Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Moshe Y. Vardi Rice University, Houston, TX, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
4633
Mohamed Kamel Aurélio Campilho (Eds.)
Image Analysis and Recognition 4th International Conference, ICIAR 2007 Montreal, Canada, August 22-24, 2007 Proceedings
13
Volume Editors Mohamed Kamel University of Waterloo Electrical and Computer Engineering Waterloo, Ontario, N2L 3G1, Canada E-mail:
[email protected] Aurélio Campilho University of Porto Faculty of Engineering, Institute of Biomedical Engineering Rua Dr. Roberto Frias, 4200-465 Porto, Portugal E-mail:
[email protected] Library of Congress Control Number: 2007932626 CR Subject Classification (1998): I.4, I.5, I.3.5, I.2.10, I.2.6, F.2.2 LNCS Sublibrary: SL 6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics ISSN ISBN-10 ISBN-13
0302-9743 3-540-74258-1 Springer Berlin Heidelberg New York 978-3-540-74258-6 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2007 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12108244 06/3180 543210
Preface
ICIAR 2007, the International Conference on Image Analysis and Recognition, held in Montreal, Canada, August 22nd to 24th, 2007 was the fourth in the ICIAR series of annual conferences alternating between Europe and North America. ICIAR 2004 was held in Porto, Portugal; ICIAR 2005 in Toronto, Canada; and ICIAR 2006 in P´ovoa do Varzim, Portugal. The idea of organizing these conferences came as a result of discussions between researchers in Portugal and Canada wishing to encourage collaboration and exchange, mainly between scientists from these two countries, but also amongst researchers from other countries. The conference addresses recent advances in theory, methodology and applications of Image Analysis and Recognition. For ICIAR 2007, we received a total of 261 full papers from 26 countries. The review process was carried out by the Program Committee members and other reviewers; all are experts in various image analysis and recognition areas. Each paper was reviewed by at least two reviewers, and also checked by the conference co-chairs. A total of 115 papers were finally accepted (71 oral presentations and 44 posters) and appear in these proceedings. The high quality of the papers is attributed first to the authors, and second to the quality of the reviews provided by the experts. We would like to thank the authors for responding to our call for papers. Some were returning authors, who attended previous ICIAR conferences, and some were new. We would like to sincerely thank the reviewers for their careful evaluation and the excellent feedback they provided to the authors. It is this collective effort that resulted in the strong conference program and the high quality proceedings in your hands. We were very pleased to be able to include in the conference program keynote talks by three world-renowned experts: Ching Suen, Concordia University, Canada; Hubert Labelle, University of Montreal, Canada, and Prabir Bhattacharya, Concordia University, Canada. We would like to express our sincere gratitude to each of them for accepting our invitation to share with the conference participants their recent advances in the fields of image analysis, recognition and their applications. We would like to thank Khaled Hammouda, the webmaster of the conference, for maintaining the Web pages, interacting with the authors and preparing the proceedings. We would like to thank the conference secretary Heidi Campbell, for her administrative assistance. We are grateful to Farida Cheriet for her work as local arrangements chair. She and her team have provided excellent input, ´ advice, planning and support. The first-class facilities provided by the Ecole Polytechnique de Montr´eal, Canada were much appreciated. We are also grateful to Springer’s editorial staff for supporting this publication in the LNCS series.
VI
Preface
Finally, we were very pleased to welcome all the participants to this conference. For those who did not attend, we hope this publication provides a good view into the research presented at the conference, and we look forward to meeting you at the next ICIAR conference. August 2007
Mohamed Kamel Aur´elio Campilho
ICIAR 2007 – International Conference on Image Analysis and Recognition
General Chair Mohamed Kamel University of Waterloo, Canada
[email protected] General Co-chair Aur´elio Campilho University of Porto, Portugal
[email protected] Local Organizing Committee Farida Cheriet ´ Ecole Polytechnique de Montr´eal, Canada
[email protected] Jeanne Daunais ´ Ecole Polytechnique de Montr´eal, Canada
[email protected] Luc Lalonde ´ Ecole Polytechnique de Montr´eal, Canada
[email protected] Conference Secretary Heidi Campbell University of Waterloo, Canada
[email protected] Webmaster Khaled Hammouda University of Waterloo, Canada
[email protected] VIII
Organization
Supported by Pattern Analysis and Machine Intelligence Group, University of Waterloo, Canada Department of Electrical and Computer Engineering, Faculty of Engineering, University of Porto, Portugal
INEB – Instituto Biom´edica
de
Engenharia
IEEE Kitchener-Waterloo Section
Advisory Committee M. Ahmadi P. Bhattacharya T.D. Bui M. Cheriet V. Di Ges´ u E. Dubois Z. Duric M. Ejiri G. Granlund L. Guan M. Haindl E. Hancock J. Kovacevic M. Kunt K.N. Plataniotis A. Sanfeliu M. Shah M. Sid-Ahmed C.Y. Suen A.N. Venetsanopoulos M. Viergever B. Vijayakumar J. Villanueva
University of Windsor, Canada Concordia University, Canada Concordia University, Canada University of Quebec, Canada Universit` a degli Studi di Palermo, Italy University of Ottawa, Canada George Mason University, USA Japan Link¨ oping University, Sweden Ryerson University, Canada Institute of Information Theory and Automation, Czech Republic The University of York, UK Carnegie Mellon University, USA Swiss Federal Institute of Technology (EPFL), Switzerland University of Toronto, Canada Technical University of Catalonia, Spain University of Central Florida, USA University of Windsor, Canada Concordia University, Canada University of Toronto, Canada University of Utrecht, Netherlands Carnegie Mellon University, USA Autonomous University of Barcelona, Spain
Organization
R. Ward D. Zhang
University of British Columbia, Canada The Hong Kong Polytechnic University, Hong Kong
Program Committee S. Abdallah W. Abd-Almageed P. Aguiar M. Ahmed J. Alirezaie D. Androutsos H. Araujo N. Arica J. Barron O. Basir J. Batista C. Bauckhage A. Bernardino P. Bhattacharya G. Bilodeau J. Bioucas B. Boufama T.D. Bui E. Cernadas F. Cheriet M. Cheriet M. Correia L. Corte-Real J. Costeira D. Cuesta-Frau J. Cunha V. Di Ges´ u J. Dias Z. Duric M. El-Sakka R. Fazel M. Ferretti P. Fieguth M. Figueiredo A. Fred G. Freeman V. Grau M. Greenspan L. Guan F. Guibault
IX
American University of Beirut, Lebanon University of Maryland, USA Institute for Systems and Robotics, Portugal Wilfrid Laurier University, Canada Ryerson University, Canada Ryerson University, Canada University of Coimbra, Portugal Turkish Naval Academy, Turkey University of Western Ontario, Canada University of Waterloo, Canada University of Coimbra, Portugal York University, Canada Technical University of Lisbon, Portugal Concordia University, Canada ´ Ecole Polytechnique de Montr´eal, Canada Technical University of Lisbon, Portugal University of Windsor, Canada Concordia University, Canada University of Vigo, Spain ´ Ecole Polytechnique de Montr´eal, Canada University of Quebec, Canada University of Porto, Portugal University of Porto, Portugal Technical University of Lisbon, Portugal Polytechnic University of Valencia, Spain University of Aveiro, Portugal Universit` a degli Studi di Palermo, Italy University of Coimbra, Portugal George Mason University, USA University of Western Ontario, Canada University of Manitoba, Canada University of Pavia, Italy University of Waterloo, Canada Technical University of Lisbon, Portugal Technical University of Lisbon, Portugal University of Waterloo, Canada University of Oxford, UK Queen’s University, Canada Ryerson University, Canada ´ Ecole Polytechnique de Montr´eal, Canada
X
Organization
M. Haindl E. Hancock B. Huet J. Jiang J. Jorge J. Kamarainen M. Kechadi G. Khan A. Krzyzak R. Lagani`ere X. Li R. Lins J. Lorenzo-Ginori R. Lukac A. Mansouri A. Marcal J. Marques M. Melkemi A. Mendon¸ca J. Meunier O. Michailovich M. Mignotte A. Monteiro J. Orchard A. Padilha P. Payeur F. Perales F. Pereira N. Peres de la Blanca E. Petrakis P. Pina A. Pinho J. Pinto F. Pla K. Plataniotis M. Queluz T. Rabie P. Radeva E. Ribeiro L. Rueda F. Samavati J. S´ anchez B. Santos
Institute of Information Theory and Automation, Czech Republic University of York, UK Institut Eurecom, France University of Bradford, UK INESC-ID, Portugal Lappeenranta University of Technology, Finland University College Dublin, Ireland Ryerson University, Canada Concordia University, Canada University of Ottawa, Canada University of London, UK Universidade Federal de Pernambuco, Brazil Universidad Central “Marta Abreu” de Las Villas, Cuba University of Toronto, Canada Universit´e de Bourgogne, France University of Porto, Portugal Technical University of Lisbon, Portugal Universit´e de Haute Alsace, France University of Porto, Portugal University of Montreal, Canada University of Waterloo, Canada University of Montreal, Canada University of Porto, Portugal University of Waterloo, Canada University of Porto, Portugal University of Ottawa, Canada University of the Balearic Islands, Spain Technical University of Lisbon, Portugal University of Granada, Spain Technical University of Crete, Greece Technical University of Lisbon, Portugal University of Aveiro, Portugal Technical University of Lisbon, Portugal Universitat Jaume I, Spain University of Toronto, Canada Technical University of Lisbon, Portugal University of Toronto, Canada Autonomous University of Barcelona, Spain Florida Institute of Technology, USA University of Windsor, Canada University of Calgary, Canada University of Las Palmas de Gran Canaria, Spain University of Aveiro, Portugal
Organization
G. Schaefer P. Scheunders J. Sequeira A. Silva J. Silva W. Skarbek B. Smolka J. Sousa H. Suesse S. Sural D. Tao G. Thomas H. Tizhoosh D. Vandermeulen M. Vento B. Vijayakumar J. Vitria Y. Voisin E. Vrscay Y. Wei M. Wirth J. Wu F. Yarman-Vural Y. Yuan J. Zelek G. Zheng D. Ziou
Nottingham Trent University, UK University of Antwerp, Belgium Ecole Sup´erieure d’Ing´enieurs de Luminy, France University of Aveiro, Portugal University of Porto, Portugal Warsaw University of Technology, Poland Silesian University of Technology, Poland Technical University of Lisbon, Portugal Friedrich-Schiller University Jena, Germany Indian Institute of Technology, India The Hong Kong Polytechnic University, Hong Kong University of Waterloo, Canada University of Waterloo, Canada Catholic University of Leuven, Belgium University of Salerno, Italy Carnegie Mellon University, USA Computer Vision Center, Spain Universit´e de Bourgogne, France University of Waterloo, Canada University of Waterloo, Canada University of Guelph, Canada University of Windsor, Canada Middle East Technical University, Turkey Aston University, UK University of Waterloo, Canada University of Bern, Switzerland University of Sherbrooke, Canada
Reviewers B. Al-Diri N. Alajlan T. Barata J. Barbosa S. Berretti A. Bevilacqua J. Cardoso M. Coimbra A. Dawoud I. El Rube A. El-ghazal D. ElShafie J. Glasa C. Hong A. Kong
XI
University of Lincoln, UK King Saud University, Saudi Arabia Technical University of Lisbon, Portugal University of Porto, Portugal University of Florence, Italy ARCES-DEIS University of Bologna, Italy INESC Porto, Portugal IEETA-University of Aveiro, Portugal University of South Alabama, USA University of Waterloo, Canada University of Waterloo, Canada Concordia University, Canada Slovak Academy of Science, Slovakia Hong Kong Polytechnic, Hong Kong University of Waterloo, Canada
XII
Organization
B. Massey S. Mohamed A. Mohebi F. Monteiro M. Nappi J. Nascimento E. Papageorgiou J. P´erez A. Picariello A. Puga A. Quddus S. Rahnamayan R. Rocha F. Sahba J. Sanches A. Sousa L.F. Teixeira L.M. Teixeira M. Vega-Rodr´ıguez C. Vinhais C. Wang Y. Yu A. Zaim
University of Lincoln, UK University of Waterloo, Canada University of Waterloo, Canada Biomedical Engineering Institute, Portugal University of Salerno, Italy Technical University of Lisbon, Portugal University of Patras, Greece Departamento de Inform´atica Escuela Polit´ecnica, Spain University of Naples, Italy University of Porto, Portugal University of Waterloo, Canada University of Waterloo, Canada Biomedical Engineering Institute, Portugal University of Waterloo, Canada Technical University of Lisbon, Portugal INEB, Portugal INESC Porto, Portugal INESC Porto, Portugal University of Extremadura, Spain Biomedical Engineering Institute, Portugal University of Lincoln, UK ENS-Cachan, France University of Texas, USA
Table of Contents
Image Restoration and Enhancement A New Image Scaling Algorithm Based on the Sampling Theorem of Papoulis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alain Hor´e, Djemel Ziou, and Fran¸cois Deschˆenes A New Fuzzy Additive Noise Reduction Method . . . . . . . . . . . . . . . . . . . . . Stefan Schulte, Val´erie De Witte, Mike Nachtegael, Tom M´elange, and Etienne E. Kerre
1 12
Subband-Adaptive and Spatially-Adaptive Wavelet Thresholding for Denoising and Feature Preservation of Texture Images . . . . . . . . . . . . . . . . J. Li, S.S. Mohamed, M.M.A. Salama, and G.H. Freeman
24
Image Denoising Based on the Ridgelet Frame Using the Generalized Cross Validation Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xi Tan and Hong He
38
Parameterless Discrete Regularization on Graphs for Color Image Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Olivier Lezoray, S´ebastien Bougleux, and Abderrahim Elmoataz
46
Multicomponent Image Restoration, an Experimental Study . . . . . . . . . . . Arno Duijster, Steve De Backer, and Paul Scheunders
58
Reconstruction of Low-Resolution Images Using Adaptive Bimodal Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hiˆep Luong and Wilfried Philips
69
Image and Video Processing and Analysis Learning Basic Patterns from Repetitive Texture Surfaces Under Non-rigid Deformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roman Filipovych and Eraldo Ribeiro
81
An Approach for Extracting Illumination-Independent Texture Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rub´en Mu˜ niz and Jos´e Antonio Corrales
93
Image Decomposition and Reconstruction Using Single Sided Complex Gabor Wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reza Fazel-Rezai and Witold Kinsner
105
XIV
Table of Contents
Solving the Inverse Problem of Image Zooming Using “Self-Examples” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mehran Ebrahimi and Edward R. Vrscay
117
A Run-Based Two-Scan Labeling Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . Lifeng He, Yuyan Chao, and Kenji Suzuki
131
New Computational Methods for the Construction of “Darcyan” Biological Coordinate Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nataliya Portman, Ulf Grenander, and Edward R. Vrscay
143
An Approach to the 2D Hilbert Transform for Image Processing Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Juan Valent´ın Lorenzo-Ginori
157
Landmark-Based Non-rigid Registration Via Graph Cuts . . . . . . . . . . . . . . Herve Lombaert, Yiyong Sun, and Farida Cheriet
166
New Hypothesis Distinctiveness Measure for Better Ellipse Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cuilan Wang, Timothy S. Newman, and Chunguang Cao
176
Automatic Closed Edge Detection Using Level Lines Selection . . . . . . . . . Thomas Hurtut and Farida Cheriet
187
Constrained Sampling Using Simulated Annealing . . . . . . . . . . . . . . . . . . . . Azadeh Mohebi and Paul Fieguth
198
A High-Accuracy Rotation Estimation Algorithm Based on 1D Phase-Only Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sei Nagashima, Koichi Ito, Takafumi Aoki, Hideaki Ishii, and Koji Kobayashi
210
Image Segmentation Enhancing Contour Primitives by Pairwise Grouping and Relaxation . . . Robert Bergevin and Alexandre Filiatrault
222
Image Segmentation Using Level Set and Local Linear Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . David Rivest-H´enault and Mohamed Cheriet
234
Bimodal Texture Segmentation with the Lee-Seo Model . . . . . . . . . . . . . . . Michalis A. Savelonas, Dimitris K. Iakovidis, and Dimitris Maroulis
246
Face Detection Based on Skin Color in Video Images with Dynamic Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ant´ onio Mota Ferreira and Miguel Velhote Correia
254
Data Segmentation of Stereo Images with Complicated Background . . . . Yi Wei, Shushu Hu, and Yu Li
263
Table of Contents
XV
Computer Vision Variable Homography Compensation of Parallax Along Mosaic Seams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shawn Zhang and Michael Greenspan
271
Deformation Weight Constraint and 3D Reconstruction of Nonrigid Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guanghui Wang and Q.M. Jonathan Wu
285
Parallel Robot High Speed Object Tracking . . . . . . . . . . . . . . . . . . . . . . . . . ´ J.M. Sebasti´ an, A. Traslosheros, L. Angel, F. Roberti, and R. Carelli
295
Robust Contour Tracking Using a Modified Snake Model in Stereo Image Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shin-Hyoung Kim and Jong Whan Jang
307
Background Independent Moving Object Segmentation Using Edge Similarity Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Ali Akber Dewan, M. Julius Hossain, and Oksam Chae
318
Pattern Recognition for Image Analysis Unsupervised Feature and Model Selection for Generalized Dirichlet Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sabri Boutemedjet, Nizar Bouguila, and Djemel Ziou
330
On Design of Discriminant Analysis Diagram for Error Based Pattern Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mariusz Leszczy´ nski and Wladyslaw Skarbek
342
Robust Tensor Classifiers for Color Object Recognition . . . . . . . . . . . . . . . Christian Bauckhage
352
New Algorithm to Extract Centerline of 2D Objects Based on Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Seifeddine Ferchichi, Shengrui Wang, and Sofiane Grira
364
A Novel Bayesian Classifier with Smaller Eigenvalues Reset by Threshold Based on Given Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guorong Xuan, Xiuming Zhu, Yun Q. Shi, Peiqi Chai, Xia Cui, and Jue Li
375
Median Binary Pattern for Textures Classification . . . . . . . . . . . . . . . . . . . . Adel Hafiane, Guna Seetharaman, and Bertrand Zavidovique
387
A New Incremental Optimal Feature Extraction Method for On-Line Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Youness Aliyari Ghassabeh and Hamid Abrishami Moghaddam
399
XVI
Table of Contents
Shape and Matching A View-Based 3D Object Shape Representation Technique . . . . . . . . . . . . Yasser Ebrahim, Maher Ahmed, Siu-Cheung Chau, and Wegdan Abdelsalam
411
A Novel Multi-scale Representation for 2-D Shapes . . . . . . . . . . . . . . . . . . . Kidiyo Kpalma and Joseph Ronsin
423
Retrieval of Hand-Sketched Envelopes in Logo Images . . . . . . . . . . . . . . . . Naif Alajlan
436
Matching Flexible Polygons to Fields of Corners Extracted from Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Siddharth Manay and David W. Paglieroni
447
Pattern Retrieval from a Cloud of Points Using Geometric Concepts . . . . L. Idoumghar and M. Melkemi
460
Motion Analysis Real-Time Vehicle Ego-Motion Using Stereo Pairs and Particle Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fadi Dornaika and Angel D. Sappa
469
Enhanced Cross-Diamond Search Algorithm for Fast Block Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gwanggil Jeon, Jungjun Kim, and Jechang Jeong
481
Block-Based Motion Vector Smoothing for Periodic Pattern Region . . . . Young Wook Sohn and Moon Gi Kang
491
A Fast and Reliable Image Mosaicing Technique with Application to Wide Area Motion Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alessandro Bevilacqua and Pietro Azzari
501
High Accuracy Optical Flow Method Based on a Theory for Warping: Implementation and Qualitative/Quantitative Evaluation . . . . . . . . . . . . . Mohammad Faisal and John Barron
513
Tracking A Sequential Monte-Carlo and DSmT Based Approach for Conflict Handling in Case of Multiple Targets Tracking . . . . . . . . . . . . . . . . . . . . . . . Yi Sun and Layachi Bentabet
526
Incremental Update of Linear Appearance Models and Its Application to AAM: Incremental AAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sangjae Lee, Jaewon Sung, and Daijin Kim
538
Table of Contents
XVII
Robust Face Tracking Using Motion Prediction in Adaptive Particle Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sukwon Choi and Daijin Kim
548
A Simple Oriented Mean-Shift Algorithm for Tracking . . . . . . . . . . . . . . . . Jamil Drar´eni and S´ebastien Roy
558
Real-Time 3D Head Tracking Under Rapidly Changing Pose, Head Movement and Illumination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wooju Ryu and Daijin Kim
569
Tracking Multiple People in the Context of Video Surveillance . . . . . . . . . B. Boufama and M.A. Ali
581
Video Object Tracking Via Central Macro-blocks and Directional Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B. Zhang, J. Jiang, and G. Xiao
593
Tracking of Multiple Targets Using On-Line Learning for Appearance Model Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Franz Pernkopf
602
Image Retrieval and Indexing Using Visual Dictionary to Associate Semantic Objects in Region-Based Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rongrong Ji, Hongxun Yao, Zhen Zhang, Peifei Xu, and Jicheng Wang Object-Based Surveillance Video Retrieval System with Real-Time Indexing Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jacky S-C. Yuk, Kwan-Yee K. Wong, Ronald H-Y. Chung, K.P. Chow, Francis Y-L. Chin, and Kenneth S-H. Tsang
615
626
Image Retrieval Using Transaction-Based and SVM-Based Learning in Relevance Feedback Sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaojun Qi and Ran Chang
638
Shape-Based Image Retrieval Using Pair-Wise Candidate Co-ranking . . . Akrem El-ghazal, Otman Basir, and Saeid Belkasim
650
Gaussian Mixture Model Based Retrieval Technique for Lossy Compressed Color Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maria Luszczkiewicz and Bogdan Smolka
662
Logo and Trademark Retrieval in General Image Databases Using Color Edge Gradient Co-occurrence Histograms . . . . . . . . . . . . . . . . . . . . . . Raymond Phan and Dimitrios Androutsos
674
XVIII
Table of Contents
Use of Adaptive Still Image Descriptors for Annotation of Video Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrea Kutics, Akihiko Nakagawa, and Kazuhiro Shindoh
686
Image and Video Coding and Encryption Data Hiding on H.264/AVC Compressed Video . . . . . . . . . . . . . . . . . . . . . . Sung Min Kim, Sang Beom Kim, Youpyo Hong, and Chee Sun Won
698
Reduced Uneven Multi-hexagon-grid Search for Fast Integer Pel Motion Estimation in H.264/AVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cheong-Ghil Kim, In-Jik Lee, and Shin-Dug Kim
708
Reversible Data Hiding for JPEG Images Based on Histogram Pairs . . . . Guorong Xuan, Yun Q. Shi, Zhicheng Ni, Peiqi Chai, Xia Cui, and Xuefeng Tong
715
Iterated Fourier Transform Systems: A Method for Frequency Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gregory S. Mayer and Edward R. Vrscay
728
Adaptive Deblocking Algorithm Based on Image Characteristics for Low Bit-Rate Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jongho Kim, Minseok Choi, and Jechang Jeong
740
Ingredient Separation of Natural Images: A Multiple Transform Domain Method Based on Sparse Coding Strategy . . . . . . . . . . . . . . . . . . . Xi Tan
752
Optimal Algorithm for Lossy Vector Data Compression . . . . . . . . . . . . . . . Alexander Kolesnikov
761
MPEG Video Watermarking Using Tensor Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Emad E. Abdallah, A. Ben Hamza, and Prabir Bhattacharya
772
Embedding Quality Measures in PIFS Fractal Coding . . . . . . . . . . . . . . . . Andrea Abate, Michele Nappi, and Daniel Riccio
784
Biometrics Comparison of ARTMAP Neural Networks for Classification for Face Recognition from Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Barry and E. Granger
794
Face Recognition by Curvelet Based Feature Extraction . . . . . . . . . . . . . . . Tanaya Mandal, Angshul Majumdar, and Q.M. Jonathan Wu
806
Table of Contents
XIX
Low Frequency Response and Random Feature Selection Applied to Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roberto A. V´ azquez, Humberto Sossa, and Beatriz A. Garro
818
Facial Expression Recognition Using 3D Facial Feature Distances . . . . . . Hamit Soyel and Hasan Demirel
831
Locating Facial Features Using an Anthropometric Face Model for Determining the Gaze of Faces in Image Sequences . . . . . . . . . . . . . . . . . . . Jorge P. Batista
839
Iris Recognition Based on Zigzag Collarette Region and Asymmetrical Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kaushik Roy and Prabir Bhattacharya
854
Biomedical Image Analysis A Modified Fuzzy C-Means Algorithm for MR Brain Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . L´ aszl´ o Szil´ agyi, S´ andor M. Szil´ agyi, and Zolt´ an Beny´ o
866
Evaluation of Contrast Enhancement Filters for Lung Nodule Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Carlos S. Pereira, Ana Maria Mendon¸ca, and Aur´elio Campilho
878
Robust Coronary Artery Tracking from Fluoroscopic Image Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pascal Fallavollita and Farida Cheriet
889
Classification of Breast Tissues in Mammogram Images Using Ripley’s K Function and Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . Leonardo de Oliveira Martins, Geraldo Braz Junior, Erick Corrˆea da Silva, Arist´ ofanes Corrˆea Silva, and Anselmo Cardoso de Paiva Comparison of Class Separability, Forward Sequential Search and Genetic Algorithms for Feature Selection in the Classification of Individual and Clustered Microcalcifications in Digital Mammograms . . . Rolando R. Hern´ andez-Cisneros, Hugo Terashima-Mar´ın, and Santiago E. Conant-Pablos Contourlet-Based Mammography Mass Classification . . . . . . . . . . . . . . . . . Fatemeh Moayedi, Zohreh Azimifar, Reza Boostani, and Serajodin Katebi Fuzzy C-Means Clustering for Segmenting Carotid Artery Ultrasound Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Amr R. Abdel-Dayem and Mahmoud R. El-Sakka
899
911
923
935
XX
Table of Contents
A Novel 3D Segmentation Method of the Lumen from Intravascular Ultrasound Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ionut Alexandrescu, Farida Cheriet, and S´ebastien Delorme
949
Carotid Ultrasound Segmentation Using DP Active Contours . . . . . . . . . . Ali K. Hamou, Said Osman, and Mahmoud R. El-Sakka
961
Bayesian Reconstruction Using a New Nonlocal MRF Prior for Count-Limited PET Transmission Scans . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yang Chen, Wufan Chen, Pengcheng Shi, Yanqiu Feng, Qianjin Feng, Qingqi Wang, and Zhiyong Huang Medical Image Registration Based on Equivalent Meridian Plane . . . . . . . Zhentai Lu, Minghui Zhang, Qianjin Feng, Pengcheng Shi, and Wufan Chen Prostate Tissue Texture Feature Extraction for Cancer Recognition in TRUS Images Using Wavelet Decomposition . . . . . . . . . . . . . . . . . . . . . . . . J. Li, S.S. Mohamed, M.M.A. Salama, and G.H. Freeman
972
982
993
Multi-scale and First Derivative Analysis for Edge Detection in TEM Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1005 Nicolas Coudray, Jean-Luc Buessler, Hubert Kihl, and Jean-Philippe Urban Watershed Segmentation of Intervertebral Disk and Spinal Canal from MRI Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1017 Claudia Chevrefils, Farida Ch´eriet, Guy Grimard, and Carl-Eric Aubin Towards Segmentation of Pedicles on Posteroanterior X-Ray Views of Scoliotic Patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1028 Vincent Dor´e, Luc Duong, Farida Cheriet, and Mohamed Cheriet Adaptive Mesh Generation of MRI Images for 3D Reconstruction of Human Trunk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1040 O. Courchesne, F. Guibault, J. Dompierre, and F. Cheriet Efficient and Effective Ultrasound Image Analysis Scheme for Thyroid Nodule Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1052 Eystratios G. Keramidas, Dimitris K. Iakovidis, Dimitris Maroulis, and Stavros Karkanis Contour Energy Features for Recognition of Biological Specimens in Population Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1061 Daniel Ochoa, Sidharta Gautama, and Boris Vintimilla Processing Random Amplified Polymorphysm DNA Images Using the Radon Transform and Mathematical Morphology . . . . . . . . . . . . . . . . . . . . 1071 Luis Rueda, Omar Uyarte, Sofia Valenzuela, and Jaime Rodriguez
Table of Contents
XXI
Applications Scale-Adaptive Segmentation and Recognition of Individual Trees Based on LiDAR Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1082 Roman M. Palenichka and Marek B. Zaremba Iterative and Localized Radon Transform for Road Centerline Detection from Classified Imagery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1093 Isabelle Couloigner and Qiaoping Zhang Using Wavelet Transform and Partial Distance Search to Implement kNN Classifier on FPGA with Multiple Modules . . . . . . . . . . . . . . . . . . . . . 1105 Hui-Ya Li, Yao-Jung Yeh, and Wen-Jyi Hwang A Framework for Wrong Way Driver Detection Using Optical Flow . . . . . 1117 Gon¸calo Monteiro, Miguel Ribeiro, Jo˜ ao Marcos, and Jorge Batista Automated Stroke Classification in Tennis . . . . . . . . . . . . . . . . . . . . . . . . . . 1128 Hitesh Shah, Prakash Chokalingam, Balamanohar Paluri, Nalin Pradeep, and Balasubramanian Raman Color-Based Road Sign Detection and Tracking . . . . . . . . . . . . . . . . . . . . . . 1138 Luis David Lopez and Olac Fuentes A New Automatic Planning of Inspection of 3D Industrial Parts by Means of Visual System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1148 J.M. Sebasti´ an, D. Garc´ıa, A. Traslosheros, F.M. S´ anchez, and S. Dom´ınguez Model-Guided Luminance Range Enhancement in Mixed Reality . . . . . . . 1160 Yunjun Zhang and Charles E. Hughes Improving Reliability of Oil Spill Detection Systems Using Boosting for High-Level Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1172 Geraldo L.B. Ramalho and F´ atima N.S. de Medeiros Computer Assisted Transcription for Ancient Text Images . . . . . . . . . . . . . 1182 Ver´ onica Romero, Alejandro H. Toselli, Luis Rodr´ıguez, and Enrique Vidal Methods for Written Ancient Music Restoration . . . . . . . . . . . . . . . . . . . . . 1194 Pedro Castro and J.R. Caldas Pinto Suppression of Noise in Historical Photographs Using a Fuzzy Truncated-Median Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1206 Michael Wirth and Bruce Bobier Generating Color Documents from Segmented and Synthetic Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1217 Rafael Dueire Lins and Jo˜ ao Marcelo Monte da Silva
XXII
Table of Contents
Enhancing Document Images Acquired Using Portable Digital Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1229 Rafael Dueire Lins, Andr´e R. Gomes e Silva, and Gabriel Pereira e Silva A Page Content Independent Book Dewarping Method to Handle 2D Images Captured by a Digital Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1242 Minghui Wu, Rongfeng Li, Bin Fu, Wenxin Li, and Zhuoqun Xu 3D Reconstruction of Soccer Sequences Using Non-calibrated Video Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1254 S´ebastien Mavromatis, Paulo Dias, and Jean Sequeira Automatic Ortho-rectification of ASTER Images by Matching Digital Elevation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1265 Jos´e A. Gon¸calves and Andr´e R.S. Mar¸cal A New Pyramidal Approach for the Address Block Location Based on Hierarchical Graph Coloring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1276 Djamel Gaceb, V´eronique Eglin, Frank Lebourgeois, and Hubert Emptoz Poultry Skin Tumor Detection in Hyperspectral Reflectance Images by Combining Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1289 Chengzhe Xu, Intaek Kim, and Moon S. Kim Intelligent Real-Time Fabric Defect Detection . . . . . . . . . . . . . . . . . . . . . . . 1297 Hugo Peres Castilho, Paulo Jorge Sequeira Gon¸calves, Jo´co Rog´erio Caldas Pinto, and Ant´ onio Limas Serafim Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1309
A New Image Scaling Algorithm Based on the Sampling Theorem of Papoulis Alain Horé, Djemel Ziou, and François Deschênes Département d’Informatique, Université de Sherbrooke, 2500 boulevard de l’Université, Sherbrooke (Québec), J1K 2R1, Canada {alain.hore, djemel.ziou, francois.deschenes}@usherbrooke.ca
Abstract. We present in this paper a new image scaling algorithm which is based on the generalized sampling theorem of Papoulis. The main idea consists in using the first and second derivatives of an image in the scaling process. The derivatives contain information about edges and discontinuities that should be preserved during resizing. The sampling theorem of Papoulis is used to combine this information. We compare our algorithm with eight of the most common scaling algorithms and two measures of quality are used: the standard deviation for evaluation of the blur, and the curvature for evaluation of the aliasing. The results presented here show that our algorithm gives the best images with very few aliasing, good contrast, good edge preserving and few blur. Keywords: Papoulis, image, resolution.
resizing, scaling, curvature, derivatives,
1 Introduction Image scaling algorithms are widely used in many applications going from simple personal pictures processing to complex microscopic defects identification. Scaling is desired because the resolutions of image sources are various and depend on their acquisition device, while the physical screen resolution of a digital display device is fixed [7]. Consequently, in the case where the resolution of an image is different from the screen resolution of a digital display device, we need to perform resizing. In the image-scaling process, image quality should be preserved as much as possible; thus, we need to preserve edges and have few blur and aliasing. The basic concept of image scaling is to resample a two-dimensional function on a new sampling grid [13], [14]. During the last years, many algorithms for image resizing have been proposed. The simplest method is the nearest neighbour [4], which samples the nearest pixel from original image. It has good high frequency response, but degrades image quality due to aliasing. The widely used methods are the bilinear and bicubic algorithm [5], [6]. Their weakness is the blur effect causing bad high frequency response. Other complex algorithms have been proposed in the last years. Among them are the Lanczos algorithm, the Hanning algorithm and the Hermite algorithm [10]. They are mainly based on interpolation through finite sinusoidal filters. These algorithms produce images of various qualities which are generally better when the filter used has a shape close to the sinc function. A new algorithm called winscale M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 1–11, 2007. © Springer-Verlag Berlin Heidelberg 2007
2
A. Horé, D. Ziou, and F. Deschênes
has been proposed [8] where pixels are considered as areas of constant intensity. This algorithm gives images with quality comparable to the bilinear algorithm. Other algorithms using adaptive [2], [15], correlative [9], polynomial [1] or fixed scale factors [3] properties have also been proposed in the last years. However, these methods are based on various assumptions (smoothness, regularity, etc.) which may lead to bad results when they are not satisfied. In this paper, we introduce a new image scaling algorithm which is based on the generalized sampling theorem of Papoulis [12]. In signal processing, this theorem is used for perfect reconstruction of a signal given a set of its sampled signals. In our scaling algorithm, we take advantage of the information provided by the derivatives of an image in the resizing process. For example, the first derivative gives information about edges or regions of grey levels variations; also, the second derivative extracts image details more accurately. Combining this information (with Papoulis’ theorem) during the resizing process tends to produce images of better quality than algorithms based on only the original image. The outline of the paper is organized as follows: in section 2, we present the problem statement. In section 3, we describe the generalized sampling theorem of Papoulis. In section 4, we apply the theorem to image scaling. Section 5 presents the experimental results. We end the paper with the concluding remarks.
2 Problem Statement The basic concept of image scaling is to resample a two-dimensional function on a new sampling grid. According to signal processing, the problem is to correctly compute signal values at arbitrary continuous times from a set of discrete-time samples of the signal amplitude. In other words, we must be able to interpolate the signal between samples. When we assume that the original signal is bandlimited to half the sampling rate (Nyquist criterion), the sampling theorem of Shannon states that the signal can be exactly and uniquely reconstructed for all time from its samples by bandlimited interpolation using an ideal low-pass filter. When the Nyquist criterion is not satisfied, the frequencies overlap; that is, frequencies above half the sampling rate will be reconstructed as, and appear as, frequencies below half the sampling rate [11]. This causes aliasing. In the case of image scaling, the effect is quite noticeable and leads to a loss of details of the original image. Several methods have been proposed in the last years to approximate the perfect reconstruction model of Shannon. They are generally based on finite interpolation filters with various analytical expressions ranging from polynomial to sinusoidal functions [10]. When the filter is very different from an ideal filter (sinc function), aliasing and blurring occur [16]. Generally, the scaling algorithms are subject to the trade-off between high contrast, few aliasing and unnoticeable blur. This appears because the same finite filter is applied to both the low and high frequency components. Thus, in some cases the unique filter will tend to blur the image while in other cases aliasing will occur. To reduce the aliasing and blur effects, we propose to use image derivatives in the reconstruction process. The image derivatives enhance the high-frequencies. Blurring is attenuated because the use of enhanced high-frequencies enables a good preservation of contrast in the scaling process and prevents the image from being uniformly smooth. Aliasing is reduced
A New Image Scaling Algorithm Based on the Sampling Theorem of Papoulis
3
because high-frequencies allow us to enhance both spatial and radiometric resolutions. To combine together the image and its partial derivatives, we propose to use the generalized sampling theorem of Papoulis [12]. This theorem enables us, in the reconstruction process, to use the information provided by the derivatives of an image so as to better handle aliasing and blurring artefacts. In the next section, we present the generalized sampling theorem of Papoulis.
3 Papoulis’ Sampling Theorem Given a σ-bandlimited function f(t) satisfying the following conditions: − f has a finite energy : 2 ∞ ∫−∞ f ( t ) dt p
(2)
√ with p ∈]0, 2] for the normalized input image A . We not only define one fuzzy set small but three such fuzzy sets, one for each couple (one for the couple red-green, another for the couple red-blue and the last one for the couple green-blue). All these fuzzy sets depend on the parameter p as seen in expression (2) and Fig. 1. These parameters, which are denoted as prg , prb and pgb , are determined adaptively as follows: prg (i, j) = max (βrg (i, j, k, l)), k,l∈ϑ
prb (i, j) = max (βrb (i, j, k, l)), k,l∈ϑ
(3)
pgb (i, j) = max (βgb (i, j, k, l)), k,l∈ϑ
where ϑ defines the (2K + 1) × (2K + 1) neighborhood around the central pixel, i.e., k, l ∈ {−K, −K + 1, ..., 0, ..., K − 1, K}, and where βrg (i, j, k, l), βrb (i, j, k, l) and βgb (i, j, k, l) were used as simplified notation for the following distances: βrg (i, j, k, l) = D rg(i, j), rg(i + k, j + l) , βrb (i, j, k, l) = D rb(i, j), rb(i + k, j + l) , (4) βgb (i, j, k, l) = D gb(i, j), gb(i + k, j + l) . As shown in the Fuzzy Rules 1- 3 we use these β’s in order to determine the weights of each neighbor. These rules contain conjunctions. In fuzzy logic triangular norms and co-norms are used to represent conjunctions [13] (roughly the equivalent of the AND operator) and disjunctions (roughly the equivalent of the OR operator). Two well-known triangular norms (together with their dual
16
S. Schulte et al.
co-norms) are the product (probabilistic sum) and the minimum (maximum) [13]. We have used the algebraic product triangular norm. This means for instance that the fuzzification of the antecedent of fuzzy rule 1 is μS1 (βrg (i, j, k, l)) · μS2 (βrb (i, j, k, l)), where μS1 and μS2 are equal to the membership function SMALL shown in expression (2) with parameters prg (i, j) and prb (i, j), respectively. The obtained value (μS1 (βrg (i, j, k, l)) · μS2 (βrb (i, j, k, l))) is called the activation degree of the fuzzy rule and is used to obtain the corresponding weight, i.e., w(i + k, j + l, 1) = μS1 (βrg (i, j, k, l)) · μS2 (βrb (i, j, k, l)). 2.2
Output of the First Fuzzy Subfilter
So the weights w(i, j, 1), w(i, j, 2) and w(i, j, 3) are calculated as follows: w(i + k, j + l, 1) = μS1 (βrg (i, j, k, l)) · μS2 (βrb (i, j, k, l)) w(i + k, j + l, 2) = μS1 (βrg (i, j, k, l)) · μS3 (βgb (i, j, k, l))
(5)
w(i + k, j + l, 3) = μS2 (βrb (i, j, k, l)) · μS3 (βgb (i, j, k, l)), where μS1 , μS2 and μS3 are equal to the membership function SMALL shown in expression (2) with parameter prg (i, j), prb (i, j) and pgb (i, j), respectively. The output of the first subfilter can finally be illustrated for the red component, where the output image is denoted as F , i.e.: +K
F (i, j, 1) =
+K
w(i + k, j + l, 1) · A(i + k, j + l, 1)
k=−K l=−K +K
+K
(6) w(i + k, j + l, 1)
k=−K l=−K
The filtering method for two other components is analogous to the one above.
3
Second Fuzzy Subfilter
The second subfilter is used as a complementary filter to the first one. The goal of this subfilter is to improve the first method by reducing the noise in the color components differences without destroying the fine details of the image. This is realized by calculating the local differences in the red, green and blue environment separately. These differences are then combined to calculate the local estimation of the central pixel. Similar as in the first method we use a window of size (2L + 1) × (2L + 1) (where L is not necessary equal to the parameter K of the first subfilter) centered at (i, j) to filter the current image pixel. We have used L = 1 since it reduces the time complexity of the method while receiving reasonable good results. Next we calculate the local differences (also called gradients or derivatives) for each element of the window denoted as LDR , LDG and LDB for the red, green and
A New Fuzzy Additive Noise Reduction Method
17
blue environment, respectively. If the output image of the previous subfilter is denoted as F , then the differences are calculated as follows: LDR (k, l) = F (i + k, j + l, 1) − F (i, j, 1), LDG (k, l) = F (i + k, j + l, 2) − F (i, j, 2), LDB (k, l) = F (i + k, j + l, 3) − F (i, j, 3),
(7)
for all k, l ∈ {−L, ..., 0, ..., +L}. These differences are then combined to calculate the following correction terms: (k, l) =
1 LDR (k, l) + LDG (k, l) + LDB (k, l) , 3
(8)
i.e., we calculate the average of the difference for the red, green and blue component at the same position. Finally the output of the second subfilter, denoted as Out, is determined as follows: +L
+L
Out(i, j, 1) =
k=−L l=−L
(2L + 1)2 +L
+L
Out(i, j, 2) =
,
F (i + k, j + l, 2) + (k, l)
k=−L l=−L
(2L + 1)2 +L
+L
Out(i, j, 3) =
F (i + k, j + l, 1) + (k, l)
,
(9)
F (i + k, j + l, 3) + (k, l)
k=−L l=−L
(2L + 1)2
,
with Out(i, j, 1), Out(i, j, 2) and Out(i, j, 3) the red, green and blue component of the output image and where (k, l) is the correction term for the neighboring pixel F (i + k, j + l, 1), F (i + k, j + l, 2) and F (i + k, j + l, 3).
4
Experiments
In this section we present some experimental results. We compared our proposed method, called the fuzzy color preserving Gaussian noise reduction method (FCG) with (i) other well-known fuzzy filters and (ii) recently developed waveletbased methods. More precisely we have: – Fuzzy: decreasing weight fuzzy filter with moving average center [14] (DWMAV), the fuzzy median filter [15] (FMF), the fuzzy bilateral filter [16] (FBF), the GOA filter [17] and the iterative fuzzy control based filter [18] (IFCF).
18
S. Schulte et al.
– Wavelet: the hidden Markov tree method [19,20] (HMT), the 3D-DFT method [21], the Bayesian least squares - Gaussian scale mixture filter [22] (BLS-GSM), the bivariate shrinkage method (BiShrink) [23], the chromatic filter proposed in [2] (Lucchese), the total least square filter [24] (TLS) and the fuzzy shrinkage method [25] (FuzzyShrink). Additionally we use the non-linear complex diffusion method [26] (NLCDM). As a measure of objective dissimilarity between a filtered image F and the original noisefree one O, we use the well-known peak signal to noise ratio (PSNR) and the normalized color difference (NCD). The NCD [4] is defined as: M N
N CDLAB (F, O) =
||ΔELAB ||
i=1 j=1 M N
,
(10)
∗ ||ELAB ||
i=1 j=1
∗ || = (L∗ )2 +(a∗ )2 +(b∗ )2 where ||ELAB
1/2
is the norm or magnitude of the orig inal (in LAB transferred) color image O and ||ΔELAB || = (ΔL∗ )2 + (Δa∗ )2 + 1/2 (Δb∗ )2 the perceptual color error, with ΔL∗ , Δa∗ and Δb∗ the difference ∗ ∗ in the L , a and b∗ components between the filtered image F and the original color images O. In order to get a clear idea of the performance of all mentioned methods we have carried out experiments for three test images shown in Fig. 2, i.e., “Baboon”, “Leaves” and “Plane”, each of size 512 × 512. The numerical results for the corrupted versions, for different standard deviations of the noise σ, are shown in Tables 1 - 3. Additionally, we illustrated the visual performances of the best methods in Fig. 3. From those results we can make the following conclusions: – The numerical results generally illustrate that the more complex wavelet based methods receive a better noise reduction than most of the fuzzy based
(a)
(b)
(c)
Fig. 2. The three test images: (a) the (512 × 512) “Baboon” image, (b) a (512 × 512) “Leaves” image and (c) the (512 × 512) “Plane” image
A New Fuzzy Additive Noise Reduction Method
19
Table 1. Comparative results in terms of PSNR (dB) of different filtering methods for various distortions of Gaussian noise for the “Baboon” image and “Leaves” image
σ 5 Noise 34.2 FMF 34.61 IFCF 24.91 FBF 34.51 GOA 29.28 FCG 36.59 HMT 34.51 3D-DFT 34.39 BLS-GSM 26.72 BiShrink 33.77 FuzzyShrink 34.64 Lucchese 25.86 NLCDM 27.16 TLS 35.95 FuzzyShrink+FCG 35.58
PSNR (dB) Baboon 10 20 30 40 28.22 22.28 18.87 16.54 29.65 25.19 22.72 20.92 24.62 23.84 22.97 21.74 29.26 25.71 23.89 22.51 27.44 22.59 21.92 21.52 31.92 26.95 25.13 23.61 30.22 26.37 24.26 22.95 30.46 26.60 24.53 23.26 26.72 26.56 25.19 21.14 30.51 26.94 24.94 23.60 30.50 26.55 24.57 23.23 25.70 25.53 24.00 21.25 26.67 25.12 23.36 21.69 31.32 27.19 25.02 23.04 31.59 27.73 25.25 23.78
5 34.13 26.83 27.12 34.95 33.76 35.34 35.37 35.29 29.41 35.73 35.94 29.06 29.41 36.37 37.28
Leaves 20 30 22.12 18.53 27.12 24.60 25.56 24.39 27.16 25.17 27.00 25.28 28.59 26.28 27.15 25.02 28.54 26.57 28.32 25.95 28.08 25.99 28.38 26.28 27.36 24.97 26.36 24.14 27.84 25.75 29.43 27.21
Table 2. Comparative results in terms of PSNR (dB) of different filtering methods for various distortions of Gaussian noise for the “Plane” image
σ 5 Noise 34.14 FBF 35.57 DWMAV 32.69 GOA 37.53 IFCF 32.52 FCG 37.27 HMT 37.50 3D-DFT 38.78 BLS-GSM 37.99 BiShrink 37.70 FuzzyShrink 38.39 Lucchese 31.89 NLCDM 34.92 TLS 38.62 FuzzyShrink+FCG 38.15
10 28.13 31.14 32.77 34.46 31.40 34.24 33.89 35.55 35.34 34.15 35.22 31.56 32.57 35.24 35.71
15 24.60 29.52 30.58 32.64 30.19 32.40 31.80 33.73 33.46 32.15 33.35 31.02 30.87 33.34 33.91
PSNR (dB) 20 25 22.11 20.17 28.62 27.99 29.36 28.64 31.36 30.41 28.99 27.88 31.28 30.15 30.39 29.30 32.39 31.42 32.23 31.21 30.79 29.79 32.01 30.97 30.18 28.81 29.69 28.58 32.02 31.01 32.58 31.52
30 18.91 27.48 28.01 29.61 26.96 29.46 28.51 30.62 30.35 28.97 30.10 26.91 27.82 30.18 30.64
40 16.09 26.56 26.61 28.01 25.33 28.17 26.98 29.13 28.82 27.61 28.72 22.98 26.84 28.70 29.24
50 14.15 25.62 25.25 26.51 24.10 27.40 25.76 27.74 27.33 26.36 27.62 20.00 25.75 27.23 28.13
methods. The proposed FCG method and the GOA are the only methods which really can compete with the more complex NLCDM and the wavelet based methods.
20
S. Schulte et al.
Table 3. Comparative results in terms of NCD of different filtering methods for various distortions of Gaussian noise for the “Plane” image
σ 5 Noise 21.38 DWMAV 11.89 IFCF 12.97 GOA 9.67 FCG 10.42 HMT 10.59 3D-DFT 9.09 BLS-GSM 8.84 BiShrink 10.40 NLCDM 12.28 TLS 9.41 FuzzyShrink 9.22 FuzzyShrink+FCG 8.92
10 42.77 17.43 18.96 13.07 13.56 15.64 11.95 11.86 14.61 20.79 12.48 12.52 11.15
15 64.17 23.63 25.06 15.74 16.62 20.08 13.76 13.88 17.70 22.37 14.61 15.11 13.28
NCD (×102 ) 20 25 30 85.01 104.50 122.26 29.89 27.31 31.09 30.99 36.85 42.21 17.97 20.21 22.15 18.29 18.83 21.37 23.45 26.65 28.74 15.52 16.88 18.08 16.12 17.60 19.01 20.09 22.00 23.63 28.32 34.03 27.71 16.26 17.60 18.70 17.51 19.78 22.03 15.38 17.39 19.37
40 152.57 38.25 52.79 26.19 23.51 35.03 20.45 21.97 26.55 33.63 20.97 26.27 23.16
50 177.02 45.05 62.20 30.61 26.39 38.58 23.22 25.40 29.43 39.18 24.03 30.38 26.82
– It can be observed that the GOA filter does not only remove noise but also fine details and texture elements of the image. This causes too much smoothed arrays on the edges and contours, which make the output image less sharp. – The best visual performance is received by the proposed FCG method, the BLS-GSM, the BiShrink, the TLS and the FuzzyShrink method. The main improvement in the proposed method can be seen in the small details of the images where we observe that the colors are preserved much better than the other complex wavelet based methods. The proposed method is the only method where the color component distances are preserved well. – In Tables 1-2 we illustrate the combination of the wavelet-based FuzzyShrink method and the proposed FCG method. By combining both methods we receive a good noise reduction filter where the differences between the color components are preserved much better, so that no outliers appear at the edges. These results indicate that future research should be done in that field.
5
Conclusion
In this paper we proposed a new fuzzy filter called the the fuzzy color preserving Gaussian noise reduction method (FCG) for additive noise reduction in digital color images. The main advantages of this new filter are the denoising capability and the reconstruction capability of the destroyed color component differences. Numerical measures, such as the PSNR, and visual observations showed convincing results. We illustrated that the proposed method generally outperforms most of the other fuzzy filters. Moreover we showed that the proposed method achieves a comparable noise reduction performance to the more complex wavelet
A New Fuzzy Additive Noise Reduction Method
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
21
Fig. 3. The restoration of a noisy “Baboon” image with (a) a noise-free part, (b) the same part contaminated with additive Gaussian noise with σ = 30, (c) FCG, (d) BLSGSM, (e) BiShrink, (f) TLS, (g) FuzzyShrink, (h) HMT, (i) FBF, (j) GOA, (k) EIFCF, (l) GMAV
based methods. We also illustrated that we get a better performance (numerically as well as visually) when a wavelet based method (e.g. the FuzzyShrink method) is combined with our proposed method. Future research can be focused on this issue and on the construction of other fuzzy wavelet filtering methods for color images to suppress other noise types as well (rician noise, speckle noise, stripping noise, etc.).
References 1. Guo, S.M., Lee, C.S., Hsu, C.Y.: An intelligent image agent based on softcomputing techniques for color image processing. Expert Systems with Applications 28, 483–494 (2005) 2. Lucchese, L., Mitra, S.K.: A New Class of Chromatic Filters for Color Image Processing: Theory and Applications. IEEE Transactions on Image Processing 13(4), 534–543 (2004)
22
S. Schulte et al.
3. Plataniotis, K.N., Androutsos, D., Venetsanopoulos, A.N.: Colour image processing using fuzzy vector directional filters. In: Proceedings of the IEEE International Workshop on Nonlinear Signal and Image Processing, pp. 535–538. IEEE Computer Society Press, Los Alamitos (1995) 4. Plataniotis, K.N., Venetsanopoulos, A.N.: Color Image Processing and Applications. Springer, Berlin, Germany (2000) 5. Vertan, C., Buzuloiu, V.: Fuzzy nonlinear filtering of color images. In: Kerre, E.E., Nachtegael, M. (eds.) Fuzzy Techniques in Image Processing, vol. 52, pp. 248–264. Springer Physica Verlag, Heidelberg (2000) 6. Arce, G.R., Foster, R.E.: Detail-preserving ranked-order based filters for image processing. IEEE Transactions on Acoustics, Speech, and Signal Processing 37(1), 83–93 (1989) 7. Lukac, R., Smolka, B., Martin, K., Plataniotis, K.N., Venetsanopoulos, A.N.: Vector filtering for color imaging. IEEE Signal Processing Magazine 22(1), 74–86 (2005) 8. Hore, S., Qiu, B., Wu, H.R.: Improved vector filtering for color images using fuzzy noise detection. Optical Engineering 42(6), 1656–1664 (2003) 9. Tsai, H.H., Yu, P.T.: Genetic-based fuzzy hybrid multichannel filters for color image restoration. Fuzzy Sets and Systems 114(2), 203–224 (2000) 10. Takagi, T., Sugeno, M.: Fuzzy identification of systems and its applications to modeling and control. IEEE Transactions on Systems Man and Cybernetic 15, 116–132 (1985) 11. Kerre, E.E.: Fuzzy Sets and Approximate Reasoning. Xian Jiaotong University Press, Xian Jiaotong (1998) 12. Zadeh, L.A.: Fuzzy Sets. Information and Control 8(3), 338–353 (1965) 13. Cornelis, C., Deschrijver, G., Kerre, E.E.: Classification of intuitionistic fuzzy implicators: an algebraic approach. In: Proceedings of the 6th Joint Conference on Information Sciences, pp. 105–108 (2002) 14. Kwan, H.K.: Fuzzy filters for noise reduction in images. In: Nachtegael, M., Van der Weken, D., Van De Ville, D., Kerre, E.E. (eds.) Fuzzy Filters for Image Processing, vol. 122, pp. 25–53. Springer Physica Verlag, Heidelberg (2003) 15. Arakawa, K.: Median filter based on fuzzy rules and its application to image restoration. Fuzzy Sets and Systems 77, 3–13 (1996) 16. Morillas, S., Gregori, V., Sapena, A.: Fuzzy Bilateral Filtering for Color Images. In: Campilho, A., Kamel, M. (eds.) ICIAR 2006. LNCS, vol. 4141, pp. 138–145. Springer, Heidelberg (2006) 17. Van De Ville, D., Nachtegael, M., Van der Weken, D., Kerre, E.E., Philips, W.: Noise reduction by fuzzy image filtering. IEEE Transactions on Fuzzy Systems 11, 429–436 (2003) 18. Farbiz, F., Menhaj, M.B.: A fuzzy logic control based approach for image filtering. In: Kerre, E.E., Nachtegael, M. (eds.) Fuzzy Techniques in Image Processing, vol. 52, pp. 194–221. Springer Physica Verlag, Heidelberg (2000) 19. Romberg, J.K., Choi, H., Baraniuk, R.G., Kingbury, N.: Multiscale classification using complex wavelets and hidden Markov tree models. In: ICIP. Proceedings of the IEEE International Conference on Image Processing, pp. 371–374. IEEE Computer Society Press, Los Alamitos (2000) 20. Romberg, J.K., Choi, H., Baraniuk, R.G.: Bayesian tree-structured image modeling using wavelet-domain hidden Markov models. IEEE Transactions on Image Processing 10(7), 1056–1068 (2001)
A New Fuzzy Additive Noise Reduction Method
23
21. Dabov, K., Foi, A., Katkovnik, V., Egiazarian, K.: Image Denoising with BlockMatching and 3D Filtering. In: Proceedings of SPIE Electronic Imaging: Algorithms and Systems, Neural Networks, and Machine Learning, pp. 354–365 (2006) 22. Portilla, J., Strela, V., Wainwright, M., Simoncelli, E.: Image denoising using gaussian scale mixtures in the wavelet domain. IEEE Transactions on Image Processing 12, 1338–1351 (2003) 23. S ¸ endur, L., Selesnick, I.W.: Bivariate shrinkage functions for wavelet-based denoising exploiting interscale dependency. IEEE Transactions on Signal Processing 50, 2744–2756 (2002) 24. Hirakawa, K., Parks, T.W.: Image Denoising for Signal-Dependent Noise. In: Proceedings of the IEEE Acoustics, Speech, and Signal Processing, pp. 18–23. IEEE Computer Society Press, Los Alamitos (2005) 25. Schulte, S., Huysmans, A., Piˇzurica, A., Kerre, E.E., Philips, W.: A new fuzzybased wavelet shrinkage image denoising technique. In: Blanc-Talon, J., Philips, W., Popescu, D., Scheunders, P. (eds.) ACIVS 2006. LNCS, vol. 4179, pp. 12–23. Springer, Heidelberg (2006) 26. Gilboa, G., Zeevi, Y.Y., Sochen, N.A.: Complex diffusion processes for image filtering. In: Kerckhove, M. (ed.) Scale-Space 2001. LNCS, vol. 2106, pp. 299–307. Springer, Heidelberg (2001)
Subband-Adaptive and Spatially-Adaptive Wavelet Thresholding for Denoising and Feature Preservation of Texture Images J. Li, S.S. Mohamed, M.M.A. Salama, and G.H. Freeman Department of Electrical and Computer Engineering, University of Waterloo, Ontario, Canada
Abstract. In imaging applications such as ultrasound or Synthetic Aperture Radar (SAR), the local texture features are descriptive of the types of geological formation or biological tissue at that spatial location. Therefore, on denoising these texture images, it is essential that the local texture details characterizing the geological formation or tissue type are not lost. When processing these images, the operator usually has prior knowledge of the type of textures to expect in the image. In this work, this prior knowledge is exploited to implement a spatially-adaptive and subband-adaptive wavelet threshold that denoises texture images while preserving the characteristic features in the textures. The proposed algorithm involves three stages: texture characterization, texture region identification system training, and texture region identification and denoising. In the first stage, the texture features to be preserved are characterized by the subband energies of the wavelet decomposition details at each level. Next, the energies of the characteristic subband are used as inputs to train the adaptive neural-fuzzy inference system (ANFIS) classifier for texture region identification. Finally, the texture regions are identified by the ANFIS and the subband-adaptive BayesShrink threshold is adjusted locally to obtain the proposed spatially-adaptive and subband-adaptive threshold.
1 Introduction Wavelet based denoising for additive Gaussian noise removal has been lately explored extensively literature. Unlike traditional denoising approaches such as Wiener filtering, wavelet thresholding or shrinkage is inherently nonlinear. It has been shown that wavelet thresholding provides better rates of convergence for functions in Besov spaces compared to linear approaches [1]. Typically, in wavelet denoising, the image is first decomposed into its subband components. Then, the coefficients in each subband are compared to a threshold value. If the coefficient is below the threshold, it is deemed as noise and set to zero. If the coefficient is larger than the threshold, it is likely part of an edge or texture, and is kept or modified. Selecting the appropriate threshold is crucial for optimal wavelet based denoising. VisuShrink was one of the earliest and best known proposed thresholds [2]. VisuShrink is a universal threshold (same for all coefficients in all subbands), which is defined as TV = σ 2 log M , where M is the length of the 1-D deterministic signal and M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 24–37, 2007. © Springer-Verlag Berlin Heidelberg 2007
σ
is the estimated
Subband-Adaptive and Spatially-Adaptive Wavelet Thresholding
25
standard deviation of the noise. VisuShrink is asymptotically optimal in the min-max sense. More recently, subband-adaptive thresholds have been proposed to take advantage of the fact that coefficients relating to strong edges and textures are often stored in several characteristic subbands. BayesShrink is an example of a subbandadaptive threshold [3]. The threshold for BayesShrink is defined as: ∧ 2
σ
∧
TB (σ X ) =
∧
σX
where the standard deviation of the original coefficients in the subband (without noise) is approximated with ∧ 2
∧
∧ 2
σ X = max(σ Y − σ ,0) . ∧
The noise variance σ is estimated using the robust median estimator ∧ Median(| Yij |) , where Yij ∈ subband HH1(finest level details in the diagonal σ= 0.6745 direction), and the standard deviation of the coefficients in the subband (with noise) is found by: ∧
σY =
1 n2
n
∑Y
i , j =1
2 ij
,
where n×n is the size of the subband being processed. This threshold is subbandadaptive because it depends on parameter estimation in each subband. Subband-adaptive thresholds such as the BayesShrink threshold preserve more edge and texture details compared to universal thresholds such as the VisuShrink threshold. However, from a spatial perspective, the optimal preservation of edges and texture details are still accomplished in the global sense. Since texture and edges are important discriminators for image segmentation and classification, then preservation of texture details in the local sense is extremely crucial. Therefore, a wavelet thresholding strategy that can preserve the most important characteristics of each texture locally would be very beneficial. Constructing a threshold that is locally-adaptive requires the algorithm to detect local textures. The classical approach to texture identification is with the cooccurrence matrix. The texture identification problem was also explored in [4], where it was argued that any complex textures could be separated into two mutually orthogonal spatially homogenous components, one which is deterministic and one which is purely indeterministic [4]. The wavelet decomposition inherently separates a texture into orthogonal components (subbands), some of which are descriptive of the texture, while others are indeterministic. This property was utilized by Laine et al., who used wavelet detail coefficients at different levels of the wavelet decomposition as inputs to a two-layer feedforward neural network to successfully classify 25 natural textures [5]. In this paper, a supervised classifier is implemented for texture region detection. Once the texture regions are identified, the threshold is adapted locally in each subband to optimally remove the noise while preserving the local texture characteristic in each texture region.
26
J. Li et al.
In Section 2, the procedure involved in constructing the subband-adaptive and spatially-adaptive threshold is briefed. In Section 3, the texture characterization method is explained, while section 4 covers the training of the texture region identification training. In section 5 texture identification and denoising is explained while the results of the proposed algorithm are presented in section 6. The important features and pitfalls associated with this approach are discussed in Section 7. The major contributions of this work are summarized in Section 8.
2 Proposed Algorithm Overview To preserve important textures characteristics, one must first determine which characteristics are considered important. This degree of importance is usually application dependent. In applications such as SAR and medical imaging, often, the goal is to differentiate between two textures, each representative of a terrain or tissue class. In this case, the difference in subband energy levels of the two textures is a useful indicator to distinguish between them. A denoising algorithm which preserves important texture characteristics locally requires the identification of the local texture regions. Wavelet decomposition subband energies which characterize these regions could be used as input features to train a classifier to identify the texture regions. Once the regions are identified, the wavelet denoising threshold could be modified locally to preserve the texture’s important characteristics. For this paper, to prove the concept, the focus will be on the problem of denoising images in which there are two important textures to be distinguished. An overview of the proposed algorithm is depicted in Fig. 1. The algorithm consists of three stages: texture characterization, texture identification system training, and texture identification and denoising. The texture characterization stage takes the two textures as input and determines the optimal subbands distinguishing between the textures and the wavelet threshold multiplication factors for each subband of the two textures. The adaptive neural-fuzzy inference system (ANFIS) is selected as the classifier for region Wavelet Threshold Multiplication Factors for Each Subband of Texture A
Texture A
Texture Characterization
Texture B
Wavelet Threshold Multiplication Factors for Each Subband of Texture B
Optimal Subband/s Distinguishing Between Textures
Image to be Denoised Containing Both Textures A and B
Texture A Texture B
Texture Identification System Training
Texture Identification and Denoising
ANFIS Texture Identification System
Fig. 1. System Overview
Denoised Texture Feature Preserving Image
Subband-Adaptive and Spatially-Adaptive Wavelet Thresholding
27
Fig. 2. Synthetic texture images. Left-panel: purely horizontal texture. Right-panel: purely diagonal texture.
identification. At the texture identification system training stage, ANFIS is trained with the energies of the optimal subbands distinguishing between the textures. At the texture identification and denoising stage, the texture regions are localized by the ANFIS and thresholds are set locally. Finally, the wavelet coefficients of the noisy image are soft- thresholded with the subband-adaptive and spatially-adaptive threshold maps and the denoised image is reconstructed. The effectiveness of the proposed algorithm is demonstrated by using the synthetic texture images in Fig. 2.
3 Texture Characterization The procedure for the texture characterization step is shown in Fig. 3 and involves the following four steps: 1) Perform wavelet decomposition on individual texture images. 2) Calculate the local energy in each subband. 3) Determine the characteristic subbands and distinguishing subbands based on median energy comparisons. 4) Set thresholding multiplication factor for each subband of a particular texture to be inversely proportional to the square-root of the median energy in that subband. The dyadic wavelet transform [6] is performed on the original texture images, fA(m,n) and fB(m,n), to extract the wavelet detail coefficients at various scales and orientations, wHL1(m2-1,n2-1), wLH1(m2-1,n2-1), wHH1(m2-1,n2-1), wHL2(m2-2,n2-2),etc. The subscripts H and L represent the high- and low-pass filters of the quadraturemirror-filter used in the analysis filter bank. For example, wHL1(m2-1,n2-1) represents the first level detail coefficients of the horizontal component of the image. The subbands of the orthogonal wavelet transform are depicted in Fig. 4(a), and the corresponding subband detail coefficients for the purely horizontal and purely diagonal textures are displayed in Fig. 4(b) and Fig. 4(c) respectively. The local energy is calculated as follows:
E d ,l ( m 2 −l , n 2 −l ) =
1 aN 2
N
∑ (w
i , j =1
d ,l
( m2 −l − i 2 −l , n 2 −l − j 2 −l )) 2
(1)
where d and l represent the direction and level of the subband being processed, N is the size of the local feature window, and a represents a parameter to normalize the
28
J. Li et al.
energy. The larger the parameter a, the greater amount of denoising and the less texture details are preserved. If a is small, less denoising occurs and more details are kept. This parameter is set to a value of 200 for subsequent processing. The subband energy maps of the texture images are depicted in Fig. 5. The median energy value in each subband is representative of the strength of the texture at the direction and scale represented by the subband. This value for each of the subbands in the purely horizontal and purely diagonal textures, as well as their subband energy differences are listed in Table 1.
Wavelet Detail Coefficients at Various Scales and Orientations
Local Energy at Various Scales and Orientations
wHL1(m2-1,n2-1)
EHL1(m2-1,n2-1)
wLH1(m2-1,n2-1) Texture A fA(m,n)
wHH1(m2-1,n2-1) Wavelet Decomposition
-2
-2
wHL2(m2 ,n2 ) wLH2(m2-2,n2-2)
Local Energy Feature Extraction Filter
wHH2(m2-2,n2-2)
ELH1(m2-1,n2-1) EHH1(m2-1,n2-1) -2
Threshold Multiplication Factors for Each Subband
tmfAHL1
EHL2(m2 ,n2 )
tmfAHL2
ELH2(m2-2,n2-2)
tmfALH2
EHH2(m2-2,n2-2)
tmfAHH2
wHL1(m2-1,n2-1)
EHL1(m2-1,n2-1)
wLH1(m2-1,n2-1)
ELH1(m2-1,n2-1)
wHH1(m2-1,n2-1) Wavelet Decomposition
wHL2(m2-2,n2-2) wLH2(m2-2,n2-2) wHH2(m2-2,n2-2)
Local Energy Feature Extraction Filter
tmfAHH1
-2
ABS(.)
Texture B fB(m,n)
tmfALH1
1 med (.)
EHH1(m2-1,n2-1)
Sort(.)
Order of importance of Subbands in Differentiatin g Between the 2 Textures
tmfBHL1 1 med (.)
tmfBLH1 tmfBHH1
EHL2(m2-2,n2-2)
tmfBHL2
ELH2(m2-2,n2-2)
tmfBLH2
EHH2(m2-2,n2-2)
tmfBHH2
Fig. 3. Texture characterization subsystem overview
Fig. 4. (a) Subbands of the 2-D orthogonal wavelet transform. (b) Subband detail coefficients of the purely horizontal texture. (c) Subband detail coefficients of the purely diagonal texture.
Subband-Adaptive and Spatially-Adaptive Wavelet Thresholding
29
Fig. 5. (a) Subband energy map of the purely horizontal texture. (b) Subband energy map of the purely diagonal texture. (c) Difference in subband energy map between the purely horizontal and purely diagonal textures. Table 1. Median energy in each subband of both textures, and absolute differences of their subband energies normalized by the parameter a set to 200
HL1 LH1 HH1 HL2 LH2 HH2 HL3 LH3 HH3 HL4 LH4 HH4
pure_hor pure_diag difference 8.1074 0.4067 7.7007 0.0000 0.4067 0.4067 0.0000 8.4580 8.4580 23.2093 10.8854 12.3239 0.0000 10.8854 10.8854 0.0000 34.3337 34.3337 670.0265 0.0000 670.0265 0.0000 0.0000 0.0000 0.0000 418.9551 418.9551 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
order 7 8 6 4 5 3 1 9 2 10 11 12
The absolute difference of the subband energy between the textures is a descriptive characteristic hence; the subbands are ranked to distinguish between the textures. For the purely horizontal and purely diagonal textures, the subband HL3 (3rd level horizontal component of the decomposition) and HH3 (3rd level diagonal component of the decomposition) are ranked as the most descriptive subbands for this case. The energy values in these two subbands are then used as input features to train the ANFIS for texture region identification. In wavelet thresholding, coefficients below the set threshold are either removed or modified (depending on hard or soft thresholding). Thus, to optimally preserve the textures locally, the adaptive threshold in a particular subband should be set lower for texture regions in which that subband is considered a characteristic of the texture, so less details in that subband are distorted. Conversely, to optimally remove the noise locally, the adaptive threshold in a particular subband should be set higher if that subband is considered non-deterministic in characterizing the texture, so more noisy coefficients in the subband could be removed. In this paper, it is proposed that the degree of importance in characterizing a texture is quantified by the textures median subband local energy levels. A threshold multiplication factor is defined to adjust the wavelet threshold in each subband of the two textures based on the importance of the
30
J. Li et al.
subband in characterizing that texture. This threshold multiplication factor is defined as follows: 1 (2) tmf t , d ,l = min( median( E d ,l ( m2 −l , n 2 −l )), q ) where t represents the associated texture, q represents a threshold parameter with a value in the rage of [0,1], and d, l describe the direction and level of the wavelet decomposition subband. The min operator allows subbands with median energy value below q to have a larger threshold (tmf will be greater than 1) so more noise could be removed in these non-deterministic subbands. The threshold multiplication factors for each subband are listed in Table 2. Table 2. Threshold multiplication factors for each subband each texture (with q set to 1)
HL1 LH1 HH1 HL2 LH2 HH2 HL3 LH3 HH3 HL4 LH4 HH4
pure_hor 0.35 1.00 1.00 0.21 1.00 1.00 0.04 1.00 1.00 1.00 1.00 1.00
pure_diag 1.00 1.00 0.34 0.30 0.30 0.17 1.00 1.00 0.05 1.00 1.00 1.00
4 Training of Texture Region Identification System Training of ANFIS for texture region identification requires the following steps: 1) Add upper bound noise to mixed texture image to be used for training. 3) Perform wavelet decomposition 4) Calculate local energy. 5) Train ANFIS system, using energy of maximally distinguishing subband/s. This process is depicted in Fig. 6. In this work, the textures to be identified are randomly mixed to create a synthesized training image. Furthermore, since noise distorts the deterministic components of the texture, additive Gaussian noise is injected into the training image to improve the noise robustness of the classifier. The noisy training image and its target map are displayed in Fig. 7.
Subband-Adaptive and Spatially-Adaptive Wavelet Thresholding
Texture A fA(m,n) Texture B fB(m,n)
Randomly Mix Two Textures to Create Synthetic Training Image which Contains Both Textures
fMIX(m,n) fMIX,TARGET(m,n)
Inject Additive Gaussian Noise
Wavelet Detail Coefficients at Various Scales and Orientations
wHL1(m2-1,n2-1) fMIX,TRAIN(m,n)
Wavelet Decomposition
wLH1(m2-1,n2-1) wHH1(m2-1,n2-1) wHL2(m2-2,n2-2)
fMIX,TRAIN(m,n)
Information on the Order of importance of Each Subbands in Differentiating Between the 2 Textures
Local Energy at Various Scales and Orientations
EHL1(m2-1,n2-1) Local Energy Feature Extraction Filter
31
ELH1(m2-1,n2-1) EHH1(m2-1,n2-1) EHL2(m2-2,n2-2)
Texture Identification System Training Input Feature Selection
Training Input Feature 1 Training Input Feature 2
Training Input Feature 1 Training Input Feature 2
ANFIS Training
ANFIS Trained to Detect Each of the Two Texture Regions in Noisy Image
fMIX,TARGET(m,n)
Fig. 6. Texture region identification system training
Fig. 7. (a) The noisy training image and (b) the target map for the training image
ANFIS is a supervised neural-fuzzy classifier which combines the advantages of fuzzy inferencing with neural network training [7]. ANFIS has the ability to generalize from a limited number of input samples and has been shown to be robust in pattern recognition applications [8]. In this work, ANFIS is used as the classifier for texture region identification. In the texture characterization stage, the most important subbands in differentiating between the textures are determined to be HL3 and HH3. The energy values in these subbands, along with the target map, are used as input features for ANFIS training.
32
J. Li et al.
5 Texture Identification and Denoising After the textures are characterized and the ANFIS is trained, the denoising stage starts, which consists of the following six steps: 1) 2) 3) 4)
Perform wavelet decomposition to the image to be denoised. Apply ANFIS system to identify texture regions. Compute BayesShrink thresholds for each subband. Calculate threshold maps based on the BayeShrink threshold and the predefined threshold multiplication factor for each detected texture region. 5) Apply soft thresholding with threshold maps. 6) Reconstruction. These steps are depicted in Fig. 8.
Wavelet Detail Coefficients at Various Scales and Orientations
Image to be Denoised Containing Both Textures A and B fNOISY(m,n)
Local Energy at Various Scales and Orientations
wHL1(m2-1,n2-1) Wavelet Decomposition
wLH1(m2-1,n2-1) wHH1(m2-1,n2-1) wHL2(m2-2,n2-2)
Order of importance of Each Subbands in Differentiating Between the 2 Textures
EHL1(m2-1,n2-1) Local Energy Feature Extraction Filter
ELH1(m2-1,n2-1) EHH1(m2-1,n2-1) EHL2(m2-2,n2-2)
Input
Feature Feature 1 Selection for Texture Region Identification Input Feature 2
Wavelet Threshold Multiplication Factors for Each Subband of Texture A
Wavelet Threshold Multiplication Factors for Each Subband of Texture B
Texture Region Map for Each Subband
Texture Region Localization with ANFIS
Trained ANFIS
Adaptive Threshold Construction
Threshold Map for Each Subband
Soft Thresholding
Inverse Wavelet Transform
Denoised Texture Feature Preserving Image
Fig. 8. Texture region identification and denoising subsystem overview
The texture image to be denoised is shown in Fig. 9.a. The two, previously defined, most distinguishing subbands, HL3 and HH3, are selected as input features to the ANFIS to determine the location of the texture regions. The texture region map resulting from the ANFIS is shown in Fig. 9.b. There exist some inaccuracies in the detected regions, which could be primarily attributed to the loss of resolution at the 3rd level of the wavelet decomposition. The proposed subband-adaptive and spatially-adaptive threshold is created by combining the subband-adaptive threshold described in BayesShrink, which has near optimal minimum mean squared error (MSE) to the best wavelet thresholding strategy [3], and the locally-adaptive texture region maps created by the ANFIS. The ANFIS
Subband-Adaptive and Spatially-Adaptive Wavelet Thresholding
a
33
b
Fig. 9. a) Texture image to be denoised, b) Texture regions identified by ANFIS
outputs are located in a continuous range from 0 (horizontal) to 1 (diagonal). Since the ANFIS output is proportional to the likelihood that the pixel is located in a certain region, the subband-adaptive and spatially-adaptive threshold can be formulated as: Tnew, d ,l (m2 − l , n2 −l ) = TB ,d ,l × ( RM d ,l (m2 − l , n 2 − l ) × tmf A, d ,l + (1 − RM d ,l (m2 − l , n2 −l )) × tmf B , d ,l )
where d, l represent the direction and the level of the wavelet decomposition subbands, TB,d,l is the BayesShrink threshold for that subband, RMd,l is the region map for the subband, and tmfA,d,l is the threshold multiplication factor for the subband found in the texture characterization section. The previous equation is expected to increase the threshold in regions where both the relative region map value and the threshold multiplication factor (important of subband for describing this texture) are high, to preserve the details in the subband, and decrease the threshold in regions which has a high relative region map value and low threshold multiplication factor (non-deterministic subband) to remover more noise. The resulting threshold map is displayed in Fig. 10 (c). It is compared to the threshold maps created using the VisuShrink (universal), and BayesShrink (subbandadaptive) approaches, shown in Fig. 10 (a) and (b) respectively. Dark regions correspond to relatively low thresholds and more details in those subbands are preserved. Bright regions represent relatively high threshold, which removes more information in the subband. After the threshold map is calculated, the wavelet coefficients are soft-thresholded as follows:
wmod, d ,l (m2 − l , n 2 − l ) = sgn( x) ⋅ max(| x | −Tnew,d ,l (m2 −l , n2 −l ),0) where
x = wd ,l (m2 −l , n2 −l ) , and d, l represent the direction and level of the
wavelet decomposition subbands. The denoised image is reconstructed by performing the inverse wavelet transform on the modified wavelet coefficients.
34
J. Li et al.
Fig. 10. Threshold maps calculated using a)VisuShrink, b) BayesShrink, c) proposed approach
6 Results The proposed approach is applied to denoise the mixed texture image in Fig. 9.a, which is 512×512 pixels in size and has pixel intensity ranging from [0,255]. The proposed approach is tested at additive Gaussian noise levels σ=10, 20, 30, 40, 56. The MSE is used as a quantitative criterion to judge the effectiveness of the proposed approach and is defined as: ∧
MSE ( f ) =
1 N2
N
∧
∑ ( f ij − f ij ) 2 .
i , j =1
The MSE of the resulting denoised images using Weiner, VisuShrink, BayesShrink, and the proposed approach are tabulated in Table 3 and they are compared in Fig. 11. Table 3 and Fig. 11 show that the proposed approach achieves minimum MSE across all noise conditions. The performance of the BayesShrink algorithm is comparable to the proposed algorithm at low noise levels. However, at high noise levels, the advantage of the locally-adaptive and subband-adaptive strategy is more noticeable. The Wiener filter and VisuShrink are neither subband-adaptive nor locally-adaptive, resulting in greater amount of distortion. Table 3. MSE values for the various denoising algorithms at different levels of noise
σ 10 20 30 40 56
Wiener 4.11E+03 5.71E+03 6.77E+03 7.62E+03 8.77E+03
VisuShrink 6.96E+03 8.48E+03 8.93E+03 9.38E+03 1.01E+04
BayesShrink 245.8813 667.3399 1.24E+03 2.11E+03 3.45E+03
Proposed 178.2508 369.4528 632.6539 1.02E+03 1.86E+03
Subband-Adaptive and Spatially-Adaptive Wavelet Thresholding
1.20E+04
35
Wiener Vis uShrink
1.00E+04
BayesShrink Propos ed
MSE
8.00E+03
6.00E+03
4.00E+03
2.00E+03
0.00E+00 10
20
30
40
56
noise standard deviation
Fig. 11. MSE of various denoising approaches at different noise levels
a
b
c
d
Fig. 12. a)Denoising result of the Wiener filter for noisy image with σ=56 b) Denoising result of the VisuShrink approach for noisy image with σ=56. c) Denoising result of the BayesShrink approach for noisy image with σ=56 . d) Denoising of the proposed approach for noisy image with σ=56.
36
J. Li et al.
The denoised image (σ=56) using the proposed approach is displayed in Fig. 12.d. It is compared to the result of Wiener filtering and the VisuShrink and BayesShrink approaches, shown in Fig. 12.a, b and c respectively. The Wiener filter is based on the Fourier transform, which is not localized spatially. Thus, when the noise level is high, denoising with the Wiener filter results in a severe loss of local texture properties. Fig. 10 showed that the universal threshold of VisuShrink removed important coefficients in the HL3 (horizontal level 3) subband. This is apparent in Fig. 12, where the details in the horizontal texture region are blurred and the diagonal textures are severely distorted. The subband-adaptive nature of BayesShrink preserved the most important features of the local texture regions. However, since the BayesShrink threshold is not spatially- adaptive, this results in horizontal noise preservation bias in diagonal texture region and diagonal noise preservation bias in the horizontal texture region. The horizontal noise preservation bias in the diagonal texture region is more apparent to the human observer. The systematically preserved noise appear as diagonal lines running from upper left to bottom right in the diagonal texture regions. Unlike the BayesShrink threshold, the proposed locally-adaptive threshold has no noise preservation bias in the local regions, which results in optimal preservation of texture integrity. The differences in the results of the BayesShrink approach and the proposed approach are highlighted Fig. 13. (a) and (b), in which the difference images between the denoised images using the respective algorithms and the image to be denoised are shown. The BayesShrink difference image shows that the BayesShrink algorithm results in a loss of horizontal texture details in the horizontal texture zone and diagonal texture details in the diagonal texture zone, which is apparent in the figure. In contrast the proposed algorithm results in no noticeable loss of local texture properties.
7 Discussions The proposed approach for texture identification has several advantages over classical approaches, such as the co-occurrence matrix approach. The latter is ineffective when classifying macro-textures. In the proposed approach, macro-textures characteristics are captured by the local energy filter at subbands situated in high wavelet decomposition levels. For example, if the original image is of size 512×512, at decomposition level 1, the coefficient images are of size 256×256, and a local energy filter of size 4×4 covers 1/4096 of the image area at each step. Whereas at decomposition level 4, where the coefficient images are of size 32×32, the same local energy filter of size 4×4 covers 1/64 of the image area at each step, and subsequently, larger texture details are captured. When training with the ANFIS for texture region identification, there are several tradeoffs to consider. If the number of distinguishing input features is increased, the ANFIS will likely provide more accurate region identification, especially for cases where the texture characteristic differences are not concentrated in a small number of subbands. However, the training and evaluation time required increases exponentially with an increase in the number of input features. Furthermore, an increase in the
Subband-Adaptive and Spatially-Adaptive Wavelet Thresholding
37
number of textures to be distinguished also negatively affects the computational time and the effectiveness of the ANFIS approach. In finding the optimal threshold, it would be ideal if the effect of the a and q parameters used to calculate the normalized local energy and threshold multiplication factor could be more accurately defined. This is possible if a cost function (for example, one that minimized the MSE) is set up, in which a and q are variables. Then, the optimal solution could be found using numerical methods.
8 Conclusions In this work, a locally-adaptive and subband-adaptive threshold is defined so that a wavelet threshold strategy could be applied to denoise texture images while preserving the textures’ characteristic details. To implement the strategy, first, the textures to be denoised are characterized by the subband energies of their wavelet decomposition. Threshold multiplication factors are set to be inversely proportional to the energy in each subband. Then, an ANFIS is trained to localize texture region based on distinguishing subband energy between the features. Finally, to prove the concept, an image that is to be denoised and contains both textures is processed by the ANFIS and texture regions are identified. The appropriate threshold multiplication factors for each texture are applied locally in each subband, which combined with the BayesShrink threshold of that subband, results in the locally-adaptive and subbandadaptive threshold. This threshold is applied to the wavelet detail coefficients in each subband using the soft-thresholding strategy and the denoised image is reconstructed with the inverse wavelet transform. Preliminary results indicate that the proposed strategy is more effective in preserving the local texture details in both the MSE sense and visually compared to other denoising strategies such as Wiener, VisuShrink and BayesShrink
References 1. Donoho, D.L.: De-noising by soft-thresholding. IEEE Trans. Inform. Theory 90(432), 1200–1224 (1995) 2. Donoho, D.L, Johnstone, I.M.: Ideal spatial adaptation via wavelet shrinkage. Biometrika 81, 425–455 (1994) 3. Chang, S.G., Yu, B., Vetterli, M.: Adaptive wavelet thresholding for image denoising and compression. IEEE Trans. IP. 9(9), 1532–1546 (2000) 4. Francos, J.M., Meiri, A.Z., Porat, B.: A unified texture model based on a 2-dWold-like decomposition. IEEE Trans. Signal Processing 41, 2665–2678 (1993) 5. Laine, Fan, J.: Texture classification by Wavelet packet signatures. IEEE Trans. Pattern Ana. Mach. Intel. 15(11), 1186–1191 (1993) 6. Mallat, S.: A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-11, 674–693 (1989) 7. Jang, J.: ANFIS. Adaptive-network-based fuzzy inference systems. IEEE Trans. Syst. Man Cybernet 23, 665–685 (1993) 8. Heiss, J.E., Held, C.M., Estevez, P.A., Perez, C.A., Holzmann, C.A., Perez, J.P.: IEEE Engineering in Medicine and Biology Magazine 21(5), 147–151 (2002)
Image Denoising Based on the Ridgelet Frame Using the Generalized Cross Validation Technique Xi Tan and Hong He Department of Electrical Engineering, Hunan University of Technology 412000 Zhuzhou, China
Abstract. A new image-denoising algorithm is proposed, which uses the Generalized Cross Validation technique in the ridgelet frame domain. The proposed algorithm has two advantages: first, it can select the optimal threshold automatically without the knowledge of the noise level; second, it has the ability to well recover the ‘line-type’ structures contained in noisy images. Experimentally, the high performance of the proposed algorithm is demonstrated.
1 Introduction The wavelet-based image denoising algorithms have attracted much attention recently [1]. Especially, the wavelet thresholding is widely used in the literature. Donoho and Johnstone [2][3] proved that the wavelet thresholding strategy is statistically optimal. Its performance is heavily dependent on the threshold chosen. The optimal threshold minimizes the error of the estimate result. There are different methods to determine the threshold. For example, the “universal threshold” [2] and the “SURE” threshold [3] essentially take a threshold proportional to the noise level. However, the noise level is often unknown in practical applications. So, one has to estimate the noise level before using wavelet thresholding strategy. In [4] and [5], based on previous work [6] [7], Generalized Cross Validation (GCV) scheme was proposed in the wavelet domain for image denoising. It tries to directly find the optimal threshold for wavelet thresholding using the input data rather than to estimate the noise level first. GCV is a function of the threshold based only on input data and the solution that minimizes GCV is asymptotically optimal. Wavelet can achieve optimal nonlinear approximation for function classes that are smooth away from point singularities in 2-D. However, the approximation ability is not satisfactory when straight singularities are concerned. In the worst case, the decay rate of m-terms nonlinear wavelet approximation is O(n −1 ) only, which is not better than the Fourier analysis. In [8], Candès developed a new system, ridgelet analysis, and showed how they can be applied to solve important problems such as constructing neural networks, approximating and estimating multivariate functions by linear combinations of ridge functions. In a following paper [9], Donoho constructed orthonormal ridgelet, which provide an orthonormal basis for L2 ( R 2 ) . Orthonormal ridgelet can optimally represent functions that are smooth away from straight singularity. To construct it, one M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 38–45, 2007. © Springer-Verlag Berlin Heidelberg 2007
Image Denoising Based on the Ridgelet Frame
39
needs to make use of two special properties of Meyer wavelet, both of which do not hold for other dominant wavelet families such as Daubechies compactly supported wavelet. In paper [10][11], Tan et. al constructed a tight frame with frame bound 1 in L2 ( R 2 ) using orthonormal wavelet families. This kind of tight frame is referred to as ridgelet frame. One can think of the ridgelet frame as a natural extension of orthonormal ridgelet. The powerful ability of ridgelet frame to recovering ‘line-type’ structure at the presence of noise was demonstrated in [10][11]. In this paper, we aim to generalize the GCV technique from wavelet domain to ridgelet frame domain. However, the conditions for a successful application of GCV are not rigorously met in ridgelet frame domain. To overcome this problem, a modified version of ridgelet frame is proposed The new image denoising algorithm — the GCV scheme in the ridgelet frame domain, can automatically select the optimal threshold without the knowledge of the noise level and well recover the ‘line -type’ structure contained in noisy images.
2 Ridgelet Frame To obtain the orthogonality of orthonormal ridgelet, Donoho made use of two special properties of Meyer wavelet, i.e., closure property under reflection about the origin in the ridge direction: ψ j , k (−t ) = ψ j ,1− k (t ) , and closure property under translation by half a cycle in the angular direction: wi ,l (θ + π ) = wi ,l + 2i−1 (θ ) [9]. It is
the closure properties that make it possible to construct orthonormal basis by removing the duplications. As an extension of the orthonormal ridgelet, ridgelet frame constructed using common orthonormal wavelet was proposed [10][11]. Due to the losing of special closure properties of Meyer wavelet, the new system is a tight frame with frame bound 1 instead of orthonormal basis in L2 ( R 2 ) .In the field of computed tomography, it is well known that there exists an isometric map from Radon domain ℜ to spatial domain L2 ( R 2 ) . So, for the purpose to construct a tight frame in spatial domain L2 ( R 2 ) , one can construct a tight frame in Radon domain first. And then, it is obvious that the image of the resulting tight frame under the isometric map constitutes a tight frame again in L2 ( R 2 ) . To construct tight frame in Radon domain, Tan et. al started from an orthonormal basis in L2 ( R ⊗ [0, 2π )) . And the orthonormal basis is obtained from tensor product of one dimension wavelet bases respectively for L2 ( R ) and L2 ([0, 2π )) . For convenience below, we denote the orthonormal basis in L2 ( R ⊗ [0, 2π )) by wλ′′ ( λ ∈ Λ ), where Λ is the collection of index λ . Now, let wλ′ : = 2 π wλ′′ .Define orthoprojector Pℜ from L2 ( R ⊗ [0, 2π )) to Radon domain by ( Pℜ F )(t ,θ ) = ( F (t ,θ ) + F (−t , θ + π )) / 2 , where F ∈ L2 ( R ⊗ [0, 2π )) . Then, applying Pℜ on wλ′ , we obtain
40
X. Tan and H. He
wλ : = Pℜ (wλ′ )=(
I +T ⊗ S )wλ′ = 2 π Pℜ (wλ′′ ) , 2 (Tf )(t ) = f (−t )
where operator T is defined by
(1)
and S is defined by
( Sg )(θ ) = g (θ + π ) . In paper [10][11], Tan et. al have proved that the collection wλ (λ ∈ Λ ) is a tight frame with frame bound 1 in Radon domain ℜ . From the properties of classical frame theory, we have
F = ∑ [ F ,Wλ ] Wλ λ
∑λ [ F ,Wλ ]
2
= F , 2
(2)
As was mentioned above, we can exactly obtain a tight frame by mapping the one in Radon domain ℜ to spatial domain L2 ( R 2 ) . And the resulting tight frame in L2 ( R 2 ) is with the same frame bound 1 as its counterpart in Radon domain ℜ . We call the tight frame in L2 ( R 2 ) as ridgelet frame. It is worth emphasizing that one can obtain the orthonormal ridgelet when Meyer wavelet is used in the above construction and also the redundancy of resulting tight frame is removed. Generally, ridgelet frame can be considered as an extension of orthonormal ridgelet. Ridgelet frame can effectively represent 2-D function that is smooth away from straight singularity since it is constructed in Radon domain and hence retained the key idea in orthonormal ridgelet, namely, it transfer the questions of analyzing straight singularities into the question of analyzing point singularities using 2-Dwavelet system. It was shown that a simply denoising method, thresholding the noisy coefficients in ridgelet domain, could provide state-of-the-art performance in both PNSR and visual effect. Especially, the advantage that the edges in denoised image are well preserved is attractive [10] [11]. Note that in paper [10][11], the threshold was chosen equal to kσ with the assumption that the noise level is known ( σ is the noise standard deviation and k is a constant 3).
3 GCV Technique in the Ridgelet Frame Domain For image arrays F = ( fij )iN, j =1 , let R denote the digital version of radon transform operator, which maps F into Radon coefficient matrix F = ( fij )iM, j =1 satisfying F = RF . And R is a linear operator from R N ×N → R M ×M . Then, let W denote the orthonormal wavelet transform operator, which transforms X = ( xij )iM, j =1 into wavelet
coefficient matrix Xˆ = ( xˆij )iM, j =1 by Xˆ = WX . And its inverse transform W −1 is defined by X = W −1 Xˆ . Both W and W −1 are linear operators from R M ×M → R M ×M . Then, for F , the ridgelet frame operator R F : R N × N 6 R M × M can be essentially
Image Denoising Based on the Ridgelet Frame
41
denoted by F := ( fij )iM, j =1 = R F ( F ) = W D R ( F ) [11]. In other words, expanding a function to ridgelet frame, one may compute the radon transform of this function first, and then, take wavelet transform. Assume that the noisy image data satisfy the model
( yij )iN, j =1 = ( f ij )iN, j =1 + σ (ε ij )iN, j =1 ,
(3)
iid
where ε ij is the Gaussian white noise, namely, ε ij ∼ N (0,1) , and fij is noise-free bivariate signal. Alternatively we may write (6) in vector notation as Y = F + σε ,
(4)
In the viewpoint of statistics, the quintessential goal of denoising is to elicit useful information about the signal F underlying an observed random process Y .To recover F using ridgelet frame, we first apply operator R F to (4) and obtain Y = ( y kl )kM,l =1 = R F ( F + σε ) = W D R ( F + σε ) = W( F + σε ) ,
(5)
with F = (( fij )iM, j =1 ) := RF , ε = (ε ij )iM, j =1 := Rε .Let Z := ( zkl ) Mk ,l =1 = R[ F + σε ] = F + σε = ( f kl )kM,l =1 + σ (ε kl ) Mk ,l =1 ,
(6)
Obviously, due to the definition of Radon transform, ε is colored and non-stationary, namely, for ∀k , l , k ′, l ′ ∈ {1,..., M } , ε kl is Gaussian with Eε kl = 0 , E (ε kl ) 2 = σ kl , but σ k ,l ≠ σ k ′,l ′ when k ≠ k ′ or l ≠ l ′ . In paper [4][12], in order to find a good threshold only using the input data, the GCV technique was introduced in wavelet domain for image denoising. The authors proved that the solution that minimizes the GCV function is an asymptotically optimal threshold. The proof of this conclusion needs several assumptions as follows. First, the original image F is smooth such that they can be represented parsimoniously by wavelet; second, the noise in wavelet domain should be stationary. The second assumption is essential to find an estimate for the optimal threshold using GCV technique and makes it possible to reduce noise decently with wavelet thresholding strategy. After all, one threshold cannot perform well on a coefficient with a large noise variance and on another coefficient with little noise variance at the same time. Note that stationarity in wavelet coefficients requires stationary input noise and an orthogonal wavelet transform. From (5), obviously, the coefficients in ridgelet frame domain are those of wavelet transform of F + σε exactly. So, it is reasonable to take the same assumptions for a successful application of GCV in ridgelet frame domain. Indeed, the first assumption is satisfied for ridgelet frame in that it can represent certain functions effectively. However, the second assumption is not met in ridgelet frame domain since ε is colored and non-stationary (see (5) and (6)). So the GCV technique cannot be used directly in ridgelet frame domain. We will present a GCV technique for a modified version of ridgelet frame below to overcome this shortcoming. Divide two sides of the equation (6) by σ kl , we obtain
Z ′ := ( zk′ ,l )kM,l =1 = ( zk ,l σ k ,l )kM,l =1 = ( f kl σ kl ) kM,l =1 + σ (ε kl σ kl )kM,l =1 ,
(7)
42
X. Tan and H. He
Then, modify (5) as Y ′ = ( y kl′ )kM,l =1 = W( Z ′) = W (( f kl σ kl )kM,l =1 + σ (ε kl σ kl )kM,l =1 ) ,
(8)
Note that now the modified ridgelet frame is equivalent to taking wavelet transform of signal F ′ = ( f kl′ ) kM,l =1 =: ( f kl σ kl ) kM,l =1 contaminated by stationary Gaussian additive noise with noise level σ . Denoising algorithm based on ridgelet frame is hence reduced to that based on wavelet. Define thresholding operation with threshold δ as Yδ′ = D δ Y ′ where D δ = ( d k l ) kM, l = 1 with
if y kl′ < δ , ⎧0 ⎪ d kl = ⎨ λδ , ⎪1 − y ′ otherwise. kl ⎩
(9)
The parameter λ ∈ [0,1] determines a family of thresholding methods, i.e., λ = 1 corresponding to soft thresholding or shrinkage, and λ = 0 hard thresholding. Then, using the inverse transform W −1 to the thresholded coefficients Yδ′ , one can get Zδ′ = ( zkl′ ,δ ) kM,l =1 : = W −1 (Yδ′ ) = W −1 (Dδ ( W (( f kl σ kl ) kM,l =1 + σ (ε kl σ kl ) kM,l =1 ))) ,
(10)
Then, the denoising result can be obtained by Yδ = ( yij ,δ )iN, j =1 := R T ((σ kl zkl′ ,δ )kM,l =1 ) where R T : R M × M 6 R N × N denotes the reconstruction operator that reconstructs original image from its Radon projection. The optimal threshold should minimize the mean square error function 1 2 R(δ ) = 2 Yδ − F , (11) N We define
R′(δ ) =
2 2 1 1 Yδ′ − W[( f kl′ )kM,l =1 ] = 2 W −1Yδ′ − ( f kl′ )kM,l =1 , 2 M M
(12)
with f kl′ := f kl σ kl . In this paper, we take the assumption that minimizing (11) and (12) respectively have the same solution. It is reasonable to some degree because the thresholding process used in (11) and (12) exactly is the same one. In this sense, using the GCV technique in ridgelet frame domain is reduced to using it in a special wavelet domain, as (8) shown. The GCV function is a formalized generalization of a leaving one-out operation [4], and in our setting it reduces to 1 ′ Y′ − Y δ 2 M GCV (δ ) = M [ 02 ] M
2
,
(13)
where M 2 is the total number of ridgelet coefficients and M 0 is the number of these coefficients that were replaced by zero in the thresholding operation.
Image Denoising Based on the Ridgelet Frame
43
In paper [4], the author proved that the value of δ minimizing GCV (δ ) is an asymptotically optimal threshold choice, namely, let δ ∗ = arg min R ′(δ )
and
δ = arg min GCV (δ ) , then when N → ∞ , we have E R (δ ) E R ′(δ ) → 1 and ∗
E GCV (δ ) ≈ E R ′(δ ) + σ 2 in the neighborhood of δ ∗ .
4 Experimental Results We test the performance of the proposed algorithm below.The experiment are performed on Lena (of size 512*512 and 256 gray levels). The test image is contaminated by additive Gaussian white noise with mean 0 and standard variance 20. The
(a)
(c)
(b)
(d)
Fig. 1. Comparison of quality of different image denoising algorithms on Lena (a) Original image, (b) Noise image, with additive Gaussian white noise of standard variance 20, PSNR=22.069dB, (c) Denoised image using GCV for RWT thresholding proposed in [12], PSNR=29.7823dB, (d) Denoised image using the proposed ridgelet frame-based GCV method, PSNR= 30.515dB
44
X. Tan and H. He
(a)
(b)
Fig. 2. Visual comparison of denoised images using different methods (a) Crop of denoised image using GCV for RWT thresholding proposed in paper [12], PSNR=29.7823dB, (b) Crop of denoised image using the proposed ridgelet frame-based GCV method, PSNR=30.515dB
qualities of the proposed algorithm are compared with the GCV denoising method based on the Redundant Wavelet Transform (RWT) [4][12]. In the experiment, we use RWT for the implementation of wavelet in the ridgelet frame and the filter is Coiflet-2. Note that both algorithms have not knowledge of the noise level. Test results are displayed in Fig.1 and Fig.2. From the results, it is clear that the proposed algorithm obviously outperform the wavelet-based algorithm, especially in the visual effect. As the usual case, edges become blurry in the wavelet case. On the contrary, in the ridgelet frame case, the details in the original image are well recovered. Especially, the ‘linetype’ structures are well preserved in Lena such as the hat brim.
5 Conclusion We have described an image denoising algorithm that uses the GCV technique in the ridgelet frame domain. This algorithm aims to take advantages of both the ridgelet frame representation and GCV scheme. Experimental results of the proposed algorithm were promising. However, in the proposed algorithm, the GCV technique is used in a modified version of the ridgelet frame rather than in the original one. How and to what degree the ability of the ridgelet frame to represent function class has been contaminated remains unknown.
References 1. Portilla, J., Strela, V., Wainwright, M.J., Simoncelli, E.P.: Image Denoising Using Scale Mixtures of Gaussians in the Wavelet Domain. IEEE Trans. Image Processing. 12, 1338– 1351 (2003) 2. Donoho, D.L., Johnstone, I.M.: Ideal Spatial Adaptation Via Wavelet Shrinkage. Biometrika 81, 425–455 (1994)
Image Denoising Based on the Ridgelet Frame
45
3. Donoho, D.L., Johnstone, I.M.: Wavelet Shrinkage: Asymptotic Journal of Royal Statistical Society. Series B 57, 301–369 (1995) 4. Jansen, M., Malfait, M., Bultheel, A.: Generalized Cross Validation for Wavelet Thresholding. Signal Processing. 56, 33–44 (1997) 5. Norman, W., Gregory, T.W.: Wavelet Shrinkage and Generalized Cross Validation for Image Denoising. IEEE Trans. Image Processing 7, 82–90 (1998) 6. Weyrich, N., Warhola, G.T.: De-noising Using Wavelets and Cross Validation. In: Singh, S.P. (ed.) Approximation Theory, Wavelets and Applications. NATO ASI Series C: Mathematics and Physical Sciences, vol. 454, pp. 523–532. Kluwer, Dordrecht (1995) 7. Weyrich, N., Warhola, G.T.: De-noising by Wavelet Shrinkage and Generalized Cross Validation with Applications to Speech. In: Chui, C.K., Schumaker, L.L. (eds.) Wavelets and Multilevel Approximation, Approximation Theory VIII, Singapore, pp. 407–414 (1995) 8. Candès, E.J.: Harmonic Analysis of Neural Networks. Appl. Comput. Harmon. Anal. 6, 197–218 (1999) 9. Donoho, D.L.: Orthonormal Ridgelet and Linear Singularities. SIAM J. Math Anal. 31, 1062–1099 (2000) 10. Tan, S., Jiao, L.C., Feng, X.C.: Ridgelet Frame. In: Campilho, A., Kamel, M. (eds.) ICIAR 2004. LNCS, vol. 3211, pp. 479–486. Springer, Heidelberg (2004) 11. Tan, S., Jiao, L.C.: Ridgelet Bi-Frame. Appl. Comput. Harmon. Anal. 20, 391–402 (2006) 12. Jansen, M., Bultheel, A.: Multiple Wavelet Threshold Estimation by Generalized Cross Validation for Data with Correlated Noise. TW Report 250, Leuven, K.U. Department of Computer Science, Leuven, Belgium (1997)
Parameterless Discrete Regularization on Graphs for Color Image Filtering Olivier Lezoray1, S´ebastien Bougleux2 , and Abderrahim Elmoataz1 1
Universit´e de Caen, LUSAC EA 2607, Vision and Image Analysis Team, IUT SRC, 120 Rue de l’exode, Saint-Lˆ o, F-50000, France
[email protected],
[email protected] 2 ENSICAEN, GREYC, 6 Bd. Mar´echal Juin, Caen, F-14050, France
[email protected] Abstract. A discrete regularization framework on graphs is proposed and studied for color image filtering purposes when images are represented by grid graphs. Image filtering is considered as a variational problem which consists in minimizing an appropriate energy function. In this paper, we propose a general discrete regularization framework defined on weighted graphs which can be seen as a discrete analogue of classical regularization theory. With this formulation, we propose a family of fast and simple anisotropic linear and nonlinear filters. The parameters of the proposed discrete regularization are estimated to have a parameterless filtering.
1
Introduction
Processing color images has became a crucial problem in the field of image processing . Numerous approaches can be found to process color images and among those, variational models have been extremely successful in a wide variety of computer vision problems such as image filtering and image segmentation. Variational formulations provide a framework that can handle such problems and provide algorithms for their solutions. Solutions of variational models can be obtained by minimizing appropriate energy functions, and this minimization is usually performed by designing continuous partial differential equations (PDEs). PDEs [1] are written in a continuous setting referring to images, and are discretized in order to have a numerical solution. One typical use of PDEs is image filtering and a lot of authors have proposed color image filtering with PDEs [2,3,4,5]. Discrete methods might be more suitable than PDEs in some cases since an image can be represented in a discrete setting by a grid graph. Discrete regularization on graphs has already been used for semi-supervised data classification [6]. Inspired by these works, we propose a regularization framework on weighted graphs of arbitrary topologies which defines a family of simple and fast anisotropic linear and nonlinear filters [7]. The discrete minimization problem is analogue to the continuous one and we propose to use such a discrete
This research work was partially supported by the ANR foundation under grant ANR-06-MDCA-008-01/FOGRIMMI.
M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 46–57, 2007. c Springer-Verlag Berlin Heidelberg 2007
Parameterless Discrete Regularization on Graphs
47
regularization for parameterless filtering of color images. When f 0 denotes a degraded image of an image f (f 0 = f + η, where η is additive noise of variance σ 2 ), the aim is to reconstruct f from f 0 by regularization. The paper is organized as follows. In Section 2, after having recalled basic definitions on weighted graphs, we present differential geometry on graphs which is similar to the one proposed by Benssoussan [8] and Zhou [6]. In Section 3, we present a general framework for discrete regularization on graphs [7]. In section 4, we present a parameterless version of the proposed discrete regularization. In Section 5, we show how discrete regularization can be applied to color image filtering. Section 6 concludes.
2 2.1
Differential Geometry on Graphs Preliminaries on Graphs
A graph is a structure used to describe a set of objects and the pairwise relations between those objects (links between objects). The objects are called vertices (or nodes) and a link between two objects is called an edge. We provide some basic definitions on graph theory (further details can be found in [9]). A graph G is a couple G = (V, E) where V is a finite set of vertices and E is a set of edges included in a subset of V × V . Two vertices u and v are adjacent if the edge (u, v) ∈ E. In the rest of this paper we only consider simple graphs which are always assumed to be connected, undirected with no self loops (see in [9,6] for details on these notions). A graph as defined above is said to be weighted if it is associated with a weight function w : E → R+ satisfying w(u, v) > 0 for (u, v) ∈ E, w(u, v) = 0 for (u, v) ∈ / E and w(u, v) = w(v, u) for all edges + in E since we consider undirected graphs. The degree function δ : V → R of a vertex v ∈ V is defined to be δ(v) = w(u, v) where u ∼ v denotes u∼v
all vertices u connected to v by an edge (u, v) ∈ E. Now we can define the space of functions on graphs. Let H(V ) denote the Hilbert space of real-valued functions on vertices, in which each f : V → R+ assigns a real value f (v) to each vertex v. The function space H(V ) is endowed with the usual inner product f, gH(V ) = f (v)g(v) where f and g are two functions in H(V ). A function v∈V
|V | f in H(V ) can be thought as a column vector in R . The norm of a function f induced from the inner product is f = f, f H(V ) . Similarly, one can define H(E) as the space of real-valued functions on edges, in which each h : E → R+ assigns a real value to each edge e ∈ E. This function space is endowed with the usual inner product h, lH(E) = h(u, v)l(u, v) where h, l : E → R+ (u,v)∈E
denote two functions in H(E). In this paper, grid graphs are considered [10]. They correspond to the definition of digital images: vertices represent pixels and edges represent pixel adjacency relationship. Therefore, processing color images comes to processing grid graphs, the vertices models and edges weights of which depend on colorimetric properties of the image.
48
O. Lezoray, S. Bougleux, and A. Elmoataz
2.2
Gradient and Divergence Operators
The difference operator d : H(V ) → H(E) on G = (V, E) of a function f ∈ H(V ) on an edge (u, v) linking two vertices u and v is defined for all (u, v) ∈ E as (df )(u, v) = (df )uv = w(u, v) (f (v) − f (u)) (1) The directional edge derivative of a function f at vertex v along the edge e = (u, v) is defined as ∂v f (u) = (df )(u, v). This definition is consistent with the continuous definition of the derivative of a function, e.g., if f (v) = f (u) then ∂v f (u) = 0. Moreover, one has ∂v f (u) = −∂u f (v) and ∂u f (u) = 0. Given a function f ∈ H(V ) and a vertex v, the gradient of f at vertex v is the vector T operator ∇ : V → RN defined by ∇f (v) = ∇v f = (∂v f (u) : (u, v) ∈ E, u ∼ v) . N + Then, the norm ∇ : R → R of the graph gradient ∇f at vertex v or the local variation of f at vertex v is defined as: 2 ∇v f = (∂v f (u)) = w(u, v) (f (v) − f (u))2 (2) u∼v
u∼v
Let R denote on H(V ), for any p ∈ [1, +∞) which is defined by a functional p Rp (f ) = ∇v f . This functional Rp can be seen as the measure of the v∈V
smoothness of f since it is the sum of the local variations at each vertex. The graph divergence operator is the operator div : H(E) → H(V ) which satisfies df, hH(E) = f, −div(h)H(V ) with f ∈ H(V ) and h ∈ H(E). The operator −div is therefore the adjoint operator d∗ of the difference operator d. From the definition of the inner products in H(V ) and H(E) and Equation (1), one can prove that the graph divergence of a function h ∈ H(E) at a vertex v can be expressed as (d∗ h)(v) = (−div(h))(v) = w(v, u) (h(u, v) − h(v, u)) (3) u∼v
The divergence operator measures the net outflow of function h ∈ H(E) at each vertex v of G. 2.3
p-Laplace Operator
The graph p-Laplacian is the operator Δp : H(V ) → H(V ), with p ∈ [1, +∞), defined as p−2 p−2 Δp f = −div ∇f df = d∗ ∇f df (4) Substituting (1) and (3) into the definition (4) of Δp f , we obtain γ(u, v)(f (v) − f (u)) (Δp f )(v) = u∼v
where γ(u, v) is the function defined by
(5)
Parameterless Discrete Regularization on Graphs
p−2 p−2 γ(u, v) = w(u, v) ∇f (v) + ∇f (u)
49
(6)
which generalizes the classical graph Laplacian and curvature. Indeed, the classical graph Laplacian is linear the operator Δ : H(V ) → H(V ) defined as Δf = −div(df ) = d∗ (df ) and the classical graph curvature is the nonlinear df df operator κ : H(V ) → H(V ) defined as κf = −div ∇f = d∗ ∇f . Clearly, one has Δ1 = κ and Δ2 = Δ. In general Δp is nonlinear (except in the case of p = 2) and itis positive semi-definite. One can then prove that p f, Δp f H(V ) = Rp (f ) = ∇v f ≥ 0, which implies that v∈V
Δp f =
∂Rp (f ) ∂f
(7)
Practically, to avoid having a zero denominator to compute the curvature (i.e. p = 1), the graph gradient ∇v f has to be replaced by its regularized version: 2
∇v f β = β 2 + ∇v f where β > 0 is a small positive parameter called the regularization parameter [10]. For the sake of clarity, when it is not necessary, we keep the notation ∇v f instead of ∇v f β in the rest of the paper.
3
Discrete Regularization on Graphs
In this section, we propose a general framework to regularize color images represented by grid graphs. For the sake of clarity we present this framework for scalar images but the principle is the same for color images (see next section). Given a graph G = (V, E) associated with a weighting function w : E → R+ , we want to perform the discrete regularization of a function f 0 ∈ H(V ) (i.e. the initial image) using the p-Laplacian. It consists in seeking for a function f ∗ which is smooth and simultaneously close to the function f 0 . This comes to consider general variational problems on graphs. Given a function f 0 ∈ H(V ), the goal is to find a function f ∗ ∈ H(V ) which is not only smooth enough on G but also close enough to the given function f . This can be formalized by minimizing a weighted sum of two energy terms:
2
2
p ∗ 0 0
f − f
f = min Ep = Rp (f ) + λ f − f = ∇v f + λ f ∈H(V )
v∈V
v∈V
(8) The first term is the smoothness term or regularizer, which requires f not to change too much between closely related objects. The second term is the fitting term, which says that f should not be far away from f 0 . The parameter λ ≥ 0 is a fidelity parameter called the Lagrange multiplier which specifies the trade-off between the two competing terms. Both terms of the energy Ep are strictly convex functions of f [10,11], therefore, by standard arguments in convex analysis, this optimization ∂E problem has a unique solution for p = 1 or p = 2 which satisfies ∂fp = 0, ∀v ∈ V . v
50
O. Lezoray, S. Bougleux, and A. Elmoataz
Using the property (7) of the p-Laplacian to compute the derivative of the first term in Ep , the above mentioned problem can be rewritten as follows:
(Δp f ∗ )(v) + 2λ f ∗ (v) − f 0 (v) = 0, ∀v ∈ V (9) The solution f ∗ of (8) is also the solution of (9). Substituting the expression of the p-Laplacian into (9), we obtain: 2λ + γuv f ∗ (v) − γ(u, v)f ∗ (u) = 2λf 0 (v), ∀v ∈ V . (10) u∼v
u∼v
Among the existing methods which can be used to solve (10), we use the GaussJacobi iterative algorithm. In this paper, we consider only the case of p = 1 based on the nonlinear curvature operator κ. Let t be the iteration step, and let f (t) be the solution of (10) at the step t. The initial function f (0) can be initialized to f 0 . The corresponding linearized Gauss-Jacobi algorithm is given by: ⎧ (0) f = f0 ⎪ ⎪ ⎪ ⎪ ⎪ 1 1 (t+1) ⎪ , ∀(u, v) ∈ E, (u, v) = w(u, v) + ∇f (t+1) (u) ⎨γ ∇f (t+1) (v)β β (11) ⎪ ⎪ t t ⎪ γ (u,v)f (u) ⎪ ⎪ ⎪ ⎩ f (t+1) (v) = 2λ+ 2λγ t (u,v) f 0 (v) + u∼v 2λ+ γ t (u,v) , ∀v ∈ V , u∼v
u∼v
(t)
where γ is the function γ(u, v) at the step t. One can note that the value of f (v) for a given iteration (t + 1) depends on two quantities: the original value of f at v (i.e. f 0 (v)) and the values for iteration t in the neighborhood of v. Coefficients are associated to those quantities which depend on the sum of weighted local variations. The obtained filtering is a low pass filter where the coefficients are adaptively updated for each iteration in addition of updating the function f . It is worth to note the connection between the proposed filter and the TV digital filter [10] (TV+L2 on grid graphs). Indeed, with p = 1, if ∀(u, v) ∈ E, w(u, v) = 1 i.e. the edges have all the same weights, one recovers exactly the same iterative filtering performed on a regular grid represented by a graph.
4
Parameterless Discrete Regularization
In this Section, we show how the proposed discrete regularization can be parameterless for the purpose of filtering color images. Indeed, several parameters act upon the proposed discrete regularization: the representation of color vectors associated to the vertices (f (v), ∀v ∈ V ), the weights associated to the edges (w(u, v), ∀(u, v) ∈ E), the regularization constants (λ, but also β), and the number of iterations involved in algorithm (11). 4.1
The Case of Color Images
For the case of color images, we define f ∈ H(V ), f : V → R3 which associates a red-green-blue color vector to each vertex. To perform the regularization
Parameterless Discrete Regularization on Graphs
51
on a grid graph representing a color image, an iteration of (11) is considered. Since a color image is composed of three channels, three independent regularization processes are considered. To take the coupling between vector channels into account, the component-wise regularizations do not have to use different local geometries (the p-Laplacian being different for each channel with p = 1) but a vector one. Therefore, the p-Laplace operator is considered as being the same for the threechannel regularizations (channel coupling) and is defined by p−2
p−2
γ(u, v) = w(u, v) ∇f (v)3D + ∇f (u)3D . The norm of a color vector is defined as its multi-dimensional Euclidean norm and remains the same whatever the color channel under consideration is. This is required to have a global 3 2 vector geometry ∇f (v)3D = ∇fi (v) . Another model (Chromaticityi=1
Brightness denoted as CB) can be used to represent color images. It decomposes a RGB color vector f (v) into two components: the brightness component b(v) = f (v) and the chromaticity component c(v) = f (v)/b(v). The discrete regularization is then performed separately on the scalar brightness component and on the vectorial chromaticity component (this involves a supplementary step of normalization to ensure that the chromaticity remains on the unit sphere, see [12] for further details). 4.2
Edge Weights
We can associate a weight function defined for each edge of a graph. This weight function determines the type of regularization induced by the functional Rp (f ). Weights are positive and symmetric and enable to quantify the proximity between two vertices based on some features. Therefore, similarities between vertices are obtained by comparing their features which generally depend on the function f and the set V . A feature vector Ff (v) ∈ Rq is assigned to every vertex v ∈ V , with q ∈ N+ . This feature vector can incorporate several image features such as color and texture. The general formulation of a weight function can be defined as w(u, v) = g(Ff (u), Ff (v)), ∀(u, v) ∈ E. We con1 sider three different weight functions. g1 (Ff (u), Ff (v)) = ε+Ff (u)−F with f (v) 2 Ff (u)−Ff (v) with Ff (u) = f (u). Ff (u) = f (u). g2 (Ff (u), Ff (v)) = exp − 2 σ 2 2 F (u)−F (v) exp − u−v with Ff (u) = [f (v) : g3 (Ff (u), Ff (v)) = exp − f σ2 f 2σ2 d
v ∈ Wur ]T . σd controls the spatial decay and Wur denotes the set of vertices which can be reached from u in r walks (for grid graphs, Wur is a window of size 2 × r + 1 centered on u). This last weight function g3 was proposed by Buades for a nonlocal means filter[13]. We use it to define a nonlocal discrete regularization on graphs. All these weight functions involve parameters. For g1 , the parameter ε is needed to avoid a zero denominator and it can be fixed with a very small value (10−4 in our experimentations). For g2 and g3 , the parameter σ is usually fixed a priori in literature. To have parameterless weight functions, a measure of dispersion around vertices can be used to estimate this parameter. We propose to estimate σ for two nodes u and v as the
52
O. Lezoray, S. Bougleux, and A. Elmoataz
product σu σv of the standard deviation estimated around each vertex. Therefore, q 2 F i (v) F i (v)2 f f i=1 v∈W r |Wur | −v∈W r |Wur | u u where Ffi (v) denotes the ith component σu = q
of the vector Ff (v) and |Wur | is the size of the window Wur . For g2 , r = 1. For computational reasons, σu is estimated only once for each vertex on the original image f 0 . The proposed weight functions are now parameterless but their choice depends on the application (see next Section). 4.3
Regularization Constants
The regularization constant λ determines the trade-off between the smoothness of the regularized image and the closeness of the regularized image to the original image. It plays an important role in the regularization and should be estimated along the iterations rather than fixed a priori since it depends on the noise level. A way to estimate λ, it to consider the equivalent constrained minimization problem[14]
p
f − f 0 2 = σ 2 . Then, of (8) formulated as: min ∇v f subject to |V1 | f
v∈V
v∈V
using (7) and t, an estimation of λ[10]: λt = (5), one obtains, at a given iteration 1 1 t 0 t t (f (v) − f (v)) γuv (f (v) − f (u)) .For t = 0, we fix λ0 = σ¯12 σ ¯ 2 |V | t
v∈V where σ ¯t2
u∼v
t
t
denotes the variance of the noise estimated on the whole image f . With p = 1, the regularization parameter β (see Section 2.3) is needed for numerical stabilities but this leads to an approximation of the solution of (8). As stated in [10], the performance of the regularization is insensitive to β as long as it is kept small. The use of β is needed to reduce degeneracies in flat regions where ∇v f ≈ 0 and is a commonly used technique [2]. Therefore, we fix β 2 = σ¯12 . 0
4.4
Termination Criterion
To have a complete parameterless algorithm, the ideal number of iterations of (11) should by dynamically determined. To that aim, we use a termination crit 0 (v) terion defined as f (v)−f < ε (ε value is the same as in subsection 4.2). f t (v) This enables to automatically determine the stopping time of (11) when few modifications occur on f t .
5
Applications
In this section, we show how the proposed parameterless discrete regularization can be used to perform color image filtering. For a more complete evaluation of the discrete regularization without automatic estimation of the parameters as defined in Section 4, one can refer to [7]. Images are represented by grid graphs (one vertex per pixel) with 8-neighborhood connectivity (filter window of size 3 × 3). First, we consider an image (Figure 1(a)) corrupted by impulse noie
Parameterless Discrete Regularization on Graphs
(a) Original image.
53
(b) Image distorted by 15% impulse noise.
(c) Image distorted by Gaussian noise (d) Restoration of impulse noise with g1 . (σ = 15).
(e) Restoration of Gaussian noise with g2 . (f) Restoration of Gaussian noise with g2 on CB features Fig. 1. Illustrative examples of the filtering efficiency for two types of noise corruption and different edge weights functions
(corruption of 15%, Figure 1(b)) or Gaussian noise (σ = 15, Figure 1(c)). For impulse noise cancelation, the weight function g1 is considered. One can see on Figure 1(d) the behavior of the proposed parameterless discrete regularization. For Gaussian noise cancelation, the weight function g2 is considered. Figure 1(e) and Figure 1(f) present filtering results respectively operating on the RGB image or its representation in Chromaticity-Brigthness. The noise is suppressed and the use of Chromaticity-Brigthness features enables a better restoration of
54
O. Lezoray, S. Bougleux, and A. Elmoataz
(a) Original noisy image.
(b) Restored image with g1 . (c) Restored image with g2 .
(d) Restored image with g3 .
(e) Restored image with g3 on CB features.
Fig. 2. Illustrative examples of the filtering efficiency on a real image corrupted by noise
the image (see the background in Figure 1(f)). This illustrates that the parameterless discrete regularization performs well for color image filtering and it can be adapted to different types of noise by chosing an approriate weight function and color vector representation.
Parameterless Discrete Regularization on Graphs
(a) Image with Gaussian noise (σ = 15).
(b) Restored image with g2 .
(c) Restored image with g3 .
(d) Part of Figure 3(a).
(e) Part of Figure 3(b).
55
(f) Part of Figure 3(c).
Fig. 3. Illustrative examples of the differences between local and nonlocal weight functions
56
O. Lezoray, S. Bougleux, and A. Elmoataz
We illustrate this last remark on a real noisy image (Figure 2(a)) captured by a digital camera at very high shutter speed using a high film ISO degree. Parameterless discrete regularization is performed with the following configurations; Figure 2(b) with g1 , Figure 2(c) with g2 , Figure 2(d) with g3 (r = 1) and Figure 2(d) with g3 (r = 1) on Chromaticity-Brigthness features. When g3 is used, a 24-neighborhood connectivity is considered (filter window of size 5 × 5). Indeed, the feature vector associated to each vertex is defined over a 8-neighborhood (r = 1) and the filter window has to be larger than the size of the feature vector (see in [15] for more details). The weight function g2 provides in general better results than g1 (except for impulse noise). Moreover, the use of weight function g3 , which is nonlocal[13], provides better results than the same norm in a fully local version i.e. with weight function g2 (compare Figures 2(c) and 2(d)). The differences between the nonlocal discrete regularization on RGB color vectors (Figure 2(d)) or Chromaticity-Brightness features (Figure 2(e)) is now less evident (notice that the face of the goalkeeper is better restored with CB features). Even is it is visually evident that the nonlocal version of the proposed parameterless discrete regularization provides better results, its has another interesting property. This is shown in Figure 3. Figure 3(a) is the classical Barbara image corrupted with Gaussian noise (σ = 15). Figures 3(b) and 3(c) present the filtering results respectively with the weight functions g2 and g3 (i.e. local versus nonlocal). The differences are not so marked, but if we more precisely study their differences, the filtering with the g3 weight function performs much more better. Indeed, it better preserves texture. This effect can be seen in Figures 3(e) and 3(f) where cropped and zoomed portions of Figures 3(b) and 3(c) are shown. Figure 3(d) provides the same cropped and zoomed part for the original corrupted image depicted in Figure 3(a). This texture preservation effect is due to the nature of the nonlocal weights as it was proposed in the original nonlocal means filter (see in [13,15]).
6
Conclusion
In this paper, we have considered a discrete regularization framework based on graph differential geometry. The discrete regularization is based on the pLaplacian and leads to a family of linear and nonlinear iterative filters. Moreover, a parameterless version of the considered discrete regularization is also proposed. This enables to automatically estimate all the required parameters. Then, the obtained parameterless discrete regularization can be applied to a wide range of color image filtering applications with a proper choice of the weight function. The abilities of the proposed parameterless discrete regularization have been illustrated on sample examples. Future works will concern the adaptation of the proposed framework for image inpainting purposes.
Parameterless Discrete Regularization on Graphs
57
References 1. Aubert, G., Kornprobst, P.: Mathematical Problems in Image Processing. Springer, Heidelberg (2002) 2. Chan, T., Shen, J.: Image Processing and Analysis - Variational, PDE, wavelet, and stochastic methods. SIAM (2005) 3. Tang, B., Sapiro, G., Caselles, V.: Color image enhancement via chromaticity diffusion. IEEE Transactions on Image Processing 10, 701–707 (2001) 4. Tschumperl´e, D., Deriche, R.: Vector-valued image regularization with PDEs: A common framework for different applications. IEEE Transactions on Pattern Analysis and Machine Intelligence 17(4), 506–517 (2005) 5. Weickert, J.: Coherence-enhancing diffusion of colour images. Image Vision Comput. 17(3-4), 201–212 (1999) 6. Zhou, D., Scholkopf, B.: A regularization framework for learning from graph data. In: ICML Workshop on Statistical Relational Learning and Its Connections to Other Fields, pp. 132–137 (2004) 7. Lezoray, O., Bougleux, S., Elmoataz, A.: Graph regularization for color image processing. Computer Vision and Image Understading (in press) (2007) 8. Bensoussan, A., Menaldi, J.L.: Difference equations on weighted graphs. Journal of Convex Analysis 12, 13–44 (2005) 9. Diestel, R.: Graph Theory, vol. 173. Springer, Heidelberg (2005) 10. Chan, T., Osher, S., Shen, J.: The digital TV filter and nonlinear denoising. IEEE Transactions on on Image Processing 10, 231–241 (2001) 11. Chambolle, A., Lions, P.L.: Image recovery via total variation minimization and related problems. Numerische Mathematik 76(2), 167–188 (1997) 12. Chan, T., Kang, S., Shen, J.: Total variation denoising and enhancement of color images based on the CB and HSV color models. J. of Visual Communication and Image Representation 12, 422–435 (2001) 13. Buades, A., Coll, B., Morel, J.: A non local algorithm for image denoising. In: IEEE Int. Conf. on Computer Vision and Pattern Recognition, vol. 2, pp. 60–65. IEEE Computer Society Press, Los Alamitos (2005) 14. Brook, A., Kimmel, R., Sochen, N.: Variational restoration and edge detection for color images. Journal of Mathematical Imaging and Vision 18, 247–268 (2003) 15. Buades, A., Coll, B., Morel, J.: A review of image denoising algorithms, with a new one. Multiscale Modeling and Simulation (SIAM interdisciplinary journal) 4(2), 490–530 (2005)
Multicomponent Image Restoration, an Experimental Study Arno Duijster, Steve De Backer, and Paul Scheunders IBBT, Vision Lab, University of Antwerp, Universiteitsplein 1, 2610 Wilrijk, Belgium {arno.duijster,steve.debacker,paul.scheunders}@ua.ac.be http://visielab.ua.ac.be Abstract. In this paper, we study the problem of restoring multicomponent images. In particular, we investigate the effects of accounting for the correlation between the image components on the deconvolution and denoising steps. The proposed restoration is a 2-step procedure, comprising a shrinkage in the Fourier domain, followed by a shrinkage in the wavelet domain. The Fourier shrinkage is performed in a decorrelated space, by performing PCA before the Fourier transform. The wavelet shrinkage is performed in a Bayesian denoising framework by applying multicomponent probability density models for the wavelet coefficients that fully account for the intercomponent correlations. In an experimental section, we compare this procedure with the single-component analogies, i.e. performing the Fourier shrinkage in the correlated space and using single-component probability density models for the wavelet coefficients. In this way, the effect of the multicomponent procedures on the deconvolution and denoising performance is studied experimentally. Keywords: multicomponent image restoration, deconvolution, denoising, wavelet transform.
1
Introduction
With the evolution of imaging technology, an increasing number of imaging modalities becomes available. In this paper, we focus on what is called multichannel or multicomponent imagery, which are images containing several image components. The most well-known example is a color image, which generally consists of a red, green and blue component. In remote sensing, sensors are available that can generate multispectral or hyperspectral data, involving a few to more than hundred bands. In medical imagery, distinct image modalities reveal different features of the internal body. Examples are MRI images acquired by using different imaging parameters (T1, T2, proton density, diffusion, . . . ), different CT and nuclear medicine imaging modalities. Remote sensed and medical images are often degraded due to limitations such as aperture effects of the camera, motion or atmospheric effects and physical limitations. As a consequence, the images are corrupted by blur and noise. The standard imaging model consists of a convolution with some impulse response function and additive noise. The standard way of deconvolution is inverse M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 58–68, 2007. c Springer-Verlag Berlin Heidelberg 2007
Multicomponent Image Restoration, an Experimental Study
59
filtering of the image in the Fourier domain. A regularization is required to handle the noise. This leads to the classical Wiener type filters. In the past, specific multicomponent image restoration was proposed [1,2,3]. A disadvantage of the Fourier transform is that it does not efficiently represent image singularities, so that only small amounts of shrinkage are allowed to avoid distortion of edges in the image. Recently, 2-step procedures were proposed, involving a Fourier shrinkage and a wavelet based image denoising [4,5,6]. The first step is a shrinkage in the Fourier domain, in order to exploit the Fourier transform’s economical noise representation, followed by a shrinkage in the wavelet domain, to exploit the wavelet transform’s economical representation of piecewise smooth images. The wavelet transform offers an efficient representation of spatial discontinuities. It compresses the essential information of an image into a relatively few, large coefficients coinciding with the positions of image discontinuities. Such a representation naturally facilitates the construction of spatially adaptive denoising methods that can smooth noise without excessive blurring of image details. Typically, noise is reduced by shrinking the noisy wavelet coefficient magnitudes. Shrinkage estimators can result from a Bayesian approach, which imposes a prior distribution on noise-free data. Recently, several wavelet based procedures for multicomponent images were proposed that account to some extent for the inter-component correlations, applying wavelet thresholding [7] or Bayesian estimation, using different prior models [8,9,10,11]. In this work, we propose to restore multicomponent images by a combination of a Fourier-based and a wavelet-based shrinkage, employing a 2-step procedure. In this study, we want to investigate whether it helps to account for the intercomponent correlations when restoring a multicomponent image. In case of the Fourier shrinkage step, a PCA transformation is applied prior to the Fourier transform, in order to decorrelate the image components, then a componentwise shrinkage is applied, after which the inverse PCA transform is applied. In the wavelet shrinkage step, a multicomponent procedure is applied, where the multicomponent wavelet coefficients are modeled using a multinormal or a multicomponent heavy-tailed distribution (a Gaussian Scale Mixture model). In both cases, the proposed procedure is compared to the single-component analogs, i.e. no PCA and single-component wavelet shrinkage. In the experimental section, conclusions on these comparisons will be drawn. The remaining of this manuscript is organized as follows. In the next section, the 2-step procedure is reviewed. In section 3, we introduce our multicomponent restoration procedure. Finally, section 4 is devoted to experiments and discussion.
2 2.1
The 2-Step Restoration Procedure The Image Model
The observed image sample Y of size M × M consists of an unknown signal S, first degraded by a circular convolution (denoted by ) with a known impulse
60
A. Duijster, S. De Backer, and P. Scheunders
response H from a linear system H and then corrupted by zero-mean additive white Gaussian noise N , with variance σ 2 Y =H S+N = HS + N . In the Fourier domain, this becomes
(1)
Yf = Hf · Sf + Nf .
(2)
Sˆf = Hf−1 (Yf − Nf )
(3)
Inverse filtering
leads to amplified noise components due to singularities at the frequencies where Hf is (almost) zero. Application of a Wiener filter can prevent this. 2.2
Step 1: Fourier Shrinkage
Denote the result of the Wiener filter as X(= SˆWiener), defined by Xf =
Hf∗ PS Yf |Hf |2 PS + PN
(4)
where PS and PN are the power spectra of the original image and the noise respectively. PS can be estimated by an iterative procedure [12], where an initial estimate of PS is inserted in (4). The initial value is set to PS = max(0, PY − PN ), with PY = |Yf |2 and, in the case of pure white Gaussian noise, PN = M 2 σ 2 . From the resulting Xf a new estimate of PS is set to |Xf |2 . This new estimate can be iteratively inserted in (4). By applying this procedure we obtain an estimate for Xf . Since the Fourier transform does not economically represent singularities, only a small amount of shrinkage is allowed in order to not distort the image edges. Due to this, the estimate X will still be noisy. For this, a wavelet shrinkage step is included. 2.3
Step 2: Wavelet Shrinkage (j,o)
The obtained Wiener filtered estimation X is wavelet transformed. Let xl denote its wavelet coefficient at spatial position l, resolution level j and orientation subband o. The corresponding wavelet coefficients of the noise-free image (j,o) (j,o) and the noise are sl and nl , respectively. Equivalent processing is typically applied to all the wavelet subbands, and hence we shall omit the indexes. Then x = s+ n.
(5)
Using the Bayesian approach, a priori knowledge about the distribution of the noise-free data is assumed. The minimum mean squared error (MMSE) estimate is the posterior conditional mean ∞ ∞ ∞ s φ(x − s; σn2 )p(s)ds −∞ s p(x|s)p(s)ds sˆ = = −∞ . (6) s p(s|x)dx = ∞ ∞ 2 −∞ −∞ p(x|s)p(s)ds −∞ φ(x − s; σn )p(s)ds
Multicomponent Image Restoration, an Experimental Study
61
Assuming e.g. a Gaussian prior for the noise-free signal p(s) = φ(s; σs2 ), the above MMSE estimate becomes the Wiener filter: sˆ =
σ ˆs2 x. σ ˆs2 + σn2
(7)
In this work, σ ˆs2 = max(0, σx2 − σn2 ) is calculated over the entire wavelet subband. The noise variance in each wavelet subband σn2 is in general a scaled version of the input image noise variance, where the scaling factors depend on the wavelet filter coefficients (see e.g. [13]). With the orthogonal wavelet families [14], that we use in this paper, the noise variance in all the wavelet subbands is equal to the input image noise variance.
3 3.1
The Multicomponent Restoration Procedure Multicomponent Fourier Shrinkage
To extend the restoration technique from the previous section to multicomponent images, we rewrite (1) in vector-matrix notation. Let S be the vector describing the multicomponent image S = [S 1 , S 2 , . . . , S K ] as a concatenation of multiple single component images S i with K the number of components. We can similarly define Y and N, being respectively the vectors containing the observed multicomponent image and the additive noise. If we assume the blurring operation to be equal for all components and no blurring among different components exist, then the multicomponent blurring operation can be written as H = IK ⊗ H with ⊗ the Kronecker product, IK the identity matrix of size K ×K and H the singlecomponent blurring operation. The multicomponent observation model can be written as Y = HS + N . (8) Following [1], the multicomponent image restoration can, depending on the assumptions made, be performed independently or in decorrelated fashion. In the independent case, all components are deblurred separately. In the decorrelated case, the assumption is made that the spatial correlations of each component, and the spatial cross-correlations between all pairs of components are equal, up to some scaling factor. We can write the total multicomponent covariance matrix as CS = Ccomp ⊗ CS , where Ccomp is the component covariance matrix of size K × K and CS the spatial correlation of one component, of size M 2 × M 2 . For the noise covariance we assume it to be spatially and component independent CN = σ 2 IM 2 K . Generally, it is assumed that the spatial covariance matrix is diagonalized by the discrete Fourier transform, with the diagonal elements described by the power spectrum PS (CS is a circulant matrix). The multicomponent restoration process can then be performed by first decorrelating the components, i.e. diagonalizing Ccomp , and then performing single-component restoration for each component.
62
A. Duijster, S. De Backer, and P. Scheunders
Let Upca be the orthogonal transformation diagonalizing the covariance matrix Ccomp by the operation generally referred to as principal component analysis (PCA) or Karhunen-Lo`eve transform T Upca Ccomp Upca = Λ
(9)
with Λ a diagonal matrix containing the eigenvalues of Ccomp on the diagonal. Applying the transformation on the multicomponent image, the covariance matrix CS is then transformed to (Upca ⊗ IM 2 )T CS (Upca ⊗ IM 2 ) = (Upca ⊗ IM 2 )T (Ccomp ⊗ CS )(Upca ⊗ IM 2 ) = Λ ⊗ CS .
(10)
A block diagonal matrix is obtained, removing all cross-component correlations. Applying the discrete Fourier transform on each component fully diagonalizes the multicomponent covariance matrix. 3.2
Multicomponent Wavelet Shrinkage
After Fourier shrinkage, each component of the estimated multicomponent image is wavelet transformed. The wavelet coefficients are concatenated into a vector, obeying x = s + n. (11) ˆ s of the unknown noise-free vector s is given by The estimated covariance matrix C ˆ s = Cx − Cn C
(12)
where Cx denotes the covariance matrix of the noisy vector x and Cn = σn2 IK is the noise covariance. In this paper, Cx is calculated over the entire image: Cx = xl xTl l , where it is assumed that the intercomponent correlations are the same for all wavelet ˆ s is a covariance matrix, it needs to be semicoefficients of a subband. Since C positive definite. This is assured by performing an eigenvalue decomposition and clipping possible negative eigenvalues to zero. The minimum mean squared error (MMSE) estimate, assuming a Gaussian prior for the noise-free signal p(s) = φ(s; Cs ), becomes the vector-based Wiener filter: ˆ s (C ˆ s + Cn )−1 x . ˆs = C (13) The Gaussian Scale Mixture (GSM) Prior. The standard Wiener result is obtained using a multicomponent Gaussian prior model. It accounts for the multicomponent covariance, but it assumes that the marginal densities for the wavelet coefficients are Gaussian. It is well-known that this assumption is not justified: these marginals are symmetric and zero mean, but heavier tailed than Gaussians. Different other priors were proposed to better approximate the marginal densities.
Multicomponent Image Restoration, an Experimental Study
63
The Gaussian scale mixture (GSM) prior [15] models the probability density function p(s) by a mixture of Gaussians p(s) = p(s|z)p(z)dz (14) where p(z) is the mixing density, and p(s|z) is a zero mean Gaussian with covariance Cs|z . Under the GSM model, the additive noise model (11) becomes x=s+n=
√ zu + n
(15)
where both u and n are zero-mean Gaussians, with covariances given by Cu and Cn respectively. Then Cs|z = zCu , or, by taking expectations over z with E(z) = 1 , Cs = Cu . GSM densities are symmetric and zero-mean and heavier tailed than Gaussians. They are known to model the shape of the wavelet coefficient marginals more accurate than Gaussians. We apply the GSM to model multicomponent wavelet coefficients. In this way, the prior fully accounts for the inter-component covariances. The Bayes least squares estimate is given by ˆs = s p(s|x)ds ∞ = s p(s, z|x)dzds 0 ∞ = s p(s|x, z)p(z|x)dzds 0 ∞ = p(z|x)E(s|x, z)dz . (16) 0
Since the GSM model s conditioned on z is Gaussian, the expected value within the integral is given by a Wiener estimate E(s|x, z) = zCu (zCu + Cn )−1 x .
(17)
The posterior distribution of z can be obtained, using Bayes’ rule, as p(z|x) = ∞ 0
p(x|z)p(z) p(x|α)p(α)dα
(18)
with p(x|z) = φ(x; zCu + Cn ) . In [15], the authors motivate the use of the socalled Jeffrey’s prior [16] for the random multiplier z: p(z) ∝ z1 . We also refer to [10] and [15] for further information about the practical implementation of (16).
4
Experiments and Discussion
In all presented techniques, the noise variance is required. Instead of assuming that it is known, the noise variance is estimated as follows: let for each image component x1,HH denote the wavelet coefficients of the first resolution level and
64
A. Duijster, S. De Backer, and P. Scheunders 33 SF SFSW MF MFMW1 MFMW2
32 31 30
PSNR (dB)
29 28 27 26 25 24 23
5
10
15
20 25 standard deviation of the noise
30
35
40
Fig. 1. The PSNR in function of the noise standard deviation σ for the restored image ‘Lena’
orientation subband HH. Then the diagonal elements of the noise covariance are estimated by the classical median estimator [17] σ ˆ2 =
median(|x1,HH |) . 0.6745
(19)
The noise is assumed to be uncorrelated between different image components. After the Fourier shrinkage step, the noise is no longer white, since different frequencies were scaled by different factors. Therefore, before the wavelet shrinkage step, we need to re-estimate the noise variance, at each wavelet scale and orientation separately. However, it can be derived directly from the Fourier shrinkage filter. Denote Wf the Fourier transformed shrinkage filter as estimated in (4): Wf =
Hf∗ PS |Hf |2 PS + PN
.
(20)
Let W = F T −1 ( |Wf |2 ) be the inverse Fourier Transform of the square root (j,o) of the filter’s power spectrum density. Define wl as its wavelet coefficient at location l, scale j and orientation o. Then, at each scale and orientation (j,o) σ ˆ (j,o) = σ ˆ |wl |2 . (21) l
Multicomponent Image Restoration, an Experimental Study
65
31 SF SFSW MF MFMW1 MFMW2
30 29
PSNR (dB)
28 27 26 25 24 23 22 5
10
15
20 25 standard deviation of the noise
30
35
40
Fig. 2. The PSNR in function of the noise standard deviation σ for the restored image ‘San Diego’
In the experimental section, we want to find out whether multicomponent restoration outperforms single-component restoration. For this, the proposed 2-step procedure is applied both in single-component mode as well as in multicomponent mode. The procedures are tested on 2 images of 3 components: the color image ‘Lena’ and the 3-band multispectral remote sensing image ‘San Diego’. The images are blurred using a 3 × 3 boxcar function. On both images, Gaussian white noise is simulated with standard deviations ranging from 4 to 40, with steps of 4. The following procedures are applied: – SF: the single-component Fourier shrinkage step, without PCA and denoising, using the procedure described in section 2.2. – SFSW: the single-component Fourier shrinkage step, with single-component denoising, using (7). – MF: the multicomponent Fourier shrinkage step, without denoising, using the procedure described in section 3.1. – MFMW1: the multicomponent Fourier shrinkage step, with multicomponent denoising, using the multinormal model (13). – MFMW2: the multicomponent Fourier shrinkage step, with multicomponent denoising, using the multicomponent GSM model (16). In the figures 1 and 2, the results are shown as the PSNR versus the noise standard deviation. For each measurement, 10 experiments with different noise realizations are performed and the average results is shown. Error bars are negligible.
66
A. Duijster, S. De Backer, and P. Scheunders
a.
b.
c.
d.
e.
f.
g.
Fig. 3. The third channel of the multicomponent ‘San Diego’ image. From left to right and top to bottom: (a) original scene; (b) blurred and corrupted with noise; (c) restored using SF, (d) SFSW, (e) MF, (f) MFMW1 and (g) MFMW2.
Multicomponent Image Restoration, an Experimental Study
67
When the multicomponent Fourier shrinkage step is applied, a gain of about a half to one dB is obtained, compared to the single-component Fourier shrinkage. This clearly indicates that the multicomponent Fourier shrinkage step does account for the correlations between the image components. Including the wavelet shrinkage step improves the image quality. When applying the single-component wavelet shrinkage on top of the single-component Fourier shrinkage, a gain in image quality is observed that increases with higher noise variance. The multicomponent denoising models generate superior results compared to the single-component model. This clearly indicates that, when applying the wavelet shrinkage step, it is favorable to account for the correlations between the image components. When a model is applied that better approximates the probability density of the wavelet coefficients (as in this case the GSM model), better results are obtained. Applying both steps in a multicomponent way is the most favorable for the image quality improvement. In figure 3a the third channel of a part of the multicomponent image ‘San Diego’ is shown. A 3 × 3 boxcar function is used to blur the image and Gaussian white noise with a standard deviation of 30 is added (figure 3b). The figures 3c and d show the results of the single-component Fourier shrinkage without and with denoising, respectively. The results of the multi-component Fourier shrinkage are shown in the last three subfigures: figure 3e shows the image without denoising, while the figures 3f and g show the denoised images, using the multinormal and the GSM model, respectively. In this work, the Fourier shrinkage step was applied with a standard Wiener filter. In [4], a procedure is described where the regularization of the Fourier and wavelet shrinkage are simultaneously optimized. When applying such an optimization, we expect even better results. Acknowledgements. This work was supported by the Flemish Interdisciplinary institute for Broadband Technology (IBBT), through the project ICA4DT, Image-based Computer Assistance for Diagnosis and Therapy.
References 1. Hunt, B.R., K¨ ubler, O.: Karhunen-Loeve multispectral image restoration, part I: Theory. IEEE Trans. Acoust., Speech, Signal Process. ASSP-32, 592–600 (1984) 2. Zervakis, M.E.: Generalized maximum a posteriori processing of multichannel images and applications. Circuits, Systems, and Signal Processing 15, 233–260 (1996) 3. Galatsanos, N.P., Wernick, M.N., Katsaggelos, A.K.: Multi-channel Image Recovery. In: The Handbook of Image and Video Processing, pp. 161–174. Academic Press, London (2000) 4. Neelamani, R., Choi, H., Baraniuk, R.: Forward: Fourier-wavelet regularized deconvolution for ill-conditioned systems. IEEE Trans. Image Process. 52, 418–433 (2004) 5. Figueiredo, M.A.T., Nowak, R.D.: An EM algorithm for wavelet-based image restoration. IEEE Trans. Image Process. 12, 906–916 (2003)
68
A. Duijster, S. De Backer, and P. Scheunders
6. Bioucas-Dias, J.: Bayesian wavelet-based image deconvolution: a GEM algorithm exploiting a class of heavy-tailed priors. IEEE Trans. Image Process 15, 937–951 (2006) 7. Scheunders, P.: Wavelet thresholding of multivalued images. IEEE Trans. Image Process 13, 475–483 (2004) 8. Benazza-Benyahia, A., Pesquet, J.C.: Building robust wavelet estimators for multicomponent images using Steins’ principle. IEEE Trans. Image Process 14, 1814– 1830 (2005) 9. Scheunders, P., Driesen, J.: Least-squares interband denoising of color and multispectral images. In: Proc. IEEE Internat. Conf. Image Proc. ICIP, Singapore, vol. 2, pp. 985–988. IEEE Computer Society Press, Los Alamitos (2004) 10. Scheunders, P., De Backer, S.: Wavelet denoising of multicomponent images using a gaussian scale mixture model. In: Proc. Internat. Conf. on Pattern Recognition, ICPR, Hong Kong (2006) 11. Piˇzurica, A., Philips, W.: Estimating the probability of the presence of a signal of interest in multiresolution single- and multiband image denoising. IEEE Trans. Image Process 15, 654–665 (2006) 12. Hillery, A., Chin, R.: Iterative wiener filters for image restoration. IEEE Trans. Signal Process. 39, 1892–1899 (1991) 13. Jansen, M.: Noise Reduction by Wavelet Thresholding. Springer-Verlag, New York (2001) 14. Mallat, S.: A Wavelet Tour of Signal Processing. Academic Press, London (1998) 15. Portilla, J., Strela, V., Wainwright, M.J., Simoncelli, E.P.: Image denoising using scale mixtures of gaussians in the wavelet domain. IEEE Trans. Image Process 12, 1338–1351 (2003) 16. Box, G.E.P., Tiao, C.: Bayesian Inference in Statistical Analysis. Addison-Wesley, Reading (1992) 17. Donoho, D.L., Johnstone, I.M.: Adapting to unknown smoothness via wavelet shrinking. Journal of the American Statistical Association 90, 1200–1224 (1995)
Reconstruction of Low-Resolution Images Using Adaptive Bimodal Priors Hiˆep Luong and Wilfried Philips Ghent University, Department of Telecommunications and Information Processing, Image Processing and Interpretation Research Group, IBBT, B-9000 Ghent, Belgium
Abstract. This paper introduces a Bayesian restoration method for lowresolution images combined with a smoothness prior and a newly proposed adaptive bimodal prior. The bimodal prior is based on the fact that an edge pixel has a colour value that is typically a mixture of the colours of two connected regions, each having a dominant colour distribution. Local adaptation of the parameters of the bimodal prior is made to handle neighbourhoods which have typically more than two dominant colours. The maximum a posteriori estimator is worked out to solve the problem. Experimental results confirm the effectiveness of the proposed bimodal prior and show the visual superiority of our reconstruction scheme to other traditional interpolation and reconstruction methods for images with a strong colour modality like cartoons: noise and compression artefacts are removed very well and our method produces less blur and other annoying artefacts.
1
Introduction
Due to the huge amount of data, images and video sequences are compressed before transmission or storage. Image quality will typically be lost in the form of blocking artefacts (intrinsic to the used block structure in jpeg- and mpegalgorithms) and mosquito noise (random noise originating from the quantization of high frequent information). With the growing popularity of High Definition television (hdtv), these artefacts become more bothersome. That is why digital high-resolution (hr) image reconstruction has become highly important nowadays. Many image interpolation methods have already been proposed in the literature, but all suffer from one or more artefacts. Linear or non-adaptive interpolation methods deal with jagged edges, blurring and/or ringing effects. Well-known and popular linear interpolation methods are nearest neighbour, bilinear and interpolation with higher order (piecewise) polynomials, b-splines, truncated or windowed sinc functions, etc. [10,15] Non-linear or adaptive interpolation methods incorporate a priori knowledge about images. Some general methods focus on reconstructing edges [1,11,17] and other methods tackle unwanted interpolation artefacts such as jagged edges, blurring and ringing using isophote smoothing, level curve mapping or mathematical morphology [9,12,16,20]. Some other adaptive techniques exploit the M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 69–80, 2007. c Springer-Verlag Berlin Heidelberg 2007
70
H. Luong and W. Philips
self-similarity property, e.g. iterated function systems [8] or the repetitive behaviour in an image [13]. Another class of adaptive image enlargement methods takes advantage of the training-based priors [3], which for example maps blocks of the low-resolution (lr) image into predefined high-resolution blocks [7]. General interpolation methods need clean input images, which are often not available due to noise and/or compression artefacts. The solution is a combination of denoising and interpolation. Most existing methods essentially perform the enhancement and resolution expansion as two separate steps. Constrained partial differential equations (pde’s) take interpolation information into account while denoising [20]. Training-based and non-local interpolation methods can also handle a certain amount of noise, but the result depends heavily on the used training set or image content [7,13]. The proposed method perform the image enhancement and interpolation simultaneously. Bimodality has been successfully applied in low-resolution text enhancement [5,19]. Typically the intensities of the text pixels tend to cluster around one value (e.g. black) and the intensities of the background pixels tend to cluster around another (e.g. white). Taking bimodality into account improves the contrast and thus the readability. We will extend this concept for general image enhancement. Another use of multimodality is introduced in image retrieval [14]. Local windows with one, two or three colours (respectively unimodal, bimodal and trimodal neighbourhoods) describe the features being matched in the image database. In what follows in this paper, we propose a reconstruction framework for degraded low-resolution images using adaptive bimodal priors. Section 2 gives the necessary background, describing the problem of image acquisition. In Section 3, we describe the Bayesian reconstruction framework. In Section 4, we focus on the image priors including the proposed adaptive bimodal priors. Section 5 presents experimental results using our technique and compares with other interpolation/reconstruction methods. Finally, Section 6 concludes this paper.
2
Problem Statement
The image acquisition process consists of converting a continuous scene into a (discrete) digital image. However, in practice, the acquired image lacks resolution and is corrupted by noise and blur. The recovery of the unknown high-resolution image x from a known low-resolution image y is related by y = DHx + n = Ax + n.
(1)
In this equation, A represents some linear degradation operation, which is commonly a combination of a blur operator H and a decimation operator D, and n describes the additive noise, which is assumed to be zero-mean Gaussian distributed (with a standard deviation σn ). This is just a simplification and an approximation of the real noise. We assume that the blur in the imaging model is denoted by a space-invariant point spread function (psf). In this paper, we assume that the degradation operator is performed by simple block averaging. The relationship between the lr grid and the hr grid is illustrated in figure 1.
Reconstruction of Low-Resolution Images Using Adaptive Bimodal Priors
71
Fig. 1. Block averaging operator with downsampling factor r
If we rearrange y and x into vectors with length N and r2 N respectively, with r the magnification factor in horizontal and vertical direction, then the probability of observing y given x in presence of white Gaussian noise becomes 2 y − Ax2 1 p(y|x) = √ . (2) exp − (2σn2 ) ( 2πσn )N 2
The L2 -norm y − Ax2 is defined as (y − Ax)T (y − Ax). The multiplication with matrix A (which has a dimension of N × r2 N ) has the same effect as the block averaging in figure 1. Because of the high dimensionalities, this matrix multiplication is replaced with its actual image operator in our implementation. The maximum likelihood (ml) estimator suggests to choose x ˆML that maximizes the likelihood function or minimizes the negative loglikelihood function: x ˆML = arg max p(y|x) x = arg min − log p(y|x) x
2
= arg min y − Ax2 . x
(3)
For a given low-resolution image y, an infinite set of high-resolution images can be generated by the observed data. Since most solutions are unstable, the inverse problem is ill-posed. To solve a hr image x that is stable and optimal in some sense, prior knowledge about the image is introduced in a Bayesian framework.
72
3
H. Luong and W. Philips
Bayesian Reconstruction Framework
Via the Bayes rule, the probabilities p(y|x) in the likelihood function are replaced with the posterior probabilities p(x|y): p(x|y) =
p(y|x) · p(x) . p(y)
(4)
The maximum a posteriori (map) estimator suggests to choose x ˆMAP that maximizes the posterior probability: x ˆMAP = arg max p(x|y) x
= arg max p(y|x) · p(x) x = arg min − log p(y|x) − log p(x) . x
(5)
Note that the denominator, p(y), does not affect the maximization. In the next section we will derive two prior probability density functions (pdf) on the hr image, namely the smoothness prior pS (x) and the adaptive bimodality prior pB (x).
4
Image Priors
A general way to describe the prior pdf p(x) is the Gibbs distribution, which has the following exponentional form [3] p(x) = cG · exp{−αf (x)},
(6)
where cG is a normalizing constant, guaranteeing that the integral over all x is 1, and the term f (x) is a non-negative energy function, which is low if the solution is close to the postulate and high otherwise. The expression of the prior in the negative loglikelihood function becomes then very simple, namely αf (x). 4.1
Smoothness Prior
We assume that images are locally smooth except for edges. A simple way to achieve this goal is via Tikhonov’s approach. The Gibbs smoothness pdf prior is defined as ρT x(u) − x(u ) pS x(u) = cG,S · exp − , (7) σs2 u ∈ℵ4 (u)
where ℵ4 (u) denotes the 4-connectivity neighbourhood of u as illustrated in 2 figure 2 and the function ρT (.) represents the quadratic term .2 . Applying this smoothness term will remove jagged edges on the one hand but will cause
Reconstruction of Low-Resolution Images Using Adaptive Bimodal Priors
(a) ℵ4 (x, y)
73
(b) ℵ8 (x, y) or ℵ3,3 (x, y)
Fig. 2. The 4-connectivity and 8-connectivity neighbourhoods ℵ4 (x, y) and ℵ8 (x, y)
blurred edges on the other hand, this latter problem will be automatically solved using the bimodal prior. Minimizing the penalty function ρT (x) leads to a very simple closed expression, namely ρT (x) = ψT (x) = 2x. Furthermore, the use of the so-called edge-stopping functions is very popular, because it suppresses the noise better while retaining important edge information [18]. In Section 5, we will discuss and compare our method with one of the most successful edge-preserving regularization terms proposed for image denoising, namely the total variation (tv) prior: ρTV x(u) − x(u ) pT V x(u) = cG,TV · exp − , (8) σs21 u ∈ℵ4 (u)
and the bilateral total variation (btv) proposed in [6]:
pBT V
x(u) = cG,BTV · exp
−
u ∈ℵp,p (u)
γ u−u 1 ρTV x(u) − x(u ) , (9) σs22
where ℵp,p (u) denotes the p × p window centered around u (ℵ3,3 (u) is illustrated in figure 2), γ is a weight that takes spatial distance into account and ρTV (.) represents the L1 -norm .1 of the image gradient. 4.2
Adaptive Bimodal Prior
The number of different colours in a small hr image region is very small in general, if we do not take noise into account. Depending on the number of modes of the probability distribution of colour values, the regions are characterized as unimodal (one dominant colour), bimodal (two modes), or in general multimodal neighbourhoods [14]. Unimodal neighbourhoods appear in flat regions, while bimodal neighbourhoods occur in edge regions. For bimodal regions we use a Gibbs prior with a non-negative fourth-order polynomial [5]: 2 2 x − μf 2 x − μb 2 pB2 x = cG,B2 · exp − , (10) σb42
74
H. Luong and W. Philips −4
3
x 10
Probabibility
Unimodal PDF
Bimodal PDF
Bimodal PDF
0 0
μ
f
μ Intensity value
μb
255
Fig. 3. One-dimensional plot of the bimodal prior pB2 (x) and the unimodal prior pB1 (x)
where μb and μf are the means of the background and foreground pixel distribution respectively and σb2 is the standard deviation, which is the same for both pixel distributions. The one-dimensional bimodal pdf is illustrated in figure 3. In our implementation each pixel is represented by its rgb colour vector. When μb and μf are equal (or close) to each other, the bimodal pdf has a much lower peak than the normal distribution as plotted in figure 3. This means that the kurtosis, i.e. the degree of peakness, is lower than 3 and this is known as platykurtosis. If the Euclidean distance μb − μf 2,L*a*b* is lower than threshold τu (i.e. μf is located inside the hypersphere of radius τu centred at μb in L*a*b* colour space), we will switch to the following unimodal Gaussian pdf: 2 x − μ2 pB1 x = cG,B1 · exp − , (11) σb21 where μ is the mean between μb and μf and σb1 is the standard deviation of the unimodal pixel distribution. The gradient of the fourth-order polynomial in equation 10 leads to the following expression: φB2 (x) = 2(x − μf ) x − μb 22 + 2(x − μb ) x − μf 22 ,
(12)
while minimizing the quadratic term of the unimodal pdf in equation 11 leads to: φB1 (x) = 2(x − μ). (13) If an image should only contain two colours, e.g. a black-and-white cartoon or a text document, we can solve this problem globally using a global μb (e.g. white)
Reconstruction of Low-Resolution Images Using Adaptive Bimodal Priors
(a)
(b)
(c)
75
(d)
Fig. 4. (a) Original image, (b) μb (light), (c) μf (dark) and (d) bimodal regions
and a global μf (e.g. black). Via the expectation maximization (em) algorithm, we can obtain the parameters μb and μf of the mixture of Gaussian distributions [4]. Nevertheless, general colour images contain much more than two colours, so we need to adapt μb and μf locally to the colour content of the neighbourhood of a pixel. An edge pixel in the low-resolution grid has a colour value that is typically a mixture of the colours of two connected regions. If we assume that the two colours have a different luminance y L (in L*a*b* colour space), then we can retrieve μb (lighter colour) and μf (darker colour) from the neighbouring pixels: μb (u) = arg min y L (u ), y
μf (u) = arg max y L (u ), y
(14)
where u ∈ ℵ8 (u) (the 8-connectivity neighbourhood is illustrated in figure 2) and y is the denoised pixel of y. Denoising is performed by clustering similar colour values using the mean-shift algorithm [2]. The mean-shift filtering preserves discontinuities and retains local image structures. Examples of μb - and μf -images are illustrated in figure 4. The colours μb and μf are determined on the lr grid because image enlargement typically does not introduce new colour information and it requires much less computation time than operating on the hr grid. To deal with multimodal regions with more than two colours, we look at the neighbouring pixels on the lr grid and choose μb and μf based on the distance between the current hr estimate x(uHR ) and the line constructed by μb (uLR ) and μf (uLR ) in rgb colour space: 2
μb (uLR ) − μf (uLR ) × μf (uLR ) − x(uHR ) μb (uHR ) = arg min , 2 μf (uHR ) µb ,µf μb (uLR ) − μf (uLR )2 (15) where uLR ∈ ℵ8 (uLR ) and × denotes the cross product between two vectors. Note that the bimodal prior is only valid when the proportions and variances of the background and foreground pixel distributions are equal, this is approximately true in our case because the bimodality is only considered in strips along edges as illustrated in figure 4.
76
H. Luong and W. Philips
(a)
(b)
Fig. 5. (a) Cubic b-spline and (b) btv regularization (γ = 0.8, p = 7)
4.3
Minimization
We solve the minimization problem in equation 5 by substituting the previously defined priors and using steepest descent. The solution for the bimodal regions (analogous for unimodal regions) can be found using the following iterative closed-form expression: xn ](uLR ) − y(uLR ) AT [Aˆ x ˆn+1 (uHR ) = x ˆn (uHR ) − β + σn2 φB2 x ˆn (uHR ) 2 x ˆn (uHR ) − x ˆn (uHR ) ,(16) + σs2 σb42 uHR ∈ℵ(uHR )
where β denotes the scalar step size in the direction of the gradient and this iterative procedure is initialized with the nearest neighbour interpolation x ˆ0 . The prior terms are also called the regularization terms where σn−1 , σs−1 and σb−1 become the weights or regularization parameters.
Reconstruction of Low-Resolution Images Using Adaptive Bimodal Priors
(a)
77
(b)
Fig. 6. (a) Curvature preserving pde’s [20] and (b) proposed method
5
Experimental Results
Our method is found perfectly suitable to images with a strong colour modality like cartoons, logos, maps, etc. As an experiment we have enlarged a part of a mpeg-compressed frame given in figure 4. The parameters for our method are β = 0.25, τu = 10, σn−2 = 0.5, σs−2 = 0.25, −4 σb2 = 0.00001, σb−2 = 0.25 and 100 iterations for the restoration process. The 1 regularization parameters are chosen accordingly to the rough order of magnitude of the regularization terms. In figures 5 and 6 we compare our method with the popular cubic b-spline interpolation, btv regularization (discussed in Section 4.1) and the curvature preserving pde’s [20] using a linear magnification factor of 4. The cubic b-spline shows clearly the compression artefacts, which are largely removed by the other methods. Our method is also visibly much sharper than the other methods. In figure 7 we have enlarged a region of interest with the 4× nearest neighbour interpolation in order to achieve a better visibility. We additionally compare our method with the tv and Tikhonov regularization (i.e. proposed method without
78
H. Luong and W. Philips
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 7. Enlarged region of the mouth: (a) cubic b-spline, (b) Tikhonov regularization, (c) tv regularization, (d) btv regularization, (e) curvature preserving pde’s [20] and (f) proposed method
the adaptive bimodal priors). Note that the tv and btv regularization schemes are initialized with cubic b-spline interpolation instead of nearest neighbour, because they tend to retain the jagged edges. We can clearly see that our proposed method outperforms the other methods in visual quality. Noise and compression artefacts are heavily reduced compared to traditional interpolation techniques. Jagged edges are removed very well while tv regularization tends to preserve the jaggedness. Staircasing artefacts (i.e. piecewise constant regions) also do not occur in our method as opposed to curvature preserved pde method for example. Our method also contains less blurring than the other methods and automatically has a built-in anti-aliasfilter, thanks to the Tikhonov regularization, which is visually more pleasant and prevents jagged edges.
6
Conclusion
In this paper we have presented a reconstruction scheme of low-resolution images from a Bayesian point of view. Due to the fact that an edge pixel in the low-resolution grid has a colour value that is a mixture of the colours of two
Reconstruction of Low-Resolution Images Using Adaptive Bimodal Priors
79
connected regions, we have proposed an adaptive bimodal prior for edge regions and a unimodal prior otherwise. The parameters of these priors are retrieved by reducing the number of different colour modes in a small image region. Results show the effectiveness of the proposed bimodal prior and the visual superiority of our reconstruction scheme to other interpolation/reconstruction schemes for images with a strong colour modality such as cartoons: noise and compression artefacts (like mosquito noise) are removed and our method contains less blur and other annoying artefacts (e.g. jagged edges, staircasing effects, etc.).
References 1. Allebach, J., Wong, P.W.: Edge-Directed Interpolation. In: Proc. of IEEE International Conference on Image Processing, pp. 707–710. IEEE Computer Society Press, Los Alamitos (1996) 2. Comaniciu, D., Meer, P.: Mean Shift: A Robust Approach Toward Feature Space Analysis. IEEE Trans. on Pattern Analysis and Machine Intelligence 24, 603–619 (2002) 3. Datsenko, D., Elad, M.: Example-Based Single Image Super-Resolution: A Global MAP Approach with Outlier Rejection. To appear in the Journal of Multidimensional Systems and Signal Processing 4. Dempster, A.P., Lairde, N.M., Rubin, D.B.: Maximum Likelihood From Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39, 1–38 (1977) 5. Donaldson, K., Myers, G.: Bayesian Super-Resolution of Text in Video With a Text-Specific Bimodal Prior. International Journal on Document Analysis and Recognition 7, 159–167 (2005) 6. Farsiu, S., Robinson, M.D., Elad, M., Milanfar, P.: Fast and Robust Multiframe Super Resolution. IEEE Trans. on Image Processing 13, 1327–1344 (2004) 7. Freeman, W.T., Jones, T.R., Pasztor, E.C.: Example-Based Super-Resolution. IEEE Computer Graphics and Applications 22, 56–65 (2002) 8. Honda, H., Haseyama, M., Kitajima, H.: Fractal Interpolation For Natural Images. In: Proc. of IEEE International Conference of Image Processing, vol. 3, pp. 657– 661. IEEE Computer Society Press, Los Alamitos (1999) 9. Ledda, A., Luong, H.Q., Philips, W., De Witte, V., Kerre, E.E.: Image Interpolation Using Mathematical Morphology. In: Proc. of 2nd IEEE International Conference On Document Image Analysis For Libraries, IEEE Computer Society Press, Los Alamitos (to appear) (2006) 10. Lehmann, T., G¨ onner, C., Spitzer, K.: Survey: Interpolations Methods In Medical Image Processing. IEEE Trans. on Medical Imaging 18, 1049–1075 (1999) 11. Li, X., Orchard, M.T.: New Edge-Directed Interpolation. IEEE Trans. on Image Processing 10, 1521–1527 (2001) 12. Luong, H.Q., De Smet, P., Philips, W.: Image Interpolation Using Constrained Adaptive Contrast Enhancement Techniques. In: Proc. of IEEE International Conference of Image Processing, vol. 2, pp. 998–1001. IEEE Computer Society Press, Los Alamitos (2005) 13. Luong, H.Q., Ledda, A., Philips, W.: An Image Interpolation Scheme for Repetitive Structures. In: Campilho, A., Kamel, M. (eds.) ICIAR 2006. LNCS, vol. 4141, pp. 104–115. Springer, Heidelberg (2006)
80
H. Luong and W. Philips
14. Matas, J., Koubaroulis, D., Kittler, J.: Colour Image Retrieval and Object Recognition Using the Multimodal Neighbourhood Signature. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1842, pp. 48–64. Springer, Heidelberg (2000) 15. Meijering, E.H.W., Niessen, W.J., Viergever, M.A.: Quantitative Evaluation Of Convolution-Based Methods For Medical Image Interpolation. Medical Image Analysis 5, 111–126 (2001) 16. Morse, B.S., Schwartzwald, D.: Isophote-Based Interpolation. In: Proc. of IEEE International Conference on Image Processing, pp. 227–231. IEEE Computer Society Press, Los Alamitos (1998) 17. Muresan, D.: Fast Edge Directed Polynomial Interpolation. In: Proc. of IEEE International Conference of Image Processing, vol. 2, pp. 990–993 (2005) 18. Piˇzurica, A., Vanhamel, I., Sahli, H., Philips, W., Katartzis, A.: A Bayesian Approach To Nonlinear Diffusion Based On A Laplacian Prior For Ideal Image Gradient. In: Proc. of IEEE Workshop On Statistical Signal Processing, IEEE Computer Society Press, Los Alamitos (2005) 19. Thouin, P., Chang, C.: A Method For Restoration of Low-Resolution Document Images. International Journal on Document Analysis and Recognition 2, 200–210 (2000) 20. Tschumperl´e, D.: Fast Anisotropic Smoothing of Multi-Valued Images using Curvature-Preserving PDE’s. International Journal of Computer Vision 1, 65–82 (2006)
Learning Basic Patterns from Repetitive Texture Surfaces Under Non-rigid Deformations Roman Filipovych and Eraldo Ribeiro Computer Vision and Bio-Inspired Computing Laboratory Department of Computer Sciences Florida Institute of Technology Melbourne, FL 32901, USA {rfilipov,eribeiro}@fit.edu http://www.cs.fit.edu/∼ eribeiro
Abstract. In this paper, we approach the problem of determining the basic components from repetitive textured surfaces undergoing free-form deformations. Traditional methods for texture modeling are usually based on measurements performed on fronto-parallel planar surfaces. Recently, affine invariant descriptors have been proposed as an effective way to extract local information from non-planar texture surfaces. However, affine transformations are unable to model local image distortions caused by changes in surface curvature. Here, we propose a method for selecting the most representative candidates for the basic texture elements of a texture field while preserving the descriptors’ affine invariance requirement. Our contribution in this paper is twofold. First, we investigate the distribution of extracted affine invariant descriptors on a nonlinear manifold embedding. Secondly, we describe a learning procedure that allows us to group repetitive texture elements while removing candidates presenting high levels of curvature-induced distortion. We demonstrate the effectiveness of our method on a set of images obtained from man-made texture surfaces undergoing a range of non-rigid deformations. Keywords: texture learning, non-rigid motion, texture classification, dynamic texture.
1
Introduction
In this paper, we investigate the problem of learning representative local texture components from repetitive textures. In particular, we focus on the specific case in which the learning stage is performed by observing a textured surface undergoing unknown free-form deformations. Figure 1 shows samples of such textured surfaces presenting a range of curvature deformations. Perceiving and modeling the appearance of repetitive textures are important visual tasks with a number of applications including surface tracking [8,14], texture classification [3,12], and texture synthesis [16,9]. However, obtaining accurate descriptions from non fronto-parallel texture fields is not a trivial problem M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 81–92, 2007. c Springer-Verlag Berlin Heidelberg 2007
82
R. Filipovych and E. Ribeiro
Fig. 1. Sample frames of sequences of surface textures under non-rigid deformation. Local planarity of texture primitive is often disturbed by abrupt changes in local surface curvature where the surface tends to curve into folds.
as the underlying texture appearance varies significantly with both the perspective geometry and the orientation of the observed surfaces [21,9,16,12]. While this appearance variation represents a rich source of information for shape-fromtexture methods [25,19], it is also a main source of problems for standard texture learning methods. Indeed, changes in local curvature produce nonlinear warping of some image regions. Consequently, texture descriptors evaluated on these warped image regions are likely to produce unreliable measurements. Our primary goal in this paper is to study the effects of surface curvature variation on local texture descriptors when the observed surface undergoes nonrigid deformations. In particular, we show how local texture appearance models can be estimated from deforming textured surfaces with high level of curvature distortion. The contribution of the study presented in this paper is twofold. First, we investigate the distribution of extracted affine invariant descriptors on a nonlinear manifold embedding. Here, we assume that the population of affine invariant descriptors lie on a lower dimensional manifold describing mainly variations in surface orientation and curvature. This lower dimensional manifold seems to describe the departure from local planarity of local affine invariant descriptors. Secondly, we describe a learning procedure that allows us to group repetitive texture elements while selecting the best set of candidates to represent the actual undistorted basic texture components. The remainder of this paper is organized as follows. Section 2 provides a survey of the related literature. In Section 3, we discuss the effects of curvature distortion on affine invariant texture measurements. Section 4 describes the details of our texture primitive selection method. The preliminary results of our
Learning Basic Patterns from Repetitive Texture Surfaces
83
study are shown in Section 5. Finally, in Section 6, we present our conclusions and directions for future investigations.
2
Related Literature
Modeling texture appearance is usually the initial step in the solution of many texture-related problems including recognition [3,12], tracking [8,14], and synthesis [16,9]. Nevertheless, finding general representations for texture is a challenging problem. In fact, despite extensive research efforts by the computer vision community, there is no currently widely accepted method to model the complexity found in all available textures. State-of-the-art texture classification algorithms have successfully approached the texture representation problem by means of statistical descriptors. These descriptors can be obtained from measurements based on the response of convolution filters [13,24,26], image regions and pixel distributions [12], and frequency-domain measurements [7,2]. Most texture modeling methods are based on measurements obtained from planar fronto-parallel texture fields [13,24,26,7]. For example, Leung and Malik [13] introduced a filter bank-based descriptive model for textures that is capable of encoding the local appearance of both natural and synthetic textures. Since then, many extensions of this work have been proposed [24,5]. Indeed, these methods achieve high classification rates due to their ability to learn representative statistical models of each texture. However, it is unclear how they would perform on non-rigid deforming surfaces. There has been some recent research effort aimed at addressing the texture learning problem from non-rigid, non fronto-parallel texture images [21,9,16,12]. For example, Chetverikov and Foldvari [4] use a frequency-domain affineinvariant representation for local texture regions. Bhalerao and Wilson [2] also use a frequency-based affine-invariant descriptor for texture segmentation. Mikolajczyk et al. [17] describe a local affine frame measurement for wide-baseline stereo followed by a comparison of affine region detectors. Zhu et al. [26] present a multilevel generative image model for learning texture primitives based on Gabor-like filter measurements. Hayes et al. [9] address the problem of learning non-rigid texture deformations on regular and quasi-regular lattices. They propose a feature matching algorithm to discover lattices of near-regular textures in images. Their final goal is to synthesize fields of texture from a set of examples. The focus of this paper is on non-rigid deforming texture surfaces. Recent work by Lazebnik et al. [12] addresses the problem of learning texture models from images of non fronto-parallel texture surfaces. They propose a texture classification method based on learned texture components using affine-invariant descriptors. This approach is very effective under the assumptions of orthographic viewing and low-curvature surfaces. However, for surfaces with high levels of curvature deformation, the surface folds and bends will reduce the ability of affine-invariant descriptors to capture correct local texture representations. As a result, the learned appearance of basic texture components using this approach is likely to be less representative of the actual surface texture.
84
R. Filipovych and E. Ribeiro
Depending on the curvature of the observed surfaces, the deformation of texture elements can present a significant degree of nonlinearity. Nonlinear manifold learning techniques such as Isomap [23], Local Linear Embedding (LLE) [20], and Laplacian Eigenmaps [1] are suitable candidates for the analysis of such deformations. For example, Souvenir and Pless [22] characterize deformations in magnetic resonance imaging. Nonlinear manifold learning is also a useful technique for the synthesis of dynamic textures. Liu et al. [15] approach the dynamic texture synthesis problem using nonlinear manifold learning and manifold traversing.
3
Curvature-Induced Distortion of Texture Elements
In this section, we investigate the effect of surface curvature on affine-invariant descriptors. The aim of this section is twofold. First, we show how changes in local curvature can reduce the quality of affine-invariant descriptors. Secondly, we analyze the distribution of local affine-invariant descriptors using a nonlinear manifold learning technique. We will represent the local texture appearance using the affine-invariant descriptor proposed in [12]. This descriptor is essentially a pixel gray-level intensity histogram calculated on a scale-invariant polar representation of an image subregion. It represents the radial frequency of normalized pixel intensities. The representation is normalized to minimize the effects of illumination. The polar representation of the pixel intensity maps rotations onto translations. The histogram representation that follows is translation invariant. This representation allows for full affine invariance when calculated at locations centered at image patches. Figure 2(a) illustrates the affine-invariant region extraction process, and is described in detail next. Single texture patch distortion. We begin by analyzing the influence of surface curvature on a single affine-invariant descriptor calculated on a small texture subregion. The selected region contains a single planar texture element from the textured surface. Here, we artificially project the extracted region on a cylindrical surface observed under perspective viewing. We proceed by bending the surface to increase the local curvature while using the standard Euclidean distance to measure the error between the affine-invariant descriptor of a planar patch and its curved versions. The plot in Figure 2(b) shows the Euclidean distance measuring the departure from the original planar patch for increasing levels of curvature deformation. The plot shows that the information in the affine-invariant descriptor remains almost constant when the local curvature is low. However, the difference from the planar patch increases significantly for medium and high surface curvatures. This behavior is surely expected as, for curved surfaces, texture affine-invariance is only valid for small planar regions. Distortion distribution of local affine-invariant descriptors. The previous analysis suggests that local texture affine invariance is not preserved if the texture region is deformed by the 3-D surface’s local curvature. We now would
Learning Basic Patterns from Repetitive Texture Surfaces
85
Planarity Error
Planarity Error
Planarity Error vs. Curvature
spin image
(a)
Curvature
(b)
Fig. 2. (a) affine-invariant region extraction. Elliptical regions centered at affineinvariant interest points are extracted from the image. The elliptical regions are normalized to a circular shape. A translation invariant radial histogram of pixel intensities is build to represent the extracted regions. (b) Deviation from the original texton with increase in curvature observed in the affine-invariant space.
like to turn our attention to the distribution of distortions of affine-invariant descriptors across a deforming surface. In this analysis, we commence by detecting a large set of interest points in a sequence of images of a non-rigid deforming surface. A sample image of the analyzed surface is shown in Figure 2(a). The set of interest point locations is obtained using the affine-invariant interest point detector proposed by Kadir and Brady [11]. This detector provides information about the affine scale of the neighborhood of each detected interest point. Once these image locations are at hand, we use the scale information provided by the detector to extract elliptical regions from the images. These regions are subsequently normalized to a standard rectangular size and represented by the rotation-invariant descriptor proposed by Lazebnik et al. [12] to remove the inherent rotational ambiguity. At this stage, we would like to point out the following: (a) we expect the distribution of local texture patches to form groups representing various basic texture components and their parts; (b) repetitive texture patterns can be compactly represented by statistical descriptors based on affine-invariant measurements. However, localization errors and incorrect scale axes are likely generate redundant groups in the distribution (i.e., non-centered similar patches will form separate groups); (c) the distribution of distorted texture elements can be assumed to lie on a low dimensional manifold modeling both patch location noise and curvature-induced distortions. We now use Isomap [23] to obtain an embedded representation of the lower dimensional nature of the distortion distribution presented by the subregion descriptors. Isomap manifold embedding preserves “geodesic” distances between data points. Figure 3 illustrates the two-dimensional manifold learned using Isomap for the local affine-invariant patch distribution. The figure also shows the embedded image subregions back-projected onto the image domain. We extracted a large number of affine-invariant descriptors from the sequence of images shown in the first row of Figure 1. The figure shows a highly dense cluster near the center of the plot containing mostly locally planar patches. On the other
86
R. Filipovych and E. Ribeiro
Fig. 3. 2-D Isomap embedding of the affine-invariant field. Undistorted texture components form the large cluster close to the center of the plot. Nonlinearly deformed elements also tend to form clusters. Border points usually represent noise.
hand, nonlinearly deformed patches tend to group themselves into clusters with respect to their deformation similarity. Finally, occluded or distorted elements form relatively sparse groups with large within-class variation.
4
Our Texture Component Learning Method
Our goal in this paper is to build an algorithm for learning compact representations of textures from images of surfaces undergoing non-rigid deformations. While this represents a significantly challenging capture setup, it also allows us to apply our method to more realistic imaging situations such as capturing the basic patterns from a piece of waving textured cloth, a moving animal’s natural skin pattern, and clothes worn by a moving person. In this section, we use some of the insights obtained from the analysis presented in the previous section to derive a algorithm for selecting a set of representative components of a texture pattern. The main steps of our texture learning algorithm are given as follows. Extraction of affine-invariant regions. The first step of our algorithm consists of extracting a large number of image subregions from a set of video frames of the observed surface. This step is subdivided into two main parts. First, we detect a large set of affine-invariant interest points on each image or video-frame.
Learning Basic Patterns from Repetitive Texture Surfaces
87
We use the Kadir and Brady’s salient feature detector [11] as it provides information about the affine scale of image features. The salient feature detector outputs elliptical regions centered at each feature of interest. Subregions extracted using this detector can be normalized to a common scale invariant shape (e.g., a circle). The remaining rotation ambiguity can then be removed by representing the normalized subregions using the spin-image affine-invariant descriptor proposed in [12]. Accordingly, let S = (s1 , s2 , . . . , sN ) be the set of affine-invariant descriptors obtained by this step, where N is the total number of subregions. Manifold learning representation of affine descriptors. Our aim is to obtain a compact representation of the most significant repetitive patterns on the image. However, the nonlinear nature of the distortion present in the dataset does not allow for correct distance measurements in the original feature space. Additionally, spin-image descriptors carry a significant level of information redundancy. To accomplish a better description of the variation in the dataset, we assume that both the basic texture elements as well as their nonlinear deformations lie on a low-dimensional nonlinear manifold in which the two intrinsic dimensions of variability describe local surface orientation and curvature distortion. Based on this assumption, we perform Isomap [23] nonlinear manifold learning on the original distribution S . Isomap allows for a reduction of the dimensionality of the data while preserving the manifold’s intrinsic geometry, and is stable even for sparse sets of data points. The resulting reduced dimension dataset of extracted subregions is then given by X = (x1 , x2 , . . . , xN ). Learning representative texture components. The goal of this step is to determine the most representative classes of texture elements in X. We begin by modeling the distribution of affine-invariant K descriptors as a mixture of K Gaussian densities given by p(x|Θ) = i=1 αi pi (x|θi ), where x is an affine-invariant descriptor in the Isomap manifold space, αi represent the mix ing weights such that K α = 1, Θ represents the collection of parameters i=1 i (α1 , . . . , αK , θ1 , . . . , θK ), and pi is a multivariate Gaussian density function parameterized by θi = (μi , i ), such that: p(x|j) =
1 d 2
(2π) |Σj |1/2
1 T exp{− (x − μj ) Σj−1 (x − μj )} 2
(1)
where μj and Σj are the mean vector and covariance matrix of the texture element j, respectively. Each mixture component represents a set of texture descriptors of similar appearance on the image. The mixture of Gaussians model parameters can be estimated by the Expectation-Maximization (EM) algorithm [6]. A straightforward consequence of the mixture of Gaussians modeling is that texture descriptors closer to the mean vector in each class will present less curvature-induced distortion when compared to descriptors that are further away from the class mean. In order to obtain sharper representations of the learned texture components, we select a single descriptor from each cluster to
88
R. Filipovych and E. Ribeiro
represent a basic texture component in the image. In other words, a set of texture components is selected as: τ j = arg max p(xi |j)
j = 1, . . . , K
(2)
xi
A set of basic components is obtained by this process and is used to create a dictionary representation d = {τ 1 , . . . , τ K }. However, the nonlinear nature of the surface distortion will compromise the representativeness of some of the learned mixture components. As a result, the learned clusters might not represent actual texture components but a geometrically warped version of them. Next, we propose a way to remove these non-representative elements from our dataset of learned texture primitives. Ranking of learned texture components in the dictionary. Our main goal in this step is to distinguish between distributions of affine transformed basic texture elements and their nonlinearly deformed counterparts. The nature of the nonlinear deformations is mostly anisotropic (i.e., directional appearance distortion). Consequently, we expect to have a relatively small number of data points belonging to clusters of non-affine distorted elements. Thus, distributions with low prior probability will most likely correspond to regions that were distorted by nonlinear transformations. Elements falling within such distributions can be safely discarded and therefore removed from the dictionary. The remaining distributions may represent two cases. The first case corresponds to classes of affine transformed elements that are representative of the texture. The second case represents classes consisting of nonlinearly distorted regions. Our experiments have shown that the distribution of nonlinearly distorted elements tend to have high within-class variation. Based on this observation, we rank the remaining dictionary elements based on the decreasing order of the within-class variation of their classes (i.e., the determinant of the covariance matrix for the class, |Σj |). Finally, the top-ranked elements are selected as the ones that represent classes of locally planar regions: d = {τ j } such that |Σj | ≥ |Σj+1 |
5
(3)
Experimental Results
Our experiments were divided into two main parts. First, we evaluated our algorithm on video sequences of a number of deforming texture surfaces. The surfaces used in our experiments consisted of patterned fabrics bought from a local shop. To produce the deformations, we have waved and deformed the fabrics manually while recording the video sequences. Three of these patterns are shown in Figure 4. Secondly, we show qualitative results of extracted basic texture elements using a standard K-Means learning approach similar to the one proposed in [12]. We commenced by extracting a large set of subregions from a sample of frames of the video sequence. Our current method does not use any temporal information and a sparse set of frames was usually sufficient for our algorithm to work.
Learning Basic Patterns from Repetitive Texture Surfaces
(a)
89
(b)
Fig. 4. Learning locally planar regions. Three distinct textures. Column (a) shows the three top-ranked learned texture elements (top to down). Column (b) shows the corresponding learned elements (in red color) mapped on a sample video frame.
We extracted approximately 2,000 local affine-invariant descriptors from the set of images of the fabric surfaces. The feature extraction stage was followed by a ten-dimensional Isomap embedding of the corresponding affine-invariant descriptors. In our experiments, the EM learning step was performed using diagonal covariance matrices. After the EM learning step was performed, the algorithm automatically selected the classes with the highest prior probabilities such that the sum of priors formed 60% of the total population. For every such a class, the basic texture element was selected using Equation 2. Finally, the ranking stage was performed to remove noisy components and rank the most representative ones. Here, the 75% top-ranked elements were selected by the program. These thresholds were chosen experimentally. We selected a single patch descriptor whenever the calculations produced no resulting components. Our results were consistent for different values of K for the EM algorithm. It is common that among the top-ranked texture primitives there will be several exemplars of the
90
R. Filipovych and E. Ribeiro
Fig. 5. Texture components learned using standard K-Means method. For every texture The affine-invariant space was clustered using k-means clustering algorithm. The learned texture elements are shown on the right-hand side of each sample frame.
same element. Such a situation may occur due to imprecisions in the region detection process as well as due to illumination changes. Figure 4 shows the results for three types of patterned fabrics used in our experiments. Figure 4(a) shows (from top to down) the ranked sequence of learned texture components obtained by our method. Figure 4(b) shows three frame images with the learned locally planar texture components superimposed on the fabric’s surface. At this point, we would like to turn the reader’s attention to the third row in Figure 4. In the figure, the learned texture components represent a texture with flower-shaped elements as well as the first three top-ranked elements discovered by our method. While texture elements with large area are more likely to be distorted, some of their parts still may be locally planar. In the case of this texture, a part of a larger element received the second highest “planarity” rank. This suggests that our method is able to discover locally planar regions even in the presence of strong deformations by selecting those distributions that are most resistant to distortions. In the second part of our experiments, we provide a qualitative comparison between the results obtained by our algorithm and typical results obtained by clustering the affine-invariant feature space using the K-Means clustering algorithm. Figure 5 shows a sample of results from this experiment. Basically, we use the same feature extraction steps as in the previous experiment. However, the nonlinear manifold learning step was not performed, and the EM algorithm clustering was substituted by the K-Means. Additionally, there was no texture component selection procedure performed on the clustered results. The figure shows a sample image-frame of each texture with some of the extracted affine regions indicated by red ellipses. The learned texture components produced by this procedure are shown on the righ-hand side of each texture sample. In this case, the resulting learned texture components present significant levels of deformation when compared to the ones obtained by our method. These results show that our method is able to distinguish between locally planar texture elements and their nonlinearly distorted versions. For illustration purposes we selected only the first three top-ranked elements for every texture. Increasing the size of the dictionary of representative elements leads to classifying more regions as locally planar. However, lower ranked texture elements usually
Learning Basic Patterns from Repetitive Texture Surfaces
91
come from distributions with large covariance and are likely to contain noisy data. The investigation of this trade-off is important and is left for future work.
6
Conclusions and Future Work
In this paper, we proposed a method for learning the basic texture primitives of patterned surfaces distorted by non-rigid motion. The proposed algorithm uses nonlinear manifold learning to capture the intrinsic dimensionality of curvature distortion on non-rigid deforming texture surfaces. A selection procedure for determining the most representative local texture components was presented. A qualitative comparison between our texture primitive learning method and a standard K-Means learning procedure was also presented. Our experiments demonstrated the effectiveness of our method on a set of images obtained from patterned fabric surfaces undergoing a range of non-rigid deformations. Interesting future directions include the further investigation of the effects of curvature on the local texture measurements as well as the introduction of both spatio-temporal information and inherent texture repetitiveness into the nonlinear manifold learning stage [10,18]. Extending the method to include color information is also a possibility. We plan to apply our ideas to the classification of deforming texture surfaces. Studies aimed at developing these ideas are in hand and will be reported in due course. Acknowledgments. This research was supported by U.S. Office of Naval Research under contract: N00014-05-1-0764.
References 1. Belkin, M., Niyogi, P.: Laplacian eigenmaps and spectral techniques for embedding and clustering. In: Neural Information and Processing Systems (NIPS), pp. 585– 591 (2001) 2. Bhalerao, A., Wilson, R.: Affine invariant image segmentation. In: British Machine Vision Conference (BMVC) (2004) 3. Carcassoni, M., Ribeiro, E., Hancock, E.R.: Texture recognition through modal analysis of spectral peak patterns. In: IEEE International Conference on Pattern Recognition (ICPR), vol. 1, p. 10243 (2002) 4. Chetverikov, D., Foldvari, Z.: Affine-invariant texture classification. In: IEEE International Conference on Pattern Recognition (ICPR), Washington, DC, p. 3901. IEEE Computer Society Press, Los Alamitos (2000) 5. Cula, O., Dana, K.J.: 3D texture recognition using bidirectional feature histograms. International Journal of Computer Vision 59, 33–60 (2004) 6. Dempster, A., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal Royal Stat. Soc., Series B 39(1), 1–38 (1977) 7. Do, M.N., Vetterli, M.: Wavelet-based texture retrieval using generalized Gaussian density and Kullback-Leibler distance. IEEE Transactions on Image Processing 11(2), 146–158 (2002)
92
R. Filipovych and E. Ribeiro
8. Guskov, I., Klibanov, S., Bryant, B.: Trackable surfaces. In: ACM SIGGRAPH/Eurographics symposium on Computer animation, pp. 251–257. ACM Press, New York (2003) 9. Hays, J., Leordeanu, M., Efros, A.A., Liu, Y.: Discovering texture regularity as a higher-order correspondence problem. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 522–535. Springer, Heidelberg (2006) 10. Jenkins, O.C., Matari, M.J.: A spatio-temporal extension to isomap nonlinear dimension reduction. In: Int’l Conference on Machine Learning, vol. 56 (2004) 11. Kadir, T., Zisserman, A., Brady, M.: An affine invariant salient region detector. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3021, pp. 228–241. Springer, Heidelberg (2004) 12. Lazebnik, S., Schmid, C., Ponce, J.: A sparse texture representation using local affine regions. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(8), 1265–1278 (2005) 13. Leung, T., Malik, J.: Recognising surfaces using three-dimensional textons. In: IEEE International Conference on Computer Vision, pp. 1010–1017. IEEE Computer Society Press, Los Alamitos (1999) 14. Lin, W.-C., Liu, Y.: Tracking dynamic near-regular textures under occlusion and rapid movements. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, Springer, Heidelberg (2006) 15. Liu, C.B., Lin, R.S., Ahuja, N., Yang, M.H.: Dynamic textures synthesis as nonlinear manifold learning and traversing. In: British Machine Vision Conference (BMVC), pp. 859–868 (2006) 16. Liu, Y., Lin, W.-C., Hays, J.H.: Near regular texture analysis and manipulation. ACM Transactions on Graphics (SIGGRAPH 2004) 23(3), 368–376 (2004) 17. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir, T., Gool, L.V.: A comparison of affine region detectors. International Journal of Computer Vision 65(1-2), 43–72 (2005) 18. Rahimi, A., Recht, B., Darrell, T.: Learning appearance manifolds from video. In: CVPR. IEEE International Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 868–875 (2005) 19. Ribeiro, E., Hancock, E.: Shape from periodic texture using the eigenvectors of local affine distortion. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(12), 1459–1465 (2001) 20. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290, 2323–2326 (2000) 21. Schaffalitzky, F., Zisserman, A.: Geometric grouping of repeated elements within images. In: Shape, Contour and Grouping in Computer Vision, pp. 165–181 (1999) 22. Souvenir, R., Pless, R.: Isomap and nonparametric models of image deformation. In: IEEE Workshop on Motion and Video Computing, vol. 2, pp. 195–200 (2005) 23. Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000) 24. Varma, M., Zisserman, A.: A statistical approach to texture classification from single images. International Journal of Computer Vision 62, 61–81 (2004) 25. White, R., Forsyth, D.A.: Retexturing single views using texture and shading. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 70–81. Springer, Heidelberg (2006) 26. Zhu, S.C., en Guo, C., Wang, Y., Xu, Z.: What are textons? International Journal of Computer Vision 62(1-2), 121–143 (2005)
An Approach for Extracting Illumination-Independent Texture Features Rub´en Mu˜ niz and Jos´e Antonio Corrales Department of Computer Science. University of Oviedo, Gij´ on, Spain {rubenms,ja}@lsi.uniovi.es
Abstract. A common issue in many computer vision applications is the effect of the illumination conditions on the performance and reliability of the built system. In many cases the researchers have to face an extra problem: to study the environmental conditions of the facilities where the application will run, the light technology and the wattage of the chosen lamps, nowadays we are moving to LED technology due to the increased life and absence of flicker, among other benefits. Nevertheless, it would be desirable to make the intelligent system more robust to lighting conditions changes, as in the case of texture classification systems [1]. On such systems the effect of light changes on the measured features may eventually lead to texture misclassification and performance degradation. In this paper we present an approach that will be helpful to overcome such problems when the light comes from a directional source, such as halogen projectors, LED arrays, etc.
1
Introduction to Texture Analysis and Classification
Texture classification is a key field in many computer vision applications, ranging from quality control to remote sensing. Briefly described, there is a finite number of texture classes we have to learn to recognize. In the first stage of the development of such kind of systems, we extract useful information (features) from a set of digital images, known as the training set, containing the textures we are studying. Once this task has been done, we proceed to classify any unknown texture into one of the known classes. This process can be summarized into the following steps: 1. 2. 3. 4.
Image (texture) acquisition and preprocessing Feature extraction Feature selection (optional) Classification
From the earliest approaches to the problem, grayscale images have been widely used, primarily due to acquisition hardware limitations and/or limited processing power. In the recent past, much effort has been made to develop new feature extraction algorithms (also known as texture analysis algorithms) to take advantage of the extra information contained in color images. On the M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 93–104, 2007. c Springer-Verlag Berlin Heidelberg 2007
94
R. Mu˜ niz and J.A. Corrales
other hand, many classical grayscale algorithms have been extended to process color textures [2][3][4]. Texture analysis algorithms can be divided into statistical and spectral ones. The former methods extract a set of statistical properties from the spatial distribution of intensities of a texture. Common examples of this approach are the histogram method and the family of algorithms based on cooccurrence matrices [5][6]. The latter techniques, on the other hand, compute a number of features obtained from the analysis of the local spectrum of the texture. In the following sections we will give an overview of the two spectral methods (Gabor filters and Wavelets) we have used for our tests.
2
Gabor Filters
Gabor filters have been extensively used for texture classification and segmentation of grayscale and color textures [7][2][4][8][9]. These filters are optimally localized in both space and spatial frequency and allow us to get a set of filtered images which correspond to a specific scale and orientation component of the original texture. There are two major approaches to texture analysis using Gabor filters. First, one can look for specific narrowband filters to describe a given texture class, while the other option is to apply a bank of Gabor filters over the image and process its outputs to obtain the features that describe the texture class. 2.1
2D Gabor Filterbank
The Gabor filter bank used in this work is defined in the spatial domain as follows: 2
fmn (x, y) =
2
1 − x +y 2 2σm exp × cos 2π(um x cos θ + um y sin θ) . 2 2πσm
(1)
where m and n are the indexes for the scale and the orientation, respectively, for a given Gabor filter. Depending on these parameters, the texture will be analyzed (filtered) at a specific detail level and direction. The half peak radial and orientation bandwidths [7] are defined as follows: √ 2πσm um + 2 ln 2 √ Br = log2 . (2) 2πσm um − 2 2 ln 2 √ 2 ln 2 −1 . (3) Bθ = 2 tan 2πσm um As in [2], we define a filterbank with three scales and four orientations. The bandwidth Bθ is taken to be 40◦ in order to maximize the coverage of the frequency domain and minimize the overlap between the filters.
An Approach for Extracting Illumination-Independent Texture Features
95
Fig. 1. Gabor filter computed by f1,45◦
2.2
Gabor Features
To obtain texture features we must filter the texture images using the generated filters. This is achieved by convolution on the frequency domain (4), due to the size of the filters used. For each filtered image, we extract a single feature μmn which represents its energy, as shown below. Gmn (x, y) = I(x, y) ∗ fmn . 2 (Gmn (x, y)) . μmn (x, y) =
(4) (5)
x,y
This approach is only valid when grayscale images are used. If we want to filter a color image, we have to preprocess it before this method can be applied. The most obvious solution for this problem is to transform the image by a weighted average of the three color bands. I(x, y) = aR(x, y) + bG(x, y) + cB(x, y) .
(6)
The coefficients a, b, c from (6), can be selected to properly model the human eye’s perception model of color, since the number of cones sensitive to red, green and blue light is not equal [10]. For this reason, an adequate choice of these weights can be a = 0.65, b = 0.33, c = 0.02. Using this transformation, different colors can give the same grayscale intensity, so color information is lost. To overcome this obstacle, (4) can be applied on each of the RGB color bands of the image to obtain unichrome features [2]. With this approach, we obtain a set of energies from each spectral band, so the information extracted from textures, grows by a factor of three. Another disadvantage of this technique is that color information is not correlated because it is simply concatenated. A good idea to solve this independency was proposed by Palm et al. in [4]. They convert the RGB image to HSV, discarding the intensity value, and taking the Hue and Saturation to form a complex number, which can be used to compute the convolution between the image and the gabor filter by means of a complex FFT.
3 3.1
Wavelets Introduction
The name ”Wavelets” was first introduced by Morlet, a French geophysicist, in the early 80’s. The kind of data he was studying could not be properly analyzed
96
R. Mu˜ niz and J.A. Corrales
by Fourier analysis, due to the fast change of their frequency contents. For this reason, he looked for a family of functions suitable for the analysis of that kind of signals and he found the wavelets. A wavelet family is a set of functions derived from a single function with special features, named the mother wavelet, by means of two parameters a and b: t−b 1 ψa,b (t) = √ ψ . (7) a a The parameter a represents the dilation (which is inversely proportional to frequency) and b the displacement (time localization). Wavelets are rather complex and we would require a complete book [11] to deal with them. In the following lines we will show only the basics of this kind of analysis, focused on texture feature extraction. 3.2
2D Discrete Wavelet Transform and Multiscale Analysis
Wavelets allow us to study a signal at different levels of detail. In the case of 2D signals, this can be interpreted as analyzing the images at different resolutions or scales. For instance, if we take an image of 512 × 512 pixels, we can study its frequency content at 256 × 256, 128 × 128, 64 × 64 and so on. In the special case of images containing textures, this is of vital importance since a texture varies significantly depending on the distance we are looking at it. If we take a such as a = 2m , the transform is known as dyadic DWT and relies on a specific form of Wavelet derived from a smoothing or scaling function represented by θ(t). From both the scaling function and the Wavelet, we can derive two FIR filters that can be used to compute the DWT. These filters are commonly named h and g and are, respectively, a low-pass and a highpass filter. The 2D transform can be represented as shown in Figure 2, where ∗ denotes the convolution operator and the subscript for each filter represents if it is being applied over the rows or the columns of the image. Finally ↓ 2 denotes downsampling of the rows or columns by a factor of two. Looking at Figure 2, we can see that the DWT in 2D produces four images as output. In each filtering step, a low resolution image Lj+1 and three detail images are produced. The detail images Dj1..3 contain the details (high frequency components) extracted from Dj that are not present in Dj + 1. This scheme can be applied recursively until a given depth is reached. 3.3
Wavelet Features
Detail images obtained by applying 2D DWT can be used as a source for extracting texture features. Since those images contain essentially edge information at a specific direction (horizontal, vertical and diagonal), its energy is a very good texture feature. It is defined as follows: 2 Eij (x, y) = (Dij (x, y)) . (8) x,y
where i = 1 . . . 3 and j = 0 . . . depth − 1.
An Approach for Extracting Illumination-Independent Texture Features
97
Fig. 2. 2D DWT using FIR filters
As in the case of Gabor filters, we need some mechanism to be able to process color images, such as grayscale conversion and independent color band feature extraction. In this particular case we can use a correlation measure [3]. This feature is named the wavelet covariance signature and is defined as follows: B B Bk Bl Cij (x, y) = Dijl Dijk . (9) x,y
where Bk and Bl represent a color band, and k, l = 1, 2, 3, k ≤ l.
4
Band Ratioing
In the previous sections, we have given a brief introduction to two of the most commonly used feature extraction algorithms, and we have seen the way many authors are extending them to process color textures. We have presented an alternative approach to do that by means of band ratioing before [12][13], but in this paper we will focus on the application of this technique to illuminationindependent texture analysis. 4.1
Introduction
Band ratioing is an enhancement technique mainly used in the field of remote sensing. It is usually applied to process LANDSAT TM images1 to enhance details such as vegetation, grass, soil, etc. It is defined as follows: I(x, y) =
B1 (x, y) . B2 (x, y)
(10)
where B1 (x, y) and B2 (x, y) are two different spectral bands of the color image. Its computation is extremely easy, but the bands involved must be processed to avoid the case when B2 (x, y) = 0. To accomplish this, we only have to increase every pixel from both bands by 1. Theoretically, ratios will be in the interval (0,256], but in practice most values will be rather small. For this reason, it is 1
http://rst.gsfc.nasa.gov/Sect1/Sect1 15.html
98
R. Mu˜ niz and J.A. Corrales
advisable to use logarithm compression to enhance small ratios over larger ones, so (10) can be rewritten as follows. B1 (x, y) . (11) I (x, y) = log B2 (x, y) It can be easily seen that this technique tends to enhance what is different in two spectral bands, and as it will be seen in the following section, its output is suitable for feature extraction.
Fig. 3. Ratio R/B and filtered ratio of Fabric0
4.2
Feature Extraction from Rationed Color Textures
In the previous section, we saw that Band Ratioing enhances what is different in two color bands. If a pixel contains a grayscale value (R = G = B), its ratio will be 1, but if at least two color components are not equal, the band ratio will encode the color information in a single value. This is very interesting for feature extraction from color textures, since we can directly use any grayscale feature extraction method available. In the following lines, we will show the way to enhance the Gabor filtering method using band ratioing. To apply Gabor filtering on a rationed image, we can combine (11) and (4) to get the following expression: B1 (x, y) Gmn (x, y) = log ∗ fmn . (12) B2 (x, y) Note that (12) directly convolves the band ratios with the Gabor filter, so it is not necessary to scale the ratios to fit in a byte value. In Fig. 3 we can see a band ratio of a color texture taken from the VisTex database2 and the result of filtering the ratio using a Gabor filter with parameters m = 1, n = 45◦ . A very interesting topic is about the implementation of the previous scheme. When we compute a band ratio, the operands are both eight-bit numbers, i.e, 2
http://www-white.media.mit.edu/vismod/imagery/VisionTexture/vistex.html
An Approach for Extracting Illumination-Independent Texture Features
99
R G
R/B
*
=
B Fig. 4. Gabor filtering of rationed images
ranging from 0 . . . 255, but the result will be a real number. There are two different ways to deal with this value: 1. Adjust the result to fit in a byte value (some information is lost). 2. Use the real ratio directly as input to the FFT. It is easy to see that the second approach is much better since no information is lost. In our experiments, we have observed a 10% performance increase compared to the case when byte values are used. Nevertheless, this option is only applicable to feature extraction algorithms that make use of the FFT, since we can not use real numbers to compute cooccurrence matrices, for example. 4.3
Illumination-Independent Texture Analysis
If we suppose that a spectral band b1 from an image fb1 (x, y) has been formed according to Lambert’s reflectivity model [14] then we can say that: fb1 (x, y) = Ib1 ∗ Rb1 ∗ cos Θ .
(13)
where Ib1 represents the intensity of the light source on the surface of the texture for the spectral band b1 and Rb1 the reflectivity coefficient of that surface. For another band b2 , the previous equation would be rewritten as: fb2 (x, y) = Ib2 ∗ Rb2 ∗ cos Θ .
(14)
The band ratio of bands b1 and b2 is defined as: BR1 =
Rb fb1 Ib Rb cos Θ =k 1 . = 1 1 fb2 Ib2 Rb2 cos Θ Rb2
(15)
If the lighting components Ib1 and Ib2 are modified we will get new ones, Ib1 and Ib2 . Then, the equation 15 would be rewritten as follows:
BR2 =
fb1 Ib 1 Rb1 cos Θ Rb = k 1 . = fb2 Ib 2 Rb2 cos Θ Rb2
(16)
If we want to calculate the difference between the rationed images BR1 and BR2 , we only have to subtract equation 15 from equation 16: BR1 − BR2 = k
Rb Rb1 Rb − k 1 = (k − k ) 1 . Rb2 Rb2 Rb2
(17)
100
R. Mu˜ niz and J.A. Corrales
If the variations from the light source are proportional to both bands we can state that k = k , then equation 17 could be rewritten as: BR1 − BR2 = k
Rb1 Rb Rb − k 1 = (k − k) 1 = 0 . Rb2 Rb2 Rb2
(18)
From the previous result we can state that if a surface can be expressed or modelled according to Lambert’s reflectivity model and if the lighting variations are proportional between bands, then the rationed images do not experiment changes. Obviously, the assumptions taken before are not always true in real-world applications, but our experiments have shown us that this approach performs very well if we compare it with regular texture analysis.
5
Experiments
To analyze the performance of our technique we have taken shots from 6 realworld textures (fig. 5), food and fabrics. We have changed the lighting conditions of the whole scene by actuating over a light regulator attached to a 150 watt halogen projector. The images have a resolution of 512x512 pixels each, and they are organized in 6 different sets representing 6 different illumination
Fig. 5. Real-word textures taken under different lighting conditions
An Approach for Extracting Illumination-Independent Texture Features
101
95 %
90 %
85 %
Grayscale R+G+B R/G & R/B ratios R/G, R/B & G/B ratios HS features
80 %
75 %
70 % 1
2
3
4
5
Fig. 6. Test 1 using Gabor features 100 %
95 %
90 %
Grayscale R+G+B R/G & R/B ratios R/G, R/B & G/B ratios HS features
85 %
80 %
75 %
70 % 1
2
3
4
5
Fig. 7. Test 2 using Gabor features
conditions, numbered from 0 (brightest) to 5 (darkest). We have performed two major tests: 1. For this experiment, we have built a training set from the textures from lighting level 0 and we have used the remaining images for testing (levels 1..5) 2. In this case, the training set is composed of the textures from lighting level 5 and the remaining images are used for testing (levels 0..4) These tests allow us to evaluate the performance of the system using images from the left end and the right end for training, i.e, the brightest and the darkest ones. We have computed both Gabor and Wavelet features from the training and testing sets. For this purpose the images were divided into 64 disjunct images of 64x64 pixels.
102
R. Mu˜ niz and J.A. Corrales 100 %
95 %
90 %
Grayscale R+G+B R/G & R/B ratios R/G, R/B & G/B ratios Correlation signatures
85 %
80 %
75 %
70 % 1
2
3
4
5
Fig. 8. Test 1 using Wavelet features 100 %
95 %
90 %
Grayscale R+G+B R/G & R/B ratios R/G, R/B & G/B ratios Correlation signatures
85 %
80 %
75 %
70 % 1
2
3
4
5
Fig. 9. Test 2 using Wavelet features
We have performed a number of experiments, applying different preprocessing techniques to be able to process color images. The charts showed in the next section gather the classifier performance in all cases.
6
Results
For evaluating the performance of the feature sets obtained in each case, we have used a Knn classifier, taking K = 5. To measure the distance between two feature vectors in n , the 1-norm is used. 6.1
Gabor Results
Gabor features were extracted by filtering the textures at 3 scales and 4 different orientations. This gives us a total of 12 gabor energies for each texture sample.
An Approach for Extracting Illumination-Independent Texture Features
103
In figures 6 and 7 we have collected the results of these tests. The bad behaviour of the grayscale conversion can be noticed as well as the features obtained from the concatenation of the RGB color bands whereas the ones computed from rationed images show a more stable performance, represented as an almost straight line. In addition, the HS features perform also very well, but only on the first test, while on the second one they behave as badly as the others. 6.2
Wavelet Results
For Wavelet features we have set the analysis depth at 4, which produces a total number of 12 features for each texture sample. The wavelet functions used for this analysis were the biorthogonal Wavelets Bior6.8 available in MatLAB v6.5. If we take a look at the charts shown on figures 6 and 7, we can see that only the features computed from rationed images perform stably. The other ones show clear performance degradation on both tests.
7
Conclusions
In this paper we have presented a new technique to overcome the limitations of the texture classification systems when the lighting is not stable. We have shown that if the surfaces can be modelled or approximated according to the Lambertian model our method performs very well, compared to the others. The band ratioing preprocessing technique has shown its benefits when applied to texture analysis and classification, especially in terms of classifier hit rate and computational efficiency.
References [1] Drbohlav, O., Chantler, M.: Illumination-invariant texture classification using single training images. In: Texture 2005. Proceedings of the 4th international workshop on texture analysis and synthesis, pp. 31–36 (2005) [2] Jain, A., Healy, G.: A multiscale representation including opponent color features for texture recognition. IEEE Transactions on Image Processing 7(1), 124–128 (1998) [3] de Wouver, G.V.: Wavelets for Multiscale Texture Analysis. PhD thesis, University of Antwerp (Belgium) (1998) [4] Palm, C., Keysers, D., Lehmann, T., Spitzer, K.: Gabor filtering of complex hue/saturation images for color texture classification. In: JCIS2000. Proceedings of 5th Joint Conference on Information Science, Atlantic City, vol. 2, pp. 45–49 (2000) [5] Haralick, R., Shanmugam, K., Dinstein, I.: Textural features for image classification. IEEE Transactions on Systems, Man, and Cybernetics 3(6), 610–621 (1973) [6] Conners, R., McMillin, C.: Identifying and locating surface defects in wood: Part of an automated lumber processing system. IEEE Transactions on Pattern Analysis and Machine Intelligence 5, 573–584 (1983)
104
R. Mu˜ niz and J.A. Corrales
[7] Bovik, A., Clark, M.: Multichannel texture analysis using localized spatial filters. IEEE Transactions on Pattern Analysis and Machine Inteligence 12(1), 55–73 (1990) [8] Kruizinga, P., Petkov, N., Grigorescu, S.: Comparison of texture features based on gabor filters. In: Proceedings of the 10th International Conference on Image Analysis and Processing, Venice, Italy, pp. 142–147 (1999) [9] Dunn, D., Higgings, W., Wakeley, J.: Texture segmentation using 2-d gabor elementary functions. IEEE Transactions on Pattern Analysis and Machine Intelligence 16, 130–149 (1994) [10] Hecht, E.: Optics. Addison-Wesley, Reading (1987) [11] Mallat, S.: A Wavelet Tour of Signal Processing. Academic Press, London (1999) [12] Mu˜ niz, R., Corrales, J.A.: Use of band ratioing for color texture classification. In: Perales, F.J., Campilho, A., P´erez, N., Sanfeliu, A. (eds.) IbPRIA 2003. LNCS, vol. 2652, pp. 606–615. Springer, Heidelberg (2003) [13] Mu˜ niz, R., Corrales, J.A.: Novel techniques for color texture classification. In: Arabnia, H.R. (ed.) IPCV, pp. 114–120. CSREA Press (2006) [14] Lambert, J.H.: Photometria sive de mensure de gratibus luminis, colorum umbrae. Eberhard Klett (1760)
Image Decomposition and Reconstruction Using Single Sided Complex Gabor Wavelets Reza Fazel-Rezai and Witold Kinsner Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg, Manitoba, Canada R3T 5V6 {fazel, kinsner}@ee.umanitoba.ca
Abstract. This paper presents a scheme for image decomposition and reconstruction, using complex Gabor wavelets. Gabor functions have been used extensively in areas related to the human visual system due to their localization in space and bandlimited properties. However, since the standard two-sided Gabor functions are not orthogonal and lead to nearly singular Gabor matrices, they have been used in the decomposition and feature extraction of images rather than in image reconstruction. In an attempt to reduce the singularity of the Gabor matrix and produce reliable image reconstruction, in this paper, we show that a single-sided Gabor function can accomplish both, with the reconstruction residual error being very small (PSNR of at least 300 dB). Keywords: Image Decomposition, Image Reconstruction, Complex Wavelets, Gabor Wavelets.
1 Introduction Since its first formulation in 1984 [1], the wavelet transform has become a common tool in signal processing, in that, it describes a signal at different levels of detail in a compact and readily interpretable form [2]. Wavelet theory provides a unified framework for a number of techniques which had been developed independently for various signal processing applications [3]. Different views of signal theory include multiresolution signal processing as used in computer vision, subband coding as developed for speech and image compression, and wavelet series expansion as developed in applied mathematics [4]. In general, wavelets can be categorized in two types: real-valued and complexvalued. There are some benefits in using complex-valued wavelets. Gabor wavelet is one of the most widely used complex wavelets. More than a half of century ago, Gabor [5][6] developed a system for reducing the bandwidth required to transmit signals. Since then, the Gabor function has been used in different areas of research such as image texture analysis [7][8], image segmentation [9][10], motion estimation [11], image analysis [12], signal processing [13][14], and face authentication [15]. It should be noted that most of those areas rely on analysis and feature extraction, and not reconstruction. In 1977, Cowan proposed that since visual mechanisms are indeed effectively bandlimited and localized in space, Gabor functions are suitable for their M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 105–116, 2007. © Springer-Verlag Berlin Heidelberg 2007
106
R. Fazel-Rezai and W. Kinsner
representation [16]. Other studies [17][18][19][20] assert that Gabor functions also well represent the characteristics of simple cortical cells, and present a viable model for such cells. Investigating "what does eye see best", Watson et al. have demonstrated that the pattern of two-dimensional Gabor functions is optimal [21]. The main difficulty with the Gabor functions is that they are not orthogonal. Therefore, no straightforward technique is available to extract the coefficients. This paper presents an analytical approach to overcome the difficulty with Gabor functions, and demonstrates their usefulness in the decomposition and reconstruction of still images.
2 The Gabor Decomposition 2.1 Notation A one-dimensional (1D) signal u = [u1 , u2 ,..., u N ]T is considered as an N-element complex column vector. Such signals are also considered as N-periodic sequences over integers Z. If the kth coordinate of u is expressed as either u k or u (k ) , we have
u (i + kN ) = u (i )
(1)
The norm of u is defined as the Euclidean norm
⎛ N u =⎜ ui ⎜ ⎝ i =1
∑
2
⎞ ⎟ ⎟ ⎠
1/ 2
(2)
in order to relate to the energy of the signal. The inner product of two signals u and v is defined as
< u,v > =
N
∑ ui vi*
(3)
i =1
*
where vi denotes conjugate of the v i . 2.2 The Matrix Form of the Gabor Transform
For a given 1D signal s , a Gabor expansion (also known as the Gabor wavelet series) is expressed in terms of a Gabor function (mother wavelet) through time and frequency, that is
s( k ) =
N
∑ g (k )c(i) i
for k = 1, 2,..., M ; N ≤ M
(4)
i =1
where g i is the Gabor elementary function as discussed in Sec. 2.4 and c( i ) is the coefficient. In a matrix notation, the above expansion can be written as
Image Decomposition and Reconstruction
s = Gc
107
(5)
where the signal vector ( s ), the Gabor matrix ( G ) and the coefficient vector ( c ) are given by
⎡ s (1) ⎤ ⎢ s (2) ⎥ ⎥ s=⎢ ⎢ # ⎥ ⎢ ⎥ ⎣ s(M ) ⎦ g 2 (1) ⎡ g1 (1) ⎢ g (2) g 2 (2) G=⎢ 1 ⎢ # # ⎢ ⎣ g1 ( M ) g 2 ( M )
(6)
" g N (1) ⎤ " g N (2) ⎥⎥ % # ⎥ ⎥ " g N (M )⎦
(7)
and
⎡ c(1) ⎤ ⎢ c(2) ⎥ ⎥ c=⎢ ⎢ # ⎥ ⎢ ⎥ ⎣c( N ) ⎦
(8)
In a two-dimensional (2D) case, the expansion of a 2D signal (for example, an image) into a set of 2D elementary functions is given by
S (k1 , k2 ) =
N1 N 2
∑∑ g
ij ( k1 , k 2 )C (i ,
j)
i =1 j =1
⎧ k = 1,", M 1 ; for ⎨ 1 ⎩k 2 = 1," , M 2 ;
N1 ≤ M 1 N2 ≤ M 2
(9)
In a matrix notation, if the 2D elementary functions are separable, the 2D expansion can be written as S = G1CG 2
T
(10)
where S is the 2D data matrix, C is the 2D expansion coefficient matrix
S (2,1) ⎡ S (1,1) ⎢ S (1,2) S (2,2) S=⎢ ⎢ # # ⎢ ⎣ S ( M 1 ,1) S ( M1 , 2)
" S (1, M 2 ) ⎤ " S (2, M 2 ) ⎥⎥ ⎥ % # ⎥ " S ( M1 , M 2 ) ⎦
(11)
108
R. Fazel-Rezai and W. Kinsner
C (2,1) ⎡ C (1,1) ⎢ C (1, 2) C (2, 2) C=⎢ ⎢ # # ⎢ ⎣C ( N1 ,1) C ( N1 ,2)
" C (1, N 2 ) ⎤ " C (2, N 2 ) ⎥⎥ ⎥ % # ⎥ " C ( N1 , N 2 ) ⎦
(12)
and Gi is the same Gabor matrix as defined in Eq. (7). 2.3 Rotation and Modulation Operators
For any signal u , the signal v is said to be the cyclic rotation u with rotation number r if the following holds
⎡ u N − r +1 ⎤ ⎢u ⎥ ⎢ N −r + 2 ⎥ ⎢ # ⎥ ⎢ ⎥ uN ⎥ ⎢ v = ℜr (u) = ⎢ u1 ⎥ ⎢ ⎥ ⎢ u2 ⎥ ⎢ # ⎥ ⎢ ⎥ ⎣⎢ u N − r ⎦⎥
(13)
where ℜ r is the rotation operator with rotation number r . The modulation (or frequency translation) of a signal u is signal w defined as
⎡u1 ⎤ ⎢ ⎥ 2π i ⎢u e− N k ⎥ ⎢ 2 ⎥ ⎢ − 4π i k ⎥ ⎢u3e N ⎥ w = ℑk (u) = ⎢ ⎥ 6π i ⎢u e− N k ⎥ ⎢ 4 ⎥ ⎢# ⎥ ⎢ ⎥ 2( n −1)π i k⎥ − ⎢ N ⎣u N e ⎦
(14)
where ℑk is the modulation operator with frequency k . It can be shown that for every signal u and any value for l
F (ℜl (u)) = ℑl ( F (u)) and
(15)
Image Decomposition and Reconstruction
F (ℑ− l (u)) = ℜl ( F (u))
109
(16)
where F denotes the Fourier transform operator defined as
F ( u) =
1 N
N
∑
u ( n )e
−
2π i n N
(17)
n =0
Based on these two operators, the Gabor matrix can be rewritten into a more convenient form as described in the next section. 2.4 The Gabor Matrix
Consider the Gabor mother function, g , as defined in the following form
gσ (n) = e
n −π ( )2
σ
(18)
where the parameter σ determines the scale of the Gaussian in time domain. The Gabor matrix G with decomposition levels L is then defined as
⎡ gσ ⎤ ⎢ℜ ( g ) ⎥ ⎢ L σ ⎥ ⎢ℜ2 L ( gσ ) ⎥ ⎢ ⎥ ⎢# ⎥ ⎢ℜ N −1 ( gσ ) ⎥ ⎢ ⎥ ⎢ ℑ1 ( gσ ) ⎥ ⎢ ℑ (ℜ ( g )) ⎥ ⎢ 1 L σ ⎥ ⎢ ℑ1 (ℜ2 L ( gσ )) ⎥ ⎥ G ( L) = ⎢⎢# ⎥ ⎢ ℑ1 (ℜ N −1 ( gσ )) ⎥ ⎢ ⎥ ⎢# ⎥ ⎢# ⎥ ⎢ ⎥ ⎢ ℑL −1 ( gσ ) ⎥ ⎢ ℑ (ℜ ( g )) ⎥ ⎢ L −1 L σ ⎥ ⎢ ℑL −1 (ℜ2 L ( gσ )) ⎥ ⎢ ⎥ ⎢# ⎥ ⎢⎣ ℑL −1 (ℜ N −1 ( gσ )) ⎥⎦
(19)
where N is the size of the vector s to be expanded using the Gabor matrix. It should be noted that the size of matrix G is M × N , where the size of Gabor mother
110
R. Fazel-Rezai and W. Kinsner
function gσ is M . If N < M , the relation in Eq. (5) cannot be satisfied. Therefore, a criterion should be defined to find the best solution for the relation in Eq. (5). The most commonly used approach is the least-squares error criterion. Therefore, a 1D solution to Eq. (5) is the vector c = cˆ that minimizes the objective function
(Gc − s )T (Gc − s) , and which can be found as cˆ = (G T G )−1 G T
(20)
2.5 Singularity
The N × N matrix G T G cannot always be inverted as it can be singular or nearly singular. To avoid the problem of matrix inversion, the singular value decomposition (SVD) can be used. The SVD states that any M × N matrix A can be decomposed as the product of an M × N column-orthogonal matrix U , as N × N diagonal matrix W with positive or zero elements, and the transpose of an N × N matrix V which can be written as
A = UWV T
(21)
with UU T = V T V = I . Then the inverse of A is defined as
A−1 = V [diag (
1 )]U T wi
(22)
In the 2D case, the solution to Eq. (10) is given by
Cˆ = (G1T G1 ) −1 G1T ⋅ S ⋅ G2 (G2T G2 )−1
(23)
The SVD can be also applied to this case, and the solution is given by
A−1 = V1[diag (
1 1 )]U1T ⋅ S ⋅ U 2 [diag ( )]V2T wi1 wi 2
(24)
If the matrix A is either singular or nearly singular, the diagonal matrix W will contain zero or very small values on its diagonal. Therefore, to invert the matrix, the very large associated elements should be replaced with zeros. However, this approximation affects the reconstructed results, and produce low value of the peak signal to noise ratio (PSNR) for the reconstructed images. In the next section, we introduce an approach to overcome this problem. 2.6 The Single-Sided Gabor Reconstruction
A typical Gabor function and its corresponding modulated signal are shown in Fig. 1.a and Fig.1.b. Since the Gabor matrix G developed from such a two-sided mother
Image Decomposition and Reconstruction
111
Gabor function is very close to singular (as described in the previous section), a single-sided Gabor function (shown in Fig. 1.c and Fig.1.d) was used to force the Gabor matrix to be no longer singular. Therefore, the single-sided Gabor function can be used for the decomposition and reconstruction of images to overcome the singularity problem exists in the two-sided Gabor wavelet decomposition and reconstruction.
3 Experimental Results The single-sided Gabor decomposition and reconstruction algorithms for images have been implemented in MATLAB. Two test images have been used, including Lena and a model plane (Fig. 2). The second test image is a computer generated image and was chosen because it has straight-line sharp edges which are not vertical and diagonal.
(a)
(c)
(b)
(d)
Fig. 1. Two-sided Gabor functions: (a) mother function. (b) modulated function and singlesided Gabor function: (c) mother function. (d) modulated function.
112
R. Fazel-Rezai and W. Kinsner
(a)
(b)
Fig. 2. Test images. (a) Lena as a test image #1. (b) Model airplane as a test image #2.
The Gabor matrix G (Eq. 19) has real and complex components. Both depend on the rotation number ( L ). The real and imaginary forms of the Gabor matrix G for gray-level images with a two-level decomposition ( L = 2 ), four-level decomposition ( L = 4 ) and eight-level decomposition ( L = 8 ) are shown in Fig. 3. Note that the values have been scaled between 0 to 255. It can been seen from these figures that the number of white bands corresponds to the rotation number ( L ). It can be shown that this number determines the number of subimages in decomposition.
L=4
L =8
Imaginary part
Real part
L=2
Fig. 3. The real and imaginary structure of the Gabor matrix G , with L = 2 , L = 4 , and
L =8
Image Decomposition and Reconstruction
113
Considering that the Gabor matrix ( G ) multiplies to the image matrix from the left and right (Eq. 10), one would expect to have a result with four subimages, each similar to the original image. The results of applying the single-sided Gabor decomposition and reconstruction technique to the images are presented next. In the case of L = 2 , the real and imaginary parts of the Gabor decomposition to the two test images are shown in Fig 4. As expected, each of real and imaginary images has four subimages. The corresponding reconstructed images using Eq. (23) are shown Fig 4.. A very high PSNR for the reconstructed images was achieved, with 313.01 dB and 311.83 dB for the test image #1 and test image #2, respectively. The PSNR was calculated as PSNR = 10 log
2552
(25)
σd2
where σ d is the mean square error defined using the following equation 2
σd2 =
1 N
N
∑ ( S (n) − Sˆ (n))
(26)
n =1
The corresponding residual errors shown in Fig 4 also indicate a very good reconstruction in that there are no visible patterns that would correspond to the original images.
Real part of the decomposed image
Imaginary part of the decomposed image
Reconstructed image
Residual error
Fig. 4. Gabor decomposition of the test images with L = 2
The remaining decomposition and reconstruction results are shown in Fig. 5 (for L = 4 ) and Fig. 6 to 18 (for L = 8 ). Again, residual errors appear to be randomly distributed, without displaying contours of the objects in the original images. This
114
R. Fazel-Rezai and W. Kinsner
Real part of the decomposed image
Imaginary part of the decomposed image
Reconstructed image
Residual error
Fig. 5. Gabor decomposition of the test images with L = 4
Real part of the decomposed image
Imaginary part of the decomposed image
Reconstructed image
Residual error
Fig. 6. Gabor decomposition of the test images with L = 8
confirms the suitability of the Gabor transform for decomposition and reconstruction of images for the human visual system (HVS).
Image Decomposition and Reconstruction
115
4 Conclusions There are some benefits in using complex-valued wavelets. It has been shown that since visual mechanisms are indeed effectively bandlimited and localized in space, Gabor functions are suitable for their representation. Despite all good features of the Gabor functions, its main inconvenience is the fact that they are not orthogonal. Therefore, no straightforward method is available to extract the coefficients. In this paper, an analytical method for still image decomposition and reconstruction using single-sided Gabor functions has been presented. Experimental results have shown that one can achieve perfect reconstruction with PSNR of more than 300 dB. Acknowledgments. We acknowledge financial support from the Telecommunication Research Laboratories (TRLabs) and Mecca Media Group, Edmonton, Canada, and the National Science and Engineering Research Council (NSERC) of Canada.
References 1. Grossmann, A., Morlet, J.: Decomposition of Hardy Functions into Square Integrable Wavelets of Constant Shape. SIAM Journal on Mathematical Analysis 15, 723–736 (2006) 2. Daubechies, I.: Ten Lectures on Wavelets. CBMS-NSF Regional Conference Series In Applied Mathematics (1992) 3. Rioul, O., Vetterli, M.: Wavelets and signal processing. IEEE Signal Processing Magazine, pp. 15–38. IEEE Computer Society Press, Los Alamitos (1991) 4. Mallat, S.G.: A wavelet tour of signal processing. Academic Press, London (1999) 5. Gabor, D.: Theory of communication. J. of Industrial Electrical Engineering (London) 93, 445 (1946) 6. Gabor, D.: New possibilities in speech transmission. J. of Industrial Electrical Engineering (London) 94, 369 (1947) 7. Porat, M., Zeevi, Y.Y.: Localized texture processing in vision: analysis and synthesis in the Gaborian space. IEEE Trans. on Biomed. Engineering. 36, 115–129 (1989) 8. du Buf, J.M.H., Heitkamper, P.: Texture features based on Gabor phase. Signal Processing 23, 227–244 (1991) 9. Billings, A.R., Scolaro, A.: The Gabor compression-expansion system using non-Gaussian windows and its application to television coding. IEEE Trans. on Information Theory 22, 174–190 (1976) 10. Wiskoot, L.: Segmentation from motion: combining Gabor and Mallat wavelets to overcome aperture and correspondence problem. Internal Report 96-10, Institut für Neroinformatik (1996) 11. Magarey, J., Kingsbury, N.: Motion estimation using a complex-valued wavelet transform. IEEE Trans. on Signal Processing 46, 1069–1084 (1998) 12. Daugman, J.G.: Complete Discrete 2-D Gabor transform by neural networks for image analysis and compression. IEEE Trans. on Acoustics, Speech, and Signal Processing 36, 1169–1179 (1988) 13. Qiu, S.: Gabor-type matrix algebra and fast computation of dual and tight Gabor wavelets. Optical Engineering 36, 276–282 (1997) 14. Bastiaans, M.J.: A Sampling Theorem for the Complex Spectrogram, and Gabor’s Expansion of a Signal in Gaussian Elementary Signals. Opt.Eng. 20, 594–598 (1981)
116
R. Fazel-Rezai and W. Kinsner
15. Duc, B., Fischer, S., Bigüm, J.: Face authentication with Gabor information on deformable graphs. IEEE Trans. on Image Processing 8, 504–516 (1999) 16. Cowaan, J.D.: Some remarks on channel bandwidth for visual contrast detection. Nerosci. Res. Prog. Bull. 15, 1255–1267 (1973) 17. Marcelja, S.: Mathematical description of the responses of simple cortical cells. J. Opt. Soc. Amer. 70, 1297–1300 (1980) 18. Kulikowski, J.J., Marcelja, S., Bishop, P.O.: Theory of spatial position and spacial frequency relations in the receptive field of simple cells in the visual cortex. Biol. Cybern. 43, 187–198 (1982) 19. Pollen, D.A., Ronner, S.F.: Visual sortical neurons as localized spatial frequency filter. IEEE Trans. Sys. Man, Cybern. 15, 91–101 (1985) 20. Jones, J.P., Palmer, L.A.: The two dimensional spatial structure of simple receptive fields in cat striate cortex. J. Neurophysio. 58, 1187–1211 (1987) 21. Watson, A.B., Barlow, H.B., Robson, J.G.: What does eye see best. Nature 302, 419–422 (1983)
Solving the Inverse Problem of Image Zooming Using “Self-Examples” Mehran Ebrahimi and Edward R. Vrscay Department of Applied Mathematics Faculty of Mathematics University of Waterloo Waterloo, Ontario, Canada N2L 3G1
[email protected],
[email protected] Abstract. In this paper we present a novel single-frame image zooming technique based on so-called “self-examples”. Our method combines the ideas of fractal-based image zooming, example-based zooming, and nonlocal-means image denoising in a consistent and improved framework. In Bayesian terms, this example-based zooming technique targets the MMSE estimate by learning the posterior directly from examples taken from the image itself at a different scale, similar to fractal-based techniques. The examples are weighted according to a scheme introduced by Buades et al. to perform nonlocal-means image denoising. Finally, various computational issues are addressed and some results of this image zooming method applied to natural images are presented.
1
Introduction
The idea of increasing the resolution of an image using its nonlocal selfsimilarities was initially proposed by the researchers in fractal image coding back in the 1990s – see, for example, [4,17,25,27,21]. In fractal image coding, one seeks to approximate a target image by the fixed point of a contractive, resolution-independent operator called the fractal transform. The essence of the fractal transform is to approximate smaller range subblocks of an image with modified copies of larger domain subblocks. Repeated action of the transform introduces finer details, implying that its fixed point can increase the resolution of images to any arbitrary degree. Because of the block coding involved in the construction of the fractal transform, however, the results are blocky and, in many ways, artificial. A more recent series of novel example-based methods that take advantage of the self-similarity of images show great promise for the solution of various inverse problems [6,7,9,14,32]. In [6,7], the authors demonstrated that their nonlocalmeans (henceforth to be referred to as “NL-means”) image denoising algorithm has the ability to outperform classical denoising methods. In this paper, we generalize their method to address the zooming problem. Similar to NL-means denoising, we compare neighborhoods in the image that will allow us to predict pixel values that reproduce local patterns and textures. Our algorithm compares M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 117–130, 2007. c Springer-Verlag Berlin Heidelberg 2007
118
M. Ebrahimi and E.R. Vrscay
the neighborhood information of each pixel to the neighborhoods of the same size in the coarser scale over the entire image. Due to the reconstruction scheme, our results do not suffer from the blockiness generally inherent in fractal zooming. Moreover, the geometry of objects is well preserved for the test images presented. In Section 2, we introduce the problem of image zooming, and present a brief introduction to the ideas of interpolation, regularization and classical inverse theory techniques, and example-based methods. In Section 3, we review the role of self-similarity in various inverse problems, namely fractal-based methods and example-based approaches. We explain how our example-based approach in this paper takes advantage of self-similarity across scales similar to fractal-based methods. In Section 4, we introduce our method, and discuss some computational issues in Section 5. Finally, some concluding remarks are presented in Section 6.
2
Some Background on the Inverse Problem of Image Zooming
In the characterization of a visual image, the term “resolution” can be confusing since it involves a rather large number of competing terms and definitions. Spatial resolution refers to the spacing of pixels in an image. Many imaging devices, such as charged-coupled-device (CCD) cameras, consist of arrays of light detectors. A detector determines pixel intensity values depending upon the amount of light detected from its assigned area in the scene. The spatial resolution of images produced by these types of devices is proportional to the density of the detector array. In many applications, the imaging sensors have poor resolution output. The process of producing a high-resolution image from a single lowerresolution and distorted (e.g., blurred, noisy) image is called (single-frame) image zooming. 2.1
The Inverse Problem of Image Zooming
We consider the following degradation model [8,15,16] u = Hf + n,
(1)
in which u ∈ l2 (Ω) is the low-resolution m × n-pixel observation, i.e., Ω = [1, . . . , m] × [1, . . . , n], and n ∈ l2 (Ω) is additive white, independent Gaussian noise with zero-mean and variance σ 2 . f ∈ l2 (Ψ ) is the high-resolution image to be recovered such that Ψ = [1, . . . , mz] × [1, . . . , nz], where z is a positive integer. The operator H = SB is the composition of a blurring operator B followed by a downsampling operator S of factor z in each direction. More technically, in order to be consistent with the CCD-array model,
Solving the Inverse Problem of Image Zooming Using “Self-Examples”
119
we assume that the blurring operator B : l2 (Ψ ) → l2 (Ψ ) is the local averaging operator of length z, i.e., for any (p, q) ∈ Ψ , 1 (Bf )(p, q) = 2 f (p + p1 , q + q1 ). (2) z 0≤p1 0. We can bypass this problem by defining w(x, y) = 1S (y), at the problematic points x ∈ P, where 1S (y) denotes the indicator function of y over the set S defined as follows,
d d S = ys ∈ Q | ys = arg min u(N {Dx}) − H v(Nz {y}) 2,a . (13) y
This is equivalent to averaging over all v(y) for which y is a minimizer of the similarity distance. 5.2
Computational Results
In this section we present some results of our computational experiments. In all experiments we have chosen a zoom factor z = 2, and 3 × 3 neighborhoods as well as 6 × 6 extended-neighborhoods, i.e., d = 3. We also have chosen Ga = δ, i.e., simply ignoring the effect of Gaussian smoothing. As well, we have varied the parameter h with respect to each image individually. Figure 1 indicates the result of an experiment of Algorithm 1 to show that using irrelevant examples (middle column) may lead to irrelevant outputs (right column). In Figure 2, we present the outputs of Algorithm 2 for a noiseless (top left) and noisy (bottom left) image. It can be seen that the algorithm performs denoising in parallel with zooming. For purposes of comparison, the results of pixel replication and bilinear interpolation are also presented. In rows 1 and 3 of Figure 3, we present the result of Algorithm 2 when applied to a simple circular shape in both noiseless and noisy situations, respectively. Below each of these rows are displayed the Fourier spectra of each image. In the bottom two rows of Figure 3, we show the effect of Algorithm 2 on a Gaussian white noise sample, along with corresponding Fourier spectra. The value of parameter h was set in a way that the variance of noise was three-quarter of the variance of the input noise. However, the output is still noise-like, as opposed to the results of bilinear and pixel replication. It can be seen that the Fourier spectrum of the output is also distributed rather uniformly over the frequency domain, as opposed to the pixel replication and bilinear interpolation cases. In Figure 4, we present the output of Algorithm 2 on a noisy input image choosing different values of the filter parameter h. The original image in this
Solving the Inverse Problem of Image Zooming Using “Self-Examples”
125
Fig. 1. Original, Example and the image reconstructed using Algorithm 1, h = 0.1 in all three cases
Fig. 2. Original, Pixel replication, Bilinear, Self-examples. For both rows h = 0.15 is applied. In the second row the standard deviation of noise is σ = 0.1.
126
M. Ebrahimi and E.R. Vrscay
Fig. 3. A comparison of the image zooming using self-examples, Algorithm 2, with other methods of zooming. Three input images are considered: Circular region (row 1), noisy circular region (row 3) with the noise of standard deviation σ = 0.1, pure noise image (row 5). The Fourier spectra of all images are also shown (rows 2,4 and 6). Starting at left: input image, zooming with pixel replication, zooming with bilinear interpolation, zooming with Algorithm 2. In rows 1,3,5 the value of h in the experiment are respectively 0.01, 0.5, and 0.05.
Solving the Inverse Problem of Image Zooming Using “Self-Examples”
Original
Pixel replication
Bilinear interpolation
Noisy input Pixel replication
Bilinear interpolation
h = 0.01
h = 0.05
h = 0.075
h = 0.1
h=1
h = 10
127
Fig. 4. A comparison of the results obtained by image zooming using self-examples for different values of the filter parameter h. A noisy input image is considered for which the standard deviation of noise is σ = 0.05.
128
M. Ebrahimi and E.R. Vrscay
figure is given only for comparison. The input image in the second row is the algorithm’s input. It can be observed that choosing small values of h will produce noisy outputs and as h grows the output will be smoothed out converging to a constant value image.
6
Conclusions
In this paper we have presented a novel single-frame image zooming method of self-examples, explaining how it combines the ideas of fractal-based image zooming, example-based zooming and nonlocal-means image denoising. Our framework implicitly defines a regularization scheme which exploits the examples taken from the image itself at a different scale in order to achieve image zooming. The method essentially extends the NL-means image denoising technique to image zooming problem. Various computational issues and results were also presented, showing that frequency domain extrapolation is in fact possible with this method. The issue of parameter selection, e.g., the filter parameter h, is an important issue which has not been addressed in sufficient detail this paper. Furthermore, the experiments in this paper were performed for the case of zooming factor z = 2 and fixed neighborhood size d = 3 according to the size of input images used. We are currently investigating adaptive parameter schemes that may be able to achieve superior results. Acknowledgments. This research has been supported in part by the Natural Sciences and Engineering Research Council of Canada, in the form of a Discovery Grant (ERV). M.E. is supported by an Ontario Graduate Scholarship. We wish to thank Dr. C. Lemaire, Department of Physics, University of Waterloo, for providing the cross-sectional magnetic resonance images of a lime fruit used in Figures 1 and 2. These images were obtained from a Bruker micro-MRI spectrometer at UW which was purchased with funds provided by the Canada Foundation for Innovation (CFI) and the Ontario Innovation Trust.
References 1. Alexander, S.K.: Multiscale methods in image modelling and image processing, Ph.D. Thesis, Dept. of Applied Mathematics, University of Waterloo (2005) 2. Alexander, S.K., Vrscay, E.R., Tsurumi, S.: An examination of the statistical properties of domain-range block matching in fractal image coding (Preprint 2007) 3. Baker, S., Kanade, T.: Limits on super-resolution and how to break them. IEEE Trans. Patt. Analysis and Mach. Intel. 24(9), 1167–1183 (2002) 4. Barnsley, M.F.: Fractals Everywhere. Academic Press, New York (1988) 5. Bone, D.J.: Orthonormal fractal image encoding using overlapping blocks. Fractals 5(Suppl. Issue), 187–199 (1997) 6. Buades, A., Coll, B., Morel, J.M.: A nonlocal algorithm for image denoising. In: CVPR. IEEE International conference on Computer Vision and Pattern Recognition, San-Diego, California, June 20-25, 2005, vol. 2, pp. 60–65. IEEE Computer Society Press, Los Alamitos (2005)
Solving the Inverse Problem of Image Zooming Using “Self-Examples”
129
7. Buades, A., Coll, B., Morel, J.M.: A review of image denoising algorithms, with a new one. SIAM Journal on Multiscale Modeling and Simulation (MMS) 4(2), 490–530 (2005) 8. Chaudhuri, S.: Super-resolution imaging. Kluwer, Boston, MA (2001) 9. Criminisi, A., Perez, P., Toyama, K.: Region filling and object removal by exemplarbased image inpainting. IEEE Trans. on Image Proc. 13(9), 1200–1212 (2004) 10. Datsenko, D., Elad, M.: Example-based single document image superresolution: A global MAP approach with outlier rejection. to appear in the Journal of Mathematical Signal Processing (2006) 11. Donoho, D.L, Johnstone, I.M.: Ideal spatial adaptation by wavelet shrinkage. Biometrika 81(3), 425–455 (1994) 12. Ebrahimi, M., Vrscay, E.R.: Fractal image coding as projections onto convex sets. In: Campilho, A., Kamel, M. (eds.) ICIAR 2006. LNCS, vol. 4141, pp. 493–506. Springer, Heidelberg (2006) 13. Ebrahimi, M., Vrscay, E.R.: Regularized fractal image decoding. In: Proceedings of CCECE ’06, Ottawa, Canada, May 7-10, 2006, pp. 1933–1938 (2006) 14. Efros, A.A., Leung, T.K.: Texture synthesis by non-parametric sampling. In: ICCV. IEEE International Conference on Computer Vision, Corfu, Greece, September 2025, 1999, pp. 1033–1038 (1999) 15. Elad, M., Datsenko, D.: Example-Based Regularization Deployed to SuperResolution Reconstruction of a Single Image. to appear in The Computer Journal 16. Elad, M., Feuer, A.: Restoration of a single superresolution image from several blurred, noisy, and undersampled measured images. IEEE Trans. on Image Proc. 6(12), 1646–1658 (1997) 17. Fisher, Y. (ed.): Fractal image compression, theory and application. Springer, New York (1995) 18. Forte, B., Vrscay, E.R.: Theory of generalized fractal transforms. In: Fisher, Y. (ed.) Fractal image encoding and analysis, Springer, Heidelberg (1998) 19. Freeman, W.T., Pasztor, E.C., Carmichael, O.T.: Learning low-level vision. Int. Journal Of Computer Vision 40(1), 25–47 (2000) 20. Freeman, W.T., Jones, T.R., Pasztor, E.C.: Example-based super-resolution. IEEE Comp. Graphics And Appl. 22(2), 56–65 (2002) 21. Gharavi-Al., M., DeNardo, R., Tenda, Y., Huang, T.S.: Resolution enhancement of images using fractal coding. In: Visual Communications and Image Processing, San Jose, CA. SPIE Proceedings, vol. 3024, pp. 1089–1100 (1997) 22. Ghazel, M., Freeman, G., Vrscay, E.R.: Fractal image denoising. IEEE Trans. on Image Proc. 12(12), 1560–1578 (2003) 23. Haber, E., Tenorio, L.: Learning regularization functionals. Inverse Problems 19, 611–626 (2003) 24. Ho, H., Cham, W.: Attractor Image Coding using Lapped Partitioned Iterated Function Systems. In: ICASSP’97. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 4, p. 2917 (1997) 25. Lu, N.: Fractal Imaging. Academic Press, London (1997) 26. Nakagaki, R., Katsaggelos, A.K.: VQ-based blind image restoration algorithm. IEEE Trans. On Image Proc. 12(9), 1044–1053 (2003) 27. Polidori, E., Dugelay, J.-L.: Zooming using iterated function systems. Fractals 5(Suppl. Issue), 111–123 (1997) 28. Roth, S., Black, M.J.: Fields of experts: A framework for learning image priors. In: CVPR. IEEE Conference on Computer Vision and Pattern Recog, San-Diego, California, June 20-25, 2005, vol. 2, pp. 860–867 (2005)
130
M. Ebrahimi and E.R. Vrscay
29. Rudin, L., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D 60, 259–268 (1992) 30. Tikhonov, A.N., Arsenin, V.A.: Solution of Ill-posed Problems. Winston & Sons, Washington (1977) 31. Vrscay, E.R.: A generalized class of fractal-wavelet transforms for image representation and compression. Can. J. Elect. Comp. Eng. 23(1-2), 69–84 (1998) 32. Wei, L.Y., Levoy, M.: Fast texture synthesis using tree-structured vector quantization. In: Proc. of SIGGRAPH, New Oleans, Louisiana, pp. 479–488 (2000) 33. Weickert, J.: Anisotropic Diffusion in Image Processing, Teubner, Stuttgart. ECMI Series (1998) 34. Xu, W., Fussell, D.: IFS coding with multiple DC terms and domain blocks. Citeseer article 185324, available at: http://citeseer.ist.psu.edu/185324.html 35. Zhu, S.C., Mumford, D.: Prior learning and Gibbs reaction-diffusion. IEEE Trans. on Patt. Analysis and Machine, Intel. 19(11), 1236–1250 (1997)
A Run-Based Two-Scan Labeling Algorithm Lifeng He1,3,4 , Yuyan Chao2,4 , and Kenji Suzuki3 1
4
Graduate School of Information Science and Technology Aichi Prefectural University, Aichi 480-1198, Japan
[email protected] 2 Graduate School of Environment Management Nagoya Sangyo University, Owariasahi, Aichi 488-8711, Japan
[email protected] 3 Department of Radiology, Division of the Biological Sciences The University of Chicago, Chicago, IL 60637, USA {helifeng, suzuki}@uchicago.edu Shannxi University of Science and Technology, Xian, Shannxi, China
Abstract. Unlike conventional raster-scan based connected-component labeling algorithms which detect the connectivity of object pixels by processing pixels in an image one by one, this paper presents an efficient run-based two-scan labeling algorithm: the run data obtained during the scan are recorded in a queue, and are used for detecting the connectivity later. Moreover, unlike conventional label-equivalence-based algorithms which resolve label equivalences between provisional labels that are assigned during the first scan, our algorithm resolve label equivalences between the representative labels of equivalent provisional label sets. In our algorithm, at any time, all provisional labels that are assigned to a connected component are combined in a set, and the smallest label is used as the representative label. The corresponding relation of a provisional label to its representative label is recorded in a table. Whenever different connected components are found to be connected, all provisional label sets concerned with these connected components are merged together, and the smallest provisional label is taken as the representative label. When the first scan is finished, all provisional labels that were assigned to each connected component in the given image will have a unique representative label. During the second scan, we need only to replace each provisional label with its representative label. Experimental results on various types of images demonstrate that our algorithm is the fastest of all conventional labeling algorithms. Keywords: Labeling algorithm, connected components, label equivalence, run data, linear-time algorithm, pattern recognition.
1
Introduction
Labeling of connected components in a binary image is one of the most fundamental operations in pattern analysis (recognition), computer (robot) vision
This work was partially supported by the TOYOAKI Scholarship Foundation, Japan, and the Research Abroad Project of Aichi Prefectural University, Japan.
M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 131–142, 2007. c Springer-Verlag Berlin Heidelberg 2007
132
L. He, Y. Chao, and K. Suzuki
[14,3,17]. Labeling is said to be more time-consuming than any other fundamental operations such as noise reduction, interpolation, thresholding, and edge detection. Many labeling algorithms have been proposed for addressing this issue. For ordinary computer architectures, there are four important approaches to component labeling. Multi-scan algorithms. Algorithms [5,7] scan an image in the forward and backward raster directions alternately to propagate the label equivalences until no label changes. The number of scans depends on the geometric complexity of connected components. For example, for an N -step stair-like connected component, 2(N -1) scans are necessary [20]. The order of the maximum time for labeling an N × N image is O(N 3 ) (where N is the matrix size). Two-scan algorithms. Algorithms[15,16,10,11,19,4,9,6,12] complete labeling in two scans: during the first scan, provisional labels are assigned to object pixels, and the label equivalences are stored in a one-dimensional (1D) or twodimensional (2D) table array. After the first scan or during the second scan, the label equivalences are resolved by use of some searching algorithms for resolving label equivalences. The resolved results are generally stored in a 1D table. During the second scan, the provisional labels are replaced by the smallest equivalent label with use of the table. The time for labeling with two-scan algorithms depends on the efficiency of the search algorithm for resolving label equivalences. The search algorithms proposed so far are quite complicated; some algorithms [4] need a very large memory for storing label equivalences. Hybrid algorithm. This algorithm [20] is a hybrid between multi-scan algorithms and two-scan algorithms. Like multi-scan algorithms, the hybrid algorithm scans an image in the forward and backward raster directions alternately. During the scans, as in two-scan algorithms, a 1D table is used for recording and resolving label equivalences. According to the results in Ref. [20], the hybrid algorithm was the fastest among the labeling algorithms of the first three types, and four is the upper limit on the number of scans. Tracing-type algorithms. These algorithms [16,17,1,2,8] avoid analysis of label equivalences by tracing the contours of objects (connected components) or by use of an iterative recursion operation. Because the tracing-type algorithms access pixels in an irregular way, depending on the shape of the connected components, they are not suitable for hardware and parallel implementation. In this paper, we propose a run-based, efficient two-scan connected-component labeling algorithm. The run data that are obtained during the first scan are recorded in a queue, and are used for detecting the connectivity in the further processing. At any time, all provisional labels that are assigned to a connected component found so far during the first scan are combined in a provisional label set, and the smallest label is used as their representative label. For each current run, if it is not connected to any previous runs in the row above the scan row, the pixels in the current run is assigned a new provisional label. If there are some runs in the row above the scan row that are connected to the current run, the pixels in the current run are assigned the same provisional label assigned
A Run-Based Two-Scan Labeling Algorithm
133
to the leftmost one of such runs, and all provisional label sets corresponding to these runs are merged with the smallest label as their representative label. Thus, when the first scan is finished, all provisional labels that were assigned to each connected component in the given image will be combined in the same provisional label set, and will have a unique representative label. During the second scan, each provisional label is replaced by its representative label. Like other raster-scan labeling algorithms (multi-scan algorithms, two-scan algorithms, and the hybrid algorithm), our algorithm is based on label equivalence. However, unlike these conventional algorithms, which resolve the equivalence between provisional labels, we resolve label equivalence between provisional label sets. This makes our algorithm more efficient than the others. A run-based connected component labeling algorithm was proposed in reference [18]. The algorithm is an improvement over the propagation-type algorithm proposed in [17] by block-sorting and tracing of runs, and by label propagation to the connected runs. It performs a searching step and a propagation step iteratively on the run data. In the searching step, the image is scanned until an unlabeled run is found, then the run is assigned a new label, which is propagated to neighbor runs above or below the current row until all the runs of the component are labeled with the same label. This algorithm needs two pre-processings: sorting runs on each row by block and adding the address of the next run data to the data for each run. Moreover, it is a tracing-type algorithm, and thus it is not suitable for parallel implementation. In comparison with this conventional run-based algorithm, our algorithm is much simpler, is raster-scan-based (i.e., an image is processed in a regular way), needs no pre-processing (i.e., it is straightforward for run-length encoding of images), and is suitable for parallel implementation. Experimental results on various types of images show that our algorithm is faster not only than the conventional run-based algorithm, but also than all other conventional labeling algorithms. The rest of this paper is organized as follows: We introduce our algorithm in the next section and consider its efficient implementation in section 3. In section 4, we show the experimental results. We give our concluding remarks in section 5.
2
Outline of Our Proposed Algorithm
For an N ×N binary image (where the original image is binarized into the object and the background, and N is the matrix size), we use p(y × N + x) to denote the pixel value at (x, y) in the image, where 1 ≤ x ≤ N and 1 ≤ y ≤ N , and we use VO for the pixel value for the object (called object pixels) and VB for that for the background (called background pixels). We assume that VO and VB are larger than any possible number of provisional labels (a safe method, for example, is using a number larger than the largest possible number of provisional labels, i.e., N × N/4, for an N × N image), and VO < VB . As the same in most labeling algorithms, we assume that all pixels on the border of an image are background pixels.
134
L. He, Y. Chao, and K. Suzuki
A run is a block of contiguous object pixels in a row. A run from p(s) to p(e) in an N × N image is described by r(s, e). The run data can easily be obtained and recorded during the first scan. Let r(s, e) be the current run. Then, by eight-connected connectivity, all runs in the row above the scan row such that one of their pixels occurs between p(s − N − 1) and p(e − N + 1) are connected to the current run, as shown in Fig.1. In other words, a run r(a, b) such that b ≥ s − N − 1 and a ≤ e − N + 1 is eight-connected with the current run r(s, e). All such runs, including the current run, belong to the same connected component. On the other hand, if no such run exists in the previous row above the scan row, the current run is not connected to any previous run.
s-N
e-N
s
e
s
e
(a)
s
(b)
e
(c)
s
e
(d)
pixel in the current run
pixel in the runs connected to the current run
background pixel
pixel in the runs not connected to the current run
Fig. 1. Eight-connected connectivity for the current run. (a) Eight-connected pixel region for the current run; (b) and (c) samples of runs eight-connected with the current run; (d) a sample of no run eight-connected with the current run.
In our algorithm, at any point, all provisional labels that are assigned to a connected component found so far during the first scan are combined in a set corresponding to the connected component, and the smallest label is referred to as their representative label. The corresponding relationship between a provisional label and its representative label is recorded in a table, called the representative table. For convenience, we use R(l) = r to show that the representative label of the provisional label l is r, and we use S(r) for the set of provisional labels with r as the representative label. Moreover, if S(r) is the provisional label set corresponding to a connected component C, for any run R such that it belongs to C, we also say that S(r) is the provisional label set corresponding to R. In the first scan, from i = 0, our algorithm scans pixel p(i) one by one in the given image in the raster scan direction. When a new run r(s, e) is found, the run data are recorded. At the same time, the eight-connected area with the current run in the row above is detected. If there is no run eight-connected with the current run in the row above the scan row, the current run belongs to a new
A Run-Based Two-Scan Labeling Algorithm
135
connected component not found so far. All pixels in the current run are assigned a new label l, the provisional label set corresponding to the connected component, i.e., the current run is established as S(l) = {l}, and the representative label of l is set to l, i.e., R(l) = l. On the other hand, if there are some runs eight-connected with the current run in the row above the scan row. Suppose that S(r1 ), . . ., and S(rn ) are the provisional label sets corresponding to those runs. Because S(r1 ), . . ., S(rn ), and the current run belong to the same connected component, S(r1 ), . . ., and S(rn ) are merged to S(r), where r = min{r1 , . . . , ln }, and all pixels in the current run are assigned the same provisional label as the upper leftmost run. For example, Fig. 2 (a) shows a step before processing the current run, and Fig. 2 (b) shows the step after processing the current run.
1
2 1
2
3
5
1
3
1
4
3
1
2 1
3
5
5
5
3
1
3
3
5 5
2
1
5
5
4 3
5
R[1]
R[2]
R[3]
R[4]
R[5]
R[1]
R[2]
R[3]
R[4]
R[5]
1
1
3
3
5
1
1
1
1
1
S(1)={1, 2}, S(3)={3, 4}, S(5)={5} (a) pixel in the current run
S(1)={1, 2, 3, 4, 5} (b) pixel in the runs connected to the current run
Fig. 2. Operations for label equivalence resolving: (a) before processing the current run; (b) after processing the current run
Whenever temporary connected components are found to be connected, all corresponding provisional label sets are merged with a single representative label. Therefore, when the first scan is finished, all provisional labels that belong to a connected component in a given image are merged in a corresponding set, and they have the same representative label. Thus, during the second scan, we need only to replace each provisional label with its representative label.
3
Implementation
In our algorithm, we need to merge provisional label sets. By use of the connection-list data structure, we can complete the merger of two sets in constant steps. Because, in our algorithm, every provisional label belongs to only one provisional label set, we can use two VO -sized 1D arrays to realize connection lists for all provisional label sets. For convenience, hereafter we will use provisional label lists instead of provisional label sets.
136
L. He, Y. Chao, and K. Suzuki
One array, denoted next[ ], is used to represent the label next to the previous label in the list. next[i]=j means that the label next to label i is j. In particular, we use next[i]=−1 to indicate that label i is the tail label of the list, i.e., there is no label next to label i. The other array, denoted tail[ ], is used for indicating the tail label of a provisional label set. tail[u]=v means that the tail label of the provisional label list S(u) is v. On the other hand, the representative label table can also be implemented easily by use of a VO -sized 1D array, rtable[ ]. Setting the representative label of a provisional label l to c can be done easily by the operation rtable[l]=c, and for any provisional label u, its representative label v can be found simply by the operation v=rtable[u]. When we connect two provisional label lists S(m) and S(n), the head label of S(n) is connected to the tail label of S(m), and the resulting provisional label list is S(m)=S(m)∪S(n); otherwise, if m > n, the head label of S(m) is connected to the tail of S(n). Therefore, for any provisional label list S(u), the representative label u certainly occurs as its head label. In this way, with any member x of a provisional label list S(h), we can simply find its head label (the representative label) h by h=rtable[x], and its tail label t by t=tail[rtable[x]]. Moreover, from the head label h, by using next[ ], we can find all of the members in the list one by one until we reach −1. With rtable[ ], next[ ] and tail[ ], the creation and connecting of equivalent label lists can be made easily. Creation of a new equivalent label list S(m)={m} can be made by the following operations: table[m] = m, next[m] = −1, tail[m] = m. On the other hand, to connect two provisional label lists, say, S(u) and S(v), without loss of generality, suppose that u < v, i.e., S(v) is merged into S(u), we should connect the head of S(v) to the tail of S(u), set the tail of S(u) as the tail of S(v), and set u as the representative label for every label in list S(v). For doing this, the following simple operations, denoted merge-operation(u, v), are performed: i = v; while (i = −1) do rtable[i] = u; i = next[i]; end of while next[tail[u]] = v; tail[u] = tail[v]. For example, rtable[ ], next[ ] and tail[ ] for the equivalent connected component lists S(1)={1, 3, 7} and S(2)={2, 6, 4} are as shown in Fig. 3 (a). If there is a label equivalence between provisional labels 6 and 3, then the results of the above operations are as shown in Fig. 3 (b). Let x and y be the provisional labels assigned to pixels in two runs, r1 and r2 , respectively. If r1 and r2 are found to be connected, the operations for resolving the equivalence of two provisional label sets that contain x and y, denoted to resolve(x, y), are as follows:
A Run-Based Two-Scan Labeling Algorithm
0
0
0
0 1
0
0
1
1
1
3
1
7
1
2
2
2
6
2
4
2
1
2
6
2
3
1
3
7
3
3
1
3
7
3
4
2
4 -1
4
4
1
4
-1
4
5
5
5
6
6
5
6
4
6
7
7
1
7
2
7
… …
S(1)={1,3,7}
1
…
next_label
5
5
…
7 -1
… …
… … rl_table
4
4
…
1
6
1
…
7
2
1
3
… …
6
tail_label
S(2)={2,6,4}
(a) before connecting
rl_table
137
next_label
S(1)={1,3,7,2,6,4}
tail_label data changed
(b) after connecting
Fig. 3. Operations for label equivalence resolving
u = rtable[x], v = rtable[y]; if(u < v) merge-operation(u, v); else if(u > v) merge-operation(v, u); end of if Notice that if u = v, which means that the two provisional labels x and y have the same representative label, i.e., they belong to the same provisional label set, nothing needs to be done. After the first scan, each object pixel p(i) is labeled with a provisional label l, i.e., p(i) = l. If we set rtable[VB ] = VB in advance, the process of replacing a provisional label with its representative label in the second scan can be completed easily by the following operation: p(i) = rtable[p(i)] Now we consider how to record and use run data. In our algorithm, because we scan an image in the raster direction, the runs such that are found first will be used first. Therefore, we use queue data structures, s queue[ ] (called startqueue) and e queue[ ] (called end-queue) to record the starting pixel number and ending pixel number, respectively. Moreover, for the current run r(s, e), any run r(m, n) such that it ends before/at p(e − N ), i.e., n B* 2
./ "
./
! % ! D+E +E +E !
2 ! ' ! 2 ! = / 44 **+ **+
!
) ) % * +
,,
?6
( ( 7 $44**+ (@! /8 **+ ! 87 **+ + 7.6 ?86 @ 2 2
**+
$. **+ 4>6
?76 2
(
) ) % * +
,-=
8) , ! **+ 558 & -.. ! ?7 **+ ! 44 **+ ! ' !
**+( **+ 7. : 44 **+ 556 ?86
$. 1 ! **+ ?-5 2
-.. ; **+
* (@! -4 **+ ?4 **+ ! /?6 >.6
( ?
( ? $$ 1 !
" " **+
1 **+ " 5- **+ ) (@! /- **+, //6 5/6 I **+
,-,
+! 3 ! (
1 " >-! ) ! " " 9 ) + 0 B" ! ->-/ 7! 3 ! 4!&
4# $% 4! "& 1 ! 4 *! 8 0! @ )! . !/ 9 ,>>,! 84 ! ,-6- ! =>>,/ :! 0' 0! 3 +!& * 04 4 0! "& A ' 0! .!/ 44 ,>>=! 84 ! ,=>7 ! ,,:C,-6! 4 1 .,>>=/ ;! ! 0 ! 0 @!0!& * $ 9 ! "& ! .!/ 4 $' A% ! :=C=>=! A% )
.,>>2/
∫| u
2
0
( x, y ) − c + | φ ( x, y ) H (a + φ ( x, y ))dxdy
in ( C )
−λ
−
∫ | u0 ( x, y) − c | φ ( x, y) H (a − φ ( x, y))dxdy − 2
(1)
out ( C )
where u0 : Ω → R is the given image, φ is the level-set function introduced in [11], C ( s ) : [0,1] → R 2 a piecewise parameterized curve, c + and c − are unknown constants representing the average value of u 0 inside and outside the curve, α is an arbitrary
248
M.A. Savelonas, D.K. Iakovidis, and D. Maroulis
small positive value, and parameters μ > 0 and λ+ , λ− > 0 are weights for the regularizing term and the fitting terms, respectively. Keeping c + and c − fixed, and minimizing F with respect to φ , we deduce the associated Euler-Langrange equation for φ . Parameterizing the descent direction by an artificial time t ≥ 0 , the equation in φ (t , x, y ) (with φ (0, x, y) = φ0 ( x, y ) defining the initial contour) is ∂φ ∇φ = δ (φ ){μ ⋅ div ( ) − λ+ | u 0 − c + | 2 [ H (a + φ ) + φ H ' (a + φ )] ∂t ∇φ + λ− | u0 − c − | 2 [ H (a − φ ) + φ H ' (a − φ )]}
(2)
φ (0, x, y ) = φ0 ( x, y ) t ∈ (0, ∞), ( x, y ) ∈ Ω
The stationarity of the global minimum obtained at the convergence of the Lee-Seo model allows the imposition of a termination criterion. For example, as | φ ( x, y ) | converges to α, the Normalized Step Difference Energy (NSDE) can be defined as follows: NSDE =
| φ ( x, y ) − a | 2 | φ ( x, y ) | 2
(3)
The NSDE is calculated at each iteration and as soon as it becomes smaller than an experimentally determined constant, the algorithm is terminated.
3 Local Binary Patterns We adopt the formulation of the LBP operator defined in [8]. Let T be a texture pattern defined in a local neighborhood of a grey-level texture image as the joint distribution of the gray levels of P (P > 1) image pixels:
T = t ( g c , g 0 ,..., g P −1 )
(4)
where gc is the grey-level of the central pixel of the local neighborhood and gp (p = 0,…, P-1) represents the gray-level of P equally spaced pixels arranged on a circle of radius R (R > 0) that form a circularly symmetric neighbor set. Much of the information in the original joint gray level distribution T’ is conveyed by the joint difference distribution:
T ' ≈ t ( g 0 − g c ,..., g P −1 − g c )
(5)
This is a highly discriminative texture operator. It records the occurrences of various patterns in the neighborhood of each pixel in a P-dimensional vector. The signed differences gp-gc are not affected by changes in mean luminance, resulting in a joint difference distribution that is invariant against gray-scale shifts. Moreover, invariance with respect to the scaling of the gray-levels is achieved by considering just the signs of the differences instead of their exact values:
Bimodal Texture Segmentation with the Lee-Seo Model
T ' ' ≈ t ( s ( g 0 − g c ),..., s ( g P −1 − g c ))
249
(6)
where
⎧1 , x ≥ 0 s ( x) = ⎨ (7) ⎩0 , x < 0 For each sign s(gp-gc) a binomial factor 2p is assigned. Finally, a unique LBPP,R value that characterizes the spatial structure of the local image texture is estimated by: P −1
LBPP , R = ∑ s ( g p − g c ) 2 p p =0
(8)
The distribution of the LBPP,R values calculated over an image region, comprises a highly discriminative feature vector for texture segmentation, as demonstrated in various studies [8]-[10].
4 Proposed Approach The underlying idea of the proposed approach is to encode the spatial distribution of the most discriminative LBPP,R values of an input image into gray-level intensities so as to produce a new image that satisfies the assumption of approximately piecewise constant intensities. This is a basic assumption of the active contour model, which is subsequently applied on the new image. As this approach avoids the iterative calculation of active contour equation terms derived from textural feature vectors, the proposed approach can be more efficient than other active contour approaches. The proposed algorithm begins with the calculation of the LBPP,R values of all pixels of the input image I. A binary image is assigned to each of the existent LBP values. For each i= LBPP,R(x,y), the pixel (x,y) of the binary image Bi is labeled white, indicating the presence of the LBP value i, otherwise it is labeled black. In the sequel, each Bi is divided into constant-sized blocks and the number Pwhite(i,j) of white pixels contained in each block j of Bi is counted. Since white pixels in Bi indicate the presence of the LBP value i, their density may vary for regions of different texture, characterized by different LBP distributions. The maximum interblock difference of white pixel densities in Bi, as expressed by contrast index ξi:
ξ i = max( Pwhite (i, j )) − min( Pwhite (i, j )) j
j
(9)
can be used for the discrimination of bimodal textures. The contrast index ξi is expected to be smaller if white pixels are mainly concentrated in some image blocks, indicating that the associated LBP value characterizes the texture of an image region. Otherwise, if the white pixels are entangled within the blocks and cannot be associated with the texture of an image region, the contrast index ξi is expected to be increased. The binary images Bi, i=1,2,…2p are sorted according to their contrast index ξi, and the r top-ranked images are selected. The logical OR operator is applied on all combinations K of the selected r binary images. The resulting “cumulative” binary images CBk, k=1,2,…K, contain information derived from subsets of the existent LBPs. This is in agreement with [8], according to which an appropriately selected subset of LBPs maintains most of the textural information associated with the set of
250
M.A. Savelonas, D.K. Iakovidis, and D. Maroulis
the existent LBPs. The “cumulative” binary image CBm with the maximum contrast index is selected, according to: m = arg(max(ξ k )) k
(10)
This image will be comprised of regions characterized by distinguishable white pixel densities, associated with different textures. Equation (10) imposes that the binary images Bi used to generate CBm, have their highest white pixel densities on regions of the same texture. In order to limit the effect of local variances in the spatial frequency of the LBPs, a Gaussian kernel WG is convolved with CBm. This results in a smoothed image CBG of nearly homogeneous image regions, which satisfy the assumption of Lee-Seo model for piecewise constant intensities. Such smoothing operations have been proved to enhance texture discrimination, as the notion of texture is undefined at the single pixel level and is always associated with some set of pixels [12]. Moreover, convolution with Gaussian kernel WG ensures that the gray level of each pixel in the smoothed image CBG depends on the distances of the local binary patterns, which are present in the neighborhood of the pixel and have been associated with the texture of interest in the previous steps of the algorithm. Finally, it should be taken into account that smoothing accelerates the convergence of the subsequently applied active contour. Fig. 1 illustrates an example of an original image composed of Brodatz textures, along with the resulting smoothed image CBG. In the final step, the Lee-Seo model is applied to CBG. The region-based formulation of this active contour model enables the segmentation of an image into two discrete regions, even if these regions are not explicitly defined by high intensity gradients. In addition, its level set formulation allows the Lee-Seo model to adapt to topological changes, such as splitting or merging, in case regions of the same texture are interspersed in the image. Finally, the Lee-Seo model is guaranteed to converge to a stationary global minimum. The steps of the proposed algorithm can be summarized as follows: 1. 2.
3.
4.
Calculate LBP values For each pixel (x,y) in I Calculate LBPP,R(x,y) P Generate binary images Bi, i = 1,2,…,2 Initialize Bi(x,y) = 0 For each LBPP,R(x,y) do i = LBPP,R(x,y) Bi(x,y) = 1 End Generate “cumulative” binary image CBm Rank all Bi according to ȟi For each combination COMBk={Bi1, Bi2,… ,Bil}, k = 1,2,…K, l = card(COMBk) of the r topranked Bi do CBk = (Bi1 OR Bi2 OR … OR Bil) End Find CBm using (12) Smoothing and segmentation CBG = CBm * WG Segment CBG using the Lee-Seo model.
Bimodal Texture Segmentation with the Lee-Seo Model
251
5 Results Experiments were performed to investigate the performance of the proposed approach on texture segmentation. The dataset used is comprised of 18 composite texture images from the Brodatz album [13]. The proposed approach was implemented in Microsoft Visual C++ and executed on a 3.2 GHz Intel Pentium IV workstation. As a segmentation quality measure, the overlap q was considered: q=
A∩G A∪G
(11)
where A is the region delineated by the approach and G is the ground truth region. The LBP operator considered in the experiments was LBP8,1. The block size used was set to 16×16, and a number of r=5 top-ranked binary images (see Section 4) was found to be sufficient for the performed segmentation tasks.
(a)
(b)
Fig. 1. Example of a smoothed image CBG, generated in step 4 of the proposed algorithm, (a) original image composed of Brodatz textures, (b) corresponding image CBG
(a)
(c)
(e)
(g)
(b)
(d)
(f)
(h)
Fig. 2. Segmentation results of the application of the proposed approach for bimodal texture segmentation, (a,c,e,g) original images composed of Brodatz textures, (b,d,f,h) segmented images
252
M.A. Savelonas, D.K. Iakovidis, and D. Maroulis
Figure 1 illustrates an example of an original image composed of Brodatz textures, along with the resulting smoothed image CBG, which is generated in step 4 of the proposed algorithm. Figure 2 illustrates four examples of the application of the proposed approach for bimodal texture segmentation. The segmentation results obtained are very promising, with the frames composed of different texture patterns being successfully segmented. The overlaps measured are 99.1%, 96.9%, 99.7%, and 99.1% for the segmentations illustrated in Fig. 1b, 1d, 1f, and 1h respectively, whereas the average overlap obtained was 98.9±0.7%. The convergence times observed for the available images, ranged between 2 and 3 seconds depending on the complexity of the target boundaries. These convergence times are an order of magnitude smaller than the ones reported in other active contour approaches for texture segmentation [2].
6 Conclusion In this paper, we presented a novel approach for bimodal texture segmentation. The proposed approach features a local binary pattern scheme to transform bimodal textures into bimodal gray-scale intensities, segmentable by the Lee-Seo active contour model. This process avoids the iterative calculation of active contour equation terms derived from textural feature vectors, thus reducing the associated computational overhead. In addition, the region-based, level-set formulation of the Lee-Seo model allows segmenting regions defined by smooth intensity changes, as well as adapting to topological changes. Finally, the Lee-Seo model is invariant to the initialization of the level-set function and guarantees convergence to a stationary global minimum. The stationarity of the associated solution facilitates the imposition of a reasonable termination criterion on the algorithm, contributing to its efficiency. In our experimental study, the proposed approach achieved very promising segmentation results, whereas the required convergence times were an order of magnitude smaller than the ones reported in other active contour approaches for texture segmentation [2]. The proposed approach in its current form is limited to bimodal segmentation. However, future perspectives of this work include an extension for multimodal texture segmentation, as well as applications on various domains, and incorporation of the uniform LBP operator, introduced in [8]. Acknowledgments. This work was funded by the Greek General Secretariat of Research and Technology and the European Social Fund, through the PENED 2003 program (grant no. 03-ED-662).
References 1. Paragios, N., Deriche, R.: Geodesic Active Contours for Supervised Texture Segmentation. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2422–2427. IEEE Computer Society Press, Los Alamitos (1999) 2. Aujol, J.F., Aubert, G., Blanc-Feraud, L.: Wavelet-Based Level Set Evolution for Classification of Textured Images. IEEE Transactions on Image Processing 12(12), 1634– 1641 (2003)
Bimodal Texture Segmentation with the Lee-Seo Model
253
3. Sagiv, C., Sochen, N.A., Zeevi, Y.: Integrated Active Contours for Texture Segmentation. IEEE Transactions on Image Processing 1(1), 1–19 (2004) 4. He, Y., Luo, Y., Hu, D.: Unsupervised Texture Segmentation via Applying Geodesic Active Regions to Gaborian Feature Space. IEEE Transactions on Engineering, Computing and Technology, 272–275 (2004) 5. Pujol, O., Radeva, P.: Texture Segmentation by Statistic Deformable Models. International Journal of Image and Graphics 4(3), 433–452 (2004) 6. Chan, T.F., Vese, L.A.: Active Contours Without Edges. IEEE Transactions on Image Processing 7, 266–277 (2001) 7. Lee, S.H., Seo, J.K.: Level Set-Based Bimodal Segmentation With Stationary Global Minimum. IEEE Transactions on Image Processing 15(9), 2843–2852 (2006) 8. Ojala, T., Pietikäinen, M., Mäenpää, T.: Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(7), 971–987 (2002) 9. Paclic, P., Duin, R., Kempen, G.V., Kohlus, R.: Supervised Segmentation of Textures in Backscatter Images. In: Proceedings of IEEE International Conference on Pattern Recognition, vol. 2, pp. 490–493 (2002) 10. Iakovidis, D.K., Maroulis, D.E, Karkanis, S.A.: A Comparative Study of Color-Texture Image Features. In: Proceedings of IEEE International Workshop on Systems, Signal and Image Processing, Halkida, Greece, pp. 205–209 (2005) 11. Osher, S., Sethian, J.: Fronts Propagating with Curvature- Dependent Speed: Algorithms Based on the Hamilton-Jacobi Formulations. Journal Of Computational Physics 79, 12–49 (1988) 12. Unser, M., Eden, M.: Nonlinear Operators for Improving Texture Segmentation Based on Features Extracted by Spatial Filtering. IEEE Transactions on Systems, Man and Cybernetics 20(4), 804–815 (1990) 13. Brodatz, P.: Textures: A Photographic Album for Artists and Designers. Dover Publications, New York (1996)
Face Detection Based on Skin Color in Video Images with Dynamic Background António Mota Ferreira and Miguel Velhote Correia Instituto de Engenharia Biomédica, Laboratório de Sinal e Imagem Universidade do Porto, Faculdade de Engenharia, Departamento de Engenharia Electrotécnica e de Computadores Rua Dr. Roberto Frias, s/n, 4200-465 Porto, Portugal {mota.ferreira, mcorreia}@fe.up.pt
Abstract. Face detection is a problem in image processing that has been extensively studied in the last decades. This fact is justified due to the innumerous applications this subject has in computer vision. In this paper we present a technique to detect a face in an image where several persons may be. The images were recorded from news broadcasts from different networks. First, a model based on normalized RGB color space was built to define a skin color model: skin regions were selected from several face images and extracted statistical characteristics. Second, each pixel of the image was classified in two categories: skin and not skin. Third, the face region was identified. A MATLAB function was created to detect faces of any size, achieving high detection rates. Keywords: Face detection, Skin detection, Skin color model, Color segmentation.
1 Introduction The work presented is part of a larger one, in which we intended to develop a system for audio-visual speech recognition. As one can easily perceive, in order to establish an automatic system for this purpose, considering a video sequence where someone speaks, we have to obtain several streams to extract the visual and acoustic characteristics of speech for further analysis and recognition. In this way, in order to do the lip reading, the system has to detect the lips and, before that, the speaker’s face. In order to achieve this goal, several techniques have been thoroughly studied and tried over the last 35 years [1]. The general problem of face detection can be put forward in the following way: while considering a picture or a video image, one has to detect and track an unknown number of faces in it. There are two main approaches in face detection: high-level methods based on the geometrical model (image-based techniques) and low-level, pixel-oriented models (feature-based techniques). Notwithstanding, there are also hybrid methods which use methodologies from both approaches and each one of them requires a previous knowledge of the characteristics that define a face. M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 254–262, 2007. © Springer-Verlag Berlin Heidelberg 2007
Face Detection Based on Skin Color in Video Images with Dynamic Background
255
In image-based techniques, face detection is treated as if it was a pattern recognition problem. Training algorithms, that classify the parts of the images as faces and non-faces, are used [2]. In order to achieve this goal, all image windows of a predefined dimension are analysed and classified. Since it is possible to find faces in the image with a larger size than the specified kernel, the original image is successively reduced (by sub-sampling) and the filter is applied to each size [3, 4]. Techniques such as Neural Networks and Eigenfaces are just a few examples of highlevel processes. These methods are computationally expensive and slow as far as processing is concerned, thus not being adequate to systems aiming towards real time processing, as it is desirable in audio-visual speech recognition. This was the reason why we selected a low-level pixel-oriented method. Afterwards, we are going to describe the steps which led to the creation of an algorithm which performs face detection and tracking through several frames of a video sequence, in a fast and robust way, as long as certain conditions are taken into account. The used video sequences were taken from different news broadcasts recorded in digital format in a Sony DVD recorder (ref RDR-HX 710). The original images, recorded in PAL system (16×9) in VOB format files, were converted to TIF file format so that they could be read in MATLAB. Each one of the frames has a 384×512 dimension. The study we are now doing fits the context of this problem, because although one can find, most of the times, only one speaker in the images we gathered, there can be more on certain situations, and the developed algorithm segments the image into several ones, taking into account the number of people and working independently with each one of them.
2 Statistical Skin Colour Study In order to detect a face, we decided to do a pixel-oriented study, computationally less costly and one sufficiently strong and quick to implement an algorithm which could be applied in real time. The human skin has a characteristic colour which distinguishes it from any other colours. In this sense, we have selected some dozens of images containing human faces and manually cut out skin sections to subsequently elaborate a probabilistic distribution of the skin colours. We promptly realised that the RGB space colour wasn’t the most suitable due to the high correlation between the
Fig. 1. Pixel distribution in channel R
256
A.M. Ferreira and M.V. Correia
channels. Owing to the combination of the chrominance and luminance, the RGB space colour thus becomes inadequate to establish an algorithm to study and perform colour recognition [5]. All in all, it were studied about 130000 pixels and as one can easily see in the following histograms, the skin colour on channel R is concentrated in the [0.4; 0.8] interval, but reaches all values along the [0, 1] interval. An analogous comment can be made regarding the remaining channels. Extensive studies have been made by several authors [5, 6, 7], and they reached the conclusion that the HSV and RGB normalized spaces are the most adequate to perform the colour segmentation. Since the images we gathered were already in the RGB space, we decided to work on the normalized RGB space, which can be easily implemented. The normalized RGB space is obtained from the RGB space by simple transformations:
r=
R G B ; g= ;b= . R+G+ B R+G+ B R+G+B
The obtained components are called “pure colours” by several authors because they are invariant to the luminance [5]. That is thus the reason why this space becomes adequate when the observed scene is not under a uniform lighting. Along with this, in this space, the sum of the three components equals one, a feature which allows us to describe the third colour in function of the other two. This means, for instance, that the space becomes totally defined if one knows the r and g components. As far as this space is concerned, the studied pixel r and g histograms are:
Fig. 2. Red pure (at the left) and green pure (at the right) color histograms
As one can easily see, the r and g components are not distributed throughout the [0, 1] interval; r is limited between the 0.3 and 0.7 values, whereas g is between 0.2 and 0.4. We calculated the media and the respective deviation patterns: μr = 0.454 ,
σ r = 0.04 , μ g = 0.313 and σ g = 0.02 . The gathered data comprised more than forty
individuals of different races, from Europeans and Caucasian Americans to Asians, Africans and African-Americans. The obtained values were used to build a decision interval which will allow the computer to classify a pixel as either being from a human skin colour or not. In order to do so, we have created an algorithm where we intended to confirm if the r and g values of each pixel from the image simultaneously satisfy to the inequalities:
Face Detection Based on Skin Color in Video Images with Dynamic Background
μr − k × σ r < r < μr + k × σ r and μ g − k × σ g < r < μ g + k × σ g
257
(1)
in which k is a real value that could change from one image to another, but which we will consider equal to 1 by default. In the following figures, we can notice the action of the algorithm, which leaves the classified as skin tone pixels unaltered, transforming the remaining pixels to zero. We will call “skin filter” to the application of this algorithm.
Fig. 3. Pixel detection in the skin colour of an African-American woman
Fig. 4. Pixel detection in the skin colour of a European man
As one can notice, the used images have a dynamic background whose colours can change throughout time. Even so, the detection rate is fairly reasonable. It is often necessary to alter the value of k, not because the algorithm does not detect the skin of somebody in the image, but due to the fact that it finds in the background very similar colours to the skin ones. As the algorithm is used in successive frames of a video recording, we can use the gathered information from the previous and following frames to better identify the area one is interested in, with the help of the Kalman filter, thus strengthening the algorithm.
3 Detection of an Area That Contains a Face Since the groundwork consists in the accomplishment of an audio-visual speech recognition system which will be implemented to an only speaker, the images we gathered nearly always have one. Even so, the developed algorithm detects and
258
A.M. Ferreira and M.V. Correia
identifies the faces of all the people in the scene, as long as they are on the same level, that is to say, at the same distance from the camera and placed next to each other. The identification of the face consists of the delimitation of its boundaries within a rectangle. Afterwards, we will describe how the algorithm functions when there is only one person in the scene. We have already noticed that after the application of the “skin filter” on the scene, one is left mainly with the skin pixels of the person whose face we wish to detect. Since the pixels of the skin areas come out grouped together, we can use this in the detection of a face. Sometimes, other skin areas are detected besides the face, such as the hands, the arms the neck and the chest area. The algorithm we are going to describe detects the biggest pixel agglomerate one can find in an area of a scene, ignoring the smaller pixel portions which become separately visible. In this way, we can exclude the hands and the most part of the pixels which, despite not being skin pixels, are classified as such. In the following figure, we can see some final results of the application of the mentioned algorithm.
a.
b.
c.
d.
e.
f.
g.
h.
i.
j.
k.
l.
m.
n.
o.
Fig. 5. Results of the application of the algorithm
Face Detection Based on Skin Color in Video Images with Dynamic Background
259
As one can see, the algorithm is more efficient when the person in the scene is a man, because men usually wear less low-cut clothing than women. Since the algorithm detects skin pixel agglomerates, in the case of the women which wore clothing with a low-cut neckline (as the ones in 5b, c, f, i, j and n) the rectangle does not restrict to the face, thus also including the areas of the neck and the sternum. In these cases, we can use an heuristic process to restrict the rectangle to the face, using the following procedure: generally, any rectangle which contains only one face in frontal position, as it happens in figures 5m and 5o, is a golden rectangle, that is to say, if we calculate the quotient between the height and the width, we are able to find the golden ratio (which corresponds to the number
1+ 5 2
≈ 1.6180 ). This means that
in the cases where the rectangle width restricts to the face and the top of the rectangle is over the forehead, in order to find its height we would simply have to multiply the width by the golden ratio. This process would achieve good results in figures 5b, c, f, j, and n, but it would fail in figure 5i.
4 Description of the Algorithm After the application of the “skin filter”, we would obtain images similar to the ones found in figures 3 and 4. Colour information stops being relevant and we can work with the image on a grey scale, something which would occupy less memory and
c.
a.
0.12
d.
0.1
0.08
0.06
0.04
b.
0.02
0
0
50
100
150
200
250
300
e.
Fig. 6. Detection of a vertical column where one can find a face
350
40
260
A.M. Ferreira and M.V. Correia
would be more promptly implemented. Since the skin areas are in agglomerate, if we calculated the average of the lines and columns of the matrix that defined the image, we would find maximums in the areas that corresponded to the vertical projection of the columns in the first line, and a horizontal projection of the lines in the first column. We decided to start with the vertical projection. We then obtained a vector whose plot, after being filtered with a Gaussian filter, would be similar to the one in figure 6d. This vector can be used to restrict the vertical area where one finds a face, since its maximum indicates the most probable site of the column which will be the axis of symmetry of the face, as soon as it is in a frontal position, as it happens in this case. Following the same procedure for the vector which results from the horizontal projection, we are able to see the lines of the image in which we would more probably find a face. An analogous procedure is used when one finds several faces in the same image. 0.14
a.
b.
c.
0.12
0.1
Two faces detected
d.
0.08
0.06
0.04
0.02
0
Fig. 7. Detection of the faces in an image with two people, after the image segmentation
a.
b.
c.
d.
Fig. 8. Correct detection in 8d after changing the value of k to 0.5
The vector’s graphic of the vertical projections would then have several maximums which identified the number of people involved, and the minimums between these maximums would identify the places where to segment the image in order to proceed afterwards to the isolated detection of each one of the faces.
Face Detection Based on Skin Color in Video Images with Dynamic Background
261
5 Misclassification Whenever the background has lots of colour and to avoid the misclassification of background pixels as skin pixels, it is sometimes necessary to restrict the value interval we defined for the skin pixels. In these cases, we have to experimentally find the value of k as far as the defined inequalities in (1) are concerned, and once we detect the face in one of the frames, we will then achieve good results in the following ones.
Fig. 9. Sequence of 10 consecutive frames achieving good detections after the correction of the k value
-
Fig. 10. Ten consecutive correct detections in 2.73 s
We obtain good results even that if we apply the skin filter every other tree frames, reducing so the processing time required.
6 Conclusions The described method is one of easy implementation and it generally produces good results. As we can easily confirm in figure 5, the algorithm is effective in the detection of faces, regardless of their size, orientation and skin tone. The algorithm was used on the 6.5 version of MATLAB and it took about 5.5 s for each 10 consecutive frames. We can also achieve good results and reduce the time one spends searching for a face by using the Kalman filter, since there are not many newscaster’s positional sudden variations, and thus the “skin filter” can be used from n to n frames. The presented method of face detection is then efficient for the extraction of characteristics for the implementation of an audio-visual speech recognition system.
References 1. Chelllappa, R., Wilson, C., Sirohey, S.: Human and machine recognition of faces: A survey. IEEE 83, 705–740 (1995) 2. Viola, P., Jones, M.: Robust Real-Time Face Detection. International Journal of Computer Vision 57(2), 137–154 (2004)
262
A.M. Ferreira and M.V. Correia
3. Schneiderman, H., Kanade, T.: A Statistical Method for 3D Object Detection Applied to Faces and Cars. In: IEEE Conference on Computer Vision and Pattern Recognition, June 2000, IEEE Computer Society Press, Los Alamitos (2000) 4. Rowley, H., Baluja, S., Kanade, T.: Human face detection in visual scenes (November 1995) 5. Vazhnevets, V., Sazonov, V., Andreeva, A.: A Survey on Pixel-Based Skin Color Detection Techniques. In: Proceedings Graphicon-2003, Moscow, Russia, September 2003, pp. 85–92 (2003) 6. Crowley, J.L., Berard, F.: Multimodel Tracking of Faces for Video Communications. In: IEEE Proc. Conference on Computer Vision and Pattern Recognition, pp. 640–645. IEEE Computer Society Press, Los Alamitos (1997) 7. Garcia, C., Zikos, G., Tziritas, G.: Face detection in Color Images Using Wavelet Packet Analisis. In: ICMCS, vol. 1, pp. 703–708 (1999)
Data Segmentation of Stereo Images with Complicated Background Yi Wei, Shushu Hu, and Yu Li School of Automation, Wuhan University of Technology, Wuhan, 430070, P.R. China
[email protected],
[email protected],
[email protected] Abstract. With the development of computer science, there is an increasing demand on the object recognition in stereo images. As a binocular image pair contains larger and more complicated information than a monocular image, the stereo vision analysis has been a difficult task. Therefore how to extract the region of user’s interest is a vital step to reduce the data redundancy and improve the robustness and reliability of the analysis. The original stereo sequences used in the paper are obtained from two parallel video cameras mounted on a vehicle driving in a residential area. This paper targets the problem of data segmentation of those stereo images. It proposes a set of algorithms to separate the foreground from the complicated changing background. Experiments show that the whole process is fast and efficient in reducing the data redundancy, and improves the overall performance for the further obstacle extraction. Keywords: Stereo Image, Data Segmentation, Data Redundancy, Least Square Fitting.
1 Introduction Object recognition, as one of the most important research areas in the image analysis field, is a technique that automatically extracts various shapes from images by means of computers. With the development of computer technology, the stereo vision analysis has become a new focus, and the relevant topics are frequently discussed in the recent literature[1][2][4][7]-[10]. Stereo vision based object recognition is a closer simulation of human eyes to grasp visual information from the world. It deals with more complicated scenes such as changing background and foreground etc. In some hi-tech systems such as intelligent vehicles and autonomous navigation, objects extracted from the stereo images provide important visual evidence and thus play vital roles. However due to the complicated scenes and vast quantity of image data, the exaction of shapes from stereo images is still a difficult problem to the researchers. The data segmentation, a process that reduces the redundancy and sorts out the image data structure by separating the foreground from the background, becomes one of the key tasks prior to the object recognition. It is not difficult to find out that the large data redundancy in a stereo image is mainly caused by the complicated background. If an object extraction algorithm is directly applied to the whole image pair, both the accuracy and the speed must be affected by M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 263–270, 2007. © Springer-Verlag Berlin Heidelberg 2007
264
Y. Wei, S. Hu, and Y. Li
various background factors. Therefore it is crucial to separate the foreground from the background prior to the analysis in order to obtain a more reliable performance and a faster speed. The original stereo sequences used in this paper are obtained by two parallel stereo video cameras mounted on a vehicle driving in the real scenes. Fig. 1 shows some original image pairs and as can be seen the background is complicated and frequently changing. The paper intends to develop a whole process that can clearly divide images into foreground and background so that the following obstacle detection is only applied to the foreground part.
(a) Left image no.9
(b) Left image no.32
(c) Left image no. 100
Fig. 1. Stereo images with complicated background
This paper is organized as follows: the second part describes the segmentation algorithm in detail. The third part gives out the experimental results and analyses. Conclusions are drawn in the final part.
2 Data Segmentation 2.1 Preprocessing Due to the physical structure of the lens, images directly grasped by a video camera have certain levels of distortion. Although the distortion cannot always be recognized by human eyes, it must bring the negative effects especially to those that conduct pixel to pixel operations such as disparity calculation. Therefore image rectification is firstly carried out here. As can be seen from Fig. 3, suppose the rectangle in solid lines represents the original image, and the quadrilateral in dashed lines is the distorted image. P ( x, y ) is a pixel in the distorted image, and P ′( x ′, y ′) is the new position of P after rectification.
Image rectification
Edge detection
Road border searching and fitting
Fig. 2. Data segmentation algorithm design
Data segmentation
Data Segmentation of Stereo Images with Complicated Background
265
'r Pc( xc, yc) P ( x, y )
Fig. 3. Image distortion
The detailed discussion on the image rectification can be found in [6]. The distortion error ∆r is directly applied here
°∆r = K1 r + K 2 r 3 + K 3 r 5 . ® °¯r = x 2 + y 2 Where
(1)
K 1, K 2 , K 3 are the system coefficients of the lens. Apply (1) to each pixel of
the distorted image, and the new rectified image is defined as follows: ∆r ∆r 2 4 °° x′ = x − ∆x = x − x r = x(1 − r ) = x(1 − K1 − K2 r − K3r ) . ® ° y′ = y − ∆y = y − y ∆r = y(1 − ∆r ) = y(1 − K − K r 2 − K r 4 ) 1 2 3 °¯ r r
(2)
The Sobel operator is then applied on the rectified image to obtain edges and thus the preprocessing step has completed. Fig. 4 shows the preprocessing results.
(a) Original image
(b) Rectified image
(c) Sobel edge detection
Fig. 4. Preprocessing results
2.2 Road Border Detection and Fitting
As the background in the original images is complicated, data segmentation cannot be achieved only through the above preprocessing. Based on the driving knowledge, it is not difficult to understand that most potential obstacles that may cause the safety
266
Y. Wei, S. Hu, and Y. Li
problems appear either on the drive way or on the pedestrian road. Therefore the area containing both drive ways and pedestrian roads is the foreground we intend to extract. Road border is the border between the drive way and the pedestrian road. In our everyday life, usually there is a solid line or curb on each side of the drive way to mark the border with the pedestrian area. Or if not there is a color difference between these two areas. As a result, the road border may roughly appear as an edge after the edge detection. In order to extract two border lines on both sides of the road, firstly an initial search of all the possible pixels on each border line is conducted to build two pixel sets, see Fig. 5.
Fig. 5. Potential road border pixels
Since the pixels in each set are discontinuous and may contain those that do not belong to the border, an accurate fitting procedure must be carried out to locate the border line. Due to the advantage in the data processing, the least square method is adopted here to fit the border line. Suppose ( xi , yi )(i = 1, 2, " , N ) is the coordinate of a pixel on one border line, then the following requirement must be met
yi = k × xi + b .
(3)
where k and b are the tangent and intercept of the border line respectively. Apply all pixels in a set to (3) and there is
AX = B .
(4)
where
ª x1 °A = « ¬1 ° ° ® X = [k ° ° B = [ y1 ° ¯
"
xN º
T
" 1 »¼ b]
T
"
. yN ]
T
(5)
Data Segmentation of Stereo Images with Complicated Background
267
The least square solution of (4) is T
−1
T
X = ( A A) A B .
(6)
Apply the results in (6) to (3) and two border lines on both sides of the drive ways can be drawn. Fig. 6 is the fitting results of two border lines.
Fig. 6. Fitting results of road borders
It is worthwhile to mention that the purpose of above searching and fitting is not for the accurate road detection but for the data segmentation. There are cases when the road is a curve such as the one shown in Fig. 8(b), and the fitting process still yields a straight line as the result. To identify the foreground area, this straight line is directly used here to approximately locate the road and its direction without applying further calculation to match the original curve. 2.3 Data Segmentation
It is well understood that when projected to a 2D image, two parallel lines in a 3D world must meet at a intersection. After the location of two road border lines are obtained, a triangle area is created from two border lines and this triangle is the drive way area, the dotted area in Fig. 6. Then this triangle is expanded to a certain extent from both sides parallel to the border lines to build a new area. The new area as can be seen from Fig. 7 is the foreground that contains obstacles. The remaining part of the image is classified as the background (black area in Fig. 7) and the further analysis is only applied to the foreground.
Fig. 7. Segmentation results of Fig.4(a)
3 Experimental Results Based on the above data segmentation algorithms, experiments on a VC++ platform are conducted on 100 24-bit color stereo image pairs. The original images are all
268
Y. Wei, S. Hu, and Y. Li
1392×1040 in size. After the rectification, they become 1147×857. The final segmented images with only foreground are 1024×531, and they are almost half sizes of the original ones. Some segmentation results are shown in Fig. 8. It can be seen that the algorithms proposed in this paper perform well in separating the foreground from the background, and the data redundancy is greatly reduced. In addition, the paper designs a comparison experiments to demonstrate the advantages the data segmentation can bring to the further stereo analysis. As can be seen from Table 1, 10 original stereo images are chosen firstly as group A ((aA) and (bA)). Then the segmentation is applied to all these images and 10 segmented image pairs are obtained as group B ((aB) and (bB)). The Sum of Absolute Difference (SAD)[5] is adopted to calculate the dense disparity maps ((cA) and (cB)) of these two groups of image pairs. V-disparity[1][2] algorithm is then chosen to project the road plane and the obstacle planes into straight lines. Finally the Hough Transform is used to extract those lines ((dA) and (dB)) to obtain the exact numbers and location of all obstacles ((eA) and (eB)).
(a) Left image no. 10
(b) Left image no. 26
(d) Segmentation of (a)
(e) Segmentation of (b)
(c) Left image no. 66
(f) Segmentation of (c)
Fig. 8. Data segmentation results of two stereo image pairs
The above comparison experiments show the following results: (i) Stereo images in B group almost contain no background and the data quantity is only half of that in group A. (ii) In some cases, not all obstacles can be detected in group A, but all obstacles are extracted in group B. Table 1 gives one example: all three obstacles are detected in (eB) while only two are marked in (eA). (iii) The tests are run on the same PC (Intel P4 1.7GHz). The experiments on group A need average 75s per pair, while average 12s is needed for group B. The reason for (ii) and (iii) is that the complicated changing background in group A not only affects the speed, but also brings pollution to the dense disparity map where the calculation of the V-disparity map is based. Therefore the polluted V-disparity map cannot extract all the projection of the obstacles and causes the miss detection.
Data Segmentation of Stereo Images with Complicated Background
269
Table 1. Experimental result comparisons between original stereo images and segmented stereo images
Experimental results of group A
(aA) Left image no. 36
(bA) Right image no.36
Experimental results of group B
(aB)Left image no.36 after processing
(bB)Right image no.36 after processing
(cA) Dense disparity map (dA) projection results
(eA) Obstacle detection
(cB) Dense disparity map
(dB) projection lines
(eB) Obstacle detection results
270
Y. Wei, S. Hu, and Y. Li
4 Conclusions Based on the applications of the intelligent vehicles, this paper targets the problem of the data segmentation in stereo images. Combined with computer vision and data processing, the paper proposes the algorithms that can separate the foreground from the complicated changing background. The segmentation largely reduces the redundancy and improves the overall performance for the further stereo analyses. Experiments show the robustness of the data segmentation algorithms. Acknowledgments. This work is funded by the Wuhan Municipal Young Scholar
Chenguang Supporting Plan (20055003059-2).
References 1. Labayrade, R., Aubert, D.: In-Vehicle Obstacle Detection and Characterization by Stereovision. In: The 1st Intl. Workshop on In-Vehicle Cognitive Computer Vision Systems, pp. 13–19 (2003) 2. Labayrade, R., Aubert, D., Tarel, J.P.: Real Time Obstacle Detection in Stereovision on Non Flat Road Geometry Through V-Disparity Representation. In: IEEE Symposium on Intelligent Vehicle, vol. 2, pp. 646–651 (2002) 3. Kazakova, N., Margala, M., Durdle, N.G.: Sobel Edge Detection Processor for a Real-time Volume Rendering System. In: IEEE International Symposium on Circuits and Systems, vol. 2, pp. 913–916 (2004) 4. Batavia, P.H., Singh, S.: Obstacle Detection Using Adaptive Color Segmentation and Color Stereo. In: Proceedings of the 2001 IEEE Intl. Conference on Robotics & Automation, pp. 705–710. IEEE Computer Society Press, Los Alamitos (2001) 5. Muhlmann, K., Maier, D., Hesser, J., Manner, R.: Calculating Dense Disparity Map from Color Stereo Images, an Efficient Implementation. In: Proceedings of 2001 Workshop on Stereo and Multi-Baseline Vision, December 2001, pp. 30–36 (2001) 6. Jia, H.T., Zhu, Y.C.: A Technology of Image Distortion Rectification for Camera. Journal of Electronic Measurement and Instrument 19(3), 46–49 (2005) 7. Ruicheck, Y.: Multilevel- and Neural-Network-Based Stereo-Matching Method for Real-Time Obstacle Detection Using Linear Cameras. IEEE Trans. on Intelligent Transportation Systems 6(1), 54–62 (2005) 8. Nakai, H., Takeda, N., Hattori, H., Okamoto, Y., Onoguchi, K.: A Practical Stereo Scheme for Obstacle Detection in Automotive Use. In: Proceedings of the 17th Intl. Conference on Pattern Recognition, pp. 346–350 (2004) 9. Batavia, P.H., Singh, S.: Obstacle Detection Using Adaptive Color Segmentation and Color Stereo. In: Proceedings of the 2001 IEEE Intl. Conference on Robotics & Automation, pp. 705–710. IEEE Computer Society Press, Los Alamitos (2001) 10. Jiang, G.Y., Yu, M., He, S.L., Shi, S.D., Liu, X.: A New Method for Obstacle Detection in ITS. In: Proceedings of 2002 Intl. Conference on Signal Processing, pp. 1770–1773 (2002)
Variable Homography Compensation of Parallax Along Mosaic Seams Shawn Zhang1 and Michael Greenspan1,2 1
Dept. Electrical and Computer Engineering, 2 School of Computing Queen’s University, Kingston, Ontario, Canada
[email protected] Abstract. A variable homography is presented which can improve mosaicing across image seams where parallax (i.e., depth variations) occur. Homographies are commonly used in image mosaicing, and are ideal when the acquiring camera has been rotated around its optical center, or when the scene being mosaiced is a plane. In most cases, however, objects in the overlapping areas of adjacent images have different depths and exhibit parallax, and so a single homography will not result in a good merge.To compensate for this, an algorithm is presented which can adjust the scale of the homography relating two views, based upon scene content. The images are first rectified so that their retinal planes are parallel. The scale of the homography is then estimated for each vertical position of a sliding window over the area of overlap, and the scaled homography is applied to this region. The scale is blended so that there are no abrupt changes across the homography field, and the images are finally stitched together. The algorithm has been implemented and tested, and has been found to be effective, fast, and robust.
1
Introduction
An image mosaic is a single image that is generated by transforming and merging a set of partially overlapping smaller images. A mosaic can be automatically generated by establishing correspondences from two or more images using either direct [12,18] or feature-based methods [10,7]. If an ordinal relationship between the images has been established (e.g., by manual interaction [13]) and the region of overlap is sufficiently large (∼25%) then the correspondences can be determined automatically using RANSAC methods. Brown and Lowe have recently described a method to automatically determine correspondences and subsequently mosaic a set of unordered images using rich feature descriptors [16]. When the images in the set have been acquired by rotating a camera about its optical center, then the region of overlap between any two images contains the same scene content, and their transformation is a 2D planar homography [10]. In practice, it is difficult to achieve this condition, using either a single or multiple cameras. There typically exists some small translational or off-center rotational component to the transformation. For mosaicing purposes, this transformation is nevertheless still approximated by a homography, often with the help of some M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 271–284, 2007. c Springer-Verlag Berlin Heidelberg 2007
272
S. Zhang and M. Greenspan
post-processing refinement. When the imaged scene is far from the camera, then the weak perspective model holds, and the homography can provide a very good approximation of the transformation between images. If the weak perspective model does not hold, and there are scene elements that are relatively close in the region of overlap, then parallax occurs. A single 2D homography will then not be sufficient to describe the transformation between the affected images, and the result will be a poor mosaicing, either globally across the region of overlap, or locally in the vicinity of the offending scene elements.
2
Related Work
Many mosaicing methods have been developed in recent years, and the two main concerns that are considered in these works are image intensity alignment and structure alignment [9]. For structure alignment, the simplest case is when the acquiring camera has been rotated around its optical center, or when the scene being mosaiced is a plane. When a camera is rotated purely about its optical center, then the acquired images are related exactly by 2D homographies, and a panoramic image can be created by unwarping these images and projecting the scene onto a cylinder [15,4,26,23]. However, this is a rare case as pure rotation about the optical center is difficult to achieve. In most cases, camera motions involve not only rotation but also translation around the optical center. Affine transformations and planar projective transformations [11,21,18,22,17] have also been applied to image mosaicing. These methods can create significant distortions when the camera motion includes other rotations, however, or when the objects in the scene are at significantly different depths. The manifold projection method [25] was developed to overcome these difficulties by considering image structure alignment. To accomplish image mosaicing, manifold projection has three steps: image alignment, image cut and paste, and image blending. The projection is defined for almost any arbitrary camera motion and any scene structure, and there are no distortions caused by the alignment to a reference frame. However, this method is restricted to images which have a significant degree of overlap, such as those from dense video sequences. Other mosaicing methods focus on intensity alignment. In [19,24,3], approaches are presented to minimize the intensity differences across seams by smoothing the transition between the images. If the structure alignment has not been done properly, however, these methods leave artifacts in the mosaic, such as double edges. Stitching images along a seam where a minimum of intensity differences takes place in the area of overlap is introduced in [1,6]. Levin et al. proposed an algorithm that operates in the gradient domain [5]. The authors introduce a method called GIST (Gradient-domain Image STitching) to deal with the intensity parallax of stitched images. The method cannot, however, handle large structure misalignment. Two other methods are proposed in [14] and [2]. Agarwala et al. [2] describes a human interactive approach for combining parts of a set of images into a single composite picture. In [14], the authors achieve seamless image stitching
Variable Homography Compensation of Parallax Along Mosaic Seams
273
for eliminating visual artifacts caused by intensity discrepancy and structure misalignment. However this method could not handle the case when there are no feature points between aligned images. A limitation of the above methods is that they do not well on images which have a fairly small area of overlap [20]. In the next sections, we demonstrate a reliable method to eliminate both structure and intensity parallax when mosaicing images which have a small area of overlap.
3
Variable Homography
To compensate for parallax effects across mosaic seams, we introduce the variable homography, whereby a single 2D planar homography is scaled according to the scene content. Proposition: Let ΠA and ΠB be two planes which are within the field-of-view of two rectified image planes πl and πr . Assume that the transformation between ΠA and ΠB is a pure translation |ZB − ZA | along the mutual direction of the optical axes of πl and πr , as illustrated in Fig. 1. Let HA and HB be the 2D planar homographies that relate πl and πr over planes ΠA and ΠB , respectively. HA and HB are then related as: HB = Kl HA Kr−1 where
⎤ k 0 (1 − k)ui0 /f Ki = ⎣ 0 k (1 − k)vi0 /f ⎦ , i = l, r 00 1
(1)
⎡
(2)
In the above, (ui0 , vi0 ) are the principle points of the left (i = l) and right ZA (i = r) cameras, k = Z is the ratio of the distances of the scene planes, and f B is the focal length (which we presume to be common for both cameras). Proof: Let P A be a point in ΠA and let lA and r A be the correspondences of P A in πl and πr respectively, i.e.: l A = HA r A
(3)
Similarly, lB = HB rB for point P B on plane ΠB . We examine the relationship between lA and lB for a single pinhole camera system. Point P A = (XA , YA , ZA ) is mapped to point lA on image plane πl , and P B = (XB , YB , ZB ) is mapped to the point lB . For focal length f and A principle point (u0 , v0 ), under the projective model lA = (xA , yA , zA ) = (f X ZA + YA XB YB u0 , f ZA + v0 , f ) and lB = (xB , yB , zB ) = (f ZB + u0 , f ZB + v0 , f ) . Planes ΠA and ΠB are related by a translation along the Z-direction, so that ZA XA = XB and YA = YB . Denoting the scale factor as k = Z , the relationship B between lA and lB can then be expressed as: ⎤⎛ ⎞ ⎡ ⎞ ⎛ xA k 0 (1 − k)u0 /f xB ⎝ yB ⎠ = ⎣ 0 k (1 − k)v0 /f ⎦ ⎝ yA ⎠ (4) 00 1 zB zA
274
S. Zhang and M. Greenspan
Fig. 1. Variable Homography for Rectified Image Pair
Extending this to the simple stereo system of Fig. 1, lB = KL lA and rB = KR r A . Substituting the expressions for lB and r B into Eq. 3 gives: −1 HB = K L H A K R
(5)
If the left and right cameras have the same intrinsic parameters (i.e., they are a single camera at two positions at two different times), then K = KL = KR and Eq. 5 can be further simplified as: HB = KHA K −1
4
(6)
Mosaicing With Variable Homography
The results from the previous section can be applied in a three step algorithm to efficiently compensate for parallax that may occur across a mosaic seam from a stereo pair. We assume that the stereo images have been rectified so that their image planes are parallel, as in Fig. 1. In the first step, a seed homography is computed, which is equivalent to HA in Eq. 5. In the second step, one or more k values are estimated in the region of overlap. This can be done using either a feature-based method by evaluating the symmetric transfer error, or a correlation-based method by evaluating the Sum of Squared Differences (SSD) for varying values of k. In the final third step of the algorithm, the images are merged at the seam using Eq. 5. If multiple k values have been computed, then the values are blended smoothly using a second order Bezier polynomial. The following subsections describe the above three steps in greater detail. 4.1
Computing a Seed Homography
A seed homography HA is estimated from the scene, either from naturally occuring scene content, such as a landscape at a distance, or from a planar target that has
Variable Homography Compensation of Parallax Along Mosaic Seams
275
been explicitly placed in the scene during a preprocessing stage. Corners are extracted from the overlap region using the Harris corner detector and are automatically matched using cross-correlation. A proportion of these correspondences will in fact be mismatches, and so a RANSAC method is used to estimate HA , and in so doing identify a set of inliers consistent with this homography. The LevenbergMarquardt (LM) algorithm is then employed as in [10] to further optimize HA by minimizing the reprojection error cost function described in Eq. 7.
d(xi , x ˆi )2 + d(xi , xˆ i )2 subject to xˆ i = HˆA x ˆi ∀i (7) i
We define a measurement vector X i = (xi , xi ) . In this case, the parameter ˆ1 , x ˆ2 , ... , x ˆn ) , where the values x ˆi vector P may be composed as P = (h , x are the estimated values of the image points in the first image, and h is a vector of entries of homography HA . ⎛
⎡ ⎞ ⎤ h1 h1 h2 h3 2 h = ⎝ h ⎠ , HA = ⎣ h4 h5 h6 ⎦ h7 h8 h9 h3
(8)
Thus, one must simultaneously estimate the homography H and the parameters ˆ , ..., X ˆ ) , ˆ , X of each point in the first image. The function f maps P to (X 1 2 n ˆ i = (ˆ ˆi ) = (ˆ ˆ i ) , x ˆ i = (ˆx, ˆy, w where each X xi , HA x xi , x ˆ ) , xˆ i = (xˆ , yˆ , wˆ ) the Jacobian matrices, I ˆ (9) ∂ Xi /∂x = ˆ /∂ x ˆ ∂x where
1 w ˆ
where ˆ /∂h = ∂x
4.2
h1T −ˆx h3T h2T −ˆy h3T 0 ˆ ∂ Xi /∂h = ˆ /∂h ∂x
ˆ /∂ x ˆ= ∂x
1 x ˆ ˆ 0 −ˆx x ˆ −ˆy x ˆ w ˆ 0 x
(10)
(11)
(12)
Estimating k
Once the seed homography HA has been calculated, it is possible to calculate one or more values of HB by estimating k over the region of overlap. If the overlap region is known to contain a single planar element, then a feature-based method can be used to estimate k using a RANSAC method. For each of N feature point correspondences, the scale factor k was estimated and HB was computed from Eq. 5 and applied to the remaining features. An error function was measured for each estimate of k based upon the symmetrical transfer error Es . This assumes
276
S. Zhang and M. Greenspan
that errors have occured in both images, and sums the geometric errors in both −1 the forward (HB ) and backward (HB ) directions:
−1 2 d(xi , HB xi ) + d(xi , HB xi )2 (13) Es = i=1,...,N
A value of Es was calculated for each of the N estimates of k that were determined for the N correspondences, and the HB associated with the smallest Es value was returned as the homography. The above feature-based method assumes that the scene content in the region of overlap is a single planar object, and is used for testing and validating the method. In the more general case, which is the main motivation for this approach, there exist a number of objects in the region of overlap, which are at different depths. A single homography is therefore not sufficient to model the transformation between the images, and it is necessary to estimate a number of values of k. A correlation-based approach is applied in this case to estimate a value of k at each vertical position within the region of overlap. A small (e.g., 10x10 pixel) mask is convolved horizontally across each image for each row. The minimal value of the Sum of Squared Differences is used to determine a unique value of k for each vertical position. 4.3
Image Blending
Histogram specification may be applied as a preprocessing step to compensate for radiometric variations across the images. Such variations can be caused by dfferences in the apertures of the two cameras, or varying lighting conditions if the images are acquired at different times. After the homography has been calculated, the right image is projected into the left image coordinate frame. The transformed area of overlap between the two images will not contain the exact same information, and so it is necessary to blend the two images to result in a seamless mosaic. One possibile approach is to use multiband blending, as in [16]. In our case, we blend the two images at the column of minimum intensity difference for each row. In the overlapping area, we compute the intensity differences between each pair of overlapping pixels in each column j over every row i:
Bi = arg min (IL(i,j) − IR(i,j) ) (14) The main idea behind this blending is that, in many cases, the images share a pixel of similar value, or sometimes the exact same value, for every row in the overlap region. This straightforward method is able to produce a smoothly blended image without exhibiting any blurring effects which we observed in the multiband blending approach. When an image is projected to another image coordinate frame, the new coordinates in general will not be integers, which causes an aliasing effect. Bilinear interpolation is applied to the transformed image coordinates to resolve the aliasing problem.
Variable Homography Compensation of Parallax Along Mosaic Seams
(a) Left Image
(b) Right Image
(c) Merged with HA
(d) Left Image
277
(e) Right Image
(f) Merged for ΠB with computed HB
Fig. 2. Images Merged with HA and Computed HB
5
Experimental Results
We executed three experiments to test the algorithm. In the first experiment, the region of overlap was a single planar object, the chessboard of Fig. 5.1. This experiment used the feature-based k estimation method described above, and was executed to validate the method. The second and third experiments contained a number of objects at various depths in the region of overlap, and exhibited significant degrees of parallax if a single homography was applied. These two experiments used the correlation-based k estimation method. 5.1
Experiment 1
In this experiment, a single consumer-grade digital camera (Fuji 6800Z) was used to acquire two sets of stereo pair images of a planar chessboard object. A laser level was used to place the camera parallel to the 3D plane. The camera was first placed in the left position with the smallest possible focal length and an image was acquired (Fig. 5.1(a)). In this same position, the camera lens was then zoomed to its maximum value, and a second image was acquired (Fig. 5.1(d)). As only the focal length of the lenser changed between these two images, the plane was effectively translated purely along the optical axis. These images therefore represented the left image planes πl for target plane ΠA and ΠB , respectively. (It is arbitrary which is the near and which is the far plane.) The camera was then translated to the right, and this procedure was repeated for the right image pairs (Figs. 5.1(b) and 5.1(e)).
278
S. Zhang and M. Greenspan
Fig. 3. Reprojection Error Minimization
From the left and right small focal length images, the seed homography HA was calculated. In this case, there was no need to rectify the images, as the mechanical alignment using the laser level resulted in πl and πr being parallel to a sufficient accuracy. The LM algorithm was next applied to optimize HA . Fig. 3 shows that the LM algorithm had a fast convergence, converging in ∼ 2 iterations. Following the LM algorithm, the resulting HA was used to project the right image to the left image, and the mosaiced result is shown in Fig. 5.1.(c). For the longer focal length image pair (Figs. 5.1(d) and 5.1(e)), k was calculated using the feature-based method. The values of the symmetric error Es for the N correspondences are illustrated in Fig. 5. The minimal Es value is very distinct, and occurs when k = 1.16. This value was used to compute a new homography HB from Eq. 5. The mosaiced image for this homography value is shown in Fig. 5.1(f). It is clear from the visual quality of the merge that the correct homography has been computed for the second image pair, and the basic method is validated. 5.2
Experiment 2
In the second experiment, pairs of images were acquired of buildings at a distance with scene elements in the foreground. Fig. 4 shows an image pair with a building in the background and some tents in the foreground. The stereo pair was rectified, and the seed homography HA was computed from the method described in Sec. 4.1 using the feature points shown in Fig. 4(a) and (b). A column to merge the two images was manually selected, and the mosaiced image using HA is shown in Fig. 5.2(a). Because we used the feature points on the building to calculate HA , the building itself is merged very well; however, the tents are at different depths than the building, so when a single homography was used for the entire image, the tent in the region of overlap was not merged well. In particular, half of the mosaiced tent appeared to be missing due to parallax effects.
Variable Homography Compensation of Parallax Along Mosaic Seams
279
Fig. 4. Image Pair of Buildings with Tents in Foreground, with Feature Points
Fig. 5. Scale Factor k vs. Symmetric Error Es
As both the buildings and the tents were far from the camera, we regarded the building as being parallel to the tents. The scale factor k was then automatically computed and another homography HB was determined for each horizontal position across the region of overlap. The result is shown in Fig. 5.2(b). It can be seen that the new values of HB are more suitable for both the building in the background, and the tent in the foreground. The Bezier smoothing of the HB values that result from the various k values causes some global deformation along the building to the right of the left-most tent. This minor deformation would in many cases be preferable to the major parallax effect that occurs without the variable homography. There may be further post processing methods that could be applied to reduce the effects of such a deformation. This experiment was repeated using images of the Kingston court house in Fig. 7. In this case, the selected merge column intersected with one of the trees and a truck in the foreground. Fig. 8 shows the region of overlap mosaiced in three different ways. In (a), a single homography was computed from the background court house. The tree and truck (identified with an arrow) can be seen to be compressed due to parallax. In (b), a single homography was computed, this time from the foreground tree and truck objects. In this case, the tree and truck were merged correctly, but the background court house has been corrupted. In (c), the variable homography approach was applied, with HA computed from the background building. It can be seen that both the background and foreground objects were merged correctly. The complete mosaiced image resulting from the variable homography of (c) is illustrated in Fig. 9.
280
S. Zhang and M. Greenspan
(a) Single Homography
(b) Variable Homography
Fig. 6. Mosaiced Images Using Single and Variable Homographies
Fig. 7. Kingston Court House
5.3
Experiment 3
In the third experiment, a surveillance system involving two cameras is installed in a vehicle. Such systems based upon a single camera are common in the taxi industry, where they are used to monitor passengers to improve the security of the drivers. In this case, the objective is to use two cameras to increase the fieldof-view, while maintaining high resolution, so that passengers’ facial features can be more accurately identified. The objective is to merge the resulting image pair to result in a single high resolution seamless mosaiced image. The region of overlap in this image pair is therefore very small, so that the resolution of each image can be used to maximum benefit. It was challenging to rectify the two images, because it was hard to find correspondences between two cameras with such as small region of overlap. To rectify this stereo system, we used another stereo system which had a significant overlap with the first system. We use Fusiello’s method [8] to rectify the stereo pairs. It was easy to calibrate and rectify stereo system 2, as the overlap between the image pairs was large. It was also straightforward to calibrate the two new stereo systems which are made of the left two cameras and the right two cameras, respectively. We then could rectify the images taken by stereo system 1 to stereo system 2. This meant that the left camera in stereo system 1 was rectified to the left camera in stereo system 2, and the right camera in stereo system 1 was rectified to the right camera in stereo system 2. Finally, we rectified these two new images to make the two image planes in stereo system
Variable Homography Compensation of Parallax Along Mosaic Seams
(a)
(b)
281
(c)
Fig. 8. Overlap Region, Merged with: (a) Single HA Computed from the Building, (b) Single HA Computed from the Tree, (c) Variable Homography
1 collinear. Once the rectification transformations were obtained, they could be applied to all the images acquired by stereo system 1, so that we were then able to apply the algorithm from Sec. 3. We still used a chessboard to compute the seed homography HA for the image pair. In the former experiments, we used feature points to seek the most suitable scale factor k and homography HB for a new pair of images; however, it was different in this application, because it was not guaranteed that there were correspondences for every pair of images. Instead, we found that the correlation-based method was flexible in dealing with this situation. A pair of images from this experiment is shown in Fig. 10. Fig. 11 shows the scale factor k changing in the overlapping area. There are two dominant k values, one of which represents the back window region, and the other of which represents the bottom of the image which is the back seat area. Fig. 11 also illustrates that the back seat area is closer to the cameras than the back window area. Fig. 12(a) shows the overlap region merged with a single value of k, representing the rear window of the vehicle. Parallax effects can be seen over the
Fig. 9. The Complete Mosaic with Variable Homography
282
S. Zhang and M. Greenspan
Fig. 10. Images Acquired Inside Vehicle
Fig. 11. Image Row vs. Scale Factor k
(a)
(b)
(c)
Fig. 12. Overlap Region, Mosaiced with: (a) One Homography, (b) Two Homographies, (c) Variable Homography
passenger.In Fig. 12(b), two homographies resulting from both values of k are used to mosaic the two images. The abrupt transition between the two values results in a a “saw” effect over the entire image. We therefore use these two homographies only to stitch along the overlapping area, as in Fig. 12(c). We
Variable Homography Compensation of Parallax Along Mosaic Seams
283
Fig. 13. Variable Homography Merge Path
gradually merged the two homographies in the part of the image that is far from the scene so that the mosaiced image appears smooth. A third order Bezier polynomial was used, and the final blended image with the merge path highlighted is shown in Fig. 13.
6
Conclusions
The variable homography has been introduced as an approach to compensate for parallax in mosaiced images. By applying an appropriate scaling to a nominal seed homography, it is possible to merge objects at varying depths along the region of overlap. The method has been shown to be effective in a number of different scenarios, including outdoor panoramic scenes. It is more efficient than full stereovision reconstruction, and has the potential to be applied in realtime.
Acknowledgements The authors would like to thank Precarn Inc. and VerifEye Technologies Ltd. for their support of this work.
References 1. Efros, A.A., Freeman, W.T.: Image quilting for texture synthesis and transfer. In: Proceedings of SIGGRAPH 2001, pp. 341–346 (2001) 2. Agarwala, A., Dontcheva, M., Agrawala, M., Drucker, S., Colburn, A., Curless, B., Salesin, D., Cohen, M.: Interactive digital photomontage. In: ACM Transactions on Graphics (Proceedings of SIGGRAPH 2004), ACM Press, New York (2004) 3. Adelson, C.B., Anderson, E.H., Burt, J.: Pyramid method in image processing. In: RCA Engineer, pp. 33–41 (1984) 4. Krishnan, A., Ahuja, N.: Panoramic image acquisition. In: IEEE CVPR 1996, pp. 379–384. IEEE Computer Society Press, Los Alamitos (1996) 5. Levin, S.A., Zomet, A., Weiss, Y.: Seamless image stitching in the gradient domain. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3024, pp. 377–389. Springer, Heidelberg (2004)
284
S. Zhang and M. Greenspan
6. Davis, J.: Mosaics of scenes with moving objects. In: CVPR 1998 (1998) 7. Capel, D., Zisserman, A.: Automated mosaicing with super-resolution zoom. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition, pp. 885–891 (1998) 8. Fusiello, A., Trucco, E., Verri, A.: Rectification with unconstrained stereo geometry. In: Proc. Brit. Mach. Vis. Conf., September 1997, pp. 400–409 (1997) 9. Guerreiro, R.F.C., Aguiar, P.M.Q.: Global motion estimation: feature-based, featureless, or both?! In: Campilho, A., Kamel, M. (eds.) ICIAR 2006. LNCS, vol. 4141, pp. 721–730. Springer, Heidelberg (2006) 10. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) 11. Sawhney, S.H., Gorkani, M.: Model-based 2d and 3d dominant motion estimation for mosaicing and video representation. In: Fifth International Conference on Computer Vision, pp. 583–590 (1995) 12. Shum, H., Szeliski, R.: Construction and refinedment of panoramic mosaics with global and local alignment. In: IEEE Int’l Conf. Computer Vision, pp. 953–958. IEEE Computer Society Press, Los Alamitos (1998) 13. http://www.realviz.com/. 14. Jia, J., Tang, C.: Eliminating structure and intensity misalignment in image stitching. In: Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, vol. 2, pp. 1651–1658. IEEE Computer Society Press, Los Alamitos (2005) 15. McMillan, L., Bishop, G.: Plenoptic modeling: An image-based rendering system. In: SIGGRAPH (1995) 16. Brown, M., Lowe, D.G.: Recognising panoramas. In: ICCV2003. Proceedings of the 9th International Conference on Computer Vision, pp. 1218–1225 (2003) 17. Hansen, M., Anandan, P., Dana, K., der Wal, G., Burt, P.: Real-time scene stabilization and mosaic construction. In: ARPA Image Understanding Workshop, pp. 457–465 (1994) 18. Irani, P.M., Hsu, S.: Mosaic based representative of video sequences and their applications. In: Proc. 5th Int’l Conf. on Computer Vision, Boston, pp. 605–611 (1995) 19. Uyttendaele, A.M., Szeliski, R.: Eliminating ghosting and exposure artifacts in image mosaics. In: CVPR 2001 (2001) 20. Pires, B.E., Aguiar, P.M.Q.: Registration of images with small overlap. In: MMSP’04. IEEE Intl. Workshop Multimedia Signal Proc., IEEE Computer Society Press, Los Alamitos (2004) 21. Jaillon, P., Montanvert, A.: Image mosaicking applied to three-dimensional surfaces. In: 12th International Conference on Pattern Recognition, pp. 253–257 (1994) 22. Burt, P.J., Anandan, P.: Image stabilization by registration to a reference mosaic. In: ARPA Image Understanding Workshop, pp. 457–465 (1994) 23. Mann, S., Picard, R.: Virtual bellows: Constructing high quality stills from video. In: First IEEE International Conference on Image Processing, IEEE Computer Society Press, Los Alamitos (1995) 24. Peleg, S.: Elimination of seams from photomosaics. In: CGIP, vol. 16, pp. 90–94 (1981) 25. Peleg, S., Herman, J.: Panoramic mosaics by manifold projection. In: IEEE CVPR, 1997. Proceeding, pp. 338–343. IEEE Computer Society Press, Los Alamitos (1997) 26. Halfhill, T.R.: See you around. In: Byte Magazine, pp. 85–90 (1995)
Deformation Weight Constraint and 3D Reconstruction of Nonrigid Objects Guanghui Wang1,2 and Q.M. Jonathan Wu1 1
Department of Electrical and Computer Engineering, The University of Windsor, 401 Sunset, Windsor, Ontario, Canada N9B 3P4. 2 National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, 100080, P.R. China
[email protected],
[email protected] Abstract. The problem of 3D reconstruction of nonrigid objects from uncalibrated image sequences is addressed in the paper. Our method assumes that the nonrigid object is composed of rigid part and deformation part. We first recover the affine structure of the object and separate the rigid features from the deformation ones. Then stratify the structure from affine to Euclidean space by virtue of the rigid features. The novelty of the paper lies in two aspects. First, we propose a deformation weight constraint to the problem and prove that the recovered shape bases and shapes are transformation invariant under this constraint. Second, we propose a constrained power factorization (CPF) algorithm to recover the deformation structure in affine space. The algorithm overcomes the limitations of previous SVD-based methods and can work with missing data in the tracking matrix. Extensive experiments on synthetic data and real sequences validate the proposed method and show improvements over existing solutions.
1
Introduction
Three dimensional reconstruction from image sequences is an important and essential task of computer vision. During the last two decades, many approaches have been proposed for different applications. Among them, factorization based methods are widely studied since these approaches deal uniformly with data sets in all images and acheive good robustness and accuracy [12,13,18]. The factorization method was first proposed by Tomasi and Kanade [14] in the early 90’s for orthographic projection model. The main idea of this algorithm is to factorize the tracking matrix into motion and structure matrices simultaneously by singular value decomposition (SVD) with rank constraint. It was later extended to weak perspective in [12] and perspective camera model in [5]. In case of uncalibrated cameras, Quan [13] proposed a self-calibration algorithm for affine cameras. Triggs [16] proposed a rank-4 factorization scheme to achieve the projective reconstruction. The method was further studied by many researchers [10]. Most previous algorithms are based on SVD factorization to find a low-rank approximation of the tracking matrix. Recently, Hartley and M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 285–294, 2007. c Springer-Verlag Berlin Heidelberg 2007
286
G. Wang and Q.M.J. Wu
Schaffalitzky [9] proposed to solve the problem by power factorization (PF). The method is derived from the Power method in matrix computation and the sequential factorization method proposed by Morita and Kanade [11]. Vidal and Hartley [17] also proposed to combine the PF algorithm into motion segmentation. One noteworthy advantage of the PF algorithm is that it can handle the missing data problem in the tracking matrix. All the above methods work only for rigid objects and static scenes. While in real world, many scenarios are nonrigid or dynamic. Examples include human faces carrying different expressions, lip movements, a walking person, moving vehicles, etc. Many extensions stemming from the factorization algorithm were proposed to relax the rigidity constraint [1,6]. In the pioneer work on nonrigid factorization by Bregler et al. [4], it is demonstrated that the 3D shape of many nonrigid objects can be expressed as a weighted linear combination of a set of shape bases. Then the shape bases, weighting coefficients and camera motions were factorized simultaneously under the rank constraint of the tracking matrix. Following this idea, the problem came to the attentions of many researchers and was widely studied, such as Brand [2,3], Del Bue et al. [7,8], Torresani et al. [15] and Xiao et al. [20,21]. Torresani et al. [15] introduced an iterative algorithm to optimize the recovered shape and motion. Brand [2] generalized the method via nonlinear optimization and proposed subspace flow to search for correspondences. These methods use orthonormal (rotation) constraint for Euclidean reconstruction. It may cause ambiguity in the combination of shape bases. Xiao et al. [20] proposed the basis constraint to solve this ambiguity and provided a closed form solution to the problem. However, it is difficult to select the shape bases automatically for contaminated data. Del Bue et al. proposed to segment the rigid parts of the object directly from the tracking matrix either from a rank-3 constraint [7] or epipolar constraint [8]. They then recover the nonrigid shape by a constrained nonlinear optimization process. While the segmentation may be difficult in case of small inter-frame movement. In this paper, we try to solve the problem from a new viewpoint based on uncalibrated affine camera model. Our main idea is to first reconstruct the object in affine space, then detect and separate the deformation parts from the rigid ones and upgrade the solution to the Euclidean space. The remaining parts of the paper are organized as follows: Some backgrounds on nonrigid factorization are reviewed in section 2. The deformation weight constraint is proposed and proved in section 3. The constrained power factorization algorithm is elaborated in section 4. Some test results are presented in section 5, followed by conclusions of this paper in section 6.
2
Background on Nonrigid Factorization
Under the assumption of affine camera, the projection of a 3D point Xi to an image xi can be modelled as xi = AXi + t. Where, A is a 2 × 3 rank 2 matrix, the translation term t can be eliminated if we register the coordinates of all the
Deformation Weight Constraint and 3D Reconstruction of Nonrigid Objects
287
image points in each image to the centroid. Given a sequence of m video frames of (j) j=1,...,m a nonrigid object, Let {xi ∈ R2 }i=1,...,n be n tracked feature points across the (j)
sequence. We want to recover the shape S(j) = {Xi |i = 1, ..., n} ∈ R3×n , i.e. the corresponding 3D coordinates of the features associated with each frame. Following the study of Bregler et.al. [4], the nonrigid shape in the Euclidean space is approximated by a linear combination of k shape bases as k (j) S(j) = ωl Sl (1) l=1
, l = 1, ..., k are the shape bases that embody the principal where Sl ∈ R (j) modes of the deformation, ωl ∈ R is the deformation weight. Suppose all the n features are tracked across the m frames. Then the factorization equation of the tracking matrix can be expressed as 3×n
⎡
(1)
x ⎢ 1. ⎢ where, W = ⎣ .. (m) x1
W2m×n = M2m×3k B3k×n ⎤ ⎡ ⎡ ⎤ (1) (1) S1 · · · xn ω1 A(1) ⎥ ⎢ ⎢ . ⎥ . .. .. ⎢ . .. ⎥ . ⎦ , B = ⎣ .. ⎦ , M = ⎣ (m) (m) Sk ω1 A(m) · · · xn
(1)
(2) ⎤
· · · ωk A(1) ⎥ .. .. ⎥. . . ⎦ (m) · · · ωk A(m)
It is easy to see from (2) that the rank of the tracking matrix W is at most 3k. By performing SVD on the tracking matrix and imposing the rank ˆ 2m×3k B ˆ 3k×n . However, the decomposition is not constraint, we have W = M unique since it is only defined up to a non-singular 3k × 3k linear transformation (j) −1 ˆ ˆ as W = MB = (MG)(G B). If matrix G is known, then A(j) , Sl and ωl can be recovered straightforwardly. The computation of the transformation G is an extremely important and difficult step. Many researchers [2,4,15] utilize the rotation constraint of the motion matrix. However, only the rotation constraints may be insufficient when the object deforms at varying speed, thus a basis constraint is combined to solve the ambiguity [20]. Nevertheless, the nonrigid factorization algorithm does not work as perfect as the rigid case due to following reasons. First, it is difficult to select the set of ˆ bases automatically with noisy data. Second, the recovered matrix M = MG usually does not conform to the replicated block structure as indicated in (2), therefore a Procrustes analysis is needed [2,4,15], which may introduce additional errors to the final results. Third, the camera intrinsic parameters are difficult to estimate for nonrigid sequence.
3
Deformation Weight Constraint
Previous methods do not have any constraint on the deformation weight. We will prove in the following section that the recovered structure may not be transformation invariant (i.e. the relationship (1) between the shape bases, deformation weights and shapes does not hold true when they are under the same transformation) if there is no constraint on the recovered weight.
288
G. Wang and Q.M.J. Wu
Theorem 1. Deformation weight constraint: When the shape S(j) of each frame and the shape bases Sl are transformed by an arbitrary Euclidean transformation, the relationship (1) remains invariant if and only if the deformation k (j) weight satisfies the constraint that l=1 ωl = 1. Proof. The transformations in the Euclidean space can be termed as Euclidean
t , where transformation He = 0RT 1t and similarity transformation Hs = sR 0T 1 R is a 3 × 3 orthonormal rotation matrix, t is a 3D translation vector, 0 is a null 3-vector, s is a similarity scalar. Without loss of generality, let us take Hs as an example. Suppose one set of the Euclidean shape bases and deformation weights are (j) Sl and ωl respectively, the corresponding shape of the jth frame S(j) is given by (1). Under the similarity transformation Hs , the shape bases and shapes are transformed to Sl and S(j) as Sl = sRSl + T, S(j) = sRS(j) + T
(3)
where T = [tt · · · t] is a 3 × n matrix. Combining (1) and (3), we have k
k (j) (j) sRSl + T ωl Sl + T = ω S(j) = sR l=1 l=1 l k k (j) (j) ωl Sl + 1 − ωl T = l=1
l=1
(4)
It is clear that the transformed shape bases and shapes also satisfy the relation k k (j) (j) = 1. of (1) as S(j) = l=1 ωl Sl if and only if l=1 ωl Theorem 1 tells us that if the recovered deformation weight satisfies the constraint, the relationship between the shape bases and object structure is invariant under any transformation in the Euclidean space. Theorem 2. If the deformation weight constraint is satisfied, the structure in affine space can also be expressed as linear combination of affine shape bases just as that given in (1). The transformation from Euclidean space to affine space can be modelled by
affine transformation Ha = 0PT 1t , where P is an invertible 3 × 3 matrix, t is a 3D translation vector. The proof for affine transformation is similar to the Euclidean case in Theorem 1 and is omitted here. From Theorem 1 and 2, we can easily obtain the following corollary. Corollary 1. Suppose the tracking matrix is decomposed in affine space which (j) gives a solution of affine motion matrix A(j) , deformation weight ωi and shape (j) basis Sl . Then we can recover the affine structure S from (1). It is obvious that the solutions can be upgraded to the Euclidean space and retain their relationship (j) (1) invariance if ωi satisfies the deformation weight constraint. Corollary 1 suggests the feasibility to solve the problem by a stratification approach. In many cases, we have no knowledge about the cameras, thus it is
Deformation Weight Constraint and 3D Reconstruction of Nonrigid Objects
289
difficult to directly adopt the rotation constraint to compute the transformation matrix G and recover the Euclidean structure. However, as shown in the subsequent sections, we may first decompose the tracking matrix in affine space and then stratify the solution to the Euclidean space.
4
CPF-Based Affine Reconstruction
The general power factorization method was proposed to find a low-rank approximation of a tracking matrix of static scene [9]. In the case of nonrigid factorization, our objective is to recover the motion matrix M ∈ R2m×3k and the shape matrix B ∈ R3k×n . However, the motion matrix recovered by general power factorization does not observe the replicated block structure of the motion matrix in (2), thus it is defined up to a correction matrix G which is difficult to compute as explained in section 2. We will now introduce a constrained power factorization (CPF) algorithm to solve this problem. Let us decompose the motion matrix of (2) into two parts as ⎡
⎤ ⎡ ⎤ (1) (1) ω 1 E · · · ωk E A(1) · · · A(1) ⎢ ⎥ ⎢ . .. .. ⎥ .. .. .. ⎥⊗⎣ . M=Ω⊗Φ=⎢ . . ⎣ ⎦ . . . ⎦ . (m) (m) (m) (m) · · · A A ω1 E · · · ωk E
(5)
where Ω denotes the weighting part, Φ denotes the affine motion part, E is a 2×3 matrix with unit entries, ⊗ stands for the element-by-element multiplication. The algorithm is summarized as follows. Algorithm 1 (Constrained power factorization). Given the tracking matrix W ∈ R2m×n , initial values Φ0 ∈ R2m×3k and B0 ∈ R3k×n . Repeat the following 3 steps until the product of (Ωt ⊗ Φt )Bt converges. 1. Given Φt−1 and Bt−1 , find Ωt to minimize W − (Ωt ⊗ Φt−1 )Bt−1 F , subject to that Ωt satisfies the deformation weight constraint ; 2. Given Φt−1 and Ωt , find Bt to minimize W − (Ωt ⊗ Φt−1 )Bt F ; 3. Given Ωt and Bt , find Φt to minimize W − (Ωt ⊗ Φt )Bt F , subject to the constraint that Φt is a block replicated matrix. The main difference in the algorithm with the general power factorization is that we divide the computation of M into Ω and Φ two steps. This makes it possible to combine the weight constraint and the replicated block structure of the motion matrix in the computation. It is quite easy to combine the weight constraint into the minimization scheme k−1 (j) (j) in step 1, since we can always set ωk = 1 − l=1 ωl . In order to observe the block replicated structure of Φ in step 3, denote W(j) ∈ R2×n as the jth two-row of W, which is the tracking matrix of the jth frame, then we have W(j) = A(j) S(j) ,
S(j) =
k l=1
(j)
ωl Sl
(6)
290
G. Wang and Q.M.J. Wu (j)
where ωl can be obtained from Ωt in step 1, Sl can be obtained from Bt in step 2. Therefore, A(j) may be computed by the least squares as
T (j) (j) T −1 S S (7) A(j) = W(j) S(j) Then the block replicated matrix Φt can be resembled from (5) and (7). Initialization. The solution of the algorithm is not unique that is defined up to an affine transformation. In worst cases, the recovered affine structure may be stretched or squeezed greatly along certain direction due to bad initialization. Usually, reasonable initial values of B0 and Φ0 can avoid the worst situation and improve convergence speed simultaneously. In our applications, we first utilize the rank-3 factorization method [12,13] to ˆ B, ˆ where M ˆ ∈ R2m×3 obtain a rigid approximation of the object as W = M 3×n ˆ is the rigid motion matrix, B ∈ R is the rigid shape of the object. Under this estimation, we compute the mean reprojection error of each point across the sequence and denote it as e. Then the initial values may be constructed as ⎡ ⎤ ˆ 1 + eT ⊗ N1 S ˆ · · · , M], ˆ B0 = ⎣ ⎦ Φ0 = [M, ··· (8) T ˆ S1 + e ⊗ Nk where, eT ⊗Ni is a reprojection-error-weighted shape balance matrix, Ni ∈ R3×n is a random matrix with small entities. This term can assure each initial shape bases are independent with each other. Usually the motions recovered by rigid ˆ to construct factorization are close to the real solutions, so we directly use M Φ0 and update the rotation part at the last step in the algorithm, though the order of these steps can be interchanged. Experiments demonstrate that this initialization can usually give good results. Working with missing data. It should be noted that each minimization step in the algorithm is equivalent to solving a set of equations by least squares, this can allow us to deal with the tracking matrix with some entries unavailable. In case of missing data, the cost function in the algorithm can be modified as
2 min (9) Wij − (Ω ⊗ Φ)B ij (i,j)∈A
where Wij denotes the (i, j)th element of the tracking matrix, A stands for the set of available entries in the tracking matrix. Thus we can update Ω, B and Φ using only available features in W according to (9). This is a very important attribute of the algorithm, since it is hard to have all the features tracked across the whole sequence due to self-occlusion in real applications. While it is difficult to deal with missing data for SVD-based methods. Stratification to the Euclidean spaces. After recovering the 3D affine structure of the object, it is easy to separate the rigid features from the deformation ones by our early studies [19]. Then the whole structure can be stratified to the Euclidean space by virtue of the transformation computed from the rigid features. Please refer to [19] for more details.
Deformation Weight Constraint and 3D Reconstruction of Nonrigid Objects
5 5.1
291
Experimental Evaluations Tests with Synthetic Data
We generated a synthetic cube with three visible surfaces in the space, whose dimension was of 10 × 10 × 10 with 9 evenly distributed points on each edge. There were three sets of moving points (17 × 3 points) on the adjacent surfaces of the cube that move on the surfaces at constant speed as shown in Fig.1. The object was composed of 90 rigid points and 51 deformation points. We generated 20 frames with different camera parameters by perspective projection. The image size is 500 × 500, while the distance of the camera to the object was set at about 12 times of the object size such that the imaging condition was close to affine. During the test, 1-pixel Gaussian noise was added to the images. We recovered the 3D affine structure by the proposed CPF algorithm and stratify the solution to the Euclidean space. The result is shown in Fig.1(c), where all structures are registered to the first frame via the rigid parts [19]. One may see from the results that the deformation structure is correctly recovered. Z
Z 10
10
0
0
20 0 -20
-10
-10
10
Y
0
0 -10
X
10
10
30 Y
0
0 -10
-10
(a)
X
10
-10
20
10
0 -10
(b)
0 -10
10
20
30
(c)
Fig. 1. Synthetic data and reconstruction results. (a)&(b) the synthetic cubes correspond to the first and last frames, where rigid and moving points are denoted by circles and dots respectively; (c) the registered 3D Euclidean structures of the 20 frames. 1
0.2
Mean of distances
0.8
Standard deviation
0.15
0.6 0.1 0.4 Proposed
0.2 0
Proposed
0.05
SVD-Based 5
10 15 Frame number
SVD-Based 20
0
5
10 15 Frame number
20
Fig. 2. Performance evaluation results. We registered the reconstructed structure to the ground truth and compared the mean and standard deviation of the distances between the recovered structure and its ground truth associated with each frame.
It should be noted that the recovered structure in Fig.1(c) is defined up to a 3D similarity transformation with the ground truth. For evaluation, we
292
G. Wang and Q.M.J. Wu
computed the similarity matrix by virtue of the point correspondences between the recovered structure and the ground truth, then register the structure with the ground truth. We calculated the distances between all the corresponding point pairs. Fig.2 shows the mean and standard deviation of the distances associated with each frame. We also performed a comparison with SVD-based method [20] as shown in Fig.2. From which we can see that the proposed method outperforms the SVD-based algorithm. 5.2
Tests with Real Sequences
The first test was performed on Franck sequence which was downloaded from the European working group on face and gesture recognition. We selected 60 frames with various facial expressions for the experiment. The resolution was 720 × 576 and there were 68 tracked feature across the sequence. Fig.3 shows the reconstructed VRML models of two frames by the propose method. We can see from the results that different facial expressions are correctly recovered. The results could be used for visualization and recognition. However, the structure of some feature points was not very accurate due to tracking errors. For a comparison, the relative reprojection errors by the proposed method and the SVD-based algorithm are 6.4018 and 6.4246 respectively.
Fig. 3. Reconstruction results of Franck sequence. Left: two frames from the sequence overlaid with 68 tracked features; Right: the front and side views of the reconstructed VRML models and corresponding wireframes.
The second test was on scarf sequence. There were 15 images and the scarf was pressed during shooting so as to obtain a deformation sequence. The image resolution was 1024 × 768. We established the initial correspondences by the method of Yao and Cham [22] and interactively deleted some outliers. As shown in Fig.4, 2986 features were tracked across the sequence. Fig.4 shows the reconstructed VRML models and wireframes of two frames. The recovered structures were visually plausible and realistic.
Deformation Weight Constraint and 3D Reconstruction of Nonrigid Objects
293
Fig. 4. Reconstruction results of scarf sequence. Upper: two frames from the sequence and the tracked features with relative disparities; Lower: the reconstructed VRML models and corresponding wireframes from different viewpoints.
The background of the sequence is two orthogonal sheets with square grids which are used as ground truth for evaluation. We computed the angle between the two sheets, the length ratio of the two diagonals of each square and the angle formed by the two diagonals. The mean errors of the three values recovered by the proposed method are 2.176◦, 0.093 and 0.425◦ respectively. The errors by the SVD-based algorithm are 3.455◦, 0.126 and 0.624◦ respectively. The proposed method performs better in the test.
6
Conclusions
In this paper, we first proposed the deformation weight constraint and proved the invariant relationship between the recovered structure and shape bases under this constraint. Then we presented the constrained power factorization to recover the deformation structure in affine space. The algorithm overcomes some limitations of previous SVD-based methods, it is easy to implement and can work with missing data. The affine solution can be stratified to the Euclidean space by the transformation matrix computed from the rigid features. Experiments with synthetic data and real sequences validate the proposed algorithms and show the improvements over the SVD-based method. It is assumed that all features are correctly tracked in the paper. However, some features may not be located precisely or tracked correctly in practice, which may cause a failure of the algorithm. We are currently still working on this problem.
Acknowledgment The work is supported in part by the Canada Research Chair program, the Ontario Centres of Excellence, and the National Natural Science Foundation of China under grant no. 60575015.
294
G. Wang and Q.M.J. Wu
References 1. Bascle, B., Blake, A.: Separability of pose and expression in facial tracing and animation. In: Proc. of ICCV, pp. 323–328 (1998) 2. Brand, M.: Morphable 3D models from video. In: Proc. of CVPR, vol. 2, pp. 456– 463 (2001) 3. Brand, M.: A direct method for 3D factorization of nonrigid motion observed in 2D. In: Proc. of CVPR, vol. 2, pp. 122–128 (2005) 4. Bregler, C., Hertzmann, A., Biermann, H.: Recovering non-rigid 3D shape from image streams. In: Proc. of CVPR, vol. 2, pp. 690–696 (2000) 5. Christy, S., Horaud, R.: Euclidean shape and motion from multiple perspective views by affine iterations. IEEE T-PAMI 18(11), 1098–1104 (1996) 6. Costeira, J., Kanade, T.: A multibody factorization method for independently moving objects. Intern. J. Comput. Vis. 29(3), 159–179 (1998) 7. Del Bue, A., Llad´ o, X., de Agapito, L.: Non-rigid face modelling using shape priors. In: Proc. of AMFG, pp. 97–108 (2005) 8. Del Bue, A., Llad´ o, X., de Agapito, L.: Non-rigid metric shape and motion recovery from uncalibrated images using priors. In: Proc. of CVPR, pp. 1191–1198 (2006) 9. Hartley, R., Schaffalizky, F.: Powerfactorization: 3D reconstruction with missing or uncertain data. In: Proc. of Australia-Japan Advanced Workshop on Computer Vision (2003) 10. Heyden, A., Berthilsson, R., Sparr, G.: An iterative factorization method for projective structure and motion from image sequences. Image Vision Comput. 17(13), 981–991 (1999) 11. Morita, T., Kanade, T.: A sequential factorization method for recovering shape and motion from image streams. IEEE T-PAMI 19(8), 858–867 (1997) 12. Poelman, C., Kanade, T.: A paraperspective factorization method for shape and motion recovery. IEEE T-PAMI. 19(3), 206–218 (1997) 13. Quan, L.: Self-calibration of an affine camera from multiple views. Intern. J. Comput. Vis. 19(1), 93–105 (1996) 14. Tomasi, C., Kanade, T.: Shape and motion from image streams under orthography: a factorization method. Intern. J. Comput. Vis. 9(2), 137–154 (1992) 15. Torresani, L., Yang, D.B., Alexander, E.J., Bregler, C.: Tracking and modeling non-rigid objects with rank constraints. In: Proc. of CVPR, pp. 493–500 (2001) 16. Triggs, B.: Factorization methods for projective structure and motion. In: Proc. of CVPR, pp. 845–851 (1996) 17. Vidal, R., Hartley, R.: Motion segmentation with missing data using powerfactorization and gpca. In: Proc. of CVPR, vol. 2, pp. 310–316 (2004) 18. Wang, G.H., Tsui, H.T., Hu, Z.Y.: Structure and motion of nonrigid object under perspective projection. Pattern Recognition Letters 28(4), 507–515 (2007) 19. Wang, G.H., Wu, J.: Deformation detection and 3D Euclidean reconstruction of nonrigid objects from uncalibrated image sequences. Technical Report of the University of Windsor (November 2006) 20. Xiao, J., Chai, J.X., Kanade, T.: A closed-form solution to non-rigid shape and motion recovery. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3024, pp. 573–587. Springer, Heidelberg (2004) 21. Xiao, J., Kanade, T.: Uncalibrated perspective reconstruction of deformable structures. In: Proc. of ICCV, vol. 2, pp. 1075–1082 (2005) 22. Yao, J., Cham, W.: Feature matching and scene reconstruction from multiple widely separated views. Technical Report of Department of Electronic Engineering, The Chinese University of Hong Kong (August 2005)
Parallel Robot High Speed Object Tracking J.M. Sebastián1, A. Traslosheros1, L. Ángel2, F. Roberti3, and R. Carelli3 1
Universidad Politécnica de Madrid, España {jsebas,atraslosheros}@etsii.upm.es 2 Universidad Pontificia Bolivariana, Bucaramanga, Colombia
[email protected] 3 Universidad Nacional de San Juan, Argentina {rcarelli,froberti}@inaut.unsj.edu.ar Abstract. This paper describes the visual control of a parallel robot called “RoboTenis”. The system has been designed and built in order to carry out tasks in three dimensions and dynamical environments; the system is capable to interact with objects which move up to 1m/s. The control strategy is composed by two intertwined control loops: The internal loop is faster and considers the information from the joints, its sample time is 0.5 ms. Second loop represents the visual servoing system, is external to the first mentioned and represents our study purpose, it is based on predicting the object velocity which is obtained form visual information, its sample time is 8.3 ms. Lyapunov stability analysis, system delays and saturation components has been taken into account. Keywords: Parallel robot, visual control strategies, tracking, system stability.
1 Introduction The vision systems are becoming more frequently used in robotics applications. The visual information makes possible to know about the position and orientation of the objects presented in the scene and the description of the environment with a relative precision. Although the above advantages, the integration of visual systems in dynamical works presents many topics which are not solved correctly yet, thus many important investigation centers [1] are motivated to investigate about this field, such as in the Tokyo University ([2] and [3]) where fast tracking (up to 2 m/s) strategies in visual servoing are developed. In order to study and implementing the different strategies of visual servoing, the computer vision group of the UPM (Polytechnic University of Madrid) has decided to design the ROBOTENIS. Actually the implemented controller make possible to achieve high speed dynamical works. In this paper some experiments made to the system are described, the principal purpose is the high velocity (up to 1 m/s) tracking of a small object (black ping pong ball) with three degrees of freedom. It should be mentioned that the system is designed with higher attributes in order to response to future works. Typically the parallel mechanisms possess the advantages of high stiffness, low inertia and large payload capacity; however, the principal weakness is the small useful workspace and design difficulties. As shown is fig. 1, mechanical structure of RoboTenis system is inspired in DELTA robot [4]. Kinematical model, Jacobian matrix and the optimized model of the RoboTenis system have been presented in other previous woks [5]. Dynamical M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 295–306, 2007. © Springer-Verlag Berlin Heidelberg 2007
296
J.M. Sebastián et al.
analysis and the joint controller have been presented in [6] and [7]. The dynamical model is based upon Lagrangian multipliers thus, is possible to use forearms of nonnegligible inertias in the development of control strategies. Two control loops are incorporated in the system: the joint loop is a control signal calculated each 0.5 ms, at this point dynamical model, kinematical model and PD action are retrofitted. The other loop is considered external since is calculated each 8.33 ms; this loop uses the visual data and is described in detail along this work.
Fig. 1. RoboTenis System
This paper is organized as follows. After the introduction above mentioned, section II describes the RoboTenis system; section III introduces the description of the visual servo control structure; section IV presents the experimental results and finally, in Section V some concluding remarks are given and commented.
2 Description of the System This section describes the experimental environment, the elements that are part of the system and the functional characteristics of each element. For more information: [8]. 2.1 Experimentally Environment The objective of the system resides in the tracking of a ping pong ball along 600 mm. Image processing is conveniently simplified using a black ball on white background. The ball is moved through a thread (fig. 2), the ball velocity is close to 1 m/s. 2.2 Vision System RoboTenis system has a camera located on the end-effector as the fig. 2 shows. This camera location merge two important aspects; when the robot and the object are relatively faraway each other, the field of view of the camera is wide but some error are generated in the ball position measurement; when the ball and robot are near each other, the field of view of the camera is narrow but the precision of the ball position measurements is increased. The above mentioned will be usable in future works as catching or hitting to the ball. Thus some characteristics of the components are the following.
Parallel Robot High Speed Object Tracking
297
Camera. The used camera is the SONY XC-HR50 and its principal characteristics are: High frame rate (8.33ms) and 240 x 640 pixels resolution; the used exposure time was 1 ms; progressive scan CCD; relatively small size and little weight: 29x29x32mm and 50gr, see fig. 3.
Fig. 2. Work environment
Fig. 3. System camera
Data Acquisition Card. The data acquisition system is composed by the Matrox Meteor 2-MC/4 card and is the responsible of visual data and is used in double buffer mode, the current image is sampled meanwhile the previous image is being processed. Image Procesing. Once images are acquired; the visual system segments the ball on white background. The centroid and diameter of the ball are calculated using a subpixel precision method. The 3D position of the ball is possible to calculate through preliminary camera calibration. The control system requires the velocity of the ball which estimated through Kalman filter ([9] and [10]) from the ball position. 2.3 System of Positioning Control The positioning system is composed by DSPACE 1103 card which is the responsible of: the generation of the trajectories, the calculation of kinematical models, dynamical models and control algorithms. The motion system is composed by AC brushless servomotors, Ac drivers and gearbox (for more information see [8]). 2.4 Characteristics and Functions The visual controller is conditioned by some characteristics of the system and there are others inherited by the own application; some of them are: - High uncertainly in data from the visual system. The small sample data (8.333 ms) makes bigger the errors of the velocity estimation. For example, if the ball is located 600mm from the robot (camera) then the diameter of the ball it will be measured in 20 pixels and, if the estimation of the position has an error of 0.25 pixels, then the estimated distance error will be 8mm approximately, in the same order the estimation of the ball velocity error will be 1 m/s. The errors above cited origin an elevated discontinuity and the control action required exceeds the robot's capacities; the Kalman filter implemented helps to solve the problem.
298
J.M. Sebastián et al.
- The target speeds for the robot should be given continually. In order to avoid high accelerations, the trajectory planner needs continuity in the estimated velocity. For example, an 8 mm error in one sample rate (8.333ms) will demand acceleration equivalently to 12 times the gravity. The system guarantees a soft tracking performance by means of a trajectory planner; the trajectory planner is specially designed for reference shifting,, thus the target point can be changed at any moment. - Some additional limitations have to be considered in the system. The delay between the visual acquisition and the joint motion is estimated in two visual servoing sample rate (16.66ms). This delay is attained to some principal reasons as image capturing, image transmission and image processing. The maximum velocity the end-effector of RoboTenis is 2.5m/s.
3 Visual Control of the Robotenis System The coordinated systems considered are shown in the fig. 4; they are Σw, Σe, and Σc which represent the word coordinate system, the end-effector robot system and the camera coordinate system respectively. Other notations defined are: cpb represents the position of the ball in the camera coordinate system, wpe represents the position of the robot end effector in the word coordinate system; wpe is obtained by means of the direct kinematical model. The rotation matrix are constant and known as wRe, wRc, e Rc; eTc is obtained by means of the camera calibrations.
+
Controller c
Fig. 4. Coordinated systems considered
q (k )
u (k )
pb* ( k ) e(k )
pb ( k )
Jacobian matrix
ROBOT
Vision system
Fig. 5. Ball position used in visual servoing
Although several alternatives exists [11], the controller selected is based in the ball position. Schematic control can be appreciated in the fig. 5, the error function is obtained though the difference between the referenced position ( measured position (
c
c
pb* (k ) ) and the
pb (k ) ) this difference must be constant because the goal is the
tracking of the ball. Once the error is obtained, the controller calculates the velocity of the end effector. By means of the trajectory planner and the Jacobian matrix, all the joint motions are calculated. “k” indicates the sample time considered.
Parallel Robot High Speed Object Tracking
299
3.1 Visual Servoing Model In the fig. 5 can be observed that the position error can be expressed as follows:
e(k )= c pb* − c pb (k ) .
(1)
If ball position is referenced in the coordinate system of the camera then the position of the ball can be expressed as: c
pb (k )= cRw ( w pb (k )− w pc (k )) .
(2)
If (2) is substituted into (1) then we obtain:
e(k )= c pb* − cRw ( w pb (k )− w pc (k )) .
(3)
The system is supposed stable and in order to guarantee that the error will decrease exponentially, thus we choose:
e(k ) = −λe(k ) con λ > 0 . Deriving (2) and supposing that
c
(4)
Rw is constant, we obtain:
e(k ) =− c Rw ( w vb (k )− w vc (k )) .
(5)
Substituting (3) and (5) into (4), we obtain: w
Where
w
Since
w
[
vc (k )= w vb (k ) −λ cRwT ( c pb* − c pb (k )) .
(6)
v c (k ) and w v b (k ) represent the camera and ball velocities respectively.
]
v e (k )= w v c (k ) the control law can be expressed as:
[
]
u (k )= w vb (k ) −λ cRwT c pb* − c pb (k ) .
(7)
The equation (7) is composed by two components: a component which predicts the position of the ball ( (
w
v b (k ) ) and the other contains the tracking error
[ p (k )− p (k )]).The ideal control scheme (7) requires a perfect knowledge of all c
* b
c
b
its components, which is not possible, a more realistic approach consist in generalizing the previous control as
[
]
u (k )= w vˆb (k ) −λ cRwT c pb* − c pˆ b (k ) .
(8)
As is shown in (8), the estimated variables are represented by the carets which are used instead of the true terms. The basic controller is shown in the fig. 6.
300
J.M. Sebastián et al. c
pb* ( k )
e(k )
+
-
u (k )
− λ cRwT
+
q (k )
Jacobian matrix
+
ROBOT
Vˆb ( k )
w
Velocity estimator w
pb ( k )
w
+
c
pc (k )
-
w
pe (k )
+
+ c
RwT cpe
Forward kinematics
RwT c
pˆ b ( k )
Vision system
Fig. 6. Visual servoing control system architecture
3.2 Adjusting the λ Parameter Fundamental aspect in the performing of the visual servoing system is the adjustment of λ thus, λ will be calculated in the less number of sample time periods and will consider the system limitations. This algorithm is based in the future positions of the camera and the ball; this lets to the robot reaching the control objective (
c
Pb (k )= c Pb* (k ) ). The future position (in the k+n instant) of the ball in the word coordinate system is: w
pˆ b (k + n)= w pˆ b + wvˆb (k )Tn .
(9)
Where T is the visual servoing sample time period (8.333 ms) and the future position of the camera in word coordinate system in the k+n instant is: w
pc (k + n)= w pc + wvc (k )Tn .
(10)
The control objective is reaching the target position in the shorter time as be possible, thus if we substitute: the target position, the future ball and camera position into (2): c
p b* − c R w
[
w
]
pˆ b (k + n)− w p c (k + n) = 0 .
Substituting (9) and (10) into (11), we obtain (12). c
pb* = cRw
[
w
(11)
]
pˆ b (k )− w vˆb (k )Tn − w pc (k )− w vc (k )Tn .
It has been indicated that
w
(12)
ve (k ) = w vc (k ) , if (2) is considered the control law is:
u (k ) = w vˆb ( k ) −
[
]
1c T c * c Rw pb − pˆ b (k ) . Tn
(13)
Parallel Robot High Speed Object Tracking
301
If (8) and (13) are compared, we can obtain the λ parameter as:
λ=
1 . Tn
(14)
The equation (14) gives a criterion for adjust λ as a function of the number of samples required (n) for reaching the control target. 3.3 Implemented Algorithm The visual servoing architecture proposed in fig. 6, doesn’t consider the physical limitations of the System such as delays and maximum system operation components. In fig. 7 we propose the visual control structure that considers the limitations above mentioned. c
pb* ( k )
e(k )
+
-
u (k )
− λ cRwT Vc (k − r )
+
q (k )
Jacobian matrix
+
ROBOT
w
-
z −r
Vˆb ( k − r )
w
c
+
pˆ b ( k )
c
Rw
Forward kinematics
Velocity estimator
Vˆb ( k − r )
c
w
r ⋅T
pb ( k − r ) +
+
c
+
c
w
c
pc (k − r )
+
w
z −r
RwT c pe
p c (k )
+
w
pe (k )
RwT
pˆ b ( k − r )
Position filter
Vision
c
pb ( k − r ) system
Fig. 7. Visual servoing control architecture proposed
z − r represents a delay of “r” periods respecting the control signals. If we c consider that the visual information ( pb (k ) ) has a delay of 2 sampling times (r=2) The term
with respect to the joint information, then at an instant k+n, the future position of the ball can be obtained by: w
pˆ b (k + n)= w pˆ b (k − r )+ wvˆb (k − r )T (n + r ) .
(15)
The future position of the camera is given by (10). Using the (14) is possible to adjust the λ for the control law by considering the following aspects: -The wished velocity of the end-effector is represented by (13). In physical systems the maximal velocity is necessary to be limited. In our system the maximal velocity has a direct effect in the minimal number of sampled time periods (T) which the target function can be achieved. The number of sampled time periods (T) is not the same in
302
J.M. Sebastián et al.
all joints, thus this data in the more restrictive joint (the bigger time) will be used in the λ parameter calculation. -Is desirable that the velocity of the robot can be continuous. 3.4 System Stability By means of Lyapunov analysis is possible to probe the system stability; it can be demonstrated that the error converges to zero if ideal conditions are considered; otherwise it can be probed that the error will be bounded under the influence of the estimation errors and unmodeled dynamics. For the stability analysis we consider (5), and (7), to obtain the close loop expression:
e(k ) + λe(k ) = 0 .
(16)
We choose a Lyapunov function as: V=
1 T e ( k )e( k ) . 2
V = eT (k )e(k ) = −eT (k )λe(k ) < 0 . Equation (18) implies that
(17) (18)
e(k ) → 0 when k → ∞ ; but if w ve ≡ u is not true then: w
vˆe (k ) = u (k ) + ρ (k ) .
(19)
Where ρ ( k ) is the velocity error vector which is produced by the bad velocity estimations and unmodeled dynamics. We consider (19) and, (16) can be written as:
e(k ) + λe(k )= c Rw ρ (k ) .
(20)
We consider again a Lyapunov function as:
V =
1 T e ( k )e ( k ) . 2
V = eT (k )e(k ) = −eT (k )λe(k ) + eT (k )cRw ρ (k ) .
(21)
(22)
The sufficient condition for V < 0 is:
e > If we consider that
ρ (k ) → 0
ρ . λ
(23)
(this has two means: The velocity controller is
capable to make the system to reach
w
ve (k ) → u and there are not error in the
Parallel Robot High Speed Object Tracking
303
velocity estimations) then e(k ) → 0 ; otherwise we can conclude that when (23) is not fulfilled, the error does not decreases and remains finally bounded by:
e ≤
ρ . λ
(24)
4 Experimental Results In this section, the experiments results are related to visual tracking object up to 1 m/s. The results show the performance of the visual servoing algorithm proposed for the RoboTenis system. The control objective consists in preserving a constant distance ([0, 0, 600]T mm) between the camera and the moving target. The ball is hanged to the structure and is moved by manual drag (see fig 2). Different arbitrary trajectories have been performed. As example, in fig. 8 is represented an experiment in which the space evolution of the ball is shown.
Fig. 8. 3D ball movements
4.1 Performance Tracking Indexes The arbitrary movement of the ball makes complicated the systematic study of the experiments; in consequence two tracking indexes have been defined. The tracking indexes have been defined according to the tracking error (difference between the target and real position) and the estimation of the ball velocity. Two indexes are: Tracking Relation (table I) is defined in (25) as the relation between the average of the tracking error module and the average of the estimated velocity module of the ball. This relation is expressed in mm/1/mm/s and allows isolating the result of each experiment from particular features of motion of the ball made in other experiment. 1 N ∑ e( k ) . N k =1 Tracking relation = 1 N w ∑ vˆb (k ) N k =1
(25)
304
J.M. Sebastián et al.
Average of the tracking error in function of the estimated ball velocity. In order to makes an easier comparison; the estimated ball velocity has been divided in 5 groups which the tracking error is shown in the table II. Table I. Tracking relation using a proportional and predictive control laws ALGORITHM Proportional Predictive
Tracking relation 40.45 20.86
Table II. Grouped Average tracking Error Velocity / Algorithm Proportional Predictive
V800 32.5 13.5
4.2 Studied and Compared Control Laws Two control laws have been used: Proportional control law: It does not consider the predictive component of (8), thus:
[
]
u (k ) = −λ cRwT c pb* (k )− c pˆ b (k ) .
(26)
Predictive control law: Considering the predictive component of (8), thus:
u (k )= w vˆb −λ cRwT
[
c
]
pb* (k )− c pˆ b (k ) .
(27)
The Table I and Table II show the indexes result when the two control laws are applied. The Numerical results presented have been obtained from the average of 10 experiments of each control algorithm. The results show a better performance of the system when the predictive control algorithm is used, thus a smaller tracking relation error and smaller grouped average tracking error is observed. Fig. 9 and 10 shows a specific system experiment and can be observed than the predictive controller introduces a significant improvement, the error is reduced to the half in comparison with the obtained error when we use proportional controller.
a)
b)
Fig. 9. Proportional control law. (a) Tracking error. (b) Estimated velocity of the ball.
Parallel Robot High Speed Object Tracking
a)
305
b)
Fig. 10. Predictive control law. (a) Tracking error. (b) ball Estimated velocity.
5 Conclusions In this paper a visual servoing structure is presented, this control strategy is used in a parallel robot in order to reach a high velocity tracking of an object which moves in unknown trajectories. The control strategy is based in obtaining the smaller number of time samples in which the control target function is able to achieved. The existing delays in the system, the saturations in the velocity and acceleration are considered in de control model thus, this consideration makes possible that RoboTenis can move faster than its characteristics allow it. The Lyapunov stability of the proposed system was probed under some ideal and non ideal conditions. When the conditions are ideal, it was probed that the error converges to zero otherwise, if the conditions are nonideal, the error will be finally is bounded. The experiments were carried out in order to illustrate the high performance of the system; ball tracking was successfully achieved and the error was smaller than 20mm, the visual loop has a sample time of 8.33 ms. In future Works, new control strategies are considered in order to attain a tracking velocity of 2m/s thus, the interaction of the system with the environment and works as caching or hitting the ball are desirable. For more information: http://www.disam.upm.es/vision/projects/robotenis/ Acknowledgments. This work is a part of the research project called: Teleoperation architectures in modelling dynamical environments (DPI 2004-07433-C02-02), in which the economical resources come from the “Ministerio de Ciencia y Tecnología”(science and technology department). Likewise, thanks to the Spain international cooperation agency (AECI) because, by means of the “interuniversity cooperation 2005” program has been possible the participation of foreign students and investors.
References 1. Kragic, D., Christensen, H.I.: Advances in robot vision. Robotics and Autonomous Systems 52(1), 1–3 (2005) 2. Kaneko, M., Higashimori, M., Takenaka, R., Namiki, A., Ishikawa, M.: The 100 G capturing robot - too fast to see. IEEE/ASME Transactions on Mechatronics 8(1), 37–44 (2003)
306
J.M. Sebastián et al.
3. Senoo, T., Namiki, A., Ishikawa, M.: High-speed batting using a multi-jointed manipulator. In: ICRA ’04. IEEE International Conference on Robotics and Automation, 26 April – 1 May, 2004, vol. 2, pp. 1191–1196. IEEE Computer Society Press, Los Alamitos (2004) 4. Clavel, R.: DELTA: a fast robot with parallel geometry. In: 18th International Symposium on Industrial Robot, Sydney, Australia, pp. 91–100 (1988) 5. Ángel, L., Sebastián, J.M., Saltarén, R., Aracil, S.J.: RoboTenis: Optimal Design of a Parallel Robot with High Performance. In: IROS. IEEE/RSJ International Conference on Intelligent Robots ans Systems, Alberta, Canadá, August 2-6, 2005 (2005) 6. Ángel, L., Sebastián, J.M., Saltarén, R., Aracil, R., Gutiérrez, R.: RoboTenis: Design, Dynamic Modeling and Preliminary Control. In: IEEE/ASME AIM2005, Monterey, California, July 24-28, 2005 (2005) 7. Ángel, L., Sebastián, J.M., Saltarén, R.: Aracil. RoboTenis System. Part II: Dynamics and Control. In: CDC-ECC’05. 44 IEEE Conference on Decision and Control and European Control Conference, Sevilla (2005) 8. Ángel, L.: Control Visual de Robots Paralelos. Análisis, Desarrollo y Aplicación a la Plataforma Robotenis. Tesis Doctoral de la Universidad Politécnica de Madrid. Diciembre (2005) 9. Gutiérrez, D.: Estimación de la posición y de la velocidad de un objeto móvil. Aplicación al sistema RoboTenis. Proyecto Fin de Carrera de la E.T.S.I.I. de la Universidad Politécnica de Madrid 10. Gutiérrez, D., Sebastián, J.M., Ángel, L.: Estimación de la posición y velocidad de un objeto móvil. Aplicación al sistema Robotenis. XXVI Jornadas de Automatica, Alicante (septiembre 6-8, 2005) 11. Hutchinson, S.A., Hager, G.D., Corke, P.I.: A tutorial on visual servo control. IEEE Trans. Robotics and Automation 12(5), 651–670 (1996) 12. Carelli, R., Santos-Victor, J., Roberti, F., Tosetti, S.: Direct visual tracking control of remote cellular robots. Robotics and Autonomous Systems 54, 805–814 (2006)
Robust Contour Tracking Using a Modified Snake Model in Stereo Image Sequences Shin-Hyoung Kim and Jong Whan Jang PaiChai University, Daejeon, South Korea, 302-435 {zeros, jangjw}@pcu.ac.kr
Abstract. In this paper, we present a robust contour tracking method using a modified snake model in stereo image sequences. The main obstacle preventing typical snake-based methods from converging to boundary concavities with gourd shapes is the lack of sufficient energy near the concavities. Moreover, previous methods suffer drawbacks such as high computation cost and inefficiency with cluttered backgrounds. Our proposed method solves the problem utilizing the binormal vector and disparity information. In addition, we apply an optimization scheme on the number of snake points to better describe the object’s boundary, and we apply a region similarity energy to handle cluttered backgrounds. The proposed method can successfully define the contour of the object, and can track the contour in complex backgrounds. Performance of the proposed method has been verified with a set of experiments. Keywords: Snake model, Contour tracking, Stereo image sequences, region similarity
1 Introduction Object tracking in image sequences is an important issue for many applications including object-based compression for video coding, interactive multimedia systems and intelligent vehicle applications. In order to provide a content-based interactive multimedia service in stereo image sequences, object tracking is an essential starting point of video coding [1]. Among the various approaches which have been proposed for object tracking in 2D or stereoscopic images we focus in this investigation on approaches based on the snake model [2-6]. In these approaches, first the OOI is segmented using a snake then tracking is performed using techniques such as motion estimation. A snake is an energy-minimizing spline guided by internal forces that preserve its characteristics, and external forces that pull it toward image features such as lines and edges. The snake approach suffers from some challenges such as the problem of snake points converging to local minima in the case of cluttered backgrounds, and the difficulty in progressing into boundary concavities [4]. In [5] Xu et al. proposed the GVF (gradient vector flow) snake to handle the concavities problem. Xu's GVF method uses a spatial diffusion of the gradient of the edge map of the image instead of using the edge map directly as an external force. The GVF has the advantages of low sensitivity to initialization and the ability to move into object M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 307–317, 2007. © Springer-Verlag Berlin Heidelberg 2007
308
S.-H. Kim and J.W. Jang
boundary concavities. However, when the boundary concavity has a gourd shape the GVF method fails because of the concentration of GVF energy in the neck of the gourd. Recently, significant attention has been given to stereo image based object tracking and understanding for stereoscopic systems, video surveillance, and robotics [7-8]. Using disparity information stereo object tracking methods have the advantage of separating the background from the object of interest (OOI). Nevertheless, most of these methods are in fact tracking the area or position of the OOI rather than the object contour. In this paper we present a snake-based method for object contour tracking in stereo image sequences. The proposed snake operates in 3-D disparity space, and has modified curvature and external energies. The main concern of our method is accurate contour tracking for objects with boundary concavities in cluttered backgrounds. And, we address the above challenges through the optimization of the number of snake points. Our optimization scheme manages the insertion of new snake points, and the deletion of unnecessary points to better describe and track the object's boundary. The proposed algorithm can successfully describe and track object contours in complex backgrounds. This paper is organized as follows. Section 2 covers the background on the snake model. Our proposed method for object contour tracking in stereo image sequences is described in Section 3. Section 4 presents the experimental results for evaluating the performance of our method, and the conclusions are given in Section 5.
2 Basics of the Snake Model The snake model was first introduced by Kass et al. [2], and has been applied to a variety of problems such as edge detection and tracking. The model is based on the definition of an energy function and a set of regularization parameters. In its discrete formulation, the active contour (snake), is represented as a set of points vi = ( xi , yi ) for i = 0,..., M − 1 where xi and y i are the x and y coordinates of a point, respectively, and M is the total number of points. Then the energy function for a snake point is typically written as follows: Esnake (vi ) = Ei nt (vi ) + Eext (vi )
(1)
The first term is called the internal energy and concerns properties of the contour such as its smoothness. The second term is the external energy which is typically defined in terms of the gradient of the image at the given snake point. Points are initialized around the OOI, and, through an iterative energy minimization process, they eventually converge onto the OOI's boundary.
3 Proposed Contour Tracking Method Our proposed method modifies the conventional snake model and extends it to 3-D disparity space. Consider a 3-D disparity space defined by x , y , and a disparity dimension d . The stereo image description in the disparity space is illustrated in Fig. 1.
Robust Contour Tracking Using a Modified Snake Model in Stereo Image Sequences
309
The figure shows the left and right stereo images, I l ( x, y ) and I r ( x, y ) ; the 2-D disparity map I DM ( x, y ) obtained by using stereo matching, and an illustration of the objects A, B, and C separated in the disparity space. Even though an object, such as a sphere, may have varying disparity levels across its visible surface, we have explicitly assumed that the variation is limited and used an averaged disparity value dis for each object. As shown in Fig. 1 (c), we could successfully detach one object from another in the disparity space using disparity information. With the introduction of the disparity dimension d and a time dimension t, a snake point at frame t can now be defined as v id,t = ( x i ,t , y i ,t , d i ,t ) [6]. d
B
C
B
A
Left image
I l ( x, y )
C
(a)
B
A
Right image
B
A I DM ( x, y)
I r ( x, y )
A
C
(b)
C
x
y (c)
Fig. 1. Image description in disparity space: (a) Stereo images (left image I l ( x, y) , right image I r ( x, y ) ). (b) Disparity map I DM ( x, y ) . (c) Objects separated in disparity space.
3.1 Proposed Snake Model Internal Energy Term. The internal energy term is typically the sum of two terms: continuity energy and curvature energy. The main role of the conventional continuity energy term is to make even spacing between snake points by minimizing the difference between the average distance and the distance between neighboring snake points. In this arrangement point spacing is globally and uniformly constrained by average distance. This is not suitable for objects with concavities because distance between snake points should be relaxed in boundary concavities. Therefore, in our method, point spacing in boundary concavities is allowed to differ from other parts of the snake. We redefine the normalized continuity energy as follows: E con ( v id, t
)=
v id−1, t − v id, t − v id, t − v id+ 1, t con
(2)
max
where con max is the maximum value in the search neighborhood to which the point may move. The proposed continuity energy can make even spacing between neighboring snake points in concave parts of the OOI’s boundary. The second internal energy term, curvature energy, encourages a smooth curvature formation. Although our method preserves this basic effect of curvature energy, it complements the effect in a way to guide the curvature direction to ensure that snake points always follow the object’s boundary. Usually, the initial snake points are set at the outside of the object and shrink inwards. Otherwise, they are set inside of the object and dilate. However, during the tracking process (as a result of motion
310
S.-H. Kim and J.W. Jang
estimation) some points may be located inside while others are outside the object. So, the moving direction of candidate snake points must be carefully determined based on whether the position of candidate snake points is outside or inside the OOI, see Fig. 2. As shown in Fig. 2, a snake point is associated with three vectors (Frenet formulas): tangent vector T , normal vector N = T ' / T ' , and binormal vector B = T × N . Where T ' is the length of the curvature vector, and the sign of the z-dimension of B will be positive when B is upward and negative when B is downward. To explain how the moving direction of a snake point is determined we first examine the case when a candidate snake point is inside the OOI (as in Fig. 2(a) and 2(b)). As illustrated in the two figures, a point must take the direction of its normal vector N when B is downward and take the opposite direction of N where B is upward. It follows that curvature energy should seek to minimize the value of the curvature vector T ' when B is downward, and visa versa. d
T N
vid−1,t y
d
d d B (+) T vi+1,t
vid+1,t
vid−1,t vid,t
N
d i,t
v
d i,t
v
B (−) OOI
d i−1,t
v
OOI
x
x vid+1,t
y
y (a)
T B(−)
OOI
x
N
(b)
(c)
Fig. 2. Movement of candidate snake point based on binormal vector
If the candidate snake point is at the outside of the OOI, we simply apply the contrary. Therefore, we can make candidate snake points move to the boundary of the OOI regardless of their position, and we can solve the problem which occurs with concave regions as shown in Fig. 2(c). T , N , and B are defined as follows: T (vid,t ) = vid+1,t − vid,t
(3)
N (vid,t ) ≈ T ′(vid,t ) = vid−1,t − 2vid,t + vid+1,t
(4)
B (vid,t ) = T (vid,t ) × N (vid,t )
(5)
We define the normalized curvature energy in terms of the tangent vector as Eq. (6), and the proposed total curvature energy E cur (vid,t ) is presented as Eq. (7). E C ( v id,t ) =
T ( v id,t ) / T ( v id,t ) − T ( v id−1,t ) / T ( v id−1,t )
Ecur (vid,t ) = β (λEC (vid,t ) + (λ − 1) EC (vid,t ))
(6) (7)
Robust Contour Tracking Using a Modified Snake Model in Stereo Image Sequences
311
The constant λ is set to 1 if the snake point is inside the object, and to 0 otherwise. The constant β can be set for different applications, depending on the direction of B ( v id,t ) on the contour. In case that λ is set to 1, β carries the opposite sign of the zdimension of B ( v id,t ) . Otherwise, β carries the same sign of the z-dimension as Eq. (5). External Energy Term. In conventional snake models, computation of the external energy involves edge information taken from a single image. However, when the OOI gets influenced by other objects or background edges, negative effects begin to occur. In order to maintain the role of the external energy and to enhance the segmentation performance of a snake, we propose a better energy function. In our algorithm, to eliminate the effects of background clutter, we propose a edge energy based on the gradient of a layer image of the OOI, Eedge (vid,t ) = − ∇I DS ( xi ,t , y i ,t , d OOI ,t ) emax , as introduced in [9]. The layer image is
based on a disparity map I DM ( x, y ) obtained by patch-based stereo matching [10]. The gradient of a layer image of the OOI can not completely remove all foreign edges. To prevent snake points from converging on such edges we propose a regional similarity energy to push the mean intensity of the polygon enclosed by the snake contour to be as close as possible to the mean intensity of the OOI. The smaller intensity difference between the polygon and the object, the closer the snake approaches the object contour. As illustrated in Fig. 3, the window W i ,dt is searched for the point which produces the minimum polygon area. The contour created by such point is referred to as C min , and is regarded as the base area contour. Mean intensity of the area enclosed by C min is regarded as the base mean intensity m i ,t , . Next, the window W i ,dt is searched for a point p which yields a polygon with the minimum intensity
variance relative to C min . For each candidate point in W i ,dt , a second contour ( C area ) is produced and the marginal area C area - C min is yielded. The search seeks a point p that yields a marginal area with minimum variance to C min . We refer to this measure as Regional Similarity Energy (RSE) and is defined as: E RSE ( p) =
1⎛ ⎜ ∑∑ I (u, v) − mi ,t A ⎜⎝ u v
2
⎞ 1 ⎟⎟ ⎠ RSEmax
(u, v) ∈Carea − Cmin
(8)
where RSE max is the maximum of possible RSE values in the neighborhood, and A is the area of ( C area - C min ). The complete proposed snake model can now be given as: E snake ( v td ) =
∑ (α E
M −1 i=0
con
( v id,t ) + β ( λ E C ( v id,t )
+ ( λ − 1) E C ( v id,t ) ) + γ E edge ( v id,t ) + δ E RSE ( v id,t )
)
(9)
The parameters α , β , γ and δ are selected to balance the relative influence of the four terms in the energy function. In our case α = 1 , β = 1, γ = 1 .2 and δ = 1 .2 .
312
S.-H. Kim and J.W. Jang
Fig. 3. Regional similarity energy
3.2 Optimizing the Number of Snake Points
After snake points converge in the first step to the boundary of the object, additional points are inserted and unnecessary points are deleted to better describe the boundary concavities. A newly inserted point is defined as cid,t = (vid,t +vid+1,t ) / 2. This insertion and deletion mechanism is explained in detail in [11]. 3.3 Motion Estimation
When an object moves in 3-D space both its planar (x, y) and depth coordinates change. Planar translation ( Δxi ,t and Δyi ,t ) of contour points can be estimated using a block matching motion estimation technique. To estimate depth, a disparity histogram is calculated for the region within the tentative snake contour (as projected through motion of x and y). The disparity histogram is defined as:
{
H l ( m ) = Card p ( x , y ) ∈ R s | f DM ( x , y ) = m
where R
}
(10)
s
is the set of all disparity pixels in the snake region, and p( x, y ) is a pixel in R . Card{} is the cardinality of the set, and H l ( m ) is the histogram of the tentative snake region (m is an intensity value: 0 ≤ m ≤ 255 ). The highest peak in the disparity histogram indicates the disparity value of the OOI as shown in Fig. 4. s
A
200
C B
D A(OOI) frame t-1
(a)
D
D
Snake region frame t
(b)
C
0
B 255
0
(c)
Fig. 4. Estimation of the OOI’s disparity using disparity histogram. (a) Actual snake points at frame t-1. (b) Tentative snake region. (c) Disparity histogram of snake region.
Robust Contour Tracking Using a Modified Snake Model in Stereo Image Sequences
313
4 Experimental Results To verify the performance of the proposed method, a set of experiments has been performed. The algorithm was coded in Visual C++ 6.0, and the experiments were performed on a Pentium IV machine with 1GByte of memory running at 3GHz clock speed. We used real stereo image sequences taken by two USB cameras separated at 4 cm from each other along the baseline. All images have 320×240 pixels. s , comThe criteria used to verify the accuracy of the estimated snake region Rest s is the Relative Shape Distortion pared with the original object region Rori RSD( R s ) measure, which is defined as:
⎛ ⎞ s s s RSD ( R s ) = ⎜⎜ ∑ Rori ( x, y ) ⊕Rest ( x, y ) ⎟⎟ / ∑ Rori ( x, y ) ⎝ ( x , y )∈I ⎠ ( x , y )∈I
(11)
where ⊕ refers to the binary XOR operation. The experiments and their results are illustrated in Figs. 5-9, and the performance comparison in terms of RSD is listed in Table 1 and 2. Fig. 5 shows results of an experiment on a binary image and an ordinary image for an object with a gourd-shaped concavity. The greedy snake could not detect the boundary of the object, Fig. 5 (a), and the GVF snake failed to proceed beyond the neck of the gourd as shown in Fig. 5 (b), However, with our proposed method the snake points converged better onto the boundary concavity in only four optimization iterations as shown in Figs. 5 (c). Fig. 5 (d) shows the results for our proposed modified snake model on an ordinary image. The performance comparison in terms of RSD and speed between the proposed method and the other two methods of Fig. 5 is summarized in Table 1.
(a)
(b)
(c)
(d)
Fig. 5. Results of experiment on binary image (gourd-shaped concavity): (a) Greedy snake, (b) GVF snake, (c),(d) Proposed modified snake model
Fig. 6 illustrates how disparity of the OOI is estimated after motion estimation. Fig. 6 (c) shows the disparity histogram of the snake region appearing in Fig. 6 (b). The highest peak of 4699 pixels belong to the object's pixels (white). Results for tracking objects with boundary concavities are shown in Figs. 7 and 8. The first and second rows show results for the methods of Pardas and Yokoyama [12], respectively. Our proposed method, as shown in the third row, successfully converged
314
S.-H. Kim and J.W. Jang
4699 600
0
(b)
(a)
97
76
127
50
120
160
259
190
255
(c)
Fig. 6. Disparity histogram for snake region. (a) Disparity map using patch based stereo matching, (b) Tentative snake region, (c) Disparity histogram.
frame 1
frame 3
frame 5
frame 8
(a)
(b)
(c)
Fig. 7. Results of experiment I: object with moderate boundary concavities. (a) Pardas’s method, (b) Yokoyama’s method, (c) Proposed method.
to the boundary of the OOI including concave parts using our proposed curvature energy term. The other two methods could not detect the boundary of the OOI due to the problem of poor energy in boundary concavities and edges from the background.
Robust Contour Tracking Using a Modified Snake Model in Stereo Image Sequences
315
frame 1
frame 5
frame 22
frame 42
(a)
(b)
(c)
Fig. 8. Results of experiment II : Object with profound boundary concavities. (a) Pardas’s method, (b) Yokoyama’s method, (c) Proposed method.
frame 1
frame 10
frame 6
frame 13
frame 8
frame 14
Fig. 9. Result of experiment in presence of occlusion
316
S.-H. Kim and J.W. Jang
Moreover, Fig. 8 shows that the proposed method could track the object even though its boundary has a profound concavity and has experienced some rotation. The other two methods failed to track the boundary of the OOI in this case. Performance comparison between the proposed method and the other two methods of Figs. 7 and 8 is summarized in Table 2 in terms of RSD. We also tested our method in the presence of partial object occlusion as shown in Fig. 9. As the object's visible region becomes smaller, the number of snake points was optimized accordingly. Table 1. Performance comparison in terms of RSD and computational speed
(a) Greedy snake
(b) GVF snake
(c) Proposed method
RSD (pixel)
(5839/15143) 0.385
(3880/15143) 0.256
(128/15143) 0.008
Speed (second)
0.073
16.214
0.109
Fig. 5
Table 2. Performance comparison in terms of RSD
Experiments
I
II
Number of snake points 30
30
Other methods
Frame number
Pardas
Yokoyama
Proposed method
3
0.18
0.10
0.00
5
0.42
0.09
0.00
8
0.54
0.13
0.00
5
0.96
0.05
0.01
22
1.02
0.17
0.01
42
miss
0.34
0.02
5 Conclusions In this paper, we have present a snake-based object contour tracking method in stereo image sequences. The method handles objects with boundary concavities, and optimizes the number of snake points based on local curvature and insures the sufficiency of points in concave parts of the contour. A set of experiments was performed to verify the performance of our snake model and has shown an excellent object contour detection and tracking capability enhancement over those of the conventional snake based method. When compared with the
Robust Contour Tracking Using a Modified Snake Model in Stereo Image Sequences
317
conventional methods, our method has shown a superior object contour detection and tracking capability at the cost of an increase in computation. However, the calculation cost of our method is increased due to the computational cost for involving stereo images. In order to be applicable to a real-time object-based system, our method needs to be followed by a further study to explore ways to reduce processing time.
References 1. ISO/IEC JTC/SC29/WG11/W4350: Information Technology - Coding of Audio-Visual Objects Part2: Visual. ISO/IEC 14496-2 (2001) 2. Kass, M., Witkin, A., Terzopoulos, D.: Snake: Active Contour Models. Int’l J. Computer Vision 1(4), 321–331 (1987) 3. Williams, D.J., Shah, M.: A Fast Algorithm for Active Contours And Curvature Estimation. Computer Vision, Graphics, and Image Processing 55, 14–26 (1992) 4. Pardas, M., Sayrol, E.: Motion Estimation based Tracking of Active Contours. Pattern Recognition Letters 22, 1447–1456 (2001) 5. Xu, C., Prince, J.L.: Snakes, Shapes, and Gradient Vector Flow. IEEE Trans. Image Processing 7(3), 359–369 (1998) 6. Kim, S.H., Choi, J.H., Kim, H.B., Jang, J.W.: A New Snake Algorithm for Object Segmentation in Stereo Images. ICME 2004 1, 13–16 (2004) 7. Izquierdo, E.: Disparity/Segmentation Analysis: Matching with An Adaptive Window and Depth-Driven Segmentation. IEEE Trans. Circuits and Systems for Video Technology 9(4), 589–607 (1999) 8. Harville, M.: Stereo Person Tracking with Adaptive Plan-View Templates of Height and Occupancy Statistics. Image and Vision Computing 22, 127–142 (2004) 9. Kim, S.H., Alattar, A., Jang, J.W.: Snake-Based Objects Tracking in Stereo Sequences with the Optimization of the Number of Snake Points. ICIP 2006, 193–196 (2006) 10. Deng, Y., Yang, Q., Lin, X., Tang, X.: A Symmetric Patch-Based Correspondence Model for Occlusion Handling. ICCV 2005 2, 1316–1322 (2005) 11. Kim, S.H., Alattar, A., Jang, J.W.: Accurate Contour Detection Based on Snakes for Objects With Boundary Concavities. In: Campilho, A., Kamel, M. (eds.) ICIAR 2006. LNCS, vol. 4141, pp. 226–235. Springer, Heidelberg (2006) 12. Yokoyama, M., Poggio, T.: A Contour-Based Moving Object Detection and Tracking. In: IEEE International Conference on Computer Vision (ICCV2005), IEEE Computer Society Press, Los Alamitos (2005)
Background Independent Moving Object Segmentation Using Edge Similarity Measure M. Ali Akber Dewan, M. Julius Hossain, and Oksam Chae∗ Department of Computer Engineering, Kyung Hee University, 1 Seochun-ri, Kiheung-eup, Yongin-si, Kyunggi-do, South Korea, 446-701
[email protected],
[email protected],
[email protected] Abstract. Background modeling is one of the most challenging and time consuming tasks in moving object detection for video surveillance. In this paper, we present a new algorithm which does not require any background model. Instead, it utilizes three most recent consecutive frames to detect the presence of moving object by extracting moving edges. In the proposed method, we introduce an edge segment based approach instead of traditional edge pixel based approach. We also utilize an efficient edge-matching algorithm which reduces the variation of edge localization in different frames. Finally, regions of the moving objects are extracted from previously detected moving edges by using an efficient watershed based segmentation algorithm. The proposed method is characterized through robustness against the random noise, illumination variations and quantization error and is validated with the extensive experimental results.
1 Introduction Moving object detection has emerged itself as an important research issue in computer vision due to its applicability in diverse discipline [1]. Important application of moving object detection includes automatic video monitoring system, intelligent highway system, intrusion detection system, airport security system and so on. Such kind of applications use real-time moving object detection algorithms as of its core step to identify man, car or other objects that are in movement in the scenario. Extensive survey on moving object detection algorithms can be found in [2] and [3]. The most common approaches for moving object detection rely on background subtraction based techniques. Background modeling is an important and unavoidable part for this kind of method to accumulate the illumination variation along with other changes of dynamic background. However, most of the background-modeling methods are complex in computation and time-consuming for real time processing [4]. Moreover, most of the time these methods suffer from poor performance due to lake of compensation with the dynamism of background scene [5]. Considering the above-mentioned problems, we present an edge segment based moving object detection approach which does not require maintaining any background model. In the first step of proposed method, two difference image edge ∗
Corresponding author.
M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 318–329, 2007. © Springer-Verlag Berlin Heidelberg 2007
Background Independent Moving Object Segmentation Using Edge Similarity Measure
319
maps are computed from three most recent consecutive frames. Instead of using edge differencing approach, we utilize difference image of successive frames to compute edge maps which makes the system more robust against illumination changes as well as for random noise [6]. Moreover, the algorithm works with the most recent successive frames embodying the updated information helps to reduce the false detection due to illumination change. In the difference image edge maps, edges are represented as segments instead of pixels using an efficiently designed edge class [7]. In the segment based representation, one edge segment is consisted with a number of neighboring consecutive edge pixels. However, to make effective use of the edge segments for moving edge detection and to facilitate the edge matching procedure, we maintain some constrains during representing an edge as segment and these are as follows: a)
If the edge segment contains multiple branches, then the branches are broken into several edge segments from the branching point. b) If the edge segment bends more than a certain limit at any edge point, the edge segment is broken into two segments at that particular point. c) If the length of an edge segment exceeds a certain limit, then it is divided into a number of segments of its maximum length (i.e. up to that certain limit). This novel representation of edge segments facilitates our system mainly in two ways: a)
It facilitates to incorporate an efficient and flexible edge-matching algorithm [8] in our method which reduces the computation time significantly and performs effective matching between edges of different image edge maps. b) This type of representation facilitates our method to take decision about a complete edge segment at a time instead of an individual edge pixel to add or delete from the edge list during matching for moving edge detection (Fig. 1). As a result it reduces the occurrence of scatter edge pixels in the detection result. Eventually, it reduces the probability of generation of false alarm significantly.
(a)
(b)
(c)
Fig. 1. Pixel and segment based representation of edges. (a) Among A, B and C, edge A is common in both of the edge maps; (b) Result of matching in case of pixel based representation; (c) Result of matching in case of segment based representation which is free from scattered edge pixels.
After computing difference image edge maps, correspondence of edge segments are determined using edge matching algorithm and detect moving edges. Finally, moving regions are extracted from previously detected moving edges by employing watershed based segmentation algorithm and gradient information of difference
320
M.A.A. Dewan, M.J. Hossain, and O. Chae
image. Since we employ segment based representation of edges, it is possible to incorporate knowledge or weight to edge segments for tracking. Moreover, segmentation result of the proposed method tend to extract moving object regions more precisely which can facilitate a lot to further higher level processing of video surveillance such as recognition, detection of moving objects interaction, event detection and so on. The rest of the paper is organized as follows. In Section 2, we describe the related works. The proposed method is described in Section 3. Experimental results are presented in Section 4 and finally, in Section 5, we draw the conclusions.
2 Related Works A number of researches have been reported for moving object detection during last few years. One of the typical approaches for moving object detection is background subtraction [9, 10]. This type of method works well if the illumination of the processed image is constant. Any change in illumination significantly effect intensities and under such circumstances it is difficult to figure out whether the changes of intensities occurred due to effect of illumination change or object motion. Moreover, in some circumstances such as wide range of surveillance area, congested and busy road or in public places it is difficult or impossible to control the area being monitored to have background without presence of any moving object. To resolve this problem, these methods need to incorporate background modeling algorithm or adaptation methods [5, 11]. Most of the background estimation algorithms utilize temporal change information or optical flow based information to identify the appropriate pixel values in a time series for the background model. If the background contents are not visible for a certain period of time during computation, these methods may fail to generate accurate background model. However, most of these methods use very complex computation and are not feasible for real time processing. On the other hand, as the background exposed to change, it needs to be updated periodically. Some optical flow based approaches are used for moving object detection [12, 13]. For these methods, intensity changes are important cues for locating moving objects in time and space. However, these methods may result in false detection if the temporal changes are generated by noise or other external factors like illumination drift due to weather change. Moreover, these methods are not computationally feasible. In [6] and [14], authors propose edge based methods for moving object detection utilizing double edge maps. In [6], one edge map is generated from difference image of background and nth frame, In. Another edge map is generated from difference image of In and In+1. Finally, moving edge points are detected applying logical OR operation on these two edge maps. Due to illumination changes, random noise may vary in background which may cause false detection of edges in the edge map. If any false edge appears in any one of the edge maps, it will be conveyed to the final detection result because of applying logical OR operation on the edge maps. In [14], one edge map is computed from the difference image of In-1 and In, and another edge map is computed from In and In+1. Then the moving edges of In are extracted by doing logical AND operation on these two edge maps. However, due to random noise, edge pixel positions may be changed to some extent in consecutive
Background Independent Moving Object Segmentation Using Edge Similarity Measure
321
frames. Hence, exact matching like AND operation is not sufficient to extract the complete edge segments of moving object. The resultant moving edges may be scattered enough which may create problem in further procedure to extract reliable shape of moving object. Moreover, pixel based processing of edge segments is not feasible in terms of computation. A pseudo-gradient based moving edge extraction method is proposed in [15]. Though this method is computationally faster but its background is not efficient enough to take care of the situation when a new object arrives in the scene and stop its movement. In this stage, stopped object is continuously detected as moving object. As no background update method is adopted in this method, it is not much robust against illumination change. Additionally, this method also suffers with scattered edge pixels of moving objects. In the proposed method, we employ an efficient segment based representation of moving edges that makes the system robust against illumination changes. It can extract more accurate shape of moving objects reducing the breaks and missing of edge points significantly. In this approach we do not deal with an individual edge pixel independently, rather all the pixels in an edge segment are processed together. This representation helps us to use the geometric information of edge in case of matching. We adopt a flexible method for matching between two edge segments that solve part of the camera calibration error and quantization error to reduce the false alarm rate considerably. Segmentation result of the proposed method tends to extract moving object regions more precisely because of using watershed algorithm [10]. It can extract exact boundary of moving object, where most of the region and block based algorithm fail. Moreover, the proposed method is computationally faster and efficient as it does not require any background model for moving object detection. However, the quality of the segmented moving object in the proposed method is degraded if the movement of object is very small in successive frames.
3 Structure of the Proposed Method The framework of the proposed model is shown in Fig. 2. The overall procedure of the proposed method is described in the following subsections. I n −1
DEn −1 = ϕ (ΔG * Dn −1 )
I n +1
In
Dn −1 = I n − I n −1
Dn = I n +1 − I n
∇ inf
Fig. 2. Flow diagram of the proposed method
DEn = ϕ (ΔG * Dn )
322
M.A.A. Dewan, M.J. Hossain, and O. Chae
3.1 Difference Image Generation In order to detect moving edges, a set of operations are applied on three consecutive frames In-1, In and In+1. While edge information plays a key role for moving object detection, simple edge differencing approaches suffer a lot with random noise [6]. This is due to the fact that the appearance of random noise created in one frame is different from its successive frames. This results in slight change of the edge locations in successive frames. Hence, instead of using simple edge differencing approach, we have employed two difference images for moving edge detection. Edge detection algorithm is then applied to the difference images to construct edge maps. Difference images Dn-1 and Dn are computed with (1) and (2).
Dn −1 = I n −1 − I n
(1)
Dn = I n − I n +1
(2)
Edges extracted from difference image of consecutive frames is much noise robust than the edges extracted from differencing of edges of the frames [6]. Hence, in the next step difference image edge maps are generated from Dn-1 and Dn instead of using traditional edge differencing approach.
(a)
(b)
(d)
(c)
(e)
Fig. 3. Difference image edge maps. (a) In-1; (b) In; (c) In+1; (d) DEn-1; (e) DEn.
3.2 Representation of Edge Maps In this step, we apply canny edge detection algorithm [16] on difference images Dn-1 and Dn to generate difference image edge maps. In the first step, the difference images are convolved with Gaussian mask to filter out the noise. Gaussian mask is computationally attractive, since it is composed only of local-fixed point shift and adds operations. Hence, we choose it for noise filtering from the difference image.
Background Independent Moving Object Segmentation Using Edge Similarity Measure
323
Then gradient magnitude is computed from the filtered image and non-maxima suppression is applied to thin the edges. After that thresholding operation with hysteresis is used to detect and link edges. The end of this step generates the edge maps DEn-1 and DEn of difference images Dn-1 and Dn respectively. The formulation is depicted as follows:
DEn −1 = ϕ (∇G * Dn −1 ) DEn = ϕ (∇G * Dn )
(3) (4)
where ϕ , ∇ and G represent canny edge detector operators, gradient operator and Gaussian mask for filtering on difference images, respectively. DEn-1 and DEn will be noise robust edge maps of difference images because Gaussian convolution included with the canny operator, suppresses the noise in the luminance difference. Thereafter, the resultant edge pixels are grouped and represented as segments using an efficiently designed edge class [7]. This representation helps the proposed system much to use the geometric shape of edges for matching in later procedure. It also helps to extract solid edge segments of moving objects instead of extracting scatter or significantly small edges. In this case, no edge pixels are processed independently, rather all the edge pixels in an edge segments are processed together for matching or in any other operations. Fig. 3(d) and 3(e) show the difference image edge maps generated from Fig. 3(a-b) and Fig. 3(b-c), respectively. 3.3 Moving Edge Detection
Edge maps DEn-1 and DEn are used in this step to extract moving edges from the video sequence. Since we generate edge map DEn-1 from the difference image of In-1 and In, it will contain only the moving edges that are presented in frames In-1 and In. Similarly, the edge map DEn contains the moving edges of frames In and In+1. Hence, only the moving edge segments of In frame are common in both of the edge maps. To find out those common edge segments, one edge map is superimposed one another one and try to find out matching between them. If two edge segments are of almost similar in size, shape and situated in similar position in edge maps then they are considered as moving edges of In. However, appearance of random noise may be different in different frames, which may cause slight change in position, size or even in shape of same edge segments in different edge maps. Hence, some variability needs to consider during matching. Considering all these issues we have adopted an efficient edge-matching algorithm, which is known as chamfer ¾ matching [8]. According to chamfer matching, a Distance Transform (DT) image is generated from an edge map in the first step. Then edges from another edge map are superimposed on it and compute matching confidence. If the matching confidence is more than a certain threshold then the edge segment is enlisted as moving edge. This threshold value gives the variability of matching procedure. In our method, we have employed DEn-1 to generate DT image and thereafter, edge segments of DEn are superimposed on it to find out matching. DT image generation and matching confidence computation procedure are described in following sub sections.
324
M.A.A. Dewan, M.J. Hossain, and O. Chae
Distance Transform Image Generation. As the edge segments have discrete nature and influence of noise, there may have small deviation between extracted locations of edge points and the actual locations in continuous domain. Therefore, it is not reasonable to employ an expensive method calculating the exact Euclidian distances between two edge segments during matching [8]. In most digital image processing application, it is preferable to use integers to represent distance. We have applied Chamfer ¾-distance which is a good integer approximation of Euclidian distance. DT image is constructed with these integer approximations where pixel value represents the corresponding distance to the nearest edge of edge map. In DT image a two-pass algorithm calculates the distance values sequentially. Initially the edge pixels are set to zero and rest of the position is set to infinity. Then the forward pass (first pass) modifies the distance image with (5).
vi , j = min(vi −1, j −1 + 4, vi −1, j + 3, vi −1, j +1 + 4, vi , j −1 + 3, vi , j )
(5)
In the same way, backward pass (second pass) modifies DT image with (6) vi , j = min(vi , j , vi , j +1 + 3, vi +1, j −1 + 4, vi +1, j −1 + 3, vi +1, j +1 + 4)
(6)
where vi,j is the value of the pixel in position (i, j). 4(a) has shown the DT image generated from DEn-1 shown in Fig. 3(d). In DT image, distance values are normalized into 0 to 255 range for better visualization.
8
7
4
3
0
3
7
4
3
0
3
4
4
3
0
3
4
7
0
3
0
3
4
7
8
0
3
0
3
4
7
10
4
3
0
3
6
9
(a)
0 0
0 0
(b)
Fig. 4. Distance transformation and matching. (a) DT image of DEn-1; (b) Edge matching using DT image. In this example, computed value for matching confidence is 0.91287.
Computation of Matching Confidence and Moving Edge Detection. For matching, edge segments from DEn are superimposed on DT image generated in the previous step using edge map DEn-1. A normalized average of pixel values of DT image that the edge-segment from DEn hits is the measure of matching confidence between edges of DEn-1 and DEn. A perfect matching between edge segments will result in matching confidence zero. But due to noise, we need to allow some tolerance. A normalized value of root mean square average is computed with (7) to find out matching confidence of lth edge segment of DEn:
Matching _ confidence[l ] =
1 1 k ∑{dist (li )}2 3 k i =1
(7)
Background Independent Moving Object Segmentation Using Edge Similarity Measure
325
where k is the number of edge points in the lth edge segment of DEn, dist(li) is the distance value at ith edge point of lth edge segment. The average is divided by 3 to compensate for the unit distance 3 in the chamfer 3/4 distance transformation. In case of moving edge detection those edge segments are removed from DEn, which do not have good matching with DT image. For conformity of edge segment, each point of the edge segment is searched in DT image to compute Matching_confidence[l]. Existence of a similar edge segments in DEn-1 and DEn produces a low Matching_confidence value. We allow some flexibility by introducing a disparity threshold, τ and empirically we set τ = 1.3 in our implementation. We consider a matching between edge segments if Matching_confidence[l] ≤ τ . In this case, the corresponding edge segment is considered as moving edge and consequently enlisted to the moving edge list. Therefore, the resultant edge list will contain the edge segments of MEn that belong to moving objects in In. Fig. 4(b) illustrates the procedure of computing matching confidence using DT image. Fig. 5(a) depicts the moving edges detected after matching. 3.4 Moving Object Segmentation
Watershed based segmentation algorithm is employed to segment out the regions of moving objects. Using moving edges detected in the previous step, rectangular region of interest (ROI) is extracted from current image and watershed algorithm is applied on ROI for segmentation. Considering the accuracy and efficiency we have adopted Vincent-Soilly watershed algorithm [17]. To solve the over segmentation problem of Vincent-Solilly algorithm, we replace the gradient image by zero where values are less than a particular threshold, T1. T1 is determined by mean of the gradient image minus one fourth of its standard deviation. Thus around fifty percent of the gradient values are replaced with zero, which reduces the over segmentation problem significantly. Gradient of Dn and Dn-1 are computed with (8) and (9) during difference image edge map generation, and are stored in ∇ n−1 and ∇ n , respectively. As moving object in In exists in both difference images, high gradient values exist in boundary of moving object in both of the images. The infimum gradient image, ∇ inf is computed from ∇ n−1 and ∇ n considering 8 neighbors using equation (10).
∇ n −1 = ∇G * Dn −1
(8)
∇ = ∇G * Dn n
∇
inf i, j
=
max
i −1 xO 8 } u: }} (w) uu (w) (w) u Sn Sn Sn u } uu }} uu / x6 ye ye A AA
Nn
Extraction and matching parts of DAD are linked in Fig.2. The final discriminant analysis diagram defines exactly 6 ∗ 4 = 24 discriminant schemes. The particular application can be tuned against DAD and the best path can be selected. We illustrate in the next section this DAD based recognition system optimization by an example of face verification system.
On Design of Discriminant Analysis Diagram P
xO 1 A
x
(g) N,N
/ x2
(w)
> x3
Sq
AA } AA }} } Cg AAN:=N }} (w) A }} PN,q / x0 / x2 AA AA AA AA (b) PN,q
x3
(b)
/ x4
/ x4
Sq
347
(b)
Pq,n
/ x5 Nn / x6 HH AA HH A HH (b) (b) (b) HH Sn Sn A S n AA H$ / / y = W tx xO 7 Nn > xO 8 } v: }} (w) vv (w) (w) v v Sn Sn } Sn vvv }} v / x5 / x6
(w)
Pq,n
Nn
Fig. 2. Complete discriminant analysis diagram
4
Experiments for Face Verification
In terms of operation compositions we define the following discriminant operations sequence (denoted here by DLDA) which includes also optimal weighting for matching: (b)
(w) (b) WDLDA := Sn(w) Nn Pq,n Sq PN,q C(g)
(11)
In this section we compare it with the more popular LDA scheme improved by centering and normalization operations: (w)
(b) (w) WLDA := Sn(b) Nn Pq,n Sq PN,q C(g)
4.1
(12)
Facial Databases Selected
We selected the normalized luminance facial images (46 × 56 resolution with the same position of eyes) from the following databases (cf. Fig. 3): Altkom (30 persons with 15 images each), Banca (52 persons with 10 images each), Feret (500 persons with 5 images and 375 persons with 4 images), Mpeg (635 persons with 5 images each), Xm2vts (295 persons with 10 images each).
Fig. 3. Fifteen pictures from the facial databases: Altkom, Banca, Feret, Mpeg, Xm2vts
348
4.2
M. Leszczy´ nski and W. Skarbek
LDA Versus DLDA
From the previous works described in [5] it is already known that in case of face verification the optimization of inverse Fisher ratio (DLDA) leads to better results than the optimization of Fisher ratio (LDA). Figure 4 gives more insight for this phenomenon. Namely, DLDA eigenfaces (fisherfaces) are more contrasted and more focused on particular facial parts.
Fig. 4. First 15 eigenfaces (fisherfaces) from U (b) in LDA (upper part) and U (w) in DLDA (lower part)
To choose the best dimension of intra-class in DLDA and inter-class in LDA method Mpeg database was used. Best results in equal error rate (EER) gives dimension 172 in LDA and 140 in DLDA (cf. Fig. 5).
Fig. 5. EER against tuning parameter q for LDA and DLDA
It is interesting that aggregation of DLDA and LDA verifiers by the maximum, the arithmetic mean, and the harmonic mean of distances give intermediate results (in ROC sense) between the best DLDA results and and significantly worse LDA results. However for Feret and Xm2vts bases, the geometric mean of both distances leads to slight improvements of EER and ROC over DLDA. In Fig. 6, 7, 8 we compare this results in three facial databases in two scenarios: single image (Image ROC) and multi-image (Person ROC).
On Design of Discriminant Analysis Diagram
349
Fig. 6. ROC and EER for DLDA and combined LDA+DLDA for Feret facial database
Fig. 7. ROC and EER for DLDA and combined LDA+DLDA for Mpeg facial database
4.3
Scaling and Normalization
In Fig. 9 four possible combinations of scaling and normalization are compared. The best combination is Nn Sn . The phenomenon is explained by a weak correlation between intra-class error and inter-class error. Therefore, despite comparable norm magnitudes of those errors the projection onto the unit sphere separates them while the final scaling respects the probability of all errors which are projected onto the same point of the unit sphere.
350
M. Leszczy´ nski and W. Skarbek
Fig. 8. ROC and EER for DLDA and combined LDA+DLDA for Xm2vts facial database
Fig. 9. ROC and EER for DLDA method with combinations of normalization (N) and scaling (S) operations
5
Conclusions
Discriminant analysis diagram (DAD) as a methodology for design of error based pattern recognition systems is presented. It is generalized, significantly modified, and extended DAD model w.r.t. the model preliminary presented in the authors recent paper [6] in the context of biometric verification systems. DAD model is recommended for biometric pattern recognition where intraclass error is the natural feature for object discrimination. DAD could be used
On Design of Discriminant Analysis Diagram
351
in practice as the basis for systematic design of good LDA type extraction and matching procedures. We show this possibility on the exemplary advanced face verification system. Moreover, the linear DAD model can be generalized to nonlinear version by the well known kernel trick. The research on efficiency of ”kernelized” approach will be the subject of our future work. Acknowledgment. The work presented was developed within VISNET 2, a European Network of Excellence (http://www.visnet-noe.org), funded under the European Commission IST FP6 Programme.
References 1. Fisher, R.A.: The use of multiple measurements in taxonomic problems. Annals of Eugenics 7, 179–188 (1936) 2. Putter, J.: Orthonormal Bases of Error Spaces and Their Use for Investigating the Normality and Variances of Residuals. Journal of the American Statistical Association 62(319), 1022–1036 (1967) 3. Fukunaga, K.: Introduction to Statistical Pattern Recognition. Academic Press, London (1992) 4. Golub, G., Loan, C.: Matrix Computations. The Johns Hopkins University Press, Baltimore, MD (1989) 5. Skarbek, W., Kucharski, K., Bober, M.: Dual LDA for Face Recognition. Fundamenta Informaticae 61, 303–334 (2004) 6. Leszczy´ nski, M., Skarbek, W.: Biometric Verification by Projections in Error Subspaces. In: Yao, J., Lingras, P., Wu, W.-Z., Szczuka, M., Cercone, N., Slezak, D. (eds.) Rough Set and Knowledge Technology. LNCS (LNAI), vol. 4481, pp. 166– 174. Springer, Heidelberg (2007)
Robust Tensor Classifiers for Color Object Recognition Christian Bauckhage Deutsche Telekom Laboratories 10587 Berlin, Germany http://www.telekom.de/laboratories
Abstract. This paper presents an extension of linear discriminant analysis to higher order tensors that enables robust color object recognition. Given a labeled sample of training images, the basic idea is to consider a parallel factor model of a corresponding projection tensor. In contrast to other recent approaches, we do not compute a higher order singular value decomposition of the optimal projection. Instead, we directly derive a suitable approximation from the training data. Applying an alternating least squares procedure to repeated tensor contractions allows us to compute templates or binary classifiers alike. Moreover, we show how to incorporate a regularization method and the kernel trick in order to better cope with variations in the data. Experiments on face recognition from color images demonstrate that our approach performs very reliably, even if just a few examples are available for training.
1 Introduction and Related Work Lately, tensor-based methods have attracted considerable interest in the image processing and computer vision communities. Two pioneering contributions that spawned this trend are due to Shashua and Levin [1] and Vasilescu and Terzopoulos [2]. Interpreting a set of intensity images {A1 , A2 , . . .} as a third-order tensor A, the authors of [1] consider rank-1 approximations of A. Their multi-matrix extension of the singular value decomposition (SVD) yields a set of matrices whose linear span includes the input data. Empirical results show that projecting the data into the corresponding subspace captures spatial and temporal redundancies in the input and therefore is well suited for image coding and subsequent classification. In [2], the authors represent a collection of intensity images of different faces with different facial expressions under different lighting conditions as a fifth-order tensor B. Applying the higher order singular value decomposition (HOSVD) [3] leads to a factorization of B that allows for computing eigen modes of the image ensemble and thus enables flexible, application-dependent dimension reduction: subspaces may be tailored that represent some of the independent modalities in the input data more faithfully than others. Most of the more recent papers on tensor methods for computer vision also emphasize dimension reduction. Shashua and Hazan [4] consider non-negative tensor factorizations which lead to semantically meaningful basis images. Wang and Ahuja [5] extend the HOSVD procedure and obtain sparse but faithful representations of video data. Dealing with video rendering, Vlasic et al. [6] apply the HOSVD to mesh models M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 352–363, 2007. c Springer-Verlag Berlin Heidelberg 2007
Robust Tensor Classifiers for Color Object Recognition
353
of faces. Wang et al. [7] present a block-wise HOSVD that efficiently handles 6th- or 7th-order tensors representing texture data. Given the success of tensor methods in image coding and compression, there has been surprisingly little work on adopting multilinear techniques to image understanding. Tenenbaum and Freeman [8] devise bi-linear functions for a shape classification task where the input is formed by two independent factors. Kinenzle et al. [9] consider the problem of face detection in intensity images and introduce support vector machines whose support vectors are constrained to be rank deficient second-order tensors. Recently, several researchers independently extended linear discriminant analysis to second-order tensors [10,11,12,13]. As their methods are related to the matters treated here, we shall describe them in more detail later on. The higher order generalization of discriminant analysis presented in this paper extends ideas put forth in [10] and [11]. Towards the problem of color object recognition, we describe an alternating least squares procedure for simultaneously deriving parallel factor models of third-order approximations of discriminant classifiers. By replacing the ordinary least squares estimators in the original algorithm with regularized estimators or estimators that are based on the kernel trick, we compute higher order templates or classifiers. Practical experience shows that these are robust against outliers and substantial variation in the training data and perform well, even if only very few examples are available for training. Next, we will first introduce notation and basic definitions needed in the sequel. Section 3 will then present our algorithmic approach and contrast it with related recent proposals. In section 4, we apply our technique to the problem of face recognition and discuss our experimental results. A conclusion and an outlook to future work will close this contribution.
2 Basic Concepts and Notation In the remainder of this paper, we will make use of definitions and notational conventions adopted from [14,15]. However, since our primary concern is object recognition from color images and since color images can be thought of as real valued third-order tensors (see Fig. 1(a)), we restrict our discussion to third-order tensors over R. If A is an m1 ×m2 ×m3 tensor over R, its elements are Aijk where the indices range from 1 to md , respectively. The inner product of two tensors A, B ∈ Rm1 ×m2 ×m3 is defined as m1 m2 mn A·B = Aijk Bijk . (1) i=1 j=1 k=1
Using Einstein’s summation convention, in which we implicitly sum over repeated indices in products, we may simply write A · B = Aijk Bijk . Note that the inner product is a tensor contraction, i.e. a product of two tensors (of possibly different orders) that results in a lower order object. Another familiar example is the matrix-vector multiplication M u = v, where M ∈ Rm1 ×m2 and u ∈ Rm2 . The components of the resulting vector v ∈ Rm1 are given by vi = Mij uj . In Penrose’s abstract index notation, the indices assume the role of abstract markers in terms of
354
C. Bauckhage
(a) A ∈ Rm1 ×m2 ×m3
(b) u ⊗ v ⊗ w = A
Fig. 1. 1(a) A color image can be thought of as third-order tensors A ∈ Rm1 ×m2 ×m3 where m1 and m2 correspond to the x- and y-resolution, respectively, and m3 counts the number of color channels. 1(b) The outer product of three vectors u ∈ Rm1 , v ∈ Rm2 , and w ∈ Rm3 results in a tensor A ∈ Rm1 ×m2 ×m3 .
which the algebra is formulated. This introduces precious versatility into the writing of tensor equations. For instance, the following expressions become equally valid to denote the contraction in our example: M u = v ⇔ vi = Mij uj . A tensor A ∈ Rm1 ×m2 ×m3 is called a decomposed or rank-1 tensor, if it results from the outer product of three vectors u ∈ Rm1 , v ∈ Rm2 , and w ∈ Rm3 , i.e. A = u ⊗ v ⊗ w.
(2)
In this case, the elements of A are simply given by Aijk = ui vj wk . Third-order tensors that can be approximated by means of a sum of r = 1, . . . , ρ rank-1 tensors which minimizes the Frobenius norm A − ur ⊗ v r ⊗ wr F (3) r
are said to have a parallel factor (PARAFAC) model [15,16]. Two decomposed tensors A and B are orthogonal, if A · B = 0. They are completely orthogonal, if, in addition, ura ⊥ urb , v ra ⊥ v rb , and wra ⊥ w rb for 1 ≤ r ≤ ρ.
3 Extending Discriminant Analysis to Higher Order Data Dealing with view-based object detection or recognition, one is often given a training set of pairs {(X l , y l )}, l = 1, . . . , N , where the X l are exemplary images and the labels y l ∈ {+1, −1} indicate wether or not they show the intended object. Due to the conceptual simplicity of linear algebra, most approaches to appearance based object recognition transform color image patches X ∈ Rm1 ×m2 ×m3 into vectors x ∈ Rm1 m2 m3 . Linear discriminant analysis (LDA) is a well established technique for classifying data like this. In this section, we therefore briefly review traditional vector-based LDA and two recent multilinear extensions; afterwards we present our own generalization of LDA to tensor spaces. 3.1 The Generalized Eigenvalue Approach and Its Extensions to Tensors Traditional LDA considers a set of (feature) vectors {x1 , x2 , . . . , xN } containing positive and negative examples. It seeks a projection w · xl , l = 1 . . . N , of the samples
Robust Tensor Classifiers for Color Object Recognition
355
that maximizes the inter-class distance of the resulting scalars. In order to determine w, Fisher [17] proposed to maximize the Rayleigh quotient w T S b w/wT S w w. With S b and S w denoting the between-class and within-class scatter matrices of the data, w results from solving the generalized eigenvalue problem S b w = αS w w. However, note that because of the extreme dimension of vectorized image data, the LDA approach to object recognition usually requires dimensionality reduction in order for the training process to even become feasible. Recently, Ye et al. [12] therefore presented an extension of this approach to matrix spaces. Dealing with data given in form of matrices X l , they seek projection matrices L and R such that projecting the data according to LT X l R preserves the structure in the original higher-dimensional space. Their solution is an iterative procedure of solving generalized eigenvalue problems in the row- and column-spaces of the data which are of considerably lesser size than the space that would result from vectorizing the matrices. More recently yet, Yan et al. [13] further extended the generalized eigenvalue approach to accomplish discriminant analysis in higher order tensor spaces. Their technique relies on the higher order SVD Ai1 i2 ...in = Zj1 j2 ...jn Ui11 j1 Ui22 j2 . . . Uinn jn . Here, the core tensor Z is of the same order and dimensions as the sample tensor A. The U j are mutually orthogonal unitary matrices which are found by unfolding A along the jth mode. Each U j is given by the left singular matrix of the corresponding, unfolded matrix Aj . Higher order SVD of nth-order tensors thus requires n matrix SVDs of matrices of size mj × m1 m2 . . . mj−1 mj+1 . . . mn . Yan et al. develop an iterative procedure where they (i) initialize the projection matrices U j to unity, (ii) compute HOSVD projections of the training data, (iii) compute the between- and within-class scatter matrices along all modes and (iv) refine the corresponding matrices U j by solving generalized eigenvalue problems and continue with the second step until the projection matrices converge to a stable solution. Both, Ye et al. [12] and Yan et al. [13] evaluate their techniques on intensity image data. Both show that, for the task of face recognition, higher order LDA outperforms conventional linear discriminant analysis applied to vectorized image data. 3.2 The Least Mean Squares Approach Fisher himself [17] pointed out that binary LDA is equivalent to ordinary least mean squares regression. Given a sample of {(xl , y l )}, l = 1, . . . , N , a suitable projection direction w results from minimizing the error 2 E(w) = w · xl − y l = Xw − y2 (4) l
where the N × m sample matrix X consists of the samples xl ∈ Rm and y ∈ RN contains the corresponding labels. This is a convex optimization problem that has a closed form solution; from setting the gradient ∇w E = 0 and some algebra, one obtains: −1 T w = X TX X y. (5) Ordinary least squares regression is known to be sensitive against outliers in the training data. The ridge regression approach aims to alleviate this and to control overfitting by penalizing the norm of w. This is done by introducing a regularization term
356
C. Bauckhage
into the error criterion: E(w) = Xw − y2 + λw2 . Minimizing this error with respect to w is a convex problem, too, whose closed form solution is given by: −1 T w = X T X + λI X y. (6) With some matrix algebra [18], one can show that w actually lies in the span of the training samples, i.e. w = X T α, where α is called the dual vector. The error criterion may thus be cast as E(α) = XX T α − y2 + λX T α2 which is solved by −1 α = XX T + λI y. Now the matrix XX T of inner products between samples can be replaced by a kernel matrix K. Since the inner products in K can be inner products in any space, one can also consider nonlinear functions of the samples, for instance: xi − xj 2 Kij = exp − . 2σ 2 In terms of w, the kernel trick provides the solution: −1 w = X T K + λI y.
(7)
(8)
3.3 Extending the LMS Approach to Tensors Motivated by problems in color object recognition, we propose to extend the LMS approach to LDA to higher orders by finding a projection tensor W that minimizes 2 E(W) = W · X l − yl . (9) l
If the samples are third-order tensors X l ∈ Rm1 ×m2 ×m3 , an unconstrained solution would still require to solve for m1 m2 m3 free parameters. However, if we constrain W to be given by an ρ-term PARAFAC model W=
ρ
ur ⊗ v r ⊗ wr ,
(10)
r=1
solving (9) would reduce to estimating ρ · (m1 + m2 + m3 ) parameters. If we choose ρ max{m1 , m2 , m3 }, we will have ρ · (m1 + m2 + m3 ) m1 m2 m3 and training would accelerate considerably. Alas, minimizing (9) under the constraint in (10) is a nonlinear optimization problem for which there exists no closed form solution. However, as a least squares problem, it lends itself seamlessly to the following alternating least squares (ALS) solution. For convenience, let us first consider the derivation of a simple ρ = 1 term PARAFAC model. If, for instance, u and v were known, we could contract the training data over u and v and stack the resulting vectors x ∈ Rm3 into a sample matrix X. Consequently, we can solve for w using the either of the solutions (5), (6), or (8). This estimate of w can than be used to compute an updated estimate of v, which, in turn, would lead to a new estimate of u. Since in practice, neither u nor v are known beforehand, ALS procedures usually start with random initializations, leading to the following overall procedure:
Robust Tensor Classifiers for Color Object Recognition
1. 2. 3. 4. 5.
357
randomly initialize u ∈ Rm and v ∈ Rn l for l = 1, . . . , N compute the contractions xlk = Xijk ui vj 2 T solve for w(t) = argminw Xw − y , where X = [x1k , . . . , xN k ] l l for l = 1, . . . , N compute the contractions xj = Xijk ui wk T solve for v(t) = argminv Xv − y2 , where X = [x1j , . . . , xN j ]
l 6. for l = 1, . . . , N compute the contractions xli = Xijk vj wk 2 T 7. solve for u(t) = argminu Xu − y , where X = [x1i , . . . , xN i ]
As the procedure starts with arbitrary initializations, steps 2 through 7 have to be iterated until convergence. Its convergence is guaranteed since each step of the procedure at best improves but never worsens the overall error. Although ALS procedures may not converge to the global minimum because they depend on a random initialization, extensive empirical tests revealed that PARAFAC fitting by means of ALS provides a very good trade-off between computation effort and solution quality [16]. For convergence testing, our implementation considers the refinement of the vector u. If, in iteration t, u(t) − u(t − 1) ≤ , the process is stopped. Practical experience shows that this converges quickly to useful results – usually in less than 10 iterations. Extending the procedure to the derivation of multi-term PARAFAC models is simple. k r r r If W = r=1 u ⊗ v ⊗ w is a k term solution for the LDA projection tensor, a k + 1 term representation can be found by minimizing E(uk+1 , v k+1 , wk+1 ). In order to avoid the problem of factor degeneracy, where two or more factors per mode are almost collinear but of opposing sign so that they cancel out their contributions, it is appropriate to require that every newly found rank-1 tensor uk+1 ⊗ v k+1 ⊗ wk+1 be orthogonal to the ones derived so far [16]. This can be relaxed to requiring that 2 of the modes of the individual rank-1 tensors are completely orthogonal to the previously estimated ones [15]. Therefore, the (modified) Gram-Schmidt procedure is applied after steps 5 and 7 in the above algorithm. Although we derived the algorithm focusing on the case of 3rd-order tensors, its underlying principles immediately apply to tensors of arbitrary order. A summary of the general, n-th order form of the procedure, including stage-wise refinement and orthogonalization is given in Figure 2. Dealing with two class problems, multilinear classification closely resembles the linear case. With ω+ denoting the class of intended object images and ω− denoting the class of all non-object images, the decision function is given by: ω+ if W · X > θ ω(X ) = (11) ω− otherwise. In the experiments reported in the next section, we estimated the classification threshold θ by means of projecting all positive training samples onto the discriminant direction and choosing the minimum value of the resulting scalars.
4 Experimental Results and Discussion All results presented in this section were obtained from experiments with the AR face database [19]. For the time beeing, we considered a subset of the database that contains
358
C. Bauckhage
Input: a training set {X l , y l }l=1,...,N of tensors X l ∈ Rm1 ×...×mn with class labels y l Output: a rank-ρ approximation of an nth-order projection tensor W = ur1 ⊗ ur2 ⊗ . . . ⊗ urn for r = 1, . . . , ρ t=0 for j = 1, . . . , n − 1 randomly initialize urj (t) } orthogonalize urj (t) w.r.t. {u1j , . . . , ur−1 j repeat t←t+1 for j = n, . . . , 1 for l = 1, . . . , N contract xlij = Xil1 ...ij−1 ij ij+1 ...in uri1 (t) . . . urij−1 (t) urij+1 (t) . . . urin (t) 2 T 1 N T urj (t) = argminu rj X T urj − y , where X = [x1ij , . . . , xN ij ] and y = [y , . . . , y ] orthogonalize urj (t) w.r.t. {u1j , . . . , ur−1 } j until ur1 (t) − ur1 (t − 1) ≤ ∨ t > tmax endfor
Fig. 2. Repeated alternating least squares scheme for computing an nth-order projection tensor W given as a sum of r = 1, . . . , ρ parallel factors ur1 ⊗ ur2 ⊗ . . . ⊗ urn
Fig. 3. Six training samples of a male face taken from the AR face database [19]
unobstructed pictures of 50 people with a total of 12 images per person. These 12 images are divided into two groups of 6 which show the same set of facial expressions and illumination conditions but were recorded 2 weeks apart. In each of our recognition experiments, some of the images from the one set were used for training and all of the images from the other set served as independent test cases. Typical examples of different facial expressions and various ambient lighting present in the training set are shown in Fig. 3; the size of these color image patches was set to 51 × 51 pixels. Figures 4 to 6 illustrate the outcome of several experiments meant to study the robustness of our ALS approach when applied to the problem of computing a tensorbased template of a set of color images. In each of the experiments, we considered the small sample of 6 images shown in Fig. 3. For the results in Figs. 4(a), 5(a), and 6(a), only the homogeneously illuminated images of different facial expressions were used for training; for the results in Figs. 4(b), 5(b), and 6(b), all six samples were taken into
Robust Tensor Classifiers for Color Object Recognition
(a) trained only with the left samples in Fig. 3
359
(b) trained with all samples in Fig. 3
Fig. 4. Tensor-based templates computed from the sample in Fig.3 where ordinary least squares estimators were used in our ALS procedure. The three results in 4(a) and 4(b) represent approximations of rank ρ ∈ {3, 6, 9}, respectively.
(a) trained only with the left samples in Fig. 3
(b) trained with all samples in Fig. 3
Fig. 5. Tensor-based templates computed from the sample in Fig.3 where regularized least squares estimators were used in our ALS procedure. The three results in 5(a) and 5(b) represent approximations of rank ρ ∈ {3, 6, 9}, respectively.
(a) trained only with the left samples in Fig. 3
(b) trained with all samples in Fig. 3
Fig. 6. Tensor-based templates computed from the sample in Fig.3 where kernel least squares estimators were used in our ALS procedure. The three results in 6(a) and 6(b) represent approximations of rank ρ ∈ {3, 6, 9}, respectively.
account. The templates were computed after normalizing the samples to unity and setting all labels y l = +1. For the results shown in Fig. 4, we applied ordinary least squares estimators in our ALS procedure for computing PARAFAC tensors. The templates in Figs. 5 and 6 result from applying ridge regression and kernel least squares, respectively. Finally, the number ρ of terms in the models was restricted to 3, 6, and 9. Obviously, for this setting, ordinary least squares estimators are not able to produce meaningful results. This is due to the extremely small sample size (3 or 6 examples) which is much smaller than the maximum spatial resolution of the images (51 pixels). Although ALS based tensor approximations can be shown to deal with small sample sizes [11], the unfavorable ratio of sample size to image dimension considered here renders the training task an under-determined optimization problem. Regularized or kernelized estimators, on the other hand, produce much more meaningful results, even for a tiny sample like this. In both cases, a higher rank ρ improves the quality of the template. It is noticeable that, in contrast to the templates that were derived using
C. Bauckhage ρ=3 ρ=6 ρ=9
70
ρ=3 ρ=6 ρ=9
70 60
50
error rate
error rate
60
40 30
40 30 20
10
10
0
60
50
20
20
40 60 80 100 number of training images
120
(a) ordinary least squares
50 40 30 20 10
0 0
ρ=3 ρ=6 ρ=9
70
error rate
360
0 0
20
40 60 80 100 number of training images
120
(b) regularized least squares
0
20
40 60 80 100 number of training images
120
(c) kernel least squares
Fig. 7. Average face recognition error rates versus number of training images
regularized estimators, the templates resulting from plugging kernelized estimators into our ALS procedure do not only faithfully account for structural features but also capture color information contained in the sample. In a second series of experiments, we studied how tensor-based discriminant classifiers perform in a face recognition task. Encouraged by the results reported above, we opted for a realistic setting and only considered a few training examples1 . Each individual in the training set had to be recognized in the test set. In a series of eight experiments, the number of positive training samples (label = 1) was identically set to 6 but the number of (randomly selected) negative training samples (label −1) was varied between 0 and 114. Thus, in the most optimistic cases where most data was available for training, the sample size still was a mere 120. Again, we trained different classifiers where different least squares estimators and different maximal ranks ρ were considered. Figure 7 compares the average error rates yielded by classifiers of different ranks which were derived using the three different least squares estimation during training. Again, tensor-based classifiers based on ordinary least squares estimators cannot cope with small sample sizes. Only if the number of training examples exceeds the maximum dimension of the training images, does the recognition error drop to reasonable rates. With respect to classifiers resulting from training with kernel-based least squares estimators, it is interesting to note that we can actually observe a phenomenon of overfitting since the error rate rises again with a growing number of training samples; the effect is most concise for PARAFAC classifiers of low rank (ρ = 3). Figure 8 displays how different parameterizations of regularized- and kernel based estimators affect average error rates; in both cases, the rank ρ of the PARAFAC models was set to 6. It shows again that tensor-discriminant classifiers trained with regularized least squares estimators yield the lowest recognition error rates. Neither do they yield useless results nor do they seem to be prone to over-fitting. For still a small sample size of only a 120 training images, the average recognition actually drops below 3%. Finally, Fig. 9 confirms this observation for it presents a direct comparison of the best performing variants of the three approaches considered in our experiments. In the light of the above results, we shall stress some of the favorable properties of our approach to higher order discriminant analysis. The major advantage of PARAFAC 1
A setting like this is realistic for many applications, for instance in the area of surveillance. If the task is to recognize dubious or potentially dangerous individuals, the number of positive training samples is usually small. Likewise it is impossible to provide an exhaustive set of counterexamples that would represent all other human beings.
Robust Tensor Classifiers for Color Object Recognition λ = 0.2 λ = 0.5 λ = 1.0
70
σ = 0.2 σ = 0.5 σ = 1.0
70 60
50
error rate
error rate
60
40 30
50 40 30
20
20
10
10
0
361
0 0
20
40
60
80
100
120
number of training images
0
20
40
60
80
100
120
number of training images
(a) regularized least squares
(b) kernel least squares
Fig. 8. Influence of the parameters λ and σ on the error rates obtained from PARAFAC classifiers of rank ρ = 6 trained with regularized and kernelized least squares estimators, respectively
ordinary least squares regularized least squares kernel least squares
70
error rate
60 50 40 30 20 10 0 0
20
40 60 80 100 number of training images
120
Fig. 9. Comparison of recognition error rates versus number of training samples obtained from the best parameterizations of the different training approaches
models for tensor discriminant classification appears to be the considerably reduced number of free parameters. First of all, it drastically reduces training times. If multivariate data of size m1 × m2 × m3 were unfolded into vectors, conventional LDA based on LMS optimization or on solving a generalized eigenvalue problem would require the computation and inversion of covariance matrices of sizes m1 m2 m3 × m1 m2 m3 . Even for moderate values of m1 and m2 and not too many samples, this may become infeasible. However, the matrices that appear in the learning stage of our approach are of considerably reduced sizes m3 × m3 , m2 × m2 and m1 × m1 , respectively. We found that, compared to traditional LDA in high dimensional vector spaces, our approach shortens training by up to two orders of magnitude. Moreover, our PARAFAC model is not based on higher order SVD of an optimal projection tensor and does not require the expensive computation of an unconstrained tensor W. Instead, the presented ALS procedure directly derives a suitable classifier from training data. Also, our approach is less involved than recent approaches generalizing the Rayleigh criterion. While the latter require computing two correlation matrices for each mode in each iteration, the extension of the LMS approach requires only one such matrix per mode and iteration.
362
C. Bauckhage
Finally, while the method by Yan et al. [13] is as general as our proposal, it necessitates frequent computations of SVDs of large matrices whereas, again, the ALS algorithm only deals with small matrices. Second of all, our experiments indicate that PARAFAC constrained multilinear predictors generalize well. The variants that apply regularized and kernel based least squares estimators during training yield reliable performance, even if the training sample is very small compared to the dimension of the samples. For the task of template learning, especially the kernel based version of our algorithm yields useful results. Even if there are only six training images that, in addition, are characterized by considerable variations in texture and illumination, we recognize meaningful structures in the resulting templates. For the task of view-based object recognition, the ridge regression variant of our algorithm performs more reliable. Even if there are considerable variations in the training data, it does not seem to suffer from over-fitting or adaption to noise. Altogether, concerning training time and effort, PARAFAC constrained tensor-based discriminant analysis yields good performance at a very low price. In general, it appears that multilinear discriminant classifiers resulting from more sophisticated training procedures capture important and salient visual structures contained in the training samples more faithfully than their counterparts using ordinary least squares. Since the effort for processing color data is small, our method provides an efficient approach to problems where the most salient features are not only correlated to texture variations in lower modes (i.e. in the x, y image plane) but also to variations along higher modes (e.g. along a color “direction”). PARAFAC tensor classifiers automatically account for multivariate variations and, as our experiments show, accomplish this on color image patches whose sizes are infeasible for most other methods.
5 Conclusion and Outlook Towards the problem of color image analysis and classification, this paper presented an alternating least squares algorithm to learn PARAFAC models for higher order discriminant analysis. Since it does not rely on repeated singular value decompositions of large matrices, our technique is less involved than other recent extensions of discriminant analysis. Since the matrix operations that occur during training only apply to reasonably small matrices, our classifiers require fewer training data and train quickly. Also, we demonstrated how to incorporate regularization methods and the kernel trick into the training process. The resulting classifiers are robust against considerable variations in the training data and capture essential structural information even if only very few training samples are available. This effect was demonstrated by means several template learning and face recognition experiments where we considered un-preprocessed RGB color images. As multilinear discriminant analysis of third-order image data accounts for salient information distributed across several modes, we achieved meaningful and reliable results. Currently, we are investigating the use of fourth-order tensor-classifiers for color object detection and tracking in video data. Here it appears that the short training times of the tensorial approach offer interesting perspectives for interactive scenarios where online learning is pivotal.
Robust Tensor Classifiers for Color Object Recognition
363
References 1. Shashua, A., Levin, A.: Linear Image Coding for Regression and Classification using the Tensor-rank Principle. In: Proc. CVPR, vol. I, pp. 40–42 (2001) 2. Vasilescu, M., Terzopoulos, D.: Multilinear Analsysis of Image Ensembles: Tensorfaces. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 447–460. Springer, Heidelberg (2002) 3. De Lathauwer, L., De Moor, B., Vanderwalle, J.: A Multilinear Singular Value Decomposition. SIAM J. Matrix Anal. Appl. 21(4), 1235–1278 (2000) 4. Shashua, A., Hazan, T.: Non-Negative Tensor Factorization with Applications to Statistics and Computer Vision. In: Proc. ICML, pp. 792–799 (2005) 5. Wang, H., Ahuja, N.: Compact representation of multidimensional data using tensor rank-one decomposition. In: Proc. ICPR, vol. I, pp. 44–47 (2004) 6. Vlasic, D., Brand, M., Pfister, H., Popovi´c, J.: Face transfer with multilinear models. ACM Trans. on Graphics (Proc. SIGGRAPH’05) 24(3), 426–433 (2005) 7. Wang, H., Wu, Q., Shi, L., Yu, Y., Ahuja, N.: Out-of-core tensor approximation of multidimensional matrices of visual data. ACM Trans. on Graphics (Proc. SIGGRAPH’05) 24(3), 527–535 (2005) 8. Tenenbaum, J., Freeman, W.: Separating Style and Content with Bilinear Models. Neural Computing 12, 1247–1283 (2000) 9. Kienzle, W., Bakir, G., Franz, M., Sch¨olkopf, B.: Face Detection – Efficient and Rank Deficient. In: Proc. NIPS, pp. 673–680 (2005) 10. Bauckhage, C., Tsotsos, J.: Separable Linear Discriminant Classification. In: Kropatsch, W.G., Sablatnig, R., Hanbury, A. (eds.) Pattern Recognition. LNCS, vol. 3663, pp. 318–325. Springer, Heidelberg (2005) 11. Bauckhage, C., K¨aster, T., Tsotsos, J.: Applying Ensembles of Multilinear Classifiers in the Frequency Domain. In: Proc. CVPR, vol. I, pp. 95–102 (2006) 12. Ye, J., Janardan, R., Li, Q.: Two-Dimensional Linear Discriminant Analysis. In: Proc. NIPS, pp. 1569–1576 (2005) 13. Yan, S., Xu, D., Zhang, L., Tang, X., Zhang, H.J.: Discriminant Analysis with Tensor Representation. In: Proc. CVPR, pp. 526–532 (2005) 14. Kolda, T.: Orthogonal Tensor Decompositions. SIAM J. Matrix Anal. Appl. 23(1), 243–255 (2001) 15. Zhang, T., Golub, G.: Rank-One Approximation to High Order Tensors. SIAM J. Matrix Anal. Appl. 23(2), 534–550 (2001) 16. Tomasi, G., Bro, R.: A Comparision of Algorithms for Fitting the PARAFAC Model. Comp. Statistics & Data Analysis 50, 1700–1734 (2006) 17. Fisher, R.: The Use of Multiple Measurements in Taxonomic Problems. Ann. Eugenics 7, 179–188 (1936) 18. Bishop, C.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006) 19. Mart´ınez, A.M., Kak, A.: PCA versus LDA. IEEE Trans. Pattern Anal. and Machine Intelli. 23(2), 228–233 (2001)
New Algorithm to Extract Centerline of 2D Objects Based on Clustering Seifeddine Ferchichi1, Shengrui Wang1 , and Sofiane Grira2 1
Department of Computer Science University of Sherbrooke Quebec, Canada, J1K 2R1 {seifeddine.ferchichi, shengrui.wamg}@Usherbrooke.ca 2 College of Computer Science Abu Dhabi University Al Ain campus P.O 1790
[email protected] Abstract. This paper presents a new algorithm to extract the centreline of 2D objects. The algorithm computes the centreline from all the points of object in order to remain faithful to the structure of the shape. The idea is to cluster a data set constituted of the spatial position of each point composing the object. The centreline is derived from the set of computed clusters centres. The proposed method is accurate and robust to noisy boundaries.
1 Introduction The shape is one of the essential characteristics of the object. In fact on the daily basis our brain describe, analyze, and classify different kind of objects that surround us intuitively. But if we want to transpose this process in computer vision we must find a way to represent the shape. Shape representation is a crucial element in various fields of computer vision such as object recognition; for solid modeling in design; in computer graphics, for rendering, animation, and mesh generation; in medical imaging, for representing branching structures; and in robotics, for path planning, etc. In such contexts, the medial axis or shape skeleton methods has been a popular technique. This skeletal representation incorporates the advantages of both region-based and boundary-based descriptors and provides a compact and highly intuitive shape description, which reduces the gap between lowlevel raster-oriented shape analysis and semantic object description. The shape skeleton or medial axis approach has been used in several applications [8] such as object recognition and representation, inspection of printed circuit boards, analysis of medical imagery, optical character recognition and road extraction [9]. The concept of the skeleton was first introduced by Blum [2], defined as the locus of the centers of maximal disks inscribed in the object and equidistant from at least two points on the boundary. In order to be efficient for shape description, the skeleton must be thin, centered, topologically accurate and reversible (allow the reconstruction of the original object). In the literature [5], we find several skeletonization methods that satisfy a subset of these properties. Since the early work by Blum, a variety of methods have emerged, which can be broadly grouped [8] into the following four classes. M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 364–374, 2007. c Springer-Verlag Berlin Heidelberg 2007
New Algorithm to Extract Centerline of 2D Objects Based on Clustering
365
1.1 Topological Thinning Methods Topological thinning methods are influenced by the work done in characterizing the topological properties of the object rather than on the metric properties of its shape. The idea is to iteratively peel off boundary points by identifying simple points whose removal does not alter the object topology. Certain simple points which are end-points are preserved unchange in order to conserve the extremities of the object. One characterization of end-points in 2D is a point which is adjacent to just one other point. Thinning methods produce a connected skeleton that captures the object topology, but do not guarantee a perfect thinned output. They require a post-processing step to smooth and extract centerline segments for further analysis. 1.2 Extracting Skeleton from the Distance Transform The distance transform labels every point within the object with the minimum distance from that particular point to the nearest point in the border of the object. Points closet to the center of the object would have the maximum distance transform and would not be removed. The centreline can be defined as a set of maximal disks. The ideal of the distance transform method implies three stages: 1) compute the distance transform field, 2) detect all the local maximum points, and 3) connect maximum points into centreline. The computation of a correct Euclidean distance transform turned out to be neither particularly efficient nor algorithmically trivial [7]. Several algorithms have been proposed for computing the Euclidian distance transform [3] [4] [11]. Therefore, many skeletonisation algorithms substituted Euclidean metrics with regular metrics such as ”chessboard” or ”city block” metrics. The use of regular metrics helps to avoid complex algorithms, but introduces considerable deviations from the correct Euclidean distances into the distance transform [6]. Each point in the object is considered as a center of disks. The disks are constructed with the radius equal to the distance transform. The medial axis is defined as the locus of maximal disks. A maximal disk is one which is not contained in the disk of any other point. The extracted point does not guarantee a connected skeleton. Niblack et al. propose an efficient centerline method to identify saddle points to get a connected skeleton. The Distance transform methods permit the object reconstruction from the skeletal points and their distance transform values. 1.3 Simulation of the Fire Propagation The idea of fire propagation is proposed by Blum [2]. Considering an object made of dry grass, we start a fire on his contour. The wave front(fire) would proceed with constant speed toward the interior of the object [1]. When two fire fronts meet each other the fire will be quenched. The locus of the quench points form the skeleton of the object. To compute the skeleton of binary image, the existing methods simulate a wave front. Generally the wave is modelled by an active contour, whose movement is controlled by the object distance transform. Beyond the requirement that the user must specify the starting and end point of a particular path, the algorithm requires little user interaction. The wave is implemented using a fast marching schemes and hence computationally efficient.
366
S. Ferchichi, S. Wang, and S. Grira
1.4 Voronoi Diagrams The Voronoi Diagram (VD) is a well-known tool in computational geometry. Given a set S of n point(forming the object contour) in a plane. The VD of a point Pi is the polygon enclosing all points in the plane that are closer to Pi than to any other point in S. The VD is a collection of the Voronoi polygons of all the points in S. The VD of points on the object boundary will generate Voronoi edges which are equidistant from at least two boundary points. The skeleton of the object is a subset of the VD. Theoretically, Voronoi methods guarantee connected skeletons. Nevertheless, object with noisy and irregular boundary tend to have a dense VD [13]. Therefore, a significant pruning is needed to produce thin skeleton. The above methods emphasize certain of the properties mentioned previously, depending on the application. In fact, real applications often entail several conflicting requirements. For example, automatic navigation of medical data sets requires that the skeleton be as thin as possible. Other applications such as compression impose the condition that the object must be reconstructible from the skeleton. Moreover, if details in the object have to be captured, the resulting skeleton is thick and dense. Besides the problem of controlling the skeleton’s thickness, existing methods suffer from sensitivity to details of the boundary [10]. In fact, small changes or noise in the boundary could result in major changes in the skeleton. Boundary smoothing represents an intuitive solution to this problem. Unfortunately, it does not solve the problem in general. In this paper, we propose a new algorithm for extracting the centerlines of 2D objects. The algorithm is based on a clustering approach. The idea is to compute the maximal disks/balls defining the centerline by iteratively executing the K-means algorithm on the object-points. The proposed algorithm is detailed in Section 2. Experimental results are presented in Section 3. Finally the conclusion is drawn in Section 4.
2 Proposed Approach The proposed algorithm implements Blum’s definition, in which the skeleton is a centreset of maximal disks (Fig 1.(a)) inscribed in the object, where the centres are equidistant from its boundary. However, adopting this method does not permit us to solve the problem of intrinsic sensitivity to small changes in the object’s surface (Fig 1.(b)). In order to surmount this difficulty, we propose a method that compute the maximal disks from the whole set of points composing the object and not only the boundary-points as in the aforementioned methods . The idea is to extract the centreline by clustering the points of the object, where each cluster represents a disk that describes a local symmetry within the shape. However, we have to take in consideration the problem of the optimal number of clusters. If we have many clusters, the centres will not be aligned with the shape of the object (Fig 2. (a)). In the opposite case, it is difficult to define the shape of the object (Fig 2. (b)). The solution is to validate the clustering results. To determine the optimal number of clusters, the approach based on cluster validation proposes to optimize two distinct criteria: 1) compactness: the elements of each cluster should be as close to each other as possible; 2) separation: the clusters themselves should be widely spaced. In our case, we cannot adopt such strategy because the separation criterion will be maximized. Due
New Algorithm to Extract Centerline of 2D Objects Based on Clustering
367
(a) Maximal disks
(b) Sensitivity to slight boundary changes Fig. 1. Concept of maximal disks
(a)
(b)
Fig. 2. Problems of the number of clusters
to the fact the object are too dense, a small number of clusters would be enough to fit the entire structure. To address these problems, we propose to optimize the alignment of the clusters centres with the local symmetries axes of the object. The idea consists of adding clusters between the centres aligned with symmetry of the object. However adding new centres doesn’t guarantee that the clusters are aligned with the shape of the object for this purpose we have to validate the alignment of the centres. In the following subsections we will present the clustering algorithm and the method for discrimination of noncentered clusters to ensure centering and an intuitive connection of the clusters. 2.1 Computation of the Maximal Disks In our approach we apply the clustering in order to compute the maximal disks. Our purpose is to cover the whole object with a centered maximal disks and not to have a ”good” repartition of the pixels forming the object. In order to reach the goal we need
368
S. Ferchichi, S. Wang, and S. Grira
to run the clustering algorithm iteratively. For this purpose, we choose a fast algorithm, the K-means. The principle of K-means consist of optimizing an objective function: M inJ(U, V ) =
n c
uik xk − vi 2
(1)
i=1 k=1
where c the number of clusters and n the number of pixels V = {v1 , v2 , ..vc } is the matrix of cluster centres vi ∈ R2 and U = (uik )c×n is the membership matrix composed of the vectors xk for each cluster i. uik must be equal to 1 or 0 for k = 1, 2..., n. xk − vi 2 is the distance between xk and ith cluster ( Euclidienne distance). To minimise J(U, V ) the cluster centres matrix vi and the membership matrix uik must be computed according the following equations: 1 xk − vi ≤ xk − vj ∀i, k uik = (2) 0 si non where j = 1..c j = i n
vi =
uik xk
k=1 n
(i = 1, 2, ..., c)
(3)
uik
k=1
once the computation of the clusters is made we have to find the centered clusters and to cover the object with sufficient clusters in order to draw an accurate centerline. 2.2 Detection of Centered Clusters In this subsection we will explain in more detail the detection of centered clusters and the definition of the centerline. The next step after clustering is to detect the clusters placed on the axes symmetry of the object. To achieve this goal, we have established a process to find these centered clusters. The idea is to find three centroid (cluster centre) that are aligned on the axis symmetries of the object that satisfy the following two conditions (Eqs. 4 and 5). First, for each centre of the matrix vi we have to find the closest two centres to it according: vl = minvi − vl i = 1..c l = i (4)
θi v i−1 vi Fig. 3. Principle of computing the alignment of centres
v i+1
New Algorithm to Extract Centerline of 2D Objects Based on Clustering
(a) Topological
(b) Distance map
(c) Fire propagation
(d) Voronoi
(e) Objet
(f) Proposed algorithm
Fig. 4. Curbed Object
369
370
S. Ferchichi, S. Wang, and S. Grira
(a) Topologique
(b) Distance map
(c) Fire Propagation
(d) Voronoi
(e) Objet
(f) Proposed algorithm
Fig. 5. Object in S
New Algorithm to Extract Centerline of 2D Objects Based on Clustering
(a) Topological
(b) Distance map
(c) Fire propagation
(d) Voronoi
(e) Objet
(f) Proposed algorithm
Fig. 6. Object in vertical position
371
372
S. Ferchichi, S. Wang, and S. Grira
(a) Topological
(b) Distance map
(c) Fire propagation
(d) Voronoi
(e) Object
(f) Proposed algorithm
Fig. 7. Object in horizontal position
Second, the three centroid should form a large angle, that is near to π. We have impose this condition because our object have a tubular structure. We have add a threshold in order to treat curvature shapes. In fact if the object tend to have a straight shape the threshold will be small, in the opposite case we can increase it. θi = vi−1 vi vi+1 threshold < θi ≤ π
i = 2..c − 1
(5)
The purpose of our algorithm is to cover the object with centered clusters, in away that captures its structure. To achieve this goal, we execute the K-means with an initial
New Algorithm to Extract Centerline of 2D Objects Based on Clustering
373
number of clusters given by the user. If the number of detected centered clusters is not sufficient to cover the whole object, we execute the K-means algorithm iteratively, each time placing new clusters in the regions of the object not covered by a centered cluster. The centeredness of the clusters is verified at each iteration, as described in the subsection 2.2. We stop the process when the whole object is covered by centered clusters or no new centered disks appear. The algorithm for centerline extraction is as follows: Algo: CenterlineExtraction (O,Nbrclusters,Threshold) 1. While ((centered clusters cover object) And (new centered clusters appear)) 1.1 Apply K-means 1.2 Detect centered clusters by verifying Eqs. 4 and 5 1.3 Add centers 3. Connect the centers by the method in reference [12]
3 Comparison and Experimental Results In this section we will present the results obtained by our algorithm and we will compared to the methods cited in the section 1. We will test the algorithms on 4 objects that have different kind of shapes. We notice that centreline obtained by the topological algorithms are not one pixel wide (Figs. 4.(a), 5.(a), 6.(a), and 7.(a)) and there is many branches that must be removed. Concerning the distance map algorithm, the obtained centreline are not always connected (Figs. 4.(b), 5.(b), 6.(b), and 7.(b)). The fire propagation has the same problem and the extremity (Figs. 4.(c), 5.(c), 6.(c), and 7.(c)) of the centerline are missing. The voronoi algorithm produce a better result but a post-processing is needed to remove the branches (Figs. 4.(d), 5.(d), 6.(d), and 7.(d)). Our method produce an accurate result without any post-processing (Figs. 4.(f), 5.(f), 6.(f), and 7.(f)). However we notice that for objects, that change rapidly (Fig. 5.(e)) their shape we find a center that is not aligned with his neighborhood (Fig. 5.(f)), these problem can be solved with a tolerant threshold.
4 Conclusion In this paper a new approach for extracting the centerlines of 2D objects is proposed. Several examples of 2D objects are presented (Fig. 4) and (Fig. 5). The method is able to deal with objects with noisy boundaries. The extracted centerline is centered and reversible. The algorithm guarantees that the centerline is connected and one pixel wide and can be extend to treat 3D objects.
References 1. Leymarie, F., Levine, M.D.: Simulating the grassfire transform using an active contour model. IEEE Trans. on PAMI 14(1), 56–75 (1992) 2. Blum, H.: A Transformation for Extracting New Descriptors of Shape. In: Symp. Models Percep. Speech Visual Form, pp. 362–380 (1967)
374
S. Ferchichi, S. Wang, and S. Grira
3. Yamada, H.: Complete Euclidean Distance Transformation by Parallel Operation. In: Proc.7th intl. on Pattern Recognition, pp. 69–71 (1984) 4. Ragnemalm, I.: The Euclidean Distance Transformation in Arbitrary Dimension. Pattern Recognition Letters 4, 883–888 (1993) 5. Cornea, N.D., Silver, D., Min, P.: Curve-Skeleton Applications. In: IEEE Visualization Conference (2005) 6. Cuisenaire, O.: Distance Transformations: Fast Algorithms and Applications to Medical Image Processing. PhD thesis, Universit´e Catholique de Louvain (October 2001) 7. Ogniewicz, R.L., K¨ubler, O.: Hierarchic Voronoi skeletons. Pattern Recognition 28(3), 343– 359 (1995) 8. Singh, R., Cherkassky, V., Papanikolopoulos, N.: Self-Organizing Maps for the Skeletonization of Sparse Shapes. IEEE Trans. on Neural Networks 11(1), 241–248 (2000) 9. Ferchichi, S., Wang, S.: Optimization of Cluster Coverage for Road Centre-line Extraction in High Resolution Satellite Images. In: Proc. ICIP (2005) 10. Choi, S.W., Seidel, H.P.: Linear Onesided Stability of MAT for Weakly Injective 3D Domain. In: Proc. ACM SMA (2002) 11. Saito, T., Toriwaki, J.: New algorithms for Euclidean Distance Transformation of an nDimensional Digitized Picture with Applications. Pattern Recognition 27, 1551–1565 (1994) 12. Dey, T.K., Wenger, R.: Fast Reconstruction of Curves with Sharp Corners. Int. J. Computational Geometry and Applications 12(5), 353–400 (2002) 13. Lorensen, W.E., Cline, H.E.: Marching Cubes: A High Resolution 3D Surface Construction Algorithm. ACM SIGGARAPH Proc. 21(4), 163–166 (1987)
A Novel Bayesian Classifier with Smaller Eigenvalues Reset by Threshold Based on Given Database Guorong Xuan1, Xiuming Zhu1, Yun Q. Shi2, Peiqi Chai1, Xia Cui1, and Jue Li1 2
1 Dept. of Computer Science, Tongji University, Shanghai, China Dept. of ECE, New Jersey Institute of Technology, Newark, New Jersey, USA
[email protected] Abstract. A novel Bayesian classifier with smaller eigenvalues reset by threshold based on database is proposed in this paper. The threshold is used to substitute eigenvalues of scatter matrices which are smaller than the threshold to minimize the classification error rate with a given database, thus improving the performance of Bayesian classifier. Several experiments have shown its effectiveness. The error rates of both handwritten number recognition with MNIST database and Bengali handwritten digit recognition are small by using the proposed method. The steganalyszing JPEG images using this proposed classifier performs well.
,
Keywords: Improved Bayesian classifier, threshold, eigenvalues, handwritten digit recognition steganalysis.
1 Introduction Research on improving classifier has always been a hot subject in pattern recognition. Given probability distribution, Bayesian Classifier is well known to be the optimum for classification [1]. In addition, its discriminant function is rather simple. Gaussian distribution is generally used to simulate the real distribution of samples, and is only related to mean and covariance. The Baysian classifier minimizes the classification error rate. However, if the probability distribution is not known, how to achieve small error rate in classification is a practical issue that needs to be resolved. Furthermore, in the case of high dimensionality or the case of small sample database, the covariance matrix is often not full ranked, i.e., there are zero and/or small eigenvalues, it is difficult to obtain the discriminant function. This has limited utilization of Baysian classifier. In [2], the principal component null space method was proposed. It keeps the zero variance subspace. For some applications, it does achieve good performance. In fact, the null space method can be viewed as zero variance re-setting from this proposed methodology In this paper1, a novel Bayesian classifier with smaller eigenvalues resetting by threshold which is determined based on the given database is proposed. It is similar to 1
This research is supported partly by National Natural Science Foundation of China (NSFC) on the project (90304017).
M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 375–386, 2007. © Springer-Verlag Berlin Heidelberg 2007
376
G. Xuan et al.
the conventional Baysian classifier based on Gaussian distribution. However, it uses a threshold to reset the eigenvalues which are smaller than this threshold to make the classification error rate for the given database smallest.
2 Baysian Classifier Bayesian classifier can be also expressed as follows:
i = arg max( g i ( x))
(1)
i
g i ( x) = −( x − μi ) t Σi−1 ( x − μi ) − ln | Σi | (2) where x is sample with dimensionality d, g i (x ) is the discriminant function, μi is the mean vector for class i, Σi is the covariance matrix for the class i. The minimum error rate classification can be achieved by use of the maximizing the discriminant functions under Gaussian distribution and equal prior probability [1, p.36]: Bayesian classifier can also be also expressed d×d dimensional eigenvalue matrix Λi and d×d dimensional eigen vector matrix Ui as follows:
(
g i (x) = −( x − μi ) t U it Λ iU i
)
−1
( x − μi ) − ln Λ i
(3)
The Σ i = U Λ iU i ,where eigenvalue matrix Λi : t i
⎡λi1 ⎢ % ⎢ ⎢ λik Λi = ⎢ ⎢ ⎢ ⎢ ⎣⎢ and When eigenvalues calculate
λi (k +1)
⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ % ⎥ λid ⎦⎥
(4)
λi1 > " > λik > λi (k +1) > " > λid (5) λi (k +1) " λid are very small or equal to zero, it is impossible to
g i (x) , because Σ −i 1 does not exist. Hence, traditional Bayesian classifier
cannot be used under such cases. In [2], the null sub-space method, named Principle Component Null Space Analysis (PCNSA), is proposed to solve the problems that the matrices are not fully ranked. It utilizes the Euclidean distance in ‘null space’ of each class instead of discriminant function. In the PCNSA a thresholds is used by a constant 10-4 time class maximum eigenvalue, but it cannot be used for different conditions. In paper [3] a thresholds is used under minimum error rate classification for a given database, but it still utilizes the Euclidean distance. Instead of the Euclidean distance used in both [2] and [3], an improved Bayesian classifier is used in this paper resulting in better results.
A Novel Bayesian Classifier with Smaller Eigenvalues Reset by Threshold
377
3 Proposed Bayes Classifier with Smaller Eigenvalues Resetting 3.1 Concept In this paper, a method of Bayes classifier with eigenvalues smaller than a threshold resetted with the threshold is proposed, which is an improvement from the traditional Bayes Classifier under normal distribution assumption. That is, assume all of eigen values of the covariance matrix are listed in a non-increasing order: λi1 ," , λik , λi ( k +1) ," , λid . Then, λ0 with λ0> λi ( k +1) " λid is selected to replace λi ( k +1) ," , λid , and λ0 is selected so that the classification error rate is the smallest. Some points about the proposed method are listed below. 1) The hypothesis of approximate normal distribution in large sample set is sometimes reasonable, because the distribution of large sample set is often clustered around the mean vectors. 2) The expression is close to traditional Bayesian Classifier, but the calculation difficulty of not fully ranked matrices would be avoided. 3) The threshold λ0 is keeping larger than λi ( k +1) " λid and smaller than λi1 " λik . It should not be near zero, and sometimes is a large number. 3.2 Bayes Classifier with Smaller Eigenvalues Reset A small threshold λ0 is utilized to substitute eigenvalues
λi (k +1) " λid
of eigenvalue
matrix Λi of covariance matrix Σi which are smaller than λ0. λ0> λi ( k +1) " λid
(6)
Bayes classifier with smaller eigenvalues resetting can be expressed by classification formula (7) and discriminant:
i = arg max( g B i ( x))
(7)
i
(
g Bi ( x ) = −( x − μi )t U it Λ BiU i
)
−1
( x − μi ) − ln Λ Bi
(8)
⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ % ⎥ λ0 ⎦⎥
(9)
where
⎡λi1 ⎢ % ⎢ ⎢ λik Λ Bi = ⎢ ⎢ ⎢ ⎢ ⎣⎢
λ0
378
G. Xuan et al.
The threshold λ0 of Bayes classifier with smaller eigenvalues resetting is carefully selected to make sure that the pattern recognition rate is maximum for a given database.
λ0 = arg min (errordatabase ) λ0 in DBi
(10)
3.3 Discussion About Threshold λ0 As mentioned above, the value of λ0 is mainly determined by the given database. There are several cases can be considered.
λ0 ≤ min(λid )
Case 1:
(11)
λ0 is smaller than all eigenvalues of all covariance matrices. In this case, Λ Bi = Λi the proposed Bayesian classifier is the same as traditional Bayesian classifier. Case 2:
λ0 ≥ max(λi1 )
(12)
λ0 is bigger than all eigenvalues of all covariance matrices.
g Bi ( x) = −λ0 ( x − μi )t ( x − μi ) − d ⋅ ln λ0
(13)
In this case, the proposed Bayesian classifier is only related to mean values and is in fact the same as Euclidean distance classifier. Case 3:
max(λi1 ) > λ0 > min(λid )
(14)
This is actually the most of cases, and the core of the proposed Bayesian classifier, which is supposed and later shown in this paper to perform better than the traditional Bayesian classifier. 3.4 Discussion About Essence (1) If the normal distribution assumption is valid for a given database, the Bayesian classifier is optimal. When the eigenvalues are close to zero, Bayesian classifier cannot be used. Here, a small threshold is introduced and Bayesian classifier can be used. (2) If the assumption of Gaussian distribution is not valid, the traditional Bayesian classifier is no longer optimal. The threshold, which is selected according to minimization of classification error rate, makes the proposed method to act like a ‘regulator’ to reduce the error rate. (3) The proposed Bayesian classifier with smaller eigenvalues resetting by a threshold determined based on the given database divides all eigenvalues into two parts. In the first part, the eigenvalues are smaller than the threshold. It is known that these small eigenvalues contribute more to correct classification [3]. After their resetting by using the selected thereshold, the Euclidean distance is actually used in this part. In the second part, the eigenvalues are
A Novel Bayesian Classifier with Smaller Eigenvalues Reset by Threshold
379
bigger than the threshold and contribute less to correct classification than those smaller eigenvalues. They are used in the same way as used in Baysian classifier. (4) The proposed small eigenvalues resetting with threshold determeined by the give database is different from principle component null space analysis (PCNSA) [2], classwise non-principal component analysis (CNPCA) [3], Fisher classifier, Mahalanobis distance based classifier, Euclidean distance based classifier, Baysian classifier after lowering dimensionality via principle component analysis (PCA) [1]. 3.5 Classification Rejection Rate From the classification formula (7) and discriminant formula (8), we use two discriminants to obtain the classification rejection rate. Consider a test sample. If its corresponding feature vector makes the discriminant g Bi1 ( x ) to reach the maximum among all classes, then this sample is classified to belonging to class i1. Similarly, if the feature vector makes the discriminant g Bi 2 ( x ) the maximum then the sample is classified to class i2. Now assume, among all of the classes, the
g Bi1 ( x ) and g Bi 2 ( x ) are two largest
values, and the difference of these two largest values is smaller than a pre-defined threshold value R as shown follow:
g Bi 2 (x ) − g Bi1 (x ) ≤ R .
(15)
We then claim that the input feature vector, hence, the corresponding sample is rejected for classification. That is, the two discriminants which assume the largest values determine the so-called classification rejection rate “R”. Obviously, the larger the classification rejection value R, the larger the reject rate, the smaller the classification error rate.
4 Experimental Works 4.1 Experiment I: Classification of Handwritten Numbers on MNIST Database The MNIST handwriting number database [4] is a well-known and commonly used database for performance evaluation of various pattern recognition algorithms. That is, it can be used an objective way to evaluate the performance of a classification algorithm. Different classification methods can be compared with each objectively as well. Some samples in the MNIST database are shown in Figure 1. From these samples, it is observed that the between-class distances are rather clustered and the within-class distances are rather scattered. In addition, there are 60,000 samples. Table 1 shows the different recognition error rate according to different thresholds λ0. The smallest error rate can be obtained when λ0= 5400. Table 2 shows the performance comparison among the proposed method and other commonly used recognition algorithms.
380
G. Xuan et al.
Fig. 1. Some samples in MNIST database Table 1. Experiment results on the MNIST database
Threshold Ȝ0 1000 4000 5400 6000 7000 9000 11000
Classification Error Rate 0.0460 0.0366 0.0361 0.0367 0.0363 0.0368 0.0372
Comment
minimum
Table 2. Performance comparison among different recognition algorithms
Recognition Algorithm Proposed method Principal component null space analysis Bayesian classifier after PCA preprocessing Euclidean distance
Classification Error Rate 0.0361 0.0473 0.0622 0.1700
Because the covariance matrices are not full ranked for all of the classes, in order to use the traditional Bayesian classifier, the PCA is introduced to lower down the dimensionality to 500 first. The traditional Bayesian classifier is then applied. From Table 2, it is observed that the proposed Bayesian classifier with small eigenvalues reset by threshold performs better than all of other classification algorithms. Furthermore, the scheme’s execution time is rather shorter. 4.2 Experiment II: Classification of Handwritten Bengali Numbers in Postcode For the purpose of automatic post-mail classification, thousands of actually used envelops were collected and scanned. The single numbers in the envelops were
A Novel Bayesian Classifier with Smaller Eigenvalues Reset by Threshold
381
segmented to form a database consisting of 24,876 separate hand-written Bengali numbers. The size of separate hand-written numbers is normalized to 28x28 pixels. Therefore, the feature dimensionality is d=28x28=784. The total number of classes are 10 (from 0, 1, …, up to 9). In this experiment, in order to lower classification error rate, the concept of rejecting classification described in Section 3.5 is used. The utilized rejection rule is defined as follows. That is, if g i1 ( x), g i 2 ( x ) are the two largest discriminants among 10 classes then they are used in Formula (15). The R is a predefined value. As discussed in Section 3.5, the larger the R value, the larger the reject rate, and the smaller classification error rate. Table 3. Bengali digits
Arabia digits
0
1
2
3
4
5
6
7
8
9
Bengali digits
Table 4. Recognition results for Bengali handwriting postcodes
Classification rejection rate Total number of errors Error detection rate
0% 176 0.0361
3.44% 101 0.0207
7.9% 46 0.0094
9.76% 35 0.0072
18.27% 12 0.0025
Table 5. Eigenvalues compared with threshold λ0=20000 (total dimensionality d=784) Digits Number of eigenvalues equal to or larger than threshold¬0 Number of eigenvalues smaller than threshold¬0
0
1
2
3
4
5
6
7
8
9
731
729
725
727
705
716
722
728
727
714
53
55
59
57
79
68
62
56
57
70
The Bengali numbers from 0, 1, up to 9 are shown in Table 3. The numbers of samples used for training and for testing are 20000 and 4876, respectively. The best threshold value is λ0=20000. Table 4 shows the different recognition results associated with different classification rejection rates. The numbers of eigenvalues that are smaller or larger than thresholdλ0 are shown in Table 5. In Table 6, the test results for all of 4876 test samples are listed there. It is observed the classification error rate is 3.61%. These recognition results are satisfactory and the algorithm has been used in the practical recognition systems now.
382
G. Xuan et al. Table 6. Handwritten Bengali digits classification with zero classification rejection rate digit 0 1 2 3 4 5 6 7 8 9 total
0 1646 9 0 11 0 1 1 3 2 1 1674
1 0 957 12 0 0 0 4 1 0 33 1007
2 0 1 689 0 0 0 8 0 0 2 700
3 3 0 0 302 0 1 13 0 1 0 320
4 0 1 3 0 217 2 1 1 3 0 228
5 7 0 3 1 0 224 2 1 0 0 238
6 0 0 0 9 0 7 217 0 2 2 237
7 0 0 0 0 0 0 0 234 0 1 235
8 0 0 0 0 0 0 3 0 120 1 124
9 0 12 2 1 1 0 1 1 1 94 113
correct 1646 957 689 302 217 224 217 234 120 94 4700
error 10 23 20 22 1 11 33 7 9 40 176
4.3 Experiment III: JPEG Image Steganalysis Steganalysis can be treated as a two- or multi-class pattern recognition problem. It is noticed that the between-class distance is small, while the within-class distance is large. Hence, our proposed new Baysian classifier can play role in steganalysis as well. The advanced JPEG steganographic techniques such as OutGuess [5], F5 [6] and MB (Model Based steganography) [7] modify the LSB (least significant bits) of JPEG AC coefficients to embed data into JPEG images, while keeping the histogram remaining unchanged. To defeat these steganographic schemes, Markov process has been used in order to utilize the 2nd order statistics in steganalysis [8,9]. In this subsection, we present a new scheme, which uses the bidirectional Markov model to form 600 dimensional feature vectors for steganalysis. Then, we utilize our proposed Baysian classifier with small eigenvalues reset with threshold. Experimental results have shown that the success of the proposed Baysian classifier in JPEG steganalysis. 4.3.1 Extraction of 600 Features For each JPEG image, we can apply entropy decoding to obtain JPEG quantized 8x8 block discrete cosine transform (DCT) coefficients (referred to as JPEG coefficients). For each 8x8 block, we scan the JPEG coefficients in three different ways: zigzag [10], vertical and horizontal as shown in Figure 2. That is only the 21 JPEG coefficients are scanned because most of high-frequency JPEG coefficients are zero after the JPEG quantization. For each of these three scans, i.e., 1-D sequence of JPEG coefficients, we apply so called one-step [11] bidirectional [12] Markov model, and the so called probability transition matrix is used to characterize the Markov model [11]. In order to reduce the dimensionality of the probability transition matrixes, we use a thresholding technique so that all JPEG coefficients with their absolute values larger than the threshold, T, are forced to equal to T. In this way, the dimensionality of the transition matrix is (2T+1)x(2T+1). In our experimental works, we choose T=7. Hence the probability transition matrix is of 15x15. Each 8x8 block of JPEG coefficients generates one transition matrix. Assume that we have total M 8x8 blocks. We average all of these M transition matrix, resulting one average matrix of (2T+1)x(2T+1). Since we use bidirectional Markov model, the matrix is symmetric. Hence, we only choose
A Novel Bayesian Classifier with Smaller Eigenvalues Reset by Threshold
⎡ 0 1 5 6 14 15 ⎢ 2 4 7 13 16 ⎢ ⎢ 3 8 12 17 ⎢ ⎢ 9 11 18 ⎢10 19 ⎢ ⎢20 ⎢ ⎢ ⎢⎣
⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦
(a) Zigzag
⎡0 1 2 3 4 5 ⎢10 9 8 7 6 ⎢ ⎢11 12 13 14 ⎢ ⎢17 16 15 ⎢18 19 ⎢ ⎢20 ⎢ ⎢ ⎢⎣
⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦
⎡0 10 11 17 18 20 ⎢1 9 12 16 19 ⎢ ⎢ 2 8 13 15 ⎢ ⎢ 3 7 14 ⎢4 6 ⎢ ⎢5 ⎢ ⎢ ⎢⎣
(b) Horizontal
383
⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦
(c) Vertical
Fig. 2. Three different scanning ways
Fig. 3. Probability transition matrix
the 120 elements of the transitional matrix as features for steganalysis, as shown in Figure 3. Consequently, the three scans result in 120x3=360 features. In addition, we arrange the 1-D sequence resulted from the zigzag scan from all 8x8 blocks in two different ways, thus generating additional 120x2=240 features. Finally, we have totally 360+240=600 features. 4.3.2 Selection of Threshold in Proposed Baysian Classifier for Steganalysis We used all of 1096 CoralDraw images [13] in our experimental works. 896 images are used for training and the remaining 200 images are used for testing. Each CorelDraw image is of size either 512x768 or 768x512. In order to reduce the affect caused by double JPEG compression [8], we JPEG compress the CorelDraw BMP images with Q-75 and use as the cover images. We apply OutGuess, F5, and MB without (MB1) and with deblocking (MB2) to CorelDraw BMP images with Q-75 to embed data and use these images as stego images. In experiments, we embed into each CorelDraw image 2kB and 4kB respectively, corresponding to 0.021 bits per 1kB 1024 Bytes pixel (bpp), 0.041bpp and 0.083bpp, respectively. We randomly run 10 experiments. In each experiment, we randomly selected 896 image pair (cover and stego) for training
(
), ,
,
384
G. Xuan et al.
and 200 image pairs for testing. The reported classification results are the arithmetic average of 10 experimental results. Because most of eigenvalues in the bidirectional Markov transfer matrix are very small and many of them are equal to zero, a threshold can be used to substitute these small eigenvalues, indicating that the proposed Bayesian classifier lends itself to this application well.
Fig. 4. Adjusting Threshold in F5 vertical is the detection ratio)
(The horizontal axis is minus logarithm of threshold and the
Table 7. Comparisons between different steganalysis methods
Data hiding Methods F5
Outguess
MB1
MB2
Amount of hidden data (Bytes) 1k 2k 4k 1k 2k 4k 1k 2k 4k 1k 2k 4k
Detection rate (Fridrich’s [8]method) % 74 87 96 89 97 98 66 86 90 62 76 84
Detection rate (proposed method) % 89 96 99 97 100 100 93 98 100 94 99 100
A Novel Bayesian Classifier with Smaller Eigenvalues Reset by Threshold
385
Figure 4 displays the success rates in detecting stego images versus the adjusted thresholdλ0 in steganalyzing F5. It is observed that, whenλ0=10-4, the detection rate is the best. This is also true for detecting OutGuess, MB1, and MB2. Hence, 10-4 is chosen forλ0 in the experiments. Table 7 shows the performance comparison between the proposed method and Fridrich’s method [8] on the CorelDraw image database. From Table 6, we can see that the proposed method outperforms the prior art.
6 Conclusion (1)
(2)
(3)
(4)
Under Gaussian distribution, Bayesian classifier is known as optimal. However, when there is zero eigenvalue in covariance matrix, it is impossible to obtain the inverse covariance matrix, especially under high dimensional cases. In this paper, a threshold is proposed to substitute smaller eigenvalues. On the one hand, the “modified” Bayesian classifier can no longer be limited by high dimensionality. On the other hand, the threshold can be selected carefully in order to minimize the recognition error rate with respect to the given database. The proposed Bayesian classifier does not need PCA preprocessing. It is more effective than the traditional Bayesian classifier with the PCA preprocessing. This novel Bayesian classifier is suitable for the classification situation in which the amount of samples is large and the within-class distance is big. In many situations, when the assumption of Gaussian distribution is not valid, the threshold can act as a ‘regulator’ to adjust the differences between the real distribution and the assumed distribution. This characteristic also contributes to better classification performance. Three experiments have proven the effectiveness of the novel Bayesian classifier. Experiments on the MNIST database demonstrate that the proposed new method outperforms all other linear classifiers. It has also been utilized in practical Bengali handwritten postcode recognition system because the recognition result has met the requirements. Finally, the experiments on JPEG image steganalysis have shown that the proposed Bayesian classifier performs well for steganalysis.
Acknowledgement This paper is partly supported by Chinese National Nature and Science Foundation Council. Thanks go to Professor Xuefeng Tong at Department of Computer Science, Tongji University, and Director Professor Zhikang Dai, Professor Yue Lv and Dr. Shujing Lv at Shanghai Research Institute of China Post Bureau for their kind help and valuable comments.
386
G. Xuan et al.
References [1] Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. John Wiley & Sons, Chichester (2001) [2] Vaswani, N., Chellappa, R.: Classification probability analysis of principal component null space analysis. In: International Conference of Pattern Recognition (ICPR) (2004) [3] Xuan, G., Shi, Y.Q., Chai, P., Zhu, X., Yao, Q., Huang, C., Fu, D.: A novel pattern classification scheme: Classwise non-principal component analysis (CNPCA). In: International Conference of Pattern Recognition (ICPR), Hong Kong, China, August 20 24, 2006 (2006) [4] The MNIST Database of handwritten digits, http://yann.lecun.com/exdb/mnist/ [5] Provos, N.: Defending against statistical steganalysis. In: 10th USENIX Security Symposium, Washington DC, USA (2001) [6] Westfeld, A.: F5 a steganographic algorithm: high capacity despite better steganalysis. In: 4th International Workshop on Information Hiding, Pittsburgh, PA, USA (2001) [7] Sallee, P.: Model-based methods for steganography and steganalysis. International Journal of Image and Graphics 5(1), 167–190 (2005) [8] Fridrich, J.: Feature-based steganalysis for JPEG images and its implications for future design of steganographic schemes. In: Fridrich, J. (ed.) IH 2004. LNCS, vol. 3200, pp. 67–81. Springer, Heidelberg (2004) [9] Shi, Y.Q., Chen, C., Chen, W.: A Markov Process Based Approach to Effective Attacking JPEG Steganography. In: Information Hiding Workshop, Old Town Alexandria, VA, USA (2006) [10] Shi, Y.Q., Sun, H.: Image and video compression for multimedia engineering: fundamentals, algorithms and standards. CRC press, Boca Raton, USA (1999) [11] Leon-Garcia, A.: Probability and random processes for electrical engineering, 2nd edn. Addison-Wesley Publishing Company, Reading (1994) [12] http://linuxbrit.co.uk/rbot/wiki/Ideas, http://www.informs-cs.org/wsc00papers/064.PDF [13] CorelDraw Software, http://www.corel.com
Median Binary Pattern for Textures Classification Adel Hafiane1 , Guna Seetharaman2 , and Bertrand Zavidovique3 1
Laboratoire Vision Robotique, ENSI de Bourges-Universit´e d’Orl´eans 88 Av. Lahitolle 18020 Bourges Cedex France 2 Departement of Electrical and Computer Engineering, Air Force Institute of Technology, Dayton, OH 45433-7765, USA 3 Institut d’Electronique Fondamentale, Universit´e de Paris-Sud Bˆ atiment 220, F-91405 Orsay Cedex, France
[email protected] Abstract. A texture classification method using a binary texture metric is presented. The method consists of extracting local structures and describing their distribution by a global approach. Texture primitives are determined by a localized thresholding against the local median. The local spatial signature of the thresholded image is uniquely encoded as a scalar value, whose histogram helps characterize the overall texture. A multi resolution approach has been tried to handle variations in scale. Also, the encoding scheme facilitates a rich class of equivalent structures related by image rotation. Then, we demonstrate – using a set of classifications, that the proposed method significantly improves the capability of texture recognition and outperforms classical algorithms.
1
Introduction
Texture plays a vital role in analyzing various types of images including, Natural, Aerial and Medical imagery. The ability to recognize texture can furnish a semantic information. Thus, texture description and classification have received a considerable attention. Several intuitive properties such as coarseness, contrast, regularity have often been associated with texture, and there is no obvious common definition of textures. The problem is inherently complex, and there is no unique approach. There exists a variety of approaches for texture description and analysis [1], which can be divided into four categories: statistical methods, structural methods, model based-methods and transform-based methods. Texture characterization and similarity measures represent the main issues in texture classification [2]. It can be defined as spatial arrangement of textons, distribution of patterns or specific spatial frequencies. For example the popular Gray Level Cooccurrence Matrix (GLCM) [3] and derivatives [4][5] define the joint probability that
This work was conducted at IEF and ENSI. The conclusions are that of the authors. They do not represent the focus and policies of AFIT, USAF or US-DoD.
M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 387–398, 2007. c Springer-Verlag Berlin Heidelberg 2007
388
A. Hafiane, G. Seetharaman, and B. Zavidovique
a given pair of gray levels co-occur at a specific relative distance in the image. The geometrical methods of analyzing spatial distribution of texture primitives are often used for periodic structures [6][7][8][9]. Texture is also modelled using Markov Random Fields (MRF) where the texture is defined as a random field with spatial context [10][11]. Other techniques suppose that human visual system transform a retinal image into spatial frequency representation [12]. Gabor filters [13][14] handle this property thereby using a bank of filters. Recent works which combine structural and statistical properties [15][16][17] show higher classification performances. The texture appearance is considered as a result of dominant neighborhood properties, and empirical distribution of intrinsic primitives. For this hybrid approach, geometrical or shape properties and spatial distribution patterns become very important. This work presents a modification of Local Binary Patten (LBP), the key difference being with the choice of threshold value. We use a two step method comprised of median based thresholding followed by extraction and analysis of the resulting binary patterns. The operator is restricted to a 3 × 3 neighborhood, and thus, yielding 512 possible distinct patterns. A unique scalar value is assigned to each pattern, and its histogram is used to measure the spatial distribution of the patterns over the image. Thus, the texture is characterized as a discrete probability function of a random binary pattern defined over compact neighborhoods, for example, 3 × 3 pixels. The rest of the paper is organized as follows. Section 2 presents the general methodologies of LBP. Section 3 defines the Median Binary Method method. Section 4 describes a statistical characterization. The modality of the performance evaluation are described in section 5, followed by the experiments and results in section 6. While the conclusion is presented in section 7.
2
Local Binary Pattern
The LBP [15] is one of the best texture descriptors, in terms of performance and its highly discriminative abilities. LBP is comprised of two steps: first, extracting texels or elementary structures in the local region; and, second, analyzing the spatial distribution of the same. The method is designed to enhance the local properties by using spatial operators and transform the image into a new representation. Then, it seeks to code, characterize and quantify the spatial arrangement of basic structures observed in the newly transformed realm. The basic LBP technique[15] uses a local thresholding over a 3 × 3 neighborhood associated to each pixel. The central pixel intensity is chosen as the threshold value. A value of 1 is assigned to each neighboring pixels whose intensity is of above or equal to the threshold, and 0 for others. The resulting pattern is captured as an 8-bit binary number representing one of 256 distinct known patterns. Then, the histogram is computed for the transformed image and considered as a texture-descriptor. For instance, in Fig. 1 the threshold value is 127; and, the central pixel does not belong to the pattern.
Median Binary Pattern for Textures Classification
127 130 135
1
120 127 88
0
90
30
LBP
43
0
1
389
1 0
0
0
Fig. 1. LBP operator performed over a 3×3 neighborhood. The central pixel is ignored at the output.
3
Median Binary Pattern
The proposed Median Binary Pattern (MBP) seeks to derive the localized binary pattern by thresholding the pixels against their median value over a 3 × 3 neighborhood. A typical snapshot is shown in Fig. 2, where the median value is 120. The central pixel is included in this filtering process therefore we obtain 29 possible structures. Then, M BP =
L
f (ai ) × 2
i
f (ai ) =
i=1
if ai ≥ M ed otherwise
1 0
(1)
where L is the number of neighbors and ai the intensity value. The MBP operator is invariant to monotonic gray-scale changes since the threshold does not depend on amount of intensity. The detected pattern is the result of the spatial interactions in the given locality. If there is no contrast in a given neighborhood, then, it is considered as a spot. Equation 1 produces an integer value in the closed interval [0 : 511]. Fig. 3 shows three examples of MBP in 3 × 3 neighborhood.
127 130 135
1
1
1
120 127
88
1
1
0
90
43
0
0
0
30
MBP
Fig. 2. Example of MBP, the median value is 120
Here MBP operator divide image into two groups of pixels depending on the median value yielding a specific configuration. The result is represented below each image. The hashed regions correspond to value 1. We can notice that MBP captures the contrast between two intensity ranges which also impacts the local structure. These patterns form the basic element of our definition of texture.
390
A. Hafiane, G. Seetharaman, and B. Zavidovique
Fig. 3. Sevreal 3 × 3 images and their corresponding MBP
The scale changes can influence local structures, and thus impact hence the MBP descriptor. To reduce this effect we decompose the image into several frequency ranges by sub sampling method. Let I be the original image of size M × N . Then, the subsampling process can be described as follows: I1 (i, j) = I(2i, 2j) , I2 (i, j) = I(2i + 1, 2j) I3 (i, j) = I(2i, 2j + 1) , I4 (i, j) = I(2i + 1, 2j + 1)
(2)
where I1 , I2 , I3 , I4 denote the four subimages. i = 0, 1, 2 · · · M/2 − 1 and j = 0, 1, 2 · · · N/2−1. These images help capture relationship between pixels that are not immediate neighbors, when subjected to MBP. Due to the complexity reasons merely one sub image is chosen for the next resolution the same process is applied for each layer. The MBP is then computed for each level of resolution, searching the best possible arrangement that match with model of a given texture.
4
Statistical Characterisation
As described above the statistical approach is based on the distribution of primitives which could be a scalar or a complex as pattern. Previous works show that histograms can be used as a powerful description of texture [18]. In our case we compute the histogram of MBP image using 512 bins to handle all possible motifs. The effect of the analysis window size can be eliminated by histogram normalization. Also, in the case of a multi-resolution texture analysis, the histogram will be computed for each level resulting in a d dimensional descriptor of the ensemble. However, adding frequency ranges does not automatically lead to high performances.
Median Binary Pattern for Textures Classification
391
The optimal number of multi-resolution levels is difficult to determine automatically. For our application, we have chosen d = 3 levels, based on a trade off between complexity and performance. Then, we define the MBP histogram with respect to scale as follow: 1 2 n HMBP = (HMBP , HMBP , . . . , HMBP )
(3)
where n is the number of sub images. The similarity measure consists in comparing histograms. In general there are many methods for representing information in the histogram. The similarity between two MBPs is given by: 2 DL2 (H1 , H2 ) = (H1 (i) − H2 (i)) (4) i
The HMBP is modelled as a joint distributions of d statistically independent features. The similarities can be computed for each resolution separately and the minimum will be retained. The final distance is defined as : D = min DL2 (H1d , H2d ) (5) d,d
5
Performances Evaluation
Our goal is to check the discriminative ability of the proposed method to identify surfaces in images. In general, the evaluation scheme should take into account the following properties: 1. Illumination invariance: the process has to furnish the same value of texture indices despite of variation in gray scale changes across the image. For example the weather can modify the intensity in the aerial images. 2. Geometric invariance: the classification should be consistent under geometric transformation introduced due to relative position of the camera. 3. Multiresolution invariance: the algorithm should be able to recognize the texture independent of the distance from object. 4. Robustness: here the strength against noise and other artifacts is required. Minor changes in noise should not produce large variations in output. 5. Window size: the minimum window size to capture the texture information. 6. Complexity: the time of computing and resources should be minimal. This work examines the first three properties using appropriate data sets. The performance is evaluated in term of classification using train and test samples. If the test example favors the same class of train set, the classification is correct. The performance depends on selected features and discrimination method. Evaluation are based on statistical technique that can be parametric (Bayesian theory), non-parametric such as k-nearest neighbors (KNN) or neural networks and derivatives. Also, it is beyond the scope of this work to evaluate all known classification schemes. We focus on evaluating the discrimination properties of MBP. We chose a non-parametric technique, since there are very few tunable
392
A. Hafiane, G. Seetharaman, and B. Zavidovique
parameters and low computational resources. KNN is one of the most popular and simplest methods for pattern classification. The KNN rule is to assign a given unlabelled example to the class that occurs most frequently among the knearest training samples. It requires only an integer K, training data and metric measure. We have chosen, K = 3. The samples are then classified according to their feature victors. The performance measure does not depend only on the classifier, instead on the experimental platforms including reference algorithms and test images. The subjective nature of many experimental evaluations have posed critical concerns about the validity of tests [3]. To overcome this problem some research teams have built test frameworks available for public [19] considered as benchmark for texture experiments and evaluations. Meastex1 , is a significant evaluation system, amongst such frameworks. MeasTex software is an open source system containing image database, quantitative measurement framework for texture analysis algorithms, and implementation of major texture classification paradigms. Nevertheless, the image set of this base is poor in terms of quality and quantity when compared against the new ones. The Outex 2 [20] texture base has recently become established as the best. It presents many options related to textures analysis and recognition. Outex is organized as train/test set within several categories each addressing a particular factor impacting the texture metrics, for instance, illumination, resolution, geometric transformation, etc. A few sample images from this database are shown in Fig. 4.
6
Experiments and Results
The experiments have been conducted over a large number of textured images. The goal was to benchmark the performances of the proposed MBP schemes in comparison with several established best methods. We used Outex database where the tests are carried out for 4 textures categories that address standard classification, illumination, rotation, and spatial resolution. Each category contains 24 texture classes mainly obtained from canvas surfaces. Fig. 4 shows one sample of each class. The set represents rich surface aspects of regular motifs, random patterns, texture irregularity, defect, different granularity, spreading form, blurring aspect, etc. The gray level distribution tends to be similar across these images and the discriminative factors are intrinsic to the patterns or motifs. Spatial frequency is determinant in such cases. We expect that techniques based on gray level representation would fail for such images. In contrast, methods based on structural representation should do better. To check this expectation, MBP is compared with other methods that use gray scale based and other approaches. The MeasTex framework supports the major texture paradigms GLCM, Gabor filters, Gaussian Markov Random Field (GMRF),. . . and many classifier particularly KNN. Success rate is computed by running KNN procedure and the 1 2
http://www.cssip.elec.uq.edu.au/guy/meastex/meastex.html http://www.outex.oulu.fi
Median Binary Pattern for Textures Classification
393
Fig. 4. Samples of Outex database
occurrence of the correct classes within k neighbors. Parameters of algorithms are set as following: – GLCM, the orientation between two pixels involved in a gray level co occurrence with θ = 0o , 45o , 90o , 135o . The spatial distance that indicating the distance of the co-occurred pixel is set to 1. GLCM technique selects features from the joint histogram. In our test, features are those proposed in [3]. – GABOR features are extracted by computing the energy function for the values: variance σ = 4, mask orientation θ = 0o , 45o , 90o , 135o and filter size 17 × 17 pixels. For each direction, we obtain the global energy by performing a convolution procedure. – GMRF, however, is used with fourth-order mask as described in [10]. – LBP is used with 8 neighbors as described in [15]. – MBP is computed for 3 levels of spatial resolution. Note that all algorithms have been tested and compared under the same experimental conditions (classifier, dataset,...). The Euclidean distance was employed for all methods. The success rate expresses the correct classification in percentage. 6.1
General Textures
Here, the experiments are conducted in Outex TC 00000 category. For each class, there are 20 monochrome images (128 × 128) with incandescent constant
394
A. Hafiane, G. Seetharaman, and B. Zavidovique
illumination and a spatial resolution 100dpi. Those images are divided into two groups with 10 tests and 10 training. Each test image is compared with all training set to produce a score, the average of tests score measure the correct classification for one class. Rate of 100 % means all tests images have found the right class according to KNN. Outex provides 100 couples of test and train files for this category, we have selected randomly one couple. Table 1 shows the success rate for each texture class, where K is fixed at 3. The classes names correspond to the images in figure 4 ordered from left to right and from top to bottom. GLCM shows low success flowed by GMRF with 50.9 %. Gabor filters do better with high success rate 93.5 %. MBP and LBP methods give the best results with 99.6 % correct classification. However, we notice that both MBP and LBP have missed one sample of texture canvas033. Table 1. Scores obtained over Outex TC 00000 category class canvas001 canvas002 canvas003 canvas005 canvas006 canvas009 canvas011 canvas021 canvas022 canvas023 canvas025 canvas026 canvas031 canvas032 canvas033 canvas035 canvas038 canvas039 carpet002 carpet004 carpet005 carpet009 tile005 tile006 Total :
6.2
MBP 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 90.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 99.6
LBP 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 90.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 99.6
GLCM GABOR GMRF 90.0 100.0 56.67 20.0 100.0 63.33 33.0 100.0 56.67 10.0 100.0 50.0 80.0 100.0 53.3 100.0 100.0 73.3 56.7 100.0 90.0 3.3 100.0 43.3 13.3 93.3 53.3 16.7 100.0 63.3 36.7 100.0 50.0 13.3 100.0 50.0 3.3 100.0 66.7 43.3 80.0 20.0 10.0 100.0 36.67 6.7 80.0 10.0 6.7 70.0 30.0 6.7 100.0 63.3 2.3 90.0 26.7 0.0 100.0 3.0 100.0 100.0 100.0 6.7 100.0 30.0 6.7 50.0 73.3 6.7 80.1 57.9 27.6 93.5 50.9
Rotation Effect
Human visual system easily and rapidly recognizes a scene or image under different affine transformation, which is not the case for the machines. Rotation is more complex than translation and it poses more difficulties on the analysis.
Median Binary Pattern for Textures Classification
395
For this purpose, we have tested algorithms under different orientations. Indeed the Outex TC 00010 category offers a set of rotated host textures with angles: 5o , 10o , 15o , 30o , 45o , 60o , 75o , 90o . An example of such rotations is presented in Fig. 5. Each rotation is performed for 24 classes yielding 8 × 20 test images per class, the original textures are assigned as a training set. 3840 images are compared to 480 non rotated textures. We can notice from table 2 that performances of MBP, LBP and Gabor filters decreased dramatically. Actually the rotations affect the local structures of texture, and thus the local patterns and spatial frequencies are modified. In contrast, the GLCM and GMRF stay stable due to their independence of the spatial structures. LBP shows best results due to its circular neighborhood which reduces the effect of rotation. The improved version of LBP [21] which takes into account the rotation invariance provides very good results. We expect that MBP can also provide very good results with such a rotation invariant scheme.
Fig. 5. Samples of classes with rotation
6.3
Resolution Effect
The goal is to study the effect of the spatial scale by testing textures captured at different resolutions. We used the Outex TC 000011 category which contains images with two resolutions 100dpi and 120dpi. The category classes are decomposed into two sets: 20 images with 120dpi for tests and 20 images with 100dpi to train. Fig. 6 describes an example of the used resolutions. The number of classes, images size, illumination conditions, and K are the same as previous tests. Table 2 shows the spatial resolution changes affect more Gabor filters than MBP, LBP, GLCM and GMRF. 6.4
Illumination Effect
This test is to check the impact of the lighting changes on the computed features. We use a set of images with different illuminations. The images were captured using three simulated illuminating sources: incandescent, horizon sunlight and fluorescent tl84,Fig. 7 presents an example of these illuminations. Outex TC 00014 provides 20 images per class for each type of light. The test data are horizon and tl84 (960 samples) compared to the training images with incandescent illumination (480 samples). The spatial resolution is 100dpi for all images. Other conditions remain the same as the previous experiments.
396
A. Hafiane, G. Seetharaman, and B. Zavidovique
Comparing to the reference techniques (table 2) MBP maintains high performance under vast variations in illumination. This is due to the nature of the method where the local threshold is independent from the gray scale. Results shows that median thresholding is more efficient than the central pixel value as in LBP.
Fig. 6. Same texture with two different resolutions: left 100 dpi, right 120 dpi
Fig. 7. Different illuminations: left incandescent, middle horizon and right tl84 Table 2. Total scores over 24 classes with KNN = 3 Class Outex TC 00000 Outex TC 00010 Outex TC 00011 Outex TC 00014 : tl84 horizon
6.5
MBP 99.3 47.9 96.0
LBP 98.8 51.6 90.0
97.3 94.1 96.1 90.0
GLCM GABOR GMRF 27.6 93.7 57.5 28.2 36.9 47.5 29.4 65.3 46.2 11.3 11.1
46.2 44.7
39.7 55.6
Effect of the Number of KNN
As described previously KNN have one tunable parameter, the number of the nearest neighbors. In order to study the effect of this parameter we have tested the behaviour of MBP according to the number of KNN. The category Outex TC 00000 is used for this experience. Table 3 shows the success rate of each technique from KNN = 1 to 9. Algorithms are performed for 100 test and train couples. The result presented here is the average of all couples. We can notice that MBP classification rate stays very high, between 99.5% and 96.2% providing best performance.
Median Binary Pattern for Textures Classification
397
Table 3. Classification rate according to the number of KNN Algorithm MBP LBP GABOR GLCM GMRF
7
K=1 99.5 99.4 96.0 30.0 61.7
k=3 99.3 98.8 94.7 27.0 57.5
k=5 98.8 98.2 93.4 38.2 55.0
k=7 97.7 96.8 92.3 39.5 52.5
k=9 96.2 94.8 91.0 39.6 51.0
Conclusion
We have proposed a characterization of textures based on local patterns. The texture attribute are modelled as a distribution of these patterns. A multi dimensional histogram is used as a cue. Median based binary pattern demonstrate very good discriminative properties. Results show high performances in texture classification. Different factors impacting the detection and recognition of texttures are also addressed in this paper. These include: illumination, resolution, and rotation. Rotations cause more problems to MBP. Although a median is invariant under rotation and so is the histogram, the coding used label the pattern is not invariant to rotation. In future work we will study this problem to reduce the affine transformation effect. Additionally, we consider extending the approach to applications on unsupervised segmentation and image retrieval.
References 1. Tuceryan, M., Jain, A.K.: Texture analysis. Handbook of pattern recognition & computer vision , 235–276 (1993) 2. Randen, T., Husoy, J.H.: Filtering for texture classification: A comparative study. IEEE Transactions on Pattern Analysis and Machine Intelligence 21(4), 291–310 (1999) 3. Haralick, R.M., Shanmugam, K., Dinstein, I.: Textural features for image classification. IEEE Trans. Sys. Man. Cybern 3(6), 610–621 (1973) 4. Davis, L.S., Clearman, M., Aggarwal, J.K.: An empirical evaluation of generalized cooccurrence matrices. IEEE Trans. on Pattern Analysis and Machine Intelligence 3(2), 214–221 (1981) 5. Gotlieb, C.C., Kreyszig, H.E.: Texture descriptors based on co-occurrence matrices. Comput. Vision Graph. Image Process 51(1), 70–86 (1990) 6. Zucker, S.W.: Toward a model of texture. Computer Graphics and Image Processing 5, 190–202 (1976) 7. Vilnrotter, F., Nevatia, R.: Structural texture analysis applications. In: DARPA82, pp. 243–252 (1982) 8. Voorhees, H., Poggio, T.: Detecting textons and texture boundaries in natural images. In: Proceedings of the First International Conference on Computer Vision, pp. 250–258 (1987) 9. Blostein, D., Ahuja, N.: Shape from texture: Integrating texture-element extraction and surface estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence 11(12), 1233–1251 (1989)
398
A. Hafiane, G. Seetharaman, and B. Zavidovique
10. Chellappa, R., Chatterjee, S.: Classification of textures using gaussian markov random fields. IEEE Trans. Acoustics Speech Signal Process 33, 959–963 (1985) 11. Comer, M., Delp, E.: Segmentation of textured images using a multiresolution gaussian autoregressive model. IEEE Transactions on Image Processing 8(3), 408– 420 (1999) 12. Campbell, F.W., Robson, J.G.: Application of fourier analysis to the visibility of gratings. Journal Physiol. 197, 551–566 (1968) 13. Turner, M.R.: Texture discrimination by gabor functions. Biological Cybernetics 55, 71–82 (1986) 14. Clark, M., Bovik, A.C.: Texture segmentation using gabor modulation/demodulation. Pattern Recognition Letters 6(4), 261–267 (1987) 15. Ojala, T., Pietik¨ ainen, M., Harwood, D.: A comparative study of texture measures with classification based on featured distributions. Pattern Recognition 29(1), 51– 59 (1996) 16. Rushing, J., Ranganath, H., Hinke, T., Graves, S.: Using association rules as texture features. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(8), 845–858 (2001) 17. Hafiane, A., Zavidovique, B.: Local relational string for textures classification. In: IEEE ICIP, Atlanta, USA, IEEE Computer Society Press, Los Alamitos (2006) 18. Unser, M.: Sum and difference histograms for texture classification. IEEE Trans. Pattern Anal. Mach. Intell. 8, 118–125 (1986) 19. Smith, G., Burns, I.: Measuring texture classification algorithms. Pattern Recognition Letters 18(14), 1495–1501 (1997) 20. Ojala, T., Maenpaa, T., Pietikainen, M., Viertola, J., Kyllonen, J., Huovinene, S.: Outex - a new framework for empirical evaluation of texture analysis algorithms. In: Proc. 16th Intl. Conf. Pattern Recognition 2002 (2002) 21. Ojala, T., Pietik¨ ainen, M., M¨ aenp¨ aa ¨, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 971–987 (2002)
A New Incremental Optimal Feature Extraction Method for On-Line Applications Youness Aliyari Ghassabeh and Hamid Abrishami Moghaddam Electrical Engineering Department, K. N. Toosi University of Technology, Tehran, Iran
[email protected],
[email protected] Abstract. In this paper, we introduced new adaptive learning algorithms to extract linear discriminant analysis (LDA) features from multidimensional data in order to reduce the data dimension space. For this purpose, new adaptive algorithms for the computation of the square root of the inverse covariance matrix Σ − 1 2 are introduced. The proof for the convergence of the new adaptive algorithm is given by presenting the related cost function and discussing about its initial conditions. The new adaptive algorithms are used before an adaptive principal component analysis algorithm in order to construct an adaptive multivariate multi-class LDA algorithm. Adaptive nature of the new optimal feature extraction method makes it appropriate for on-line pattern recognition applications. Both adaptive algorithms in the proposed structure are trained simultaneously, using a stream of input data. Experimental results using synthetic and real multi-class multi-dimensional sequence of data, demonstrated the effectiveness of the new adaptive feature extraction algorithm. Keywords: Adaptive Learning Algorithm, Adaptive Linear Discriminant Analysis, Feature Extraction.
1 Introduction Feature extraction is generally considered as a process of mapping the original measurements into a more effective feature space. When we have two or more classes, feature extraction consists of choosing those features which are most effective for preserving class separability in addition to dimension reduction [1]. Linear discriminant analysis (LDA) has been widely used in pattern recognition applications, such as feature extraction, face and gesture recognition [2-4]. LDA also known as fisher discriminant analysis (FDA) seeks directions for efficient discrimination during dimension reduction [1]. Typical implementation of this technique assumes that a complete dataset for training is available, and learning is carried out in one batch. However, when we conduct LDA learning over datasets in real-world applications, we often confront difficult situations where a complete set of training samples is not given in advance. Actually, in most cases such as on-line face recognition and mobile robotics, data are presented as a stream. Therefore, the need for dimensionality reduction, in real time applications motivated researchers to introduce adaptive versions LDA. Mao and Jain [5] proposed a two layer network, each of which was an adaptive principal component M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 399–410, 2007. © Springer-Verlag Berlin Heidelberg 2007
400
Y.A. Ghassabeh and H.A. Moghaddam
analysis (APCA) network. Chatterjee and Roychowdhury [6] presented adaptive algorithms and a self-organized LDA network for feature extraction from Gaussian data using gradient descent optimization technique. They described algorithms and networks for (i) feature extraction from unimodal and multi-cluster Gaussian data in the multi-class case and (ii) multivariate linear discriminant analysis in multi-class case. Approach presented in [7] suffers from low convergence rate. To solve this drawback, Abrishami Moghaddam et al. [7] derived accelerated convergence algorithms for adaptive LDA (ALDA), based on steepest descent, conjugate direction and Newton-Raphson methods. In this study, we present new adaptive learning algorithms for the computation of −1 2 Σ . Furthermore, we introduce a cost function related to these algorithms and prove their convergence by discussing about its properties and initial conditions. Finally, we combine our Σ −1 2 algorithm with an APCA algorithm for ALDA. Each algorithm discussed in this paper considers a flow or sequence of inputs for training (input data are given one by one to the algorithm); therefore there is no need to a large set of sample data. Memory size and complexity reduction provided by the new ALDA algorithm make it appropriate for on-line pattern recognition applications [8, 9]. We will show the effectiveness of these new adaptive algorithms for extracting LDA features using different on-line experiments. The organization of the paper is as follows. The next section describes the fundamentals of LDA. Section 3, presents the new adaptive algorithms for estimation of the square root of the inverse covariance matrix Σ −1 2 and analyzes its convergence. Then, by combination of this algorithm with an APCA algorithm in cascade, we implemented an ALDA feature extraction algorithm. Section 4 is devoted to simulations and experimental results. Finally, conclusion remarks are given in section 5.
2 Linear Discriminant Analysis Fundamentals Let {x1 , x 2 ,..., x N } , x ∈ ℜ n be N samples from L classes {ω1 , ω2 ,...,ω L } . Consider m and Σ denote the mean vector and covariance matrix of samples, respectively. LDA searches the directions for maximum discrimination of classes in addition to dimensionality reduction. To achieve this goal, within-class and betweenclass matrices are defined [1]. A within-class scatter matrix is the scatter of the samples around their respective class means m i and denoted by Σ w . The betweenclass scatter matrix is the scatter of class means m i around the mixture mean m , and denoted by Σ b . Finally, the mixture scatter matrix is the covariance of all samples regardless of class assignments, and represented by Σ . In LDA, the optimum linear transform is composed of p (≤ n) eigenvectors of Σ −w1 Σ b corresponding to its p largest eigenvalues. Alternatively, Σ −w1 Σ can be used for LDA. A simple analysis shows that both Σ −w1 Σ b and Σ −w1 Σ has the same eigenvector matrix. In general, Σ b is not a full rank matrix, hence we shall use Σ in place of Σ b . The computation of the
A New Incremental Optimal Feature Extraction Method for On-Line Applications
401
eigenvector matrix Φ LDA of Σ −w1 Σ is equivalent to the solution of the generalized eigenvalue problem ΣΦ LDA = Σ w Φ LDA Λ , where Λ is the generalized eigenvalue matrix. Under assumption of Σ w being a positive definite matrix, if we consider −1 2 Ψ = Σ 1w2 Φ LDA , there exists a symmetric Σ w such that the problem can be reduced to a symmetric eigenvalue problem [1] :
Σ −w1 2 ΣΣ −w1 2 Ψ = ΨΛ .
(1)
3 New Adaptive Learning Algorithms for the LDA Feature Extraction We use two adaptive training algorithms in cascade for extracting optimal LDA features. The first algorithm called Σ −1 2 algorithm is for the computation of the square root of the inverse covariance matrix. We prove the convergence of the new adaptive Σ −1 2 algorithms by introducing a cost function related to them. By minimization of the cost function using gradient descent method, we present our new adaptive Σ −1 2 algorithms. The second algorithm is an APCA algorithm introduced by Sanger [10] and is used for the computation of the eigenvectors of the covariance matrix. We prove the convergence of the cascade architecture as an ALDA feature selection. 3.1 New Adaptive Σ
−1 2
Algorithm and Convergence Proof
We define the cost function J (w ) with parameter
J (W) =
W
, J : ℜ n×n → ℜ as follows:
tr ( W 3 xx t ) − tr ( W ) . 3
(2)
The cost function J (w ) is a continuous function with respect to W . If the sample vectors have zero mean value, the expected value of J will be given by:
E ( J ( W)) =
tr ( W 3 Σ) − tr ( W) . 3
(3)
Where Σ is the covariance matrix. The first derivative of (3) is computed as follows [11]:
∂E ( J ( W )) = ( W 2 Σ + ΣW 2 + WΣ W ) 3 − I . ∂W
(4)
If W is selected such that WΣ = ΣW , equating (4) to zero will result in W = Σ −1 2 . Therefore, Σ −1 2 is a critical point (matrix) of (4). The second derivative of E (J ) with respect to W is [11]:
402
Y.A. Ghassabeh and H.A. Moghaddam
∂ 2 E ( J ( W )) = 2(I ⊗ ΣW ) + 2( ΣW ⊗ I ) + W ⊗ Σ + Σ ⊗ W . ∂2W
(5)
where it is assumed that W is symmetric and WΣ = ΣW . Substituting W = Σ −1 2 in (5) will result in a positive definite matrix . The above analysis implies that if W is a symmetric matrix satisfying WΣ = ΣW , the cost function J (w ) will have a minimum that occurs at W = Σ −1 2 [11]. Using the gradient descent optimization method [12] we obtained the following adaptive equation for the computation of Σ −1 2 : ∂J ( W) ) . ∂W 2 t t 2 t = Wk + η ( I − ( Wk x k +1 x k +1 + x k +1 x k +1 Wk + Wk x k +1 x k +1 Wk ) / 3))
Wk +1 = Wk + η (−
(6)
where Wk +1 is the estimation of Σ −1 2 in k+1-th iteration, η is the step size and x k +1 is the input vector at iteration k+1. The only constraint on (6) is its initial conditions, that is W0 must be a symmetric and positive definite matrix satisfying W0 Σ = ΣW0 . It is quite easy to prove that if W0 is a symmetric and positive definite matrix, then all values of Wk (k=1, 2 ,…) will be symmetric and positive definite. Therefore, the final estimation also will have these properties. To avoid confusion for choosing the initial value W0 , we consider W0 equal to identity matrix multiplied by a positive constant
α ( W0 = α I ). 3.2 Reduction of Computational Cost As mentioned above, we consider the initial condition equal to identity matrix multiplied by a constant. It is clear that for this initial condition we have W0 Σ = ΣW0 ; hence the expected value of (6) is equal to:
E ( Wk +1 ) = Wk + η k (I − (Wk2 Σ + Wk ΣWk + ΣWk2 ) / 3)
(7)
It is quite easy to prove that if W0 Σ = ΣW0 , then we will obtain:
E (Wk +1 ) = Wk + η k (I − Wk2 Σ) = Wk + η k (I − Wk ΣWk ) = Wk + η k (I − ΣWk2 )
(8)
Therefore (6) is simplified to three more efficient forms as follows:
Wk +1 = Wk + η k (I − Wk2 x k +1xtk +1 )
(9)
Wk +1 = Wk + ηk (I − Wk x k +1xtk +1Wk )
(10)
Wk +1 = Wk + η k (I − x k +1x tk +1Wk2 )
(11)
Equations (9-11) have less computational cost with respect to (6). Obviously, the expected values of Wk as k → ∞ in (6) and (9-11) are equal to Σ −1 2 , provided that W0 Σ = ΣW0 .
A New Incremental Optimal Feature Extraction Method for On-Line Applications
403
3.3 Adaptive Computation of Eigenvectors We use the following algorithm for the computation of eigenvectors:
Tk +1 = Tk + γ k (y k x kt − LT [ y k y kt ]Tk ) .
(12)
where y k = Tk x k and Tk is a p × n matrix that converges to a matrix T whose rows are the first p eigenvectors of Σ . LT [.] sets all entries of its matrix argument
γ k is learning rate which meets Ljung’s conditions [13]. The convergence of this algorithm has been proved by Sanger [10] using stochastic approximation theory. It has been shown that algorithm (12) computes the eigenvectors of the covariance matrix corresponding to its eigenvalues in descending order. Therefore, choosing initial value as a random p × n matrix, algorithm (12) will converge to a matrix T whose rows are the first p eigenvectors of covariance matrix, ordered by decreasing eigenvalues. There are different adaptive estimations of the mean vector. The following equation was used in [6, 7]:
which are above the diagonal to zero and and
m k +1 = m k + η k +1 (x k +1 − m k ) .
(13)
where η k +1 satisfies Ljung assumptions [13]. 3.4 New Adaptive LDA Algorithm As discussed in section 2, the LDA features are significant eigenvectors of Σ −w1 Σ . For adaptive computation of them, we combine two algorithms discussed in the previous sub-sections in cascade and show that this architecture asymptotically computes LDA features. Consider the training sequence described at the beginning of section 2. Furthermore, let m ik denote the estimated mean vector of class i(i = 1,2,..., L) at k-th iteration and ω (x k ) denote the class of x k . The training sequence {y k } for Σ −1 2 algorithm is defined by y k = x k − m ωk ( x k ) . With the arrival of every training sample x k , m ik is updated according to its class using (13). It is easy to show that the correlation of the sequence {y k } is the within-class scatter matrix Σ w . Therefore, we have the following equation:
lim E[(x k − m ωk ( xk ) )(x k − m ωk ( xk ) ) t = lim E[y k y tk ] = Σ w . k →∞
k →∞
(14)
Suppose the sequence {z k } is defined by, z k = x k − m k . Where m k is the estimated −1 2
mixture mean value in k-th iteration. We train the Σ algorithm by the sequence {y k } and use Wk in (8-11) to create the new sequence {u k } as follows, u k = Wk z k . The sequence {u k } is used to train the algorithm (12). As mentioned before, the matrix T in the algorithm (12) converges to the eigenvectors of the covariance matrix
404
Y.A. Ghassabeh and H.A. Moghaddam
of the input vectors, ordered by decreasing eigenvalues. Hence, (12) will converge to the eigenvectors of E (u k u tk ) . It is quite easy to show:
lim E (u k u tk ) = Σ −w1 2 ΣΣ −w1 2 .
(15)
k →∞
Our aim is to estimate the eigenvectors of Σ −w1 Σ . Suppose Φ and Λ denote the eigenvector and eigenvalue matrices corresponding to Σ −w1 Σ . Following equations are held [1]:
Σ −w1 ΣΦ = ΦΛ , Σ −w1 2 ΣΣ −w1 2 Ψ = ΨΛ .
(16)
where Ψ = Σ Φ . From (16), it is concluded that the eigenvector matrix of 12 w
Σ −w1 2 ΣΣ w−1 2 is equal to Ψ . In the other words, the matrix Tt in the second algorithm converges to Ψ and the following equation is held:
lim Tkt = Ψ = Σ 1w2 Φ t .
(17)
k →∞
By multiplying the outputs of the first and second algorithms as k → ∞ , we will have:
lim Wk Tkt = Σ w−1 2 Σ 1w2 Φ t = Φ t .
(18)
k →∞
Therefore, the combination of the first and second algorithms will converge to the desired matrix Φ , whose columns are eigenvectors of Σ −w1 Σ . As described in the previous sections, by choosing a p × n random matrix as the initial value of (12), the final result of (8-11) in cascade with (12) will converge to a p × n matrix, composed of p significant eigenvectors of the Σ −w1 Σ ordered by decreasing eigenvalues. By the definition given in section 2, these eigenvectors are used as the LDA features.
Fig. 1. Samples covariance matrix used Σ
−1 2
in experiments
4 Simulation Results We used the new adaptive algorithm given in (9-11) to estimate Σ −1 2 and then used combination of (9-11) and (12) in cascade in order to extract LDA features. In all
A New Incremental Optimal Feature Extraction Method for On-Line Applications
405
experiments described in this section, we used a sequence of training data and trained each algorithm adaptively with it. 4.1 Experiments with
Σ −1 2 Algorithm
In these experiments, we compared the convergence rate of the new adaptive Σ −1 2 algorithm in 4, 6, 8 and 10 dimensional spaces. We used the first covariance matrix in [14], which is a 10 × 10 covariance matrix and multiplied it by 20 (fig.1). The ten eigenvalues of this matrix in descending order are 117.996, 55.644, 34.175, 7.873, 5.878, 1.743, 1.423, 1.213 and 1.007. Three other matrices were selected as the principal minors of this matrix. In all experiments, we chose the initial value W0 equal to identity matrix multiplied by 0.6, and then using a sequence of Gaussian input data estimated Σ −1 2 . For each experiment, at k-th iteration, we used −1 2 e(k ) = norm ( W − Σ actual ) for the computation of the error e(k) between the estimated and actual Σ −1 2 matrices. Fig.2 shows values of the error during iterations for each covariance matrix. Final values of the error after 500 samples are 0.1755 for d=10,0.1183 for d=8, 0.1045 for d=6 and 0.0560 for d=4. As expected, the simulation results confirmed the convergence of (6) toward Σ −1 2 . We repeated the same experiment for (9-11) and in all of experiments and get the same results.
Fig. 2. Convergence of Σ
−1 2
algorithm toward its final value, for different covariance matrices
406
Y.A. Ghassabeh and H.A. Moghaddam
4.2 Experiments on Adaptive LDA Algorithm We tested the performance of the new ALDA using i) ten dimensional five class Gaussian data and ii) PIE database. 4.2.1 Experiment with Ten Dimensional Data For this purpose, we generated 500 samples of 10-D Gaussian data, each from five classes with different mean vectors and covariance matrices. The means and covariances were obtained from [14] with the covariance matrices multiplied by 20. The eigenvalues of Σ −w1 Σ b are 10.84, 7.01, 0.98, 0.34, 0, 0, 0, 0, 0, 0. Thus, the data has intrinsic dimensionality of four for classification, of which only two features corresponding to the eigenvalues 10.84 and 7.01 are significant. We used the proposed ALDA to extract relevant features for classification and compared these features with their actual values computed from samples scatter matrices. The graph in the left side of Fig. 3 shows the convergence of the first algorithm. As mentioned before through this algorithm, Wk converges to the square root of the inverse withinclass scatter matrix. The graph in the right side of Fig. 3 illustrates the convergence of the first and second feature vectors of Σ −w1 Σ corresponding to the largest eigenvalues.
Fig. 3. Left: Convergence of the first algorithm toward the square root of the inverse withinclass scatter matrix, Right: Convergence of the estimated first and second LDA features toward their final values
Normalized error E φ is defined as E φ = ϕ i − ϕˆ i
ϕ i , i = 1,2 , where ϕ is computed
from the sample scatter matrices and ϕˆ is estimated using the proposed ALDA. It can be observed that the feature vectors computed by the new adaptive LDA algorithm converge to their actual values through the training process. The normalized errors at
A New Incremental Optimal Feature Extraction Method for On-Line Applications
407
the end of 2500 samples are Eϕ = 0.0724, Eϕ 2 = 0.0891. Figure 4 illustrate the 1 distribution of samples during the training process. The graph in the top left side of Fig. 4 illustrates the distribution of training data on estimated LDA feature space after 500 iterations, the top right graph in fig.4 demonstrate the distribution of samples on estimated LDA feature sub-space after 1000 iterations. The left below and right below graphs in fig. 4 shows the distribution of samples on estimated LDA feature space after 1500 and 2500 iterations, respectively. It is obvious that the distribution of data is not clearly separable at first iterations; however by training the algorithm, they separated into five clusters (although overlapping) with only two significant feature vectors. Fig. 4 verifies ability of proposed algorithm for adaptive dimension reduction while preserving separability.
Fig. 4. top-Left: Distribution of data on the estimated LDA sub-space after 500 iterations. Topright after 1000 iterations. Down left: after 1500 iteration. Down right: after 2500 iteration.
4.2.2 Experiment on PIE Data Base This database contains images of 68 people under different poses and illuminations with 4 different expressions. In this experiment, we chose 3 random subjects and for each subject 150 images are considered. We manually cropped all images to size of 40 × 40 in order to omit the background. Figure 5 shows some of selected subjects in different position and illumination. We vectorized these images (every image produce a 1600 × 1 vector) and considered them as a sequence of data.
408
Y.A. Ghassabeh and H.A. Moghaddam
Fig. 5. Sample images from five subjects in different illumination and posses
Fig. 6. Distribution of subject images in the estimated three dimensional feature space, after100, 200, 300 and 450 iteration
A New Incremental Optimal Feature Extraction Method for On-Line Applications
409
Prior to our algorithm, we applied PCA algorithm on the training images and considering the 60 important eigen-faces, we reduced the vector sizes to 60. We trained the proposed algorithm with this sequence of images and reduced the dimensionality of the feature space into three. Figure 6 shows estimated fisher faces [15] at the end of process. Hence there are 3 subjects, the adaptive algorithm will estimate the two fisher faces. Figure 6 shows distribution of images related to each subject in the three dimensional feature space. The top left diagram shows distribution after 100 iteration and other three diagrams demonstrated the distribution of subject images in feature space after 200, 300 and 450 iteration, respectively. it is clear from figure 6 that images at first iterations are not clearly separable but gradually by training of the algorithm, each subjects separate from others and at the end of process (after 450 iteration) all of the subjects are linearly separable (although overlapping) in three dimensional estimated feature space
5 Conclusion Remarks In this paper, a new ALDA feature extraction algorithm was presented. The new algorithm was considered as a combination of a new adaptive Σ −1 2 algorithm in cascade with APCA. Convergence of the new adaptive algorithms was proved. Simulation results for LDA feature extraction using synthetic and real multidimensional data demonstrated the ability of the proposed algorithm for adaptive optimal feature extraction. The new adaptive algorithm can be used in many fields of on-line pattern recognition applications such as face and gesture recognition. Acknowledgment. This project was partially supported by Iranian telecommunication research center (ITRC).
References 1. Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic Press, New York (1990) 2. Chen, L., Liao, H.M., Ko, M., Lin, J., Yu, G.: A new LDA based face recognition system which can solve the small sample size problem. Pattern Recognition 33(10), 1713–1726 (2000) 3. Chellappa, R., Wilson, C.: Sirohey, Human and machine recognition of faces. Proc. IEEE 83(5), 705–740 (1995) 4. Yu, H., Yang, J.: A direct LDA algorithm for high-dimensional data with application to face recognition. Pattern Recog. 34(10), 2067–2070 (2001) 5. Mao, J., Jain, A.K.: Discriminant analysis neural networks. In: IEEE Int. Conf. on Neural Networks, CA, pp. 300–305 (1993) 6. Chatterjee, C., Roychowdhurry, V.P.: On self-organizing algorithm and networks for class separability features. IEEE Trans. Neural Network 8(3), 663–678 (1997) 7. Abrishami Moghaddam, H., Matinfar, M., Sajad Sadough, S.M., Amiri Zadeh, Kh.: Algorithms and networks for accelerated convergence of adaptive LDA. Pattern Recognition 38(4), 473–483 (2005)
410
Y.A. Ghassabeh and H.A. Moghaddam
8. Hongo, H., Yasumoto, N., Niva, Y., Yamamoto, K.: Hierarchical face recognition using an adaptive discriminant space. In: TENCON’02. Proc. IEEE Int. Conf. Computers Communications, Control and Power Engineering, vol. 1, pp. 523–528. IEEE Computer Society Press, Los Alamitos (2002) 9. Rao, Y., Principe, N., Wong, J.C.: Fast RLS like algorithm for generalized eigen decomposition and its applications. Journal of VLSI Signal processing systems 37(3), 333– 344 (2004) 10. Sanger, T.D.: Optimal unsupervised learning in a single-layer linear feed forward neural network. Neural Networks 2, 459–473 (1989) 11. Magnus, J.R., Neudecker, H.: Matrix Differential Calculus. John Wiley, Chichester (1999) 12. Widrow, B., Stearns, S.: Adaptive Signal Processing. Prentice-Hall, Englewood Cliffs (1985) 13. Ljung, L.: Analysis of recursive stochastic algorithms. IEEE Trans. Automat Control 22, 551–575 (1977) 14. Okada, T., Tomita, S.: An Optimal orthonormal system for discriminant analysis. Pattern Recognition 18(2), 139–144 (1985) 15. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. Fisher faces: Recognition using class specific linear projection. IEEE Trans. Pattern Anal. Machine Intel. 19, 711– 720 (1997)
A View-Based 3D Object Shape Representation Technique Yasser Ebrahim1 , Maher Ahmed1 , Siu-Cheung Chau1 , and Wegdan Abdelsalam2 1
Wilfrid Laurier University, Waterloo ON N2L 3C5, Canada 2 University of Guelph, Guelph ON N1G 2W1, Canada
Abstract. In this paper we present a novel approach to 3D shape representation and matching utilizing a set of shape representations for 2D views of the object. The proposed technique capitalizes on the localizationpreserving nature of the Hilbert space filling curve and the approximation capabilities of the Wavelet transform. Each 2D view of the object is represented by a concise 1D representation that can be used to search an image database for a match. The shape of the 3D image is represented by the set of 1D representations of its 2D views. Experimental results, on a subset of the Amsterdam Library of Object Images (ALOI) dataset, are provided.
1
Introduction
3D objects are usually represented using geometrical (shape) models. Although such models are readily available for manufactured objects in the form of CAD models, most objects of interest do not come with a CAD model. Because of the impracticability of manually creating such a model for each object of interest, some effort has been made to automatically create 3D shape models from 2D images. However, in many cases human intervention is still necessary. In this paper, a view-based (or appearance-based) 3D object shape representation technique is proposed. The technique represents the shape of a 3D object utilizing the shape representations of a set of 2D views of the object. The shape representation of a 2D view is obtained by scanning the image following the Hilbert Curve (HC) producing a 1D vector. The vector is then smoothed, using wavelet approximation, and sampled to produce a Shape Feature Vector (SFV) that represents the 2D view. This method is translation and scale invariant. To minimize the number of 2D representation necessary to represent a 3D object, a mechanism for approximating a 2D view SFV from that of its mirror image is proposed. The technique exploits the vertical symmetry of the HC. Experimental results showing the effect of the proposed approximation on retrieval accuracy are presented. Section 2 describes a number of approaches to 3D shape representations. The proposed approach is discussed in Section 3. Experimental results are presented in Section 4. M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 411–422, 2007. c Springer-Verlag Berlin Heidelberg 2007
412
2
Y. Ebrahim et al.
Previous Work
3D object representation methods follow one of two schools, object-centered representation and viewer-centered representation. 2.1
Object-Centered Methods
Object-centered 3D model retrieval methods can be classified into three categories: Geometric Properties-Based Methods. Because 3D shapes can be described by some global geometric properties, such as volume, surface area, and concentricity, many 3D methods exploit these properties to represent 3D objects. Paquet et al. [1] have employed bounding boxes, cords, moments, and wavelets descriptors for 3D shape description. Zhang and Chen [2] have described efficient methods to compute global features such as volume, area, statistical moments, and Fourier transform coefficients.Vranic and Saupe [3] have suggested a method in which the feature vector is formed by a complex function on the sphere. Kazhdan et al. [4] have described a reflective symmetry descriptor. Their experimental results show that combining the reflective symmetry descriptor with existing methods provides better results. Statistical Properties-Based Methods. To refine the previous approach, the distributions of the global features are employed to represent the 3D objects rather than the global features themselves. Ip et al. [5] have investigated the application of shape distributions in the realm of CAD and solid modelling. Their method can be applied to volume models, but not to polygonal soups. Robert et al. [6] have devised a method to represent the features of an object as a shape distribution, sampled from a shape function measuring the global geometric properties of an object. Osada et al. [7] have used shape distributions, which measure properties based on distance, angle, area, and volume measurements between random surface points. Similarity between objects is measured with a pseudo-metric that measures distances between distributions. Ohbuchi et al. [8] have devised shape histograms that are discretely parameterized along the principal axes of the inertia of the model. Topology-Based Methods. A complex object can be described in terms of its constituent components [9]. A 3D object shape can be described by providing a data structure to represent these components and their relationships. The three data structures commonly used for this purpose are model graphs, Reeb graphs, and skeletons. Model graphs are typically used to describe 3D solid models, produced by CAD systems [10]. However, model graphs are not suitable for natural shapes such as humans and animals. Hilaga et al. [11] have developed a topological matching method, particularly suited for articulated objects. Their method consists of Reeb graphs, based on
A View-Based 3D Object Shape Representation Technique
413
a quotient function defined by an integral geodesic distance. Bespalov et al. [12] have presented a modification of the Hilaga’s method, which computes a scalespace decomposition of a shape, represented as a rooted undirected tree instead of a Reeb graph. This reduces the problem of comparing two 3D models to matching trees. 2.2
Viewer-Centered Methods
In the viewer-centered approach, a 3D object is modelled as a set of 2D images, one for each view of the object. In this way, 3D recognition is reduced to 2D recognition, except that a single 3D object representation might require a large set of 2D views. Appearance-based 3D object representations encode individual object views as points in one or more multidimensional spaces [13]. To obtain the bases for these spaces, a statistical analysis of the training images is conducted. Typically, identifying an object and its pose is performed by projecting that view into the space(s) along the stored basis vectors, and finding the closest projected view of the training image. Murase and Nayar [14] have introduced an appearance-based recognition and pose determination for general objects. Using two eigenspaces, the authors recognize and determine the pose for objects under varying lighting conditions. The universal eigenspace, U , is formed from all the training images, X, and is used to determine the identity of an object. The object eigenspace, Oj , is formed for each object j from all training images Xi for that object. The object eigenspace is used to establish the object’s pose. However, this technique is not robust to occlusion. In addition, no indication is given to optimize the size of the database with respect to the types of objects considered for recognition and their respective eigenspace dimensionality [15]. Appearancebased approaches have become a popular method for object identification and pose location in intensity images [16,17,18,19]. This popularity is, in part, due to these methods’ capability to handle the effects of pose and illumination variations for general objects. Cyr and Kimia [20] have adopted a concept, named aspect graph, to represent a 3D object with a minimal set of 2D views. To represent a 3D object with a minimal number of 2D view, view transitions, named visual events, must be determined precisely. Chen et al. [21] use ten different silhouettes to represent a symmetric 3D object from 20 different viewpoints. 3D objects’ similarity is determined by comparing their light fields. A light field is a set of five 2D images viewed from a certain angle. The similarity between two objects is measured by the highest cross-correlation between two of their light fields.
3
Proposed Approach
In this section, a view-based technique is proposed to represent 3D objects by a set of 2D views. Although the technique is described by the views generated on one viewing plane, namely, those generated by walking around the object, this is not a fundamentally limiting assumption. Since the proposed 3D shape representation
414
Y. Ebrahim et al.
is made out of a number of 2D ones, we start by introducing the 2D representation in Section 3.1 followed by the 3D representation in Section 3.2. 3.1
2D Shape Representation
A shape is distinguished from its background by its pixels intensity variation. To capture this variation, which has the shape information embedded in it, the segmented out object image is scanned following a Hilbert Curve (HC) of the proper level. Figure 1 depicts four HCs of levels one to four. The gray level for each visited pixel is saved in a vector, V . Because of the locality preserving nature of the HC, the resulting vector reflects the clustering of pixels in the image. To smooth out noise while keeping the main shape features intact, the wavelet transform is applied to V , producing the vector W V which is then sampled to obtain the vector SW V which is normalized to produce the object’s Shape Features Vector (SFV). Figure 2 depicts these steps. At search time, the distance between the search and the database images’ SFVs is computed by a distance measure.
(a)
(b)
(c)
(d)
Fig. 1. Four levels of the Hilbert space-filling curve
3.2
3D Shape Representation
The idea of the proposed view-based 3D shape representation is to represent a 3D object with a set of Shape Feature Vectors (SFVs), V = SF V0 , SF Vθ , SF V2θ , ..., SF Viθ where i ≤ n − 1, n is the cardinality of V , and θ is the angle between any two views. Figure 3 depicts a set of 2D views representing a 3D object using a θ of 45 degrees. In such a veiw-based representation, the objective is to effectively represent the object while keeping n to a minimum. For this to happen, the 2D representation must have two main characteristics: robustness to rotations around the vertical axis (vertical rotation) and the capability to approximate mirror images. Robustness to Vertical Rotation. Robustness to rotations around the vertical axis is a function in n−1 Ciθ i=0
θ
,
(1)
A View-Based 3D Object Shape Representation Technique Step
415
Sample output
Original image
Crop and HC scan image 300
250
200
150
100
50
V
0
0
200
400
600
800
1000
1200
1400
1600
700
600
500
400
300
200
100
Approximate to produce W V
0
0
20
40
60
80
100
120
140
160
180
200
1.4
1.2
1
0.8
0.6
0.4
0.2
Sample W V to obtain SW V
0
0
50
100
150
200
250
300
0
50
100
150
200
250
300
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
Normalize SW V to obtain SFV
0
Fig. 2. Steps of creating the 2D shape representation
416
Y. Ebrahim et al.
where Ciθ is the cross-correlation between SF Vj and SF Vj+θ . The more robust to vertical rotation the 2D representation is, the less the effect the vertical rotation exerts on the distance between each two consecutive images. The angle between each two consecutive images, θ, is set such that the cross-correlation (as a distance measure) between any SF Vi+ θ and the two V images at either side of it, SF Vi and 2 SF Vi+θ , is ≥ a threshold t. In other words, in the worst case, there exists a V image that has at least t cross-correlation with any search image of the same object. The higher the t, the smaller the θ and the more accurate the representation.
Fig. 3. View-based representation of a 3D object with θ = 45. Dashed lines indicate the approximated views.
1
2
3
4
5
Box
Cup
Bottle
Vehicle
Jar
Can
Fig. 4. Subset of ALOI used for the first experiment
A View-Based 3D Object Shape Representation Technique
417
Handling Mirror Images. Because some views within V are mirror images (e.g., SF V0 and SF V180 ), the ability to approximate one from the other can result in significant reduction in n, which translates to a more compact representation. Due to the U (or inverted U) shape of the HC, reversing the SFV of an image yields an approximation of the SFV of its mirror image. For this to be the case, two conditions must be satisfied: the object is vertically symmetric and the angle-0 view is taken along a plane that is orthogonal to the symmetry plane (see Figure 3). Also, both planes must share an edge along the vertical axis which passes through the center of the minimum bounding rectangle of the object’s top view. Note that for this mirror image approximation to be useful, V must have an even cardinality. The maximum number of approximated images is always ≤ n2 . The 3D object representation in Figure 3 requires only five SFVs to represent the object. The remaining three views (indicated by the dotted lines) are approximated at comparison time.
4 4.1
Experimental Results The Dataset
The test dataset for this experiment is a subset of the Amsterdam Library of Object Images (ALOI) [22]. The dataset consists of six classes, each with five 3D objects, as shown in Figure 4. For each 3D object, there are 72 images shot at an interval of 5 degrees. While the images in the ALOI dataset are colored, a gray-scale version of the dataset is used. The objects in the test dataset are selected according to the following criteria: 1. The objects within each class are different in colour (in the original dataset), shape, and/or size, but humans would have no difficulty recognizing they belong to the same class of objects. 2. The objects are symmetric around a vertical plane to some degree. Although, usually, the non-symmetries are intensity related (e.g., different writing on each side of a box), in some cases shape non-symmetries do exist (e.g. the white truck). Objects with incomplete symmetry are intentionally chosen to demonstrate the proposed technique’s robustness. 4.2
Deciding on θ
As stated in Section 3.2, the value of θ is determined such that, on average, the cross-correlation between any two views which are θ/2 apart for the same 3D object is t or more. To find the largest θ that satisfy this condition, the average cross-correlation for each object is calculated at angles of 5, 10, 15, 20, 25, 30, 35, 40, and 45. The average for each class of objects is shown in Figure 5. As expected, the average cross-correlation between two consecutive views decreases as the angle increases. The degradation is less evident for rounded objects, such as bottles and cans than square ones such as vehicles and boxes. In Figure 5, the widest angle where all classes have an average cross-correlation ≥ t = 0.5,
418
Y. Ebrahim et al. 1
0.9
Cross−Correlation
0.8
0.7
0.6
0.5
0.4
Box Can Cup Vehicle Jar Bottle
0.3
0.2
5
10
15
20
25
30
35
40
45
50
Angle
Fig. 5. Average cross-correlation at different angles
Angle
0
45
90
135
180
225
270
315
Box 1 Cup 3 Bottle 4 Vehicle 3 Jar 1 Can 4
Fig. 6. Sample 3D representations from V 45
Angle
0
30
60
90
120
150
180
210
240
Box 1 Cup 3 Bottle 4 Vehicle 3 Jar 1 Can 4
Fig. 7. Sample 3D representations from V 30
270
300
330
A View-Based 3D Object Shape Representation Technique
419
1
Average retrieval rate
0.9
0.8
0.7
0.6
0.5
Object−45 Class−45 Object−30 Class−30 Bottle
Box
Can
Cup
Vehicle
Jar
Average
Classes
(a) 1
Average retrieval rate
0.9
0.8
0.7
0.6 Object−45 Class−45 Object−30 Class−30
0.5
Bottle
Box
Can
Cup
Vehicle
Jar
Average
Classes
(b) Fig. 8. Average retrieval rate per class for (a)V 45 and V 30 and (b)Va 45 and Va 30.
is found to be around 22. This means that θ = 2 ∗ 22 ≈ 45 (the angles of the images in the ALOI dataset are multiples of 5). As t approaches 0.6, θ ≈ 30. 4.3
Experimental Setup
Two 3D object representations, V 45 and V 30, are created by a θ of 45 and 30 degrees, respectively. In V 45 each 3D object is represented by eight 2D SFVs, whereas twelv SFVs represent each 3D object in V 30. Figures 6 and 7 provide sample images of V 45 and V 30, respectively. With the approximation technique described in Section 3.2, approximated versions of V 45 and V 30 (named Va 45 and Va 30, respectively) are created. Va 45 has three approximated images at angles 135, 180, and 315, and Va 30 has five approximated images at angles 120, 150, 180, 300, and 330. Each of the four representations is searched for 2160 images (30 objects x 72 images per object) and the average retrieval rate for each class is computed. The retrieval rate is calculated for two types of searches, same-object and same-class.
420
Y. Ebrahim et al.
1
Average retrieval rate
0.9
0.8
0.7
0.6
No Approx. Approx.
0.5
Object−45
Class−45
Object−30
Class−30
Fig. 9. Average retrieval rate for all four ALOI representations
With the same-object search, the goal is that the highest cross-correlation view belongs to the same object as the search image. A same-class match indicates that the highest cross-correlation view found belongs to the same class as the search image, but not necessarily the same object. 4.4
Results
The average retrieval rate for V 45 and V 30 are denoted in Figure 8(a). From the figure, it is clear that there is some correlation between the effect of the rotation angle, as exhibited in Figure 5, and the average retrieval rate for each class. The classes which are more robust to vertical rotation such as Bottle and Cup exhibit high retrieval rate, and vise versa. There is no significant difference between the same-object and same-class retrieval rates except for the Box and Jar classes. This is understandable given the fact that jar-1 and jar-2 are almost identical, and box-1 and box-3 are similar in shape, size, and intensity levels. This causes these similar objects to be confused for each other, negatively affecting the sameobject retrieval rate. The difference in θ has little effect, since the V 45 and V 30 average retrieval rates for same-object and same-class differ by only 4% and 3%, respectively. From the Va 45 and Va 30 results in Figure 8(b), the introduction of the approximation has more noticeable impact on the Box and Vehicle classes, but little effect on the remaining classes. This is attributed to the weak symmetry of some objects in these two classes be it in terms of shape (e.g., truck with one rear door open) or intensity level (e.g., different writing on each side of the box). Overall, the average retrieval rate for the approximated representations varies only by 3% to 5% from that of the original representations as depicted in Figure 9. This suggests that a reduction of approximately 40% in the storage requirements by using approximation, results in about one tenth of that in retrieval accuracy degradation.
A View-Based 3D Object Shape Representation Technique
5
421
Conclusion
This paper describes a novel technique for representing the shape of a 3D object by a set of shape representations of 2D views of the object. Experimental results show that a 3D object can be represented with as few as five 2D view representations. This is possible because of the technique’s robustness to vertical rotation and because some of the views can be approximated from their mirror images (by reversing the SFV of the mirror image). The experimental results show that the proposed technique achieves excellent retrieval rates. The approximation of some 2D view representations had a modest negative effect on retrieval accuracy compared to the resulting significant reduction in the number of 2D representations for each object.
References 1. Paquet, E., Murching, A., Naveen, T., Tabatabai, A., Rioux, M.: Description of shape information for 2-d and 3-d objects. ignal Processing: Image Communication 16, 103–122 (2000) 2. Zhang, C., Chen, T.: Efficient feature extraction for 2d/3d objects in mesh representation. In: CIP 2001 (2001) 3. Vranic, D.V., Saupe, D.: Description of 3d-shape using a complex function on the sphere. In: ICME 2002. IEEE International Conference on Multimedia and Expo., Lausanne, Switzerland, pp. 177–180 (2002) 4. Kazhdan, M., Chazelle, B., Dobkin, D., Funkhouser, T., Rusinkiewicz, S.: A reflective symmetry descriptor for 3d models. Algorithmica 38(1) (2004) 5. Ip, C.Y., Lapadat, D., Sieger, L., Regli, W.C.: Using shape distributions to compare solid models, pp. 273–280 (2002) 6. Robert, O., Thomas, F., Bernard, C., David, D.: Shape distribution. ACM Transactions on Graphics 21(4), 807–832 (2002) 7. Osada, R., Funkhouser, T., Chazelle, B., Dobkin, D.: Shape distributions. ACM Transactions on Graphics 21(4), 807–832 (2002) 8. Ohbuchi, R., Otagiri, T., Ibato, M., Takei, T.: Shape-similarity search of threedimensional models using parameterized statistics. Pacific Graphics 2002, pp. 265– 274 (2002) 9. Biederman, I.: Recognition-by components: A theory of human image understanding. Psychological Review 94(2), 115–147 (1987) 10. Tangelder, J., Veltkamp, R.: A survey of content based 3d shape retrieval methods. In: Shape Modeling International, pp. 145–156 (2004) 11. Hilaga, M., Shinaagagawa, Y., Kohmura, T., Kunii, T.L.: Topology matching for fully automatic similarity estimation of 3d shapes. In: SIGGRAPH 2001, Los Angeles, USA, pp. 203–212 (2001) 12. Bespalov, D., Shokoufandeh, A., Regli, W.C., Sun, W.: Scale-space representa- tion of 3d models and topological matching. In: Solid Modeling 03, pp. 208–215 (2003) 13. Dickinson, S.: Object Representation and Recognition. Basil Blackwell (1999) 14. Murase, H., Nayar, S.K.: Visual learning and recognition of 3-d objects from appearance. International Journal of Computer Vision 14, 5–24 (1995) 15. Naga Jyothi, D.: Multi-view technique for 3-d robotic object recognition system using neuro-fuzzy method. http://www.gisdevelopment.net/technology/ ip/mi04039.htm
422
Y. Ebrahim et al.
16. Black, M.J., Jepson, A.D.: Eigen tracking: Robust matching and tracking of articulated objects using a view-based representation. International Journal of Computer Vision 26, 63–84 (1998) 17. Ohba, K., Ikeuchi, K.: Detectability, uniqueness, and reliability of eigen windows for stable verification of partially occluded objects. IEEE Transactions on Pattern Analysis and Machine Intelligence 19, 1043–1048 (1997) 18. Rao, R.: Dynamic appearance-based recognition. In: IEEE Conf. Computer Vision and Pattern Recognition, San Juan, Puerto Rico, pp. 540–546 (1997) 19. Campbell, R., Flynn, P.: Eigenshapes for 3d object recognition in range data. In: IEEE Conf. Computer Vision and Pattern Recognition, Fort Collins, Colorado (1999) 20. Cyr, C.M., Kimia. 3d, B.B.: 3d object recognition using shape similarity-based aspect graph. In: 8th International Conference on Computer Vision, Vancouver, Canada, pp. 254–261 (2001) 21. Chen, D.-Y., Ouhyoung, M., Tian, X.-P., Shen, Y.-T., Ouhyoung, M.: On visual similarity based 3d model retrieval. In: Eurographics, Granada, Spain, pp. 223–232 (2003) 22. Geusebroek, J.M., Burghouts, G.J., Smeulders, A.W.M.: The Amsterdam library of object images. International Journal of Computer Vision 61(1), 103–112 (2005)
A Novel Multi-scale Representation for 2-D Shapes Kidiyo Kpalma and Joseph Ronsin IETR (Institut d'Eléctronique et de Télécommunications de Rennes) UMR – CNRS 6164 Groupe Image et Télédétection Institut National des Sciences Appliquées (INSA) de Rennes
[email protected] Abstract. We present an original approach for 2-D shapes description. Based on a multi-scale analysis of closed contours, this method deals with the differential turning angle. The input contour is progressively low-pass filtered by decreasing the filter bandwidth. The output contour thus becomes increasingly smooth. At each iteration of the filtering we extract the essential points from the differential turning angle of the filtered contour to generate the d-TASS map. Experimental results show that the d-TASS map is closely related to the contour and that it is rotation, translation and scale change invariant. It is also shearing and noise resistant. This function is local-oriented and appears to be particularly suitable for pattern recognition even for those patterns that have undergone occultation. Keywords: pattern recognition, planar object, contour smoothing, Gaussian filter, differential-turning angle, curvature scale space, intersection point map.
1 Introduction The huge availability of digital data together with the rapid growth of computer systems render pattern recognition systems more and more necessary for browsing databases and finding the desired information within a reasonable time limit. Given this observation, systems like CBIR (Content-Based Image Retrieval), QBIC (Query By Image Content) and QBE (Query By Example) require more attention and are of increasing interest to researchers. An automatic pattern recognition system must be simple, fast and efficient since one deals, in general, with large databases. The method must be resistant to camera viewpoint (rotation, translation, scale change, shearing) and to noise. Moreover, it must also be robustly resistant to occultation [7, 11]. In this paper, we present an original representation called the d-TASS function (differential-Turning Angle Scale Space) that helps to assess the curvature [12] and which can be expressed in terms of turning angle [5, 8, 10, 13]. This CSS-like [1, 3, 9, 12] method is also based on the multi-scale smoothing of the boundary of a planar pattern. By definition, the method is translation insensitive, rotation invariant, scale change invariant and shearing resistant. It can also operate even when objects are partially occulted. Reference [13] M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 423–435, 2007. © Springer-Verlag Berlin Heidelberg 2007
424
K. Kpalma and J. Ronsin
describes a method based on the (multiscale) turning angle in which features are defined by the turning angles of two different scales. In the new approach, we propose some essential characteristic points derived from the differential-Turning Angle (dTA) function that yield representative features. Preliminary results obtained when applied to various contours show that the dTASS map is very characteristic of the contour and discriminative features can therefore be derived from it.
2 Description of the Approach The input contour is defined by a set of N points ordered counterclockwise in the plane. The number of points can be set to an arbitrary value by resampling the input contour: in our experiments, N is set to 360 like in the MSGPR method [1, 3]. These points are represented by their parameter-based coordinates x(u) and y(u) where u is the arc-length and corresponds to the curvilinear abscissa. This sequence is counterclockwise ordered. Given a starting point P0, the points are numbered from P0 G to PN-1. Now, let Vn be the vector defined by the two points Pn and Pn+1 originating at Pn and oriented towards Pn+1. Figure 1 illustrates an example of this. G
Definition 1: The turning angle θn at point Pn is the angle between vectors Vn and the x-axis: it is also called the tangential angle. From this definition we can introduce a second one. Definition 2: The differential-turning angle (d-TA) is the angle between the two G G consecutive vectors Vn and Vn −1 . In other words, the following relation can define the d-TA: ϕn = θn-θn-1
(1)
By definition, the d-TA function is null on straight lines. It increases when turning left and decreases when turning right. G Vk
G Vn
Pk-1 ϕk
G Vk −1
Pk
Pn+1
Pn+2
ϕn Pn Pn-1
Pn-2
Fig. 1. Definition of a turning angle
G Vn −1 G Vn − 2
A Novel Multi-scale Representation for 2-D Shapes
425
Proposition: The differential turning angle at a convex point is positive. It is negative for concave points and null on a straight line. According to this proposition, the maximum d-TA value corresponds to a point at which the contour stops turning to the left on convex sections. Inversely, a minimum value corresponds to the point at which the contour stops turning to the right on a concave section of the contour. Thus, it appears that the locations of these extremes correspond to points with high curvature magnitude, constituting the essential points of the contour.
(a) (c)
(e)
(b)
(d)
(f)
P0
(i) (g)
d-TA(u)
(k) (j)
(h) (E)
(A)
(G) (H)
(J) (K)
(C) 0 (F) (B)
(I)
u
(D)
Fig. 2. Convexity/concavity on a planar object and corresponding differential turning angle
Figure 2 illustrates the following assertion: the high value of d-TA in (A) indicates a convex portion (a) of the contour. Part (c), which is a convex arc, gives a positive dTA (C) while part (i), which is a concave arc, gives a negative d-TA (I). On this figure, it is easy to see that straight sections (f) give null d-TA (F). Definition 3: We define the essential points, that can be classified into one of two categories: • α-points correspond to those with a local minimum d-TA, • β-points correspond to those with a local maximum d-TA. With this definition, β-points are the essential points corresponding to high positive curvature or high convexity. α-points represent high concavity, corresponding to high negative curvature. In 1954, F. Attneave [2] suggested that information along a visual contour is concentrated in regions of high curvature magnitude rather than distributed uniformly along it. Starting from this suggestion, J. Feldman et al. demonstrated in [2, 4] that the sign of the curvature carries useful information. Moreover, they observed that negative curvatures carry more information than positive curvatures. The d-TA
426
K. Kpalma and J. Ronsin
function therefore provides a particularly suitable means of retrieving useful information along the contour. In [9, 12], the authors showed that scale space analysis could be used to immediately capture useful information about multiple scales. In order to take advantage of this property, we derive a multi-scale version of the d-TA to define the differential turning angle scale space (d-TASS) function that yields interesting features about a contour in a scale space representation. Definition 4: The d-TASS function is a 2-D function giving the differential turning angle. The horizontal axis corresponds to the space variable u while the vertical axis is related to the scale variable σ. This function indicates the location of the essential points on the smoothed contour for each value of the smoothing filter standard deviation σ. The principle of this approach can be described as follows: 1. the input contour is separated into its parameter-based coordinate functions x(u) and y(u), 2. both functions are smoothed by applying a Gaussian low-pass filter with standard deviation σ, 3. the d-TASS map is then generated by detecting the location of the essential points from the d-TASS function. Definition 5: The d-TASS map is a 2-D representation containing only the essential points (see Definition 3) detected from the d-TASS function. Since the bandwidth is inversely proportional to the standard deviation, as σ increases, the Gaussian kernel cuts lower and lower so that the output functions become increasingly smooth. Figure 3.a shows the d-TASS function for a range of the filter standard deviation σ. In this representation, the y-axis corresponds to the number of filtering iterations and represents the scale variable σ that is related to filter bandwidth. At each iteration of the filtering process, the d-TA along the filtered contour is normalized to obtain gray levels between 0 and 255. As shown in this figure, the resulting image yields some particularly interesting shapes: on this image, dark areas indicate the location of concave sections and clear areas indicate convex sections. (α)
(α) (α)
(α)
(α)
(α) (β)
a: The d-TASS function
(β)
(β)
(β)
(β)
b: The d-TASS map
Fig. 3. The d-TA function vs. scale-space
(β)
A Novel Multi-scale Representation for 2-D Shapes
427
From this observation, we define the essential points required to characterize the contour. As explained above, the characteristic points that will be extracted from the turning angle function are those corresponding to the extreme values (minima and maxima) of the turning angle. By doing so, we obtain the map of the corresponding essential points represented by the d-TASS map on figure 3.b. As seen in the following figures, the number of d-TASS points varies from one scale to another: as the scale variable σ increases, this number decreases. This is done in three ways: • one β-point and one α−point merge and disappear, • one β-point merges with two α-points to give one α-point • two β-points merge with one α-point to give one β-point, σMAX,0 (β)
σ2 (α)
(α)
(α)
(β)
(α)
(β)
(α)
(β)
(α)
(β)
(β)
σ1 σ0
a)
b)
Fig. 4. Low frequencies in the d-TASS map
Figures 4.b and 5.b show two d-TASS maps derived from the same shape. The difference between both maps comes from the represented range of the scale variable σ. Figure 4.b covers a large range of the scale variable σ from 0 to σMAX,0 and we have zoomed the part ranging from 0 to σMAX,1 leading to figure 5.b. (α) (β)
σMAX,1
(β)
(α)
(α)
(β)
(β) (α) (β) (α) (β) (α)
(α)
(β)
(α) (β) (α) (β) (α)
(α)
(β)
(β) (α)
a)
b)
Fig. 5. High frequencies in the d-TASS map
(β)
σ0
428
K. Kpalma and J. Ronsin
Indeed the highest value of σ (σMAX,0) on figure 4 is greater than the highest value (σMAX,1) on figure 5 for the same number of image rows. That means that the iteration step of σ is smaller on figure 5 than on figure 4, which corresponds to an enlarging of the y-axis. These figures show that scale information can be retrieved from the d-TASS map: fine details (or high frequencies) lie in the lower part of the d-TASS map while the coarse scales (or low frequencies) lie in the upper part. 2.1 Properties of the d-TASS Map 2.1.1 Shape Representativity Figure 6 shows d-TASS maps derived from various shapes. It is evident that representing the turning angle in the scale space is characteristic of the contour. In other words, the d-TASS maps differ from one contour to another.
a: carcer
b: kk20
c: kk707
d: kk711
e: beetle-05
f: bat-07
Fig. 6. The d-TASS map of different contours
2.1.2 Rotation Insensitivity Figure 7 represents the d-TASS map derived from the same contour after subjecting it to various angles of rotation. It is clear that the d-TASS map is rotation insensitive: it is topologically identical whatever the angle of rotation. In the d-TASS map, the rotation is expressed by a simple circular shift related to the curvilinear abscissa of the starting point P0.
A Novel Multi-scale Representation for 2-D Shapes
a: 0° rotation
b: 45° rotation
c: 90° rotation
d: 135° rotation
e: 180° rotation
f: 225° rotation
429
Fig. 7. The d-TASS map vs. the rotation
2.1.3 Resistance to Scale Change Figures 8 show examples of d-TASS maps extracted from a contour that has undergone various scale changes. One can see that the d-TASS map remains practically the same whatever the scale factor.
a: original
b: scaled (x2)
c: scaled down (x0.20)
d: scaled down (x0.25)
e: scaled down (x0.50)
f: scaled down (x0.75)
Fig. 8. The d-TASS map vs. the scaling
430
K. Kpalma and J. Ronsin
2.1.4 Resistance to Occultation Occultation corresponds to a partially masked contour. Since the proposed approach deals with closed contours, if the input contour is occulted between point A and B, then it is considered to be closed by connecting point A to point B. Figure 9 shows an example of an occulted contour and the establishment of the new closed contour.
a)
B
B
A
A
b)
c)
Fig. 9. a) the original contour, b) the occulted contour and c) the reconstructed contour
On figure 10.a we represent the d-TASS map of a contour. Figures 10.b-d show the d-TASS maps derived from the same contour after various partial occultation. As we operate by sampling the remaining visible part to get the same number of points, the x-axis of the d-TASS map is rescaled. BO
LO BO
RO
a: Original
b: Bottom occlusion LO
RO
c: Right side occlusion
d: Left side occlusion
Fig. 10. The d-TASS map vs. occultation
As is obvious from these figures, the d-TASS map from the remaining visible part (see figure 10.b-d) is the same as the part derived from the corresponding region on the original whole contour. In these figures, BO, RO and LO indicate a number of structures from d-TASS mapping of the remaining visible part and the corresponding structures in the d-TASS map derived from the whole contour. This property is particularly important and interesting since it makes it possible to recognize a pattern even if it is partially hidden.. 2.1.5 Shearing Resistance Figure 11 shows an example of the effect of shearing on the d-TASS map. In this figure we represent six contours with their corresponding d-TASS maps. Figure 11.a
A Novel Multi-scale Representation for 2-D Shapes
a: original
b: shear factor λ=0.25
c: shear factor λ=0.5
d: shear factor λ=0.75
e: shear factor λ=1.0
f: shear factor λ=1.5
431
Fig. 11. The d-TASS map vs. shearing
shows the original contour and its d-TASS map while Figures 11.b-f show the sheared copies and their d-TASS maps. In this study, the shearing transformation is performed as follows:
⎛ x s ⎞ ⎛ 1 λ ⎞⎛ x 0 ⎞ ⎛ x 0 + λy 0 ⎞ ⎟⎟ ⎜⎜ ⎟⎟ = ⎜ ⎟⎜⎜ ⎟⎟ = ⎜⎜ ⎝ y s ⎠ ⎝ 0 1 ⎠⎝ y 0 ⎠ ⎝ y 0 ⎠
(2)
where (x0,y0) are the coordinates of the original contour and (xs,ys) those of the sheared equivalent. These figures show that the d-TASS map derived from the sheared contours is consistent with the map extracted from the original contour. Except for slight loss in the height of some peaks, the d-TASS maps are topologically identical.
3 Application of the d-TASS Map From the results obtained from various experiments above, we observe that the dTASS map is closely related to the contour on which it is based. In other words, the dTASS map is characteristic of the shape under consideration. In addition, it has a number of very interesting properties such as rotation insensitivity, translation insensitivity, shearing resistance, scale change resistance etc. We showed that for partially occulted contours, the visible region generates a d-TASS map that remains consistent with the area corresponding to the same region on the d-TASS map derived from the whole contour.
432
K. Kpalma and J. Ronsin
With all these interesting properties we can define efficient features based on the dTASS map to describe an object by analyzing its boundary. Possible features are the horizontal distances di (similar to the proposed features in [1, 3] ) between two essential points (see Definition 3) at a given scale and the location (ui,σi) of points where essential points merge. Furthermore, as indicated in [6], the CSS representation gives the same CSS image for the two contours of figure 12: thus this approach fails to discriminate both patterns.
a)
b) Fig. 12. Examples of contours
(α)
(β)
(α)
(β)
(β)
(β) (α)
(α)
a) (α)
(β)
(α)
(β)
(β)
(α)
(α)
b) Fig. 13. The d-TASS maps corresponding to the two contours of Fig.12
A Novel Multi-scale Representation for 2-D Shapes
433
As can be seen on figures 13.a and 13.b, the d-TASS maps of these contours are different from each another: thus with this approach, it is possible to handle and discriminate this kind of patterns. According to its definition, the d-TASS function corresponds to another approach to measure the curvature of a contour [5, 8, 10, 13]. Thus the zero-crossing of the dTASS function corresponds to another representation of the curvature in the scale space. With this observation, we then define a third type of essential points. Since the zero-crossings of the d-TASS function appear to be characteristic of contours like the well known CSS image we define the zero-crossing essential points. Definition 6: The γ-points are those essential points that correspond to the zerocrossing of the d-TASS function. Figures 14.a and 14.b show the γ-points corresponding to contours of figures 12.a and 12.b, respectively. It is clearly noticeable that the γ-points map is discriminative for both contours. Superposing the three essential points in the same single image (see figures 14.c and 14.d) illustrates the ability of this representation to discriminate two different contours. Indeed, by comparing figures 14.c and 14.d point-to-point, it clearly appears that they are characteristic and representative of their respective contours so one only needs to define some appropriate features to efficiently describe the 2-D objects based on the d-TASS map of their contours. Some possible features that can be derived from the d-TASS map are the coordinates of various characteristic points of the essential points curves. b)
a)
c)
(α)
(γ)
(γ)
(γ)
(β)
(α)
(β)
d)
(α)
(β)
(γ)
(α)
(α)
(α) (β)
(β)
(α)
(α) (β) (γ)
(γ)
(γ)
(β)
(γ)
Fig. 14. Examples of α-, β- and γ-points derived from two contours
4 Conclusion and Perspectives It is well established that curvature measures useful information on contours. In addition, it is proven that the capacity of a system to simultaneously capture
434
K. Kpalma and J. Ronsin
information about various scales makes it easy to capture useful content information about an object. We have presented some application results of the d-TASS map (differential turning angle scale space) with regard to shape description. These results show that the d-TASS map has encouraging properties for planar shape description. Some of the main advantages of this method are its simplicity and its capacity to deal with occultation. Since the d-TASS map is shape characteristic and topologically identical for various transformations and for the visible part of an occulted contour, it appears to have interesting features for pattern description. The originality of this approach is its ability to deal with multiple scales at the same time by utilizing the differential turning angle. Using γ-points together with αand β-points, one can define multi-scale features that will be discriminative and can help to overcome the lack of the CSS while facing contours like those represented on figure 12. In our future work we will focus the study, initially, to analyze and propose some effective features derived from the d-TASS map in order to apply them to planar object recognition. This will help to evaluate the new approach by comparing its recognition scores to those obtained from other existing approaches.
References 1. Kpalma, K., Ronsin, J.: Multiscale contour description for pattern recognition, Elsevier. Pattern Recognition Letters 27(13), 1545–1559 (2006) 2. Feldman, J., Singh, M.: THEORETICAL NOTE: Information Along Contours and Object Boundaries. Psychological Review, The American Psychological Association 112(1), 243– 252 (2005) 3. Kpalma, K., Ronsin, J.: A Multi-Scale curve smoothing for Generalised Pattern Recognition (MSGPR). In: ISSPA 2003, Paris, France, July 1-4, 2003 (2003) 4. Barenholtz, E., Cohen, E.H., Feldman, J., Singh, M.: Detection of change in shape: an advantage for concavities. Elsevier, Cognition 89, 1–9 (2003) 5. Zibreira, C., Pereira, F.: A Study of Similarity Measures for a Turning Angles-based Shape Descriptor. In: 3rd Conference on Telecommunications, conftele 2001, Figueira da Foz, April 23-24, 2001 (2001) 6. Latecki, L.J., Lakamper, R., Eckhardt, U.: Shape Descriptors for Non-rigid Shapes with a Single Closed Contour. In: CVPR. IEEE Conf. On Computer Vision and Pattern Recognition, pp. 424–429. IEEE Computer Society Press, Los Alamitos (2000) 7. Bruckstein, A.M., Rivlin, E., Weiss, I.: Recognizing objects using scale space local invariants. In: ICPR ’96. Proceedings of the 1996 International Conference on Pattern Recognition, Vienna, Austria, August 25-29, 1996, pp. 25–29 (1996) 8. Niblack, W., Yin, J.: A pseudo-distance measure for 2D shapes based on turning angle. In: ICIP’95. Proceedings of the International Conference on Image Processing, October 23-26, 1995, vol. 3, pp. 352–355 (1995) 9. Lindeberg, T.: Scale-Space Theory in Computer Vision. Kluwer Academic Publishers, Dordrecht, Netherlands (1994) 10. Scassellati, B.: Retrieving images by 2D shape: a comparison of computation methods with human perceptual judgments. In: Niblack, W., Jain, R.C. (eds.) Proc. of SPIE, Storage and Retrieval for Image and Vision Databases II, February 1994, pp. 2–14 (1994)
A Novel Multi-scale Representation for 2-D Shapes
435
11. Bruckstein, A., Katzir, N., Lindenbaum, M., Porat, M.: Similarity-Invariant Signatures for Partially Occluded Planar Shapes. IJCV 7(3), 271–285 (1992) 12. Mokhtarian, F., Mackworth, A.K: A Theory of Multiscale, Curvature-Based Shape Representation for Planar Curves. In: IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI, vol. 14(8) (1992) 13. Arkin, E.M., Paul Chew, L., Huttenlocker, D.P., Kedem, K., Mitchell, J.S.B.: An efficiently computable metric for comparing polygonal shapes. IEEE Transactions on PAMI 13(3), 209–216 (1991)
Retrieval of Hand-Sketched Envelopes in Logo Images Naif Alajlan Electrical Engineering Dept., King Saud University, P.O. Box 800, Riyadh 11421, Saudi Arabia
[email protected] Abstract. This paper introduces an approach for retrieving envelope of high-level object groupings in bi-level images with multiple objects. Motivated by studies in Gestalt theory, hierarchical clustering is used to detect the envelope and group its objects based on their spatial proximity, area, shape features and orientation. To decide the final grouping, the grouping outcomes are combined using an evidence accumulation method. The high-level boundary of the detected envelope is then extracted using morphological operations. For retrieval, the boundary of a query sketch is matched to the extracted envelopes from database images via dynamic space warping. Experiments on a set of bi-level logo images demonstrate the effectiveness of the approach.
1
Introduction
Recently, semantic content-based image retrieval (CBIR) has emerged as a result of the fact that most users do not require to retrieve images based on only their low-level features [1]. Existing CBIR systems fall behind that target [2, 3, 4]. As an example, multi-object images contain more shape information than the mere sum of the shape information of the individual objects. A group of objects can be spatially arranged such that their envelope has a semantically high-level shape. Fig. 1 shows two multi-object images containing the same objects with different spatial arrangements. Clearly, this difference significantly changes the way we perceive the images. Gestalt theory provided the well-known four main principles about the human perception which include proximity, similarity, continuity and closure [5]. A study by Biederman [6] suggests that the human visual system quickly assumes and uses collinearity, curvature, parallelism and adjacency of a group of objects in order to perceive them as a whole. The envelope concept was first originated by Eakins et al. in the context of their ARTISAN system for trademark retrieval [7]. They suggested grouping boundary segments based on their proximity and shape features as part of an ongoing research; however, no detailed method was reported. In [8], an adaptive selection scheme between Zernike moments and geometric primitives, obtained using the Hough transform, is used for trademark feature extraction. The authors claim that these features capture Gestalt-based features which include symmetry, M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 436–446, 2007. c Springer-Verlag Berlin Heidelberg 2007
Retrieval of Hand-Sketched Envelopes in Logo Images
437
Fig. 1. Two multi-object images containing the same object and with totally different perceived envelopes
continuity, proximity, parallelism and closure. However, this method does not detect high-level envelopes. In this paper, an approach for envelope detection and retrieval in multi-object shapes is introduced which is mainely motivated by Gestalt principles. At first, hierarchical clustering is used to group objects based on each of their proximity, area, shape similarity and orientation. To decide the final grouping of the objects, the concept of evidence accumulation is employed [9]. This method is more systematic and achieves better results than the heuristic-based method in [10]. Then, morphological operations are used to merge the objects in each of the identified groupings without changing their size. The envelope is then extracted by reconstructing the merged component using its concavity tree to eliminate odd concavities. For retrieval, a user provides a hand-sketched query shape and would like to retrieve envelopes in database images similar to the query. Therefore, the problem becomes matching the shape of the query with each identified envelope in the database images. There is a large number of methods in the literature for shape matching [11, 12]. In our application, the dynamic space warping (DSW) method for matching closed boundaries [13] is applied since it has achieved higher accuracy than many other methods in the literature.
2
The Proposed Method
Given a query image of a hand-sketched shape, the aim is to retrieve any envelope formed by a group of objects in a database image that is most similar to the query. This is achieved in three main steps. At first, hierarchical clustering is used to group objects forming an envelope based on their proximity, area, shape similarity and orientation. In the second step, morphological operations are used to merge the objects in each grouping and the envelope is then extracted by reconstructing the merged object from its concavity tree; thus, removing any artifacts along the envelope boundary. The third step measures the similarity between the query shape and each identified envelope using DSW method in [13]. The remaining of this section explains these steps in more details.
438
2.1
N. Alajlan
Object Grouping
Brain research showed experimentally that the processing of proximity and other features are performed separately [5]. Therefore, in our approach, objects are grouped separately based on their proximity, area, shape similarity and orientation using hierarchical clustering. The feature extraction process for proximity is done as described in [10], which measures the spatial distance between two objects as the symmetric Hausdorff distance [14] between their convex hulls. The features for shape similarity include eccentricity and solidity and orientation is taken as the angle between the major axis and the horizontal line [15]. Then, the final grouping is decided based on a recent method for combining multiple clusterings called evidence accumulation [9]. For each distance matrix of proximity, area, shape similarity and orientation, hierarchical clustering algorithm [16] is applied. The result is a hierarchical tree, called dendrogram, which is not a single set of clusters, but rather a multi-level hierarchy where clusters at one level are joined as clusters at the next higher level as shown in Fig. 2. In our application, clusters are defined when there is a clear cut in the dendrogram. In this case, the compactness of a cluster is defined by how similar its members are. For proximity, area and shape similarity groupings, deciding to cut the dendrogram based on the maximum lifetime1 has a major drawback, that is, a single large distance can dominate the decision when the inter-class or intra-class variation is large. Here, the cutting decision is based on statistics, including the mean and the standard deviation, computed from the distances under each node in the dendrogram. To achieve scale invariance, the standard deviations of all nodes are normalized to have zero mean and unity variance. Intuitively, a node with small normalized standard deviation most likely groups proximal (or similar objects), and vice versa. Therefore, a threshold is set for cutting the dendrogram based on the normalized standard deviation (which is evaluated experimentally to be equal to one). For orientation-based grouping, the dendrogram cut is made directly at the desired angle, which is considered to be ten degrees. An illustrative example is shown in Fig. 2. Panel (a) shows the input multi-object shape. The results of hierarchical clustering based on proximity, shape similarity, and orientation are shown in panels (b), (d), and (f), respectively. The horizontal line in each dendrogram shows the location of deciding the groups. The final grouping of the image objects is made based on the initial mentioned groupings. In [10], a set of heuristic rules are employed to combine the initial groupings into a final grouping. In this paper, a more systematic approach for deciding the final grouping is obtained using a recent method in the literature about combining multiple clusterings using evidence accumulation [9], as shown in the block diagram of Fig. 3. The outcomes of four initial groupings are considered as evidences which are accumulated in a new distance matrix. Then, hierarchical clustering is applied on this matrix to decide the final grouping. In the following, we describe the application of this method on our grouping problem. 1
The lifetime is the difference between the distances at two successive nodes.
Retrieval of Hand-Sketched Envelopes in Logo Images
439
20
Proximity distance
18
16
14
12
10
8
6 36 38 34 40 39 33 27 30 21 22 23 25 24 26 29 28 32 35 37 31 1 3 2 4 6 5 8 10 13 15 18 7 9 11 12 14 16 17 19 20
Object number
(a)
(b)
0.09
0.08
Shape similarity distance
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
24 27 2 3 21 8 1 39 5 16 40 36 38 33 34 4 15 22 10 37 35 28 6 23 13 25 7 9 12 14 32 17 30 18 26 11 19 29 20 31
Object number
(c)
(d) 2.5
20
Orientation distance
2
40
60
1.5
1
80 0.5
100 0
15 33 13 26 36 18 24 17 16 21 29 34 19 30 27 8 11 12 14 22 23 10 1 2 4 3 39 37 6 35 5 7 32 40 9 38 25 28 31 20
Object number
120 0
20
40
60
(e)
80
100
120
140
(f)
Fig. 2. An example of object grouping and envelope extraction. See text for details.
Assume there are M initial groupings (or clusterings) of m image objects. Here, there are four initial clusterings based on proximity, area, shape similarity and orientation; therefore, M = 4. The clusterings ensemble is defined as: Gi = C1i , C2i , ...., Cki i (1) where Cji is the jth cluster in the clustering Gi with ki clusters and i ∈ 1, M . Then, a voting scheme combines the outcomes of the initial clusterings using a co-occurrence matrix, D, of object pairs defined as:
440
N. Alajlan
Fig. 3. Block diagram of the object grouping based evidence accumulation
mpq (2) M where mpq is the number of times the objects p and q are assigned to the same cluster Cji among the M initial clusterings. To decide the final grouping, hierarchical clustering is applied on D and the dendrogram cutting is done based on the maximum lifetime. An example illustrating this process is shown in Fig. 4. The maximum lifetime in the co-occurrence matrix dendrogram, of panel (b), partitions the image objects into two clusters as shown in panel (c) of the same figure. D(p, q) =
2.2
Envelope Extraction
This is the second stage towards the extraction of the semantic envelope. The input to this stage is the output of the object grouping stage; specifically, a labeled matrix of the object groupings in the image where objects belonging to the same group are assigned the same label. There are two sub-steps in this stage. The first is to morphologically merge the objects in each grouping by repeated dilation, using a 3×3 structuring element, until the resulting grouping has only one component. If the dilation operation was performed n times, we need then to shrink the merged component n pixels, without splitting it, using an erosion operation with a (2n + 1) × (2n + 1) structuring element. A splitting will occur if the (square) structuring element cannot slip through the neck joining pairs of (original) components. To get around this problem, we morphologically close the merged component with a diamond shaped structuring element with a main-diagonal of 2n + 1 pixels. This will always guarantee that the subsequent erosion will not split the merged component, as the square structuring element will now be always guaranteed to pass through the necks. If we proceed with the envelope extraction at this stage, the resulting envelope will have small odd concavities resulting from the morphological operations. The second substep is to extract the envelope of the merged objects, which could be used as the envelope at this stage; however, it needs to be smoothed.
Retrieval of Hand-Sketched Envelopes in Logo Images
441
(a) 1
0.9
Evidence accumulation distance
0.8
lifetime3 = 0.5 0.7
0.6
0.5
0.4
lifetime = 0.25 2 0.3
0.2
lifetime = 0.25 1
0.1
0
6
8
11
12
13
14
10
2
5
7
3
4
9
1
Object number
(b) 90
80
70
60
50
40
30
20
10
100
(c)
90
80
70
60
50
40
30
20
10
0
(d)
Fig. 4. Illustration of the object grouping using evidence accumulation. Input image (a), dendrogram of the co-occurrence matrix D (b), the final grouping (c), and the extracted envelope (d).
This task is delegated to a contour-based concavity tree extraction algorithm [17] that will eliminate concavities smaller than a given threshold. This threshold varies with the gaps in between the original components. It is currently set to four times the area of the structuring element used in the erosion step. Fig. 5 illustrates the merging and envelope extraction processes. Note that only two
442
N. Alajlan
(a)
(b)
50
50
100
100
−50
0
50
100
0
Nodes: 5, leaves: 3, height: 2
50
100
Nodes: 16, leaves: 10, height: 3 0
0
1 1 2 2
3
(c) 20
40
60
80
100
120 20
40
60
80
(d)
100
120
140
(e)
Fig. 5. Illustration of the merging and envelope extraction process. Input image (two groupings) (a), result of merging (b), concavity trees used to extract the envelope (c), resulting envelope (d), and output image (e).
Retrieval of Hand-Sketched Envelopes in Logo Images
443
10
20
30
40
50
60
70
80
0
10
20
30
40
50
60
70
80
90
20
40
60
80
100
120
140 0
20
40
60
80
100
120
140
160
20
40
60
80
100
120
0
50
100
150
10
20
30
40
50
60
70
80
90
0
10
20
30
40
50
60
70
80
90
100
Fig. 6. Samples of correct object grouping and envelope extraction. Input images (left column), grouping results (middle column) and extracted envelopes (right column).
tree nodes (including the root) were used for the envelope reconstruction from the corresponding concavity tree in each of the two envelopes. 2.3
Envelope Retrieval
In image retrieval, a user provides a query image and then the system retrieves the most similar images to the query from an image database. In this work, the user provides a hand-sketched query shape, as in Fig. 7 (top row), and the system retrieves envelopes in database images similar to that query. Therefore, the problem becomes matching the shape of the query with each identified envelope in the database images. At this stage, any shape matching method can
444
N. Alajlan
Fig. 7. Illustration of hand-sketched queries (top row) and the top three retrieved images (second to fourth rows)
be used [11, 12]. In our application, the dynamic space warping (DSW) method for matching closed boundaries [13] is applied since it has achieved higher accuracy than many other methods in the literature. Note that the accuracy of the retrieval depends heavily on that of the envelope detection and extraction.
3
Experimental Results
The performance of the proposed approach is demonstrated using a set of 110 logo images containing a varying number of objects with various orientations and shapes. The outcomes of the object grouping stage were subjectively correct for 93 images using the evidence accumulation approach. Moreover, when the envelope extraction algorithm was applied to each of the correctly grouped images, a subjectively correct envelope was extracted. Fig. 6 shows samples of correct grouping and envelope extraction. For retrieval, 10 hand-sketched shapes of envelopes are used as queries. Fig. 7 shows sample of these queries and the top 3 retrieved images. For 8 queries, the system was able to retrieve images with subjectively similar envelope.
Retrieval of Hand-Sketched Envelopes in Logo Images
4
445
Conclusions
Envelope extraction is a very important stage towards high-level shape representation and similarity matching. In this paper, an approach for envelope extraction and retrieval is proposed. The proposed approach utilizes the proximity and shape similarity between objects and their orientations for grouping them. Hierarchical clustering allows such utilization. The fusion of the outcomes of the initial groupings is achieved using the evidence accumulation concept. Then, the envelope of each group of objects is approximated by means of morphological operations. A contour-based approach for concavity tree reconstruction is employed to smooth the extracted envelope. For retrieval, DSW method is employed to match a hand-sketched query shape to the identified envelopes in the database images. Experiments on a set of 110 logos demonstrate the feasibility of the approach.
References 1. Eakins, J.P.: Towards intelligent image retrieval. Pattern Recognition 35(1), 3–14 (2002) 2. Flickner, M., Sawhney, H., Niblack, W., Ashley, J., Huang, Q., Dom, B., Gorkani, M., Hafner, J., Lee, D., Petkovic, D., Steele, D., Yanker, P.: Query by image and video content: The qbic system. IEEE Computer 28(9), 23–32 (1995) 3. Bach, J., Fuller, C., Gupta, A., Hampapur, A., Horowitz, B., Humphrey, R., Jain, R., Shu, C.: Virage image search engine: An open framework for image management. In: Storage and Retrieval for Image and Video Databases (SPIE), February 1996, pp. 76–87 (1996) 4. Pentland, A., Picard, R., Sclaroff, S.: Photobook: content-based manipulation of image databases. International Journal of Comput. Vision 18(3), 233–254 (1996) 5. Thorisson, K.: Simulated perceptual grouping: An application to human computer interaction. In: Proceedings of the 16th Annual Conference of Cognitive Science Society, Atlanta, GA, August 1994, pp. 876–881 (1994) 6. Biederman, I.: Recognition by components: A theory of human image understanding. Psychological Review 94(2), 115–147 (1987) 7. Eakins, J.P., Shields, K., Boardman, J.M.: Artisan: a shape retrieval system based on boundary family indexing. In: Storage and Retrieval for Still Image and Video Databases IV. Proceedings SPIE, vol. 2670, pp. 17–28 (1996) 8. Jiang, H., Ngo, C.W., Tan, H.K.: Gestalt-based feature similarity measure in trademark database. Pattern Recognition 39(5), 988–1001 (2006) 9. Fred, A.L.N., Jain, A.K.: Combining multiple clusterings using evidence accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(6), 835–850 (2005) 10. Alajlan, N., Badawy, O.E., Kamel, M.S., Freeman, G.H.: Envelope detection of multi-object shapes. In: Kamel, M., Campilho, A. (eds.) ICIAR 2005. LNCS, vol. 3656, pp. 399–406. Springer, Heidelberg (2005) 11. Loncaric, S.: A survey of shape analysis techniques. Pattern Recognition 31(8), 983–1001 (1998)
446
N. Alajlan
12. Zhang, D., Lu, G.: Review of shape representation and description techniques. Pattern Recognition 37(1), 1–19 (2004) 13. Alajlan, N., Elrube, I., Kamel, M.S., Freeman, G.: Shape retrieval using trianglearea representation and dynamic space warping. Pattern Recognition (in press, 2007) 14. Rote, G.: Computing the minimum hausdorff distance between two point sets on a line under translation. Information Processing Letters 38(3), 123–127 (1991) 15. Costa, L.F., Cesar, R.M.: Shape Analysis and Classification: Theory and Practice. CRC Press, Inc., Boca Raton, USA (2000) 16. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Surveys (CSUR) 31(3), 264–323 (1999) 17. Badawy, O.E., Kamel, M.S.: Hierarchical representation of 2-D shapes using convex polygons: a contour-based approach. Pattern Recognition Letters 26(7), 865–877 (2005)
Matching Flexible Polygons to Fields of Corners Extracted from Images* Siddharth Manay and David W. Paglieroni Lawrence Livermore National Laboratory 7000 East Ave, Livermore, CA 94551, USA
Abstract. We propose a novel efficient method that finds partial and complete matches to models for families of polygons in fields of corners extracted from images. The polygon models assign specific values of acuteness to each corner in a fixed-length sequence along the boundary. The absolute and relative lengths of sides can be either constrained or left unconstrained by the model. Candidate matches are found by using the model as a guide in linking corners previously extracted from images. Geometrical similarity is computed by comparing corner acutenesses and side lengths for candidate polygons to the model. Photometric similarity is derived by comparing directions of sides in candidate polygons to pixel gradient directions in the image. The flexibility and efficiency of our method is demonstrated by searching for families of buildings in large overhead images. Keywords: Corner Extraction, Polygon Extraction, Overhead Imagery.
1 Introduction Broad area image search involves detecting localized objects (vehicles, buildings, storage tanks, parking lots, etc.) in overhead images with broad area coverage. Many objects of interest can be at least partially modeled as polygons (e.g., roofs of buildings or vehicles). In this paper, we propose a class of polygon models invariant to position / orientation and, if desired, also invariant to scale or even some shape distortions. We then construct similarity measures based on geometric and photometric considerations for computing the degree of match between objects in images and the polygon model. Our general idea is to extract corners from images using a robust technique previously developed in [1-2]. We then develop a highly efficient matching strategy that grows all possible polygon trees of corners from the field of corners detected in the image, and seeks tree branches corresponding to successful complete or partial matches to the polygon model. The matches are validated and redundancies are eliminated. *
Work performed under the auspices of the U.S. Department of Energy by the University of California, Lawrence Livermore National Laboratory under Contract No. W-7405-Eng-48 (UCRL-CONF-226354).
M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 447–459, 2007. © Springer-Verlag Berlin Heidelberg 2007
448
S. Manay and D.W. Paglieroni
Polygon matching methods based on polygon trees of corners provide several advantages: (1) They are highly efficient. (2) They are robust to photometric variation in the image, such as lighting changes and object contrast. (3) They are invariant to polygon position / orientation, and, if desired, they can also be made invariant to polygon scale or relative side lengths. (4) They are relatively insensitive to broken edges because they are not based on edge detection. (5) They can handle missing or occluded corners. Section 2 discusses our proposed method in the context of prior work on polygon detection. Section 3 reviews a previously developed corner detection algorithm that matches models of corners to gradient directions in the image. A very simple yet flexible way to model polygons is developed in Section 4. Geometric and photometric measures of similarity between polygon matches and polygon models are developed in Section 5. An efficient algorithm for growing polygon trees from fields of corners previously extracted from images is developed in Section 6. A method for eliminating redundant polygon matches is developed in Section 7. Then in Section 8, our approach is applied to the problem of finding buildings of certain general shapes in large overhead images.
2 Prior Work Due to the polygonal nature of many objects constructed by humans that appear in images, polygon and rectangle detection are important topics of research. Although a complete literature survey is beyond the scope of this paper, selected key research is briefly reviewed below. The search for polygonal objects in images is often performed in two stages: (1) search for primitive features, such as edges, lines, or corners, and (2) assembling collections of these features into polygons. Under specific assumptions about the desired shape (e.g. rectangles) features can be grouped by a variety of algorithms and constraints [3-9]. More generally, features can be linked into chains, trees, or graphs, which are then searched for matches to polygons [10-13]. Although the preferred features are edges or lines extracted with edge detectors or Hough methods, corner detectors were employed directly in [10]. Higher-dimensional Hough methods transform edges directly into parameterized rectangles and regular polygons [14-15]. Other methods that estimate parameters directly include genetic algorithms [16] and algorithms that estimate posterior probability in the parameter space [17]. Due to their reliance on parameterization, these methods do not extend well to general polygons. Alternatively, over-segmented regions can be grouped into polygonal regions [18-19]. Rather than using extracted features, the image can be searched directly for template matches to the desired shape [20-21]. Templates allow little flexibility in the scale or shape of the object. Invariant models are employed to provide such flexibility. Polygon models composed of ordered lists of attributes (such as interior angles) can be applied to prune or validate feature graphs [13], fit to features directly using least-squares [22] or fit to images directly using iterative optimization [23-24]. Other instances of invariant models include polygons with side normal vectors fit to image gradients via eigenanalysis [25], contour models fit to an over-segmentation [l9], and triangulated models fit to image data with an inductive, brute force algorithm
Matching Flexible Polygons to Fields of Corners Extracted from Images
449
[26]. These models allow affine shape distortions (e.g., in [26], the shapes are triangles). Like the attribute list models cited above, the polygonal model proposed in Section 4 is a special simplified case of invariant shape representations proposed by other authors (e.g., [27-29]), and shares many of their desirable properties, such as invariance to rotation, translation, scale, and other shape distortions. However, we do not use optimization techniques to search for polygons, as these can be slow, require good initialization, and frequently find local minima for the objective functions. Instead, we adopt a philosophy published in much of the rectangle and polygon literature in which polygons are assembled from lower level components detected in images [3,6]. However, we choose lower level components based on corners rather than edges, and are thus less subject to problems associated with missing or broken edges. This approach requires a robust method for corner detection, such as the method based on gradient direction matching reviewed in the next Section. The ability to detect and analyze incomplete or partial polygons provides additional robustness to missing corners, which frequently occur due to occlusion, poor image quality, etc. Other corner detectors, such as those in [30-31], compute only the corner location, and not the orientation and acuteness attributes required to efficiently construct polygon trees. In further contrast to methods in the literature, the proposed method leverages constraints in corner models and flexible polygon models to efficiently link corners into well-pruned polygon trees, and proposes a similarity measure for polygons, based on geometric and photometric factors, that captures the confidence in the detected polygon.
3 Modeling and Detecting Corners As illustrated in Fig.1, models of corners have several attributes: corner [column, row] pixel location [c, r], corner acuteness α ∈ (0,π), corner orientation θ, corner leg length L and corner leg thickness T.
Fig. 1. Model of a corner
We use gradient direction matching (GDM) to detect corners in images [1,2]. Conceptually, a corner model is matched at all pixels and across a number of orientations to a field of pixel gradient directions using the GDM measure
450
S. Manay and D.W. Paglieroni
S(Δ ,Δ ) = c
r
∑ w(c,r) cos2[θ (c+Δc,r+Δr) − β (c,r)]
(1)
(c,r)∈C
In equation (1), θ (c,r) is the image gradient direction at pixel (c,r), C is the set of all pixels on legs in the corner model (excluding the vertex itself, where normals are ambiguous), β(c,r) is the angle normal to the leg containing pixel (c,r), and w(c,r) is the weight assigned to leg pixel (c,r). S∈ [0,1] (1 for perfect matches) because the weights w(c,r) are non-negative and sum to one. w(c,r) decreases as the distance from the center of pixel (c,r) to the line segment representing its ideal leg increases. An efficient FFT-based algorithm for computing the best match across all corner orientations at each pixel is given in [1]. As shown in Fig. 2, corners map to unambiguous local maxima in the degree of match. Because GDM relies on a sum of gradient angle (and not gradient magnitude) terms, it is robust to noise; see [2] for an extended discussion.
Fig. 2. Corner detection example: (a) Optimal 90º corner match vs. location in an image. (b) Unambiguous local maxima.
4 Polygon Models A polygon with n corners can be modeled as a sequence of parameter vectors for corners and side lengths in order of clockwise traversal along the boundary: P* = { [α*(k), L*(k)] k = 0...n−1 }
(2)
α*(k) ∈ (0,2π) (exclusive of π) is the acuteness of polygon corner k. L*(k) ≥ 0 is a length parameter for side k, which connects corner k to corner ((k+1) mod n). α*(k) < π for convex corners and α*(k) > π for concave corners. The relative orientation from corner k to corner ((k+1) mod n) can be computed directly from the corner acutenesses as Δ Δθ*(k) = θ *((k+1) mod n) − θ *(k) =π−
α *((k+1) mod n) + α *(k)
(3) 2 where θ *(k) ∈ [0,2π) is the absolute orientation of corner k. Note that θ *(k) is not part of the model. If side absolute lengths are to be specified (the goal is to find polygons of specific shape and size at any position or rotation), then for k = 0…n−1, specify L*(k) > 0 as the prescribed length for side k. If only side relative lengths are to be specified (the goal is to find polygons of specific shape but any size), then set L*(0) = 0 (undefined)
Matching Flexible Polygons to Fields of Corners Extracted from Images
451
and for k = 1…n−1, specify L*(k) > 0 as the prescribed ratio of side k length to side 0 length. If side lengths are to be unconstrained, then for k = 0…n−1, set L*(k) = 0 (undefined). Note that the model in equation (2) is always invariant to polygon position and rotation, whether or not the side lengths are constrained.
5 Polygon Similarity Measures Let P be a candidate match to polygon model P* (equation (2)) found by linking corners previously extracted from an image (Section 3). A method for using the model as a guide in linking corners previously extracted from images will be given in Section 6. In this Section, measures of similarity between P and P* based on geometric and photometric considerations are developed. P can be either a complete or partial match to P*. Although the similarity measures defined below allow complete and partial matches to be ranked together, they can harshly penalize partial matches relative to complete matches. As an alternative, one may therefore instead choose to rank matches with n′ ≤ n corners only against other matches with exactly n′ corners. The similarity measures defined below apply independent of n′. First, geometric measures of similarity based on corner acuteness and relative or absolute side length are proposed. Acuteness similarity can be defined as S (P;P*) = 1 − α
1 2πn
n−1
∑ | α(k) − α*(k) |
∈ [0,1]
(4)
k=0
where α(k) is computed from the locations of corners k, ((k−1) mod n), and ((k+1) mod n) in P. If corner k is missing from P, set |α(k) − α*(k)| = 2π in equation (4). Side length similarity can be defined as n−1
∑ |L(k) − L*(k)|
S (P;P*) = 1 − L
k=0 n−1
∈ [0,1]
(5)
∑ L*(k)
k=0
when the side lengths are constrained (set S (P;P*) = 1 when the side lengths are L
unconstrained). When side lengths are relative, L(k) is the ratio of side k to side 0. If side k is missing from P, use L(k) = 0 in equation (5). A photometric measure of similarity between polygons based on gradient direction matching (GDM) is proposed. This gradient direction similarity measure is similar to the measure used for detecting corners in equation (1), except it is tailored to polygons: S (P;P*) = G
1 L
∑ cos2[θ (c,r) − β (c,r)] ∈[0,1] (c,r)∈P
(6)
452
S. Manay and D.W. Paglieroni
P can either be a complete or incomplete polygon match. However, the sum in equation (6) excludes polygon vertices, since normals are ambiguous there. Following equation (1), θ (c,r) is the image gradient direction at pixel (c,r), β(c,r) is the angle normal to the side of P containing pixel (c,r), and L is the perimeter of the polygon match extrapolated to completeness. The overall similarity between P and P* combines the geometric and photometric similarities in equations (4)-(6): S(P;P*) = [w ⋅ S (P;P*)] * [w ⋅ S (P;P*)] * [w ⋅ S (P;P*)] α
α
L
L
G
G
(7)
where the weights are non-negative and specified such that S(P;P*) ∈ [0,1]. “*” is some operator, such as “scalar sum”, “scalar product”, or “logical AND”. For “logical AND” the weights are all 1 and the S values are set to 1 if greater than a threshold (0 otherwise). For match validation, we use “logical AND”. For match comparison, we use w = 0, w = 0, and w = 1, and “scalar sum” (in which case, S(P;P*) = S (P;P*)). α
L
G
G
6 Polygon Trees A polygon tree is a tree whose nodes represent corners in polygon matches P to a polygon model P*, and whose branches represent connections between successive corners traversed in order along the boundary of P. As illustrated in Fig.3, each polygon tree has a single root node representing the corner from which traversal begins. A branch is a path from root to leaf. For polygon models with n corners, complete branches traverse n nodes, and partial branches traverse at least 2 nodes but less than n nodes. Valid branches can be extrapolated to valid complete polygon matches. Whether complete or partial, branches are always valid when side lengths are constrained by the polygon model in either a relative or absolute sense. Identifying invalid partial branches for a model with unconstrained side lengths is of lesser importance but a topic of future research. Valid partial branches correspond to partial matches, and valid complete branches correspond to complete matches. In Fig.3, branches ABCD and AJKL are complete and successful because corners D and L can be connected back to corner A in accordance with the polygon model constraints. Branch ABCE is complete but unsuccessful because E cannot be connected back to A. Branch AGH is partial and successful because H can be connected back to A through a missing corner H′. Branches AF and AGI are partial and unsuccessful because F and I cannot be connected back to A. In summary, ABCD and AJKL represent sequences of corners for a complete match, and AGH represents a sequence of corners for a partial match. A similarity to the model can be computed for each of these three matches using equation (7). A partial branch is said to be finished if it is as complete as possible. Thus, for finished partial branches, given matches to corner 0 and corner 1 of the polygon model, there is no match to corner n−1. All polygon matches of interest correspond either to valid complete branches or valid finished partial branches.
Matching Flexible Polygons to Fields of Corners Extracted from Images
453
Fig. 3. Polygon tree for a quadrllateral
6.1 Tests for Candidate Corners Let Ω be a field of corners with various positions, orientations, and acutenesses extracted from an image. Suppose v = [c , r , α , θ ] ∈ Ω was previously identified k
k
k
k
k
as a candidate for corner k in a match P to some polygon model P*. Consider another corner v = [c, r, α, θ ] ∈ Ω . We wish to devise tests based on corner acuteness, orientation, and location for determining whether or not v constitutes a candidate for corner k+1 ∈ [1,n) in a polygon match to P*. First, v cannot be a candidate unless it satisfies the acuteness test
α ∈ [ α*(k+1) - εα , α*(k+1) + εα ]
(8)
θ − θk ∈ [ Δθ*(k) − εθ , Δθ*(k) + εθ ]
(9)
and the orientation test
where Δ *(k) is given by equation (3). θ
If P* has absolute side length constraints, or P* has relative side length constraints and k > 0, then the length of side k in the polygon match is expected to be for absolute constraints ⎧L*(k) d = ⎨d · L*(k) for relative constraints k ⎩ 0
(10)
The angle of the ray emanating from v along side k is expected to be k
+Δ −Δ φk = θk = θk + αk / 2 or θk = θk − αk / 2
(11) −
+
Use φ = θk if [c , r ] is closer to the line from [c , r ] in direction θk than from k k−1 k−1 k k −
+
the line in direction θk . Otherwise, use φ = θk . v is expected to be at location k [c
,r
k+1
] = [c + d cosφ , r + d sinφ ]
k+1
k
k
k
k
k
k
(12)
In this case, v is a candidate for corner k+1 if it satisfies the location test || [c − c
,r−r
k+1
] || ≤ ε
k+1
R
(13)
454
S. Manay and D.W. Paglieroni
If P* has no side length constraints, or P* has relative side length constraints and k = 0, then the distance D from [c, r] to the line Lk emanating from [c , r ] in direction φ must k k k be small: D ( [c, r], L ) ≤ ε k R
(14)
For purposes of this paper, the angular and radial tolerances ε , ε , and ε are treated θ
α
R
as user-specified. For k = n−1, it is necessary to determine whether or not v (the root corner in the 0
polygon tree drawn from Ω) satisfies the conditions outlined in equations (8)-(14). A complete polygon match has been found only if it does. 6.2 Growing Polygon Trees Let us now consider the problem of extracting all complete and partial polygon matches to a polygon model P* from a field Ω of corners extracted from an image. Every corner in Ω with acuteness equal to α*(0) is an eligible polygon tree root. An attempt is made to grow a polygon tree from each eligible root. Each grown tree may contain multiple complete and partial polygon matches. For each eligible root, a polygon tree is grown by finding all corners that connect to the root corner using the tests given in Section 6.1. For each leaf, the process is repeated. The tree is grown in this way until it has n levels or cannot be grown further. This process finds all complete matches. However, if all model corners do not have the same acuteness, the only way to guarantee that all finished partial matches will be found is to grow polygon trees as matches to every circular shift of the sequence of model corners. In the end, each valid complete branch and each valid finished partial branch is catalogued for each polygon tree. If each corner in a polygon tree can be linked to on average k candidate corners, n each n-level polygon tree will have on average k branches. Values of n are typically between 4 (for a rectangle) and 10. The field of corners Ω contains n corners,
Ω
where n
Ω
is typically between 1000-8000. A very naïve attempt to grow polygon
trees by assuming each corner may link to any other corner would result in n branches for each root corner and an O(n
Ω
Ω
n
n+1 ) extraction algorithm. However, by
leveraging corner models that include acuteness and orientation, polygon models, and the tests in Section 6.1 to restrict the search, k is reduced to a value much smaller than n n . The result is an O(n k ) extraction algorithm.
Ω
Ω
7 Match Validation and Redundancy Polygon matches for which acuteness similarity S (P;P*), side length similarity α
S (P;P*), or gradient direction similarity S (P;P*) are smaller than their respective L
G
Matching Flexible Polygons to Fields of Corners Extracted from Images
455
user-specified similarity thresholds can be discarded. Additional matches can be discarded by applying user-specified size-dependent constraints on area, perimeter, minimum numbers of corners in partial matches, or ranges of admissible side lengths. The surviving matches are ranked. However, some may be redundant in the sense that they overlap or are too close together. The idea behind match disambiguation is that in a set of overlapping polygon matches, only the match P of greatest similarity S(P;P*) = S (P;P*) to P* survives. G
The distance between two polygons can be estimated using the symmetric (twosided) Hausdorff distance D between two polygon matches P and P with n and n 1
2
1
2
corners respectively: D(P ,P ) = min [ D 1
D where D
ASYM
ASYM
2
(P ,P ) = 1
2
max
(P ,P ), D
ASYM
(
i = 0...n −1 1
1
min
2
ASYM
(P ,P ) ] 2
(15)
1
|| [c (i) − c (j), r (i) − r (j)] ||
j = 0...n −1 2
1
2
1
2
) (16)
is the asymmetric (one-sided) Hausdorff distance [32]. Two polygon
matches are said to be redundant if the symmetric Hausdorff distance between their corners is less than some threshold. For efficiency, polygons are spatially indexed, and for each polygon, the distance is computed only to polygons that likely overlap.
8 Experimental Results Results of a polygon search experiment involving a large overhead satellite image are presented in this Section. Fig.4 shows an 4096 x 4096 pixel section of a satellite image of Stockton, CA (courtesy of CASIL [33]). We searched for L-shaped buildings. Ranked results for three searches are shown: one for no side length constraints in Fig.5, one for relative side length constraints in Fig.6, and one for absolute side length constraints in Fig.7. Six sided polygon models for these three cases are given in Table 1. Table 1. Specifications for L-Shaped Building Models Model No side length Relative side length Absolute side length
Parameters α* = {3π/2, π/2, π/2, π/2, π/2, π/2} α* = {3π/2, π/2, π/2, π/2, π/2, π/2}, L* = {60, 30, 90, 45, 30, 15} α* = {3π/2, π/2, π/2, π/2, π/2, π/2}, L* = {4, 2, 6, 3, 2, 1}
As shown in Fig.5, polygon trees constrained by the model with no side length constraints detect L-shaped polygons of several sizes and proportions. For this experiment, partial polygons with less than four corners were discarded, while polygons with five corners are completed by extrapolating to compute the sixth
456
S. Manay and D.W. Paglieroni
vertex. The highest-ranked thumbnails are subjectively good fits to either discrete L-shaped buildings or sets of buildings in an L-shaped arrangement. There were some missed detections, and these are primarily attributed to inaccurate corners, missing corners (due to poor contrast or occlusion), or poor pixel gradient correlations (due to poor contrast, noise, or scene clutter). In Fig. 6-7, absolute and relative side length constraints were imposed by the model to find L-shaped buildings with specific dimensions (reflected by the first thumbnail in each figure). In these cases, successful partial matches with as few as four corners were allowed.
Fig. 4. 4096 × satellite image
Fig. 5. Thumbnails of top 10 matches to L-shaped mdel with no side length constraints
Matching Flexible Polygons to Fields of Corners Extracted from Images
457
Fig. 6. Thumbnails of top 10 matches to L-shaped mdel with relative side length constraints
Fig. 7. Thumbnails of top 5 matches to L-shaped mdel with absolute side length constraints
Table 2 contains execution times for the above experiments. Note that the corners are pre-computed, so the corner extraction time is reported separately. For processing, the image is divided into 1024 x 1024 pixel image blocks. Reported times are the average time for the image blocks. All experiments were performed on a single core of an Intel Xeon 2.66GHz processor. The table demonstrates the relationship between the number of constraints in the model and the execution time of the algorithm. Table 2. Algorithm Execution Time Algorithm Corner Extraction Polygon Extraction w/ no side length model Polygon Extraction w/ Relative side length model Polygon Extraction w/ Absolute side length model
Avg. Time per Image Block 47.4s 8.8s 1.9s 0.7s
9 Conclusion A novel efficient method that matches models for families of polygons to fields of corners extracted from images has been developed. The flexibility of our method was demonstrated by searching for buildings of prescribed shape in large overhead images, subject to various degrees of constraint on lengths of the polygon sides. The method requires all possible polygon trees to be grown from the field of corners, subject to model constraints. Each tree can contain several valid complete and partial matches. However, the similarity measures for polygons in equations (4)(6) harshly penalize partial matches relative to complete matches, thus making it
458
S. Manay and D.W. Paglieroni
problematic to rank partial and complete matches together. We thus plan to extend our work by using gradient direction matches for corners and sides as random state variables in Hidden Markov Model representations for polygon matches so that complete and partial matches can be ranked together in a more even-handed statistically rigorous way.
References 1. Paglieroni, D., Eppler, W., Poland, D.: Phase sensitive cuing for 3D objects in overhead images. In: SPIE Defense and Security Symposium: Signal Processing, Sensor Fusion and Target Recognition XIV, Proc. SPIE, March 2005, vol. 5809, pp. 28–30 (2005) 2. Paglieroni, D., Manay, S.: Signal-to-noise behavior of corners modeled as image gradient direction fields. In: Proc. SPIE. Defense and Security Symposium (April 2007) 3. Kim, H., Choi, W., Kim, H.: A hierarchical approach to extracting polygons based on perceptual grouping. In: Sys.Man, and Cybernetics. IEEE Int’l Conf. on., October 1994, vol. 3, pp. 2402–2407. IEEE Computer Society Press, Los Alamitos (1994) 4. Lin, C., Nevatia, R.: Building detection and description from a single intensity image. Comp. Vis and Image Understanding 72(2), 101–121 (1998) 5. Lagunovsky, D., Ablameyko, S.: Straight-line-based primitive extraction in gray-scale object recognition. Pat. Rec. Let. 20(10), 1005–1014 (1999) 6. Tao, W., Tian, J., Liu, J.: A new approach to extract rectangular buildings from aerial urban images. In: Sig. Proc., Int’l Conf. on., vol. 1, pp. 143–146 (2002) 7. Yu, Z., Bajaj, C.: Detecting circular and rectangular particles based on geometric feature detection in electron micrographs. J. Structural Biology 145, 168–180 (2004) 8. Liu, D., He, L., Carin, L.: Airport detection in large aerial optical imagery. In: Acoustics, Speech, and Signal Proc. IEEE Int’l Conf. on, May 2004, vol. 5, IEEE Computer Society Press, Los Alamitos (2004) 9. Liu, Y., Ikenaga, T., Goto, S.: Geometrical physical and text/symbol analysis based approach of traffic sign detection system. In: Intell. Vehicles Symp., September 2006, pp. 238–243. IEEE Computer Society Press, Los Alamitos (2006) 10. Jaynes, C.O., Stolle, F., Collins, R.T.: Task driving perceptual organization for extraction of rooftop polygons. In: Appl. of Comp. Vis., Proc. IEEE Workshop on., December 1994, pp. 152–159. IEEE Computer Society Press, Los Alamitos (1994) 11. Gates, J.W., Haseyama, M., Kitajima, H.: Real-time polygon extraction from complex images. In: Circuits and Systems. Proc. IEEE Int’l Symp. on., May 2000, vol. 5, pp. 309– 312. IEEE Computer Society Press, Los Alamitos (2000) 12. Ferreira, A., Fonseca, M.J., Jorge, J.A.: Polygon detection from a set of lines. In: Encontro Portugues de Computacao Grafica (October 2003) 13. Croitoru, A., Doytsher, Y.: Right-angle rooftop polygon extraction in regularized urban areas: Cutting the corners. Photogrammetric Record 19(108), 311–341 (2004) 14. Zhu, Y., Carragher, B., Mouche, F., Potter, C.: Automatic particle detection through efficient Hough transforms. Medical Imaging, IEEE Trans. on 22(9), 1053–1062 (2003) 15. Shaw, D., Barnes, N.: Regular polygon detection as an interest point operator for SLAM. In: Australasian Conference on Robotics and Automation (2004) 16. Lutton, E., Martinez, P.: A genetic algorithm for the detection of 2D geometric primitives in images. In: Pat. Rec., Int’l Conf. on., vol. 1, pp. 526–528 (1994) 17. Barnes, N., Loy, G., Shaw, D., Robles-Kelly, A.: Regular polygon detection. In: Comp. Vis. IEEE Int’l Conf. on., IEEE Computer Society Press, Los Alamitos (2005)
Matching Flexible Polygons to Fields of Corners Extracted from Images
459
18. Macieszczak, M., Ahmad, M.O.: Extraction of shape elements in low-level processing for image understanding systems. In: Elec. and Comp. Eng., Canadian Conf. on., September 1995, vol. 2, pp. 1184–1187 (1995) 19. Liu, L., Sclaroff, S.: Deformable shape detection and description via model-based region grouping. In: Comp. Vis and Pat. Rec., IEEE Conf. on., June 1999, vol. 2, IEEE Computer Society Press, Los Alamitos (1999) 20. Moon, H., Chellappa, R., Rosenfeld, A.: Optimal edge-based shape detection. In: Image Proc., IEEE Trans. on, November 2002, vol. 11(11), pp. 1209–1227 (2002) 21. Amit, Y., Geman, D., Fan, X.: A coarse-to-fine strategy for multi-class shape detection. Pat. Anal. and Mach. Intell., IEEE Trans on 28, 1606–1621 (2004) 22. Gidney, M.S., Diduch, C.P., Doraiswami, R.: Real-time post determination of polygons in a 2-D image. In: Elec. and Comp. Eng., Canadian Conf. on., vol. 2, pp. 913–915 (1995) 23. Miller, J.V., Breen, D.E., Wozny, M.J.: Extracting geometric models through constraint minimization. In: Visualization. In: IEEE Conf. on., October 1990, pp. 74–82. IEEE Computer Society Press, Los Alamitos (1990) 24. Mayer, S.: Constrained optimization of building contours from high-resolution orthoimages. In: Image Proc., IEEE Int’l Conf. on, October 2001, vol. 1, pp. 838–841. IEEE Computer Society Press, Los Alamitos (2001) 25. Wang, Z., Ben-Arie, J.: Detection and segmentation of generic shapes based on affine modeling of energy in Eigenspace. In: Image Proc, IEEE Trans on., November 2001, vol. 10(11) (2001) 26. Felzenszwalb, P.: Representation and detection of deformable shapes. Pat. Anal. and Mach. Intell., IEEE Trans on 27(s), 208–220 (2005) 27. Kanatani, K.: Group Theoretical Methods in Image Understanding. Springer, New York (1990) 28. Pikaz, A., Dinstein, I.: Matching of partially occluded planar curves. Pat. Rec. 28(2), 199– 209 (1995) 29. Sebastian, T., Klein, P., Kimia, B.: On aligning curves. Trans. Pat. Anal. and Mach. Intell. 25(1), 116–125 (2003) 30. Harris, C., Stephens, M.: A combined corner and edge detector. In: Alvey Vision Conference, Proc, pp. 147–151 (1988) 31. Smith, S., Brady, J.: SUSAN - A new approach to low level image processing. Int’l J. Comp. Vis. 23(1), 45–78 (1997) 32. Huttenlocher, D., Kladerman, G., Rucklidge, W.: Comparing images using the Hausdorff distance. Trans. Pat. Anal. and Mach. Intell. 15(9), 850–863 (1993) 33. CASIL: California Spatial Information Library (Last visited November 17, 2006), http://archive.casil.ucdavis.edu/casil/
Pattern Retrieval from a Cloud of Points Using Geometric Concepts L. Idoumghar and M. Melkemi Universit´e Haute Alsace, Laboratoire de Math´ematiques Informatique et Applications. 4 rue des fr`eres Lumi`ere, 68093, Mulhouse, France
[email protected],
[email protected] Abstract. In this article, we propose a generalization of the Euclidean α-shape, and through it we present an algorithm to retrieval shapes that are similar to a query shape from a finite set of points. Similarity means that we look for patterns identical, according to a given measure, to a query shape independently of translation, rotation and scaling transforms.
1
Introduction
In this article, we are interested in the problem of detecting patterns embedded in a cloud of points and similar to a query shape. In contrast with existing approaches [1], [3], [4], our method uses only geometric algorithms and treats cloud of points. According to a defined similarity measure, the algorithm detects accurately the overall patterns similar to the query shape. It is based on a mathematical formalism that we considered it as a generalization of the α-shape concept [2] and it is implemented using the Voronoi diagram [5]. In our approach, the query shape is defined by a union of discs arranged according to a fixed configuration. It depends only on three parameters, two points and a positive real number, they, respectively, allow to move and to dilate/shrink the query shape, some examples are shown in Figs. 1, 2 and 3. Given a query shape represented by a union of discs and a set of points, the presented algorithm computes the gaps embedded in the set of points and similar to the query shape. This article is organized as follows: first, we present geometrical concepts related to the concept we introduce and its implementation. In section 3 and 4 we present the main concepts we introduce; the query shape definition, the α-shape generalization and the algorithms.
2
Related Geometrical Concepts
Let P = {p1 , p2 , ..., pn } be a set of n distinct points of the plan. In this paper we assume that the points of S are in general-position. This means that no four points lie on a common circle. The general-position assumption simplifies the forthcoming definitions. Even if the points are not in general position, there exists many efficient algorithms to compute the Voronoi diagram of a set of points [5]. M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 460–468, 2007. c Springer-Verlag Berlin Heidelberg 2007
Pattern Retrieval from a Cloud of Points Using Geometric Concepts
2.1
461
Definition of the Voronoi Diagram
Let pq be the Euclidean distance between two points p and q. Usually, we define 2 the Voronoi polygon R(pi ) of pi as the set of points p ∈ R so that ppi ≤ ppj , for all pj ∈ P . The Voronoi diagram of P is the set V (P ) = {R(p1 ), · · · , R(pn )}. The dual of the Voronoi diagram, obtained by connecting the points pi to their neighbors in V (P ), is the Delaunay triangulation of P and noted D(P ). To emphasize the morphological aspect of the Voronoi diagram, we rewrite the def2 2 inition of R(pi ) as follows: let x ∈ R and p ∈ R be two points and r a positive real number. We note bp (x) an open disc centered at p and the point x is on its boundary (x ∈ ∂bp (x), where ∂S denotes the boundary of a region S). The definition of R(Spi ) can be generated using discs bp (pi ). The polygon R(pi ) cor2 responds to the set: R(pi ) = {p ∈ R | bp (pi ) ∩ P = ∅}. 2.2
The Euclidean α-Shape of a Point Set
Let bα be an open disc, of radius α. bα is said to be empty if and only if bα ∩P = ∅. The α-shape of P , we note Fα (P ), is the set of the edges [pi pj ] such that there exists an empty disc bα , where {pi , pj } ⊂ ∂bα . The α-shape is a sub-graph of the Delaunay triangulation. An edge s = [pi pj ] of D(P ) is an element of Fα (P ) if and only if αmin (s) ≤ α ≤ αmax (s), where αmin (s) and αmax (s) are two positive real numbers computed from the edges of V (P ). Indeed, the αmin (s) is the minimum distance of pi to the Voronoi edge separating pi from pj , the αmax (s) corresponds to the maximum distance (Edelsbrunner and M¨ uke, 1994). In the next section, we show how to represent a query-shape by a union of discs.
3 3.1
The Query-Shape Definition A Unitary Query-Shape
Consider a set of points P and denote b(c, r) an open disc of radius r and centered at c. Let S1m be a connected planar graph, its set of m vertices is denoted C1m . We associate to each vertex c ∈ C1m a weight c > 0. The graph S1m has the following properties: i) The vertices of each edge [cd] of S1m verify: d ∈ ∂b(c, c ) if d ≤ c (A) c ∈ ∂b(d, d ) else where c and d are the weights associated, respectively, with the vertices c and d. ii) There exists in C1m a particular vertex c0 so that its weight c0 is equal to 1.
462
L. Idoumghar and M. Melkemi
Based on the previous properties, we define a unitary query-shape as follows: Definition 1. The unitary query-shape M1m associated with a graph S1m is the set: M1m = ∪c∈C1m b(c, c ) The vertex c0 whose weight is c0 = 1, see item ii), is called the first center of the query-shape M1m . We chose an adjacent vertex c1 to c0 in S1m . It plays the role of the second center of M1m . The weight c0 associated with the first center c0 is called the size of the query-shape. In definition 1, the term unitary is used because the radius c0 of M1m is equal to 1. An example of a unitary query-shape is illustrated by Fig. 1. It is a non-convex region covered with 22 discs verifying the relationship (A) item i), the dotted lines represent the graph S1m , the centers of the discs are the vertices of S1m , and there radii are the weights associated with the vertices. In this figure, many weights (radii of the discs) are equal to 1, specifically those associated with the vertices 1, 10, 14, 20 and 22. Thus, many couples of vertices can be considered as first and second centers (c0 , c1 ) of the displayed query-shape, these couples are (1, 2), (10, 9), (14, 13), (20,19) and (22, 21). The choice of one of them to play the role of the centers of the query-shape is fixed by a user according to its goal.
20 19 18 14 13 12
22 21
17 16
11
15
5 4
6
3 2
7
8 9 10
1 Fig. 1. A unitary query shape (m = 22). If the center 1 is fixed as the first center of this query shape, the point 2 is its second center.
3.2
A Query-Shape of Size r
A query-shape of size r is denoted Mrm , r is a positive real number. When r is less than 1, the query-shape Mrm shrinks the unitary one M1m , else it corresponds to dilation of M1m . Mrm is associated with a graph Srm , which is computed by
Pattern Retrieval from a Cloud of Points Using Geometric Concepts
463
using the graph S1m . Each vertex T (c) of Srm is obtained from a vertex c of the graph S1m . Formally, we define the graph Srm as follows: (i) Let us consider again the first and second centers c0 and c1 of the unitary query-shape M1m (section 2, after the definition 1). Let T (c0 ) and T (c1 ) be two points such that T (c0 )T (c1 ) = rc0 c1 . T (c0 ) and T (c1 ) are the first and second vertices of Srm . (ii)The vertices of Srm are defined according to the iterative following process. Let c be a vertex of S1m , and suppose that T (c) is a known vertex of Srm , initially, c = c0 . Then, for each vertex d connected to c ([cd] is an edge of S1m ), we compute the vertex T (d) in Srm using the next properties. T (d) is such that the angle between the vectors [cd] and [T (c)T (d)] is equal to the angle between the vectors [c0 c1 ] and [T (c0 )T (c1 )], and rc if d ≤ c T (c)T (d) = rd else The radius of the disc centered at T (d) is T (d) = rd . Definition 2. The query-shape Mrm of size r defined from the unitary queryshape M1m and associated with the graph Srm is: Mrm = ∪d∈Crm b(d, d ) where Crm is the set of the vertices of Srm and d is the radius of the disc centered at d. The first center and the second center of Mrm are, respectively, the vertices T (c0 ) and T (c1 ), where c0 and c1 are, respectively, the first and the second centers of M1m . Fig. 2(a), illustrates a unitary query-shape, Fig. 2(b) shows a shape of size r = 4 which dilates the unitary query-shape of Fig. 2(a). However, Fig. 2(c) displays a shape of size r = 0.4, it shrinks the unitary query-shape of Fig. 2(a). Finally, Fig. 2(d) illustrates a rotated shape. The query-shape representation is a generalization of the disc concept, a disc of radius r can be seen as a query-shape Mr1 of size r and consisting of one disc. A query-shape describes a shape of a connected region, it has a size and two centers. Varying the centers, the query-shape is moved. Increasing or decreasing its size, dilates or shrinks the query-shape. The different parameters needed to compute Mrm are: 1. Constants: Shape parameters. These parameters define the shape of the pattern M1m . They are fixed real numbers: the coordinates of the centers c and the radii c of the discs composing M1m . 2. Variables: Dilating, shrinking and moving parameters. These parameters are the size r, the first and second centers of the shape Mrm . Knowing the shape parameters and the two first centers of Mrm is enough to determine the rest of centers of its discs, see the iterative process used to define Srm .
464
L. Idoumghar and M. Melkemi
(a)
(b)
(c)
(d)
Fig. 2. (a) A unitary query-shape composed by 22 discs. (b) Dilation of the query shape presented in (a). (c) A shape obtained by shrinking the query shape (a). (d) Translation and rotation of the query shape presented in (a).
4
Retrieval Patterns from a Finite Set of Points
To detect shapes similar to a query-shape (a known region) we introduce, a generalization of the Euclidean α-shape. 4.1
(α, )-Shape Based on a Query-Shape
The goal is to retrieval, from a set of points P , the patterns similar to a query shape, independently of translation, rotation and scale changing transformations. The proposed algorithm is based on a generalization of the Euclidean α-shape, we
(a)
(b)
(c)
Fig. 3. Examples of four query shapes
(d)
Pattern Retrieval from a Cloud of Points Using Geometric Concepts
465
Fig. 4. A shape and its associated hull
call the (α,)-shape based on a query shape. Before defining it, we need to introduce the following definitions. Consider a pattern Mrm associated with the graph Srm , we say that Mrm is empty if and only if P ∩Mrm = ∅. We call the hull, σ(Mrm ), of an empty Mrm the open polygon of vertices in P , containing the graph Srm so that σ(Mrm ) ∩ P = ∅, see Fig. 4. Let us denote K the vertices of σ(Mrm ). Definition 3. For two fixed positive real numbers α and , the (α, )-shape of P , based on M1m is the set of the hulls σ(Mαm ) associated with the empty Mαm verifying: i) At least one point of P is on the boundary of Mαm . ii) maxv∈K de (v, ∂Mαm ) ≤ . When the query shape M1m is a disc and is nil then the (α, )-shape, of P , based on M1m becomes the Euclidean α-shape of P . The properties i) and ii) of the definition 3 as the measure of similarity between the query shape M1m and the empty patterns Mαm . The parameter α fix the level of shrinking or dilation and
(a)
(b)
Fig. 5. (a) illustrates four patterns which are input of our algorithm. After the sampling step, we obtain a cloud of points given in (b).
466
L. Idoumghar and M. Melkemi
Algorithm: Retrieval patterns b(ck , ck ), the Input : the set of points P , the query shape M1m = ∪k=m−1 k=0 parameters and α, the parameter t which determines the directions where the empty patterns Mαm will be constructed. Specifically, the interval [0..2π) is divided in t intervals [a0 ..( + 1)a0), with ta0 = 2π. For each = 0, · · · , t − 1, each angle in the range [a0 ..( + 1)a0 ) is represented by the angle a0 . b(T (ck ), αck ) detected in the Output : The empty patterns Mαm = ∪k=m−1 k=0 directions a0 , = 0, ..., t − 1. 1. Compute the Voronoi diagram of P . 2. Compute a first empty disc of Mαm For each Delaunay edge [pq] so that there exists an empty disc, bα of radius α and ∂bα ∩ P = {p, q}. There are at most two balls bα . There centers are on the Voronoi edge dual to [pq]. These two empty discs are the first discs of the two possible empty patterns. 3. For each computed empty disc bα whose center is denoted d0 = T (c0 ), For each number ∈ {0, 1, ..., t − 1}, 3.1. Compute a second empty disc of Mαm Compute the point d1 = T (c1 ) on the straight line of slope tan(a0 ), passing by d0 , and verifying: d0 d1 = αc0 = α if c0 > c1 , else d0 d1 = αc1 (see section 2.2, item (ii)). 3.2. Verify that the second disc centered at d1 is empty Compute the point s ∈ P so that d1 is into the Voronoi region R(s). If s is not into the disc b(d1 , c1 α), then the second disc b(d1 , c1 α) is empty (see Theorem 1), and continue the construction of Mαm (see next step number 4). Else, give up the construction of Mαm in the direction a0 . 4. Compute the remaining empty discs of the pattern Mαm For k ∈ {2, ..., m − 1}, 4.1. Compute the center, T (ck ), of the kth disc of Mαm Compute the point T (ck ) verifying: the angle between the vectors [ck−1 ck ] and [T (ck−1 ) T (ck )] is equal to the angle between the vectors [c0 c1 ] and [d0 d1 ]. 4.2. Let s be the point of P such that the computed T (ck ) is into the Voronoi region R(s). If s is not into the disc b(T (ck ), ck α), then the disc b(T (ck ), ck α) is empty, and continue the construction of Mαm . Else give up the construction of Mαm in the direction a0 . the measures the similarity between the shape of an empty Mαm and the queryshape M1m . Specifically, it is decided that an empty pattern Mαm is similar to the query shape when the largest distance from the vertices of the hull σ(Mαm ) to the boundary of Mαm is less than a tolerance .
Pattern Retrieval from a Cloud of Points Using Geometric Concepts
(a)
(b)
(c)
(d)
467
Fig. 6. This figure shows the results obtained by using sequentially the four query shapes given in Fig. 3. The used parameters are αw = 8.2, αtwo = 34, αdog = 10, αf ish = 31.5 and = 0.6.
(a)
(b)
Fig. 7. A set of points and the detected patterns. (a) A set of points, (b) retrieval the patterns, similar to the query shape given in Fig. 3(a), 3(b) , 3(c) and 3(d), existing in the set illustrated by (a), the used parameters are αtwo = 34, αdog = 9.8, αf ish = 31.5 and = 0.6.
4.2
An Algorithm to Retrieval Patterns from a Cloud of Points
Knowing a query shape M1m , the presented algorithm extracts from a set of points P the empty patterns Mαm that can be embedded in gaps and verifying the similarity measure defined in i) and ii) of the definition 2. This algorithm computes the (α, )-shape of P using the Voronoi diagram concept and the property of empty discs. The following observation shows how we verify that a disc is empty of the points of P .
468
L. Idoumghar and M. Melkemi
Theorem 1. Let b(c, r) be a disc centered at c and of radius r, let us p be the point so that c ∈ R(p). b(c, r) ∩ P = ∅ if and only if p ∈ / b(c, r). Proof. p ∈ / b(c, r) if and only if pc ≥ r. Thus, for all q ∈ P we have r ≤ pc ≤ qc. Consequently, q ∈ / b(c, r), this means b(c, r) ∩ P = ∅. To simplify the presentation of the algorithm, without loss of generality, we suppose that the graph S1m of the query shape M1m is a polygonal line. The vertices of S1m are denoted c0 , c1 , ..., cm−1 and M1m is equal to ∪k=m−1 b(ck , ck ), where k=0 m ck are the radii of the discs into M1 . The main steps of the algorithm are briefly explained hereafter and some examples are shown in Figs. 5, 6 and 7:
5
Conclusion
In this paper, we have proposed a generalization of the α-shape concept. Through this new concept we have presented a geometric algorithm extracting from a cloud of points shapes that are similar to a query shape. The algorithm is accurate and it detects the overall shapes similar to the query shape. As a continuation of this work, we focus our interest on the extension of the presented algorithm to the three dimensional sets of points.
References 1. Ballard, D.H.: Generalized Hough transform to detect arbitrary patterns. IEEE Trans. Pattern Recognition and Machine Intelligence 13(2), 111–122 (1981) 2. Edelsbrunner, H., Muke, E.P.: Three-dimensional alpha shapes. ACM trans. on Computer Graphics 13, 43–70 (1994) 3. Khotanzad, A., Hong, Y.H.: Image invariant recognition by Zernike moments. IEEE Trans. Pattern Recognition and Machine Intelligence 12(5), 489–497 (1990) 4. Loncaric, S.: A survey of shape analysis techniques. Pattern Recognition 31(8), 983– 1001 (1998) 5. Okabe, A., Boots, B., Sugihara, K.: Spatial tessellations: Concepts and Applications of Voronoi Diagrams. John Wiley and Sons, England (1992)
Real-Time Vehicle Ego-Motion Using Stereo Pairs and Particle Filters Fadi Dornaika1 and Angel D. Sappa2 1
Institut G´eographique National 94165 Saint Mand´e, France
[email protected] 2 Computer Vision Center 08193 Bellaterra, Barcelona, Spain
[email protected] Abstract. This paper presents a direct and stochastic technique for real time estimation of on board camera position and orientation—the ego-motion problem. An on board stereo vision system is used. Unlike existing works, which rely on feature extraction either in the image domain or in 3D space, our proposed approach directly estimates the unknown parameters from the brightness of a stream of stereo pairs. The pose parameters are tracked using the particle filtering framework which implicitly enforces the smoothness constraints on the dynamics. The proposed technique can be used in driving assistance applications as well as in augmented reality applications. Experimental results and comparisons on urban environments with different road geometries are presented.
1
Introduction
In recent years, several vision based techniques were proposed for advanced driver assistance systems [1,2,3,4]. They can be broadly classified into two different categories: highways and urban. Most of the techniques proposed for highways environments are focused on lane and car detection, looking for an efficient driving assistance system. On the other hand, in general, techniques for urban environments are focused on collision avoidance or pedestrian detection. Of particular interest is the estimation of on board camera position and orientation related to the current 3D road plane parameters—the ego-motion problem. Note that since the 3D plane parameters are expressed in the camera coordinate system, the camera position and orientation are equivalent to the plane parameters. Algorithms for fast road plane estimation are very useful for driver assistance applications as well as for augmented reality applications.
This work was supported in part by the MEC project TRA2004-06702/AUT and The Ram´ on y Cajal Program.
M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 469–480, 2007. c Springer-Verlag Berlin Heidelberg 2007
470
F. Dornaika and A.D. Sappa
The prior knowledge of the environment is a source of information generally involved in the proposed solutions. For instance, highway driver assistance systems are based on assumptions such as: i) the vehicle is driven along two parallel lane markings [5], ii) lane markings, or the road itself, have a constant width [6], and iii) the camera position as well as its pitch angle (camera orientation with respect to the road plane) are constant values [7]. Similarly, vision-based urban driver assistance systems, also propose to use the prior knowledge of the environment to simplify the problem. Some of the aforementioned assumptions are also used on urban environment, together with additional assumptions related to urban scenes. In summary, scene prior knowledge has been extensively used. However, making assumptions cannot always solve problems. It can sometimes provide erroneous results. For instance, constant camera position and orientation, which is a generally used assumption on highways, is not so valid in an urban scenario. In the latter, camera position and orientation are continuously modified by factors such as: road imperfections or artifacts (e.g., rough road, speed bumpers), car accelerations, uphill/downhill driving, among others. [6] introduces a technique for estimating vehicle yaw, pitch and roll. It is based on the assumption that some parts of the road have a constant width (e.g., lane markings). A different approach was presented in [8]. The authors propose an efficient technique able to cope with uphill/downhill driving, as well as dynamic pitching of the vehicle. It is based on a v -disparity representation and Hough transform. The authors propose to model not only a single plane road—a straight line—but also a non-flat road geometry—a piecewise linear curve. This method is also limited since a longitudinal profile of the road should be extracted for computing the v -disparity representation. In this paper, a new approach based on raw stereo images provided by a stereo vision system is presented. It aims to compute camera position and orientation, avoiding most of the assumptions mentioned above. Since the aim is to estimate the pose of an on board stereo camera from stereo pairs arriving in a sequential fashion, the particle filtering framework seems very useful. In other words, we track the pose of the vehicle (stereo camera) given the sequence of stereo pairs. The proposed technique could be indistinctly used for urban or highway environments, since it is not based on a specific visual traffic feature extraction neither in 2D nor in 3D. Our proposed method has a significant advantage over existing methods since it does not require road segmentation nor dense matching—two difficult tasks. Moreover, to the best of our knowledge, the work presented in this paper is the first work estimating road parameters directly from the rawbrightness images using a particle filter. The rest of the paper is organized as follows. Section 2 describes the problem we are focusing on. Section 3 briefly describes a 3D data based method. Section 4 presents the proposed stochastic technique. Section 5 gives some experimental results and method comparisons. In the sequel, the “road plane parameters” and the“camera pose parameters” will refer to the same entity.
Real-Time Vehicle Ego-Motion Using Stereo Pairs and Particle Filters
(a)
471
(b)
Fig. 1. (a) On board stereo vision sensor with its corresponding coordinate system. (b) The time-varying road plane parameters d and u.
2
Problem Formulation
Experimental setup. A commercial stereo vision system (Bumblebee from Point Grey1 ) was used. It consists of two Sony ICX084 color CCDs with 6mm focal length lenses. Bumblebee is a pre-calibrated system that does not require in-field calibration. The baseline of the stereo head is 12cm and it is connected to the computer by a IEEE-1394 connector. Right and left color images can be captured at a resolution of 640×480 pixels and a frame rate near to 30 fps. This vision system includes a software able to provide the 3D data. Figure 1(a) shows an illustration of the on board stereo vision system as well as its mounting device. The problem we are focusing on can be stated as follows. Given a stream of stereo pairs provided by the on board stereo head we like to recover the parameters of the road plane for every captured stereo pair. Since we do not use any feature that is associated with road structure, the computed plane parameters will completely define the pose of the on board vision sensor. This pose is represented by the height d and the plane normal u = (ux , uy , uz )T (See Figure 1(b)). Due to the reasons mentioned above, these parameters are not constant and should be estimated online for every time instant. Image transfer function. It is well-known [9] that the projections of 3D points belonging to the same plane onto two different images are related by a 2D projective transform having 8 independent parameters—homography. In our setup, the right and left images are horizontally rectified2 . Let pr (xr , yr ) and pl (xl , yl ) be the right and left projection of an arbitrary 3D point P belonging to the road plane (d, ux , uy , uz ) (see Figure 2). In the case of a rectified stereo pair where the left and right images have the same intrinsic parameters, the right and left coordinates of corresponding pixels belonging to the road plane 1 2
[www.ptgrey.com] The use of non-rectified images will not have any theoretical impact on our developed method. However, the image transfer function will be given by a general homography.
472
F. Dornaika and A.D. Sappa
Fig. 2. The mapping between corresponding left and right road pixels is given by a linear transform
are related by the following linear transform, i.e., the homography reduces to a linear mapping: xl = h1 xr + h2 yr + h3 (1) yl = yr
(2)
where h1 , h2 , and h3 are function of the intrinsic and extrinsic parameters of the stereo head and of the plane parameters. For our setup (rectified images with the same intrinsic parameters), those parameters are given by: h1 = 1 + b h2 = b
uy d
ux d
h3 = −b u0
(3) (4)
uz ux uy − b v0 + αb d d d
(5)
where b is the baseline of the stereo head, α is the focal length in pixels, and (u0 , v0 ) is the image center—the principal point.
3
3D Data Based Approach
In [10], we proposed an approach for on-line vehicle pose estimation using the above commercial stereo head. The camera position and orientation—the road plane parameters—are estimated from raw 3D data. The proposed technique consists of two stages. First, a dense depth map of the scene is computed by the provided dense matching technique. Second, the parameters of a plane fitting to the road are estimated using a RANSAC based least squares fitting. Moreover, the second stage includes a filtering step that aims at reducing the number of 3D points that are processed by the RANSAC technique. The proposed technique could be indistinctly used for urban or highway environments, since it is not based on a specific visual traffic feature extraction but on raw 3D data points. This technique has been tested on different urban environments. The proposed algorithm took, on average, 350 ms per frame. The main drawback of the proposed 3D data technique is its high CPU time. Moreover, it requires a dense 3D reconstruction of
Real-Time Vehicle Ego-Motion Using Stereo Pairs and Particle Filters
473
the captured images. In the current study, this method is used for comparisons with the proposed stochastic technique. It can be used to initialize the proposed approach—providing the solution for the first video frame.
4
A Direct and Stochastic Approach
Our aim is to estimate the pose parameters from the stream of stereo pairs. In other words, we track the pose over time. In this section, we propose a novel approach that directly infers the plane parameters from the stereo pair using the particle filtering framework. The idea of a particle filter (also known as Sequential Monte Carlo (SMC) algorithm) was independently proposed and used by several research groups. These algorithms provide flexible tracking frameworks as they are neither limited to linear systems nor require the noise to be Gaussian and proved to be more robust to distracting clutter as the randomly sampled particles allow to maintain several competing hypotheses of the hidden state. Therefore, the main advantage of particle filtering methods is the fact that any loose of track will not lead to permanent loss of the object. Note that when the noise can be modelled as Gaussian and the observation model is linear then the solution will be given by the Kalman filter. Particle filtering is an inference process which aims at estimating the unknown time-t state bt from a set of noisy observations (images), z1:t = {z1 , · · · , zt } arriving in a sequential fashion [11,12]. Two important components of this approach are the state transition and observation models. The particle filter approximates the posterior distribution p(bt |z1:t ) by a set of weighted particles or samples (j) (j) (j) {bt , πt }N j=1 . Each element bt represents the hypothetical state of the object (j)
and πt is the corresponding discrete probability. Then, the state estimate can be set for example to the minimum mean square error or to the maximum a posteriori (MAP): arg maxbt p(bt |z1:t ). Based on such generative models, the particle filtering method is a Bayesian filtering method that recursively evaluates the posterior density of the target state at each time step conditionally to the history of observations until the current time. 4.1
Algorithm
Dynamics model. At any given time, the road plane parameters are given by the plane normal ut , a unit vector, and the distance dt between the camera t center and the plane. These parameters can be encapsulated into a 3-vector u dt . Therefore, the state vector bt representing the plane parameters will be given by ux(t) uy(t) uz(t) T bt = (bx(t) , by(t) , bz(t) )T = ( , , ) (6) dt dt dt Note that the vector bt fully describes the current road plane parameters since the normal vector is a unit vector. Since the camera height and orientation, the
474
F. Dornaika and A.D. Sappa
Fig. 3. The region of interest associated with the right image. In this example, its height is set to one third of the image height.
plane parameters, are ideally constant, the dynamics of the state vector bt can be well modelled by a Gaussian noise: bx(t) = bx(t−1) + t
(7)
by(t) = by(t−1) + t bz(t) = bz(t−1) + t
(8) (9)
where is a noise (scalar) drawn from a centered Gaussian distribution N (0, σ). The standard deviation of the noise can be computed from previously recorded camera pose variations. However, we believe that fixed standard deviations or context-based standard deviations are more appropriate since they are directly related to the kind of perturbations and to the video rate. Observation model. The observation model should relate the state bt (plane parameters) to the measurement zt (stereo pair). We use the following fact: if the state vector encodes the actual values of the plane distance and of its normal, then the registration error between corresponding road pixels in the right and left images should correspond to a minimum. In our case, the measurement zt is given by the current stereo pair. The registration error is simply the Sum of Squared Differences between the right image and the corresponding left image computed over a given region of interest. The registration error is given by: e(b) =
1 Np
(Ir(xr ,yr ) − Il(h1 xr +h2 yr +h3 ,yr ) )2
(10)
(xr ,yr )∈ROI
where Np is the number of pixels contained in the region of interest. The above summation is carried out over the right region of interest. The corresponding left pixels are computed according to the affine transform (1). The computed xl = h1 xr + h2 yr + h3 is a non-integer value. Therefore, Il (xl ) is set to a linear interpolation between two neighboring pixels. Note that the region of interest is a user-defined region. Ideally, this region should not include non-road objects but as will be seen in the experiments this is not a hard constraint because we use a stochastic tracking technique.
Real-Time Vehicle Ego-Motion Using Stereo Pairs and Particle Filters
475
In our study, the ROI is set to a rectangular window that roughly covers the lower part of the original image (one third). Figure 3 illustrates a typical region of interest. The observation likelihood is given by p (zt |bt ) = √
1 e(bt ) exp − 2σe2 2π σe
(11)
where σe is a parameter controlling the aperture of the Gaussian distribution. Computing the state bt from the previous posterior distribution p(bt−1 |z1:t−1 ) is carried out using the particle filtering framework described in Figure 4.
(1)
(N )
1. Initialization t = 0: Generate N state samples a0 , . . . , a0 according to some (1) (N ) prior density p(b0 ) and assign them identical weights, w0 = . . . = w0 = 1/N (N ) (N ) 2. At time step t, we have N weighted particles (at−1 , wt−1 ) that approximate the posterior distribution of the state p(bt−1 |z1:(t−1) ) at previous time step (a) Resample the particles proportionally to their weights, i.e. keep only particles with high weights and remove particles with small ones. Resampled particles have the same weights (j) (b) Draw N particles according to the dynamic model p(bt |bt−1 = at−1 ) (7), (8), and (9). These particles approximate the predicted distribution p(bt |z1:(t−1) ) (c) Weight each new particle proportionally to its likelihood: (j)
(j)
wt
p(zt |bt = at ) = N (m) p(zt |bt = at ) m=1
where p(zt |bt ) is given by (11). The set of weighted particles approximates the posterior p(bt |z1:t ) ˆ t as the MAP: (d) Give an estimate of the state b ˆ t = arg max p(bt |z1:t ) ≈ arg max w(j) b t bt a(j) t
Fig. 4. Particle filter algorithm
Initialization. Note that the initial distribution p(b0 ) can be either a Dirac or Gaussian distribution centered on a provided solution. We have used two methods for estimating this solution: i) the 3D data-based algorithm, and ii) the differential evolution algorithm [13] which aims at minimizing the registration error (10).
476
5
F. Dornaika and A.D. Sappa
Experiments
The proposed technique has been tested on different urban environments. First experiment. The first experiment has been conducted on a short sequence of stereo pairs corresponding to a typical urban environment (see Figure 3). The stereo pairs are of resolution 320 × 240. Here the road is almost flat and the perturbations are due to accelerations and decelerations of the car. Figures 5(a) and 5(b) depict the estimated camera height and orientation as a function of the sequence frames, respectively. The plotted solutions correspond to the Maximum a Posteriori solution. The solid curves corresponds to an arbitrary ROI of size 270×80 pixels centered at the bottom of the image. The dotted curves correspond to a ROI covering the road region only (here the ROI is manually set to 200 × 80 pixels centered at the bottom of the image). The arbitrary ROI includes some objects that do not belong to the road plane—the vehicles on the right bound. As can be seen, the estimated solutions associated with the two cases are almost the same, which suggests that the obstacles will not have a big impact on the solution. In this experiment the number of particles N was set to 200 and the parameters were as follows σ = 0.002 and σe = 1. In the literature, the pose parameters (road plane parameters) can be represented by the horizon line. Figure 5(c) depict the vertical position of the horizon line as a function of the sequence frames. Figure 5(d) illustrates the computed horizon line for frames 55 and 182. In order to study the algorithm behavior in the presence of significant occlusions or camera obstructions, we conducted the following experiment. We used the same sequence of Figure 3. We run the proposed technique described in Section 4 twice. In the first run the stereo images were used as they are. In the second run, the right images were modified to simulate a significant occlusion. To this end, we set the vertical half of a set of 20 right frames to a constant color. The left frames were not modified3 . Figure 6 compares the pose parameters obtained in the two runs. The solid curves were obtained with the non-occluded images (first run). The dotted curves were obtained in the second run. The occlusion starts at frame 40 and ends at frame 60. The upper part of the figure illustrates the stereo pair 40. As can be seen, the discrepancy that occurs at the occlusion is considerably reduced when the occlusion disappears. Second experiment and method comparison. The second experiment has been conducted on another short sequence corresponding to an uphill driving. The stereo pairs are of resolution 160 × 120. Figures 7(a) and 7(b) depict the estimated camera height and orientation as a function of the sequence frames, respectively. The solid curves correspond to the developed stochastic approach.The dashed curves correspond to the 3D data based approach 3
Therefore, there is a sudden increase in the registration error.
Real-Time Vehicle Ego-Motion Using Stereo Pairs and Particle Filters 5
1.3
Camera ’s Height (meters)
Camera ’s pitch angle (degrees)
Arbitrary ROI Road region
1.25
1.2
1.15
1.1
1.05
1
477
0
20
40
60
80
100
120
140
160
180
3
2
1
0 0
200
Arbitrary ROI Road region
4
20
40
60
80
100
120
140
160
180
200
Frames
Frames
(a)
(b)
Horizon line (pixels)
160
Arbitrary ROI Road region
155
150
145
140 0
20
40
60
80
100
120
140
160
180
200
Frames
(c)
Fig. 5. Camera position and orientation, estimated by the particle filter. The plotted solutions correspond to the Maximum a Posteriori (MAP) solution. The solid curves corresponds to a ROI having an arbitrary width. The dotted curves correspond to a ROI containing the road image only.
obtained with full resolution images, i.e., 640 × 480 [10]. Figure 7(c) depicts the estimated position of the horizon line obtained by three methods: the above two methods (solid and dashed curves) and a manual method (dotted curve) based on the intersection of two parallel lines. As can be seen, the horizon
478
F. Dornaika and A.D. Sappa
Stereo pair 40 5 1.3
Camera ’s pitch angle (degrees)
No occlusion Temporary occlusion
Camera ’s Height (meters)
1.25
1.2
1.15
1.1
1.05
1
0
20
40
60
80
100
Frames
120
140
160
180
200
No occlusion Temporary occlusion 4
3
2
1
0
20
40
60
80
100
120
140
160
180
200
Frames
Fig. 6. Comparing the pose parameters when a significant occlusion or camera obstruction occurs. This occlusion starts at frame 40 and ends at frame 60.
line estimated by the proposed featureless approach is closer to the manually estimated horizon line—assumed to be very close to the ground-truth data. Figure 7(d) displays the horizon line associated with frame 160. The white line corresponds to the proposed technique and the black one to the 3D data based technique. Note that even the first stereo frame was initialized with the 3D data-based approach, the solid and dashed curves (the two methods) do not coincide at the first frame (see Figure 7(a)). This is because only the MAP solution is plotted. According to the obtained results, the average discrepancy in the height was about 10 cm4 and in the pitch angle was smaller than one degree. Figure 8 displays the estimated camera height associated with the sequence of Figure 7 when the number of particles was set to 300, 200, 100, and 50. As can be seen, the estimated parameters were very consistent. A similar behavior was obtained with the pitch angle. A non-optimized C code processes one stereo pair in 30 ms assuming the size of the ROI is 6000 pixels and the number of particles is 100. The proposed approach runs almost twelve times faster than the 3D data-based approach (Section 3). Moreover, our stochastic approach is faster than many approaches based on elaborated road segmentation and detection. 4
By assuming that this discrepancy is an upper bound of the camera height error, the latter can be considered small given the fact that the camera height was estimated with a small focal length (200 pixels) and with a small baseline.
Real-Time Vehicle Ego-Motion Using Stereo Pairs and Particle Filters 5
Camera ’s pitch angle (degrees)
1.3
Camera ’s Height (meters)
1.25
1.2
1.15
1.1
1.05
1
479
Stochastic method 3D data based method
0
20
40
60
80
100
120
140
160
180
4
3
2
1
Stochastic method 3D data based method 0
200
0
20
40
60
80
100
120
140
160
180
200
Frames
Frames
(a)
(b)
80
Horizon line (pixels)
78
76
74
72
70
Stochastic method 3D data based approach Manual method 68
0
20
40
60
80
100
120
140
160
180
200
Frames
(c)
(d)
Fig. 7. Method comparison for on board camera pose estimation. The solid curves correspond to the developed stochastic approach and the dashed curves to the 3D data based approach obtained with full resolution images, i.e., 640 × 480. (d) displays the horizon line associated with frame 160 obtained with two automatic methods: the proposed technique (white) and the 3D data based technique (black). N = 200 1.3
1.25
1.25
Camera ’s Height (meters)
Camera ’s Height (meters)
N = 300 1.3
1.2
1.15
1.1
1.05
1.1
1.05
0
20
40
60
80
100
120
140
160
180
1
200
20
40
60
80
100
Frames
N = 100
N = 50
1.3
1.3
1.25
1.25
1.2
1.15
1.1
1.05
1
0
Frames
Camera ’s Height (meters)
Camera ’s Height (meters)
1
1.2
1.15
120
140
160
180
200
120
140
160
180
200
1.2
1.15
1.1
1.05
0
20
40
60
80
100
Frames
120
140
160
180
200
1
0
20
40
60
80
100
Frames
Fig. 8. The estimated camera height obtained by the proposed stochastic approach for different numbers of particles. From up to bottom N = 300, 200, 100, and 50.
480
6
F. Dornaika and A.D. Sappa
Conclusion
A featureless and stochastic technique for real time ego-motion estimation of on board vision system has been presented. The method adopts a particle filtering scheme that uses images’ brightness in its observation likelihood. The advantages of the proposed technique are as follows. First, the technique does need any feature extraction neither in the image domain nor in 3D space. Second, the technique inherits the strengths of stochastic tracking approaches. A good performance has been shown in several scenarios—uphill, downhill and a flat road. Furthermore, the technique can handle significant occlusions. Although it has been tested on urban environments, it could be also useful on highways scenarios.
References 1. Broggi, A., Bertozzi, M., Fascioli, A., Sechi, M.: Shape-based pedestrian detection. In: Procs. IEEE Intelligent Vehicles Symposium, Dearborn, pp. 215–220. IEEE Computer Society Press, Los Alamitos (2000) 2. Labayarde, R., Aubert, D.: A single framework for vehicle roll, pitch, yaw estimation and obstacles detection by stereovision. In: IEEE Intelligent Vehicles Symposium, IEEE Computer Society Press, Los Alamitos (2003) 3. Lef´ee, D., Mousset, S., Bensrhair, A., Bertozzi, M.: Cooperation of passive vision systems in detection and tracking of pedestrians. In: Proc. IEEE Intelligent Vehicles Symposium, Parma, Italy, pp. 768–773. IEEE Computer Society Press, Los Alamitos (2004) 4. Liu, X., Fujimura, K.: Pedestrian detection using stereo night vision. IEEE Trans. on Vehicular Technology 53(6), 1657–1665 (2004) 5. Liang, Y., Tyan, H., Liao, H., Chen, S.: Stabilizing image sequences taken by the camcorder mounted on a moving vehicle. In: Procs. IEEE Intl. Conf. on Intelligent Transportation Systems, Shangai, China, pp. 90–95. IEEE Computer Society Press, Los Alamitos (2003) 6. Coulombeau, P., Laurgeau, C.: Vehicle yaw, pitch, roll and 3D lane shape recovery by vision. In: Proc. IEEE Intelligent Vehicles Symposium, Versailles, France, pp. 619–625. IEEE Computer Society Press, Los Alamitos (2002) 7. Bertozzi, M., Broggi, A.: GOLD: A parallel real-time stereo vision system for generic obstacle and lane detection. IEEE Trans. on Image Processing 7(1), 62–81 (1998) 8. Labayrade, R., Aubert, D., Tarel, J.: Real time obstacle detection in stereovision on non flat road geometry through ”V-disparity” representation. In: Proc. IEEE Intelligent Vehicles Symposium, Versailles, France, pp. 646–651. IEEE Computer Society Press, Los Alamitos (2002) 9. Faugeras, O.: Three-Dimensional Computer Vision: a Geometric Viewpoint. MIT Press, Cambridge (1993) 10. Sappa, A., Ger´ onimo, D., Dornaika, F., L´ opez, A.: On-board camera extrinsic parameter estimation. Electonics Letters 42(13) (2006) 11. Blake, A., Isard, M.: Active Contours. Springer, Heidelberg (2000) 12. Doucet, A., Freitas, N., Gordon, N.: Sequential Monte Carlo Methods in Practice. Springer-Verlag, New York (2001) 13. Storn, R., Price, K.: Differential evolution – A simple and efficient heuristic for global optimization over continuous spaces. Journal od Global Optimization 11, 341–359 (1997)
Enhanced Cross-Diamond Search Algorithm for Fast Block Motion Estimation Gwanggil Jeon, Jungjun Kim, and Jechang Jeong Department of Electronics and Computer Engineering, Hanyang University, 17 Haengdang-dong, Seongdong-gu, Seoul, Korea {windcap315, kimjj79, jjeong}@ece.hanyang.ac.kr
Abstract. A new fast motion estimation algorithm is presented. Our goal was to find the best block-matching results in a computation-limited and varying environment, and the conventional Diamond Search (DS) algorithm, though faster than most known algorithms, was found to be not very robust in terms of objective and subjective qualities for several sequences. The proposed algorithm, Enhanced Cross-Diamond Search (ECDS), employs a small cross search in the initial step. The large/small DS patterns are utilized as subsequent steps for fast block motion estimation. Experiment results show that the proposed algorithm outperforms other popular fast motion estimation algorithms in terms of PSNR and search speed, especially for sequences having random or fast motions. Keywords: Block motion estimation, fast-search algorithm, search pattern, enhanced cross-diamond search.
1 Introduction In video coding standards such as MEPG-1/2/4 and H.264 the block matching algorithm (BMA) is usually employed to reduce temporal redundancy. In the BMA, the current frame to be encoded is divided into non-overlapping blocks and the best matching block. Motion estimation (ME) is a multi-step process that involves a combination of techniques, including the motion starting point, motion search patterns, and the avoidance of searching stationary regions. The Full Search (FS) algorithm can find the optimal solution for the criterion used by exhaustively searching all of the possible blocks. The FS algorithm searches each block candidate for the closest match within the entire search region, in order to minimize the block distortion measure (BDM). The BDM of the image blocks may be measured using various criteria such as the mean absolute error (MAE), the mean square error (MSE), and the matching pixel count (MPC) [1]. It can be seen that this algorithm is rather computationally intensive, thus making it difficult to apply in realtime video compression, particularly for software-based implementation. For this reason during the past two decades many fast search algorithms, which reduce the computation time by searching only a subset of the eligible candidate blocks, have been developed. M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 481–490, 2007. © Springer-Verlag Berlin Heidelberg 2007
482
G. Jeon, J. Kim, and J. Jeong
Many efficient fast block motion estimation algorithms have been proposed. In particular, three categories of algorithms have been identified, which are characterized by the particular strategy adopted to speed up the search process. First, there are algorithms that limit the number of search candidates, such as the 2-D Logarithmic Search (2DLOG) [2], Three-Step Search (TSS) [3], Four-Step Search (FSS) [4], New Three-Step Search (NTSS) [5], Cross-Diamond Search (CDS) [6], prediction search algorithm [7], and New Cross-Diamond Search (NCDS) [8]. Second, algorithms can sub-sample candidates in the motion vector field. These kinds of search algorithms are based on the assumption that the BDM increases as the checking points move away from the global minima. However, these assumptions are reasonable for certain applications; for example, in video-conferencing, where the motion is neither very fast nor complicated. The third class refers to subsampling in the spatial domain [9]. A drawback of this approach however, is that the reduction in search complexity is often inadequate for real-time applications and is therefore difficult to embed within algorithms such as the TSS and NTSS. These algorithms reduce the number of computations required by calculating the sum of absolute difference (SAD) matching criterion at locations coarsely spread over the search window according to a pattern. The procedure is then repeated with finer resolution around the location with the minimum SAD that was found in the preceding step. The DS algorithm can in most cases significantly reduce the complexity without much reduction in quality compared with FS. Unfortunately it was proven that it does not perform as well for some cases, such as for sequences with either significant global motion or scene variations, and it can be less robust. The proposed ECDS is based on DS and NCDS, which employs four points in Step 2 instead of three points, and two points in Step 3 instead of three points. This modification provides better performance than that of NCDS. In Section 2, we briefly review of the DS algorithm is provided. In Section 3, an ECDS-based fast block matching motion estimation algorithm is shown. In Section 4, experimental results and performance analysis are provided to show the feasibility of the proposed approach. These results are compared to other well-known motion search algorithms. Finally, the conclusions are presented in Section 5.
2 Remarks on the DS Pattern It is known that the search pattern has an important influence on the speed and amount of distortion performance in block motion estimation. The DS algorithm is mainly based on the assumption that motion vectors are, in general, center biased [8]. The algorithm always starts searching from the center of the search area (that is, from (0,0)), by examining nine check points as shown in Fig. 1(a). If the minimum is found at the center, then four additional check points are examined and the search stops as shown in Fig. 1(b). Otherwise, depending on the position of the current minimum, additional points will be examined, as shown in Fig. 1(c) and Fig. 1(d). By taking the current minimum as the new center of this new large diamond that is created, the process iterates until the minimum is found to be in the center, where again, the smaller diamond with four additional check points is examined.
Enhanced Cross-Diamond Search Algorithm for Fast Block Motion Estimation -3
-2
-1
0
1
2
3
-3
-3
-2
-1
8
0
-1
6
1
0
1
1
0
1
4
3
2
2
3
3
3
(a) -2
-1
0
(b) 1
2
3
4
-3
-3
-2
-1
0
1
2
3
4
-3
-2 8 1
6 0
2
5 0
3
11 10
9
7
-1
12
4 3
-2
13
7
-1
2
3
4
0
5
2
2
1
2
-2
7
-1
0
1
-3
-2
-3
0
483
0 1 2
8 1
6 0
2
11
4 10
3
3
(c)
5 0
9
(d)
Fig. 1. Definition of the DS algorithm: (a) LDSP; (b) SDSP; (c) LDSP along a corner point; (d) LDSP along an edge point
In this study, besides existing fast algorithms, an Enhanced Cross-Diamond Search Pattern (ECDS) was devised utilizing both small diamond search pattern (SDSP) and large diamond search patterns (LDSP), rather than using only the LDSP, as in the algorithm DS shown in Fig. 1.
3 Enhanced Cross-Diamond Search Pattern In general, the neighboring blocks are highly correlated with each other. The SAD value of a current block is expected to be similar to those of the neighboring blocks. In the proposed method, the existence of a motion vector (MV) using SDSP can be examined by comparing the SAD value of the current block. In reality, the distortion of an object in a video frame is proportional to its velocity. Therefore, as the length of a motion vector increases so does the amount of error in the block difference. In [1], the authors addressed this issue by modifying the FS algorithm to incorporate variable distance-dependent thresholds for fast and robust true motion vector estimation in object-based indexing applications. In this paper, the principles are further extended so they can be applied in real-time video coding. Compared with the TSS and NTSS algorithms, the proposed technique is more robust, since it visits all candidates around the center tracing of a concentric-square arrangement and hence significantly reduces the probability of being trapped in a local minimum. The proposed ECDS can be used in regions with a large motion vector, as well as in fast motion estimation using a small motion vector.
484
G. Jeon, J. Kim, and J. Jeong
The search point configuration used in the ECDS algorithm is shown in Fig. 2. Fig. 2(a) shows the cross-shaped pattern. It was found that most of the blocks can be regarded as stationary. Since most real world sequences have a high cross centerbiased property, the cross-shaped pattern was used for the first two steps of the -3
-2
-1
0
1
2
3
-3
-3
-2
-2
-1
1
0
1
1
1
0
1
1
1
1
2
-3
-2
-1
0
2
3
-3 -3
-2
-2
0
1
1
1
1
1
2
2
2
1
2
3
1
1
2
1
2
3
0
2 3
-3
0
1
2
3
4
1
-3
6
-2
-2
1
1
2
1
1
1
2
5
-1 2
2
3
1
2
3
1
1
2
1
2
2
0
2
1
3
1
2
0
1
2
3
2
2
4
5
6
5
6
3
-2
3
-1
-3
1
1
(d) Step 3 - Case 2 5
3
-1
0
1
3
-1
-1
-1 2
2
-2
-2
3
(c) Step 3 - Case 1
4
5
1
2
1
1
1
2
2
3
3
3
3
(e) Step 3 - Case 3-1 -3
-2
-1
0
1
2
3
4
(f) Step 3 - Case 3-2 5
3
-3
3
1
1
2
1
1
1
2
2
5
4
2
5
5
4
4
2
-3
-2
-1
0
1
2
3
(g) Steps 4 and 5
-1 0 1
4
4
6
5
3
6
4
6
1
2
3
6
4
1
1
2
2
1
2
-2 4
3 5
-3
4
-1
6 -4
-2
3
2
3
(b) Step 2
1
-3
-1
1
1
2
3
(a) Step 1
0
1
2
3
0
0
-2
-1
-3
-1
-3
1
5
2
(h) Steps 4 and 5
Fig. 2. Example of ECDS; (a) Step 1 stop with MV(0,0); (b) Step 2 stop with MV(1,0); (c) Step 3 stop with MV(2,0) ; (d) Step 3 stop with MV(1,-1); (e) Step 3 stop with MV(2,0) ; (f) Step 3 stop with MV(0,-3); (g, h) examples of Step 4 and Step 5
Enhanced Cross-Diamond Search Algorithm for Fast Block Motion Estimation
485
ECDS. The proposed ECDS maintains a small cross-shaped pattern (SCSP), which minimizes the number of search points for stationary blocks. The second step from the NCDS was also modified. This modification was obtained empirically and works very well for actual motion estimation. The details and analysis of the algorithm are given below. Step 1 A minimum SAD is found from the five search points of the SCSP. If the minimum SAD point occurs at the center of the SCSP, the search stops as shown in Fig. 2(a). Otherwise, the algorithm proceeds to Step 2. Step 2 If the winning point is (1,0) (or (-1,0), (0,1), (0,-1)), four additional search points (2,0), (3,0), (1,-1), and (1,1) are checked. If the winning point is still (1,0), the search stops, as shown in Fig. 2(b). Otherwise, the algorithm proceeds to Step 3. Step 3 Case 1: If the winning point is (2,0) (or (-2,0), (0,2), (0,-2)), two additional search points (2,-1), (2,1) are checked. If the winning point is still (2,0), the search stops, as shown in Fig. 2(c). Otherwise, the algorithm proceeds to Step 4. Case 2: If the winning point is (1,-1) (or (-1,-1), (-1,1), (1,1)), two additional search points (1,-2), (2,-1) are checked. If the winning point is still (1,-1), the search stops, as shown in Fig. 2(d). Otherwise, the algorithm proceeds to Step 4. Case 3-1: If the winning point is (3,0) (or (-3,0), (0,3), (0,-3)), three additional search points (0,-3), (-3,0), (0,3) are checked as shown in Fig. 2(e). If the winning point is still (3,0) again, then the algorithm proceeds to Step 4. Case 3-2: If the winning point is (0,-3) (or (-3,0), (0,3)), one additional search point (0,-2) is checked, as shown in Fig. 2(f). If the winning point is still (0,3), the algorithm proceeds to Step 4. If the winning point is (0,-2), then two additional search points (1,-2), (-1,-2) are checked. If the winning point is still (0,-2), the search stops. Otherwise, the algorithm proceeds to Step 4. Step 4 A minimum SAD is found in a previous step as the center of the LDSP. If the minimum SAD point occurs at the center of the LDSP, continue to Step 5. Otherwise Step 4 is repeated again. Step 5 A minimum SAD is found in a previous step as the center of the SDSP. Figures 2(g) and 2(h) show the cases for Step 4 and Step 5. The main improvement of this algorithm over previous versions is in speed, due to a reduction in the number of searching points when there are stationary blocks or quasi-stationary blocks. In
486
G. Jeon, J. Kim, and J. Jeong
order to fit the cross center-biased MV distortion characteristics, the proposed algorithm provides more opportunities to skip searching points for motion vectors.
4 Performance Analysis This section includes the performance of the proposed ECDS algorithm and its comparison with conventional methods. The performance of the ECDS algorithm for Table 1. Performance Comparison of Average PSNR (dB) FL AK WA FO HA MD NE SI TE PA CG BR CH CO ST AV AVECDS
TSS 25.65 42.65 35.63 32.82 34.36 40.08 38.10 35.60 27.27 30.30 29.33 38.41 27.29 38.35 25.27 33.410
FSS 25.93 42.71 35.64 33.01 34.36 40.11 38.09 35.48 27.21 30.32 29.32 38.41 27.08 38.35 24.70 33.380
NTSS 26.19 42.78 35.64 33.32 34.37 40.15 38.14 35.62 27.29 30.53 29.49 38.41 27.36 38.35 25.26 33.530
BBGDS
26.17 42.78 35.64 33.42 34.31 40.23 38.15 35.56 27.32 30.43 29.30 38.41 27.19 38.35 24.23 33.430
HEX 25.82 42.50 35.61 32.50 34.28 40.06 38.01 35.59 27.33 30.23 29.26 38.40 27.15 38.35 24.59 33.310
DS 26.17 42.76 35.64 33.31 34.38 40.23 38.16 35.66 27.36 30.49 29.45 38.41 27.39 38.35 24.72 33.500
CDS 26.18 42.74 35.64 33.17 34.33 40.16 38.15 35.57 27.33 30.42 29.44 38.41 27.33 38.35 24.62 33.460
NCDS 26.17 42.74 35.64 33.18 34.29 40.14 38.14 35.45 27.31 30.37 29.34 38.40 27.24 38.35 24.38 33.410
ECDS 26.18 42.75 35.64 33.13 34.30 40.15 38.14 35.45 27.32 30.38 29.34 38.41 27.28 38.36 24.39 33.415
-0.005
-0.035
0.115
0.015
-0.105
0.085
0.045
-0.005
0.000
Table 2. Performance Comparison of Average MSE FL AK WA FO HA MD NE SI TE PA CG BR CH CO ST AV AVECDS
TSS 182.11 4.18 18.90 35.50 26.24 8.65 13.07 22.05 126.09 62.70 89.29 9.40 197.64 9.54 222.90 68.551
FSS 170.99 4.10 18.88 33.97 26.30 8.65 13.09 22.83 127.84 62.15 91.21 9.40 216.84 9.54 262.65 71.896
NTSS 161.79 4.03 18.88 31.76 26.24 8.56 12.91 21.32 125.52 59.18 84.10 9.40 197.36 9.54 227.41 66.533
BBGDS
162.32 4.03 18.88 30.85 26.72 7.92 12.89 21.87 124.01 60.58 90.49 9.41 205.31 9.54 298.13 72.197
HEX 175.32 4.35 19.00 38.13 26.89 8.23 13.35 21.43 123.61 63.55 91.99 9.41 197.86 9.54 268.76 71.428
DS 162.53 4.05 18.88 31.55 26.15 7.87 12.86 21.22 122.58 59.75 85.90 9.40 191.07 9.54 259.05 68.160
CDS 162.07 4.08 18.89 32.66 26.52 8.08 12.90 21.66 123.42 60.80 86.29 9.40 193.85 9.54 267.22 69.159
NCDS 162.36 4.08 18.89 32.57 26.87 8.14 12.91 22.33 124.07 61.36 88.01 9.42 199.66 9.54 284.43 70.976
ECDS 162.36 4.07 18.88 33.01 26.87 8.17 12.93 22.29 124.26 61.46 88.66 9.41 195.53 9.53 284.66 70.810
-2.259
1.086
-4.277
1.387
0.618
-2.650
-1.651
0.166
0.000
Enhanced Cross-Diamond Search Algorithm for Fast Block Motion Estimation
487
Table 3. Performance Comparison of Average Search Point FL AK WA FO HA MD NE SI TE PA CG BR CH CO ST AV AVECDS
TSS 23.31 23.21 23.21 23.28 23.25 23.29 23.21 23.21 23.23 23.22 23.27 23.25 23.23 23.23 23.30 23.247
FSS 20.33 15.89 15.89 19.49 16.59 17.33 16.14 16.58 16.28 16.69 20.80 16.89 16.98 15.99 19.07 17.396
NTSS 22.55 16.08 16.28 22.59 17.55 18.84 16.50 17.43 17.30 17.53 22.55 17.89 18.09 16.18 23.21 18.705
BBGDS
12.51 8.56 8.67 13.76 9.35 10.14 8.85 10.17 9.58 9.59 14.15 9.50 11.40 8.58 14.53 10.623
HEX 12.58 10.35 10.32 12.50 10.58 11.03 10.51 11.03 10.72 10.79 13.31 10.86 11.54 10.42 13.36 11.327
DS 12.76 12.29 12.28 16.10 12.97 13.53 12.53 13.49 12.96 13.06 17.18 13.15 14.33 12.37 17.12 13.741
CDS 13.99 8.74 8.83 14.42 9.81 10.59 9.11 10.34 9.77 9.93 16.21 10.13 11.55 8.84 16.09 11.223
NCDS 11.21 4.99 5.13 12.13 6.21 7.06 5.45 6.92 6.27 6.50 14.64 6.11 8.44 5.03 14.44 8.035
ECDS 9.63 4.98 5.13 10.66 6.01 6.77 5.33 6.71 6.18 6.19 11.25 5.96 7.93 5.02 12.85 7.373
15.874
10.023
11.332
3.250
3.954
6.368
3.850
0.662
0.000
video coding was evaluated using the luminance (Y-component) signal of a number of standard test video sequences including “FL: flower,” “AK: akiyo,” “WA: waterfall,” “FO: foreman,” “HA: hall monitor,” “MD: mother and daughter,” “NE: news,” “SI: silent,” “TE: tempete,” “PA: Paris,” “CG: coastguard,” “BR: bridge,” “CH: children,” “CO: container,” and “ST: Stefan.” The size of each individual frame is 352×288 pixels. The mean absolute error (MAE) distortion function is used as the BDM. The maximum displacement in the search area is ±15 pixels in both the horizontal and vertical directions for a 16×16 block size. To quantitatively evaluate the video coding performance of the ECDS algorithm, the following two measures were considered. 1) The average peak signal-to-noise ratio (PSNR) and average mean square error (MSE) after picture reconstruction. 2) The average number of search points for computational complexity. The performance comparison of the TSS, FSS, NTSS, BBGDS, HEX, DS, CDS, NCDS, and the proposed ECDS in terms of PSNR and MSE between the estimated 80-frames and the original frames are shown in Tables 1 and 2. The average number of search points used to estimate the motion vectors is presented in Table 3. Table 3 also shows that the ECDS outperformed other algorithms whether the image sequence contained fast or slow motion. It was observed that the PSNR value for the ECDS algorithm is almost the same as that of DS when the number of search points is 53.65% or below. At higher search speeds, where the improvement factor was between 15.874 and 0.662, the average PSNR for the ECDS algorithms was very close to the optimal average PSNR value of the DS algorithm. It is clear that the ECDS algorithm was faster by a factor of at least 0.662, and in the “Akiyo” sequence, more than 4.66 times faster than that of the TSS algorithm, while the PSNR remained comparable with the TSS and NTSS algorithms. The
488
G. Jeon, J. Kim, and J. Jeong
results also proved that by choosing a suitable constant for the selected threshold function, the average number of search points required by the ECDS algorithm was considerably less, while concomitantly having a significantly higher average PSNR. Figure 3 shows the 70th image of the coastguard sequence and its reconstructed images for various ME algorithms. The original image is given in (a), and (b)-(f) show the motion-compensated images using FS, DS, CDS, NCDS, and the proposed ECDS algorithm, respectively. The figures show that the DS, CDS, and NCDS fail to
(a) Original
(b) FS
(c) DS
(d) CDS
(e) NCDS
(f) ECDS
Fig. 3. The 70th image of the coastguard sequence reconstructed by using various motion estimation algorithms
Enhanced Cross-Diamond Search Algorithm for Fast Block Motion Estimation
489
find the ship’s side deck within a given search area, while FS and the proposed algorithm successfully find it. This demonstrates that the supplement scheme adopted in the proposed algorithm is working properly in terms of finding random MVs. However, the ECDS algorithm has a limitation as well. As can be seen in Tables 1 and 2, the ECDS method does not outperform the NTSS, DS, and CDS methods in average PSNR and average MSE. Nevertheless, the concepts studied here may in the future be used to develop a method that can significantly reduce computational requirements.
5 Conclusion ECDS outperforms many other conventional methods, having a faster speed while yielding a picture quality similar to that of a full search technique. The small cross search is employed in the initial step, and large/small DS patterns are utilized as subsequent steps for estimating fast block motion. Theoretical analysis shows that a speed improvement of up to 46.35% over the DS algorithm can be obtained by locating some motion vectors in certain scenarios. The experiments show that ECDS can cope well with both large dynamic motion variation sequences and simple uniform motion videos. ECDS is very suitable for real-time high quality MPEG-4 video encoding.
Acknowledgment This research was supported by Seoul Future Contents Convergence (SFCC) Cluster established by Seoul R&BD Program.
References 1. Sorwar, G., Mushed, M., Dooley, L.: Block-based true motion estimation using distance dependent thresholded search. In: Proc. ISCA Comp. Appl. In Indus. And Eng., pp. 45–48 (2001) 2. Jain, J.R., Jain, A.K.: Displacement measurement and its application in inter frame image coding. IEEE Trans. Commun. COM-29, 1799–1808 (1984) 3. Koga, T., Iinuma, K., Hirano, A., Iijima, Y., Ishiguro, T.: Motion compensated interframe coding for video conferencing. In: Proc. Nat. Telecommon. Conf., New Orleans, LA, November-December 1981, pp. G5.3.1-G5.3.5 (1981) 4. Po, L.M., Ma, W.C.: A novel four-step search algorithm for fast block motion estimation. IEEE Trans. Circuit Syst. Video Technology 6, 313–317 (1996) 5. Li, R., Zeng, B., Liou, M.L.: A new three-step search algorithm for block motion estimation. IEEE Trans. Circuits Syst. Video Technology 4, 438–443 (1994) 6. Cheung, C.H., Po, L.M.: A novel cross-diamond search algorithm for fast block motion estimation. IEEE Trans. Circuits Syst. Video Technology 12(12) (2002)
490
G. Jeon, J. Kim, and J. Jeong
7. Xu, J.B., Po, L.M., Cheung, C.K.: A new prediction model search algorithm for fast block motion estimation. In: Proc. ICIP 1997, vol. 3, pp. 610–613 (1997) 8. Tham, J.Y., Ranganath, S., Ranganath, M., Kassim, A.A.: A novel unrestricted centerbiased diamond search algorithm for block motion estimation. IEEE Trans. Circuits Syst. Video Technology 8(4) (1998) 9. Liu, L., Feig, E.: A block-based gradient descent search algorithm for block motion estimation in video coding. IEEE Trans. Circuits Syst. Video Technology 6, 313–317 (1996)
Block-Based Motion Vector Smoothing for Periodic Pattern Region Young Wook Sohn and Moon Gi Kang Yonsei University Department of Electrical and Electronic Engineering 134 Shinchon-Dong, Seodaemun-Ku, Seoul, 120-749, South Korea
[email protected],
[email protected] Abstract. When finding true motion vectors in video sequences, multiple local minima areas such as periodic pattern cause severe motion errors. There have been efforts to reduce motion errors in the region, where they use exhaustive full-search motion estimation scheme to analyze or find a solution for the region. To find robust motion vectors in the periodic pattern region using non-exhaustive motion estimator, we propose a recursive motion vector smoothing method. Recursively averaged vectors are used for periodic pattern region and the input vectors from a conventional search method are used for other regions, controlled by a weighting parameter. Properties of periodic pattern is considered in calculating the parameter to adaptively weight on the input or the mean vectors. Experimental results show motion vector improvements in the periodic pattern region with the input vectors from non-exhaustive search method.
1
Introduction
True motion vector is used in frame rate up-conversion, film judder compensation, and de-interlacing algorithms, representing the movement of human eyes. As conventional MPEG-based applications use motion vectors that minimize residual data, the vectors show different directions or magnitudes from human eye movements in real sequence. There are problems in tracking true motion with the conventional estimation algorithms in periodic pattern. The region has multiple local minima points in search area, whose sum-of-absolute-difference (SAD) values are almost identical. Motion vectors show randomness due to the similar local minima points. Recent motion estimation methods utilize spatialtemporal vector consistency to smooth motion vectors if their SAD values are similar [1][2]. Increasing the consistency reduces visual artifacts when there are similar SAD values in search range. However, the randomness of the motion vectors in the periodic pattern region is not completely removed by the consistency condition. To find robust motion vectors in the region, Lee et al. [3] analyzed the properties of the periodic pattern to determine whether the current block has the pattern M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 491–500, 2007. c Springer-Verlag Berlin Heidelberg 2007
492
Y.W. Sohn and M.G. Kang
or not. If the current block is static and is considered as to have periodic pattern, then the motion vector of the block is found as to be the shortest local minimum point in the periodic pattern. If the block is moving, then the motion vector of non-periodic neighboring block is used. Han et al. [4] used pseudo M-estimator and spatially smoothing constraint to remove motion error in the periodic pattern and halo effect. Their smoothing constraint minimizes vector differences between the current and neighboring blocks, and the final vector is selected from the candidates of the full-search method. Yu et al. [5] used variable block size to reduce visual artifacts and to find true motion vectors in de-interlacing algorithm. They observed that motion estimation with large block size can find accurate motion, but the large block size causes severe occlusion in boundary region. On the other hand, if the smaller block size is used, occlusion artifacts are reduced but it results in more motion errors in other regions due to increased possibility of similar SAD values. They selected one block size from 32 × 16, 16 × 8, and 8 × 4 pixels according to the current block condition. Motion errors in the periodic patterns are removed by selecting 32 × 16 block size, while occlusion is solved by the smaller block sizes. We focus on reducing motion errors in the periodic pattern region with using non-exhaustive search schemes. The previous works use the exhaustive full-search method to analyze the pattern or find the solution. We noted that there are high randomness of motion vectors and similar SAD values at multiple local minima in the periodic pattern region. If the random vectors in the periodic pattern region are recursively smoothed, then the vector can represent robust motion of the region, when the region is either moving or static. The external vectors and the recursively averaged ones can be properly weighted by measuring the vector activity as the randomness and comparing at SAD values of the vectors at local minima. We derived an adaptive parameter to weight on the vectors. The organization of the paper is as follows. Section 2 introduces the previous works for the periodic pattern region and the proposed method is described in Sec. 3. Experimental results are shown in Sec. 4, followed by conclusions in Sec. 5.
2
Previous Works for Periodic Pattern Region
Figure 1 shows examples of periodic pattern and its SAD distributions. Lee et al. [3] analyzed SAD distribution to find periodic pattern block. SAD value with M × N block size is defined as SAD(x, y) =
N M i
|f n−1 (x + i, y + j) − f n (x + i + vx , y + j + vy )|,
(1)
j
where vx and vy are x- and y-axis offset vectors. f n−1 and f n denote the pixel values in the (n − 1)th and nth frame. It is observed that the SAD distribution shows multiple local minima in periodic pattern region and its x- and y-axis projection arrays show longer envelop than those of non-periodic area, where the x-axis envelop is illustrated as the length of the SAD graph in Fig. 1 (c). To
Block-Based Motion Vector Smoothing for Periodic Pattern Region
493
(a) Periodic pattern
(b) SAD distribution
(c) SAD distribution, projected on x-axis
Fig. 1. Periodic pattern and its SAD distributions
determine whether the block has periodic pattern or not, the number of local minima in the SAD domain is counted and the envelop length of the projection arrays are measured. The number of the local minima is calculated as φ = (Sn − Sn−1 ) · (Sn+1 − Sn ),
(2)
The number of local minima = The number of negative φ s,
(3)
where Sn is the nth SAD projection value in x- or y-axis. The envelop length is given as L= |Sn−1 − Sn |. (4) If both the measures exceed corresponding thresholds, then the block is considered to have periodic pattern. After the detection of periodic pattern block, the block is classified into static or moving block. If the block is static, the motion vector is set as the minimum-magnitude motion vector among the vectors with similar SAD values. If it is moving block, the vector is replaced by one of the vectors in the surrounding non-periodic pattern blocks. Han et al. [4] utilized pseudo M-estimator for a cost function to remove outliers in the SAD domain caused by motion boundary and periodic pattern. The cost function is proposed as E= ϕd (fn−1 (x) − fn (x − v)) + λ ϕs (v i − v), (5) i
494
Y.W. Sohn and M.G. Kang
where ϕd and ϕs represent pseudo M-estimator and v i is one of the neighboring motion vectors. λ is a weighting parameter to control the smoothness of the output vector. Both of the algorithms are based on the full-search method. The SAD distribution used in Eqs. (2)∼(4) needs SAD values for all search position. In Eq. (5), the full search is required to find the lowest cost vector. If we are not using the full search method, the SAD distribution and the cost function cannot be used.
3
The Proposed Method
3.1
Vector Constraints
To acquire smooth motion vector field from input motion vectors, we set a cost function to follow input vector and minimum spatial variance as J = α||d ˜ uc − d||2 +
N
||dpi − d||2 ,
(6)
i=1
where duc is the input vector from conventional motion estimation methods, d is output vector, and dpi is the surrounding vector of the current block in the previous frame. To represent recursive process, the neighboring vectors in the previous frame were used. Taking derivatives of the function with respect to d and set to zero, we get N (˜ α + N )d = α ˜ duc + dpi , i=1
d= Setting
N α ˜ 1 p duc + d . α ˜+N α ˜ + N i=1 i
(7)
α ˜ α+N ˜
= α, the equation is rewritten as N p d (8) d = αduc + (1 − α) i=1 i , 0 ≤ α ≤ 1. N N p d i The α in the above equation weights the unconstrained vector duc and i=1 . N p As the latter term represents the mean vector of di , we can replace the term with mp as d = αduc + (1 − α)mp , (9)
where
N p
i=1
dpi
. (10) N The duc from conventional motion estimator follows new motion when objects change their direction or speed. If the unconstrained motion vector is reliable, the vector should be weighted to follow the movement of the current frame. However, the periodic pattern can cause duc to indicate wrong position. The result can be regularized by weighting on the mean vector mp for the region by properly setting the α value. The function α should weight on the input vector to follow objects’ motion change, and the recursive mean vector should be weighted if the input vector contains motion error. m =
Block-Based Motion Vector Smoothing for Periodic Pattern Region
3.2
495
Weighting Function α
We assumed that periodic patterns have the following properties. 1) There are similar SAD values at several local minima positions in the area. 2) The motion vectors in the region have high randomness. The first property implies that the ratio of the SAD values of input vectors are almost unity as SAD(d1,uc ) 1, (11) SAD(d2,uc ) where d1,uc and d2,uc denote arbitrary two vectors representing the local minima points. If the previous mean vector mp in Eq. (10) represents another local minimum point, the ratio of the input and the mean vector is also represented as SAD(d1,uc ) 1, SAD(mp )
(12)
and the ratio can be rewritten as 1−
SAD(d1,uc ) 0. SAD(mp )
(13)
The α in Eq. (8) should minimize the weight on duc for the above case, which represents one of the periodic pattern proporties. The α can be set as SAD(d1,uc ) . (14) α ∝ max 0, 1 − SAD(mp ) The second property is caused by the first property, as the similar SAD values in several local minima position lead to ambiguity in selecting true motion vector in the region. To suppress the randomness of the input vectors, we modeled α as α ∝ N i=1
k ||di,uc − d||2
,
where di,uc is one of the input motion vectors in the surrounding blocks. Incorporating the two models, the α can be derived as k SAD(d1,uc ) · . α = min 1, max 0, 1 − N 2 SAD(mp ) i=1 ||di,uc − d||
(15)
(16)
The α will be decreased when the activity of the input vector is high or there are multiple similar SAD values, weighting more on the recursively mean vector to suppress the randomness of the input vectors in the periodic pattern region.
4
Experimental Results
Four test sequences were used for simulation, as shown in Fig. 2. Zooming rates and Moving and zooming grates were used to test quasi-stationary and globally moving periodic patterns. The first sequence contains periodic pattern with
496
Y.W. Sohn and M.G. Kang
(a) Zooming grates
(b) Moving and zooming grates
(c) Mobile
(d) Football
(e) Nebski
(f) Siegfried Fig. 2. Test sequences
slowly zooming out, and the second sequence has the same objects moving eight pixels left per frame. The Mobile sequence has local periodic pattern, and the performance for non-periodic pattern region was tested with the Football sequence. To test complex motion with several periodic patterns, Nebski and Siegfried sequences were used. 16 × 8 size block was used for motion estimation with 32 × 16 search range. Output frames were generated by compensating at the half position between the (n − 1)th and nth frames, implementing double-rate frame up-conversion. The constant k in Eq. (16) was set to 4. Three-step search (TSS) was used for the non-exhaustive input vector generator in the proposed method. Figure 3 shows the compensation results with the Zooming grates sequence. The proposed method found robust motion vectors in periodic pattern region, while the motion vectors with the TSS method showed high randomness in the region. Lee et al.’s [3] and Han et al.’s [4] methods, denoted as pattern detection (PD)
Block-Based Motion Vector Smoothing for Periodic Pattern Region
(a) TSS
(b) PD
(c) RFS
(d) The proposed method
497
Fig. 3. Motion compensated results with the Zooming grates sequence, the 12th frame
and robust full-search (RFS), respectively, also showed reduced motion errors in the region. Note that the proposed method showed more regularized motion vectors than the two methods, although it used the input motion vectors from the TSS method.With the Moving and zooming grates sequence in Fig. 4, the vectors of the proposed method followed global movement of the periodic regions, while the vectors from the TSS method still showed randomness.The PD and RFS methods show more errors in the region. With the Mobile sequence, the conventional and the proposed methods successfully removed error vectors, as shown in Fig. 5. There are periodic patterns in calendar object, where the TSS method produced several random vectors in
(a) TSS
(b) PD
(c) RFS
(d) The proposed method
Fig. 4. Motion compensated results with the Moving and zooming grates sequence, the 11th frame
498
Y.W. Sohn and M.G. Kang
(a) TSS
(b) PD
(c) RFS
(d) The proposed method
Fig. 5. Motion compensated results with the Mobile sequence, the 16th frame
the region. The random vector was suppressed and the recursive mean vector was used in the region with the proposed method. For the evaluation of motion vectors non-periodic pattern blocks, the Football sequence was tested. The compensated result of the proposed method is compared with that of the TSS method in Fig. 6. The motion vectors are overlapped on the compensated frame for better comparison. The proposed method produced smoothed motion vectors than those from the TSS method, where there are non-rigid moving objects instead of periodic patterns. The non-rigid objects caused random vectors and the proposed method suppressed the random vectors, while the method preserved other non-random vectors which have spatial consistency. In Fig. 7 and Fig. 8, there are complex motions which cause several inconsistent vectors with the TSS method. They were also suppressed by the proposed method. Notethat the random vectors both inside and outside
(a) TSS
(b) The proposed method
Fig. 6. Motion compensated results with the Football sequence overlapped by motion vectors, the 25th frame
Block-Based Motion Vector Smoothing for Periodic Pattern Region
(a) TSS
(b) PD
(c) RFS
(d) The proposed method
499
Fig. 7. Motion compensated results with the Nebski sequence overlapped by motion vectors, the 57th frame
(a) TSS
(b) PD
(c) RFS
(d) The proposed method
Fig. 8. Motion compensated results with the Siegfried sequence overlapped by motion vectors, the 32th frame
500
Y.W. Sohn and M.G. Kang
of periodic pattern regions were considered as errors by the proposed method, due to the assumption of the properties of periodic pattern regions. Although the objects were moving toward several directions, the proposed method suppressed the vector close to zero. As the complex motion with errors increase the vector variance, the α weighted more on the recursive mean vectors than the input vectors. The PD method in Fig. 8 kept original vectors unchanged, as the periodic pattern blocks were not detected in the motion error regions. If the input vectors are spatially consistent, the proposed method will keep the vectors unchanged. However, if true motions themselves have multiple directions and show inconsistency, then the proposed method can suppress the vectors due to the consistency assumption of the weighting parameter α. Although the α can be adjusted for the trade-off between periodic and non-periodic patterns by the constant k, additional modeling of the α will improve the preservation of the vectors outside periodic pattern regions.
5
Conclusions
We proposed a motion vector regularization method in the periodic pattern region using non-exhaustive motion estimator. The mean vectors in the previous frame were used to acquire smooth motion vectors in the region and conventional vectors were used for non-periodic pattern region. Using the properties of the periodic pattern, weighting parameter was adaptively set to control the input and the recursively averaged vectors. The result showed smoothed motion vectors both in the periodic and non-periodic pattern regions, following the motions of the objects. Future works will be focused on more accurate calculation of the α for periodic pattern regions.
References 1. Tourapis, A., Au, O., Liou, M.: Highly efficient predictive zonal algorithms for fast block-matching motion estimation. IEEE Tran. Circuits and Syst. for Video Technology 12, 934–937 (2002) 2. Nie, Y., Ma, K.: Adaptive Irregular Pattern Search width matching prejudgment for fast block-matching motion estimation. IEEE Tran. Circuits and Syst. for Video Technology 15, 789–974 (2005) 3. Lee, S.-H., Kwon, O., Park, R.-H.: Motion vector correction based on the patternlike image analysis. IEEE Tran. Consumer Electronics 49, 479–484 (2003) 4. Han, S.-H., Kim, H.-K., Lee, Y.-H., Yang, S.: Converting the interlaced 3:2 pulldown film to the NTSC video without motion artifacts. In: Proc. IEEE, Int. Conf. Image Processing, vol. 2, pp. 1054–1057 (2005) 5. Chang, Y.-L., Chen, C.-Y., Lin, S.-F., Chen, L.-G.: Four field variable block size motion compensated adaptive de-interlacing. In: Proc. IEEE, Int. Conf. Acoustics, Speech, and Signal Processing, vol. 2, pp. 913–916 (2005)
A Fast and Reliable Image Mosaicing Technique with Application to Wide Area Motion Detection Alessandro Bevilacqua1,2 and Pietro Azzari2 1
DEIS - Department of Electronics, Computer Science and Systems University of Bologna Viale Risorgimento 2 - Bologna, 40136, Italy 2 ARCES - Advanced Research Center for Electronic Systems University of Bologna Via Toffano 2/2 - Bologna, 40125, Italy {abevilacqua,pazzari}@arces.unibo.it
Abstract. Image mosaicing is stirring up a lot of interests in the research community for both its scientific significance and potential spinoff in real world applications. Being able to perform automatic image alignment in a common tonal and spatial reference can trigger a wide range of higher level image processing tasks such as panoramic image construction, scene depth computation, resolution enhancement, motion detection and tracking using non stationary camera. In this work we propose a fully automated real time on-line mosaicing algorithm able to build high quality seam-free panoramic images. Moreover, the whole approach does not exploit any a priori information regarding scene geometry, acquisition device properties or feedback signals, thus resulting in a fully image based solution. Extensive experiments have been accomplished to assess the quality of the attained mosaics by using them as the background to perform motion detection and tracking with a Pan Tilt Zoom camera.
1
Introduction
Image mosaicing is a popular method to effectively increase the field of view of a camera by allowing several views of the same scene to be combined into a single image. Such technique enables the creation of larger virtual field of view camera preserving the original resolution and without introducing undesirable lens deformation (e.g. wide-angle lens). Mosaicing methods ought to be robust with respect to illumination changes, scene multimodality (i.e. waving trees and hedges), moving objects, imaging device noise. Furthermore consistent handling of camera arbitrary rotations and zoom are indeed desirable properties. Other camera related aspects such as intrinsics (focal length, principal point location) should not degrade significantly the performance of the system as well as demanding scene geometry (very close or structured environment). Finally, inherently real time applications, such as visual surveillance, require the method to perform on line at acquisition rate; M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 501–512, 2007. c Springer-Verlag Berlin Heidelberg 2007
502
A. Bevilacqua and P. Azzari
global coherency through sequential processing scheme is still an unanswered question (namely frames are added incrementally according to the temporal ordering). The mosaicing method proposed in this paper tries to address all of the above-mentioned issues. Being able to perform automatic image alignment in a common tonal and spatial reference can trigger a wide range of higher level image processing tasks such as panoramic images construction, scene depth computation, resolution enhancement, motion detection and tracking using non stationary camera. Using the background subtraction technique, with a precomputed mosaic of the scene as background, is likely to offer the best tradeoff between the quality of the detected moving masks and the computational cost, as it happens when using conventional stationary cameras. Using this approach basically means comparing the current frame with a reference scene (a previously computed background). Moving “blobs” (aggregate of pixels) are identified by thresholding these differences. Many attempts to improve the overall system’s performance of motion detection systems have been accomplished, by improving the background detection techniques, exploiting color information or using PTZ cameras to widen the surveyed area. There are many good previous methods for performing background difference, but none of them can be trivially extended to work with hinged Pan-Tilt-Zoom cameras, that are allowed to pan, tilt, rotate and/or zoom about their optical center. Extending background subtraction algorithms requires to have a sound and reliable background mosaic at one’s disposal and obviously a robust technique to construct and maintain it. This work describes a real time image mosaicing technique that has been devised to construct high quality mosaics from video sequences. This has been realized by developing an effective yet efficient color based background mosaicing algorithm. The algorithm has been conceived to be completely image based, so that it does not rely on any prior information regarding camera intrinsics (focal length, distortion coefficients), scene geometry or feedback signals coming from the imaging device (pan/tilt angular movements, exposure settings). This choice yields the algorithm to be hardware independent, thus resulting in a general purpose approach. The quality of the attained mosaics has been verified in a automated visual surveillance , where the algorithm has been employed to extend the functionalities of a stationary camera motion detector to work with PTZ (Pan Tilt Zoom) cameras. Experimental results prove the effectiveness of the solution even in such a demanding case study. This paper is composed as follows. Section 2 provides an overview of the state of the art in the image mosaicing research field. Section 3 describes the mosaic building phase, detailing both spatial and tonal alignment methods. Some experimental results are reported on real sequences in Section 4 followed by our concluding remarks and some directions for future works in Section 5.
2
Previous Work
During the last decades several numbers of publications addressed the broad topic of image registration, for a comprehensive survey refer to [1]. In their
A Fast and Reliable Image Mosaicing Technique
503
diversity, the published methods, share a lot of theoretical and technical aspects and can be roughly classified into two main classes: direct methods and featurebased methods. Inside these super classes, methods can be further distinguished according to the preferred geometrical model, illumination variations model and allowance for moving objects. Moreover the use of procedures that account for optical distortion and vignetting degradation should be considered as distinctive features as well. It goes without saying, all of them have their pros and cons and the choice is driven by the specific context. – Direct methods: The direct methods usually attempts to iteratively estimate the transformation parameters by minimizing an error function based on the intensity difference in the area of overlap [2,3,4]. The advantage of the direct methods is that very accurate registration can be achieved since all the available data is used, many physical phenomena (such as illumination changes [2], lens distortion [4] and vignetting degradations) in the image formation process can be easily introduced in the objective function. The main drawback of this class of methods is the high computational cost because of the non-linear nature of the error function to be minimized that requires complex iterative algorithm. Of course to avoid local minima, an initial guess for the transformation is required. Since the use of direct methods usually prevents real time capabilities, they are often employed in global registration method. Such option leads to globally coherent mosaics but also requires to have all the sequence in advance, ruling out even the possibility of performing slowly but on line [2,3]. Furthermore the presence of moving in the scene can cause serious problems because all the pixel values are taken into account[2,3]. – Feature-based methods: Instead of using all the available data, feature based methods try to establish feature point correspondences between the images to be registered [5,6,7,8,9,10]. Many different features have been used in the literature, including region, line and point features, which are the mostly used and known as corner points ([11,12]). After the corner have been found, they are matched by utilizing a correlation measure in the local neighborhood. Unlike direct methods they do not require the initialization step. By selecting appropriate features [7,10], these methods can be very robust to illumination changes, image rotations and zooming. Furthermore, moving objects are definitely allowed in the scene by robustly detecting and removing outlier feature points with appropriate methods [5,8].
3 3.1
Image Mosaicing Method Geometric Alignment
A mosaic is a compound image built through properly composing (aligning) a high number of frames and warping them into a common reference coordinate system, both spatial and tonal. The result consists of a single image of a greater resolution or spatial extent. Usually mosaicing techniques are concerned with collection of frames which do not exhibit parallax effects, so that seamless
504
A. Bevilacqua and P. Azzari
stitching can be accomplished without recovering a precise 3D scene structure. Such requirements is known to be satisfied if images are taken according to the following assumptions: – images of an arbitrary scene acquired with a camera free to pan, tilt and rotate about its optical axis are taken from the same location (to the author’s knowledge this is the case in most wide-area surveillance applications); – images of a planar scene are taken from arbitrary locations; – images are acquired using an ideal pin hole camera (namely near zero lens distortion coefficients is assumed). Under these assumptions, planar perspective projections (homographies) represent the relationship between overlapping frames captured by the imaging device: xi =
Axi−1 + b = Hxi−1 cT xi−1 + 1
(1)
where A = [a11 , a12 , a21 , a22 ] is a 2 × 2 scaling and rotation matrix, b = [b1 , b2 ] is a 2 × 1 translation vector and c = [c1 , c2 ] is a chirping, or “perspective” vector. xi = [xi , yi ]T and xi = [xi−1 , yi−1 ]T denote the spatial coordinates of the matched points between two images. The estimation of the homography matrix H from a couple of temporal adjacent frames is known as frame to frame (F2F) or pairwise alignment. The frame projection matrices {Q1 , Q2 , ..., Qi , ...} relating an arbitrary frame Ii to reference frame of the mosaic, which correspond to that of the unwarped frame I0 , are obtained by concatenating each previous pairwise transformation: Qi = Hi Qi−1 =
i
Hk
(2)
k=0
where H0 should be the identity matrix. The correctness of this method is granted by the fact that homographic mappings constitute a group structure, so that the functional composition of two (or more) homographic transformations results in a homographic mapping again. Using frame projections matrices, an image mosaic can be constructed by projectiong all frames onto the mosaic plane. Because the method implements a sequential structure in the temporal order, the first frame is selected as the reference frame, however such constraint can be easily relaxed employing the same theoretic construction. To estimate the projective transformation a pure feature-based approach has been adopted. Feature detection and matching is the most critical stage of the proposed method. Since the method must be able to work reliably in real world application, feature points have to be highly distinctive in order to them being correctly tracked even in demanding natural environment. In fact system performances are strongly affected by how accurately the feature points can be detected and matched. Different approaches presented in the literature have been tested including Kanade-Lucas-Tomasi tracker (KLT) [12], matching of Harris corners or the more recent Scale and Rotation Invariant (SIFT) features [11]. The only
A Fast and Reliable Image Mosaicing Technique
505
one able to meet our requirements has been the KLT tracker. Such features show a suitable degree of robustness with respect to large inter-frame deformations (shift, rotation and perspective warp) and illumination changes, if continuous video stream are to be processed, while preserving real time performances. An initial guess estimator, that relies on phase correlation approach [13], have been introduced to assist the KLT tracker in difficult situations. Such a prior step represents a serious improvement over the stand alone approach permitting to extend the field of applicability to case of large camera shift. Such initial step determines a coarse estimation of the movement exhibited by the observer so as to provide an initial guess to the feature tracker. This solution allows to handle large camera rotations using small search and correlation window, once again with great benefits in terms of soundness and performance. After the feature matching stage, we have a set of feature correspondences. Due to the scene structure or moving objects, some features may not be tracked correctly thus originating wrong matches. These wrong correspondences (hereinafter model’s outlier ) cannot be explained by the true model parameters. Accordingly, when they are employed to perform the model estimation they could bias the final result leading to a poor (homographic) fitting. Several methods are known to estimate parameters of a mathematical model from a set of observed data which contains outliers (e.g. RANSAC, LMedS). In this work we have decided to utilize an enhanced version of the original RANSAC (RANdom SAmple Consensus) algorithm similar to that suggested in [9]. The former method considers many random data subsets, each containing the minimum number of samples required to compute the model parameters exactly, and select the parameter set which has the largest number of inliers. The ill conditioned nature of the problem suggest to refine the model parameters using as large amount of data as possible, namely every point correspondences deemed to be inliers. The refined homography estimate returned by the previous estimation stage is used to guide a search for more interest point correspondences. The search procedure is similar to that used to obtain the initial correspondences : given an interest point x in the first image, a match is sought in search window centred on the expected position x = Hx in the second image. Because the search is now guided, the size of the search window can be greatly reduced. The new set of inliers are again used to refine the estimate of H. The estimation and guided matching stages are repeated until the number of inliers stabilizes. This procedure allow to achieve a fast and accurate pairwise alignment even in presence of moving object, the next challenge regards how to obtain a globally coherent mosaic? As mentioned before, when several frames overlap in mosaic space, global registration is indispensable to minimize the accumulated projections errors, due to the pairwise registration. Usually mosaicing methods require that the images to be mosaiced are captured in advance [2,3,4,7,9,10]. Taken the whole sets of images as the input, a global registration procedure using an objective non linear function minimization procedures is performed. Global registration reduces the drift error by simultaneously minimizing the misregistration between all overlapping pairs of images. Such an approach has been proposed
506
A. Bevilacqua and P. Azzari
using both a direct[2,4] or feature-based framework [7,9,10], however these methods are very computational intensive and cannot be trivially extended to the case of the presence of moving objects in the scene. Due to the objectives of the proposed approach only past information should be used to predict the transformation of the current frame. Besides, since we want our method to perform on line in a sequential manner preserving real time capabilities, global registration method based on bundle adjustment are to be discarded. Here sequential means that all frames are inserted into the mosaic in the temporal order and a sort of global registration should be performed whenever an image is added. The problem with pairwise registration stem from its local optimality in so far only temporal contiguity is accounted leaving spatial adjacency out of considerations. To overcome this restrictions the method has been designed to be self correcting by taking into account the history of the pixel values during the spatial registration, using a technique known as frame to mosaic (F2M) alignment [3]. Since the current frame and its correspondent submosaic region can be fairly different in terms of exposure, our method features an earlier frame to frame alignment to exploit the similarity due to the temporal adjacency of frames coming from a video sequence. This hybrid approach gives the ability of aligning groups of frames by considering both their spatial (F2M) and temporal contiguity (F2F), yielding a sensible improvement to the global coherency of the mosaic. The feedback corrected projection matrices {M1 , M2 , ..., Mi , ...} relating an arbitrary frame Ii to reference frame of the mosaic, are now computed through the composition of correspondent pairwise projection matrices Hi with those obtained by F2M alignment Hi :
Mi = Hi Hi Mi−1 =
i
Hi Hk
(3)
k=0
Furthermore the method is able to detect such situations where the image cannot be registered reliably. The images that cannot be suitably aligned are skipped based on a similarity metric computed between a frame and its correspondent mosaic region, and some other diagnostic parameters extracted from the estimated homography matrix. 3.2
Photometric Alignment Estimation
Tonal misalignments, due to global illumination changes or fluctuations in the images’ exposure, are common occurrences with digital photographs. If they are not properly taken into account the resulting panorama will appear to have seams which do not correspond to any physical structure of the scene, even when the images are blended in overlapping regions. These color gradients can strongly affect the registration process as well as further processing involving the mosaic. For example, in a typical visual surveillance system, the motion detector based on background subtraction may erroneously interpret these artifacts as moving objects, thus generating false alarms. As a consequence, a mosaicing technique
A Fast and Reliable Image Mosaicing Technique
507
must face the problem of illumination changes. Since we assume that the possible remaining discrepancies between corresponding pixels after a successful spatial registration is achieved are due to photometric misalignments, tonal alignment is performed after the spatial registration stage. The illumination changes we are interested in are: – automatic camera exposure adjustments (e.g. AGC); – environmental illumination changes (e.g. daytime changes). Many methods have been published about the topic of exposure equalization of partially overlapping frames, most of them do not explicitly model the physical phenomena that cause such intensity variation among corresponding pixels. The work in [7,10] face this problem using spatially-varying weighting functions (feathering) and smart seams placement to minimize the visible impact. The proposals in [14] can be considered as the pioneering work to which others take inspiration. Instead of using a single transition width, a frequency-adaptive width is used by creating a band-pass (Laplacian) pyramid and making the transition widths a function of the pyramid level. However all the efforts are made to find the best solution to somewhat hide such misalignments rather than to correct for it. Pyramid and gradient domain blending can do a good job of compensating for moderate amounts of exposure differences between images. However, when the exposure differences become large, alternative approaches may be necessary. The method in [2] yield good results in photometric alignment by approximating the camera’s nonlinear comparametric function with a constrained linear piecewise function. The main drawback is high computational cost involved in such estimation that make it not compatible for real time processing. A more principled approach to exposure compensation is to estimate a single high dynamic range (HDR) radiance map from of the differently exposed images [15,16]. Once again these proposals are too time consuming to be considered as a suitable solution. Besides time performance considerations, the preferable candidate method should be tolerant with respect to a wide variety of issues. Spatial registration inaccuracies (e.g. due to homographies not modeling the camera motion or small alignment errors) and the presence of moving objects represent issues to account for. The previous considerations have prompted us to exploit an histogrambased approach, that allows to partly overcome the above-mentioned inaccuracies. The histogram specification technique is a histogram-based approach that aims to find an intensity mapping function able to transform a given cumulative histogram H1 into a target cumulative histogram H2 belonging to a reference image (see [17] for further details). Practically speaking, in case of gray level images only 256 couple of points (u1 , u2 ) derived from H1 and H2 are considered and the outcome is simply a look-up table (LUT): u2 = H2−1 (H1 (u1 ))
(4)
This method has been conceived to work with gray scale images, and its porting to color image processing requires to handle the correlation between color channels, that are perceptually not uniform. More details are given in [8].
508
4
A. Bevilacqua and P. Azzari
Experimental Results
We have accomplished extensive tests of the proposed method against several video sequences captured from real world scenes in order to evaluate the performance of the system in terms of accuracy of the outcoming mosaics. Moreover the integration of the proposed mosaicing algorithm into a system that detects and tracks moving objects by using stationary camera (see [18] for details) has been accomplished to assess the reliability of the detected moving masks when using PTZ camera. To this purpose, three challenging indoor and outdoor sequences have been considered, they being different for illumination, scene structure, depth of field of view, number of moving objects in the scene and imaging device. Both sequences are 320 × 240 pixel resolution. The camcorder panned by hand across the scene back and forth many times, to emphasize the usual looping path problem suffered from many mosaicing algorithm. The camcorder was hinged on a tripod in order to make it rotate about its optical center, however not great care has been paid in order to assure such property. The target machine is a standard PC equipped with an AMD 2000+ and 1GB RAM. All the code is written in C++ programming language. The first sequence S1 is 1121 frames long and consists of a wide field of view capture of the interior of our lab. As one can see in Fig.1, both the mosaic (top) and the planimetry (bottom) highlights the trouble involved in processing such a sequence. The very close environment together with the hardly fulfilled assumption of planar scene can emphasize even slight out of center rotations highlighting parallax effects. In particular such concern are likely to emerge next to significant depth relative variations (near the red door and the wall on both sides). Moreover the vicinity of the foreground object require fast camera rotation to perform tracking operations, leading to large interframe shift. Despite such strong deviations from the ideal settings the system perform consistently and accurately, as one can see from the samples shown in Fig. 2. Here the detected motion masks are superimposed on the original frame to ease the visual inspection, they are always adherent to real body shape of the moving person irrespective of its position inside the scene (namely the mosaic) or its distance from the observer. The second sequence S2 is composed by 809 frames and deals with an outdoor environment with a very close building and a walking person again (see Fig. 3). Here the problems are no longer concerned with the structure of the scene which can substantially satisfy the assumption of planarity. Rather, the proximity of the moving person cause (the interframe displacements) and the homogeneouos wall represent an hard testing ground for single-resolution patch-based features thus leading to erroneous motion parameter estimation and to alignment errors, accordingly. The experimental results regarding this sequence emphasize the feature matching improvement granted by the use of an early displacement estimation provided by the phase correlation module together with a robust estimator such as the modified RANSAC. The conventional KLT tracker is initialized with the phase
A Fast and Reliable Image Mosaicing Technique
509
Fig. 1. Mosaic created through processing the indoor sequence S1 (top), planimetry of the environment and cone of view(bottom)
Fig. 2. Motion detection sample frames from the indoor sequence S1
510
A. Bevilacqua and P. Azzari
Fig. 3. Mosaic created through processing the outdoor sequence S2 (top), planimetry of the environment and cone of view(bottom)
correlation guess and then the model is determined and refined pruning bad tracked features through RANSAC consistency check. Once again the detected moving masks reflects the presence and the real shape of moving objects as one can see from the samples shown in Fig. 2. As one can see in Fig. 2, the achieved moving masks denotes the presence and adhere to the silhouette of the moving objects and their casted shadows. The motion detector performs regardless of the lighting condition and the distance of the moving objects. The ability to process color images without significant performance loss, enable straightforward method to be employed to remove annoying shadows using intensity-cromaticity color spaces.
A Fast and Reliable Image Mosaicing Technique
511
Fig. 4. Motion detection sample frames from the outdoor sequence S2
5
Summary and Future Work
In this work we have reported a fully automatic, real-time and general purpose image mosaicing algorithm. The proposed method is able to perform consistently in a wide range of real world contexts, e.g. indoor and outdoor scenes, due to the joint spatial and tonal relationship modeling implemented in the algorithm. In addition, the system works independently of the imaging device, since it is completely image-based and it does not rely on any a priori assumption regarding the intrinsic parameters of the camera. The dual alignment stage allows to prevent error accumulation and to construct globally coherent mosaics, leading to near optimal results even without resorting to computing intensive global adjustment procedures. High speed performance is achieved by exploiting fast feature based techniques, not properly designed for registration purposes, being affected by projective and illumination changes. However, we introduce an initial guess for registration provided by exploiting the efficient phase correlation method that enables us to handle large and complex camera motions (shift, rotation and exposure changes). The accuracy and the high processing speed allow the algorithm to be used in visual surveillance systems performing on line highquality motion detection (i.e. using the well established background difference approach). The effectiveness of the method we devised has been proved through extensive experiments carried out using challenging real-world video sequences. As for future works, the general purpose aspect of the system will be improved addressing issues regarding the identification of the optical properties of the imaging device stemming from the video sequence only. In particular, we are referring to the estimation of the most important intrinsic parameters (focal length and principal point) and lens distortion coefficients. This will hopefully lead to a complete independence of the imaging device and a significant refinement of both spatial and tonal alignment. Finally, a fast implementation of scale and rotation invariant features will provide a more accurate mosaicing method, yielding a more reliable motion detection, accordingly.
References 1. Zitova, B., Flusser, J.: Image registration methods: a survey. International Journal of Image and Vision Computing 21(11), 977–1000 (2003) 2. Candocia, F.M.: Jointly registering images in domain and range by piecewise linear comparametric analysis. IEEE Transactions on Image Processing 12(4) (2003)
512
A. Bevilacqua and P. Azzari
3. Winkelman, F., Patras, I.: Online globally consistent mosaicing using an efficient representation. In: Proceedings of the IEEE International Conference on Multimedia Computing and Systems, October 2004, pp. 3116–3121. IEEE Computer Society Press, Los Alamitos (2004) 4. Sawnhey, H.S., Kumar, R.: True multi-image alignment and its applications to mosaicing and lens distortion correction. IEEE Transactions on Pattern Analysis and Machine Intelligence 21(3), 235–243 (1999) 5. Bevilacqua, A., Di Stefano, L., Azzari, P.: An effective real-time mosaicing algorithm apt to detect motion through background subtraction using a ptz camera. In: Proceedings of IEEE Conference on Advanced Video and Signal based Surveillance, vol. 1, pp. 511–516. IEEE Computer Society Press, Los Alamitos (2005) 6. Zhu, Z., Xu, G., Riseman, E.M., Hanson, A.R.: Fast generation of dynamic and multi-resolution 360◦ panorama from video sequences. In: Proceedings of the IEEE International Conference on Multimedia Computing and Systems, July 1999, pp. 400–406. IEEE Computer Society Press, Los Alamitos (1999) 7. Brown, M., Lowe, D.: Recognising panoramas. In: ICCV’03. Proceedings of the International Conference on Computer Vision, pp. 1218–1225 (2003) 8. Azzari, P., Bevilacqua, A.: Joint spatial and tonal mosaic alignment for motion detection with ptz camera. In: Campilho, A., Kamel, M. (eds.) ICIAR 2006. LNCS, vol. 4142, Springer, Heidelberg (2006) 9. Capel, D.P.: Image Mosaicing and Super-Resolution. University of Oxford, Oxford (2001) 10. Eden, A., Uyttendaele, M., Szeliski, R.: Seamless image stitching of scenes with large motions and exposure differences. In: CVPR’06. Proceedings of the International Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 2498– 2505 (2006) 11. Lowe, D.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004) 12. Tomasi, C., Shi, J.: Good features to track. In: CVPR’94. Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 593–600. IEEE Computer Society Press, Los Alamitos (1994) 13. Zokai, S., Wolberg, G.: Image registration using log-polar mappings for recovery of large-scale similarity and projective transformations. IEEE Transaction on Image Processing 14(10), 1422–1434 (2005) 14. Burt, P.J., Adelson, E.H.: A multiresolution spline with application to image mosaics. ACM Transactions on Graphics 2, 217–236 (1983) 15. Kim, J., Pollefeys, M.: Radiometric self-alignment of image sequences. In: CVPR’04. Proceedings of the International Conference on Computer Vision and pattern Recognition, vol. 2, pp. 645–651 (2004) 16. Grossberg, M.D., Nayar, S.K.: Determining the camera response function from images. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(11) (2003) 17. Grundland, M., Dogson, N.A.: Color histogram specification by histogram warping. In: Proceedings of SPIE Color Imaging X: Processing, Hardcopy, and Applications, January 2005, pp. 610–621 (2005) 18. Bevilacqua, A., Di Stefano, L., Lanza, A.: An efficient motion detection algorithm based on a statistical non parametric noise model. In: ICIP 2004. Proceedings IEEE ICIP, October 2004, pp. 2347–2350. IEEE Computer Society Press, Los Alamitos (2004)
High Accuracy Optical Flow Method Based on a Theory for Warping: Implementation and Qualitative/Quantitative Evaluation Mohammad Faisal and John Barron Dept. of Computer Science The University of Western Ontario London, Ontario, Canada, N6A 5B7 {mfaisal,barron}@csd.uwo.ca
Abstract. We describe the implementation of a 2D optical flow algorithm published in the European Conference on Computer Vision (ECCV 2004) by Brox et al. [1] (best paper award) and a qualitative and quantitative evaluation of it for a number of synthetic and real image sequences. Their optical flow method combines three assumptions: a brightness constancy assumption, a gradient constancy assumption and a spatio-temporal smoothness constraint. A numerical scheme based on fixed point iterations is used. Their method uses a coarse-to-fine warping strategy to measure larger optical flow vectors. We have investigated the algorithm in detail and our evaluation of the method demonstrates that it produces very accurate optical flow fields from only 2 input images. Keywords: optical flow, regularization, warping, multiscale pyramid, brightness/gradient,smoothing constraints.
1
Introduction
Optical flow estimation is still an open research areas in computer vision. While many methods have been proposed, Brox, Bruhn, Papenberg and Weickert [1] presented a variational approach at ECCV 2004 that they claimed gave the best quantitative flow (up to that time). Later, Papenberg, Bruhn, Brox, Didas and Weickert [2] added a few additional constraints to this algorithm and got even better results. We implemented Brox et al.’s algorithm (we started before Papenberg et al.’s algorithm was published) and investigated why it produced such good flow fields using only 2 frames and with such poor temporal intensity differentiation (simple pixel differences). We quantitatively evaluated our implementation for a number of image sequences including the Yosemite Fly-Through sequence made in 1988 by Lynn Quan at SRI International [3]. It consists of 15 textured depth maps of a mountain range and a valley in Yosemite National Park (with the camera moving towards a point in the valley to create a diverging flow field) and fractals generated clouds that moved left to right at 2 pixels per frame. Given a depth map and the instantaneous camera motion, the correct image velocity field can be computed [4]. This sequence is perhaps one of the M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 513–525, 2007. c Springer-Verlag Berlin Heidelberg 2007
514
M. Faisal and J. Barron
Fig. 1. (a) The middle frame of the cloudy Yosemite sequence and (b) the correct flow field for (a). (c) The middle frame for cloudless Yosemite sequence and (d) the correct flowfield for (c).
most complex synthetic image sequences freely available as the mountain range has varying depth and the clouds are deformable objects. Michael Black suggests that because there is no real ground truth for the cloud motion (they are modelled as fractals undergoing Brownian motion)1 reporting errors for the cloud region is a bit meaningless. Thus, the sequence is often evaluated without the clouds; in this paper we evaluate the flow for both sequences. Figure 1 shows the central cloudy and cloudless frames of this sequence plus their correct flows.
2
Relevant Literature
Here, we briefly review the papers, whose cloudy/cloudless Yosemite results were quoted by Brox et al. [1]. These techniques comprise the best contemporary 1
http://www.cs.brown.edu/black/
High Accuracy Optical Flow Method Based on a Theory for Warping
515
optical flow methods and all give a quantitative analysis of the cloudy and/or cloudless Yosemite sequences. The classical Horn and Schunck optical flow algorithm [5] is a global regularization of a data term consisting of the motion constraint equation and a smoothness term constraining the velocity to vary smoothly everywhere. Lucas and Kanade [6] assume the velocity is constant in local neighbourhoods and formulate a least squares calculation of the velocity for each neighbourhood. Ju, Black and Jepson [7] propose a “Skin and Bones” model to compute optical flow using a parametrized (affine) flow model with a smoothness constraint on the flow parameters to ensure continuity of motion between patches. Lai and Vemuri [8] propose a gradient-based regularization method that includes a contour based motion constraint equation that is enforced only at zero-crossings. Bab-Hadiashar and Suter [9] formulate the measurement of optical flow as a over-determined set of linear equations. They use 2 robust estimation techniques: Least Median of Squares and Least Median of Squares Orthogonal Distances. Alvarez, Wickert and S´anchez [10] introduce improvements to Nagel and Enkelmann’s [11] optical flow work that include hierarchical processing and an energy term that is invariant under linear brightness changes. Farneb¨ack’s [12] algorithm has 3 distinct components: estimation of spatio-temporal tensors, estimation of parametric motion models and simultaneous segmentation of the motion field. M´emin and P´erez [13] propose a robust energy-based model for the incremental estimation of optical flow in a hierarchical piece-wise parametric minimization of an energy functional in regular or adaptive meshes at each hierarchical level from the coarsest to the finest levels. Bruhn, Weickert and Schn¨ orr [14,15] propose a method that combined local and global methods, in particular, those of Horn and Schunck and Lucas and Kanade. The data term in the Horn and Schunck regularization is now replaced by the least squares Lucas and Kanade constraint.
3
3D Angular Error
All authors used the angular error measure of error proposed by David Fleet (see Barron et al. [16]). Velocity may be viewed as a space-time direction vector (u, v, 1) in units of (pixel, pixel, frame). We can measure errors as angular deviations from the correct space-time orientation. We write velocities as 3D 1 normalized vectors, vˆ ≡ √u2 +v (u, v, 1)T . The angular error between the cor2 +1 rect velocity v c and an estimate v e is then: ψE = arccos(ˆ vc · vˆe ). This metric takes into account both direction and magnitude error as a single number.
4
Brox et al.’s Variational Model
The Brox et al. [1] method is composed of several constraints. The Grayvalue Constancy Assumption requires that the grayvalue of a pixel does not change as it undergoes motion: I(x, y, t) = I(x + u, y + v, t + 1). A 1st order Taylor series expansion leads to the standard motion constraint equation Ix u + Iy v + It = 0.
516
M. Faisal and J. Barron
The Gradient Constancy Assumption requires that the gradient of the image grayvalues not to vary due to displacement: ∇I(x, y, t) = ∇I(x + u, y + v, t + 1). Brox et al.’s Smoothness Assumption is the standard Horn and Schunck smoothness constraint. Lastly, Brox et al. use a Multiscale hierarchical (pyramid) approach, which we describe below. Brox et al’s energy function penalizes deviation from their model assumptions. Their data term is: EData (u, v) = Ψ (|I(x+w) − I(x)|2 + γ|∇I(x+w) − ∇I(x)|2 )dx. (1) Ω
where Ω √ is the region of interest (the image) over which the minimization is done. Ψ (s2 ) = s2 + 2 The small positive constant keeps Ψ (s2 ) convex which helps the overall minimization process as there is now a unique global minimum for most images (consider two constant grayvalue images as one possible exception). Brox et al. use = 0.0001. A Horn and Schunck smoothness term is also used: Esmooth (u, v) = Ψ (|∇3 u|2 + |∇3 v|2 )dx, (2) Ω
with the same function Ψ . The spatio-temporal gradient ∇3 = (∂x , ∂y , ∂t ) indicates that a spatio-temporal smoothness assumption is involved. Here, we have only used two images so the spatio-temporal gradient (∇3 ) becomes the spatial gradient (∇2 ). The total energy is the weighted sum between the data term and the smoothness term E(u, v) = EData + αEsmooth ,
(3)
for some regularization parameter α > 0. Note that E(u, v) is highly nonlinear. The goal is to find u and v values that minimizes this energy over the whole image. We use the same mathematical abbreviations that Brox et al. used in the paper (via their Equation (8) in the algorithm description [1]): Ix = ∂x I(x + w), Iy = ∂y I(x + w), Iz = I(x + w) − I(x), Ixx = ∂xx I(x + w), Ixy = ∂xy I(x + w), Iyy = ∂yy I(x + w), Ixz = ∂x I(x + w) − ∂x I(x) and Iyz = ∂y I(x + w) − ∂y I(x). The functional to be minimized is: f = Ψ |I(x + w) − I(x)|2 + γ(|∇I(x + w) − ∇I(x)|2 + αΨ |∇3 u|2 + |∇3 v|2 2 2 = Ψ Iz2 + γ(Ixz (4) + Iyz ) + αΨ u2x + u2y + vx2 + vy2 . We compute uk+1 = uk + duk and v k+1 = v k + dv k from (uk , v k ) and unknown correction terms (duk , dv k ). Brox et al. use an arbitrary reduction factor η ∈ (0, 1) when down sampling in the pyramid, typically η ∈ [0.80, 0.95], which allows smooth flow projections between adjacent images in the pyramid. An outer fixed iteration is used to solve a linear system of equations in terms of duk,l+1 and k dv k,l+1 at level k + 1, holding (Ψ )k,l Data and (Ψ )Smooth constant. A second fixed inner iteration was used to handle the nonlinearity due to (Ψ )k,l Data and
High Accuracy Optical Flow Method Based on a Theory for Warping
517
(Ψ )kSmooth by updating these values each time the outer iteration converged. Once both inner and outer iterations have converged we have the flow at level k, giving us wk = (uk , vk ). This is then to be used as the initial solution for wk+1 on the next finer level. The iterative equations that minimize the Euler-Lagrange equations are: k k k k,l+1 (Ψ )k,l + Iyk dv k,l+1 ) Data Ix (Iz + Ix du k k k k k k k k (Ixz + Ixx duk,l+1 + Ixy dv k,l+1 ) + Ixy (Iyz + Ixy duk,l+1 + Iyy dv k,l+1 ) +γ Ixx − α Div (Ψ )kSmooth ∇3 (uk + duk,l+1 ) = 0 (5) and
k k k k,l+1 (Ψ )k,l + Iyk dv k,l+1 ) Data Iy (Iz + Ix du
k k k k k k k k (Iyz + Ixy duk,l+1 + Iyy dv k,l+1 ) + Ixy (Ixz + Ixx duk,l+1 + Ixy dv k,l+1 ) +γ Iyy k k,l+1 ∇ (v + dv ) = 0. (6) − α Div (Ψ )k,l 3 Smooth To compute I∗k+1 terms from I∗k terms we use: Izk+1 ≈ Izk + Ixk duk + Iyk dv k , k+1 Ixz k+1 Iyz
≈ ≈
k Ixz k Iyz
k + Ixx duk k + Ixy duk
k + Ixy dv k k + Iyy dv k .
(7) and
(8) (9)
We have not been able to solve the linear system of equations using common numerical methods such as Gauss-Seidel or Successive Over Relaxation (SOR) as Brox et al. did (a typical experience was the first 20-30 iterations with convergence followed by rapid divergence). Instead we make 2 modifications to the equations to obtain 2 variants on Brox et al.’s algorithm, which we present in the next 2 sections. We adopt a different approach to solve the set of equations using standard averaging as Horn and Schunck did, as that leads to convergence for the equations (7 × 7 averaging was found to be the best). We adopted Brox’s technique from his Ph.D. thesis [17] to solve the smoothing term but then used iterations of Cramer’s rule rather than directly using Gauss-Seidel or SOR (this was our 4 − pt algorithm) to get convergence. 4.1
7 × 7 Averaging
We can express Equations (5) and (6) as a linear system of equations: A(duk,l+1 ) + B(dv k,l+1 ) = E(duk,l+1 ) C(duk,l+1 ) + D(dv k,l+1 ) = F (dv k,l+1 ),
(10) (11)
k k k k k k A = (Ψ )k,l+1 Data Ix Ix + γ(Ixx Ixx + Ixy Ixy ) , k k k k k k B = (Ψ )k,l+1 Data Ix Iy + γ(Ixx Ixy + Ixy Iyy ) ,
(12)
where:
(13)
518
M. Faisal and J. Barron
k k k k k k C = (Ψ )k,l+1 Data Iy Ix + γ(Iyy Ixy + Ixy Ixx ) , k k k k k k D = (Ψ )k,l+1 Data Iy Iy + γ(Iyy Iyy + Ixy Ixy ) , k k,l+1 E = α (Ψ )k,l ) and Smooth Div ∇3 (u + du k k k k k k − (Ψ )k,l Data Ix Iz + γ(Ixx Ixz + Ixy Iyz ) k k,l+1 F = α (Ψ )k,l ) Smooth Div ∇3 (v + dv k k k k k k − (Ψ )k,l Data Iy Iz + γ(Iyy Iyz + Ixy Ixz ) .
(14) (15)
(16)
(17)
After some algebra and using the standard approximation Xxx + Xyy ≈ X − X as in [5] we can write: k k k,l+1 − duk,l+1 ) − e E = α (Ψ )k,l (18) Smooth (u − u ) + (du where and
k,l k k k k e = (Ψ )Data Ixk Izk + γ(Ixx Ixz + Ixy Iyz )
(19)
k k k,l+1 − dv k,l+1 ) − f F = α (Ψ )k,l Smooth (v − v ) + (dv
(20)
k,l k k k k f = (Ψ )Data Iyk Izk + γ(Iyy Iyz + Ixy Ixz ) .
(21)
where
We can solve for du
k,l+1
k,l+1 1 du = k,l+1 dv det
and dv
k,l+1
using Crammer’s rule as:
k k,l − uk ) − De Dα(Ψ )k,l Smooth (u + du 2 k,l 2 α ((Ψ )Smooth ) (uk + duk,l − uk ) − α(Ψ )k,l Smooth e k + dv k,l + v k ) + Bf (v −Bα(Ψ )k,l Smooth k k,l + uk ) + Ce −Cα(Ψ )k,l Smooth (u + du k,l k k,l Aα(Ψ )Smooth (v + dv − v k ) − Af 2 2 k k,l + v k ) − α(Ψ )k,l α ((Ψ )k,l Smooth ) (v + dv Smooth f
. (22)
k,l 2 k,l 2 where det = AD + Aα(Ψ )k,l Smooth + Dα(Ψ )Smooth + α ((Ψ )Smooth ) − BC. We k,1 k,1 k,l+1 initialize du and dv to 0. Then we solve for the next set of (du , dv k,l+1 ), k,l k,l k k etc., by using these two equations and computing du , dv , u , v by taking the average of a 7 × 7 window around each pixel.
4.2
4-Point Differences
Brox [17] suggests using finite differences to calculate ∇u :
2
2 ui+1,j − ui−1,j ui,j+1 − ui,j−1 |∇ui,j | = + 2 2
(23)
Once we have the gradient magnitude (|∇u|), we can calculate the div term for u as: ∂x Ψ (∇u2 )ux + ∂y Ψ (∇u2 )uy = +
Ψi+1,j +Ψi,j (ui+1,j 2 Ψi,j+1 +Ψi,j (ui,j+1 2
− ui,j ) − − ui,j ) −
Ψi−1,j +Ψi,j (ui,j 2 Ψi,j−1 +Ψi,j (ui,j 2
− ui−1,j ) − ui,j−1 ),
(24)
High Accuracy Optical Flow Method Based on a Theory for Warping
519
k k,l 2 k k,l 2 where Ψ refers to (Ψ )k,l Smooth = (|∇(u + du )| + |∇(v + dv )| ). We re-
place the Ψ ’s by some weight coefficients wi as follows: w1 = Ψi−1,j +Ψi,j , 2
Ψi,j+1 +Ψi,j , 2
Ψi+1,j +Ψi,j , 2
w2 = Ψi,j−1 +Ψi,j w4 = and w = w1 + w2 + w3 + w4 . We 2 for ∂x Ψ (∇v 2 )vx + ∂y Ψ (∇v 2 )vy . The div term
w3 = can write a similar equation in Equations (5) and (6) now become:
w1 (ui+1,j ) + w2 (ui−1,j ) + w3 (ui,j+1 ) + w4 (ui,j−1 ) − w(ui,j ) +w1 (dui+1,j ) + w2 (dui−1,j ) + w3 (dui,j+1 ) + w4 (dui,j−1 ) − w(dui,j )
(25)
w1 (vi+1,j ) + w2 (vi−1,j ) + w3 (vi,j+1 ) + w4 (vi,j−1 ) − w(vi,j ) +w1 (dvi+1,j ) + w2 (dvi−1,j ) + w3 (dvi,j+1 ) + w4 (dvi,j−1 ) − w(dvi,j ).
(26)
and
The iterative solution is almost the same as for the averaging method but now with appropriate weight w∗ . Again, using Crammer’s we replace the ΨSmooth rule we obtain: Dα(uk + duk,l − wuk ) − De + α2 w(uk + duk,l − wuk ) − αwe 1 duk,l+1 −Bα(v k + dv k,l + wv k ) + Bf . = k,l+1 k 2 k k k,l k k,l dv det Aα(v + dv − wv ) − Af + α w(v + dv − wv ) − αwf −Cα(uk + duk,l + wuk ) + Ce (27)
5
Experimental Results
Brox et al. used a 5 tap filter to compute the 1st order spatial derivatives. They used the coefficients: {0.8666,-0.6666,0.0,0.6666,-0.8666}. They used a 3-tap filter with coefficients: {-0.5,0.0,0.5} on the 1st order spatial derivatives to compute the 2nd order spatial derivatives. The input images were initially presmoothed with a 2D Gaussian with σ = 1.3. Since the pyramid levels have different widths and heights, when the velocities are projected down from a coarse level to a finer level we distributed the velocities in the finer level by multiplying the integer pixel coordinates by 1/η. This leaves some holes (along lines) in the flow field, these were filled by averaging. We coded our algorithm in Tinatool (version 5.0) [18,19], an open source X windows based software for Computer Vision algorithms. 5.1
Synthetic Results
Figure 2 shows the cloudy and cloudless Yosemite flows computed by our implementation. We used the same parameter setting as Brox et al. [1], namely σ = 1.3, α = 80 and γ = 100, 10 inner and outer iterations, a reduction factor of η = 0.95 and 77 levels in the pyramid. We can see from the error intensity images that most of the error occurs at the discontinuity between the mountain and the clouds and in the clouds themselves (for the cloudy data). The later is to be expected as optical flow requires a rigidity assumption and clouds are non-rigid.
520
M. Faisal and J. Barron
Table 1. Comparison of results from the literature for the cloudy Yosemite data Technique Angular Error St. Dev. Horn-Schunck, modified [16] 9.78◦ 16.19◦ ◦ Alvarez et al. [10] 5.53 7.40◦ Weickert et al. [14] 5.18◦ 8.68◦ ◦ Brox et al. 4-pt 5.07 9.90◦ ◦ M´emin-P´erez [13] 4.69 6.89◦ ◦ Brox et al. 7 × 7 3.60 10.2◦ Brox et al. [1] 2.46◦ 7.31◦
Table 2. Comparison of results from the literature for the cloudless Yosemite data Technique Brox et al. 4-pt Ju et al. [7] Bab-Hadiashar-Suter [9] Lai-Vemuri et al. [8] Brox et al. [1] M´emin and P´erez [13] Weickert et al. [14] Farneb¨ ack [12] Brox et al. 7 × 7
Angular Error St. Dev. 2.63◦ 11.68◦ ◦ 2.16 2.00◦ 2.05◦ 2.92◦ 1.99◦ 1.41◦ ◦ 1.59 1.39◦ ◦ 1.58 1.21◦ 1.46◦ 1.50◦ ◦ 1.14 2.14◦ ◦ 0.54 5.37◦
Table 1 shows that our best results for the cloudy Yosemite data are not quite as good as Brox et al.’s but better than the other results. We tested the variation of the basic parameters for the algorithm (see Table 3 for some results). Table 2 shows that, while out 4-pt results for the cloudless Yosemite data are the worst, our 7 × 7 results are the best, even better than Brox et al.’s result. One of the main strengths of Brox el al.’s method is the use of slowly changing pyramidal images to compute good temporal derivatives. We always obtained better results when higher reduction factors (η). Figure 3 shows the evolution of I(x + ω) into I(x). By gradually changing the image size between adjacent levels in the pyramid it is possible to maintain good flow until the final image is reached. We tested Brox et al.’s differentiation against Simoncelli derivatives [20] (which are known to be good) by keeping all other things but the differentiation the same. At the initial level in the pyramid both differentiation methods produced similar results (12.57◦ for Simoncelli and 11.25◦ for Brox et al.). Brox et al. derivatives were slightly better here than Simoncell derivatives because it is difficult to warp the 7 images required by Simoncelli filtering; we have to warp by 3 images to the left and to the right of the middle image. For lower level images in the pyramid the error from bilinear interpolation and violation of the implicit constant motion assumption became very significant. For example, by 10 levels down the pyramid, Brox et al.’s accuracy was a bit more than 3 times better than Simoncelli’s.
High Accuracy Optical Flow Method Based on a Theory for Warping
521
Fig. 2. (a) Frame 8 of the cloudy Yosemite sequence. (b) Frame 8 of the cloudless Yosemite sequence. (c)-(d) Best flow fields for (a) and (b). (e)-(f) Error images for (c) and (d).
As can be seem in Table 3, increased smoothing (α) tends to produce more accurate results for the Yosemite sequence. If α is set to 0 we obtain very poor results; the smoothness term is required as one would expect. Higher values of γ weighted the gradient constraint more importantly and usually also reduced error. Turning off γ resulted in less accurate flow (now effectively becoming hierarchical Horn and Schunck). Similarly, more presmoothing (σ) produced less error at first and then more error as σ increased [16].
522
M. Faisal and J. Barron
Fig. 3. (a)-(b) Frames 8 and 9 of the cloudy Yosemite sequence at level 0 of the pyramid. (c) I(x + ω) and (d) Iz after computation of the flow at the top level and projection to level 1. (e)-(f) The transformation of I(x + w) and Iz at level 5. Iz is now approaching 0 while I(x + w) is essentially I(x) and Iz is close to 0.
We found that the value of the offset around the flow fields was significant even through we use reflecting boundary conditions as suggested by Brox in our differentiation. By imposing a 2 or 6 pixel wide border around the image and thresholding very small velocities (magnitudes > 10−1 or > 10−3 ) we were able to improve accuracy by up to 10% with slightly less than 100% density (about 99%+). 5.2
Real Image Results
We tested our implementation on frames 5 and 6 of the Ettlinger Tor traffic sequence, as shown in Figure 4. The flow in Figure 4b qualitatively looks the
High Accuracy Optical Flow Method Based on a Theory for Warping
523
Table 3. Error analysis for various parameter settings Parameter σ α γ 1.3 80 100 1.3 60 120 2.0 90 130 2.0 25 20 0.6 70 80 3.0 60 20 1.3 80 0 0.0 80 100 1.3 0 100
Angular Error 7 × 7 4-pt 4.69◦ 5.16◦ 5.45◦ 5.32◦ 5.26◦ 5.44◦ 6.16◦ 6.56◦ 8.44◦ 6.04◦ 8.92◦ 12.72◦ 8.54◦ 14.94◦ 20.8◦ 10.70◦ 77.02◦ 77.02◦
Fig. 4. (a) Ettlinger Tor traffic sequence where the cars and the bus have motion. (d) Computed flow field for Ettlinger Tor traffic sequence.
same as in Brox et al. [1], except here we don’t get flow for the part of bus roof top that has uniform white intensity.
6
Discussion
We have verified that Brox et al.’s 2-frame optical flow is better than all other existing methods for both versions of the Yosemite fly-through sequence. The combination of their hierarchical method where adjacent images in the pyramid only change their size slowly and their gradient constraint data term explains their superior performance. We tested Brox et al.’s implementation (executable only provided by Thomas Brox) and found our code often ran significantly slower than theirs. 7 × 7 averaging may give better results but it requires considerably
524
M. Faisal and J. Barron
more computations. We found the implementation of this algorithm to be quite challenging as the original paper did give all the details about some key issues that are essential from a programming point of view. The details were hidden about how to calculate the smoothing term in the linear system of equations. We had to resort to Brox’s Ph.D. thesis [17] to find a way to compute Div(Ψ (∗)) (at Brox’s suggestion) and we still could not get exactly his solution. Our failure to get Brox et al.’s SOR relaxation to work might explain the slightly higher error we sometimes get, compared to their results.
7
Future Work
We would like to improve our implementation of Brox et al.’s method to the point where we get their results. We would also like to implement their 3-frame variant using temporal differentiation. Papenberg et al. [2] proposed the use of two additional constraints into the model and have obtained even better Yosemite error results. It would be interesting to incorporate these constraints into our implementation. At this point, an investigation into the automatic determination of the weighting terms for the various terms in the regularization would be interesting. Lastly, it would be interesting to extend a Papenberg et al. implementation into 3D to allow, for example, the computation of accurate 3D optical flow from two/three volumes of gated MRI datasets. Acknowledgements. We gratefully acknowledge financial support from NSERC through a Discovery grant. We also thank Thomas Brox for answering our many questions.
References 1. Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3021, pp. 25–36. Springer, Heidelberg (2004) 2. Papenberg, N., Bruhn, A., Brox, T., Didas, S., Weickert, J.: Highly accurate optic flow computation with theoretically justified warping. International Journal of Computer Vision 67(2), 141–158 (2006) 3. Black, M., Anandan, A.: The robust estimation of multiple motions: Parametric and picewise smooth flows. Computer Vision and Image Understanding 64(1), 71– 104 (1996) 4. Longuet-Higgins, H.C., Prazdny, K.: The interpretation of a moving retinal image. In: Proc. Royal Society London vol. B208, pp. 385–397 (1980) 5. Horn, B., Schunck, B.: Determining optical flow. Artificial Intelligence 17(1-3), 185–203 (1981) 6. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: International Joint Conference on Artificial Intelligence, pp. 674–679 (1981) (DARPA Image Understanding Workshop, pp. 121–130)
High Accuracy Optical Flow Method Based on a Theory for Warping
525
7. Ju, S.X., Black, M.J., Jepson, A.D.: Skin and bones: Multi-layer, locally affine, optical flow and regularization with transparency. In: Proceedings of Computer Vision and Pattern Recognition Conference, San Francisco, pp. 307–314 (June 1996) 8. Lai, S.H., Vemuri, B.C.: Reliable and efficient computation of optical flow. International Journal of Computer Vision 29(2), 87–105 (2005) 9. Bab-Hadiashar, A., Suter, D.: Robust optic flow computation. International Journal of Computer Vision 29(1), 59–77 (1998) 10. Alvarez, L., Weickert, J., Sanchez, J.: Reliable estimation of dense optical flow fields with large displacements. International Journal of Computer Vision 39(1), 41–56 (2000) 11. Nagel, H.H., Enkelmann, W.: An investigation of smoothness constraints for the estimation of displacement vector fields from image sequences. IEEE Trans. Pattern Analysis and Machine Intelligence 1, 523–530 (1986) 12. F¨ arneback, G.: Very high accuracy velocity estimation using orientation tensors, parametric motion, and simultaneous segmentation of the motion field. In: Proceedings of the Eighth IEEE International Conference on Computer Vision, Vancouver, Canada, July 2001, vol. I, pp. 171–177. IEEE Computer Society Press, Los Alamitos (2001) 13. M´emin, E., P´erez, P.: Hierarchical estimation and segmentation of dense motion fields. International Journal Computer Vision 46(2), 129–155 (2002) 14. Weickert, J., Bruhn, A., Schn¨ orr, C.: Lucas/Kanade meets Horn/Schunck: Combining local and global optic flow methods. Technical report, Dept. of Mathematics, Saarland University, Saarbr¨ ucken, Germany (April 2003) 15. Bruhn, A., Weickert, J., Schn¨ orr, C.: Lucas/Kanade meets Horn/Schunck: Combining local and global optic flow methods. International Journal of Computer Vision 61(3), 1–21 (2005) 16. Barron, J., Fleet, D., Beauchemin, S.: Performance of optical flow techniques. International Journal of Computer Vision 12(1), 43–77 (1994) 17. Brox, T.: From Pixels to Regions: Partial Differential Equations in Image Analysis. Ph.d thesis, Mathematical Image Analysis Group, Dept. of Mathematics and Computer Science, Saarland University, Saarbr¨ ucken, Germany (April 2005) 18. Pollard, S., Porrill, J., Thacker, N.: Tina programmer’s guide. Technical report, Medical Biophysics and Clinical Radiology, University of Manchester (1999), www.niac.man.ac.uk/Tina/docs/programmers/programmers guide.html 19. Barron, J.: The integration of optical flow into tinatool. Technical report, Dept. of Computer Science, The Univ. of Western Ontario. TR601 (report for OSMIA (Open Source Medical Image Analysis, a EU 5th Framework Programme project)) (2003) 20. Simoncelli, E.: Design of multi-dimensional derivative filters. In: IEEE Int. Conf. Image Processing, pp. 790–793. IEEE Computer Society Press, Los Alamitos (1994)
A Sequential Monte-Carlo and DSmT Based Approach for Conflict Handling in Case of Multiple Targets Tracking Yi Sun and Layachi Bentabet Department of Computer Science, Bishop’s University {ysun053, lbentabe}@ubishops.ca
Abstract. In this paper, we propose an efficient and robust multiple targets tracking method based on particle filtering and Dezert-Smarandache theory. A model of cue combination is designed with plausible and paradoxical reasoning. The proposed model can resolve the conflict and paradoxes that arise between measured cues due to the particle or total occlusion. Experimental results demonstrate the efficiency and accuracy of the model in case of tracking with multiple cues. Keywords: Multiple Targets Tracking, Sequential Monte-Carlo, DezertSmarandache Theory, Cue Combination.
1 Introduction Visual tracking of moving targets has become one of the primary research issues in computer science, due to the increasing demand for reliable activity monitoring and surveillance systems. In order to achieve accurate tracking, several cues have been explored in the literature. These cues include color histogram, edges, motion, camera geometry (field of view), or velocity. As pointed out by [1], individual cue can potentially fail or provide paradoxical interpretations due to the occlusion problem in cluttered scenes and the changes in the illumination which are unavoidable in the real world. Numerous methods suggest the use of multiple cues to increase visual robustness in cases of complex scenes. The integration of the extracted cues into an object representation has been performed using probabilistic methods [1][2] as well as non-probabilistic methods [3]. With a knowledge of a priori distributions and conditional probabilities, the probabilistic methods, and especially the Bayesian inference, offer the most complete, scalable and theoretically justifiable approach for data fusion. However, in real complex scenes such complete knowledge is difficult to obtain due to high occlusion, background clutter, illumination and camera calibration problems. In [4], an alternative approach is proposed for data fusion; namely the Dempster-Shafer (DST) theory. DST theory does not require the knowledge of prior probabilities. The uncertainty and imprecision of a body of knowledge are represented via the notion of confidence values that are committed to a single or a union of hypotheses. The orthogonal sum rule of DST theory allows the integration of M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 526–537, 2007. © Springer-Verlag Berlin Heidelberg 2007
A Sequential Monte-Carlo and DSmT Based Approach for Conflict Handling
527
information from different sources into a single and overall representation. Unfortunately, Bayesian inference and Dempster-Shafer theories lack to provide an interesting manner of modeling conflicts and paradoxical interpretation arising between the difference information sources. The Bayesian inference assumes that all sources provide bodies of evidences using the same objective and universal interpretation of the phenomena under consideration; therefore, it cannot handle conflicts [5]. In most practical fusion applications based on DST theory, ad-hoc or heuristic techniques must always be added to the fusion process to manage or reduce the possibility of high degree conflict between the sources. Otherwise, the fusion result leads to false conclusions or cannot provide a reliable result at all. To overcome these limitations, a recent theory of plausible and paradoxical reasoning has been developed in [5]. The Dezert-Smarandache Theory (DSmT) can be considered as a generalization of the DST theory. In this paper, we propose a sequential particles filter approach for multiple targets tracking using multiple cues. The different cues are combined using the DSmT theory. If the targets are partially or completely occluded, the conflicts and paradoxes that arise between the measured cues are assessed and used in the tracking process. The proposed scheme is simple and provides an effective tracking in cluttered scenes.
2 The Dezert-Smarandache Theory The Dezert-Smarandache Theory (DSmT) of plausible and paradoxical reasoning [5] is a generalization of the classical Dempster-Shafer theory (DST), which allows formal combining of rational, uncertain and paradoxical sources. The DSmT is able to solve complex fusion problems where the DST usually fails, especially when conflicts between sources become large. In this section, we will first review the principle of the DST before discussing the fundamental aspects of the DSmT. 2.1 Principle of Dempster-Shafer Theory The DST makes inferences from incomplete and uncertain knowledge by combining additional sources of confidence, even in the process of partially contradictory sensors. The DST contains the Bayesian theory of partial belief as a special case. In the DST, there is a fixed set of mutually exclusive and exhaustive elements, called the frame of discernment, which is symbolized by Θ = {θ1, θ2,…, θN}. The frame of discernment Θ defines the propositions for which the sources can provide confidence. Information sources can distribute mass values on subsets of the frame of discernment, Ai ∈ 2Θ. If an information source can not distinguish between two propositions Ai and Aj, it assigns a mass value to the set including both hypotheses (Ai∪Aj). The mass distribution for all hypotheses has to fulfill the following conditions
0 ≤ m( Ai ) ≤ 1 m(φ ) = 0
∑ m( A ) = 1 i
Ai ∈2 Θ
(1)
528
Y. Sun and L. Bentabet
Mass distributions m1, m2,…, md from d different sources are combined with Dempster’s orthogonal rule. The result is new distribution, m = m1 ⊕ m2 ⊕ … ⊕ md, which carries the joint information provided by the d sources.
∑
m( A) = (1 − K ) −1
A1 ∩...∩ A p
⎡ d ⎤ ⎢ mi ( Ai )⎥ = A ⎣ i =1 ⎦
∏
(2)
where
∑
K=
A1 ∩...∩ A p
⎡ d ⎤ ⎢ mi ( Ai )⎥ . =φ ⎣ i =1 ⎦
∏
K is a measure of conflict between the sources and it is introduced a normalization factor. The larger K is, the more the sources are conflicting and the less the combination has sense. Two functions can be evaluated to characterize the uncertainty about the hypotheses A. The belief function, Bel, measures the minimum uncertainty value about A, whereas, the plausibility function, Pls, reflects the maximum uncertainty value. Belief and plausibility functions are defined from 2Θ to [0, 1]
Bel ( A) =
∑ m( A ), Pls( A) = ∑ m( A ) i
Ai ∩ A ≠φ
Ai ⊆ A
i
(3)
2.2 The Dezert-Smarandache Theory (DSmT)
While the DST considers Θ as a set of exclusive elements, the DSmT relaxes this condition and allows for overlapping and intersecting hypotheses. This allows for quantifying the conflict that might arise between the different sources throughout the assignment of non-null confidence values to the intersection of distinct hypotheses. Let Θ = {θ1, θ2,…, θN}be a set of N elements which can potentially overlap. The hyper-power set DΘ is defined as the set of all composite hypotheses obtained from Θ with ∪ and ∩ operators such that 1. φ,θ1, θ2, …, θN ∈ DΘ . 2. If A, B ∈ DΘ, then (A∪B)∈ DΘ and (A∩B) ∈ DΘ. 3. No other elements belong to DΘ, except those defined in 1) and 2). As in the DST, the DSmT defines a map m(.): DΘ Æ [0,1]. This map defines the confidence level that each sensor associates with the element of DΘ. This map supports paradoxical information, and ∑ m( A) = 1 . A∈D Θ
The belief and plausibility functions are defined in the same way as for the DST. The DSmT rule of combination of conflicting and on uncertain sources is given by
∑ m( A) = 1
A∈D Θ
and
(4)
A Sequential Monte-Carlo and DSmT Based Approach for Conflict Handling
∑
m( A) =
529
D
Θ
∏m (A ) . i
i
i =1 A1 , A2 ,..., Ad ∈D A1 ∩ A2 ∩...∩ Ad = A
(5)
3 Sequential Monte Carlo Sequential Monte Carlo techniques, also know as particle filtering, were introduced to track multiple objects in cluttered scenes [6]. In the following, let’s consider Xt = (x1, x2,…,xt) as the state vector (location, size, etc.) describing the target and Zt = (z1, z2,…,zt) as the vector of measurements (color, texture, etc.) up to time t. The tracking is based on the estimation of the posterior state distribution p(xt|Zt) at each time step. The estimation is performed using a two step Bayesian recursion. The first step is prediction,
p( xt | Z t −1 ) ∝
∫ p( x
t
| xt −1 ) p( xt −1 | Z t −1 )dxt −1
(6)
and the second step is filtering p( xt | Z t ) ∝ p( z t | xt ) p( xt | Z t −1 )
(7)
This recursion requires the specification of the state evolution p(xt|xt-1) and a measurement model linking the state and the current measurement p(zt|xt). The basic idea behind the particle filter is very simple. Starting with a weighted set of samples S t −1 = {s t(−n1) , π t(−n1) |
N
∑π n =1
(n) t −1
= 1}
(8)
which describe target candidates and distributed according to p(xt-1|zt-1), new samples are obtained by propagating each sample according to the target’s state model, p(xt|xt1). In the filtering step, each sample is weighted given the observation and N samples are drawn with replacement according to πt = p(zt|xt). The value given by E[ S t ] =
N
∑π
( n) (n ) t t
s
will represent the best estimation of the target. Particle
n =1
filtering has proven to be very successful for non-linear and non-Gaussian estimation problems. The tracking iteration using particle filtering can be summarized as follows. Step 1: Select N samples
S t −1 = {s t(−n1) , π t(−n1) |
N
∑π n =1
(n) t −1
= 1}
(9)
Step 2: Propagate each sample according to
st( n ) = A ⋅ st(−n1) + wt(−n1)
(10)
530
Y. Sun and L. Bentabet
where A is a square matrix defining the determistic component of the target’s motion model and wt(−n1) is the random component of the target’s motion model. Step 3: Observe the samples and calculate π t(n ) . Step 4: Estimate the mean state of St N
E[ S t ] = ∑ π t( n ) st( n )
(11)
n =1
4 DSmT-Based Tracking In this section, we describe a general framework for multiple targets tracking using multiple cues. This approach uses the DSmT combinational rule to refer the information provided by the cues into a single representation. This latter takes into account the conflicts between the cues that might arise due to occlusion. Let’s assume that the number of targets,τ, and the number of cues, c, are known. Up to time t-1, each target is associated with a track {θ j }τj =1 . At time t, an image frame is extracted from the video sequence and a number of measurements are obtained for each target candidate. Thus, the objective is to combine these measurements in order to determine the best track for each candidate. It is important to notice that a target candidate, in this paper, refers to a particle sample. The hyper-power set DΘ defines the set of the hypotheses for which the different cues can provide confidence values. These hypotheses can correspond to: 1) individual tracks θj, 2) union of tracks θr∪…∪θs, which symbolizes ignorance, 3) intersection of tracks θr∩…∩θs, which symbolizes conflict or 4) any tracks combination obtain by ∪ and ∩ operators. The confidence level is expressed in terms of mass function {mt(,nl ) (.)}lc=1 that is committed to each hypothesis and which satisfies the condition in (4). Given this framework, m t(,nl ) ( A) expresses the confidence with which cue l associates particle n to hypothesis A at time t. According to DSmT combinational rule in (6), a single map function mt(n) (.) can be derived as follows
mt( n ) ( A) = mt(,n1) (.) ⊕ mt(,n2) (.) ⊕
⊕ mt(,nc) (.)
(12)
where mt( n) ( A) is the overall confidence level with which all cues associate particle n to hypothesis A at time t. Since the target candidates must be associated to individual tracks, the information contained in compound hypotheses is transferred into single hypotheses (i.e. single tracks) through the notions of the belief or plausibility functions
Bel t( n ) (θ j ) =
∑m
θi ⊆ A
A∈D Θ
( n) t
( A), Pls t( n ) (θ j ) =
∑m
θ i ∩ A ≠φ
( n) t
( A)
(13)
A∈D Θ
Where Bel t( n ) (θ j ) (resp. Pls t( n ) (θ j ) ) quantifies the confidence with which particle
n is associated to θj at time t using the notion of belief (resp. plausibility). The
A Sequential Monte-Carlo and DSmT Based Approach for Conflict Handling
531
confidence levels in (13) are not used to determine whether a given a candidate is the best estimate or not of the target, they are rather used to quantify the weight of the candidate (or particle) as a sample of the state posterior distribution p(xt|Zt). The DSmT-based particle filtering algorithm implemented in this paper is given below. Step 1: Initialization - generate N samples S t , j = {s t(,nj ) , π t(,nj) } nN=1 for each target j = 1,…,τ
independently, with π t(,nj) =
1 . Set t =1. N
Step 2: Propagation - s t(*,n )j = A ⋅ s t(−n1), j + wt(−n1), j Step 3: Observation (for each particle) - Compute {m t(*,n )l ( A)}lc=1 for A ∈ DΘ
- Compute m t(*n ) ( A) for A∈ DΘ according to (5)
- Calculate the particle weight π t(*,n )j = Bel t(*n ) (θ j ) (or π t(*,n )j = Pls t(*n ) (θ j ) ) π~t(*,n )j ( n) ~ - Normalize the weight: π t *, j = N ~ ( n) π
∑
n =1
t *, j
Step 4: Estimation
- Target j = 1,…,τ is given by E[ S t *, j ] =
N
∑ π~
( n) (n ) t*, j t *, j
s
.
n =1
Step 5: Resampling (for each target) - Generate S t , j = {s t(,nj ) , π t(,nj) } nN=1 by resampling N times from S t *, j where p( s ( n ) = s ( n ) ) = π~ ( n ) . t, j
t *, j
t *, j
Step 6: Incrementing - t = t + 1, go to step 2.
5 Tracking Two Targets Using Location and Color For two targets, we can define Θ as follows Θ = {θ1 , θ 2 , θ 1 ∪ θ 2 }
(14)
In (14), θ1 refers the first target, θ2 refers to the second target and θ 1 ∪ θ 2 refers to the rest of the scene. Actually, hypothesis θ 1 ∪ θ 2 can refer to the background information. However, since this latter can change during the tracking, we will refer to θ 1 ∪ θ 2 as the false alarm hypothesis. Beside, θ 1 ∩ θ 2 ≠ φ due to the possible occlusion, and θ j ∩ θ 1 ∪ θ 2 = φ for j=1,2.
532
Y. Sun and L. Bentabet
5.1 The Location Cue
The targets locations at time t-1 are known and given by ( x t −1,1 , y t −1,1 ) and ( x t −1, 2 , y t −1, 2 ) . At time t, the probability that a particle s t(,nj) located at ( x t(,nj) , y t(,nj) ) belongs to target j=1,2 according to the location information is defined from a Gaussian pdf as follows
p
(n) t, j
=
1 2π σ
e
−
( xt(,nj) − xt −1, j ) 2 + ( yt(,nj) − yt −1, j ) 2 2σ 2
(15)
where σ is a bandwidth parameter. Similarly, the probability that a given particle does not belong to θ1 and θ2 is inversely proportional to the distance between the particle and both targets. Since Θ is exhaustive, a particle that does not belong to θ1 and θ2 do belong to θ 1 ∪ θ 2 . This leads us to the definition of a new pdf, p t(,nFA) , which measures the membership of a particle n = 1,…,N to the false alarm hypothesis pt(,nFA) =
1 2π σ
e
−
( d max − d1(−n2) ) 2 2σ 2
(16)
where σ is the bandwidth parameter, dmax is the radius of a circle centered on the midpoint of targets 1 and 2, and which contains all the particles used for tracking at the time t-1, d 1(−n2) is the distance separating particle n and the mid-point. d 1n− 2 = ( xt(,n1) −
xt −1,1 − xt −1, 2 2
) 2 + ( y t(,n1) −
y t −1,1 − y t −1, 2 2
)2
(17)
The mass function of particle n according to its location is given by mt(,n1) (θ j ) =
pt(,nj) p t(,n1) + pt(,n2) + pt(,nFA)
mt(,n1) (θ 1 ∪ θ 2 ) =
, j=1, 2
pt(,nFA) pt(,n1) + pt(,n2) + pt(,nFA)
(18)
(19)
5.2 The Color Cue
Let’s assume that both target models are known and given by normalized color histograms {q j (u )}um=1 , where u is a discrete color index and m is the number of histogram bins. At time t, the normalized color histogram of particle s t(,nj) is given by {ht(,nj) (u )}um=1 . The probability that particle s t(,nj) belongs to target j=1,2 according to the color histogram is derived from the following Gaussian pdf.
A Sequential Monte-Carlo and DSmT Based Approach for Conflict Handling
p
(n) t, j
=
1
e
2π σ
−
533
( d t(,nj) ) 2 2σ 2
, j=1,2
(20)
where σ is a color bandwidth parameter, d t(,nj) is the Bhattacharyya distance between ht(,nj) (u ) and q j (u ) at time t d t(,nj) = 1 − ∑u =1 ht(,nj) (u )q j (u ) m
(21)
Let’s define {q FA (u )}um=1 as the histogram of the scene from which we subtract the histogram of targets 1 and 2. q FA (u ) = max{q scene (u ) − q1 (u ) − q 2 (u ),0}
(22)
The probability that s t(,nj) belongs to the false alarm hypothesis will be given by p
(n) t , FA
1
=
2π σ
e
−
) 2 ( d t(,nFA )
2σ 2
(23)
where d t(,nFA) = 1 − ∑u =1 ht(,nj) (u )q FA (u ) . m
(24)
The mass functions of particle n according to color can be evaluated as follows mt(,n2) (θ j ) =
pt(,nj) p t(,n1) + pt(,n2) + pt(,nFA)
, j=1,2
(25)
pt(,nFA)
mt(,n2) (θ 1 ∪ θ 2 ) =
(26)
pt(,n1) + pt(,n2) + pt(,nFA)
5.3 The Cue Combination
The combination rule leads to the mass functions m t(n ) (.) that are defined in table-1 Table 1. Combination rule for color and location
Color Cue
Location Cue
m m t(,n1) (θ 1 ) m t(,n1) (θ 2 ) m t(,n1) (θ 1 ∪ θ 2 )
(n) t ,2
(θ 1 )
m t( n ) (θ 1 ) m
(n) t
(θ 1 ∩ θ 2 )
φ
m t(,n2) (θ 2 )
m t(,n2) (θ 1 ∪ θ 2 )
mt( n ) (θ 1 ∩ θ 2 )
φ
m t( n ) (θ 2 )
φ
φ
m t( n ) (θ 1 ∪ θ 2 )
534
Y. Sun and L. Bentabet
where mt( n ) (θ1 ) = mt(,n1) (θ 1 ) ⋅ mt(,n2) (θ1 )
(27)
mt( n ) (θ 2 ) = mt(,n1) (θ 2 ) ⋅ mt(,n2) (θ 2 )
(28)
mt( n ) (θ1 ∩ θ 2 ) = mt(,n1) (θ 1 ) ⋅ mt(,n2) (θ 2 ) + mt(,n1) (θ 2 ) ⋅ mt(,n2) (θ 1 )
(29)
mt( n ) (θ1 ∪ θ 2 ) = mt(,n1) (θ 1 ∪ θ 2 ) ⋅ mt(,n2) (θ 1 ∪ θ 2 )
(30)
mt( n ) (φ ) = mt(,n1) (θ 1 ∪ θ 2 )(mt(,n2) (θ 1 ) + mt(,n2) (θ 2 )) + mt(,n2) (θ 1 ∪ θ 2 )(mt(,n1) (θ 1 ) + mt(,n1) (θ 2 ))
(31)
Equation (27) is the confidence level with which both cues associate s t(,nj) to target 1. Equation (28) is the confidence level with which both cues associate s t(,nj) to target 2. Equation (29) is the conflict value between the cues for the membership of s t(,nj) to target 1 or target 2. Equation (30) expresses the confidence value with which both cues agree that the particle corresponds to a false alarm. Equation (31) quantifies the conflict between the targets and the false alarm hypothesis. The weight of particle s t(,nj) within the posterior p(Xt|Zt) distribution, for target j, is calculated using belief (or the plausibility) function, so that
π t(,nj) = Belt( n ) (θ j ) = mt( n ) (θ j ) + mt( n ) (θ1 ∩ θ 2 ) , j=1,2
(32)
The generalization of the tracking scheme described in this section to τ targets can be carried out by defining a frame of discernment Θ = {θ 1 , , θ τ , θ 1 ∪ ∪ θ τ } , where θ j are individual targets and θ 1 ∪
∪ θ τ is the false alarm hypothesis.
6 Experimental Results The tracking approach is tested on a cluttered scene with intersecting targets. Two targets are selected from frame 1 as shown in figure 1. Since the purpose of this paper is tracking and not the target detection, both targets are selected manually. Each target is tracked using 20 particles only. An increased number of particles will result in a smoother tracking, while increasing the processing time. For the sake of clarity, the person on the right is denoted target 1 and the person on the left is denoted target 2.
A Sequential Monte-Carlo and DSmT Based Approach for Conflict Handling
535
Fig. 1. First row: Tracking during the pre-occlusion phase. Second row: Tracking during the occlusion phase. Third row: Tracking during the post-occlusion phase.
The tracking sequence is divided into three phases. Phase 1 is the pre-occlusion sequence, phase 2 corresponds to the occlusion sequence, and phase 3 is the postocclusion sequence. Tracking in phase 2 is challenging due to the closeness of the targets, which perturbs the measured cues and might lead to a false identification. The location cue loses gradually its ability to separate targets 1 and 2 as they converge to the intersection point. However, the location cue remains a valid measurement because it is independent from the relative location of targets with respect to the camera (occluding or occluded). The color cue is extremely sensitive to the occlusion. During phase 2, target 1is partially or totally occluded by target 2. As a result, the color measurement for particles associated with target 1 is corrupted by the presence of target 2. When the occlusion is total, target 1 disappears from the scene and the color measurement becomes invalid. The occlusion affects also the behavior of particles associated with target 2 since the presence of target 1 in its neighborhood will be interpreted by the algorithm as a rapid change in the background information. The tracking performances in phase 3 depend on the outcome of tracking during phase 2. A successful tracking would result on a correct identification; whereas, a failure would result on a bad identification of the occluded target (target 1). As shown in figure 1, the approach proposed in this paper accurately identifies the targets during the three phases of the tracking. This is due to the effective handling of the conflicting information provided by the location and color cues during the second phase of tracking by the DSmT model. Figure 2 shows the variation of the average value of the confidence levels for all particles during the tracking.
536
Y. Sun and L. Bentabet
1
1
1
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0
1
3
5
7
9
11
13
15
17
0
19
1
3
5
7
(a)
9
11
13
15
17
19
1
1
0.6 0.5
1
0.4
0.8
0.3
0.6
0.2
0.4
0.6 0.4 0.2
0.1
0
0 3
5
7
9
5
7
11
13
15
17
19
9
11
13
15
17
19
(c)
0.8
1
3
(b) 1.2
0.2 0 1
3
5
7
(d)
9
11
13
15
17
19
1
3
(e)
5
7
9
11
13
15
17
19
(f)
Fig. 2. The variation of: a) m avg (θ 1 ) , b) m avg (θ 1 ∩ θ 2 ) , c) Bel avg (θ 1 ) for the occluded target. The variation of : a) m avg (θ 1 ) , b) m avg (θ 1 ∩ θ 2 ) , c) Bel avg (θ 1 ) for the occluding target.
m avg (θ j ) =
1 N
∑
N n =1
m t( n ) (θ j ) , m avg (θ 1 ∩ θ 2 ) =
1 N
∑
N n =1
m t( n ) (θ 1 ∩ θ 2 ) , and
N 1 Bel t( n ) (θ j ) . ∑ n =1 N The confidence level for the occluded target, m avg (θ 1 ) , is high during phases 1 and
Bel avg (θ j ) =
3, but it decreases in phase 2 (see figure 2-a). Indeed, in phases 1 and 3 the color and location cues both agree on the identity of the target. However, in phase 2 the target is occluded and this reduces the confidence value provided by the color cue. During the same phase, the location confidence remains high, which explains the increase in the conflict, m avg (θ 1 ∩ θ 2 ) , as shown in figure 2-b. The belief function Bel avg (θ 1 ) is given in figure 2-c. This curve shows the high confidence with which the target is located despite the occlusion. This is mainly due to the introduction of the conflict information through the DSmT model. Figures 2-d, 2-e and 2-f show that the effect of the occlusion on the occluding target is small in comparison with its effect on the occluded target. The existence of such an effect can be justified by the presence of target 1 in the immediate neighborhood of target 2, which rapidly modifies the color measurement for some particles.
7 Conclusion In this paper, we addressed the problem of tracking multiple targets in a cluttered scene using multiple cues. Within this framework, we developed a model that combines the classical particle filtering approach and the novel DSmT theory. A set of particles is used to track each target. The DSmT model assigns confidence level values for the membership of each particle. This membership takes into consideration the conflict between the cues during the occlusion phase, allowing thus a better
A Sequential Monte-Carlo and DSmT Based Approach for Conflict Handling
537
tracking. The experimental results demonstrated the effectiveness of the model in case of multiple targets tracking using location and color and its interest in cluttered scenes. We also showed how our approach can easily be generalized to deal with additional cues and targets.
References 1. Chen, C., Lin, X., Shi, Y.: Moving object tracking under varying illumination conditions. Pat. Rec. Let. 27, 1632–1643 (2006) 2. Ozyilidiz, E., Krahnstover, N., Sharma, R.: Adaptive texture and color segmentation for tracking moving objects. Pat. Rec. 35(10), 2013–2029 (2002) 3. McCane, B., Galvin, B., Novins, K.: Algorithmic fusion for more robust feature tracking. Int. J. of Com. Vis. 49(1), 79–89 (2002) 4. Shafer, G.: A Mathematical Theory of Evidence. Princeton University Press, Princeton (1976) 5. Smarandache, F., Dezert, J.: Applications and Advances of DSmT for Information Fusion. Am. Res. Press, Rehoboth (2004) 6. Arulampalam, M.S., Maskell, S., Gordon, N., Clapp, T.: A Tutorial on Particle Filters for Online Nonlinear/Non-Gaussian Bayesian Tracking. IEEE Tran. On Sig. Proc. 50(2), 174– 188 (2002)
Incremental Update of Linear Appearance Models and Its Application to AAM: Incremental AAM Sangjae Lee, Jaewon Sung, and Daijin Kim Pohang University of Science and Technology Department of Computer Science and Engineering San 31, Hyoja-Dong, Nam-Gu, Pohang, 790-784, Korea {lee7067,jwsung,dkim}@postech.ac.kr
Abstract. Because many model-based object representation approaches such as active appearance models (AAMs) use a fixed linear appearance model, they often fail to fit to a novel image that is captured in a different imaging condition from that of training images. To alleviate this problem, we propose to use adaptive linear appearance model that is updated by the incremental principal component analysis (PCA). Because the incremental update algorithm uses a new appearance data that is obtained in an on-line manner, a reliable method to measure the quality of the new data is required not to break the integrity of the appearance model. For this purpose, we modified the adaptive observation model (AOM), which has been used to model the varying appearance of the target object using statistical model such as Gaussian mixtures. Experiment results showed that the incremental AAM that uses adaptive linear appearance model greatly improved the robustness to the varying illumination condition when compared to the traditional AAM.
1
Introduction
To represent the texture variation of a certain object class, many approaches such as active appearance models [1],[2], and morphable models [3] have been used linear appearance models. However, just using the fixed appearance model often brings unsatisfactory fitting results especially when the imaging condition of the target image is different from that of the training images. One possible solution to alleviate this problem is to adaptively update the appearance model as the imaging condition changes. For example, McKenna et. al. [4], Jepson et. al. [5], and Zhou et. al. [6] used Gaussian mixture model to model the changing appearance of the target object. In case of linear appearance models, the concept of adaptive update can be incorporated by adaptively updating them using incremental subspace learning technique such as the incremental principal component analysis (PCA), which updates the mean and bases incrementally as a new data becomes available [7],[8],[9] while the traditional PCA learns the mean and bases using a set of training data in a batch process. M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 538–547, 2007. c Springer-Verlag Berlin Heidelberg 2007
Incremental Update of Linear Appearance Models and Its Application
539
In this paper, we propose an incremental AAM that uses incremental PCA to adaptively update its appearance bases. The AAMs represent the variation of the appearance and shape of a certain non-rigid object using linear models [1]. Although the AAMs are frequently used for face image analysis problems in these days, the fitting performance of AAMs are degraded when the lighting condition of an input image is significantly different from that of training images. This problem can be solved by collecting a large number of training images that contain every possible lighting conditions. However, it is impossible to collect such large number of training images and the number of appearance bases may increase too much. Instead, adaptively modifying the linear appearance model so that it can represent current imaging condition is better choice. On the other hand, an unconditional update of the appearance model using the fitting result can worse the fitting performance of the AAMs than that of the original AAMs because an invalid appearance data can break the integrity of the appearance models. Hence, we need a measure to determine whether to update the appearance models or not when a new fitting result is given. Unfortunately, the fitting error of the AAMs is not a good measure for this purpose because a small fitting error can be observed even when the fitting fails. Instead of using the fitting error of the AAM, we measured the percentage of the outlier pixels using a modified Adaptive Observation Model (AOM) [6], which is an appearance modeling technique that has been used for a stable on-line object tracking, where the varying appearance of an object is represented by Gaussian mixture models1 . Using the modified AOM, we conditionally updated the appearance bases as follows. First, we measured the goodness of AAM fitting that is defined by the percentage of outliers using the modified AOM. Second, we computed the goodness of the appearance bases defined by the magnitude of AAM error and updated the appearance bases by the incremental PCA only when the current appearance bases produced a large AAM error. This conditional update of the appearance bases stabilized the AAM fitting and improved the face tracking performance over the change of illumination condition. This paper is organized as follows. Section 2 briefly reviews related research such as active appearance models, incremental PCA, and AOM. Section 3 explains the proposed outlier detection algorithm using modified AOM and incremental AAM. Section 4 shows experiment results. Finally, Section 5 presents a conclusion.
2
Theoretical Backgrounds
2.1
Active Appearance Models
In 2D AAMs [1], [2], the 2D shape is represented by a triangulated 2D mesh with l vertices and the 2D coordinates of the vertices make up the shape vector 1
Note that theses kind of appearance models including AOM are quite different from the linear appearance model of the AAMs. Usually, they aim to model the unpredictably varying appearance in the on-line tracking phase, while the AAMs aim to learn a set of manually verified legal variation of the appearance in the training phase.
540
S. Lee, J. Sung, and D. Kim
s = (x1 , y1 , . . . , xl , yl )t . The shape variation is expressed by a linear combination of a mean shape s0 and n shape bases si as s = s0 +
n
pi si ,
(1)
i=1
where pi are the shape parameters. A standard approach to obtain the linear shape model is to apply the principal component analysis (PCA) to a set of aligned shape vectors that are gathered from the manually landmarked training images, where the ith shape basis si is the ith eigenvector that corresponds to the ith largest eigenvalue. The linear model can be extended to represent the shape variation due to 2D similarity transformation by adding four more shape bases[2]. Once a mean shape s0 is obtained, the training images are warped to the mean shape using the piece-wise affine warp that is defined between the corresponding triangles in the landmarked shape of the training images and the mean shape. Then, we can define the appearance as a shape normalized image A(x) over the pixels x that belong to the inside of the s0 . The appearance variation is expressed by a linear combination of a mean appearance A0 (x) and m appearance bases Ai (x) as m αi Ai (x), (2) A(x) = A0 (x) + i=1
where αi are the appearance parameters. As with the shape model, the appearance model can be obtained by applying PCA to a set of shape normalized images that are warped from the manually landmarked training images, where the ith appearance basis image Ai (x) is the ith eigenvector that corresponds to the ith largest eigenvalue. The goal of AAM fitting is to minimize the difference between the target image and the model image that is synthesized from the model parameters (the shape and appearance parameters). Thus, fitting a 2D AAM to a target image can be formulated as finding the model parameters of an AAM that minimize following error : m 2 E= λi Ai (x) − I(W (x; p)) (3) x∈s0
i=0
and various gradient-based fitting algorithms [2],[10],[11] have been proposed by extending the Lucas-Kanade (LK) image matching algorithm [12]. 2.2
Incremental Principal Component Analysis
To incrementally update the appearance bases such that the updated linear appearance model can represent a new appearance data, the traditional PCA requires to keep all the training images, which is very inefficient. Instead we use incremental subspace learning algorithms [7], [8], [9] that are more efficient than traditional PCA. Among them, we used the method proposed by Hall et. al. [8], which updates the mean and bases efficiently.
Incremental Update of Linear Appearance Models and Its Application
541
Suppose that a set of d-dimensional data vectors is D = {d1 , . . . , dN }. The eigenspace of the data set can be obtained by solving the singular value decomposition (SVD) of the covariance matrix C. Then, the given data set can be represented by k(< d)-dimensional coefficient vectors ai by projecting the data vector di to a subspace spanned by k eigenvectors corresponding to k largest eigenvalues. For the ease of the explanation, we will use a matrix U = [u1 · · · uk ] ∈ Rd×k that contains the k eigenvectors and a diagonal matrix Λ ∈ Rk×k that contains k large eigenvalues as the diagonal elements in the descending order. When a new data vector dN +1 is given, the incremental PCA updates the mean and the basis vector as follows. Since the total amount of the data is changed, we should update the mean and the basis vector to represent the data including a new data. The mean is updated as ¯ = d
1 ¯ + dN +1 ). (N d N +1
(4)
Then, the orthogonal residual vector bN +1 is computed as bN +1 = (UaN +1 + ¯ − dN +1 and its normalized vector is b ˆN +1 = bN +1 . We acquire the new d) bN +1 2 ˆN +1 ] so that the i-th basis of the basis set U by rotating the basis set [U b new basis represents the i-th largest maximal variance as the ˆN +1 ] R. U = [U b
(5)
The rotation matrix can be obtained by solving SVD for D matrix: D R = R Λ , when we compose D ∈ R(k+1)x(k+1) as T N N aa βa Λ 0 + D= 2 , N + 1 0T 0 (N + 1)2 βaT β
(6)
(7)
ˆT (dN +1 − d) ¯ and a = UT (dN +1 − d). ¯ where β = b N +1 2.3
Adaptive Observation Model
Because the AAM fitting error in Eq. (3) is not a good measure to determine the fitting quality, we measure the percentage of the outlier pixels. A standard approach to detect outlier pixels is to apply robust statistics to the AAM fitting error at each pixel. However, this approach cannot discriminate the two types of outlier pixels. The first type occurs when the fitting is incorrect, where incorrectly matched pixels have large error values. The second type occurs even when the fitting is correct; Because the AAMs use fixed linear appearance model, untrained lighting conditions such as shading or highlighting can cause correctly matched pixels have large error values. On the other hand, the AOM is an appearance modeling technique that adaptively update the statistics of the pixel intensities to incorporate the varying appearance of the object. Hence, the AOM
542
S. Lee, J. Sung, and D. Kim
provide more robust outlier detection result than the AAM. The AOM is defined by a mixture of Gaussians at each pixel of the template. It assumes that the probability of an observed appearance can be explained by the mixture density of three components Ω = {S, W, F } that correspond to the three causes. The S component depicts the stable structure within all past observations that changes slowly, the W component characterizes the variations in successive two frames, and the F component accounts for a fixed template. The AOM assumed that each component obeys the Gaussian distribution and the d pixels consisting of the AOM are independent of each other. At time t, the AOM Ωt = {St , Wt , Ft } is defined with the mixture centers {μi,t ; i = 2 s, w, f }, their corresponding variances {σi,t ; i = s, w, f }, and the mixing probabilities {mi,t ; i = s, w, f }. Once the AOM Ωt is given, an image patch Zˆt that best matches to this model and the model parameters are updated to 2 {mi,t+1 , μi,t+1 , σi,t+1 ; i = s, w, f } for the next frame. To update the model parameters, the posterior responsibility probability ωi,t (j) for the jth pixel (j = ˆ μi,t (j), σ 2 (j)), (i = w, s, f ), 1, . . . , d) is computed as ωi,t (j) ∝ mi,t (j) N (Z; i,t where ωi,t is normalized such that i=s,w,f ωi,t = 1 and N is a normal distribution as
2 1 1 x−μ 2 N (x; μ, σ ) = . (8) exp − 2 σ (2πσ 2 ) Second, the new mixing probabilities are obtained as mi,t+1 (j) = αωi,t (j) + (1 − α)mi,t (j);
i = w, s, f,
(9)
where α is used to make sure that the past images are exponentially forgotten with respect to their contributions to the current appearance and is defined as −1 α = 1 − exp(−nh /log2 ), where nh is a half life of the frames. Third, the firstp and second-moment {Mt+1 ; p = 1, 2} are evaluated as p (j) = αZˆtp (j)ωs,t (j) + (1 − α)Mtp (j); Mt+1
p = 1, 2.
(10)
p are computed, the mixture centers and the variOnce ωi,t , mi,t+1 , and Mt+1 ances are simply updated with following equations as 2 2 Wt+1 (j) = mw,t+1 (j) = Zˆt (j), σw,t+1 (j) = σw,1 (j),
St+1 (j) = μs,t+1 (j) =
1 2 Mt+1 Mt+1 (j) (j) 2 , σs,t+1 − μ2s,t+1 (j), (j) = ms,t+1 (j) ms,t+1 (j)
2 2 Ft+1 (j) = μf,t+1 (j) = F1 (j), σf,t+1 (j) = σf,1 (j).
3 3.1
(11) (12) (13)
Incremental Active Appearance Model Modified AOM and Outlier Detection
Among the three components of the AOM, the fixed F component is not appropriate to be used with AAM because the appearance of the face changes
Incremental Update of Linear Appearance Models and Its Application
543
severely due to inter-person variation, pose change, illumination change and so on. Therefore, we propose to use a modified AOM, where the F component is replaced with R component. As with other components, the R component has mean and variance parameters for each appearance pixel, where the mean is always set to the model appearance A, and the variance is set to the σ of the S component as 2 μr,t+1 (j) = A(j), σr,t+1 = σs,t+1 (j). (14) The R component helps the AOM to adapt to the appearance variation of the face more flexibly than the F component. After fitting the AAM to the input image, outlier pixels in the warped input ˜ image I(x), which is warped to the shape template s0 , are detected using the modified AOM as follows. For the j th pixel, the deviations of the pixel intensity from the current modified AOM’s three components are compared to a certain threshold θ1 to decide if the pixel is an outlier for each component as
I(j)
˜ − μi,t (j) ji := outlier, if (15)
> θ1 i = {s, w, r},
σi,t and the the number of outlier pixels of the three components noutlier,i are counted. Once all the outlier pixels are identified, the goodness of the fitting result is determined to be successful if the percentage of the outlier pixels r is smaller than a certain threshold value θ2 : (16) Fitting result is successful, if r < θ2 , where the fitting quality r is computed by i={s,w,r} ωi · noutlier,i , and ωi are the weighting factors such that i={w,s,r} ωi = 1. 3.2
Active Appearance Model with Incremental Subspace Learning
Once we judge the quality of the fitting result, we should decide whether to update the appearance bases of the AAM and AOM parameters or not. When the fitting result is good, it is obvious that we must update the AAM and AOM. When the fitting result is bad, we can choose one of the two policies for the update of AOM: 1. Policy 1: update the AOM that corresponds to inlier pixels, 2. Policy 2: do not update the AOM. When we take the policy 1, the parameters of the AOM are updated although the AAM fitting is failed. To prevent this erroneous update, we decided to use the policy 2. The proposed incremental AAM (adAAM) fitting algorithm is summarized in Fig. 1. For a new image, the AAM is initialized using a face and eye detector and fitted to the input image. Next, the goodness of the fitting result is determined as explained in Section 3.1. If the fitting result is good, the AOM parameters are updated and the new appearance bases of the AAM are computed using the incremental PCA algorithm. Then, the updated AOM and AAM are used for the next input image.
544
S. Lee, J. Sung, and D. Kim
Fig. 1. A procedure of the incremental AAM fitting algorithm
4 4.1
Experiment Results Data Set
To prove the effectiveness of the incremental AAM, we recorded two image sequences under different lighting conditions. Fig. 2 shows example images of the two sequences: (a) A sequence that is recorded under normal illumination condition and (b) a sequence that is recorded under varying illumination condition, where the lighting direction changes from front, left, right and upper-side. Two sets of images are extracted from the first and second image sequence: 100 images from the first sequence and 250 images from the second sequences, used as
(a) Training image sequence (normal illumination).
(b) Test image sequence (varying illumination condition). Fig. 2. Example images of the two image sequences
Incremental Update of Linear Appearance Models and Its Application
545
training and test set, respectively, and manually landmarked. From the landmarked training images, a 2D AAM has been built. 4.2
Comparison of Face Tracking Result Using the Incremental AAM and the Original AAM
We compared the face tracking performance of the original AAM and proposed incremental AAM using the test image sequence. At the first frame, the original AAM and incremental AAM are initialized using the landmarked ground truth so that the both AAMs began from the good initial conditions. In the remaining frames, the incremental AAM dynamically updated the appearance bases and AOM parameters as explained in Section 3. In this experiment, the parameters ωw , ωs , ωf , θ1 , and θ2 are set to 0.35, 0.3, 0.35, 1.55, and 0.135, respectively. Fig. 3 shows the change of AAM fitting errors that are measured by the sums of squared errors of the original AAM and incremental AAM. This figure shows that the fitting error of the original AAM began to increase drastically after the 25th frame where the lighting condition began to change, while the fitting error of the incremental AAM was retained at small values. We can see that the fitting error of the incremental AAM occasionally dropped at some frames, where the appearance bases of the adAAM were incrementally updated. To compare the tracking performance of the two methods quantitatively, the fitting are classified as one of three error types: small-error, medium-error, largeerror according to the magnitude of fitting error and counted the number of fitting results belonging to each error types. Table 1 summarizes the relative percentages of the error types. This table shows that the fitting of the adAAM was successful at most of the frames; small fitting errors were measured at 87.5% of the frames and medium fitting errors at 12.4% while the fitting of the original AAM was unsuccessful at many frames; small fitting errors were measured at 53.4% of the frames, medium fitting errors at 20.9%, and large fitting errors at 25.7%. 6
15
A comparison of appearance SSE
x 10
Original−AAM Incremental−AAM
SSE
10
5
0
0
50
100
150
200
250
Frame
Fig. 3. A comparison of the sums of squared errors
546
S. Lee, J. Sung, and D. Kim Table 1. A comparison of the tracking errors
Original AAM Incremental AAM
Small-error Medium-error Large-error 53.4 % 20.9% 25.7% 87.6 % 12.4% 0%
Fig. 4 shows some fitting results of the original AAM (Fig. 4 (a)) and incremental AAM (Fig. 4 (b)), where the three images in a row are the fitting result, backward warped input image and outlier detection result, respectively. The test sequence began from normal lighting condition as shown in the first row and both algorithms fitted well. As the lighting condition began to change, the fitting of the original AAM began to wander around the face, while the fitting of the incremental AAM were very stable throughout the whole sequence as shown in the remaining rows.
(a) Original AAM
(b) Incremental AAM
Fig. 4. A comparison of tracking results of the (a) original AAM and (b) proposed incremental AAM
5
Conclusion
The Active Appearance Model is sensitive to varying illumination condition, which results unstable operation of on-line applications. To alleviate this problem, we proposed an incremental AAM that updates the appearance bases to adapt itself to the varying environment. However, the update with a bad fitting result make the model more worse than before. Therefore, it is necessary
Incremental Update of Linear Appearance Models and Its Application
547
to measure the goodness of the fitting and we proposed a modified appearance observation model. The experiment results showed that the proposed appearance observation model detects outlier pixels well and provide a good guide to judge the goodness of the fitting result. We also showed the superiority of the proposed incremental AAM algorithm by comparing the tracking result of the incremental AAM and traditional AAM.
Acknowledgement This work was partially supported by the Korea Science and Engineering Foundation (KOSEF) through the Biometrics Engineering Research Center (BERC) at Yonsei University. Also, It was financially supported by the Ministry of Education and Human Resources Development(MOE), the Ministry of Commerce, Industry and Energy(MOCIE) and the Ministry of Labor(MOLAB) through the fostering project of the Lab of Excellency.
References 1. Cootes, T., Edwards, G., Taylor, C.: Active appearance models. IEEE Trans. on Pattern Recognition and Machine Intelligence 23, 681–685 (2001) 2. Matthews, I., Baker, S.: Active appearance models revisited. International Journal of Computer Vision 60, 135–164 (2004) 3. Blanz, V., Vetter, T.: A morphable model for the synthesis of 3d faces. In: SIGGRAPH. Computer Graphics, Annual Conference Series, pp. 187–194 (1999) 4. McKenna, S., Raja, Y., Gong, S.: Object tracking using adaptive color mixture models. In: Asian Conference on Computer Vision, pp. 615–622 (1997) 5. Jepson, A., Fleet, D., El-Maraghi, T.: Robust online appearance models for visual tracking. IEEE Trans. on Pattern Analysis and Machine Intelligence 25 (2003) 6. Zhou, S., Chellappa, R., Moghaddam, B.: Visual tracking and recognition using appearance-adaptive model in particle filter. IEEE Trans. on Image Processing 13, 1491–1506 (2004) 7. Artac, M., Jogan, M., Leonardis, A.: Incremental pca for on-line visual learning and recognition. In: International Conference on Pattern Recognition, pp. 781–784 (2002) 8. Hall, P., Marshall, D., Martin, R.: Incremental eigenanalysis for classification. In: British Machine Vision Conference (1998) 9. Li, Y.: On incremental and robust subspace learning. Pattern recognition 37, 1509– 1518 (2004) 10. Baker, S., Matthews, I.: Lucas-kanade 20 years on: A unifying framework. International Journal of Computer Vision 56, 221–255 (2004) 11. Gross, R., Matthews, I., Baker, S.: Lucas-kanade 20 years on: A unifying framework: Part 3. Cmu-ri-tr-03-05, CMU (2003) 12. Kanade, T., Lucas, B.: An iterative image registration technique with an application to stereo vision. In: International Joint Conference on Artificial Intelligence, pp. 674–679 (1981)
Robust Face Tracking Using Motion Prediction in Adaptive Particle Filters Sukwon Choi and Daijin Kim Intelligent Multimedia Laboratory, Dept. of Computer Science & Engineering, Pohang University of Science and Technology (POSTECH), Pohang, Korea {capriso, dkim}@postech.ac.kr
Abstract. We propose an efficient real-time face tracking system that can follow fast movements. For face tracking, we use a particle filter that can handle arbitrary distributions. To track fast movements, we predict the motions using motion history and motion estimation, hence we can find the face with fewer particles. For observation model, we use active appearance model(AAM) to obtain an accurate face region, and update the model using incremental principle component analysis(IPCA). Occlusion handling scheme incorporates motion history to handle the moving face with occlusion. We present several experimental results to prove that our system shows better performance than previous works.
1
Introduction
Many kinds of Bayesian filters are widely used in tracking area, because they estimate the sequence of hidden state. Especially, the particle filter[1],[2],[3] is a widely used technique among the Bayesian filters because it can handle arbitrary distribution. In this paper, we applied the particle filter framework to solve the face tracking problem. The particle filter consists of observation model and state transition model. The observation model has to determine how the face is tracked correctly. For this purpose, we need the appearance of the face. There are several researches on face modeling algorithm that constructs the appearance of the face. Turk et al.[4] use principle component analysis(PCA) for eigenface analysis while Bartlett et al.[5] perform independent component analysis(ICA)-based representations for modeling faces. Jepson et al.[6] introduce an online learning algorithm, namely online appearance model (OAM), which assumes that each pixel can be explained with a mixture of Gaussian model, and Zhou et al.[7] modify it. However, these approaches cannot overcome the facial variations such as illumination and expression. Cootes et al.[8] design the active appearance codel(AAM) that is expanded from the active shape model[9]. Although the fitting process of AAM takes time, it provides robustness and flexibility. However, the drawback of AAM is that all variations cannot be trained during the training phase. Hamlaoui et al.[10] expand Zhou’s work and use AAM as their appearance model, but there is no method for handling the change of illumination. M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 548–557, 2007. c Springer-Verlag Berlin Heidelberg 2007
Robust Face Tracking Using Motion Prediction in Adaptive Particle Filters
549
To make the state transition model of particle filter robust, zero-velocity model is not appropriate to track fast moving objects. For object tracking, the image registration technique, introduced by Lucas and Kanade [11], can be used to make up for the weakness of zero-velocity model if the movement is fast. However, their method is a gradient-based optimization, hence it can be trapped into the local minimum easily. Zhou et al. [7] combine first-order linear approximation and least-square estimation to track objects effectively. However, if the number of particle is insufficient or the variance is relatively small, fast movements cannot be tracked in real application. Our system aims to make both observation model and state transition model adaptive. For this purpose, we incorporate incremental principle component analysis (IPCA) [12], [13], [14] with AAM and motion prediction with least-square estimation. This paper is organized as follows : We review the particle filter in Section 2. In Section 3 and Section 4, we explain our observation model and state transition model. We present how to handle the occlusion while the face is moving in Section 5. Next, experimental results are presented in Section 6 and the conclusion is drawn in Section 7.
2
Particle Filter
The purpose of the particle filter is to estimate the states {θ1 , . . . , θt } recursively using the sampling technique. To estimate the states, the particle filter approxi(1) (P ) mates the posterior distribution p(θt |Y1:t ) with a set of samples {θt , . . . , θt } and a noisy observation {Y1 , . . . , Yt }. The particle filter consists of two components, observation model and state transition model. They can be defined as : State T ransition M odel : θt = Ft (θt−1 , Vt ), Observation M odel : Yt = Ht (θt , Wt ).
(1)
The transition function Ft approximates the dynamics of the object being tracked using the previous state θt−1 and the system noise Vt . The measurement function Ht models a relationship among the noisy observation Yt , the hidden state θt , and the observation noise Wt . We can characterize transition probability p(θt |θt−1 ) with the state transition model, and likelihood p(Yt |θt ) with the observation model. We use maximum a posteriori (MAP) estimate to obtain the state estimate θˆt . The particle filter algorithm is performed as in Table 1.
3
Observation Model
Observation model in particle filter finds the relationship between the observed data and state. So, we design observation likelihood function with AAM to describe how the noisy observations deviate from the true state. To make observation model adaptive to the illumination change, we use IPCA to update the appearance model.
550
S. Choi and D. Kim Table 1. Brief algorithm of particle filter (1)
(P )
Initialization Step : Generate P weighted samples {(θ0 , 1), . . . , (θ0 , 1)} according to the prior density p(θ0 ). For t = 1, 2, . . . For p = 1, 2, . . . , P (a) Resampling Step : Resample the particle based on its weight as (p) (θt−1 , 1). (b) Prediction Step : Draw sample according to the state transition model (p) (p) (p) θt = Ft (θt−1 , Vt ) (p) (c) Updating Step : Update the weight ωt to its likelihood probability (p) p(Yt |θt ) End Normalization Step : Normalize the weight (p)
ωt
(p)
=
p(Yt |θt ) P (i) i=1 p(Yt |θt )
Estimation Step : Estimate the state using MAP (p) θˆt = arg max ωt θt
End
3.1
Observation Model Using AAM
In the observation model, we propose to use AAM to measure the degree of deviation. The AAM tries to minimize the AAM error, which is the mean squared distance between the reconstruction image, IR,t , and the warping image, IW,t . Thus, we approximate the AAM error with Mahalanobis distance between IR,t−1 and IW,t to measure each particle. We do not want the observation likelihood value of a good particle to be spoiled by a few outlier pixels. Hence, we decrease the weight of outlier using robust statistics [15], and we have the observation likelihood as : N IW,t (l) − IR,t−1 (l) p(It |θt ) ∝ exp − , (2) ρ σR (l) l=1
1 where
ρ(x) =
if |x| < ξ , ξ|x| − 12 ξ 2 , if |x| ≥ ξ 2 2x ,
(3)
where N is the number of pixels in the template image, l is the pixel index, σR is the standard deviation of reconstruction image, and ξ is a threshold that determines whether the pixel is an outlier or not. 3.2
Update Appearance Model
After estimating the state, we need to update our appearance model to cover the illumination change. Before we update the appearance model, we need to check whether the fitting is performed well or not, and whether we really need
Robust Face Tracking Using Motion Prediction in Adaptive Particle Filters
551
to update the appearance model or not. To check these conditions, we use OAM outlier and AAM error. To determine the goodness of fitting, we need to obtain the number of outliers as in [7]. We get the number of outliers for each component and take the average number of outliers, Nout . If Nout is smaller than certain threshold, then we update our OAM. If Nout is small but the AAM error is large, then we need to apply the IPCA because the current set of bases cannot express new input IW,t properly.
4
State Transition Model
The state transition model describes the dynamics of a moving object. Rather than using the zero-velocity model, we combine the results from motion history and motion estimation, then apply the least-square estimation technique. 4.1
Velocity Model with Motion History
35
35
30
30
25
25
Ratio of Failed Frame (%)
Frame per Second
To track an object, Zhou et al. [7] use least-square estimation to obtain the predicted shift. The detailed derivation of the least-square estimation is explained in [7]. This velocity model computes the predicted shift based on the input image I, and the initial starting point θ∗ . We can define the predicted shift as a functional form VLS {I; θ∗ }. With sufficient number of particles, least-square estimation hardly fails to track the fast moving face. However, it is time consuming to obtain VLS {I; θ∗ }, because the number of particles determines the computational load. Hence, by reducing the number of particles, we obtain the frame rate and the performance of tracking with fast moving face as in Fig. 1. We can see that the ability and
20
15
10
5
0
20
15
10
5
0
50
100
150 200 250 Number of Particles
300
(a) Frame rate
350
400
0
0
50
100
150 200 250 Number of Particles
300
350
400
(b) Tracking failure
Fig. 1. Computational load and tracking failure with respect to the number of particles
the complexity of the algorithm depend on the number of particles. The graph in Fig. 2 shows how fast the face moves in a 320 × 240 image frame. In the case of slow moving face, as in Fig. 2-(b), the algorithm successfully tracks the face with only 15 particles. As we can see in Fig. 2, the maximum velocity of the fast moving face is about 30 pixels per frame. On the other hand, the maximum velocity of the slow moving face is about 13∼14 pixels per frame. Thus, if the
552
S. Choi and D. Kim Velocity change of the face movement in image sequence 40
35
35
30
30 Velocity (pixel per frame)
Velocity (pixel per frame)
Velocity change of the face movement in image sequence 40
25 20 15 10
20 15 10
5 0
25
5
0
100
200
300
400
500
0
0
100
Frame
200
300
400
500
Frame
(a) The velocity of fast moving face (b) The velocity of slow moving face Fig. 2. The velocity of face in image sequences
distance to be shifted with least-square estimation is less than 13∼14 pixels, then we can track the face with a small number of particles. Here, we propose to shift the initial point of each frame using motion history to track the fast movement of human face. Using the knowledge of motion vector of the previous frame, we can shift the initial starting point somewhere close to the true solution. We define a new velocity term as vtMH , and acceleration term as aMH . We predict vtMH and aMH as weighted terms of previous motion vectors t t vt−1 and vt−2 as :
vtMH = αMH vt−1 = αMH θˆt−1 − θˆt−2 , (4) = αMH (vt−1 − vt−2 ), aMH t where αMH ∈ [0, 1]. We perform Eq. (5) to shift the initial point with motion history to perform the least-square based velocity model, and construct particles as : θˆtMH = θˆt−1 + vtMH + 12 aMH , t (5) (p) (p) MH MH ˆ ˆ θt = θ t + VLS {It ; θt } + Vt , where Vt is a system noise. After applying motion history in image sequence with fast moving face, we calculate the distance to be shifted with least-square estimation. As shown in Fig. 3, the maximum distance is less than 10 pixels, hence the face can be tracked with a small number of particles. 4.2
Velocity Model with Motion Prediction
We suggest the motion prediction model that combines motion estimation concept [16] and motion history information to find better a initial point. Motion estimation technique tries to find the macroblock of reference frame to predict the motion vector. We use Adaboost for search algorithm of motion estimation to produce reliable results. The search space of motion estimation should be very large, at most twice larger than actual face size, if the face moves very fast. To reduce searching complexity, we can incorporate motion history information to motion estimation
Robust Face Tracking Using Motion Prediction in Adaptive Particle Filters
553
Fig. 3. The distance to be shifted with least-square estimation
algorithm. Combining these two method will keep the candidate face region small around θˆtMH , so we do not need to search for a large candidate region. To obtain the predicted motion vector, we combine the result from motion history, θˆtMH , and the result from Adaboost, θˆtAda . The predicted state, θˆtMP , and a new set of particles can be constructed as follows : θˆtMP = αMP θˆtAda + (1 − αMP )θˆtMH , (p) (p) θt = θˆtMP + VLS {It ; θˆtMP } + Vt ,
(6)
where αMP is a weight for predicted parameter of the face which is in the range of [0, 1]. We check the tracking performance of the motion history model, motion estimation model, and motion prediction model as in Table 2. We calculated the average distance of horizontal/vertical translation parameters between the ground truth and the tracking result for each technique. Table 2. Average distance between ground truth and tracking results from each model
Average Distance (pixels)
Motion History Motion Estimation Motion Prediction 0.46 0.97 0.41
After obtaining the MAP estimate θˆt based on the Eq. (2), we perform the AAM fitting at θˆt and figure out whether we need to update the appearance model or not. 4.3
Noise Variance and Number of Particles
We adaptively change the noise variance and the number of particles. They are proportional to the Mahalanobis distance, Dt , between IR,t−1 and IW,t . σt = σ0 ×
Dt , D0
Pt = P0 ×
Dt . D0
(7)
We restrict the range of standard deviation as [σmin , σmax ] and the number of particles as [Pmin , Pmax ] to ensure certain degree of frame rate and performance.
554
5
S. Choi and D. Kim
Occlusion Handling
Our system determines occlusion with the AAM error and the number of outliers. If the face is occluded, then the AAM error EAAM becomes higher than the certain threshold, ηAAM . Also, we need to check whether Nout is larger than the certain threshold ηocc or not, because the occlusion produces a large number of outliers. When the occlusion is declared, least-square estimation shows poor tracking performance. Hence we stop using it for occlusion and use the motion history information only. (p) (p) θt = θˆtMH + Vt . (8) However, it is insufficient to use only the motion history information to track the movement successfully, thus we need to maximize the variance and the number of particles. We can construct a new set of particles based on the Eq. (8).
6
Experimental Results
We implemented the proposed system in Windows C++ environment. Our system processes at least 15 frames per second in Pentium 4 CPU with 3.0 GHz, and 2 GB RAM. We detected face with Adaboost for initialization. 6.1
Tracking Fast Moving Face
We tracked the fast moving face using two methods, velocity model with leastsquare estimation and motion prediction. Fig. 4 represents the velocity of the face at each frame, and the circle means the frame that the least-square estimation failed to track the face. It is clear that the face is moving very fast whenever the least-square estimation fails tracking. Fig. 5 shows the overall tracking process. After obtaining frame t+1, we can see that the movement is very large. However, by performing motion prediction model, we can decrease the distance to be tracked, hence we can track the face successfully. We used only 12 ∼ 25 particles for tracking, and the frame rate of the system guarantees about 15 frames per second as shown in Fig. 1.
Fig. 4. The velocity of the face at each frame
Robust Face Tracking Using Motion Prediction in Adaptive Particle Filters
555
(a) AAM Fitting (b) Obtaining (c) Motion predic- (d) AAM fitting at at frame t frame t + 1 tion at frame t + 1 frame t + 1 Fig. 5. Overall fitting process
Also, we tried to analyze the performance of each method using the AAM error. Fig. 6 compares the AAM error between the velocity model with leastsquare estimation and motion prediction. As shown in the graph, the AAM error of the velocity model with least-square estimation is very large and unstable. On the other hand, the velocity model with motion prediction shows relatively small error and stable performance. Small AAM error means that the fitting result is close to the space spanned by the eigenvectors of AAM. Hence, we can say that the particle filter with motion prediction model produces better fitting results. Fig. 7 compares the tracking result of least-square estimation and motion prediction model. We can see that the tracking results without motion prediction model are very unstable.
Fig. 6. Comparing AAM error of least-square estimation and motion prediction
Fig. 7. Tracking results with least-square estimation (1st row) and motion prediction (2nd row) at frame 120, 150, 195
556
S. Choi and D. Kim
6.2
Occlusion Handling
In this section, we present the performance of our occlusion handling. As shown in Fig. 8, the moving face is heavily occluded. Our occlusion detection scheme detected outliers and occlusion successfully, and handled them properly, thus the shape of the AAM was not corrupted. On the upper right corner of each image, we can see small template image that displays outliers with white pixels. Since we incorporated the motion history in the state transition model for occlusion, the tracker follows the movement of the heavily occluded face. Therefore, this experiment shows that our occlusion handling algorithm can approximate the dynamics of the occluded face.
Fig. 8. Occlusion handling for moving face at frame 43, 56, 65 in the 1st row, frame 277, 281, 285 in the 2nd row, and frame 461, 463, 465 in the 3rd row
7
Conclusion
In this paper, we presented a face tracking system that can track fast movements. The success of our face tracking system lies on the assumption that if we can shift the initial point to somewhere near the solution, then we can successfully find the face with a small number of particles and small variance. To predict the motion vector more precisely, we reduced the search space with motion history and applied motion estimation algorithm. Then, we obtained the motion vector through combining results from motion history and motion estimation. Our occlusion handling algorithm is modeled to handle the occlusion while the face is moving. For occlusion, the state transition model incorporates only motion history, and shows that our system can approximate the dynamics of the occluded face successfully. Our system was implemented and tested in real-time environment. The experimental results showed that the position of the face was predicted very precisely with small number of particles. Therefore we found that this approach has generated particles more efficiently, saved unnecessary computational load, and improved tracking performance.
Robust Face Tracking Using Motion Prediction in Adaptive Particle Filters
557
Acknowledgement This work was partially supported by the Korea Science and Engineering Foundation (KOSEF) through the Biometrics Engineering Research Center (BERC) at Yonsei University. Also, it was partially supported by the Intelligent Robotics Development Program, one of the 21st Century Frontier R&D Programs funded by the Ministry of Commerce, Industry and Energy of Korea.
References 1. Doucet, A., Godsill, S.J., Andrieu, C.: On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and Computing 10, 197–209 (2000) 2. Arulampalam, S., Maskell, S., Gordon, N., Clapp, T.: A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Transactions on Signal Processing 50, 174–189 (2002) 3. Isard, M., Blake, A.: Condensation – conditional density propagation for visual tracking. International Journal of Computer Vision 29, 5–28 (1998) 4. Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3, 72–86 (1991) 5. Bartlett, M.S., Lades, H.M., Sejnowski, T.J.: Independent component representations for face recognition. In: Proceedings of the SPIE Symposium of Electronic Imaging III, San Jose, CA, vol. 3299, pp. 528–539 (1998) 6. Jepson, A.D., Fleet, D.J., El-Maraghi, T.: Robust online appearance model for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 1296–1311 (2003) 7. Zhou, S., Chellappa, R., Moghaddam, B.: Visual tracking and recognition using appearance-adaptive models in particle filters. IEEE Transactions on Image Processing 13, 1491–1506 (2004) 8. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. In: Burkhardt, H.-J., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1406, pp. 484–498. Springer, Heidelberg (1998) 9. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active shape models – their training and application. Journal of Computer Vision and Image Understanding 61, 38–59 (1995) 10. Hamlaoui, S., Davoine, F.: Facial action tracking using particle filters and active appearance models. In: Joint sOc-EUSAI conference, pp. 165–169 (2005) 11. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Proceedings of the 7th International Joint Conference on Artificial Intelligence, pp. 674–679 (1981) 12. Hall, P., Marshall, D., Martin, R.: Incremental eigenanalysis for classification. In: Proceedings of British Machine Vision Conference, pp. 286–295 (1998) 13. Artac, M., Jogan, M., Leonardis, A.: Incremental PCA for on-line visual learning and recognition. In: International Conference on Pattern Recognition, pp. 781–784 (2002) 14. Ross, D.A., Lim, J., Yang, M.H.: Adaptive probabilistic visual tracking with incremental subspace update. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3024, pp. 470–482. Springer, Heidelberg (2004) 15. Huber, P.J.: Robust statistics. John Wiley, Chichester (1982) 16. Bhaskaran, V., Konstantinides, K.: Image and Video Compression Standards. Kluwer Academic Publishers, Dordrecht (1997)
A Simple Oriented Mean-Shift Algorithm for Tracking Jamil Drar´eni and S´ebastien Roy DIRO, Universit´e de Montr´eal, Canada {drarenij,roys}@iro.umontreal.ca
Abstract. Mean-Shift tracking gained a lot of popularity in computer vision community. This is due to its simplicity and robustness. However, the original formulation does not estimate the orientation of the tracked object. In this paper, we extend the original mean-shift tracker for orientation estimation. We use the gradient field as an orientation signature and introduce an efficient representation of the gradient-orientation space to speed-up the estimation. No additional parameter is required and the additional processing time is insignificant. The effectiveness of our method is demonstrated on typical sequences.
1
Introduction
Object tracking is a fundamental and challenging task in computer vision. It is used in several applications such as surveillance [1], eye tracking [2] and object based video compression/communication [3]. Although many tracking methods exist, they generally fall into two classes, bottom-up and top-down [4]. In a bottom-up approach, objects are first identified and then tracked. The top-down approach instead, uses hypotheses or signatures that discriminate the object of interest. The tracking is then performed by hypotheses satisfaction. Recently, a top-down algorithm based on mean-shift was introduced for blob tracking [5]. This algorithm is non-parametric and relies solely on intensities histograms. The tracking is performed by finding the mode of a statistical distribution that encodes the likelihood of the original object’s model and the model at the probing position. Because it is a top-down approach and it does not rely on a specific model, the mean-shift tracker is well adapted for real-time applications and robust to partial occlusions. In [6], an extension was proposed to cope with the scale variation. However, little has been done to extend the tracker for rotational motions[7]. In fact, the original mean-shift tracker as proposed in [5] is invariant to rotations and thus, does not provide information on the target’s orientation. This property is induced by the inherent spaceless nature of the histograms. While this may not be problematic for objects with symmetrical dimensions like circles or squares, it is no longer valid when the tracked objects are ”thin” [7]. An example of a tracked thin object (an arm) is illustrated in Fig.1. M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 558–568, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Simple Oriented Mean-Shift Algorithm for Tracking
559
Fig. 1. Result of tracking an arm using the presented oriented mean-shift tracker
In [7], the authors used a simplified form of correlograms to encode pixels positions within the region of interest. Pairs of points at an arbitrarily fixed distance along the principal axis vote with their joint intensities and their angle relative to the patch’s origin to generate an orientation-intensity correlogram. Once the correlogram is estimated, it is used in the mean-shift’s main loop just like a regular histogram. Unfortunately no method was proposed to automatically select the fixed distance for pairs sampling. Furthermore, since the pairs are only picked along the principal axis of the object, the generated correlogram does not encode a global representation of the object. In this paper, we propose a fast and simple algorithm for an oriented meanshift tracking. We use the original mean-shift formulation to estimate the object’s translation and for the rotational part, a histogram of the orientations of the spatial gradients (within the region of interest) is used to assign an orientation according to a previously computed set of possible orientations’ histogram. The effectiveness of the proposed method is demonstrated in experiments with various types of images. Our method can also be applied to video stabilization as our experiments will show. Real-time applications are still possible since the additional processing time is negligible. The rest of the paper is organized as follows, in section 2, mean-shift tracking is summarized. Section 3, presents the gradient-orientation representation using histogram’s LUT. The implementation of the proposed method is described in section 4. The experiments and results are reported in section 5 and we finally summarize our conclusion in section 6.
2
Mean-Shift and Limitations
The mean-shift algorithm, as initially proposed in [8], is a non-parametric method to estimate the mode of a density-function. Let S = {xi }i=1..n a finite set of r-dimensional data and K(x) a multivariate kernel with window radius h. The sample mean at x is defined as: xi ∈S K(x − xi ).xi m(x) = xi ∈S K(x − xi ) The quantity m(x) − x is called the mean-shift vector. It has been proven that if K(x) is an isotropic kernel with a convex and monotonic decreasing profile, the
560
J. Drar´eni and S. Roy
mean-shift vector always points in the direction of the maximum increase in the density. Thus, following this direction recursively leads to the local maximum of the density spanned by S. Examples of such kernels are the gaussians and Epanechnikov kernels. The reader is referred to [8, 9, 5] for further details on the mean-shift algorithm and related proof of convergence. 2.1
Mean-Shift for Tracking
Comaniciu et al.[5] took advantage of the mean-shift’s property and proposed an elegant method to track blobs based on intensities histograms. The algorithm finds the displacement y of the object of interest S as a weighted sum: xi ∈S wi .K(x − xi ).xi y = xi ∈S wi .K(x − xi ) Where wi are weights related to the likelihood of the model and the target’s intensities histograms. The estimation is recursive until the displacement’s magnitude ||y|| vanishes (or reaches a predefined value). Unfortunately, the mean-shift tracker can not infer the orientation of an object based on its intensity histogram. To overcome this limitation, the tracker must use clues related to the spatial organization of the pixels or parameters that describe textures. Among those clues, image gradients are good candidates because their orientations vary when the image undergoes a rotation and are easy to compute. 2.2
Gradients and Gradients Histogram
Let I be an image. The first order gradient of I at position (x, y), noted ∇Ixy is defined as: I(x + 1, y) − I(x − 1, y) (1) ∇Ixy = [Ix , Iy ]T = I(x, y + 1) − I(x, y − 1) The orientation and the magnitude of the gradient vector Ixy are given by: Iy ; mag(x, y) = Ix2 + Iy2 θ(x, y) = tan−1 (2) Ix It is clear that the orientation is independent of the image translation. However, a rotation of the image yield a rotation of the gradient field by the same amount. This property can be used to assign an orientation to the object of interest. Instead of keeping track of the gradient field itself, it is more convenient to build a histogram of gradient’s orientations. This representation has been used in Lowe’s SIFT [10] to assign an orientation to the keypoints. In the present work, the m-bin orientation histogram O of an object is computed as: i=n Om = C mag(pi ) · δ[θ(pi ) − m] (3) i=1
A Simple Oriented Mean-Shift Algorithm for Tracking
561
Where p0 , p1 , ...pn are the n pixels of the object of interest and the normalization u=m constant C is computed as to insure that u=1 Ou = 1. θ(pi ) and mag(pi ) are functions that return the orientation and the magnitude of the gradient at pixel pi as defined in (2) As opposed to a regular intensity histogram, each sample modulates its contribution with its magnitude. The reason behind this choice is two-fold: first, we generally observe that gradients with larger magnitudes tend to be more stable ; second, the gradient is known to be very sensitive to noise, thus weighting the votes with their magnitudes is like privileging samples with a good signal to noise ratio. As opposed to [10], we do not extract a dominant orientation from the histogram. Rather, we keep the whole histogram as an orientation signature. Histograms and bin width. One of the major problem that arises when estimating a histogram (or any density function) from a finite set of data is to determine the bin width of the histogram. A large bin width gives an oversmoothed histogram with a coarse block look, whereas a small bin width results in an under-smoothed and jagged histogram [11]. In [12], Scott showed that the optimal bin width W , which provides an unbiased estimation of the probability density is given by: W = 3.49 · σ · N −1/3 (4) Where N is the number of the samples and σ is the standard deviation of the distribution. We used a more robust formulation described in [13]: W = 2 · IQR · N −1/3
(5)
The interquartile range (IQR) is the difference between the 75th and 25th percentile of the distribution. Note that (5) does not contain σ, thereby reducing the risk of bias. The bin width computed with (5) is the one we use throughout our experiments.
3
Tracking with Gradient Histograms
A single orientation histogram encodes only the gradient distribution for one specific orientation. Thus, to infer the orientation from a gradient histogram, a LUT of gradient histograms corresponding to all image orientations must be built beforehand (at the initialization step). During the tracking process, the gradient histogram of the object must be compared against the histograms in the LUT. The sought orientation is the one that corresponds to the closer histogram in the LUT. A histogram’s likelihood can be computed in different ways. We used the histogram intersection as introduced in [14] for its robustness and ease of computation. The intersection of two m-bins histograms h1 and h2 is defined as: h1 ∩ h2 =
i=m i=1
M in(h1 [i], h2 [i])
(6)
562
J. Drar´eni and S. Roy
Where M in() is a function that returns the minimum of its arguments. It is clear that the closer the histograms, the bigger the intersection score. The lookup table of histograms captures the joint orientation-gradient space of the object and can also be seen as a 2D histogram. In the following subsections, two methods are introduced to construct a histogram gradient table: Image-Rotation Voting and Gradient-Rotation Voting. 3.1
Image-Rotation Voting
This is the simplest way to gather histograms of gradients for different orientations. The image of the tracked object is rotated by 360◦ around its center. The rotation is performed by a user-defined step (2◦ in our experiments) and an orientation histogram is computed at each step. The resulting histograms are stored in a stack and they form the gradient’s histograms LUT. To reduce noise due to the intensity aliasing , rotations are performed with a bi-cubic interpolation. This method is outlined in the algorithm below: Given: Original image, target’s pixels {pi }i=1...n and a rotation step rot. 1. step ← 0 , ndx ← 0. 2. Apply a gaussian filter on {pi }i=1...n to reduce noise (typically 3 × 3). 3. Compute {magi }i=1...n and {θi }i=1...n the orientation and magnitudes of gradients at {pi }i=1...n according to (1). 4. Derive the orientation histogram Om using {magi } and {θi } according to (3). 5. LU T [ndx] ← Om 6. ndx ← ndx + 1 , step ← step + rot. 7. Rotate {pi }i=1...n by step degrees. 8. If step < 360 go to step 3. 9. return LU T
3.2
Gradient-Rotation Voting
The second method is faster and produces better results in practice. Instead of rotating the image itself, the computed gradient field of the original image is incrementally rotated and the result of each rotation votes in the proper histogram. Note that due to histogram descretization, rotating a gradient field is not exactly equivalent to shifting the histogram by the same amount. This is due to the fact that histogram sampling is generally not the same as the rotation sampling. For instance, after rotating the gradient field some samples that vote for a specific bin might still vote for the same bin whereas others may jump to an adjacent bin. They would be equivalent if the gradient histogram had a bin width of 1 (i.e 360 bins), which is not the case in practice. The gradient-rotation voting is outlined below: Given: Original image, target’s pixels {pi }i=1...n and a rotation step rot. 1. step ← 0 , ndx ← 0. 2. Apply a gaussian smoothing on {pi }i=1...n to reduce noise.
A Simple Oriented Mean-Shift Algorithm for Tracking
563
3. Compute {magi }i=1...n and {θi }i=1...n the orientation and magnitudes of gradients at {pi }i=1...n according to (1). 4. Derive the orientation histogram Om using {magi } and {θi } according to (3). 5. LU T [ndx] ← Om 6. ndx ← ndx + 1 , step ← step + rot. 7. For each {θi }i=1...n Do θi ← θi + rot 8. If step < 360 go to step 4. 9. return LU T
4
Implementation
We implemented the proposed oriented mean-shift tracker as an extension to the original mean-shift tracker. The user supplies the initial location of the object to track, along with its bounding-box and an initial orientation. Images are first smoothed with a gaussian filter to reduce the noise (typically a 3 × 3 gaussian mask). Note that the smoothing is only applied in the neighborhood of the object. Orientations’s look-up table are generated using the Gradient-Rotating method with a 2◦ step. The orientation estimation can either be nested within the original mean-shift loop or performed separately after the estimation of the translational part. The complete algorithm is outlined below: Given: The original sequence, the initial object’s position (y0 ) and orientation (θ0 ). 1. Compute the LUT of histograms orientations at y0 (see section 3). 2. Initialize the mean-shift algorithm. 3. For each frame fi (a) Update the object’s position using the original mean-shift. (b) compute the gradient and estimate H the orientation histogram using (3). (c) hmax ← M ax [H ∩ hi ]; hi ∈ LU T . (d) update the object’s orientation by the orientation that corresponds to hmax .
Notice that the orientations are estimated relatively to the initial orientation θ0 . Even though histogram intersection is a fast operation, processing time can still be saved at step 4.c by limiting the search in a specific range instead of the entire LUT. Typical range is ±20◦ from the object’s previous orientation.
5
Experimental Results
We tested the proposed oriented mean-shift algorithm on several motion sequences. Since we propose an orientation upgrade to the original mean-shift tracker, we mostly considered sequences with dominating rotational motion. We first ran our tracker on a synthetic sequence that was generated by fully rotating a real image (a chocolate box). The figure fig.2 shows some frames from the synthetic sequence.
564
J. Drar´eni and S. Roy
Fig. 2. Some frames from the manually rotated sequence (with a fixed background) 80 Gradient-Rotation Image-Rotation 70
60
Error (%)
50
40
30
20
10
0 4
6
8
10
12
14
16
18
20
22
number of bins
Fig. 3. Errors in orientation estimation as a function of histogram samples
Fig. 4. Results of tracking a rotating face. Sample frames: 78, 164 and 257.
The error of rotation estimation using different bin size is reported in figure fig.3. The green curve represents the error with a LUT generated by the imagerotation method whereas the red curve is the error using a LUT generated by the gradient-rotation method. In both cases the computed optimal bin size was 14. As we can see, the gradient-rotation method gives better results and is less
A Simple Oriented Mean-Shift Algorithm for Tracking 0.6
0.4
Angle (rad)
0.2
0
−0.2
−0.4
−0.6 0
50
100
150
200
250
300
350
Frame Id
Fig. 5. Estimated orientation for the rotating face sequence
Fig. 6. Tracking results for the car pursuit sequence 1 0.8 0.6
Angle (rad)
0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 0
100
200
300
400 Frame
500
600
700
Fig. 7. Estimated orientation for the shelf sequence
800
565
566
J. Drar´eni and S. Roy
Fig. 8. Results of tracking and rectifying images from a rolling camera sequence. left) results of the original tracking. right) rectified sequence after rotation cancellation. Shown frames are 0, 69,203,421,536 and 711.
A Simple Oriented Mean-Shift Algorithm for Tracking
567
sensitive to the bin size variation. For the rest of the experiments, gradient’s LUT were generated using the gradient-rotation method. We further tested our method for face tracking. As the face underwent an almost perfect roll, we computed the orientation estimations at each frame. We observe that the face is tracked accurately, although no exact ground truth is available in this case. The results are shown in figures fig.4 and fig.5. Aerial surveillance is another field where the tracking is useful. Due to the rectangular shape of common vehicles, an oriented tracking is suitable as shown in figure fig.6. However, notice that the orientation is not truly 2D, as the view angle induces some perspective distortions that is not handled in our method. Finally, we illustrate the effectiveness of the proposed method for video rectification. A hand-held camera was rotated by hand around its optical axis while gazing at a static scene (see figure fig.7, left column). We tracked a rigid object attached to the scene and used the recovered motion to rectify and cancel the rotation in the video sequence. The results of tracking/rectification are shown in the figure fig.8 and the estimated orientations are plotted in the figure fig.7. The rotation is well recovered, as can be seen in the estimated curve of figure fig.7 and the rectified images of figure fig.8. Notice that the rectified images are sometimes distorted by parallax effects that are not modelled by our algorithm.
6
Conclusion
We have presented a fast and simple extension to the original mean-shift tracker, to allow the estimation of the orientation. This rotation parameter is crucial when the tracked objects have a ”thin” shape. We introduced the idea of the gradient-orientation space represented by the gradient look-up tables. Of course, the LUT can be extended to other cues related to the texture or pixels positions. This representation proved to be efficient as our experiments depicted. The proposed method ran comfortably on a regular PC in real time. Tracking was performed at 10-25 frames per second for typical 2000 pixels objects. In our implementation, the orientation was estimated independently from the translation shift. However, performing a combined mean-shift on histograms intensities and gradient LUT is possible. In the future, we plan on adding support for perspective deformation to better handle different type of rotations.
References [1] Segen, J., Pingali, S.G.: A camera-based system for tracking people in real time. In: International Conference on Pattern Recognition, p. 63 (1996) [2] Zhu, Z., Ji, Q., Fujimura, K.: Combining kalman filtering and mean shift for real time eye tracking under active ir illumination (oral) (2002) [3] Crowley, J., Schwerdt, K.: Robust tracking and compression for video communication. In: IEEE Computer Society Conference on Computer Vision, Workshop on Face and Gesture Recognition, IEEE Computer Society Press, Los Alamitos (1999)
568
J. Drar´eni and S. Roy
[4] Nummiaro, K., Koller-Meier, E., Van Gool, L.J.: An adaptive color-based particle filter. Image Vision Comput. 21(1), 99–110 (2003) [5] Comaniciu, D., Ramesh, V., Meer, P.: Real-time tracking of non-rigid objects using mean shift. In: Conference on Computer Vision and Pattern Recognition, pp. 142–151 (2000) [6] Collins, R.T.: Mean-shift blob tracking through scale space. In: Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 234–240 (2003) [7] ZHao, Q., Tao, H.: Object tracking using color correlogram. In: Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 263–270. IEEE Computer Society Press, Los Alamitos (2005) [8] Fukunaga, K., Hostetler, L.: The estimation of the gradient of a density function with applications in pattern recognition. In: IEEE Transactions on Information Theory, pp. 32–40. IEEE Computer Society Press, Los Alamitos (1975) [9] Cheng, Y.: Mean shift, mode seeking, and clustering. IEEE Trans. Pattern Analysis and Machine Intelligence 17(8), 790–799 (1995) [10] Lowe, D.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, vol. 20, pp. 91–110 (2003) [11] Wand, M.P.: Data-based choice of histogram bin width. The American Statistician 51(1), 59 (1997) [12] Scott, D.: On optimal and data-based histograms. Biometrika, 66 (1979) [13] Izenman, A.J.: Recent developments in nonparametric density estimation, 86 (1991) [14] Swain, M.J., Ballard, D.H.: Color indexing. International Journal of Computer Vision 7(1), 11–32 (1991)
Real-Time 3D Head Tracking Under Rapidly Changing Pose, Head Movement and Illumination Wooju Ryu and Daijin Kim Intelligent Multimedia Laboratory, Dept. of Computer Science & Engineering, Pohang University of Science and Technology (POSTECH), Pohang, Korea {wjryu,dkim}@postech.ac.kr
Abstract. This paper proposes a fast 3D head tracking method that is working robustly under a variety of difficult conditions such as the rapidly changing pose, head movement and illumination. First, we obtain the pose robustness by using the 3D cylindrical head model (CHM) and dynamic template. Second, we also obtain the robustness about the fast head movement by using the dynamic template. Third, we obtain the illumination robustness by modeling the illumination basis vectors and by adding them to the previous input image to adapt the current input image. Additionally, to make it more robust, we use a re-registration technique that takes the stored reference image as the template when the registration error becomes great. Experimental results show that the proposed head tracking method outperforms the other tracking method using the fixed and dynamic template in terms of the small pose error and the higher successful tracking rate and it tracks the head successfully even if the head moves fast under the rapidly changing poses and illuminations in a speed of 10-15 frames/sec.
1
Introduction
For the 3D head tracking, many researchers have used simple geometric head models such as a cylinder [1], [2], an ellipsoid [3], or a head-like 3D shape [4] to recover the global head motion. They assume that the shape of the head model does not change during tracking, which means that it does not have the shape parameters. The global head motion can be represented by a rigid motion, which can be parameterized by 6 parameters; three for 3D rotation and three for 3D translation. Therefore, the number of the model parameters is only 6. Among three different geometric 3D head models, we take the cylindrical head model due to the robustness to the pose variation and the general applicability and the simplicity. It is more appropriate to approximate the 3D shape of the generic faces than the ellipsoid model. Also, it requires a small number of parameters and its fitting performance is less sensitive to their initialization than the head-like 3D shape model. To be more robust about the extreme pose and fast head movement, dynamic template technique has been proposed [2] that uses the previous input image M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 569–580, 2007. c Springer-Verlag Berlin Heidelberg 2007
570
W. Ryu and D. Kim
patch as a current template. While the fixed template cannot cover a large rotation and a fast head movement, the dynamic template is robust to them. Although the dynamic template technique can treat the extreme head movement, it has a problem that can fail the head tracking due to the accumulated fitting error. To remedy this problem, [2] proposed the re-registration technique that stored the reference frames and used them when the fitting error became large. The dynamic template may cover the gradually changing illumination because the template is updated every frames. However, it can not cover all kinds of illumination changes, specifically for the rapidly changing illumination. [1] removed the illumination effects by adding the illumination basis vectors to the template. This approach can cover the rapidly changing illumination. We derive a novel full-motion recovery under perspective projection that combines the dynamic template and re-registration technique [2] and the removal of illumination effect by adding the illumination basis vectors [1]. Also, we update the reference frames according to the illumination condition of input image. This approach provides a new head tracking method which is robust to the extreme head poses, the fast head movement, the rapidly changing illumination.
2 2.1
Image Registration The Lucas-Kanade Algorithm
The LK algorithm is the well-known technique for image registration or alignment problem [5,6,7]. The LK algorithm tries the method to align a template image T to an input image I by minimizing the sum of squared error E as E= [T (x) − I(W (x; p))]2 , (1) x
where x = (x, y)T is a column vector which represents the pixel coordinate and W (x; p) is a parameterized warping function from R2 space to R2 space, and p = [p1 , . . . , pn ]T is a vector of parameters. 2.2
Full Motion Recovery
The cylinder head model assumes that the head is shaped as a cylinder and the face is approximated by the cylinder surface. The 3D cylinder surface represented as x = [x y z]T and the 2D image pixel coordinate is represented as u = [u v]T . If we take the perspective projection function, the 2D image pixel coordinate u is given by fL u = P (x) = [ x y ]T , (2) z where fL is the focal length. When the cylinder surface point x is transformed by the rigid motion vector p, the rigid transformation function M (x; p) of x can be represented by M (x; p) = Rx + T,
(3)
Real-Time 3D Head Tracking
571
where R ∈ R3×3 and T ∈ R3×1 are the 3D rotation matrix and the 3D translation vector, respectively. We take the twist representation [8], whose detailed derivation is given in [9]. According to the twist representation, the 3D rigid motion model M (x; p) is given by ⎡ ⎤⎡ ⎤ ⎡ ⎤ 1 −wz wy tx x M (x; p) = ⎣ wz 1 −wx ⎦ ⎣ y ⎦ + ⎣ ty ⎦ (4) z −wy wx 1 tz where p = [ wx wy wz tx ty tz ]T is the 3D full head motion parameter vector. The warping function W (x; p) in Eq. (1) is completely defined by using P (x) and M (x; p) as W (x; p) = P (M (x; p)) fL x − ywz + zwy + tx = xwz + y − zwx + ty −xwy + ywx + z + tz
3 3.1
(5) (6)
Illumination Modeling Texture Map Image
The texture map image introduced in [1] is obtained by projecting the 2D input image onto the 3D cylinder surface and unfolding the 3D texture image. Fig. 2 Shows the process of building the texture map image. We build the illumination model as follows. First, we normalize the 2D input image into the 25 × 25 sized image. Second, we build the texture map image. Third, we mask the texture map image to select the face part. Now we describe the dynamic template on the 2D input image and construct illumination basis vectors on the texture map image. 3.2
Illumination Basis Vectors
Although the dynamic template approach can cover the gradually changing illumination, it can not cover the rapidly changing illumination effectively. To tackle the rapidly changing illumination, we propose to use the linear model. To build illumination model, we generate the head images whose illuminations are changed in five different directions( left, right, up, down, and front side ), collect their texture map images, and apply the principle component analysis (PCA) to the collected texture map images after subtracting the mean texture map image. Fig. 2 shows the first 10 basis vectors of PCA. Using the illumination basis vectors, we can remove the illumination effect from the template image as
T (x) = T (x) −
bN
qi bi (x),
(7)
i=1
where T (x) and T (x) are the current template image and the current template image whose illumination effect has been removed, respectively, bi (x) and qi are the i − th illumination basis vector and the i − th illumination coefficient, respectively, and bN is the number of effect basis vector used.
572
W. Ryu and D. Kim
Fig. 1. The process of building the texture map image. (a) 2D input image, (b) 3D cylinder texture, (C) texture map image.
Fig. 2. The first 10 basis vectors of the PCA result
4
Full Motion Recovery Under the Rapidly Changing Illuminations
When we consider the illumination basis vectors, the objective function for the robust 3D head tracking is given by minimize
[I(W (x; p), t) −
bN
(qi + Δqi )bi (x) − I(W (x; p + Δp), t + 1)]2 , (8)
i=1
x
where p and qi are the 3D rigid motion parameter and the i − th illumination coefficient, respectively, and Δp and Δqi are the updated parameters computed by solving optimization problem. 4.1
Linear Approximation
To solve the Eq. (8), we need to approximate the nonlinear equation to the linear equation about the Δp and Δq as I(W (x; p), t) −
bN
(qi + Δqi )bi (x) − I(W (x; p + Δp), t + 1)
i=1
≈ I(W (x; p), t) − I(W (x; p), t + 1) −
bN i=1
qi bi (x) −
bN i=1
Δqi bi (x) − ∇I
∂W Δp. (9) ∂p
Let us define the error image, the steepest descent image, and the Hessian matrix H as
Real-Time 3D Head Tracking
E(x) = I(W (x; p), t) − I(W (x; p), t + 1) −
bN
qi bi (x)
573
(10)
i=1
∂W ∂W SD(x) = [∇I , . . . , ∇I , b1 (x), . . . , bbN (x)] ∂p1 ∂p6 SD(x)T SD(x). H=
(11) (12)
x
Then, the model parameters [ Δp Δq ]T can be obtained as Δp = H -1 SD(x)T E(x). Δq
(13)
x
4.2
Parameter Update
At every frame, we iteratively update the parameters Δp and Δq simultaneously. Before the iterative process, we need to set the previous input image patch as the current template and set the initial parameters p = p0 and q = 0, where p0 is the previous motion parameters. At every iteration, we compute the new error image and the steepest descent image. When computing the new error image, the parameter p in the first term I(W (x; p), t) should be kept as p0 because it is used as the template image. Table 4.2 summarizes the overall process of the iterative parameter update of p and q, where 1 and 2 are the threshold values. Table 1. Overall process of the parameter update
(1) (2) (3) (4) (5) (6) (7)
5
Set the previous input image patch as template image. Initialize the parameters as p = p0 and q = 0. Compute E(x) and SD(x). Compute the Hessian matrix H. Compute the incremental parameter Δp and Δq. Update the parameter by p ← p + Δp and q ← q + Δq. If Δp < |1 | and Δq < |2 | then stop. Otherwise goto (3).
Robust 3D Head Tracking
The dynamic template and re-registration method firstly suggested in [2] for robust head tracking. In this chapter we will review the dynamic template and explain the modified re-registration algorithm under the rapidly changing illumination. 5.1
Dynamic Template
We briefly review the dynamic template method for the robust head tracking [2]. Since the fixed template can not cover the all kinds of head motions, we consider
574
W. Ryu and D. Kim
the dynamic template to obtain the long-term robustness of head motion. The cylindrical model can not represent the head shape exactly, the template from the initial frame can not cover the current input image when the head pose is extremely changed. The dynamic template method assumes that the previous input image patch is the current template image. 5.2
Re-registration
We describe what frames are referenced and how to execute the re-registration process with the illumination vectors are considered. While tracking the head motion, the fitting error can be accumulated. Therefore, when the fitting error is over a certain threshold value, we need to re-register to prevent the accumulation error. In the early step of tracking, before accumulation error is over the threshold value, we record the input image I and its motion parameter p as the reference frames (reference DB). The reference frames are classified with the head pose wx , wy , wz and each class in the reference DB is represented by one reference frame. When the re-registration is executed, the reference frame which corresponds to the current head pose is selected from the reference DB. If the illumination is changed, we can not use the old reference DB because the current input image has different illumination condition with the reference frame. Therefore we update the reference DB when the norm of the illumination parameter q is larger than a threshold value, after the re-registration is performed.
6
Experimental Results
We developed the illumination-robust 3D head tracking system using the realtime algorithm explained above. We used the desktop PC(Pentium IV 3.2GHz) and Logitech Web Camera. The average tracking speed is about the 10 frames per second when the illumination basis vectors are used and is about 15 frames per second when they are not used. To initialize automatically, we used the face and eye detector based on MCT+Adaboost [10] and AAM face tracker to find face boundary and face feature points. 6.1
Extreme Pose and Fast Head Movement
In this experiments, we compared the 3D head tracking using the fixed and dynamic template. First, we captured the test sequence with extreme head motion about the tilting and yawing. Fig. 3-(a) shows the result of the 3D head tracking, where each row corresponds to the 3D head tracking using the fixed template and the 3D head tracking using the dynamic template, respectively. As you see, the 3D head tracking using the fixed template starts to fail from the 48 − th frame and loses the head tracking completely at the 73 − th frame. After the 73 − th frame, the tracking can not be performed any more. However, we can get a stable head tracking throughout the entire frames using the dynamic template. Fig. 3-(b) compared the measured 3D head poses (tilting, yawing, rolling) when the poses
Real-Time 3D Head Tracking
575
Fig. 3. Comparison of the head tracking results and the measured 3D head poses in the case of changing poses
Fig. 4. Comparison of head tracking results and the measured 3D head poses in the case of the fast head movement
changed much, where the left and right column correspond to the fixed template and the dynamic template, respectively, and the ground truth poses are denoted as the dotted line.
576
W. Ryu and D. Kim
Second, we evaluated our head tracking system when the head move very fast. The test sequence has 90 frames with one tilting and two yawings. Fig. 4-(a) shows the result of 3D head tracking using the test image sequence. While the tracking using the fixed template is failed at frame 31, the tracking using the dynamic template is succeeded throughout the whole test sequence. Fig. 4-(b) compared the measured 3D head poses when the head moves fast. As you see in the above experiments, the fixed template produces the large measurement error in the 3D head poses, but the dynamic template produces the very accurate 3D head pose measurements. 6.2
Rapidly Changing Illumination
We tested how the proposed head tracking system is executed when the illumination condition is changed rapidly. The test sequence has three rapidly changing illuminations (front, left, and right). Fig. 5 compares the head tracking results, where each row corresponds to the head tracking result with the dynamic template only and the head tracking result with the dynamic template using the illumination basis vectors. As you see, when we use the dynamic template only, it shows the very unstable tracking results under the rapidly changing illuminations. On the other hand, when we use the dynamic template with the illumination basis vectors, the tracking results are very stable throughout the entire frames even though the illumination condition changes rapidly.
Fig. 5. Comparison of the tracking results under the rapidly changing illumination
6.3
Rapidly Changing Pose, Head Movement and Illumination
To test the proposed head tracking methods more faithful, we built the head moving database1 of the 15 different peoples which includes five different illuminations (left, right, up, down, and front) and the ground truth of each rotation The ground truth of the head rotation is measured by the 3D object tracker (Fastrak system). We also used the head database from the Boston University [1]. Fig. 6, 7, and 8 show the head tracking results and the measured 3D poses when they are used the fixed template, the dynamic template, and the dynamic template with the illumination basis vectors, respectively, and the ground truths of the real 3D poses are denoted as the dotted line. As you see, the dynamic 1
We call this database the IMH DB.
Real-Time 3D Head Tracking
577
Fig. 6. Tracking results and the measure 3D poses when the fixed template is used
Fig. 7. Tracking results and the measure 3D poses when the dynamic template is used
Fig. 8. Tracking results and the measure 3D poses when the dynamic template with the illumination basis vectors is used
578
W. Ryu and D. Kim
Fig. 9. Head tracking results using the different head image sequences Table 2. The tracking results on the IMH DB
Seq. 1 (679) Seq. 2 (543) Seq. 3 (634) Seq. 4 (572) Seq. 5 (564) Seq. 6 (663) Seq. 7 (655) Seq. 8 (667) Seq. 9 (588) Seq. 10 (673) Seq. 11 (672) Seq. 12 (504) Seq. 13 (860) Seq. 14 (694) Seq. 15 (503)
Fixed template rtrack = 0.05 rtrack = 0.1 rtrack = 0.06 rtrack = 0.07 rtrack = 0.06 rtrack = 0.08 rtrack = 0.07 rtrack = 0.05 rtrack = 0.09 rtrack = 0.23 rtrack = 0.07 rtrack = 0.04 rtrack = 0.37 rtrack = 0.13 rtrack = 0.25
Dynamic Dynamic template with template the illumination basis vectors rtrack = 0.89 rtrack = 1 Ep = 3.89 rtrack = 0.54 rtrack = 1 Ep = 3.34 rtrack = 0.26 rtrack = 1 Ep = 3.75 rtrack = 0.44 rtrack = 1 Ep = 4.81 rtrack = 0.9 rtrack = 1 Ep = 5.75 rtrack = 0.39 rtrack = 1 Ep = 4.41 rtrack = 0.48 rtrack = 1 Ep = 3.89 rtrack = 0.88 rtrack = 1 Ep = 4.29 rtrack = 0.38 rtrack = 1 Ep = 4.49 rtrack = 0.48 rtrack = 1 Ep = 2.92 rtrack = 0.42 rtrack = 1 Ep = 5.30 rtrack = 0.59 rtrack = 1 Ep = 8.21 rtrack = 0.53 rtrack = 1 Ep = 5.38 rtrack = 0.72 rtrack = 1 Ep = 2.19 rtrack = 0.58 rtrack = 1 Ep = 3.79
template withe the illumination basis vectors shows the very stable tracking results and the smallest measured 3D poses among three methods. Fig. 9 shows more examples of the head tracking results on the IMH DB and the database from the Boston University. These tracking results show that the proposed tracking method is very robust to the rapidly changing pose, head movement, and illumination. Table 6.2 summarizes the head tracking experiments on the IMH
Real-Time 3D Head Tracking
579
DB, where the number in each test sequence denotes the number of frames, rtrack denotes the ratio of the number of successfully tracked frames over the number of total frames, and the average pose error Ep is computed by computing the sum of tilting, yawing, and rolling pose error between the ground truth and the estimated angle every frame and averaging the pose error sums over the entire frames. From this table, we know that the proposed tracking method using the dynamic template with the illumination basis vectors is successfully tracking the whole image sequences in the IMH DB, while others are failed for head tracking at a certain frame of the image sequence.
7
Conclusion
We proposed a new framework for the 3D head tracking using the 3D cylinder head model, which combined several techniques such as dynamic template, reregistration, and the removal of illumination effects using the illumination basis vectors. Also, we proposed the new object function that added the linear illumination model to the existing objective function based on LK image registration and derived the iterative updating formula of the model parameters such as the rigid motion parameter p and the illumination coefficient vector q at the same time. We modified the overall process of the existing re-registration technique such that the reference frames could be updated when the illumination condition was changed rapidly. We performed many intensive experiments of 3D head tracking using the IMH DB. We evaluated the head tracking performance in term of the pose error and the successful tracking rate. The experiment results showed that the proposed head tracking method was the most accurate and stable among other tracking methods using the fixed and dynamic template.
Acknowledgement This work was financially supported by the Ministry of Education and Human Resources Development(MOE), the Ministry of Commerce, Industry and Energy(MOCIE) and the Ministry of Labor(MOLAB) through the fostering project of the Lab of Excellency. Also, it was partially supported by the Intelligent Robotics Development Program, one of the 21st Century Frontier R&D Programs funded by the Ministry of Commerce, Industry and Energy of Korea.
References 1. Cascia, M., Sclaroff, S., Athitsos, V.: Fast, reliable head tracking under varying illumination: An approach based on robust registration of texture-mapped 3d models. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI) 22, 322–336 (2000) 2. Xiao, J., Moriyama, T., Kanade, T., Cohn, J.: Robust full-motion recovery of head by dynamic templates and re-registration techniques. International Journal of Imaging Systems and Technology 13, 85–94 (2003)
580
W. Ryu and D. Kim
3. Basu, S., Essa, I., Pentland, A.: Motion regularization for model-based head tracking. In: ICPR. Proceedings of the International Conference on Pattern Recognition, vol. 3, p. 611 (1996) 4. Malciu, M., Preteux, F.: A robust model-based approach for 3d head tracking in video sequences. In: Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition., p. 169. IEEE Computer Society Press, Los Alamitos (2000) 5. Lucas, B., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: IJCAI. Proceedings of the 7th International Joint Conference on Artificial Intelligence, pp. 674–679 (1981) 6. Baker, S., Matthews, I.: Lucas-Kanade 20 Years On: A Unifying Framework: Part 1, tech. report CMU-RI-TR-02-16, Technical Report. Robotics Institute, Carnegie Mellon University (2002) 7. Baker, S., Gross, R., Matthews, I.: Lucas-Kanade 20 Years On: A Unifying Framework: Part 3, tech. report CMU-RI-TR-03-35, Technical Report. Robotics Institute, Carnegie Mellon University (2003) 8. Bregler, C., Malik, J.: Tracking people with twists and exponential maps. In: CVPR. IEEE Conference on Computer Vision and Pattern Recognition, pp. 8– 15 (1998) 9. Murray, R., Li, Z., Sastry, S.: A Mathematical Introduction to Robotic Manipulation. CRC Press, New York (1994) 10. Froba, B., Ernst, A.: Face detection with the modified census transform. In: Proceedings of the IEEE Conference on Automatic Face and Gesture Recognition, pp. 91–96. IEEE Computer Society Press, Los Alamitos (2004) 11. Ryu, W., Sung, J., Kim, D.: Asian Head Movement Video Under Rapidly Changing Pose, Head Movement and Illumination AHM01, Technical Report. Intelligent Multimedia Lab, Dep. of CSE, POSTECH (2006)
Tracking Multiple People in the Context of Video Surveillance B. Boufama and M.A. Ali School of Computer Science University of Windsor Windsor, ON, Canada, N9B 3P4 {boufama, ali12w}@uwindsor.ca
Abstract. This paper addresses the problem of detecting and tracking multiple moving people when the scene background is not known in advance. We have proposed a new background detection technique for dynamic environment that learns and models the scene background based on K-mean clustering technique and pixel statistics. The background detection is achieved using the first frames of the scene where, the number of these frames needed depends on how dynamic is the observed environment. We have also proposed a new feature-based framework, which requires feature extraction and feature matching, for tracking moving people. We have considered color, size, blob bounding box and motion information as features of people. In our feature-based tracking system, we have used Pearson correlation coefficient for matching feature-vector with temporal templates. The occlusion problem has been solved by subblobbing. Our tracking system is fast and free from assumptions about human structure. The tracking system has been implemented using Visual C++ and OpenCV and tested on real-world videos. Experimental results suggest that our tracking system achieved good accuracy and can process videos close to real-time.
1
Introduction
Intelligent video surveillance has become one of the most important problems in computer vision with applications in security, vehicle guidance[1], people and traffic monitoring[3], patient observation, etc. Automatic visual surveillance in dynamic scenes has attracted recently a great deal of interest in research [8]. Technology has reached a stage where, mounting a video camera has become so cheap causing a widespread deployment of cameras in public and private areas[19]. Finding available human resources to supervise the videos is too expensive for most organizations[4]. Moreover, surveillance by operators is error-prone due to fatigue and negligence. Therefore, it is important to develop accurate and efficient automatic video analysis systems for monitoring human activity that will create enormous business opportunities. Such systems will allow the detection of unusual events in the scene and warrant the attention of security officers to take preventive actions [19]. The purpose of visual M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 581–592, 2007. c Springer-Verlag Berlin Heidelberg 2007
582
B. Boufama and M.A. Ali
surveillance is not to replace human eyes with cameras, but to make most of the surveillance task as automatic as possible[8]. Other applications of automatic video surveillance include preventing theft at parking and shopping areas[19], detecting robbery in bank and secured places[4], detecting camouflage[17], etc. Automatic video surveillance systems consist of two major components, one for the detection of moving objects and the other one for the tracking. The accuracy of these components largely affects the accuracy of overall surveillance results. Detecting moving regions in the scene and separating them from background image is a challenging problem. In the real world, some of the challenges associated with foreground object segmentation are illumination changes, shadows, camouflage in color, dynamic background and foreground aperture[18]. Foreground object segmentation can be done by three basic approaches: frame differencing, background subtraction and optical flow. Frame differencing technique does not require any knowledge about the background and is suitable for dynamic environments[4]. However, it suffers from the problem of foreground aperture due to homogeneous color of moving object. Background subtraction technique can extract all moving pixels when the perfect background model is available. Furthermore, it is extremely sensitive to scene changes due to lighting and movement of background objects. Optical flow technique is more robust than the previous two as it is able to detect all moving objects, even in the presence of camera motion. However, it is computationally expensive and cannot be used for real-time systems. On the other hand, once objects have been extracted, they have to be tracked across video frames. The challenges associated with tracking are similarity of people in shape, color and size, proximity of other people and occlusion by other people or background component. Tracking also requires proper management of appearance or disappearance of objects (which changes total number of objects being tracked). Object tracking methods can be divided into 4 categories[8], region-based, active-contour-based, feature-based and model-based. In the region-based approach[13,4], tracking is performed based on the variation of the image regions in motion. Although this approach neither requires the computation of image blobs nor the extraction of the features, it suffers from high complexity, as it matches a window with all candidate windows in the next frame. Moreover it cannot reliably handle occlusions[8] and it fails to a match an object that has moved beyond a region. In contrast to region-based tracking, objects description is much simpler in the active contour-based tracking methods [14,10]. In particular, the bounding contours are used to represent object’s outline, which are updated dynamically in the successive frames [8]. This approach is sensitive to the initialization and is limited to the tracking precision. Model-based approach [12] requires the development of 2D or 3D models of humans then it performs the tracking of the models’ components. This is a robust approach for tracking and performs well under occlusion, However, in addition to the models requirement, its computational cost is very high. In feature-based tracking [15], features of image blobs are extracted for matching between frames. In this method, several features of blobs are used for matching. Such features
Tracking Multiple People in the Context of Video Surveillance
583
could include size, position, velocity, ratio of major axis of best-fit ellipse [19], orientation, coordinates of bounding box, etc. The feature-vectors can be compared using several techniques such as Euclidean distance [19] or correlation-based approach [7]. The histogram of the RGB color components of image blobs can also be used as a feature and compared for matching [5]. In this paper, we propose a new method for matching features of blobs in conjunction with a tracking system. The background is modeled by statistical method and updated continuously. Foreground object segmentation is performed by background subtraction and K-means clustering. We have used the HueSaturation-Value(HSV) color space to minimize the effect of cast shadows. After finding the blobs, features are extracted and compared with features of blobs in the previous frame using Pearson correlation-based approach. Best matched blob is identified by considering maximum correlation coefficient.
2
Background Modeling and Foreground Object Segmentation
Feature-based tracking of multiple people starts with the detection of people in the scene. In our method, we have used a combination of background subtraction and clustering techniques for human detection. Background subtraction requires an ideal background and an appropriate threshold value to segment foreground region. In case of video surveillance in a busy dynamic area, for instance, shopping malls or city streets, the perfect background may not be available. To overcome this problem, we have used a statistical method where the background model is built based on the values pixels take across several frames. Table 1 describes the algorithm for modeling background by statistical method. Table 1. Background modeling by statistical method Input: Sequence of frames (RGB color space) Output: A background model (in HSV color space) 1. Convert frames from RGB to HSV color space 2. Calculate the histogram of H, S and V component for every pixel 3. Find the histogram bin with highest value and assign the median of this bin to the H, S and V component of the pixel in the background model 4. Repeat Perform step 1,2 and 3 for all the pixels in the pixel values become stable
To keep our background model up-to-date, we have updated our background model continuously using every new frame by the following equation[13]: BGi,t = αBGi,t−1 + (1 − α)xi,t
584
B. Boufama and M.A. Ali
where BGi,t is the new value of pixel i in the background model and xi,t is the value of pixel i in the current frame t. The value of α indicates how quickly we want to change our background model. In particular, the value of α should be less than 0.5 for a very dynamic scene. Once the background model has been calculated, the problem of background subtraction requires the finding of a good threshold value to be used for segmentation. Given the many drawbacks related to the use of thresholds for segmentation, we have used a clustering-based segmentation technique that avoids the explicit use of thresholds. In the proposed algorithm, summarized on Table 2, k-means clustering technique is applied to the image difference to find the foreground and background clusters. Table 2. K-means clustering for segmenting foreground region Input: current background model and a video frame Output: Segmentation of all pixels into foreground and background clusters 1. Convert video frame into HSV color space 2. Create the image difference by subtracting the H, S and V component of background model from the H, S and V component of the video frame. 3. Find minimum and maximum values in the image difference. The minimum value corresponds to the seed of the background cluster, denoted by M1 and maximum value corresponds to the seed of foreground cluster, denoted by M2 4. For every pixel i in the video frame, calculate Dij = |xi − Mj |, where j=1,2 and xi is the value of pixeli from the frame difference. 5. If Di1 > Di2 , then assign i to FG cluster, otherwise assign i to background cluster 6. Calculate mean of background cluster, M1 and mean of foreground cluster, M2 . 7. Repeat step 4 and 5 until M1 and M2 does not change significantly. 8. Report pixels in foreground cluster as foreground region and vice versa
K-means clustering has a drawback for segmenting foreground region. When there are no foreground objects in the scene, the k-means clustering might produce unpredictable result. We can solve this problem by using a threshold value in the difference image or checking the size of each cluster. If one cluster is too big compared to other, then we assume that there is no foreground object in the scene and skip further processing of that frame. In ideal situations, we would get the perfect foreground regions. However, this is rarely the case in practice. In particular, noise appearing in foreground regions should be removed before the tracking phase. We have removed noise using the
Tracking Multiple People in the Context of Video Surveillance
585
two morphological operations, erosion and dilation [19]. Another problem we have encountered was the shadow. As suggested by many researchers [6], we have taken advantage of HSV color space that eliminate most of the shadow. The next step is the construction of blobs to be tracked. We have applied the popular connectivity component analysis (CCA) for blob construction [19]. CCA detects all kind of blobs in the scene as shown on Figure 1.a.
a)
b)
Fig. 1. a) Different representations of blobs, b) types of blob bounding box overlap
Morphological operations can remove discrete noise but fail to remove whole blobs of noise. Before further processing, these blobs of noise should be removed by a size threshold to identify legitimate blobs. In our case, we have decided to delete small blobs with sizes less than a threshold, which is 10% of the average size of all blobs. We need to find the initial sizes of the blobs for this method to work. One can use information, such as the size of the image, the scene-camera distance and the focal of the camera, to get a good guess of initial blob size threshold. In our case, a threshold of between 50 and 100 pixels worked well. Blob merging is another task we have considered as very often a moving person results into two blobs instead of one. We have solved this problem by merging pairs of blobs when their bounding boxes overlap vertically for at least 25% [2]. Figure 1.b shows different types of blobs and their overlaps.
3
Tracking Multiple People
A feature-based tracking system requires feature extraction, feature matching, maintenance of total objects and occlusion handlings. The following paragraphs describe each of these tasks. 3.1
Feature Extraction and Feature Matching
A feature is a persistent characteristic of a blob/object. We have considered the average RGB color, the size, blob bounding box and motion vector to describe each blob. This description of a blob is represented by a feature-vector consisting of the blob size, motion vector, color and the bounding box size. Note that the
586
B. Boufama and M.A. Ali
HSV color space has been used only for foreground region segmentation and shadow removal. For blob description and histogram analysis, we have used RGB color space. Although HSV color space can separate color component from intensity, RGB color space can represent the distribution of color of a blob more accurately. In a tracking system, a combination of spatial and temporal information can increase tracking accuracy. The spatial information says where a person is in the current scene whereas the temporal information describes where a person was in the previous frame and where it is right now. In our feature vector, the blob bounding box contains the spatial information and the motion vector contains the temporal information. The features we have used are the most popular features in the literature [20,19,11]. We have calculated the feature vector of a blob in every frame to get the most recent information of a blob. After constructing an object’s feature-vector, it will be compared to the temporal templates to find its location. Temporal templates are lists of feature vectors of all the people currently being tracked. A temporal template is maintained for each person and it is updated in every frame. When a feature vector best matches a temporal template of a person, then that blob is considered as the new position of that person and the template is updated with the new information. Matching can be done exhaustively or selectively with templates in the predicted region. To reduce the number of comparisons, we have assumed that there is a maximum distance a blob can move between frames [9]. This distance is updated dynamically from the motion vectors of the blobs. We have used an N xM score matrix S to store the result of matching between objects and candidate blobs, assuming N temporal templates and M blobs in the current frame. The matching between a feature vector and a temporal template uses a combination of Euclidean distance [19] and Pearson correlation coefficient. The Euclidean distance measures the amount of difference between two vectors, whereas the correlation coefficient calculates the measure of similarity. The following equation has been used for calculating measure of similarity between a feature vector and a temporal template. SIM (Fi , Fj ) = w1 C(Fi , Fj ) + w2 (1 − D(Fi , Fj )) where
(1)
1 (Fik − F¯i )(Fjk − F¯j ) , d σFi σFj D(Fi , Fj ) = (Fik − Fjk )2 ,
C(Fi , Fj ) =
d is the dimension of the feature-vector, F¯ is the mean of a feature-vector and σ is the standard deviation. The Euclidean distance and the correlation score can be given different weight factors during comparison by w1 and w2. The measure of similarity between template i and feature-vector j is stored in Sij . The value of correlation coefficient varies between -1.0 and 1.0. The advantage of correlation coefficient is that the
Tracking Multiple People in the Context of Video Surveillance
587
value of correlation does not depend upon the units of measurement of the vectors. A variation of values in feature-vector with different dimension and units is observed in our feature-vector. Another strength of correlation is that, a value of more than 0.8 indicates strong correlation and a value of less than 0.5 indicate weak correlation. In the case of Euclidean distance, the feature vectors are normalized before finding difference to eliminate the scaling problem. In our experiment, we have found that color histogram of a blob is a strong feature and does not change much due to variation of rotation and scaling [16]. So, we have calculated color histogram of blobs in RGB color space and used them during matching. The histograms are matching by correlation using the equation given by Hi (k)Hj (k) Hist(Hi , Hj ) = (2) (Hi (k))2 (Hj (k))2 where Hi is the histogram of template i, Hj is the histogram of blob j and b is the total number of histogram bins. We have used a typical value [13] for number of bins (16 in this case) for each color channels. The final equation for calculating the similarity score between a feature-vector and a temporal template is given by SIM (Fi , Fj ) = w1 C(Fi , Fj ) + w2 (1 − D(Fi , Fj )) + w3 Hist(Hi , Hj )
(3)
where w1 , w2 and w3 are the different weights given to the three measurements. The value of this similarity measure ranges from -1.0 through 1.0. The best match is found by checking the values in the score matrix S. However, in some situations, a blob has no match in the score matrix S. Such a blob is either a new object or an object that has become occluded. To distinguish between a new object and an occluded one, we have to verify whether the blob bounding box overlaps with any other bounding box. A nonmatched blob is considered a new object only if its bounding box has no overlaps. The handling of occlusions is described in the next section. 3.2
Maintenance of Total Objects
According to Collins et al. [4], any of the following five possible scenarios might occur during tracking: (1) continuation of an object, (2) disappearance of an object, (3) appearance of a new object, (4) two or more objects merge or (5) separation of an object into several objects. Out of these 5 possibilities, (4) requires the most consideration. A continuation of an object occurs when a perfect match is found. To find a temporal consistency of an object, we have maintained an age and lost counter in our data structure for each template [19,13]. In some cases a cluster of noisy pixels could form a blob. However, such a false blob disappears from the scene very fast. We have solved this problem by considering as legitimate only those blobs that have survived at least N frames. The number N could be set to a small value like 4. Similarly, we consider an object to be lost when it disappears from at least M consecutive frames, in which case, the object will be deleted from the matrix S. The value of M depends on of the kind
588
B. Boufama and M.A. Ali
of application and the kind of moving objects. In our experiments on tracking people, M was set to 50 frames which was large enough to keep an object after short occlusions.
4
Handling Occlusions by Sub-blobbing
Occlusion detection is required to avoid unnecessary template-blob matching. On the other hand, occlusion can only be detected properly after matching all the templates with the blobs in the current frame. In fact, occlusion detection and template-blob matching is kind of chicken and egg problem. Occlusion detection is required before template-blob matching, but accurate detection of occlusion is only available after template-blob matching. To detect occlusions, we first need to know which persons (or templates) are occluded in the current frame and which blob contains those templates. Two or more occluding templates form a merged blob and it does not clearly indicate where exactly those templates are. That is why we need to use more information to track a person during occlusions. To detect whether a template is occluding other templates or not, we have used the following clues for occlusion detection: – Prediction of the bounding box of a template in the current frame: this can be estimated by using motion information of a template. If the predicted bounding box of two or more templates overlaps, then there is a possible occlusion in the current frame. – Increase of size of merged blob: if two or more template’s predicted bounding box overlaps with a blob and its size is much bigger than the size of templates, then the templates might be occluded. We have proposed a new technique based on sub-blobbing to deal with occlusions during tracking. Let’s start with the easy case of two colliding blobs as shown on Figure 2. In this case, one blob is at the front and is totally visible whereas, the other blob is in the back and is partially visible. As a blob starts
a)
b)
c)
Fig. 2. (a)The person at the right (target) is getting occluded by another person, (b) the merged blob is cut from left to find target, (c) Cutting merged blob from right to find target
Tracking Multiple People in the Context of Video Surveillance
a)
b)
c)
589
d)
Fig. 3. a) The person at the right and middle (target) is getting occluded by the person coming from left, b) cutting merged blob from left to find target, c) cutting merged blob from right to find target and, d) cutting merged blob from middle to find target
properly either from left or from right, we try both sides by assuming nonoccluded blob is either on letf(Figure 2.b) or on the right(Figure 2.c). For the blob size we used the one from the previous frame as an estimate. Then for each case, we calculate the features of the blob and find the best template match using matrix S. The blob with the higher score will be considered to be the one at the front, i.e., the one that is occluding the other. If a person is significantly occluded by another person, then the size of the occluded blob would be too small to be matched with a template from the matrix S. In this situation, we have used blind tracking using previous motion information of the template and assuming that occluded object will reappear in the next frames. In the more complex situation of three people in the scene, we have to cut the merged blob into three portions, left, right and middle(see Figure 4). We first need to find the nonoccluded blob that is at the front of the other two. This blob could be located at the left, or at right or in the middle. We try the left and the right cases as we did for the two-people case. If we fail to find a complete blob-template match then the nonoccluded blob must be in the middle. In this case, we start at the center of the three-merged blobs to identify the object. Once this is done, the case of the remaining two blobs will be much easier and will be processed as for the two-people case. Most of the occlusions occur between two or three people. The case of more than three people could also be solved in similar way by first identifying the nonoccluded blob then looking at the remaining ones.
5
Experimental Result
We carried out numerous experiments but only a subset of these experiments is shown here. Our current system is not real time yet as it is capable of processing only 5 frames per seconds. All our tests with two people were successful. We were able to track and to overcome all occlusions(see Figure 4). Although we were also successful with three people in our tests(see Figure 5), in some situations our tracker dropped one or two people during collisions but was able to pick them
590
B. Boufama and M.A. Ali
Fig. 4. Experimental Result 1: Two People
Fig. 5. Experimental Result 2: Three People
up after the collisions. This situation cannot be avoided as we have set some high thresholds for the matchings. Note that the background was not assumed to be known but was constructed and maintained using our K-means algorithm.
Tracking Multiple People in the Context of Video Surveillance
6
591
Conclusion
Intelligent Video Surveillance(IVS) has become a major research area in computer vision. In particular, tracking and identifying humans in the scene is a central problem in IVS. We have proposed solutions to several problems related to tracking people in a dynamic scene with unknown background and without any assumption. We have proposed an algorithm based on the K-mean technique for the identification of the background. We have also developed an algorithm for feature-based tracking of multiple people. In particular, we have used a combination of Euclidean distance and Pearson correlation for the natching of feature vectors and the histograms. We have tackled the difficult problem of collisions between moving people and have proposed an algorithm based on sub-blobbing to solve this problem with a lot of success. Future work includes the investigation into more details of the scene-people occlusion. The sub-blobbing idea we have proposed could be used in a different way where, we could save sub-blob information in the matrix S instead of the whole blob vectors. This will allow us to compare pieces of the blobs together and could make the identification more efficient.
References 1. Elgamal, A., Duraiswami, R., Harwood, D., Davis, L.: Background and foreground modelling using nonparametric kernel density estimation for visual surveillance. Proceedings of the IEEE 90(7), 1151–1163 (2002) 2. Bunyak, F., Ersoy, I., Subramanya, S.R.: A Multi-Hypothesis Approach for Salient Object Tracking in Visual Surveillance. In: IEEE International Conference on Image Processing vol. 2, pp. 446 (2005) 3. Chen, B., Lei, Y.: Indoor and outdoor people detection and shadow suppression by exploiting HSV color information. In: 4th International Conference on Computer and Information Technology, September 14-16, 2004, pp. 137–142 (2004) 4. Collins, R.T., Lipton, A.J., Kanade, T., Fujiyoshi, H., Duggins, D., Tsin, Y., Tolliver, D., Enomoto, N., Hasegawa, O.: A system for video surveillance and monitoring. Technical Report CMU-RI-TR-00-12 (2000) 5. Comaniciu, D., Ramesh, V., Meer, P.: Real-time tracking of non-rigid objects using mean shift. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 142–149 (2000) 6. Grana, C., Prati, A., Cucchiara, R., Vezzani, R.: Computer vision system for inhouse video surveillance. In: IEEE Proceedings - Vision, Image and Signal Processing, vol. 152(2), pp. 242–249 (2005) 7. Haritaoglu, I., Harwood, D., Davis, L.S.: Hydra: multiple people detection and tracking using silhouettes. In: International Conference on Image Analysis and Processing, September 27-29, 1999, pp. 280–285 (1999) 8. Hu, W., Tan, T., Wang, L., Maybank, S.: A survey on visual surveillance of object motion and behaviors. Systems, Man and Cybernetics, Part C 34(3) (2004) 9. Davis, J.W., Intille, S.S., Bobick, A.F.: Real-time closed-world tracking. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 1997, pp. 697–703. IEEE Computer Society Press, Los Alamitos (1997)
592
B. Boufama and M.A. Ali
10. Isard, M., Blake, A.: Condensation-Conditional Density Propagation for Visual Tracking. International Journal of Computer Vision 29(1), 5–28 (1998) 11. Ivanovic, A., Huang, T.S.: A probabilistic framework for segmentation and tracking of multiple non rigid objects for video surveillance. International Conference on Image Processing 1, 353–356 (2004) 12. Karaulova, I.A., Hall, P.M., Marshall, A.D.: A hierarchical model of dynamics for tracking people with a single video camera. In: British Machine Vision Conference, pp. 262–352 (2000) 13. McKenna, S.J., Jabri, S., Duric, Z., Rosenfeld, A., Wechsler, H.: Tracking groups of people. Computer Vision and Image Understanding 80, 42–56 (2000) 14. Paragios, N., Deriche, R.: Geodesic Active Regions for Motion Estimation and Tracking. In: The Proceedings of the Seventh IEEE International Conference on Computer Vision, September 20-27, 1999, vol. 1, pp. 688–694 (1999) 15. Polana, R., Nelson, R.: Low level recognition of human motion (or how to get your man without finding his body parts). In: IEEE Workshop on Motion of Non-Rigid and Articulated Objects, IEEE Computer Society Press, Los Alamitos (1994) 16. Swain, M., Ballard, D.: Color indexing. International Journal of Computer Vision 7(1), 11–32 (1991) 17. Tankus, A., Yeshurun, Y.: Convexity-based visual camouflage breaking. Computer Vision and Image Understanding 82(3), 208–237 (2001) 18. Toyama, K., Krumm, J., Brumitt, B., Meyers, B.: Wallflower: Principles and practice of background maintenance. In: 7th IEEE International Conference on Computer Vision, September 20-27, 1999, vol. 1, pp. 255–261 (1999) 19. Xu, L., Landabaso, J.L., Lei, B.: Segmentation and tracking of multiple moving objects for intelligent video analysis. BT Technology Journal 22(3) (2004) 20. Zhou, Q., Aggarwal, J.K.: Tracking and classifying moving objects from video. In: 2nd IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, December 9, 2001, IEEE Computer Society Press, Los Alamitos (2001)
Video Object Tracking Via Central Macro-blocks and Directional Vectors B. Zhang1, J. Jiang1,2, and G. Xiao1 1
Faculty of Informatics & Computing, Southwest University, Chongqing, China
[email protected],
[email protected] 2 Department of EIMC, University of Bradford, UK
Abstract. While existing video object tracking is sensitive to the accuracy of object segmentation, we propose a central point based algorithm in this paper to allow inaccurately segmented objects to be tracked inside video sequences. Since object segmentation remains to be a challenge without any robust solution to date, we apply a region-grow technique to further divide the initially segmented object into regions, and then extract a central point within each region. A macro-block is formulated via the extracted central point, and the object tracking is carried out through such centralized macroblocks and their directional vectors. As the size of the macroblock is often much smaller than the segmented object region, the proposed algorithm is tolerant to the inaccuracy of object segmentation. Experiments carried out show that the proposed algorithm works well in tracking video objects measured by both efficiency and effectiveness.
1 Introduction Traditional video-coding standards, such as MPEG-1/MPEG-2 and H.261/H.263, lack high-level interpretation of video contents. The MPEG-4 and MPEG-7 video standard introduces the concept of a video object layer (VOL) to support content-based functionality. Object-based representations of multimedia content provide the user with flexibility in content-based access and manipulations. As a result, there has been a lot of interest in video processing based on moving objects, such as object-based video coding (MPEG-4), object-based video content analysis, retrieval, and video object tracking [1-9]. In video object segmentation, the aim is to identify semantic objects in the scene and separate them from their background. For video object tracking, the aim is to follow the video object, and update their two-dimensional (2-D) shape from frame to frame. Existing efforts on video object tracking can be classified into four categories: model-based, appearance-based, contour/mesh-based, and finally feature-based. Model-based tracking methods exploit a priori knowledge for the shape of objects in a given scene [1]. This approach is computationally expensive and lack of generality. Appearance-based methods track connected regions that roughly correspond to the 2D shapes of video objects based on their dynamic model. The tracking strategy relies on information provided by the entire region [2]–[5]. Examples of such information are motion, color, and texture. These methods cannot usually cope with complex M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 593–601, 2007. © Springer-Verlag Berlin Heidelberg 2007
594
B. Zhang, J. Jiang, and G. Xiao
deformations. Contour-based methods track objects purely based on the contour of object, where motion information is exploited to project the contour, and adapt it to the object detected in the next frame [6]. In such contour-based tracking methods, the computational complexity is often high, and large non-rigid movements cannot be handled by these methods. This difficulty is due to the rigid body motion projection followed by adjustments. One improvement of the previous method is to use a deformable object motion model, such as active contour models (snakes) [7]–[9], or meshes [10], [11]. Feature-based methods uses features of a video object to track parts of the object. Several feature-based tracking techniques have been proposed, but they are not specifically designed for video object tracking. An adaptation to object tracking is presented in [12]. However, the problem of grouping the features to determine which of them belong to the same object is a major drawback of these approaches. In this paper, we propose a technique to automatically tracking video objects in cluttered background via object segmentation, regional division, central point extraction, central macroblock construction, and directional vector validation This algorithm uses central points and their directional vectors to solve the correspondence problem. The simplicity comes from the fact that instead of projecting the entire region into the next frame, only a central macroblock of 16x16 pixels needs to be processed. Therefore, there is no need for computationally expensive motion models. The remains of this paper is organized as follows: Section 2 mainly describes the proposed algorithm design, including construction of central macroblocks, formulation of directional vectors, and the interaction between center and directional vectors to complete the video object tracking. Section 3 reports relevant experimental results, and section 4 provides some concluding remarks.
2 The Proposed Algorithm Design Given an input video sequence, {I0, I1, … Ii-1, Ii, … In}, we use the technique proposed in [13] and improved in [14] to segment the semantic object via the principle of change detection and motion information exploitation. When the current frame is denoted as Ii, the corresponding segmented jth object can be described by Oi , j (j=0, 1, …, N Fi − 1 ), where N Fi stands for the total number of objects inside the ith frame. As object segmentation can never be 100% accurate [14], especially in circumstances that one object disappearing and the other coming in producing inevitable occlusions, there exist problems of under segmentation and over segmentation, where two overlapped objects could be segmented into one single object, or part of the object is missing. To minimize such negative effect upon the object tracking, we further divide each object into texture consistent regions, and thus the object tracking can be designed in terms of regions, where each region play autonomous roles and contribute to the overall object tracking to compensate for the inaccuracy of object segmentation. In our proposed algorithm, the region growing technique described in [15] is adopted. Given the objects inside the video frame, the jth object is further divided into N i j non-overlapped regions, where i denotes the ith video frame, and Rik, j the kth region, where k=0, 1, 2, … N i j -1.
Video Object Tracking Via Central Macro-blocks and Directional Vectors
595
As regions may partially be located outside the segmented object, we only select a representative part of the region to do the tracking, where such representative part should be located at the center of the region. To this end, a central point needs to be extracted out of each region. For the kth region inside the jth object located in the ith frame, Rik, j , its central point, C ik, j , can be extracted as follows [16].
μ ik, j =
1 M
M −1
∑P l =0
k i , j ,l
(1)
Where μ i,k j is the mean intensity value of all the boundary pixels, Pi ,kj ,l , surrounding the region, and M represents the total number of pixels on the boundary. By using μ i,k j as a threshold, a binary image g ( x, y ) can be produced as given below: ⎧1 if P( x, y ) ≥ μ ik, j g ( x, y ) = ⎨ else ⎩0
(2)
Where P(x, y) represents the pixels inside the frame I(x, y). Finally, the central point for the region can be extracted via the following two equations:
m p ,q = ∑∑ g ( x, y) x p y q x
Cik, j = (
(3)
y
m1,0 m0,1 , ) m0 , 0 m0 , 0
(4)
Around the central point, C ik, j , a macroblock with 16 × 16 pixels is constructed to represent the region, and hence the region tracking contributing to the overall object tracking is carried out via this macroblock. To ensure that the macroblock based tracking is able to make useful contribution to the overall object tracking, the region must be sufficiently large, which should at least be larger than the macroblock. Hence, any region less than the macroblock will be ignored, or being regarded as unable to contribute meaningfully. To track the object from the (i-1)th frame to the ith frame, we calculate the value of SAD (sum of absolute differences) between the representative macroblocks among all regions to establish the correspondence between regions. As the total number of regions inside the (i-1)th frame is Num = N −1N j , and the total number of regions ini −1 F
i −1
side the ith frame is Num = i
N Fi −1
∑N j =0
j i −1
∑ j =0
i −1
, an array with
Num i −1 × Num i
SAD elements can be
established as follows: ⎛ SAD0, 0 ⎜ ⎜ SAD1,0 Matrix(i − 1, i) = ⎜ ... ⎜ ⎜ SAD Num −1, 0 i −1 ⎝
Where SAD =
⎞ ⎟ SAD1,1 ... SAD1, Numi −1 ⎟ ⎟ ... ... ... ⎟ SAD Numi −1 −1,1 ... SAD Numi −1 −1, Numi −1 ⎟⎠ SAD0,1
1 15 15 ∑∑ Pi ( x, y ) − Pi−1 ( x, y) . 16 × 16 x =0 y =0
...
SAD0, Numi −1
(5)
596
B. Zhang, J. Jiang, and G. Xiao
This array, Matrix(i-1, i), measures the transition process from the objects of the (i1)th frame to the objects of the ith frame. Essentially, our strategy is to establish the object correspondence between the two adjacent frames in terms of regions, yet the region correspondence across the two frames is established in terms of central macroblocks. Specifically, given the jth object inside the (i-1)th frame, Oi −1, j , all its macroblocks should find their corresponding macroblocks inside every object of the ith frame. Such correspondence can be described via the following equations:
) {
(
Φ Oi −1, j , Oi . j = MAC 0 , MAC1 , ... MAC N j
Where MAC k
i −1 −1
}
j ∈ [0, N Fi − 1]
(6)
k ∈ [0, N i −j 1 − 1] represents the best matching macroblock inside the
j th object of the ith frame, which corresponds to the minimum SAD value between the kth macroblock of the (i-1)th frame and all other macroblocks inside the j th object of ith frame. As tracking of each object inside the (i-1)th frame will be considered among all objects inside the ith frame, we have N Fi correspondences measured by equation (6) for the object Oi −1, j . As a result, to measure the possibility of the object Oi −1, j being
tracked as the object O , we add all the corresponding minimum SAD values in (6) i, j together to produce a single value to measure the cost of establishing such a correspondence.
Sum j , j
1 = j N i −1
N i j−1
∑ SAD
(7)
l
l =0
Therefore, between the (i-1)th frame with N Fi −1 objects and the ith frame with N Fi objects, we have the following cost matrix to measure the correspondence among all the objects inside both frames:
Cost i −1, i
⎛ Sum 0, 0 ⎜ ⎜ Sum1,0 =⎜ ... ⎜ ⎜ Sum N i −1 , 0 F ⎝
Sum0,1 ... Sum1,1 ... ... ... Sum N i −1 ,1 ... F
Sum0 ,l Sum1,l Sum k ,l Sum N i −1 ,l F
th
Sum 0, N i F Sum1, N i F ... Sum N i −1 , N i F
F
⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠
(8)
Where the first row represents the cost of the 0 object inside the (i-1)th frame to be tracked into all objects inside the ith frame. In other words, the row index covers the objects inside the (i-1)th frame, and the column index covers the objects inside the ith frame. Equation (8) indicates that the minimum cost value, Sum k ,l , at each row corresponds to the largest possibility that the kth object inside the (i-1)th frame should be tracked as the lth object inside the ith frame in terms of the SAD values. However, the SAD values only indicates the best possible match between the texture of the two
Video Object Tracking Via Central Macro-blocks and Directional Vectors
597
macroblocks. As the macroblock is the representative part of each region located at its central point, such texture match only indicates a local similarity between the two objects at regional level. To measure the similarity at the object level, all regions should be considered together. To this end, we propose a measurement of structural similarity between the objects. Details are given below. To represent the internal structure of each object, we produce a sequence of directional vectors via all the central points, Cik, j k ∈ [0,1,...N i j − 1] . Given the jth object, its sequence of directional vectors can be constructed as follows:
{
} {(
)(
)
(
DVi j = D0 , D1 , ... D N j − 2 = C i1, j − C i0, j , C i2, j − C i1, j , ... C iN, ji −1 − C iN, ji − 2 i
j
j
)}
(9)
In this way, the structural similarity between any two objects can be measured by the direction of each vector. To reduce the complexity and speed up the matching process, we divide all the possible directions into eight quantized directions as illustrated in Figure-1. Q3
Q2 Q1
Q4 Q0
Q7
Q5 Q6 Fig. 1. Illustration of quantized directions
Given two objects with two sequences of directional vectors, the directional vector inside one object is compared with its corresponding directional vector inside the other object to see if they both fall into the same quantized direction or not, and thus their corresponding cost value is adjusted accordingly. Details are described as follows. ⎧Sumk ,l − α Sumk ,l = ⎨ ⎩ Sumk ,l
if Di0, j = Di0−1, j = Qh
(10)
Where α represents a step score used to reduce the cost value, which designed within the range of from 1 to 10.
598
B. Zhang, J. Jiang, and G. Xiao
After all the cost values inside (8) are adjusted by structural similarity measurement in (10), the object tracking can be finalized by the minimum value on each row of (8), where min (Sumi , j ) = Sumh , g indicates that the hth object inside the (i-1)th j∈[ 0 , N Fi ]
frame is tracked as the gth object inside the ith frame. From the above description, it can be summarised that the proposed object tracking is carried out on two levels. Firstly, we extract the correspondence between macroblocks as shown in (5) to measure the local similarity between regions inside the objects. Secondly, we check the structural similarity via the sequence of directional vectors inside each object to ensure that correspondence is established for the entire object rather than local regions.
3 Experimental Results To evaluate the proposed algorithm, we carried out extensive experiments on a number of test video clips, in which the computing environment used includes: (i) a PC with 1.73G, 512M and windows XP; (ii) VC++ programming. Figure 2 illustrates the tracking experimental results for the video clip Hall Monitor, which includes 300 frames altogether, and each frame has a resolution of 352 × 240. The tracking results show that the object A is tracked starting from frame-17 when it first appeared (See part-(a) of Figure-2), and the tracked object is marked by a red square. On frame-80 when the second object B appears, part-(b) of Figure-2 clearly illustrates that this object is also tracked inside the pink square. At frame-111, object A has changed the size as he picks things up as shown in part-(c) of Figure-2. The results illustrate that the proposed algorithm successfully tracks the deformed object A. In frame 249, object A disappears and only object B exists. This is illustrated in part-(d) of Figure-2, from which it can be seen that the proposed algorithm is capable of tracking the object B as marked by the pink square. In summary, Figure-2 illustrates that the proposed algorithm is able to track objects under a range of different circumstances, which include: (i) object A varies from small to large and then from large to small, presenting a gradual changing process inside the video clip; (ii) one object disappears or appears while another stays inside the frame; (iii) one object is subject to deformation. To test the robustness of the proposed algorithm against possible inaccurate object segmentation, we created some video clips, where objects have similar colour to that of background. The video clip was produced with a resolution of 352 × 234 pixels, and contains 229 frames. The tracking experimental results are shown in Figure-3 and 4.
Video Object Tracking Via Central Macro-blocks and Directional Vectors
599
(a) Frame-17, where object A starts to (b)Frame-80, where object B appears
(c) Frame-111, where object A deforms;(d) Frame-249, where object A disappears Fig. 2. Object tracking experiments for Hall Monitor
Frame-37, where object A appears;
Frame-89, where object B appears
Frame-109, where two objects coexist; Frame-132, where object A disappears Fig. 3. Tracking of objects with similar colour to backgrounds
600
B. Zhang, J. Jiang, and G. Xiao
(a) Frame-43, where object-A appears; (b) Frame=85, where object B appears
(c) Frame-110, where two objects coexist; (d) Frame-149, with object A dsappeared Fig. 4. illustration of further experimental results on object tracking
4 Conclusions In this paper, we proposed a new algorithm for object tracking with two main contributions. Firstly, we introduced central point based macroblocks to track objects at regional level to overcome the problem of inaccuracy in object segmentation. Secondly, we measure the structural similarity between objects at global level via introduction of directional vectors inside the objects. These two levels of measurements are integrated through a cost matrix to establish the correspondence of all objects between the two adjacent frames. Extensive experiments show that the proposed algorithm is capable of tracking objects under a range of circumstances. Finally, the authors wish to acknowledge the financial support under EU IST FP-6 Research Programme with the integrated project: LIVE (Contract No. IST-4-027312).
References 1. Koller, D., Danilidis, K., Nagel, H.: Model-based object tracking in monocular image sequences of road traffic scenes. Int. J. Comput. Vis. 10(3), 257–281 (1993) 2. Meier, T., Ngan, K.: Automatic segmentation of moving objects for video object plane generation. IEEE Trans. Circuits Syst. Video Technol. 8(5), 525–538 (1998) 3. Wang, D.: Unsupervised video segmentation based on watersheds and temporal tracking. IEEE Trans. Circuits Syst. Video Technol. 8(5), 539–546 (1998)
Video Object Tracking Via Central Macro-blocks and Directional Vectors
601
4. Marcotegui, B., Zanoguera, F., Correia, P., Rosa, R., Marques, F., Mech, R., Wollborn, M.: A video object generation tool allowing friendly user interaction. In: In Proc. IEEE Int. Conf. Image Process., pp. 391–395 (1999) 5. Tao, H., Sawhney, H.S., Kumar, R.: Object tracking with Bayesian estimation of dynamic layer representation. IEEE Trans. Pattern Anal. Machine Intell. 24(1), 75–89 (2002) 6. Gu, C., Lee, M.-C.: Semiautomatic segmentation and tracking of semantic video objects. IEEE Trans. Circuits Syst. Video Technol. 8(5), 572–584 (1998) 7. Paragios, N., Deriche, R.: Geodesic active regions for motion estimation and tracking. In: ICCV. Proc. 7th Int. Conf. Computer Vision, pp. 224–240 (1999) 8. Peterfreund, N.: Robust tracking of position and velocity with Kalmansnakes. IEEE Trans. Pattern Anal. Machine Intell. 21(6), 564–569 (1998) 9. Sun, S., Haynor, D.R., Kim, Y.: Semiautomatic video object segmentation using VSnakes. IEEE Trans. Circuits Syst. Video Technol. 13(1), 75–82 (2003) 10. Gnsel, B., Tekalp, A.M., van Beek, P.J.: Content-based access to video objects: temporal segmentation, visual summarization, and feature extraction. Signal Process 66(2), 261–280 (1998) 11. Zhao, J.W., Wang, P., Liu, C.Q.: An object tracking algorithm based on occlusion mesh model. In: Proc. Int. Conf. Machine Learning and Cybernetics, pp. 288–292 (2002) 12. Beymer, D., McLauchlan, P., Coifman, B., Malik, J.: A real-time computer vision system for measuring traffic parameters. In: CVPR. Proc. ComputerVision and Pattern Recognition, pp. 495–501 (1997) 13. Kim, C., Hwang, J.N.: Fast and Automatic Video Object Segmentation and Tracking for Content-Based Applications. IEEE Transactions on Circuits and Systems for Video Technology 12(2) (2002) 14. Gao, L., Jiang, J., Yang, S.Y.: Constrained region-grow for semantic object segmentation. In: Blanc-Talon, J., Philips, W., Popescu, D., Scheunders, P. (eds.) ACIVS 2006. LNCS, vol. 4179, pp. 323–331. Springer, Heidelberg (2006) 15. Adams, R., Bischof, L.: Seeded region growing. IEEE Trans. Pattern Anal. Machine Intell. 16(6), 641–647 (1994) 16. Kirishima, T., Sato, K., Chihara, K.: Real-time gesture recognition by learning and selective control of visual interest points. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(3) (2005)
Tracking of Multiple Targets Using On-Line Learning for Appearance Model Adaptation Franz Pernkopf Graz University of Technology, Inffeldgasse 16c, A-8010 Graz, Austria
[email protected] Abstract. We propose visual tracking of multiple objects (faces of people) in a meeting scenario based on low-level features such as skin-color, target motion, and target size. Based on these features automatic initialization and termination of objects is performed. Furthermore, on-line learning is used to incrementally update the models of the tracked objects to reflect the appearance changes. For tracking a particle filter is incorporated to propagate sample distributions over time. We discuss the close relationship between our implemented tracker based on particle filters and genetic algorithms. Numerous experiments on meeting data demonstrate the capabilities of our tracking approach.
1
Introduction
Visual tracking of multiple objects is concerned with maintaining the correct identity and location of a variable number of objects over time irrespective of occlusions and visual alterations. Lim et al. [1] differentiate between intrinsic and extrinsic appearance variability including pose variation, shape deformation of the object and illumination change, camera movement, and occlusions, respectively. In the past few years, particle filters have become the method of choice for tracking. Isard and Blake introduced particle filtering (Condensation algorithm) [2]. Many different sampling schemes have been suggested in the meantime. An overview about sampling schemes of particle filters and the relation to Kalman filters is provided in [3]. Recently, the main emphasis is on tracking multiple objects simultaneously and on on-line learning to adapt the reference models to the appearance changes, e.g., pose variation, illumination change. Lim et al. [1] introduce a single object tracker where the target representation is incrementally updated to model the appearance variability. They assume that the target region is initialized in the first frame. For tracking multiple objects most algorithms belong to one of the following three categories: (i) Multiple instances of a single object tracker are used [4]. (ii) All objects of interest are included in the state space [5]. A fixed number of objects is assumed. Varying number of objects result in a dynamic change of the dimension of the state space. (iii) Most recently, the framework of particle filters is extended to capture multiple targets using a mixture model [6]. M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 602–614, 2007. c Springer-Verlag Berlin Heidelberg 2007
Tracking of Multiple Targets Using On-Line Learning
603
This mixture particle filter - where each component models an individual object - enables interaction between the components by the importance weights. In [7] this approach is extended by the Adaboost algorithm to learn the models of the targets. The information from Adaboost enables detection of objects entering the scene automatically. The mixture particle filter is further extended in [8] to handle mutual occlusions. They introduce a rectification technique to compensate for camera motions, a global nearest neighbor data association method to correctly identify object detections with existing tracks, and a mean-shift algorithm which accounts for more stable trajectories for reliable motion prediction. In this paper, we do both tracking of multiple persons in a meeting scenario and on-line adaptation of the models to account for appearance changes during tracking. The tracking is based on low-level features such as skin-color, object motion, and object size. Based on these features automatic initialization and termination of objects is performed. The aim is to use as little prior knowledge as possible. For tracking a particle filter is incorporated to propagate sample distributions over time. Our implementation is related to the dual estimation principle [9], where both the states of multiple objects and the parameters of the object models are estimated simultaneously given the observations. At every time step, the particle filter estimates the states using the observation likelihood of the current object models while the on-line learning of the object models is based on the current state estimates. Additionally, we discuss the similarity between our implemented tracker based on particle filters and Genetic Algorithms (GA). Numerous experiments on meeting data demonstrate the capabilities of our tracking approach. The proposed approach differs from previous methods in several aspects. Recently, much work has been done in multiple object tracking on the one hand side and on appearance model adaptation for a single object tracker on the other side. In this paper, we do both tracking of multiple objects and on-line learning to incrementally update the representation of the tracked objects to model appearance changes. We automatically initialize and terminate tracking of individual objects based on low-level features, i.e. face color, face size, and object movement. Many methods unlike our approach initialize the target region in the first frame. The paper is organized as follows: Section 2 introduces the particle filter for multiple object tracking, the state space dynamics, the observation model, automatic initialization and termination of objects, and the on-line learning of the models for the tracked objects. Section 2.7 summarizes the implemented tracker on the basis of pseudo-code. Section 3 sketches the relationship to GA. The tracking results are presented in Section 4. Section 5 concludes.
2 2.1
Tracking Using Particle Filters Particle Filter
A particle filter is capable to deal with non-linear non-Gaussian processes and has become popular for visual tracking. For tracking the probability distribution
604
F. Pernkopf
that the object of interest is in state xt at time t given the observations y0:t up to time t is of interest. Hence, p (xt |y0:t ) has to be constructed starting from the initial distribution p (x0 |y0 ) = p (x0 ). In Bayesian filtering this can be formulated as iterative recursive process consisting of the prediction step p (xt |y0:t−1 ) = p (xt |xt−1 ) p (xt−1 |y0:t−1 ) dxt−1 (1) and of the filtering step p (xt |y0:t ) =
p (yt |xt ) p (xt |y0:t−1 ) , p (yt |xt ) p (xt |y0:t−1 ) dxt
(2)
where p (xt |xt−1 ) is the dynamic model describing the state space evolution which corresponds to the evolution of the tracked object (see Section 2.2) and p (yt |xt ) is the likelihood of an observation yt given the state xt (see observation model in Section 2.3). In particle filters p (xt |y0:t ) of the filtering step is approximated by a fim M nite set of weighted samples, i.e. the particles, {xm t , wt }m=1 , where M is the number of samples. Particles are sampled from a proposal distribution xm t ∼ q (xt |xt−1 , y0:t ) (importance sampling) [3]. In each iteration the importance weights are updated according to m m M p (yt |xm t ) p xt |xt−1 m m wt−1 m m and wtm = 1. (3) wt ∝ q xt |xt−1 , y0:t m=1 One choice for the proposal distribution is to take the prior density simple m m m q xm t |xt−1 , y0:t = p xt |xt−1 (bootstrap filter). Hence, the weights are proportional to the likelihood model p (yt |xm t ) m wtm ∝ p (yt |xm t ) wt−1 .
(4)
The posterior filtered density p (xt |y1:t ) can be approximated as p (xt |y1:t ) ≈
M
wtm δ (xt − xm t ),
(5)
m=1 m where δ (xt − xm t ) is the Dirac delta function with mass at xt . We use resampling to reduce the degeneracy problem [10]. We resample the M m particles {xm t }m=1 with replacement M times according to their weights wt . M 1 m The resulting particles {xm t }m=1 have uniformly distributed weights wt = M . Similar to the Sampling Importance Resampling Filter [3], we resample in every 1 m time step. This simplifies Eqn. 4 to wtm ∝ p (yt |xm ∀m. t ) since wt−1 = M In the meeting scenario, we are interested in tracking the faces of multiple people. We treat the tracking of multiple objects completely independently, i.e., M K m,k we assign a set of M particles to each tracked object k as xt , m=1
k=1
where K is the total number of tracked objects which changes dynamically over time. Hence, we use multiple instances of a single object tracker similar to [4].
Tracking of Multiple Targets Using On-Line Learning
2.2
605
State Space Dynamics
The state sequence evolution {xt ; t ∈ N} is assumed to be a second-order autoregressive process which is used instead of the first-order formalism (p (xt |xt−1 )) introduced in the previous subsection. The second-order dynamics can be written as first-order by extending the state vector at time t with elements from the state vector at time t − 1. T We define the state vector at time t as xt = [xt yt sxt syt ] . The location x y of the target at t is given as xt , yt , respectively, and st , st denote the scale of the tracked region in the x × y image space. In our tracking approach, the transition model corresponds to m,k xm,k + Cvt + t+1 = xt
M D m ,k m ,k xt , − xt−1 2M
(6)
m =1
where vt ∼ N (0, I) is a simple Gaussian random noise model and the term
M m,k m,k 1 x − x t t−1 captures the linear evolution of object k from the parm=1 2M ticles of the previous time step. Factor D models the influence of the linear evolution, e.g., D is set to 0.5. The parameters of the random noise model are set to C = diag ([10 10 0.3 0.3]) with the units of [pixels/f rame], [pixels/f rame], [1/f rame], and [1/f rame], respectively. 2.3
Observation Model
The shape of the tracked region is determined to be an ellipse [11] since the tracking is focused on the faces of the individuals in the meeting. We assume that the principal axes of the ellipses are aligned with the coordinate axes of the image. Similarly to [12], we use the color histograms for modelling the target regions. Therefore, we transform the image into the hue-saturation-value (HSV) space [13]. For the sake of readability we abuse the notation and write the particle xm,k as xt in this subsection. We build an individual histogram for hue (H) hxHt , t saturation (S) hxSt , and value (V) hxVt of the elliptic candidate region at xt . The k length of the principal axes of the ellipse are Akref sxt and Bref syt , respectively, k where Akref and Bref are the length of the ellipse axes of the reference model of object k. The likelihood of the observation model (likelihood model) p ytm,k |xm,k t must be large for candidate regions with a histogram close to the reference histogram. Therefore, we introduce the Jensen-Shannon (JS) divergence [14] to measure the similarity between the normalized candidate and reference histograms, hxc t and hkc,ref , c ∈ {H, S, V }, respectively. Since, JS-divergence
is defined for probability distributions the histograms are normalized, i.e. N hxc t = 1, where N denotes the number of histogram bins. In contrast to the KullbackLeibler divergence [15] the JS-divergence is symmetric and bounded. The JSdivergence between the normalized histograms is defined as JSπ hxc t , hkc,ref = H π1 hxc t + π2 hkc,ref − π1 H (hxc t ) − π2 H hkc,ref , (7)
606
F. Pernkopf
where π1 + π2 = 1, πi ≥ 0 and the function H (·) is the entropy [15]. The JSdivergence is computed for the histograms of the H, S, and V space and the observation likelihood is ⎡ ⎤ m,k x p ytm,k |xm,k ∝ exp −λ ⎣ JSπ hc t , hkc,ref ⎦ , (8) t c∈{H,S,V }
where parameter λ is chosen to be 5 and the weight πi is uniformly distributed. The number of bins of the histograms is set to N = 50. 2.4
Automatic Initialization of Objects
If an object enters the frame a set of M particles and a reference histogram for this object have to be initialized. Basically, the initialization of objects is performed automatically using the following simple low-level features: – Motion: The images are transformed to gray scale IxGt ,yt . The motion feature is determined for each pixel located at x, y by the standard deviation t G over a time window Tw as σx,y = σ Ixt−Tw :t ,yt−Tw :t . Applying an adaptive threshold Tmotion =
1 10
t max σx,y pixels with a value larger Tmotion belong
x,y∈I G
t to regions where movement happens. However, max σx,y has to be suffix,y∈I G
ciently large so that motion exists at all. A binary motion image IxBtmotion ,yt after morphological closing is shown in Figure 1. – Skin Color: The skin color of the people is modeled by a Gaussian mixture model [16] in the HSV color space. A Gaussian mixture model p (z|Θ) is the L
weighted sum of L > 1 Gaussian components, p (z|Θ) = αl N (z|μl , Σl ), l=1 T
where z = [zH , zS , zV ] is the 3-dimensional color vector of one image pixel, αl corresponds to the weight of each component l = 1, . . . , L. These weights
are constrained to be positive αl ≥ 0 and L l=1 αl = 1. The Gaussian mixture L is specified by the set of parameters Θ = {αl , μl , Σl }l=1 . These parameters are determined by the EM algorithm [17] from a face database. Image pixels z ∈ IxHSV are classified according to their likelihood p (z|Θ) t ,yt using a threshold Tskin . The binary map IxBtskin ,yt filtered with a morphological closing operator is presented in Figure 1. – Object Size: We initialize a new object only for skin-colored moving regions with a size larger than TArea . Additionally, we do not allow initialization of a new set of particles in regions where currently an object is tracked. To B this end, a binary map Ixtprohibited represents the areas where initialization is ,yt prohibited. The binary combination of all images IxBt ,yt = IxBtmotion ∩ IxBtskin ,yt ,yt ∩ B
Ixtprohibited is used for extracting regions with an area larger TArea . Target ,yt k objects are initialized for those regions, i.e., the ellipse size (Akref , Bref ) and k the histograms hc,ref , c ∈ {H, S, V }, are determined from the region of the bounding ellipse.
Tracking of Multiple Targets Using On-Line Learning
607
Figure 1 shows an example of the initialization of a new object. The original image IxHSV is presented in (a). The person entering from the right side should t ,yt be initalized. A second person in the middle of the image is already tracked. The binary images of the thresholded motion IxBtmotion and the skin-colored areas ,yt are shown in (b) and (c), respectively. The reflections at the table and IxBtskin ,yt the movement of the curtain produce noise in the motion image. The color of the table and chairs intersects with the skin-color model. To guarantee successful initialization the lower part of the image - the region of the chairs and desk have to be excluded. This is reasonable since nobody can enter in this area. Also tracking is performed in the area above the chairs only. Finally, the region of the new initialized object is presented as ellipse in (d). Resizing of the images is performed for computing the features to speed up the initialization of objects.
Fig. 1. Initialization of new object: (a) Original image with one object already tracked, , (c) Binary image of the skin(b) Binary image of the thresholded motion IxBtmotion ,yt , (d) Image with region of initialized object colored areas IxBtskin ,yt
Shortcomings: The objects are initialized when they enter the image. The reference histogram is taken during this initialization. There are the following shortcomings during initialization: – The camera is focused on the people sitting at the table and not on people walking behind the chairs. This means that walking persons appear blurred. – Entering persons are moving relatively fast. This also results in a degraded image quality (blurring). – During initialization, we normally get the side view of the person’s head. When the person sits at the table the reference histogram is not necessarily a good model for the frontal view.
608
F. Pernkopf
To deal with these shortcomings, we propose on-line learning to incrementally update the reference models of the tracked objects over time (see Section 2.6). We perform this only in cases where no mutual occlusions between the tracked objects are existent. 2.5
Automatic Termination of Objects
Termination of particles is performed if the observation likelihood p ytm,k |xm,k t at state xm,k a predefined threshold TKill (e.g. 0.001), i.e., t drops below m,k m,k = 0 if p y < T |x p ytm,k |xm,k Kill . Particles with zero probability t t t do not survive during resampling. If the tracked object leaves the field of view m,k m,k all M particles of an object k are removed, i.e., p yt |xt = 0 for all particles of object k. 2.6
Incremental Learning of Object Models
To handle the appearance change of the tracked objects over time we use on-line learning to adapt the reference histograms hkc,ref , c ∈ {H, S, V } , and ellipse k size Akref and Bref . Therefore, a learning rate α is introduced and the model parameters for target object k are updated according to ˆ k + (1 − α) hk , hkc,ref = αh c c,ref Ak = αAˆk + (1 − α) Ak , ref
ref
k k ˆ k + (1 − α) Bref Bref = αB ,
c ∈ {H, S, V } ,
(9) (10) (11)
ˆ k denotes the histogram and Aˆk and B ˆ k are the principal axes of the where h c bounding ellipse of the non-occluded (i.e. no mutual occlusion between tracked objects) skin-colored region of the corresponding tracked object k located at M xm,k . Again, these regions have to be larger than TArea . Otherwise, no t m=1 update is performed. 2.7
Implemented Tracker
In the following, we sketch our tracking approach for multiple objects. The binary variable InitObject denotes that a new object for tracking has been detected. KillObject is set if an object should be terminated. OnlineUpdate indicates M that object k located at xm,k is non-occluded and the area of the skint m=1 colored region is larger than TArea , i.e., we perform on-line learning for object model k. Our implementation is related to the dual estimation problem [9], where both the states of multiple objects xm,k and the parameters of the object models are t estimated simultaneously given the observations. At every time step, the particle
Tracking of Multiple Targets Using On-Line Learning
609
filter estimates the states using the observation likelihood of the current object models while the on-line learning of the object models is based on the current state estimates. procedure Tracking (Color image sequence 0 : T ), Skin color model Θ Input: IxHSV 0:T ,y0:T Parameters: M, N, λ, C, D, Tw , Tmotion , Tskin , TArea, TKill , α Output:
xm,k 0:T
M
(set of particle sequences 0 : T )
m=1
∀k
begin t ← 0, k ← 0 while InitObject (motion feature is computed over future frames) k ←k+1 k Determine: hkc,ref c ∈ {H, S, V }, Akref , Bref , xkref m,k k xt ← xref + Cvt ∀m = 1, . . . , M (generate particles) end K←k for t ← 1 : T (loop over image sequence) ∀k = 1, . . . , K ∀m = 1, . . . , M wtm,k ∝ p ytm,k |xm,k t while KillObject k ← Determine object to terminate of object k Remove M particles xm,k t Remove reference histogram and ellipse size: k hkc,ref c ∈ {H, S, V } , Akref , Bref K ←K −1 end for k ← 1 : K (loop over objects) wtm,k ←
M m,k
xt
m,k wt m ,k M w m =1 t
m=1
∀m = 1, . . . , M (normalize weights)
m,k m,k M xt , wt m=1 m ,k
← Resampling (with replacement):
m,k xm,k + Cvt + t+1 ← xt
D 2M
M xm ,k − x
m =1
t
t−1
∀m = 1, . . . , M
if OnlineUpdate ˆ kc c ∈ {H, S, V } , A ˆk , B ˆk Determine: h k k k ˆ hc,ref ← αhc + (1 − α) hc,ref c ∈ {H, S, V } Akref ← αAˆk + (1 − α) Akref k k ˆ k + (1 − α) Bref Bref ← αB end end while InitObject K ←K +1 K K Determine: hK c ∈ {H, S, V }, AK c,ref ref , Bref , xref K xm,K ← x + Cv ∀m = 1, . . . , M (generate particles) t ref t end end end
610
3
F. Pernkopf
Relationship to Genetic Algorithms
Genetic algorithms are optimization algorithms founded upon the principles of natural evolution discovered by Darwin. In nature, individuals have to adapt to their environment in order to survive in a process of further development. An introduction of genetic algorithms can be found in [18]. GA are stochastic procedures which have been successfully applied in many optimization tasks. GA operate on a population of potential solutions applying the principle of survival of the fittest individual to produce better and better approximations to the solution. At each generation, a new set of approximations is created by the process of selecting individuals according to their level of fitness in the problem domain and assembling them together using operators inspired from nature. This leads to the evolution of individuals that are better suited to their environment than the parent individuals they were created from. GA model the natural processes, such as selection, recombination, and mutation. Starting from an initial population P (0), the sequence P (0) , P (1) , . . . , P (t) is called population sequence or evolution. The end of an artificial evolution process is reached once the termination condition is met, and the result of the optimization task is available. In this section, we want to point to the close relationship between GA and our particle filter for tracking. This analogy has been mentioned in [19]. As suggested in Section 2, we treat the tracking of multiple objects completely independently, i.e., we have a set of M particles for each object k. In the GA framework, we can relate this to k instantiations of GA, one for each tracked object. Hence, each particle xm t of object k represents one individual in the population P (t) which is value encoded. The population size is M . A new genetic evolution process is started once a new object is initialized for tracking (InitObject). The evolution process of the GA is terminated either at the end of the video (t = T ) or when the set of individuals is not supported by the fitness value (KillObject). The observation likelihood p ytm,k |xm,k denotes the fitness function to evaluate t the individuals. However, the scope of GA for tracking is slightly different. GA are generally used to find a set of parameters for a given optimization task, i.e., the aim is to find the individual with the best fitness after the termination of the evolution. Whereas, in the tracking case, the focus lies on the evolution of the individuals, i.e., the evolution of the tracked object. The selection operator directs the search toward promising regions in the search space. Roulette Wheel Selection [20] is a widely used selection method which is very similar to sampling with replacement. To each individual a rewm production probability according to wtm ← M t wm is assigned. A roulette
m =1
t
wheel is constructed with a slot size proportional to the individuals reproduction probability. Then M uniformly distributed random numbers on the interval [0, 1] are drawn and distributed according to their value around the wheel. The slots where they are placed to compose the subsequent population P (t). The state space dynamics of the particle filter (see Section 2.2) is modeled by the recombination and mutation operator.
Tracking of Multiple Targets Using On-Line Learning
611
The framework of the GA for tracking one object k is presented in the following. The incremental learning of the object model is omitted. procedure GA (Color image sequence t : T ), hc,ref c ∈ {H, S, V }, Aref , Bref , xref Input: IxHSV t:T ,yt:T Parameters: M, N, λ, C, D, TKill M Output: {xm t:T }m=1 (set of particle sequences t : T ) begin Initialize population P (t) : xm ∀m = 1, . . . , M t ← xref + Cvt while KillObject ∩ t < T (loop over image sequence) ∀m = 1, . . . , M Evaluate individuals: wtm ← p (ytm |xm t ) M m m M Selection P (t) : {xm } ← (Sampling with replacement): {x t m=1 t , wt }m=1
m D Recombination P (t + 1) : xm t+1 ← xt + 2M m m Mutation P (t + 1) : xt+1 ← xt+1 + Cvt t←t+1 end end
4
M
m xm t − xt−1 ∀m = 1, . . . , M m =1
∀m = 1, . . . , M
Experiments
For testing the performance of our tracking approach 10 videos with ∼ 7000 frames have been used. The resolution is 640 × 480 pixels. The meeting room is equipped with a table and three chairs. We have different persons in each video. The people are coming from both sides into the frame moving to chairs and sit down. After a short discussion people are leaving the room sequentially, are coming back, sit down at different chairs and so on. At the beginning, people may already sit at the chairs. In this case, we have to initialize multiple objects automatically at the very first frame. The strong reflections at the table, chairs, and the white board cause noise in the motion image. Therefore, we initialize and track objects only in the area above the chairs. The appearance of an object changes over time. When entering the frame, we get the side view of the person’s head. After sitting down at the table, we have a frontal view. Hence, we update the reference histogram incrementally during tracking. We perform this only in the case where no overlaps with other tracked objects are existent. This on-line learning enables a more robust tracking. The participants were successfully tracked over long image sequences. Figure 2 shows the result of the implemented tracker for one video. All the initializations and terminations of objects are performed automatically. First the person on the left side stands up and leaves the room on the right side (frame 416 - 491). When walking behind the two sitting people partial occlusions occur which do not cause problems. Next, the person on the right (frame 583 - 637) leaves the room on the left side. His face is again partially occluded by the person in the middle. Then the person on the center chair leaves the room (frame 774). After that a person on the right side enters and sits at the left chair (frame 844). At frame 967 a small person is entering and moving to the chair in the middle. Here, again a partial occlusion occurs at frame 975 which is
612
F. Pernkopf
Fig. 2. Tracking of people. Frames: 1, 416, 430, 449, 463, 491, 583, 609, 622, 637, 774, 844, 967, 975, 1182, 1400 (the frame number is assigned from left to right and top to bottom).
Fig. 3. Partial occlusions. Frames: 468, 616, 974, 4363.
also tackled. Finally, a person enters from the right and sits down on the right chair (frame 1182, 1400). The partial occlusions are shown in Figure 3. Also the blurred face of the moving person in the back can be observed in this figure. If we do not update the models of the tracked objects over time the tracking fails in case of partial occlusions.
5
Conclusion
We propose a robust visual tracking algorithm for multiple objects (faces of people) in a meeting scenario based on low-level features as skin-color, target motion, and target size. Based on these features automatic initialization and termination of objects is performed. For tracking a sampling importance resampling particle
Tracking of Multiple Targets Using On-Line Learning
613
filter has been used to propagate sample distributions over time. Furthermore, we use on-line learning of the target models to handle the appearance variability of the objects. This enables a more robust tracking. We discuss the similarity between our implemented tracker and genetic algorithms. Each particle represents an individual in the GA framework. The evaluation function incorporates the observation likelihood model and the individual selection process maps to the resampling procedure in the particle filter. The state space dynamics is incorporated in the recombination and mutation operator of the GA. Numerous experiments on meeting data show the capabilities of the tracking approach. The participants were successfully tracked over long image sequences. Partial occlusions are handled by the algorithm.
References 1. Lim, J., Ross, D., Lin, R.S., Yang, M.H.: Incremental learning for visual tracking. In: Advances in Neural Information Processing Systems, vol. 17 (2005) 2. Isard, M., Blake, A.: Condensation - conditional density propagation for visual tracking. International Journal of Computer Vision 29(1), 5–28 (1998) 3. Arulampalam, S., Maskell, S., Gordon, N., Clapp, T.: A tutorial on particle filters for on-line on-linear/non-gaussian Bayesian tracking. IEEE Transactions on Signal Processing 50(2), 174–188 (2002) 4. Dockstader, S.L., Tekalp, A.: Tracking multiple objects in the presence of articulated and occluded motion. In: Workshop on Human Motion, pp. 88–98 (2000) 5. Hue, C., Le Cadre, J.P., Pe’eres, P.: Tracking multiple objects with particle filtering. IEEE Transactions on Aerospace and Electronic Systems 38(3), 791–812 (2002) 6. Vermaak, J., Doucet, A., P´eres, P.: Maintaining multi-modality through mixture tracking. In: ICCV. International Conference on Computer Vision, pp. 1110–1116 (2003) 7. Okuma, K., Taleghani, A., de Freitas, N., Little, J., Lowe, D.: A boosted particle filter: Multitarget detection and tracking. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3024, Springer, Heidelberg (2004) 8. Cai, Y., de Freitas, N., Little, J.: Robust visual tracking for multiple targets. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, Springer, Heidelberg (2006) 9. Haykin, S.: Kalman filtering and neural networks. John Wiley & Sons, Chichester (2001) 10. Doucet, A.: On sequential Monte Carlo sampling methods for Bayesian filtering. Technical Report CUED/F-INFENG/TR. 310, Cambridge University, Dept. of Eng. (1998) 11. Jepson, A., Fleet, D.J., El-Maraghi, T.: Robust online appearance models for visual tracking. IEEE Trans. on Pattern Analysis and Machine Intelligence 25(10), 1296– 1311 (2003) 12. P´eres, P., Hue, C., Vermaak, J., Gangnet, M.: Color-based probabilistic tracking. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2353, Springer, Heidelberg (2002) 13. Sonka, M., Hlavac, V., Boyle, R.: Image processing, analysis, and machine vision. International Thomson Publishing (1999) 14. Lin, J.: Divergence measures based on the Shannon entropy. IEEE Trans. on Inf. Theory 37(1), 145–151 (1991)
614
F. Pernkopf
15. Cover, T., Thomas, J.: Elements of information theory. John Wiley & Sons, Chichester (1991) 16. Duda, R., Hart, P., Stork, D.: Pattern classification. John Wiley & Sons, Chichester (2000) 17. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood estimation from incomplete data via the EM algorithm. Journal of the Royal Statistic Society 30(B), 1–38 (1977) 18. Goldberg, D.: Genetic algorithms in search, optimization and machine learning. Addison-Wesley, Reading (1989) 19. Higuchi, T.: Monte Carlo filter using the Genetic Algorithm operators. J. Statist. Comput. Simul 59, 1–23 (1997) 20. Baker, L.: Reducing bias and inefficiency in the selection algorithm. In: International Conference on Genetic Algorithms and their Applications, pp. 14–21 (1987)
Using Visual Dictionary to Associate Semantic Objects in Region-Based Image Retrieval Rongrong Ji, Hongxun Yao, Zhen Zhang, Peifei Xu, and Jicheng Wang School of Computer Science and Engineering Harbin Institute of Technology P.O. BOX 321, West Dazhi Street, Harbin, 150001, China {rrji, yhx, zzhang, pfx, jcwang}@vilab.hit.edu.cn
Abstract. In spite of inaccurate segmentation, the performance of region-based image retrieval is still restricted by the diverse appearances of semantic-similar objects. On the contrary, humans’ linguistic description of image objects can reveal object information at a higher level. Using partial annotated region collection as “visual dictionary”, this paper proposes a semantic sensitive region retrieval framework using middle-level visual & textual object description. To achieve this goal, firstly, a partial image database is segmented into regions, which are manually annotated by keywords to construct a visual dictionary. Secondly, to associate appearance-diverse, semantic-similar objects together, a Bayesian reasoning approach is adopted to calculate the semantic similarity between two regions. This inference method utilizes the visual dictionary to bridge un-annotated images region together at semantic level. Based on this reasoning framework, both query-by-example and query-by-keyword user interfaces are provided to facilitate user query. Experimental comparisons of our method over Visual-only region matching method indicate its effectiveness in enhancing the performance of region retrieval and bridging the semantic gap. Keywords: Image Retrieval, Region Matching, Visual Dictionary, Bayesian Inference.
1 Introduction Content-based image retrieval (CBIR) is a technique that effectively retrieves images based on their visual contents. It is a long-standing research hot spot over the past decade [1, 2]. However, the performances of state-of-art CBIR systems are still unsatisfactory due to the so-call semantic gap between image content and human’s perception [1]. Region-based image representation and understanding is a feasible solution to address the semantic gap in content-based image retrieval. Humans tend to discern image information from its containing objects [3-6]. Consequently, object-level indexing can largely narrow down the semantic gap in CBIR. For instance, Berkeley BlobWorld system [3] utilized object-level feature extraction for retrieval. By automatically segmenting image into regions which roughly correspond to objects or parts of objects, BlobWorld allows users to query photographs based on region of interest. Stanford SIMPLIcity system [16] adopted a fuzzy logic approach, unified M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 615–625, 2007. © Springer-Verlag Berlin Heidelberg 2007
616
R. Ji et al.
feature matching (UFM), for region retrieval, which alleviated the imprecise segmentation by fuzzy matching of regions. These methods utilize image segmentation to partition images into objects. The main problem is the inaccurate image segmentation, which obstructs the performance of region-based image representation. Some methods have been proposed to debase the negative effort of inaccurate segmentation. For instance, the integrated region matching (IRM) [7] is considered as one of the most effective region matching methods. It maps a region of one image to several regions of another image. Hence, the similarity between two images is the weighted sum of region distances. Comparing to traditional single region-to-region approaches, IRM largely alleviated inaccurate segmentation by smoothing over the distance-imprecision. Since IRM does not take into consideration the spatial relationship of regions, Wang [17] proposed a constraint-based region matching (CRM) approach to integrate spatial information of regions into IRM. However, in many cases, the perceptually similar objects may own diverse visual appearance. To associate semantic region together, visual-only region representation is insufficient. On the contrary, humans’ linguistic description of image objects can reveal object information at a higher level. In CBIR research community, increasing researches are focused on integrating linguistic supervision to image representation and similarity ranking [8, 9]. For instance, Li et al. [8] utilized a concept dictionary to supervise similarity ranking. They modeled images of any given concept as instances of a concept-based stochastic process (2D-MHMM), which is trained using prelabeled category images. Based on such process, the likelihood of image occurrence is computed to measure the relevancy degree between image and textual description of each concept. However, their method is applied at global scale hence unsuitable for object-based retrieval. Duygulu [9] utilized region segmentation and region-based annotation for similarity ranking. But their method strongly relied on accurate segmentation, which is unavailable for real-world images that contain complex background and multiple objects. This paper proposes a region retrieval algorithm to overcome above-mentioned drawbacks. Especially, we propose an inference framework to bridge perceptually similar, visually diverse objects together using “visual dictionary”. Firstly, a regionlevel visual dictionary is constructed offline using pre-labeled segmented images. Then, this visual dictionary is seamlessly integrated into our similarity ranking process based on a novel Bayesian reasoning procedure (Detail description in Section 3). It combines region-level visual features with keyword-based region perception to bridge the semantic gap and mining higher level association of regions. Since this automatic linguistic integration strategy is a probability-based fuzzy matching procedure, it can naturally debase the negative influence of imprecise segmentation. The promising experimental result in Section 4 strongly supports the effectiveness of our keyword-integrated region retrieval strategy. Furthermore, our method supports both query-by-example and query-by-keyword scenarios to facilitate user retrieval. Both of these query forms are natural extensions of our Bayesian reasoning framework.
Using Visual Dictionary to Associate Semantic Objects
617
The rest of this paper is organized as follows: Section 2 presents our image segmentation and region-level visual dictionary construction method. Section 3 describes our Bayesian reasoning framework. The experimental result and discussions are presented in Section 4. And finally this paper is concluded in Section 5 and discusses our future research directions.
2 Image Segmentation and Region-Based Visual Dictionary Construction Our segmentation algorithm adopts a clustering procedure to automatically assemble pixels into regions. For computation efficiency consideration, the image is divided into 4×4 blocks, based on which an 8-dimensional feature vector is constructed as follows: The former three dimensions are the average of H, S and I component in HSI color space. Similar to Chen [18], the later three are the variances of HH, HL, and LH high-frequency coefficients of Daubechies-4 wavelet bands from I component in each block. Based on this six-dimensional feature vector, a k-mean clustering process is conducted to assemble blocks into regions (Hence segment image into regions). This procedure does not specify the cluster number, but gradually iterate this number from 2 to 6, seeking a best image partition by maximum an evaluation criterion as follows: N Selected = arg max ( i
1 Ni
8
∑ (F j =1
Query j
− FjImage ) 2 )
(1)
Where the FjQuery is the jth feature of query image, the FjImage is the jth feature of the image in current segmentation procedure. Ni is the ith iteration (from 2 to 6). And the Query image is pre-segmented into 4 regions. The similarity between two regions is defined as their Euclidean distance in the feature space. Different from the features used for segmentation, features utilized for region description could contain more perceptual meaningful information. To describe the region more effectively, our algorithm extends Chen’s feature extraction method by combining shape information: the region geometrical circularity and center scatter:
μ p ,q =
In Eq.2,
1 nBlock
α=
4π S P2
∑
( x − xc ) 2 ( y − yc ) 2
( x , y )∈R
(2) (3)
α represents the geometrical circularity of this region; S and P are its size
and perimeter respectively. In Eq.3 μ p ,q is its center scatter; xc and yc and nBlock represent the geometric centers and the number of blocks respectively in this region. A pre-segmented and pre-labeled image collection is provided to construct a region-level visual dictionary. Each region of image is manually labeled a specified keyword. For instance, Fig 1(b)-(d) shows the segmentation and labeling result of Fig 1(a). Each keyword Ki in our keyword set is assigned a weight P(Ki). And two assumptions are made about our visual dictionary: 1. Each region is mapped one
618
R. Ji et al.
keyword; 2. Regions in image are conditional independent. They are based on the computation efficiency consideration. The semantic revealing ability of this visual dictionary depends largely on the volume of the image collection. In our experiment, 200 images are segmented to form the region set for labeling.
g
(a)
(b)
(c)
(d)
Fig. 1. (a): Original image “rose” in COREL. (b-d): Each of the segmentation regions is manually labeled a keyword: grass (b), flower (c), leaf (d)
3 Semantic Similar Object Association by Bayesian Reasoning Using Visual Dictionary To associate perceptual-similar and visual-diverse regions together, as well as associate visual & perceptual similar regions, we adopt Bayesian reasoning to rank image similarities using visual dictionary. From a Bayesian point-of-view, the query image is the prior probability of similarity inference, while the image similarity between query and each image in the database is the posterior probability of this image given the query. Different from traditional Bayesian reasoning strategy based on solely visual features [10-12], our approach combines region-based image feature with keywordbased image perception to add semantic information into similarity calculation and ranking procedure. This combination is out of researchers’ concerns but of great importance to bridge the semantic gap. By using a keyword-to-region strategy, our algorithm reveals user s’ retrieval purpose more accurately and hence enhance retrieval performance. The images in the database are segmented offline using the algorithm described in Section 2. Our method supports both query-by-example and query-by-keyword scenarios. Based on these two query patterns, we present our Bayesian reasoning process in detail below: 3.1 Similarity Inference by Bayesian Reasoning for Query-by-Example Interface We consider the scenario of query-by-example retrieval task as a procedure of induce the conditional probability (P(I|IQ)) of each image I in the database given a query example IQ. This probability measures to what degree an image I belongs to users’ query concept given the query example IQ. Consequently the similarity ranking process is conduct by ranking the conditional probability of each image given the
Using Visual Dictionary to Associate Semantic Objects
619
query example. To measure this probability, a equivalent transformation is made by replacing the query image IQ by its regions: R1,R2,…,Rquery.
P( I | IQ ) = P( I | R1 , R2 ,..., Rquery )
(4)
Using the visual dictionary, the Bayesian posterior probability of semantic similarity between query regions and the image in database can be expanded into Eq.5, which is a discrete integral of variant K using Bayesian reasoning: K dic
P ( I | I Q ) = ∑ P ( I , K | R1 , R2 ,..., Rquery )
(5)
K =1
Expanded by Bayesian inference, Eq.5 can be replaced as follows: K dic
P ( I | I Q ) = ∑ P ( I | K , R1 , R2 ,..., Rquery ) P ( K | R1 , R2 ,..., Rquery )
(6)
K =1
And the second component of Eq.6 is further transformed by another Bayesian inference procedure: P ( K | R1 , R2 ,..., Rquery ) =
P( R1 , R2 ,..., Rquery | K ) P ( K ) P ( R1 , R2 ,..., Rquery )
(7)
Since the posterior probability of each region given its keyword annotation is conditional independent. Using the Bayesian optimal classifier, the probability P(R1,R2,…,Rquery|K) can be considered as the product of individual keyword-to-region probability: query
P ( R1 , R2 ,..., Rquery | K ) = ∏ P ( Ri | K ) i =1
(8)
Considering each region in the query image, its eight-dimensional feature vector is extracted using the method described in Section 2 (This procedure is made offline and the former 6 dimension feature is used for segmentation). For each keyword, the most similar region having this keyword annotation in the keyword dictionary is chosen to calculate the probability P(Ri|K) in Eq.8 as follows:
P ( Ri | K ) = e
−(
1
8
∑ ( Feature Rj i − Feature j Q )8 ) 8 R
j =1
(9)
The same reasoning procedure is conducted on the P(I|K,R1,R2,…,RQuery):
P(I | K, R1, R2,..., Rquery ) = P(I | K) =
P(R1, R2,..., Rn | K)P(K) P(R1, R2,..., Rn )
(10)
n
P ( R1 , R2 ,..., Rn | K ) = ∏ P ( Ri | K ) i =1
(11)
In Eq.10, once a specific keyword is given in the similarity probability calculation, we assume that the probability of image I similar to users’ retrieval
620
R. Ji et al.
concept given the query image IQ is no longer related to query image (In such case, the query image is simply a instance to inspire the keyword-based Bayesian reasoning). Hence, a conditional independent assumption is made between I and IQ once given a specific K. And in Eq.11 the regions in one image are assumed to be conditional independent, and hence Bayesian optimal classifier is utilized, just as Eq.8. Finally these calculations (Eq.5-Eq.11) are integrated into Eq.4 to induce the semantic similarity of an image I giving query image IQ. The top 50 images with the highest posterior probability are returned to users as the retrieval result. 3.2 Similarity Inference by Bayesian Reasoning for Query-by-Keyword Interface In the scenario of query-by-keyword, the Bayesian reasoning process can be described as follows: I
P( I | K ) = P( R1 , R2 ,..., RI | K ) = ∏ P( Ri | K ) I =1
(12)
The probability P(I|K) is the probability that to what degree the image I similar to users’ query concept given the query keyword K. And it is replaced using its regions in Eq.12. As assumed in Eq.2, the regions in image are conditional independent, which results in the transformation of P(R1,R2,…,RI|K) to the products of P(Ri|K) using Bayesian optimal classifier in Eq.12. And the P(Ri|K) is the probability that to what degree the region Ri reveals the retrieval semantic conception, given keyword K. It is calculated using the same method in Eq.9. Finally, the top 50 images with the highest posterior probability are returned to the user as the retrieval result. And it should be address that the keyword input should be constrained to our annotation keyword set. Our method owns two the main advantages over traditional region-based retrieval strategies: 1. Our method utilizes keyword-to-region annotation to reduce the ambiguous keyword annotation at image global scope. 2. By annotating different keywords to semantic different regions, our method avoids the mistaken region-matching due to similar visual features of two semanticdiversely regions. 3. By assigning semantic similar regions with the same keywords, our method enhances their correlation even in the case that their visual features are diverse.
4 Experimental Results and Discussion To validate our Bayesian reasoning framework, a Demo system “ImageSeek” is developed for efficiency certification, where both query-by-example and query-bykeyword interfaces are provided for user interaction. A brief view of “ImageSeek” is presented in Fig.3.
Using Visual Dictionary to Associate Semantic Objects
621
Fig. 2. “ImageSeek” System Interface
4.1 Visual Dictionary Construction 200 images from 20 categories (10 of each category) are randomly selected to form the candidate training set. Using our segmentation algorithm in Section 2, these images are segmented into distinct regions. Manually labeling is needed to annotate each region with a unique keyword. These annotation results forms our visual dictionary. Our keyword set uses 50 keywords to index these 200 images. Tab.1 presents a fraction of our annotation keyword set. Table 1. Fraction of Our Keyword Dictionary
1.sky 2.tree 3.water 4.grass 5.eagle 6.bear 7.snow 8.building 9.house 10.beach
11.elephant 12.desert 13.car 14.jet 15.leaf 16.flower 17.mushroom 18.fish 19.coral 20.mountain
622
R. Ji et al.
4.2 Retrieval Performance Evaluation Our experiments are conduct on a subset of COREL image database, which contains over 2000 general purpose images of 20 image categories (100 of each). And 8 images of each category are randomly selected f to form a test set (Totally 160 images). To verify the efficiency of our keyword-integrated Bayesian reasoning (KIBR) strategy, the traditional single region matching (SRM) strategy is applied as the baseline comparison. Its performance is evaluated as a criterion of our keywordintegrated Bayesian reasoning. In SRM, the distance between two images is calculated as the sum of differences of every region in one image and its most similar region in the other image. P-R Curve 1 0.9 0.8
e 0.7 g a t 0.6 n e c 0.5 r e P 0.4 0.3 0.2 0.1 20
30
40
50
60
70
80
90
100
Number of Returning Image Precise Bayesian
Recall Bayesian
Precise SRM
Recall SRM
Fig. 3. Precise-Recall Curves Comparison of KIBR and SRM
Fig. 4. KIBR Retrieval Results in Top20-100 of Ten Representative Image Classes in COREL
Using Visual Dictionary to Associate Semantic Objects
623
Fig. 5. SRM Ranking Results of Query Image (Top-Left), in which Returning Results are Visually Similar for Color & Textual Features
Fig. 6. KIBR Ranking Results of Query Image (Top-Left), in which Returning Results are visually diverse for Color & Textual Features but Semantic Similar
Fig.3 shows the retrieval result of KIBR comparing to SRM. As the number of returning images (n) increases, the retrieval precise decreases while the retrieval recall increases. These two curves meet each other at n=100. In the case of n e AND (7) δdir (ex,y ) = ex,y falls into the bin direction, ⎪ ⎪ ⎩ 0 otherwise Where the threshold e is used to determine if ex,y is a non-directional edge or not. In addition to the local histograms of 4x4 blocks, [6,3] also suggest to construct the semi-global histograms and the global histograms as shown in figure 2. The semiglobal and global histograms can be efficiently computed by accumulating the local histograms, i.e., the edge direction histograms feature vector v ed consist of 64 local histogram bins ehistli , 52 semi-global histogram bins ehistsi and 4 global histogram bins ehistgi , where i denotes the i-th histogram of local, semi-global or global block.
4 Video Retrieval The system performs object-based video retrieval by samples. Users can search for the objects appeared in the video clips by providing snapshots or hand-drawn images as queries. Some of the sample query images are shown in figure 3. The system extracts the set of feature vectors V q = {v qdc , v qed } of the input image as described in section 3.1. V q is then used to match with the feature vector sets V (n) = {v dc (n), v ed (n)} of each segmented object n already stored in the database.
Object-Based Surveillance Video Retrieval System
(a)
(b)
(c)
(d)
631
(e)
Fig. 2. Bounding box of the segmented object is divided into 4x4 blocks for construction of edge direction histograms. (a) 16 local blocks, (b)(c)(d) 4 horizontal, 4 vertical and 5 neighboring subimages of semi-global blocks respectively, (e) the whole bounding box image is used for global edge direction histograms.
(a)
(b)
Fig. 3. Sample query images (a) snapshots/photos (b) hand-drawn images
4.1 Similarity Measurements To measure whether an object matches the query, the dominant color distance Ddc and edge direction histograms distance Ded are defined:
2 Ddc (v dc , v dc ) =
3
p2i +
i=1
Ded (v ed , v ed ) =
64 i=1
|ehistli −ehistil |+5×
3 j=1
4 i=1
pj2 +
3 3
2ai,j pi pj
(8)
i=1 j=1
|ehistgi −ehistig |+
52
|ehistsi −ehistis |
i=1
(9) In (8) ai,j is the similarity coefficient between color ci and cj , such that ai,j = 1 − di,j /dmax where di,j = |ci − cj | and dmax is the maximum allowable distance. Here, ci and pi are the color and percentage components of the i-th dominated color. An object n in the database is considered to be matched if Ddc (v dc (n), v qdc ) < dc and Ddc (v ed (n), v qed ) < ed , where dc and ed are pre-defined thresholds for dominant color and edge direction histograms respectively.
632
J.S-C. Yuk et al.
The overall distance Doverall is calculated as Doverall = αdc Ddc + αed Ded where αdc and αed are the pre-defined weighting. The relevance R between the retrieved objects and the query is then defined: ) × 100% if Doverall ≤ Dmax , (1 − DDoverall max R= (10) 0% otherwise where Dmax is the maximum allowable distance.
5 Experiments and Results The proposed object-based indexing and retrieval system is evaluated by 19 surveillance videos with D1 resolution (720x576 pixels) which are captured by static cameras. Without any optimization, the indexing process was already running mostly in real-time at about 24-25 frames per second on a PC with Intel Core(TM)2 CPU 6300 @ 1.86 GHz and 1 Gb RAM. The process can be even faster when the videos are down-sampled into CIF resolution (352x288 pixels) without significant degradation of the retrieval results. There are totally 1421 objects segmented from the video, of which all their associated features are indexed into the database. The objects are mainly human beings, vehicles, and very few background objects like windy trees due to background movement and camera vibrations. Figure 4 shows a demonstration of the user interface of the retrieval system. The left column shows snapshots of the retrieved objects. Users can select any retrieved object and playback the video clip which contains the object. The corresponding object trajectory and location are also overlaid in the videoas red line and blue bounding box.
Fig. 4. User interface of the proposed video retrieval system
Object-Based Surveillance Video Retrieval System
633
(a)
(b)
(d)
(c)
(e)
Fig. 5. Retrieval results by sample snapshots/photo images. The query images are shown at the bottom-left of each sub-figures, and the corresponding snapshots of 12 retrieved objects with the highest relevance R as defined in (10) are shown in the next 3 columns, which are displayed in the order of top-down and then left-right according to R.
634
J.S-C. Yuk et al.
(a)
(b)
(d)
(c)
(e)
Fig. 6. Retrieval results by sample hand-drawn images. The query images are shown at the bottom-left of each sub-figures, and the corresponding snapshots of 12 retrieved objects with the highest relevance R as defined in (10) are shown in the next 3 columns, which are displayed in the order of top-down and then left-right according to R.
Object-Based Surveillance Video Retrieval System
635
Figures 5 and 6 show the examples of some retrieval results. The query images are shown at the bottom-left of each sub-figures, and the corresponding snapshots of 12 retrieved objects with the highest relevance R as defined in (10) are shown in the next 3 columns, which are displayed in the order of top-down and then left-right according to R. In our experiments, foreground of a query image usually occupies more than 80% of the whole image area. The retrieval results are even better when the background are masked out. On the other hand, if the background becomes clutter or its area increases, the results will degrade gradually. The query images in figures 5, 5 and 5 are randomly sampled from the 19 indexed videos. The system is able to retrieve the corresponding objects and also objects which are similar to them. The query images in figures 5 and 5 are unseen objects taken by a digital camera. The system is also able to retrieve relevant objects as shown. Results show that the system performs the retrieval by snapshot/photo images quite well especially when the objects have sharp dominant colors. However, if the dominant colors tend to be white/black/grey, the system may not be able to distinguish them due to the limitation of the hue component in differentiating gray levels. As shown in figure 5 and 5, objects with dominant black/grey colors are also retrieved. Simply using other color space like RGB/CIE LUV or adding the saturation/intensity components may help to handle this kind of retrievals, but this also makes the system to be very sensitive to lighting changes or camera settings. Slight changes in lighting or camera saturation may cause the system fail in the retrieval. Further research is needed for more robust but efficient color searching under different lighting and camera settings. Figure 6 shows the retrieval results using hand-drawn query images which are drawn with MS Paint. For figures 6(d) and 6(e), only the edge direction features are used in retrieval since there is no color information in the query images, while the other three use both dominant color and edge direction histograms. The results show the system can retrieve objects using hand-drawn images very well. This demonstrates an interesting ability of the system. Users can perform video retrieval even by drawing the desired objects when the snapshots/photos of the objects are not available.
6 Conclusion and Future Works The proposed system demonstrates an efficient and interesting object-based video indexing and retrieval approach to retrieve moving objects which are similar to the query image. Unlike class-based video retrieval, the proposed system allows more freedom for users to specify the searching query. Knowledge-based information like “what is the object” can also be retrieved by further post-processing and analysis of the indexed metadata. Furthermore, the proposed system also demonstrates its potential ability in digital surveillance industry. For examples, operators may take hours or even days to analyze recorded videos to search for a suspect car or person that passed through certain monitoring areas. With the help of the proposed system, operators may search for the suspect by simply providing a snapshot/photo/drawing of the suspect, and then all the video clips, which captured such a suspect, can be retrieved. The time for performing the labor intensive searching operations can therefore be significantly reduced.
636
J.S-C. Yuk et al.
The current proposed system is targeted to extract and index moving objects in the scene. For static background objects, we may need another object extraction method which is able to extract those static objects in an efficient manner. One of the possible solution is to perform general image segmentation on the background image given by the GMM background modeling. In this way, the computational intensive segmentation process is required only once at the end of each scene, and therefore, the processing time for video indexing can be highly reduced. Besides, there are two major directions for future works. Firstly, the system is currently using MySQL version 5.0.2 to provide the database and indexing operations. This can causes the retrieval process being very slow when the database size grows, due to the curse of dimensionality of image feature vector as stated in [5]. Traditional R-Tree, SRTree, and SS-Tree will not work either for the high dimensional image feature vectors. As a result, further research effort is required on video database structure to speed-up the retrieval mechanism. Secondly, the current system performs video retrieval based on matching of elementary features. These features, as stated earlier, can further be analyzed to give knowledge-based information. This motivates us to continue research on performing knowledge-based video retrieval which allows higher level query like “Find a girl in red clothes who is skating”. In addition, we will also continue to search for a more robust and reliable elementary features, especially a more reliable color feature, to further enhance retrieval accuracy.
Acknowledgment This research was jointly sponsored by Verint Systems (Asia Pacific) Limited and the Innovation and Technology Commission of the Government of the Hong Kong Special Administrative Region, under the Grant ITS/072.
References 1. http://video.google.com 2. http://www.youtube.com 3. Iso/iec 15938-3 information technology – multimedia content description interface – part 3, visual (May 2002) 4. Chung, R.H.Y., Chin, F.Y.L., Wong, K.-Y.K., Chow, K.P., Luo, T., Fung, H.S.K.: Efficient block-based motion segmentation method using motion vector consistency. In: Proc. IAPR Conference on Machine Vision Applications, Tsukuba Science City, Japan, May 2005, pp. 550–553 (2005) 5. Fan, S., Zhu, X.Q., Elmagarmid, A.K, Aref, W.G., Wu, L.: Classview: Hierarchical video shot classification, indexing, and accessing. IEEE Transactions on Multimedia 6(1), 70–86 (2004) 6. Manjunath, B.S., Salembier, P., Sikora, T.: Introduction to MPEG-7, Multimedia Content Description Interface. Wiley, Chichester (2002) 7. Oh, J.H., Baskaran, N.: An efficient technique for video shot indexing using low level features. In: SOFSEM 2002. Proc. of WORKSHOP ON MULTIMEDIA SEMANTICS 2002 in conjunction with 29th Annual Conference on Current Trends in Theory and Practice of Informatics, Milovy, Czech Republic (November 2002)
Object-Based Surveillance Video Retrieval System
637
8. Smith, M., Khotanzad, A.: An object-based approach for digital video retrieval. In: ITCC 2004. Proceedings of the International Conference on Information Technology: Coding and Computing, Los Alamitos, CA, April 2004, vol. 1, pp. 456–459 (2004) 9. Taskiran, C., Chen, J.-Y., Albiol, A., Torres, L., Bouman, C.A., Delp, E.J.: Vibe: a compressed video database structured for active browsing and search. IEEE Transactions on Multimedia 6(1), 103–118 (2004) 10. Yi, H., Rajan, D., Chia, L.T.: A motion based scene tree for browsing and retrieval of compressed videos. In: 2nd ACM International Workshop on Multimedia Databases, Washington DC, November 2004, pp. 10–18. ACM Press, New York (2004)
Image Retrieval Using Transaction-Based and SVM-Based Learning in Relevance Feedback Sessions Xiaojun Qi and Ran Chang Computer Science Department, Utah State University, Logan, UT 84322-4205 {xqi, ranchang}@cc.usu.edu
Abstract. This paper introduces a composite relevance feedback approach for image retrieval using transaction-based and SVM-based learning. A transaction repository is dynamically constructed by applying these two learning techniques on positive and negative session-term feedback. This repository semantically relates each database image to the query images having been used to date. The query semantic feature vector can then be computed using the current feedback and the semantic values in the repository. The correlation measures the semantic similarity between the query image and each database image. Furthermore, the SVM is applied on the session-term feedback to learn the hyperplane for measuring the visual similarity between the query image and each database image. These two similarity measures are normalized and combined to return the retrieved images. Our extensive experimental results show that the proposed approach offers average retrieval precision as high as 88.59% after three iterations. Comprehensive comparisons with peer systems reveal that our system yields the highest retrieval accuracy after two iterations. Keywords: Content-based image retrieval, transaction repository, support vector machines.
1 Introduction Content-based image retrieval (CBIR) has attracted a broad range of research interests in many computer communities in the past decade [1]. However, finding desired images from multimedia databases still remains a challenging issue. The sensor gap between the real world object and the information represented by computers and the semantic gap between low-level features and high-level semantics are two major unsolved issues. This paper will focus on bridging the semantic gap in CBIR via relevance feedback sessions by combining transaction-based and support vector machine (SVM)-based learning. Present CBIR systems use relevance feedback and machine learning techniques to mitigate the semantic gap issue. Here we briefly review these two techniques and refer the interested readers to three survey papers [1], [2], [3] for a more detailed study. Relevance feedback techniques have recently been extensively studied to narrow the semantic gap. They will first solicit the user’s relevance judgments on the retrieved images returned by CBIR systems. They will then refine retrieval results by learning query concept (i.e., the query targets) based on provided relevance M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 638–649, 2007. © Springer-Verlag Berlin Heidelberg 2007
Image Retrieval Using Transaction-Based and SVM-Based Learning
639
information. In general, relevance feedback techniques can be classified into three categories: query reweighting, query shifting, and query expansion. Both query reweighting and query shifting apply a nearest-neighbor sampling approach to refine query concept using the user’s feedback. Specifically, query reweighting [4], [5], [6] assigns a new weight to each feature of the query, and query shifting [7], [8], [9] moves the query to a new point in the feature space. Query expansion [10], [11] uses a multiple-instance sampling approach to learn from samples around the neighborhood of positive labeled instances. However, most approaches require seeding a query with appropriate positive examples and do not effectively use negative examples. These systems also focus on the short-term learning by refining low-level features using current feedback. They do not utilize any previous feedback, which contains semantic relevance between each query and its retrieved images returned by CBIR systems. Machine learning techniques have also been widely studied for updating classifiers to learn a boundary to separate positive and negative examples labeled by the user. Different learning techniques [12], [13], [14], [15], [16], [17], [18], [19] are introduced for classification using a limited number of positive and negative samples. That is, a classifier is updated after each round of labeling, where each returned image is labeled as positive or negative by the user. However, these learning techniques work well when a sufficient number of positive images are returned at each round, the images with the same labels are clustered, and the query concept is not diverse. Several ensemble techniques such as cascade linear utility model based SVM [20], SVM ensembles [21], and Adaboost [22] have been proposed to further improve classification accuracy by introducing learning redundancy and reducing the bias. However, these ensemble techniques perform well with a large number of training instances and a sufficient amount of training time. To address the limitations of current retrieval systems, we propose a learning approach to retrieving the desired images using an integrated transaction-based and SVM-based learning technique. To this end, we dynamically construct a transaction repository (TR) by recording each session-term feedback, which contains the accumulated collection of all short-term positive and negative feedbacks for each query. This TR remembers the user’s intent and stores a semantic representation of each database image in terms of presence or absence of the semantics of each query image. We then refine the query semantic features using the current feedback and the TR, and measure the semantic similarity between the query and each database image using a correlation measure. In the meantime, the SVM is applied after each round of labeling to update a boundary for separating accumulated short-term positively and negatively labeled images. We finally return retrieval results by combining the normalized similarity scores computed from both transaction-based and SVM-based learning. To our knowledge, two image retrieval systems, the semantic-space-based [23] and the logbased [24], are the only ones using the on-the-fly transactions to learn query concept. However, our system improves these two systems from the following two aspects. First, we integrate both positive and negative examples and SVMs to improve the semantic-space-based approach, where only positive examples are used for learning. Second, we refine the query semantic features to improve the log-based system, where a computationally intensive overall relevance score is calculated for measuring the similarity. The remainder of the paper is organized as follows: Section 2 presents
640
X. Qi and R. Chang
our proposed approach in detail. Section 3 discusses several experimental results. Section 4 concludes the paper and shows the direction for future work.
2 The Proposed Transaction-Based and SVM-Based Learning The block diagram of our proposed method is shown in Fig. 1. The system first returns top 20 images based on the low-level similarity between each database image and the query image. The user then indicates relevant and non-relevant images from the returned pool. This is designated as short-term feedback and is performed at each iteration step. The transaction-based semantic learning technique uses session-term feedback and the TR to estimate the query semantic feature (i.e., query concept in terms of the semantic meaning). The SVM-based learning also uses session-term feedback to find the optimal hyperplane for separating positively and negatively labeled images. The system then returns top 20 images ranked by fusing the normalized scores computed from both techniques. The user labels each returned image for the next iteration until he is satisfied with the results. The following subsections will explain each component of the proposed system in detail. Query Image Feature Extraction Low-Level Feature Database Final Retrieval Results
No
Initial Retrieval User Relevance Feedback Yes Short-Term Feedback
Session-Term Feedback Database SVM-Based Learning
Transaction-Based Semantic Learning
Transaction Repository
Combined Retrieval Fig. 1. The block diagram of our proposed system
2.1 Feature Extraction Image representation is essential to learning efficiency and iteration steps. Two sets of features are extracted to represent the images for transaction-based and SVM-based learning, respectively. The first set contains three different sets of short features, namely, 9-dimensional color features, 18-dimensional edge features, and 9-dimensional texture features. Specifically, we use the color moments (mean,
Image Retrieval Using Transaction-Based and SVM-Based Learning
641
variance, and skewness) in HSV color space to represent color features since they closely model the natural human perception and have been proven to be effective in many previous research studies. We use the edge direction histogram to represent the edge features. That is, we convert the color image to a grayscale image and apply the Sobel edge detector to the grayscale image to obtain its edge image. We quantize each edge direction into one of the 18 bins, where each bin contains 20 degrees. Last, we use the entropy of each of 9 detail subband images (i.e., all the subbands except the approximation subband) obtained by a 3-level Daubechies-4 wavelet transform to represent the texture features. The second set contains two different sets of long features, namely, 64-dimensional color features and 150-dimensional edge features. Specifically, we use the expanded MPEG-7 edge histogram descriptor (EHD) and the 64-bin (8×2×4) HSV-based scaled color descriptor (SCD) to extract low-level features for each database image. The expanded EHD is also constructed by appending the conventional 80-bin EHD with 5bin global and 65-bin semi-global edges, which represent the sum of five edges in all blocks and 13 different block groupings, respectively. This choice is mainly adapted to the characteristics of two integrated techniques. That is, the transaction-based relevance feedback technique requires a short feature to quickly find the index in the TR to construct the query semantic features and compute the semantic similarities. The SVM-based machine learning technique requires a long feature to find a more accurate boundary to separate the user’s labeled positive and negative images and measure the visual similarities. Furthermore, the first set of features can be treated as a coarse representation of the image whereas the second set of features can be treated as a fine representation of the image. The use of different feature extraction techniques ensures that an image is efficiently represented by complementary features and the shortcomings of different features are compensated. 2.2 Initial Retrieval Using Low-Level Features We will use the first set of features to quickly measure the similarity between the query and each image in the low-level feature database. The inverted Euclidean distance is computed for color and texture features and histogram intersection is computed for the edge histogram. This similarity measure ensures a higher similarity score corresponds to more similarity or stronger relevance. 2.3 Transaction Repository Construction The transaction repository (TR) stores semantic relationships between database images and each query image having been used to date. These query images correspond to rows and the database images correspond to columns of the TR. That is, the TR can be treated as an M×N matrix, where M is the number of query images used to date and N is the number of database images. The TR is dynamically constructed as follows: 1. Initialize the transaction repository T as empty. 2. For each query image q 2.1 Append a new row to T. 2.2 Empty the session-term feedback database.
642
X. Qi and R. Chang
2.3 Perform initial retrieval (section 2.2) to return x (e.g., 20 or 30) images most similar to query image q. 2.4 Let the user select relevant images, which are most similar to the user’s query concept, while regarding the remaining returned images as nonrelevant. 2.5 Add the new relevant and non-relevant images, which have not been retrieved in previous iterations, to session-term feedback database. 2.6 Compute the visual similarity score between query image q and each database image using the SVM-based learning technique (section 2.4). 2.7 Compute the semantic similarity score between query image q and each database image using the transaction-based semantic learning technique (section 2.5). 2.8 Combine the normalized visual and semantic similarity scores to obtain the final similarity measure and return x highest ranked images (section 2.6). 2.9 Repeat steps 2.4 through 2.9 until the user is satisfied with the retrieval results or the maximum iteration is reached. 3. Update the columns corresponding to the relevant and non-relevant images in the session-term feedback database at the last row as 1’s and -1’s, respectively. In our implementation, we ensure that non-relevant images labeled by the user are excluded from the following searches to increase the learning speed. Specifically, the ith column of the TR is the dynamic semantic feature vector of database image xi. 2.4 SVM-Based Learning and Search The SVM-based learning and search starts with training the relevant and non-relevant images labeled in the initial retrieval by using their second set of features. The SVM will find a hyperplane that separates the training data by a maximal margin. That is, given m training data {xi , yi } ’s, where xi ∈ R n and yi ∈{−1,1} , SVMs need to solve the following optimization problem: l ⎛1 T ⎞ min ⎜⎜ ω ω + C ∑ ξ i ⎟⎟ ω , b, ξ ⎝ 2 i =1 ⎠ T Subject to y (ω φ ( x ) + b) > 1 − ξ , i i i
(1) ξi > 0
where C is the penalty parameter of the error term and K ( x i , y i ) = φ ( xi )T φ ( x j ) is the kernel function. The non-linear SVMs with the Gaussian radial basis function (RBF) kernel are used in our system since they yield excellent results compared with linear and polynomial kernels [25]. This RBF kernel is defined as: 2⎞ ⎛ K ( xi , y i ) = exp⎜ γ xi − x j ⎟, γ > 0 ⎜ ⎟ ⎝ ⎠
(2)
As a result, two SVMs related parameters C and γ need to be predetermined. We combine the 3-fold cross-validation and grid-search algorithms [26] to find the best C and γ by testing exponentially growing sequences of C = 2 −5 ,2 −3 ,",215 and
Image Retrieval Using Transaction-Based and SVM-Based Learning
643
γ = 2−15 ,2 −13 ,",23 on several sets of pre-labeled training images. The pair that gives
the minimum 3-fold cross-validation error is selected as the optimal parameters and is used in our proposed system. The distance of each database image to the hyperplane will be computed and normalized to the range of [0, 1] to rank the visual similarity between the query and each database image. For the following feedback iterations, the relevant and non-relevant images stored in the session-term feedback database will be used to train the SVM for finding the optimal hyperplane to compute the visual similarity measure. 2.5 Transaction-Based Semantic Learning and Search
The transaction-based semantic learning and search starts with finding the semantic columns corresponding to the relevant and non-relevant images labeled in the initial retrieval. The query’s semantic feature vector is initialized as: qk (t ) =
1 Nr
Nr
∑S i =1
R ,i k
(3)
where q k (t ) is the kth element of the query semantic feature vector, skR , i is the kth element of the semantic feature vector of the ith relevant image, Nr is the number of relevant images, and k is the number of rows in the TR. The skR , i ’s with a value of -1 will be treated as 0’s in this computation since it indicates relevant image i doesn’t possess the semantic meaning of query k. Therefore, this initial semantic feature vector stores an estimated semantic similarity to each previously employed query image, which is recorded in the TR in terms of the semantic similarity with retrieved images in each relevance feedback process. After estimating the query semantic feature vector using the labeled images in the initial retrieval, we use a correlation measurement to compute the semantic similarity between query q(t) and each database image xi by: S xSem = x i ⋅ q (t ) = i
∑x
i ,k q k
(t )
k
(4)
where xi corresponds to the ith column in the TR. The higher the similarity score, the more semantically relevant the images are to the query. These scores will be normalized to the range of [0, 1] to rank the semantic similarity between the query and each database image. For the following feedback iterations, relevant images reinforce the semantically relevant features of the query semantic feature vector and non-relevant images suppress the non-relevant features of the query semantic feature vector. This process is summarized by: ⎧αq k (t ) ⎪ ⎪1 q k (t + 1) = ⎨ ⎪q k (t ) ⎪ ⎩q k (t ) / α
(s kR = 1 or s kN = −1), q k (t ) ≠ 0 ( s kR = 1 or s kN = −1), q k (t ) = 0 s kR = 0 or s kN = 0 s kR = −1 or s kN = 1
(5)
644
X. Qi and R. Chang
where q k (t + 1) is the kth element of the updated query semantic feature vector, s kR and s kN correspond to the kth element of the semantic feature vectors of the relevant and non-relevant images, respectively. The parameter α is the adjustment rate and is empirically set to 1.1. 2.6 Combined Retrieval
Retrieval results after the first relevance feedback process are obtained by combining the normalized SVM-based visual similarity and transaction-based semantic similarity scores. In our system, we give equal weights (i.e., 0.5) to both similarity measures.
3 Experimental Results To date, we have tested our retrieval system on 6000 images from COREL. These images have 60 distinct semantic categories with 100 images in each. A retrieved image is considered to be relevant if it belongs to the same category as the query image. To facilitate the evaluation process, we designed an automatic feedback scheme to dynamically construct the TR and model the integrated transaction-based and SVMbased query session. The retrieval accuracy is computed as the ratio of the relevant images to the total returned images. Several experiments have been specifically designed to evaluate the performance using different TRs, which are constructed before computing the retrieval accuracy by two varying factors including different number of query images and different number of returned images per iteration. These two factors control the size and the fillings (i.e., the number of non-zeros) of the initial TR. In our experiments, we build a variety of initial TRs with different sizes using our proposed transaction-based and SVM-based learning techniques. Specifically, the initial TRs with row numbers such as 0% (empty), 1%, 3%, 6%, 9%, and 12% of the database images are constructed. The fillings using different returned images (i.e., 20 and 30) per iteration are also used to construct the initial TRs using up to 6 iterations. For example, the largest initial TR has the size of 6000×720, where 12% of the database images in each category are chosen as the query. The initial Euclidean-distancebased retrieval accuracy for this database and its subset (20-category) is 10.89% and 15.38%, respectively. They may be omitted from some figures to ensure readability of subsequent iterations. This initial accuracy can be improved by using more powerful features such as the second set of features, which can respectively achieve the average retrieval accuracy of 24.80% and 38.02% for this 60-category database and its subset. The use of the initial non-empty TR and more returned images per iteration in the feedback steps can also substantially improve the retrieval accuracy. 3.1 Evaluation of Our Proposed Learning Technique and Its Three Variants
To evaluate the transaction-based and SVM-based learning, several experiments have been tested on 2000 images in the 20-category COREL database. Fig. 2 summarizes the average retrieval accuracy on this subset of the COREL database using our proposed approach (i.e., integrating transaction-based learning using the first set of short features and SVM-based learning using the second set of long features) and its three
Image Retrieval Using Transaction-Based and SVM-Based Learning
645
variants, namely, transaction-based learning, SVM-based learning using the first set of short features, and SVM-based learning using the second set of long features. It clearly shows that both our proposed approach and the transaction-based learning achieve substantially better retrieval accuracy than the two SVM-based learning schemes. For the SVM-based learning, training on the second set of features achieves better retrieval accuracy than training on the first set. Our proposed scheme achieves better performance on the 2nd iteration and achieves comparable retrieval accuracy for the remaining relevance feedback iterations when compared to the transaction-based learning. This improvement is preferred in image retrieval since the user wants to retrieve the desired images in as few feedback steps as possible. 100 90
Retrieval Accuracy (%)
80 Transaction−Based SVM−Based (1st feature) SVM−Based (2nd feature) Our Proposed Learning
70 60 50 40 30 20 0
1
2
3
4
5
Iterations
Fig. 2. Comparisons of our proposed approach and its three variants
3.2 Evaluation of Different Initial Transaction Repositories
100
100
95
95
90
90
85
85
Retrieval Accuracy (%)
Retrieval Accuracy (%)
To evaluate the effect of different initial TRs, several experiments have been tested on 20-category COREL images with different pre-built TRs. Fig. 3 shows the average retrieval accuracy after building TRs using different number of queries. That is, 1%, 3%, 6%, 9%, and 12% of the database images in each of the 20 categories are used as queries to dynamically pre-construct the TRs with different sizes. The queries used
80 75 70 Empty Repository 1% Repository 3% Repository 6% Repository 9% Repository 12% Repository
65 60 55 50
0
1
2
3 Iterations
(a)
4
80 75 70 Empty Repository 1% Repository 3% Repository 6% Repository 9% Repository 12% Repository
65 60 55
5
50
0
1
2
3
4
5
Iterations
(b)
Fig. 3. Effect of different pre-built transaction repositories on (a) Transaction-based approach. (b) Our proposed approach.
646
X. Qi and R. Chang
for constructing these TRs are excluded in later retrievals and therefore are not involved in computing the average retrieval accuracy. Specifically, Fig. 3(a) and Fig. 3(b) compare the average retrieval accuracy of applying the transaction-based learning and our proposed combined learning on several pre-built TRs, respectively. These two figures also show the average retrieval accuracy with an empty TR for easy comparison. It clearly shows that the larger the pre-built TR is, the better the average retrieval accuracy. Fig. 4 further compares the retrieval accuracy for the transaction-based approach and our proposed approach on an empty TR and a 9% pre-built TR. It clearly shows our approach achieves better accuracy for the first two iterations and achieves comparable high accuracy of above 90% for the following iteration steps on both initial TRs. 100 90
Retrieval Accuracy (%)
80 70 60 50 40 Empty Repository (Transaction−Based Learning) 9% Repository (Transaction−Based Learning) Empty Repository (Ours) 9% Repository (Ours)
30 20 0
1
2
3
4
5
Iterations
Fig. 4. Comparison of transaction-based approach and our proposed approach on an empty transaction repository and a 9% pre-built transaction repository
3.3 Comparisons with Peer Retrieval Systems
To evaluate the scalability of the proposed system, we compare our proposed approach with its two variants (i.e., transaction-based learning and SVM-based learning), k-nearest-neighbor-based approach [27], and MARS-1 [4] on 6000 COREL images. Fig. 5 summarizes the average retrieval accuracy of these five systems for each of 6 iterations. Here, our proposed approach and the transaction-based approach 100 90
Retrieval Accuracy (%)
80 70 60 50 40 30 MARS I Transaction−Based Learning Our Proposed Approach SVM−Based Learning k−Nearest−Neighbor Approach
20 10 0
0
1
2
3
4
5
Iterations
Fig. 5. Comparisons with other CBIR systems
Image Retrieval Using Transaction-Based and SVM-Based Learning
647
use the empty TR for retrieval. It shows that our proposed approach achieves the highest retrieval accuracy (64.54%) after the 2nd relevance feedback session. The inferior retrieval performance at the initial and first feedback is mainly because we use a significantly shorter feature vector (36-dimensional) for fast retrieval. This can be easily improved by using more powerful low-level feature vector to represent an image. The retrieval accuracy can also be improved by retrieving images on the pre-built TR and increasing the number of returned images per iteration. Fig. 6 demonstrates the average retrieval accuracy of two improved systems versus our proposed system. Here, our proposed system retrieves the images on the empty TR with 20 returned images per iteration. The first improved system applies our proposed approach on a 6% pre-built TR with 20 returned images per iteration. The second improved system applies our proposed approach on a 6% pre-built TR with 30 returned images per iteration. It clearly shows these two improvements dramatically increase the retrieval accuracy of the first two feedbacks and maintain a high retrieval accuracy of above 90% for the following feedbacks. Specifically, the second improved system increases the average retrieval accuracy of our proposed system from 30.96% to 57.83% for the first feedback and from 64.54% to 75.92% for the second feedback. 100
Retrieval Accuracy (%)
90
80
70
60
50
Improved Proposed Approach 1 Improved Proposed Approach 2 Our Proposed Approach
40
30
0
1
2
3
4
5
Iterations
Fig. 6. Comparisons of our proposed system with two improved systems
4 Conclusions and Future Work This paper introduces a novel retrieval system with relevance feedback. The proposed transaction-based and SVM-based learning system provides a flexible learning approach allowing users to benefit from past search results. It also returns a high percentage of relevant images after two feedback iterations. Major contributions are: 1. Construct a dynamic TR to learn the user’s query intention; 2. Learn the semantic meaning of each database image using the query images; 3. Estimate the query semantic feature vector using the retrieved images labeled by the user and the TR; 4. Apply the SVM-based learning to find an optimal boundary for locating images which are visually similar to the query image.
648
X. Qi and R. Chang
5. Combine both transaction-based and SVM-based learning to retrieve images both semantically and visually similar to the query with fewer than 3 feedback iterations. Experimental results show the proposed system outperforms peer systems and achieves remarkably high retrieval accuracy for a large database after the first three iterations. The principal component analysis or the singular value decomposition will be considered to update the TR. An adaptive weight combination will be studied to further improve the retrieval results. Different low-level features will be explored for the SVM-based learning to further improve the retrieval accuracy in later feedback sessions. Different learning approaches will be explored to construct the TR in a more systematic manner. The performance in a multi-class database, where an image may belong to multiple semantic classes, will be evaluated.
References 1. Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-Based Image Retrieval at the End of the Early Years. IEEE Trans. on Pattern Analysis and Machine Intelligence 22(12), 1349–1380 (2000) 2. Antani, S., Kasturi, R., Jain, R.: A Survey on the Use of Pattern Recognition Methods for Abstraction, Indexing, and Retrieval of Images and Video. Pattern Recognition. 35(4), 945–965 (2002) 3. Zhou, X.S., Huang, T.S.: Relevance Feedback for Image Retrieval: a Comprehensive Review. Journal of ACM Multimedia Systems 8(6), 536–544 (2003) Special Issues on CBIR 4. Rui, Y., Huang, T.S., Ortega, M., Mahrota, S.: Relevance Feedback: A Powerful Tool for Interactive Content-based Image Retrieval. IEEE Trans. on Circuits and Video Technology 8(5), 644–655 (1998) 5. Aggarwal, G., Ashwin, T.V., Sugata, G.: An Image Retrieval System with Automatic Query Modification. IEEE Trans. on Multimedia. 4(2), 201–214 (2002) 6. Kushki, A., Androutsos, P., Plataniotis, K.N., Venetsanopoulos, A.N.: Query Feedback for Interactive Image Retrieval. IEEE Trans. on Circuits and Systems for Video Technology 14(5), 644–655 (2004) 7. Rui, Y., Huang, T.S., Mahrota, S.: Content-based Image Retrieval with Relevance Feedback in MARS. In: Proc. of Int. Conf. on Image Processing, pp. 815–818 (1997) 8. Giacinto, G., Roli, F., Fumera, G.: Comparison and Combination of Adaptive Query Shifting and Feature Relevance Learning for Content-based Image Retrieval. In: Proc. of 11th Int. Conf. on Image Analysis and Processing, pp. 422–427 (2001) 9. Muneesawang, P., Guan, L.: An Interactive Approach for CBIR Using a Network of Radial Basis Functions. IEEE Trans. on Multimedia 6(5), 703–716 (2004) 10. Porkaew, K., Mehrota, S., Ortega, M.: Query Reformulation for Content-Based Multimedia Retrieval in MARS. In: Proc. of Int. Conf. on Multimedia Computing and Systems, vol. 2, pp. 747–751 (1999) 11. Widyantoro, D.H., Yen, J.: Relevant Data Expansion for Learning Concept Drift from Sparsely Labeled Data. IEEE Trans. on Knowledge and Data Engineering 17(3), 401–412 (2005) 12. Tong, S., Chang, E.: Support Vector Machine Active Learning for Image Retrieval. In: Proc. of ACM Int. Conf. on Multimedia, pp. 107–118. ACM Press, New York (2001)
Image Retrieval Using Transaction-Based and SVM-Based Learning
649
13. Huang, T.S., Zhou, X.S., Nakazato, M., Yu, Y., Cohen, I.: Learning in Content-Based Image Retrieval. In: Proc. of the 2nd Int. Conf. on Development and Learning, pp. 155–162 (2002) 14. Brinker, K.: Incorporating Diversity in Active Learning with Support Vector Machines. In: Proc. of the 20th Int. Conf. on Machine Learning, pp. 59–66 (2003) 15. Tao, D., Tang, X.: Nonparametric Discriminant Analysis in Relevance Feedback for Content-Based Image Retrieval. In: Proc. of IEEE Int. Conf. on Pattern Recognition, vol. 2, pp. 1013–1016 (2004) 16. Hoi, S.C.H., Lyu, M.R.: Biased Support Vector Machine for Relevance Feedback in Image Retrieval. In: Proc. of Int. Conf. on Neural Network, pp. 3189–3194 (2004) 17. Dong, A., Bhanu, B.: Active Concept Learning in Image Databases. IEEE Trans. on Systems, Man,and Cybernetics-Part B: Cybernetics 35(3), 450–466 (2005) 18. Hsu, C.T., Li, C.Y.: Relevance Feedback Using Generalized Bayesian Framework with Region-Based Optimization Learning. IEEE Trans. on Image Processing 14(10), 1617–1631 (2005) 19. Zhou, X.S., Garg, A., Huang, T.S.: Nonlinear Variants of Biased Discriminants for Interactive Image Retrieval. In: IEE Proc. of Vision, Image and Signal Processing, vol. 152(6), pp. 927–936 (2005) 20. Wu, H., Lu, H.Q., Ma, S.D.: A Practical SVM-Based Algorithm for Ordinal Regression in Image Retrieval. In: Proc. of the 11th ACM Int. Conf. on Multimedia, pp. 612–621. ACM Press, New York (2003) 21. Hoi, S.C.H., Lyu, M.R.: Group-Based Relevance Feedback with Support Vector Machine Ensembles. In: Proc. of the 17th Int. Conf. on Pattern Recognition, pp. 874–877 (2004) 22. Tieu, K., Viola, P.: Boosting Image Retrieval. In: Proc. of IEEE Int. Conf. on Computer Vision and Pattern Recognition, vol. 1, pp. 228–235 (2000) 23. He, X., King, O., Ma, W.Y., Li, M.L., Zhang, H.J.: Learning a Semantic Space From User’s Relevance Feedback for Image Retrieval. IEEE Trans. on Circuits and Systems for Video Technology 13(1), 39–48 (2003) 24. Hoi, S.C.H., Lyu, R., Jin, R.: A Unified Log-Based Relevance Feedback Scheme for Image Retrieval. IEEE Trans. on Knowledge and Data Engineering 18(4), 509–524 (2006) 25. Scholkopf, B., Sung, S., Burges, C., Girosi, F., Niyogi, P., Poggio, T., Vapnik, V.: Comparing Support Vector Machines with Gaussian Kernels to Radial Basis Function Classifiers. In: A.I. Memo 1599, MIT Press, Cambridge (1996) 26. Hsu, C., Chang, C., Lin, C.: A Practical Guide to Support Vector Classification (2003), http://www.csie.ntu.edu.tw/c̃jlin/papers/guide/guide.pdf 27. Giacinto, G., Roli, F.: Instance-Based Relevance Feedback for Image Retrieval. In: Proc. of Advances in Neural Information Processing Systems, vol. 17, pp. 489–496 (2005)
Shape-Based Image Retrieval Using Pair-Wise Candidate Co-ranking A. El-ghazal1, O. Basir1, and S. Belkasim2 1
Electrical and Computer Engineering, University of Waterloo, ON., Canada 2 Computer Science, Georgia State University, Atlanta, GA., USA
Abstract. Shape-based image retrieval is one of the most challenging aspects in Content-Based Image Retrieval (CBIR). A variety of techniques are reported in the literature that aim to retrieve objects based on their shapes; each of these techniques has its advantages and disadvantages. In this paper, we propose a novel scheme that exploits complementary benefits of several shape-based image retrieval techniques and integrates their assessments based on a pairwise co-ranking process. The proposed scheme can handle any number of CBIR techniques; however, three common techniques are used in this study: Invariant Zernike Moments (IZM), Multi-Triangular Area Representation (MTAR), and Fourier Descriptor (FD). The performance of the proposed scheme is compared with that of each of the selected techniques. As will be demonstrated in this paper, the proposed co-ranking scheme exhibits superior performance. Keywords: Co-ranking, Image retrieval, Shape.
1 Introduction Recent years have witnessed increased interest in digital imaging. The ease and convenience of capturing, transmitting, and storing digital images in personal and public image databases are considered contributing factors behind this increased interest. The internet is already rich in image depositories that cover a wide range of applications, including, biomedical, multimedia, geological, and astronomy. Although, the storage format of image data is relatively standardized, the retrieval process of images from an image depository tends to be quite complex, and thus it has become a limiting factor that needs to be addressed. Typically, images in a database are retrieved based on either textual information or content information. Early retrieval techniques were based on textual annotation of images. Images were first annotated with text and then searched based on their textual tags. However, text-based techniques have many limitations, including their reliance on manual annotation, which can be a difficult and tedious process for large image sets. Furthermore, the rich content typically found in images and the subjectivity of the human perception make the task of describing images using words a difficult if not an impossible task. To overcome these difficulties, Content Based Image Retrieval (CBIR) was proposed [1]. This approach relies on the visual content of images, rather than textual annotations, to search for images; and hence, has the potential to be more effective in M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 650–661, 2007. © Springer-Verlag Berlin Heidelberg 2007
Shape-Based Image Retrieval Using Pair-Wise Candidate Co-ranking
651
responding to more specific user queries. CBIR techniques use visual contents such as color, texture, and shape to represent and index the image. Visual contents such as color and texture have been explored more thoroughly than shape contents .The increasing interest in using the shape features of objects for CBIR is not surprising, given the considerable evidence that natural objects are recognized primarily by their shape [2, 3]. A survey of users on the cognition aspects of image retrieval confirmed that users prefer retrieving images based on shape to color and texture [4]. However, retrieval based on shape content remains more difficult compared to image retrieval based on other visual features [2]. During the last decade, significant progress has been made in both the theoretical and practical research aspects of shape-based image retrieval [5, 6]. Shape representation approaches can be divided into two main categories, namely: regionbased approaches and boundary-based approaches (also known as contour-based approaches). Region-based approaches often use moment descriptors to describe shapes. These descriptors include geometrical moments [7], Zernike moments [8, 9], pseudo-Zernike moments [10], and Legendre moments [8]. Although region-based approaches are global in nature and can be applied to generic shapes, they fail to distinguish between similar objects [11]. In many applications, the internal content of the shape is not as important as its boundary. Boundary-based techniques tend to be more efficient for handling shapes that are describable by their object contours [11]. Many boundary-based techniques have been proposed in the literature, including Fourier descriptors [12], Curvature Scale Space (CSS) [13], wavelet descriptors [14], and contour displacement [15]. Though the number of available shape-based image retrieval techniques has been increasing rapidly such techniques still exhibit shortcomings. Techniques that have demonstrated reasonable robustness often tend to be computationally complex [15]. In this paper, we propose to integrate several of the existing in a pair-wise co-ranking scheme so as to obtain superior shape-based image retrieval performance. The remaining of the paper is organized as follows: Section 2 introduces the problem formulation; while Section 3 discusses the proposed pair-wise co-ranking scheme. The process of formulating the final ranking decision of the group is presented in Section 4. Section 5 describes experiments to evaluate the performance of the proposed shape retrieval scheme. Conclusions from this study and suggestions for future work are presented in Section 6.
2 Problem Formulation We consider a group of Image Retrieval techniques with complementary image representation capabilities. These techniques are viewed as agents cooperating to determine the best candidate image x from an ensemble of images Θ that matches a query image. This group of image retrieval agents is indexed by the set IRA = {IRA1, IRA2,… IRAM}. Each agent IRAi uses a feature extraction scheme Fi and a matching strategy Γ i to determine a similarity measure Si between query image x and all images y ∈ Θ ; that is,
S i ( x, y ) = Γi ( Fi ( y ), Fi ( x)), ∀y ∈ Θ
(1)
652
A. El-ghazal, O. Basir, and S. Belkasim
Each agent IRAi establishes a ranking Ri ( y | Fi ( y )) ∈ {1,2,...N }, ∀y ∈ Θ such that Ri ( y | Fi ( y)) ≤ R j ( z | Fj ( z )) implies S i ( x, y) ≥ S j ( x, z ) . Without loss of generality, Ri ( y | Fi ( y)) can be viewed as an index set from 1 to N, where index 1, for example, points to the closest candidate image to the query image x, and index N points to the most dissimilar candidate image to the query image x. In general, index l in this set points to the image that is preceded by l-1 candidates; these candidates are viewed by IRAi as better candidates for the query image x, than the candidate give the rank l. Since each agent uses a different feature extraction scheme, it is expected that the agents reach different ranking decisions for the different images in the set. Each of the CBIR techniques must demonstrate reasonable performance before it is selected to be a member of the group. It is thus reasonable to expect that good image candidates will cluster at the top of the ranking for all agents.
3 Pair-Wise Co-ranking Scheme
Pair-wise co-ranking
Query Image
In the following discussion, we propose an information exchange scheme between the image retrieval agents so as to assist each in determining the best image as a candidate matching the user query. This scheme allows for exploiting the relative advantages and disadvantages of each agent.
Fig. 1. Information exchange between two image retrieval agents regarding the ranking
Pair-wise co-ranking revolves around the hypothesis that each agent IRAj may readjust its ranking of a candidate image if it is exposed to the ranking of another agent IRAi. The communication process between two agents is depicted in Fig 1. The initial ranking of an agent is referred to as marginal ranking. This ranking reflects the assessment of the agent of closeness of each image to the query image. On the other hand, the revised ranking of an agent is referred to as conditional ranking. This ranking reflects how the assessment of an agent is influenced by other agents. To set up the process of information exchange among the agents, the ranking set of each agent IRAi is divided into two partitions: an Elite Candidates Partition ( ECPi ) and a Potential Candidates Partition ( PCPi ) . It is expected that good matches to the query will cluster in the elite partitions. ECP contains the first m candidates; PCP contains the last N-m candidates. Thus, the marginal ranking of agent IRAi can be viewed as a concatenation of two ranking sets; that is, R i = {RiECP , RiPCP } . Where RiECP is the ranking for R i ≤ m , and RiPCP is the ranking for R i > m .
. .
z . . . .
. . .
.
.
.
.
m
.
m
ECP
.
IRA j
. .
R ji
.
z
z ∈ ECPi , Ri ( z | Fi ( z )) ≤ m
. .
.
Potential Candidates Partition (PCPi)
R ji = { R ECP , R PCP } ji ji
.
Elite Candidates Partition (ECPi)
R i = { RiECP , RiPCP }
653
( z | F j ( z ), Ri ( z | Fi ( z )) ≤ m ), z ∈ ECPi
Shape-Based Image Retrieval Using Pair-Wise Candidate Co-ranking
Fig. 2. Illustration of the pair-wise co-ranking scheme
We need to introduce the conditional ranking like what we did with the marginal. Then we show the example. Fig 2. depicts the process of forming conditional elites, where agent IRAj revises its conditional elite partition ranking R jiECP , based on the marginal ranking information provided by agent IRAi. Here, image z is seen as an elite image by agent IRAi. Agent IRAj uses its own feature extraction scheme F j to determine the rank of image z in its conditional elite candidates partition; that is, R jiECP ( z | F j ( z ), R i ( z | Fi ( z )) ≤ m), z ∈ ECPi
(2)
This formula can be read as follows: agent j re-ranks image z based on its feature extraction scheme F j (z ) , given that image z has been ranked as an elite candidate by agent IRAi, based on the feature extraction scheme Fi (z ) . The fact that image z is placed in the conditional elite partition of agent IRAj does not necessarily imply that image z is in the marginal elite partition of IRAj. Similarly, the conditional ranking of the potential candidates partitions are computed, R PCP ( z | F j ( z ), Ri ( z | Fi ( z )) > m), z ∈ PCPi ji
(3)
The conditional ranking of agent IRAj, based on information received from agent IRAi is viewed as a concatenation of two ranking sets; that is,
R ji = {R ECP , R PCP ji ji } The results of the above process are summarized in Table 1.
(4)
654
A. El-ghazal, O. Basir, and S. Belkasim Table 1. Marginal and Pair-wise Conditional Rankings
RiECP (z|Fi (z))} ∀ Ri ≤ m
RiPCP (z|Fi (z))} ∀ Ri > m
, R i = { RiECP , RiPCP }
RijECP (y|Fi (z),R j (y|F j (z)) ≤ m)} ∀ Rij < m
RijPCP (y|Fi (z),R j (y|F j (z)) > m)} ∀ Rij > m
R jECP (z|F j (z))}∀R j ≤ m
R jPCP (z|F j (z))}∀R j > m
, R ij = { RijECP , RijPCP }
, R j = { R jECP , R PCP } j
R ECP ji (y|F j (z),R i (y|Fi (z)) ≤ m)} ∀ R ji ≤ m
R PCP ji (y|F j (z),R i (y|Fi (z)) > m)} ∀ R ji > m
, R ji = { R jiECP , R jiPCP }
Example
The database, described in Section 5, consists of 70 classes, each class having 20 objects. The target is to see the relevant images ranked as the top twenty positions. Fig. 3 and 4 display the results produced by agents IRA1 and IRA2, respectively.
Fig. 3. Retrieved results of IRA1 (R1)
Fig. 4. Retrieved results of IRA2 (R2)
Fig. 5. Pair-wise conditional ranking of IRA1 based on IRA2 (R21)
In both figures the top left shape is the query shape (Frog); the retrieved shapes are arranged in descending order according to each shape’s similarity to the query shape.
Shape-Based Image Retrieval Using Pair-Wise Candidate Co-ranking
655
Fig. 3 indicates that the first agent IRA1 has managed to correctly rank 9 of 20 shapes (45%). Furthermore, most of the irrelevant shapes, indicated by the dashed frame, are objects that belong to the same class (Bell). In Fig. 4, it is evident that the irrelevant shapes of agent IRA2, indicated by the dashed frame, differ from those of agent IRA1. Fig. 5 exhibits the conditional ranking of agent IRA1 based on the ranking information it receives from the second agent IRA2. Referring to Fig. 5, the irrelevant shapes of the first agent, indicated by the dashed frame in Fig. 3, do not appear at top-rank positions. The positions of the relevant shapes, encircled by the ellipses in Fig. 3, produced by the first agent IRA1, are repositioned to the top 20 positions in R21, resulting in an accuracy of 75% (15 out of 20). It is clear that the pair-wise conditional ranking R21 is much better than the marginal rankings R1 and R2.
4 Reaching Decision The main motivation behind this research work is to identify image retrieval techniques that capture different characteristics of an image and combine them so as to achieve an enhanced retrieval performance. These techniques are considered as a team of agents who cooperate to determine the best image in the database that matches the query image. This collaboration is accomplished via exchanging candidate ranking information. . Each agent uses its feature extraction scheme to compute a ranking Rii (double indexing is used to simplify the mathematical notations) that reflects the technique’s preference on the best candidate to match the query image (i.e., marginal ranking). . Furthermore, each agent is exposed to the ranking information of other agents so as to compute a conditional ranking of for each candidate image, as explained above. Therefore, M retrieval agents yield M2 rankings: M marginal rankings, plus M (M-1) conditional rankings. Fig. 6 depicts the ranking set of a three-agent system.
Fig. 6. Information exchange on the ranking of three image retrieval agents
For M agents, the ranking sets can be organized in a matrix format: ⎡ R 11 R 12 ...... R 1 M ⎢ ⎢ R 21 R 22 ...... R 2 M ⎢. . R =⎢ . ⎢. ⎢. . ⎢ ⎢⎣ R M 1 R M 2...... R MM
⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦
656
A. El-ghazal, O. Basir, and S. Belkasim
Where Rij is the conditional ranking of the ith agent, given that the ranking of the jth agent, and Rii is the marginal ranking of the ith agent. Fig. 7 portrays rank information exchange that yields the group’s rank decision on all candidates.
Fig. 7. Steps of applying the pair-wise co-ranking
The group rank of image z is defined as: M M
Score (z)= ∑∑Rij(z | Fi (z),R j(z | Fj (z)) i =1 j =1 i ≠ j i≠ j
(5)
The ranking scheme of all candidates is defined as
R G = sort(score(z))∀z ∈ Θ
(6)
Example
Fig. 8 displays the results produced by agent IRA3 for the same query, the frog. As seen in this figure, agent IRA3 has managed to rank 12 of the 20 shapes correctly (60%). Furthermore, one can notice that the images, indicated by the dashed frame at the top 20 positions, are identified by the other two agents IRA1 and IRA2 as being irrelevant. Thus, the conditional ranking of IRA1, based on ranking information it has received from IRA3, enables this agent to revise its assessment of relevance of the candidates. The relevant images, encircled by ellipses and produced by IRA1, have moved to the top rank positions in R31, resulting in an accuracy of 75% (15 out of 20), as illustrated in Fig. 9. After each agent revises its results, based on the ranking information it receives from the other two agents, a group ranking RG of all pair-wise conditional rankings is obtained by using the formulation given by Equation (6). Fig. 10 exhibits the final ranking of the proposed pair-wise co-ranking scheme. This figure makes it clear that the exchange of ranking information enables the three agents to converge to better ranking decisions (an accuracy of 90%) than those of any individual agent; IRA1, IRA2 and IRA3 achieving an accuracy of 45%, 70%, and 60%, respectively.
Shape-Based Image Retrieval Using Pair-Wise Candidate Co-ranking
Fig. 8. Retrieved results of IRA3 (R33)
657
Fig. 9. Pair-wise conditional ranking of IRA1 based on IRA3 (R31)
Fig. 10. The final raking of the pair-wise co-ranking scheme RG
5 Experimental Results We recognize the lack of a standard database to evaluate the different shape-based image retrieval techniques. Researchers in this field tend to develop their own databases, which are often limited in size and/or application scope. The MPEG-7 developers have built a database of a reasonable size and generality [11]. This database consists of three main sets: set A, set B, and set C. Set A consists of two subsets A1 and A2, and each subset includes 420 images. A1 is used for testing scale invariance; set A2 is used to test for rotation invariance. Set B consists of 1400 images that are classified into 70 classes, each class having 20 images. Set B is used to test for similarity-based retrieval performance, and to test the shape descriptors for robustness to various arbitrary shape distortions including rotation, scaling, arbitrary skew, stretching, defection, and indentation. For these reasons, set B is selected to evaluate the performance of the proposed pair-wise co-ranking scheme. Set C consists
Fig. 11. Samples from set B of the MPEG-7 database
658
A. El-ghazal, O. Basir, and S. Belkasim
of 200 affine transformed Bream fish and 1100 marine fish that are unclassified. The 200 bream fish are designated as queries. This set is used to test the shape descriptors for robustness to non-rigid object distortions. Samples of the images from this database are depicted in Fig. 11. To evaluate the image retrieval performance of the different techniques, a performance measure is required. Precision and recall are the most commonly used measures. Precision measures the retrieval accuracy, whereas recall measures the capability to retrieve relevant items from the database [16]. To evaluate the performance of the proposed pair-wise co-ranking scheme, experiments are conducted by using set B of the MPEG-7 database. All the contours of the images are resampled so that each contour consists of 128 points. To implement and test the newly developed scheme, three techniques are selected. The first technique is the Invariant Zernike Moment (IZM) method [9], a regionbased technique that provides the global characteristics of an image. Zernike moments are effective both for image representation and as a global shape descriptor [17]. This technique is adopted by MPEG-7 as a standard region technique. Multi-Triangular Area Representation (MTAR) is the second technique[18]. It is a boundary-based technique used to provide the local characteristics of an image for the proposed co-ranking scheme. The technique is designed such that it captures the local information using the triangular area. El-Rube et al. [18] have demonstrated that the MTAR technique outperforms the curvature scale space (CSS) technique, which is adopted by MPEG-7 as a standard boundary technique, in terms of accuracy. Moreover, the number of iterations of the MTAR image is limited to n/2+1, where n is the number of boundary points of an image, whereas the number of the iterations of the CSS image is not limited to a certain number. This renders the representation of the MTAR faster than that of the CSS. The third technique is the Fourier Descriptor method, based on the Centroid Distance (CD) signature. According to Zhang [16], the Centroid Distance (CD) signature outperforms the Complex Coordinate (CC), Triangular Centroid Area (TCA) and the Chord-Length Distance (CLD) signatures. The reason for selecting the FD is that it is a boundary-based technique and can capture both local and global characteristics. The low frequency provides the global characteristics, and the high frequency symbolizes the local characteristics of an image[16]. To fuse the three techniques, the first 50 (m=50) retrieved results of each technique are considered an elite candidate partition in the revision stage. Two experiments are conducted to evaluate the performance of the new coranking scheme. In the first experiment, the group-ranking scheme is compared with that of each technique. The precision-recall curves of the proposed approach and that of each of the three techniques are shown in Fig. 12. The precision value at a certain recall is the average of the precision values of all the database shapes at that recall.
Shape-Based Image Retrieval Using Pair-Wise Candidate Co-ranking
659
Fig. 12. Precision-recall curves of the proposed pair-wise co-ranking scheme, IZM, FD, and MTAR techniques
The precision-recall curves in Fig. 12 show that the proposed approach outperforms each of the three techniques: the IZM, FD, and MTAR techniques. In the second experiment, we compare the relative performance of the proposed coranking scheme under two conditions: 1) Group ranking based on pooling the pairwise conditional rankings of individual techniques. 2) Group ranking based on pooling the marginal rankings of the individual techniques. As can be seen in Fig. 13, there are six pair-wise conditional rankings (Rij, i ≠ j, i,j=1,2,3), and three marginal rankings (Rij, i=j, i=1,2,3).Fig. 14 depicts the precision-recall curves for the pooling of the pair-wise conditional rankings and pooling the marginal rankings. It is evident from this figure how the performance of the pair-wise co-ranking scheme is superior to that of pooling only the marginal rankings. This demonstrates the value of communication among the three techniques in selecting the best match to the query image.
Fig. 13. Pooling the pair-wise conditional rankings and pooling the marginal rankings of the individual techniques
Table 2 summarizes the average accuracy of relevance of the closest shapes retrieved from 1400-image database for: IZM, FD, MTAR, pooled marginal rankings, and pair-wise conditional co-ranking.
660
A. El-ghazal, O. Basir, and S. Belkasim
Fig. 14. Precision-recall curves of the pair-wise co-ranking and the pooled marginal ranking Table 2. Retrieval performance of the IZM,FD, MTAR,pooled marginal ranking, and the pooled pair-wise conditional ranking Method
IZM
FD
MTAR
Accuracy%
59.16
54.61
60.01
Pooled marginal ranking 67.32
pair-wise conditional co-ranking 71.35
It clear from Table 2 that the pair-wise conditional co-ranking scheme (with accuracy of 71.35%) outperformed the pooling of the marginal rankings (with accuracy of 67.32%) as well as IZM (with accuracy of 59.16%), FD (with accuracy of 54.61%), and MTAR (with accuracy of 60.01%).
6 Conclusions This paper presents a new pair-wise co-ranking scheme that aims to improve the retrieval performance by combining the ranking decisions of a group of shape-based image retrieval techniques. Experimental results are presented to investigate the performance of the proposed scheme using set B of the widely used MPEG-7 database. The experimental results demonstrate that the proposed pair-wise co-ranking scheme yields better results than those of each of the three selected techniques. Furthermore, the performance of the proposed co-ranking scheme is superior to that of pooling the three individual rankings. This improvement is accomplished by allowing the individual retrieval techniques to exchange their ranking information in a pair-wise manner. The overall reported experimental results demonstrate the effectiveness of the proposed co-ranking scheme for image retrieval applications. In future work, the complexity of the proposed co-ranking scheme should be investigated and compared with other techniques.
Shape-Based Image Retrieval Using Pair-Wise Candidate Co-ranking
661
References 1. Hirata, K., Kato, T.: Query by visual example - content-based image retrieval. In: Pirotte, A., Delobel, C., Gottlob, G. (eds.) EDBT 1992. LNCS, vol. 580, pp. 56–71. Springer, Heidelberg (1992) 2. Mumford, D.: Mathematical theories of shape: Do they model perception? In: Proceedings of SPIE Conference on Geometric Methods in Computer Vision, San Diego, California, vol. 1570, pp. 2–10 (1991) 3. Biederman, I.: Recognition-by-components: a theory of human image understanding. Psychological Review 94, 115–147 (1987) 4. Schomaker, L., Leau, E.D., Vuurpijl, L.: Using Pen-Based Outlines for Object-Based Annotation and Image-Based Queries. Visual Information and Information Systems, 585– 592 (1999) 5. Wang, Z., Chi, Z., Feng, D.: Shape based leaf image retrieval. In: IEE Proceedings, Vision Image and Signal Processing, vol. 150, pp. 34–43 (2003) 6. Celebi, M.E., Aslandogan, Y.A.: A Comparative Study of Three Moment-Based Shape Descriptors. In: ITCC’05. International Conference on Information Technology: Coding and Computing, pp. 788–793 (2005) 7. Hu, M.: Visual Pattern Recognition by Moment Invariants. IRE Transactions on Information Theory 8, 179–187 (1962) 8. Teague, M.: Image analysis via the general theory of moments. Journal of the Optical Society of America 70, 920–930 (1980) 9. Khotanzad, A.: Invariant Image Recognition by Zernike Moments. IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 489–497 (1990) 10. Belkasim, S.O., Shridhar, M., Ahmadi, M.: Pattern Recognition with Moment Invariants: A Comparative Study and New Results. Pattern Recognition 24, 1117–1138 (1991) 11. Mokhtarian, F., Bober, M.: Curvature scale space representation: Theory application and MPEG-7 standardization. Kluwer Academic Publishers, Boston (2003) 12. Wallace, T.P., Wintz, P A: An efficient three dimensional aircraft recognition algorithm using normalized Fourier descriptors. Computer Graphics and Image Processing 13, 99– 126 (1991) 13. Abbasi, S., Mokhtarian, F., Kittler, J.: Curvature Scale Space Image in Shape Similarity Retrieval. Multimedia Systems 7, 467–476 (1999) 14. Chauang, G., Kuo, C.: Wavelet descriptor of planar curves: Theory and applications. IEEE Transaction on Image Processing 5, 56–70 (1996) 15. Adamek, T., O’Connor, N.E.: A Multiscale Representation Method for Nonrigid Shapes with a Single Closed Contour. IEEE Transactions on Pattern Analysis and Machine Intelligence 14, 742–753 (1996) 16. Zhang, D.S., Lu, G.: Study and Evaluation of Different Fourier Methods for Image Retrieval. International Journal of Computer Vision 23, 33–49 (1996) 17. Kim, W., Kim, Y.: A region-based shape descriptor using Zernike moments. Signal Processing: Image Communication 16, 95–102 (1996) 18. El Rube, I., Alajlan, N., Kamel, M.S., Ahmed, M., Freeman, G.: Efficient Multiscale Shape-Based Representation and Retrieval. In: Kamel, M., Campilho, A. (eds.) ICIAR 2005. LNCS, vol. 3656, pp. 415–422. Springer, Heidelberg (2005)
Gaussian Mixture Model Based Retrieval Technique for Lossy Compressed Color Images Maria Luszczkiewicz and Bogdan Smolka Silesian University of Technology, Department of Automatic Control, Akademicka 16 Str, 44-100 Gliwice, Poland {maria.luszczkiewicz, bogdan.bsmolka}@polsl.pl
Abstract. With the explosive growth of the World Wide Web and rapidly growing number of available digital color images, much research effort is devoted to the development of efficient content-based image retrieval systems. In this paper we propose to apply the Gaussian Mixture Model for color image indexing. Using the proposed approach, the color histograms are being modelled as a sum of Gaussian distributions and their parameters serve as signatures, which provide for fast and efficient color image retrieval. The results of the performed experiments show that the proposed approach is robust to color image distortions introduced by lossy compression artifacts and therefore it is well suited for indexing and retrieval of Internet based collections of color images stored in lossy compression formats.
1
Introduction
The rapid developments in communication and information technologies leads to an exponentially growing number of images being captured, stored and made available on the Internet. However, managing this vast amount of visual information remains a difficult task. The image retrieval systems can be classified into two main categories: metadata based and content based approaches. The techniques based on metadata utilize keywords describing each image and the retrieval is performed by keyword matching. The great advantage is easiness of specifying abstract queries like ”sunset” or ”red rose”, but on the other hand this method is laborious and it performs well only if textual descriptions are accurate. The second group of techniques utilize some generic image attributes such as color, shape, texture as a basis for retrieval. The extraction of image features is done automatically, but it can be complicated to specify a query in terms of a set of parameters. In this paper we present an approach which belongs to the second class of the retrieval methods. The problem addressed in this paper is as follows: given a query image, a system should retrieve all images whose color structure is similar to that of the query image independently on the applied lossy coding. Although color based retrieval methods are generally quite effective, many techniques are storage-memory consuming. Therefore, we specify a model describing efficiently the color image content which can be stored as a set of parameters and used as an M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 662–673, 2007. c Springer-Verlag Berlin Heidelberg 2007
Gaussian Mixture Model Based Retrieval Technique
663
image index. In the proposed method we introduce the Gaussian Mixture Model (GMM) as a descriptor of the image color distribution. Its main advantage is that it overcomes the problems connected with the high dimensionality of standard color histograms. Additionally, the proposed method, based on weighted two-dimensional Gaussians is robust to distortions introduced by compression techniques and therefore it can be used for the retrieval of images contained in the web based databases, which very often store the images in lossy compression formats, like GIF and JPEG. The paper is organized as follows. In the next section we introduce the color histograms together with the Gaussian Mixture Model and describe the way we found the optimal GMM parameters. Section 3 is devoted to the analysis of the retrieval results. In the last Section conclusions are drawn and some remarks on the future work are given.
2
Modelling Color Image Histograms
Operating in the RGB color space, the histogram is obtained by grouping color image pixels into bins and counting the number of pixels belonging to each bin. In this work we operate in the normalized rgb space, which is independent on the color intensity I: Iij = Rij + Gij + Bij , rij = Rij /Iij , gij = Gij /Iij , bij = Bij /Iij , where i, j denote image pixels coordinates. Histogram Φ(x, y) in the r − g chromaticity space is defined as Φ(x, y) = N −1 {ri,j = x, gi,j = y}, where Φ(x, y) denotes a specified bin of a twodimensional histogram with r − component equal to x and g − component equal to y, denotes the number of elements in the bin and N is the number of image pixels. Image compression can very significantly influence the properties of the r − g histogram as can be seen in Fig. 1, because of the distortions of color information introduced by lossy compression techniques, [1]. This Figure illustrates the urgent need for histogram correction, in order to diminish the negative influence of the compression process, as an effective image retrieval systems based on histogram modelling should be robust to compression induced artifacts and retrieve all relevant images. 2.1
Gaussian Mixture Model
To alleviate the problems connected with the distortions of the color histograms introduced by lossy compression, we propose to approximate the color histogram built in the r − g chromaticity space by the Gaussian Mixture Model, whose parameters are estimated via Expectation-Maximization (EM) algorithm. The input data are realizations of a X variable, which are the (r, g) pairs of the color image pixels: χ = (x1,1 , ..., xn,m ), where n, m denote image sizes. The probability density function p(χ|Θ) is governed by a set of parameters Θ. The GMM parameters are the mean and covariance matrix for each Gaussian. We
GIF
JP EG25
Original (24 bit)
Test Image
3D r − g histogram
2D r − g histogram
R R R 0 0
50
100
150
200
250
0 0
50
100
150
200
250
0 0
50
100
150
200
250
50
50
50
100
100
100
G
G
G
150
150
150
200
200
200
250
250
250
2D GMM visualization
3D GMM visualization
664 M. Luszczkiewicz and B. Smolka
Fig. 1. Comparison of r-g histograms and their GMM representations for original and compressed images
Gaussian Mixture Model Based Retrieval Technique
665
assume that the data items are independent and identically distributed and the N resulting distribution density p of the data samples is p(χ|Θ) = i=1 p(xi |Θ) = L(Θ|χ). The function L(Θ|χ) is a likelihood function or likelihood of the parameters given the data. The goal of the estimation procedure is to find Θ∗ which maximizes L: Θ∗ = arg maxΘ L(Θ|χ) . Usually we maximize log L(Θ|χ) as it is analytically less complicated. The set of parameters Θ∗ can be found utilizing the Expectation-Maximization algorithm, [2,3]. In general, the EM algorithm is a method for evaluating the maximum-likelihood estimate of parameters of underlaying distribution, when the data set is given, but there is no complete set of them or some values are missing. Let us assume, that the data set χ is observed and was generated by some distribution and that there exists some hidden information Y. The realizations of X are called incomplete data. Consequently, complete data is denoted as Z = (χ, Y) and the joint density function is specified as p(z|Θ) = p(x, y|Θ) = p(y|x, Θ)p(x|Θ). We use joint density, since joint relationship between the missing and observed values must be assumed. For this density function a new likelihood function is defined as L(Θ|Z) = L(Θ|χ, Y) = p(χ, Y|Θ) and will be called complete-data likelihood. It is worth mentioning, that the random variable is unknown and presumably governed by the underlaying distribution. Because of that L(Θ|χ, Y) is a function of constant Θ and χ, Y are random variables. The likelihood L(Θ|χ) is called incomplete-data likelihood function. 2.2
Expectation-Maximization
The EM algorithm finds expected value of the complete-data likelihood function log p(χ, Y|Θ) when observed data χ and current estimates are given as Q(Θ, Θk−1 ) = E[log p(χ, Y|Θ)|χ, Θ(k−1) ], where current parameter estimates, denoted as Θk−1 , are used to compute expectation and new parameters Θ to increase the value of Q. This step is called an E-step. The second step, called M-step, maximizes the previously computed expectation Θ(k) = arg maxΘ Q(Θ, Θ(k−1) ). These two steps are repeated as long as necessary. It is important to underline that each iteration is guaranteed to increase the value of log-likelihood and the EM algorithm is guaranteed to converge to local maximum of the likelihood function, which very often is also a global maximum, [4,5]. The estimation of parameters of mixture models is probably one of the most widely used application of the Expectation-Maximization algorithm. For this purpose let us assume the following probabilistic model: p(x|Θ) =
M m=1
αm pm (x|θm ),
(1)
which is composed of M components and its parameters are defined as: Θ = M (α1 , . . . αM , θ1 , . . . , θM ), with m=1 αm = 1. Moreover, each pm is a function of the probability density function which is parameterized by θm . The analyzed
666
M. Luszczkiewicz and B. Smolka
model consists of M components with M weighting coefficients αm and the incomplete-data log-likelihood expression for this density is given by: N N M log(L(Θ|χ)) = log p(xi |Θ) = log αm pm (xi |θm ) , (2) i=1
i=1
m=1
but unfortunately, it is difficult to optimize as it contains log of the sum. As it was assumed previously, χ is incomplete data and additional information Y = (yi )N i=1 about which component ”generated” each data item, greatly simplifies the problem. In more details: yi ∈ 1, ..., M for each i and yi = k when ith sample was generated by k th mixture component. If we assume that Y values are known, the likelihood function is: N N log(L(Θ|χ, Y)) = log(p(xi |yi , θyi )pyi ) = log(αyi ·pyi (xi |θyi )) . (3) i=1
i=1
Naturally, values of Y are not known but for example a random vector can be used to start an estimation process. Also other mixture parameters Θ = g ) can be randomly assigned. For further computation we (αg1 , ...αgM , θ1g , ..., θM use Bayes rule: p(yi |xi , Θg ) =
αgyi pyi (xi |θygi ) αgyi pyi (xi |θygi ) = , M g g p(xi |Θg ) k=1 αk pk (xi |θk )
(4)
where θg denotes a set of model parameters evaluated in the g-th iteration. After further derivation, [3] the model parameters are defined as: N l N l+1 −1 l l+1 i=1 xi · p(m|xi , Θ ) αm = N p(m|xi , Θ ), µm = , (5) N l i=1 i=1 p(m|xi , Θ ) N l+1 l+1 T l i=1 p(m|xi , Θ )(xi − µj )(xi − µj ) l+1 , (6) Σm = N l i=1 p(m|xi , Θ ) where m is the index of the model component and l denotes the iteration number. The E and M steps are performed simultaneously according to (5) and (6). In each iteration, as the input data we use parameters derived in the previous one. For any iterated algorithm there arises the problem of deciding when the estimation process should be terminated. In general, we are interested in finding estimate θˆ of parameter value θ, for which the expression |θˆi − θi | ≤ ε is valid for a chosen a priori value ε. However, there are many approaches to choosing the stopping condition. For example, it is possible to stop the iteration process after certain number of iterations or to observe the change of parameter values ˆ − θˆi | ≤ τ for a certain a priori defined value τ . in each iteration unless |θi+1 The second method suffers from an unknown number of necessary iterations. Moreover, when input data structure is unknown and vary between different data sets, in general, it is impossible to make any assumption on the time needed to find the model estimates. To alleviate this problem we tried to adopt the first method. In order to find the number of necessary iterations we have chosen a set of natural color images having various r − g histogram structures and applied GIF and JPEG compression with various quality levels denoted as subscripts.
Gaussian Mixture Model Based Retrieval Technique
2.3
667
Experimental Estimation of GMM Parameters
Having built the Gaussian Mixture Model of the original and compressed image histograms we compared it with the histogram of the original image. For that purpose we used several widely known distance measures denoted as d and computed them using the histogram of the evaluated image (denoted as H) and the surface generated by the model of its histogram, (denoted as C): – L1 D : d(H, C) = ηi=1 ηj=1 |Hi,j − Ci,j |, 1 η η 2 2 – L2 D : d(H, C) = |H − C | , i,j i,j i=1 j=1 η η – Bhattacharyya distance, (BD): d(H, C) = i=1 j=1 Hi,j · Ci,j , – Peak to signal ratio, (PSNR) distance: d(H, C) = −10 log10 {M SE}, η η 2 where MSE is defined as d(H, C) = η12 i=1 j=1 |Hi,j − Ci,j | , η = 256. We conducted extensive experiments to estimate the number of necessary iterations of the EM algorithm. For each histogram structure we evaluated 31 experiments, for 7 model components, (model complexity is discussed below), setting various starting conditions of the algorithm, generated by normal pdf. The coefficients αm were assigned randomly. 0.77
1.04 GIF JPEG
1.02
0.76
25
JPEG50 JPEG80
1
0.75
BMP 0.98
0.74 0.96
d
d0.73
0.94 0.72 0.92 GIF JPEG
0.71 0.9
25
JPEG
50
0.7
0.88
JPEG80 BMP
0.86 10
25
50
75
100
150
250
250
500
0.69
10
25
50
75
Number of iterations
a)
100
150
250
250
500
Number of iterations
b)
1
1.2
0.78 GIF JPEG
1.15
25
0.76
JPEG
50
JPEG
80
1.1
BMP
0.74
1.05 0.72
d
d
1
0.7 0.95 GIF
0.68
JPEG
0.9
25
JPEG
50
0.66
0.85
JPEG
80
BMP 0.8
2
3
5
7
10
12
Number of components
c)
1
15
18
20
0.64
2
3
5
7
10
12
15
18
20
Number of components
d)
Fig. 2. Dependence of the L1 and BD distance between the histogram of the test image depicted in Fig. 1 and its approximation obtained through the GMM on the number of iterations (a, b) and number of GMM components (c, d)
668
M. Luszczkiewicz and B. Smolka
Fig. 3. Query results for various similarity measures and images depicting Architecture, Horses, Roses and Elephants
Gaussian Mixture Model Based Retrieval Technique
669
The results of the experiments lead to the conclusion that about 75 iterations, (in some cases even 50) is fairly enough to reflect the histogram structure independently on the used compression scheme. Increasing the number of iterations is time consuming and does not increase significantly the fitting of the model to given data. The plots depicting the L1 and BD distance between the test image histogram (Fig. 1) and its approximation obtained via the GMM are presented in Fig. 2a) and b). Choosing the proper number of model components is also crucial for the accuracy of the evaluated model and final efficiency of the image retrieval. Increasing the complexity of the Gaussian Mixture Model we obtain as a result a surface reflecting more details of the the r-g histogram of an image. However, one must be aware that in the case of compressed images extremely complex models produce surfaces dissimilar to the histogram of the original, uncompressed image, exaggerating discontinuities in color structure caused by lossy compression. For the purpose of our technique, we posed a hypothesis, that there exists a minimal model structure which is sufficient to accurately reflect the distribution of the r − g histogram. To experimentally prove that statement we conducted similar experiments as previously. When increasing the model complexity, there is no significant improvement in the model fitness, whatever distance measure we use. Thus, it seems reasonable to build models consisting of 7 components. This is illustrated in Figs. 2c) and d) which show the dependence of the L1 and BD dissimilarity measures between the histogram of the test image from Fig. 1 and its approximation on the number of GMM components. The GMM parameters of all images in a database are computed and saved in advance, so that only a similarity matrix is used for retrieving most similar answers to query among all images in the database. Each signature (GMM parameters) associated with specific image was saved as Matlab file of 600 B size. All procedures were realized using MATLAB and C and were performed on a PC with 2.8 GHz Pentium IV CPU and 512 MB primary memory. The images used for evaluating the efficiency of the proposed technique were taken from the database of Wang 1 . The estimation of the GMM for 7 components with 75 iterations, including r-g histogram computation and model surface generation, can be done in about 10 seconds. However, the computational speed can be significantly improved if our implementation is optimized and written in a faster programming environment. The transformation to GIF and JPEG compression was performed using the IrfanView software 2 .
3
Image Retrieval
The retrieval techniques based on Gaussian Mixture Model described it this paper was tested on the database of Wang, [6] consisting of 1000 color images, divided into 10 distinct categories (i.e.’Buses’, ’Beach’, ’Horses’, ’Flowers’) with 100 images in each category. This collection of images was constructed as a set 1 2
http://wang.ist.psu.edu/docs/related/ http://www.irfanview.com/
670
M. Luszczkiewicz and B. Smolka
of images related by semantic similarity, (i.e. in the category ’Buses’ one can find green buses as well as red ones), what excludes in some cases the relation between an object and its unique color representation. 3.1 Histogram Distance Measures In order to test the methodology of retrieval presented in this paper, extensive experiments were conducted. At the beginning, the model of a mixture of normal densities was built using the EM algorithm. We estimated the model of parameters for a median of 5 sets of starting values. Having model’s parameters computed, a comparison between the histogram of the query image and of those belonging to the image database was performed and the images were ordered according to the values of the previously described L1 D, L2 D, BD and PSNR distance measures. Additionally we used: n1 n2 |Hi,j − Ci,j | – Mean Magnitude Error, (MME): d(H, C) = 1 − i=1 j=1N , n1 n2 – Histogram Intersection, (HI),[7]: d(H, C) = 1 − i=1 j=1 min(Hi,j , Ci,j ), – Earth Mover’s Distance, (EMD) [8]. The Earth Mover Distance is based on the assumption that one of the histograms reflects ”hills” and the second represents ”holes” in the ground of histogram. The measured distance is defined as a minimum amount of work needed to transform one histogram into the other using a ”soil” of the first histogram. As this method operates on signatures and their weights, [8] using the approach based on the Gaussian Mixture Models we assigned as signature values the mean of each component and for the signature weight the surface occupied by an ellipse representing 95% probability of each of the Gaussian function. 3.2 Retrieval Efficiency As it was mentioned previously, a well designed image retrieval system should be robust to image distortions introduced by compression methods widely used for the presentation of images on the Internet. Figure 1 shows the r-g histogram and the visualizations of the obtained GMMs for images compressed with JPEG and GIF. The comparison of histograms of true color original, compressed images and histograms approximated by the GMM, reveals a good fitting to the original histogram due to the GMM ability to smooth histogram distortions caused by lossy compression. Figure 3 presents four sample queries with five highest ranked answers of the retrieval system using different distance metrics. Of course, retrieval success depends not only on the effectiveness of the similarity criterions, but also on the database content. In order to evaluate the efficiency of the proposed indexing method we validated its accuracy through Recall and Precision. Recall is the fraction of relevant images in the database that have been retrieved in response to a query, whereas Precision is defined as the fraction of the retrieved images that are relevant to the query image. As mentioned before, database of Wang consists of 10 image categories. Some of them are very homogenous (i.e. category Horses) and some are versatile (i.e.
Gaussian Mixture Model Based Retrieval Technique 1
1 L1, MME, Histogram Intersection L2, PSNR Bhattaharyya distance EMD
0.9
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0 0
0.1
0.2
0.3
0.4
0.5 Recall
0.6
0.7
0.8
0.9
0 0
1
1 L1, MME, Histogram Intersection L2, PSNR Bhattaharyya distance EMD
0.2
0.3
0.4
0.5 Recall
0.6
0.7
0.8
0.9
1
BMP GIF JPEG25
0.9 0.8
0.7
0.7
0.6
0.6 Precision
Precision
0.8
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1 0 0
0.1
1
0.9
(b)
BMP GIF JPEG25
0.9
Precision
Precision
0.8
(a)
671
0.1
0.1
0.2
0.3
0.4
0.5 Recall
0.6
0.7
0.8
0.9
0 0
1
1
1
0.9
0.9
0.8
0.8
0.1
0.2
0.3
0.4
0.5 Recall
0.6
0.7
0.8
0.9
1
0.3
0.4
0.5 Recall
0.6
0.7
0.8
0.9
1
0.7
0.7 Precision
(c)
Precision
0.6 0.6 0.5
0.5 0.4
0.4 0.3 0.3 0.2 0.1 0
L1, MME, Histogram Intersection L2, PSNR Bhattaharyya distance EMD 0.1
0.2
0.3
0.4
0.5 Recall
0.2 0.1
BMP GIF JPEG
25
0.6
0.7
0.8
0.9
1
0 0
0.1
0.2
Fig. 4. Comparison of retrieval performance using Precision vs. Recall plots for different classes of Wang database: Elephants (a), Flowers (b) and Horses (c). Left column depicts the plots reflecting the dependence on the used distance measures and right column shows the results obtained using the L1 distance for the images compressed with GIF and JPEG.
category Architecture groups different kinds of architectural items (building, remnants, monuments, sculptures, city sceneries, bridges and paintings), (see Fig. 3). In order to evaluate the retrieval efficiency of the proposed solution and also to describe the structure of image groups, we constructed the Precision vs Recall plots, (Fig. 4) for various image groups. As can be seen the L1 , HI and M M E (as they are linearly dependent) and BD measure of histogram dissimilarity perform quite well. The effectiveness of retrieval using the L2 and EMD method is lower and therefore for inspecting the results of retrieval obtained using the GMM we have chosen the L1 distance.
672
M. Luszczkiewicz and B. Smolka
Table 1. Performance of various histogram distance measures for a collection of 50 images with 6 having multiple query answers, (see Fig. 5)
measure ρ Υ
L1 D 1.73 0.83 0.85
distance measure L2 D BD 9.79 1.84 0.56 0.83 0.69 0.85
EM D 16.47 0.30 0.48
(a)
(b)
Fig. 5. Collection of images chosen for the evaluation of the retrieval efficiency summarized in Tab. 1. First group (a) consists of query images and corresponding unique answers. The second group (b) contains query images (first of each set) and their multiple answers.
As can be observed the efficiency of retrieval is not affected by the compression artifacts introduced to the images from the database and the usefulness of the GMM approximation for the histogram based image retrieval is the main contribution of the paper. The overall behavior of our retrieval method can be specified by average recall (
) and average precision (
ρ). Average recall is defined as a sum of the ranks of correct answers Oi over each query divided by the number of queries q: = q 1 1 i=1 rank(Oi ) and average precision is defined as a sum of rank(Ai ) divided q 1 q 1 by number of queries q: ρ = q i=1 rank(Oi ) . Highly ranked images contribute more to measure. As a consequence, the smaller -measure the better. There is also an opposite relation for ρ , if a method has a better performance then ρ
measure is higher. In order to describe the retrieval efficiency we also applied the Recall vs. Scope analysis, [9]. Let us assume that for each query image Oi (which is called multiple query) there are multiple answers A1i , . . . , Aξi . The Recall measure is defined for a scope s > 0, (i.e. the number of relevant images ranked below s), as: Υi (s) = {Aji |rankAji ≤ s}/ξ. The average recall measure Υ is evaluated taking the average over all query images in the database.
Gaussian Mixture Model Based Retrieval Technique
673
The results of the experiments performed on a collection of 50 images consisting of pictures with unique and multiple answers, (see Fig. 5) are shown in Tab. 1. The results obtained for scope s = 6 confirm that the L1 and BD distances yield much better retrieval results than EMD and L2 dissimilarity measures.
4
Conclusions and Future Work
In this paper we presented a novel image indexing method that enhances the capabilities of retrieval systems utilizing the color histograms. Our technique extracts the image color structure in the form of parameters of a Gaussian Mixture Model. The experiments revealed that 7 components of the GMM and about 75 iterations of the EM algorithm assure satisfactory retrieval results. The main advantage of the proposed method is its robustness to artifacts introduced by lossy compression. The performed tests show that the proposed framework is a useful and robust tool that can be used for the retrieval of images in web based databases. In the future we want to apply other color spaces and test the proposed technique on other databases. The important part of further research efforts will be focused on the comparison of our technique with other histogram based solutions. Additionally we intend to incorporate some nonparametric estimation techniques for the restoration of color image histograms distorted by compression artifacts.
References 1. Smolka, B., Szczepanski, M., Lukac, R., Venetsanoloulos, A.: Robust color image retrieval for the World Wide Web. In: Proceedings of ICASSP, pp. 461–464 (2004) 2. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data. J. Royal Stat. Soc. 39, 1–38 (1997) 3. Bilmes, J.: A gentle tutorial on the EM algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. Technical Report ICSITR-97-021, University of Berkeley (1997) 4. Wu, J.: On the convergence properties of the EM algorithm. The Annals of Statistics 11(1), 95–103 (1983) 5. Xu, L., Jordan, M.I.: On convergence properties of the EM algorithm for Gaussian mixtures. Neural Computation 8, 129–151 (1996) 6. Wang, J.Z., Li, J., Wiederhold, G.: Simplicity: semantics-sensitive integrated matching for picture libraries. Pattern Analysis and Machine Intelligence, IEEE Transactions 23, 947–963 (2001) 7. Swain, M.J., Ballard, D.H.: Color indexing. Int. J. Comput. Vision 7, 11–32 (1991) 8. Rubner, Y., Tomasi, C., Guibas, L.J.: A metric for distributions with applications to image databases. In: ICCV, pp. 59–66 (1998) 9. Huang, J., Kumar, S.R., Mitra, M., Zhu, W.-J., Zabih, R.: Image indexing using color correlograms. In: Computer Vision and Pattern Recognition, 1997 IEEE Computer Society Conference, pp. 762–768. IEEE Computer Society Press, Los Alamitos (1997)
Logo and Trademark Retrieval in General Image Databases Using Color Edge Gradient Co-occurrence Histograms Raymond Phan and Dimitrios Androutsos Department of Electrical and Computer Engineering Ryerson University 350 Victoria St., Toronto, Ontario, M5B 2K3, Canada {rphan,dimitri}@ee.ryerson.ca
Abstract. In this paper, we present a logo and trademark retrieval system for general, unconstrained, color image databases that extends the Color Edge Co-occurrence Histogram (CECH) object detection scheme. We introduce more accurate information to the CECH, by virtue of incorporating color edge detection using vector order statistics. This produces a more accurate representation of edges in images, in comparison to the simple color pixel difference classification of edges as seen in the CECH. As such, we call this the Color Edge Gradient Co-occurrence Histogram (CEGCH). We use this as the main mechanism for our retrieval scheme. Results illustrate that the proposed retrieval system retrieves logos and trademarks with good accuracy. Keywords: Color Edge Gradient Co-occurrence Histogram (CEGCH), logo and trademark retrieval, color edge detection, pattern recognition, object detection.
1
Introduction
Content Based Image Retrieval (CBIR) is a wide area of multimedia processing with many applicable applications for society to function. Much research has been devoted to CBIR and many solutions have been proposed. However, many avenues of CBIR exist; therefore, in this paper, we focus solely on the logo and trademark retrieval aspect for it is seen as an increasingly vital tool for industry and commerce [1], and also for sports entertainment [2]. As a result, a wide variety of methods have been proposed to perform this task. Neural network methods have been investigated to solve this task [3], as well as employing shape information [1]. Also, contour and segmentation based techniques have been proposed [4]. Before research in logo and trademark retrieval began, Swain and Ballard [5] proposed a method for color image retrieval by employing color histograms. Unfortunately, this method only captures the color content of an image, thus misclassifying images having similar color distributions but different spatial statistics. As a result, some methods were proposed to resolve this problem. Jain and M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 674–685, 2007. c Springer-Verlag Berlin Heidelberg 2007
Logo and Trademark Retrieval in General Image Databases Using CEGCHs
675
Vailaya [6] proposed the Edge Directional Histogram (EDH), which is a means of describing an image based on significant edges considered at multiple angles. The use of edge directions ignores color information, but captures global shape information based on edges and is invariant to translation. However, the drawback with the EDH is that it is not rotationally invariant. Rotating an image results in the edge directions to change, causing the bins in the EDH to change. Also, images can be retrieved that have similar EDHs, but can look entirely different. Therefore, a more reliable method should be used instead. By virtue of color histograms, methods have been researched that focused on introducing texture features to color histograms for use in image retrieval. Chang and Krumm [7] proposed the Color Co-occurrence Histogram (CCH), which incorporated texture information by capturing spatial relationships between color image pixels. The CCH is a three-dimensional histogram which contains the number of color pixel pairs, c1 and c2 , that occur at different spatial distances, d. Figure 1 illustrates a graphic representation of the construction of a CCH.1
d
c2
c1
Fig. 1. Graphic representation of the construction of the CCH
The main goal of the CCH is to capture the spatial correspondence between colour pixels, not only by focusing on the frequency on colour pixel pairs that occur in the image, but on colour pixels that surround a given pixel, effectively modelling texture. Chang and Krumm used CCHs as the mechanism for their object detection scheme, and proved that it was quite an effective method. However, Luo and Crandall [8] discovered a fundamental flaw with the CCH. Regions in the image of a uniform color contribute a disproportionate amount of energy to the CCH, overwhelming the comparison metrics. Therefore, the CCH would demonstrate similarity between images that contain similar solid regions of color, but arranged quite differently. As such, Luo proposed a modification to the CCH named the Color Edge Co-occurrence Histogram (CECH), where the aim is to resolve the solid color contribution problem to the CCH. The CECH only captures the separation of pairs of color pixels at different spatial distances when these pixels lie in edge neighborhoods, alleviating the disproportionate energy contributions a single color would have on the CCH. The CECH is the mechanism used in the object detection scheme proposed by Luo. This use of edge information is more suitable for retrieval as opposed to the EDH, for the EDH only considers edges and not the spatial correspondence between edge points; the EDH does not consider edges within edge neighborhoods. The CECH effectively 1
Diagram courtesy of Dr. Thomas El-Maraghi; used with permission.
676
R. Phan and D. Androutsos
uses the spatial correspondence between edges. In Luo’s paper, the construction of the CECH dealt with quantizing the query and search images using a two-stage color quantization process. The first stage involves quantizing an image using the color set defined by the ISCC-NBS, which is a set of 267 colors. The second stage performs further quantization down to a set of 11 colors: S = {black, white, gray, brown, yellow, orange, pink, red, purple, blue, green}. After, CECHs are built using information from the input query and search images. Luo’s definition of an edge pixel is any pixel having a different color from any of its 8 connected neighbors. This implies that only pixels that are on, or very close to, an edge are included in the CECH, which tend to be pixels containing the most important spatial-color information. It should be noted that the majority of the previous attempts for logo and trademark retrieval focused solely on the retrieval of logos and trademarks themselves in image databases. This is sufficient for applications in copyright protection and trademark registration, as seen in [1]. However, with the advent of the internet, and the low cost of scanners and storage devices, digital images are a vital role for the dissemination of pictorial information, introducing the problem of unconstrained images. Unconstrained images are those that have been captured by an acquisition device in natural environments where there are factors beyond one’s control. For these images, the objects of interest can be subject to many uncontrollable factors. High amounts of deformation, and large variations of color depending on illumination conditions are inevitable in unconstrained images. The objects to detect can also be of an arbitrary size, and can be in any location when the image is captured. The previous attempts do not address the case of unconstrained image retrieval, for the underlying image database already consists of logos and trademarks to be retrieved, and the acquisition of the images that populate the databases are under controllable conditions. Luo’s work concentrates on detecting objects in unconstrained color images, subject to the uncontrollable factors stated previously. Though the object detection results in [8] were very good, we feel that a more appropriate definition of what an edge pixel is should be employed, as opposed to the simple color pixel difference for edge classification introduced in Luo’s work. To this end, we present a paper that extends Luo’s work using CECHs, extending Luo’s work for object detection for use in logo and trademark retrieval in general, unconstrained, color images. Color is one feature that is most important, for it is considered a powerful visual cue, and is one of the most discriminatory and commonly used features in image retrieval [5]. The focus of this work is solely on color images, as opposed to more general cases, such as binary and gray-level images. We introduce color edge detection to the CECH, which produces an edge map determining valid edge points with greater accuracy. This edge map is used with the CECHs of the input query and search images to perform logo and trademark retrieval in general, unconstrained, color images. The inclusion of this information to the CCH thus makes it dependent on the edge gradient, and so we name this the Color Edge Gradient Co-occurrence Histogram (CEGCH).
Logo and Trademark Retrieval in General Image Databases Using CEGCHs
677
The rest of this paper presents our retrieval system and findings. Section 2 describes our methodology, including the color edge detection algorithm used and the mechanics used for logo and trademark retrieval in general, unconstrained, color image databases. Section 3 illustrates retrieval results using the proposed framework applied to general, unconstrained, color image databases. Finally, Section 4 will present our conclusions and future research in this area.
2
Method
As with Luo’s method, the input to the retrieval system is the logo or trademark of interest to be retrieved. We also work directly in the RGB color space. The following is an overview of the algorithm: 1. Processing of the Input Image: The input image is read in and quantized using a uniform quantization of 32 bins. This method of color quantization was chosen, for it is a quick way to verify that our retrieval scheme works. Next, multiple images at different scale factors are produced by subsampling the quantized input image, for multiple sizes of the object to be found can exist in arbitrary images. Next, color edge detection is performed on each subsampled, quantized image, producing edge maps. After, these are used to build the CEGCHs for each subsampled and quantized version of the input image. 2. Processing the database images: To reduce computation time, each database image is subsampled to a resolution of 256 x 384 pixels, as done in Luo’s paper in [8], before further processing and the database images are also quantized using uniform color quantization. 3. Image Retrieval: Finally, overlapping search windows of different scale factors are used to search through a database image. The CEGCH of a search window at a scale factor is computed and compared to each of the CEGCHs created from the input image. This is repeated for the rest of the scale factors for this search window and then for all other search windows in the database image. The search window within this image which produced the greatest similarity value is used as the overall similarity for this image. Also, the coordinates of this search window are recorded, and a bounding box is drawn around it, illustrating that the system found the logo or trademark to be within this search window. This is repeated for all database images. Once overall similarity values are generated for each database image, these are used to rank the images and the similarity values are sorted in descending order. Thumbnails are presented to the user from left to right, top to bottom, 40 per page. The image that appears on the first page at the top left corner of the display denotes that there is a high probability that the logo exists in this image. Similarly, the image that appears on the last page at the bottom right corner of the display denotes that there is a very low probability. It should be noted that our current algorithm does not have the ability to detect multiple logos or trademarks, or those that are subject to large amounts
678
R. Phan and D. Androutsos
of deformation other than uniform scaling in an image. However, this is currently an area of research we are investigating. In terms of the analysis of the computational complexity and optimization, to illustrate it is not the objective of this paper; rather, this paper is to demonstrate that the concept of the CEGCH is a viable mechanism to facilitate logo and trademark retrieval in general, unconstrained, color images. These two factors are areas of research we are currently investigating. 2.1
Color Edge Detection
There exist a wide variety of color edge detection methods in literature. The gradient operators proposed for edge detection in gray-level images can be extended to color by taking the vector sum of the gradients from the individual components of the image [9]. Another approach would be to perform edge detection in the vector space, employing vector gradient operators as seen in [10]. However, the approach that we take is to perform color edge detection, also in the vector space, using vector order statistics, as proposed by Trahanias and Venetsanopoulos [11][12]. As stated in [12], this family of color edge detectors proves to have extremely good results by retaining edges, and is robust against the presence of noise. In the vector space, color pixels are treated as three dimensional vectors. Pixels in a neighborhood (e.g. 3 times 3, 5 times 5, . . . ) are ordered based on some criteria. We employ reduced ordering to order these color pixels. Each color pixel in the neighborhood is reduced to a scalar value, determined by a distance measure based on color pixel values, known as the reduction function. Once each color pixel is reduced to a scalar value, ordering of the color pixels is based on these scalar values. The Euclidean distance, or the L2 norm, between each color pixel in the neighborhood and the mean color pixel within this neighborhood was the reduction function of choice. This is defined in Equation 1 as: f (c, w) = (c1 − w1 )2 + (c2 − w2 )2 + (c3 − w3 )2 , (1) where c1 , c2 and c3 represent the RGB color pixel values and w1 , w2 and w3 represent the mean RGB color pixel values for the red, green and blue channels respectively. The color edge detector we chose from the vector order statistics family was the minimum vector dispersion edge detector (MVD). As such, let X = {X1 , X2 , X3 ,. . . ,Xn } represent the sorted color pixels within a pixel neighborhood in ascending order after the reduction function has been applied. The output of the MVD edge detector defined within a pixel neighborhood is: m X (i) (n−j+1) M V D = min X , (2) − j m i=1 where j = 1, 2, . . . , k and k, m < n. Equation 2 averages the m lowest ranked color pixels, creating an average color pixel within a pixel neighborhood. This new average color pixel is compared to the k highest ranked color pixels, where noise tends to exist. The Euclidean distance is calculated to each of these k color pixels and the output of the color
Logo and Trademark Retrieval in General Image Databases Using CEGCHs
679
edge detector is the minimum of the k calculated values, giving a value of relative edge strength. Thresholding these values creates an edge map. There is no need to filter any of the images for noise, for this edge detector automatically performs noise filtering [11][12]. Averaging the lowest ranked m color pixels results in reducing short-tailed, or Gaussian noise. Choosing the minimum value of the distances between the k highest ranked pixels with the average color pixel eliminates impulsive noise, simultaneously performing two different methods of noise suppression in a single edge detector, making its selection a robust choice. The aforementioned method was used to create edge maps for the subsampled input images and database images and their CEGCHs were calculated based on these edge maps. This edge detector more accurately represents what is considered an edge in an image and is also very robust. Also, the parameters k, m, and the size of the pixel neighborhood for the MVD are tunable to produce the best possible edge detection results for different images in applications where accuracy needs to be controlled. By increasing k, the likelihood of impulsive noise appearing as an edge in the output is decreased, but this will introduce possible short-tailed noise at the output. When k is high, the output may be selected from lower ranked pixels; decreasing the value of k will increase the likelihood of impulsive noise appearing at the output, but a pixel that may have been corrupted by short-tailed noise is less likely to be chosen. Increasing m will increase the amount of averaging, reducing the amount of short-tailed noise that will be seen at the output, but will create more inaccurate edge points. Decreasing the value of m will decrease the amount of reduction for short-tailed noise, but edge points will be more accurate. Though the definition of an edge in [8] produces very good results, their object detection system was applied on noise-free images; also, the quality of the images they used was not compromised. For example, by considering a uniform region of color in an image, if this smooth area varies slightly in color, this would cause the edge map definition by Luo to consider this area as having edges due to their proposed two-stage color quantization scheme. Also, with the edge map definition in [8], if the quality of an image was compromised from a substantially lossy compression algorithm, the resulting block effect would produce misclassified edges. If an image contained noise, the edge map definition in [8] would also misclassify edges. We consciously chose color edge detection using vector order statistics for its resilience to noise, and it inherently performs noise filtering. 2.2
Similarity Measure
The similarity measure used to compare CEGCHs, and obtain an overall similarity value for a database image, is by means of histogram intersection, introduced in Swain’s work in [5] for their image retrieval scheme. For CEGCHs, the histogram intersection is defined as: c c d j=1 k=1 min(hs [i][j][k], hi [i][j][k]) s(hs , hi ) = i=1 , (3) c d c i=1 j=1 k=1 hs [i][j][k]
680
R. Phan and D. Androutsos
where hs is defined as the CEGCH within a search window of a database image at a scale factor, and hi represents the CEGCH of an input image at a scale factor. c represents the total number of colors (in our case, 32) and d represents the maximum distance two color pixels can be separated over a spatial distance. The spatial distance metric we used was the quantized Euclidean distance, as seen in [8], which is defined as the mathematical floor of the Euclidean distance between two color pixels in an image. This was used to maintain rotational invariance for the retrieval process. The closer the similarity measure is to 1 when comparing the two CEGCHs, the more similar they are with each other. We chose histogram intersection as the comparison metric due to the fact that it is a very simple similarity measure to implement and it provides a very quick way in determining whether our modification to the CECH proposed in [8] yields acceptable retrieval results.
3
Experimental Results
To test our retrieval scheme, the two input logos that were used for generating retrieval results were the Ferrari and Alfa Rom´eo logo, shown in Figures 2a) and 2b) respectively. Figures 2a) and 2b) have image dimensions of 185 x 250 pixels and 200 x 200 pixels respectively.
(a)
(b)
Fig. 2. Original Input (a) Ferrari and (b) Alfa Rom´eo Logos
Three tests were performed to verify that the proposed retrieval scheme functions correctly. The image databases for each trial were subsets of a 5,000 general, unconstrained, color image database we compiled. The first test had 100 images in the database, 7 of those containing the Ferrari logo. The second test had 500 images in the database, 11 of those containing the Alfa Rom´eo Logo, and finally, the third test had 1,000 images in the database, 20 of those containing the Ferrari logo. Each test had an increasing number of images in the database to demonstrate that results can reliably be generated as the database size becomes large. To quickly produce preliminary results, verifying that the proposed system generates accurate results, the tests had overlapping search windows of one scale factor passed over the database images, and one scale factor for the input image was used. The scale factor for each input image was determined experimentally with repeated trials until satisfactory results were obtained. The size of the subsampled inputs used to generate retrieval results for the tests was 0.6 of the height and width of the logos, making the scale factor 0.6. Also, the search windows were the same size as the subsampled input image. The edge
Logo and Trademark Retrieval in General Image Databases Using CEGCHs
681
maps created for the CEGCHs for both trials, using the MVD edge detector, used k = 8, m = 12, and the pixel neighborhood was 5 x 5, as seen in [11]. These parameters were chosen based on the results in [11], for they demonstrated excellent edge detection results under the influence of different combinations of short-tailed and impulsive noise. Also, the gradient threshold for the MVD edge detector was set to 96. The maximum spatial distance that two color pixels can be separated in an image to calculate co-occurrences was 7, corresponding to a 5 x 5 window. 3.1
First Test: 100 Image Database
Figures 3a) and 3b) illustrate the first two pages of ranked retrieval results that the system produced for the 100 image database for the first test, displaying them 40 per page. All of the relevant Ferrari images are shown within the first page, seen in Figure 3a).
(a)
(b)
Fig. 3. First and second page of the retrieval results from the 100 image database
To ensure that the retrieval system is also correctly locating where the logo is in the image, Figures 4a) and 4b) illustrate where the system located the Ferrari logo to be for two of the retrieved images shown on the first page of results. Though stated previously that our retrieval system cannot handle elements such as heavy deformation or affine transformations, Figure 4b) illustrates that the system is still able to correctly identify where the logo is in an image where partial deformation exists. 3.2
Second Test: 500 Image Database
Figures 5a) and 5b) illustrate the first two pages of ranked retrieval results that the system produced for the second test. As can be seen in Figure 5, only 3 of the 10 images containing the Alfa Rom´eo logo appeared within the 40 highest ranked images. One reason could be the uniform color quantization method employed in the retrieval system. Though computationally much faster than other methods, uniform quantization does not model human color perception well. Contouring effects can result, for many of the colors in the full color map, prior to uniform quantization, are not considered [13]. However, uniform color quantization was the method used to quickly verify that the proposed modification to
682
R. Phan and D. Androutsos
(a)
(b)
Fig. 4. Two of the retrieval results from the 100 image database
(a)
(b)
Fig. 5. First and second page of the retrieval results from the 500 image database
the CECH, introduced in [8], generates correct retrieval results. If a color quantization method was used that captured human color perception better, retrieval results would improve. Also, the similarity measure seen in Section 2.2 is prone to false positives, as stated by Luo in [8]. If a more sophisticated similarity measure for image retrieval was used, accuracy would also increase. Another reason for the decreased accuracy of retrieval results illustrated in Figure 5 is that only one scale factor was passed over each database image. As can be seen in some of the retrieval results in Figure 5, the bounding boxes drawn are larger than the actual size of the logo, resulting in the decrease in rank for some of the images containing the Alfa Rom´eo logo in it. The size of the bounding box introduces more co-occurrences and are included into the CEGCH. As such, for an image that has the logo of interest where the size is significantly less than the scale factor of consideration, inaccuracy will result. This does not apply for database images that have an extremely different color distribution in comparison to the logo of interest to retrieve (except for the appearance of the logo itself in the database image). However, for images that have a similar color distribution to the query logo to be retrieved, these have a higher tendency of producing false positives. If multiple scale factors were used for the logo retrieval, then a greater amount of accuracy would be achieved. Figures 6a) and 6b) illustrate where the system located the Alfa Rom´eo logo to be for two of the retrieved images shown on the first page of results. With reference to these images, the system is still able to correctly identify where the logo is in an image where partial deformation and some rotation exist.
Logo and Trademark Retrieval in General Image Databases Using CEGCHs
(a)
683
(b)
Fig. 6. Two of the retrieval results from the 500 image database
3.3
Third Test: 1,000 Image Database
Figures 7a) and 7b) illustrate the first two pages of ranked retrieval results that the system produced for the third test. Figures 8a) and 8b) illustrate where the system located the Ferrari logo to be for two of the retrieved images shown on the first page of results. As with the second test, the system is still able to correctly identify where the logo is in an image where partial deformation and a larger amount of rotation exist. However, in Figure 7, only 7 of the 20 images containing the Ferrari logo appeared within the 40 highest ranked images. With reference to the second test, the decreased accuracy can be attributed to the reasoning stated previously in Section 3.2.
(a)
(b)
Fig. 7. Two of the retrieval results from the 1,000 image database
(a)
(b)
Fig. 8. Two of the retrieval results from the 1,000 image database
684
3.4
R. Phan and D. Androutsos
Precision-Recall graphs
Figures 9a), 9b) and 9c) represent Precision-Recall graphs illustrating the evaluation of the retrieval results for the tests seen earlier. As can be seen in the Figure, the retrieval results for the 100 image database illustrate high precision and recall. Although the image database for this trial was small in comparison to the second and third trials, it performed very well. For Figure 9b), the precision and recall have substantially changed in comparison to Figure 9a), due to the aforementioned reasons stated in Section 3.2. With reference to Section 3.2, Figure 9c) generates a similar Precision-Recall graph. By introducing a better method for color quantization, considering multiple scale factors of the query image and successfully incorporating the retrieval of images due to deformation and affine transformations, recall and precision will improve as the size of the image database increases. Precision−Recall Graph for 100 image database
Precision−Recall Graph for 1000 image database
Precision−Recall Graph for 500 image database
1
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.94
Precision
Precision
0.96
0.92
0.9
Precision
0.98
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.88 0.1 0.86 0.1
0.2
0.3
0.4
0.5 0.6 Recall
(a)
0.7
0.8
0.9
1
0
0.1 0
0.2
0.4
0.6
0.8
1
0
0
0.2
0.4
Recall
(b)
0.6
0.8
1
Recall
(c)
Fig. 9. Precision-Recall graphs of all three retrieval tests
4
Conclusions and Future Work
We presented an extension of the color object detection method devised by Luo, which focused on detecting objects in general, unconstrained, color images. We used a modified version of their CECH, intended for use in logo and trademark retrieval in general, unconstrained, color image databases. The method in [8] erroneously classifies edges in the edge map; therefore, we replaced their edge map definition with a more reliable color edge detection scheme using vector order statistics. The retrieval results illustrate that the proposed system is able to determine the location of the logo or trademark of interest in spite of partial deformations. However, these are preliminary retrieval results, and there are several aspects of our system that could be improved upon, such as the use of uniform color quantization and the choice of histogram intersection as the similarity measure. Better retrieval results can be achieved by choosing a color quantization method that models human color perception well, and a more robust similarity measure. We are also investigating pre-screening approaches, such that images in a database that do not contain the right spatial-color statistics to that of the input logo are not retrieved. This will reduce the amount of computation time by effectively narrowing the search to images that may contain the logo of interest,
Logo and Trademark Retrieval in General Image Databases Using CEGCHs
685
as opposed to searching every image in the database. We are also investigating the detection of logos and trademarks under significant deformations, and the detection of multiple instances of a logo within a database image.
References 1. Eakins, J.P., Boardman, J.M., Graham, M.E.: Similarity retrieval of trademark images. IEEE Multimedia 5(2), 53–63 (1998) 2. Pan, H., Li, B., Sezan, M.I.: Automatic detection of replay segments in broadcast sports programs by detection of logos in scene transitions. In: ICASSP. Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, vol. 4, pp. 13–17 (2002) 3. Zyga, K., Price, R., Williams, B.: A generalized regression neural network for logo recognition. In: Proc. IEEE Intl. Conf. on Knowledge-Based Intelligent Engineering Systems and Allied Technologies, vol. 2, pp. 475–478 (2000) 4. Leung, W.H., Chen, T.: Trademark retrieval using contour-skeleton stroke classification. In: ICME. Proc. IEEE Intl. Conf. on Multimedia and Expo, vol. 2, pp. 517–520 (2002) 5. Swain, M.J., Ballard, D.H.: Color Indexing. Intl. Journal of Computer Vision 7(1), 11–31 (1991) 6. Jain, A.K., Vailaya, A.: Image retrieval using color and shape. Pattern Recognition 29(8), 1233–1244 (1996) 7. Chang, P., Krumm, J.: Object recognition with color cooccurrence histograms. In: CVPR. IEEE Trans. on Computer Vision and Pattern Recognition, vol. 2, pp. 23–25 (1999) 8. Luo, J., Crandall, D.: Color object detection using spatial-color joint probability functions. IEEE Trans. on Image Processing 15(6), 1443–1453 (2006) 9. Di Zenzo, S.: A note on the gradient of a multiimage. In: CVGIP. Computer Vision Graphics and Image Processing, vol. 33, pp. 116–125 (1986) 10. Cumani, A.: Edge detection in multispectral images. In: CVGIP. Graphical Models and Image Processing, vol. 53, pp. 40–51 (1991) 11. Trahanias, P., Venetsanopoulos, A.N.: Color edge detection using vector order statistics. IEEE Trans. on Image Processing 2(2), 259–264 (1993) 12. Plataniotis, K.N., Venetsanopoulos, A.N.: Color Image Processing and Applications. Springer, Heidelberg (2000) 13. Heckbert, P.: Color image quantization for frame buffer display. In: SIGGRAPH. Proc. ACM 9th Annual Conf. on Computer Graphics and Interactive Techniques, pp. 297–307. ACM Press, New York (1982)
Use of Adaptive Still Image Descriptors for Annotation of Video Frames Andrea Kutics1, Akihiko Nakagawa2, and Kazuhiro Shindoh 1 Tokyo University of Technology 1404-1, Katakura, Hachioji-shi, Tokyo, 192-0982 Japan 2 University of Electro-Communications 1-5-1, Chofugaoka, Chofu-shi, Tokyo, 182-8585 Japan
[email protected],
[email protected] Abstract. This paper presents a novel method for annotating videos taken from the TRECVID 2005 data using only static visual features and metadata of still image frames. The method is designed to provide the user with annotation or tagging tools to incorporate multimedia data such as video or still images as well as text into searching or other combined applications running either on the web or on other networks. It mainly uses MPEG-7-based visual features and metadata of prototype images and allows the user to select either a prototype or a training set. It also adaptively adjusts the weights of the visual features the user finds most adequate to bridge the semantic gap. The user can also detect relevant regions in video frames by using a self-developed segmentation tool and can carry out region-based annotation with the same video frame set. The method provides satisfactory results even when the annotations of the TRECVID 2005 video data greatly vary considering the semantic level of concepts. It is simple and fast, using a very small set of training data and little or no user intervention. It also has the advantage that it can be applied to any combination of visual and textual features.
1 Introduction There is a huge amount of multimedia including video, audio, still images, 3D as well as text produced by professionals and amateurs, and it is clear that and the amount of data is going to greatly increase in coming years. Thus, to be able to categorize, search, personalize, summarize, authorize as well as integrate these data to existing search engines, we will need to provide appropriate tags, or in other words to annotate them based on either standardized ontologies or even folksonomies or just provide some simple concepts or concept hierarchies. There are presently a number of applications on the internet that urgently need such techniques. Well known examples are Flickr (http://www.flickr.com) and YouTube (http://www.youtube.com). However, automatic annotation is extremely difficult, and to accomplish it we have to solve problems like image segmentation and object recognition in general. The key problem here is to bridge the so-called “semantic gap” that exists between automatically extracted visual properties of the given multimedia data and high-level concepts M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 686–697, 2007. © Springer-Verlag Berlin Heidelberg 2007
Use of Adaptive Still Image Descriptors for Annotation of Video Frames
687
based on the user’s understanding of the multimedia content. Several research activities are presently focusing on this field. They can be separated into two main closely related areas: (1) finding standard ontologies for users to tag their multimedia content and thus make the searching, summarizing, etc., processes more effective; (2) annotating the given content as precisely as possible by state-of-the-art research to avoid a semantic gap. Both of these research topics are extremely challenging, especially the second one, and it will probably take several years to provide a suitable solution. A number of multi-modal annotation methods have been developed to overcome semantic gap problems by combining keywords with other low-level visual features of the multimedia content. These methods include approaches drawn from the field of information retrieval, such as latent semantic indexing [1] relevance feedback-based methods [2], various statistical learning techniques based on Gaussian mixture models [3, 4], HMM-based methods [5], several neural-network-based approaches [6] and a vast number of methods using Support Vector Machine (SVM) [7, 8, 9]. Most of these approaches extract histogram-based color features and texture descriptions obtained over the frequency domain, and then unify them into composite feature vectors representing the whole image or evenly divided image blocks. These feature vectors are then utilized to annotate images by using pre-annotated training data and various learning approaches. A very promising method, described in [3, 4], matches image regions with keywords by using a joint clustering approach on both feature spaces based on unsupervised learning using Expectation Maximization (EM). There are also several new promising image annotation techniques that emphasize various machine learning methods [10, 11, 12]. However, these vary greatly in their vocabularies or are set up to specific image sets, and are also sometimes restricted to very simple visual features. Furthermore, several vocabularies have been proposed for multimedia, such as MPEG-7 multimedia description schemes[13], TV-Anytime, as well as TRECVIDbased ontologies such as LSCOM (Large-Scale Concept Ontology for Multimedia) [14]. This ontology contains as many as 857 concepts and a whole set of frames can be annotated by 449 of them. There are also a large number of specific or widely usable ontologies for still images and video, and the MUSCLE [15] project that is trying to incorporate them. Some are quite detailed and contain a large number of concepts. However, vocabularies or even ontologies tend to select concepts expressing everyday objects on higher levels in the concept hierarchy because it is easier to provide them with representative images or objects, even though there is a tradeoff of letting out specific images or objects from these concept clusters. A detailed analysis of currently used keywords, vocabularies and ontologies can be found in [16]. Here we have to keep in mind that in recent years there has been a very important change in users’ viewpoints on multimedia content. With evolving internet technologies from Web 1.0 to Web 2.0, the user has a larger role in dynamically reforming the web. In Web 2.0, the user is much more involved in the creation of new individual or community sites or virtual networks and has stopped being only a web content consumer and has become a web content producer. However, semantic multimedia content classification or annotation, as well as organized multimedia authoring, have yet to be included in this new type of web community. A survey and discussion on this issue can be found in the IEEE Multimedia Magazine [17].
688
A. Kutics, A. Nakagawa, and K. Shindoh
Obviously, there is a big gap between the goals and the results achieved by the research community in bridging the semantic gap to provide precise automatic annotation methods for multimedia content for object recognition or content-based object searching purposes. There is still an urgent need by a huge number of users on the web for tools for even approximate integration of multimedia content into web communities and to provide new applications using these contents via mash-ups. This paper addresses this problem and introduces a simple method using visual features and simple metadata similar to that defined as a subset of LSCOM-lite for a news video to annotate dynamic video content using annotated still image frames extracted from the video and used as prototype data. Here we emphasize that we extracted only static visual features from image frames in order to prove that satisfactory results can be provided without involving motion information. This method uses mainly MPEG-7 visual features and metadata and allows the user to select both the prototype set and to adaptively weight the visual features that seem to be the most adequate. In our experiment, we used 17,269 video frame images taken as shot boundary frames from the TRECVID 2005 data [18] as ground truth. We extracted the most important regions in the video frames by using a self-developed multi-scale segmentation framework and carried out experiments with the same video frame set. We also carried out experiments using still images only: 12,500 images from the Corel Gallery collection and 5,000 from home image or video data. We verified that satisfactory results could be obtained even though the annotations of the TRECVID 2005 video data are widely varying considering the semantic level of their concepts. To avoid this problem we also provided concept trees for category-based search.
2 Selecting Video Frames and Detecting Their Visual Features 2.1 Frame Selection We used MPEG-1 format video data selected from the TRECVID 2005 video data collection. We selected frames at changes of video scenes, such as the so-called shot boundaries. This selection is usually carried out using the (YCbCr) color space by calculating differences in the normalized histograms of the consecutive frames determined for the (YCbCr) color channels. A frame is then defined as a shot boundary frame or a scene change and saved whenever the histogram differences reach a significant value. However, in our experiments we used shot boundaries and their assigned annotations provided by the TRECVID 2005 Workshop. 2.2 Visual Features With this method we extract discrete still images such as video frames from continuous video data. As mentioned before, for computational reasons, we selected only 17,269 frames from the total of about 70,000 for our experiments. Next, we extract MPEG-7 compliant visual features for each image frame. In our experiments, we used three main visuals features: MPEG-7’s edge histogram descriptor, color layout descriptor [19] and another descriptor called a color correlogram [20]. The edge histogram descriptor (called here ehd) is a powerful texture descriptor. To extract it, the image is first divided into 4x4 blocks and the resulting sub-images
Use of Adaptive Still Image Descriptors for Annotation of Video Frames
689
are further divided into small sub-blocks. Then four directional and a non-directional edge detector are applied to the sub-blocks to determine a histogram consisting of 80 bins according to the 5 edge directions calculated over the sub-images. Also, an extended histogram is computed by grouping the image blocks and calculating semiglobal and global histograms. Finally, similarity is measured by using a weighted L1 distance by applying a larger weight to the global histogram. For the color layout descriptor (called here cls), the (YCbCr) color space is utilized, the image is separated into 8x8 image blocks and the average colors from each block are determined. Finally, an 8x8 DCT is carried out and quantized DCT coefficients are used to determine the similarity between two images by using the L2 distance and assigning larger weights to the lower frequency coefficients. However, several other MPEG-7 compliant features and other extensions or newly defined visual features are also provided to the user. For example, here we use color correlograms to try to capture the relation between similar color agglomerates inside the image, thus using the very important spatial distribution of colors, which is normally lacking when using normalized color histograms only. This is a very powerful descriptor, but it has the disadvantage of heaviness. Thus, in our experiments we applied a simplification and also used autocorrellograms and the L1 distance measure for measuring similarity between images.
3 Annotating Video Via Prototype Video Frames 3.1 Keywords and Frame or Image Prototypes As mentioned before, we used LSCOM-lite-like annotation keywords in our experiments. These include “fire”, “map”, “flag”, “building”, “waterscape/waterfront”, “mountain”, “basketball”, “soccer”, “car”, and “people walking/running”. Here we dropped those that can be detected via motion features only, such as running or walking. This is because our purpose here is to show the power of discrete multimedia content types such as still images and their related text in the video annotation. Prototype or example video frames can be either selected randomly or determined by using user preferences. To enable the latter, we provided a browsing tool capable of both visual and textual-based browsing to make it easier for the user to find example or prototype images. For visual browsing, the user can select images by using any devices or tools in his/her computer and then repeatedly carry out searches based on visual image features using the selected image. For keyword-based browsing, the user can obtain a tree of concepts, which is provided by using the WordNet lexical database [21]. Next he/she can browse that database to find sample or prototype images for both keyword-based searching and for automatic annotation purposes, as explained below. 3.2 Annotation by Video Frame or Image Prototypes Thus, with our annotation method, we can either select prototypes of a given concept or category randomly or leave it to the user to specify his or her desired frames or images. The user can select only one image or multiple images for the annotation task, depending on his/her requirements. Next, we use these prototype or training images to annotate video frames without using any metadata or human intervention. Here we proposed and carried out the following annotation scheme:
690
A. Kutics, A. Nakagawa, and K. Shindoh
Step 1. We extracted visual features as explained above for all 17,269 TRECVID 2005 shot boundary frames and also for the 20,000 images: 12,500 from the Corel Gallery Collection and the rest from home photo or video images. For the TRECVID 2005 video data, we used the annotations provided for the shot boundary images and set the number of prototype frames from 3 to 5. For the Corel and home image collection, we used the same number of image prototypes or training data but these provide only very vague annotations such as Corel category names or a few specific annotations provided by the producer of the given image. Step 2. Next we used a self developed searching tool to find video frames or images that are visually similar to the prototypes and can thus be used for annotation. This process is very similar to the vector quantization approach with multiple examples selected as prototypes. In this step, we tried out two simple algorithms: (1) Retrieving images on the base of their combined similarities over the feature spaces. We used 95% similarity ratio for this purpose; (2) We retrieved 60 similar images for each prototype video frame or image. Step 3. In both cases, we merged the frames or images according to their ranking for each visual feature. In this way, the most similar frames or images considering all of the visual features such as the closest ones to the prototype frames or images in their features could be the closest to the cluster center such as the prototype frames or images themselves. The similarity value and also the number of prototype images were selected empirically by several users. Step 4. As it is desirable and also possible to select several prototype images to represent a given content, we enable the user to select multiple prototype images or we select them randomly using the keyword–based browsing tool and generated concept trees. In this way, we obtain several ranked lists each for a prototype frame or image. Step 5. Our next problem is how to obtain the final ranked results for the retrieved but yet unannotated frames or images. In our experiments this is done by calculating the minimum ranks for each feature for each prototype image or frame. The final ranked list is then created by merging the ranking lists obtained for each prototype to obtain a combined one. Obviously, there emerges a problem when an image does not appear in the result list of a given visual feature. However, we can easily overcome this problem by assigning a relatively large penalty rank (we set it to 100 in our experiments) for those frames or images. It is also possible for a given frame or image to appear several times or for a conflict to emerge between ranks (two frames or images are assigned with the same rank). When frames or images appear more then once, they are handled by using their higher overall rank. For frames or images with conflicting ranks, we use a “first calculated first put” rule.
Use of Adaptive Still Image Descriptors for Annotation of Video Frames
691
Annotated Images “basketball” Prototype STEP 1 and 4 Query image qi
Prototype Query image q j
featuresqi
Prototype Query image
featuresqk
featuresq j
Visual STEP 2 and 3 Search Non-annotated Images Similarity Rank List
Visual Search
Visual Search
Similarity Rank List
Similarity Rank List Image : A
Image : A
qk
n
n : rank of A
l l : rank of A
m
Image : A
m : rank of A
l
§ r f 1 A, qi · ¸ ¨ SR ¨ r f 2 A, qi ¸ ¨ r f 3 A, q ¸ i ¹ ©
m
§ r f 1 A, q j · ¨ ¸ SR ¨ r f 2 A, q j ¸ ¨ r f 3 A, q ¸ j ¹ ©
n
§ r f 1 A, qk · ¨ ¸ SR ¨ r f 2 A, qk ¸ ¨ r f 3 A, q ¸ k ¹ ©
STEP 5 and 6
Rank A
Z 1 MIN r f 1 A, qi , r f 1 A, q j , r f 1 A, qk
MIN r A, q , r A, q , r A, q
Z 2 MIN r f 2 A, qi , r f 2 A, q j , r f 2 A, qk Z3
f3
f3
f3
i
j
k
Merged Similarity Rank List Annotation Image : A
Rank A
…
“basketball”
Image : x
Rank x Final Ranked List (Image data file list) Annotated Images
Fig. 1. Overview of annotation process by selecting multiple frames as prototypes
692
A. Kutics, A. Nakagawa, and K. Shindoh Table 1. Detected image regions selected from different video categories
basketball
building
mountain
flag
map
waterfront
Step 6. The user is also provided with a tool to set up weights for the features he/she prefers most. In this case, he/she can adjust the weights of the individual features (or even weights of the prototype images he/she or the system itself has chosen). In this case, the ranking is calculated by applying the selected weight for each feature, or frame/image, and the final ranked results are generated using this method. Step 7. Steps 1 to 6 are repeated until all of the frames or images became annotated and thus provided by metadata in the system. Those annotations can then be reassigned by applying further recalculation or step repetitions. An explanation of the annotation process is shown in Figure 1. We also set up a demonstration site on the URL: http://www.teu.ac.jp/akm/, which shows annotation results for several concepts as html files. As the TRECVID 2005 data is available only to registered members, their results can not be illustrated publicly. Table 2. Precision obtained for annotations Annotation basketball car fire flags golf maps mountain soccer waterfront
Precision (top 30th) 87% (26/30) 63% (19/30) 50% (15/30) 47% (14/30) 67% (20/30) 60% (18/30) 60% (18/30) 67% (20/30) 60% (18/30)
Precision 51% (214/418) 73% (371/508) 34% (170/494) 24% (80/329) 17% (84/493) 17% (82/496) 33% (164/503) 27% (128/468) 42% (218/519)
Ground truth 556 3,390 748 2,144 173 910 626 927 1,080
Precision (top 30th) = no. of frames with matching annotation in the top 30 ones. Precision = no. of frames with matching annotation / no. of final ranked frames of the topic. Ground truth = no. of frames with ground truth for the given topic in the 17,269 frames.
Use of Adaptive Still Image Descriptors for Annotation of Video Frames
693
3.3 Region Detection and Region-Based Annotation Segmentation on frames or images is carried out by a multi-scale self-developed segmentation model based on both color and texture properties, defined in one of our previous works [22]. This nonlinear inhomogeneous diffusion segmentation model is defined as: ∂ I (x, y, t ) = div ⋅ (c( x, y, t )grad { I (x, y, t ) }), I (x, y,0) = I 0 (x, y ) . ∂t
(1)
where I indicates the color feature vector, and function c(x, y, t ) describes the diffu-
{
}
sivity, and it is defined as c( x, y, t ) = 1 1 + ( grad{I } K ) where K indicates the conductance parameter. Here we modified this model by defining the conductance parameter ( K ) as a function of the texture gradient. In this way, the boundaries of both colored and textured areas can be preserved during the whole diffusion process producing adequate image regions. The reader is referred to [22] for a detailed explanation of the above inhomogeneous diffusion model and of the image segmentation process. An online demonstration of the image segmentation can be found on the same URL: http://www.teu.ac.jp/akm/. In further processing steps, we select segments that are most relevant to retrieval: mainly 4-6 segments per image. These are determined on the basis of their size, geometric properties, and layout features, or on the saliency of their features. Next, we create a visual description of each segment by calculating MPEG-7 compliant visual features for each segment, such as color layout descriptors (cls), and edge histogram descriptors (ehd). Examples of the segmentation process are illustrated in Table 1. 2
Fig. 2. Example list of frames obtained for annotation “basketball” (TRECVID 2005 data)
694
A. Kutics, A. Nakagawa, and K. Shindoh
Fig. 3. Example list of images obtained for annotation “painting” (Corel Gallery)
4 Experiments with Both TRECVID Video Frames and Images To evaluate our annotation system, we use video frames from the TRECVID 2005 video data collection, containing about 70,000 video frames, according to the 10 concepts chosen for the evaluation in TRECVID 2005 video. It contains the following 10 concepts: People walking/running, Explosion or fire, Map, US flag, Building exterior, Waterscape/waterfront, Mountain, Prisoner, Sports, Car. As previously mentioned, we chose 17,269 frames from this large data. We use these data and their assigned annotations as a ground truth and then analyze how well we can approximate or reannotate these images by choosing example ones chosen by the user or randomly. We also segment these images and check the annotation results for the regions in the same way. Precision results for the proposed annotation method are shown in Table 2. A ranked video frame lists an example for the keyword “basketball” (Sports) using the TRECVID video frames, shown in Figure 2. Figure 3 illustrates a similar list example using the Corel Gallery collection for the keyword “painting”. Figure 4 shows an
Use of Adaptive Still Image Descriptors for Annotation of Video Frames
695
example frame list using segment features for the keyword “announcer”. We can obtain an average precision for our nine concepts of about 62% when using the first 30 resulting images and about 36% when using all the images annotated by a given concept in only one cycle. The results show that relatively high precision and nearly satisfactory results can be found for the TRECVID 2005 data using three to five examples or prototype images and their ranked visual search results using 60 to 100 result sets according to the given concepts. We also try the same process with the Corel Gallery collection and with our home images and we can obtain even higher precision results mainly due to the high quality of the images. For the region-based annotation, the precision drops due to the very large number of regions and their complex structures within the frames and images. Here we have to tune the method and incorporate more structural features of the regions. However, the annotation results obtained for whole frames of video shot boundaries illustrate that automatic annotation of video frames provides good results when using discrete multimedia content such as still images and textual information for video annotation.
Fig. 4. An example list of frames obtained by region-based annotation for the keyword “announcer” (TRECVID 2005 data)
5 Conclusions This work has proposed a new approach for annotating videos taken from the TRECVID 2005 data using static visual features and metadata of still image frames. The purpose of the developed method is to provide annotation or tagging tools for users to incorporate multimedia data such as video or still images as well as text into searching or other combined applications running either on the web or on other networks. Here we use MPEG-7 compliant visual features as well as textual information and
696
A. Kutics, A. Nakagawa, and K. Shindoh
allow the user to select both the prototype or the training set and also to adaptively adjust the weights of the visual features that seem to be the most adequate for him/her and thus find a way of bridging the so-called semantic gap. Annotation results obtained for video shot boundaries illustrate that automatic annotation of video frames can be satisfactorily accomplished using discrete multimedia content such as still images and textual information. The proposed method is simple and fast to implement, uses a very small set of training data and also requires little or no user intervention. It also has the advantage that it can be applied to any combination of visual and textual features. It can also be changed easily to work in parallel and thus can be applied to a cluster or grid of computers to achieve a robust and fast tool even for handling a huge set of multimedia content. Acknowledgments. The authors would like to thank PE. D. Dawson of KIA CONSULTANTS, Mr. K. Matsumoto and Dr. M. Naitoh of KKDI R&D Laboratories Inc. for their invaluable support of this research.
References 1. Zhao, R., Grosky, W.I.: Negotiating the semantic gap: from feature maps to semantic landscapes. Pattern Recognition 35, 593–600 (2002) 2. Zhou, X.S., Huang, T.S.: Unifying Keywords and Visual Contents in Image Retrieval. IEEE Multimedia 9(2), 23–33 (2002) 3. Barnard, K., et al.: Matching Words and Pictures. Journal of Machine Learning Research 3, 1107–1135 (2003) 4. Hofmann, T.: Learning and Representing Topic. A Hierarchical Mixture Model for Word Occurrences in Document Databases. In: Proc. of CONALD, Pittsburgh (1998) 5. Wang, J.Z., Li, J.: Learning-based linguistic indexing of pictures with 2-D MHMMs. In: Proc. ACM Multimedia, pp. 436–445. ACM Press, New York (2002) 6. Lim, J.-H., Tian, Q., Mulhem, P.: Home Photo Content Modeling for Personalized EventBased Retrieval. IEEE Multimedia 9(2), 28–37 (2003) 7. Matsumoto, K., et al.: SVM-based Shot Boundary Detection with a Novel Feature. In: ICME. IEEE Proc. of International Conference Multimedia Exhibition, pp. 1837–1840 (2006) 8. Qi, G., et al.: Video Annotation by Active Learning and Cluster Tuning, IEEE-Computer Society. In: CVPRW. Proc. Of the 2006 Conference on Computer Vision and Pattern Recognition Workshop, pp. 114–121 (2006) 9. Vapnik, V.N.: The Nature of Statistical Learning Theory, 2nd edn. Springer, Heidelberg (1999) 10. Carbonetto, P., de Freitas, N., Barnard, K.: A Statistical Model for General Contextual Object Recognition. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 350–362. Springer, Heidelberg (2004) 11. Gabrilovich, E., Markovitch, S.: Feature Generation for Text Categorization Using World Knowledge. In: Proc. of The 19th International Joint Conf. for Artificial Intelligence (2005) 12. Metzler, D., Manmatha, R.: An inference network approach to image retrieval. In: Enser, P.G.B., Kompatsiaris, Y., O’Connor, N.E., Smeaton, A.F., Smeulders, A.W.M. (eds.) CIVR 2004. LNCS, vol. 3115, pp. 42–50. Springer, Heidelberg (2004) 13. Manjunath, B.S., et al.: Introduction to MPEG-7. Wiley, Chichester (2002)
Use of Adaptive Still Image Descriptors for Annotation of Video Frames
697
14. Smith, J.R., et al.: Large-Scale Concept Ontology for Multimedia. IEEE Multimedia 14(1), 86–90 (2007) 15. Hanbury, A.: MUSCLE, Guide to annotation, Version 2.12, Tech. Univ. of Vienna (2006) 16. Hanbury, A.: Analysis of Keywords Used in Image Understanding Tasks. In: Proceedings of the OntoImage Workshop, Genoa, Italy (2006) 17. Boll, S.: MultiTube–Where Web 2.0 and Multimedia Could Meet. IEEE Multimedia 14(1), 9–13 (2007) 18. Paul, O.: Guidelines for the TRECVID, 2005 Evaluation (2006), http://wwwnlpir.nist.gov/projects/tv2005/ 19. Manjunath, B.S., et al.: Color and Texture Descriptors. IEEE Trans.on Circuits and Systems for Video Technology 11(6), 703–715 (2001) 20. Huang, J., et al.: Image indexing using color correlograms. In: Proc. IEEE Comp. Soc. Conf. Comp. Vis. and Patt. Rec., pp. 762–768. IEEE Computer Society Press, Los Alamitos (1997) 21. Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998) 22. Kutics, A., Nakagawa, A.: Detecting Prominent Objects for Image Retrieval. In: IEEE International Conference on ICIP, vol. 3, pp. 445–448 (2005)
Data Hiding on H.264/AVC Compressed Video Sung Min Kim, Sang Beom Kim, Youpyo Hong, and Chee Sun Won Dept. of Electronic Eng., Seoul, 100-715, South Korea MMChips Co., Ltd.
[email protected] Abstract. An important issue in embedding watermark bits in compressed video stream is to keep the bit-rate unchanged after the watermarking. This is a very difficult problem for high efficient compression methods such as H.264/AVC, because just one bit alteration in highly compressed bit-stream may widely affect the video content. In this paper we solve this problem by embedding watermark bit to the sign bit of the Trailing Ones in Context Adaptive Variable Length Coding (CAVLC) of H.264/AVC. The algorithm yields no bit-rate change after the data hiding. Also, we can easily balance between the capacity of the watermark bits and the fidelity of the video. The simplicity of the proposed algorithm is an added bonus for the real-time applications. Our experiments show that the PSNRs of the video sequences after the data hiding are higher than 43dB.
1
Introduction
In the past few years, the need for embedding digital message into digital video has attracted a great deal of interest in digital TV applications such as finger printing, authentication, broadcast monitoring, and broadcasting auxiliary information. Because most digital video contents are distributed and stored in the form of compressed bit-stream, the data embedding algorithm is normally done in the compression domain with minimal decoding processes. H.264/AVC [1] is the most recent standard for video data with much higher compression efficiency. It is expected that H.264/AVC becomes a most popular coding standard for broadcasting on wireless channels and internet media. Also, since the broadcasting systems of the next age such as MVC(Multi-view Video Coding) [2] and SVC(Scalable Video Coding) [3], which are being standardized by JVT(Joint Video Team) of ISO/IEC MPEG and ITU-T VCEG, are based on H.264, it is necessary to devise a watermarking algorithm for H.264/AVC compressed bitstream. In H.264/AVC, however, to achieve higher compression efficiency, new compression features such as variable block-size motion compensation, directional spatial prediction and context-adaptive entropy coding have been adopted. Having these new features, the conventional digital data hiding techniques proposed for the compressed bit-streams of MPEG-2/4 can cause unexpected problems when directly applied for H.264/AVC compressed video. Specifically, since they basically modify the quantized DCT(QDCT) coefficients, the compressed M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 698–707, 2007. c Springer-Verlag Berlin Heidelberg 2007
Data Hiding on H.264/AVC Compressed Video
699
bit-rate is prone to be changed after the modification and the visual quality can be deteriorated. Data hiding solutions have been proposed for MPEG-2/4 compression domains. In [4], perceptual models are adopted to mitigate the visual artifacts. In [5] and [6], authors embed watermarks into the QDCT coefficients only when the bit-rate after watermarking becomes equal to or less than the original one. In [7] and [8], watermarks are embedded into the middle frequency areas of 8x8 DCT domain because watermarking at lower frequency areas may cause severe visual artifact and bit-rate increase. The modification of the QDCT coefficients in MPEG-2/4 compression domains has been used various applications. In [9], a small picture is embedded into the original source video using frequency masking of the QDCT coefficients. Also, in [10], features representing the content of a block were hidden into the QDCT coefficients of its companion block for error concealment. In [11], a contents-associative signature is inserted in consideration of the channel with burst error. The authors in [12] proposed the method that some information are embedded into QDCT coefficients aiming at highlighting regions of the frame corrupted by transmission errors. Recently, in [13], a data hiding technique is adopted for video quality estimation. They inserted a binary image adjusted by strength factor into the QDCT coefficients for the assessment of the video quality at the receiver. For all the above techniques, the main issues were visual artifacts and bitrate increase due to the coefficient modifications. That is, the methods mentioned above were designed to prevent visual degradation and bit-rate increase induced by the QDCT coefficients modification. Now, our concern is whether those previous methods are still applicable for H.264/AVC compression domain or not. Unfortunately, they are not suitable for H.264/AVC bit-streams, simply because the H.264/AVC adopts a smaller DCT block(i.e., 4×4) and intra predictions, which can cause serious visual artifacts when modified. Moreover, since the QDCT coefficients in H.264/AVC are coded via context adaptive entropy coding, which adopts numerous different VLC(Variable Length Coding) tables, the bit-rate after the QDCT modification can be significantly changed. Having these issues in mind, we investigate which elements of the QDCT coefficient in the H.264/AVC bit-stream can minimize the visual artifacts and keep the original bit-rate after the QDCT modifications. Note that maintaining bit-rate unchanged is a necessary requirement for broadcasting environment, since no re-packetizing is required. Thus, our goal is to devise a data hiding algorithm in H.264/AVC compression domain, which causes no bit-rate changes. Note that we are dealing with the bit modifications in the compression domain in this paper. Thus, the survival of the embedded watermark against various attacks is definitely dependent on the structure and the parameters of the H.264 encoder. This implies that this data embedding may not be robust against attacks such as re-compression with different compression parameters and some signal processing operations [14]. Thus, our data embedding in the QDCT coefficients should be regarded as fragile watermarking for the video authentication or a container for an extra bit transmission (e.g., see applications in [10]-[13]).
700
S.M. Kim et al.
This paper is organized as follow: Section 2 demonstrates the problem when the previous watermarking or data hiding methods are directly applied in H.264/AVC compressed data. In Section 3, CAVLC process used in H.264/AVC and its characteristics are briefly analyzed. The proposed novel data embedding algorithm is explained in detail in Section 4. The experimental results are presented in Section 5 followed by the conclusion in Section 6.
2 2.1
Applying Previous Watermarking Methods to H.264/AVC Visual Degradations
To demonstrate the visual degradations caused by modifying QDCT coefficients in H.264/AVC bit-stream, we alter the LSBs(Least Significant Bits) of 4×4 QDCT coefficients in the positions of (0,0), (1,1), (2,2) and (3,3) of Ypicture. In our experiments, we used the JVT reference model JM9.8 [15], baseline profile, level 2.0, I sequence only, 1 slice per frame and 30 frame per second. The video data used in our experiments are Claire and Foreman, with 100 frames. Fig. 1 shows the results for different QPs(Quantization Parameters) at CIF(352×288) resolution. As one can see, the PSNRs after altering the LSBs of the QDCT coefficients of DC, ac(1,1), ac(2,2), and ac(3,3) drop significantly, and its average deterioration is only 19.04dB. These severe degradations are caused by intra prediction. This tells that direct modifications of the QDCT coefficients are not desirable for data embedding in H.264/AVC compression domain.
Fig. 1. PSNR variations after the modifications of various QDCT coefficients: (a) Claire and (b) Foreman
Data Hiding on H.264/AVC Compressed Video
2.2
701
Bit-Rate Increase
Another problem of data embedding in QDCT coefficients is the bit-rate increase. This can cause a serious consequence, because it can destroy the synchronization between audio and video. Once the synchronization is destroyed, the video elementary bit-stream needs to be re-packetized for transmission, which is an obvious computational overhead. Fig. 2 shows the bit-rate variations when the bitmodification method used in MPEG-2 and 4 is directly applied for H.264/AVC bit-stream. As one can see in Fig. 2, the modifications of the QDCT coefficients in H.264/AVC affect the bit-rates seriously. Note that there are new elements in H.264/AVC, namely trailing ones(T1) and total coefficients(TC), which are quite sensitive to the variations of the QDCT coefficients. The information about T1 and TC are coded via CAVLC(Context Adaptive Variable Length Coding), which makes use of various VLC tables according to the statistical features of previously coded adjacent blocks.
Fig. 2. Bit-rate variations when the bit-modification methods used in MPEG-2 and 4 are directly applied for DCT coefficients (i.e., DC, ac(1,1), ac(2,2), and ac(3,3)) in the H.264/AVC bit-stream: (a) Claire and (b) Foreman
3 3.1
CAVLC in H.264/AVC The Feature Elements for CAVLC
Fig. 3 illustrates the CAVLC process for 4x4 QDCT residuals. At first, the QDCT coefficients are scanned in a zig-zag order and five elements for CAVLC are extracted from the zig-zag scanned data. They are “Coefficient token”, “Sign of T1”, “Level”, “Total zeros”, and “Run before” as shown in Fig. 3. “Coefficient token” represents the number of non-zero coefficients (TC: Total Coefficient) and trailing ones(T1). There are four look-up tables to be selected for encoding “Coefficient token” for all 4×4 blocks, three variable length code tables and a
702
S.M. Kim et al.
Fig. 3. The elements for CAVLC [1]
fixed one. “Level” means the value of the remaining nonzero coefficients except for T1s in the block, and the “Run before” means the number of zeros preceding each non-zero coefficient in the reverse zig-zag order. Note that four elements except “Level” (i.e., “Coefficient token”, “Sign of T1”, “Total zeros” and “Run before”) are closely related with TC and T1. That is, the variations of TC and T1 affect the whole bit-rates. Therefore, our strategy for data hiding is not to change the TC and T1 values. 3.2
Codebook for the CAVLC
As shown in Fig. 3, the CAVLC used in H.264/AVC has five elements to be coded. Unlike MPEG-2, each element has multiple codebooks. Therefore, it is more difficult to predict the bit-rate variations after the modifications of the QDCT coefficients in H.264/AVC. There are four codebooks for “Coefficient token”(The fourth codebook assigns a fixed 6 bit code to every part of TC and T1). Fig. 4 illustrates the length of “Coefficient token” for each codebook. The column represents TC(0∼16) and
Data Hiding on H.264/AVC Compressed Video
703
Fig. 4. Code length table for “Coefficient token”
the row is T1(0∼3). For example, in Fig. 3, since there are 5 nonzero coefficients including last three consecutive 1’s, the TC and T1 have values of 5 and 3, respectively. So, as indicated in bold blocks in Fig. 4, the code length of the “Coefficient token” are 7, 6, and 4. Among them, the final bit length is determined by the number of nonzero coefficients in the left and upper previously coded blocks [1]. That is, since the TC and T1 of current block affect the selection of the codebook in the next block, any change in TC can affect the code length of the next block. Also, any change of T1 value due to the modification of the QDCT coefficient can easily introduce different code length as shown in Fig. 4 and eventually the bit-rate changes. This tells that the best strategy is not to change T1 value. Fortunately, the “Sign of T1” has nothing to do with the code length. This implies that the sign information can be a good candidate for the data embedding and we basically exploit this fact for our data hiding. The sign and magnitude of each remaining nonzero coefficient (i.e., “Level”) in the block are encoded in the reverse scan order with a prefix and a suffix [1]. Note that we may choose the “Level” to be modified for the embedding. However, one has to be careful to embed a bit into the “Level”, because the length of the suffix may be between 0 and 6 bits and is adapted depending on the magnitude of each successively coded “Level”(refer [1] for more detail).
704
S.M. Kim et al. Table 1. The characteristics of each coding parameter
Coding Parameter Coeff token Trailing ones sign flag Level prefix
values to be coded TC & T1 Signs of T1
Non-zero coefficients excluding T1 Level suffix Non-zero coefficients excluding T1 Total zeros Total number of zeros occurring after the first non-zero coefficients Run before Number of zeros preceding each non-zero coefficient
Bit-length Direct propagation Frequency due to coded bits components Variable Yes to next block All Fixed Highest Variable Variable Yes to the next low frequency coeff. Variable
Medium to low Medium to low NA
NA Variable
Note that “Total zeros” is related with TC. Therefore, if the bit modification for the data embedding changes TC or T1, it affects the value of total zeros, and consequently the whole bit-rates may be changed from the original one. The number of zeros preceding each non-zero coefficient (i.e., “Run before”) is encoded in the reverse scan order. If the bit modification for the data embedding changes TC, then the whole bit-rates may be changed from the original one because the TC is also related to the number of zeros.
4
Proposed Bit-Embedding Method
As explained in Section 3, one should be careful in modifying QDCT coefficients of H.264/AVC in embedding bits. Again, any change of TC and T1s due to the bit-embedding can cause bit-rate change and visual degradation, which may propagate to next blocks and frames. Moreover, since there are multiple codebooks for TC and T1, it will be relatively difficult to single out the modified QDCT elements for the watermark detection in H.264/AVC. To alleviate these problems, the bit-embedding method to the QDCT should satisfy the following requirements: (i) no bit-rate change: fixed length coding parameters are preferable (ii) minimization of visual degradation: high frequency components are preferable. Now we can determine the best coding parameter for the bit-embedding, which satisfy the above requirements. As shown in Table 1, the “trailing-ones-sign-flag” appears to fulfill the above requirements. That is, each sign bit of the trailing ones is encoded by a fixed single bit such that “+1” is coded to a binary number “0” and “−1” to “1”. This is a fixed length (one-bit) coding. Also, the sign bits belong to the trailing ones and they are the QDCT coefficients at the highest
Data Hiding on H.264/AVC Compressed Video
705
frequency components of the 4 × 4 QDCT. Thus, if we change the sign bit of the “trailing ones” for the bit-embedding, then the two requirements mentioned above are satisfied. Let us denote the coded sign bits arranged in the reverse zig-zag order of the ith DCT block as Si1 , Si2 , and Si3 . Since the maximum number of the T1 is 3, we have three sign bits at most. Also, Si1 is the sign bit for the first trailing ones in the reverse order of zig-zag scan, which implies that Si1 is the highest frequency component among the non-zero coefficient values. Now, denoting the binary bit to be embedded at the ith 4 × 4 DCT block as bi , for the number of T1= k, we can embed the watermark bit bi to the sign bit Si1 of the first T1 in the reverse order as follows S¯i1 , if ((bi = Si1 ) and (k ≥ 1)) Si1 = , (1) Si1 , otherwise denotes where S¯i1 represents the complement of the original sign bit Si1 and Si1 the modified sign bit. As a result of the embedding in (1), if k ≥ 1, the first sign bit of the T1 will be equivalent to the watermark bit to be embedded. Thus, only the last (and highest) frequency component of the zig-zag scanned QDCT will be altered due to the bit-embedding, if there is at least one T1. The watermark extraction in the receiver side is simply expressed as follows ˆbi = Sˆi1 , if (k ≥ 1), (2)
where ˆbi represents the extracted watermark bit at the ith DCT block and Sˆi1 is the received first sign bit of the trailing ones.
5
Experiments
In our experiments, we altered all Si1 ’s to check the worst case. Fig. 5 show the average PSNR and payload of one frame with the various bit-rates at QCIF (176×144) resolution. As one can see, all PSNRs are higher than 43dB even for 1000-bits embedding case. Table 2 shows the comparisons between our proposed method and the method of [16]. As shown in Table 2, there is no bit-rate increase for the proposed method after the bit embedding. Also, the payload is larger than that of [16]. However, although the PSNR values for individual watermarked intra frames are higher than 43dB, a flicking artifact may occur in the temporal direction of the video sequences. The subjective visual artifact is getting worse as the image size is getting larger, because the errors due to the bit-modifications are accumulated throughout the intra frame in the raster scan order by the intra predictions. Thus, the simplest way to alleviate this artifact is to reduce the watermark payload. In our experiments, the visual artifacts are not noticeable when 1-bit watermark signal per macroblock is embedded.
706
S.M. Kim et al.
Fig. 5. The variations of PSNR(a) and payload with the various bit-rates(b)
Table 2. Performance comparisons
Video sequence Carphone Claire Mobile Mother Tempete
6
The method of [16] Proposed method Number of Bit-rate Number of Bit-rate PSNR embedded bits increase embedded bits increase 44 0.80% 869 0 % 43.3 dB 22 0.44% 933 0 % 43.7 dB 85 0.23% 229 0 % 51.7 dB 42 0.69% 767 0 % 44.5 dB 81 0.44% 345 0 % 47.6 dB
Conclusion
The modification of the DCT coefficient is the most frequently used scheme for watermarking or data hiding in MPEG compression domain. Specifically, the QDCT coefficients of middle frequency area are often used for the data embedding. However, if we adopt the conventional data embedding scheme developed for MPEG-2 and 4 to H.264/AVC bit-stream, it can cause severe visual degradation and bit-rate increase. That is, the modification of the QDCT in H.264/AVC compression domain can alter TC and T1 values, which in turn distorts the visual quality of the corresponding block and deteriorates the other blocks as well because of the intra prediction adopted for H.264/AVC. Also, since various VLC tables are used for TC and T1, the bit-rates can be altered by the modification of TC and T1. To solve this problem, we propose a new data embedding scheme for H.264/AVC bit-stream, which use the first sign-signal of T1s in zig-zag reverse order that never alter TC and T1s. Experimental results show that the proposed method maintain a constant bit-rate after the modification. However, one thing to be further improved is subjective visual quality degradation due to the error propagation of the intra prediction.
Data Hiding on H.264/AVC Compressed Video
707
Acknowledgement This research was partially supported by the MIC(Ministry of Information and Commnication), Korea, under the ITRC(Information Technology Research Center) support program supervised by the IITA(Institute of Information Technology Assessment)(IITA-2006-(C1090-0603-0011)).
References 1. Ricardson, E.G.: H.264/AVC and MPEG-4 Video Compression. Wiley, Chichester (2003) 2. Vetro, A., Su, Y., Kimata, H., Smolic, A.: Joint Multiview Video Model(JMVM) 2.0, Output Document of JVT Meeting at Hangzhou, N8459 (October 2006) 3. Wiegand, T., Sullivan, G., Reichel, J., Schwarz, H., Wien, M.: Joint Draft 8 of SVC Amendment, Output Document of JVT Meeting at Hangzhou, JVT-U201 (October 2006) 4. Alattar, M.: Digital Watermarking of Low Bit-Rate Advanced Simple Profile MPEG-4 Compressed Video. IEEE Transaction on Circuits and Systems for Video Technology 13(8), 787–800 (2003) 5. Simitopoulos, D., Tsaftaris, S.A., Boulgouris, N.V., Strintzis, M.G.: Fast MPEG Watermarking for Copyright Protection. In: IEEE International Conference on Electronics, Circuits and Systems, September 2002, vol. 3, pp. 1027–1030. IEEE Computer Society Press, Los Alamitos (2002) 6. Hartung, F., Girod, B.: Digital Watermarking of MPEG-2 Coded Video in the Bit Stream Domain. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, April 1997, pp. 21–24 (1997) 7. Hsu, C.T., Wu, J.L.: DCT-Based Watermarking for Video. IEEE Transaction on Consumer Electronics 44, 206–216 (1998) 8. Barni, M., Bartolini, F., Checcacci, N.: Watermarking of MPEG-4 Video Objects. IEEE Transaction on Multimedia 7, 23–32 (2005) 9. Chae, J.J., Manjunath, B.S.: Data Hiding in Video, Proceedings. In: International Conference on Image Processing, vol. 1, pp. 311–315 (1999) 10. Yin, P., Liu, B., Yu, H.h.: Error Concealment using Data Hiding. In: IEEE International Conference on Acoustics, Speech and Signal Processing, May 2001, vol. 3, pp. 1453–1456. IEEE Computer Society Press, Los Alamitos (2001) 11. Yu, H.H., Peng, Y.: Multimedia Data Recovery using Information Hiding. In: IEEE Gloval Telecommunications Conference, vol. 3, pp. 1344–1348. IEEE Computer Society Press, Los Alamitos (2000) 12. Bartolini, F., Manetti, A., Piva, A., Barni, M.: A Data Hiding Approach for Correcting Errors in H.263 Video Transmitted over a Noisy Channel. In: IEEE Fourth Workshop on Multimedia Signal Processing, pp. 65–70. IEEE Computer Society Press, Los Alamitos (2001) 13. Farias, M.C.Q., Marco, C., Sanjit, K.M.: Objective Video Quality Metric based on Data Hiding. IEEE Transaction on Consumer Electronics 51, 938–992 (2005) 14. Wu, G.-Z., Wang, Y.-J., Hsu, W.H.: Robust watermark embedding/detection algorithm for H.264 video. Journal of Electronic Imaging 14, 13013 (2005) 15. JVT reference software home page - http://iphome.hhi.de/suehring/tml 16. Noorkami, M., Mersereau, R.M.: Compressed-domain Video Watermarking for H.264. In: IEEE International Conference on Image Processing, September 2005, vol. 2, pp. 11–14. IEEE Computer Society Press, Los Alamitos (2005)
Reduced Uneven Multi-hexagon-grid Search for Fast Integer Pel Motion Estimation in H.264/AVC Cheong-Ghil Kim, In-Jik Lee, and Shin-Dug Kim Dept of Computer Science, Yonsei University, 134 Shinchon-Dong, Seodaemun-Ku, Seoul, Korea {cgkim, ijlee, sdkim}@parallel.yonsei.ac.kr
Abstract. A reduced uneven multi-hexagon-grid search algorithm for fast integer pel motion estimation in H.264/AVC is presented. The objective is to reduce the number of candidates for the block matching by predicting the likely area in which the minimum sum of absolute differences (SAD) can be taken. For this purpose, the proposed algorithm employs directionally partial hexagon search patterns utilizing the motion vectors computed in previous stages which supply the spatial correlation characteristics between adjacent macro blocks and the temporal ones between video frames. Experimental results show that the proposed method can save 39%~69% of computational complexity compared with the original one at the cost of negligible degradation on RD performance. Keywords: H.264/AVC, motion estimation, multi-hexagon-grid search algorithm, uneven multi-hexagon-grid search.
1 Introduction H.264/AVC, the latest video coding standard of the Joint Video Team (JVT), is formed by ISO/IEC MPEG and ITU-T VCEG [1] and has been applied to wide range of applications starting from video conferencing to high quality broadcasting [2]. In Korea, it has been accepted for a new national standard for terrestrial Digital Multimedia Broadcasting (DMB) [3]. The H.264/AVC is based on hybrid video coding and similar in spirit to previous ones such as H.263 and MPEG-4, but achieves considerably higher coding efficiency known as up to 50% bit rate saving with the same video quality. This is mainly due to the enhanced motion estimation (ME) comprising variable block size ME (16×16, 16×8, 8×16, 8×8, 8×4, 4×8, and 4×4), sub-pixel ME (1/2, 1/4, and 1/8-pel accuracy), and multiple frame referencing, which may consume up to 60% with one reference frame and 80% with 5 reference frames of total encoding time [2]. Therefore, an efficient predictive fast ME algorithm is highly required to reduce the computational complexity while preserving good coding performance. In response to this requirement, various fast motion estimation (FME) algorithms have been proposed to reduce the encoding computational complexity of ME [4], [5] and implemented in JM reference software [6]: a hybrid unsymmetrical-cross multihexagon-grid search (UMHexagonS) [7], a simplified UMHexagonS [8], an enhanced predictive zonal search (EZPS) [9], and a modified dynamic search range (DSR) [10]. M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 708–714, 2007. © Springer-Verlag Berlin Heidelberg 2007
Reduced Uneven Multi-hexagon-grid Search for Fast Integer Pel Motion Estimation
709
These algorithms adopt several techniques to reduce the complexity of ME: search point predictions, early termination by predicted thresholds, and the hybrid search patterns for UMHexagonS, Simplified UMHexagonS, and EZPS, and reducing the search range by motion vector of neighboring blocks for modified DSR. According to the simulation results of [11], modified DSR is useful in some cases without any modification of original ME structure even though it can save not much time with DSR, UMHexagonS and EZPS show the same ME execution time and RD (rate distortion) performance for integer-pel. But the RD performance of simplified UMHexagonS with DSR is worse even though it is faster than others. Also EZPS has more switches than UMHexagonS for one block [11]. To reduce the complexity of integer pel ME occupying 70% of the whole ME computation in H.264 reference software of JM11.0 [6], this paper proposes a fast algorithm, called a reduced Uneven Multi-Hexagon-Grid Search which is a part of UMHexagonS. The proposed algorithm can achieve computational reduction by skipping some search points based on the prediction while maintaining the PSNR and bit rate at the same level with the original one. This paper is organized as follows. The unsymmetrical-cross multi-hexagon-grid search algorithm is reviewed in Section 2. The proposed algorithm is described in Section 3. Experimental results are shown in Section 4. Finally, conclusions are given in Section 5.
2 Unsymmetrical-Cross Multi-hexagon-grid Search UMHexagonS (unsymmetrical-cross multi-hexagon-grid search) is an integer pel ME employed in the latest H.264 reference software of JM11.0 [6]. This algorithm is conducting motion search hierarchically with 4 steps: a prediction of current block’s motion vector (MV), an unsymmetrical cross search, an uneven multi-hexagon-grid search, and an extended hexagon-based search. The first step determines the starting point of the search using a median predictor which comes from the median value of the three motion vectors of adjacent blocks (left, top, and top-right or top-left) of current block. An unsymmetrical cross search, the second step, uses prediction vector as the search center and extends in the horizontal and vertical directions, respectively. Here, the original cross search is modified asymmetrically; the horizontal search range equals W and the vertical search range equals W/2. The third stage, uneven multi-hexagon-grid search, includes two sub steps: First a full search is carried out around the search center. And then a 16-point multihexagon-grid search strategy is taken. More details follow in the next section. Finally, extended hexagon-based search (EHS) is used as a center biased search algorithm, including hexagon search and diamond search in a small range to refine motion vector more accurately. Through the above processes, the best search point generated by the previous step is used as a search centre for the current search step. The performance achievements of UMHexagonS against fast full search in computation time without DSR are about 59.99% and 91.69% for total encoding and integer ME, respectively [10]. They rise to 61.88% and 94.24% with DSR [10]. Especially, with the help of the second and third step, the motion estimation accuracy can be maintained as high as that of fast full search. Even though the performance gain of UMHexagonS is remarkable in computational complexity, this paper proposes a method to reduce the number of candidates for the block matching even more in the stage of uneven multi-hexagon-grid search.
710
C.-G. Kim, I.-J. Lee, and S.-D. Kim
3 Reduced Uneven Multi-hexagon-grid Search The uneven multi-hexagon-grid search, the third stage of UMHexagonS optimized to cover large and irregular motions, is performed as two sub-steps. The first is a full search pattern with range of 5×5 square around the search centre given by the unsymmetrical-cross search, which is shown as step 1 in Fig. 1(a) and the second is a 16-point hexagon pattern, extending with different scales from one to W/4 starting from the inner to the outer hexagon, which is shown as step 2 of Fig. 1(a). Here, W is defined as the search range. Consequently, the number of search points results in (25+16×W/4). And a point with the minimum SAD value will be used as the starting search point of the next step, EHS. The proposed algorithm, a reduced uneven multi-hexagon-grid search, devises directionally partial hexagon search patterns which take only selected points for block matching instead of full 16 points by predicting the likely area in which the minimum SAD can be taken while preserving the initial full square search. Fig. 1 (b) depicts the new search patterns indicated on a coordinate plane that is divided into four quadrants by the x-axis and y-axis intersecting at 0 point (a search center). Here, a hexagon search pattern is divided into 8 sub-patterns; 4 quarter-hexagon searches (QHSs) which belong to Quadrant I, II, III, and IV, respectively, and 4 half-hexagon searches (HHSs) which combine two adjacent quadrants in either vertical or horizontal and lie on the right and left of y-axis, and the above and below of x-axis. A QHS selects only 5 points and a HHS does only 9 out of 16 points for block matching. As a result, the computational complexity of integer pel ME can be reduced when the number of half- and quarterhexagon searches may occur frequently with the help of accurate predictions.
Fig. 1. (a) Uneven multi-hexagon-grid search patterns, (b) reduced uneven hexagon grid search patterns
Reduced Uneven Multi-hexagon-grid Search for Fast Integer Pel Motion Estimation
711
Algorithm. The fundamental idea comes from the spatial correlation characteristics between adjacent macro blocks and the temporal ones between video frames. Therefore, the prediction is based on the direction and magnitude of MV from uplayer and neighboring reference frame. Here, up-layer means the just above macro block (for example, 16×16 block is the up-layer of 16×8 and 8×16 blocks, 16×8 is the up-layer of 8×8 block, and etc.) because the search order in JM11.0 is hierarchical from 16×16 to 4×4. Up-layer prediction uses the relationship among 7 modes classified by the block size, i.e., 16x16, 16x8, 8x16, 8x8, 8x4, 4x8, and 4x4. It uses the motion vector of the larger block to get a starting search point for its associated smaller block relatively. The proposed algorithm can be summarized as follows. z
Step 1: The initial hexagon is centered on the origin of the search window generated by the previous step after completing a full square search. z Step 2: Refer to MV(x, y) from either previous reference frame or up-layer according to the macro block size for the prediction. If block size is 16×16, MV comes from previous reference frame because there is no up-layer; otherwise from up layer. z Step 3: Check the case of zero MV; if so, execute only the innermost 16-point hexagon search and skip Step 4. Otherwise, go to next step. z Step 4: Check the signs of the x and y of MV to select a quadrant; at this moment, one of four QHSs is taken; further, comparing the absolute values of x and y. Unless they are same, the pattern is enlarged either vertically (|x| < |y|) or horizontally (|x| > |y|) to take one of four HHSs for more accurate solutions. Otherwise (|x| = |y|), the QHS is maintained. Once the search pattern has been decided, an early termination policy of UMHexagonS can be applied to 16-point multi-hexagon-grid searches extending from the inner to the outer hexagon for further reduction.
4 Experimental Results To evaluate the performance of the proposed algorithm by comparing with UMHexagonS, many experiments have been carried out with JM 11.0 of the [6], the reference software of encoder distributed by JVT. The major parameters for encoding are as followed; GOP (group of picture) structure is IPPP; QPs (quantization parameters) are 8, 18, 28, and 38; a reference frame number is 5; a search range equals with 32; the Hadamard transform and RD Optimization are enabled. And all the other parameters set as for the Baseline profile. The benchmarks for experiments are QCIF and CIF test sequences which are generally employed in testing of video encoding to represent varying movements from small to large. They are encoded at 30 frames per second. Furthermore, because the major purpose of tilt is just for deciding the direction of motion vector, we simplify this calculation removing division operation. Fig. 2 shows the average distribution ratio of each search patterns of the proposed method. Half hexagon searches are evenly distributed; on the other hand, quarter hexagon searches are rarely found compared with half hexagon searches. The PSNR4
712
C.-G. Kim, I.-J. Lee, and S.-D. Kim
Fig. 2. Distribution ratio of each search pattern
and bitrates comparisons between the original encoder and the ones with the proposed algorithm are tabulated in Table 1. In this table, the execution time is normalized by that of UMHexagonS such as Total, ME, and Proposed using Eq. (1). The notations of Proposed, Total, and ME denote the execution times of the proposed method, the full encoding, and the total ME, respectively. PSNR and average bitrates are calculated by the difference between the proposed algorithm and UMHexagonS. If the PSNR is a negative value, the quality of encoded video frames with proposed algorithm is degraded. On the other hand, if bitrates is negative, the proposed algorithm needs a fewer bits than the UMHexagonS algorithm to encode video frames. Pr oposed (%) =
Time Pr oposed Time UMHexagonS
× 100
(1)
The coding results are very similar. Compared with the original encoder, the PSNR degradations are negligible at the cost of 0.78 kbits/sec bitrates drop on average. The computational complexity comparisons are illustrated also in Table 1. Instead of taking full 16-point hexagon searches always, QHS and HHS were employed based on the prediction. Thus up to 69% computations are saved in uneven multi-hexagongrid search, which brings up to 13.5% of computations reduction in total integer pel ME. The degradation of overall average PSNR is almost zero. And, in the worst case, the degradation of Y in PSNR which is the most sensitive component for the human is at most 0.01dB. As the experimental result, the quality of encoded benchmarks including high and complex motions such as football, mobile, and stefan is better than others even though the proposed algorithm reduces many search areas. This is because those reductions come from the prediction based on directional relationship in video sequences. We can also know that the encoding time of CIF sized benchmarks is smaller than QCIF sized ones. Therefore, if the larger search range is applied to the larger size video sequence, we can expect that the encoding time will be decreased additionally.
Reduced Uneven Multi-hexagon-grid Search for Fast Integer Pel Motion Estimation
713
Table 1. Simulation results (5 reference frames, search range = 32, and frequency = 30Hz)
399
mthr_dotr
299 QCIF
foreman
299
silent
299
football
90
mobile
299
CIF
news
paris
299
stefan
90
tempete
259
Average
PSNR(dB)
Average Bitrates (kbits/sec)
Total (sec)
ME (sec)
Proposed (sec)
8 18 28 38 8 18 28 38 8 18 28 38 8 18 28 38 8 18 28 38 8 18 28 38 8 18 28 38 8 18 28 38 8 18 28 38
97.59% 96.94% 95.22% 94.95% 98.72% 98.52% 98.23% 97.28% 98.67% 98.57% 98.54% 96.74% 98.54% 98.03% 97.09% 96.48% 96.54% 95.83% 94.62% 91.83% 96.56% 93.00% 94.21% 93.46% 98.61% 98.18% 98.16% 96.88% 96.92% 95.85% 94.98% 93.88% 97.30% 96.18% 95.01% 94.38%
95.99% 93.27% 91.89% 90.67% 96.12% 93.35% 93.62% 94.20% 93.81% 91.23% 92.30% 92.54% 94.57% 93.59% 95.60% 95.86% 93.02% 92.26% 90.50% 86.50% 90.54% 89.67% 88.85% 88.18% 93.77% 94.75% 94.59% 92.83% 92.42% 90.20% 90.13% 90.03% 93.76% 92.01% 90.50% 90.74%
50.22% 60.62% 57.52% 54.54% 53.08% 45.12% 49.73% 30.78% 51.44% 45.61% 52.27% 36.75% 55.80% 50.02% 52.95% 43.12% 57.36% 50.44% 52.60% 45.56% 58.59% 56.22% 55.94% 59.61% 49.08% 52.58% 53.53% 41.63% 56.98% 57.59% 57.18% 52.11% 59.21% 55.54% 55.61% 56.01%
8
97.72%
93.78%
54.64%
0.00
0.00
0.01
0.83
18 96.79%
92.26%
52.64%
0.00
-0.01
0.00
1.27
Benchmarks Type Frames QP
Y 0.00 0.00 0.01 0.02 0.00 -0.01 0.01 0.03 0.00 0.00 0.00 0.04 0.01 -0.01 0.00 -0.01 0.00 0.01 0.01 0.02 0.00 0.00 0.00 0.01 0.00 0.00 -0.01 0.00 0.00 0.00 -0.01 0.01 0.00 0.00 -0.01 0.00
U 0.00 0.00 -0.02 -0.01 -0.01 -0.01 0.00 -0.21 0.00 -0.01 0.01 0.06 0.00 -0.03 0.00 -0.01 0.01 -0.01 0.07 0.06 0.00 0.00 0.00 0.00 0.00 -0.02 0.00 0.05 0.00 0.00 0.02 0.02 0.01 0.00 -0.01 -0.01
V 0.00 0.02 0.00 0.06 0.00 0.00 0.00 -0.08 0.01 -0.02 -0.02 0.00 0.03 -0.04 0.06 -0.03 0.00 0.01 0.01 -0.02 0.00 0.01 0.01 0.03 0.01 0.01 -0.02 -0.06 0.00 0.00 0.01 0.03 0.00 0.00 0.00 -0.01
0.57 0.80 0.19 0.23 -0.52 -0.40 0.11 0.02 0.03 -0.18 0.05 0.23 -0.63 -0.02 0.29 0.10 7.26 4.57 3.39 3.20 -1.60 -0.08 0.63 0.13 -2.43 1.49 0.33 -0.03 3.17 4.45 1.51 0.67 1.63 0.76 -2.31 0.44
28 96.23%
92.00%
54.15%
0.00
0.01
0.01
0.47
38 95.10%
91.28%
46.68%
0.01
-0.01
-0.01
0.55
714
C.-G. Kim, I.-J. Lee, and S.-D. Kim
5 Conclusion This paper proposes a reduced uneven multi-hexagon-grid search algorithm for fast integer pel motion estimation in H.264 by decreasing the block matching points. It predicts the possible area where minimum SAD can be taken referencing motion vectors computed in previous stages and employs directionally partial hexagon search patterns. Experimental results show that the proposed algorithm can efficiently save the computation cost up to 69% with negligible degradation of RD performance.
References 1. Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification. ITU-T Rec. H.264 | ISO/IEC 14496-10 AVC (2003) 2. Richardson, I.E.G.: H.264 and MPEG-4 Video Compression. Video Coding for Nextgeneration Multimedia. Wiley (2004) 3. Radio Broadcasting Systems. Specification of the video services for VHF digital multimedia broadcasting (DMB) to mobile, portable and fixed receivers. Telecommunications Technology Association (TTA) of Korea TTAS.KO-07.0026 (August 2004) 4. Yang, L., Yu, K., Li, J., Li, S.: An effective variable block-size early termination algorithm for H.264 video coding. IEEE Trans. Circuits and Syst. Video Technology 15(6), 784–788 (2005) 5. Shimizu, T., Yoneyama, A., Yanagihara, H., Nakajima, Y.: A two-stage variable block size motion search algorithm for H.264 encoder. In: Proc. Int. Conf. on Image Processing, vol. 3, pp. 1481–1484 (2004) 6. H.264/AVC Reference Software JM11.0 (August 2006), [Online]. Available: http://iphome.hhi.de/suehring/tml/ 7. JVT-F017, Fast integer pel and fractional pel motion estimation (December 2002) 8. JVT-P021, Improved and simplified fast motion estimation for JM (July 2005) 9. JVT-E023, Fast Motion Estimation within the JVT codec (October 2002) 10. JVT-Q088, Modification of Dynamic Search Range for JVT (October 2005) 11. JVT-Q089, Comments on Motion Estimation Algorithms in Current JM Software (October 2005)
Reversible Data Hiding for JPEG Images Based on Histogram Pairs Guorong Xuan1, Yun Q. Shi2, Zhicheng Ni2, Peiqi Chai1, Xia Cui1, and Xuefeng Tong1 1
2
Dept. of Computer Science, Tongji University, Shanghai, China Dept. of ECE, New Jersey Institute of Technology, Newark, New Jersey, USA
[email protected],
[email protected] Abstract. This paper1 proposes a lossless data hiding technique for JPEG images based on histogram pairs. It embeds data into the JPEG quantized 8x8 block DCT coefficients and can achieve good performance in terms of PSNR versus payload through manipulating histogram pairs with optimum threshold and optimum region of the JPEG DCT coefficients. It can obtain higher payload than the prior arts. In addition, the increase of JPEG file size after data embedding remains unnoticeable. These have been verified by our extensive experiments.
1 Introduction Reversible, also referred to as invertible or lossless, image data hiding can imperceptibly hide data into digital images and can reconstruct the original image without any distortion after the hidden data have been extracted out. Linking a group of data to a cover image in a reversible way is particularly critical for medical, images (for legal consideration) as well as for military, remote sensing and high-energy physical images (where high accuracy does matter and the original image is of importance). Reversible data hiding using the histogram shifting technique in spatial domain has first been reported in [1] and in [2], independently. It has been applied to the histogram of integer DCT domain [3], and integer wavelet transform domain [4,5]. In general, the histogram shifting technique has achieved dramatically improved performance in terms of embedding capacity versus visual quality of stego image measured by PSNR (peak signal noise ratio). Among various digital image formats, JPEG (Joint Photographic Experts Group) format is by far used most often currently. How to reversibly hide data into JPEG image file is hence important and useful for many applications including image authenticcation, secure data system, and linking annotation data to the host image. However, there are not many reversible data hiding techniques that have been developed for JPEG images so far. Some techniques are reported in [6,7]. In [6], the least significant bit plane of some selected JPEG mode coefficients is losslessly 1
This research is supported partly by National Natural Science Foundation of China (NSFC) on the project (90304017).
M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 715–727, 2007. © Springer-Verlag Berlin Heidelberg 2007
716
G. Xuan et al.
Original JPEG image
Extracted Data
Original JPEG image
JPEG quantized block DCT coefficients after entropy decoding
Data extraction followed by entropy encoding
Lossless data embedding followed by entropy encoding
JPEG quantized block DCT coeff. after entropy decoding
Payload
JPEG with hidden data
(a) Data embedding
JPEG with hidden data
(b) Data extraction
Fig. 1. Block diagram of the proposed lossless data embedding for JPEG image
compressed, thus leaving space for reversible data embedding. Consequently the payload is rather limited. In [7], some schemes have been reported which aim at keeping the size of JPEG file after lossless data hiding remaining unchanged. However, the payload is still rather limited (the highest payload in various experiment al results reported in [7] is 0.0176 bpp (bits per pixel). In this paper, we present a novel technique based on histogram pairs applied to some mid- and low-frequency JPEG quantized 8x8 block DCT (discrete cosine transform) coefficients (sometimes referred to as JPEG coefficients in this paper for short) for reversible data hiding. A block diagram of data embedding and extraction is shown in Figure 1. Experimental results are presented to demonstrate its effectiveness. The data embedding capacity range from 0.0004, to 0.001, 0.1, up to 0.5 bpp (bits per pixel) for one-time (or, one-loop) reversible data hiding, while the visual quality of images with hidden data measured by both subjective and objective (PSNR) ways remains high. The increase of size of image file due to data hiding is not noticeable, and the shape of histogram of the mid-and lower-frequency coefficients of DCT remains similar. The proposed technique works for various Q-factors. The rest of this paper is organized as follows. The principle of histogram pair based lossless data hiding is presented in Section 2. The concept of thresholding is discussed in Section 3. The lossless data embedding and extraction algorithm is described in Section 4. Optimality is analyzed in Section 5. Section 6 contains experimental results. Conclusion and discussion are in Section 6.
2 Principle of Histogram Pair Based Lossless Data Embedding 2.1 Definition of Histogram Pair Histogram, h(x), is the number of occurrence (i.e., frequency) of feature x within a set of samples X. Here the samples X in this paper is some selected JPEG quantized 8 8
×
Reversible Data Hiding for JPEG Images Based on Histogram Pairs
717
DCT coefficients, the feature x is the JPEG coefficients’ value. The x is either positive, or negative integer, or zero, such as x {-2,-1,0,1,2,3}.
∈
Histogram pair is defined as a part of the histogram, denoted by h=[m,n], where m and n are, respectively, the frequencies of two immediately neighboring feature values x∈{a,b} with a T. In addition, it makes sure that the small JPEG coefficients after data embedding will not conflict (will not be confused) with the large JPEG coefficients with (|x|>T). That is, for the JPEG coefficients satisfying |x|≤T, histogram pair based data embedding is applied. It requires that after data embedding, the coefficients between -T≤x≤T will be separable from the coefficients with |x|>T. The simple thresholding will divide the whole histogram into two parts:1) the data-to-be embedded region, where the JPEG coefficients absolute value is small ; 2) no data-to-be embedded region named end regions, where the JPEG coefficients’ absolute value is large. 3.2 Optimum Thresholding
Our experimental works have indicated that the smallest threshold T does not necessarily lead to the highest PSNR for a given data embedding capacity. Instead, it is found that for a given data embedding capacity there is an optimum value of T. This can be justified as follows. If a smaller threshold T is selected, the number of coefficients with |x|>T will be larger. This implies that more coefficients with |x|>T need to be moved away from 0 in order to create histogram pair(s) to losslessly embed data. This may lead to a lower PSNR and more side information (hence smaller embedding capacity). Therefore in this proposed optimum histogram pair lossless embedding, the best threshold T for a given data embedding capacity is selected to achieve the highest PSNR. Discussion about the optimum parameters are presented in Section 5, and experimental results are shown in Section 6.
4 Histogram Pair Based Reversible Data Hiding Algorithm 4.1 Data Embedding Algorithm
Assume the length of the to-be-embedded data is L. The data embedding steps are listed below. (See Figure 3 (a).) (1) Set a threshold T>0, to let the number of the mid- and low-frequency JPEG coefficients within [-T,T] be greater than L. And set the P ← T. (2) In the JPEG coefficient histogram, move the portion of histogram with the coefficient value greater than P to the right-hand side by one unit to make the histogram at P+1 equal to zero (call P+1 as a zero-point). Then embed data into P and P+1 according to the to-be-embedded bit is 0 or 1, respectively. . (3) If some of the to-be-embedded bits have not been embedded at this point, let P ← (-P), and move the histogram (less than P) to the left-hand side by one unit to leave a zero-point at the value (P-1). And embed data into P and (P-1) according to the to-be-embedded bit is 0 or 1, respectively. .
720
G. Xuan et al.
(4) If all the data have been embedded, then stop embedding and record the P value as the stop value, S. Otherwise, P ← (-P-1), go back to step (2) to continue to embed the remaining to-be-embedded data.
Choose T P←T
P←S
Extract data and backfill zero point
Expanded a zero point and Embedding Data
Finished? No P ←( -P-1)
No P>0?
Yes
Finished? No P ←( -P)
No P>0?
Yes P ← (-P)
Yes
Yes P ← (-P-1)
S←P
(a) embedding
(b) extracting
Fig. 3. Flowchart of proposed lossless data embedding and extracting
4.2 Data Extraction Algorithm
The data extraction is the reverse of data embedding. Without loss of generality, assume the stop position of data embedding is S (positive). The data extraction steps are as follows. (Refer to Figure 3 (b).) (1) Set P ← S. (2) Decode with the stopping value P and the value (P+1). Extract all the data until P+1 becomes a zero-point. Move all the DCT coefficients histogram (greater than P+1) towards the left-hand side by one unit to eliminate the zero-point. (3) If the amount of extracted data is less than C, set P ← (-P-1). Continue to extract data until (P-1) becomes a zero-point. Then move the histogram (less than P-1) to the right-hand side by one unit to eliminate the zero-pint. (4) If all the hidden bits have been extracted out, stop. Otherwise, set P ← -P, go back to (2) to continue to extract the data. 4.3 Third Example of Histogram Pair Based Lossless Data Hiding
In this simple but complete example, the to-be-embedded bit sequence D=[1 10 001] has six bits and will be embedded into an image by using the proposed histogram pair scheme with threshold T=3, and stop value S= 2. The dimensionality of the image is 5×5 as shown in Figure 4 (a). The image has 12 distinct feature (grayscale) values, i.e., x ∈ {-5,-4,-3,-2,-1,0,1,2,3,4,5,6}. The grayscale values of this image have the histogram h0= [ 0,1,2,3,4,6,3,3,1,2,0,0] (as shown in 1st row of Figure 5). As said
Reversible Data Hiding for JPEG Images Based on Histogram Pairs
721
before, for x ≥ 0, the histogram pair is of form h=[m,0], for x5 and 2 m ) is an overcomplete basis function dictionary and X ∈ R n is its coefficient matrix, V ∈ R m is independent, identically distributed Gaussian noise with standard deviation σ . With W fixed, X in (1) is not unique because the number of basis functions is greater than the dimension of the input space. A method to solve such an underdetermined inverse problem is to choose a prior probability of the coefficients matrix P( X ) , which specifies the probability of the alternative representation [2] [5]. M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 752–760, 2007. © Springer-Verlag Berlin Heidelberg 2007
Ingredient Separation of Natural Images
753
Recently, much work has been done to find adaptive basis systems in which the corresponding decomposition coefficients are the sparsest [1]-[8]. These papers mainly focused on how the adaptive basis can be obtained through learning driven by the ‘sparse coding’ strategy used in V1. Mathematically, the problem to what degree a system can sparsely represent signals or functions can be quantitatively characterized using the theory of nonlinear approximation. The better the nonlinear approximation ability, the less transform coefficients are needed to represent a signal within a given error level. More recently, in harmonic analysis field several new systems have been developed that can obtain optimal nonlinear approximation rate to some special function classes [9]-[11]. For example, ridgelet and ridgelet frame transform can optimally represent functions smooth away form straight singularity; curvelet transform can optimally represent functions smooth away form curved singularity; and etc.. Generally, a system can effectively represent special function class. For example, it is well-known that the wavelet system is only able to deal with function smooth away from point singularity. The same holds for ridgelet, ridgelet frame and curvelet. They can’t effectively approximate function with point singularity. These newly developed systems, hence, make it possible to choose special one for special application task. Till now, how to optimally approximate complicated function classes that have more than one kind of singularities is yet a very difficult problem. Such complicated functions such as natural images have commonly multiple ingredients. An interesting solution for this problem maybe is using several systems together to deal with these complicate function classes. In this paper, we focus on such a problem, i.e., when several systems are at hand, how we can combine them together to deal with natural images. Such problem, essentially, is along a line contrary to those described in [1]-[8].
2 Multiple Transform Domain Image Model Let I in (1) be an image consisting of m pixels and matrix W be constructed by combining k different transforms. Rewrite equation (1), we can obtain a multiple transform domain image model k
I = WX ( I ) + V = ∑ Wi X i ( I ) + V = W1 X 1 ( I ) + W2 X 2 ( I ) + ... + Wk X k ( I ) + V ,
(2)
i =1
where each Wi ∈ R m×ni corresponds to a complete or over-complete transform, namely, for ∀F ∈ R m we have F = Wi X i ( F ) . For example, one can assign W1 to be wavelet transform matrix W2 to be Local cosine transform matrix and etc. Clearly, the problem to determine the coefficients X ( I ) in (2) is ill-posed. Given only information about image I , there will be multiple states of X ( I ) that can correspond the same image. X ( I ) must be inferred from I . For this purpose, we maximize the conditional probability distribution of X ( I ) given I , P( X | W , I ) , which can be expressed by means of Bayes’s rule as (3) P( X | W , I ) ∝ P( I | W , X ) P( X | W ) .
754
X. Tan
The first term in (3) is the likelihood of the image under the model for a given state of coefficient matrix X ( I ) . If the noise is assumed to be Gaussian, then this likelihood is given by P( I | W , X ) ∝ exp[−(λ / 2) | I − WX |2 ] , where λ = 1 σ 2 . The second term is the prior probability distribution over the decomposition coefficients matrix X ( I ) . k
We assume that this distribution is factorial such that P( X | W ) = ∏ P ( X i | W ) . Furi =1
thermore, assuming that the image I is exactly generated by combining the components from W1 , W2 , … Wk and X ( I ) is exactly ideal solution of (2), then P( X i | W ) ∝ Pi ( X i | Wi ) . Maximizing the posterior distribution P( X | W , I ) thus presents us with the following problem:
Xˆ = ( Xˆ 1 , Xˆ 2 ,... Xˆ k ) = arg max P( X | W , I ) = arg max P ( I | W , X ) P( X | W ) X
X
= arg max P ( I | W , X ) P( X | W1 , W2 ,..., Wk ) X k
= arg max exp[−(λ / 2) | I − WX |2 ] Pi ( X i | Wi )
(4)
i =1
X
k
= arg min (λ / 2) | I − WX |2 + ∑ gi ( X i | Wi ) i =1
X
where gi (.) = − log Pi (.) . A local minimum may be obtained by using gradient descent algorithm, which yields X ∝ λ[ W1 , W2 ,..., Wk ]T e − [ g1′( X 1 )T ,..., g k′ ( X k )T ]T .
(5)
e = I − [ X 1 ,..., X k ]T [ W1 , ..., Wk ] .
(6)
It is worthy to emphasize that the multiple transform domain image model allows one to assign different prior probability distribution to different transform coefficient matrix X i . This is distinctly different from previous research in [1]-[8] that assigned a same prior probability distribution to all coefficients even if different basis function classes of different property was used in W . The multiple transform domain image model proposed here can be considered as a generalized notion of those in [1]-[8]. Furthermore, previous papers only focused on the global property of dictionary W in (1). On the contrary, the multiple transform domain image model not only considers the global property, namely, I = WX + V , but also considers the local property of partial basis functions classes, Wi .
3 Candidate Transforms To date, many systems for signal and function decomposition have been developed. Commonly, these systems are complete, or over-complete. To construct W in multiple
Ingredient Separation of Natural Images
755
transform domain image model, we combine several systems together. These systems are described below. • Wavelet system: Wavelet system has achieved tremendous success in many fields of contemporary science and technology, especially in signal and image processing application. The success of wavelet system mainly arises from its ability to provide optimal sparse representation for functions that are smooth away from point singularities. • Ridgelet frame: Ridgelet frame can be constructed by using orthogonal wavelet basis in Radon domain [10]. It can be thought of as an extension of the notion of orthonormal ridgelet [9]. Ridgelet frame first transforms straight singularities to point singularities and then deals with the resulting point singularity using wavelet system. Therefore, ridgelet frame is good at dealing with line structures in images. • Local Cosine Transform: Local cosine transform provides an adaptive segmentation of the spatial domain in terms of oscillating patterns. An image is decomposed into blocks of different sizes within which a local Fourier expansion is performed. Local cosine transform is appropriate for a sparse representation of periodic behave such as texture. Other complete systems are also available. For example, the brushlet transform is designed to be suited for representing texture and the contourlet is designed to deal with straight ‘contour’ [12] [13]. All transforms make it possible to flexibly construct large basis function dictionary by combining them together. In the scope of this paper, we only use wavelet, ridgelet frame and local cosine system to construct W .
4 Parameterization of Prior Probability Distribution To use the estimation defined by (4), (5) and (6), the prior probability distribution of X ( I ) need be characterized with a model that is flexible enough. Previous studies used the same prior probability distribution for all coefficients and do not consider they arise from which kind of basis function classes of dictionaryW . The multiple transform domain image model offers the advantage that one can assign different prior probability distribution to different coefficients X i belonging to different transform Wi . In fact, for over-complete dictionary, especially in the case that the basis function classes of different property commonly have different approximation efficiency for special image ingredients. Given the fact that natural image consists of ingredients of different property, hence, in the context of image processing different X i should be assigned different prior probability distribution. As an example, Fig.1 (a) shows a histogram from a single subband of a wavelet transform built on a natural image and Fig.1 (b) shows a histogram from a corresponding single subband of a ridglet frame transform built on the same image. As Figure 1 shown, both the wavelet and ridgelet frame coefficients are sharply peaked at zero with broad tails. But the histogram of the latter has much higher kurtosis, 280.2579, than that of wavelet. So, there is no reason for us to depict different
756
X. Tan
types of coefficients using the same prior probability distribution, as done in the previous studies. For our purpose here, we use a two-parameter generalized Laplacian distribution, P( X i ) ∝ exp(| − X i / si | pi ) . We specify different values of parameters si and pi to different coefficient matrix, X i . In practice the parameters si and pi can be adaptively estimated from the signal corrupted by additive Gaussian white noise [14] s 2 Γ(3 p) . Γ(1 p)
(7)
6σ n2 s 2 Γ(3 p ) s 4 Γ(5) + . Γ(1 p) Γ(1 p )
(8)
σ 2 = σ n2 + m4 = 3σ n4 +
where Γ is the Gamma function, m4 is the fourth moment of the generalized Laplacian distribution signal corrupted by additive Gaussian white noise with standard deviation σ .
5 Image Ingredient Separation Using Multiple Transform Domain Method To demonstrate the ability of the proposed multiple transform domain image model to capture typical structure in natural images, we applied it to an image processing application, ingredient separation of image. Assuming that a natural image consists of different ingredients and each ingredient is the liner superposition of basis functions classes belonging to a special transform of dictionary W such as W1 , W2 ,..., Wk . The task of image ingredient separation is to decomposing the given image into its actual ingredients. Such separation is important for application such as image coding and image analysis. Previous studies mainly focused on the decomposition of a given image into texture and piecewise smooth additive ingredients [15] [16]. The multiple transform domain image model allow us to decompose an image into more types of ingredients depending on the combination of different transform in W . The basic idea is that the ideal decomposition coefficients should be sparse in the multiple transform domain image model. For our purpose, on each image the coefficients are fitted so as to maximize the posterior distribution P( X | W , I ) , with a Generalized Laplacian prior on the coefficient X ( I ) , as expressed in (4). Then the different ingredients is obtained by Iˆi = Wi Xˆ i ,
i = 1,..., k .
(9)
Notice that the term | I − WX |2 in (4) will guarantee I ≈ Iˆ1 + Iˆ2 + ... + Iˆk .
(10)
Ingredient Separation of Natural Images
(a)
757
(b)
Fig. 1. (a) Histogram for a HH wavelet subband built on a 256*256 natural image, kurtosis=16.0825. (b) Histogram of the corresponding subband in ridgelet frame domain built on the same image, kurtosis=280.2579.
6 Experiments We performed a set of experiments to explore the utility of the proposed ingredient separation method using multiple transform domain image model. The test data is a pre-whitened 256*256 natural image as shown in Figure 2 (a), which was used in previous research [2]. The noise parameter λ is 200, corresponding to a noise variance 0.005. In our experiments, we decompose a natural image into three types of ingredients, namely, point-type ingredient, line-type ingredient and texture ingredient. Hence, we combine wavelet system, local cosine system and ridgelet frame to construct the basis function dictionary W in (2). In particular, we assign wavelet system to W1 for extracting the point-type ingredient, ridgelet frame to W2 for extracting the line-type ingredient, and local cosine system to W3 for extracting the texture ingredient. Note that the ridgelet frame we used is constructed using Symlet-8 wavelet and is also a redundant transform with redundant factor 4. So the W is a 6 × -over-complete dictionary consisting of three types of basis function classes. The parameters si and pi were estimated by first decomposing the original image into wavelet coefficients, local cosine coefficients and ridgelet frame coefficients respectively, and then computing si and pi according to (7) and (8). For image Fig. 2 (a), the parameter values si and pi are 4.39 and 0.85 for wavelet transform, 2.12 and 0.8 for local cosine transform and 3.43 and 0.3 for ridgelet frame transform, respectively. It accords with our previous analysis that the different coefficient matrix in multiple transform domain image model has different prior probability distribution. To improve the convergence rate to a stable solution, we initialize coefficient matrix X 1 , X 2 and X 3 by keeping few coefficients of the largest magnitude in the wavelet, local cosine and ridgelet frame decomposing series of the original image so that each partial initialization reconstruction image retains 5% energy of the original image.
758
X. Tan
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
Fig. 2. (a) pre-whitened original 256*256 natural image; (b) initialization combining image resulted by add (d) and (e) to (f), here MSE is 0.2207; (c) combining image resulted by add (g) and (h) to (i) after 5 iterations, here MSE is 0.0014; (d) partial reconstruction image using the initialization wavelet coefficients, which has 5% energy of the original image; (e) partial reconstruction image using the initialization local cosine coefficients, which have 5% energy of the original image; (f) partial reconstruction image using the initialization ridgelet frame coefficients, which have 5% energy of the original image; (g) partial reconstruction image using the learning wavelet coefficients after 5 iterations; (h) partial reconstruction image using the learning local cosine coefficients after 5 iterations; (i) partial reconstruction image using the learning ridgelet frame coefficients after 5 iterations
A stable solution began to emerge after about 3 minutes, in which 5 iterations was completed. (Pentium 4, 2.7GHz). The results are shown in Fig. 2. The reconstruction image, Fig. 2 (c), after 5 iterations is visually indistinguishable from the original one. As mentioned above, wavelet transform extract out the
Ingredient Separation of Natural Images
759
point-type ingredient of original image. We can check carefully the stone at the top right corner in Fig.2(g). Most line-type ingredients are extracted into Fig.2 (i), which is the partial reconstruction images resulted from ridgelet frame transform. Few linetype ingredients survive in the partial reconstruction image resulted from wavelet transform and local cosine transform. In addition, Fig.2 (e) exhibits some oscillating patterns.
7 Conclusion We have shown in this paper how multiple transform is used to construct a multiple transform domain image model and how such model may be used to separate different image ingredient based on the ‘sparse coding’ strategy. The proposed model allows for assigning different prior probability distribution to different transform coefficient matrix and extracting image ingredients of special interest. Although only initial experiment results are demonstrated, we believe the multiple transform domain image model described here has the potential to be used in data analysis and image processing field.
References 1. Olshausen, B.A., Field, D.J.: Sparse Coding of Sensory Inputs. Current Opinion in Neurobiology 14, 481–487 (2004) 2. Olshausen, B.A., Field, D.J.: Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381, 607–609 (1996) 3. Lewicki, M.S., Sejnowski, T.J.: Learning Overcomplete Representations. Neural Comput. 12, 337–365 (2000) 4. Olshausen, B.A., Field, D.J.: Sparse Coding With an Overcomplete Basis Set: A Strategy Employed By V1? Vision Res. 37, 3311–3325 (1997) 5. Lewicki, M.S., Olshausen, B.A.: A Probabilistic Framework for the Adaptation and Comparison of Image Codes. J. Opt. Soc. Am. A 16, 1587–1601 (1999) 6. David, B.G., Rao, P.N.: Bilinear Sparse Coding for Invariant Vision. Neural Computation 17, 47–73 (2005) 7. Bell, A.J., Sejnowski, T.J.: The Independent Components of Natural Scenes Are Edge Filters. Vision Research 37b, 3327–3338 (1997) 8. Comon, P.: Independent Component Analysis: A New Concept? Signal Processing 36, 287–314 (1994) 9. Donoho, D.L.: Orthonormal Ridgelet and Linear Singularities. SIAM J. Math. Anal. 31, 1062–1099 (2000) 10. Tan, S., Jiao, L.C.: Ridgelet Bi-frame. Appl. Comput. Harmon. Anal. 20, 391–402 (2006) 11. Candès, E.J., Donoho, D.L.: Curvelets—A Surprisingly Effective Nonadaptive Representation for Objects with Edges. In: Cohen, A., Rabut, C., Schumaker, L.L. (eds.) Curve and Surface Fitting: Saint-Malo 1999, Van-derbilt Univ., Nashville, TN (1999) 12. Meyer, F.G., Coifman, R.R.: Brushlets: A Tool for Directional Image Analysis and Image Compression. Applied and Computational Harmonic Analysis 4, 147–187 (1997)
760
X. Tan
13. Do, M.N., Vetterli, M.: Contourlets. In: Stoeckler, J., Welland, G.V. (eds.) Beyond Wavelet, Academic Press, San Diego (2002) 14. Simoncelli, E.P., Adelson, E.H.: Noise Removal via Bayesian Wavelet Coring. In: Proc. IEEE Int. Conf. Image Processing, ICIP, IEEE Computer Society Press, Los Alamitos (1996) 15. Bertalmio, M., Vese, L., Sapiro, G., Osher, S.: Simultaneous Structure and Texture Image Inpainting. IEEE Trans. on Image Processing 12, 882–889 (2003) 16. Aujol, J., Aubert, G., Blanc-Feraud, L., Chambolle, A.: Image Decomposition: Application to Textured Images and SAR Images. Tech. Rep. ISRN I3S/RR-2003-01-FR, INRIA – Project ARIANA, Sophia Antipolis (2003)
Optimal Algorithm for Lossy Vector Data Compression Alexander Kolesnikov Department of Computer Science, University of Joensuu P.O. Box 111, 80101 Joensuu, Finland Institute of Automatics and Electrometry Pr.Koptyuga 1, 630090, Novosibirsk, Russia
[email protected] Abstract. An algorithm for lossy compression of vector data (vector maps, vector graphics, contours of shapes) was developed. The algorithm is based on optimal polygonal approximation for error measure L2 and dynamic quantization of the vector data. The algorithm includes optimal distribution of the approximation line segments among the vector objects, optimal polygonal approximation of the objects with dynamic quantization and construction of the optimal variable-rate vector quantizer. The developed algorithm can be used for lossy compression of one-dimensional signals and multidimensional vector data. Keywords: Polygonal approximation, vector data reduction, vector data compression, vector map compression, quantization.
1 Introduction The compression of vector data (vector maps, vector graphics, contour of shapes, signals) is necessary for the efficient storage or transmission of the data to user [2], [8], [10−13], [19−21]. The vector data can be compressed with lossless algorithms, so the information can be restored back without losses. In some applications the loss of information is acceptable, if it allows to reduce demands to resources and the dropped part of information is not critical for the application (see Fig. 1). To satisfy constraints on storage and transmission resources, our task is to reduce the distortion to minimum or to achieve maximum compression ratio. The problem of lossy vector data compression can be formulated in two ways: 1) Bitrate-constraint (min-distortion or min-ε) problem: Given constraints on the bitrate compress the vector data in such a way, that the distortion is minimal; 2) Distortion-constraint (min-bitrate or min-#) problem: Given constraint on the distortion, compress the vector data in such a way, that the bitrate is minimal. Amount of vector data can be reduced by polygonal approximation and quantization of the approximation data, then the data can be compressed with lossless entropy encoder. Polygonal approximation of the vector data can be performed with heuristic [5] or optimal algorithms [3], [4], [7], [14], [16−18]. Quantization of vector data [9], [15], [21] can be applied to reduce the vector data by cost of the spatial resolution. M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 761–771, 2007. © Springer-Verlag Berlin Heidelberg 2007
762
A. Kolesnikov
Fig. 1. Test vector data # 1 (fragment), file size is 547 Kbytes (left), and results of the vector data compression with polygonal approximation and dynamic quantization (fragment): file size is 40.1 Kbytes for RMSE=0.15 m (center) and file size is 21.7 Kbytes for RMSE=0.95 m (right)
Optimal algorithms of lossy compression of digitized curves with polygonal approximation were introduced in [8], [19], [20]. Solution is found as the shortest path in weighted graph constructed on vertices of the curves. In the case of realvalued coordinates, the approximation data should be quantized for better compression performance. The vector data can be quantized before or after process of polygonal approximation, pre- and post-quantization approach, respectively. In [11] was introduced algorithm of optimal polygonal approximation with dynamic quantization, where construction of the approximation was performed by taking into consideration vector quantization of the approximation line segments. Later this approach was adopted for min-bitrate (distortion-constrained) compression of vector maps [12] with L∞ error measure. In the current paper we consider min-bitrate and min-distortion problems for L2 error measure that is known as integral square error (ISE). From practical point of view, the constraint on bitrate for min-distortion problem can be given as maximum file size. As for min-bitrate (distortion-constrained) problem, we define constraint on the total distortion D basing on desirable standard deviation σ: D≤σ2N. To solve the problems of the vector data compression with L2 error measure we follow the algorithm for vector data reduction introduced in [13] by using the dynamic quantization approach [11]. We are looking for such value of the total number of approximation nodes, M, and for such variable-rate quantizer that the objective function (bitrate or distortion) is minimal for given constraints on distortion or bitrate, respectively. Solution of the problem includes: a) optimal distribution of the line segments among the vector objects, b) optimal polygonal approximation of the objects with dynamic quantization for the found number of segments, c) construction of the optimal variable-rate quantizer, and d) lossless compression of the data with entropy-based coder. The rest of the paper is organized as follows. In Sections 2 and 3, we represent optimal algorithms for minimal-distortion polygonal approximation with dynamic quantization for single and many vector objects, respectively. The Section 4 is dedicated to design of optimal variable-rate vector quantizer. The section 5 is dedicated to selection of optimal parameters. Experiments and discussions are made in Section 6, and conclusions are drawn in Section 7.
Optimal Algorithm for Lossy Vector Data Compression
763
2 Optimal Single-Object Polygonal Approximation 2.1 Problem Formulation An open N-vertex polygonal curve P in 2-dimensional space is represented as the ordered set of vertices P={p(1), …, p(N)} = {(x1, y1), …, (xN, yN)}. The single-object min-distortion problem is stated as follows: approximate the polygonal curve P by another polygonal curve S with a given number of linear segments M and given vector quantizer so that total approximation error E(M) is minimized. The output curve S consists of (M+1) vertices: S={s1, …, sM+1}. The approximation line segment (sm, sm+1) of S for curve segment {p(i), …, p(j)} of P is defined by the quantized values of vertices pr(i) and pr(j). Two-dimensional relative coordinates, Δpi,j, for vertices p(i) and p(j) are defined as difference between the vertices: Δpi,j=p(j)−p(i). To avoid accumulation of quantization error, the relative coordinates Δpi,j are defined as difference between the current vertex p(j) and the restored value pr(i) of the vertex p(i): Δpi,j=p(j)−pr(i). In its turn, the restored point pr(j) is defined by the restored vertex pr(i) and by the quantized value of the relative coordinates: pr(j)=pr(i)+Z[p(j)−pr(i)], here Z[x] means quantized value of x (see Fig. 2).
Fig. 2. A curve segment {p(i), …, p(j)} (white circles) is approximated by a line segment, L(pr(i), pr(j)), defined by the restored vertices pr(i) and pr (j) (black circles)
Quantization of the relative coordinates induces displacement of the restored approximation nodes pr(j) relatively to the original vertices p(j) and affects the error of polygonal approximation. The distance from vertex, p(k)=(xk, yk)∈P, to the approximating line L(pr(i), pr(j)) passing the restored vertices, pr(i) and pr(j), is given by the following expression:
d ( p (k ); ( p r (i), p r ( j ) )) =
ax k + by k + c a2 + b2
,
(1)
where the coefficients a, b and c of the line can be computed from the restored values pr(i) and pr(j). The approximation error e2(pr(i), pr(j)) for a curve segment {p(i), …, p(j)} with measure L2 is defined as the sum of squared deviations from vertices p(k) of the curve segment to the approximating line, L(pr(i), pr(j)): j
e2 ( pr (i ), pr ( j ) ) = ∑ d 2 ( p(k ); ( pr (i ), pr ( j ) )). k =i
(2)
764
A. Kolesnikov
The complexity of the algorithm for the calculation of the error is O(1). The distortion E(M) for the input curve P approximated by a curve with M line segments is defined as sum of approximation errors over all segments:
E ( M ) = ∑ e2 ( p r (im ), p r (im +1 )) . M
(3)
m =1
where pr(im) and pr(im+1) are restored values for the end points of the curve segment {p(im), …, p(im+1)}. 2.2 Algorithm for Single Object min-distortion problem To obtain the optimal polygonal approximation we have to find the set of vertices {pr(i1), …, pr(iM+1)} that minimizes the total distortion E(M) for a given number of segments M:
⎧M ⎫ E ( M ) = min ⎨∑ e2 ( pr (im ), pr (im +1 ) )⎬ . {i m } ⎩ m =1 ⎭
(4)
The optimization problem can be solved by as the shortest path in the state space [16]. Let us introduce 2-dimensional state space (see Fig. 3) and define the cost function on the space:
⎧m ⎫ D(n, m) = min ⎨∑ e2 ( pr (it ), pr (it +1 ) )⎬ . {it } ⎩ t =1 ⎭
(5)
The solution of the optimization problem we found with following recurrent equation:
D(n, m ) =
min
{D( j , m − 1) + e2 ( pr ( j ), pr (n) ) },
L ( m −1) ≤ j < n
(6)
where n = 1, …, N; m = n+1, …, N; and L(m) is the left bound of the state space, see Figure 3 (left).
Fig. 3. Illustration of the state space for sample problem of Nk=34 (left), and the bounding corridor in the state space for sample problem of Nk=34 and Mk=12 using corridor width W=3 (right). The reference path is marked with dark gray circles, and three goal states with gray squares.
Optimal Algorithm for Lossy Vector Data Compression
765
To improve time performance we use reduced search approach [14]. According to this method, at first a reference solution is obtained with a fast heuristic polygonal approximation algorithm which defines a reference path in the state space. A bounding corridor of width W is constructed in the space along the reference path (see Fig. 3). The dynamic programming search is performed in the bounding corridor. The obtained solution is used to construct new reference path in the space with the corresponding bounding corridor for new run of the search. With iterative reduced dynamic programming search in the state space we get practically optimal solution in a few iterations.
3 Optimal Multi-object Polygonal Approximation 3.1 Problem Formulation Let’s consider the problem of joint polygonal approximation with dynamic quantization of K vector objects P1, …, PK; the total number of vertices is N = ΣNk, where Nk is the number of vertices in the object Pk. We have to approximate the set of polygonal curves by another set of polygonal curves S1, …, SK; the total number of approximation line segments is ΣMk, where Mk is the number of segments allocated to the approximation of a single polygonal curve Sk. The total distortion D(K, M) for the set of inputs curves {P1, …, PK} approximated by K curves {P1, …, PK} is defined as the sum of distortions Ek(Mk) over all curves: K
D ( K , M ) = ∑ Ek ( M k ) ,
(7)
k =1
where distortion for one curve is given by formula (4). The min-distortion problem for multiple objects can be formulated as the following optimization task: find the optimal approximation of the curves P1, …, PK by polygonal curves S1, …, SK with minimum error G(K, M) under the given constraint on the total number of segments: ΣMk ≤ M: ⎫ ⎧ K G ( K , M ) = min ⎨ ∑ E k (M k ) ⎬ . { M k } ⎩ k =1 ⎭
(8)
K
subject to : ∑ M k ≤ M . k =1
Given the budget, the total number of line segments M, we have to find optimal number of line segments for each curve and find such polygonal approximation of the curves with the found number of segments that the total distortion (integral square error) is minimal. To solve the problem we follow algorithm [13] developed earlier for multi-object min-ε problem. 3.2 Algorithm for Multi-object min-distortion problem Let us introduce rate-distortion function G(k, m) as minimal polygonal approximation error of k objects with m line segments [13]:
766
A. Kolesnikov
⎧k ⎫ G (k , m) = min ⎨∑ Et (mt )⎬ . {mt } ⎩ t =1 ⎭
(9)
Solution of the optimization problem can be found by dynamic programming algorithm for optimal allocation of constrained resource (m line segments) using the following recursive equation: k −1
k −1
i =1
i =1
G (k , m) = min {G (k − 1, m − u ) + Ek (u )}, where ∑ ai ≤ m ≤ ∑ bi , ak ≤u ≤bk
(10)
for k=1, …, K and m=1, …, M; here [ak, bk] is range of possible values of segments number for i-th shape defined by the initial value of segments number mk for the object and corridor width W: ak=mk−⎣W/2⎦ and bk=ak+W−1. We initialize recursive calculations for k=1 with G(1, m)=E1(m), where m=1, …, M. The total distortion for all K objects approximated with M linear segments is defined by the cost function as G(K, M). Finally, for every curve Pk we create the polygonal approximation solution with the found number of segments Mk. The found number of segments are restricted to the range [ak, bk]. To find the global optimal allocation of the resource, the iterations are necessary using the output solution of the previous iteration as the reference solution in the next run. The initial allocation of the segments number {mk} we find with fast heuristic merge-based algorithm [13] for polygonal approximation.
4 Optimal Variable-Rate Vector Quantization So far we considered algorithm for polygonal approximation with dynamic vector quantization under assumption that the quantizer is given. Compression performance of the algorithm in question depends on properties of the vector quantizer. For optimal entropy-based encoding of quantized data we have to construct such variable-rate quantizer [6] that provides minimal bitrate for given constrained on quantization error (min-bitrate problem) or minimal distortion for given constraint on bitrate (mindistortion problem), correspondingly. In general case, if statistics of the input data to be quantized is known, optimal variable-rate vector quantizer with L2 error measure can be constructed with iterative algorithm introduced in [4]. To collect the statistics at first we have to perform polygonal approximation, but to construct polygonal approximation with embedded quantization we should know properties of the quantizer. To solve the problem, let us consider the following simple model of variable-rate quantizer. At first, analysis of the vector data shown that relative coordinates Δx and Δy are practically noncorrelated; the coefficient of correlation for the test vector data is less than 5%. In other words, relative coordinate on one axis practically does not depend on those on another one. For quantization of the relative coordinates, Δx and Δy, we can use two variable–rate scalar quantizers. Then, it is known that uniform scalar quantization is asymptotically best for variable-rate quantization at high rates [6]. Moreover, in [1] it was shown that approximate optimality of uniform scalar quantization holds approximately even the rate is not large and holds exactly for exponential densities. If we take into consideration cost of the codebook storing the situation became even more beneficial for the uniform quantizer because the quantizer is defined by only one parameter, the
Optimal Algorithm for Lossy Vector Data Compression
767
quantization step Q. We use the same quantization step Q for both dimensions because the test vector data is anisotropical. The first and last vertices of the curves are quantized with same step Q to ensure continuity of the vector data in maps. So, the problem of construction of the optimal variable-rate vector quantizer is reduced to those of selection of quantization step Q for the uniform scalar quantizer.
5 Selection of Parameters 5.1 Accuracy of Polygonal Appoximation To solve min-distortion (bitrate-constraint) problem we should be given a threshold for bitrate, which can be defined as maximum file size R0. Then we can find such number of segments M and such qunatization step Q that the file size does not exceed the threshold R0 and the total distortion is minimal. The file size of compressed vector data we calculate by compressing the approximated and quantized data with arithmetic coder. Min-distortion polygonal approximation with embedded quantization is given by algorithm represented in the Section 3. As for min-bitrate (distortion-constraint or min-#) problem, we should be given a threshold for the total approximation error. For L∞ error measure it means that we have upper bound for maximum deviation of vertices from approximation line [12]. In the case of additive error measure, L2, situation is more complicated. The integral square error is widely used as fitness function for min-ε problem, but this feature hardly can be used as input parameter in practical applications for min-# problem. To solve the problem, we use standard deviation σ of vertices from approximation lines as input parameter to define constraint on the total distortion (ISE): D≤Nσ2. Standard deviation is estimated as root mean-square error (RMSE). Instead of strong constraint on approximation error we can consider accuracy of approximation which is defined as a such value ε0 that F percents of vertices will have an absolute value of vertex deviation equal to or smaller than ε0. The parameter F is called confidence level. For normally distributed random variable the confidence level F=95% is ensured by two standard deviations: ε0=2σ. Distribution of deviation for the test data is represented on Fig. 4, the confidence level F was computed for two standard deviations.
Fig. 4. Distribution of absolute value of deviation for the test dataset: a) Elevation lines, RMSE=1.8 m, F=95.2% (left) and b) Europe, RMSE=2.7⋅10−3 degrees; F=95.5% (right)
768
A. Kolesnikov
5.2 Optimal Value for the Segments Number and Quantization Step Using the represented algorithm for multi-object min-distortion we construct a set of rate-distortion curves for set of quantization steps Q varying values of the total number of segments M in certain range. Bitrate is calculated by compression of output vector data, and the total distortion is calculated in the process of search for optimal polygonal approximation. These rate-distortion curves provide local minimum of the total distortion. The globally optimal rate-distortion function for the vector data compression is calculated as envelope of the rate-distortion curves computed for all values of M and Q (see Figs. 5 and 6). Searching for the optimal values of the parameters for given threshold on filesize or distortion is time-consuming task. To estimate the quantization step, Q*, basing on the current value of segments number we offer the following heuristic idea. Quantization error for uniform scalar quantizer is given as Q2/12. Let’s assume that the total quantization error for N points should be equal of those for polygonal approximation without quantization, D0(K, M), for Q=0:
2N
Q2 = D0 ( K , M ) . 12
(11)
Following this error balance principle we get the following estimation for the quantization step Q* for the current number of segments M:
Q* =
6 D0 ( K , M ) . N
(12)
With this estimation we can find the optimal value of segments number M by binary search.
6 Experiments and Discussions A test set of vector data; the vector maps Elevation lines and Europe, is presented in Fig. 5. The data are given in the ArcShape file format with an accuracy of 16 bytes per vertex.
Fig. 5. Test set of vector data: 1) Elevation lines, N=31260 vertices, file size 547 Kbytes; 2) Europe, N=169673 vertices, file size 2671 kBytes
Optimal Algorithm for Lossy Vector Data Compression
769
The output file includes the beginnings of curves and the relative coordinates of approximation nodes. The data reduction was performed by reducing the number of points in the curves by polygonal approximation and through the dynamic quantization of coordinates. The rate-distortion curves for test vector data for different values of the segment number M and the quantization step Q are represented on Figs. 6 and 7. Distortion is calculated as RMSE. Trade-off between bit rate (file size) and distortion is controlled by two parameters, the total number of segments M and the quantization step Q. To achieve better compression ratio for some M we have to use larger quantization step. However, if the quantization step is too large, the distortion cannot be decreased below a certain level regardless the number of segments M. So, we have to use a smaller quantization step to obtain solution with smaller distortion. The optimal ratedistortion curve is given by envelope of the rate-distortion curves constructed for all values of M and Q, see Figs. 6 and 7. Confidence level for polygonal approximation calculated for spatial accuracy ε0=2σ is at least 90%. Samples of restored data after lossy compression are shown on Figs. 1 and 8. Experiments shown that the simple formula (12) gives quite good estimation for the optimal quantization step, the rate-distortion curve for the quantization step Q* is very close to those obtained as envelope of the rate-distortion curves (see Figs. 6 and 7). We cannot expect that the formula will give good approximation for all vector maps, but this evaluation can give quite good initial value of the parameters for further search.
Fig. 6. Test vector data: a) the rate-distortion curves for Q=0.25÷2.0 m (step 0.25), 2.5 m, 3.0÷11 m (step 1.0), from right to left; b) the optimal rate-distortion curve as envelope for the curves and heuristic estimation of the optimal rate-distortion curve
The total distortion is monotonic function by the number of segments M, so for given constraints on bitrate or distortion we can find the parameter M by binary search using the evaluation (12) for the parameter, Q. With using the estimation for quantization step Q the processing time was reduced very much. The processing time is less than one minute for the test vector data #1 and about five minutes for the test vector data #2.
770
A. Kolesnikov
Fig. 7. Test vector data #2: a) the rate-distortion curves for Q=0.0025, 0.0030, 0.0040, 0.0050, 0.0075, 0.0100, 0.0150, 0.0200, from right to left; b) the optimal rate-distortion curve as envelope for the curves and heuristic estimation of the optimal rate-distortion curve
Fig. 8. Test vector data #2 (fragment), file size is 2671 Kbytes (left), and results of the vector data compression with polygonal approximation and dynamic quantization: file size is 105.8 Kbytes for RMSE=1.3⋅10−3 degrees (center) and file size is 52.2 Kbytes for RMSE=2.7⋅10−3 degrees (right)
7 Conclusions To solve the problem of lossy compression of vector data the optimal algorithm was developed. The algorithm is based on optimal allocation of the segments number among the vector objects and optimal polygonal approximation with dynamic quantization of the vector data. An asymptotically optimal variable-rate scalar quantizer was constructed for dynamic quantization of the relative coordinates. A simple and efficient heuristic rule for evaluation of quantization step for the scalar uniform quantizer was proposed. The developed algorithm can be used for lossy compression of one-dimensional signals and multidimensional vector data because the core algorithm does not depend on dimensionality of the input signal.
Optimal Algorithm for Lossy Vector Data Compression
771
References 1. Berger, T.: Optimum Quantization and Permutation Codes. IEEE Trans. on Information Theory 18, 759–765 (1972) 2. le Buhan, C., Ebrahimi, T.: Progressive Polygonal Encoding of Shape Contours. In: Proc. of the Int. Conference on Image Processing and its Applications, vol. 1, pp. 17–21 (1997) 3. Chan, W.S., Chin, F.: On approximation of polygonal curves with minimum number of line segments or minimum error. International Journal on Comput. Geometry and Applications 6, 59–77 (1997) 4. Chou, P., Lookabaugh, T., Gray, R.: Entropy-Constrained Vector Quantization. and Signal Processing 37, 31–42 (1989) 5. Douglas, D.H., Peucker, T.K.: Algorithm for the Reduction of the Number of Points Required to Represent a Line or its Caricature. The Canadian Cartographer 10, 112–122 (1973) 6. Gray, R.M., Neuhoff, D.L.: Quantization. IEEE Trans. on Information Theory 44, 2325– 2383 (1998) 7. Imai, H., Iri, M.: Computational-geometric Methods for Polygonal Approximations of a Curve. Computer Vision and Image Processing 36, 31–41 (1986) 8. Katsaggelos, A.K., Kondi, L., Meier, F., Ostermann, J., Schuster, G.: MPEG-4 and RateDistortion-Based Shape-encoding Techniques. Proc. of the IEEE 86, 1126–1154 (1998) 9. Khoshgozaran, A., Khodaei, A., Sharifzadeh, M., Shahabi, C.: A Multi-Resolution Compression Scheme for Efficient Window Queries over Road Network Databases. In: Proc. of Sixth IEEE International Conference on Data Mining-Workshops, pp. 355–360. IEEE Computer Society Press, Los Alamitos (2006) 10. Kim, K.J., Lim, C.W., Kang, M.G., Park, K.T.: Adaptive Approximation Bounds for Vertex Based Contour Encoding. IEEE Trans. on Image Processing 8, 142–147 (1999) 11. Kolesnikov, A.: Optimal Encoding of Vector Data with Polygonal Approximation and Vertex Quantization. In: Kalviainen, H., Parkkinen, J., Kaarna, A. (eds.) SCIA 2005. LNCS, vol. 3540, pp. 1186–1195. Springer, Heidelberg (2005) 12. Kolesnikov, A., Akimov, A.: Distortion-Constrained Compression of Vector Maps. In: Proc. of ACM Symposium on Applied Computing-SAC’07, vol. 1, pp. 8–12 (2007) 13. Kolesnikov, A., Fränti, P.: Data Reduction of Large Vector Graphics. Pattern Recognition 38, 381–394 (2005) 14. Kolesnikov, A., Fränti, P.: Reduced Search Dynamic Programming for Approximation of Polygonal Curves. Pattern Recognition Letters 24, 2243–2254 (2003) 15. Li, Z., Oppenshaw, S.: A Natural Principle for the Objective Generalization of Digital Maps. Cartography and Geographical Information Systems 20, 19–29 (1993) 16. Perez, J.C., Vidal, E.: Optimum Polygonal Approximation of Digitized Curves. Pattern Recognition Letters 15, 743–750 (1994) 17. Salotti, M.: An Efficient Algorithm for the Optimal Polygonal Approximation of Digitized Curves. Pattern Recognition Letters 22, 215–221 (2001) 18. Salotti, M.: Optimal Polygonal Approximation of Digitized Curves Using the Sum of Square Deviations Criterion. Pattern Recognition 35, 435–443 (2002) 19. Schuster, G.M., Katsaggelos, A.K.: An Optimal Polygonal Boundary Encoding Scheme in the Rate-Distortion Sense. IEEE Trans. on Image Processing 7, 13–26 (1998) 20. Schuster, G.M., Melnikov, G., Katsaggelos, A.K.: Operationally Optimal Vertex-based Shape Coding. IEEE Signal Processing Magazine 15, 91–108 (1998) 21. Shekhar, S., Huang, Y., Djugash, J.: Dictionary Design Algorithms for Vector Map Compression. In: Proc. of Data Compression Conference, p. 471 (2002)
MPEG Video Watermarking Using Tensor Singular Value Decomposition Emad E. Abdallah, A. Ben Hamza, and Prabir Bhattacharya Concordia Institute for Information Systems Engineering Concordia University, Montr´eal, Qu´ebec, Canada {ee abdal,hamza,prabir}@encs.concordia.ca
Abstract. In this paper, we introduce a new watermarking algorithm to embed an invisible watermark into the intra-frames of an MPEG video sequence. Unlike previous methods where each video frame is marked separately, our proposed technique uses high-order tensor decomposition of videos. The key idea behind our approach is to represent a fixed number of the intra-frames as a 3D tensor with two dimensions in space and one dimension in time. Then we modify the singular values of the 3D tensor, which have a good stability and represent the video properties. The main attractive features of this approach are simplicity and robustness. The experimental results show the robustness of the proposed scheme against the most common attacks including geometric transformations, adaptive random noise, low pass filtering, histogram equalization, frame dropping, frame swapping, and frame averaging. Keywords: Video watermarking, tensor singular value decomposition.
1
Introduction
The growing popularity of digital video applications such as video-conferencing, digital television, digital cinema, distance learning, videophone, and video-ondemand have increase the demand for copyright protection. Today it is much easier for digital data owners to transfer multimedia documents across the internet, and the data can be perfectly duplicated and rapidly redistributed on a large scale. Digital watermarking is an effective way to protect copyright of multimedia data even after its transmission. Watermarking refers to the process of adding a hidden structure, called a watermark, into a multimedia document which carries either information about the owner of the cover or the recipient of the original data object. This copyright information is permanently tied to the multimedia documents. Such watermark information can be used for several purposes including copyright protection and fingerprinting [1]. A variety of watermarking techniques have been proposed for digital images [2, 3, 4, 5, 6]. These techniques can be divided into two main categories according to the embedding domain of the cover image: spatial domain methods and transform domain methods. The spatial domain methods are the earliest and simplest watermarking techniques but have a low information hiding capacity, and also the watermark can be easily erased by lossy image compression. On M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 772–783, 2007. c Springer-Verlag Berlin Heidelberg 2007
MPEG Video Watermarking Using Tensor Singular Value Decomposition
773
the other hand, the transform domain approaches [4, 5, 6] insert the watermark into the transform coefficients of the image cover, yielding more information embedding and more robustness against watermarking attacks. Apparently any image watermarking technique can be extended to watermark videos, but in reality video watermarking techniques need to meet other challenges such as the large volume of inherently redundant data between frames, unbalance between motion and motionless regions [7], and the real-time requirements in video broadcasting which make the video signals highly susceptible to pirate attacks including frame averaging, frame dropping, frame swapping, and statistical analysis. [17]. Applying a fixed image watermark to each frame in the video leads to problems of maintaining statistical and perceptual invisibility. Furthermore, applying independent watermarks to each frame also yields a frame averaging problem which is usually used by the attackers to remove the watermark [7, 11]. In the sequel we briefly review some previous work on video watermarking approaches. In [14] the authors generalize the idea of embedding a binary pattern in the wavelet domain for images by modifying all the four sub-bands in the discrete wavelet transform (DWT) for MPEG video sequences. In [9] a real time digital video watermarking scheme has been proposed to embed the watermark in intra-pictures of the MPEG video sequence by modifying variable length codes (VLCs) directly to avoid inverse quantization. The advantage of modifying the VLCs is to minimize the perceptual degradation of video quality cased by the embedded watermark. In [12] a blind MPEG2 video watermarking technique was proposed by focusing on geometric attacks. The discrete Fourier transform of 3D chunks of a video scene was used in [8] for video watermarking, where the embedding and the extraction algorithms work with uncompressed video data. In [20] the authors mark only the discrete cosine transform (DCT) coefficients of the I-frames in the MPEG compressed video. The spread spectrum signal was used as copyright information which is added to the non-zero DCT coefficients under the condition of not increasing the bit rate. Embedding the watermark in the uncompressed domain proposed in [10] where the authors use the I-frames to embed the watermark. They adopt the block matching algorithm to find the motion vector of each block and use the motion feature to embed the watermark. A complete overview of video watermarking techniques can be found in [15,16,17]. Motivated by the need for more robustness against attacks especially for dropping, swapping, averaging frames and geometric attacks, we propose in this paper a new robust imperceptible video watermarking approach based on 3D tensor singular value decomposition (SVD). Our approach generalizes the SVD-based image watermarking method [6] by applying the SVD technique to the video frames viewed as a 3D tensor with two dimensions in space and one dimension in time. Extensive numerical experiments are performed to validate the robustness of our technique. The remainder of this paper is organized as follows. In Section 2, we briefly review some background material and describe the multidimensional tensor singular value decomposition. In Section 3, we introduce the proposed approach and describe in detail the watermark embedding and extraction algorithms.
774
E.E. Abdallah, A. Ben Hamza, and P. Bhattacharya
Experimental results are presented in Section 4 to demonstrate the performance of the proposed watermarking scheme. Finally, we conclude and point out some future directions in Section 5.
2
Tensor Decomposition
Tensor decomposition is used in many applications including image compression, data mining and psychometrics [18]. In this section we give a general overview of the multidimensional generalization of the singular value decomposition. 2.1
Singular Value Decomposition of 2D Image
The SVD of a 2D image C of size m × n is given by C = U ΣV T where U is an orthogonal matrix (U T U = I), Σ =diag(λi ) is a diagonal matrix of singular values λi , 1 ≤ i ≤ m, arranged in decreasing order, and V is is an orthogonal matrix (V T V = I). The columns of U are the left singular vectors, whereas the columns of V are the right singular vectors of the image C. 2.2
Multidimensional Tensor
Transforming a 3D tensor into a matrix is usually referred to as a “Matricization” process [18, 19]. A tensor A of size m × n × p can be rearranged into a matrix of size k × in three different ways: left-right matrix A1 , front-back matrix A2 , and top-bottom matrix A3 as shown in Fig. 1. Clearly, the number of elements in the matrices A1 , A2 and A3 must be the same as the number of elements in the tensor A. 2.3
Multidimensional Tensor Singular Value Decomposition
Extending matrix decompositions such as the SVD to higher-order tensors has proven to be quite difficult [18]. Given an m × n × p tensor A, Tucker decomposition (see Fig. 2) is given by: A = Σ ×1 U ×2 V ×3 W =
r1 r2 r3
σijk (ui ⊗ vj ⊗ wk )
(1)
i=1 j=2 k=1
where r1 ≤ m, r2 ≤ n, r3 ≤ p and the columns of U , V , and W are the left singular vectors of the matrices A1 , A2 and A3 . The tensor Σ = (σijk ), is called the core tensor and it is given by: Σ = A ×1 U T ×2 V T ×3 W T
(2)
The core tensor not always need to have the same dimension as A. In general, we can have either orthogonal columns of U , V , and W or a diagonal core tensor Σ [13].
MPEG Video Watermarking Using Tensor Singular Value Decomposition
775
3D Tensor A
n
p m
A1 m
A2
n p
n
n n m p
A3
m
m
p m
p p n
n m A2
A1
p
A3
Fig. 1. Illustration of matricizing a third-order tensor
Applying SVD to the matrices A1 , A2 and A3 yields: A1 = U D 1 GT1 A2 = V D2 GT2 A3 = W D 3 GT3
(3)
where the columns of G1 , G2 , and G3 are the right singular vectors of A1 , A2 and A3 respectively. Moreover, we have A1 = U Σ 1 (V ⊗ W )T A2 = V Σ 2 (W ⊗ U )T A3 = W Σ 3 (U ⊗ V )T where
3
Σ 1 = D 1 GT1 (V
⊗ W ), Σ 2 =
D 2 GT2 (W
⊗ U ) and Σ 3 =
(4) D 3 GT3 (U
⊗V)
Proposed Watermarking Scheme
In this section, we describe the main steps of the proposed watermark embedding and extraction methods.
776
E.E. Abdallah, A. Ben Hamza, and P. Bhattacharya
Fig. 2. Tucker decomposition of a 3D tensor
3.1
Watermark Embedding Process
In an MPEG system, there are three kinds of coded images in each group of pictures: I-frames, P-frames, and B-frames. The P-frames are forward predicted from the last I-frame or P-frame and it may not be possible to reconstruct them without the data from another I or P frames. The B-frames are forward predicted and backward predicted from the last/next I-frame or P-frame. In order to improve the robustness of the proposed scheme we only use the Iframes to embed the watermark. Partitioning MPEG video into scenes is done by counting the percentage of each type of blocks in a frame. For example, a large percentage of the intra-blocks in the P-frame implies the beginning of a new scene [12]. The watermark embedding process description is shown in Algorithm 1. We used an identical watermark for each group of the I-frames in the same scene, and different watermarks for different scenes. 3.2
Watermark Extraction Process
In order to extract the watermark, our algorithm requires the original video sequence as well as the watermarked video sequence. The watermark extraction process description is shown in Algorithm 2. Fig. 5 shows an example of video watermarking using our proposed scheme. Clearly the difference between the original and the watermarked videos is not noticeable to the human observer.
4
Experiment Results
We tested the performance of our proposed watermarking on a variety of video sequences. Due to the space limitation we show our experiments in one video
MPEG Video Watermarking Using Tensor Singular Value Decomposition
777
Algorithm 1. Watermark embedding algorithm Input MPEG video sequence V , and a watermark image W of size m × m. Output: Watermarked video V . For each scene: 1- Convert the I-frames from RGB to YUV. 2- Divide the converted Luminance layers (Y) of the I-frames into chunks (Groups) of fixed length (10 frames). Each Group of the I-frames is represented by 3D tensor as shown in Fig. 3. 3- Matricize the tensor in three different ways (Left-right, Front-back, and Topbottom) to obtain A1 , A2 , A3 respectively. 4- Apply SVD to the matrices A1 , A2 , and A3 that is, A1 = U D 1 GT1 , A2 = V D 2 GT2 , and A3 = W D 3 GT3 . 5- Calculate the 3D singular values (SVs) matrix Σ3D = A ×1 U T ×2 V T ×3 W T . 6- Apply SVD to W , that is: W = U w Σ w V Tw 7- Modify the largest SVs of Σ3D with the SVs of W using: λi = λi + αλiw , where λi and λiw , 1 ≤ i ≤ m are the singular values of Σ3D and Σ w respectively, λi denotes the distorted SVs, and α is a constant scaling factor. 8- Produce the watermarked tensor Aw = Σ 3D ×1 U ×2 V ×3 W , where Σ 3D is the modified 3D core tensor. 9- Finally use the modified I-frames to produce the watermarked cover video sequence as depicted in Fig. 4.
Algorithm 2. Watermark extraction algorithm Input Original and watermarked MPEG video sequences. . Output: Extracted watermark image W For each scene: 1- Apply the first five steps of the watermark embedding process to the original and the watermarked video sequences. 2- Extract the singular values of the visual watermark as follows: λiw = (λi − λi )/α, where λi and λi are the original and the watermarked singular values respectively. = 3- Construct the watermark image using the extracted singular values W U w Σ w V Tw , where U w and V w are the left and right singular vectors of W respectively, and Σ w =diag(λiw ) is the extracted matrix of SVs.
sequence called “Claire” which has 154 frames. The experimental tests are mainly performed to verify the robustness against attacks including rescaling, rotation, Gaussian noise, histogram equalization, gamma correction, low-pass filtering, sharpening, motion blurring, frame compression, frame dropping, frame swapping, frame averaging, cropping, and also combinations of these attacks. For all the attacks, we show one of the attacked I-frames and the extracted watermark. Experimentally a constant scaling factor α = 0.2 was chosen such that it is small enough to keep the watermark imperceptible to the human observer,
778
E.E. Abdallah, A. Ben Hamza, and P. Bhattacharya
Fig. 3. Illustration of a multidimensional tensor produced from one group of the Iframes
Fig. 4. Producing the watermarked cover video sequence diagram
and large enough to resist as many attacks as possible. Fig. 6 shows one of the watermarked I-frames with different kinds of attacks. Fig. 7 depicts their corresponding extracted watermarks. The title below each subfigure in Fig. 7 displays the correlation coefficient between the original and the best extracted watermark. Fig. 8 depicts the correlation coefficients for different video sequences under the
MPEG Video Watermarking Using Tensor Singular Value Decomposition
(a)
(b)
(c)
(d)
779
Fig. 5. (a) Original video I-frame, (b) Watermarked video frame, (c) Original watermark image, and (d) Extracted watermark image from the 3D tensor containing the video frame shown in (a)
scaling attack with ratio from 0.1 to 0.9. The correlation coefficient ρ between is given by the original watermark image W and the extracted watermark W m
ij − W ) (Wij − W )(W m m 2 ij − W )2 (W − W ) ( W ij i,j=1 i,j=1
ρ =
i,j=1
are the mean values of W and W respectively. The results obwhere W and W tained from our experiments clearly indicate the robustness of the proposed algorithm against the commonly used attacks in videos. For frame dropping attack, the watermark is embedded into frames of a scene, and due to the large amount of redundancies between frames, the calculated SVD for the 3D tensor will not change significantly by frame dropping up to 50% of those highly correlated frames. Similar results were obtained under frame swapping attack. For frame averaging attack, the attackers can use multiple frames and try to eliminate the watermark by statistical averaging of the watermarked video frames [10, 7]. We used different watermarks for each scene, and identical watermark within the same scene which prevent the attackers from statistically compare and remove the watermark from the motionless regions in the successive video frames. To measure the perceptual quality of the watermarked video, we calculate the peak signal-to-noise ratio (PSNR) which is used to estimate the quality of the watermarked frames in comparison with the original ones. Fig. 9 depicts the PSNR experimental results.
780
E.E. Abdallah, A. Ben Hamza, and P. Bhattacharya
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
Fig. 6. Claire video sequence I-frames distorted by different attacks. (a) Rescaling 100% − 50% − 100%, (b) Salt & peppers noise (2%), (c) Histogram equalization, (d) Gamma correction, (e) Low-pass filter (3×3), (f) Sharpening α = 0.2, (g) Rotation 20o , (h) JPEG compression Q = 25, (i) Cropping 8% from the top and 8% from the bottom, (j) Cropping 8% from the left and 8% from the right, (k) Cropping 5% from top and 5% from the bottom and drooping 50% of the frames, (l) Rescaling and drooping 50% of the frames.
MPEG Video Watermarking Using Tensor Singular Value Decomposition
(a) 0.9852
(b) 0.7537
(c) 0.9128
(d) 0.901
(e) 0.9071
(f) 0.7267
(g) 0.7141
(h) 0.9412
(i) 0.9783
(j) 0.9905
(k) 0.9591
(l) 0.9759
781
Fig. 7. Extracted watermarks from the tensor containing the attacked I-frames shown in Fig. 6 1
Correlation coefficient
0.9 0.8 0.7 0.6 0.5 Claire Tennis Skiing Airplane
0.4 0.3 0.2 0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
Downsizing ratio Fig. 8. Robustness of the proposed watermarking scheme against rescaling attack for different video sequences: Clair, tennis, skiing, and airplane
782
E.E. Abdallah, A. Ben Hamza, and P. Bhattacharya
42
40
PSNR
38
36
34
α=0.3
32
30
0
2
α=0.25
4
6
α=0.2
8
α=0.15
10
12
α=0.1
14
16
Frame number Fig. 9. PSNR for different scaling factors
5
Conclusion
In this paper, we proposed an efficient tensor SVD-based video watermarking scheme. The key idea is to embed a watermark image into stable coefficients of the video frames. This is done by modifying the singular values of the high order tensors of the intra-frames video. The performance of the proposed method was evaluated through extensive experiments which clearly show a good robustness against a variety of attacks including frame dropping, frame averaging, image processing and geometric attacks. The proposed watermarking scheme may further be improved by applying block-based watermarking techniques to the 3D tensor.
References 1. Cox, I.C., Miller, M.L., Bloom, J.A.: Digital Watermarking. San Francisco (2001) 2. Hartung, F., Kutter, M.: Multimedia watermarking techniques. Proc. IEEE 87, 1079–1107 (1999) 3. Memon, N., Wong, P.: Digital watermarks: protecting multimedia content. Comm. of the ACM 47, 35–43 (1998) 4. Cox, I.J., Kilian, J., Leightonm, T., Shamoon, T.: Secure spread spectrum satermarking for multimedia. IEEE Trans. Image Processing 6, 1673–1687 (1997) 5. Ganic, E., Eskicioglu, A.M.: Robust DWT-SVD domain image watermarking: embedding data in all frequencies. In: Proc. ACM Multimedia and Security Workshop, pp. 166–174. ACM Press, New York (2004) 6. Liu, R., Tan, T.: A SVD-based watermarking scheme for protecting rightful ownership. IEEE Trans. Multimedia 4, 121–128 (2002)
MPEG Video Watermarking Using Tensor Singular Value Decomposition
783
7. Swanson, M.D., Zhu, B., Tewfik, A.H.: Multiresolution scene-based video watermarking using perceptual models. IEEE Jour. on Selected Areas in Commun. 16, 540–550 (1998) 8. Deguillaume, F., Csurka, G., ORuanaidh, J., Pun, T.: Robust 3D DFT video watermarking. In: Proc. SPIE, Sec. and Wat. of Mult. Content, vol. 3971, pp. 346–357 (2000) 9. Liu, H., Chang, L.: Real time digital video watermarking for digital rights management via modification of VLCS. In: Proc. Internat. Conf. on Parallel and Distributed Systems, vol. 2, pp. 295–299 (2005) 10. Lin, Y.R., Hsu, W.H.: An embedded watermark technique in video for copyright protection. In: Proc. Internat. Conf. Pattern Recog., vol. 4, pp. 795–798 (2006) 11. Chan, P.W., Lyu, M.R., Chin, R.T.: A Novel scheme for hybrid digital video watermarking: approach evaluation and experimentation. IEEE Trans. on Circuits and Sys. for Video Technology 15, 1638–1649 (2005) 12. Wang, Y., Pearmain, A.: Blind MPEG-2 video watermarking robust against geometric attacks: a set of approaches in DCT domain. IEEE Trans. on Image Processing 15, 1536–1543 (2006) 13. Lathauwer, L.D., Moor, B.D., Vandewalle, J.: A multilinear singular value decomposition. SIAM J. Matrix Anal. Application 21, 1253–1278 (2000) 14. Elbasi, E., Eskicioglu, A.M.: MPEG-1 video semi-blind watermarking algorithm in the DWT domain. IEEE Internat. Symposium on Broadband Multimedia Sys. and Broadcasting (2006) 15. Langelaar, G.C., Setyawan, I., Lagendijk, R.L.: Watermarking digital image and video data. A state-of-the-art overview. IEEE Signal Process. Mag. 17, 20–46 (2000) 16. Gwenael, A.D., Dugelay, J.L.: A guide tour of video watermarking. Signal Process.: Image Commun. 18, 263–282 (2003) 17. Bhattacharya, S., Chattopadhyay, T., Pal, A.: A Survey on different video watermarking techniques and comparative analysis with reference to H.264/AVC. In: IEEE Internat. Symposium on Consumer Electronics, pp. 1–6. IEEE Computer Society Press, Los Alamitos (2006) 18. Bader, B.W., Kolda, T.G.: MATLAB tensor classes for fast algorithm prototyping. ACM Transactions on Mathematical Software 32, 635–653 (2006) 19. Kolda, T.G.: MATLAB Tensor Toolbox. Version 2.2 (2007), http://csmr.ca.sandia.gov/∼ tgkolda/TensorToolbox/ 20. Hartung, F., Girod, B.: Watermarking of uncompressed and compressed video. IEEE Trans. Signal Processing 66, 283–301 (1998)
Embedding Quality Measures in PIFS Fractal Coding Andrea Abate, Michele Nappi, and Daniel Riccio Universit´ a Degli Studi di Salerno, via Ponte Don Melillo, 84084, Fisciano, Salerno, Italy {abate, mnappi, driccio}@unisa.it
Abstract. Fractal image coding is a relatively recent technique based on the representation of an image by a map of self-similarities. In last years, most researchers focused their attention on the problem of speeding up the fractal coding process, while paying little attention to possible improvements of the objective and subjective image quality. In this paper, we investigate image quality measures, which could represent a reasonable alternative to the RMSE when finding a suitable map of similarities. Subjective assessments have been performed in order to compare performances of the selected quality metrics. Experimental results bear witness to the superiority of such a quality metric based on Fourier coefficients.
1
Introduction
Many objective quality measures [1], have been defined to replace subjective evaluations by retaining, as much as possible, the fidelity with the human perception of image distortions introduced by the coding schemes. The most common measures undoubtedly are the RMSE (Root Mean Square Error) and the PSNR (Peak Signal to Noise Ratio). They owe their wide spread to that they work well on the average by showing a very low computational cost. However, there are cases in which the quality estimates given by the PSNR are very far from the human perception and this led many researchers to define new quality metrics providing better performances in terms of distortion measurement even if at a higher computational cost. Some examples are given by the Human Visual System [7] or the FFT Magnitude Phase Norm [1]. All these measures have been largely used to assess the global quality of the decoded image after a coding process has been applied on; in other words the original image is compressed/decompressed by means of an encoder and then the overall amount of distortion introduced by the coding scheme is measured. Thus, objective measures represent an effective way to compare different coding schemes in terms of percentage of distortion introduced for a fixed compression ratio. The key idea of this paper is to embed quality measures into the coding process, not curbing them to be a sheer analysis tool. The compression scheme we adopted for this study is the quad-tree based PIFS (Partitioned Iterated Function System), whose scheme is represented in Fig. 1. It lays itself open to a direct replacement of the RMSE by other quality measures. M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 784–793, 2007. c Springer-Verlag Berlin Heidelberg 2007
Embedding Quality Measures in PIFS Fractal Coding
segmentation Ranges
Coding Range search
Domains
Domain
Input image
785
Error estimation RMSE
4.5 6.2 2.1
Best domain
Classification KD-Tree List of candidate domains KD-Tree
Fig. 1. The architecture of our fractal coder
Indeed, fractal based approaches partition the whole image into two different sets of square regions ranges and domains. Ranges represent a covering for the image; they are the region we want to codify. On the contrary, domains are overlapping regions we use to codify. Each range is approximated by applying an affine transformation to a domain, which is selected as that region minimizing the approximation error. The RMSE is adopted as the approximation error measure in the PIFS approach proposed by Jaquin [4], in this case many alternative measures are investigated. Because full exhaustive search for the best range/domain matching is prohibitively costly, we should try to reduce the search time by pruning the domain pool through the DRDC (Deferring Range/Domain Comparison) technique using a pre-classification criterion, in this case the approximation error with respect to a reference block. This criterion is formally justified by a lemma in [8]. Thus, many papers faced the problem of reducing time complexity of the compression phase [4], while a little effort has been spent in modifying the fractal coding scheme to improve the quality of decoded images [3]. As the objective quality measures are the matter of the study, PSNR is not very suitable as evaluation criterion, then subjective assessments are given as experimental results. For subjective evaluations the ITU-R recommendation has been chosen as assessment methodology [6], in particular for viewing conditions and testing protocols. Two kind of experiments have been conducted according to the Direct Comparison (DC) and Forced Choice methods. The direct comparison is suitable to compare the quality of alternative systems when applied to the same set of input images, while the forced choice is applied to establish the point at which an impairment becomes visible for a given system.
2
Quality Measures
A major problem in evaluating lossy techniques is the extreme difficulty in describing the type and amount of degradation in reconstructed images. Because of the inherent drawbacks associated with the subjective measures of image quality, there has been a great deal of interest in developing quantitative measures
786
A. Abate, M. Nappi, and D. Riccio
that can consistently be used as substitute. In this paper we selected a limited number of seven quality measures from literature (see Table 1); they are all disˆ k) denote the samples of original and crete and bivariate, while R(j, k) and R(j, approximated range block. Table 1. Quality Measures Name Average Difference Image Fidelity
IF =
Maximum Difference
n−1 n−1
ˆ R(j, k) − R(j, k)
j=0 k=0 n−1 n−1 ˆ [R(j,k)−R(j,k) ]2 j=0 k=0 n−1 n−1 R(j,k)2 j=0 k=0
MD = max
RMSE
RMSE =
PMSE HVS MF
Definition
1 AD = 2 n
1 n2
P MSE =
ˆ R(j, k) − R(j, k)
n−1 n−1
ˆ k) R(j, k) − R(j,
2
j=0 k=0 n−1 n−1 ˆ [R(j,k)−R(j,k) ]2 j=0 k=0 n2 maxj,k [R]2
Human Visual System Real and Imaginary parts of the FFT coefficients
Human Visual System: The U {.} operator expresses an image processed through a DCT spectral mask and then inverse discrete cosine transformed and is defined as follows: U {R(j, k)} = R(j, k) − DCT −1 H( u2 + v 2 )Ω(u, v) , (1) −1 where Ω(u, is the 2D inverse DCT √ v) denotes the 2D DCT of the image, DCT 2 2 and H( u + v ) is the Nill’s function defined in [7]. The quality measure is defined as follows: n−1 2 ˆ k) HV S = U {R(j, k)} − U R(j, , (2) j,k=0
FFT based quality measure: let be Γ (u, v) and Γˆ (u, v) the DFT of the original and coded blocks, the M F distance function can be defined as follows: n−1
2 1 ˆ MF = 2 Ψ (u, v) − Ψ (u, v) (3) n u,v=0 where the Ψ (Γ (u, v)) is defined as follows: Ψ (u, v) = |Re (Γ (u, v))| + |Im (Γ (u, v))| . In PIFS coding the whole image is partitioned in a sets of ranges (as described in Section 1). For each range, the coding scheme looks for an approximating domain to be assigned to; this domain is mapped into the corresponding range by
Embedding Quality Measures in PIFS Fractal Coding
787
an affine transformation. For a given range R, PIFS associates that domain providing the smallest approximation error in a root mean square sense, so exactly in that point it is possible to embed a quality measure different from the RMSE, to decide the best range/domain association. The key idea underlying to this strategy is that quality measures outperforming the RMSE, from a subjective point of view, can better the subjective appearance of the whole image by improving the quality of single ranges. In other words, in the original definition of the PIFS coding scheme proposed by Jaquin, the range is approximated by the ˆ = α · D + β minimizing the error function R − (α · R + β) . transformation R 2 In order to embed new quality measures in PIFS coding, the α and β are even computed by solving a mean square error problem while the distance between the original and the transformed range is measured by a new quality measure ˆ f (R, R).
3
Image Set and Testing Criteria
Tests have been conducted on a dataset of fifty-six images, twelve of them coming from the waterloo bragzone standard database [5], thirty-six from the set of natural images in [9] and the remaining eight from the web. A large variability in testing conditions has been ensured by selecting test images containing patterns, smooth regions and details. They are all 8-bit grayscale images at a resolution of 512 × 512 pixels. The RMSE has been replaced from 7 quality measures, we described in Section 2. The grayscale image data set was then obtained by coding and decoding the whole image set with all the selected quality measures (one at time) embedded in the PIFS compression scheme. For each test image, twenty different compression ratios were selected for degradation. They range from 4.5 : 1 to 50 : 1 with an increment of about 4. Resulting images have undergone to subjective quality assessments according to the ITU (International Telecommunications Union) recommandation. Indeed, there have been efforts by the ITU to establish an objective measurement of image and video quality. In particular, general viewing conditions, monitor resolution and contrast for subjective assessments are given. This document [6] also accurately describe criteria and test material selection, the number and characteristics of the observers, further than the organization and the duration of sessions. In this study, two groups of observers A and B have been used. The group A consists in 20 people, who are all non-expert and not experienced assessors. On the contrary, there are 15 people in B, who are all skilled persons and someway familiar with image processing applications (e.g.: radiologists, photographers, ...). People in both A and B have been carefully introduced to the method of assessment, the quality factors likely to occur, the grading scale, the sequence and timing. Training sequences demonstrating the range and the type of the impairments to be assessed have been used with illustrating pictures other than those used in the test, but of comparable sensitivity. The session duration is about half an hour. At the beginning of the first session, two ”‘dummy presentations”’ have
788
A. Abate, M. Nappi, and D. Riccio
been introduced to stabilize the observers’ opinion. The data issued from these presentations have not be taken into account in the results of the test. In general, particular methods should be used to address particular assessment problems. We chosen two different assessment methods: the direct comparison and the forced-choice. 3.1
Direct Comparison
It compares the quality of alternative systems. Pictures are coded/decoded at the same compression rate, but with different quality measures. In our case, five compression ratios have been considered (8, 12, 20, 32, 44). In each test session only four pictures have been randomly selected from the original test set, in order to assure the session time not exceeding half an hour, while all comparisons of distinct pair of quality measures are performed. In other words, let be Ij and Ik the test image I decoded with the quality measures mj and mk , only the pair (Ij , Ik ) is visualized, while (Ik , Ij ) is discarded; conversely the (Ij , Ik ) with j = k is admitted. At each time, two different images are presented to the observer, who have to rate for that providing the better subjective quality. Three choices of rating are left to the observer: a) the image Ij shows a better quality than Ik , b) Ik appears better than Ij , c) there is no difference. Doing so, ratings of the observer i can be represented by means of a diagonal matrix RiI , which is RiI (j, k) = 1, RiI (k, j) = 0 in the first case, RiI (j, k) = 0, RiI (k, j) = 1 in the second case and RiI (j, k) = 0, RiI (k, j) = 0 otherwise. Admitting comparisons of a test image to itself ((Ij , Ik ) with j = k) allows us to define a criterion w(R) to evaluate the reliability of the rating session. Indeed, we can make the assumption that ratings of an observer pinpointing identical images are more reliable than that of someone else who rate for one of them (when j = k). In particular, a reliability index can be defined as follows: wi = 1 −
n−1 1 I R (j, j) n j=0 i
(4)
¯ I is computed for each test image I, as Thus, the average matrix of ratings R the weighted sum: ¯ I (j, k) = 1 R wi RiI (j, k) (5) |UI | i∈UI
where |UI | represents the set of observers, who tested the image I. The matrix ¯ I is computed for each of the five compression ratios and then averaged in R order to obtain a single n × n matrix. For a given quality measure, the score is ¯ I (j, k) by summing on its columns. extracted from the matrix R A further way to evaluate performances of the selected quality measures is by estimating how many times the observer found no differences between two images coded at the same compression ratio but with different quality measures. This is indicative of how really significant the gap between scorings of the quality measures is. Thus, this parameter represents a sound way to quantify the
Embedding Quality Measures in PIFS Fractal Coding
789
uncertainty of an observer in distinguishing images coded by different quality ¯ I (j, k) matrix by counting pairs measures. It can be easily computed from the R I I ¯ ¯ (j, k) for which R (j, k) = R (k, j) = 0, with j = k. 3.2
Forced Choice
It establishes the point at which an impairment becomes visible. At the beginning of a session, the original test image is presented to the observer for 1 minute; a delay of 15 seconds follows before the test starts. A sequence of images decoded at an increasing compression ratio is presented to the observer, who decides to stop the presentation when the image distortion is perceived as annoying. The compression rate at which the observer stopped the presentation is then reported as the highest the current quality measure can reach for the test image. In order to limit the presentation time, only four test images per subject have been randomly selected from the whole test set.
4
Results and Discussion
In this first experiment, only people from the group A have been involved and they have been asked to rate a subset of the whole image database (twelve bragzone and height web images). Figure 3 (a) shows the average scores obtained by the quality measures on this test set. Several aspects come immediately clear from Figure 3. First of all, it comes out that the Average Difference gets the worst performance, due to such a smudge distortion it introduces in decoded images. Furthermore, Image Fidelity and Maximum Difference reveal a behavior, which is not very stable among different images and compression ratios. On the contrary, PMSE and RMSE give almost comparable performances, which are better than those provided by the HSV in some cases. The only quality measure showing a high rating for all compression ratios is the MF (Magnitude of the FFT). The main grounds can be found in that image distortions are uniformly distributed in pictures coded with this measure, from a subjective point of view. The reason of way this happens resides in the image partitioning process. In other words, in the quad-tree PIFS coding scheme, a range is coded by the best approximating domain if the approximation error is lower than a given threshold, subsequently partitioned otherwise. The RMSE and MF generally give a different quad-tree partitioning for the same image. Particularly, the MF partitioning is more balanced favoring midsize blocks. Figure 2 (a) shows two different quadtree decompositions for the lena image, the former is obtained from the RMSE measure while the latter from the MF. Another example of the improvements in subjective quality level provided by the MF measure is given in Figure 2 (b) that reports the eye regions of the mandrill image coded by the MF, HSV, RMSE and PMSE quality measures at a compression ratio of 20:1. This picture underlines as the overall quality reached by the MF overcomes all the others. Figure 3 (b) reports the average degree of uncertainty for different compression ratios. An interesting observations immediately comes out from Figure 3 (b),
790
A. Abate, M. Nappi, and D. Riccio
RMSE quad-tree
MF quad-tree (a)
(b) MF
RMSE
HSV
PMSE
Fig. 2. a) MF and RMSE quad-tree partitioning of the Lena image; b) detail of mandrill decoded at the same compression ratio (20:1) with different quality measures
HSV, PMSE and RMSE shares almost even values of their uncertainty that is consistent with that we found analyzing their rating scores. On the contrary, AD, IF and MD, they all show little degree of uncertainty, but this is connected to a scant effectiveness in terms of quality scores. Essentially, the MF still remains the best candidate as alternative quality measure to be deeply embedded in the PIFS coding process. Actually, it provides higher quality scores as well as a little degree of uncertainty. In Figure 3 (c), for some sample images, the highest compression ratios have been averaged overall the observers rating these images, while the last row shows the mean value of each column. The HSV and RMSE are almost comparable in performances, while MF significantly outperforms both. There are two main reasons motivating the superiority of the MF : a) the FFT transform retains the most of the image information in its first coefficients, which get it more robust with respect to small changes in details than the RMSE, by principally characterizing low frequencies; b) the ease in computing it by summing on square differences, without applying any filter as in the case of the HSV. Indeed, even though the HSV is based on the DCT (Discrete Cosine Transform) that often provides better performances in several image processing applications (image coding, filtering, indexing), it is less effective than MF, probably due to the complexity of the model.
Embedding Quality Measures in PIFS Fractal Coding
791
(a) Average Scores
MF
0,8 Scores
0,6
IF
Image
PMSE
HSV
0,4
0,2 AD
Character
RMSE
MD
0 Quality measures
Measures AD
HSV
IF
MD
9.2
13.4
9.1
13.6 20.7 10.7
15.1 30.0
Cat
12.1 15.8 11.4 16.7 26.9 11.3
22.2
Boy
15.4 22.3 18.7 15.3 25.2 19.6
19.5
Goldhill
16.3 17.6 13.4 11.1 21.6 16.8
19.3
Mean
13.6 19.8 14.4 15.8 25.0 14.8
21.0
0,1
15
RMSE PMSE
IF MD
AD
10 5
0
0
Quality measures AD HSV IF
MF HSV
20 CR
U ncertainty
0,2
HSV PMSE IF MF AD MD
RMS.
16.4 32.1 21.9 24.2 33.2 18.3
25
RMSE
PMS.
Elaine
(b) Average Uncertainty
0,3
MF
(c)
MD MF PMSE RMSE
Fig. 3. The average uncertainty and scores for the group A
In the second experimet, skilled people from the group B rated randomly selected images from the whole dataset (fifty-six images). Several interesting observations emerge from Figure 4. First of all, the histogram in Figure 4 (a) shows the general trend for all the quality measures to be consistent with that given in Figure 3, even if some of them narrow the gap (PMSE and RMSE, IF and HSV ). The MF outvotes all the others with an upper hand of about 0.38 over the PMSE that gains the second best rate. Figure 4 (b) also discloses an important aspect relative to the pople expertise in the group B; indeed, the overall uncertainity gets lower. The reason this happens is because the probability that people in B vote one image in the test couple increases, so lowering instances of equally rated couples (RiI (j, k) = 0, RiI (k, j) = 0). This also turns in a more accurate performance estimation, as further confirmed by Figure 4 (c). Indeed, the overall best compression ratio achieved by the quality metrics in the forced choice experiment (group B) evenly reduces, in particular for AD and MD. Figure 4 (c) also shows a gap of about 4.7 between MF and HSV, further supporting the MF measure as a valid alternative to the RMSE in the PIFS coding scheme.
792
A. Abate, M. Nappi, and D. Riccio
Scores
(a) Average Scores
1 0,8 0,6 0,4 0,2 0
Image
MF PMSE
Character Elaine
RMSE
HSV IF
Cat Boy
AD
MD
Goldhill Mean
Quality measures
0,1
PMSE
IF MD MF
AD
0,05
7.2 12.6
8.5
15
MD
MF
PMS.
7.0 17.8 11.3
12.8 28.2 18.6 16.4 25.2 16.1 9.8 11.2 11.4 12.2 23.7
9.2
RMS.
13.1 24.4 17.2
12.7 18.8 16.3 11.3 22.2 20.4
15.5
11.5 13.2 12.2
9.1 18.6 18.0
12.3
10.8 16.8 13.4 11.2 21.5 15.0
16.5
MF HSV
20 RMSE
CR
Uncertainty
HSV
IF
HSV
25
(b) Average Uncertainty 0,15
Measures AD
IF
RMSE PMSE
MD
AD
10 5 0
0 Quality measures
(c) AD HSV IF
MD MF PMSE RMSE
Fig. 4. The average uncertainty and scores for the group B
5
Conclusions
The most of the researchers focused their attention on the problem of speeding up the fractal coding process, while a little effort has been spent in modifying the fractal coding scheme to improve the quality of decoded images. In this paper, we selected 7 quality measures and we embedded them in the quad-tree based fractal coding scheme. Performances of these quality metrics have been assessed from a subjective point of view. In particular, a framework suggested by the standard ITU-R BT-500-11 Recommendation has been used for subjective assessments. Experimental results pointed out that RMSE and HSV performances are quite comparable and generally satisfactory. They also bear witness to the superiority of a Fourier based quality metric.
References 1. Acivbas, I., Sankur, B., Sayood, K.: Statistical evaluation of image quality measures. Fournal of Electronic Imaging 11(2), 206–223 (2002) 2. Ahmed, N., Natarajan, T., Rao, K.R.: Discrete Cosine Transform. In: IEEE Transaction on Computers, pp. 90–93. IEEE Computer Society Press, Los Alamitos (1974)
Embedding Quality Measures in PIFS Fractal Coding
793
3. De Oliveira, J.F.L., Mendo¸ca, G.V., Dias, R.J.: A Modified Fractal Transformation to Improve the Quality of Fractal Coded Images. In: IEEE Signal Processing Society 1998 International Conference on Image Processing, pp. 756–759. IEEE Computer Society Press, Los Alamitos (1998) 4. Fisher, Y.: Fractal Image Compression: Theory and Application. Springer-Verlag, New York (1994) 5. Kominek, J.: Waterloo BragZone and Fractals Repository (January 25, 2007), http://links.uwaterloo.ca/bragzone.base.html 6. Recommendation ITU-R BT-500-11 Methodology for the subjective assessment of the quality of television pictures, Geneva (2000) 7. Nill, N.B.: A visual model weighted cosine transform for image compression and quality assessment. IEEE Trans. Commun. 3(6), 551–557 (1985) 8. Distasi, R., Nappi, M., Riccio, D.: A Range/Domain Approximation Error Based Approach for Fractal Image Compression. IEEE Transaction on Image Processing 15(1), 89–97 (2006) 9. Starosolski, R.: Performance evaluation of lossless medical and natural continuous tone image compression algorithms. In: Proceedings of the SPIE Conference, vol. 59, pp. 116–127 (2005), http://sun.iinf.polsl.gliwice.pl/ rstaros/ mednat/index.html
Comparison of ARTMAP Neural Networks for Classification for Face Recognition from Video M. Barry1 and E. Granger2, 1
2
Laboratoire d’imagerie, de vision et d’intelligence artificielle ´ Ecole de technologie sup´erieure, Montreal, Canada
[email protected] 1100 Notre-dame Ouest, Montreal, Quebec, H3C 1K3, Canada Phone: 1-514-396-8650; Fax: 1-514-396-8595
[email protected] Abstract. In video-based of face recognition applications, the What-and-Where Fusion Neural Network (WWFNN) has been shown to reduce the generalization error by accumulating a classifier’s predictions over time, according to each individual in the environment. In this paper, three ARTMAP variants – fuzzy ARTMAP, ART-EMAP (Stage 1) and ARTMAP-IC – are compared for the classification of faces detected in the WWFNN. ART-EMAP (stage 1) and ARTMAPIC expand on the well-known fuzzy ARTMAP by using distributed activation of category neurons, and by biasing distributed predictions according to the number of time these neurons are activated by training set patterns. The average performance of the WWFNNs with each ARTMAP network is compared to the WWFNN with a reference k-NN classifier in terms of generalization error, convergence time and compression, using a data set of real-world video sequences. Simulations results indicate that when ARTMAP-IC is used inside the WWFNN, it can achieve a generalization error that is significantly higher (about 20% on average) than if fuzzy ARTMAP or ART-EMAP is used. Indeed, ARTMAP-IC is less discriminant than the two other ARTMAP networks in cases with complex decision bounderies, when the training data is limited and unbalanced, as found in complex video data. However, ARTMAP-IC can outperform the others when classes are designed with a larger number of training patterns.
1 Introduction Face recognition has received considerable attention over the past decade because of the wide range of commercial and law enforcement applications and the availability of affordable technology. In addition, face acquisition does not depend on the cooperation of individuals. As a result, face recognition remains a powerful tool in spite of the existence of other very reliable characteristics for biometric recognition such as iris scans and fingerprint analysis [12]. A video-based face recognition system applied to the identification of individuals will first perform segmentation to locate and isolate regions of interest (ROI) in successive video frames. Then, invariant and discriminant features will be extracted from the
Corresponding author.
M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 794–805, 2007. c Springer-Verlag Berlin Heidelberg 2007
Comparison of ARTMAP Neural Networks
795
ROIs, and used by the recognition system to assign a class label to individuals. Face recognition from video is however a very challenging problem since frames provided by video sequences are typically low quality and generally small. Furthermore, images acquired in uncontrolled environments presents several technical challenges such as change in illumination, poses and occlusion. A typical approach to recognizing faces in video consist in applying techniques developed for static images, once face detection has been performed. Over the last few years, several techniques have been proposed to recognize faces in static images. These approaches yield a high level of accuracy when operational environments are said to be constrained, where many assumption may be made about pose, illumination, facial expression, orientation and occlusion to provide accurate recognition. However, the performance of these techniques can degrade considerably when applied in unconstrained environments, as found in many video surveillance applications. This may be tied to the limited amount of training data from which recognition systems are designed. More recently, some authors [1,6,10,13] have attempted to exploit both spatial and temporal information contained in video sequences to provide a higher level of accuracy in unconstrained environment. For instance, a time series states space has been proposed by Zhou et al. [13] to fuse temporal information in video, which simultaneously characterizes the kinematics and identity of individuals in a probabilistic framework. A distributed sensor network (DSN) is proposed by Foresti [6] as a solution to the problem of partial occlusion that occurs in dynamics environments. Li and Chellappa [10] have introduced a face verification system from video sequences that exploits the trajectories of Gabor facial features. Finally, Barry and Granger [1] have applied the What-and-Where fusion neural network (WWFNN) to video-based face recognition. The WWFNN performs recognition by accumulating the responses of a classifier over time according to each individual in the environment. The prediction of the WWFNN is therefore the result of one or multiple responses by the classifier. From previous experiments on real video data [1], such accumulation has been shown to significantly reduce the generalization error. In previous work, a particular realization of the WWFNN was considered for videobased face recognition. It used the fuzzy ARTMAP neural network for classification, and a bank of Kalman filters for tracking. This paper compares three ARTMAP variants – fuzzy ARTMAP, ART-EMAP (Stage 1) and ARTMAP-IC – for classification of faces detected in video sequences within the WWFNN. ARTMAP refers to a family of neural network architectures that can perform fast, stable, on-line, supervised or unsupervised, incremental learning, and classification. An attractive feature of ARTMAP neural networks is the ability to learn new data incrementally, without having to retrain on all cumulate training data, as would be the case with the Multi-Layer Perceptron. ARTMAP networks can also represent individual class by one or more prototypes, and lend themselves well to high speed processing, which make them suitable for both resource-limited and real-time face applications. The performance of ARTMAP neural networks are assessed through computer simulation on complex real-world video data. The data set has been collected by the National Research Council of Canada [7], and corresponds to video sequences that display the face of a single person under different scenarios such as partial occlusion, pose, facial
796
M. Barry and E. Granger
expression, motion, resolution and proximity. The average performance is assessed in terms of resources required during training and the generalization error during operation. A WWFNN with k-Nearest-Neighbor classifier is also included for reference. This paper is organized as follow. In Section 2, the WWFNN applied to face recognition is briefly described. In Section 3, the ARTMAP neural network and the three variants considered in this study are briefly reviewed. In Section 4, the experimental methodology (data set, protocol and performance measures) employed to evaluate and compare performance are presented. Finally, in Section 5, simulations results are presented and discussed.
2 What-and-Where Fusion Neural Network The What-and-Where fusion neural network [1] applied to face recognition from video is presented in Fig.1. It is composed of 3 modules: a classifier, a tracker and an evidence accumulation module. During operations, the recognition system receives information provided by ROIs of successive video frames, which is then partitioned into two data streams called What and Where, and fed to the classification and tracking systems, respectively. The What parameters of an ROI characterizes the intrinsic properties of a face. In this paper, the What parameters are represented by the vectorized form of ROIs, I = [I1 , . . . , Il , . . . , Ip ], where Ii correspond to the gray level intensity of a pixel, and where p = wxh is the total number of pixels in a ROI of height h and of width w. In contrast, the Where data stream of an ROI indicates the position of the face in an environment, and it is represented by a vector b = [C, S], where C and S are the centroid and the size of the blob. The centroid C(x, y) of a blob in a frame is defined by its 2D spatial coordinates, x and y, and the size S(w, h) of a blob by its width w and its height h. It is important to note that the Where parameters are useless for face identification but necessary to resolve ambiguity such as occlusion in complex scene. For each ROI, the evidence accumulator receives the output activation pattern yab from the classifier and the track number h furnished by the tracking sub-system. Based on these two responses, the accumulator will provide the most likely identity ye of the faces detected in a scene. The tracking system uses Where parameters to pursue faces in a given scene, over successive frames. While color information, appearance, shape and facial features have been used to track faces in video, a blob-based tracking scheme has been used to pursue faces in this work. A new track is initialized for each newlydetected blob, and deleted whenever a person leaves the scene. The tracking system computes the most likely position of each face in the next frame according to previous observations. For each new frame, the tracker computes the distance between estimated and actual blob coordinates, and then provides the track number h associated with that face. Prior to operations, a neural network classifier is trained in supervised learning mode with a representative data set. This data set consists of a variety of ROIs extracted from video frames for each individual. During operations, the classifier uses What parameters to predict the class associated with an input ROI. The output yab is a binary pattern of activity.
Comparison of ARTMAP Neural Networks
797
Fig. 1. A What-and-Where fusion neural network for face recognition
Predictions of the What-and-Where fusion neural network are provided via evidence accumulation. The accumulation module exploits the result of the tracking module to accumulate the responses of neural network classification. That is, the evidence accumulation module accumulate the classifier’s responses over time according to each track. The prediction provided by the What-and-Where fusion neural network are therefore the result of one or multiple responses by the classifier. Evidence accumulation is implemented by means of identical evidence accumulation fields F1e , F2e , . . . , FRe , where each field Fhe is connected to a track h, and replicates the neural network classifier’s output field, that is, contains L nodes, one per class. The classifier’s output nodes are linked to their respective response nodes in all fields Fhe , h = 1, 2, . . . , R. Each field e e e Fhe is characterized by a field accumulation pattern Teh = (Th1 , Th2 , . . . , ThL ). Upon e e initiation of track h, Th is set equal to 0. When track h = H is assigned to a ROI, FH ab e becomes active. The activity pattern y output by the classifier accumulates onto FH according to : (TeH ) = TeH + yab (1) Accumulation of ROI activity patterns in Fhe continues until track h is deleted. For a given ROI, the activity pattern ye output from evidence accumulation is equal to TeH , and the individual is to be: Ke = arg max {TeHke : k e = 1, 2, . . . , L} e k
(2)
3 Classification with ARTMAP Networks ARTMAP network self-organize stable recognition categories in response to arbitrary sequences of input patterns [2]. The ARTMAP is often applied using the simplified version shown in Fig.2. It is obtained by combining an ART unsupervised neural network
798
M. Barry and E. Granger t1
F ART network W
y
ab
tL
t2
ab
...
2
1
L
ab
y F2
1
2
...
reset _ |x|
W x F1
1
2
.
.
A1
A2
match tracking
N
... ...
M .
+
|A|
AM
Fig. 2. ARTMAP neural network
with a map field. The ART neural network consists of two fully connected layers of nodes: an M node input layer, F1 , and an N node competitive layer, F2 . A set of realvalued weights W = {wij ∈ [0, 1] : i = 1, 2, ..., M ; j = 1, 2, ..., N } is associated with the F1 -to-F2 layer connections. Each F2 node j represents a recognition category that learns a prototype vector wj = (w1j , w2j , ..., wMj ). The F2 layer is connected, through learned associative links, to an L node map field F ab , where L is the number ab of classes in the output space. A set of binary weights Wab = {wjk ∈ {0, 1} : j = ab 1, 2, ..., N ; k = 1, 2, ..., L} is associated with the F2 -to-F connections. The vector ab ab ab wjab = (wj1 , wj2 , ..., wjL ) links F2 node j to one of the L output classes. During training, ARTMAP classifiers perform incremental supervised learning of the mapping between training set vectors a = (a1 , a2 , ..., am ) and output labels t = (t1 , t2 , ..., tL ), where tK = 1 if K is the target class label for a, and zero elsewhere. Although the first ARTMAP [2] classifier is limited to processing binary-valued input patterns, the ART-EMAP (stage 1) [4], ARTMAP-IC [5] and fuzzy ARTMAP [3] can process both analog and binary-valued input patterns by employing fuzzy ART as the ART network. With the fuzzy ART network, the input patterns goes through a transformation called complement coding, which doubles their number of components and becomes therefore, A = (a, ac ) = (a1 , a2 , ..., aM ; ac1 , ac2 , ..., acM ), where aci = (1 − ai ), and ai ∈ [0, 1]). With complement coding and fast learning (β = 1), fuzzy ART represents category j as hyperrectangle Rj that just encloses all the training patterns a to which it has been assigned. ART-EMAP (stage 1) and ARTMAP-IC are extensions of fuzzy ARTMAP that produce a binary winner-take-all pattern y when training, but uses distributed activation of coded F2 nodes when testing. ARTMAP-IC is more extended by biasing distributed test set predictions according to the number of times F2 nodes are assigned to training set patterns2 . Table 1 presents the equations used by the three ARTMAP networks for different steps of the algorithm. 2
ARTMAP-IC also involves Negative Match Tracking (M T − ), where the match tracking parameters takes small negatives values, ≤ 0. M T − has not been included in this study since it has been associated to a higher generalization error in some applications [9].
Comparison of ARTMAP Neural Networks
799
Table 1. Equations used by the three ARTMAP networks: | · | is the norm operator (|wj | ≡ 2M i=1 |wij |), ∧ is the fuzzy AND operator ((A ∧ wj )i ≡ min(Ai , wij )), α is the choice parameter, β is the learning rate parameter, is the match tracking parameter, ρ¯ is the baseline vigilance parameter, Q is the number of F2 category with the greated activation Tj , and cj is the number of training pattern that activate F2 node j Algorithmic step 1. Initialization: 2. Complement coding:
Training phase α > 0, β ∈ [0, 1], ρ¯ = 0, = 0+
Testing phase N/A
A = (a, ac )(M = 2M)
A = (a, ac )(M = 2M)
3. Prototype selection: - choice function
Tj (A) = |A ∧ wj |/(α + |wj |) Tj (A) = |A ∧ wj |/(α + |wj |)
- vigilance test
|A ∧ wj | > ρM
N/A
- F2 activation
yj = 1 only if j = J
yj = 1 only if j = J
4. Class prediction: - prediction function - FAM
(y) = (y) =
Skab (y) =
- ART-EMAP
Skab
- ARTMAP-IC
Skab
N j =1 N j =1
yj wjkab
N j =1
yj wjkab
yj wjkab
N
j =1 wyj wTjk ab w T Sk (y) = w wc Tc T Skab (y) = Skab (y) =
j j ∈Q L ab j j ∈Q jk k =1 ab j ∈Q jk j j L ab j ∈Q jk j j k =1
ρ = (|A ∧ wJ |/M ) +
N/A
- prototype update
wJ = β(A ∧ wJ ) + (1 − β)wJ
N/A
- instance counting
cJ = cJ + 1
N/A
- match tracking
ab
ab jk
5. Learning:
4 Experimental Methodology In order to compare the performance of WWFNNs using fuzzy ARTMAP, ART-EMAP (stage 1) and ARTMAP-IC classifiers, computer simulations were performed using a complex real-world data base of video streams. Prior to simulation trials, this dataset was normalized using the min-max technique, and partitioned into two parts – training and test subsets. Each subset contains an equal proportion of images from each class. The experimental protocol used in this paper is 10-fold cross validation. This strategy partition the training subset into 10 equal subsets. Over 10 trials, each fold
800
M. Barry and E. Granger
is successively used as a validation subset, while the 9 remaining folds are used for training. Fuzzy ARTMAP, ART-EMAP (stage 1) and ARTMAP-IC neural networks are trained using the particle swarm optimization learning strategy [8]. This learning strategy select the network parameters values to minimize generalization error. The 4dimensional search space for PSO learning strategy was set to the following range: β ∈ [0, 1], α ∈ [0.00001, 1], ρ¯ =∈ [0, 0.999], and =∈ [−1, 1]. Each simulation trial was performed with 60 particles, and ended after 100 iterations, or when the best particle position remains unchanged for 5 consecutive iterations. Finally, a bank of Kalman filters was used for tracking faces in the WWFNN. Average results, with corresponding standard error, are always obtained, as a result of the 10 independent simulation trials. The non-parametric k-Nearest-Neighbor (kNN) classifier is included for reference. For each computer simulation, the value of k employed with k-NN was selected among 1 through 10, using 10-fold validation. During each simulation trial, the performance of the k-NN classifier, the fuzzy ARTMAP, the ART-EMAP (stage 1) and the ARTMAP-IC neural networks are compared from a perspective of different ROI scaling sizes. To assess the impact on performance of ROI normalization, the size of the ROI containing the face furnished by the detection is scaled using a bi-linear interpolation, from an ROI of 10x10 to an ROI of 60x60. The dataset used for computer simulations was collected by the National Research Council (NRC) [7]. It contains 22 video sequences captured with an Intel webcam mounted on a computer monitor. Each sequence have an average duration of 12 seconds, and contains an average of 300 frames. It contains the face of one among eleven individuals sitting in front of a computer and exhibiting a wide range of facial expressions, pose and motions. The detection process3 yields 300 ROIs - between 29x18 and 132x119 ROIs per individual. There are two sequences per individual, one dedicated to training and the other to testing. The video sequences are taken under approximately the same illumination conditions (no sunlight, only ceiling light evenly distributed over the room), the same setup and almost the same background, for all persons in the data base. Furthermore, the video capture have two different resolutions 160 x 120 and 320 x 240 and each face occupies between 14 to 18 of the image. Finally, this dataset contains a variety of challenging scenarios such as low resolution, motion blur, out of focus factor, facial orientation, facial expression and occlusion. The average performance of classifiers was assessed in terms of resources required during training and its generalization error on the test set. The amount of resources required during training is measured by compression and convergence time. Compression refers to the average number of training patterns per category prototype created in the competition layer F2. Convergence time is the number of epochs required to complete learning for a learning strategy. It does not include presentations of the validation subset used to perform hold-out validation. Generalization error is estimated as the ratio of incorrectly classified test subset patterns over all test set patterns. The combination of compression and convergence time provides useful insight into the amount of processing during the training phase to produce its best asymptotic generalization error. Compression is directly related to memory resources required for recognition, and to the computational time during operation. 3
Note that faces in frames were detected using the Haar coefficient approach [11].
Comparison of ARTMAP Neural Networks
801
5 Simulations Results Fig.3 presents the average performance of fuzzy ARTMAP, ART-EMAP (stage 1) and ARTMAP-IC neural networks and k-NN classifier, alone and within the WWFNN, as a function of the ROI scaling size. Simulation trials (not presented in this paper) have shown that, with this dataset, the Q-max rule [5] for distributing F2 layer activation in ART-EMAP and ARTMAP-IC classifiers gives better results than the threshold rule [4]. The choice of Q that was found to give good results for the video data is Q = min NLc , 3 , where L = 11 is the number of classes and Nc is the number of committed F2 nodes. In particular, higher values of Q tend to degrade performance considerably. Fig.3(a) indicates that fuzzy ARTMAP and ART-EMAP perform better than ARTMAP-IC for this data set, and yield an average generalization error of about 13% over ROI size, in comparison to 22% for ARTMAP-IC. The k-NN yields the lowest average generalization error of about 12%. On average, the three ARTMAP networks converge after about 552 epochs of the PSO training strategy (60 particles x 9.2 iterations x 1 epoch) and, give a compression that varies between 15 (for an ROI of 10x10) and 50 (for an ROI of 60x60) training patterns per F2 nodes. ARTMAP networks represent a good alternative when the amount of ressouces and computational time is limited. For example, with an ROI of 60x60, ARTMAP corresponds to about 40 times less memory and matching operations comparatively to k-NN. The WWFNN allows to achieve an average generalization error that is significantly lower than one of the classifiers alone. This error is about 6% for fuzzy ARTMAP and ART-EMAP, which correspond to a 50% improvement in accuracy over these networks alone when the ROI scaling size is 60x60. In contrast, the generalization error tends toward 1% for k-NN as the ROI size grows. Finally, the WWFNN with ARTMAP-IC yields an error of about 10%. This lower accuracy of ARTMAP-IC is cause by the use of the frequency information (cj values) when computing predictions. Note that the dataset used to evaluate the performance is unbalanced (i.e., the training set has a different number of ROIs per class, due to the limitations in the detection process). Table 2 reveal that ARTMAP-IC tends to favor classes with more training patterns. This table present the average confusion matrix associated with k-NN, fuzzy ARTMAP, ART-EMAP and ARTMAP-IC classifiers when used inside the WWFNN. In this table, most of the test patterns are assigned to classes 3, 5, 7 and 10, have at least 160 training patterns. Note that in Table 2, for the classes 2 and 8, the average generalization error per class are 26.6% and 37.8%, respectively, and both have been trained with less than 100 patterns. However, given the cost of data collection process, and shortcomming of face detection (segmentation) algorithms, limited and unbalanced data sets may be common. Moreover, the data set used for training is relatively small when one considers the complexity of the environment, and the number of features typically used with video data. Biasing class predictions in video data with unbalanced training data classes does not appear to yield any performance improvement. Fig.4(a)(b) show the occurrence of prediction errors in time associated with the WWFNN using the three ARTMAP neural networks in this study, for an ROI size of 60x60, and for video 2 and video 8. The classes corresponding to these two video are designed with fewer training patterns (36 patterns for class 2 and 90 patterns for class 8) than the other videos. In Fig.4(a)(b), the fuzzy ARTMAP and ART-EMAP give
802
M. Barry and E. Granger 30
Generalization Error (%)
25
20
15
10
5
0 10x10
20x20
30x30 40x40 size of ROI
50x50
60x60
(a) Generalization Error vs ROI size
Compression (Number of patterns per F2 node)
100 90 80 70 60 50
k−NN fuzzy ARTMAP ART−EMAP ARTMAP−IC WWFNN(k−NN) WWFNN(fuzzy ARTMAP) WWFNN(ART−EMAP) WWFNN(ARTMAP−IC)
40 30 20 10 0 10x10
20x20
30x30 40x40 size of ROI
50x50
60x60
(b) Compression vs ROI size Fig. 3. Average performance of fuzzy ARTMAP, ART-EMAP, ARTMAP-IC and k-NN classifiers, alone and within the WWFNN, versus the ROI scaling size for all NRC training data. Note that fuzzy ARTMAP, ART-EMAP and ARTMAP-IC provide the same compression and convergence time. (error bars are standard error).
erroneous prediction on few frames. For example, as shown on Fig.4(b), in video 8, the individual presents abrupt motions between frames 60 to 100. The impact of these patterns are reduced by the WWFNN, and completely attenuated in the case of video 2 as shown in Fig.4(a). However, as shown in Fig.4(a)(b), ARTMAP-IC misclassifies most of the test patterns over time. Since WWFNN prediction is based on the accumulation of ARTMAP-IC predictions according to track, successive prediction errors by ARTMAP-IC will tend to accumulate errors over time.
Comparison of ARTMAP Neural Networks
803
Table 2. Average confusion matrix associated with a What-and-Where fusion neural network using fuzzy ARTMAP, ART-EMAP (stage 1), ARTMAP-IC and k-NN classifiers obtained on test data for an ROI size of 60x60 Classifiers True in Classes WWFNN k-NN fuzzy ARTMAP
1
ART-EMAP ARTMAP-IC k-NN fuzzy ARTMAP
2
ART-EMAP ARTMAP-IC k-NN fuzzy ARTMAP
3
ART-EMAP ARTMAP-IC k-NN fuzzy ARTMAP
4
ART-EMAP ARTMAP-IC k-NN fuzzy ARTMAP
5
ART-EMAP ARTMAP-IC k-NN fuzzy ARTMAP
6
ART-EMAP ARTMAP-IC k-NN fuzzy ARTMAP
7
ART-EMAP ARTMAP-IC k-NN fuzzy ARTMAP
8
ART-EMAP ARTMAP-IC k-NN fuzzy ARTMAP
9
ART-EMAP ARTMAP-IC k-NN fuzzy ARTMAP
10
ART-EMAP ARTMAP-IC k-NN fuzzy ARTMAP
11
ART-EMAP ARTMAP-IC
1 130.1 142 142 142 0 0 0 9.6 0 0 0 0 0 0 0 0.4 0 0 0 0 0 0 0 0 0 0.4 0.5 0.5 0 0 0 0 0 0.6 0.6 1 0 0 0 0 0 0 0 7.3
2 5.2 0 0 0 40.9 41 41 30.1 0 0 0 0 0.1 0 0 0.1 0 0 0 0 1.3 0 0 0.1 0 0 0 0 0 0 0 0.1 0 0 0 0 1.7 0.3 0.3 0.4 0 0 0 0
3 0.8 0 0 0 0 0 0 0.2 157.6 160 160 160 0 0 0 0 0 0.3 0.5 1.6 0.3 0 0 0.3 0 0 0 0 0 9.8 9.8 36.6 0 0 0 0 0.4 3.6 3.6 2.7 0 0 0 0.1
4 1.7 0 0 0 0 0 0 0 0 0 0 0 109 131 131 130 0.2 0 0 0 0.8 0.9 1 4.4 0 0 0 0 0 0 0 0 0 0.5 0.2 0 0 0 0 0 0 0 0 0
Predicted classes 5 6 7 8 0.8 0.1 0.8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.9 0 0 0 0 0 0 0 0 0 0 0 0 0 0.3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 179.6 0.1 10.5 0 193.7 0 0 0 193.5 0 0 0 192.3 0 0.1 0 0.2 121.5 4.7 0 2.8 130 0 0 2.7 130 0 0 0.5 127.9 0 0 2.1 0 191.8 0 20.1 0.2 173.3 0 19.5 0.2 173.8 0 4.1 0 189.4 0 0 0 0 99.5 0 0 0 90.2 0 0 0 90.2 0 0 0.2 62.2 0 0 0 0 0 0.3 0 0 0 0.4 0 0 0 0.6 0 0 0 1 0 3.3 2 52 0 3.2 1.7 52.1 0 0.3 0 34 0 2.3 0 0.4 0.1 0.4 3.3 0 4.9 0 3 0 4.9 0 0.8 0.2 27.1 0
9 0.2 0 0 0 0.1 0 0 0.1 0 0 0 0 0 0 0 0 3.6 0 0 0 2.4 0 0 0.1 0.1 0 0 0 0.5 0 0 0.3 183.9 180.1 181.7 186.3 0.7 0.4 0.4 3.5 0 0 0 2.4
10 0 0 0 0 0 0 0 0 0.5 0 0 0 3.2 0 0 0 0 0 0 0 1.3 0.3 0.3 0.7 0 0 0 0 0 0 0 0.5 0 0.9 0.6 0 159.2 106.9 107.1 124.8 0.2 0 0 0
11 2.3 0 0 0 0 0 0 1 0 0 0 0 18.4 0 0 0.5 0 0 0 0 1.5 0 0 0 0 0 0 0 0 0 0 0.1 4.1 5.6 4.5 0.1 2.7 0.6 0.5 1.3 145.9 138.8 139.1 109.1
Error/ class (%) 8.4 0 0 0 0.2 0 0 26.6 1.5 0 0 0 16.8 0 0 0.7 7.4 0.2 0.3 0.9 9.3 3 3 4.6 1.1 10.7 10.4 2.4 0.5 9.8 9.8 37.8 2.2 4.2 3.4 0.9 5.6 36.7 36.6 26.2 0.7 5.6 5.3 25.8
804
M. Barry and E. Granger
WWFNN w/ ARTMAP−IC
WWFNN w/ ARTMAP−IC
WWFNN w/ ART−EMAP
WWFNN w/ ART−EMAP
WWFNN w/ fuzzy ARTMAP
WWFNN w/ fuzzy ARTMAP
ARTMAP−IC
ARTMAP−IC
ART−EMAP
ART−EMAP
Fuzzy ARTMAP
Fuzzy ARTMAP
Track Reset
Track Reset
0
5
10
15 20 25 Frame number
30
(a) Video sequence # 2
35
40
0
40 60 Frame number
80
100
(b) Video sequence # 8
WWFNN w/ ARTMAP−IC
WWFNN w/ ARTMAP−IC
WWFNN w/ ART−EMAP
WWFNN w/ ART−EMAP
WWFNN w/ fuzzy ARTMAP
WWFNN w/ fuzzy ARTMAP
ARTMAP−IC
ARTMAP−IC
ART−EMAP
ART−EMAP
Fuzzy ARTMAP
Fuzzy ARTMAP
Track Reset
Track Reset
0
20
50
100 Frame number
(c) Video sequence # 7
150
0
20
40
60 80 100 Frame number
120
140
160
(d) Video sequence # 10
Fig. 4. An example of the distribution of prediction errors over time with the WWFNN and the three ARTMAP variants alone when using an ROI scaling size of 60x60
Although weighting predictions according to the number of training set patterns may not appropriate in certain contexts, it may be useful when the training data is larger and more representative of the environment. As shown in Table 2, ARTMAP-IC perform better than the two other ARTMAP networks for videos 7 and 10. The classes defined by these two videos are designed with a high number of training patterns. Fig.4(c)(d) present the occurrence of errors in time associated with the WWFNN using the three ARTMAP networks, for an ROI size of 60x60, and for videos 7 and 10. In these two videos, the individuals vary proximity to the camera, show different facial expression and orientation, and out of focus factor. The individual presents abrupt motions yielding partial face patterns in video 10. As shown in Fig.4(c)(d), ARTMAP-IC considerably reduces the impact of this variability. For video 7, the ARTMAP-IC prediction error is almost 0%. ARTMAP-IC provides lower generalization than fuzzy ARTMAP and ART-EMAP when the classes are well defined by training data.
6 Conclusion A particular realization of the What-and-Where Fusion Neural Network was previously applied to face recognition in video sequences. In this paper, three ARTMAP neural
Comparison of ARTMAP Neural Networks
805
networks – fuzzy ARTMAP, ART-EMAP (Stage 1) and ARTMAP-IC – have been compared for classification of detected faces within this framework. Their performance has been assessed in terms of resources during training and the generalizations error achieved during operations, through computer simulation on complex real-world video data. Simulations results indicate that fuzzy ARTMAP and ART-EMAP (Stage 1) yield a significantly lower generalizations error than ARTMAP-IC, over a wide range of ROI scaling sizes. The distributed activation of coded F2 category nodes has no significant effect on the test set predictions on this data set, since ART-EMAP performs as well as fuzzy ARTMAP. However, the instance counting procedure used in the ARTMAPIC increases the generalizations error considerably on the video data. A more detailed analysis of errors has revealed that this is linked to the use of limited and unbalanced training data set. In fact, ARTMAP-IC tends to outperform the others when classes are designed with a larger number of training patterns.
References 1. Barrym, M., Granger, E.: Face Recognition in Video Using a What-and-Where Fusion Neural Network. In: Int’l Joint Conference on Neural Network, Orlando, USA, August 2-17, 2007 (to appear, 2007) 2. Carpenter, G.A., Grossberg, S., Reynolds, J.H.: ARTMAP: Supervised Real-Time Learning and Classification of Nonstationary Data by a Self-Organizing Neural Network. Neural Networks 4, 565–588 (1991) 3. Chen, S., Grossberg, S., Markuzon, N., Reynolds, J.H., Rosen, D.B.: Fuzzy ARTMAP: A Neural Network Architecture for Incremental Supervised Learning of Analog Multidimensional Maps. IEEE Trans. Neural Networks 3, 698–713 (1992) 4. Carpenter, G.A., Ross, W.D.: ART-EMAP: A Neural Network Architecture for Object Recognition by Evidence Accumulation. IEEE Trans. on Neural Networks 6(4), 805–818 (1995) 5. Carpenter, G.A., Markuzon, N.: ARTMAP-IC and Medical Diagnosis: Instance Counting and Inconsistent Cases. Neural Networks 11(1), 323–336 (1998) 6. Foresti, G.L.: Detecting Multiple Objects under Partial Occlusion by Integrating Classification and Tracking Approaches. Journal of Imaging Systems and Technology 11, 263–276 (2000) 7. Gorodnichy, D.: Video-Based Framework for Face Recognition in Video. Computer and Robot Vision, Victoria, Canada, pp. 330–338 (May 9-11, 2005) 8. Granger, E., Henniges, P., Sabourin, R., Oliveira, L.S.: Particle Swarm Optimisation of Fuzzy ARTMAP Parameters. In: Int’l Joint Conference on Neural Network, Vancouver, Canada (July 16-21, 2006) 9. Henniges, P., Granger, E., Sabourin, R., Oliveira, L.S.: Impact of Fuzzy ARTMAP Match Tracking Strategies on the Recognition of Handwritten Digits. Artificial Neural Networks in Engineering Conf., St-Louis, USA (November 5-8, 2006) 10. Li, B., Chellappa, R.: Face Verification through Tracking Facial Features. Journal of the Optical Society of America. 18, 530–544 (2001) 11. Lieanhart, R., Maydt, J.: An Extended Set of Haar-like Features for Rapid Object Detection. In: Int’l Conf. on Image Processing, vol. 1, pp. 900–903 (2002) 12. Zhao, W., Chellappa, R., Phillips, P.J., Rosenfeld, A.: Face Recognition: A Literature Survey. ACM Computing survey 35(4), 399–458 (2003) 13. Zhou, S.K., Krueger, V., Chellappa, R.: Probabilistic Recognition of Human Faces from Video. Computer Vision and Image Understanding 91, 214–245 (2003)
Face Recognition by Curvelet Based Feature Extraction Tanaya Mandal1, Angshul Majumdar2, and Q.M. Jonathan Wu1 1
Department of Electrical and Computer Engineering, University of Windsor, Canada 2 Pricewaterhouse Coopers, India
[email protected],
[email protected],
[email protected] Abstract. This paper proposes a new method for face recognition based on a multiresolution analysis tool called Digital Curvelet Transform. Multiresolution ideas notably the wavelet transform have been profusely employed for addressing the problem of face recognition. However, theoretical studies indicate, digital curvelet transform to be an even better method than wavelets. In this paper, the feature extraction has been done by taking the curvelet transforms of each of the original image and its quantized 4 bit and 2 bit representations. The curvelet coefficients thus obtained act as the feature set for classification. These three sets of coefficients from the three different versions of images are then used to train three Support Vector Machines. During testing, the results of the three SVMs are fused to determine the final classification. The experiments were carried out on three well known databases, viz., the Georgia Tech Face Database, AT&T "The Database of Faces" and the Essex Grimace Face Database.
1 Introduction Face Recognition has been studied for over 20 years in computer vision. Since the beginning of the 1990s, the subject has become a major issue, mainly due to the important real-world applications of face recognition like smart surveillance, secure access, telecommunications, digital libraries and medicine. Faces are very specific objects whose most common appearance (frontal faces) roughly look alike but subtle changes make the faces different. The different face recognition techniques have been discussed in the work of Zhao et al [1]. In the recent years, the success of wavelets in other branches of computer vision, inspired face recognition researchers to apply wavelet based multiresolution techniques for face recognition [2, 3]. Over the past two decades, following wavelets, other multiresolution tools like contourlets [4] , ridgelets [5] and curvelets [6], to name a few, were developed. These tools have better directional decomposition capabilities than wavelets. These new techniques were used for image processing problems like image compression [7] and denoising [8], but not for addressing problems related to computer vision. In some recent works, Majumdar showed that a new multiresolution tool – curvelets can serve as bases for pattern recognition problems. Using curvelets, he obtained very good results for character recognition [9]. In a comparative study [10] Majumdar showed that curvelets can indeed supersede wavelets as bases for face recognition. In this paper we propose to go one step further towards a curvelet based M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 806–817, 2007. © Springer-Verlag Berlin Heidelberg 2007
Face Recognition by Curvelet Based Feature Extraction
807
face recognition system by fusing results from multiple classifiers trained with curvelets coefficients from images having different gray scale resolutions. In Section 2 Curvelet Transform and our proposed feature extraction technique will be discussed. A brief overview of Support Vector Machine (SVM), the classifier we have used is given in Section 3. In Section 4 the three databases we have carried out our experiments on are covered. Finally, Section 5 lists the experimental results and Section 6 concludes the future prospects of this technique.
2 Curvelet Based Feature Extraction Wavelets and related classical multiresolution ideas exploit a limited dictionary made up of roughly isotropic elements occurring at all scales and locations. These dictionaries do not exhibit highly anisotropic elements and there are only a fixed number of directional elements (The usual orthogonal wavelet transforms have wavelets with primarily vertical, primarily horizontal and primarily diagonal orientations), independent of scale. Images do not always exhibit isotropic scaling and thus call for other kinds of multi-scale representation. Computer Vision researchers of the ‘80s and early 90’s [11, 12] were inspired by two biological properties of the visual cortex, that it functions in a) multi-scale b) mutli-orientational mode. The multi-scale aspect has been captured by the scale-space analysis as well as wavelet transform. However standard wavelet transforms for twodimensional functions f(x1, x2) have only very crude capabilities to resolve directional features. The limitations of the wavelet transform inspired the vision researchers to find new transforms proposed that had improved directional representation; such as the ‘Steerable Pyramids’ and ‘Cortex Transforms’. Curvelet transform by Candes and Donoho [6] is the latest multi-directional multi-scale transform. Field and Olshausen [13] did set up a computer experiment for empirically discovering the basis that best represents a database of 16 by 16 image patches. Although this experiment is limited in scale, they discovered that the best basis is a collection of needle shaped filters occurring at various scales, locations and orientations. The interested reader will find a stark similarity between curvelets, which derive from mathematical analysis, and these empirical basis elements arising from data analysis [14]. It is not possible to go into the details of digital curvelet transform within this paper. The interested reader can refer to the works of Candes and Donoho [16]. A brief procedural definition of curvelet transform is provided here for ready reference. The detailed discussion can be found in [15]. Ridgelet Transform- A basic tool for calculating ridgelet coefficients is to view ridgelet analysis as a form of wavelet analysis in the Radon domain. The Radon Transform
(
)
R : L2 ( R 2 ) → L2 [0,2π ], L2 ( R) is defined by
Rf (θ , t ) = ∫ f ( x1 , x2 )δ ( x1 cos θ + x2 sin θ − t )dx1dx2 where
δ
is the Dirac delta. The Ridgelet coefficients
given by analysis of the Radon Transform via,
(1)
ℜ f (a, b,θ ) of an object f are
808
T. Mandal, A. Majumdar, and Q.M.J. Wu
−
1 2
ℜ f (a, b,θ ) = ∫ Rf (θ , t )a ψ ((t − b) / a )dt
(2)
Hence, the ridgelet transform is precisely the application of a one-dimensional wavelet transform to the slices of the Radon transform where the angular variable θ is constant and t is varying. Discrete Ridgelet Transform- A basic strategy for calculating the continuous ridgelet transform is first to compute the Radon transform Rf (θ , t ) and second to apply a one-dimensional wavelet transform to the slices Rf (θ , t ) . The projection formula [16] is a fundamental fact about the Radon Transform,
fˆ (ω cos θ , ω sin θ ) = ∫ Rf (θ , t )e −2πiωt dt
(3)
This says that the Radon transform can be obtained by applying the onedimensional inverse Fourier transform to the two-dimensional Fourier transform restricted to radial lines through the origin. This enables one to arrive at the approximate Radon transform for digital data based on the FFT. The steps to obtain this are as follows: 1. 2D-FFT - Compute the two-dimensional Fast Fourier Transform (FFT) of f. 2. Cartesian to polar conversion - Using an interpolation scheme, substitute the sampled values of the Fourier transform obtained on the square lattice with sampled values of ˆf on a polar lattice: that is, on a lattice where the points fall on lines through the origin. 3. 1D-IFFT - Compute the one-dimensional Inverse Fast Fourier Transform (IFFT) on each line; i.e., for each value of the angular parameter. Digital Curvelet Transform- The digital curvelet transform of f is achieved by the implementation steps. Subband Decomposition: A bank of filters P0 , ( Δ s , s
> 0) , is defined. The image ƒ
is filtered into subbands with àtrous algorithm
f → ( P0 f , Δ 1 f , Δ 2 f ,...) The different sub-bands
(4)
Δ s f contain detail about 2 −2 s wide.
Smooth Partitioning: Each subband is smoothly windowed into “squares” of an appropriate scale
Δ s f → ( wQ Δ s f ) Q∈Qs Where
wQ is a collection of smooth window localized around dyadic squares
Q = [k1 / 2 s , (k1 + 1) / 2 s ] × [k 2 / 2 s , (k 2 + 1) / 2 s ]
(5)
Face Recognition by Curvelet Based Feature Extraction
809
Renormalization: Each resulting square is renormalized to unit scale
g Q = (TQ ) −1 ( wQ Δ s f ) Q ∈ Qs ,
(6)
Fig. 1. Overview of Curvelet Transform
In the following images curvelet transform coefficients of a face at one approximation and eight detailed decompositions, from the At&T database is shown.
Fig. 2. Curvelet Transform of face images. The first one is that of approximate coefficients. The rest are detailed coefficients at eight different angles.
810
T. Mandal, A. Majumdar, and Q.M.J. Wu
Fig. 2. (continued)
Curvelets are good at representing edge discontinuities in two dimensional functions. In this work we will exploit this property of curvelets in a novel way for facial feature extraction. Human faces are three dimensional objects that are represented in two-dimensions in ordinary images. As a result when a face is photographed different parts of the face reflect the incident light differently and we find differential shades in the face image. We, human beings are able to get a rough idea of the three-dimensional structure from this differential shades. Black and white digital images are represented in 8 bits or 16 bits resulting in 256 or 65536 gray levels. Let us suppose that the images are represented by 256 gray levels (actually the image databases we used are all 8 bit images). In such an image two very near regions in can have differing pixel values. Such a gray scale image will have a lot of “edges” – and consequently the curvelet transform will capture this edge information. But if we quantize the gray levels, say to 128 or 64, nearby regions that had very little differences in pixel values and formed edges in the original 256 bit image will be merged and as a result only more bold edges in face image will be represented. Now if these gray-level quantized images are curvelet transformed, the transformed domain coefficients will contain information of these bolder curves. Images of the same person from the AT&T face database, quantized to 4 bits and 2 bits from the original 8 bit representation are shown below.
Fig. 3. The first images of the rows are the original 8 bit representations. The second images are the 4 bit images while the last ones are the 2 bit images of the original 8 bit images.
Face Recognition by Curvelet Based Feature Extraction
811
Fig. 3. (continued)
From the above images it can be clearly seen how the edge information is varied by quantizing the gray scale of the face images. In this work we vary the gray scale resolution from 256 to 16 and 4. As the images are quantized, only the bolder curves of the face image will remain. Now when we take a curvelet transform of these 8 bit, 4 bit and 2 bit images, the bolder curves in a person’s face will be captured. We will train three classifiers with the curvelet coefficients from three gray scale representations of the images. During testing, the test images will be quantized in the same manner, and the quantized test images will be classified by the three corresponding classifiers. Finally the output of these three classifiers will be fused to arrive at the final decision. The idea behind this scheme is that, even if a person’s face is failed to be recognized by the fine curves present in the original 8 bit image, it may be recognized by bolder curves at a lower bit resolution. To show, how the curvelet based scheme fairs over wavelets, the same exercise as depicted in the aforesaid paragraphs will be carried in the wavelet domain, i.e. instead of using curvelet transform on the bit quantized images, wavelet transform will be used. The results of the two schemes will be compared.
3 Support Vector Classification Support Vector Machine (SVM) [17] models are a close cousin to classical neural networks. Using a kernel function, SVM’s are an alternative training method for polynomial, radial basis function and multi-layer perceptron classifiers in which the weights of the network are found by solving a quadratic programming problem with linear constraints, rather than by solving a non-convex, unconstrained minimization problem as in standard neural network training. The two most popular approaches are the One-Against-All (OAA) Method and the One-Against-One (OAO) Method. For our purpose we used a One-Against-All (OAA) SVM because it constructs g binary classifiers as against g(g-1)/2 classifiers required for One-Against-One SVM while addressing a g – class problem.
4 Databases We tried our face recognition approach in several well known databases. A brief discussion on the databases used will follow.
812
T. Mandal, A. Majumdar, and Q.M.J. Wu
Georgia Tech Face Database [18] This database contains images of 50 people taken in two or three sessions between 06/01/99 and 11/15/99 at the Center for Signal and Image Processing at Georgia Institute of Technology. All people in the database are represented by 15 color JPEG images with cluttered background taken at resolution 640x480 pixels. The average size of the faces in these images is 150x150 pixels. The pictures show frontal and/or tilted faces with different facial expressions, lighting conditions and scale. Each image is manually labeled to determine the position of the face in the image. Of these 15 images, 9 were used as the training set and the remaining 6 were used as the testing set. The division into training and testing set was random, and was randomly done thrice.
Fig. 4. Faces from the Georgia Tech Face Database
AT&T "The Database of Faces" [19] This database contains ten different images of each of 40 distinct subjects. For some subjects, the images were taken at different times, varying the lighting, facial expressions (open / closed eyes, smiling / not smiling) and facial details (glasses / no glasses). All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position (with tolerance for some side movement). For this database 6 images per person served as the training set and the remaining 4 consisted of the testing set. This random segregation into training and testing set was done thrice.
Fig. 5. Faces from the AT&T Face Database
Essex Face Database [20] A sequence of 20 images each for 18 individuals consisting of male and female subjects was taken, using a fixed camera. The images are of size 180 x 200 pixels. During the sequence the subject moves his/her head and makes grimaces which get more extreme towards the end of the sequence. There is about 0.5 seconds between successive frames in the sequence. Of the 20 images of each individual, 12 images are randomly chosen for the training set and the rest for testing. This random segregation into training and testing set was done thrice.
Face Recognition by Curvelet Based Feature Extraction
813
Fig. 6. Faces from the Essex Grimace Face Database
5 Results While doing this work, we converted the colour images from the Georgia Tech database and the Essex Grimace database to black and white. Apart from this we did not do any pre-processing step like cropping or slant correction. But [21] showed that the recognition accuracy of face images does not degrade if the size of the image is reduced before any feature extraction is done. Following this work, as a pre-processing step, we reduced all the images by four times in length and width. Images from the databases were divided into training and testing sets randomly. For each database, the segregation into testing and training sets was done randomly thrice. For each of the three pairs of training and testing set thus obtained, three sets of experiments were performed. As discussed in section 2, we converted the each of the images to two other gray scale resolutions. All the images were 8 bit images. We converted them to 4 bit and 2 bit images. With these three versions of the training images we trained three SVM classifiers – one each for the 8 bit, 4 bit and 2 bit images. The test images too were converted to these 4 bit and 2 bit versions apart from the original 8 bit one. And the separate versions of the test images thus obtained were classified with the three corresponding classifiers. The final decision was made by fusing the outputs from these three classifiers using simple majority voting. If there was no clear winner the decision image was rejected. As was mentioned at the end of section 2, a comparative study between wavelets and curvelets will be presented here. The results between the two schemes can be compared on the Georgia Tech database from tables 1- 6. Table 1. Curvelet based Results of Set 1
No. of Bits in Image 8 4 2
Accuracy of each Classifier 83.3 83.3 79.7
Accuracy after majority Voting 88.7
Rejection Rate 7
Incorrect Classification rate 4.3
Table 2. Wavelet based Results of Set 1
No. of Bits in Image 8 4 2
Accuracy of each Classifier 82.3 82.7 78
Accuracy after majority Voting 86
Rejection Rate 8.7
Incorrect Classification rate 5.3
814
T. Mandal, A. Majumdar, and Q.M.J. Wu Table 3. Curvelet based Results of Set 2
No. of Bits in Image 8 4 2
Accuracy of each Classifier 78.7 78 74
Accuracy after majority Voting 81.3
Rejection Rate 10.7
Incorrect Classification rate 8
Table 4. Warvelet based Results of Set 2
No. of Bits in Image 8 4 2
Accuracy of each Classifier 78.0 76.7 73.3
Accuracy after majority Voting 80.7
Rejection Rate 11
Incorrect Classification rate 8.3
Table 5. Curvelet based Results of Set 3
No. of Bits in Image 8 4 2
Accuracy of each Classifier 85.7 85.3 84.3
Accuracy after majority Voting 89.7
Rejection Rate 5.7
Incorrect Classification rate 4.6
Table 6. Warvelet based Results of Set 3
No. of Bits in Accuracy of each Image Classifier 8 84.3 4 84.0 2 83.7
Accuracy after majority Voting 87.3
Rejection Rate 6.3
Incorrect Classification rate 6
From the above 6 tables, it can be seen how the curvelet based scheme show better results against the wavelet based scheme. The experimental results on the AT&T database are tabulated in tables 4, 5 and 6. Table 7. Results of Set 1
No. of Bits in Image 8 4 2
Accuracy of each Classifier 96.9 95.6 93.7
Accuracy after majority Voting 98.8
Rejection Rate 1.2
Incorrect Classification rate 0
Face Recognition by Curvelet Based Feature Extraction
815
Table 8. Results of Set 2
No. of Bits in Image 8 4 2
Accuracy of each Classifier 98.8 98.1 97.5
Accuracy after majority Voting
Rejection Rate
99.4
0.6
Incorrect Classification rate 0
Table 9. Results of Set 3
No. of Bits in Image 8 4 2
Accuracy of each Classifier 96.9 96.2 95.6
Accuracy after majority Voting
Rejection Rate
100
0
Incorrect Classification rate 0
The following three tables tabulate the results of the Essex Grimace Database. Table 10. Results of Set 1
No. of Bits in Image 8 4 2
Accuracy of each Classifier 100 100 100
Accuracy after majority Voting
Rejection Rate
100
0
Incorrect Classification rate 0
Table 11. Results of Set 2
No. of Bits in Image 8 4 2
Accuracy of each Classifier 95.8 95.8 94.4
Accuracy after majority Voting
Rejection Rate
97.2
1.4
Incorrect Classification rate 1.4
Table 12. Results of Set 3
No. of Bits in Image 8 4 2
Accuracy of each Classifier 96.7 95.6 95.6
Accuracy after majority Voting 97.2
Rejection Rate 2.1
Incorrect Classification rate 0.9
6 Conclusion The technique introduced in our paper appears to be robust to the changes in facial expression as it shows good results for the Essex and the AT&T database. However,
816
T. Mandal, A. Majumdar, and Q.M.J. Wu
we are trying to improve the recognition accuracy for sidewise tilted images, like those in the Georgia Tech database. The further work is suggested in improving the recognition accuracy by cropping images and making tilt corrections and using other voting schemes as well.
Acknowledgement The work is supported in part by the Canada Research Chair program, and the Natural Science and Engineering Research Council of Canada.
References 1. Zhao, W., Chellappa, R., Rosenfeld, A., Phillips, P.J.: Face Recognition: A Literature Survey. ACM Computing Surveys, 399–458 (2003) 2. Chen, C.F, Tseng, Y.S., Chen, C.Y.: Combination of PCA and Wavelet Transforms for Face Recognition on 2.5D Images. In: Proc. Image and Vision Computing NZ, pp. 343– 347 (2003) 3. Tian, G.Y., King, S., Taylor, D., Ward, S.: Wavelet based Normalisation for Face Recognition. In: Proceedings of International Conference on Computer Graphics and Imaging (CGIM) (2003) 4. Do, M.N., Vetterli, M.: The contourlet transform: an efficient directional multiresolution image representation. IEEE Transactions Image on Processing 14(12), 2091–2106 (2005) 5. Do, M.N., Vetterli, M.: The finite ridgelet transform for image representation. IEEE Trans. Image Processing 12(1), 16–28 (2003) 6. Donoho, D.L, Duncan, M.R.: Digital curvelet transform: strategy, implementation and experiments. Tech. Rep., Department of Statistics, Stanford University (1999) 7. Belbachir, A.N., Goebel, P.M.: The Contourlet Transform for Image Compression. Physics in Signal and Image Processing, Toulouse, France (January 2005) 8. Li, A., Li, X., Wang, S., Li, H.: A Multiscale and Multidirectional Image Denoising Algorithm Based on Contourlet Transform. In: International Conference on Intelligent Information Hiding and Multimedia, pp. 635–638 (2006) 9. Majumdar, A.: Bangla Basic Character Recognition using Digital Curvelet Transform. Journal of Pattern Recognition Research (accepted for publication) 10. Majumdar, A.: Curvelets: A New Approach to Face Recognition. Journal of Machine Learning Research (submitted) 11. Simoncelli, E.P., Freeman, W.T., Adelson, E.H., Heeger, D.J.: Shiftable multi-scale transforms. IEEE Trans. Information Theory, Special Issue on Wavelets 38(2), 587–607 (1992) 12. Watson, A.B.: The cortex transform: rapid computation of simulated neural images. Computer Vision, Graphics, and Image Processing 39(3), 311–327 (1987) 13. Olshausen, A., Field, D.J.: Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381, 607–609 (1996) 14. Candès, E.J., Guo, F.: New Multiscale Transforms, Minimum Total Variation Synthesis: Applications to Edge-Preserving Image Reconstruction. Signal Processing 82, 1519–1543 (2002) 15. Candes, E., Demanet, L., Donoho, D., Ying, L.: Fast Discrete Curvelet Transforms, http://www.curvelet.org/papers/FDCT.pdf
Face Recognition by Curvelet Based Feature Extraction
817
16. Deans, S.R.: The Radon transform and some of its applications. John Wiley Sons, New York (1983) 17. Cortes, C., Vapnik, V.: Support-Vector Networks. Machine Learning 20(3), 273–297 (1995) 18. http://www.anefian.com/gt_db.zip 19. http://www.cl.cam.ac.uk/Research/DTG/attarchive:pub/data/att_faces.zip 20. http://cswww.essex.ac.uk/mv/allfaces/grimace.zip 21. Ruiz-Pinales, J., Acosta-Reyes, J.J., Salazar-Garibay, A., Jaime-Rivas, R.: Shift Invariant Support Vector Machines Face Recognition System. Transactions on Engineering, Computing And Technology 16, 167–171 (2006)
Low Frequency Response and Random Feature Selection Applied to Face Recognition Roberto A. Vázquez, Humberto Sossa, and Beatriz A. Garro Centro de Investigación en Computación – IPN Av. Juan de Dios Batiz, esquina con Miguel de Othon de Mendizábal Ciudad de México, 07738, México
[email protected],
[email protected],
[email protected] Abstract. A novel method for face recognition based on some biological aspects of infant vision is proposed in this paper. The biological hypotheses of this method are based on the role of the response to low frequencies at early stages, and some conjectures concerning how an infant detects subtle features (stimulating points) from a face. In order to recognize a face from different images of it we make use of a bank of dynamic associative memories (DAM). As the infant vision responds to low frequencies of the signal, a low-filter is first used to remove high frequency components from the image. We then detect subtle features in the image by means of a random feature selection detector. At last, the network of DAMs is fed with this information for training and recognition. To test the accuracy of the proposal a benchmark of faces is used.
1 Introduction One of the problems with high-dimensional datasets such as faces is that, in many cases, not all the measured variables or pixels are “important” for recognizing a face. Several statistical computationally expensive techniques such as principal component analysis [4] and factor analysis have been proposed, for solving the face recognition problem. These techniques are the most widely used linear dimension reduction methods based on second-order statistics. Instead of using the complete version of the describing pattern X of any face, a simplified version from describing pattern X could be used to recognize the face. In many papers, authors have used PCA to perform face recognition and other tasks, refer for example to [6], [7], [8], [10] and [11]. Most research in vision systems has been focused on the fully developed visual system of adult humans. During early developmental stages, there are communication pathways between the visual and other sensory areas of the cortex, showing how the biological network is self-organizing. Within a few months of birth, the baby is able to differentiate one face from another. It has been hypothesized that the functional role of perception is to capture the statistical structures of the sensory stimuli such that corresponding actions could be taken to maximize the chances of survival M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 818–830, 2007. © Springer-Verlag Berlin Heidelberg 2007
Low Frequency Response and Random Feature Selection Applied to Face Recognition
819
(see [15] for details). Barlow hypothesized that for a neural system one possible way of capturing the statistical structure was to remove the redundancy in the sensory outputs [1], [5]. Taking into account the theory of Barlow, we propose a novel method for face recognition based on some biological aspects of infant vision. The biological hypotheses of this proposal are based on the role of the response to low frequencies at early stages, and some conjectures concerning how an infant detects subtle features (stimulating points (SP)) in a face or object [13], [16] [21] [23]. The proposal consists of a network of dynamic associative memories (nDAMs). As the infant vision responds to low frequencies of the signal, a low-filter is first used to remove high frequency components from the image. Then we detect subtle features in the image by means of a random selection of stimulating points. Preprocessing images used to remove high frequencies and random selection of stimulating points contribute to eliminating redundant information and help the nDAMs to learn efficiently the faces. At last, the nDAMs is fed with this information for training and recognition.
2 Dynamic Associative Memory The proposed model is not an iterative model as Hopfield’s model [2]. Our model emerges as an improvement of the model proposed in [18] which is not an iterative model. Let
x ∈ R n and y ∈ R m an input and output pattern, respectively. An asso-
(
)
ciation between input pattern x and output pattern y is denoted as x k , y k , where
k is the corresponding association. Associative memory: W is represented by a matrix whose components wij can be seen as the synapses of the neural network. If
x k = y k ∀k = 1,… , p then W is auto-associative, otherwise it is hetero-associative. A distorted version of a pattern x to be recalled will be denoted as x . If an associative memory W is fed with a distorted version of x k and the output obtained is exactly y k , we say that recalling is robust. 2.1 Building the Associative Memory The brain is a dynamic, changing neural network that adapts continuously to meet the demands of communication and computational needs [17]. This fact suggests that some connections of the brain could change in response to some input stimuli. Humans, in general, do not have problems recognizing patterns even if these are altered by noise. Several parts of the brain interact together in the process of learning and recalling a pattern. For example, when we read a word the information enters the eye and the word is transformed into electrical impulses. Then electrical signals are passed through the brain to the visual cortex. After that, specific information about the patterns passes on the other areas of the cortex. From here information passes through the arcuate fasiculus; paths of this pathway connect
820
R.A. Vázquez, H. Sossa, and B.A. Garro
language areas with other areas involving in cognition, association and meaning, for details see [3] and [14]. Based upon the above example we have defined in our model several interacting areas, one per association we would like the memory to learn. Also we have integrated the capability to adjust synapses in response to an input stimulus. As we could appreciate from the previous example, before an input pattern is learned or processed by the brain, it is hypothesized that it is transformed and codified by the brain. In our model, this process is simulated using the following procedure recently introduced in [19]: Procedure 1. Transform the fundamental set of associations into codified patterns and de-codifier patterns: Input: FS Fundamental set of associations: {1. Make
d = const
and make
( x , y ) = (x , y ) 1
1
1
1
2. For the remaining couples do { For k = 2 to p {
i = 1 to n { xik = xik −1 + d ; xˆik = xik − xik ; yik = yik −1 + d ; yˆik = yik − yik
For
}}} Output: Set of codified and de-codifying patterns.
This procedure allows computing codified patterns from input and output patterns denoted by x and y respectively; xˆ and yˆ are de-codifying patterns. Codified and de-codifying patterns are allocated in different interacting areas and d defines of much these areas are separated. On the other hand, d determines the noise supported by our model. In addition a simplified version of x k denoted by sk is obtained as:
sk = s ( x k ) = mid x k
(1)
where mid operator is defined as mid x = x( n +1) / 2 . When the brain is stimulated by an input pattern, some regions of the brain (interacting areas) are stimulated and synapses belonging to those regions are modified. In our model, the most excited interacting area is call active region (AR) and could be estimated as follows:
⎛ p ⎞ ar = r ( x ) = arg ⎜ min s ( x ) − si ⎟ ⎝ i =1 ⎠
(2)
Once computed the codified patterns, the de-codifying patterns and sk we can build the associative memory. Let ( x k , y k ) k = 1,… , p , x k ∈ R n , y k ∈ R m a fundamental set of associations
{
}
(codified patterns). Synapses of associative memory W are defined as:
wij = yi − x j
(3)
Low Frequency Response and Random Feature Selection Applied to Face Recognition
821
After computed the codified patterns, the de-codifying patterns, the reader can easily corroborate that any association can be used to compute the synapses of W without modifying the results. In short, building of the associative memory can be performed in three stages as: 1. Transform the fundamental set codifying patterns by means of 2. Compute simplified versions of 3. Build W in terms of codified
of association into codified and depreviously described Procedure 1. input patterns by using equation 1. patterns by using equation 3.
2.2 Modifying Synapses of the Associative Model As we had already mentioned, synapses could change in response to an input stimulus; but which synapses should be modified? For example, a head injury might cause a brain lesion killing hundred or even thousands of neurons; this entails some synapses to reconnect with others neurons. This reconnection or modification of the synapses might cause that information allocated on brain will be preserved or will be lost, the reader could find more details concerning to this topic in [12] and [22]. This fact suggests there are synapses that can be drastically modified and they do not alter the behavior of the associative memory. In the contrary, there are synapses that only can be slightly modified to do not alter the behavior of the associative memory; we call this set of synapses the kernel of the associative memory and it is denoted by K W . In the model we find two types of synapses: synapses that can be modified and do not alter the behavior of the associative memory and synapses belonging to the kernel of the associative memory. These last synapses play an important role in recalling patterns altered by some kind of noise. Let K W ∈ R n the kernel of an associative memory W . A component of vector
K W is defined as: kwi = mid ( wij ) , j = 1,… , m
(4)
According to the original idea of our proposal, synapses that belong to K W are modified as a response to an input stimulus. Input patterns stimulate some ARs, interact with these regions and then, according to those interactions, the corresponding synapses are modified. Synapses belonging to K W are modified according to the stimulus generated by the input pattern. This adjusting factor is denoted by Δw and can be computed as:
Δw = Δ ( x ) = s ( x ar ) − s ( x )
(5)
where ar is the index of the AR. Finally, synapses belonging to K W are modified as:
K W = K W ⊕ ( Δw − Δwold )
(6)
where operator ⊕ is defined as x ⊕ e = xi + e ∀i = 1,… , m . As you can appreciate, modification of K W in equation 6 depends of the previous value of Δw denoted by
822
R.A. Vázquez, H. Sossa, and B.A. Garro
Δwold obtained with the previous input pattern. Once trained the DAM, when it is used by first time, the value of Δwold is set to zero. 2.3 Recalling a Pattern Using the Proposed Model Once synapses of the associative memory have been modified in response to an input pattern, every component of vector y can be recalled by using its corresponding input vector x as:
yi = mid ( wij + x j ) , j = 1,… , n
(7)
In short, pattern y can be recalled by using its corresponding key vector x or x in six stages as follows: 1. Obtain index of the active region 2. Transform
xk
ar
by using equation 2.
using de-codifying pattern
ing transformation:
x = x + xˆ . Δw = Δ ( x ) by k
k
3. Compute adjust factor
xˆ ar
by applying the follow-
ar
using equation 5.
4. Modify synapses of associative memory
W
that belong to
KW
by us-
yˆ ar
by ap-
ing equation 6. 5. Recall pattern 6. Obtain
y
k
yk
by using equation 7.
y k using y k = y k − yˆ ar .
by transforming
plying transformation:
de-codifying pattern
The formal set of prepositions that support the correct functioning of this dynamic model and the main advantages again other classical models can be found in [24]. In general, we distinguish two main parts for the model: a part concerning the determination of the active region (PAR) and a part concerning pattern recall (PPR). PAR (first step during recall procedure) sends a signal to PPR (remaining steps for recall procedure). This indicates the region activated by the input pattern. A schematic figure of this model is shown in Fig. 2(a).
3 Description of the Proposal Authors in [13], [16] suggest that infant vision responds in an accurate way to low frequencies. Within a few months of birth, the brain can differentiate faces from other faces. On the other hand, young babies are not capable of detecting subtle features in a face [21], [23]. This fact could suggest that the baby uses only few information of the face in order to recognize it. The proposal consists of a nDAMs used to recognize different images of faces. As the infant vision responds to low frequencies of the signal, a low-filter is first used to remove high frequency components from the image. After that, we divide the image in different parts (sub-patterns). Then, over each subpattern, we detect subtle features by means of a random selection of SPs. At last, each DAM of the nDAM is fed with these sub-patterns for training and recognition.
Low Frequency Response and Random Feature Selection Applied to Face Recognition
823
3.1 Response to Low Frequencies It is important to mention that instead of using a filter that exactly simulates the infant vision system behavior at any stage we use a low-pass filter to remove high frequency. This kind of filter could be seen as a slight approximation of the infant vision system due to it eliminates high frequency components from the pattern. For simplicity, we used an average filter. If we apply this filter to an image, the resultant image could be hypothetically seen as the image that infants perceive in a specific stage of their life. For example, we could associate the size of the mask used in the filter with a specific stage. For example if the size of the mask is one we could say that the resultant image corresponds to one year of age; if the size of the mask is the biggest we could say that the resultant image corresponds to a newborn. In Fig. 1 are shown some images filtered with masks of different sizes. One year (1x1) 6 months (5x5) 3 months (15x15)
Fig. 1. Images filtered with masks of different size. Each group could be associated with different stages of infant vision system.
3.2 Random Selection In the DAM model, the simplified version of an input pattern is the middle value of input pattern. This value is computed by using mid operator. In order to simulate the random selection of the infant vision system we substitute mid operator with rand operator defined as follows:
rand x = xsp
(8)
where sp = random ( n ) is a random number between zero and the length of input pattern. This represents a stimulation point. sp is a constant value computed at the beginning of the building phase. During recalling phase sp takes the same value. rand operator uses a uniform random generator to select a component over each part of the pattern. We adopt this operator based on the hypothetical idea about infants are interested into sets of features where each set is different with some intersection among them. By selecting features at random, we conjecture that at least we select a feature belonging to these sets.
824
R.A. Vázquez, H. Sossa, and B.A. Garro
3.3 Implementation of the Proposal During recalling, each DAM recovers a part of the image based on the AR of each DAM. However, a part of the image could be wrongly recalled because its corresponding AR could be wrongly determined due to some patterns do not satisfy the prepositions that guarantee perfect recall. To avoid this, we use an integrator. Each DAM determines an AR, the index of the AR is sent to the integrator, the integrator determines which was the most voted region and sends to the DAMs the index of the most voted region (the new AR). In Fig. 2(b) a general architecture of the network of DAMs is shown.
(a)
(b)
Fig. 2. (a) Main parts of the DAM. (b) Architecture of a network of DAMs.
In order to interconnect several DAMs we disconnect the PAR of its corresponding PPR. The output of the PAR is sent to an external integrator. This integrator receives several signals from different DAMs indicating the AR. At last, based on a voting technique, the integrator “tells” to each DAM which is the most voted AR. A schematic representation of the building and recalling phases is shown in Fig. 3. Building of the nDAMs is done as follows: Let
⎡⎣ I kx ⎤⎦ a ×b
and
⎡⎣ I ky ⎤⎦ c× d
an association of images and
r
be the number
of DAMs.
1. Select filter size and apply it to the images. 2. Transform the images into a vector ( x k , y k ) by means of the standard image scan method where vectors are of size tively.
a ×b
and
c×d
respec-
3. Decompose x k and y k in r sub-patterns of the same size. 4. Take each sub-pattern (from the first one to the last one ( r )), then take at random a stimulating point
5.
spi , i = 1,… , r
and extract the
value at that position. Train r DAMS as in building procedure taking each sub-pattern (from the first one to the last one ( r )) using rand operator.
Pattern I ky can be recalled by using its corresponding key image I kx or distorted version I kx as follows: 1. Select filter size and apply to the images. 2. Transform the images into a vector by means of the standard image scan method.
Low Frequency Response and Random Feature Selection Applied to Face Recognition
825
3. Decompose x k in r sub-patterns of the same size. 4. Use the stimulating point, spi , i = 1,… , r computed during the building phase and extract the value of each sub-pattern.
5. Determine the most voted active region using the integrator. 6. Substitute mid operator with rand operator in recalling procedure and 7.
apply steps from two to six as described in recalling procedure on each memory. Finally, put together recalled sub-patterns to form the output pattern.
(a)
(b) Fig. 3. (a) Schematic representation of building phase. (b) Schematic representation of the recalling phase.
While PCA dimension reduction techniques require the covariance matrix to build an Eigenspace, then to project patterns using this space to eliminate redundant information, our proposal only requires removing high frequencies by using a filter and a
826
R.A. Vázquez, H. Sossa, and B.A. Garro
random selection of stimulating points. This approach contributes to eliminating redundant information, it is less computationally expensive than PCA, and helps the nDAMs or other classification tools to learn efficiently the faces. The main reason to use a DAM (in our proposal) is to demonstrate that, as equal as neural networks and other classifiers, DAMs can be used to solve complex problems.
4 Experimental Results To test the efficiency of the proposal we have used the benchmark of faces given in [9]. This database contains twenty photos of fifteen different people. Each photo is in colour and of 180 × 200 pixels. Furthermore, people in images appear with different gesticulations which nowadays is still a challenging problem in face recognition. Due to the level of complexity of this benchmark, we decided to use it to test the accuracy of the proposal. The database of images was divided into two sets of images. First photo of each person (15 in total) was used to train the network of DAMs. The remaining 285 photos (19 for each person) were used to test the efficiency of the proposal. To train the nDAM, only one image of the fifteen face classes was used along with the building procedure described in section 3.3. Something important to mention is that each DAM belonging to the nDAM was trained into its auto-associative version, i.e. I kx = I ky . During recalling phase the second set of images and the recalling procedure described in section 3.3 were used. The accuracy of the proposal was tested using several configurations, starting with networks composed by one DAM until 1100 DAMs and changing the size of the filter. Because of stimulating points (pixels) were randomly selected, we decided to test the stability of proposal with the same configuration 20 times. As can appreciated in Fig. 4(a), in average the accuracy of the proposal oscillates between 33% and 54 % by using only one stimulating point and different filter sizes (from 1 to 119); if we increase the filter size the accuracy increases. In the contrary, as you can appreciate from Figs. 4(b) to 6(f), after a certain filter size if we increase the filter size the accuracy decreases. In average the accuracy of the proposal, when the number of stimulating points goes from 51 to 1051, oscillates between 80% and 99%. Through several experiments we have observed that after applying a filter of size 29 the accuracy of the proposal tends to diminish. In average, the accuracy of the proposal, before surpassing a size of 29 and using different stimulating points, increases from 93% to 99%. In Fig. 5, you can appreciate the accuracy and stability of the proposal. The results obtained with the proposal through several experiments were comparable with those obtained by means of a PCA-based method. Although PCA is a powerful technique it consumes a lot of time to reduce the dimensionality of the data. Our proposal, because of its simplicity in operations, is not a computationally expensive technique and the results obtained are comparable to those provided by PCA (99% of recognition). In addition we evaluate the accuracy of the proposal by substituting the associative memories with other distance classifiers (with the stimulating points as input). In this case the Modified Manhattan distance classifier, Chi distance classifier, Canberra
Low Frequency Response and Random Feature Selection Applied to Face Recognition
827
distance classifier, Manhattan distance classifier, Euclidian distance classifier, Chevyshev distance classifier [20] were used. In average the accuracy of the proposal using these classifiers was of 84%, 84%, 95%, 94%, 91% and 72% respectively. Although the results obtained with the proposal using these classifiers was acceptable, the use of the nDAM provides better results. Finally, the accuracy of the proposal was tested using the nDAM when images were partially occluded. For this some parts of each image (of the second set of images) were manually occluded with regions of different forms and sizes, see Fig. 6.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 4. (a-f)Accuracy of the proposal using different filter size and stimulating points. Maximum, average and minimum accuracy are sketched.
828
R.A. Vázquez, H. Sossa, and B.A. Garro
Fig. 5. Average accuracy of the proposal before surpassing a filter of size 29
Fig. 6. The first four images of two people used for testing under occlusions
In general, the results obtained in these experiments (images under occlusions) were around 80 % of recognition. We have tested the efficiency of the proposal. We have verified that the worst performance was obtained with the images in presence of occlusions. The results obtained with the proposal in the first set of experiments were comparable with those obtained by means of a PCA-based method. In addition we observed that the results obtained with other classifiers were not comparable with those obtained with nDAM. By removing high frequencies and by random selecting of stimulating points contribute to eliminating redundant information and due to of its simplicity in operations, is not a computationally expensive technique as PCA.
5 Conclusions In this paper we have proposed a novel method for face recognition based on some biological aspects of infant vision. We have shown that by applying some aspects of the infant vision system it is possible to enhance the performance of an associative memory (or other distance classifiers) and also make possible its application to complex problems such as face recognition. The biological hypotheses of this method are based on the role of the response to low frequencies at early stages, and some conjectures concerning to how an infant detects subtle features (stimulating points) in a face [13], [16] [21] [23]. In order to recognize different images of face we have used a network of dynamic associative memories (DAMs). As the infant vision responds to low frequencies of the signal, a low-filter is first used to remove high frequency components from the image. Then we detected subtle features in the image by means of a random selection of
Low Frequency Response and Random Feature Selection Applied to Face Recognition
829
stimulating points. At last, the network of DAMs was fed with this information for training and recognition. Through several experiments we have shown the accuracy and the stability of the proposal even under occlusions. In average the accuracy of the proposal oscillates between 96% and 99%. Important to mention is that, to our knowledge, nobody has reported results of this type using an associative memory for face recognition. The results obtained with the proposal were comparable with those obtained by means of a PCA-based method. Although PCA is a powerful technique it consumes a lot of time to reduce the dimensionality of the data. Our proposal, because of its simplicity in operations, is not a computationally expensive technique and the results obtained are comparable to those provided by PCA. Acknowledgments. This work was economically supported by SIP-IPN under grant 20071438 and CONACYT under grant 46805.
References [1] Barlow, H.B.: Possible Principles Underlying the Transformations of Sensory Messages. In: Rosenblith, W.A. (ed.) Sensory Comm, pp. 217–234. MIT Press, Cambridge (1961) [2] Hopfield, J.J.: Neural networks and physical systems with emergent collective computational abilities. In: Proc. of the Nat. Academy of Sciences, vol. 79, pp. 2554–2558 (1982) [3] Kutas, M., Hillyard, S.A.: Brain potentials during reading reflect word expectancy and semantic association. Nature 307, 161–163 (1984) [4] Jolliffe, I.: Principal Component Analysis. Springer, Heidelberg (1986) [5] Barlow, H.B.: Unsupervised Learning. Neural Computation 1, 295–311 (1989) [6] Kirby, M., Sirovich, L.: Application of the Karhunen-Loeve procedure for the characterization of human faces. IEEE Trans. Pattern Analysis and Machine Intelligence 12(1), 103–108 (1990) [7] Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3(1), 71–86 (1991) [8] Shapiro, L.S., Brady, J.M.: Feature-based correspondence: an eigenvector approach. Image and Vision Computing 10(5), 283–288 (1992) [9] Spacek, L.: Collection of facial images: Grimace. Available from (1996), http://cswww.essex.ac.uk/mv/allfaces/grimace.html [10] Swets, D., Weng, J.: Using discriminant eigenfeatures for image retrieval. IEEE Trans. Pattern Analysis and Machine Intelligence 18(8), 831–836 (1996) [11] Valentin, D., et al.: Principal component and neural network analysis of face images: what can be generalized in gender classification. Journal of Mathematical Psychology 41(4), 398–413 (1997) [12] Reinvan, I.: Amnestic disorders and their role in cognitive theory. Scandinavian Journal of Psychology 39(3), 141–143 (1998) [13] Mondloch, C.J., et al.: Face Perception During Early Infancy. Psychological Science 10(5), 419–422 (1999) [14] Price, C.J.: The anatomy of language: contributions from functional neuroimaging. Journal of Anatomy 197(3), 335–359 (2000) [15] Barlow, H.B.: Redundancy Reduction Revisited. Network: Computation in Neural Systems 12, 241–253 (2001)
830
R.A. Vázquez, H. Sossa, and B.A. Garro
[16] Acerra, F., Burnod, Y., Schonen, S.: Modelling aspects of face processing in early infancy. Developmental science 5(1), 98–117 (2002) [17] Laughlin, S.B., Sejnowski, T.J.: Communication in neuronal networks. Science 301, 1870–1874 (2003) [18] Sossa, H., Barron, R.: New associative model for pattern recall in the presence of mixed noise. In: Proc. of the fifth IASTED-SIP2003, vol. 399, pp. 485–490. Acta Press (2003) [19] Sossa, H., Barrón, R., Vázquez, R.A.: Transforming Fundamental set of Patterns to a Canonical Form to Improve Pattern Recall. In: Lemaître, C., Reyes, C.A., González, J.A. (eds.) IBERAMIA 2004. LNCS (LNAI), vol. 3315, pp. 687–696. Springer, Heidelberg (2004) [20] Perlibakas, V.: Distance measures for PCA-based face recognition. Pattern Recognition Letters 25(6), 711–724 (2004) [21] Slaughter, V., Stone, V.E., Reed, C.: Perception of Faces and Bodies Similar or Different? Current Directions in Psychological Science 13(9), 219–223 (2004) [22] Jovanova-Nesic, K.D., Jankovic, B.D.: The Neuronal and Immune Memory Systems as Supervisors of Neural Plasticity and Aging of the Brain: From Phenomenology to Coding of Information. Annals of the New York Acad.of Sci. 1057, 279–295 (2005) [23] Cuevas, K., Rovee-Collier, C., Learmonth, A.E.: Infants Form Associations Between Memory Representations of Stimuli That Are Absent. Psychological Science 17(6), 543– 549 (2006) [24] Vazquez, R.A., Sossa, H.: A new associative memory with dynamical synapses. Submitted to ICANN 2007
Facial Expression Recognition Using 3D Facial Feature Distances Hamit Soyel and Hasan Demirel Department of Electrical and Electronic Engineering, Eastern Mediterranean University, Gazimağusa, KKTC, via Mersin 10- Turkey {hamit.soyel. hasan.demirel}@emu.edu.tr
Abstract. In this paper, we propose a novel approach for facial expression analysis and recognition. The proposed approach relies on the distance vectors retrieved from 3D distribution of facial feature points to classify universal facial expressions. Neural network architecture is employed as a classifier to recognize the facial expressions from a distance vector obtained from 3D facial feature locations. Facial expressions such as anger, sadness, surprise, joy, disgust, fear and neutral are successfully recognized with an average recognition rate of 91.3%. The highest recognition rate reaches to 98.3% in the recognition of surprise. Keywords: 3D Facial Expression Analysis, Facial Feature Points, Neural Networks.
1 Introduction Facial expression analysis has attracted many computer vision researchers over the last two decades. The studies of human facial expressions performed by Ekman [1, 2] gave evidence to the classification of the following basic facial expressions. According to these studies, the basic facial expressions are those representing happiness, sadness, anger, fear, surprise, disgust and neutral. Facial Action Coding System (FACS) was developed by Ekman and Friesen [1] to code facial expressions in which the movements on the face are described by action units. This work inspired many researchers to analyze facial expressions in 2D by means of image and video processing, where by tracking facial features and measuring the amount of facial movements, they attempt to classify different facial expressions. Recent work on facial expression analysis and recognition [3, 4, 5, 6, and 7] has used these basic expressions or a subset of them. The two recent surveys about facial expression analysis [8, 9] provide detailed review of many of the research done in recent years. Almost all of the methods developed use 2D distribution of facial features as inputs into a classification system, and the outcome is one of the facial expression classes. They differ mainly in the facial features selected and in the classifiers used to distinguish between the different facial expressions. Information extracted 3D face models are rarely used in the analysis of the facial feature analysis. In this paper we M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 831–838, 2007. © Springer-Verlag Berlin Heidelberg 2007
832
H. Soyel and H. Demirel
use distance measures extracted from 3D face vectors in order to determine facial expression of a given face. One of the major contributions of this work is to use 3D facial expression database BU-3DFE [11] for facial expression analysis in a 3D space by exploring the 3D distance vectors. The distance measures extracted from the 3D facial features provide very reliable and valuable information for robust recognition of facial features. The proposed system uses a 3D database for designing a facial expression recognition methodology based on 3D distributions of facial feature points. Among the 84 points on the 3D space describing the 3D faces [11], only six distance measures maximizing the differences of facial expressions are selected to represent each facial expression to be classified. These six distance values are used to form a 3D Distance Vector for the representation of facial expressions that are capable of identifying the fundamental facial expressions, which are defined by the MPEG-4 Facial Definition Parameter Set (FDP) [10]. The results obtained from a neural network classifier using the proposed 3D distance vectors reaches up to 98.3% in the recognition of surprise facial expression, where the average recognition performance is 91.3%. The performance results are compared with the results reported by Wang et al [13], which use the same 3D face database for performance evaluation. In their paper among the four classifies LDA based system provides 83.6% for the recognition of the facial expressions which is 7.7% less than the performance of the system proposed in this paper. The proposed system is presented briefly in Section 3 where we present an extensive evaluation of the neural network classifier using the BU-3DFE database [11] in Section 4. The concluding remarks are presented in Section 5.
2 Information from 3D Models for Facial Expression Analysis In this paper, 3D distribution of the facial feature points are used for the estimation of six characteristic distances in order to represent the facial expressions. The proposed six distances are adopted based on the visual analysis of facial expressions on many images. These distance values are used to describe the fundamental facial expressions including surprise, joy, happiness, sadness, anger, and neutral. The definitions of facial feature expressions that are used in this paper are given in Table 1. After a series of analysis on faces we have concluded that mainly 15 FAP’s are affected by these expressions [13]. These facial features are moved due to the contraction and expansion of facial muscles, whenever a facial expression is changed. Table 2 illustrates the description of the basic expressions using the MPEG-4 FAPs terminology [10]. MPEG-4 specifies 84 feature points on the neutral face. The main purpose of these feature points is to provide spatial references to key positions on a human face. These 84 points were chosen to best reflect the facial anatomy and movement mechanics of a human face. The location of these feature points has to be known for any MPEG-4 compliant face model. The Feature points on the model should be located according to the Figure 1.
Facial Expression Recognition Using 3D Facial Feature Distances
833
Table 1. Primary Facial Expressions as defined for FAP [12] Expression Anger
Sadness Surprise Joy
Textual Description The inner eyebrows are pulled downward and together. The eyes are wide open. The lips are pressed against each other or opened to expose the teeth. The inner eyebrows are bent upward. The eyes are slightly closed. The mouth is relaxed. The eyebrows are raised. The upper eyelids are wide open, he lower relaxed. The jaw is opened. The eyebrows are relaxed. The mouth is open and the mouth corners pulled back toward the ears.
Disgust
The eyebrows and eyelids are relaxed. The upper lip is raised and curled, often asymmetrically.
Fear
The eyebrows are raised and pulled together. The inner eyebrows are bent upward. The eyes are tense and alert. Table 2. Muscle Actions involved in the six basic expressions
Expression Anger
Sadness
Surprise
Joy
Disgust
Fear
squeeze_l_eyebrow (+) lower_t_midlip (-) raise_l_i_eyebrow (+) close_t_l_eyelid (-) close_b_l_eyelid (-) raise_l_i_eyebrow (+) close_t_l_eyelid (+) raise_l_m_eyebrow (-) raise_l_o_eyebrow (-) close_b_l_eyelid (+) raise_l_o_eyebrow (+) raise_l_i_eyebrow (+) raise_l_m_eyebrow (+) squeeze_l_eyebrow (-) open_jaw (+) close_t_l_eyelid (+) close_b_l_eyelid (+) stretch_l_cornerlip (+) raise_l_m_eyebrow (+) close_t_l_eyelid (+) close_b_l_eyelid (+) lower_t_midlip (-) open_jaw (+) raise_l_o_eyebrow (+) raise_l_m_eyebrow(+) raise_l_i_eyebrow (+) squeeze_l_eyebrow (+) open_jaw (+)
Muscle Actions squeeze_r_eyebrow (+) raise_b_midlip (+) raise_r_i_eyebrow (+) close_t_r_eyelid (-) close_b_r_eyelid (-) raise_r_i_eyebrow (+) close_t_r_eyelid (+) raise_r_m_eyebrow (-) raise_r_o_eyebrow (-) close_b_r_eyelid (+) raise_r_o_eyebrow (+) raise_r_i_eyebrow (+) raise_r_m_eyebrow (+) squeeze_r_eyebrow (-) close_t_r_eyelid (+) close_b_r_eyelid (+) stretch_r_cornerlip (+) raise_r_m_eyebrow (+) close_t_r_eyelid (+) close_b_r_eyelid (+) raise_b_midlip (+) raise_r_o_eyebrow raise_r_m_eyebrow raise_r_I_eyebrow squeeze_r_eyebrow
834
H. Soyel and H. Demirel
Fig. 1. The 3D feature points of the FDP set [12]
By using the symmetry of the human face we have optimized the number of facial features used by 11 points. By using the distribution of the 11 facial feature points from 3D facial model we extract six characteristic distances that serve as input to neural network classifier used for recognizing the different facial expressions shown in Figure 2.
Fig. 2. 11-facial feature points: 1-Left corner of outer-lip contour, 2-Right corner of outer-lip contour, 3-Middle point of outer upper-lip contour, 4- Middle point of outer lower-lip contour, 5-Right corner of the right eye, 6-Left corner of the right eye, 7-Center of upper inner-right eyelid, 8-Center of lower inner-right eyelid, 9-Uppermost point of the right eyebrow, 10Outermost point of right-face contour, 11- Outermost point of left-face contour
Facial Expression Recognition Using 3D Facial Feature Distances
835
3 3D Distance Vector Based Facial Expression Recognition System By using the entire information introduced in the previous section, we achieve 3D facial expression recognition in the following phases. First, we extract the characteristic distance vectors as defined in Table 3. Then, we classify a given distance vector on a previously trained neural network. The sixth distance, D6, is used to normalize the first five distances. The neural network architecture consists of a multilayered perceptron of input, hidden and output layers that is trained by using Backpropagation algorithm in the training process. The input layer receives a vector of six distances and the output layer represents 7 possible facial expressions mentioned in the preceding sections. We used BU-3DFE database [11] in our experiments to train and test our model. The database we have used contains 7 facial expressions for 60 different people. We arbitrarily divided the 60 subjects into two subsets: one with 54 subjects for training and the other with 6 subjects for testing. During the recognition experiments, a distance vector is derived for every 3D model. Consecutive distance vectors are assumed to be statistically independent as well as the underlying class sequences. The vector is eventually assigned to the class with the highest likelihood score. Table 3. Six characteristic distances Distance No D1
Distance Name Eye Opening
Distance Description Distance between the right corner of the right eye and the left corner of the right eye. Distance between the center of upper inner-right eyelid and the center of lower inner-right eyelid.
D2
Eyebrow Height
D3
Mouth Opening
D4
Mouth Height
D5
Lip Stretching
Distance between the right corner of the right eye and right corner of outer-lip contour.
D6
Normalization
Distance between the outermost point of rightface contour and outermost point of left-face contour.
Distance between the left corner of outer-lip contour and right corner of outer-lip contour. Distance between the middle point of outer upper-lip contour and middle point of outer lower-lip contour.
4 3D Facial Expression Analysis Experiments We have tested our neural network setup on the BU-3DFE database [11], which contains posed emotional facial expression images with seven fundamental emotional states, Anger, Disgust, Fear, Happiness, Sadness, Surprise and Neutral see Figure 3. In our experiment, we used the data captured from 60 subjects for each expression. The test is based on the seven fundamental expressions. The 3D distribution of the 84 feature vertices was provided for each facial model. Table 4
836
H. Soyel and H. Demirel
gives a summary of the data set that we used in our experiments. A detail description of the database construction, post-processing, and organization can be found in [11]. Our facial expression analysis experiments are carried out in a person-independent manner, which is thought to be more challenging than a persondependent approach. We arbitrarily divided the 60 subjects into two subsets: one with 54 subjects for training and the other with 6 subjects for test. The experiments assure that any subject used for testing does not appear in the training set because the random partition is based on the subjects rather than the individual expression. The tests are executed 10 times with different partitions to achieve a stable generalization recognition rate. The entire process assures that every subject is tested at least once for each classifier. For each round of the test, all the classifiers are reset and re-trained from the initial state. We show the results for all the neural network classifiers in Table 4. Note that most of the expressions are detected with high accuracy and the confusion is larger with the Neutral and Anger classes. One reason why Anger is detected with only 85% is that in general this emotion’s confusion with Sadness and Neutral is much larger than with the other emotions. As we compared the proposed 3D Distance Vectors based Facial Expression Recognition method (3D-DVFER) with 2D appearance feature based Gaborwavelet (GW) approach [12] we found the Gabor-wavelet approach performs poorly with an average recognition rate around 80%, comparing to the performance shown in Table 4, the 3D-DVFER method is superior to the 2D appearance feature based methods when classifying the seven prototypic facial expressions. Table 4. Average confusion matrix using the NN classifier (BU-3DFE database) Input/Output Neutral Happy Fear Surprise Sadness Disgust Anger
Neutral 86.7% 0.0% 0.0% 0.0% 6.7% 1.7% 5.0%
Happy 0.0% 95.0% 3.3% 0.0% 0.0% 1.7% 0.0%
Fear 1.7% 3.3% 91.7% 0.0% 1.7% 0.0% 1.7%
Surprise 0.0% 0.0% 1.7% 98,3% 0.0% 0.0% 0.0%
Sadness 3.7% 0.0% 0.0% 0.0% 90.7% 1.9% 3.7%
Disgust 1.7% 5.0% 1.7% 0.0% 0.0% 91.7% 0.0%
Anger 6.7% 3.3% 0.0% 0.0% 5.0% 0.0% 85.0%
When we compare the results of the proposed system with the results reported in [13] which use the same 3D database through an LDA classifier, we can see that our method outperforms the recognition rates in Table 5 for all of the facial expressions except the Happy case. Both systems give the same performance for the Happy facial expression. Note that the classifier in [13] does not consider the Neutral case as an expression, which gives an advantage to the approach. The average recognition rate of the proposed system is 91.3% where the average performance of the method given in [13] stays at 83.6% for the recognition of the facial expressions that uses the same 3D database.
Facial Expression Recognition Using 3D Facial Feature Distances
837
Table 5. Average confusion matrix using of the LDA based classifier in [13] Input/Output Happy Fear Surprise Sadness Disgust Anger
Happy 95.0% 12.5% 0.0% 0.0% 3.8% 0.0%
Fear 3.8% 75.0% 1.2% 2.9% 4.2% 6.3%
Surprise 0.0% 2.1% 90.8% 5.8% 0.4% 0.8%
Sadness 0.4% 7.9% 5.4% 80.4% 6.7% 11.3%
Disgust 0.8% 2.5% 0.8% 2.5% 80.4% 1.7%
Anger 0.0% 0.0% 1.7% 8.3% 4.6% 80.0%
5 Conclusion In this paper we have shown that multilayered perceptron based neural network classifier can be used for the 3D analysis of facial expressions without relying on all of the 84 facial features and error-prone face pose normalization stage. Face deformation as well as facial muscle contraction and expansion are important indicators for facial expression and by using only 11 facial feature points and symmetry of the human face, we are able to extract enough information from a from a face image. Our results show that 3D distance vectors based recognition outperforms facial expression recognition results compared to the results of the similar systems using 2D and 3D facial feature analysis. The average facial expression recognition rate of the proposed system reaches up to 91.3%. The quantitative results clearly suggest that the proposed approach produces encouraging results and opens a promising direction for higher rate expression analysis.
References 1. Ekman, P., Friesen, W.: The facial action coding system: a technique for the measurement of facial movement. Consulting Psychologists Press, San Francisco (1978) 2. Ekman, P., Huang, T.S., Sejnowski, T.J., Hager, J.C. (eds.): Final Report to NSF of the Planning Workshop on Facial Expression Understanding. Human Interaction Lab. Univ. California, San Francisco (1993) 3. Bartlett, M., Littlewort, G., Frank, M., Lainscsek, C., Fasel, I., Movellan, J.: Recognizing facial expressions: machine learning and application to spontaneous behavior. In: IEEE CVPR 2005, San Diego, CA, vol. 2, pp. 568–573 (2005) 4. Braathen, B., Bartlett, M., Littlewort, G., Smith, E., Movellan, J.: An approach to automatic recognition of spontaneous facial actions. In: Proc. of Int. Conf. on FGR, USA, pp. 345–350 (2002) 5. Cohen, I., Sebe, N., Garg, A., Chen, L.S., Huang, T.S.: Facial expression recognition from video sequences: temporal and static modeling. In: CVIU, vol. 91, pp. 160–187 (2003) 6. Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., Taylor, J.: Emotion recognition in human computer interaction. IEEE Signal Processing Magazine 18(1), 32–80 (2001) 7. Pantic, M., Rothkrantz, L.: Facial action recognition for facial expression analysis from static face images. IEEE Trans. on SMC-Part B: Cybernetics 34, 1449–1461 (2004) 8. Pantic, M., Rothkrantz, L.J.M.: Automatic analysis of facial expressions: The state of the art. IEEE Trans. on PAMI 22(12), 1424–1445 (2000)
838
H. Soyel and H. Demirel
9. Fasel, B., Luettin, J.: Automatic facial expression analysis: A survey. Pattern Recognition 36, 259–275 (2003) 10. Pandzic, I., Forchheimer, R.: MPEG-4 Facial Animation: the Standard, Implementation and Applications. Wiley, Chichester (2002) 11. Yin, L., Wei, X., Sun, Y., Wang, J., Rosato, M.: A 3d facial expression database for facial behavior research. In: Proc. of Int. Conf. on FGR, UK, pp. 211–216 (2006) 12. Parke, F., Waters, K.: Computer Facial Animation (1996) 13. Wang, J., Yin, L., Wei, X., Sun, Y.: 3D Facial Expression Recognition Based on Primitive Surface Feature Distribution. In: IEEE CVPR’06, vol. 2, pp. 1399–1406 (2006)
Locating Facial Features Using an Anthropometric Face Model for Determining the Gaze of Faces in Image Sequences Jorge P. Batista ISR-Institute of Systems and Robotics, DEEC/FCT University of Coimbra, Coimbra - PORTUGAL
[email protected] Abstract. This paper presents a framework that combines a robust facial features location with an elliptical face modelling to measure user’s intention and point of attention. The most important facial feature points are automatically detected using a statistically anthropometric face model. After observing the structural symmetry of the human face and performing some anthropometric measurements, the system is able to build a model that can be used in isolating the most important facial feature areas: mouth, eyes and eyebrows. Combination of different image processing techniques are applied within the selected regions for detecting the most important facial feature points. A model based approach is used to estimate the 3D orientation of the human face. The shape of the face is modelled as an ellipse assuming that the human face aspect ratio (ratio of the major to minor axes of the 3D face ellipse) is known. The elliptical fitting of the face at the image level is constrained by the location of the eyes which considerable increase the performance of the system. The system is fully automatic and classifies rotation in all-view direction, detects eye blinking and eye closure and recovers the principal facial features points over a wide range of human head rotations. Experimental results using real images sequences demonstrates the accuracy and robustness of the proposed solution.
1
Introduction
Since the motion of a person’s head pose and gaze direction are deeply related to his/her intention and attention, the ability to detect the presence of visual attention, and/or determine what a person is looking at by estimating the direction of eye gaze and face orientation is useful in many applications. Knowledge about gaze direction can be used in human-computer interfaces, video compression, active video-conferencing and face recognition systems [9,11,14]. In behavioral studies gaze detection and estimation are also invaluable. The goal of our work is to estimate user head pose non-invasively and robustly in real-time. One of the main purpose is to recover and track the three degrees
This work was funded by FCT Project POSI/EEA-SRI/61150/2004.
M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 839–853, 2007. c Springer-Verlag Berlin Heidelberg 2007
840
J.P. Batista
of rotational freedom of a human head, without any prior knowledge of the exact shape of the head and face being observed. This automatic orientation estimation algorithm is combined with a robust facial features detection and tracking to provide additional constraints to estimate the sequence of head orientations observed. Although anthropometric measurement of the human face provide useful information about the location of facial features, it has rarely been used in their detection and locations [7,8]. In this paper, we have explored the approach of using a statistically anthropometric face model for localization of the facial features areas as well as the detection of the most important facial feature points from these areas using hybrid image processing techniques [1,5]. Since human faces can be accurately modelled with an ellipse, being less sensitive to facial expression changes, an elliptical image face modelling is used to obtain head(face) orientation [4]. To robustly recover the 3D orientation of a human face from a single image, the ellipse modelling requires an accurate detection of both eye’s pupils centers and assumes that the ratio of the major and minor axes of the 3D face ellipse is know [12]. This ratio is obtained through anthropometric face statistics. Our purpose is to recover the three angles of rotation: yaw, pitch and roll.
2
Anthropometric Face Model
Face Shape is dynamic, due to the many degrees of articulative freedom of the human head, and the deformations of the face and its parts induced by muscular action. Face shape variability is also highly limited by both genetic and biological constraints, and is characterized by a high degree of (approximate) symmetry and (approximate) invariants of face length scales and ratios. Anthropometry is a biological science that deals with the measurements of the human body and its different parts [10]. It is concerned with tabulation and modelling of the distributions of these scales and ratios, and can serve as a useful source of shape and parts location constraints for analyzing image sequences of human faces. After performing anthropometric measurement on several frontal face images taken from different human’s subjects, an anthropometric model of the human Table 1. Proportion ratios obtained from the anthropometric measurements Di Ratios D4,5 /D1 D3 /D1
D2 /D1 D6 /D1
Description Proportion of the eye-eyebrow distance to the mouth width Proportion of the nose tip to mouth center distance to the mouth width Proportion of the distance between the midpoint of the eyes center and the mouth center to the mouth width Proportion of the inter-eyes distance to the mouth width
Value ≈ 0.35 ≈ 0.65 ≈ 1.40 ≈ 1.20
Locating Facial Features Using an Anthropometric Face Model
841
Fig. 1. Anthropometric face model used for facial feature area localization. Facemarks (Pi ) and anthropometric measurements (Di ) of our anthropometric face model are displayed.
face is build that can be used to locate the most important facial feature areas from face images [8]. The facemarks points that have been measured to build the face anthropometric model are represented in fig. 1. Some statistics of proportion were obtained from these points, and the mouth statistics serves as the principal parameter for measuring the center location and size of the other facial feature regions. It has been observed that the distance between each eye-eyebrow (D4 ,D5 ) and the distance between the nose tip and the midpoint of the eyes to the mouth center (D3 and D2 respectively) are related to the inter-eyes distance (D6 ) and mouth width (D1 ). Table 1 show the proportion of distances Di (D ) taking the mouth endpoints distance as the principal parameter of mea1 surement. The mouth statistics is used, instead of inter-eyes statistics, mainly because eyes-centers (pupils) can not be detected with eyes shut.
3
Image Face Detection
Though people differ in color and length of hair it is reasonable to assume that the amount of skin that can be seen and the position of the skin pixels within the frame is a relatively invariant cue for a person’s face detection in a static image. Many researchers have exploited the relative uniqueness of skin color to detect and track faces [4,5]. To automatically select the face skin pixels on the image, a color skin histogram is defined taking samples from different scenarios [15]. A hand-select face skin region is performed on a sequence of learning frames to compute a (normalized) skin colour histogram model in RGB-space. We then compute the probability that every pixel in the face images is drawn from the predefined skin histogram model. Using the skin histogram, and taking the approach proposed on [15], each pixel in each face image is drawn from a specific RGB bin and so is assigned the relevant weight which can be interpreted as a probability that the pixel cames from the skin model. Figure 2 show the results of the skin color detection on a human face image. Using the Birchfield [4] solution, that combines intensity gradients and color histograms, the face projection on the image plane is modelled as an ellipse. Fig. 2 show the filtered skin
842
J.P. Batista
Fig. 2. The image face skin color detection (top row). The face detected area and the fitted ellipse using intensity gradients and color histograms (bottom row).
detected area and the ellipse fitted to that area. This ellipse is the starting point for our automatic facial features detection algorithm. The ellipse center and the ellipse’s axis will be used as the new image face coordinate system.
4
Facial Feature Points Detection
Since the proposed anthropometric face model was obtained for frontal face images, a starting frontal head pose is required to ensure a robust detection of the facial feature points. To robustly detect the facial features over a sequence of human face images taken from different head poses, a kalman filter feature tracking is combined with the proposed anthropometric face model features detection. 4.1
Identification of Facial Feature Areas
The rotation angle (θ) between the frame coordinate system and the ellipse coordinate system (see fig. 2) encodes (at this stage) the roll rotation of the face, and to fit the anthropometric face model with the image face the whole image is rotated by this amount. Since the mouth width D1 serves as the principal parameter for measuring the center locations of the other facial feature regions, the implementation of our automatic facial feature point detection begins with the detection of the mouth area. Once the mouth area is correctly detected, points P2 and P3 can be accurately detected and distance D1 used to locate points P5 , P6 , P7 and P8 by calculating the distances D2 , D4 , D5 and D6 using the proportionality ratios proposed by the anthropometric face model. Rectangular areas for confining the facial feature regions are then approximated using the mouth width as the measurement criteria.
Locating Facial Features Using an Anthropometric Face Model
4.2
843
Mouth Features Detection
Once the face is aligned to fit with the anthropometric model, the mouth detection starts by searching over the bottom area of the ellipse the region not segmented as skin that is closer to the major ellipse axis. The end points of this area and the maximum thickness value are used to accurately define the lips corner points and the inter-lips line. Taking the coordinates of the end points (Pl , Pr ) of the detected area as the initial lips corner points, the location of the correct lips corners is obtained by using the following approach: 1. Convert from color to gray scale and apply a contrast stretching to increase contrast; 2. Let Pl and Pr represent the coordinates of the initial lips corner points, and let Tck represent the maximum thickness of the detected area; 3. For each column extending beyond both lips corners, consider a vertical line c (of height Tck pixels and centered on the previous detected lip corner) and find the darkest pixel on this vertical line [9]. The darkest pixel will generally be a pixel in the gap between the lips. 4. To determine where the lip corners are the system obtains f (x, y) =
1 1 + D(x, y) I(x, y)
&
D(x, y) < 3
(1)
where D(x, y) is the distance of a pixel (x, y) from the closest corner of the mouth, and I(x, y) is the gray level intensity at (x, y). This will give a pixel that is close to the previous lip corner, and that is not too bright. The function maximum is the new lip corner. 5. This searching process is stopped when the gradient along the vertical line is below a certain threshold. |∇c | < 0.1
Figure 3 show the result of the proposed algorithm for lip corner location and inter-lips line detection. This solution has proven to be very robust and accurate.
Fig. 3. Lips corner detection. Left to right: Grey level image after contrast stretching with the initial lip corners superimposed; The searching area; The correctly located lip corners and the extracted inter-lips line; The results superimposed on the color face image.
4.3
Eyebrows Features Detection
The detection of the eyebrow feature points is accomplished prior to the eye pupil and eye corner detection for two basic reasons: i) to take into account the
844
J.P. Batista
eyes shut situations and ii) because the correct detection of the eye’s corners are difficult to achieve for large yaw head rotation angles due to occlusions. Since the anthropometric proportion rations fail for non-frontal face images, to deal with large yaw and pitch head rotations additional constraints must be adopted. The following strategy is used to robustly detect both eyebrows: 1. Using the face model estimate the location and size of both eyebrow feature regions. The dimension of the feature region is related with the the anthropometric measure D1 , being defined as : W idth = 1.25 × D1 , Height = 0.8 × D1 . 2. Perform the eyebrow detection within each of the feature regions using (a) Covert from color to gray scale and apply a contrast stretching to increase contrast; (b) Detect horizontal features using a vertical gradient mask (ex. Sobel) and threshold the outcome; (c) Perform some noise filtering; (d) Select the lowest extreme points of the segmented area as the eyebrow corner points; 3. To accomplish the correct detection of the end-points, in special for large yaw head rotations, adjust the the size of the feature region (10% steps) and repeat from step 2 1; 4. Stop the process when the end-points location remain unchanged.
Figure 4 show the defined eyebrow regions for two yaw head gaze rotations with the eyebrow segmentation result superimposed.
Fig. 4. Eyebrows detection. top: Defined eyebrow regions for different head rotations with the segmented eyebrow superimposed; bottom: The eyebrow segmentation and eyebrow corner detection.
4.4
Eye Features Detection
The eye features detection is the most challenging task due to the variability of shapes. The eye region is composed of an upper eyelid with eyelash, lower eyelid, pupil, bright sclera and the skin region that surrounds the eye. Because both pupil and sclera change their shape with various possible appearances of eyes, especially when the eye is closed or partially closed, robust detection of pupil 1
The black-box shown in fig. 4 represents the adjusted eyebrow region.
Locating Facial Features Using an Anthropometric Face Model
845
Fig. 5. The eyes features variability
center and eye corners is not an easy task. Most of the approaches found on the literature model the eyelid and detects the eye features (pupil center and eye corners) mainly for frontal face images [17,18,19]. Fig. 5 shows the variability of eye shapes under different head poses. Our purpose is to be able to accurately detect the most important features of the eye (pupil center and eyes corners) for the large set of different appearances of an eye that can occur within our working scenario. The corners of each eyebrow are used as the starting point for an accurate selection of the eye region. Taking the anthropometric face model, the location of the eye region is placed immediately above the eyebrow region. To deal with non-frontal face images, the size of each eye region is constrained by the size of the correspondent eyebrow. Representing Deyebrow the distance between both eyebrow corners, the width (M ) and height (N ) of each eye region is defined as M = Deyebrow + 0.4 × max(D1 , Deyebrow )
N = 0.7 × max(D1 , Deyebrow ).
Eye features detection is accomplished in two steps. At the first step the presence of the iris is checked on each eye region and if the eye is labelled as an open eye, then a second processing step is applied to find the pupil center and to fit an ellipse to the iris shape. Eyes Open vs. Eyes Shut. The detection of eye open vs. eye shut is accomplished by computing the variance projection function [16] on each eye feature region. The variance projection function (VPF) is defined as 2 σH (y) =
1 2 [I(xi , y) − H(y)] M
(2)
i=1..M
where H(y) is the mean intensity value for the row y and I(x, y) is the intensity of pixel (x, y). This variance projection function is applied in both directions and it is used to bound the area of the iris (fig. 6). To increase the robustness of the iris checking process, the detected iris is cross-checked based on size and proportion constraints. Pupil Center Detection and Iris Shape Modelling. Independently of the color of an iris, the pupil will always be its darkest region. The use of this cue to locate the pupil center is constrained by the level of detail of the eye region
846
J.P. Batista
Fig. 6. The Variance Projection Function used to bound the iris area
and by the color or gray level quantization of the image. For our purpose, the precise location of the pupil center is not a major issue, and the pupil center was considered to be coincident with the iris center. The following approach was used to detect the pupil center and to model the iris shape. 1. Covert from color to gray scale and apply a contrast stretching to increase contrast; 2. Threshold the bounded iris area and consider the center of mass of the segmented region as the iris center; 3. Obtain the iris contour points by measuring the gradient maximum values along the radial lines starting at the iris center; 4. fit an ellipse to the detected contour points.
Fig. 7 show the results obtained with the proposed approach.
Fig. 7. Pupil center detection and iris elliptical shape model
Eyes Corners Detection. The detection of corners of the eye is a challenging task in special for large yaw head rotations. In these cases, the correct location of the eye corners is extremely difficult mainly due to occlusions. The solution the we propose try to overcame this problem, detecting the eye corners either in the presence of eyes open or eyes shut. Eyes-Shut: To detect the corners of a closed eye we will take advantage of the dark region that is created with the union of both eyelashes. In these cases, the following algorithm is used: 1. Covert from color to gray scale and apply a contrast stretching to increase contrast; 2. Apply a vertical gradient mask to enhance the horizontal edges and threshold;
Locating Facial Features Using an Anthropometric Face Model
847
Fig. 8. Eye-Shut corners detection. Top row: Evolution of the proposed algorithm; Bottom row: Corners detected on several Eyes-Shut images. 3. Obtain the skeleton of the segmented region followed by a pruning operation; 4. Select the end-points of the skeleton as the eye corners.
Figure 8 presents the evolution of the algorithm and a few results. Eyes-Open: For Open-eyes, we define the corner as the farthest transition point between the sclera (brightest pixels) and the skin (darkest pixels). This cue is easily visible for frontal face images, but lacks visibility when the eyes are gazed on the sizes and when the face image is not frontal. To overcame this lack of visibility an additional cue has been added. The upper eyelid and eyelash are normally associated to the darkest region just above the iris, and the proposed solution will take advantage of this fact to extract the shape of the upper eyelid. The shape of the eyelid is used to constraint the location area for the corners. The following strategy is used to locate the eye corners: 1. Covert from color to gray scale and apply a contrast stretching to increase contrast; 2. Taking the previously segmented iris, obtain the VPF vales on the horizontal vicinity of the iris; These values will encode the gray level variability that exist between the dark regions of the eyelids and the bright area of the sclera. 3. Obtain an estimate for the location of the eye corners by thresholding the VPF values; 4. Threshold the image in order to enhance the iris and upper eyelid areas, and remove the pixels that belong to the previously segmented area of the iris. 5. Obtain the skeleton of the remaining region and fit a polynomial function to the skeleton; 6. The end-points of the skeleton polynomial function are used to locate the eye corners. 7. Combine the information supplied by both approaches.
Figure 9 shows the detection results of the proposed algorithm. 4.5
Experimental Results
The performance of the proposed solutions for facial feature location was evaluated using five video sequences of a human face gazing at different 3D points and blinking its eyes. The algorithm was tested with different human subjects and
848
J.P. Batista
Fig. 9. Eye-Open corners detection. Top row: Evolution of the proposed algorithm; Bottom row: Corners detected on several Eyes-open images.
Fig. 10. Facial feature detection for different subject and with different face gaze orientation. The feature regions obtained using the anthropometric face model and the detected features are superimposed on the images.
under different image scales. Table 2 presents the detection accuracy of the automatic facial feature point detector described on the paper and figure 10 shows the facial feature detection on different human image faces extracted from the image sequence database.
5
Human Head Orientation
The presented approach models the shape of the human face with an ellipse, since human faces can be accurately modelled with an ellipse and is less sensitive to facial expression changes. To recover the 3D face pose from a single image, it is assumed that the ratio of the major and minor axes of the 3D face ellipse is know. This ratio is obtained through the anthropometric face statistics. Our purpose is to recover the three angles of rotation: yaw (around vertical axis), pitch (around horizontal axis) and roll (around the optical axis). 5.1
Image Face Ellipse Detection and Tracking
In order to correctly detect the face ellipse, some constraints must be considered, in special size, location and orientation. The distance between the detected pupils
Locating Facial Features Using an Anthropometric Face Model
849
Table 2. Detection accuracy (in percent) of the Automatic Facial Feature Point Detector Facial Feature Points Detection Rate (%) Seq. # % % eyebrows % eyes corners left-right no. frames Mouth left-right 1 360 99% 100%-99% 96.6%-98.3% 360 100% 98.2%-98% 97.7%-96.6% 2 3 360 100% 100%-96.6% 98.0%-96.6% 360 97% 99.4%-98.7% 95.5%-88.3% 4 346 99% 98.2%-100% 96.5%-98.3% 5 Eyes-Shut Detection Rate Eyes-Open Detection Rate Seq. # False Positives % # False Negatives % no. Eyes-Shut left-right left-right Eyes-Open left-right left-right 1 95 3-1 96.8%-99.0% 265 1-1 99.6%-99.6% 2 10 1-0 99%-100% 350 1-3 99.7%-99.1% 90 4-4 95.5%-95.5% 270 2-1 99.2%-99.6% 3 47 13-3 72.3%-93.6% 313 1-7 99.6%-97.7% 4 5 105 3-1 97.1%-99.0% 241 2-5 99.1%-97.9%
and their location are used to constrain the size and location of the image face ellipse. The orientation of the line that passes through both pupils is directly related to the 3D face roll rotation. For roll free face poses this line remains horizontal, which means that it is invariant to the yaw and pitch rotations. Under this constraints, the roll angle (γ) is defined by γ = atan[(ypl − ypr )/(xpl − xpr )], where Pl = (xpl , ypl ) and Pr = (xpr , ypr ) are the image location of the left and right detected pupil, respectively. For frontal orientation, a weak perspective projection can be assumed and the face symmetry for the location of the eyes within the 3D face ellipse hold for the image face ellipse. This means that the major axis of the face ellipse is normal to the line connecting the two eyes and pass through the center of the line. In fact, these constraints doesn’t hold for non-frontal orientation and the orientation of the major line is not normal to the connecting line. Although, the solution adopted kept the constrain that the major axis of the ellipse pass through the center of the line, considering the existence of an angle α between the major axis and the normal to the line that connect the two eyes. Assuming the existence of an ellipse coordinate frame located at the middle point of the eyes connecting line, with the X and Y axis aligned with the minor and major axes of the ellipse, respectively, the image face ellipse is characterized by a 4−tuple e = (mi , ni , d, α), where mi and ni are the lengths of the major and minor semi-axis of the ellipse, respectively, d is the distance to the image ellipse center and α is the rotation angle. Taking the approach proposed by Birchfield [4], the image face ellipse can be detected as the one that minimizes the normalized sum of the gradient magnitude projected along the directions orthogonal to the ellipse around the perimeter of
850
J.P. Batista
the ellipse and the color histogram intersection of the face’s interior. This can be N formulated has εg (e) = N1 i=1 |n(i) · g(i)| where n(i) is the unit vector normal to the ellipse at pixel i, g(i) is the pixel intensity gradient and (·) denotes dot N
min(I (i),M(i))
product and εc (e) = i=1 N Ie (i) where Ie (i) and M (i) are the numbers i=1 s of pixels in the ith bin of the histograms, and N is the number of bins. The best face ellipse is χ = arg maxe∈E (¯ εg (e) + ε¯c(e)) where the search space E is the set of possible ellipses produced by varying the 4-tuple parameters of the ellipse and ε¯g and ε¯c are normalized values. The 4-tuple parameters of the ellipse are filtered via a kalman filter and the a posterior estimated tuple is used to define an initial estimate for best face ellipse search. Figure 11 show several results of the image face ellipse fitting. 5.2
Face Orientation
Consider an object coordinate frame attached to the 3D face ellipse, with its origin located on the center of the ellipse and its X and Y axes aligned with the major and minor axes of the ellipse. The Z axis is located normal to the 3D ellipse plane. The camera coordinate frame is located at the camera optical center with the Xc and Yc aligned with the image directions with the Zc along the optical axis. Since the 3D face ellipse is located on the plane Z = 0, the projection equation that characterizes the relationship between an image face ellipse point pi = (x, y, 1)T and the corresponding 3D face ellipse point Pi = (X, Y, 1)T is given by pi = βK[R|t]Pi where K represents the camera intrinsic parameters matrix, M = [R|t] = [r1 r2 |t] is the extrinsic parameters matrix and β = λ/f is an unknown scalar. Representing ⎤⎡ ⎤ ⎡ x a c/2 d/2 x y 1 ⎣ c/2 b e/2 ⎦ ⎣ y ⎦ = 0 (3) 1 d/2 e/2 f the matricial generic formula of an ellipse, the 3D face ellipse and the image face ellipse can be defined, respectively, as T T XY 1 Q XY 1 =0 x y 1 A x y 1 = 0. (4) Substituting pi = βKM Pi to Eq. 4 lead to T X Y 1 βM T K T AKM X Y 1 = 0.
(5)
Denoting B = K T AK, the 3D ellipse matrix Q yields Q = βM T BM . Let the length of the major and minor axis of the 3D face ellipse be m and n, respectively, and since the object frame is located on the center of the ellipse, the ellipse matrix Q is parameterized as ⎡ ⎤ 1/m2 0 0 Q = ⎣ 0 1/n2 0 ⎦ (6) 0 0 −1
Locating Facial Features Using an Anthropometric Face Model
851
Fig. 11. The elliptical face model and 3D face gaze estimation. Results concerning the 3D face orientation estimation are superimposed on the images (τ : yaw; σ: pitch; γ: roll).
resulting the equation ⎤ ⎡ T ⎤ ⎡ r1 Br1 r1T Br2 r1T Bt 1/m2 0 0 ⎣ 0 1/n2 0 ⎦ = β ⎣ r2T Br1 r2T Br2 r2T Bt ⎦ 0 0 −1 tT Br1 tT Br2 tT Bt
(7)
Due to the symmetry of the matrix, there are only six equations (constraints) for a total of nine unknowns. Since the roll angle was already obtained, the face orientation can be defined just by the yaw and pitch rotation. Assuming a null translation vector, the rotation matrix obtained from the yaw and pitch rotation is ⎤ ⎡ cos(σ) sin(σ)sin(υ) −sin(σ)cos(υ) ⎦. cos(υ) sin(υ) R = Rσ Rυ = r1 r2 r3 = ⎣ 0 (8) sin(σ) −cos(σ)sin(υ) cos(σ)cos(υ) Assuming that the ratio between the major and minor axis of the 3D face ellipse is know by anthropometric face analysis [10], and letting c = m2 /n2 represent this ratio, the 2 × 2 sub-matrix yields
T 1/m2 0 r Br rT Br (9) β 1T 1 1T 2 = 0 1/n2 r2 Br1 r2 Br2 resulting the following constraint equations r1T Br2 = 0
(10)
βr1T Br1 βr2T Br2 n2 T T = ⇔ r Br = r Br2 ⇔ r2T Br2 − cr1T Br1 = 0. (11) 1 1 1/m2 1/22 m2 2 Using these two equations it is possible to solve for the pitch and yaw iteratively. The initial estimates of 0o for both angles has been used for the first frame of the sequence.
852
J.P. Batista
Given r1 and r2 the translation T can be calculated up to a scale factor by using
T 0 r BT . (12) = β 1T 0 r2 BT Let T = (tx , ty , tz ), tx /tz and tx /tz can be solved by using the above equation analytically. This approach was tested with several real images with good results. Although, the accuracy obtained with this approach is highly dependent on the image face ellipse obtained. A robust and accurate detection of the eyes pupil is fundamental for the overall performance of this approach. Figure 11 show the results obtained with the face orientation estimation approach.
6
Conclusions
A framework that combines a robust facial features location with an elliptical face modelling to determining the gaze of faces in image sequences was presented. The detection of facial feature point uses an anthropometric face model to locate the feature regions in the face. The statistical parameters of the mouth was used to model a set of face proportions that helped in the localization of facial feature regions, introducing a considerable reduction in computational time. The proposed solution is independent of the scale of the face image and performs well even with yaw and pitch rotation of the head. To determine the gaze of the face an elliptical face modelling was used taking the eye’s pupil locations to constraint the shape, size and location of the ellipse. A robust and accurate detection of the eyes pupil is fundamental for the overall performance of this approach. This solution has proven to work relatively well, although the iterative approach that was used to obtain the 3D face gaze parameters.
References 1. Rizon, M., Kawaguchi, T.: Automatic Eye Detection using Intensity and Edge Information. In: Proc. TENCON, vol. 2 (2000) 2. Yilmaz, A., et al.: Automatic feature detection and pose recovery for faces. In: ACCV2002, Melbourne, Australia (2002) 3. Phimoltares, S., et al.: Locating Essential Facial Features using Neural Visual Model. In: Int. Conf. on Machine Learning and Cybernetics (2002) 4. Birchfield, S.: Elliptical head tracking using intensity gradients and color histograms. In: IEEE CVPR (1998) 5. Xin, Z., Yanjun, X., Limin, D.: Locating Facial Features with Color Information. In: IEEE Int. Conf. on Signal Processing, vol. 2 (1998) 6. Gee, A.H., et al.: Determining the gaze of faces in images. Image and Vision Computing 12(10) (1994) 7. Horprasert, A.T., et al.: Computing 3D head orientation from a monocular image sequence SPIE. In: 25th AIPR workshop: Emerging Applications of Computer Vision, vol. 2962 (1996)
Locating Facial Features Using an Anthropometric Face Model
853
8. Sohail, A., Bhattacharya, P.: Localization of Facial Feature Regions using Anthropmetric Face Model. In: Int. Conf. on Multidisciplinary Information Sciences and Technologies (2006) 9. Smith, P., et al.: Determining Driver Visual Attention with one camera. IEEE Trans. on Intelligent Transportation Systems 4(4) (2003) 10. Farkas, L.: Anthropometry of the Head and Face. Raven Press, New-York (1994) 11. Matsumoto, Y., Zelinsky, A.: An Algorithm for Real-time Stereo Vision Implementation of Head Pose and Gaze Direction Measurement. In: Int. Conf. on Automatic Face and Gesture Recognition (2000) 12. Ji, Q., et al.: 3D pose estimation and tracking from a monocular camera, Image and Vision Computing (2002) 13. Wang, J.G., et al.: Study on Eye Gaze Estimation. IEEE Trans. on Systems, Man, and Cybernetic PART B: Cybernetics 32(3) (2002) 14. Ho, S.Y., et al.: An analytic solution for the pose determination of human faces from a monocular image. In: Pattern Recognition Letters, vol. 19 (1998) 15. Swain, M., Ballard, D.: Color Indexing. Int. Journal of Computer Vision 7(1) (1991) 16. Feng, G., Yuen, P.: Variance Projection Function and its application to eye detection for human face recognition. Pattern Recognition Letters 19(9) (1998) 17. Vezhnevets, V., Degtiareva, A.: Robust and Accurate Eye Contour Extraction. In: Graphicon-2003, Moscow, Russia (2003) 18. Kuo, P., Hanna, J.: An Improved Eye Feature Extraction Algorithm based on Deformable Templates. In: ICIP (2005) 19. Wu, Y., Liu, H., Zha, H.: A new method of human eyelids detection based on deformable templates. In: IEEE Int Conf. on Systems, Man and Cybernetics (SMC’04), Hague, Netherlands (2004)
Iris Recognition Based on Zigzag Collarette Region and Asymmetrical Support Vector Machines Kaushik Roy and Prabir Bhattacharya Concordia Institute for Information Systems Engineering Concordia University, Montreal, Quebec, Canada. H3G 1M8 {kaush_ro, prabir}@ciise.concordia.ca
Abstract. This paper presents an iris recognition technique based on the zigzag collarette region for segmentation and asymmetrical support vector machine to classify the iris pattern. The deterministic feature sequence extracted from the iris images using the 1D log-Gabor filters is applied to train the support vector machine (SVM). We use the multi-objective genetic algorithm (MOGA) to optimize the features and also to increase the overall recognition accuracy based on the matching performance of the tuned SVM. The traditional SVM is modified to an asymmetrical SVM to treat the cases of the False Accept and the False Reject differently and also to handle the unbalanced data of a specific class with respect to the other classes. The proposed technique is computationally effective with recognition rates of 97.70 % and 95.60% on the ICE (Iris Challenge Evaluation) and the WVU (West Virginia University) iris datasets respectively. Keywords: Biometrics, asymmetrical support vector machine, multi-objective genetic algorithm, zigzag collarette region.
1 Introduction Iris recognition has been regarded as one of the most reliable biometrics technologies in recent years [9], [10]. The human iris, an annular part between the pupil and the white sclera (see Fig. 1), has an extraordinary structure and provides many interlacing minute characteristics such as freckles, coronas, stripes and zigzag collarette region, which are unique to each subject [10]. Iris-based recognition system can be noninvasive to the users since the iris is an internal organ as well as externally visible, which is of great importance for real-time applications [3], [4], [5]. Iris scans are nowadays used in several international airports for the rapid processing of passengers through the immigration who have pre-registered their iris images [11], and it has also been used by the United Nations High Commission for refugees at the Pakistan– Afghanistan border (UNHCR) for the returning Afghan refugees before receiving the travel grants [18]. A new technology development project for iris recognition namely, the Iris Challenge Evaluation (ICE) is being conducted by the National Institute of Standards and Technology (NIST) [17]. The usage of iris patterns for the personal identification was started in the late 19th century, however, the major works on iris M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 854–865, 2007. © Springer-Verlag Berlin Heidelberg 2007
Iris Recognition Based on Zigzag Collarette Region and Asymmetrical SVMs
855
recognition were started in the last decade. In [3], [4], [5], integrodifferential operator was proposed to localize the iris regions along with the eyelids removal technique. In [15], the Hough transform was applied for iris localization, a Laplacian pyramid was used to represent the distinctive spatial characteristics of the human iris, and a modified normalized correlation was applied for the matching process. In [9], a personal identification method based on iris texture analysis was described. An algorithm was proposed for iris recognition by characterizing the key local variations in [10]. In [8], an iris segmentation method was proposed based on the crossed chord theorem and the zigzag collarette area. In [13], the collarette boundary is detected by applying the histogram equalization technique and a high pass filter on the iris image. In [14], the SVM approach for comparing the similarity between the similar and the distinct irises was used to assess the discriminative power of the feature.
Fig. 1. Samples of iris images form ICE dataset
2 Iris Image Preprocessing The iris is surrounded by various non-relevant regions such as the pupil, the sclera, the eyelids, and also has some noise that include the eyelashes, the eyebrows, the reflections and the surrounding skin [9], [10]. We need to remove the noise from the iris image to improve the recognition accuracy. 2.1 Localization The iris is an annular portion between the pupil (inner boundary) and the sclera (outer boundary) [10]. Both the inner boundary and the outer boundary of a typical iris can be approximately taken as circles. However, these two circles are not usually concentric [2], [9], [10]. We use the following approach to detect the iris and pupil boundary from the digital eye’s image. 1. First, we deploy the morphological operation namely, the opening on an input image to remove noise. A simple intensity thresholding technique is applied to isolate the pupil area and approximate the pupil center, (Xp,Yp) and radius, rp. 2. The Canny edge detection technique is applied on a circular region centered at (Xp,Yp) and with rp + 25. We apply the circular Hough transform to detect the pupil/iris boundary. 3. In order to detect the iris/sclera boundary, we repeat step 2 with the neighborhood region replaced by an annulus band of a width, R outside the pupil/iris boundary. The edge detector is adjusted to the vertical direction to minimize the influence of eyelids. 4. The specular highlight that typically appears in the pupil region is one source of edge pixels. These can be generally eliminated by removing Canny edges at pixels with a high intensity value (For ICE and WVU datasets, these values are 245 and 240 respectively). Edge pixels inside the iris region can also contribute to pulling
856
K. Roy and P. Bhattacharya
the Hough transform result away from the correct result. Therefore, these can generally be eliminated by removing the edges at pixels with an intensity below some predetermined value like 32 and 34 for ICE and WVU datasets respectively. Fig. 3 (b, f) shows the localized pupils of the input iris images. 2.2 Collarette Boundary Detection Iris complex pattern provides many distinguishing characteristics such as arching ligaments, furrows, ridges, crypts, rings, freckles, coronas, stripes, and zigzag collarette area [13]. Zigzag collarette area is one of the most important parts of iris complex patterns (see Fig. 2) as it is usually insensitive to the pupil dilation and less affected by the eyelids and the eyelashes unless the iris is partly occluded since it is closed with the pupil. From the empirical study, it is found that the zigzag collarette region is generally concentric with the pupil, and the radius of this area is restricted in a certain range. The zigzag collarette area is detected using the previously obtained center and radius values of the pupil as shown in Fig. 3 for both the cases of ICE and WVU datasets.
(a)
(b)
(c)
Fig. 2. (a) Circle around the pupil indicates the zigzag collarette region (b) zigzag collarette area is occluded by eyelashes and (c) upper eyelid occludes the collarette region
2.3 Eyelids and Eyelashes Detection Eyelids can be approximated as parabolic curves. Therefore, we use parabolic curves to detect the upper and lower eyelids. Generally, the eyelids occur in the top and lower portions of the pupil center. Thus, we restrict our search only to those areas. The parametric equation of a parabola is applied to construct parabolas of different sizes and shapes. An abrupt change in the summation of the gray scale values over the shape of a parabola is estimated for various shapes and sizes of parabolas. This results in detection of the upper and lower eyelid boundaries in the eye image. Figure 3 (c, g) shows the detected eyelids for ICE and WVU datasets. Separable eyelashes are detected using 1D Gabor filters, since a low output value is produced by convolution of a separable eyelash with the Gaussian smoothing function. Thus, if a resultant point is smaller than a threshold, it is noted that this point belongs to an eyelash. Multiple eyelashes are detected using the variance of intensity and if the values in a small window are lower than a threshold, the centre of the window is considered as a point in an eyelash as shown in the Fig. 3.
Iris Recognition Based on Zigzag Collarette Region and Asymmetrical SVMs
(a)
(e)
(b)
(f)
(c)
(g)
857
(d)
(h)
Fig. 3. Image preprocessing (a) and (e) are the original iris images from ICE and WVU datasets respectively, (b), (f) show the localized pupil, iris and collarette boundary, (c), (g) reveal the eyelids detection and (d), (h) show the segmented images the after eyelids, eyelashes and reflections detection for ICE and WVU datasets respectively
2.4 Normalization and Enhancement We use the rubber sheet model [4] to normalize or unwrap the isolated zigzag collarette area. The normalization approach produces a 2D array with the horizontal dimensions of angular resolution and the vertical dimensions of radial resolution form the circular shaped zigzag collarette area. Fig. 4 shows the unwrapping procedure. Fig. 5 I (a, b) and 5 II (a, b) show the normalized images after the isolation of the zigzag collarette area from the iris images of ICE and WVU datasets respectively. Since the normalized iris image has relatively low contrast and may have nonuniform intensity values due to the position of the light sources, a local intensity based histogram equalization technique is applied to enhance the contrast of the quality of the normalized iris image to improve the subsequent recognition accuracy. Fig. 5 I (c, d) and 5 II (c, d) also show the effect of enhancement on the normalized iris images for two datasets.
Fig. 4. (a) Unwrapping an iris image (b) noise area is marked for the corresponding unwrapped iris image
858
K. Roy and P. Bhattacharya
Fig. 5. (I) (a, b) shows the unwrapped collarette region before the enhancement for ICE dataset and (c, d) reveals the corresponding collarette region after enhancement and (II) (a, b) shows the unwrapped collarette region before the enhancement from WVU dataset and (c, d) reveals the corresponding collarette region after enhancement
3 Feature Extraction and Encoding In order to extract the discriminating features from the normalized zigzag collarette region, the normalized pattern is convolved with 1-D Log-Gabor wavelets. First, the 2-D normalized pattern is isolated into a number of 1-D signals, and then these 1-D signals are convolved with 1-D Gabor wavelets. The phase-quantization approach proposed in [5] is applied to four levels on the outcomes of filtering with each filter producing two bits of data for each phasor.
4 Feature Selection Using MOGA We propose MOGA to select the optimum set of features [6]. In this paper, we prefer to use the wrapper approach since an optimal selection of features is dependent on the inductive and representational biases of the learning algorithm which is used to build the classifier [6]. The goal of feature subset selection is to use fewer features to achieve the better performance. Therefore, the fitness evaluation contains two terms: accuracy from the validation data and number of features used. Only the features in the parameter subset encoded by an individual are used in order to train the SVM classifier. The performance of the SVM classifier is estimated using a validation data set and used to guide the genetic algorithm. Each feature subset contains a certain number of parameters. Between accuracy and feature subset size, accuracy is our major concern. Combining these two terms, the fitness function is given by Fitness= (104×Accuracy+ (102×NOZ/NOF))/104
(1)
where the Accuracy denotes the accuracy rate that an individual achieves, NOZ is the number of zeros in the chromosome and NOF represents the number of features used for optimization. The accuracy ranges roughly from 0.7 to 1 (i.e., the first term on the right hand side of (1) assumes values in the interval [7000, 10000]). The NOZ and NOF range from 0 to l where l is the length of the chromosome code. Therefore, the second term on the right hand side of (1) takes values in the interval [0, 60000]. Overall, higher the accuracy implies higher the fitness. Also, if fewer features are
Iris Recognition Based on Zigzag Collarette Region and Asymmetrical SVMs
859
used it would imply a greater number of zeros, and as a result, the fitness would increase. It can be noticed that the individuals with a higher degree of accuracy would outweigh the individuals with a lower accuracy, no matter how many features they contain.
5 Asymmetrical Support Vector Machines Support vector machine (SVM) is a widely used approach for classification due to their attractive features and promising performance. A detailed description of SVM can be found at [14]. In this work, we apply the SVM for iris pattern classification. We use four basic kernel functions for experimentation and the most favorable one is selected for prediction purpose. The adjustment of the penalty parameter for the error term C and the kernel parameters are important to improve the generalization performance. A careful selection of a training subset and a validation set with a small number of classes is required to avoid the training of SVM with all classes and to evaluate its performance on the validation set due to its high computational cost when the number of classes is higher. A modified approach reported in [7] is applied here to reduce the cost of selection procedure as well as to tune the parameters of SVM. Here, a Fisher’s least square linear classifier is used with a low computation cost for each class. After a careful selection of C and the kernel parameters, the whole training set with all classes is trained. The main drawback of the traditional SVM is that it does not differentiate between the False Accept and the False Reject that plays a very important role to meet different security requirements. Another issue is to control the unbalanced data of a specific class with respect to other classes. Therefore, the traditional SVM is modified to an asymmetrical SVM [14] to satisfy several security demands for the specific applications and also to handle the unbalanced data, as follows: 1. In order to change the traditional SVM to an asymmetrical SVM, a constant g is used which is called asymmetrical parameter, and it is used to adjust the decision hyperplane, H. Thus the hyperplane becomes ݓ ܾ ݃ ൌ Ͳ
(2)
2. Therefore, the decision function changes to ݂ሺݔሻ ൌ σே ୀଵ ݕ ן ܭሺݔǡ ݔ ሻ ܾ ݃
(3)
When g>0, it indicates that the classification hyperplane is closer to positive samples. By changing the value of g, the value of False Accept can be reduced and the statistically under-presented data of a class with respect to other classes can be controlled with the variation of the value of penalty parameter, C.
6 Experimental Results We conduct the experimentation on two non-ideal iris datasets namely: the ICE (Iris Challenge Evaluation) dataset created by the University of Notre dame, USA [17] and the WVU (West Virginia University) dataset [19]. The ICE dataset is divided into two categories: the ‘‘gallery’’ images, which are considered as good quality images, and
860
K. Roy and P. Bhattacharya
the ‘‘probe’’ images that represent iris images of varying quality levels. The iris images are intensity images with a resolution of 640 ×480 and the average diameter of an iris is 228 pixels. The ICE database consists of left and right iris images for experimentation. We consider only the left iris images in our experiments. There are 1528 left iris images corresponding to the 120 subjects in our experiments. We also evaluated the performance of the proposed iris recognition scheme on WVU dataset. The WVU iris dataset has a total of 1852 iris images from 380 different persons. The number of iris images for each person ranges from 3-6 in this database. We evaluate the success rate for the proposed method on the ICE and WVU datasets by detecting the pupil boundary and the zigzag collarette region, and the obtained success rates are 98.70% and 97.20% for ICE and WVU datasets respectively. From the experimental results, it is found that the reasonable recognition accuracy is achieved when the zigzag collarette area is isolated by increasing the previously detected radius value of the pupil up to a certain number of pixels. In order to reduce the computational cost and speed up the classification process, the Fisher least square linear classifier is used as a low cost pre-classifier so that reasonable cumulative recognition accuracy can be achieved to include true class label to a small number of selected candidates. For the ICE dataset, we apply the Fisher least square linear classifier to choose ten candidates and the cumulative recognition accuracy at rank 10 was 98.84%. In this paper, the selected cardinal number of set found from the experimentation is 32 by using the tuning algorithm for SVM parameters selection. As a result, the sizes of training and validation sets for selecting the optimal values of the parameters, C and γ are 112 (=32*5*70%) and 48 (=32*5*30%), respectively. The parameter γ is set at 0.6 and C at 20 when the highest accuracy on validation set was achieved with the RBF kernels for ICE iris database. In this paper, we only consider those classes of ICE database, which have at least 5 probe images in order to select the optimal parameter values. The SVM parameters are also tuned for WVU iris database. Here, twenty candidates are selected using the Fisher least square linear classifier, and the cumulative recognition accuracy found at rank 20 is 98.32%. The cardinal number of set obtained for this dataset is 60. Therefore, the sizes of training and validation sets for selecting the optimal value for the parameters C and γ are 168 (=60*4*70%) and 72 (=60*4*30%), respectively. Finally, C is set at 100 and the optimal value of γ remains the same as for ICE database. For WVU iris dataset, only those iris classes, which have at least four images, are considered. Table I shows the performance of different kernel functions for the purpose of tuning the SVM parameters. Table 1. Efficiency of the Various Kernel Functions Kernel Type Polynomial RBF Sigmoid
Classification Accuracy (%) for ICE Dataset 91.1 94.3 91.2
Classification Accuracy (%) for WVU Dataset 89.2 92.4 90.6
Since the highest classification accuracy is obtained by RBF kernel, this kernel is used in our system for iris pattern classification. In order to evaluate the matching accuracy, the zigzag collarette area is used here. For each person of the ICE database, one iris sample is
Iris Recognition Based on Zigzag Collarette Region and Asymmetrical SVMs
861
used to build the template, and the remaining irises from the probe image set are applied for testing. For WVU iris database, we use two iris samples for training and the remaining samples of each class for testing. In order to show the effectiveness of SVM as a classifier, the extracted features are also provided as input to feed-forward neural network by using the backpropagation (FFBP) rule, the Levenberg-Marquardt rule (FFLM), and the knearest neighbor (k-NN) method. The classification and the recognition accuracy for different numbers of classes among FFBP, FFLM, SVM, and k-NN (k=3) are shown in Fig. 6 for both the datasets. After conducting several experiments, the optimal values of the parameters for FFBP and FFLM are set as follows: Table 2. The Selected Values of the Parameters of FFBP and FFLM for ICE and WVU Datasets Parameters
No. of nodes in hidden layers Iteration Epoch
ICE Dataset
WVU Dataset
FFBP FFLM 170 310
FFBP FFLM 150 270
120
180
120
200
Fig. 6. Comparison of classification accuracy among SVM, FFBP, FFLM and k-NN for 600-bit feature sequence on (a) ICE and (b) WVU datasets
From Fig.6, it is clear that the performance of SVM as an iris classifier is better than the other classical methods though the classification accuracy is decreased as the number of classes is increased. From the experimentation, we observe that the SVM still performs reasonably well in the case of limited sample size. Fig.7 shows the comparison of the feature dimension versus recognition accuracy among Hamming Distance (HD), FFBP, FFLM, k-NN and SVM. In this case, only the RBF kernel is considered due its reasonable classification accuracy for SVM as mentioned earlier. From Fig. 7, we can see that with increasing dimensionality of the feature sequence, the recognition rate also increases rapidly for all similarity measures. However, when the dimensionality of the feature sequence is up to 600 or higher, the recognition rate starts to level off at an encouraging
862
K. Roy and P. Bhattacharya
rate of about 97.31% and 95.13% for ICE and WVU datasets respectively. In order to select the optimum features for the improvement of the matching accuracy, MOGA involves running the genetic process for several generations. From the experimentation, it is found that the highest matching accuracy of 97.70% is achieved at the generation 190 on the ICE dataset, and 95.6% accuracy is found at the generation 210 when the WVU dataset is considered for experimentation. The evolution of the MOGA can be observed in Fig. 8. It is noticeable from this figure that the MOGA represents a good convergence since average fitness of the population approaches with that of the maximum individual along the generations for two datasets. We conduct several experimentations and the arguments of the MOGA are set as shown in Table 3. The recognition accuracy is compared between the proposed method using the zigzag collarette area and the traditional approach denoted as ‘previous approach’ where the complete iris information between pupil and sclera boundary are considered for recognition. Fig. 9 shows the efficiency of the current approach with and without using MOGA in comparison with the previous method. It is noticeable from this figure that the proposed method performs better with an accuracy of 97.70% on ICE and 95.60% on WVU datasets. Therefore, we see that performance of our approach is increased when we use the MOGA for the feature selection. The proposed approach leads to a reasonable recognition accuracy in the case where eyelashes and eyelids occlude the iris badly, therefore, the pupil becomes partly visible. The adjustment of the asymmetrical parameter g can be made to satisfy several security issues in iris recognition applications and to control the unbalanced data of a specific class with respect to other classes. The performance of a verification system is evaluated using ROC curve (see Fig. 10) which demonstrates how the Genuine Acceptance Rate (GAR) changes with a variation in FAR for the previous and proposed methods. From the experimentation, we find that the proposed approach reduces the EERs from 0.72% to 0.43% for ICE dataset and from 1.60% to 0.90% for WVU dataset, which represent a good improvement of our proposed method. Fig. 11 shows that the proper selection of the asymmetrical parameter g leads to a low recognition error both in the cases of proposed and previous methods for the ICE and WVU datasets.
Fig. 7. Comparison of recognition accuracy of SVM with FFBP, FFLM, k-NN and HD for different feature dimension on (a) ICE and (b) WVU datasets
Iris Recognition Based on Zigzag Collarette Region and Asymmetrical SVMs
863
Fig. 8. Variation of fitness values with generation on (a) ICE and (b) WVU datasets Table 3. The Selected Values of the Arguments of MOGA for ICE and WVU Datasets. Parameters Population Size Length of chromosome code
Crossover probability Mutation probability Number of generation
ICE Dataset 120 (the scale of iris sample) 600 (selected dimensionality of feature sequence) 0.70 0.005 220
WVU dataset 380 (the scale of iris sample) 600 (selected dimensionality of feature sequence) 0.63 0.005 300
Fig. 9. Comparison of recognition accuracy between the previous and proposed approaches on (a) ICE and (b) WVU datasets
864
K. Roy and P. Bhattacharya
Fig. 10. ROC curve shows the comparison between GAR and FAR for the previous and proposed approaches for (a) ICE and (b) WVU datasets 5
4.5
4
Error (%)
Previous Approach (WVU) Proposed Approach (WVU) 3.5
Previous Approach (ICE) Proposed Approach (ICE)
3
2.5
2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Asymmetrical Parameter Value, g
Fig. 11. Comparison of recognition error between the proposed method and previous method with different values of asymmetrical parameter, g
7 Conclusions In this paper, an iris recognition method is proposed using an efficient iris segmentation approach based on the zigzag collarette area with the incorporation of eyelashes and eyelids detection techniques. MOGA is applied for feature subset selection. In order to increase the feasibility of the SVM in biometric applications, the SVM is modified into an asymmetrical SVM. This asymmetrical SVM approach can be applied to a wide range of security-related application fields. Although it seems that the proposed method with the zigzag collarette region has lesser information available as compared to the previous method using the entire iris information, the proposed method can be used effectively for personal identification since the zigzag collarette
Iris Recognition Based on Zigzag Collarette Region and Asymmetrical SVMs
865
area contains enough discriminating features, and it is less affected by the eyelids and the eyelashes. The experimental results of the proposed method exhibit an encouraging performance in both the accuracy and the speed for the non-ideal iris data sets.
Acknowledgments We have used by kind permission the database “Iris Challenge Evaluation” (ICE) [17] owned by the Department of Computer Science, the University of Notre Dame, USA that was provided to our institution under a software license agreement. We have also used by kind permission the WVU data set owned by West Virginia University, USA, under a software license agreement [19].
References 1. Boles, W., Boashash, B.: A human identification technique using images of the iris and wavelet transform. IEEE Trans. Signal Processing 46, 1185–1188 (1998) 2. Chen, Y., Dass, S.C., Jain, A.K.: Localized Iris Image Quality Using 2-D Wavelets. In: Zhang, D., Jain, A.K. (eds.) Advances in Biometrics. LNCS, vol. 3832, pp. 373–381. Springer, Heidelberg (2005) 3. Daugman, J.: High confidence visual recognition of persons by a test of statistical independence. IEEE Trans. Pattern Anal. Machine Intell. 15, 1148–1161 (1993) 4. Daugman, J.: Statistical richness of visual phase information: update on recognizing persons by iris patterns. Int. J. Comput. Vis. 45, 25–38 (2001) 5. Daugman, J.: Demodulation by complex-valued wavelets for stochastic pattern recognition. Int. J. Wavelets, Multi-Res. and Info. Processing 1, 1–17 (2003) 6. Deb, K.: Multi-Objective Optimization using Evolutionary Algorithms. J. Wiley Ltd., West Sussex (2004) 7. Dong, J., Krzyzak, A., Suen, C.Y.: An improved handwritten Chinese character recognition system using support vector machines. Pattern Recog. Lett. 26, 1849–1856 (2005) 8. He, X., Shi, P.: An efficient iris segmentation method for recognition. In: Singh, S., Singh, M., Apte, C., Perner, P. (eds.) ICAPR 2005. LNCS, vol. 3687, pp. 120–126. Springer, Heidelberg (2005) 9. Ma, L., Tan, T., Wang, Y., Zhang, D.: Personal identification based on iris texture analysis. IEEE Trans. Pattern Anal. Machine Intell. 25, 1519–1533 (2003) 10. Ma, L., Tan, T., Wang, Y., Zhang, D.: Efficient iris recognition by characterizing key local variations. IEEE Trans. Image Processing 13, 739–750 (2004) 11. Seiberg, D.: Iris recognition at airports uses eye-catching technology. CNN.com (2000) 12. Son, B., Won, H., Kee, G., Lee, Y.: Discriminant iris feature and support vector machines for iris recognition. In: Int. Conf. on Image Processing, vol. 2, pp. 865–868 (2004) 13. Sung, H., Lim, J., Park, J., Lee, Y.: Iris Recognition Using Collarette Boundary Localization. In: Int. Conf. on Pattern Recog., vol. 4, pp. 857–860 (2004) 14. Vapnik, V.N.: Statistical Learning Theory. J. Wiley Ltd., New York (1998) 15. Wildes, R., Asmuth, J., Green, G., Hsu, S., Kolczynski, R., Matey, J., McBride, S.: A machine-vision system for iris recognition. Machine Vision Applications 9, 1–8 (1996) 16. http://www.csie.ntu.edu.tw/c̃jlin/libsvm 17. http://iris.nist.gov/ICE/ 18. United Nations Refugee Agency (UNHCR), http//www.unhrc.org/cgi-bin/texis/ country?iso=afg 19. Iris Dataset obtained from West Virginia University (WVU), http://www.wvu.edu/
A Modified Fuzzy C-Means Algorithm for MR Brain Image Segmentation L´ aszl´o Szil´ agyi1,2 , S´ andor M. Szil´ agyi2 , and Zolt´ an Beny´ o1 1
2
Budapest University of Technology and Economics, Department of Control Engineering and Information Technology, Budapest, Hungary Sapientia - Hungarian Science University of Transylvania, Faculty of Technical and Human Science, Tˆ argu-Mure¸s, Romania
[email protected] Abstract. Automated brain MR image segmentation is a challenging pattern recognition problem that received significant attention lately. The most popular solutions involve fuzzy c-means (FCM) or similar clustering mechanisms. Several improvements have been made to the standard FCM algorithm, in order to reduce its sensitivity to Gaussian, impulse, and intensity non-uniformity noises. This paper presents a modified FCM-based method that targets accurate and fast segmentation in case of mixed noises. The proposed method extracts a scalar feature value from the neighborhood of each pixel, using a context dependent filtering technique that deals with both spatial and gray level distances. These features are clustered afterwards by the histogram-based approach of the enhanced FCM algorithm. Results were evaluated based on synthetic phantoms and real MR images. Test experiments revealed that the proposed method provides better results compared to other reported FCMbased techniques. The achieved segmentation and the obtained fuzzy membership values represent excellent support for deformable contour model based cortical surface reconstruction methods. Keywords: fuzzy c-means algorithm, image segmentation, noise elimination, feature extraction, context dependent filter, magnetic resonance imaging.
1
Introduction
The segmentation of an image represents the separation of its pixels into nonoverlapping, consistent regions, which appear to be homogeneous with respect to some criteria concerning gray level intensity and/or texture. The fuzzy c-means (FCM) algorithm is one of the most widely used method for data clustering, and probably also for brain image segmentation [2,6]. However, in this latter case, standard FCM is not efficient by itself, as it fails to deal with that significant property of images, that neighbor pixels are strongly correlated. Ignoring this specificity leads to strong noise sensitivity and several other imaging artifacts. M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 866–877, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Modified Fuzzy C-Means Algorithm for MR Brain Image Segmentation
867
Recently, several solutions were given to improve the performance of segmentation. Most of them involve using local spatial information: the own gray level of a pixel is not the only information that contributes to its assignment to the chosen cluster. Its neighbors also have their influence while getting a label. Pham and Prince [9] modified the FCM objective function by including a spatial penalty, enabling the iterative algorithm to estimate spatially smooth membership functions. Ahmed et al. [1] introduced a neighborhood averaging additive term into the objective function of FCM, calling the algorithm bias corrected FCM (BCFCM). This approach has its own merits in bias field estimation, but it gives the algorithm a serious computational load by computing the neighborhood term in every iteration step. Moreover, the zero gradient condition at the estimation of the bias term produces a significant amount of misclassifications [12]. Chuang et al. [5] proposed averaging the fuzzy membership function values over a predefined neighborhood and reassigning them according to a tradeoff between the original and averaged membership values. This approach can produce accurate clustering if the tradeoff is well adjusted empirically, but it is enormously time consuming. Aiming at reducing the execution time, Szil´ agyi et al. [13], and Chen and Zhang [4] proposed to evaluate the neighborhoods of each pixel as a pre-filtering step, and perform FCM afterwards. The averaging and median filters, followed by FCM clustering, are referred to as FCM S1 and FCM S2, respectively [4]. Once having the neighbors evaluated, and thus having extracted a scalar feature value for each pixel, FCM can be performed on the basis of the gray level histogram, clustering the gray levels instead of the pixels, causing a significant reduction of the computational load, as the number of gray levels is generally smaller by orders of magnitude [13]. This latter quick approach, combined with an averaging pre-filter, is referred to as enhanced FCM (EnFCM) [3,13]. All BCFCM, FCM S1, and EnFCM suffer from the presence of a parameter denoted by α, which controls the strength of the averaging effect, balances between the original and averaged image, and whose ideal value unfortunately can be found only experimentally. Another disadvantage emerges from the fact, that averaging and median filters, besides eliminating salt-and-pepper and Gaussian noises, also blur relevant edges. Due to these shortcomings, Cai et al. [3] introduced a new local similarity measure, combining spatial and gray level distances, and applied it as an alternative pre-filtering to EnFCM, calling this approach fast generalized FCM (FGFCM). This approach is able to extract local information causing less blur than the averaging or median filters, but failed to eliminate the experimentally adjusted parameter, denoted here by λg , which controls the effect of gray level differences. Another remarkable approach, proposed by Pham [10], modifies the objective function of FCM by the means of an edge field, in order to exclude the filters that produce edge blurring. This method is also significantly time consuming, because the estimation of the edge field has no analytical solution. In this paper we propose a novel method for MR brain image segmentation that simultaneously targets high accuracy in image segmentation, low noise
868
L. Szil´ agyi, S.M. Szil´ agyi, and Z. Beny´ o
sensitivity, and high processing speed. The remainder of the paper is organized as follows. Section 2 gives a detailed presentation of background works, including standard and spatially constrained FCM. Section 3 deals with the proposed context sensitive filtering and segmentation method. Some considerations regarding partial volume artifacts also exhibited here. The performance evaluation via experimental comparison is presented in Section 4, while Section 5 gives conclusions and several topics for future works.
2
Background
The fuzzy c-means algorithm has successful applications in a wide variety of clustering problems. The traditional FCM partitions a set of object data into a number of c clusters based on the minimization of a quadratic objective function. When applied to segment gray level images, FCM clusters the scalar intensity level of all pixels (xk , k = 1 . . . n). The objective function to be minimized is: JFCM =
n c
2 um ik (xk − vi ) ,
(1)
i=1 k=1
where m > 1 is the fuzzyfication parameter, vi represents the prototype value of cluster i, uik ∈ [0, 1] is the fuzzy membership function showing the degree to which pixel k belongs to cluster i. According to the definition of fuzzy sets, for c any pixel k, we have i=1 uik = 1. The minimization of the objective function is reached by alternately applying the optimization of JFCM over {uik } with vi fixed, i = 1 . . . c, and the optimization of JFCM over {vi } with uik fixed, i = 1 . . . c, k = 1 . . . n [6]. During each cycle, the optimal values are computed from the zero gradient conditions, and obtained as follows: (vi − xk )−2/(m−1) uik = c −2/(m−1) j=1 (vj − xk ) vi
n m k=1 uik xk = n m k=1 uik
∀ i = 1 . . . c, ∀ k = 1 . . . n ,
(2)
∀i = 1...c .
(3)
FCM has invaluable merits in making optimal clusters, but in image processing it has severe deficiencies, such as failing to take into consideration the position of pixels, which is also relevant information in image segmentation. This drawback led to introduction of spatial constraints into fuzzy clustering. Ahmed et al. [1] proposed a modification to the objective function of FCM, in order to allow the labeling of a pixel to be influenced by its immediate neighbors. This neighboring effect acts like a regularizer that biases the solution to a piecewise homogeneous labeling [1]. The objective function of BCFCM is: n c α 2 2 JBCFCM = um , (4) um ik (xk − vi ) + ik (xr − vi ) n k i=1 k=1
r∈Nk
A Modified Fuzzy C-Means Algorithm for MR Brain Image Segmentation
869
where xr represents the gray level of pixels situated in the neighborhood Nk of pixel k, and nk is the cardinality of Nk . The parameter α controls the intensity of the neighboring effect, and unfortunately its optimal value can be found only experimentally. Having the neighborhood averaging terms computed in every computation cycle, this iterative algorithm performs extremely slowly. Chen and Zhang [4] reduced the time complexity of BCFCM, by previously computing the neighboring averaging term or replacing it by a median filtered term, calling these algorithms FCM S1 and FCM S2, respectively. These algorithms outperformed BCFCM, at least from the point of view of time complexity. Szil´ agyi et al. [13] proposed a regrouping of the processing steps of BCFCM. In their approach, an averaging filter is applied first, similarly to the neighboring effect of Ahmed et al. [1]: 1 α ξk = xk + , (5) xr 1+α nk r∈Nk
followed by an accelerated version of FCM clustering. The acceleration is based on the idea, that the number of gray levels is generally much smaller than the number of pixels. In this order, the histogram of the filtered image is computed, and not the pixels, but the gray levels are clustered [13], by minimizing the following objective function: JEnFCM =
q c
2 hl u m il (l − vi ) ,
(6)
i=1 l=1
where hl denotes the number of pixels with gray level equaling l, and q is the number of gray levels. The optimization formulas in this case will be: (vi − l)−2/(m−1) uil = c −2/(m−1) j=1 (vj − l) q hl u m il l vi = l=1 q m h u l il l=1
∀ i = 1 . . . c, ∀ l = 1 . . . q , ∀i = 1...c .
(7)
(8)
EnFCM drastically reduces the computation complexity of BCFCM and its relatives [3,13]. If the averaging pre-filter is replaced by a median filter, the segmentation accuracy also improves significantly [3,14]. Based on the disadvantages of the aforementioned methods, but inspired of their merits, Cai et al. [3] introduced a local (spatial and gray) similarity measure that they used to compute weighting coefficients for an averaging pre-filter. The filtered image is then subject to EnFCM-like histogram-based fast clustering. The similarity between pixels k and r is given by the following formula: (s) (g) skr · skr if r ∈ Nk \ {k} Skr = , (9) 0 if r = k (s)
(g)
where skr and skr are the spatial and gray level components, respectively. The (s) spatial term skr is defined as the L∞ -norm of the distance between pixels k and r.
870
L. Szil´ agyi, S.M. Szil´ agyi, and Z. Beny´ o (g)
The gray level term is computed as skr = exp[−(xk − xr )2 /(λg σk2 )], where σk denotes the average quadratic gray level distance between pixel k and its neighbors. Segmentation results are reported to be more accurate than in any previously presented case [3].
3
Methods
Probably the most relevant problem of all techniques presented above, BCFCM, EnFCM, FCM S1, and FGFCM, is the fact that they depend on at least one parameter, whose value has to be adjusted experimentally. The zero value in the second row of Eq. (9) implies, that in FGFCM, the filtered gray level of any pixel is computed as a weighted average of its neighbor pixel intensities. Having renounced to the original intensity of the current pixel, even if it is a reliable, noise-free value, unavoidably produces some extra blur into the filtered image. Accurate segmentation requires this kind of effects to be minimized [10]. 3.1
Context Dependent Filtering
In this paper we propose a set of modifications to EnFCM/FGFCM, in order to improve the accuracy of segmentation, without renouncing to the speed of histogram-based clustering. In other words, we need to define a complex filter that can extract relevant feature information from the image while applied as a pre-filtering step, so that the filtered image can be clustered fast afterwards based on its histogram. The proposed method consists of the following steps: A. As we are looking for the filtered value of pixel k, we need to define a small square or diamond-shape neighborhood Nk around it. Square windows of size 3 × 3 and 5 × 5 were used throughout this study, but other window sizes and shapes are also possible. B. We search for the minimum, maximum, and median gray value within the neighborhood Nk , and we denote them by mink , maxk and medk , respectively. C. We replace the gray level of the maximum and minimum valued pixel with the median value (if there are more than one maxima or minima, replace them all), unless they are situated in the middle pixel k. In this latter case, pixel k remains unchanged, just labeled as unreliable value. D. Compute the average quadratic gray level difference of the pixels within the neighborhood Nk , using the formula 2 r∈Nk \{k} (xr − xk ) σk = . (10) nk − 1 E. The filter coefficients will be defined as: ⎧ (s) (g) ⎨ ckr · ckr if r ∈ Nk \ {k} Ckr = 1 if r = k ∧ xk ∈ {maxk , mink } . ⎩ 0 if r = k ∧ xk ∈ {maxk , mink }
(11)
A Modified Fuzzy C-Means Algorithm for MR Brain Image Segmentation
871
The central pixel k will have coefficient 0 if its value was found unreliable, otherwise it has unitary coefficient. All other neighbor pixels will have coefficients Ckr ∈ [0, 1], depending on their space distance and gray level difference from the central pixel. In case of both terms, higher distance values will push the coefficients towards 0. (s) F. The spatial component ckr is a negative exponential of the Euclidean (s) distance between the two pixels k and r: ckr = exp(−L2 (k, r)). The gray level term is defined as follows:
xr −xk 1 1 + cos π |xr − xk | ≤ 4σk (g) 4σk ckr = 2 . (12) 0, |xr − xk | > 4σk , The above function has a bell-like shape within the interval [−4σk , 4σk ], and is constant zero outside the interval. G. The extracted feature value for pixel k, representing its filtered intensity value, is obtained as a weighted average of its neighbors: r∈Nk Ckr xr ξk = . (13) r∈Nk Ckr Algorithm. We can summarize the proposed method as follows: 1. Pre-filtering step: for each pixel k of the input image, compute the filtered gray level value ξk , using Eqs. (10), (11), (12), (13). 2. Compute the histogram of the pre-filtered image, obtain the values hl , l = 1 . . . q. 3. Initialize vi with valid gray level values, differing from each other. 4. Compute new uil fuzzy membership values, using Eq. (7). 5. Compute new vi prototype values for the clusters, using Eq. (8). 6. If there is relevant change in the vi values, go back to step 4. This is tested by comparing any norm of the difference between the new and the old vector v with a preset small constant ε. The algorithm converges quickly, however, the number of necessary iterations depends on ε and on the initial cluster prototype values. 3.2
Handling Partial Volume Effect
Whatever resolution an MR scanner may have, the scanned images will contain such pixels where more than one tissue classes are present. This phenomenon is referred to as partial volume effect (PVE). Although it is not granted, it is reasonable to assume that within a given pixel, PVE only occurs over two classes [8]. Pixels involved in PVE are generally modeled using the mixel model [11], which states that the gray level intensity of pixel k is given by: xk = αk · vμ + (1 − αk ) · vν + ηk ,
(14)
872
L. Szil´ agyi, S.M. Szil´ agyi, and Z. Beny´ o
where ηk represents the noise of pixel k that will be ignored after context dependent filtering, while vμ and vν are the centroids of the two involved classes, assuming vμ ≤ xk ≤ vν . Fuzzy membership values given by FCM-based clustering techniques are reported to give a good estimate of the partial volumes [15]. Let us inspect now on theoretical basis, under what circumstances will the fuzzy memberships satisfy Eq. (14). In this order, we would like to have uμk αk = . uνk 1 − αk
(15)
By applying Eqs. (2) and (14), we obtain uμk = uνk
xk − vμ vν − xk
−2 m−1
=
(1 − αk )(vμ + vν ) αk (vμ + vν )
−2 m−1
=
αk 1 − αk
2 m−1
,
(16)
which equals the desired value shown in Eq. (15) if and only if m = 3. Consequently, if FCM is required to give estimation of partial volume ratios, the usage of fuzzification exponent m = 3 is recommendable.
4
Results and Discussion
In this section we test and compare the accuracy of four algorithms: BCFCM, EnFCM, FGFCM, and the proposed method, on several synthetic and real images. All the following experiments used 3 × 3 or 5 × 5 window size for all kinds of filtering.
Fig. 1. Filter mask coefficients in case of a reliable pixel intensity value (a), and a noisy one (b). The upper number in each cell represents the intensity value, while the lower number shows the obtained weight. The arrows indicate that the coefficients of extreme intensities are contributed to the median valued pixel.
A Modified Fuzzy C-Means Algorithm for MR Brain Image Segmentation
873
Fig. 2. Segmentation results on phantom images: (a) original, (b) segmented with traditional FCM, (c) segmented using BCFCM, (d) segmented using FGFCM, (e) filtered using the proposed pre-filtering, (f) result of the proposed segmentation
Fig. 3. A comparison of the numbers of misclassifications at rising noise level (from left to right)
874
L. Szil´ agyi, S.M. Szil´ agyi, and Z. Beny´ o
The proposed filtering technique uses a convolution mask whose coefficients are context dependent, and thus computed for the neighborhood of each pixel. Fig. 1 presents the obtained coefficients for two particular cases. Fig. 1(a) shows the case, when the central pixel is not significantly noisy, but some pixels in the neighborhood might be noisy or might belong to a different cluster. Under such circumstances, the three pixels on the left side having distant gray level compared to the value of the central pixel, receive small weights and this way they hardly contribute to the filtered value. Fig. 1(b) presents the case of an isolated noisy pixel situated in the middle of a relatively homogeneous window. Even though all computed coefficients are low, the noise is eliminated, resulting a convenient filtered value 76. The arrow-indicated migration of weights from the local maximum and minimum towards the median valued pixel, caused by step C of the filtering method, is relevant in the second case and useful in the first. The noise removal performances were compared using a 256 × 256-pixel synthetic test image taken from IBSR [7] (see Fig. 2(a)). The rest of Fig. 2 also shows the degree to which these methods were affected by a high-magnitude mixed noise. Visually, the proposed method achieves best results, slightly over FGFCM, and significantly over all others. Fig. 3 shows the evolution of misclassifications obtained using three of the presented methods, while segmenting the phantom shown in Fig. 2(a), corrupted by an increasing amount of mixed noise (Gaussian noise, salt-and-pepper impulse noise, and mixtures of these). Moreover, not only an extra amount of noise is added to the image step by step, but also the original cluster centroids (the base intensities of the clusters) are moved closer and closer to each other. This complex effect is obtained using a variably weighted sum of three different noisy versions of the same image (all available at IBSR). Fig. 3 reveals that the proposed filter performs best at removing all these kinds of noises. Consequently, the proposed method is suitable for segmenting images corrupted with unknown noises, and in all cases it performs at least as well as his ancestors. We applied the presented filtering and segmentation techniques to several T1weighted real MR images. A detailed view, containing numerous segmentations, is presented in Fig. 4. The original slice (a) is taken from IBSR. We produced several noisy versions of this slice, by artificially adding salt-and-pepper impulse noise and/or Gaussian noise, at different intensities. Some of these noisy versions are visible in Fig. 4 (d), (g), (j), (m). The filtered versions of the five above mentioned slices are presented in the middle column of Fig. 4. The segmentation results are shown in Fig. 4 (c), (f), (i), (l), (o), accordingly. From the segmented images we can conclude, that the proposed filtering technique is efficient enough to make proper segmentation of any likely-to-be-real MRI images in clinical practice, at least from the point of view of Gaussian and impulse noises. Table 1 takes into account the behavior of three mentioned segmentation techniques, in case of different noise types and intensities, computed by averaging the misclassifications on 12 different T1-weighted real MR brain slices. The proposed algorithm has lowest misclassification rates in most of the cases.
A Modified Fuzzy C-Means Algorithm for MR Brain Image Segmentation
875
Fig. 4. Filtering and segmentation results on real T1-weighted MR brain images, corrupted with different kinds and levels of artificial noise. Each row contains an original or noise-corrupted brain slice on the left side, the filtered version (using the proposed method) in the middle, and the segmented version on the right side. Row (a)-(c) comes from record number 1320 2 43 of IBSR [7], row (d)-(f) is corrupted with 10% Gaussian noise, while rows (g)-(i), (j)-(l), and (m)-(o) contain mixed noise of 3% impulse + 5% Gaussian, 3% impulse + 10% Gaussian, and 5% impulse + 5% Gaussian, respectively.
876
L. Szil´ agyi, S.M. Szil´ agyi, and Z. Beny´ o
Fig. 5 presents one slice of real, T2-weighted MR brain image, and its segmentation using the proposed method. Visual inspection shows, that our segmentation results are very close to the IBSR expert’s manual inspection. We applied the proposed segmentation method to several complete head MR scans in IBSR. The dimensions of the image stacks were 256 × 256 × 64 voxels. The average total processing time for one stack was around 10 seconds on a 2.4 GHz Pentium 4. Table 1. Misclassification rates in case of real brain MR image segmentation Noise type
EnFCM
FGFCM
Original, no extra noise Gaussian 4 % Gaussian 12 % Impulse 3 % Impulse 5 % Impulse 10 % Impulse 5 % + Gaussian 4 % Impulse 5 % + Gaussian 12 %
0.767 % 1.324 % 4.701 % 1.383 % 1.916 % 3.782 % 2.560 % 6.650 %
0.685 % 1.131 % 2.983 % 0.864 % 1.227 % 1.268 % 1.480 % 4.219 %
Proposed 0.685 % 1.080 % 2.654 % 0.823 % 0.942 % 1.002 % 1.374 % 4.150 %
Fig. 5. Segmentation results on real T2-weighted MR images: (a) original, (b) filtered using the proposed method, (c) result of the proposed segmentation
5
Conclusions
We have developed a modified FCM algorithm for automatic segmentation of MR brain images. The algorithm was presented as a combination of a context dependent pre-filtering technique and an accelerated FCM clustering performed over the histogram of the filtered image. The pre-filter uses both spatial and gray level criteria, in order to efficiently eliminate Gaussian and impulse noises without significantly blurring the real edges. Several test series were carried out using synthetic brain phantoms and real MR images. These investigations revealed, that our proposed technique accurately segments the different tissue classes under serious noise contamination. We compared our results with other recently reported methods. Test results
A Modified Fuzzy C-Means Algorithm for MR Brain Image Segmentation
877
revealed that our approach outperformed these methods in many aspects, especially in the accuracy of segmentation and processing time. Further works target more precise treatment of partial volume artifacts, removal of intensity non-uniformity noises, and adaptive determination of the optimal number of clusters. Acknowledgments. This research was supported by the Hungarian National Office for Research and Technology (RET-04/2004), the Sapientia Institute for Research Programmes, and the Communitas Foundation.
References 1. Ahmed, M.N., Yamany, S.M., Mohamed, N., Farag, A.A., Moriarty, T.: A modified fuzzy c-means algorithm for bias field estimation and segmentation of MRI data. IEEE Trans. Med. Imag. 21, 193–199 (2002) 2. Bezdek, J.C., Pal, S.K.: Fuzzy models for pattern recognition. IEEE Press, Piscataway, NJ (1991) 3. Cai, W., Chen, S., Zhang, D.Q.: Fast and robust fuzzy c-means algorithms incorporating local information for image segmentation. Patt. Recogn. 40, 825–838 (2007) 4. Chen, S., Zhang, D.Q.: Robust image segmentation using FCM with spatial constraints based on new kernel-induced distance measure. IEEE. Trans. Syst. Man. Cybern. Part. B 34, 1907–1916 (2004) 5. Chuang, K.S., Tzeng, H.L., Chen, S., Wu, J., Chen, T.J.: Fuzzy c-means clustering with spatial information for image segmentation. Comp. Med. Imag. Graph. 30, 9–15 (2006) 6. Hathaway, R.J., Bezdek, J.C., Hu, Y.: Generalized fuzzy c-means clustering strategies using Lp norm distances. IEEE Trans. Fuzzy Syst. 8, 576–582 (2000) 7. Internet Brain Segmentation Repository, http://www.cma.mgh.harvard.edu/ibsr 8. Pham, D.L., Prince, J.L.: Partial volume estimation and the fuzzy c-means algorithm. In: Proc. Int. Conf. Imag. Proc., pp. 819–822 (1998) 9. Pham, D.L., Prince, J.L.: Adaptive fuzzy segmentation of magnetic resonance images. IEEE Trans. Med. Imag. 18, 737–752 (1999) 10. Pham, D.L.: Unsupervised tissue classification in medical images using edgeadaptive clustering. In: Proc. Ann. Int. Conf. IEEE EMBS, Canc´ un, vol. 25, pp. 634–637 (2003) 11. Ruan, S., Jaggi, C., Xue, J.H., Fadili, M.J., Bloyet, D.: Brain tissue classification of magnetic resonance images using partial volume modeling. IEEE Trans. Med. Imag. 19, 1179–1187 (2000) 12. Siyal, M.Y., Yu, L.: An intelligent modified fuzzy c-means based algorithm for bias field estimation and segmentation of brain MRI. Patt. Recogn. Lett. 26, 2052–2062 (2005) 13. Szil´ agyi, L., Beny´ o, Z., Szil´ agyi, S.M., Adam, H.S.: MR brain image segmentation using an enhanced fuzzy C-means algorithm. In: Proc. Ann. Int. Conf. IEEE EMBS, Canc´ un, vol. 25, pp. 724–726 (2003) 14. Szil´ agyi, L.: Medical image processing methods for the development of a virtual endoscope. Period. Polytech. Ser. Electr. Eng. 50(1–2), 69–78 (2006) 15. Zhang, Y., Brady, M., Smith, S.: Segmentation of brain MR images through a hidden Markov random field model and the expectation-maximization algorithm. IEEE Trans. Med. Imag. 20, 45–57 (2001)
Evaluation of Contrast Enhancement Filters for Lung Nodule Detection Carlos S. Pereira1,2 , Ana Maria Mendon¸ca1,2 , and Aur´elio Campilho1,2 1
INEB - Instituto de Engenharia Biom´edica Laborat´ orio de Sinal e Imagem; Campus FEUP (Faculdade de Engenharia da Universidade do Porto), Rua Roberto Frias, s/n, 4200-465 Porto, Portugal 2 Universidade do Porto, Faculdade de Engenharia, DEEC {cmsp,amendon,campilho}@fe.up.pt
Abstract. The aim of this paper is to evaluate and compare the performance of three convergence index (CI) filters when applied to the enhancement of chest radiographs, aiming at the detection of lung nodules. One of these filters, the sliding band filter (SBF ), is the proposal of an innovative operator, while the other two CI class members, the iris filter (IF ) and the adaptive ring filter (ARF ), are already known in this application. To demonstrate the adequacy of the new filter for the enhancement of chest x-ray images, we calculated several figures of merit with the goal of comparing (i) the contrast enhancement capability of the filters, and (ii) the behavior of the filters for the detection of lung nodules. The results obtained for 154 images with nodules of the JSRT database show that the SBF outperforms both the IF and ARF . The proposed filter demonstrated to be a promising enhancement method, thus justifying its use in the first stage of a computer-aided diagnosis system for the detection of lung nodules.
1
Introduction
The contrast in chest radiograph images is usually low and the noisy environment is frequently high due primarily to the limitations placed on x-ray dose. These characteristics are naturally intrinsic to all the structures that can be found in these images, and in particular to lung nodules, which normally are present as local low-density rounded areas, exhibiting very weak contrast against their background. As the lack of contrast was found to be a serious drawback for the effective use of image detection methods based on the magnitude of spatial differences, Kobatake and Murakami [1] proposed an adaptive filter to enhance rounded convex regions, the iris filter, which evaluates the degree of convergence of gradient vectors in the neighbourhood of each pixel of interest. This concept was further extended to generate a class of new filters, the convergence index (CI) filters [2]-[4], mainly differing in the region of support used for calculating the convergence degree of the gradient vectors. In this paper we present an innovative operator, the sliding band filter (SBF ), which belongs to the class of convergence index filters. In order to evaluate M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 878–888, 2007. c Springer-Verlag Berlin Heidelberg 2007
Evaluation of Contrast Enhancement Filters for Lung Nodule Detection
879
the adequacy of the new filter for the enhancement of chest x-ray images in a computer-aided nodule detection framework, several figures of merit of enhancement are applied to the output of different CI filters. A statistical analysis of the results obtained in the images of a public database, the JSRT database [5], allows the conclusion that the new filter outperforms both the iris and adaptive ring filters. The structure of this paper is as follows. The next section is dedicated to a brief description of filters belonging to the convergence index class, and the new approach, the sliding band filter, is presented in section 3. Section 4 is devoted to the description of the contrast enhancement evaluation measures, whose results are shown and discussed in section 5. Section 6 contains some conclusions and guidelines for future work.
2
Convergence Index Filters
The main concept underlying the convergence index filter set is the convergence degree of a gradient vector, which is calculated from the angle θi (k, l) of the orientation of the gradient vector at point (k, l) with respect to a reference line with direction i. If we consider a pixel of interest P (x, y), the convergence index calculated in the neighbourhood of P that is the region of support of the filter, R, is defined by the average of the convergence indices at all M points in R, as shown in equation (1) C(x, y) =
1 M
cos θi (k, l) .
(1)
(k,l)∈R
In [2] a set of filters belonging to the CI class was proposed for detecting lung nodules candidates on chest x-ray images. In that work, the region to evaluate the convergence degree consists of N half-lines radiating from the pixel of interest, which are defined over a circular convex region. When different criteria are established for selecting the points on the ith half-line that are used for calculating the convergence index for direction i, distinct types of filters can be obtained. The coin filter (CF ) has a fixed region of support formed by all the points in the complete set of half-lines, while the iris filter (IF ) is an adaptive coin filter whose region of support can change in each direction. The adaptive ring filter (ARF ) uses a ring-shaped region whose radius changes adaptively. The iris filter automatically adjusts the length of each radial line used for measuring the averaged convergence index for direction i along the n pixels (Rmin ≤ n ≤ Rmax ) away from P (x, y), as defined by equation 3, aiming at the maximization of this value. The output of the iris filter, IF (x, y), is the average of the maximal convergence indices for the N half radial directions, as defined by equation 2. N −1 1 IF (x, y) = Cimax . (2) N i=0
880
C.S. Pereira, A.M. Mendon¸ca, and A. Campilho
Cimax =
max
Rmin ≤n≤Rmax
n 1 cos θim n m=1
.
(3)
A slightly different implementation of the iris filter is presented in [6], where the parameter Rmin also establishes the inner limit of the filter support region. The adaptive ring filter has a ring-shaped region of support whose radius changes adaptively according to equations 4 and 5. N −1 1 ARF (x, y) = max (4) ci . 0≤r≤r0 N i=0 r+d 1 ci = cos θim . d m=r+1
(5)
The parameters r, r0 and d define the radius of the inner circle, the maximum radius of the outer circle and the width of the ring, respectively.
3
Sliding Band Filter
The sliding band filter (SBF ) is also a member of the CI filter class as its output also measures the degree of convergence of gradient vectors. The main difference between this new filter and the iris filter is that the SBF searches in each radial direction the band of fixed width that corresponds to the maximum degree of convergence, while in the IF the radial line used for calculation purposes always begins in the point of interest P and the number of pixels can vary from a minimum to a maximal values. As in the ARF , the width of the band used for calculating the convergence index is equal in each half-line, but its position is variable in the SBF . The region of support of the proposed filter is represented in Fig. 1. The output of the SBF at a pixel of interest P (x, y) is defined by equations 6 and 7 as follows: N −1 1 SBF (x, y) = Cimax . (6) N i=0 n+d 1 (7) Cimax = max cos θim . Rmin ≤n≤Rmax d m=n where N is the number of radial directions leading out from P , d represents the fixed width of the band (in pixels), θim is the angle of the gradient vector at the point m pixels away from P with direction i, and Rmin and Rmax represent, respectively, the inner and outer sliding limits of the band, as illustrated in Fig. 1. When compared with the IF , this new approach has a more selective response for those image areas whose central part has a more random degree of convergence, as only the image band with the highest convergence indices is considered
Evaluation of Contrast Enhancement Filters for Lung Nodule Detection
881
Fig. 1. Region of support of the sliding band filter
for computation purposes. The sliding band filter also has the advantage of being more flexible than the ARF when the shape of region to detect differs from the expected rounded area.
4
Measures for the Evaluation of Contrast Enhancement Filters
The convergence index filters described in the two previous sections, with support regions dominantly circular, are particularly adequate for the enhancement of round-shaped image regions. In order to evaluate the contrast enhancement capability of these filters, we selected a set of figures of merit based on: 1) the calculation of contrast-to-noise ratio (CN R) and 2) a combined measure for enhancement assessment (D). The contrast-to-noise, CN R, and combined, D, measures were selected based on the intrinsic characteristics of the x-ray chest images, and particularly of the nodules that we want to enhance that are normally rounded regions exhibiting very weak contrast against their background. We can define contrast as the difference of the average intensities in two adjacent regions of an image; the contrast-to-noise ratio (CN R) is given by the quotient of the contrast and noise, where noise can be measured by the averaged variation in the two considered regions [7]. Although there are several options for the choice of the background for a circular shaped area corresponding to a lung nodule, we have decided on the selection of a surrounding ring with area identical to the nodular circle. For this solution, the CN R value is defined by equation 8: μnodule − μring CN R = (8) 1/2 . 2 2 σnodule /2 + σring
882
C.S. Pereira, A.M. Mendon¸ca, and A. Campilho
where μnodule is the mean intensity in the central nodular area, μring is the mean value of the ring-shaped background, and σnodule and σring are their corresponding standard deviations. Because the areas of the two regions are equal, their contributions to the noise calculation are also identical. The best performance of the filtering method corresponds to the highest values of the CN R figure. The second quantitative figure of merit, as defined in [8], is a combination of three different measures, namely the distribution separation measure (DSM ) and two target-to-background contrast enhancement measurements, one based on the standard deviation (T BCs ) and the other based on the entropy (T BCe ). The distribution separation measure (DSM ) aims at evaluating the reduction of overlapping between the intensity distributions of the nodular area and background, as a consequence of the enhancement operation. To calculate the DSM value, we can consider the distributions of the intensities of the original nodular image region and its background as two normal probability density functions original original original with means μoriginal and standard deviations σnodule , σring , renodule , μring spectively. After the application of the enhancement filter, the new values for iltered iltered f iltered f iltered the two distributions parameters are μfnodule , μfring and σnodule , σring , respectively. This measure is defined as:
original
iltered iltered
original
(9) DSM = μfnodule − μfring
− μnodule − μring . The highest values of DSM correspond to a better separation between the distributions, and, therefore, to improved enhancement results. The target-to-background contrast enhancement measurement based on standard deviation (T BCs ) maximizes the difference between the nodular and background mean gray levels and, in addition, reduces the spread of intensities in the enhanced nodular region when compared with the similar region in the original image, as defined in equation (10). ⎧ f iltered original ⎫ μ μ ⎪ ⎪ ⎪ ⎪ − μnodule f iltered original ⎨ μnodule ⎬ ring ring T BCs = . (10) f iltered σnodule ⎪ ⎪ ⎪ ⎪ ⎩ ⎭ original σ nodule
Singh and Bovis [8] proposed an extension of the concept of T BCs by replacing the standard deviation in the denominator of equation (10) with the entropy. The T BCe measure is defined by ⎧ f iltered original ⎫ μnodule μnodule ⎪ ⎪ ⎪ ⎨ μf iltered − μoriginal ⎪ ⎬ ring ring . (11) T BCe = f iltered nodule ⎪ ⎪ ⎪ ⎪ ⎩ ⎭ original nodule
For these last two figures, a good enhancement capability is associated with a positive value. After scaling the DSM , T BCs and T BCe values to the range [0, 1], the Euclidean distance between the corresponding point in the 3-D coordinate space
Evaluation of Contrast Enhancement Filters for Lung Nodule Detection
883
and the (1, 1, 1) point is calculated. This distance, called combined enhancement measure (D), is given by eq. (12). 2 2 2 D = (1 − DSM ) + (1 − T BCs ) + (1 − T BCe ) . (12) As expected, the lowest value for D corresponds to the method with the greatest enhancement capability.
5
Experimental Results and Discussion
Most of the proposed computer-aided diagnosis (CAD) systems for lung nodule detection adopt a two-stage approach, with an initial selection of nodule candidates, followed by the reduction of false positives [6], [9]-[10]. The selection of nodule candidates is frequently preceded by contrast enhancement operations using adequate techniques [9]. The new filter described in section 3 and two other CI filters, namely the iris filter and the adaptive ring filter, were used for enhancing the chest radiographs of the JSRT database containing lung nodules. For the SBF output computation, the parameter values were set to N = 256, d = 5, Rmin = 2 and Rmax = 21. This last value was selected based on the maximal nodule size present in the database; the other parameters were adjusted empirically to maximize the nodule detection rate. The same parameter values were used with the two other filtering methods. The results of these filtering operations on one of those images are depicted in Fig.2.
a)
b)
c)
d)
Fig. 2. a) Original lung field image; Enhanced images using: b) IF , c) ARF , d) SBF
The performance of the three filters was evaluated based on the calculation of the measures described in section 4. For this purpose a set of image regions of 97 × 97, centred in a true nodule, were extracted from chest radiographs of the JSRT database. The size of these regions was established based on the maximal
884
C.S. Pereira, A.M. Mendon¸ca, and A. Campilho
a)
b)
c)
d)
Fig. 3. a) Original images; b-d) Results of the enhancement operation: b) IF , c) ARF , d) SBF
radius of the nodules that are present in the database and its background ring. Figure 3 shows five of these regions and the results achieved with the contrast enhancement operation for each one of the three filters. All the images of this set have a different value for the nodule subtleness, as defined in [5]. The values of the two figures of merit, calculated for the five regions shown in Fig. 3, are presented in table 1. The average values computed over all filtered image regions of the database are summarized in table 2. As shown in this table, the SBF presents the best results for the two measures when compared with the other filters. To support the performance evaluation of the nodular enhancing filters, we extracted an additional nonparametric statistical measure, the Mann-Whitney U test, defined by the null hypothesis that there are no differences between the two filtering methods. For each figure of merit, CN R and D, we compared the results of the SBF with those of the IF and ARF . The U statistic values for the CN R measure are 17691 and 13650 for the pairs SBF -IF and SBF -ARF , respectively; the results for the D measure are 10808 and 9281, respectively. These tests revealed, with a .20 level of significance (corresponding to a 10 percent level of significance in each tail of the curve), a statistically significant difference between the SBF and each of the two other filters, as all the U statistic values are positioned outside the acceptance region, whose lower and upper limits are 10857.7 and 12858.3, respectively, for a normal distribution with parameters, N (μ, σ) = (11858, 781.5).
Evaluation of Contrast Enhancement Filters for Lung Nodule Detection
885
Table 1. Values of the selected measures for contrast enhancement evaluation for five image regions of the JSRT database Image Figure of number merit CNR 4 D CNR 24 D CNR 59 D CNR 109 D CNR 146 D
Iris filter 0.74 1.13 1.27 1.35 0.2 1.05 0.61 1.37 0.16 1.37
Adaptive Sliding band filter ring filter 1.85 1.92 1.11 0.83 2.03 2.36 1.11 0.75 0.33 0.68 0.96 0.66 1.2 1.6 1.39 1.29 0.43 0.66 1.18 1.17
Table 2. Average values calculated over all filtered image regions Feature
Iris filter
CNR D
0.62 1.04
Adaptive Sliding band filter ring filter 0.98 1.15 1.09 0.99
The influence of the initial enhancement step on the results of the methodology for detecting lung nodule candidates described in [11] was also evaluated. In this system, the selection of a set of suspicious region is performed in two steps: in the first step, the image is enhanced and multiplied by a probability mask; then, the resulting image is processed with a morphological watershed segmentation operator which divides each lung field into a set of non-overlapping regions, each one considered as a potential lung nodule candidate. For each one of these regions, all local maxima are discarded except the highest one, whose value is used for characterizing the region as a lung nodule candidate. The final result is a ordered set of candidate image areas, corresponding to the highest filter responses. The results of the watershed segmentation for the three enhancement filters are illustrated in Fig. 4a-c. The output of this algorithm using the SBF is shown in Fig. 4d. Based on the results of the proposed methodology, with the enhancement step performed by each one of the three CI filters, we have calculated other three quantitative features, namely the average number of local maxima per image, F 1, the average number of regions per image resulting from the watershed segmentation, F 2, and the average rank of the true nodule in the final ordered list of lung nodule candidates areas, F 3. The results that were obtained for the 154 images with nodules of the JSRT database are presented in table 3. In order to assess the influence of the enhancement operation on the images without nodules, the values of features F 1 and F 2 were calculated for the 93
886
C.S. Pereira, A.M. Mendon¸ca, and A. Campilho
a)
b)
c)
d)
Fig. 4. Watershed segmentation results: a) IF b) ARF c) SBF ; d) Output of the algorithm showing the set of selected candidate nodules Table 3. Influence of the enhancement filters on the detection of lung nodule candidates Feature
Iris filter
F1 F2 F3
871.9 499.1 37.2
Adaptive ring filter 467.4 130.5 12.7
Sliding band filter 243.2 54.8 6.7
Table 4. Influence of the enhancement filters on images without nodules Feature
Iris filter
F1 F2
847.2 490.3
Adaptive ring filter 479.9 149.4
Sliding band filter 247.2 54.6
non-nodular images of the JSRT database. As expected, the results shown in table 4 are identical to those obtained for the nodular images. These results clearly demonstrate that the new CI filter outperforms the two other CI class member for this kind of application. The number of local maximum values that result directly from the filtering operation is much lower and the watershed segmentation outputs are also superior for the SBF . In what concerns the discrimination between true and false positives, a significant improvement was also achieved, as demonstrated by the values obtained for the final rank of the true nodule regions. The algorithms developed for filter implementation were tested on a Pentium M processor, running at 1.60 GHz, with 512 MB of RAM. The ARF presented the lowest execution time, followed by the SBF and the IF , which showed an increase of 43.6% and 64.2%, respectively. All the three algorithms have time complexity O(n) for n > 64000.
Evaluation of Contrast Enhancement Filters for Lung Nodule Detection
6
887
Conclusions
In this paper we compared the performance of three convergence index CI filters when applied to the enhancement of x-ray chest images aiming at the detection of lung nodules. One of these filters, the sliding band filter, was proposed by the authors, while the other two operators, the iris filter and the adaptive ring filter, were already known for this kind of application. In order to evaluate the contrast enhancement capability of the filters, we selected two figures of merit, namely the contrast-to-noise ratio and a combined enhancement measure. These measures were calculated for all the nodular images of the JSRT database, and the obtained results, shown in tables 1 and 2, clearly demonstrate that the SBF outperforms the two other filters. Finally, we have analyzed the influence of the three enhancement methods on the detection of an initial set of lung nodule candidates. The best performance of the new operator was one again confirmed by the obtained results, as shown in tables 3 and 4. In our opinion, the proposed filter demonstrated to be a promising enhancement method, thus justifying its use in the first stage of a computer-aided diagnosis system for the detection of lung nodules. Right now, the output of this system is a just set of nodule candidate regions that are intended to be the input for a false positive reduction stage, aiming at reducing the number of final probable nodular regions to be presented to the specialist.
References 1. Kobatake, H., Murakami, M.: Adaptive filter to detect rounded convex regions: Iris filter. In: Int. Conf. Pattern Recognition, vol. II, pp. 340–344 (1996) 2. Kobatake, H., Hashimoto, S.: Convergence index filter for vector fields. IEEE Transactions on Image Processing 8(8) (1999) 3. Wei, J., Hagihara, Y., Kobatake, H.: Detection of cancerous tumors on chest X-ray images - Candidate detection filter and its evaluation. In: ICIP99, vol. 27AP4.2 (1999) 4. Kobatake, H., Wei, J., Yoshinaga, Y., Hagihara, Y., Shimizo, A.: Nonlinear Adaptine Convergence Index Filters and Their Characteristics. In: Int. Conf. Pattern Recognition, pp. 522–525 (2000) 5. Shiraishi, J., Katsuragawa, S., Kezoe, S., Matsumoto, T., Kobayashi, T., Komatsu, K., Matsui, M., Fujita, M., Kodera, Y., Doi, K.: Development of a Digital Image Database for Chest Radiographs With and Without a Lung Nodule: Receiver Operating Characteristic Analysis of Radiologists’ Detection of Pulmonary Nodules. American Journal of Roentgenology 174, 71–74 (2000) 6. Keserci, B., Yoshida, H.: Computerized detection of pulmonary nodules in chest radiographs based on morphological features and wavelet snake model. Medical Image Analysis 6, 431–447 (2002) 7. Song, X., Pogue, B., Jiang, S., Doyley, M., Dehghani, H., Tosteson, T., Paulsen, K.: Automated region detection based on the contrast-to-noise ratio in near-infrared tomography. Applied Optics 43(5), 1053–1062 (2004) 8. Singh, S., Bovis, K.: An evaluation of contrast enhancement techniques for mammographic breast masses. IEEE Transactions On Information Technology in Biomedicine 9(1), 109–119 (2005)
888
C.S. Pereira, A.M. Mendon¸ca, and A. Campilho
9. Wei, J., Hagihara, Y., Shimizu, A., Kobatake, H.: Optimal image feature set for detecting lung nodules on chest X-ray images. In: CARS 2002 (2002) 10. Suzuki, K., Shiraishi, J., Abe, H., MacMahon, H., Doi, K.: False-positive reduction in computer-aided diagnostic scheme for detecting nodules in chest radiographs by means of massive training artificial neural network. Academic Radiology 12, 191–201 (2005) 11. Pereira, C., Fernandes, H., Mendon¸ca, A., Campilho, A.: Detection of lung nodule candidates in chest radiographs. In: Mart´ı, J., et al. (eds.) IbPRIA 2007. LNCS, vol. 4478, pp. 170–177. Springer, Heidelberg (2007)
Robust Coronary Artery Tracking from Fluoroscopic Image Sequences Pascal Fallavollita1 and Farida Cheriet1,2 1 Institute of Biomedical Engineering, Department of Computer Engineering École Polytechnique de Montréal, Campus de l'Université de Montréal, 2500, chemin de Polytechnique, Montréal, H3T 1J4 pascal.fallavollita,
[email protected] 2
Abstract. This paper presents a new method to temporally track the principal coronary arteries in an X-ray fluoroscopy setting. First, the principal coronary artery centerlines are extracted at a first time instant. Secondly, in order to estimate the centerline coordinates in subsequent time frames, a pyramidal Lucas-Kanade optical flow approach is used. Finally, an active contour model coupled with a gradient vector flow (GVF) formulation is used to deform the estimated centerline coordinates towards the actual medial axis positions. The algorithm’s effectiveness has been evaluated on 38 monoplane images from three patients and the temporal tracking lasted over one cardiac cycle. The results show that the centerlines were correctly tracked in 92% of the image frames. Keywords: Coronary Artery Tracking, Pyramidal Optical Flow, Gradient Vector Flow, Image Segmentation.
1 Introduction Coronary heart disease (CHD) is still a leading cause of death in North America today. CHD is caused by atherosclerosis, the narrowing of the coronary arteries due to fatty build-ups of plaque. This leads to artery stenosis and it is likely to produce heart attack, angina pectoris, or both. Thus, artery motion can play an important part in the pathogenesis of atherosclerosis by altering the vessel wall mechanics. The dynamic information from the coronary arteries is of great interest to cardiologists and has attracted attention in cardiac research recently. X-ray fluoroscopic imaging remains the method of choice for interventionists for proper diagnostic of CHD. Acquiring perpendicular biplane images, over a cardiac phase, facilitates the three dimensional reconstruction of the coronary arteries and enables a thorough analysis of motion dynamics of the vascular structure. This not only aids in the diagnosis of stenosis or vascular pathologies, but also paves the way for clinical breakthroughs in interventional procedures. Many rigorous techniques have been developed by researchers for the purpose of temporally tracking the coronary arteries, heart chamber contours or even guidewires. In a 2D framework, the M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 889–898, 2007. © Springer-Verlag Berlin Heidelberg 2007
890
P. Fallavollita and F. Cheriet
authors in [1] presented a method combining active contours and image template matching techniques for tracking the coronary arteries. The active contour is modeled as a string of templates and is no longer a simple curve. In [2], a method is presented to extract and track the position of a guide wire during endovascular interventions. A rough estimate of the displacement of the guidewires is obtained using a template matching procedure and the position of the guide wire is determined by fitting a Bspline to a feature image using 4-5 control points. In these methods, large deformations may hinder the reliability of the template position estimates. Optical Flow formulations for motion estimation and tracking of the epicardial surface and right ventricle were respectively presented in [3-4]. The calculation of velocity fields was performed using correlation in a small region of interest in [3]. To improve tracking speed, the authors in [4] extended the formulation of active contours by including an additional force, derived from the optical flow field. However in both studies, the traditional optical flow methods (block correlation & Horn and Schunk) worked well since the overall inter-frame movement of the cardiac surface was small. Making the transition to 3D tracking algorithms, in [5], the authors propose combining optical flow derived displacement information from two projection images to displace an existing 3-D coronary tree model. A deformable models approach for segmenting objects using parametric curves was applied to segmenting the coronary arteries with 3-D B-spline curves in [6]. In [7], a registration framework recovers the motion one angiogram at a time, using a set of coarse-to-fine motion models. This 3D approach requires an a priori 3-D model of the coronary tree at the first time instant before tracking begins. The main objective of these methods is to evaluate the functionality of the overall coronary motion in three dimensions and the use of many 3D control points increases not only the complexity of the algorithms but the extrapolation errors due to local deformations. The main objective of our work is to temporally track the principal coronary arteries in a 2D single-view setting for the purpose of 3D reconstruction from biplane images (right anterior and left anterior oblique). Valid reconstruction relies on the successful correspondence of appropriate landmarks obtained from a pair of biplane images. In general, correspondence is difficult to achieve if we only rely on a single pair of biplane images. Authors in [8] have demonstrated that self-calibration (and 3D reconstruction) of the coronary arteries also becomes possible using a spatio-temporal matching technique that uses as landmarks the coronary bifurcation points across a sequence of biplane image frames. However, as the number of bifurcation points may be limited on a pair of biplane images and across an angiographic sequence, it is essential to find, in sufficient number, other appropriate landmarks for the biplane correspondence problem. In [9], we have developed a method to extract the principal coronary artery centerlines on a single fluoroscopic image and the seed points making up the centerline will be used, not only for the purpose of biplane reconstruction, but for the correspondence problem as well. The focus of this paper will be on the spatiotemporal tracking of the principal coronary arteries and the method proposed is an extension of the algorithms used in [4], but with two significant modifications. To account for large deformations of the arteries, we implement a multiresolution optical flow approach that has been vastly used in computer vision applications but not in the angiographic framework. Secondly, we do not include the optical flow force in the active contour model; instead we use a gradient vector flow (GVF) snakes method as it reduces the number of parameters to be considered.
Robust Coronary Artery Tracking from Fluoroscopic Image Sequences
891
2 Methodology 2.1 Image Segmentation and Centerline Extraction Our research group has recently developed a simple 4-step filter method in order to extract the principal coronary arteries. In short, a combination of four filters is used in order to enhance the main coronary arteries and produce a final cost image in order to initialize a two-click Fast Marching Method (FMM) for centerline extraction purposes [9]. Following is a brief description of the 4-step filter implemented. 2.1.1 Homomorphic Filtering A homomorphic filter is first used to denoise the fluoroscopic image. The illumination component of an image is generally characterized by slow spatial variation. The reflectance component of an image tends to vary abruptly. The homomorphic filter tends to decrease the contribution made by the low frequencies and amplify the contribution of high frequencies. The result is simultaneous dynamic range compression and contrast enhancement. The homomorphic filter is given by
H (u , v ) = (γ H − γ L ) (1 − e − c ⋅( D
2
( u , v ) / D o2
)+γL .
(1)
with γL < 1 and γH >1. The coefficient c controls the sharpness of the slope at the transition between high and low frequencies, whereas Do is a constant that controls the shape of the filter and D(u,v) is the distance in pixels from the origin of the filter. 2.1.2 Anisotropic Diffusion Anisotropic systems are those that exhibit a preferential spreading direction while isotropic systems are those that have no preferences. The Perona-Malik anisotropic diffusion method was implemented here in order to reduce and remove both noise and texture from the image, as well as, to preserve and enhance structures. 2.1.3 Complex Shock Filtering Gilboa et al. have recently developed a filter coupling shock and linear diffusion in the discrete domain, showing that the process converges to a trivial constant steady state. To regularize the shock filter, the authors suggest adding a complex diffusion term and using the imaginary value as the controller for the direction of the flow instead of the second derivative. The complex shock filter is given by
It = −
2
π
I ~ arctan (a Im( )) ∇I + λI ηη + λ I ξξ .
θ
(2)
where a is a parameter that controls the sharpness of the slope, λ= reiθ is a complex scalar, λ̃ is a real scalar, ξ is the direction perpendicular to the gradient and η is the direction of the gradient. The complex shock filter results in a robust and stable deblurring process that can still be effective in noisy environments such as in fluoroscopy due to the low signal to noise ratio of these images.
892
P. Fallavollita and F. Cheriet
2.1.4 Morphological Operation Morphological filtering was applied as a final image processing step in order to eliminate background elements around the primary coronary arteries. The structuring element will suppress the background (black) and enhance the arteries (grayscale). We chose a disk structuring element of a few pixel radius as the contours of the coronary arteries can be modeled as a collection of disks varying slowly at a certain radius, and its centerline is the union of the centers of these disks. 2.2 Pyramidal Lucas-Kanade Optical Flow The traditional Lucas-Kanade optic flow technique proposes to solve the brightness constant constraint equation by assuming constant flow over a fixed neighborhood region, w:
⎡ ∑ wI x2 ⎢ ⎢⎣∑ wI x I y
∑ wI I ∑ wI x
2 y
y
⎤⎛ u ⎞ ⎛ ∑ wI x I t ⎞ ⎟. ⎥⎜⎜ ⎟⎟ = − ⎜⎜ ⎟ wI I ⎥⎦⎝ v ⎠ ∑ y t ⎝ ⎠
(3)
Typically, the optical flow solution is solved by defining a small 5x5 window region, w, assigning a high weight to the center pixel. The solution obtained (u, v) is valid for objects experiencing small deformations. A modification has been proposed to remedy this problem and is known as multi-resolution optical flow. This consists of the following three steps: a) construction of Gaussian pyramids for each image, b) computation of optical flow at a coarse scale, and c) propagation of that optical flow to the next level of the pyramid. The first step is to create a Gaussian pyramid containing images at different scales. Each level of the pyramid should contain images scaled by a factor of two (and anti-aliased by virtue of the Gaussian filtering). Iteration over each level in the pyramid follows, and the optical flow is updated at each step. Once the optical flow is computed at a given resolution, it can be propagated to the next resolution by interpolation. The flow values also need to be scaled to account for resizing. The objective is to find the corresponding pixel location in a second image, P2, from a pixel position in the first image, P1. We achieve this by using the pyramidal approach and calculating the final optical flow vector, d, yielding P2= P1 + d. 2.3 Gradient Vector Flow and Snakes Active contour models have been vastly used in many tracking and segmentation problems and they are represented by a parametric curve x = [x(s), y(s)] that deforms through the image domain to minimize an energy function. They minimize the following Euler equation:
α x ′′ − β x ′′′′ − ∇E ext = 0 .
(4)
The first two parameters α and β are weights explicitly used for the internal bending energy and ∇ is a gradient operator. The gradient vector flow field (GVF) is not formulated using the standard energy minimization framework, but is constructed
Robust Coronary Artery Tracking from Fluoroscopic Image Sequences
893
by a force balance condition. As developed in [10], the GVF is a vector field that minimizes the energy function: E =
[
∫∫ μ ( ∇ u
2
+ ∇v
2
) + ∇f
2
ν − ∇f
2
]
dxdy .
(5)
where ν is the vector field and f is the edge map of the input image with µ being a parameter controlling the smoothness of the GVF field. Traditional edge maps are computed using Gaussian convolutions at a specified scale, σ. The external energy term in (4) is replaced by (5) to yield the final active contour formulation:
α x ′′ − β x ′′′′ + ν = 0 .
(6)
where x is the active contour curve. Lastly, we introduce a third and final parameter, τ, which is defined as the iteration time step. By normalizing the external force, each contour point will move at most τ pixels per iteration (example: 0.1 pixel time steps). 2.4 Evaluation A Toshiba digital fluoroscopic system was used to acquire 512x512 pixel angiographic images with a frame rate of 15 fps. Three angiographic datasets of different patients and 38 total image frames were used for evaluation of the tracking procedure. For each individual dataset, we extracted 15, 10, and 13 frames respectively that spanned one cardiac cycle. We first implemented our 4-step filter to the diastolic images and then extracted the principal coronary artery centerline using a two-click FMM approach. Secondly the pyramidal Lucas-Kanade algorithm was applied in order to estimate the corresponding centerline points in subsequent images. A window region of 10 pixels was selected for the optical flow estimations to account for possible large deformations of the arteries, and 4 pyramid levels were used for interpolation purposes to obtain a reliable flow estimate. For these centerlines to converge to their true positions, the GVF was implemented for all experiments using a regularization coefficient factor set to µ = 0.2, as in [10], and the number of iterations equaled 15 so as to produce desired field vectors pointing inwards towards the center of the artery (instead of pointing towards the artery wall contour, as in the traditional GVF formulation). The active contour parameters were empirically determined to α = 1, β = 0.5, and τ = 0.1 as they yielded stable convergence of the coronary centerline throughout the total number of images tracked.
3 Results and Discussion These experiments were conducted on a HP work station with P4 2.8GHz CPU, 2 GB RAM and MatLab code. Figure 1 shows the centerline extraction method on diastolic images. The 4-step filter enhances successfully the main coronary arteries and produces a suitable cost function to initialize the two-click fast marching method. The centerline coordinates obtained in this first frame will be used to estimate the centerline positions in subsequent frames by using the multi-resolution optical flow method. The pyramidal optical flow approach took on average 14 minutes/frame to obtain estimated centerlines when using two consecutive images. Our main focus was
894
P. Fallavollita and F. Cheriet
not directed toward the optimality of our code; hence future work should revolve around coding the algorithm using C++. The mean displacement (± standard deviation) of the principal coronary artery between successive frames was: 7.57 (± 2.44), 12.66 (± 6.42) and 10.54 (± 3.88) pixels.
Fig. 1. (Top Row) Cost images used to initialize the fast marching method algorithm. The final images are obtained using the sequential application of the following four filters: (i) homomorphic, (ii) anisotropic diffusion, (iii) complex shock filter and (iv) morphological. (Bottom Row) Example images of extracted coronary artery centerlines in using a two-click method.
Figure 2 illustrates the difference between the traditional and pyramid Lucas Kanade optical flow approaches. In both Figure 2a-2b (top rows), the traditional implementation for the optical flow fails to produce reliable centerline estimates due to large deformities of the artery. If deformations are larger than 10 pixels, the pyramidal optical flow produces reliable centerline position estimates but also shows signs of unreliability as seen by the elliptical region of interest in Figure 2b (bottom). We have also proposed to change the initial image used for the GVF field calculations. This affects as well the external energy formulation for the active contour approach. In Figure 3 (left-top), it is important to select an appropriate scale, σ, when applying a Gaussian filter to the input image, otherwise the GVF vector fields will show some discrepancies near the distal part of the artery as pointed by the arrow in the figure. This might be tricky as the scale can change value depending on the amount of noise present in the dataset. Therefore, we found that by choosing our 4step filter over all experiments (Figure 3 (left-bottom)), we avoid having to determine a suitable scale for the edge map computation. The computation becomes automatic, thereby reducing simulation time. An example of the active contour convergence using the two different edge maps is shown in Figure 3 (right). The active contour coordinates converged properly at the distal end of the artery when using our proposed edge map versus the Gaussian convolution edge map.
Robust Coronary Artery Tracking from Fluoroscopic Image Sequences
895
Fig. 2a. (Top row) Results for traditional Lucas-Kanade Optical Flow applied to frame number 2, 3 and 4 of the first angiographic database. The technique fails when large deformations are experienced by the object being tracked. (Bottom row) The pyramidal implementation of Lucas-Kanade rectifies the problem for this particular angiographic sequence.
Fig. 2b. (Top row) Results for traditional Lucas-Kanade Optical Flow applied to frame number 2, 3 and 4 of the first angiographic database. (Bottom row) The pyramidal implementation of Lucas-Kanade performs relatively well; however, it begins to falter for deformations larger than 10 pixels as shown by the elliptical region.
896
P. Fallavollita and F. Cheriet
Fig. 3. (Left-top) Region of interest representing distal part of coronary artery. The gradient vector flow is computed using the traditional convolution technique with a user selected scale σ. The arrow shows that the GVF fields do not converge near the distal end, but continue towards the bottom left of the image. (Left -bottom) By using the 4-step filter, the GVF fields converge correctly towards the distal end of the artery. (Right) The circle (o) centerline is obtained using the GVF field produced by the 4-step filter and depicts correctly the artery center axis, whereas the dotted line (.) centerline is a misrepresentation using the convolution image for GVF calculations.
Our algorithm correctly tracks the principal coronary arteries in 35 out of 38 images. Figure 4 shows typical tracking results. As presented in Table 1, the average number of iterations for convergence using our empirically determined parameters was 12, 28 and 196 respectively for the three datasets. The last three image frames from the third dataset yielded incorrect centerline convergence as some contour coordinates were “attached” to an external structure (rib cage) and hence were not attracted to the actual coronary artery centerline. Prior to these three images, the estimated contour coordinates were slowly being attracted to the rib cage and hence farther away from the coronary artery center axis. This is reflected in the large number of average iterations (= 196.1) calculated for this dataset. We noticed as well that by increasing gradually the value of the iteration step τ ∈ [0.1,1], the number of iterations decreased significantly by almost half. A larger τ accelerates the convergence process, however at the cost of instability.
Robust Coronary Artery Tracking from Fluoroscopic Image Sequences
897
Fig. 4. Tracking results for three X-ray angiographic databases. (Left to right columns) Respectively the 2nd, 5th and 8th image frame number. Table 1. Number of iterations required for interframe convergence
Dataset Number
Tracked Frames
Mean Number of Iterations/Frame
1 2 3
15 10 13
12.2 ± 3.23 27.6 ± 8.44 196.1± 262.57
The purpose of our work was to test the feasibility of our tracking technique applied to the coronary arteries. We have proposed two improvements to traditional single-view tracking algorithms. First, we have applied a multi-resolution optical flow formulation to take into consideration possible large deformations of the arteries, and second, we have modified the edge map used for the GVF calculation in order to obtain correct flow fields pointing towards the artery center axis. When compared to the previous works, our technique is also advantageous as it only requires a two click initialization step and is completely user interaction free the rest of the way. We finally do not rely on having an a priori 3D model of the arteries for the purpose of tracking them over a sequence of image frames as some authors have suggested.
898
P. Fallavollita and F. Cheriet
4 Conclusion To our knowledge, we are the first to incorporate both pyramidal optical flow and gradient vector flow field in an attempt to resolve the problem of large deformations in angiography, and to obtain a suitable centerline at a certain time frame. From an academic standpoint, the mathematics and concepts presented in this article are straight forward to implement. From a clinical standpoint, our method is almost fully automatic and does not require the selection of many control points, as some authors have suggested in previous works. Acknowledgment. This work was supported in part by NSERC (Natural Sciences and Engineering Research Council).
References 1. Zhu, H., Friedman, M.H.: Vessel Tracking by Template String in Angiography, in Medical Image Acquisition and Processing. In: Proc. SPIE Conf. Multispectral Image Processing and Pattern Recognition, Wuhan, China, pp. 29–33 (October 2001) 2. Baert, S.A.M., Viergever, M.A., Niessen, W.J.: Guide wire tracking in endovascular interventions. IEEE Transactions on Medical Imaging 22, 965–972 (2003) 3. Meunier, J., Bourassa, M., Bertrand, M., Verreault, M., Mailloux, G.: Regional epicardial dynamics computed from coronary cineangiograms. Comput. Cardiol, 307–310 (1989) 4. Hamarneh, G., Althoff, K., Gustavsson, T.: Snake Deformations Based on Optical Flow Forces for Contrast Agent Tracking in Echocardiography. In: Swedish Symposium on Image Analysis, pp. 45–48 (2000) 5. Ruan, S., Bruno, A., Collorec, R., Coatrieux, J.: 3-D motion and reconstruction of coronary networks. In: Proc. IEEE EMBS, vol. 5, pp. 2048–2049 (1992) 6. Cañero, C., Radeva, P., Toledo, R., Villanueva, J., Mauri, J.: 3-D curve reconstrucion by biplane snakes. In: Proc. ICPR, vol. 4, pp. 563–566 (2000) 7. Shechter, G., Devernay, F., Coste-Manière, E., Quyyumi, A., McVeigh, E.R.: ThreeDimensional Motion Tracking of Coronary Arteries in Biplane Cineangiograms. IEEE Transactions On Medical Imaging 22(4), 493–503 (2003) 8. Cheriet, F., Meunier, J., Lesperance, J., Bertrand, M.: 3D motion/structure estimation using temporal and stereoscopic point matching in biplane cineangiography. Computers in Cardiology, pp. 409–412 (September 8-11, 1996) 9. Fallavollita, P., Cheriet, F.: Towards an Automatic Coronary Artery Segmentation Algorithm, Engineering in Medicine and Biology Society. In: Proceedings of the 28th IEEE EMBS Annual International Conference, pp. 3037–3040 (2006) 10. Xu, C., Prince, J.L.: Generalized gradient vector flow external forces for active contours. Signal Processing 71, 131–139 (1998)
Classification of Breast Tissues in Mammogram Images Using Ripley’s K Function and Support Vector Machine Leonardo de Oliveira Martins1 , Geraldo Braz Junior1 , Erick Corrˆea da Silva1 , Arist´ ofanes Corrˆea Silva1 , and Anselmo Cardoso de Paiva2 1
Federal University of Maranh˜ ao - UFMA, Department of Electrical Engineering Av. dos Portugueses, SN, Campus do Bacanga, Bacanga 65085-580, S˜ ao Lu´ıs, MA, Brazil
[email protected],
[email protected], erick
[email protected],
[email protected] 2 Federal University of Maranh˜ ao - UFMA, Department of Computer Science Av. dos Portugueses, SN, Campus do Bacanga, Bacanga 65085-580, S˜ ao Lu´ıs, MA, Brazil
[email protected] Abstract. Female breast cancer is a major cause of death in western countries. Several computer techniques have been developed to aid radiologists to improve their performance in the detection and diagnosis of breast abnormalities. In Point Pattern Analysis, there is a statistic known as Ripley’s K function that is frequently applied to Spatial Analysis in Ecology, like mapping specimens of plants. This paper proposes a new way in applying Ripley’s K function in order to distinguish Mass and Non-Mass tissues from mammogram images. The features of each image are obtained through the calculate of that function. Then, the samples gotten are classified through a Support Vector Machine (SVM) as Mass or Non-Mass tissues. SVM is a machinelearning method, based on the principle of structural risk minimization, which performs well when applied to data outside the training set. Another way of computing Ripley’s K function, using concentric rings instead of a circle, is also examined. The best result achieved was 94.25% of accuracy, 94.59% of sensitvity and 94.00% of specificity. Keywords: Mammogram, Breast Tissues Classification, Ripley’s K Function, Texture Analysis, SVM, Point Pattern Analysis.
1
Introduction
The breast cancer is the major cause of death by cancer in the female population. It is known that the best prevention method is the precocious diagnosis, what lessens the mortality and improves the treatment [1]. Accordingly with American National Cancer Institute [2], it is estimated that every three minutes a woman is diagnosed with breast cancer and every 13 minutes, a woman dies from the disease. M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 899–910, 2007. c Springer-Verlag Berlin Heidelberg 2007
900
L. de Oliveira Martins et al.
Mammography is currently the best technique for reliable detection of early, non-palpable, potentially curable breast cancer [1]. In 1995, the mortality rate from this disease decreased for the first time, due in part to the increasing utilization of screening mammography [1]. However, the interpretation of the image is a repetitive task that requires much attention to minute detail, and radiologists vary in their interpretation of mammograms. Digital mammography represents an enormous advance in detection and diagnosis of breast abnormalities. Through image processing techniques, it is possible to enhance the contrast, color, and sharpness of a digital mammogram. Thus, several possible breast abnormalities can become visible for human beings. Therefore, in the past decade there has been tremendous interest in the use of image processing and analysis techniques for Computer Aided Detection (CAD)/ Diagnostics (CADx) in digital mammograms. The goal has been to increase diagnostic accuracy as well as the reproducibility of mammographic interpretation. CAD/CADx systems can aid radiologists by providing a second opinion and may be used in the first stage of examination in the near future, providing the reduction of the variability among radiologists in the interpretation of mammograms. Many techniques of texture analysis have been developed in order to identify suspicious regions on mammograms. The principal difficulty in this task is the non-existence of a general algorithm that produces good results for all images. Wei and Chan [3] investigated the feasibility of using multiresolution texture analysis for differentiation of masses from normal breast tissue on mammograms and use texture features based on the wavelet coefficients and variable distances. They reached 89% and 86% for the training and test groups, respectively. Recently Silva et al [4], [5] showed that a geostatistical function as semivariogram supply good results to discriminate malignant from benign lung nodules. Also [6] applied semivariogram function to the characterization of breast tissue as malignant or benign in mammographic images with results analysis of sensitivity in 92.8%, specificity of 83.3% and accuracy above 88.0%. A computer aided neural network classification of suspicious regions on digitized mammograms is presented by by Christoyianni et al [7], that use a Radial Basis Function Neural Network (RBFNN) is used to complete the classification, fed by features selected out by a technique based on independent component analysis. Experiments in the mini-MIAS Database have shown a recognition accuracy of 88.23% in the detection of all kinds of abnormalities and 79.31% in the task of distinguishing between benign and malignant regions, outperforming in both cases standard textural features, widely used for cancer detection in mammograms [7]. Also, Land et al [8] explores the use of different Support Vector Machines (SVM) kernels, and combinations of kernels, to ascertain the diagnostic accuracy of a screen film mammogram data set and improves about 4% the average of sensitivity and 18% the average of specificity, reaching 100% of sensitivity and 98% of specificity. Traditionally, texture analysis is accomplished with classical image processing measures, like histogram, Spatial Gray Level Dependence Method, Gray Level
Classification of Breast Tissues in Mammogram Images
901
Difference Method, etc. Ripley’s K function is frequently applied to Spatial Analysis in Ecology, like mapping specimens of plants. In this paper, we intend to investigate the effectivity of a classification methodology that uses Ripley’s K function to calculate input measures for a Support Vector Machine, with the purpose to discriminate tissues in mammographic images into two types, Mass or Non-Mass. Thus, this work contribution is the modification of the classical application of the Ripley’s K function and its experimentation to breast tissues characterization, using a Support Vector Machine. Also, we examine another way of computing the local form of Ripley’s K function from breast tissues images, using the region between two concentric rings, instead of circles, from a central reference point. This work is organized as follows. In Section 2, we present the techniques for feature extraction, and mass diagnosis. Next, in Section 3, the results are shown and we discuss about the application of the techniques under study. Finally, Section 4 presents some concluding remarks.
2
Material and Methods
The methodology proposed in this work intends to classify breast tissues on mammograms into Mass and Non-Mass using two versions of local Ripley’ K function to extract texture measures and Support Vector Machines to separate the samples. The first step is the image acquisition, which is done by obtaining mammograms and selecting manually regions that correspond to Mass and NonMass. After this, a histogram equalization in each region extracted is performed to emphasize characteristic not shown in previous images. The next step is texture characterization. In this work, we propose the application of Ripley’s K function to characterize the texture. In the last step, we use the measures obtained from texture characterization to classify the taken samples into two classes, Mass and Non-Mass. Herein, we propose the use of Support Vector Machine to do this. 2.1
Image Acquisition
For the development and evaluation of the proposed methodology, we used a publicly available database of digitized screen-film mammograms called the Digital Database for Screening Mammography DDSM [9] database. It contains 2620 cases acquired from Massachusetts General Hospital, Wake Forest University, and Washington University in St. Louis School of Medicine. The data are comprised of studies of patients from different ethnic and racial backgrounds. The DDSM contains descriptions of mammographic lesions in terms of the American College of Radiology breast imaging lexicon called the Breast Imaging Reporting and Data System (BI-RADS) [9]. Mammograms in the DDSM database were digitized by different scanners depending on the institutional source of the data. A subset DDSM cases was selected for this
902
L. de Oliveira Martins et al.
Fig. 1. Process used for acquisition of Mass region on the mammogram. The same is applied to select Non-Mass regions.
study. Cases with Mass lesions were chosen by selecting reports that only included the BI-RADS descriptors for Mass margin and Mass shape. From 2620 cases, 433 images were selected based on this criterion. From this subset, 868 regions of interest were selected manually, 394 represent Mass regions (benign or malignant) and 474 Non-Mass. All Non-Mass regions selected are from images that have almost one Mass region. 2.2
Texture Analysis
Texture can be understood as tonal variations in the spatial domain and determines the overall visual smoothness or coarseness of image features. It reveals important information about the structural arrangements of the objects in the image and their relationship to the environment. Consequently, texture analysis provides important discriminatory characteristics related to variability patterns of digital classifications. Texture processing algorithms are usually divided into three major categories: structural, spectral and statistical [10]. Structural methods consider textures as repetitions of basic primitive patterns with a certain placement rule [11]. Spectral methods are based on the Fourier transform, analyzing the power spectrum [11]. The third and most important group in texture analysis is represented by statistical methods, which are mainly based on statistical parameters such as the Spatial Gray Level Dependence Method-SGLDM, Gray Level Difference Method-GLDM, Gray Level Run Length Matrices-GLRLM [12], [13], [14]. In practice, some of the most usual terms used by interpreters to describe textures, such as smoothness or coarseness, bear a strong degree of subjectivity and do not always have a precise physical meaning. Analysts are capable of visually extracting textural information from images, but it is not easy for them to establish an objective model to describe this intuitive concept. For this reason, it has been necessary to develop quantitative approaches to obtain texture
Classification of Breast Tissues in Mammogram Images
903
descriptors. Thus, in a statistical context, textures can be described in terms of an important conceptual component associated to pixels (or other units), their spatial association. This component is frequently analyzed at the global level by quantifying the aggregation or dispersion of the element in study [15]. In this work, the texture analysis is done by quantifying the spatial association between individual pixel values from the tissue image by applying the local form of the Ripley’s K function - which will be discussed in a following subsection. Ripley’s K Function Patterns of point based objects in two or three dimensions or on the surface of the terrestrial or celestial spheres are commonplace; some examples are towns in a region, trees in a forest and galaxies in space. Other spatial patterns such as a sheet of biological cells can be reduced to a pattern of points [16]. Most systems in the natural world are not spatially homogeneous but exhibit some kind of spatial structure. As the name suggests, point pattern analysis comprises a set of tools for looking at the distribution of discrete points [17], for example individual pixels in an image that have been mapped to Cartesian coordinates (x, y). Point pattern has a long history in statistics and the great majority of them focus on a single distance measurement. There are a lot of indices - most of them use the Poisson distribution [18] as the underlying model for inferences about pattern - used to quantify the intensity of pattern at multiple scales. Point patterns can be studied by first-order and second-order analysis. The first-order approach uses point-to-point mean distance or derives a mean area per point, and then inverts this to estimate a mean point density from which the test statistics about the expected point density are derived [17]. Second-order analysis looks at a larger number of neighbors beyond the nearest neighbor. This group of methods is used to analyze the mapped positions of objects in the plane or space, such as the stems of trees and assumes a complete census of the objects of interest in the zone (area or volume) under study [17]. One of the most commonly used second-order methods is the Ripley’s K function. Ripley’s K function is a tool to make analysis of completely mapped spatial point process data, i.e. data on the locations of events. These are usually recorded in two dimensions, but they may be locations along a line or in 3D space. Completely mapped data include the locations of all events in a predefined study area. Ripley’s K function can be used to summarize a point pattern, test hypotheses about the pattern, estimate parameters and fit models [16]. Ripley’s K method is based on the number of points tallied within a given distance or distance class. Its typically definition for a given radius, t, is: A K(t) = 2 δ (dij ) (1) n i j for i = j, where A is the area sampled, n is the total number of points and δ is an indicator function that is 1 if the distance dij between the points on locations i and j is lower than the radius t, else it takes on 0. In other words, this method counts the number of points within a circle of radius t of each point.
904
L. de Oliveira Martins et al.
It is usual to assume isotropy, i.e. that one unit of distance in the vertical direction has the same effect as one unit of distance in the horizontal direction. Although it is usual to assume stationarity, that means the minimal assumption under which inference is possible from a single observed pattern, K(t) is interpretable for nonstationary processes because K(t) is defined in terms of a randomly chosen event. As every point in the sample is taken one time to center a plot circle, Ripley’s K function provides an inference at global level of the element in study. However, this measure can be also considered in a local form for the ith point [19]: A Ki (t) = δ (dij ) (2) n i=j
In this work, we also propose a different way of computing the local form of Ripley’s K function from breast tissues images. It consists basically in using the Equation 2 taking as the sampling area A the circular region between two concentric rings, as suggested in Figure 2. This simple modification has the advantage that one can examine pixels of a specific distance more precisely, instead of collect information of all circle. In Section 3, we show the results obtained with tradicional Ripley’s K function and with the proposed modification, so we can determine the most effective approach to discrimination of breast tissues in Masses and Non-Masses regions.
Fig. 2. Schematic illustration of a modified computation of local Ripley’s K function
2.3
Support Vector Machine
The Support Vector Machine (SVM) introduced by V. Vapnik in 1995 is a method to estimate the function classifying the data into two classes [20], [21]. The basic idea of SVM is to construct a hyperplane as the decision surface in such a way that the margin of separation between positive and negative examples is maximized. The SVM term come from the fact that the points in the training set which are closest to the decision surface are called support vectors. SVM achieves this by the structural risk minimization principle that is based on the fact that the error rate of a learning machine on the test data is bounded by the sum of the training-error rate and a term that depends on the Vapnik-Chervonenkis (VC) dimension.
Classification of Breast Tissues in Mammogram Images
905
The process starts with a training set of points xi ∈ n ,i = 1, 2, · · · , l where each point xi belongs to one of two classes identified by the label yi ∈ {−1, 1}. The goal of maximum margin classification is to separate the two classes by a hyperplane such that the distance to the support vectors is maximized. The construction can be thinked as follow: each point x in the input space is mapped to a point z = Φ(x) of a higher dimensional space, called the feature space, where the data are linearly separated by a hyperplane. The nature of data determines how the method proceeds. There is data that are linearly separable, nonlinearly separable and with impossible separation. This last case be still tracted by the SVM. The key property in this construction is that we can write our decision function using a kernel function K(x, y) which is given by the function Φ(x) that map the input space into the feature space. Such decision surface has the equation: l f (x) = αi yi K(x, xi ) + b (3) i=1
where K(x, xi ) = Φ(x).Φ(xi ), and the coefficients αi and the b are the solutions of a convex quadratic programming problem [20], namely l min 12 W T · W + c ξi W,b,ξ i=1 T (4) subject to yi W · φ (xi ) + b ≥ 1 − ξi ξi ≥ 0. where c > 0 is a parameter to be chosen by the user, which corresponds to the strength of the penality errors. The ξi ’s are slack variables that penalize training errors and W = {w1 , w2 , . . . , wn } is the weight vector of the hyperplane. Classification of a new data point x is performed by computing the sign of the right side of Equation 3. An important family of kernel functions is the Radial Basis Function, more commonly used for pattern recognition problems, which has been used in this paper, and is defined by: K(x, y) = e−γx−y
2
(5)
where γ > 0 is a parameter that also is defined by the user. 2.4
Validation of the Classification Methods
In order to evaluate the classifier in respect to its differentiation ability, we have analyzed its sensitivity, specificity and accuracy. Sensitivity is defined by T P /(T P + F N ), specificity is defined by T N /(T N + F P ), and accuracy is defined by (T P + T N )/(T P + T N + F P + F N ), where T P is true-positive, T N is true-negative, F N is false-negative, and F P is false-positive. Herein, truepositive means Mass samples were correctly classified as Mass. The meaning of the other ones is analogous.
3
Results
The regions of interest (ROI) were manually extracted from each image based on the information provided by the database DDSM. The ROIs are square region
906
L. de Oliveira Martins et al.
sub-images defined to completely enclose the DDSM described abnormality. To perform the experiments, we take 394 ROIs representing Masses and 474 representing Non-Masses ones. We generate two data sets: one was generate by applying traditional Ki function, and the other was generated by applying the modified version of this function. First, we apply traditional form of local ripley’s K function. In order to find maximum possible information about the masses, we used original images and also quantized them to 3, 4, 5, 6, and 7 bits (or 8, 16, 32, 64 and 128 gray levels, respectively). For each quantization level we applied Equation 2 to each individual pixel value, being the area A = π × t2 . For example, for an image quantized to 8 bits, we obtained the 256 function values of Ki (t); first it was obtained Ki (t) for the pixels with density equals to 1, after for the pixels with density equals to 2 and so on, until to be obtained Ki (t) for the pixels with density equals to 255. For the purpose of carry out the analysis along the entire tissue image, we performed the analysis using six different values of the radius t. So, in order to find maximum radius value, we took the image’s central pixel and then we found out the farthest one from it. Each circle radius ti , i = 1, ...6, can assume values t = { 61 × d, 13 × d, 12 × d, 23 × d, 56 × d, d}, where d is the distance between the central pixel i and the farthest one. This approach makes possible to observe the spatial association among individual pixel values at different locations, from central to peripheral zones of the tissues. Thus, we obtained a set of 3024 (equals to 8 + 16 + 32 + 64 + 128 + 256 gray levels × 6 concentric circles) different values, for each study sample. To make feasible the computation we need to select from all the obtained measures which were the minimum set that has the power to discriminate Mass from Non-Mass tissues. To do it, we used the stepwise selection [22] technique, that reduced the number of 3024 variables to 77. Next, we generate the samples using the proposed modification of local Ripley’s K function. Again, original and quantized images were used, with six different values of the radius t, totalizing 3024 variables. However, in this new approach, the area is no more given by a circle with radius t, but by a region formed by two concentric rings taken from a central point, according with Section 2.2. We use six different regions formed by points between two concentric rings of radius r1 = ti and r0 = ti−1 , where r0 = 0 for i = 1. The stepwise technique was used again, and the number of relevant variables decreases to 102. Figure 3 shows the application of Ki function and its modified computation to two samples of Mass and Non-Mass tissues, using a subset of 16 Ki functions values from the total of relevant variables, representing the typical behaviour for the measures. It was observed that for this subset the Non-Masses values presents a minor variance. Differently, for the Masses ones, the K-function presents high ratios for some individual variables values. The next step was the classification of each sample, using each data set independently, through a SVM classifier. A library for Support Vector Machines,
Classification of Breast Tissues in Mammogram Images
907
Fig. 3. Ripley’s K function applied to two samples of Non-Mass and Mass tissues
called LIBSVM [23], was used for training and testing the SVM classifier. We used Radial Basis Function as kernel. For each data set, we generated several pairs of subsets with 694 samples for training and 174 samples for tests. In order to show the best performance results, we select only five pair of subsets that present best test accuracy, i.e, that ones that produces less generalization error. Considering only data generated by traditional Ripley’s K function, Table 1 shows performances measures for each experiment. The SVM parameters used here were c = 512 and γ = 3.0517578125 × 10−5 . Table 1. Results from running SVM with traditional Ripley’s K function Experiments 1 2 3 4 5
Specificity (%) Train Test 88.98 87.25 90.82 85.88 88.77 84.00 89.53 91.30 90.36 91.11
Sensitivity (%) Train Test 91.93 90.28 90.82 89.89 90.00 93.24 90.06 84.15 91.61 84.52
Accuracy (%) Global Accuracy (%) Train Test 90.35 88.51 89.98 90.49 87.93 89.98 89.34 87.93 89.06 89.77 87.93 89.40 90.92 87.93 90.32
Analyzing only test results, we can see that the best result was 88.51% of accuracy, 90.28% of sensitivity and 87.25% of specificity. A more detailed analysis of test results shows that the methodology presents an average sensibility of 88.42%, an average specificity of 87.91% and an average accuracy of 88.04%. The global accuracy is the sum of all true positives and true negatives detected by the classification divided by the total of samples, and its average value is equal to 89.75%. For the modified version of Ki function, once more, only five pair of subsets that present best test accuracy in SVM classification were selected. The SVM
908
L. de Oliveira Martins et al. Table 2. Results from running SVM with the modified Ripley’s K function
Experiments 1 2 3 4 5
Specificity (%) Train Test 95.72 94.00 95.81 94.57 95.56 96.70 95.44 94.06 96.04 96.84
Sensitivity (%) Train Test 95.31 94.59 96.15 90.24 96.14 87.95 95.95 90.41 96.51 87.34
Accuracy (%) Global Accuracy (%) Train Test 95.53 94.25 95.28 95.97 92.53 95.28 95.82 92.53 95.16 95.68 92.53 95.05 96.25 92.53 95.51
parameters used in these experiments were c = 32768 and γ = 0.00048828125. Table 2 shows performances measures for each experiment, using this function. For the modified function, the best test result was 94.25% of accuracy, 94.59% of sensitivity and 94.00% of specificity. This approach presents an average sensibility of 90.11%, an average specificity of 95.23% and an average accuracy of 92.87%. The average global accuracy is equal to 95.25%. Although the two approachs present good results, we can see clearly that, for the specific problem at hand, the modified Ripley’s K function presents the best discriminatory power. This can be explained, as we saw in Section 2.2, due to the fact that the use of circles as region of study provides cumulative information from central to peripheral zones of the image. However, the use of concentric rings eliminates possible interferences of central regions, giving more precise information about elements located in a larger distance from the center. Thus, the use of local Ripley’s K function and its proposed modification described in this paper provides a good support for breast tissues characterization. Also, the use of SVM classification with these measures brings a good generalization power from training to test data. The success of these experiments is very encouraging for further investigation and utilization of more complicated databases, with larger number of samples.
4
Conclusion
This paper has presented a new application of a function used traditionally in point pattern analysis, with the purpose of characterizing breast tissues as Mass or Non-Mass. The measures extracted from Ripley’s K function were analyzed and had great discriminatory power, through Support Vector Machine classification. The best test result was 88.51% of accuracy, 90.28% of sensitivity and 87.25% of specificity. In addition, we show that a simple modification in Ripley’s K function computation can increase its discriminatory power, through the use of concentric rings instead of a circle as sampling region. Using this approach, the best test result was 94.25% of accuracy, 94.59% of sensitivity and 94.00% of specificity. The presented results are very encouraging, and they constitute strong evidence that Ripley’s K function is an important measure to incorporate into a CAD software in order to distinguish Mass and Non-Mass tissues. Nevertheless,
Classification of Breast Tissues in Mammogram Images
909
there is the need to perform tests with other databases, with more complex cases in order to obtain a more precise behavior pattern. As future works, we propose the use of other textural measures to be used jointly with Ripley’s K funcion, in order to try to reduce the number of false negatives to zero and to find others possible patterns of Mass and Non-Mass tissues. We also suggest the application of the function to provide features for automatic detection of masses. Finally, we suggest to apply the proposed mehodology for calcification detection and diagnosis problem, using SVM classification.
Acknowledgements Leonardo de Oliveira Martins acknowledges receiving scholarships from CAPES (Coordination for the Development of Graduate and Academic Research). Geraldo Braz Junior and Erick Corrˆea da Silva acknowledges receiving scholarships from the CNPq (National Research Council of Brazil). The authors also acknowledge CAPES for financial support (process number 0044/05-9).
References 1. (AMS), A.C.S.: Learn about breast cancer (2006), available at http:// www.cancer.org 2. (NCI), N.C.I.: Cancer stat fact sheets: Cancer of the breast (2006), available at http://seer.cancer.gov/statfacts/html/breast.html 3. Wei, D., Chan, H., Helvie, M., Sahiner, B., Petrick, N., Adler, D., Goodsitt, M.: Classification of mass and normal breast tissue on digital mammograms: Multiresolution texture analysis. Medical Physics 22, 1501 (1995) 4. Silva, A.C., Carvalho, P.C.P., Gattass, M.: Analysis of spatial variability using geostatistical functions for diagnosis of lung nodule in computerized tomography images. Pattern Analysis & Applications 7(3), 227–234 (2004) 5. Silva, A., Carvalho, P., Gattass, M.: Diagnosis of lung nodule using semivariogram and geometric measures in computerized tomography images. Computer Methods and Programs in Biomedicine 79(1), 31–38 (2005) 6. Paiva Jr., V.R.S., Silva, A.C., Oliveira, A.C.M.: Semivariogram applied for classification of benign and malignant tissues in mammography. In: Campilho, A., Kamel, M. (eds.) ICIAR 2006. LNCS, vol. 4142, pp. 570–579. Springer, Heidelberg (2006) 7. Christoyianni, I., Koutras, A., Dermatas, E., Kokkinakis, G.: Computer aided diagnosis of breast cancer in digitized mammograms. Comput. Med. Imaging Graph 26(5), 309–319 (2002) 8. Land Jr., W., Wong, L., McKee, D., Embrechts, M., Salih, R., Anderson, F.: Applying support vector machines to breast cancer diagnosis using screen film mammogram data. In: Computer-Based Medical Systems. CBMS 2004. Proceedings. 17th IEEE Symposium, pp. 224–228. IEEE Computer Society Press, Los Alamitos (2004) 9. Heath, M., Bowyer, K., Kopans, D.: Current status of the digital database for screening mammography. In: Digital Mammography, pp. 457–460. Kluwer Academic Publishers, Dordrecht (1998)
910
L. de Oliveira Martins et al.
10. Gonzalez, R.C., Woods, R.E.: Digital Image Processing, 3rd edn. Addison-Wesley, Reading, MA, USA (1992) 11. Meyer-Baese, A.: Pattern Recognition for Medical Imaging. Elsevier, Amsterdam (2003) 12. Kovalev, V.A., Kruggel, F., Gertz, H.J., Cramon, D.Y.V.: Three-dimensional texture analysis of MRI brain datasets. IEEE Transactions on Medical Imaging 20(5), 424–433 (2001) 13. Li, X.: Texture analysis for optical coherence tomography image. Master’s thesis, The University of Arizona (2001) 14. Mudigonda, N.R., Rangayyan, R.M., Desautels, J.E.L.: Gradient and texture analysis for the classification of mammographic masses. IEEE Transactions on Medical Imaging 19(10), 1032–1043 (2000) 15. Scheuerell, M.D.: Quantifying aggregation and association in three dimensional landscapes. Ecology 85, 2332–2340 (2004) 16. Ripley, B.D.: Modelling spatial patterns. J. Roy. Statist. Soc. B 39, 172–212 (1977) 17. Urban, D.L.: Spatial analysis in ecology - point pattern analysis, available at (2003), http://www.nicholas.duke.edu/lel/env352/ripley.pdf 18. Papoulis, A., Pillai, S.U.: Probability, Random Variables and Stochastic Processes, 4th edn. McGraw-Hill, New York (2002) 19. Dale, M.R.T., Dixon, P., Fortin, M.J., Legendre, P., Myers, D.E., Rosenberg, M.S.: Conceptual and mathematical relationships among methods for spatial analysis. ECOGRAPHY 25, 558–577 (2002) 20. Haykin, S.: Redes Neurais: Princ´ıpios e Pr´ atica, 2nd edn. Bookman, Porto Alegre (2001) 21. Burges, C.J.C.: A Tutorial on Support Vector Machines for Pattern Recognition. Kluwer Academic Publishers, Dordrecht (1998) 22. Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis. WileyInterscience Publication, New York (1973) 23. Chang, C.C., Lin, C.J.: LIBSVM – a library for support vector machines (2003), Available at http://www.csie.ntu.edu.tw/cjlin/libsvm/
Comparison of Class Separability, Forward Sequential Search and Genetic Algorithms for Feature Selection in the Classification of Individual and Clustered Microcalcifications in Digital Mammograms Rolando R. Hern´ andez-Cisneros, Hugo Terashima-Mar´ın, and Santiago E. Conant-Pablos Tecnol´ ogico de Monterrey, Campus Monterrey, Centro de Sistemas Inteligentes Av. Eugenio Garza Sada 2501, Monterrey, N.L. 64849 Mexico {a00766380,terashima,sconant}@itesm.mx
Abstract. The presence of microcalcification clusters in digital mammograms is a primary indicator of early stages of malignant types of breast cancer and its detection is important to prevent the disease. This paper uses a procedure for the classification of microcalcification clusters in mammograms using sequential Difference of Gaussian filters (DoG) and feedforward Neural Networks (NN). Three methods using class separability, forward sequential search and genetic algorithms for feature selection are compared. We found that the use of Genetic Algorithms (GAs) for selecting the features from microcalcifications and microcalcification clusters that will be the inputs of a feedforward Neural Network (NN) results mainly in improvements in overall accuracy, sensitivity and specificity of the classification.
1
Introduction
Breast cancer is one of the main causes of death in women and early diagnosis is an important means to reduce the mortality rate. Mammography is one of the most common techniques for breast cancer diagnosis, and microcalcifications are one among several types of objects that can be detected in a mammogram. Microcalcifications are calcium accumulations typically 100 microns to several mm in diameter, and they sometimes are early indicators of the presence of breast cancer. Microcalcification clusters are groups of three or more microcalcifications that usually appear in areas smaller than 1 cm2 , with a probability of becoming a malignant lesion. However, the predictive value of mammograms is relatively low, compared to biopsy. This low sensitivity is caused by the low contrast between the cancerous tissue and the normal parenchymal tissue, the small size of microcalcifications and possible deficiencies in the image digitalization process. The sensitivity may be improved having each mammogram checked by two or more radiologists, M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 911–922, 2007. c Springer-Verlag Berlin Heidelberg 2007
912
R.R. Hern´ andez-Cisneros, H. Terashima-Mar´ın, and S.E. Conant-Pablos
with the consequence of making the process inefficient by reducing the individual productivity of each specialist. A viable alternative is replacing one of the radiologists by a computer system, giving a second opinion [1]. The method we selected for the detection of suspicious points is the Difference of Gaussian Filters (DoG). DoG filters are adequate for the noise-invariant and size-specific detection of spots, resulting in a DoG image. This DoG image represents the microcalcifications if a thresholding operation is applied to it. Neural networks (NNs) have been successfully used for classification purposes in medical applications. Unfortunately, for a NN to be successful in a particular domain, its architecture, training algorithm and the domain variables selected as inputs must be adequately chosen. Designing a NN architecture is a trialand-error process; several parameters must be tuned according to the training data when a training algorithm is chosen and, finally, a classification problem could involve too many variables (features), most of them not relevant at all for the classification process itself. Genetic algorithms (GAs) may be used to address the problems mentioned above, helping to obtain more accurate NNs with better generalization abilities. GAs have been used for searching the optimal weight set of a NN, for designing its architecture, for finding its most adequate parameter set (number of neurons in the hidden layer(s), learning rate, etc.), among others tasks. An exhaustive review about evolutionary artificial neural networks (EANNs) is presented by Yao [2]. In particular, this paper describes the use of GAs for selecting the most relevant features extracted from both individual microcalcifications and microcalcification clusters, which will become the inputs of two simple feedforward NNs for their classification, expecting to improve its accuracy. We compare this approach to feature selection with one of the methods used in [3] and [4], ordering features according to their class separability and then selecting the most relevant ones as inputs to a NN. We also compare the GA approach with the method we used in [5], which was based on a forward sequential search, sequentially adding inputs to the NN while its error decreases, and stopping when it starts to increase. The GA provides a broader, parallel search in the feature space, simultaneously managing a population of features subsets. The rest of this document is organized as follows. In the second section, the methodology is discussed. The third section deals with the experiments and the main results of this work. Finally, in the fourth section, the conclusions are presented, and some comments about future work are also made.
2
Methodology
The mammograms used in this project were provided by The Mammographic Image Analysis Society (MIAS) [6]. The MIAS database contains 322 images, all medio-lateral (MLO) view, digitized at resolutions of 50 microns/pixel and 200 microns/pixel. In this work, the images with a resolution of 200 microns/pixel were used. The data has been reviewed by a consultant radiologist and all the abnormalities have been identified and marked. The truth data consists of the
Feature Selection in the Classification of Digital Mammograms
913
location of the abnormality and the radius of a circle which encloses it. From the totality of the database, only 25 images contain microcalcifications. Among these 25 images, 13 cases are diagnosed as malignant and 12 as benign. The general procedure receives digital mammograms as input, and it is conformed by five stages: pre-processing, detection of potential microcalcifications (signals), classification of signals into real microcalcifications, detection of microcalcification clusters and classification of microcalcification clusters into benign and malignant. 2.1
Pre-processing
This stage has the aim of eliminating those elements in the images that could interfere in the process of identifying microcalcifications. A secondary goal is to reduce the work area only to the relevant region that exactly contains the breast. The procedure receives the original images as input. A median filter is applied in order to eliminate the background noise and then an automatic cropping procedure deletes the background marks and the isolated regions, so the image will contain only the region of interest. The result of this stage is a smaller image, with less noise. 2.2
Detection of Potential Microcalcification (Signals)
The main objective of this stage is to detect the mass centers of the potential microcalcifications in the image (signals). The pre-processed image of the previous stage is the input of this procedure. The optimized difference of two gaussian filters (DoG) is used for enhancing those regions containing bright points. A gaussian filter is obtained from a gaussian distribution, and when it is applied to an image, eliminates high frequency noise, acting like a smoothing filter. A DoG filter is built from two simple gaussian filters. These two smoothing filters must have different variances. When two images, obtained by separately applying each filter, are subtracted, then an image containing only the desired frequency range is obtained. The DoG filter is obtained from the difference of two gaussian functions, as it is shown in equation (1), where x and y are the coordinates of a pixel in the image, k is the height of the function and σ1 and σ2 are the standard deviations of the two gaussian filters that construct the DoG filter. DoG(x, y) = k1 e(x
2
+y 2 )/2σ12
− k2 e(x
2
+y 2 )/2σ22
(1)
The resultant image after applying a DoG filter is globally binarized, using an certain threshold. In Figure 1, an example of the application of a DoG filter is shown. A region-labeling algorithm allows the identification of each one of the points (defined as high-contrast regions detected after the application of the DoG filters, which cannot be considered microcalcifications yet). Then, a segmentation algorithm extracts small 9x9 windows, containing the region of interest whose centroid corresponds to the centroid of each point. The size of the windows is adequate for containing the signals, given that at the current
914
R.R. Hern´ andez-Cisneros, H. Terashima-Mar´ın, and S.E. Conant-Pablos
5x5 Gaussian Filter
Original Image
DoG
Binarized Image
7x7 Gaussian Filter
Fig. 1. Example of the application of a DoG filter (5x5, 7x7)
resolution of 200 microns, the potentially malignant microcalcifications (whose diameter is typically 100 microns to several mm) have an area of 5x5 pixels on average [7]. In order to detect the greater possible amount of points, six gaussian filters of sizes 5x5, 7x7, 9x9, 11x11, 13x13 and 15x15 are combined, two at a time, to construct 15 DoG filters that are applied sequentially. Each one of the 15 DoG filters was applied 51 times, varying the binarization threshold. The points obtained by applying each filter are added to the points obtained by the previous one, deleting the repeated points. The same procedure is repeated with the points obtained by the remaining DoG filters. All of these points are passed later to three selection procedures. These three selection methods are applied in order to transform a point into a signal (potential microcalcification). The first method performs selection according to the object area, choosing only the points with an area between a predefined minimum and a maximum. For this work, a minimum area of 1 pixel (0.0314 mm2 ) and a maximum of 77 pixels (3.08 mm2 ) were considered. The second methods performs selection according to the gray level of the points. Studying the mean gray levels of pixels surrounding real identified microcalcifications, it was found they have values in the interval [102, 237] with a mean of 164. For this study, we set the minimum gray level for points to be selected to 100. Finally, the third selection method uses the gray gradient (or absolute contrast, the difference between the mean gray level of the point and the mean gray level of the background). Again, studying the mean gray gradient of point surrounding real identified microcalcifications, it was found they have values in the interval [3, 56] with a mean of 9.66. For this study, we set the minimum gray gradient for points to be selected to 3. The result of these three selection processes is a list of signals (potential microcalcifications) represented by their centroids. 2.3
Classification of Signals into Real Microcalcifications
The objective of this stage is to identify if an obtained signal corresponds to an individual microcalcification or not. With this in mind, a set of features are extracted from the signal, related to their contrast and shape. From each signal, 47 features are extracted: seven related to contrast, seven related to background
Feature Selection in the Classification of Digital Mammograms
915
contrast, three related to relative contrast, 20 related to shape, six related to the moments of the contour sequence and the first four Hu invariants. In order to process signals and accurately classify the real microcalcifications, we decided to use NNs as classifiers. In the first section, we mentioned that one of the difficulties of working with conventional feedforward NNs is that a classification problem could involve too many variables (features), and most of them may not be relevant at all for the classification process itself. In [3] and [4], a filter is used to estimate how well a feature separates the data into different classes using the Kullback-Leibler (KL) distance between histograms of feature values. A histogram for each class is obtained, for each feature. Features are discretized using b equally spaced bins, where b = |D|/2, and |D| is the size of the training data. The histograms are later normalized by dividing the number of elements in each bin by the total number of elements, so the probability that the j-th feature takes a value in the i-th bin of the histogram, given a class n : pj (d = i|n), is obtained. For each feature j, the class separability is then calculated as c c Δj = δj (m, n) (2) m=1 n=1
where c is the number of classes and δj (m, n) is the KL distance between histograms corresponding to classes m and n and b is the number of bins in the histograms. The features are finally ranked by sorting them in descending order of the distances Δj (larger distances mean better separability). In this method, it is considered heuristically that two features are redundant if their distances differ by less than 0.0001, and the feature with the smallest distance is eliminated. Other irrelevant non-discriminative features with Δj distances less than 0.001 are eliminated also. After this process, only 5 features were selected (in decreasing order of the distance Δj ): median gray level, mean gray level, minimum gray level, background maximum gray level, and background mean gray level. In [5], we used a method which consisted of two feature selection processes [8]: the first process attempts to delete the features that present high correlation with other features, and the second process uses a derivation of the forward sequential search algorithm, which is a sub-optimal search algorithm, adding features to a NN while the error is decreasing and stopping when it increases again. After these processes were applied, only three features were selected and used for classification: absolute contrast (the difference between the mean gray levels of the signal and its background), standard deviation of the gray level of the pixels that form the signal and the third moment of contour sequence. Moments of contour sequence are calculated using the signal centroid and the pixels in its perimeter, and are invariant to translation, rotation and scale transformations [9]. Expecting to achieve greater accuracy in the classification, we use a different method, this being based on a GA for selecting features. The chromosomes of the individuals in the GA contain 47 bits, one bit for each extracted feature, and the value of the bit determines whether that feature will be used in the classification or not [10]. The individuals are evaluated by constructing and training
916
R.R. Hern´ andez-Cisneros, H. Terashima-Mar´ın, and S.E. Conant-Pablos
a feedforward NN (with a predetermined structure), and the number of inputs of this NN is determined by the subset of features to be included, coded in the chromosome. The accuracy of each network is used to determine the fitness of each individual. When the GA stops either because the generations limit has been reached or because improvements on the evaluation of the best individual has not been observed during five consecutive generations, we obtain the NN with the best performance in terms of the overall accuracy, and the subset of features that are relevant for the classification. 2.4
Detection of Microcalcification Clusters
During this stage, the microcalcification clusters are identified. The detection and posterior consideration of every microcalcification cluster in the images may produce better results in a subsequent classification process, as shown in [5]. Because of this, an algorithm for locating microcalcification cluster regions where the quantity of microcalcifications per cm2 (density) is higher, was developed. This algorithm keeps adding microcalcifications to their closest clusters at a reasonable distance until there are no more microcalcifications left or if the remaining ones are too distant for being considered as part of a cluster. Every detected cluster is then labeled. 2.5
Classification of Microcalcification Clusters into Benign and Malignant
This stage has the objective of classifying each cluster in one of two classes: benign or malignant. This information is provided by the MIAS database. From every microcalcification cluster detected in the mammograms in the previous stage, a cluster feature set is extracted. The feature set is constituted by 30 features: 14 related to the shape of the cluster, six related to the area of the microcalcifications included in the cluster and ten related to the contrast of the microcalcifications in the cluster. In order to process microcalcification clusters and accurately classify them into benign or malignant, we decided again to use NNs as classifiers. To determine which ones of the 30 extracted features from the clusters are relevant for their classification, the first method used in [3] and [4] and the method we used in [5] were performed in this work. The first procedure uses a filter to estimate how well a feature separates the data into different classes using the KullbackLeibler (KL) distance between histograms of feature values, as described in a previous section. The features selected by this method were maximum radius, convex perimeter, standard deviation of the distances between microcalcifications, minimum absolute contrast and the standard deviation of the mean gray level of the microcalcifications in a cluster. The second method consisted of two feature selection processes [8]: the first process attempts to delete the features that present high correlation with other features, and the second process uses a derivation of the forward sequential search algorithm, which is a sub-optimal search algorithm, adding features to
Feature Selection in the Classification of Digital Mammograms
917
a NN while the error is decreasing and stopping when it increases again. After these processes were applied, only three cluster features were selected for the classification process: minimum diameter, minimum radius and mean radius of the clusters. The minimum diameter is the maximum distance that can exist between two microcalcifications within a cluster in such a way that the line connecting them is perpendicular to the maximum diameter, defined as the maximum distance between two microcalcifications in a cluster. The minimum radius is the shortest of the radii connecting each microcalcification to the centroid of the cluster and the mean radius is the mean of these radii. Trying to improve the accuracy in the classification of the microcalcification clusters, we also applied GAs to the feature selection task. The chromosomes of the individuals in this GA contain 30 bits, one bit for each extracted feature from the clusters, and the value of the bit determines whether that feature will be used in the classification or not. The individuals are evaluated by constructing and training a feedforward NN, where the number of inputs of the NN is determined by the subset of features to be included, which is coded in the chromosome. For solving nonlinearly separable problems, it is recommended at least one hidden layer in the network, and according to Kolmogorov’s theorem [11], and considering the number of inputs (n), the hidden layer contains 2n + 1 neurons. The output layer has only one neuron. The accuracy of each network is used to determine the fitness of each cluster. When the GA is stopped either because the generations limit has been reached or because an improvement on the evaluation of the best individual has not been observed during five consecutive generations, we obtain the NN with the best performance (in terms of the overall accuracy) and the subset of cluster features that are the most relevant for the classification.
3 3.1
Experiments and Results From Pre-processing to Feature Extraction
As it was mentioned in the previous section, only 25 images from the MIAS database contain microcalcifications. Among these 25 images, 13 cases are diagnosed as malignant and 12 as benign. Three images were discarded because the positions of the microcalcifications clusters, marked in the additional data that comes with the database, were outside the boundaries of the breast. So, only 22 images were finally used for this study, and they were passed through the pre-processing stage first (application of a median filter and trimming). In the second phase, six gaussian filters of sizes 5x5, 7x7, 9x9, 11x11, 13x13 and 15x15 were combined, two at a time, to construct 15 DoG filters that were applied sequentially. Each one of the 15 DoG filters was applied 51 times to the pre-processed images, varying the binarization threshold in the interval [0, 5] in increments of 0.1. The points obtained by applying each filter were added to the points obtained by the previous one, deleting the repeated points. The same procedure was repeated with the points obtained by the remaining DoG filters. These points passed through the three selection methods for selecting signals
918
R.R. Hern´ andez-Cisneros, H. Terashima-Mar´ın, and S.E. Conant-Pablos
(potential microcalcification), according to region area, gray level and the gray gradient. The result was a list of 1,242,179 signals (potential microcalcifications) represented by their centroids. The additional data included with the MIAS database define, with centroids and radii, the areas in the mammograms where microcalcification clusters are located. It is supposed that signals within these areas are mainly microcalcifications, but there are many signals that lie outside the marked areas. With these data and the support of expert radiologists, all the signals located in these 22 mammograms were pre-classified into microcalcification, and notmicrocalcifications. From the 1,242,179 signals, only 4,612 (0.37%) were microcalcifications, and the remaining 1,237,567 (99.63%) were not. Because of this imbalanced distribution of examples of each class, an exploratory sampling was made. Several samplings with different proportions of each class were tested and finally we decided to use a sample of 10,000 signals, including 2,500 real microcalcifications in it (25%). After the 47 microcalcification features were extracted from each signal, the first method for feature selection, based on class separability for ranking the features, reduced the relevant features to five: median gray level, mean gray level, minimum gray level, background maximum gray level, and background mean gray level. A transactional database was obtained, containing 10,000 signals (2500 of them being real microcalcifications randomly distributed) and five features describing each signal. The second approach, based on the forward sequential search, reduced the relevant features to only three: absolute contrast, standard deviation of the gray level of the signal and the third moment of contour sequence. Again, a transactional database was obtained, containing 10,000 signals including 2500 real microcalcifications randomly distributed, and three features describing each signal. For using the third approach, using the GA, the original transactional database with all the 47 features were used. 3.2
Classification of Signals into Microcalcifications
For testing the first two feature selection methods, based on class separability (CS) and forward sequential search (FSS), simple feedforward NNs with the corresponding number of inputs were trained and tested. The architecture of these NNs consisted of five and three inputs respectively, 2n + 1 neurons in the hidden layer (where n is the number of inputs) and one output. All the units had the sigmoid hyperbolic tangent function as the transfer function. The data (input and targets) were scaled in the range [-1, 1] and divided into ten non-overlapping splits, each one with 90% of the data for training and the remaining 10% for testing. Ten-fold crossvalidation trials were performed; that is, the NNs were trained ten times, each time using a different split on the data and the averages of the overall performance, sensitivity and specificity were reported. These results are shown in Table 1, representing the NNs that had the best performance in terms of overall accuracy (percentage of correctly classified microcalcifications). The sensitivity (percentage of true positives or correctly classified microcalcifications)
Feature Selection in the Classification of Digital Mammograms
919
and specificity (percentage of true negatives or correctly classified objects that are not microcalcifications) of these NNs are also shown. Also, a GA was combined with NNs to select the features to train them, as described earlier. The GA had a population of 50 individuals, each one with a length of l = 47 bits, representing the inclusion (or exclusion) of each one of the 47 features extracted from the signals. We used a simple GA, with gray encoding, stochastic universal sampling selection, single-point crossover, fitness based reinsertion and a generational gap of 0.9. The probability of crossover was 0.7 and the probability of mutation was 1/l, where l is the length of the chromosome (in this case, 1/l = 1/47 = 0.0213). The initial population of the GA was always initialized uniformly at random. All the NNs constructed by the GA are feedforward networks with one hidden layer. All neurons have biases with a constant input of 1.0. The NNs are fully connected, and the transfer functions of every unit is the sigmoid hyperbolic tangent function. The data (input and targets) were normalized to the interval [-1, 1]. For the targets, a value of “-1” means “not-microcalcification” and a value of “1” means “microcalcification”. For training each NN, backpropagation was used, only one split of the data was considered (90% for training and 10% for testing) and the training stopped after 20 epochs. The GA ran for 50 generations, and the results of this experiment are shown in Table 1 on the row “GA”. Table 1. Results of the classification of individual microcalcifications Method Sensitivity (%) Specificity (%) Overall (%) CS 50.91 95.94 84.56 FSS 76.21 81.92 81.33 GA 83.33 94.87 95.40
The best solution found is a NN with 23 inputs (five related to the contrast of the signal, four related to the background contrast, two related to the relative contrast, seven related to the shape, four moments of the contour sequence and only one of the invariant geometric moments), corresponding to 48.94% of the original 47 extracted features. All the NNs coded in the chromosomes of the final population of the GA use 20.02 inputs on average, that is, the NNs with the best performance need only 42.60% of the original 47 features extracted from the microcalcifications. 3.3
Microcalcification Clusters Detection and Classification
The process of cluster detection and the subsequent feature extraction phase generates another transactional database, this time containing the information of every microcalcification cluster detected in the images. A total of 40 clusters were detected in the 22 mammograms from the MIAS database that were used in this study. According to MIAS additional data and the advice of expert radiologists, 10 clusters are benign and 30 are malignant. The number of features extracted
920
R.R. Hern´ andez-Cisneros, H. Terashima-Mar´ın, and S.E. Conant-Pablos
from them is 30, but after the two feature selection processes already discussed in previous sections, the number of relevant features we considered relevant was five, in the case of the class separability (CS) procedure, and three in the case of the forward sequential search method (FSS). As in the stage of signal classification, simple feedforward NNs with five and three inputs respectively (corresponding to the number of features from the clusters selected by the first two processes) were trained and tested. The architecture of these NNs had five/three inputs, 2n + 1 neurons in the hidden layer (where n is the number of inputs) and only one output. The sigmoid hyperbolic tangent function was used as the transfer function for every neuron. The data (input and targets) were scaled in the range [-1, 1] and divided into ten non-overlapping splits, each one with 90% of the data for training and the remaining 10% for testing. Ten-fold crossvalidation trials were performed; that is, the NNs were trained ten times, each time using a different split on the data and the averages of the overall performance, sensitivity and specificity were reported. These results are shown in Table 2, representing the NNs that had the best performance in terms of overall accuracy (percentage of correctly classified clusters). The sensitivity (percentage of true positives or correctly classified malignant clusters) and specificity (percentage of true negatives or correctly classified benign clusters) of these NNs are also shown. A GA was also used to select the features for training ANNs, as described earlier. In this case, the transactional database containing the 30 features extracted from the clusters was used. The GA had a population of 50 individuals, each one with a length of l = 30 bits, representing the inclusion (or exclusion) of each one of the 30 features extracted from the clusters. We used a simple GA, with gray encoding, stochastic universal sampling selection, single-point crossover, fitness based reinsertion and a generational gap of 0.9. The probability of crossover was 0.7 and the probability of mutation was 1/l = 1/30 = 0.0333. The initial population of the GA was initialized uniformly at random. All the NNs constructed by the GA are feedforward networks with one hidden layer. All neurons have biases with a constant input of 1.0. The NNs are fully connected, and the transfer functions of every neuron is the sigmoid hyperbolic tangent function. The data (input and targets) were normalized to the interval [-1, 1]. For the targets, a value of “-1” means that the cluster is “benign” and a value of “1” means “malignant”. For training each NN, backpropagation was used, considering 10 splits of the data as in the previous experiment (90% for training and 10% for testing) and the training stopped after 20 epochs. The GA ran for 50 generations, and the results of this experiment are shown in Table 2 on the row “GA”. The best solution has 9 inputs, corresponding to 30% of the original cluster feature set (five features related to the shape of the cluster, one related to the area of the microcalcifications and three related to the contrast of the microcalcifications). On average, the chromosomes of the last generation coded 14.03 inputs, that is, the NNs with the best performance only receive 46.76% of the original features extracted from the microcalcification clusters.
Feature Selection in the Classification of Digital Mammograms
921
Table 2. Results of the classification of microcalcification clusters Method Sensitivity (%) Specificity (%) Overall (%) CS 50.91 95.94 84.56 FSS 53.85 88.89 77.50 GA 100.00 100.00 100.00
4
Conclusions and Future Work
We found that the use of GAs combined with NNs greatly improves the overall accuracy, the specificity and the sensitivity of the classification, when signals are classified. The best solution found is a NN with 23 inputs, corresponding to 23 extracted features (five related with the contrast of the signal, four related with the background contrast, two related with the relative contrast, seven related to the shape, four moments of the contour sequence and only one of the invariant geometric moments). We found also that all the NNs coded in the chromosomes of the final population of the GA use 20.02 inputs on average; that is, the NNs with the best performance need only 42.60% of the original 47 original features. As an additional note, the first method, based on ranking features by class separability, had similar results in the case of the specificity, and a good overall performance, but had poor accuracy for the specificity of the classification. In the case of the classification of microcalcification clusters, we observed that the use of a GA greatly improved the overall accuracy, the sensitivity and the specificity, achieving values of 100%. The best solution has 9 inputs, corresponding to 9 extracted features from the clusters (five related to the shape of the cluster, one related to the area of the microcalcifications and three related to the contrast of the microcalcifications). On average, the best NNs architectures receive 14.03 inputs on average, that is, they only receive 46.76% of the 30 original cluster features as inputs. Nevertheless, only 40 microcalcification clusters were detected in the 22 mammograms used in this study. The test sets used in the ten-fold crossvalidation trial were very small and in some splits, all the examples belonged to only one of the two classes so either sensitivity or specificity could not be calculated. These splits were ignored in the calculation of the respective mean. By the other hand, the first two methods, based on ranking features by class separability and forward sequential search, had similar performance in terms of specificity and overall performance, but both showed deficient results for the specificity of the classification. As future work, it would be useful to include and process other mammography databases, in order to have more examples and produce transactional feature databases more balanced and complete, and test also how different resolutions could affect system effectiveness. The size of the gaussian filters could be adapted depending on the size of the microcalcifications to be detected and the resolution of images. Different new features could be extracted from the microcalcifications in the images and tested also. In this study, simple GAs and NNs were used, and more sophisticated versions of these methods could produce better results. The
922
R.R. Hern´ andez-Cisneros, H. Terashima-Mar´ın, and S.E. Conant-Pablos
inclusion of simple backpropagation training in the EANNs have consequences of longer computation times, so alternatives to backpropagation should be tested in order to reduce time costs.
References 1. Thurfjell, E.L., Lernevall, K.A., Taube, A.A.S.: Benefit of independent double reading in a population-based mammography screening program. Radiology 191, 241– 244 (1994) 2. Yao, X.: Evolving artificial neural networks. Proceedings of the IEEE 87, 1423–1447 (1999) 3. Cant´ u-Paz, E.: Feature subset selections, class separability and genetic algorithms. Lawrence Livermore National Laboratory, Center for Applied Scientific Computing (2004) 4. Cant´ u-Paz, E., Newsam, S., Kamath, C.: Feature selection in scientific applications. Lawrence Livermore National Laboratory, Center for Applied Scientific Computing (2004) 5. Oporto-D´ıaz, S., Hern´ andez-Cisneros, R.R., Terashima-Mar´ın, H.: Detection of microcalcification clusters in mammograms using a difference of optimized gaussian filters. In: Kamel, M., Campilho, A. (eds.) ICIAR 2005. LNCS, vol. 3656, pp. 998– 1005. Springer, Heidelberg (2005) 6. Suckling, J., Parker, J., Dance, D., et al.: The Mammographic Image Analysis Society digital mammogram database. In: Exerpta Medica International, Congress Series, vol. 1069, pp. 375–378 (1994) 7. Oporto-D´ıaz, S.: Automatic detection of microcalcification clusters in digital mammograms. Master thesis, Center for Intelligent Systems, Tecnol´ ogico de Monterrey, Campus Monterrey, Monterrey, Mexico (2004) 8. Kozlov, A., Koller, D.: Nonuniform dynamic discretization in hybrid networks. In: Proceedings of the 13th Annual Conference of Uncertainty in AI (UAI), Providence, RI, USA, pp. 314–325 (2003) 9. Gupta, L., Srinath, M.D.: Contour sequence moments for the classification of closed planar shapes. Pattern Recognition 20(3), 267–272 (1987) 10. Cant´ u-Paz, E., Kamath, C.: Evolving neural networks for the classification of galaxies. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2002), San Francisco, CA, USA, pp. 1019–1026 (2002) 11. Kurkova, V.: Kolmogorov’s Theorem. In: The Handbook of brain theory and neural networks, pp. 501–502. MIT Press, Cambridge, MA (1995)
Contourlet-Based Mammography Mass Classification Fatemeh Moayedi, Zohreh Azimifar, Reza Boostani, and Serajodin Katebi Computer Science & Engineering Department, Shiraz University, Shiraz, Iran
[email protected] Abstract. The research presented in this paper is aimed at the development of an automatic mass classification of mammograms. This paper focuses on using contourlet-based multi-resolution texture analysis. The contourlet transform is a new two-dimensional extension of the wavelet transform using multi-scale framework as well as directional filter banks. The proposed method consists of three steps: removing pectoral muscle and segmenting regions of interest, extracting the most discriminative texture features based on the contourlet coefficients, and finally creating a classifier, which identifies various tissues. In this research classification is performed based on the idea of Successive Enhancement Learning (SEL) weighted Support Vector Machine (SVM). The main contribution of this work is exploiting the superiority of the contourlets to the-state-of-the-art multi-scale techniques. Experimental results show that contourlet-based feature extraction in conjunction with the SEL weighted SVM classifier significantly improves breast mass detection. Keywords: Mammography, Mass Detection, Contourlet, Support Vector Machine.
1
Introduction
Breast cancer is claimed as the second most deadly cancer in the world, on which public awareness has been increasing during the last few decades [1]. Early detection can play an effective role in the prevention, particularly by the most reliable detection technology known as mammography. Unfortunately, at the early stages of breast cancer, the signs are very subtle and vary in appearance, making diagnosis difficult even for specialists. Therefore, automatic reading of medical images becomes highly desirable. It has been proven that double reading of a mammogram, by two radiologists, reduces missed detection rate, but at a very high cost. Therefore, the motivation of the computer-aided diagnosis is to assist medical staff in achieving higher efficiency and accuracy. The objective is to develop an automated imaging system for mass classification of digital mammograms. Mass classification requires a preprocessing step that segments the input image into different areas, such as the breast region, background, and redundant labels and text. A significant body of research has M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 923–934, 2007. c Springer-Verlag Berlin Heidelberg 2007
924
F. Moayedi et al.
already been devoted to breast segmentation including point operations, histogramming, gradient-based approaches, region growing, polynomial modeling, active contours, and classifier-based techniques [2]. The second step (as the most effective stage) is feature extraction. Texture is a commonly used feature in the analysis and interpretation of images. Oliver [3] distinguishes textures employed in mammography according to three main extraction methods: 1. Statistical methods: The extracted features of this class include those obtained from co-occurrence matrices [3], from surface variation measurements (smoothness, coarseness and regularity) [3], and run-length statistics [2]. 2. Model-based methods: The analysis of texture features in this class is based on prior models such as Markov random fields [4], auto-regressive models and fractals [3]. 3. Signal processing methods: In this class, texture features are obtained according to either pixel characteristics or image frequency spectrum including Laws energy filtering [3], Gabor filtering [3], wavelets [5,6,7], ridgelets [6], and curvelets [8,9,10]. Finally, an appropriate classifier is utilized for mass classification. Various methodologies have been proposed with the most efficient ones known as Bayesian classifier [11], multilayer perceptron [4,12], ANFIS classifier [7], Radial Basis Function (RBF) [12], k-Nearest Neighbors, ID3 decision tree classifier [13], and Support Vector Machine (SVM) [1]. This work is motivated to improve performance of the current mass classification methods using the recently proposed image transforms and classifiers. The novelty of this research is exploiting the superiority of contourlets in representing line singularities over wavelets as well as the structural risk minimization property of SEL weighted SVM to achieve a more efficient mammogram mass classification.
2 2.1
Background Review Contourlet Transform
An image can be represented in different scales by multi-resolution analysis. Wavelet transform is used to represent 1-D piecewise smooth signals. The structure of medical images, however, are not as simple as a 1-D piecewise smooth line. Candes and Donoho [10] show that wavelets perform well for objects with point singularities in 1-D and 2-D. Orthogonal wavelets capture only horizontally, vertically, and diagonally directed discontinuities. These orientations may not preserve enough directional information in medical images. Ridgelet analysis, on the other hand, is a suitable transform to catch radial directional details in frequency domain. Ridgelets are very effective in detecting linear radial structures, but those structures are not dominant in medical images [8].
Contourlet-Based Mammography Mass Classification
(a) wavelet
925
(b) curvelet
(c)
Fig. 1. Top row: Wavelets versus contourlet, contourlets effectively represent a smooth contour with fewer coefficients. Bottom row: Contourlet filter bank. First, a multi-scale decomposition into octave bands by the Laplacian pyramid is computed, and then a directional filter bank is applied to each bandpass channel.
An extension of ridgelet transform introduced by Candes and Donoho [10] is the curvelet transform. Curvelets are very successful in detecting image activities along curves, while analyzing images at multiple scales, locations, and orientations [8]. Multi-scale decomposition captures point discontinuities while directional decomposition links point discontinuities into linear structures. Contourlet transform overcomes directionality lack of 2-D wavelets by geometrically representing smoothness of contours. In other words, contourlets can represent a smooth contour with fewer coefficients than wavelets do, as illustrated in Figure 1. Wavelet basis functions are isotropic, thus they cannot adapt to geometrical structures. Contourlet transform not only exploits multi-scale and time frequency localization properties of wavelets, but also offers a high degree of directionality (i.e., locally adaptive) and anisotropy [10]. A curvelet is indexed by three parameters: a scale parameter a, 0 < a < 1; an orientation θ, θ ∈ [−π/2, π/2), and a location parameter b, b ∈ R2 . At scale a, the family of curvelets is generated by translation and rotation of a basic element ϕa : ϕa,b,θ (x) = ϕa (Rθ (x − b))
(1)
where Rθ is a rotation by θ radians √ and ϕa is a type of directional wavelet with spatial width a, spatial length a, and with minor axis pointing towards the horizontal direction: 1/a √0 (2) ϕa (x) ≈ ϕ(Da x), Da = 0 1/ a where Da is a parabolic scaling matrix.
926
F. Moayedi et al.
(a)
(b)
Fig. 2. Diagrams describing: (a) the proposed methodology, (b) segmentation stage
Do and Vetterli [9] construct a discrete domain multi-resolution and multidirection expansion using non-separable filter banks. This implementation is called the contourlet transform shown in Figure 1(b). Our objective is to study performance of the SEL weighted SVM classifier based on the contourlet-domain extracted features. 2.2
Support Vector Machine
Many learning techniques attempt to minimize the classification error in the training phase, only. This, however, does not guarantee a low error rate in the testing phase. In statistical learning theory, the SVM is claimed to overcome this issue [14]. The SVM performs structural risk minimization and creates a classifier with minimized Vapnik Chervonenkis (VC) dimension. The lower the VC dimension, the smaller the expected probability of error. Consequently, a good generalization is obtained if the VC dimension is low. The dual quadratic optimization of SVM is performed in order to obtain an optimal hyperplane for the higher dimensional space. The optimization is done by maximizing → L(− α) =
ν
αi −
i=1
ν 1 yi yj αi αj K(xi , xj ), 2 i,j=1
0 ≤ αi ≤ C
(3)
while ν i=1
yi αi = 0,
w=
N
αi yi xi ,
αi [yi (wT xi + b) − 1 + ξi ] = 0
i=1
where K(xi , xj ) is the kernel of SVM, ν shows the number of total samples, and C is a user-specified positive parameter to control the tradeoff between the
Contourlet-Based Mammography Mass Classification
927
SVM complexity and the number of non-separable points. This quadratic opti→ = (α0 , α0 , . . . , α0 ) is obtained, mization problem is solved and a solution to − α 0 1 2 nsv 0 where each αi is a Lagrange coefficient, and nsv denotes the number of support vectors. The slack variables ξi are used to relax the constraints of the canonical hyperplane equation. Radial Basis Function (RBF) kernel is utilized in this stage.
3
Methodology
As mentioned above, the proposed system is developed upon the contourlet analysis to extract features as well as SEL weighted SVM to classify the breast abnormalities. The proposed system consists of three main stages as shown in Figure 2(a): Segmentation, Feature extraction and reduction, and classification processes. Next, we review and discuss each step separately. 3.1
Segmentation Stage
Mammograms are difficult to interpret, thus, a preprocessing is necessary to improve the image quality and to make the feature extraction phase more reliable. The preprocessing consists of two main phases, which are not necessarily independent. First remove background, pectoral muscle, and tagged labels from the image. Then enhance the contrast and extract the suspicious area in the image. Figure 2(b) summarizes all steps of the segmentation process with the following description. Removing The Pectoral Muscle: In MIAS1 database, almost 50% of whole image comprises of a dark background with a significant structural noise. Cropping can remove the unnecessary parts of image for further processing. The cropping operation was done automatically by sweeping through the image and cutting horizontally and vertically oriented regions with mean and variance less than a certain threshold. There exist images of either left breast or right breast. In other words, breast appears in either left or right side of the picture. We find these directions automatically and align all to the left side of the image. Breast texture separation from background was done according to difference of local variances and other point operations such as histogram stretching. Because the gray level of pectoral muscle is similar to that of the mass, it is better to remove this muscle (i.e., pectoral muscle suppression) and avoid increasing false detection rate. This operation is usually necessary for images captured in MLO2 view, wherein the pectoral muscle looks slightly brighter than the rest of the breast tissue. Here, breast segmentation was done using logarithm of the pixels energy and region growing algorithm based on vertical and horizontal gradient and morphological operations. The above processes are illustrated in Figure 3. 1 2
Mammographic Image Analysis Society. Medio Lateral-Oblique.
928
F. Moayedi et al.
(a)
(b)
(c)
(d)
Fig. 3. First stage of preprocessing: (a) original image, (b) image cropped, background and annotation removed, (c) logarithm of pixels energy, (d) pectoral muscle removed
Contrast Enhancement: Many medical images such as mammograms look blur and fuzzy and suffer from low contrast. Therefore, contrast enhancement is necessary before any further processing or analysis is conducted. We used a transfer function for each pixel based on its local statistics. A new intensity is assigned to each pixel according to an adaptive transfer function that is designed on the basis of local minimum/maximum/average intensity. This method follows the idea of adaptive histogram equalization and retinex model of Yu and Bajaj [15]. Region of Interest segmentation: Following preprocessing, the Regions of Interest (ROIs) in each image are found and localized to be processed by the subsequent operations. This approach reduces the computational burden of the system. Masses usually hide themselves in dense tissues with higher intensity values. Thus, pixels with very low intensity within the tissue should not be considered suspicious. Only, regions whose pixels intensity is close to that of the mass must be considered for further processing. The following observations have effectively directed this segmentation phase: – Mammographic masses have approximately uniform textures across their interiors, i.e., low intensity variance with high frequencies close to zero. – Another characteristic of masses is that energy content at their boundaries is relatively higher than other areas. The thresholded image (i.e., suspicious areas) was decomposed by overcomplete wavelet transform. If a mass exists in the suspicious area, then the high frequency signal at the corresponding location is close to zero and the energy parameter at its boundary is high. All the locations with high frequencies close to zero should be registered into a map representing all mass region candidates. Next, every region must be separated perfectly from the original image in order to extract its features. A refinement step is necessary in order to select only segmented regions whose abnormality likelihood is relatively high. Therefore,
Contourlet-Based Mammography Mass Classification
(a)
(b)
(c)
929
(d)
Fig. 4. Three mammograms segmented according to the proposed approach: (a) original image (red curve shows the mass), (b) pectoral muscle removed, (c) map of the connected components, and (d) ROIs, extracted based on Eq. 4
region labeling is done; such that those pixels that are connected to each other – Connected Components (CC)– based on some morphological operation, are similarly labeled. According to the mammogram pixel resolution, we empirically observed that CCs of 25% largest areas are plausible candidate of a suspicious region, i.e., ROI is defined as: 1 if areai ∈ top 25% areas label(CCi ) = (4) 0 otherwise Every CCi with label 1 is assumed to be an ROI, i.e., a candidate for feature extraction. Figure 4 shows three examples of ROI determination. 3.2
Feature Extraction and Reduction
When all ROIs are extracted, their texture is analyzed through feature extraction. As shown in Figure 5, in this study three approaches were employed for texture determination: 1. Contourlet features: combination of k maxima, mean, standard deviation, energy, entropy, and skewness parameters from 4-level contourlet transform was
930
F. Moayedi et al.
Fig. 5. Feature extraction stage
examined. Contourlets capture structural information along multiple scales, locations, and orientations. Also, contourlet transform can effectively represent a smooth contour with fewer coefficients than the wavelet transform. These features are useful and important for mass classification. Although, theoretically, contourlets are more complex than wavelets, there is no significant difference in their computational complexity. An example of contourlet transform for one ROI is shown in Figure 6. The image was decomposed into four pyramidal levels, resulting in four, eight and sixteen directional sub-bands. To the best of our knowledge, contourlet-based feature extraction of mammograms accounts for the novelty of this work with effective improvement. 2. Co-occurrence matrix features: co-occurrence matrices are essentially 2-D histograms of occurrence of grey-level pairs for a given displacement vector. Formally, the co-occurrence P of two intensity values can be specified as a matrix of relative gray levels i and j. Co-occurrence matrices are not generally used as features; rather, various statistical features are derived from the matrix entities Pi,j (d, θ) concerning two pixels separated by a predefined distance d and an angle θ [3]. Here we use four different directions: 0◦ , 45◦ , 90◦ and 135◦ and two distances: d = 1 and d = 2 pixels. From each co-occurrence matrix a feature set including energy, correlation, inertia, entropy, inverse difference moment, sum average, sum variance, sum entropy, difference average, difference variance, difference entropy, and information measure of correlation is extracted. 3. Geometrical features: area, center of mass, and orientation are considered in this study. In this research, genetic algorithm is used for feature selection based on neural network pattern classification, in which every feature is weighted by genetic algorithm. Finally, features with weights lower than a threshold are eliminated. The cross-over rate and mutation rate 0.8 and 0.2 are applied, respectively. The chromosome fitness is calculated according to the classification rate by a weighted SVM classifier. The above features are passed to the SEL weighted SVM classifier, explained next.
Contourlet-Based Mammography Mass Classification
3.3
931
Mass Classification
The extracted ROIs are labeled as either containing a mass or a normal tissue. Each ROI denoted by xi , is assumed as a pattern belonging to the “mass present” class (yi = +1) or “mass absent” class (yi = −1). The labeling yi is done based on the Euclidean distance between the mass center3 and that of every ROI, i.e., yi = 1 if this distance is less than the mass radius, otherwise yi = 0. In general, for a mammogram, many ROIs are classified as “mass absent”, but very few are labeled as “mass present”. Therefore, and due to the random selection of samples, it is highly probable that a significantly small (even zero) fraction of the training set is occupied by the “mass present” class. To address this issue, we propose to use weighted SVM in SEL scheme to examine all the available “mass present” samples. The basic idea of SEL approach is to iteratively select the most representative “mass present” examples from all the training images while keeping the total number of training examples small [16]. Such a scheme improves the generalization ability of the trained weighted SVM classifier. Weights are constructed based on the number of “mass present” and “mass absent” samples. The weight of each class is defined as: ⎧W 2 ⎨ L21 = W L1 (5) ⎩ W1 + W2 = 1 where W1 , W2 and L1 , L2 are weight and sample numbers in “mass absent” and “mass present”, respectively. To evaluate the proposed classifier accuracy as well as its generalization capability a validation procedure was used, whose reliability was improved by employing ten-folded cross validation method ten times.
4
Results and Discussions
This section presents and evaluates results of the experiments carried out according to the three stages of mammogram examination introduced in this article: 1) image segmentation, 2) ROI feature extraction and reduction, and 3) mass classification. A large collection of mammograms (including both normal and abnormal cases) was obtained from the MIAS database. The images were in 8-bit gray resolution format and of size 1024 × 1024 pixels. First, pectoral muscle was successfully separated from the rest of the input image. The segmentation result was evaluated by a mammography radiologist, who scored our results as 97.5% accurate. The ROIs were defined and extracted from an enhanced version of the mammogram. In the second stage, the suspicious regions were projected into the contourlet domain and the nominated features were extracted, in addition to a number of geometrical features obtained in the spatial domain. Next, feature selection was performed based on the genetic algorithm. The last step, i.e., SEL weighted SVM classification was accomplished 3
The mass center and radius are available in MIAS dataset.
932
F. Moayedi et al.
(b)
Fig. 6. Contourlet transform of an ROI: (a) ROI sample, (b) Four-level pyramidal decomposition. Dark gray value shows large valued coefficients.
by using the RBF kernel, and adaptively tuning the parameter C and the kernel parameter to 27 and 10−6 , respectively. To evaluate the performance of each classifier, parameters specificity, sensitivity, and accuracy rates were then calculated from each of the misclassification matrices, as seen in Table 1. In the medical imaging, the most important performance measures are specificity and sensitivity. Ideally, one wishes both high specificity and high sensitivity measures. Theoretically, however, these two measures are inversely proportional (i.e., anti-correlated). Since accuracy is a function of sensitivity and specificity measures together, this descriptor was selected to determine the overall correctness of the classifier. Experimental results and comparisons between different features and classifiers are presented in Tables 2 and 3. As Table 2 shows, our approach successfully distinguishes between normal and abnormal tissues while outperforming current methods. Table 3 indicates that the proposed contourletbased features also offer an improvement in mass classification over then stateof-the-art algorithms. Table 1. Classifier performance measures Measures Definition Sensitivit True Positive / Total Positive Specificity True Negative / Total Negatives Accuracy (True Positives + True Negatives) / Total Sample
Contourlet-Based Mammography Mass Classification
933
Table 2. Performance comparison of weighted SVM and SEL weighted SVM classifiers using different features in detecting normal and abnormal tissues
co-occurence
wavelets
0.457 0.967 0.81 ±0.040
0.473 0.935 0.859 ±0.037
0.600 0.921 0.85 ±0.035
0.600 0.953 0.87 ±0.030
contourlets & co-occurence & morphology
contourlets & co-occurence & morphology
0.400 0.95 0.78 ±0.040
contourlets
contourlets
Mean Sensitivity 0.375 Mean Specificity 0.913 0.738 Mean Accuracy ±0.05
SEL weighted SVM
wavelets
Measures
co-occurence
Weighted SVM
0.600 0.958 0.980 1 0.900 0.966 ±0.030 ±0.029
Table 3. Performance comparison of weighted SVM and SEL weighted SVM classifiers using different features in detecting benign and malignant masses
5
contourlets & co-occurence & morphology
contourlets
wavelets
0.550 0.979 0.833 ±0.007
co-occurence
0.55 0.96 0.825 ±0.007
SEL weighted SVM contourlets & co-occurence & morphology
contourlets
Mean Sensitivity 0.450 Mean Specificity 0.953 0.766 Mean Accuracy ±0.01
wavelets
Measures
co-occurence
Weighted SVM
0.600 1 1 1 1 0.988 0.958 0.980 0.983 0.992 0.850 0.966 0.975 0.989 1 ±0.008 ±0.003 ±0.002 ±0.003 ±0
Conclusions
In this paper, we exploited the advantages of contourlet-based texture analysis and employed SEL weighted SVM method to detect and classify the breast masses into benign and malignant cases. It was shown that curvelet transform (i.e., its discrete implementation, contourlets) can successfully capture structural information along multiple scales, locations, and orientations. This representation offers an improvement over the separable2-Dwavelettransformwhichhaslimitationsindirectionalanalysisof textures. We are currently considering the effect of recent progresses in SVM such as the SVM-based fuzzy neural networks to improve mammographic classification.
References 1. Mu, T., Nandi, A.K.: Detection of breast cancer using v-SVM and RBF networks with self organization selection of centers. In: Third IEEE International Seminar on Medical Applications of Signal Processing (2005)
934
F. Moayedi et al.
2. Raba, D., Oliver, A., Marti, J., Peracaula, M., Espunya, J.: Breast segmentation with pectoral muscle suppression on digital mammograms. In: Marques, J.S., P´erez de la Blanca, N., Pina, P. (eds.) IbPRIA 2005. LNCS, vol. 3523, pp. 471–478. Springer, Heidelberg (2005) 3. Malagelada, A.O.: Automatic mass segmentation in mammographic images. PhD thesis, Universitat de Girona (2004) 4. Roffilli, M.: Advanced machine learning techniques for digital mammography. Technical Report, Department of Computer Science University of Bologna, UBLCS2006-12 (2006) 5. Semler, L., Dettori, L., Furst, J.: Wavelet-based texture classification of tissues in computed tomography. In: IEEE International Symposium on Computer-Based Medical Systems (2005) 6. Semler, L., Dettori, L.: A comparison of wavelet-based and ridgelet-based texture classification of tissues in computed tomography. In: International Conference on Computer Vision Theory and Applications (2006) 7. Mousa, R., Munib, Q., Mousa, A.: Breast cancer diagnosis system based on wavelet analysis and fuzzy-neural network. IEEE Trans. on Image Processing 28(4), 713– 723 (2005) 8. Semler, L., Dettori, L.: Curvelet-based texture classification of tissues in computed tomography. In: IEEE International Conference on Image Processing (2006) 9. Do, M.N., Vetterli, M.: The contourlet transform: An efficient directional multiresolution image representation. IEEE Trans. on Image Processing 14(12), 2091– 2106 (2005) 10. Candes, E.J., Donoho, D.L.: Curvelets: A surprisingly effective non adaptive representation for objects with edges. In: Saint-Malo Proceedings, pp. 1–10. Vanderbilt Univ., Nashville, TN (2000) 11. Fischer, E.A., Lo, J.Y., Markey, M.K.: Bayesian networks of BI-RADS descriptors for breast lesion classification. IEEE EMBS San Francisco 4, 3031–3034 (2004) 12. Bovis, K., Singh, S., Fieldsend, J., Pinder, C.: Identification of masses in digital mammograms with MLP and RBF nets. IEEE Trans. on Image Processing 1, 342– 347 (2005) 13. Freixenet, O.J., Bosch, A., Raba, D., Zwiggelaar, R.: Automatic classification of breast tissue. In: Marques, J.S., P´erez de la Blanca, N., Pina, P. (eds.) IbPRIA 2005. LNCS, vol. 3523, pp. 431–438. Springer, Heidelberg (2005) 14. Lin, C., Yeh, C., Liang, S., Chung, J., Kumar, N.: Support vector based fuzzy neural network for pattern classification. IEEE Trans. on Fuzzy Systems 14(1), 31–41 (2006) 15. Yu, Z., Bajaj, C.: A fast and adaptive for image contrast enhancement. In: IEEE International Conference on Image Processing (2004) 16. El-Naqa, I., Yang, Y., Wernick, M., Galatsanos, N., Nishikawa, R.: A Support Vector Machine approach for detection of microcalcifications. IEEE Trans. on Medical Imaging 21(12), 1552–1563 (2002)
Fuzzy C-Means Clustering for Segmenting Carotid Artery Ultrasound Images Amr R. Abdel-Dayem and Mahmoud R. El-Sakka Computer Science Department University of Western Ontario London, Ontario, Canada {amr,elsakka}@csd.uwo.ca
Abstract. This paper introduces a fully automated segmentation scheme for carotid artery ultrasound images. The proposed scheme is based on fuzzy cmeans clustering. It consists of four major stages. These stages are preprocessing, feature extraction, fuzzy c-means clustering, and finally boundary extraction. Experimental results demonstrated the efficiency of the proposed scheme in segmenting carotid artery ultrasound images.
1 Introduction Vascular plaque, a consequence of atherosclerosis, results in an accumulation of lipids, cholesterol, smooth muscle cells, calcifications and other tissues within the arterial wall. It reduces the blood flow within the artery and may completely block it. As plaque builds up, it can become either stable or unstable layers. Unstable plaque layers in a carotid artery can be a life-threatening condition. If a plaque ruptures, small solid components (emboli) from the plaque may drift with the blood stream into the brain. This may cause a stroke. Early detection of unstable plaque plays an important role in preventing serious strokes. Currently, carotid angiography is the standard diagnostic technique to detect carotid artery stenosis and the plaque morphology on artery walls. This technique involves injecting patients with an X-ray dye. Then, the carotid artery is examined using X-ray imaging. However, carotid angiography is an invasive technique. It is uncomfortable for patients and has some risk factors, including allergic reaction to the injected dye, kidney failure and the exposure to X-ray radiation. Ultrasound imaging provides an attractive tool for carotid artery examination. The main drawback of ultrasound imaging is the poor quality of the produced images. It takes considerable effort from radiologists to extract significant information about carotid artery contours and the possible existence of plaque layers that may exist. This task requires a highly skilled radiologist. Furthermore, manual extraction of carotid artery contours generates a result that is not reproducible. Hence, a computer aided diagnostic (CAD) technique for segmenting carotid artery contours is highly needed. 1.1 Carotid Artery Segmentation Mao et al. [1] proposed a scheme for extracting the carotid artery lumen from ultrasound images. The scheme uses a deformable model to approximate the artery M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 935–948, 2007. © Springer-Verlag Berlin Heidelberg 2007
936
A.R. Abdel-Dayem and M.R. El-Sakka
lumen. The user has to specify a seed point inside the artery. The initial contour shape is estimated from the image entropy map. However, the result accuracy depends, to a large extent, on the appropriate estimation of the initial contour. Furthermore, the deformable model takes a considerable amount of time to approach the equilibrium state. It is worth mentioning that the equilibrium state of a deformable model does not guarantee the optimal state or contour shape. Abolmaesumi et al. [2] proposed a scheme for tracking the centre and the lumen of the carotid artery in real-time. The scheme uses an improved star algorithm with temporal and spatial Kalman filters. The major drawback of this scheme is the estimation of the weight factors used by Kalman filters. In the proposed scheme, these factors are estimated from the probability distribution function of the boundary points. In practice, this distribution is usually unknown. Da-chuan et al. [3] proposed a method for automatic detection of intimal and adventitial layers of the common carotid artery lumen in ultrasound images using a snake model. The proposed method modified the Cohen’s snake [4] by adding spatial criteria to obtain the contour with a global maximum cost function. The proposed snake model was compared with a ziplock snake [5] and was found to give superior performance. However, the computational time for the proposed model was significantly high. It took a long amount of time for the snake to reach the optimum shape. Hamou et al. [6] proposed a segmentation scheme for carotid artery ultrasound images. The scheme is based on Canny edge detector [7]. The scheme requires three parameters. The first parameter is the standard deviation of the Gaussian smoothing kernel used to smooth the image before applying edge detection process. The second and the third parameters are upper and lower bound thresholds to mask out the insignificant edges from the generated edge map. The authors empirically tuned these parameters, based on their own database of images. This makes the proposed scheme cumbersome when used with images from different databases. Abdel-Dayem et al. [8] proposed a scheme for carotid artery contour extraction. The proposed scheme uses a uniform quantizer to cluster the image pixels into three major classes. These classes approximate the area inside the artery, the artery wall and the surrounding tissues. A morphological edge extractor is used to extract the edges between these three classes. The system incorporates a pre-processing stage to enhance the image quality and to reduce the effect of the speckle noise in ultrasound images. A post-processing stage is used to enhance the extracted contours. This scheme can accurately outline the carotid artery lumen. However, it cannot differentiate between relevant objects with small intensity variations within the artery tissues. Moreover, it is more sensitive to noise. Abdel-Dayem et al. [9] used the watershed segmentation scheme [10] to segment the carotid artery ultrasound images. Watershed segmentation schemes usually produce over-segmented images. Hence, a region merging stage is used to merge neighbouring regions based on the difference on their average pixel intensity. A single global threshold is needed during the region merging process. If this threshold is properly tuned, the proposed scheme produces accurate segmentation results.
Fuzzy C-Means Clustering for Segmenting Carotid Artery Ultrasound Images
937
Abdel-Dayem et al. [11] integrated multi-resolution-analysis with their watershedbased segmentation scheme [9] to reduce the computational cost of the segmentation process and at the same time reduce the sensitivity of the results with respect to noise. In this scheme, the image is decomposed into a pyramid of images at different resolutions using wavelet transform. Then, the lowest resolution image is segmented using Abdel-dayem et al. segmentation scheme [9]. Finally, the segmented image is projected back to produce the full resolution image. Abdel-Dayem et al. [12] proposed a scheme for segmenting carotid artery ultrasound images uses the fuzzy region growing technique. Starting from a user defined seed point within the artery; the scheme creates a fuzzy connectedness map for the image. Then, the fuzzy connectedness map is thresholded using an automatic threshold selection mechanism to segment the area inside the artery. The proposed scheme is a region-based scheme. Hence, it is robust to noise. It produces accurate contours. This gain can be contributed to the fuzzy nature of the objects within the ultrasound images. Moreover, it is insensitive to the seed point location, as long as it is located inside the artery. However, the calculation of the fuzzy connectedness map is computationally expensive process. To overcome this problem, Abdel-Dayem et al. [13] applied their fuzzy region growing scheme in multi-resolutions. In this configuration, the computational complexity is reduced as the fuzzy connectedness map is calculated for the lowest resolution image which has a size of 1/4N of the original image size, where N is the number of decomposition levels. 1.2 Fuzzy C-Means Clustering Clustering is the process of partitioning a data set into different classes, so that the data in each class share same common features according to a defined distance measure. The standard crisp C-means clustering scheme is very popular in the filed of pattern recognition [15]. However, this scheme uses hard partitioning, in which each data point belongs to exactly one class. The Fuzzy C-Means (FCM) is a generalization of the standard crisp c-means scheme, in which a data point can belong to all classes with different degrees of membership [16]. The scheme is based on optimizing the following objective function: n m
J fuzzy = ∑ ∑ μij k d ij 2 , i =1 j =1
(1)
where, n is the number of data points, m is the number of classes, k is the fuzzy index, used to adjust the “blending” of different clusters. When k is set to zero, the FCM approach becomes the standard crisp c-means algorithm, dij is a distance measure between the ith data point and the jth class center, i.e.
938
A.R. Abdel-Dayem and M.R. El-Sakka
dij= ||xi – cj||,
(2)
where, ||.|| denotes the Euclidean norm, μij is the membership (degree of belonging) of data point i to class j, where μij must satisfy the following two conditions, μ ij ∈ [0, 1], ∀i = 1..n, ∀j = 1..m,
(3)
and, m
∑ μ ij = 1, ∀i = 1,..., n,
(4)
j =1
Minimizing (1) and (2) requires,
∂J fuzzy
∂J fuzzy
= 0 and
∂c j
∂μij
= 0.
(5)
These lead to the solution [15] n
k ∑ μij xi
c j = i =1
(6)
,
n
∑ μij
k
i =1
and
μij =
1 m ⎛ d ij ⎞ ⎟ ∑ ⎜⎜ ⎟ r =1 ⎝ d ir ⎠
2 /( k −1)
, (7)
The class centers cj and the membership values μij are iteratively calculated according to the algorithm shown in Fig. 1. Since, objects in ultrasound images do not have well defined boundaries (also called fuzzy objects with defused boundaries), fuzzy-c-means clustering technique would be a good candidate for extracting such objects. Thus, we adopted this technique to present our segmentation scheme. 1.3 Paper Layout
This paper is organized as follows. Section 2 describes the proposed scheme in detail. Section 3 presents the results and presents the objective analysis of the proposed system. Section 4 discusses the results and compares them with other methods. Finally, Section 5 offers the conclusions of this paper.
Fuzzy C-Means Clustering for Segmenting Carotid Artery Ultrasound Images
939
Module Name Fuzzy C-Means Clustering Input X = {x1, x2,…,xn} /*Collection of data points*/ m /*Number of classes*/ /*li represents a label for point xi*/ Output L = {l1, l2,…,ln} Description Variable Matrix: μ /*Membership matrix with size n×m*/ Begin Step1 /*Initialization step */ Initialize the membership matrix μ with random values that satisfy the constraints in (3) and (4). Step2 Repeat Calculate the fuzzy centers ci , i=1..m, using (6) Calculate the membership μij, i=1..n, j=1..m, using (7) Calculate the cost function Jfuzzy using (1) and (2) Until Jfuzzy < ε Step3 /*Defuzzification step*/ For i = 1 to n
li = k , where μik = max μij j
Return L End Fig. 1. The fuzzy c-means clustering algorithm
2 Proposed Solution We propose a scheme for segmenting carotid artery ultrasound images. The proposed scheme consists of four major stages. These stages are pre-processing, feature extraction, fuzzy c-means clustering and finally boundary extraction stage. Fig. 2 shows the block diagram of the proposed method. In the following subsections, a detailed description of each stage is introduced.
940
A.R. Abdel-Dayem and M.R. El-Sakka
Input Image
Pre-Processing Histogram Median Equalization Filter
Feature Extraction
Fuzzy C-Means Clustering
Boundary Extraction
+
+
Constructed Image
Fig. 2. The block diagram of the proposed scheme
2.1 Pre-processing
Ultrasound images suffer from several drawbacks. One of these drawbacks is the presence of random speckle noises, caused by the interference of the reflected ultrasound waves. Another sever problem is that ultrasound images have relatively low contrast. These factors severely degrade any automated processing and analysis of the images. Hence, it is crucial to enhance the image quality prior to any further processing. In this stage we try to overcome these problems by performing two preprocessing steps. The first is a histogram equalization step [14] to increase the dynamic range of the image gray levels. In the second step, the histogram equalized image is filtered using a median filter to reduce the amount of the speckle noise in the image. It was empirically found that a 3×3 median filter is suitable for the removal of most of the speckle noise without affecting the quality of the edges in the image. 2.2 Feature Extraction
The objective of this stage is to formulate a feature vector for every pixel. This vector will be utilized during the clustering process. This feature vector consists of pixel intensity, mean and standard deviation of a 5×5 image block centered on the pixel in question. This way, local neighborhood information is considered during the clustering operation. 2.3 Fuzzy C-Means Clustering
In this stage, the pre-processed image is segmented using the fuzzy c-means clustering algorithm described in Section 0. The number of classes is set to three. These three classes approximate: 1) the area inside the artery, 2) the artery wall and 3) the surrounding tissues. The fuzzy index k, in (1), is empirically set to two. During the next stage, the boundaries between the segmented areas will be extracted.
Fuzzy C-Means Clustering for Segmenting Carotid Artery Ultrasound Images
941
2.4 Boundary Extraction
The objective of this stage is to extract the boundaries of the segmented regions. Various edge detection schemes can be used for this purpose [14] .In our system, we use a morphological-based contour extraction mechanism [14], First, the image produced by the previous stage is morphologically eroded using a 3×3 rounded square structuring element. Then, the eroded image is subtracted from the non-eroded image to obtain the boundary of the segmented region, which represents the artery lumen. This operation can be described by the following equation: Boundary (A) = A - (A θ B)
(8)
where, A is the post-processed image, B is the structuring element and θ is the erosion operator. Finally, the extracted contour is superimposed on the histogram equalized image to produce the final output of the proposed scheme.
3 Results The proposed system was tested using a set of 50 B-mode ultrasound images. These images were obtained using ultrasound acquisition system (Ultramark 9 HDI US machine and L10-5 linear array transducer) and digitized with a video frame grabber before saved to disk. These images were carefully inspected by an experienced clinician and artery contours were manually highlighted to represent gold standard images. These gold standard images are used to validate the results produced by the proposed system. As a representative case, Fig. 3(a) shows one of the original ultrasound images that are used as an input to our system. Fig. 3(b) shows the output after the histogram equalization step. Comparing the two images, shown in Fig. 3(a) and Fig. 3(b), reveals that the histogram equalization step increases the dynamic range and the contrast of the image. Unfortunately, the histogram equalization step increases the speckle noise that exists in the ultrasound images. However, the next step in the pre-processing stage will compensate for this drawback. Fig. 3(c) shows the image produced by applying a 3×3 median filter to the histogram equalized image shown in Fig. 3(b). This image represents the output from the pre-processing stage. In order to show the importance of applying the median filter, in Fig. 3(d)(e), we magnify the area inside the artery (region of interest) before and after this step. Comparing Fig. 3(d) and Fig. 3(e) revealed that the amount of noise within the artery is reduced. This noise reduction step has great impact in the accuracy of the segmentation results during the following stages of the proposed scheme. Fig. 3(f) shows the segmented area produced by applying the fuzzy c-means clustering stage. The image contains some objects that are outside the region of interest (ROI). By selecting a seed point within the artery, the proposed scheme can extract the ROI, while neglecting all other components. This result is shown in Fig. 3(g). Fig. 3(h) shows the final output of the proposed scheme, where the extracted contour is superimposed on the histogram equalized image.
942
A.R. Abdel-Dayem and M.R. El-Sakka
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 3. Experimental results: (a) Original ultrasound image; (b) The image after applying the histogram equalization step; (d) The region of interest (ROI) in the histogram equalized image; (e) The ROI after the noise removal step; (f) The segmented image produced by applying the fuzzy c-means clustering algorithm; (g) The area extracted by selecting a seed point within the ROI; (h) The final output of the proposed scheme
Fuzzy C-Means Clustering for Segmenting Carotid Artery Ultrasound Images
943
3.1 Objective Analysis
Typically, the terms true positive (TP), false positive (FP), and false negative (FN) are used to compare the gold standard to the computer segmented images. In these terms, the positive or negative describes the decision made by the detection system. However, the true or false illustrates how the decision of the detection system agrees with the clinician opinion. From these terms, the sensitivity and the specificity of the system are defined as:
Sensitivity (a.k .a., the true positive fraction) =
TP , TP + FN
(9)
Specificity (a.k .a., the true negative fraction) =
TN . TN + FP
(10)
and,
Sometimes the false positive fraction is reported instead of the specificity where, False positive fraction =
FP = 1 − Specificity. TN + FP
(11)
Over the fifty test images, the proposed system achieved 95% and 99% confidence intervals of [0.9808, 0.9909] and [0.9793, 0.9925], respectively, for the true positive fraction. It also achieved 95% and 99% confidence intervals of [0.0121, 0.0159] and [0.0115, 0.0165], respectively, for the false positive fraction.
4 Comparisons and Discussions To evaluate the performance of the proposed system, we compare it with four schemes found in literature. The first scheme is proposed by Hamou et al. in [6], while the second through the fourth schemes are proposed by Abdel-Dayem et al. in [8][9][12]. We select the image shown in Fig. 4 as input to the four algorithms. This image has several noise pixels within the artery area. Moreover, the contour of the artery (as it visually appears) is not completely closed from the lower right part. Hence, this image represents a comprehensive test to check the robustness of our newly proposed scheme. The output from the proposed scheme is shown in Fig. 5(a), while, Fig. 5(b)-(e) show the output produced by the schemes presented in Hamou et al.[6], AbdelDayem et al. [9][8][12] respectively. Note that, in Fig. 5(b) there are several false contours outside the region of interest. However, this is irrelevant during our analysis.
944
A.R. Abdel-Dayem and M.R. El-Sakka
(a)
(b)
Fig. 4. Carotid artery ultrasound image, used to compare the proposed scheme with others; (a) the original image, (b) the pre-processed image
The comparison between Fig. 5(a) and Fig. 5(b) revealed that the first scheme [6] failed to produce accurate contour for the carotid artery. The second scheme [9], Fig. 5(c), produces reasonable contour. The third and fourth schemes [8][12], Fig. 5(d)-(e), fail to produce closed contour for the artery. However, the proposed scheme is able to produce accurate contour. The utilization of the local statistics during the segmentation process makes our system more robust to noise. Moreover, the use of fuzzy clustering allows for smooth transitions between objects with diffused and not well defined boundaries. To demonstrate the power of fuzzy clustering, we present the result of segmenting the same image using the standard crisp c-means clustering algorithm; see Fig. 5(f). The standard crisp c-means clustering algorithm also fails to produce closed contour. A significant feature of the proposed scheme is that it is fully automated. The segmentation process does not require any input from the user. Note that the selection of the seed point at the end of the proposed scheme is just to neglect the false contours, which exist outside the region of interest. The seed point selection has no effect in the segmentation process. To appreciate the significance of this feature, we present the output of segmenting the same test image, shown in Fig. 4, using an active contour model [17]. It is worth mentioning that, active contour models are widely used in various clinical applications, where most clinicians are familiar with and trained to use such systems. We performed three different experiments to show the pitfall of active contours. In the first experiments, Fig. 6(c), we defined the initial contour by selecting four points within the artery area. This experiment represents the case where the artery contour is under estimated. In this experiment, the contour collapses and the system fails to detect the artery lumen, Fig. 6(d). In the second experiment, we defined the initial contour to be larger than the artery area (i.e. the artery contour is over estimated), Fig. 6(e). In this case the contour converges, as shown in Fig. 6(f), to an area that is far from the actual artery lumen. In the third experiment, we set our goal to produce a contour that is close to the actual one. To achieve this goal, we had to increase the number of initial points (or controlling points) to eight, and at the same time, the initial contour should be placed very close to the actual one, Fig. 6(g). Note that the overhead encountered in defining a good initial contour is comparable to the overhead of extracting the contour manually. Meanwhile the converged contour, Fig. 6(h), produces a result that is comparable to the output of the proposed scheme, Fig. 6(b).
Fuzzy C-Means Clustering for Segmenting Carotid Artery Ultrasound Images
(a)
(b)
(c)
(d)
(e)
(f)
945
Fig. 5. (a) The output produced by the proposed scheme for the images in Fig. 4, (b) The output from the system proposed by Hamou et al. [6], (c)(d) and (e) the output from the system proposed by Abdel-Dayem et al.[9][8][12], (f) the output produced by the c-means clustering.
These experiments demonstrate that active contour models are highly userdependent. They, not only, consume considerable clinician’s time to plot initial contours, but also they require skilled clinicians to estimate good contours to start with. Moreover, active contour models may collapse especially when the artery area contains noise pixels (which are common in most carotid artery ultrasound images). A major drawback of our scheme is that the contour may leak in some images that suffer from the shadowing effect. Fig. 7(a) shows a carotid artery ultrasound image, where the image has dark area to the lower right part of the artery area. This area is not part of the artery and should be excluded from the final segmentation output. However, our system fails to differentiate between the shadow and the area inside the artery, Fig. 7(b). The same case happened in around 13.75% of our test cases. It is worth mentioning that the shadowing effect can be eliminated during the image capturing process by capturing the image from a different angle.
946
A.R. Abdel-Dayem and M.R. El-Sakka
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 6. (a) The test image, shown in Fig. 4, (b) The output produced by the proposed scheme for the image in (a), (c) Initial contour with four controlling points used as input for the active contour model (initial contour under estimates the artery area), (d) The active contour collapses (with initial contour in (c)), (e) Initial contour with four controlling points which over estimates the artery area, (f) The active contour converges to an area larger than the actual artery (with initial contour in (e)), (g) Initial contour with eight controlling points, (h) The active contour improves with the improvement in the initial contour (with initial contour in (g))
Fuzzy C-Means Clustering for Segmenting Carotid Artery Ultrasound Images
947
(a) (b) Fig. 7. (a) A test image with shadow in the lower right corner of the artery, (b) The output of the proposed scheme for the image shown in (a), where the contour leaks
5 Conclusions In this paper, we proposed a scheme for extracting the carotid artery contour using ultrasound images. The proposed scheme uses the fuzzy c-means clustering algorithm to cluster the image pixels into three classes, representing the area inside the artery, the artery wall and the surrounding tissues. Local statistics, extracted from a 5×5 image block centred on every pixel, are employed during the clustering process. Visual inspection of the results showed that the proposed scheme provides a good estimation of the carotid artery contours. It is robust to noise and it can successfully manage fuzzy object boundaries that exist in ultrasound images. Moreover, the proposed scheme is fully automated (there is no input from the user). This feature makes the proposed scheme more convenient for clinicians, compared to the active contour models. The objective analysis over the entire set of fifty test images revealed that the proposed system achieved 95% and 99% confidence intervals of [0.9808, 0.9909] and [0.9793, 0.9925], respectively, for the true positive fraction. It also achieved 95% and 99% confidence intervals of [0.0121, 0.0159] and [0.0115, 0.0165], respectively, for the false positive fraction.
References 1. Mao, F., Gill, J., Downey, D., Fenster, A.: Segmentation of carotid artery in ultrasound images. In: Proceedings of the 22nd IEEE Annual International Conference on Engineering in Medicine and Biology Society, July 2000, vol. 3, pp. 1734–1737 (2000) 2. Abolmaesumi, P., Sirouspour, M., Salcudean, S.: Real-time extraction of carotid artery contours from ultrasound images. In: Proceedings of the 13th IEEE Symposium on Computer-Based Medical Systems, June 2000, pp. 81–186 (2000) 3. Da-chuan, C., Schmidt-Trucksass, A., Kuo-Sheng, C., Sandrock, M., Qin, P., Burkhardt, H.: Automatic detection of the intimal and the adventitial layers of the common carotid artery wall in ultrasound B-mode images using snakes. In: Proceedings of the International Conference on Image Analysis and Processing, September 1999, pp. 452–457 (1999) 4. Cohen, L.: On active contour models and balloons. Computer Vision, Graphics, and Image Processing: Image Understanding 53(2), 211–218 (1991)
948
A.R. Abdel-Dayem and M.R. El-Sakka
5. Neuenschwander, W., Fua, P., Iverson, L., Szekely, G., Kubler, O.: Ziplock snake. International Journal of Computer Vision 25(3), 191–201 (1997) 6. Hamou, A., El-Sakka, M.: A novel segmentation technique for carotid ultrasound images. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, May 2004, vol. 3, pp. 521–524 (2004) 7. Canny, J.: Computational Approach To Edge Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 8(6), 679–698 (1986) 8. Abdel-Dayem, A., El-Sakka, M.: A novel morphological-based carotid artery contour extraction. In: Proceedings of the Canadian Conference on Electrical and Computer Engineering, May 2004, vol. 2, pp. 1873–1876 (2004) 9. Abdel-Dayem, A., El-Sakka, M., Fenster, A.: Watershed segmentation for carotid artery ultrasound images. In: Proceedings of the IEEE International Conference on Computer Systems and Applications, January 2005 (2005) 10. Vincent, L., Soille, P.: Watersheds in digital spaces: an efficient algorithm based on immersion simulations. IEEE Transactions on Pattern Analysis and Machine Intelligence 13(6), 583–598 (1991) 11. Abdel-Dayem, A., El-Sakka, M.: Carotid Artery Contour Extraction from Ultrasound Images Using Multi-Resolution-Analysis and Watershed Segmentation Scheme. ICGST International Journal on Graphics, Vision and Image Processing 5(9), 1–10 (2005) 12. Abdel-Dayem, A., El-Sakka, M.: Carotid Artery Ultrasound Image Segmentation Using Fuzzy Region Growing. In: Kamel, M., Campilho, A. (eds.) ICIAR 2005. LNCS, vol. 3656, pp. 869–878. Springer, Heidelberg (2005) 13. Abdel-Dayem, A., El-Sakka, M.: Multi-Resolution Segmentation Using Fuzzy Region Growing for Carotid Artery Ultrasound Images. In: Proceedings of the IEEE International Computer Engineering Conference, 8 pages (December 2006) 14. Gonzalez, G., Woods, E.: Digital image processing, 2nd edn. Prentice Hall, Englewood Cliffs (2002) 15. Duda, R., Hart, P., Stork, D.: Pattern Classification, 2nd edn., ch. 2, pp. 20–63. John Wiley & Sons, Chichester (2001) 16. Bezdek, J.: Fuzzy mathematics in pattern classification. Ph.D. thesis, Applied Mathematic Center, Cornell University, Ithaca, NY (1973) 17. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: active contour models. In: Proceedings of the 1st International Conference on Computer Vision, pp. 259–268. IEEE Computer Society Press, Los Alamitos (1987)
A Novel 3D Segmentation Method of the Lumen from Intravascular Ultrasound Images Ionut Alexandrescu1, Farida Cheriet1,2, and Sébastien Delorme3 1
Institute of Biomedical Engineering Department of Computer Engineering École Polytechnique de Montréal, Campus de l'Université de Montréal, 2500, chemin de Polytechnique, Montréal, QC, H3T 1J4 3 Industrial Materials Institute, 75 de Mortagne, Boucherville, QC, J4B 6Y4 {ionut.alexandrescu, farida.cheriet}@polymtl.ca,
[email protected] 2
Abstract. In this paper a novel method that automatically detects the lumenintima border on an intravascular ultrasound sequence (IVUS) is presented. First, a 3D co-occurrence matrix was used to efficiently extract the texture information of the IVUS images through the temporal sequence. By extracting several co-occurrence matrices a complete characterization feature space was determined. Secondly, using a k-means algorithm, all the pixels in the IVUS images were classified by determining if they belong to either the lumen or the other vessel tissues. This enables automatic clustering and therefore no learning step was required. The classification of the pixels within the feature space was obtained using 3 clusters: two clusters for the vessel tissues, one cluster for the lumen, while the remaining pixels are labeled as unclassified. Experimental results show that the proposed method is robust to noisy images and yields segmented lumen-intima contours validated by an expert in more than 80% of a total of 300 IVUS images. Keywords: IVUS segmentation, 3D co-occurrence classification, swap heuristics, texture analysis.
matrix,
k-means
1 Introduction Intravascular ultrasound (IVUS) is a catheter-based imaging technique that enables clinicians to see blood vessels cross-sections from inside out (see figure 1), including the blood flowing through the lumen of the vessel, and the three layers of the vessel wall (intima, media and adventitia). Coronary arteries are the most frequent imaging target for IVUS. This technology is used in coronary arteries to assess the spatial distribution of atheromatous plaque (also known as atherosclerosis) in the coronary tree. Plaque buildup causes stenosis – a narrowing of the artery – and reduces blood flow. IVUS is more precise than angiography to determine the degree of stenosis because of image resolution, and because angiographic images are just 2D projections of a 3D shape which is not necessarily symmetrical. IVUS is often used as a research tool to improve atherosclerosis diagnosis, treatment planning and assessment of M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 949–960, 2007. © Springer-Verlag Berlin Heidelberg 2007
950
I. Alexandrescu, F. Cheriet, and S. Delorme
Fig. 1. (Left) A traditional IVUS image illustrating three layers around the lumen (iii): the intima (ii), the media (iv) and the adventitia (v).The IVUS catheter is in the middle (i) (Right) Plaque composition consists of fibro-lipid(ii) and calcium deposits (i). The lumen is the blood (iii).
treatment efficiency, at the cost of lengthy manual segmentation of the lumen-intima and the media-adventitia borders. Fast and automatic IVUS image segmentation would improve the clinical usefulness of IVUS. Much research work has been done on IVUS segmentation, including the development of several methods to obtain the desired borders. In [1] combination of several techniques is used by minimizing a global cost function calculated using: (i) the intensity patterns learned from examples, (ii) statistical proprieties of ultrasound and (iii) area homogeneity. This technique requires the construction of patterns from manually segmented images. This stage is very difficult to accomplish properly due to the large variability of IVUS images from one patient to another and the intensity discontinuities between the lumen and the intima. An automatic segmentation approach using a level set (fast marching method) is presented in [2, 3]. To find the initial parameters of the level set, Rayleigh distributions are searched in the histogram of the sequence of images. However, this method was only tested on images obtained on femoral arteries, therefore containing less motion blur and being of better quality when compared to coronary IVUS images. Popular methods rely on both texture analysis and classification of the IVUS images in order to segment the various vessel borders. Local Binary Patterns [4, 5], 2D co-occurrence matrices [5, 6], or both methods [7] have been used mainly for texture analysis. An analysis of cumulative moments was used in [5], and a fuzzy kmeans approach is used in [8] where several clusters are defined for the various layers of the vessel. All these methods are implemented in two dimensions and hence the results are calculated from a single static image only. Once texture characterization is finalized, a second phase is implemented that consists in the use of a classifier in order to find the desired vessel border. In [4, 5], an Adaboost or a k-nearest neighbors
A Novel 3D Segmentation Method of the Lumen
951
classification is used to classify the pixels of an IVUS image. These two methods require a training phase, which requires a large image database in order to be efficient and effective. In [6, 7], IVUS segmentation is accomplished using a deformable contour analysis, such as Snakes, in combination with a texture analysis procedure. This technique assumes that the extraction of characteristics behaves like a filter in order for the contours to evolve and converge correctly, which is not always the case with noisy images. This paper presents a novel approach to automatically find the lumen-intima border using a 3D co-occurrence matrix feature extraction, and a k-means classifier on IVUS coronary images. A major advantage of the proposed method is that it does not involve any supervised training and it exploits the temporal information contained in the IVUS sequence.
2 Methodology The proposed method is mainly based on a 3D texture analysis performed on a specified number of IVUS images. Those images are recorded at 30 fps with an ultrasound catheter pulled back at a controlled speed of 0.5 mm/s, over a few minutes. In coronary arteries, the heart produces significant motion and deformation of the artery cross-section between each heart beat. Hence, the temporal information of the IVUS sequence is taken into account. First, the three dimensional co-occurrence matrices were computed to extract 3D characteristics. Secondly, the classification was performed using a k-means algorithm with a slight modification. A hybrid method between Lloyd and a heuristic swap was used [10, 11], to avoid local minima during the optimization process. Finally, a three cluster k-means was used: one cluster represents the lumen and the two others represent the artery itself. The unclassified pixels were identified and then the border was detected consequently. 2.1 Preprocessing An IVUS film lasting several minutes is usually composed of a few thousand images. The segmentation process can be speeded up by segmenting a subsample of the images, and interpolating between the found contours. By extracting images synchronized with the electrocardiogram (ECG), one image for each cardiac cycle, the motion blur due to the cycle-to-cycle variation of blood pressure was partly eliminated, while reducing the number of total images to segment in a sequence. To make the extraction of characteristics optimal, the images were transformed into polar coordinates by using a bi-cubic interpolation (See Figure 2). When the transformation is made, the image artifacts such as the calibration markers visible on every image were eliminated by a thresholding followed by an interpolation procedure. The catheter, which appears as a black band between two white bands (peaks) on the polar images, must be removed. The catheter covers almost the same region on all the images. The dimension of the region can be identified by using a horizontal scan line algorithm followed by a threshold. A threshold of 75% of the maximum intensity value of the second peak was used to remove the catheter.
952
I. Alexandrescu, F. Cheriet, and S. Delorme
Fig. 2. (Left) Original IVUS image. (Right) IVUS image transformed in polar coordinates using bi-cubic interpolation.
2.2 3D Co-occurrence Matrices Co-occurrence matrices [12] are a statistical tool which allows the extraction of second-order texture information from an image. The frequency between pairs of pixels i and j of various gray scales, to a certain shift, is calculated and placed in a matrix C. The matrix of co-occurrence C (i,j,d,θ), for a radius d and an angle θ is defined by:
C (i, j , d ,θ ) = {(r , c) | I (r , c) = i and I (r + d cos(θ ), c + d sin(θ )) = j} (1) Matrix C indicates the number of times the value of pixel i is in relation to the value of pixel j in a space relation. This space relation is described by r and θ which define respectively the distance and the orientation between the two pixels. For one 2D image, matrix C is a square matrix representing the number of gray scales in the image. In this paper, 3D co-occurrences matrices for volumetric data [9] were used. These 3D matrices are able to identify the space dependence of the gray scales through the IVUS sequence. They are similar to the 2D matrices but the calculation of the C matrix is made from a cubic window rather than a square window. Therefore, the space shift between two pixels will be in three dimensions. Equation (1) becomes:
C (i, j , d , θ , φ ) = {(r, c) | I (r , c) = i and I (r + d cos(θ ) sin(φ ), c + d sin(θ ) sin(φ )) = j} (2) Once matrix C is found, the analysis is similar to the 2D case. The co-occurrence matrix can be normalized with respect to the characteristics to be extracted. The characteristics used to identify features in the image texture were similar to those described in [12]. This extraction is time consuming, hence the most important characteristics describing the co-occurrence matrices are presented in Table 1. To extract the characteristics from an image, the above 8 characteristics are calculated on cubes centered on each pixel of the image. The size of the cube must be chosen as well as the angles of the C matrix. There are 13 possible pairs of angles, represented by the following directions: d(1,0,0); d(0,1,0); d(0,0,1); d(1,1,1); d(1,1,1); d(1,-1,1); d(1,1,-1); d(0,1,1); d(1,0,1); d(1,1,0); d(1,0,-1); d(0,1,-1); d(1,1,0)[9] (see Figure 3) .
A Novel 3D Segmentation Method of the Lumen
953
Table 1. Co-occurrence Matrices Characteristics Characteristic
Formula
Energy
∑ ∑ (N (i, j ) ) M
Characteristic
Formula
L
M
2
i =1 j =1
i =1 j =1
∑ ∑ (((i − μ ) + ( j − μ
j
∑ ∑ (((i − μ ) + ( j − μ
j
M
Shade
L
i
i =1 j =1
M
Promenance
L
∑∑
)
Inverse Difference Moment (IDM)
⎛ N (i, j ) ⎞ ⎟ 2 ⎟ j =1 ⎝ ⎠
M
L
∑∑ ⎜⎜ 1 + (i − j ) i =1
∑ ∑ ((i − j ) M
)
Inertia
L
2
N (i , j )
i =1 j =1
(i − μ i )( j − μ j ) N (i, j )
)
μ σ
Rayleigh density
σ2
i =1 j =1
N(i,j) M, L μi μj μ σ
)) 4 N (i, j )
i =1 j =1
M
Correlation
)) 3 N (i, j )
L
i
L
− ∑ ∑ (N (i , j ) log N (i , j ) )
Entropy
- Normalized co-occurrence matrix. - Number of columns and lines of the matrix N. - Vector representing the mean of each column. - Vector representing the mean of each line. - Mean of the matrix N. - Standard deviation of the matrix N.
1
1
1
0.5
0.5
0.5
0
0
0
-0.5
-1 -0.5
-0.5
-1
-1 1
0 0
-1 1
0 0
-1 1
-1
-1 1
0 0
-1 1
-1 1
Fig. 3. The 13 directions given by the possible angles of the co-occurrence matrices
2.3 Features Space A features space (see figure 4a) was created using the 3D co-occurrence matrices. Several parameters values must be selected: cube size, direction (angle) of the calculation of the co-occurrence matrices, and number of characteristics to be extracted from the matrices. Since the algorithm needs images before and after the image being analyzed, the first and the last images were analyzed by extrapolating the border (see figure 4b). For the same reason, a part of the top of the image and a part of bottom of the image cannot be analyzed. Since the images are in polar coordinates, we can pad circularly
954
I. Alexandrescu, F. Cheriet, and S. Delorme
to fill the missing pixels on the left and right side of the images. Selecting the size of the cube to be smaller than the lumen enables the contour between the lumen and the intima to be inside the part to be analyzed. With respect to the angle, a circular padding is used to fill the missing region. Lastly, a principal components analysis (PCA) [13] is performed to reduce the space dimension by eliminating the correlation between the characteristics. The feature space obtained by the co-occurrence matrices contains many features that have the same information but with some slight differences. Analyzing the entire feature space is very time consuming. By applying the PCA, we eliminate the redundant information and decrease considerably the analysis time.
Fig. 4. (a left) For each pixel we create a feature space. (b right) The characteristics of the images in stipple and the pixels in gray can’t be extracted.
2.4 k-Means Classifier The goal of the classifier is to associate each given data from a distribution to a cluster. The k-means classifier has the advantage of being completely automatic; therefore it does not need any learning phase. Once the number of clusters is specified, the classifier finds the centers of each one. The distribution can be of any dimension. The hybrid k-means, used in this paper, is a combination method between Lloyd method and a heuristic swap [10, 11]. This method starts by generating an initial solution and then tries to improve it by using the Lloyd algorithm followed by a heuristic swap algorithm. This approach allows the optimization process to avoid local minima. The Lloyd algorithm, also known as k-means algorithm, takes initial centers and calculates the neighborhood for each center. Each initial center then moves towards the centroid of the neighborhood until satisfaction of a convergence criterion. This algorithm converges towards an optimal solution, which could be a local minimum. At this point, the swap heuristic algorithm selects a set of k initial centers S from the potential centers, R. The algorithm then tries to constantly improve the solution by removing a center s ∈ S and by replacing it with another s′ ∈ R − S . Let S ′ = S − {s} ∪ {s ′} be the new set of centers. If the modified solution has a smaller distortion than the existing solution then S is replaced by S ′ , otherwise S doesn’t change. This swap is done until there aren’t any more changes in the distortion after several swaps. The decision to apply several steps of the Lloyd algorithm is based on
A Novel 3D Segmentation Method of the Lumen
955
the amount of update from one step to another. This quantity is defined by determining an empiric threshold value. The new solution is accepted if it is better than the old one [10, 11]. By using the algorithm we are able to find the centers of the clusters from a given distribution. In our case, the distribution is composed of pixels characteristics obtained from the 3D co-occurrence matrices. Once the centers of the clusters are found, the pixels are classified according to the three clusters. As mentioned previously, one cluster represents the lumen and the two others represent the remaining vessel structures. The vessel has several textures representing the different tissues which are regrouped in the two classes. During the classification, the pixels which are at a similar distance from the lumen center to one of the vessel centers are identified as unclassified. 2.5 Postprocessing After classification, the pixels are assigned to three distinct classes: those who belong to the lumen and those who belong to the two vessel clusters. During the classification, some pixels will be identified as uncertain. A segmented image of this is produced by assigning for each class a number n such that n = [0, 3]. This image is filtered with a “closing” morphological filter to eliminate isolated pixels [14]. The two classes of the vessel are merged so that the final segmented image has only three classes, representing the lumen, the vessel, and an uncertain region between them. Then, the lumen-vessel border is found in the center of the uncertain area. The border is then transformed from polar coordinates back into the format of the initial IVUS image. In order to correct errors that might occur during this postprocessing step, an active contour methods can be used, such as Snakes [15] with an external energy taken from the gradient of the original image filtered with a Gaussian filter. There is a slight offset between the found border and the actual one due to the size of the cube. The active contour method was used to minimize this offset. The snake external energy is given by the original IVUS image smoothed with an anisotropic filter. 2.6 Evaluation IVUS images obtained with a rotating catheter, at 20 MHz of ultrasound frequency, are used to evaluate the proposed method. The images have an original size of 480 x 480 pixels and are converted in polar coordinates (r, θ), having a size of 360 pixels along θ and 240 pixels along r. Eight characteristics were extracted for several sizes of cubes with various angle directions. A 17x17x17 cube was selected with characteristics calculated in the following directions: d(1,1,1); d(1,0,0); d(0,1,0); d(0,0,1); d(1,1,0); d(0,1,1); d(1,0,1); and a 9x9x9 cube, with the characteristics calculated in the directions: d(1,1,1); d(1,0,0); d(0,1,0); d(0,0,1); d(1,1,0); d(0,1,1); d(1,0,1); d(3,3,3); d(3,0,0); d(0,3,0); d(0,0,3); d(3,3,0); d(0,3,3); d(3,0,3). By doing so the total volume of pixels needed for the calculation is 18x18x18 (i.e. cube size plus one of the above directions). The final number of characteristics is 168
956
I. Alexandrescu, F. Cheriet, and S. Delorme
(8 characteristics x 7 angle directions x 1 cube size + 14 angle directions x 1 cube size). By applying a PCA, the total number is reduced to 3. This final result is selected for the number of clusters to be assigned for the k-means analysis. The border is afterward found on the segmented polar image and transformed back into the original image format. The method was tested on three patient databases that included 4 IVUS sequences, each one having approximately 3000 images, of which 75 images were analyzed.
3 Results and Discussion Initially, before the IVUS images are transformed to polar coordinates, secondary artifacts like calibration markers should be removed. Unfortunately the transformation to polar coordinates reduces the quality of the images. Hence to alleviate this problem as much as possible, the polar images were sized to 240 X 360 pixels. The second phase consists of extracting the characteristics of the images using the 3D co-occurrence matrices. This process takes several minutes per image on a P4 2.6 Ghz computer, implemented in C++. Since this first preprocessing step is very timeconsuming, it is necessary to choose strategically the total number of characteristics for the 3D co-occurrence matrix and k-means analysis. The chosen cube size is of primary importance as it is not possible to extract all the characteristics on all the IVUS sequence as seen in Figure 3b. Based on a priori testing with different cube sizes, the 9x9x9 and 17x17x17 sizes were selected as they contain enough image texture information without being hindered by possible artifacts such as noise or motion. The cube size is very important. The co-occurrence matrices act like an image filter. If the cube size is too small, we lose a lot of information and the offset between the found border and the actual border will be significant. If the cube size is too large, the processing time will increase considerably. We determined empirically that a good compromise would be found by using the above cube sizes. The methodology was modified to reduce computing time. First, the characteristics are extracted only at 4º intervals and an inter-frame interpolation was carried out to complete the missing data. Secondly, a maximum limit for the value of the radius r can be given as the lumen does not span the entire image. With respect to the direction, it is necessary to choose orientations which cover the entire space uniformly. In this case, we deemed that the main 7 directions (major axes and corner diagonals) are sufficient for the analysis. The dimension of the direction is also important since we need to consider the edge (see figure 4b). A dimension equal to 1 was chosen for the 17x17x17 cube and a dimension equal to 1 and 3 was assigned for the 9x9x9 cube. Because of that, 9 pixels are lost on the top and bottom of the image analyzed and the first 9 images at the beginning and end of the sequence are not used for the analysis. The third step consists in the reduction of the features space. The PCA algorithm reduced the features space to three so as to remove the correlation between the characteristics and to reduce the redundancy of information (see figure 5). The algorithm keeps approximately 85% of the information given by the 168 characteristics. Hence, each pixel of the image has 3 characteristics.
A Novel 3D Segmentation Method of the Lumen
20
20
20
40
40
40
60
60
60
80
80
80
100
100
100
120
120
120
140
140
140
160
160
160
180
180 20
40
60
80
957
180 20
40
60
80
20
40
60
80
Fig. 5. The three characteristics images obtained after the PCA space reduction (ordered from left to right)
The fourth phase consists in the classification of the pixels. Pixels classification in the characteristic space was fed into the k-means classifier with 3 centers. A significant problem with this algorithm is that the representative distribution obtained is usually large. To remedy this, image subsections or patches which do not contain relevant texture information must be removed. In order to do this, a binary mask representing the chosen pixels was created (see figure 6a). If the region of the mask is black, then the associated pixels are not used as input for the classifier. The mask is formed by creating 5 horizontal white bands (~20% of the image) on a black background. The characteristic energy of the co-occurrence matrices is subtracted from the mask. This signifies that the uniform texture regions contained in every IVUS image is not included during the classification stage. This is not a problem for the analysis as the uniform region is usually located at the bottom of the polar image, whereas the borders of interest are located at the top of the polar image. Some of the lumen and vessel tissue pixels were still fed into the classifier so that the classification was more accurate (see figure 6b). Although the choice of the initial centers for the kmeans classifier is important, they can be chosen randomly because the k-means algorithm does not remain trapped in local minima as explained in the previous section. Once the three cluster centers were found, the pixels were associated to the nearest center. The two vessel clusters were merged to obtain only two classes because the clusters belonging to the vessel structure have a larger standard deviation than the lumen. The lumen cluster is smaller than the intima, media and adventitia clusters, because its pixels represent only one tissue. Sometimes the adventitia has a similar texture as the lumen. This is not a problem since we want to find the border between the lumen and the intima and so the adventitia has no influence on this final border. The unclassified pixels are determined by applying a threshold of 20% of the minimal distance between the lumen cluster and the vessel clusters. After the classification, a segmentation image was created. This image has two classes, the lumen and the vessels. The image includes also some uncertain pixels which are between the classes (see figure 6d).
958
I. Alexandrescu, F. Cheriet, and S. Delorme
Fig. 6. (a top-left) The feature space image mask; in white the pixels that are given to the classifier. (b top-right) the distribution that is given to the k-means classifier. (c bottom-left) The 3 cluster given by the classifier. The front distribution is representing the lumen, the far distribution is representing the vessel and the uncertain is located between them. (d bottomright) the segmentation image that incorporates the lumen class in black, the vessel tissue class in white and the unclassified pixels in gray.
The last phase consists in finding the separation border between the classes by also considering the unclassified pixels. The border has to pass through the center of the uncertain area. Once the border was identified, it was processed using a Gaussian smoothing filter. The border was then scaled back onto the original IVUS image by using a cubic interpolation. It is necessary to consider the angle sampling and the radius trim. All of the transformations presented in the methodology introduce several approximation errors in the order of a few pixels. The lumen-intima border found on the IVUS images by our method (see figure 7 right) is close to the border identified by an expert (see figure 7 left). Results show that 80% of the 300 total IVUS images used in the analysis are adequately segmented. In the worst case, at least 70% of the lumen contour is segmented properly with a few minor gaps. In those regions, the difference between the found border and the actual one is larger than 5 pixels. The remaining 60 the images aren’t correctly segmented, because the texture in the lumen and the vessel are very similar and the algorithm can’t find any separation between them. The identification of the two vessel clusters can produce some errors by associating the lumen cluster to the vessel and by identifying one of the vessel clusters as the lumen.
A Novel 3D Segmentation Method of the Lumen
959
Fig. 7. ( left) The lumen-intima border identified manually and (right) determined by the proposed method
As most IVUS images are difficult to segment manually by an expert, we believe the results obtained from our proposed method will speed up the segmentation process. Furthermore, the gaps seen in some of the borders are created by the guidewire echo or because the catheter touches the intima. Imperfections also arise when there are vessel bifurcations in the IVUS images or when the lumen-intima border is too convex. Future work should focus on the removal of the echo created by the guidewire. Cooccurrence matrices are a powerful texture operator but the parameters must be found experimentally. Tests could be carried on by choosing other cube sizes, other orientations or even other characteristics. The k-means classifier could be used with more than 3 classes to possibly identify the border between the media and the adventitia.
4 Conclusion The method suggested in this paper is a new approach for IVUS image segmentation, specifically targeting poor quality images. Automatic identification of the lumenintima border is effective on the majority of the IVUS images, in spite of significant noise present in most of them. The 3D co-occurrence matrices take into account the spatiotemporal texture information in the IVUS sequence. By using a k-means classifier we can segment the images without any training phase or prior knowledge. The full characterization of the morphology of the vessel is the future prospect for this method.
960
I. Alexandrescu, F. Cheriet, and S. Delorme
References 1. Olszewski, M., Wahle, A., Vigmostad, S., Sonka, M.: Multidimensional segmentation of coronary intravascular ultrasound images using knowledge-based methods. In: Fitzpatrick, J.M., Reinhardt, J.M. (eds.) Medical Imaging 2005: Image Processing. Proceedings of the SPIE, vol. 5747, pp. 496–504 (2005) 2. Cardinal, M.R., Meunier, J., Soulez, G., Therasse, E., Cloutier, G.: Intravascular Ultrasound Image Segmentation: A Fast-Marching Method. In: Ellis, R.E., Peters, T.M. (eds.) MICCAI 2003. LNCS, vol. 2879, pp. 432–439. Springer, Heidelberg (2003) 3. Cardinal, M.R., Meunier, J., Soulez, G., Therasse, E., Cloutier, G.: Automatic 3D Segmentation of Intravascular Ultrasound Images Using Region and Contour Information. In: Duncan, J.S., Gerig, G. (eds.) MICCAI 2005. LNCS, vol. 3750, pp. 319–326. Springer, Heidelberg (2005) 4. Pujol, O., Rotger, D., Radeva, P., Rodriguez, O., Mauri, J.: Near real-time plaque segmentation of IVUS. Computers in Cardiology, 66–72 (2003) 5. Pujol, O., Radeva, P.: On the assessment of texture feature descriptors in intravascular ultrasound images: A boosting approach to a feasible plaque classification. Plaque Imaging: Pixel to Molecular Book, 276-299 (2005) 6. Pujol, O., Radeva, P.: Lumen detection in IVUS images using snakes in a statistical framework. Congreso Anual de la Sociedad Española de Ingeniería Biomédica, Zaragoza (2002) 7. Brunenberg, E., Pujol, O., Romeny, B., Radeva, P.: Automatic IVUS Segmentation of Atherosclerotic Plaque with Stop & Go Snake. In: Larsen, R., Nielsen, M., Sporring, J. (eds.) MICCAI 2006. LNCS, vol. 4191, pp. 9–16. Springer, Heidelberg (2006) 8. Santos Filho, E., Yoshizawa, M., Tanaka, A., Saijo, Y., Iwamoto, T.: Moment-based texture segmentation of luminal contour in intravascular ultrasound images. Journal of Medical Ultrasonics, 91–99 (2005) 9. Kurani, A., Xu, H.-H., Fustm, J., Raicu, D.: Co-occurrence Matrices for Volumetric Data. Computer Graphics and Imaging, 426–485 (2004) 10. Kanungo, T., Mount, D., Netanyahu, N., Piatkpo, C., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: Analysis and implementation. IEEE Trans. Pattern Analysis and Machine Intelligence, 881–892 (2002) 11. Kanungo, T., Mount, D., Netanyahu, N., Piatkpo, C., Silverman, R., Wu, A.Y.: A Local Search Approximation Algorithm for k-Means Clustering. Computational Geometry: Theory and Applications, 89–112 (2004) 12. Haralick, R., Shabmugam, K., Dinstein, I.: Textural Features for image classification. IEEE transactions on systems and cybernetics 3, 610–621 (1973) 13. Duda, R., Hart, P., Stork, D.: Pattern Classifications. Wiley-Interscience Publication, Chichester, 654 pages (2001) 14. Gonzales, R.C., Woord, R.E.: Digital image processing, 2nd edn. 793 pages, Prentice Hall, Upper saddle river,NJ (2002) 15. Kass, M., Witkin, A., Terzepoulos, D.: Snakes: Active contour models. International Journal of Computer Vision 1(4), 321–331 (1987)
Carotid Ultrasound Segmentation Using DP Active Contours Ali K. Hamou1, Said Osman2, and Mahmoud R. El-Sakka1 1 Computer Science Department University of Western Ontario, London, ON, Canada {ahamou, elsakka}@csd.uwo.ca 2 St. Josephs Health Sciences Centre, London, ON, Canada
[email protected] Abstract. Ultrasound provides a non-invasive means for visualizing various tissues within the human body. However, these visualizations tend to be filled with speckle noise and other artifacts, due to the sporadic nature of high frequency sound waves. Many techniques for segmenting ultrasound images have been introduced in order to deal with these problems. One such technique is the active contouring. In this paper, two proposed alterations to the dynamic programming parametric active contour model (or snake) are introduced. The first alteration allows the snake to converge to the one-response result of a modified Canny edge detector. The second provides a function that allows a user to preset apriori knowledge about a given object being detected, by means of curve fitting and energy modification. The results yield accurate segmentations of crosssectional transverse carotid artery ultrasound images that are validated by an independent clinical radiologist. Utilizing the proposed alterations leads to a reduction of clinician interaction time while maintaining an acceptable level of accuracy for varying measures such as percent stenosis. Keywords: Ultrasound Imaging, Carotid Stenosis, Image Segmentation, Active Contour Models.
1 Introduction Carotid ultrasound (US) is considered an inexpensive non-invasive clinical procedure for evaluating many carotid artery diseases [10]. It provides flexibility for offline review of patient records efficiently, i.e., the US images can be acquired by a clinician and stored for later testing and analysis, with minimal user input. A brightness-mode (or B-mode) US image is one of the two major types of US used on the carotid. This study will primarily focus on B-mode images since it provides an adequate means for an efficient clinical measurement of patients that are susceptible to plaque accumulation (whether from diet, genetics, obesity, etc). A B-mode US image is made up of a two-dimensional ultrasound display composed of bright pixels representing the ultrasound echoes and reverberations [16]. The echo amplitude determines the brightness of each pixel. M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 961–971, 2007. © Springer-Verlag Berlin Heidelberg 2007
962
A.K. Hamou, S. Osman, and M.R. El-Sakka
A high degree of carotid artery stenosis, which can be used as a pre-clinical disease marker, is an early indicator of a possible stroke. Stenosis is the abnormal narrowing of a blood vessel. It is generally caused by plaque build up [10]. Atheromatous plaque is an accumulation of fatty deposits that form inflamed patches within the inner lining of the arteries. In ultrasound images, hard (calcified) and soft plaque are represented as a range of high and low contrast pixels. However, these plaque areas have virtually the same levels of contrast to the surrounding soft tissue due to the lack of texture resolution and the high levels of noise inherent to US images. This poses many difficult problems for plaque and lumen detection algorithms since experienced clinicians subjectively dictate the amount of plaque build up based on knowledge of image structures and prior training. Boundary detection is one of the most important elements of computer vision. Many techniques have been introduced to accomplish this task. One such example is an active contour model. Kass el al. [11] first proposed the original active contour model, also commonly known as a snake or a deformable model. In their formulation, image segmentation was posed as an energy minimization problem. Active contours treat the surface of an object as an elastic sheet that stretches and deforms when external and internal forces are applied to it. These models are physically-based, since their behaviour is designed to mimic the physical laws that govern real-world objects [8]. This model helps solve the problems of edge based segmentation techniques, such as bleeding of the contour and unstable borders as exhibited in [1]. Active contour models have two inherent problems: • the proximity limitations of attaining a true boundary, especially when coupled with poor initialization plots •
the inability of traversing boundary concavities due to strong surface tension in the curve. The former is rarely an issue in the medical arena since clinicians are generally accepted as experts. Conversely, several studies have been conducted to overcome the latter, while also improving the accuracy and the speed of the original design. Of these are Dynamic Programming (DP) [5], level sets [18] and Gradient Vector Flow (GVF) models [19] just to name a few. The DP parametric active contour provides an efficient algorithm to quickly attain a desired curve. In this study, we have chosen this model for this attractive feature. Within this model, we propose a new external energy to be made up of two distinct elements. The first is to be composed of an edge map resulting from a modified Canny edge detector. Canny edge detection will maintain the one-edge response and the local connectedness property of an image, whereas the active contour will ensure that our entire energy is minimized over the detected border [5]. The second will be the curve-fitted data built upon the user’s initial plots and incorporated into the energy. This will amplify the amount of knowledge received from the clinician in acquiring a desired boundary. The proposed segmentation algorithm is used to extract a cross-sectional area of the carotid artery at two specified positions (see Section 2.3), which is then used to calculate a percent stenosis that is less prone to user bias between clinicians. It is important to note that measuring the percent stenosis is usually done by manually estimating the diameter of the carotid via catheter angiography. This introduces large
Carotid Ultrasound Segmentation Using DP Active Contours
963
user bias into the calculations, while being quite costly from a clinician’s time standpoint. Abolmaesumi et al. [4] designed an efficient Kalman-star algorithm to estimate the location of the arterial wall. However, their algorithm still crosses the external boundaries of the wall due to the use of first degree gradient estimation. Mao et al. [13] used a contour probability distribution (CPD) function in order to evaluate the accuracy of their segmentation algorithm. Their algorithm employed entropy mapping and morphological operations based on gray-level values within an image. These mappings create an accurate identification of the arterial wall. However they are strenuous on computer systems resulting in long execution times. A class of schemes [1][2][3][9] for carotid artery contour extraction that are based on the use of local edge detection techniques are proposed. However, these schemes are used globally throughout an image, and hence lack localization. All of these techniques are susceptible to leaking (or contour bleeding), especially if the image quality is not suitable. Hence other means are needed to properly segment the carotid and calculate its percent stenosis. The rest of this paper is organized as follows. In Section 2, the proposed scheme is presented. The results of the scheme are shown in Section 3, and Section 4 contains conclusions.
2 System and Methods 2.1 External Energy Formulation A snake is an energy minimization problem. Its energy is represented by two forces (internal energy, Ein, and external energy, Eex) which work against each other. The total energy should converge to a local minimum – in the perfect case – at the desired boundary. The snake is defined as X(s), where s belongs to the interval [0,1]. Hence, the total energy to be minimized to give the best fit between a snake and a desired object shape is: Total Energy =
∫
s =1 s =0
{Ein ( X ( s )) + Eex ( X ( s ))}ds
(1)
The internal energy would force the curve to be smooth (by incorporating both elasticity and stiffness); whereas the external energy would force the curve towards image structures (image edges). The internal energy of the active contour formulation is further defined as:
⎛ d 2 X (s) ⎞ ⎛ dX ( s) ⎞ ⎟⎟ Ein ( X ( s)) = α × ⎜ ⎟ + β × ⎜⎜ 2 ⎝ ds ⎠ ⎝ ds ⎠ 2
2
(2)
where α and β are the weights of elasticity and stiffness, respectively. The first order term makes the snake’s surface act like a membrane. The weight α controls the tension along the spine (stretching a balloon or elastic band). The second order term makes the snake act like a thin plate. The weight β controls the rigidity of the spine (bending a thin plate or wire) [8]. We define our external energy as:
964
A.K. Hamou, S. Osman, and M.R. El-Sakka
Eex =
γ × Canny(f(x,y) ,σ, l, u) + τ × Shape(X(s))
(3)
where γ and τ are the weights for the two proposed external energy elements, Canny and prior shape knowledge, respectively, and f(x,y), σ, l, and u, represent the original image and the parameters for the Canny operator, being the degree of Gaussian blur, a lower threshold and an upper threshold. The weights γ and τ are important factors in governing whether the clinician’s knowledge or the modified Canny edge map takes precedence in the converging snake model. In general, active contour internal and external energies need to be balanced properly in order to converge to the desired edge in question. In order to properly tune the given curve with the proper weights, values were adjusted until the snake converged to the edges chosen by the radiologist. Samples were taken on both normal and diseased images in order to properly validate the curve with the chosen weights. Initially, equal weights were assigned to each energy. Gradually weights were shifted until acceptable edge convergence was achieved. The weights α, β, γ and τ were set to 0.09, 0.01, 0.5, and 0.4, respectively. This is justified for the US images, since the modified Canny edge detector and the a-priori knowledge resides in the external energy formulation. Hence more weight should be shifted towards the external energy. 2.2 Canny Edge Detection The Canny algorithm will first smooth the data (by means of a Gaussian filter) in order to remove speckle noise. Then edge detection is performed. Canny optimizes edge-response with respect to three criteria: • Detection: real edges should not be missed, and false edges should be avoided. • Location: the edge response should lie as close as possible to its true location. • One-response: a true edge should not give rise to more than one response. Canny is particularly susceptible to input parameters [6]. These parameters allow for fine-tuning of the edge-detection mechanism, permitting the user to acquire the desired results on an image data set. Canny uses a technique called “non-maximum suppression” that relies on two thresholding values (lower and upper). The lower defines a threshold for all points that should not be considered lying on an edge. The upper defines a threshold for all the points guaranteed to be lying on an edge [6]. To emphasize the main edges in an image, the two thresholds should be large. To find the details of an image, their value should be small. The Canny edge detector was applied with increasing lower and upper thresholds and the resulting images were probed where expected edges should exist within the US images. Canny recommends that the lower threshold should be one third the amount of the upper threshold. Our trials are consistent with the recommended values yielding a 0.2 lower threshold and a 0.6 upper threshold. We also found these thresholds reduce the creation of false edges due to speckle noise in the image without degradation.
Carotid Ultrasound Segmentation Using DP Active Contours
(a)
(b)
(c)
(d)
965
Fig. 1. Example of Edge detection. (a) Original carotid US image. (b) Canny edge detected binary image (c) Modified Canny edge detected grayscale image (d) Distance transformed modified Canny edge detected image.
Modification to the Canny edge detection algorithm was necessary in order to more accurately integrate this measure into the external energy. The standard Canny edge detector outputs all edges as a binary result. Each point that is lying on a Canny detected edge will not have the same weight following detection; rather it will be replaced with its relative gradient. Fig. 1(a) shows an example of a carotid artery US image. Fig. 1(b) shows the US image following the application of a normal Canny edge detector. The result is a binary image with all points having the same weight. Fig. 1(c) shows the modified Canny edge detector where each point has a designated weight associated with its gradient – hence a gray scale image. Each gradient point is incorporated into the external energy by means of a Manhattan distance transform [17], as shown in Fig. 1(d). The Manhattan distance transform provides a set of gray level intensities that show the distance to the closest boundary from each point. This provides a path to a better local minimum by avoiding small hazards of unwanted noise. The Manhattan distance of the modified Canny edge map was sufficient for the purpose of this study due to its simplicity and elegance of implementation. The energy will also converge such that it will mimic the behaviour of GVF snakes, since the curve will be forced into boundary concavities.
966
A.K. Hamou, S. Osman, and M.R. El-Sakka
2.3 A-Priori Knowledge It is desirable to incorporate some means of shape to the model without directly involving the user. Automatic shape detection takes place on the initial plots positioned by the user. This would force the active contour to converge closer to the user’s first delineated points. This is especially useful in the medical arena where a specialist or technician has a clear understanding of the underlying structure being detected, such as a liver, an artery, or a heart. Traditional shape priors in snake models tend to take place in the internal energy [12]. However, this knowledge has been incorporated into the external energy in order to influence the shape and the positional information as well – since it is readily available from the user. This shape and position information will also be represented by the Manhattan distance transform in order to properly integrate it within the external energy. For carotid images, the prior was set to ellipses. A least squares ellipse fitting technique is performed on the points of the initial contour. The initial ellipse is defined using the initial contour placement points of the snake. Continual refinement takes place by attempting to minimize the geometric error of a series of approximating curves to the initial points. This process continues until a curve with minimal deviation from all snake contour points is found. Fig. 2(a) and 2(b) shows the active contour being used without and with a-prior knowledge, respectively. Fig. 2(a) depicts an improper segment of the boundaries, most likely due to poor visualization in the lower region of the image. Fig. 2(b) takes the same initial plots and builds knowledge around the data by ellipse fitting in order to prevent the collapse of the curve and improve alignment with the true lumen.
(a)
(b)
Fig. 2. Active contour on a carotid ultrasound image. (a) without α-priori knowledge. (b) with α-priori knowledge.
2.4 Clinical Metrics The desire to use the proposed technique in the clinical arena is paramount. One such metric is the North American Symptomatic Carotid Endarterectomy Trial (NASCET). This is a confirmed clinical test that may benefit patients with high-grade carotid stenosis [14]. The NASCET percent stenosis criterion is measured as: the
Carotid Ultrasound Segmentation Using DP Active Contours
967
ICA B
L
A
ECA
CCA
NASCET:
B A u 100% B (a)
(b)
Fig. 3. (a) Drawing of nascet criteria for evaluation of carotid stenosis: A is the diameter of the residual lumen at the point of maximal stenosis (vertical lines representing plaque); B is the diameter of the normal artery distal to the stenosis. (CCA, ECA and ICA stand for Common, External and Internal Carotid Artery, respectively) (b) Ultrasound image of a cross-sectional slice, at line L in (a), showing the area of the carotid. Image is histogram equalized for better visualization.
diameter of lumen at the point of maximal stenosis, subtracted from the diameter of the lumen of the normal internal carotid artery, divided by the normal internal carotid artery multiplied by 100% [14]. The diameter of maximal stenosis is most often found just above the carotid bulb, near the bifurcation of the internal and external carotid arteries. A visual representation of this is shown in 4(a). This test is generally performed by catheter angiography – an invasive technique requiring surgery. In this paper, we propose an alternative method of measurement from a transverse segmented US image. While, angiography will give a 2D representation on a longitudinal view of any given artery, yet it will not reveal any information on the depth of that artery. Depth can be measured by taking a cross-sectional area (transverse) slice of the carotid, as shown in Fig 3(b). This additional information should improve the traditional NASCET measurements. However, slight modification to our measurements is needed in order to conform to the NASCET standard. Since NASCET percent stenosis is based on artery diameters, where the proposed measure is based on areas, (in a generally elliptic fashion) each contour would be square rooted before the ratio is calculated. This will help to preserve the ratio of percent stenosis in a range that traditional clinicians are already familiar with, rather than introducing an entirely new range for stenosis.
3 Results In this study, 91 carotid artery US images are used. These images include normal and diseased carotid arteries at different positions along the carotid artery, e.g., common
968
A.K. Hamou, S. Osman, and M.R. El-Sakka
(a)
(b)
(c)
(d)
Fig. 4. Transverse carotid artery US images of the carotid bulb. (a) the original image. (b) the original image manually segmented by a clinical radiologist. (c) original image with four points – white circles – placed by the clinical radiologist as an initial value for the algorithm and the extracted prior ellipse as a dashed line. (d) the resulting segmentation by the proposed active contour algorithm.
carotid, carotid bulb, and internal carotid. Manual carotid lumen segmentation was performed on all of the images by a clinical radiologist. The images were then segmented by the proposed snake algorithm. These images were acquired using a SONOS 5500 by Philips Medical System. Fig. 4(a) shows a transverse 2-dimensional US image of the carotid bulb. Fig. 4(b) represents the manually segmented region of the carotid bulb by a clinical radiologist at the given A position designated by the NASCET measure (as shown in Fig. 3(a)). Fig. 4(c) shows an example set of the four initial points plotted by the radiologist as an input to the active contour and the result of initial ellipse extraction. These points should be placed anywhere on the carotid lumen edge. They are required in order to achieve an elliptical approximation and the initial contour. Each point placed is automatically connected to its closest neighbour by means of a direct line of equally distanced points. Fig. 4(d) shows the result of the active contour following convergence. Similarly, Fig. 5(a), (b), (c) and (d) show the same for an internal
Carotid Ultrasound Segmentation Using DP Active Contours
969
carotid artery image (at location B designated by the NASCET measure). These results were ascertained by using the proposed active contour with parameters σ, l, u, α, β , γ, and τ, as 2.0, 0.2, 0.6, 0.09, 0.01, 0.5 and 0.4, respectively (as defined in Section 2.1). All images processed by the proposed algorithm were examined and reviewed by a radiologist in a double blind study in order to insure the segmentations were accurate. Due to the subjectivity in estimating the borders, the radiologist reported that the proposed technique would accurately objectify the inconsistencies among clinical practitioners. This will serve to improve the computation of percent stenosis and other measures used in the clinical arena.
(a)
(b)
(c)
(d)
Fig. 5. Transverse carotid artery US images of the internal carotid artery. (a) the original image. (b) the original image manually segmented by a clinical radiologist. (c) original image with four points – white circles – placed by the clinical radiologist as an initial value for the algorithm and the extracted prior ellipse as a dashed line. (d) the resulting segmentation by the proposed active contour algorithm.
Overall, accuracy of the proposed system was measured by comparing the 91 cases to the expert manual segmentations by the radiologist. These measurements include both type I and type II errors as defined in [15]. Since the images at hand were mainly small segmented foregrounds against vast backgrounds, the system would best be
970
A.K. Hamou, S. Osman, and M.R. El-Sakka
measured by means of its sensitivity and precision rate. Sensitivity is the number of true positives divided by the number of true positives plus false negatives. In other words, it classifies how well a binary classification test correctly identifies a condition. Precision rate is the number of true positives divided by the number of true positives plus false positives. In other words, it classifies how accurate the results of the test are versus its inaccuracies. Sensitivity of the system, given a 95% confidence interval, yields 0.871 – 0.916. Whereas, precision rate, given the same confidence interval, yields 0.866 – 0.903. It is worth mentioning that these reported performance measures are quite high. This can be attributed to the fact that the initial points to the proposed system were plotted by the specialist.
4 Conclusions In this paper, a modified DP snake algorithm helped alleviate the inherent difficulties in segmenting ultrasound carotid artery images, such as avoiding speckle noise and contour bleeding. The proposed alterations utilized a modified Canny edge detector and a-priori knowledge. Experimental results show that these alterations tune the original DP snake model in order to be used in the medical arena so that clinical analysis results can be improved. A radiologist verified that this objective segmentation scheme would improve the calculation of different clinical measures, such as percent stenosis, by reducing the inconsistencies and variability between clinicians while reducing the time for clinician interaction. We have shown that the segmentation of the carotid artery is robust enough to be used in a clinical setting for specific domain applications. Contributions Author contributions are allocated as follows: research, development, implementation and paper writing were done by Ali Hamou, M.Sc., and Mahmoud El-Sakka, Ph.D. Result validation was done by Said Osman, M.D., FRCPC.
Acknowledgements This research is partially funded by the Natural Sciences and Engineering Research Council of Canada (NSERC). This support is greatly appreciated.
References [1] Abdel-Dayem, A., El-Sakka, M.: A Novel Morphological-based Carotid Artery Contour Extraction. In: Canadian Conference on Electrical and Computer Engineering, pp. 1873– 1876. IEEE Computer Society Press, Los Alamitos (2004) [2] Abdel-Dayem, A., El-Sakka, M., Fenster, A.: Watershed Segmentation for Carotid Artery Ultrasound Images. In: International Conference on Computer Systems and Applications, pp. 131–139. IEEE Computer Society Press, Los Alamitos (2005) [3] Abdel-Dayem, A., El-Sakka, M.: Carotid Artery Ultrasound Image Segmentation Using Fuzzy Region Growing. In: Kamel, M., Campilho, A. (eds.) ICIAR 2005. LNCS, vol. 3656, pp. 869–878. Springer, Heidelberg (2005)
Carotid Ultrasound Segmentation Using DP Active Contours
971
[4] Abolmaesumi, P., Sirouspour, M., Salcudean, S.: Real-Time Extraction of Carotid Artery Contours from Ultrasound Images. In: Computer Based Medical Systems 2000, pp. 181– 186. IEEE Computer Society Press, Los Alamitos (2000) [5] Amini, A., Tenran, S., Weymouth, T.: Using dynamic programming for minimizing the energy of active contours in the presence of hard constrains. In: 2nd International Conference of Computer Vision, pp. 95–99. IEEE Computer Society Press, Los Alamitos (1988) [6] Bresson, X., Vandergheynst, P., Thiran, J.: A priori information in image segmentation: energy functional based on shape statistical model and image information. In: International Conference on Image Processing, vol. 3, pp. 425–428. IEEE Computer Society Press, Los Alamitos (2003) [7] Canny, J.: Finding lines and edges in images. MIT Press, Cambridge (1983) [8] Cohen, I.: On active contour models and balloons. Image Understanding 53, 211–218 (1991) [9] Hamou, A., El-Sakka, M.: A Novel Segmentation Technique For Carotid Ultrasound Images. In: International Conference on Acoustics, Speech and Signal Processing, pp. 521–524. IEEE Computer Society Press, Los Alamitos (2004) [10] Jespersen, S., Gronholdt, M., Wilhjelm, J., Wiebe, B., Hansen, L., Sillesen, H.: Correlation Between Ultrasound B-mode Images of Carotid Plaque and Histological Examination. In: Ultrasonics Symposium, pp. 1065–1068. IEEE Computer Society Press, Los Alamitos (1996) [11] Kass, M., Witkin, A., Terzopoulos, D.: Snakes: active contour models. In: 1st International Conference on Computer Vision, pp. 259–268. IEEE Computer Society Press, Los Alamitos (1987) [12] Leventon, M., Grimson, W., Faugeras, O.: Statistical Shape Influence in Geodesic Active Contours. In: Computer Vision and Pattern Recognition Conference, pp. 316–323. IEEE Computer Society Press, Los Alamitos (2000) [13] Mao, F., Gill, J., Downey, D., Fenster, A.: Segmentation of Carotid Artery in Ultrasound Images. In: Engineering in Medicine and Biology Society Conference, pp. 1734–1737. IEEE Computer Society Press, Los Alamitos (2000) [14] NASCET Team: North American Symptomatic Carotid Endarterectomy Trial (NASCET). Stroke, vol. 22, pp. 816–817 (1991) [15] Neyman, J., Pearson, E.S: On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference, Part I. Joint Statistical Papers, pp. 1–66. Cambridge University Press, Cambridge (1967) (originally published in 1928) [16] O’Donnell Jr., T., Erdoes, L., Mackey, W., McCullough, J., Shepard, A., Heggerick, P., Isner, J., Callow, A.: Correlation of B-mode Ultrasound Imaging and Arteriography with Pathologic Findings at Carotid Endarterectomy. Arch Surgery 120, 443–449 (1985) [17] Rosenfeld, A., Pfaltz, J.: Distance Functions in Digital Pictures. Pattern Recognition 1, 33–61 (1968) [18] Sethian, J.: Level set methods and fast marching methods: evolving interfaces in computational geometry. In: Fluid Mechanics, Computer Vision, and Materials Science, 2nd edn., Cambridge University Press, Cambridge (1999) [19] Xu, C.: Deformable Models with Application to Human Cerebral Cortex Reconstruction from Magnetic Resonance Images. Ph.D. Dissertation, Baltimore Maryland (1999)
Bayesian Reconstruction Using a New Nonlocal MRF Prior for Count-Limited PET Transmission Scans Yang Chen1,2 , Wufan Chen1 , Pengcheng Shi1 , Yanqiu Feng1 , Qianjin Feng1 , Qingqi Wang2 , and Zhiyong Huang2 1
School of Biomedical Engineering, Southern Medical University, Guangzhou, 510515, China, Tel.:+086-02061648285-82; Fax:+086-02061648294 {chenwf,shipch,foree,fqianjin}@fimmu.com 2 The 113 Hospital of People’s Liberation Army, Ningbo, 315040, China, Tel.:+086-057427754036 {chenyang20021979,wangqingqi,jd21}@163.com
Abstract. Transmission scans are performed to provide attenuation correction factors (ACFs) information for positron emission tomography(PET). Long acquisition or scan times for transmission tomography, although alleviating the noise effect of the count-limited scans, are blamed for patient uncomfortableness and movements. So, the quality of transmission tomography from short scan time often suffers heavily from noise effect and limited counts. Bayesian approaches, or maximum a posteriori (MAP) methods, have been accepted as an effective solution to overcome the ill-posed problem of count-limited transmission tomography. Based on Bayesian and Markov Random Fields(MRF)theories, prior information of the objective image can be incorporated to improve the reconstructions from count-limited and noise-contaminating transmission scans. However, information of traditional priors comes from a simply weighted differences between the pixel densities within local neighborhoods, so only limited prior information can be provided for reconstructions. In this paper, a novel nonlocal MRF prior, which is able to exploit global information of image by choosing large neighborhoods and a new weighting method, is proposed.Two-step monotonical reconstruction algorithm is also given for PET transmission tomography. Experimentations show that the reconstructions using the nonlocal prior can reconstruct better transmission images and overcome the ill-posed problem even when the scan time is relatively short. Keywords: Positron Emission Tomography (PET), Transmission Tomography, Markov Random Fields (MRF), nonlocal prior,local prior.
1
Introduction
In positron emission tomography (PET), not all annihilations that result in a photon pair heading towards two detectors are not detected. Often, one or both of the photons get scattered or absorbed by the patient body resulting in no M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 972–981, 2007. c Springer-Verlag Berlin Heidelberg 2007
Bayesian Reconstruction Using a New Nonlocal MRF Prior
973
detection. The survival probability of an annihilation event is determined by the length and the type of the tissue that the photons traverse. This effect is called attenuation. The attenuation map or attenuation correction factors (ACFs) express spatially the mass absorption coefficients for the transaxial slice of the body. Such attenuation is different for different tissue types, hence the measurements should be compensated for attenuation for each ray (or line integral)[1]. In clinical practice, short transmission scan times with precise attenuation correction are requested because statistically desired long acquisition times for transmission are clinically impractical and inconvenient to the patient. However, short scan times with low limited total counts levels often bring statistical noise to reconstructions. So, the short scan data are conventionally smoothed with the resulting unfavorable blurring of details propagating to the emission sinogram [2-3]. Attenuation correction can also be performed by computational methods [3]. The attenuating matter is approximated by an area with a uniform value of linear mass absorption coefficient f . The ACFs can be computed by projecting the graphically defined f -image and taking the exponential: Ê
ACF = e
l
f (x)dx
(1)
For above computational methods, when analytic filtered back projection (FBP) method is applied , noise and some bias may be introduced in the transmission image. Bayesian statistical methods, which have been widely accepted for producing better reconstructions, are recommended for transmission tomography [4-8].However, in the case of short time transmission scan when the noise level is relatively significant, the simple quadratic membrane (QM) smoothing prior tends to produce an unfavorable oversmoothing effect, and edge-preserving nonquadratic priors tend to produce blocky piecewise regions or so-called staircase artifacts. What is more, the application of edge-preserving Bayesian methods are often hindered by the often needed annealing procedures and complicated parameter estimations for some built-in parameters. Furthermore, the prior energies’ convexities can not be preserved for some edge-preserving nonquadratic priors, which might endanger the whole concavities of their overall objective posterior energy functions[4-5]. Both the annoying oversmoothing effect for quadratic priors and staircase effect for nonquadratic priors can be ascribed to the fact that the simply weighting strategy and small local neighborhoods can only provide limited prior information. And none of above traditional priors address the information of global connectivities and continuities in objective image and we term these traditional priors local priors. In this paper, we devise a novel nonlocal MRF quadratic prior model for PET transmission tomography. The nonlocal quadratic prior can greatly improve the reconstruction by exploiting not only densities-differences information between individual pixels but also global connectivity and continuity information in the objective image. Relevant comparisons show the proposed nonlocal prior’s good properties in lowering noise effect and preserving edges for transmission data from different scan times.
974
2
Y. Chen et al.
Theory of the Proposed Nonlocal MRF Prior Model
Based on Bayesian and Markov Random Fields(MRF) theory, regularization from prior information can be imposed on image reconstruction to suppress noise effect with following posterior probability P (f |g): P (f /g) ∝ P (f /g)P (f ), P (f ) = Z −1 × e−βU(f ) = Z −1 × e−β
(2) È j
U(f,j)
(3)
where P (g|f )is the likelihood distribution. P (f ) is the MRF prior distribution. The partition function Z is a normalizing constant. U (f ) is the MRF prior energy function, and U (f, j)is the notation for the value of the energy function U evaluated on the f at pixel index j. β is the global hyperparameter that controls the degree of the MRF prior’s influence on the image f . The energy function in (3) attains its minimum and the corresponding prior distribution (3) attains its maximum when the image meets the prior assumptions ideally [5]. And we can build the posterior energy function as ψβ (f ) = log P (f /g) =L(g, f ) − βU (f )
(4)
where L(g, f ) represents the likelihood energy function. The reconstructed image f can be obtained through minimization of function ψβ (f ) by an iterative procedure. 2.1
The Traditional Local Prior Model
Traditionally, prior energy U (f, j) in (3) is commonly computed through a simply weighted sum of potential functions of the differences between the pixels in neighborhood Nj : U (f, j) = β wbj v (fb − fj ) (5) b∈Nj
Generally, different choices of the potential function v lead to different priors. The prior is the space-invariant QM prior when the potential function has the form v(t) = t2 /2. We can also choose edge-preserving nonquadratic priors by adopting a nonquadratic potential function for v, such as the Huber potential function: 2 t 2, |t| ≤ δ v(t) = (6) δ |t| − δ 2 2, |t| > δ where δ is the threshold parameter [4]. Such Huber prior preserves the edge information by choosing a nonquadratic potential function that increases less when the differences between adjacent pixels become bigger. And in iterative reconstruction algorithm using MRP, the medians of the pixels within local neighborhoods are incorporated into the iterative algorithm [4-6]. Weight wbj is a positive value that denotes the interaction degree between pixel b and pixel j. And in local prior model, it is usually considered to be in
Bayesian Reconstruction Using a New Nonlocal MRF Prior
975
inverse proportion to the distance between pixel b and pixel j. So on a square lattice of image f , in which the 2D positions of the pixel b and pixel j (b = j)are respectively (bx , by ) and (jx , jy ), wbj can be calculated by the formula 2 2 (bx − jx ) + (by − jy ) or some other simple forms. A typical normalized 1 eight-neighborhood weighting map and a typical four-neighborhood weighting map for local priors are as follows:
1/ 2
1/ 4 u 1 4 u 1
2
u
1 1/ 2
1/ 4 u 1
1 1/ 2 0
1
1 1/ 2
1 u1 0 1 1
Such kinds of simple weights-determining methods have been widely used in Bayesian image reconstructions and restorations [4-8]. From above we can see the local priors can only provide prior information within a fixed local neighborhood for image reconstruction. 2.2
The Proposed Nonlocal Prior Model
Elicited by the nonlocal idea in [9], we devised a novel nonlocal prior for Bayesian image reconstruction. In the building of such nonlocal prior, a large neighborhood N e is set to incorporate global geometrical configuration information in image. Other than just a simple inverse proportional calculation of the spatial distance between two individual pixels, the nonlocal prior weight wbj N L between pixel b and pixel j (in (8)) is estimated through a similarity measure of the respective neighborhoods centered on pixel b and pixel j. Above neighborhood-based similarity is computed as a decreasing function of the Gaussian Euclidean distance between all the pixel densities within above two neighborhoods. The objective image estimate f for Bayesian reconstruction using such nonlocal prior can be obtained through maximization of following posterior energy function: 2 UN L (f ) = UN L (f, j) = wbj N L (f ) (fb − fj ) (7) j
j
b∈N ej
wbj
2
V (nb ) − V (nj ) N L (f ) = exp − h2
disbj
(8)
V (nb ) = (fq , q ∈ nb )
(9)
V (nj ) = (fp , p ∈ nj )
(10)
976
Y. Chen et al.
disbj
2 2 = (bx − jx ) + (by − jy )
(11)
Here V (nb ) and V (nj ) are the two pixel density vectors in the two square comparing neighborhoods nb and nj that center at pixel b and pixel j, respectively. A − B denotes denotes the sum of Gaussian-weighted Euclidean distance between the corresponding pixel densities in vectors A and B. disbj denotes the 2-D space distance between pixel b and pixel j whose 2-D coordinates are (bx , by ) and (jx , jy ), respectively. Parameter h controls the decay of the exponential function in (8) of the weights for the nonlocal prior. The neighborhood N e is set to be a large one to cover a large region in order to incorporate more information of geometrical configuration into the nonlocal prior. And through the computation of the similarities of the comparing neighborhoods centered on the pair of pixels in the neighborhood N e, the weights of each pair of pixels in N e are distributed across the more similar configurations, which can be easily seen in Fig. 1. So we get to the conclusion that the proposed nonlocal prior is able to accomplish a systematic use of global self-prediction configuration information in the objective image f .
(a)
(b)
(c)
Fig. 1. The illustrations of the weight distributions for the pixels (pointed by the arrows) in the corresponding objective images for the proposed nonlocal prior model. The size of neighborhood N e and n is 21×21 and 7×7, respectively. (a) The weights are distributed in the direction of the straight edge line when the pixel is in a straight edge. (b) The weight distributions favor pixels belonging to the same curved line when the pixel is in a curved edge. (c) The illustrations of the weight distributions for two pixels in edge region and background region, respectively.
3
Statistical Model and Two Step Reconstruction Algorithm
In PET statistical reconstruction, transmission scan measurement data vectors can be independently modeled as Poisson random variables. gi , the measurement detected by the detector pair i from a transmission scan, can be well modeled
Bayesian Reconstruction Using a New Nonlocal MRF Prior
977
as Poisson random with expectation g¯i as a function of the underlying attenuation map for transmission tomography [9-14]. In transmission tomography, PL (g/f )(i.e., the likelihood function) is the probability of obtaining the measurement vector g if the objective transmission attenuation map is f . Based on above, we can get following formulas: gi ∝ P oisson {¯ gi (f )} i = 1.......N (12) ⎛ ⎞ M g¯i (f ) = di exp ⎝− aij fj ⎠ + riT (13) j
PL (g/f ) =
N g¯i (f )gi
gi !
i=1
exp(−¯ gi (f ))
(14)
where N is the number of detector pairs, M is the number of objective image pixels, and aij is the geometric probability that an transmission from image pixel j is detected by the detector pair i in ideal conditions. di factors incorporate calibration factors of scan time, detector efficiencies, attenuation factors and possibly deadtime correction factors for detector pair i. And riT factors represent the total detected random and scattered event counts for detector pair i in transmission scan. And the corresponding log-likelihood function for estimating g from f is: L(f ) = log PL (g/f ) = (gi log g¯i (f ) − g¯i (f ) − log gi !) (15) i
Choosing the Nonlocal prior, the objective image f can be obtained through an iterative maximization of the posterior energy function ψβ (f /g): f = arg max ψβ (f /g) = arg max (L(g, f ) − βUN L (f )) f
f
(16)
However, from (7)-(11), we find that every weight term wbj N L is determined by the objective image f and the concavity of above posterior energy function can not be preserved. So we apply following alternating two-step updating iterative algorithm for transmission tomography: a.Weight update: After setting an initial image estimate f 0 for the iteration, for every pixel pair (fb , fj ) in image f , we compute wbj N L using (7)- (10) and current image estimate fˆ. b.Image update: It has been shown in [4] that ∂2 wbj N L fˆ (fb − fj )2 j b∈N ej ∂ 2 ψβ ∂ 2 L (f ) = −β (17) ∂fb ∂fj ∂fb ∂fj ∂fb ∂fj Matrix ∂ 2 L (f ) ∂fb ∂fj with L (f ) given by (14) is strictly negatively definite [4]. And from (7)-(10) we know 1 1 nj )2 v (ˆ nb ) − v (ˆ ˆ disbj ≤ 0 < wbj N L f = exp − (18) 2 Z h Z × disbj ˆ j are both computed from the current image estimate fˆ. where n ˆ b and n
978
Y. Chen et al.
Considering the ensured convexity for (fb − fj )2 , we can reach to the conclusion that at every iteration the second derivatives of ψβ (f /g) (17) is definitely concave given the estimated wbj N L . Thus we can monotonically maximize the cost of the posterior energy function ψβ (f /g) using iterative algorithms such as the paraboloidal surrogate coordinate ascent (PSCA) iterative algorithm [10] and the well-known conjugate gradient algorithm[11].
4
Simulated Experimentations
In this experiment, synthetic simulated phantom data with 128 × 128 square pixels are used for transmission reconstruction. Fig.2 shows the simulated 2D PET attenuation map. The attenuation map comprises four regions which correspond to different substances with assumed attenuation coefficient levels (air = 0 cm−1 , lungs= 0.02 cm−1 , soft tissue (water) = 0.096 cm−1 , bone (teflon) = 0.2 cm−1 ), and the pixel values in the map are set to be 0 for air, 0.02 for lungs, 0.096 for soft tissue and 0.2 for bone tissue, respectively. To evaluate the proposed nonlocal prior for transmission tomographies with different scan times, we simulate three sinograms with different total counts amount to 3 × 105, 5 × 105 and 1 × 106 from above synthetic simulated phantom. The simulated data in the three sinograms are all Poisson distributed and the percentages of simulated delayed coincidences riT factors in all counts are all set to be 5%. And all the di factors are generated using pseudo-random log-normal variates with a standard deviation of 0.3 to account for detector efficiency variations. The transition probability matrix used in the reconstructions corresponds to parallel strip-integral geometries with 128 radial samples and 128 angular samples distributed uniformly over 180 degrees. A PC with Intel P4 CPU and 512 Mb RAM is used as the workstation for reconstruction. Above two step algorithm is applied in Bayesian reconstruction using the proposed nonlocal prior. And PSCA algorithm is used in Bayesian reconstructions using QM and Huber prior. As to the Bayesian reconstruction using MRP, the iterative algorithm in [7] is chosen. The reconstructions from FBP method are used as the initial images in all iteration procedures. The numbers of iteration for Bayesian reconstructions are all set to be 150.
Fig. 2. The 128 × 128 synthetic phantom image data used in the experiments for the simulated data case
Bayesian Reconstruction Using a New Nonlocal MRF Prior
979
Fig. 3. 1)-3) rows: FBP reconstruction and Bayesian reconstructions using different priors for the three simulated sinograms with different total counts levels, (a) FBP reconstruction (b) QM prior reconstruction (c) Nonquadratic Huber prior reconstruction (d) MRP reconstruction (e) The proposed nonlocal prior reconstruction. 4)-5) rows: Image differences and the corresponding image distances (in the right parentheses below the corresponding image differences) between the reconstructions for the three different sinograms.
980
Y. Chen et al.
FBP reconstructions and Bayesian reconstructions using QM prior, Huber prior, median root prior (MRP), the proposed nonlocal prior are all performed. For FBP method, we choose ramp filter with cutoff frequency equal to the Nyquist frequency. The Hyperparameter β for all Bayesian reconstructions, threshold parameter δ for Huber prior and threshold parameter h for the nonlocal prior are chosen by hand to give the best stable reconstructed images in terms of minimization of signal-to-noise ratios (SNRs), which is calculated by following formula: SN R = 10 log10 (
(f (x, y) − f¯)2
x,y
(f (x, y) − fphantom (x, y))2
)
(19)
x,y
where f (x, y),f¯ and fphantom (x, y) denote the reconstructed image, the mean of f and the original true phantom image in Fig.2, respectively. For all the Bayesian reconstructions using the traditional local priors ( QM prior, Huber prior and MRP), the typical eight-neighborhood together with the corresponding weighting map mentioned in the second section are used. And the sizes of the neighborhoods N ej in (7) are all set to 11×11. The two comparing neighborhoods nb and nj in (8) are both set to be 7×7 neighborhoods . Fig.2 show the reconstructed images from different methods for the three sinograms, respectively; 4) and 5) rows in Fig.2 represent the image differences and the corresponding numerical distances between the reconstructed images for the three different sinograms. Obviously,reconstructions using the nonlocal prior are able to overcome the oversmoothing effect for QM prior and the staircase artifacts for Huber prior. We can also find in 4) and 5) rows in Fig.2 that, when the total detected counts decrease, the reconstructions using the nonlocal prior can reconstruct images with smaller differences or deterioration than the reconstructions using QM prior, Huber prior and MRP. So we can reach to the conclusion that the nonlocal prior have a more robust performance when the scan times decrease. Table 1 displays SNR comparisons with regards to true phantom data in Fig.2 for above reconstructions. We can notice that, for the same total counts levels, the reconstructions using the proposed nonlocal prior can produce images with the highest SNRs. Table 1. Snrs for above reconstructed images in the upper three rows of Fig.3 FBP QM prior Huber prior SNR-Sinogram (3 × 105 totalcounts) 6.80 11.55 12.34 SNR-Sinogram (5 × 105 totalcounts) 8.89 12.21 13.77 SNR-Sinogram (1 × 106 totalcounts) 10.59 12.98 15.10
MRP Nolocal prior 12.82 13.49 13.66 14.23 14.79 15.28
Bayesian Reconstruction Using a New Nonlocal MRF Prior
5
981
Conclusions
Stemming from MRF theory, the devising of the proposed nonlocal prior is theoretically correct and straight forward. From above analyses and experiments about the proposed nonlocal prior, we can see our proposed nonlocal prior, which is devised to exploit the global connectivity information in objective image, can impose a more effective and robust regularization on PET transmission tomography than other methods. Nevertheless, the building of proposed prior needs a computation over a large neighborhood and introduces several new hand-adjusting parameters to the reconstruction procedure, which increases the computational cost of the reconstruction. For the reconstruction using the nonlocal prior, further work includes finding effective ways to lower the computational cost,and testing the nonlocal prior’s application in real transmission data reconstructions.
References 1. Bailey, D.: Transmission scanning in emission tomography. European Journal of Nuclear Medicine 25(7), 774–787 (1998) 2. Meikle, S., Dahlbom, M., Cherry, S.: Attenuation correction using countlimited transmission data in positron emission tomography. Journal of Nuclear Medicine 34, 143–150 (1993) 3. Fessler, J.: Hybrid Poisson/polynomial objective functions for tomographic image reconstruction from transmission scans. IEEE Transactions on Image Processing 4, 1439C50 (1995) 4. Lange, K.: Convergence of EM image reconstruction algorithms with Gibbs smoothness. IEEE Transactions on Medical Imaging 9, 439–446 (1990) 5. Li, S.Z.: Markov Random Field Modeling in Image Analysis. Springer, Tokyo (2001) 6. Fessler, J., Ficaro, E., Clinthorne, N., Lange, K.: Grouped-coordinate ascent algorithms for penalized likelihood transmission image reconstruction. IEEE Transactions on Medical Imaging 16, 166–175 (1997) 7. Alenius, S., Ruotsalainen, U., Astola, J.: Attenuation Correction for PET Using Count-Limited Transmission Images Reconstructed with Median Root Prior. IEEE Transactions on Nuclear Science 46, 646–651 (1999) 8. Yu, D.F., Fessler, J.A.: Edge-Preserving Tomographic Reconstruction with Nonlocal Regularization. IEEE Transactions on Medical Imaging 21(2), 159–173 (2002) 9. Buades, A., Coll, B., Morel, J.M.: A nonlocal algorithm for image denoising. In: Proc. IEEE Int. Conf. Computer Vision Pattern Recognition, vol. 2, pp. 60–65 (2005) 10. Erdo˘ gan, H., Fessler, J.A.: Monotonic algorithms for transmission tomography. IEEE Transactions on Medical Imaging 18, 801–814 (1999) 11. Lalush, D.S., Tsui, B.M.W.: A fast and stable maximum a posteriori conjugate gradient reconstruction algorithm. Med. Phys. 22, 1273–1284 (1995)
Medical Image Registration Based on Equivalent Meridian Plane Zhentai Lu, Minghui Zhang, Qianjin Feng, Pengcheng Shi, and Wufan Chen* School of Biomedical Engineering, Southern Medical University, Guang Zhou, 510515, China
[email protected] ,
Abstract. This paper presents a new 3-D image registration method based on the equivalent meridian plane and mutual information named as EMP-MI algorithm. Compared with MI based registration method using the whole volume intensity information, our approach utilizes the equivalent meridian plane information. First, principal component analysis (PCA) is applied to determine the equivalent meridian plane. The volume is roughly aligned by taking advantage of the EMP. Second, registration is refined via maximizing the MI of the EMP. We evaluate the effectiveness of the EMP-MI approach by applying it to the simulated and real brain image data (CT, MR, PET, and SPECT). The experimental results indicate that the algorithm is effective in reducing computation time as well as in helping to avoid local minima. Keywords: image registration, equivalent meridian plane, mutual information, principal component analysis.
1 Introduction Image registration is the process of finding the transformation that aligns one image to another. The geometric alignment or registration of multimodality images is a fundamental task in numerous applications in 3-D medical image processing. It has become an important research topic because of its great value. The reason for this is clear: there are numerous applications in diagnostic as well as treatment settings that benefit from integrating the complementary character of multimodal images. For example, computed tomography (CT) and magnetic resonance (MR) imaging primarily provide anatomic information while single photon emission computed tomography (SPECT) and positron emission tomography (PET) provide functional and metabolic information. Notable application fields include neurosurgery and radiation therapy planning. Because of the challenges it pose, biomedical registration remains an active research endeavor. A number of image registration methods have been presented in the literature [1], [2]. Existing image registration techniques can be broadly classified into two categories: feature based and intensity based methods. A feature-based method *
Corresponding author.
M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 982–992, 2007. © Springer-Verlag Berlin Heidelberg 2007
Medical Image Registration Based on Equivalent Meridian Plane
983
requires the extraction of features common in both images. Common features include corresponding points, edges, contours or surfaces. The transformation is estimated form these features [3]. Feature-based approaches have the advantage of greatly reducing computational complexity. However, these approaches depend on the feature extraction process, so they are highly sensitive to the accuracy of the feature extraction. The location of features might be very complicated. For example, even experts are not always able to characterize anatomical landmarks in medical images mathematically. In contrast, intensity based registration techniques optimize a function measuring the similarity of all geometrically corresponding voxel pairs, and then obtain the transformation between the entire intensity images. Mutual information (MI) is one of the few measures that are well suited to registration of multi-modal images. It is an automatic, intensity-based measure, which does not specify the landmarks or features and can also be applied in retrospect. Unlike the measures based on correlation of gray values or sum of absolute differences of gray values, MI needn’t assume any linear relationship among the gray values in the image. This technique was originally proposed by Viola and Wells [4, 5]. The authors described the application of MI for the registration of magnetic resonance images as well as for the 3D object model matching to the real scene. Studholme et al. [6] proposed normalized mutual information (NMI), which is less sensitive to changes in overlap. They found a distinct improvement in the behavior of the normalized measure for rigid registration of MR-CT and MR-PET images. Maes et al. optimized the MI by means of the Brent’s method and the Powell’s multidimensional direction set method to register MR, CT, and PET images of a human brain that differ by similarity transform [7]. Many studies have compared various measures of voxel similarity and concluded that MI is the most accurate measure for 3D image registration. However, the MI methods work with the entire image data and directly with image intensities, and are prone to fall into local minima and the computation cost is too expensive. We extend the meridian plane (MP) to 3D rigid medical image registration; a new concept of equivalent meridian plane (EMP) is proposed. The rough registrations of those 2D planes are to be realized at six freedom degrees; the refine registrations can be completed using MI in a small neighboring region. This method is called as EMP based MI registration technique, which combines the feature and intensity information. This paper is organized as follows. In section 2, we originally propose the EMP concept. In section 3, the implementation of the EMP based MI registration algorithm is described in detail. In section 4, the results in clinical medical images are illustrated. In section 5, we present our conclusions and give some direction for further work.
2 Equivalent Meridian Plane As is well known, a meridian plane is arbitrary plane perpendicular to the celestial equator, which passes through the earth’s axis of rotation (the Earth's North and South poles). For three-dimensional medical image, it is necessary to propose a new equivalent meridian plane concept since estimating the meridian plane is not always feasible in practice.
984
Z. Lu et al.
Definition: For a three-dimensional irregular volume, a set of orthogonal principal axes can be always found, by which a family of orthogonal planes can be determined. One of these planes, containing the first and the second principal axis, is just EMP. The next problem is how to find EMP. We suggest using classical PCA method [8] for determining EMP. PCA produces a single best line that satisfies the following condition: the sum of the squares of the perpendicular distances from the sample points to the line is a minimum. The variable defined by the line of best fit is the first principal component that indicates the greatest amount of variation. The second principal component is the variable defined by the line that is orthogonal with the first. The center of the data set is the intersection of the two axes (see Fig.1.a). PCA approach describes object by forming vectors from the coordinates of the object. The resulted vectors are then treated as a population of random vector. In other words, each pixel in the object is treated as a 3-D vector X
= {( xi , yi , zi )T | i = 1, " , n} ,
the mean vector of the population and covariance matrix can be estimated by n
u = ∑ Xi / n
(1)
i =1
and n
C = (∑ X i X iT − uu T ) / n
(2)
i =1
Since the covariance matrix is real, symmetric, and positive semi-definite, it is always possible to find a set of orthonormal eigenvectors. If E is a matrix whose rows are the eigenvectors of C and E is arranged in the order so that the first row of E is the eigenvector that corresponds to the largest eigenvalue of C and the last row corresponds its smallest eigenvalue, then
Y = E( X − u)
(3)
is the PCA transform. The first principal component is the linear combination that accounts for the maximum variance. Geometrically, it corresponds to the direction of the longest axis through the scatter of data points. The effects of the PCA transform on the set of
(a) 2-D
(b) 3-D
Fig. 1. The red, green, and blue lines represent the directions of the first, the second, and the third principal component, respectively. (a) Principal components line (b) The equivalent meridian plane is determined by the first and second principal axis.
Medical Image Registration Based on Equivalent Meridian Plane
985
points of a given image are both a translation and a rotation. The centroid of the volume is translated to the origin of the global coordinate system after translation. The principal axes of the volume will be coincident with the x-, y- and z- coordinate axes after rotation by the eigenvector matrix. By equating the eigenvector matrix of C
⎛ e11 e12 ⎜ E = ⎜ e21 e22 ⎜e ⎝ 31 e32 to the rotation matrix
e13 ⎞ ⎟ e23 ⎟ e33 ⎟⎠
R = Rx (θ x ) ⋅ Ry (θ y ) ⋅ Rz (θ z ) , where θ x , θ y and θ z are
rotation angles with respect to the x-, y- and z - axes, respectively. We find that
θ y = arcsin(e31 ), θ x = arcsin(−e32 / cos θ y )
(4)
θ z = arcsin(−e21 / cos θ y ) As shown in Fig. 2, the net effect of using PCA transform is to establish a new coordinate system whose origin is at the coordinates of the mean vector and whose axes are in the direction of the eigenvectors of C. This coordinate system clearly shows that the PCA transformation is a translation and rotation transformation that aligns the data with the eigenvectors. The effects of translation are accounted for by centering the object about its mean.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 2. Geometrical Representation of PCA Transform: the equivalent meridian plane is coincident with the coordinate plane (XY) after PCA Transform. (a) Object I (b) Object II (c) Superposition of I and II. (d) Object I after PCA transform (e) Object II after PCA transform (f) Superposition of I and II after registration.
986
Z. Lu et al.
3 EMP Based MI Registration 3.1 Mutual Information Mutual information is a basic concept from information theory. Mutual information is acquired by measuring the statistical dependence between two random variables or the amount of information that one variable contains another. Mutual information considers the individual entropies H(A), H(B) and the joint entropy H(A, B):
H ( A) = −∑ p A (a ) log p A (a )
(6)
H ( B) = −∑ pB (b) log pB (b)
(7)
a
b
H ( A, B) = −∑ p AB (a, b) log p AB ( a, b)
(8)
a ,b
where pA and pB denote the marginal probability of A and B, respectively, pAB is the joint probability. The mutual information is defined as:
MI ( A, B ) = H ( A) + H ( B ) − H ( A, B )
(9)
As a similarity measure, MI has enjoyed a great deal of success, particularly in the medical imaging domain. It is robust to outliers and efficient to calculate. Mutual information generally provides smooth cost function which is used for optimization. 3.2 Binaryzation In order to utilize adequately the knowledge of the geometric shape of the 3D object, it is essential to realize its binaryzation so to get PCA transform fast and correctly. In fact, binaryzation qualities also determine the registration accuracy. There are various binaryzation algorithms. However, most of the image binaryzation algorithms rely on statistical methods, without taking into account the relationships between the pixels. So they seldomly yield satisfactory binaryzation result when applied on images with uneven illumination or low contrast. Several of these algorithms need additional user interaction. In this paper, Otsu binaryzation [9] algorithm is combined with mutual information technique so that the automate binaryzation segmentation can be guaranteed for 3D object. In other words, the initial threshold can be chosen by Otsu algorithm, and in the iteration process, an optimal threshold will be determined by maximizing the MI between the original volume and the thresholded volume. Refer to our earlier work for more detail [10]. In Fig.3, we show a brain volume binaryzation results by Otsu’s method and our method. The background (outside the object) was segmented by thresholding, and the foreground was used to compute the centroid and covariance matrix.
Medical Image Registration Based on Equivalent Meridian Plane
(a)
(b)
(c)
987
(d)
Fig. 3. Coronal (first row), Axial (second row) and sagittal (third row) view of the binaryzation results of a brain volume. (a) Original image (b)-(c) Binaryzation by Otsu’s method and our method, respectively (d) Contours (red) extracted from images showed in (c) are superimposed to original images in order to better visualize the quality of binaryzation.
3.3 Registration Based on EMP For image registration, the primary drawback of the usual optimization approaches is that it may fail unless the two volumes are misaligned by a moderate difference in rotation and translation. To address this problem, we bring the volumes into approximate alignment in PCA transform. The geometric effects of the transformation on the set of points of a given volume are both, a translation and a rotation, thereby, the centroid of the volume is translated to the origin of the global coordinate system (X, Y, Z) , at the same time, its equivalent meridian plane is positioned on the coordinate plane (X, Y), shown Fig.4. It could be said that PCA can automatically eliminate the effects of the translations and rotation and make the object return approximately to the original positions and directions. Using PCA transform, a rough estimate of registration is realized in very short time; then using EMP-MI, the final refine registration is finished. Because the EMP is coincident with the coordinate plane (X, Y), one of two volumes is fixed as target, and another as source. Small corrections are needed to the source, including space coordinates and three directions, by computing MI between EMP in the target and the plane (X, Y) in the source. The
988
Z. Lu et al.
MI just reaches maximum when the two volumes are correctly registered. During the registration process, Powell’s method is performed to do refine searching. Powell’s method [11] exhibits fast convergence with high accurate if the initial point closes to the optimum. For volume A and B, we obtain rough translations and rotation angles by PCA transform in formulas (1) and (3):
t1 = u A − u B , θ 1 = θ A − θ B where
(10)
u A and u B denote the centroids of volume A and B, respectively; θ A
and θ denote the rotation angles of volume A and B with respect to the x-, y- and zaxes, respectively. B
In the refine registration, we get translations t and rotation angles θ . Terminally, the total translations and total rotation angles are obtained as: 2
t = t1 + t 2 ,
θ = θ1 +θ 2
2
(11)
Fig. 4. The equivalent meridian plane after PCA transforms
The new registration technique consists of main steps follows as: Step 1. Performing binaryzation for both volumes and forming three-dimensional vectors from the coordinates of the object; Step 2. Getting centroid and covariance matrix using formulas (1) and (2); Step 3. Computing the PCA transformation; Step 4. Setting original volumes into a canonical coordinate frame utilizing PCA transform; Step 5. Refining registration by maximizing MI between EMP in the target and the coordinate plane (XY) in the source.
4 Experiments To test the performance of our proposed EMP-MI, it is necessary to compare the computation cost and accuracy of EMP-MI with ones of MI-based method. For 3-D medical image rigid registration, the search space is six-dimensional, i.e., three rotations and three translations around the image center. As a local optimization
Medical Image Registration Based on Equivalent Meridian Plane
989
scheme, Powell’s method is performed as: setting zero initially to all transformation parameters. In all of the experiments, the number of histogram bins used to compute the MI measure was 64 for both volumes, and the tri-linear interpolation was used in non-integer grids due to the translations and rotations. The platform was Matlab 7.0, Windows XP SP2 with 3.0 GHz Intel Pentium 4 CPU and 512M RAM. 4.1 Data Set Description The image files were provided by General Hospital, Tianjin Medical University, China. Imaging and data acquisition was performed on a novel combined CT–PET inline system (GE Medical Systems), combining the ability to acquire CT images and PET data from the same patient in one session. The CT and PET images were acquired simultaneously. Thus, the CT and PET images are registered intrinsically. This intrinsic registration gives us an ideal method to evaluate the registration accuracy. All the pertinent image file information is tabulated in Table 1. Table 1. Image files descriptions Experiment 1 2 3
Modal CT PET CT PET CT PET
Dimension 5122×35 1282×35 5122×35 1282×35 5122×35 1282×35
Voxel(mm3) 0.592×5.0 2.342×4.3 0.592×5.0 2.342×4.3 0.982×5.0 3.912×4.3
Images for registration experiment are illustrated in Fig.5 – Fig. 7, respectively.
Fig. 5. Brain images for registration experiments. Top row: the CT transverse slices, 6, 12, 18 and 24, respectively. Bottom row: the corresponding PET slices.
990
Z. Lu et al.
Fig. 6. Brain images for registration experiments. Top row: the CT transverse slices, 6, 12, 18 and 24, respectively. Bottom row: the corresponding PET slices.
Fig. 7. Cardiac images for registration experiments. Top row: the CT transverse slices, 7, 15, 23 and 31, respectively. Bottom row: the corresponding PET slices.
4.2 Experiment Setup To generate randomly misregistered pairs, one volume was rotated sequentially along the x-, y- and z- axis in three different angles. The rotated image was then translated to a new position. These translations and angles had a uniform distribution over a specified range. In every experiment, sets of 100 random misregistration volume pairs were registered. An experiment is considered successful if the translation error is less than one pixel, and the rotation error is less than one degree. 4.3 Experiment Results The PET volume was used as a target and the MR as a source image. The rotation angles were uniformly distributed over [-20, 20] degrees and the translations were
Medical Image Registration Based on Equivalent Meridian Plane
991
uniformly distributed over [-10, 10] pixel. The difference of the resulting registration parameters and the true ones were then statistically analyzed in Tables 2-4. It is clear that the advantages of our algorithm over MI-based method are execution time and accuracy. Our method is much faster than the MI-based algorithm, by a factor of 5 or more. Cardiac image registration is a more complex problem than brain image registration, particularly because of the non-rigid and mixed motions of the heart and the thorax structures. The success ratio of MI-based method is only 63%, and one of EMP-MI is 81%, which is lower than the previous two experiments (>90%). Table 2. Registration results for brain PET-CT image pairs Alg
tx
ty
θz
θx
MI EMP
0.24 0.27
0.12 0.16
0.07 0.16
0.12 0.35
θy 0.22 0.34
tz
Time(m)
Success
1.06 0.48
5.06 1.01
63% 81%
Table 3. Registration results for brain PET-CT image pairs Alg
tx
ty
θz
θx
θy
tz
Time(m)
Success
MI EMP
0.47 0.51
0.42 0.09
0.43 0.09
0.64 0.34
0.34 0.28
0.36 0.31
4.84 1.00
63% 81%
Table 4. Registration results for cardiac PET-CT image pairs Alg
tx
ty
θz
θx
MI EMP
0.49 0.36
0.12 0.14
0.30 0.39
0.23 0.40
θy 0.21 0.29
tz 0.42 0.44
Time(m) 4.98 1.04
Success 63% 81%
5 Conclusion and Future Work In this paper, we defined a new concept of EMP and presented a two-stage global optimization registration for 3-D multi-modality medical images. In stage 1, using PCA transform, we can determine EMP and other two orthogonal planes, and simultaneously have the planes aligned with each other closely. In stage 2, we are to implement some adjustment in a small neighboring region to complete refine registration. All of the experiments, including simulated data and clinical data, show that EMPMI is not only more robust and faster, but also quite automatic. In practice, a lot of clinical experts highly appreciate and use it. For some special cases, we may have a sort of incomplete data (such as, the structures in one modality image absent in another modality one). We are developing a new method based on EMP-MI to solve this problem. Acknowledgments. This work was supported in part by the National Basic Research Program of China (No.2003CB716103).
992
Z. Lu et al.
References 1. Antoine Maintz, J.B., Viergever, M.A.: A survey of medical image registration. Medical Image Analysis 2(1), 1–36 (1998) 2. Zitova, B., Flusser, J.: Image Registration Methods: A Survey. Image and Vision Computing 21(11), 977–1000 (2003) 3. Besl, P.J., McKay, N.D.: A method for registration of 3-D shapes. IEEE Trans on PAMI 14(2), 239–256 (1992) 4. Viola, P., Wells, W.: Alignment by maximization of mutual information. Int. J. Comput. Vis. 24(2), 137–154 (1997) 5. Collignon, A., Maes, F., Delaere, D., et al.: Automated multi-modality image registration based on information theory. Information Processing in Medical Imaging, pp. 263–274. Kluwer Academic Publishers, Dordrecht (1995) 6. Studholme, C., Hill, D.L.G., Hawkes, D.J.: An overlap invariant entropy measure of 3D medical image alignment. Pattern Recogn. 32, 71–86 (1999) 7. Maes, F., Collignon, A., Vandermeulen, D., et al.: Multimodality Image Registration by Maximization of Mutual Information. IEEE Trans. Med. Imag. 16(2), 189–198 (1997) 8. Gonzales, R.C., Woods, R.E.: Digital Image Processing, 2nd edn. Prentice Hall, Englewood Cliffs (2002) 9. Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Systems, Man, and Cybernetics 9(1), 62–66 (1979) 10. Qingwen, L., Wufan, C.: Unsupervised Segmentation of Medical Image Based on Difference of Mutual Information. Science in China Series F-Information Sciences 49(4), 484–493 (2006) 11. Maes, F., Vandermeulen, D., Suetens, P.: Comparative evaluation of multiresolution optimization strategies for multimodality image registration by maximization of mutual information. J. Med. Image Anal. 3(4), 373–386 (1999)
Prostate Tissue Texture Feature Extraction for Cancer Recognition in TRUS Images Using Wavelet Decomposition J. Li, S.S. Mohamed, M.M.A. Salama, and G.H. Freeman Department of Electrical and Computer Engineering, University of Waterloo, Ontario, Canada {smohamed,msalama}@hivolt.uwaterloo.ca
Abstract. In this paper, a wavelet based approach is proposed for the detection and diagnosis of prostate cancer in Trans Rectal UltraSound (TRUS) images. A texture feature extraction filter was implemented to extract textural features from TRUS images that characterize malignant and benign tissues. The filter is based on the wavelet decomposition. It is demonstrated that the wavelet decomposition reveals details in the malignant and benign regions in TRUS images of the prostate which correlate with their pathological representations. The wavelet decomposition is applied to enhance the visual distinction between the malignant and benign regions, which could be used by radiologists as a supplementary tool for making manual classification decisions. The proposed filter could be used to extract texture features which linearly separate the malignant and benign regions in the feature domain. The extracted feature could be used as an input to a complex classifier for automated malignancy region classification.
1 Introduction Prostate carcinoma is among the highest cancer risks in men, its rank in cancer death statistics is also among the highest. Prostate carcinoma is only curable at an early stage; therefore early detection is highly recommended [1]. There is an obvious short of reliability of the different existing Prostate cancer screening methods such as Digital Rectal Exam (DRE), TRUS imaging system and analysis of Prostate Specific Antigen (PSA) value as well as the prostate volume. DRE usually provides high rate of false negatives, PSA might be elevated as a result of different conditions such as BPH or Prostatitis [2]. Moreover, TRUS images only detect two thirds of the tumors while the remaining one third appears as isoechoic and can not be identified as cancer [3]. Prostate biopsy is usually performed for conclusive diagnostic of the disease. However, since prostate biopsy might loose some of the cancer cases due to inaccurate biopsy locations, therefore this research aims to assist in determining the biopsy locations to decrease the false negative results. This will ultimately lead to enhance the diagnosis accuracy. Aiding the biopsy is typically accomplished by the help of an expert radiologist by marking the suspicious regions in TransRectal UltraSound (TRUS) images. M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 993–1004, 2007. © Springer-Verlag Berlin Heidelberg 2007
994
J. Li et al.
However, manual segmentation is a time consuming task as there is often dozens of 2D TRUS slices associated with each single clinical case. Moreover, it also depends on the talent of the conducting radiologist. The goal of this research is to design an automated algorithm to segment the suspicious regions within the prostate on TRUS images. The automated region segmentation could improve both the efficiency and effectiveness of the prostate cancer diagnostic process. To construct a successful Region of Interest (ROI) segmentation algorithm, the appropriate features discriminating between different textures should be used features. In this work, a novel approach is proposed to extract the local texture feature of the malignant and benign regions within the TRUS image of the prostate with the aid of the wavelet decomposition. The local texture in the TRUS image is used by radiologists to determine malignant regions in conjunction with intensity and spatial features. The malignant and benign tissues have differing parenchyma, which will reflect and scatter the acoustic energy differently [4]. Part of this difference likely manifest as local texture variation on TRUS images. Regular textures are usually defined by a texture primitive or textel (orientation parameter) of a specific size (scale parameter) that is repeated at a certain spatial frequency (periodicity parameter). Simple statistical approaches include applying the standard deviation, uniformity, and entropy measures to the local histogram. These measures effectively describe the coarseness or smoothness of the texture [5], thus characterize the periodicity parameter well, but not explicitly the orientation and scale parameters. Spectral approaches such as the Fourier spectrum analysis is effective at characterizing the periodicity parameter. However, the results of the transform are not localized in the spatial domain, thus the method is ineffective when identifying texture regions in an image. Popular textural feature extraction techniques such as the co-occurrence matrix and the recurrent random pulsed neural network (RNN) [6] capture both the orientation and periodicity characteristics of a texture. To capture the local texture features in the ultrasonic images of the prostate, the co-occurrence parameters have been applied by Scheipers et al. in [7]. The problem with these textural feature extraction techniques is that they might not be able to effectively distinguish between textures when the texture primitives are located at different scales. This is in part due to the restriction that these approaches could not adaptively vary the size of the texture capturing window to ensure that a single textel is exactly captured within that window. Furthermore, intensity inhomogeniety or illumination variation usually occurs at a coarse scale whereas the textures of interest are usually located at a finer scale. Multi-resolution analysis has been shown to be effective for texture segmentation tasks. It has been applied successfully in [8] to internally segment the prostate gland from the TRUS images using the Gabor filters. The Wavelet analysis was used earlier for texture segmentation in [9] where 25 natural textures were segmented with almost perfect accuracy. Intuitively, wavelet decomposition should be able to uniquely characterize textures as it takes into account all of the orientation, scale, and periodicity characteristics of a texture. In the area of diagnostic medical imaging, the wavelet approach has been applied to liver and breast cancer detection in ultrasound images and mammography [10, 11, 12]. The basis for this analysis lies in the argument that most relevant tissue textures are to be located at specific and finite
Prostate Tissue Texture Feature Extraction for Cancer Recognition
995
resolution ranges, where the texture at other resolutions ranges are most likely noise or other scales of detail that are not relevant to the analysis at hand. Wavelet decomposition provides the tool with which to zoom into the resolution of interest. In section two, the proposed multi-resolution non-linear texture feature extraction filter is presented. In section three, results of the proposed approach when applied to 24 TRUS images are presented and the major issues with the proposed approach are discussed. The major findings of this work are concluded in section four. The ultrasound images used in the subsequent analysis are obtained during clinical TRUS sessions using Aloka 2000 ultrasound machine with a broadband 7MHz linear transducer and a field of view of approximately 6 cm. An experienced radiologist manually outlined the malignant and benign regions. Manual segmentation was used as the gold standard in the performance evaluation of the proposed method. Before tissue segmentation, the contour of the prostate was segmented using the deformable model approach [13]. The outlined prostate was used as the input for the proposed texture feature extraction filter to characterize and identify the malignant and benign.
2 Texture Identification In medical images it is important to distinguish between regions containing different textures. For texture definition, the proposed method should be capable of capturing the orientation, scale, and periodicity features associated with the texture. Figure 1 illustrates the typical problem. The prostate region of the TRUS image is used as input to the system. Texture features should then be obtained at the output of the proposed filtering system, which would ideally have linearly separable malignant and benign distributions for good classification results. Input
Filter
Output
Classified result
Frequency
Feature value
Fig. 1. Ideal texture mapping to linearly separable feature space
It is well known that complex textures rarely have regular patterns. Hence, it is hard to characterize these textures by regular measures such as orientation, scale and periodicity. However, all textures could be viewed as a combination of sub-textures on several scales. When textures are decomposed into components on different scales, regular sub-texture patterns characterizing the original texture on a particular scale might emerge. It is assumed in this work that when the original texture is decomposed into its components regular patterns characterizing this texture at some decomposition could be noted. It is expected that the decomposition of a complex irregular texture reveals some regular sub-textures on certain levels of decomposition. However, at
996
J. Li et al.
other levels of the decomposition, the sub-textures are irregular. Hence, the major challenge in characterizing a complex texture using this approach consists of: • Identifying the scales at which its sub-textures are regular and could be characterized • Finding a textural feature measure which would take into account of the orientation and periodicity characteristics of the texture. The proposed approach uses Wavelet decomposition to identify the characteristic scale and to capture the periodicity feature of the texture. A local texture feature will then be defined to capture the orientation of the texture primitives.
3 Textural Feature Extraction Method The proposed non-linear filter consists of three levels of sub-filters that are illustrated in Figure 2. Wavelet decomposition filter
Local feature filter w1(m,n)
g1(m,n)
w2(m,n)
g2(m,n)
w3(m,n)
g3(m,n)
w4(m,n)
g4(m,n)
f(m,n)
Intensity domain
Wavelet domain
Local feature domain
Fig. 2. The proposed textural feature extraction filter
The first level of the proposed filter consists of a wavelet decomposition sub-filter, which transforms the original image in the intensity domain to wavelet decomposition details wi(m,n) in the wavelet domain. The subscript i represents the different scale levels of the wavelet decomposition with i=1 being the finest scale. The wavelet decomposition details are then passed through a bank of local feature sub-filters. In this work, energy is the local feature calculated. The local feature sub-filters maps the wavelet decomposition details to the local feature domain, while keeping the local feature values localized in the spatial domain. The local feature values at different scales of the wavelet decomposition form the output of the filter and are used to characterize and identify the individual textures in the spatial domain. The following section describes the different levels of sub-filters in more detail.
Prostate Tissue Texture Feature Extraction for Cancer Recognition
997
3.1 Wavelet Decomposition Sub-filter The wavelet decomposition expresses the image data as a superposition of scaling functions, which present a coarse representation of the image, and wavelets, which represent the details of the image. The common expression for the discrete wavelet transform (DWT) in 2-D is as follows:
f (m, n) = ∑∑ ci max (k , l )2i max/ 2 ϕ (2i max/ 2 m − k ,2i max/ 2 n − l ) + k l i max
i max
∑ ∑ d i (k )2i / 2ψ (2i / 2 m − k ) + ∑ ∑ di (l )2i / 2ψ (n − l ) + k
i =1 i max
l
(1)
i =1
∑∑ ∑ di (k , l )2i / 2ψ (2i / 2 m − k ,2i / 2 n − l ) k
l
i =1
where ci max ( k , l ) is the approximation coefficients, d i ( k , l ) is the difference
coefficients, ϕ ( m, n) represents the scaling function and ψ ( m, n) represents the wavelet. The term in the first line of Equation 1, on the right side of the equation mark, represents the approximation components. the two terms in the second line represents the vertical and horizontal detail components at all scales from i=1 to i=imax, and the term on the bottom line represents the diagonal detail components at all scales. The proposed wavelet decomposition filter takes as input the original image and outputs the wavelet decomposition details at scale levels from i=1 (finest scale) to i=p, where p is the coarsest scale level required to describe the texture. The wavelet decomposition output wi(m,n) at a specific scale level i is the average of the wavelet decomposition detail differences in the horizontal, vertical, and diagonal directions at that scale level. The wavelet decomposition is accomplished with the discrete filter D9 from Daubechies [14]. This filter is implemented with a DWT quadrature mirror filter (QMF) as shown in Figure 3. To maintain the resolution at each scale, the filtered results are not downsampled as is the norm in common QMF designs. The filter h0 is a low pass filter with its impulse response characterized by the scaling function coefficients s ( m, n) , while h1 is a high pass filter with its impulse response characterized by the wavelet coefficients z (m, n) .
rows
columns h0
cA1
h1
cH1
h0
cV1
h1
cD1
h0 f(m,n) h1
Details at scale i=1
Fig. 3. QMF for implementation of Wavelet decomposition
998
J. Li et al.
3.2 Local Feature Filter
In order to construct a local feature filter that maximally linearly separates the malignant and benign regions, we must identify textural characteristics that distinguish the two regions. Prostate biopsy is usually used as the final stage for prostate cancer diagnostics. In the biopsy operation, tissue samples are obtained from suspicious regions of the prostate. The tissue samples are then observed and graded by a pathologist. The prostate cancer grading system was developed by Gleason [15]. It is observed that, pathologically, the more serious the tumour, the more the glands will lose shape and structure and start to fuse. Pathologic samples of malignant and benign tissues of the prostate are displayed in Figure 4.
Benign
Malignant Grade 1-3
Benign
Malignant Grade 4-5
Fig. 4. Pathologic samples of malignant and benign tissues of the prostate [16]
It is expected that part of this textural pattern is reflected in the fine resolution speckles in the TRUS image due to the differing parenchyma of the malignant and benign tissues. Passing an input image through the wavelet decomposition filter, the detail differences on the two finest resolution scales are given, which is displayed in Fig. 5. W1(m,n)
W2(m,n )
Benign
Malignant
Fig. 5. Wavelet decomposition details at the two finest scales
From Figure 5 it is clear that the texture in the benign regions have a more ordered appearance while the textures in the malignant regions are more disordered and granular. This is consistent with the pathological description of tumour, with the benign regions consisting of uniform glands and malignant regions where the uniform structures are destroyed. In Figure 6, the histogram of the wavelet decomposition detail difference W1(m,n), is displayed.
Prostate Tissue Texture Feature Extraction for Cancer Recognition
999
a =3.5,b =5 600
benign
Frequency
500
malignant
400 300 200 100 0 0
10 20 30 40 50 60 70 80 90 Feature values
Fig. 6. Wavelet decomposition level 1 detail difference distribution in the malignant and benign region of the training image
From Figure 6, it is clear that there exist a threshold σ that can be chosen as a tool to separate the malignant and benign regions, since the distribution shows that for detail difference values below σ, the associated pixel is much more likely to be malignant than benign.
Fig. 7. Wavelet decomposition level 1 detail differences
Figure 7 shows the wavelet decomposition level 1 detail difference and it can be noted that in the fine scale wavelet decomposition detail the benign region is characterized by almost uniform horizontal lines. In the proposed algorithm, an attempt is done to capture this characteristic by counting the amount of horizontally connected image elements with values below the threshold σ in a local neighborhood. The degree of connectivity is defined in this work by a parameter b, such that if b is set to 3 and there are three horizontally connected pixels below the threshold σ in the local neighborhood, the local feature counter would increase by one. To find the optimal values of σ and b to separate the malignant and benign distributions in the local feature domain, the neighborhood within which the local feature values are counted is set to be 24x24 image elements. Then, to find the optimal value for the threshold σ, b is set to be constant of 2, and the distributions of the local feature values in the malignant and benign regions were examined with σ =10, σ =5, σ =3.5, and σ =2. It was found out that the threshold yielding the highest linear separation for the local feature values in the malignant and benign regions is
1000
J. Li et al.
σ =3.5. At this threshold, the modes of the malignant and benign distributions are separated by a feature value distance of 25 as shown in Figure 7, which approximately covers half of the malignant distribution range. Likewise, the optimal characterizing degree of connectivity value b, was suggested by examining the feature value distributions when b is varied with σ held constant at 3.5 and b set to 1, 2, 3, and 5 respectively. it was found out that for b=5, the modes of the distributions are considerably separated. Therefore, a characteristic degree of connectivity value of 5 is selected for the subsequent analysis. 3.3 Identifying Regions with Proposed Texture Feature Extraction Filter
With the local feature distributions characterized by the parameters σ and b, TRUS images suspected to have malignant tumour could be evaluated with the following steps. • Apply the proposed method to the TRUS image, the input image is then transformed to the wavelet domain and then the local feature domain. The local feature values are then calculated for each discrete image element. • Classify the image elements using the local feature values as input to a classifier • The classifier used in step two could be a complex classifier such as a neural network or fuzzy inference system. However, to demonstrate the effectiveness of the textural feature extracted using the proposed algorithm, the simplest classifier, thresholding, will be used for the subsequent analysis. Figure 8 demonstrates the classification result by setting the local feature value threshold to 8 (dark grey) and 25 (light grey) for the shown image (white represents benign region). It is clear from the figure that the classification result compares favorably with the desired result in Figure 1.
Fig. 8. Sample Classification result 1 with local feature value thresholds of 8 and 25
4 Results and Discussion In this section, the effectiveness of the textural feature extracted is examined by applying the proposed approach to 23 TRUS images distinct from the training image used to characterize the local textural feature distributions. Thresholding is used as the
Prostate Tissue Texture Feature Extraction for Cancer Recognition
1001
classification method. In Figure 9, Figure 10 Figure 11 and Figure 12, the desired classification, enhanced level 1 detail difference image, and resulting classification of 4 of the evaluation images are displayed. For the resulting classification image, feature value thresholds of 8 and 25 are used.
Fig. 9. Sample Classification result 1 with local feature value thresholds of 8 and 25
Fig. 10. Classification result 2 with local feature value thresholds of 8 and 25
Fig. 11. Classification result 3 with local feature value thresholds of 8 and 25
Fig. 12. Classification result 4 with local feature value thresholds of 8 and 25
The proposed textural feature could effectively identify most of the malignant regions in the evaluation images. Most of the segmentation results are already apparent in the enhanced level 1 detail difference image. In fact, the enhanced detail image contains lots of information that could be effective as a decision support tool for radiologists for manual region segmentation. The value of the extracted textural
1002
J. Li et al.
features lies in their ability to be used as input to complex classifiers for robust automated malignant and benign region identification. The receiver operating characteristic curve (ROC) analysis in Figure 13 compares the sensitivity and specificity performance of the extracted textural feature taking into account all 23 evaluation images. The ROC curves are obtained by continuously varying the feature value threshold. The area under the ROC curve (AUC) provides a good summary of the accuracy of the test, taking into account of both sensitivity and specificity, as well as the full range of possible range of classification results [18]. The classification rate (CR), sensitivity, and specificity are defined by:
CR =
tp + tn , tp + fn + fp + tn
sensitivity =
tp and tp + fn
specificity =
tn fp + tn
Sensitivity
where tp denotes true positives (malignant block correctly detected), fn denotes false negatives (malignant block undetected), fp denotes false positives (benign block wrongly detected as malignant), and tn denotes true negatives (benign block correctly classified as such). The areas under curve (AUC) when applying linear thresholding to the extracted textural feature is Az=0.72. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
ROC curve Reference
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Specificity
Fig. 13. ROC analysis for extracted textural feature
Figure 14 shows the malignant and benign distributions of all the images used in this study. It is clear from the figure that there is a difference between both distributions which shows the superiority of the proposed texture feature filter.
Prostate Tissue Texture Feature Extraction for Cancer Recognition
10000
benign
8000 Frequency
1003
malignant
6000 4000 2000 0 0
20 40 60 80 100 120 140 160 180 Feature value
Fig. 14. Malignant and benign distributions of the extracted texture feature values for all images
5 Conclusions In this paper, a novel non-linear filter is proposed to extract a texture feature which linearly separates the benign and malignant regions within the prostate from TRUS images. The filter is based on the wavelet decomposition, which separates the complex texture details in the TRUS image into sub-components at different resolution scales. It is demonstrated that the decomposition details on the finest scale is a coarse representation of pathological inter-nodule arrangements that is indicative of the degree of malignancy. The level 1 decomposition details is shown to visually demonstrate the distinction between malignant and benign regions and could be used as a decision support tool by radiologists for making manual diagnostic decisions. Moreover, a single textural pattern is extracted with the proposed textural extraction filter from the level 1 decomposition details in order to construct a local texture feature that separates the malignant and benign feature values. The effectiveness of this texture feature is verified by applying the feature to classify malignancy regions in 23 TRUS images using the simple thresholding approach. AUC of the ROC for the proposed feature is found to be Az=0.72. One more advantage of the proposed approach to extract textural features is the removal of the effects of intensity inhomogeniety on the texture characterization process.
References 1. National Cancer Institute of Canada: Canadian cancer statistics 2002. Toronto, Canada (2002) 2. Paul, B., Dhir, R., Landsitte, D., Hitchens, M.R., Getzenberg, R.H.: Detection of Prostate Cancer with a Blood-Based Assay for Early Prostate Cancer Antigen. Cancer Research 65 (May 2005)
1004
J. Li et al.
3. Scardino, P.T.: Early detection of prostate cancer. Urol Clin North Am. 16(4), 635–655 (1989) 4. Bushberg, J.T., Seibert, J.A., Leidholdt Jr., E.M., Boone, J.M.: The essential physics of medical imaging, 2nd edn., pp. 469–553. Lippincott, Williams & Wilkins, Philadelphia, Penn (2002) 5. Gonzalez, R.C., Woods, R.E.: Digital Image Processing, 2nd edn. ch. 7, Prentice Hall, Upper Saddle River, New Jersey (2002) 6. Gelenbe, E., Feng, Y.T., Ranga, K., Krishnan, R.: Neural network methods for volumetric magnetic resonance imaging of the human brain. Proc. IEEE 84, 1488–1496 (1996) 7. Scheipers, U., Ermert, H., Sommerfeld, H.-J., Garcia-Schurmann, M., Senge, T., Philippou, S.: Ultrasounic multifeature tissue characterization for prostate diagnostics. Ultrasound in med. and biol. 29(8), 1137–1149 (2003) 8. Mohamed, S.S., Salama, M.M.A., Kamel, M., El-Sadaany, E.F, Rizkalla, K., Chin, J.: Prostate cancer multi-feature analysis using TRUS images. Physics in Medicine and Biology 50(15), N175–N185 (2005) 9. Laine, A., Fan, J.: Texture classification by Wavelet packet signatures. IEEE Trans. Pattern Ana. Mach. Intel. 15(11), 1186–1191 (1993) 10. Georgiou, G., Cohen, F.S.: Is early detection of liver and breast cancers from ultrasound scans possible? Patt. Rec. Let. 24, 729–739 (2003) 11. Chen, D.-R., Chang, R.-F., Kuo, W.-J., Chen, M.-C., Huang, Y.-L.: Diagnosis of breast tumors with sonographic texture analysis using wavelet transform and neural networks. Ultrasound in Med. & Bio. 28(10), 1301–1310 (2002) 12. Lee, W.-L., Chen, Y.-C., Hsieh, K.-S.: Ultrasonic liver tissues classification by fractal feature vector based on M-band wavelet transform. IEEE Trans. Med. Imaging 22(3), 382– 393 (2003) 13. Chiu, B., Freeman, G.H., Salama, M.M.A., Fenster, A.: Prostate segmentation algorithm using dyadic wavelet transform and discrete dynamic contour. Phys. Med. Biol. 49(21), 4943–4960 (2004) 14. Daubechies, I.: Orthonormal basis for compactly supported wavelets. Commun. Pure Appl. Math. XLI, 909–996 (1998) 15. Gleason, D.F.: Classification of prostatic carcinomas. Cancer Chemother. Rep. 50, 125– 128 (1966) 16. Lawton, C.A., Grignon, D., Newhouse, J.H., Schellhammer, P.F., Kuban, D.A.: Oncodiagnosis panel: 1997 Prostatic Carcinoma. Radiographics 19, 185–203 (1999) 17. Loch, T., Leuschner, I., Genberg, C., Weichert-Jacobsen, K., Kuppers, F., Retz, M., Lehmann, J., Yfantis, E., Evans, M., Tsarev, V., Stockle, M.: Future trends in transrectal ultrasound. Artificial Neural Network Analysis (ANNA) in the detection and staging of prostate cancer. Der Urologe A39(4), 341–347 (2000) 18. Hanley, J.A., McNeil, B.J.: The meaning and use of the area under the receiver operation characteristic (ROC) curve. Radiology 143, 29–36 (1982)
Multi-scale and First Derivative Analysis for Edge Detection in TEM Images Nicolas Coudray, Jean-Luc Buessler, Hubert Kihl, and Jean-Philippe Urban Laboratoire MIPS, Groupe TROP, 4 rue des Fr`eres Lumi`ere, 68093 Mulhouse, France {nicolas.coudray,jean-luc.buessler, hubert.kihl,jean-philippe.urban}@uha.fr
Abstract. Transmission electron microscope images of biological membranes are difficult to segment because they are low-contrasted images with heterogeneous gray levels. Added to that are the many possible types of membranes, the variable degree of aggregation, and the negative staining of the sample. We therefore develop a multi-scale approach to detect the edges at the appropriate scales. For these images, the study of the amplitude of the first derivative through the scales simplifies the feature tracking and the scale selection. A scale-adapted threshold is automatically applied to gradient images to progressively segment edges through the scales. The edges found at the different scales are then combined into a gradient-like image. The watershed algorithm is finally applied to segment the image into homogeneous regions, automatically selecting the edges found at the finest resolution. Keywords: multi-scale algorithm, edge detection, first derivative method, TEM images, segmentation.
1 1.1
Introduction Context
The automatic segmentation of biological specimens for further analysis is of major importance, considering the large amount of images acquired with automatic or semi-automatic screening packages (for instance the TOM Toolbox [5], the Leginon system [8]). This study is part of a project for an automatic and on-line analysis of biological specimens acquired with a Transmission Electron Microscope (TEM). The specimens are 2D-crystal membrane-embedded proteins. 2D-crystallization arranges the proteins into lipid layers (artificial membranes) to help the study of their structure and understand their functions [7]. A large number of experiments are however needed to determine the optimal conditions for crystallization. It is therefore worth developing a tool that can routinely assess the results and the quality of the crystals for 2D crystallization, similarly to the automation developed for other protein analysis techniques (for example: [11] for 3D-crystallization, [6] and [13] for single particles). M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 1005–1016, 2007. c Springer-Verlag Berlin Heidelberg 2007
1006
N. Coudray et al.
The crystallization of the specimen is assessed in a 3-step top-down method. A low magnification screening and analysis evaluates the quality of the grid and locates the appropriate grid squares. A medium magnification step on the located squares identifies and characterizes the membranes. Finally, a high magnification step, achieved on the most interesting membranes, evaluates the crystallization by analyzing the power spectrum of the image. This paper is dealing with images at medium magnification of about x5,000 (about 5 nm/pixel). From the segmentation proposed to fragment such images [1], we outine in this paper the main concepts for the analysis and the use of the first derivative through the scales. 1.2
Image Characterization
Biological specimen images are low-contrasted. Although negative staining with heavy metal is used, the contrast is still not sufficient to clearly separate the membranes from the background with a simple threshold method. The stain tends to outline the structure of the specimen but the gray level of the membrane itself is still very low, unless it is coated with stain or aggregated. That latter case is uninteresting for high magnification study. The more interesting areas, the crystal-like non-aggregated membranes to be studied at high magnification, are the more difficult to identify. Then, among the drawbacks of staining, artifacts and uneven staining are those that can complicate the segmentation at the medium magnification. This leads to a heterogeneous gray level of the background and an uneven contrast and width of the edges. Finally, the size and the shape of the objects can vary greatly. Thus, the identification of the membranes cannot rely on any pattern-based approach. 1.3
Gradient-Based and Multi-scale Methods
Negative staining enhances the contrast locally and leads therefore to gradientbased segmentation methods. However, because of the noise, the uneven contrast and the variable edge width, no specific kernel size and threshold value will be adequate for all edges in an image. We therefore consider a multi-scale approach to adjust automatically the parameters. Edge detection combined with multi-scale approaches is often considered from the second derivative point of view and the loss of features is evaluated from the finest to the coarsest scale. However, this approach is not appropriate for our images, as will be explained in the following section. The study presented in this paper focuses on the first derivative amplitude of the images as the scale increases. The degradation of the features can be analyzed with the amplitude of the first derivative. Indeed, since blurring does not affect the edges as much as noise, it should be possible to discriminate edges from noise by analyzing the evolution of their gradient. Section 2 develops the principles of the proposed method based on a multiscale analysis of the first derivative. The paper concludes with a discussion of the results.
Multi-scale and First Derivative Analysis for Edge Detection
2
1007
Methodology of the Multi-scale Approach
In this section, the first part introduces the general principle of the multi-scale approach and underlines the main difficulties of using a coarse-to-fine methodology with the zero-crossing of the second derivative. Then, the outline of the method chosen and based on the first derivative is exposed. In the third part, details on the edge identification process are given. 2.1
General Principle of the Multi-scale Approach
The multi-scale approach, primarily developed by Witkin [12], relies on the idea that some information can only be observed at certain scales. The structures observed at the coarser scales obtained by low-pass filtering are simplifications of those at the finer scales. From Lindeberg [4], the scale-space representation L : Z 2 × R+ → R, can be defined for a 2-dimensional discrete signal f : Z 2 → R as: L(x, y, 0) = f (x, y) (1) at scale s = 0 (initial image, finest scale). At coarser scales (s > 0): L(x, y, s) =
+k
+k
T (m, n, s).f (x − m, y − n)
(2)
m=−k n=−k
where T (m, n, s) is a family of kernels of size (2 × k + 1); generally T (m, n, s) corresponds to discrete Gaussian kernels where s is the variance. Edge detectors typically operate only at a defined spatial scale and thus, to enhance their performances, they are often combined with multi-scale processes where the image is analyzed at different spatial scales [2]. Finest details are detected with precision at the finer scales and more interesting edges are assumed to be those found at coarsest scales but with less precision [9]. That is why the edges found at the coarser scales are often used to select the main edges that are tracked back to the finest scale to detect their precise position. This requires edge tracking and scale selection which is a challenging process to develop. The approach is often depicted from the zero-crossing of the second order derivative standpoint, which is not well suited to our images. Firstly, location of the edges found at the finest scale by the second derivative are very affected by the texture of the noise and the position of the edges at the finest scale cannot always be considered as the best ones (Fig. 1(a) and Fig. 1(b)). Some edges found at coarser scales can be considered as better located (Fig. 1(c)): the edges near the bright membrane surround smoothly the edges at scale 5 (label 2), whereas they are not continuous and distorted by noise at the finest scale. Nevertheless, scale 5 is not the best choice for all the edges since those near the darkest part on the top left of the image (label 1) are detected with less precision. Also, there is no unique scale suitable for all the membranes. Moreover, the second derivative is too noise-sensitive and Fig. 1(d) confirms that it is more efficient to threshold the gradient of the first derivative to remove the noise before applying the second derivative.
1008
N. Coudray et al.
(a)
(b)
(c)
(d)
Fig. 1. Initial image (a) and edges detected with zero-crossing of the second order derivative at scale 0 (b), at scale 5 (c), at scale 5 ignoring all edges which gradient is not stronger than a given threshold (0.2 in this case) (d)
2.2
Outline of the Method: Impact of Smoothing on the First Derivative Analysis
The first derivative being less noise-sensitive than the second derivative, this part aims to show that the first derivative amplitude behaves differently for edge and non-edge pixels and how it could be used to identify the edges. Smoothing filters are used to remove or at least reduce the noise from an image, thus increasing the Signal to Noise Ratio (SNR). It is a process by which points are replaced by a weighted average of their neighborhood. The considered neighborhood, thus the kernel size of the filter, clearly influences the signal to be detected and the SNR according to the matched filter theorem. Using kernels of different sizes for smoothing, which is the base of a multi-scale approach, induces different SNR but also different blurring of the edges. Since the noise is less affected than the edges by the smoothing, the evolution of the gradient of the noise is different from the gradient of the edges: when a group of pixels of nearly identical gradient is considered at a given scale s1 , then pixels (xe , ye , s1 ) belonging to the edges and other pixels (xn , yn , s1 ) cannot be dissociated by a threshold. However, after an appropriate smoothing (scale s2 > s1 ), the evolution of the gradient ∇L of the two groups is bound to differ because of the neighborhood consideration, increasing the probabilty of identifying the noise from the edge: ∃s2 > s1 , ∇L(xe , ye , s2 ) ≥ ∇L(xn , yn , s2 ) .
(3)
As a consequence, a threshold B(s2 ) can be found at scale s2 to identify the edge pixels from the noise pixels. Thus, for each scale, a binary image LB is constructed by: ∀(x, y, s) if ∇L(x, y, s) < B(s) then LB (x, y, s) = 0 else LB (x, y, s) = 1 . (4) Thus, as the scale increases, new edges can be discriminated. This is illustrated on Fig. 2. At scale 0 by twenty points of nearly identical gradient are selected and classified into two categories: pixels belonging to the edges (edge pixels) and far from the edges (non-edge pixels). The curves confirm that the gradient of the non-edge pixels tends to decrease faster than the edge pixels when the scale
Multi-scale and First Derivative Analysis for Edge Detection
(a) Position of the test points
(b) Non-edge pixels
1009
(c) Edge pixels
Fig. 2. Evolution through the scales of the amplitude of the gradient (Prewitt) for edge pixels (*) and for non-edge pixels (O)
increases. When the scale increases, a threshold curve can be drawn to separate the noise from the edges. In the example, most of the selected low-contrasted edges are already selected at scale 2. In a more general case, the study of the first derivative amplitude confirms the possibility of a multi-scale segmentation: Fig. 3(b) and 3(c) shows the evolution from scale 1 to scale 5 of the amplitude of the gradient of about 2000 labeled pixels regularly selected to cover the picture of Fig. 3(a) and manually labeled as edges and non-edges. A clear difference between the evolution of the edge and non-edge pixels can be seen: the multi-scale representation makes a separation between the edges and the noise possible, a separation that would be impossible considering only scale 1. We can see that the gradients of the non-edge pixels are concentrated on the bottom-left part of the graph, except for a few points. It can be noticed that these few points are located close to the edges of the membranes
(a) Tested image
(b) Non-edge points gradients
(c) Edge points gradients
Fig. 3. Evolution of the gradient of labeled pixels (non-edge and edge pixels) of the image on 2000 pixels of the left image, from scale 1 (abscissa) to scale 5 (ordinate) and suggested thresholds for each scale
1010
N. Coudray et al.
and their position in the graph is explained by the blurring effect. Although the image at coarser scales is blurred, the evolution of the gradient can be used to identify the interesting features. In this work, discrimination can be operated by a scale-adapted threshold of the gradient-like images. 2.3
Gradient Evolution and Scale-Adapted Thresholding
As seen in the previous section, edges can be discriminated from the noise if the threshold is adapted to the scale. Then, it should be decided how the optimal threshold should be set. Segmentation of gradient images is known to be a difficult task and a good approach would be to use appropriate knowledge whenever available. The histogram of the gradient of the image is influenced by the gradient of the signal and the gradient of the noise. When the gradient magnitude of noise can be modelized as a Raleigh function for additive white Gaussian noise, the beginning of the histogram of the gradient of the image is generally considered as a peak that decreases sharply. Voorhees and Poggio [10] showed that under these assumptions, for an acceptable proportion of false edges PF , the threshold B(s) can be fixed by: B(s) = Z(s). −2.ln(PF ) , (5) Z(s) being the maximum of the histogram of the image at scale s. We retained B(s) = 3.Z(s) for each scale s corresponding to PF < 1 %. This compromise gives good results for our images, except at the finest scale where the binary images appear to be quite noisy and a higher threshold would be more appropriate. The threshold is calculated and adapted at each scale to keep it optimal. Fig. 4 shows that when the image gets smoother (higher scales), the peak of the gradient histogram is shifted to the left, getting higher and sharper. As a result, the position of the threshold (3.Z(s)) is also shifted to the left-side and its value decreases as the scale increases (threshold curve in Fig. 2). Thus, the amplitude of the noise decreases with smoothing and the threshold needs to be adapted to keep it optimal for each scale.
Fig. 4. Histograms of the gradient of the image at different scales s = sigma, sigma being the variance of the Gaussian filter
Multi-scale and First Derivative Analysis for Edge Detection
1011
Fig. 5. Edges segmented from scale 0 to 7 (scale being the variance of the Gaussian kernel) with a scale adapted threshold
With this adapted threshold, the segmentation (Fig. 5) confirms that when the scale is increased it is possible to identify more edges of the membranes. An axiom of the multi-scale approach is that no new structure (or intensity level) should arise from one scale to another (though local extrema can arise for 2D signals [3]). Our method does not contradict this axiom, though new edges are detected when the scale gets coarser. Indeed, those segmented edges does exist at the finer scales, from the amplitude of the gradient point of view or from the zero-crossing of the second derivative point of view, but they are discarded because the amplitude of the noise gradient is comparatively too high. Compared to the second derivative method, the accuracy of the edge detection with the first derivative is often pointed as a drawback. The detection could be corrected with a thinning algorithm or by combining the edges found with the first and the second derivative with a logical and. However, we would not recommend to do it immediately in this particular case. Indeed, edge thickness can help the tracking of the edges trough the scales. Because the position of the zeros is slightly shifting through the scales, the tracking of the edges could be more difficult with the second derivative than with the method used here. As can be seen on Fig. 6, the edges become wider with the increasing scale because of the blurring. As a consequence, the position of the edges detected with precision at a finer scale is included in the position of its wider equivalent at a coarser scale. This simplifies the tracking of an edge through the scales as well as the continuity of heterogeneously contrasted edges detected at different scales.
1012
N. Coudray et al.
(a) Gray level image and position of the lines
(b) Profile along the Line 1
(c) Profile along the Line 2
Fig. 6. Evolution of the profile lines of the image through the scales (the scale being the variance of the Gaussian kernel), after gradient transformation (doted lines) and after thresholding (plain lines)
The profile line of Fig. 6(c) crosses 3 edges of layered sheet-like membranes detected at different scales and the profile line of Fig. 6(b) crosses two edges. All the desired edges can be identified at scale 4. However, for some of them, a finer scale exists where the edge can be separated from the noise and located more precisely. By analyzing the image at several scales, it is possible for each edge to be placed at the finest possible scale: for Fig. 6(c), the external edge (on the left) can be placed according to the information found at scale 1, while the other edges would be placed according to scales 2 and 4. Fig. 6(b) shows however that some neighboring edges merge at coarser scales when the edges are very wide (here at s = 7).
(a) Gradient-like image
(b) Combination along Line 1 (c) Combination along Line 2
Fig. 7. Weighted combination of the edges found at the different scales: (a) gradientlike image; (b) and (c) Weighted combination along Lines 1 and 2 of Fig. 6
Multi-scale and First Derivative Analysis for Edge Detection
1013
Fig. 8. Result of the watershed applied to the weighted combination of edges: result alone (left) and displayed on the initial gray level image (right)
To combine the features found at different scales, a weighted combination of the binary images at the different scales is proposed. The value of the weighting coefficient as = (n + 1 − s) decreases as the scale s increases: Ires = max(a1 × IB (x, y, 1), . . . , as × IB (x, y, s), . . . , an × IB (x, y, n)) ,
(6)
n being the number of scales used. The objective of this step is to obtain a gradient-like noise-reduced image (Fig. 7) which can be used with the watershed algorithm for a convenient partition of the image. The gray levels of the gradientlike image are related to the scale, thus proportional to the precision of the edge position. The value of each pixel in Ires reflects the finest scale at which an edge has been detected. Finally, after the combination, the watershed algorithm is applied on the gradient-like image (Fig. 8) to partition the image into closed homogeneous regions. Edges are thinned by the watershed algorithm toward the edges identified at the finest scale, producing one-pixel wide continuous edges placed at the crest lines of the gradient-like image.
3
Results and Discussion
The algorithm has been tested on various images acquired with different TEM (FEI Tecnai F30, Philips CM200, Hitachi H-7000 and H-8000). Images have been selected to represent a good panel of possible images, so that different categories are represented: – Types of membranes: sheets-like membranes (for example Fig. 9(a) and 9(d)) and vesicles (Fig. 9(b)) – Sizes: from small to bigg membranes (from about 100 nm to 4 μm in Fig. 9(a)) – Various numbers of membranes in the field of view, from hardly any to many (for example Fig. 9(c) and 9(d)) and vesicles (Fig. 9(b))
1014
N. Coudray et al.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 9. Line 1: example of typical images; Line 2: black lines show the result of the proposed segmentation, displayed on the grey level original image. The size of the images is around 10×10 μm, 1024×1024 pixels.
Qualitative evaluation: The obtained segmentation fits the requirements: a partition of the image into homogeneous regions with a limited amount of false positive and false negative edges (Fig. 9). The multi-scale method clearly gives better results than the traditional mono-scale edge detectors that were tested. Evaluation done on different TEM images confirmed that the adapted threshold allows the identification of new edges when the scale increases. Though the complexity of the images makes an objective evaluation of the algorithm difficult, conclusions can be drawn by analyzing three types of edges: – The true positive edges are located with a precision that can be considered as equivalent to a manual segmentation. – The false negative edges are rare and generally related to very bright and low-contrasted segments. – Some regions can be over-segmented (false positive), due to the noise or the creation of new crests as edges are merging in the coarser scales. With TEM images of 1024×1024 pixels and a pyramidal implementation, the optimal number of scales stands between 4 and 16 scales (here a scale s means a reduction of the size by s in the x and y direction): when the image is well contrasted, a few scales are sufficient. Over scale 16, the size of the image (under 64×64) becomes too small to extract relevant information. Comparison with a conventional watershed algorithm: In our tests, the reconstructed gradient-like image that is used in our process is much less noisy than a gradient image. Indeed, applying the watershed algorithm directly on a gradient image leads to an important oversegmentation. For example, if the watershed method is directly applied on the gradient (Prewitt kernel) of the Fig. 9(b)), then more than 115,000 regions are obtained, with the largest covering 36 pixels. If the image is previsouly smoothed with a gaussian filter of variance
Multi-scale and First Derivative Analysis for Edge Detection
(a)
1015
(b)
Fig. 10. Regions segmented by the watershed algorithm applied to the gradient (Prewitt kernel) of an image smoothed with a gaussian kernel of variance 25 ( 10(a)), and to our gradient-like image ( 10(b))
25 (important blurring), then, the image is still very oversegmented (Fig. 10(a)): 61,914 regions, with a mean area of 10 pixels (693 pixels maximum). Applied on the gradient-like noise free image, the oversegmentation is considerably reduced (701 regions), and there are no missing edges in this case. Robustness analysis: The robustness of the algorithm to acquisition has been tested by comparing the segmentation obtained for a given field acquired several times in the same conditions (noise influence). The position of the pixels in the 1024×1024 pixels images was compared by summing the dilated edges of the detected areas. A 6-pixel dilatation has been used to correct for the mislocation due to noise. For each image, about 90 % of their edges have also been detected in the two other images. The robustness has also been tested by acquiring an area at four different magnifications (from 1,500× to 5,000×). Though a quantitative evaluation is difficult, the result of the segmentation of the more interesting areas appears identical.
4
Conclusion
A multi-scale edge analysis, focused on the amplitude of the first derivative, has been presented. It has been shown that the amplitude variation through the scales is characteristic of the pixels in the image (edge and non-edge pixels) and that its analysis can be used to discriminate the noise from the edges. Applied to TEM images, this distinction can be obtained with a scale-adapted threshold. The automatic method allows an easy scale selection as well as a simple combination and linking of the edges found at the different scales. The segmentation obtained on various images is meaningful and promising for the next steps. The membranes are properly segmented, even when the contrast is low. The sub-segmentation of the objects in the image eases the study of membranes and stacked regions. We are currently working on the characterization of the membranes, aggregates and artifacts, using the segmented images. Acknowledgments. This work was supported by the EU 6th framework (HT3DEM, LSHG-CT-2005-018811), in collaboration with the Biozentrum of Basel and the Max-Planck Institute for Biochemistry in Martinsried who provided the TEM images.
1016
N. Coudray et al.
References 1. Coudray, N., Buessler, J.-L., Kihl, H., Urban, J.-P.: TEM Images of Membranes: a Multiresolution Edge-Detection Approach for Watershed Segmentation. Physics in Signal and Image Processing (PSIP), 6 (2007) 2. Hu, Q., He, S.X., Zhou, J.: Multi-Scale Edge Detection with Bilateral Filtering in Spiral Architecture. In: VIP2003. Pan-Sydney Area Workshop on Visual Information Processing, vol. 36, pp. 29–32 (2004) 3. Lifshitz, L.M., Pizer, S.M.: A Multiresolution Hierarchical Approach to Image Segmentation Based on Intensity Extrema. IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 529–540 (1990) 4. Lindeberg, T.: Discrete ScaleSpace Theory and the Scale-Space Primal Sketch. Thesis in the computational vision and Active Perception Laboratory (CVAP), p. 281 Royal Institute of Technology, Stockholm, Sweden (1991) 5. Nickell, S., F¨ orster, F., Linaroudis, A., Del Net, W., Beck, F., Hegerl, R., Baumeister, W., Plitzko, J.M.: TOM software toolbox: acquisition and analysis for electron tomography. Journal of Structural Biology 149, 227–234 (2005) 6. Stagg, S.M., Lander, G.C., Pulokas, J., Fellmann, D., Cheng, A., Quispe, J.D., Mallick, S.P., Avila, R.M., Carragher, B., Potter, C.S.: Automated cryoEM data acquisition and analysis of 284 742 particles of GroEL. Journal of Structural Biology 155, 470–481 (2006) 7. Stahlberg, H., Fotiadis, D., Scheuring, S., R´emigy, H., Braun, T., Mitsuoka, K., Fujiyoshi, Y., Engel, A.: Two-Dimensional Crystals: A Powerful Approach to Assess Structure, Function and Dynamics of Membrane Proteins. FEBS Letters, 166–172 (2001) 8. Suloway, C., Pulokas, J., Fellmann, D., Cheng, A., Guerra, F., Quispe, J., Stagg, S., Potter, C.S., Carragher, B.: Automated molecular microscopy: The new Leginon system. Journal of Structural Biology 151, 41–60 (2005) 9. Sumengen, B., Manjunath, B.S.: Multi-scale Edge Detection and Image Segmentation. In: EUSIPCO. Proc. European Signal Processing Conference, p. 4 (2005) 10. Voorhees, H., Poggio, T.: Detecting blobs as textons in natural images. In: Proc. of Image Understanding Workshop, Los Angeles, pp. 892–899 (1987) 11. Wilson, J.: Automated evaluation of crystallisation experiments. Crystallography Reviews 10, 73–84 (2004) 12. Witkin, A.P.: Scale-space filtering. In: Proceedings of the Ninth International Joint Conference on Artificial Intelligence, Karlsrushe, Germany, pp. 1019–1022 (1983) 13. Zhu, Y., Carragher, B., Glaeser, R.M., Fellmann, D., Bajaj, C., Bern, M., Mouche, F., Haas, F.d., Hall, R.J., Kriegman, D.J., Ludtke, S.J., Mallick, S.P., Penczek, P.A., Roseman, A.M., Sigworth, F.J., Volkmann, N., Potter, C.S.: Automatic particle selection: results of a comparative study. Journal of Structural Biology 145, 3–14 (2004)
Watershed Segmentation of Intervertebral Disk and Spinal Canal from MRI Images Claudia Chevrefils1,2, Farida Chériet1,2, Guy Grimard2, and Carl-Eric Aubin1,2 1
Ecole Polytechnique de Montreal, C.P.6079, 53851, succ. Cemtre-ville, Montreal, Canada H3C 3A7
[email protected] 2 Departement of Orthopaedics, Sainte-Jusitne Hospital, Montreal Canada
[email protected] Abstract. A robust method to segment intervertebral disks and spinal canal in magnetic resonance images is required as part of a precise 3D reconstruction for computer assistance during diskectomy procedure with minimally invasive surgery approach. In this paper, an unsupervised segmentation technique for intervertebral disks and spinal canal from MRI data is presented. The proposed scheme uses a watershed transform and morphological operations to locate regions containing structures of interest. Results show that the method is robust enough to cope with variability of shapes and topologies characterizing MRI images of scoliotic patients.
1 Introduction Benefits of minimally invasive surgery for disk resection are clear for scoliotic patient[1-3]. Unfortunately, enthusiasm for the adoption of this procedure has been sluggish by 3 main difficulties faced by the surgeons: the lost of depth perception, reduced field of view and the long training curve. An image guided surgery system that would integrate 3D preoperative data with the thoracoscopic video images would help to solve problems encountered by surgeons with minimally invasive surgery. An important part of this system is the precise preoperative 3D model of the patient anatomy. The structures of interest for scoliotic patients are the vertebral bodies, the intervertebral disks and the spinal canal. MRI is a non invasive imaging modality that allows precise visualization of soft tissues while showing good delimitation of hard structure. Hence, MRI is a relevant choice of modality to obtain 3D reconstruction of the intervertebral disk and the spinal canal. Magnetic resonance images are challenging because (1) non-uniformities of intensities over the same class of tissues or structures exist between patients, (2) shape and position of structures vary between patients due to the scoliotic deformity, (3) variation of relative intensity along the spine due to different structures surrounding the spine at different levels (intra patient variability). Also, as the ultimate goal of the segmentation process is to perform 3D reconstruction of the structures and as the final application will be used in a clinical context, the external constraints are (1) automatic M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 1017–1027, 2007. © Springer-Verlag Berlin Heidelberg 2007
1018
C. Chevrefils et al.
segmentation (no user interaction), (2) segmentation should create closed contour that will be connected between successive sections for 3D reconstruction, otherwise it would necessitate an edge closing step which is often a complex task. (a)
(b)
(c)
Fig. 1 .Results of segmentation obtained for (a) the original image with (b) Canny method, (c) Marr Hilldreth method
For MRI images of the spine where intervertebral disks and spinal canal have to be segmented, methods based on thresholding or top hat methods are not working since structures to segment do not have fixed size, orientation and intensity. Figure 1 shows examples of edge detection by Canny and Marr-Hildreth methods. For the Canny method, contours are not closed and for 3D reconstruction, it would necessitate an edge closing step which is often a complex task. The Marr-Hildreth can produce closed contours but the image is oversegmented as it contains edges of many structures besides the structures of interest. Other segmentation techniques that are part of the active contour model category like the snakes introduced by Kass et al[4] or level sets, require a lot of fine tuning of parameters in order to obtain the type of segmentation needed. Also, results with this kind of methods are strongly dependant on the initialization phase so user interaction is often needed for this step. Only few studies relate works on segmentation of MRI spine images[5-9]. None of these techniques are useful for our application because of the spinal deformity and the external constraints namely the unsupervised and closed contours requirements. Hoad et al[7] and Coulon et al[5] respectively proposed a technique to segment vertebrae and spinal cord on MR images. For both cases, the initialisation phase needed a user interaction to manually locate the center of the spinal cord at every spine level[5] or to manually locate 4 points on each vertebral body[7]. Booth et al[9] developed an algorithm that can automatically detect the center of the spinal canal on the axial images based on a symmetry measure. With this information, they can apply an active contour algorithm to segment the spinal canal. The vertebral bodies are then segmented in the axial direction based on a radial edge detection scheme that produces open contours which is not adequate for the current application. Also, working in the axial direction only is not a good choice for the precise delimitation between the intervertebral disk and the vertebral body: saggital view gives much more information on the delimitation of these structures. On the other hand, Shi and Malik[10] proposed an unsupervised segmentation technique that could have been
Watershed Segmentation of Intervertebral Disk and Spinal Canal from MRI Images
1019
used for our application if the final application was not for scoliotic spines. Indeed, this technique does not require initialization and it looks for pairwise affinities between all pairs of pixels and admits combinations of different features such as brightness, position, windowed histogram, etc. Carballido-Gamio et al[8] have worked to alleviate the computational demand imposed by the normalized cuts technique but even with this work, the time to compute the segmentation is still too long. Also, this technique needed a selection of saggital slices where spinal canal can be clearly identified. For scoliotic patients, there is no saggital image that shows spinal canal from top down because of the 3D deformation of the spine. Peng et al[6] automatically found the best saggital slice to locate intervertebral disks but they use a Canny edge operator creating open contours. Watershed has been used in combination with other techniques in cardiology on ultrasound images[11] and in neurology on MR images[12-14] and their results showed that the technique is able to cope with variation of topologies and shape but it was never used for spinal deformity. The principle of watershed transform is based on the detection of ridges and valleys. The image is viewed as a topological image where intensity represents the altitude of the pixels. The image is flooded from its minimum and it allows the delimitation between the catchment basins and the ridges (watershed lines). Hence the catchment basins represent region of homogeneous intensity. But, as it is well known, the use of the watershed method on gradient image leads to oversegmentation problems[14-16] and the searched contours are lost in a bunch of irrelevant ones. Oversegmentation can be caused by too many minima due to noise or to other structures that should not appear in the gradient image. To overcome this problem it is possible to either remove irrelevant contour[12] or modify the gradient image[15]. Basic watershed method already uses the intensity information of the image (gradient image). Hence, to introduce shape information in the method and to get rid of the well known oversegmentation problem with the watershed technique, we have used the marker method based on morphological operators. This method allowed incorporation of a priori knowledge of the shape of the structure of interest by introducing internal markers (sets of connected pixels of the region of interest) and external markers that correspond to the background. The external markers represent the deepest valley lines surrounding every internal marker. The use of an atlas to automatically determine the markers as proposed by Grau et al[14] was not applicable in the case of intervertebral disk and vertebral body of scoliotic patient because of complex spinal curves (three dimensional deformation of the spine) specific to every scoliotic patient. The objective of this paper is to present an unsupervised segmentation technique using watershed on magnetic resonance imaging and evaluating the capacity of the method to deal with different spinal deformities of scoliotic patients.
2 Proposed Approach 2.1 Image Acquisition System and Hardware The magnetic resonance images were acquired at Sainte-Justine Hospital with a 1.5T Magnetom Avanto system from Siemens. The radiofrequency (RF) transmitting and receiving units consisted of a body coil. A 3D MEDIC (Multi Echo Data Image
1020
C. Chevrefils et al.
Combination) sequence was used in saggital plane with a RT=23ms, ET=12ms, slice thickness of 1mm and a matrix of 256 X 256 leading to a voxel size of 1mm3. The images were then transferred on a Pentium 4 3 GHz, 1 GB of RAM. 2.2 Segmentation Algorithm The algorithm presented in this study is able to detect intervertebral disks and spinal canal. The preferred planes for the segmentation are not the same for both structures. The intervertebral disks have a cylindrical shape of approximately 20mm diameter with a height of 6mm and can be easily detected in the saggital or coronal plane. The spinal canal has a long cylindrical shape with a diameter of approximately 10mm with curvatures in 3D that depends on the severity of the scoliotic deformities so the axial plane is the preferred view to segment this structure. Figure 2 shows that the algorithm needed a preprocessing step (described in section 2.2.1 - Preprocessing) allowing consistent contrast from slice to slice in the acquisition plane. Hence, image reconstruction in the other planes (axial and coronal) did not show any discontinuity. The proposed method modifies the gradient image by using the internal and external markers to keep only the most significant and relevant contours for the structures of interest. Subsections 2.2.2 – Internal Markers and 2.2.3 – External Markers give details of how the marker, created with morphological operations, can locate intervertebral disks or spinal canal. Then, the watershed transformation was applied on the modified gradient image to give a segmentation of the disk for images in the saggital or coronal planes and the spinal canal for the images in the axial plane. 2.2.1 Preprocessing As it is shown in figure 2, the first step of the proposed algorithm is a preprocessing procedure. In order to have the same contrast from slice to slice in the saggital plane, all the images of the volume go through a contrast stretching step which widens the dynamic range of the histogram based on a simple linear mapping.
⎛b−a⎞ I out = (I in − c )⎜ ⎟+a ⎝d −c⎠
(1)
where Iout is the processed image, Iin the original image, a and b the minimum and maximum values of a normalized image respectively. c and d are chosen so that they represent the 2nd and 98th percentile of the histogram, meaning that 2% of the pixels in the histogram had a values lower than c and 2% of the histogram had values higher than d. This prevented outliers affecting the histogram mapping. The axial reconstruction based on this new volume was then obtained without any contrast irregularities. Once the axial reconstruction was completed, no more preprocessing was needed on these images. For the saggital view, the spinal canal which appears very bright in the image was removed with a morphological operator called opening by reconstruction. A simple opening noted (2) I D b = IΘb ⊕ b
(
) (
)
was obtained by applying erosion Θ on an image with a structuring element b and then applying a dilation ⊕ on the resulting erosion. Opening by reconstruction is an iterative process (see Vincent et al[15] for details) defined as
Watershed Segmentation of Intervertebral Disk and Spinal Canal from MRI Images
ψ openrec (I ) = RI (I D b | I )
1021
(3)
which is representing the opening by reconstruction of I, using structuring element b. This morphological operator is often used to filter out all the connected components which can not contain the structuring elements while preserving the others entirely. 2.2.2 Internal Markers After this preprocessing step, the structures of interest on the image I are the bright pixels. An opening by reconstruction was applied with a small structuring element. The structuring element used for the spinal canal was a disk of 8mm of diameter and a square of 2mm X 2mm for the intervertebral disk. The choice of the structuring element was made knowing that the intervertebral disk in the coronal or saggital plane can have different orientations depending on the severity of the spine deformity. Indeed to be able to keep the same structuring element regardless of the scoliotic severity, the structuring element should be invariant to the rotation and translation of the structure in the saggital or the coronal plane. This operation resulted in an image where the intervertebral disks and spinal canal had a smooth intensity (figure 3 a, e). The markers are binary images, hence the intensity image (figure 3 a, e) was converted to a black and white image with thresholding (figure 3 b, f). The thresholding method used is the Otsu’s method which can automatically find the threshold k that minimizes the within class variance and this turns out to be the same as maximizing the between class variance[17,18].
⎛σ 2 max⎜⎜ B2 ⎝σT
⎞ ⎟⎟ ⎠
(4)
where σ2B is the between class variance and σ2T is the total variance that represents the sum of the between-class variances and the within-class variances. 2.2.3 External Markers The external markers represent the background and were created with the help of the distance transform of the internal markers. The distance transform of a binary image is the distance from every pixel to the closest non-zero value pixel. The metric used was the Euclidean distance between 2 points u=(x1,y1) and v=(x2,y2)
d ( x, y ) =
(x1 − x2 )2 + ( y1 − y 2 )2
(5)
where d measured a straight line between 2 pixels. An example of a distance transform of an internal marker image is shown on figure 3 c, g. The watershed was applied on the distance transform which produced the external marker. Combined binary marker Fm (figure 3 d, h) is then imposed as minima on the gradient image.
F m = Fintm ∪ Fextm
(6)
This minima imposition eradicated the problem of oversegmentation that occurred with watershed directly applied on gradient image.
1022
C. Chevrefils et al.
Original data Preprocessing Axial plane images
SE = disk 8mm
Saggital plane images
Opening by reconstruction
SE = square 2 X 2
Automatic thresholding
SE = disk 4mm
Erosion
SE = rect 1 X 3
Internal Markers
Distance
Watershed transform
External Markers Modified Grad(I)
Watershed
Segmented image
Fig. 2. Algorithm for automatic detection of internal and external markers and the final segmentation
Watershed Segmentation of Intervertebral Disk and Spinal Canal from MRI Images
(a)
(e)
(b)
(f)
(c)
(g)
1023
(d)
(h) (g)
Fig. 3 . Creation of internal and external markers. (a) and (e) are the results of the opening by reconstruction with a structural element being a square and disk respectively. (b) and (f) are the results of the automatic threshold to create the internal markers in the saggital and axial planes. (c) and (g) the distance transform applied on the internal markers to obtain the external markers. (d) and (h) the internal and external markers used to impose minima on the gradient image.
3 Experimental Results and Discussion For clinical purposes intervertebral disks and spinal canal have to be segmented on every image contained in the volume of interest. Intervertebral disks can easily be detected on the coronal or saggital views and the spinal canal on the axial view. The technique presented here is an unsupervised technique and do not need any initialization. All the segmented results were obtained using the algorithm directly on the magnetic resonance images acquired with the specified protocol. Our implementation using Matlab® on a Windows NT based system equipped with a 3 GHz processor took less than 2 seconds per frame. This is fast considering the complexity of the problematic of the current application and the use of a high level interpreted language like Matlab®. Figure 4 a,b,c shows results for different types of scoliotic deformity severities and figure 4 d shows result for axial images. Taking the 4 results(a, b, c, d) independently, it shows that the proposed scheme does give satisfactory results in terms of its ability to detect intervertebral disk along the spine of one specific patient even with the relative intensity variation due to different structures surrounding the spine at different disk levels. Indeed, for the 4 cases all the disks and the spinal canal were detected and segmented. The comparison of the results of figure 4 (a), (b) and (c), illustrates that the marker-controlled watershed technique is able to deal with nonuniformities of intensities that exist over the same class of tissues or structures between patients because the algorithm is able to detect all the intervertebral disks. These results also illustrate that the proposed method can deal with the change of topology due to the scoliotic deformities of the spine. Figure 4 (d) illustrate the segmentation of the spinal canal on axial image and clearly demonstrate that the
1024
C. Chevrefils et al.
proposed technique is versatile and can detect many types of structures of interest. This new technique enables the use of prior information to automatically generate markers for specific structures. With proper internal and external markers used to modify the gradient images, watershed algorithm is able to detect specific structures on axial or saggital images. The position of the plane of the image is a factor that does affect the quality of the detection of the intervertebral disk. Indeed, on the extremities of the disk in a saggital or a coronal plane, it is hard to find the delimitation between the disk and the surrounding tissue (figure 5). This can be bypassed by segmenting intervertebral disks in 2 orthogonal planes simultaneously. Hence regions that can hardly be segmented in one plane (saggital plane) correspond to regions that can easily be segmented in the other plane (coronal plane). (a)
(b)
(c)
(d)
Fig. 4. Results for different severities of spine deformities. Saggital view of (a) normal spine, (b) moderate spine deformation, (c) important spine deformation, and (d) results in axial view for the spinal canal detection.
(a)
(b)
Fig. 5. Poor delimitation at the extremities of the disk in one direction (a) saggital, corresponds to good delimitation of the disk in the other orthogonal directions (b) coronal
Unfortunately, a persistent problem with our method is the oversegmentation. Our watershed based technique does segment regions that are not intervertebral disk or spinal canal. But since our segmentation technique leads to closed contours, it is easy
Watershed Segmentation of Intervertebral Disk and Spinal Canal from MRI Images
1025
to send the closed contours into an unsupervised pattern recognition algorithm to determine whether it is an intervertebral disk or not. Some preliminary work has been done with a non-parametric approach based on texture information and gives promising results. A validation step based on the DICE similarity coefficient is being carried out. This method will quantitatively validate the segmentation method and we will be able to compare the results with manual segmentation done by radiologist.
4 Conclusion As the ultimate goal of the segmentation process for the current application is to reconstruct in 3D the intervertebral disks and the spinal canal of scoliotic patients in a clinical environment, the robustness, the automatic aspect and the precision are the 3 fundamentals requirements of the segmentation process. We developed a watershed based technique for segmentation of intervertebral disks and spinal canal from magnetic resonance images. A qualitative analysis of the results obtained with this technique compared favourably with other fast and unsupervised techniques such as Canny and Marr Hildreth edge detectors. The advantage of our approach lies in the fact that it is fast, unsupervised, produces closed contours, and is based not only on intensity information but also on prior knowledge of the shape to be detected. We also demonstrated the robustness of this novel method by assessing detection of intervertebral disks and spinal canal on MRI data coming from normal spine and highly deformed spine of scoliotic patients. Complementary works are being carried out to determine quantitatively the precision of this novel segmentation process. Furthermore, an extensive validation of the proposed approach by an expert on different classes of scoliotic deformities will be performed. Finally, further work is undertaken to integrate in the proposed method a learning step from a scoliotic patient’s database to define shape descriptors of the anatomical structures of interest from their extracted boundaries.
Acknowledgements Financial support for this project was provided by the Fonds Quebecois de la Recherches sur la Nature et les Technologies and Natural Sciences and Engineering Research Council of Canada.
References [1] Picetti, G.D., Pang, D.: Thoracoscopic techniques for the treatment of scoliosis Childs Nerv Syst, November 2004, vol. 20, pp. 802–810 (2004) [2] Newton, P.O., Marks, M., Faro, F., Betz, R., Clements, D., Haher, T., Lenke, L., Lowe, T., Merola, A., Wenger, D.: Use of video-assisted thoracoscopic surgery to reduce perioperative morbidity in scoliosis surgery Spine, October 15, 2003, vol. 28, pp. S249– 254 (2003)
1026
C. Chevrefils et al.
[3] Wong, H.K., Hee, H.T., Yu, Z., Wong, D.: Results of thoracoscopic instrumented fusion versus conventional posterior instrumented fusion in adolescent idiopathic scoliosis undergoing selective thoracic fusion Spine, September 15, 2004, vol. 29, pp. 2031–2038, discussion 2039 (2004) [4] Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active Contour Models. In: First International Conference on Computer Vision, pp. 259–268 (1987) [5] Coulon, O., Hickman, S.J., Parker, G.J., Barker, G.J., Miller, D.H., Arridge, S.R.: Quantification of spinal cord atrophy from magnetic resonance images via a B-spline active surface model. Magn. Reson. Med. 47, 1176–1185 (2002) [6] Peng, Z., Zhong, J., Wee, W., Lee, J.H.: Automated Vertebra Detection and Segmentation from the Whole Spine MR Images. In: Conf. Proc. IEEE Eng. Med. Biol. Soc., vol. 3, pp. 2527–2530 (2005) [7] Hoad, C.L., Martel, A.L.: Segmentation of MR images for computer-assisted surgery of the lumbar spine. Phys. Med. Biol. 47, 3503–3517 (2002) [8] Carballido-Gamio, J., Belongie, S.J., Majumdar, S.: Normalized Cuts in 3-D for Spinal MRI Segmentation. IEEE Transactions on Medical Imaging 23(1), 36–44 (2004) 02780062. Jt. Graduate Group in Bioengineering, Univ. of California, San Francisco, San Francisco, CA, United States [9] Booth, S., Clausi, D.A.: Image segmentation using MRI vertebral cross-sections. In: Canadian Conference on Electrical and Computer Engineering, May 13-16, 2001, pp. 1303–1308 (2001) [10] Shi, J., Malik, J.: Normalized cuts and image segmentation. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22(8), pp. 888–905. Carnegie Mellon Univ., Pittsburgh, PA, 0162-8828 (2000) [11] Cheng, J., Foo, S.W., Krishnan, S.M.: Watershed-presegmented snake for boundary detection and tracking of left ventricle in echocardiographic images. IEEE Transactions on Information Technology in Biomedicine 10(2), 414–416 (2006) 1089-7771. School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798, Singapore [12] Cates, J.E., Whitaker, R.T., Jones, G.M.: Case study: An evaluation of user-assisted hierarchical watershed segmentation. Medical Image Analysis: ITK Open Science Combining Open Data and Open Source Software: Medical Image Analysis with the Insight Toolkit 9(6), 566–578 (2005) 1361-8415. Scientific Computing and Imaging Institute, University of Utah, Salt Lake City, UT, United States [13] Dokladal, P., Bloch, I., Couprie, M., Ruijters, D., Urtasun, R., Garnero, L.: Topologically controlled segmentation of 3D magnetic resonance images of the head by using morphological operators. Pattern Recognition 36(10), 2463–2478 (2003) 0031-3203. Ecole Nationale Sup. des Telecom., Dept. TSI-CNRS URA 820 and IFR 49, Paris 75013, France [14] Grau, V., Mewes, A.U.J., Alcaniz, M., Kikinis, R., Warfield, S.K.: Improved watershed transform for medical image segmentation using prior information Medical Imaging. IEEE Transactions on 23(4), 447–458, 0278-0062 (2004) [15] Vincent, L., Soille, P.: Watersheds in digital spaces: An efficient algorithm based on immersion simulations. IEEE Transactions on Pattern Analysis and Machine Intelligence 13(6), 583–598 (1991) 0162-8828. Center for Math Morphology, Ecole des Mines de Paris, Fontainebleau, France [16] Tek, F.B., Dempster, A.G., Kale, I.: Noise sensitivity of watershed segmentation for different connectivity: experimental study. Electronics Letters 40(21), 1332–1333 (2004) 0013-5194. Dept. of Electron. Syst., Univ. of Westminster, London, UK
Watershed Segmentation of Intervertebral Disk and Spinal Canal from MRI Images
1027
[17] Otsu, N.: A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man and Cybernetics SMC-9(1), 62–66 (1979) 0018-9472. Information Sci. Div., Electrotech. Lab., Tokyo, Japan [18] Ng, W.S., Lee, C.K.: Comment on using the uniformity measure for performance measure in image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 18(9), 933–934 (1996) 0162-8828. Dept. of Electron. Eng., Hong Kong Univ., Kowloon, Hong Kong
Towards Segmentation of Pedicles on Posteroanterior X-Ray Views of Scoliotic Patients Vincent Dor´e1 , Luc Duong2,3 , Farida Cheriet2,3 , and Mohamed Cheriet1 ´ Department of GPA, Ecole de Technologie Sup´erieure, Montreal Qc H3C 1K3, Canada 2 Research Center, Sainte-Justine Hospital, Montreal Qc H3T 1C5, Canada ´ Department of Computer Engineering, Ecole Polytechnique de Montr´eal, Montreal Qc H3C 3A7, Canada 1
3
Abstract. The objective of this work is to provide a feasible study to develop an automatic segmentation of pedicles of vertebrae on X-ray images of scoliotic patients, with the ultimate goal of the extraction of high level primitives leading to an accurate 3D spine reconstruction based on stereo-radiographic views. Our approach relies on coarse and fine parameter free segmentation. First, active contour is performed on a probability score table built from the input pedicle sub-space yielding to a coarse shape. The prior knowledge induced from the latter shape is introduced within a level set model to refine the segmentation, resulting in a fine shape. For validation purposes, the result obtained by the estimation of the rotation of scoliotic deformations using the resulting fine shape is compared with a gold standard obtained by manual identification by an expert. The results are promising in finding the orientation of scoliotic deformations, and hence can be used for subsequent tools for clinicians. Keywords: X-ray images, idiopathic scoliosis, active contour, vertebral orientation.
1
Introduction
Adolescent idiopathic scoliosis is a disease characterized by frontal deviation of the spine from the normal alignment plane. This deviation is principally lateral but it also includes a three dimensional component. Present technologies only allow X-ray view in two dimensions ( posteroanterior or PA and lateral or LAT) which only reveals partially the deformation. To provide clinicians with additional information, 3D spine are nowadays reconstructed by extracting manually landmark points from both views. Reconstruction by manual identification is a tedious task, and might introduce a variability depending on the operator performing it. The ultimate goal of our work is to automate the reconstruction process and to enrich it by integrating basic primitives on different objects such as vertebrae, clavicle, pelvis, ribs. Already several works treat of spine X-ray images [1], those researches deal with vertebrae shape matching [2] or classification [3,4] of vertebrae or of spinal curve [5]. M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 1028–1039, 2007. c Springer-Verlag Berlin Heidelberg 2007
Towards Segmentation of Pedicles on Posteroanterior X-Ray Views
1029
In such perspective, we present in this paper a feasible study to automate the segmentation of pedicles at every vertebral level. Pedicles are two short rounded structures that extend from the lateral posterior margin of the vertebral body. They can be identified on both views and are thus used as landmark for 3D reconstruction. In addition of giving insights about the vertebral rotation, segmentation of pedicle would lead to an automatic 3D-reconstruction. Furthermore, by extracting objects rather than points, high level primitives can be used to improve the quality of the 3D model. This work could also be used to detect pedicle intra-operatively since pedicle screws are used for surgical correction of scoliosis. LAT views are highly noisy due to several object superpositions. Presently, only cervical and lumbar vertebrae are segmented on this view. The segmentation of pedicles on LAT should thus be driven by segmentation information from the PA view; projective geometry computed from a known object can be used for this purpose. The accuracy of the segmentation on LAT view is thus dependant on the segmentation quality on the PA. Hence, in this paper, we focus on the segmentation on the PA view. Because of pedicles are small and low contrasted objects on highly noisy images, they are hard to identify and even harder for non expert clinicians. Moreover, since pedicles shape are deformed in AIS, segmentation becomes really tedious. It is thus difficult to perform segmentation directly on those images. The first step of this kind of segmentation is to characterize the pixels belonging to the same pedicle. In fact, pedicles can be identified (after training) because some of their pixels are locally and slightly darker than the background. They also form a ring around lighter pixels belonging to the inner of the pedicle. To enhance this characteristic, we built a score table which assigns to each pixel a probability to be in the darker object present in a certain neighborhood. In this table, the ring is hardly identified and often mistaken for vertebrae edges. To simplify and without loss of generality, only the inner part is detected. Thus, on the table, we initialized around the inner part an active contour. This kind of model enables us to extract closed curves from noisy images [6,7]. Recently, models incorporating shape knowledge on snake [8] or in active contour [9,11] have appeared in the literature. We applied [11] to our segmentation problem with bad results . This happened because of the large variability of the shapes, their small and variable size as well as the lack of singularity points. We then decided to use the simple constant piecewise model developed by Chan and Vese [10] on the probability score table and embedded it in a coarse-fine segmentation. By using such table instead of the original image, we group together pixels, which have the same local characteristics; moreover, this table enables us to have the same initialization for all the pedicles. Initially as the ring in these score tables may contain holes, the regularity parameter is set high. A priori information about the size of the pedicle enables the model to automatically and dynamically update the characteristic and regularity parameters. As a result of such a process, coarse shape of the inner part of the pedicle is obtained. In order to
1030
V. Dor´e et al.
reach the summits of the real shape, we perform a second algorithm with ”a priori” knowledge on the coarse shape that searches for more details around it. To validate the relevancy of the segmentation, we show a comparison between the orientation of each vertebra (frontal rotation) estimated by our pedicle extraction and the one obtained manually. In addition, the assessment of this rotation, as described in [12], might be used as a good indicator of the severity of the curve. Even though those rotations are not accurate, they are still used by clinicians, due to the lack of 3D-reconstruction. Hence, our automatic estimation is a useful clinical result. In the first part of the paper, we present the Chan-Vese model and its limitations when applied to our problem. Then, we introduced the new model to segment pedicles. Discussions on the estimation results of the vertebral orientations are provided in the following section. Finally in the conclusion we present the perspective opened by this feasible study.
2
The Active Contour Model of Chan-Vese
The constant piecewise model developed by Chan Vese [10] is a segmentation algorithm based on the minimization of the following simplified Mumford-Shah energy functional: 2 inf FMS (u, C) = (u − u0 ) dxdy + ν|C| (1) u,C
Ω
Where u is considered to be a two phases image; a single object of characteristic c1 is separated from the background of characteristic c2 by a contour C. The first term in (1) corresponds to the energy attachment of u to the input image u0 while the second one is the length of the contour C. The two phases image that minimize (1), is close to u0 and has a regular contour. ν is the regularity coefficient, which enables to privilege the contour regularity on the data attachment or vice versa. In the level set method, C is represented by the zero level set of an implicit function: ⎧ ⎨ 0 if s ∈ C φ(s) = − if s ∈ Cin (2) ⎩ if s ∈ Cout Where is a positive constant, Cin the area inside the curve C and Cout the area outside it. The evolving image u can thus be decomposed as: u = c1 H(φ) + c2 (1 − H(φ))
(3)
With H is the heaviside function: H(φ) =
1 if φ > 0 −1 otherwise
(4)
Towards Segmentation of Pedicles on Posteroanterior X-Ray Views
1031
By using this decomposition, the functional (1) can be rewritten as: ⎧ ⎫ ⎨ F MS (c1 , c2 , φ) = ⎬ (c1 − u0 )2 H(φ)dxdy Ω inf + Ω (c2 − u0 )2 (1 − H(φ))dxdy + ν Ω |∇H(φ)|dxdy ⎭ c1 ,c2 ,φ ⎩ (5) The Euler Lagrange equations can be easily computed by minimizing (5) under c1 , c2 and φ: u0 H(φ) u0 (1 − H(φ)) 1 2 and c = (6) c = H(φ) (1 − H(φ)) c1 and c2 correspond to the mean value of u0 inside and outside the curve C, respectively. To solve the practical problem of the level set function, (5) is embedded in a gradient descent by introducing an artificial time t. The authors thus obtain the following equation:
∂φ ∇φ = ν δ (φ)∇ + δ (φ) −(c1 − u0 )2 + (c2 − u0 )2 (7) ∂t |∇φ| A finite differences implicit scheme is used to discretize equation (7). At time t = 0, the level set function φt is initialized by the user. The algorithm iterates, until convergence of the system, the three following steps: – compute c1 (φt ) and c2 (φt ) – compute φt+1 by discretizing (7) – increment t = t + 1 One of the major drawbacks of the constant piecewise model is the determination of the characteristics which are constant over the object and the background. Indeed, in real life those characteristics are not constant but homogeneous (regular). In some extreme cases, characteristics c1 and c2 can be very close when computed over the objects, but different when evaluated locally. For instance, in this study, the mean intensities of the inner and outer parts of the pedicle are often separated by less than three gray levels. It is clear that this difference of
(a)
(b)
Fig. 1. a) X-ray of a L1 vertebra, b) in white what we call the outer part and in black the inner part of the pedicle. The X-ray has been enhanced for visual facilities.
1032
V. Dor´e et al.
contrast is not obvious. Nevertheless, we are able to perceive the pedicle as a dark ring on a lighter background. The difference in the evaluation of the local characteristics that is generally larger than that of the global ones, enables us to have a better visual perception of the pedicles.
3
Overview of the System
To allow the Chan Vese model in segmenting pedicles, we adopt the following four steps ( Figure 2): – Initialisation: The algorithm is initialized by clicking in the inner pedicle. – Probability score table: In a neigbourhood of the initialization point, a probability score table is constructed by assigning to each pixel a probability of locally belonging to the darker object. – Coarse segmentation: The active contour is processed on the table with a high regularity parameter ν. Hence, pixels having a high (or low) probability are grouped together. If the output segmentation is not suitable, the active contour is reinitialized with a different regularity parameter depending on the resulting segmentation. – Fine segmentation: This segmentation enables high cuvature on the active contour and thus to reach the entire inner part. The model is increased with a shape consistency term, forcing the fine shape to be close to the coarse one. The three latter steps are detailed in the next following subsections.
Fig. 2. Bloc diagram of the system
3.1
Probability Score Table
For each pixel x ∈ Ω, let Va (x) ∈ Ω be its neighborhood of size a. All the pixels x in the inner part of the pedicle should contain in their neighborhood Va (x), at least one pixel belonging to the outer part. a should thus be quite large. This parameter is set to seven for the first ten vertebrae and to nine for the other ones. Then, the histogram of Va (x) ∈ Ω is separated into two classes: A1 (x) the
Towards Segmentation of Pedicles on Posteroanterior X-Ray Views
1033
background and A2 (x), the outer. To do so, Otsu’s algorithm [13] is used to find k ∗ that maximizes the intra class variance: k ∗ = arg max p1 (k)p2 (k)[μ1 (k) − μ2 (k)]2
(8)
k
Where k is a variable threshold, pi (k) the weighting of the class Ai and μi (k) its mean. According to the Bayesian theorem, the probability p(x ∈ A1 (x, a)) is computed as : p(x ∈ A1 (x, a)) =
p1 P (x|A1 (x, a)) p1 P (x|A1 (x, a)) + p2 P (x|A2 (x, a))
By assuming that the distribution of the intensity inside each class is Gaussian: 2
P (x|Ai (x, a)) =
1 − (μi −u(x)) σ2 i e 2πσi2
We thus obtain : 1
p(x ∈ A1 (x, a)) = p1 σ22 p1 σ22
+
p2 σ12 e
−
(μ2 −u(x))2 σ22
+
(μ1 −u(x))2 2 σ1
(9)
Probabilities p(x ∈ A1 (x, a)) are saved in the probability score table Ta . The table subscript is the size of the neighborhood used. By using this probability table, the image is locally split into two-class object; pixels in outer part of the pedicle are highly enhanced by their high probabilities, whereas inner and background pixels get low probabilities. The constant piecewise model is then performed on the score table. This combination of local and global approaches leads to a more robust segmentation algorithm. The latter is presented with more details in the following subsections. 3.2
Coarse Segmentation of Pedicles
At t = 0, the initial level set function is a distance function whose 0-level line is a 5 radius circle centered on an inner pixel pt, chosen manually. Then, the objective is to group together all the pixels having a strong probability p2 of belonging to the inner part. Consequently, the characteristics c1 and c2 of the model are represented by the probabilities p2 and p1 = 1 − p2 . At the initialisation, the characteristics are set as c10 = 0 and c20 = 1. The only a priori knowledge that has been fixed to the level set function is: – A minimum area size Amin of the set H(φ) and a maximal one Amax . – A parameter ν corresponding to the weight of the regularity part in the energy (1). Those three parameters have been estimated experimentally for each kind of vertebra on 5 x-rays. ν is set high in order to get a regular shape and avoid
1034
V. Dor´e et al.
the active contour to come through the eventual holes of the outer part. The equation that handles the evolution of the level set function is :
∇φt+1 φt+1 = φt + ν δ (φt )∇ + δ (φt ) −(c2t − Ta )2 + (c1t − Ta )2 (10) |∇φt | Since the level set method allows change of topology, at each iteration the connected component C of H(φt ) > 0 that contains the initial point is extracted. φt is then reinitialized with the signed distance function (2) from the border ∂C. Let At be the area |C| at time t. If At is higher than Amax , some pixels included in the set H(φt ) > 0 should not belong to the inner part. In other words, the outer being the border of the inner part of the pedicle, those pixels should belong to the outer part. They have been included in the set H(φt ) > 0 because of their low probability p1 (noisy score table). The characteristic c1t is thus reduced to make the condition of being in the outer part less selective. So, some pixels in H(φt ) > 0 and at the border of this condition at time t, will be excluded of the positive part of φ at time t + 1 reducing the size area.
(a) after 5 iterations of the process
(b) at the end of the process
Fig. 3. A(φt = 0) is higher than Amax , the characteristic c2t decreases to 0.84
This process is iterated until the system reaches convergence. Due to characteristics updating, there may be situation where c1t c2t or the number of iterations is above a threshold τ . It is obvious that in the first case, the model will not converge to an acceptable segmentation. In the second one, experiments show that the 0 level line could not be limited to the inner part of the pedicle. In both situations the process is stopped since it has encountered a problem of area size: – If At < Amin then the problem of convergence is not due to the quality of the score table as we defined it previously but due to the pedicle shape. When the pedicle is too thin (this is the case for high axial rotation of the vertebra) and the regularity constraint term ν is too strong, the active shape encounters difficulties in growing, especially where its curvature is strong. Thus the algorithm is reinitialized with the same initial conditions except for the regularity term that is reduced.
Towards Segmentation of Pedicles on Posteroanterior X-Ray Views
1035
– If At > Amax , mostly the pedicle has a hole in its outer part ( due to the superposition of the pedicle with another object). The table is badly defined. If the parameter ν is too small then the 0 level line fills the hole and enables the shape to evolve outside the inner part. The algorithm is thus reinitialized with a higher parameter ν, this prevents the evolution of the shape on strong curvature locations, and hence to proceed outside the inner part. The way characteristics c1 , c2 and regularity term ν are updated enables the model to segment coarse shapes of pedicles from all vertebral spine levels (thoracic and lumbar) without any manual parameter adjustments. This is an important result for the model automation. 3.3
A Fine Segmentation
The Chan Vese algorithm with a high coefficient of regularity ν gives a regular shape approximating the pedicle. In this section, we present a way to get a fine segmentation for better pedicle shapes estimation. At first, we build a signed distance function φG from the coarse shape. The latter has been segmented in the previous subsection. Then, we use a second algorithm to better estimate the pedicle segmentation on a score table Tb . The neigbourhood b is set to seven for the first ten vertebrae. For the others, smaller neighborhood sizes are to be used because a 9-neighborhood produces too coarse tables. Moreover, important axial rotation on those vertebrae implies thin pedicles. The shape of those latter pedicles presents two region with strong curvation, that we call summits. Thus, in order to reach upper and lower summits of theses thin pedicles, the algorithm must be processed on a T5 probability table. The approach that we recommend is to process the second algorithm iteratively on different tables (from T8 to T5 ). Once the process has converged on a table, the resulting level set function φ is used as the new coarse level set φG of the next level (figure 4). This second algorithm is based on the model presented by Rousson and Paragios [11]. According to this model, a new energy is added to the Mumford Shah functional. This energy, called shape energy, forces the level set function φ to evolve according to a certain shape database[11]. However, in our model φ evolves according to φG , which is the distance function to the coarse shape boundary extracted in the previous section. The energy to be minimized is : ⎧ ⎫ (c1 − Tb )2 H(φ)dxdy + Ω (c2 − Tb )2 (1 − H(φ))dxdy ⎬ ⎨ Ω E(φ) = + ν (11) |∇H(φ)|dxdy + μ Ω (φ − φG )2 dΩ ⎩ ⎭ Ω The evolution scheme of φ associated to this energy is found in figure 4. In addition, the energy parameter μ is not taken constant over all the definition space of φ as in [11]. It is set φ1t and φG dependant, and it follows a sigmoidal shape: 1 μt (φ1t , φG ) = (12) 1 −φ −P ) −λ(φ G t 1+e
1036
V. Dor´e et al.
Fig. 4. Shape superposing for neighborhoods sizing from 9 to 5
φ0 = φG φ1t+1 = φt + ν δ (φt )∇
∇φt+1 |∇φt |
+ δ (φt ) (c1t − Tb )2 − (c2t − Tb )2
φ2t+1 = φt − 2 × dt(φt − φG ) update of μt thanks to φ1t+1 and φG φt+1 = (1 − μt ) φ1t+1 + μt φ2t+1 update of c2t+1 et c1t+1 thanks to Amax and to Amin
Fig. 5. Algorithm for fine segmentation
When φ1t is close to φG , it can evolve to find details. Whereas when the distance between φ and φG is high on some pixels, its evolution will locally slow down until stopping. The coefficient parameters λ and P are set in our experimentations in such a way that the inflexion point is at 4 and the evolution is almost stopped for φ − φG < 5. From a shape point of view, this is to allow the active contour to evolve in a narrow band of ten pixels width with a speed that is proportional to its local distance from φG . 3.4
Experimental Results
To quantify the results of our method, we have tested our algorithm on five xray images of patients with Adolescent Idiopathic Scoliosis. We focused on the twelve thoracic level and the first four lumbar vertebrae. On those x-ray images, nine pedicles could not be segmented due to the presence of calibration balls and sixteen were not visible due to strong axial rotations. We do not consider those pedicles in our study. Hence, on the remaining hundred thirty five we succeeded in correctly segmenting one hundred one (i.e. 74% of accuracy rate). Considering
Towards Segmentation of Pedicles on Posteroanterior X-Ray Views
1037
the quality of the images, the difficulty of the task and the diversity of the spinal and vertebral deformations, this segmentation rate is rather promising. For the other non detected pedicles, the algorithm failed due to two main reasons. First some pedicles do not respect the thresholds Amin and Amax especially from T2 to T6 vertebrae where variability is high. But those pedicles, to which we may design more specific methods, could be segmented separately as outliers. Secondly, some others may be intersected by the upper endplate. In this case, the algorithm may encounter difficulties in segmenting the complete pedicles. Hence, our segmentation accuracy rate might be promising for the estimation of the vertebral rotation. 3.5
Frontal Rotation Estimation
The orientation of the vertebral body (frontal rotation) could be derived from the orientation of the endplates on healthy subjects. However, for patients with scoliosis, vertebrae are often deformed and the orientation of the endplates does not always coincide with the orientation of the vertebra. Using the pedicles may present a more reliable evaluation of this rotation and it has been widely accepted by the spine community [14]. The frontal rotation of a vertebra, as described in [12], is estimated on the PA view by computing the angle between the abscissa axis of this view and the line going through the centers of the pedicles (figure 6(a)). Since the fine shape is nosy comparing to the coarse shape, the latter seems to be more robust to estimate those centers. However, extracting a sufficient number of points from a fine contour implies also a good center estimation. Furthermore, since fine shape reaches the entire pedicle, the estimated center might be better. The rotations must only be estimated by the algorithm without those centers being dependent on the initialization point. To this end, the experiment comprised a set of ten pedicles from different levels. The algorithm has been initialized with various points for each pedicle. The resulting mean variance on the center location is about 0.0616 pixel (corresponding to 0.02mm). As it will be explained below, this result is considerably insensitive to the starting point. This makes the model robust and user independent. First let consider the simple case of two different initialisations on the same pedicle. If the three mentioned parameters do not change, the algorithm will be driven by the probability score table. This produces the same segmentation and thus the same center. Secondly, in a more general case, as the inside of the pedicle is quite small, the initialization points must be close. After a few iterations both level set functions will be very close. Consequently, areas of positive parts of both level set functions will change similarly as the parameters. This results at the end in similar segmentations; both center will match. The frontal rotation estimation is currently done manually by extracting the upper and lower summits of the pedicle; the centers being just the middle points. On figure 6(b) we show the frontal rotation obtained manually (on the right) and our semi-automatic evaluation of the angle (on the left). When the segmentation could not be found, the center was extracted manually from the x-ray.
1038
V. Dor´e et al.
Our results may look noisy. Nevertheless, we conserve the intra vertebra smoothness. Moreover, we notice that our estimated rotations are as accurate as those extracted manually on the same PA view (such as vertebrae 4 and 13). By comparing all the rotations, we obtain a mean difference of 2.3 degrees between the manual and the semi-automatic estimation. This error is considered more than acceptable since the manual error is about five degrees. This result is really promising for the evaluation of the scoliotic deformation.
(a)
(b)
Fig. 6. (a) Rx is an estimation of the frontal rotation. Vectors X and Y define the PA view. (b) Representation of the planar rotation in degree (abscissa) of different vertebrae (ordinate) of a patient. On the left, manually extracted rotations and on the right rotations extracted by our semi automated algorithm.
4
Conclusion
In this paper we have introduced a technique to segment pedicles on the PA view. We first perform active contours on score tables that assign to each pixel a probability of being in the outer part. Setting a high regularity parameter, we get a coarse shape of the pedicle. Hence, we used a second algorithm that looks for details around it to reach the summit points. This paper is, to our knowledge, the first study to segment the intrinsic shape of pedicles from image of pathologic spine. Moreover, segmentation on digital radiography is often limited to the lumbar segment of the spine and do not treat such fine details as the vertebra pedicles. Our study demonstrated good-to-fair results on such task, hence paved the road for other studies to automate the segmentation process by injecting some prior knowledge into the process. Other researches on the vertebrae [1] described techniques to either provide information of the global spine shape or to provide local information on the vertebral bodies location. For instance, the vertebral shape location could be injected in our model to improve the segmentation of this kind of pedicles. Moreover, state of the art researches on vertebra and spine segmentation allow us to think that at term the initialization can be automated. The conclusions of this feasible study offer good perspectives in automatically segmenting pedicles on a pair of stereo-radiographic views to automate 3D reconstruction of the vertebral spine. Furthermore, by segmenting pedicles, high level
Towards Segmentation of Pedicles on Posteroanterior X-Ray Views
1039
primitives can be extracted to generate a more accurate and more representative 3D model than those reconstructed manually.
References 1. Antani, S., Long, L.R., Thoma, G.R., Lee, D.J.: Anatomical Shape Representation in Spine X-ray Image. In: Proceedings of IASTED International Conference on Visualization, Imaging and Image Processing, Benalmadena, Spain, September 810, 2003, pp. 510–515 (2003) 2. Xiaoqian, X., Lee, D.J., Antani, S., Long, L.R.: Pre-indexing for fast partial shape matching of vertebrae images Proceedings. In: 19th IEEE International Symposium on Computer-Based Medical Systems, p. 6. IEEE Computer Society Press, Los Alamitos (2006) 3. Antani, S., Long, L.R., Thoma, G.R.: Vertebra Shape Classification using MLP for Content-Based Image Retrieval. In: Proceeding of the International Joint Conference on Neural Networks, vol. 1, pp. 160–165 (2003) 4. Koompairojn, S., Hua, K.A., Bhadrakom, C.: Automatic classification system for lumbar spine X-ray images. In: Proceedings 19th IEEE International Symposium on Computer-Based Medical Systems, p. 6 (2006) 5. Duong, L., Cheriet, F., Labelle, H.: Towards an Automatic Classification of Spinal Curves from X-Ray Images. Stud. Health Technol. Inform. 123, 419–424 (2006) 6. Caselles, V., Catt´e, F., Coll, B., Dibos, F.: A geometric model for activr contours in image processing. Numerische Mathematik 66(1), 1–31 (1993) 7. Paragios, N., Chen, Y., Faugeras, O.: Handbook of Mathematical Models in Computer Vision. Springer, New York (2006) 8. Cremers, D., Tischh¨ auser, F., Weickert, J., Schn¨ orr, C.: Diffusion Snakes: Introducing statistical shape knowledge into the Mumford-Shah functional. International Journal on Computer Vision 50(3), 295–313 (2002) 9. Hong, B.W., Prados, E., Soatto, S., Vese, L.: Shape Representation based on Integral Kernels: Application To Image Matching and Segmentation, Computer vision and pattern Recognition. In: IEEE Computer Society Conference on, June 2006, vol. 1, pp. 833–840 (2006) 10. Chan, T.F., Vese, L.A.: Active contours without edges. IEEE Transactions on Image Processing 10(2), 266–277 (2001) 11. Rousson, M., Paragios, N.: Shape Prior for Level Set Representations. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2351, pp. 77–92. Springer, Heidelberg (2002) 12. O’Brien, M.F., Kuklo, T.R., Blanke, K.M., Lenke, L.G.: Editors-in-Chief. Radiographic Measurement Manual. Spinal Deformity Study Group (SDSG). Medtronic Sofamor Danek, Memphis, Tennessee, 110 pages (2004) 13. Otsu, N.: A Threshold Selection Method from Grey-Level Histograms. IEEE Trans. Syst., Man, Cybern. 9, 62–66 (1978) 14. Stokes IAF (chair): Scoliosis Research Society Working Group on 3-d terminology of spinal deformity Three-dimensional terminology of spinal deformity. Spine 19, 236–248 (1994)
Adaptive Mesh Generation of MRI Images for 3D Reconstruction of Human Trunk O. Courchesne1, F. Guibault2, J. Dompierre3, and F. Cheriet1,2 1
Institute of Biomedical Engineering, École Polytechnique de Montréal, Montreal, Quebec, H3T 1J4, Canada 2 Department of Computer Engineering, École Polytechnique de Montréal, Montreal, Quebec, H3T 1J4, Canada 3 Department of Mathematics and Computer Science, Laurentian University, 935 Ramsey Lake Road, Sudbury, Ontario, P3E 2C6, Canada
Abstract. This paper presents an adaptive mesh generation method from a series of transversal MR images. The adaptation process is based on the construction of a metric from the gray levels of an image. The metric is constrained by four parameters which are the minimum and maximum Euclidian length of an edge, the maximum stretching of the metric and the target edge length in the metric. The initial mesh is a regular triangulation of an MR image. This initial mesh is adapted according to the metric by choosing appropriate values for the previous set of parameters. The proposed approach provides an anisotropic mesh for which the elements are clustered near the boundaries. The experimental results show that the element’s edges of the obtained mesh are aligned with the boundaries of anatomical structures identified on the MR images. Furthermore, this mesh has approximately 80% less vertices than the mesh before adaptation with vertices mainly located in the regions of interest.
1 Introduction The technological innovations of the past decades have had a significant impact on the medical field. Currently, surgeons express their need for planning and simulating their surgical maneuvers on numerical models. In this context, surgery simulators are being developed in different fields such as cardiology, neurology and orthopedics. They exploit physical and physics-based models that simulate an organ’s deformation during surgery through laws governing the dynamical behavior of the organ. Such numerical simulations are used for instance in plastic and reconstructive facial surgery. Our long term objective is to develop and validate a deformable model of the human trunk from non invasive data in order to numerically simulate the effect of a spinal and/or rib surgical correction on the external shape of the trunk in the treatment of adolescent idiopathic scoliosis. The 3D reconstruction of a tetrahedral volume of M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 1040–1051, 2007. © Springer-Verlag Berlin Heidelberg 2007
Adaptive Mesh Generation of MRI Images for 3D Reconstruction of Human Trunk
1041
the trunk from MR images is a prerequisite to the development of the deformable model. To build a regular volumetric mesh of the trunk, MR images should be segmented to extract the contours of the different anatomical structures. These extracted contours on transversal slices can be discretized to a desired number of nodes. A nearest neighbor algorithm can be used to define correspondences on a contour-by-contour basis [1]. This will generate a grid of connected points on each slice. By connecting the points from successive slices, a tetrahedral mesh will be obtained. Several techniques were proposed for the segmentation of MR images such as the classic region growing technique [2], level set [3] and graph cuts [4]. The idea of using adaptive meshing for image segmentation has been mainly developed during the last few years. Some authors have proposed techniques for large scale data visualization that can also be used for image compression [5] [6]. Others have used an adaptive triangular mesh to segment medical images with a variation of the classic level set technique [3]. A more interesting technique was used to perform tomographic reconstruction [7]. This last technique was used to generate an adaptive mesh based on the gray level of an image [8]. An anisotropic mesh, respecting the Delaunay criteria, is obtained. The density of the mesh is more important where the gradient of the image intensity is higher. This last technique is quite interesting but it provides no warranty about the quality of the resulting meshes. In fact, the elements of the mesh are not necessarily oriented with the direction of intensity.gradients. This means that the sides of the elements of the mesh may not follow the borders of the different classes of objects in the image. Furthermore, the quality of a mesh is defined by the PSNR (peak signal to noise ratio) between the original image and the one obtained from the mesh interpolation. This criterion is not sufficient to ensure that the meshes will be valid for the numerical solver, which is a very important aspect in the physical modeling step. To obtain an adequate mesh, we choose to use a software called OORT developed by our team. The Object Oriented Remeshing Toolkit (OORT) [9] was initially developed as a tool to perform a posteriori mesh adaptation, which has become an accepted approach for controlling the numerical precision of simulation-based analysis for a range of application areas, including Computation Fluid Dynamics (CFD). With the goal of constructing meshes targeted towards the simulation of physical deformations of the trunk, we propose to use sequences of 2D meshes as a starting point for the reconstruction of the discrete 3D trunk model. The mesh adaptation tool has thus been recalibrated to take a discrete intensity field extracted from medical images as an input and produce 2D mesh slices that will later on be stacked to form a complete 3D model. The objective of this study is to define the appropriate parameters that can be used by OORT to generate valid 2D meshes of MR images of scoliotic patients in order to develop a deformable model of the trunk.
1042
O. Courchesne et al.
2 The Proposed Algorithm This section presents an algorithm to produce valid adapted meshes based on an MR image. The image processing algorithm comprises the following step: 1. 2. 3. 4.
Construct a regular mesh based on an MR image Construct an Hessian matrix based on the gray level of the image Use the Hessian matrix to build a metric Adapt the initial mesh to make it fit the metric
First, a regular mesh based on the MR image intensity is constructed. This mesh is a simple grid with one vertex for each pixel of the image and where the solution field is the gray level of each pixel. Figure 1 illustrates the construction of this first mesh.
Fig. 1. From original image to regular mesh
The pixel intensity is then use as a scalar field from which a Hessian matrix is reconstructed at each pixel. The software OORT will use this matrix to build a metric according to constraints specified by the user. OORT will then adapt the original mesh according to the metric and to a series of adaptation constraints also specified by the user. At the end of the adaptation process, OORT will have produced a mesh that is valid. OORT defines four levels of conformity to ensure the validity of the resulting mesh: 1. Topological conformity, which means that the final mesh respects the topology of the original mesh. 2. Geometric conformity, which means that all the new elements generated from the modified elements will have a positive area or volume. 3. User conformity, which means that the mesh respects constraints given by the user, such as minimum edges length and 4. Metric conformity, which means that the mesh fits as closely as possible the metric used, according to a criterion of convergence specified by the user. 2.1 Metric Specification Mesh adaptation with OORT is driven by a metric which is used as a control function throughout the adaptation process. In the present work, the metrics are reconstructed by post-processing a local reconstruction of second order derivatives of the intensity of the image at every pixel. Two different Hessian reconstruction methods have been
Adaptive Mesh Generation of MRI Images for 3D Reconstruction of Human Trunk
1043
evaluated (expressed as Hessian matrix) to derive a metric from the discrete intensity function Fh : the simple linear fitting (SLF) and the quadratic fitting (QF) [10]. The Hessian matrix can be represented by the following:
∂ 2 Fh ⎤ ∂x∂y ⎥⎥ ∂ 2 Fh ⎥ ∂y 2 ⎥⎦
⎡ ∂ 2 Fh ⎢ ∂x 2 M =⎢ 2 ⎢ ∂ Fh ⎢ ∂y∂x ⎣
(1)
To reconstruct the Hessian of the intensity function at a given node, both methods select a number of sampling points on each element of the mesh that surround the node. A polynomial regression is then locally performed on this sample set, and the regression is then differentiated to yield the values of the second order derivatives at the node. In the case of the SLF method, a linear regression in 2D is used to first reconstruct each spatial component of the function gradient. The gradient is computed at the sampling points and a linear fitting operator is independently applied to the two derivatives. This means that two linear functions
Fhx (x, y ) = a x1 + a x 2 x + a x 3 y and Fhy (x, y ) = a y1 + a y 2 x + a y 3 y
(2)
are computed to respectively smooth the derivatives
∂Fhx = ax 2 , ∂x
∂Fhy
∂Fhx = ax3 , ∂y
∂x
= ay2,
∂Fhy ∂y
= a y3
(3)
on a given patch of neighboring element. The local reconstruction is then differentiated along x and y, and these derivatives are assigned as nodal values of smoothed derivatives. By averaging the mixed derivatives, a continuous recovery of the Hessian is obtained. In the case of the QF method, one or two levels of neighbors are used with the sampling points chosen at the centroid of the elements. A least-square quadratic reconstruction of the intensity function is then performed, yielding a twice differentiable local approximation of the function. This approximation is then differentiated to yield constant derivatives on the patch. This means that the quadratic function 6
Fh (x, y ) = ∑ alς l (x, y )
(4)
ς l ∈ {1, x, y , x 2 , xy , y 2 }
(5)
l =1
where
is computed to respectively smooth the derivatives
∂2 6 ∂2 6 ∂2 ( ) ( ) a ς x , y = 2 a , a ς x , y = a , ∑ ∑ l l l l 4 5 ∂x 2 l =1 ∂xy l =1 ∂y 2
6
∑ a ς ( x, y ) = 2 a l =1
l l
6
(6)
1044
O. Courchesne et al.
OORT will then construct a metric based on the Hessian and on the preferences of the user. The minimum and maximum edges length, the maximum stretching of the metric and, most importantly, the target length of an edge in the metric, must be well defined by the user. The first two and the last parameters are expressed in the scale of the mesh. The other one, the stretching, represent the anisotropic factor of the metric. 2.2 Adaptive Mesh Generation Technique The adaptive mesh can be obtained by moving vertices, swapping edges, refining edges or decimating edges. The ultimate goal is to produce a mesh that satisfies the metric. Figure 2 illustrates the global adaptation loop implemented in OORT. The goal of moving vertices is to regularize the mesh. Each vertex will be moved to the centroid of its neighbors, evaluated using the metric. Movements are allowed as long as the topology of the mesh is not modified. For instance, if a vertex belongs to an edge it will be moved toward the centroid but will remain on this edge. The user can control the moving speed by specifying a relaxation factor. For more robustness, this value should be slightly under one. The moving is done iteratively until a stopping criterion or a maximum number of iterations is reached. At each global iteration, statistical analysis is performed on the mesh. To determine which elements will be swapped, each element is analyzed to compute its level of deformation. Instead of sorting the elements by their shape deformation, a statistical process on the element shape measure evaluated using the metric is performed and returns the average and the standard deviation of the deformation. Assuming a Gaussian distribution, only the elements above a threshold are considered. This threshold corresponds to a number of times the standard deviation. A good value for this threshold should be between 2 and 2.5. So that the most stretched elements will undergo diagonal swapping. A similar statistical process is also performed to evaluate the length of the edges according to the metric. The same threshold used for diagonal swapping will apply for mesh refinement and decimation, to determine how many elements will be considered. Each of these elements will be in the inner loop of refinement and decimation. The mesh refinement loop has its own threshold that determines when an edge is too long. Here, too long means that an edge evaluated in the metric space has a length higher than one. In fact, at convergence, the optimal adapted mesh should have all edges of metric length one. In practice, OORT will not reach such a perfectly adapted mesh but lets the user specify an acceptable threshold. The default value is 1.5 and the closer we get to one the more time it will take to compute. For that reason, a maximum number of iterations is also specified. Severe values for the threshold are about 1.2 to 1.3. The edges that are too long are cut in two. The same thoughts apply to the mesh decimation loop. In this case, the default value for the threshold is 0.5 and severe values are about 0.7 to 0.8. Edges that are too short are removed.
Adaptive Mesh Generation of MRI Images for 3D Reconstruction of Human Trunk
1045
Global adaptation cycle True
Conditions
Conditions
True
Moving vertex
False
Conditions
True
Swapping edges
False
Conditions
True
Refining edges
False False
Conditions
True
Decimating edges
Fig. 2. The adaptation process of OORT
3 Experimental Results and Discussion The first step is to determine which Hessian reconstruction methods yields the best metric for the present application. Then we will determine which constraints to use to produce a good mesh based on MR images. The tests have been made with an axial MR image of the thoracic area. The resolution of this image is 256 by 256 so the original mesh produced has 65 536 vertices. Other tests performed on thoracic MR images have shown similar results. 3.1 Choice of Hessian Reconstruction Method As can be seen in the Figure 3, when OORT is used with default parameter settings for minimum and maximum Euclidian edge length, target metric length and maximum stretching, the adapted meshes obtained using the SLF and QF reconstruction methods are very similar. In fact, they have roughly the same number of vertices and the elements are mainly in the trunk area. We can also see in those images the effect of noise and MRI artifact. Even if there is nothing really interesting around the trunk, a lot of vertices are present and form a thick vertical line. With the QF metric, the noise can be reduced by taking a larger neighborhood (Figure 4). With SLF, increasing the target length of an edge in the metric can produce a similar effect (Figure 5).
1046
O. Courchesne et al.
Original image
SLF: 46 622 vertices
QF: 46 818 vertices
Fig. 3. Comparison between SLF and QF metric (default value)
Neighborhood of size 1
Neighborhood of size 2
Neighborhood of size 3
46 818 vertices
24 862 vertices
12 066 vertices
Fig. 4. Variation of neighborhood size with QF
Target edges length of 1
Target edges length of 2
Target edges length of 3
46 622 vertices
23 105 vertices
13 889 vertices
Fig. 5. Variation of target length of an edge in the metric with SLF
By super imposing boundaries extracted using Desolneux’s method [11] based on level lines selection, we can see in Figure 6 that both metrics produce a mesh with more vertices near edges. We can also see that elements follow the edge of the trunk. The QF metric produces thicker edges for roughly the same number of vertices in the mesh and takes more time to compute than the SLF. For these reasons, the SLF metric has been selected for the remainder of the experiments presented here.
Adaptive Mesh Generation of MRI Images for 3D Reconstruction of Human Trunk
1047
3.2 Defining the Constraint Factors of the Metric There are four main constraints to test: minimum and maximum Euclidean edge length, maximum stretching of the metric and target length of an edge in the metric. The first test aims to determine the influence of the minimum Euclidian edge length in the metric. Since the mesh is based on an MR image, there is no feature to resolve smaller than a pixel, the smallest value tested is thus 1. For the test, the maximum Euclidian edges length has been fixed to 10 and the maximum stretching to 1000. Table 1 shows that the number of vertices decreases when the minimum edge length increases, which was predictable. Also, the larger the minimum length, the more regular is the mesh and the harder to identify the boundaries of the trunk become (Figure 7).
Original with level lines
SLF (13 889 vertices)
QF (12 066 vertices)
Fig. 6. Comparison between SLF and QF metric (noise reduction) Table 1. Test for minimum Euclidian edges length
Minimum EL * 1.00 1.25 1.50 2.00 5.00
Target EL =1 46 622 34 683 26 425 15 890 3 886
Target EL =2 23 105 19 392 16 458 12 238 3 262
Target EL =3 13 889 12 235 10 704 8 574 3 144
*
EL means edges length
The second test aims to determine an appropriate value for the maximum Euclidian edge length. For the test, the minimum Euclidian edges length has been fixed to one and the maximum stretching to 1000. Table 2 shows that the number of vertices decreases when the maximum edge length increases, which was predictable. The impact of increasing the maximum edge length is not linear. Increasing the maximum edge length over ten is quite ineffective in reducing the number of vertices.
1048
O. Courchesne et al.
13 889 vertices
3 144 vertices
Fig. 7. Effect of the minimum Euclidian edges length Table 2. Test for maximum Euclidian edges length
Maximum EL * 1 2 4 5 10 15 20 25
Target EL =1 71 898 51 790 47 989 47 474 46 622 46 698 46 335 46 596
Target EL =2 71 898 30 331 24 183 23 818 23 105 22 961 22 944 23 038
Target EL =3 71 898 26 137 15 089 14 585 13 889 13 741 13 691 13 672
The next parameter tested is the maximum stretching of the metric. For this test, the minimum edge length has been fixed to one and the maximum to ten. Table 3 shows that the stretching factor should not be larger than 10 because past this value the number of vertices stops to increase and the element follow the trunk’s edges. Figure 8 shows that when the stretching factor is small, the mesh elements are roughly equilateral and are not aligned as well with the trunk boundaries as for higher values. The last test aims to determine a suitable value for target edge length in the metric. For this test, the minimum edge length has been fixed to one, the maximum to ten and the maximum stretching to 1000. Table 4 shows that when the target edge length increases, the number of vertices decreases. It has been shown previously that the target length must be high enough to reduce noise artifacts. On the other hand, if the value is too high, the elements are not aligned with trunk boundaries, as shown in Figure 9. We think that a target length of 3 is a good choice because the number of vertices is greatly reduced (79% less vertices) and quality of the mesh is still good.
Adaptive Mesh Generation of MRI Images for 3D Reconstruction of Human Trunk
Maximum stretching factor of 1
Maximum stretching factor of 10
8 349 vertices
13 889 vertices
1049
Fig. 8. Effect of the maximum stretching factor Table 3. Test for the maximum stretching of the metric
Maximum stretching 1 5 10 100 1000 100000 *
EL means edge length
Target EL * =1 35 768 44 430 46 622 46 622 46 622 46 622
Target EL =2 15 466 22 898 23 105 23 105 23 105 23 105
Target EL =3 8 349 13 743 13 889 13 889 13 889 13 889
1050
O. Courchesne et al. Table 4. Test for the target edges length in the metric
Target EL* 1 1.5 2 2.5 3 3.5
Number of vertices 46 622 31 895 23 105 17 405 13 889 11 108
Target EL 4 5 7 9 11 13
Number of vertices 9 157 6 269 3 506 2 412 2 035 1 683
*
EL means edges length
Target edges length of 1
Target edges length of 3
Target edges length of 9
46 622 vertices
13 889 vertices
2 412 vertices
Fig. 9. Effect of the target edges length in the metric
4 Conclusion The proposed method allows an adaptive mesh generation from MR images of the human trunk by using gray levels of the image to define an appropriate metric to control the adaptation process. The input parameters to the Object Oriented Remeshing Toolkit were determined through several experiments to provide a valid mesh with respect to a set of constraints specified by the user. The result obtained is a 2D mesh of the trunk section with the element's edges aligned with the boundaries of anatomical structures. A very significant reduction in the number of vertices was achieved compared with the original mesh including all pixels, while greatly improving the alignment of element edges with the important features of the images. On the basis of this work, two segmentation algorithms have been successfully implemented on meshes which allow the 3D reconstruction of a tetrahedral mesh of the whole trunk. This reconstruction is obtained by performing a Delaunay tetrahedralisation constrained by the trunk’s external boundary [12]. Further work is undertaken to perform an extensive validation of the detected anatomical boundaries by an expert. Afterwards, an algorithm of features-based registration using the detected boundaries will be proposed to compensate the movement of the patient during the acquisition of the MRI sequence. Finally, successive registered sections will be connected to provide a precise and valid 3D tetrahedral mesh of the whole trunk.
Adaptive Mesh Generation of MRI Images for 3D Reconstruction of Human Trunk
1051
Acknowlegment. This work has been supported by the SickKids Foundation and the Natural Sciences and Engineering Research Council of Canada (NSERC).
References 1. Papademetris, X., Sinusas, A.J., Dione, D.P., Constable, R.T., Duncan, J.S.: Estimation of 3-D left ventricular deformation from medical images using biomechanical models. IEEE Transactions on Medical Imaging 21, 786–800 (2002) 2. Lu, Y., Jiang, T., Zang, Y.: Region growing method for the analysis of functional MRI data. NeuroImage 20, 455–465 (2003) 3. Xu, M., Thompson, P.M., Toga, A.W.: An Adaptive Level Set Segmentation on a Triangulated Mesh. IEEE Transactions on Medical Imaging 23, 191–201 (2004) 4. Boykov, Y.Y., Jolly, M.P.: Interactive graph cuts for optimal boundary & region segmentation of objects in N-D images, vol. 1, pp. 105–112. IEEE Comput. Soc, Vancouver, BC, Canada (2001) 5. Wiley, D.F., Hamann, B., Bertram, M.: On a construction of a hierarchy of best linear spline approximations using a finite element approach. IEEE Transactions on Visualization and Computer Graphics 10, 548–563 (2004) 6. Kreylos, O., Hamann, B.: On simulated annealing and the construction of linear spline approximations for scattered data. IEEE Transactions on Visualization and Computer Graphics 7, 17–31 (2001) 7. Brankov, J.G., Yongyi, Y., Wernick, M.N.: Tomographic image reconstruction based on a content-adaptive mesh model. IEEE Transactions on Medical Imaging 23, 202–212 (2004) 8. Yongyi, Y., Wernick, M.N., Brankov, J.G.: A fast approach for accurate content-adaptive mesh generation. IEEE Transactions on Image Processing 12, 866–881 (2003) 9. Dompierre, J., Labbé, P., OORT,: Objet-Oriented Remeshing Toolkit. MAGNU: Laboratoire de Maillage de Géométrie Numérique, Montréal (2006) 10. Manole, C.-M., Vallet, M.-G., Dompierre, J., Guibault, F.: Benchmarking second order derivatives recovery of a piecewise linear scalar field. In: Proceedings of the 17th IMACS World Congress Scientific Computation, Applied Mathematics and Simulation (2005) 11. Desolneux, A., Moisan, L., Morel, J.-M.: Edge Detection by Helmholtz Principle. Journal of Mathematical Imaging and Vision 14, 271–284 (2001) 12. Courchesne, O.: Génération adaptative de maillage tridimensionnel à partir d’images IRM du tronc humain. Institut de génie biomédical, Vol. Maitrise. École Polytechnique de Montréal, Université de Montréal, Montréal, p. 130 (2007)
Efficient and Effective Ultrasound Image Analysis Scheme for Thyroid Nodule Detection Eystratios G. Keramidas, Dimitris K. Iakovidis, Dimitris Maroulis, and Stavros Karkanis Dept. of Informatics and Telecommunications, University of Athens, Panepistimioupolis, 15784, Athens, Greece
[email protected] Abstract. Ultrasound imaging of thyroid gland provides the ability to acquire valuable information for medical diagnosis. This study presents a novel scheme for the analysis of longitudinal ultrasound images aiming at efficient and effective computer-aided detection of thyroid nodules. The proposed scheme involves two phases: a) application of a novel algorithm for the detection of the boundaries of the thyroid gland and b) detection of thyroid nodules via classification of Local Binary Pattern feature vectors extracted only from the area between the thyroid boundaries. Extensive experiments were performed on a set of B-mode thyroid ultrasound images. The results show that the proposed scheme is a faster and more accurate alternative for thyroid ultrasound image analysis than the conventional, exhaustive feature extraction and classification scheme. Keywords: Ultrasound, Thyroid Nodules, Thyroid Boundary Detection, Local Binary Patterns.
1 Introduction Thyroid is a small gland located near the bottom of the neck. It produces hormones that affect heart rate, cholesterol level, body weight, energy level, mental state and a host of other conditions. Epidemiologic studies have showed that palpable thyroid nodules occur in approximately seven percent of the population, but nodules found incidentally on ultrasonography suggest a prevalence up to 67 percent [1]. Ultrasound imaging (US) can be used to detect thyroid nodules that are clinically occult due to their size or their shape. However, the interpretation of US images, as performed by the experts, is still subjective. An image analysis scheme for computer aided detection of thyroid nodules would contribute to the objectification of the US interpretation and the reduction of the misdiagnosis rates. Early approaches to US image analysis utilized local grey-level histogram information for the characterization of the histological state of the thyroid tissue [2]. Subsequent approaches were based on spatial, first and second order statistical features [3][4] as well as on frequency domain features for computer aided diagnosis of lymphocytic thyroiditis [5]. Recent approaches include quantification of thyroid M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 1052–1060, 2007. © Springer-Verlag Berlin Heidelberg 2007
Efficient and Effective Ultrasound Image Analysis Scheme
1053
tissue characteristics using co-occurrence matrix features [6], and active contour methodologies for fine delineation of thyroid nodules [7]. However, many of them require manual selection of Regions Of Interest (ROI) within the US image, and they usually involve intensive computations, such as exhaustive feature extraction, or iterative integral operations over the whole image. In this paper we propose a novel scheme for automatic nodule detection based on features extracted from longitudinal US images of thyroid gland. It involves two phases: a) application of a novel algorithm for the detection of the boundaries of the thyroid gland, and b) detection of thyroid nodules via classification of textural feature vectors extracted only from the ROI defined by the thyroid boundaries. The textural characteristics of the thyroid tissue are encoded by histograms of Local Binary Patterns (LBP) [8]. The advantages of this novel scheme include increased accuracy in nodule detection, compared with the conventional ultrasound image analysis methods, and time efficiency. The rest of this paper is organized in three sections. Section 2 describes the proposed scheme for the detection of nodules in thyroid gland.. The results from the experimental evaluation of the proposed scheme on thyroid US images are apposed in Section 3. Finally, the conclusions as well as future perspectives are summarized in Section 4.
2 Thyroid Ultrasound Image Analysis Scheme 2.1 Thyroid Boundaries Detection Algorithm The lobes of the thyroid gland are surrounded by a thin fibrous capsule of connective tissue [9]. That capsule bounds the thyroid gland and can be identifiable in longitudinal US images as thin hyperechoic lines. The first phase of the proposed thyroid US image analysis scheme aims to the detection of those hyperechoic lines. Detection is performed by a novel Thyroid Boundaries Detection (TBD) algorithm. This algorithm involves three stages: a) pre-processing of the US image, b) analysis of the pre-processed image, and c) identification of the thyroid boundaries. In the pre-processing stage, a US input image of N×M pixels and G grey levels, is normalized and uniformly quantized into z discrete grey levels. Let pi be the original value of a pixel and qi the value of that pixel after quantization. Then qi can be computed as follows: qi =
G⋅g : g ∈ Ν, z −1
g ⋅G ( g + 1) ⋅ G ≤ pi ≤ z z
(1)
Grey level quantization results in a rough segmentation of the US image and accentuates the hyperechoic bounds of the thyroid gland (Fig. 1b). In the second stage of the algorithm, the quantized image is vertically sampled from top to bottom with horizontal stripes (Fig 1c). Each stripe has h pixels height and M pixels width, spanning the entire width of the image. A step of s ∈ (0, h ] pixels between two successive stripes is considered, leading to a total of K stripe samples per image.
1054
E.G. Keramidas et al.
(a)
(b)
(c)
(d)
Fig. 1. a) Input thyroid US image digitized at 256 grey levels. (b) Pre-processed US image, quantized at 3 grey levels. (c) Horizontal stripes sampled from the pre-processed image for analysis. (d) The detected thyroid boundaries superimposed to the original input image.
For each stripe, a weighted sum In is computed, where n denotes the stripe index incrementing from top (n = 1) to bottom (n = K) (Fig. 1c). The weighted sum of the number of pixels with grey level j, for each stripe n is calculated by the following equation: G ⎛ ⎞ I n = ∑ ⎜ w( j ) ⋅ ∑ 1 ⎟ ⎟ ⎜ j=0 ⎝ P j ,n ⎠
(2)
where Pj,n represents the set of pixels having grey level j and reside within stripe n. In Eq. 3, w(j) denotes a quadratic weight function, which is defined as w( j ) = a ⋅ j 2 + β ⋅ j + γ , where α, β, γ are constants. The choice of the optimal weight function has been considered so as to amplify the contribution of higher grey levels in the calculation of In. Constants β, γ are chosen so as to satisfy (w, j) = (0, 0) and (w, j) = (1, G). Therefore w(j) is finally derived by the following equation: w( j ) = a ⋅ j 2 +
1− a ⋅G2 j G
(3)
This weight function aims to amplify the contribution of higher grey levels, which clearly appear as hyperechoic lines in US images after the quantization process of the first stage (Fig 1c). The values of In for the stripes sampled from the US image of Fig. 1, are depicted in Fig. 2. From this figure, it can be noticed that the peak values of In correspond to the stripes that fall on the thyroid boundaries. A direct measure of the rate of change of In, between two successive stripes is estimated as follows:
Dn =
d (Ii ) ⋅ I n , n = 1,2,...K di i =n
(4)
In the final stage of the algorithm, the stripes that contain the outer and the inner thyroid boundaries are selected. If nouter and ninner are the stripe indices that correspond to the outer and the inner boundaries respectively, then nouter and ninner should satisfy the following conditions:
Efficient and Effective Ultrasound Image Analysis Scheme
1055
40000 35000 30000
In
25000 20000 15000 10000 5000 0 0
1
2
3
4
5
6
7
8 n
9 10 11 12 13 14 15 16
Fig. 2. Example of In values for each stripe of the longitudinal US image of figure 1
⎧nouter = arg min[Dn ⋅ ω1 (n)] n ⎪⎪ [Dn ⋅ ω2 (n)] ⎨ninner = arg max n ⎪ ⎪⎩ninner − nouter > T , where T > 0 .
(5)
⎧ ⎛λ⋅K -n ⎞ ⎪log⎜ + 1⎟ n < λ ⋅ K ⎝ λ⋅K ⎠ ⎪⎩ 0 n≥λ⋅K
(6)
⎧ ⎛ n − (1 − λ ) ⋅ K ⎞ ⎪log⎜ + 1⎟ n < (1 − λ ) ⋅ K ⋅ K λ ⎝ ⎠ ⎪⎩ 0 n ≥ (1 − λ ) ⋅ K
(7)
T N
(8)
ω1 ( n) = ⎨
ω 2 ( n) = ⎨
λ = 1−
Threshold T represents a minimum anteroposterior diameter of the thyroid gland and ensures that stripe ninner will always reside below stripe. nouter. The logarithmic weight functions ω1(n) and ω2(n) are bias nouter and ninner towards the upper and lower image regions, respectively. 2.2 Detection of Thyroid Nodules The boundaries detected by the TBD algorithm define a ROI in which feature extraction and classification take place. This ROI is raster scanned with sliding windows (Fig. 3), and from each window a textural feature vector is extracted. The feature extraction method used for texture analysis of the thyroid gland is based on Local Binary Patterns (LBP). This method was chosen because it produces
1056
E.G. Keramidas et al.
highly discriminative texture descriptors, it involves low complexity computations, and it has been successfully applied for ultrasound image analysis [10]. Feature extraction is succeeded by classification of the feature vectors into two classes; a class of normal tissue regions, and a class of nodular tissue regions. A simple k-nearest neighbor (k-NN) algorithm was chosen as a powerful and robust non-parametric classification method with well-established theoretical properties that has demonstrated experimental success in many pattern recognition tasks [11]. As the LBP feature vectors are actually statistical distributions, the k-NN algorithm uses the intersection of two distributions as an effective similarity measure [12]. This measure is estimated by the following equation: L
μ (h1 , h2 ) = ∑ min (h1 (n), h2 (n) )
(10)
n =1
where h1 and h2 are the two LBP distributions compared, and L is the number of bins of each distribution.
(a)
(b)
(c)
Fig. 3. (a) Longitudinal US image of thyroid gland. (b) ROI defined by the TBD algorithm. (c) Raster scanned ROI with sliding windows for feature extraction.
3 Results Thyroid US examinations were performed using a digital US imaging system HDI 3000 ATL with a 5-12 MHz linear transducer at the Department of Radiology of Euromedica, Greece. A total of 39 longitudinal in vivo digital thyroid US images were acquired at a resolution of 256×256 pixels with 256 grey-level depth. The dataset used in the experiments is comprised of various thyroid nodules classified as Grade 3 [13]. In order to obtain ground truth images for the experiments, three expert radiologists manually annotated the US images by drawing horizontal lines for the inner and the outer boundaries of the thyroid lobe, and by delineating the existing thyroid nodules. The ground truth images were generated according to the principle that every pixel is characterized either as normal thyroid parenchyma, thyroid nodule or tissue not belonging to thyroid gland, only if at least two out of three expert radiologists characterized it that way [14]. As a measure of detection accuracy, we consider the overlap between the detected and the ground truth region. The results of the proposed thyroid US image analysis scheme are presented in two parts, according to the experiments performed. The first part presents the results of the
Efficient and Effective Ultrasound Image Analysis Scheme
1057
experimental evaluation of the TBD algorithm, whereas the second part presents the results of the proposed scheme for thyroid nodule detection. 3.1 Evaluation of the TBD Algorithm Experiments were performed to determine the optimal stripe dimensions that minimize the error between the boundaries detected by the TBD algorithm and the ground truth boundaries. In all the experiments a total of z = 3 quantization levels was found to be sufficient. Figure 4 illustrates the results obtained for different combinations of h ∈ H and s ∈ H: s ≤ h, where H = {4, 8, 16, 32, 64}. The optimal values of the investigated parameters were found to be h = 16 and s = 16 pixels, for which the mean accuracy in the detection of thyroid boundaries reaches a maximum of 93.2±3.2%. For large values of h (>32) a notable decrement of the accuracy is observed as the large stripe size leads to a gross localization of the thyroid boundaries. 100 95 90
Accuracy (%)
85 80 75
h (pixels)
70
4 8
65
16
60
32 64
55 50 0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
34
s (pixels)
Fig. 4. Mean accuracy of the TBD algorithm for different values of h and s
3.2 Evaluation of the Proposed Scheme for Thyroid Nodule Detection Experiments were performed to evaluate the performance of the proposed scheme with the TBD algorithm in comparison with the conventional, exhaustive feature extraction scheme. The window sizes tested were 16×16, 32×32 and 64×64 pixels, for sliding steps of 4, 8, 16, 32, 64 pixels. Three LBP neighborhoods, namely 3×3, 5×5, and 7×7 pixels, and three k-NN classifiers for k = 3, 5, 7 were tested. The parameters
1058
E.G. Keramidas et al.
delivering the best results in Section 3.1 were used for the TBD algorithm i.e. s = 16 and h = 16. A balanced proportion of normal and abnormal samples was extracted from the available US images in a way that all samples corresponding to nodular thyroid tissues were included and an equal number of samples corresponding to normal tissues was randomly selected. In [15], it has been shown that learning from a balanced class distribution the classifiers generally come up with fewer but more accurate classification rules for the minority class than for the majority class. So, as the nodular tissue samples comprise a minority class, such an approach is expected to enhance the classification of abnormal samples and thus increase the system’s sensitivity. The best results obtained per LBP neighborhood are summarized in Table 1. It can be observed that in all cases the application of the TBD algorithm improves nodule detection accuracy. The smallest LBP operator, for windows of 32×32 pixels sliding with a step of 8 pixels, resulted in the highest accuracy. Table 1. Best classification results
Method
Window/ Step
K (kNN)
Without TBD Algorithm
With TBD Algorithm
Acc/cy
Sens/ty
Spec/ty
Acc/cy
Sens/ty
Spec/ty
LBP{3}
32/8
7
0.75
0.75
0.76
0.82
0.78
0.81
LBP{5}
32/4
7
0.74
0.63
0.83
0.81
0.74
0.87
LBP{7}
32/8
5
0.71
0.62
0.77
0.81
0.73
0.88
(a)
(b)
(c)
(d)
Fig. 5. (a) Longitudinal US image of thyroid gland with a nodule. (b) Ground truth thyroid nodule annotated by radiologists. (c) Classification result without TBD (d) Classification result with TBD. The nodular tissues detected are colored white.
An example detection of a single nodule with and without the TBD algorithm is illustrated in Fig. 5. Besides the notable improvement of the detection accuracy, the use of the TDB algorithm led also to a notable improvement in time performance, which reached 74%.
Efficient and Effective Ultrasound Image Analysis Scheme
1059
4 Conclusions We presented a novel scheme for nodule detection on longitudinal US images of the thyroid gland. This method encompasses two consecutive phases: utilizing TBD algorithm to detect thyroid boundaries defining an initial ROI and then applying feature extraction and classification techniques within the region defined. Through a series of experiment presented, the proposed scheme proved to be more accurate and less time consuming in nodule detection, compared with the conventional US image analysis methods. The results of the experimental study presented in this paper lead to the following conclusions about the proposed scheme: • The proposed scheme can improve nodule detection accuracy. • The proposed scheme can considerably decrease processing time needed for nodule detection. • Its application clinical practice is feasible and could contribute to the reduction of false medical decisions. Future work and perspectives include: • Experimentation to determine the optimal feature extraction and classification methods for thyroid nodule detection. • The performance of the proposed scheme will be investigated on images acquired from different US imaging systems. • Enhancement of the functionality of the proposed TBD algorithm, to deal with other US images beside the longitudinal US images of thyroid gland. • Evolvement of the proposed time efficient scheme for application in an integrated real time system for the assessment of the thyroid gland. Acknowledgments. We would like to thank Dr. N. Dimitropoulos, M.D. and EUROMEDICA S.A., Greece for the provision of part of the medical images. This work was supported by the Greek General Secretariat of Research and Technology and the European Social Fund, through the PENED 2003 program (grant no. 03-ED662). It is also partially supported by the European Social Fund and National Resources – (EPEAEK II).
References 1. Welker, M.J., Orlov, D.: Thyroid Nodules. American Family Physician 67(3) (2003) 2. Mailloux, G.E., Bertrand, M., Stampfler, R.: Local Histogram Information Content Of Ultrasound B-Mode Echographic Texture. Ultrasound in Medicine and Biology 11(5), 743–750 (1985) 3. Wagner, R.F., Insana, M.F., Brown, D.G.: Unified approach to the detection and classification of speckle texture in diagnostic ultrasound. Opt. Eng. 25, 738–742 (1986)
1060
E.G. Keramidas et al.
4. Fellingham, L.L., Sommer, F.G.: Ultrasonic characterization of tissue structure in the in vivo human liver and spleen. IEEE Transactions Sonics and Ultrasonics 31(4), 418–428 (1984) 5. Smutek, D., Sara, R., Sucharda, P., Tjahjadi, T., Svec, M.: Image texture analysis of sonograms in chronic inflammations of thyroid gland. Ultrasound in Medicine and Biology 29(11), 1531–1543 (2003) 6. Skouroliakou, C., Lyra, M., Antoniou, A., Vlahos, L.: Quantitative image analysis in sonograms of the thyroid gland. Nuclear Instruments and Methods in Physics Research A 569, 606–609 (2006) 7. Iakovidis, D.K., Savelonas, M.A., Karkanis, S.A., Maroulis, D.E.: Segmentation of Medical Images with Regional Inhomogeneities. In: Proc. International Conference on Pattern Recognition (ICPR), vol. 2, pp. 279–282. IAPR, Hong Kong (2006) 8. Ojala, T., Pietikainen, M., Harwood, D.: A comparative study of texture measures with classification based on feature distributions. Pattern Recognition 29, 51–59 (1996) 9. Rumack, C.M., Wilson, S.R., Charboneau, J.W., Johnson, J.A.: Diagnostic Ultrasound. Mosby (2004), ISBN 0323020232 10. Pujol, O., Radeva, P.: Supervised texture classification for intravascular tissue characterization. In: Suri, J.S., Wilson, D., Laximinarayan, S. (eds.) Handbook of Biomedical Image Analysis. Segmentation Models Part B, vol. 2, Springer, Heidelberg (2005) 11. Theodoridis, S., Koutroumbas, K.: Pattern Recognition. Academic Press, London (1999) 12. Swain, M.J., Ballard, D.H.: Color indexing. IJCV 7(I), 11–32 (1991) 13. Tomimori, E.K., Camargo, R.Y.A., Bisi, H., Medeiros-Neto, G.: Combined ultrasonografhic and cytological studies in the diagnosis of thyroid nodules. Biochimie 81, 447–452 (1999) 14. Kaus, M.R., Warfield, S.K., Jolesz, F.A., Kikinis, R.: Segmentation of Meningiomas and Low Grade Gliomas in MRI. In: Taylor, C., Colchester, A. (eds.) Medical Image Computing and Computer-Assisted Intervention – MICCAI’99. LNCS, vol. 1679, pp. 1–10. Springer, Heidelberg (1999) 15. Weiss, G.M., Provost, F.: The effect of class distribution on classifier learning. Technical Report ML-TR-43. Dept. of Computer Science, Rudgers University (2001)
Contour Energy Features for Recognition of Biological Specimens in Population Images Daniel Ochoa1,2, Sidharta Gautama1, and Boris Vintimilla2 1
Department of telecommunication and information processing, Ghent University,St-Pieters Nieuwstraat 41, B-900, Ghent, Belgium 2 Centro de Vision y Robotica, ESPOL University, Km 30.5 via perimetral, Guayaquil, Ecuador {dochoa,sid}@telin.ugent.be,
[email protected] Abstract. In this paper we present an approach to perform automated analysis of nematodes in population images. Occlusion, shape variability and structural noise make reliable recognition of individuals a task difficult. Our approach relies on shape and geometrical statistical data obtained from samples of segmented lines. We study how shape similarity in the objects of interest, is encoded in active contour energy component values and exploit them to define shape features. Without having to build a specific model or making explicit assumptions on the interaction of overlapping objects, our results show that a considerable number of individual can be extracted even in highly cluttered regions when shape information is consistent with the patterns found in a given sample set. Keywords: feature extraction, statistical shape analysis, segmentation, recognition.
1 Introduction Many biological methods rely on molecular, biochemical and microbiological analysis of communities. One of the most studied families is the nematoda phylum given its well-described nervous system, complete genome sequence and sensitivity to environmental changes that makes it attractive for biotechnology research and development. Spatial and temporal distribution in nematode populations can be used as bio-indicators for soil management, variations between conventional and new genetically modified plants, expressions of disease symptoms in crops, pesticide treatments and lately for measuring the impact of the expected global warming. To study populations and their evolution, care has to be taken to resort to non destructive methods to avoid killing the individuals and collecting a considerable amount of specimens from different samples and control sets. Once the specimens are under the microscope the technician collects data related to length, area, spatial distribution that then are correlated to the rate of growth, biomass, feeding behavior, maturity index and other time-related metrics that are used to support or discard hypothesis about the sample set under consideration. M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 1061–1070, 2007. © Springer-Verlag Berlin Heidelberg 2007
1062
D. Ochoa, S. Gautama, and B. Vintimilla
As high resolution camera systems has become affordable for research labs, the increasing amount of digital image data in biological studies will require efficient and robust image analysis tools to generate accurate and reproducible quantitative results. In contrast to medical images, where imaging condition and sampling methods are highly controlled, biological images are inherently difficult to analyze because of sample variation, noise and clutter [1]. These problems can distort the shape measurements of the detected specimens if for instance overlapping specimens are regarded as one. In early papers, images containing single nematodes are examined [2]. After background correction, the image is thresholded and skeletonized, after which contour curvature patterns are used to identify the head and tail of the nematode. In a first step towards classifying C.Elegans behavioral phenotypes quantitatively, in [3] motion patterns are identified by means of a one-nematode tracking system, morphological operators and geometrical related features. Images of nematode populations were used in [4] to describe how to apply the scale space principles to linear object detection but no attempt is done to extract single specimens from the population. In practice much of the work is still very labour intensive. Using digitalization software, the user marks points along the nematode body and linear segments are interpolated. The live-wire [5] approach can make the manual process easier since by following the contour a line is attracted to the nematode body but problems remains in cluttered regions where line information vanishes. In any case, the bulk of the recognition task is still done by hand for every single nematode. Certainly a discouraging scenario for researchers considering that a data set might consists of massive amounts of image data with possibly hundreds of specimens. Consequently the need of high-throughput screening of bio-images to fully describe biological processes on a quantitative level is still very much in demand [6]. Given the nature of these images, extracting reliable shape information for object identification with a restricted amount of image data, clutter and structural noise is a challenging task. However we consider that effective recognition is a necessary step before any post-processing task, in particular if a computer vision based software tool is to be incorporated to derive statistical data from population samples [7] where accurate measurements are needed to provided truly meaningful information to bioresearchers. Previous work on nematode population samples resort to intensity thresholding followed by filling, drawing and measuring operations in a semi-automatic fashion [8] and specimens are carefully placed apart from each other to prevent occlusion. Applying image processing techniques when dealing with several biological specimens that despite of belonging to the same class differ in shape and appearance makes parameter setting a complicated issue [9] and narrows the scope of possible applications. Unlike previous efforts aimed at extracting shape clues from a set of single nematode images integrating them into a model and then finding the best possible fit on the image data [10]. This paper explore whether shape information can be captured by the utilization of a population of active contours. We believe that despite the inherent variability of nematode shapes in population images common patterns can play an important role in recognition not only in still images as those used in our
Contour Energy Features for Recognition of Biological Specimens
1063
experiments but in the video sequence when disambiguation methods in occluding scenarios are required. This paper is organized as follows. Section 2 shape models and the active contour approach are discussed. Shape statistics of detected nematodes are proposed and used for classification in Section 3. Results are shown in Section 4. Finally conclusions and further work are presented in Section 5.
2 Active Contour Segmentation In general, nematodes in an image can be thought as lines of varying width along their length, wide in the center and narrow near both ends. Although an important research body about linear structure detection particularly for vessel/neurite segmentation has been developed in the past [11], parameter setting involves a trade off between the image-content coverage and conciseness [12] a critical problem when dealing with populations because as specimens overlap line information vanishes at junctions and structural noise appears when nematodes’ internal organs become visible. Graph based search was proposed to integrate line evidence in [13] to detect networks of lines but recognition of individual objects require additional post processing steps, that given the lack of salient contour points make common shape representations less suitable for recognition. Capturing shape variation for instance by means of appearance/shape models [14] is a complex task in of worm-like objects given the absence of discriminant landmark points. Moreover, complex motion patterns prevent the use of linear systems to create a simple shape model. Although nonlinear systems has been devised [10] the complete range of nematode body configurations is still far from being model. Spatial arrangement of feature points at different scales were exploited in [15] to search a rigid wiry object in cluttered environment. In a similar vein but on non-rigid objects in this paper we propose the utilization of active contours energies to capture relevant statistical shape information for recognition applied to nematode detection in population images. Active contours introduced by Kass with a model called Snake [16] had drawn attention due to their performance in various problems. Segmentation and shape modeling in single images proved effective by integrating region-based information, stochastic approaches and appropriate shape constrains [17,18]. Active contours merge image data and shape modeling through the definition of an linear energy function consisting of two terms: a data driven component (external energy), which depends on the image data, and a smoothness-driven component (internal energy) which enforces smoothness along the contour.
E snake = α ⋅ Eint + β ⋅ E ext
(1)
In parametric contours, the internal energy can be decomposed further in tension and bending energies, they report higher values as the contour stretches or bends during the evolution process. The goal is to minimize the total energy iteratively using gradient descent techniques while energy components balance each other.
1064
D. Ochoa, S. Gautama, and B. Vintimilla P
E int =
∫ e t ( p ) + e b ( p ) dp ,
P
E ext =
0
∫e
ext
( p ) dp
(2)
0
The rationale behind the proposed approach is that given convergence of the active contour being mostly data-driven, appearance and geometrical information can be recover from their energy component values. We consider the analysis of energybased derived features a natural way to explore the range of possible shape configurations in nematodes population images without: a) having to build an specific model and b) making explicit constrains about the interaction of occluding objects [19]. In our experiments we chose the ziplock snake model [20]. This model is designed to work with open contours and initialization is limited to locating the contours’ end points. Optimization is carried out from the end points towards the center of the contour so that the initial control points will be located progressively on the object surface and increases the probalilities of a correct segmentation. Being parametric it can encode shape information [21] explicitly and provide faster convergence than geodesic snakes. e ext = I ( x , y ), e t =
x 2 + y 2 , e b =
( x ⋅ y − x ⋅ y ) ( x 2 + y 2 ) 3 / 2
(3)
The tension energy applied corresponds to the point distance distribution, the bending energy to the local curvature and a normalized version of the intensity image I was used as energy field. The number of control points n, should be selected big enough to capture the global variability of the nematode's shape without losing the discriminant power to recognize them individually.
Fig. 1. Segmented contours in a nematode population image. Left: Initial end points and contours. Right: After convergence some contours are located on a single nematode others lie partially on background and sections of different nematodes.
Contour Energy Features for Recognition of Biological Specimens
1065
Given a set of potential end points, either manually or automatically detected [3] contours without any particular initialization can be proposed and optimized until convergence. As the energy components interact with each other an number of contours will find the minimun energy on nematodes whereas others will be trapped in clutter regions. Unless explicit and detailed analysis of every or at least most common cases is done, a particularly challenging task in non-rigid objects, there is no suitable way to distinguish between nematodes and clutter contours Fig.1. Hence the suggested solution in presented in the following section.
3 Shape Classification Using Contour Energy Features To select those contours corresponding to nematodes, we exploit the shape data available through the energy terms of the active contours. We define at global and local levels energy features distributions for both clutter Cl and nematode Nl contours classes. Let S be the set of contours, our general measure Gi is defined as the expected value of the observed energy component e for a given contour s in S.
Gis = E[ei ]s∈S , i ∈{et , eb , eext }
(4)
A more local measure Li is defined as a n-dimensional vector formed by the local value of the energy component estimated for every control point in given a contour s in S:
Lsi = ( e i1 .,.., e in ) s∈S , i ∈ {e t , e b , e ext }
(5)
From data distributions obtained experimentally we found no clear feature clusters and seems difficult that any isolated feature would be able separate nematode from clutter contours. However some patterns in the image become evident, in particular for local feature distributions. Regarding the shape of the nematodes seems that as the nematode gets thicker -in the central region- becomes less flexible. Looking at the external energy features is also apparent that distributions gradually displaces toward lower mean energy values since the nematode tend to be is darker in the middle than on both ends. To combine features in a statistical framework we applied Bayes rule to classify contours in nematode and clutter classes. The ratio of the a posteriori probabilities of nematode to clutter class detection was define as discriminant function.
D=
P( Nl | s ) P ( s | Nl ) ⋅ P( Nl ) = P(Cl | s) P( s | Cl ) ⋅ P(Cl )
(6)
Assuming independence1 between energy components, conditional distributions can be readily defined for w ∈ {Nl , Cl} class using Gi and Li based feature sets. Regarding object features the prior was considered homogeneous and consequently discarded, for local features prior probabilities were defined by taking into account the distance dc from a control point c to the closest end point. 1
This is a crude assumption because one could expect some correlation between energy terms.
1066
D. Ochoa, S. Gautama, and B. Vintimilla
P( s | w) = ∏ P(G i | w ), i ⊆ {et , eb , eext } i
P( s | w) = ∏ c
∏ P(L
i
| c, w ), i ⊆ {et , eb , eext }
(7) (8)
i
P( w) = ∏ P(d c )
(9)
c
In order to measure the discriminant power of the derived feature sets different combinations of energy feature were tested. For local features classification the experiments were repeated without prior probabilities to estimate the impact of prior information. Also classification with different number of control points was performed by discarding an increasing number of segments on both ends of the contour.
4 Experimental Results The developed approach was tested on set consisting of 8 high resolution time-lapse images depicting populations of adult nematodes with approximately 200 specimens laying freely on agar substract. Straight contours were placed between every pair of end points manually segmented from ground truth images within a distance equal to the maximum expected size of a nematode and iteratively optimized until convergence Fig. 1. Conditional distributions were derived from a training set of 30 randomly selected nematode and clutter contours. Given the non-gaussian nature of our data, conditional and prior distributions were fitted using gama and weibull pdfs respectively. Classification results are summarized in a table 1. The general features Gi, seem to have the lower discrimination power and in general true positive TP rate tend to increases as more energy feature types are added. But surprisingly results improves as the number of points is reduced, this is indicative that nematode and clutter contours have similar average energy distributions and only when the central part of the contour considered the difference is large enough to allow better classification mainly because the central part of clutter contours they tend to fall on the background. When considering local features Li, the tension energy proved being the most discriminant. Spatial distribution of control points can be explained by looking in the interaction between energies. Since the external energy is lower in the middle of the nematode control points tend to gather in that area, but as they move toward the center tension energy increases near both contour ends and pulls control points in the opposite direction. Therefore, the distance between the control points varies depending on the region a control point is located. Usually this region correspond to morphological structures inside the nematode. It must be noted that only by combining several energy types the false positive FP rate can be consistently reduced. In particular, the bending energy allow us to filter out contours with sharp turns and the external energy. We also found that contours with spatial intensity distribution too different from those common to nematode or whose control points are located mostly on the image background can be rejected by considering in the external energy distribution. When the experiments were repeated without prior information becomes manifest the need of more control points to improve the accuracy of the results.
Contour Energy Features for Recognition of Biological Specimens
1067
Fig. 2. Contour classification examples: green contours classified as nematode and red as clutter. Left: Correct nematode contour recognition in presence of occlusion. Right: misclassified clutter contour that runs over a true nematode. Table 1. Classification results for Gi and Li feature combinations, the shadowed cells correspond to the best Tp,Fp pair Gi n 8 8 8 8 8 8 8 12 12 12 12 12 12 12 16 16 16 16 16 16 16
Energy set {Et} {Eb} {Eext} {Et,Eb} {Et,Eext} {Eb,Eext} {Et,Eb,Eext} {Et} {Eb} {Eext} {Et,Eb} {Et,Eext} {Eb,Eext} {Et,Eb,Eext} {Et} {Eb} {Eext} {Et,Eb} {Et,Eext} {Eb,Eext} {Et,Eb,Eext}
Tp 0.804 0.707 0.832 0.728 0.804 0.755 0.761 0.266 0.364 0.641 0.223 0.288 0.348 0.239 0.114 0.255 0.522 0.082 0.087 0.190 0.065
Fp 0.272 0.348 0.457 0.217 0.245 0.272 0.201 0.082 0.223 0.364 0.082 0.065 0.103 0.054 0.038 0.141 0.245 0.049 0.038 0.054 0.022
Li Tp 0.918 0.918 0.848 0.924 0.913 0.918 0.908 0.908 0.870 0.793 0.902 0.891 0.853 0.918 0.880 0.826 0.788 0.897 0.875 0.848 0.913
Fp 0.360 0.545 0.552 0.330 0.341 0.456 0.306 0.317 0.477 0.466 0.284 0.291 0.364 0.255 0.335 0.477 0.456 0.305 0.309 0.360 0.273
Li(no prior) Tp Fp 0.913 0.335 0.908 0.520 0.826 0.515 0.935 0.332 0.897 0.306 0.886 0.405 0.924 0.284 0.918 0.316 0.897 0.510 0.832 0.515 0.940 0.307 0.891 0.295 0.864 0.392 0.924 0.266 0.924 0.296 0.880 0.488 0.821 0.473 0.918 0.280 0.875 0.274 0.875 0.366 0.902 0.248
1068
D. Ochoa, S. Gautama, and B. Vintimilla
Misclassified nematode contours fall in two broad categories: contour close to the image border where the contrast between foreground and background is poor and appearance information is lost and those located in the perimeter of the sample affected by optical distortion that produce unusual shape configurations different from common patterns in the image set. Clutter contours can be mistakenly classified as nematode when most of their control points converge towards a real nematode, for instance when two parallel nematodes are close to each other, also when in the presence of heavy overlapping a clutter contour manages to cover parts of different nematodes. Finally it should be noted that in occlusion cases, the more occluded is a nematode the less its discriminant function value. Structural noise introduced at junction regions due to the change of relative optical density affects negatively the convergence of contours and the feature values. Still we can recover a number of nematode from cluttered regions when enough shape information is retained. According to our experience in a local set overlapping of contours – for instance derived from common end point - nematode have discriminant function value consistently higher than those corresponding to clutter contours.
5 Conclusions In this paper we had studied a feature sets aimed at improving recognition of individual specimens in populations images, where structural noise, intensity variations, different shape configurations and occlusion are present. The inherent similarity of nematodes in populations provides valuable information for recognition that is capture during the evolution of well known parametric contour models, allowing us to use the shape and image data encapsulated in the energy terms directly for classification. By this approach we avoid building a complex shape model and take advantage of image and shape statistics to narrow the range of possible appearance and geometrical configurations to those commonly present in a given sample set. Our features based on energy component distributions were tested on manually segmented images in the framework of bayesian inference. Experimental results show how nematode and clutter contours can have similar statistics and only when less articulated nematode sections are considered rejection of significant number of clutter contours is obtain while retaining most of the nematode contours. Local features do not require prior knowledge of the location of stable object sections as the spatial distribution of contour points and its associated energy components effectively encode shape information. Known problems of contour based segmentation although still present do not prevent correct nematodes recognition when they differ from background or lie in relative isolation. Recognition in cluttered regions is as expected more difficult but as long as meaningful detected sections of the nematode body are linked by a contour a positive classification is possible. We must point out that in practice only a representative sample of the population is needed to derive statistical data, our aim is to explore how to build such sample in a reliable way by exploiting image and shape regularities.
Contour Energy Features for Recognition of Biological Specimens
1069
Future work will focus on extending our findings to video sequence in tracking of moving nematodes specifically when two or more nematode temporarily overlap and nematode identification is uncertain. Acknowledgments. This work was supported by the VLIR-ESPOL program under the component 8, the images were kindly provided by Devgen corporation.
References 1. Bengtsson, E., Bigun, J., Gustavsson, T.: Computerized Cell Image Analysis: Past, Present and Future. In: Bigun, J., Gustavsson, T. (eds.) SCIA 2003. LNCS, vol. 2749, Springer, Heidelberg (2003) 2. Fdez-Valdivia, J., Perez De la Blanca, N., Castilllo, P., Gomez-Barcina, A.: Detecting Nematode Features from Digital Images. Journal of Nematology 24, 289–298 (1992) 3. Wei, G., Cosman, P., Berry, C., Zhaoyang, F., Schafer, W.R.: Automatic tracking, feature extraction and classification of C. elegans phenotypes. IEEE transactions in Biomedical Engineering 51, 1811–1820 (2004) 4. Van Osta, P., Geusebroek, J.M., Ver Donck, K., Bols, L., Geysen, J., ter Haar Romeny, B.M.: The Principles of Scale Space Applied to Structure and Colour in Light Microscopy. Proceedings Royal Microscopical Society 37, 161–166 (2002) 5. Meijering, E., Jacob, M., Sarria, J.-C.F., Unser, M.: A Novel Approach to Neurite Tracing in Fluorescence Microscopy Images. Signal and Image Processing. 399, 96–148 (2003) 6. Meijering, E., Smal, I., Danuser, G.: Tracking in Molecular Bioimaging. IEEE Signal Processing Mag. 3, 46–53 (2006) 7. Moller, S., Kristensen, C., Poulsen, L., Cartersen, J., Molin, M.: Bacterial Growth on Surfaces: Automated Image Analysis fo Quantification of Rate-Related Parameters. Applied and Environmental Microbiology 6(1), 741–748 (1995) 8. Baguley, J., Hyde, L., Montagna, P.: A Semi-automated Digital Microphotographic Approach to Measure Meiofaunal Biomass. Limnology and Oceanography Methods 2, 181–190 (2004) 9. Tomankova, K., Jerabkova, P., Zmeskal, O., Vesela, M., Haderka, J.: Use of Image Analysis to Study Growth and Division of Yeast Cells. Journal of Imaging Sicience and Technology 6, 583–589 (2006) 10. Twining, C.J., Taylorn, C.J.: Kernel Principal Component Analysis and the Construction of Non-Linear Active Shape Models. In: British Machine Vision Conference, pp. 26–32 (2001) 11. Kirbas, C., Quek, F.: Vessel Extraction Techniques and Algorithms: A Survey. In: Proceedings 3th IEEE Symposium on BioInformatics and BioEngineering, pp. 238–246. IEEE Computer Society Press, Los Alamitos (2003) 12. Aylward, S.R., Bullitt, E.: Initialization, noise, singularities, and scale in height ridge traversal for tubular object centerline extraction. IEEE Transactions in Medical Imaging 21, 61–75 (2002) 13. Geusebroek, J., Smeulders, A., Geerts, H.: A minimum cost approach for segmenting networks of lines. International Journal of Computer Vision 43, 99–111 (2001) 14. Hicks, Y., Marshall, D., Martin, R.R., Rosin, P.L., Bayer, M.M., Mann, D.G.: Automatic landmarking for biological shape model. Proceedings IEEE International Conference on Image Processing 2, 801–804 (2002)
1070
D. Ochoa, S. Gautama, and B. Vintimilla
15. Carmichael, O., Hebert, M.: Shape-based recognition of wiry objects. Pattern Analysis and Machine Intelligence 26, 1537–1552 (2004) 16. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. International Journal of Computer Vision 4, 191–200 (1997) 17. Foulonneau, A., Charbonnier, P., Heitz, F.: Geometric shape priors for region-based active contours. In: Proceedings IEEE International Conference on Image Processing, vol. 3, pp. 413–416. IEEE Computer Society Press, Los Alamitos (2003) 18. Tsechpenakis, G., Rapantzikos, K., Tsapatsoulis, N., Kollias, S.: A snake model for object tracking in natural sequences. Signal Processing: Image Communication 19, 219–238 (2004) 19. Zimmer, C., Olivo-Marin, J.-C.: Coupled parametric active contours. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 1838–1842 (2005) 20. Neuenschwander, W.M., Fua, P., Iverson, L., Székely, G., Kubler, O.: Ziplock snakes. International Journal of Computer Vision 23, 191–200 (1997) 21. Jiankang, W., Xiaobo, L.: Guiding ziplock snakes with a priori information. IEEE Transactions on Image Processing 12, 176–185 (2003)
Processing Random Amplified Polymorphysm DNA Images Using the Radon Transform and Mathematical Morphology Luis Rueda1,3 , Omar Uyarte3 , Sofia Valenzuela2,3 , and Jaime Rodriguez2,3 1
Department of Computer Science University of Concepci´on Edmundo Larenas 215, Concepci´on, 4030000, Chile 2 Department of Forest Science University of Concepci´on, Chile 3 Laboratory for Forest Genomics, Center for Biotechnology University of Concepci´on, Chile {lrueda,ouyarte,sofvalen,jrodrig}@udec.cl
Abstract. Random Amplified Polymorphism DNA (RAPD) analysis is a wellknown method for studying genetic relationships between individuals. In this context, processing the underlying RAPD images is a quite difficult problem, affected by various factors. A method for processing RAPD images is proposed, which aims to detect bands by pre-processing the images by performing a template correction step and a band detection mechanism. The results on RAPD images, which aim to verify genetic identity among tree individuals, show that the proposed method detects either negative or positive amplification (bands) with a sensitivity of over 94%.
1 Introduction Randomly Amplified Polymorphism DNA (RAPDs) [18,19] is a type of molecular marker which has been used in verifying genetic idetity. It is a codominant marker, of low cost to implement in the laboratory, and provides fast and reliable results [14]. During the past few years RAPDs have been used for studying phylogenetic relationships [4,17], gene mapping [9], trait-associated markers [16], and genetic linkage mapping [5]. This technique has been used as support for many agricultural, forest and animal breeding programs [11,13], for example, in the evaluation and characterization of germoplasms, molecular identification of clone freezing resistance [8]. The RAPD technique consists of amplifying random sequences of the genomic DNA by using primers, which are commonly 10 bp (base pairs) in length. This process is carried out by the polymerase chain reaction (PCR) and generates a typical pattern for a single sample and different primers. The PCR products are separated in an agarose gel, under an electric field which permits that smaller fragments of the PCR products migrate faster and larger ones much slower. The gel is stained with a dye (typically ethidium bromide) and photographed for further data analysis. One way of analyzing the picture obtained is simply by comparing visually the different bands obtained for each sample. However, this can be a tedious process when various samples with different primer M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 1071–1081, 2007. c Springer-Verlag Berlin Heidelberg 2007
1072
L. Rueda et al.
50 100 150 200 100
200
300
400
500
600
700
800
Fig. 1. A sample RAPD image with two reference lanes, and 12 lanes representing four ramets
combinations have to be analyzed. At the same time, since, in this case, the presence or absence of bands is to be scored, sometimes the band calling can be very subjective and there is no reliable threshold level, due to the variations of the intensity of the bands affected by several factors (i.e. staining, gel quality, PCR reaction, DNA quality, etc). In Fig. 1, a photograph of a RAPD reaction is shown. In this case, 12 samples were loaded of which lanes 1 and 12 correspond to the molecular weight standards. In this case, four different genotypes of Eucalyptus globulus were studied, including three identical copies of each (known as ramets). If the ramets are identical, then an identical band pattern should be expected when analyzed by the same primer, however this is not always the case, due to, for example, mislabeling of samples. During the process of generating the RAPD image, many physical-chemical factors affect the electrophoresis producing different kinds of noise, rotations, deformations and other abnormal distortions in the image. The effect of this problem, unfortunately, propagated through the different stages in the posterior analysis, that includes visualization, background extraction, band detection, and clustering, can lead to erroneous biological conclusions. Thus, efficient image processing techniques will, on the other hand, have a positive impact on those biological conclusions.
2 The Problem Consider an image (matrix) A = {aij }, i = 1, ...., n and j = 1, ...., m, where aij ∈ Z+ , and A is a RAPD image1 . Usually, aij is in the range [0..65,535] in a TIFF image, and hereafter we use A(x, y) to refer to aij , where x and y are the pixel coordinates. The aim is to obtain a matrix B (bands) where B = {bij }, i = 1, ...., n and j = 1, ...., m, bij = 0 or bij = 1 (a binary image), with 1 meaning that bij belongs to a band. The bands could also be represented as a list of vectors, bi , where bij is the vertical position of the band j th in the ith lane. Informally speaking, the latter implies that the band matrix is composed of runs of lanes (not necessarily equally-spaced), which are parallel to the x-axis. Typically, and as in this paper, the leftmost and rightmost lanes are considered as the reference lanes. 1
The aim is to apply this method to a sub-image from which the unnecessary regions are removed.
Processing Random Amplified Polymorphysm DNA Images
1073
Image processing of RAPD images has been usually done using software packages which even though are very user-friendly, they are copyright protected, and the underlying techniques for processing the images and posterior analysis are in most cases not available. The most well-known softwares for RAPD image analysis are Image J from the National Institute of Health (NIH), USA, [3], Gel Compar II2 , Gel Quant [6], and Quantity One [12]. In this paper, we propose a method for pre-processing RAPD images, performing two main steps: template orientation correction and band detection. The template correction part resorts mainly on the well-known Radon transform, while band detection is carried out by using mathematical morphology and cubic spline smoothing. The experimental results show that the proposed method effectively detects bands in a batch of real-life RAPD images.
3 RAPD Image Processing Given a RAPD image, first, any existing rotation is corrected with respect to the vertical lanes. Then, all the boundaries of the lanes are found, which help to find the centers of the references. The latter are used to fit the image into a perfect rectangular template so that there is a horizontal correspondence between the left and right reference bands. Once the corrections are done the bands are detected by removing background, projecting the signals onto the y-axis, and finally applying cubic smoothing spiline and morphological operators. 3.1 Template Correction For the template correction step, the image is first checked for rotations, and if any, they are corrected using the Radon transform [10]. Given an image A(x, y), the Radon transform is given by: ∞ R(p, t) = A(x, t + px)dxdt , (1) −∞
where p is the slope and t its intercept. The rotation angle of the image with respect to the slope p is given by φ = arctan p. For the sake of the notation, R(φ, t) is used to denote the Radon transform of image A. Each rotation angle φ will give a different onedimensional function, and the aim is to obtain the angle that gives the best alignment with the lanes. This will occur when the lanes are parallel to the y-axis. While there are many ways to state this as an optimization problem, and different objective functions have been used (cf. [1]), an entropy-based function is used here. Assuming all the lanes are parallel to each other, then, the one-dimensional projected function will contain well-accentuated peaks, each corresponding to a lane, and deep valleys (background separating the lanes). Normalizing the R(φ, t) function, and renaming it R (φ, t), such 2
Details of this software are available at http://www.applied-maths.com/gelcompar/ gelcompar.htm
1074
L. Rueda et al.
that t R (φ, t) = 1, the best alignment will occur at the angle φmin that minimizes the entropy as follows: R (φ, t) log R (φ, t) .
H(φ) = −
(2)
t
One of the problems, however, the entropy function has is the following. When the rotation angle φ is near π/4, the extremes of the one-dimensional function become much smaller than the center part of the function. This is because, the extremes are computed by integrating the corners of the image. As a consequence of this, the entropy function becomes smaller, producing an effect of diminishing the “uniformity” of the function, and hence biasing the entropy measure. While this happens when φ is near π/4, we (i) consider small values of φ (no more than 10 degrees is good enough for most all the images), and (ii) remove the sides of R(φ) for computing the entropy. Since the resulting signal function is in a discrete domain, i.e. φ takes discrete values, the entropy is computed as follows: H(φ) = − R (φ, t) log R (φ, t) , (3) t
and R(φ, t) is normalized into R (φ, t), such that t R (φ, t) = 1. Once the rotation is performed3, the pixel positions in the new image, x and y , are integers, and their intensities are computed from those of pixel A(x, y) and/or their neighbors, where x and y are given by: x = x cos φ − y sin φ , y = x sinφ + y cos φ .
(4) (5)
The transformed image is reconstructed by interpolating the intensities of pixels x and y and their neighbors; in this work, bicubic is used. Fig. 2 shows a sample RAPD image, UBC 01 e, along with the corresponding rotated image, Fig. 2 (b). In Fig. 2 (c), the entropy is plotted for all angles of φ in [−10, 10], with a step of 0.05. The minimum value of H(φ) is found at φmin = 1.15. The original image is then rotated by −1.15 degrees, and the resulting image in Fig. 2 (b) has the lanes well-aligned with the vertical axis. The next step is to find the positions of the lanes. Suppose that the angle that gives the optimal rotation is φmin , and the “rotated” image is A . The vertical running sum of pixel intensities of A is computed, obtaining the function sv (x) = y A (x, y). Applying a simple mean-based smoothing, a function sv (x) is obtained, and the boundaries of the lanes are those of the nc largest (smallest or negative) slopes for the left, bl (right, br ) boundaries respectively, where nc is given by the user. The function sv (x) is divided in nc equally-sized intervals, and the boundaries are detected as the maximum and minimum slopes in that interval. Fig. 3 shows the sample image UBC 01 e with the lane boundaries detected (c), the smoothed vertical running sum of pixel intensities (a), and the corresponding slopes for the running sum function, s (x), in (b). It is visible that the highest peaks coincide with the largest slopes and the boundaries of the lanes in 3
Note that in the actual rotation, the image is rotated by −φmin degrees, as φmin represents the angle that the “incorrect” image is rotated.
Processing Random Amplified Polymorphysm DNA Images
1075
(a) Original image. 50 100 150 200 250 200
400
600 (b) Rotated image.
800
1000
1200
200
400
600 (c) Entropy function.
800
1000
1200
50 100 150 200 250
7.04 7.02 7 6.98 −10
−8
−6
−4
−2
0
2
4
6
8
10
Fig. 2. A sample RAPD image, UBC 01 e, and the new image rotated by φmin degrees. The entropy function H(φ) shows where the minimum value of H at φmin = 1.15 is obtained.
the image, and the lowest peaks indicate the negative slopes: the right lane boundaries. While in some cases, e.g. lane 6, the lanes are very poor, and the boundaries detected are unpredictable, these cases of poor signals are usually discarded in the genotyping analysis. The centers of (left and right) reference bands are located by taking each lane separately. Two new sub-images are created for the two reference lanes, namely Al = A(x, y), where x ∈ [bl1 , br1 ], and Ar = A(x, y), where x ∈ [blnc , brnc ]. For each sub-image the horizontal running sum of pixel intensities is computed, obtaining two functions, sl (y) and sr (y) for the left and right lanes respectively. These two functions are then corrected by applying morphological operators (dilation, sl (y)⊕b, followed by erosion, sl (y) b, with b = [0, 1, 1, 1, 1, 0]) to sl (y) and sr (y). The nr reference bands correspond to the highest peaks (local maxima), where nr is user-defined. Finally, to correct the “trapezoidal” deformation, all the lanes are fitted into a rectangular template (more details can be found in [15]). 3.2 Band Detection Once the image has been corrected and fit to a perfect rectangular template, the final task is to detect the actual bands. A background removal procedure is applied, which eliminates the pixels that are in the quartile (25%) with lowest intensities. The next step
1076
L. Rueda et al. 150 100 50 0
200
400
600 (a) Running sum.
800
1000
1200
200
400
600 (b) Slopes.
800
1000
1200
200
400
600
800
1000
1200
10 5 0 −5 −10
50 100 150 200 250
(c) Image with lane boundaries.
Fig. 3. The sample RAPD image UBC 01 e, the corrected running sum of pixel intensities, and the slopes for the running sum function
consists of separating the lanes by creating a new sub-image for each lane, Ai , for i = 1, . . . , nc , as the portion of the image A as follows: A(x, y), where x ∈ [bli , bri ], and y takes any value. Next step is to create a one-dimensional function with the horizontal running sum of pixel intensities as follows: si (y) = Ai (x, y) (6) x
The one-dimensional function si (y) is corrected by applying morphological operators as follows. For each point y, si (y) was corrected by applying si (y) ⊕ q, a dilation operator. An erosion operator was applied as si (y) q, where q = [0, 1, 1, 1, 1, 1, 1, 0]). A cubic spline smoothing correction was then applied, using the csaps Matlab command. Cubic spline smoothing creates a new funciton s∗i (y) that minimizes the value of y n si (yj )+ηj −s∗i (yi ) + (1 − μ) y0n {s∗ (y)}2 dy, where μ ∈ [0, 1], and ηj is an μ j=0 σj
iid random variable with variance σj2 [2]. A function s∗i (y) is obtained and the centers of the bands are located by finding the highest peaks (local maxima) of s∗i (y). Here, two problems arise: (a) extremely low peaks that represent “false bands”, and (b) some highintensity bands tend to “absorb” a band that is close to it. To correct (a), only those peaks
Processing Random Amplified Polymorphysm DNA Images
1077
(a) Horizontal running sum of intensities. 10000
5000
0 50
100
50
100
150 (b) Normalized signal.
200
250
200
250
200
250
4000 2000 0 150 (c) Bands detected. 10 20 30 40 50 50
100
150
Fig. 4. The third lane of the sample RAPD image UBC 01 e, the running sum of pixels and the normalized function si (y)
above a threshold θ are considered, where θ is user-defined; we have experimentally found that θ = 200 is a good value. To correct (b), the signal is normalized by taking a piecewise linear function as reference (baseline), where the knots of the normalizing function are selected as the lowest peaks of the signal. In Fig. 4, we plot a lane (the third one) of the sample image UBC 01 e, along with the corresponding original, smoothed and normalized running sum functions, as well as the bands detected. In Fig. 4 (a), the signal on top represents the function si (y), while the one in the bottom represents the smoothed signal (it has been plotted 1,000 units below si (y) to observe the difference between them). In Fig. 4 (b), the normalized si (y) function is plotted, along with the baseline (y = 0), and the threshold line (y = θ, where θ = 200 for the example). Note that the normalizing method is able to distinguish bands only if they are separated by a local minimum. In Fig. 4 (c), the sub-image corresponding to the third lane is depicted (horizontally) along with the bands detected (black vertical lines).
4 Experiments The image test suite was generated from a study made with 30 genotypes of Eucalyptus globulus each represented by three ramets (identical copies), giving a total of 90 samples. The aim of the experiment was to use RAPD makers to verify genetic identity
1078
L. Rueda et al.
between ramets of the same clone. To start with the experiments DNA was extracted from a leaf of Eucalyptus globulus sample, by following the protocol of [7]. The DNA was quantified and diluted for making the PCR, the amplification was carried out, and the resulting product was run on an agarose gel, which was dyed with ethidium bromide and photographed, producing a grayscale image. A total of nine images were obtained using primer UBC 02, from which five images were selected at random. Each of these images contains two reference lanes, molecular weight marker (leftmost and rightmost lanes), while the others correspond to groups of two or three lanes per clone. Each image contains typically more than ten samples, and the five images selected contain 54 samples, i.e. more than 50% of the 90 samples. A sixth image was included in this study, which has 3 ramets of a clone. Each clone was amplified using four different primers, UBC 01, UBC 02, OPC 01 and OPD 01, where each group of three consecutive lanes correspond to the amplification of the 3 ramets with the same primer. More details of the experiment and some of the images tested here are available in [15].
50
100
150
100
200
300
400
500
600
500
600
(a) Original image. 50
100
150
100
200
300
400
(b) Bands detected are marked with a white line. Fig. 5. Original image and bands detected for one of the test images, UBC 02 d
The image pre-processing and band detection method was applied to the test images. One of the images, UBC 02 d, is shown in Fig. 5. In Fig. 5a the original image is shown, while in Fig. 5b the processed image with the bands detected can be observed (marked with a horizontal line), the background (the quintile with the lowest intensity values) was eliminated to facilitate the visualization. It can be observed that all but four bands are detected by our method, based on visual inspection by an expert with the aid of the running sum function (as shown in Fig. 4). Also, the template correction detects a rotation and corrects it. This rotation is observed in Fig. 5a, by comparing the topmost bands for the leftmost and rightmost lanes.
Processing Random Amplified Polymorphysm DNA Images
1079
Table 1. Number of bands detected, positive predicted values and sensitivity for various images Primer UBC 02 UBC 02 UBC 02 UBC 02 UBC 02
Image ID d k m g i
TB-V 20 59 37 57 45
TB-A 24 65 47 75 51
FP 4 6 10 18 6
FN 0 0 0 0 0
PPV 83.33 90.77 78.72 76.00 88.24
S 100 100 100 100 100
To assess the efficiency of the proposed band detection method from a more objective perspective, a numerical analysis based on the (visually observed and automatically detected) bands was performed. Table 1 depicts such numerical results. Column 1 shows the primer name, while column 2 contains the image name/identifier. Assuming the bands detected by the expert represent the positives, otherwise the samples represent the negatives, we computed the following measures. TP (true positive) is the number of bands detected by the method and the expert. FN (false negatives) represent the number of bands not detected by the method, but detected by the expert. FP (false positives) are the bands detected by the method, when the human expert did not detected them. The number of bands detected visually by the expert, TB-V = TP + FN, is listed in column 3, while TB-A = TP + FP represents the total number of bands detected automatically by the proposed method. The next two columns represent FP and FN, while the last TP two columns contain the positive predicted value, PPV = TP+FP , and the sensitivity, TP S = TP+FN . Note that the expert has a lot of experience in detecting bands visually, but his/her analysis is prone to errors, which implies that our results could be even better. It can be seen that the proposed method is able to capture most of the bands leading to a PPV of near 80% in some cases, and even 90% on image UBC 02 k. Moreover, the sensitivity was 100% for all images, since no false negatives were detected. To further assess the performance of the band detection method, an experiment was performed using three different primers with a clone (3 ramets) and samples loaded on a single gel. Each group of three lanes (excluding the leftmost and rightmost lanes) corresponds to the same clone which has been amplified with four different primers, UBC 01, UBC 02, OPC 01 and OPD 01. The aim of this experiment was to check that the proposed method can detect the same band pattern when 3 ramets of the same clone were amplified with a specific primer. Fig. 6 depicts such an image, UBCOPCD a, to which the band detection method was applied - the bands detected are shown in (b). From the figure, it is seen that the proposed method is able to detect the same band patterns for each of the three primers, UBC 02, OPC 01 and OPD 01. In the case of primer UBC 01, the proposed method does not detect the same band patterns for the three lanes. To objectively assess the accuracy of our method, we counted the number of bands that were detected in one lane but not in another (within the group of three lanes). This was done for each group of three, by taking pairs of lanes, and counted a false positive when a band was detected in one lane but not in the other of the pair. The total number of bands detected visually by the expert was TP + FP = 102, from which 96 were detected by the proposed method (TP), representing a sensitivity 94.12%, and
1080
L. Rueda et al.
50 100 150 200
100
200
300
400
500
600
700
800
600
700
800
(a) Original image. 50 100 150 200 100
200
300
400
500
(b) Bands detected are marked with a dark gray line. Fig. 6. Image UBCOPCD a with groups of three lanes for each ramet using different primers
a PPV of 100% – note that no false negatives were detected. However, these results are independent of the experimental errors, such as buffer concentration, electrophoresis temperature, added volume, among others. Note that these errors, unfortunately, cannot be controlled by the image processing method. We have also performed a comparison with an existing software, Image J [3], for which the details can be found in [15].
5 Conclusions A new method for detecting bands in RAPD images is proposed, which performs template correction and band detection. The proposed method has been tested on two different kind of images obtained from electrophoresis agarose gels used to amplify DNA extracted from Eucalyptus globulus leaves. One of the tests compares the visual inspection by the user expert and checks for false positives, which show the bands are detected in most of the cases, although these experiments are prone to errors in identifying samples. The best results, thus, were obtained in an image in which groups of three lanes belong to the same ramet, obtaining a sensitivity of over 94%. Plans for future work focus on enhancing the band detection method by comparing band shapes rather than a threshold. Other possible topics to investigate are considering non-parallel lanes, and the case in which the bands may not be parallel to the horizontal axis. These two problems are currently being undertaken.
Processing Random Amplified Polymorphysm DNA Images
1081
References 1. Antoniol, G., Ceccarelli, M.: A Markov Random Field Approach to Microarray Image Gridding. In: Proc. of the 17th International Conference on Pattern Recognition, Cambridge, UK, pp. 550–553 (2004) 2. De Boor, C.: A Practical Guide to Splines. Springer, Heidelberg (2001) 3. Burger, W.: Digital Image Processing with Java and ImageJ. Springer, Heidelberg (2006) 4. Cao, W., Scoles, G., Hucl, P., Chibbar, R.: Phylogenetic Relationships of Five Morphological Groups of Hexaploid Wheat Based on RAPD Analysis. Genome 43, 724–727 (2000) 5. Casasoli, M., Mattioni, C., Cherubini, M., Villani, F.: A Genetic Linkage Map of European Chestnut (Castanea Sativa Mill.) Nased on RAPD, ISSR and Isozyme Markers. Theoretical Applied Genetics 102, 1190–1199 (2001) 6. Das, R., Laederach, A.: GelQuant User’s Manual. Stanford University, Electronically (2004), available at http://safa.stanford.edu/release/ GelQuantGuiv09b3UserGuide.pdf 7. Doyle, J., Doyle, J.: Aislation of Plant DNA from Fresh Tissue. Focus 12, 13–15 (1990) 8. Fernandez, M., Valenzuela, S., Balocchi, C.: RAPDs and Freezing Resistance in Eucalyptus Globulus. Electronic Journal of Biotechnology 9, 303–309 (2006) 9. Groos, C., Gay, G., Perrenant, M., Gervais, L., Bernard, M., Dedryver, F., Charmet, G.: Study of the Relationship Between Pre-harvest Sprouting and Grain Color by Quantitative Trait Loci Analysis in the White x Red Grain Bread-wheat Cross. Theoretical Applied Genetics 104, 39–47 (2002) 10. Helgason, S.: The Radon Transform, 2nd edn. Springer, Heidelberg (1999) 11. Herrera, R., Cares, V., Wilkinson, M., Caligarip, D.: Characterisation of Genetic Variation between Vitis Vinifera Cultivars from Central Chile Using RAPD and Inter Simple Sequence Repeat Markers. Euphytica 124, 139–145 (2002) 12. Bio Rad Laboratories: Quantity One User’s Guide. Bio-Rad Laboratories (2000), Electronically available at http://www.calpoly.edu/ bio/ubl/ equip files/q1 manual.pdf 13. Narv´aez, C., Valenzuela, J., Mu noz, C.: Comparaci´on de RAPD Y AFLP Como M´etodos de Identificaci´on Gen´etica de Vid Basados en el Estudio de Fragmentos Gen´omicos An´onimos. Agricultura T´ecnica 60, 320–340 (2000) 14. Nkongolo, K.: RAPD Variations Among Pure and Hybrid Population of Picea Mariana, P. rubens, and P. glauca, and Cytogenetic Stability of Picea hybrids: Identification of Speciesspecific RAPDs Markers. Plants Systems Evolution 215, 229–293 (1999) 15. Rueda, L., Uyarte, O., Valenzuela, S., Rodriguez, J.: Band Detection in Random Amplified Polymorphism DNA Images. Technical Report, Department of Computer Science, University of Concepci´on (2007), Electronically available at http://www.inf.udec.cl/∼lrueda/papers/BandDetectJnl.pdf 16. Saal, B., Struss, D.: RGA-and RAPD-derived SCAR Markers for a Brassica B-genome Introgression Conferring Resistance to Blackleg in Oilseed Rape. Theoretical Applied Genetics 111, 281–290 (2005) 17. Sudapak, M., Akkaya, M., Kence, A.: Analysis of Genetics Relationships Among Perennial and Annual Cicer Species Growing in Turkey Using RAPD Markers. Theoretical Applied Genetics 105, 1220–1228 (2002) 18. Tripathi, S., Mathish, N., Gurumurthi, K.: Use of Genetic Markers in the Management of Micropropagated Eucalyptus Germplasm. New Forests 31, 361–372 (2006) 19. Williams, J., Kubelik, A., Livak, K., Rafalski, J., Tingey, S.: DNA Polymorphism Amplified by Arbitrary Primers Useful as Genetic Markers. Nucleic Acid Research 18, 6531–6535 (1990)
Scale-Adaptive Segmentation and Recognition of Individual Trees Based on LiDAR Data Roman M. Palenichka and Marek B. Zaremba Université du Québec en Outaouais Gatineau, Québec, Canada {palenich,zaremba}@uqo.ca
Abstract. A scale-adaptive method for tree segmentation and recognition based on the LiDAR height data is described. The proposed method uses an isotropic matched filtering operator optimized for the fast and reliable detection of local and multiple objects. Sequential local maxima of this operator indicate the centers of potential objects of interest such as the trees. The maxima points also represent the seed pixels for the region-growing segmentation of tree crowns. The tree verification (recognition) stage consists of tree feature estimation and comparison with reference values. Various non-uniform tree characteristics are taken into account when making decision about a tree presence in the found location. Experimental examples of the application of this method for the tree detection in LiDAR images of forests are provided. Keywords: remote sensing, LiDAR data, Digital Surface Model (DSM), image segmentation, tree recognition, multi-scale isotropic matched filtering (MIMF), local scale, feature estimation.
1 Introduction The remote-sensing imagery using different physical sensors with the resolution of one meter or less per pixel creates the possibility of automatic delineation (segmentation) of individual tree crowns. A substantial contribution to solving this task is made by the LiDAR (Light Detection And Ranging) imaging, which can provide a high-resolution representation of object surfaces (canopy) in the form of a digital surface model (DSM) [1-3]. The advent of LiDAR technology prompted development of new image analysis methods dealing with the LiDAR data in particular [4-8]. However, some inherent difficulties arise when applying conventional pattern recognition techniques to implement detection of individual trees. Varying tree heights and crown sizes, the absence or presence of leaves, the overlap of tree crowns in dense forests and the variety of crown shapes are among the hindering factors. A common existing approach to tree detection relies on the extraction of maximum height points on the canopy surface [3-6]. This is usually followed by a detailed data analysis in neighborhoods of the maxima points. For example, the local maxima, which may correspond to tree positions, are selected as seed points in the region M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 1082–1092, 2007. © Springer-Verlag Berlin Heidelberg 2007
Scale-Adaptive Segmentation and Recognition of Individual Trees
1083
growing method for the tree crown segmentation [7, 8]. However, this direct method of local height maxima gives many false alarm positions because of the height and width variability and presence of other local objects. Another example is the method of gray-scale morphology used to locate the treetops in a dense forest area [3]. Some of the existing methods also capitalize on the canopy penetration property of the LiDAR beam by analyzing both the first and the last return data [6, 7]. In their majority, the existing methods require prior determination of the ground level (i.e., Digital Terrain Model, DTM) and other objects (e.g., buildings) that is a separate task and requires additional computational resources. Some incorrect ground data – that may occur with the DTM approach – can significantly deteriorate the tree detection results. Recently, some software products for LiDAR data processing and object recognition became available. One of the most sophisticated among them is the ArcGIS extension package “Lidar Analyst” (Visual Learning Systems, Inc.) [6]. The tree detection module is based on a DTM extraction as well as a separate procedure for building recognition. Although the package provides several object detection capabilities, its first version has shown some drawbacks, which are inherent in the underlying tree detection method (referred to as LA method). A different approach to individual tree detection in forest areas is the crown delineation method based on the valley following algorithm [2]. This method is an extension to LiDAR imagery of a similar method for tree crown delineation in aerial photos and multi-spectral images [8]. A set of rules is applied during the delineation process in order to cope with the tree variable heights and crown sizes. However, the method is not fully automatic since it requires some preliminary masking of forest areas and a priori knowledge of key parameters. Multi-resolution and multi-scale approaches contribute substantially to the efficiency of the object detection, although the sensitivity to orientation and low contrast or noise presence influence greatly the detection results [10, 11]. A common example of the multi-resolution approach is the wavelet transform, which was also applied to LiDAR imagery for the tree detection purpose [10]. Another effective technique for object detection in images is the matched filtering approach [13, 14]. In order to handle multi-scale objects and achieve rotational invariance, the method of multi-scale isotropic matched filtering (MIMF) was developed [14]. This method can also be considered as a particular example of visual attention operators which have proven to be effective for fast detection of multiple and local objects in images such as the trees [15-17]. In this paper, the MIMF method [14] with certain modifications is applied to the task of tree recognition. In particular, the proposed local scale determination in the form of a scale map is effectively used to perform a scale-adaptive segmentation of tree crowns. In our method, the three-dimensional (3-D) height data of the LiDAR first return measurements are transformed to DSM with a regular grid representation that can be treated as a grey-scale height image. If the terrain surface data are available, then the MIMF method is applied to the difference image between the DSM and DTM that is called a canopy height model (CHM). The design of the MIMF operator as well as feature extraction is based on the proposed modeling of LiDAR data (Section 2). The model-based tree recognition method consists of the following steps. The first step is the localization of the objects of interest invariantly to local
1084
R.M. Palenichka and M.B. Zaremba
object size and orientation by the MIMF local maxima (Section 3). It also includes estimation of the local scale in each grid point of the DSM. Segmentation of potential tree crowns is performed at the second step (Section 4). The third step is the estimation of relevant features and tree presence verification.
2 Concise Description of Tree Surfaces The main goal of the LiDAR data modeling (description) is to concisely represent 3-D shape characteristics of the objects of interest by a few transformation-invariant height (surface) descriptors and planar shape features. In this local representation, an object fragment in the DSM or CHM of the LiDAR data can be any object of interest, which has certain shape characteristics and a specified mean height above the ground. The height data are described separately from the planar shape of their projection region on the ground surface, which is called the object support region. The proposed surface modeling is a parametrical representation of the DSM (CHM) data in the form of a function f(m,n) in a fragment centered at lth local object position (i,j)l (e.g., a tree location) with two dominant height levels: ol and bl. An additive perturbation (noise) term is combined with f(m,n), which models fluctuations in height data and can be mathematically described by a random field model. The two dominating levels, ol and bl, are the height values of the two regions representing the object region (e.g., the tree support region) and the ground region around it (Fig. 1). Additionally, a linear smoothing filter is introduced in this model in order to adequately represent the DSM (CHM) smoothness. The local representation of the DSM implies the following characteristics of the height distribution within the object (e.g., tree) support region. An object mean height, hl=ol-bl, is used in the detection in order to minimize the effect of small in size objects and outliers. Non-homogeneity of the object height surface concisely characterizes its type and can be measured as the standard deviations λ of the perturbation term. Finally, height surface sharpness is used to distinguish between different types of object surfaces. These height characteristics are combined with planar shape features to obtain a complete description of the image fragment within the lth tree support region, Sl. An efficient approach to describing multi-scale planar shape is offered by the mathematical morphology, which involves a disk-shaped structuring element at different scales [3, 13, 14]. Here, the local scale describes the local object size and can be morphologically defined as the diameter of greatest structural element centered at a given point and inscribed into the object support region. For example, the object size in the case of tree detection is the tree crown width. In the current modeling, one integral shape feature was involved – shape radial symmetry. It can be measured as the ratio of tree crown pixels to the ground pixels in a ring region around the disk region with the given width. It should be noted that this model represents a case of isolated trees. However, it can be easily extended to a case of multiple trees connected with each other (i.e., a group of trees or dense forest area) as an envelope of the individual tree surfaces. The two-dominant levels model is applicable only to treetops and valleys between the treetops inside a dense forest area.
Scale-Adaptive Segmentation and Recognition of Individual Trees tree mean height, hl
ol
tree maximal height
1085
DSM of three isolate trees, f(m,n)
crown center, (i,j)l planar shape
bl
crown support regions, {Sl} local scale, ρ (i,j)l ; tree crown width, wl
Fig. 1. Two-dominant-level representation of DSM (CHM) in LiDAR height measurements: cross-section of the model for three isolated trees
3 Localization of Potential Trees The tree detection method is based on the multi-scale isotropic matched filtering (MIMF) and determination of its local maxima, which indicate the centers of the potential objects of interest [14]. A flowchart of the proposed algorithm for tree recognition and segmentation is shown in Fig. 2. Potential tree localization is the main step in this algorithm. The direct approach of local (absolute) height maxima gives many false alarm positions because of the tree height and width variability and the presence of other local objects [3-6]. Moreover, it is not applicable in the case when the DSM is the only available data. Therefore, the MIMF operator is applied first and its local maxima are determined in a region of interest A: (i f , j f ) l = arg max {Φ[( f (i, j ), ρ (i, j )], (i, j ) ∉ Γl } , ( i , j )∈A
(3.1)
where f(i,j) is the input LiDAR image (i.e., DSM or CHM) of a given area, Φ[f(i,j),ρ(i,j)] is a non-linear matched filter function at the point (i,j) and local scale ρ(i,j), and (if,jf)l are the coordinates of lth maximum. The function Φ[f(i,j),ρ(i,j)] as computed for all the grid points is also called the MIMF map [14]. Region Γl in Eq. (3.1) represents the masking region that excludes areas around previous maxima. Four conditions for the object presence were considered in the design of the MIMF operator: 1) two different dominant levels of height corresponding to the objects and the ground around it; 2) local homogeneity of object (or ground) heights; 3) maximal local scale (object width); 4) high level of radial symmetry of planar shape. An
1086
R.M. Palenichka and M.B. Zaremba
example of explicit expression for the computation of Φ[f(i,j),ρ(i,j)] is given in [14]. Object region segmentation at the lth location, (if,jf)l, gives a bitmap of object pixels versus the background (e.g., tree points vs. non-tree points). The tree verification consists of the estimation of object features x1,…, xN at the point (if,jf)l and comparison with detection thresholds corresponding to these characteristics. Besides the independent output of the individual tree (crown) segmentation, this step in the flowchart in Fig. 2 is also used for the estimation of feature values during the tree presence verification. Scale map, M I M F map
L iDAR data, DSM , DTM Vectors of individual tree features,
Feature estimation, Determination of next M I M F maximum
Segmentation of potential tree crown
Verification of tree presence
B itmap of tree crowns
Search cycle for next object of interest
Fig. 2. Flowchart for the scale-adaptive segmentation and recognition of individual trees using the MIMF approach
In the new version of the MIMF operator, the local scale value ρ(i,j) have to be estimated in each point (i,j) prior the computation of the filter function Φ[f(i,j), ρ(i,j)]. The local scale, which is estimated independently for the whole image plane, produces the local scale map. It is also used in the scale-adaptive feature extraction and scale-adaptive segmentation. According to the adapted morphological scale definition, the scale determination consists in selecting the greatest structural element which is centered at (i,j) and inscribed into the object (tree) support region. Since the object support region is not given explicitly, the scale have to be estimated using the height function, f(i,j), only:
ρ (i, j ) = arg max 0≤m≤M −1
{c (i, j, m) − α ⋅ d 2
2
}
(i, j , m) ,
(3.2)
where c(i,j,m) is an estimate of the object mean height with respect to the ground height at the scale m, d(i,j,m) is an estimate of the height deviations at the scale m in the object region as a measure of its non-homogeneity, M is the total number of scales, and α is the constraint coefficient [14]. The homogeneity estimation by d(i,j,m) is generic and may include the height surface pre-filtering such as the differentiation to cope with a slope presence or surface convexity [16]. More accurate crown width estimation can be easily made based on the tree crown segmentation (Section 4.1).
Scale-Adaptive Segmentation and Recognition of Individual Trees
1087
4 Adaptive Segmentation and Recognition of Individual Trees 4.1 Scale-Adaptive Segmentation by Region Growing
An accurate solution to the segmentation task is necessary for the delineation of individual trees since the tree support regions can be located within a small area (e.g., 5x5 window) on the LiDAR map. In some dense forest areas, treetop regions within 3x3 window are the only results of tree segmentation. Since the MIMF maxima are located at the potential tree centers, they represent the best candidates for the seed points in the region growing segmentation [18]. There are two main reasons to select them as the seeds for the region growing. First, the MIMF maxima are located at the support region centers of the potential trees. Second, the MIMF method provides the local scale value that represents an estimate for the width of the tree crown. Although the proposed region-growing algorithm is similar to the conventional region growing procedure, it has its discerning characteristics related to the MIMF method. For a pixel f(i,j) labeled as an individual tree, the condition that a neighboring pixel belongs to the same region is: if | f(m,n) - f(i,j)| < η, then the point (i,j) belongs to the same tree support region, (4.1) where (i,j) are the coordinates of the labeled pixel, and (m,n) are the coordinates of the current neighboring pixel, η is the height local threshold. At the beginning of the region growing, the point (i,j) coincides with the seed point, (if,jf). However, this growing condition is not sufficient in a case of multiple trees in dense forest areas and other objects of interest, which might occur as neighbors to a given tree. Therefore, between-object border conditions have to be used to distinguish between neighboring objects. For example, in order to delineate individual trees interconnected by their crowns a valley condition is introduced: if ( fxx(m,n) + fyy(m,n) ) < μ,
then the point (i,j) belongs to the tree support region, (4.2)
where fxx(m,n) and fyy(m,n) are the estimates for the height second derivatives at the point (m,n) and μ is the threshold for the tree-valley presence. The condition (4.2) means that the height of the current object (tree) is not increasing while the neighboring object (tree) height is not decreasing. If there are other possible objects (e.g., buildings), then additional segmentation features and conditions associated with them such as Eq. (4.2) have to be introduced into the region growing procedure. The object scale in Eq. (3.2) gives an approximate estimate for its local size, i.e. the tree crown width. Therefore, height surface characteristics estimated within a circular neighborhood region with the diameter equal or less ρ(if,jf) characterize the local variability of the object height. The scale-adaptive thresholds η and μ in Eq. (4.1) and Eq. (4.2) are calculated as given percentages of the mean square deviation of the height in this neighborhood region. 4.2 Tree Presence Verification
Since the MIMF maxima points correspond to all possible object locations including, for example, buildings, shrubs, poles and others, it is necessary to recognize the tree
1088
R.M. Palenichka and M.B. Zaremba
presence at the current maximum (if,jf). Due to the preliminary localization of the objects of interest and local scale estimation, the variability of feature values with respect to object classes (e.g. tree class versus non-tree class) is reduced substantially [19]. Therefore, the tree recognition at this stage is transformed into a simple step of tree presence verification. It consists of comparing the feature values, x1,…,xN , estimated at location (if,jf) with their reference values θ1,…, θN using inequalities of two types: Type I condition: ξk=1, if |xk-θk| < δk or Type II condition: ξk=1, if |xk-θk| ≥ δk , (4.3) otherwise ξk=0, where δ1, …, δN are the detection thresholds and ξk is the feature value indicator represented by a binary variable. The final condition for a tree presence in this verification process is: N
if
∑ξk
= N , then the point (if,jf)l is the tree location.
(4.4)
k =1
The condition in Eq. (4.4) means simultaneous satisfaction of all N inequalities (4.3) and can be implemented in a sequential manner, one feature at a time. The second, inverse condition in Eq. (4.3) is associated only with certain feature types in order to exclude other objects (non-ground locations) for which these features take a specified range of possible values. For example, the feature, which characterizes the surface sharpness, is used with the type I condition since the tree crown surfaces have a significant difference in the mean height values. The type II condition is verified in conjunction with the mean height feature. Here, the feature reference value and the detection threshold are set to a sufficiently low level above the ground in order to eliminate low-height objects (shrubs, mounds, cars, etc.). In the current version of the algorithm, N=5 features have been used: x1 – mean height; x2 – surface homogeneity; x3 – surface sharpness; x4 – width of support region; x5 – radial symmetry of support region.
5 Experimental Results The study site for the individual tree detection in forest areas consisted of four plots located in a boreal forest area near the town of Whitecourt (Alberta, Canada). The test plots of approximate area size 2500m2 are composed of mixed species types including coniferous trees (e.g., spruce, pine, fir) and deciduous trees (mostly aspen). The LiDAR data were acquired with a LiDAR device having the pulse frequency 100 KHz and the beam footprint 41 cm. The height measurement precision of the input LiDAR data has the vertical resolution Dq (X, Yj ), it follows that D(x, yj ) > Dq (X, Yj ).
(2)
Let Di , i = 1, ..., k, be the squared distance between x and current yfi during PDS process. Before PDS is started, we set the initial current fi = 0, i = 1, .., k, and the corresponding initial Di is set to be ∞. We check each vector (except initial current Yf1 itself) from Y1 to Yt . For each vector Yj to be searched, we compute |Y1j − X1 |. Suppose |Y1j − X1 | > √ √ Dk , then D(x, yj ) > Dk . Hence, j does not belong to Fk and Yj can be √ rejected. If |Y1j − X1 | < Dk , then we employ the following fast search process. Beginning with q = 2, for each value of q, q = 2, .., 2n × 2n , we first evaluate Dq (X, Yj ). Suppose Dq (X, Yj ) > Dk , then D(x, yj ) > Dk and Yj can be rejected. Otherwise, we go to the next value of q and repeat the same process. Note that the partial distance Dq (X, Yj ) can be expressed as Dq (X, Yj ) = Dq−1 (X, Yj ) + (Xq − Yqj )2 .
(3)
Therefore, the partial distance of the new q can use the partial distance of the previous q, and only the computation of (Xq − Yqj )2 is necessary. This PDS process is continued until Yj is rejected or q reaches 2n × 2n . We compare D(x, yj ) with Dk when q = 2n × 2n . If D(x, yj ) < Dk , then we remove fk from Fk , insert j into Fk , and re-sort elements in Fk . After the final Fk is found, we finally compute ki for each class ωi . The class having highest ki is then identified as the class that x belongs to. This completes the PDS-based kNN classification.
1108
3
H.-Y. Li, Y.-J. Yeh, and W.-J. Hwang
The Architecture
3.1
Outline of the Subspace PDS for Hardware Implementation
Following is the complete outline of the proposed PDS for hardware realization: Assume:
Step 1 Step 2
Step 3 Step 4
Step 5
Step 6
Step 7 The Fk
k, m, l, and δ are given. 1 t YLm , ..., YLm are computed and stored with reducing l bitplanes. Fetch an input vector x. Compute XLm and remove l LSB bitplanes from XLm . Initialize j = 1. Set the initial fi = i, i = 1, ..., k. (i.e., Fk = {1, 2, ..., k}) fi ), i = 1, ..., k. Compute Di = D(XLm , YLm Sort Fk in accordance with Di , i = 1, ..., k. If j = t then Stop. Otherwise, j = j + 1. Initialize q = δ. j Compute Dq (XLm , YLm ) using eq.(4). j ) with Dk . Compare Dq (XLm , YLm j ) > Dk , then If Dq (XLm , YLm go to Step 4 for the search of the next codeword. If q = 2m × 2m , then Set fk = j. fk Update Dk = D(XLm , YLm ). Go to Step 3. Otherwise, q = q + δ. Go to Step 4 for continuing the partial distance computation and comparison. obtained after the PDS search is finished is the final Fk to x.
From the outline, we observe that the algorithm is a subspace PDS because it 1 N is based on XLm and YLm , ..., YLm . The algorithm utilizes bitplane reduction 1 N since l LSB bitplanes will be removed from XLm and YLm , ..., YLm prior to the j q PDS. Finally, the computation of D (XLm , YLm ) is based on j j Dq (XLm , YLm ) = Dq−δ (XLm , YLm )+
q
(Xi − Yij )2 ,
(4)
i=q−δ+1 j , respectively. Conwhere Xi and Yij are the i-th coefficient of XLm and YLm sequently, the PDS employs the multiple-coefficient accumulation. It therefore can be concluded that the subspace search, bitplane reduction and multiplecoefficient accumulation techniques are adopted in the algorithm for the hardware realization of fast kNN classifier.
Using Wavelet Transform and Partial Distance Search
1109
input
Data Buffer
x DWT
address ROM
XLm Buffers
Xq-į+1 Xq-į+2 PDS Control unit
...
Xq
Y jq-į+1Y jq-į+1 . . . Y jq
VSDC unit
Accumulator
Reg
Sorting
Comparator
Buffer
result
Fig. 1. The VLSI architecture of the proposed PDS algorithm
3.2
Single Module Hardware Implementation
The basic VLSI architecture for realizing the proposed PDS is shown in Fig. 1, which contains a ROM, a DWT unit, a VSDC unit, an accumulator, a comparator, a control unit, a sorting circuit, and a number of registers storing the values 1 t of intermediate results. The ROM contains YLm , ..., YLm for the fast search. The DWT unit is used for computing XLm of an input vector x. The VSDC and j accumulator are used to compute and store the partial distance Dq (XLm , YLm ) j for q = δ, 2δ, ..., 2m × 2m . The comparator then compares Dq (XLm , YLm ) with Dk . The comparison results are reported to the control unit, which determines whether the reset of the accumulator and the update of Fk are necessary. The sorting circuit will be activated when Fk is required to be updated. The circuit first replaces current fk by j and updates corresponding Dk . Afterward the circuit re-orders the elements of Fk in accordance with Di , i = 1, ..., k. Later than
1110
H.-Y. Li, Y.-J. Yeh, and W.-J. Hwang Table 1. Truth table for the multiplexer in the sorting module
D i 1
Di
x
0 1 1
0 1
output f i Ε Di j Ε D X , Y f i 1Ε Di 1
j
sorting operations, each of the new fi , i = 1, ..., k and its corresponding Di will be stored in registers for subsequent PDS comparisons. The sorting architecture is deserved to be mentioned. This proposed solution can sort N inputs in N clock cycles; and the area complexity only increases linearly with the number of sorting modules. The novel sorting architecture consists of a chain of k duplicate sorting modules; and each sorting module contains a comparator, a multiplexer, and a buffer. The output of the multiplexer of each module i contains both the index ji and the corresponding distance j Di , i = 1, 2, ..., k. When D(XLm , YLm ) ≤ Dk , the index j with the associated j distance D(XLm , YLm ) is entered into every sorting module in the sorting architecture. Each sorting module starts to do its own sorting process respectively, and the sorting process can be divided into two steps. In the first step, the comj parator will compare D(XLm , YLm ) with Di to see if Di is larger (smaller) than j j D(XLm , YLm ). The comparison result Si is set to 1 if D(XLm , YLm ) is larger (smaller) than Di and 0 on the contrary. Furthermore, the comparison result α is given to be one of the control line values of the multiplexer of the i-th and (i + 1)-th module. In the second step, the multiplexers determine the output contents according to the control line values. The truth table for determining output contents depending upon control line values is shown in Table 1. Let T (x) be the number of clock cycles required for the completion of the subspace PDS given an input vector x. The latency L is then defined as L=
W 1 T (xj ), tW j=1
(5)
where W is the total number of input vectors and t is the total number of codewords. For each input vector, the latency L then represents the average number of clock cycles required for the computation of each input vector. It can be observed from [3], the average latency of the PDS with a relatively small δ > 1 for detecting an undesired codeword is close to 1. In addition, it also follows from [3] that the subspace PDS with small δ and large l can lower the area complexity. Therefore, with a lower area complexity and a faster clock rate, the proposed PDS is able to achieve an average latency close to 1 for squared distance calculation and undesired codeword detection. 3.3
Multiple Module Hardware Implementation
Since the kNN classification operations of different input vectors can be performed independently, the throughput of a kNN classifier will be substantially
Using Wavelet Transform and Partial Distance Search clk
ctl_from_host
1111
input_data
Global Control Unit ΰGCUα
j1
input data buffer
j2
ROM
...
x
jM
DWT j
j
1 YLm
YLmM
XLm
PDS Module 1
PDS Module 2
PDS Module M
(A1)
(A2)
(AM)
status reg output data buffer
MUX
result
Fig. 2. Architecture for mutiple-module PDS search
increased by the concurrent search of codewords for multiple input vectors. The basic PDS VLSI architecture shown in Fig. 1 can be extended for this purpose. Fig. 2 shows the VLSI architecture for this extension. The architecture contains M modules, where each module is responsible for the PDS of a different input vector. Therefore, the encoding of M input vectors can be processed concurrently by this structure. Input vectors are fetched one at a time from a host processor. Accordingly, the pin count of the architecture will remain low as M increases. In this single-fetch scheme, there exists no concurrent DWT operations. All modules therefore can share the same DWT unit for computing XLm . Moreover, as shown in Fig. 2, all modules also share the same ROM for accessing 1 N YLm , ..., YLm . This hardware sharing scheme may reduce the chip area when the number of modules becomes high. Each module has its own VSDC unit as well as accumulator, comparator and local control unit for independent PDS operation. The input to each module is XLm . Therefore, each module is responsible for only Steps 2-6 of the PDS algorithm presented in Section 3.1. Although the proposed PDS algorithm has a low average latency as shown in [3], the actual latency for
1112
H.-Y. Li, Y.-J. Yeh, and W.-J. Hwang
an input vector x may vary in accordance with the degree of energy compaction of the x in the wavelet domain. Therefore, in the multiple module VLSI architecture, the final closest codeword to successive input vectors may not be obtained in first-in first-out (FIFO) order. It is then necessary to include a global control unit (GCU) in the system for the scheduling of data flow. Let Ai , i = 1, ..., M, be the i-th module in the architecture. In our design, we associate each Ai with two flag bits M odule Available of Ai and Result Available of Ai , which indicate whether module Ai is idle, and whether the PDS search result of Ai is available, respectively. These flag bits will be used in conjunction with the global control unit. Let P be the set of modules currently available. That is, P = {Ai : M odule Available of Ai = true}. The control unit will assign the XLm to Ap , where p = min{i : Ai ∈ P}.
(6)
Likewise, let Q be the set of busy modules having completed the PDS operations. In other words, Q = {Ai : Result Available of Ai = true}. The final closest codewords of Aq will be selected as the output of the circuit when q = min{i : Ai ∈ Q}. (7) After the final closest codewords of Aq has been read, the module Aq is then assigned to P. In addition to dataflow scheduling, the GCU is also responsible for the interaction with an external programs/device, which performs the input block loading, and PDS results collection operations. Our design provides a simple interface for the interaction, which contains a status register, and input and output buffers. There are 5 flag bits in the status register: Input Data Available, W rite Back Result Complete, Ready f or Input, Ready f or Output, and M odule Available. The flag Ready f or Input indicates whether the input buffer is ready for access. It will become true when the input buffer is empty, and M odule Available is also true, where M odule Available is given by M odule Available = M odule Available of A1 + M odule Available of A2 + ... + M odule Available of AM . As the external program/device detects that the Ready f or Input=true, it will load a block x to the input buffer. After the completion of the loading operations, it will then set the Input Data Available to true, notifying the GCU that an input block is now ready for PDS operation. The GCU loads the x from the input buffer to the DWT unit for the computation of XLm . It then assign the resulting XLm to Ap , where p is determined by eq.(6). Note that the GCU will not monitor the progress of the PDS process in Ap . The Result Available of Ap
Using Wavelet Transform and Partial Distance Search
dataa
1113
Nios basic ALU
>
Mux
+ _
out
&
datab
clk,clk_en, n,reset,start
kNN architecture
PDS architecture
result
Fig. 3. The position of the kNN architecture in the ALU of the NIOS CPU
will automatically be set to true by the local controller of Ap when all the codewords in ROM have been scanned. In this case, the flag Ready f or Output is updated by Ready f or Output = Result Available of A1 + Result Available of A2 + ... + Result Available of AM . The flag Ready f or Output is used for informing the external program/device that the PDS search results of some modules are available. The external program/device can then retrieve the results from the output buffer. Upon the completion of the result retrieval, it sets the flag W rite Back Result Complete to true, informing the GCU that the PDS results of the module Aq has been written back. Also, it follows that the GCU will set M odule Available of Aq to true so that the module is available for PDS search for the next input block. Based on the GCU, a simple external device/program can be developed for sending the input blocks to the PDS architecture, and collecting the kNN classification results from the architecture. When all modules are busy and Ready f or Output=true, the program then collects the kNN classification results from the architecture. 3.4
The PDS Architecture as a Custom User Logic in a Softcore CPU
For physically measuring the performance of the proposed architecture, the proposed PDS architecture has been implemented as a user logic in the ALU of the softcore NIOS CPU, which is a general purpose, pipelined, single-issued reduced instruction set computer (RISC) processor. The major differentiation between normal microprocessors and softcore processors is that softcore processors have
1114
H.-Y. Li, Y.-J. Yeh, and W.-J. Hwang
a re-configurable architecture. Accordingly, custom VLSI architectures can be embedded in the softcore processors for physical performance measurement. Fig. 3 shows the position of the kNN architecture in the ALU of the NIOS CPU. As shown in the figure, the kNN architecture and the basic ALU unit share the same input and output buses. The kNN architecture can be accessed by the custom instructions associated with the architecture. The custom instructions accessing the buffers and status register of the kNN architecture can be called using macros in C/C++. Therefore, the C program for kNN classification running on the NIOS CPU involves only the input vector loading and kNN results collection operations. The C code implementation is straightforward, and requires few computations. This can be beneficial for embedded systems where the CPU has limited computational capacity for high level language execution.
4
Experimental Results and Conclusions
In this section we present some physical performance measurements of the proposed FPGA implementation. In our experiment, there are 5 classes (i.e., N = 5). Each class represents a texture as shown in Fig. 4. The design set Ci associated with each class contains 256 vectors. Therefore, there are 1280 vectors (t = 1280) in the design set C. The dimension of the vectors is 8 × 8 (i.e.,n = 3). For the j subspace search, the dimension of XLm and YLm is 4 × 4 (i.e., m = 2). We also set k = 5 for the kNN classification. The whole kNN system consisting of the NIOS CPU embedded by the proposed kNN architecture is implemented by the Altera Stratix EP1S40 FPGA, which has maximum number of 41250 LEs[8]. The operating frequency of the system is 50 MHz. Table 2 shows the area complexity (LEs) and the CPU time of the complete kNN classifiication systems consisting of the NIOS CPU embedded by the proposed PDS architecture with various M. The system consumes only 6391 LEs for the implementation of both NIOS CPU and subspace PDS circuit when M = 1,
Fig. 4. Five textures for classification Table 2. The area complexity of the complete kNN classification system consisting of the NIOS CPU embedded by the proposed PDS architecture with various M Design Set = 1280
M CPU time (μs) LEs
1
2
3
137.77
69.13
46.13
6391
7441
8384
Using Wavelet Transform and Partial Distance Search
1115
Table 3. The CPU time of various kNN systems CPU type
PentiumIV 3.0 GHz
algorithm
Subspace PDS
kNN
Implementation Software CPU time (μs)
16079.2
kNN
Subspace PDS
Software Software Software 262.5
NIOS 50 MHz
PentiumIV 1.8 GHz
24912.5
497.5
Subspace PDS Hardware/ Hardware/ Hardware/ Software Software Software Codesign Codesign Codesign (M=1) (M=3) (M=2) 137.77
69.13
46.13
which is significantly less than the maximum capacity of the target FPGA device. Otherwise, although increasing the number of modules for the implementation of subspace PDS requires a larger number of VSDC units and a higher routing overhead for accessing the ROM, it can be observed from Table 2 that for the same t, the M -module system is approximate 1/M times the execution time of the single module system. That is, the M -module system is M times faster then the single-module, but the extra area cost (including the NIOS CPU) only increased about 16%. It is thus clear that the proposed architecture is effective for the hardware implementation of PDS-based kNN classifiers. The execution time of NIOS with that of the Pentium IV for various PDS operations is compared in Table 3, where the execution time is defined as the average CPU time (in μs) required for identifying the final class to each input vector. The NIOS CPU is running with the support of the proposed hardware; whereas, the Pentium IV CPU is executing solely on C codes. Note that, for the NIOS systems with software/hardware codesigns, the measurements in fact cover the complete execution process for codeword search including the memory accesses (the fetching of NIOS instructions and source vectors), buffer filling, executions of NIOS instructions, executions of PDS subspace hardware, and the retrieval of encoding results. It can be observed from Table 3 that the average CPU time of the single-module hardware acceleration and Pentium IV 3.0 GHz are 137.77 μs and 262.5 μs, respectively. Therefore, the speedup is 1.91 with the aid of the single-module PDS circuit. The execution time of NIOS can also be substantially reduced by increasing the number of modules M in the circuit. In particular, when M increases from 1 to 2, the reduction of average CPU time is 49.82% (from 137.77 to 69.13 μs). Moreover, it can also be observed from the table that the average CPU time of the 50 MHz NIOS systems with M = 1, 2, 3 are all lower than that of the 3.0-GHz Pentium IV systems executing the same subspace PDS algorithm. Table 3 also includes the CPU time of the basic wavelet PDS VQ encoding algorithm executed in Pentium IV without subspace search. In this case, the CPU time is 16079.2 μs for Pentium IV 3.0-GHz CPU, which is 348.56 times longer then the NIOS CPU with three-module hardware support. All these facts demonstrate the effectiveness of the proposed implementation. In practical applications, the proposed circuit may be beneficial for the
1116
H.-Y. Li, Y.-J. Yeh, and W.-J. Hwang
implementation of smart camera for low network bandwidth consumption, which directly produces the classification results instead of delivering image raw data to main host computers for classification.
References 1. Hwang, W.J., Jeng, S.S., Chen, B.Y.: Fast Codeword Search Algorithm Using Wavelet Transform and Partial Distance Search Techniques. Electronic Letters 33, 365–366 (1997) 2. Hwang, W.J., Wen, K.W.: Fast kNN Classification Algorithm Based on Partial Distance Search. Electronics letters 34, 2062–2063 (1998) 3. Yeh, Y.J., Li, H.Y., Hwang, W.J., Fang, C.Y.: FPGA Implementation of kNN Classifier Based on Wavelet Transform and Partial Distance Search. In: Proc. SCIA 2007, Aalborg, Denmark, June 10-14, 2007 (2007) 4. Mcnames, J.: Rotated Partial Distance Search for Faster Vector Quantization Encoding. IEEE Signal Processing Letters, 244–246 (2000) 5. Ridella, S., Rovetta, S., Zunino, R.: K-Winner Machines for Pattern Classification. IEEE Trans. Neural Networks 12, 371–385 (2001) 6. Vetterli, M., Kovacevic, J.: Wavelets and Subband Coding. Prentice Hall, New Jersey (1995) 7. Xie, A., Laszlo, C.A., Ward, R.K.: Vector Quantization Technique for Nonparametric Classifier Design. IEEE Trans. Pattern Anal. Machine Intell. 15, 1326–1330 (1993) 8. Stratix Device Handbook (2005), http://www.altera.com/literature/lit-stx.jsp 9. Custom Instructions for NIOS Embedded Processors, Application Notes 188 (2002), http://www.altera.com/literature/lit-nio.jsp
A Framework for Wrong Way Driver Detection Using Optical Flow Gon¸calo Monteiro, Miguel Ribeiro, Jo˜ ao Marcos, and Jorge Batista Institute for System and Robotics Dep. of Electrical Engineering and Computers University of Coimbra - Portugal
Abstract. In this paper a solution to detect wrong way drivers on highways is presented. The proposed solution is based on three main stages: Learning, Detection and Validation. Firstly, the orientation pattern of vehicles motion flow is learned and modelled by a mixture of gaussians. The second stage (Detection and Temporal Validation) applies the learned orientation model in order to detect objects moving in the lane’s opposite direction. The third and final stage uses an Appearance-based approach to ensure the detection of a vehicle before triggering an alarm. This methodology has proven to be quite robust in terms of different weather conditions, illumination and image quality. Some experiments carried out with several movies from traffic surveillance cameras on highways show the robustness of the proposed solution.
1
Introduction
In order to ensure a safe and efficient driving, it is important to classify the behaviors of vehicles and to understand their interactions in typical traffic scenarios. Until recently, this task was performed by human operators at traffic control centers. However, the huge increase of available cameras requires automatic traffic surveillance systems [1,2,3,4,5,6,7,8]. In the last decades, one of the most important efforts in Intelligent Transportation Systems (ITS) research has been the development of visual surveillance systems that could help reduce the number of traffic incidents and traffic jams in urban and highway scenarios. Although the large number of systems based on different types of sensors and their relative performance, vision-based systems are very useful to collect very rich information about road traffic. The work presented in this paper is part of an automatic traffic surveillance system [9]. The primary goal of the system is to detect and track potentially anomalous traffic events along the highway roads. By anomalous events it is meant the detection of vehicles that stop on the highway, vehicles driving in the lane’s opposite direction and also vehicles that are constantly switching between lanes.
This work was supported by BRISA, Auto-estradas de Portugal, S.A.
M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 1117–1127, 2007. c Springer-Verlag Berlin Heidelberg 2007
1118
G. Monteiro et al.
Vehicles driving on the wrong way represent a serious threat. An immediate detection of a vehicle driving on the wrong direction could help prevent serious accidents by warning the oncoming vehicles (via traffic telematic systems or radio announcements) and by warning the police. The proposed project aims at the automatic detection of drivers circulating on the wrong way and consequently triggering an alarm on the highway traffic telematic system. The system must be robust to illumination changes and small camera movements, being able to robustly track vehicles against occlusions and crowded events. A simple way to detect the wrong way drivers is using a segmentation process to distinguish the vehicle from the background, and then tracking all segmented cars and verify if the direction of its trajectory is the correct one for the lane or if it is a vehicle circulating on the wrong side. This is not a hard process to implement, but it has some disadvantages, namely the lack of robustness to the variation of light and weather conditions, and the difficult task of tracking vehicles in crowded situations without grouping those circulating near each other. After taking all this into account we opted to use the optical flow obtained by two consecutive frames. This process is more robust and accurate as regards light and weather conditions variation. The solution presented in the paper is based mainly on three stages. Firstly, the orientation pattern of vehicles motion flow is learned and modelled by a mixture of Gaussians (Learning Stage). Then, there is a Detection and Temporal Validation using the learned orientation model to detect objects moving on the lane’s opposite direction. On both stages, a Block Median Filtering is applied to the motion flow in order to remove noisy data. The temporal validation is applied through a kalman filter, tracking over a stack of images the blocks marked as belonging to a driver’s wrong way event. Finally, an appearance-based approach is used to validate the existence of a vehicle as an object that triggers the event, sending an alert sign in case the temporal and appearance validation succeeds.
2
Image Motion Estimation
The algorithm proposed uses motion information that can be provided by different motion estimation algorithms. A valuable comparison of different techniques is presented in [10]. Satisfactory results were obtained in several experiments by applying the method proposed by Lucas and Kanade in [11] and modified according to the work of Simoncelli et al. [12]. Furthermore, this optical flow estimation algorithm also provides an objective measurement of the local level of reliability of the motion information. Shi and Tomasi adopted this criterion of reliability in order to evaluate the texture properties of pictures areas, and achieved improved tracking performances [13]. We will not go into a detailed description of the method, but we will just report here the results of the discussion in [10]. The reliability of the estimates for a given pixel can be evaluated using the eigenvalues λ1 ≥ λ2 of the matrix C (1).
A Framework for Wrong Way Driver Detection Using Optical Flow
⎡
W 2 (x)Ix2 (x)
W 2 Ix (x)Iy (x)
⎢ x∈Ω x∈Ω ⎢ C=⎢ ⎢ ⎣ W 2 (x)Ix (x)Iy (x) W 2 Iy2 (x) x∈Ω
1119
⎤ ⎥ ⎥ ⎥ ⎥ ⎦
(1)
x∈Ω
Where the summations are intended over a small spatial neighborhood Ω of the pixel, W (x) is a window function that gives more influence to pixels in the center of the neighborhood, and Ix and Iy are the spatial gradients of the gray levels in directions x and y respectively. The method proposed in [11], [12] sets a condition λ2 ≥ δ on the smallest eigenvalue for a velocity to be evaluated; otherwise, no velocity value is assigned to the pixel. The result of the optical flow detection contains some disturbance (see Fig. 1), which is mainly caused by motion flow discontinuity regions and noise. To reduce the disturbance and the volume of information analyzed, the image is divided into blocks of 8 × 8 pixels. For each block the median of the directions obtained by the optical flow is calculated (Block Median Filtering). From this moment on, all the references made to the movement direction in the image are related to the median motion flow of the block. Likewise, the flow detected in the image is analysed for each block instead of a pixel by pixel analysis.
(a)
(b)
Fig. 1. Results of the Block Median Filtering to reduce the disturbance in the optical flow estimation. a) output of the estimated flow with the Lucas-Kanade Algorithm. b) block median filtering applied to a).
3
Traffic Flow Direction Learning
The basic idea when learning the patterns of vehicles’ motion direction flow on the different lanes is that vehicles circulating on these, during the learning period, are moving in the correct direction along the lane. The estimation of each lane’s motion orientation on the image is learned through the analysis of a large amount of frames. A Gaussian mixture is modelled to learn the image motion flow orientation of each block in the image by the analysis of the vehicle’s movement (see Fig. 2). If it is assumed that the directions of the vehicles (θ) have a Gaussian distribution, then the direction can be modelled by a Gaussian Mixture Model (GMM) [14], which is given by (2), in which ωi is the prior of the Gaussian distribution
1120
G. Monteiro et al.
Fig. 2. Flow chart of the traffic flow direction learning process
Ni with mean μi and standard deviation σi , and θ is the block image direction of the movement. In practice, the number of kernels was limited to a certain maximum, namely K = 3. p(θ) =
K
ωi Ni (θ; μi , σi )
(2)
i=1
The mixture model is dynamically updated. Each block direction is updated as follows: i) The algorithm checks if each incoming direction angle θ can be ascribed to a given mode of the mixture, this is the match operation. ii) If the direction angle occurs inside the confidence interval with ±2.57 standard deviation (for the 99% confidence interval), a match event is verified. The parameters of the corresponding distributions (matched distributions) for that pixel are updated according to μti = (1 − αti )μt−1 + αti θt i σit = (1 − αti )σit−1 + αti (θt − μti )2
(3) (4)
αti = τ N (θt , μt−1 , σit−1 ) i
(5)
where The weights are updated by ωit = (1 − τ )ωit−1 + τ (Mit ) 1 matched models with Mit = 0 remaining models
(6)
where τ is the learning rate. The non match components of the mixture are not modified. If none of the existing components match the direction angle, the least probable distribution is replaced by a normal distribution with mean
A Framework for Wrong Way Driver Detection Using Optical Flow
1121
equal to the current value, a large covariance and small weight. iii) The next step is to order the distributions in the descending order of ω. This criterion favours distributions which have more weight (most supporting evidence) and less variance (less uncertainty). iv) Finally, the algorithm models each direction as the sum of the corresponding updated distributions. Finally, after the models learning, these are filtered to eliminate some noise. All the models with a weight lower than a threshold are discarded (see Fig. 3(d)). The main advantage of the Gaussian mixture modelling in this situation is that it can embrace a variety of directions for the same block, which is very useful in lanes with exits or bifurcations, modelling also the movements of vehicles that are changing between lanes. The number of necessary frames to obtain a correct estimation of the GMM’s depends on the number of vehicles circulating on the road. In our experiments a stack of 1000 learning frames was used. Obviously, if there are no vehicles circulating on one of the lanes, the direction of that lane will not be learned.
(a) 100th f rame
(b) 200th f rame
(c) 600th f rame
(d) Filtered result
Fig. 3. a), b) and c) three frames showing the evolution of the orientation pattern modelled by the first Gaussian of the GMM on a highway scenario. d) the result of the learning process after the model weight filtering.
4
Wrong Way Drivers Detection
In this section it is described the methodology used to detect and validate the vehicles circulating on the wrong way (see Fig. 4). In each new frame the optical motion flow is computed and the median of the flow direction for each block is
1122
G. Monteiro et al.
Fig. 4. Flow chart of the wrong way drivers detection system proposed here
calculated. An object is defined as circulating on the wrong direction when the difference between both the direction of the flow in the present frame and the estimated means of the corresponding block learned are larger than 2.57σ for the 99% confidence interval. Due to the vibration of the surveillance camera pole and noisy motion flow estimation, it is possible that a vector or a set of vectors of flow are detected even if there is no real motion on those blocks of the image. Thus, it is necessary to validate all the objects detected in the wrong way before triggering an alarm. Two types of validation were used, namely a temporal validation, to verify if the detected objects make a coherent trajectory, and an appearance-based validation to check if that object is, in fact, a car. 4.1
Temporal Validation
The temporal validation consists of tracking all the objects detected as circulating on the wrong side of the road and verifying if they appear in consecutive frames. If an object is detected more than n times in m consecutive frames and makes some coherent trajectory, then it will be considered as an object circulating on the wrong side of the road, namely n = 4 and m = 6. A second order Kalman filter is used to track and predict the position of the vehicles in consecutive frames.
A Framework for Wrong Way Driver Detection Using Optical Flow
1123
When a detected flow does not match with the learned motion direction model, a new tracker is initiated. The object image position, P , is given by the center of mass of all neighbor blocks detected as being part of an object moving in a wrong direction. The velocity, ν, of the object is obtained in 2 parts: the direction is obtained as the median of the direction of all grouped blocks, and the module is computed by the average motion of all grouped blocks. When tracking the object it is only necessary to save P , ν, and the area of the grouped blocks. The m frames used for temporal validation are stored and used in the appearance-based validation. 4.2
Appearance-Based Validation
The appearance-based sub-system described here receives as input all frames used to validate the vehicle in the temporal validation sub-system. The appearance-based sub-system is applied to the sub-windows where flow has been detected in m frames and it verifies if it is a vehicle or a false positive. This system uses a set of Haar-Like features (see Fig. 5) to extract the information from the given image [15]. The detection of the objects is performed using these features as an input to an AdaBoost classifier. The AdaBoost classifier is then trained to perform the detection of the vehicles on the road. The main goal of this learning algorithm is to find a small set of Haar-Like features that best classifies the vehicles, rejecting most of the background objects, and to construct a robust classifier function. To support this purpose, a weak learning algorithm is designed to select the single feature which best separates the positive and negative examples. For each feature, the weak learner determines the optimal threshold classification function, so that the minimum number of examples are misclassified. A weak classifier hj (x) consists of a feature fj , a threshold θj and a parity pj indicating the direction of the inequality sign (7). The value 1 represents the detection of the object class and 0 represents a non-object. Each of these classifiers alone is not able to detect the object category. Rather, it reacts to some simple feature in the image that may be related to the object. The final classifier H(x) is constructed with the weighted sum of the T weak classifiers and is represented by (8). 1 if pj fi (x) < pj θj hj (x) = (7) 0 otherwise ⎧ ⎪ ⎨ H(x) =
⎪ ⎩
1 if
T
1 αt 2 t=1 T
αt ht (x) ≥
t=1
(8)
0 otherwise
To construct a robust and accurate cars classifier it is necessary to gather a large amount of labelled cars in the scene that we want to use the classifier. A segmentation process [9] was used to obtain the labelled cars in the scene images. The detection of the objects is done by sliding a search window through each sub-image and checking whether an image region at a certain location is classified
1124
G. Monteiro et al.
Fig. 5. Subset of the Haar-like features prototypes used in the object detection. a, b, c and d are the line features, e and f the edge features and g is the center-surround feature.
Fig. 6. Cars classification using AdaBoost classifier at different scales and positions of the search window on the image
as a car (see Fig. 6). Initially, the detecting window is of the same size of the classifier (30 × 30), then the window’s size is increased by β until the size of the window is equal to the sub-image size (β = 1.05). The appearance-based validation is carried out after the temporal validation. In order to validate an object as a vehicle it should be positively classified at least q times in m consecutive frames. A value of q = 4 was used.
5
Experimental Results
The system described here was tested by using a real set of image sequences from highways traffic surveillance cameras with different weather conditions, illumination, image quality and fields of view. This set of image sequences is composed by real and simulated wrong way events. In fact, in some of the simulated video sequences the vehicles were not circulating on the wrong side of the road. These situations are scarce and it is difficult to obtain videos of these events when they happen. To test the system, all the directions learned during the training phase were increased by π and, therefore, all vehicles in the road should be considered as circulating on the wrong direction (see Fig. 7). All vehicles detected as wrong way drivers are bounded by a red box. A simulated wrong way driver event was also tested with this algorithm. In this situation, a vehicle is entering the highway through an exit lane, and the system was able to detect the event correctly (see Fig. 8). Another simulated situation was also tested, including frames from a video in a tunnel with a wrong way driver event (see Fig. 9).It is worth noting that in
A Framework for Wrong Way Driver Detection Using Optical Flow
1125
Fig. 7. Wrong way drivers detection in real event video sequence. All the moving vehicles were validated as wrong way drivers.
Fig. 8. Wrong way driver detection on simulated event video sequence. In the first three frames of the sequence, the vehicle entering the highway through the exit lane is being validated and then it is detected as a wrong way driver.
the experimental result image sequences, the vehicle seen in the first frames as going in the wrong direction is not detected due to the temporal validation. A set of 300 frames of highway tunnels in two different scenarios and 600 frames of outdoor video sequences in five different scenarios were used to test the system. All the directions learned during the training phase were increased by π, and, therefore, all vehicles in the road should be considered as circulating on the wrong direction of the lane. The results of this experiments are presented in Table 1. In
1126
G. Monteiro et al.
Fig. 9. Wrong way driver detection in a simulated event video sequence in a tunnel Table 1. Performance of system here proposed tested in various scenarios Flow Detection (%)
Hit Rate(%)
False Alarm Rate(%)
89.85 89.55
92.31 89.86
0.03 0.24
Tunnels Outdoor
the first column of the table represents the rate of cars detected with flow by each frame in the experimental image set. The hit rate is obtained by number of vehicles detected in the wrong direction of the lane divided by the total number of vehicles circulating on the road and the false alarm rate is the number of wrong way drivers events falsely detected divided by the total number of frames. The system’s performance in the tunnels is higher because its illumination is controlled, unlike the outdoor situations, where the illumination depends on the weather conditions, which are unstable. An additional problem in the outdoor scenarios is the vibration of the camera supporting poles, which induces a false optical flow in the image and, usually, the vehicles in the image are considerably smaller than those in tunnel situations. The system can detect vehicles moving on the wrong direction of a lane over a 320×240 pixel image at 33 frames/s on a 3.2 GHz P4 Intel Processor under Linux OS. This approach is part of a traffic surveillance system that is currently being tested in some of the Brisa’s Highway roads.
6
Conclusions
In this paper, a methodology to detect vehicles circulating on the wrong side of the highway using optical flow is proposed. In the learning phase, the direction
A Framework for Wrong Way Driver Detection Using Optical Flow
1127
of each lane is modelled by a Gaussian Mixture. The optical flow is calculated to detect the moving objects in every frame. If the calculated direction does not match the Gaussian Mixture Model, then a temporal and an appearance-based validation are initiated. After all these procedures, if the vehicle is validated, an alarm will be triggered. The experiments conducted on a large number of scenes demonstrate that the proposed system has the following properties: a)it is able to detect vehicles circulating on the wrong side of the road with good accuracy; b)it runs in real-time; and c)it is robust to variation of weather conditions, illumination and image quality.
References 1. Foresti, G.: Object detection and tracking in time-varying and badly illuminated outdoor environments. SPIE Journal on Optical Engineering (1998) 2. Piccardi, M., Cucchiara, R., Grana, C., Prati, A.: Detecting moving objects, ghosts and shadows in video streams. IEEE Trans. Pattern Anal. Machine Intell. 1337– 1342 (2003) 3. Coifman, B., Malik, J., Beymer, D., McLauchlan, P.: A real-time computer vision system for measuring traffic parameters. In: IEEE CVPR, IEEE Computer Society Press, Los Alamitos (1997) 4. Koller, D., et al.: Towards robust automatic traffic scene analysis in real-time. In: Int. Conference on Pattern Recognition (1994) 5. Ikeuchi, K., Sakauchi, M., Kamijo, S., Matsushita, Y.: Occlusion robust vehicle detection utilizing spatio-temporal markov random filter model. In: Gauthier, G., VanLehn, K., Frasson, C. (eds.) ITS 2000. LNCS, vol. 1839, Springer, Heidelberg (2000) 6. Magee, D.: Tracking multiple vehicles using foreground, background and motion models. In: Image and Vision Computing, pp. 43–155 (2004) 7. Ebrahimi, T., Cavallaro, A., Steige, O.: Tracking video objects in cluttered background. In: IEEE Transactions on Circuits and Systems for Video Technology, pp. 575–584. IEEE Computer Society Press, Los Alamitos (2005) 8. Collins, R., et al.: A system for video surveillance and monitoring. In: CMU-RITR-00-12 (2000) 9. Fernandes, C., Batista, J., Peixoto, P., Ribeiro, M.: A dual-stage robust vehicle detection and tracking for real-time traffic monitoring. In: IEEE Int. Conference on Intelligent Transportation Systems, IEEE Computer Society Press, Los Alamitos (2006) 10. Fleet, D.J., Barron, J.L., Beauchemin, S.S.: Performance of optical flow techniques. Int. J. Comput. Vision 4377 (1994) 11. Lucas, B., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: DARPA Image Understanding Workshop, pp. 121130 (1981) 12. Adelson, E.H., Simoncelli, E.P., Heeger, D.J.: Probability distribution of optical flow. In: IEEE Conf. Comput. Vision and Pattern Recognition, p. 310315. IEEE Computer Society Press, Los Alamitos (1991) 13. Foresti, G.: Object detection and tracking in time-varying and badly illuminated outdoor environments. SPIE Journal on Optical Engineering (1998) 14. Stijnman, G., van den Boogaard, R.: Background extraction of colour image sequences using a gaussian mixture model, Tech. Rep., ISIS - University of Amsterdam (2000) 15. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: IEEE Conf. Comput. Vision and Pattern Recognition, IEEE Computer Society Press, Los Alamitos (2001)
Automated Stroke Classification in Tennis Hitesh Shah1, Prakash Chokalingam1, Balamanohar Paluri1, Nalin Pradeep1, and Balasubramanian Raman2 1
Sarnoff Innovative Technologies, Bangalore - 560025, India {hshah,cprakash,bpaluri, npradeep}@sarnoff.com 2 Indian Instititue of Technology, Department of Mathematics, Roorkee - 247667, India
[email protected] Abstract. Stroke recognition in tennis is important for building up statistics of the player and also quickly analyzing the player. It is difficult primarily on account of low resolution, variability in strokes of the same player as well as among players, variations in background, weather and illumination conditions. This paper proposes a technique to automatically classify tennis strokes efficiently under these varying circumstances. We use the geometrical information of the player to classify the strokes. The player is modeled using a color histogram and tracked across the video using histogram back projection. The binarized (segmented) output of the tracker is skeletonized and the gradient information of the skeleton is extracted to form a feature vector. A three class SVM classifier is then used to classify the stroke to be a Forehand, Backhand or Neither. We evaluated the performance of our approach with real world datasets and have obtained promising results. Finally, the proposed approach is real time and can be used with live tennis broadcasts. Keywords: Skeletonization, Tracking, Optical Flow, Oriented Histogram.
1 Introduction The increase in sport content archives has boosted the need for automated tools for video content analysis. Large body of literature exist for ball tracking [1], [2], highlight extraction [3], [4], shot classification [5], team behavior classification [6] with applications to sports like Tennis [1], [2], Baseball [4], Basketball [7], Soccer [8], [9], etc. In this paper, we focus on stroke classification of a player in Tennis game video. We classify a tennis stroke as a forehand, backhand or as a stroke that cannot be classified, representative examples respectively are shown in Figure 1(a,b,c). Such an analysis contributes to better understanding of playing patterns, statistics of the game, and also helps in easy retrieval of game video. Prior work in stroke classification can be traced to Terrence and Andrew [10] who proposed a technique based on the relative position of the racquet’s head to the player’s position and, Miyamori and Iisaku [11] technique used the relative position M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 1128–1137, 2007. © Springer-Verlag Berlin Heidelberg 2007
Automated Stroke Classification in Tennis
1129
of the tennis ball to the player’s position and also the player’s behavior to classify the strokes. However, tracking objects like ball, racquet and head requires high-resolution videos which limits the application of the above approaches. The proposed technique is based on skeleton features of the player. However, extracting skeletons from low resolution videos require a robust tracking system which can track the player efficiently in different court backgrounds and different illumination conditions. The initialization of the tracker is also crucial. Most of the existing techniques expect the initialization to give a perfect model of the object to be tracked. So any outliers while modeling the tracker will affect the outputs drastically. This has led to the use of hand marked inputs for the tracker, which requires human intervention. In our approach, we counter this problem by using a robust initialization based on optical flow [12]. We have opted the use of a non-parametric tracking approach which is noise resilient and gives us scale and orientation information of the player being tracked.
(a)
(b)
(c)
Fig. 1. Three Strokes. (a) Forehand, (b) Backhand, (c) Neither.
The paper is organized as follows: Section 2 explains the proposed approach, Section 3 details the experimental results on real world data sets, and Section 4 concludes the paper and details the future work.
2 Proposed Approach The objective of the paper can be formally stated as, given a video taken with a stationary camera behind the player, identify the strokes in each of the frames. The strokes identified by the current system are: Fore hand, Back hand, Neither. The assumption of having a stationary camera is not very restricting as it is the most common camera configuration during the television coverage of tennis. Figure 2 represents the overall architecture of the proposed system. The main blocks of the system are initialization to identify the player and generate an appropriate model to track the player during the game. Features are then extracted from the tracked player using oriented histogram of player’s skeleton. Finally, a Support Vector Machine (SVM) classifier is used to identify the strokes based on the extracted features.
1130
H. Shah et al.
Fig. 2. An Overview of the proposed system
2.1 Initialization Given a video we need to locate the player. On account of the variability in various features such as clothes color, court color, illumination, non rigid deformation and size of players, we have opted for using motion as a cue to segment the player. Optical flow is calculated between consecutive frames initially, as proposed in [12] which introduces regularization constraints for spatial coherence along with data fitting to obtain robust pixel motion. The magnitude of the optical flow vector at each pixel describes the displacement of the corresponding pixel. As the two players and
Automated Stroke Classification in Tennis
1131
the ball will be the only moving entities in the video, the pixels where they are imaged will have high magnitude of motion (figure 2 shows the magnitude of the optical flow). The image frame is segmented using global adaptive thresholding. In the thresholded image, the largest blob is contributed due to the motion of the player near the camera. The largest blob is located and used to segment the player in the frame. The segmented player is used to calculate an appearance model to track the player in the subsequent frames. The appearance model is a distribution of colors represented by a histogram m, which is compared with a histogram of colors d observed within the current frame. The sample weight at each pixel with color i is then set to
mi a di
value related to the Bhattacharyya coefficient of histogram similarity. Obtained weight image (figure 3(b)) is then used with a Expectation Maximization (EM) algorithm, an extension of mean-shift algorithm, as proposed by [13]. The EM-like algorithm not only estimates the position of the player but also estimates the covariance matrix that describes the shape of the player. This helps in obtaining an appropriately oriented and scaled ellipse around the player in each frame. (see figure 3(a)). The orientation and the scale of the player is of significant importance in classifying the stroke as explained in the next section. Moreover, the weight image is also used in the feature extraction phase as shall be explained in the next section.
(a)Tracked
(c) Blob
(b) Weight Image
(d) Skeleton
Fig. 3. Outputs at various stages of the algorithm
1132
H. Shah et al.
2.2 Feature Extraction Study of tennis videos suggested that body shape of the players can be used as a discriminative feature for identifying the strokes. However, naive approaches like template matching, contour matching cannot be applied directly due to the subtle differences in the players, their orientation while playing strokes, and the scale of the player in the video. To address the above problems, we have proposed the use of oriented histogram of the skeletonized binary images of the player. The weight image, calculated during the tracking phase, inside the tracked ellipse is thresholded to obtain a binary image of the body of the player (figure 3(c)). We use the technique proposed by [14] to obtain skeleton of the player (figure 3(d)). The algorithm computes curve skeletons from binary images by computing a potential field over the entire object. The potential at any object pixel (x, y) is defined as N
φ = ∑ r1 i =0
(1)
m i
1
where
2 2 2 ri = ⎡( x − xi ) + ( y − yi ) ⎤ and N is the number of points on the boundary ⎣ ⎦
of the object. The gradient of the potential field at each object pixel is then calculated to form a vector field. This vector field is then analyzed to extract the seed points which start from high curvature values, high divergence values and saddle points. Φ
Tracked Ellipse
Skeleton of player
Fig. 4. Tracked Ellipse along with the skeleton
The gradient vector of the potential field starting from the seed point converges at critical points which forms the skeleton of the object. The skeletonization algorithm is very efficient in generating smooth skeleton and is robust to noise, which is one of our major requirements. The skeletonization helps us overcome the differences in the body sizes of the players but not the rotation and scale factors. For rotation invariance we use the oriented ellipse obtained during the tracking phase to align the players to common reference axis (minor axis of the ellipse). Finally, to achieve scale invariance, we take a oriented histogram of the skeleton inside the tracked ellipse. The oriented histogram is calculated by dividing the ellipse into n sectors of equal angle ( φ ) as shown in figure 4 and calculating the number of
Automated Stroke Classification in Tennis
1133
skeleton points in each of the sectors. We normalize the histogram by dividing each bin value by the sum of all bin values, this is important to have scale invariance. The sample histograms of a Forehand, Backhand and Neither strokes are shown in figure 5. The histogram is used as the feature vector and a trained SVM classifier is used to classify it. 0.12
0.18
0.06
0.16 0.1
0.05 0.14
0.08
0.04
0.12 0.1
0.06
0.03 0.08
0.04
0.06
0.02
0.04 0.02
0.01 0.02
0 −50
0
50
100
150
200
250
300
350
400
0 −50
0
50
100
150
200
250
300
350
400
0 −50
0
50
100
150
200
250
300
350
400
Fig. 5. Histograms. (a) Forehand (b) Backhand (c) No Stroke.
2.3 Classification For classification we have tried with classifiers like correlation, Bhattacharya coefficient and nearest neighbour techniques along with SVM [15]. We have observed SVM classifier to outperform the rest. SVM’s performance can be attributed to its ability of mapping data to a high dimensional space where classes are linearly separable by hyper planes. We trained a SVM classifier which classifies the feature vector (histogram) into one of the three classes: Forehand, Backhand and Neither. The SVM with an Gaussian Radial Basis Function (RBF) kernel was trained. The values of the parameters sigma σ and c were set to 0.l and 10 respectively. We trained the SVM classifier with 500 histograms of each class.
3 Experimental Results We have evaluated the performance of the proposed approach on a variety of videos. The dataset videos consisted of tennis games on grass, clay, artificial grass courts as well as had both male and female players. The player is initially segmented based on optical flow. The parameters used in optical flow are the search window size, gaussian blurring window size. We experimentally found that optical flow outputs were best for a search window size of 5 and gaussian tap size of 7. We also used a look up table for eigen calculations to speed up the computation. The player is then modeled using a color histogram. We used 64 bins, 4 bins for each of the Red, Green and Blue colors. By increasing the number of bins we can have a rich representation of the object to be tracked but it is computationally expensive. Moreover, the players to be tracked are homogeneous in nature. Thus 64 bins were found sufficient for accurate tracking of the player. The histogram is used to generate the weight image as explained in the earlier section. EM-like algorithm is
1134
H. Shah et al.
used to fit an appropriate ellipse. Mean indicates the position and variance indicates the scale and orientation of the ellipse used in tracking. In our experiments, we found that the EM algorithm takes 6 iterations to converge. The weight image obtained is thresholded using a global adaptive threshold to obtain a player blob inside the ellipse. The threshold was taken as 80% of the value of the maximum weight value. A Closing operation is then performed with a circular structured element of radius 5 units. This is done so that the skeleton outputs are robust to variations across videos. The oriented histogram accounting for each point in the skeleton with respect to the minor axis of the tracked ellipse is computed. The histogram obtained varies from being generic to specific with increase in the number of bins respectively. After thorough experimental evaluation the number of bins were fixed to 60, i.e. it maps 6 degrees into one bin. The trained svm classifier [16] classifies the 60 length normalized feature vector (histogram values). The system performs at a speed of 20 fps, on a Pentium 4, 3.4 GHz processor. The speed of the method is directly dependent on speeds of the tracker, skeletonization and svm classification. The tracker speed depends on the size of the object to be tracked, which is same for most of the videos. The same applies to the skeleton computation. The speed of the svm classifier is dependent on the size of the feature vector to be classified which is fixed to 60. Hence, the system performs at the same fps for most of the tennis videos. Table 1 shows the recognition accuracies of the proposed approach on 150 different games played in different tennis courts across different players. Each game was clipped to segments of 10 minute duration. For generating the ground truth, each of the frames in the segment was manually labeled according to the shot. Our experiments showed high recognition accuracies for low resolution real world datasets under different conditions as shown in figures 6, 7, and 8. Table 1. Recognition Accuracies in Percent
Stroke ForeHand BackHand No Stroke
ForeHand 97.24 0.56 1.18
BackHand 0.44 96.43 0.79
No Stroke 2.32 3.02 98.02
Figures 6, 7, and 8 shows the frames, their corresponding cropped skeleton images and oriented histograms of ‘forehand’, ‘backhand’ and ‘no stroke’ respectively. The example figures have been taken to illustrate the variations in scale, orientation, tennis courts and position of the players. Under these challenging variations, the skeletons and oriented histograms of the tracked player are classified with good accuracy as in Table 1. For instance, in Figure 6(b), the player is classified correctly even while stretching for the ball, oriented downwards. Also figures 7(a), 7(d) are classified similarly with different orientations. Figures 6(c), 6(d), 7(a), 7(b) show the forehand and backhand classification succeed for cameras placed far off thereby not getting affected to the player scales. In figure 7(e), in doubles match, the tracker helped in
Automated Stroke Classification in Tennis
0.07
0.12
0.09 0.08
0.06
0.08
0.07
0.07
0.06
0.1 0.07
0.06
0.05
1135
0.05
0.08
0.06
0.05 0.04
0.04
0.05 0.04
0.06 0.04
0.03
0.03 0.03 0.04
0.03 0.02
0.02
0.01
0 −50
0.02
0.02
0.02
0
0 5 100 150 200 250 300 350 400
0 −50
0.01
0.01
0.01
0
0 5 100 150 200 250 300 350 400
0 −50
0
(b)
(a)
0 −50
0 5 100 150 200 250 300 350 400
0
0 −50
0 5 100 150 200 250 300 350 400
0
0 5 100 150 200 250 300 350 400
(e)
(d)
(c)
Fig. 6. ‘Forehand’ Strokes (a,b,c,d) and (e) ‘No stroke’ with Zoomed Skeleton images and Histograms
0.18
0.16
0.16
0.14
0.14
0.25
0.12
0.12
0.1
0.2
0.14
0.12
0.1 0.08
0.12 0.1
0.15
0.08
0.1
0.06
0.08 0.08
0.06
0.1
0.06 0.06
0.04 0.04
0.04
0.05
0
05 100 150 200 250 300 350 400
(a)
0 −50
0.02
0.02
0.02
0.02 0 −50
0.04
0
05 100 150 200 250 300 350 400
(b)
0 −50
0
05 100 150 200 250 300 350 400
(c)
0 −50
0
05
00 1
150 200 250 300 350 400
(d)
0 −50
0
05 100 150 200 250 300 350 400
(e)
Fig. 7. ‘Backhand’ Strokes with Zoomed Skeleton images and Histograms
1136
H. Shah et al.
0.05
0.06
0.08
0.045
0.06
0.06
0.05
0.05
0.04
0.04
0.03
0.03
0.02
0.02
0.07 0.05
0.04
0.06 0.035
0.04 0.05
0.03 0.025
0.03
0.04
0.02
0.03 0.02
0.015
0.02 0.01
0.01
0.01
0 −50
0.01
0.01
0.005 0
50
100
150
200
250
300
350
400
(a)
0 −50
0
50
100
150
200
250
(b)
300
350
400
0 −50
0
50
100
150
200
250
(c)
300
350
400
0 −50
0
50
100
150
200
250
300
350
400
0 −50
0
(d)
50
100
150
200
250
300
350
400
(e)
Fig. 8. ‘No Strokes’ with Zoomed Skeleton images and Histograms
getting accurate skeletons of the player and was classified as ‘backhand’. Similarly, it can be observed for ‘No Stroke’ as shown in figure 8. Figure 6(e) is one of the failure cases, we believe that the failure can be attributed to the fact that the player has both the hands stretched, hence classifying it as ‘No Stroke’.
4 Conclusion An automated stroke classification system for real world tennis videos has been proposed. Motion vectors are used to automatically model the player, structural and geometrical information of the player tracked across the video are used as features. The features are invariant to intra and inter class variations of players strokes. A trained SVM classifier is used to recognize the strokes of the player. In future a possible application of the proposed system is to automatically annotate a player with metadata to generate statistics and also aid in video search capabilities. Recognizing other tennis strokes such as slice, volley, serve using more semantic actions from the videos will be focused on.
References 1. Yan, F., Kostin, A., Christmas, W., Kittler, J.: A novel data association algorithm for object tracking in clutter with application to tennis video analysis. In: CVPR ’06: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, pp. 634–641. IEEE Computer Society, Los Alamitos (2006)
Automated Stroke Classification in Tennis
1137
2. Pingali, G.S., Jean, Y., Carlbom, I.: Real-time tracking for enhanced tennis broadcasts. In: CVPR ’98: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, p. 260. IEEE Computer Society, Los Alamitos (1998) 3. Hanjalic, A.: Adaptive extraction of highlights from a sport video based on excitement modeling. IEEE Transactions on Multimedia 7(6), 1114–1122 (2005) 4. Rui, Y., Gupta, A., Acero, A.: Automatically extracting highlights for tv baseball programs. In: MULTIMEDIA ’00:Proceedings of the eighth ACM international conference on Multimedia, New York, NY, USA, pp. 105–115. ACM Press, New York (2000) 5. Duan, L.-Y., Xu, M., Yu, X.-D., Tian, Q.: A unified framework for semantic shot classification in sports videos. In: MULTIMEDIA ’02: Proceedings of the tenth ACM international conference on Multimedia, New York, NY, USA, pp. 419–420. ACM Press, New York (2002) 6. Thurau, C., Hettenhausen, T., Bauckhage, C.: Classification of team behaviors in sports video games. In: ICPR ’06: Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), Washington, DC, USA, pp. 1188–1191. IEEE Computer Society, Los Alamitos (2006) 7. Zhou, W., Vellaikal, A., Jay Kuo, C.C.: Rule-based video classification system for basketball video indexing. In: MULTIMEDIA ’00: Proceedings of the 2000 ACM workshops on Multimedia, New York, NY, USA, pp. 213–216. ACM Press, New York (2000) 8. Ekin, A., Tekalp, A., Mehrotra, R.: Automatic soccer video analysis and summarization. In: Proceedings of IEEE Transactions Image Processing, pp. 796–807 (2003) 9. Iwase, S., Saito, H.: Parallel tracking of all soccer players by integrating detected positions in multiple view images. In: ICPR ’04: Proceedings of the Pattern Recognition, 17th International Conference on (ICPR’04), Washington, DC, USA, vol. 4, pp. 751–754. IEEE Computer Society, Los Alamitos (2004) 10. Bloom, T., Bradley, P.: Player tracking and stroke recognition in tennis video. In: Proceedings of the WDIC, pp. 93–97 (2003) 11. Miyamori, H., ichi Iisaku, S.: Video annotation for content-based retrieval using human behaviour analysis and domain knowledge. In: Proceedings of the 4th IEEE International Conference on Automatic Face and Gesture Recognition, pp. 320–325 (2000) 12. Black, M.J., Anandan, P.: A framework for the robust estimation of optical flow. In: ICCV ’93, pp. 231–236 (1993) 13. Zivkovic, Z., Krose, B.: An em-like algorithm for color-histogram-based object tracking. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition (2004) 14. Saraf, Y., Balasubramanian, R., Swaminathan, K.: A classical approach for thinning of binary images using divergence of the potential field. International Journal of Computer Mathematics 82(6), 673–684 (2005) 15. Vapnik, V.: The nature of statistical learning theory. Springer, New York (1995) 16. Chapelle, O., Vapnik, V.N.: Model selection for support vector machines. In: Advances in Neural Information Processing Systems, vol. 12, pp. 230–236. MIT press, Cambridge (2000)
Color-Based Road Sign Detection and Tracking Luis David Lopez and Olac Fuentes Computer Science Department University of Texas, El Paso 79902, USA {ldlopez, ofuentes}@utep.edu
Abstract. This paper describes a general framework for the detection and tracking of traffic and road signs from image sequences using only color information. The approach consists of two independent parts. In the first we use a set of Gaussian distributions that model each color for detecting road and traffic signs. In the second part we track the targets detected in the first step over time. Our approach is tested using image sequences with high clutter that contain targets with the presence of rotation and partial occlusion. Experimental results show that the proposed system detects on average 97% of the targets in the scene in near real-time with an average of 2 false detections per sequence.
1
Introduction
Automatic detection of road signs has recently received attention from the computer vision research community. The main objective of these algorithms is to detect signs with different maintenance conditions and variable light sources from image sequences acquired from a moving vehicle. Many algorithms have been proposed to solve this problem. These methods can be divided into shapebased [5,8,11,16] and color-based segmentation [3,4,13]. Shape-based methods detect the signs using a set of predefined templates. In general, these methods are sensitive to total or partial occlusion and target rotation. Color-based methods detect signs in a scene using the pixel intensity in RGB [3], CIELab [4] or other color spaces [13]. A few typical problems in the detection of road signs using color information are that vandalism, long exposure to the sun, or camera sensitivity produce a change in the apparent color of the sign. Other approaches combine shape and color segmentation to improve the percentage of correct detection, while minimizing the number of false detections [1,6,7]. But due to time requirements, these methods are not suitable for realtime applications. In this paper we are proposing to detect road signs in CIELab color space, modeling the pixel intensity values using a set of Gaussian distributions. We combine the segmentation and tracking processes in order to obtain accurate detection results, while satisfying real-time constraints. Our computational results show that this method can detect signs in sequences with different outdoor lighting conditions, in the presence of obstacles such as cars and trees that partially occlude the sign, and with motion blur that is originated by the vehicle motion and vibration. M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 1138–1147, 2007. c Springer-Verlag Berlin Heidelberg 2007
Color-Based Road Sign Detection and Tracking
1139
Some of the advantages that motivate the use of video instead of single images in this framework are: – Redundant information: The use of several images from the same scene can increase the robustness of the detection system, especially in sequences with high level of noise. Thus, a failure in the detection in a given image does not necessarily result in the failure of the detection in the whole sequence. – Targets with different sizes: We only process targets whose area is bigger than a predefined threshold and reject the smaller targets. With this method we reduce the false target detection rate that may be generated by the noise in the image. – Several points of view of the same target: with every step the angle of the camera changes; this information can be used to make the sign classification process easier [12]. This paper is organized as follows. Section 2 shows an overview of the proposed framework. Sections 3 and 4 describe the algorithms used to detect and track the signs, respectively. Several results that validate our system are reported in Section 5, and finally Section 6 contains concluding remarks.
2
Framework
The proposed road sign detection and tracking framework is shown in Figure 1. In our experiments, we first convert the rgb-scale image into CIELab color space. The segmentation phase uses a mixture of three Gaussian models [14] to characterize each road sign color (red, green, blue, yellow, and orange). These models are updated after the execution of the segmentation algorithm in order to improve the discrimination of the background in each step. All the candidate signs are tracked over time using the Conditional Density propagation algorithm Condensation [10], which is a probabilistic method that iteratively propagates the targets’ position using Bayes’ rule. After tracking the targets for k frames, the segmentation algorithm is executed again in order to obtain more information about the road sign; this helps to deal with partial occlusion and changes in size. The details of these algorithms are given in the next two sections.
3
Color-Based Sign Detection
In this approach we discriminate the road sign from the background using a set of Gaussian models. We use a perceptually uniform color space called CIELab [15] that measures how similar the reproduction of a color is to the original when viewed by a human observer. In this space, every pixel has a luminance component L and two chrominance components a and b. We create a system that is sensitive to color but invariant to intensity changes using only the luminance components, as also done in [4]. Gaussian models have been successfully applied to automatic detection of mobile targets using color information [14,9]. Traditionally, this method models
1140
L.D. Lopez and O. Fuentes
Fig. 1. Road Sign detection framework
the background color of a given image, so that everything that is similar enough to the model is viewed as a part of the background. The results are improved at every step using a learning function that updates the mean value of the distributions. Unlike traditional methods that model the background, in our framework we use a Mixture of Gaussian distributions to model each road sign color (red, green, blue, yellow, orange and brown), where every Mixture of Gaussian distributions consists of three Gaussian models. Our detection algorithm consists of three basic steps: initialization, model generation and target detection. The following paragraphs explain each of these steps in detail. Gaussian initialization. A Gaussian Model consists of a mean value and a covariance matrix. In this step we compute the initial values of these parameters using a set of image patterns. Every model of the same mixture is trained using images with excellent, regular, and poor illumination conditions, respectively. In the training process the mean and the covariance matrix of every Gaussian model are computed using Eq. 1 and 2, respectively. 1 colori,j n j=1
(1)
1 (colorj − μi )(colorj − μi )T n − 1 j=1
(2)
n
μi = n
Σi =
Model Generation. In a Gaussian model each pixel p is modeled with a single probability distribution η(pt , μt , Σt , ), as shown in Eq. 3, where μt and Σt are the mean value and covariance matrix of the distribution at frame t, respectively. Pixels where observed colors pt are close enough to the background distribution
Color-Based Road Sign Detection and Tracking
1141
are classified as background points, while those too far away are classified as foreground points. η(pt , μ, Σt ) =
1 (2π)n/2 |Σ
e 2 (pt −μ) 1
t|
1 2
t
Σt−1 (pt −μ)
(3)
In order to improve the accuracy of the detector, we use a multi-modal model; in this model the probability of observing the current pixel value is computed using a mixture of K multiple independent distributions (Eq. 4), where μi,t and Σi,t are the ith mean and covariance at time t, respectively, and ωi,t is an estimate of the weight of the ith Gaussian in the mixture at time t. P (pt ) =
K
ωi,t ∗ η(pi , μi,t , Σi,t )
(4)
i=1
Rather than explicitly specifying the value of the signs’ colors, we model the value of a particular color using Eq. 4, with K = 3 and ωi,t = 1. In our model, if the probability P (pt ) is higher than a predefined threshold, we classify the pixel as part of a road sign. Target Detection. In this step, for every pixel in a given image we compute the difference between the modeled and the real pixel value. A small difference means that this pixel is accurately described by the model and this pixel is considered an element of a sign, while a high difference represents a pixel that does not belong to a road sign. In this framework, it is enough if one model of the distribution generates a similar value to consider the pixel as part of the foreground. The pixels that were labeled as foreground are grouped using connected components [2], creating regions of interest that are used for later tracking.
4
Target Tracking
In the previous step we generate a set of targets that are represented by a state vector xi , which contains the position and size for every target. We use a modified version of the condensation algorithm to track this set of targets over time. The proposed tracking algorithm is shown graphically in Figure 4 and expressed as pseudo-code in Table 1. This method has three steps: initialization, observation, and selection. The following paragraphs describe these steps in detail. Initialization. After the target detection step, the state vector Ti stores the position and size of all the targets at time t. In this approach, we randomly generate a set of N coordinates (x, y) that estimate the target position at time t + 1. The estimation of the target position at time t + 1, Ti,n , is defined in Equation (5), where r is a Gausssian random value with mean equal to the current velocity.
1142
L.D. Lopez and O. Fuentes
Fig. 2. Target tracking Table 1. The algorithm to track road signs Tracking(Target[ ] T ) - where T = (x, width, height) and x = (positionX, positionY, velocity) For i = 1 to maxTargets - Select the ith target from the Targets’ list Select Ti from T - Generate N particles for every target For k = 1 to N xi,k = Ti .x + r(Ti .velocity, σ) Xi = (xi,1 , xi,2 , ..., xi,k ) - Compute similarity Pi = p(obs|Xi ) - Update target i at time t+1 prediction = argmax(Pi)
The current targets’ velocity can help improve the prediction, but the detection algorithm can not compute this parameter, thus, in the first iteration we estimate the next position using a velocity equal to zero. After the first tracking step the velocity in pixels of the targets is computed as the difference between the positions at time t and t + 1. Ti,n = Ti .x + r(Ti .velocity, Σ)
(5)
Observation. In the observation process we compute the similarity between every target and its N predicted positions using a simple correlation method that works with gray scale images. We use gray scale images instead of images in CIELab color space because the tracking algorithm has approximately the same results in both cases, but the time required to compute the correlation is significantly lower using a gray scale image. The result of this step is an array that represents the similitude of the predicted position i at time t + 1 with the target at t. Selection. We determine which of the predicted positions contains the target in the image at time t + 1. In this framework we select the highest correlation value, which represents the most similar prediction as the position of the target.
Color-Based Road Sign Detection and Tracking
(a) Highway sequence
(b) Detection result at t=1
(c) Detection result at t=100
(d) Detection result at t=200
1143
Fig. 3. Road sign detection results in Highway sequence
5
Experimental Results
In this section we show the experimental results of our approach. The algorithm was tested with a database of ten image sequences with a frame rate of 20 images per second, captured by a digital camera mounted on a vehicle and different driving situations. Figures 3(a), 4(a), and 5(a) show three sample image sequences: Highway, Road, and Parking lot, respectively. The Highway sequence (Figure 3) contains four green and one yellow signs, some of them with partial occlusion and rotation on a straight highway. The Road sequence (Figure 4) contains two green signs of different sizes in images with a large amount of noise. The Parking lot sequence (Figure 5) contains several objects with colors that can be detected as road signs. The database also contains a large amount of shrubs and trees with color that is similar to the green signs. Table 2 summarizes the principal features of the sample sequences and the computational results. In the table, Time processing shows the time required to process an image of the sequence, Pd is the probability of detecting a target with size larger than 150 pixels, and NFt is the number of false detections per sequence.
1144
L.D. Lopez and O. Fuentes
(a) Road sequence
(b) Detection results at t=1
(c) Detection results at t=80
(d) Detection results at t=160
Fig. 4. Road sign detection results in Road sequence
By applying the algorithms described in sections 3 and 4, the signs are detected and tracked in each frame. Figures 3(b), 3(c), and 3(d) show the road signs detected in the Highway sequence; in these images our method detected all the road signs in the scene with only one false detection. The two signs that are overlapped are detected as one single sign. This problem is generated by the camera perspective, but it is solved when the vehicle is closer and the angle from the camera to the signs places them on different positions in the image plane. In these images the milepost and other partially occluded signs are detected after 180 images when their size is larger than the threshold defined in the detection algorithm. Figures 4(b), 4(c), and 4(d) show the results using the road sequence; in these images we can observe that the proposed method detects all the road signs in the scene, and discriminates objects with similar colors to road signs, obtaining zero false detections. The results for the parking lot sequence are shown in Figures 5(b), 5(c), and 5(d). In this sequence our method detects correctly 95 % of the targets in the scene. The false detections are objects that can not be differentiated from road signs using only color information, but this number is not significant compared to the number of images in the sequence.
Color-Based Road Sign Detection and Tracking
(a) Parking sequence
(b) t = 1
(c) target at t = 50
(d) t = 100
1145
Fig. 5. Road sign detection results in Parking lot sequence Table 2. Computational results Sequence Highway Road Parking lot
Image size 640x480 720x480 2272x1240
Frames 360 240 240
Targets 4 2 2
Time processing 0.25 secs 0.32 secs 1.60 secs
Pd (%) NFt 96 1 98 0 95 5
In these figures the false detections are shown with a white circle, while the detections obtained from the segmentation algorithm are marked with a white square. These results show that the proposed system can detect rotated signs and signs with different sizes and colors in sequences with varying characteristics.
6
Conclusions
This paper describes a general framework for the detection and tracking of traffic and road signs from image sequences using only color information. In the detection step we are proposing the use of Gaussian models to detect road signs. This method combines detection and tracking in order to reduce the
1146
L.D. Lopez and O. Fuentes
computational time required to process the whole image sequence, making this framework suited for real-time applications. The proposed approach was tested using a set of image sequences that contain signs with different sizes and colors in the presence of occlusion. Experimental results show that our method is invariant to in-plane rotation, being able to detect signs with different sizes and colors viewed at any orientation. On average the system detects 97% of the signs in highway, road and street environments. As future work we plan to use a machine learning method to improve the learning function in the detection algorithm. We will also perform experiments with nighttime image sequences and use road shape information in order to reduce the number of false detections.
References 1. Bahlmann, C., Zhu, Y., Ramesh, V., Pellkofer, M., Koehler, T.: A system for traffic sign detection, tracking, and recognition using color, shape, and motion information. In: Proceedings of the IEEE Intelligent Vehicles Symposium (IV 2005), IEEE Computer Society Press, Los Alamitos (2004) 2. Ballard, D., Brown, C.: Computer Vision. Prentice-Hall, Englewood Cliffs, New Jersey (1982) 3. Benallal, M., Meunier, J.: Real-time color segmentation of road signs. In: Proceedings of the Canadian Conference on Electrical and Computer Engineering (IEEE CCECE 2003), pp. 1823–1826 (2003) 4. Chutorian, M., Trivedi, M.M.: N-tree Disjoint-Set Forests for Maximally Stable Extremal Regions. In: Proceedings of the British Machine Vision Conference, Edinburgh, PA (September 2006) 5. de la Escalera, A., Moreno, L.: Road traffic sign detection and classification. IEEE Transactions on Industrial Electronics 44, 848–859 (1997) 6. Fang, C.-Y., Chen, S.-W., Fuh, C.-S.: Road-sign detection and tracking. IEEE Transactions on Vehicular Technology 52(5), 1329–1341 (2003) 7. Fleyeh, H., Dougherty, M.: Road and traffic sign detection and recognition. In: Proceedings of the 10th EWGT Meeting and 16th Mini-EURO Conference (2005) 8. Gavrila, D., Philomin, V.: Real-Time Object Detection for Smart Vehicles. In: Proceedings of International Conference on Computer Vision, pp. 87–93 (1999) 9. Guti´errez, L.D.L., Robles, L.A.: Decision Fusion for Target Detection Using Multispectral Image Sequences from Moving Cameras. In: Proceedings of the Iberian Conference on Pattern Recognition and Image Analysis, Estoril, Portugal, June 2005, pp. 720–727 (2005) 10. Isard, M., Blake, A.: Condensation – conditional density propagation for visual tracking. International Journal of Computer Vision 29(1), 5–28 (1998) 11. Loy, G., Barnes, N.M.: Fast Shape-based Road Sign Detection for a Driver Assistance System. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 70–75 (2004) 12. Mart´ınez, C., Fuentes, O.: Face Recognition using Unlabeled Data. Iberoamerican Journal of Computer Science Research 7, 123–129 (2003) 13. Shadeed, W.G., Abu-Al-Nadi, D.I., Mismar, M.J.: In: Proceedings of the 10th IEEE International Conference on Electronics, Circuits and Systems, pp. 890–893
Color-Based Road Sign Detection and Tracking
1147
14. Stauffer, C., Grimson, W.E.L.: Adaptive Background Mixture Models for Real-time Tracking. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages II, pp. 246–252 (1999) 15. Vezhnevets, V., Sazonov, V., Andreeva, A.: A survey on pixel-based skin color detection techniques. In: Proceedings of Graphicon (2003) 16. Zadeh, M., Kasvand, T., Suen, C.: Localization and recognition of traffic signs for automated vehicle control systems. In: Proceedings of the Conference on Intelligent Transportation Systems, Pittsburgh, PA, October 1997 (1997)
A New Automatic Planning of Inspection of 3D Industrial Parts by Means of Visual System J.M. Sebastián1, D. García1, A. Traslosheros1, F.M. Sánchez2, and S. Domínguez1 1
Departamento de Automática, Ingeniería Electrónica e Informática Industrial, Universidad Politécnica de Madrid, José Gutiérrez Abascal nº2. 28006 Madrid (Spain) 2 Departamento de Arquitectura y Tecnología de Sistemas Informáticos., Universidad Politécnica de Madrid, Campus de Montegancedo, Boadilla del Monte. 28660 Madrid (Spain)
[email protected],
[email protected],
[email protected],
[email protected],
[email protected] Abstract. This paper describes a new planning algorithm to perform automatic dimensional inspection of 3D industrial parts using a machine vision system. Our approach makes use of all the information available in the system: model of the machine part to inspect and characteristics of the inspection system obtained in previous calibration stages. The analysis is based on discretizing both the configuration space of the system as well as the geometry of the part. Our method does not limit the range of application of the system. It neither imposes any restrictions to the viewpoints from where the part is to be inspected nor to the complexity of the part. All the results shown in this study have been tested with a machine vision inspection system for quality control of industrial parts which consists of a stereoscopic pair of cameras and a laser plane. Keywords: 3D inspection, automatic planning, quality control.
1 Introduction The aim of this study concentrates on visual inspection of machine parts with threedimensional characteristics for quality control tasks. Three dimensional inspection is influenced by numerous factors that makes it quite different to other types of inspection. Some aspects such as the presence of occlusions, reflections or shadows introduce many inconveniences that will most likely make the analysis very difficult. Our work focuses on inspecting metal parts in order to assess the accuracy and tolerances. As it is well known, tolerance checking is one of the most demanding task that can be done in an industrial environment in what concerns to precision in measurement [1], [2]. The need for comparing the real measurement with the ideal one makes it necessary to have such information beforehand, usually in the form of a computer aided design (CAD) model. The use of CAD models involves specific working methods and organization of data that differ to other techniques commonly adopted. Also, materials employed to fabricate such parts are usually made of metal, with specular properties that involve special methods of analysis. On the other hand, M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 1148–1159, 2007. © Springer-Verlag Berlin Heidelberg 2007
A New Automatic Planning of Inspection of 3D Industrial Parts
1149
the most ambitious aim in the development of an inspection system is to get the system to be able to find by itself the best strategy to perform the job in terms of some criteria to optimise. This study handles all these problems, focusing on the search for methods that improve precision in measurement. Our aim is not to limit this study to specific configurations of the inspection system. Instead, it aims to get the system to be capable of performing three-dimensional measurement in a similar way to coordinated measurement machines (CMM). In [3], [4], [5], [6], or [7] different approaches to the planning problem are shown, although their solutions depend too much on the architecture of their inspection systems. This work has been developed using an inspection system called INSPECTOR – 3D. In previous works, the characteristics of the system [8] and some early approaches to automatic planning [9] were introduced.
2 INSPECTOR-3D System Description INSPECTOR-3D system consists of two fixed converging cameras, a laser plane for surface scanning, a part positioning device with 3 degrees of freedom (2 rotational and 1 linear) and a workstation that controls the whole inspection process. Figure 1 shows an image of the system. All the degrees of freedom of the image acquisition system have been eliminated in order to simplify the camera calibration process and minimize the uncertainties of the final measurements. It is easy to demonstrate that calibrating the axes of the positioning device is much simpler and precise than the dynamic calibration of the cameras.
Fig. 1. Image of the system Inspector-3D
Fig. 2. Functional architecture of the system
Referring to the functional architecture of the system, the inspection procedure consists of two stages, as shown in Figure 2. The first stage takes place off-line. The user analyses a CAD model of the part and selects the elements to inspect (called “entities”, as described later). This information, together with the calibration models, constitute the input data to the planning algorithm. The output information consists of the set of states that the system needs to follow in order to complete the inspection process.
1150
J.M. Sebastián et al.
In the online stage, small areas of the part are scanned in a sequential way according to the inspection plan. As a result, a cloud of points is obtained. These points are classified and referred to a common reference system. Finally, a comparison of this measurement with the tolerance zones is accomplished. Two important aspects need to be mentioned. In the first place, planning has only been considered as an off-line problem prior to any type of measurement. In the second place, the fact of working with high optical resolution normally implies that only a small part of each feature is visible. Therefore, data acquisition requires successive operations of orienting the part and scanning small areas. Besides allowing to deal with the planning problem, this system has been used as an excellent test bench for studies related to precision in measurement, calibration and evaluation of feature extraction algorithms.
3 Discretization of the Part The available information on the inspection process comes from two different sources: the inspection system itself and the part to be inspected. Regarding the inspection system, by calibrating both cameras, the laser plane and the part positioning device, it is possible to obtain a complete model of the system and use it to calculate the projection of the part on both images during inspection.
Fig. 3. Discretization of a part in triangles
In what concerns to the part, there are different ways for representing the geometric information, such as spatial enumeration (octrees), set-theoretic modelling (computational solid geometry or CSG for short) and boundary representations [2]. Nevertheless, our approach to data representation is based on discretizing the surface of the part in triangles [10], as shown in Figure 3. This technique, although widely used in computer graphic applications, to our knowledge, it has not been used as a base for the analysis of the inspection planning problem. If we know the position of the part and the equation of the laser plane, it is easy to calculate the intersection of the plane with the triangles and to project such intersection on both images. With this approach, several advantages can be obtained. On one hand, we can reduce the analysis to areas around the projections, decreasing the calculation time and avoiding errors. On the other hand, it is easy to associate each digitized point to a triangle of the CAD model, avoiding later processing. Finally, as
A New Automatic Planning of Inspection of 3D Industrial Parts
1151
the calibration models of both cameras and the laser plane are known, two independent and redundant measurements can be calculated in order to detect the presence of outliers. As a result, a better performance of the system is obtained, reducing the presence of digitization errors to a large extent. However, there are still some situations in which important errors in the measurement process can appear. These errors are basically due to: Multiple configurations for inspecting a single feature, presence of internal reflections and direct visualisation of specular reflections. Although some of these effects can be minimized by controlling the dynamic range of digitalisation, the power of the laser unit or the aperture of the camera lens, there are still many unacceptable situations for inspection which makes it necessary to find mechanisms for the automatic selection of the best conditions of inspection, that is, an inspection planning process.
4 Preliminary Aspects of the Planning Problem Our approach can be classified into the well known generation and verification techniques [11]. These techniques analyse every possible configuration of the system in a sequential way, considering only those configurations that allow one to measure the part (applying restrictions of visibility) and selecting among them the most adequate one in terms of a specific metric of quality (named reliability) [5]. In our system, the metric of quality has been set in terms of the behaviour of the feature detection algorithm which analyses the intersection of the laser plane with the part when seen from both cameras. In order to analyse a finite set of possible states of the system, the rank of variation of every degree of freedom has been discretized. Each combination of discrete values of the degrees of freedom of the system will be named configuration of the system. The analysis will be repeated sequentially for every triangle and for every configuration of the system until all features have been analysed. However, we need to consider some related concepts previously. 4.1 Definition of Entities In the first place, it is important to clarify the concept of entity. Since a discrete representation of the part is being used and the aim of the system is to measure specific features, a way for relating such concepts has been established by means of a new concept called entity. An entity is defined as the set of triangles associated to the areas of the part to inspect. Besides the geometrical information of the triangles, an entity usually includes information related to tolerance zones and reference systems (in some cases called simulated reference system) [12]. Therefore, various analysis such as inspection of parallelism between two faces or cylindricity of a surface are now considered as a problem of inspecting entities or triangles. At this point, two aspects need to be clarified. On one hand, although an approximate representation of the surface of the part has been used, the actual comparison has been performed between the measurements and the exact dimensions of the part. Thus, the discrete representation has only been used as an useful approach to the analysis of the problem. On the other hand, the definition of entities and the process of entering information on tolerances have been done manually using a friendly user interface.
1152
J.M. Sebastián et al.
4.2 Configuration Space Another concept to take into account is the way in which the degrees of freedom of the system have been considered. As mentioned before, the only dof of the whole system are those of the part positioning device: two for rotating the part and one for displacing the area to inspect under the laser plane. In the following analysis, a clear difference between these two types of degrees of freedom will be established. In fact, the space of analysis will be reduced to a 2-dimensional space taking into account only the rotational axes of the positioning system and hence considering displacement as a property associated to each configuration. The reason, as it will be explained later, is that the analysis of visibility and reliability depend most fundamentally on the orientation of the part. The result of the analysis will be represented in a diagram of discrete states of the system named configuration diagram. Every opening of the diagram represents a possible configuration of the system. In this diagram, the level of discretization of each degree of freedom depends on the level of detail aimed. Very high levels of discretization imply more accurate solutions although a higher number of states to analyse. As it is clear, the analysis has focused on the degrees of freedom of the system instead of considering other solutions such as the study of all the points of view around the part using a discretized sphere [13]. The reason is that these approaches that analyse large sets of points of view do not have to be physically realizable with the system as opposed to the configuration space approach. 4.3 Visibility The first set of restrictions assures that a specific triangle is visible to the cameras. We use a definition of the concept of visibility that involves both the cameras and the laser plane. A triangle is considered visible when a range of displacement of the part positioning device exists which assures that the intersection of the laser plane with the triangle is visible by both cameras at all times during the complete scanning of this triangle. Therefore, if a triangle is visible under a specific configuration of the system, it means it is possible to record a range of valid displacement of the part for that configuration. In order to optimise the implementation of the definition of visibility, the following restrictions have been consecutively applied in the INSPECTOR 3-D system: • Orientation: The triangle is oriented in such a way that its external face is visible by both cameras. • Field of view: The projection of the intersection lies inside the image. • Occlusions: No other triangle occludes the vision of the one being analysed. The verification of the previous restrictions allows a specific configuration of the system to verify the definition of visibility. The result is a set of valid configurations in which the triangle can be digitalized through laser scanning. 4.4 Reliability Once the condition of visibility is verified, a metric of quality will be associated to every visible configuration of the system. The aim is to be able to select the most adequate configuration for the measurement process between the visible ones. The criterion for the selection will be established in terms of the quality of the image being
A New Automatic Planning of Inspection of 3D Industrial Parts
1153
observed. Such criterion depends on the behaviour of the feature extraction algorithm. In this case, the algorithm extracts the peak position of the laser plane in the image. In order to measure the quality of each configuration, a metallic planar surface was oriented sequentially sweeping the range of variation of the rotation axes of the positioning system. At each configuration the resulting images have been stored and analysed. Based on the type of laser reflection obtained, four different cases have been distinguished, as indicated in Figure 4:
Fig. 4. Different types of intersections (NOT VISIBLE, GAUSSIAN, SATURATION, SPECULAR)
• Not visible intersection. (NOT VISIBLE). The intersection could not be seen under this configuration. • Gaussian intersection (GAUSSIAN). The laser intersection is not saturated. Subpixel techniques may be employed to improve precision. • Saturated Intersection (SATURATION). The laser intersection is saturated. Subpixel algorithms cannot be applied. Instead, the centre of mass of the saturated intersection is calculated. • Specular reflection (SPECULAR). The reflection of the laser plane hits directly on the sensor making it impossible to process the image. The occurrence of each case is strongly related to the relative orientation between the reflected laser plane, the metal part and the camera. Figure 5 shows the different elements involved in the analysis. The triangle is defined by vector n, perpendicular to it, the reflected laser plane by vector r, resulting from the intersection of this plane with the plane perpendicular to the original laser plane and the axes of both cameras by vectors v1 and v2. In this context, the cosine of the angle between vectors r and v* (v1 or v2) constitutes a reliable measure to indicate the type of intersection that can be seen by each camera. We have defined specific thresholds to differentiate each of the four cases and the transitions between them, obtaining seven different states as shown in Table 1. Table 1. Weights associated to every possible type of image
TYPE OF INTERSECTION
cos(r , v )
WEIGHT
Specular Saturation-Specular Saturation Gaussian – Saturation Gaussian Not visible – Gaussian Not visible
1 0.975 0.95 0.9 0.8 0.7 0.5
0 0.25 0.5 0.75 1 0.5 0
1154
J.M. Sebastián et al.
Fig. 5. Laser reflection with respect to the camera position
Moreover, we have associated a weight in the range between 0 and 1 to each of the possible seven states according to how favourable this states are with respect to inspection. Therefore, as the orientation of the part with respect to the cameras and to the laser plane is always known, it is possible to detect the type of intersection being visualized and to get a value of reliability for that configuration. Thus, every valid configuration of the diagram has two associated values: a range of displacement for digitising the triangle (obtained in the visibility analysis) and a reliability measure obtained from the Table 2. Table 2. Weights associated to Cosine camera axis – laser reflection COSINE WEIGHT
0.0 0.0
... 0.0
0.5 0.0
0.6 0.25
0.7 0.5
0.8 1.0
0.9 0.75
1.0 0.0
4.5 Level of Discretization of the Part An important aspect to consider in the following studies is the level of discretization of the triangles. Since there are two visibility restrictions (field of view and occlusions) affected by the size of the triangles, it is important to assure that their size is not too large to invalidate many configurations. However, if the size of the triangles is too small, there is a risk of causing excessive processing. In our approach, the part is initially discretized into triangles using conventional techniques [14]. Next, these triangles are divided recursively, using the middle point of their sides, into another four triangles until the projection of the maximum dimension of all the triangles is smaller than 40% of the dimensions of the image.
5 Planning Algorithm The procedure starts with a first stage in which the possibility of inspecting every single triangle of an entity is analyzed. Once this analysis is done, it is possible to known if a specific entity can be inspected with the system. The only condition that must be verified is that every triangle of the entity has at least one reliable
A New Automatic Planning of Inspection of 3D Industrial Parts
1155
configuration. However, it is very possible that if the analysis ends here and the best configuration for inspecting each triangle is selected, a set of very different configurations would be obtained. This would lead to a large number of successive operations of orienting and scanning of small areas. Instead, an additional fusion of the results of the triangles of one entity will be developed in order to unify their conditions of inspection, in those cases in which common inspection configurations exist. As a result, it will be possible to inspect groups of neighbouring triangles under the same configuration. This set of triangles will be named Group of Inspection. 5.1 Planning on Individual Triangles By applying the previous considerations, a value of reliability will be obtained for every visible configuration of a specific triangle. 5.2 Fusing the Diagrams of the Triangles of the Same Entity In order to fuse the configuration diagrams of different triangles, the following property will be used: neighboring triangles with similar orientation show small variations in their configuration diagrams.
Fig. 6. Configuration diagram associated to one link
At this point, it is important to clarify the use of the concept of similarity. It is considered that two triangles are similar when the intersection between their configuration diagrams is not empty. This definition is logical since the intersection of both diagrams implies that common inspection conditions exist for both triangles. A very useful approach is based on representing the information of the part by means of some kind of structure that reflects aspects of proximity or closeness between two different triangles. According to this, a new graph based representation, called
1156
J.M. Sebastián et al.
proximity graph, has been defined. In general terms, this graph consists of a set of nodes that represent each of the triangles of the part. A node is related to another by means of a link if both triangles share any side. Additionally, a weight has been associated to every link in the graph. But instead of using a single value, as it is usually done, a complete configuration diagram has been used. This configuration diagram is obtained from the intersection between the two diagrams of the triangles of the nodes (Figure 6). Therefore, the proximity graph is a new representation that combines information of similarity and proximity. The study of fusing different triangles to create groups of inspection is based on the analysis of the proximity graph. 5.3 Initial Definitions During the analysis of the proximity graph, a triangle (or node) can be in two possible states: classified, when assigned to a group of inspection, and unclassified, when not assigned to any group of inspection. If a triangle is not classified, it is being considered as a possible new member of a group and it is labelled as candidate. As a result, in order to study the triangles, three different lists will be maintained: list of groups, list of candidate triangles and list of unclassified triangles. Initially, the lists of unclassified triangles are filled with all the triangles. The search process ends up when the list of unclassified triangles is empty. The results are stored in the list of groups. Creation of new groups: When a new group is created, the list of candidates is emptied. Every group has a single configuration diagram associated to it. Initially, the diagram of the group will be equaled to the diagram of the triangle considered as seed in the group and it will be modified according to the diagrams of new triangles added to the group. The criterion to modify the configuration diagram of the group consists of calculating the minimum between the diagram of the group and the diagram of the triangle to be added. Selection of the seed triangle: The selection of a node as a seed is based on choosing the one with the largest number of neighbours in the proximity graph and the largest number of valid configurations in the configuration diagram of the node. The aim is not only to start from a configuration diagram with many valid configuration but also to avoid nodes which correspond to triangles on the borders of an entity. Such nodes usually have few neighbours and are located in areas where occlusions occur. Analysis of neighbours: Once a node is selected as a seed, it is classified as a member of an inspection group and it is removed from the list of unclassified triangles. Besides, all unclassified neighbours of the seed node are included in the candidate list. Next, all candidates are analysed in search for the triangle whose reliability diagram has the largest intersection with the reliability diagram of the group. Therefore, the impact of adding a triangle to a group is minimal, due to the small reduction of reliable configurations that the reliability diagram of the group suffers. Once a candidate is added to a group, it becomes the new seed for the analysis, and therefore all its neighbours are added to the list of candidates.
A New Automatic Planning of Inspection of 3D Industrial Parts
1157
Group Completion: A group is considered completed when the list of candidates is emptied or when none of the configuration diagrams of the candidates intersect with the configuration diagram of the group (Figure 7). The configuration used for inspection is obtained selecting the configuration with the highest value in the configuration diagram. The range of displacement of the part is obtained from the union of the ranges of displacement of all the triangles of the group.
Fig. 7. Entities in the proximity graph
6 Example Figure 8 shows the result of applying the planning algorithm to an angular entity.
Fig. 8. Three inspection groups obtained from an angular entity
As observed, three inspection groups are obtained, each one associated to a different angular configuration. Therefore, the inspection will consist in three different stages. In each stage, the part will be oriented and displaced according to the configuration diagrams of the groups (Figure 9).
1158
J.M. Sebastián et al.
Fig. 9. Digitalisation process based on three inspection groups
It is important to point out the fact that when the planning algorithm does not find a complete solution for the whole entity, it provides a partial solution for the triangles that have valid configurations. This allows one to understand the reasons why a complete inspection of the part cannot be performed.
7 Conclusions This study describes a new planning algorithm to perform dimensional inspection of a metal part with three dimensional characteristics in an automatic way. The algorithm follows a generation and verification strategy and it works directly in the configuration space of the inspection system. It uses a discrete representation of both the degrees of freedom of the system and the information of the part, represented as a set of triangles. The geometrical information of the features to inspect has been grouped into entities (sets of triangles). Each entity has been represented using a graph-based diagram called proximity graph, very appropriate for this discrete analysis. Our approach has several advantages; we have not imposed any restrictions to the complexity of the features to inspect or to the types of measurements to perform. Limitations are only those associated to the use of a visual measurement system (lack of visibility) and to the limited degrees of freedom of the system which only allow one to orient the part under specific configurations. Moreover, the planning algorithm provides partial solutions to the problem being solved. In case the entities being analysed cannot be completely inspected according to specifications, it is still possible to obtain solutions for the triangles that can actually be inspected which constitutes very interesting information to help reorient the inspection process. The performance of this planning algorithm has been largely tested using a set of more than twenty complex mechanical parts from the automobile industry and the results have been quite satisfactory; on the other hand, when the piece is inside inspected, the use of a laser introduces some problems as internal reflections which
A New Automatic Planning of Inspection of 3D Industrial Parts
1159
causes two different and negative effects: multiple candidate points appear and the modification of the global piece illumination, future woks will take in account this problems.
References 1. Rivera-Rios, A.H., Shih, F-L., Marefat, M.: Stereo Camera Pose Determination with Error Reduction and Tolerance Satisfaction for Dimensional Measurements. In: Proceedings of the International Conference on Robotics and Automation, Barcelona, Spain, pp. 423–428 (April 2005) 2. Malamasa, E.N., Petrakisa, E.G.M., Zervakisa, M., Petito, L., Legatb, J-D.: A survey on industrial vision systems, applications and tools. Image and Vision Computing 21, 171– 188 (2003) 3. Chen, S.Y., Li, Y.F.: Vision Sensor Planning for 3-D Model Acquisition. IEEE Trans. On Systems, Man. And Cybernetics 35, 894–904 (2005) 4. Kosmopoulos, D., Varvarigou, T.: Automated inspection of gaps on the automobile production line through stereo vision and specular reflection. Computers in Industry 46, 49–63 (2001) 5. Trucco, E., Umasuthan, M., Guallance, A., Roberto, V.: Model based planning of optimal sensor placement for inspection. IEEE Trans. on Robotics and Automation 13(2), 182–194 (1997) 6. Chen, S.Y., Li, Y.F.: Automatic Sensor Placement for Model-Based Robot Vision. IEEE Trans. On Systems, Man. And Cybernetics 34(1), 393–408 (2004) 7. Reed, M.K., Allen, P.K., Stamos, I.: Automated model acquisition from range images with view planning. In: Computer Vision and Pattern Recognition. IEEE Computer Society Conference, pp. 72–77. IEEE Computer Society Press, Los Alamitos (1997) 8. Garcia, D., Sebastian, J.M., Sanchez, F.M., Jiménez, L.M., González, J.M: 3D inspection system for manufactured machine parts. In: Proceedings of SPIE. Machine Vision Applications in Industrial Inspection VII, vol. 3652, pp. 25–29 (1999) 9. Sebastian, J.M., Garcia, D., Sanchez, J.M., Gonzalez, J.M.: Inspection system for machine parts with three-dimensional characteristics. In: Proceedings of SPIE. Machine Vision Systems for Inspection and Metrology VIII, vol. 3836 (1999) 10. Farin, G.: Curves and surfaces for computer aided geometric design. A practical guide. Academic Press, London (1993) 11. Maver, J., Bajcsy, R.: Occlusions as a guide for planning the next view. IEEE Transactions on Pattern Analysis ans Machine Intelligence 15(5), 417–433 (1993) 12. Modayur, B.R., Shapiro, L.G., Haralick, R.M.: Visual inspection of machine parts. In: Sanz (ed.) Image Technology, pp. 385–404. Springer-Verlag, Heidelberg (1996) 13. Yi, S., Haralick, R.M., Shapiro, L.G.: Optimal sensor and light source positioning for machine vision. Computer Vision and Image Understanding 61(1), 122–137 (1995) 14. Velho, L., Figueiredo, L.H.D., Gomes, J.: A unified approach for hierarchical adaptative tessellation of surfaces. ACM Transactions on Graphics 18(4), 329–360 (1999)
Model-Guided Luminance Range Enhancement in Mixed Reality Yunjun Zhang and Charles E. Hughes Computer Science Department University of Central Florida
Abstract. Mixed Reality (MR) applications tend to focus on the accuracy of registration between the virtual and real objects of a scene, while paying relatively little attention to the representation of the luminance range in the merged video output. In this paper, we propose a means to partially address this deficiency by introducing Enhanced Dynamic Range Video, a technique based on differing brightness settings for each eye of a video see-through head mounted display (HMD). First, we construct a Video-Driven Time-Stamped Ball Cloud (VDTSBC), which serves as a guideline and a means to store temporal color information for stereo image registration. Second, with the assistance of the VDTSBC, we register each pair of stereo images, taking into account confounding issues of occlusion occurring within one eye but not the other. Finally, we apply luminance enhancement on the registered image pairs to generate an Enhanced Dynamic Range Video.
1
Introduction
Current computer graphics rendering techniques are able to produce images that are close to photorealistic. When these approaches are applied in an MR environment, it should be possible to make the virtual indistinguishable from the real. However, limited by the contrast ratio (around two orders of magnitude) of conventional output devices, the actual images displayed cannot match the range that occurs in most natural settings. For instance, an indoor scene that includes visibility to outdoor sunlight provides a contrast ratio of five orders of magnitude, a range that is within the capabilities of the human eye [8] but beyond those of of most conventional displays. In this paper, we address this luminance range limitation by introducing Enhanced Dynamic Range Video into the MR domain. We construct a Video-Driven Time-Stamped Ball Cloud (VDTSBC), which serves as a guideline and a means of storing temporal color information for stereo image registration. The positional data for each Ball in the VDTSBC is acquired by precise measurement, usually via a 3D terrestrial laser scanner, while the time-stamped dual-channel color parameters are projected from two corresponding images captured from the stereo rig. With the assistance of VDTSBC, the pair of stereo images, each based on a different brightness setting, can be registered, even in the presence of areas occluded in one but not the other eye. M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 1160–1171, 2007. c Springer-Verlag Berlin Heidelberg 2007
Off-line scan
Model-Guided Luminance Range Enhancement in Mixed Reality
3D Model with singular region labeled
registration by tracking
VDTSBC T parameters
Virtual camera projection
1161
Virtual Depth Map
Virtual Disparity Map
Label checking
Modeled Image Region
Camera Calibration
Camera Input
Real Scene
Video SeeThrough Stereo Rig
Luminance Enhancement
VDTSBC C,t,r C,t, tr
Indoor Image Region
Outdoor Image Region
Luminance Adjustment
Off-line operations
projection j tii
On-line operations
Fig. 1. Flow chart of our luminance enhancement system
In order to generate a full radiance map that can cover the human eye contrast ratio in an indoor-outdoor combined MR environment, more than two frames taken under distinct exposures are required [1]. This is an impractical requirement for an online MR system using the “eyes” of the HMD. To achieve capabilities similar to the human visual system’s ability to view bright sunlight and dark indoor details at the same time, we abstract the real world radiance with a low and a high partial radiance map. After registering and combining the two views, the over-saturated and under-illuminated regions can be represented in the same frame. To validate our luminance enhancement approach, we make the assumption that the two cameras in the stereo rig, when set at the same exposure level, present the same color for most scene objects. To see that this assumption is reasonable, we note that, when the two sets of hardware for displaying the views of the left and right cameras are identical, the actual colors displayed on each camera are decided by illumination geometry and viewing geometry. Since the cameras in the stereo rig takes images at the same moment, the illumination geometry for the scene is constant at that time. The distance between the two camera centers in our stereo rig is less than 70mm. For any scene position more than 1.5m from the stereo rig, the view angle difference α is less than 3 degree. In the well-known Phong lighting model, the specular refection intensity is proportional to the ns power of a cosine function of α, which results in around 10 percent of intensity differences for 3 degrees even if ns is as high as 100. Thus, in our context of MR environments, this simplifying assumption may be made with little loss of generality.
1162
Y. Zhang and C.E. Hughes
This paper is organized as follows: Section 2 gives a quick review of related research. Section 3 presents an overview of our framework, including the notations and the components of the system. Section 4 explains the construction and operations based on VDTSBC for registering stereo images. Section 5 shows how to generate an Enhanced Dynamic Range Video. Section 6 demonstrates three experimental video results. The remaining sections, 7 and 7, present conclusions and acknowledgements.
2
Related Work
Research in High Dynamic Range Imaging remains active in the computer graphics and computer vision communities. Most previous research has focused on the generation and representation of HDR images given multiple snapshots taken under identical scene-camera geometric relations. In order to generate an HDR image, the camera response function, which defines the mapping from the irradiance to the image brightness, needs to be recovered. Debevec and Malik [1] construct the camera response function while the reciprocity holds (as long as the production remains the same, halving the irradiance and doubling the exposure time will not change the optical density). Their approach requires a minimum of two images to recover the camera response function, though more images are needed to generate a whole range radiance map. Shafique and Shah [9] model the camera response function for each color channel as a gamma curve. This approach requires a set of registered images under different unknown illumination conditions. A rough estimation of the exposure setting for a set of registered images is also enough to extract the camera response function, as is proposed in Mitsunaga and Nayar’s work [4]. The color variance between differently exposed frames needs to be adjusted before applying most vision-based correspondence searching algorithms. The color transfer approach [6] is popular. Porikli [5] calibrates the inter-camera colors based on a color correlation map. Though promising, this approach helps very little in over-saturated and under-illuminated regions. Hongcheng Wang et al. [10] diverge from image registration by customizing a camera rig with three CCD sensors aligned on the same principal axis. Their results achieve good spatial and temporal consistency. Unfortunately, we cannot apply this technique given the constraints of our application to function within the domain of existing video see-through HMDs.
3
Framework Overview
Figure 1 summarizes the pipeline of our video enhancement approach. The stereo camera rig calibration and the background point cloud scanning are applied offline. The outdoor region in the cloud is labeled based on the window areas. During an interactive experiment, the relative poses of the cameras to the background are updated by a permanently attached tracker. Upon registration, the VDSTBC provides geometrical information to generate a virtual disparity map.
Model-Guided Luminance Range Enhancement in Mixed Reality
1163
Y X
ximg
yimg
principal point
C p
principal axis
Z
ag e
y
x
O
im
pr
in
cip
al
pl
pl an e
an
e
camera center
image origin
Fig. 2. A overview of the camera coordinate frame – For a general projective camera model, the projection center is called the camera center C. The plane passing C and parallel to the image plane is the principal plane. The Z axis for the camera coordinate system is defined as the principal axis. The point where the principal axis meets the image plane is called the principal point p. The actual image origin is defined as O.
Differing from a real disparity map, the virtual one records not only the disparity values, but the corresponding ball labels, by which a depth ordering is established to segment the occlusion region. Finally all the regions are submitted to the luminance enhancement adjustment module to deliver an improved scene rendering. The camera model we selected in our system is a finite projective camera with radial distortion. The basic definitions are listed in Figure 2. In our framework, a capitalized boldface letter X denotes a non-homogenous 3-vector in Euclidian → − 3D space. X represents the corresponding homogenous coordinates. Similarly, → lowercase x and − x denote the same concepts in Euclidian 2D space. A tilde letter pair (˜ x, y˜) denotes the normalized coordinates, and a letter with a subscript d represents the distorted coordinates. The coordinates on 2D camera plane are denoted as (xp , yp ) , and the final pixel coordinates are denoted as (xi , yi ) . For stereo matching and luminance enhancement, I denotes an image and I(x) is the color value at x. The corresponding position of x in the second camera is ˙ ˆ . The radiance value of I(x) is represented by I(x). denoted as x
4
Left-Right Input Registration by VDTSBC Model
The main contribution of this paper is the introduction of the VDTSBC to assist the construction of a high quality stereo matching, which is the central issue for video luminance enhancement.
1164
Y. Zhang and C.E. Hughes
We define a ball B as a 6-tuple: B =< T, R, Cl , Ch , t, r > where T denotes the three-dimensional coordinates for the center of the ball. Cl and Ch specify the color information for the ball, back-projected from two differently exposed stereo cameras. t records the duration from the time Cl and Ch are projected to the present. r is the radius of B, which is defined by a function radius(t). R records the camera’s direction at time t. A Video-Driven Time-Stamped Ball Cloud G is defined as a set of balls: G = {Bi =< Ti , Ri , Cli , Chi , ti , ri > |i = 1..n} For each ball Bi , the positional parameter Ti is acquired by the off-line geometry scan using a 3D laser scanner. Our approach to acquire Cli and Chi is explained as follows. To simplify the notation, we ignore the index i. For a ball B, the homoge→ − nous coordinates X in the world frame are given by (T, 1) . The homogenous −→ coordinates Xc for B in the camera frame are presented by −→ → R −RC˜ − Xc = X 0 1 where C˜ represents the coordinates of the camera center in the world coordinate frame. The ideal point corresponding to B in one camera view is presented by −−−→ −−−→ −−−→ −−−→ (˜ x, y˜) = (Xc (1)/Xc (3), Xc (2)/Xc (3)) . In order to register the virtual scene correctly onto the camera view, a camera distortion model needs to be taken into consideration. For our non-wide angle stereo camera, it is not necessary to push the radial component of the distortion model beyond the 4th order. [11] also suggests that the tangential distortion is negligible compared with the radial distortion; therefore the actual projected point of Ball B on the camera frame is calculated by (xd , yd ) = L(˜ x2 + y˜2 )(˜ x, y˜) , where L(r2 ) is a polynomial function L(r2 ) = 1+a1 r2 +a2 r4 . The coefficients a1 and a2 are the camera radial distortion parameters. The actual pixel coordinates of the Ball B are (xp , yp , 1) = K(xd , yd , 1) where K is an upper triangular 3 × 3 camera calibration matrix with the skew parameter preset to zero. The color information C is then acquired by a convolution around xi = (xpi , ypi ) Ck = G(ξ)I(xi + ξ) k = l, h where G(ξ) is a spatial smoother, in our case a Gaussian convolution kernel with a small displacement vector. ξ and I(xi + ξ) denote the image color information at xi + ξ. Under occlusion, a ball can be visible in one view but not in the other view. The Cl and Ch calculated in this case are not consistent using this
Model-Guided Luminance Range Enhancement in Mixed Reality
1165
approach. The reason is that, under occlusion, the projected color information to the ball B belongs to some other balls that overlap B from the camera view direction. In order to handle occlusion, we extend the disparity map concept in the virtual scene to include one more parameter, the ball label p, which indexes the corresponding ball B to generate the disparity. This extension builds a direct connection between a geometrical position and its image projections. On the other hand, the real disparity map preserves depth cues related to the cameras. Keeping these ideas in mind for one camera, we project all balls in the VDTSBC onto two cameras to generate disparity values. Without loss of generality, we describe only the virtual disparity map construction in the right V camera. Assume ball B has a projection x in the right V camera and a proˆ in the left V camera. The disparity is calculated by jection x ˆ ∂x = x − x If several balls have the same image projection x, the ball that has the smallest −−−→ depth Xc (3) has its index attached with x, and its disparity value ∂x is updated in the final virtual disparity map for the right V camera. If the index for the ˆ in the right V camera is not the same as for x, the corresponding position x index p for x is modified to a unique value to identify the occlusion.
5
Enhanced Dynamic Range Video Generation
Theoretically, after registering most of the regions in the stereo images, a full radiance map that covers the whole image region is needed in order to enhance the image dynamic range. However, the possibility of missing registration cannot be ignored [3]. In their case, Kang et al. relax the registration requirement by excluding over-saturated or under-illuminated pixels from the weighted radiance map generation function. Our system distinguishes the indoor and outdoor environments by labeling the VDTSBC. Based on this labeling, we either enhance or adjust the luminance dynamic range to achieve the desired perceptual improvement. For generating a correct tone map, existing standard methods like [7] can be applied, given camera color calibration in advance. For the purpose of enhancing luminance range, our simple linear method works reasonably well and it can be easily implemented in our online framework. 5.1
Luminance Enhancement
The techniques to enhance luminance dynamic range in the left and right cameras are symmetric. Without loss of generality, we explain only the right image luminance enhancement in this subsection. With the assistance of the VDTSBC, most of the modeled regions in IR have three available colors from which to choose. For a position x in such region in IR , I˙R (x) is from the right image;
1166
Y. Zhang and C.E. Hughes
x) is from the corresponding location in the left image; and Chi is from Bi I˙L (ˆ in the VDTSBC . These three have weights of wR , wL , and wB , respectively. ˆ of x is occluded in the left camera, I˙L (ˆ If the corresponding position x x) is meaningless. The radiance map can be calculated as: R=
n k=1
w¯k I˙k
(1)
where w¯k is the normalized weight and we abuse the notation slightly to denote I˙k as one of the available radiance values. n is 3 when three colors are available; otherwise n = 2. Grid 3D Model 6x8 1
mm
0.5 0 −0.5 −1 150 100
100 50
50
mm
/HIW9LHZ
0
mm
5LJKW9LHZ
Fig. 3. The sparse 3D point cloud is projected onto the left and right images. Each point in the cloud represents a cross on the checker board. A radial distortion model is applied to achieve accurate registration.
5.2
Luminance Adjustment
Under high brightness settings in our system, the outdoor environments are too bright to be registered based on our stereo method. At the same time, the VDTSBC gives no geometrical information for outdoor environments. These facts direct us to a practical solution. For the left image IL , which has a lower brightness setting, the color from I˙L in the outdoor region is used to fill the radiance map. For the right image, things are more complicated since it is necessary to get detailed information from the right image for an outdoor region.
Model-Guided Luminance Range Enhancement in Mixed Reality
1167
Fig. 4. Left image is actual input from the left camera with a low brightness setting. Right image is the input from the right camera with a high brightness setting.
Though it is hard to achieve a precise registration in an outdoor region, an approximation is acceptable for most situations. This is supported by the fact that a large portion of the outdoor scene is far enough away to be “indistinguishable” through HMD video cameras. In fact, with a image resolution of 320× 240, any geometrical location 15 meters or more away from the stereo camera rig can result in merely sub-pixel displacement. In our current implementation, a direct copy from the right image to the left image in the outdoor region is applied, though a small displacement along the Epipolar Line for each pixel may be more precise. The actual tradeoff for the warping function compared with direct copying needs further evaluation.
6
Implementation and Results
We build our framework within the Mixed Reality System developed at the Media Convergence Laboratory (Hughes et al., [2]). The positional information for the VDTSBC that we apply in our MR system is acquired by a Riegl LMSZ420i 3D terrestrial laser scanner. The head mounted display we select is the Canon VH-2002 see-through model, on which an Intersense IS-900 ultra-sonic tracking sensor is permanently attached. The system runs on two Pentium 4 3GHz machines, one for capturing and tracking, the other for rendering. The video card we use is an nVidia 6800 with 256MB memory. For all of the experimental sequences, the stereo rig was calibrated off-line by a stereo calibration tool box available at http://www.vision.caltech.edu/ bouguetj/calib doc/ Only the 2nd and 4th order radial distortion parameters are considered, and the camera skew is set to zero. The displacement between the tracker center and the stereo rig is also compensated before registering the virtual and real cameras. Since our method does not rely on high accuracy of image feature correspondence, the difference between the two brightness levels within the stereo cameras can be adjusted freely depending on the radiance range the user wants to cover. Figure 4 shows a typical brightness setting for our framework.
1168
Y. Zhang and C.E. Hughes
(g)
(a)
(b)
(c)
(d)
(e)
(f)
(h)
(i)
Fig. 5. In this sequence, a paper box is attached on the wall to demonstrate the advantage of our method. (a) Left input with the projected point cloud on it. (b) One frame of the enhanced result of our method. (c)-(f) Left: two frames of the our enhanced results. Right: the corresponding direct merged results. (g) A frame of the enhanced result with indoor and outdoor mixed view. (h) Left input. (i) Right input.
Model-Guided Luminance Range Enhancement in Mixed Reality
1169
Calibration board test. In this sequence, a flat shaped ball cloud is applied. Figure 3 shows a sparse cloud model to demonstrate the accuracy of the cloudimage, image-image registration. Figure 6 is a representative frame from the test. Since our method takes radial distortion into consideration, the slightly distorted image pair is correctly registered and the enhanced left view (f) has a much better brightness level than the left input does. The resolution of the input in this test is 512x480.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 6. (a) Left input. (b) Right input. (c) Disparity map generated by VDTSBC. Note that only the checker board area is modeled. (d) Directly merged result for the left display. The displacement is large due to the close camera position. (e) Merged left result of our method. The brighter checker board is achieved by combining the checker board from right image to the entire left image. (f) Result of the checker board only.
Office Room MR environment sequence. The office room has an unobstructed window facing the sun during the day time. A single brightness setting is clearly not enough to cover most of the details inside and outside the room, as Figure 5 (h) and (i) demonstrate. The VDTSBC here serves as the intermediate 3D marker for stereo image registration, and provides proper color for the occlusion region. Since we assume that the lighting condition cannot change suddenly, the output enhanced image results are perceptually acceptable, including those in the occluding region. Figure 5 shows a few key frames of two office sequences taken at two different times of day. In this set of sequences, a non-flat
1170
Y. Zhang and C.E. Hughes
Fig. 7. From left to right: there typical enhanced frames for the modeled game table on the top and the corresponding direct merged results in the bottom
model is used for the VDTSBC. The direct merged result in Figure 5 (f) clearly shows that the disparity on the front side of the box is noticeably larger than the disparity on the wall. For an unconstrained camera with a moving camera center, it is impossible to compensate the disparity with a single global motion. Here the VDTSBC serves as an effective guide to generate a correct disparity map. For the indoor-outdoor mixed area in Figure 5 (h)(i), our result preserves the dark indoor details like the Chinese characters on the wall with the visible outdoor scene (a portion of a tree) in the same frame Figure 5 (g). Pinball game machine sequence. In this sequence, instead of defining a global coordinate frame for the scene, we define the top left corner of the game table as the local origin; the right-hand coordinate frame has x- and y- directions parallel to the sides of the table. The transformation between the table coordinates and the tracking system coordinates is adjusted interactively. The table has a very complicated shaped surface. We scan the 3D shape of the pinball table with a Riegl laser scanner. To demonstrate the efficacy of our method, we apply a relatively sparse model of the machine. The small black holes shown in the results are the consequence of low point density. Other than this, the enhanced frames show very good registration results (Figure 7).
7
Conclusions
In this paper, we present a framework to enhance the visible luminance range in a mixed reality environment. We introduce the Video-Driven Time-Stamped Ball Cloud (VDTSBC) into the camera registration pipeline to assist the matching between stereo images and to provide a rational luminance enhancement in occlusion areas, which is hard to recover using a pure stereo-based algorithm.
Model-Guided Luminance Range Enhancement in Mixed Reality
1171
Acknowledgements This work is partially supported by the VIRTE (Virtual Technology and Environment) project funded through the Office of Naval Research. We would like to gratefully acknowledge Nick Beato and Matt O’Connor for contributions they made to the mixed reality rendering software. We thank the UCF Media Convergence Laboratory at the Institute for Simulation and Training for the use of its facilities. Moreover, we thank Mark Colbert for sharing insightful comments with us and we recognize the earlier work of James Burnett whose application of localized color transfer on the stereo input inspired this work.
References 1. Debevec, P.E., Malik, J.: Recovering high dynamic range radiance maps from photographs. In: SIGGRAPH ’97. Proceedings of the 24th annual conference on Computer graphics and interactive techniques, pp. 369–378. ACM Press/AddisonWesley Publishing Co, New York (1997) 2. Hughes, C.E., Stapleton, C.B., Hughes, D.E., Smith, E.M.: Mixed reality in education, entertainment, and training. IEEE Computer Graphics and Applications 25, 24–30 (2005) 3. Kang, S.B., Uyttendaele, M., Winder, S., Szeliski, R.: High dynamic range video. ACM Trans. Graph. 22(3), 319–325 (2003) 4. Mitsunaga, T., Nayar, S.K.: Radiometric Self Calibration. In: CVPR’99. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, p. 380. IEEE Computer Society Press, Los Alamitos (1999) 5. Porikli, F.: Inter-camera color calibration by correlation model function. In: ICIP. IEEE International Conference on Image Processing, vol. 2, pp. 133–136 (2003) 6. Reinhard, E., Ashikhmin, M., Gooch, B., Shirley, P.: Color Transfer between Images. IEEE Computer Graphics and Applications 21, 34–41 (2001) 7. Reinhard, E., Stark, M., Shirley, P., Ferwerda, J.: Photographic tone reproduction for digital images. In: SIGGRAPH ’02. Proceedings of the 29th annual conference on Computer graphics and interactive techniques, pp. 267–276. ACM Press, New York, NY, USA (2002) 8. Reinhard, E., Ward, G., Pattanaik, S., Debevec, P.: High Dynamic Range Imaging, First Edition: Acquisition, Display, and Image-Based Lighting. Morgan Kaufmann, San Francisco (2005) 9. Shafique, K., Shah, M.: Estimation of the radiometric response functions of a color camera from differently illuminated images, pp. IV: 2339–2342 (2004) 10. Wang, H., Raskar, R., Ahuja, M.: High Dynamic Range Video Using Split Aperture Camera. In: OMNIVIS (2005) 11. Zhang, Z.: A Flexible New Technique for Camera Calibration. IEEE Trans. Pattern Anal. Mach. Intell. 22(11), 1330–1334 (2000)
Improving Reliability of Oil Spill Detection Systems Using Boosting for High-Level Feature Selection Geraldo L.B. Ramalho and F´ atima N.S. de Medeiros Image Processing Research Group, Federal University of Cear´ a 60455-760 - Fortaleza, CE, Brazil
[email protected],
[email protected] http://www.gpi.deti.ufc.br
Abstract. A major problem in surveillance systems is the occurrence of false alarms which lead people to take wrong actions. Thus, if the false alarm is frequent and occurs mainly due to system misclassification, this system will turn into an unreliable one and briefly out of use. This paper proposes a classification method to oil spill detection using SAR images. The proposed methodology uses boosting method to minimize misclassification and also reach better generalization in order to reduce false alarms. Different feature sets were applied to single neural network classifiers and its performance were compared to a modified boosting method which provides a high-level feature selection. The experiments show substantial improvement in discriminating SAR images containing oil spots from the look-alike ones. Keywords: oil slick, SAR image, neural network, feature selection, boosting.
1
Introduction
Concerning ocean oil spill surveillance, one can make use of SAR (Synthetic Aperture Radar) images to extract feature sets by using different methods to predict if a specific region contains an oil spill or not [1] [2]. As radar backscatter values for oil slicks are very similar to backscatter values for very calm sea areas and other ocean phenomena, dark areas in SAR imagery tend to be misinterpreted [3]. In oil spill detection systems, it is important to reduce the number of false alarms in order to take the correct action in time to minimize the effects of oil spill disasters [2]. Lower false alarm levels represent less cost and a more effective use of the surveillance resources. Feature extraction methods based on oil spill shape is penalized due to the presence of speckle noise in SAR images and the different number of looks and resolution of the images. One can improve the classification performance by combining this features coded by different systems of feature extraction, as the spatial texture [4] [5] and spectral texture, although this often increases the dimensionality of the problem which results in a more complex classifier design. M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 1172–1181, 2007. c Springer-Verlag Berlin Heidelberg 2007
Improving Reliability of Oil Spill Detection Systems
1173
Classical heuristics methods for feature combination and dimensionality reduction make use of linear transformation like PCA (Principal Component Analysis), CVA (Canonical Variate Analysis) and MLD (Multiple Discriminant Analysis) [6]. However, these methods present some limitations which depends on the application dataset statistics. As this paper deals with a small set of sample images, there are some numerical problems for good statistical parameters estimation. The use of ANNs (Artificial Neural Network) classifiers based on MLP (Multilayer Perceptron) and RBF (Radial Basis Function) produce good solutions [7] for this approach. In case of using single neural networks like MLP, RBF and SVM (Support Vector Machine) classifiers, no previous knowledge about the density function of the data is necessary and, thus, there are no need to statistical parameter estimation. Both MLP and SVM can be tuned to provide good generalization [8]. Although, when handling noisy data from a small sample size, it can easily drop into an overfitting problem. Also, their parameters estimates are not always guarantee to be a global-minimum [9]. Feature sets coded by different systems were joint in order to provide more details for class discrimination. Nevertheless, the classifier design becomes a hard task due to the higher data dimensionality. The improvement on the detection performance achieved by single neural network classifiers is limited by the overfitting. Thus, using a single classifier with the lowest training MSE (mean square error) may not achieve the best solution to design an automatic system for reliable oil spill detection [10]. In recent works, researchers in computational learning theory have started to consider algorithms that search for a good classification rule by optimizing quantities other than the training error (as the least mean square error). Algorithms of this type includes Boosting [11] which has been designed to maximize the margin of a linear classifier. Boosting is one of the most common ensemble methods found in recent research used to solve high dimensionality classification problems. An ensemble is a set of classifiers whose individual predictions are combined to classify new examples [10] with better performance than using a single classifier. The use of boosting method is growing because it is a very effective computational procedure to improve the performance of a classifier. Moreover, it is resistant to overfitting [12] which is a good advantage over other methods. In this paper we present a methodology based on boosting that combines classifiers and features adapting the algorithm described in [13] to our dataset and base classifiers. This ensemble is applied to improve oil spill identification using SAR images. Tests were made and the results were compared to other single ANN classifiers. In the next section we briefly discuss the boosting methods and its characteristics. Section 3 presents the results of the proposed approach applied to oil spill detection over 18 SAR images samples each, composed of shape and texture extracted features. The boosting performance is compared to classical classifiers such as ANN, kNN (k-Nearest Neighbors) and Na¨ıve Bayes. In the last section we present the concluding remarks.
1174
2
G.L.B. Ramalho and F.N.S. de Medeiros
Boosting
Boosting [11] was introduced in 90’s, and used for arbitrarily increase the precision of a “weak” classifier (also called “base classifier” or “base function”) based on the principle of divide-and-conquer [8]. This method provides high level feature combination suitable for multi-class problems. Its resulting weight voting classifier provides better performance than other methods based on selecting the best single classifier. One can use boosting with success to solve multivariate classification problems like the protein domain structural class prediction based on 3D structures [14] [15] and automatic target recognition from SAR images [9] as for oil spill detection [16]. Also, there are applications where the feature vector, extracted by different methods, presents high dimensionality, hence the best feature set is not easily recognized. For this kind of application, Yin et al. [13] introduced an ensemble of classifiers that can focus on the best feature set, providing high level feature combination. Thus, this modified Adaboost algorithm is suitable to work on automatic feature selection and increase the classification performance over high dimensional datasets. . AdaBoost [17] is an adaptive boosting method which is fast, simple and easy to program. The simplicity becomes from the fact that it has only one parameter to tune. This adaptive method builds ensembles based on functions that emphasizes critical and erroneous samples [18] and this emphasis can be also tuned as a simple parameter. These methods were tested with success to improve multivariate classification applications like the protein domain structural class prediction based on 3D structures [14] [15], automatic feature combination [13], tumour classification [19] and for automatic target recognition from SAR images [9]. For marine surveillance application, and for oil spill detection, good results were obtained when compared to other neural network classifiers [16]. 2.1
Ensemble Construction
In this approach, on each boosting round one “weak” classifier is constituted and trained using a different sample distribution which depends on the misclassified vectors. These classifiers are then combined by weighted voting into a new final classifier. The AdaBoost (Adaptive Boosting) [17] algorithm is an improved version of the original boosting. AdaBoost is sometimes compared to SVM because the “weak” classifiers are forced to focus on the patterns nearest to the decision boundaries [8] [6]. The AdaBoost [12] [20] algorithm approximates a function F (x): F (x) =
M
cm fm (x) .
(1)
m=1
where cm are constants to be determined and fm are the basis functions or “weak” classifiers. A Discrete Binary AdaBoost algorithm is shown in Fig. 1. To form the ensemble fm classifiers are trained given the input data S = {(x1 , y1 ), ..., (xN , yN )},
Improving Reliability of Oil Spill Detection Systems
1175
where xi ∈ X are the training vectors, yi ∈ Y = {−1, 1} are the corresponding class label, and M is the number of iterations or number of base classifiers. Each round, the classification error rate of fm is used to compute the weights for the new training set of the next classifier fm+1 . As the training vectors which are close to the classes boundary tend to be misclassified, the fm+1 classifier will focus on it. Indeed, there is a similarity between AdaBoost and SVM, as both build discriminants closely related to the vectors on the classes frontier. The Pseudo-Code of Discrete Binary AdaBoost 1. Start with weights wi = 1/N , i = 1, 2, ...N . 2. Repeat m=1,2,...,M : (a) Fit the classifier fm (x) ∈ −1, 1 using weights wi on the training data. (b) Compute errm = Ew [1(y=fm (x)) ], cm = log((1 − errm )/errm ). (c) Set wi ← wi exp[cm 1˙ (yi =fm (xi )) ]; i = 1, 2, ...N , and normalize so that i wi = 1. 3. Output the classifier F (x) = sign
M m=1
cm fm (x) .
Fig. 1. The Pseudo-Code of Discrete Binary AdaBoost [17]
2.2
Feature Combination
Yin et al. [13] modified one step of the Adaboost algorithm reaching a high-level automatic feature combination method meanwhile the ensemble was formed. As those authors did not name this algorithm, for sake of text simplicity in this paper we call the method as Variant AdaBoost (VARADABOOST ). An additional step was included in the original AdaBoost algorithm as Fig. 2 shows. Instead of training only one classifier per round, in the Variant AdaBoost K classifiers are trained, each one over one different subset of the feature set. The final middle classifier is taken as the weighted voting fm
K 1 = gk (m) . K
(2)
k=1
where m represents the mth round in the boosting. For example, if a feature set is formed by combining three different feature extraction methods, in each round three classifiers gk (m) can be trained, each one over one of the subset features Fk (k = 1, 2, 3). The resulting ensemble tends to focus on the best set of features for each set of weighted sample vector. An important advantage of using this ensemble algorithm compared to single ANNs is that fine tuning and sophisticated nonlinear optimization are not necessary [19]. Both boosting algorithms have only one parameter to tune: the
1176
G.L.B. Ramalho and F.N.S. de Medeiros The Pseudo-Code of Variant AdaBoost 1. Start with weights wi = 1/N , i = 1, 2, ...N . 2. Repeat m=1,2,...,M : (a) Repeat k=1,2,...,K: Train a weak classifier gk (x) using features coded by a certain system {(x1k , y1 ), ..., (xM k , yM )}. (b) Get a middle final classifier fm (x) by combining {gk } using majority voting. (c) Compute errm = Ew [1(y=fm (x)) ], cm = log((1 − errm )/errm ). (d) Set wi ← wi exp[cm 1˙ (yi =fm (xi )) ]; i = 1, 2, ...N , and normalize so that i wi = 1. 3. Output the classifier F (x) = sign
M m=1
cm fm (x) .
Fig. 2. The Pseudo-Code of Variant AdaBoost [13]
number M of “weak” classifiers. As boosting is generally resistant to overfitting, the choice of M is typically not very critical. Moreover, there is an empirical approach to automatically choose this parameter explained in [19] [21]. The “weak” classifier is often chosen to be a regression tree algorithm like CART [22]. Nevertheless, virtually any other classifier can be used instead. In our experiments, MLP and Na¨ıve Bayes were tested as the base classifiers.
3 3.1
Experimental Results and Discussions Dataset
The working dataset was taken from 18 SAR images separated into two classes each one containing 9 samples of real oil spill class (C1 class) and look-alike ones (C2 class), two of them are shown in Fig. 3. Fig. 3a displays a real oil spill image in the ocean and Figure 3b exhibits a similar one. Three different feature sets were formed by different feature extraction methods. The first one (SHAP E) was generated as described in Del Frate et. al. [23] and is based on spill geometric characteristics. This set contains 8 descriptors. Spatial Texture features (SP AT IAL) with 15 descriptors are taken from the co-occurrence matrix as suggested in Haralick et al. [24]. The third feature set, named Spectral Texture (SP ECT RAL), contains 7 descriptors. They are the coefficient energies of the first 7 Haar wavelet decompositions of the spill in the segmented image. This feature set give us a spectral interpretation about the texture of the spot. As the waves in the ocean surface are attenuated by the oil viscosity, the radar backscatter looses its energy and the spot appears much more homogeneous in the grayscale image. Three feature combination sets (F C1 , F C2 and F C3 ) were formed by combining the single feature sets in different
Improving Reliability of Oil Spill Detection Systems
(a) Oil spill
1177
(b) Look alike spot
Fig. 3. SAR image samples used in the experiments
groups: F C1 = SHAP E + SP AT IAL, F C2 = SHAP E + SP ECT RAL and F C3 = SHAP E + SP AT IAL + SP ECT RAL. These sets have high dimensionality, respectively 23, 15 and 30 dimensions. 3.2
Classifiers Implementation
We have implemented two Variant AdaBoost algorithms using Na¨ıve Bayes and three-layer MLP as the base learner. The MLP was designed with 5 neurons in the hidden layer. The same architecture was adopted for all boosting algorithms with a maximum of 10 base learners each. In the step two of the boosting algorithm, the training is based on weighted input vectors. An appropriated method to implement this probability of vector occurrence is to use re-sample. In our experiments, an input vector was resampled many times as needed to achieve its weight in training. When applying Variant AdaBoost to F C1 and F C2 , a slightly change was made in the algorithm because the majority voting does not apply. We adopted weighted voting and the weight is based on the classifier training error. Three single classifiers were tested to compare the results. The performance of a classical 3 neighbors kNN algorithm was established as the reference. One MLP was adjusted to reach the minimum least square training error. The MLP has three-layer architecture and uses the backpropagation algorithm. The number of neurons in the hidden-layer was experimentally adjusted to half of the feature dimension and then tuned until reach the best performance. The last single classifier is a SVM and uses a polinomial kernel and the classifier was tuned changing the number of free degrees between 2 and 5 until reach the best performance for each feature set. 3.3
Results
The datasets were submited to the single and ensembled classifiers: one classical kNN, two ANNs (MLP and SVM), one Adaboost algorithm with MLP as base
1178
G.L.B. Ramalho and F.N.S. de Medeiros
classifier and two variant boosting algorithms with MLP and Na¨ıve Bayes base classifiers, respectively. Each classifier performance estimation was obtained using a hold-out method with 70% of the samples used for training and the rest for test. For boosting algorithms no dimensionality reduction was taken. The graph in Fig. 4 reports a comparison between the estimated success classification rates of each classifier over the different datasets. As it can be observed in Fig. 4, Variant AdaBoost algorithms were not applied to single feature sets. The average correct classification rate and the variance were used to determine which classifier is the best predictor for each feature set. Boosting methods have shown better performance over all datasets and the high-level feature combination has produced even better results with improved generalization. The classifier variance is an important parameter to help understand the generalization capacity of the classifier [25]. Smaller variances yield better generalization and imply more precise classifier error rates estimation. The boosting classifier variances are smaller than the others, which means that the generalization of this approach is also better.
Fig. 4. Classifier performances compared at the correct classification rates
Table 1 displays the performances obtained for the seven classifiers using the F C3 dataset. Besides its higher dimensionality, this dataset provides the best class discrimination. We observe that the high level feature combination from the Variant AdaBoost classifiers yields the better performance, reaching 97.5% of correct class prediction. Moreover, this improvement was achieved without loss of generality, confirmed by the very low variance, which is a remarkable aspect concerning to reliable classifiers. Tables 2 and 3 show the confusion matrix for the best performance of the single classifiers and for the best performance of the Variant AdaBoost ensemble. In both cases we look for the higher classification rate at the lower variance. As
Improving Reliability of Oil Spill Detection Systems
1179
Table 1. Classifier comparison over FC3 feature set Correct Classifier Prediction (%) NAIVE 0.7817 MLP 0.8133 0.7783 SVM 0.9483 ADABOOST MLP VARADABOOST NAIVE 0.9750 0.9400 VARADABOOST MLP
Variance 0.0167 0.0176 0.0080 0.0082 0.0041 0.0068
Table 2. Rated Confusion Matrix for SVM Estimated Real C1 C2 C1 0.4400 0.0600 C2 0.1617 0.3383
Table 3. Rated Confusion Matrix for Variant AdaBoost with Na¨ıve Bayes Estimated Real C1 C2 C1 0.4783 0.0033 C2 0.0217 0.4967
it can be observed, the misclassification reduction is another important result obtained with the proposed methodology.
4
Concluding Remarks
The simulation results showed that the proposed classification ensemble minimized the misclassification of dark spots in oil spills or look-alikes for such set of SAR images. The number of false alarms was significantly reduced. Furthermore, we have shown the use of boosting methods can improve the performance of oil spill identification systems using high dimensionality dataset and limited training samples. Although boosting methods have higher computational cost for training, the ensemble test response is as fast as the response of a single neural network. The ensemble is easy to implement and tune. Indeed, only one parameter must be tuned and the feature subsets are automatically taken in action to the ensemble construction. In future works we will investigate the use of different base learners and modify the algorithm in order to reach feature selection improvements in multi-class classification problems.
1180
G.L.B. Ramalho and F.N.S. de Medeiros
Acknowledgement. The authors would like to thank CNPq (#476177/2004-9) and CAPES for their financial support.
References 1. Martinez, A., Moreno, V.: An oil spill monitoring system based on SAR images. Spill Science & Technology Bulletin 3(1/2), 65–71 (1996) 2. Ferrado, G., Bernadini, A., David, M., Meyer-Roux, S., Muellenhoff, O., Perkovic, M., Tarchi, D., Toupozelis, K.: Towards an operational use of space imagery for oil pollution monitoring in the mediterranean basin: A demonstration in the adriatic sea. Marine Pollution Bulletin (2007) doi:10.1016/j.marpolbul.2006.11.022 3. Brekke, C., Solberg, A.H.S.: Oil spill detection by satellite remote sensing. Remote Sensing Environment 95, 1–13 (2005) 4. Marghany, M.: RADARSAT automatic algorithms for detecting coastal oil spill pollution. International Journal of Applied Earth Observation and Geoinformation 3(2), 191–196 (2001) 5. de Lopes, A., Ramalho, D.F., de Medeiros, F.N.S., Costa, R.C.S., Ara´ ujo, R.T.S.: Combining features to improve oil spill classification in SAR images. In: Yeung, D.-Y., Kwok, J.T., Fred, A., Roli, F., de Ridder, D. (eds.) Structural, Syntactic, and Statistical Pattern Recognition. LNCS, vol. 4109, pp. 928–936. Springer, Heidelberg (2006) 6. Webb, A.R.: Statistical Pattern Recognition, 2nd edn. Wiley, England,UK (2002) 7. Topouzelis, K., Karathanassi, V., Pavlakis, P., Rokos, D.: Oil spill detection using RBF neural networks and SAR data. In: Proceedings on ISPRS, pp. 724–729 (2005) 8. Haykin, S.: Redes Neurais, princ´ıpios e pr´ atica, 2nd edn. Bookman, Porto Alegre (2001) 9. Sun, Y., Liu, Z., Todorovic, S., Li, J.: SAR automatic target recognition using adaboost. In: Proc. SPIE on Technologies and Systems for Defense and Security, vol. 5808, pp. 282–293 (2005) 10. Dzeroski, S., Zenko, B.: Is Combining Classifiers Better than Selection the Best One? Machine Learning 54(3), 255–274 (2004) 11. Schapire, R.E.: The strength of weak learnability. Machine Learning 5(2), 197–227 (1990) 12. Freund, Y., Schapire, R.E.: A Short Introduction to Boosting. Journal of Japanese Society for Artificial Intelligence 14(5), 771–780 (1999) 13. Yin, X.-C., Liu, C.-P., Han, Z.: Feature combination using boosting. Pattern Recognition Letters 26, 2195–2205 (2005) 14. Cai, Y.-D., Feng, K.-Y., Lu, W.-C., Chou, K.-C.: Using LogitBoost classifier to predict protein structural classes. Journal of Theoretical Biology 238, 172–176 (2006) 15. Feng, K.-Y., Cai, Y.-D., Chou, K.-C.: Boosting classifier for predicting protein domain structural class. Biochemical and Biophysical Research Communications 334, 213–217 (2005) 16. Ramalho, G.L.B., Medeiros, F.N.S.d.: Using Boosting to Improve Oil Spill Detection in SAR Images. In: 18th International Conference on Pattern Recognition, vol. 2, pp. 1066–1069 (2006) 17. Freund, Y., Schapire, R.E.: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences 55, 119–139 (1997)
Improving Reliability of Oil Spill Detection Systems
1181
18. Gomez-Verdejo, V., Ortega-Moral, M., Arenas-Garc´ıa, J., Figueiras-Vidal, A.R.: Boosting by weighting critical and erroneous samples. Neurocomputing 69, 679– 685 (2006) 19. Dettling, M., B¨ uhlmann, P.: Boosting for tumor classification with gene expression data. Bioinformatics 19(9), 1061–1069 (2003) 20. Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of boosting. The. Annals of Statistics 38(2), 337–374 (2000) 21. B¨ uhlmann, P.: Boosting Methods: Why They Can be Useful for High-Dimensional Data. In: Proceedings on Distributed Statistical Computing, Vienna (2003) 22. Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth and Brooks/Cole, Montery, CA (1984) 23. Calabresi, G., Del Frate, F., Lichtenegger, J., Petrocchi, A.: Neural networks for oil spill detection using ERS-SAR data. Ieee. Transactions On. Geoscience And. Remote Sensing 38(5), 2282–2287 (2000) 24. Haralick, R.M., Shanmugam, K., Dinstein, I.: Textural Features for Image Classification. IEEE Transactions on Systems, Man, and Cybernetics 3(6), 610–621 (1973) 25. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley, England (2000)
Computer Assisted Transcription for Ancient Text Images Ver´onica Romero1 , Alejandro H. Toselli1 , Luis Rodr´ıguez2, and Enrique Vidal1 1
Instituto Tecnol´ogico de Inform´atica Universidad Polit´ecnica de Valencia Cam´ı de Vera s/n, 46022 Val`encia (Spain) {vromero,ahector,evidal}@iti.upv.es 2 Departamento de Sistemas Inform´aticos Universidad de Castilla La Mancha. Spain
[email protected] Abstract. Paleography experts spend many hours transcribing ancient documents and state-of-the-art handwritten text recognition systems are not suitable for performing this task automatically. We propose here a new interactive, online framework which, rather than full automation, aims at assisting the experts in the proper recognition-transcription process; that is, facilitate and speed up the transcription of old documents. This framework combines the efficiency of automatic handwriting recognition systems with the accuracy of the experts, leading to a cost-effective perfect transcription of ancient manuscripts. Keywords: Automatic Transcription of Ancient Documents, Computer Assisted transcription, handwritten text image recognition.
1 Introduction The increasing number of on-line digital libraries publishing a large quantity of digitized legacy documents makes also necessary to transcribe these document images, in order to provide historians and other researchers new ways of indexing, consulting and querying these documents. These transcriptions are usually carried out by experts in paleography, who are specialized in reading ancient scripts, characterized, among other things, by different handwritten/printed styles from diverse places and time periods. How long experts takes to make a transcription of one of these documents depends on their skills and experience. For example, to transcribe many of the pages of the document used in the experiments reported in sections 4.1, they would spend several hours per page. On the other hand, up-to-date handwritten text recognition systems (HTR) cannot substitute the experts in this task, because there are no efficient solutions to perform an automatic ancient document transcription with a good accuracy. The difficulties to segment text lines, the variability of the handwriting, the complexity of the styles and the open vocabulary explain most of the issues encountered by these recognition systems. M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 1182–1193, 2007. c Springer-Verlag Berlin Heidelberg 2007
Computer Assisted Transcription for Ancient Text Images
1183
An interactive on-line scenario can allow for a more effective approach. Here, the automatic HTR system and the human transcriptor cooperate to generate the final transcription of the text line images. The rationale behind this approximation is to combine the accuracy provided by the human transcriptor with the efficiency of the HTR system. This approach is called “Computer Assisted Transcription of Handwritten Text Images” (CATOTI) in this article. It follows similar ideas as those previously applied to computer assisted translation (CAT) [1,2,3], and computer assisted speech transcription (CATS) [4,5], where experiments have shown that these kind of systems can save significant amounts of human effort. The HTR system employed here is based on Hidden Markov Models (HMMs) [6], in the same way as in the current speech recognition systems [7]. The most important difference is that the input feature vector sequence of the HTR system represents a handwritten text line image, rather than an acoustic speech signal. This paper is divided as follows. First, the CATOTI framework is introduced in section 2 and formalized in section 2.1. An initial implementation is described in section 2.3. Then, a general description of the HTR system used is given in section 3. The experiments and results are commented in section 4. Finally, some conclusions are drawn in the section 5.
2 Foundations of CATOTI This section overviews our approach to CATOTI. As illustrated in figure 1, the process starts when the HTR system proposes a full transcription sˆ (or a set with N-best transcriptions) of a suitable segment of the feature vectors sequence x, extracted from a handwritten text line image (see section 3). Then, the human transcriptor (named user from now on) reads this transcription until he or she finds a mistake; i.e, he or she validates a prefix sˆp of the transcription which is error-free. Now, the user can enter a word (or words), c, to correct the erroneous text that follows the validated prefix. This action produces a new prefix p (the previously validated prefix, sˆp , followed by c). Then, the system takes into account the new prefix to suggest a suitable continuation (or a set of best possible continuations) to this prefix (i.e., a new sˆ), thereby starting a new cycle. This process is repeated until a correct, full transcription t of x is accepted by the user. A key point in this interactive process is that, at each user-system iteration, the system can take advantage of the prefix validated so far to attempt an improved prediction. 2.1 Formal Framework The traditional handwritten text recognition problem can be formulated as the problem of finding a most likely word sequence, w, ˆ for a given handwritten sentence image represented by a feature vector sequence x, i. e., w ˆ = arg maxw P r(w|x). Using the Bayes’ rule we can decompose the probability P r(w|x) into two probabilities, P r(x|w) and P r(w), representing morphological-lexical knowledge and syntactic knowledge, respectively: w ˆ = arg max P r(w|x) = arg max P r(x|w) · P r(w) . w
w
(1)
1184
V. Romero et al.
x (p) () (ˆ s) (antiguos (sˆp ) (antiguos) INTER-1 (c) (p) (antiguos (ˆ s) (sˆp ) INTER-2 (c) (p) (antiguos (ˆ s) (c) FINAL (p ≡ t) (antiguos
INTER-0
y
gas
que en el castillo sus
llamadas
)
(que en el castillo sus (que en) (castilla) que en castilla) (se
llamadas
)
(ciudadanos) ciudadanos)
ciudadanos
llamaban ) (#)
ciudadanos
que
en
castilla
se
llamaban
)
Fig. 1. Example of CATOTI operation. Starting with an initial recognized hypothesis sˆ, the user validates its longest well-recognized prefix sˆp and corrects the following erroneous word c; next, the validated prefix p (c concatenated to sˆp ) is submitted as additional help information to the recognition system, which based on this emits a new recognized hypothesis sˆ. This process goes on until the final error-free transcription t is obtained. Underlined boldface words in the final transcription are those which were corrected by user.
P r(x|w) is typically approximated by concatenated character models (usually hidden Markov models [8,7]) and P r(w) is approximated by a word language model (usually n-grams [8]). In the CATOTI framework, in addition to the given feature sequence, x, a prefix p of the transcription (validated and/or corrected by the user) is available and the HTR should try to complete this prefix by searching for a most likely suffix sˆ as: sˆ = arg max P r(s|x, p) = arg max P r(x|p, s) · P r(s|p) . s
(2)
s
Equation (2) is very similar to (1), being w the concatenation of p and s. The main difference is that now p is given. Therefore, the search must be performed over all possible suffixes s of p and the language model probability P r(s|p) must account for the words that can be written after the prefix p. In order to solve equation (2), the signal x can be considered split into two fragments, xb1 and xm b+1 , where m is the length of x. By further considering the boundary point b as a hidden variable in (2), we can write: sˆ = arg max P r(x, b|s, p) · P r(s|p) . (3) s
1≤b≤m
We can now make the naive (but realistic) assumption that xb1 does not depend on the suffix and xm b+1 does not depend on the prefix, to rewrite (3) as: sˆ ≈ arg max s
1≤b≤m
P r(xb1 |p) · P r(xm b+1 |s) · P r(s|p) .
(4)
Computer Assisted Transcription for Ancient Text Images
1185
Finally, the sum over all the possible segmentations can be approximated by the dominating term, leading to: sˆ ≈ arg max max P r(xb1 |p) · P r(xm b+1 |s) · P r(s|p) . s
1≤b≤m
(5)
This optimization problem entails finding an optimal boundary point, ˆb, associated with the optimal suffix decoding, sˆ. That is, the signal x is actually split into two segˆ ments, xp = xb1 and xs = xˆm . Therefore, the search for the best transcription suffix that b+1 completes a prefix p can be performed just over segments of the signal corresponding to the possible suffixes and, on the other hand, we can take advantage of the information coming from the prefix to tune the language model constraints modelled by P r(s|p). This is discussed in the next subsections. 2.2 Adapting the Language Model Perhaps the simplest way to deal with P r(s|p) is to adapt an n-gram language model to cope with the consolidated prefix. Given that a conventional n-gram models the probability P r(w) (where w is the concatenation of p and s, i.e the whole sentence), it is necessary to modify this model to take into account the conditional probability P r(s|p). As discussed in [4], assuming an n-gram model for P r(w) leads to the following decomposition: P r(s|p)
k+n−1 i=k+1
i−1 P r(wi |wi−n+1 ) ·
l
i−1 P r(wi |wi−n+1 ).
(6)
i=k+n
l = s is a possible suffix. The first where the consolidated prefix is w1k = p and wk+1 term of (6) accounts for the probability of the n−1 words of the suffix, whose probability is conditioned by words from the validated prefix, and the second one is the usual ngram probability for the rest of the words in the suffix.
2.3 Searching In this section, a possible implementation of a CATOTI decoder is described. In the first iteration of the CATOTI process, p is empty. Therefore, the decoder has to generate a full transcription of x as shown in equation (1). Afterwards, the user-validated prefix p has to be used to generate a suitable continuation s in the following iterations of the interactive-transcription process. As discussed in [4], we can explicitly rely on equation (5) to implement a decoding process in one step, as in conventional HTR systems. The decoder should be forced to match the previously validated prefix p and then continue searching for a suffix sˆ according to the constraints (6). This can be achieved by building a special language model which can be seen as the “concatenation” of a linear model which strictly accounts for the successive words in p and the “suffix language model” (6). An example of this language model is shown in figure 2. Owing to the finite-state nature of this special language model, the search involved in equation (5) can be efficiently carried out using the well known Viterbi algorithm [8].
V. Romero et al. Original Bigram (L)
Training samples (L) de la edad media
edad
edad
media
de
la
de
la
Model for the Prefix (Lp ) ed ia
de la epoca media
media
ad ed
de edad media
ep o
en
m
1186
ca
de la epoca actual de la actual
epoca
actua
actual
en
la
la
actual
l
de actual Prefix (LP ) = en la Final Combined Model (Lp Ls ) edad
media
media
en
la
la
ep o
m
en
ed ia
ad ed ca
epoca
actual
actual
Fig. 2. Example of a CATOTI dynamic language model building. First, an n-gram (L) for the training set of the figure is built. Then, a linear model (Lp ) which accounts for the prefix “en la” is constructed. Finally, these two models are combined into a single model (Lp Ls ) as shown.
Apart from the optimal suffix decoding, sˆ, a correspondingly optimal segmentation of the x is then obtained as a byproduct. It should be noted that a naive Viterbi adaptation to implement these techniques would lead to a computational cost that grows quadratically with the number of words of each sentence. However, using word-graph techniques similar to those described in [3] for Computer Assisted Translation, very efficient, linear cost search can be easily achieved.
3 HTR System Overview The HTR system used here follows a classical architecture composed of three modules: a preprocessing module in charge to filter out noise, recover handwritten strokes from degraded images and reduce variability of text styles; a feature extraction module, where a feature vector sequence is obtained as the representation of a handwritten text image; and finally, a recognition module, which obtains the most likely word sequence for the sequence of feature vectors. In addition, the recognition module used in this work has been adapted to solve (5) using (6) as a suffix model for each consolidated prefix. The following subsections describe each of these modules in detail. 3.1 Preprocessing It is quite common for ancient documents to suffer from degradation problems [9]. Among these are the presence of smear, background of big variations and uneven illumination, spots due to the humidity or marks resulting from the ink that goes through the paper (generally called bleed-through). In addition, other kinds of difficulties appear in these pages as different font types and sizes in the words, underlined and/or crossedout words, etc. The combination of these problems contributes to make the recognition process difficult, therefore a preprocessing module becomes essential.
Computer Assisted Transcription for Ancient Text Images
1187
Concerning the preprocessing module used in this work, the following steps take place: skew correction, background removal and noise reduction, line extraction, slant correction and size normalization. We understand “skew” as the angle between the horizontal direction and the direction of the lines on which the writer aligned the words. Skew correction is carried out on each document page image, by aligning their text lines with the horizontal direction [10]. Background removal and noise reduction are performed applying a bi-dimensional median filter [11] on the whole page image, followed by a grey-level normalization to increase the foreground/background image contrast (see figure 3 panel b). The next step consists in dividing the page image into separate line images. The method used is based on the horizontal projection profile of the input image. Local minimums in this projection are considered as potential cut-points located between consecutive text lines (see panel c on figure 3). When the minimum values are greater than zero, no clear separation is possible. This problem has been solved using a method based in connected components [12]. The panel d on figure 3 shows the resulting line images (from the highlighted region in panel c) after applying the method above mentioned. Finally, the slant correction and size normalization processes are applied on each previously separated line image. The slant is the clockwise angle between the vertical direction and the dominant direction of the written vertical strokes. This angle is determined using a method based on vertical projection profile (see [13]), and used then by the slant correction process to put the written text strokes in an upright position. On the other hand, the size normalization process tries to make the system invariant to character size and to reduce the areas of background pixels which remain on the image because of the ascenders and descenders of some letters [14]. The three panels in figure 3 show these last steps in the preprocessing module. 3.2 Feature Extraction As our HTR system is based on Hidden Markov Models (HMMs), each preprocessed text line image has to be represented as a sequence of feature vectors. To do this, the feature extraction module applies a grid to divide the text line image into N×M squared cells. In this work, N = 40 is chosen empirically and M must satisfy the condition that M/N is equal to the original line image aspect ratio [10]. Each cell is characterized by the following features: normalized gray level, horizontal gray level derivative and vertical gray level derivative. To obtain smoothed values of these features, feature extraction is not restricted to the cell under analysis, but extended to a 5 × 5 cells window, centered at the current cell. To compute the normalized gray level, the analysis window is smoothed by convolution with a 2-d Gaussian filter. The horizontal derivative is calculated as the slope of the line which best fits the horizontal function of column-average gray level. The fitting criterion is the sum of squared errors weighted by a 1-d Gaussian filter which enhances the role of central pixels of the window under analysis. The vertical derivative is computed in a similar way. Columns of cells (also called frames) are processed from left to right and a feature vector is constructed for each frame by stacking the three features computed in its constituent cells. Hence, at the end of this process, a sequence of M 120-dimensional
1188
V. Romero et al. (a)
(c)
(b)
(d)
Fig. 3. Preprocessing example: a) original image; b) skew correction, background removal, noise reductions and increase of contrast; c) image with cutting lines; d) separated line images from the highlighted region
Computer Assisted Transcription for Ancient Text Images
1189
(e)
(f) (g) Fig. 3. (continued) e) a separated line image); f) slant correction; g) size normalization
feature vectors (40 normalized gray-level components and 40 horizontal and vertical derivatives components) is obtained. This process is similar to that followed in [6]. On the top of figure 1 an example of the feature vectors sequence for a separate line image is shown graphically. 3.3 Recognition Characters are modelled by continuous density left-to-right HMMs, with 6 states and a mixture of 64 Gaussian densities per state. This Gaussian mixture serves as a probabilistic law to the emission of feature vectors on each model state. The number of Gaussian densities as well as the number of states were empirically chosen after tuning the system. It should be noted that the number of Gaussians and states define the number of parameters to be estimated, and this number strongly depend on the amount of training vectors available. The character HMMs are trained using a well-known instance of the EM algorithm called forward-backward or Baum-Welch re-estimation [8]. Each lexical word is modelled by a stochastic finite-state automaton which represents all possible concatenations of individual characters that may compose the word. On the other hand, according to section 2.2 text line sentences are modelled using back-off bi-grams, with Kneser-Ney back-off smoothing [15,16], which are estimated using the given transcriptions of the trained set. All these finite-state (HMM character, word and sentence) models can be easily integrated into a single global model on which the search for decoding the input feature vectors sequence x into the output words sequence w is performed. This search is optimally done by using the Viterbi algorithm [8]. This algorithm can be easily adapted also for the search required in the CATOTI interactive framework explained in section 2.3.
4 Experimental Results In order to test the effectiveness of the CATOTI approximation proposed in this paper on legacy documents, different experiments were carried out. The corpus used in the experiments, as well as the different measures and the obtained experimental results are explained in the following subsections.
1190
V. Romero et al.
4.1 Corpus The corpus was compiled from the legacy handwriting document identified as “CristoSalvador”, which was kindly provided by the Biblioteca Valenciana Digital (BIVALDI)1. This corpus is composed of 53 text page images, written by only one writer and scanned at 300dpi. Some of these page images are shown in the figure 4. As has been
Fig. 4. Examples of corpus “Cristo-Salvador”
explained in section 3, the page images have been preprocessed and divided into lines, resulting in a data-set of 1,172 text line images. The transcriptions corresponding to each line image are also available, containing 10,911 running words with a vocabulary of 3,408 different words. Two different partitions were defined for this data-set. In the first one, called soft, the test set is formed by 491 samples corresponding to the last ten lines of each document page, whereas the training set is composed of the 681 remaining samples. On the other hand, in the second partition, called hard, the test set is composed of 497 line samples belonging to the last 20 document pages, whereas the remaining 675 were assigned to the training set. All this information is summarized in the table 1. The soft partition is considered to be easier to recognize than the hard one because its text line samples were extracted from the same pages that the training text line samples. On the other hand, hard partition better approaches a realistic transcription process. 1
http://bv2.gva.es
Computer Assisted Transcription for Ancient Text Images
1191
Table 1. Basic statistics of the database and its soft and hard partitions
Number of: text lines words characters
Soft Training Test 681 491 6,432 4,479 36,699 25,460
Hard Training Test 675 497 6,222 4,689 35,845 26,314
Total 1,172 10,911 62,159
Lexicon – 3,408 78
That is, the CATOTI system is initially trained with the first document pages, and then as more page images are transcribed, a greater amount of samples (line images and transcriptions) become available to retrain the system and therefore improve the assistance to the transcription of the rest of the document. 4.2 Assessment Measures Two kinds of measures have been adopted. On the one hand, the quality of the transcription without any system-user interactivity is given by the well known word error rate (WER). It is defined as the minimum number of words that need to be substituted, deleted or inserted to convert a sentence recognized by the system into the corresponding reference transcription, divided by the total number of words in the reference transcription. On the other hand, the effort needed by a human transcriptor to produce correct transcriptions using the CATOTI system is assessed by the word stroke ratio (WSR). The WSR is computed using a reference transcription of the text image considered. After a first CATOTI hypothesis, the longest common prefix between this hypothesis and the reference is obtained and the first unmatching word from the hypothesis is replaced by the corresponding reference word. This process is iterated until a full match with the reference is achieved. Therefore, the WSR can be defined as the number of (word level) user interactions that are necessary to achieve the reference transcription of the text image considered, divided by the total number of reference words. This definition makes WST and WER comparable. Moreover, the relative difference between WER and WSR gives us an estimation of the reduction in human effort that can be achieved by using CATOTI with respect to using a conventional HTR system followed by human postediting. 4.3 Results As previously mentioned, different evaluation measures have been adopted. The WER, which is the conventional automatic HTR result and the CATOTI result called WSR. Table 2 shows the different results obtained with both the soft and the hard partitions. Several comments about these results are in order. First, as expected given the high difficulty of the transcription problem, conventional HTR yields too many errors for human post-editing to be considered a viable, cost-effective approach to produce errorfree transcriptions. In contrast, the assitive nature of CATOTI places the user always in command of the whole process: if predictions are not good enough, then the user simply keeps typing at her own pace; otherwise, she can accept (partial) predictions
1192
V. Romero et al. Table 2. Results obtained with soft and hard partitions
WER (%) WSR (%)
Soft 46.8 44.1
Hard 50.3 48.7
and thereby save raw typing effort. According to the results, more than 51% of raw typing effort of paleography experts is expected to be saved. In addition, if we ignore the inadequacy of a human expert to post-edit a text where half of its words have to be corrected, the interactive approach would be more effective even in this case: using CATOTI, the estimated relative reductions in words to be corrected are 5.8% and 3.2% for the soft and hard settings, respectively.
5 Remarks and Conclusions In this paper, we have proposed a new interactive, on-line framework, which combines the efficiency of automatic HTR systems with the accuracy of the paleography experts in the transcription of ancient documents. In this proposal, the words corrected by the expert become part of a increasingly longer prefixes of the final target transcription. These prefixes are used by the CATOTI system to suggest new suffixes that the expert can iteratively accept or modify until a satisfactory, correct target transcription is finally produced. The results reported in the last section should be considered preliminary. Because of the large vocabulary of the corpus and the limited number of training samples available in this experiment, n-gram models are clearly undertrained. Despite this severe sparse data condition, and given the extreme difficulty of the task, the preliminary results achieved in this work are encouraging. In fact, much better results are expected if more training text is available for language modeling, as shown by an informal experiment carried out by training the n-gram models with the whole data-set (1,172 text line images). In this case, WER and WSR were 30.0% and 19.9%, respectively, leading to an estimated reduction in human effort of 33.8% Current work is under way to apply this approach to the transcription of very much larger documents and document collections, involving bigger vocabularies and relatively larger amounts of training material Acknowledgments. This work has been partially supported by the EC (FEDER), the Spanish MEC under grant TIN2006-15694-CO2-01, by Conseller´ıa d’Empresa, Universitat i Ci`encia - Generalitat Valenciana under contract GV06/252 and by the Universitat Polit`ecnica de Val`encia (FPI grant 2006-04)
References 1. Cubel, E., Civera, J., Vilar, J.M., Lagarda, A.L., Vidal, E., Casacuberta, F., Pic´o, D., Gonz´alez, J., Rodr´ıguez, L.: Finite-state models for computer assisted translation. In: Proceedings of the 16th European Conference on Artificial Intelligence (ECAI04), Valencia, Spain, 2004, pp. 586–590. IOS Press, Amsterdam (2004)
Computer Assisted Transcription for Ancient Text Images
1193
2. Civera, J., Vilar, J.M., Cubel, E., Lagarda, A.L., Barrachina, S., Casacuberta, F., Vidal, E., Pic´o, D., Gonz´alez, J.: A syntactic pattern recognition approach to computer assisted translation. In: Fred, A., Caelli, T.M., Duin, R.P.W., Campilho, A., de Ridder, D. (eds.) Structural, Syntactic, and Statistical Pattern Recognition. LNCS, vol. 3138, pp. 207–215. Springer, Heidelberg (2004) 3. Barrachina, S., Bender, O., Casacuberta, F., Civera, J., Cubel, E., Khadivi, S., Lagarda, A.L., Ney, H., Tom´as, J., Vidal, E., Vilar, J.: Statistical approaches to computer-assited translation. In: Computational Linguistic (Submitted 2006) 4. Rodriguez, L., Casacuberta, F., Vidal, E.: Computer Assisted Speech Transcription. In: Proceedings of the third Iberian Conference on Pattern Recognition and Image Analysis, Girona (Spain). LNCS, Springer, Heidelberg (2007) 5. Alabau, V., Bened´ı, J., Casacuberta, F., Juan, A., Mart´ınez-Hinarejos, C., Pastor, M., Rodr´ıguez, L., S´anchez, J., Sanchis, A., Vidal, E.: Pattern Recognition Approaches for Speech Recognition Applications. In: Pla, F., Radeva, P., Vitri`a, J. (eds.) Pattern Recognition: Progress, Directions and Applications. Centre de Visi´o per Computador, pp. 21–40 (2006) ISBN 84-933652-6-2 6. Bazzi, I., Schwartz, R., Makhoul, J.: An Omnifont Open-Vocabulary OCR System for English and Arabic. IEEE Trans. on PAMI 21(6), 495–504 (1999) 7. Rabiner, L.: A Tutorial of Hidden Markov Models and Selected Application in Speech Recognition. In: Proc. IEEE, vol. 77, pp. 257–286 (1989) 8. Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press, Cambridge (1998) 9. DRIRA, F.: Towards Restoring Historic Documents Degraded Over Time. In: DIAL ’06. Proceedings of the Second International Conference on Document Image Analysis for Libraries (DIAL’06), Washington, DC, USA, pp. 350–357. IEEE Computer Society Press, Los Alamitos (2006) 10. Toselli, A.H., Juan, A., Keysers, D., Gonz´alez, J., Salvador, I., Ney, H., Vidal, E., Casacuberta, F.: Integrated Handwriting Recognition and Interpretation using Finite-State Models. Int. Journal of Pattern Recognition and Artificial Intelligence 18(4), 519–539 (2004) 11. Kavallieratou, E., Stamatatos, E.: Improving the quality of degraded document images. In: DIAL ’06. Proceedings of the Second International Conference on Document Image Analysis for Libraries (DIAL’06), Washington, DC, USA, pp. 340–349. IEEE Computer Society Press, Los Alamitos (2006) 12. Marti, U.-V., Bunke, H.: Using a Statistical Language Model to improve the preformance of an HMM-Based Cursive Handwriting Recognition System. Int. Journal of Pattern Recognition and Artificial Intelligence 15(1), 65–90 (2001) 13. Pastor, M., Toselli, A., Vidal, E.: Projection profile based algorithm for slant removal. In: Campilho, A., Kamel, M. (eds.) ICIAR 2004. LNCS, vol. 3211, pp. 183–190. Springer, Heidelberg (2004) 14. Romero, V., Pastor, M., Toselli, A.H., Vidal, E.: Criteria for handwritten off-line text size normalization. In: Procc. of The Sixth IASTED international Conference on Visualization, Imaging, and Image Processing (VIIP 06), Palma de Mallorca, Spain (2006) 15. Katz, S.M.: Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. IEEE Trans. on Acoustics, Speech and Signal Processing ASSP-35, 400–401 (1987) 16. Kneser, R., Ney, H.: Improved backing-off for N-gram language modeling. International Conference on Acoustics, Speech and Signal Processing (ICASSP) 1, 181–184 (1995)
Methods for Written Ancient Music Restoration Pedro Castro and J.R. Caldas Pinto IDMEC/IST, Technical University of Lisbon Av. Rovisco Pais, 1049-001 Lisboa, Portugal
[email protected],
[email protected] Abstract. Degradation in old documents has been a matter of concern for a long time. With the easy access to information provided by technologies such as the Internet, new ways have arisen for consulting those documents without exposing them to yet more dangers of degradation. While restoration methods are present in the literature in relation to text documents and artworks, little attention has been given to the restoration of ancient music. This paper describes and compares different methods to restore images of ancient music documents degraded over time. Six different methods were tested, including global and adaptive thresholding, color clustering and edge detection. In this paper we conclude that those based on the Sauvola’s thresholding algorithm are the better suited for our proposed goal of ancient music restoration. Keywords: Ancient Music Restoration, Image Processing, Document Degradation, Thresholding, Clustering, Edge Detection.
1
Introduction
Documents of ancient music frequently appear with multiple signs of degradation. Most of these signs result from the aging process, especially if poor care was taken to store and preserve the documents over time. It is becoming a common task in institutions like libraries to provide easy access to different types of information. With ancient documents, the process usually evolves from the creation of high resolution images, using image scanners, to the generation of different versions of these images, differing in aspects like image quality and size. One common feature to all of these images is that they continue to reveal the deterioration that was present in the original documents. Restoration can be thought of as a transformation process that gives the original aspect to ancient document images that show a certain state of degradation. Degradation, on the other hand, can be seen as “every sort of less-than-ideal properties of real document images, e.g. coarsening due to low digitizing resolution, ink/toner drop-outs and smears, thinning and thickening, geometric deformations, etc” [1]. Restoration is needed not only to improve the appearance of a document but also to improve the results of further segmentation and recognition operations. By clearing artifacts from the images there will be less room in the future for misinterpretation. M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 1194–1205, 2007. c Springer-Verlag Berlin Heidelberg 2007
Methods for Written Ancient Music Restoration
1195
Degradation can be divided into three types [2], according to the areas of a document that are subject to interference: 1) background degradation; 2) foreground degradation; and 3) global degradation. Background degradation includes blotches due to humidity, marks resulting from ink that traverses the paper (bleed-through) or from the scanning process (show-through), underlines, strokes of pen, annotations, and the superimposition of other symbols. Degradation on the foreground generally leads to broken or touching foreground objects, for instance, characters or musical symbols. Age effects can affect the ink components of a document leading to ink disappearance and some gaps can even appear in the document image causing significant loss of data and therefore affecting the document’s content. Global degradation affects documents in their entirety, relating to a transformation that can be observed in a document as a whole, i.e., without affecting uniquely the foreground or the background. It includes, for instance, geometrical transformations and torn pages. Background degradations account for the majority of degradations found in documents of ancient music, and therefore constitute the main focus of this paper. Examples of these degradations are depicted in Fig. 1. Water blotches are characterized by having a mainly convex shape (due to the diffusion of water molecules in the paper), a color that is darker than the neighborhood (due to the dust which is attracted in the paper texture), and an even darker border area where the dust accumulates [3]. Bleed-through refers to the sipping of ink from one side of a page to the other. It can be quite damaging, showing intensity levels that can be even darker than the true valid musical symbols in the foreground. Underlines are more frequent in text documents than they are in music documents, but strokes, annotations and superimposed symbols do appear quite often. The restoration of ancient music images constitutes a topic that has been given little attention in the literature. Many papers exist that focus on optical music recognition, even for ancient music, but no emphasis is given to the restoration of those documents. On the other hand, a variety of methods do exist with application to text documents, and also some with application to images of maps, paintings and photographic prints. Many segmentation and binarization approaches that aim to extract clear text from either noisy or textured backgrounds have been reported and compared in the literature [4,5,6,7,8], some with direct relation to old documents [9,10,11,12,13,14]. Edge-based segmentation has also been reported to work with the observation that edges of the valid foreground are sharper than those of the interference [15,16]. Other approaches exist that attempt to recover from defects found in artworks [3,17,18,19], typically following a semi-automatic approach that requires a user to interactively select the locations of the defects. The rest of the paper is organized as follows. Section 2 describes the methods to be used throughout the experiments. Section 3 details the methodology used during the evaluation, and presents and analyzes the results. Section 4 concludes the paper and indicates future work directions.
1196
P. Castro and J.R.C. Pinto
(a)
(b)
Fig. 1. Images of ancient music showing background degradation
2
Methods
Interference in the background of a document can be generally detected using segmentation in the feature space. Because the background interference usually presents intensity levels that are a mid-term between those the background and the foreground, thresholding and clustering methods can be used to separate these defects from the rest of the image. However, there are also cases of extreme degradation in which other methods have to be used. In this sense, edge-based segmentation has been used, as it can be observed that edges surrounding degradation objects are usually smeared while those of the foreground are sharp. Of the existing methods, only a subset can be applied to restore documents of ancient music. The selected methods to be used throughout the experiments are described next. 2.1
Niblack Thresholding
Niblack’s method [20] performs adaptive thresholding and was selected because it is frequently cited and has been thoroughly compared with other types of documents. Comparisons have revealed that it outperforms other thresholding algorithms under different application domains [7,8,13]. This algorithm calculates a threshold value for each pixel based on the mean and standard deviation of all the pixels in a local neighborhood. A window of size N × N is moved over the image and the threshold value, for a pixel (x, y), is calculated as: t(x, y) = m(x, y) + K · s(x, y) , (1) where m(x, y) and s(x, y) are the mean and standard deviation values, respectively, in a local neighborhood of size N × N of pixel (x, y). The parameters N and K highly influence the segmentation result. The window size N must be small enough to keep the details, but also large enough to remove noise. The value of K is used to decide how much of the total print object boundary is retained. A difficulty naturally arises on how to determine the best N and K parameters. We tested N and K ranging from 9 to 71 and from −0.2 to −1, respectively. The best parameter set did vary from image to image, but the values of N = 51 and K = −0.8 were found to average better.
Methods for Written Ancient Music Restoration
2.2
1197
Sauvola Thresholding
Sauvola’s method [21] performs adaptive thresholding and was chosen because it is a modification of Niblack’s method aimed at dealing better with the cases in which the background contains light texture, big intensity variations and uneven illumination. In this modification, the threshold values are computed with the dynamic range of standard deviation, as: s(x, y) t(x, y) = m(x, y) · 1 + K · −1 , (2) R where m(x, y) and s(x, y) are still the mean and standard deviation values, respectively, in a local neighborhood of size N × N , as they were in Niblack’s formula. R is the dynamic range of standard deviation and K now assumes positive values. Our tests revealed that the value of N was of limited influence to the quality of the segmentation, except with images containing bigger music note heads, in which case lower N values would not detect their centers. The value of R was fixed to 128, as used by Sauvola [21]. From tests performed with N and K ranging from 5 to 31 and from 0.2 to 0.8, respectively, the values of N = 15 and K = 0.2 provided the best results. 2.3
Otsu Thresholding
Otsu’s method [22] was included for completeness, to perform global thresholding. It has been many times cited, evaluated [7,8] and used. This method selects a threshold that maximizes between-class variance after creating the histogram of an intensity image. This threshold is then applied to all image pixels. It has the advantage of not requiring the input of parameters, but assumes that histograms are bimodal and illumination is uniform. 2.4
Canny Edge Detection with Niblack Post-processing
This method is a composition of algorithms, similar in part to what was used by Tan et al [15]. As outlined in Fig. 2, it starts by using a Canny edge detector [23], which is followed by foreground recovery and an adaptive thresholding algorithm, in this case Niblack’s. Fig. 3 depicts the image resulting from the multiple stages of this method, which will be described next. Canny edge detection is used to detect the edges within an image based on the observation that the edges of the valid foreground are usually sharper than
Fig. 2. Outline of the Canny edge detection with post-processing method
1198
P. Castro and J.R.C. Pinto
those of the background interference. This edge detector was selected due to the mentioned previous work with images of text documents, and also because it is known to many as the optimal edge detector, following the assumptions: 1) low error rate; 2) good localization on edge marking; and 3) minimal number of responses for a single edge. This algorithm starts by applying a Gaussian filter to smooth the image. Next, the gradient magnitude and orientation are computed using finite-difference approximations for the partial derivatives. Non-maximal suppression is applied to the gradient magnitude and double thresholding is used to detect and link edges. The two thresholds are used to eliminate weak edges. Initial seeds relating to strong edges are found using the high threshold. These seeds are grown to the pixels connected to them showing a gradient value above the low threshold. This inherently imposes the difficulty of selecting good threshold values. We opted to select the thresholds automatically based on the histogram of gradient magnitudes. The high threshold is set to the level that allows separating 30% of the highest gradient magnitudes from the rest. The low threshold is set to 40% of the high threshold, according to Canny’s recommendations. Fig. 3(b) shows an example of detected edges. Following the detection of the edges, recovery is needed to restore the original foreground. The dilation mathematical morphology operator is used with a structuring element that corresponds to a square of size D × D. Its use has the purpose of filling the regions around the edges. The neighboring pixels within an D × D window centered on each edge pixel are recovered. The pixels outside the recovered region are set to the average color of the region they fill. Fig. 3(c) shows the result of applying the dilation operator to the edges that were previously detected. This image is used to recover the original pixels, resulting in the image shown in Fig. 3(d). As can be observed in Fig. 3(d), the dilation operator retains not only valid parts of the foreground but also background and degradation that were surrounding the edges. Therefore, adaptive thresholding is used to remove the remaining interference. Niblack was once again selected for this task. Fig. 3(e) presents the result of the binarization process. This binary image can be used together with the original image to retain the original color of the foreground, as shown in Fig. 3(f). The background and interference pixels were set to their average color. We tested D, N and K ranging from 5 to 9, from 9 to 71, and from −0.2 to −1, respectively, being N and K the parameters for Niblack’s thresholding. In what concerns the D parameter used for the dilation, the value of 9 has to be used for the foreground objects to be completely filled. Contrary to what happens with text documents, where a value of 7 has been used [15], musical
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 3. Images resulting from the multiple stages of the Canny edge detection method with post-processing
Methods for Written Ancient Music Restoration
1199
objects can many times be thicker thus requiring a higher D value. The values N = 51 and K = −0.8 provided the best results for the thresholding phase. 2.5
Canny Edge Detection with Sauvola Post-processing
This method is similar to the previous one, described in Sect. 2.4, with the difference that Sauvola thresholding is used in the post-processing phase, instead of Niblack’s. From tests with D, N and K ranging from 7 to 9, from 5 to 31, and from 0.2 to 0.8, the values of D = 9, N = 15 and K = 0.2 provided the best results, where N and K are the parameters of Sauvola’s thresholding. 2.6
Fuzzy C-Means Clustering
This method shares similarities with that of Pinto et al [24], which uses fuzzy clustering in the restoration of old text documents. Clustering is used in the RGB color space to collect pixels into groups according to their similarity. Groups, or clusters, are created based on the principle of maximizing the intraclass similarity and minimizing the interclass similarity. Objects within a cluster should have high similarity between each other, but should also be very dissimilar to objects in other clusters. Fuzzy clustering distinguishes itself from hard clustering by not assigning an object into a single cluster. Each object can belong to more than one cluster, with certain membership levels. The fuzzy C-Means [25] is well known and revealed to present good results [24]. It is used to partition an image into N clusters, where each pixel has a membership degree to each cluster. The cluster centers are used to determine which cluster relates to the darkest color. All pixels that share a degree of membership greater than a value M are selected as valid foreground pixels. A difficulty arises in the selection of the ideal number of clusters, as well as on the minimum required membership value M . From tests with N and M ranging from 2 to 5 and from 0.65 to 0.95, the best results were achieved with N = 3 and M = 0.75.
3
Experiments
Experiments were performed in order to determine which of the method achieves a better segmentation. The emphasis was put on the quality of the restoration as observed in the resultant images. Computational issues were therefore not considered. 3.1
Methodology
Images of written ancient music were provided by the Portuguese Biblioteca Nacional and the Biblioteca Geral da Universidade de Coimbra, Portugal. A total of 10 images, scanned at a resolution of 150 dpi, were used throughout the experiments. These images contain a variety of degradations, as well as different musical notations and illumination characteristics. They are representative
1200
P. Castro and J.R.C. Pinto
of the majority of images presented in the studied collections from the two referred libraries. All images were firstly manually restored, using graphics editing software, in order to be used as a standpoint for comparison. Except for the case of Fuzzy C-Means, which clustered the images in the RGB color space, all other methods had the original images converted to gray scale before processing. This was performed using the known weighted conversion: Gray = 0.3 · R + 0.59 · G + 0.11 · B,
(3)
where Gray, R, G and B represent the gray, red, green and blue intensity levels, respectively. The result of all methods is a binary image containing the restored foreground and background. After binarization, this image can be applied to the original images as a mask, in order to retain the original colors of the foreground. In what concerns the background, it can be set to a fixed color or the average of its pixels, just to name a few simple alternatives. The test images were processed by the chosen methods and compared to the manually restored images. The comparison was evaluated by the standard measures of precision and recall [26], with a slight modification. To evaluate text segmentation, these measures are typically used with the precision of a character or word. When relating to ancient music, it does not apply that well as the musical notation is very varied, including notes, clefs, key and time signatures, rests, bar and staff lines, as well as text, among other symbols. All of these symbols could be treated as characters, for instance, but that would ignore the great differences they have in shapes and sizes. Therefore, we opted to perform a bitwise comparison, and as such, the precision (P ) and recall (R) measures were used as: Correctly Detected Pixels P = , (4) Total Detected Pixels R=
Correctly Detected Pixels , Total Pixels
(5)
where “Correctly Detected Pixels” refers to the foreground pixels that were correctly binarized (i.e., that are equal to those of the manually restored images) by a specific method, “Total Detected Pixels” refers to the total foreground pixels that were binarized by that method, and “Total Pixels” refers to the total foreground pixels that are present in the manually restored image. Precision and recall reflect the performance of removing interfering strokes and restoring foreground strokes, respectively. The two measures have to be used together in order to compare the methods. To relate the two measures, the F 1 score statistical measure was used, which is defined as: F1 =
2·P ·R . P +R
The is the actual value that is worth maximizing.
(6)
Methods for Written Ancient Music Restoration
3.2
1201
Results
The results of evaluating the selected methods on the 10 selected images are presented in Table 1. Sauvola’s thresholding was the method to obtain the highest average F1 score, but only slightly better than Canny edge detection with Sauvola post-processing. Niblack’s thresholding was better when used as a postprocessor for the Canny method, but only with a small difference to when it was used alone. Fuzzy C-Means performed somewhat close to Niblack’s thresholding. Otsu’s global thresholding method gave the worst results, as expected. Table 1. Detailed precision (P ), recall (R) and F 1 results obtained by applying six different methods to 10 images of ancient music Image P Niblack R F1 P Sauvola R F1 P Otsu R F1 P Canny Niblack R F1 P Canny Sauvola R F1 P Fuzzy C-Means R F1
1 88.02 91.99 89.97 89.42 91.17 90.28 70.16 90.72 79.13 86.22 94.02 89.95 88.05 90.59 89.31 76.02 87.83 81.50
2 83.63 80.18 81.87 84.90 81.37 83.10 73.01 85.53 78.78 80.12 83.20 81.63 84.79 81.69 83.21 85.79 76.90 81.10
3 48.64 82.83 61.29 94.51 93.17 93.84 24.12 36.16 28.94 40.81 86.02 55.36 93.13 93.33 93.23 33.63 95.39 49.72
4 83.83 84.42 84.12 91.74 85.56 88.54 66.64 81.91 73.49 85.19 84.86 85.02 91.91 85.81 88.76 87.20 89.44 88.31
5 75.24 82.32 78.62 74.97 79.32 77.08 67.25 85.11 75.13 72.33 86.96 78.97 78.00 82.00 79.95 59.39 90.40 71.69
6 69.09 88.77 77.70 94.00 93.00 93.50 69.87 96.10 80.92 71.79 92.06 80.67 92.75 91.38 92.06 89.14 86.46 87.78
7 91.03 71.86 80.31 90.60 80.33 85.16 83.93 73.92 78.61 89.84 77.39 83.15 90.69 80.01 85.02 95.36 68.29 79.58
8 81.95 80.30 81.12 84.04 95.66 89.47 62.60 78.65 69.71 78.41 84.92 81.53 84.00 95.00 89.16 74.73 84.83 79.46
9 91.12 81.68 86.14 89.56 88.80 89.18 79.74 80.70 80.22 88.96 83.09 85.92 89.36 88.25 88.80 90.71 88.37 89.52
10 95.28 89.81 92.47 96.35 94.79 95.57 85.92 87.06 86.49 97.05 93.00 94.98 96.05 94.99 95.52 98.40 91.47 94.81
Average 80.78 83.42 81.36 89.01 88.32 88.57 68.32 79.59 73.14 79.07 86.55 81.72 88.87 88.31 88.50 79.04 85.94 80.35
The two methods that used Sauvola’s thresholding obtained the highest values of precision and recall, with a small difference between them. The other methods presented bigger difference between precision and recall, with recall being greater most of the times. This means that these methods demonstrated a preference to restore foreground strokes over to remove interference, causing a great part of degradation to remain intact. Fuzzy C-Means was more relevant with images in which the degradation showed a strong color. When converted to gray scale, as performed with the other methods, the degradation color may become a gray level in some parts similar to that of the foreground. In this case, clustering in the color space may be able to group the degradation pixels together, distinguishing them from the remaining pixels. Sauvola’s method was able to outperform Niblacks’ with a much smaller window of size 15 × 15, compared to the window of size 51 × 51 used with Niblack’s
1202
P. Castro and J.R.C. Pinto
(a) Image 2 – original.
(b) Image 3 – original.
(c) Image 2 – Niblack.
(d) Image 3 – Niblack.
(e) Image 2 – Sauvola.
(f ) Image 3 – Sauvola.
(g) Image 2 – Otsu.
(h) Image 3 – Otsu.
(i) Image 2 – Canny Niblack.
(j) Image 3 – Canny Niblack.
(k) Image 2 – Canny Sauvola.
(l) Image 3 – Canny Sauvola.
(m) Image 2 – Fuzzy C-Means.
(n) Image 3 – Fuzzy C-Means.
(o) Image 2 – restored (Sauvola).
(p) Image 3 – restored (Sauvola).
Fig. 4. Result of processing two of the original images with the chosen methods. Columns 1 and 2 correspond to the processing of images 2 and 3, respectively. Presented from top to bottom are: the original images; the result of binarizing the original images with Niblack thresholding, Sauvola thresholding, Otsu thresholding, Canny edge detection with Niblack post-processing, Canny edge detection with Sauvola post-processing, and Fuzzy C-Means; and the restored images using Sauvola thresholding, the method that performed the best.
Methods for Written Ancient Music Restoration
1203
method. In fact, using a 15×15 window with Niblack’s method resulted in a particularly bad segmentation, with most of the symbols being barely recognizable as most of the interference would be detected as foreground too. Two of the images used within the experiments are presented in Fig. 4, along with the results of processing them with the chosen methods. Image 2 presents less varying results, but still noticeable interference is observable in the results created by the methods of Otsu, Niblack and Canny with Niblack post-processing. The Fuzzy C-Means method was particularly aggressive with this image, with part of the staff lines getting erased. More varying results are observable in image 3. The Otsu and Fuzzy C-Means methods performed very poorly, showing big areas of degradation. Niblack’s method, either when used alone or as post-processing for the Canny edge detection, also performed poorly, showing sparse but still large interference regions. Sauvola’s method, either after edge detection or used alone, was able to remove almost all degradation. No method was able to remove the bleed-through pixels, which appear to the right in image 3. This is due to the high color intensity that bleed-through objects can appear with.
4
Conclusion
Ancient music images often present multiple types of degradation. Typical cases of degradation were presented and analyzed. Different written ancient music restoration methods have been described. A methodology for method evaluation was established and the selected methods compared. A discussion was performed on the obtained results. Indeed, the use of edge detection did not seem to be of great use when compared to the use of the thresholding algorithms themselves. This seems to have less to do with the edge detection itself, but more with the way those edges are used afterwards. By observing the edge detection result, as exemplified in Fig. 3(b), it is noticeable that it typically correctly detects the edges, i.e., most of the foreground edges and almost none of the interference. The problem is that, after edge detection, the recovery phase retains all the pixels that surround the edges, within a local window, therefore including not only foreground but also interference pixels. A solution to be tested would consist on growing the edge pixels in the direction of the darkest colors, in such a way that would only preserve the true foreground. Another problem evidenced by our experiments relates to the removal of bleedthrough. As can be observed in the binarization results of image number 3, in Fig. 4, even Sauvola’s method was not able to remove the sipped ink. One thing that is noticeable is that edge detection does, in fact, distinguish between bleedthrough and valid foreground strokes. As the bleed-through strokes are smeared, they are typically not detected as edges. Therefore, once again, the need for a different method of reusing the detected edges, like the previously indicated solution of pixel growing, could be more successful in dealing with this problem. Finally, some research is still necessary in order to fully automate the application of these methodologies to the mass production of restored ancient music documents. That will be the concern of future work.
1204
P. Castro and J.R.C. Pinto
Acknowledgments. This work was partly supported by: “Programa de Financiamento Plurianual de Unidades de I&D (POCTI), do Quadro Comunit´ario de Apoio III”; the FCT project POSC/EIA/60434/2004,(CLIMA), Minist´erio do Ensino Superior da Ciˆencia e Tecnologia, Portugal; program FEDER; and “Programa Operacional Ciˆencia e Inova¸c˜ao POCI2010”.
References 1. Baird, H.: The state of the art of document image degradation modeling (2000) 2. Drira, F.: Towards restoring historic documents degraded over time. In: Document Image Analysis for Libraries, pp. 350–357 (2006) 3. Stanco, F., Ramponi, G.: Detection of Water Blotches in Antique Documents. In: Proc. 8th COST 276 Workshop, Trondheim, Norway (May 2005) 4. Liu, Y., Srihari, S.N.: Document image binarization based on texture features. IEEE Trans. Pattern Anal. Mach. Intell 19(5), 540–544 (1997) 5. Liang, S., Ahmadi, M., Shridhar, M.: A morphological approach to text string extraction from regular periodic overlapping text/background images. ICIP (1), 144–148 (1994) 6. Yang, Y., Yan, H.: An adaptive logical method for binarization of degraded document images. Pattern Recognition 33, 787–807 (2000) 7. Trier, .D., Jain, A.K.: Goal-directed evaluation of binarization methods. IEEE Trans. Pattern Anal. Mach. Intell 17(12), 1191–1201 (1995) 8. Taxt, T., Trier, O.D.: Evaluation of binarization methods for document images. IEEE Trans. Pattern Analysis and Machine Intelligence 17(6), 640–640 (1995) 9. Negishi, H., Kato, J., Hase, H., Watanabe, T.: Character extraction from noisy background for an automatic reference system. In: ICDAR, pp. 143–146 (1999) 10. Gatos, B., Pratikakis, I., Perantonis, S.J.: An adaptive binarization technique for low quality historical documents. In: Workshop on Document Analysis Systems, pp. 102–113 (2004) 11. Leedham, G., Varma, S., Patankar, A., Govindaraju, V.: Separating text and background in degraded document images: A comparison of global thresholding techniques for multi-stage thresholding. In: Frontiers in Handwriting Recognition, pp. 244–249 (2002) 12. Garain, U., Paquet, T., Heutte, L.: On foreground – background separation in low quality document images. International Journal on Document Analysis and Recognition 8(1), 47–63 (2000) 13. He, J., Do, Q.D.M., Downton, A.C., Kim, J.H.: A comparison of binarization methods for historical archive documents. In: ICDAR, pp. 538–542. IEEE Computer Society, Los Alamitos (2005) 14. Leedham, G., Yan, C., Takru, K., Tan, J.H.N., Mian, L.: Comparison of some thresholding algorithms for text/background segmentation in difficult document images. In: International Conference on Document Analysis and Recognition, pp. 859–864 (2003) 15. Tan, C.L., Cao, R., Wang, Q., Shen, P.: Text extraction from historical handwritten documents by edge detection. In: ICARCV2000. 6th International Conference on Control, Automation, Robotics and Vision, Singapore (December 5-8, 2000) 16. Tan, Cao, Shen: Restoration of archival documents using a wavelet technique. IEEETPAMI: IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2002)
Methods for Written Ancient Music Restoration
1205
17. Barni, M., Bartolini, F., Cappellini, V.: Image processing for virtual restoration of artworks. IEEE MultiMedia 7(2), 34–37 (2000) 18. de Rosa, A., Bonacchi, A.M., Cappellini, V., Barni, M.: Image segmentation and region filling for virtual restoration of artworks. In: International Conference on Image Processing, vol. 1, pp. 562–565 (2001) 19. Stanco, F., Ramponi, G., Tenze, L.: Removal of Semi-Transparent Blotches in Old Photographic Prints. In: Proc. 5th COST 276 Workshop, Prague, Czech Republic (2003) 20. Niblack, W.: An Introduction to Digital Image Processing. Prentice-Hall, Englewood Cliffs (1986) 21. Sauvola, J., Pietikainen, M.: Adaptive document image binarization. Pattern Recognition 33(2), 225–236 (2000) 22. Otsu, N.: A threshold selection method from gray level histograms. IEEE Trans. Systems, Man and Cybernetics 9, 62–66 (1979) 23. Canny, J.: A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 8, 679–698 (1986) 24. Pinto, J.R.C., Bandeira, L., Sousa, J.M.C., Pina, P.: Combining fuzzy clustering and morphological methods for old documents recovery. In: Marques, J.S., P´erez de la Blanca, N., Pina, P. (eds.) IbPRIA 2005. LNCS, vol. 3523, p. 387. Springer, Heidelberg (2005) 25. Bezdek, J.C. (ed.): Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum, NY (1981) 26. Junker, M., Dengel, A., Hoch, R.: On the evaluation of document analysis components by recall, precision, and accuracy. In: ICDAR, pp. 713–716 (1999)
Suppression of Noise in Historical Photographs Using a Fuzzy Truncated-Median Filter Michael Wirth and Bruce Bobier Dept. Computing & Information Science, University of Guelph {mwirth, bbobier}@uoguelph.ca
Abstract. To a large extent noise suppression algorithms have been designed to deal with the two most classically defined types of noise: impulsive and Gaussian noise. However digitized images such as those acquired from historical photographs such as albumen prints contain a form of quasi-noise we shall term chaotic noise. This paper describes the concept of chaotic noise and proposes two fuzzy filters to suppress various types of noise in historical photographs based on the truncated median, an approximation of the mode filter. Keywords: noise suppression, chaotic noise, fuzzy filter, image quality.
1 Introduction Photographs represent one of the oldest visual forms of media used to convey information. Historical images constitute an important part of our cultural and documentary heritage (see Figure 1). In many cases the views that these images represent have either changed significantly, or no longer exist, and as such they provide an invaluable insight into the state of structures and monuments of the past 150 years. The purpose of image enhancement is to improve the quality of an image. This usually means improving acuity, augmenting contrast, or suppressing noise. Noise can be described as any unwanted distortion in an image. The two most commonly portrayed types of noise in images are impulse noise and random noise. Impulse noise is characterized by spurious corrupted pixels, which may not effect the content of the image too greatly, and may be due to information loss. Random, or Gaussian noise can be triggered during the process of generating an image, and is analogous to film “grain”. In photography, the “grainy” effect of a photograph is caused by developed silver halide crystals that cluster together on the processed negative. When a photograph is printed these grain clusters are enlarged, becoming a perceptible pattern over the whole image. The presence of such film grain has a noticeable effect on the ability to discern small features in an image. Figure 1 shows an example of film grain noise. However there is also a form of quasi-noise we will categorize as chaotic noise. Chaotic noise may occur as the result of a deteriorative process, resulting in changes to the physical structure of a piece of art or photograph. When digitized, physical characteristics such as cracks in paintings and albumen prints manifest themselves as noise in the image, contributing to a decrease in image M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 1206–1216, 2007. © Springer-Verlag Berlin Heidelberg 2007
Suppression of Noise in Historical Photographs
1207
quality and aesthetic appeal. Consider the albumen print shown in Fig.2. Cracking of the albumen results in thousands of ultra fine fracture on the surface of the print. This is sometimes termed crazing. We term this sub-category of chaotic noise as thread noise[1], because the noise looks like threads in the image. It may also be the product of dirt which has accumulated on the emulsion. This dirt results in an overall grainy appearance of the photographs. It might also be that the noise was fashioned during photographic processing. Figure 3 shows an example of noise resulting from the presence of dirt on the emulation of a historical photograph.
Fig. 1. An example of film grain noise
Fig. 2. The original (left) and enlarged region of interest showing an example of cracking manifesting itself as thread, or ribbon noise (right)
There are a multitude of nonlinear filters which have been developed to suppress noise in images. Noise suppression algorithms should possess two characteristics: the ability to suppress noise, and the ability to preserve edges within the image. Median
1208
M. Wirth and B. Bobier
filters provide a good balance of being able to reduce impulse noise, while preserving edges and only moderately attenuating noise in homogeneous or flat regions of an image. They operate by replacing a pixel by the median value obtained from a neighborhood surrounding the pixel. However the caveat with median filters is that their performance in suppressing random noise is somewhat mediocre. The moving average filter is a technique, adept at smoothing random noise, but suffers from an inability to preserve edges, or reduce impulse noise. There are few filters able to deal with noise such as thread noise. In this paper, we extend the fuzzy filters of Kwan and Cai [2]. They apply fuzzy membership weighted functions which incorporate aspects of both moving average and median filters. We take this a step further, advocating the use of truncated median (or quasi-mode) filters.
Fig. 3. The original (left) and enlarged region of interest showing an example of noise relating to the presence of dirt on the emulation (right)
2 Fuzzy Filtering Fuzzy filters are based on the principle of a 2D filter [2] of the form:
I e ( i, j ) =
∑
( x , y )∈W
F [W (i + x, j + y ) ] ⋅ W (i + x, j + y )
∑
( x , y )∈W
F [W (i + x, j + y )]
(1)
where F [W ] is the filtering function, and W is the sub-image being processed, centred on the pixel (i, j ) . The window is of size N, such that the range of x and y are: -X ≤ x ≤ X, and –Y ≤ y ≤ Y, where N = 2X+1 = 2Y+1. We use a value of N = 5 for both filters described in the next section. Kwan [3] introduced a series of fuzzy filters based on symmetrical and asymmetrical triangular membership functions, used to filter impulse, random and mixed noise. Some of these include the TMED (symmetrical triangular fuzzy filter with median value), GMED (Gaussian fuzzy filter with median centre), ATMED
Suppression of Noise in Historical Photographs
1209
(asymmetrical triangular fuzzy filter with median value), and three similar filters GMAV, TMAV, ATMAV where the median is replaced with a moving average. For example, the GMED incorporates a Gaussian which provides a “smoothing” affect on the image, and a median filter for filtering “impulse”-type noise and preserving edges.
3 Fuzzy Filters Using the Truncated Median 3.1 The TMOD and GMOD Filters
We introduce a symmetrical triangular fuzzy filter with a truncated median (TMOD) and the Gaussian fuzzy filter with truncated median centre (GMOD). This filter is used for filtering impulse, random and mixed noise of a sporadic nature. The TMOD filter is defined as: ⎧1 − W (i + x, j + y ) − Wtmed Wmm F [W (i + x, j + y )] = ⎨ ⎩1
if W (i + x, j + y ) − Wtmed ≤ Wmm if Wmm = 0
Wmm = max [Wmax − Wtmed ,Wtmed − Wmin ]
(2)
(3)
where Wtmed , Wmin and Wmax represent the truncated median, minimum and maximum values of the sub-image W respectively. The GMOD filter is defined as:
⎡W ( i + x, j + y ) − Wtmed ⎤ F [W (i + x, j + y ) ] = e ⎢ ⎥ Wσ ⎣ ⎦ − 12
2
(4)
where Wtmed and Wσ represent the truncated median and variance of the sub-image W respectively. 3.2 The Truncated Median
The mode of an image neighborhood represents the most probable intensity value [4]. first Davies [4] points out that while the mode has redeemable characteristics as a noise suppression filter, it is difficult to implement in a small neighborhood with a sparse intensity distribution. A truncated median (sometimes called the trimmed median), offers an approximation of the mode. It is calculated by first extracting a neighborhood centred on pixel ( i, j ) (W) from the original image, and reshaping it
into a vector, E. We then calculate the mean
( Emax ) ,
( Emean ) ,
median ( Emed ) , maximum
and minimum ( Emin ) values in the neighborhood. Now we can decide whether the neighborhood will be truncated on the upper or lower bound of the vector: ⎧ E >= 2 Emed − Emin Etrunc = ⎨ ⎩ E q,k ∉ M and k ∉ Nc(j)}; If q ≤ Δ + 1 then c(j) := q; Else c(j):=min{k|k ∉ Nc(j)}; Endif Enddo Endif End. Where, c(i) is the color of node I , N(i) is the set of nodes adjacent to i (the set of colors of N(i) is denoted Nc(i)), d(i) = |N(i)|, is the number of neighbors of node i, or its degree, and Δ = max{d(i)|i ∈ V }. After applying Procedure 1, some colors remain without any dominating vertex. The following Procedure 2 finds a b-coloring of graph G where all the colors belonging to L are dominating colors (i.e. Dm=L). The idea is the following: each non-dominating color q (i.e. q ∈ NDm) can be changed. In fact, after removing q from the graph G, for each vertex vi colored with q (i.e. c(vi)=q), a new color is assigned to vi which is different from those of its neighborhoods. As our objective is to find a partition such that the sum of vertex dissimilarities within each class is minimized, the color whose distance with vi is minimal will be selected if there is a choice between many colors for vi. Before starting again with another color q’ ∈ NDm, we verify if colors of NDm have a dominating vertex (in such a case, these colors are added to the Dm set). Procedure 2: Graph_b-coloring() BEGIN Repeat q := max{k| k ∈ NDm}; L := L\{q}; NDm:= L\Dm; For each vertex vi such that c (vi):=q do K := {k| k ∈ L and k ∉ Nc(vi)}; c(vj):= {c|dist(vi,c):=mink ∈ K(dist(vi,k))}; Enddo For each vertex vj such that c(vj) ∈ NDm do Update(Nc(vj)); If Nc(vj) := L\{c(vj)}then Add(c(vj),Dm); EndIf
A New Pyramidal Approach for the ABL Based on Hierarchical Graph Coloring
Enddo Until (NDm :=
φ
1281
)
END. In the next part of this paper, we present our new approach of ABL based on graph coloring. Where the graph coloring is used to improve the physical segmentation and the graph bicoloring is used to train a classifier that is used to separate the address block from others.
3 Our Approach of Address Block Location The LBA is done in three steps. Image binarisation and CCs detection, physical segmentation for decompose the envelope image into several candidate blocks, and in the features space, classification of blocks to extract from them the AB. The Figure 4 presents our modular architecture conceived to ensure an effective coherence between the different modules.
Foreground blocks localization
Envelop image
Cumulated Gradient approach ĹĻ Morphological
CCs detection
Local thresholding of foreground blocks
Multiresolution
Envelop
Level 3 Block1 ………………… Block n
Level 2
Level 1 Proper hierarchical colouring of graph
Pyramidal organisation of data
Physical segmentation
Hierarchical extraction of features Trained classifier by graph b-colouring
Localization of address block and its lines
Skew detection and correction
OCR
Fig. 4. Modular scheme of different steps of address block localization, pyramidal organization of data and hierarchical graph
1282
D. Gaceb et al.
3.1 Methods Used in Envelope Image Thresholding and CCs Detection None of the traditional (global or adaptive) method fulfils all imposed conditions, namely a certain efficiency on all images for an imposed computing time. The separation between the two stages of thresholding and location of the texts simultaneously increases the computing time and leads to over-segmentations of the paper texture noise on the image background. We have managed to optimize our thresholding method by applying adapted thresholds calculations near the text zones only ( see Fig.5.). The localization of the information zones is carried out by the approach multi-resolution of gradient cumulated with morphological operations calculated directly from the grey levels image.
Fig. 5. Our mixed approach of binarisation (text localization/thresholding)
Then, we detect connected components CCs of the binary card and foreground blocks card returned by our hybrid thresholding method. The method used is inspired from Pavlidis’s studies [8] based on the LAG structure (Line Adjacency Graph) presented previously. Each CC i is represented by its bounding box coordinates CC (i) = (xi1, yi1, xi2, yi 2). That is to say V(L1) the CC set of binary card which form the L1 finest layer, and V(L3) that of foreground blocks card which forms the coarsest layer L3 of the pyramid. With V(Lk)={ CCk(i)/ i=1…Nk} such as Nk is the ccs number in layer k/ k=1,2,3. In order to reduce Nk number we remove all parasitic ccs. The both sets V(L1) and V(L3) will be used by the next module of segmentation.
Fig. 6. Bounding boxes of connected components of binarised envelope image
A New Pyramidal Approach for the ABL Based on Hierarchical Graph Coloring
1283
3.2 Our Hierarchical Strategy of Features Extraction Local or global (Figure7), habitually in the classical methods of LBA, the features extraction step is applied after the phase of segmentation. In our work, this stage is done during the segmentation and also takes part to improve it. Noting VdLk (i) is the feature vectors of item i, extracted from the layer K of envelope image, this descriptor should be able to represent the disparity between dissimilar objects. The total descriptor of each block is given by combination of extracted features from all layers of pyramid (see formula 1).
Vd Total (i ) = Vd L1 ∪ Vd L 2 ∪ Vd L 3
(1)
Fig. 7. Hierarchical features extraction
3.3 Physical Segmentation (PHS) Using Hierarchical Proper Coloring of Graph In this section, we propose a physical segmentation of envelope image based on graph coloring. The idea to use the items clustering (who have similar features) in the PHS, owing to the fact that the blocks card resulting from the localization phase cannot be always correct, and that a block address can contain other objects of different natures, and that two different blocks can represent the same object. At the beginning, we use the set of CCs named V (L1) to build our graph G1 (V (L1), E>S1). Each object in this set is represented by a descriptor (a vector of relevant features of object) denoted VdL1(i). vertices (nodes) in G1 correspond to objects, edges represent neighborhood relationships, and edge-weights reflect dissimilarity between pairs of linked vertices. After that, we use the distributed algorithm named “Proper_Graph_coloring ()” to give the same color at all similar objects. This algorithm is detailed in section 2.3. The dissimilarity thresholds Si/ i=1,2,3 can be returned automatically by training [15]. We can say that two objects are neighbours if their dissimilarity is superior of S. We combine the colors of graph G1 to the blocks card (see Fig.4.) to split each block according to these colors and color according to these blocks. Each resulting block of this decomposition should not be heterogeneous. We use other features, then we execute “Proper_Graph_coloring ()” on the graph G3 represented by the set of homogenised blocks. The Classes results represent the finale set of corrected blocks. Lastly, we execute the “Proper_Graph_coloring ()” with a other set of features to detect the text lines in each block. The following algorithm summarizes the principle of our method.
1284
D. Gaceb et al.
Procudure3: Physical_Segmentation() Begin Level1: CCs card, similar object regrouping For every CC1 (i)/i:=1…N1 Extract VdL1(i) Endfor V(L1) := { VdL1(i)/ i=1…N1) G1:=G(V(L1),E(L1)>S1);Execute Proper_Graph_coloring() (G1) Level3: Blocks card, segmentation refinement, split-andmerge of existing blocks Let M = G1ŀV(L3) For every item i of M Extract VdL3(i) endfor G3= G(M, E(M)>S3) ; Execute Proper_Graph_coloring () (G3) Level2: text lines, segmentation of block in several text lines. Let M := V(L1) ŀ G3 For every item i of M Extract VdL2(i) endfor G2= G(M, E(M)>S2); Execute Proper_Graph_coloring () (G2) V(L2)=G2, V(L3)=G3; Extract VdL1 := V(L1)/G2 and G3; Extract VdL2 := G2/ V(L1) and G3; Extract VdL3 := G3/ V(L1) and G2; For every block find Vd Total (i ) ^ Vd L1VdL 3 ` ^ Vd L 2VdL 3 ` Vd L 3 ; End
3.4 Training Based on Graph b-Coloring To generate our set of training data from the segmented envelope images by our method (presented in section 3.3) , a set of 400 blocks is selected from a wide category (AB, stamps, logos…).To find the several classes we adopt the clustering technique based on graph b-coloring (presented in section 2.3). The different steps of this method are presented in [15]. The proposed technique offers a real representation of clusters by dominating items that guarantees the interclass disparity in a partitioning (see Figure8). Also, the dominating vertexes are used to distinguish a block address from others extracted on image of envelope request. 3.5 ABL Based on Graph b-Coloring To select the AB from several blocks detected on the envelope image, we compare in the features space all objects of the set S={Vi=1…N} with all dominating objects * * (nodes) of the set S {v j=1...M} that are detected in training step.
A New Pyramidal Approach for the ABL Based on Hierarchical Graph Coloring
1285
Fig. 8. Training: blocks classification based on graph b-coloring and detection of dominating node of each color (class)
We affect to every one object of S the same identity of its neighborhood object of S*. If several objects represents a AB at the same time. We must use the special relationships of each object (AB) with all others in the same set. The measure of the dissimilarity between two objects vi and v*j is given by the generalized Minkowski distance of order α (α ≥ 1): di, j = (
Nf
∑g k =1
k
(vik , v kj * )α
)α
1
When α = 1, di,j is the City Block distance. When α = 2, , di,j is the Euclidian distance. As α increases, the distance tends towards a Chebyshev result. Nf is number of features used. gk is the comparative dissimilarity function between the two features
4 Experimental Results and Discussions We tested our method on a set of 750 envelope images the rate of good localisation is 98%. And to evaluate the efficiency of our method on rejected envelopes by the existing system, we have used a set of 100 rejected envelope images. Statistically, the rejection of 18 envelopes depends on a failure of the binarisation method, and of the 53 envelopes on the failure of physical segmentation and the rejection of remainder
1286
D. Gaceb et al.
envelopes depend on the bad classification of BA used. We tested our method on this basis for ABL. 7 envelopes only among 100 were rejected. The figure illustrates the performances of our approach.
Fig. 9. Comparison of methods and reducing of rejection rate by our approach
The diagram of figure10 represents the increase rate of the OCR after the application of our mixed method of binarisation.
Fig. 10. Increasing of the recognition rate by our mixed thresholding method
5 Conclusion We presented in this paper a new approach of ABL based on a hierarchical graph coloring and a pyramidal structure of data. We have seen that the ABL module is composed of three elementary modules (binarization and connected components
A New Pyramidal Approach for the ABL Based on Hierarchical Graph Coloring
1287
detection, physical segmentation of the envelope, and address zone extraction). The failure on any one of the modules leads to directly rejection of the postal mail piece. We have used the hierarchical graph coloring in physical segmentation, at the same time, for increase the robustness to the problem of parasitic objects, to avoid fusion between the text lines, to split and merge the blocks of the last pyramidal layer and for regroup objects which have same characteristics. Using a pyramidal structure of data for characteristics extraction, the graph b-coloring is used in the training to extract the address zone. This new architecture ensured a good coherence between the various modules and reduced, at the same time, the computing time and rejection rate. This work is granted by CESA company (www.cesa.fr).
References [1] Ching-Huei, W., Palumbo, P.W., Srihari, S.N.: Object recognition in visually complex environments: an architecture for locating address blocks on mail pieces. In: Pattern Recognition, 9th International Conference, vol. 1, pp. 365–367. IEEE Computer Society Press, Los Alamitos (1988) [2] Viard-Gaudin, C., Barba, D.: A multi-resolution approach to extract the address block on flat mail pieces. In: ICASSP-91. International Conference, vol. 4, pp. 2701–2704. [3] Yu, B., Jain, A.K., Mohiuddin, M.: Address block location on complex mail pieces. In: Document Analysis and Recognition. Fourth International Conference, vol. 2, pp. 897– 901. IEEE Computer Society Press, Los Alamitos (1997) [4] Jeong, S.H., Jang, S.I., Nam, Y.-S.: Locating destination address block in Korean mail images. In: ICPR 2004. 17th International Conference, vol. 2, pp. 387–390. IEEE Computer Society Press, Los Alamitos (2004) [5] Eiterer, L.F., Facon, J., Menoti, D.: Postal envelope address block location by fractalbased approach. In: Computer Graphics and Image Processing. 17th Brazilian Symposium, pp. 90–97. IEEE Computer Society Press, Los Alamitos (2004) [6] Otsu, N.: A threshold selection method from grey-level histogram. IEEE trans system, man and cybernetics 9, 62–66 (1979) [7] Sauvola, J., et al.: Adaptive Document Binarization. In: ICDAR’97, vol. 1, pp. 147–152 (1997) [8] Pavlidis, Z., Zhou, J.: A Page Segmentation and Classification. CVGIP92 54(6), 484–496 [9] Regentova, E., Latifi, S., Deng, S., Yao, D.: An Algorithm with Reduced Operations for Connected Components Detection in ITU-T Group 3/4 Coded Images. In: Pattern Analysis and Machine Intelligence, vol. 24, pp. 1039–1047. IEEE Computer Society Press, Los Alamitos [10] Déforges, O., Barba, D.: A fast multiresolution text-line and non text line structures extraction and discrimination scheme for document image analysis. In: ICPR 94, pp. 34– 138 [11] Wang, S.-Y., Yagasaki, T.: Block selection: a method for segmenting a page image of various editing styles. In: ICDAR 1995. Proceedings of the Third International Conference, vol. 1, pp. 128–133 (1995) [12] Shi, Z., Govindaraju, V.: Line separation for complex document images using fuzzy runlength. In: Document Image Analysis for Libraries, DIAL 2004, pp. 306–312 (2004) [13] Drivas, D., Amin, A.: Page segmentation and classification utilising a bottom-up approach, Document Analysis and Recognition. In: ICDAR 1995. Proceedings of the Third International Conference, vol. 2, pp. 610–614 (1995)
1288
D. Gaceb et al.
[14] Effantin, B., Kheddouci, H.: The b-chromatic number of power graphs. Discrete Mathematics and Theoretical Computer Science (DMTCS) 6, 45–54 (2003) [15] Elghazel, H., Hacid, M.-S., Khedddoucil, H., Dussauchoy, A.: A New Clustering Approach for Symbolic Data: Algorithms and Application to Healthcare Data, 22 éme journées bases de données avancées, BDA 2006, Lille, Actes On formal proceeding (Octobre 17, 2006) [16] Corteel, S., Valencia-Pabon, M., Vera, J.-C.: On approximating the b-chromatic number,Discrete Applied Mathematics archive (ISSN:0166-218X), vol. 146, pp. 106–110. Publisher Elsevier Science Publishers B. V, Amsterdam, The Netherlands (2005) [17] Effantin, B., Kheddouci, H.: A distributed algorithm for a b-coloring of a graph. In: Guo, M., Yang, L.T., Di Martino, B., Zima, H.P., Dongarra, J., Tang, F. (eds.) ISPA 2006. LNCS, vol. 4330, Springer, Heidelberg (2006)
Poultry Skin Tumor Detection in Hyperspectral Reflectance Images by Combining Classifiers Chengzhe Xu1, Intaek Kim1, and Moon S. Kim2 1
Department of Communication Engineering, Myongji University, 38-2 Namdong, Yongin, Kyonggido, South Korea
[email protected],
[email protected] 2 USDA ARS, BA, ANRI, ISL, Bldg. 303, BARC-East, 10300 Baltimore Ave., Beltsville, MD 20705-2350, U.S.A
[email protected] Abstract. This paper presents a new method for detecting poultry skin tumors in hyperspectral reflectance images. We employ the principal component analysis (PCA), discrete wavelet transform (DWT), and kernel discriminant analysis (KDA) to extract the independent feature sets in hyperspectral reflectance image data. These features are individually classified by a linear classifier and their classification results are combined using product rule. The final classification result based on the proposed method shows the better performance in detecting tumors compared with previous works.
1 Introduction A hyperspectral image is a three-dimensional volume of data containing twodimensional spatial information measured at a sequence of individual wavelength across sufficiently broad spectral bands (spectrum). The hyperspectral image shows a great potential for detection of abnormalities because it provides both spatial and spectral features about the objects of interest in the image. Hyperspectral imaging techniques have been utilized in many scientific disciplines. One of recent application of hyperspectral imaging techniques may include inspection and quality control in the food safety and inspection service. Several approaches have been proposed to deal with the poultry skin tumor detection problem [1][2]. To enhance the performance of detecting tumor, two different approaches have been proposed with the hyperspectral fluorescence images. A PCA technique was used to extract the features and a Support Vector Machine (SVM) classifier was applied to detect skin tumors on poultry carcasses [3]. The other method utilized the band ratio to maximize the distinction between normal skin and tumor. It employed an RBPNN classifier to differentiate the tumor and normal regions in the image [4]. This paper proposes a new method based on combining classifiers for poultry skin tumor detection in hyperspectral reflectance images. The paper is organized as follows. In the next section, we briefly describe the hyperspectral images. In section 3, feature extraction and classifier combination methods are described. Section 4 summarized the main results of the paper, which show combining classifiers work better than any single classifier. In the final section, concluding remarks are made. M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 1289–1296, 2007. © Springer-Verlag Berlin Heidelberg 2007
1290
C. Xu, I. Kim, and M.S. Kim
2 Hyperspectral Image The Instrument and Sensing Laboratory (ISL) in the USDA has developed a laboratory-based line-by-line hyperspectral imaging system capable of reflectance and fluorescence imaging for uses in food safety and quality research [5]. Hyperspectral sensors collect the spectral signatures of a number of contiguous spatial locations to form hyperspectral sensor imagery. A hyperspectral image can be represented as a 3D array or cube of data; single-band images are stacked along a spectral axis. Each data value represents the intensity of a pixel and can be denoted by I (u, v, λ ) , where u = 1, " , M , v = 1, " , N are spatial coordinates and λi , i = 1, " , L
indicates spectral band. For a fixed λk , I (u , v, λk ) represents the k-th band image. If u and v are fixed, then I (u , v, λ ) stands for spectrum or spectral information. Hyperspectral images are useful in the analysis of a scene as no single-band image has sufficient information to describe the information of the scene completely. Hyperspectral reflectance images can be obtained using normal luminous source like an incandescent electric lamp, which makes the image acquisition cheaper and simpler than any other sources of image. In this study, 112 spectral bands are obtained from wavelength λ1 = 425.4 nm to λ112 = 710.7 nm with the same interval. Fig.1
shows spectral band images at the wavelength of λ10 , λ40 , λ70 , λ100 .
Fig. 1. Hyperspectral reflectance images at the wavelength of λ10 , λ40 , λ70 , and λ100 (from left to right)
3 Tumor Detection Algorithm There have been several attempts for detecting tumor in the hyperspectral image and they relied on a single classifier. In this section, we have investigated three distinct features which are independent and they may carry complimentary information on locating tumors. Like conventional classification system, the procedure for tumor detection requires two steps: feature extraction and classifier combination. 3.1 Feature Extraction
Different feature sets are extracted by different methods from the same data. The information from the three features seems to be independent, but we will not deal it in detail because it is out of scope to prove the independence of these features. The classification result rather can tell if the assumption is correct. In this study, we employ
Poultry Skin Tumor Detection in Hyperspectral Reflectance Images
1291
three different methods to obtain three different feature sets and they are PCA (Principal Component Analysis), DWT (Discrete Wavelet Transform), and KDA (Kernel Discriminant Analysis). 3.1.1 PCA (Principal Component Analysis) PCA is a linear transformation that projects the original data onto new orthogonal axes to find the subspaces that yield the best approximation for the observed data [6]. PCA can be defined by Y = W T X , where X is n-dimensional column vector set, E[ X] = 0 , and W ∈ ℜ n× n . The transformation matrix W can be obtained from
Σ x W = WΛ
(1)
where Λ = diag[λ1 , λ2 , " , λn ] consists of eigenvalues (λ1 ≥ λ2 ≥ " ≥ λn ) corresponding to covariance matrix Σ X = E[ XXT ] . Transformation matrix W , which consists of n normalized orthogonal eigenvectors in column can be given by W −1 = WT . The covariance matrix Σ Y is obtained from
Σ Y = WT Σ X W
(2)
Using Equations (1) and (2) with W −1 = WT yields Σ Y = Λ = diag[λ1 , λ2 , " , λn ] . The eigenvalue of covariance matrix of the input data is the variance of the principal component. Its variance indicates the degree of contrast. Thus the principal component with higher contrast, in general, has better information and it can be considered as a candidate for an efficient feature. The first principal component is excluded because it is the best approximation of the original data and is inappropriate for feature. According to the dimensions of variance, only the first few principal components hold considerable information of the original data while the rest have noisy information. 3.1.2 DWT (Discrete Wavelet Transform) Wavelet analysis is a transformation method that transforms the original signal into a different domain for analysis and processing [7]. In the DWT, the signal x(t ) in a square-integrable functional space can be expressed as x(t ) =
∑c k
j , k ϕ j , k (t ) +
∑d
j , kψ j , k (t ) +
k
∑d
j −1, kψ j −1, k (t ) + " +
k
∑d
1, kψ 1, k (t )
(3)
k
where ϕ j , k (t ) and ψ j , k (t ) are wavelet functions generated from father and mother wavelets ϕ and ψ through scaling and translation as follows: −
j
ϕ j, k = 2 2 ϕ (
t −2jk ), 2j
−
j
ψ j, k = 2 2ψ (
t −2jk ). 2j
(4)
The coefficients c j , k , d j , k , " , d1, k are amplitude coefficients of each basis function, j and k belong to integers set, j > 0 is the resolution level, k = 1, " , N/2 is the shift parameter, and N is the number of data recorded.
1292
C. Xu, I. Kim, and M.S. Kim
The multi-resolution decomposition of a signal can now be defined as: A j (t ) =
∑c
j , k ϕ j , k (t )
,
Di (t ) =
k
∑d
i , kψ i , k (t )
,
for i = 1, " , j .
(5)
k
Then, A j (t ) correspond to coarser detail at level j which is called an approximation of the signal, it contains the low frequency component of x(t ) . Di (t ) correspond to the fine details. Low frequency component generally contain most information of signal, thus, the coefficient c j , k corresponding to A j (t ) can be used as feature to describe the original signal.
3.1.3 KDA (Kernel Discriminant Analysis) KDA is a nonlinear approach that extracts the discriminating features based on the kernel technique [8]. Assume there are n training patterns {x1 , ", x n } which are categorized into c classes and li (i = 1,", c) denotes the number of samples in the ith class. Due to nonlinearity, these patterns cannot be linearly separated in the space of x . With an appropriately selected nonlinear mapping φ , we project the patterns onto a high-dimensional feature space where the patterns are linearly separable. By performing LDA in the feature space, one can obtain a nonlinear representation in the original input space. To find the linear discriminant in feature space, we need to maximize J (W) =
W T SφB W φ WT SW W
(6)
where W is the transformation matrix in feature space, S Bφ and SWφ are the between and within class scatter matrices respectively in feature space. However, computing φ explicitly may be problematic or even impossible. If the nonlinear mapping φ satisfies Mercer’s condition, then the inner product of two vectors in the feature space can be computed through a kernel function k (x, y ) = (φ (x) ⋅ φ (y )) in the input space, and the kernel function can instead the problem of mapping the data explicitly to feature space. In what follows, we assume that we deal with centred data. According to the theory of the reproducing kernel, W will be an expansion of all training samples in the feature space, i.e. there exist coefficients α i (i = 1,2,", n) such that n
W=
∑ α φ (x ) = Hα i
i
(7)
i =1
where H = (φ (x1 )," , φ (x n )) and α = (α1 ,",α n )T . Substituting (7) into (6), we can obtain the following equation [9]:
Poultry Skin Tumor Detection in Hyperspectral Reflectance Images
J (α ) =
1293
α T KVKα α T KKα
(8)
where K is a kernel matrix and V = diag (V1 ,", Vc ) . Vi is a li × li matrix and ~ ~ whose diagonal elements is 1 / li . Suppose S B = KVK and SW = KK , then α is ~ ~ ~ determined by finding the leading eigenvector of S = SW−1S B . The projection of a new pattern x onto W is given by (9). W ⋅ φ ( x) =
n
∑α k (x , x). i
(9)
i
i =1
3.2 Classifier Combination Product rule is one of classifier combination rules for combining multiple classifiers [10]. The principle of product rule can be described as follows. Consider there are R classifiers and each representing the given pattern X by a distinct measurement vector. In each classifier, X is to be assigned to one of the m possible classes (ω1 , " , ωm ) according to measurement vector. Denote the measurement vector obtained by the i-th classifier by x i . In the measurement space, each class ωk is modeled by the p(x i | ωk ) and its a priori probability of occurrence is denoted by P(ω k ) . According to the Bayesian theory, given measurements x i ( i = 1, " , R ), the pattern X should be assigned to class ω j if the a posteriori probability of that interpretation is maximum, i.e. X ∈ω j
m
if
P(ω j | x1 ,", x R ) = max P (ω k | x1 ,", x R ) . k =1
(10)
Using the Bayes theorem and total probability theorem, (10) can be rewritten as: X ∈ω j
m
if
p(x1 ,", x R | ω j ) P(ω j ) = max p (x1 ,", x R | ω k ) P(ω k ) . k =1
(11)
p(x1 ,", x R | ω k ) represents the joint probability distribution of the measurements extracted by the classifiers. Let us assume that the representations used are conditionally statistically independent, then, we will investigate the consequences of this assumption and write p(x1 ,", x R | ω k ) =
R
∏ p(x
i
| ωk )
(12)
i =1
where p(xi | ω k ) is the measurement process model of the i-th representation. Substituting (12) into (11), we obtain the decision rule:
1294
C. Xu, I. Kim, and M.S. Kim
X ∈ω j
if
R
∏
P(ω j )
m
i =1
k =1
R
∏ p(x
p(x i | ω j ) = max P(ω k )
i
| ωk ) .
i =1
(13)
The combination of results obtained by the same classifier over different feature sets frequently outperforms combination of results obtained by the different classifiers over same feature set. In this paper, the former is utilized to realize the classifier combination applying product rule.
4 Experimental Results A total of 13 hyperspectral reflectance images with skin tumors were collected for the experiment. Two sets of data containing 1000 pixels were randomly selected for normal skin and tumor, respectively. In each set, the half of them was used for training and the remaining half for testing. In preprocessing, the data was normalized into the range [-1 1]. In PCA, the principal components up to 8-th were considered for feature except the first component. In DWT, Daubechies 1 wavelet was utilized to perform the 3 levels wavelet decomposition for every pixel, and approximation coefficients was used as feature. Finally, Gaussian kernel was used in KDA with σ = 0.8 and a regularization parameter C = 1.5 . Each feature was classified by linear classifier and each result was combined according to product rule. In following experiment, the average test errors and standard deviations were obtained over 100 times runs. In each run, the data set was randomly selected from poultry samples, and the same location pixels of hyperspectral fluorescence images were selected as data set for the purpose of comparison. Table 1 shows the error rates of individual classification for each feature and that of classifier combination. It demonstrates that the result of classifier combination outperform the best individual classification result. To evaluate the performance of the proposed method, we compared the result with those in [3] and [4] in table 2. It should be noted that an SVM classifier and PCA method for feature selection were implemented in [3], while a RBPNN classifier with a Band Ratio method was adopted in [4]. The proposed method combining linear and simple classifiers reduces the error rate about 4.7% compared with the previous works. Fig. 2 also indicates that the proposed method yields less amount of false alarms. Table 1. Error rates of individual classification VS. classifier combination (%)
Table 2. Comparison of error rates with previous works (%)
Method
Error Rate
Method
Error Rate
PCA
11.52 ± 1.40
Fletcher’s [3]
7.55 ± 0.81
DWT
5.63 ± 0.84 Kim’s [4]
7.51 ± 0.88
Proposed
2.82 ± 0.54
KDA
3.17 ± 0.55
Classifier Combination
2.82 ± 0.54
Poultry Skin Tumor Detection in Hyperspectral Reflectance Images
1295
Fig. 2. Classification result from [3], classification result from [4], classification result by the proposed method, and real tumor spots (from left to right)
5 Conclusions A new method for detecting skin tumors on chicken carcasses for hyperspectral reflectance images has been presented. The proposed detection method can yield better performance by combing several pieces of information from the different feature sets. The feature sets are extracted by using PCA, DWT, and KDA, and then each feature is classified by a linear classifier individually. All of these classification results are combined by product rule and finally each pixel is classified as either normal skin or tumor. In the experiment, classification result has been improved compared with previous works and others using a single classifier. From the study, it can be inferred that each different method may extract independent feature because its combination can improve the performance by reducing the error rate. Therefore, as long as the independence of features is maintained, combining classifiers can show enhanced performance. One of further researches may include the fusion of hyperspectral reflectance and fluorescence images in order to reduce error rate more significantly.
References 1. Chao, K., Chen, Y.R., Hruschka, W.R., Gwozdz, F.B.: On-line Inspection of Poultry Carcasses by a Dual-camera System. J. Food. Eng. 51(3), 185–192 (2002) 2. Kong, S.G., Chen, Y.R., Kim, I., Kim, M.S.: Analysis of Hyperspectral Fluorescence Images for Poultry Skin Tumor Inspection. Applied Optics. 43(4), 824–833 (2004) 3. Fletcher, J.T., Kong, S.G.: Principal Component analysis for Poultry Tumor Inspection using Hyperspectral Fluorescence Imaging. In: Proceedings of the International Joint Conference on Neural Networks 2003, Portland, Oregon. Doubletree Hotel-Jantzen Beach, vol. 1, pp. 149–153 (2003) 4. Kim, I., Xu, C., Kim, M.S.: Poultry Skin Tumor Detection in Hyperspectral Images Using Radial Basis Probabilistic Neural Network. In: Wang, J., Yi, Z., Zurada, J.M., Lu, B.-L., Yin, H. (eds.) ISNN 2006. LNCS, vol. 3973, pp. 770–776. Springer, Heidelberg (2006) 5. Kim, M.S., Chen, Y.R., Mehl, P.M.: Hyperspectral Reflectance and Fluorescence Imaging System for Food Quality and Safety. Transactions of ASAE 44(3), 721–729 (2001) 6. Jolliffe, I.J.: Principal Component Analysis, 2nd edn. Springer-Verlag, Heidelberg (2002) 7. Rao, R.M., Bopardikar, A.S.: Wavelet Transforms: Introduction to theory and Applications, 1st edn. Addison Wesley Longman, Inc., Redwood City (1998)
1296
C. Xu, I. Kim, and M.S. Kim
8. Mika, S., Ratsch, G., Weston, J., Scholkopf, B., Muller, K.: Fisher Discriminant Analysis with Kernels. In: IEEE Neural Networks for Signal Processing Workshop, pp. 41–48. IEEE Computer Society Press, Los Alamitos (1999) 9. Li, Y., Gong, S., Liddell, H.: Recognising Trajectories of Facial Identities Using Kernel Discriminant Analysis. J. Image and Vision Computing 21, 1077–1086 (2003) 10. Kittler, J., Hatef, M., Duin, R.P.W., Matas, J.: On Combining Classifiers. IEEE Transactions on PAMI 20(3), 226–239 (1998)
Intelligent Real-Time Fabric Defect Detection Hugo Peres Castilho3 , Paulo Jorge Sequeira Gonc¸alves1,2 , Jo˜ao Rog´erio Caldas Pinto1 , and Ant´onio Limas Serafim3 1
IDMEC/IST, Technical University of Lisbon (TU Lisbon) Av. Rovisco Pais, 1049-001 Lisboa, Portugal
[email protected] 2 EST, Polytechnical Institute of Castelo Branco Av. do Empres´ario, 6000-767 Castelo Branco, Portugal
[email protected] 3 INETI, Instituto Nacional de Engenharia Tecnologia e Inovac¸a˜ o, Estrada do Pac¸o do Lumiar, 22, 1649-038 Lisboa, Portugal
[email protected],
[email protected] Abstract. This paper presents real-time fabric defect detection based in intelligent techniques. Neural networks (NN), fuzzy modeling (FM) based on productspace fuzzy clustering and adaptive network based fuzzy inference system (ANFIS) were used to obtain a clearly classification for defect detection. Their implementation requires thresholding its output, and based in previous studies a confusion matrix based optimization is used to obtain the threshold. Experimental results for real fabric defect detection were obtained from the experimental apparatus presented in the paper, that showed the usefulness of the three intelligent techniques, although the NN has a faster performance. Online implementation of the algorithms showed they can be easily implemented with commonly available resources and may be adapted to industrial applications without great effort.
1 Introduction It has been said [1,2] that defects reduce the value of the fabrics by 45% to 65%, furthermore, a man can detect no more than 60% of the present defects. This means many defects are missed, and the inspection is inconsistent, with its outcome depending on the training and the skill level of the personnel. The use of pattern recognition in machine vision systems would allow for consistent results in quality control, increased speeds and reduction in manual intervention. Related to the work presented in this paper, fabric defect detection, numerous approaches have been suggested and presented which can be broadly divided into four main classes[3]: statistical, geometrical, model based and signal processing. Statistical approaches are one of the first mentioned in machine vision literature and are based in the statistical distribution of gray values [1,4,5,6]. Geometrical models where first suggested as an interpretation of human perception of textures by Julesz, as stated in [3], when he proposed the “theory of textons”, and are characterized by the definition of texture being composed of texture elements. Model based texture models are based on the construction of an image model that can be used not only to describe texture, but also to synthesize it. M. Kamel and A. Campilho (Eds.): ICIAR 2007, LNCS 4633, pp. 1297–1307, 2007. c Springer-Verlag Berlin Heidelberg 2007
1298
H.P. Castilho et al.
Several signal processing techniques have been applied to texture analysis with success and research has given evidence that such an analysis is similar to what the human brain does [2,7,8,9,10,11,12,13]. The detailed description of these approaches is beyond the scope of this paper. The approach followed in this work is based on a filter that takes into account pixel neighborhoods [11] to generate inputs for a classification system. Three classification systems were experimented, a feed-forward neural network (NN), fuzzy modeling (FM) based on product-space fuzzy clustering and adaptive network based fuzzy inference system (ANFIS). The output is processed in order to define a threshold that allows to clearly identify defect presence. The threshold selection is based on a performance function that uses the classification’s confusion matrix. All the three classification approaches were tested off-line using the experimental apparatus presented in this paper to obtain images from real defected fabrics. The models obtained from the off-line training process were then implemented in the experimental apparatus. The system is now prepared to defect detection for the trained fabrics. Also in this paper are presented results for the on-line implementation of all the classification systems. Real-time inspection systems use dedicated hardware to process fabric data [6,12] and line capture is encoder driven. The system presented in this paper is based on a line scan camera, a frame-grabber installed in a PC and a motion controlled drum, all capable to simulate a shop-floor process. The recognition system was tested both off and on-line. Special emphasis is given to the on-line results and obtained inspection rates. The organization of this paper is as follows: Section 2 describes the proposed method for defect inspection. The apparatus and the experiments are detailed in Sec. 3. Results comparing classification systems are in Sec. 4 and the real-time implementation result in Sec. 5. Finally Sec. 6 summarizes our findings.
2 Defect Detection Algorithm The block diagram of the defect detection algorithm, based on a neural network, proposed in [14] is presented in Fig. 1. In this paper, each of the classification systems is used to classify feature vectors extracted from the input image. The output from the network is post-processed to reduce classification errors and to produce a clear delimitation between defect and defect-free areas. Feature Vector Extraction. Feature vectors are simply the grey-level values of pixels on a pre-defined neighborhood. This is a simple and direct method given the goal of real-time defect detection. Classification System. Processes features vectors to highlight defects. The output is an image of the same size containing continuous values in the [0, 1] interval.
Fig. 1. Block diagram of the defect detection algorithm
Intelligent Real-Time Fabric Defect Detection
1299
Averaging Filter. Reduces speckle-like noise in the output. This filter is a N × N matrix of equal weights and sum 1 that is convolved with the image. Threshold. Binarizes the output into two distinct classes. This uses a pre-determined threshold value using the methods described in Sec. 2.3. Feature Vector Extraction. Many methods have been devised to select features capable of defining the texture, some of them of high complexity such as those coming from random field models; others, simpler, are useful for most practical applications. The approach followed in this work sought a simple algorithm given the goal of applying it to a real-time defect detection system. So, a quite direct feature selection process was elected that is based on the gray levels of pixels in a given neighborhood. A number of configuration masks or macro-windows could be considered for neighborhood selection [14]. Following the previous cited work, the neighborhood (macrowindow) selected is presented in Fig.2.
Fig. 2. Macro-window (MW) shape
2.1 Classification System Neural Network: Neural networks are one of the fastest most flexible classification systems in use. Another desirable feature is that they are easily trained using backpropagation. When used directly on feature vectors from the acquired image this method is in some literature [9], compared to filter bank approaches. The input weights of the first layer are similar to a filter bank and nonlinearities in the network resemble the nonlinearity in the local energy function. The use of log-sigmoid as activation functions comes naturally when dealing with binary states, 0 representing defect presence and 1 its absence. The output of the NN is an image of the same size as the input containing continuous values in the [0, 1] interval. Fuzzy Modeling Based on Product Space Fuzzy Clustering: Fuzzy modeling often follows the approach of encoding expert knowledge expressed in a verbal form in a collection of if–then rules. Parameters in this structure can be adapted using inputoutput data. When no prior knowledge about the system is available, a fuzzy model can be constructed entirely on the basis of system measurements. In the following, we consider data-driven modeling based on fuzzy clustering [15]. We consider rule-based models of the Takagi-Sugeno (TS) type. TS models consist of fuzzy rules describing a local input-output relation, typically in an affine form: Ri : If x1 is Ai1 and . . . and xn is Ain then yi = ai x + bi , i = 1, 2, . . . , K. Here Ri is the ith rule, x = [x1 , . . . , xn ]T are the antecedent variables, Ai1 , . . . , Ain are fuzzy sets defined in the antecedent space, and yi is the rule output variable. K
1300
H.P. Castilho et al.
denotes the number of rules in the rule base, and the aggregated output of the model, yˆ, is calculated by taking the weighted average of the rule consequents: K βi y i yˆ = i=1 , (1) K i=1 βi n μAij (xj ), i = 1, . . . , K, where βi is the degree of activation of the ith rule: βi = Πj=1 and Aij (xj ) : R → [0, 1] is the membership function of the fuzzy set Aij in the antecedent of Ri . To identify the model in (1), the regression matrix X and an output vector y are constructed from the available data: XT = [x1 , . . . , xN ], yT = [y1 , . . . , yN ], where N n is the number of samples used for identification. The number of rules, K, the antecedent fuzzy sets, Aij , and the consequent parameters, ai , bi are determined by means of fuzzy clustering in the product space of the inputs and the outputs [16]. Hence, the data set Z to be clustered is composed from X and y: ZT = [X, y]. Given Z and an estimated number of clusters K, the Gustafson-Kessel fuzzy clustering algorithm [17] is applied to compute the fuzzy partition matrix U. The fuzzy sets in the antecedent of the rules are obtained from the partition matrix U, whose ikth element μik ∈ [0, 1] is the membership degree of the data object zk in cluster i. One-dimensional fuzzy sets Aij are obtained from the multidimensional fuzzy sets defined point-wise in the ith row of the partition matrix by projections onto the space of the input variables xj . The point-wise defined fuzzy sets Aij are approximated by suitable parametric functions in order to compute μAij (xj ) for any value of xj . The consequent parameters for each rule are obtained as a weighted ordinary leastsquare estimate. Let θiT = aTi ; bi , let Xe denote the matrix [X; 1] and let Wi denote a diagonal matrix in R N ×N having the degree of activation, βi (xk ), as its kth diagonal element. Assuming that the columns of Xe are linearly independent and βi (xk ) > 0 for 1 ≤ k ≤ N , the weighted least-squares solution of y = Xe θ + becomes −1 T θi = XTe Wi Xe Xe Wi y . (2)
ANFIS - Adaptive Network Based Fuzzy Inference System: ANFIS [18] combines a fuzzy inference system and a neural network. The fuzzy inference system concept has been widely used in fuzzy control, it uses a certain number of rules that combined gives an ideal membership function to deal with structural knowledge. Neural network usually don’t deal with structure knowledge, but it has the function of self-adapting and self-learning, and can estimate input-output relations from data. ANFIS fully makes use of the excellent characteristics of neural network and fuzzy inference system and is widely applied in fuzzy control and modeling. As a special neural network, ANFIS can approach all nonlinear-system. ANFIS is a neural network in fact, which realize Takagi-Sugeno systems using a neural network. 2.2 Training the Models To train the neural network, the fuzzy model and the ANFIS system the inputs and targets (outputs) are supplied. The inputs are feature vectors extracted from the training
Intelligent Real-Time Fabric Defect Detection
1301
Fig. 3. Training inputs and outputs
image as explained in Sec. 2. The target (output) is determined by an image in which normal areas are marked in white and defective areas in black, these images are referred as defect tags. During the training process inputs and targets (outputs) are presented for every pixel in the image, as shown in Fig. 3. 2.3 Threshold Selection The confusion matrix partitions the results into a table. The columns are the classes predicted by the algorithm and the rows the actual class, an example of a confusion matrix is presented in Tab. 1. Table 1. Example of confusion matrix Classified Normal Defect Actual Normal a b c d Defect
This is a common way to compare classification results, and therefore a good tool to determine an optimal threshold. The terms in the confusion matrix presented in Table 1 are described as follows: a b c d
number of correct predictions that an instance is normal number of incorrect predictions that an instance is defective number of incorrect predictions that an instance is normal number of correct predictions that an instance is defective
Several standard terms have been defined for the two class matrix: AC =
a+d . a+b+c+d
(3)
1302
H.P. Castilho et al.
d . c+d b . FP = a+b TN = 1 − FP . TP =
FN = 1 − TP .
(4) (5) (6) (7)
AC (accuracy) is the proportion of the total number of predictions that were correct, Eq. 3. TP (true positive rate) is the proportion of defect cases correctly identified, Eq. 4. FP (false positive rate) is the proportion o normal cases incorrectly classified as defect, Eq. 5. TN (true negative rate) is the proportion of correctly identified normal cases, Eq. 6. FN (false negative rate), proportion of defect cases incorrectly classified as normal, Eq. 7. Using the accuracy (Eq. 8) we build the following performance function: f (x) = 1 − AC(x) .
(8)
that, compared with the performance function defined in [14] (Eq. 9), f (x) = wFP(x) + (1 − w)FN(x) .
(9)
has the advantage of not being necessary to fix a constant weight, w, used to set the relative importance of the factor in relation to one another. In the presented equations x is the selected threshold and f is the result for the performance function. The weight w is used to set the relative importance of the factors in relation to one another. The first method, using the accuracy, will be called accuracy performance (ACP) the second in Eq. 9 false rate performance (FRP). A simple minimum search based on the simplex search method is applied to the selected performance function (accuracy or false rate performance). The resulting threshold value is used in the defect detection algorithm.
3 Experimental Apparatus and Experiment Description The experimental setup includes the following components: motion controlled rotating drum to display the fabric, illumination source, line-scan camera, frame grabber (Matrox Meteor II/Dig) and a personal computer. The rotating drum is used to display the fabric, similarly to what happens on a manual inspection machine or a loom. The objective is to have a setup capable of simulating the real inspection process. Figure 4 displays a photograph of the apparatus developed in INETI for this project. A detailed description of the process of determining parameters for threshold algorithm, Sec. 2.3, is present in [14]. For the FRP method to determine the threshold we used w = 0.975 which has shown to produce good results. The first step was to study the performance of the classification systems by comparing the results using the same global parameters. This was done by acquiring fabric images and processing them off-line. The second step was to test the algorithms in real-time, to this purpose a careful study of times was made, both for processing and acquisition. The flow diagram of the
Intelligent Real-Time Fabric Defect Detection
1303
Fig. 4. Experimental apparatus
algorithm in real-time is present in Fig. 5. We have to ensure that processing is faster than acquisition, otherwise there will be gaps between two consecutive acquisitions [19].
Fig. 5. Flow diagram for the algorithm in real-time
The computer used in the experiment is a Pentium 4, 2.67 GHz with 256 MB of RAM memory, running Windows XP.
4 Classification System Comparison Table 2 displays the results for the classification systems. This results where obtained off-line during the training process, in order to obtain the best parameters for each classification system. For NN were obtained five neurons in the input layer, five neurons in the hidden layer and one neuron in the output layer. For FM and ANFIS we obtained ten and seven clusters, respectively. To help visualization and interpretation of these results we present in Fig. 6 the output images of the algorithms, for fabric 1.
5 Online Results Each classification system was in turn tested on-line for the same group of defects, with the parameters obtained in the off-line training, see Sec.4. The algorithms were then
1304
H.P. Castilho et al. Table 2. Classification system performance using FRP perfomance function Fabric 1 CS
AC
TP
Fabric 2 TN
AC
TP
TN
NN 0.99800 0.50062 0.99978 0.99553 0.27035 0.99975 FM 0.99785 0.49859 0.99979 0.99541 0.20616 0.99985 ANFIS 0.99791 0.46269 0.99976 0.99550 0.24950 0.99980
(a) Fabric Image
(c) FM
(b) N N
(d) ANFIS
Fig. 6. Example output for each experimented classification system using the accuracy performance function (off-line)
validated for a different defect in the same fabric. Two parameters are to be evaluated, the time the algorithm takes to process the image and the results of the classification. On-line, each captured frame is 512 × 1024 pixels wide. Table 3 displays the obtained frame rates for the different classification systems. Table 4 presents the results for online processing of the phases depicted in Fig.1. From Tab. 3 is seen that the neural network was, by far, the fastest classification system, about four time faster than FM and ten times faster than ANFIS. It should be
Intelligent Real-Time Fabric Defect Detection
1305
Table 3. Results for on-line experiment
Frame Rate (frame/s) Feature Extraction & Classification (ms/f) Averaging (ms/f) Threshold (ms/f)
NN
FM ANFIS
4.18 185.1 42.7 0.2
1.22 0.47 762.8 2052.3 44.8 42.5 0.2 0.2
Table 4. Results for on-line experiment CS
AC
TP
FP
TN
FN
NN 0.9987 0.7828 0.0007 0.9993 0.2172 FM 0.9982 0.7162 0.0005 0.9995 0.4746 ANFIS 0.9986 0.7955 0.0008 0.9992 0.2045
(a) Fabric Image for NN
(d) N N
(b) Fabric Image for FM
(e) FM
(c) Fabric Image for ANFIS
(f) ANFIS
Fig. 7. Example output for each experimented classification system using the accuracy performance function (on-line)
possible to reduce times for the fuzzy systems by decreasing the number of clusters, with the drawback of a possible decrease in accuracy (AC).
1306
H.P. Castilho et al.
The classification results for the on-line experiment, Tab. 4 and Fig.7, showed lower accuracy than the expected from the off-line results. During the time between the image acquisition for off-line training and the on-line tests slight variations in the camera position and light conditions deteriorated the results. This shows that this algorithm is sensitivity to these factors. The different classification systems obtained similar results for the same group of defects. The worst classification system was FM. This result may indicate that the system is over-trained for this on-line application.
6 Conclusions In this work real-time fabric defect detection based on neural networks, fuzzy modeling based on product-space fuzzy clustering and adaptive network based fuzzy inference system, ANFIS, was presented and compared. The output of these methods was thresholded with a novel confusion matrix based threshold method. The effectiveness of all the methods has been demonstrated with examples and with real-time experiments in an experimental apparatus. Real-time implementation confirmed these methods applicability in such cases, but highlighted it’s sensitivity to variations in the ambient conditions. Balancing the algorithm’s parameters we are able to control the processing load without degrading performance to an unacceptable level, this allows inspection rates of about 16 cm/s. As future work, the proposed approach for fabric defect detection should be optimized for speed and its robustness tested with fabrics with different patterns. Further research will be devoted to on-line modeling of the classification systems. Acknowledgments. This work is partially supported by the ”Programa de Financiamento Plurianual de Unidades de I&D (POCTI), do Quadro Comunit´ario de Apoio III” by program FEDER and by the FCT project POSC/EIA/60434/2004, (CLIMA), Minist´erio do Ensino Superior da Ciˆencia e Tecnologia, Portugal. Authors also acknowledge to ”A Penteadora” for making available the fabrics used in this research.
References 1. Mitropoulos, P., Koulamas, C., Stojanovic, R., Koubias, S., Papadopoulos, G., Karagiannis, G.: A real-time vision system for defect detection and neural classification of web textile fabric. In: Proccedings of the SPIE Electronic Imaging. Machine Vision Applications in Industrial Inspection VII, vol. 3652, pp. 59–69 (1999) 2. Chan, C., Pang, G.K.H.: Fabric defect detection by fourier analysis. IEEE Transactions on Industry Applications 36(5), 1267–1276 (2000) 3. Tuceryan, M., Jain, A.K.: Texture analysis. In: Handbook of pattern recognition & computer vision, pp. 235–276. World Scientific Publishing Co., Inc., River Edge, NJ (1993) 4. Stojanovic, R., Mitropulos, P.: Automated detection and neural classification of local defects in textile web (Conf. Publ. No. 465). In: IPA’99. Proceedings of the IEE 7th Internanional Conference on Image Processing and its Applications, Manchester, 1999, vol. 2, pp. 647–651 (1999)
Intelligent Real-Time Fabric Defect Detection
1307
5. Karayiannis, G., Stojanovic, R., Mitropoulos, P., Koulamas, C., Stouraitis, T., Koubias, S., Papadopoulos, G.: Defect detection and classification on web textile fabric using multiresolution decomposition and neural networks. In: Proceedings on the 6th IEEE International Conference on Electronics, Circuits and Systems, pp. 765–768. IEEE Computer Society Press, Los Alamitos (1999) 6. Stojanovic, R., Mitropulos, P., Koulamas, C., Karayiannis, Y., Koubias, S., Papadopoulos, G.: Real-time vision-based system for textile fabric inspection. Real-Time Imaging 7(6), 507–518 (2001) 7. Kumar, A., Pang, G.K.H.: Defect detection in textured materials using gabor filters. IEEE Transactions on Industry Applications 38(2), 425–440 (2002) 8. Kumar, A., Pang, G.K.H.: Defect detection in textured materials using optimized filters. IEEE Transactions on Systems, Man, and Cybernetics–Part B: Cybernetics 35(5), 553–570 (2002) 9. Randen, T., Husøy, J.H.: Filtering for texture classification: A comparative study. IEEE Transactions on Pattern Analysis and Machine Inteligence 21(4), 291–310 (1999) 10. Sobral, J.L.: Leather inspection based on wavelets. In: Marques, J.S., P´erez de la Blanca, N., Pina, P. (eds.) IbPRIA 2005. LNCS, vol. 3523, p. 682. Springer, Heidelberg (2005) 11. Kumar, A.: Neural network based detection of local textile defects. Pattern Recognition Letters 36(7), 1645–1659 (2003) 12. Sari-Sarraf, H., Goddard, J.S.: Vision system for on-loom fabric inspection. Industry Applications, IEEE Transactions 35(6), 1252–1259 (1999) 13. Karras, D.A., Karkanis, S.A., Mertzios, B.G.: Supervised and unsupervised neural network methods applied to textile quality control based on improved wavelet feature extraction techniques. International Journal of Computer Mathematics 67(1), 169–181 (1998) 14. Castilho, H.P., Pinto, J.R.C., Serafim, A.L.: N.N. automated defect detection based on optimized thresholding. In: Campilho, A., Kamel, M. (eds.) ICIAR 2006. LNCS, vol. 4142, pp. 790–801. Springer, Heidelberg (2006) 15. Gonc¸alves, P.S., Paris, A., Christo, C., Sousa, J., Pinto, J.C.: Uncalibrated visual servoing in 3d workspace. In: Campilho, A., Kamel, M. (eds.) ICIAR 2006. LNCS, vol. 4142, pp. 225–236. Springer, Heidelberg (2006) 16. Sousa, J., Kaymak, U.: Fuzzy Decision Making in Modeling and Control. World Scientific Pub. Co., Singapore (2002) 17. Gustafson, D.E., Kessel, W.C.: Fuzzy clustering with a fuzzy covariance matrix. In: Proceedings IEEE CDC, San Diego, pp. 761–766. IEEE Computer Society Press, Los Alamitos (1979) 18. Jang, J.-S.R.: Anfis: Adaptive-network-based fuzzy inference system. IEEE Transactions on Systems, Man, and Cybernetics 3(23), 665–685 (1993) 19. Matrox: Matrox Imaging Library User Guide. 6.1 edn. (2000)
Author Index
Abate, Andrea 784 Abdallah, Emad E. 772 Abdel-Dayem, Amr R. 935 Abdelsalam, Wegdan 411 Ahmed, Maher 411 Alajlan, Naif 436 Alexandrescu, Ionut 949 Ali, M.A. 581 Androutsos, Dimitrios 674 ´ Angel, L. 295 Aoki, Takafumi 210 Aubin, Carl-Eric 1017 Azimifar, Zohreh 923 Azzari, Pietro 501 Backer, Steve De 58 Barron, John 513 Barry, M. 794 Basir, Otman 650 Batista, Jorge P. 839, 1117 Bauckhage, Christian 352 Belkasim, Saeid 650 Bentabet, Layachi 526 Beny´ o, Zolt´ an 866 Bergevin, Robert 222 Bevilacqua, Alessandro 501 Bhattacharya, Prabir 772, 854 Bobier, Bruce 1206 Boostani, Reza 923 Boufama, B. 581 Bougleux, S´ebastien 46 Bouguila, Nizar 330 Boutemedjet, Sabri 330 Buessler, Jean-Luc 1005 Campilho, Aur´elio 878 Cao, Chunguang 176 Carelli, R. 295 Castilho, Hugo Peres 1297 Castro, Pedro 1194 Chae, Oksam 318 Chai, Peiqi 375, 715 Chang, Ran 638 Chao, Yuyan 131
Chau, Siu-Cheung 411 Chen, Wufan 972, 982 Chen, Yang 972 Ch´eriet, Farida 166, 187, 889, 949, 1017, 1028, 1040 Cheriet, Mohamed 234, 1028 Chevrefils, Claudia 1017 Chin, Francis Y-L. 626 Choi, Minseok 740 Choi, Sukwon 548 Chokalingam, Prakash 1128 Chow, K.P. 626 Chung, Ronald H-Y. 626 Conant-Pablos, Santiago E. 911 Corrales, Jos´e Antonio 93 Correia, Miguel Velhote 254 Coudray, Nicolas 1005 Couloigner, Isabelle 1093 Courchesne, O. 1040 Cui, Xia 375, 715 da Silva, Jo˜ ao Marcelo Monte Delorme, S´ebastien 949 Demirel, Hasan 831 Deschˆenes, Fran¸cois 1 Dewan, M. Ali Akber 318 Dias, Paulo 1254 Dom´ınguez, S. 1148 Dompierre, J. 1040 Dor´e, Vincent 1028 Dornaika, Fadi 469 Drar´eni, Jamil 558 Duijster, Arno 58 Duong, Luc 1028
1217
Ebrahim, Yasser 411 Ebrahimi, Mehran 117 Eglin, V´eronique 1276 El-ghazal, Akrem 650 El-Sakka, Mahmoud R. 935, 961 Elmoataz, Abderrahim 46 Emptoz, Hubert 1276 Faisal, Mohammad Fallavollita, Pascal
513 889
1310
Author Index Jeong, Jechang 481, 740 Ji, Rongrong 615 Jiang, J. 593 Junior, Geraldo Braz 899
Fazel-Rezai, Reza 105 Feng, Qianjin 972, 982 Feng, Yanqiu 972 Ferchichi, Seifeddine 364 Ferreira, Ant´ onio Mota 254 Fieguth, Paul 198 Filiatrault, Alexandre 222 Filipovych, Roman 81 Freeman, G.H. 24, 993 Fu, Bin 1242 Fuentes, Olac 1138 Gaceb, Djamel 1276 Garc´ıa, D. 1148 Garro, Beatriz A. 818 Gautama, Sidharta 1061 Ghassabeh, Youness Aliyari 399 Gomes e Silva, Andr´e R. 1229 Gon¸calves, Jos´e A. 1265 Gon¸calves, Paulo Jorge Sequeira 1297 Granger, E. 794 Greenspan, Michael 271 Grenander, Ulf 143 Grimard, Guy 1017 Grira, Sofiane 364 Guibault, F. 1040 Hafiane, Adel 387 Hamou, Ali K. 961 Hamza, A. Ben 772 He, Hong 38 He, Lifeng 131 Hern´ andez-Cisneros, Rolando R. Hong, Youpyo 698 Hor´e, Alain 1 Hossain, M. Julius 318 Hu, Shushu 263 Huang, Zhiyong 972 Hughes, Charles E. 1160 Hurtut, Thomas 187 Hwang, Wen-Jyi 1105 Iakovidis, Dimitris K. Idoumghar, L. 460 Ishii, Hideaki 210 Ito, Koichi 210 Jang, Jong Whan 307 Jeon, Gwanggil 481
246, 1052
911
Kang, Moon Gi 491 Karkanis, Stavros 1052 Katebi, Serajodin 923 Keramidas, Eystratios G. 1052 Kerre, Etienne E. 12 Kihl, Hubert 1005 Kim, Cheong-Ghil 708 Kim, Daijin 538, 548, 569 Kim, Intaek 1289 Kim, Jongho 740 Kim, Jungjun 481 Kim, Moon S. 1289 Kim, Sang Beom 698 Kim, Shin-Dug 708 Kim, Shin-Hyoung 307 Kim, Sung Min 698 Kinsner, Witold 105 Kobayashi, Koji 210 Kolesnikov, Alexander 761 Kpalma, Kidiyo 423 Kutics, Andrea 686 Lebourgeois, Frank 1276 Lee, In-Jik 708 Lee, Sangjae 538 Leszczy´ nski, Mariusz 342 Lezoray, Olivier 46 Li, Hui-Ya 1105 Li, J. 24, 993 Li, Jue 375 Li, Rongfeng 1242 Li, Wenxin 1242 Li, Yu 263 Lins, Rafael Dueire 1217, 1229 Lombaert, Herve 166 Lopez, Luis David 1138 Lorenzo-Ginori, Juan Valent´ın 157 Lu, Zhentai 982 Luong, Hiˆep 69 Luszczkiewicz, Maria 662 Majumdar, Angshul 806 Manay, Siddharth 447 Mandal, Tanaya 806 Mar¸cal, Andr´e R.S. 1265
Author Index Marcos, Jo˜ ao 1117 Maroulis, Dimitris 246, 1052 Martins, Leonardo de Oliveira 899 Mavromatis, S´ebastien 1254 Mayer, Gregory S. 728 Medeiros, F´ atima N.S. de 1172 M´elange, Tom 12 Melkemi, M. 460 Mendon¸ca, Ana Maria 878 Moayedi, Fatemeh 923 Moghaddam, Hamid Abrishami 399 Mohamed, S.S. 24, 993 Mohebi, Azadeh 198 Monteiro, Gon¸calo 1117 Mu˜ niz, Rub´en 93 Nachtegael, Mike 12 Nagashima, Sei 210 Nakagawa, Akihiko 686 Nappi, Michele 784 Newman, Timothy S. 176 Ni, Zhicheng 715 Ochoa, Daniel 1061 Osman, Said 961 Paglieroni, David W. 447 Paiva, Anselmo Cardoso de 899 Palenichka, Roman M. 1082 Paluri, Balamanohar 1128 Pereira, Carlos S. 878 Pereira e Silva, Gabriel 1229 Pernkopf, Franz 602 Phan, Raymond 674 Philips, Wilfried 69 Pinto, Jo´co Rog´erio Caldas 1194, 1297 Portman, Nataliya 143 Pradeep, Nalin 1128 Qi, Xiaojun
638
Ramalho, Geraldo L.B. 1172 Raman, Balasubramanian 1128 Ribeiro, Eraldo 81 Ribeiro, Miguel 1117 Riccio, Daniel 784 Rivest-H´enault, David 234 Roberti, F. 295 Rodriguez, Jaime 1071 Rodr´ıguez, Luis 1182
Romero, Ver´ onica 1182 Ronsin, Joseph 423 Roy, Kaushik 854 Roy, S´ebastien 558 Rueda, Luis 1071 Ryu, Wooju 569 Salama, M.M.A. 24, 993 S´ anchez, F.M. 1148 Sappa, Angel D. 469 Savelonas, Michalis A. 246 Scheunders, Paul 58 Schulte, Stefan 12 Sebasti´ an, J.M. 295, 1148 Seetharaman, Guna 387 Sequeira, Jean 1254 Serafim, Ant´ onio Limas 1297 Shah, Hitesh 1128 Shi, Pengcheng 972, 982 Shi, Yun Q. 375, 715 Shindoh, Kazuhiro 686 Silva, Arist´ ofanes Corrˆea 899 Silva, Erick Corrˆea da 899 Skarbek, Wladyslaw 342 Smolka, Bogdan 662 Sohn, Young Wook 491 Sossa, Humberto 818 Soyel, Hamit 831 Sun, Yi 526 Sun, Yiyong 166 Sung, Jaewon 538 Suzuki, Kenji 131 Szil´ agyi, L´ aszl´ o 866 Szil´ agyi, S´ andor M. 866 Tan, Xi 38, 752 Terashima-Mar´ın, Hugo 911 Tong, Xuefeng 715 Toselli, Alejandro H. 1182 Traslosheros, A. 295, 1148 Tsang, Kenneth S-H. 626 Urban, Jean-Philippe Uyarte, Omar 1071
1005
Valenzuela, Sofia 1071 V´ azquez, Roberto A. 818 Vidal, Enrique 1182 Vintimilla, Boris 1061 Vrscay, Edward R. 117, 143, 728
1311
1312
Author Index
Wang, Cuilan 176 Wang, Guanghui 285 Wang, Jicheng 615 Wang, Qingqi 972 Wang, Shengrui 364 Wei, Yi 263 Wirth, Michael 1206 Witte, Val´erie De 12 Won, Chee Sun 698 Wong, Kwan-Yee K. 626 Wu, Minghui 1242 Wu, Q.M. Jonathan 285, 806 Xiao, G. 593 Xu, Chengzhe 1289 Xu, Peifei 615
Xu, Zhuoqun 1242 Xuan, Guorong 375, 715 Yao, Hongxun 615 Yeh, Yao-Jung 1105 Yuk, Jacky S-C. 626 Zaremba, Marek B. 1082 Zavidovique, Bertrand 387 Zhang, B. 593 Zhang, Minghui 982 Zhang, Qiaoping 1093 Zhang, Shawn 271 Zhang, Yunjun 1160 Zhang, Zhen 615 Zhu, Xiuming 375 Ziou, Djemel 1, 330