Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
6134
Abderrahim Elmoataz Olivier Lezoray Fathallah Nouboud Driss Mammass Jean Meunier (Eds.)
Image and Signal Processing 4th International Conference, ICISP 2010 Trois-Rivières, QC, Canada, June 30-July 2, 2010 Proceedings
13
Volume Editors Abderrahim Elmoataz Olivier Lezoray Université de Caen Basse-Normandie GREYC UMR CNRS 6072, ENSICAEN, 14050 Caen, France E-mail: {abderrahim.elmoataz-billah; olivier.lezoray}@unicaen.fr Fathallah Nouboud Université du Québec à Trois-Rivières Département de Mathématiques et d’ informatique G9A 5H7 Trois-Rivières, QC, Canada E-mail:
[email protected] Driss Mammass Université IbnZohr, Faculté des Sciences, Agadir, Morocco E-mail:
[email protected] Jean Meunier Université de Montreal Département d’ Informatique et de Recherche Opérationnelle Montreal, QC, H3C 3J7 Canada E-mail:
[email protected] Library of Congress Control Number: 2010928191 CR Subject Classification (1998): I.4, I.5, H.3, F.1, I.2.10, C.3 LNCS Sublibrary: SL 6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics ISSN ISBN-10 ISBN-13
0302-9743 3-642-13680-X Springer Berlin Heidelberg New York 978-3-642-13680-1 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180
Preface
ICISP 2010, the International Conference on Image and Signal Processing, was the fourth ICISP conference, and was held in Trois-Rivi`eres, Qu´ebec, Canada. Historically, ICISP is a conference resulting from the actions of researchers of Canada, France and Morocco. Previous editions of ICISP were held in CherbourgOcteville (France), Agadir (Morocco), and Agadir (Morocco) in 2008, 2003 and 2001, respectively. ICISP 2010 was sponsored by EURASIP (European Association for Image and Signal Processing) and IAPR (International Association for Pattern Recognition). The response to the call for papers for ICISP 2010 was encouraging. From 165 full papers submitted, 69 were finally accepted (54 oral presentations, and 15 posters). The review process was carried out by the Program Committee members; all are experts in various image and signal processing areas. Each paper was reviewed by atleast two reviewers, and also checked by the conference Co-chairs. The quality of the papers in these proceedings is attributed first to the authors, and second to the quality of the reviews provided by the experts. We would like to thank the authors for responding to our call, and we thank the reviewers for their excellent work. We were very pleased to be able to include in the conference program keynote talks by three world-renowned experts: Yann Lecun, Silver Professor of Computer Science and Neural Science, The Courant Institute of Mathematical Sciences and Center for Neural Science, New York University (USA); Theo Gevers, Associate Professor, Faculty of Science, University of Amsterdam (The Netherlands); and Leo Grady, Senior Member of Technical Staff with Siemens Corporate Research in Princeton, New Jersey (USA). We would like to thank the members of the Local Organizing Committee for their advice and help. We would also like to thank Olivier Lezoray for preparing the proceedings and the remote sites of the Universit´e de Caen Basse-Normandie located in the Manche for their support. We are grateful to Springer’s editorial staff for supporting this publication in the LNCS series. Finally, we were very pleased to welcome all the participants to this conference. For those who did not attend, we hope this publication provides a good view into the research presented at the conference, and we look forward to meeting you at the next ICISP conference. April 2010
Abderrahim Elmoataz Olivier Lezoray Fathallah Nouboud Driss Mammass Jean Meunier
ICISP 2010 – International Conference on Image and Signal Processing
General Chair Fathallah Nouboud Universit´e du Qu´ebec `a Trois-Rivi`eres, Qu´ebec, Canada
[email protected] General Co-chair Abderrahim Elmoataz Universit´e de Caen Basse-Normandie, France
[email protected] Program Committee Chair Olivier Lezoray Universit´e de Caen Basse-Normandie, France
[email protected] Program Committee Co-chair Driss Mammass Universit´e Ibn Zohr, Morocco driss
[email protected] Program Committee Co-chair Jean Meunier Universit´e de Montr´eal, Qu´ebec, Canada
[email protected] Local Committee Chair Alain Chalifour Universit´e du Qu´ebec `a Trois-Rivi`eres, Qu´ebec, Canada
[email protected] Website Ara Studio Trois-Rivi`eres, Qu´ebec, Canada
VIII
Organization
International Associations Sponsors International Association for Pattern Recognition (IAPR) European Association for Signal Processing (EURASIP)
Sponsoring Institutions Universit´e du Qu´ebec `a Trois-Rivi`eres, Qu´ebec, Canada Universit´e de Caen Basse-Normandie, France Universit´e de Caen Basse-Normandie, sites d´elocalis´es de la Manche, France Universit´e Ibn Zohr, Morocco Centre de recherches math´ematiques, Montr´eal, Qu´ebec, Canada.
Local Organizing Committee Alain Chalifour Linda Badri Mourad Badri
Universit´e du Qu´ebec `a Trois-Rivi`eres, Qu´ebec, Canada Universit´e du Qu´ebec `a Trois-Rivi`eres, Qu´ebec, Canada Universit´e du Qu´ebec `a Trois-Rivi`eres, Qu´ebec, Canada
Program Committee Ennaji Abdel Universit´e de Rouen, France Driss Aboutajdine Universit´e Mohamed V, Morocco Antonis Argyros University of Crete and FORTH-ICS, Greece Sebastiano Battiato University of Catania, Italy Abdel Belaid Universit´e de Vand.-Les-Nancy, France Mostafa Bellafkih INPT- Rabat, Morocco Charles Beumier Royal Military Academy, Belgium ´ Guillaume-Alexandre Bilodeau Ecole Polytechnique de Montr´eal, Canada Diego Borro Ceit And Tecnun, Spain Adrian Bors University of York, Great Britain Alain Boucher The Francophone Institute For Computer Science, Vietnam S´ebastien Bougleux Universit´e de Caen Basse-Normandie, France Alexandra Branzan University of Victoria, Canada Luc Brun ENSICAEN, France Gustavo Carneiro Technical University of Lisbon, Portugal Emre Celebi Louisiana State University, USA Alain Chalifour Universit´e du Qu´ebec `a Trois Rivi`eres, Canada Christophe Charrier Universit´e de Caen Basse-Normandie, France Xiaochun Cheng Middlesex University, UK Ronald Chung The Chinese University of Hong Kong, China
Organization
Laurent Cohen Tomeu Coll Jose Crespo Kevin Curran J´erˆome Darbon Marleen de Bruijne Farzin Deravi Francois Deschenes Laurent Duval Ayman El-Baz Abderrahim Elmoataz Adrian Evans Christine Fernandez-Maloigne Andrea Fusiello Antonios Gasteratos Basilios Gatos Abel Gomes Michael Greenspan Metin Gurcan Rachid Harba St´ephanie Jehan-Besson Xiaoyi Jiang Pierre-Marc Jodoin Zoltan Kato Mohamed Lamine Kherfi Dimitrios Kosmopoulos Michal Kozubek Zakaria Lakhdari Denis Laurendeau S´ebastien Lefevre Reiner Lenz Olivier L´ezoray Xuelong Li Chen Liming R´emy Malgouyres Driss Mammass Alamin Mansouri Franck Marzani Brendan Mccane Mahmoud Melkemi Jean Meunier Francois Meunier Cyril Meurie
IX
Ceremade, France Universitat de les Illes Balears, Spain Universidad Polit´ecnica de Madrid, Spain University of Ulster, UK UCLA, USA University of Copenhagen, Denmark University of Kent, United Kingdom, UK Universit´e du Qu´ebec `a Rimouski, Canada Ifp, France University of Louisville, USA Universit´e de Caen Basse-Normandie, France University of Bath, Great Britain Universit´e de Poitiers, France Universit` a di Verona, Italy Democritus University of Thrace, Greece National Center for Scientific Research, Greece University of Beira Interior, Portugal Queen’s University, Canada OSU Columbus, USA Ecole Polytechnique de l’Universit´e d’Orl´eans, France Laboratoire LIMOS CNRS, France University of M¨ unster, Germany Universit´e de Sherbrooke, Canada University of Szeged, Hungary Universit´e de Qu´ebec `a Trois Rivi`eres, Canada NCSR Demokritos, Greece Masaryk University, Czech Republic Universit´e de Caen, France Universit´e Laval, Canada Universit´e de Strasbourg, France Reiner Lenz, ITN, Sweden Universit´e de Caen Basse-Normandie, France University of London, UK Ecole Centrale de Lyon, France Universit´e de Clermont-Ferrand, France Universit´e Ibn Zohr, Morocco Universit´e de Bourgogne, France Universit´e de Bourgogne, France University of Otago, New Zealand Universit´e Haute Alsace, France Universit´e de Montre´eal, Canada Universit´e du Qu´ebec `a Trois-Rivi`eres, Canada Universit´e de Technologie de Belfort-Montbeliard, France
X
Organization
Max Mignotte Amar Mitiche Philippos Mordohai Fathallah Nouboud Yanwei Pang Bogdan Raducanu Eraldo Ribeiro Christophe Rosenberger Gerald Schaefer Sophie Schupp Lik-Kwan Shark Jialie Shen Bogdan Smolka Jean-Luc Starck Andrea Torsello Eiji Uchino Yvon Voisin Liang Wang Djemel Ziou
Universit´e de Montr´eal, Canada INRS-Energie, Mat´eriaux et T´el´ecommunications, Canada Stevens Institute of Technology, USA Universit´e de Qu´ebec `a Trois Rivi`eres, Canada Tianjin University, China Computer Vision Center, Spain Florida Institute Of Technology, USA ENSICAEN, France Loughborough University, UK Universit´e de Caen Basse-Normandie, France University of Central Lancashire, UK Singapore Management University, Singapore Silesian University of Technology, Poland CEA, France University of Venice, Italy Yamaguchi University, Japan Universit´e de Bourgogne, France University of Melbourne, Australia Universit´e de Sherbrooke, Canada
Table of Contents
Image Filtering and Coding Facial Parts-Based Face Hallucination Method . . . . . . . . . . . . . . . . . . . . . . . Kaori Kataoka, Shingo Ando, Akira Suzuki, and Hideki Koike Geometric Image Registration under Locally Variant Illuminations Using Huber M -estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M.M. Fouad, R.M. Dansereau, and A.D. Whitehead Morphological Sharpening and Denoising Using a Novel Shock Filter Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cosmin Ludusan, Olivier Lavialle, Romulus Terebes, and Monica Borda Resolution Scalable Image Coding with Dyadic Complementary Rational Wavelet Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F. Petngang, R.M. Dansereau, and C. Joslin LMMSE-Based Image Denoising in Nonsubsampled Contourlet Transform Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Md. Foisal Hossain, Mohammad Reza Alsharif, and Katsumi Yamashita Wavelet-Based Multi-Channel Image Denoising Using Fuzzy Logic . . . . . Jamal Saeedi and Ali Abedi An Adaptive Multiresolution-Based Multispectral Image Compression Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jonathan Delcourt, Alamin Mansouri, Tadeusz Sliwa, and Yvon Voisin Zoom Based Super-Resolution: A Fast Approach Using Particle Swarm Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prakash Gajjar and Manjunath Joshi
1
10
19
28
36
44
54
63
Real Time Human Visual System Based Framework for Image Fusion . . . Gaurav Bhatnagar, Q.M. Jonathan Wu, and Balasubramanian Raman
71
Color VQ-Based Image Compression by Manifold Learning . . . . . . . . . . . . Christophe Charrier and Olivier L´ezoray
79
Total Variation Minimization with Separable Sensing Operator . . . . . . . . Serge L. Shishkin, Hongcheng Wang, and Gregory S. Hagen
86
XII
Table of Contents
Pattern Recognition Circle Location by Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vivian Cho and Wm. Douglas Withers
94
Remote Sensing Image Registration Techniques: A Survey . . . . . . . . . . . . . Suma Dawn, Vikas Saxena, and Bhudev Sharma
103
Assessment of the Artificial Habitat in Shrimp Aquaculture Using Environmental Pattern Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jos´e Juan Carbajal Hern´ andez, Luis Pastor S´ anchez Fern´ andez, and Marco Antonio Moreno Ibarra
113
Speech Recognition System of Arabic Alphabet Based on a Telephony Arabic Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yousef Ajami Alotaibi, Mansour Alghamdi, and Fahad Alotaiby
122
Statistical and Neural Classifiers: Application for Singer and Music Discrimination in Polyphonic Music Context . . . . . . . . . . . . . . . . . . . . . . . . Hassan Ezzaidi and Mohammed Bahoura
130
An Adaptive Method Using Genetic Fuzzy System to Evaluate Suspended Particulates Matters SPM from Landsat and Modis Data . . . Bahia Lounis, Sofiane Rabia, Adlene Ramoul, and Aichouche Belhadj Aissa An Efficient Combination of Texture and Color Information for Watershed Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cyril Meurie, Andrea Cohen, and Yassine Ruichek New Approach Based on Texture and Geometric Features for Text Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hinde Anoual, Sanaa El Fkihi, Abdelilah Jilbab, and Driss Aboutajdine Segmentation of Images of Lead Free Solder . . . . . . . . . . . . . . . . . . . . . . . . . Matthias Scheller Lichtenauer, Silvania Avelar, and Grzegorz Toporek Grassland Species Characterization for Plant Family Discrimination by Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohamed Abadi, Anne-Sophie Capelle-Laiz´e, Majdi Khoudeir, Didier Combes, and Serge Carr´e
137
147
157
165
173
Biometry Efficient Person Identification by Fusion of Multiple Palmprint Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Abdallah Meraoumia, Salim Chitroub, and Ahmed Bouridane
182
Table of Contents
XIII
A Cry-Based Babies Identification System . . . . . . . . . . . . . . . . . . . . . . . . . . . Ali Messaoud and Chakib Tadj
192
Comparative Testing of Face Detection Algorithms . . . . . . . . . . . . . . . . . . . Nikolay Degtyarev and Oleg Seredin
200
On Supporting Identification in a Hand-Based Biometric Framework . . . Pei-Fang Guo, Prabir Bhattacharya, and Nawwaf Kharma
210
Selecting Optimal Orientations of Gabor Wavelet Filters for Facial Image Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tianqi Zhang and Bao-Liang Lu Human Detection with a Multi-sensors Stereovision System . . . . . . . . . . . Y. Benezeth, P.M. Jodoin, B. Emile, H. Laurent, and C. Rosenberger
218 228
Signal Processing Efficient Time and Frequency Methods for Sampling Filter Functions . . . Fadel M. Adib and Hazem M. Hajj The Generalized Likelihood Ratio Test and the Sparse Representations Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jean Jacques Fuchs Frequency Tracking with Spline Based Chirp Atoms . . . . . . . . . . . . . . . . . . Matthias Wacker, Miroslav Galicki, and Herbert Witte
236
245 254
Improving Performance of a Noise Reduction Algorithm by Switching the Analysis Filter Bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hamid Sepehr, Amir Y. Nooralahiyan, and Paul V. Brennan
262
On the Choice of Filter Bank Parameters for Wavelet-Packet Identification of Dynamic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Henrique Mohallem Paiva and Roberto Kawakami Harrop Galv˜ ao
272
A Signal Processing Algorithm Based on Parametric Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrey Kopylov, Olga Krasotkina, Oleksandr Pryimak, and Vadim Mottl
280
Determining Dominant Frequency with Data-Adaptive Windows . . . . . . . Gagan Mirchandani and Shruti Sharma
287
Non-linear EEG Analysis of Idiopathic Hypersomnia . . . . . . . . . . . . . . . . . Tarik Al-ani, Xavier Drouot, and Thi Thuy Trang Huynh
297
Classification of Similar Impact Sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sofia Cavaco and Jos´e Rodeia
307
XIV
Table of Contents
Video Coding and Processing High Dimensional versus Low Dimensional Chaos in MPEG-7 Feature Binding for Object Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hanif Azhar and Aishy Amer Compensating for Motion Estimation Inaccuracies in DVC . . . . . . . . . . . . ˇ J¨ urgen Slowack, Jozef Skorupa, Stefaan Mys, Nikos Deligiannis, Peter Lambert, Adrian Munteanu, and Rik Van de Walle Multi-sensor Fire Detection by Fusing Visual and Non-visual Flame Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Steven Verstockt, Alexander Vanoosthuyse, Sofie Van Hoecke, Peter Lambert, and Rik Van de Walle Turbo Code Using Adaptive Puncturing for Pixel Domain Distributed Video Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohamed Haj Taieb, Jean-Yves Chouinard, Demin Wang, and Khaled Loukhaoukha
315
324
333
342
Error-Resilient and Error Concealment 3-D SPIHT Video Coding with Added Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jie Zhu, R.M. Dansereau, and Aysegul Cuhadar
351
Video Quality Prediction Using a 3D Dual-Tree Complex Wavelet Structural Similarity Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . K. Yonis and R.M. Dansereau
359
An Efficient 3d Head Pose Inference from Videos . . . . . . . . . . . . . . . . . . . . . Mohamed Dahmane and Jean Meunier Fall Detection Using Body Volume Recontruction and Vertical Repartition Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Edouard Auvinet, Franck Multon, Alain St-Arnaud, Jacqueline Rousseau, and Jean Meunier
368
376
Watermarking and Document Processing A Subsampling and Interpolation Technique for Reversible Histogram Shift Data Hiding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yih-Chuan Lin, Tzung-Shian Li, Yao-Tang Chang, Chuen-Ching Wang, and Wen-Tzu Chen Multi-Objective Genetic Algorithm Optimization for Image Watermarking Based on Singular Value Decomposition and Lifting Wavelet Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Khaled Loukhaoukha, Jean-Yves Chouinard, and Mohamed Haj Taieb
384
394
Table of Contents
XV
Image Hiding by Using Genetic Algorithm and LSB Substitution . . . . . . . Masoumeh Khodaei and Karim Faez
404
Tone Recognition of Isolated Mandarin Syllables . . . . . . . . . . . . . . . . . . . . . Zhaoqiang Xie and Zhenjiang Miao
412
A Semantic Proximity Based System of Arabic Text Indexation . . . . . . . . Taher Zaki, Driss Mammass, and Abdellatif Ennaji
419
Classification of Multi-structured Documents: A Comparison Based on Media Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ali Idarrou, Driss Mammass, Chantal Soul´e Dupuy, and Nathalie Valles-Parlangeau
428
Computer Vision Human Action Recognition Using Key Points Displacement . . . . . . . . . . . Kuan-Ting Lai, Chaur-Heh Hsieh, Mao-Fu Lai, and Ming-Syan Chen Shape Skeleton Classification Using Graph and Multi-scale Fractal Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andr´e R. Backes and Odemir M. Bruno Leaves Shape Classification Using Curvature and Fractal Dimension . . . . Jo˜ ao B. Florindo, Andr´e R. Backes, and Odemir M. Bruno Plant Leaf Identification Using Color and Multi-scale Fractal Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andr´e R. Backes and Odemir M. Bruno
439
448 456
463
A Novel Polychromatic Model for Light Dispersion . . . . . . . . . . . . . . . . . . . Samy Metari and Fran¸cois Deschˆenes
471
Content Based Thermal Images Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . Hern´ an Dar´ıo Ben´ıtez and Gloria In´es Alvarez
479
People Reacquisition Across Multiple Cameras with Disjoint Views . . . . . D.-N. Truong Cong, L. Khoudour, and C. Achard
488
Performance Evaluation of Multiresolution Methods in Disparity Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dibyendu Mukherjee, Gaurav Bhatnagar, and Q.M. Jonathan Wu 3D Head Trajectory Using a Single Camera . . . . . . . . . . . . . . . . . . . . . . . . . Caroline Rougier and Jean Meunier
496 505
XVI
Table of Contents
Biomedical Applications Artificial Neural Network and Fuzzy Clustering Methods in Segmenting Sputum Color Images for Lung Cancer Diagnosis . . . . . . . . . . . . . . . . . . . . Fatma Taher and Rachid Sammouda A Multi-stage Approach for 3D Teeth Segmentation from Dentition Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marcin Grzegorzek, Marina Trierscheid, Dimitri Papoutsis, and Dietrich Paulus
513
521
Global versus Hybrid Thresholding for Border Detection in Dermoscopy Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rahil Garnavi, Mohammad Aldeen, Sue Finch, and George Varigos
531
Unsupervised Segmentation for Inflammation Detection in Histopathology Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kristine A. Thomas, Matthew J. Sottile, and Carolyn M. Salafia
541
Scale-Space Representation of Lung HRCT Images for Diffuse Lung Disease Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kiet T. Vo and Arcot Sowmya
550
Correction of Left Ventricle Strain Signals Estimated from Tagged MR Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mina E. Khalil, Ahmed S. Fahmy, and Nael F. Osman
559
Texture Analysis of Brain CT Scans for ICP Prediction . . . . . . . . . . . . . . . Wenan Chen, Rebecca Smith, Nooshin Nabizadeh, Kevin Ward, Charles Cockrell, Jonathan Ha, and Kayvan Najarian
568
Mass Description for Breast Cancer Recognition . . . . . . . . . . . . . . . . . . . . . Imene Cheikhouhou, Khalifa Djemal, and Hichem Maaref
576
A New Preprocessing Filter for Digital Mammograms . . . . . . . . . . . . . . . . . Peyman Rahmati, Ghassan Hamarneh, Doron Nussbaum, and Andy Adler
585
Classification of High-Resolution NMR Spectra Based on Complex Wavelet Domain Feature Selection and Kernel-Induced Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guangzhe Fan, Zhou Wang, Seoung Bum Kim, and Chivalai Temiyasathit
593
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
601
Facial Parts-Based Face Hallucination Method Kaori Kataoka1, Shingo Ando2 , Akira Suzuki1 , and Hideki Koike1
2
1 NTT Cyber Space Laboratories, NTT Corporation 1-1 Hikarinooka Yokosuka-Shi Kanagawa 239-0847, Japan Research and Development Center, Nippon Telegraph and Telephone West Corporation 6-2-82 Shimaya Konohana-ku Osaka 554-0024, Japan
[email protected] Abstract. Face hallucination produces high-resolution facial images from low-resolution inputs. In this paper, we propose a facial-parts-based face hallucination method. Since our goal is face recognition rather than face reconstruction, the contour information of facial-parts(such as eyes) is important. This method reconstructs facial parts as entities instead of dividing them into small blocks. We obtain the contours of facial parts by using the Active Appearance Model(AAM), and transform training images based on contours. We confirm that the proposed method significantly enhances face recognition performance. Keywords: face hallucination, contour of facial-parts, Active Appearance Model.
1
Introduction
Surveillance cameras are becoming more and more common in places such as banks, stores, and parking lots. In most cases, however, the face sizes are very small because the cameras are configured to capture very wide views. The low resolution of these face images is a primary obstacle to effective face identification and recognition. To gain the detailed facial features needed for recognition, it is necessary to process the low-resolution image captured so as to generate a high-resolution face image. This technique, called face hallucination, was first proposed by Baker and Kanade[1]. A number of learning-based image hallucination techniques that generate a high-resolution image from one low-resolution image have been proposed recently. They employ a training database consisting of pairs of high- and lowresolution image samples to output hallucinated high-resolution faces. These methods are based on either global models or the patch approach. Wang and Tang[2] developed an efficient scheme based on a global-model-based face hallucination method. They use PCA to express the input face image as a linear combination of low-resolution training face images, and the final high-resolution image is synthesized by replacing the low-resolution training face images with their high-resolution counterparts; the same combination weights are used. Unfortunately, their technique ignores local details to focus on global information, so the resulting images are unclear and lack detailed features. A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 1–9, 2010. c Springer-Verlag Berlin Heidelberg 2010
2
K. Kataoka et al.
Chang et al.[3] apply the manifold learning method by assuming that small equivalent patches in the low- and high-resolution images form manifolds with similar local geometry. This method requires fewer training examples than other learning-based hallucination methods because the generation of a high-resolution image patch depends on multiple nearest neighbors. Their approach is similar to LLE for manifold learning and so does not depend on just one of the nearest neighbors in the training set. They represent each low- or high-resolution image as a set of small overlapping image patches. Liu et al.[4] proposed a two-step approach to face super-resolution. First, a global linear model learns a transfer function from low-resolution faces to high-resolution faces, then a patch-based non-parametric Markov network locally reconstructs high-frequency content. Li and Lin[5] have improved this approach by applying PCA to both low- and highresolution images in the first step and also using a MAP framework instead of the MRF model for finding the optimal local face. Since patch-based methods divide the image into small blocks that are uniformly overlapped, and hallucination is performed on each block, the reconstructed high-resolution image exhibits significant blocking artifacts. In this paper, we propose a new facial-parts-based method that preserves local details. This method reconstructs each facial part(such as eyes) as an entity instead of dividing the facial parts into small blocks. To obtain the contours of facial parts, we construct a Active Appearance Model (AAM)[6] for each facial part in the training images, and locate the best match AAM parameters for each facial part. Next, we transform the training images according to the contours. We confirm that our approach yields significant benefits in face recognition.
2
Active Appearance Model
Our method uses AAM to obtain the contours of facial parts, so we must first construct AAM from the training images. An AAM is a statistical model that can describe the variation of shape and appearance with lower dimensionality. A statistical model of shape variation can be generated from a training set of labeled images. The labeled points on a single object describe the shape of that object. Each shape is represented by vector x which consists of the aligned coordinates of the labeled points. By applying principal component analysis (PCA) to the data, any example can then be approximated using x = x¯ + Ps bs
(1)
where x¯ is the mean shape, Ps is a set of orthogonal modes of variation and bs is a set of shape parameters. To build a statistical model of object appearance, grey level information g is sampled from the shape-normalized image by warping, by using a triangulation algorithm, each training image so that its labeled points match the mean shape. By applying PCA to the normalized data, we can obtain the following linear model: (2) g = g¯ + Pg bg
Facial Parts-Based Face Hallucination Method
3
where g¯ is the mean normalized grey-level vector, Pg is a set of orthogonal modes of variation and bg is a set of grey-level parameters. The shape and appearance of any example can be summarized by the vectors bs and bg . For each example we generate the concatenated vector Ws bs Ws Ps (x − x ¯) b= = (3) Pg (g − g¯) bg where Ws is a diagonal matrix of weights for each shape parameter, which handles the difference in units between the shape and grey models. Since there may be correlations between the shape and grey-level variations, we again apply PCA to the data to obtain b = Qc
(4)
where Q are the eigenvectors and c is a vector of appearance parameters that determines both the shape and grey-level of the model. Next, we synthesize the object image by matching a statistical model of object shape and appearance to the input image. We minimize the difference between the input image and the one synthesized by AAM. Difference vector δI can be defined as: (5) δI = I − Im where I is the vector of grey-level values of the input image, and Im is the vector of grey-level values for the current model parameters. To locate the best match between the model and the input image, we minimize the magnitude of 2 the difference vector, Δ = |δI| , by varying model parameters c. Then, we can obtain coordinates of contour by obtaining most appropriate parameter c.
3
Facial Parts-Based Face Hallucination Method
Our facial parts-based method is based on the assumption that AAM can extract the contours of facial parts. We describe our method below. First, we construct an AAM for each facial part. The procedure is illustrated in Fig.1.
Fig. 1. The diagram of the facial parts contour extraction
4
K. Kataoka et al.
Fig. 2. The diagram of our synthesis method
1. Prepare training sets in which the contours of facial-parts are labeled. 2. Downsize the training images to the input image size and then scale them back up to the original size using the linear interpolation algorithm to yield blurred images similar to the input image. 3. Build an AAM of each facial-part such as eyes from the blurred training images and the contour coordinates. Next, we locate the best match between the model and the input image to obtain the contour coordinates. 4. Scale up the input image(low resolution image) with linear interpolation algorithm. 5. Locate parameter c that minimizes the difference vector between the synthesized image and the scaled-up input image. 6. Obtain the contour of facial-parts from the best match parameter c. Finally, we synthesize the high-resolution image using the obtained contour coordinates. The synthesis procedure is illustrated in Fig.2. 7. Warp the original high resolution training images of facial parts to fit the obtained contour and average the wrapped training images. This average image is output as the high-resolution facial parts image.
4
Experiments and Results
We decided to construct one AAM for each facial part to better preserve the diversity in their appearances. Since eyes are especially important in facial recognition[6], our experiments focused on them. 4.1
Building an AAM
4.1.1 Data Set Our experiments utilized the Japanese Face Image Database HOIP[8], which contains 300 people. From the HOIP frontal facial database, we chose 103 people (three of whom were used as training data) whose eyes were not occluded by glasses, accessories, or hair. The high-resolution face images were scaled to 80×80 which yields the eye size of about 25×10. The input images were created by blurring the face images (averaging neighboring pixels) and downsizing them low-resolution images with size of 20×20(eye size is about 6×3).
Facial Parts-Based Face Hallucination Method
5
Fig. 3. Training eye images
Fig. 4. Labelled eye image
4.1.2 A Set of Training Images As the training data, we chose three eye images that exhibited different types of eye contour; this was done to challenge AAM with a wide variety of eye curve. In this experiment, the eye types reflect the angle of eye corners as shown as Fig.3. Fig.3(a) shows the curve type wherein angle α and β are almost the same and large. The angles in Fig.3(b) are small and similar. The angles in Fig.3(c) are quite different. We expect that these training data well represent various eye types such as drooping eyes, narrowed eyes, wide-open eyes, and slanting eyes. 4.1.3 A Set of Labeled Points on the Training Images We manually labeled 12 landmark points on each eye and set triangular patches as shown as Fig.4. 4.2
Modified Method Using Residue Images
We tried to modify the obtained facial parts images by using the residue images as shown as Fig.5. Suppose Il is the low-resolution(LR) input image and Ih is the corresponding obtained high-resolution(HR) image, the low-resolution residue image Rl can be denoted as Rl = D(Ih ) − Il , where D(.) is the down-sampling function. Given bias is 128 when we display RI , D(Ih ) is darker than Il when pixel value of RI is smaller than 128 and D(Ih ) is lighter than Il when pixel value of RI is larger than 128. Since contour line is darker than skin color, we expect to be able to correct the position of contour line based on Rl .
6
K. Kataoka et al.
Fig. 5. The procedure of modification using the residue image
Fig. 6. Examples of the modified method based on residue image
Fig. 7. An example of modified eye image. (a)Residue eye image, (b)Eye image before modification, (c)Eye image after modification, (d)Original HR eye image.
We describe the modified method below. case1 outer corner of eyes: If the pixel corresponding to point P0 on Rl is darker than the threshold as is the pixel adjacent to point P0 that lies outside and normal to the contour, the hallucinated eye is larger than the original, and thus P0 is transformed inward. In association with this transformation, P1 and P11 are transformed inward. We move P0 4 pixels inward and P1 and P11 2 pixels inward. If the pixel adjacent to point P0 that lies outside and normal to the contour is lighter than the threshold and point P0 is darker, the hallucinated eye is smaller than original one, therefore P0 is transformed outward.
Facial Parts-Based Face Hallucination Method
7
Fig. 8. Exsamples of hallucinated images. (a) Input LR images, (b) Cubic B-spline interpolation images, (c) Hallucinated HR images by our method, (d) Original HR images.
8
K. Kataoka et al.
Table 1. Comparison of correlation value against original image and recognition performance Cubic B − spline Our method Correlation value Recognition rate
0.37 6/100
0.72 49/100
case2 inner corner of eyes: If point P6 is darker than the threshold as is the pixel adjacent to of point P6 that lies outside and normal to the contour, the hallucinated eye is larger than the original, and thus P6 is transformed inward as shown in Fig.6(a). If the pixel adjacent to point P6 that lies outside and normal to the contour is lighter than the threshold and P6 is darker, the hallucinated eye is smaller than the original, and thus P6 is transformed outward. case3 upper curve: If point P3 is darker than the threshold as is the pixel adjacent to point P3 that lies outside and normal to the contour, the hallucinated eye is larger than the original, and thus P3 is transformed inward as shown in Fig.6(b). If the pixel adjacent to point P3 that lies outside and normal to the contour is lighter and point P3 is darker than the threshold, the hallucinated eye is smaller than the original, and thus P3 is transformed outward. case4 lower curve: If point P9 is darker than the threshold as is the pixel adjacent to point P9 that lies outside and normal to the contour, the hallucinated eye is larger than the original, and thus P9 is transformed inward. If the pixel adjacent to point P9 that lies outside and normal to the contour is lighter than the threshold and point P9 is darker, the hallucinated eye is smaller than the original one, and thus P9 is transformed outward. These rules were decided empirically, but the result of this modification shown in Fig.7((a) depicts a residue eye image, (b) depicts an eye image before modification, (c) depicts an eye image after modification, (d) depicts the original HR eye image), confirms that they are entirely valid for modifying hallucination images. 4.3
Face Recognition Experiments
We conducted a hallucination experiment on 100 eye regions. Some hallucination results are shown in Fig.8 in a comparison with the input LR faces images and the Cubic B-Spline interpolation results. We compared the correlation of original HR eye region and hallucinated image, and interpolation results. Table 1 shows the average of 100 faces correlations. A perfect fit yields a coefficient value of 1.0. Thus the higher the correlation coefficient the better. We also compared the recognition rate of hallucinated images and interpolation results. Table 1 shows the recognition performances
Facial Parts-Based Face Hallucination Method
9
for 100 individuals. The eye region edge images were created by a Sobel filter and estimation was based on normalized correlation values of the eye region.
5
Conclusion
In this paper, we proposed a new facial-parts-based hallucination method. Our method prevents facial-parts from being subdivided by obtaining the contour location of each facial-part by Active Appearance Model (AAM). Facial recognition experiments showed that the proposed method yields hallucination images that well approximate the original high resolution images, making our method effective for raising face recognition performance. In future work, we will conduct experiments on all facial-parts other than the eyes and synthesize them. We also will apply it to real-world images captured by surveillance systems because the current results were obtained by blurring and down-sampling original high resolution images.
References 1. Baker, S., Kanade, T.: Hallucinating faces. In: Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition, Grenoble, France, pp. 83–88 (2000) 2. Wang, X.G., Tang, X.O.: Hallucinating face by eigentransformation. IEEE Trans. Syst. Man. Cybern. 35(3), 425–434 (2005) 3. Chang, H., Yeung, D.Y., Xiong, Y.M.: Super-resolution through neighbor embedding. In: Proceedings of CVPR’04, Washington, DC, vol. 1, pp. 275–282 (2004) 4. Liu, C., Shum, H., Freeman, W.T.: Face hallucination: Theory and practice. Int. J. Comput. Vis. 75(1), 115–134 (2007) 5. Li, Y., Lin, X.: An improved two-step approach to hallucinating faces. In: Proc. IEEE Conf. Image and Graphics, December 2004, pp. 298–301 (2004) 6. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. In: Proc. of the 5th European Conference on Computer Vision, vol. 2, pp. 484–498. Springer, Heidelberg (1998) 7. Moriyama, T., Kanade, T., Xiao, J., Cohn, J.F.: Meticulously detailed eye region model and its application to analysis of facial images. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(5), 738–752 (2006) 8. http://www.softopia.or.jp/rd/facedb.html
Geometric Image Registration under Locally Variant Illuminations Using Huber M -estimator M.M. Fouad, R.M. Dansereau, and A.D. Whitehead Dept. of Systems and Computer Engineering, Carleton University, Ottawa, Ontario, Canada, K1S 5B6 {mmafoad,rdanse}@sce.carleton.ca,
[email protected] Abstract. In this paper, we extend our previous work on presenting a registration model for images having arbitrarily-shaped locally variant illuminations from shadows to multiple shading levels. These variations tend to degrade the performance of geometric registration and impact subsequent processing. Often, traditional registration models use a leastsquares estimator that is sensitive to outliers. Instead, we propose using a robust Huber M -estimator to increase the geometric registration accuracy (GRA). We demonstrate the proposed model and compare it to other models on simulated and real data. This modification shows clear improvements in terms of GRA and illumination correction.
1
Introduction
Geometric image registration (GIR) of a set of images is a common pre-processing step in many applications [1], such as super-resolution [2], image stitching [3], and remote-sensing applications [4]. For these and similar applications, sub-pixel accuracy in registration is necessary for satisfactory post-processing results. Existing image registration approaches can be categorized into feature-based and intensity-based. The former rely on the fact that the extracted features in one image are matched to their corresponding features in another image, either by their appearance similarity or by geometric closeness (see e.g., [5, 6]), or by local descriptors, such as SIFT, Harris, etc. (see e.g., [7, 8]). Although these approaches have robustness to outliers, such as RANSAC [9, 10], they can be highly sensitive to the feature extraction process, especially in the presence of locally variant illumination changes [3]. Note that outliers in feature-based registration approaches denotes a mismatch between corresponding features. Another major problem of feature-based approaches is that the sparseness of the features increases the geometric registration error, especially if the features are less dense in a specific part of the input images. Intensity-based approaches directly deal with image intensity without detecting prominent features. These approaches aim to find a transformation function such that the discrepancy in the intensity of the corresponding pixels are minimized. These approaches avoid some of the pitfalls of the feature extraction stage. Although a number of researchers [11–14] have incorporated global or A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 10–18, 2010. c Springer-Verlag Berlin Heidelberg 2010
GIR under Locally Variant Illuminations Using Huber M -estimator
11
semi-local illumination models into the image registration problem, their approaches do not allow arbitrarily-shaped illumination changes that could occur in real imaging scenarios. The GIR process is impacted by the presence of locally variant illumination changes [3]. In [11, 12], a global illumination (GI) model is presented to relate the intensity levels of an image pair, where pre-defined neighborhoods with an imposed smoothness constraint are proposed in [11]. In [13], an affine illumination model is given with triangular or quadratical region support. In [14], a dual inverse compositional algorithm is proposed based on the assumption that the geometric and photometric parameters can commute, thereby impeding the use of explicit local intensity transformations. Unfortunately, the resulting registration from these models tends to be degraded when arbitrarily-shaped regions with distinct illumination variations exist. In our previous work [15], although local illumination variations have been taken into account simultaneously with GIR, one can notice two limitations. First, these variations are limited to two illumination levels (i.e., shadowed and non-shadowed). Second, the approaches in [11–15] use a least-squares estimator (LSE). Intuitively, the LSE is more sensitive to outliers, since its influence function (i.e., the first derivative) is proportional to the residuals’ magnitude. Instead, the LSE can be replaced by a function that assigns smaller weights for larger residuals. The most common method is Huber M -estimation [16], that has greater resistance to outliers as it differently penalizes the residuals. Note that outliers in intensity-based registration approaches refers to corresponding pixels that have large absolute intensity difference due to illumination variations. In this paper, we propose an intensity-based registration approach by extending the approach in [15] to cope with multiple shading levels simultaneously with improving GIR accuracy using a robust Huber M -estimator [16] instead of LSE. Since [13]’s approach tends to limit the illumination regions to predefined shapes, and [14]’s algorithm prevents using local illumination transformations, we compare the proposed model to the ASLIV model presented in [15], and the GI model shown in [11, 12]. This paper is organized as follows. In Section 2, we propose an approach using Huber M -estimator. In Section 3, we develop the experiments using real and simulated data sets. Finally, conclusions are given in Section 4.
2
Proposed Model With Huber M -estimator
In this section, we propose a registration model to improve geometric registration and illumination correction, while using Huber M -estimator. 2.1
Model
Consider that we have two N ×M input images, I1 and I2 , captured for the same scene at two different times. Both images may have distinct illumination regions
12
M.M. Fouad, R.M. Dansereau, and A.D. Whitehead
with arbitrary shapes. We can formulate our extended registration model that relates the intensity levels of I1 and I2 as the matrices, I2 (x2 ) B(x1 ) I1 (x1 ) + C(x1 ) ,
(1)
where x = (x, y) is a pixel location, and {Ii , B, C} ∈ RN ×M . As presented in [15], the illumination variations, B and C, are modeled as matrices, ⎤ ⎡ ⎤ ⎡ c11 . . . c1M b11 . . . b1M ⎢ .. ⎥ , C = ⎢ .. .. ⎥ . .. .. (2) B = ⎣ ... ⎣ . . . . ⎦ . ⎦ bN 1 . . . bN M
cN 1 . . . cN M
Many motion models could be used to model the motion in (1). However, we chose the 6-parameter affine motion model as shown in [11–13, 15] (i.e., x2 = A(x1 ), see Eqs. (3) and (4) in [15]). In addition, in [15], the illumination changes were limited to two shading levels. In this paper, the approach is generalized such that each image Ii can contain Vi distinct levels of illumination. To demonstrate the idea of the proposed model, we will start with the case that a priori known information of distinct shading levels is available. Fig. 1(a) shows an example of a masked image I1 segmented into three distinct illumination levels: L11 , L21 and L31 (i.e., V1 =3). Similarly, Fig. 1(b) depicts another masked image I2 segmented, as well, into a different set of distinct illumination levels: L12 , L22 and L32 (i.e., V2 =3). With a rough geometric registration for I1 and I2 , an absolute image difference (AID), I1,2 , is created having a set of overlapping J regions, Rj , such that I1,2 = j=1 Rj (see Fig. 1-c), where J refers to the number of resulting overlapping regions. Similarly, in a real imaging scenario, we propose a simple method of segmenting the AID of the roughly aligned images using, for instance, the k-means algorithm [17]. Thus, the illumination regions, Rj , are obtained: first estimated and then iteratively refined. Note that the sum of the number of distinct illumination levels in both images (i.e., V1 + V2 ) does not express the number of the resultant overlapping regions, J, as the latter depends on the former as well as the intersections among these distinct levels, Lvi ∀v. We’d like to stress that each resulting region, Rj , is assumed to have its own constant brightness bj and contrast cj . A special case was discussed in [15], where J = 2. For pixel domain mathematical manipulations, each Rj can be represented by a binary mask, Qj (x), such that 1, ∀ x ∈ Rj (3) , where Qj ∈ RN ×M . Qj (x) = 0, otherwise Thus, we can constrain B and C in (1) to J regions as follows B=
J
j=1
bj Qj ,
C=
J
j=1
cj Q j .
(4)
GIR under Locally Variant Illuminations Using Huber M -estimator
(a) masked I1
(b) masked I2
13
(c) masked I1,2
Fig. 1. With a priori known information, (a,b) two different sets of distinct illumination levels of two input images (i.e., V1 =V2 =3), yielding (c) a masked AID with seven arbitrarily-shaped regions, each has its own constant illumination (i.e., J=7).
The following section presents an approach to estimate the unknown vector Φ = [a1 , . . . , a6 , b1 , . . . , bJ , c1 , . . . , cJ ] , which contains the 6 parameters of affine motion, J for brightness and J for contrast, respectively. 2.2
Iterative Scheme Using Huber M -estimator
This section presents an iterative framework using Huber M -estimator in order to estimate the unknown vector Φ. Unlike the quadratic cost function presented in [11–13, 15], we propose using Huber M -estimator [16] 1 2
t , |t| ≤ α Ψ (E(Φ, x); α) , where Ψ (t; α) = 2 min L = , (5) 1 2 Φ α|t| − 2 α , |t| > α x where α is a positive tuning threshold, Ψ (·) denotes Huber’s function, and E(Φ, x) refers to the residual located at position x = (x, y). This Huber M -estimation formula can be rewritten [18] as
E 2 (Φ) · F + 2 α S · E(Φ) − α2 , (6) min L = 0.5 Φ
x
where (·) refers to element-by-element matrix product. The residuals can be expressed as E(Φ; x) = I2 (A(x)) − B(x) I1 (x) − C(x) . (7) The high-residual selective matrix is ⎡ ⎤ ⎧ s11 . . . s1M ⎨ −1, E(Φ; x) < −α ⎢ ⎥ S = ⎣ ... . . . ... ⎦ , sxy = 0, |E(Φ; x)| α . ⎩ 1, E(Φ; x) > α s N 1 . . . sN M
(8)
While, the low-residual selective matrix is F = 1 − S · S, fxy = 1 − s2xy ,
(9)
14
M.M. Fouad, R.M. Dansereau, and A.D. Whitehead
where fxy is an element of F at position (x, y), and {1, E, S, F } ∈ RN ×M . We can estimate Φ using the Gaussian-Newton method [20] to solve the non-linear ˆ is updated by Δ at each iteration k, minimization problem in (6). Note that Φ ˆ ˆ such that Φk = Φk−1 + Δk . Replacing E(·) in (7) by its 1st order Taylor series expansion, then F in (6) can be rewritten as
2 L = 0.5 F · E(Φk−1 ) + Δ k ∇Φ E(Φk−1 ) x
2 . (10) + 2 α S · E(Φk−1 ) + Δ ∇ E(Φ ) − α Φ k−1 k
Setting the gradient of L w.r.t. Δ to zero, we obtain
− [(E(Φk−1 ) ∇Φ E(Φk−1 )) · F + α S · ∇Φ E(Φk−1 )] = x
ˆ Δ k
∇Φ E(Φk−1 ) ∇Φ E (Φk−1 ) · F . (11) x
We can write (11) in matrix notation as ˆ, −Y P = (Y H ) Δ
(12)
where H = [H11 , H12 , . . . , Hpq ], Y = [Y11 , Y12 , . . . , Ypq ], P = [P11 , P12 , . . . , Ppq ], Hpq = [pIx , qIx , pIy , qIy , Ix , Iy , −I1 Q1 , . . . , −I1 QJ , −Q1 , . . . , −QJ ] ,
(13)
Ypq = fpq [pIx , qIx , pIy , qIy , Ix , Iy , −I1 Q1 , . . . , −I1 QJ , −Q1 , . . . , −QJ ] , (14) and
Ppq = α spq + fpq I2 (p, q) ,
1 p, q N, M.
(15)
Equations (12) through (15) can be used to perform one iteration for finding ˆ to update Φ. ˆ As well, our approach uses a coarse-to-fine [19] a solution of Δ framework to cope with large motions with r resolution levels. The convergence is achieved at the last resolution level, when either the cost function in (6) is updated by less than a predefined threshold, ε, or a maximum number of iterations, k, has been reached. The unknown parameters at each iteration are accumulated yielding a single final estimate. We will refer to the proposed model explained in this section by HM-ASLIV6,2J , where the superscript accounts for the 6+2J parameters in Φ.
3
Experimental Results
In this section, we develop the experiments and compare the proposed model HM-ASLIV to the ASLIV model in [15] and the GI model in [11, 12], whose implementation can be found online [22].
GIR under Locally Variant Illuminations Using Huber M -estimator
15
Table 1. The AAE (×10−4 ) of the estimated affine parameters of the “J=3” and “J=4” data computed by the competing models compared to their ground truth values
Para. a1 a2 a3 a4 a5 a6
“J=3” data set GI ASLIV6,6 HM-ASLIV6,6 15.3658 1.7205 1.2025 12.0018 1.5079 1.0344 14.6081 1.6189 1.1661 19.1128 2.0502 1.8699 9190.77 789.16 722.35 2688.72 795.70 716.11
“J=4” data set GI ASLIV6,8 HM-ASLIV6,8 16.5962 1.8021 1.4358 13.2015 1.6025 1.1916 15.5246 1.7360 1.2049 21.0259 2.4261 1.9280 9472.09 816.50 760.57 2835.12 804.81 755.11
Table 2. SSIM indexes, PSNR and NCC values express the correlation between the overlapping region of the registered pairs for a random subset of the “J=3” data set using (i) GI, (ii) ASLIV6,6 and (iii) HM-ASLIV6,6 models
Pair # 10 17 28 34 38 48
SSIM [23] (i) (ii) (iii) 0.8397 0.9602 0.9622 0.8194 0.9577 0.9589 0.8268 0.9534 0.9598 0.8066 0.9507 0.9531 0.8552 0.9650 0.9665 0.8419 0.9634 0.9643
PSNR (dB) (i) (ii) (iii) 22.53 29.02 29.78 20.01 28.33 29.01 21.76 29.42 30.17 21.91 27.01 27.76 23.47 28.63 29.25 21.94 29.69 30.10
NCC [24] (i) (ii) (iii) 0.9111 0.9942 0.9956 0.9241 0.9932 0.9948 0.9049 0.9939 0.9951 0.9100 0.9927 0.9946 0.9192 0.9948 0.9956 0.9067 0.9941 0.9953
Time (sec.) (i) (ii) (iii) 43.0 57.7 62.2 42.8 55.4 59.8 44.7 58.8 61.4 43.3 57.2 62.5 42.7 56.4 63.2 41.2 55.3 59.4
Table 3. NCC for the real pairs using GI, ASLIV6,8 and HM-ASLIV6,8 models Pair A B C D
3.1
GI ASLIV6,8 HM-ASLIV6,8 Pair GI ASLIV6,8 HM-ASLIV6,8 0.8819 0.9226 0.9237 E 0.9601 0.9743 0.9758 0.9642 0.9789 0.9802 F 0.9919 0.9922 0.9939 0.9799 0.9853 0.9870 G 0.9651 0.9657 0.9671 0.9550 0.9644 0.9659 H 0.9071 0.9229 0.9245
Data Set Description and Performance Evaluation
The image data sets used in the experiments include two categories: real and simulated. The first category involves nine real 400 × 400 and 600 × 600 LANDSAT satellite images [21] with unknown true motion. The second category includes 50 simulated image pairs that are 500 × 500 pixels acquired from 3000 × 3000 pixel IKONOS satellite images for the Pentagon and its surroundings as shown in [15]. Each simulated pair is subjected to varying V1 and V2 levels of illumination, respectively, in different areas in both images according to the desired number of illumination regions J; set to 3 and 4. Note that all the simulated pairs are directly presented to competing models without pixel quantization error. For the real and simulated data sets, we determine the correlation between the overlapping area of the two registered images using commonly used image
16
M.M. Fouad, R.M. Dansereau, and A.D. Whitehead
(a)
(b)
(c)
(d)
(e) GI
(f) ASLIV6,8
(g) HM-ASLIV6,8
(h) GI
(i) ASLIV6,6
(j) HM-ASLIV6,6
Fig. 2. (a,b) A real pair (#B). (c,d) A “J=3” simulated pair (#34). Using the competing models, (e,f,g) gamma-corrected AIDs of the real pair in (a,b), γ = 1.7, with NCC=0.9642, 0.9789 and 0.9802, respectively, and (h,i,j) normalized AIDs of the simulated pair in (c,d), with PSNR=21.91, 27.01 and 27.76 dB, respectively.
quality measures: structure similarity (SSIM) index [23], PSNR (in dB) and normalized cross-correlation (NCC) [24]. The average of absolute error (AAE) between the estimated affine parameters and the corresponding ground truth values is also computed for the simulated data sets. 3.2
Implementation Setup and Discussion
Our implementation runs on a 2 GHz Pentium IV Core 2 Duo, with 2 GB of RAM. We compare the HM-ASLIV model to the GI model in [11, 12] and that in [15]. We will refer to the model in [15] by ASLIV6,2J , in which the quadratic cost function is employed, but for J=3 and 4 (i.e., ASLIV6,6 and ASLIV6,8 , respectively). In our experiments, an identity initialization is used for Φ◦ yielding an initial guess that the two input images are aligned and no illumination variations exist. Thus, Φ◦ is either set to [1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0] with mod-
GIR under Locally Variant Illuminations Using Huber M -estimator
17
els ASLIV6,6 and HM-ASLIV6,6 or set to [1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0] using models ASLIV6,8 and HM-ASLIV6,8 . Our approach iteratively exploits a coarseˆ is adequately to-fine framework with 5 resolution levels (i.e., r=5), so that Φ estimated at the coarsest level for both real and simulated data. For convergence settings, ε and k are set to 0.1 and 10, respectively. Note that α is set to 1.345 σ [20] when using the HM-ASLIV model, where σ is the standard deviation of the residuals computed at each iteration. Table 1 shows that the proposed model, HM-ASLIV, provides more precise motion estimates, in the presence of local illumination variations, over GI and ASLIV models for the “J=3” and “J=4” data sets. In addition, Tables 2 and 3 show that the selected image quality measures using the proposed model, HMASLIV, outperform those using the GI and ASLIV models with a slight increase in computational time using simulated and real data. The gamma-corrected AIDs of the registered pairs, of the pair in Fig. 2-(a,b), are manifested in Fig. 2-(e,f,g) using GI, ASLIV6,8 and HM-ASLIV6,8 models, respectively (γ = 1.7 for better visualization). Similarly, the normalized AIDs of the aligned pairs, for the pair in Fig. 2-(c,d), are shown in Fig. 2-(h,i,j) using GI, ASLIV6,6 and HM-ASLIV6,6 models, respectively. Since the less the brightness, the more precise the GRA, Fig. 2 shows that the HM-ASLIV model qualitatively outperforms both GI and ASLIV models.
4
Conclusions
In this paper, we present an intensity-based image registration model to cope with images having multiple illumination variations (regions) with arbitrary shapes. The proposed model assumes brightness and contrast uniformity within each illumination region, which allows conditioning of the ill-posed problem. The approach also uses a robust Huber M -estimator that is not as vulnerable as LSE to outliers. Convergence is well achieved with an identity initialization, through an iterative coarse-to-fine scheme to allow for large motions. The experiments are performed on both real and simulated data sets. Clear improvements in terms of geometric registration accuracy are obtained using the proposed model as opposed to the GI model [11, 12] and ASLIV model [15] by an average increase of 88.5% and 18.9%, respectively, with only a slightly average increase of 33.1% and 11.6%, respectively, in computational time.
References 1. Zitov´ a, B., Flusser, J.: Image registration methods: A Survey. Image & Vis. Comp. 21, 977–1000 (2003) 2. Gevrekci, M., Gunturk, B.: Superresolution under photometric diversity of images. Advances in Signal Proc. (2007) 3. Szeliski, R.: Image alignment and stitching: A tutorial. Found. and Trends in Comp. Graphics and Vision 2 (2006) 4. Lou, L., Zhang, F., Xu, C., Li, F., Xue, M.: Automatic registration of aerial image series using geometric invariance. In: IEEE ICAL, pp. 1198–1203 (2005)
18
M.M. Fouad, R.M. Dansereau, and A.D. Whitehead
5. Aylward, S., Jomier, J., Weeks, S., Bullitt, E.: Registration of vascular images. Comp. Vis. 55, 123–138 (2003) 6. Xu, D., Kasparis, T.: Robust image registration under spatially non-uniform brightness changes. In: IEEE ICASSP, vol. 2, pp. 945–948 (2005) 7. Ke, Y., Sukthankar, R.: PCA-SIFT: a more distinctive representation for local image descriptors. Comp. Vis. and Patt. Recog. 2, 506–513 (2004) 8. Battiato, S., Gallo, G., Puglisi, G., Scellato, S.: SIFT features tracking for video stabilization. In: Proc. 14th ICIAP, pp. 825–830 (2007) 9. Fischler, M., Bolles, R.: Random Sample Consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Comm. Of the ACM 24, 381–395 (1981) 10. Hartley, R., Zisserman, A.: Multiple view geometry in computer vision, 2nd edn. Cambridge University Press, Cambridge (2003) 11. Periaswamy, S., Farid, H.: Elastic registration in the presence of intensity variations. IEEE Trans. Med. Imag. 22, 865–874 (2003) 12. Aguiar, P.: Unsupervised simultaneous registration and exposure correction. In: IEEE ICIP, pp. 361–364 (2006) 13. Altunbasak, Y., Mersereau, R., Patti, A.: A fast parametric motion estimation algorithm with illumination and lens distortion correction. IEEE Trans. Image Proc. 12, 395–408 (2003) 14. Bartoli, A.: Groupwise geometric and photometric direct image registration. IEEE Trans. Patt. Ana. & Mach. Intel. 30, 2098–2108 (2008) 15. Fouad, M., Dansereau, R., Whitehead, A.: Geometric registration of images with arbitrarily-shaped local intensity variations from shadows. In: IEEE ICIP, pp. 201–204 (2009) 16. Huber, P.: Robust Statistics, 1st edn. Wiley, New York (1981) 17. Kanungo, T., Mount, D., Netanyahu, N., Piatko, C., Silverman, R.: An efficient K-means clustering algorithm: Analysis and implementation. IEEE Trans. Patt. Ana. & Mach. Intel. 24, 881–892 (2002) 18. Mangasarian, O., Musicant, D.: Robust linear and support vector regression. IEEE Trans. Patt. Ana. & Mach. Intel. 22, 950–955 (2000) 19. Jiang, J., Zheng, S., Toga, A., Tu, Z.: Learning based coarse-to-fine image registration. In: IEEE ICCVPR, pp. 1–7 (2008) 20. Nocedal, J., Wright, S.: Numerical optimization. Springer, New York (1999) 21. http://vision.ece.ucsb.edu/registration/satellite/ (accessed January 2009) 22. http://users.isr.ist.utl.pt/~ aguiar/mosaics/ (accessed January 2009) 23. Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E.: Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Proc. 13, 600–612 (2004) 24. Barnea, D., Silverman, H.: A class of algorithms of fast digital image registration. IEEE Trans. Comp. 21, 179–186 (1972)
Morphological Sharpening and Denoising Using a Novel Shock Filter Model Cosmin Ludusan1,2, Olivier Lavialle2, Romulus Terebes1, and Monica Borda1 1
Technical University of Cluj-Napoca, 15 C. Daicoviciu Street, 400020, Cluj-Napoca, Romania 2 Bordeaux 1 University, 351 Cours de la Libération, 33405, Talence, France {Cosmin.Ludusan,Romulus.Terebes,Monica.Borda}@com.utcluj.ro,
[email protected],
[email protected] Abstract. We present a new approach based on Partial Differential Equations (PDEs) for image enhancement in generalized “Gaussian Blur (GB) + Additive White Gaussian Noise (AWGN)” scenarios. The inability of the classic shock filter to successfully process noisy images is overcome by the introduction of a complex shock filter framework. Furthermore, the proposed method allows for better control and anisotropic, contour-driven, shock filtering via its control functions f1 and f2. The main advantages of our method consist in the ability of successfully enhancing GB+AWGN images while preserving a stableconvergent time behavior. Keywords: PDEs, shock filters, image denoising, image sharpening, image enhancement.
1 Introduction The reasoning behind our approach has as starting point the limitations presented by the classic shock filter model described in [1] as well as on its subsequent developments, such as the one presented in [5]. The main drawback of the classic model consists in the fact that in the presence of AWGN the filtering is minimal, at best. This minimal filtering effect is mainly due to the numerical scheme based on the minmod function defined in [1] and used in computing the gradient norm in each point of the image function. Its role is to restrict large value variations in neighboring pixels, such being the case of noise-corrupted images. Another important drawback of the classic shock filter model resides in its edge detector, based on the 2nd order directional derivative that, in GB+AWGN scenarios, fails to correctly detect edges and contours, thus blocking the shock filter’s natural time evolution. By knowing these a priori limitations of the classic model, a series of steps towards improving its overall performance were taken over time. Noteworthy results were described in [2], [4] and [8], the edge detector and its robustness being their main focus, somehow neglecting the global GB+AWGN scenario. This generalized scenario was approached in [3] and [5] where the useful signal, affected by both GB and AWGN was part of the problem’s statement. A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 19–27, 2010. © Springer-Verlag Berlin Heidelberg 2010
20
C. Ludusan et al.
The starting point of our approach is represented by the work described in [5] that contains a series of innovative ideas at the level of contour detection as well as imagedefinition domain. Nevertheless, handling both contamination sources at the same time implies a series of compromises either processing quality wise, noise removal wise or edge enhancement wise. In the case of the method described in [5], in order to surmount the inherent classic edge detector’s limitations, a new approach was proposed: an approximation of the 2nd order directional derivative given by the imaginary part of the image function. In order to accomplish this, the image definition domain needs to be changed from the real one to the more general domain, the complex one, thus adding a new dimension to the work space. The major improvement brought by this edge detector consists in its robustness to noise, even when dealing with low signal-to-noise ratio (SNR) images. On the other hand, this edge detector presents a noticeable drawback as well, due to the fact that the edge detector will continuously evolve over time, leading to a divergent effect of the filtered result instead of reaching a steady-state solution, as in the case of the classic shock filter.
2 Shock Filter Theory Review As previously stated, the main limitation of traditional shock filters is the presence of AWGN as contamination source of the useful signal. Since traditional shock filters were initially designed to deal exclusively with GB signal corruption, the more general scenario, GB+AWGN, proves to be too complex for a classic shock filter like the one described in [1]. The GB+AWGN scenario represents a complex perturbation, difficult to filter using traditional methods, requiring either complex processing models or successive filtering for removing each distortion at a time. The qualitative level of the filtering also depends, to a great extent, on the discretization method used for the mathematical model. An alternative discretization scheme, described in [4], allows the classic shock filter to perform well even in AWGN scenarios, but only for large SNR values, i.e. small AWGN signal corruption. When dealing with just AWGN perturbations, the usual PDEs framework approach is the use of diffusion filters; this type of filters performs a controlled GB filtering based on the principle of heat dissipation, described in Physics. The controlled GB filtering is behaviorally similar to the GB distortion, hence it can be inferred that the noise removal GB filtering in the AWGN scenario can be approximated to the GB distortion in the GB perturbation scenario. Therefore, in the case of the generalized scenario of GB+AWGN the separate filtering of each distortion is performed with filters opposite in nature, leading to a complex problem. This problem is discussed in [3] and [5], leading to an elegant solution by defining a series of connecting terms between the filtered image and the input image in order to preserve coherence and avoid the filtered image’s divergence (absence of a steady-state solution) induced by the opposite nature of the two filtering processes. The novelty of the idea described in [5] arises from the purpose of the method: to use a shock filter for processing AWGN-corrupted images not just GB-corrupted ones. In order to attain this desideratum the edge detector needs to be rethought, since the classic edge detector is not adequate in handling AWGN-corrupted signals, as
Morphological Sharpening and Denoising Using a Novel Shock Filter Model
21
previously stated. The solution given in [5] consists in redefining the definition domain of the image function, from the real one to the complex one. By doing so, the use of the imaginary part of the image function as an edge detector proves to be an elegant and efficient solution in overcoming the classic edge detector’s problem. The general shock filter 1D equation is the following:
I t = − I x ⋅ F ( I xx ) .
(1)
With F satisfying the following constraints:
⎧ F ( 0) = 0 . ⎨ ⎩ F ( s ) ⋅ sign ( s ) ≥ 0 .
(2)
By choosing F ( s ) = sign( s ) one obtains the classic shock filter expression:
I t = − sign ( I xx ) ⋅ I x .
(3)
When dealing with images, we generally work in a 2D or higher framework. For the 2D case, (3) becomes:
I t = − sign ( Iηη ) ⋅ ∇I .
(4)
η represents the gradient vector’s direction. An important role in the discretization of (4) is played by the way in which the gradient norm | ∇I | is computed, in order to avoid the algorithm’s instability caused by the approximation of the 1st order derivatives when computing the gradient vector. A way around this problem is described in [1] where the gradient norm | ∇I | is computed using a slope limiter minmod function in order to minimize the sudden signal variations. The classic shock filter model from (4) combined with its discretization using the minmod function is extremely sensitive to AWGN perturbations, as stated in [1]. The filtering of a GB-corrupted signal with overlaid AWGN or just of an AWGNcorrupted one using the shock filter (4), will amplify the AWGN instead of successfully processing it. If we consider the image function over a continuous domain, the noise amplification can lead to an infinite number of inflexion points, thus resulting in the image function’s rapid divergence from a steady-state solution. Another way to address the AWGN problem is to consider a more complex approach of the shock filter formalism. Such an approach would combine a deblurring method with a noise removal method: for the isotropic regions of the image, a noise removal will take care of the AWGN distortion; as for the anisotropic regions, such as edges and contours, a local, image geometry-driven deblurring will take care of the GB distortion. Such an approach is presented in [2] and consists in coupling a diffusion filter with a shock filter:
I t = − sign(Gσ ∗ Iηη ) ⋅ ∇I + cI ξξ .
(5)
22
C. Ludusan et al.
σ is the standard deviation of the Gaussian kernel G and c is a positive constant; ξ defines the direction orthogonal to the gradient vector. A more complex mathematical model is described in [3]:
I t = α r ⋅ ( hτ Iηη + I ξξ ) − α e ⋅ (1 − hτ ) ⋅ sign (Gσ ∗ Iηη ) ⋅ ∇I .
(6)
Where: hτ = hτ ( Gσ ∗ ∇I ) = 1 if Gσ ∗ ∇I < τ
and 0 otherwise. In order to improve the filtering capacity, [5] firstly suggests changing the sign function F described by (2), to allow taking into account not only the 2nd order derivative’s direction but also its magnitude. This way the inflexion points (the regions close to contours/edges where the 2nd order derivative has a higher magnitude) will not have equal weights, which translates into a higher deblurring speed near edges and contours than in the isotropic regions of the image. The new sign function is expressed as follows: F (s ) =
2
π
arctan(as ) .
(7)
In (7) a is the parameter that controls the steepness of the 2nd order derivative’s slope near 0. Finally, [5] proposes a complex shock filter model that employs the sign function (7), having the following expression:
It = −
2
π
~ arctan(a ⋅ Im(I θ )) ⋅ ∇I + λIηη + λ I ξξ .
(8)
~ Where λ = reiθ is a complex scalar, λ is a real scalar and θ ∈ (− π 2 , π 2) . For small values of θ ( θ → 0 ), the imaginary part can be regarded as a smoothed 2nd order derivative of the initial signal factored by θ and the time t, as it was mathematically proven in [6]. The implementation of (8) is done by the same standard discrete approximations used in [1], except that all computations are performed in the complex domain.
3 The Hybrid Shock Filter Model Although the complex shock filter described in [5] proves to be a viable alternative to the classic one in circumventing the noise problem in the generalized scenario of GB+AWGN interference, it presents at the same time a series of shortcomings, the most important of them being its numerical implementation, which becomes unstable after a sufficiently large number of iterations. This translates into the method’s dependency on the human supervised control, the algorithm’s stopping criterion being tied to its input parameters and sensitive to the nature of the input image. These shortcomings along with the ones presented by the classic model represent the premises of our hybrid shock filter model. Our goal is to combine the advantages of both models without preserving any of their disadvantages. So far our hybrid model solves the inability to efficiently process AWGN of the classic shock filter as
Morphological Sharpening and Denoising Using a Novel Shock Filter Model
23
well as the divergent character of the complex one, thus resulting a shock filter capable of image enhancement in GB+AWGN scenarios that is both efficient and stable. Another advantage of this method resides in its modularity, allowing the use of multiple sets of functions, useful in the filter’s behavioral analysis over a large variety of input images. The mathematical expression of the hybrid shock filter is the following: Re( I t ) = −
2
π
arctan(a ⋅ Im( I ) θ ) ⋅ f1 ⋅ ∇I − sign(Re( Iηη )) ⋅ f 2 ⋅ ∇I
~ + f1 ⋅ (Re(λ ) ⋅ Re( Iηη ) − Im(λ ) ⋅ Im( Iηη ) + λ Re( I ξξ )) .
Im(I t ) = Im(λ ) ⋅ Re(Iηη ) − Re(λ ) ⋅ Im(Iηη ) + λ Im(Iξξ ) . ~
(9)
The parameters of (9) have the following significance: • • • • • •
a is the parameter that controls the slope of the edge detector’s sign function (arctan). θ ∈ (-π/2,π/2) is an input parameter. When θ→0, Im( I ) θ can be approximated to the 2nd order directional derivative of the image function I. | ∇I | represents the gradient norm of function I computed using the minmod function as defined in [1]. λ = re iθ is a complex scalar parameter, computed as a function of θ. ~ λ is a real scalar input parameter. f 1 and f 2 are two complementary functions that represent the core of the hybrid shock filter formulation (9). Their purpose is to control the nature of the hybrid shock filter, i.e. to control the transition rate of the filter’s behavior from an exclusively complex one to an exclusively real one. These functions are defined as follows:
⎧ 1, i < TI1 ⎪⎪ i − T i = 0....N −1, I1 , TI1 ≤ i < TS1, f1(TI1, TS1) = ⎨1− TI1, TS1 ∈(0; N −1) ⎪ TS1 − TI1 ⎪⎩ 0, i ≥ TS1 ⎧ 0, i < TI 2 ⎪⎪ i − T i = 0....N −1, I2 , TI 2 ≤ i < TS 2 , f2 (TI 2 , TS 2 ) = ⎨ TI 2 , TS 2 ∈(0; N −1) ⎪TS 2 − TI 2 ⎪⎩ 1, i ≥ TS 2
(10)
N represents the number of iterations, correlated with the mathematical model’s time parameter t (t = dt·N). TI 1 , TS1 , TI 2 , TS 2 are threshold parameters used to define the complementary behavior of f1 and f2; f1,f2 : [0;N-1]→[0;1] as exemplified in Fig. 1:
24
C. Ludusan et al.
Fig. 1. Graphical representation of f1 and f2 (10) for: TI1 = 200; TS1 = 500 / TI2 = 100; TS2 = 700 left and TI1 = 100; TS1 = 700 / TI2 = 400; TS2 = 900 - right
The hybrid shock filter has a combined behavior, weighted by its control functions f1 and f2 that could be summarized as follows: •
•
•
When f1=1 and f2=0 the filter behaves exclusively as a complex shock filter (described in [5]). This behavior is required in order to effectively deal with the AWGN perturbation that can be approached using the complex shock filter paradigm. Thus, the hybrid shock filter relies on its edge detector (imaginary part of the image function I) in correctly detecting edges and contours in GB+AWGN conditions. Following its time evolution, after a certain number of iterations the AWGN will be filtered enough to use the classic shock filter component of the hybrid shock filter. This is the case when f1 ∈ [0;1) and f2 ∈ [0;1), translating into a simultaneous evolution of both hybrid shock filter’s components. Finally, at the end of the filtering process, f1=0 and f2=1 allowing the hybrid shock filter to properly act as an edge enhancement filter (filtering the GB perturbation) through its classic shock filter component.
4 Experimental Results In order to carry out the performance analysis, as well as the comparative study of the hybrid shock filter, we first need to define the experimental setting: a test image distorted with a GB+AWGN-type distortion (Fig. 2b) with the following parameters: GB with σ = 5 and AWGN of amplitude A = 30. The comparative analysis between the three types of shock filters (classic, complex and hybrid respectively) will be performed using as an objective quality assessor the Root-Mean-Square-Error (RMSE) measurement and as a subjective quality assessor the visual comparison between the unaltered test image and the filtered results. Fig. 2 represents the experimental setting as well as the filtered results. For the above test scenario: the test parameters were the following: θ = 0.00001, ~ λ = 0.5 , a = 0.5, λ = 0.5 dt = 0.1 and N = 1000 leading to a theoretical evolution time of t = 100 seconds. It needs to be emphasized that the classic shock filter only uses two parameters (dt and N) while the complex and hybrid shock filters use the
Morphological Sharpening and Denoising Using a Novel Shock Filter Model
a)
b)
c)
d)
25
e) Fig. 2. Experimental setting: a) Original test image; b) Distorted test image; c) Classic shock filter result; d) Complex shock filter result; e) Hybrid shock filter result
same parameters with the remark that for the hybrid shock filter the function set (10) was used with the following parameters: TI1 =300;TS1 = 350 / TI2 = 320;TS2 = 900. Fig. 3 represents the RMSE/time evolution of the three filters, according to the experimental setting described by Fig. 2. The RMSE measurement was performed between the unaltered image (Fig. 2a) and each of the three filtered results (Fig. 2c2e). As it can be noted, the hybrid shock filter possesses the advantages of both the classic shock filter (stable time evolution, steady-state solution) and the complex one (efficient AWGN filtering as well as GB deblurring). Since any output image is considered to be information and according to the definition of information, it represents an entity about which we do not possess any prior knowledge, it is impossible to a priori know the minimum value of the RMSE obtained by filtering. Thus, the complex shock filter lacks the ability of maintaining a stable behavior (that leads to a steady-state solution) long enough to ensure that its time evolution has reached the minimum RMSE value before diverging.
26
C. Ludusan et al.
Fig. 3. RMSE comparison between the three shock filters
5 Conclusions While still a work in progress, the hybrid shock filter model has proven so far a step forward in shock filter theory, turning out to be a viable alternative to the classic methods in both GB, AWGN and GB+AWGN scenarios. The mathematical model allows further improvements, from minor tweaks in control function definition to increasing the filtering capacity as well as the convergence speed towards a steadystate RMSE value. The experimental results and comparative analysis were promising, establishing the premises for future shock filter theory paradigms. Acknowledgment. This work was supported by CNCSIS-UEFISCSU, project number PNII-IDEI code 908/2007.
References 1. Osher, S., Rudin, L.: Feature-oriented Image Enhancement Using Shock Filters. SIAM J. on Num. Anal. 27, 919–940 (1990) 2. Alvarez, L., Mazorra, L.: Signal and Image restoration Using Shock Filters and Anisotropic Diffusion. SIAM J. on Num. Anal. 31, 590–605 (1994) 3. Kornprobst, P., Deriche, R., Aubert, G.: Image Coupling, Restoration and Enhancement via PDEs. In: International Conference on Image Processing (ICIP) Proceedings, vol. 2, pp. 458–461 (1997) 4. Remaki, L., Cheriet, M.: Numerical Schemes of Shock Filter Models for Image Enhancement and Restoration. J. of Math. Imag. And Vis. 18, 129–143 (2003) 5. Gilboa, G., Sochen, N.A., Zeevi, Y.Y.: Regularized Shock Filters and Complex Diffusion. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 399–413. Springer, Heidelberg (2002) 6. Gilboa, G., Zeevi, Y.Y., Sochen, N.A.: Complex Diffusion Processes for Image Filtering. In: Kerckhove, M. (ed.) Scale-Space 2001. LNCS, vol. 2106, pp. 299–307. Springer, Heidelberg (2001)
Morphological Sharpening and Denoising Using a Novel Shock Filter Model
27
7. Gilboa, G.: Super-resolution Algorithms Based on Inverse Diffusion-type Processes. PhD Thesis, Technion-Israel Institute of Technology, Haifa (2004) 8. Weickert, J.: Coherence-Enhancing Shock Filters. In: Michaelis, B., Krell, G. (eds.) DAGM 2003. LNCS, vol. 2781, pp. 1–8. Springer, Heidelberg (2003) 9. Tschumperle, D., Deriche, R.: Constrained and Unconstrained PDEs for Vector Image Restoration. In: 12th Scandinavian Conference on Image Analysis, Norway, pp. 153–160 (2001) 10. Buades, A., Coll, B., Morel, J.M.: Image Enhancement by Non-local Reverse Heat Equation (Preprint CMLA 2006-22) (2006) 11. Buades, A., Coll, B., Morel, J.M.: The Staircasing Effect in Neighborhood Filters and its Solution. IEEE Transactions on Image Processing 15(6), 1499–1505 (2006) 12. Rudin, L.: Images, Numerical Analysis of Singularities and Shock Filters. PhD Thesis, California Institute of Technology, Pasadena CA (1987) 13. Barash, D.: One-step Deblurring and Denoising Color Images Using Partial Differential Equations. HP Laboratories, Israel (2001) 14. Welk, M., Theis, D., Borx, T., Weickert, J.: PDE-based Deconvolution with ForwardBackward Diffusivities and Diffusion Tensors. In: Kimmel, R., Sochen, N.A., Weickert, J. (eds.) Scale-Space 2005. LNCS, vol. 3459, pp. 585–597. Springer, Heidelberg (2005) 15. Bettahar, S., Stambouli, A.B.: Shock Filter Coupled to Curvature Diffusion for Image Denoising and Sharpening. Imag. and Vis. Comp. 26(11), 1481–1489 (2008)
Resolution Scalable Image Coding with Dyadic Complementary Rational Wavelet Transforms F. Petngang, R.M. Dansereau, and C. Joslin Department of Systems and Computer Engineering, Carleton University, Ottawa, Ontario, Canada
[email protected], chris
[email protected] Abstract. Traditional wavelet-based image coders use dyadic multiresolution decompositions, resulting in spatial resolutions that are also dyadic in scale, i.e., powers of 1/2 the original resolution. In this paper, we increase the set of decodable spatial resolutions beyond dyadic scales by introducing a two-dimensional wavelet decomposition made of combining wavelet transforms of non-integer dilation factors. We describe how the proposed wavelet decomposition can produce spatial resolutions related to the original resolution by dyadic and non-dyadic ratios, and present how the resulting subband structure can be made compatible with zerotree-based subband coders. Keywords: Image compression, dyadic and non-dyadic spatial scalability, rational dilation wavelet transform.
1
Introduction
In image or video coding, the notion of scalability designates the idea that an image/video can be compressed once and decompressed in various ways to produce images of different visual quality, different spatial resolutions, or videos of different temporal resolutions. With the increasing diversity of media, different display terminal size, frame rates, and bandwidth, development of codecs with support for scalability on multiple dimensions provides an efficient way to accommodate various users from one single compressed file. From the perspective of spatial scalability, multi-resolution decompositions using wavelet transforms are a natural choice. The successive sizes of the approximation subband produced during the wavelet decomposition constitute the native spatial resolutions available at the decoding side of a wavelet-based image/video compression system. As traditionally defined, a two-dimensional wavelet decomposition produces dyadic native spatial resolutions, related to the original size of the image by a factor (1/2)k , where k is an integer. Few works in the literature have investigated the inclusion of spatial scalability by non-dyadic ratios with wavelet-based approaches. In [1], a post-processing stage is used in addition to a full decomposition tree obtained by extending the two-dimensional wavelet decomposition filter bank. While the scheme allows for dyadic and non-dyadic spatial scalability, the post-processing stage at the A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 28–35, 2010. c Springer-Verlag Berlin Heidelberg 2010
Resolution Scalable Image Coding
p
H(z)
q
q-p
G(z)
q
.
. .
.
q
H’(z)
p
q
G’(z)
q-p
29
+ .
. .
.
Analysis stage (forward transform)
Synthesis stage (inverse transform)
Fig. 1. A (p, q) wavelet transform filter bank
decoder is an extra complexity and introduces additional distortion. In [2], M band wavelet filter banks are exploited to produce spatial resolutions obtained by non-dyadic factors. The proposed method, however, requires a different set of synthesis filters to be built for every size ratio. In [3], a wavelet transform with sampling coefficients forming a non-integer ratio is proposed for generating nondyadic spatial resolutions; combining dyadic and non-dyadic spatial scalability within that framework was, however, not attempted. In this paper, we propose using a combination of wavelet transforms of noninteger dilation factors capable of producing native spatial resolutions related to the original resolution by dyadic and non-dyadic ratios. Section 2 briefly reviews wavelet transform filter banks with rational dilation factors. In Section 3, the proposed subband decomposition is introduced and a constraint on the dilation factors is formulated to allow both dyadic and non-dyadic resolutions. An extension is presented in Section 4, and in Section 5 we describe a parent-child tree construction scheme for compatibility with zerotree-based subband coders. Finally, experimental results are presented in Section 6, and conclusions drawn in Section 7.
2
Rational Wavelet Transforms
Fig. 1 illustrates a generalization of the two-band filter bank representation of a wavelet transform to a pair of integer sampling coefficients p and q with p < q. The ratio q/p is defined as the dilation (or sampling) factor of the filter bank. The traditional definition of a wavelet transform, often qualified as dyadic, corresponds to a filter bank with sampling coefficients (p, q) = (1, 2) and a dilation factor q/p = 2. When the sampling coefficients are not divisible, the filter bank is said to have a rational dilation factor and the wavelet transform is termed a rational wavelet transform. The interested reader is referred to [4],[5] for a discussion on wavelet filter banks with rational dilation factors.
3
Combined Rational Wavelet Decomposition for Spatial Scalability with Dyadic and Non-dyadic Ratios
A two-dimensional wavelet decomposition is performed by applying the filter bank illustrated in Fig. 1 along the rows and the columns of the image, producing an approximation subband and three details subbands. Multiple levels of
30
F. Petngang, R.M. Dansereau, and C. Joslin
Original image
approximation subband (w1 x h1)
approximation subband (w2 x h2)
HL2 HL1
HL1 HH2
LH2 LH1 (a)
(b)
HH1
LH1
HH1
(c)
Fig. 2. Illustration of a two-dimensional combined rational wavelet decomposition: a 2D (p1 , q1 ) wavelet transform is applied on the original image, then a 2D (p2 , q2 ) wavelet transform is applied on the resulting approximation subband
decomposition are performed by recursively applying the 2D wavelet transform to the resulting approximation subband. The set of native spatial resolutions produced by a multi-level two-dimensional wavelet decomposition is completely determined by the size ratios between the successive sizes of the approximation subband and the size of the original image. If (w0 , h0 ) is the size of the original image, n levels of 2D decomposition with a wavelet transform of dilation factor q/p results in an approximation subband of size (wn , hn ) = (( pq )n ×w0 , ( pq )n ×h0 ). Hence, with a dilation factor of q/p, the set of native spatial resolutions is determined by the set of size ratios {(p/q)m }, with m = 0, . . . , n. It is clear from this formulation that the two-dimensional wavelet decomposition, as defined, is not capable of producing a set of native spatial resolutions containing both dyadic and non-dyadic ratios, regardless of the value of the dilation factor q/p. However, a different set of native resolutions can be obtained if the multi-level wavelet decomposition alternates between wavelet transforms of different dilation factors, as illustrated in Fig. 2. Such an alternated use of rational wavelet transforms in a multi-level two-dimensional wavelet decomposition is what we denote a combined rational wavelet decomposition. In the rest of the document, this terminology will often be simplified to combined decomposition. The above definition can be generalized to any number of different dilation factors qk /pk . However, without loss of generality, the notion of combined rational wavelet decomposition is described in the current paper for a combination of only two different rational wavelet transforms. Given an original image of size (w0 , h0 ), the combined decomposition illustrated in Fig. 2 produces native spatial resolutions of sizes (w1 , h1 ) = ( pq11 × w0 , pq11 × h0 ) and (w2 , h2 ) = ( pq11 × pq22 × w0 , pq11 × pq22 × h0 ), successively. This constitutes one level of combined decomposition. Multiple levels of combined decomposition are performed by reiterating the scheme on the approximation subband. It can be demonstrated that the set of native spatial resolutions produced by n levels of combined decomposition is determined by the two sets of size ratios {( pq11 )m ( pq22 )m−1 , m = 1, . . . , n} and {( pq11 )m ( pq22 )m , m = 0, . . . , n}. From the latter
Resolution Scalable Image Coding
31
list, we can establish that the set of native spatial resolutions will include resolutions obtained by dyadic ratios if the dilation factors are selected such that (qi /pi ) = 2 . (1) i
An example of dilation factors satisfying this condition is the pair q1 /p1 = 4/3 and q2 /p2 = 3/2. A two-level combined decomposition with wavelet transforms of dilation factors q1 /p1 = 4/3 and q2 /p2 = 3/2 produces the set of size ratios {1, 3/4, 1/2, 3/8, 1/4}. The corresponding native spatial resolutions are indicated in Table 1 for an original image at 4CIF resolution; the result is an extended set of native resolutions when compared to a regular dyadic wavelet decomposition. Table 1. Native spatial resolutions from an original image of size 704 × 576 n=0 Dyadic wavelet decomposition 704 × 576
n=1 352 × 288
n=2 176 × 144
Combined decomposition with 704 × 576 528 × 432 352 × 288 264 × 216 176 × 144 q1 /p1 = 4/3 and q2 /p2 = 3/2
4
A Wavelet Packet Extension to the Combined Rational Wavelet Decomposition
In this section, we propose an extension to the combined rational wavelet decomposition introduced in Section 3. The extension, denoted wavelet packet extension, consists of performing an additional one-dimensional wavelet transform on the horizontal and vertical subbands from the (p1 , q1 ) wavelet decomposition, using the (p2 , q2 ) wavelet transform. This is illustrated in Fig. 3 for the pair of dilation factors q1 /p1 = 4/3 and q2 /p2 = 3/2. It can be observed in Fig. 3 that with this specific pair of dilation factors, the wavelet packet extension results in a subband decomposition similar to a wavelet packet decomposition obtained with the use of a dyadic wavelet transform. This similarity has a significant advantage: it translates in natural compatibility with a wide base of existing wavelet-based image coding frameworks. Indeed, since the support for alternative decompositions structures was added to Part 2 of the JPEG2000 image coding reference standard, many existing frameworks and subband coders developed for wavelet-based image compression have naturally included support for dyadic wavelet packet decomposition structures. Furthermore, with the described wavelet packet extension mode, dyadic wavelet decomposition and combined decomposition can be used together in a multi-level decomposition, and the resulting subband structure can be made compatible with existing subband coders by taking advantage of the similarity with dyadic wavelet packet decompositions. An example is illustrated in Section 5 for subband coders of the zerotree coding family.
32
F. Petngang, R.M. Dansereau, and C. Joslin w/2
HL2 HL1 HH2
LH2
LH1
HH1
additional decomposition of subbands LH1 and HL1
h/2
w/4
w/4
HL2
HL1
h/4
LH2
HH2
HH’1
h/4
LH1
HH’2
HH1
Subbands from one level of combined decomposition
Fig. 3. Illustration of the wavelet packet extension for a combined decomposition with q1 /p1 = 4/3 and q2 /p2 = 3/2
5
Adapting the Parent-Child Tree Construction Scheme for Zerotree-Based Subband Coders
A natural challenge that arises from a new two-dimensional subband decomposition structure is the compatibility with existing subband coders. In this section, we address this issue for subband coders based on a zerotree parent-child structure, of which the widely-used SPIHT [6] is a good example. Zerotree-based subband coders use a parent-child tree structure built among the wavelet subbands. With a dyadic wavelet decomposition (q/p = 2), the construction of the parent-child tree takes advantage of the constant integer ratio of two between the sizes of subbands at consecutive levels of decomposition. In the case of a combined rational wavelet decomposition, an integer ratio in size is not always possible between subbands produced by wavelet transforms of different dilation factors. For example, considering the subband decomposition depicted in Fig. 2, and the pair q1 /p1 = 4/3 and q2 /p2 = 3/2, the ratio in size between the subbands HL1 and HL2 would be 1 along the width and 3/2 along the height. In order to overcome the issue of non-integer ratios, we propose to group subbands by levels of combined decomposition and associate any subband S k obtained after k levels of combined decomposition to the corresponding subband S k+1 obtained after k + 1 levels. If the dilation factors are selected according to (1), the ratio in size between two subbands S k and S k+1 at consecutive levels of combined decomposition is a constant integer value of two. From this derivation, we propose to adapt the parent-child tree construction scheme from the traditional method depicted in Fig. 4(a) to the manner illustrated in Fig. 4(b). The parallel between the two schemes is clearly visible. Fig. 4(c) on the other hand illustrates how the parent-child tree can be built when multi-level combined decomposition and multi-level dyadic wavelet decomposition are used in conjunction. The illustrated method considers the special case where q1 /p1 = 4/3, q2 /p2 = 3/2, and the wavelet packet extension is used for the combined decomposition. The similarity with dyadic wavelet packet decomposition structures is
Resolution Scalable Image Coding
..
..
..
.. ..
(a) two-level dyadic decomposition
..
..
(b) two-level combined decomposition
.. .. .. .. .. .. .. .. .. ..
.... .... .... ....
.... .... .... ....
.... ....
.... ....
.... ....
.... ....
.... ....
.
..
. .. ..
.... .... .... ....
.
..
. . . . . .
.... .... .... ....
.
33
(c) two-level combined decomposition + two-level dyadic decomposition
Fig. 4. Adaptation of the parent-child tree construction scheme
taken advantage of, notably in the transition between the lowest level of dyadic wavelet decomposition and the highest level of combined decomposition.
6
Experimental Results
We implemented the combined decomposition with wavelet transforms of dilation factors q1 /p1 = 4/3 and q2 /p2 = 3/2 in a wavelet-based image compression system with support for spatial scalability. The subband coding stage of the compression is done with an adaptation of SPIHT [6] capable of producing resolution-scalable bitstreams [7], and added the ability to specify a bitstream length for every spatial resolution of interest in the bitstream. The encoded image is decoded at full size resolution and at 3/4 the original size using the native spatial resolution available at the decoder. A similar image coder is built using only dyadic wavelet decomposition. Because a native spatial resolution corresponding to 3/4 the original size of the image is not available at the decoder, two configurations are defined: 1. the image is to be decoded at half-size resolution using the native resolution available, and then interpolated to 3/4 the original size of the image. 2. the image is to be decoded at full size resolution, and then decimated to an image of 3/4 the original size. We denote these two configurations of the dyadic wavelet-based image coder respectively as B1 and B2, while A designates our image coder built with combined decomposition. Four levels of dyadic wavelet decomposition are specified for the image coders B1 and B2. The image coder A uses two levels of combined decomposition followed by two levels of dyadic wavelet decomposition; our experimentations have shown that this produces slightly better results than the use of four levels of combined decomposition. The CDF9/7 filters are used for the dyadic wavelet transforms, and the rational wavelet transforms are performed using wavelet filters of four vanishing moments obtained from [5]. Finally, for all three image coders, the subband coder imposes 3/4 of the full bit budget to be assigned to the data needed to obtain the image at 3/4 the original size.
F. Petngang, R.M. Dansereau, and C. Joslin
0,96 0,94 0,92 0,9 0,88 0,86 0,84 0,82
Boat, 3/4 the original size
A B1 B2 0,3
0,4 0,5
0,6
0,7
0,8 0,9
SSIM
SSIM
34
1
0,92 0,9 0,88 0,86 0,84 0,82 0,8 0,78
bitrate for the full bit budget (in bpp)
0,94
0,4 0,5
0,6
0,7
0,8 0,9
1
Cafe, full size resolution
0,92 0,88
A B1 B2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 bitrate for the full bit budget (in bpp)
SSIM
SSIM
0,74
0,3
Cafe, 3/4 the original size
0,86
0,78
A B1 B2
bitrate for the full bit budget (in bpp)
0,9
0,82
Boat, full size resolution
0,84 0,8 A B1 B2
0,76 0,72 0,68
0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 bitrate for the full bit budget (in bpp)
Fig. 5. Experimental results
Experiments are run with the images Boat of size 512 × 512 and Cafe of size 1024 × 1280 (a half-size version of the popular JPEG2000 test image). Bitrates for the full bit budget range from 0.3 to 1 bits per pixel, and the quality of the decoded images is measured using the Structural Similarity Index Metric (SSIM). A resized version of the original image is used as reference for images of lower spatial resolution. Bicubic interpolation/decimation is used to perform resizing operations for B1/B2 outputs. Fig. 5 illustrates the objective quality scores obtained with the different sets of images. The results display a gain in quality when using the image coder with combined decomposition compared to the alternative of interpolating the lower dyadic resolution of a dyadic wavelet-based image coder. Furthermore, across the range of bitrates, visual quality remains competitive against the second alternative of decimating the higher dyadic resolution. Quality scores at full size resolution illustrate a tradeoff of having extended spatial scalability through combined rational wavelet decomposition, as the images decoded at full size resolution are shown to be of lesser quality than the outputs from the dyadic wavelet-based image coder. The apparent drop in quality at higher bitrates for full size images obtained from the image coder B1 can be explained by the fact that the subband coder assigns 3/4 of the bit budget to the data needed for the half-size image used to obtain an image at 3/4 the original size.
Resolution Scalable Image Coding
7
35
Conclusions
In this paper, we have introduced a multi-level wavelet decomposition structure made of combining rational wavelet transforms of different dilation factors. We’ve shown that with a proper selection of dilation factors, the proposed decomposition allows for spatial scalability with dyadic and non-dyadic ratios, with minimal impact on coding efficiency. Future work on the subject will include a more extensive performance analysis with various sets of wavelet filters for the rational wavelet transforms, and a study on the impact of combining different wavelet transforms in a multi-level subband decomposition. Acknowledgements. The authors would like to thank Nortel and the Natural Sciences and Engineering Research Council of Canada for funding this research.
References 1. Nakachi, T., Sawabe, T., Suzuki, J., Fujii, T.: A study on non-octave scalable image coding and its performance evaluation using digital cinema test material. IEICE Trans. Fundamentals E89-A(9), 2405–2414 (2006) 2. Pau, G., Pesquet-Popescu, A., Piella, G.: Modified M-band synthesis filter bank for fractional scalability of images. IEEE Signal Processing Letters 13(6), 345–348 (2006) 3. Xiong, R., Xu, J., Wu, F.: A lifting-based wavelet transform supporting non-dyadic spatial scalability. In: IEEE International Conference on Image Processing, October 2006, pp. 1861–1864 (2006) 4. Blu, T.: A new design algorithm for two-band orthonormal rational filter banks and orthonormal rational wavelets. IEEE Trans. on Signal Processing 46(6), 1494–1504 (1998) 5. Bayram, I., Selesnick, I.W.: Orthonormal FBs with rational sampling factors and oversampled DFT-modulated FBs: A connection with filter design. IEEE Trans. on Signal Processing 57(7), 2515–2526 (2009) 6. Said, A., Pearlman, W.: A new, fast, and efficient image codec based on set partitioning in hierarchical trees. IEEE Trans. on Circuits and Systems for Video Technology 6(3), 243–250 (1996), doi:10.1109/76.499834 7. Christophe, E., Pearlman, W.A.: Three-dimensional SPIHT coding of volume images with random access and resolution scalability. EURASIP Journal on Image and Video Processing 2008 (2008)
LMMSE-Based Image Denoising in Nonsubsampled Contourlet Transform Domain Md. Foisal Hossain1, Mohammad Reza Alsharif1, and Katsumi Yamashita2 1
Department of Information Engineering, University of the Ryukyus, Okinawa, Japan
[email protected],
[email protected] 2 Graduate School of Engineering, Osaka Prefecture University, Osaka, Japan
[email protected] Abstract. This paper proposes a nonsubsampled contourlet transform (NSCT) based multiscale linear minimum mean square-error estimation (LMMSE) scheme for image denoising. The contourlet transform is a new extension of the wavelet transform that provides a multi-resolution and multi-direction analysis for two dimension images. The NSCT expansion is composed of basis images oriented at various directions in multiple scales, with flexible aspect ratios. Given this rich set of basis images, the NSCT transform effectively captures smooth contours that are the dominant feature in natural images. To investigate the strong interscale dependencies of NSCT, we combine pixels at the same spatial location across scales as a vector and apply LMMSE to the vector. Experimental results show that the proposed approach outperforms wavelet method and contourlet based method both visually and in terns of the peak signal to noise ratio (PSNR) values at most cases. Keywords: Denoising, NSCT, LMMSE, PSNR.
1 Introduction An image is often corrupted by noise in its acquisition and transmission. Image denoising is used to remove the additive noise while retaining as much as possible the important signal features. In the recent years there has been a fair amount of research on wavelet thresholding and threshold selection for signal de-noising [1], [2]-[9]. However, wavelets are less effective for images where singularities are located both in space and directions. Initially, the wavelet transform was considered to be a good decorrelator for images, and thus wavelet coefficients were assumed to be independent and were simply modeled using marginal statistics [10]. However, wavelet coefficients of natural images exhibit strong dependencies both across scales and between neighboring coefficients within a subband, especially around image edges. This gave rise to several successful joint statistical models in the wavelet domain [10]-[13]. One important feature of a transform is its stability with respect to shifts of the input signal. Shift invariance is very important in image denoising by thresholding because lack of shift invariance causes pseudo-Gibbs phenomena around singularities A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 36–43, 2010. © Springer-Verlag Berlin Heidelberg 2010
LMMSE-Based Image Denoising in Nonsubsampled Contourlet Transform Domain
37
[7]. In addition to shift invariance, it is necessary that an efficient image representation should have the ability to capture geometrical structure exists in natural images. Contourlet transform [14] is a multidirectional and multiscale transform that is constructed by combining the Laplacian pyramid [15],[16] with directional filter bank (DFB), can be used to capture geometrical properties of images. However, due to downsamplers and upsamplers present in both the Laplacian pyramid and the DFB, the contourlet transform is not shift-invariant. Hence, some new transforms have been introduced to solve this problem. Cunha and Do [17], developed nonsubsampled contourlet transform (NSCT), which is a fully shift-invariant version of contourlet transform, as well as multiscale, and multidirectional expansion. The NSCT is based on a nonsubsampled pyramid structure and nonsubsampled directional filter banks. NSCT transform allows for different and flexible number of directions at each scale, while achieving nearly critical sampling. Although NSCT well decorrelates signals, strong interscale dependencies between NSCT coefficients still exist. Efficient modeling and exploiting such dependencies can significantly improve denoising performance. Denoising approach based on LMMSE considering interscale dependencies in wavelet domain was presented in [18]-[20]. However, the major drawback for wavelets in two-dimensions is their limited ability in capturing directional information. In this paper, an LMMSE-based denoising approach with an interscale model is presented by using nonsubsampled contourlet transform (NSCT). NSCT do not have any downsampling in the decomposition and each NSCT subband has the same number of coefficients as the input image. We combine the NSCT coefficients with the same spatial location across adjacent scales as vector, to which LMMSE is then applied. Such an operation naturally incorporates the interscale dependencies of NSCT coefficients to improve the estimation. Experimental results show that the proposed approach outperforms other denoising methods in terms of visual quality. The rest of the paper is organized as follows. Section 2 introduces NSCT in brief. The proposed LMMSE-based denoising approach is described in section 3. Experimental results are presented in section 4. Finally a conclusion is presented in section 5.
2 Construction of Nonsubsampled Contourlet Transform The NSCT is a fully shift-invariant, multi-scale, and multidirectional expansion that has a fast implementation. Overview of the NSCT is shown in Figure 1(a). The structure consists in a bank of filters that splits the 2-D frequency plane in the subbands shown in Figure 1(b). NSCT transform can thus be divided into two shiftinvariant parts which are as follow: a nonsubsampled pyramid structure that ensures the multi-scale property and a nonsubsampled DFB structure that gives directionality. The contourlet transform is implemented via a two dimensional filter bank that decomposes an image into several directional subbands at multiple scales. To achieve this, the nonsubsampled contourlet transform is built upon nonsubsampled pyramids and nonsubsampled DFB [17]. First, a nonsubsampled pyramid split the input into a
38
M.F. Hossain, M.R. Alsharif, and K. Yamashita
low pass subband and a high pass subband. Then a nonsubsampled DFB decomposes the high pass subband into several directional subbands. The scheme is iterated repeatedly on the low pass subband outputs of nonsubsampled pyramids. The multiscale property of the NSCT is obtained from a shift invariant filtering structure that achieves subband decomposition similar to that of the Laplacian pyramid. The block diagram of the nonsubsampled pyramid is shown in Figure 2(a) and subband on the 2-D frequency plane is shown in Figure 2(b). (ʌ,ʌ)
Ȧ2
Ȧ1
(-ʌ,-ʌ)
(a)
(b)
Fig. 1. Nonsubsampled contourlet transform. (a) NSFB structure that implements the NSCT. (b) Obtained idealized frequency partitioning. H0(z4I) y0
H0(z2I)
(ʌ,ʌ)
Ȧ2 H1(z4I)
H0(z)
3 y1
H1(z2I)
x
0
y2
1
2 Ȧ1
H1(z) y3
(-ʌ,-ʌ) (a)
(b)
Fig. 2. Nonsubsampled pyramid is a 2-D multiresolution expansion. (a) Three stage pyramid decomposition. (b) Subbands on the 2-D frequency plane.
LMMSE-Based Image Denoising in Nonsubsampled Contourlet Transform Domain
39
Specifically, NSFB can be built from low-pass filter H0(z). Then we need to set H1(z)=1-H0(z) .
(1)
Perfect reconstruction condition is given as follows:
H 0 ( z )G0 (z ) + H1 (z )G1 (z ) = 1 .
(2)
This condition is much easier to satisfy than the perfect reconstruction condition for critically sampled filter banks, and thus allows better filters to be designed. The nonsubsampled DFB is a shift-invariant version of the critically sampled DFB in the contourlet transform. Nonsubsampled pyramids provide multiscale decomposition and nonsubsampled DFB’s provide directional decomposition. The building block of a nonsubsampled DFB is also a two-channel nonsubsampled filter bank. However, shift invariance property is obtained by eliminating the downsamplers and upsamplers if the DFB.
3 LMMSE Based Denoising NSCT is translation invariant and hence do not cause visual artifacts in threshold based denoising. For this the denoising scheme presented in this paper adopts NSCT. Suppose the original signal f is corrupted with additive Gaussian white noise ε
g = f +ε . where
(3)
ε ∈ N (0,σ 2 ) . Applying NSCT to the noisy signal g, at scale j yields wj = x j + v j .
(4)
where wj is coefficients at scale j, xj and vj are the expansions of f and ε, respectively. Suppose the variance of vj is σj2 and that of xj and vj are both zero mean, the LMMSE of xj is
xˆ j = c′.w j .
(5)
with
c=
σ x2
j
σ x2 + σ 2j j
. (6)
The factor c is always less than 1, thus the magnitude of estimated NSCT coefficients would be less than that of wj.
K xˆ j is obtained, only the component xˆ j is extracted. K Estimation of xˆ j +1 would be obtained form the LMMSE result xˆ j +1 . After the LMMSE result
40
M.F. Hossain, M.R. Alsharif, and K. Yamashita
4 Experimental Results To evaluate the performance of the proposed approach this section compares the proposed scheme with other popular denoising schemes: wavelet based hardthresholding (WH), wavelet based soft-thresholding (WS). The three benchmark images named Lena, Barbara and Cameraman with dimension 512×512 are used for the experiments. The noisy images are obtained by adding Gaussian white noise to the noise free image. Figure 3 and Figure 4 show the denoising performance of the above mentioned methods using the experimented images. From these figures, it can be observed that the proposed approach is better than the standard denoising methods.
(a)
(e)
(b)
(c)
(f)
(d)
(g)
(h)
Fig. 3. Denoising results over Lena image (a) Original image with dimension 512×512; (b) noisy image (σ=20); (c) denoised image over b; (d) noisy image (σ=25); (e) denoised image over d; (f) noisy image (σ=30); (g) denoised image using soft wavelet threshold over f; (h) denoised image over f using the proposed method
In order to evaluate the performance of the proposed method numerically, we measured peak signal to noise ration (PSNR), which is defined as: PSNR = 10 log10
2 I max
1 MN
.
∑ ( X (i, j ) − Y (i, j ))
2
(7)
i, j
where Imax is the maximum gray level value of the image, Xi,j is the gray value of pixel at (i,j) of the original image, Yi,j is the gray value of pixel at (i,j) of the denoised image, and M and N are the numbers of row and column of the image respectively.
LMMSE-Based Image Denoising in Nonsubsampled Contourlet Transform Domain
(a)
(b)
(e)
(f)
(c)
(g)
41
(d)
(h)
Fig. 4. Denoising results over Barbara image (a) Original image with dimension 512×512; (b) noisy image (σ=20); (c) denoised image over b; (d) noisy image (σ=25); (e) denoised image over d; (f) noisy image (σ=30); (g) denoised image using soft wavelet threshold over f; (h) denoised image over f using the proposed method Table 1. PSNR comparison of different methods over experimented images with different noise level
Image
Lena Cameraman
Barbara
Noise level ı 20 25 30 20 25 30 20 25 30
Wavelet hard threshold 31.42 30.71 29.16 28.42 27.22 26.36 23.26 23.12 22.36
Wavelet soft threshold 31.67 30.76 29.37 28.46 27.29 26.41 23.72 23.37 22.47
Proposed method 34.76 32.87 32.16 30.64 29.12 27.83 27.57 27.16 26.73
Table 1 list the PSNR results of the three methods on the experimented images corrupted by different levels of additive Gaussian noise. Observing the table, it can be grasped that the proposed approach outperforms other denoising methods.
5 Conclusion In this paper, we have presented LMMSE-based denoising approach with NSCT interscale model. The NSCT is a fully shift-invariant, multi-scale, and
42
M.F. Hossain, M.R. Alsharif, and K. Yamashita
multidirectional expansion. The NSCT coefficients at the same spatial locations at two adjacent scales are presented as a vector and LMMSE is applied to the vector. The NSCT interscale dependencies are thus exploited to improve the signal estimation. The proposed method gives highest PSNR values than other methods. It is ascertained from the experimental results that the proposed LMMSE based denoising approach outperforms other denoising methods.
References 1. Donoho, D.L.: De-Noising by Soft Thresholding. IEEE Trans. Info. Theory 43, 933–936 (1993) 2. Grace Chang, S., Yu, B., Vattereli, M.: Adaptive Wavelet Thresholding for Image Denoising and Compression. IEEE Trans. Image Processing 9, 1532–1546 (2000) 3. Donoho, D.L., Johnstone, I.M.: Adapting to Unknown Smoothness via Wavelet Shrinkage. Journal of American Statistical Assoc. 90(432), 1200–1224 (1995) 4. Grace Chang, S., Yu, B., Vattereli, M.: Wavelet Thresholding for Multiple Noisy Image Copies. IEEE Trans. Image Processing 9, 1631–1635 (2000) 5. Grace Chang, S., Yu, B., Vattereli, M.: Spatially Adaptive Wavelet Thresholding with Context Modeling for Image Denoising. IEEE Trans. Image Processing 9, 1522–1530 (2000) 6. Lang, M., Guo, H., Odegard, J.E.: Noise reduction Using Undecimated Discrete wavelet transform. IEEE Signal Processing Letters (1995) 7. Coifman, R.R., Donoho, L.D.: Translation Invariant De-noising. In: Antoniadis, A., Oppenheimm, G. (eds.) Wavelet and Statistics, pp. 125–150. Springer, Berlin (1995) 8. Bao, P., Zhang, L.: Noise Reduction for Magnetic Resonance Images via Adaptive multiscale Products Thresholding. IEEE Trans. Medical Imaging 22(9), 1089–1099 (2003) 9. Mallat, S.G.: A Theory for Multiresolution Signal Decomposition: the Wavelet Representation. IEEE Trans. on Pattern Analysis and Machine Intelligence 11(7), 674–693 (1989) 10. Romberg, J.K., Choi, H., Baraniuk, R.G.: Bayesian Tree-Structured Image Modeling Using Wavelet-Domain Hidden Markov Models. IEEE Trans. Image Proc. 10(7), 1056– 1068 (2001) 11. Crouse, M., Nowak, R.D., Baraniuk, R.G.: Wavelet-Based Signal Processing Using Hidden Markov Models. IEEE Trans. Signal Proc. 46, 886–902 (1998) 12. Wainwright, M.J., Simoncelli, E.P., Willsky, A.S.: Random Cascades on Wavelet Trees and Their Use in Modeling and Analyzing Natural Images. Journal of Appl. and Comput. Harmonic Analysis 11, 89–123 (2001) 13. Do, M.N., Vetterli, M.: The contourlet transform: An efficient directional multiresolution image representation. IEEE Trans. Image Process. 14(12), 2091–2106 (2005) 14. Burt, P.J., Adelson, E.H.: The Laplacian pyramid as a compact image code. IEEE Trans. commun. 31(4), 532–540 (1983) 15. Do, M.N., Vetterli, M.: Framing pyramids. IEEE Trans. Signal Process. 51(9), 2329–2342 (2003) 16. Cunha, L., Zhou, J., Do, M.N.: The nonsubsampled contourlet transform: theory, design and applications. IEEE Trans. Image Proc. 15(10), 3089–3101 (2006)
LMMSE-Based Image Denoising in Nonsubsampled Contourlet Transform Domain
43
17. Mihçak, M.K., Kozintsev, I., Ramchandran, K., Moulin, P.: Low Complexity Image Denoising Based on Statistical Modeling of Wavelet Coefficients. IEEE Signal Process. Lett. 6(12), 300–303 (1999) 18. Li, X., Orchard, M.: Spatially Adaptive Image Denoising Under Overcomplete Expansion. In: Int. Conf. Image Process., Vancouver, Canada, pp. 300–303 (2000) 19. Zhang, L., Bao, P., Xiaolin, W.: Multiscale LMMSE-Based Image Denoising with Optimal Wavelet Selection. IEEE Trans. on Circuits and Systems for Video Technology 15(4), 469–481 (2005)
Wavelet-Based Multi-Channel Image Denoising Using Fuzzy Logic Jamal Saeedi and Ali Abedi Electrical Engineering Department, Amirkabir University of Technology, Tehran, Iran {jamal_saeidi,ali_abedi}@aut.ac.ir
Abstract. In this paper, we propose a new wavelet shrinkage algorithm based on fuzzy logic for multi-channel image denoising. In particular, intra-scale dependency within wavelet coefficients is modeled using a fuzzy feature. This feature space distinguishes between important coefficients, which belong to image discontinuity and noisy coefficients. Besides this fuzzy feature, we use inter-relation between different channels for improving the denoising performance compared to denoising each channel, separately. Then, we use the Takagi-Sugeno model based on two fuzzy features for shrinking wavelet coefficients. We examine our multi-channel image denoising algorithm in the dual-tree discrete wavelet transform domain, which is the new shiftable and modified version of discrete wavelet transform. Extensive comparisons with the state-of-the-art image denoising algorithms indicate that our image denoising algorithm has a better performance in noise suppression and edge preservation. Keywords: Dual-tree discrete wavelet transform; Fuzzy membership function; Multi-channel image.
1 Introduction Denoising has become an essential step in image analysis. Indeed, due to sensor imperfections, transmission channels defects, as well as physical constraints, noise weakens the quality of almost every acquired image. Reducing the noise level is the main goal of an image denoising algorithm, while preserving the image features (such as edges, textures, etc.). The multi-resolution analysis performed by the wavelet transform has been shown to be a powerful tool to achieve these goals [1]. Indeed, in the wavelet domain, the noise is uniformly spread throughout the coefficients, while most of the image information is concentrated in the few largest ones. Classical wavelet-based denoising methods consist of three steps: (a) Compute the discrete wavelet transform (DWT), (b) Remove noise from the wavelet coefficients, and (c) Reconstruct the enhanced image by using the inverse DWT. In this paper, we consider image with C channels. Typically, C is equal to three channels for RGB images, but for biological (fluorescence) and multi-channel satellite images, C might be larger. We denote these multi-channel images by: X = [x(i, j ,1), x(i , j ,2 ) ... x(i, j , C )] A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 44–53, 2010. © Springer-Verlag Berlin Heidelberg 2010
(1)
Wavelet-Based Multi-Channel Image Denoising Using Fuzzy Logic
45
These images are corrupted by an additive Gaussian white noise following a normal law defined by a zero mean and a known σ2 variance, that is n ~ N 0, σ 2 :
(
N = [n(i, j ,1), n(i , j ,2 ) ... n(i, j , C )]
)
(2)
We denote the resulting noisy image by: Y =X +N
(3)
Due to the linearity of the wavelet transform, additive noise in the image domain remains additive in the transform domain. If ys , d (i, j ) and x s,d (i, j ) denote the noisy and the noise-free wavelet coefficients of scale s and orientation d, respectively, then we can model the additive noise in the transform domain as: y s , d (i , j , c ) = x s .d (i , j , c ) + n s , d (i , j , c )
(4)
The easiest method for denoising multi-channel image is simply to apply an existing denoising algorithm separately in each channel. However, this solution is beyond optimal, due to the presence of strong common information between the various channels. In literatures, inter-channel information in both spatial and wavelet domain is used for multi-channel image denoising. S. Schulte et al used a fuzzy filter in spatial domain for color image denoising. The filter consists of two sub-filters. The first subfilter computes the fuzzy distance between the color components of the central pixel and its neighborhood. These distances determine to what degree each component must be corrected. The second sub-filter corrects pixels where the color components differences are corrupted so much that they appear as outliers in comparison to their environment [2]. Luisier’s et al. multi-channel Sure-Let work follows their SURELET approach [3], where the denoising algorithm is parameterized as a linear expansion of thresholds (LET) and optimized using Stein’s unbiased risk estimate (SURE). The proposed wavelet thresholding function is point-wise and depends on the coefficients of same location in the other channels, as well as on their parents in the coarser wavelet sub-band [4]. Pizurica et al. have also adapted their original ProbShrink by including the inter-channel dependencies in the definition of their local spatial activity indicator [5]. In image denoising, where a trade-off between noise suppression and the maintenance of actual image discontinuity must be made, solutions are required to detect important image details and accordingly adapt the degree of noise smoothing. We always postulate that noise is uncorrelated in the spatial domain; it is also uncorrelated in the wavelet domain. With respect to this principle, we use a fuzzy feature to enhance image information in wavelet sub-bands. This feature space distinguishes between important coefficients, which belong to image discontinuity and noisy coefficients. Besides this fuzzy feature, we use inter-relation between different channels for improving the denoising performance compared to denoising each channel, separately. Then, we use the Takagi-Sugeno model [6] based on two fuzzy features for shrinking wavelet coefficients. In addition, we examine our image denoising algorithm in the dual-tree DWT (DTDWT), which provides both shiftable sub-bands and good directional selectivity and low redundancy [7].
46
J. Saeedi and A. Abedi
This paper is structured as follows: Section 2 presents proposed multi-channel image denoising algorithm. Section 3 gives results and the comparisons. Finally, we conclude with a brief summary in Section 4.
2 Multi-Channel Image Denoising In this Section, we describe our multi-channel image denoising algorithm. First, the input noisy image is decomposed using DT-DWT. After extracting two fuzzy features from different sub-bands at different scales, wavelet coefficients are shrunk using a simple fuzzy rule. Finally, inverse DT-DWT of modified wavelet coefficients generates denoised image. 2.1 Single and Multi-Channel Fuzzy Features
In multi-channel images, different channels are correlated: an image discontinuity from one channel is probably to occur in at least some of remaining channels. As for this reason and uncorrelated nature of noise, we use a new feature for denoising of multi-channel images. Improving wavelet coefficients’ information in the shrinkage step is the main idea. The first feature is obtained using a nonlinear averaging filter in the wavelet subbands of each single channel. We want to give large weights to neighboring coefficients with similar magnitude, and vice versa. The larger coefficients, which produced by noise are always isolated or unconnected, but edge coefficients are clustered and persistence. It is well known that the adjacent points are more similar in magnitude. Therefore, we use a fuzzy function m(l , k , c ) of magnitude similarity and a fuzzy function s(l , k , c ) of spatial similarity, which is defined as: ⎛ ⎛ y (i , j , c ) − y (i + l , j + k , c ) ⎞ 2 ⎞ ⎜ s,d s,d ⎟ ⎟⎟ m (l , k , c ) = exp ⎜ − ⎜⎜ ⎟ ⎟ Thr ⎜ ⎝ ⎠ ⎠ ⎝
⎛ ⎛ l2 + k2 s (l , k , c ) = exp ⎜ − ⎜ ⎜ ⎜ N ⎝ ⎝
where
y s , d (i, j , c )
and
y s ,d (i + l , j + k , c )
⎞⎞ ⎟⎟ ⎟⎟ ⎠⎠
(5)
(6)
are central coefficient and neighbor
coefficients in wavelet sub-bands of channel c, respectively. Thr = K × σˆ nc , 3 ≤ K ≤ 4 , σˆ nc is estimated noise variance of channel c using median estimator, and N is the number of coefficients in the local window k ∈ [− K ...K ] , and l ∈ [− L...L ] . According the two fuzzy functions, we can get adaptive weight w(l , k , c ) for each neighboring coefficient in each channel: w (l , k , c ) = m (l , k , c )× s (l , k , c )
(7)
Wavelet-Based Multi-Channel Image Denoising Using Fuzzy Logic
47
Using the adaptive weights w (l , k , c ) , we can obtain the first fuzzy feature for each coefficient in wavelet sub-bands as follows: L
K
∑ ∑ w (l , k , c ) × y s , d (i + l , j + k , c )
f1 ( i , j , c ) = l = − L k = − K
L
(8)
K
∑ ∑
w (l , k , c )
l=−L k =−K
In addition, the second fuzzy feature that we have used for multi-channel image denoising is defined as: C ⎛
⎜
f 2 (i , j ) =
L
K
⎛ C
⎞
⎞ ⎟
∑ ⎜⎜ ∑ ∑ ⎜⎜⎜ ∏ w (l , k , c ) ⎟⎟⎟ × y s , d (i + l , j + k , c ) ⎟⎟ ⎠ ⎞ ⎛ C ⎟ ⎜ C× w (l , k , c ) ⎟ ⎜ ⎟ ⎜ l = − L k = − K ⎝ c =1 ⎠
c =1 ⎝ l = − L k = − K ⎝ c =1 L K
⎠
(9)
∑ ∑ ∏
The second fuzzy feature equally is used to shrink wavelet coefficients in each channel. In fact, for denoising one channel, the second feature simultaneously takes the remaining channels’ information into account in the denoising process. Then, we use the Takagi-Sugeno model based on the two fuzzy features for shrinking wavelet coefficients [7]. 2.2 Fuzzy Shrinkage Rule
Having the fuzzy features, we will define Linguistic IF-THEN rules for shrinking wavelet coefficients in each channel as follows: IF f1 (i, j, c) (the first fuzzy feature) is large AND f 2 (i, j ) (the second fuzzy feature) is large THEN shrinkage of wavelet coefficient in channel c i.e. y s, d (i, j , c) is small.
The idea behind this simple fuzzy rule is to assign small shrinkage weights to the wavelet coefficients, which have similar values to the neighbors and corresponding coefficients in the remaining channels. For color images, it is important to treat pixels as color components not as three separate colors. When only the separate channels are considered, more artifacts are introduced. For building standard rule from linguistic rule, each linguistic value is represented by a membership function. The membership functions that are used to represent the two fuzzy sets of large f1 (i, j, c) (the first fuzzy feature) and large f 2 (i, j ) (the second fuzzy feature), are denoted as μ ( f1 (i, j , c) ) and μ ( f 2 (i , j ) ) , respectively. Here, we use the spline-based curve as fuzzy membership function, which is a mapping on the vector x, and is named because of its S-shape. The parameters T1 and T2 locate the extremes of the sloped portion of the curve, as given by:
48
J. Saeedi and A. Abedi
⎧ 0 ⎪ 2 ⎪ 2 × ⎛⎜ x − T1 ⎞⎟ ⎜T −T ⎟ ⎪⎪ 1 ⎠ ⎝ 2 μ (x) = ⎨ 2 ⎛ ⎞ ⎪ x ⎜ ⎟ − × − T 1 2 ⎪ ⎜ 2 T −T ⎟ 2 1 ⎠ ⎝ ⎪ ⎩⎪1
x ≤ T1 T1 < x ≤
T1 + T 2 2
T1 + T 2 ≤ x < T2 2 x ≥ T2
(10)
For the pair of values {μ ( f1 (i, j , c) ), μ ( f 2 (i, j ) )} , the degree of satisfaction of the antecedent part of the rule determines the firing strength of the rule: τ ( i , j , c ) = f1 ( i , j , c ) AND
f 2 (i, j )
(11)
The AND operation is typically implemented as minimum, but any other triangular norms (t-norm) may be used (such as algebraic product, weak, and bounded sum). Originally, t-norms appeared in the context of probabilistic metric spaces [8]. Then they were used as a natural interpretation of the conjunction in the semantics of mathematical fuzzy logics and they are used to combine criteria in multi-criteria decision making [9]. In such applications, the minimum or product t-norm are usually used because of a lack of motivation for other t-norms [8]. We have chosen algebraic product for the AND operation, because it is the standard semantics for strong conjunction in product fuzzy logic and is continues as a function. Finally, the estimated noise free signal is obtained using the following formula: xˆ s , d (i , j , c ) = τ (i , j , c ) × y s , d (i , j , c )
(12)
where s and d are scale and orientation of DT-DWT sub-bands, and τ (i, j , c ) is obtained using (11). In addition, for building fuzzy membership function we used test images in different noise levels, to obtain best values for the thresholds. 2.3 Post-processing
Processing artifact usually result from a modification of the spatial correlation between wavelet coefficients (often caused by zeroing of small neighboring coefficients) or using DWT, which is shift invariance and will cause some visual artifacts (such as Gibbs phenomena) in thresholding-based denoising. For this reason, we use a fuzzy filter on the results of our fuzzy-shrink algorithm to reduce artifacts to some extent. First, we use a window of size (2 L + 1)× (2 K + 1) centered at (i, j ) to filter the current image pixel at position (i, j ) . Next, we calculate the similarity of neighboring pixels to the center pixel using following formula: 2⎞ ⎛ ⎛ ˆ ⎜ ⎜ X (i , j , c ) − Xˆ (i + l , j + k , c ) ⎞⎟ ⎟ m (l , k ) = exp ⎜ − ⎟ ⎟⎟ Thr ⎜ ⎜⎝ ⎠ ⎠ c =1 ⎝ C
∏
(13)
where Xˆ (i, j , c ) and Xˆ (i + l , j + k , c ) are central pixel and neighbor pixels in the denoised image of channel c, respectively. N is the number of pixels in the local window k ∈ [− K ...K ] , and l ∈ [− L...L ] , and 2.55 < Thr < 7.65 .
Wavelet-Based Multi-Channel Image Denoising Using Fuzzy Logic
49
According the fuzzy feature, we can get adaptive weight w(l , k ) for each neighboring coefficient: w (l , k ) = m (l , k )× s (l , k )
(14)
where s (l , k ) is obtained using (6). Finally, the output of post-processing step is determined as follows: L
~ X (i , j , c ) =
K
∑ ∑ w (l , k ) × Xˆ ( i + l , j + k , c )
l=−L k =−K
L
(15)
K
∑ ∑
w (l , k )
l=−L k =−K
where Xˆ is the denoised image, which is obtained via our fuzzy-shrink algorithm.
3 Experimental Results In the satellite systems, it may be desirable to perform denoising before the image compression step in order to improve the compression efficiency. Therefore, we use both color and multi-spectral satellite images for experimental dataset, which was captured by simulated additive Gaussian white noise. In this section, we compare our multi-channel fuzzy denoising method with three state-of-the-art wavelet-based techniques: Pizurica’s multi-channel ProbShrink [5] and Luisier’s et al. multi-channel Sure-Let [4]. In addition, we compare our multichannel image denoising algorithm with the Portilla’s BLS-GSM [10], which denoised each channel, separately. It should be mentioned that for comparing ProbShrink [5] and Sure-Let [4] methods, which are proposed in the discrete wavelet transform domain, we use a critically sampled orthonormal wavelet basis with eight vanishing moments (sym8) over four decomposition stages in our fuzzy-shrink method. On the other hand, BLSGSM method has used a redundant wavelet transform (an 8-orientations full steerable pyramid [10]). Therefore, for a fair comparison between BLS-GSM and our fuzzyshrink methods, we use dual-tree discreet wavelet transform over four decomposition stages for wavelet analysis. We measured the experimental results by the peak signalto-noise ratio (PSNR) in decibels (dB), objectively: ⎛ 255 2 PSNR = 10 × log 10 ⎜ ⎜ MSE ⎝
⎞ ⎟ ⎟ ⎠
(16)
where MSE
=
1 C ×M × N
⎛ ⎜ ⎜⎜ c =1 ⎝
⎞
~ ∑ ∑ ∑ (X ( i , j , c ) − X ( i , j , c ) )⎟⎟⎟ C
M
N
i =1 j =1
2
(17)
⎠
~ where X is the original image, X is the estimated noise free signal, and M×N is the number of pixels.
50
J. Saeedi and A. Abedi Table 1. Comparison of multi-channel image denoising algorithms σ n Method Input PSNR Prob-Shrink-M (DWT) Sure-let-M (DWT) Fuzzy-Shrink-M (DWT)
5
15
25
30
50
5
15
25
30
50
34.15 35.61
24.69 31.40
Lena 512×512 20.37 18.86 29.31 28.51
37.86
32.40
29.60
28.64
26.20
36.62
31.79
29.16
28.28
25.61
36.81 36.75
32.08 32.65
30.01 30.60
29.31 29.80
26.77 27.14
36.13 36.05
31.26 31.53
29.17 29.61
28.49 28.95
25.86 26.15
37.11
32.72
30.55
29.95
27.89
36.12
31.63
29.86
29.03
26.98
37.56 37.42
33.16 33.29
27.76 27.92
36.57 36.41
34.15 33.11
24.66 26.44
31.17 30.35 31.28 30.46 Baboon 512×512 20.33 18.79 23.73 22.89
14.75 20.94
34.26 33.29
31.94 30.01 29.21 26.64 32.23 30.24 29.45 26.78 Southern California 512×512 24.81 20.44 18.92 14.88 27.13 24.31 23.43 21.06
Sure-let-M (DWT) Fuzzy-Shrink-M (DWT)
34.46
26.82
24.03
23.27
21.43
33.36
27.74
24.95
24.05
21.66
34.01 33.86
27.25 27.44
24.41 24.86
23.47 23.96
21.49 21.79
35.21 35.35
28.25 28.41
25.31 25.64
24.41 24.76
21.99 22.42
BLS-GSM Best Redundant Fuzzy-Shrink-M (DT-DWT)
35.01
27.56
24.75
24.01
21.62
35.31
28.31
25.32
24.68
22.13
34.33
27.66
25.25
24.57
22.28
35.23
28.94
26.15
25.23
22.68
34.21
27.56
25.31
24.72
22.45
35.19
29.21
26.41
25.48
22.84
BLS-GSM Best Redundant Fuzzy-Shrink-M (DT-DWT) Method Input PSNR ProbShrink-M (DWT)
14.82 26.08
34.25 34.86
24.78 30.35
Peppers 512×512 20.47 18.98 28.12 27.82
15.01 25.29
Note: Output PSNRs have been averaged over five noise realizations. The best redundant results are obtained using the BLS-GSM 3× 3 with an 8-orientations full steerable pyramid [10]. In addition, the results in the second row for our fuzzy method are obtained after post processing.
Table 1 summarizes the obtained results. As it can be observed in Table 1, our fuzzy multi-channel image denoising method is already competitive with the best techniques. When looking closer at the results, we observe the following: • • • •
Our method (Fuzzy-Shrink using DWT) provides better results than Pizurica’s multi-channel ProbShrink, which integrates the intra-scale dependencies (average gain of +1.199 dB). Our method (Fuzzy-Shrink using DWT) outperforms the Luisier’s et al. multichannel Sure-Let, which is used the inter-scale and inter-channel dependencies more than +0.449 dB on average. Our method (Fuzzy-Shrink using DT-DWT) also gives better results than Portilla’s BLS-GSM (Best-Redundant), which is used for each channel separately (average gain of +0.435 dB). The worst result is obtained for the “Baboon” image. Our explanation for this is that the “Baboon” image is very noisy (i.e. texture area in this image is similar to noise), and when we use the fuzzy features for taking into account the neighbor dependency, it will be smoother in the resulting image.
In addition, Figs. 1 and 2 illustrate the results of noisy multispectral “southern California” and noisy color “Lena” images from different methods. As it can be observed in Figs. 1 and 2, our multi-channel image denoising is successful in preserving image edges and fewer artifacts as visual criteria as compared to other methods.
Wavelet-Based Multi-Channel Image Denoising Using Fuzzy Logic
Fig. 1. Comparison of multi-channel image denoising algorithms
51
52
J. Saeedi and A. Abedi
Fig. 2. Comparison of multi-channel image denoising algorithms
4 Conclusion In this paper, we propose a new wavelet-based multi-channel image denoising using intra-scale dependency as a fuzzy feature, and inter-channel relation to improve wavelet coefficients’ information at the shrinkage step. We use the DT-DWT for wavelet analysis, because it is shift invariant, and has more directional sub-bands
Wavelet-Based Multi-Channel Image Denoising Using Fuzzy Logic
53
compared to the DWT. In other words, proposing a new method for shrinking wavelet coefficients in the second step of the wavelet-based image denoising is the main novelty of this paper. The comparison of the denoising results obtained with our algorithm, and with the best state-of-the-art methods, demonstrate the performance of our fuzzy approach, which gave the best output PSNRs for most of the images. In addition, the visual quality of our denoised images exhibits the fewest number of artifacts and preserves most of edges compared to other methods.
References 1. Mallat, S.: A theory for multiresolution signal decomposition: The wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell. 11(7), 674–693 (1989) 2. Schulte, S., Witte, V.D., Kerre, E.E.: A Fuzzy Noise Reduction Method for Color Images. IEEE Trans. Image Process 16(5), 1425–1436 (2007) 3. Blu, T., Luisier, F.: The SURE-LET approach to image denoising. IEEE Trans. Image Process 16, 2778–2786 (2007) 4. Luisier, F., Blu, T.: SURE-LET Multichannel Image Denoising: Interscale Orthonormal Wavelet Thresholding. IEEE Trans. Image Process 17(4), 482–492 (2008) 5. Pizurica, A., Philips, W.: Estimating the probability of the presence of a signal of interest in multiresolution single-and multiband image denoising. IEEE Trans. Image Process, 654–665 (2006) 6. Kingsbury, N.G.: Complex Wavelets for Shift Invariant Analysis and Signals. Applied and Computational Harmonic Analysis 10(3), 234–253 (2001) 7. Zadeh, L.A.: Fuzzy logic and its application to approximate reasoning. Inf. Process 74, 591–594 (1973) 8. Driankov, D., Hellendoorn, H., Reinfrank, M.: An Introduction to Fuzzy Control. Springer, Heidelberg (1993) 9. Hájek, P.: Metamathematics of Fuzzy Logic. Kluwer, Dordrecht (1998) 10. Portilla, J., Strela, V., Wainwright, M., Simoncelli, E.: Image denoising using Gaussian scale mixtures in the wavelet domain. IEEE Transactions on Image Processing, 1338–1351 (2003)
An Adaptive Multiresolution-Based Multispectral Image Compression Method Jonathan Delcourt, Alamin Mansouri, Tadeusz Sliwa, and Yvon Voisin Laboratoire Le2i, BP 16 Route des Plaines de l’Yonne, 89010 Auxerre Cedex, France
[email protected] Abstract. This paper deals with the problem of multispectral image compression. In particular, we propose to substitute the built-in JPEG 2000 wavelet transform by an adequate multiresolution analysis that we devise within the Lifting-Scheme framework. We compare the proposed method to the classical wavelet transform within both multi-2D and full3D compression strategies. The two strategies are combined with a PCA decorrelation stage to optimize their performance. For a consistent evaluation, we use a framework gathering four families of metrics including the largely used PSNR. Good results have been obtained showing the appropriateness of the proposed approach especially for images with large dimensions.
1
Introduction
Multispectral images are widely used in geoscience and remote sensing. Their use in other applications field like medical imagery, quality control in industry, meteorology, and exact color measurements are increasing. A multispectral image is generated by collecting dozens, so-called image-bands, where each one is a monochromatic image centered on a particular wavelength of the electromagnetic spectrum. As a result, uncompressed multispectral images are very large, with a single image potentially occupying hundreds of megabytes. Compression is thus necessary to facilitate both the storage and the transmission. One of the most efficient compression method for monochrome images is the JPEG 20001 [1–5]. However, its extension to multispectral images needs to be adapted and yields to different methods. These methods depend on the manner of which one consider the multispectral cube: – each image band of the multispectral image is considered separately (2D wavelets + 2D SPIHT), – the whole cube is considered as input (3D wavelets + 3D SPIHT). The contributions of this paper consists in substituting the built-in JPEG 2000 wavelet transform by an adaptive multiresolution analysis that we develop within the Lifting-Scheme framework. We will compare this approach with the standard version of JPEG 2000 within the previous presented strategies. 1
http://www.jpeg.org
A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 54–62, 2010. c Springer-Verlag Berlin Heidelberg 2010
An Adaptive Multiresolution
55
In the next section we will develop the theory of the proposed multiresolution analysis. Then we describe the data, experiments and results before discussing them in the third section. We concluded in the last section.
2
Proposed Multiresolution Analysis
We propose to substitute the wavelet transform by building an adequate multiresolution analysis adapted to the image compression aim. We build our multiresolution analysis within the lifting-Scheme framework which allows to produce easily second generation multiresolution analyses [6, 7]. Indeed, the advantage of such framework is that the inverse transform is explicitly defined and its value is exactly computed, whatever are the applied filters. We limit here the demonstration and illustration to the case of a scheme built on a couple of filters, one for prediction and one for update. The idea we aim to exploit consist in adapting the predict filter for each stage of the analysis in that way we minimize the energy hold by the detail signal and reduce more again the non-zero elements after the quantization stage. However, we risk in doing so to lose the error propagation control during the inverse transform. Indeed, ideal compression requires building perfect orthogonal transforms. But when we generate adaptive filters we cannot guarantee the orthogonality condition nor to be near. In order to reduce such a problem effect, we add a term expressing the constraint of near-orthogonality that we resolve using an optimization under constraints method. We recall that we work within a scheme with one predict filler f and one update filter g. The linear operator corresponding to the scale change and its inverse could be written as [8, 9]: ⎞ ⎛ 1I 0 A aI 0 Ig I 0 O O I 0 I −g ⎝ a ⎠ A = , = D D 0 bI 0I −f I E E fI 0 I 0 1b I (1) Where A is the approximation signal, D the details signal, O and E the odd and even signals, a and b normalization coefficients (that we can consider positive), I the identity matrix, and f and g are assimilated to their form of Toeplitz matrices. The non-orthogonality condition of a linear operator T could be written worthily as following: ε1 = T ∗ T − I, ε2 = T ∗ − T −1 , N (εk ) ≤ c ∗
(2)
Where N is a matrix norm and represents the conjugate of the transpose operation. In the case of a filter with real values, the ∗ is the convolution filtering. Let us also notice that the product of two filters is commutative due to the commutativity of the convolution product. One important mathematical property we will exploit is the equivalence between all norms in finite dimension. This is our case since the number of samples is finite. Hence we will define and use a norm that ensures simplification
56
J. Delcourt et al.
of calculus in (2). First, equation (1) could easily be simplified as in [8, 9] by putting b = 1/a. Then after forward simplification of calculus in (2), it appears that we can make non-diagonal elements in ε2 equal to zero when considering g = f ∗ /a2 . In doing so, we limit the solutions that minimize the nonorthogonality condition within the following family: ⎞ ⎛ aI − a1 f f ∗ a1 f ∗ 1 1 A ⎠ O b = , g = 2 f ∗, =⎝ D E a a 1 1 −af aI
(3)
We switch now to the definition ε1 of non-orthogonality for calculus commodity. We inject b and g in ε1 of equation (2) and we obtain a result proportional to: ∗ ∗ f f − (a2 − 1)I 0 f f − a2 I f ∗ f I 0 f f ∗ − (a2 − 1)I
(4)
By replacing T by T −1 , we obtain a result proportional to: ∗ f f − (a2 − 1)I 0 I f∗ 0 f f ∗ − (a2 − 1)I f f f ∗ − a2 I
(5)
The coefficient of proportionality is a common constant of value 1/a2 meaning that it will not be involved in optimization. Now we consider N as the operator norm subordinate to the norm of Euclidean vector. In other words, for an operator A: A2 = sup v=0
Av2 = ρ(A∗ A) = ρ(AA∗ ) v2
(6)
Where ρ(M ) represents the spectral ray of the matrix M , equivalent to the largest eigenvalue in absolute value. From equations (4) and (5) one can remark that we can reason either on T or T −1 . We can understand this intuitively as that the error propagates in the same way in direct and inverse sense. By using the inequality of the product for the norm (6) in (4) and (5), we can evaluate independently the terms constituting the product. The first term is completely determined by: f f ∗ − (a2 − 1)I
(7)
let us remark that (7) is symmetric and we can reason on its eigenvalues. We can obtain the upper-bound for (6) applied to (7):
max | max(Sp(f f ∗ )) − (a2 − 1)|, | min(Sp(f f ∗ )) − (a2 − 1)| Sp corresponding to the spectrum.
(8)
An Adaptive Multiresolution
57
We consider now the second term of the product. After simplification using symbolic notations (Matlab symbolic toolbox), we can obtain, when we let zmax = max(| max(Sp(f f ∗ )) − (a2 − 1)|, | min(Sp(f f ∗ )) − (a2 − 1)|), an upperbound for the second term as: 2 zmax + zmax + 4a2 (9) 2 Leading to the following upper-bounding for the product of the two terms: 2 zmax + zmax + 4a2 zmax (10) 2 By identifying this upper-bound with the constraint c in equation (2), we obtain: zmax = √
c c 1 = a 1+ a2 + c
c a2
N V Ff2 ff used (i, j) = f2 (i, j) (3) ⎩ median[f2 (i, j), f2 (i, j)] N V Ff1 = N V Ff2 In case of more than two images, two images are fused first and then the fused image is fused with the third one and this process continues till all the images are fused. For the merging of equivalent region, median is used instead of averaging. The main reason
74
G. Bhatnagar, Q.M.J. Wu, and B. Raman
for choosing median is come from statistical theory. According to the statistical theory, averaging has some drawbacks neither it can be determined by inspection nor it can be located graphically, it cannot be determined when one or more data is missing and mainly it is affected very much by extreme values. Where as median is the middle of a distribution: half the values are above the median and half are below the median. The median can be viewed as a value such that the number of data above is equal to the number of data below. Hence, median is a positional average. Moreover, median is determined when one or more data is missing and it is not at all affected by extreme values and this makes it a better measure than the mean. In the case of two data, median coincides with average value of given data.
3 Experimental Results Some general requirements for fusion algorithm are: (1) it should be able to extract complimentary features from input images, (2) it must not introduce artifacts or inconsistencies according to Human Visual System and (3) it should be robust and reliable. Generally, these requirements are often very difficult to achieve. Hence, first an evaluation index system is established for evaluating the proposed fusion algorithm. These indices are determined according to the statistical parameters. This evaluation index system includes mean, standard deviation, entropy, peak signal to noise ratio (PSNR) and spectral distortion. Among these mean, standard deviation, entropy reflect spatial details information whereas PSNR and spectral distortion reflects spectral information. Mathematical definitions of these indices are given below: 1. Mean and Standard Deviation: In statistical theory, mean and standard deviation are defined as follows: μ ˆ= σ ˆ2 =
1 MN
M N
xi,j ,
i=1 j=1 M N
1 (M−1) (N −1)
(4) (xi,j − μ ˆ)2
i=1 j=1
where M N is the total number of pixels in the image and xi,j is the value of the ij th pixel. 2. Entropy: Entropy is the measure of information quantity contained in an image. If the value of entropy becomes higher after fusion then the information quality will increase. Mathematically, entropy is defined as: E=−
M N
p(xi,j ) ln p(xi,j )
(5)
i=1 j=1
where p(xi,j ) is the probability of the occurrence of xi,j . 3. Average Gradient: The average gradient is given by: Ix2 + Iy2 1 g¯ = MN i j 2
(6)
Real Time Human Visual System Based Framework for Image Fusion
75
where Ix and Ix are the differences in x and y direction. The larger the average gradient, the sharper the image. 4. Peak Signal to Noise Ratio: The PSNR indicates the similarity between two images. The higher the value of PSNR, the better fused image is. Mathematically, PSNR is defined as: 2552 P SN R = 10 lg (7) M N 1 2 [x − y ] i,j i,j MN i=1 j=1
where M N is the total number of pixels in the image, xi,j and yi,j are the values of the ij th pixel in original and degrade image. 5. Structural Similarity: Structural similarity (SSIM) is designed by modeling any image distortion as the combination of loss of correlation, radiometric distortion and contrast distortion. SSIM is defined as: SSIM =
σxy 2μx μy 2σx σy σx σy μ2x + μ2y σx2 + σy2
(8)
where μx , μy are mean intensity and σx , σy , σxy are the standard deviation. In equation 8, first term is the correlation coefficient between x and y. The second term measure how close the mean grey level is, while third term measure the similarity in contrast of x and y. The higher the value of SSIM, the better fused image is. The performance of the proposed fusion algorithm is demonstrated using MATLAB by taking different set of gray-scale images as experimental images. In the experiments, Table 1. Evaluation Indices for Fused Images Images Time Taken (in sec.)
Mandrill Payaso DWT
Pepper
Lab
Medical
Gun
51.9358 50.7060 52.0318 41.2769 32.0462 31.9971
Proposed 10.7415 10.6577 9.9431
8.3922
6.6555 6.1712
Original 129.1438 99.7830 120.2164 122.9372 29.9628 25.6251 Mean
DWT
129.2793 99.4065 119.7593 122.7133 27.2458 31.3983
Proposed 129.3688 99.6213 119.6156 123.9435 41.1227 52.2069 Original 10.3859 10.5246 10.8560 46.0241 38.7900 32.7680 Standard Deviation
DWT
10.3703 9.9039 10.7842
9.6350
8.8112 8.0418
Proposed 42.1657 63.0997 50.9855 47.2410 35.1843 55.2290 Entropy PSNR
Original
5.1000
DWT
5.0854
5.3771
5.2542
4.8079
4.0565 4.2379
Proposed 5.1004
5.3678
5.2558
4.9537
4.4472 4.6353
DWT
5.2635
5.2635
4.7847
2.8922 3.3344
34.0813 33.3704 31.6640 28.2273 15.9483 17.0233
Proposed 35.3932 33.3838 26.3765 27.6945 15.6526 17.9351 SSIM
DWT
2.7971
3.1574
4.0271
4.8788 27.0301 19.8870
Proposed 1.8513
2.6500
3.0346
3.6889 26.3006 19.0912
76
G. Bhatnagar, Q.M.J. Wu, and B. Raman
Original Image
Input: source Images
Fused Images using Li method [4] Proposed method
(Mandril)
(a)
(b)
(c)
(d)
(Payaso)
(e)
(f)
(g)
(h)
(Pepper)
(i)
(j)
(k)
(l)
(Lab)
(m)
(n)
(o)
(p)
(q)
(r)
(s)
(t)
(u)
(v)
(w)
(x)
Fig. 2. Results for all experimental Image
Real Time Human Visual System Based Framework for Image Fusion
77
mandrill, payaso, peppers, lab, medical images and gun images are taken as experimental images these are having size of 512 × 512, 512 × 512, 512 × 512, 384 × 288, 256 × 256 and 256 × 256 respectively. In mandrill images, we have concentrated on upper half and lower half part. In pepper images, we have concentrated on left half and right half part. In payaso images, we have concentrated on middle and outer part. Lab and medical images are the examples of multi focus images. Finally, gun images are the very famous example of the Concealed Weapon Detection. To evaluate the robustness and superiority of the proposed method, the comparison is made by us with very well known fusion algorithm given by Li [5] which is based on the wavelet transform and activity measure. The activity measure of each coefficient is computed in a 3×3 or 5×5 window and fusion is done by consistency verification along with area based activity measure and maximum selection. For Medical and Gun images, proper original images are not available to compare our results. Hence, we evaluate all indices with the help of all input images and then take the average of all values as the results for the original image. Table 1 shows the values of evaluation indices for original and fused images using Li [5] and proposed methods and visual results are shown in figure 2. The first row of the table shows the time taken in the fusing images of different sizes. It is clear that the time taken by the proposed method is very less. The proposed methods perform better compared to Li et al. [5]. To reach this conclusion the main stress is focused on the mean, standard deviation, entropy, PSNR and SSIM which have the highest/lowest values between the compared methods.
4 Conclusions The fusion method described in this paper cover a large variety of practical applications. In the proposed technique, human visual system is considered and applied in spatial domain which make it well suitable candidate for real time applications. The main benefit is that the use of no transform makes the proposed technique less complex and very easy. Hence, it has very low computation cost. It can do a equivalent/better job than existing wavelet based fusion method[5] as shown by the experiments and comparison made by us. This shows the algorithm robustness and accuracy in fusing several types of real images obtained from different kinds of sensors, for different purposes. Hence, the proposed algorithm will be suitable for implementation in real-time processing.
Acknowledgement The work is supported by the Canada Research Chair program, the NSERC Discovery Grant.
References 1. Hall, D.L., Llinas, J.: An Introduction to Multisensor Data Fusion. Proceedings of the IEEE 85(1), 6–23 (1997) 2. Xydeas, C. S., Petrovic, V.: Objective Pixel-level Image Fusion Performance Measure. In: Procedding of SPIE, Sensor Fusion: Architectures, Algorithms, and Applications IV, vol. 4051, pp. 89–98. Society of Photographic Instrumentation Engineers (2002)
78
G. Bhatnagar, Q.M.J. Wu, and B. Raman
3. Chavez, P.S., Kwarteng, A.Y.: Extracting spectral contrast in Landsat thematic mapper image data using selective principal component analysis. Photogrammetric Engineering and Remote Sensing 55, 339–348 (1989) 4. Burt, P.J., Kolczynski, R.J.: Enhanced image capture through fusion. In: Proceedings of International Conference on Computer Vision, Berlin, Germany, pp. 173–182. IEEE Press, Los Alamitos (1993) 5. Li, H., Manjunath, B.S., Mitra, S.K.: Multisensor image fusion using the wavelet transform. Graph Models Image Processing 57(3), 235–245 (1995) 6. Qu, G., Zhang, D., Yan, P.: Medical Image Fusion by Wavelet Transform Modulus Maxima. Optics Express 9, 184–190 (2001) 7. Levicky, D., Foris, P.: Human visual system models in digital watermarking. Radioengineering 13(4), 38–43 (2004) 8. Voloshynovskiy, S., Herrigel, A., Baumgartner, N., Pun, T.: A stochastic apporach to content adaptive digital image watermarking. In: Pfitzmann, A. (ed.) IH 1999. LNCS, vol. 1768, pp. 211–236. Springer, Heidelberg (2000)
Color VQ-Based Image Compression by Manifold Learning Christophe Charrier1,2 and Olivier L´ezoray1 1
Universit´e de Caen Basse-Normandie Laboratoire GREYC, Unit´e Mixte de Recherche CNRS 6072 6 Bd. Mar´echal Juin, 14050 Caen, France 2 Universit´e de Sherbrooke Dept. d’informatique, laboratoire MOIVRE 2500 Bd. de l’Universit´e, Sherbrooke, Qc., J1K 2R1, Canada
Abstract. When the amount of color data is reduced in a lossy compression scheme, the question of the use of a color distance is crucial, since no total order exists in IRn , n > 1. Yet, all existing color distance formulae have severe application limitation, even if they are widely used, and not necesseraly within the initial context they have been developed for. In this paper, a manifold learning approach is applied to reduce the dimension of data in a Vector Quantization approach to obtain data expressed in IR. Three different techniques are applied before construct the codebook. Comparaisons with the standard LBG-based VQ method are performed to judge the performance of the proposed approach using PSNR, MS-SSIM and VSNR measures.
1
Introduction
Compression is a commonly used process to reduce the amount of initial data to be stored or transmited by a channel to a receiver. To reach this reduction goal, two compression families exist : 1) lossless compression approach (usually entropy-based schemes) and 2) lossy compression techniques. In those latters, compression operates as a nonlinear and noninvertible operation and is applied on individual pixel (scalar quantization) or group of pixels (Vector Quantization – VQ). VQ corresponds to the coding structure developed by Shannon in his theoretical development of source coding [1]. Conceptually, it is an extension of scalar quantization to a multidimensional space. In VQ, the input image is parsed into a sequence of groups of pixels, referred to as input vectors. This sequence is termed the test set, and the training set corresponds to a set of training images. VQ maps a vector x of dimension k to another vector y of dimension k belonging to a finite set C (codebook ) containing n output vectors, called code vectors or codewords. When one tries to reduce the amount of data in a multidimensional space, the question of the used distance to measure existing difference between two candidate vectors is crucial. Actually no total order exists in spaces whose dimension is greater than 1. For example, within the color domain, many distance formulae A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 79–85, 2010. c Springer-Verlag Berlin Heidelberg 2010
80
C. Charrier and O. L´ezoray
have been introduce to counterbalance this main drawback. Furthermore, those formulae tend to measure the color perception as done by the Human Visual System. For now, all distances fails to measure the distance between two vectors, specially for chromatic data, represented by a 3D vector (C1 , C2 , C3 ). In addition, when one tries to apply VQ techniques on 3D data (i.e., colorimetric data), one has to measure the distance betweeen multidimensional color vectors to construct the final codebook. In that case, no satisfactory distance formulae are available to be apply on those data. Instead of developping a dedicated distance formula, an investigation about reduction dimensionnality methods is performed. The aim of such an approach is to be able applying Euclidean distance on scalar obtained after performind a dimensionnality reduction process. In this paper, a manifold learning process is applied prior a VQ compression scheme in order to reduce the data dimensionality to generate the codebook. Comparison with standard VQ scheme is performed to measure the effeciency of the proposed approach.
2
Dimensionality Reduction
Given a set of visual features describing an image, a Manifold Learning method is used to project the data onto a new low-dimensional space. Thus, nonlinear new discriminant features of the input data are yielded. The obtained low dimensional sub-manifold is used as a new representation that is transmitted to classifiers. When data objects, that are the subject of analysis using machine learning techniques, are described by a large number of features (i.e. the data is high dimensional) it is often beneficial to reduce the dimension of the data. Dimensionality reduction can be beneficial not only for reasons of computational efficiency but also because it can improve the accuracy of the analysis. Indeed, traditional algorithms used in machine learning and pattern recognition applications are often susceptible to the well-known problem of the curse of dimensionality, that refers to the degradation in the performance of a given learning algorithm as the number of features increases. To deal with this issue, dimensionality reduction techniques are often applied as a data pre-processing step or as part of the data analysis to simplify the data model. This typically involves the identification of a suitable low-dimensional representation of the original high-dimensional data set. By working with this reduced representation, tasks such as classification or clustering can often yield more accurate and readily interpretable results, while computational costs may also be significantly reduced. Dimensionality reduction methods can be divided into two sets wether the transformation is linear or nonlinear. We detail here the principles of three well-known linear and nonlinear dimensionality reduction methods: Principal Components Analysis (PCA)[2], Laplacian Eigenmaps (LE)[3] and Manifold Parzen Window (MPW) [4]. Let X = {x1 , x2 , · · · , xn } ∈ Rp be n sample vectors. Dimensionality reduction consists in finding a new low-dimensional representation in Rp with q p.
Color VQ-Based Image Compression by Manifold Learning
2.1
81
Principal Components Analysis
The main linear technique for dimensionality reduction, principal components analysis (PCA), performs a linear mapping of the data to a lower dimensional space in such a way, that the variance of the data in the low-dimensional representation is maximized. Traditionally, principal component analysis is performed on the symmetric covariance matrix Ccov or on the symmetric correlation matrix Ccor . We will denote C one of these two matrices in the sequel. From such a symmetric matrix, we can calculate an orthogonal basis by finding its eigenvalues and eigenvectors. Therefore, PCA simply consists in computing the eigenvectors and eigenvalues of the matrix C: C = U ΛU T where Λ = diag(λ1 , · · · , λn ) is the diagonal matrix of the ordered eigenvalues λ1 ≤ · · · ≤ λn , and U is a p × p orthogonal matrix containing the eigenvectors. Dimensionality reduction is then obtained by the following operator hP CA : xi → (y1 (i), · · · , yq (i)) where yk (i) is the ith coordinate of eigenvector yk . In the rest of this paper, we will Cor denote hCov P CA and hP CA , dimensionality reduction performed with PCA of the covariance or the correlation matrix. 2.2
Laplacian Eigenmaps
Given a neighborhood graph G associated to the vectors of X, one considers its adjacency matrixW where weights Wij are given by a Gaussian ker ||x −x ||2
nel Wij = k(xi , xj ) = e − i σ2 j . Let D denote the diagonal matrix with elements Dii = j Wij and Δ denote the un-normalized Laplacian defined by Δ = D−W . Laplacian Eigenmaps dimensionality reduction consists in searching foranew representation {y1 , y2 , · · · , yn } with yi ∈ Rn , obtained by minimizing 1 yi − yj Wij = T r(YT ΔY) with Y = [y1 , y2 , · · · , yn ]. This cost func2 2 ij
tion encourages nearby sample vectors to be mapped to nearby outputs. This is achieved by finding the eigenvectors y1 , y2 , · · · , yn of matrix Δ. Dimensionality reduction is obtained by considering the q lowest eigenvectors (the first eigenvector being discarded) with q p and is defined by the following operator hLE : xi → (y2 (i), · · · , yq (i)) where yk (i) is the ith coordinate of eigenvector yk . 2.3
Manifold Parzen Window
Considering X, one can associate an unknown probability density function pX (.). Let a training set contain l samples of that random variable, collected in a l × n matrix X whose row xi is the i-th sample. Then, the goal is to estimate the density pX (.). Let us consider a small region R centered on the point x at which we wish to determine the probability density. In order to count the number K of points falling within this region, a commonly used way is the use of the following function 1 |ui | ≤ 1/2, i = 1, . . . , D k(u) = (1) 0 otherwise
82
C. Charrier and O. L´ezoray
that represents a unit cube centered on the origin. The function k(u) is known as a Parzen window. From this, the quantity k(x − xn )/h will be one if the data point xn lies inside the cube, and zero otherwise. Thus the total number of data points lying inside this cube is given by K=
N x − xn k h n=1
(2)
Thus, the estimate density at data point x is given by N x − xn 1 1 k p(x) = N n=1 hp h
(3)
where V = hp is the volume of the cube of side h in p dimensions. Nevertheless, this kernel density estimator suffers from one of the same problems encountered using the histogram method, namely the presence of artificial discontinuities at the cube boundaries. To prevent this, one uses a smoother kernel function based on a Gaussian kernel defined as follows: N 1 1 x − xn 2 √ exp − p(x) = N n=1 2πh2 2h2
(4)
where h represents the standard deviation of the Gaussian components, i.e. can be interpreted as the covariance matrix C. In that case, h2 represents the determinant |C|. In order to obtain a more compact representation of the inverse Gaussian, one stores only the eigenvectors associated with the first few largest eigenvalues of Ci , as described below. The eigen-decomposition of a covariance matrixC can be expressed as: C = V DV T , where the columns of V are the orthonormal eigenvectors and D is a diagonal matrix with the eigenvalues λ1 , . . . , λn , that we will suppose sorted in decreasing order, without loss of generality. The first q eigenvectors with largest eigenvalues correspond to the principal directions of the local neighborhood, i.e. the high variance local directions of the supposed underlying q-dimensional manifold (but the true underlying dimension is unknown and may actually vary across space). Dimensionality reduction is obtained by considering the q lowest eigenvectors (the first eigenvector being discarded) with q p and is defined by the following operator hP W : xi → (y2 (i), · · · , yq (i)) where yk (i) is the ith coordinate of eigenvector yk .
3
Performance Measure Protocol
To analyze the performance of the proposed approach, a comparison with the standard VQ compression scheme (based on the use of the LBG algorithm [5]) is computed.
Color VQ-Based Image Compression by Manifold Learning
83
The comparison operates using computable metrics. Three computable metrics are selected: 1) the PSNR due to its commonly use in the image processing community, 2) the Image Quality Assessment (IQA) algorithm Visual Signalto-Noise Ratio (VSNR) introduced by Chandler and Hemami [6] and 3) the IQA algorithm Multiscale Structural Similarity Index (MS-SSIM). The VSNR has been developed to uantify the visual fidelity of natural images based on near-threshold and suprathreshold properties of human vision. In [7] Sheikh et al. have shown that MS-SSIM is highly competitive with all other existing IQA algorithms. 3.1
Experimental Setup
To judge how the proposed method outperforms the standard VQ algorithm, 25 initial images in the LIVE image database are used [8]. From those images, two datasets are generated: 1) 12 images are used as a training set and the 2) the 13 remaining images serve as test set. From the training set, 12 codebooks (Ci )i∈[1,...,12] of size from 32 to 560 are generated without performing the dimension reduction approach and 12 codebooks are computed after performing each one of the three manifold learning algorithm described in section 2. The size of the used vectors to construct both the training set and the test set is 8 × 8. Those codebook sizes yields us to have image quality from very bad to excellent.
4
Results
Table 1 presents the obtained results using the three computed metrics to measure the quality of VQ-based reconstructed images and the manifold learning VQ-based compressed images. In this latter only the first eigenvector is used to construct the codebook at different sizes using monodimensionnal data dm projected on that new axis. One can observe that the obtained results are slightly better that those obtained using the initial spatial vectors to construct the codebook. The differences are not really significant to claim that the proposed method definitely outperforms the standard VQ-based compressed images. Nevertheless, the construction of the codebooks applying the proposed approach has been realized only using monodimensional data dm instead of 8 × 8 × 3 color data. By the way, a simple Euclidean distance is used instead of a complex color distance. From all tested manifold learning methods, the Laplacian Eigenmaps gives better results than the two others, in terms of IQA measures. Table 2 presents the quality measure results taking into account more and more dimensions during the construction of the codebooks. One notes that for a number of dimension greater than 3, no significant improvement of the quality is observed. This tends to prove that no more 3D data are needed to construct codebook of quality when the distance between data is measure using the Euclidean formula. By the way, we do not need to use any specific colorimetric distance to generate a codebook of quality.
84
C. Charrier and O. L´ezoray
Table 1. Computable metrics applied on VQ-based and each one of the manifold learning VQ-based compressed images (in that case, only data provided by the first eigenvector are used) Codebook size
32
80
128
176
PSNR MS-SSIM VSNR
20.3 21.2 21.0 22.1 0.919 0.948 0.956 0.963 1.051 0.952 0.871 0.672
PSNR MS-SSIM VSNR
23.4 25.1 26.2 26.8 0.917 0.946 0.952 0.960 1.021 0.943 0.856 0.664
PSNR MS-SSIM VSNR
23.2 25.4 26.3 26.8 0.912 0.943 0.954 0.961 1.045 0.946 0.869 0.678
PSNR MS-SSIM VSNR
23.9 26.1 26.5 27.0 0.921 0.951 0.957 0.963 1.032 0.946 0.869 0.678
PSNR MS-SSIM VSNR
23.6 25.2 25.8 27.7 0.916 0.942 0.951 0.958 1.043 0.923 0.857 0.654
224
272 320 368 Standard VQ 22.3 22.8 23.9 24.1 0.965 0.969 0.970 0.971 0.596 0.518 0.445 0.431 hCov P CA and VQ 26.8 27.9 28.1 28.3 0.962 0.964 0.968 0.970 0.600 0.489 0.421 0.401 hCor P CA and VQ 26.7 27.5 23.9 26.1 0.960 0.957 0.965 0.968 0.610 0.489 0.433 0.412 hLE and VQ 27.1 27.9 28.1 28.2 0.967 0.969 0.971 0.973 0.610 0.509 0.401 0.398 hP W and VQ 27.2 27.5 27.8 28.1 0.964 0.970 0.969 0.969 0.602 0.487 0.436 0.421
416
464
512
560
25.3 26.7 27.2 28.2 0.972 0.973 0.974 0.976 0.392 0.331 0.298 0.221 28.4 28.6 28.8 29.1 0.972 0.972 0.973 0.974 0.391 0.330 0.289 0.213 26.5 27.0 27.1 27.9 0.970 0.972 0.973 0.974 0.392 0.331 0.291 0.220 28.5 28.6 28.8 29.3 0.973 0.974 0.974 0.978 0.391 0.328 0.264 0.208 28.2 28.3 28.3 28.5 0.971 0.972 0.972 0.973 0.409 0.350 0.287 0.212
Table 2. Evolution of each computable metrics applied Laplacian Eigenmaps VQbased compressed images when the number of eigenvectors increases Codebook size PSNR λ1 MS-SSIM VSNR PSNR λ2 MS-SSIM VSNR PSNR λ3 MS-SSIM VSNR PSNR λ4 MS-SSIM VSNR
32 23.9 0.921 1.032 24.22 0.923 1.036 24.33 0.924 1.037 24.35 0.924 1.037
80 26.1 0.951 0.946 26.32 0.953 0.949 26.41 0.954 0.951 26.44 0.955 0.952
128 26.5 0.957 0.869 26.67 0.960 0.874 26.70 0.962 0.876 26.71 0.963 0.876
176 27.0 0.963 0.678 27.5 0.965 0.682 27.8 0.965 0.684 27.9 0.966 0.686
224 27.1 0.967 0.610 27.6 0.968 0.618 28.0 0.968 0.619 28.1 0.969 0.621
272 27.9 0.969 0.509 28.3 0.972 0.512 28.4 0.972 0.515 28.45 0.973 0.516
320 28.1 0.971 0.401 28.6 0.973 0.421 28.9 0.974 0.424 29.1 0.974 0.427
368 28.2 0.973 0.398 28.8 0.974 0.408 29.2 0.974 0.414 29.4 0.975 0.416
416 28.5 0.973 0.391 29.2 0.973 0.398 29.5 0.974 0.402 29.7 0.974 0.404
464 28.6 0.974 0.328 29.6 0.976 0.333 29.8 0.977 0.335 29.8 0.978 0.338
512 28.8 0.974 0.264 29.5 0.974 0.275 29.7 0.975 0.277 29.9 0.977 0.281
560 29.3 0.978 0.208 29.7 0.980 0.214 30.2 0.981 0.217 30.3 0.982 0.219
Fig. 1 shows an example of reconstruted image when a dimension reduction process is applied prior the VQ (a) or not (b).
Color VQ-Based Image Compression by Manifold Learning
(a)
85
(b)
Fig. 1. Example of hLE + VQ reconstructed image (a) and standard VQ reconstruted image (b) using a codebook size equal to 100
5
Conclusion
In this paper a dimensionnality reduction is applied before performing a VQbased compression technique. Using such an approach, problems concerning the selection of a good colororimetric distance to construct the codebook is evacuated, since obtained data from manifold learning method are not color data. In that case, an Euclidean distance can be used. In a first part, three manifold learning methods have been compared to the standard VQ compression scheme in terms of PSNR, MS-SSIM and VSNR values. Laplacian Eigenmaps applied before the construction of the codebook give best results. In a second part, it has been shown that no more than 3 eigenvectors need to be use to improve the quality of the results. By the way, one can reach best quality using 3D data than initial spatial color vectors.
References 1. Gersho, A., Gray, R.M.: Vector Quantization and Signal Compression. Kluwer Academic Publishers, Dordrecht (1991) 2. Saul, L.K., Weinberger, K.O., Ham, J., Sha, F., Lee, D.D.: Spectral methods for dimensionnality reduction. In: Semi-supervised Learning, pp. 279–294. MIT Press, Cambridge (2006) 3. Belkin, M., Niyogi, P.: Laplacien eigenmaps for dimensionality reduction and data representation. Neural Computing 15(6), 1373–1396 (2003) 4. Bengio, Y., Vincent, P.: Manifold parzen windows. Tech. Rep., CIRANO (2004) 5. Linde, Y., Buzo, A., Gray, R.R.: An algorithm for vector quantizer design. IEEE Transactions on Communications 28, 84–94 (1980) 6. Chandler, D.M., Hemami, S.S.: VSNR: A wavelet-based visual signal-to-noise ratio for natural images. IEEE Transactions on Image Processing 16(9), 2284–2298 (2007) 7. Sheik, H.R., Sabir, M.F., Bovik, A.C.: A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Transactions on Image Processing 5(11), 3441–3452 (2006) 8. Laboratory for Image & Video Engineering, University of Texas (Austin) LIVE Image Quality Assessment Database (2002), http://live.ece.utexas.edu/research/Quality
Total Variation Minimization with Separable Sensing Operator Serge L. Shishkin, Hongcheng Wang, and Gregory S. Hagen United Technologies Research Center, 411 Silver Ln, MS 129-15, East Hartford, CT 06108, USA
Abstract. Compressed Imaging is the theory that studies the problem of image recovery from an under-determined system of linear measurements. One of the most popular methods in this field is Total Variation (TV) Minimization, known for accuracy and computational efficiency. This paper applies a recently developed Separable Sensing Operator approach to TV Minimization, using the Split Bregman framework as the optimization approach. The internal cycle of the algorithm is performed by efficiently solving coupled Sylvester equations rather than by an iterative optimization procedure as it is done conventionally. Such an approach requires less computer memory and computational time than any other algorithm published to date. Numerical simulations show the improved — by an order of magnitude or more — time vs. image quality compared to two conventional algorithms.
1
Introduction
Compressed Imaging (CI) methods perform image recovery from a seemingly insufficient set of measurements [1]. This is a fast growing field with numerous applications in optical imaging, medical MRI, DNA biosensing, seismic exploration, etc. One promising but still open application for CI is the problem of on-line processing of “compressed” video sequences. The two main obstacles are that, first, even the fastest recovery algorithms are not fast enough for on-line processing of the video; and second, the memory requirements of the algorithms are too high for the relatively small processors typical for video devices. A big step toward overcoming these limitations was done recently in [2,3] with the suggestion to use only separable — i.e. presentable in the form A = R ⊗ G — Sensing matrices (SM), where “⊗” stands for the Kronecker product. In such a case, for any image F the equality holds: A vect(F ) = vect(RF G) ,
(1)
where “vect” denotes a vectorization operator. For M measurements performed on the image of the size n × n, the conventional approach — left hand side of (1) — demands O(M n2 ) operations and computer memory while the right hand side of equation (1) requires O(M 1/2 n) operations and computer memory. This simplification fits into any CI method and yields significant reduction of computational time and memory almost independently of the method schematics. A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 86–93, 2010. c Springer-Verlag Berlin Heidelberg 2010
Total Variation Minimization with Separable Sensing Operator
87
However, the potential advantages of separable SMs are not limited to the simple consideration above. As will be shown in this paper, some CI algorithms can be customized according to specific qualities of the SM and this leads to further speed-up of image recovery. In this paper, we consider an approach to CI recovery via Total Variation (TV) minimization. TV minimization was first introduced for image denoising in [4] and later was very successfully used for CI recovery in [5], [6] and other papers. A theoretical justification of the approach is presented in [7]. In TV framework, the image is recovered by solving the one of two problems: minimize x F 1 + y F 1
s.t. A vect(F ) − y ≤ ε , μ minimize x F 1 + y F 1 + A vect(F ) − y22 . 2
(2) (3)
where ε and μ are small constants, A is the M ×N Sensing matrix, y is M -vector of measurements, M < N . A variant of this approach called “Anisotropic TV minimization” involves solving one of the slightly modified problems, respectively: |x Fi,j |2 + |y Fi,j |2 s.t. A vect(F ) − y ≤ ε , (4) minimize i,j
minimize
i,j
|x Fi,j |2 + |y Fi,j |2 +
μ A vect(F ) − y22 . 2
(5)
Since the problems (2) and (4) can be solved by reduction to the form (3) and (5) respectively, either with iterative adjustment of the parameter μ or by introducing an auxiliary variable as in [9], we will consider only (3) and (5). With the assumption that the SM has separable structure A = R ⊗ G, these two problems take the following form, respectively: μ minimize x F 1 + y F 1 + RF G − Y 22 , 2 μ |x Fi,j |2 + |y Fi,j |2 + RF G − Y 22 , minimize 2 i,j
(6) (7)
where the matrices R and G of appropriate dimensions are fixed a priori and y = vect(Y ). To enable CI recovery, the Sensing matrix A = R ⊗ G must satisfy certain conditions. We do not discuss them here; a reader is referred to [1], [3]. We propose that the problems (6) and (7) should be solved using a SplitBregman approach [8], [9], with the important alternation: the internal cycle of minimization is replaced by solving two Sylvester equations. Such a modification is possible due to separable structure of the SM. This significantly reduces the computational load of the method and improves the accuracy. The structure of this paper is as follows: Section 2 presents the fast modification of the Split-Bregman method to solve the problems (6) and (7). Results of numerical simulations are presented and discussed in Section 3. Conclusions are formulated in Section 4.
88
2
S.L. Shishkin, H. Wang, and G.S. Hagen
TV Minimization Algorithm
Let us start with problem (6). According to the Split-Bregman approach [8], [9], we introduce new auxiliary variables V k , Wk , Dxk , Bxk , Dyk , Byk and replace the unconstrained minimization problem (6) by the equivalent problem μ minimize Dx 1 + Dy 1 + RF G − Y 22 2 such that Dx = x V, Dy = y V, V = F . (8) Here we see the main idea of the proposed algorithm: “splitting” the main variable into two — F and V — so that the former is responsible for the term RF G − Y 22 while the latter is being l1 regularized by TV minimization. The condition F = V guarantees that this is, in fact, one variable; however such a split allows us to decompose the most computationally expensive step of the algorithm into two much simpler steps. This approach looks similar to the version of the Split Bregman method proposed in [9] for solving problem (2) with ε = 0. However, the method presented here is developed for a slightly different problem (3) and additional differences will be noted in the development. The method presented here can easily be modified along the lines of [9] to address problems (2) or (4). The solution of (8) can be found by unconstrained optimization: λ λ minimize Dx 1 + Dy 1 + x V − Dx 22 + y V − Dy 22 F,V,Dx ,Dy 2 2 ν μ + V − F 22 + RF G − Y 22 (9) 2 2 with appropriate choice of the coefficients λ, ν, μ. Consistently applying the SplitBregman procedure, we replace (9) by the sequence of minimization problems λ minimize Dxk 1 + Dyk 1 + x V k − Dxk + Bxk 22 2 λ ν μ + y V k − Dyk + Byk 22 + V k + W k − F k 22 + RF G − Y 22 , (10) 2 2 2 k k k where auxiliary variables Bx , By , and W are updated as follows: W k+1 = V k+1 + W k − F k+1 , Bxk+1
= x V
k+1
+
Bxk
−
Dxk+1
,
Byk+1
= y V
k+1
(11) +
Byk
−
Dyk+1
The minimization (10) can be split into the following subtasks: μ ν F k+1 = arg min RF G − Y 22 + F − V k − W k 22 , F 2 2 λ k λ k+1 k 2 V = arg min Dx − x V − Bx 2 + Dyk − y V − Byk 22 V 2 2 ν + V − F k+1 + W k 22 , 2 λ k+1 k+1 Dx = arg min D1 + D − x V − Bxk 22 , D 2 λ Dyk+1 = arg min D1 + D − y V k+1 − Byk 22 . D 2
. (12)
(13)
(14) (15) (16)
Total Variation Minimization with Separable Sensing Operator
89
Let us consider these subtasks one-by-one. Differentiating (13), we obtain the necessary condition on F k+1 : μ(RT R)F k+1 (GGT ) + νF k+1 = μRT Y GT + νV k + νW k
(17)
This is a classical Sylvester equation. To solve it, let us compute the eigendecomposition of the matrices (RT R) and (GGT ), respectively: RT R = URT LR UR ,
T GGT = UG L G UG ,
where UR , UG are orthogonal matrices, and LR , LG are diagonal ones. Denote k = UR F k U T , F G
T Vk = UR V k UG ,
k = UR W k U T , W G
T Y = UR RT Y GT UG .
Multiplying (17) by URT from left and by UG from right, we obtain k k+1 L + ν F k+1 = μY + ν Vk + ν W μLR F G
(18)
which is easily solvable on entry-by-entry basis, yielding: k+1 k )i,j /ν + μ(LR )ii (LG )jj F = (μY + ν Vk + ν W i,j
i, j = 1, . . . , n
(19)
Now, let us denote by Dx , Dy the matrices of differential operators x , y : x F = Dx F, y F = F Dy and compute the eigen-decomposition of the matrices (DTx Dx ) and (Dy DTy ), respectively: DTx Dx = UxT Lx Ux ,
Dy DTy = UyT Ly Uy .
Following the same steps as in the analysis of (13), we obtain the solution of the problem (14), which is given by
k+1 V = Ux λDTx (Dxk − Bxk ) + λ(Dyk − Byk )DTy + νF k+1 − νW k UyT /li,j (20) i,j i,j
where li,j = ν + λ(Lx )ii + λ(Ly )jj , i, j = 1, . . . , n. Note that Ux , Uy are Fourier matrices, and thus all matrix multiplications in (20) can be performed very efficiently by Fast Fourier transforms and by differentiation w.r.t. x, y. In the “classical” Split Bregman scheme, the subproblems (13), (14) would be presented as one optimization problem, involving all the terms now distributed between them. However, solving such a problem would not be straightforward: either some iterative minimization algorithm has to be employed to find an accurate solution of the subproblem at each step (as in [8]), or some severe limitations (incompatible with TV minimization) on all the operators must be imposed (as in [9]). Our method, as was demonstrated above, allows the application of the Silvester method for solving such problems — thus it is much faster than any existing optimization procedure.
90
S.L. Shishkin, H. Wang, and G.S. Hagen 45
40
n=64 n=128 n=256 n=512 n=1024
40
30 PSNR (dB)
PSNR (dB)
35
35
30
25 UTRC TV−SBI "original" TV−SBI GPSR
20
25
15
20 10
15 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0
50
0.8
Fig. 1. Image PSNR vs. sampling rate r (the number of measurements M = r · n · n)
100 150 time (s)
200
250
Fig. 2. Comparison of image quality vs. computational time for three algorithms
Now, let us consider the problems (15), (16). Since there is no coupling between the elements of D in either subproblem, we can solve them on an “elementwise” basis. It is straightforward to obtain the formula for the solution ([8], [9]): i, j = 1, . . . , n (21) (Dxk+1 )i,j = shrink (x V k+1 + Bxk )i,j , 1/λ , k+1 k+1 k + By )i,j , 1/λ , i, j = 1, . . . , n (22) (Dy )i,j = shrink (y V shrink(d, γ) = sign(d) max(|d| − γ, 0),
d, γ ∈ R
Assembling (11), (12)and (19)–(22) we obtain the formal description of the algorithm. Note that the matrices UR , UG , Ux , Uy , LR , LG , Lx , Ly are independent of the particular image and can be computed in advance. Solving problem (7) is quite similar to (6); the only difference is that the subproblems (15), (16) have to be replaced by the subproblem [Dxk+1 , Dyk+1 ] = arg min
Dx ,Dy
(Dx )2i,j + (Dy )2i,j i,j
λ λ + Dx − x V k+1 − Bxk 22 + Dy − y V k+1 − Byk 22 . (23) 2 2 The solution of (23) can be expressed by (Dxk+1 )i,j = shrk (x V k+1 + Bxk )i,j , (y V k+1 + Byk )i,j , 1/λ , (24) k+1 k+1 k k+1 k (Dy )i,j = shrk (y V + By )i,j , (x V + Bx )i,j , 1/λ , (25)
shrk(d1 , d2 , γ) = d1 (d21 + d22 )−1/2 max (d21 + d22 )−1/2 − γ, 0 , d1 , d2 , γ ∈ R Thus the algorithm for solving the anisotropic problem (7) is defined by (11), (12), (19), (20), (24), (25).
Total Variation Minimization with Separable Sensing Operator
(a)
(c)
91
(b)
(d)
Fig. 3. Algorithm performance comparison. The sampling rate is r = 0.3, the image resolution is 512 × 512. (a) Original image. (b) Reconstruction by GPSR, PSNR = 25.3dB, time=29.9 seconds. (c) Reconstruction by original TV-SBI with separable matrix and PCG solver, PSNR = 30.1dB, time=134.9 seconds. (d) Reconstruction by using our “UTRC TV-SBI” algorithm, PSNR = 29.3dB, time=16.9 seconds.
3
Numerical Experiments
We have performed a set of numerical experiments to test the effectiveness of our proposed algorithm. The parameters of the method were fixed as follows: μ = 1, λ = 50, ν = 5. All the experiments were run using MATLAB code on a standard PC with an Intel Duo Core 2.66GHz CPU and 3GB of RAM. We first tested the image quality (measured using PSNR) with respect to the CI sampling rate. The number of algorithm iterations is kept constant and thus the computational time varies just slightly. Typical results are shown in Fig. 1. As expected, as the number of measurements increases, the reconstructed image quality gets better. At the same time, the PSNR gets better as image resolution increases with the same sampling rate. This is because the relative image sparsity with respect to image resolution is smaller for larger images. Experiments on different images generated similar results.
92
S.L. Shishkin, H. Wang, and G.S. Hagen
Fig. 4. An example image of 16MP reconstructed by our algorithm
We then compared the performance of our proposed algorithm (named “UTRC TV-SBI”) with the original TV-SBI algorithm with the PCG (Preconditioned Conjugate Gradient) solver [8]. This method was sped up by using separable sensing matrices as proposed in [2]. Also, we used for comparison one of the most efficient algorithms in the literature, — GPSR (Gradient Projection for Sparse Reconstruction) [10]. We did not compare the original TV-SBI algorithm [8] without separable sensing matrices because the computer runs out of memory even for moderate image sizes. For the same reason, we had to use GPSR with a sparse Sensing Matrix, as a full one would require 20Gb for storage. Fig. 2 presents the image quality vs. computational time results for the three algorithms. We used a sampling rate of r = 0.3. Each line in the plot was generated based on different convergence criteria of the algorithms. As the number of iterations increases, the final image quality increases and saturates at some PSNR. The line corresponding to “original” TV-SBI starts so high because the very first iteration of the method takes significant time due to slow convergence of PCG; as the reward, though, it produces the result with high PSNR. It can be easily seen that, with comparable image quality, our algorithm is by up to an order of magnitude faster than GSPR and up to two orders of magnitude faster then original TV-SBI algorithm with separable sensing matrices. The final results of image reconstruction by each of the three algorithms are shown in Fig. 3. The image resolution is 512×512. The proposed algorithm yields image quality comparable with the original TV-SBI with separable matrices, but much better image quality than GPSR. Besides, the proposed algorithm is several times more efficient than the original TV-SBI and almost twice as efficient as GSPR. Experiments with different sampling rate generated similar results.
Total Variation Minimization with Separable Sensing Operator
93
Finally, we have used our algorithm to recover the images as large as 16MPixel. Fig. 4 shows an example of such reconstruction with compression rate 20%, which took 1.5 hours the same computer platform described earlier in this paper. The sampling rate is r = 0.3. The result image has PSNR of 31.7dB.
4
Conclusions
This paper presents a new algorithm for Compressive Imaging built upon the approaches of Total Variation minimization and a Separable Sensing Matrix. Rather than mechanically combining these two ideas, the algorithm welds them together in a specific and computationally efficient way that allows speed-up of the image recovery by almost an order of magnitude. Numerical experiments show significant computational advantage of the proposed algorithm over such efficient methods as Split Bregman ([8], [2]) and GSPR ([10]). Our algorithm also has the advantage of memory efficiency because it uses separable sensing matrices. With a standard PC, our algorithm can reconstruct up to 16 Mega-Pixel images. To the best of our knowledge, this is the largest image reported in the literature to date using a typical PC. The authors thankfully acknowledge help and support kindly given to them by Dr. Alan Finn, Dr. Yiqing Lin and Dr. Antonio Vincitore at the United Technologies Research Center.
References 1. Donoho, D.: Compressed sensing. IEEE Trans. Inform. Theory 52(4), 1289–1306 (2006) 2. Rivenson, Y., Stern, A.: Compressed imaging with a separable sensing operator. IEEE Signal Processing Let. 16(6), 449–452 (2009) 3. Duarte, M., Baraniuk, R.: Kronecker Compressive Sensing. Submitted to IEEE Trans. Image Proc., 30 (2009) 4. Rudin, L., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D 60, 259–268 (1992) 5. Osher, S., Burger, M., Goldfarb, D., Xu, J., Yin, W.: An iterative regularization method for total variation-based image restoration. Multiscale Modeling and Simulation 4, 460–489 (2005) 6. Ma, S., Yin, W., Zhang, Y., Chakraborty, A.: An efficient algorithm for compressed MR imaging using total variation and wavelets. In: IEEE Conf. on Computer Vision and Pattern Recognition 2008, pp. 1–8 (2008) 7. Han, W., Yu, H., Wang, G.: A General Total Variation Minimization Theorem for Compressed Sensing Based Interior Tomography. Int. J. of Biomedical Imaging 2009, Article ID 125871, 3 pages (2009), doi:10.1155/2009/125871 8. Goldstein, T., Osher, S.: The Split Bregman Method for L1 Regularized Problems. UCLA CAM Report, p. 21 (2008) 9. Plonka, G., Ma, J.: Curvelet-Wavelet Regularized Split Bregman Iteration for Compressed Sensing (2009) (preprint) 10. Figueiredo, M.A.T., Nowak, R.D., Wright, S.J.: Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. IEEE Journal of Selected Topics in Signal Processing: Special Issue on Convex Optimization Methods for Signal Processing 1(4), 586–598 (2007)
Circle Location by Moments Vivian Cho1 and Wm. Douglas Withers1,2, 1
Department of Mathematics United States Naval Academy 2 Accusoft Pegasus Department of Mathematics, United States Naval Academy, Annapolis, MD 21402
[email protected] Abstract. We present a method for locating circles in images based on the moment values of specially-constructed functions. Our approach differs from Hough-transform-type approaches in requiring no large accumulator array, searching operation, nor preliminary edge-detection step. It differs from previous work on curve location by moments by the existence of an explicit formula for the circle parameters and a theoretically proven guarantee of success. The moment approach is inherently robust with respect to blurriness, pixelization, or small-scale noise.
1
Introduction
Locating circles within images is an important computer-vision task with practical applications in astrogation, image registration, camera calibration, biometric identification, and localization of machine parts. Many approaches proposed for circle location rely on the basic approach of the Hough transform, which, though originally conceived as a line-location method, was applied to the problem of locating more general curves as early as 1972 (see [1]).
Fig. 1. Moment methods of curve location tend to seek out the single most prominent low-frequency structure
Corresponding author.
A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 94–102, 2010. c Springer-Verlag Berlin Heidelberg 2010
Circle Location by Moments
95
While intuitive and robust, the Hough transform presents serious resource issues of both time and space with increasing number of parameters in the family of curves under consideration. For this reason, circle-location methods using the Hough transform generally rely on one or more ingenuities to reduce the dimensionality of the required accumulator array and/or the time required to search the array; [4] includes a brief survey. An alternative approach of using moments to locate features in images has origins as early as 1989 for the case of straight lines (see [2]). [3] presents a computationally simpler and more general method, and [4] uses moments to locate more general curves. The moment approach offers strengths complementary to those of Hough-type methods: no accumulator array is needed, no searching is involved, and the moment calculations are inherently robust with respect to blurriness and high-frequency noise. A major distinction is that while Hough-type methods yield many curves or fragments of curves within the region analyzed, moment methods yield a single global solution representing the dominant feature in the region analyzed (Fig. 1). Useful preprocessing operations and recursivesubdivision approaches to the problems of image clutter and multiple curves are discussed in [4]. The theory developed in [4] concerns a curve described by an equation N
pn zn (x, y) = 0.
(1)
n=1
This includes, for example, the case of a general circle with four-term equation: (2) p1 + p2 x + p3 y + p4 x2 + y 2 = 0. The method takes S to be a bounded subset of the plane corresponding to an image or a subregion thereof. Let I(x, y) represent the grayscale values of the image for (x, y) ∈ S. Suppose the image takes constant values on S ∩ Γ and S − Γ , where the curve ∂Γ is described by Equation (1). Our goal is to recover the values of pn from moment values of the image. For a given function U (x, y) defined on S, define the associated moment value I, U : I(x, y) U (x, y) dx dy. I, U = S
The chief tool of the method is the following theorem: Theorem 1. Let V be an open set containing S. Define s(x, y) twice continuously differentiable on V and F (x, y) continuously differentiable on V such that F (x, y) = 0 on ∂S. For n = 1, . . . , N , define Un (x, y) = Υ (F, s, zn ), where Υ (F, s, z) = ([szy − zsy ] F )x − ([szx − zsx ] F )y , the x and y subscripts indicating partial differentiation. Then N
I, Un pn = 0.
n=1
(3)
96
V. Cho and W.D. Withers
Example: For the circle, we have z1 (x, y) = 1, z2 (x, y) = x, z3 (x, y) = y, z4 (x, y) = x2 + y 2 . Take S to be a subsquare of the image. By a change of coordinates, if necessary, we may consider S to be the region −1 ≤ x ≤ 1, −1 ≤ y ≤ 1. Define F (x, y) = max 0, 1 − x2 max 0, 1 − y 2 , (4) then F ≡ 0 on ∂S as required by Theorem 1. Define s(x, y) ≡ 1. Then we have: U1 (x, y) = Υ (F, s, 1) = 0, U2 (x, y) = Υ (F, s, x) = 2y(1 − x2 ), U3 (x, y) = Υ (F, s, y) = −2x(1 − y 2 ), U4 (x, y) = Υ (F, s, x2 + y 2 ) = 4xy(y 2 − x2 ). If we now further suppose that our image function I(x, y) is zero-valued outside a circle of radius 3/2 centered at (−1/2, −1) and has value 1 inside the circle (corresponding to p1 = −1, p2 = 1, p3 = 2, p4 = 1), we can calculate the corresponding moment values over S: u1 = 0, u2 = −1.13136, u3 = 0.561783, u4 = 0.007794. It is easily verified that (3) holds. Here the functions F and s are largely arbitrary. In practice the differentiability assumptions can be relaxed somewhat. Various choices of (F, s) yield various equations of the form (3). The goal is to generate a system of such equations solvable for the parameters pn up to multiplication by a single nonzero constant (which leaves invariant the curve described by (1) ). In practice F also serves to define the region S, as the region where F (x, y) is nonzero together with its boundary. In this article we take F as defined by (4) so that S is the square [−1, 1] × [−1, 1]. With appropriately defined coordinates, this square can represent any subsquare of the image. Theorem 1 leaves us with two problems. The first is choosing a set of (F, s) yielding a system of equations solvable for pn , the ease or difficulty of which is highly dependent on the class of curves under consideration. The bare minimum number of equations required is the number of independent parameters in the curve equation. Not infrequently more are needed, due to redundancy of one or more equations—which may, moreover, be contingent upon the curve position. The second problem is solving the resulting system of equations—a thoroughly understood but possibly complicated matter of linear algebra. Explicit formulas for the solution of the system are therefore valuable if available. Two examples: (1) The relatively simple case of locating straight lines ([3]) uses three equations based on three distinct moment functions, with a simple explicit form for the solution. (2) The more complex problem of locating a general conic with five parameters uses 18 equations based on 45 distinct moment functions in order to minimize the potential for excess redundancy; however, for
Circle Location by Moments
97
the class of general conic curves this potential persists for fundamental reasons of geometry no matter how many equations are used. Special measures for coping with this and other aspects of this problem are discussed at length in [4]. This article treats the family of circles described by (2) and presents a method based on Theorem 1 for locating circles in images using moment information. This class of curves is a subset of that treated in [4]; we use the special properties of this class to present a specialized solution, which is simpler to implement than the general approach of [4]. Moreover, in contrast to [4], we give precise conditions guaranteeing the existence of a unique solution and an explicit formula for the solution.
2
Theory
A circle with center (x0 , y0 ) and radius r has equation (2), where p4 = 0 and (5) p2 = −2p4 x0 , p3 = −2p4 y0 , p1 = p4 x20 + y02 − r2 . At least one of p2 , p3 , p4 must be nonzero for (2) to describe a curve. In the case p4 = 0, this equation describes a straight line, which is a limiting case of a circle as the radius goes to infinity. Our method locates straight lines or circles with a unified approach. We use the term “generalized circle” to refer to either a circle = 0). or straight line, as opposed to a “true circle” with finite radius (p4 Perhaps the simplest conceivable method would be to define moment functions Mn such that I, Mn = pn . Unfortunately, no such set of functions exists. For necessarily I, M4 = 0 for any I consisting of a single straight edge (p4 = 0). One can then argue, using the invertibility of either the Radon or Fourier transform, that M4 ≡ 0, making recovery of p4 impossible. Our slightly more complicated method operates in two phases. The first phase identifies a line Λ with equation x cos φ + y sin φ = κ
(6)
which intersects ∂Γ orthogonally. The second phase defines a set of moment functions M1 . . . M4 in terms of Λ such that I, Mn = pn for any generalized circle orthogonal to Λ. Phase I: Determining an Orthogonal Line. Let I(x, y), S, V , and F (x, y) be as in Theorem 1, and assume ∂Γ is a generalized circle described by (1). Applying Theorem 1 with s = 1, define: A(x, y) = Υ (F, 1, x) = −Fy , B(x, y) = Υ (F, 1, y) = Fx , C(x, y) = Υ (F, 1, x2 + y 2 ) = 2yFx − 2xFy . Note Υ (F, 1, 1) = 0. Theorem 1 then yields: I, Ap2 + I, Bp3 + I, Cp4 = 0.
(7)
98
V. Cho and W.D. Withers
We claim that (6) describes a line orthogonal to ∂Γ if ρ, φ, κ satisfy: I, A = ρ cos φ, I, B = ρ sin φ, I, C = 2ρκ.
(8)
For true circles it suffices to show that the center (x0 , y0 ) lies on Λ. This is verifiable by direct substitution of (5) and (8) into (7). In the straight-line case orthogonality follows from the value I, B/I, A of the slope of ∂Γ (see [3]). This method fails if I, A = I, B = ρ = 0. In the straight-line case (see [3]) ρ > 0 provided that F (x, y) ≥ 0 everywhere (nonnegativity) and ∂Γ intersects the support of F nontrivially (impingency). Impingency is an obvious requirement; a curve disjoint from the set of pixels to be examined is unlocatable For the true-circle case we provide the following back-up method. For the sake of brevity we merely sketch the derivation. An alternate formulation for Λ, and κ in particular, is given by defining: P (x, y) = Υ (F, x, y), Q(x, y) = Υ (F, x, x2 + y 2 ), R(x, y) = Υ (F, y, x2 + y 2 ). Let ρ, φ be defined as in (8). Then it can be shown that: x0 =
p1 ρ sin φ I, R p1 ρ cos φ I, Q , y0 = − . − + 2I, P p4 2I, P 2I, P p4 2I, P
It follows that x0 cos φ + y0 sin φ =
I, Q sin φ − I, R cos φ = κ (alternate). 2I, P
(9)
Formula (9) also fails if I, P = 0; however, it can be proven that nonnegativity and impingency together guarantee that at least one of ρ or I, P is nonzero. This alternate formulation of Λ is numerically stable as ρ → 0 in the following subtle sense: even though φ as defined by (8)—and hence the position of Λ—is not numerically stable, the (ideally zero) distance from (x0 , y0 ) to Λ is increasingly insensitive to errors in φ as ρ → 0. For example, in the extreme case ρ = 0, x0 = −
I, R I, Q , y0 = , 2I, P 2I, P
so that (9) holds for any value of φ. Phase II: Determining the Circle Parameters. Now assume ∂Γ to be defined by Equation (2) and Λ to be a line with equation (6) (determined by the method of the previous section or any other) which intersects ∂Γ orthogonally. Let S and V be sets satisfying the conditions of Theorem 1, and let G(x, y) be a function satisfying the same conditions as F (x, y) in Theorem 1. G need not be (and in fact should not be, for reason given below) the same as F used in Phase I. The parameters of Equation (2) can then be determined as follows: Theorem 2. Define: M1 (x, y) = Υ G, −x sin φ + y cos φ, −2κ[x cos φ + y sin φ] + x2 + y 2 ; M2 (x, y) = Υ G, 1, −2κy + x2 + y 2 sin φ ; M3 (x, y) = Υ G, 1, 2κx − x2 + y 2 cos φ ; M4 (x, y) = Υ (G, 1, −x sin φ + y cos φ) .
Circle Location by Moments
99
Then all points (x, y) on ∂Γ satisfy
I, M1 + I, M2 x + I, M3 y + I, M4 x2 + y 2 = 0.
(10)
Proof. This is a matter of direct calculation, aided by Green’s Theorem, which yields: Υ (G, s, z) dx dy I, Υ (G, s, z) = h Γ =h G ([szx − zsx ] dx + [szy − zsy ] dy), ∂Γ
where h is the difference in gray-levels between the left and right sides of ∂Γ . Note in particular: Υ (G, 1, z) dx dy = h G dz. Γ
∂Γ
Even in the straight-line case, ∂Γ can be treated as a closed curve by assuming it bends around outside the region S (where G is zero-valued). For example, we have: G d(−x sin φ + y cos φ). I, M4 = h ∂Γ
Consider first the true-circle case. Following (5), it suffices to show: I, M1 = x20 + y02 − r2 I, M4 ; I, M2 = −2x0 I, M4 ; I, M3 = −2y0 I, M4 . Introducing a parametric representation for the the circle: x = x0 + r cos θ, y = y0 + r sin θ, 0 ≤ θ ≤ 2π, we find: I, M4 = h
∂Γ
(11)
G (r sin θ sin φ + r cos θ cos φ) dθ = h
We then calculate: I, M3 = h ∂Γ =h ∂Γ =h ∂Γ =h
∂Γ
Gr cos(θ − φ) dθ. (12)
G d 2κx − x2 + y 2 cos φ G (−2κr sin θ + 2x0 r cos φ sin θ − 2y0 r cos θ cos φ) dθ G (−2r sin θ[κ − x0 cos φ] − 2y0 r cos θ cos φ) dθ
G (−2r sin θ[y0 sin φ] − 2y0 r cos θ cos φ) dθ = −2y0 h Gr cos(θ − φ) dθ = −2y0 I, M4 . ∂Γ
∂Γ
100
V. Cho and W.D. Withers
I, M1 and I, M2 are treated similarly. In the straight-line case, let (x1 , y1 ) be the point of intersection of ∂Γ and Λ. Being orthogonal to Λ, ∂Γ has equation p1 + p2 x + p3 y = 0, where p1 = y1 cos φ − x1 sin φ, p2 = sin φ, p3 = − cos φ, (and p4 = 0). We can also write ∂Γ parametrically as: x = x1 + t cos φ, y = y1 + t sin φ. Direct calculation then yields I, Mn = Kpn , where 2Gt dt. K =h
(13)
(14)
∂Γ
A final concern is that (10) might hold trivially, in that I, Un = 0 for all n. (In fact, this happens if G is taken identical to F from Phase I.) We can avoid this by noting that in the true-circle case (11): x cos φ + y sin φ − κ = r cos(θ − φ), while in the straight-line case (13): x cos φ + y sin φ − κ = t. It therefore suffices to choose G(x, y) so that (x cos φ + y sin φ − κ) G(x, y) is nonnegative—for example: G(x, y) = (x cos φ + y sin φ − κ) max 0, 1 − x2 max 0, 1 − y 2 . Impingency then implies that the integrand of (12) or (14) is positive somewhere and negative nowhere, so the resulting integral is necessarily nonzero. A general discussion of the organization of the method for computational efficiency is given in [4]. As noted there, the moment computation dominates the operation count (even more so for the simplified method presented here), and the overriding concern is the degree of the moment functions in x alone. The functions used in Phase I are all of cubic degree or lower in x; this phase of the computation thus can be implemented using three multiplications and four additions per pixel. The functions used in Phase II are all of fifth degree or lower; Phase II thus entails an additional two multiplications and two additions per pixel.
3
Examples
This section presents illustrations of our method in practice, with particular emphasis on the special capabilities of the moment approach. To begin with, we
Circle Location by Moments
101
Fig. 2. Circle location on a blurred image
Fig. 3. Fitting a circle to an approximate boundary
ran a series of 5000 test cases on randomly-generated 500×500 images containing whole or partial circles. In all cases the circle contained between 10% and 90% of the image area. Goodness of fit was measured by reconstructing the image from the calculated parameter values and counting the pixels differing between the image and its reconstruction. The average case showed a difference in 0.0093% of pixel values. The worst case showed a difference in 1.36% of pixel values. Figure 2 shows our method used for locating a blurred circle. This type of image is problematic for Hough-type approaches or other methods requiring a preliminary edge detection step. However, the moment method, which relies on global moment values, is scarcely impeded. In this and all other examples, with the exception of Fig. 5, the method was applied with no preprocessing whatsoever. Figure 1 shows successful location of a circle which is not only blurred but overlaid with distracting high-frequency features. Figure 3 shows fitting of a circle to an approximate circular boundary. Figure 4 shows successful circle location in an image consisting of the same two pixel values both inside and outside the circle, the boundary marked only by a shift in frequency. Figure 5 shows the possible application of our method to the problem of locating the pupil in a red-eye-removal application. In this case simple preprocessing
102
V. Cho and W.D. Withers
Fig. 4. Locating a circle defined by a change in frequency
(a)
(b)
(c)
Fig. 5. Locating pupil in a potential red-eye removal application: (a) grayscale representation of the (originally color) image; (b) red component of the image, showing flare in the pupil; (c) pupil located to sub-pixel accuracy.
(paint a pixel white if its red component value exceeds its green component value by 77 or more, gray otherwise) was used to isolate the red-flare region. Subsequent application of our method serves to locate the pupil with subpixel accuracy.
References 1. Duda, R.O., Hart, P.E.: Use of the Hough Transform to Detect Lines and Curves in Pictures. Communications of the ACM 15, 11–15 (1972) 2. Lyvers, E.P., Mitchell, O.R., Akey, M.L., Reeves, A.P.: Subpixel Measurements Using a Moment-Based Edge Operator. IEEE Trans. Pattern Anal. Machine Intell. 11(12), 1293–1309 (1989) 3. Popovici, I., Withers, W.D.: Custom-Built Moments for Edge Location. IEEE Trans. Pattern Anal. and Machine Intell. 28(4), 637–642 (2006) 4. Popovici, I., Withers, W.D.: Curve Parametrization by Moments. IEEE Trans. Pattern Anal. and Machine Intell. 31(1), 15–26 (2009)
Remote Sensing Image Registration Techniques: A Survey Suma Dawn1, Vikas Saxena1, and Bhudev Sharma2 1
Department of Computer Science Engineering and Information Technology 2 Department of Mathematics Jaypee Institute of Information Technology, A-10, Sector -62, Noida, U.P., India {suma.dawn,vikas.saxena,bhudev.sharma}@jiit.ac.in
Abstract. Image Registration is the first step towards using remote sensed images for any purpose. Despite numerous techniques being developed for image registration, only a handful has proved to be useful for registration of remote sensing images due to their characteristic of being computationally heavy. Recent flux in technology has prompted a legion of approaches that may suit divergent remote sensing applications. This paper presents a comprehensive survey of such literatures including recently developed techniques. Keywords: Remote Sensing Images, Image Registration, Feature selection, control Point matching.
1 Introduction Image registration is the process of transforming the different set of data into one coordinate system; also, may be said as the process of overlaying two or more images of the same scene taken at different times, from different viewpoints or from different sensors. Its main aim is to geometrically align two images. Despite numerous techniques being developed for image registration [1] , only a few have proved to be useful for registration of remote sensing images due to their characteristic of being computationally heavy. Recent flux in technology has prompted a legion of approaches that may suit a particular remote sensing application. Registration is used in the first phase for analysis of images received from one or more sensors (multimodal) having variable spatial or temporal variations. Major applications of remote sensing image registration are in the fields of medicine, cartography, climatology, archaeosurvey, hydrology and hydrogeology, pattern recognition, geographical information system etc [2], [4], [24], [38], [77]. The procedure followed for registering two remote sensed images of a given scene, as illustrated by Goshtasby [2], are: (a) Preprocessing, (b) Feature Selection, (c) Feature Correspondence, (d) Determination of a transformation function and (e) Resampling. Of these afore mentioned steps, feature selection, correspondence and the transfer function determination are ones in which numerous techniques, for manipulation, may be applied. It is these variations that form the basis of classification stated in this letter. A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 103–112, 2010. © Springer-Verlag Berlin Heidelberg 2010
104
S. Dawn, V. Saxena, and B. Sharma
Section 2 categorizes some of the recent trends and technologies for image registration. A comparative study is presented in section 3. Finally, section 4 concludes this letter.
2 Classification of Image Registration Techniques Feature selection phase may be based on the orientation of particular structures (regions, lines, curves, corners, etc) present in the images or registration may be dependent on the geometric orientation of the pixels and their intensities. These catetories are – feature-based and intensity-based respectively. Classical intensity-based methods for registering imagery are cross-correlation; normalized cross-correlation (NCC) and minimum distance criteria [3] wherein registration involves calculating a transformation between two images using the pixel or voxel values alone. In its purest form, the registration transformation is determined by iteratively optimizing some similarity measure calculated from all pixel or voxel values [86]. Ground Control points (GCP), as primarily used by feature-based registration routines, represent centers of unique neighbourhoods that contain considerable image information [2]. Most of these are culminated and used in by Mohr [37], Brown [4], and Flusser and Zitova [5]. Bases on the evolution of techniques and minor deviations being incorporated by these registration styles, we have categorized the literatures as per their core concepts used for feature selection or correspondence or transformation functions determination. The classes are as follows: Class 1 - Similarity-metrics based methods. Similarity-metrics like Mutual Information, Correlation Coefficient, and histogram among others, though traditionally used as a measure to determine the correctness of registered images has, altogether, evolved as a method for registration of images. Correlation based similarity measure is a basic statistical approach and is useful for images which are misaligned by small rigid or affine transformations [59]. Mutual Information has found acceptance both in the fields of geosciences applications and medical images, because of its generality and high accuracy [34],[76], Arya et al [32], Chen et al [39], and Boes and Meyer [43]. Li et al [63] presented an automated intensity-based image registration approaches namely the simultaneous perturbation stochastic approximation (SPSA) algorithm. Xie et al [40] analyzed the two interpolation algorithms, namely bilinear interpolation and partial volume interpolation. A new indicator - misregistration value - was proposed to assess registration accuracy based on estimated offset between the corresponding points in the principal and the candidate image by spectral diversity algorithm by Lu et al [64]. Class 2 - Methods using a combination of Simulated Annealing and Genetic algorithms. Yagouni [6] had proposed a rigid registration technique for 2 monomodal image which adapted two optimization metaheuristics, the Simulated Annealing and Genetic Algorithms, in their sequential and parallel versions. A combination of Simulated
Remote Sensing Image Registration Techniques: A Survey
105
Annealing based Marquardt-Levenberg (SA-ML) with mutual information together with a wavelet based multiresolution pyramid due to Simoncelli was introduced by Ghorbani et al [70]. Genetic Algorithm application for image registration was also explored by Seixas et al [73] who addressed the point matching problem by employing a method based on nearest-neighbor. Class 3 - Pyramid or triangular and Coarse-to-Fine method based. Massout et al [24] inducted a method based on pyramid algorithms and morphological filtering to create synthesis images having both high spatial resolution and high spectral resolution. To algorithmically improve registration time, a pyramid scheme to provide coarse-to-fine grain registration process was employed by Bui et al [9]. Xuehua et al [13] used a combination of Moravec and Forstner operator for feature point matching. Coarse-to-fine matching process was also used by Chen et al [31] in which initially, the image pyramid of working and reference images were constructed followed by feature points matching. Yulin et al [52] purported a coarseto-fine matching by incorporating an edge feature consensus method to rectify the rotation and translation difference between the SAR images globally in the coarse level registration; and automatically rectifying global control point pairs form the edge maps. Another multi-source image registration using coarse-to-fine method was also proposed by Zhang et al [56]. Lu et al [33] proposed another coarse-to-fine registration framework using Quantum Particle Swarm Optimization (QPSO) as optimizer. Quick and robust geometric registration of parallax image using coarse-tofine strategy was proposed by Junbin et al [65]. Class 4 - Methods using points, edges, corners, road-junctions and road point extraction as basis. Man-made structures are proficient for depiction of straight lines, junctions, T-points, etc. For registration of city images via remote sensing, straight lines are used as an important feature by Bentoutou et al [46], Lou et al. [58], Zou et al [28], Dell'Acqua et al [22], Xiong et al [29], Alkaabi et al [51], Wen et al [74] and Weijie et al [14]. Zhang [12] used Radon transform as a tool for straight line detection. Bilinear interpolation and road networks based GCPs, which are the characters of urban remote sensing, were used by Guo et al [45] and Li et al [7]. Normalized sum of absolute differences metric (NSAD) based method was used by Alkaabi et al [51]. Multicamera registration and mosaicking method using control points and projective transformation was shown by Holtkamp et al [27]. Automatic Registration of Remote-Sensing Images (ARRSI), an automatic registration system built to register multimodal and multitemporal satellite and aerial remotely sensed images was developed by Wong et al [68] which used phase-congruency model and a variation on the Random Sample and Consensus algorithm called Maximum Distance Sample Consensus (MDSAC) for registration and was said to be capable of handling remotely sensed images geometrically distorted by various transformations such as translation, rotation, and shear. Class 5 – Segmentation, Point-Scatter and Learning-based methods.
106
S. Dawn, V. Saxena, and B. Sharma
Pohl [38] justified using the segmented features of the image to generate and recursively evaluate a number of selected hypotheses as a basis for image registration especially for multiresolution images. Modified watershed transformation algorithm was used for image segmentation by Zhao et al [16] for feature-based geometric registration. Goncalves et al [17] also proposed several measures for an objective evaluation of the geometric correction process. Similar measures were also proposed by Aldrighi et al [18] wherein they used mode-based feature matching scheme which was initially used for computer vision application but latter adapted to pre- and postevent matching. Another novel algorithm for registration of SAR images was offered by Kang [61]. Isolated point scatterers (IPS) was used by Serafino et al [62]. Baboulaz [8] took maximum advantage of the a priori knowledge of the acquisition filter for feature extraction. Barbu et al [21] proposed a system to learn the registration algorithm using a larger pool of feature-based registration algorithms from which a small number was selected and combined using AdaBoost. The pool of weak registration algorithms from which the boosted algorithm will be trained contains phase-correlation based algorithms and feature-based registration algorithms. A neural computational approach was presented by Srikanchana et al [42] which used a mixture of principal axes registration (mPAR) method and a multilayer perceptron neural network (MLP) to recover the nonlinear deformation. Class 6 - Fourier transformation and wavelet based techniques. The Fourier-based method works with the images in the frequency domain and utilizes the translation and rotational properties of Fourier transform to calculate the transformation parameters as stated in Xu et al [19], Li [75]. Zhou et al [48] used the wavelet-based global image registration, which was also used by Hong et al in [44], and [69]. Phase correlation (PC) method usage was shown by Erives and Fitzgerald [57]. Zhou et al [35],[48] tested serial and parallel strategies of wavelet-base automatic image registration for ChinaGrid project. Xu et al [19] used the Multiple Signal Classifier (MUSIC) algorithm whereas Qumar et al [31] used Discrete Multiwavelet transform (DMWT) methods. Steerable Simoncelli filters was used by Li et al [71] and Plaza, Moigne and Netanyahu [54]. Kasetkasem [20], proposed the use of the particle filtering algorithm to search for the global optima. Mixel decomposition of remote sensing images using particle swarm intelligence searching method was proposed by Wang, Dong et al [26]. In another paper, Zavorin and Moigne [49] proposed a method to handle single- and multisensor satellite data with relatively small amounts of nonlinear radiometric variation. Class 7 – Miscellaneous Methods. A combination of area-based matching technique with feature-based matching technique is, many a times, used to register images. A hybrid procedure was developed by Jia et al [47], Li et al [50], and Jianying [66]. Cluster machines could be utilized for parallel implementation of mutual information based image registration, as presented by Du et al [55] which used conjugate-gradient methods implemented on Commercial off-the-shelf (COTS) cluster
SA and GA based methods
Distortion finding done using translation / affine transformati ons; edges, contours, wavelet coefficients, & moment invariants features.
A corresponde nce matrix based on closest-point rule may be used.
Monomodal, Multitempor al & synthetic image sets of IRS-1C.
Similaritymetrics based methods
Maximum likelihood estimators values, pseudorandom S/N surfaces, MI using nonlinear diffusion, GPVE, joint histogram, etc used.
Iterative linearlized least square estimation used for searching and feature matching.
IRS P6 , ASTERASTER/Ma p, SARSAR, Quickbird and EROS-B
Class Name
Feature Detecti on and Extract ion method s used
Searchi ng and Feature Matchi ng method s used
Image Types used for experi mentati on
SPOT, FORMOSA T2, Quick Bird, Multispectra l images.
Techniques Moravec and Forstner operator, NCC, polynomial transformati on, weight vector matrices.
TDGO & Harris operator for edges; QPSO, wavelet transformati on for feature detections.
Pyramid or triangular and Coarseto-Fine Method
Feature matching done using spatial relations. And clustering Neural computation involving MLP is also used.
QuickBird, SAR, IKONOS images, Envisat data set and from UAV.
Optical, SAR, Multispectrum, IKONOS, MODIS, Landsat ETM+, SPOT images
Geometric moments used for feature retrieval. Edge, corner, centroids, and Lowe or Sojka methods used for detection; Neuralnetwork used for continuous learning.
Segmentation, Point-Scatter and Learning based methods
Peak, point pattern matching, RANSAC, clustering, and coherence, interest point; thin-plate spline interpolation, etc., are the techniques used for matching.
Radon detectors, bilinear interpolation, road network & junction points, corner point extraction, Robust SIFT, Lee, Harris, Canny & Laplace detectors, used for feature detection.
Points, edges, corners, roadjunctions based methods
CCD & IR sensor images, Landsat TM and ETM+, IKONOS, MODIS, LiDAR Optical
Feature searching done by bandpass or high-pass frequency filtering, which are then matched together. Phase correlation may also be used to register images.
Obtain translation misalignment from cross-power spectrum of 2 images, by applying MUSIC algorithm. & computation of the decomposition of prefiltered images.
Fourier transformation and wavelet based techniques
IKONOS, QuickBird, Landsat ETM+, Multitemporal polarimetric SAR, Multispectral aerial images; SAR images.
Feature matching done using Fast Fourier Correlation Methods, Simoncelli wavelet features, etc. SPSA and Gradient Ascent Optimization Algorithm also used.
Features used are gray levels, edges, and wavelet coefficients. Matching performed using crosscorrelation, mutual information or a Hausdorff distance as a similarity metrics. Pixel-by-pixel comparison of the fractal images can be done by a simple difference operator.
Miscellaneous methods
Remote Sensing Image Registration Techniques: A Survey 107
Table 1. Classification and comparison of techniques used for registration of remote sensing images based on the literature study
108
S. Dawn, V. Saxena, and B. Sharma
of parallel Linux computer. Gupta et al [72] presented an efficient VLSI architecture for real-time implementation of image registration algorithms using an exhaustive search method. Registration of images from GIS and remote sensing images may give erroneous results in the final registration result. Tian et al [53] suggested an automatic algorithm for the image-map registration problem using modified partial Hausdorff distance (MPHD) was proposed as a distance measure. Image registration may also be modeled by a displacement vector field that may be estimated by measuring rigid local shifts for each pixel in the image as shown by Inglada [67]. Tzeng et al [60] used Fractal method that could be applied to multitemporal polarime tric SAR images for change detections. Mitray et al [41] used point cloud data (PCD).
3 Comparative Study The various classes described in this letter has been presented in the tabular format in Table 1, wherein not only are the main feature detection and extraction methods along with searching and feature matching methods have been illustrated but also the types of images mostly used for experimentation by these methods have also been discussed.
4 Conclusion In this letter, we have collected together methods which have been classified and categories by considering the core theme or procedure used for registration. Recent techniques presenting a hybrid method seemed to be giving better results, as shown by the experimentation done by various authors, for the required application satisfaction. Hence, many a methods are not generally classified on the basis of the key classes of image registration methods, rather is done on the mathematical and statistical methodology used in the background for the registration. Some of the work may be further enhanced in the directions of providing a better framework for encompassing umpteen applications.
References [1] Fonseca, L.M.G., Manjunath, B.S.: Registration Techniques for Multisensor Remotely Sensed Imagery. Photogrammetric Engineering & Remote Sensing 62(9), 1049–1056 (1996) [2] Goshtasby, A.: 2-D and 3-D Image Registration for Medical, Remote Sensing, and Industrial Applications. Wiley-Interscience Publication, Hoboken (2005) [3] Webber, W.F.: Techniques for Image Registration. In: Conference on Machine Processing of Remotely Sensed Data (1973) [4] Brown, L.G.: A Survey of Image Registration Techniques. ACM Computing Surveys 24, 326–376 (1992) [5] Zitova, B., Flusser, J.: Image Registration Methods: A Survey. Image and Vision Computing 21, 977–1000 (2003) [6] Yagouni, M.: Using Metaheuristics for Optimizing Satellite Image Registration. In: International Conference on Computers & Industrial Engineering (2009)
Remote Sensing Image Registration Techniques: A Survey
109
[7] Qiaoliang, L., Wang, G., Liu, J., Chen, S.: Robust Scale-Invariant Feature Matching for Remote Sensing Image Registration. IEEE Geoscience and Remote Sensing Letters 6(2), 287–291 (2009) [8] Baboulaz, L., Dragotti, P.L.: Exact Feature Extraction Using Finite Rate of Innovation Principles with an Application to Image Super-Resolution. IEEE Transactions on Image Processing 18(2), 281–298 (2009) [9] Bui, P., Brockman, J.: Performance Analysis of Accelerated Image Registration Using GPGPU. In: Proceedings of 2nd Workshop on GPGPU (2009) [10] Wang, W., Liu, Y., Zheng, B., Lu, J.: A Method of Shape Based Multi-Sensor Image Registration. Joint Urban Remote Sensing Event 1–5 (2009) [11] Yang, J., Zhao, Z.: A New Image Feature Point Detection Method Based on Log-Gabor Gradient Feature. Joint Urban Remote Sensing Event, 1–5 (2009) [12] Jun, Z.: A Study on Automated Image Registration Based on Straight Line Features. Joint Urban Remote Sensing Event, 1–6 (2009) [13] Lin, X., Yongzhi, Z., Jiazi, Y.Y.: Application of Triangulation-Based Image Registration Method in the Remote Sensing Image Fusion. In: International Conference on Environmental Science and Information Application Technology, vol. 1, pp. 501–504 (2009) [14] Weijie, J., Jixian, Z., Jinghui, Y.: Automatic Registration of SAR and Optics Image Based on Multi-Feature on Suburban Areas. Joint Urban Remote Sensing Event, 1–7 (2009) [15] Serief, C., Bentoutou, Y., Barkat, M.: Automatic Registration of Satellite Images. In: First International Conference on Advances in Satellite and Space Communications, pp. 85–89 (2009) [16] Zhao, Y., Liu, S., Du, P., Li, M.: Feature-Based Geometric Registration of High Spatial Resolution Satellite Imagery. Joint Urban Remote Sensing Event, 1–5 (2009) [17] Goncalves, H., Goncalves, J.A., Corte-Real, L.: Measures for an Objective Evaluation of the Geometric Correction Process Quality. IEEE Geoscience and Remote Sensing Letters 6(2), 292–296 (2009) [18] Aldrighi, M., Dell’Acqua, F.: Mode-Based Method for Matching of Pre- and Postevent Remotely Sensed Images. IEEE Geoscience and Remote Sensing Letters 6(2), 317–321 (2009) [19] Xu, M., Varshney, P.K.: A Subspace Method for Fourier-Based Image Registration. IEEE Geoscience and Remote Sensing Letters 6(3), 491–494 (2009) [20] Kasetkasem, T., Homsup, N., Meetit, D.: An Image Registration Algorithm Using Particle Filters. In: 6th International Conference on Electrical Engineering/ Electronics, Computer, Telecommunications and Information Technology, vol. 02, pp. 1120–1123 (2009) [21] Barbu, A., Ionasec, R.: Boosting Cross-Modality Image Registration. Joint Urban Remote Sensing Event, 1–7 (2009) [22] Dell’Acqua, F., Sacchetti, A.: Steps towards a New Technique for Automated Registration of Pre- and Post-Event Images. Joint Urban Remote Sensing Event, 1–5 (2009) [23] Liying, W., Weidong, S.: A Review of Range Image Registration Methods with Accuracy Evaluation. Joint Urban Remote Sensing Event, 1–8 (2009) [24] Massout, S., Smara, Y., Ouarab, N.: Comparison of Fusion by Morphological Pyramid and By High Pass Filtering. Joint Urban Remote Sensing Event, 1–6 (2009) [25] McCartney, M.I., Zein-Sabatto, S., Malkani, M.: Image Registration for Sequence of Visual Images Captured by UAV. In: IEEE Symposium on Computational Intelligence for Multimedia Signal and Vision Processing, pp. 91–97 (2009)
110
S. Dawn, V. Saxena, and B. Sharma
[26] Dong, W., Xiangbin, W., Dongmei, L.: Particle Swarm Mixel Decomposition for Remote Sensing Images. In: IEEE International Conference on Automation and Logistics, pp. 212–216 (2009) [27] Holtkamp, D.J., Goshtasby, A.A.: Precision Registration and Mosaicking of Multicamera Images. IEEE Transactions on Geoscience and Remote Sensing 47(10), 3446–3455 (2009) [28] Zou, S., Zhang, J., Zhang, Y.: The Automatic Registration between High Resolution Satellite Images and a Vector Map Based on RFM. In: International Conference on Image Analysis and Signal Processing, pp. 397–401 (2009) [29] Xiong, Z., Zhang, Y.: A Novel Interest-Point-Matching Algorithm for High-Resolution Satellite Images. IEEE Transactions on Geoscience and Remote Sensing (2009) [30] Qumar, J., Pachori, R.B.: A Novel Technique for Merging of Multisensor and Defocussed Images using Multiwavelets. IEEE Region 10 TENCON, 1–6 (2005) [31] Chen, C.F., Chen, M.H., Li, H.T.: Fully Automatic and Robust Approach for Remote Sensing Image Registration. In: Rueda, L., Mery, D., Kittler, J. (eds.) CIARP 2007. LNCS, vol. 4756, pp. 891–900. Springer, Heidelberg (2007) [32] Arya, K.V., Gupta, P., Kalra, P.K., Mitra, P.: Image Registration Using Robust MEstimators. Pattern Recognition Letters 28, 1957–1968 (2007) [33] Lu, Y., Liao, Z.W., Chen, W.F.: An Automatic Registration Framework using Quantum Particle Swarm Optimization for Remote Sensing Images. In: Proceedings of the 2007 International Conference on Wavelet Analysis and Pattern Recognition (2007) [34] Wang, X., Zhang, J., Zhang, Y.: Registration of Remote Sensing Images Based on Gaussian Fitting. In: 3rd IEEE Conference on Industrial Electronics and Applications, pp. 378–381 (2008) [35] Zhou, H., Tang, Y., Yang, X., Liu, H.: Research on Grid-Enabled Parallel Strategies of Automatic Wavelet based Registration of Remote-Sensing Images and Its Application ChinaGrid. In: Fourth International Conference on Image and Graphics, pp. 725–730 (2007) [36] Bunting, P., Lucas, R., Labrosse, F.: An Area based Technique for Image-to-Image Registration of Multi-Modal Remote Sensing Data. In: IEEE International Geoscience and Remote Sensing Symposium, vol. 5(7-11), pp. 212–215 (2008) [37] Rohr, K.: Localization Properties of Direct Corner Detectors. Journal of Mathematical Imaging and Vision (4), 139–150 (1994) [38] Pohl, C., Van Genderen, J.L.: Multisensor Image Fusion in Remote Sensing: Concepts, Methods and Applications. International Journal of Remote Sensing 19(5), 823–854 (1998) [39] Chen, H.M., Varshney, P.K., Arora, M.K.: Performance of Mutual Information Similarity Measure for Registration of Multitemporal Remote Sensing Images. IEEE Transactions on Geosciece and Remote Sensing 41(11) (2003) [40] Xie, H., Pierce, L.E., Ulaby, F.T.: Mutual Information Based Registration of SAR Images. In: IEEE International Geoscience and Remote Sensing Symposium Proceedings, vol. 6(6), pp. 4028–4031 (2003) [41] Mitray, N.J., Gelfandy, N., Pottmannz, H., Guibasy, L.: Registration of Point Cloud Data from a Geometric Optimization Perspective. In: Eurographics Symposium on Geometry Processing (2004) [42] Srikanchana, R., Xuan, J., Freedman, M.T., Nguyen, C.C., Wang, Y.: Non-Rigid Image Registration by Neural Computation. Journal of VLSI Signal Processing 37, 237–246 (2004) [43] Boes, J.L., Meyer, C.R.: Multi-variate Mutual Information for Registration. In: Taylor, C., Colchester, A. (eds.) MICCAI 1999. LNCS, vol. 1679, pp. 606–612. Springer, Heidelberg (1999)
Remote Sensing Image Registration Techniques: A Survey
111
[44] Hong, G., Zhang, Y.: The Image Registration Technique for High Resolution Remote Sensing Image in Hilly Area. In: International Society of Photogrammetry and Remote Sensing Symposium (2005) [45] Guo, X., Zhang, W., Ma, G.: Automatic Urban Remote Sensing Images Registration Based on Road Networks. IEEE Urban Remote Sensing Joint Event (2009) [46] Bentoutou, Y., Taleb, N., Kpalma, K., Ronsin, J.: An Automatic Image Registration for Applications in Remote Sensing. IEEE Transactions on Geoscience and Remote Sensing 43(9), 2127–2137 (2005) [47] Jia, X.: Automatic Ground Control Points Refinement for Remote Sensing Imagery Registration. In: Proceedings from International Conference on Intelligent Sensors, Sensor Networks and Information Processing Conference, pp. 145–149 (2005) [48] Zhou, H., Yang, X., Liu, H., Tang, Y.: First Evaluation of Parallel Methods of Automatic Global Image Registration Based on Wavelets. In: International Conference on Parallel Processing, pp. 129–136 (2005) [49] Zavorin, I., Le Moigne, J.: Use of Multiresolution Wavelet Feature Pyramids for Automatic Registration of Multisensor Imagery. IEEE Transactions on Image Processing 14(6), 770–782 (2005) [50] Li, Y., Davis, C.H.: A Combined Global and Local Approach for Automated Registration of High Resolution Satellite Images Using Optimum Extrema Points. In: IEEE International Geoscience and Remote Sensing Symposium, vol. 2(2), pp. 1032–1035 (2008) [51] Alkaabi, S., Deravi, F.: A New Approach to Corner Extraction and Matching for Automated Image Registration. In: Proceedings of IEEE International Geoscience and Remote Sensing Symposium, vol. 5, pp. 3517–3521 (2005) [52] Yulin, Zhimin, Z., Wenge, C., Tian, J.: A New Registration Method for Multi-Spectral SAR Images. In: Proceedings of IEEE International Geoscience and Remote Sensing Symposium, vol. 3, pp. 1704–1708 (2005) [53] Tian, L., Kamata, S., Ueshige, Y., Kuroki, Y.: An Automatic Image-Map Registration Algorithm Using Modified Partial Hausdorff Distance. In: Proceedings of IEEE International Geoscience and Remote Sensing Symposium, vol. 5, pp. 3534–3537 (2005) [54] Plaza, A., Le Moigne, J., Netanyahu, N.S.: Automated Image Registration Using Morphological Region of Interest Feature Extraction. In: International Workshop on the Analysis of Multi-Temporal Remote Sensing Images, pp. 99–103 (2005) [55] Du, Y., Zhou, H., Wang, P., Yang, X., Liu, H.: A Parallel Mutual Information Based Image Registration Algorithm for Applications in Remote Sensing. In: Guo, M., Yang, L.T., Di Martino, B., Zima, H.P., Dongarra, J., Tang, F. (eds.) ISPA 2006. LNCS, vol. 4330, pp. 464–473. Springer, Heidelberg (2006) [56] Zhang, D., Yu, L., Cai, Z.: A Matching-Based Automatic Registration for Remotely Sensed Imagery. In: IEEE International Conference on Geoscience and Remote Sensing Symposium, pp. 956–959 (2006) [57] Erives, H., Fitzgerald, G.J.: Automatic Subpixel Registration for a Tunable Hyperspectral Imaging System. IEEE Geoscience and Remote Sensing Letters 3(3), 397–400 (2006) [58] Lou, X., Huang, W., Fu, B., Teng, J.: A Feature-Based Approach for Automatic Registration of NOAA AVHRR Images. In: IEEE International Conference on Geoscience and Remote Sensing Symposium, pp. 995–998 (2006) [59] Martinez, A., Garcia-Consuegra, J., Abad, F.: A Correlation-Symbolic Approach to Automatic Remotely Sensed Image Rectification. In: IEEE International Geoscience and Remote Sensing Symposium, vol. 1, pp. 336–338 (1999)
112
S. Dawn, V. Saxena, and B. Sharma
[60] Tzeng, Y.C., Chiu, S.H., Chen, K.S.: Automatic Change Detections from SAR Images Using Fractal Dimension. In: IEEE International Conference on Geoscience and Remote Sensing Symposium, pp. 759–762 (2006) [61] Kang, X., Han, C., Yang, Y.: Automatic SAR Image Registration by Using Element Triangle Invariants. In: IEEE 9th International Conference on Information Fusion, pp. 1–7 (2006) [62] Serafino, F.: SAR Image Coregistration Based on Isolated Point Scatterers. IEEE Geoscience and Remote Sensing Letters 3(3), 354–358 (2006) [63] Li, Q., Sato, I., Murakami, Y.: Simultaneous Perturbation Stochastic Approximation Algorithm for Automated Image Registration Optimization. In: IEEE International Conference on Geoscience and Remote Sensing Symposium, pp. 184–187 (2006) [64] Lu, B., Yanping, W., Lideng, W., Wen, H., Hailiang, P.: InSAR Co-registration Accuracy Assessment Based on Misregistration Value. In: IEEE International Conference on Geoscience and Remote Sensing Symposium, pp. 964–967 (2006) [65] Junbin, G., Xiaosong, G., Wei, W., Pengcheng, P.: Geometric Global Registration of Parallax Images Based on Wavelet Transform. In: 8th International Conference on Electronic Measurement and Instruments (2007) [66] Jianying, J., Qiming, Z.: A Novel Method for Multispectral Aerial Image Registration. In: IEEE International Geoscience and Remote Sensing Symposium, pp. 385–388 (2007) [67] Inglada, J., Muron, V., Pichard, D., Feuvrier, Y.: Analysis of Artifacts in Subpixel Remote Sensing Image Registration. IEEE Transactions On Geoscience And Remote Sensing 45(1) (2007) [68] Wong, A., Clausi, D.A.: ARRSI: Automatic Registration of Remote-Sensing Images. IEEE Transactions on Geoscience and Remote Sensing, Part 2 45(5), 1483–1493 (2007) [69] Hong, G., Zhang, Y.: Combination of Feature-Based and Area-Based Image Registration Technique for High Resolution Remote Sensing Image. In: IEEE International Geoscience and Remote Sensing Symposium, pp. 377–380 (2007) [70] Ghorbani, H., Beheshti, A.A.: Multiresolution Registration of Multitemporal Remote Sensing Image Optimization of Mutual Information Using a Simulated Annealing Based Marquardt-Levenberg Technique. In: International Conference on Intelligent and Advanced Systems, pp. 685–690 (2007) [71] Li, Q., Sato, I., Murakami, Y.: Steerable Filter Based Multiscale Registration Method for JERS-1 SAR and ASTER Images. In: IEEE International Geoscience and Remote Sensing Symposium, pp. 381–384 (2007) [72] Gupta, N., Gupta, N.: A VLSI Architecture for Image Registration in Real Time. IEEE Transactions On Very Large Scale Integration (VLSI) Systems 15(9), 981–989 (2007) [73] Seixas, F.L., Ochi, L.S., Conci, A., Saade, D.C.M.: Image Registration Using Genetic Algorithms. In: Genetic and Evolutionary Computation Conference, GECCO’08, July 12–16 (2008) [74] Wen, G.J., Lv, J.J., Yu, W.X.: A High-Performance Feature-Matching Method for Image Registration by Combining Spatial and Similarity Information. IEEE Transactions on Geoscience and Remote Sensing, Part 2 46(4), 1266–1277 (2008) [75] Li, Q., Sato, I., Sakuma, F.: A Novel Strategy for Precise Geometric Registration of GIS and Satellite Images. In: IEEE International Geoscience and Remote Sensing Symposium, vol. 2(2), pp. 1092–1095 (2008) [76] Bunting, P., Lucas, R., Labrosse, F.: An Area based Technique for Image-to-Image Registration of Multi-Modal Remote Sensing Data. In: IEEE International Geoscience and Remote Sensing Symposium, vol. 5(5), pp. 212–215 (2008) [77] Ranganath, R.N., Jayaraman, V., Roy, P.S.: Remote Sensing Applications: An Overview. Current Science 93(12) (2007)
Assessment of the Artificial Habitat in Shrimp Aquaculture Using Environmental Pattern Classification José Juan Carbajal Hernández, Luis Pastor Sánchez Fernández, and Marco Antonio Moreno Ibarra Centre of Computer Research – National Polytechnic Institute, Av. Juan de Dios Bátiz, Col. Nva. Industrial Vallejo, México D.F., México. 57296000, Ext. 56573
[email protected], {lsanchez,marcomoreno}@cic.ipn.mx
Abstract. This paper presents a novel model for assessing the water quality for the artificial habitat in shrimp aquaculture. The physical-chemical variables involved in the artificial habitat are measured and studied for modeling the environment of the ecosystem. A new physical-chemical index (Г) classifies the behavior of the environmental variables, calculating the frequency and the deviations of the measurements based on impact levels. A fuzzy inference system (FIS) is used for establishing a relationship between environmental variables, describing the negative ecological impact of the concentrations reported. The FIS uses a reasoning process for classifying the environmental levels providing a new index, which describes the general status of the water quality (WQI); excellent, good, regular and poor. Keywords: Artificial intelligence, fuzzy inference systems, classification, water management.
1 Introduction Water management is an important factor in shrimp aquaculture, where the ecosystem must be under control. A disestablished habitat is not conducive for a good farming, also an organism with a weakened immunological system is more likely for getting sick (for example Taura virus, Mancha Blanca, Cabeza Amarilla, Etc.) The main purpose on water management and aquaculture systems is to control and maintain the optimal conditions for the surviving and good growing of the organisms to the closest to a natural ecosystem [1]. The assessment of the water quality can be estimated using the relationships between the physical, chemical and biological parameters. The combination of the environmental variables determines the status of the water quality [2], [3]. Actually, in the world the laws do not provide enough criteria to resolve this problem and the ecological standards only describe the toxicity limits of the pollutants into the water bodies and the methodologies to measure them [4], [5], [6], [7]. Environmental variables have some concentration limits, where low or high concentrations (depending of the variable) can be harmful for the organism [1], [2], [3]. Following this behaviors, it is possible to implement a model in the attention that those limits and changes in the variables can be used for determining when a A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 113–121, 2010. © Springer-Verlag Berlin Heidelberg 2010
114
J.J. Carbajal Hernández, L.P. Sánchez Fernández, and M.A. Moreno Ibarra
concentration is good or bad for shrimp, and how the combination of these variables affects the water quality in the artificial shrimp habitat. This strategy will decrease the negative situations; consequently also it will decrease the stress in the organism, and low mortality rates. A new model for assessing the water quality in artificial shrimp habitats is presented. The system includes a statistical model for evaluating physical-chemical concentrations and a fuzzy inference system for classifying the status of the ecosystem. Litopenaeus vanammei shrimp, whose is cultivated in farms located in Sonora, Mexico, has been studied in this work; therefore toxic concentrations for this organism will be analyzed for the construction of the FIS.
2 Materials and Methods 2.1 Classification Levels In order to classify the behavior of a physical-chemical variable it is needed to define the limits of optimal or harmful concentrations [8]. The classification levels of the physical-chemical variables (status) are defined in Table 1; for dissolved oxygen we chosen “hypoxia”, “low” and “normal”, for the temperature and salinity variables we chosen “low”, “normal” and “high”, and for the pH variable we chosen “acid”, “low”, “normal”, “high”, and “alkaline. Tolerance (Tol) and limits (Lim) are used to determine the bounds of ranges, where values can be considered closer or further than a specified level. Table 1. Classification levels, tolerances and limits of physical-chemical variables
Temp (ºC)
Hypoxia Acid -------
Sal (mg/L)
-------
0 – 15
15 - 25
25 - ∞
-------
1
1
DO (mg/L)
0-3
3–6
6 - 10
10 - ∞
-------
0.5
0.5
PH
0-4
4–7
7–9
9 - 10
10 - 14
0.5
0.5
Variables
Low
Normal
High
Alkaline
Tol
Lim
0 – 23
23 - 30
30 - ∞
-------
1
1
2.2 Preprocesing The environmental signals have several peaks values, this behavior can be generated due a failed device, human error or noise; in order to minimize those effects the physical-chemical signals are smoothed for an accurate assessment (Fig 1). A moving average weighted filter works using an average of signal points (measured concentrations) for producing new output points of the new filtered signal and smoothing it [9], [10]. The smoothing process of the physical-chemical variables can be calculated as follows: (1)
Assessment of the Artificial Habitat in Shrimp Aquaculture
115
where x(n) is the original signal, y(n) is the new output signal, N is known as the filter order, bi are the Henderson 7 terms coefficients defined as H = [–0.05874, 0.05874, 0.29371, 0.41259, 0.29371, 0.05874, –0.05874]. The smoothing process using a moving average weighted filter is: – 0.05874
0.05874
1
7
(2)
pH
Units (mg/l)
Units
Units (mg/l)
Salinity
0.05874
Units (°C)
Temperature
Measurements
Dissolved Oxygen
Measurements
Fig. 1. Original and smoothed signals of the physical-chemical variables using a moving average
Physical – Chemical Assessment Index (Г) The Physical – chemical assessment index (Г) classifies the level that a physical – chemical variable reports based on the Table 1. The Г index comprises three factors and is well documented. Index 1: α (Frequency) The frequency represents the percentage of individual tests that do not meet the objectives (failed tests). α
(3)
where α is the failed index, mf is the number of failed tests and mT is the number of
total tests of the variable. Index 2: β (Amplitude) The average by which the failed test values do not meet their objectives, is calculated in three steps. The number of times by which an individual concentration is greater than (or less) the objective is termed a “deviation” and is expressed as follows. When the value must not exceed the level: (4) where e is the deviation of the failed test; m is the value of the test; la is the upper limit of the range to evaluate; ta is the upper tolerance. When the value must not fall below the level:
116
J.J. Carbajal Hernández, L.P. Sánchez Fernández, and M.A. Moreno Ibarra
(5) where lb is the lower limit of the range evaluated; tb is the lower tolerance of the range (i. e. if normal salinity has a range [15 – 23 mg/l], tolerance is 1 and limit is -1, therefore la is 22 mg/l and ta is 24 mg/l and tb is 14 and lb is 16). The average of the deviation is calculated as an index (β): ∑
(6)
where i: 1, 2, … n; n is the number of calculated deviations; mT is of total tests. Index 3: Γ (Physical – chemical assessment index) The Physical – chemical assessment index (Г) classifies the behavior of the variable establishing a level status (Table 1). The Г index can be expressed as follows: Γ
(7)
.
2
The Γ result can be interpreted as follows:
• •
If 0 ≤ г < 1, the variable behavior is classified inside the evaluated range. If г ≥ 1, the variable behavior is classified totally outside the evaluated range.
2.3 Fuzzy Inference System The Fuzzy inference systems (FIS) theory was applied in this study providing a nonlinear relationship between input sets (Physical-chemical variables) and output set (Water Quality Index) [11], [12]. The Γ index can be used for calculating the membership functions of the fuzzy inference system, the expression of the fuzzy input can be expressed as follows: 0
1
Γ
Γ
1
(8) 1
Γ
0
where v is the desired variable. The input set of the FIS can be expressed as follows: ,
,
,
(9)
where i, j, k and l are the range of the evaluated variables (high, normal, low, alkaline, acid, and hypoxia respectively). 2.4 Reasoning Process There are some expressions that are frequently used by experts in water quality; those expressions will be helpful for the construction of the FIS [11], [12]. These types of
Assessment of the Artificial Habitat in Shrimp Aquaculture
117
expressions built the fuzzy language and they are represented as a rule set, and they can be expressed as follows: Rule 1: If Temp is normal and Salt is normal and pH is normal and DO is normal then WQI is Excellent
Rule 2: If Temp is normal and Salt is normal and pH is normal and DO is low then WQI is Good
The output rules functions are fuzzy expressions that can be determined as follows: ,
,
,
.
(10)
Fig. 3 illustrates the operation of rules 1 and 2. The size of the set rule depends of the number of rules that are involved in the environment; a total of 135 rules have been used in this case. 2.5 Water Quality Status The ecosystem is always changing, and the combination of the variable concentrations defines the status of the water quality. If a high impact variable reports harmful concentrations, therefore the status of the water quality will be deteriorated [1], [2], [3], [12]. Water quality status has been classified in four levels, whose involve all the hypothetical situations in a shrimp pond: 1. 2. 3.
4.
Excellent: physical-chemical variables report concentrations in the optimal range. Good: one variable reports concentrations out of the optimal range; however this situation do not represents danger in the shrimp. Regular: some variables report concentrations out of the optimal range, and the combination between them represents certain stress level in the organism. Poor: all the variables concentrations are out of the optimal ranges, or a variable with a high impact level presents concentrations that could generate a potentially danger situation in the pond (i. e. extremely low oxygen concentrations).
The membership rules must be processed by an output membership function in order to determine the fuzzy outputs. The output membership function proposed for water quality is showed in Fig. 2. Trapezoidal membership functions were used to represent poor, regular, good, and excellent fuzzy sets [11]. They are represented as: , , , ,
, 1,
where a, b, c and d are the membership parameters.
(11)
J.J. Carbajal Hernández, L.P. Sánchez Fernández, and M.A. Moreno Ibarra
Poor
μ
118
Regular
Good
Excellent
WQI
Fig. 2. Membership functions for WQI
Temp
Sal
pH
DO
WQI
Defuzzification
Fig. 3. Inference process where the membership functions of rules are used to determine the final membership function
The values of WQI memberships functions ( with the rule memberships (Fig. 3).
) are determined by matching
2.6 Aggregation The aggregation process generates one final function (μout) using all the output membership functions (μR). The WQI index is calculated using the gravity center of the function [7]: .
(12)
The final score for the WQI index has a [0.078, 0.87] range, in order to have a [0, 1] range, the output is normalized using the next equation:
Assessment of the Artificial Habitat in Shrimp Aquaculture
max
min min
119
(13)
where x is the output of the fuzzy system, xn is the new normalized result.
3 Experimental Results In order to estimate how the WQI index performs with real data, data sets of environmental measurements obtained from farms on Rancho Chapo, located in Sonora, Mexico were used as fundamental patterns. The measurements of the variables depend of the exactitude of how a supervisor monitors the aquaculture system. Temperature, dissolved oxygen, salinity and pH were measured using a sensor for each variable. The monitoring frequency was of 4 measurements per hour. The data set contains four months of measurements with 96 values by day; a register of 9312 values per variable has been measured. Measurements by day
Units
Sal(mg/l) Temp ( C)
DO (mg/l)
pH
Samples WQI
Index
Excellent = 1 Good = 0.66 Regular = 0.28 Bad = 0
Results by day
Fig. 4. a) Measurements of the environmental variables in one month of farming; b) Results of the evaluation of the habitat in the farm using the WQI
120
J.J. Carbajal Hernández, L.P. Sánchez Fernández, and M.A. Moreno Ibarra
The experimental phase was carried out as follows: one month of measurements was extracted from data base (Fig 3a), the Γ index is calculated using one day of information (96 measurements), and this process is repeated using the rest of the month. The z vector is built using the Γ classifications and it is processed in the fuzzy inference system. The final result is normalized in order to establish a [0, 1] index. The WQI results are shown in Fig. 4b. In Fig. 4a the salinity is always in a high level, the temperature is increasing and the dissolved oxygen has an oscillatory behavior where it changes between normal and low concentrations; those behaviors affect directly the habitat of the environment reporting regular and bad water quality (i.e. in Fig. 4b, the 15th day DO reported most of the concentrations in low and hypoxia level, which generates a por state).
4 Conclusion In this paper a new algorithm which is based on fuzzy inference systems has been introduced. The model to assess the water quality status was built in two phases; the first classifies the levels of the physical – chemical variables; the second evaluates the negative ecological impacts in order to analyze the water quality status in the shrimp habitat. Experimental results have shown that this algorithm is an efficient way to improve water management in shrimp aquaculture. Acknowledgments. The authors of the present paper would like to thank the following institutions for their support to develop this work: National Polytechnic Institute, Mexico, Biology Research Center of Sonora (CIB) and CONACyT.
References 1. Casillas, R., Magallón, F., Portillo, G., Osuna, P.: Nutrient mass balances in semi-intensive shrimp ponds from Sonora, Mexico using two feeding strategies: Trays and mechanical dispersal. In: Aquaculture, vol. 258, pp. 289–298. Elsevier, Amsterdam (2006) 2. Martínez, R.: Cultivo de Camarones Pendidos, Principios y Practica. AGT Editor S.A. (Ed.) (1994) 3. Páez, O. F.: Camaronicultura y Medio Ambiente. Instituto de Ciencias del mar y Limnología, UNAM, México, pp. 271–298 (2001) 4. (INE) Instituto Nacional de Ecología: La calidad del agua en los ecosistemas costeros de México (2000) 5. (SEMARNAP), Secretaría de Medio Ambiente, Recursos Naturales y Pesca.: NOM-001ECOL-1996 6. (CCME) Canadian Council of Ministers of the Environment (Canada), An assessment of the application and testing of the water quality index of the Canadian Council of Ministers of the Environment for selected water bodies in Atlantic Canada. National indicators and reporting office, http://www.ec.gc.ca/soer-ree/N (accessed August 2007) 7. (NSF) National Sanitation Foundation International (2005), http://www.nsf.org (accessed August 2007) 8. Yew-Hu, C.: Water quality requirements and management for marine shrimp culture. In: Proceedings of the Special Session on Shrimp Farming World Aquaculture Society, Baton Rouge, LA USA, pp. 144–156 (1992)
Assessment of the Artificial Habitat in Shrimp Aquaculture
121
9. Emmanuel, C.: Digital signal processing: a practical approach. Addison-Wesley, Reading (1993) 10. Proakis, J., Manolakis, D.: Tratamiento digital de señales. Pearson Education, Vol. 1(4), Ed. España (2007) 11. Ocampo, W., Ferré, N., Domingo, J., Schuhmacher, M.: Assessing water quality in rivers with fuzzy inference systems: A case study. In: Environment International, vol. 32, pp. 733–742. Elsevier, Amsterdam (2006) 12. Chow, M.-Y.: Methodologies of using neural network and fuzzy logic technologies for motor incipient fault detection. World Scientific, Singapore (1997)
Speech Recognition System of Arabic Alphabet Based on a Telephony Arabic Corpus Yousef Ajami Alotaibi1, Mansour Alghamdi2, and Fahad Alotaiby3 1
Computer Engineering Department, King Saud University, Riyadh, Saudi Arabia 2 King Abdulaziz City for Science and Technology, Riyadh, Saudi Arabia 3 Department of Electrical Engineering, King Saud University, Riyadh, Saudi Arabia
[email protected],
[email protected],
[email protected] Abstract. Automatic recognition of spoken alphabets is one of the difficult tasks in the field of computer speech recognition. In this research, spoken Arabic alphabets are investigated from the speech recognition problem point of view. The system is designed to recognize spelling of an isolated word. The Hidden Markov Model Toolkit (HTK) is used to implement the isolated word recognizer with phoneme based HMM models. In the training and testing phase of this system, isolated alphabets data sets are taken from the telephony Arabic speech corpus, SAAVB. This standard corpus was developed by KACST and it is classified as a noisy speech database. A hidden Markov model based speech recognition system was designed and tested with automatic Arabic alphabets recognition. Four different experiments were conducted on these subsets, the first three trained and tested by using each individual subset, the fourth one conducted on these three subsets collectively. The recognition system achieved 64.06% overall correct alphabets recognition using mixed training and testing subsets collectively. Keywords: Arabic, alphabets, SAAVB, HMM, Recognition, Telephony corpus.
1 Introduction 1.1 Arabic Language Arabic is a Semitic language, and it is one of the oldest languages in the world. Currently it is the fifth language in terms of number of speakers [1]. Arabic is the native language of twenty-five countries including Saudi Arabia, Jordan, Oman, Yemen, Egypt, Syria, Lebanon, etc [1]. Arabic alphabets are used in several languages in addition to Arabic, such as Persian and Urdu. Standard Arabic has basically 34 phonemes, of which six are vowels, and 28 are consonants [2]. A phoneme is the smallest element of speech units that indicates a difference in meaning, word, or sentence. Arabic language has fewer vowels than English language. It has three long and three short vowels, while American English has twelve vowels [3]. Arabic phonemes contain two distinctive classes, which are named pharyngeal and emphatic phonemes. These two classes can be found only in Semitic languages like A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 122–129, 2010. © Springer-Verlag Berlin Heidelberg 2010
Speech Recognition System of Arabic Alphabet
123
Hebrew [2], [4]. The allowed syllables in Arabic language are: CV, CVC, and CvCC where V indicates long/short vowels, v indicates short vowels and C indicates a consonant. Arabic utterances can only start with a consonant [2]. Table 1 shows the Arabic alphabets along with number and types of syllables in every spoken alphabet. Table 1. Arabic Alphabets
1.2 Spoken Alphabets Recognition In general, spoken alphabets for different languages were targeted by automatic speech recognition researchers. A speaker-independent spoken English alphabet recognition system was designed by Cole et al [5]. That system was trained on one token of each letter from 120 speakers. Performance was 95% when tested on a new set of 30 speakers, but it was increased to 96% when tested on a second token of each letter from the original 120 speakers. Other efforts for spoken English alphabets recognition was conducted by Loizou et al. [6] In their system a high performance spoken English recognizer was implemented using context-dependent phoneme hidden Morkov models (HMM). That system incorporated approaches to tackle the problems associated the confusions occurring between the stop consonants in the E-set and the confusions between the nasal sounds. That recognizer achieved 55% accuracy in nasal discrimination, 97.3% accuracy in speaker-independent alphabet recognition, 95% accuracy in speakerindependent E-set recognition, and 91.7% accuracy in 300 last names recognition. Karnjanadecha et al. [7] designed a high performance isolated English alphabet recognition system. The best accuracy achieved by their system for speaker independent alphabet recognition was 97.9%. Regarding digits recognitions, Cosi et al.
124
Y.A. Alotaibi, M. Alghamdi, F. Alotaiby
[8] designed and tested a high performance telephone bandwidth speaker-independent continuous digit recognizer. That system was based on artificial neural network and it gave a 99.92% word recognition accuracy and 92.62% sentence recognition accuracy. Arabic language had limited number of research efforts compared to other languages such as English and Japanese. A few researches have been conducted on the Arabic alphabets recognition. In 1985, Hagos [9] and Abdullah [10] separately reported Arabic digit recognizers. Hagos designed a speaker-independent Arabic digits recognizer that used template matching for input utterances. His system is based on the LPC parameters for feature extraction and log likelihood ratio for similarity measurements. Abdullah developed another Arabic digits recognizer that used positive-slope and zero-crossing duration as the feature extraction algorithm. He reported 97% accuracy rate. Both systems mentioned above are isolated-word recognizers in which template matching was used. Al-Otaibi [11] developed an automatic Arabic vowel recognition system. Isolated Arabic vowels and isolated Arabic word recognition systems were implemented. He studied the syllabic nature of the Arabic language in terms of syllable types, syllable structures, and primary stress rules. 1.3 Hidden Markov Models and Used Tools Automatic Speech Recognition (ASR) systems based on the HMM started to gain popularity in the mid-1980’s [6]. HMM is a well-known and widely used statistical method for characterizing the spectral features of speech frame. The underlying assumption of the HMM is that the speech signal can be well characterized as a parametric random access, and the parameters of the stochastic process can be predicted in a precise, and well-defined manner. The HMM method provides a natural and highly reliable way of recognizing speech for a wide range of applications [12], [13]. The Hidden Markov Model Toolkit (HTK) [14] is a portable toolkit for building and manipulating HMM models. It is mainly used for designing, testing, and implementing ASR systems and related research tasks. This research concentrated on analysis and investigation of the Arabic alphabets from an ASR perspective. The aim is to design a recognition system by using the Saudi Accented Arabic Voice Bank (SAAVB) corpus provided by King Abdulaziz City for Science and Technology (KACST). SAAVB is considered as a noisy speech database because most of the part of it was recorded normal life conditions by using mobile and landline telephone calls [15]. The system is based on HMMs and with the aid of HTK tools.
2 Experimental Framework 2.1 System Overview A complete ASR system based on HMM was developed to carry out the goals of this research. This system was divided into three modules according to their role. The first module is training module, whose function is to create the knowledge about the speech and language to be used in the system. The second subsystem is the HMM models bank, whose function is to store and organize the system knowledge gained by
Speech Recognition System of Arabic Alphabet
125
the first module. Final module is the recognition module, whose function is tried to figure out the meaning of the input speech given in the testing phase. This is done with the aid of the HMM models mentioned above. The parameters of the system were 8KHz sampling rate with a 16 bit sample resolution, 25 millisecond Hamming window duration with a step size of 10 milliseconds, MFCC coefficients with 22 as the length of cepstral leftering and 26 filter bank channels of which 12 were as the number of MFCC coefficients, and of which 0.97 were as the pre-emphasis coefficients. Phoneme based models are good at capturing phonetic details. Also contextdependent phoneme models can be used to characterize formant transition information, which is very important to discriminate between alphabets that can be confused. The Hidden Markov Model Toolkit (HTK) is used for designing and testing the speech recognition systems throughout all experiments. The baseline system was initially designed as a phoneme level recognizer with three active states, one Gaussian mixture per state, continuous, left-to-right, and no skip HMM models. The system was designed by considering all thirty-four Modern Standard Arabic (MSA) monophones as given by the KACST labeling scheme given in [16]. This scheme was used in order to standardize the phoneme symbols in the researches regarding classical and MSA language and all of its variations and dialects. In that scheme, labeling symbols are able to cover all the Quranic sounds and its phonological variations. The silence (sil) model is also included in the model set. In a later step, the short pause (sp) was created from and tied to the silence model. Since most of the alphabets are consisted of more than two phonemes, context-dependent triphone models were created from the monophone models mentioned above. Before this, the monophone models were initialized and trained by the training data explained above. This was done by more than one iteration and repeated again for triphones models. A decision tree method is used to align and tie the model before the last step of training phase. The last step in the training phase is to re-estimate HMM parameters using Baum-Welch algorithm [12] three times. 2.2 Database The SAAVB corpus [15] was created by KACST and it contains a database of speech waves and their transcriptions of 1033 speakers covering all the regions in Saudi Arabia with statistical distribution of region, age, gender and telephones. The SAAVB was designed to be rich in terms of its speech sound content and speaker diversity within Saudi Arabia. It was designed to train and test automatic speech recognition engines and to be used in speaker, gender, accent, and language identification systems. The database has more than 300,000 electronic files. Some of those files contained in SAAVB are 60,947 PCM files of recorded speech via telephone, 60,947 text files of the original text, 60,947 text files of the speech transcription, and 1033 text files about the speakers. The mentioned files have been verified by IBM Egypt and it is completely owned by KACST. SAAVB contains three subsets called SAAVB26, SAAVB27, and SAAVB28 as following [15]. First, SAAVB26 contains read speech as spelling of a phonetically rich word (671 randomized and partially repeated items). Second, SAAVB27 contains read speech as spelling of a city name (Prompt 20 and 21). Last, SAAVB28 contains
126
Y.A. Alotaibi, M. Alghamdi, F. Alotaiby
read speech as spelling of a personal name (260 randomized and repeated items). Arabic contains about total of 35 different alphabets, but we used in our final system testing 29 alphabets only.
3 Results We conducted different experiments and got four confusion matrices. The system was trained and tested by only SAAVB26, SAAVB27, SAAVB28 separately and, then, combined these three (SAAVB26, SAAVB27, SAAVB28) named as SAAVBaa. In all three experiments training and testing subsets are disjoint and the number of tokens (alphabets) in data subset for testing in SAAVB26, SAAVB27, and SAAVB28 were 923, 1,594 and 1,110 respectively and the number of tokens in data subset for training was about three times of their respective testing data subset. As can be seen from Table 2, the correct rates of the system are 62.84%, 67%, and 61.26% for using SAAVB26, SAAVB27, and SAAVB28, respectively. The worst correct rate was encountered in the case of SAAVB28 while the best correct rate was encountered in the case of SAAVB27. We think this is correlated to the amount of the training data subset because SAAVB26 and SAAVB28 have low training data compared to SAAVB27. As can be noticed here the correct rate and size of SAAVB28 is in between if compared to that of SAAVB26 and SAAVB27. We want to emphasize that the size of training subset is three times the testing subsets in all cases. By mixing training subsets of SAAVB26, SAAVB27, and SAAVB28 in one training subset and mixing all testing subset of these portions in one testing subset we get the set called SAAVBaa, where the confusion matrix for this set is shown in Table 3. Depending on testing this subset (i.e., SAAVBaa), the system must try to recognize 3,623 samples for all 29 alphabets. The overall system performance was 64.06%, which is reasonably high where our database is considered as noisy corpus. The similarity between groups of Arabic alphabets is very high. For example we may have two alphabets with the only difference is the voicing in the carried phoneme such as the case of alphabets A4 and A9. We have many cases of these pair of phonemes that can cause misleading behavior of the system and we can notice such problems in the confusion matrices. The system failed in recognizing total of 1,342 alphabets (1,302 were substituted and 40 were deleted mistakenly by the system) out of 3,623 recorded alphabets. Alphabets A7, A15, A23, A28, and A29 have gotten reasonably (above 85%) high recognition rate; on the other hand, the bad performance was encountered with alphabets A4, A10, A20, A21, A25, and A27 where the performance is less than 50%. Even though the database size is medium (only the 29 spoken Arabic alphabets) and with the existence of noise, the system showed a reasonable performance due to the variability in how to pronounce Arabic alphabets. Number of Alphabet tokens appeared in the test data subset from 0 up to 469 times. This is one of the drawbacks of the SAAVB26, SAAVB27, SAAVB28 subsets. Some of the alphabets were completely disappeared from the vocalized version due to problems of misleading effect in selecting and/or bad vocalizations given by speakers. A16 ( )ضand A18 ( )ظare two of the alphabets that appeared with 0 times in all three subsets. Also the inconsistency of number of occurrence of alphabets in each subset of SAAVB26, SAAVB27, and SAAVB28 showed a fatal disadvantage of SAAVB
Speech Recognition System of Arabic Alphabet
127
which may caused a biased training among different system vocabulary. To give an example, alphabet A1 appeared about 1407 times in all three subsets while A10 and A18 appeared 24 and 0 times in all three subsets, respectively. Table 2. Summary of accuracies for all subsets and alphabets
In our experiments, we can notice that some of the alphabet got a very bad accuracy in all subsets which mean that the problem of this is caused low training data regarding this alphabet and/or the high similarity of this alphabet to another alphabet(s). Examples of this situation are A4 and A27. Also it can be noticed that some of the alphabets got a very low accuracy in one subset but gained a high accuracy in case of the others. Example of this situation is A20.
128
Y.A. Alotaibi, M. Alghamdi, F. Alotaiby Table 3. Confusion matrix of the system when trained and tested with SAAVBaa
4 Conclusion To conclude, a spoken Arabic alphabets recognizer is designed to investigate the process of automatic alphabets recognition. This system is based on HMM and by using Saudi accented and noisy corpus called SAAVB. This system is based on HMM strategy carried out by HTK tools. There are total of four experiment were conducted on Arabic alphabets. The SAAVB corpus (especially subsets dedicated for isolated Arabic alphabets, namely, SAAVB26, SAAVB27, and SAAVB28) supplied by KACST is used in this research. The first three experiments were conducted on each of these three subsets and the forth one was conducted on sum of all these subsets (i.e., by using all data of alphabet in one experiment). The system correct rate for the grand one is 64.06%.
Acknowledgment This paper is supported by KACST (Project: 28-157).
References [1] http://en.wikipedia.org/wiki/List_of_languages_by_number_of_ native_speakers, http://en.wikipedia.org/wiki/Arab_world [2] Alkhouli, M.: Alaswaat Alaghawaiyah. Daar Alfalah, Jordan (1990) (in Arabic) [3] Deller, J., Proakis, J., Hansen, J.H.: Discrete-Time Processing of Speech Signal. Macmillan, Basingstoke (1993) [4] Elshafei, M.: Toward an Arabic Text-to-Speech System. The Arabian Journal for Scince and Engineering 16(4B), 565–583 (1991)
Speech Recognition System of Arabic Alphabet
129
[5] Cole, R., Fanty, M., Muthusamy, Y., Gopalakrishnan, M.: Speaker-Independent Recognition of Spoken English Letters. In: International Joint Conference on Neural Networks (IJCNN), vol. 2, pp. 45–51 (June 1990) [6] Loizou, P.C., Spanias, A.S.: High-Performance Alphabet Recognition. IEEE Trans. on Speech and Audio Processing 4(6), 430–445 (1996) [7] Karnjanadecha, M., Zahorian, Z.: Signal Modeling for High-Performance Robust Isolated Word Recognition. IEEE Trans. on Speech and Audio Processing 9(6), 647–654 (2001) [8] Cosi, P., Hosom, J., Valente, A.: High Performance Telephone Bandwidth Speaker Independent Continuous Digit Recognition. In: Automatic Speech Recognition and Understanding Workshop (ASRU), Trento, Italy (2001) [9] Hagos, E.: Implementation of an Isolated Word Recognition System. UMI Dissertation Service (1985) [10] Abdulah, W., Abdul-Karim, M.: Real-time Spoken Arabic Recognizer. Int. J. Electronics 59(5), 645–648 (1984) [11] Al-Otaibi, A.: Speech Processing. The British Library in Association with UMI (1988) [12] Rabiner, L.R.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE 77(2), 257–286 (1989) [13] Juang, B., Rabiner, L.: Hidden Markov Models for Speech Recognition. Technometrics 33(3), 251–272 (1991) [14] Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.: The HTK Book (for HTK Version 3.4). Cambridge University Engineering Department, Cambridge (2006), http://htk.eng.cam.ac.uk/prot-doc/ktkbook.pdf [15] Alghamdi, M., Alhargan, F., Alkanhal, M., Alkhairi, A., Aldusuqi, M.: Saudi Accented Arabic Voice Bank (SAAVB). Final report, Computer and Electronics Research Institute, King Abdulaziz City for Science and technology, Riyadh, Saudi Arabia (2003) [16] Alghamdi, M., El Hadj, Y., Alkanhal, M.: A Manual System to Segment and Transcribe Arabic Speech. In: IEEE International Conference on Signal Processing and Communication (ICSPC’07), Dubai, UAE, November 24-27 (2007)
Statistical and Neural Classifiers: Application for Singer and Music Discrimination in Polyphonic Music Context Hassan Ezzaidi1 and Mohammed Bahoura2 1
D´epartement des Sciences Appliqu´ees Universit´e du Qu´ebec ` a Chicoutimi, 550, boul. de l’Universit´e, Chicoutimi, Qc, Canada, G7H 2B1
[email protected] 2 D´epartement de Math´ematiques, d’Informatique et de G´enie Universit´e du Qu´ebec ` a Rimouski, 300, all´ee des Ursulines, Rimouski, Qc, Canada, G5L 3A1 Mohammed
[email protected] Abstract. The problem of identifying sections of singer voice and instruments is investigated in this paper. Three classification techniques: Linde-Buzo-Gray algorithm (LBG), Gaussian Mixture Models (GMM) and feed-forward Multi-Layer Perception (MLP) are presented and compared in this paper. All techniques are based on Mel frequency Cepstral Coefficients (MFCC), which commonly used in the speech and speaker recognition domains. All the proposed approaches yield a decision at every 125 ms only. Particularly, a large experimental data is extracted from the music genre database RWC including various style (68 pieces, 25 subcategories). The recognition scores are evaluated on data used in the training session and others never seen by proposed systems. The best results are obtained with the GMM (94% with train data and 80.5% with test data). Keywords: music, song, artist, discrimination.
1
Introduction
The explosion of media, particularly the accessibility to digital information via Internet and the increasing production amount of the disk industry, creates many new needs as the browsing, archiving, maintenance, classification and authentication of data. Hence, the needs for an automatic processing of media documents are increasing fast and the challenge is becoming greater. Among these needs, there is the interest to classify music by style, singer, instrument or language. Automatic indexing is another need even more delicate and complex. Discrimination between music, silence, speech and singing is a crucial for the proper functioning of identification, recognition and segmentation systems. In other words, we want an automatic system to analyse, understand, track and mask like processed by the human auditory system. So, the challenge is very A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 130–136, 2010. c Springer-Verlag Berlin Heidelberg 2010
Statistical and Neural Classifiers for Singer and Music Discrimination
131
strong despite the progress obtained in speech and speaker domains. More, the acoustical music signal is very complicated because it is composed from multiple sound sources including artist song. Many works were published in music information retrieval but in a very limited context. Indeed, the context depends on the used database, pre-processing and post-processing effectuated, parameter vectors, models recognition, learning approach and target task to be performed by the system. The global task is the speech and the music discrimination, genre recognition, singer’s identification, instruments identification, tracking and detection of the singer, the tempo/beat/metric determination, etc [1,2]. In the case of the singer characterization, some works was previously published. All applied techniques are inspired from the speech and the speaker’s domain. The Melfrequency Cepstral Coefficients (MFCC) and Linear Predictive Coding (LPC) are known as the best acoustics features in speech, speaker and instrument recognition. Intuitively, they have been used in artist identification system [1,3,4,5]. Classification strategies including the discriminate functions, Gaussian Mixture Models (GMM), Nearest Neighbors (k -NN), Kullback-Leibler divergence and Kullback-Leibler distance measure were compared for singer identification in polyphonic music using a concept of vocal separation [1]. A Support Vector Machine (SVM) is a classifier estimating an optimal hyperplane to separate classes of data. Reporting good achieving results in genre classification, it was applied also in singer identification [1]. The artist classification was also performed by a Multi Layer Perceptron (MLP) neural network [4]. In the present work, we are interested in a system that performs artist detection from the beginning to the ending of his song at different time. Classical MFCC coefficients are used as parameters. We apply Linde-Buzo-Gray (LGB) [3], Multi-layer forwards perceptron (MLP) and Gaussian Mixture Models (GMM) as classifiers. The decision is determined at rhythm of 125 ms without overlapping shift windows.
2
Database
A 68 musical pieces of the RWC Music Genre Database [6] database is used. Each piece averages 4 minutes originating from more than 25 genres (Popular, Ballade, Rock, Heavy Metal, Rap/Hip-Hop, House, Techno, Funk, Soul/R&B, Big Band, Modern Jazz, Fusion, Bossa Nova, Samba, Reggae, Tango, Baroque, Classical, Romantic, Modern, Blues, Folk, Country, Gospel) of men and women. Musicals pieces used are numbered in the RWC database [1] by: G01 xx to G05 xx and G07 xx. Duration of whole database is approximately 5 hours with 3 hours of music only. All the musical pieces used in our experiments have been hand labeled. For each file, we make listen to a person segment by segment’s of duration one second and asking him to type 1 if he hears an artist song or 0 if he hears only music. Each piece in the proposed database is split equally in two part: the first half segment and the last half segment. First half segments for all pieces are used as the set for training session. Also, all the second part are grouped together and they are never presented in training session in order to be used in testing session as validation.
132
3
H. Ezzaidi and M. Bahoura
Feature Extraction
The conventional parameters Mel-Frequency Cepstral Coefficients (MFCC) often used in systems for recognition and identification (speech or music) were considered. In deed, all musical pieces are first down sampled from 44.4 kHz to 16 kHz. Then, the musical signal is divided into frames of 2000 samples. Theses frames are shifted at a frame rate of 2000 samples (no overlapping). In fact, the musical signal is more stable and quasi-stationary than the speech signal (coarticulation phenomena). For each frame, a Hanning window is applied without pre-emphasis. The Fourier power spectrum is computed and coefficients are extracted from 32 triangular Mel-filters. After the application of log operator for each output filters, a discrete cosine transform is applied to extract 24 coefficients in cepstral space.
4
Proposed Classification Techniques
The problem of discrimination of singing and music can be naively seen as a categorization into two classes. A first class is associated with musical sequences and a second one is associated with the song signal. In fact, the modeling with just two classes is far from ideal realty related to the rich information conveyed by the singer signal (melody, form, prosody, sex, age, feeling, etc.) and the musical signal (instruments, style, etc.). However, in the field of speaker identification several contributions have been proposed to track and recognize one or some people engaged in a conversation [3]. Inspired by the progress realized last years in this field, we propose to consider the following statistical approaches. 4.1
Non-parametric Model
Linde-Buzo-Gray algorithm (LBG) [3] clustering is a method that aims to partition N observations into K clusters (prototypes or centers) in which each observation belongs to the cluster with the nearest distances. Strategy based on vector quantization using LBG algorithm was investigated to calculate respectively 32, 64 and 128 centers (one codebooks related music and second codebooks related song). Same MFCC coefficients described previously were used as vectors parameters. 4.2
Multi-Layer Feed-Forwards Perceptron (MLP)
Neural networks (NN) are known able to perform complex tasks in various fields of application as pattern recognition, identification, classification, prediction and control systems. The NN designs are inspired by biological cells of neurons systems. Among the networks most used is the multi-layer feed-forwards networks. Each layer consists of simple units operating in parallel process. Unit receives their input from units from a layer directly below and sends their output to the unit layer above the unit. Each unit is weighted with an appropriate coefficient
Statistical and Neural Classifiers for Singer and Music Discrimination
133
Fig. 1. Neural network structure system
wi (synapses). Commonly, NN is trained by adjusting the weights and biases, so that each input leads yields in output the specific target. The auditory toolbox for Matlab is used for this experiment [7]. Based on labeling data manually performed in music and song, we used conventional multi-layer feed-forwards network. Output is a binary encoding; the input is the vector of MFCC coefficients with a hidden layer of 32, 62 and 128 cells respectively. In principle, we should use audio signal segmentation based on phonemes, for the segments song. Thus, several alternative coding can take place to establish the targets and to train the network. This is a feasible task, despite the variability of the speaker. However, for segments music this strategy becomes very complicated and difficult due to the number of instruments and musical notes combinations. To bypass this problem, we applied the LBG algorithm to perform segmentation by using a simple centroid representation. The architecture system is illustrated in Fig. 1. Specifically, we separated the learning data into two subgroups respectively for music and song, based on the manual labelling. Then, for each sub-group it applied the LBG algorithm to estimate prototypes cookbooks. Three sets of prototypes were estimated for each subgroup: 32, 64 and 128 codebooks. Relying on the similarity measure, the target for each entry is coded in this case as the nearest prototype. So we are left with 32 prototypes (24 of dim.) respectively for music and singing 4.3
Parametric Model
We use a Gaussian Mixture Model (GMM) with a weighted sum of 32, 64 and 126 Gaussians. In Each case, GMM is defined for song features and music features extracted respectively from the training database. Expectation Maximization (EM) is used for estimating the means, covariances and weights parameters GMM models. Test features are classified by a maximum likelihood discriminate function, calculated by their distances from the multiple Gaussians of the class distributions. Recall that all the experimented data is hand labeled. It contains men and women artists, various genres and 5 hours recording. For all examined
134
H. Ezzaidi and M. Bahoura Table 1. Score recognition (every 2000 samples): GMM method
Data set Train Music Train Song Test Music Test Song
Number of Gaussian 32 64 128 86.5% 89.5% 94.1% 89.5% 92.2% 93.4% 77.7% 79.7% 81.8% 81.2% 78.3% 79.1%
Table 2. Score recognition (every 2000 samples): LBG Method
Data set Train Music Train Song Test Music Test Song
32 73% 76% 67% 72%
Prototypes 64 76% 78% 69% 73%
128 78% 81% 70% 75%
Table 3. Score recognition (every 2000 samples): Multi-layer perceptron based LBG
Data set Train Music Train Song Test Music Test Song
Hidden layer units 32 64 128 86% 86 % 88% 54% 56% 58 % 80% 80% 83 % 54% 60% 59 %
Table 4. Score recognition (every 2000 samples): Multi-layer perceptron
Data set Train Music Train Song Test Music Test Song
Hidden layer units 32 64 128 94% 92 % 92% 60% 64% 66% 90% 89% 88% 35% 35% 37%
techniques we extract half of each file in the database for the training session and the last is used for testing session as unknown data.
5
Results and Discussion
The results for the three classifiers are shown in Tables 1 to 4. In the columns, we give the scores recognition for the 32, 64 and 128 prototypes for LBG, number
Statistical and Neural Classifiers for Singer and Music Discrimination
135
of weighted Gaussian of GMM and cells in hidden layer MLP. In line, we give the score recognition for train and test data respectively for segments music and song. The recognition rate increases in general gradually dependently on the factor (32, 64 and 128). The results show that better score were obtained with training data for all systems. With test data, the rate recognition decrease for all strategies. By comparing the results obtained by the GMM and the LBG, we find that the scores are lower for the LBG in all cases. In fact, quantization algorithm can be seen as a discrete version of GMM models. So, we expect lower performance and in best case the scores will be close. The difference is that the LBG algorithm is easy to implement and serves as an indicator. By comparing the results obtained by the GMM and the proposed neural networks, we find that the neural networks not discriminate well segments of singing while GMM achieve a good performance. In the case of music discrimination, the two approaches give good results. However, it retains the best performance is obtained with the GMM in all cases. It was concluded that the coding target by codebook is less effective for training the network compared to GMM system. Unsatisfactory results of the neural network can be improved by considering another way of coding and by balancing the amount of data between music and singing. The coding based GMM will be a good alternative.
6
Conclusion
In this works, the problem of identifying section of singer voice and musical instrument is explored. A parametric GMM models, a non parametric LBG model and MLP neural network were used as classifiers and are compared. Only, classical MFCC coefficients were used as parameters. Best performance is obtained with GMM model until 94% score recognition with train data and 80.5% score recognition with test data. With the neural network results can still be improved but we must rework the part of coding using the only rang of codebooks elements or investigating other segmentation technique. The method of coding codebook remains a good idea for a quick learning the neural network in the case of music.
References 1. Mesaros, A., Virtanen, T., Klapuri, A.: Singer identification in polyphonic music using vocal separation and pattern recognition methods. In: Proc. ISMIR, Vienna, Austria (2007) 2. Tzanetaki, G., Essl, G., Cook, P.: Automatic musical genre classification of audio signals. In: Proc. ISMIR, Bloomington, Indiana (2001) 3. Linde, Y., Buzo, A., Gray, R.M.: An algorithm for vector quantizer design. IEEE Trans. Comm. 28(1), 84–95 (1980)
136
H. Ezzaidi and M. Bahoura
4. Berenzweig, A., Ellis, D., Lawrence, S.: Using voice segments to improve artist classification of music. In: Proc. AES-22 Intl. Conf. on Virt., Synth., and Ent. Audio., Espoo, Finland (June 2002) 5. Kim, Y.E., Whitman, B.: Singer identification in popular music recordings using voice coding featuress. In: Proc. ISMIR, Paris, France (2002) 6. Goto, M., Hashiguchi, H., Nishimura, T., Oka, R.: Rwc music database: Music genre database and musical instrument sound database. In: Proc. ISMIR, pp. 229–230 (2003) 7. The auditory toolbox for matlab, http://cobweb.ecn.purdue.edu/~ malcolm/interval/1998010/
An Adaptive Method Using Genetic Fuzzy System to Evaluate Suspended Particulates Matters SPM from Landsat and Modis Data Bahia Lounis, Sofiane Rabia, Adlene Ramoul, and Aichouche Belhadj Aissa Laboratory of image processing and radiation- Faculty of Electronics and Computer ScienceUSTHB University - BP 32, El-Alia–Beb-Ezzouar 16111-Algiers-Algeria
[email protected],
[email protected] Abstract. In this paper, we propose an optimization of fuzzy model which exploits remotely sensed multispectral reflectances to estimate Suspended Particulates Matters SPM concentrations in coastal waters. The relation between the SPM concentrations and the subsurface reflectances is modeled by a set of fuzzy rules extracted automatically from the data through two steps procedure. First, fuzzy rules are generated by unsupervised fuzzy clustering of the input data. In the second step, a genetic algorithm is applied to optimize the rules. Our contribution has focused on global and partial optimization of rules and a proposed chromosome structure adapted to remote sensing data. Results of the application of each type of optimization to Landsat and Modis data are shown and discussed. Keywords: Remote sensing data, Coastal water quality, SPM concentrations, Fuzzy clustering, Genetic optimization, Chromosome codification.
1 Introduction In major coastal waters, Suspended Particulate Matters "SPM" is the biggest source of water pollution. The determination of their concentration is basic to the study of environmental phenomenology and manage the coastal environments. Furthermore, the SPM is, also, essential in the initialization and validation of numerical hydrosedimentary models [1], and is considered as an important indicator of erosion problems in the watershed. For these reasons, the concentration of this parameter must be continuously monitored in order to follow their seasonal movements, to quantify sedimentary fluxes and estimate the fluvial solid discharges to the ocean. To reach this goal, spatial and/or aerial remote sensing is the most efficient tool for mapping the concentration of SPM as it provides an instantaneous and synoptic view of sediments that would otherwise be unavailable by in situ measurements. SPM concentrations affect remotely sensed signals through the interactions of scattering and absorption by sediment particles and water. Summaries of these relationships are provided in [2]. In general, optics of water-sediment mixtures are highly nonlinear, while many factors such as suspended particle size, shape and color can have large influences on water–sediment optics. Due to these complexities it is A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 137–146, 2010. © Springer-Verlag Berlin Heidelberg 2010
138
B. Lounis et al.
well known that there are no universal algorithms to remotely estimate sediment concentrations and the most first developed algorithms were both empirical or semi analytical models [3]. With the increase of multi-spectral information and the complexity to solve the inverse model, these last years, new estimation algorithms inspired from natural phenomena have emerged. Neural networks were successfully used to implement the inverse model and to properly address the non-linearity problem [4]. Besides neural networks, fuzzy systems have proved to be particularly effective in identifying non-linear models too. The most popular approach to fuzzy modeling is based on the identification of fuzzy rules, which describe in linguistic terms the input/output relationship [5]. These rules are extracted from the available numerical data and fixe the actions that must be performed to obtain the output if some conditions on the input variables are satisfied. The crucial point of this method is the definition of membership functions which express the membership degrees of the input variables to their appropriate fuzzy sets. An optimization process is usually needed to tune membership functions so that the fuzzy method implements the desired inverse model. In this paper, we propose a methodological approach for SPM estimation based on the hybridization of fuzzy logic and genetic algorithms (GAs) using Landsat and Modis data. In this approach, we devote the first part of it to the fuzzy SPM estimation based on the fuzzy rules of Takagi Seguno kang TSK type[5]. We establish the membership functions of the fuzzy sets, which are the most important parameter of SPM estimation model. We then present, in the second part of our approach, the optimization process in which we involve a TSK population to generate an accurate membership functions and TSK rules. Our contribution consists in the search of the best optimization that gives the best estimation results. We then perform a global and partial optimization by adapting the chromosome structure codifying fuzzy sets parameters. Further details of these two parts are presented in section three and four. In the experimental section, we present the estimation results and we show the benefits of each optimization type.
2 Processed Data To carry out our experiments, we have used two images with different spatial resolution, 30m and 1km, corresponding respectively to ETM+ and Modis satellite and covering the Algiers bay and the Manche Sea. The ETM image (1342x489 pixels) has been, first, geometrically corrected, georeferenced and transformed into radiances. After, it was corrected from atmospheric contribution by the dark point technique [6] and masked from the land areas. Whereas the Modis image (1667x2401 pixels) has been provided with all necessary pretreatments. From these images and according to the available concentrations of SPM, we have built two data sets of subsurface reflectances R and SPM concentrations, 282 and 7093 pairs of ( R j , SPM ) For ETM and Modis respectively. For each set, we randomly split the data into two subsets, the former subset composed of the 2/3 of the initial data is used to define the fuzzy model whereas the latter to evaluate its performances.
An Adaptive Method Using Genetic Fuzzy System
139
3 Fuzzy SPM Estimation The main idea of fuzzy SPM estimation is based on unsupervised fuzzy clustering of the available data to clusters in order to select a blend water types. Each cluster i , which shares the same spectral characteristics, is associated to a water type. For each water type, the SPM concentration variation is assumed to be locally linear and expressed as following: SPM i = p0, i + pi ,1R1 + pi , 2 R2 + ... + pi , M RM , where RM are the M surface reflectances and
( pi , 0 , pi ,1 , pi , 2 ,..., pi , M ) are real numbers. The
estimation model is, then, expressed as a set of conditions, which assign each pixel to a water type under the form of Takagi-Sugeno-Kang (TSK) rules. The premises of these rules depend on the fuzzy sets Ai , j defined on the reflectance domain and the consequences present the local linear SPM model form, like shown by equation 1[5]. Rulei : If (R1 is Ai,1 ) & . . . & (RM is Ai,M ) Then SPMi = pi,0 + pi,1R1 + . . . + pi,MR M
(1)
Each local model is associated with a degree of activation β i expressed as β i = min(u A ( R1 ), u A ( R2 ),..., u A ( RM )) , where u A ( R j ) is the membership function i ,1
i,2
i ,M
i, j
associated to the fuzzy set Ai , j . The global SPM concentration is, then, computed by aggregating the conclusions inferred from r individual rules as followed: r
∑ β .SPM i
SPM =
i
i =1
(2)
r
∑β
i
i =1
Consequently, the estimation model is regarded as an identification problem of fuzzy sets Ai , j and their membership functions u ( Ai , j ) , which define the different water types, and parameters identification ( pi , 0 , pi ,1 , pi , 2 ,..., pi ,M ) of linear model associated to each cluster. All these parameters are extracted from an unsupervised partition of the available data using fuzzy clustering algorithm. Let Z = {z1 , z 2 ,..., z N } be a training data set, with z k = ( Rk , SPM k ) , where Rk is the
kth vector of subsurface spectral reflectances and SPM k is its corresponding suspended particulate matters concentration. Let C1 ,..., CC be a family of fuzzy clusters on Z , and U C , N is a real [CxN ] matrix. A fuzzy C-partitions of Z is the construction of the matrix U C , N as U = ⎧⎨U ui ,k ∈ [0,1] ∀i; ∑ ui ,k = 1 ∀k ;0 < ∑ ui ,k < N ∀i ⎫⎬ ⎩
C
N
i =1
k =1
⎭
where ui , k is the membership value of zk to a cluster Ci . The FCM algorithm determines the optimal fuzzy C-partitions minimizing the objective function
J m (U , V , W ) =
C
N
∑∑ u i =1
m i ,k
d i2 ( z k , vi ) where d i2 ( zk , vi ) = ( zk , vi )T Wi ( z k , vi ) and m is the
k =1
fuzzification coefficient. Each cluster Ci is represented by a prototype vi in ℜ M and
140
B. Lounis et al.
its matrix Wi which determines its shape. In the classical FCM algorithm, Wi is an identity matrix and the distance is Euclidian [7]. In Gustafson and Kessel algorithm [8], known as GK algorithm, Wi = [ ρ i det(Qi )]1 / C Qi−1 where Qi is the fuzzy covariance matrix of cluster i and the clusters are hyper-ellipsoidals. Whereas in [5], noted Weighted fuzzy C-means (WFCM) algorithm, Wi is forced to be diagonal whose diagonal elements are wi , f , f . This latest is equivalent to use a weighted distance, where each element w f , f on the diagonal of Wi , weights the importance of feature f in computing d i . Comparing to classical FCM and GK algorithms, the WFCM presents simplicity of implementation, rapid convergence and best performances [5]. That's why we have chosen it to partition the input space for our data. The WFCM principle is summarized as followed, fix the number of clusters C , the fuzzification coefficient m , the weight h and initialize U C( l,=N0 ) randomly. Then, at the step l = 0,1,2,.... do:
1.
Calculate the vector prototypes V (l ) using U C(l,)N ,
2.
Calculate Wi (l ) using V (l ) and U C(l,)N ,
3.
Update U C( l,+N1) using V (l ) and Wi (l )
4.
Compare
U C(l,)N and
U C( l,+N1) using
the
matrix
norm: U C( l,+N1) − U C(l,)N ,
if
U C( l,+N1) − U C( l,)N ≤ ε then stop otherwise go to 1, with l = l + 1 and ε is the admissible error generally ε = 10 −5 . Once the input space is partitioned, each cluster is associated to a fuzzy set Ai , j . The membership functions u A of Ai , j are obtained by projecting the rows of the partition i, j
matrix U C , N onto the input variables R j and approximating the projections by triangular membership functions as expressed in equation 3.
⎛ ⎛ Rj − a c − Rj u A ( R j ) = max⎜⎜ 0, min ⎜⎜ , ⎝ b−a c −b ⎝ i, j
⎞⎞ ⎟⎟ ⎟ ⎟ ⎠⎠
(3)
Where b corresponds to the abscissa of the vertex of the triangle and is computed as the weighted average of the components of the training patterns. Parameters a and c are deduced as the intersection of the abscissas axis with the lines obtained as linear regression of the membership values of the training patterns, respectively, on the left and the right sides of b . Given the membership functions, the parameter ( pi , 0 , pi ,1 , pi , 2 ,..., pi ,M ) of local linear model associated to each cluster are carried out by standard linear least-squares methods as described in [9].
P = (X T X) −1 ⋅ X T ⋅ Y
(4)
An Adaptive Method Using Genetic Fuzzy System
Where
X
⎡ α 1,1 ..α i,1 ..α C,1 ⎢ # ⎢ = ⎢ α 1, k ..α i,1 ..α C, k ⎢ # ⎢ ⎢ α 1, N ..α i,1 ..α C, N ⎣
α 1,1 R 1,1 ..α i,1 R 1,1 ..α C,1 R 1,1
....
α 1,1 R j,1 ..α i,1 R j,1 ..α C,1 R j,1
....
# α 1, k R 1,1 ..α i,1 R 1,1 .α C, k R 1,1 #
# .... #
# α 1, k R j, k ..α i,1 R j, k ..α C, k R j, k #
# .... #
α 1, N R 1,1 ..α i,1 R 1,1 ..α C, N R 1,1
....
α 1, N R j, N ..α i,1 R j, N ..α C, N R j, N
....
P = [p1, 0 ,......, p1, M ,..........., p C, 0 ,......, p C, M ] , Y = [SPM1 ,........., SPM k ]
and α i,k =
A i,1 (R 1, k ) ∧ .......... ∧ A i, M (R M, k )
∑
N
A i,1 (R i, k ) ∧ .......... ∧ A i, M (R M, k ) k =1
141
α 1,1 R M,1 ..α i,1 R M,1 ..α C,1 R M,1 ⎤ ⎥ # ⎥ α 1, k R M, k ..α i,1 R M, k ..α C, k R M, k ⎥ ⎥ # ⎥ α 1, N R M, N ..α i,1 R M, N ..α C, N R M, N ⎥⎦
T
, ∀ i ∈ [1 , C] .
After identifying all fuzzy rules, a genetic optimization is applied in order to tune their parameters and reach the desired estimation results.
4 Genetic Optimization of Fuzzy Model Genetic Algorithms (GAs) are adaptive methods suitable to solve optimization problems. By mimicking the principles of natural selection, GAs are able to evolve solutions for real world problems provided that they have been suitably encoded. Each potential solution to a given problem is coded as a binary or real string that is called chromosome. Generally, GAs start with a randomly generated initial population of chromosomes. At each step, a new population is generated from the current one using two basic operators: crossover and mutation. While crossover combines parent chromosomes to generate offspring chromosomes, mutation is a local modification of a chromosome. Chromosomes are selected for crossover and/or mutation based on their fitness function. In our model, the general objective of the genetic optimization is to find a fuzzy model which can approximate input/output relationship by optimizing the structure of the model and/or it's parameters. As cited above, the fuzzy SPM estimation model is based on a number of rules, which are defined by the fuzzy sets and output parameters extracted from the partition of the input space. Several techniques are available to optimize the structure of fuzzy models. According to the model complexity, these techniques proceed by global or partial optimization taking account all or one part of the fuzzy model. To adapt the GA to a specific optimization, the chromosome structure is adjusted to the parameters which are tuned. 4.1 Structure Codifying Chromosome
In a global optimization, each chromosome represents the entire fuzzy system, ruleby-rule, with the antecedent and consequent parts. As shown by figure 1.a, each rule antecedent consists of a sequence of M triplets (a, b, c) of real numbers representing triangular membership functions, whereas each rule consequent contains M + 1 real numbers corresponding to the consequent parameters Pi , j (see section 3). As the
identification of the antecedent and consequent parts of a rule are inseparable since they have a mutual relationship. Thus, the optimization of parameters needs much more processing time to reach desired estimation results. For some applications, the GA is adapted to a specific problem, each chromosome then codifies one part of the
142
B. Lounis et al.
fuzzy model, generally the membership functions. As illustrated by figure 1.b, the chromosome structure is composed of the antecedent part of the rule, which defined the membership functions (a, b, c) of the fuzzy sets.
(a)
(b)
Fig. 1. Chromosome representation. (a) : Global optimization, (b) : Partial optimization.
In this work, we propose to test both types of optimization in order to select the best one for our databases. In fact, the complexity of the inverse model of SPM concentration depends on several parameters such as the atmospheric and geographic conditions specific to the studied area, its suspended particle size, shape, and color…. 4.2 Genetic Evolution
For each optimization type (global or partial), the most important consideration in the implementation is to define a proper fitness function which aids to find a better solution for the given optimization and the AG evolution which ensure the algorithm convergence. As we are interested by the generation of the SPM concentrations with minimizing the error estimation, we defined the fitness value as the inverse of the Mean Square Error MSE. Also, to accelerate the convergence, fuzzy parameters resulting from the partition algorithm are considered as the initial population of the AG and we apply the arithmetic crossover and the uniform mutation operators to generate a new population [10]. Chromosomes to be mated are chosen by using the well-known roulette wheel selection method, which associates to each chromosome a probability proportional to its fitness value. We fixed the probability of crossover and mutation to 0.9 and 0.1, respectively. The offspring generated using crossover and mutation are checked against the aforementioned space coverage criterion. Only 80% of the new population is composed of offspring, whereas 20% consists of the best chromosomes of the previous population. When the average of the fitness values of all the individuals in the population is greater than 99% of the fitness value of the best individual or a prefixed number of iterations has been executed, the GA is considered to have converged.
5 Experimental Results To estimate the SPM concentrations for ETM and Modis data sets, we proceed, first, to the partition of both data by executing the clustering with the Weighted Fuzzy C-means (WFCM) algorithms proposed in [5] in order to deduce the fuzzy sets
An Adaptive Method Using Genetic Fuzzy System
(a)
(b)
143
(c)
Fig. 2. Fuzzy model performances for training base data of ETM. (a) : Fuzzy model, (b) : Fuzzy with global optimization (c) : Fuzzy with partial optimization.
(a)
(b)
(c)
Fig. 3. Fuzzy model performances for test base data of ETM. (a) : Fuzzy model, (b) : Fuzzy with global optimization (c) : Fuzzy with partial optimization.
(a)
(b)
(c)
Fig. 4. Fuzzy model performances for training base data of Modis. (a) : Fuzzy model, (b) : Fuzzy with global optimization (c) : Fuzzy with partial optimization.
(a)
(b)
(c)
Fig. 5. Fuzzy model performances for test base data of Modis. (a) : Fuzzy model, (b) : Fuzzy with global optimization (c) : Fuzzy with partial optimization.
144
B. Lounis et al.
Table 1. Statistic comparison of fuzzy model performances before and after optimization
Modis ETM
Fuzzy model Training data Test data Training data Test data
MSE
ρ
0,0081039 0,0084515 1,93138e-8 2,33829e-8
0,9142 0,9093 0,9952 0,9954
(a)
Global optimization
MSE
ρ
0,0065202 0,0074373 9,57956e-9 1,23177e-8
0,9313 0,9212 0,9976 0,9976
Partial optimization
MSE
ρ
0.0015308 0.0021842 4.68912e-9 6.70072e-9
0.9842 0.9762 0.9988 0.9987
(b)
Fig. 6. MSE variation versus iteration number. (a) ETM data (b). Modis data.
parameters ( Ai , j , u ( Ai , j ) ), which define the different water types, and the associated output parameters ( pi , 0 , pi ,1 , pi , 2 ,..., pi ,M ) . Then, we computed the SPM concentrations and evaluated the performances of the fuzzy model by calculating the mean square error MSE and the correlation coefficient ρ of each type of data. An important point in the implementation of fuzzy SPM model is to determine the optimum number of clusters C which determine the number of rules r . Like in [5] we have adopted a criterion Xie-Beni index and we have obtained 5 and 9 rules for Modis and ETM respectively. Figures 2.a, 3.a, 4.a and 5.a show the comparison between the fuzzy estimated and the measured values of SPM for ETM and Modis data. For these figures, we note that the values are correlated; they approach to the bisector despite those obtained for ETM data, which present some dispersion. The MSE and the correlation coefficients presented in table 1 confirm these remarks. Figures 2.b, 3.b, 4.b, 5.b and 2.c, 3.c, 4.c, 5.c illustrate the results of global and partial genetic optimization, respectively. Lets' note that to be able to compare these two types of optimization, we have taken the same hypothesis, to know the initial population, the numbers of iterations, the same crossover and mutation probability, fitness and convergence criteria. Figure 6 illustrates an example of MSE evolution during the global optimization. Figures 2.b, 3.b, 4.b and 5.b illustrate the results of global genetic optimization. Compared to figures 2.a, 3.a, 4.a and 5.a, these graphs show the contribution of this optimization to improve the estimation results. Indeed, for Modis data the dispersion of points around the bisector decreased especially for areas encircled in the figures 4.b, and 5.b. However, for ETM data the dispersion has slightly reduced. The statistical results in Table 1 confirm this observation. The correlation coefficient for the ETM data is increased from 91% to 93%. Thus, it appears that the exploration of
An Adaptive Method Using Genetic Fuzzy System
145
solution space by encoding both the input and output parameters in a fixed architecture are not adapted to these data. Figures 2.c and 3.c represent the partial optimization results. The graphs show a clear improvement for ETM data. The partial exploration of the solution space and a mutual relationship between the antecedents and the consequences parameters guarantee more improved results than the global optimization proposed in [6]. This observation is confirmed by the statistical evaluation presented in Table 1. Indeed, the correlation coefficient increased from 91% to 98% reducing considerably the MSE error. Whereas, the partial optimization has slightly improved the results of MODIS data as shown by the statistical results presented in table 1 and figures 4.c and 5.c.
6 Conclusion For each data types, we have proposed a genetic optimization of fuzzy model to estimation the SPM concentrations from ETM and Modis data. The fuzzy model is based on fuzzy rules automatically extracted from the data using an unsupervised clustering algorithm. The rules composing the fuzzy model allow easy interpretation of input/output relations between surface reflectances and SPM concentrations. Furthermore, the rules number of fuzzy model for ETM data is approximately the double than that obtained for Modis data. This means that the inverse model of SPM for ETM are more complex compared to Modis model. This gives an initial appreciation of the water quality type corresponding to each studied site. Finally, fuzzy rules parameters are, after, tuned by a genetic algorithm in order to minimize the error between desired and predicted outputs. To each data type, we have proposed two chromosome structures to codify the different rules parameters and to optimize the model in global or partial way. From the experimental results we conclude that the global optimization gives satisfactory results for fuzzy model of Modis data. Whereas, the partial optimization is more suitable for a complex inverse model which corresponds to ETM data. The generalization of the proposed approach to the entire images provides a map which represents the spatial variability of SPM in the coastal and marine areas. Using the multitemporal remote sensing images, the estimated maps give important information about the SPM spreading. This is allowed to determine seasonal and yearly changes caused for example by floods. These considerations are integrated into our future works.
References 1. Douillet, P., Ouillon, S., Cordier, E.: A numerical model for fine suspended transport in the southwest lagoon of new caledonia. Coral Reefs 20, 361–372 (2001) 2. Stumpf, R.P., Pennock, J.R.: Calibration of a general optical equation for remote sensing of suspended sediments in a moderatly turbid estuary. Journal of Geophysical Research 94(C10), 14363–14371 (1989) 3. Acker, J., Ouillon, S., Gould, R., Arnone, R.: Measuring Marine Suspended Sediment Concentrations from Space: History and Potential. In: 8th International Conference on Remote Sensing for Marine and Coastal Environments, Halifax, Canada, May 17-19 (2005)
146
B. Lounis et al.
4. Djellal, D., Lounis, B., Belhadj-Aissa, A.: Cartographie de la concentration des sédiments côtiers par imagerie satellitaire et étude des lidars bathymétriques. In: Journées d’Animation Scientifique (JAS’09) de l’AUF, Alger (Novembre 2009) 5. Cococcioni, M., Corsini, G., Lazzerini, B., Marcelloni, F.: Approaching the Ocean Color Problem Using Fuzzy Rules. IEEE Transactions On Systems, Man, And Cybernetics—Part B: Cybernetics 34(3) (June 2004) 6. Lounis, B., Belhadj aissa, A., et al.: Processus de correction relative PCR des images satellitaires Landsat multidates. In: SETIT 2005, 3rd International Conference: Sciences of Electronic, Technologies of Information and Telecommunications, Tunisia, March 27-31 (2005) 7. Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum, NY (1981) 8. Gustafson, D.E., Kessel, W.C.: Fuzzy clustering with fuzzy covariance matrix. In: Gupta, M.M., Ragade, R.K., Yager, R. (eds.) Advances in Fuzzy Set Theory and Applications, pp. 605–620. North-Holland, Amsterdam (1979) 9. Takagi, T., Sugeno, M.: Fuzzy identification of systems and its application to modeling and control. IEEE Trans. System., Man, Cybern. SMC-15, 116–132 (1985) 10. Wei, W.: Synthèse d’un contrôleur flou par algorithme génétique: application au réglage dynamique des paramètres d’un système. Thèse de doctorat de l’université de Lille 1 (1998)
An Efficient Combination of Texture and Color Information for Watershed Segmentation Cyril Meurie, Andrea Cohen, and Yassine Ruichek Systems and Transportation Laboratory, University of Technology of Belfort-Montbliard 13 rue Ernest Thierry-Mieg, 90010 Belfort, France {cyril.meurie,andrea.cohen,yassine.ruichek}@utbm.fr
Abstract. A new segmentation technique based on a color watershed using an adaptive combination of color and texture information is proposed on this paper. This information is represented by two morphological gradients, a classical color gradient and a texture gradient based on co-occurrence matrices texture features. The two morphological gradients are then mixed using a gradient component fusion strategy and an adaptive technique to choose the weighting coefficients. The segmentation process is finally performed by applying the watershed algorithm. The obtained results are then evaluated with the MSE for several sets of parameters and color spaces. Keywords: image segmentation, adaptive combination, color, texture, mathematical morphology, co-occurrence matrices.
1
Introduction
Many segmentation methods are proposed in the literature. The methods on which this article is based can be grouped into two categories: edge and/or region based segmentation. These methods are usually developed considering specific applications. Therefore, there is no method that can be successfully applied for all applications without parameter redefinition. Generally, images present two important informations which are color and texture. It is, hence, useful to use a segmentation method based on these informations. Angulo [1] proposes a segmentation method combining color and texture information. However, this method involves many parameters, which are difficult to adjust according to the considered application. Considering the popularity of the use of co-occurrence matrices [2] [3] [4] [5] to extract the texture features of an image, we propose a new texture gradient based on co-occurrence matrices. This new texture gradient is then used in a new segmentation technique based on an original and adaptive combination (considering local image content) to take into account both the color and texture gradients. This combination technique has already been tested on previous works ([6] [7])for Angulo’s texture gradient [1] for two specific applications. The paper is organized as follows: Section 2 presents the morphological texture and color gradients definition. Section 3 describes the structural gradient A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 147–156, 2010. c Springer-Verlag Berlin Heidelberg 2010
148
C. Meurie, A. Cohen, and Y. Ruichek
process and the proposed strategy for combining the texture and color gradients. Before concluding, experimental results for the Berkeley Segmentation Dataset and Benchmark are presented for different parameter settings and color spaces in section 4.
2 2.1
Morphological Texture and Color Gradients Definition Co-occurrence Matrices
A co-occurrence matrix is, essentially, a two-dimensional histogram of the number of times that pairs of intensity values occur in a given spatial relationship (or translation) [4]. Mathematically, a co-occurrence matrix C is defined over a gray-level image g, parameterized by a translation t, as: Ct (i, j) = card{(x, x + t) ∈ D : g(x) = i, g(x + t) = j}
(1)
where g(x) is the gray level for the pixel x; i and j are gray levels, and D is the domain of the image. A co-occurrence matrix describes an image by looking at the relation between neighbor pixels and not each pixel separately. Texture, on the other hand, is a phenomenon associated to a neighborhood of pixels and not to an individual pixel, hence, co-occurrence matrices can be used as a tool for texture description. Nevertheless, the success of this tool highly depends on the spatial relationship (translation vector) chosen. There are several characteristic features of co-occurrence matrices that can be used for textural description. These features summarize the content of the matrices. There is a total of 13 of these features that have been presented by Haralick et al. [3]. Only three of them are used for extracting feature images on [2]: the entropy, the angular second moment (ASM ) and the contrast (c). We will only present and use the ASM and the contrast since, as we proved on our experiences, the entropy does not provide interesting results. ASMt =
G−1 G−1 i=0 j=0
Ct2 (i, j) and ct =
G−1 G−1
(i − j)2 Ct (i, j) .
(2)
i=0 j=0
where G represents the number of gray levels taken into account, which is set to 16 for our experiences. The ASM is a measure of the homogeneity of the texture for the given spatial relationship. Its value is high when the same couple of pixels is found repeatedly throughout the image, resulting in a matrix that has few entries of large magnitude. The contrast feature is a measure of the amount of local variation of the texture according to the spatial relationship. It is high when the matrix presents large terms far away from the diagonal. This means that pixels that are neighbors in terms of vector t have very different intensity values.
An Efficient Method for Watershed Segmentation
2.2
149
Local Analysis
As presented on the preceding section, co-occurrence matrices can be used to obtain texture features. Nevertheless, they are computed globally for the entire image, while the expectation is to obtain more than one texture per image. Indeed, the analysis should be able to be applied for each pixel of a color image f . To solve this problem, a co-occurrence matrix is computed locally for each pixel. This can be done by calculating C on a window centered on each pixel x. In this way, both measures (ASM and contrast) can be computed for each pixel of the image. It is defined as the local ASM image associated to t the image that groups the ASM measures associated to the translation t of each pixel of the image. The same concept can be applied to the contrast, resulting in a local contrast image associated to t. In a t-oriented local ASM image (resp. contrast image), pixels associated to a texture that’s homogeneous in the direction and size of t (or that presents a lot of contrast for the contrast image) are affected with a high value. The images obtained by this procedure are then normalized. Since co-occurrence matrices can only be computed over a gray-level image, they are computed seperately for each color plane of the input image f in the RGB color space. This results in three t-oriented local ASM images (resp. contrast images) per color image f per translation vector t. By using a set of vectors with different sizes and orientations, one can obtain texture feature images that are responsive to a number of textures with different orientations. The chosen set of vectors must cover the whole spectre of textures present in the image. 2.3
Morphological Texture and Color Gradients Computation
A morphological gradient is defined, over a gray-level image g, as the residue of the dilatation and the erosion (usually computed with a structuring element of size 1): (3) Q(g(x)) = δβ (g(x)) − β (g(x)) . where β represents the structuring element (normally a disc of radius 1). If f is a color image, then δ(f (x)) and (f (x)) are color vectors, and the classical morphological color gradient f is given by the preceding formula using a conditional (also called lexicographic) vector ordering [8] [9] [10] [11]. The morphological texture gradient is defined using references to the concept of texture feature images. This concept may refere to the ASM local images for the computation of an ASM texture gradient, and to contrast local images for the computation of a contrast texture gradient previously presented. Each one of these texture feature images is refered to as tk . The texture gradient of the image is finally defined in reference to the morphological gradient of each tk : Qtex (f (x)) = [Q(tk (x))] (4) k∈K
where K is a set of vectors with different sizes and orientations (such as those presented on table 1). The process to compute a morphological texture gradient
150
C. Meurie, A. Cohen, and Y. Ruichek Table 1. Tested sets of vectors in co-occurrence matrices Set Nb of vectors Vectors’ sizes Vectors’ orientation 1 4 2 0, 45, 90, 135 2 6 2 0, 30, 60, 90, 120, 150 3 8 4 0, 30, 45, 60, 90, 120, 135, 150 4 10 4 0, 30, 37, 45, 60, 90, 120, 127, 135, 150 5 16 2-4 0, 30, 45, 60, 90, 120, 135, 150 6 8 8 0, 30, 45, 60, 90, 120, 135, 150
for both the contrast and the ASM measures is illustrated on figure 1. Other definitions may be applied for the computation of the texture gradient, such as an euclidian gradient. However, the supremum gradient is simpler to compute and has been proved to be as efficient in [1]. It should be pointed out that both resulting texture gradients (ASM and contrast) are gray level images.
3
Structural Gradient Definition
In order to achieve a robust and reliable segmentation, it is very useful to use both texture and color information. The main idea is to produce a structural gradient by combining the texture and color gradients. The problem is the fact that the color gradient is a color image while the texture gradient is a gray level image. To solve this problem, the proposed method starts by decomposing the color gradient image Qcol to obtain its three components which are QR col , B QG col and Qcol . In the next step, each component of the color gradient image is combined with the texture gradient image Qtex , where Qtex may refer to either a contrast-based gradient or a ASM-based gradient. This operation produces three gray levels images QR , QG and QB : ⎧ R ⎨ Q = QR col ⊗ Qtex QG = QG (5) col ⊗ Qtex ⎩ B Q = QB ⊗ Q . tex col QR , QG and QB can be interpreted as the color planes of a new color image, which is proposed to define the needed structural gradient. In other words, the structural gradient Q is defined as a color image with QR as the red component, QG as the green component, and QB as the blue component. This combination approach is suitable because, not only color information is preserved, but also the texture information is added to each color component. Indeed, if we asume that texture is not a color phenomenon, then it is supposed to affect all colors equally. Other strategies involving the fusion of the color gradient into a gray level image have been discarded since they give very bad results. The operator ⊗ represents the combination of two gray level images: a component of the color gradient image q and the texture gradient image r. Three techniques are used:
An Efficient Method for Watershed Segmentation
151
Fig. 1. Illustration of the structural gradient computation process (left: texture gradient computation; right: structural gradient computation)
a fixed combination, an adaptive combination and a supremum combination. An illustration of this process is given in figure 1. Let h be the output of the combination process which is applied for each pixel. 3.1
Fixed Combination
The fixed combination is defined as a barycentric sum of the color and the texture gradient images. It uses a global weighting coefficient referred to as α: h(p) = αq(p) + (1 − α) × r(p)
(6)
where α is a constant coefficient taking its value in [0; 1]. The combination technique is not generally suitable, due to the coefficient α, which is constant for the entire image. Indeed, one may need to give priority to color or texture according to their importance in the different zones of the image. This technique requires manual adjustment of the coefficient α according to the content of the image. 3.2
Adaptive Combination
The proposed adaptive combination strategy uses a modular combination of texture and color gradients according to the content of the image. This implies two advantages: first, it gives priority to the most important information (the color or the texture) for a given pixel. Second, it constitutes an automatic method, which can be performed for all types of images. This technique has been tested for a different texture gradient in [6], [7]. The adaptive combination is expressed as follows: h(p) = αp q(p) + (1 − αp ) × r(p) and αp =
q(p) . q(p) + r(p)
(7)
αp is a coefficient taking its value in [0; 1]. It is calculated for each pixel p in order to give a high weight to the image that provides the most important information for the pixel. In other words, αp is high if the information is more important for f than for g (and vice-versa).
152
3.3
C. Meurie, A. Cohen, and Y. Ruichek
Supremum Combination
Using the same principle as in the adaptive strategy, the supremum combination is sensibly different. There is actually no combination at pixel level. Indeed, for a given pixel, the modular gradient is either a copy of the color gradient or the texture gradient depending on which one of them provides the greatest amount of information (the supremum). This combination has the same advantages as the adaptive one. q(p), si q(p) ≥ r(p) h(p) = r(p), si q(p) > r(p)
4
Experimental Results
In this section, we evaluate the proposed approach for image segmentation by combination of color and texture information using eight images of the Berkeley Segmentation Dataset and Benchmark (BSDB). To analyse the different segmentation results, we use the Mean Square Error (M SE) which does not require a reference segmentation. The influence of the color space is also studied for six color spaces (RGB, HSL, I1 I2 I3 , Y Cb Cr , Y Ch1 Ch2 , L∗ a∗ b∗ ), each one of them belonging to one of the six main color space families described in the literature [12]. Segmentation is achieved by performing a watershed algorithm [13] [14]. This algorithm allows to segment an image into homogeneous regions from a seeds image (markers) and a potential image (gradient). The seeds image corresponds to 5% of local minimum values of the original image, while the gradient image corresponds to the structural gradient presented on the preceding section. For better vizualisation, we only present the results obtained with the best parameter settings (figure 2): on the top of the figure, average results for the best window sizes for RGB colors space are presented for each each collection of table 1. On the bottom, we illustrate the results corresponding to the best collection for each color space for the same window sizes. The different curves on the graphs correspond to several structural gradients computed for the contrast-based texture gradient. Results for the ASM-based texture gradient are not presented since they are not satisfactory enough. The worse segmentation results are obtained with a window of size 16 (except for the α = 0.7 fixed combination for which the results are almost unchanged regardless the windows size). This is why they are not presented in figure 2. The proposed adaptive combination technique always gets the second or the third place if not the first. In average, a fixed combination of α = 0.7 gives the best results but we should not forget that this choice of α was found after a grat number of experiments and it is not guaranteed to work for a different set of images. This is not the case for the proposed adaptive method. In RGB color space, the best results are obtained: 1/ for the proposed adaptive combination with a window size of 4 and the collection 1 (figure 2 top-left); 2/ for the combination α = 0.7 with a window of size 6 and the collection 5 (figure 2 top-right). To analyze the influence of the color space on the segmentation results, these two window sizes (4 and 6) are kept
An Efficient Method for Watershed Segmentation
153
Fig. 2. Evaluation of segmentation results in RGB color space for all collections (top) and in different color space for the best collection (bottom), with a window size of 4 (left) and a window size of 6 (right).
Fig. 3. Segmented images with the best gradient combinations in Berkeley Segmentation Database and Benchmark (top to bottom : initial image, segmented images with combination α = 0.7, segmented image with adaptive combination)
154
C. Meurie, A. Cohen, and Y. Ruichek
Fig. 4. Evaluation of segmentation results in different color spaces for the best collection and window size (left), evaluation of segmentation results for previous works in different color spaces and with the best parameters (right)
and tested for all collections in table 1. We can conclude that the best results are obtained with the proposed adaptive method (with a window of size 4) and for the α = 0.7 fixed combination (with a window of size 6) in Y Cb Cr color space. For the α = 0.7 fixed combination, the best results are obtained for collection 4. This emphasizes the importance of the choice of the vector set, which can vary according to the application. For the proposed adaptive combination, the best results are obtained for collection 2. Nevertheless, it is important to point out that the proposed adaptive method is also the best regardless of the collection chosen (and no parameter settings are required as opposed to the fixed method). In figure 3, we illustrate the segmentation results for Y Cb Cr for the two best combinations (ie. α = 0.7 and adaptive). Even if the results are similar for the fourth image, we can notice that the proposed method provides better results for the four other images. For example, image 1 shows a better segmentation of the bird’s parts, the seaweed and diver are better extracted on image 2, the man’s hair and the woman’s arm are well segmented on image 3 and the surfboard is better extracted on image 5. The method has been compared to previous works (which replace the texture gradient computation with a morphological texture gradient, [6] [7]). The evaluation results for both methods are presented on figure 4. As we can see, both methods present the best results in Y Cb Cr color space. However, the previous morphological gradient only achieves a M SE of 1900 with the adaptive combination while the present method achieves a M SE of 1800 for this combination. This beats the best results obtained by the older method, thus demonstrating the interest of developing this new approach.
5
Conclusion
A new segmentation technique based on the watershed algorithm by combination of both color and texture information is proposed. This information is represented by two morphological gradients, a classical color gradient and a texture gradient based on co-occurrence matrices texture features. These two morphological gradients are then mixed using a gradient component fusion strategy.
An Efficient Method for Watershed Segmentation
155
The originality of the method is that it takes into account local image content by automatically computing the weighting coefficients for color and texture. The classical technique uses a fixed combination and requires the choice of the collection and a manual adjustment of the weighting coefficients which are global for all the pixels of the image. On the contrary, the proposed method does not require any parameters settings. For the BSDB, the best results are obtained with the proposed adaptive method in the Y Cb Cr color space (with a window of size 4 which is a parameter for the calculation of the texture gradient). The segmentation results for the proposed method are very satisfactory considering those obtained by other methods which, in addition, present the disadvantage of requiring a parameter adjustment. Perspectives concern the reduction of the execution time in order to be able to use the proposed method on a real time application. Future works will also focus on the computation of the texture gradient in different color spaces as well as the definition of a texture gradient as a color image instead of a gray level image.
References 1. Angulo, J.: Morphological texture gradients. Application to colour+texture watershed segmentation. In: Proc. of the 8th International Symposium on Mathematical Morphology, October 2007, pp. 399–410 (2007) 2. Kruizinga, P., Petkov, N.: Nonlinear operator for oriented texture. IEEE Transactions on image processing 8(10), 1395–1407 (1999) 3. Haralick, R.M., Shanmugan, K., Dinstein, I.: Textural features for image classification. IEEE Trans. On Systems, Man, and Cybernetics 3(6), 610–621 (1973) 4. Zucker, S.W., Terzopoulos, D.: Finding structure in co-occurrence matrices for texture analysis. Computer Graphics and Image Processing 12, 286–308 (1980) 5. Baraldi, A., Parmiggiani, F.: An investigation of the textural characteristics associated with gray level cooccurrence matrix statistical parameters. IEEE Trans. On Geoscience and Remote Sensing 33(2) (March 1995) 6. Cohen, A., Meurie, C., Ruichek, Y., Marais, J., Flancquart, A.: Quantification of gnss signals accuracy: an image segmentation method for estimating the percentage of sky. In: IEEE International Conference on Vehicular Electronics and Safety (ICVES), November 2009, pp. 40–45 (2009) 7. Meurie, C., Ruichek, Y., Cohen, A., Marais, J.: An hybrid an adaptive segmentation method using color and textural information. In: IS&T/SPIE, Electronic Imaging 2010 - Image Processing: Machine Vision Applications III (January 2010) 8. Meurie, C.: Segmentation of color images by pixels classification and hierarchy of partitions. Ph.D. dissertation, University of Caen Basse-Normandie, Caen, France (October 2005) 9. Lezoray, O., Meurie, C., Elmoataz, A.: A graph approach to color mathematical morphology. In: IEEE Symposium on Signal Processing and Information Technology (ISSPIT), pp. 856–861 (2005) 10. Lezoray, O., Meurie, C., Elmoataz, A.: Graph-based ordering scheme for color image filtering. International Journal of Image and Graphics 8(3), 473–493 (2008)
156
C. Meurie, A. Cohen, and Y. Ruichek
11. Aptoula, E., Lefevre, S.: A comparative study on multivariate mathematical morphology. Pattern Recognition 40(11), 2914–2929 (2007) 12. Vandenbroucke, N.: Segmentation d’images par classification de pixels dans des espaces d’attributs colorim´etriques adapt´es. Ph.D. dissertation, University of Lille 1 (December 2000) 13. Vincent, L., Soille, P.: Watersheds in digital spaces: an efficient algorithm based on immersions simulations. IEEE Trans. On Pattern Analysis and Machine Intelligence (PAMI) 13(16), 583–598 (1991) 14. Beucher, S.: Watershed, hierarchical segmentation and waterfall algorithm. Mathematical morphology and its applications to image and signal processing, 69–76 (1994)
New Approach Based on Texture and Geometric Features for Text Detection Hinde Anoual1 , Sanaa El Fkihi1,2 , Abdelilah Jilbab1,3 , and Driss Aboutajdine1 1
LRIT, unit´e associ´ee au CNRST, FSR, Mohammed V University Agdal, Morocco
[email protected],
[email protected] 2 ENSIAS, Mohammed V University Soussi, BP 713, Rabat, Morocco
[email protected] 3 ENSET, Madinat AL Irfane, B.P 6207 Rabat-Instituts, Rabat, Morocco
[email protected] Abstract. Due to the huge amount of data carried by images, it is very important to detect and identify the text region as accurately as possible before performing any character recognition. In this paper we describe a text detection algorithm in complex background. It is based on texture and connected components analysis. First we abstract texture regions which usually contain text. Second, we segment the texture regions into suitable objects; the image is segmented into three classes. Finally, we analyze all connected components present in each binary image according to the three classes with the aim to remove non-text regions. Experiments on a benchmark database show the advantages of the new proposed method compared to another one. Especially, our method is insensitive to complex background, font size and color; and offers high precision (83%) and recall(73%) as well. Keywords: Text detection, text localization, feature extraction, texture analysis, geometric analysis.
1
Introduction
Text detection is defined as the task that localizes text in complex background without recognizing individual characters. It is still an interesting research topic in many fields. The rational behind this is the fact that, in a given image, the embedded texts are considered as reliable sources of descriptive information since they carry important information on the semantics of the image content. However, there are mainly three text detection challenges: complex background, size of characters, and multiple colors/font styles. To face these challenges, extensive efforts have been made to extract text from images. The existing approaches are based on the mainstream text characteristics that are: – Texture characteristics: Text region is considered as a texture region to isolate from the rest of the image. There are many kinds of text texture characteristics such as contrast and color homogeneity. Indeed, text must be readable, and that is why the contrast of text is important and highest compared to other objects. Also, characters tend to have the same or similar A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 157–164, 2010. c Springer-Verlag Berlin Heidelberg 2010
158
H. Anoual et al.
colors. We can found monochrome and polychrome characters in an image. But polychrome characters are related to artistic aspects more than to informative purposes, that is why most of authors tend to discard them. – Geometric aspects: Text can appear with different sizes. Its size can vary a lot, but this variability must be reasonable. Hence character has size restrictions. In fact, it is not as large as the whole screen, nor smaller than a certain number of pixels. The minimal height and width allowing viewers to read the text are approximately 15 and 7 pixels, respectively. In addition, there is a relationship between character neighbors. Characters in a text have a uniform distance between them. These two features depend on the type of text. Basically, there are two kinds of text: scene text and artificial text. Scene text appears as a part of and was recorded with the scene like texts in T-shirts and names of streets. Whereas artificial text was produced separately from the image and is laid over the image in a post-processing stage, like text in logos and text for describing images. Here, we focus on the two text kinds. In this paper we introduce a new efficient text detection technique. Our approach consists of three main steps. In the first stage we aim to abstract texture regions while in the second one, we segment these regions into suitable objects; especially the image is segmented into three classes. In the third step, we analyze all connected components present in each binary image according to the defined classes of the segmented zones. The remain of this paper is organized as follows: In sections 2 we present a related work to text detection. Section 3 details our new proposed approach. Then, section 4 is devoted to experimental results obtained by comparing our approach to the Y. LIU [1] one of, on the ICDAR 2003 Robust Reading dataset [2]. Finally, conclusions are drawn in section 5.
2
Related Works
Numerous text detection algorithms have been proposed in the past years. They are based on different text properties. Moreover, the most feature used in the literature is edge (gradient). For example, Wu et al [3] propose an algorithm based on the image gradient produced by nine second order gaussian derivatives. The pixels that have large gradient are considered as strokes. Then, these strokes are grouped into text blocks based on several empirical rules. Chen et al. [4] and Ye et al. [5] use the canny edge feature and morphological ’close’ operation to detect candidate text blocks. In [6], Smith et al. detect text by finding text box ’horizontal rectangle structure of clustered sharp edges’, but only text with certain font-size can be detected. Zhong et al [7] proposed a classical text detection algorithm based on connected component analysis. They extract text as those connected components that follow certain size and horizontal alignments constraints. Jain and Yu [8] use connected component analysis as well. In their algorithm, they decompose the image into a multi-value map, then color connected components are selected. The lasts are determined as text regions if they
New Approach Based on Texture and Geometric Features for Text Detection
159
are accepted by any of the following strategies: (1) inter-component features (geometry and connection property), (2) projection profile features in both horizontal and vertical orientations. The probability of missing text is minimized at the cost of increasing false alarms. Texture analysis is as well a way used to discriminate between text areas and non-text areas. Indeed, techniques based on Gabor filters, Wavelet, Fast Fourier Transform (FFT), spatial variance, etc. can be used to detect the textural properties of a text region in an image. For example, in [9], Li try to capture the texture by mean, second and third order central moments in Haar-wavelet domain. Zhong et al. [10] detect text in JPEG/MPEG compressed domain using texture features from DCT coefficients. They first detect blocks of high horizontal spatial intensity variation as text candidates, and then refine these candidates into regions by spatial constraints. The potential caption text regions are verified by the vertical spectrum energy. But its robustness in complex background may not be satisfying for the limitation of spatial domain features. Kim et al. [11] and Li et al. [12] propose a texture based method using support vector machine (SVM). Another method based on color, texture and OCR statistic features to discriminate texts from non-texts patterns is described in [13]. We the aim to deal with the text detection problem, we propose to capitalize on the texture and geometric aspects of text regions. In the next we will describe our proposed approach.
3
Proposed Text Detection Approach
In this section, we present new approach to deal with the text detection problem. The proposed method is mainly based on texture and geometric features, and it consists of three steps that are: Texture Analysis Step, Segmentation Step, and Geometry Analysis one. In the next we will details each step. 3.1
Texture Analysis Step
Texture analysis step aims to extract texture regions and filter out smooth ones. Indeed, in this step we divide a grayscale image Im into two kinds of regions: smooth and texture. Then, we decompose an image into N Macro Blocks (MB); the size of the MB was sets to 64x64 (the size is chosen experimentally). For each MB, the average edge intensity is considered. We adopt the gradient magnitude to obtain the edge image. Let M B(X, Y ) denotes the MB at location (X, Y ) in an image. The edge intensity of the M B(X, Y ) is calculated by: M BI(X, Y ) = Σ(i,j)∈MB(X,Y ) V (i, j)
(1)
Where V (i, j) is the magnitude of the intensity gradient (I(i, j)) at location (i,j), using finite extent convolution masks to estimate the partial derivative. With: dI(i, j) 2 dI(i, j) 2 ) +( ) V (i, j) = ( (2) dx dy
160
H. Anoual et al.
Fig. 1. Color image and its texture analysis result
A MB must have larger edge intensity than the threshold Tv to be classified as texture MB. Otherwise, it belongs to smooth regions. Tv is determined by: Tv = αΣMB(X,Y )∈Im M BI(X, Y )/N
(3)
where α is the weighting factor. Its value has been experimentally sets to 0.75. Fig.1 illustrates an input and its outputs of our texture analysis process. The output image is gray level one such that the black regions represent smooth regions and the rest represent texture regions. 3.2
Segmentation Step
The segmentation results of texture regions have potential possibilities to improve the text detection results. Thus, applying an enhancement technique on texture regions can benefit the regions segmentation to improve the binarization before employing connected components analysis. Fig.2 describes a multi-segment decomposed image approach that we propose for segmenting the texture regions. In this approach, we apply a segmentation algorithm which classifies the pixels into K classes. Then, for each class, we generate the binary image by assuming that this class corresponds to text one. Finally, we analyze the connected components present in each binary image, separately. Hence, for K classes, we get K segmented images. The choice of K is an important and difficult issue. By running several experimentations on our database, we noted that if K ≤ 2, a character may be merged with other objects around it and if K ≥ 4 it may be divided to many components represented in different binary images. So the optimal value of K is 3. The Expectation and Maximization method is the segmentation method used here. As shown in Fig.2, each coherent region is labelled by a color to discriminate regions (yellow, blue and white). The black region of each image represents the smooth one that we are not interested by. After that we represent each coherent region in a single binary image (mask) which will be analyzed in the next step. 3.3
Geometry Analysis Step
Text can appear in a single image with different texture. Hence text components can be defined by different coherent regions. Consequently, a totality of text may not be found in a single mask obtained in the previous step. For this reason, we propose to analyze the different components of each mask from the three
New Approach Based on Texture and Geometric Features for Text Detection
161
Fig. 2. Segmentation results of the multi-segment decomposed images
obtained ones. Thus, we capitalize on the fact that a text area is a connected component. Nevertheless, the connected components present in an image may include non-text object that are false positives alarms in our case. To deal with this, we propose to use some rules to discriminate text and non-text regions. The used rules are those used in many other text detection published methods such as [5] and [14], but adapted to our case. These rules are related to the geometric and the spatial properties of a text area in an image. On the first time, we eliminate false positives alarms by using the geometrical properties of a given connected component i. Let Wi and Hi be the width and height of the possible text block labelled i. Then, the aspect ratio (Ri ) and the area (Ai ) of possible text block labelled i, are given by: Ri =
Wi Hi
;
Ai = Wi ∗ Hi
(4)
The candidate text block is considered as a text block if we have: 0.1 ≤ Ri ≤ 2
;
0.1 ≤ Ai
(5)
On the second time, we can consider the geometric and spatial relationship of components. Thus, given two neighboring blocks (xi ; yi ; Wi ; Hi ) and (xj ; yj ; Wj ; Hj ), the vicinity based rules can be defined by Eq.6, Eq.7, and Eq.8; where (xi ; yi ) (resp. (xj ; yj )) are the coordinates of the object gravity center of the region labelled i (resp. j). Hi 0.5 ≤ ≤2 (6) Hj dist(xi , xj ) ≤ 0.2max(Hi , Hj )
(7)
dist(yi , yj ) ≤ 2max(Wi , Wj )
(8)
where dist(x, y) is the distance between x and y.
162
H. Anoual et al.
The first rule given by Eq.6 ensures that connected components have similar height. The second Eq.7 affirms that the connected components are neighbors in horizontal direction while Eq.8 ensures that the considered components belong to same text line [15]. Finally, when the different rules above are verified, we obtain an image in which characters are segmented.
4
Experimental Results
We used the dataset made available with the occasion of the ICDAR 2003 Robust Reading Competition [2], to evaluate the performance of the new approach compared to the Y. LIU one [1]. The later is based on the gradient magnitude, the connected components and wavelets. The used benchmark database contains 529 images. The existing images depict one or multiple text areas which have different characteristics. Fig.3 illustrates some text detection results. In fact, this figure presents the outputs of the two compared text detection system. We notice that the detection of the text (red block) is accurate and effective, using the new approach.
Fig. 3. Results of the compared text detection system. Top: Some inputs. Middle: outputs of our approach. Bottom: outputs of the other method.
For quantitative comparison, two metrics recall and precision are adopted to evaluate the performance of the two compared methods. The recall (R) is the number of correct estimates (C) divided by ground-truth text regions (T) whereas the precision (P) is the ratio of the number of correct estimates (C) to the number of regions claimed (E). These parameters are defined by: R=
C T
;
P =
C E
(9)
The highest value of the recall (R) denotes the superior ability in detecting the relevant text region, while the highest value of precision (P) indicates the highest detection rate with correct text regions.
New Approach Based on Texture and Geometric Features for Text Detection
163
Fig. 4. The Recall/Precision curves of our method and the Y. LIU one
Figure 4 gives the Recall/Precision curves of our method compared to the Y. LIU one [1]. This figure highlights that the new approach offers more precision for all recall values, compared to the other approach. The precision rate is 83% (resp. 59%) and the recall rate is 73% (resp. 73%) for the new approach (resp. Y. LIU’s method). In addition, precision and recall can be combined into a single standard measure of quality (f) given by Eq. 10. The relative weights of recall and precision are controlled by α which we set to 0.5 in order to give equal weight to the two considered metrics. 1 (10) f= α/P + (1 − α)/R For our proposed approach, the value of f is equal to 0.78 while it is equal to 0.65 in case of the compared method. In terms of qualitative and quantitative evaluations, we conclude that the new proposed approach is more efficient and robust compared to the other one.
5
Conclusion
This paper describes an automatic text detection method in images with complex background. The proposed approach is basically based on the textured areas contained in an image. Indeed, Texts are considered as textured zones that we have to distinct from a smooth ones. Then, a segmentation method is applied on texture regions with the aim to decompose the texture regions into different coherent regions. The lasts are separately analyzed in order to select the candidate text regions. Finally, The results are combined to define text regions. Simulation results on the public benchmark ICDAR 2003 Robust Reading Dataset show that our method detects efficiently characters in spite of the complexity of image background. The precision and recall rates are 83% and 73% respectively. In addition, the standard measure of quality f is equal to 0.78. Consequently, the proposed method is accurate and effective. In future works, we propose to improve the proposed approach by minimizing the number of holes in case of false detection text alarms.
164
H. Anoual et al.
References 1. Liu, Y., Goto, S., Ikenaga, T.: A Contour-Based Robust Algorithm for Text Detection in Color Images. IEICE Transactions 89-D(3), 1221–1230 (2006) 2. http://algoval.essex.ac.uk/icdar/RobustReading.html 3. Wu, V., Manmatha, R., Riseman, E.M.: Finding text in images. In: DL 1997: Proceedings of the second ACM international conference on Digital libraries, New York, NY, USA, pp. 3–12. ACM, New York (1997) 4. Chen, D.T., Bourlard, H., Thiran, J.-P.: Text identification in complex background using SVM. In: International Conference on Computer Vision and Pattern Recognition 2001, pp. 621–626 (2001) 5. Ye, Q., Gao, W., Wang, W., Zeng, W.: A robust text detection algorithm in images and video frames. In: Joint Conference of Fourth International Conference on Information Communications and Signal Processing and Pacific-Rim Conference on Multimedia, Singapore (2003) 6. Smith, M.A., Kanade, T.: Video skimming for quick browsing based on audio and image characterization, Carnegie Mellon University, Pittsburgh, PA, Technical Report CMU-CS-95-186 (July 1995) 7. Zhong, Y., Karu, K., Jain, A.K.: Locating text in complex color images. In: ICDAR ’95: Proceedings of the Third International Conference on Document Analysis and Recognition, Washington, DC, USA, vol. 1, p. 146. IEEE Computer Society, Los Alamitos (1995) 8. Jain, A.K., Yu, B.: Automatic text location in images and video frames. Pattern Recognition 31, 2055–2076 (1998) 9. David, H.L., Doermann, D., Kia, O.: Automatic text detection and tracking in digital video. IEEE Transactions on Image Processing 9(1) (2000) 10. Zhong, Y., Zhang, H., Jain, A.K.: Automatic caption localization in compressed video. IEEE Trans. Pattern Anal. Mach. Intell. 22(4), 385–392 (2000) 11. Kim, K.I., Jung, K., Kim, J.H.: Texture-based approach for text detection in image using support vector machine and continuously adaptive mean shift algorithm. IEEE Transaction on PAMI 25, 1631–1639 (2003) 12. Li, X., Wang, W., Jiang, S., Huang, Q., Gao, W.: Fast and effective text detection. In: ICIP, pp. 969–972. IEEE, Los Alamitos (2008) 13. Ye, Q., Jiao, J., Huang, J., Yu, H.: Text detection and restoration in natural scene images. J. Vis. Comun. Image Represent. 18(6), 504–513 (2007) 14. Ezaki, N., Bulacu, M., Schomaker, L.: Text detection from natural scene images: Towards a system for visually impaired persons. In: ICPR (2), pp. 683–686 (2004) 15. Hanif, S., Prevost, L.: Text detection and localization in complex scene images using constrained adaboost algorithm, pp. 1–5 (2009)
Segmentation of Images of Lead Free Solder Matthias Scheller Lichtenauer, Silvania Avelar, and Grzegorz Toporek Swiss Federal Laboratories for Materials Testing and Research (EMPA) Laboratory of Media Technology Ueberlandstrasse 129, 8600 Duebendorf, Switzerland {matthias.scheller,silvania.avelar}@empa.ch
Abstract. We present two approaches to segment metallic phases in images of lead free solder joints. We compare the results of a method without user interaction with another one extrapolating information from a relatively small set of user labeled pixels. The segmented images provide statistical data of spatial characteristics of phases to serve as input in numerical models of solder joints.
1
Introduction
Lead has been replaced in the last decade as a joining material in soldering due to its toxic properties. The relatively new lead free soldering can still be improved. Some properties sought for are low melting point and mechanical properties like resistance to tear stress and tensile strength. Data about the size, form and spatial distribution of regions of homogeneous materials (phases) is needed for numerical models that predict mechanical properties at the scale of solder balls (100 − 1000m) in contemporary electronic devices [1]. Methods to segment images of solder joints in regions containing specific materials are required in order to create statistical data of spatial distribution of phases. In materials science, related work has recently appeared. Sidhu and Chawla [2] described the use of numerical models to calculate solder deformation behavior. Erinc et al. [3] analyzed fatigue fracture of solder joints using images. Kang et al. [4] investigated soldering as a process in which size and form of phases can vary widely with cooling rates. Suitable segmentation methods for images of lead free solder have to be adaptive to cope with the variety of size and form of phases. There are three major challenges in detecting phases in images of solder. First, the appearance of a phase in the probe depends on probe preparation. Second, the imaging system may map the probe unequally to an image, due to depth of focus or to planarity of the probe. Third, methods to determine the chemical composition at a given spot have limitations in space, be it resolution or field of view, or limitations in the time needed to sample the probe. There is the risk to be inaccurate focussing on regions that are not representative enough or because two phases have the same look in an image, although they are chemically different. We aim to segment images of lead free solder joints such that segmented regions correspond to phases as human experts identify them. We combine light microscope images with different illumination and imaging methods. A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 165–172, 2010. c Springer-Verlag Berlin Heidelberg 2010
166
M. Scheller Lichtenauer, S. Avelar, and G. Toporek
We apply two segmentation approaches and compare their results, namely hierarchical segmentation, as Ibrahim et al. [5] use, and the learning methodology of support vector machines (SVM) [6,7]. Hierarchical or decision tree methods partition the set of pixels to be segmented with each step, starting either with the pixels that can be reliably segmented or reducing the input size of computationally expensive steps to regions of interest. The SVM method maps the data to a higher dimensional feature space and constructs an optimal separation hyperplane between two classes in this space [6]. Both approaches are used for classifying the image data into categories, which lead to the segmentation of phase regions in the image [8]. We present the image capturing process of solder probes in the next section. Section 3 discusses the segmentation approaches. A comparison of the segmentation results is shown in section 4. We give our conclusions in section 5.
2
Image Capturing
We have used Leica DC500 camera on Leica DMRX light microscope to take images of solder probes. The parameters of this system are listed in Table 1. We corrected the evenness of illumination with the built-in function of Leica Firecam Version 1.7.1 on Mac, which was also used for image capturing. The different resolution modes of the camera and magnification factors of the light microscope objectives allowed us to choose between alternative configurations with approximately the same pixel sampling interval. Lateral resolution of the optical system (λ/2NA) is calculated at λ = 380nm. The respective depth of focus δz is estimated at the same wavelength [9]. The structures we visualize are in the scale of 1m, requiring at least objective 10× to achieve the Nyquist sampling frequency limit. We took microscopic images from metallic probes constituted of Cu joined with Sn solder that was reinforced with 4w.t.% Ag and 0.5w.t.% Cu. We will refer to this composition as SAC405. The knowledge on how phases look like is taken by scanning electron microscopy (SEM) and energy-dispersive spectroscopy (EDS/EDX). In most of the light microscope images human experts were able to visually distinguish five categories, namely copper, eutectic, βSn, void, and intermetallics (Cu6 Sn5 , Cu3 Sn and Ag3 Sn). Table 1. Areas which are visible in optical microscope and pixel sampling interval on the probe for the resolution modes of the camera. NA stands for numerical aperture. Objective 10× 20× 40× 50× 100×
NA 0.20 0.45 0.60 0.86 0.90
min δz 4.75m 0.88m 0.47m 0.19m 0.16m
λ/2NA 1.80m 0.81m 0.61m 0.44m 0.42m
Field of View 1.360 × 1.057mm 0.680 × 0.529mm 0.340 × 0.270mm 0.272 × 0.215mm 0.136 × 0.108mm
3900 px 352nm/px 176nm/px 89nm/px 70nm/px 35nm/px
2600 px 528nm/px 264nm/px 133nm/px 106nm/px 52nm/px
1300 px 1.06m/px 528nm/px 266nm/px 212nm/px 104nm/px
Segmentation of Images of Lead Free Solder
A raw probe, illumination not corrected
167
A polished probe, corrected illumination
Fig. 1. Visible differences due to polishing, etching and illumination correction 2 1 0 −1 −2 Height Profile (micrometer) Depth of field objective 10x (NA=0.20) Depth of field objective 20x (NA=0.45)
−3 −4 −5
0
250
500
750
1000
Fig. 2. Surface height profile compared to depth of focus of the light microscope. Both dimensions are in m. With objective 20x or higher, some parts of the image will be out of focus even if the sample is optimally aligned.
To improve visibility of phases, samples were polished and then etched for a few seconds. Etching creates a relief. The edges are visible in dark field illumination, but height differences increase depth of focus needed. There are known tradeoffs between magnification and depth of focus with light microscopy. Figure 1 shows on the right side an example of an image in bright field illumination taken with objective 50×, covering a visible field of 272 × 215 m. The image is focussed on the interface between the soldered copper and the solder, while the rest of the solder seems blurred. Surface line profile measurements of two probes using Ambios XP-1 profilometer at different locations revealed that the step height between the copper and the solder is about 1 to 2m after polishing and etching, while the solder surface height varies in the range of ±1m, which is more than the light microscope can keep in focus with objectives magnifying larger than 10× (Figure 2). We spectrally measured reflectance of all different phases in a probe with Leica MPV-SP microspectrometer mounted on Leica DMRX optical microscope. Except for copper, the phases do almost not show differences in hue or saturation, some not even in brightness. The reflectance of an X-Rite white standard is
168
M. Scheller Lichtenauer, S. Avelar, and G. Toporek
White Sn Eutectic Cu Sn
10000
6
Reflection spectra of phases
5
Cu Ag Sn 3
5000
0
400
450
500
550 600 Wavelength
650
700
Fig. 3. Reflection measured with Leica MPV-SP microspectrometer. Intensity values are in relative units of this microspectrometer, as reflectance of the phases exceeds reflectance of the X-Rite white standard.
way below that of all phases, supporting our assumption that the reflection of the phases was mostly specular and surfaces were mostly oriented parallel to the image plane (Figure 3). In optical microscopy using dark field illumination, smooth surfaces parallel to the image plane appear dark, whereas rough surfaces or slopes appear bright. We compared bright field and dark field images from the same field of view with regard to surface relief. Although the reflectance spectra of eutectic and Ag3 Sn intermetallic phase did not much differ, they can be visually distinguished in images by their different surface relief in dark field illumination and their brightness variation in bright field illumination. Hence, we provided paired bright field and dark field images (Figure 4).
3
Image Segmentation
In order to generate good statistical data, we had to segment enough unconnected components of each phase. For users, interactively labeling each of the components is tedious and error prone. So we looked for segmentation approaches able to identify phases with minimal user interaction, and able to extrapolate information from a small samples set, even to several different images of a probe.
Bright field illumination
Dark field illumination
Fig. 4. Paired light microscope images of a SAC405 probe
Segmentation of Images of Lead Free Solder
169
As input for our segmentation methods, we provided two paired images in bright and dark field illumination showing exactly the same field of view sampled with 2600×2060 pixels. For comparison of results, we also provided the positions of 250 pixels to be labeled by the two methods. We also gathered the labels of the selected pixels for evaluation of results. To gather those 250 ground truth points, we used an image of a probe taken with bright field illumination which was displayed on a computer screen. Human experts labeled pixels by choosing one of five phases and clicking on its pixels. Some phases were quite rare, hence we discarded to randomly select pixels to be labeled and took care that there was an equal number of sample points for all phases (Figure 5). The decision tree approach calculates by thresholding methods a set of masks for flatness (gray level of dark field image), redness (difference of red and blue channels), and for brightness (gray level of bright field image). The combination of these masks is used to segment the phases. For the support vector machine, local features of the image are extracted from individual image channels by means of a sliding window. The brightness histogram [10] and the output of a bank of Gabor filters [11] are calculated at each window position to extract texture information. Features are gained by computing the mean of the output in the frequency domain and filtering the image texture with 4 orientations and 4 scales. We used the SVM with a radial basis function (RBF) kernel in our experiment. For training the SVM, we have provided other 250 labeled pixels of the five phases. The image in bright field illumination was tested to look for patterns.
4
Results
The outputs of the two segmentation approaches on the images were compared to the ground truth. The result of the hierarchical approach based on pixelwise thresholding technique has hit rates between 72 and 96%. The recognition of the copper substrate is reliable enough to estimate the form and thickness of the interfacial intermetallic layer with the help of a distance transform. Using SVM, the prediction of phases for our ground truth samples has 83.2% accuracy (208/250 points). Misclassified phases were mostly near the boundary of different phases and in some areas of similar texture from copper and βSn. SVM performed better in distinguishing the eutectic and intermetallic phase, but worse in recognition of copper (Figure 6). We make here some runtime considerations. We assume a pixel resolution of 1m2 . The decision tree classifier is able to segment more than 2600 × 2060 pixels a minute on a 2.8GHz Intel Core2 Duo. The extraction of image features to the SVM classification is more computationally intensive than the method itself. It takes about five minutes to take the two pictures in dark field and bright field once the probe is etched and polished. Using SEM, the same polishing and etching technique could be used, so time for probe preparation would not differ. But it would take at least about 100 hours to sample the same surface
170
M. Scheller Lichtenauer, S. Avelar, and G. Toporek
+ • ∗ copper
βSn
eutectic
intermetallic
Legend copper eutectic intermetallics βSn void void
Fig. 5. Ground truth. Zoomed examples are adjusted to increase visibility in print. Decision tree Ground truth Result Cu Eu Im Vo copper Cu 45 2 3 0 eutectic Eu 0 36 0 1 intermetallic Im 0 3 38 0 void Vo 0 0 2 48 βSn Sn 0 0 3 0 Total 45 41 46 49
SVM Sn 0 13 9 0 47 69
Hit (%) 90 72 76 96 94
Ground truth copper eutectic intermetallic void βSn Total
Cu Eu Im Vo Sn
Result Cu Eu Im Vo Sn 32 1 5 5 7 1 45 0 3 1 1 2 43 2 2 0 2 0 48 0 6 0 3 1 40 40 50 51 59 50
Hit (%) 64 90 86 96 80
Fig. 6. Phase-confusion resulting from the two segmentation approaches
at one sample per m2 with SEM using energy dispersive spectroscopy (EDS), although EDS technique would yield more reliable results. Using EDS to check phase attribution is feasible in acceptable time, since the areas to check are smaller.
Segmentation of Images of Lead Free Solder
5
171
Conclusions and Future Research
We presented two approaches to segment phases in solder joints images taken with light microscopy. These segmentation methods can be used to generate statistical data as input for numerical methods limited by the accuracy and resolution needed. The results can complement information of scanning electron microscopy using energy dispersive spectroscopy. Scanning electron and atomic force microscopy (AFM) could also be combined as imaging methods. The SVM method was mostly based on texture features. If texture can not be captured by the imaging methods due to focus problems or probe preparation, the prediction accuracy decreases. Ground truth points have to be chosen in regions with precise phases and near to the boundary of different phases. The use of new combinations of features, as well as increasing training data size, needs to be further investigated, in order to improve the SVM prediction accuracy for different images of a probe. The decision tree method used the color information contained in bright field as well as the flatness information available in dark field, but no texture information as SVM. A combination of both feature sets could lead to better results. Since SEM images are grayscale and the feature set extracted for use in SVM relies on grayscale information of the light microscope images only, a combination of SEM with light microscope images could also profit from the additional information of color and flatness. In order to get spatial characteristics of phases, we aim also to construct a three-dimensional model of phases. It is possible to gain a 3D model by a sequence of 2D cuts or by estimating it from one single 2D cut. Regardless which method we choose, the quality of the results depends on the quality of the segmentation. Future work includes the combination of image data with spatially correlated topographical measurements of a probe as well as the construction of threedimensional structures from two-dimensional cuts.
References 1. Sivasubramaniam, V., Galli, M., Cugnoni, J., Janczak-Rusch, J., Botsis, J.: A study of the shear response of a lead-free composite solder by experimental and homogenization techniques. Journal of Electronic Materials 38(10), 2122–2131 (2009) 2. Sidhu, R.S., Chawla, N.: Three-dimensional (3D) visualization and microstructurebased modeling of deformation in a Sn-rich solder. Scripta Materialia 54(9), 1627–1631 (2006) 3. Erinc, M., Assman, T., Schreurs, P., Geers, M.: Fatigue fracture of SnAgCu solder joints by microstructural modeling. International Journal of Fracture 152(1), 37–49 (2008) 4. Kang, S.K., Choi, W.K., Shih, D., Henderson, D.W., Gosselin, T., Sarkhel, A., Goldsmith, C., Puttlitz, K.J.: Ag3Sn plate formation in the solidification of nearternary eutectic Sn-Ag-Cu. JOM Journal of the Minerals, Metals and Materials Society 55(6), 61–65 (2003)
172
M. Scheller Lichtenauer, S. Avelar, and G. Toporek
5. Ibrahim, A., Tominaga, S., Horiuchi, T.: Material classification for printed circuit boards by spectral imaging system. In: Tr´emeau, A., Schettini, R., Tominaga, S. (eds.) CCIW’09. LNCS, vol. 5646, pp. 216–225. Springer, Heidelberg (2009) 6. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (1995) 7. Bruzzone, L., Carlin, L., Melgani, F.: A Multilevel Hierarchical Approach to Classification of High Spatial Resolution Images with Support Vector Machines. In: Proceedings of IGARSS, IEEE International Geoscience and Remote Sensing Symposium, pp. 540–543 (2004) 8. Jain, A.K.: Fundamentals of digital image processing. Prentice-Hall, Englewood Cliffs (1989) 9. Young, I.T., Zagers, R., van Vliet, L.J., Mullikin, J., Boddeke, F., Netten, H.: Depth-of-Focus in Microscopy. In: Proc. 8th Scandinavian Conference on Image Analysis, SCIA, Tromso, Norway, pp. 493–498 (1993) 10. Chapelle, O., Haffner, P., Vapnik, V.N.: Support Vector Machines for HistogramBased Image Classfication. IEEE Transactions on Neural Networks 10(5) (1999) 11. Jian, M., Guo, H., Liu, L.: Texture image classification using visual perception texture features and Gabor wavelet features. Journal of Computers 4(8), 763–770 (2009)
Grassland Species Characterization for Plant Family Discrimination by Image Processing Mohamed Abadi1 , Anne-Sophie Capelle-Laiz´e1, Majdi Khoudeir1 , Didier Combes2 , and Serge Carr´e2. 1
Universit´e de Poitiers, XLIM-SIC, BP 30179, 86962 Futuroscope-Chasseneuil {abadi,khoudeir,capelle}@sic.sp2mi.univ-poitiers.fr 2 INRA Poitou-Charentes, UR4 P3F, BP 6, 86600 Lusignan, France {dcombes,scarre}@lusignan.inra.fr
Abstract. Pasture species belonging to poaceae and fabaceae families constitute of essential elements to maintain natural and cultivated regions. Their balance and productivity are key factors for good functioning of the grassland ecosystems. The study is based on a process of image processing. First of all an individual signature is defined while considering geometric characteristics of each family. Then, this signature is used to discriminate between these families. Our approach focuses on the use of shape features in different situations. Specifically, the approach is based on cutting the representative leaves of each plant family. After cutting, we obtain leaves sections of different sizes and random geometry. Then, the shape features are calculated. Principal component analysis is used to select the most discriminatory features. The results will be used to optimize the acquisition conditions. We have a discrimination rate of more than 90% for the experiments carried out in a controlled environment. Experiments are being carried out to extend this study in natural environments. Keywords: shape features, plant classification, leaf recognition, pasture, poaceae and fabaceae family, image processing.
1
Introduction
The areas occupied by temporary or permanent grasslands are now declining due to the increasing woodland and their transformation into arable land [1]. The balance and productivity of these pasture species and their productivity are key outputs for functioning of our grassland ecosystems. They provide a specific and significant contribution to biodiversity and play a major environmental role. Identification of each family allows to measure the proportion of each one in order to understand the spatial evolution of these families. This allows a certain balance between families which is essential to sustainability and agricultural value of temporary grassland. Moreover, studying the effects of light competition of different components of these vegetation cover has been the subject of many studies [2–5]. A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 173–181, 2010. c Springer-Verlag Berlin Heidelberg 2010
174
M. Abadi et al.
In a general case of characterization and identification of grassland species, the main objective of this study is specifically to discriminate between grassland families (poaceae and fabaceae). Thus to understand the evolution of these species within populations and assist enduring environment management, National Institute for Agricultural Research (INRA) of Lusignan (France) has established an experimental plots combination representative species of these two families. The idea is to have indicators to quantify the rate of different families within mixed grassland. Image analysis is effective and less costly in terms of the manpower than classical procedures. Species were cultivated separately. This allows to achieve acquisitions in a controlled environment, thus facilitating the processing and analysis phase. Samples of five species are photographed (white clover, alfalfa, orchardgrass, tall fescue and ryegrass). This work allows individual characterization of each species based on the study of their geometric shape. It depends on acquisition quality and performed preprocessing [6]. We explain our choice by the fact that the humans easily identify objects by their shape [7] and the features that describe shape are widely used to classify plants by characterizing their leaves [8]. For example, Im et al. [9] used polygon approximation method to describe the geometric shape of leaves which helps to recognize the Acer family. Wang et al. [10] have developed a method that combines different shape features computed on leaves to identify plants in an image database. The rest of the paper is organized in the following manner. Section 2 describes our acquisition process and pre-processing developed to extract the region of interest. The adopted procedure processing to estimate the limits of shape features and their robustness. The techniques to characterize the shape features and classification algorithms are briefly mentioned in Section 3. In Section 4, we describe the results obtained for poaceae and fabaceae plant families. Section 5 presents the conclusions and perspectives of this study.
2
Image Characteristics and Pre-processing
In this section, we present the leaves to describe poaceae and fabaceae families on which we apply our approach. 2.1
Acquisition Process
For the construction of image database of these two families, a number of sets composing of different cultures were cultivated at INRA of Lusignan under controlled conditions. These sets are composed of several samples. One sample represents each isolation plant (Figure 1 left). The leaf images were acquired for each species in XLIM-SIC laboratory. These acquisitions are effectuated in visible field and under controlled lighting. The acquisitions capture the detailed leaves of these species with NIKON D100 camera (Figure 1 center). This camera allows obtaining images in raw format (i.e. NEF: format without pre-treatment). This format allows exploring all color components (R, g1, g2 and B) unlike other formats (i.e. tiff, jpeg, . . . ).
Grassland Species Characterization for Plant Family Discrimination
175
Our database includes 5 species (white clover, alfalfa, orchardgrass, tall fescue and ryegrass) belonging to poaceae and fabaceae families (Figure 2). For each species, we took 20 samples. So in total we have 180 images. 60 leaves represent the poaceae family (orchardgrass, tall fescue and rye-grass) and 120 leaves represent the fabaceae family (white clover and alfalfa). Each leaf belonging to fabaceae family is composed of 3 leaflets. We applied specific treatments based on morphological operators (erosion followed by dilation) to separate them. This allowed an increase in the number of processed leaves and consequently a better robustness of the searched signature. 2.2
Images Pre-processing
Through these acquisitions, we obtain images at 4 color components whose resolution is 1012 × 1518. Many authors have shown that effective processing applied on color images, specifically segmentation is determined by using an adequate color space [11–14]. The goal of our pre-processing is to extract the area of interest. For this, we propose to determine, for a given image and for each class, the optimal color component combination. We used a criterion which, for every iteration, allows maximizing the separation between pixels for every iteration belonging to the two classes. The general scheme of our algorithm is presented in Figure 1 (right). The optimal combination is based on an iterative procedure. At each iteration the procedure selects a new color combination that is associated with two classes previously identified by a thresholding method. The goal is to define the most discriminating hybrid color combination. For this we use the higher discriminant power. Corresponding regions are then labelled. Then it is used in the next iteration to extract color component data for all other regions. Finally, the procedure is stopped after the defined iteration n by the expert. The advantage of this approach is that it provides the optimal color combination for each class identified in the image. It provides an effective solution to the problem of correlation between color components and ensures a better discriminant power compared to treatments used on the classical and/or hybrid color spaces. These pre-processing were used to get an intermediate version of the images (binary images). They showed the leaves were well localized and now the results can be used for further processing (i.e. shape features extraction).
3
Discriminative Signatures for Plant Families
To characterize the two families, images with high resolution are now available. To treat them, we developed a procedure consisting of three steps: – Extract region of interest (leaves localization) (Section 2.2), – these regions are characterized by computing shape features allowing to describe the geometry, – Classification algorithms are applied on these features to automatically discriminate between families.
176
M. Abadi et al.
3.1
Features Used
The morphological characteristics of the leaves of these plants are a relevant recognition parameter which is used in plant taxonomy. Indeed, some recent work has focused on the characterization of the leaves for plant recognition. Here we are particularly interested on approaches that explore the geometric shape. We choose to compute the digital morphological features (DM F ) because they are most conventional, widely used and are considered most important to characterize the leaves of a plant. Moreover, the geometric shape of the leaf is sub-oval (fabaceae) or highly elongated (poaceae). Several features have been studied, for reasons of features robustness we have chosen to present only the results using the DM F . They are extracted from the contours of the leaf. These features generally include geometrical features (GF ) [15, 16] and invariable moment features (M F ) [7, 15, 17]. These are robust and invariant under translation, rotation and scaling. 3.2
Classification Algorithm
Classification is the final step of our processing. The selected discriminant features make the classification method choice relatively less important. In this study, we apply the LDC (Normal Densities based Linear (multi-class) Classifier), QDC (normal Densities Based Quadratic (multi-class) Classifier) and U DC (Uncorrelated normal Densities Based Quadratic Classifier) algorithms. These algorithms are defined in PRTools toolbox version 3.0 developed by Duin [18]. These methods were chosen for their fast convergence, their supervised mode and their adaptation to features having a Gaussian probability density. In this paper the samples are partitioned into training and test sets. On the training set a pattern recognition based on these 3 classifiers is defined. These will serve to estimate the error over the test set. 3.3
Treatment Approach
In order to study the limits and robustness of the techniques to characterize the geometry of the leaf of the plant, we proceed by cutting the leaf images in regions of decreasing size. This choice is justified by the fact that this study fits into the recognition of species within a family’s problem. Indeed, in a real environment, the plants are mixed and access to whole leaf is not the most frequent case. Therefore, leaf cutting allows better evaluated quality and capacity of discrimination of features selected and simulating real cases. The size and geometry of these regions change with the used cutting. For each leaf (Figure 3), we applied 3 cuttings (Figures 4, 5 and 6) composed 3 sets of data. – Set 1: the leaves are cut into 4 regions (Figure 4), – Set 2: the leaves are cut into 6 regions (Figure 5), – Set 3: the leaves are cut into 8 regions (Figure 6).
Grassland Species Characterization for Plant Family Discrimination
177
Fig. 1. Left acquisition system and example of plants used. Right general scheme for proposed leaves extraction. )
Fig. 2. Examples of plant leaf representative of 5 species studied in this paper. Left to right poaceae family (orchardgrass, tall fescue and rye-grass) and fabaceae family (white clover and alfalfa).
Fig. 3. Leaves after pre-treatment. Left to right: white clover (437 × 398) and orchardgrass (1024 × 617)). s1
s2
s3
s4
Fig. 4. Leaves cutting in 4 sections. Top to bottom Left to right: white clover (437×199) and orchardgrass (1024 × 308)).
Fig. 5. Leaves cutting in 6 sections. Top to bottom Left to right: white clover (437×132) and orchardgrass (1024 × 205)).
Fig. 6. Leaves cutting in 8 sections. Top to bottom Left to right: white clover (437×99) and orchardgrass (1024 × 154)).
178
M. Abadi et al.
Table 1. Hu invariant moments values competed on samples of Figure 3 White clover orchardgrass φ1 φ2 φ3 φ4 φ5 φ6 φ7
0.2190 0.1024 0.0353 0.0220 0.0247 0.0307 -0.0178
1.3724 0.2276 0.1586 0.1840 -0.1745 0.1940 -0.1530
Table 2. Hu invariant moments values competed on samples of Figure 4 White clover φ1 φ2 φ3 φ4 φ5 φ6 φ7
orchardgrass
s1
s2
s3
s4
s1
s2
s3
s4
1.5786 1.4275 0.7799 0.7209 0.7312 0.8308 -0.5849
1.5673 1.4054 0.7896 0.6971 0.7160 0.8085 -0.5620
1.3937 1.2271 0.6520 0.5733 0.5867 0.6631 0.4899
1.2854 1.0876 0.6889 0.6679 0.6704 0.7336 0.5210
0.2714 0.1752 0.0467 0.0226 0.0224 0.0314 0.0269
0.2778 0.1490 0.0889 0.0681 0.0727 0.0737 -0.0479
0.2441 0.0824 0.0191 0.0145 0.0151 0.0223 0.0140
0.2311 0.0977 0.0544 0.0376 0.0413 0.0458 -0.0240
Fig. 7. Classification error for three examples of cutting and classifiers
Grassland Species Characterization for Plant Family Discrimination
179
In each area, geometrical features (14 features) and those based on Hu invariant moments (7 features) are computed. The principal component analysis is then used to select only the most discriminative features. These will constitute the data processed by classification algorithms previously defined.
4
Results
Here we present the results describing the three cuttings that were used. Figure 7 presents classification results on data sets constituting the test set for three classifiers. Note that the LDC classifier returns a low error rate classification (0.04) for set 3 unlike to other classifiers. This rate remained stable despite the change in the size of the training set (samples ≥100). Similarly, the result obtained by the same classifier is very good for the other sets despite small sizes of training set (samples ≥30). They showed a lot of potential to discriminate leaves. In general, the developments effectuated show that the shape features allowed to discriminate between families where discrimination rate remains higher 90% for cutting ≤ 3. Without the cutting of the leaves, the rate varies between 85% and 100% depending on features selected, classification algorithms and size of training set. In the case where the cutting is > 3, these features face difficulties. This is explained by the fact that the selection of the obtained leaves become smaller. So the geometry is reduced to rectangles with a few variations on the edges. Precisely the rate of discrimination is 73% where leaves are cut into 10 regions. Tables 1 and 2 give an idea about the Hu invariant moments values variation before normalization. These values are computed on representative plant leaf samples showed in Figure 3 and 4. So in light objectives, we believe that shape features based on digital morphological features give a best separation between the two families (fabaceae and poaceae). Thus the maximum cutting is 3. So the width of the section of the leaf must be greater than the pixels for the two families.
5
Conclusions
In this paper, we have presented an approach for measuring limits and robustness of techniques to characterize the shape of objects. This approach has been tried to separate fabaceae and poaceae families. This separation is based on plant leaf species recognition extracted from high definition images. Recognition is performed by a supervised classification process based on shape leaf description using digital morphological features. These leaves are representative of grassland species studied by INRA in Lusignan (white clover, alfalfa, orchardgrass, tall fescue and rye-grass). The DM F computed after applied a segmentation process to locate the leaves and separate them from the ground. The results are satisfactory given the divisions employed.
180
M. Abadi et al.
This study shows that the shape features can separate the two families. The difficulty is that using only this feature can not identify the species composing these families. This problem is more complex to address because the strong resemblance of geometrical shapes of leaves within each family. Further studies are exploring texture and color features. Acknowledgements. The authors would like to thank Poitou-Charentes region for financing this study.
References 1. Huygue, C., Bournoville, R., Couteaudier, Y., Duru, M.: Prairies et cultures fourragres en France : entre logiques de production et enjeux territoriaux. INRA (2005) 2. Leconte, D., Simon, J.: Diversit floristique des prairies Normandes Un patrimoine, un rle fonctionnel. Prairiales Normandie, Colloque du Robillard (2002) 3. Zaller, J.: Ecology and non-chemical-control of rumex crispus and r. obtusifolius (polygonaceae): a review. Weed research 44(6), 414–432 (2004) 4. Gebhardt, S., Schellberg, J., Lock, R., Khbauch, W.: Identification of broad-leaved dock (rumex obtusifolius l.) on grassland by means of digital image processing. Precision Agriculture 7(3), 165–178 (2006) 5. Himstedt, M., Fricke, T., Wachendorf, M.: Determining the contribution of legumes in legume-grass mixtures using digital image analysis. CROP Science 49, 1910–1916 (2009) 6. Neto, N., Meyer, G., Jones, D., Samal, A.K.: Plant species identification using elliptic fourier leaf shape analysis. Computers and electronics in agriculture 50(2), 121–134 (2006) 7. Wang, X., Huang, D., Du, J., Huan, X., Heutte, L.: Classification of plant leaf images with complicated background. Applied mathematics and computation 205(2), 916–926 (2008) 8. Du, J.-X., Wang, X.-F., Gu, X.: Shape matching and recognition base on genetic algorithm and application to plant species identification. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 282–290. Springer, Heidelberg (2005) 9. Im, C., Nishida, H., Kunii, T.: Recognizing plant species by leaf shapes-a case study of the acer family. In: ICPR ’98: Proceedings of the 14th International Conference on Pattern Recognition, Washington, DC, USA, vol. 2, pp. 1171–1173. IEEE Computer Society Press, Los Alamitos (1998) 10. Wang, Z., Chi, Z., Feng, D.: Fuzzy integral for leaf image retrieval. In: FUZZ-IEEE’02. Proceedings of the IEEE International Conference on Fuzzy Systems, Honolulu, HI, USA, vol. 1, pp. 372–377. IEEE Computer Society Press, Los Alamitos (2002) 11. Plataniotis, K., Venetsanopoulos, A.: Color Image Processing and Applications. Springer, Heidelberg (2000) 12. Cheng, H., Jiang, X., Sun, Y.: Color image segmentation: Advances and prospects. Pattern Recognition 34(12), 2259–2281 (2001) 13. Vandenbroucke, N., Macaire, L., Postaire, J.: Color image segmentation by pixel classification in an adapted hybrid color space. Application to soccer image analysis. Computer Vision and Image Understanding 90(2), 190–216 (2003)
Grassland Species Characterization for Plant Family Discrimination
181
14. Yang, L., Yang, X.: Multi-object segmentation based on curve evolving and region division. Chin. J. Comput. 27(3), 420–425 (2004) 15. Du, J., Wang, X., Zhang, G.: Leaf shape based plant species recognition. Applied mathematics and computation 185(2), 883–893 (2007) 16. Wu, S., Bao, F., You Xu, E., Wang, Y., Chang, Y., Xiang, Q.: A leaf recognition algorithm for plant classification using probabilistic neural network. Computer Science Artificial Intelligence (2007) 17. Wang, X., Du, J., Zhang, G.: Recognition of leaf images based on shape features using a hypersphere classifier. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 87–96. Springer, Heidelberg (2005) 18. Duin, R.: A Matlab toolbox for pattern recognition Prtools version 3.0 a edn. (January 2000), http://www.ph.tn.tudelft.nl/prtools
Efficient Person Identification by Fusion of Multiple Palmprint Representations Abdallah Meraoumia1, Salim Chitroub1 , and Ahmed Bouridane 2 1
Signal and Image Processing Laboratory Electronics and Computer Faculty, U.S.T.H.B. P.O. Box 32, El-Alia, Bab-Ezzouar, 16111, Algiers, Algeria 2 School of Computing, Engineering and Information Sciences Northumbria University, Pandon Building Newcastle upon Tyne, UK
[email protected], S
[email protected],
[email protected] Abstract. The automatic person identification is a significant component in any security biometric system because of the challenges and the significant number of the applications that require a high safety. A biometric system based solely on one template (representation) is often not able to meet such desired performance requirements. Identification based on multiple representations represents a promising tendency. In this context, we propose here a multi-representation biometric system for person recognition using palm images and by integrating two different representations of the palmprint. Two ensembles of matchers that use two different feature representation schemes of the images are considered. The two different feature extraction methods are the block based 2D Discrete Cosine Transform (2D-DCT) and the phase information in 2D Discrete Fourier Transform (2D-DFT) that are complementing each other in terms of identification accuracy. Finally the two ensembles are combined and the fusion is applied at the matching-score level. Using the PolyU palmprint database, The results showed the effectiveness of the proposed multi-representation biometric system in terms of the recognition rate. Keywords: Biometric, Palmprint, DCT, DFT, PCF, Data Fusion.
1
Introduction
The pronounced need for reliably determining or verifying the identity of person is becoming critical in our vastly interconnected information society [1]. Traditional methods of establishing a person’s identity, such as knowledge-based (passwords) and token-based (ID cards) mechanisms, are time-consuming, inefficient and expensive. In fact, a reliable identity management system is a critical component in several protected applications that render services to only legitimately enrolled users. However, theses surrogate representations of the identity can easily be lost, shared, manipulated or stolen thereby undermining the intended security. Biometrics offer a natural and reliable solution to the problem A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 182–191, 2010. c Springer-Verlag Berlin Heidelberg 2010
Efficient Person Identification by Palmprint
183
of identity determination. Biometric authentication deals with recognition the identity of individuals by using their unique physical or behavioral characteristics that are inherent to the person. However, every biometric technology has its merits and its limitations and no technology is the best for every application domain [2], [3], [4], [5]. Palmprint as a new biometric feature, compared to the hand-based biometric technologies such as fingerprint and hand geometry, has several advantages [6], [7], [8]. A simple biometric system has a sensor module, for acquiring the palmprint, a feature extraction module, for palmprint representation, and a matching module for decision making. Most of palmprint recognition systems are based on a single palmprint representation, and this fact can be considered as a bottleneck for performances. In fact, the performances of any biometric identification system are largely affected by the reliability of the method used for the feature extraction. Further, if the method used for the feature extraction is not efficient the resultant matching score computed by the matching module may not be reliable. An ideal palmprint recognition system should be based on the fusion of several palmprint representations. Such systems, known as multi-representations biometric systems, are expected to be more reliable due to the presence of multiple representations of evidence [6], [7], [8]. In this paper, we address the problem of information fusion by firstly building a multi-representations biometric system and then devising a scheme to integrate these representations. For that, we propose to use two different feature extraction methods. The extracted features are used as inputs of the matcher modules. The outputs of the matcher modules are combined using the concept of data fusion at score level. We propose to extract the palmprint feature using the 2D-DFT and the 2D-DCT. Thus, for each palm image, the two feature vectors are extracted and used for training two different matchers. The integration scheme is required to fuse the information presented by the individual representations. The remainder of the paper is organized as follows. The proposed palmprint identification scheme is presented in section 2. The method used for extracting the Region Of Interest (ROI) is presented in section 3. The two feature extraction methods, 2D-DFT method and block based 2D-DCT method, are discussed in section 4. Sections 5 is devoted to describe the two matching modules, Sum of Absolute Differences (SAD) and Phase Correlation Function (PCF), and the score level fusion process for fusing the information presented by extracted features. The obtained results are evaluated and commented in section 6. We conclude the present work in section 7.
2
Proposed Palmprint Identification Scheme
Fig. 1 shows the block-diagram of the proposed palmprint identification system. The system is composed of three steps: image pre-processing, feature extraction and fusion. In the pre-processing module, the grey-scale image of the palm surface of the hand is segmented for extracting the Region Of Interest (ROI). In this system of multi-representations, the feature vectors are extracted
184
A. Meraoumia, S. Chitroub, and A. Bouridane
independently from the unique ROI using the block based 2D-DCT and the phase information in 2D-DFT. Such extracted features are then compared to the enrolment templates, which are stored separately for each biometric trait. Based on the proximity of feature vector and template, by using SAD and PCF, each subsystem now computes its own matching score. After the score normalization of the matcher’s outputs, these scores are finally fused into a total score, which is handed over to the decision module.
Fig. 1. Block-diagram of the multirepresentation palmprint identification system based on the fusion at the matching-score level
3
Region Of Interest (ROI) Extraction
After the image is captured, it is pre-processed to obtain only the area information of the hand. In the pre-processing phase, the tangent of the two holes, which are between the forefinger and the middle finger and between the ring finger and the little finger, are computed and used to arrange in a line the palmprint. With referencing to the locations of the gaps between fingers, the palm is duly rotated and the maximum palm area is then located. The central part of the image is then cropped to represent the whole palmprint. The ROI is then defined in a square shape and it is converted to a fixed size (128 × 128 pixels) so that all of the palmprints conform to a same size. The ROI extraction method applied in our system is based on the algorithm described in [5]. The basic steps to extract the ROI are summarized as follows: First, apply a low pass filter, such as Gaussian smoothing, to the original palmprint image. A threshold, Tp , is used to convert the filtered image to a binary image, then, the boundary tracking algorithm used to obtain the boundaries of the binary image. This boundary is processed to determine the points F1 and F2 for locating the ROI pattern and, based on the relevant points (F1 and F2 ), the ROI pattern is located on the original image. Finally, the ROI is extracted. The pre-processing steps are shown in Fig. 2. An example showing a sample of ROI patterns is given in Fig. 3.
Efficient Person Identification by Palmprint
185
Fig. 2. Various steps in a typical region of interest extraction algorithm. (a) The filtered image, (b) The binary image, (c) The boundaries of the binary image and the points for locating the ROI pattern, (d) The central portion localization, and (e) The preprocessed result (ROI).
Fig. 3. Sample ROIs extracted from a palmprint
4 4.1
Feature Extraction Discrete Fourier Transform (DFT)
The 2D-DFT of an input data with size H × H is defined as : H−1 H−1 1 xu + yv F (u, v) = 2 f (x, y)exp 2πj( ) H y=0 x=0 H
(1)
Note that f (x, y) is the image and is real, but F (u, v), u, v = 0, 1, 2, ..., H − 1, is the FT and is, in general, complex. Generally, F is represented by its magnitude (A) and phase (φ) rather that its real and imaginary parts, where: Im(F ) φ(F ) = arctan A(F ) = Re2 (F ) + Im2 (F ) (2) Re(F ) At the features-extraction stage the features are generated from the ROI subimage by φ(F ). This feature-extraction technique has been used for matching module. It represents an efficient algorithm for palmprint recognition using the phase components in 2D-DFT of given images. 4.2
Block Based Discrete Cosine Transform (DCT)
With block based 2D-DCT, the image is analyzed on the basis of block per block. Given an image block, where x; y = 0; 1; 2, ....., N − 1, we decompose it in terms
186
A. Meraoumia, S. Chitroub, and A. Bouridane
of orthogonal 2D-DCT basis functions. The result is an N × N matrix C(u; v) containing DCT coefficients: C(u, v) = α(u)α(v)
N −1 N −1 y=0 x=0
f (x, y) cos
(2y + 1)vπ (2x + 1)uπ cos 2N 2N
(3)
with u, v = 0, 1, 2, ..., N − 1, α(v) = 1/N for v = 0, and α(v) = 2/N for v = 1, ...., N − 1. The coefficients are ordered according to a zigzag pattern. For a block located at (b; a), the baseline 2D-DCT feature vector is composed of : T (b,a) (b,a) (b,a) (b,a) (b,a) (b,a) x1 = c0 c1 c2 c3 · · · · · ·cM−2 cM−1 (4) (b,a)
where cn , n ∈ [0, M ] and M = N × N ,denotes the n-th 2D-DCT coefficient and M is the number of coefficients. 2D-DCT-based features are also sensitive to changes in the illumination direction. Sanderson & Paliwal [9] used a modified form of DCT feature extraction, termed DCT-mod2, which has been shown to be robust against illumination direction changes. In this paper, we introduce the DCT-mod2 feature extraction technique. For the DCT-mod2, the first three 2D-DCT coefficients (vector x) are replaced by their respective horizontal and vertical deltas, in order to reduce the effects of illumination direction changes, and thus form a feature vector representing a given block as follows [10] : h T v h v h v x2 = Δ c0 Δ c0 Δ c1 Δ c1 Δ c2 Δ c2 c3 c4 · · · · · cM −2 cM −1
(5)
The n-th horizontal and vertical delta coefficient for block located at (b; a) are:
K
K (b,a+k) (b+k,a) k=−K khk cn k=−K khk cn h (b,a) v (b,a) Δ Cn = Δ C = (6)
K
K n 2 2 k=−K hk k k=−K hk k Where h is a 2K+1 symmetric window vector. In this paper, we use K=1 and a rectangular window such as: h = [1 1 1]T .
5
Matching Modules
In the subsequent two feature-extraction modules, the phase information in 2DDFT and the block based 2D-DCT transform, are used for the extraction of the ROI sub-image features, while in the two matching modules the matching between the input feature and the templates from a database is performed. After the score normalization of the matchers outputs, fusion is applied. Tow different similarity/dissimilarity measures are used : 5.1
Sum of Absolute Differences (SAD)
For the features matching, the most popular criterion consists of minimizing the sum of the errors between the values of every pair of corresponding points
Efficient Person Identification by Palmprint
187
between the input feature Fv and the stored feature (template) Fr . In this context therefore, the objective function takes the form: E=
N −1 N −1
|Fv (i, j) − Fr (i, j)|
(7)
i=0 j=0
5.2
Phase Correlation Function (PCF)
Frequency domain palmprint matching is based on the 2D-DFT property that a translational displacement in the spatial domain corresponds to a linear phase shift in the frequency domain [11]. Let two N1 × N2 images, the palmprint image to be verified (current) fv and the registered palmprint image (reference) fr . Thus, assuming that fv and fr differ over a moving area, A , only due to a translational displacement, (τ1 ; τ2 ). fv (n1 , n2 ) = fr (n1 + τ1 , n2 + τ2 ), (n1 , n2 ) ∈ A
(8)
By taking the 2D-DFT of both sides : Fv (k1 , k2 ) = Fr (k1 , k2 )ej(−k1 τ1 −k2 τ2 )
(9)
where Fv and Fr are the Fourier Transform (FT). If we define as the phase difference between the FT of Fv and that of the Fr , then we obtain: ejΔφ(k1 ,k2 ) = ej[φv (k1 ,k2 )−φr (k1 ,k2 )] =
F ∗ (k1 , k2 ) Fv (k1 , k2 ) · r∗ |Fv (k1 , k2 )| |Fr (k1 , k2 )|
(10)
where φv and φr are the phase components of Fv and Fr , respectively. If we define Cv,r (n1 , n2 ) as the inverse DFT of ejΔφ(k1 ,k2 ) , then we have: Cv,r (n1 , n2 ) = F −1 {ejΔφ((k1 ,k2 ) } = F −1 {ejφv (k1 ,k2 ) }⊗F −1 {e−jφr (k1 ,k2 ) } (11) where ⊗ is the 2D convolution operation. In other words, Cv,r (n1 , n2 ) is the crosscorrelation of the inverse 2D-DFT (F −1 = 2D IDFT) of the phase components of Fv and Fr . For this reason,Cv,r (n1 , n2 ) is known as the Phase Correlation Function (PCF) [12]. The importance of this function becomes apparent if it is rewritten in terms of the phase difference in equation (9): Cv,r (n1 , n2 ) = F −1 {ejΔφ(k1 ,k2 ) } = δ(n1 − τ1 , n2 − τ2 )
(12)
Thus, the phase correlation surface has distinctive impulses at (τ1 ; τ2 ). This observation is the basic idea behind the phase correlation matching. When two images are similar, their PCF gives a distinct sharp peak. When two images are not similar, the peak drops significantly. The elevation of the peak can be used as a good similarity measure for image matching.
188
5.3
A. Meraoumia, S. Chitroub, and A. Bouridane
Normalization and Fusion Modules
Score level fusion is achieved by combining the matching scores calculated from different representations. Each representation has its matching module. Matching score is calculated with PCF-based matching and DCT-based matching. The two matching scores for the two feature vectors are normalized; i.e. it takes the values [0,1], and combined using the strategy of the sum rule (based on their experimental results, the authors make the observation that the sum rule achieves the best performance [13]). In the decision stage, the best fusion score compared with the threshold of the decision, noted by TE . When the best fusion score ≤ TE , the claimed identity is accepted; otherwise it is rejected.
6
Experimental Results : Evaluation and Comments
We have tested and evaluated our approach on (PolyU) palmprint database [14]. This database contains 100 classes. The size of the original Palmprint images is 384 × 384 pixels that have been captured with the resolution of 75 dpi. Twenty samples from each person were collected in two sessions. The palmprint images are aligned and the central part of the image, whose size is 128 × 128 pixels, is automatically segmented to represent the ROI of the palmprint. Three palmprint images of each palm were randomly selected for enrollment, and the remainder of 17 palmprint images is used as test images for recognition. In the first experiment, the impostor distribution and the genuine distribution are generated by 1700 and 42075 comparisons for 100 persons, respectively. The Receiver Operating Curves (ROC) are plotted in Fig. 4, in which the Genuine Acceptance Rate (GAR) is plotted against False Acceptance Rate (FAR). For example, if only the 2D-DCT feature is used, we have GAR = 95.75 % (FAR = FRR = 4.25 %). In the case of using the 2D-DFT, the GAR is improved to 96.90 % (FAR = FRR = 3.10 %). By applying the fusion rule, the results are improved and they reach the rate of 97.95 % (FAR = FRR = 2.05 %) for a database size equal to 100. Therefore; we can conclude from these results that an integration of 2D-DFT and 2D-DCT does result in an improvement of identification accuracy. Other series of experiments are carried out using another palmprint database. Two experiments are done in order to test the identification performance of the system. We setup two registration databases for N = 50 and N=100. No palmprint image of the testing database is contained in one of the registration databases. In the first experimental with the database of fifty persons, and in order to obtain the performance characteristics such as the genuine and impostor distributions, the biometric feature of unknown user is presented to the system and the system will try to match the user with all stored template in the database. A genuine matching is defined as the matching between the palmprints from the same palm and an impostor matching is the matching between the palmprints from different palms. Therefore, a total of 21675 comparisons are performed, in which 850 comparisons are genuine matching and 20825 comparisons are impostor matching. The genuine and impostor distributions are plotted in Fig. 5.a
Efficient Person Identification by Palmprint
189
Fig. 4. Receiver operating curves for individual representations
Fig. 5. Identification test results (50 Person’s). (a) The genuine and impostor distribution, and (b) The receiver operating curves.
Fig. 6. Identification test results (100 Person’s). (a) The genuine and impostor distribution, and (b) The receiver operating curves.
and the ROC curves, shown in Fig. 5.b, depicts the performance of the system. The described identification system can achieve the equality between FAR, False Rejection Rate (FRR) and Equal Error Rate (EER) that reaches the rate of 2.82 % at the decision threshold of TE = 0.654. A maximum of GAR is also reached
190
A. Meraoumia, S. Chitroub, and A. Bouridane
with the rate of 97.18 %. Thus, we can conclude that the fusion of the two features has improved the results of the identification compared with results that have been obtained with only one feature. In the second experimental with the database of one hundred persons, the distributions of the impostor and the genuine are generated by 1700 and 42075 comparisons, respectively. Fig. 6.a presents the results of the identification and Fig. 6.b shows the corresponding ROC. As in the first experimental, the identification system can achieve the equality between FAR, FRR and EER with the rate of 2.25 % for TE = 0.656 and a maximum of GAR is reached at the rate of 97.75 %. These experimental results show that the proposed methods reach a highest identification rate compared with other methods developed in the literature [15] and [16].
7
Conclusion
A biometric system that is based only on a single biometric characteristic may not always be able to achieve the desired performance. Biometric techniques based on the multi-representations, which combine multiple representations in identification process, can be used to overcome the limitations of the single biometric system. Integration of multi-representations of the biometrics for an identification system can improve the identification accuracy. We have developed a decision fusion scheme that integrates two different representations that complement each other, which are the block based 2D-DCT and the phase information in 2D-DFT. A fusion scheme exploits all the information of the two representations via its decision rule. In order to demonstrate the efficiency of our proposed fusion scheme, the (PolyU) palmprint database for testing and evaluating the method is used. The obtained results show that the proposed fusion scheme overcomes many of the limitations of both individual systems, DCT based identification system and DFT based identification system. They demonstrate that it performs very well and it meets the accuracy requirements.
References 1. Jain, A.K., Bolle, R., Pankanti, S.: Biometrics Personal Identification in Networked Society. The Kluwer International Series in Engineering and Computer Science, USA (2002) 2. Ross, A.A., Nandakumar, K., Jain, A.K.: Handbook of Multibiometrics. Springer Science+Business Media, LLC (2006) 3. Ross, A., Jain, A.K.: Information Fusion in Biometrics. Pattern Recognition Letters 24(13), 2115–2125 (2003) 4. Ma, Y., Cukic, B., Singh, H.: A Classification Approach to Multi-Biometric Score Fusion. In: Kanade, T., Jain, A., Ratha, N.K. (eds.) AVBPA 2005. LNCS, vol. 3546, pp. 484–493. Springer, Heidelberg (2005) 5. Zhang, D., Kong, W., You, J., Wong, M.: On-line Palmprint Identification. IEEE Trabsactions on Pattern Analysis and Machine Intelligence 25(9), 1041–1050 (2003)
Efficient Person Identification by Palmprint
191
6. Kong, W.K., Zhang, D.: Palmprint Texture Analysis Based on Low-Resolution Images for Personal Authentication. In: The 16th International Conference on Pattern Recognition, August 2002, vol. 3, pp. 807–810 (2002) 7. Nanni, L., Lumini, A.: Ensemble of Multiple Palmprint Representation. Expert Systems with Applications (2008) 8. Kumar, A., Wong, D.C.M., Shen, H.C., Jain, A.K.: Personal Verification Using Palmprint and Hand Geometry Biometrics. In: The Fourth International Conference on Audio and Video-Based Biometric Personal Authentication (June 2003) 9. Sanderson, C., Paliwal, K.K.: Polynomial Features for Robust Face Authentication. In: Proceedings of International Conference on Image Processing, Rochester, New York, vol. 3, pp. 997–1000 (2002) 10. Sanderson, C., Paliwal, K.K.: Fast feature extraction method for robust face verification. Electronics Letters 38(25) (2002) 11. Kumar, A., Mahalanobis, A., Juday, R.: Correlation Pattern Recognition. Cambridge University Press, New York (2005) 12. Ito, K., Nakajima, H., Kobayashi, K., Aoki, T., Higuchi, T.: A fingerprint matching algorithm using phase-only correlation. IEICE Transactions on Fundamentals E87A(3) (2004) 13. Bubeck, U.M.: Multibiometric Authentication-An Overview of Recent Developments-. Term Project CS574 Spring, San Diego State University (2003) 14. The Hong Kong Polytechnic University, PolyU Palmprint Database, http://www.Comp.polyu.edu.hk/biometrics 15. Pang, Y.H., Teoh Beng Jin, A., Ngo, D., Ling, C.: Palmprint Authentication System Using Wavelet based Pseudo Zernike Moments Features. Inter. Journal of The Computer, the Internet and Management 13(2), 13–26 (2005) 16. Hennings, P., Kumar, A.: Palmprint Recognition Using Correlation Filter Classifiers. IEEE Transactions on Circuits and Systems for Video Technology (2004)
A Cry-Based Babies Identification System Ali Messaoud and Chakib Tadj Laboratoire de Traitement de l’Information et des Signaux, École de Technologie Supérieure, 1100, Rue Notre-Dame Ouest, Montréal, Canada
[email protected],
[email protected] Abstract. Human biological signals convey precious information about the physiological and neurological state of the body. Crying is a vocal signal through which babies communicate their needs to their parents who should then satisfy them properly. Most of the researches dealing with infant’s cry intend mainly to establish a relationship between the acoustic properties of a cry and the state of the baby such as hunger, pain, illness and discomfort. In this work, we are interested in recognizing babies only by analyzing their cries through the use of an automatic analysis and recognition system using a real cry database. Keywords: Infant cry, classification, neural network, acoustic features.
1 Introduction The human body is a source of many signals related to different functions such as cardiological and nervous systems. The analysis of biological signals is important for medical diagnoses and also for the study of various phenomena observed in the human body. Although a baby has no explicit way of communication, he can inform the parents through his cry about a need to be satisfied. Experienced parents can analyze this signal correctly and act appropriately whereas others remain confused about it. In many cases, this parental perception is altered by individual or contextual factors [1-3]. Another important aspect of the subject is the ability of the mother to distinguish her infant’s cry from others within the first days of life [4]. In previous studies [5-7], researchers have discovered that the acoustic features of a baby’s cry change according to the physiological state of the baby. These findings were at the origin of the efforts made to develop “intelligent” fully automatic systems capable of analyzing infant’s cry and decoding its underlying significance [8-10]. In this study, the main goal is to design an automatic system for the identification of a baby from his cry in an attempt to emulate a mother who is capable of recognizing her own child just by his cry. A comparison between the natural and the artificial recognition systems could contribute to a better understanding of the perception mechanism of infant’s cry. We followed a procedure composed of two main steps. First, we performed signal processing tasks resulting in the extraction of specific acoustic features characterizing the cry. Secondly, an automatic classifier based on neural networks was used to assign A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 192–199, 2010. © Springer-Verlag Berlin Heidelberg 2010
A Cry-Based Babies Identification System
193
an input cry to one of known babies in the database. It was trained and tested by the mean of real cries recorded for the purpose of this study. Figure 1 shows the general structure of the designed system.
Fig. 1. Block diagram of cry-based recognition system
2 Methodology 2.1 Subjects The cry database was constructed from the recording of 13 healthy babies aged less than 6 months. We dealt with 7 females and 6 males. These babies had no family support and therefore were part of a social program aiming to provide them special care and shelter during their first months of life before they get adopted. It is important to consider this social aspect in other analog studies especially if subjects without a family support and subjects with a normal family life are involved. 2.2 Acquisition Procedure In this work, real infant’s cries were used in the experiments. Therefore, it was necessary to maintain a uniform and well defined procedure during the acquisition of the cry sounds so as to guarantee a good quality of the final recordings. The procedure was based on subject-related, environment-related and material-related precautions. The subject-related precautions consists in recording the babies in an “everyday” natural state like hunger or discomfort caused by wetness and avoid extreme cases such as undergoing a very painful stimulus. Hence, spontaneous cries were recorded mainly in a hunger context. As for the environment-related precautions, a special care was given to the ambient conditions surrounding the baby being recorded. Noise was kept as low as possible by recording each baby separately and suspending the acquisition whenever the ambient noise was too high. In order to achieve a good sound quality, we used the WS-310M Olympus digital voice recorder at a 44.1 KHz sample rate. During acquisition, the device was located such as to capture a maximum cry level while keeping the ambient noise very low (a high signal to noise ratio).
194
A. Messaoud and C. Tadj
2.3 Segmentation After acquisition, the digital cry sounds were transferred to a computer and stored in the WAV format. The raw cry recordings had a total duration of 158 min. They were treated by Adobe Audition software1 to reduce the background noise picked up by the recorder. Then, we segmented these records manually into individual cry utterances discarding unusable ones. We obtained 1615 cry samples ready to be analyzed. In the next sections, the term “cry” denotes an individual cry utterance.
3 Acoustic Analysis According to previous studies [8-10], a cry is characterized by a set of acoustic features which contain relevant information about the state of the baby and which can be perceived, analyzed and interpreted by the listener. Thus, the first step in this study is the extraction of these features. 3.1 Preprocessing The inaccurate manual segmentation resulted in cries with imprecise start and end points (advanced start and delayed end). So, the “silent” portions at the edges of each cry were removed by considering energy less than 1 % of the maximum energy as “silence”. As a result, all the cries in the database had synchronous start. We have noticed in many other works [8-9] that the cry records were segmented on a fixed length basis (e.g. 1 second). This method leads to a loss of information due to the truncation of the cries longer than the chosen duration. Instead of this method, we adopted a variable-duration segmentation that resulted in cries with different durations. The other preprocessing we implemented was the time scaling of all the cry waveforms to a unique length of 1 s using a phase vocoder technique. Finally, we filtered the cry samples using a 4th order low pass filter with a cutoff frequency of 3000 Hz. 3.2 Feature Extraction The main goal of features extraction is the conversion of vocal waves into a set of values which represent the signal in a very compact way discarding irrelevant information. Various acoustic features such as pitch, intensity, jitter, etc. [3] have been proposed to characterize an infant cry. Cepstral coefficients are the most widely used features for speech recognition. Cepstral analysis consists in calculating the short-term Fourier transform power spectrum, mapping this spectrum to a Mel scale using a filter bank and performing the Discrete Cosine Transform of the logarithm of the obtained spectrum [11]. The Mel Frequency Cepstral Coefficients (MFCCs) are the amplitudes of the resulting spectrum. The Mel scale used in this analysis is based on a model of the human auditory system. 1
http://www.adobe.com/fr/products/audition/
A Cry-Based Babies Identification System
195
Fig. 2. MFCC extraction system
In order to perform such an analysis, each cry was divided into overlapping, Hamming windowed, short segments. A window duration of 25 ms and an overlapping amount of 20 % were adopted. From each window, 25 MFCCs were extracted. Figure 2 shows the block diagram of the MFCC extractor.
4 Classification The recognition of a particular cry among a set of acoustically different cries was based on the acoustic characteristics (MFCCs) obtained by the feature extraction stage. These features were formatted as a vector representing the input of the system. 4.1 Neural Network Among many neural network architectures, we adopted the probabilistic neural network (PNN). We made this choice because of the simplicity of the design of such a network and its suitability for classification problems. The PNN is a two layer network. In the first layer, the distances between the input vector and the training input vectors are computed. The second layer produces a vector containing the probabilities of belonging to the classes. A competitive transfer function takes theses probabilities as an input and produces the final classification result which corresponds to a value of 1 for the largest probability and 0 elsewhere. In our case, we used a network formed by 780 nodes in the first layer and 13 nodes in the second layer. 4.2 Classifier Input The feature extraction stage produced for each cry a total number of 1225 MFCCs. Due to this large dimension of the input vector, we used a dimension-reducing technique which goal was to decrease the computational power. For each cry, the arithmetic average of the MFCCs was calculated along the time axis resulting in a vector formed by only 25 elements. In Fig. 3 we can see an example showing how the MFCC matrix of a cry is transformed into a simple 25-element vector (we used interpolation to smooth the variation of MFCC values). The simple visual analysis of the degree of similarity of this feature vector from one baby to another supports the fact of considering it as characteristic. Figure 4 shows the MFCC values for 4 babies (the mean of all the cries of each baby).
196
A. Messaoud and C. Tadj
Fig. 3. Reduction of MFCC feature dimension
Fig. 4. Examples of MFCC curves of 4 babies
5 Results and Discussion The designed system was implemented using the programming environment MATLAB. The experimental results presented in the next sections deal with the classification of 13 babies according to their characteristic cries modeled by the extracted MFCCs. 5.1 Preselection of Cry Samples The segmentation stage resulted in 1615 cries not uniformly distributed on the 13 babies. The use of this raw distribution was likely to produce inaccurate classification results. So, before launching the classification process, we had to pass through a preselection step in order to obtain the same number of cries per class. The idea was to look for the class having the least number of cries and select randomly that number of cries from every class. We obtained 73 cries per class (a total number of 949 cries).
A Cry-Based Babies Identification System
197
5.2 Cross Validation In order to validate the designed classifier, we adopted a 7 cross validation technique that consisted in dividing the cry data set of each class into 7 subsets: 1 test set and 6 training sets (10 cries per subset for each class). Then, the classification was repeated 7 times by choosing each time one of the 7 subsets for testing and the remaining subsets for training. 5.3 Classification Performance The performance of the classification is measured by the number of correctly recognized cries presented to the system. Table 1 gathers the performance specific to the 7 cross classifications. We can see from these results that the accuracy of the designed classifier is greater than 69.2 %. Besides, in any classification problem, the classes to be predicted do not have the same specificity (the ability to be classified correctly). Therefore, we computed the accuracy of the classifier with respect to the 13 babies involved in the study. Table 2 presents these results which show different accuracy from one baby to another with a maximum performance of 90 % for the baby #12. The 40 % value of the first class indicates that the cries belonging to baby #1 were not easy to distinguish from other cries. A simple auditory inspection of the cries produced by baby #1 showed a relatively large variability of the perceived acoustic properties compared to other babies. Table 1. Performance specific to the 7 classifications (%)
1 71.5
2 72.3
3 74.6
4 70
5 71.5
6 70.7
7 69.2
7 70
Table 2. Performance specific to the 13 classes (%)
1 40
2 68.5
3 65.7
4 61.4
5 84.2
6 70
8 58.5
9 82.8
10 87.1
11 67.1
12 90
13 82.8
The detailed classification performance results presented in the previous table can be used as a guideline in the improvement of the system performance. For instance, we would need to be concentrated on increasing the accuracy for baby #1 rather than baby #5 or baby #10 whose classification rates are already high. This improvement could be achieved for example by altering the content of the cry database of the concerned classes by performing the random preselection of cries described in section 5.1. The cry data set which gives the best results could be retained. We present in table 3 the confusion matrix of the classifier which is the mean of the 7 confusion matrices. We can see in this table that the diagonal contains high values indicating a pretty good performance. The misclassification rate is represented by the rest of the matrix.
198
A. Messaoud and C. Tadj
The overall performance of the classifier was calculated independently of the classes and the classification iterations by taking the mean of all the specific accuracy values. We obtained a global value of 71.4 % showing the high ability of the designed classifier to recognize individuals (babies) from their cry signals. Table 3. Confusion matrix
6 Conclusion In the human body, biological signals convey rich information about various vital functions. A newborn, up to a certain age uses cry signals to communicate. In this work, we were interested in using the information contained in the infant’s cry to recognize a baby among 13 babies from his cry. Acoustic characteristics were extracted from a real cry database. We used Mel Frequency Cepstral Coefficients which were a good choice in this study. As for the classification, we chose a Probabilistic Neural Network known for its suitability for classification problems. The obtained results showed an overall performance of 71.4 %. As a future work, we intend to enlarge the database and try other combinations of acoustic features to improve the accuracy. Aknowlegements. We acknowledge the funding by the Natural Sciences and Engineering Research Council (NSERC) of Canada.
References 1. LaGasse, L.L., Neal, A.R., Lester, B.M.: Assessment of infant cry: Acoustic cry analysis and parental perception. Mental Retardation and Developmental Disabilities Research Reviews 11, 83–93 (2005) 2. Wood, R.M., Gustafson, G.E.: Infant Crying and Adults’ Anticipated Caregiving Responses: Acoustic and Contextual Influences. Child Development 72, 1287–1300 (2001)
A Cry-Based Babies Identification System
199
3. Protopapas, A.: Perceptual differences in infant cries revealed by modifications of acoustic features. The Journal of the Acoustical Society of America 102, 3723–3734 (1997) 4. Linwood, A.: Crying and Fussing in an Infant. Gale Encyclopedia of Children’s Health: Infancy through Adolescence. Thomson Gale (2006) 5. Fuller, B.F., Keefe, M.R., Curtin, M., Garvin, B.J.: Acoustic Analysis of Cries from “Normal” and “Irritable” Infants. West J. Nurs. Res. 16, 253 (1994) 6. Goberman, A.M., Robb, M.P.: Acoustic characteristics of crying in infantile laryngomalacia. Logopedics Phoniatrics Vocology 30, 79–84 (2005) 7. Green, J.A., Gustafson, G.E., McGhie, A.C.: Changes in Infants’ Cries as a Function of Time in a Cry Bout. Child Development 69, 271–279 (1998) 8. Galaviz, O.F.R.: Infant cry classification to identify hypo acoustics and asphyxia comparing an evolutionary-neural system with a neural network system. In: Gelbukh, A., de Albornoz, Á., Terashima-Marín, H. (eds.) MICAI 2005. LNCS (LNAI), vol. 3789, pp. 949–958. Springer, Heidelberg (2005) 9. Orozco, J., Garcia, C.A.R.: Detecting pathologies from infant cry applying scaled conjugate gradient neural networks. In: Proc. European Symposium on Artificial Neural Networks, pp. 349–354. d-side publi, Bruges-Belgium (2003) 10. Ortiz, S.D.C.: A radial basis function network oriented for infant cry classification. In: Sanfeliu, A., Martínez Trinidad, J.F., Carrasco Ochoa, J.A. (eds.) CIARP 2004. LNCS, vol. 3287, pp. 374–380. Springer, Heidelberg (2004) 11. Xu, M., Duan, L.-Y., Cai, J., Chia, L.-T., Xu, C., Tian, Q.: HMM-Based Audio Keyword Generation. pp. 566-574 (2005)
Comparative Testing of Face Detection Algorithms Nikolay Degtyarev and Oleg Seredin Tula State University
[email protected] http://lda.tsu.tula.ru
Abstract. Face detection (FD) is widely used in interactive user interfaces, in advertising industry, entertainment services, video coding, is necessary first stage for all face recognition systems, etc. However, the last practical and independent comparisons of FD algorithms were made by Hjelmas et al. and by Yang et al. in 2001. The aim of this work is to propose parameters of FD algorithms quality evaluation and methodology of their objective comparison, and to show the current state of the art in face detection. The main idea is routine test of the FD algorithm in the labeled image datasets. Faces are represented by coordinates of the centers of the eyes in these datasets. For algorithms, representing detected faces by rectangles, the statistical model of eyes’ coordinates estimation was proposed. In this work the seven face detection algorithms were tested; article contains the results of their comparison. Keywords: face detection, face localization accuracy, comparative test, face datasets.
1
Introduction
Face detection tasks are becoming required more frequently in the modern world. It’s caused by the development of security systems as an answer to acts of terrorism. In addition, these algorithms are widely used in interactive user interfaces, in advertisement industry, entertainment services, video coding, etc. However, many researchers mostly paid their attention to Face Recognition algorithms[6] considering Face Detection tasks (necessary first stage for all face recognition systems) to be almost solved. Thus, as far as we know, the last practical and independent comparisons of FD algorithms were made by Hjelmas et al. [4] and by Yang et al.[17] in 2001. Nevertheless, “due to the lack of standardized tests”[4] most of such researches (including two mentioned above) “do not provide a comprehensive comparative evaluation” and contain only a summary of the originally reported performance among several face detection algorithms on the pair of small datasets. We are sure that this type of comparative testings can hardly represent “true” performance,
This work is supported by Russian Foundation for Basic Research (Gr. 09-07-00394).
A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 200–209, 2010. c Springer-Verlag Berlin Heidelberg 2010
Comparative Testing of Face Detection Algorithms
201
because the reported results could be based on different evaluation methods and parameters; could be adjusted to demonstrate better performance under controlled circumstances; etc. The aim of this work is to propose parameters of FD algorithms quality evaluation and methodology of their objective comparison, and to show the current state of the art in face detection. Also it’s should be stressed that a correct experiment should consists of two parts: algorithms learning on the training set and comparative testing. Unfortunately, we are not able to train all algorithms on the same data for several reasons. However, we believe that this does not diminish the correctness of this research, because our goal is to evaluate face detection systems rather than the learning methods. The following algorithms c FaceSDK (FSDK), c OpenCV (OCV), Luxand were tested in this work: Intel Face Detection Library (FDLib), SIFinder (SIF), University of Surrey (UniS), c VeriLook (VL). Their brief description will FaceOnIt(FoI), Neurotechnology be given in Section 2.
2 2.1
Algorithms’ Test Set c Open Computer Vision library Intel
In this work we used OpenCV 1.0, which contains the extended realization of the Viola-Jones object detection algorithm [14,15] supporting Haar-like features. Haar-like features, originally proposed by Papageorgiou et al. [12], valuate differences in average intensities between two rectangular regions, that makes them able to extract texture without depending on absolute intensities. However, Viola and Jones, during their work on objects detection algorithms [14], extended the set of the features and developed an efficient method for evaluating it, which is called an “integral image” [14]. Later Lienhart et al. [9] introduced an efficient scheme for calculating 45◦ rotated features and included it in OpenCV library. It should be mentioned, that opposite to many of the existing algorithms using one single strong classifier, Viola-Jones algorithm uses a set of weak classifiers, constructed by thresholding of one Haar-like feature. Due to large number of weak classifiers, they can be ranked and organized into cascade. In this work, we have tested cascade for the frontal face detection included by default in OpenCV 1.0: haarcascade frontalface alt (trained by R. Lienhart). To find trade off between FAR and FRR (see Section 4 for FAR and FRR definition), we have changed min neighbors parameter, which indicates minimum number of overlapping detections are needed to decide a face is presented in the selected region; all other parameters were set by default. 2.2
SIF
This algorithm[8] has been developing in the Laboratory of Data Analysis of Tula State University. The main hypothesis consists in the eyes being dark spots in the face image, and we can immediately skip the routine scan of the image by sub-windows of different size.
202
N. Degtyarev and O. Seredin
At the beginning, the algorithm finds points of minimum brightness in image, then these points are sorted, some of them are discarded, and the rest are grouped in pairs. Then these fragments, containing a pair of singular points, are photometric normalized, affine transformed (for images containing only two singular points only following transformation can be applied: rotation, scale transformation (with the same scale on both axes) and displacement) and projected into the lattices of fixed size. After these transformations, the lattices are represented as a vector of features (values of brightness of nodes) and are sent to the two-class SVM-classifier, trained in advance on a large number of faces and non-faces. As a parameter to find trade off between FAR and FRR, we changed the shift of hyperplane separating face and non-face classes in the space of features(values of brightness of the lattices nodes). 2.3
Face Detection Library
The Face Detection Library (FDLib) has been developed by Keinzle et al.[7]. Authors proposed a method for computing fast approximations to support vector decision functions (so-called reduced set method) in the field of object detection. This method creates sparse kernel expansions, that can evaluated via separable filters. This algorithm has only one tuning parameter that can control the “rigour” of face detection via changing the number of separable filters into which the reduced support vectors are decomposed. 2.4
UniS
Algorithm UniS was developed in University of Surrey and is based on various methods. To find the trade off between FAR and FRR we changed the value of the face confidence threshold for UniS. 2.5
FaceOnIt
FaceOnIt (http://www.faceonit.ch) is a face detection SDK developed at the Idiap research institute[13,10]. It is based on the cascade architecture introduced by Viola-Jones and on an extension of Local Binary Patterns. LBPs have been proposed by Ojala et al.[11] for texture classification. But later its rotation invariance and computationally lightness were used Ahonen et al.[1] to develop effective and fast face recognition algorithm. As a parameter to find trade off between FAR and FRR, we changed the value of face confidence threshold. 2.6
FSDK and VL
FaceSDK (version 2.0) and VeriLook (version 4.0) were kindly provided by Luxand Inc. (http://www.luxand.com) and Neurotechnology (http://www. neurotechnology.com) respectively. These two algorithms are commercial products, and therefore no details of the principle of their functioning were disclosed. To find the trade off between FAR and FRR we changed the value of the
Comparative Testing of Face Detection Algorithms
203
face confidence threshold for VL and changed parameter of FSDK SetFaceDetectionThreshold function affecting the threshold for FaceSDK.
3
Models of Faces Representations and Localization Accuracy
There are many different models of face representation in images: by the center of the face and its radius, by rectangle (OCV, FDLib, FoI), by coordinates of the centers of eyes (SIF, UniS, FSDK, VL), by ellipse, etc. In this work we represent faces by coordinates of the centers of the eyes (i.e. centers of the pupils), because first, this representation looks to be more opportune in terms of the results comparison; second, usually face recognition algorithms require the centers of eyes matching for learning samples; third, experts mark eyes faster, easier and more precisely than they mark faces by rectangles. Thus, to unify the resulting comparison method we suggest eyes reconstruction model, which receives a face location in rectangle representation and returns estimated coordinates of the centers of eyes.
Fig. 1. Schematic face representation. EyeLef t and EyeRight – absolute coordinates of Lef t , detected left and right eye respectively; lEyes – distance between eyes’ centers; lHEyes Right lHEyes , lHEyes – distance between top border of the face and center of the left or right eye, or the eyes respectively; SizeHead – size of the rectangle representing face; DEyes – diameter of the area of acceptable eyes’ coordinates deviation from the true eyes A location EyeA Right and EyeLef t ; CenterHead – absolute coordinates of the found face.
3.1
Localization Accuracy for Algorithms Describing Faces by Centers of the Eyes
If detected faces are represented by the centers of the eyes (Fig. 1.a), let’s consider them to be correctly detected, if and only if detected eyes belong the area around the true eyes location with the diameter DEyes . Which depends on the distance between eyes’ centers and α, has been taken equal to 0.25 (This criterion was originally used by Jesorsky et al. [5]), and calculates as DEyes = 2α × lEyes .
204
3.2
N. Degtyarev and O. Seredin
Localization Accuracy for Algorithms Describing Faces by Rectangle
Assume there is a full face portrait image with no incline (Fig. 1.b), and the algorithm has found its center and size – (CenterHead and SizeHead respectively). Obviously, the eyes on this image are located symmetrically about the vertical axis (i.e., at the half the distance between them: lEyes /2) and at the same distance (lHEyes ) from the top border of the face’s rectangle. Thus the absolute coordinates of eyes can be estimated as: y EyeyRight = EyeyLef t = CenterHead + lHEyes − 12 SizeHead , x x EyeRight = CenterHead − 12 lEyes , x EyexLef t = CenterHead + 12 lEyes .
(1)
Let’s try to estimate the parameters of the algorithm, namely lEyes and lHEyes , as an average of the huge amount of images with experts’ labeled eyes. Based of such analysis, the following coefficients have been founded: A – average proportion of distance between top border of the face and center of the eyes (lHEyes ) to the size of the face rectangle; and B – average proportion of the distance between eyes (lEyes ) to the size of the face rectangle (SizeHead ). They can be estimated using information about true eyes location on the images series: N N liHEyes liEyes 1 (2) A = N1 , B = , i N Size Sizei i=1
Head
i=1
Head
i i where lHEyes , lEyes and SizeiHead – respective parameters measured for the i-th image in the data set, containing N objects. Therefore the coordinates of the eyes for a given face size and the coefficient of proportions for the algorithms (2) are calculated according next equations: y A − 12 , EyeyRight = EyeyLef t = CenterHead + SizeHead x EyexRight = CenterHead − SizeHead 12 B , (3) x EyexLef t = CenterHead + SizeHead 12 B .
When the eyes’ coordinates are estimated, we can determine its localization accuracy, as it is described in section 3.1. If there is a full face portrait image with any incline, let’s find lHEyes as an average distance between center of each eye and top border of the face, i. e.: Lef t Right lHEyes = 12 lHEyes . + lHEyes It should be noted that such “conversion” of face representation could deteriorate the localization accuracy for algorithms describing faces by rectangles . However, all benefits from the “conversion” (discussed in the beginning of this section) are much bigger than risks mentioned above.
Comparative Testing of Face Detection Algorithms
4
205
Parameters of Results Evaluation
The results of each algorithm have been evaluated by following parameters: – False Rejection Rate (FRR) — Ratio of type I errors, which indicates the probability of misclassification of the images containing a face; – False Acceptance Rate (FAR) — Ratio of type II error, which indicates the probability of misclassification of the images not containing a face; – Distance to the “exemplary” algorithm (dexemp ). We consider a FD algorithm to be “exemplary” (exemp), if its FAR and FRR equals√0. Thus the distance between “exemp” algorithm and this one is dexemp = F AR2 + F RR2 ; – Speed parameters such as mean and median of an image processing time (ms.) from the dataset. These results were obtained in following configuration: Intel Core2Duo 1.66 GHz, 2Gb RAM, Windows Vista HP. Measure of dissimilarity of the algorithms’ results is normalized Hamming distance between the pair of vectors of algorithms’ errors: dH (Xi , Xj ) =
N 1 (s) (s) xi − xj . N s=1
Each vector component (x(s) ) equals 1, if the algorithm has correctly detected face region on the correspondent (s) image in the dataset. Further, we will call the matrix of such distances (dH ∈ [0, 1]), matrix of dissimilarities or d-matrix.
5
Image Dataset
All algorithms have been tested on the following manually marked datasets1 : 1. Face Place contains 1247 images (480×400 pixels) of 150 people taken from different angles, http://www.face-place.org/; 2. The IMM Face Database 240 images (512×342 pixels) of 40 people, http://www.imm.dtu.dk/~aam/; 3. B. Achermann’s face collection, Uni of Bern – 300 greylevel images (512×342 pixels) of 30 people., ftp://ftp.iam.unibe.ch/pub/Images/FaceImages/; 4. BioID Only 1520 grayscale images containing one face have been used, complex background; 5. The Sheffield Face Database (previously known as The UMIST Face Database). Only 416 images containing two eyes have been used in this test; 6. PIE Database subset (one picture of each individual), complex background; 7. Indian Face Database Only 513 images containing two eyes have been used; 8. The ORL Database of Faces; 9. Laboratory of Data Analysis (Tula State University) Face Database contains 6973 color images, complex background, 320×240 pixels (see examples at http://lda.tsu.tula.ru/FD/TulaSU_FDB.zip). For correct FAR estimation we have used Non-Face images collected in Tula State University. This dataset contains 48211 color images, 320×240 pixels. 1
If no URL is given, it can be found at http://www.face-rec.org/databases/
206
N. Degtyarev and O. Seredin
Total test dataset size is 59888 images: 11677 faces and 48211 non-faces. Coefficients A and B were determined on Georgia Tech Face Database.
6
Results
After the algorithms had been routinely tested, FRR, FAR, dexemp , speed and vectors of algorithm’s errors were obtained for each one. Coefficients of the model of face localization accuracy for algorithms representing faces by rectangles (see Section 3.2) are given in Table 1. Table 1. Coefficients of the model of supposed coordinates of eyes estimation Algorithm FDLib OCV FoI coefficient A 0.3332 0.3858 0.2646 coefficient B 0.3830 0.3666 0.5124
Changing the algorithm’s parameter, plots the FAR against the FRR for all tested parameters, which yields the receiver operating characteristic (ROC) curve in standard biometric sense[3,16]. These curves (Fig. 2) let us to determinate the optimal algorithm (with parameters) in a particular situation. For more detailed analysis, let’s fix the parameters, delivering the lowest dexemp for each algorithms, find d-matrix (see Table 2), two dimensional FastMap [2] of d-matrix (Fig. 3) and other evaluation parameters (see Table 3). Table 2. Matrix of dissimilarity of algorithms with fixed parameters (see value of parameters in the parentheses)
OCV(2) SIF(-3) FDLib(1) FSDK(5) UniS(20) FoI(5) VL(2) Exemp
OCV SIF FDLib 0 0.099 0.227 0 0.226 0
FSDK 0.052 0.089 0.222 0
UniS 0.076 0.097 0.223 0.069 0
FoI 0.053 0.070 0.213 0.043 0.051 0
VL Exemp 0.047 0.043 0.075 0.073 0.215 0.216 0.035 0.030 0.053 0.047 0.026 0.019 0 0.011 0
Analysis of the elements of the d-matrix leads to the conclusion, that the most similar algorithms are FoI and VL, and the most different are FDLib and OCV. It should be mentioned, that dH -es between almost all tested algorithms (excluding FDlib) are smaller than 0.1. It is obvious that each of the algorithms have a unique peculiarity of detection. One way to perform their numerical evaluation is to compare the number of images uniquely classified by each algorithm, number of “challenging” and “easy” images (see Table 4). Here “easy” images mean images detected by all algorithms. In the opposite case, images are considered to be “challenging”.
Comparative Testing of Face Detection Algorithms
207
Fig. 2. The ROC plots. FAR (in log scale) against FRR. Perfect performance would be the bottom left corner: F RR = F AR = 0.
Fig. 3. Two dimensional FastMap diagram obtained on the data represented in Table 2 (the closest five algorithms are also represented on an expended scale in the frame)
208
N. Degtyarev and O. Seredin
Table 3. Results of algorithms’ testing with fixed parameters (see value of parameters in the parentheses); Estimations of time (Mean and Median) are given in ms Algorithm OCV(2) SIF(-3) FDLib(1) FSDK(5) UniS(20) FoI(5) VL(2)
FRR 0.0628 0.2362 0.4565 0.0805 0.1444 0.2222 0.0523
FAR 0.0423 0.0454 0.1868 0.0294 0.0426 0.0044 0.0062
dexemp Mean, ms. Median, ms. 0.0757 90 88 0.2405 260 254 0.4932 64 62 0.0857 1305 1041 0.1505 176 149 0.2222 84 85 0.0527 47 43
Table 4. Peculiar images distribution on the datasets (Dataset ID corresponds to the index in the image datasets’ list (Section 5)) Dataset ID 1 2 3 4 5 6 7 8 9 NonFaces Number of images 1247 240 300 1520 416 68 513 400 6973 48211 Peculiar Cases ”easy” images 315 31 121 713 28 44 4 248 2881 34093 ”challenging” imag. 9 3 5 6 7 51 only OCV 2 5 1 6 only SIF 5 only FDLib 1 2 only FSDK 1 2 2 8 only UniS 1 2 3 18 only FoI 3 1 21 1 only VL 12 2 23 2 21
7
Discussion and Conclusion
In this work the seven FD algorithms were tested and the statistical model of eyes’ position estimation for algorithms describing faces by rectangle was proposed (see Section 3.2). According the result of our study VeriLook has the best performance under various parameters and has the first place in the speed test (18-20 images per second). FDLib shows good speed characteristics (second place), but it demonstrates the worst performance. OpenCV – the most popular and free available FD algorithm – took the second place in the performance test and has sufficient speed. SIF developed in Tula State University has demonstrated the average performance. It’s worth noting that VeriLook has the biggest number of uniquely classified images, i.e. images that were misclassified by other algorithms. It should be noted that about 64% of images were correctly processed by all algorithms. Such images are called “easy” in this work. Though the best algorithms demonstrated excellent performance in this test, there are as many as 5 − 6% of FRR separating it from the desired errorfree result. However, the per cent of the “challenging” images in the dataset is
Comparative Testing of Face Detection Algorithms
209
dramatically small – only 0.14% of our dataset. Thus there is a potential for FD algorithms refinement. Also we believe that this performance “gap” can be eliminated through the detectors combining. In the future we plan to offer an interactive web framework for FD algorithms testing. It should be emphasized, that this research has minor shortcoming: we partially used public image databases, and we have no reason to think that they were not used in a learning process of tested FD systems. We are also looking forward to collaboration with other developers of FD algorithms and with researchers who would like to share their image databases with us.
References 1. Ahonen, T., Hadid, A., Pietikainen, M.: Face recognition with local binary patterns. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 469–481. Springer, Heidelberg (2004) 2. Faloutsos, C., Lin, K.I.: FastMap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, pp. 163–174 (1995) 3. Fawcett, T.: An introduction to ROC analysis. Pattern Recognition Letters 27(8), 861–874 (2006) 4. Hjelmas, E., Low, B.K.: Face Detection: A Survey. Computer Vision and Image Understanding 83(3), 236–274 (2001) 5. Jesorsky, O., Kirchberg, K.J., Frischholz, R.W.: Robust face detection using the hausdorff distance. In: Bigun, J., Smeraldi, F. (eds.) AVBPA 2001. LNCS, vol. 2091, pp. 90–95. Springer, Heidelberg (2001) 6. Jones, M.J.: Face Recognition: Where We Are and Where To Go From Here. IEEJ. Trans. on Elect., Information and Systems 129(5), 770–777 (2009) 7. Kienzle, W., Bakir, G., Franz, M., Scholkopf, B.: Face detection – efficient and rank deficient. Advan. in neural inform. process. systems 17, 673–680 (2005) 8. Krestinin, I.A., Seredin, O.S.: Excluding cascading classifier for face detection. In: Proc. of the 19th Int. Conf. on Computer Graphics and Vision, pp. 380–381 (2009) 9. Lienhart, R., Maydt, J.: An extended set of Haar-like features for rapid object detection. In: Proc. Intern. Conf. on Image Processing 2002, pp. 900–903 (2002) 10. Marcel, S., Keomany, J., et al.: Robust-to-illumination face localisation using Active Shape Models and Local Binary Patterns. In: IDIAP-RR, vol. 47 (2006) 11. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(7), 971–987 (2002) 12. Papageorgiou, C.P., Oren, M., Poggio, T.: A general framework for object detection. In: Proc. of ICCV, pp. 555–562 (1998) 13. Sauquet, T., Rodriguez, Y., Marcel, S.: Multiview face detection. In: IDIAP-RR (2005) 14. Viola, P., Jones, M.: Rapid Object Detection using a Boosted Cascade of Simple. In: Proc. IEEE CVPR 2001 (2001) 15. Viola, P., Jones, M.J.: Robust Real-Time Face Detection. International Journal of Computer Vision 57(2), 137–154 (2004) 16. Wechsler, H.: Reliable face recognition methods: system design, implementation and evaluation, p. 329. Springer, Heidelberg (2007) 17. Yang, M.H., Kriegman, D.J., Ahuja, N.: Detecting faces in images: A survey. IEEE Transactions on Pattern analysis and Machine intel. 24(1), 34–58 (2002)
On Supporting Identification in a Hand-Based Biometric Framework Pei-Fang Guo1, Prabir Bhattacharya2, and Nawwaf Kharma1 1
Electrical & Computer Engineering, Concordia University, 1455 de Maisonneuve Blvd., Montreal, QC H3G 1M8, Canada {pf_guo,kharma}ece.concordia.ca 2 Computer Science Department, University of Cincinnati, 814 Rhodes Hall, Cincinnati, OH 45221-0030, USA
[email protected] Abstract. Research on hand features has drawn considerable attention to the biometric-based identification field in past decades. In this paper, the technique of the feature generation is carried out by integrating genetic programming and the expectation maximization algorithm with the fitness of the mean square error measure (GP-EM-MSE) in order to improve the overall performance of a hand-based biometric system. The GP program trees of the approach are utilized to find optimal generated feature representations in a nonlinear fashion; derived from EM, the learning task results in the simple kmeans problem that reveals better convergence properties. As a subsequent refinement of the identification, GP-EM-MSE exhibits an improved capability which achieves a recognition rate of 96% accuracy by using the generated features, better than the performance obtained by the selected primitive features. Keywords: Feature generation, biometric identification, genetic programming, the expectation maximization algorithm, mean square error, classification.
1 Introduction Biometrics-based identification is a verification approach using biological features in each individual. Hand features have been widely used in designing a biometric identification system and the challenge has been established [1], [2]. In pattern recognition, the analytical selection of features and the automatic generation of features provide two distinct approaches, because they rely on different sources of information [3]. It may be useful to explore new biometric identification systems that combine both methods of the feature selection and generation. In this study, building upon our previous work of the feature selection on hand images, we present an automatic method of the feature generation, using genetic programming and the expectation maximization algorithms with the mean square error measure (GP-EM-MSE) for a hand-based biometric identification system. The benefits from the combined system of the feature selection and generation are as follows: A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 210–217, 2010. © Springer-Verlag Berlin Heidelberg 2010
On Supporting Identification in a Hand-Based Biometric Framework
211
• to increase the probability that the errors of the individual feature selection may be compensated for by the correct results of automatic feature generation. • to raise the contribution that the overall performance of the classification would make toward a higher recognition rate by the improvement of the quality of feature representations. The organization of the paper is as follows: Section 2 describes briefly our previous work on the feature selection for hand images. In Section 3, GP and the k-means problem via EM are described as the basis of the platform, including the evaluation phase. Section 4 presents identification results, and conclusion and future work are given in Section 5.
2 Previous Work on the Feature Selection for Hand Images Previous work on hand images can be categorized as the unsupervised feature selection for clustering hand images. The main tool for accomplishing this was a genetic algorithm (GA). The method, named cooperative coevolutionary clustering algorithm (CCCA), was designed to search for a proper number (without prior knowledge of it) of clusters of hand images, and simultaneously to achieve the feature selection. CCCA was implemented using 100 hand images as the test set; see a sample of a hand image in Fig.1. The results showed that the dimensionality of the clustering space was reduced from 84 original features to 41 selected features (11 geometric and 30 statistical features), with 4 clusters produced. At the end, the output clusters were labeled with the number of input image patterns per class, assigned to each cluster. The details of the CCCA method appear in our previous work [4], [5].
Fig. 1. A sample of a hand image
3 The GP-EM-MSE Biometric Identification System The pre-requisite for the technique of the feature generation is the preparation of primitive feature sets. Using our previous method in [4] on a dataset of 200 hand images, we succeeded in clustering the 200 images into 4 categories, with a total
212
P.-F. Guo, P. Bhattacharya, and N. Kharma
number of 41 features selected as the primitive feature set for the current study of the feature generation. Fig.2 presents the method of the feature generation by integrating GP and EM with the MSE fitness indicator (GP-EM-MSE). In the approach, GP program trees can be viewed as sequences of applications of functions (replaced with mathematic operators) to arguments (replaced with 41 primitive features), which satisfy requirements for rational expressions of the generated features in a straightforward way; via EM, the learning task results in a simple hypothesis of the k-means problem. In the next subsections, we provide descriptions of each component of the biometricbased GP-EM-MSE system, including the steps of the computation.
Fig. 2. The GP-EM-MSE hand-based biometric identification system
3.1 The GP Application GP maintains a population of individuals by program trees. In each iteration, it produces a new generation of individuals using reproduction, crossover, and mutation. The fitness of a given individual program in the population is typically determined by executing the program on a set of the training data. The iteration of GP is illustrated in Fig. 3; the details of the GP process can be found in [6].
Fig. 3. The diagram of the evolutionary GP iteration
The GP program trees in the approach The choices of terminal arguments and functions provide particular representations for describing GP program trees in the problem domain. In this study, we choose the following mathematical operators:
On Supporting Identification in a Hand-Based Biometric Framework
213
{+, -, ×, ÷, square root, sine, cosine, tan, exponential, absolute, square, negative} , which constitute the GP function set; terminal arguments receive 41 primitive features selected by the method [4]. As shown in Fig.4, an example of the rational expression, h2 X h5 – tg(h5 + h8), can be expressed as the GP program tree, On the tree, the sign of ‘tg’ is a mathematical operator that takes one terminal argument, and the signs of ‘+’, ‘-’ and ‘X‘ are operators that need two arguments. The argument, hd, d = 2, 5, 8, is replaced with the (d+1)th primitive feature. X
tg
h2
h5
+ h5
h8
Fig. 4. A GP program tree for the rational expression of h2 X h5 – tg(h5 + h8)
The fitness function, MSE The MSE measure is well known in the function approximation and learning system theories when the number of classes is assumed to be a prior known. In this supervised problem of the feature generation, we employ the MSE measure as the fitness which is given by [7]: k
the
MSE
m
fitness = ∑ ∑ D (c j , ri j ),
(1)
j =1 i =1
where D is the Euclidean distance between the instance ri and its mean center cj. 3.2 Derivative of the k-Means Problem via EM Given the observed data R = { ri }, divided into the k known classes; the hidden variables Z = {< zi1,…, zij ,…, zik >} in this case indicate the index of the jth mixture component that generates ri. With the approximation, the data R can be described in the mixture of the k Gaussians. The purpose is to fit the Gaussian density mixture to the data R by optimizing the Gaussians mixture parameters, which include the mixing proportion, the mean values, and the covariance matrix. In practice, the most common use is to assume the equal mixing proportion (1/k) with a univariate case for all the k categories [8], from which it simplifies the task to a hypothesis of the k-means problem [9]. Applying the maximum likelihood estimator, the k-means problem is broken down into two separate steps: E
step E zij
ri
k n
ri
(2)
214
P.-F. Guo, P. Bhattacharya, and N. Kharma
M
step
m i
z ij
m i
i
zij
(3)
where μj is the mean value of the jth Gaussian, and E[zij] is the probability that ri is generated by the jth Gaussian distribution. Further details can be found in [9], [10], regarding the derivative of the k-means problem via EM. In the implementation, the data R are replaced with the generated data produced by GP-EM-MSE. The EM steps, expressed in Eqs. (2) and (3), are executed on the data R, and constitute a joint loop with the GP iteration. 3.2 The GP-EM-MSE Recursive Computation The steps of the GP-EM-MSE computation are summarized in Table 1. Table 1. The GP-EM-MSE recursive computation Note: begin from each population of the generated features, P_feature p. FOR index p = 1 TO 32 populations DO FOR index = 1 TO 100 images DO Perform the data transformation based on the variation of the GP program trees. END Implement the E-step, expressed in Eq. (2), to estimate the expected values, E[zij]. Implement the M-step, expressed in Eq. (3), to revise the values of the k-means. Replace with the new revised values of the k-means. Evaluate each of the pupulations by the MSE fitness, expressed in Eq. (1). Vary and generated new populations by applying the reproducation, crossover and mutation. END Save the best generated features.
3.3 The Evaluation Phase Using the Classifiers MDC and KNN The task of the identification is to classify samples into categories so that samples in the same categories are as similar as possible and samples in different categories are as dissimilar as possible. Mathematically, the problem of classification can be formulated in terms of a predetermined measure [11]. For the minimum distance classifier (MDC), the distance is defined as an index of similarity so that the minimum distance is identical to the maximum similarity; for the K nearest neighbors (KNN), each image pattern is classified to the frequent class among its neighbors based on a similarity measure. In this study, the Euclidean distance measure is employed for both MDC and KNN in the evaluation.
4 Identification Experiments In the implementation, we divided the 200 images into the training and testing sets, each with 100 images. Using the 41 primitive features as the input, the GP-EM-MSE training is evolved with the maximum tree depth of 5 and the population size of 32.
On Supporting Identification in a Hand-Based Biometric Framework
215
4.1 Feature Generated Results We ended the GP-EM-MSE training after running 400 iterations. The total time for computation was about three minutes on a Pentium 4 at 1.60 GHz. The results for the two features, P_feature 1 and P_feature 2, produced by GP-EM-MSE are as follows: P _ feature 1 = cos{sin[ h0 × h39 × tg ( h 22 × h 21)] × tg[| h32 × h0 | + h39 × h 27 ] + | sin[( h 30 h 7 ) 2 − e h 2×h19 ] − ( h12 × h31 + h14 × h 26 ) 2 |}
P _ feature
and
⎛ ⎞ ⎜ ⎟ cos[ −( h32 + h 25 ) × ( h9 + h 28 )] ⎜ ⎟ 2 = cos ⎜ ⎟, h8 h10 ⎜ h 25 + cos( h9 ) × ⎟ cos( 35 20 ) + + h ×h ⎜ h 23 ⎟ h8 h 22 × sin( h 25 + h32 ) h8 ⎝ ⎠
where hd is the (d+1)th feature of 41 primitive features, and the convergence results are presented in Fig. 5; in the sequel, the resulting generated features will be employed to identify hand images. fitness
1
0.75
C_feature 1 C_feature 2
0.5
0.25
0 0
50
100
150
200
iterations
250
300
350
400
Fig. 5. Convergence results for the features produced by GP-EM-MSE
4.2 Identification Results The goal of the identification is to classify hand images into a known number of 4 categories or classes according to the decision space of the generated features. With the classifier MDC, Table 2 shows the confusion matrix of the identification performance for four classes on the test set, using two features, P_feature 1 and P_feature 2, produced by GP-EM-MSE. Each row of the table represents the identification performance for a given category, while each column represents a percentage of the number of samples in an actual class. It can be observed, from Table 2, that the difference of the related identification accuracy among the classes is small, ranging from 95.00 % to 96.96 %.
216
P.-F. Guo, P. Bhattacharya, and N. Kharma
Table 2. Identification accuracy (%) using two features, P_feature 1 and P_feature 2, produced by GP-EM-MSE with MDC categories class I (33 samples) class II (25 samples) class III (22 samples) class IV (20 samples)
class I 96.96 0 4.55 0
class II 3.04 96.00 0 5.00
class III 0 0 95.45 0
class IV 0 4.00 0 95.00
4.3 Comparison Results The classifiers MDC and KNN are utilized in order to assess the capability of different feature sets over different classification systems. In terms of the identification accuracy, Table 3 shows the comparison results between the features produced by GP-EM-MSE with MDC and 41 primitive features selected by the method [4] with KNN in which the value of K is tested by the input of 41 primitive features (K = 41). Table 3. The comparision between the features produced by GP-EM-MSE / MDC and 41 primitive features selected by [4] / KNN, in terms of the identification accuracy (%) no. of features
identification accuracy (%)
the primitive features selected by [4] / KNN
feature types / classifier
41
93.0
P_feature 1 / MDC
1
92.0
P_feature 1, P_feature 2 / MDC
2
96.0
It can be seen from Table 3 that the identification accuracy using one feature, P_feature 1, is slightly lower than when using 41 primitive features. However, the combination of two features, P_feature 1 and P_feature 2, produced by GP-EM-MSE achieves the best identification performance with an accuracy rate at 96.0%.
5 Conclusion and Future Work As an extension of the method of CCCA for the feature selection in [4], the purpose of GP-EM-MSE in this study is to find the improved feature representations in an optimal control environment in order to further minimize identification errors for the hand-based biometric system. GP-EM-MSE has succeeded in producing correct features to identify hand images with the improved performance. The previous work of the CCCA method was evolved without supervision in such a way that the selected feature sets operated globally to cluster hand images. Consequently, the current study of GP-EM-MSE for the feature generation is a supervised learning algorithm, via the results of CCCA. In the case of some not-sosimple problems, the GP-EM-MSE method offers a customized general purpose research platform by jointly optimizing feature representations, which demonstrates that such a combination of methods of the feature selection and generation leads to a higher recognition rate.
On Supporting Identification in a Hand-Based Biometric Framework
217
Our future work will be dedicated to examining the robustness of the system in difficult pattern identification problems and large databases of samples. Acknowledgments. This work is supported by grants from the NSERC.
References 1. Zheng, G., Wang, C.-J., Boult, T.E.: Application of projective invariants in hand geometry biometrics. IEEE Trans. Information Forensics and Security 2, 758–768 (2007) 2. Zhang, D., Kong, W., You, J.: Online palmprint identification. IEEE Trans. Pattern Anal. Mach. Intell. 25, 1041–1050 (2003) 3. Rizki, M.M., Zmuda, M.A., Tamburino, L.A.: Evolving pattern recognition systems. IEEE Trans. Evol. Comput. 6, 594–609 (2002) 4. Kharma, N., Suen, C.Y., Guo, P.-F.: Palmprints: a cooperative co-evolutionary algorithm for clustering hand images. Int. J. Image and Graphics 5, 595–616 (2005) 5. Guo, P.-F.: Palmprints: a cooperative co-evolutionary clustering algorithm for hand-based biometric identification. M.A.Sc. thesis, Concordia University, Montreal, Canada (2003) 6. Koza, J.R.: Genetic Programming II: Automatic Discovery of Reusable Programs. MIT Pr., Cambridge (1994) 7. Lee, C.-Y.: Efficient automatic engineering design synthesis via evolutionary exploration. Ph.D thesis, California Institute of Technology, Pasadena, California (2002) 8. Raymer, M.L., Doom, T.E., Kuhn, L.A., Punch, W.F.: Knowledge discovery in medical and biological datasets using a hybrid Bayes classifier/evolutionary algorithm. IEEE Trans. Syst., Man, Cybern., Part B 33, 802–813 (2003) 9. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006) 10. Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997) 11. Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic, Boston (1990)
Selecting Optimal Orientations of Gabor Wavelet Filters for Facial Image Analysis Tianqi Zhang1 and Bao-Liang Lu1,2 1
Dept. of Computer Sci. & Eng., Shanghai Jiao Tong University, Shanghai 200240, China 2 MOE-Microsoft Key Laboratory for Intelligent Computing and Intelligent Systems Shanghai Jiao Tong University, Shanghai 200240, China {ztq121,bllu}@sjtu.edu.cn
Abstract. Gabor wavelet-based methods have been widely used to extract representative features for face analysis. However, the existing methods usually suffer from high computational complexity of Gabor wavelet transform (GWT), and the Gabor parameters are fixed to a few conventional values which are assumed to be the best choice. In this paper we show that, for some facial analysis applications, the conventional GWT could be simplified by selecting the most discriminating Gabor orientations. In the selection process, we analyze the histogram of oriented gradient (HOG) of the average face image in a dataset, and eliminate the less significant orientation combinations. Then we traverse the rest combinations and select the best according to classification performance. We find that the selected orientations match the analysis of HOG well, and are therefore consistent with the intrinsic gradient characteristics of human face images. In order to assess the performance of the selected Gabor filters, we apply the proposed method to two tasks: face recognition and gender classification. The experimental results show that our method improves the accuracy of the classifiers and reduces the computation cost.
1 Introduction Facial image analysis plays an important role in human computer interaction studies because the machine can detect a lot of useful information of their users from the face images, such as identity, age, gender and expression, which makes existing human computer interaction systems more intelligent and easy to use. However, building an automatic system to detect such information is a challenging problem which attracts a great deal of research interests in the computer vision realm. The key to this problem is to extract the most representative features from facial images for classification. Gabor wavelet transform (GWT) [1] [2] was first introduced as a model of the simple cell in human visual cortex. It derives desirable facial features characterized by spatial frequency, spatial locality, and orientation selectivity to cope with the variations due to
This work was partially supported by the National Natural Science Foundation of China (Grant No. 60773090 and Grant No. 90820018), the National Basic Research Program of China (Grant No. 2009CB320901), the National High-Tech Research Program of China (Grant No. 2008AA02Z315), and the Science and Technology Commission of Shanghai Municipality (Grant No. 09511502400).
A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 218–227, 2010. © Springer-Verlag Berlin Heidelberg 2010
Selecting Optimal Orientations of Gabor Wavelet Filters for Facial Image Analysis
219
illumination and facial expression changes. Gabor-based features have been widely used in many face analysis tasks, such as face recognition [3] [4] and gender classification [5]. However, the huge dimensionality of Gabor feature can be a disaster to statistical learning and pattern classification methods. In order to solve this problem, some subspace methods have been introduced to reduce the dimension of Gabor feature. For face recognition, Liu and Wechsler applied Fisher discrimination model [4], independent component analysis (ICA) [6] and principal component analysis (PCA) [3] with an evenly spaced Gabor filter set of 40 filters (5 scales × 8 orientations). An alternative way to bypass the dimensionality problem is to use non-statistics based face representation, as Zhang et al. did with their local Gabor binary pattern histogram sequence (LGBPHS) method [7] for face recognition. Later Xia et al. [8] managed to map the LGBPHS feature into a low-dimensional space with LDA-like method called local Gabor binary mapping pattern (LGBMP), so that classic statistical classifiers like SVM can be used for gender classification. Although subspace methods significantly reduce the dimension of Gabor features, these Gabor-based methods still suffer from high computational complexity of Gabor wavelet transform and are incompetent for speed-sensitive scenarios. Moreover, the popular Gabor parameters, 5 scales × 8 orientations, have been assumed to be the best choice in many studies ([3] [4] [6] [7] [8]) without careful discussion and examination on their performance. A few efforts [9] [10] have been made to reduce the number of filters. In [9] the Information Diagram concept was adopted for Gabor parameter selection, and in [10] a minimum set of Gabor filters was arranged in space to cover the frequency plane. However in these two studies, the intrinsic characteristics of face image were not considered. In this study, we propose a method to reduce the number of Gabor filters while maintaining and even improving classification accuracy. Our work is motivated by the intrinsic gradient distribution of human face images and the redundancy of the Gabor filters when using equally divided orientations. We analyze the histogram of oriented gradient (HOG) of the average face image in a dataset, and eliminate the less significant orientation combinations. Then we traverse and select the rest combinations according to cross validation accuracy. The selection process is quite simple, but also very effective and consistent with face gradient distribution. The performance and generalization of the selected Gabor filters is evaluated in face recognition and gender classification experiments on different datasets. The results show that the selected Gabor filters are optimal and can represent the face images better with less computation cost. We also examined the generalization capability of our optimal Gabor filters by applying the image Euclidean distance (IMED) [11] transform on input images. IMED has been used to gain tolerance to small deformations of face images [12] [13]. The prior selected Gabor filters also show a steady performance after the IMED transform. The details of our selection method is discussed in Sec. 2, and the evaluation methods and results on public datasets are given in Sec. 3.
2 Gabor Filter Selection The motivation of using Gabor-based feature is mostly biological, since Gabor-like receptive fields have been found in the visual cortex of primates [14] [15] [16]. These
220
T. Zhang and B.-L. Lu
neurological studies partially support the evenly spaced Gabor filters that are commonly used in a variety of computer vision tasks. The research in [16] proved that the Gabor wavelet representation is optimal in the sense of minimizing the joint two-dimensional uncertainty in space and frequency. Field also suggested that the evenly tuned orientations and spatial-frequencies of mammalian simple cells are useful to cope with the uncertainty of natural images by preserving a certain level of redundancy and allow information to be transmitted with only a subset of the total number of cells [15]. This kind of redundancy is crucial for mammalian due to the variation in scaling, rotation, shift and illumination of real world objects, but is not necessarily useful for computer vision problems because we usually have aligned and normalized face images as inputs. Therefore most of the rotation uncertainty can be avoided, and the redundancy of Gabor orientations can be reduced. We call it the Gabor filter selection because the reduction is done by selecting the best orientation combination in the parameter space. 2.1 Gabor Wavelet Filters Based on Gabor wavelet transform, a family of Gabor kernels is defined as[3]: 2 ||kμ,ν ||2 − ||kμ,ν ||22 ||z||2 (ikμ,ν z) − σ2 2σ ψμ,ν (z) = e e − e , σ2
(1)
where μ and ν define the orientation and scale of the Gabor kernels, z = (x, y), and kμ,ν = kν eiφμ is the wave vector, in which kν = kmax /f ν and φμ = πμ/8. kmax is the maximum frequency and f is the spacing factor between wavelets in the frequency domain. Traditionally, ν ∈ {0 . . . 4} and μ ∈ {0, 1, . . . , 7} are used for face images. In the following we will show that evenly divided φd is redundant and can be improved for better performance. 2.2 Facial Statistics Analysis The Gabor filters can be considered as orientation and scale tunable edge and line detectors, since they only respond to some specifically oriented textures in some specific scale. In [15], it was suggested that the efficiency of a code method depends on the statistics of the inputs. Therefore by analyzing the intrinsic gradient characteristics of human face images, we can use fewer but specially tuned Gabor filters to acquire more discrimination capability and avoid unnecessary noises. We borrow the histogram of oriented gradient (HOG) [17] technique to analyze the significant gradient orientations of face images. Given a dataset, we generate its average face A, and then apply the Sobel operators [18] on A to get the horizontal and vertical gradient images Gx and Gy , and the orientation image GΘ : ⎡ ⎤ ⎡ ⎤ 1 0 −1 1 2 1 (2) Gx = ⎣2 0 −2⎦ ∗ A, Gy = ⎣ 0 0 0 ⎦ ∗ A 1 0 −1 −1 −2 −1 GΘ = arctan(
Gy ) Gx
(3)
Selecting Optimal Orientations of Gabor Wavelet Filters for Facial Image Analysis
221
The histogram of GΘ reflects the general orientation distribution of the images in the dataset. To make the Sobel operation region the same size as the Gabor window, the average face A is firstly smoothed by a 3 × 3 mean filter. 2.3 Straightforward Filter Selection The HOG analysis of the average face can reveal the statistics of gradient distribution in a dataset. But HOG analysis is not enough for selecting optimal Gabor filters. On the one hand, the HOG of a dataset does not always have a proper number of obvious peaks. The curve can have various shapes and is hard for us to identify the most significant orientation combination. Thus, the meaning of HOG analysis could only be used as reference. On the other hand, and more importantly, we should also consider the orthogonality of orientations other than the HOG significance. Evidence has been shown that adjacent simple cells demonstrate an orthogonal phase-selective property [19]. And according to [15], as long as the two phase relations are in quadrature (i.e., they differ by 90°), it is not critical what phases are involved. 1000 900 800 700 600 500 400 300 200
0
pi/8
pi/4
3pi/8
pi/2
5pi/8 3pi/4 7pi/8
pi
0
pi/8
pi/4
3pi/8
pi/2
5pi/8 3pi/4 7pi/8
pi
0
pi/8
pi/4
3pi/8
pi/2
5pi/8 3pi/4 7pi/8
pi
1000 900 800 700 600 500 400 300 200
1000 900 800 700 600 500 400 300
Fig. 1. Three examples of average face images (left) and the corresponding HOGs (right). Datasets used here from top to bottom are C-PM00, C-PM15 and C-PM30, respectively.
222
T. Zhang and B.-L. Lu
To take the orthogonality into consideration, we should restrict the selection in an orthogonal orientation space, like the commonly used [0, π/8, 2π/8, 3π/8, . . . , 7π/8]. We here use a mixed method to select optimal orientations automatically from this space. Assume the mean intensity
of the HOG vector is μH . For all k-orientation combinations {δki |i = 1, 2, . . . , k8 } in the orientation base, we examine the average HOG intensity of the ith combination, μik , and narrow our selection base down to set Ω = {δki |μik ≥ μH }. In Ω, a traversal is carried out and the combination with the highest classification performance is chosen as the optimal orientations. Ω is usually fairly small and can be traversed in reasonable time. Next we will show that the result of this selection is confirmed by both the shape of facial HOG curve and the classification performance. Fig. 1 shows the average faces we generate from three different poses (facing front, deflection angle from 0 to 30°) and the HOGs (blue curve) of them. The gradient distribution peaks imply that the most significant gradient changes occur in these orien tations. The vertical green lines indicate the best 4 orientations we find from the 84 combination space. Note that lines at ’0’ are drawn at ’π’. We can see that the green lines are near the peaks of the HOG curve, indicating that Gabor filters specially tuned to these orientations do indeed produce better features for classification. Red double arrows represents the orthogonal relationships between orientations, which coincides with the observations in [19] and [15] about the orthogonal property in adjacent simple cells. One exception in Fig. 1 is pose C-PM30, which sacrifices one pair of orthogonal orientations for better orientation significance. This also proves that our selection method pays overall consideration to both orientation significance and orthogonality. The matching of the selected orientations to the HOG peaks indicates that this kind of statistical characteristics can be useful for classification. In fact, according to our gender classification experiment, the two best orientations already contain as much gender information as all 8 orientations do.
3 Experiments 3.1 Set-Up In this section we assess the performance of the selected Gabor filters by both face recognition and gender classification experiments. The framework of the feature extraction method used for comparison is described in this section. For face recognition, we follow the Gabor-based PCA method used in [3], and for gender classification we adapt the more recent framework using LGBP features [5,8] that originated with the work of [7]. In the pre-processing step, with eye positions located, a face image is firstly cropped and normalized. And the result is then fed to the selected Gabor filters to produce Gabor magnitude pictures (GMPs). For face recognition, the GMPs are down-sampled by a factor ρ and normalized according to [3] and finally fed to PCA to generate lowerdimensional features. While for gender classification, GMPs is converted to LGBP imu2 operator, and then each is divided into 10 × 10 non-overlapping ages with LBP8,1 rectangular regions, so that they can be mapped to LGBMP features, as proposed by Xia et al.[8].
Selecting Optimal Orientations of Gabor Wavelet Filters for Facial Image Analysis
223
The optimal orientations used in Sec. 3 are selected prior to the experiment by the unsupervised process mentioned in Sec. 2 according to gender classification cross validation accuracy on CAS-PEAL[20] dataset. Later the same orientation set was used for face recognition experiment on FERET[21] dataset in Sec. 3.3 and on the IMED[11] transformed images in Sec. 3.4 to show the generalization of the optimal orientations selected by our method. 3.2 Dataset For frontal face recognition, we conduct our experiment on FERET[21] database with 600 frontal face images belonging to 200 subjects. In every three images that correspond to the same subject, one is taken under different illumination and another one in different facial expression. We randomly chose two images of one subject for training and leave the rest one image for testing. This results in a training set of 400 images and a test set of 200 images. For pose-angled gender classification, we use both FERET frontal and CAS-PEAL[20] databases to compose multiple datasets of different poses. CAS-PEAL is a large-scale face database that currently contains 30,864 facial images with 9 different poses (vertically facing up, middle and down, each coupled with a horizontal deflection angle of 0°, 15°and 30°), and some sample images are given in Fig. 2. For each pose and also for frontal images in FERET we randomly chose about 600 training images and leave about 400 images for testing. The detailed numbers are shown in Table. 1. Table 1. Training and test data for gender classification, and the male/female composition of each set is given. ’C-PD00’ means CAS-PEAL subset, facing down with a deflection angle of 0°. Trainging Test Dataset Male/Female Male/Female C-PD00 311/311 284/134 C-PD15 296/296 220/127 C-PD30 295/295 220/127 C-PM00 282/282 285/134 C-PM15 310/310 221/127 C-PM30 295/295 221/127 C-PU00 311/311 284/134 C-PU15 296/296 220/127 C-PU30 296/296 220/127 FERET 282/282 307/121
3.3 Face Recognition For comparison purpose, we implemented the Gabor-based PCA method from [3] using 5 scales and 8 orientations (5 × 8) of Gabor filters, and then replaced them with the independently selected filter set (5 × 4). Fig. 3 shows the performance of these two different sets of Gabor filters. We can see that the recognition rate of selected Gabor filters is slightly better than that of the full 5×8 set. In particular, the selected Gabor
224
T. Zhang and B.-L. Lu
Fig. 2. Normalized sample images from CAS-PEAL dataset in different poses. From top to bottom: C-PU, C-PM and C-PD; from left to right: angle 0°, 15°and 30°. 1
recognistion rate
0.8
0.6
0.4
Gabor_5x4 Gabor_5x8
0.2
0
0
50
100
150 200 250 number of features
300
350
400
Fig. 3. Face recognition performance of the PCA method with two different sets of Gabor filters: Gabor 5 × 4 and Gabor 5 × 8
filters achieves 100% correct face recognition accuracy using 365 features, while the full 5 × 8 set achieves 98.5%. This comparison demonstrates both the effectiveness and generalization performance of selected Gabor filters. Another advantage of selecting Gabor filters is that it can reduce the dimension and computing time of raw Gabor features, which is very helpful for speed-sensitive applications. Subspace methods like PCA can benefit from lower-dimensional features as well, since solving the eigenvalue problem of a huge matrix is extremely time-consuming. For the 5 × 8 = 40 filters, the dimension of Gabor features could be as much as 10,000 even after down-sampling, which will result in a PCA coefficient matrix of 10000 × 10000. But with selected Gabor filters, we can reduce the size to about 5000 × 5000. The summarized comparison is in Table 2, and the numbers may be approximative.
Selecting Optimal Orientations of Gabor Wavelet Filters for Facial Image Analysis
225
3.4 Gender Classification We employ the LGBMP method [8] as the baseline of our experiments, and the parameters are set to 5 scales and 8 orientations. Then we select the best 1, 2, . . . , 7 orientations from the full 8 family to show the advantage of selected filters. Fig. 4 shows that the best accuracy comes with 4 or 5 orientations, which corresponds with the number of significant peaks in the HOG of the corresponding dataset. In fact, the 4 or 5 orientations are also a trade-off between keeping information and avoiding noises. Too few filters can not give enough information for classification, while too many filters may bring too much noises. As mentioned before, Fig. 4 indicates that we can use only 25% of the original features while maintaining a competitive performance. That is from a dimension of 4000 to only 1000, for our 5 × 8 × 10 baseline. And the computing time can also be reduced greatly to nearly real-time. The detailed comparison is shown in Table. 2. Fig. 4 also demonstrates that the effectiveness of our filter selection method is not affected by IMED transform.
97
96 0.2 95
No−IMED IMED Time
94
93
100
0.4
1
2
3
4
5
6
7
8
0
Classification Accuracy (%)
Classification Accuracy (%)
98
98
96
94
92
90
Number of Orientations
Fig. 4. The average accuracy over all 10 datasets using different orientation numbers and with/without IMED transform. The approximate computing time is also given.
Full 8 orient. Best 4 orient. Random 4 orient.
1
2
3
4
5
6
7
8
9
10
Dataset
Fig. 5. The accuracy with different filter sets on each dataset. 1:FERET; 2:C-PD00; 3:CPD15; 4:C-PD30; 5:C-PM00; 6:C-PM15; 7:CPM30; 8:C-PU00; 9:C-PU15 and 10:C-PU30.
Fig. 5 shows the effect of orientation selection on every dataset we used. The average accuracy of randomly chosen orientations is also given for comparison. An interesting fact is that orientation selection brings less improvement on C-PD00, C-PM00 and CPU00, which are also the datasets with less significant HOG peaks. Actually when the face deflection angle is 0°, the image is horizontally symmetrical and so is the HOG curve, therefor the characteristic orientations may not be as obvious as the ones of other datasets. Table 2. Performance comparison of optimal Gabor orientations and full Gabor family. Time is measured on a Intel Pentium 4 2.8GHz PC. Experiment Face Recognition
No. of Filters Feature Dim. 5 × 8 = 40 10000 5 × 4 = 20 5000 5 × 8 = 40 4000 Gender Classification 5 × 4 = 20 2000 5 × 2 = 10 1000
Time Recog.Rate/Accuracy 0.36s 98.5% 0.18s 100% 0.4s 96.17% 0.2s 97.02% 0.1s 96.17%
226
T. Zhang and B.-L. Lu
4 Conclusion This paper presents a novel criterion to select the most discriminating Gabor filters for facial image analysis. We first exclude the less significant orientation combinations according to HOG analysis and then traverse the rest to choose the best combinations. The result of this method has been shown to match the analysis of image HOG well, and is therefore consistent with the intrinsic gradient characteristics of human face images. We evaluated our selected Gabor filters with face recognition and gender classification tasks, and the experimental results show that the selected Gabor filters are optimal and can represent the face images better using less computational time.
References 1. Daugman, J.: Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. Journal of the Optical Society of America A 2(7), 1160–1169 (1985) 2. Marcelja, S.: Mathematical description of the responses of simple cortical cells. Journal of the Optical Society of America 70(11), 1297–1300 (1980) 3. Liu, C.: Gabor-based kernel PCA with fractional power polynomial models for face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(5), 572–581 (2004) 4. Liu, C., Wechsler, H.: A Gabor feature classifier for face recognition. In: Proceedings of IEEE International Conference on Computer Vision, vol. 2 (2001) 5. Lian, H., Lu, B.: Multi-view gender classification using local binary patterns and support ˙ J.M., Lu, B.-L., Yin, H. (eds.) ISNN 2006. vector machines. In: Wang, J., Yi, Z., Zurada, LNCS, vol. 3972, p. 202. Springer, Heidelberg (2006) 6. Liu, C., Wechsler, H.: Independent component analysis of Gabor features for face recognition. IEEE Transactions on Neural Networks 14(4), 919–928 (2003) 7. Zhang, W., Shan, S., Gao, W., Chen, X., Zhang, H.: Local Gabor binary pattern histogram sequence (LGBPHS): A novel non-statistical model for face representation and recognition. In: Proceedings of ICCV, pp. 786–791 (2005) 8. Xia, B., Sun, H., Lu, B.: Multi-view gender classification based on local Gabor binary mapping pattern and support vector machines. In: Proceedings of IEEE International Joint Conference on Neural Networks, pp. 3388–3395 (2008) 9. Moreno, P., Bernardino, A., Santos-Victor, J.: Gabor parameter selection for local feature detection. In: Marques, J.S., P´erez de la Blanca, N., Pina, P. (eds.) IbPRIA 2005. LNCS, vol. 3522, pp. 11–19. Springer, Heidelberg (2005) 10. Wu, H., Yoshida, Y., Shioyama, T.: Optimal Gabor filters for high speed face identification. Vectors 3, 5 (2002) 11. Wang, L., Zhang, Y., Feng, J.: On the euclidean distance of images. IEEE transactions on pattern analysis and machine intelligence 27, 1334–1339 (2005) 12. Li, J., Lu, B.: An adaptive image Euclidean distance. Pattern Recognition 42, 349–357 (2009) 13. Zhu, S., Song, Z., Feng, J.: Face recognition using local binary patterns with image Euclidean distance. In: Proceedings of SPIE, vol. 6970, 67904Z (2007) 14. Burr, D., Morrone, M., Spinelli, D.: Evidence for edge and bar detectors in human vision. Vision Research 29, 419–431 (1989) 15. Field, D.: Relations between the statistics of natural images and the response properties of cortical cells. Journal of the Optical Society of America A 4, 2379–2394 (1987) 16. Jones, J., Palmer, L.: An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. Journal of Neurophysiology 58, 1233 (1987)
Selecting Optimal Orientations of Gabor Wavelet Filters for Facial Image Analysis
227
17. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, p. 886. Citeseer (2005) 18. Sobel, I., Feldman, G.: A 3x3 isotropic gradient operator for image processing. Presentation for Stanford Artificial Project (1968) 19. Pollen, D.A., Ronner, S.F.: Phase relationships between adjacent simple cells in the cat. Science 212, 1409–1411 (1981) 20. Gao, W., Cao, B., Shan, S., Chen, X., Zhou, D., Zhang, X., Zhao, D.: The CAS-PEAL largescale Chinese face database and baseline evaluations. IEEE Transactions on Systems Man and Cybernetics Part A Systems and Humans 38, 149 (2008) 21. Phillips, P., Wechsler, H., Huang, J., Rauss, P.: The FERET database and evaluation procedure for face-recognition algorithms. Image and Vision Computing 16, 295–306 (1998)
Human Detection with a Multi-sensors Stereovision System Y. Benezeth1 , P.M. Jodoin2 , B. Emile3 , H. Laurent4 , and C. Rosenberger5 1
Orange Labs, 4 rue du Clos Courtel, 35510 Cesson-S´evign´e - France MOIVRE, Universit´e de Sherbrooke, BP 91226, Sherbrooke, J1K 2R1 - Canada Institut PRISME, Universit´e Orl´eans, IUT de l’indre, 36000 Chˆ ateauroux - France ENSI de Bourges, Institut PRISME, 88 bd Lahitolle, 18020 Bourges Cedex - France 5 GREYC, ENSICAEN, Universit´e de Caen - CNRS, 14000 Caen - France
[email protected] 2 3 4
Abstract. In this paper, we propose a human detection process using Far-Infrared (FIR) and daylight cameras mounted on a stereovision setup. Although daylight or FIR cameras have long been used to detect pedestrians, they nonetheless suffer from known limitations. In this paper, we present how both can collaborate inside a stereovision setup to reduce the false positive rate inherent to their individual use. Our detection method is based on two distinctive steps. First, human positions are detected in both FIR and daylight images using a cascade of boosted classifiers. Then, both results are fused based on the geometric information of the sterovision system. In this paper, we present how human positions are localized in images, and how the decisions taken by each camera are fused together. In order to gauge performances, a quantitative evaluation based on an annotated dataset is presented.
1
Introduction
Techniques for locating humans in still images and videos have long been studied. It is now used in all kinds of military and civilian applications such as surveillance and security applications, energy-saving control on air-conditioning and lighting in offices, or simply counting people entering and leaving a building, to name of few [1]. But detecting humans in real-life scenarios is a fundamentally hard problem to solve, mainly because of the wide variability of human postures. Furthermore, small object size, occlusion, bad weather conditions, sudden illumination changes, camouflage problems, and the need for real-time applications, are common challenges one has to deal with. As mentioned by Gavrila in its review paper [1], human detection methods can be divided in three categories namely (1) the 3D approaches, (2) the 2D approaches with explicit shape model, and (3) the 2D approaches without explicit shape model. Methods in (1) and (2) look forward at recovering 3D/2D body parts and posture on pre-localized blobs [2] often resulting in a stick-figure representation. While these methods do not require a training phase, they have strong assumptions on the content of the scene and often require the use of non-trivial A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 228–235, 2010. c Springer-Verlag Berlin Heidelberg 2010
Human Detection with a Multi-sensors Stereovision System
229
mathematical models. On the other hand, methods in (3) detect people without explicitly locating their body parts. In fact, based on a training database, these methods extract features (e.g. edges, gradients, shape, wavelet coefficients, etc.) and, following a clustering step (e.g. SVM, Adaboost, etc.) separate human from non-human shapes [3], [4]. Most of the methods presented above were meant to work on daylight cameras (or visible). However, as their cost keep decreasing, far-infrared (FIR) cameras (often called IR or thermic cameras) gain more interest for human detection [5], [6], [7] as they provide numerous advantages (night vision, relatively uniform backgrounds, etc.). However, as opposed to daylight cameras, FIR cameras fail at detecting humans in hot summer days and often suffer from floor reflection. Note that a study on human detection in FIR and daylight images can be found in [8]. As false detections in FIR and daylight images are due to different causes, it can be interesting to make them collaborate in a stereovision framework so that the number of false detections inherent to their individual use is reduced. To our knowledge, very few papers have addressed that issue before, the closest one being by Bertozzi et al. [9]. However, their method uses two pairs of stereovision systems from which they match disparity maps. In our method, only two cameras are required and the detection is based on a machine learning method and a fusion step. Each video stream of the stereovision system is independently processed based on Viola et al.’s method [4] before their results are merged together. The paper is organized as follows: we first present the used stereovision system and the implemented human detection system. Performances and possible applications of the system are finally presented.
2
The Stereovision System
The objective of our method is to combine information from a FIR and a daylight camera mounted side by side. But prior to do so, lets first review some basic stereovision notions. It is well known that, given the epipolar geometry of a stereovision system, each point observed in one image corresponds to an epipolar line in the other image. Such correspondance is typically determined via the socalled fundamental matrix [15]. The fundamental matrix F is a 3 × 3 matrix satisfying the relation xT A F xA = 0 in which A is a point in the world reference frame imaged as xA in the first view and xA in the second. A simple projection lA = F xA permits to determine the corresponding epipolar line lA of xA in the second image. This point-to-line relation stipulates that points lying on the epipolar line lA in camera 2 are the only ones that can match xA in camera 1. Many methods have been proposed to estimate the fundamental matrix between two cameras, the most commonly implemented being the 8-points algorithm [15]. This being said, the fundamental matrix can also be obtained by combining the extrinsic and calibration matrices of each camera: F = [P C]x P P +
(1)
230
Y. Benezeth et al.
Fig. 1. Camera calibration setup: images acquired by the daylight and FIR cameras
where P and P’ are the projection matrices of the first and second camera, P + is the pseudo-inverse of P and C is the camera center. By their very nature, P and P are obtained by multiplying together the calibration and extrinsic matrices (see chap 9 of [15] for more details). The reason for using the calibration and extrinsic matrices in our setup is two fold: estimate F following Eq. 1 to enforce the human detection procedure and 2) estimate the 3D position of each detected person (localization examples will be presented in section 4). The intrinsic and extrinsic parameters are always estimated following a calibration procedure involving a pattern with known dimensions [11]. In our method, the picture of a calibration pattern is taken simultaneously by both cameras. Note that by its very nature, the FIR camera needs an atypical pattern with “warm” and “cold” areas (see Fig.1). This is achieved with a heat lamp mounted atop the calibration pattern so its dark squares are heated up.
3
Human Detection Methods
As mentioned previously, images from the FIR and daylight cameras are first independently processed. Then, each detection (bounding box) in one camera is matched (or confirmed) with the detections in the other camera. In this section, we describe how the human detection method works on still images and how results from both cameras are fused together. 3.1
Detection on FIR and Daylight Images
In order to detect humans in FIR and daylight images, we use Viola et al.’s method [4]. Here, 14 Haar-like filters are used and, as shown in Fig 2, those filters are made of two or three black and white rectangles. The feature values xi are computed with a weighted sum of pixels of each component. Each feature xi is then fed to a simple one-threshold weak classifier fi : +1 if xi ≥ τi fi = (2) −1 if xi < τi where +1 corresponds to a human shape and −1 to a non-human shape. The threshold τi corresponds to the optimal threshold that minimizes the misclassification error of the weak classifier fi estimated during the training stage. Then, a
Human Detection with a Multi-sensors Stereovision System
231
Fig. 2. Haar-like filters used by our human detection method
Fig. 3. Example of decision fusion based on epipolar geometry
more robust classifier is built with several weak classifiers trained with a boosting method [12]: (3) Fj = sign(c1 f1 + c2 f2 + . . . + cn fn ). Then, a cascade of boosted classifiers is built. Fj correspond to the boosted classifier of the j th stage of the cascade. Each stage can reject or accept the input window. Whenever an input window passes through every stages, the algorithm labels it as a human shape. Note that humans are detected in a sliding window framework [4]. 3.2
Detection with the Stereovision System
As mentioned in section 2, knowing the epipolar geometry of the sterovision system, one can link a point in one image to its corresponding epipolar line in the second image. In our method, this correspondence is used to confirm every human shape detected in one camera with those detected in the other camera. Lets consider that M human shapes have been detected in the daylight image and N have been detected in the FIR image. As shown in Fig.3, lets also consider and Bivis are the top-left and bottom-right points of the ith human shape that Avis i and dBivis , their in the daylight image (represented by a bounding box) and dAvis i respective epipolar lines in the FIR image obtained with the fundamental matrix. In our method, a detected shape i ∈ [1, M ] in the daylight image is kept if and IR and dAvis only if there is a shape j ∈ [1, N ] such that the distance between AF j i F IR vis and between Bj and dBi is smaller than a pre-defined threshold (obtained empirically). Whenever that test fails, the detected shape is deleted. In Fig. 3, two
232
Y. Benezeth et al.
human shapes have been detected in the daylight image and only one in the FIR image. In this example, only shape 1 has been kept. Of course, this algorithm is used both ways such that the shapes in the daylight image are confirmed with those in the FIR image and vice versa.
4 4.1
Experimental Results Datasets
Since a classifier is used in each spectrum, two learning datasets are required (cf. examples in Fig.4). The positive datasets are made of 1208 daylight images (from [3]) and 1175 FIR images (from the OTCBVS dataset [13,14] and our manually annotated images). Both negative learning datasets are made of 3415 gray-level images (FIR and Daylight).
Fig. 4. Example of FIR and daylight images from the learning dataset
The human detection rate with FIR and daylight cameras have been evaluated with several real-life videos taken in three different areas. That test dataset is made of 640 daylight and FIR images randomly extracted from videos and manually annotated. 4.2
Results
In order to gauge performance, Precision/Recall curves are used : Precision =
#T P , #T P + #F P
Recall =
#T P , #T P + #F N
(4)
where #T P , #F P and #F N stands for the number of true positives, false positives and false negatives respectively. We present in Fig. 5 results obtained using the fundamental matrix to perform the point-to-line correspondance. The FIR and Daylight curves show results obtained after processing the daylight and FIR images outside our stereovision setup. As one might expect, results are far more precise with the FIR images. This is because FIR images have more uniform backgrounds and their human shape vs. background contrast is much stronger than in daylight images. The Improved daylight and Improved FIR curves show results obtained with the fusion process described in 3.2. As can be seen, our method clearly improves human detection performances in both spectrums.
Human Detection with a Multi-sensors Stereovision System
233
Fig. 5. Precision/Recall curves for Daylight and FIR images processed with and without our stereovision setup
Fig. 6. First row: detection examples. Second row: detection examples with the stereovision system.
Detection examples are presented in Fig. 6. On the first row, one can see detections in a FIR image and a daylight image with false positives. On the second row, detections have been fused with our method and false positives (the reflection on the floor on the left and the chair on the right) have been deleted. Note that our fusion procedure is very fast since it only requires two projections per detection. As explained in section 2, by using the extrinsec and intrinsec parameters of the calibrated scameras, it is possible to locate humans in 3D and so to verify their activities. We present in Fig. 7 two localization results. In the first example, the rectangle represents the ground truth location and the crosses correspond to the estimated location. The person coordinates are obtained retrieving the coordinates of the detection centroid in both spectrum and then calculating its 3D coordinates using stereo-triangulation. In the second example, we show
234
Y. Benezeth et al.
Fig. 7. Localisation results (the X and Z axis are in meters)
the estimated path of a person drawn into the floor map. In this figure, the individual enters from the right door, stays for a while near the shelf, and then leaves through the same door. We can note that this information is sufficient for home care applications where a more precise location would not be useful.
5
Conclusion
In this paper, we present a stereovision-based human detection system which uses a far-infrared and daylight camera. Since the FIR and a daylight cameras suffer from different limitations, our method combines both cameras in order to reduce the number of false positives inherent to their individual use. In our method, human positions are first detected in each camera independently with a cascade of boosted classifiers. Then, the detected shapes of each spectrum are fused in order to remove false positives. Results have been quantitatively shown with an evaluation study based on various manually annotated real-life videos. We have shown that our system distinctively outperforms the classic human detection. The aim of this work is to decrease the number of false detections taking into account information given by our stereovision system. In the future, we plan to study benefits of decreasing the number of missed detections taking into account temporal information, for example integrating this human detection stereovision system in a simple tracking framework. Then, we will consider a fusion scheme combining detection results of both cameras in the world reference frame instead of doing fusion in the images reference frames. So we will obtain one single detection result at each time for the whole system.
References 1. Gavrila, D.M.: The Visual Analysis of Human Movement: A Survey. In: Proc. of CVIU, vol. 73, pp. 82–98 (1999) 2. Wren, C.R., Azarbayejani, A., Darrell, T., Pentland, A.P.: Pfinder: Real-Time Tracking of the Human Body. IEEE Trans. PAMI 19, 780–785 (1997)
Human Detection with a Multi-sensors Stereovision System
235
3. Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 428–441. Springer, Heidelberg (2006) 4. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: IEEE Proc. CVPR, pp.511–518 (2001) 5. Bertozzi, M., Broggi, A., Lasagni, A., Del Rose, M.: Infrared Stereo Vision-based Pedestrian Detection. In: Proc. of IVS, pp. 24–29 (2005) 6. Xu, F., Liu, X., Fujimura, K.: Pedestrian Detection and Tracking With Night Vision. IEEE Trans. ITS 6, 63–71 (2005) 7. Benezeth, Y., Emile, B., Laurent, H., Rosenberger, C.: A Real Time Human Detection System Based on Far-Infrared Vision. In: Elmoataz, A., Lezoray, O., Nouboud, F., Mammass, D. (eds.) ICISP 2008 2008. LNCS, vol. 5099, pp. 76–84. Springer, Heidelberg (2008) 8. Fang, Y., Yamada, K., Ninomiya, Y., Horn, B., Masaki, I.: Comparison between infrared-image-based and visible-image-based approaches for pedestrian detection. In: Proc. of IVS, pp. 505–510 (2003) 9. Bertozzi, M., Broggi, A., Felisa, M., Vezzoni, G., Del Rose, M.: Low-level Pedestrian Detection by means of Visible and Far Infra-red Tetra-vision. In: Proc. of IVS, pp. 231–236 (2006) 10. Bovik, A.: Hand Book of Image and Video Processing. Academic Press, London (2000) 11. Bouguet, J.Y., Perona, P.: Closed-form Camera Calibration in Dual-space Geometry. In: Proc. of the ECCV (1998) 12. Schapire, R.E.: The boosting approach to machine learning: An overview. In: Workshop on N.E.C. (2002) 13. Davis, J., Keck, M.: A two-stage template to person detection in thermal imagery. In: Workshop on A.C.V. (2005) 14. Davis, J., Sharma, V.: Background-subtraction using contour-based fusion of thermal and visible imagery. In: CVIU, vol. 106, pp. 162–182 (2007) 15. Hartley, R., Zisserman, A.: Multiple view geometry in computer vision. Cambridge University Press, Cambridge (2000)
Efficient Time and Frequency Methods for Sampling Filter Functions Fadel M. Adib and Hazem M. Hajj Department of Electrical and Computer Engineering, American University of Beirut, Bliss st., Beirut, Lebanon {fma32,hazem.hajj}@aub.edu.lb
Abstract. In this paper, we seek to determine the adequate number of samples for an analog filter function f(t). The proposed approaches provide discrete filters that can be used for multiresolution analysis. We introduce two methods that provide sampling results for localization: one of them defines an approximate Nyquist rate, and the other samples in a manner that ensures timefrequency consistency between the generated samples and the analog filter function. The key contribution of the paper is that it establishes robust mathematical and programmable foundations for a previously established empirical method. Analytically, we show that the time-frequency method is based on minimizing aliasing while maximizing decimation. The method can be programmed by introducing a mean square error (MSE) threshold across scales. Afterwards, we provide the outcomes of experiments that demonstrate success of localization with the proposed time-frequency method. Keywords: Multiresolution analysis; multiscale analysis; sampling filters; time-frequency.
1 Introduction Sampling is a key step in digital signal processing for transforming a mathematical continuous function into a discrete function for digital processing. In this paper, we examine the problem of sampling a continuous filter function to derive samples for multi-resolution analysis. Several researchers [1-9] have faced this step in their multiscale applications. Some of the work does not clearly describe how the samples were derived, while others included restriction conditions to ensure perfect function reconstruction for the samples. There has been extensive research and publications in sampling methods for multiresolution analysis. The development of wavelets by Mallat [10,11] as filters for multiresolution, and the establishment of orthonormal basis with compact support by Daubechies [12, 13] are considered key foundation references for multiresolution sampling theory and analysis. Applications in multiresolution analysis have at times relied on analog filters as they provide an elegant mathematical derivation, and avoided the sampling problem [9]. Level-crossing has contributed to the literature of sampling signals on one hand and of feature detection on the other. Guan and Singer [14] proposed sampling by A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 236–244, 2010. © Springer-Verlag Berlin Heidelberg 2010
Efficient Time and Frequency Methods for Sampling Filter Functions
237
level-crossings based on the Lebesgue integral that approximates frequency nonbandlimited functions with a finite set of amplitude. Despite its advantages, this divides the amplitude of the signal into equally spaced intervals in order to set the levels of the crossings, whereas it does not specify the number of levels that should be taken nor does it indicate how to obtain this number. Much research has been done on analog-to-digital transforms, or s-to-z, transforms. The approaches for these transforms, according to [15], can be divided into three categories: time-invariant response methods, numerical approximation methods, and heuristic methods. These transforms form a basis for mapping an analog filter to a digital one. However, in this mapping, these methods still require a specification of how to select the sampling period T in order to finalize all aspects of this transformation. Shannon sampling provides a robust way for sampling bandlimited signals; however, most filter functions we deal with are time bandlimited and thus nonbandlimited in frequency, and an alternative approach is needed n these areas. Several approaches have been proposed, each however posing its own limitations. Liu et al. [16] establish a generalized Shannon-type sampling theorem for a class of nonbandlimited signals with special spectrum properties regarding a ladder shaped filter of two parameters. Moreover, Chen et al [17] successfully propose a theorem to reconstruct certain non-bandlimited signals; they also provide several sampling formulae that remain challenged by the question of identifying six relevant parameters required to sample. In short, the question of sampling different classes of nonbandlimited signals remains open in many respects. Accordingly, the remaining of the paper is organized as follows. Section 2 proposes a method to approximate Nyquist rate for filter signals which are timebandlimited, and thus frequency non-bandlimited. Section 3 introduces the timefrequency method which ensures consistency in both time and frequency domains by constructing a discrete time-consistent model and provides the mathematical foundation and evaluation for this method followed by evaluation and comparison in section 4. Section 5, then, summarizes the results and section 6 concludes the paper.
2 Frequency-Based Method This section of the paper presents a method to sample non-frequency-bandlimited signals by employing continuous and discrete time signal analysis. This is analogous to finding the bounds of an anti-aliasing filter. Shannon’s sampling theorem requires that sampling occurs at a rate at least double the highest frequency of the input signal; thus, it does not address how to sample input functions that having no upper bound on frequency. To sample frequency nonbandlimitted functions, we draw from Pareseval’s theorem equivalence of energy between time and frequency. Unlike bandlimited functions whose integral in frequency domain is bounded by the highest frequency, we must approximate the boundary frequency of non-bandlimited functions. Our approximation employs a threshold ratio of the total energy to derive the sampling rate needed to produce an output signal that contains a percentage of the energy content of the total input function, where 1 i.e. 100% represents the whole energy.
238
F.M. Adib and H.M. Hajj
2.1 Procedure Given an approximation ratio, e.g. 0.9 to represent 90% of energy approximation, the procedure for determining the upper frequency bound is described below. First, consider the filter continuous function f(t), and derive its Fourier transform F(w). Then, we calculate its total energy content. Finally, we solve the equation 1 2
|
|
∞ ∞
|
|2
.
1
Solving the above equation means finding the frequency whigh (Wh), at which the energy content of the filter signal will have reached τ*100% of its total value. Now, the approximation of Nyquist theorem implies that the sampling should take place at a frequency at least equal to twice Wh. Therefore, the frequency required is 2 Wh .
2
2.2 Examples To illustrate this procedure, we consider some filter functions namely Haar, Gaussian, First Derivative of Gaussian (FOG), Laplacian of Gaussian (LOG), and Pulse, and apply the method with τ = 0.9 to derive the required sampling rates. The results are summarized in Table 1. For the Haar and the Pulse filter functions, the time frame considered was 2 seconds by design. The Gaussian, on the other hand, has a standard deviation σ=1sec; therefore, it is safe to consider 6σ = 6sec as a complete time support for the function. This time support was also adopted for the FOG and LOG. Table 1. Required Number of Samples Using the Frequency-based method Filter Function Haar Gaussian First Derivative of Gaussian (FOG) Laplacian of Gaussian (LOG) Pulse
Number of samples needed 40 24 92 30 32
3 Time Frequency Consistency Our approach of sampling filter functions at lower rates than Nyquist aims at modeling these filters using smaller supports without losing their practical representations. The key objective is that the filter samples would provide us best localization when applied to signals. This section describes a method that builds on the time and frequency consistencies (TFC) across scales. In [18], Hajj relied on visual inspector of the Z-transform. In this paper, we develop an analytical and programmable approach to eliminate the required visual spectrum. The essence of the method is to maintain consistency in both time and frequency domains, which would thus yield a representation of filter functions at lowest sampling rate, yet which are consistent with their representations in both time and frequency analysis, and thus
Efficient Time and Frequency Methods for Sampling Filter Functions
239
may be used to provide viable models for the filter samples. On one hand, maintaining consistency in time is achieved by construction of the samples which are chosen in discrete time in a manner that roughly represents the continuous time model. Then, we proceed by finding the Z-transform of a filter signal at lowest scale (i.e. at 2 or 3 samples), whose magnitude is to be plotted over [0, π]. The number of samples taken afterwards is then increased dyadically until the Z-transform shows consistency in the plot. Starting to have consistency at a certain level implies that the corresponding number of samples is the sufficient representation of the function in the frequency domain as will be demonstrated in the following subsection. Determining a sufficient representation of the samples may be assumed when the following two criteria are satisfied: 1. The Z-transform is evident within [0, π], i.e. the whole shape of the transform lies inside [0, π] not outside it. 2. The transform plot has sufficient decay to 0, i.e. according to a threshold prespecified (90% for example). 3.1 Programmable Approach This section illustrates the results of TFC thus verifying its applicability from a numerical perspective before resolving to its proof from a mathematical viewpoint. To automate the sampling procedure and satisfy the above two conditions, the following procedure is implemented. First, starting at 2 or 3 samples, we construct the magnitude of the Z-transform evenly spaced between 0 & π. Then, at any scale ‘a’ we consider the following normalized sum: 2
/
|
|
|
2 |
.
3
where N is the number of points in computing the Z-transform and |Za[i]| is the magnitude of the Z-transform at a scale a and point i. This sum is the normalized error expressed as the root-mean-square (ea) of the errors between two successive scales. In our experiments, we found that a threshold of 16% was sufficient for achieving TFC. However, we must also maintain that this error should apply across all scales, which can be detected by having the threshold satisfied at a pair of scales then having decreasing errors from one scale to the next. Programmatically, thus, we may now formulate the conditions, in order to consider scale ‘a’ as sufficient, we must have: 1. MSE between scales: < 0.16 2. If, at one level, the error meets this threshold requirement, we must also check whether the following error (i.e. that of the next transition) is less than the former. If condition 2 is not satisfied, we would not be able to ensure consistency at higher scales.
240
F.M. Adib and H.M. Hajj
In order to verify this method, it is tested on several filter signals, namely: Haar, Gaussian and its first two derivatives, roof, triangle, and pulse, and the results are shown in table 2. Table 2. Required Number of Samples Using TFC Filter Function
Number of Scales Examined 4
Haar Gaussian
4
First Derivative of Gaussian (FOG) Laplacian of Gaussian (LOG) Roof Triangle
5
3 5
Pulse
4
Consecutive ’s 0.4941, 0.2479, 0.1229, 0.0604 0.4202, 0.2153, 0.0972, 0.0476 3.4606e-005, 0.1924, 0.2575, 0.1100 0.5808, 0.3672, 0.4517, 0.1654, 0.0764 0.1061, 0.0813 2.9052e-005, 0.5208, 0.1826, 0.0802, 0.0375 0.1229, 0.4176, 0.1365, 0.0970
5
Resulting number of samples needed 8 (level 3) 9 (level 3) 17 (level 4) 17 (level 4) 5 (level 2) 17 (level 4) 9 (level 3)
3.2 Analytic Approach Having illustrated an automated programmable method of TFC in sampling filter functions, we provide the mathematical foundation of the method along with an interpretation of the results obtained. Similar to the analysis of the thresholding operator in [18], we can show that our TFC sampling operator is non-linear and discontinuous, which implies that it is difficult to analyze. Nevertheless, we demonstrate that the foundation of the method is an optimization problem based on minimizing aliasing and maximizing decimation. 3.2.1 Sampling Consistency in Time Domain Often, when we need a larger support for a filter function, the approach is to upsample this signal (function) by N, then filter the output signal to smooth out the filter. The derivation of the upsampling of a signal x[n]:
yn
x
x
n
if n is a multiple of N
0
otherwise
(4)
results in the following corresponding frequency relation [19] Y z
X
z
X z
.
5
However, this refined filter introduces aliasing, or “inaccurate” values at the skipped sampling points. On the other hand, TFC provides us with the ability to widen the support of our filter signal by adding exact samples at the skipped sampling points rather than introducing these “inaccurate” samples
Efficient Time and Frequency Methods for Sampling Filter Functions
241
Many systems are aware of the filter functions in the time domain. In such cases, TFC supplies these systems with a much better approximation for the discrete filter functions, yielding exact samples at the sampling instants, as opposed to approximated samples at half the sampling intervals in the conventional method even at the first step of upsampling by 2. Note that all we need to do in TFC is to generate these new samples from the provided filter function while keeping the old ones. 3.2.2 Sampling Consistency in Frequency Domain The literature [20] explains how downsampling a signal x[n] by M: y[n] = x[nM] yields the following relation in the frequency domain: Y z
1 M
X z e
.
6
This perspective allows us to explain two observed phenomena in TFC: • The expansion of the Z-transform induced by the downsampling derivation explains the shrinking effect of the TFC frequency plots when comparing to higher scales. • In downsampling, if x[m] is not bandlimited to n/M, aliasing may result from spectral overlap. Thus, as we downsample, if a signal is bandlimited, increasing M to a certain threshold (defined by the signal’s frequency spectrum) starts inducing this aliasing effect. On the other hand, if we consider starting with lowest rates (i.e. number of samples per signal), this aliasing effect is present initially at low rates, and as the rate increases, it decreases until it is completely removed if and only if there exists M such that the corresponding x[m] is bandlimited to n/M in the downsampling perspective. In other words, aliasing can be completely eliminated only if the signal itself is frequency-bandlimited. As a result, to achieve a balance for choosing the smallest set of samples, we choose the samples that give the least aliasing while maximizing decimation. TFC provides the procedure to determine this optimal set of samples and derive larger sets of samples for multiresolution analysis.
4 Advantages of TFC Having presented the mathematical foundation for TFC, we now present some of the advantages of the proposed method. First of all, like level-crossings [14], in TFC, the signal dictates the frequency of sampling of the filter signal, in our case, even in cases where the signal is frequency non-bandlimited. This feature makes sampling by TFC much simpler from the processing perspective on one hand, while generating minimal number of samples. Moreover, TFC is a parametric sampling scheme with two parameters which can be determined by the sampling system, (1) the MSE (ea2) which may be set high for rough edge-detection and low for finer detail-detection, and (2) the bit rate of the sampling process adjustable to a system’s limitations and design. The two advantages discussed so far render TFC a highly scalable sampling approach capable of adapting to multiple purposes and various systems. Furthermore,
242
F.M. Adib and H.M. Hajj
to illustrate the advantage of having optimal small sets of samples for multiresolution analysis, consider the following example. If we try to convolve a multi-step signal with a filter, results vary depending on the relative number of samples taken between the filter and the signal. We focus on the Haar filter and the multistep function where step edges can be detected at peaks of the Haar response. From Fig. 1, we can see that TFC responses provide excellent localization while the responses of the filter samples derived from Nyquist approximation have smeared peaks with little localization.
Fig. 1. Signal Convolved with Haar at TFC (left) and Approximated Nyquist (right)
5 Summary of Results The table below summarizes the sampling rates derived for each of the frequencybased method and TFC for selected filter functions. Table 3. Number of Samples According to the Three Discussed Methods
Filter function Haar LOG FOG Gaussian Pulse
Method used
Frequency-based Method (90% threshold)
TFC Method (# of samples needed)
40 30 92 24 32
8 17 17 9 9
Efficient Time and Frequency Methods for Sampling Filter Functions
243
The table clearly shows the benefit of using the two proposed methods in reducing the number of samples needed from infinite to finite amounts for filter functions that decay to zero in the frequency domain. The relevant restrictions on the properties of the transfer function of such filters’ poles and zeroes, regions of convergence, etc. require further analysis and may be within the scope of another paper. We illustrated how TFC gradually removes aliasing, and how the iterative approach of taking more samples allows us to reach the desired error without overhead. In addition, this approach proves time and resource efficient, allows the filter function to dictate its number of samples, and the application to set the maximum tolerated MSE. Finally, the method has proved superiority in localization. In short, the key contribution of the paper is that it establishes robust mathematical and programmable foundations for a previously established empirical method.
6 Conclusion For frequency bandlimited signals, Shannon’s theorem has provided since long a sufficient condition for reproduction. However, many a time, we encounter filter functions that have a finite time support and thus an infinite frequency representation. Applying the Nyquist criterion for these functions renders us unable to reproduce them given our finite resources of space and time. To overcome this limitation, we provided, in this paper, two methods for filter sampling: the Frequency-based Method and the Time-Frequency Consistency Method. These methods can be applied when our goal isn’t the complete reproduction of a filter function, as they use specific forms for signal processing. They also prove beneficial for applications that require few filter samples such as edge localization and multiresolution analysis. We also showed robust programmable and analytical approaches for the TFC method. The smallest set of samples can be derived based on the optimal condition of minimum aliasing and maximum decimation. As these methods are novel, further research could delve more into classifying the types of filters for which our method works best; this will allow us to compare further our methods against recently proposed approaches in compressed sensing, which in turn requires certain properties of the sampling functions.
References 1. Qiu, P.: Image Processing and Jump Regression Analysis. Wiley, Hoboken (2005) 2. Foucher, S.: Multiscale filtering of SAR images using scale and space consistency. In: International Geoscience and Remote Sensing Symposium, July 2007, pp. 3878–3882 (2007) 3. Scharcanski, J., Jung, C., Clarke, R.T.: Adaptive Image Denoising Using Scale and Space Consistency. IEEE Transactions on Image Processing 11(9), 1091–1101 (2002) 4. Marin-Jimenez, M.J., de la Blanca, N.P.: Empirical Study of Multi-scale Filter Banks for Object Categorization Pattern Recognition. In: 18th International Conference on Pattern Recognition, vol. 1, pp. 578–581 (2006) 5. Foucher, S., Benie, G., Boucher, J.-M.: Multiscale MAP Filtering of SAR Images. IEEE Transactions on Image Processing 10(1), 49–60 (2001)
244
F.M. Adib and H.M. Hajj
6. Polak, I.: Segmentation and Restoration via Nonlinear Multiscale Filtering. IEEE Signal Processing Magazine, 26–35 (September 2002) 7. Witkin, A.P.: Scale-space Filtering. In: Proceedings of the Eighth International Joint Conference on Artificial Intelligence, Karlsruhe, FRG, pp. 1019–1022 (1983) 8. Ziou, D., Tabbone, S.: A Multiscale Edge Detector. Pattern Recognition 26(9), 1305–1314 (1993) 9. Xu, Y., Weaver, J.B., Healy Jr., D., Lu, J.: Wavelet Transform Domain Filters: A Spatially Selective Noise Filtration Technique. IEEE Trans. on Image Processing 3(6), 747–758 (1994) 10. Mallat, S., Hwang, W.L.: Singularity Detection and Processing with Wavelets. Proceedings of the IEEE Trans. on Inf. Theory, part 2 38, 617–643 (March 1992) 11. Mallat, S.G.: A wavelet Tour of Signal Processing, 2nd edn. Academic, New York (1998) 12. Daubechies, I.: Orthonormal basis of compactly supported wavelets, Coom. Pure Appl. Math. 41, 909–996 (1988) 13. Daubechies, I.: Ten Lectures on Wavelets. SIAM, Philadelphia (1992) 14. Guan, K., Singer, A.C.: A Level-Crossing Sampling Scheme for Non-Bandlimited Signals. Presented at 2006 IEEE Sarnoff Symposium, Princeton, NJ, USA (2006) 15. Al-Alaoui, M.A.: Novel Approach to Analog-to-Digital Transforms. IEEE Transactions on Circuits and Systems 54(2), 338–350 (2007) 16. Liu, Y.-L., et al.: doi:10.1016/j.sigpro.2009.09.030 17. Chen, Q.H., Wang, Y.B., Wang, Y.: A sampling theorem for non- bandlimited signals using generalized Sinc functions. Comput. Math. Appl. 56(6), 1650–1661 (2008) 18. Hajj, H., Nguyen, T., Chin, R.: On Multi-Scale Feature Detection using Filter Banks. In: Proceedings of the Twenty Ninth Annual Asilomar Conference, vol. 29 (1996) 19. Boche, H., Mönich, U.: Limits of signal processing performance under thresholding. Signal Processing 89(8), 1634–1646 (2009) 20. Bose, T.: Digital Signal and Image Processing. Wiley, Hoboken (2004)
The Generalized Likelihood Ratio Test and the Sparse Representations Approach Jean Jacques Fuchs IRISA, Univ. de Rennes 1, Campus de Beaulieu, 35042 Rennes Cedex, France
[email protected] Abstract. When sparse representation techniques are used to tentatively recover the true sparse underlying model hidden in an observation vector, they can be seen as solving a joint detection and estimation problem. We consider the 2 − 1 regularized criterion, that is probably the most used in the sparse representation community, and show that, from a detection point of view, minimizing this criterion is similar to applying the Generalized Likelihood Ratio Test. More specifically tuning the regularization parameter in the criterion amounts to set the threshold in the Generalized Likelihood Ratio Test.
1
Introduction
Sparse representation techniques can be considered to have two main applications. The first and direct one [1] being compression i.e. given a data vector represent it in a sparse way, which is then useful in coding, compression, archiving, etc. The second and less direct one being the retrieval of a true sparse representation from noisy observations [2]. The application in statistics [3], in a variable selection context, when the number of potential instruments or predictor variables is larger than the number of observations, can be seen as being somehow in between these two main classes of applications. One indeed seeks a sparse representation on a set of say physically meaningful instruments though strictly speaking there is no prior true sparse representation. The second application is the one considered in this paper. It has a long history in the signal processing community and has started well before the present sparse representations era in a number of different domains such as identification, source separation, model fitting, inverse problems among others, [4,5]. In the next section, one shortly describes the general sparse representation problem that will be considered. It potentially covers a large class of the applications listed above. The classical problem of the detection of signals in noise of known covariance is then presented and the generalized likelihood ratio test (GLRT) developed [6,7] . It is then shown that setting the value of the threshold in the GLRT is equivalent to tuning the hyper-parameter in the 2 − 1 regularized criterion used to get a sparse representation. Finally a specific application is analysed and some simulation results presented. A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 245–253, 2010. c Springer-Verlag Berlin Heidelberg 2010
246
2
J.J. Fuchs
Sparse Representations in an Estimation Context
Basically and without seeking generality to keep the model simple [8], sparse representations techniques can, for instance, be applied as soon as an n-dimensional observation vector b is the sum of an unknown number P of weighted contributions a(θ) in additive white noise, i.e., P αk a(θk ) + e (1) b= k=1
with e ∈ N (0, σ 2 In ) and a(θ) a parametrized vector function depending upon a scalar θ ∈ [0, 1]. To both decide upon the number P of contributors and to estimate theirs characteristics {αk , θk }, one then solves 1 Ax − b22 + hx1 with h > 0, (2) 2 with x1 = k |xk | and x22 = k |xk |2 and A an n × m matrix (m n) with column vectors aj = a(j/m)/a(j/m)2, 1 ≤ j ≤ m with m n, the normalized a(θ)-vector on a uniform grid. One seeks to reconstruct the observation vector b by a model of the form Ax where x is a high-dimensional but sparse (having just a few non-zero components) vector that indicates which columns in A are useful in the construction of a approximation of b. The sparseness is ensured by the presence of the 1 norm in the criterion. Ideally, there are between P and 2P non-zero components in the optimal x. This optimization problem being convex [9] has generically a unique optimum. One deduces from its optimum both an estimate Pˆ of P and the estimates of {αk , θk } for k=1 to Pˆ . The parameter h which appears in (2) has to be fixed by the user with care since it plays an important role that we do emphasize below. min x
3
Detection of Signals in Noise
In the context of model (1), the associated basic detection problem is known as the detection of signals in noise of known covariance. It is well documented and in its elementary version it consists in deciding between H0 :
b=e
and
H1 :
b = αa(θo ) + e
(3)
with θo known, i.e., to decide whether α = 0 or α = 0. One way to take the decision is then to use the likelihood ratio test (LRT) which in this case amounts to compare 1 (a(θo )T b)2 σ 2 a(θo )22 to a threshold. The quantity to be tested is clearly linked to the so-called matched filter output [10] (a(θo )T b)/a(θo )2 . This test procedure is optimal in the sense that it maximizes the detection probability for a given probability of false alarm.
The GLRT and the Sparse Representations Approach
247
In the more general case where θo is also unknown, a suboptimal strategy known as the generalized likelihood ratio test (GLRT) [6,7] can be used. It is a quite general approach to detection of signals that include unknown parameters. The GLR is defined as the ratio of the maximum value of the likelihood under H1 to the maximum value under H0 , the maximum being with respect to the unknown parameters. It is shown below that it amounts to project the observation b onto the manifold associated with a(θ) for θ ∈ [0, 1] and to retain the model leading to the projection with largest modulus. To confirm these observations let us define the GLRT statistic t(b) t(b) = max{2 log f1 (b)} − 2 log f0 (b) α,θ
(4)
with f0 (b) and f1 (b) the probability density functions (p.d.f.) of b under H0 and H1 , respectively. One has, under H1 2 log f1 (b) = −n log(2πσ 2 ) −
1 b − αa(θ)22 σ2
Its optimum w.r.t. α is attained at: α ˆ=
bT a(θ) a(θ)T a(θ)
(5)
and by substitution of this value, its optimum over θ satisfies (bT a(θ))2 a(θ)a(θT θˆ = arg min bT (I − )b = arg max . ˆ ˆ θ θ a(θ)T a(θ) a(θ)T a(θ) The associated value of the loglikelihood is then max{2 log f1 (b)} = −n log(2πσ 2 ) − α,θ
1 (bT a(θ))2 2 (b − max ). 2 ˆ θ a(θ)T a(θ) σ2
Since under H0 , the value of loglikelihood is simply 2 log f0 (b) = −n log(2πσ 2 ) − (1/σ 2 )b22 , the statistic t(b) in (4) is now fully characterized and the GLRT procedure is thus to compare it to a threshold η t(b) =
(bT a(θ))2 1 >< η . max σ 2 θ a(θ)T a(θ)
(6)
and to decide H1 if the maximum is greater than the threshold and H0 otherwise. The threshold η is generally fixed so as to make the probability of false alarm equal to a user-chosen value PF A . To fix the threshold, one thus needs to find the p.d.f. of t(b) when b = e, or equivalently the p.d.f. of t(w) = max (wT a(θ))2 /a(θ)2 θ
with
w = N (0, In ) .
Seeking the probability density function of the maximum of a set or a continuum of random variables is in general a difficult problem [11,12] when the variables are not independent.
248
4
J.J. Fuchs
Detection Properties of the Sparse Representations Approach
Let us establish that, if one wants to decide between the two hypothesis in (3), comparing the GLR to a threshold η, is equivalent to comparing the optimum x of (2) to the zero vector. To prove this point, let us introduce the conditions satisfied by the optimum of the criterion (2). This criterion is convex [9] and the necessary and sufficient condition for x to be a global minimum of (2) is that the vector zero is a sub-gradient of the criterion at x: ∃ u ∈ ∂ x1
such that AT (Ax − b) + hu = 0,
(7)
where u is a sub-gradient of x1 , i.e., a vector belonging to the set called the sub-differential of x1 , denoted ∂x1 , and satisfying ∂ x1 = {u|uT x = x1 , u∞ ≤ 1} = 0 and |ui | ≤ 1 otherwise}. = {u|ui = sign(xi ) if xi h ≥ AT b∞ = max |aTk b|,
To prove that if:
k
(8)
the optimal x is at zero, one simply checks that {x = 0, u = (−1/h)AT b} satisfies (7) with the given requirements on u in ∂x1 . Intuitively this means that in (2) and a large h satisfying (8), no component in x is worth being made non-zero. Since ak in (8) is equal to a(k/m)/a(k/m)2, this establishes the announced equivalence, since to tune h one now needs to evaluate the p.d.f. of max k
|wT a(k/m)| a(k/m)
with w = N (0, In ) .
compare with (6). The sole difference is that in (8) the maximum is sought over m values of θ, while in the GLRT it is sought over a continuum of values of θ. As a matter of fact, if a unique component in the optimal x of (2) is non-zero, say the i-th component xi then it can be deduced from (7), that its value is given by [13]: xi = aTi b − hsign(xi ), a sum of two terms, where the first term is the optimal (ML) amplitude estimate, see (5), and the second term is a bias due to the presence of the penalization hx1 in (2) that systematically diminishes the absolute value by h. The sparse representation approach continues to work if more than just one signal is present and can be seen as an extension of the GLRT procedure applicable to a generalization of (3). The presence of the systematic bias on the amplitude estimates is well known, it has then a more intricate expression than above, but remains easy to remove.
The GLRT and the Sparse Representations Approach
5
249
Typical Application: A Sum of Sinusoids in White Noise
5.1
The Model
Let us apply the observations made above to the specific sinusoid-plus-noise example, where the parameter θ to be estimated is two-dimensional b j = sj + e j ,
with sj =
P
αk cos(ωk j + ϕk ),
with
j ∈ {0, n − 1} (9)
k=1
where ej are white Gaussian noise samples. To cast this application into the form required by the sparse representations approach in (2), one switches to complex weights x = ρeiϕ and complex basis vector denoted c(ω) corresponding to cisoids with equispaced pulsations ωk = 2π (k/m) with k ∈ {0, m − 1}: c(ω) = [1 eiω e2iω
... ei(n−1)ω ]H
(10)
Another possibility [15] is to store in the columns of A, real vectors a(ω, ϕ) where ω takes takes mω equispaced values in [0, π[ and ϕ takes mϕ values equispaced in [0, π] but the complex representationis preferred in the sequel. The component (q, p) of A is thus equal to aq,p = exp(−2iπ(q − 1)(p − 1)/m), i.e. the q−th component of the column of the cisoid c(ω) for ω = 2π(p − 1)/m). To fix ideas the number m of columns in A is taken even, it follows that ak = a ¯m+2−k , i.e. the columns in A are pairwise complex conjugate. Similarly, for the reconstructed vector Ax to be a real vector like the observation vector b it aims to reconstruct, the same hermitian symmetry property will hold for x, i.e. ¯m+2−k for k = 2 to m/2. Indeed one can establish by contradiction that xk = x the optimal solution x of (2) will always satisfy this property. It follows then from (8) that when b is composed of a single sinusoid, it will be detected by the sparse representations approach if h < AH b∞ = max |aH k b|. k
5.2
(11)
The Associated GLRT
One considers the single sinusoid case, see (3), where one decides if a single sinusoid with unknown pulsation and initial phase is present or not present. For the GLRT approach and under H1 , one then has to find the MLE estimates of the different quantities. One establishes below that the ML estimate of the pulsation corresponds to the location of the maximum of the periodogram, just like in the sparse representation approach, see (11). With a(ω, ϕ) a n-dimensional vector whose k-th component is cos((k−1)ω+ϕ), the probability density function of b is fα,ω,ϕ (b) = (2πσ 2 )−n/2 exp(−(1/2σ 2 )b − αa(ω, ϕ)22 ).
250
J.J. Fuchs
Under H0 , α = 0 and using the notations of the previous section −(2/n) log f0 (b) = log(2π) + log σ 2 + (1/nσ 2 ) b2 . Under H1 , one seeks the maximum likelihood estimates (MLE) of the unknowns, α, ω, and ϕ, they are obtained by locating the maximum of fα,ω,ϕ (b). Equating to zero the partial derivatives of f with respect to α yields α ˆ = [a(ω, ϕ)T b] / [a(ω, ϕ)T a(ω, ϕ)] Substitution of this values into f , one observes that maximizing this function with respect to ω and ϕ amounts to maximize max (bT a(ω, ϕ))2 / a(ω, ϕ)2 . ω,ϕ
If one now assume that the norm of a(ω, ϕ) is independent of ω and ϕ, which is realistic [16] for large n or ω not too close to 0 or π, the sought-for estimates of ω and ϕ are obtained as ω ˆ , ϕˆ = arg max g(ω, ϕ) ω,ϕ
with g(ω, ϕ) = |bT a(ω, ϕ)|.
Let us denote c(ω, ϕ) the vector filled with complex exponentials whose real part is a(ω, ϕ) and B(ω), the discrete Fourier transform (DFT) of b with modulus ρ(ω) and phase θ(ω). One then has g(ω, ϕ) = |bT a(ω, ϕ)| = | (bT c(ω, ϕ))| = | (bT c(ω, 0) exp(−iϕ))| = | (B(ω) exp(−iϕ))| = ρ(ω)| cos(θ(ω) − ϕ)|. The maximum of this last expression is attained at ω ˆ the value of ω that maximizes ρ(.) and then at ϕˆ = θ(ˆ ω ) to make the cosine equal to one. This means that the MLE of ω is, for large n, the values of ω for which the modulus ρ(ω) of the DFT of b is maximal. The other estimates are then obtained by substitution of this estimate ω ˆ , one first obtains ϕˆ and then α. ˆ Since
max |bT a(ω, ϕ)| = max ρ(ω) = max |bT c(ω)| ω,ϕ
ω
ω
with c(ω) = c(ω, 0) the column associated with the cisoid at pulsation ω, it follows that the GLR statistic (4) can be taken as t(b) = (1/σ 2 ) max |bT c(ω)|2 , ω
which is nothing but, up to a multiplicative constant, the maximum of the periodogram and is also the test strategy performed by the sparse representations approach, compare with (8), since ak the generic notation for the k-th column of the A matrix is now precisely c(ω) with ω = 2π(k − 1)/m.
The GLRT and the Sparse Representations Approach
5.3
251
Tuning the Parameter h
The simple scenario described in (3) is considered and the parameter h is set h to achieve a given probability of false alarm. A false alarm appears when the algorithm decides that there is (at least) one sinusoid present in the observation vector b while indeed there is none, i.e. while b = e. From Section 4 and relation (11), it follows that this is the case if AH e∞ > h, where ak is the k-th (complex) column vector of the matrix A and e = N (0, σe2 In ). The vector AH e is a vector of m scalar complex random dependent variables build upon the n independent real Gaussian variables present in e. Each of these complex random variables √ is of the form cH e with c an n-dimensional complex vector (10), with norm n, that can be decomposed into the sum of its real part (cosines) and imaginary T T e − ivsin e where the two vectors vsin and vcos are part (sines) so that cH e = vcos quasi-orthogonal and of similar norm. It follows that 2/(nσe2 ) cH e is, approximatively, a complex Gaussian random variable whose real and imaginary parts are Gaussian, independent, zero mean and with unit variance and thus that (2/(nσe2 )) |cH e|2 is, approximatively, a Chi-square random variable with 2 degrees of freedom. Let us now recall a quite standard result. For n independent and identically distributed random variables, say wi , the following result holds P ( max wi > t) = p ⇔ P ( max wi < t) = 1 − p
p , n where P (wi < t) denotes the probability of the random variable wi to be smaller than t and where the approximation in the last step is valid for small probability p and large n. In the present situation, one can consider that since the m dependent variables in AH e depend only upon the n degrees of freedom in e, there are only n independent random variables that are χ22 . Applying the previous result, then leads to n √ p h = σe t, with P r(χ22 > t) = . (12) 2 n i∈(1,n)
i∈(1,n)
⇔ P (wi < t)n = 1 − p ⇔ P (wi > t) = 1 − (1 − p)(1/n)
A precise analysis is extremely difficult and no analytical expression that would allow to fix h exactly exists to our knowledge [11,12,14]. The simulations below (see Section 5.4) show that this approximate way to fix h is quite accurate. 5.4
Simulation Results
To check the validity of the analysis performed in Section 5.3, we evaluate the probability of false alarm of GMF. We consider n = 100 data points and m = 1000 columns in A. For a probability of false alarm of 10 percent, i.e., p = .1 the value of h given by (12), is h = 26.28 and 10000 independent realizations, of noise only observations b = e were simulated. The procedure detected wrongly a sinusoid (had a optimal z not identically zero) for 1348 among the 10000, i.e. 13 percent.
252
J.J. Fuchs
As a matter of fact, if we estimate for each realization, the actual noise variance and use it to fix the value of h accordingly using (12), the number of false alarms, for the same 10000 realizations, dropped to 1077, quite close to the expected 10 percent. This means that the approximate evaluation of the statistical properties of the moduli of the outputs of the Fourier Transform under H0 gives a threshold that is quite accurate and that the choice of the threshold is quite sensitive for this specific detection problem. √ Let us now simulate a single sinusoid with α = 2 and the same unit variance noise. This corresponds to a signal to noise ratio (SNR) of 0 dB. For h given by (12), the sparse representations approach detects systematically the sinusoid and sometimes detects a second sinusoid. The results obtained over 10000 independent realizations indicate that the proposed procedure attains the Cramer-Rao bounds [16] for the true sinusoid and that among the 10000 realizations, a second sinusoid (by far much weaker) is detected in 1259 realizations, i.e. in again about 13 percent of the realization, as expected since the true sinusoid is strong enough to be always detected.
6
Concluding Remarks
It has been shown that in a detection and estimation context, the basic sparse representations criterion 1 Ax − b22 + hx1 with h > 0, min x 2 acts as a generalized likelihood ratio test (GLRT) and that the tuning of the hyperparameter h in this criterion relies on the same statistical analysis then the tuning of the threshold in the GLRT. Since this criterion is equivalent to min x1 x
subject to
Ax − b22 ≤ ρ,
one could argue that, since tuning ρ is much easier, this second criterion should be preferred. But this is of course wrong. From a detection point of view, the second criterion is indeed equivalent to a energy detector with much poorer probability of detection for a given probability of false alarm, than the GLRT. The relation between h and ρ which makes these criteria equivalent is indeed unknown a priori and data dependent and there is no way to exploit this relation to bypass the sophisticated analysis needed to tune h.
References 1. Donoho, D.: Compresssed sensing. IEEE-T-IT 52, 1289–1306 (2006) 2. Bofil, P., Zibulevski, M.: Underdetermined blind source separation using sparse representations. In: Signal Processing, November 2001, vol. 81, pp. 2353–2362. Elsevier, Amsterdam (2001)
The GLRT and the Sparse Representations Approach
253
3. Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. Royal Statist., Soc. B 58(1), 267–288 (1996) 4. Kwakernaak, H.: Estimation of pulse heights and arrival times. Automatica 16, 367–377 (1980) 5. Fuchs, J.J.: On the Application of the Global Matched Filter to DOA Estimation with Uniform Circular Arrays. IEEE-T-SP 49, 702–709 (2001) 6. Van Trees, H.L.: Detection, Modulation and Estimation Theory, part 1. John Wiley and Sons, Chichester (1968) 7. Cox, D., Hinkley, D.: Theoretical Statistics. Chapman and Hall, Boca Raton (1974) 8. Fuchs, J.J.: Detection and estimation of superimposed signals. In: Proc. IEEE ICASSP, Seattle, vol. III, pp. 1649–1652 (1998) 9. Fletcher, R.: Practical Methods of Optimization. John Wiley and Sons, Chichester (1987) 10. Scharf, L.L.: Statistical signal processing, detection, estimation and time series analysis. Addison Wesley, Reading (1991) 11. Hotelling, H.: Tubes and spheres in n-spaces, and a class of statistical problems. Amer. J. Math. 61, 440–460 (1939) 12. Weyl, H.: On the volume of tubes. Amer. J. Math. 61, 461–472 (1939) 13. Fuchs, J.J.: On sparse representations in arbitrary redundant basis. IEEE-TIT 50(6), 1341–1344 (2004) 14. Villier, E., Vezzosi, G.: Caract´eristiques op´erationnelles de r´eception d’une classe de d´etecteurs. Gretsi, 15`eme colloque, Juan les Pins, 125–128 (September 1995) 15. Fuchs, J.J.: A sparse representation criterion: recovery conditions and implementation issues. In: 17th World Congress IFAC, Seoul, July 2008, pp. 12425–12429 (2008) 16. Porat, B.: Digital Processing of Random Signals. Theory and Methods. Prentice Hall, New Jersey (1994)
Frequency Tracking with Spline Based Chirp Atoms Matthias Wacker, Miroslav Galicki, and Herbert Witte Institute for Medical Statistics, Computer Sciences and Documentation Friedrich-Schiller-University Jena, GER
[email protected] Abstract. The tracking of prominent oscillations with time-varying frequencies is a common task in many signal processing applications. Therefore, efficient methods are needed to meet precision and run-time requirements. We propose two methods that extract the energy of the time frequency plane along a cubic spline trajectory. To raise efficiency, a sparse sampling method with dynamically shaped atoms is developed for the one method in contrast to standard Gabor atoms for the other. Numerical experiments using both synthetic and real-life data are carried out to show that dynamic atoms can significantly outperform classical Gabor formulations for this task.
1
Introduction
Frequency-dynamic oscillations are present in many situations of daily life. Music, wireless communication, a beating heart – endless applications deal with the task of assigning a frequency to each point in time, working against Heisenberg’s uncertainty principle. Dependent on the application, requirements differ in precision, runtime, noise robustness and online applicability, leading to a variety of approaches. Prudat and Vesin present in [1] multi-signal extensions to (FIR and IIR) notch filter based frequency tracking algorithms that minimize the filter response energy with respect to the center frequency. Similarly, Johansson and White[2] apply the notch filter concept to time variant auto regressive data modeling, leading to two types of time-variant notch filters that are applied to signals with low signal-to-noise ratios. In [3], Streit and Barrett formulate the problem of frequency line tracking as an optimal path through fixed frequency states, determined by a Hidden Markov Model which implies a high noise robustness at the cost of a coarse frequency resolution (prior chosen states). Witte et al.[4] use an active contour model approach from image processing, the Snake algorithm, to extract a time frequency line. This is the closest work to ours and will be commented on later in this paper. In our application, we want to adapt further automatic processing steps to the time frequency course of a salient oscillation in magnetoencephalogram data. Although this oscillation is prominent, it might have discontinuations so that there
This work was supported by DFG Wi 1166/9-1 (Gamma).
A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 254–261, 2010. c Springer-Verlag Berlin Heidelberg 2010
Frequency Tracking with Spline Based Chirp Atoms
255
might be no useful information at a point in time. Therefore, it is an important requirement that the tracking must not drift away during such disconinuations. Accordingly, the entire temporal context must be considered simultaneously. In the following we will present an off-line approach to time frequency tracking. The general theoretic problem formulation is described in section 2, including the derivation of an efficient approach to solve it. Numerical experiments on synthetic and real-life data are performed in section 3 that lead to the final conclusion in section 4.
2
Method
2.1
Problem Formulation
Formulating the goal in more mathematical terms, we look for a continuous time frequency trajectory F ∈ C 0 [t0 , tend] of a signal in the time range of interest. The trajectory defines a path through the complex valued time frequency plane (TFP), that can be computed by Gabor analysis [5] for a given time frequency resolution determined by σt . ⎧ ⎨R × R → C (t, f ) → TFP(t, TFP : (1) f ) = [g(τ, f, σt ) ∗ x(τ )] (t) , ⎩ = τ g(t − τ, f, σt )x(τ )dτ where 1 g(t, f, σt ) = √ exp σ 2π t
−t2 2σt2
· exp(i2πf t) .
(2)
compl.wave
:=wσt (t), envelope
The desired trajectory is defined by the path of maximum energy through the TFP tend 2 ˆ F = arg max |TFP (t, F (t))| dt, (3) F ∈C 0 [t0 ,tend ]
t0
which implies finding the solution for an infinite dimensional optimization problem. Beginning here, our approach differs from that of Witte et al. in [4], where the TFP is sampled and is then regarded as a discrete image, which implies a fixed frequency spacing and the computation of many points in the TFP that will remain untouched in the end. After this sampling, an active contour model is applied, which then already works on the discrete image approximation and suffers from sampling artefacts. In contrast, we regard the TFP as a function, that can be precisely evaluated at every time and frequency point. The method of energy extraction is discussed in the following. To limit dimensionality, we reduce the search space of possible trajectories to smooth functions, which can be approximated by cubic splines [6]. Those splines automatically ensure a high degree of robustness regarding the previously mentioned discontinuations of the oscillations because the (local) change of a
256
M. Wacker, M. Galicki, and H. Witte
node affects the whole spline. Hence, a local overfitting can not be preferred over global information. In order to define the spline, we choose the natural boundary condition and an equidistant spacing t , leading to N = (tend − t0 )/t + 1 nodes tk .
s(t) = ak + bk (t − tk ) + ck (t − tk )2 F ∈ s(t) (4) +dk (t − tk )3 , ∀t ∈ [tk , tk+1 ] The function values at these nodes define the complete spline and are therefore combined in an N -dimensional parameter vector y. This leads to the finite dimensional optimization problem tend 2 yˆ = arg max |TFP (t, Fy (t))| dt, (5) y
t0
The dimensionality is now well under control, so we can have a closer look at the integral. The integration is numerically a very expensive and critical task, as it has to be repeated for every evaluation in the optimization. Therefore we approximate it by a sum of scalar products of Gabor atoms with the signal 2 V (6) yˆ = arg max g (tv − τ, Fy (tv ), σt ) x(tv − τ )dτ , y
v=0
where the atoms have an equidistant spacing. Using very few atoms in a subinterval of course does not sufficiently approximate the spline behavior, as one atom represents a constant frequency course. The obvious means to solve this problem is to reduce the spacing between the atoms until the spline behavior is well approximated, which implies a dense sampling of the trajectory in the TFP. The smarter way would be to adopt the shape of the atoms to the shape of the spline curve, so that a sparse sampling still generates a sufficient approximation. As indicated in (2), the Gabor atom is nothing but a complex wave with a constant frequency, multiplied with a gaussian envelope w(t). In algorithms such as matching pursuit, Gabor atoms have already been generalized to gaussian windowed linear chirps [7]. Here, as we have cubic splines as target functions, the generalization is straight forward. By replacing the argument of the complex wave by the antiderivative S(t) of the spline s(t), we obtain: h(t, tv , σt , y) = wσt (t − tv ) · exp (i2πS(t)) , where
(7)
t
S(t) =
s(τ )dτ t0
k−1 dl c b dk ck bk l l 4 + 3t + 2t + al t + δt4 + δt3 + δt2 + ak δt (8) = 4 t 3 2 4 3 2 l=0
with t ∈ [tk , tk+1 ] and δt = t− tk . Figure 1 shows an example of a Gabor analysis of a single Gabor atom and a single chirp atom. The coefficients ak , bk , ck , dk are
Frequency Tracking with Spline Based Chirp Atoms
257
Fig. 1. Gabor analysis of a single Gabor atom (left) and a single chirp atom (right). Dark indicates high power.
uniquely defined by the spline algorithm and the spline node values y. So we obtain 2 V yˆ = arg max E(y) = arg max (9) h (τ, tv , σt , y) x(τ )dτ y
y
v=0
as the final approach to determine the time frequency trajectory with maximum energy E(y). 2.2
Derivation of the Objective Function
In order to implement an efficient algorithm to find the solution, we compute the derivative of the objective function with respect to the parameters. To increase readability, unnecessary indices and variables are hidden. 2 d dE(y) d = xhdτ = xhdτ xhdτ dyj dyj dyj dh dh = x dτ xhdτ + xhdτ x dτ (10) dyj dyj where · indicates the complex conjugate. Next, dh(t, tv , σt , y) d = wσt (t − tv ) exp (i2πS(t)) i2π S(t) dyj dyj d S(t) = and dyj k−1 ddl dc db da l l l = 4 + 3 + 2 + t + · · · 4dyj t 3dyj t 2dyj t dyj
(11)
l=0
···+
ddk 4 dck 3 dbk 2 dak δ + δ + δ + δt 4dyj t 3dyj t 2dyj t dyj
(12)
258
M. Wacker, M. Galicki, and H. Witte
give the derivative w.r.t. yj of the ’dynamic’ complex atom. The derivative of h is computed in the same way. Hence, the derivation of the objective function E(y) leads to the derivation of the spline coefficients w.r.t. the spline node values. To complete the computation, the following paragraph will recapitulate some facts on natural cubic splines and shows how to do the derivation. Natural cubic splines and their derivation. For t ∈ [ti , ti+1 ], the value of the spline s is given by its i − th part zi+1 zi si (t) = (t − ti )3 + (ti+1 − t)3 6h 6h yi+1 h yi h + − zi+1 (t − ti ) + − zi (ti+1 − t), (13) h 6 h 6 where h = ti+1 − ti , z0 ⎡ 41 ⎢1 4 ⎢ ⎢ .. ⎣ . 0 1
M
= zN −1 = 0 and ⎤ ⎤ ⎡ ⎤ 0 ⎡ z1 y2 − 2y1 + y0 ⎥ ⎥ ⎢ .. ⎥ ⎢ ⎥ .. ⎥⎣ . ⎦ = ⎣ ⎦ . 1⎦ zN −2 yN −1 − 2yN −2 + yN −3 4
(14)
Multiplying out (13) and identification with the coefficients in (4) yields di = (zi+1 − zi )/(6h) ci = 0.5zi
(15) (16)
bi = −hzi /3 − hzi+1 /6 + (yi+1 − yi )/h ai = y i .
(17) (18)
dzk = (M −1 )k,j−1 − 2(M −1 )k,j + (M −1 )k,j+1 , dyj
(19)
From (14), we get
−1 where Mi,j := 0, if (i, j) ∈ {1, . . . , N − 2}2 and dz0 /dyk = 0, dzN −1 /dyk = 0. For the final derivation of the coefficients in (15) – (18), we obtain with the unit impulse δ ddi 1 dzi+1 dzi = − (20) dyj 6h dyj dyj dci dzi = −0.5 (21) dyj dyj dbi h dzi h dzi+1 1 (22) =− − + [δ(i + 1 − j) − δ(i − j)] dyj 3 dyj 6 dyj h dai = δ(i − j) . (23) dyj
By these computations, we completed the derivation in (12), so that now the derivation of our whole objective function is also available.
Frequency Tracking with Spline Based Chirp Atoms
259
Fig. 2. Time-frequency trajectories of the benchmark signals
2.3
Optimization
Now that we have the derivation of our objective function, we can apply a gradient based optimization algorithm to solve (9). The class of quasi-Newton methods is of special interest, as they attempt to approximate the hessian of the objective function by an iterative update formula with the gradient information as input. Thus, these algorithms show a much more robust iteration behavior than the simple steepest ascent method, especially in the presence of “narrow valleys” in the parameter space. We selected the so-called BFGS-algorithm (named from Broydon, Fletcher, Goldfarb, Shanno) as it is one of the most popular[8] and promising.
3
Results
An experiment with two synthetic signals is carried out to have a ground truth comparison of the sparse dynamic atoms with classic Gabor atoms. Afterwards, a result on real-life magneto encephalogram (MEG) data is presented. For the simulation, we took one target chirp signal whose frequency trajectory is in C ∞ (sine function) and one from C 0 (triangle function). These ground truth trajectories are visualized in Fig. 2. The vertical lines indicate the spline nodes that we allow for approximating the target signal. We compare the behavior of the dynamic atoms in contrast to the commonly known Gabor atoms. As a measure of this comparison, we computed the mean squared error between ground truth and the tracked trajectory. In order to use “realistic” experimental conditions, the original signal types have been overlaid with noise of 0dB SNR. Figure 3 shows the error values of both approaches plotted against the number of atoms per spline sub interval. The significant advantage the spline-based atoms compared to the Gabor atoms
260
M. Wacker, M. Galicki, and H. Witte
Fig. 3. Error values for spline atoms (sp) and Gabor atoms (gb) for signal type 1 and 2
Fig. 4. Time-frequency plane of a MEG-signal with the extracted trajectory as solid line for the spline atoms and as a dotted line for the standard Gabor atoms. White indicates high power.
is visible in the range where only a few atoms are used. If there are more than two atoms per subinterval, the difference between the two approaches is close to the measurement precision in our signals. To demonstrate the practical applicability, we take MEG data from a photic driving experiment [9], where so-called alpha oscillations in the brain are manipulated by a time-limited flickering light stimulus. The dynamics of this process is of special interest to study epileptic disease. Therefore, Fig. 4 shows a Gabor analysis of the data, where white indicates high power. The onset of the stimulus is at the 0 seconds point in the presented signal. For the setup of the tracking, we chose the spline intervals to be 0.25 seconds and used one atom per interval. The result of our tracking algorithm is shown as a solid black line, the tracking with standard Gabor atoms as a dotted black line, superimposed on the Gabor analysis. It can easily be seen that the trajectory from the spline atoms outperforms that of the standard Gabor atoms. The dynamics of the oscillations are depicted clearly and more automatic data analysis is made possible at a lower computational cost.
Frequency Tracking with Spline Based Chirp Atoms
4
261
Conclusion
We have presented two approaches to off-line frequency tracking. The trajectory of maximum energy through the TFP is modeled by a natural cubic spline curve which enables us to reduce the search space drastically. The energy formulation is approximated by standard Gabor atoms in one case and by dynamic chirp atoms, that implicitly approximate the spline curve, in the other. The dynamic atoms allow a sparse sampling among the trajectory without losing significant precision. The derivation of these atoms with respect to the spline parameters leads to an efficient implementation using a quasi-Newton optimization scheme. Numerical experiments validate the superiority of the spline atoms over the standard Gabor atoms. Future investigations will include the design of adaptive regions of interest for higher frequency regions of MEG-data which incorporate the dynamics of the tracked alpha oscillations. The generated regions of interest will thereby be free of higher harmonics of alpha.
References 1. Prudat, Y., Vesin, J.-M.: Multi-signal extensions of adaptive frequency tracking algorithms. Signal Processing 89, 963–973 (2009) 2. Johansson, A.T., White, P.R.: Instantaneous frequency estimation at low signal-tonoise ratios using time-varying notch filters. Signal Processing 88, 1271–1288 (2008) 3. Streit, R., Barrett, R.: Frequency line tracking using hidden markov models. IEEE Transactions on Acoustics, Speech and Signal Processing 38(4), 586–598 (1990) 4. Witte, H., Putsche, P., Hemmelmann, C., Schelenz, C., Leistritz, L.: Analysis and modeling of time-variant amplitude-frequency couplings of and between oscillations of EEG bursts. Biological Cybernetics 99(2), 139–157 (2008) 5. Gabor, D.: Theory of communication. Journal of Institution of Electrical Engineers 93, 429–457 (1946) 6. Schoenberg, I.J.: On equidistant cubic spline interpolation. Bulletin of the Americal Mathematical Society 77(6), 1039–1044 (1971) 7. Kov´ aˇcov´ a, M., Kristekov´ a, M.: New version of matching pursuit decomposition with correct representation of linear chirps. In: Proceedings of Algoritmy, pp. 33–41 (2002) 8. Nocedal, J., Wright, S.J.: Numerical Optimization. Springer, Heidelberg (2006) 9. Schwab, K., Ligges, C., Jungmann, T., Hilgenfeld, B., Haueisen, J., Witte, H.: Alpha entrainment in human electroencephalogram recordings. Neuroreport 17, 1829–1833 (2006)
Improving Performance of a Noise Reduction Algorithm by Switching the Analysis Filter Bank Hamid Sepehr1,2, Amir Y. Nooralahiyan1, and Paul V. Brennan2 1
ElaraTek LTD University College London
[email protected] 2
Abstract. A new approach for preservation of transient parts of speech in a noise reduction system is proposed in this paper. Transient components of speech such as vowel onset and beginning of some consonants such as stop sounds are important parts for intelligibility of speech. These components are usually attenuated by noise reduction algorithms due to the low temporal resolution of block-based noise reduction techniques. A method is proposed to detect the transient component of speech, followed by dynamic switching of the analysis filter bank at the front end of the noise reduction system to provide higher resolution in the time domain. The optimal spectral gain values are transformed into the time domain to form a linear filter in order to achieve noise reduction and only group delay equalisation is performed to avoid discontinuity. Our objective evaluation shows that the proposed method provides superior performance compared to noise reduction with fixed time/frequency resolution analysis filter banks.
1 Introduction Ambient noise is the most common factor in degrading the quality and intelligibility of speech signals in many telecommunication networks, particularly when wireless communication devices, such as mobile or wireless headsets are used in high noise environments. Additionally, CODECs in many digital devices further reduce the quality and intelligibility of speech. The accumulative effect of CODECs and ambient noise can significantly reduce the quality of conversation in telecommunication networks, causing strain and listening fatigue to a user. There are various known methods for performing noise estimation followed by noise cancellation on audio voice signals. One of the most practically used noise estimation approaches is based on minimum statistics [1] and its modified variants [2][3][4]. These methods are based on tracking the minimum value of energy in different sub-bands where the smoothed value of the minimum signal is considered as the estimated noise level. Other methods based on using a Voice Activity Detector (VAD) are also used [4] in which the estimated noise value is updated during speech pauses. After the noise value is estimated, the majority of noise reduction part of the algorithm is performed in the spectral domain. The noise reduction algorithm transforms frame-by-frame of the audio data to the frequency domain and subsequently A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 262–271, 2010. © Springer-Verlag Berlin Heidelberg 2010
Improving Performance of a Noise Reduction Algorithm
263
attenuates the estimated noise level, which usually results in some speech distortion. Two of the most practical (low complexity and ease of implementation) methods used for noise reduction are based on adaptive sub-band Wiener filtering and spectral subtraction performed independently in the frequency bins [4][5]. Equations (1) and (2) briefly describe the foundation of the spectral subtraction algorithm. The noisy signal y (m) is a sum of the desired signal x (m ) and the noise n(m ) :
y (m ) = x(m ) + n(m ) ~
(1) ~
Y (ω ) = X (ω ) + N (ω ) => X (ω ) = Y (ω ) − N (ω )
(2)
Where Y (ω ) , X (ω) and N (ω ) are Fourier Transforms of y(m), x(m) and n(m), ~
~
respectively. N (ω ) and X (ω) respectively represent the estimated noise level and estimated clean speech signal. The Short Time Fourier Transform (STFT) and Wavelet Transform are typical filter banks which are widely used in conjunction with different noise reduction algorithms such as Wiener filtering [6] and spectral subtraction [5] indicated above. One of the drawbacks of these methods is due to the fixed frame length of the audio data which results in fixed temporal/frequency resolutions of the selected transform method. Commonly, the audio frame length is selected of sufficient length to obtain a good frequency resolution in order to achieve a good degree of noise reduction and to limit excessive variations in the noise reduction algorithm. This process is reasonable and can work well, however it does not take into account fast moving non-stationary elements of speech. This drawback results in non-optimal noise reduction performance in non-stationary transient parts of speech, creating pre-echo or undesirable attenuation of the wanted transient portion of the speech signal. Examples of such parts include plosive bursts, formant transitions and vowel onsets of speech, which are major contributors in intelligibility of speech signal. The main area of study in this paper is to achieve a reasonable solution based on trading-off the frequency resolution when the transient components of speech signal are detected. The secondary aspect of this study is the time-domain implementation of the proposed solution and a review of the added advantages that this approach encompasses. The above mentioned trade-off is addressed in a recent publication on noise reduction and speech enhancement. In [7], an adaptive buffer size is proposed to estimate and attenuate the noise. The adaptive segmentation algorithm defines the length of effective audio buffer size depending on the non-stationary nature of signal. In [8], time domain high temporal resolution audio data is used followed by an extrapolation technique to further improve the frequency resolution of the audio data, however this algorithm is computationally intensive and extrapolation doesn’t necessarily guarantee improving the frequency resolution. Another approach is taken in [9] to mix the coefficients of two different analysis filter banks depending on transient detection of the signal, however, details of filter banks and complete reconstruction of the method is not addressed. The approach proposed in this study is somewhat similar to [10] in which a pair of adaptive analysis/synthesis filter banks
264
H. Sepehr, A.Y. Nooralahiyan, and P.V. Brennan
with high/low temporal resolutions are switched (swapped) depending on the transient detection in the audio speech signal. The main difference in the approach presented in this study is that the system works in the time domain and the selected filter banks do not need to satisfy the full reconstruction restrictions during transient switching period. In an interesting approach[11], similarity of visual and aural systems in human subjects and their sensitivity to the edge contour in image and acoustic edge (speech transient signal) in audio signal is exploited and an adaptive structure method (previously used for image enhancement [12]) is developed to enhance the speech signal. In this paper, we propose a technique to switch between two different analysis filter banks depending on the transient detection in the signal, for which the process of noise reduction is carried out in the time domain. Objective measures and informal subjective testing demonstrate improvement in quality and intelligibility of the processed signal for a variety of noise test signals. The proposed algorithm has low complexity and requires a small memory footprint. It also introduces very low processing delay (< 8 msec) that makes it an attractive solution for many applications that are sensitive to processing delay in receive and/or transmit channels. Such applications include wireless handsets and network infrastructure, particularly enduser headset/handset and telephony devices with built-in side-tones. In the next section the fundamentals of the suggested idea are outlined; section 3 describes the proposed algorithm for detection of transient parts of speech signal; in section 4 the modified structure of the proposed noise reduction is further explained; and finally the results and conclusion of the suggested method are presented in section 5 and 6.
2 Fundamental Principals of the Suggested Method The fundamentals of the proposed algorithm lie in two concepts: (i) Detection of transient in speech signals in order to detect events such as vowel onset and plosive bursts. - this information is used to change the analysis filter bank to increase the temporal resolution of time-frequency representation of the signal; (ii) The other important aspect of the proposed algorithm is to perform the noise reduction method in the time domain as opposed to the traditional frequency domain. The latter has been investigated by Gustafsson [13][14] where the main idea is to compute the optimal spectral gains based on priori SNR in different frequency bins. The frequency domain weights w j (j represents different sub band) are transformed to the time domain as a linear filter that is convolved with the time domain audio signal. As the frequency domain optimal weights are real value vectors, the time domain representation of these weights is a non-causal filter. This means random phase can be added to the spectral domain gains to achieve a causal real valued filter. A Hilbert-based algorithm is used in [13] to obtain the minimum phase filter. However, in this study, a linear phase approach is used for ease of implementation and in order to have a controlled group delay on the time domain filter, as shown in Equation 4. Equation 3, shows the optimal spectral domain gains for estimation of the
Improving Performance of a Noise Reduction Algorithm
265
desired signal. In Equation 4, N is the length of the filter and hence the length of the audio buffer and k represents different elements of the filter vector. ~
~
X (ω ) = Y (ω ) × W (ω ) ===> x( m) = y (m) ∗ w( m)
(3)
[ w0 ( m), w1 (m ),..., w N −1 (m )] = ifft ([W0 (ω ) , W1 (ω ),...., W N −1 (ω )] × exp( − sqrt ( −1) * 2 * pi * k / N ))
(4)
The proposed method for detection of transient elements of the audio signal is presented in the next section.
3 Detection of Boundary of Speech Utterance In order to increase the temporal resolution of the proposed noise reduction, it is required to detect the points of interest and hence stationary boundaries of the audio signal. Transient detection has many applications in audio signal processing from efficient audio coding, audio signal segmentation and many others; however, in this paper, transient detection is used to increase the temporal resolution of the analysis filter bank in a noise reduction solution. In Lukin et al [9], transient is defined as variance of signal properties in time and they have used the convolution of audio critical bands with an impulse response of h(t) = {-1,-1,-1,0,1,1,1} and the estimation of transient-like signals is achieved by thresholding of these sub-band values. In the edge preserving method used in image noise reduction[12], the transient behavior of the noisy signal is represented with the value Sk as defined in equation (5), this value and comparison of it with the value of Tk controls the increase or decrease size of the frame size Where Ek represents the energy of the noisy signal in the kth frame, Mk is the average of the noisy signal in the kth frame, δnk is the noise variance in the kth frame and Lk is the frame length. This method can be a low complexity and efficient algorithm for transient detection in audio signals. S k = Ek − ( M k ) 2 − (δnk ) 2 ,
Tk = η
(δnk ) 2 Lk
(5)
Transient detection in speech is also used for efficient high quality compression of speech signals. In one implementation [15], the weighted summation of the first derivative of signal energy in four frequency regions of up to 5.5 kHz is used as a measure to mark the transient segment of the speech signal. The transient is also detected by a likelihood ratio measurement of statistical similarity of two block of data to decide whether the signal has stayed stationary or its statistical characteristics have changed, hence signifying a transient event [10]. One efficient method to detect the transient of speech was suggested by Li et al [16], who have defined the transient in a time series as non-recurrence signals with lack of fractal structure, using the computational complexity for detection of nonlinear features of the signal in the time series. In this study a benchmark between methods presented in [12], [15] and [16] was carried out. The method in [16] provided the best results, hence the implementation details of this algorithm and its modification (underlined section) from [16] is listed below:
266
H. Sepehr, A.Y. Nooralahiyan, and P.V. Brennan
¾ ¾ ¾ ¾
In a received audio buffer, compute the mean of the signal and construct a binary sequence by comparing the value of the audio buffer to the mean value. Estimate the Kolmogorov Complexity (“KC”) of the binary sequence. Calculate the first order derivative of KC and compare the absolute value of the first derivative with a threshold. Hold the activated flag for 32 ms to ensure that transient signal is included during the active flag period.
Figure1 below shows the ability of this method to detect the transient start of speech when speech is mixed with white noise, with an input SNR value of 12.
Fig. 1. Noisy Speech Signal, KC value of the signal and Transient Point in the Noisy Speech Signal
4 Noise Reduction Modified System The primary objective in this paper is to preserve the transient point in the speech audio signal by increasing the temporal resolution of the analysis filter bank. The secondary objective is to illustrate the advantages of a time-domain implementation to achieve this purpose. By using the KC modified method described above, we are able to detect the transient points in speech and upon detection of the transient; the analysis filter bank will be switched to provide a better temporal resolution. When the transient point ends, the analysis filter bank is switched back to an analysis filter bank with a better frequency resolution so optimum noise reduction of the algorithm can be achieved. In this implementation, as the noise reduction is carried out in time domain, there is no
Improving Performance of a Noise Reduction Algorithm
267
constraint to ensure the full reconstruction of the filter bank because the frequency domain signal does not need to be synthesized back to the time domain (the optimal spectral coefficients are synthesised in time domain as explained in section 2). The only consideration is given to the group delay of two filters resulting from high frequency resolution versus high temporal resolution analysis filter banks; this can be simply rectified by zero-padding the filters achieved with a high temporal resolution filter bank. Figure 2 below illustrates a sample filter achieved from the two filter banks (L1 =32 & L2=128) and their group delays.
Fig. 2. Time Domain Filters and their group delays
In the above example, in order to equalise the group delay and avoid any glitches or discontinuity in the processed signal, the filter implemented with a high temporal resolution filter bank is padded with zeros, where the required number of zeros is L2 − L1 = 48 . 2 In figure 3, the overall structure of the proposed method is depicted. The input data is buffered and passed through one of the two analysis filter banks. The selection of the analysis filter bank depends on the KC transient detection block. The windowed signal is then passed through a STFT to determine the optimal spectral gains [5],[6],[13],[14]. Additionally, the transient detection flag controls the smoothing performed on the optimal spectral filter gains and signal spectra. If the transient is detected, then the spectral gains will change their values more rapidly in order to take into account the transient nature of the signal [11] and in other circumstances, the spectral gains are smoothed further with an auto regressive filter to provide a steady and consistent noise reduction performance. Once the optimal spectral filter is determined, the filter is mapped into the time domain and is convolved with the incoming noisy audio signal. If the transient signal is detected, a variable gain of between 0 and 4 dB is also applied to the processed audio signal. Tests based on informal subjective listening suggest that higher gain values (~ 4dB) can be applied at high SNR values, however this gain needs to be reduced as the
268
H. Sepehr, A.Y. Nooralahiyan, and P.V. Brennan
SNR deteriorates. A similar method is suggested in [17] to improve the intelligibility of speech signal in the presence of background noise and deployed in [10] for a noise reduction algorithm with a switching analysis/synthesis filter bank structure.
Mapping to Time Domain and Group Delay Equalisation
Noisy Signal
Analysis Filter Bank Switching
Variable Gain
Clean Signal
Optimal Spectral Gains Vector
STFT
KC Transient Detection
Fig. 3. Overall Structure of Suggested Method High Temporal Resolution High Frequency Resolution 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
1
3
5
7
9
1 1
1 3
15
17
19
21
23
2 5
2 7
29
31
33
35
37
39
4 1
43
45
47
49
51
53
5 5
5 7
59
61
63
6 5
67
69
71
73
75
77
7 9
8 1
83
85
87
89
91
93
9 5
97
99
1 01
1 03
1 05
1 07
1 09
111
113
11 5
11 7
1 19
121
123
125
127
Fig. 4. Two Selections of Windows for Higher Temporal ( ) / Frequency Resolution ( )
5 Results Figure 5 below, illustrates a pure speech signal (top left image); noisy speech signal with pink noise @ SNR 15 (bottom left); processed signal with fixed analysis (top right); and the proposed method of switching analysis in the time domain (bottom right). As can be seen, with the proposed noise reduction method, the transient speech signals are less attenuated in comparison with a noise reduction algorithm with a fixed analysis filter bank. The proposed algorithm was tested objectively against a fixed time/frequency resolution analysis filter bank; the test database included two male and two female speakers and 80 different words spoken by different male and female speakers. The
Improving Performance of a Noise Reduction Algorithm
269
noise types were selected to be white noise, pink noise, car road noise and engine noise at different SNR values of 9, 12, 15 and 20 (at lower SNR values, the performance of the transient detector algorithm deteriorates, which impacts the effectiveness of the algorithm). The objective measure for benchmarking the proposed algorithm is PESQ (Perceptual Evaluation of Speech Quality) [4]. The objective measurement (The PESQ software for objective evaluation of noise reduction algorithm can be downloaded from [18]) shows a significant improvement for the proposed method compared to noise reduction with fixed time/frequency resolution. This is illustrated in figure 6 where the PESQ value is increased by 0.1 on average for a variety of noise types and at different input SNR values.
Fig. 5. Time Domain presentations of Clean, Noisy and Processed Signals
PESQ Comparison Chart 3.50
PESQ value (fixed Analysis FB) PESQ value (Suggested Method)
3.40 3.30 3.20 3.10 3.00 2.90 2.80 2.70 2.60 2.50 SNR_9
SNR_12
SNR_15
SNR_20
Fig. 6. PESQ values for fixed analysis filter bank and suggested methods
6 Conclusion In this paper, a method to preserve the transient element of speech in a noise reduction system is proposed. KC values were used to identify the transient
270
H. Sepehr, A.Y. Nooralahiyan, and P.V. Brennan
components of speech and upon detection of these transient events, the temporal resolution of analysis filter bank used for noise reduction was increased. Objective measurement shows superior performance of the proposed methods over a standard noise reduction method with a fixed analysis filter bank. The objective performance evaluation was based on PESQ and gave an increase of 0.1 PESQ values on average over the standard method for a variety of noisy test signals. The proposed algorithm is performed in the time domain, which results in no need for a synthesis filter bank. This eliminates any requirement for full reconstruction of the selected filter bank during switching period and reduces the complexity, computational load and memory footprint of the algorithm. Additionally, as the noise reduction is performed in the time domain, the only processing delay is due to the group delay of the time domain filter, which is small (< 8msec) and therefore makes the proposed algorithm a suitable solution for many applications that are sensitive to processing delay in the receive and/or transmit channels. Acknowledgment. The authors would like to thank EPSRC for funding the research under research grant EE385538.
References [1] Martin, R.: Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. on Speech and Audio Processing 9(5), 504–512 (2001) [2] Cohen, I., Berdugo, B.: Noise estimation by minima controlled recursive averaging for robust speech enhancement. IEEE Signal Proc. Letters 9(1), 12–15 (2002) [3] Cohen, I.: Noise Spectrum Estimation in Adverse Environments: Improved Minima Controlled Recursive Averaging. IEEE Transactions On Speech And Audio Processing 11(5) (2003) [4] Louizo, P.: Speech Enhancement Theory and Practice. CRC Press, Boca Raton (2007) [5] Kamath, S.D., Loizou, P.C.: A Multi-Band Spectral Subtraction Method For Enhancing Speech Corrupted By Colored Noise. In: ICASSP (2002) [6] Hu, Y., Loizou, P.: Speech enhancement by Wavelet thresholding the multitaper spectrum. IEEE Transactions on Speech and Audio Processing 12(1), 59–67 (2004) [7] Hendriks, R.C., Heusdens, R., Jensen, J.: Adaptive time segmentation for improved speech enhancement. IEEE Transactions on Audio, Speech, and Language Processing 14(6), 2064–2074 (2006) [8] Kauppinen, I., Roth, K.: Improved Noise Reduction in Audio Signals Using Spectral Resolution Enhancement With Time-Domain Signal Extrapolation. IEEE Transactions On Speech And Audio Processing 13(6) (2005) [9] Lukin, A., Todd, J.: Adaptive Time-Frequency Resolution for Analysis and Processing of Audio. In: AES 120th Convention, Paris, France (2006) [10] Mauler, D., Martin, R.: Improved Reproduction of Stops in Noise Reduction Systems with Adaptive Windows and Non-stationary Detection. EURASIP Journal on Advances in Signal Processing (2009) [11] Quatieri, T.F., Dunn, R.B.: Speech Enhancement Based on Auditory Spectral Change. In: ICASSP, pp. 257–260 (2002) [12] Song, W.-J., Pexrlman, W.A.: Edge-Preserving Noise Filtering Based on Adaptive Windowing. IEEE Transactions On Circuits And Systems 35(8) (1988)
Improving Performance of a Noise Reduction Algorithm
271
[13] Gustafsson, H.: Speech enhancement for mobile communications. Department of Telecommunications and Signal processing, University of Karlskrona/Ronneby, Ronneby (2000) [14] Gustafsson, et al: Signal Noise Reduction By Time_Domain Spectral Subtraction. US Patent US6507623B1 (2003) [15] Szwoch, G., Kulesza, M., Czyżewski, A.: Transient Detection for Speech Coding Applications. IJCSNS International Journal of Computer Science and Network Security 6(12) (2006) [16] Li, Y., Fan, Y.-L., Tong, Q.-Y.: Endpoint Detection In Noisy Environment Using Complexity Measure. In: Proceedings of the 2007 International Conference on Wavelet Analysis and Pattern Recognition, Beijing, China (2007) [17] Hazan, V., Simpson, A.: The effect of cue-enhancement on the intelligibility of nonsense word and sentence materials presented in noise. Speech Communication 24(3), 211–226 (1998) [18] PESQ and other objective measures for evaluating quality of speech, http://www.utdallas.edu/~loizou/speech/software.htm
On the Choice of Filter Bank Parameters for Wavelet-Packet Identification of Dynamic Systems Henrique Mohallem Paiva and Roberto Kawakami Harrop Galv˜ ao Instituto Tecnol´ ogico de Aeron´ autica – ITA CTA, S˜ ao Jos´e dos Campos, SP, 12228-900, Brazil {hmpaiva,kawakami}@ita.br
Abstract. This paper is concerned with a recently proposed technique for linear system identification in frequency subbands using waveletpacket filter banks. More specifically, the effect of using different mother wavelets and resolution levels is investigated. The study is based on simulated examples involving the identification of a servomechanism model. The results reveal that the identification outcome can be improved by using wavelet filters with better frequency selectivity, as well as by increasing the number of resolution levels in the filter bank. In this context, the advantages of using wavelet packets instead of standard wavelet decompositions are also discussed. Keywords: Wavelet Packets, System Identification, Filter Banks.
1
Introduction
System identification techniques employing wavelet decompositions have been widely used in several applications. Examples include, for instance, modal parameter estimation in mechanical systems [7,9], vibration signal modelling [5], and non-parametric linear system identification [11]. The use of wavelet packets, which allow greater flexibility in spectral partioning as compared to standard wavelet transforms, has also been proposed in this context [12]. In [12], the wavelet-packet decomposition tree was used to establish frequency bands where subband models were created. The tree structure was optimized by using a generalized cross-validation method in order to achieve a compromise between accuracy and parsimony of the overall model. In this manner, the most appropriate frequency partitioning for the subband models was automatically determined. In comparison with a standard ARX (autoregressive with exogenous output) identification method, the proposed technique was superior in terms of resonance peak identification and sensitivity to white measurement noise. This technique was also studied in [13], where the effects of coloured noise were analyzed. However, some important aspects of concern for system identification were not addressed. In particular, the advantage of using wavelet packets instead of the more conventional wavelet transform was not clearly demonstrated. A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 272–279, 2010. c Springer-Verlag Berlin Heidelberg 2010
Choice of Filter Bank Parameters for Wavelet-Packet Identification
273
Moreover, the choice of mother wavelet and number of resolution levels was not discussed. In fact, choosing the mother wavelet is an important issue in any given application [2], such as texture analysis [1], eletrocardiogram signal processing [4] and denoising [14]. The present work investigates the effect of the choice of wavelet type (family and length) and number of resolution levels in the wavelet-packet identification algorithm proposed in [12]. For this purpose, a case study involving a servomechanism model is presented. The investigation is aimed at providing practical guidelines for using the wavelet packet identification technique. Moreover, a comparison with an identification structure employing the conventional wavelet transform is carried out. In this case, the purpose consists of determining whether the wavelet packet formulation does bring advantages over the use of simpler wavelet decompositions. 1.1
Notation
Fig. 1 depicts the wavelet-packet decomposition and reconstruction trees [15]. In this figure, H and G are lowpass and highpass filters, respectively, and Hr and Gr are the associated reconstruction filters. In the present work, these filters are adopted such that the filter bank is orthonormal. ↓ 2 and ↑ 2 denote downsampling and upsampling operations, respectively. y is the signal to be decomposed. xi,j {y} represents the wavelet coefficients of signal y at node (i, j) of the tree, where j ≥ 0 indicates the resolution level and i is a sequential index ranging from 0 to 2j − 1. Node (i, j) is the parent of nodes (2i, j + 1) and (2i + 1, j + 1), which are the children of node (i, j). All nodes without children are called leaf nodes. The depth of the tree is defined as the highest level in which nodes are present. A tree is said to be complete if all leaf nodes are in the same resolution (a) H(z)
2
x 0,1 {y}
H(z) G(z)
y = x 0,0 {y}
G(z)
2
x 1,1 {y}
2 2
H(z)
2
G(z)
2
x 0,2 {y} x 1,2 {y} x 2,2 {y} x 3,2 {y}
... ... ... ...
(b) ...
...
x 0,2 {y} 2 x 1,2 {y}
x 2,2 {y} ...
...
Hr (z) + +
2
x 0,1 {y}
2
Hr (z)
Gr (z) + +
2
Hr (z) + +
x 3,2 {y} 2
x 1,1 {y}
2
y = x 0,0 {y}
Gr (z)
Gr (z)
Fig. 1. Wavelet-packet (a) decomposition and (b) reconstruction tree
274
H.M. Paiva and R.K.H. Galv˜ ao
level. Scalars and row vectors are represented by italic and boldface lowercase symbols, respectively.
2
The Wavelet-Packet Identification Technique
This section presents a short review of the wavelet-packet identification technique proposed in [12]. The identification algorithm is based on the development of several subband models. The wavelet-packet decomposition tree (Fig. 1a) is used to establish the frequency bands at which the subband models will be created. Each leaf node of the tree is associated to a frequency band, and the complete set of leaf nodes composes the whole frequency range. For each frequency band, a subband model is created. This modelling scheme is illustrated in Fig. 2 for a particular structure of the wavelet-packet decomposition tree. In this figure, Mi,j indicates the subband model intended to represent the plant in the frequency band associated to the ˘ i,j is defined as the output of leaf node (i, j). If (i, j) is a leaf node, then signal u ui,j } model Mi,j for input u. If (i, j) is not a leaf node, then the coefficients xi,j {˘ are defined as the reconstruction of the coefficients at the children nodes of (i, j). Plant
M3,2 (z)
u 2,2 u 3,2
2
G(z)
2
G(z)
2
H(z)
2
G(z)
2
G(z)
2
2
x2,2 {u2,2 }
2
x3,2 {u3,2 }
Wavelet-Packet Decomposition Tree (with multiple inputs)
2
H r(z)
(
2
+ x 0,1{u 0,1} G r(z) +
2
H r(z) + x 1,1{u 1,1} + G r(z)
2
Hr (z) (
H(z)
x1,2 {u1,2 } (
u 1,2
x0,2 {u0,2 }
+ x 0,0{u 0,0} +
(
2
(
H(z)
(
2
(
M2,2 (z)
H(z)
(
M1,2 (z)
u 0,2
(
M0,2 (z)
y
(
(
u
yˆ
Gr (z)
Wavelet-Packet Reconstruction Tree
Fig. 2. Example of the system modelling scheme for a particular wavelet-packet decomposition tree with four leaf nodes (in this example, all leaf nodes are in the same level, but this is not a requirement)
The structure adopted for each subband model is a transfer function of the form Mi,j (z) = Pi,j (z)Qi,j (z) , where Pi,j (z) = [1/(1 − z−1 )]si,j , si,j ∈ Z;
Qi,j (z) = αi,j + βi,j z−1 , αi,j , βi,j ∈ R (1)
Parameters si,j , αi,j and βi,j are estimated in order to minimize the following cost function Ji,j : Z × R2 → R: Ji,j (si,j , αi,j , βi,j ) = ei,j (ei,j )T , where residue ui,j } denotes the wavelet-packet coefficients of the difference beei,j =xi,j {y-˘ ˘ i,j , in the frequency tween the plant output y and the subband model output u band under consideration. For a fixed value of si,j , cost Ji,j is minimized with respect to αi,j and βi,j by a least-squares procedure. A search algorithm is used to find the value of
Choice of Filter Bank Parameters for Wavelet-Packet Identification
275
si,j that leads to the minimum value of Ji,j . The details of these procedures are described in [12]. As demonstrated in [12], the cost J0,0 , at the root node (i, j) = (0, 0) of the wavelet-packet tree, is equal to the square of the 2-norm of the prediction error ˆ)(y − y ˆ)T . Furthermore, for orthonormal wavelets, (y − y ˆ), that is, J0,0 = (y − y the cost at a non-leaf node (i, j) was shown to be equal to the sum of the costs at its children nodes, i.e., Ji,j = J2i,j+1 + J2i+1,j+1 . Such features allow the use of the following algorithm to adjust the wavelet-packet tree structure in order to achieve a compromise between accuracy and complexity of the model [12]. 1. Fix the maximum depth d allowed for the tree and initialize the search with a complete tree with that depth. 2. All nodes are candidates to be leaf nodes. Thus, for each node (i, j), leaf or non-leaf, obtain the associated subband model Mi,j (z). Calculate, for each node, l the square of the 2-norm of the residue ei,j and call this value Ji,j . Superscript l l, standing for leaf, is used to emphasize that Ji,j will be equal to the cost Ji,j at node (i, j) only if this node is chosen to be a leaf. 3. Initialize the costs at level d, the deepest level of the tree. If one node of this l , level is kept in the tree, it will be necessarily a leaf node. Thus, let Ji,d = Ji,d d for all i = 0, 1, ..., (2 − 1). 4. Analyze the other nodes of the tree, which can be made either leaf or nonleaf nodes. Start from level j = d − 1 and use a bottom-up approach (that is, analyze all nodes of a level before evaluating the parent level). Decide if each l (cost if node (i, j) should be a leaf or a non-leaf node, by comparing cost Ji,j (i, j) is a leaf node) with the sum J2i,j+1 + J2i+1,j+1 of the costs at its children nodes (cost if (i, j) is a non-leaf node). The decision rule is: l l Ji,j , if Ji,j ≤ ρ(J2i,j+1 + J2i+1,j+1 ) Ji,j = J2i,j+1 + J2i+1,j+1 , otherwise where the penalty factor ρ ≥ 1 ensures that node (i, j) will only be split into children nodes if the cost reduction is large enough to justify the increase in model complexity. The penalty factor ρ is required to avoid an overfitting of the identification data. An increase in the value of ρ tends to reduce the number of nodes in the resulting tree. Thus, the choice of the penalty factor ρ can be regarded as a model order determination problem, which is addressed by using a generalized cross validation (GCV) method, described in detail in [12]. The value of ρ that minimizes the GCV index is selected, thus providing a tradeoff between model parsimony and identification accuracy. Therefore, the ability of the model to represent the behaviour of the system for input signals different from the one used in the identification is improved.
3
Case Study
This case study involves the identification of a servomechanism model. Eq. (2) represents the transfer function employed in the simulations, where U is the input voltage (volts) and Y is the shaft speed (rad/s) of the servomechanism.
276
H.M. Paiva and R.K.H. Galv˜ ao
Y (s) 5000 = 3 U (s) s + 75s2 + 1350s + 15000
(2)
The following chirp excitation was used for identification: u(kT ) = sin(2πf kT ), k = 0, 1, ..., 5 × 104 , T = 0.0125s with frequency f varying linearly from 0.1 to 40 Hz. A zero-mean, white gaussian noise was added to the output y of the plant, with standard deviation equal to 5% of the standard deviation of y. Sections 3.1 and 3.2 present investigations concerning the choice of the number of resolution levels and the choice of the wavelet filters, respectively. 3.1
Resolution Levels
This section presents the results of the identification obtained by using different resolution levels. The Daubechies 8 (db8) wavelet filters were adopted, as in [12]. The maximum wavelet tree depths were varied from 4 to 8. For comparison, an identification using standard wavelet trees, where only the lowpass channel is successively decomposed, was also performed. Figure 3a presents the cost J0,0 obtained for the standard-wavelet and waveletpacket decomposition trees. As mentioned previously, this cost is equal to the squared 2-norm of the prediction error. The value of J0,0 presented in this table is normalized by yyT (squared 2-norm of output y). For comparison, the squared 2-norm of the gaussian noise added to the plant output y is also shown. This value is also normalized by yyT . If the identification was perfect, cost J0,0 would be equal to the squared 2norm of the noise added to output y. Since the identified model is only an approximation of the true plant, cost J0,0 is higher than this value. It is worth noting that J0,0 decreases as more resolution levels are used, pointing to an improvement in the approximation. In the wavelet-packet case, J0,0 appears to converge to the squared 2-norm of the noise. It is worth noting that an hypothetic cost J0,0 smaller than the squared 2norm of the noise added to output y would imply data overfitting, i.e., modelling of noise rather than plant dynamics. 3.2
Wavelet Filters
This section presents the identification results obtained by using different wavelets. The wavelet tree depth was set to six and the wavelet filters were taken from the Daubechies, Symlet and Coiflet families [6]. It should be noted that the identification technique requires the filter bank to be orthonormal, which means that families such as Morlet, Meyer and Derivative of a Gaussian (DOG) [6] cannot be used. The wavelet filters H, Hr , G, and Gr (as defined in Fig. 1) corresponding to the dbN and symN (Daubechies and Symlet families) have length 2N (each filter
Choice of Filter Bank Parameters for Wavelet-Packet Identification (b)
0.25 0.2 0.15 0.1 0.05 0
4
5
6 Tree depth
7
Coiflet Symlet Daubechies Noise
0.2
0,0
Default Wavelet Wavelet Packet Noise
Normalized cost J
Normalized cost J0,0
(a)
277
0.15 0.1 0.05 0 6
8
12 18 Number of filter tap weights
24
Fig. 3. Servomechanism identification results: Normalized cost J0,0 for (a) db8 wavelet, different resolution levels and (b) different wavelets, six resolution levels
Mag
1.5
1.5
1 0.5 0 0
1 db3 db6 db9 db12 0.1
0.5
0.2
0.3
0.4
0.5
Mag
1.5 1 0.5 0 0
1.5
db3 sym3 coif1 0.1
0 0
ω/ωs
0.3
0.4
0.5
0.1
0.5
0.2
0.3
0.4
0.5
0 0
1.5
1.5
1
1
0.5
0.2
1 sym3 sym6 sym9 sym12
0 0
db6 sym6 coif2 0.1
0.5
0.2
0.3
ω/ωs
0.4
0.5
0 0
coif1 coif2 coif3 coif4 0.1
0.2
0.3
0.4
0.5
0.3
0.4
0.5
db9 sym9 coif3 0.1
0.2
ω/ωs
Fig. 4. Frequency response of the low-pass filters H corresponding to mother wavelets from the Daubechies, Symlet and Coiflet families. ωs denotes the sampling frequency.
has a finite impulse response with 2N values different from zero). The wavelet filters corresponding to the Coiflet family coif N have length 6N . Fig. 4 presents the frequency responses of the low-pass filters H corresponding to the mother wavelets compared in this study. Fig. 3b shows the resulting cost J0,0 . Fig. 4 shows that, in a given family, longer wavelets filters present better frequency selectivity (sharper cut-off transition). Furthermore, for a given length, the Daubechies and Symlet filters have very similar frequency responses, with selectivity better than the Coiflet filter. The results presented in Fig. 3b show that wavelet filters with better frequency selectivity lead to better results (i.e. smaller cost values). The figure shows that Daubechies and Symlet filters with the same number of tap weights lead to similar costs, which are smaller than those obtained with the Coiflet family. Furthermore, in the same family, longer wavelet filters lead to smaller costs. This result was expected because filters with better frequency selectivity lead to a better separation of the frequency bands in the wavelet-packet frequency partitioning scheme. A comparison between Fig. 3a and Fig. 3b suggests that a reduction in filter length can be compensated by an increase in the wavelet-tree depth, and viceversa. For instance, db12 with six resolution levels and db8 with eight resolution
278
H.M. Paiva and R.K.H. Galv˜ ao
levels lead to the same cost value (0.028). In terms of computational effort, the most efficient approach would be to use longer filters and less resolution levels (because increasing the filter length increases the complexity linearly, whereas increasing the number of resolution levels may increase the number of leaf nodes exponentially). However, at least a minimum number of resolution levels should be used, in order to maintain the frequency-partitioning flexibility.
4
Conclusions
This paper revisited a recently proposed technique for linear system identification in frequency subbands. In the proposed formulation, a wavelet-packet decomposition tree is used to establish frequency bands where subband models are created. An optimization of the tree structure is performed by using a generalized cross-validation method in order to achieve a compromise between accuracy and parsimony of the overall model. The following conclusions were obtained in the application example involving the identification of a simulated servomechanism: - Number of resolution levels: The cost associated to the identification was shown to be higher than the squared 2-norm of the noise added to the plant output, and to converge to this value when more resolution levels are used, which indicates an improvement in the identification. The use of wavelet packets led to better results (lower identification costs) than the use of the standard wavelet decomposition. An hypothetic cost lower than the squared 2-norm of the noise, which would imply modelling of noise rather than plant dynamics, was not observed. - Different wavelets: Wavelet filters with better frequency selectivity provided better cost results. Moreover, the results suggest that a reduction in filter length can be compensated by an increase in the wavelet-tree depth, and vice-versa. In terms of computational effort, the most efficient approach is to use longer filters and less resolution levels; however, at least a minimum number of resolution levels should be used, in order to keep the frequency-partitioning flexibility. Future works could adopt other basic structures for each subband model Mi,j . For instance, more coefficients may be used in the Qi,j (z) term in Equation (1). Moreover, the use of autoregressive terms could also be investigated. The model order determination problem, addressed in this work by using the GCV method, could be studied by using alternative approaches, such as model complexity penalty, statistical hypothesis tests [10] and the minimum description length criterion [8]. Finally, it would be worth extending the technique to the identification of a broader class of systems, such as piecewise-affine plants [3].
Acknowledgements This work was supported by Embraer, CNPq (research fellowship and postdoctoral grant 200721/2006-2) and FAPESP (grant 2006/58850-6).
Choice of Filter Bank Parameters for Wavelet-Packet Identification
279
References 1. Abhayaratne, G.C.K., Jermyn, I.H., Zerubia, J.: Texture-adaptive mother wavelet selection for texture analysis. In: Proc. IEEE International Conference on Image Processing (ICIP), vol. 2, pp. 1290–1293 (2005) 2. Ahuja, N., Lertrattanapanich, S., Bose, N.K.: Properties determining choice of mother wavelet. IEE Proceedings – Visual, Image and Signal Processing 152(5), 659–664 (2005) 3. Bemporad, A., Garrulli, A., Paoletti, S., Vicino, A.: A bounded-error approach to piecewise affine system identification. IEEE Transactions on Automatic Control 50(10), 1567–1580 (2005) 4. Bhatia, P., Boudy, J., Andreao, R.V.: Wavelet transformation and pre-selection of mother wavelets for ECG signal processing. In: Proc. 24th IASTED international conference on Biomedical engineering, pp. 390–395 (2006) 5. Chen, H.X., Chua, P.S.K., Lim, G.H.: Adaptive wavelet transform for vibration signal modelling and application in fault diagnosis of water hydraulic motor. Mechanical Systems and Signal Processing 20(8), 2022–2045 (2006) 6. Daubechies, I.: Ten Lectures on Wavelets. CBMS-NSF Series in Applied Mathematics, vol. (61). SIAM, Philadelphia (1992) 7. Erlicher, S., Argoul, P.: Modal identification of linear non-proportionally damped systems by wavelet transform. Mechanical Systems and Signal Processing 21(3), 1386–1421 (2007) 8. Feil, B., Abonyi, J., Szeifert, F.: Model order selection of nonlinear input-output models: a clustering based approach. Journal of Process Control 14(6), 593–602 (2004) 9. Huang, C.S., Su, W.C.: Identification of modal parameters of a time invariant linear system by continuous wavelet transformation. Mechanical Systems and Signal Processing 21(4), 1642–1664 (2007) 10. Ljung, L.: System Identification: Theory for the User, 2nd edn. Prentice Hall, Upper Saddle River (1999) 11. Luk, R.W.-P., Damper, R.I.: Non-parametric linear time-invariant system identification by discrete wavelet transforms. Digital Signal Processing 16(3), 303–319 (2006) 12. Paiva, H.M., Galv˜ ao, R.K.H.: Wavelet-packet identification of dynamic systems in frequency subbands. Signal Processing 86(8), 2001–2008 (2006) 13. Paiva, H.M., Galv˜ ao, R.K.H.: Wavelet-Packet Identification of Dynamic Systems with Coloured Measurement Noise. In: Elmoataz, A., Lezoray, O., Nouboud, F., Mammass, D. (eds.) ICISP 2008. LNCS, vol. 5099, pp. 508–515. Springer, Heidelberg (2008) 14. Singh, B.N., Tiwari, A.K.: Optimal selection of wavelet basis function applied to ECG signal denoising. Digital Signal Processing 16(3), 275–287 (2006) 15. Vetterli, M., Kovacevic, J.: Wavelets and Subband Coding. Prentice Hall, Upper Saddle River (1995)
A Signal Processing Algorithm Based on Parametric Dynamic Programming Andrey Kopylov1, Olga Krasotkina1, Oleksandr Pryimak2, and Vadim Mottl3 2
1 Tula State University, Lenin pr. 92, 300600 Tula, Russia Moscow Institute of Physics and Technology, 141700, 9, Institutskii per., Dolgoprudny, Moscow Region, Russia 3 Dorodnicyn Computing Centre of RAS, Vavilov st. 40, 119333 Moscow, Russia {And.Kopylov,aleksandr.priymak}@gmail.com, {ko180177,vmottl}@yandex.ru
Abstract. A new algorithm for low-level signal processing is proposed based on dynamic programming principle. It is shown that it is possible to extend the dynamic programming procedure to the case of continuous variables by introducing the parametric family of Bellman functions, represented as a minimum of a set of quadratic functions. The procedure can take into account a wide range of prior assumptions about the sought-for result, and leads to the effective algorithms of data analysis. Keywords: dynamic programming, edge-preserving smoothing, separable optimization.
1 Introduction There is a considerably wide class of low-level signal processing problems, where the optimization-based approach is claimed to play a role of the universal framework. This class includes, in particular, such data analysis problems as smoothing, nonstationary regression analysis, segmentation, etc. In all these cases the ultimate aim of processing can be represented as a transformation of the original data Y = ( yt , t ∈ T ) , y ∈ Y , defined on a subset T , of
the signal axis, into a secondary array X = ( xt , t ∈ T ) which would be defined on the same argument set t ∈ T and take values x ∈ X from a set specific for each particular problem. It is important that the signal axis is naturally supplied by the binary neighborhood relation, which turns it into a simple graph in the form of a chain. The problem of finding the best, in some sense, transformation of source data Y into a secondary array X can be mathematically set as those of minimizing objective functions J ( X ) of a special kind, defined over the variety of all the feasible results of processing, and meant to assess the discrepancy between each admissible version of the result X and the given data array Y .
J ( X ) = ∑ψ t ( xt | Yt ) + t∈T
∑
( t ′,t′′)∈G
γ t′,t′′ ( xt′ , xt′′ )
A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 280–286, 2010. © Springer-Verlag Berlin Heidelberg 2010
(1)
A Signal Processing Algorithm Based on Parametric Dynamic Programming
281
The structure of objective function J ( X ) reflects the fact that the data are ordered along the axis of one or two arguments, and can be defined by the undirected neighborhood graph G ⊂ T × T . Such objective functions are called pair-wise separable, as far as they are representable as sums of elementary objective functions each of only one or two variables. These functions are associated, respectively, with nodes and edges of the neighborhood graph. The data dependent node functions ψ t ( xt | Yt ) are to be chosen with respect to the essence of a particular data processing problem so that the greater value each of them takes, the more evident is the contradiction between the hypothesis that xt is just the correct local value we are seeking and the respective vicinity Yt of the data array. Each of the model based edge functions γ t ′,t ′′ ( xt ′ , xt ′′ ) is meant to impose an individual penalty upon the distinction of values in the respective pair of adjacent variables of the edge (t ′, t ′′) of neighborhood graph G . The evident neighborhood graph G for signals is chain. In the case of finitely valued objective variables xt ∈ X = {1,..., m} , such an optimization problem is known under the name of the (min,+) labeling problem. For an arbitrary adjacency graph G this problem is NP-hard, however, in the case of acyclic graph it is possible to construct very effective optimization procedures, dynamic programming (DP) in nature. But if the number of elements in the set X grows, the computation time and the amount of computer memory for large majority of (min,+) optimization algorithms becomes inapplicable very fast. Nevertheless, it is possible to construct a generalization of the classical dynamic programming procedure to the case of continuous variables [1, 2]. Such a generalization can be made on the basis of introducing the concept of a parametric family of Bellman functions. It will allow us to reduce, first, requirements to memory, as the number of parameters of Bellman functions is less than the number of elements in the set X , second, to increase computational speed, as we can turn the intermediate optimization problems on each step of the DP procedure to recurrent reevaluation of parameters, third, to increase accuracy in the tasks, where objective variables continuous in nature. Unfortunately, the class of appropriate parametric families of functions is quite deficient, and greatly reduces the amount of practical applications of parametric DP procedures. We propose here a way to extend parametric family of Bellman functions to the case of most widely used optimization criteria
2 Generalized Dynamic Programming Procedure A pair-wise separable objective function (1) allows a highly effective global optimization procedure [1] when the neighborhood graph on the set of its variables has no cycles, i.e. is a tree (Fig. 1).
282
A. Kopylov et al.
t*
TM
j
t 0
T2
T(t) T(t)
T1 Fig. 1. Structure of an arbitrary tree
The optimization procedure in this case is based on a recurrent decomposition of the initial optimization problem of a function of T variables, where T is the number of elements in set T, into a succession of T elementary problems, each of which consists in optimization of a function of only one variable. The elementary functions of one variable to be minimized at each step of minimization of a treesupported separable function play the same role as the Bellman functions in the classical procedure and are called here extended Bellman functions. The fundamental property of the Bellman function J t ( x t ) = ψ t ( xt ) +
∑
s∈T (0t )
min {γ tm, s ( x t , x s ) + J s ( x s )}
x s ∈X
(2)
will be called the upward recurrent relation. The inverted form of this relation x s ( x t ) = arg min{γ tm, s ( x t , x s ) + J s ( x s )} , s ∈ T (0t ) x s ∈X
(3)
will be referred to as the downward recurrent relation. The procedure runs through all the nodes of the graph in order of their membership in the hierarchy levels T j at the upward pass j = 1,..., M , recurrently calculate and store Bellman functions in accordance with upward recurrent relation (2) starting with J t ( x) = ψ t ( xt ) , x ∈ X , t ∈ T 1 . The Bellman function at the root J t * ( x) , x ∈ X , that is obtained at the last step of the upward pass, immediately gives the optimal value of the root variable. On the downward pass, as the procedure descends the hierarchy levels j = M ,..., 2 , the already found optimal values of the current level determine optimal values of the variables at the immediately underlying level in accordance with the downward recurrent rule (3). In the case of continuous variables, e.g. if X = ( x t , t ∈ T ) , xt ∈ R n , a numerical
realization of the basic procedure for a separable function supported by a tree, as well as the classical dynamic programming procedure, is possible only in case there exists
A Signal Processing Algorithm Based on Parametric Dynamic Programming
283
a finitely parameterized function family J ( x, a) concordant with node functions
ψ t ( x) and edge functions γ t′t′′ ( xt′ , xt′′ ) in the sense that Bellman functions J t ( xt ) belong to this family at each step. In this case, the upward pass of the procedure consists in recurrent re-evaluating parameters at that completely represent the Bellman functions. J t ( x) = J ( x, at ) . In particular, as is shown in [1], if the node and edge functions are quadratic, the Bellman functions will be quadratic too. In such case, the DP procedure is entirely equivalent to Kalman–Bucy filter and interpolator [3]. However, prior smoothness constraints in the form of quadratic edge functions do not allow to keep abrupt changes of the data in the course of processing. The parametric representation is also possible in the case of using absolute value of difference of adjacent variables instead of quadratic node functions. It leads to piecewise constant form of sought-for data array [2]. Though this kind of assumption is appropriate for separate problems, for example in economy, it yields not so good result generally. We propose here to use the following form of the edge functions in criterion (1):
γ t′,t′′ ( xt′ , xt′′ ) = min ⎡⎣( xt′ − xt′′ )T U t′,t′′ ( xt′ − xt′′ ), h ⎤⎦ , (t ′, t ′′) ∈ G
(4)
It has been known [4], that the following edge function possesses the properties to preserve the abrupt changes in the data being analyzed. The node functions are also chosen in the quadratic form ψ t ( xt | Yt ) = ( yt − xt )T Bt ( yt − xt ) . In this case it is possible to construct the parametric DP procedure for minimization of the criterion in the form (1). Let us consider the case, then the adjacency graph G has a form of a chain, so the source data X = ( x t , t ∈ T ) and the result of processing Y = ( yt , t ∈ T ) are signals and t = 1, 2,… N . It can be proven, that if the Bellman function at the node t − 1 is J t −1 ( xt −1 ) = min ⎡ J t(1) ( x ), J (1) ( x ),… , J t(−K1t −1 ) ( xt −1 ) ⎤ , ⎣ −1 t −1 t −1 t −1 ⎦ then the next Bellman function will be: J t ( xt ) = min ⎡ J t(1) ( xt ), J t(1) ( xt ),… , J t( Kt ) ( xt ) ⎤ , ⎣ ⎦ where Kt = K t −1 + 1 and, in accordance with (2), J t(i ) ( x t ) = ψ t ( x t ) + min {( xt − xt −1 )T U t ( xt − xt −1 ) + J t(−i )1 ( x t −1 )} , i = 1, 2,… K t − 1 , xt −1∈X
J t( Kt ) ( x t ) = ψ t ( x t ) + min {h + J t −1 ( x t −1 )} . xt −1∈X
It is easy to see that each function J t(i ) ( x t ) has the parametric representation
284
A. Kopylov et al.
J t(i ) ( x t ) = ( xt − xt( i ) )T Rt(i ) ( xt − xt( ) ) + dt(i ) , i = 1, 2,… K t . i
The downward recurrent relation (3) takes the form xt(−i )1 ( x t ) = arg min{min[ f t(1) ( xt −1 , xt ), f t(2) ( xt −1 , xt ),… , ft ( Kt −1 ) ( xt −1 , xt ), ft ( Kt ) ( xt −1 )]} , xt −1∈X
where ft (i ) ( xt −1 , xt ) = ( xt − xt −1 )T U t ( xt − xt −1 ) + J t(−i )1 ( x t −1 ) , i = 1, 2,… K t − 1 , ft ( Kt ) ( xt −1 ) = h + J t −1 ( x t −1 ) . Let
J t ( xt ) = min ⎡⎣ J t(1) ( xt ), J t(2) ( xt )… , J t( K ) ( xt ) ⎤⎦ .
If
for
any
xt ∈ R ,
J t(1) ( xt ) ≥ min ⎡⎣ J t(2) ( xt )… , J t( K ) ( xt ) ⎤⎦ , then J t ( xt ) = min ⎡⎣ J t(2) ( xt )… , J t( K ) ( xt ) ⎤⎦ . Thus, the number of functions in the representation of Bellman function at the step t can be reduced. We consider the procedure of such reduction in the following section by the example of the edge-preserving smoothing of signals.
3 Parametric DP Procedure for Edge-Preserving Smoothing of Signals Let us suppose that the observable signal Y = ( yt = xt + ξt , t = 1,… , N ) , y ∈ R where ξ - additive white Gaussian noise, with zero mean. The aim of processing is to restore the hidden signal X = ( xt , t = 1,… , N ) , xt ∈ R . The basic assumption in the problem is that the original signal changes smoothly enough except, perhaps, some points, where jumps or breaks can be observed. It has been known [4], that the following edge function possesses the required properties of edge-preservation:
γ t −1,t ( xt −1 , xt ) = u min[( xt −1 − xt ) 2 , Δ 2 ] .
(5)
It is easy to see, that the form of function (5) coincides with (4), if it is remembered that constant is the degenerate form of a quadratic function. The criterion (1) takes the following simple form: N
N
t =1
t =2
J ( X ) = ∑ ( yt − xt ) 2 + u ∑ min[( xt −1 − xt ) 2 , Δ 2 ] .
(6)
Suppose that in step t − 1 , the Bellman function J t −1 ( xt −1 ) is represented in the form J t −1 ( xt −1 ) = min ⎡ J t(1) ( x ), J (2) ( x )… , J t(−K1−1) ( xt −1 ) ⎤ , ⎣ −1 t −1 t −1 t −1 ⎦ where J t(−i )1 ( xt −1 ) = ct(−i )1 ( xt −1 − xt(−i )1 ) 2 + dt(−i )1 , ct(−i )1 ≥ 0 .
(7)
A Signal Processing Algorithm Based on Parametric Dynamic Programming
285
Then, in accordance with forward recurrent relation (2) of a DP procedure, the Bellman function J t ( xt ) at the next step will be J t ( xt ) = min ⎡⎣ J t(1) ( xt ), J t(2) ( xt )… , J t( K ) ( xt ) ⎤⎦ , J t(i ) ( xt ) = ( yt − xt ) 2 + min[u ( xt −1 − xt )2 + J t(−i )1 ( xt −1 )] , xt −1
J t( k ) ( xt ) = ( yt − xt ) 2 + min[u Δ 2 + J t −1 ( xt −1 )] . xt −1
This implies that each Bellman function has finite-parametric representation. Note also that at each step the number of quadratic functions, needed for the representation of the Bellman function, generally speaking, increases by one. This leads to the fact that the procedure has complexity O( N 2 ) . At the same time, numerous experiments on simulated data showed, that two or three quadratic functions is usually enough to completely represent each Bellman function. Modification of the above described algorithm is based on the following idea. It is easy to see that if x1 , x2 ,… , xN is minimum point of the criterion (6), then xt ∈ [m, M ] for each t = 1, 2,… , N , where m = min(Y ) and M = max(Y ) . So we can omit all J t(i ) ( xt ) ≥ min ⎡⎣ J t( j ) ( xt ) ⎤⎦ for xt ∈ [m, M ] . i≠ j
Let us denote the number of functions, needed for the representation of the Bellman function at each step, by m1 , m2 ,… , mN . It can be shown that if there exists M = const
such that m12 ln m12 + m22 ln m22 + ...
+ mN2 ln mN2 ≤ M ⋅ N , the
procedure has the computational complexity O( N ) . This procedure can be extended for the case of image analysis by means, for example, of approximation of the initial lattice-like neighborhood graph of variables by the succession of trees or other techniques, described in [1].
4 Experiments For an experimental study of complexity of the algorithm, described in section 3, the following data model has been used. Observable signal: Y = ( yt = xt + ξt , t = 1,… , N ) , y ∈ R where ξ - additive white Gaussian noise, with zero mean and variance σ 2 . Hidden signal: X = ( xt , t = 1,… , N ) , where x1 = z1 , xt = xt −1 with probability p and xt = zt with probability 1 − p for t = 2,… , N . zt are independent uniformly distributed random variables on interval [0,50]. A random distribution of the mean number of quadratic function in the composition of Bellman functions (m1 + m2 + ... + mN ) / N was constructed by MonteCarlo method for N = 100, 200, 500, 1000. (Fig. 2). Results of experiments have shown that the procedure has the average complexity O( N ) .
286
A. Kopylov et al.
N=1000
N=500
N=200
N=100
Fig. 2. A random distribution of the mean number of quadratic function in the composition of Bellman functions
5 Conclusion The DP procedure based on parametric family of Bellman functions represented as a minimum of a set of quadratic functions, can reduce the amount of memory, required for its implementation, if the number of parameters of Bellman functions is small enough and less than the amount of memory needed in the case of trivial parameterization by the finite set of values. This is of particular importance as the number of discrete values of the objective variables makes a massive impact to the requirements of modern (min,+) optimization algorithms. Experimental study let us to guess, that for the vast majority of source data arrays two or three quadratic functions is usually enough to completely represent each Bellman function. The proposed procedure allows parametric implementation, can take into account a wide range of prior assumptions about the sought-for result, and leads to the effective algorithm of data analysis with linear average computation complexity.
References 1. Mottl, V., Kopylov, A., Blinov, A., Kostin, A..: Optimization techniques on pixel neighborhood graphs for image processing. In: Graph-Based Representations in Pattern Recognition. Computing, vol. (suppl. 12), pp. 135–145. Springer-Verlag/Wien (1998) 2. Kopylov, A.V.: Parametric dynamic programming procedures for edge preserving in signal and image smoothing. Pattern Recognition and Image Analysis 15(1), 227–230 (2005) 3. Kalman, R.E., Bucy, R.S.: New Results in Linear Filtering and Prediction Theory. Journal of Basic Engineering 83, 95–108 (1961) 4. Szeliski, R., Zabih, R., Scharstein, D., Veskler, O., Kolmogorov, V., Agarwala, A., Tappen, M., Rother, C.: A Comparative Study of Energy Minimization Methods for Markov Random Fields with Smoothness-Based Priors. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 30(6), 1068–1080 (2008)
Determining Dominant Frequency with Data-Adaptive Windows Gagan Mirchandani and Shruti Sharma School of Engineering, College of Engineering & Mathematical Sciences, University of Vermont
Abstract. Measurement of activation rates in cardiac electrograms is commonly done though estimating the frequency of the sinusoid with the greatest power. This frequency, commonly referred to as Dominant Frequency, is generally estimated using the short-time Fourier Transform with a window of fixed size. In this work a new short-time Fourier transform method with a data-adaptive window is introduced. Experiments are conducted with both synthetic and real data. Results for the former case are compared with current state-of-the-art methods. Given the difficulty in identifying activation points in electrograms, experiments reported in the literature have so far used only synthetic data. The new method is tested by application to real data, with true activation rates determined manually. Substantial improvement is observed. An error analysis is provided. Keywords: atrial fibrillation, data-adaptive windows, non-stationary signals, dominant frequency.
1
Introduction
Measurement of temporal beat-to-beat variations in heart rate is one of the key factors in the determination of cardiac disorders. In atrial fibrillation (AF), the atria contract rapidly and irregularly. The corresponding heart rate, measured through intracardiac electrograms, shows rapidly changing frequencies between 3-15 Hz. Since AF can be associated with heart disease and heart failure, an accurate determination of heart rate is essential for proper management purposes. Activation rate or frequency, is defined as the inverse of the time period between two consecutive activation points. While manual determination of atrial activation timings by experts is theoretically possible, the sheer volume of data that would need analysis make this an impractical option. Accordingly, automated methods for determining the beat-to-beat variations have received much attention. This is a difficult task for many reasons: electrogram data is random and nonstationarity and there is no clearly identifiable spectral estimation technique that specifically matches the particular problem of atrial activation rate determination. Methods using the short-time Fourier transform (STFT) suffer from the usual time duration, frequency resolution problem. Furthermore, complexity in identifying activation time, often due to morphology fragmentation, further complicates the spectral estimation task. A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 287–296, 2010. c Springer-Verlag Berlin Heidelberg 2010
288
G. Mirchandani and S. Sharma
There currently do not exist automated methods for the determination of activation rates. Using synthetic data and fixed-window STFT analysis, results have been reported for the so-called dominant frequency (DF) associated with some specific structures of the data. DF is defined as the frequency with the greatest power in the power spectral estimate of data. In this paper we introduce a new STFT-based method for power spectral estimation of electrogram data. A data-adaptive window is employed to partially mitigate the resolution problem. The method is tested using both synthetic data as also real electrogram data where the true activation rates are determined manually. Results with the fixed-window and the new data-adaptive window are compared. An error analysis shows substantial gains with the data-adaptive window method. The latter method can also find use in other similar non-stationary data applications such as eeg power spectral estimation.
2
State of the Art in Electrogram Analysis
Methods used for determining AF include time-frequency analysis [1], [11], hidden Markov models [15], deconvolution [3], and spectral analysis, with the latter method being the predominant one. Motivated perhaps by the hypothesis that sites of high frequency activation might serve as drivers for fibrillation and could therefore be targets for ablation therapy [13], [16], spectral estimation has centered around frequency domain methods and specifically around the concept of DF within the framework of a fixed window STFT [6], [12], [13], [14], [16], [5],[10], [9], [7]. While for certain synthetic signals, DF correlates well with the mean, mode and median of the activation time [12], [6], results with real data show poor correlation [4]. Surprisingly, there are very few results in the literature reporting on the accuracy of the DF methodology applied to signals, real or synthetic. For those cases that are documented, synthetic signals are created by placing a Hanning window to emulate a clean morphology [13], [6]. Another interesting approach for the understanding of AF is in the use of morphology matching for the determination of the direction of progression of the activation wavefront. Cross-correlation is used to determine the recurrence of a morphology and that information is used to detect a pattern in the flow of the electrical activity [2][8].
3
Spectral Analysis Using Variable Windows
As indicated earlier, DF characterizes electrogram data with varying frequency, by the one frequency with the largest power in the power spectral density. However, given that the signal is non-stationary, it is clear that for an accurate determination of activation rates, electrogram signals should best be analyzed locally to capture the changing nature of the frequencies. As a solution, a dataadaptive window method is proposed: the window length is determined by the time resolution required for the signal. Hence ideally, with the frequency between 3-15 Hz, the spectrogram data should be analyzed in segments with a maximum
Determining Dominant Frequency with Data-Adaptive Windows
289
4
1.5
x 10
500−pt window with 50% overlap 250−pt window with 50% overlap 1
0.5
0
−0.5
−1
0
100
200
300
400
500
600
700
800
900
1000
Fig. 1. Synthetic signal
of two activations per segment. Accordingly, given that a f Hz signal exhibits f activations (peaks) per second, the ideal size window for a 4 Hz signal would be 500 points, decreasing linearly to 125 points for a 16 Hz signal, assuming 1 kHz sampling rate. However, the small windows are designed to analyze each segment with a maximum of three activations in each segment ( 3 activations in a 250 ms window would mean an inter-event frequency of 12Hz) and since the inter-event period would not be greater than 250 ms (the total length of the segment being analyzed is 250 ms), the signal was not analyzed in segments of 125 ms. In the proposed method the DF is first determined using the STFT with a 500point window with a 50% overlap. If the observed DF is greater than 4 Hz, a 250point window (also with a 50% overlap) is utilized. Dominant frequencies obtained from each of the overlapping windows are then used to label frequencies over 125 ms segments of the signal, to effectively generate a localized frequency estimate. To illustrate the technique, the method is applied to a simple electrogram signal where the signal allows for 8 125-ms DF estimates over the 1000-ms duration. Example 1: The synthetic signal shown in Figure 1, with randomly changing frequency between 3-15Hz was created using specific morphologies that were obtained by pacing the heart from different directions. The signal was analyzed using the data-adaptive windows described above. For the 1000 ms signal shown in Figure 1, a 500-point window with the 50% overlap provides 3 consecutive signal segments. The DF for each of the 3 segments shows frequencies greater than 4 Hz. Accordingly, 250-point windows are utilized. With the 50% overlap, there are 6 consecutive 125-ms signal segments, for which the DF is determined. These, as also the true activation rates (determined manually) are placed on a frequency-time plot and compared. Note that the latter frequencies occur at actual activation times and are shown accordingly
290
G. Mirchandani and S. Sharma
Table 1. Ground truth activation rates and DF estimate using data-adaptive windows Activation rates (ground truth) 8 10 9 6 4 11 11 Data adaptive windows 500ms Windows - 250ms Overlap 11 8 10 250ms Windows - 125ms Overlap 10 11 10 7 5 4 10 DF estimate using both windows 10 10 11 10 7 5 4 10 11
10
9
8
7
6
Ground truth activation rates Data adaptive windows DF 5
4
0
100
200
300
400
500
600
700
800
900
1000
Fig. 2. DF error using data-adaptive windows
on the frequency-time plot. The DF estimates however, using the two fixed-size adaptive windows, occur at activation times that are integer multiples of 250 or 125 ms. The final frequency estimates are obtained by using the 250 ms window results and extrapolating the start and end-points to cover the whole range from 0 to 1000 ms. Hence for the example here, the frequency over the 1-250 ms and for 875-1000 ms range are extrapolated to 5.3 Hz, and 3 Hz respectively. The activation rates and their DF estimates, while generated over different sized ms-blocks, are assumed to apply over each ms, generating a time series, shown in Figure 2 that can be used for error analysis.
4
Experiments and Results
Experiments were divided into two categories: (i) those similar to ones in the reported literature (ii) those with real morphologies. 4.1
Previous Experiments
The first set of experiments were conducted to compare results reported in [12], referred to here as the fixed-window method. In that work, electrogram data was analyzed in segments of 4000 points. Segments were created using a 50-point
Determining Dominant Frequency with Data-Adaptive Windows
291
Hanning window as the morphology, while the frequencies were chosen using a mean frequency from a uniform distribution between 3-9Hz, with a standard deviation of 1 Hz. Four sets of tests were conducted with the following conditions: 1. 2. 3. 4.
Constant amplitude, constant frequency. Varying amplitude, constant frequency. Constant amplitude, varying frequency. Varying amplitude, varying frequency.
Since the first two tests involved constant frequency, these experiments were not duplicated since a variable window analysis would not provide any gain. The last two sets of tests were conducted with synthetic signals. For the purpose of this paper, one hundred sets of 4000 point signals were created where the frequency was varied randomly between 3-10 Hz. A 50-point Hanning window was used to depict a filtered, rectified biphasic morphology. The average interevent frequencies, and the DF estimates from the data-adaptive window method were then compared to the DF estimates using the fixed-window method. Figures 3(a) and 4(a) show the sample synthetic signals with varying frequency where the peak amplitude ratio was constant (condition 3) and where it was varied between 0.1 and 1 (condition 4). These are followed by the scatter plots of the DF by the fixed-window method vs. average inter-event frequency (ground truth) and the DF by the data-adaptive window method vs. average inter-event frequency (ground truth). The correlation coefficient for the tests using the fixed-window and the data-adaptive window methods were determined to be 0.069 and 0.91 for test 3, and 0.071 and 0.81 for test 4 respectively. 4.2
Experiments Using Real Morphologies
In order to simulate signals that resemble real electrograms, signals were generated by pacing the heart manually from different directions. Seven different morphologies were generated. Each morphology was placed on a baseline with an inter-event frequency chosen at random between 3-10 Hz to create 10,000 point synthetic signals. These were analyzed using data-adaptive windows. The mean and standard deviation of the error was determined. The signals was subject to the standard electrogram preprocessing procedures: 1. 2. 3. 4.
Band-pass filtering between 40-250 Hz. Signal rectification. Low-pass filtering between 0-20 Hz Window the signal with a Hamming window.
Experiments were conducted using both rectification and squaring, the latter as an alternative to rectification. DF obtained using the data-adaptive window method for both were formulated as time-series with each frequency value representing as before, frequencies over 1 ms for the 125 ms segments. These were then compared to the ground-truth values. The error series for five synthetic signals
292
G. Mirchandani and S. Sharma
1
a.
0.5
Frequency − Ng (Hz)
0 0
b.
500
1000
2000 Time(ms)
2500
3000
3500
4000
18
Correlation Coefficient: 0.069
15 10 5
Frequency − VWA (Hz)
0 3
c.
1500
4
5 Ground Truth Frequency (Hz)
6
15
7
Correlation Coefficient: 0.91
10 5 0 3
3.5
4
4.5 5 5.5 Ground Truth Frequency (Hz)
6
6.5
7
Fig. 3. (a). Sample signal with constant amplitude and variable frequency. (b) DF using fixed-window vs average inter-event frequencies (c) DF using data adaptive windows vs average inter-event frequencies. 1
a.
0.5
c.
Frequency − VWA (Hz)
b.
Frequency − Ng (Hz)
0 0
500
1000
1500
2000 Time (ms)
2500
3000
3500
4000
15
Correlation Coefficient: 0.071 10 5 0 3
3.5
4
4.5
5 5.5 6 Ground Truth Frequency (Hz)
6.5
7
7.5
8
15
Correlation Coefficient: 0.81 10 5 0 3
3.5
4
4.5
5 5.5 6 Ground Truth Frequency (Hz)
6.5
7
7.5
8
Fig. 4. (a). Sample signal with variable amplitude and variable frequency (b) DF using fixed-window vs average inter-event frequencies (c) DF using data adaptive windows vs average inter-event frequencies.
Determining Dominant Frequency with Data-Adaptive Windows
Error with Squaring
Error with Rectification 20 Error(Hz)
Error(Hz)
20 10 0
0
20
40
60
80
100
Error(Hz)
Error(Hz)
0
20
40
60
80
100
Error(Hz)
Error(Hz)
0
20
40
60
80
100
Error(Hz)
Error(Hz)
80
100
120
0
20
40
60
80
100
120
0
20
40
60
80
100
120
0
20
40
60
80
100
120
20
40
60
80
100
120
20
10
0
20
40
60
80
100
10 0
120
20 Error(Hz)
20 Error(Hz)
60
10 0
120
20
10
0
20
40
60
80
100
0.8
1
1.2
0
20
Mean = 1.53 Hz Standard Deviation = 0.15 Hz
10
10 0
120
20
0
40
20
10
0
20
10 0
120
20
0
0
20
10
0
10 0
120
20
0
293
Mean = 1.61 Hz Standard Deviation = 0.15 Hz
10 1.4
1.6
1.8
2
0
2.2
0.8
1
1.2
1.4
1.6
1.8
2
2.2
Fig. 5. Error and error distribution with the data-adaptive method using real morphologies
Voltage (mV)
Voltage (mV)
Voltage (mV)
Voltage (mV)
Voltage (mV)
4
1
x 10
0 −1 0
100
200
300
400
500 Time (ms)
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
600
700
800
900
1000
600
700
800
900
1000
600
700
800
900
1000
5000 0
−5000 0 4 x 10 2
Time (ms)
0 −2 0 5000
100
200
300
400
500
Time (ms)
0 −5000 0 5000
100
200
300
400
500
Time (ms)
0 −5000 0
100
200
300
400
500 Time (ms)
Fig. 6. Examples of real signals from real AF electrograms that were used for the experiments
294
G. Mirchandani and S. Sharma Error Distribution: Rectification
Error Distribution: Squaring
10
10
9
9
Mean:1.66 Std :1.42 Mean: 1.66 Std : 1.42 8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0 −1
0
1
2
3
4
5
6
0 −1
0
1
2
3
4
5
6
Fig. 7. Error distribution with rectification and squaring, using real data
are shown in Figure 5. Also shown is the error distribution for the average error of each of the hundred synthetic signals created. The mean errors for squaring and rectification were 1.50 Hz and 1.61 Hz with a standard deviation of 0.15 Hz for both respectively. 4.3
Experiments Using Real Signals
Fifty sets of 1000-point real signals were obtained from a patient with AF and are shown in Figure 6. Activation times and rates were determined manually. Signals were preprocessed and analyzed using the data-adaptive window method. It is seen that data-adaptive window method was able to capture the changing frequency of the signals with a mean error of 1.66 Hz and standard deviation of 1.42 Hz for both squaring and rectification. Error distribution is shown in Figure 7.
5
Conclusion
A data-adaptive window has been used with the STFT to capture non-stationary characteristics of synthetic signals representing electrogram data. Tests were conducted using 100 sets of 4000-point data with variable frequency and fixed amplitude and variable frequency and variable amplitude. Correlating results with the ground truth shows the data-adaptive and fixed-window methods performing with correlation coefficients 0.91and 0.069 for the first case and 0.81 and 0.071 respectively for the second. Experiments with real morphologies with the data-adaptive window showed a mean error of 1.50 and 1.61 Hz and standard deviation of 0.15 Hz. Using 50 sets of real 1000-point electrogram data showed a mean error of 1.66 Hz and standard deviation of 1.42 Hz.
Determining Dominant Frequency with Data-Adaptive Windows
295
Acknowledgements Support from a Graduate Teaching Fellowship from The University of Vermont and a Research Fellowship from Medtronics is hereby acknowledged.
References 1. Addison, P.S., Watson, J.N., Clegg, G.R., Steen, P., Robertson, C.E.: Finding coordinated atrial activity during ventricular fibrillation using wavelet decomposition. IEEE Engineering in Medicine and Biology 58–65 (January/February 2002) 2. Barbaro, V., Bartolini, P., Calcagnini, G., Censi, F., Michelucci, A., Morelli, S.: Mapping the Organization of human atrial fibrillation using a basket catheter. Computers in Cardiology, 475–478 (1999) 3. Ellis, W.S., Eisenberg, S.J., Auslander, D.M., DAe, M.W., Zakhor, A., Lesh, M.D.: Deconvolution: A novel signal processing approach for determining activation time from fractionated electrograms and detecting infarcted tissue. Circulation 94, 2633–2640 (1996) 4. Elvan, A., et al.: Dominant Frequency of Atrial Fibrillation Correlates Poorly with Atrial Fibrillation Cycle Length. Circulation, Arrhythmia and Electrophysiology 2, 634–644 (2009) 5. Everett, T.H., Kok, L.-C., Vaughn, R.H., Moorman, J.R., Haines, D.E.: Frequency domain algorithm for quantifying atrial fibrillation organization to increase defibrillation efficacy. IEEE Transactions on Biomedical Engineering 48(9), 69–978 (2001) 6. Fischer, G., St¨ uhlinger, M.C., Wieser, B., Nowak, C.-N., Wieser, L., Tilg, B., Hintringer, F.: On Computing Dominant Frequency From Bipolar Intracardiac Electrograms. IEEE Transactions on Biomedical Engineering 54(1), 165–169 (2007) 7. Le Goazigo, C.: Measurement of the dominant frequency in atrial fibrillation electrograms. MSc. Thesis, Cranfield University (2005) 8. Houben, R.P.M., Allessie, M.A.: Processing of intracardiac electrograms in atrial fibrillation. IEEE Engineering in Medicine and Biology Magazine, 40–51 (November/December 2006) 9. Jacquemet, V., Oosterom, A.V., Vesin, J.V., Kappenberger, L.: Analysis of electrocardiograms during atrial fibrillation. IEEE Engineering in Medicine and Biology Magazine, 79–88 (November-December 2006) 10. Langley, P., Bourke, J.P., Murray, A.: Frequency analysis of atrial fibrillation. Computers in Cardiology, 65–68 (September 2000) 11. Moghe, S.A., Qu, F., Leonelli, F.M., Patwardhan, A.R.: Time-frequency representation of epicardial electrograms during atrial fibrillation. Biomedical Sciences Instrumentation 36, 45–50 (2000) 12. Ng, J., Kadish, A.H., Goldberger, J.J.: Effect of electrogram characteristics on the relationship of dominant frequency to atrial activation rate in atrial fibrillation. Heart Rhythm 3(11), 1295–1305 (2006) 13. Ng, J., Goldberger, J.J.: Understanding and interpreting dominant frequency analysis of AF electrograms. Journal of Cardiovascular Electrophysiology 18(7), 680–685 (2007)
296
G. Mirchandani and S. Sharma
14. Ng, J., Kadish, A.H., Goldberger, J.J.: Technical considerations for dominant frequency analysis. Journal of Cardiovascular Electrophysiology 18(7), 757–764 (2007) 15. Sandberg, F., Stridth, M., S¨ ornmo, L.: Frequency tracking of atrial fibrillation using hidden Markov models. IEEE Transactions on Biomedical Engineering 55(2), 502– 511 (2008) 16. Sanders, P., Berenfeld, O., Hocini, M., Ja¨ıs, P., Vaidyanathan, R., Hsu, L.-F., Garrigue, S., Takahashi, Y., Rotter, M., Sacher, F., Scav¨ee, C., Ploutz-Snyder, R., Jalife, J., Ha¨ıssaguerre, M.: Spectral analysis identifies sites of high frequency activity maintaining atrial fibrillation in humans. Circulation 112, 789–797 (2005)
Non-linear EEG Analysis of Idiopathic Hypersomnia Tarik Al-ani1,2 , Xavier Drouot3,4 , and Thi Thuy Trang Huynh3 1
2
3
LISV-UVSQ, 10-12 Av. de l’Europe, 78140 V´elizy, France Dept. Informatique, ESIEE-Paris, Cit´e Descartes-BP 99, 93162 Noisy-Le-Grand, France
[email protected] APHP, Hˆ opital Henri Mondor, Service de Physiologie, Centre du Sommeil, 51 av. de Lattre de Tassigny, 94010 Cr´eteil, France 4 EA 4391, Universit Paris 12, France
[email protected] Abstract. The electroencephalogram (EEG) signals are used to analyse and quantify the “depth” of sleep and its dynamic behaviour during night. In this work, we investigate a direct data-driven nonlinear and non-stationary quantitative analysis of sleep EEG issued from patients suffering from idopathic hypersomnia. We show that the minimum weighted average instantaneous frequency appears to be a specific intrinsic characteristic of brain function mechanism in these patients. It could be an interesting new parameter for the quantification of sleep.
1
Introduction
The hypersomnia can result from a disorder of the central nervous system, what is called primary hypersomnia. There are two main types of primary hypersomnia [1,2,3]: narcolepsy and idiopathic hypersomnia (IH ). This work concerns only the automatic diagnosis of IH. Sleep quality and sleep duration are, in part, governed by a homeostatic process (process S ) which increases during waking and decreases during sleep [4]. This homeostatic drive [5] can be quantified during sleep by calculating the so-called slow wave activity (SWA), with Fast Fourier transform, which is the spectral power of the 0.75-4Hz band of the EEG corresponding to the sleep slow waves. For Sleep disorder diagnosis in routine, sleep can be analysed by the study of electroencephalogram (EEG) signals recorded during one night of sleep. Based on R&K standard rules [6], these EEG signals are used to quantify the ”depth” of sleep and its dynamic behaviour during night. Different interesting computerized (discrete-time) automatic sleep EEG analysis approaches have been developed in the literature [7]. Among these approaches those that are based essentially on some linearity and stationarity hypothesis such as the use of FFT spectrum in short time of a segment of data. However, the accuracy of the FFT calculation is closely related to the choice of the the duration of the signal segment [7]. The main disadvantage of Fourier analysis in signal processing is that it extracts only global features of signals and does A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 297–306, 2010. c Springer-Verlag Berlin Heidelberg 2010
298
T. Al-ani, X. Drouot, and T.T.T. Huynh
not adapt to analysed signals. Different drawbacks of FFT for EEG analysis are given by Penzel et al. [7]. In the framework of automatic diagnosis of IH, the calculation of homeostatic drive during sleep with SWA has been widely used since Borbely’s work [4], and is unquestionably a robust parameter to study process S regulation in response to experimental conditions such as sleep deprivation in intra-individual study design. However, the calculation of SWA using FFT has some limitations and caveats that have to be kept in mind. First, as already stated supra, FFT calculation relies on the assumption that sleep EEG is a stationary signal, which is likely not the case. Second, and more importantly, SWA analysis with FFT supposes the choice a priori of the band-width of interest. The lower limit of SWA band has never been strictly defined and varies across studies from 0.25Hz [8] or 0.5Hz [9] to 0.75 [10]. However, this could be of physiological importance since the SWA results in the combination of two distinct EEG generators, the cortical slow oscillation (around 0.7 Hz) [11] and the thalamic delta clock rhythm (around 2Hz) [12]. Such combination is supported by the fact that the dynamics across the night of SWA in response to sleep deprivation, has been showed to be significantly different for the lower part of the SWA (< 1Hz) and the upper part (1-4Hz) in both animals and humans [13] -[14]. Thus it is interesting to search for a non linear and non stationary decomposition without a priori choice of frequency bands. In Wavelet Transform (WT ) [15] more generalized functions may be used. These functions are characterized by the frequency as well as by the time scales. Although these approaches are more effective than the FFT, they show at the same time much bigger ambiguity in signal decomposition. In this work, we use a more fairly recent technique called the empirical mode decomposition (EMD ) [16] for nonlinear and non-stationary time series data analysis such as EEG. The EMD behaves like a bank of filters. However, the significant difference with a standard bank of filters is that the modes can be modulated in amplitude and especially in frequency. The advantage of EMD is that it is a data driven approach, i.e. one does not need to define a mother wavelet beforehand. In this work, we are motivated by a direct nonlinear and non-stationary quantitative analysis of sleep EEG issued from healthy subjects and patients suffering from idiopathic hypersomnia based on the instantaneous frequency (IF ) using the EMD algorithm. The rest of the paper is organized as follows. Section 2 briefly recalls the principle of the EMD. In section 3 we introduce the IF estimation approaches. Section 4 introduces our methods for hypersomnia analysis. Section 5 is devoted to the results. Finally, section 6 gives some discussion and conclusions on this work.
2
Empirical Mode Decomposition (EMD)
The traditional EMD, recently introduced by Huang et al. [16], is simply defined by an algorithm based on empirical framework. The EMD can be used to decompose adaptively a signal into a finite number of mono-component signals, which are known as intrinsic mode functions (IMFs) or modes. It considers signals at
Non-linear EEG Analysis of Idiopathic Hypersomnia
299
their local oscillations, but they are not necessarily considered in the sens of Fourier harmonics. Their extraction is non-linear, but their recombination for exact reconstruction of the signal is linear. The IMFs admit well-behaved Hilbert transforms (HT ) [19] and they satisfy the following properties: they are symmetric, different IMFs yield different instantaneous local frequencies as functions of time that give sharp identifications of embedded structures. The decomposition is done linearly or non-linearly depending on the data. This complete and almost orthogonal decomposition is empirically realised by identifying the physical local characteristic time scales intrinsic to these data, which is the lapse between successive extrema. For more details on the EMD algorithm, see [16].
3
Instantaneous Frequency Estimation (IF )
The IF was originally defined in the context of FM modulation theory in communications. Ville [17] unified the work done by Carson and Fry [18] and Gabor [19] and noted that since the IF was time-varying, there should intuitively be some instantaneous spectrum associated with it-with the mean value of the frequencies in this instantaneous spectrum being the IF [20]. The importance of the IF concept stems from the fact that in many applications, like our EEG signal analysis, spectral characteristics (in particular the frequency of the spectral peaks) are varying with time, i.e. nonstationary. In this case, the IF is an important characteristic; it is a time-varying parameter which defines the location of the signals spectral peak as it varies with time. Conceptually it may be interpreted as the frequency of a sine wave which locally fits the signal under analysis. Physically, it has meaning only for monocomponent signals, where there is only one frequency or a narrow range of frequencies varying as a function of time [20]. The IF has been applied with success to EEG signal analysis [21]. The EMD method is not itself a time-frequency representation. The use of the concept of IF is here interesting since the (IMFs) are mono-component signals. This implies that there is a single frequency at any time. There are several techniques to estimate the IF using the derivative of the phase of a signal. Hilbert-Huang transform (HHT ) method [22] is an empirically based data-analysis method based on EMD and Hilbert transformHT [19] which is the most common method to estimate the IF. Its basis of expansion is adaptive, so that it can produce physically meaningful representations of data from nonlinear and non-stationary processes. . The results indicate that both these two methods can abstract the main characters of the EEG. But HHT can more accurately express EEG distribution in time and frequency domain. Thats because it can produce a self-adaptive basis according to the signal and obtain local and instantaneous frequency of EEG. However, the determination of this instantaneous quantity is based on the integration of the signal, which is a global operation on signal duration with the same weight for the past and for the future. Moreover, the definition of this instantaneous quantity is based on a non-causal and non-instantaneous filter and on the FT of the analytical signal. The estimation of instantaneous quantities requires the values of the signal that are not yet observed. Thus, we can not do that before
300
T. Al-ani, X. Drouot, and T.T.T. Huynh
that the signal has been fully observed. To avoid this constraint, there are other approaches privileging essentially the notion of instantaneity to estimate the IF and the IA of a signal. These approaches use only the amplitudes of the signal and, if necessary, its derivative at the given instant [23]. 3.1
Teager-Kaiser Energy Operator (TKEO)
Kaiser [23] proposed a very simple and fast algorithm to estimate the energy. That operator is called Teager-Kaiser Energy Operator (TKEO ), provided that the restriction related to the signal bandwidth (narrow-band signal) is respected. In the following, the discrete version of this algorithm is presented. Consider a discrete real value signal x(n). The discrete TKEO, Ψd [x(n)], may be written as Ψd [x(n)] = x2 (n) − x(n − 1)x(n + 1). This operator requires no assumption on the stationarity of the signal. Its formulation shows that it is local and easy to implement. Indeed, only three samples are required to calculate the energy at each moment. The estimation of the IF and IA of a discrete signal x(n) sampled at a frequency fe is inspired by the discrete-time energy separation algorithm - 1 noted (DESA1 ) developed by Maragos et al. [24]-[26] Ψd [x(n)] |a(n)| = , (1) Ψd [y(n)]+Ψd [y(n+1)] 2 1 − (1 − ) 4Ψd [x(n)] Ωi (n) = arccos(1 −
Ψd [y(n)] + Ψd [y(n + 1)] Ωi (n) ), fi (n) = fe , 4Ψd [x(n)] 2π
(2)
with y(n) = x(n) − x(n − 1). Finally, the IF is estimated by Potamianos et al. [27] have concluded on the interest of this operator for the estimation of the IF for demodulation in speech compared to the approach based on the analytical signal.
4 4.1
Methods EEG Records
EEG-records obtained from healthy females as well as females suffering from IH were collected by the team at “Sleep Laboratory, Department of Physiology Functional Explorations at Henri Mondor Hospital (Cr´eteil-France)” for about 8 hours of sleep. There are two types of collected EEG-records. The first records were obtained from healthy volunteers: 9 healthy females aged 25 - 35 years while the second records were obtained from age-matched females that suffering from IH diagnosed with polysymptomatic form of IH : 11 females aged 25 - 35 years. Diagnosis of IH were established using the International criteria [5]. All the EEG signals, issued from the placement of electrodes C4 - O2 according to the International 10-20 system, were sampled at 200 Hz, with 16-bit resolution. The studied EEG signals were scored with 30-second epochs based on R&K standard criteria. The hypnograms corresponding to these signals were also used in our work.
Non-linear EEG Analysis of Idiopathic Hypersomnia
4.2
301
Application of EMD to our EEG records
In our work, the EMD algorithm is used to calculate the IMFs or modes. Five modes were generated. Then, the instantaneous frequencies (IFs) were calculated using Equations (1-2). Analysing all night sleep necessitates analysing each of the modes on all night. To do that, the EMD is applied on each epoch (30 seconds) of the EEG signal to retrieve the mode of interest. For each epoch of the calculated mode, we calculate the mean of IF and the energy of the EEG signal of the epoch. Finally, we calculate and Plot the IF and the energy as a function of the time. To reduce noise, the values are smoothed using median filter for the energy and Savitzky and Golay filter [28] for the IF. This will provide the evolution of IF and energy during all night sleep.
5
Results
We started by applying first the EMD algorithm on one epoch (30 seconds) of the EEG signal corresponding to a healthy female “ab”. In our analysis, each mode reveals predominantly a certain range of frequency and that the frequency of mode 1 is nearly two times greater than that of mode 2 and so on. 5.1
Identification of the Descriptive Parameter of Slow Deep Sleep by the Study of 9 Healthy Females
The IFs of the different modes during a night’s sleep were calculated. We observe that the IF of stage 4 is always lower than the other stages. Besides, the evolution of the IFs during the night is almost the same for all 5 modes, only the amplitude of the variations decreases more and more from mode 1 to mode 5. This led us to calculate the weighted average instantaneous frequency (WAIF ) [29] which has a linear relationship to the individual instantaneous frequencies. This weighting will take into account the amplitudes of the different modes. Let x(t) a signal decomposed by the EMD algorithm to N modes. The WAIF at time t is given by N a2n (t)fn (t) , (3) W AIF (t) = n=1 N 2 n=1 an (t) where an (t) and fn (t) are the IA and the IF at time t respectively. The weighted average instantaneous frequency (WAIF ) of one epoch is the average of the weighted averages instantaneous frequency (WAIFs) of all its points. For the nine healthy females, we noted that, whatever the stage, the WAIF is lower during sleep than in awakening state and it reveals very well the slow deep sleep (stages 3 and 4), particularly the very deep sleep (stage 4). Indeed, during stage 4, the WAIF decreases significantly. In addition, the stage-4 periods are well localised in terms of time and duration. A statistical analysis using repeated analysis of variance (ANOVA) showed a significant effect of sleep stages (wake, stage 1, stage 2 to 4, and REM) on WAIF for all sleep epochs (F = 62.5 and p < 0.0001). Post hoc Fisher test also
302
T. Al-ani, X. Drouot, and T.T.T. Huynh minimumFréquence weighted average instantaneous frequency moyenne pondérée (Fmp) minimale 2 1.8
Fmp WAIF minWAIF Fmp min considered episodes
W AIF(Hz) (H z) Fmp
1.6
épisodes considérés
1.4 1.2 1 0.8 0
100
200
300
400 500 600 Epoques(30 de sec) 30 sec Epochs
700
800
900
1000
700
800
900
1000
Hypnogramme Hypnogram
WakeEveil REMREM Stage 1 stade 1 Stage 2 stade Stage 32 Stage 43 stade stade 4 0
group 1
100
200
group 2
300
400 500 600 Epoques(30 de sec) 30 sec Epochs
group 3
group 4
Fig. 1. The minWAIF of the episodes of deep sleep for the healthy female “ab”. The different minWAIF are indicated by red dots.
showed that WAIF in each sleep stage significantly differs from the three others (p < 0.006 for all pairs). We noticed that the most interesting parameter could be the minimum weighted average instantaneous frequency (minWAIF ) reached during deep sleep episodes (one episode is a series of 30-second epochs). These episodes of deep sleep contain inevitably stage 4 (or stage 3 in case of a new type of scoring in which stages 3 and 4 are grouped) and possibly stages 2 and 3. Hence, we developed a simple algorithm to obtain minWAIF of deep sleep episodes. Fig. 1 shows that the minWAIF could be considered as a new parameter for quantification of sleep EEG. 5.2
Comparison of the minWAIF for the Two Female Groups
We compared the minWAIF for the two female groups (9 healthy females and 11 females suffering from IH ). We realize two types of In this section, we compare the minWAIF for the two female groups (9 healthy females and 11 females suffering from IH ). We realize two types of interpretation: visual interpretation and statistical interpretation. Visual interpretation. In Fig. 2, one can already see that the mean of the minWAIF value in healthy females never drops below 0.6 Hz, while that of females suffering from IH drops as low as 0.45 Hz. In addition, the range of the
Non-linear EEG Analysis of Idiopathic Hypersomnia
303
Mean minWAIF of the first deep sleep episodes of all subjects 1.1 healthy hypersomnia median Healthy median Hypersomnia
1
Mean minWAIF (Hz)
0.9
0.8
0.7
0.6
0.5
0.4
9 healthy females
11 female patients
Fig. 2. The mean of the minWAIF of the first four deep sleep episodes of 2 groups
mean of minWAIF values in healthy females is more narrow than that of female patients. Statistical interpretation. For all tests, the level of significance, α, was set to 5%. In order to find possible differences between healthy females and females suffering from IH, we compared the minWAIF values in both groups (9 healthy females vs. 11 female patients using the Mann-Whitney nonparametric test. The test result shows that the minWAIF value was significantly lower in patients than in healthy females 0.7172; [0.6787 − 0.7635] (median; 25th percentile - 75th percentile) for the 9 healthy females vs. 0.6524; [0.5640−0.7761] for the 11 female patients (p < 0.05, Mann-Whitney test).
6
Conclusions
In this work, we studied a new approach for nonlinear and non-stationary quantitative analysis of sleep EEG signals issued from healthy females as well as females suffering from idiopathic hypersomnia. This approach is based on the empirical mode decomposition (EMD ) algorithm and the instantaneous frequency (IF ) of the intrinsic mode functions (IMFs). The IF, which is generated by the Teager-Kaiser energy operator (TKEO ), is motivated by the objective of seeking a relevant descriptive parameter of the neurological hypersomnia. Our study has been validated on sleep EEG data issued from healthy femals and patient females suffering from IH. Because of the similarity in the evolution of the (IF ) in the five relevant modes, the weighted average instantaneous frequency (WAIF ) was then calculated by weighting with the amplitudes of the different
304
T. Al-ani, X. Drouot, and T.T.T. Huynh
modes. The results showed that the WAIF reveals well deep sleep (especially stage 4). However, it should be noted that the physiological interpretation of these quantities is difficult and delicate because their values do not represent in themselves the ”true” physiological frequencies that are normally observed. We are interested only by their variation over time since it reflects the evolution of physiological frequencies of brain waves. Unlike the IF, the evolution of the signal energy is more discriminating with only modes 2, 3 and 4. The reduction of energy at the end of the night can be explained by the fact that in general the amount of deep sleep gradually decreases during a night sleep. A physiological explanation could be attributed to the decrease in frequency and increase in energy during deep sleep. The deeper the sleep is, the more brain activities slow down and become synchronised, resulting in the slower and bigger brain waves observed during deep sleep. Finally, we decided to go further by examining the minimum weighted average instantaneous frequency (minWAIF ) reached during deep sleep episodes. Statistical tests showed that minWAIF is stable during the night and significantly lower among patients suffering from IH than among healthy females. The minWAIF appears to be a specific intrinsic characteristic of brain function mechanism in patients suffering from IH. It could be an interesting new parameter for the quantification of sleep EEG. Most of patients have a minWAIF in the range of the healthy subjects. However, the three patients who have a minWAIF clearly below the lowest healthy’s minWAIF also suffer from severe difficulties to awake in the morning. Further studies will be necessary to determine if this low WAIF is specific to these awakening difficulties. The results of the analysis of the minWAIF seem interesting and promising. However, it remains to verify in a future study the reproducibility of this parameter from one night to another in order to study the first night effect. This study needs EEG records of two consecutive nights for the same healthy females. In order to obtain a more statistically significant results, several EEG records issued from a sufficient number of healthy subjects as well as patients suffering from IH will also be required.
Acknowledgments The authors would like to thank CEPHALON for the financial support (indemnity) through Thi Thuy Trang Huynh (first author) research scholarship.
References 1. Furukawax, T.: Heinrich Bruno Schindler’s description of narcolepsy in 1829. Neurology 37, 146 (1987) 2. Roth, B.: Narcolepsy and hypesomnia: review and classification of 642 personally observed cases. Schweiz Arch Neurol Neurochir Psychiatr 199(1), 31–41 (1975) 3. Kirsch, D.B., Chervin, R.D.: Idiopathic Hypersomnia and Reccurent Hypersomnia,plus 0.5em minus 0.4emSleep Disorders And Neurologic Diseases, ch. 9, 2nd edn. Antonio Culebras, New York (2007)
Non-linear EEG Analysis of Idiopathic Hypersomnia
305
4. Borbely, A.A.: A two process model of sleep regulation. Hum. Neurobiol. 1, 195–204 (1982) 5. American Academy of Sleep Medicine. In: ICSD-The International Classification of Sleep Disorders, 2nd edn. (2005) 6. Rechtschaffen, A., Kales, A.: A Manual of Standardized Terminology, Techniques and Scoring System for Sleep Stages of Human Subjects,plus 0.5em minus 0.4em, Brain Inform. Service/Brain Res. Inst., Univ. California, Los Angeles (1968) 7. Penzel, T., Conradt, R.: Computer based sleep recording and analysis. Sleep Medicine Reviews 4(2), 131–148 (2000) 8. Guilleminault, C., Kirisoglu, C., da Rosa, A.C., Lopes, C., Chan, A.: Sleepwalking, a disorder of NREM sleep instability. Sleep Medicine 7(2), 163–170 (2006) 9. Preud’homme, X.A., Lanquart, J.P., Krystal, A.D., Bogaerts, P., Linkowsk, P.: Modeling slow-wave activity dynamics: does an exponentially dampened periodic function really fit a single night of normal human sleep? Clin. Neurophysiol. 119, 2753–2761 (2008) 10. Sforza, E., Gaudreau, H., Petit, D., Montplaisir, J.: Homeostatic sleep regulation in patients with idiopathic hypersomnia. Clinical Neurophysiology 111, 277–282 (2000) 11. Achermann, P., Borbely, A.A.: Low-frequency ( 2) Fisher criterion can be decomposed into w two-class Fisher criteria (where w =µ C2 ), here C indicates the combination of σ-classes as pairs. The
High Dimensional versus Low Dimensional Chaos
a)
319
b)
Fig. 1. 2-D visualization of video objects in dataset 16, a) MPEG-7 and b) C-MP7 (h-d chaos)
inter-class scatter distance fAB between, A with E video objects ΥA and class B with F video objects ΥB , is, fAB = E
|
A=1 (ΥA
1 E 1 E
E
A=1
E
− E−1
A=1
ΥA − ΥA )
1 F
2
+
B=1F
F
ΥB |
B=1 (ΥB
2 − F1 F B=1 ΥB ) F −1
(4)
In a μ-dimensional space, distance measure is defined as [14], DIS = min(βu + λ
w 1 (βu ) w u=1
(5)
where βu denotes the uth binary class (e.g., fAB ) Fisher criterion among μ classes which is decomposed into w, and λ is an empirical factor. Higher distance DIS in Eq. 5 implies more discriminant feature vector for the corresponding multiclasses. Fig. 2 shows that the minimum intra-class distance, in the four classes we consider, is higher for C-MP7 (with h-d chaos) than that for MPEG-7 and for C-MP7 (with l-d chaos). 4.2
Classification Accuracy
Initially [1] reports, better cross-validation accuracy can be achieved using C-MP7 over MPEG-7, with either h-d or l-d chaotic series. In surveillance scenes, usually changes in video objects in successive frames are not significant. Thus, in cross-validation, the training and testing datasets may share almost similar image regions in successive frames. Here, we present classification accuracy with different training and test datasets. One pruned (no repeated object to avoid over fitting in training) dataset with few video objects (i.e., 167) is selected for training. In test datasets, repeated objects from subsequent frames are kept.
320
H. Azhar and A. Amer
Fig. 3 shows that, significant improvement can be achieved by C-MP7 (using h-d chaos). In SVM, the accuracy improves 10% to 20% over MPEG-7 for all four classes, i.e., p, g, v, u. C-MP7, however, performs poor, if l-d chaos is used. In addition to over all classification performance in all classes, above 99% accuracy is achieved for only vehicles against other objects in binary SVM classifier, for C-MP7 with h-d chaos. The corresponding accuracy for MPEG-7 is on average 84%. The said high vehicle accuracy complements the well clustered feature vector representation of vehicles in Fig. 1-b.
Fig. 2. Multi-class Fisher criteria for four classes
4.3
Why High-Dimensional Chaos Is Better?
A chaotic series must be sensitive to initial conditions[8]. One reason for excellence of C-MP7 with h-d chaos compared to that with l-d chaos, can be due to the sensitivities of initial seed x0 , either, in the chaotic series, or in the chaotic attractor. In Fig. 4-a, up to n=15, l-d chaos (Eq. 1) exhibits same series with initial seeds of 10−6 difference, i.e., 0.123423 and 0.123425. As n increases, chaotic series drift apart. In 2-dimensional phase space, x(n) vs x(n+ 1), both the series, however, stay on the same attractor, see Fig. 4-b. In case of h-d chaos (Eq. 3), at higher iteration, e.g., n=1980, the said two seeds show drift in Fig. 4-c, but stay on the same attractor in Fig. 4-d. However, for initial seeds of 10−1 difference, h-d chaotic attractors show significant drift in Fig. 4-d. The non-linear structure (i.e., dynamics in trajectory) in l-d chaotic series is easily detectable and is hardly detectable in h-d chaotic series. To verify, the effect of drifts in chaotic series, we re-simulate C-MP7 with l-d chaotic series at 10 iteration, where there is no drift in the series, and with h-d chaotic series at 2500 iteration, where there exists significant drifts in the series. As in Fig. 3, for SVM, the C-MP7 with l-d
High Dimensional versus Low Dimensional Chaos
321
Fig. 3. Accuracy with C-MP7 and MPEG-7
a)
b)
c)
d) Fig. 4. a) l-d series, b) l-d attractors, c) h-d series, and d) h-d attractors
chaos again shows poor accuracy, while, with h-d chaos achieves increased accuracy, on average 84.58%. The drifts in h-d chaotic attractors, not in chaotic series, for change in initial seeds in the range of 10−1 , is due to transient. In Fig. 4-d, this transient is the initial delay before the h-d chaotic series rides on the attractor, and is caused by asymptotic stability of fixed points in the
322
H. Azhar and A. Amer
Fig. 5. Variance in MPEG-7 and C-MP7
attractor [8]. Transient is practically insignificant in l-d chaos. In a class, this transient allows C-MP7 to capture subtle variations of descriptor coefficients, as they exist in diverse image-regions for video objects in successive frames in surveillance scenes. The higher variance in C-MP7 with h-d chaos as shown in Fig. 5 for video objects, manifests the said subtle variations.
5
Conclusion
The proposed histogram-based chaotic feature binding of MPEG-7 visual descriptors generates a new feature vector C-MP7, and is MPEG-7 compliant. CMP7, with h-d chaotic series, video object classification accuracy is better, than the C-MP7 with l-d chaos, and with original MPEG-7. Our analysis, validates the excellence of h-d chaos-based C-MP7 as a feature vector over MPEG-7, for video objects, despite poor segmentation and tracking. Video objects of different scale, resolution, orientation and occlusion are randomly mixed, and tested from both indoor and outdoor surveillance scenes. Vehicle objects are clustered well with C-MP7, which leads to above 99% accuracy for only vehicles against other objects in all data sets. The new feature vector C-MP7 has the same MPEG-7 format as available in the XML schema for each descriptor, except null feature coefficients added at varying indexes. So each descriptor for a video object can be reconstructed back to the original XML description with reduced feature coefficients. This compliance allows the C-MP7 -based video object descriptions to be shared with other potential MPEG-7 industry applications.
References 1. Azhar, H., Amer, A.: Chaos and MPEG-7 Based Feature Vector for Video Object Classification. In: IEEE International Conference on Image Processing, pp. 432–437 (2008) 2. Ma, X., Eric, W., Grimson, L.: Edge-based rich representation for vehicle classification. In: IEEE International Conference on Computer Vision, pp. 1185–1192 (2005)
High Dimensional versus Low Dimensional Chaos
323
3. Qiming, L., Khoshgoftaar, T.M., Folleco, A.: Classification of Ships in Surveillance Video. In: IEEE International Conference on Information Reuse and Integration, September 2006, pp. 432–437 (2006) 4. Eidenberger, H.: How Good are the Visual MPEG-7 Features? In: SPIE Visual Communications and Image Processing, vol. 5150, pp. 476–488 (2003) 5. Basar, E. (ed.): Chaos in Brain Function. Springer, Heidelberg (1990) 6. Hubel, D.H.: Eye, Brain, and Vision, 2nd edn. W. H. Freeman, New York (1995) 7. Skarda, C.A., Freeman, W.J.: How Brains Make Chaos in Order to Make Sense of the World. Behav. Brain Sci. 10, 161–195 (1987) 8. Kaplan, D., Glass, L.: Understanding Nonlinear Dynamics. Springer, Heidelberg (1998) 9. Takens, F.: Dynamical Systems and Turbulence. Lecture Notes in Mathematics, p. 898. Springer, Heidelberg (1981) 10. Grassberger, P., Procaccia, I.: Characterization of Strange Attractors. Phys. Rev. Lett. 50, 346–349 (1983) 11. Amer, A.: Voting-based Simultaneous Tracking of Multiple Video Objects. IEEE Transactions on Circuits and Systems for Video Technology 15, 1448–1462 (2005) 12. Manjunath, B.S., Salembier, P., Sikora, T.: Introduction to MPEG-7. John Wiley and Sons, Ltd., Chichester (2002) 13. Javed, O., Shah, M., Shafique, K.: Automated Visual Surveillance in Realistic Scenarios. IEEE Multimeda Magazine (2007) 14. Jack, L., Guo, H., Nandi, A.: Feature Generation Using Genetic Programming with Application to Fault Classification. IEEE Transactions on Systems, Man and Cybernetics Part B: Cybernetics 35(1) (2005)
Compensating for Motion Estimation Inaccuracies in DVC 1 ˇ J¨ urgen Slowack1 , Jozef Skorupa , Stefaan Mys1 , Nikos Deligiannis2 , 1 Peter Lambert , Adrian Munteanu2 , and Rik Van de Walle1 1
Ghent University – IBBT, Dept. of Electronics and Information Systems (ELIS) Multimedia Lab, Gaston Crommenlaan 8 bus 201, B-9000 Ghent, Belgium 2 Vrije Universiteit Brussel – IBBT, Electronics and Informatics Dept. (ETRO) Pleinlaan 2, B-1050 Brussels, Belgium
Abstract. Distributed video coding is a relatively new video coding approach, where compression is achieved by performing motion estimation at the decoder. Current techniques for decoder-side motion estimation make use of assumptions such as linear motion between the reference frames. It is only after the frame is partially decoded that some of the errors are corrected. In this paper, we propose a new approach with multiple predictors, accounting for inaccuracies in the decoder-side motion estimation process during the decoding. Each of the predictors is assigned a weight, and the correlation between the original frame at the encoder and the set of predictors at the decoder is modeled at the decoder. This correlation information is then used during the decoding process. Results indicate average quality gains up to 0.4 dB.
1
Introduction
Video compression is achieved by exploiting redundancies in the frame sequence. In the temporal direction, these redundancies are often exploited through a process called motion estimation. In conventional video compression schemes, motion estimation is performed at the encoder. Each frame is partitioned into non-overlapping blocks, and the goal of the motion estimation process is to find, for each of these blocks, the closest matching block in a set of reference frames. Next, the residual between the block and its prediction is entropy coded, along with the motion vectors. In distributed video coding (DVC), on the other hand, motion estimation is performed by the decoder instead of the encoder. As a result, the complexity of the encoder is low compared to the complexity of the decoder. However, performing motion estimation at the decoder is difficult, since in DVC motion estimation is performed without having the original frame available. Hence, the original frame available at the encoder is predicted at the decoder using reference frames only. This prediction is called side information. Since the side information is but a prediction of the original, additional information is sent by the encoder allowing the decoder to correct the side information. In this process, the correlation between the original frame X and the side information Y is often estimated A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 324–332, 2010. c Springer-Verlag Berlin Heidelberg 2010
Compensating for Motion Estimation Inaccuracies in DVC
325
at the decoder, for efficient use of the error correcting information (e.g. LDPC or turbo codes). Complex motion characteristics of video cause a significant amount of errors in the side information. Typically, a motion vector is calculated for each block in the side information by comparing blocks in the reference frames. For example, techniques have been proposed by Aaron et al. [1] and in the context of DISCOVER [2]. In the latter, block-based motion estimation is performed between a past and a future reference frame. This motion field is then interpolated to obtain a motion vector for each block in the side information. Next, the motion vector is further refined. Other researchers have made contributions as well, for example, Kubasov et al. [3] use a mesh-based approach for generating the side information, as well as a combination of mesh and block-based techniques. The problem with these techniques is that motion is assumed linear between the past and future reference frames. This assumption becomes less valid if the distance between the reference frames is large. As a result, sequences with irregular motion such as non-linear motion and occlusion are not predicted very accurately. This is illustrated with an example further on. The most recent techniques for side information generation use a refinement approach. Decoding is performed partially and the partially decoded frame is used to improve the side information. The improved side information is then used for further decoding. Some interesting techniques in this context are proposed by Martins et al. [4], as well as by Ye et al. [5], and Fan et al. [6], for example. While these techniques show good results, they need to decode some information first before they can compensate for any mistakes made during the side information generation process. Therefore, in this paper, we propose a technique where some of the motion estimation inaccuracies are taken into account during the decoding, by using a combination of weighted predictors (Sect. 2). The weights are updated using an online procedure. Evaluating our technique indicates average PSNR gains up to 0.4 dB (Sect. 3). Conclusions and future work are provided in Sect. 4, and Sect. 5, respectively.
2
Proposed Technique
We first illustrate the problems associated with side information generation in the case of complex motion. Side information has been generated using the techniques employed in DISCOVER [2], for the 5th frame of the Foreman sequence, using the first frame as a past reference, and the 9th frame as a future reference. The side information is corrected using a turbo decoding procedure. When analyzing the residual between the side information and the decoded frame in Fig. 1, it is clear that a lot of errors have been corrected. Judging from the side information itself, it could already be expected that the accuracy of estimating the face is low. However, the residual between the side information and the decoded frame also reveals that errors have been corrected in the background. More specifically, we can see that edges in the side information are not predicted accurately. This is due to non-linear camera motion.
326
J. Slowack et al.
Past reference
Side information
Future reference
Decoded
Residual
Fig. 1. A lot of errors need to be corrected in the side information if the distance between the reference frames is large, as shown by the residual between the side information and the decoded frame
We can compensate for some of these inaccuracies by using more than one prediction for each block. This is explained using Fig. 2. As input we use the side information called Y generated as in DISCOVER [2]. As such, a particular block in the side information is generated by averaging past and future reference blocks P and F , using a linear motion vector. However, if the motion is nonlinear, then the prediction should appear on a different spatial position in the side information. Hence, to predict a block at position (x0 , y0 ), we can use the block at position (x0 , y0 ) in Y , together with some of the surrounding blocks in Y . This strategy can also be beneficial in other cases with complex motion such as occlusion and deformation. Before explaining this method in detail, a description of the codec is provided in the following section. 2.1
Codec Description
The proposed codec is based on the work of Aaron et al. [1], with some important extensions adopted from DISCOVER [2], and from our previous work [7]. The codec is depicted in Fig. 3, highlighting the extensions proposed in this paper. The frame sequence is partitioned into key frames I and Wyner-Ziv (WZ) frames W . At the encoder, key frames are intra coded using H.264/AVC intra coding. The intra decoded key frames I and the original key frames I are used to calculate the quantization noise, which is needed at the decoder for accurate correlation noise estimation, as in [7]. WZ frames are partitioned into 4-by-4 nonoverlapping blocks, and each block is transformed using a DCT. Coefficients at
Compensating for Motion Estimation Inaccuracies in DVC
327
F P
(1) (2)
Past reference frame
Y
Y
Side information
Future reference frame
Fig. 2. The linear motion vector (1) could be inaccurate, so that the interpolation between P and F is located on a different spatial position (2)
the same index k (e.g. all DC coefficients) are grouped into so-called coefficient bands, and each coefficient band is quantized using a quantizer having 2Mk levels. For each quantized band, bits at the same position (e.g. all most significant bits) are grouped into bitplanes, which are fed to a turbo coder calculating parity bits. These parity bits are stored in a buffer, and sent in portions to the decoder upon request. At the decoder, key frames are decoded into I . For each WZ frame, side information is generated using already decoded frames I , and W (as discussed below). We adopt the techniques for side information generation as used in DISCOVER [2]. The output of this process is the side information frame Y , and for each block the (linear) motion vector M VSI , as well as the residual RSI between the past and future reference blocks. This information is used as input for the extensions provided in this paper. First, for each block, multiple predictors are generated (Sect. 2.2), denoted {Yn }. Next, each of these predictors is assigned a weight (Sect. 2.3), and the correlation between the predictors and the original is modeled through the conditional distribution fX|{Yn } (Sect. 2.4). This distribution is used by the turbo decoder, which requests bits until the decoded result is sufficiently reliable1 . Finally, the quantized coefficients are reconstructed (Sect. 2.5) and inverse transformed to obtain the decoded frame W . 2.2
Generation of Predictors
A block at position (x0 , y0 ) is predicted using multiple predictors, obtained from the side information frame Y . The first predictor is the predictor corresponding to linear motion, i.e., the block at the co-located position in Y . To compensate for motion inaccuracies such as non-linear motion, surrounding blocks in Y are taken into account as well. As a compromise between complexity and performance, eight additional predictors are used, namely the ones corresponding to positions (x0 ± 1, y0 ± 1) in Y . This results in a total of 9 predictors per block. Not every predictor is equally likely, so that weights are calculated for each predictor, as explained in the following section. 1
More specifically, the sign-difference ratio is used as a stopping criterion for turbo decoding [8].
328
J. Slowack et al.
W
qk
Xk DCT
WZ-frames
Q
BP1 Extract bitplanes
...
Turbo coder
q'k
X’k
Turbo decoder
Reconstr.
BPMk
IDCT
W’
fX|{Yn} Request bits
Calculate and code quantization noise I’
Correlation noise estimation p(L(MVSI), L(MVNL))
{Yn} Predictor generation
Weight calculation
MVSI’s Key frames I
H.264/AVC intra coder
H.264/ AVC intra decoder
Y RSI’s
Side info generation I’
Fig. 3. Our DVC codec with the proposed modifications highlighted
2.3
Online Calculation of the Predictor Weights
Each of the 9 predictors is assigned a weight, according to the probability that this predictor is the best one out of the set. To discriminate between different predictors, denote M VN L as the predictor offset compared to the linear predictor. For example, if the linear predictor corresponds to the block at position (x0 , y0 ) in Y , then the predictor at position (x0 + 1, y0 − 1) in Y has M VN L = (1, −1). As a second parameter, different weights are defined for different values of M VSI . The reason is that large motion vectors M VSI (delivered by the side information generation process) could be less accurate than small motion vectors. Instead of using two-dimensional parameters M VN L and M VSI as input for retrieving the predictor weights, we define the following magnitude metric: L((x, y)) = max(|x|, |y|).
(1)
Using this metric, the predictor weights are defined through the distribution p(L(M VSI ), L(M VN L )). This distribution is established during the decoding process, since different sequences often require different weights. For some sequences, linear motion might be a good assumption, so that the linear-motion predictor should have a significantly higher weight than the other predictors. On the other hand, for sequences with complex motion characteristics, all predictors could require similar weights. Hence, the following technique is used for updating the weights using an online procedure. Given the side information frame Y and the decoded frame W , each block in W is compared to each of the 9 predictors in Y . More specifically, the mean absolute difference (MAD) is calculated between the block at a certain position (x0 , y0 ) in W and the co-located block in Y . This MAD indicates the amount of errors corrected when using the linear predictor. Likewise, the MAD for other
Compensating for Motion Estimation Inaccuracies in DVC
329
predictors is calculated, for example, comparing the block at position (x0 , y0 ) in W to the block at position (x0 + 1, y0 + 1) in Y etc. The predictor with the lowest MAD is then considered the best one out of the set. However, a non-linear predictor is only considered best in case its MAD is lower than 0.9 times the MAD of the linear predictor. Otherwise, the linear predictor is considered best, nonetheless. This criterion is used to ensure that only significant improvements over the linear predictor are taken into account. For example, in a region with not much texture, one of the non-linear predictors could have a lower MAD than the linear predictor, because the quantization noise in this predictor has distorted the block in such a way that it resembles better the decoded result. To avoid these situations, the MAD of a non-linear predictor must be lower than 0.9 times the MAD of the linear predictor. The value of 0.9 has been experimentally obtained. Given the best predictor for a block, a histogram table T is updated, where each entry T (i, j) indicates the number of occurrences of the tuple (L(M VSI ) = i, L(M VN L ) = j). This table only covers the statistics of the current frame. After processing all blocks in W , the histogram table is used to update the global statistics: p(L(M VSI ) = i, L(M VN L ) = j) =K · p(L(M VSI ) = i, L(M VN L ) = j) T (i, j) , + (1 − K) · k T (i, k)
(2)
where the update parameter K is set to 0.8. This value – which has been obtained through experiments – remains fixed for all test sequences. A more detailed study of the update parameter is left as future work, as described in Sect. 5. The global statistics are used for weighing the predictors in the following frame to be decoded. 2.4
The Correlation Model
Using the weights, a correlation model is defined. This model is used by the turbo decoder. If the model is more accurate, less bits are needed by the turbo decoder for decoding the quantized frame. In addition, the reconstruction process (as described in Sect. 2.5) will be more accurate. The goal is to model the correlation between the original X and the set of predictors denoted {Yn }. This is modeled in the (DCT) transform-domain. For each 4-by-4 block in the original frame, 16 distributions are generated, i.e., one for each coefficient Xk (k = 0 . . . 15). The predictors are transformed, and all coefficients at the same index are grouped. Denote the predictors for Xk as {Yk,n }. It is common in DVC to model the correlation between the original and the side information as a Laplace distribution [9]. Hence, with multiple predictors, in this paper the conditional distribution fXk |{Yk,n } is modeled as a combination of weighted Laplace distributions, i.e.: α fXk |{Yk,n } (x|{yk,n }) = wi · e−α|x−yk,i | , (3) 2 i
330
J. Slowack et al.
where yk,i indicates the kth coefficient of the ith predictor. The weight wi for each predictor is given by p(L(M VSI ), L(M VN L )), divided by the number of predictors having the same value for L(M VN L ). The scaling parameter α is calculated based on the reference residual of the linear predictor, using the techniques proposed in our earlier work [7]. PSNR(dB)
Football
37 36 35 34 33 32 31 30 29 28 27 26 25
Proposed Previouswork DISCOVER
Rate(kbps) 0
500
1000
1500
PSNR(dB)
2000
2500
3000
3500
4000
Coastguard
37 36 35 34 33 32 31 30 29 28 27 26
Proposed Previouswork DISCOVER
Rate(kbps) 0
500
1000
PSNR(dB)
1500
2000
2500
3000
3500
Foreman
38 36 34 32
Proposed
30
Previouswork
28
DISCOVER
26
Rate(kbps) 0
500
1000
1500
2000
2500
3000
Fig. 4. Over our previous work, average quality gains of 0.4 dB are reported for Football and Foreman, and 0.1 dB for Coastguard
Compensating for Motion Estimation Inaccuracies in DVC
2.5
331
Coefficient Reconstruction
After turbo decoding, the quantization bin qk containing the original value (with very high probability) is known at the decoder. The following step is to choose a value in this quantization bin as the decoded coefficient Xk . This is done through optimal centroid reconstruction [10]: q wi qH x · α2 e−α|x−yk,i | dx L , qH α −α|x−yk,i | w e dx i i q 2
Xk
=
i
(4)
L
where qL and qH indicate the low and high border of qk , respectively.
3
Results
Tests have been conducted on three different sequences: Foreman, Football and Coastguard. All are in CIF resolution, at a frame rate of 30 frames per second. A GOP of size 8 is used, and only the luma component is coded for comparison with the state-of-the-art DISCOVER codec2 . Our system is also compared to our previous work [7], which applies a better correlation noise model than DISCOVER, but still uses only one predictor per block. The results indicate improvements over both systems (Fig. 4). The gains are the largest for sequences with complex motion such as Football and Foreman, where the linear predictor does not always provide an accurate prediction. In these cases, using multiple predictors to compensate for inaccuracies shows average Bjøntegaard [11] quality gains up to 0.4 dB over our previous work, and 1.0 dB over DISCOVER (both for Football and Foreman). For sequences with rather simple motion characteristics, such as Coastguard, less gain is observed. For such sequences, an accurate prediction is already provided by the linear motion predictor, and so using additional predictors provides less gain. Over our previous work, average quality gains of 0.1 dB are reported, while 1.4 dB over DISCOVER.
4
Conclusions
The technique proposed in this paper tries to compensate for inaccuracies in the generation of the side information at the decoder. Instead of using only the predictor associated with linear motion, other predictors are taken into account as well. This technique shows good results, achieving average quality gains up to 0.4 dB over our previous work, and 1.4 dB over DISCOVER [2].These results illustrate the importance of accurate side information and correlation noise estimation in DVC. 2
Executables for the DISCOVER codec are available at http://www.discoverdvc.org/
332
5
J. Slowack et al.
Future Work
Several extensions to this work can be investigated. One possible extension is to compare different techniques for updating the global statistics (used as weights in the following WZ frame to be decoded). For example, instead of using a fixed value for the update parameter K, a value for K can be determined online, i.e., during the decoding process. This could enable better tracking of the motion non-linearities in the frame sequence. The set of predictors could be extended, for example, adding unidirectional predictors for better handling occlusion. Also, instead of using the same weights for all predictors in one WZ frame, it could be interesting to study techniques where the predictor weights are spatially adaptive.
References 1. Aaron, A., Rane, S., Setton, E., Girod, B.: Transform-domain Wyner-Ziv codec for video. In: Proc. SPIE Visual Communications and Image Processing, January 2004, vol. 5308, pp. 520–528 (2004) 2. Artigas, X., Ascenso, J., Dalai, M., Klomp, S., Kubasov, D., Ouaret, M.: The DISCOVER codec: Architecture, techniques and evaluation. In: Proc. Picture Coding Symposium (PCS) (November 2007) 3. Kubasov, D., Guillemot, C.: Mesh-based motion-compensated interpolation for side information extraction in Distributed Video Coding. In: Proc. IEEE International Conference on Image Processing (ICIP) (October 2006) 4. Martins, R., Brites, C., Ascenso, J., Pereira, F.: Refining side information for improved transform domain wyner-ziv video coding. IEEE Transactions on Circuits and Systems for Video Technology 19(9), 1327–1341 (2009) 5. Ye, S., Ouaret, M., Dufaux, F., Ebrahimi, T.: Improved side information generation for distributed video coding by exploiting spatial and temporal correlations. EURASIP Journal on Image and Video Processing (2009) Article ID 683510 6. Fan, X., Au, O.C., Cheung, N.M., Chen, Y., Zhou, J.: Successive refinement based Wyner-Ziv video compression. Signal Processing: Image Communcation (2009) doi:10.1016/j.image.2009.09.004 ˇ 7. Slowack, J., Mys, S., Skorupa, J., Lambert, P., Grecos, C., Van de Walle, R.: Accounting for quantization noise in online correlation noise estimation for distributed video coding. In: Proc. Picture Coding Symposium (PCS) (May 2009) ˇ 8. Skorupa, J., Slowack, J., Mys, S., Lambert, P., Grecos, C., Van de Walle, R.: Stopping criterions for turbo coding in a Wyner-Ziv video codec. In: Proc. Picture Coding Symposium (PCS) (May 2009) 9. Brites, C., Pereira, F.: Correlation Noise Modeling for Efficient Pixel and Transform Domain Wyner-Ziv Video Coding. IEEE Transactions on Circuits and Systems for Video Technology 18(9), 1177–1190 (2008) 10. Kubasov, D., Nayak, J., Guillemot, C.: Optimal reconstruction in Wyner-Ziv video coding with multiple side information. In: IEEE MultiMedia Signal Processing Workshop, October 2007, pp. 183–186 (2007) 11. Bjøntegaard, G.: Calculation of average PSNR differences between RD-curves. Technical report, VCEG, Contribution VCEG-M33 (April 2002)
Multi-sensor Fire Detection by Fusing Visual and Non-visual Flame Features Steven Verstockt1,2 , Alexander Vanoosthuyse2 , Sofie Van Hoecke2 Peter Lambert1 , and Rik Van de Walle1 1
Department of Electronics and Information Systems, Multimedia Lab, Ghent University - IBBT, Gaston Crommenlaan 8, bus 201, B-9050 Ledeberg-Ghent, Belgium 2 University College West Flanders, Ghent University Association, Graaf Karel de Goedelaan 5, 8500 Kortrijk, Belgium
[email protected] http://multimedialab.elis.ugent.be/
Abstract. This paper proposes a feature-based multi-sensor fire detector operating on ordinary video and long wave infrared (LWIR) thermal images. The detector automatically extracts hot objects from the thermal images by dynamic background subtraction and histogram-based segmentation. Analogously, moving objects are extracted from the ordinary video by intensity-based dynamic background subtraction. These hot and moving objects are then further analyzed using a set of flame features which focus on the distinctive geometric, temporal and spatial disorder characteristics of flame regions. By combining the probabilities of these fast retrievable visual and thermal features, we are able to detect the fire at an early stage. Experiments with video and LWIR sequences of fire and non-fire real case scenarios show good results and indicate that multi-sensor fire analysis is very promising.
1
Introduction
Recently, research on video analysis for fire detection has become a hot topic in computer vision. This has resulted in a large amount of vision-based detection techniques that can be used to detect the fire at an early stage [1]. Based on the numerous advantages of video-based sensors, e.g. fast detection (no transport delay), indoor/outdoor detection at a distance, and the ability to provide fire progress information, it is expected that video fire detection (VFD) will become a viable alternative or complement for the more traditional fire sensors. Although it has shown that ordinary video promises good fire detection and analysis results, we believe that the use of IR cameras in the long wave IR range (LWIR) can be of added value. Various facts corroborate this idea. First of all, existing VFD algorithms have inherent limitations, such as the need for sufficient and specific lighting conditions. Thermal IR imaging sensors image emitted light, not reflected light, and do not have this limitation, providing a 24 hour, 365 day capability. Also, the further we go in the infrared spectrum the more the visual A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 333–341, 2010. c Springer-Verlag Berlin Heidelberg 2010
334
S. Verstockt et al.
perceptibility decreases and the thermal perceptibility increases. As such, hot objects like flames will be best visible and less disturbed by other objects in the LWIR spectral range. Furthermore, due to the variability of shape, motion, colors, and patterns of smoke and flames, many of the existing VFD approaches are still vulnerable to false alarms. Since it is possible to integrate IR cameras into existing closed-circuit television (CCTV) networks, the combination of both technologies can be used to reduce these false alarms. At last, smoke is almost transparent in the LWIR wavelength range, and therefore, in contrast to video cameras or the human eye, a LWIR camera can look through it. As such, LWIR imaging could also be used to localize the fire through the smoke. On the basis of all these facts, the use of LWIR in combination with ordinary VFD is considered to be a win-win. This is also confirmed by our experiments, as the fused detector performs better than either sensor alone. The remainder of this paper is organized as follows. Section 2 presents the related work in non-visible light, with a particular focus on the underlying features which can be of use in LWIR flame detection. For the related work in visible light we refer to earlier work of the authors [1]. Based on the analysis of the existing approaches in visibible and non-visible light and on our experiments, Section 3 proposes the multi-sensor flame detector. Subsequently, Section 4 shows the experimental results. Section 5 ends this paper with the conclusions.
2
Related Work in Non-visible Light
Although the trend towards IR-based video analysis is noticeable, the number of papers about IR-based fire detection in literature is still limited. Nevertheless, the results from existing work already seem very promising and ensure the feasibility of IR video in fire detection. Owrutsky et al. [2] work in the near infrared spectral range and focus on an increase in the global luminosity, i.e. the sum of the pixel intensities in the frame. Although this fairly simple algorithm seems to produce good results, its limited constraints do raise questions about its applicability in large and open places with varying backgrounds and a lot of ordinary moving objects. Toreyin et al. [3] detect flames in infrared by searching for bright-looking moving objects with rapid time-varying contours. A wavelet domain analysis of the 1D-curve representation of the contours is used to detect the high frequency nature of the boundary of a fire region. In addition, the temporal behavior of the region is analyzed using a Hidden Markov Model. The combination of both temporal and spatial clues seems more appropriate than the luminosity approach and, according to the authors [3], greatly reduces false alarms. A similar combination of temporal and spatial features is used by Bosch et al. [4]. Hotspots, i.e. candidate flame regions, are detected by automatic histogram-based image thresholding. By analyzing the resulting hot objects their intensity, signature, and orientation, discrimination between flames and other hot objects is made.
Multi-sensor Fire Detection
335
Fig. 1. General scheme of multi-sensor flame detector
3
Multi-sensor Flame Detector
Based on the analysis of our experiments and the related work, this section proposes the multi-sensor flame detector. As can be seen in Fig. 1, the detector consists of a video and LWIR moving (hot) object segmentation and a set of visual and LWIR flame features which analyze these objects. At the end, a global classifier combines the analysis results and takes a final decision. 3.1
Dynamic Background Subtraction: Moving Object Detection
Both the video and LWIR segmentation process start with a dynamic background subtraction, which extracts moving objects by subtracting the LWIR and video frames with everything in the scene that remains constant over time, i.e. the estimated background BGn . This estimation is updated dynamically (Eq. 1) after each segmentation by comparing the pixel values of BGn with the values of the corresponding pixels in the frame Fn . For the LWIR images the comparison is based on the temperature increase of flames: if the quantized shifted value of the pixel [x, y] in Fn is higher than the shifted value of the pixel in BGn , Fn [x, y] is labeled as foreground F G. Otherwise, Fn [x, y] gets a background label BG. The shifting is performed by moving down the value of each pixel in BGn to the nearest ten and by shifting the corresponding pixel in Fn over the same distance. The quantization is done by rounding off the shifted values of Fn to the nearest ten. For the ordinary video, a similar approach is used. Fn [x, y] is labeled as F G if the quantized shifted intensities of the pixel in BGn and Fn differ. αBGn [x, y] + (1 − α)Fn [x, y] if Fn [x, y] → BG BGn+1 [x, y] = (1) BGn [x, y] if Fn [x, y] → F G where the update parameter α is a time constant that specifies how fast new information supplants old observations. Here α (=0.95) was chosen close to 1.
336
S. Verstockt et al.
Fig. 2. Hot object segmentation by background subtraction and histogram-based dynamic thresholding
To avoid unnecessary computational work and to decrease the number of false alarms caused by noisy objects, a morphological opening, with a 3*3 square structuring element, is performed after the dynamic background subtraction. Each of the remaining FG objects in the video images is then further analyzed using a set of visual flame features. Analogously, LWIR FG objects are analyzed by a set of LWIR flame features. Before this LWIR analysis starts, a histogram-based segmentation is used in addition to extract the hottest objects out of the set of LWIR FG objects. Only these hot FG objects are further analyzed. 3.2
LWIR Hot Object Segmentation
Like in the work of Bosch et al. [4] we extract the hot objects representing possible flames by separating the highly brightened objects from the less brightened objects (Fig. 2). This segmentation step uses Otsu’s method [5], which automatically performs histogram shape-based image thresholding, assuming that the image to be processed contains two classes of objects. Iteratively the optimum threshold separating those two classes is calculated so that their combined spread, i.e. the intra-class variance (Eq. 2), is minimal. 2 σw (t) = ω1 (t)σ12 (t) + ω2 (t)σ22 (t)
(2)
where weights ωi are the probabilities of the two classes separated by a threshold t and σi2 are the variances of these classes. 3.3
Hot Object Analysis: LWIR Flame Features
Bounding Box Disorder (BBD): Our experiments (Fig. 3) revealed that the bounding box BB of flames varies considerably over time in both directions and that this variation shows a high degree of disorder. As such, the BBD (Eq. 3) is chosen as a feature to distinguish between flames and other hot objects. It is related to the number of extremes, i.e., local maxima and minima, in the set of N BBwidth and BBheight data points. By smoothing these data points, small
Multi-sensor Fire Detection
337
Fig. 3. Bounding box disorder of flames (a) and moving person (b)
differences between consecutive points are filtered out and are not taken into account in the extrema calculation, which increases the strength of the feature. Flames, with a high number of extremes, will have a BBD close to 1, while for more static objects it will be near to 0. BBD =
|extrema(BBwidth )| + |extrema(BBheight )| N
(3)
Principal Orientation Disorder (POD): During the experiments, it was also found that the disorder in principal orientation is remarkably higher for flames than for more static objects like people. This orientation equals the angle α between the x-axis and the major axis of the ellipse that has the same secondmoments as the object region. The POD focuses on this orientation disorder characteristic and is calculated in a similar way as the BBD (Eq. 4). P OD =
|extrema(α)| N/2
(4)
Again, flames, with a high number of orientation extremes, will have a POD close to 1, while more static objects their POD will be near to 0. Histogram Roughness (HR): By inspection of the histograms H of hot objects, it was observed that histograms of flame regions are very rough (Fig. 4). Also, it was found that the intensities of these regions range almost over the whole histogram, while for non-flame objects these intensities are more centered on some specific intensity bins and have a smaller range. The HR focuses on these two findings. As Eq. 5 shows, the HR equals the mean range of the histogram multiplied by the average disorder over all the non-zero bins. This average disorder is also calculated by extrema analysis and is the indicator of the histogram roughness over time. HR =
range(H) |extremabins=∅ (H)| ∗ N N/2
(5)
338
S. Verstockt et al.
Fig. 4. Histogram analysis of flames (a) and moving person (b)
3.4
Moving Object Analysis: Visual Flame Features
The set of visual flame features, to analyze moving objects in ordinary video, also consists of three features. Two of them, i.e. BBD and POD, are equal to their LWIR equivalents. The other feature is the spatial flame color disorder, which focuses on the color characteristics of the object. Spatial Flame Color Disorder (SFCD): Based on our experiments and the work of others [6], it is reasonable to assume that the colors of flames belong to the red-yellow color range. Furthermore, experiments showed that the flame color does not remain steady, i.e. flames are composed of several varying colors. The SFCD focuses on both color-related aspects of flames in order to eliminate non fire-colored objects and ordinary fire-colored objects with a solid flame color. As is defined in Eq. 6, the SF CD is calculated as the product of the percentage flame pixels and the spatial disorder in the object region R. The percentage flame pixels equals the ratio of the number of pixels #Y −R (RH ) with a hue value in the yellow-red range and the total number of pixels #pixels (R). The spatial disorder equals the ratio of the average range over all sets ΩI ([x, y]) of 4-neighboring pixels’ intensities and the overall range of the pixels’ intensities in the object region. Objects with a SFCD close to 1 will most probably represent flames, while ordinary moving objects will have a SFCD close to 0. Ω([x, y]) = {[x, y − 1], [x − 1, y], [x, y], [x + 1, y], [x, y + 1]} #Y −R (RH ) range(ΩI ([x, y])) SF CD = ∗ #pixels (R) range(RI ) + ε 3.5
(6)
Global Classifier
Each of the proposed visual and LWIR flame features possesses a probability between 0 and 1, indicating whether the object has the flame characteristic. By averaging these LWIR and visual flame probabilities (Eq. 7), the probabilities PLW IR (F lames) and Pvideo (F lames) are retrieved, which indicate wether the object should be classified as flames in the respective spectral range. The global classifier combines these two probabilities, using (Eq. 8), into an overall flame
Multi-sensor Fire Detection
339
probability P (F lames). The parameter β in this equation is a constant that specifies how much of PLW IR (F lames) and Pvideo (F lames) must be taken into account in the overall probability calculation. Depending the circumstances, e.g. night or day, an appropriate β value can be chosen. At the end, the overall probability P (F lames), is compared to an alarm threshold tf ire . If the flame probability is high enough, a fire alarm is raised. In our experiments it was found that a good value for tf ire is 0.7. PLW IR (F lames) =
BBD + P OD + HR 3 (7)
Pvideo (F lames) =
BBD + P OD + SF CD 3
P (F lames) = β ∗ PLW IR (F lames) + (1 − β) ∗ Pvideo (F lames)
4
(8)
Experimental Results
The proposed multi-sensor flame detector was tested with a Xenics Gobi-384 LWIR camera [7] and a Linksys WVC2300 camera, which works in the 8 - 14 μm spectral range and the visible spectrum respectively. Using the Xenics Xeneth software we were able to extract appropriate grayscale video images out of the thermal imaging camera. These images were then further analyzed by our own LWIR detection algorithm written in Matlab, extended with some add-ons for extrema detection and histogram analysis. The images from the Linksys camera, which were already in the appropriate MPEG-4 video format, were analyzed by the video-based flame detector, also written in Matlab. At the end, the probabilities of both detections were combined, using the global classifier (Eq. 8, β=0.6), and a final indication about the presence of flames is given. The video and LWIR images in Fig. 5 are some exemplary frames of the fire and non-fire realistic video sequences, which were captured to test the proposed multi-sensor flame detection algorithm. Table 1 summarizes the detection results for all the tested sequences. As the results in the Table indicate, the proposed algorithm yields good detection results. For uncontrolled fires, e.g. burning paper, the flame detection rate is higher than 90% and for controlled fires, e.g. a bunsen burner, it is around 75%. Furthermore, the number of false detections is very low. By further analyzing the detection results, e.g. by temporal median filtering, it is expected that this number of false alarms can even be further reduced. However, to confirm all these findings a more thorough evaluation and comparison with other work [2-4] is needed. This will be part of our future work. Changing the β value in Eq. 8 to 0 or 1, transforms the multi-sensor detector into a standalone LWIR or video fire detector respectively. Comparisson of the results for these standalone detectors with the results of our multi-sensor approach have confirmed our hypothesis that the fused detector performs remarkably better than either sensor alone.
340
S. Verstockt et al. Table 1. Performance evaluation of multi-sensor flame detector
Video sequence (# frames)
# fire # detected mean # false flame frames fire frames P (f lames) detections detection rate *
Attic1 (337) Fire → burning paper
264
259
0.92
6
0.96
Attic2 (2123) Fire / Moving people
1461
1352
0.84
19
0.91
Attic3 (886) Moving people
0
5
0.22
5
-
Lab (115) Fire → bunsen burner
98
74
0.77
0
0.75
Corridor1 (622) Moving people
0
0
0.19
0
-
Corridor2 (184) Moving people / Hot object
0
3
0.28
3
-
* detection rate = (# detected fire frames - # false alarms) / # fire frames
Fig. 5. Test sequences in ordinary video (a,c,e,g) and LWIR (b,d,f,h): Attic2, Lab, Corridor2, and Attic3
5
Conclusions
To detect the presence of fire at an early stage, the proposed multi-sensor fire detector fuses visual and non-visual flame features from moving (hot) objects in ordinary video and long wave infrared (LWIR) thermal images. Extraction of the moving objects is based on dynamic background subtraction. Additionally, LWIR moving objects are further filtered by histogram-based hot object segmentation, such that only the hottest moving objects remain. These hot and moving objects are then further analyzed using a set of flame features which focus on
Multi-sensor Fire Detection
341
the distinctive geometric, temporal and spatial disorder characteristics of flame regions. By combining the probabilities of the bounding box disorder (BBD), the principal orientation disorder (POD), and the histogram roughness of hot moving objects in LWIR, a LWIR flame probability is calculated. Analogously, the probabilities of the spatial flame color disorder, the BBD, and the POD of moving objects in ordinary video are combined into a video flame probability. At the end, both the LWIR and the video flame probability are combined into a multi-sensor flame probability, which gives an indication about the presence of flames. If this indication is high enough, a fire alarm is given. Experimental results have shown that the proposed multi-sensor flame detector already yields good results, but further testing on a broader range of video sequences and comparisson with existing work is necessary for a more adequate performance evaluation. At the moment, however, we can already say that multi-sensor fire analysis is very promising. Acknowledgements. The research activities as described in this paper were funded by Ghent University, the Interdisciplinary Institute for Broadband Technology (IBBT), University College West Flanders, the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT), the Fund for Scientific Research-Flanders (FWO-Flanders), the Belgian Federal Science Policy Office (BFSPO), and the European Union.
References 1. Verstockt, S., Merci, B., Sette, B., Lambert, P., Van de Walle, R.: State of the art in vision-based fire and smoke detection. In: Proc. 14th International Conference on Automatic Fire Detection, pp. 285–292 (2009) 2. Owrutsky, J.C., Steinhurst, D.A., Minor, C.P., Rose-Pehrsson, S.L., Williams, F.W., Gottuk, D.T.: Long Wavelength Video Detection of Fire in Ship Compartments. Fire Safety Journal 41, 315–320 (2006) 3. Toreyin, B.U., Cinbis, R.G., Dedeoglu, Y., Cetin, A.E.: Fire Detection in Infrared Video Using Wavelet Analysis. SPIE Optical Engineering 46, 1–9 (2007) 067204 4. Bosch, I., Gomez, S., Molina, R.: Object discrimination by infrared image process´ ing. In: Mira, J., Ferr´ andez, J.M., Alvarez, J.R., de la Paz, F., Toledo, F.J. (eds.) IWINAC 2009. LNCS, vol. 5602, pp. 30–40. Springer, Heidelberg (2009) 5. Otsu, N.: A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man and Cybernetics 9, 62–66 (1979) 6. Chen, T.-H., Wu, P.-H., Chiou, Y.-C.: An early fire-detection method based on image processing. In: Proc. International Conference on Image Processing, pp. 1707–1710 (2004) 7. Xenics Infrared Solutions: Situational Firefighting, http://www.xenics.com/
Turbo Code Using Adaptive Puncturing for Pixel Domain Distributed Video Coding Mohamed Haj Taieb1 , Jean-Yves Chouinard1 , Demin Wang2 , and Khaled Loukhaoukha1 1
Laval University, Quebec, QC, G1K 7P4 Canada Communications Research Centre Canada, Ottawa, ON, K2H 8S2 Canada
[email protected],
[email protected],
[email protected],
[email protected] 2
Abstract. Distributed video coding is a research field which brings together error coding techniques along with video compression ones. Its core part is a Slepian-Wolf encoder which often involves turbo codes because of their strong error correction capabilities. The turbo encoder generates parity bits which are sent to refine the side information constructed at the decoder by interpolation using the neighboring key frames already received. For bit rate flexibility, these parity bits are punctured and sent gradually upon request until the receiver can correctly decode the frame. In this work, we introduce a novel distributed video coding scheme with adaptive puncturing that sends more parity bits when the virtual channel is noisy and less parity bit when it is not the case. Considerable compression performance improvement over the puncturing techniques used in literature is reported. Keywords: Distributed video coding, Rate compatible punctured Turbo code, Log-MAP decoding.
1
Introduction
Digital video coding standards have been steadily changing in order to achieve high compression performances using sophisticated and increasingly complex techniques for accurate motion estimation and motion compensation. Those techniques are executed at the encoder which needs to be adapted to the computationally consuming video coding task. The decoder, on the other hand, can easily reconstruct a video sequence by exploiting the motion vectors computed at the encoder. This task repartition is well suited to the common video transfer applications such as broadcasting and video streaming, where the encoder benefits from a high computational power to compress just once the video sequence and then send it to many computationally limited low cost devices. However, since the emergence of wireless surveillance locally distributed cameras, the growth of cellular interactive video utilities as well as many other applications involving several low cost video encoders at the expense of high complexity central decoder, the traditional video encoding standard (e.g. H.264/AVC standard [1]) has been revised and the task repartition reversed. A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 342–350, 2010. c Springer-Verlag Berlin Heidelberg 2010
Turbo Code Using Adaptive Puncturing
343
The Slepian and Wolf information theoretical result on lossless coding for correlated distributed sources [2] and its extension to the lossy source coding case with side information at the decoder, by the results of Wyner and Ziv [3], constitute the theoretical basis of distributed source coding. These theoretical results have given birth to a wide field of applications as the recent distributed video coding paradigm, established a few years ago ([4], [5]). The compression complexity is transferred from the encoder to the decoder which will be responsible of the movement estimation and compensation by generating the side information. Since its recent establishment, the distributed video coding paradigm has attracted many research groups and was intensively studied. Several directions are pursued in order to achieve better rate versus distortion performances. Some of those directions are: the improvement of side information interpolation [6], the correlation noise modeling [7] and the improvement of the reconstruction algorithm [8]. Nevertheless, little or no work has been conducted involving the channel coding part of a distributed video codec. Starting from the Wyner-Ziv video codec introduced in [5], we propose a novel puncturing technique which optimize the effect of the sent parity bits. In the DVC architecture presented in [5], the number of parity bits sent over a period of 8 symbols (pixels) is the same across the whole frame. The main idea of the puncturing technique proposed in this work, is to send parity bits when they are the most needed and to avoid dissipating parity bits when the side information is already sufficient and performing well. We propose here a modified Wyner-Ziv codec which estimates when parity bits should be punctured and when they should be sent. In section 2, we describe the Wyner-Ziv and in particular the Slepian-Wolf video codec used in [5]. In section 3, we describe in detail the proposed Wyner-Ziv codec to ensure adaptive puncturing. In section 4, simulation results are presented to show the performance improvement of the proposed encoder. Finally, we conclude in the section 5.
2
Wyner-Ziv and Slepian-Wolf Codec
The architecture of a distributed video coding system is depicted in figure 1. The odd-numbered frames (intra frames) are considered perfectly reconstructed at the decoder and will be used to generate the side information to decode the even frames (inter frames). The inter frames are fed to a 16-level scalar quantizer and then fed to a turbo encoder consisting of the two constituent rate 4/5 recursive convolution encoders shown in figure 2. The generator polynomials in octal form of these encoders are listed in table 1 where h0 (D) is representing the feedback connection polynomial. At each clock pulse, the 4 bits of a given quantized pixel are fed to the two rate 4/5 constituent recursive convolution encoders, separated by a pseudo-random interleaver. Each of the two RSCs (recursive systematic coders) associates a parity bit to a quantized pixel. The parity bits are stored in a buffer and sent gradually to the encoder upon request. To ensure compression, the systematic bits are discarded since the decoder has already an interpolated
344
M.H. Taieb et al. Table 1. Generator polynomials in octal form of the constituent encoders k=0k=1k=2k=3k=4 hk (D) 23 35 31 37 27
WZ encoder
WZ decoder
SW encoder Systematic bits
WZ frame
X 2i Quantification
q2i
Turbo encoder
SW decoder
p2i
Turbo decoder
Buffer
Xˆ 2i
q2i
Reconstruction
Feedback channel
X 2i
1
Key frames
Y2i
Interpolation
X 2i
Side information
1
h1 D
h2 D
D
1
35 11101 D4 D3 D2 1
31 11001 D 4
4
D
3
2
3
1
h3 D 37 111111 D D D D 1 h4 D
27 10111 D
4
D
2
1
1 1
1 1
1
D 1
0 1
1
1
0 1
1
1 1
0
1
0 1
1
Systematic bits
Fig. 1. Wyner-Ziv video codec
Parity bit
h0 D
23octal 10011binary D 4 D 1 1
0
0
1
1
Fig. 2. Rate 4/5 recursive convolution encoder with parity polynomials (23, 35, 31, 37, 27) in octal form
version of the even frames which can replace these bits in the turbo decoding process. This latter involves the computation of branch metrics αk (s), βk (s) and γk (s , s) relative to a transition in the decoding trellis between the states s and s. More details can be found in [9]: we explain here the role of the parity bits. In this coding scheme, the systematic bits are discarded. The computation of the trellis branch metric γk (s , s) is as follows: γk (s , s) =
P (xk ) extrinsic information from the other decoder
×
P ( ykx = SI| xk ) channel likelihood
×
P ( ykp | pk )
(1)
parity bit likelihood
where SI is the side information playing the role of the discarded systematic bit ykx , xk is the input symbol and ykp its associated parity bit. pk is the parity bit effectively sent by the encoder to refine the side information and is evaluated as:
Turbo Code Using Adaptive Puncturing
⎧ p ⎨ 1, yk = pk , p pk , P ( yk | pk ) = 0, ykp = ⎩ 1 , 16
345
(pk sent) (pk sent) (pk punctured).
(2)
To further demonstrate the role of parity bits, we present in figure 3, two transitions of the 16 states trellis relative to the 4 states recursive convolutional encoder described in figure 2. By observing this trellis, one notes that a received parity bit reduces to half the number of potential input symbols. For instance if the received parity bit is 0 only the links with parity bit equal 0 are kept and the links with parity bit equal 1 are removed. This is more useful when the received systematic bit is actually corrupted than when it is correct; that is in the DVC context, when the side information generation fails instead of when it is correct.
Input symbol/Parity bit
13/1 0/0 7/1 10/0 1/1
0
12/0 11/1 6/0 3/0
Encoder state 0 1
4/0 9/1 15/0 2/1 5/0 8/1
0/0 7/1 10/0
12/0
3
11/1 6/0 3/0 14/1 4/0
5 6
8 9
1/1
2
4
14/1
13/1
9/1 15/0
10 11 12 13 14
2/1 7
5/0
15
8/1
Fig. 3. Part of the 16 states trellis of the RSC described in figure 2
The puncturing scheme used in [5] has a pattern period of 8 parity bits. It means that for each group of 8 pixels, a given number of parity bits, for instance np1 for the first decoder and np2 for the second decoder, are sent. Thus the overall parity bits number is: nbtotal = (np1 + np2 ) × 25344 8 , where 25344 (144 rows and 176 columns) is the total number of pixels in a frame. If the decoder fails with this amount of parity bit, a request is sent to the encoder: it will send an = 3168. This scheme does additional number of bits equal to nbadded = 25344 8 not allow for having a finer rate control. For this reason, it is more suitable to compare the performance of the adaptive puncturing technique introduced in
346
M.H. Taieb et al.
this work to a random puncturing pattern over the whole frame. Therefore, the total number of parity bits will be more adjustable.
3
Wyner-Ziv Coding with Adaptive Puncturing
The behavior of typical propagation channels is random and unpredictable such that puncturing cannot be adjusted according to their time variations. However, in the context of distributed video coding, the channel behavior can somehow be estimated at the decoder. By observing the neighboring key frames, the decoder can assess approximately the regions in the frame where the interpolation is more likely to fail: if there is a large difference between the value of a pixel in the previous key frame and in the following key frame, interpolation will probably fail. Therefore, the decoder can get an estimate of the noise variations in the DVC virtual channel. This information should be sent back to the encoder, so that it will not puncture the parity bits relative to the noisy pixels. We propose in figure 4, a new architecture of distributed video coding allowing the minimization of the feedback channel solicitation by which the decoder informs the encoder about the estimated noisy pixels. 3*44=132 bits
Feedback channel WZ frame
X 2i
WZ encoder
Parity bits request if the turbo decoder do not converge
p2i
WZ decoder
Puncturing
X 2i 1 Key frames
X 2i 1
X 2i 1 X 2i 1 2 Interpolation
Xˆ 2i
Computing parity bit for each region
Y2i Side information
Fig. 4. Proposed adaptive puncturing Wyner-Ziv codec
As a matter of fact, failure of the interpolation process usually happens localised in clusters as can be seen in figure 5.This figure represents the difference between the quantized pixels of the original frame and the quantized pixels of the interpolated frame. For the first WZ frame this difference is ranging from 0 to 6 as it is shown in the color bar of the figure 5. The adaptive puncturing scheme consists of partitionning the frame into a number of regions, estimating the noise value in each region, and then sending back the number of parity bits required for each region. For instance, with a puncturing period of 8 = 23 parity bits, as used in [5], and a frame divided into 44 regions, the decoder need to send back only 3 × 44 = 132 bits, through the feedback channel. This
Turbo Code Using Adaptive Puncturing
347
Errors positions in the first interpolated frame 1
2
12
13 14 15 16 17 18
23
24 25 26
34 35
3
4
5
9
10 11
19 20
21 22
27 28 29 30 31
32 33
39 40 41 42
43 44
36 37 38
6
7
8
Errors number = 4261
Fig. 5. Frame partitionning into 44 regions, showing the error clusters
scheme also reduces considerably the number of parity bit requests since an initial estimate of the needed parity bits is sent to the encoder.
4
Simulations and Discussion
In table 2, we present the parity bit distribution used for the first WZ frame in the video sequence Carphone. The decoder will only send the parity bits required by the first decoder np1 ∈ {1, 8} and given by:
|X2i−1 − X2i+1 | np1 (region = r) = min ceiling mean ,8 (3) r 2 where r ∈ {1, 44} and the mean of the absolute value of the pixels differences, mean, is over the region r (16 × 36 pixels). r Then, parity bits for the second decoder can be deducted according to this relation: if
np1 ≥ 4, np2 = min {(np1 + 1) , 8}
else np2 = max {(np1 − 2) , 0}
(4)
We notice, from table 2, that the parity bits sent by the first encoder are more dispersed than those sent by second encoder. The pseudo-random interleaver inside the turbo encoder ensures the dispersal of the parity bits in the second decoder. To demonstrate the impact of concentrating the parity bits in the frame’s noisy regions, we show in figure 6, the convergence of the turbo code using adaptive puncturing versus random puncturing for the same total number of sent parity bits. This figure shows clearly that adaptive puncturing performs better than random puncturing. With the adaptive puncturing scheme, the parity bits are
348
M.H. Taieb et al.
Table 2. Number of requested parity bits distribution accross the 44 regions of figure 5 for the 2 RSC encoders
2 5 5 5
2 5 4 4
2 4 5 3
First 2 2 5 4 4 5 2 2
encoder 3 3 3 3 7 4 5 5 5 3 7 5
8 8 5 2
7 8 8 2
2 7 8 4
Second encoder (deinterleaved order) 0 0 0 0 0 1 1 1 8 8 0 6 6 5 6 5 1 8 5 8 8 8 6 5 6 5 6 6 6 6 6 8 8 6 5 1 0 0 1 8 6 0 0 5
Random puncturing: Turbo decoding divergence
Iter 1: nb errors=2815
Iter 2: nb errors=2431
Iter 3: nb errors=2512
Iter 4: nb errors=2573
Adaptive puncturing: Turbo decoding convergence
Iter 1: nb errors=3182
Iter 8: nb errors=1578
Iter 12: nb errors =970
Iter 20: nb errors =2
Fig. 6. Turbo decoding process of the first WZ frame using random puncturing and adaptive puncturing with the same number (27288) of parity bits
mostly directed to regions 9, 10, 20 and 21, where the interpolation expresses difficulties. Fewer parity bits remain for the other (cleaner) regions to correct the few remaining scattered errors and to maintain some robustness across the whole trellis during the log-MAP decoding process. For the random puncturing scheme, where the parity bits are spread randomly across the trellis, the number of errors decrease rapidly for the two first iterations. But as soon as the isolated errors, corrected by more parity bits than actually needed, the decoder can no longer combat the errors clusters with the few remaining parity bits, and begin diverging at the third iteration. In figure 7, we compare the number of parity bits needed to make the turbo decoder converge for the 3 differents puncturing schemes discussed above: the random puncturing where the parity bits are sent randomly across the frame, the puncturing used in [5] where the parity bits are sent progressively according to a fixed pattern and finally the adaptive puncturing presented in this paper where
Turbo Code Using Adaptive Puncturing
349
Fig. 7. Number of parity bit sent for each WZ frame of the Carphone sequence to ensure the turbo decoding convergence Table 3. Parity bit rate for the WZ frame of the Carphone sequence
Rate in kbps
Aaron et al. [5] 401.07
Random puncturing 388.72
Adaptive puncturing 360.78
the parity bits are directed towards the noisy regions. The proposed adaptive puncturing schemes offers a significant rate reduction as shown in table 3. The WZ frame rate is 15 frames per second. Note that the 132 bits sent by the decoder have been taken into account.
5
Conclusion
In this paper, we propose a distributed video codec scheme to supports adaptive puncturing. This scheme is based on the argument that redirecting the parity bits where they are the most effective, will improve the compression results. Simulation results demonstrate that lower bitrates are achieved. This puncturing technique is restricted to architectures where the decoder can predict approximately the location of the noisy regions and can send this information to the encoder with as few bits as possible: this is indeed the case for the distributed video coding architecture.
Acknowledgments The authors are greatly indebted to Andr´e Vincent and Gr´egory Huchet from the Advanced Video Systems Group at the Communication Research Centre in Ottawa, for their expertise and support for this work.
350
M.H. Taieb et al.
References 1. Richardson, I.E.G.: H.264 and MPEG-4 Video Compression: Video Coding for Next Generation Multimedia. Wiley Interscience, Hoboken (2003) 2. Slepian, D., Wolf, J.K.: Noiseless coding of correlated information sources. IEEE Transactions on Information Theory 19, 471–480 (1973) 3. Wyner, A.D., Ziv, J.: The rate-distortion function for source coding with side information at the decoder. IEEE Transactions on Information Theory (January 1976) 4. Girod, B., Aaron, A., Rane, S., Monedero, D.R.: Distributed video coding. Proceedings IEEE, Special Issue on Advances in Video Coding and Delivery (January 2005) 5. Aaron, A., Zhang, R., Girod, B.: Wyner-Ziv coding of motion video. In: Proc. Asilomar Conference on Signals and Systems, Pacific Grove, CA, USA (November 2002) 6. Ascenso, J., Brites, C., Pereira, F.: Improving Frame Interpolation with Spatial Motion Smoothing for Pixel Domain Distributed Video Coding. In: 5th EURASIP Conference on Speech and Image Processing, Multimedia Communications and Services, Slovak Republic (July 2005) 7. Brites, C., Pereira, F.: Correlation Noise Modeling for Efficient Pixel and Transform Domain Wyner–Ziv Video Coding. IEEE Transactions on Circuits and Systems for Video Technology, 1177–1190 (September 2008) 8. Weerakkody, W., Fernando, W., Kondoz, A.: Enhanced reconstruction algorithm for unidirectional distributed video coding. IET Image Processing 3(6), 329–334 (2009) 9. Avudainayagam, A., Shea, J.M., Wu, D.: A Hyper-Trellis based Turbo Decoder for Wyner-Ziv Video Coding. In: Proceedings IEEE GLOBECOM (2005)
Error-Resilient and Error Concealment 3-D SPIHT Video Coding with Added Redundancy Jie Zhu, R.M. Dansereau, and Aysegul Cuhadar Department of Systems and Computer Engineering, Carleton University, Ottawa, Ontario, Canada {jzhu,rdanse}@sce.carleton.ca
Abstract. In this paper, we propose a multiple description (MD) video coding algorithm based on error-resilient and error concealment set partitioning in hierarchical trees (ERC-SPIHT). Compared to ERC-SPIHT, the novelty of the proposed approach is the injection of additional redundancy into the substreams so that the coefficients in the spatial root subband are protected highly from transmission errors. Experimental results on different video sequences show that the proposed method maintains error-resilience with high coding efficiency. In particular, our results demonstrate that the proposed algorithm achieves a significant improvement on video quality by up to 2.24 dB in the presence of a substream loss compared to ERC-SPIHT. Keywords: multiple description (MD) coding, set partitioning in hierarchical trees (SPIHT), error-resilience, error concealment.
1
Introduction
The set partitioning in hierarchical trees (SPIHT) algorithm has made great strides in image and video coding [1]. It provides outstanding rate-distortion (R-D) performance with low computational complexity. However, SPIHT-coded bitstreams are extremely sensitive to transmission errors. To remedy this problem, it is important to incorporate error-resilience with the SPIHT encoding algorithm for robust video transmission over error-prone networks. There have been various error control mechanisms proposed to provide errorresilience for video coders [2]. One common approach is to request retransmission of any corrupted information from the receiver. However, retransmission may not be acceptable for real-time video applications due to long round-trip delays. Another popular error-resilience method is forward error correction (FEC), which is designed to add additional channel redundancy to the bitstream. However, FEC is ineffective in packet-based networks when a large number of packets may be lost in clusters and such loss exceeds the recovery capability of FEC codes [3]. Multiple description (MD) video coding has recently attracted attention as a promising approach for error-resilient video transmission [4]. Among existing MD
This research was partially funded by Nortel and the Natural Sciences and Engineering Research Council (NSERC) of Canada.
A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 351–358, 2010. c Springer-Verlag Berlin Heidelberg 2010
352
J. Zhu, R.M. Dansereau, and A. Cuhadar
video coding schemes, several domain-partitioning based algorithms using threedimensional (3-D) SPIHT have shown to be effective in achieving error resilience, such as spatial and temporal tree preserving 3-D SPIHT (STTP-SPIHT) [5] and error resilient and error concealment 3-D SPIHT (ERC-SPIHT) [6]. In these algorithms, the wavelet coefficients are divided into multiple groups and each group is independently encoded with the 3-D SPIHT algorithm and then separately transmitted over networks. Error-resilience of a domain-partitioning based MD video coder can be further ameliorated when the coder is supplemented with an error concealment technique at the decoder. It has been demonstrated that even simple error concealment, such as bilinear interpolation (missing coefficients are estimated by the average value of its four adjacent neighbors), can improve the R-D performance when combined with STTP-SPIHT and ERC-SPIHT [7]. More specifically, ERC-SPIHT provides better error concealment performance over STTP-SPIHT because ERC-SPIHT uses dispersive grouping such that the spatially adjacent coefficients are spread onto different substreams [7]. Although error concealment can be used in conjunction with STTP-SPIHT and ERC-SPIHT, it may still cause reduced presentation quality. Small relative errors in the estimation of missing coefficients in the spatial root subband (i.e., the approximation subband on the highest spatial decomposition level) can have a significant impact in the overall distortion. An elegant solution for a video coder to reduce such impact is to inject additional redundancy into the substreams so that the missing coefficients in the spatial root subband can be recovered with a high level of accuracy. By additional redundancy, we mean redundancy that is added intentionally to the signal instead of the inherent redundancy present in the signal. The key challenge in developing such a video coder is how to balance the tradeoff between the source coding efficiency at the encoder and the error concealment performance at the decoder. In this paper, we propose an error-resilient MD video coding framework for ERC-SPIHT-coded video sent across error-prone networks. The framework is an improvement of our earlier related scheme [8]. Although the work in [8] allows an important quality improvement, it is restricted to a specific scenario (i.e., only four substreams are created and only one of the substreams allowed lost due to noisy channels). In this paper, we generalize the scheme for the generation of an arbitrary number S of substreams and extend the recovery of missing coefficients. The rest of this paper is organized as follows. In Section 2, we present our error-resilient video coding framework. We discuss the simulation results in Section 3 and draw conclusions in Section 4.
2
Proposed Multiple Description Video Coding Algorithm With 3-D SPIHT
Motivated by the approach in [8], now we propose a new error-resilient MD video coding scheme. The overall framework of the proposed coding algorithm is shown in Fig. 1, where two new functions are added to the ERC-SPIHT video coding system [7], namely additional redundancy generation at the encoder and
ERC-SPIHT with Added Redundancy
353
root subband recovery at the decoder. In our implementation, we use the same order of spatial and temporal transformation as in [7]. In addition, we adopt the 3-D SPIHT with asymmetric tree structure (AT-SPIHT) in our work since it exhibits better R-D performance than its counterpart, the 3-D SPIHT with symmetric tree structure [7].
Encoder
Spatial Decomposition
Additional Redundancy Generation
Temporal Decomposition
Domain Partitionaing
3-D SPIHT Encoder
Noisy Channels Error Concealment Spatial Reconstruction
Root Subband Recovery
Error Concealment on Root+High Subband
Temporal Reconstruction
3-D SPIHT Decoder
Decoder
Fig. 1. Framework of the proposed video coding algorithm
2.1
Additional Redundancy Generation and Encoding
After a group of frames (GOF) is spatially transformed into the 2-D wavelet transform domain, the next step is to generate the additional redundancy. In essence, the additional redundancy generation operates as illustrated in Fig. 2(a). To simplify the illustration, we assume there are N frames in the GOF and the frames are spatially transformed in one level. Symbol 2D W denotes the wavelet coefficients resulting from the spatial transform and symbol 2D W SR , shown by the shaded area, represents the spatial root subband. The additional redundancy is obtained by spatially decomposing each frame of the spatial root subband 2D W SR by one more level. Then, the resulting coefficients in the LL (low-low) subband (i.e., approximation coefficients in the lowest frequency subband) are grouped together as the additional redundancy, denoted by 2D R. Note that the wavelet transform chosen at this step can be different from the one used for decomposing the GOF into the 3-D wavelet domain. In this work, we employ the Haar wavelet transform. One reason to choose the Haar wavelet is that the computation is much simpler. Another advantage of the Haar wavelet is that it is a 2-tap filter. As a result, each element in 2D R is only correlated to a block of 2 × 2 adjacent wavelet coef2D ficients in W SR . Recall that the coefficients of the Haar low-pass filter are 1 1 1 1 √ ,√ and the coefficients of the high-pass filter are √2 , − √2 . Therefore, 2 2
354
J. Zhu, R.M. Dansereau, and A. Cuhadar
let 2D R(x, y, t) and 2D W (x, y, t) denote the coefficient at the spatial location (x, y) of frame t in 2D R and 2D W , respectively. We find that 2D
R(x, y, t) = {2D W (2x − 1, 2y − 1, t) +2D W (2x − 1, 2y, t) +2D W (2x, 2y − 1, t) +2D W (2x, 2y, t)}/2.
(1)
frame t frame t 2D SR
W
2D
R
2D
W
(a) Haar Decomposition
LL
HL
LH
HH
(b)
Fig. 2. Relationship between original video sequence and additional redundancy
Next, 2D W and 2D R are both temporally transformed to the 3-D wavelet domain and the resulting coefficients are denoted by 3D W and 3D R, respectively. The above relationship is still valid for 3D W and 3D R such that 3D
R(x, y, t) = {3D W (2x − 1, 2y − 1, t) +3D W (2x − 1, 2y, t) +3D W (2x, 2y − 1, t) +3D W (2x, 2y, t)}/2,
(2)
where 3D R(x, y, t) and 3D W (x, y, t) represent the coefficient at the spatial location (x, y) of frame t in 3D R and 3D W , respectively. This relation holds true since the Haar wavelet transform is a separable transform. The issue is now how to encode 3D R efficiently and resiliently. An effective solution is to distribute the additional redundancy 3D R over different substreams. We assume that the coefficients in 3D W are divided into S groups by dispersive grouping. Similarly, we use dispersive grouping to organize the coefficients in 3D R into S groups. Groups 3D Wi and 3D Ri are then independently encoded with AT-SPIHT to produce substream-i, where i = 1, . . . , S. Recall that in our previous work [8], the additional redundancy 3D R is comprised of the average value of four coefficients with the same location in the spatial root subband of groups 3D W1 , 3D W2 , 3D W3 and 3D W4 . When dispersive grouping is applied, we had 3D
R(x, y, t) = {3D W (2x − 1, 2y − 1, t) +3D W (2x − 1, 2y, t) +3D W (2x, 2y − 1, t) +3D W (2x, 2y, t)}/4.
(3)
ERC-SPIHT with Added Redundancy
355
Compared to the proposed video coder, a smaller amount of additional redundancy is inserted among substreams in the previous work [8] since each element in 3D R is only half the magnitude of the one in (2). 2.2
Reconstruction of Lost Wavelets Coefficients
We now describe how the proposed root subband recovery algorithm can be used with other 2-D error concealment schemes in each frame. We define 3D W and 3D R as the coefficients decoded from the received substreams. We assume that when the error occurred in any substream induced by noisy channels, all coefficients in this substream are set to zero. Then, the reverse temporal transform is applied on 3D W and 3D R separately and the resulting frames are denoted by 2D 2D W and R . The issue now is how to reconstruct the missing coefficients in 2D W . In this work, we focus on the reconstruction of the missing coefficients in the spatial root subband. When the additional redundancy is generated, each element in 2D R is only correlated to a block of 2 × 2 adjacent wavelet coefficients in 2D W . As a result, root subband recovery is implemented in each block of size 2 × 2 adjacent wavelet coefficients. For notation convenience, let us denote the coefficients in the same block 2D W (2x − 1, 2y − 1, t), 2D W (2x − 1, 2y, t), 2D W (2x, 2y − 1, t) 2D and W (2x, 2y, t) as W1 , W2 , W3 and W4 . In this paper, we consider two cases as below. • Case 1: only one coefficient is missing It is straightforward to reconstruct the lost coefficient by (2). For example, if W1 is missing, it can be reconstructed by
W1∗ = 2 ·2D R (x, y, t) − W2 − W3 − W4 .
(4)
• Case 2: two coefficients are missing Note that (4) is not suitable when there is more than one coefficient lost. A solution for this problem is to recalculate the missing coefficients by exploiting the Haar wavelet decomposition and reconstruction. Let’s assume that two coefficients, W2 and W4 , in the same block are affected by the noisy channels. Initially, the lost coefficients W2 and W4 are estimated with any existing 2-D ˆ2 error concealment technique. For example, W2 and W4 are reconstructed as W ˆ 4 . This block is then spatially decomposed by one level by applying the and W Haar wavelet transform. We have ˆ2 − W − W ˆ 4 /2, (5) LH = W1 + W 3 ˆ2 − W + W ˆ 4 /2 HH = W1 − W 3
(6)
where LH and HH are the resulting coefficients in the low-high and high-high subbands, respectively. Meanwhile, LL and HL can be recovered from the correctly received substreams. That is LL = W1 + W2 + W3 + W4 /2 =2D R (x, y, t), (7)
356
J. Zhu, R.M. Dansereau, and A. Cuhadar
HL = W1 − W2 + W3 − W4 /2 = W1 + W3 −2D R (x, y, t)
(8)
where LL and HL are the resulting coefficients in the low-low and high-low subbands, respectively. Then, the missing coefficients can be recalculated by applying the inverse Haar wavelet transform such that ˆ2 − W ˆ 4 /2 W ∗ = {LL + LH − HL − HH} /2 = W + W + W (9) 2
2
4
ˆ4 − W ˆ 2 /2. W4∗ = {LL + HH − LH − HL} /2 = W2 + W4 + W
(10)
The sum of squared difference (SD) between the original and reconstructed coefficients is 2 2 SD = W2∗ − W2 + W4∗ − W4 ˆ 2 − W )2 + (W ˆ 4 − W )2 − 2 ∗ (W ˆ 2 − W )(W ˆ 4 − W ) /2. (11) = (W 2 4 2 4 Compared to the sum of SD obtained using only an existing error concealment technique, the sum of SD can be decreased by 2 2 ˆ 2 − W2 + W ˆ 4 − W4 − SD = W 2 ˆ2 + W ˆ 4 ) − (W + W ) /2. = (W 2
4
(12)
Note that not all of the missing coefficients in the spatial root subband can be estimated by root subband recovery when the coefficient 2D R (x, y, t) has to be set to zero (i.e., 2D R (x, y, t) = 0). In this case, the missing coefficients are only able to be estimated by using other error concealment techniques, such as bilinear interpolation.
3
Simulation Results
The experiments are conducted using the 352× 240 × 48 monochrome “Football” and “Susie” video sequences with frame rate of 30 frames/s. We use 16 frames for the GOF. A three-level decomposition using the 9/7 biorthogonal wavelet filter is applied in both the spatial and temporal domains. The wavelet coefficients are divided into multiple groups (S = 4 or S = 16) and each group is independently encoded at the coding rate c = 1.0 bit/pixel (bpp). The peak signal-to-noise ratio (PSNR) 2552 )dB (13) P SN R = 10 log10 ( M SE is used as the distortion metric where MSE is the mean-squared error between the original and the reconstructed video sequence.
ERC-SPIHT with Added Redundancy
357
Table 1. Comparison of PSNR (dB) for “Football” and “Susie” video sequences in noiseless channels Total number of substreams S=4 S = 16
(a) Football Proposed algorithm Previous work in ERC-SPIHT (dB) (dB) [8] (dB) 33.16 33.20 33.35 32.96 N/A 33.16
Total number of substreams S=4 S = 16
(b) Susie Proposed algorithm Previous work in ERC-SPIHT (dB) (dB) [8] (dB) 43.72 43.75 43.87 43.51 N/A 43.66
Table 2. Comparison of PSNR (dB) for “Football” and “Susie” video sequences in a noisy channel Total number of substreams S=4 S = 16
Total number of substreams S=4 S = 16
(a) Football Proposed Previous work ERC-SPIHT Substream(s) lost algorithm (dB) in [8] (dB) (dB) 2 26.00 25.93 24.93 4 25.91 25.84 24.81 2&4 21.56 N/A 21.51 2 30.56 N/A 29.71 4 30.55 N/A 29.60 2&4 28.48 N/A 27.64
Substream(s) lost 2 4 2&4 2 4 2&4
(b) Susie Proposed Previous work ERC-SPIHT algorithm (dB) in [8] (dB) (dB) 30.74 30.73 29.71 30.58 30.57 29.41 26.52 N/A 25.97 39.85 N/A 37.78 39.51 N/A 37.27 36.45 N/A 34.62
• Source Coding Efficiency We assume that all substreams are thus received correctly at the decoder. Table 1 shows the average PSNR values for the proposed algorithm, the previous work in [8] and ERC-SPIHT when the total number of substreams is S = 4 or S = 16. When S = 4, the PSNRs obtained with the proposed algorithm and the previous work [8] are lower than ERC-SPIHT (0.15 − 0.19 dB). We also observe that the PSNR values of the proposed algorithm are slightly lower than those of [8]. It means that more bits are assigned to the additional redundancy in the final bitstreams in the proposed video coder. When S = 16, the previous work [8] is not applicable. It can be seen that the difference between the proposed algorithm and ERC-SPIHT is 0.15 − 0.20 dB in terms of PSNR.
358
J. Zhu, R.M. Dansereau, and A. Cuhadar
• Error Concealment Performance We assume there is one or two substreams corrupted due to transmission errors. Table 2 presents the average PSNR values for the proposed algorithm, the previous work [8] and ERC-SPIHT. When only one substream is corrupted by noisy channels, the average PSNRs of the proposed algorithm outperforms ERCSPIHT by up to 2.24 dB. In addition, the average PSNR value obtained with the proposed algorithm are slightly higher than [8] (S = 4) since a larger amount of additional redundancy is inserted among substreams in the proposed algorithm, which will increase the accuracy in the estimation of missing coefficients in root subband recovery. When there are two substreams lost and these two missing substreams are in the same 2 × 2 coefficient block, the missing coefficients can be recalculated by using (9)–(10). Compared to those with ERC-SPIHT, the average PSNRs with the proposed algorithm can be further improved by up to 1.83 dB.
4
Conclusions
We presented a domain-partitioning based MD video coding algorithm with 3-D SPIHT for robust video transmission. Although the proposed algorithm provides lower source coding efficiency than ERC-SPIHT in lossless transmission, the proposed approach has been shown to be more resilient in an error-prone transmission environment.
References [1] Said, A., Pearlman, W.A.: A New, Fast, and Efficient Image Codec Based on Set Partitioning in Hierarchical Trees. IEEE Trans. Circuits Syst. Video Technol. 6(3), 243–250 (1996) [2] Guo, Y., Chen, Y., Wang, Y.-K., Li, H., Hannuksela, M.M., Gabbouj, M.: Error Resilient Coding and Error Concealment in Scalable Video Coding. IEEE Trans. Circuits Syst. Video Technol. 19, 781–795 (2009) [3] Wu, D., Hou, Y.T., Zhang, Y.-Q.: Transporting Real-Time Video Over the Internet: Challenges and Approaches. Proc. IEEE 88(12), 1855–1877 (2000) [4] Goyal, V.K.: Multiple Description Coding: Compression Meets the Network. IEEE Signal Processing Mag. 18, 74–93 (2001) [5] Cho, S., Pearlman, W.A.: A Full-Featured, Error-Resilient, Scalable Wavelet Video Codec Based on the Set Partitioning in Hierarchical Trees (SPIHT) Algorithm. IEEE Trans. Circuits Syst. Video Technol. 12, 157–171 (2002) [6] Cho, S., Pearlman, W.A.: Error Resilience and Recovery in Streaming of Embedded Video. Journal of Signal Process 82, 1545–1558 (2002) [7] Cho, S., Pearlman, W.A.: Error Resilient Video Coding With Improved 3-D SPIHT and Error Concealment. In: Proc. of SPIE/ID&T Int. Conf. on Electron. Imaging, vol. 5022, pp. 125–136 (2003) [8] Zhu, J., Cuhadar, A.: Error-Resilient Multiple-Description Video Coding With 3D SPIHT. In: Proc. IEEE PacRim Conf. Communications, Computers and Signal Process., pp. 606–611 (2009)
Video Quality Prediction Using a 3D Dual-Tree Complex Wavelet Structural Similarity Index K. Yonis and R.M. Dansereau Department of Systems and Computer Engineering, Carleton University, Ottawa, Ontario, Canada {kyonis,rdanse}@sce.carleton.ca
Abstract. In this paper, we test the performance of the complex wavelet structural similarity index (CW-SSIM) using the 2D dual-tree complex wavelet transform (DT-CWT). Also, we propose using a 3D DT-CWT with the CW-SSIM algorithm, to predict the quality of digital video signals. The 2D algorithm was tested against the LIVE image database and has shown higher correlation with the subjective results than peak signal-to-noise ratio (PSNR), structural similarity (SSIM), and the initial steerable pyramid implementation of CW-SSIM. The proposed 3D DT-CWT implementation of the CW-SSIM is tested against a set of subjectively scored video sequences from the video quality experts group’s (VQEG) multimedia (MM) project and gave promising results. Both implementations were validated to be good quality assessment tools to be embedded with DT-CWT based image and video denoising algorithms as well as DT-CWT image and video coding algorithms. Keywords: Objective video quality assessment, dual-tree wavelet, structural similarity.
1
Introduction
With increasing development in video related applications, the need for a reliable method to verify and rate the quality of coded and processed video has become evermore important. As humans are often the final consumers of video information, subjective quality assessment is considered to yield the best results. However, due to the challenges in preparing subjective tests, cost, and time consumption, research efforts have developed with the goal of quantitatively predicting the quality of the final video in a way that correlates with subjective scores. Element based objective quality assessment algorithms, such as the mean squared error (MSE) and peak signal-to-noise ratio (PSNR), have often been used to assess signal quality, due to their simplicity and their direct relation to the distortion in the signal. However, such algorithms are less meaningful when it comes to image and video signals and do not correlate well with subjective results [1, 2]. To that effect, many research efforts, including [3–7], proposed objective quality assessment algorithms that simulate human visual system (HVS) characteristics such as perceptual channels, contrast sensitivity and contrast masking. A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 359–367, 2010. c Springer-Verlag Berlin Heidelberg 2010
360
K. Yonis and R.M. Dansereau
As directly simulating the features of the HVS can be complex and leads to high processing overhead, a new way of looking at the HVS lead to a new class of objective quality algorithms. Recognizing that the HVS is more oriented towards identifying structure information in images, the structure information was used to predict the quality of images in a full reference (FR) manner, leading to the development of the structural similarity (SSIM) index and its extensions [1, 8, 9]. While having an objective video quality assessment algorithm is important, being able to embed that quality assessment as part of the processing algorithm can allow further system optimization and quality control. Unfortunately, embedding quality assessment into any processing algorithm increases the algorithm’s complexity and computational overhead. To limit computational overhead, we are also motivated to develop a simple, yet effective, video quality assessment algorithm. The simplicity of the SSIM algorithm, along with its good performance, makes it a solid starting point for developing simple video assessment algorithms. The dual-tree complex wavelet transform (DT-CWT) is a relatively new transform [10]. Nonetheless, the DT-CWT has shown good results in digital video denoising [11]. Also, the transform gave good results in video coding, without the need for a motion estimation/compensation step [12]. In this paper, we extend the complex wavelet SSIM (CW-SSIM) algorithm to use the 3D DT-CWT and use it to predict the quality of processed video signals in a FR manner. The approach will be evaluated as a stand-alone algorithm without the use of a preprocessing (registration or motion compensation) step. The rest of this paper is organized as follows. In Sec. 2, we review the SSIM index and its complex wavelet extension CW-SSIM. In Sec. 3, we provide a background on the dual-tree complex wavelet transform. In Sec. 4 and Sec. 5, we discuss the implementation and evaluation of the CW-SSIM using the 2D and 3D DT-CWT respectively. Finally, conclusions are drawn in Sec. 6.
2
Structural Similarity Index
Including known HVS characteristics into a video quality assessment algorithm currently results in slow and very complicated systems. A need for a simple, yet effective, algorithm emerged, where focusing on certain aspects of the HVS lead to the development of the structural similarity (SSIM) algorithm [8]. Recognizing that the HVS is more adapted to extract the viewed scene’s structure information than detecting absolute differences in intensities, Wang et al. [8] developed the SSIM index to estimate the quality of an image based on structural changes between the distorted and undistorted image. For two images x and y, where x is considered to be the perfect quality and y is a distorted version of x, the structure information ς of each image is calculated by removing the luminance information μ, and then normalizing by the contrast information σ to get ςx = (x − μx )/σx . Three comparison functions are derived from the extracted information of the two images; a luminance function l(x, y),
Video Quality Prediction Using a 3D DT-CWSSIM
361
a contrast function c(x, y) and a structure function s(x, y). The three functions are defined and combined in [8] to give SSIM (x, y) =
(2μx μy + C1 )(2σxy + C2 ) (μ2x + μ2y + C1 )(σx2 + σy2 + C1 )
(1)
where C1 and C2 are small positive constants used for numeric stability. The SSIM index has shown a higher correlation with subjective scores than PSNR. Also, SSIM is not very sensitive to small variations in luminance or contrast. However, it is very sensitive to translations, rotations, and scaling due to its operation in the spatial domain. To address these sensitivities, [1] proposed moving the SSIM algorithm from the spatial domain to the complex wavelet domain where these non-structural distortions have a smaller effect on the final quality index. This move to the complex wavelet domain made the algorithm less sensitive to small geometric distortions that are not detectable by the human eye and do not affect the perceived quality of the image. The complex wavelet structural similarity [1] is defined in the following manner; if cx = {cx,i |i = 1, ..., N } and cy = {cy,i |i = 1, ..., N } are two sets of complex wavelet coefficients extracted from the same locations in the decomposition from the reference and distorted images, respectively, then the CW-SSIM is given by N
∗ i=1 cx,i cy,i | + K N 2 2 i=1 |cx,i | + i=1 |cy,i |
CW-SSIM(cx , cy ) = N
2|
+K
.
(2)
The results presented in [1] for the CW-SSIM algorithm are for a steerable pyramid wavelet transform, where (2) is applied on 7 × 7 coefficient blocks extracted at matching spatial locations from the same decomposition band of both the reference and distorted images. For quality evaluation, the index of each band is calculated as the average quality of all the blocks in the band and the overall quality of the image is taken as the mean of all the bands’ quality values.
3
Dual-Tree Complex Wavelet Transform
As an alternative to the steerable pyramid wavelet transform used in [1], in this paper we propose using the dual-tree complex wavelet transform, which has been successfully used for video coding [12] and video denoising [11]. Being able to use the DT-CWT for video quality assessment could improve the coding/processing performance of these applications that use the DT-CWT. The dual-tree complex wavelet transform (DT-CWT) is a relatively new wavelet transform [10], designed with the shortcomings of the discrete wavelet transform (DWT) in mind. Unlike the DWT, the DT-CWT is designed to be almost shift invariant with good direction selectivity [10]. The strong directional orientation of the DT-CWT comes from its ability to separate positive from negative frequencies into different bands. Similar to steerable filters, the DT-CWT
362
K. Yonis and R.M. Dansereau
is an overcomplete wavelet transform with a limited redundancy of 2d , where d is the number of dimensions in the processed signal [10]. The DT-CWT is designed with two wavelet basis in a Daubechies-like algorithm. The first wavelet can be derived in a similar fashion to that of the DWT. The second wavelet basis Ψg (t) is derived from the first basis forming a Hilbert transform pair Ψg (t) = H(Ψh (t)). Although both wavelet bases are real and yield their own set of coefficients bands, Ψh (t) is taken as the real part of the transform and Ψg (t) is treated as the complex part. With that, Ψ (t) = Ψh (t) + iΨg (t). The 2D DT-CWT is taken as Ψ (x, y) = Ψ (x)Ψ (y). With the use of the different combinations, Ψ (x)Ψ (y), Φ(x)Ψ (y), Ψ (x)Φ(y), Φ(x)Ψ (y), Ψ (x)Φ(y), where Φh (t) is the wavelet’s scaling function and Φ(y) is the conjugate of Φ(y), we can obtain 6 different orientation wavelets, each with its own unique spatial orientation. The 3D DT-CW-SSIM is defined as Ψ (x, y, z) = Ψ (x)Ψ (y)Ψ (z) and the 28 different orientation wavelets are derived in a similar fashion to that of the 2D transform. For a detailed discussion regarding the DT-CWT, see [10, 11].
4
Dual-Tree Complex Wavelet SSIM for Image Assessment
In this section, we will evaluate how well the DT-CWT works for image quality assessment using CW-SSIM. To validate the performance of this combination, which we will call the dual-tree complex wavelet SSIM (DT-CW-SSIM), we use the DT-CWT to transform both the reference and distorted images into the complex wavelet domain. This is different from what was used in testing the original CW-SSIM implementation in [1] and what was used for the analysis in [2], which both used the steerable pyramid (SP) transform. To predict the quality of the image using the DT-CW-SSIM algorithm, we used a 4 level 6 orientation decomposition and broke each resulting band of both the reference and test images into 7 × 7 overlapping blocks. The final quality index was computed as the mean of the sub-blocks’ quality indices as described in [1]. We performed the image quality assessment on the entire LIVE image database [13], with results shown in Fig. 1. As seen by the scatter plots in Fig. 1, we show the quantitative measure against the difference mean opinion score (DMOS) as subjectively tested for each image. Four separate quantitative measures are shown, including PSNR in Fig. 1(a), the original spatial SSIM in Fig. 1(b), the steerable pyramid with CW-SSIM (SP-CW-SSIM) as presented in [1] in Fig. 1(c), and our use of the DT-CW-SSIMin Fig. 1(d). As we can see from the SSIM results in Fig. 1(b), a reasonable correlation with perceptual measures is achieved. However, the SP-CW-SSIM and DT-CWSSIM results in Fig. 1(c) and Fig. 1(d) show a tighter correlation over SSIM, particularly since the images corrupted by white noise are brought in tighter with the rest of the image distortion types. From the scatter plots in Fig. 1(c) and Fig. 1(d), it is difficult to gauge what performance differences exists between SP-CW-SSIM and DT-CW-SSIM, so additional correlation tests were performed and are tabulated in Table 1.
Video Quality Prediction Using a 3D DT-CWSSIM
(a) PSNR
(b) SSIM
(c) PSNR
(d) SSIM
363
Fig. 1. Scatter plots of different quality assessment results vs. the DMOS values for all the images in the LIVE image database [13]
For each test in Table 1, both the correlation coefficient (CC) and Spearman’s rank order correlation coefficient (ROC) were calculated to show the algorithms’ prediction accuracy and monotonicity, respectively. Another correlation coefficient was calculated (CC2 ) between the DMOS values and non-linearly mapped objective results. For the non-linear mapping between the objective and subjective scores, we used the logistic function adopted in the video quality experts group (VQEG) Phase II FR-TV test [14]. It can be seen from Table 1 that both CW-SSIM algorithms outperform the PSNR and the spatial SSIM tests. Also, the results indicate that although the rank-order correlation values of both the SP-CW-SSIM and the DT-CW-SSIM are very close, the DT-CW-SSIM algorithm has a higher accuracy reflected by its correlation coefficient values. From the results in Fig. 1 and Table 1, we conclude that the 2D DT-CW-SSIM provides
364
K. Yonis and R.M. Dansereau
Table 1. Correlation Coefficient (CC), non-linear mapped Correlation Coefficient (CC2 ) and Rank Order Correlation Coefficient (ROCC) of the objective quality results vs. DMOS for all images in the LIVE image database Objective test PSNR SSIM SP-CW-SSIM DT-CW-SSIM
CC 0.7883 0.7506 0.8262 0.8866
CC2 0.8881 0.9079 0.9140 0.9309
ROCC 0.8930 0.9040 0.9334 0.9378
a slight performance gain over 2D SP-CW-SSIM. The next section explores the performance of extending the DT-CW-SSIM to a 3D implementation to include the temporal component of video.
5
3D Dual-Tree Complex Wavelet SSIM for Video Assessment
The good results that the 3D DT-CWT showed in video coding algorithms without the need for a motion compensation step encouraged us to implement the CW-SSIM algorithm using the 3D DT-CWT. The motivation behind such an implementation is to have a quality assessment algorithm that is simple and can be integrated into DT-CWT based algorithms to provide a quality control step. Also, due to the DT-CWT’s good directional selectivity [12] (orientation motion) and its near shift invariant design, we believe that using this transform can drastically reduce the need for computationally intensive registration steps, such as what is used in [6]. The 3D DT-CW-SSIM implementation operates on 3D blocks following the same steps as that of the original 2D implementation. To predict the quality of the video sequence, both reference and test video sequences are transformed using the 3D DT-CWT, which extracts the sequence’s information into different 3D frequency and orientation bands. After the sequence decomposition, we break each frequency-orientation band into overlapping 3D blocks and calculate the CW-SSIM index for each. Let {X = xi , i = 1, 2, . . . , I}, I = number of coefficients in X, be the set of coefficients from an extracted 3D band’s sub-block of the reference video. Also, let {Y = yi , i = 1, 2, . . . , I} be the corresponding sub-block from the test sequence. Then, the sub-block’s quality can be calculated directly using (2). The frequency/orientaion band’s quality is the mean of all its sub-block qualities, the decomposition level’s quality is the mean of all its bands’ qualities, and the final video’s quality index is taken as the average of the different decompositions levels’ quality indices. To test the proposed approach, we implemented the CW-SSIM algorithm with Selesnick’s implementation of the DT-CWT. To reduce the memory and
Video Quality Prediction Using a 3D DT-CWSSIM
365
time requirement of the algorithm, we divided each video into a number of group of pictures (GoP) with a length of 16 frames each. The algorithm was applied to each set of reference and test GoPs, decomposing each GoP into three decomposition levels with 28 orientation bands each. We discarded the low frequency approximation subbands as most of the structure information has been extracted into the detail subbands of the transform. The extracted sub-blocks from the detail bands had a fixed size of 7× 7 coefficients spatially and a variable temporal depth, with its upper limit being 7, adjusted based on the temporal depth of each decomposition band. The final video quality was computed as the average of all the GoP’s quality values. To test the proposed approach, we obtained a set of 112 subjectively scored common intermediate format (CIF) test videos. The video sequences were processed by Nortel, in a collaborative effort with the video lab at the Communications Research Center Canada (CRC), as part of the VQEG Multimedia (MM) project to evaluate objective quality assessment algorithms for CIF and QCIF video sequences. More on VQEG’s multimedia project can be found in the project’s Phase I report [15]. The set’s test video cases were designed to reflect impairments from encoding compression and network transmission impairments (packet loss). The test video sequences used were generated from 7 reference CIF sequences, with 16 impaired sequences from each of the reference videos. The impairments in the test videos were generated through a combination of H.264 compression at bit rates of 720, 256, or 128 bps, frame rates of 30, 20, or 15 fps; and packet loss percentages of 0.0%, 0.5%, 0.1%, 0.2%, 0.4%, or 0.8%. We used the same criteria to evaluate the video 3D DT-CW-SSIM algorithm as done with the 2D implementation. To compare the performance of the proposed approach, we implemented the video-SSIM algorithm proposed by Wang et al. in [9]. We selected the video-SSIM algorithm because it is also a video extension of the SSIM algorithm, though it only uses a 2D version of SSIM, and because the video-SSIM algorithm is reported to have good performance with the VQEG’s FRTV phase I test [9]. We have tested both implemented algorithms, video-SSIM and 3D DT-CWSSIM, against the full set of test video sequences. As shown in Table 2, the results of the two video quality assessment measures indicate that the 3D DT-CWSSIM algorithm has a significant performance increase over the spatial videoSSIM algorithm. Although the 3D DT-CW-SSIM outperformed video-SSIM, we do see that the correlations are still rather low. Recognizing that the selected 3-D DT-CWT implementation has limitations in representing fast motion [11, 16], we ran tests on two of VQEG’s MM project’s sequence with low motion, CRC SRC bench cif and CRC SRC birches1 cif, and show the results in the last row of Table 2. It seems evident that fast motion in the video can further degrade the quality of the assessments since there is a marked difference when only low motion videos are assessed.
366
K. Yonis and R.M. Dansereau
Table 2. Correlation coefficient (CC), non-linear mapped correlation coefficient (CC2 ) and rank order correlation coefficient (ROCC) of the 3D DT-CW-SSIM objective quality results vs. DMOS for VQEG’s MM test video sequences Objective test video-SSIM all videos 3D-DT-CW-SSIM video-SSIM low motion video 3D-DT-CW-SSIM
6
CC 0.3776 0.6975 0.7981 0.8095
CC2 0.4031 0.7096 0.8007 0.8416
ROCC 0.5111 0.7221 0.7598 0.8510
Conclusion
This paper presented results of the CW-SSIM algorithm using the 2D and 3D DT-CWT. The 2D implementation was evaluated against the LIVE image database. The DT-CWT implementation showed higher correlation values with subjective scores compared to PSNR, SSIM, and the original steerable pyramid implementation of the algorithm. Also, we have proposed and tested a 3D extension of the CW-SSIM algorithm. The 3D DT-CW-SSIM implementation was tested against a set of video sequences from VQEG’s MM project. The implementation showed very promising performance results, given that the algorithm is very simple and does not use any preprocessing/registration steps. We have shown that using the DT-CWT with the CW-SSIM algorithm provides a good assessment algorithm for both images and video sequences. In its current implementation, the 3D DT-CW-SSIM algorithm’s performance is better for low motion video sequences; for faster motion, the performance of the 3D DT-CW-SSIM deteriorates. While not yet tested, we expect that including a rough registration step as a preprocessing stage would further enhance the performance of the 3D DT-CW-SSIM and likely allow a broader range of motion in videos for quality assessment. Acknowledgements. The authors would like to thank Nortel and the Natural Sciences and Engineering Research Council of Canada for funding this research.
References 1. Wang, Z., Simoncelli, E.: Translation insensitive image similarity in complex wavelet domain. In: Proc. of IEEE Intern. Conf. on Acoustics, Speech and Signal Processing, March 2005, pp. 573–576 (2005) 2. Brooks, A.C., Zhao, X., Pappas, T.N.: Structural similarity quality metrics in a coding context: Exploring the space of realistic distortions. IEEE Trans. on Image Processing 17(8), 1261–1273 (2008) 3. Voran, S.: The development of objective video quality measures that emulate human perception. In: Proc. of GLOBECOM, pp. 1776–1781 (1991) 4. van den Branden Lambrecht, C., Verscheure, O.: Perceptual quality measure using a spatio-temporal model of the human visual system. In: Proc. SPIE, San Jose, CA, January 1996, vol. 2668, pp. 450–461 (1996)
Video Quality Prediction Using a 3D DT-CWSSIM
367
5. Winkler, S.: A perceptual distortion metric for digital color video. In: Proc. SPIE, pp. 1–4 (1999) 6. Wolf, S., Pinson, M.H.: Video quality measurement techniques. National Telecommunications and Information Administration, Report 02-392 (June 2002) 7. Yao, S., Lin, W., Ong, E., Lu, Z.: A wavelet-based visible distortion measure for video quality evaluation. In: Proc. of IEEE Intern. Conf. on Image Processing, October 2006, pp. 2937–2940 (2006) 8. Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E.: Image quality assessment: From error visibility to structural similarity. IEEE Trans. on Image Processing 13(4), 600–612 (2004) 9. Wang, Z., Lu, L., Bovik, A.: Video quality assessment based on structural distortion measurement. Signal Processing: Image Communication 19(2), 121–132 (2004) 10. Selesnick, I.W., Baraniuk, R.G., Kingsbury, N.G.: The dual-tree complex wavelet transform. IEEE Signal Processing Magazine 22(6), 123–151 (2005) 11. Selesnick, I.W., Li, K.Y.: Video denoising using 2d and 3d dual-tree complex wavelet transforms. In: Wavelet Appl. Signal Image Proc. X, Proc. SPIE 5207, pp. 607–618 (2003) 12. Wang, B., Wang, Y., Selesnick, I., Vetro, A.: Video coding using 3d dual-tree wavelet transform. EURASIP J. on Image and Video Processing 2007(1), 15 (2007) 13. Sheikh, H.R., Wang, Z., Cormack, L., Bovik, A.C.: Live image quality assessment database release 2, http://live.ece.utexas.edu/research/quality 14. VQEG, FRTV phase II report, final report from the video quality experts group on the validation of objective models of video quality assessment, VQEG, Tech. Rep. (August 2003) 15. VQEG, “Final report from the video quality experts group on the validation of objective models of multimedia quality assessment, phase I,” VQEG, Tech. Rep. (September 2008) 16. Shi, F., Selesnick, I.W.: Video denoising using oriented complex wavelet transforms. In: Proc. of IEEE Intern. Conf. on Acoustics, Speech, and Signal Processing, May 2004, vol. 2, pp. 949–952 (2004)
An Efficient 3d Head Pose Inference from Videos Mohamed Dahmane and Jean Meunier DIRO, Universit´e de Montr´eal, CP 6128, Succursale Centre–Ville, 2920 Chemin de la tour, Montr´eal, Qu´ebec, Canada, H3C 3J7 {dahmanem,meunier}@iro.umontreal.ca
Abstract. In this article, we propose an approach to infer the 3d head pose from a monocular video sequence. First, we employ a Gabor–Phase based displacement estimation technique to track face features (two inner eye corners, two wings, tip and root of the nose). The proposed method is based on the iterative Lowe’s pose estimation technique using the six tracked image facial points and their corresponding absolute location in a 3d face model. As any iterative technique, the estimation process needs a good initial approximate solution that is found from orthography and scaling. With this method, the pose parameters are accurately obtained as continuous angular measurements rather than expressed in a few discrete orientations. Experimental results showed that under the assumption of a reasonable accuracy of facial features location, the method yields very satisfactory results. Keywords: Head–pose estimation, facial features, face tracking, facial analysis.
1
Introduction
The orientation of the human head allows inferring important non–verbal forms of communication in a face-to-face talk (e.g. spatial information about the focus of attention, agreement–disagreement, confusion. . . ). The computer vision community is thus interested in automating interpersonal information captured from a scene. Thereby specific meaning can be automatically extracted from video by head pose inference, a process that presents an important challenge arising from individual appearance and personal facial dynamics. The head pose can also be used to wrap and normalize the face for more general facial analysis, or in some augmented reality systems to construct images that wrap around the sides of the head. For videos in which the face occupies less than 30 by 30 pixels (very low resolution), 3d model based approaches are inefficient [1]. The pose estimation problem is rather converted into a classical classification problem. For somewhat higher resolution videos, since facial features remain hard to detect and track, adapted approaches were developed, such as appearance–based techniques (e.g. [2]). In the case of high resolution face images, geometric–based approaches becomes relevant since facial features are visible and can be accurately recovered [3]. A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 368–375, 2010. c Springer-Verlag Berlin Heidelberg 2010
An Efficient 3d Head Pose Inference from Videos
369
In recent years, a variety of methods have been introduced, the reader is therefore invited to look at the thorough review presented in [4]. In this paper we deal with high resolution videos and use a geometric approach for head pose assessment since it can directly exploit properties that are known to influence human pose [4]. A further motivation behind using feature– based approaches is their near invariance under pose and orientation changes [5]. However, their performance depends on the precise configuration of the local facial features. Therefore, an effective scheme for tracking the features on the face would be essential to compensate for this shortcoming.
2
Feature–Based Facial Tracking Scheme
Generally, facial feature tracking methods, mainly based on Gabor wavelets, try to perform refinement stages [6–8] by imposing geometric constraints with subspace projection techniques or by using gray–level profiles to refine and adjust the positions of the features of interest. In this paper, we adopt a facial feature–based tracking using Gabor phase–based technique. In what follows, we propose to use a personalized gallery of facial bunch graphs (sets of Gabor jets attached to each node), that we deform to fit a set of tracked points using the Procrustes transform. Six particular facial feature points (left/right eye–inner corner, left/right nose wings, root and tip of the nose) are used because they are known to be less sensitive to facial deformations (Fig. 1). 2.1
Procrustes Transform
Procrustes shape analysis is a method in directional statistics [9], used to compare two shape configurations. A two–dimensional shape can be described by a centered configurations u in C k (u 1k = 0), where u is a vector containing 2d shape landmark points, each represented by a complex number of the form x + ıy. The procrustes transform is the similarity transform (eq. 1) that minimizes (eq. 2), where α 1k , |β| and ∠β, respectively, translates, scales and rotates u2 , to match u1 . u1 = α 1k + β u2 α, β ∈ C (1) β = |β| eı∠β 2 u1 u2 u1 − α1k − β u2 2.2
(2)
Tracking of Facial Reference–Points
From frame to frame, the tracking of the 4 reference points (left and right eye inner corner and the two nose wings) is based on an iterative disparity estimation procedure (Gabor phase–based technique described in [10]). To ensure a high
370
M. Dahmane and J. Meunier
tracking efficiency, we create a personnalized bunch for each one of these features. The size of each bunch depends on the degree of deformation of the corresponding feature. 2.3
Geometric Fitting
The Procrustes transform is used to adequately deform each reference graph Gi stored in the personalized gallery. The transformation, that best wraps the 4 anchor points of Gi to fit the tracked reference points (circle–dots in figure 1), is used to adjust the positions of the two remaining feature points of Gi (diamond– dots in figure 1). The new generated positions form the probe graph Gp . 2.4
Refinement Stage
Then, Gabor jets are calculated at each point position of the generated graph Gp , and a Jet–based similarity [10] is computed between each reference graph Gi and the corresponding Gp . The probe graph with the highest similarity gives the positions of the six facial feature points. The final positions are obtained by estimating the optimal displacement of each point by using the Gabor phase– based displacement estimation technique [10].
3
Extracting the Pose of the Face
For recovering the pose of the face, we use the positions of the six facial feature points (L–R eye-inner corner, L–R nose wings, the root and the tip of the nose), that are less sensitive to facial deformations. The pose can be estimated from the topography of the nose, using the volume generated by these points as it can be seen from the face model (Fig. 2).
2 1 4
3 5
6
Fig. 1. The four tracked (circle) and the two adjusted (diamond) facial points
Fig. 2. The volume approaching the topography of the nose on a 3d GeoFace model
Given an approximate 3d absolute position of each point of this volume and the corresponding 2d points on the face image, we estimate the pose of the head using Lowe’s pose estimation algorithm [11].
An Efficient 3d Head Pose Inference from Videos
3.1
371
Lowe’s Pose Estimation Method
Given Pi (1 ≤ i ≤ n) the set of three–dimensional model points, expressed in the model reference frame, if Pi represents the corresponding set, expressed in the camera reference frame, and if we denote by pi (xi , yi ) the corresponding set of 2d image points, the pose estimation problem is to solve for the parameters1 s = (R, T) so that [Xi , Yi , Zi ] = R Pi + T
(3)
where pi is the perspective projection of Pi Xi Yi [xi , yi ] = f , = P roj (˜s, Pi ) Zi Zi
(4)
For ˜ s, an estimate of s, an error measurement between the observations and the locations of pi is computed (eq. 5). ei = pi − P roj (˜s, P )
(5)
The pose parameters correction amount δs, that eliminates the residual e, can be found via the Newton Method’s [12]. Lowe’s method proceeds by producing new estimates for the parameters s (eq. 6) and iterating the procedure until the residual e (eq. 5) drops below a given tolerance. s(i+1) = s(i) + δs 3.2
(6)
Initialization
The iterative Newton’s method starts off with an initial guess s(0) which should be sufficiently accurate to ensure convergence to the true solution. As any iterative methods, choosing s(0) from an appropriate well–behaved region is essential. For this purpose, we use the POS2 algorithm [13–15], which gives a reasonable rough estimate for s(0) . The algorithm approximates the perspective projection with a scaled orthographic projection, and finds the rotation matrix and the translation vector of the 3d object by solving a linear system. If p and P denote, respectively, the image points and the model points, the initial solution s(0) can be determined, by recovering the rotation matrix R(0) and the translation vector T(0) , from Pi P1 and pi p1 , two matrices constructed as follows, ⎞ X2 − X1 Y2 − Y1 Z2 − Z1 ⎜ ⎟ .. .. .. Pi P1 = ⎝ ⎠ . . . Xn − X1 Yn − Y1 Zn − Z1 ⎛
1 2
⎞ x2 − x1 y2 − y1 ⎜ ⎟ .. .. ; pi p1 = ⎝ ⎠ . . xn − x1 yn − y1 ⎛
s is the pose vector concatenating the three rotation angles (roll, pitch and yaw ) and the x,y,z translations. Pose from Orthography and Scaling.
372
M. Dahmane and J. Meunier
The initial rotation matrix R(0) is formed as ⎛ ⎞ aTN ⎝ ⎠ bTN T (aN × bTN )
(7)
where (a , b) is the matrix obtained by the left division3 Pi P1 \ pi p1 , a and b are three–dimensional vectors. Subscript N refers to the normalized vector and the symbol “×” to the cross product of two vectors. The initial translation vector T(0) is defined as T
(¯ px , p¯y , f ) sc
(8)
where, sc refers to the scale of the projection that corresponds to (a + b)/2 , p¯ = n1 pi , and f is the camera focal length in pixels.
4
Experimental Results
To test the approach on real world data, we used representative video sequences displaying different head movements and orientations. In order to enhance the tracking performance, a personalized gallery was built with graphs4 (Fig. 1) from different faces under different “key orientations”. First, the subgraph containing the four reference facial features (Fig. 1) is roughly localized in the first frame of the video via an exhaustive search of the subgraph as a rigid object through the coarsest face image level. We used a three–level hierarchical image representation to decrease the inherent average latency of the graph search operation which is based on the Gabor jet similarity, by reducing the image search area and the size of the jet. For images at the finest level (640×480 pixel resolution), jets are defined as sets of 40 complex coefficients constructed from different Gabor filters spanning 8 orientations under 5 scales. Whereas those for images at the coarsest level (320 × 240) are formed by 16 coefficients obtained from filters spanning 8 orientations under 2 scales. The intermediate level images use jets of (8 × 3) coefficients. Then, the iterative displacement estimation procedure is used as a refinement stage, performed individually on each position of all of the six feature–points, and over the three levels of the pyramid face–image. From frame to frame, only the four reference points are tracked using the iterative disparity estimation procedure. To avoid drifting during tracking, for each feature point, a local fitting is performed by searching through the gallery the subgraph that maximizes Gabor magnitude–based similarity. The rough 3 4
The left division A\B is the solution to the equation AX = B. From our experiment, we found that about twenty graphs are sufficient to cover the different head appearances under various orientations.
An Efficient 3d Head Pose Inference from Videos
373
positions of the four feature points are given by the positions of the nodes of the optimal subgraph. These are then adjusted using the displacement estimation procedure. The entire 3d head pose inference method is summarized in the flow diagram of figure 3.
First Frame
Current Frame
Search for the optimal graph from the gallery (use only the coarsest level)
Track the 4 reference nodes using the phase−based disparity estimation procedure over the 3 image levels
Get the 6 node positions by the Procrustes transform that best deforms each subgraph within the gallery to fit the 4 tracked positions
Refine node positions over the 3 image levels via the displacement estimation procedure
The optimal 2d (Image) facial feature positions
The 3d (Model) facial feature positions
Evaluate s0 the initial pose, using POS algorithm Initialization Tracking Pose Estimation
Estimate s the final pose via Lowe’s method
Fig. 3. Flow diagram of the entire 3d pose inference approach
Figure 4 shows the tracking and pose inference results for 3 different persons and environment conditions. Clearly, the achieved pose recovering performance are visually consistent with orientation of the head, as it is shown by the direction of the normal of the face shown in yellow color in figure 4. Figure 5 gives an example where a tracking failure occurred (see the tip of the nose). In this case a correcting mechanism (e.g. reinitialization based on poor Gabor similarity) can be adopted to prevent the tracker from drifting.
374
M. Dahmane and J. Meunier
Fig. 4. Tracking and pose recovery performances
Fig. 5. A tracking failure of the nose tip affects the pose recovery
5
Conclusions
This article has described a feature–based pose estimation approach using the iterative Lowe’s pose estimation technique, with six tracked image facial points and their corresponding locations in a 3d face model. The estimated orientation parameters are given as continuous angular measurements rather than expressed in a few discrete orientations such as (left, up, front . . . ), furthermore the solution provides information about the 3d position of the object. As any iterative solution, the estimation process needs an accurate initialization seed, which can be done using the pose from orthography and scaling algorithm. The facial features’ tracking is accomplished by using the phase–based displacement
An Efficient 3d Head Pose Inference from Videos
375
estimation technique and a personalized gallery of facial bunch graphs to further enhance the tracking efficiency. In the future, we plan to assess non–verbal communications between a patient and his health care professional in clinical setting.
Acknowledgements This research was supported by the National Sciences and Engineering Research Council (NSERC) of Canada.
References 1. Jilin, T., Huang, T., Hai, T.: Accurate head pose tracking in low resolution video. In: 7th Int. Conf. on Automatic Face and Gesture Recognition, pp. 573–578 (2006) 2. Fu, Y., Huang, T.: Graph embedded analysis for head pose estimation. In: 7th International Conference on Automatic Face and Gesture Recognition (April 2006) 3. Hu, Y., Chen, L., Zhou, Y., Zhang, H.: Estimating face pose by facial asymmetry and geometry. In: Sixth IEEE International Conference on Automatic Face and Gesture Recognition, May 2004, pp. 651–656 (2004) 4. Murphy-Chutorian, E., Trivedi, M.: Head pose estimation in computer vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(4), 607–626 (2009) 5. Yang, M.: Recent advances in face detection. In: Tutorial of IEEE Conferece on Pattern Recognition (2004) 6. McKenna, S.J., Gong, S., W¨ urtz, R.P., Tanner, J., Banin, D.: Tracking facial feature points with gabor wavelets and shape models. In: Big¨ un, J., Borgefors, G., Chollet, G. (eds.) AVBPA 1997. LNCS, vol. 1206, pp. 35–42. Springer, Heidelberg (1997) 7. Zhu, Z., Ji, Q.: Robust pose invariant facial feature detection and tracking in realtime. In: 18th International Conference on Pattern Recognition, pp. 1092–1095 (2006) 8. Yu, M.S., Li, S.F.: Tracking facial feature points with statistical models and gabor wavelet. In: Fifth Mexican International Conference on Artificial Intelligence, November 2006, pp. 61–67 (2006) 9. Mardia, K., Jupp, P.: Directional Statistics. Wiley, New York (2000) 10. Dahmane, M., Meunier, J.: Constrained phase–based personalized facial feature tracking. In: Blanc-Talon, J., Bourennane, S., Philips, W., Popescu, D., Scheunders, P. (eds.) ACIVS 2008. LNCS, vol. 5259, pp. 1124–1134. Springer, Heidelberg (2008) 11. Lowe, D.G.: Fitting parametrized three-dimensional models to images 13(5), 441–450 (1991) 12. Trucco, E., Verri, A.: Introductory Techniques for 3-D Computer Vision, Englewood cliffs edn. Prentice-Hall, NJ (1998) 13. DeMenthon, D.F., Davis, L.S.: Model-based object pose in 25 lines of code. International Journal of Computer Vision 15, 123–141 (1995) 14. Tomasi, C., Kanade, T.: Shape and motion from image streams under orthography: a factorization method. Int. J. Comput. Vision 9(2), 137–154 (1992) 15. Ullman, S., Basri, R.: Recognition by linear combinations of models. IEEE Transactions on Pattern Analysis and Machine Intelligence 13(10), 992–1006 (1991)
Fall Detection Using Body Volume Recontruction and Vertical Repartition Analysis Edouard Auvinet1,2 , Franck Multon2 , Alain St-Arnaud3 , Jacqueline Rousseau4 , and Jean Meunier1 1
3
University of Montreal - IGB - Montreal, QC, Canada 2 University of Rennes 2 - M2S - Rennes, France Health and social service center Lucille-Teasdale - Montreal, QC, Canada 4 CRIUGM, University of Montreal, Montreal, QC, Canada
Abstract. With the life expectancy increase, more and more elderly people risk to fall at home. In order to help them living safely at home by reducing the eventuality of unrescued fall, autonomous systems are developped. In this paper, we propose a new method to detect falls at home, based on a multiple cameras network for reconstructing the 3D shape of people. Fall events are detected by analyzing the volume distribution along the vertical axis, and an alarm is triggered when the major part of this distribution is abnormaly near the floor which implies that a person has fallen on the floor. This method is evaluated regarding the number of cameras (from 3 to 8) with 22 fall scenarios. Results show 96% of correct detections with 3 cameras and above 99% with 4 cameras and more.
1
Introduction
Falls is a major health security problem which increase with the elderly’s population growth. However, since a decade, technological solutions are proposed to reduce the problem of unrescued elderlies who fall on the floor at home and could not raise up by themselves. Those systems use different kind of sensors such as accelerometers and camera-system to detect fall events and then call emergency services. Accelerometer systems are easy to setup and their use is widely experimented [4]. But acceleration measures are sometimes insufficient to disciminate falls from others activities of daily life (ADL) and are totally useless in the case where patients forget to wear it, which is not negligeable with patient suffering from dementia or Alzheimer’s disease. With camera solutions, it is the contrary situation that prevails. Picture sequences contain too much information, and the major problem here is to identify useful part. Once this segmentation operation is done, it remains to find the descriptor which will define a situation as a fall or not. Some previous works proposed mono-camera descriptors using the movement, such as the reconstructed 3D movement of the head [5] or body silhouette changes [6]. Others consist in using body orientation features such as width and height of a silhouette by comparing a standing and a lying person [7][8][9]. However in case of occlusions, which come easily with A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 376–383, 2010. c Springer-Verlag Berlin Heidelberg 2010
Fall Detection Using Body Volume Recontruction
377
home furnitures, those solutions are disabled by hidden parts of the movement or body. This is why some other descriptors use multi-cameras information in order to overcome this problem. This kind of system permits to reconstruct people volume in space using a visual hull reconstruction method [11]. For instance, Anderson et al. [10] use a descriptor based on principal axis of the recontructed volume and fuzzy logic analysis to classify human activities, such as fall. But a more simple solution could be possible to detect lying down people on the ground. Our contribution resides in two points. The first one is the use of an adaptation of visual hull reconstruction method to increase robustness to occlusions [12] which does not need that the silhouette of the person be present in all cameras to reconstruct the volume. The second one is the validation of a descriptor based on a simple volume repartition ratio in order to detect a lying-down-on-the-floor person. Using the fact that the volume of a person in this situation does not exceed 40 centimeters, detection is done by looking at the ratio of the volume under this height threshold divided by the total volume. A previous presentation of this descriptor [1] showed excellent results using 8 cameras. In this paper, we evaluate the performance of this descriptor regarding the number of cameras (from 3 to 8) used to reconstruct the volume. In a following part, a brief description of the method will be given, followed by the result obtained with 22 different simulated fall scenarios. Then a discussion on the method and futur works will be presented in the conclusion.
2
Method
Our algorithm can be divided into 3 levels: camera, data fusion and recognition levels. Each level is now described in details. 2.1
Camera Level
In order to calculate the volume distribution of a subject in his environment the system must know the relationship between the camera coordinate system and the real 3D space. Thus, preliminary to the fall detection process, the cameras have to be calibrated. Intrinsic parameters were computed using the chess board method[15] to define the focal distance f = (fx , fy ), the optical center c = (cx , cy ), the skew coefficient α and the radial distortion k = (k1 , k2 , k3 , k4 , k5 ) as presented in [14]. The later parameters are necessary because of non negligible radial distortion due to the large field of view of the camera lenses. External parameters, the rotation matrix R and the translation vector T, were calculated using feature points manually placed on the floor. Altogether, those parameters define the projective camera model described as follows. Let X = (X, Y, Z) be the real world vector of a 3D point and Xc = (Xc , Yc , Zc ) his coordinates in the camera space then : Xc = R X + T
378
E. Auvinet et al.
The normalized image projection is defined by: xn Xc /Zc = yn Yc /Zc The normalized point coordinates with radial distortion become : xn d xd 2 4 6 = 1 + k1 rn + k2 rn + k5 rn + x yd yn dy Where the tangential distortion vector is : dx 2k3 xn yn + k4 3x2n + yn2 = dy k3 x2n + 3yn2 + 2k4 xn yn Finally, multiplying the normalized coordinates with the camera matrix gives pixel coordinates f x α · f x cx xd xp = yp 0 f y cy yd This function can be written as : [xp , yp ] = φ (X, Y, Z, f, c, k, R, T, α) In order to detect moving objects, each camera image ij is subtracted from its own background model bj obtained by computing a temporal median image of the sequence [13]. When the absolute difference of a pixel is higher than a previously defined threshold T h, it is registered as a foreground pixel, otherwise it is considered as a background pixel: 1 if |ij (xp , yp ) − bj (xp , yp ) | > T h (f oreground pixel) sj (xp , yp ) = 0 otherwise (background pixel) Finally, in order to reduce noise detection and reinforce large surface detection, an opening morphological operation is done on sj to obtain the final 2D silhouette of the person for each camera j. 2.2
Data Fusion Level
This level aims at gathering projections of the 2D silhouette provided by each camera on horizontal slices in order to reconstruct the 3D volume of the subject. Let Si,j be the projection of the image provided by camera j on the horizontal plane i as follow: Si,j (X, Y ) = sj (φ (X, Y, Zi , fj , cj , kj , Rj , Tj , αj )) where Zi is the height for the horizontal plane i and fj , cj , kj , Rj , Tj , αj are the parameters for camera j.
Fall Detection Using Body Volume Recontruction
379
For each horizontal slice i, Si is the image corresponding to the summation of projections Si,j coming from n cameras: Si (X, Y ) =
n
Si,j (X, Y )
j=1
Where n is the total number of cameras. Therefore, Si (X, Y ) takes values between 0 and n depending on the number of 2D silhouettes (from n cameras) contributing to the 3D reconstruction at position (X, Y ) and at height Zi . The distance between each slice was set arbitrarily to 10 cm in this study. Fig.1 illustrates an example of such kind of fusion. Without occlusion, the person is visible
Fig. 1. Representation of 4 slices (Si ) where camera views were projected and summed (18 slices were used in practice with a 10 cm vertical interval)
from all cameras and consequently all positions (X, Y ) where Si (X, Y ) = n define the correct 3D reconstruction (slice by slice). To allow tolerance for one possible occlusion we simply add the positions where Si (X, Y ) = n−1 at the expense of a slightly larger and coarser reconstruction. Therefore, by thresholding Si at n − 1 we obtain the 3D reconstruction as a series of segmented slices Si∗ : 1 if Si (X, Y ) ≥ n − 1 ∗ Si (X, Y ) = 0 otherwise Let B be the set of pixels in each slice S belonging to the largest object. The Vertical Volume Distribution of this object at the ith slice denoted V V D (i) is given by : V V D (i) = Si∗ (X, Y ) (X,Y )∈B
Examples of the resulting volume of a standing up (light gray) and lying down (dark gray) positions and their corresponding vertical volume distribution are presented in Fig.2, on left, where the difference is clearly visible. Fig.2, on right, represents the evolution of the vertical volume distribution (displayed with gray levels) of a subject obtained during a fall scenario.
380
E. Auvinet et al. Fall 10000 150
150
100
Height (cm)
Z (cm)
Z (cm)
150 100
50 50
8000 6000
100 4000 50
0 100 175 250 250 Y (cm)
2000
5000 surface (cm²)
150 200 X (cm)
Standing up Lying down
100
2
4
6
8 10 Time (s)
12
14
16
0 Surface (cm²)
Fig. 2. On the left, 3D Reconstruction of a person after fusion of the different points of view and their corresponding vertical volume distribution in the middle graphic. Light gray color is attached to a standing up person, and dark gray for a lying on the ground person. On the right, example of the vertical volume distribution during a fall scenario (displayed with gray levels).
2.3
Fall Detection Level
To detect a fall, an indicator based on the ratio between the sum of VVD values from the first 40 cm (5 slices starting from the floor) with respect to the whole volume (m=18 slices) is computed as follows: 5 V V D (i) V V DR = i=1 m i=1 V V D (i) A fall is detected if this ratio is above a preselected threshold (see next section) during a predefined period of time (5 seconds in our case).
3
Materials
To evaluate the ability of the method to detect falls, 22 simulated fall scenarios were performed by an actor according to many previously published works (e.g. [3]). The room was 7 meters long by 4 meters large. A set of 8 IP cameras GADSPOT 4600 with large field of view lenses was attached to the ceiling for tests. The returned times of falls detected with this method were compared with the ground true measured visually. If the difference was below 1 second, the fall was considered as detected. In the case of a fall reported outside this time period, it was considered as a false detection. Table 1. Number of combinatorial arrangement of cameras possible per number of camera Number of cameras per combinaison 3 4 5 6 7 8 Total Number of combinatorial arrangement per scenario 56 70 56 28 8 1 219 Total number of situations 1232 1540 1232 616 176 22 4818
Fall Detection Using Body Volume Recontruction
381
Each scenarios was analysed with all camera combinaisons from 3 to 8 cameras, as presented in Tab.1, for a total of 219 camera configurations per scenario.
4
Results
Vertical Volume Distribution Ratio value
The Vertical Volume Distribution Ratio (VVDR) for each body position is presented in Fig.3 as a function of the number of cameras, 95% confidence interval area for the lying-down-on-the-floor position is clearly separated from others. This confirms that fall detection is possible with an appropriate threshold applied to VVDR (e.g. 0.6 for four cameras and more). 1 0.8 0.6 0.4 0.2 0 3
4
5 6 Number of cameras
7
8
Standing up Lying down on the floor Crouched
Fig. 3. Influence of the number of cameras on the capability of the VVDR to discriminate body postures. The gray areas correspond to 95% confidence intervals. Table 2. Influence of the number of cameras on correct detection and false alarm rates Number of cameras Correct detection rate False detection rate ≥4 ≥ 99% < 1% 3 96% 10%
The results, presented in Tab.2, show that falls was correctly detected in 96% of situations with only 3 cameras and above 99% with 4 cameras and more. The false detection rate is 10% with 3 cameras and less than 1% with 4 cameras and more. These results were obtained with a VVDR threshold which was equal to mean point between crouching position and lying-on-the-floor position VVDR value for each number of cameras.
382
5
E. Auvinet et al.
Conclusion
The results obtained with this method enlighten its efficiency to do its aim : detecting lying people on the floor. However it is based on the assumption that the major part of the fallen body is under 40 centimeters height. When postfall position differs from this expected position (e.g. fall on a table), it will be misdetected. The method is clearly efficient even with only three cameras. Althought the false detection rate is still not negligible in this case. Fortunately with four cameras or more, the false detection rate becomes negligible. The foreground segmentation algorithm used was a very basic one. This is possible because the volume reconstruction method, being robust to occlusions, is also robust to segmentation errors. For future work, more experimentations on occlusion resistance would have to be done in order to better assess the limit of this method. This could be done with real life and simulated occlusions. Some improvements could be worked on, such as the optimisation of the segmentation and back projection processings which could be done on a GPU, or a better foreground/background segmentation algorithm. The major contribution of this paper is the validation of the VVDR efficiency with 3 cameras and more with the aim to detect lying down person after a fall. But this reconstruction method could be applied to other types of applications, such as quantifying daily life activities which is a key issue of our modern society.
Acknowledgment This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) and MITACS.
References 1. Auvinet, E., Reveret, L., St-Arnaud, A., Rousseau, J., Meunier, J.: Fall detection using multiple cameras. In: EMBS 2008, 30th Annual Internationnal Conference of the IEEE (2008) 2. Xinguo, Y.: Approches and Principles of Fall Detection for Elderly and Patient. In: 10th IEEE Intl. Conf on e-Health Networking, Application and Service (2008) 3. Noury, N., Fleury, A., Rumeau, P., Bourke, A., Laighin, G., Rialle, V., Lundy, J.: Fall detection - principles and methods. In: 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBS 2007), pp. 1663–1666 (2007) 4. Kangas, M., Konttila, A., Lindgren, P., Winblad, I., J¨ ams¨ a, T.: Comparison of low-complexity fall detection algorithms for body attached accelerometers. Gait & Posture 28, 285–291 (2008) 5. Rougier, C., Meunier, J., St-Arnaud, A., Rousseau, J.: Monocular 3D Head Tracking to Detect Falls of Elderly People. In: Conference of the IEEE EMBS, August 30-September 3 (2006)
Fall Detection Using Body Volume Recontruction
383
6. Rougier, C., Meunier, J., St-Arnaud, A., Rousseau, J.: Fall Detection from human Shape and Motion History using Video Surveillance. In: IEEE First International Workshop on Smart Homes for Tele-Health, Niagara Falls, May 2007, pp. 875–880 (2007) 7. Tao, J., Turjo, M., Wong, M.F., Wang, M., Tan, Y.-P.: Fall Incidents Detection for Intelligent Video Surveillance. In: Conference on Information, Communications and Signal Processing, December 6-9 (2005) 8. Lee, T., Mihailidis, A.: An intelligent emergency response system: preliminary development and testing of automated fall detection. Journal of telemedicine and telecare 11(4), 194–198 (2005) 9. Miaou, S.G., Shung, P.H., Huang, C.Y.: A customized human fall detection system using omni-camera images and personnal information. In: D2H2 2006, 1st Transdisciplinary Conf. on Distributed Diagnosis and Home Care, pp. 39–42 (2006) 10. Anderson, D., Keller, J.M., Skubic, M., Chen, X., He, Z.: Recognizing Falls from Silhouettes. In: Conference of the IEEE EMBS, August 30-September 3 (2006) 11. Laurentini, A.: The visual Hull concept for silhouette-based image undestanding. IEEE Trans. Pattern Annal. Mach. Intell. 16(2), 150–162 (1994) 12. Kim, H., Sakamoto, R., Kitahara, I., Orman, N., Toriyama, T., Kogure, K.: Compensated Visual Hull for Defective Segmentation and Occlusion. In: International Conference on Artificial Reality and Telexistence (2007) 13. Piccardi, M.: Background subtraction techniques: a review. In: IEEE International Conference on Systems, Man and Cybernetics (2004) 14. Heikkila, Silven: A Four-step Camera Calibration Procedure with Implicit Image Correction. In: Conference Vision Pattern Recognition (1997) 15. Bouguet, J.Y.: Camera Calibration Toolbox for Matlab (2007), http://www.vision.caltech.edu/bouguetj/calib_doc/
A Subsampling and Interpolation Technique for Reversible Histogram Shift Data Hiding Yih-Chuan Lin1, Tzung-Shian Li1, Yao-Tang Chang2, Chuen-Ching Wang3, and Wen-Tzu Chen4 1
Dept. of Computer Science and Information Engineering, National Formosa University, Yunlin 63201 Taiwan 2 Dept. of Electro-Optical Science and Engineering, Kao Yuan University, Taiwan 3 Dept. of Electrical Engineering, National Changhua Univeristy of Education, Taiwan 4 Institute of Telecommunications Management, National Cheng Kung University, Taiwan
[email protected] Abstract. This paper proposes a novel reversible data hiding algorithm for images, which the original host image can be exactly recovered from the marked image after the hidden data has been extracted. The proposed algorithm considers shifting the histogram of the difference values between the subsampled target pixel intensities and their interpolated counterparts to hide secret data. The shifting of the histogram of difference values is carried out by modifying the target pixel values. As compared to other schemes, the proposed method can make more utilization of the correlation between nearby pixels in an image via simple interpolation techniques to increase embedding capacity without sacrificing much distortion for data hiding. The reason of the feasibility is that the difference histogram derived in the paper renders so highly centralized distribution around zero that much more embedding capacity than before can be thus obtained. The experimental results demonstrate that the proposed method not only provides larger embedding capacity than other histogram shifting methods but also maintains a high visual quality. Moreover the computational complexity of the proposed method is low since only simple arithmetic computations are needed. Keywords: Reversible Data Hiding, Histogram Shifting, Subsampling, Interpolation.
1 Introduction Data hiding is a method that embeds useful information into a digital content (such as video, audio, image, electronic documents) and extracts it at a later time for one or more of the purposes of copy control, content authentication, distribution tracking, and so forth. In most cases, the marked image would experience some distortion due to the modification on pixels for hiding data and cannot be inverted back to the original version. However, in some applications, such as the fields of the law enforcement, medical or military imagery system, any permanent distortion induced to the original images by the data hiding techniques is intolerable. In such cases, the A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 384–393, 2010. © Springer-Verlag Berlin Heidelberg 2010
A Subsampling and Interpolation Technique for Reversible Histogram Shift
385
original images have to be recovered without any distortion after the extraction of the embedded data. Data hiding techniques satisfying these requirements are referred to as the so-called reversible data hiding. According to what we known so far, three categories of reversible data hiding methods are available thus far in literature: (a) Difference expansion methods [1-4]: This type of methods uses the average and difference values between any two neighboring pixels in the host image, and expands the difference value to create the embedding space for hiding secret messages. But it does require keeping a large location map of data embedding for the lossless recovery of host image. Thus data compression technique is usually employed to alleviate this problem. (b) Histogram shifting methods [5-11]: Most of these methods need to compute the histogram of the whole image and manipulate the histogram to hide data. They are simple because it even doesn’t need any compression technique. However, the capacity of this method is usually insufficient for most applications. (c) LSB (least significant bit) replacement methods [12, 13]: Techniques that use this type of methods usually hides the secret data bits in the LSB positions of the host pixel values; while record the original LSB as overhead for the exact recovery of host image. In general, they would need a lossless data compression method to offer additional space for increasing data hiding capacity. Among these methods, the histogram shifting based methods have the advantages of less distortion and better image quality. Because it modifies the intensities of pixels slightly, it can produce very high peak signal-to-noise ratio (PSNR) after embedding secret data [5]. However, the embedding capacity that the histogram shifting methods can provide is not sufficient for most applications. The capacity of this method depends on the height of histogram maximum peak point. If the histogram distribution shape is highly centralized, i.e., the height of the maximum (or peak) point is raised significantly, the embedding capacity would become larger. Therefore, many researchers have tried various approaches to centralize the histogram distribution for maximizing the capacity of histogram shifting methods. For example, in 2009, Kim brings up a method which utilizes the differences between sub-images in a host image which are obtained by some sub-sampling techniques to raise the hiding capacity of histogram shifting. Since the histogram of the difference values becomes rather centralized, a good performance was obtained by this method [8, 11]. Nevertheless, we think that much more correlation that exists between neighboring pixels in an image can be extracted to use for increasing lossless data hiding capacity. Therefore, while following Kim’s sub-sampling method, interpolation techniques are added in the paper to produce more centralized histograms, in which more correlation between proximity pixels in an image is utilized by interpolation of neighboring pixels to reduce the difference value between a target pixel and its interpolated one. The secret data bits are embedded by shifting the histogram of the interpolated difference values, where each shift of histogram bin is carried out by the modification of the correspondent target pixel values in the host image. Incorporating the interpolation into the sub-sampling method, both the image quality and hiding capacity can be improved for most images.
386
Y.-C. Lin et al.
The rest of this paper is organized as follows. In Section 2, we explain the proposed methods, where linear and bilinear sub-sampling methods are presented. In Section 3, embedding and extracting processes are illustrated. Section 4 is devoted to show experimental results. Finally conclusions are drawn in Section 5.
2 Proposed Subsampling and Interpolation Technique Fig. 1 depicts the components that comprise the proposed data hiding scheme. As shown in Fig. 1, the input host image I is partitioned in one subsampled sub-image and several target sub-images by sub-sampling technique. For each target sub-image, an interpolated sub-image is computed by an interpolation scheme. Then the difference between the subsampled and interpolated sub-images is computed. For embedding secret data, we take histogram shifting method which is mentioned by other scholars to embed data into the histogram of the difference values. In this section, subsampling and interpolation technique would be described in details.
Fig. 1. Flowchart of data hiding process
For resizing an image, we usually prefer to use linear interpolation or bilinear interpolation. Kim’s method [11] is equivalent to the simplest nearest neighbor interpolation, which relies solely on the difference between adjacent intensity. Thus the correlation between sub-images is not efficiently employed in his method. The reason is described as follows. Refer to Fig.2, where three consecutive pixels in an image is considered, and b' stands for the interpolated intensity for pixel b by simple linear interpolation. Table 1 lists the statistical data on the different differences among the three considered pixels to show the advantage of linear interpolation technique. From Table 1, we can easily find that the distribution of difference values between b and b' is more centralized than that of difference values between neighboring pixels
Fig. 2 Illustrative Example for linear interpolation
A Subsampling and Interpolation Technique for Reversible Histogram Shift
387
Table 1. Comparison between the nearest neighbor and linear interpolation methods | b − a | or | b − c |
Lenna Baboon F-16 Goldhill
μ
σ
6.45 17.2 6.66 7.22
9.35 17.8 16 9.12
| b′ − b |
μ
σ
3.99 12.2 3.96 5.28
5.27 12.4 8.15 6.41
2.1 The Linear Interpolation Sub-sampling Method
N1 × N2 image I ( i , j ) , where i = 0,..., N1 − 1 , j = 0,..., N 2 − 1 , we divide the image I into one sub-sampled Assume
that
the
input
image
is
a
8-bit
image I s (k1 ,k2 ) and one target image I t (k1 ,k2 ) , as described in (1). Based on I s (k1 ,k2 ) , the reference image I r ( k1 , k2 ) that is used as the interpolated version of the target image is obtained by the linear interpolation defined in (2). I s (k1 ,k2 ) = I (k1 , 2k2 ), I t ( k1 ,k2 ) = I (k1 , 2k2 + 1) ⎢N ⎥ 0 ≤ k1 ≤ M , 0 ≤ k2 ≤ ⎢ ⎥ − 1 ⎣2⎦ ⎢ I s (k1 , k2 ) + I s (k1 , k2 + 1) ⎥ ⎢N ⎥ I r (k1 ,k2 )= ⎢ ⎥ , where 0 ≤ k1 ≤ M , 0 ≤ k2 ≤ ⎢ ⎥ − 1 2 ⎣2⎦ ⎣ ⎦
(1)
(2)
Using the reference image, the difference image is formed using (3),
D(k1, k2 ) = I r (k1 , k2 ) − I t (k1 , k2 )
(3)
2.2 The Bilinear Interpolation Sub-sampling Method
For increasing more capacity, we can extend the simple sub-sampling and linear interpolation method to the two-dimensional counterpart. Fig. 3 depicts the two-dimensional sub-sampling and interpolation pattern. By this pattern, there is one
Fig. 3. Example of bilinear interpolation method
388
Y.-C. Lin et al.
sub-sampled image I s ( k1 , k2 ) and three target images produced using (4). For each sub-sampled pixel, three correspondent interpolated pixels are calculated as the reference pixels according to (5). An index q is used to specify which reference (target) is employed. s I ( k1 , k 2 ) = I (2 k1 , 2 k 2 ) t t t I1 ( k1 , k 2 ) = I (2 k1 , 2 k2 + 1), I 2 ( k1 , k 2 ) = I (2 k1 + 1, 2 k2 ), I 3 ( k1 , k2 ) = I (2 k1 + 1, 2 k 2 + 1)
⎢M ⎥ ⎢N⎥ 0 ≤ k1 ≤ ⎢⎣ 2 ⎥⎦ -1, 0 ≤ k 2 ≤ ⎢⎣ 2 ⎥⎦ -1
⎢ I s ( k1 , k 2 ) + I s (k1 , k 2 + 1) ⎥ r ⎢ I s ( k1 , k 2 ) + I s (k1 + 1, k 2 ) ⎥ I1r ( k1 , k2 ) = ⎢ ⎥ , I 2 ( k1 , k2 ) = ⎢ ⎥ 2 2 ⎣ ⎦ ⎣ ⎦ s s s s ⎢ I ( k1 , k 2 ) + I (k1 , k 2 + 1) + I ( k1 + 1, k 2 ) + I ( k1 + 1, k 2 + 1) ⎥ I 3r ( k1 , k2 ) = ⎢ ⎥ 4 ⎣ ⎦ ⎢M ⎥ ⎢N ⎥ 0 ≤ k1 ≤ ⎢ ⎥ − 1, 0 ≤ k2 ≤ ⎢ ⎥ − 1 ⎣2⎦ ⎣2⎦
(4)
(5)
The difference images of bilinear interpolation is formed by (6). Dq ( k1 , k2 ) = I qr ( k1 , k2 ) − I qt (k1 , k2 ) 1 ≤ q ≤ 3
(6)
3 Embedding and Extracting Processes In this section, we describe the procedure that shifts the histogram of difference values and embeds data in the host image. The process is similar to other histogram shifting methods. Instead of directly shifting the difference values of histogram, the proposed method modifies the pixels in the target images to reflect the shifting for embedding operation. Because we have more than one target image in the bilinear method, we define a symbol T to represent the number of target images used in embedding process. In linear interpolation situation, T is set to 1, and T is set up to 3 in bilinear interpolation. The index of reference image or target image is less than T. The embedding level L in the embedding process is pre-specified. If we want larger capacity, we set bigger L and vice versa. We also denote WB as the embedding secret t
data. When modifying the target images I q for embedding, we get the marked images Iˆq . The embedding process is descried as follows: t
Setp1: Shift the histogram outwards: t ⎪⎧ I (k , k ) + L + 1 if Dq ( k1 , k 2 ) ≤ - L -1 Iˆqt ( k1 , k2 ) = ⎨ qt 1 2 where 1 ≤ q ≤ T ⎪⎩ I q (k1 , k2 ) − L − 1 if Dq ( k1 , k2 ) ≥ L + 1
(7)
A Subsampling and Interpolation Technique for Reversible Histogram Shift
389
Step 2: Embed secret data WB in the difference histogram. ⎧⎪ I t ( k , k ) − Dq ( k1 , k 2 ) + WB, Iˆqt ( k1 , k2 ) = ⎨ qt 1 2 ⎪⎩ I q ( k1 , k2 ) − Dq (k1 , k2 ) − WB, where 1 ≤ q ≤ T , WB ∈ {0,1}
for - L ≤ Dq ( k1 , k2 ) < 0 for 0 ≤ Dq (k1 , k2 ) ≤ L
(8)
When the message is embedded into the target image or we shift the target image, over/underflow might occur. A modular addition with a cycle as that mentioned in [14] is employed. If a difference value is tested to be over/underflow as a result of hiding the secret data bit, the embedding operation has to be changed to use (9), instead of (8). In our experiments in the paper, the method uses an adaptive cycle of length to avoid a salt-and-pepper noise, as defined in (9). For example, the length C of the cycle is set to 128, just as the permutation of integers: 0 1 127 0 128 129 255 128 for the image Lenna, when the maximum difference value between the sub-images is found to be 76.
↔…↔ ↔
↔ ↔…↔ ↔ , ↔
⎢ I qt (k , k ) ⎥ 1 2 ⎥ + mod( I t ( k , k ) + i, C ) I qt (k1 , k2 ) + i = C ⎢ q 1 2 C ⎢ ⎥ ⎣ ⎦ ⎧⎪ Dq (k1 , k2 ) ± WB when - L ≤ Dq (k1 , k2 ) ≤ L where 1 ≤ q ≤ T , i = ⎨ when Dq (k1 , k2 ) < − L − 1 or Dq ( k1 , k2 ) > L + 1 ⎪⎩±( L + 1)
(9)
In extracting process, we first generate the difference values from the marked image. Because the difference values are central to zero under normal circumstances, we can determine which target pixels have used modular addition if the difference values are very big. Thus we can restore these pixels back firstly. A similar approach is mentioned in [11]. The extracting and recovering processes are expressed as follows: Step 1: Sub-sample the marked image using the same order as that used in the embedding procedure. Then we can compute the differences between the reference images and target images, and generate the histogram of difference values. Step 2: Scan each of the difference images, if a difference value is within [-2L-1, 2L+1], the extract process is performed according to (10). ⎧0 ,if Dq (k1 , k2 )%2 = 0 WB = ⎨ , for - 2 L -1 ≤ Dq (k1 , k2 ) ≤ 2 L + 1 ⎩ 1 ,if Dq (k1 , k2 )%2 = 1
(10)
Step 3: Recover the target pixels if difference values within [-2L-1, 2L+1] using (11). ⎧⎪ Iˆqt (k1 , k2 ) = Iˆqt (k1 , k2 ) + l + 1 if Dq (k1 , k2 ) = 2l + 1 I qt (k1 , k2 ) = ⎨ t for 0 ≤ l ≤ L ˆ ˆt if Dq (k1 , k2 ) = 2l ⎪⎩ I q (k1 , k2 ) = I q (k1 , k2 ) + l ⎧⎪ Iˆqt ( k1 , k2 ) = Iˆqt (k1 , k2 ) + l − 1 if Dq (k1 , k2 ) = 2l − 1 I qt (k1 , k2 ) = ⎨ t for - L ≤ l < 0 ˆ ˆt if Dq (k1 , k2 ) = 2l ⎪⎩ I q (k1 , k2 ) = I q (k1 , k2 ) + l where 1 ≤ q ≤ T
(11)
390
Y.-C. Lin et al.
Step 4: Recover the target pixels if difference values outside [-2L-1, 2L+1] using (12). ⎧⎪ Iˆqt ( k1 , k2 ) + L + 1, if Dq ( k1 , k2 ) > 2 L + 1 I qt (k1 , k2 ) = ⎨ t ˆ ⎪⎩ I q ( k1 , k2 ) − L − 1, if Dq ( k1 , k2 ) < −2 L − 1 ⎢M ⎥ ⎢N ⎥ where 0 ≤ k1 ≤ ⎢ ⎥ − 1, 0 ≤ k2 ≤ ⎢ ⎥ − 1, 1 ≤ q ≤ T ⎣2⎦ ⎣2⎦
(12)
With the embedding/extracting process pair, our proposed method is reversible, simple, and of less computational complexity. Besides, it doesn’t have to transmit any overhead data because we solve the over/underflow problem by the modular method.
4 Results Images of 512×512×8 grayscale, as those shown in Fig. 4, are used in the experiments for verifying the reversible data hiding algorithm proposed in this research. We have tested our proposed method in many kinds of images, such as nature images, map images, texture images, and satisfactory results are obtained.
Fig. 4. Test image. (a) Lenna; (b) Baboon; (c) F16; (d) Goldhill; (e) Map; (f) Texture.
In Fig. 5, the original image Lenna and the marked images obtained by bilinear method are shown without prominent visual distortion. A comparison of capacity between Kim’s algorithm and the proposed method is revealed in Fig.6. In Table 2 and Table 3, we not only compare the PSNR with that of Kim’s method but also compare with that of Tian’s method [1], which is very famous in data hiding field, after 50000 bits were embedded in the images considered. Because the difference between the sub-images defined by our sub-sampling methods is small, over/underflow problem almost seldom happens. The proposed method is less likely to use modular addition and brings better visual quality.
A Subsampling and Interpolation Technique for Reversible Histogram Shift
391
; (a)
(b)
(c)
(d)
Fig. 5. (a)original image; (b) 10000 bits hidden, PSNR=46.28 dB; (c)50000 bits hidden, PSNR=43.06 dB; (d) 70000 bits hidden, PSNR=41.33 dB.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 6. Capacity comparison with varying embedding levels. (a) Lenna; (b) Baboon; (c) F-16; (d) Goldhill; (e) map; (f) texture.
392
Y.-C. Lin et al.
As shown in Fig.6, the worse capacity in Baboon and texture images for the bilinear interpolation method is resultant from the less correlation among the neighboring pixels in the column direction in the both images. If the sub-images subsampled by column direction are less correlated, the shape of difference histogram is more scattered and thus less capacity is gained. However the bilinear sub-sampling method is more adaptive. That is, if the images have stronger correlation in the column direction, the capacity of the proposed algorithm can be adjusted as demanded to very large when raising the value of embedding level L. Table 2. Comparison results in terms of PSNR(dB) for Lenna, Baboo and F16 images
Tian [1] Kim [8] Linear interpolation Bilinear interpolation
Lenna 37.89 42.76 43.31 45.63
PSNR(dB) Baboon 28.51 33.30 35.52 34.93
F16 37.50 43.01 46.30 43.24
Table 3. Comparison results in terms of PSNR(dB) for Goldhill, map and texture images
Tian [1] Kim [8] Linear interpolation Bilinear interpolation
Goldhill 38.68 40.50 42.82 42.36
PSNR(dB) map 29.33 33.42 35.64 35.10
Texture 37.94 40.58 43.04 40.38
5 Discussion and Conclusions As the results mentioned in Section 4, our proposed method has three strength points. First, it can provide more capacity than other methods. Second, it can keep relatively high image quality. Third, it is applicable to most types of image. In this research, the correlations between sub-images obtained by the proposed sub-sampling/interpolation technique are adequately utilized to increase the hiding capacity. The simulation results prove the effectiveness of the proposed method in improving the traditional histogram shifting in terms of both embedding capacity and objective fidelity. Furthermore, how to find other better ways defining subsample sub-image seems interesting in the future work.
Acknowledgments This research was supported in part by National Science Council, Taiwan under research grants NSC 98-2221-E-150-051 and NSC 98-2221-E-244-002 to Yih-Chuan Lin and Yao-Tang Chang, respectively.
A Subsampling and Interpolation Technique for Reversible Histogram Shift
393
References 1. Tain, J.: Reversible Data Embedding Using a Difference Expansion. IEEE transactions on circuits and systems for video technology 13(8) (2003) 2. Alattar, A.M.: Reversible watermarking using the difference expansion of a generalized integer transform. IEEE transactions on image processing 13(8) (2004) 3. Coltuc, D., Chassery, J.-M.: Very fast watermarking by reversible contrast mapping. IEEE Signal Processing Letters 14(4), 255–258 (2006) 4. Lin, C.-C., Yang, S.-P., Hsueh, N.-L.: Lossless data hiding based on difference expansion without a location map. In: IEEE Congress on Image and Signal Processing (2008) 5. Ni, Z., Shi, Y.-Q., Ansari, N., Su, W.: Reversible Data Hiding. IEEE Transactions on Circuits and Systems Video Technology 16(3) (2006) 6. Hwang, J.H., Kim, J.W., Choi, J.U.: A reversible watermarking based on histogram shifting. In: Shi, Y.Q., Jeon, B. (eds.) IWDW 2006. LNCS, vol. 4283, pp. 348–361. Springer, Heidelberg (2006) 7. Kuo, W.-C., Jiang, D.-J., Huang, Y.-C.: A reversible data hiding based on block division. In: Congress on image and signal processing, vol. 1, pp. 365–369 (2008) 8. Kim, K.-S., Lee, M.-J., Suh, Y.-H.: Histogram-based reversible data hiding technique using subsampling. In: Proceeding of the 10th ACM workshop on multimedia and security (September 2008) 9. Chung, K.-L., Huang, Y.-H., Yang, W.-N., Hsu, Y.-C.: Capacity maximization for reversible data hiding based on dynamic programming approach. Applied mathematics and computation, 284–292 (2009) 10. Lin, Y.-C., Li, T.-S.: High capacity histogram shifting lossless data hiding. In: IPPR conference on Computer Vision, Graphics, and Image Processing (2009) 11. Kim, K.-S., Leeand, M.-J., Suh, Y.-H.: Reversible data exploiting spatial correlation between sub-sampled images. Pattern Recognition 42, 3083–3096 (2009) 12. Celik, M.U., Sharma, G., Tekalp, A.M., Saber, E.: Lossless Generalized-LSB Data Embedding. IEEE transactions on image processing 14(2) (2005) 13. Ohyama, S., Niimi, M., Yamawaki, K., Noda, H.: Lossless data hiding using bit-depth embedding for JPEG2000 compressed bit stream. In: IEEE International Conference on Intelligent Information Hiding and Multimedia Signal Processing (2008) 14. Fridrich, J., Goljan, J., Du, R.: Invertible authentication. In: Proceedings of the SPIE, Security and Watermarking of Multimedia Contents, vol. 4314, pp. 197–208 (2001)
Multi-Objective Genetic Algorithm Optimization for Image Watermarking Based on Singular Value Decomposition and Lifting Wavelet Transform Khaled Loukhaoukha, Jean-Yves Chouinard, and Mohamed Haj Taieb Laval University, Quebec, QC, Canada, G1K 7P4
[email protected],
[email protected],
[email protected] Abstract. In this paper, a new optimal watermarking scheme based on singular value decomposition (SVD) and lifting wavelet transform (LWT) using multi-objective genetic algorithm optimization (MOGAO) is presented. The singular values of the watermark is embedded in a detail subband of host image. To achieve the highest possible robustness without losing watermark transparency, multiple scaling factors (MSF) are used instead of single scaling factor (SSF). Determining the optimal values of the MSFs is a difficult problem. However, to find this values a multi-objective genetic algorithm optimization is used. Experimental results show a much improved performance in term of transparency and robustness of the proposed method compared to others methods. Keywords: Digital watermarking, multi-objective optimization, genetic algorithm, singular value decomposition, lifting wavelet transform.
1
Introduction
In the digital era, it has become easier to exchange illegally digital multimedia content. In this context, digital watermarking was introduced as technical security protection solution which consists of inscribing invisible secret information (known as watermark) into multimedia content. In the case of digital images, watermarks are generally embedded in spatial or frequency domain. The spatial domain has advantage of low calculation complexity compared to transform domains. However, it suffers from a weak robustness against various attacks. Embedding watermarks in the frequency domain enhances the imperceptibility because the Human Visual System (HVS) behavior naturally follows the spectral characteristics (frequency domain) of the source. In recent years, artificial intelligence techniques has been used to improve the performances of digital watermarking methods [1–3]. This paper investigates genetic algorithm as artificial intelligence technique. The paper is organized as follow. In section 2, fundamental concepts of a genetic algorithm are explained. Section 3 describes A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 394–403, 2010. c Springer-Verlag Berlin Heidelberg 2010
Multi-Objective Genetic Algorithm Optimization for Image Watermarking
395
the SVD-LWT watermarking algorithm. In section 4, we explain how the multiobjective genetic algorithm optimization is used in the watermarking algorithm. The experimental results are discussed in section 5 and concluding remarks are given in section 6.
2
Fundamental Concepts of Genetic Algorithm
Holland [4] defined the genetic algorithm (GA) as a technique based on the process of natural selection and genetic recombination and evolution. The main techniques employed in genetic algorithm are : 1. Population initialization : an initial population is a set of chromosomes which represents a solution of the problem. This population can be generated randomly or by some problem-specific heuristic. It important to note that the chromosome is expressed as a string of genes, which can be represented, depending on the application, by the binary alphabet, integers or real numbers. 2. Evaluation : this consists of classifying the chromosomes from best to worst according their fitness values, in order to reflects their importance in forming the next generation. This value is computed using an objective function. 3. Reproduction : this allows a chromosome the possibility to form the next generation. The probability that a chromosome will be chosen is based on its fitness value. This operator plays an important role in driving the search towards a better solution and maintaining diversity in the population. The main reproduction methods used are : roulette wheel selection and tournament selection. 4. Crossover : this combines pairs of chromosomes (called parents) to generate new chromosomes pairs (called offsprings) that share some features taken from the parents. The crossover is executed at fairly high probability (i.e. PC > 0.7). The aim of the crossover is to form a new generation with a higher proportion of the characteristics of the good chromosomes than the previous generation. There are many ways of doing the crossover, such as multi-points crossover, uniform crossover, and arithmetic crossover. 5. Mutation : this applies to some offspring chromosomes. It consists of introducing variations in the values of several genes selected at random. This operator ensures genetic diversity within the population and it is performed with a low probability (i.e. PM is kept within the range 0.001-0.05). A classical genetic algorithm scheme is shown in figure 1.
3
Watermarking Algorithm Based on SVD and LWT
The developed SVD-LWT watermarking algorithm [7] can be described by two processes : watermark embedding and watermark extraction.
396
K. Loukhaoukha, J.-Y. Chouinard, and M.H. Taieb
Fig. 1. Scheme of a classical genetic algorithm [6]
3.1
Watermark Embedding
Consider an original image X of size N × N and let the watermark W be a binary image of size M × M . The embedding procedure is described as follows: 1. Decompose the original image I into 3 + 1 subbands by applying a -levels lifting wavelet transform (LWT). 2. Select one subband (SB) among the three following subbands: HH , HL and LH . 3. Compute the inverse LWT of the selected subband (SB): X = LW T −1(SB)
(1)
4. Apply a singular value decomposition (SVD) of matrix X: X = UX · SX · VXT
(2)
5. Apply a singular value decomposition of the watermark matrix W . T W = UW · SW · VW
(3)
6. Compute the one-way hash function to matrices UW and UW : ⎧ ⎨ HU = Hash(UW ) ⎩
(4) HV = Hash(VW )
7. The matrices UW and UW , and the hash values HU and HV are stored in the private key. 8. Compute matrix SY according to: SY (i, j) = SX (i, j) + α · SW (i, j),
0 ≤ i, j ≤ M,
(5)
Where α is the watermark strength factor that controls the tradeoff between visual quality and robustness of the watermarking scheme.
Multi-Objective Genetic Algorithm Optimization for Image Watermarking
397
9. Compute matrix ZW , according to: ZW = UX · SY · VXT
(6)
10. Compute the lifting wavelet transform of the matrix ZW , SBW = LW T (ZW )
(7)
11. The watermarked image IW is computed by applying the inverse lifting wavelet transform into the -levels, to the modified subband SBW and the 3 unmodified subbands. 3.2
Watermark Extraction
Several watermarking algoritm based on the singular value decomposition (SVD) which embed a watermark using singular values suffers from the high probability of false positive detections. This vulnerability in watermarking algorithm based on SVD was reported by Zhang et al. [8] and Rykaczewski [9] for the SVDbased spatial watermarking algorithm proposed by Liu and Tan [10]. Moreover, other researchers have mentioned that others algorithms suffers from the high probability of false positive detection [11, 12]. Since the proposed watermarking algorithm is based on singular value decomposition (SVD) and to protect it against this vulnerability, we used a solution proposed in [13] which consist in using a one-way hash function or encryption function. Briefly the proposed solution consists in storing the hash values of the singular vectors matrices during the watermark embedding process, these values are denoted by HU and HV . Before starting the watermark extraction, a safety test is done. The following steps summarizes the extracting algorithm: 1. Safety test : the hash values of the singular vectors matrices UW and VW ˜W and V˜W ) are computed. These (probably changed by attacker to be U hash values are denoted by HU˜ and HV˜ and compare them to the hash values stored in the embedding procedure (step 6). Then, ⎧ ⎨ if HU = HU˜ and HV = HV˜ −→ go to step 2 (8) ⎩ = HU˜ or HV = HV˜ −→ stop algorithm attacked if HU 2. Decompose the original image I and the watermarked image IW , by applying the -levels lifting wavelet transform. 3. Select the same subband (SB) used in step 2 of the watermark embedding procedure. Let SBI and SBIW are, respectively, the subbands seleceted for the original and watermarked images. 4. Compute the inverse lifting wavelet transform of the selected subbands (SBI and SBIW ): ⎧ ⎨ X = LW T −1(SBI ) (9) ⎩ XIW = LW T −1(SBIW )
398
K. Loukhaoukha, J.-Y. Chouinard, and M.H. Taieb
5. Apply the singular value decomposition to the matrices X and XIW : ⎧ T ⎨ X = UX · SX · VX ⎩X
IW
= UXIW · SXIW · VXTI .
(10)
W
6. Compute matrix SW ˆ as follows: SW ˆ =
SXIW − SX α
(11)
ˆ , by computing: 7. Determine the estimated watermark, W ˆ =U ˜W · S ˆ · V˜ T W W W
4
(12)
Optimal Watermarking Algorithm Using Multi-Objective Genetic Algorithm Optimization
Watermarking schemes are either based on an additive or a multiplicative rule. The embedding rule are usually of the form (13) below. −→ additive rule IW = I + α · W (13) IW = I · (1 + α · W ) −→ multiplicative rule where IW is (transformed) watermarked image, I is the (transformed) original image and α is used to control watermarking strength. From (13), one can see that the watermark W is scaled by factor α in the embedding process. Cox et al. [14] suggest the use of multiple scaling factors (MSF) instead of one. They state that considering a single scaling factor (SSF) may not be applicable for altering all of the values of original image I. Therefore, it is necessary to use multiple scaling factors (MSF) instead single scaling factor. Determining the optimal values of these factors, in order to achieve the highest robustness and transparency under various attacks, is unfortunately a difficult problem. In this paper, we propose a multi-objective genetic algorithm optimization (MOGAO) scheme in order to find the optimal multiple scaling factors (MSF) of the watermarking algorithm presented above in section 3. The steps for applying MOGAO into SVD-LWT watermarking scheme are: 1. Define the value of population size (PS), the number of variables (NV), the probability of crossover (PC ), the probability of mutation (PM ), the objective function, and the generation number (GN) or any other stopping condition. 2. Generate randomly an initial population of potential solutions. 3. Using embedding process, produce the watermarking images using each chromosomes of population. 4. Compute the normalized correlation between original and watermarked images, N C(I, IW ), (the number of watermarked images IW is equal to PS).
Multi-Objective Genetic Algorithm Optimization for Image Watermarking
399
5. Apply the selected attacks upon the watermarked images IW one by one (T attacked images will be generated for each watermarked image). 6. Using the extraction process previously described, extract the watermarks from the attacked watermarked images. ˆ ), (W ˆ being the extracted watermark). 7. Compute N C(W, W 8. Evaluate an objective function for every chromosome of the population. 9. Apply the selection process, crossover and mutation operations. 10. Repeat steps 3 to 10 until the generation number is reached or another stopping condition is met. The objective function is defined depending on imperceptibility, N C(I, IW ), and ˆ i ) (normalized correlation between the original watermark robustness N C(W, W ˆ under attack i, such as i = 1, 2, . . . , T .). W and the extracted watermark W Equation (14) gives the fitness value of chromosome i. T 1 1 1 ˆi) Fitnessi = + − N C(W, W (14) ˆ) T N C(I, IW ) N C(W, W i=1 It is clear from (14) that the ideal fitness value is one.
5
Experimental Results
In this section, several experiments are performed using three 256 × 256 grayscale images and a 32 × 32 binary watermark as depicted in figure 2. To show the effectiveness and the impact of the multi-objective genetic algorithm optimization in the proposed scheme, denoted as MSF-Scheme, our results are compared to the pure DWT watermarking scheme presented by [15] and the SVD-LWT watermarking scheme [7] as presented in section 3 but using a single scaling factor α. These algorithms are denoted by PDWT and SSF, respectively. The embedding process is done in LH3 subband and the lifting wavelet transform levels is equal to 3. To determine the optimal values of multiple scaling factors (MSF), the genetic algorithm parameters should be selected carefully. These parameters (population size, crossover probability and mutation probability) are selected by varying one
(a) Baboon
(b) Lena
(c) Peppers
(d) Letter A
Fig. 2. Original (Baboon, Lena and Peppers) and watermark (Letter A) images
400
K. Loukhaoukha, J.-Y. Chouinard, and M.H. Taieb Table 1. Parameters setting of the genetic algorithm Control parameters Population size (PS) Generation number (GN) Number of variables (NV = MSF) Selection method Crossover type Crossover probability (PC ) Mutation type Mutation probability (PM )
Setting 100 150 32 Roulette Wheel selection Arithmetic crossover 0.8 Gaussian mutation 0.05
parameter at a time while keeping the others fixed. The genetic algorithm setting parameters selected for simulation are listed in table 1. For the experimental tests, eight different attacks are selected in conjunction to multi-objective optimization (i.e. T = 8). These attacks are : salt & peppers noise (density 0.05), Gaussian filter (3 × 3), cropping (1/8 center), JPEG compression (Q = 5), sharpening, scaling (256 → 512 → 256), histogram equalization and gray value quantization (1 bit), denoted respectively by SP, GF, CR, CM, SH, SC, HE and QN. The proposed watermarking scheme is flexible: the number of selected attacks (T ) can be easily decreased or increased since the level of robustness differs according to the nature of the watermarking application. Table 2 gives the results of imperceptibility and robustness tests of the proposed scheme compared with the schemes proposed by Xianghong et al. [15] and Loukhaoukha et al. [7]. Table 2. Imperceptibility and robustness tests
Peppers
Lena
Baboon
Scheme MSF-Scheme SSF [7] PDWT [15] MSF-Scheme SSF [7] PDWT [15] MSF-Scheme SSF [7] PDWT [15]
ˆ NC(I, IW ) NC(W, W)
1.000 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999
1.000 1 0.999 0.997 1 0.999 1.000 0.999 0.999
SP 0.982 0.841 0.694 0.995 0.749 0.616 0.985 0.770 0.713
GF 0.999 0.786 0.858 0.995 0.703 0.866 0.997 0.722 0.891
ˆ i) NC(W, W CR CM SH SC 0.995 0.978 0.983 1.00 0.986 0.886 0.909 0.923 0.983 0.633 0.712 0.986 0.985 0.994 0.995 0.997 0.835 0.856 0.964 0.993 0.983 0.640 0.666 0.994 0.977 0.946 0.987 1.000 0.879 0.856 0.963 0.990 0.983 0.609 0.699 0.996
HE 0.985 0.962 0.440 0.993 0.986 0.587 0.989 0.976 0.749
QN 0.982 0.960 0.569 0.973 0.969 0.625 0.978 0.958 0.534
From the table 2, one can see the impact of using MOGAO to find the optimal multiple scaling factors (MSF) in the proposed watermarking scheme compared to the same watermarking scheme using a single scaling factor (SSF). The experimental results presented above demonstrate clearly the much improved performance of the proposed watermarking scheme with multiple scaling factors
Multi-Objective Genetic Algorithm Optimization for Image Watermarking
401
Table 3. Optimal multiple scaling factors for Baboon image Rang 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 MSF 10.0 13.5 11.0 18.3 14.2 23.5 29.0 -35.0 -21.3 -18.0 -18.1 17.1 -21.1 -23.2 -23.0 12.2 Rang 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 MSF 15.5 -15.7 -45.3 26.3 11.5 -14.3 17.5 -27.0 -16.3 9.1 33.1 -18.8 -14.2 33.7 -11.2 -13.4
(MSF) in terms of imperceptibility and robustness over the same watermarking algorithm using a single scaling factor (SSF) and Xianghong et al. [15] scheme. Table 3 provides the optimal multiple scaling factors found by multi-objective genetic algorithm optimization (MOGAO) when the Baboon image is used as original image. Figure 3 depicts the watermarked images attacked and the extracted watermarks under a different attacks.
(a) Salt & peppers noise (NC=0.951)
(b) Gaussian filtering (NC = 0.963)
(c) Cropping (NC = 0.994)
(d) JPEG Compression (NC = 0.967)
(e) Sharpening (NC = 0.970)
(f) Scaling (NC = 0.998)
(g) Histogram equalization (NC = 0.969)
(h) Gray quantization (NC=0.966)
Fig. 3. Watermarked image attacked under the 8 different attacks and corresponding extracted watermark
402
6
K. Loukhaoukha, J.-Y. Chouinard, and M.H. Taieb
Conclusion
Most watermarking schemes use additive or multiplicative embedding rules with a single factor α in order to get a tradeoff between robustness and imperceptibility. In this paper, a new optimal image watermarking scheme, based on SVD and LWT, using an additive embedding rule with multiple scaling factors (MSF) is proposed. To achieve the best trade-off in terms of robustness and imperceptibility, multi-objective genetic algorithm optimization is employed in order to determine the optimum values of MSF. Experimental results demonstrate that the proposed scheme outperforms the watermarking schemes proposed in [7, 15]. Moreover, the problem of false positive detection which affects most SVD-watermarking algorithms is solved using one-way hash functions.
References 1. Zhang, F., Zhang, H.: Applications of a neural network to watermarking capacity of digital image. Neurocomputing 67, 345–349 (2005) 2. Wang, Z., Sun, X., Zhang, D.: A novel watermarking scheme based on PSO algorithm. In: Proceedings of International Conference on Life System Modeling and Simulation, September 2007, pp. 307–314 (2007) 3. Kumsawat, P., Attakitmongcol, K., Srikaew, A.: A new approach for optimization in image watermarking by using genetic algorithms. IEEE Transactions on Signal Processing 53(12), 4707–4719 (2005) 4. Holland, J.H.: Adaptation in Natural and Artificial Systems. University of Michigan Press (1975) 5. Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs.. Springer, Berlin (1996) 6. Garca-Villoria, A., Pastor, R.: Solving the response time variability problem by means of a genetic algorithm. European Journal of Operational Research 202(2), 320–327 (2010) 7. Loukhaoukha, K., Chouinard, J.Y.: Hybrid watermarking algorithm based on svd and lifting wavelet transform for ownership verification. In: Canadian Workshop on Information Theory, May 2009, pp. 177–182 (2009) 8. Zhang, X.P., Li, K.: Comments on an SVD-based watermarking scheme for protecting rightful ownership. IEEE Transactions on Multimedia 7(3), 593–594 (2005) 9. Rykaczewski, R.: Comments on an SVD-based watermarking scheme for protecting rightful ownership. IEEE Transactions on Multimedia 9(2), 421–423 (2007) 10. Liu, R., Tan, T.: A SVD-based watermarking scheme for protecting rightful ownership. IEEE Transactions on Multimedia 4(1), 121–128 (2002) 11. Xiao, L., Wei, Z., Ye, J.: Comments on robust embedding of visual watermarks using discrete wavelet transform and singular value decomposition and theoretical analysis. Journal of Electronic Imaging 17(4), 40501 (2008) 12. Zhang, T.X., Zheng, W.M., Lu, Z.M., Liu, B.B.: Comments on a semi-blind digital watermarking scheme based on singular value decomposition. In: Proceedings of International Conference on Intelligent Systems Design and Applications, November 2008, vol. 2, pp. 123–126 (2008)
Multi-Objective Genetic Algorithm Optimization for Image Watermarking
403
13. Loukhaoukha, K., Chouinard, J.Y.: On the security of ownership watermarking of digital images based on SVD decomposition. Journal of Electronic Imaging 19(1), 013007 (2010) 14. Cox, I., Kilian, J., Leighton, F.T., Shamoon, T.: Secure spread spectrum watermarking for multimedia. IEEE Transactions on Image Processing 6(12), 1673–1687 (1997) 15. Xianghong, T., Lu, L., Lianjie, Y., Yamei, N.: A digital watermarking scheme based on dwt and vector transform. In: Proceedings of International Symposium on Intelligent Multimedia, Video and Speech Processing, October 2004, pp. 635–638 (2004)
Image Hiding by Using Genetic Algorithm and LSB Substitution Masoumeh Khodaei1 and Karim Faez2 1
Islamic Azad University of Qazvin, Iran
[email protected] 2 AmirKabir University of Technology, Iran
[email protected] Abstract. In this paper, we propose a new image hiding method by using LSB substitution for improving stego-image quality. In this method, we try to transform the secret image into meaningless picture by using a bijective mapping function so that the difference of embedded secret image bits and LSB bits of host image pixels is of minimum possible value. This operation results in the encryption of the secret image. Thus, if grabbers detect existence of the secret image into stego-image, they won’t be able to recognize the secret image exactly. As a result, the security of our method will be increased. We use genetic algorithm for setting parameters of bijective mapping function to obtain the best condition in distribution of the pixels. We compare the our proposed method with LSB substitution method and Chang et al.'s method. The experimental results show that our proposed method has enhanced both the quality and the security of stego-image by using LSB substitution method. In addition, our method results are approximately close to the results of Chang et al.'s method. Keywords: Cryptography, Genetic Algorithm, Image Hiding, Steganography.
1 Introduction With expanding network and internet communications, steganography and cryptography methods have been increasingly important to raising the security of message sending between sender and receiver. Cryptography is used to protect important data against illicit access, while a steganographic method is used for hiding the secret message in the host data. The main goal of steganographic methods is to hide data in the host image with an acceptable decrease of image quality so that the quality of image containing the secret data is acceptable and the distortion of images would be imperceptible to the viewers. Secret data and host data can be text, image, video or audio. The image which the secret data will be embedded in it is called host image and the image with secret data embedded within is called stego-image and the encrypted image is called crypto-image. In some papers, many techniques about data hiding have been proposed [1-4]. One of the most famous techniques is based on manipulating the least-significant-bits (LSB) of host image pixels by replacing the LSB bits of host image pixels with the A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 404–411, 2010. © Springer-Verlag Berlin Heidelberg 2010
Image Hiding by Using Genetic Algorithm and LSB Substitution
405
secret message bits. In 2001, Wang et al. [5] proposed a genetic algorithm based schema and LSB substitution to hide the secret image in the host image. In 2007, a method was proposed by Li and Wang [6] which was based on Particle Swarm Optimization algorithm (PSO) to enhance the quality of stego-image. In 2008, Chang et al. [7] presented a reversible steganographic method which used a genetic algorithm to find an approximate optimal common bitmap to embed the secret data in the image. In this paper, we use an image as secret data. We try to encrypt the secret image by using genetic algorithm so that the difference between LSB bits of the host image pixels and the bits of the secret image pixels is of minimum value and also, the security of our proposed method will be increased by applying the secret image encryption. This paper is organized as follows: Related works are introduced in Section 2. The proposed scheme is presented in section 3. Experimental results are illustrated in section 4, and the conclusions are reported and discussed in section 5.
2 Related Works 2.1 Image Hiding by LSB Substitution We describe the concept of image hiding by LSB substitution which is presented by Wang et al [5]. Suppose we want to hide an image as secret image into an image as host image and resultant image called stego-image. All images are 8-bit gray-scale images. Size of host image H is k times the secret image S. The simplest method is to hide the bits of S in the least- significant-bits (LSBs) of H. This work can be done in three steps. In the first step, the method decomposes the pixel values of secret image S into several k-bit units and considers each k-bits unit as a single k-bits pixel. The resultant image is called S'. In the second step, k LSB bits are extracted from each pixel of host image H which is called R. Finally, it replaces S’ with R in order to obtain the embedded image of H’. Some additional works can be done to increase the security and the quality of stego-image H'. For this work, a bijective (i.e., one-to-one and onto) mapping function is presented to transform each pixel location in S' into a new random location. First, the pixel locations in S' are numbered sequentially from 0 to s-1, where s is number of pixels in S'. Next, the original location pixel x is transformed to a new location f(x) by using the following bijective mapping function:
f(x) = (k1 + k 2 × x) mod s
(1)
and
gcd(k 2 , s) = 1 Where k1 and k2 are two constants that are taken as keys, and gcd(.) is the greatest common divisor. After location transforming, we get a meaningless image S" and call it crypto-image. Then, we replace S" into R and get stego-image H'. By using this method, the security of image hiding is increased because secret image is encrypted
406
M. Khodaei and K. Faez
and grabbers cannot analyze the embedded data in the stego-image. The following section will describe our method to obtain optimal crypto-image S".
3 Proposed Method In this section, we present the method for setting optimal value of k1 and k2 in eq. (1) so that the difference between k-bits crypto-image and k LSB bits of the host image is slight. This act results in increasing the quality of stego-image. We use genetic algorithm to find optimal value of k1 and k2 which is introduced in next sub-section.
Secret image S (n-bits/pixels)
Decomposition
S'
Secret image (k-bits/pixels)
Encryption by using Genetic algorithm
S"
Crypto-image (k-bits/pixels)
H
Host image (n-bits/pixels)
LSB Extraction
R
Residual image (k-bits/pixels)
Replace
H'
Stego-image (n-bits/pixels)
Fig. 1. The flowchart of embedding by proposed method
3.1 Secret Image Encryption by Using Genetic Algorithm In our method, we want to find optimal value of k1 and k2 in eq. (1) by using genetic algorithm to encrypt the secret image so as to reduce the rate of changes resulted from embedding the secret image bits in the host image to enhance the quality of stegoimage. Fig.1 illustrates the flowchart of our proposed method. According to Fig.1, we first decompose the pixel values of secret image S into several k-bit units and call it S'. Then, we extract k LSB bits of each pixel in the host image H and denote this image R. Now, we must encrypt k-bits secret image S'. To do this, we number the pixel locations in S' from 0 to s-1 sequentially, where s is number of pixels in S'. Next, the original location pixel x is transformed to a new location f(x) by using a bijective mapping function as previously mentioned in eq. (1). Here, we describe genetic algorithm steps to find optimal value of k1 and k2 in eq. (1). The initial step in genetic algorithm is to code the solution space of problem as chromosomes such that a chromosome is an individual in a GA. There are many genes in a chromosome. We present the following step to find optimal value of k1 and k2 by genetic algorithm:
Image Hiding by Using Genetic Algorithm and LSB Substitution
407
Step 1: Initial population Generation: Generate N chromosomes as initial population randomly. Suppose a chromosome has two genes. Define a chromosome G as follows:
G =g1g2
(2)
Where g1 represents the k1 value and g2 represents the k2 value. Step 2: Calculation of fitness value: Set the value of g1 and g2 of each chromosome in eq. (1) to get new locations of k-bit units in S' and call it S". Then, calculate fitness value of each chromosome to judge the goodness of chromosomes by
Fitness =
1 m
∑ ( Ri − S i " ) i =1
(3)
2
Where m is the image size of R and S". Step 3: Parents selection: Select two chromosomes among the population as parents by proportional to fitness value. i.e., the chromosomes with more fitness value would have more chance of selection. Step 4: Performing crossover operator: Suppose two chromosomes G1=p1p2 and G2=q1q2 are chosen as parents. Perform crossover operator on G1 and G2 with probability of Pc and produced two offspring G'1 and G'2 by using fixed-point crossover according to Fig.2. Divide G1 and G2 into two parts and exchange two parts of them to get G'1=q2p2 and G'2=q1p1. Now, check the validation of two Chromosomes G'1 and G'2. The chromosome will be valid which contains gcd (p1,s)=1 or gcd(p2,s)=1. If an offspring isn't valid, transmit the fitted parent in new population. G1 p1
p2
G1' q2
p2
G2 q1
q2
G2' q1
p1
Fig. 2. Crossover operation on chromosomes G1 and G2
Step 5: Performing mutation operator: Perform mutation operator on Chromosome G'=q1q2 with probability of Pm. to do this, generate two random numbers in ranges [0, 10000] and add them to the values of q1 and q2 of Chromosome G' to get G". Then, check the validation of Chromosome. The chromosome will be valid that the greatest common divisor of q2 and s is 1. If G" isn't valid, transmit G' in new population. Step 6: Repeat steps (3-5) until N new individuals are generated. Step 7: Survivors selection: Select all of produced offspring as new population.
408
M. Khodaei and K. Faez
Step 8: Repeat steps (2-6) until it reaches the termination condition. It can be either by reaching the predefined number of generations or reaching the given fitness value.
After terminating GA, we choose a chromosome that has greatest fitness value between current populations. The selected chromosome represents the optimal value of k1 and k2. 3.2 Embedding Secret Image into Host Image
After finding the optimal value of k1 and k2, we set this values in eq. (1) to get new locations of k-bits units in S' and obtain image S". Then, we replace image S" into image R and get stego-image H'. By using this method, the stego-image quality and the stego-image security is increased. During stego-image sending, we should send the values of k1 and k2 with stego-image for receiver to decrypt the crypto-image and get the original secret image.
4 Experimental Results and Analysis In this section, we present the experimental results of the proposed method. We use four 8-bits gray-level images Lena, Baboon, Jet and Elaine that are shown in Fig.3 as host images in our experiments. Moreover, three images Boat, House and Tank with size 128 × 128 are used as secret images that are shown in Fig.4. The size of all host images are 256 × 256. We utilize the peak signal-to-noise ratio (PSNR) to evaluate the quality of stego-images. Great PSNR value shows that quality of stego-image is high. Theoretically, the PSNR value between H and H' can be calculate by
⎛ (255) 2 PSNR = 10 × Log ⎜ ⎜ MSE ⎝
⎞ ⎟ ⎟ ⎠
(4)
1 m 2 ∑ (H i − H'i ) mi=1
(5)
Where MSE is defined by
MSE =
Here, m is the size of image H and H'. In all experiments, number of k-bits for embedding into host image pixels is equal to 4 and the size of population N is 30. Also, we assume that the probability value Pc of crossover operation is 0.8 and the probability value Pm of mutation operation is 0.2. We use eq. (6) to calculate the similarity measure between the secret image and the crypto-image [8]: MN
r=
∑ ( x − x )( yi − y ) i =1 i
M
∑ (x − x) i =1 i
2
N
∑ ( y − y) i =1 i
(6) 2
Image Hiding by Using Genetic Algorithm and LSB Substitution
(a)
(b)
(c)
409
(d)
Fig. 3. Four 256 × 256 host images: (a) Lena, (b) Baboon, (c) Jet and (d) Elaine
Where xi is the value of the pixels in the secret image and yi is value of the pixels in the crypto-image and x is the mean value of xi and y is the mean value of yi. Also, M is the size of secret image and N is the size of crypto-image. The results of our proposed method are presented in Table 1, 2, 3. These tables show that MSE and PSNR values given by the proposed method is better than the simple LSB method and is approximately close to the values that are derived from Chang et al.'s method [2].
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 4. Three 128 × 128 secret images: (a) Boat, (b) House and (c) Tank. Three 128 × 128 crypto-images obtained by the proposed method: (d) Crypto-Boat with k1=4753 and k2=227 and r=0.001863, (e) Crypto-House with k1=43 and k2=911 and r=0.003287 and (f) Crypto-Tank with k1=27 and k2=101 and r=0.006395.
410
M. Khodaei and K. Faez
However, this rate of increase in the stego-image quality may appear to be slight, but our proposed method causes the growth of the security due to the encryption of the secret image in image steganographic method. Therefore, if grabbers can recognize that the host image contain a secret-image, they won’t understand the secret image easily unless they are aware of k1 and k2 values. Fig.4 shows three encrypted secret images by using the proposed approach. Value r represents similarity measure and correlation between the secret image pixels and the crypto-image pixels. When the rate of similarity in r is less, it means that the cryptoimage is encrypted suitably and indistinguishably. The obtained values of r are small. Thus, the secret images are encrypted satisfactorily and imperceptibly by using our method and this process causes to increase security of image hiding method. Table 1. The results of embedding Boat image as secret image by proposed and LSB and Chang et al.'s [2] methods Host Image Lena Baboon Jet Elaine
LSB method 18.74 17.83 17.47 18.10
MSE Chang method 15.62 14.28 15.12 15.71
Proposed method 16.92 15.78 16.01 16.24
LSB method 35.40 35.61 35.70 35.55
PSNR Chang method 36.19 36.58 36.15 36.16
Proposed method 35.84 36.14 36.08 36.02
Table 2. The results of embedding House image as secret image by proposed and LSB and Chang et al.'s [2] methods Host Image Lena Baboon Jet Elaine
LSB method 18.36 17.51 17.32 17.55
MSE Chang method 16.27 14.70 14.33 13.93
Proposed method 17.01 15.89 15.47 16.09
LSB method 35.49 35.69 35.74 35.68
PSNR Chang method 36.01 36.45 36.56 36.69
Proposed method 35.82 36.11 36.23 36.06
Table 3. The results of embedding Tank image as secret image by proposed and LSB and Chang et al.'s [2] methods Host Image Lena Baboon Jet Elaine
LSB method 17.15 16.64 16.36 16.55
MSE Chang method 15.32 12.91 13.08 13.67
Proposed method 16.21 14.73 14.51 14.48
LSB method 35.78 35.91 35.78 35.93
PSNR Chang method 36.27 37.02 36.94 36.77
Proposed method 36.03 36.46 36.51 36.53
5 Conclusions In this paper, we proposed an image hiding method to hide the secret image into host image. In this approach, we used genetic algorithm to encrypt the secret image so that
Image Hiding by Using Genetic Algorithm and LSB Substitution
411
the rate of changes resulted from embedding the secret image bits into LSB bits of the host image is of minimum value. Moreover, even if the grabbers get to know that the stego-image contains the secret image, they won’t be able to recognize the original secret image easily unless they know the values of k1 and k2. Consequently, the security of image steganographic method is increased. The experimental results of our proposed method showed that the stego-image quality is better than the stego-image resultant from simple LSB substitution method and is approximately close to the results of Chang et al.'s method.
References 1. Chang, C.-C., Hsiao, J.-Y., Chan, C.-S.: Finding optimal least- significant-bit substitution in image hiding by dynamic programming strategy. Pattern Recognition 36, 1583–1595 (2003) 2. Chang, C.-C., Chan, C.-S., Fan, Y.-H.: Image hiding scheme with modulus function and dynamic programming strategy on partitioned pixels. Pattern Recognition 39, 1155–1167 (2006) 3. Wang, R.-Z., Tsai, Y.-D.: An image-Hiding method with high hiding capacity based on best-block matching and k-means clustering. Pattern Recognition 40, 398–409 (2006) 4. Chang, C.-C., Lu, T.-C.: Lossless Information Hiding Scheme Based on Neighboring Correlation. International Journal of Signal Processing, Image Processing and Pattern (2009) 5. Wang, R.Z., Lin, C.F., Lin, J.C.: Image hiding by optimal LSB substitution and genetic algorithm. Pattern Recognition 34, 671–683 (2001) 6. Li, X., Wang, J.: A Steganographic method based upon JPEG and Particle Swarm optimization algorithm. Pattern Recognition 177, 3099–3109 (2007) 7. Chang, C.-C., Lin, C.-Y., Fan, Y.-H.: Lossless Data hiding for color images based on block truncation coding. Pattern Recognition 41, 2347–2357 (2008) 8. AI-Taani, A.T., AL-Issa, A.M.: A Novel Steganographic Method for Gray-Level Images. International Journal of Computer and System Science and Engineering (2009)
Tone Recognition of Isolated Mandarin Syllables Zhaoqiang Xie and Zhenjiang Miao Institute of Information Science, Beijing Jiao Tong University, Beijing 100044, P.R. China {08120470,zjmiao}@bjtu.edu.cn
Abstract. Mandarin is tonal language. For Mandarin, tone identification is very important for speech recognition and pronunciation evaluation. Mandarin tone behavior varies greatly from speaker to speaker and it presents the greatest challenge to any speaker-independent tone recognition system. This paper presents a speaker normalizing algorithm which is designed to reduce this influence. Then a basic neural network tone recognizer using recognition features extracted from the processing syllable is introduced. The system employs an improved pitch detector and a powerful tone identification method. Finally, the experimental results show that the tone recognition system classifies four tones very well. Keywords: Mandarin speech, tone recognition, pitch normalization, pitch contour, BP neural network.
1 Introduction Mandarin is well known as tonal language with four tones (i.e., 1st tone, high-level; 2nd tone, rising; 3rd tone, falling-rising; and 4th tone, high-falling). There are a total of about 416 phonologically allowed syllables in Mandarin Chinese, if the differences in tones are disregarded. But if the differences in tones are considered, there are about 1345 syllables. Accurate tone recognition is greatly helpful for Chinese recognition systems, because the tones in syllables have lexical meanings in Chinese. So the tone identification is regarded as an important part for the mandarin speech recognition. It has been shown that the main difference among the tones is in the pitch contours [1]. Hence the performance of pitch extraction can affect the identification of tone very much. Many algorithms for pitch extraction have been studied extensively ever since many years ago, some of which utilized the various characteristics of speech signals in time domain, and some in frequency domain, and some in both [2]. We all know that one of the oldest digital methods for estimating the fundamental frequency (F0) of voiced speech is auto-correlation analysis. What is proposed in this paper is a simplified analysis technique, based upon an inverse filter formulation, which retains the advantages of both the autocorrelation and cepstral analysis techniques [4]. One of the biggest difficulties of tone recognition is that tone varies seriously, and the reasons that cause tone variations are too many and too complicated to solve, such as speakers and contexts. Even spoken by the same speaker, syllables with the same tone usually don’t have similar characteristics. Thus a good method to model tone A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 412–418, 2010. © Springer-Verlag Berlin Heidelberg 2010
Tone Recognition of Isolated Mandarin Syllables
413
variations is necessary. In this paper, we will introduce our work on reducing the impact of gender. At the last step of the whole system, the value of tone is identified. There are many tone identification methods based on pitch contours, for example, the method based on fuzzy pattern recognition [5], or the method using HMM and wavelet transform [6], or ANN (Artificial Neural Network) [7]. The ANN has self-organization, fuzziness, redundancy and self-learning characters, and it also has good non-linear distinguishing ability. Therefore its use in tone recognition is effective. In the applications of ANN, the multi-layer perception (MLP) using the back-propagation (BP) algorithm [8] is used very often. In this paper, BP will be adopted as a tone recognizer. The rest of this paper is organized as follows. Section 2 describes the composition of the tone recognition system. A pitch detection method is given in Section 3. In Section 4, the method of the final tone identification is discussed. Then the experiment and results are illustrated in Section 5. Finally, conclusions are given in Section 6.
2 Tone Recognition System Tone is mainly described by pitch contour. So an accurate extraction of pitch is necessary in a tone recognition system. In addition, the accuracy of this extraction essentially influences the correct rate of a tone recognition system. For voiced INITIALs or zero INITIALs, the pitch contour begins with the sound of the word, but it begins with the FINAL in unvoiced INITIAL, because the unvoiced INITIAL is much like a noise. So the classification of the voiced INITIAL and the unvoiced INITIAL must be done before the extraction of the pitch contour. There are many methods of pitch extraction, such as classical autocorrelation function method [9], one level cutoff wave method, AMDF and Seneff’s method [10], SIFT, etc. There are some multi-time pitch points, divided pitch points and random error points in the extraction pitch sequence. Therefore, a smoothing algorithm is needed to correct such errors. The framework of the tone recognition is show in Fig. 1.
Fig. 1. The Diagram of Tone Recognition System
414
Z. Xie and Z. Miao
3 Tone Pitch Detection 3.1 End Point Detection The unvoiced part of the input speech signal must be well cut off for this part is useless and has serious impact on the effect of recognition. So a special V/UV classification method based on ZCR and STAM is adopted [3]. For isolated Mandarin syllables, it is enough to classify the voiced and unvoiced part. 3.2 Pitch Detection Mandarin tone is the patterns of pitch variation, so it may be acquired by pitch extraction. Many methods of pitch detection are developed so far; in this paper the pitch detector using SIFT algorithm [4] is adopted. The algorithm is based upon a simplified version of a general technique for fundamental frequency extraction using digital inverse filtering. It is demonstrated that the simplified inverse filter tracking algorithm encompasses the desirable properties of both autocorrelation and cepstral pitch analysis techniques. In addition, the SIFT algorithm is composed of only a relatively small number of elementary arithmetic operations. In machine language, SIFT should run in several times real time while with special-purpose hardware it could easily be realized in real time. That is why we choose it as our pitch detector. Our algorithm procedure is showed by Fig. 2.
Fig. 2. The SIFT Procedure
The Time/Frequency weighting is to avoid spectral leakage generated by data truncated because of limited frame width. The formula is like this: ⎧rs( n) = 0; n < 1 ⎪ ⎨ 2π (n − 1) ⎪⎩ p (n) = [rs (n) − rs ( n − 1)] × [0.54 − 0.46 × cos( FW / DR )]; n = 1,2,3,..., FW / DR
And FW/DR is the frame width after sampling.
(1)
Tone Recognition of Isolated Mandarin Syllables
415
The autocorrelation function (Normalization) shows as follows: N
ρ e (i ) =
∑ e(n)w(n)e(n + i)w(n + i) n =1
(2)
N
∑ e(n)w(n)e(n)w(n) n =1
And e (n) is an inverse filter output by LPC analysis, w (n) is a Hamming window. Fig. 3 is an example compared with Praat. (Frame increment is 10ms) 150
140
130
Pitch(Hz)
120
110
100
90
80 0
50
100
150
200
Frame
(a) 150 140
Pitch (Hz)
130 120 110 100 90 80 75 0
0.5
1
1.5
2
2.469
Time (s)
(b) Fig. 3. Estimated fundamental frequency curves for utterance “běi jīng jiāo tōng” (a) is gained by the above SIFT method, and (b) is from Praat. We can see the result is satisfactory.
3.3 F0 Smoothing Method
After observing some speech signal waveforms, we found the result of pitch calculation usually has divided frequency, multi frequency and random error points, so it needs smoothing. There are many smoothing methods, such as the linear
416
Z. Xie and Z. Miao
smoothing method, the middle point smoothing method and the normal linear interpolation method, etc. These methods cannot treat the continual random error points. From the physical progress of a voice, we can find that the pitch will not change with large magnitude in continual speech, so that we can solve the problem with a statistic method. Suppose F0(1), F0(2), F0(3)……F0(N) are the N frame frequency points of the FINAL segment, the smoothing algorithm is as below: 1) For all frames, get rid of that frame, If |F0(i)- F0(i-3)|> F0(i)*RUNAWAY && |F0(i)- F0(i+3)|> F0(i)*RUNAWAY, (RUNAWAY=0.3) then F0(i) is recognized as a real isolated point, F0(i)=0. 2) For all frames, if F0(i) =0, F0(i-1)!=0, find the first F0(k)!=0 (k=i+1~i+3), then F0(i)= (F0(i-1)+ F0(k))/2;
4 Tone Recognition Scheme 4.1 Tone Feature Extraction
The dynamic pitch ranges of different speakers differ greatly; to within 80Hz to 350 Hz. Therefore speaker independent tone recognition must normalize F0. The method here is that the F0 contour of each sentential utterance is normalized by its own means to reduce inter speaker variability.
f 0 (i ) = F0 (i ) / Fmean ; e(i ) = E (i ) / Emean
(3)
F0(i) is the ith frame frequency point, and E(i) corresponds the ith frame energy. Based on the previous F0 normalization method, each signal frame was represented by a vector with 5 features [f0 (t), f0 (t), f0 (t), e (t), e (t)].
△
△△
△
Δf 0 (t ) = f 0 (t + 1) − f 0 (t ) ; ΔΔf 0 (t ) = Δf 0 (t + 1) − Δf 0 (t ) Δe(t ) = e(t + 1) − e(t )
(4)
F0 curve and energy curve are divided into three uniformly sub contours, and we pick up the means of these sub contours as final tone features. So our tone feature vector dimension is fifteen. These features are then fed into a MLP pattern recognizer for tone recognition. 4.2 MLP Model
MLP model consists of three layers: the input layer, the hidden layer and the output layer. The hidden layer may have multi-layers, but in this system, all MLP models have only one hidden layer for simplification purposes [11]. It consists of two output nodes corresponding to tour tones. Each neuron output is the sigmoid function of the weighted summation of inputs. The MLP recognizer is trained by the back propagation (BP) rule, which minimizes the mean squared error between the feed forward outputs and the desired targets.
Tone Recognition of Isolated Mandarin Syllables
417
5 Experiments and Results Experimental results are carried out on a speaker independent isolative syllable database in Chinese. The database consists of 5 male and 6 female speakers, the sample rate is 16 KHz and the number of bits per sample is 16bit. Every person pronounced 1588 syllables with tone. Each syllable is pronounced four times using 4 tones although some tones are not pronounced in the daily lives. For example, although “tian4” (the number is tone value) is not pronounced in the daily lives it is also recorded. The training set uses only a standard female voice pronunciation. The reason for doing so is to examine whether the normalization method is effective or not. Table 1 describes the difference of features ([f0(t), f0(t), f0(t) …] by formula (3)(4) ) between male and female. Here we take the four tones of syllable “Xin” as an example. As can be seen from the table, after normalized, the difference of the features ([f0(t), f0(t), f0(t) …]) between male and female is so small so that it basically can be ignored. Because of this, even though we use very few training samples and the training set does not contain male samples, the results for both male and female test sets show very good, as can be seen from Table 2.
△
△
△△
△△
Table 1. The normalization features of male and female on the same tone Xin1
Xin2 Xin3
Xin4
Male (1.004,0.994,1.005,0.001,0.024,0.056,-0.017, 0.007,0.015…) (0.748,0.1.06,1.206,1.62,0.409,0.558, 0.025,0.016,-0.016…) (1.008,0.898,1.097,0.559,0.067,0.714,0.047,0.05,0.003… ) (1.216,0.997,0.768,-0.17,-1.506,0.263,-0.08,-0.02,0.15…)
Female (1.023,0.978,0.996,-0.276,0.004,0.033,0.139,0.011,-0.051…) (0.862,0.966,1.271,1.736,1.996,0.692, 0.005,0.026,0.003 …) (1.072,0.89,1.024,-0.664,0.118,0.658,0.093,-0.051,0.022…) (1.189,1.024,0.768,-0.268,-1.301,0.912,-0.109,-0.127,0.195…)
Table 2 shows the confusion matrix of experiment 3. In this table, we can see that tone 3 is prone to be tone 2. In fact, isolative tone 3 is quite similar with tone 2 except that tone 3 is lower than tone 2 and sometimes it becomes creaky voice. Table 2. The identification accuracy of four tones Percent
Input
Tone 1
Tone 2
Tone 3
Tone 4
Output Tone 1
5832
5892
5384
5230
95.32%
0.45%
0.59%
0.32%
Tone 2
0.49%
92.12%
5.39%
0.08%
Tone 3
2.11%
4.98%
90.93%
1.02%
Tone 4
0.43%
0.12%
1.28%
96.29%
418
Z. Xie and Z. Miao
6 Conclusion Several tone recognition schemes based on artificial neural networks have been discussed in this paper. First, we presents a tone pitch detector using SIFT and a smoothing scheme. In the recognizer, from our analysis we can see, our normalization method can reduce the influence of gender, and experimental results verified that the normalization method works very well. Future work can be focused on how to apply such conclusions on large vocabulary speech recognition system.
Acknowledgment This work is supported by Innovative Team Program of Ministry of Education of China (IRT0707).
References [1] Lee, L.S.: Voice dictation of Mandarin Chinese. IEEE Signal Processing Magazine 14(4), 63–101 (1997) [2] Rabiner, L.R., Cheng, M.L., Rosenberg, A.E., McGonegal, C.A.: A comparative performance study of several pitch detection algorithms. IEEE Trans. Acoust. Speech, Signal Processing, ASSP 24, 399–417 (1976) [3] Xie, X., Xie, L.: A new approach for tone recognition of isolated Mandarin syllables. Communications and Information Technologies (2006) [4] Markel, J.D.: The SIFT Algorithm for Fundamental Frequency Estimation. IEEE Trans. AU 20, 361–377 (1972) [5] Shilin, X.: Fuzzy Tone Recognition for Chinese Speech. Acta Electronica Sinica, China 24(1), 119–121 (1996) [6] Jun, C., Kechu, Y., Bingbing, L.: A tone recognizer using wavelet transform and Hidden Markov Model. Journal of Electronics and Information Technology, China 19(2), 177–182 (1997) [7] Fang, S., Guangrui, H.: Chinese Tones Recognition Based on a New Neural Network. Journal of Shanghai JiaoTong University 31(5), 36–38 (1997) [8] Tang, L., Yin, J.: Mandarin Tone Recognition Based on Pre-Classification. In: Proceedings of the 6th World Congress on Intelligent Control and Automation, June 2123 (2006) [9] Rabiner, L.R., Schafer, R.W.: Digital Processing of Speech Signals. Science Press, Beijing (1983) [10] Seneff, S.: Real-time harmonic pitch detector. IEEE Trans. Acoustics, Speech and Signal Processing 26(4), 358–365 (1978) [11] Qing, S., Lin, T.: Introduction of Pattern Recognition.National Defense University of Technology Press, Changsha (1991)
A Semantic Proximity Based System of Arabic Text Indexation Taher Zaki1, Driss Mammass1, and Abdellatif Ennaji2 1
Ibn Zohr University, Agadir, Morrocco
[email protected],
[email protected] 2 LITIS EA 4108, University of Rouen, France
[email protected] Abstract. In this paper, we extended the vectorial model of Salton [9], [11], [12] and [14], by adapting the TF-IDF parameter by its combination with the Okapi formula for index terms extraction and evaluation of the in order to identify the relevant concepts which represent a document.Indeed, we have proposed a new measure TFIDF-ABR which takes in consideration the concept of semantic vicinity using a measure of similarity between terms by combining the calculation of TF-IDF-Okapi with a kernel approach (Radial Basis function). This indexation approach allows a contextual and semantic research. In order to have a robust descriptor index, we used not only a semantic graph to highlight the semantic connections between terms, but also an auxiliary dictionary to increase the connectivity of the constructed graph and therefore the semantic weight of indexation words. Keywords: Document indexation, semantic graph, semantic vicinity, dictionary, kernel function, okapi formula, similarity, TF-IDF, vectorial model.
1 Introduction The explosive growth of textual information, in particular published and easily accessible on the world requires the implementation of useful techniques for information extraction from forms that appear in large corpus of texts. However, it has become essential to design methods making it possible to exploit at the same time the structure and the textual contents of this text corpus. The goal of indexing is to identify, to locate and retrieve easily some information in a set of documents. Generally, we use indexation systems for information research, but, also for comparing and classifying documents, for keywords extraction, for an automatic synthesis of documents, for calculating co-occurrences of words and so on. Any document indexation process lead to a loss of some part of the initial information. Some recent works are interested with redefining the task of information retrieval and indexation for semi-structured documents. A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 419–427, 2010. © Springer-Verlag Berlin Heidelberg 2010
420
T. Zaki, D. Mammass, and A. Ennaji
In this paper, we define a new formalism for the statistical processing of Arabic textual documents and we show how this formalism could be used for the processing of various problems including indexation and classification. Our work is to be positioned in the context of information retrieval called statistical learning that allows the development of generic methods used easily on different corpus. In section 2, briefly introduces indexing phase. Then, in Section 3 we describe some semantic resources used by our approach. In sections 4 and 5, our framework is presented. Next, we present experimental results in Section 6. Finally, we conclude and discuss some perspectives of this work in section 7.
2 Indexing Phase One way for tackling the problem of document indexation and classification may be to take into account some explicit informations around the text like the structure and the distribution of words, as well as implicit semantic informations. 2.1 Weighting of Index Units The goal is to find the terms that best characterize the contents of a document. The easiest way to calculate the weight of a term is to calculate its occurence frequency because a term which often appears in a document can characterize best its contents. Several weighting functions of terms have been proposed [15]. We are interested by the TF-IDF (term frequency - inverse document frequency) used in vectorial model which is used with some adaptations in this work. The dimensions retained to calculate the weight of a term are: - A local weight that determines the importance of a given term in the document. It is usually represented by its frequency (TF). - An overall weight that determines the distribution of the term in the document database. It is generally represented by the inverse of the frequency of documents containing the term (idf). a ij
= tf (i , j ) . idf ( i )
= tf (i , j ) log (
N ). Ni
(1)
Where tf(i,j) is the frequency or the number of times the term ti appears in document dj and idf(i) is the inverse document frequency or the logarithm of the ratio between the number N of documents in the corpus and Ni the number of documents that contain the term ti . This indexing scheme gives more important weight to words that appear with high frequency in few documents. The underlying idea is that these words help to discriminate between texts with different subject. The TFIDF has two fundamental limits. First, the dependence of the term frequency is too important. If a word appears twice in a document dj, that does not necessarily mean that it has twice more important value that in a document where it appears only once. Second, the weights of long documents are higher because they
A Semantic Proximity Based System of Arabic Text Indexation
421
contain more words and the term frequencies tend to increase. To address these problems, a new indexing technique has been proposed (the Okapi Formula) [8]: ai
j
=
tf ( i , j ) . idf ( i ) . [ (1 − b ) + b ⋅ NDL( d j ) ] + f ( i , j )
(2)
Where NDL(dj) is the standard length of dj (number of words that it contains) divided by the average length of the documents in the corpus.
3 The Semantic Resources 3.1 The Auxiliary Semantic Dictionary It is a hierarchical dictionary with standard vocabulary based on generic terms and on specific terms to a domain. Consequently, It provides the definitions, the relations between terms and their choice overriding the significances. The relations commonly expressed in such a dictionary are: - The taxonomic relations (hierarchy). - The equivalence relations (synonymy). - The associate relations (relations of semantic proximity, close-in, connected to, etc…). 3.2 Taxonomy We introduce a taxonomy as the organization of concepts linked by hierarchical relations [21]. 3.3 Semantic Networks The semantic networks [5], were initially introduced as a model of the human memory. A semantic network is an oriented and labelled graph (or, more precisely, multigraph). An arc connects a starting node (at least) to a node of arrival (at least). The relations go from the relations of proximity semantic to the relations part-of, cause-effect, parent-child… The concepts are represented as nodes and their relations as arcs. The heritage of the properties by the connexions is materialized by an arc (kind-of) between the nodes. Different types of connexions can be mixed as well as concepts and instances. So, various developments have emerged and led to the definition of formal languages.
4 Semantic Indexing Based on Radial Basis Functions Several studies have adapted the vectorial model by indexing directly concepts instead of terms. These approaches treat primarily synonymy by replacing the terms by their concepts. We consider the rich links between the terms by taking into account all the semantic types of relations (ontology of the field). This can solve the problem of synonymy and avoids also the complications caused by the other relations of specialization and generalization for example.
422
T. Zaki, D. Mammass, and A. Ennaji
4.1 The Indexation and Classification Proposed System Unlike existing methods, we do not use concepts only. Indeed, the terms are enriched if they are linked to the concepts or if they provide good semantic connectivity. It is important to note that during the search, we also find terms which are not linked to ontology. We calculate the proximity between terms. Thus, we define a radial basis function (RBF) which associates to each term a region of influence characterized by the degree of semantic similarity and the relationship between the kernel term and its neighbors. Rada and al. [6], were the first to suggest that the similarity in a semantic network can be calculated based on the taxonomic links “is-a”. One of the most obvious ways to evaluate semantic similarity in taxonomy is to calculate the distance between nodes by the shortest path. In [7], the idea was to calculate the paths as those linking each concept to its closest ancestor to the top of the ontology. We are aware that the calculation of the similarity measure by restriction on the "is-a" is not always suitable, because the taxonomies are not always at the same level of granularity, some parts may be have a more important density than others. These problems can be solved by associating weights to the links, so we have chosen to take into consideration all types of relationships (conceptual problematic) and the distribution of words in documents (structural problematic). However, the automatic calculation of degree of semantic relation is too difficult and many previous works has focused on the similarity measures, often based on known hierarchies (eg WordNet [1] and Wikipedia [3]). We have adapted our system to support any kind of semantic relation such as synonymy, meronymy, hyponimy, taxonomy, antonomy, etc ... and we assign initially a unit weight for semantic links. A semantic network is constructed in each phase to model the semantic relationships between terms. To avoid connectivity problems, we choose to build an auxiliary dictionary allowing to have a strong connection in the builded network and to grow the weight of the semantic descriptor terms thereafter. In the next section, we define our TFIDF measure with radial basis function and we show how the weights of the indexing terms are enriched from the outputs of this measure. 4.2 Text Preprocessing All text documents went through a preprocessing stage. This is necessary due to the variations in the way text can be represented in Arabic. The preprocessing is performed for the documents to be classified and the learning classes themselves. Preprocessing consisted of the following steps: • • •
Convert text files to UTF-16 encoding. Remove punctuation marks, diacritics, non-letters, stop words. Stemming by using the Khoja stemmer [4].
A Semantic Proximity Based System of Arabic Text Indexation
423
4.3 KNN Text Classification In many application frameworks, this approach has shown interesting results. Indeed, the probability of error with a knn converges to the Bayes risk when the data learning quantities increases, regardless of k. However, it has several drawbacks. Its robustness depends directly on the relevance and quality of the learning set. Another drawback concern the access to the learning set which generally requires a large memory space and time consuming computing. In this paper, we use the KNN classifier as reference in the experiments. In fact the KNN can be used with few amount of data what is a very interesting property in our context. Instead of the euclidean distance, we chose the measure of relevance combined with the measure of Dice after reducing the learning set. The text preprocessing phase is firstly applied to each document to be classified. Then the RBF Okapi-TFIDF profile is generated (RBF for Radial Basis Function). The RBF Okapi-TFIDF profile of each text document (document profile) is compared against to the profiles of all the documents in the learning classes (class profile) in terms of similarity. The second measure used is the Dice measure of similarity: Dice
( Pi , P j )
=
2
Pi ∧ P j Pi +
Pj
.
(3)
where Pi is the number of elements in profile Pi . Pi ∧ Pj is the number of elements that are found in both Pi and P j
4.4 The TF-IDF with Radial Basis Functions The RBF TFIDF with radial basis functions is based on the determination of supports in the space of representation E. However, unlike the traditional TFIDF, those can correspond to fictitious forms which are a combination of the values of traditional TFIDF. We will call them prototype. They are associated with an area of influence defined by a distance (Euclidean, Mahalanobis ...) and a radial basis function (Gaussian, bell ...). The output discriminant function g of the RBF-TFIDF is defined from the distance form the shape in entry to each of the prototype and of the linear combination of the corresponding radial basis functions:
g (X ) = w0 +
N
∑
i =1
w i φ ( d ( X , sup
i
)) .
(4)
Where d(X, supi) is the distance between entry X and the support supi, {w0, ...,wN} are the weights of the combination and φ the radial basis function. The Learning of this type of model can be done in one or two steps. In the first case, a gradient type method is used to adjust all the parameters by minimizing an objective function based on a criterion such as least squares. In the second case, a first step is to determine the parameters associated with radial basis functions (position of prototypes and areas of influence). To determine the centers, methods of unsupervised
424
T. Zaki, D. Mammass, and A. Ennaji
classification are often used. In a second step, the weights of output layer can be learned by different methods such as inverse or pseudo-gradient descent. The RBF-TFIDF has several advantages in the case of learning in two stages. For example, the separated learning of the radial basis functions and their combination makes learning faster, simple and avoids the problems of local minima (local and global relevance), the prototypes of RBF-TFIDF represent the repartition of the examples in the space of representation E (terms). Moreover, management of multi-classes problems is easier with RBF-TFIDF. We will see in the following section that RBF-TFIDF is very similar under certain conditions to the Systems of Fuzzy Inference. Modeling RBF-TFIDF is both discriminant and intrinsic. In fact, the layer of radial basis functions corresponds to an intrinsic description of the learning data and the combination layer output seeks to discriminate the different classes. In this paper, we use RBF-TFIDF with learning in two stages. A first step is to determine the parameters associated with radial basis functions (position of prototypes and areas of influence), usually unsupervised classification methods are often used. In a second step, the weights of output layer can be learned by different methods such as inverse or pseudo-gradient descent. The radial basis function is of Cauchy type: φ (d ) =
1 . 1 + d
(5)
We have defined two new operators: a)
the relational weight : WeightRel ( t ) =
degree ( t ) . total number of concepts
(6)
b) Semantic density: SemDensity ( c1 , c 2 ) =
Dist ( c1 , c 2 ) . recovering tree with minimal cost
(7)
Thus the semantic distance between two concepts DistSem ( c1 , c 2 ) = WeightRel (c1 ) * WeightRel (c 2 ) * SemDensity (c1 , c2 ) .
(8)
The proximity measure is a Cauchy function: Pr oximity
( c1 , c 2 ) =
1 1 + DistSem
( c1 , c 2 )
.
(9)
degree(t): The number of incoming and outgoing edges of node t. Dist(c1,c2): The minimum distance between c1 and c2, calculated by Dijkstra's algorithm [2] and [16], applied to the semantic network thus built starting from document. For the indexing phase, we will see later how the weight of the index descriptors are generated by the radial basis measures admitting like parameter a semantic distance.
A Semantic Proximity Based System of Arabic Text Indexation
425
5 New Weights of the Descriptors Indexes The documents are represented by sets of vectors of terms. The weights of terms are calculated according to their distribution in the documents. The weight of a term is enriched by the conceptual similarities of words co-occurring in the same topic. We proceed by calculating the TFIDF terms for the set of themes of the learning basis in order to deduce their overall relevance and then we calculate their local relevance by our radial basis function combined with the traditional TFIDF and accepting only the terms located in the zone of influence. This weight noted RBFTFIDF (t) is calculated by the formula: n
RBF - TFIDF(t , theme) = TFIDF(t, theme) + ∑ TFIDF( t i , theme) ∗φ (Pr oximity( t , t i )) . i =1
(10)
With φ (Pr oximity ( t , t i )) < threshold ti ∈ belongs to the set of n terms in the topic. The threshold: is a value which fixes the proximity at a certain vicinity (zone of semantic influence of the term t), we initially fix it in the proximity between the concept of t and the concept context (concept which represents the topic). 5.1 Extension of the Okapi-Formula
To avoid the inconvenients of the TFIDF measure, and in order to make it more robust, we opted for the Okapi model proposed by [8], with adding a semantic extension. For this, the function φ (d ) calculates the degree of relevance for each term on its semantic proximity level (zone of influence). The new formula is: ai
j
=
tf ( i , j ) . idf ( i ) ⋅ φ (d [ ( 1 − b ) + b ⋅ NDL( d j ) ] + f ( i , j )
j
) .
(11)
Or more simply, ai j =
TFIDFABR ( i , j ) . [ (1 − b ) + b ⋅ NDL( d j ) ] + f ( i , j )
(12)
φ (d j )
is the set of the terms dj close to ti semantically. A similarity threshold is necessary to characterize all of these elements. We set a threshold of similarity for the value of Proximity (ti, t) which corresponds to the degree of similarity between t and the concept of the theme where it appears (the term is accepted if it is in the influence zone of term kernel defined by the radial basis function f).
6 Results For the learning phase, we worked on a very reduced database (initial corpus) of labeled documents representing the classes (sport, policy, economy and finance) that we seeks to discriminate or to learn and this is the strong point of our measurement.
426
T. Zaki, D. Mammass, and A. Ennaji
More this base is discriminating and representative, more our method is efficient and more the results are better. For the phase of test we worked on a corpus of 200 documents of press from the Associated Press (AP) witch is a very rich and varied base of 2246 English documents [20]. And for the Arab documents we worked on a corpus of 300 documents of Arab electronic presses ([17], [18] and [19]). Table 1. Table of the results of the experimentation
Corpus
Method
Classifier
Recall
Precision
English
TFIDF-ABR TFIDFOkapi-ABR
kppv
0.89
0.92
accuracy (%) 90.5
kppv
0.98
0.98
98.79
Arabic
7 Conclusion After validation of the prototype, our method has shown robustness and adaptability for both Arabic and English corpus. In addition, the results of the indexing contain exactly the required keywords sorted by their relevance. We set a threshold for semantic enrichment which leads to retain a few intruders words away from those sought. Many elements remain to be evaluated, in particular the addition of an algorithm of clarification and disambiguation. The presence of the complex concepts can also prove an interesting track by the fact that the more long concepts are often less ambiguous.
Acknowledgment This work is supported with grants from the “Action intégrée Maroco-Française” n° MA/10/233 and the AIDA project, program Euro-Mediterranean 3 +3: n° M/09/05.
References 1. Budanitsky, A., Hirst, G.: Evaluating wordnet-based measures of semantic distance. Computational Linguistics 32(1), 13–47 (2006) 2. Dijkstra, E.W.: A short introduction to the art of programming, contenant l’article original décrivant l’algorithme de Dijkstra, pp. 67–73 3. Gabrilovich, E., Markovitch, S.: Computing Semantic Relatedness singWikipedia-based Explicit Semantic Analysis. In: Proc. IJCAI’07, pp. 1606–1611 (2007) 4. Khoja, S., Garside, S.: Stemming Arabic Text. Computing Department. Lancaster University, Lancaster, September 22 (1999), http://www.comp.lancs.ac.uk/computing/users/khoja/stemmer.ps 5. Quillian, M.R.: Semantic memory. In: Semantic Information Processing (1968)
A Semantic Proximity Based System of Arabic Text Indexation
427
6. Rada, R., Mili, H., Bicknell, E., Blettner, M.: Development and application of a metric on semantic nets. IEEE Transaction on Systems, Man, and Cybernetics 19(1), 17–30 (1989) 7. Resnik, P.: Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language. Journal of Artificial Intelligence Research 11, 95–130 (1999) 8. Robertson, S., Walker, S., Beaulieu, M.: Experimentation as a way of life: Okapi at TREC. Information Processing and Management 36(1), 95–108 (2000) 9. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24, 513–523 (1988) 10. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983) 11. Salton, G., Yang, C.S., Yu, C.T.: A theory of term importance in automatic text analysis. Journal of the American Society for Information Science and Technology 26(1), 33–44 (1975) 12. Salton, G., Fox, E.A., Wu, H.: Extended boolean information retrieval. Communications of the ACM, 1022–1036 (1983) 13. Salton, G., Singhal, A., Buckley, C., Mitra, M.: Automatic text decomposition using text segments and text themes. In: UK Conference on Hypertext, pp. 53–65 (1996) 14. Salton, G.: The SMART retrieval system: experiments in automatic document processing. Prentice-Hall, Englewood Cliffs (1971) 15. Seydoux, F., Rajman, M., Chappelier, J.C.: Exploitation de connaissances sémantiques externes dans les représentations vectorielles en recherche documentaire. Ph.D. thesis (2006) 16. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction à l’algorithmique, version (en). section 24.3, Dijkstra’s algorithm, deuxième edn., pp. 595–601. MIT Press, McGraw-Hill (2001) ISBN 0-262-03293-7 17. Al Jazeera: http://www.aljazeera.net/ 18. Al charq Al awsat, http://www.aawsat.com/ 19. Al ahdat Al maghrebiya, http://www.almaghribia.ma/ 20. Associated Press, http://www.cs.princeton.edu/~blei/lda-c/ap.tgz 21. Wikipedia, http://en.wikipedia.org/
Classification of Multi-structured Documents: A Comparison Based on Media Image Ali Idarrou1,2, Driss Mammass2, Chantal Soulé Dupuy1, and Nathalie Valles-Parlangeau1 1 IRIT, groupe SIG, Université Paul Sabatier 118, route de Narbone, 31062 Toulouse cedex 9 France 2 IRF – SIC Université Ibn Zohr Agadir Maroc
[email protected],
[email protected], {chantal.soule-dupuy,nathalie.valles-parlangeau}@univ-tlse1.fr
Abstract. This paper focuses on the structural comparison of multimedia documents. Most of the systems treating the multimedia documents exploit only the text part of these documents. However, the text is no longer the only means to carry information. The major issue is to extend these systems to the other modality notably to the image that constitutes one of the basic components of multimedia documents. The complexity of multimedia documents, multistructured in essence, imposes not only a structural representation in the form of trees, but rather in the form of graphs. The graphs are in appropriateness to the description of these documents. For example, one will be able to describe the components of a scene of an image, the relations between these components, their positions (spatial relations), etc. We propose a new similarity measure of graphs, based on a univocal matching between the graphs to compare. In our approach, we will take account of structural information and specificities of multimedia information. We evaluate our measure on a corpus of multi-structured documents from the INEX 2007 corpus. Keywords: Documents, multimedia information, image, clustering, classification, similarity, matching of graphs.
1 Introduction The technological evolution, of digitalization, coding, compression, storage and communication, contributes to the massive provision of numerical multimedia information (text, image, audio, video …) of which the exploitation necessitates appropriate methods, models and tools. The numerical multimedia documents are complex in essence. Coming up from the composition of different documents, themselves more or less complex, information is not as directly accessible as in text documents, for example. That puts us in the obligation to use tools and techniques allowing a synthetic and complete description of this type of multi-structured documents. Each multimedia document has its own structure, and each document composing it has also at least one structure if not several. Their management and, in A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 428–438, 2010. © Springer-Verlag Berlin Heidelberg 2010
Classification of Multi-structured Documents: A Comparison Based on Media Image
429
particular, their storage imposes that we no longer consider these documents as simple arborescence forms but rather in the form of graphs. In fact, the multi-structuralism induces more complex relations (there may be multiple relations between the same two nodes). The graphs are mathematical tools that allow modelling structured objects; they are in appropriateness with multimedia numerical document representation in XML format and of sub-documents that they contain. For example, one will be able to describe the objects and the sub-objects which are components of a scene of an image, the relations between these objects, and their positions (spatial relations, etc). Unlike the classical methods of image description, such as the descriptors of textures, the color histograms, etc, the graphs allow the modelling of different facets the images contents, like the nature of objects, the spatial relations, etc. Most of the systems treating the multimedia documents exploit only the textual part of these documents. The major issue is to extend these systems to other modalities notably to the image that constitutes one of the basic components which is very frequent in the multimedia documents. The description of the text and the image (and other modalities) by using the same model of representation will allow combining jointly several methods in order to exploit multimedia information easily. In the research for information for example, the need of the user can be a fragment of text (or image), an image or the text + the image. The result can be one or more subgraph(s).This technique will allow access to pertinent information in the multimedia documents for example. It will equally allow access to the image by context (using the neighboring nodes of the image). The documentary classification is a solution which allows optimizing the access time in the relevant information in a large mass of data. We are interested in the structural clustering as an organization of documents in the form of classes of typological documents having similar structures. In this paper, we are interested in the "text" and the "image" media, but our approach can be applied to the other types of media (audio and video). We will take account of structural information and multimedia specificities and will present a new similarity measure of graphs, based on a univocal matching between the graphs. This matching will take into account both the ontological context and the hierarchical context of the components of graphs to match. In the following section we will present some examples of applications using graphs to represent the objects and we present the measures of graphs similarity most used in literature. In the third section, we will present our approach of multimedia document comparison, and we will finish by experimentation on the INEX 2007 corpus.
2 Representation of Objects Using Graphs: Comparative Study The graphs are widely used in many applications such as in information retrieval, in image processing, in organic chemistry to represent molecules, etc. If objects are represented by graphs, measuring their similarity means matching their graphs so as to find their common points and their differences. The set of techniques and mathematical tools developed in graph theory, allows easily demonstrating the
430
A. Idarrou et al.
properties, to deduce the characteristics of an object from a similar object, etc. Several previous works have used the graphs as models of representation of objects for analyzing and comparing. The graphs are used in various domains to represent objects. In organic chemistry, for describing the structures of the molecules where every atom is represented by a node of graph and each connection between atoms by an arc. The nodes and arcs have labels symbolizing the object they model. This allows to synthetize molecules and to show for example that a molecule enters the composition of another molecule [4]. In CAD, the multi-labelled graphs are used to represent objects of computer-aided design [8], [9]. Each graph node represents a component of the object and each arc of the graph represents a binary relation between two components. Labels are assigned to each node and to each arc in order to express the different types of components and relations. The objective in this context is to identify automatically all the objects already designed and similar to a new object to be designed. The matching used in this application is a multivoque matching: a component of one graph is put in correspondence with more than a component of the other graph (two components of a graph, for example, can play the same role as one component of the other graph). Let us note that two objects can be similar in spite of the fact that they are not identical. The labelled graphs are also used to represent the images to compare [12]. The images are often segmented in regions (nodes represent regions (labelled by their properties: color, size of the scene,...) and arcs represent the various binary possible relations between regions). In our previous works [1], [2], we used the graphs to represent the multimedia documents in order to classify them within the framework of a documentary warehouse. It was a question of finding, among a set of multimedia documents, the most similar to a given document. Different similarity measures of graphs exist in the literature. We present in what follows an overview on the most related measures: - To calculate the similarity of two XML documents represented by trees, [6] has introduced the following measure:
∑ Danc ( vj ) Sim (T , T ') = 1 −
vj
∑ Panc ( vj )
(1)
vj
Where Danc(vj) : represents the distance of the ancestors of vj. and Panc(vj) : represents the weight of the path vj. - To evaluate the similarity of trees T1 and T2, representing XML documents, [3] had proposed the following measure:
∑ n ∑ m Sim(e1i , e2 j ) Similarité (T1 , T2 ) =
i =1
j =1
Max (| T1
|,| T2 |)
(2)
Classification of Multi-structured Documents: A Comparison Based on Media Image
431
Where |T1|: the size of T1, sim(e1i,e2j) represents the similarity of nodes e1i and e2j belonging respectively to T1 and T2. -
In [8], [9], the similarity of two labeled graphs G and G', representing the objects to be designed, is based on a multivalent matching of peaks of these two graphs: a component of an object can play the same role of several other components of the subject:
Sim(G , G ') =
⎡ f ( descr (G ) ∩ descr (G ')) − g ( splits ( M )) ⎤ ⎢ M ⊆VxV ' ⎣ ⎦⎥ f ( descr ( G ) ∪ descr ( G ')) Max
M
(3)
Where f and g are two functions that depend on the application. Splits(M) : Indicates the set of nodes of the graph, matched to more than one node of the other graph. Descr(G) ∩M descr(G′) : Indicates all the common characteristics in G and G' with regard to the matching M. - In [10], the authors showed that the measure (3) is not directly suited in certain contexts and they proposed a new measure of similarity of graphs that represent images. They showed that this measure is generic because it is parameterized by two functions, Simv and Sime, which depend on the considered application:
Sim ( G , G ')
=
⎡ ∑ Sim v ( v , m ( v )) + ∑ Sim e (( u , v ), ( m ( u , v )) ⎤ ( u , v )∈ E ∪ E ' ⎢ v∈V ∪V ' ⎥ m ⊆ VxV ' ⎢ |V ∪V '| + | E ∪ E '| ⎥ ⎣ ⎦ Max
(4)
Where G = (V, E) and G '= (G', V '), V (resp. V') set of nodes of G (resp. G '), E (resp. E') set of arcs of G (resp. G ') and m is a graph matching. The above examples have discussed techniques for comparing objects represented by graphs, based on: - The research for a univocal matching: a node and/or an arc of a graph can be put in correspondence with, at maximum, one node and/or an arc of the other graph (isomorphism of graphs, isomorphism of sub graphs, larger sub common graph to two graphs). - The research for a multivocal matching: a node and/or an arc of a graph can be put in correspondence with one or several nodes and /or arcs of the other graph. We noticed that these similarity measures are all based on the research for a better matching between the graphs to compare and that the preferences and the constraints imposed by the desired matching, for every measure, depending on the application. They are not defined in the same context which imposes, each time, constraints on the desired matching between graphs. These constraints depend on the problem to be resolved and differ from one application to another. It is thus necessary to find each time the best adequacy between the objective and the effective behaviours of the defined measure. Therefore, it is difficult to compare and apprehend the results of many studies that have addressed this topic.
432
A. Idarrou et al.
However, the measure (2) takes into account neither the preservation of arcs nor the order of matched components. Nevertheless, two nodes can be synonymous but they have different semantics in the two trees. On the other hand, the measures (3) and (4) do not take into account the distribution of the components of both graphs (depth level, order of components). Concerning the measure (1), it just allows to evaluate the number nodes of T in T' (does not consider the relations between nodes). In the next section, we will present our approach based on media image.
3 Structural Comparison of Multi-structured Documents Unlike conventional approaches of comparison of documents, based on similarity of surface and which exploit only the textual part of documents, we are going to take into account in our approach, the structural information and the multimedia specificities to give more efficiency to the process of documentary comparison. In this paper, we focus on the "text" and the "image" media, but our approach can be applied to other media (audio and video). Generally, the images are segmented by regions. Every node (of graph) represents then a region and the arcs represent the various possible relations between regions. The image segmentation is not our objective but is our point of departure. Whatever is the used facet, the image can be represented by graph. The description of the text and the image (and the other modalities) by using the same model of representation is going to allow to combine jointly several modalities in order to exploit easily the multimedia information. In information research for example, one user can need a fragment of text (or image), image or text + image. The result returned can be one or more sub graph(s). This description also allows access to the image by context (using nodes neighbours of the image). The choice of graphs as a tool of description is consistent with the representation of multimedia documents whose structural complexity is high. These tools allow a rich modeling of objects (and sub-objects) and of their relations. To compare two documents, simply compare their graphs. This matching is going to allow evaluating the number of nodes and arcs that they have in common. More precisely, the isomorphism of (sub) graphs can show one of the graph is included in an other (or both graphs are structurally identical) while the intersection between graphs can show the commonalities between objects represented by these graphs. In this paper, the documentary structures are described by the meta-model MVDM [5]. This model consists of two levels: a generic level where the generic views are represented by graphs (a class represents a set of typologically similar documents while keeping their specific characteristics) and a specific level where the specific views are represented by trees. Our integration process of a document in the documentary warehouse is composed of the following steps: a) Extraction of the Generic View of a Document To extract the generic view of an XML document, we use two parsers. The first one is based on Sax (Simple API for XML) which returns all the tags encountered in the
Classification of Multi-structured Documents: A Comparison Based on Media Image
433
document. Then the semantic tags are intercepted by a second parser that can analyze, filter and make changes to get the generic view of the document. The generic view thus obtained is used as representative of a document in our matching process. b) Filtering of the Generic Views This step consists in selecting the generic views Vgi of the documentary warehouse that have one degree of resemblance with the generic structure Vrep of the previous step. The problem thus consists in finding an univocal matching ϕn, (a node and either arc of a graph is matching in at most a node and/ or arc of the other graph), between Vrep and Vgi who allows to make correspond the maximum of nodes and /or arcs of Vrep and Vgi. To preserve semantics (spatial relations, etc.) ϕn has to preserve arcs between components matched by Vrep and Vgi. We propose the function Cf, which calculates the coefficient of filtering Cf (Vrep, Vgi) from which we can select a generic view Vgi of documentary warehouse (DW): Cf : {Vrep} x DW → [0,1] (Vrep,Vgi) → Cf(Vrep,Vgi)
Formally Vrep = (V, E), Vgi = (Vi, Ei) where V (resp. Vi) is the set of nodes of the graph Vrep (resp. of the graph Vgi), E (resp Ei) is the set of arcs of the graph Vrep (resp. of the graph Vgi) and Vgcand={Vgi∈ DW / Cf(Vrep,Vgi)>=Sf } set of Vgi, candidates to comparison and Sf is a constant, determined in priori by experiments, indicating the threshold of filtering. We define the coefficient of filtering Cf (Vrep, Vgi), which depends on the matching ϕn research, as follows: C f (Vrep , Vgi ) =
⎡ | ϕ n (V ) | + | ϕ n (Vi ) | ⎤ / 2 ⎢⎣ | V | | Vi | ⎥ ⎦
(5)
Where |X| indicates the cardinal of the set X and ϕn (V)={u’∈Vi / ∃ u∈ V ; ϕn (u)=u’}, ϕn (Vi) ={u∈V / ∃ u’∈Vi ; ϕn (u’)=u}, ϕn an univocal matching between Vrep and Vgi (resp. between Vgi and Vrep) that preserves the arcs:ϕn : Vrep → Vgi u → v where v similar to u (ie u and v have the same labels or the similar labels). Cf (Vrep, Vgi) allows evaluating the percentage of nodes common to Vrep and Vgi. In fact, the first part of Cf (Vrep, Vgi) determines the percentage of matched nodes of the structure Vrep, that is the degree of inclusion of Vrep in Vgi and the second part determines the percentage of matched nodes of the structure Vgi that is the degree of inclusion of Vgi in Vrep. The generic structures of the warehouse, for which Cf (Vsp, Vgi) is upper or equal to Sf are selected for the next steps of the comparison phase. c) Reconciliation of Generic Views This step consists in reconciling the generic views of the warehouse, candidate in the comparison, to the generic view Vrep representing the specific view of the new document. Some possible additions of nodes and arcs, missing in the generic views of the warehouse, can be envisaged.
434
A. Idarrou et al.
d) Weighting This step consists in attributing a weight to every relation. This value depends on the position of the relation in the graph. It thus allows translating the hierarchical context of arcs (Fig.1). We propose the weighting function from E (resp. of Ei) to ] 0,1 [:
Pe :
⎧1 − ord(v) si depth(v)=1 ⎪ k Pe (u, v) = ⎨ ⎪P (x,u)- ord(v) otherwise, where node u extremity of the arc (x,u) ⎩ e kdepth(v)
E a ]0,1[ (u, v) a P e(u, v)
(6)
K is a fixed parameter by the user (a power of 10) indicating the maximum number of son nodes (number of sons< k) for each node of the manipulated graphs and depth(v) is the profouness of v (depth of root node=0. For example (Fig.1), Pe(name,Id)=0.89. The weighting function allows penalizing the arcs profoundness. The arc profoundness i (i> = 0) have more important than the arc profoundness i +1. V1
article
0.9 0.8 Name 0.89
0.689 image
figure
V2 0.7
0.8
conversionwar body 0.69 0.68
Id
0.6889 xmlns:xlink
article 0.9
0.7
……..
Name 0.89 Id
conversionwar body 0.69 0.68 empph2
collectionlink
0.688 caption
…..
0.67 figure 0.669 0.668 image
caption
0.6689 xmlns:xlink
Fig. 1. Example representation, hierarchical relation, temporal relation and weighting
e) Calculation of Score Similarity The similarity between the graphs Vrep and Vgi is based on the research for a better univocal matching, between both graphs: a matching which maximizes the similarity between Vrep and Vgi, while respecting the hierarchical order of components and relations matched. Our similarity measure takes into account at once ontological and hierarchical contexts of nodes and arcs matched. The decision depends on the threshold of similarity. More formally, if Sim (Vrep, Vgi) > =Ss then the graphs Vrep and Vgi are considered as similar. In this case the specific view represented by the graph Vrep will be the aggregated into the generic view Vgi, of the warehouse (the most similar). Otherwise, a new class is created; it is represented by the view Vrep. Concerning the threshold of similarity Ss, several tests were realized to determine the optimal value. We noticed that the increase of the value of Ss leads to the creation of numerous classes. On the other hand, the decrease of this value implies the growth of the number of documents associated to each class that arouses heterogeneousness between documents of the same class. We noticed that the value of 0.8 (80 % of similarity) gave good results. To evaluate the similarity of two graphs, we propose:
Classification of Multi-structured Documents: A Comparison Based on Media Image
⎧⎪1 −
Sim ( G , G ')
w here d
1
=⎨
( d1 + d 2 )
2
⎪⎩ 1 − ( d 1
=
1 Nbchm G
if
d1 * d 2 # 0
+ d 2 ) otherwise
∑
435
Dchm ( v ) and d j
vj ∈V / vj leaf
Where G=(V,E), G’=(V’,E’), Dchm ( v j )
=
2
=
1 Nbchm G '
∑
ei∈∈Dchm ( v j )
(7)
∑
Dchm ( v
'j )
v ' j ∈V '/ v ' j leaf
D ( ei )
∑ Pe ( ei )
,
ei
D ( ei ) =
⎧| Pe ( ei ) − Pe ( e 'i ) | ⎨ ⎩ Pe ( ei ) otherwise
if ∃e' ∈ E '/ ϕ ( e ) = e ' i e i i
ϕe matching between E and E’
For example, the similarity between the paths (Fig.1) v1:article\body\figure\image\ xmlns:xlink and v2:article\body\figure\image\ xmlns:xlink is equal to 0.97. Unlike the measure (1), our measure allows to evaluate the degree of inclusion in both senses (G⊆G′ and G′⊆G). To preserve the semantics, this measure is based on a matching that preserves the arcs (hierarchical and spatiotemporal relations) and order of the matched components. This technique allows preserving the synchronization between objects matched of documents to compare (which is not the case for the measures (2), (3) and (4)). In the context of comparison of the views of digital multimedia documents represented by graphs, our measure is more effective. Unlike the studied measures in section 2, our measure takes into account at once the ontological, hierarchical contexts and multimedia specificities, it is based on a matching which preserves the order of components and arcs between these components (to preserve semantics of the matched components).
4 Experimental Results To validate our approach, we developed a prototype in Java. This prototype consists of two modules: the first allows parsing the document to extract the generic view. The second module allows evaluating the similarity of the generic view, that result of the first module, with the generic views of the documentary warehouse. We used this tool to process XML documents 102 (1515 KB) extracted from the INEX 2007 corpus (http://wwwconnex.lip6.fr/~denoyer/wikipediaXML/). The 102 documents included in the documentary warehouse must be grouped into classes of similar documents (structurally). with a degree of similarity of 80%. We obtained the following clustering: We notice that we have four classes for 102 documents. One class presents the peculiarity to represent only six documents. Otherwise, the other classes are rather homogeneous. In our comparison process, we imposed the constraint of order and that of the conservation of arcs; the graph matching (between the graphs G and G ') must respect the order of the matched components (1) and it must preserved the arcs (2).
436
A. Idarrou et al.
Table 1. Characteristics of the corpus used
Table 2. Number of aggregated views per class GenericClass
102
Number of documents Average number of elements / Vsp
147,7
Average number of elements / Vrep
26,75
Average number of attributes / Vsp
194,3
Average number of attributes / Vrep
26,2
Average number of paths /Vrep
32,15
Average depthes /Vrep
Number of aggregated views
C1
39
C2
6
C3
12
C4
41
6,5
Average capacity in KB / doc
14,85
(1) : ∀ (u,v)∈E, ord(u) TS n
(3)
Here, TS is a threshold defining the largest variance of components. In our algorithm, the threshold TS is chosen as TS = pv λmax , where λmax is the maximum eigenvalue of 0 0 the covariance matrix Λ0 of the initialization component, pv is a predefined percentage value modifying the variance of component (pv is set to pv = 20% in our experiments). Thus, component n which satisfies the above condition is split into two new components n1 and n2 . Let vnmax be the eigenvector corresponding to the maximum eigenvalue λmax of the covariance matrix Σ n . The two subsets of pixels re-assigned to these n two new components are defined by the separating plane which is perpendicular to vnmax : Cn1 = { x ∈ Cn | (x − μn ) .vnmax ≥ 0} (4) Cn2 = { x ∈ Cn | (x − μn ) .vnmax < 0} where Cn , Cn1 and Cn2 are the sets of feature vectors belonging to components n, n1 and n2 respectively.
People Reacquisition across Multiple Cameras with Disjoint Views
491
The parameters of both components are re-estimated based on the statistics of their assigned pixels: αni = nni /N 1 x nni x∈Cni 1 T = x x − μTni μni nni x∈Cni
μni = Σ ni
(5)
where nni is the number of pixels assigned to the component ni . – Merging components: The merging algorithm is applied to reduce the overestimated components in the mixture model. The criteria for merging two components are based on the covariances of two components and the distances between them, which is defined by using the Mahalanobis distance: T μ − μn2 Σ −1 + dM (n1 , n2 ) = n2 μn1 − μn2 n1 (6) T −1 μn2 − μn1 Σ n1 μn2 − μn1 where μni and Σ ni (i = {1, 2}) are the mean vector and covariance matrix of two components n1 and n2 . For each component ni , we first select the nearest neighbour nj in terms of their Mahalanobis distance. These two components are considered suitable for merging if their distance are lower than a given threshold: dM (ni , nj ) < TM
(7)
where nj = arg minnk (dM (ni , nk )). The parameters of the new merged Gaussian are calculated as: αnew = αni + αnj μnew = αni μni + αnj μnj T + Σ new = αni Σ ni + μni − μnew μni − μnew T
αnj Σ nj + μnj − μnew μnj − μnew
(8)
The distance (dissimilarity) between two silhouettes is computed by first re-calculating the probability density function f (x) corresponding to each silhouette (Eq. (1)), then using the normalized L2 distance as the distance between two probability density functions: 2 f1 (x) − f2 (x) (9) d (X1 , X2 ) =
where f (x) = f (x)
x
x
2
f (x) .
492
D.-N. Truong Cong, L. Khoudour, and C. Achard
3 Silhouette Classification for People Re-identification In this section, we detail the proposed algorithm for measuring the similarity of the passages in front of cameras. The key idea is first to embed the set of signatures into a lower dimensional space by using the graph-based manifold-learning technique [12], then to measure the distance (dissimilarity) between two data points set based on their robust centroids. Given a set of signatures S = {S1 , S2 , . . . , Sm } extracted from m extracted silhouettes belonging to two passages in front of two different cameras. The algorithm for people re-identification is formally stated as following: 1. Construct the adjacency graph: Let G = (V, E) denote a graph with m vertex V = {v1 , v2 , . . . , vm }. Each signature Si corresponds to a vertex vi . Two vertices vi and vj are connected by an edge whose weight is computed as:
d(Si , Sj )2 Wij = exp − (10) τ where d (Si , Sj ) is the distance between two signatures Si and Sj , τ is a predefined parameter. 2. Search for thenew representation: Let D denote the diagonal matrix with elements Dii = Wij and L denote the un-normalized Laplacian defined by L = j
D − W. Dimensionality reduction is obtained by solving the generalized eigenvector problem Ly = λDy (11) for the d lowest non-zero eigenvalues. The dimensionality reduction operator is thus defined as: h : Si → ui = [y1 (i), . . . , yd (i)]
(12)
where yk (i) is the ith coordinate of eigenvector yk . a 3. Measure the similarities between two passages: Suppose that Sa={S1a , . . . , Sm } 1 b b b and S = {S1 , . . . , Sm2 } are two signature sets belonging to two passages a and b; ua = {ua1 , . . . , uam1 } and ub = {ub1 , . . . , ubm2 } is two corresponding point sets in the new coordinate. The dissimilarity between two passages is computed as follows: (i) Estimate the robust centroids Cˆ a and Cˆ b of two point sets corresponding to two passages by using an iterative method proposed in [13]. Given an initial guess C0 , the estimated centroid is iteratively defined as: m m ˆ Ck = wi ui wi (13) i=1
i=1
where wi = 1 ui − Cˆk−1 , ·2 denotes the L2-norm. 2
People Reacquisition across Multiple Cameras with Disjoint Views
493
b∗ b∗ b ∗ (ii) Let Cˆ ab = Cˆ a − Cˆ b ; ua∗ = {ua1 ∗ , . . . , uam1 ∗ } and u = {u1 , . . . , um2 } be the projections of ua and ub onto vector Cˆ ab Cˆ ab .
(iii) Estimate the classification point Cˆ ∗ which is defined to minimize the misclassification: Cˆ ∗ = median C ∗ : C ∗ = arg min err (C) (14) C
m m 1 2 b ∗ where err (C) = 1 − m11 1 (uai ∗ ≤ C) + m12 1 ui ≤ C . i=1
i=1
(iv) Dissimilarity between two passages is thus defined as: ∗ d (a, b) = min Cˆ ∗ − uai ∗ + min ubi − Cˆ ∗
(15)
4 Experimental Results In this Section, we perform the evaluation of the proposed approach by using two real datasets. The first dataset collected on INRETS premises contains video sequences of 40 people captured in two different locations (indoors in a hall near windows and outdoors with natural light). The second dataset containing sequences of 35 people is collected by two cameras installed on-board a train at two different locations: one in the corridor and one in the cabin. This dataset is more difficult than the first one, since these two cameras are set up with different angles and the acquisition of the video is influenced by many factors, such as fast illumination variations, reflections, vibrations. Figure 1 shows several example images of the datasets.
Fig. 1. Illustrations of the datasets
For each passage in front of a camera, the silhouette of the moving person is first extracted by combining two main techniques: the GMM background subtraction algorithm [14] coupled with an adaptive update of the background model, and a motion detection module for managing important illumination changes. The appearance-based signature is then calculated in order to characterize the person. In our experimentations, we compare the proposed signature with the three signatures existing in literature: colour histograms, spatiograms and colour/path-length descriptor. For each query passage in front of one camera, the distances between the query passage and each of the candidate passages of the other camera are calculated. A decision threshold is chosen; distances below the threshold indicate a re-identification (i.e. the
494
D.-N. Truong Cong, L. Khoudour, and C. Achard
two passages correspond to the same person). If the distance is above the threshold, this means a distinction and the two passages correspond to two different individuals. In order to illustrate the performances of the system, we use the ROC curve that is a plot of true re-identification rate (TRR) (also known as true positive rate) versus false re-identification rate (FRR) (also known as false positive rate) while the threshold value varies. The optimal threshold is chosen by referring to the Equal Error Rate (EER) point at which the two error rates (false positive rate and false negative rate) are equal. Figure 2 presents the ROC curves obtained by applying our approach to the two datasets. The closer the ROC curve approaches the top left-hand corner of the plot, the better the method is. The optimal points are also presented by the crossing between the ROC curves and the EER line. Second dataset 1
0.9
0.9
0.8
0.8
True re−identification rate
True re−identification rate
First dataset 1
0.7 0.6 0.5 0.4 0.3
Histograms Spatiograms Color/Pathlength Proposed model
0.2 0.1 0
0
0.2
0.4
0.6
0.8
0.7 0.6 0.5 0.4 0.3
Histograms Spatiograms Color/Pathlength Proposed model
0.2 0.1 1
False re−identification rate
0
0
0.2
0.4
0.6
0.8
1
False re−identification rate
Fig. 2. ROCs corresponding to four signatures (left-hand diagram: results of the first dataset; right-hand diagram: results of the second one)
For the first dataset, the optimal settings of the thresholds lead to 77.5% TRR for histograms, 90% for both spatiograms and colour/pathlength descriptors, and 97.5% for the proposed signature. For the second dataset, in spite of the difficulties encountered in the processing, we notice that globally the performances are still very satisfying (91.4% for the proposed descriptor). We note that the performance of colour histograms is the worst. The others lead to better results thanks to the additional spatial information. For both datasets, one can notice that the proposed signature always leads to the best result.
5 Conclusion and Perspectives In this paper, we have presented an algorithm to re-identify people who appear and reappear at different times and across different cameras. We first proposed an appearancebased model which includes both colour and spatial feature of extracted silhouette. The proposed signature leads to satisfying results compared to other descriptors existing in literature. A silhouette clustering algorithm is then carried out in order to measure the dissimilarity between two passages with an important robustness and to make the final decision of re-identification. The global system was tested on two real datasets. The first one was captured in laboratory while the second one comes from a real surveillance
People Reacquisition across Multiple Cameras with Disjoint Views
495
system including two sensors installed on board a moving train. The experimental results show that the proposed system leads to satisfactory results: 97.5% for the true re-identification rate for the first dataset and 91.4% for the second one. In order to further improve the performance of the system, the appearance-based features need the addition of more temporal and spatial information, such as camera transition time or moving direction of persons. These last ones can help to deal with the more challenging scenarios (multiple passages in front of cameras, many people wearing clothes of the same colour, occlusion, partial detection, etc).
References 1. Javed, O., Rasheed, Z., Shafique, K., Shah, M.: Tracking across multiple cameras with disjoint views. In: ICCV (2003) 2. Bird, N., Masoud, O., Papanikolopoulos, N., Isaacs, A.: Detection of loitering individuals in public transportation areas. IEEE Transactions on Intelligent Transportation Systems 6(2), 167–177 (2005) 3. Gheissari, N., Sebastian, T., Hartley, R.: Person reidentification using spatiotemporal appearance. In: CVPR, pp. 1528–1535 (2006) 4. Wang, X., Doretto, G., Sebastian, T., Rittscher, J., Tu, P.: Shape and appearance context modeling. In: ICCV (2007) 5. Yu, Y., Harwood, D., Yoon, K., Davis, L.: Human appearance modeling for matching across video sequences. Machine Vision and Applications 18(3), 139–149 (2007) 6. Truong Cong, D.N., Khoudour, L., Achard, C., Meurie, C., Lezoray, O.: People reidentification by spectral classification of silhouettes. International Journal of Signal Processing (2009) 7. Hadjidemetriou, E., Grossberg, M., Nayar, S.: Spatial information in multiresolution histograms. In: CVPR (2001) 8. Birchfield, S., Rangarajan, S.: Spatiograms versus histograms for region-based tracking. In: CVPR (2005) 9. Yoon, K., Harwood, D., Davis, L.: Appearance-based person recognition using color/pathlength profile. Journal of Visual Communication and Image Representation 17(3), 605–622 (2006) 10. Truong Cong, D.N., Khoudour, L., Achard, C.: Approche spectrale et descripteur couleurposition statistique pour la ré-identification de personnes à travers un réseau de cameras. In: Colloque GRETSI 2009 (2009) 11. McKenna, S., Raja, Y., Gong, S.: Color model selection and adaptation in dynamic scenes. In: ECCV (1998) 12. von Luxburg, U.: A tutorial on spectral clustering. Statistics and Computing 17(4), 395–416 (2007) 13. Huber, P.: Robust statistics. Wiley Series in Probability and Mathematical Statistics (1981) 14. Stauffer, C., Grimson, W.: Adaptive background mixture models for real-time tracking. In: CVPR (1999) 15. Buchsbaum, G.: A spatial processor model for object color perception. Journal of the Franklin Institute 310(1), 1–26 (1980) 16. Finlayson, G., Hordley, S., Schaefer, G., Yun Tian, G.: Illuminant and device invariant colour using histogram equalisation. Pattern Recognition 38(2), 179–190 (2005)
Performance Evaluation of Multiresolution Methods in Disparity Estimation Dibyendu Mukherjee, Gaurav Bhatnagar, and Q.M. Jonathan Wu Department of Electrical and Computer Engineering University of Windsor, Windsor, ON, Canada N9B 3P4
[email protected],
[email protected],
[email protected] Abstract. Disparity estimation from stereo imagery has gained substantial interest of research community from its commencement with the recent trend being the use of multiresolution methods. Existing multiresolution based methods are relatively independent and do not, in general, relate to a continuous progress in the research. As a result, the relative advantages and disadvantages of a particular multiresolution method in disparity estimation are hard to understand. Present work is an effort to put different multiresolution methods together to highlight their expediency and suitability along with the comparison to get a better understanding. Three different frameworks are used having different strengths and limitations followed by the comparison in the terms of time complexity, quality of matching and effect of different levels of decomposition. Qualitative and quantitative results have been provided for four types of standard multiresolution methods.
1 Introduction Disparity estimation between stereo images is a fundamental task in computer vision to determine the dense depth map. Recently, searching disparity in multiple resolutions has been an interesting topic of research. But, an extensive study would reveal that less research has been carried out to compare different multiresolution methods and how detail sub-bands contribute to enhance the performance. The multiresolution based methods bridge a gap between the global and local methods in terms of speed and accuracy. They are faster than global methods due to the reduction in searching area whereas more accurate when compared to the local methods due to higher number of matching features and refinement of disparity in different resolutions. A lot of research have been reported on wavelet based stereo estimation [3,7,11]. Recently, Bhatti and Nahavandi [1] have proposed an effective way of stereo matching based on multiwavelets that actually involves different detail subbands. Further, Mukherjee et al. [8] have proposed the use of curvelets associated with support weights to enhance the results. Ding et al. [4] have proposed the use of shift invariant contourlet transform in stereo. However, these methods are apparently less popular due to the lack of proper exploration of the performances and comparison among them. The goal of present work is to provide a comparative study of different multiresolution methods for stereo disparity estimation with relative performances. The primary reason for such a study is to highlight the progress in a relatively less popular domain A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 496–504, 2010. c Springer-Verlag Berlin Heidelberg 2010
Performance Evaluation of Multiresolution Methods in Disparity Estimation
497
and motivate the researchers to make more advancements. Also, it allows to analyze each algorithm and improve their performances. Finally, it highlights the limitations of a method with respect to the others and facilitate the choice of an appropriate multiresolution method for specific needs. The paper briefly discusses the use of Wavelet Transform, Curvelet Transform, Multiresolution Singular Value Decomposition and Contourlet Transform in disparity estimation and their comparison on well-known Middlebury stereo datasets. For comparison, three frameworks have been used viz. fixed window based SAD, segmented window based SAD and approximate band based disparity estimation. To the best of our knowledge, this work is the first attempt to explore and evaluate the performance of different multiresolution methods in disparity estimation by above mentioned frameworks. The paper is organized as follows. Section 2 provides an overview of different multiresolution methods. Section 3 describes the frameworks used for evaluation and comparison of their relative performances in disparity estimation. Section 4 gives the experimental results and comparison tables. Finally, the concluding remarks are depicted in Section 5.
2 Multiresolution Methods This section provides a brief overview of the different multiresolution methods used in the paper. 2.1 Wavelet Transform Wavelets are time limited functions that are localized in frequency [7]. For their transient nature, they are fundamentally different from the functions used for Fourier Transform. Wavelet transform of a signal constitutes of an approximate and a detail part. Approximate part is of low frequency and detail part consists of the high frequency components. Wavelet transform of an image decomposes it into an approximate subband and a detail part consisting of horizontal, vertical and diagonal subbands. For the implementation, Daubechies wavelets have been used. 2.2 Curvelet Transform Curvelets are new multiscale representations that are better suited to objects with smooth curvatures [2]. They were designed to represents edges and singularities along curves. √ Roughly, to represent an edge to square error 1/N requires 1/N wavelets but only 1/ N curvelets. Unlike wavelets, curvelet transform (CT) can measure the information of a signal at specified scales, positions along specified orientations. The orientation flexibility makes it a powerful feature descriptor for disparity estimation. For the implementation, Curvelets via Wrapping [2] has been used with the codes provided by the Curvelab. 2.3 Multiresolution Singular Value Decomposition Multiresolution SVD (MRSVD) is a new addition to the family of multiresolution methods. In 2001, Kakarala and Ogunbona [6] generalized the SVD to MRSVD. The idea is
498
D. Mukherjee, G. Bhatnagar, and Q.M.J. Wu
to replace the low and high pass filters in decomposition by SVD. The image is subdivided in blocks of size p×q and a feature vector matrix Rl from the blocks is generated. t Eigen vector matrix Ul is obtained by Singular value decomposition of R¯l R¯l where, ¯ Rl is the zero-mean representation of Rl . Finally, the rows of the transform domain representation Ult R¯l yield the different subbands. The number of subbands depend on the size of block size p × q. For our implementation, block size is kept 2 × 2 to get subbands equivalent to WT. 2.4 Contourlet Transform A key feature in multidimensional signal being the directional features, Contourlet Transform (CONT) decomposes an image into several directional subband by combining the Laplacian pyramid with a directional filter at each scale [5]. Due to the cascade structuring, the multi-scale and directional stages are independent of each other and each scale can be separately decomposed into any number of directions provided the number of directions is a power of 2. Contourlets have a similarity to curvelets in terms of their directionality and parabolic scaling nature. For implementation, the code provided in Contourlet Toolbox by Minh. Do. is used with ’9-7’ filter for pyramidal decomposition stage and ’pkva’ filter for direction decomposition stage.
3 Comparison Framework In order to evaluate performances, three different frameworks are used for a number of reasons. A comparison among methods does not only rely on the performance of the methods but also on the comparison framework used. Using three different frameworks, a better knowledge on the performance of the methods can be obtained. Also, the results in Section 4 show that by using better cost functions, the results can be greatly improved. In Section 3.1, the support window is kept as fixed size and include all subbands of multiresolution method. Next, segmented window based cost function is used for more accurate disparity estimation in Section 3.2. Finally, in Section 3.3, the classical approximate band based disparity estimation is used to highlight the effects of simple multiple resolution searching. In all the cases, SAD is used as out cost function. Further, in order to keep an equivalent level of WT and MRSVD decomposition, single level is used in sub-sections 3.1 and 3.2 whereas 2-level is used in sub-section 3.3. As per as CT and CONT are concern, one cannot use level of decomposition because these are decomposed into scale and orientation. Hence, scale 2 and orientation 8 are used for sub-sections 3.1 and 3.2 whereas scale 3 and orientation 8 is used for sub-section 3.3. 3.1 Fixed Window SAD Fixed Window SAD (fixWin) uses SAD as cost function and refines disparity in multiple resolution and subbands over a fixed size window in the following algorithm 1. Perform multiresolution analysis of the gray scale stereo pairs. 2. Compute the correspondence match in the approximate image using SAD. 3. Shift to the subband.
Performance Evaluation of Multiresolution Methods in Disparity Estimation
499
4. Compute scale factor = (size of current subband image)/(size of previous subband image). 5. For each pixel, divide the coordinate of the pixel by scale factor to get the coordinates in previous scale. Then, compute the initial curr disparity = scale factor * disparity in previous scale. 6. Take a range of search around the initial curr disparity. 7. Find the best match using SAD. 8. If this is the last subband, go to step 10. 9. Shift to the next subband. Go to step 5. 10. Take a range of search in final image based on the refined disparity map recovered after step 9. 11. Find the final disparity map using SAD. For WT and MRSVD, only four subbands are obtained while CT and CONT yield a number of subbands in different scales and orientations. In the algorithm, it is assumed that if current subband is at scale s and orientation o, then next subband will be at scale s and orientation o + 1, if orientation o + 1 exists, else it will be the first orientation in next scale. The algorithm implicitly mentions two parameters - the search range for refinement and the fixed support window size. To keep consistency in quality of estimated disparity, both are kept constant for each comparison. 3.2 Segmented Window SAD The major disadvantage of fixed window is fattening effect in depth discontinuities. As all points in a support window are considered to be of similar depth, the regions with discontinuity in depth are poorly matched. If the pixels in the window are kept to same region, the results can be improved. Based on This idea, this section describes a segmented support window (segWin) that works particularly better in these regions and can largely remove fattening effect. For this method, color images are used to produce multiresolution output of each color band separately. Each color band yields a set of multiresolution subbands. In other words, each subband has three gray value components each corresponding to a color. These “tricolor” subbands are used to segment the image into different regions by simple thresholding. The pseudo code below shows the procedure for the same. 1. 2. 3. 4. 5. 6. 7. 8.
SAD = 0; For i IN −W TO W For j IN −W TO W IF dist(sBL(x + i, y + j, :), sbL(x, y, :)) < thresh SAD = SAD + (|sBL(x + i, y + j, :) − sBR(x + i + d, y + j, :)|) END IF END FOR END FOR
Here, sBL(·) and sBR(·) denote the “tricolor” subbands with sBL(x, y, :) denoting three components of pixel (x, y), dist is a function that returns the euclidean distance
500
D. Mukherjee, G. Bhatnagar, and Q.M.J. Wu
between the two component vectors of two pixels, d denotes the disparity, (2∗W +1) is the support window size and thresh is the threshold for segmentation. Only the pixels that meet the criteria in IF condition are chosen for SAD computation. The extra parameter involved in this method is the threshold value. This value needs to be kept at minimum range to remove the interregional interference. 3.3 Approximate Band Based Matching The study is concluded with classical approximate band searching that ignores detail subbands and searches in approximate bands progressively till it reaches the original image resolution. The search region is projected into higher resolutions and searching is limited only to this region. This work is an effort to highlight the effect of low filtered approximate band for stereo matching. The method (segWinApprox) uses segWin and the comparison has been done in single and double level of decomposition.
4 Experimental Results The well-known Middlebury datasets Tsukuba [9] and Cones [10] are used for generating comparative results. The window size has been kept 9 × 9 in original scale, 7x7 for level 1 of decomposition for fixWin and 33 × 33 in original scale and all subbands for segWin and segWinApprox. In Table 1, the comparison of different methods is provided for fixWin. The results are also compared with original SAD (origSAD) result in terms of percentage of bad pixels. The percentage is calculated based on the following formula 1 (| dc (x, y) − dt (x, y) |) > δthresh (1) N x,y where, N is the total number of pixels in the image, dc and dt represent the computed and ground truth disparity maps respectively, and δthresh is the threshold for bad pixels (usually equal to 1.0). In Table 2, the execution times of different methods are compared for fixWin in terms of seconds. For time complexity, it is important to mention that all methods have been run in a Pentium 4, 2 GHz machine with MATLAB. It is evident from the results that multiresolution methods provide comparative results in faster computations. The qualitative results can be seen from Figure 1. Table 3 shows the comparison among different multiresolution methods based for segWin. The results, as can be seen from Figure 2, are much better than fixWin and are quite immune to the fattening effect. As segwin is much slower compared to fixed Table 1. Comparisons of methods based on fixed window SAD
PP Methods WT PP Images PPP Tsukuba Cones
CT MRSVD CONT OrigSAD % of bad pixels 10.44 9.38 10.53 10.57 11.82 21.79 20.81 20.41 21.69 21.55
Performance Evaluation of Multiresolution Methods in Disparity Estimation
(a)
(b)
( c)
(d)
(e)
( f)
( g)
(h)
( i)
(j)
(k)
( l)
501
Fig. 1. Disparity maps generated using Fixed Window SAD. (a),(g) groundtruth, (b),(h) WT, (c),(i) CT, (d),(j) MRSVD, (e),(k) CONT and (f),(l) OrigSAD. Table 2. Comparison execution time of different methods based on fixed window SAD
PP Methods PP Images PPP Tsukuba Cones
WT
CT
MRSVD CONT OrigSAD
13.9313 19.3346 14.1931 17.3685 25.0329 28.2972 44.7665 28.5641 28.7104 130.3596
Table 3. Comparison of methods based on segmented window SAD
PP Methods WT CT MRSVD CONT PP Images PPP % of bad pixels Tsukuba Cones
6.23 5.72 6.35 19.31 17.69 17.64
6.63 19.68
Table 4. Comparisons of methods based on approximate band matching in multiple resolution WT PP Methods PP Images PPP L1 L2 Tsukuba Cones
CT MRSVD CONT % of bad pixels L1 L2 L1 L2 L1 L2 8.73 9 8.42 8.33 9.2 9.4 8.78 8.85 23.59 27.27 22.72 25.4 21.24 22.61 23.51 27.54
502
D. Mukherjee, G. Bhatnagar, and Q.M.J. Wu
(a)
(b)
( c)
(d)
(e)
( f)
( g)
(h)
Fig. 2. Disparity maps generated using Segmented Window SAD. (a),(e) WT, (b),(f) CT, (c),(g) MRSVD and (d),(h) CONT.
(a)
(b)
( c)
(d)
(e)
( f)
( g)
(h)
( i)
(j)
( k)
( l)
(m)
( n)
( o)
(p)
Fig. 3. Disparity maps generated using Approximate subbands and Segmented Window SAD. (a)(h) represent level 1 and ( i)-(p) represent level 2 decomposition. (a),(e),(i),(m) WT, (b),(f),(j),(n) CT, (c),(g),(k),(o) MRSVD and (d),(h),(l),(p) CONT.
Performance Evaluation of Multiresolution Methods in Disparity Estimation
503
window based method, time complexity analysis is not provided. Of course, the execution time depends on the platform and can be improved. Finally, Table 4 shows the quantitative comparison for segWinApprox. The results are not as good as segWin due to the unavailability of other subbands. Comparison results reveal that CT and MRSVD generally give better results than those of CONT and WT. Moreover, the results of CT are better in lower search range as in Tsukuba pairs and MRSVD provides better results in higher search ranges as in Cones pairs. This can be attributed to the fact that same level decomposition of CT generates more subbands than those of MRSVD. Multiple subband searching, although effective for accuracy improvement, may lead to more choice for disparity as the search range goes higher. This apparent ambiguity results in relatively less accurate disparity estimation in higher disparity ranges for CT. The results of segWinApprox are quite close to segWin results with improper edge restoration and reduction in minute details because approximate bands do not contain the detail parts and the edge or minute details of an image are lost. As the level of decomposition increases, the result deteriorates due to the higher resolution approximate that progressively reduce information. As the initial disparity estimation is done in the smallest approximate band at highest level of decomposition, the initial map loses detailed information. The refinements in lower level approximations are unable to provide details. Thus, higher level of decomposition normally reveals less accurate disparity maps although execution speed increases as the maximum search is done in lowest resolution.
5 Conclusions This paper provides a study on the use of different multiresolution methods for disparity estimation with an effort to highlight the power of multiple resolution processing and the expediency and suitability of the details obtained from the subbands. This study also provides references to previous works that were relatively independent till now. Finally, the goal of the comparisons provided in the paper is to provide a better understanding of different multiresolution methods, their relative strengths and drawbacks and their effects on stereo matching.
Acknowledgments The work is supported in part by the Natural Sciences and Engineering Research Council of Canada.
References 1. Bhatti, A., Nahavandi, S.: Depth estimation using multiwavelet analysis based stereo vision approach. International Journal of Wavelets Multiresolution and Information Processing 6(3), 481–497 (2008) 2. Candes, E.J., Demanet, L., Donoho, D.L., Ying, L.: Fast discrete curvelet transforms. Multiscale Modelling and Simulation 5, 861–899 (2006)
504
D. Mukherjee, G. Bhatnagar, and Q.M.J. Wu
3. Caspary, G., Zeevi, Y.Y.: Wavelet-based multiresolution stereo vision. In: Proceedings of the International Conference on Pattern Recognition, vol. 3, pp. 680–683 (2002) 4. Ding, H., Fu, M., Wang, M.: Shift-invariant contourlet transform and its application to stereo matching. In: Proceedings of the International Conference on Innovative Computing, Information and Control, pp. 87–90 (2006) 5. Do, M.N., Vetterli, M.: The contourlet transform: an efficient directional multiresolution image representation. IEEE Transactions on Image Processing 14(12), 2091–2106 (2005) 6. Kakarala, R., Ogunbona, P.: Signal analysis using a multiresolution form of the singular value decomposition. IEEE Transactions on Image Processing 10(5), 724–735 (2001) 7. Mallat, S.: Wavelets for a vision. Proceedings of IEEE 84(4), 604–614 (1996) 8. Mukherjee, D., Wang, G., Wu, Q.: Stereo matching algorithm based on curvelet decomposition and modified support weights. Accepted at IEEE International Conference on Acoustics, Speech and Signal Processing (2010) 9. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision 47(1-3), 7–42 (2002) 10. Scharstein, D., Szeliski, R.: High-accuracy stereo depth maps using structured light. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 195–202 (2003) 11. Zhang, W., Zhang, Q., Qu, L., Wei, S.: A stereo matching algorithm based on multiresolution and epipolar constraint. In: Proceedings of the International Conference on Image and Graphics, pp. 180–183 (2004)
3D Head Trajectory Using a Single Camera Caroline Rougier and Jean Meunier Department of Computer Science and Operations Research (DIRO), University of Montreal, QC, Canada
[email protected],
[email protected] Abstract. Video surveillance applications need tools to track people trajectories. We present here a new method to extract the 3D head trajectory of a person in a room using only one calibrated camera. A 3D ellipsoid representing the head is used to track the person with a hierarchical particle filter. This method can run in quasi-real-time providing reasonable 3D errors for a monocular system. Keywords: computer vision, 3D, head tracking, monocular, particle filter, video surveillance.
1 1.1
Introduction Motivation
A 3D trajectory gives a lot of information for behavior recognition and can be very useful for video surveillance applications. For instance, in our research on fall detection [17], the 3D head trajectory of a person in a room is needed. Indeed, Wu [11] shows in a biomechanical study with wearable markers that the 3D head velocities are efficient to detect falls. Thus, we need a method to extract the 3D head trajectory in real-time with sufficient precision to compute 3D velocities. In this paper, we present an original method which can track the person head with a 3D ellipsoid using a single calibrated camera and recover its location in a room. 1.2
Related Works
Usually, a multi-cameras system is used to have some 3D information. Wu and Aghajan [7] have recovered both the 3D trajectory of the head and its pose with a multi-camera system without accurate calibration of the camera network. Their method was based on Chamfer matching refined with a probabilistic framework. Kawanaka et al. [4] have tracked the head in a 3D voxel space with four stereo cameras, using particle filtering with depth and color information. Kobayashiet et al. [5] have used AdaBoost-based cascaded classifiers to extract the 3D head trajectory using several cameras. For a low-cost system, we prefer to use only one camera per room, which explains our interest in monocular methods. Very few attempts have been done to track the 3D head trajectory in real-time with a single camera. Hild [3] has A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 505–512, 2010. c Springer-Verlag Berlin Heidelberg 2010
506
C. Rougier and J. Meunier
recovered a 3D motion trajectory from a walking person, but he has supposed that the person was standing and that the camera optical axis was parallel to the (horizontal) ground plane. These assumptions can not be assumed in video surveillance applications: the camera need to be placed higher in the room to avoid occluding objects and to have a larger field of view, and the person is not supposed to be standing or facing the camera. 1.3
Method Overview
Birchfield [1] showed that a head is well-approximated by an ellipse in the 2D image plane. Its tracker was based on a local search using gradient and color information. A head ellipse was also tracked by Nummiaro et al. [10] using a color-based particle filter. In our previous work [17], we also have used this idea to compute the 3D head localization from a 2D head ellipse tracked in the image plane. The 3D head pose was computed using a calibrated camera, a 3D model of the head and its projection in the image plane. This method worked well with a standing person, but the 3D pose can be wrongly estimated when the person falls (the ellipse ratio can change with the point of view). In this work, we explore the inverse method: the head is seen as a 3D ellipsoid which is projected as an ellipse in the 2D image plane, and tracked through the video sequence using a particle filter. Section 2 describes the 3D head model and its projection in the image plane. The head tracking module using particle filter is shown in section 3 and is tested in section 4 with the HumaneEva-I dataset [14].
2
Head Model Projection
We chose to represent the head by a 3D ellipsoid, which is projected in the image plane as an ellipse [6]. The 3D head model will be tracked in the world coordinate system attached to the XY ground plane. To project the 3D head model in the image plane, we need to know the camera characteristics (intrinsic parameters) and the pose of the XY ground plane related to the camera (extrinsic parameters). The intrinsic parameters can be computed using a chessboard calibration pattern and the camera calibration toolbox for Matlab [16]. The camera’s intrinsic matrix K is defined by: K = [fx 0 u0 0; 0 fy v0 0; 0 0 1 0]
(1)
with the focal lenght (fx , fy ) and (u0 , v0 ) in pixels. Notice that image distortion coefficients (radial and tangential distortions) are also computed and used to correct the images for distortion before processing with our methodology.
3D Head Trajectory Using a Single Camera
507
The extrinsic parameters are obtained from corresponding points between the real word and the projected image points. The plane-image homography obtained from these two sets of points is used to computed the extrinsic parameters Mext = [R T ; 0 0 0 1], where R and T are respectively a 3D rotation matrix and a 3D translation vector, as described in the work of Zhang [12]. An ellipsoid is a quadric described by a matrix QC in the camera coordinate system such that [x, y, z, 1]T QC [x, y, z, 1] = 0 for a point (x, y, z) belonging to the ellipsoid. This ellipsoid is projected in the image plane with the projection matrix P as a conic C [6][13]: C = QC44 QC1:3,1:3 − QC1:3,4 QTC1:3,4
(2)
T
The ellipse is described by the conic by [u, v, 1] C [u, v, 1] = 0 for a point (u, v) in the image plane. Concretely, the ellipsoid matrix representing the head in the head coordinate system has the form: ⎛ 1 ⎞ 0 0 0 B2 ⎜ 0 12 0 0 ⎟ B ⎟ QH = ⎜ (3) ⎝ 0 0 12 0 ⎠ A 0 0 0 −1 With the semi-major A and the semi-minor B ellipsoid head axes. The head ellipsoid is projected as QC in the camera coordinate system with the projection matrix P : QC = P −1T QH P −1 (4) The projection matrix P represents here the transformation from the head ellipsoid coordinate system to the image plane: P = KMext MHead/W orld
(5)
MHead/W orld will be defined during the tracking process by the translation and the rotation of the head in the world coordinate system (see Section 3). Finally, the conic obtained from eq. 2 is used to defined the ellipse parameters. An example of ellipsoid projection is shown in Fig. 1.
3 3.1
3D Head Tracking with Particle Filter Particle Filter
Particle filters have been used with success in several applications, and in particular to track a head with an ellipse [17][10], or a parametric spline curve [8] using color information or edge contours. In our work, a particle filter is particularly suitable as it allows abrupt variations of the trajectory and can deal with small occlusions.
508
C. Rougier and J. Meunier
z Head ellipsoid (Xe,Ye,Ze)
z
y x
XY ground plane
y x (Xw,Yw,Zw) World coordinate system
(a)
(b)
(c)
Fig. 1. The 3D ellipsoid model (a) is projected into an ellipse in the image plane (b). From the foreground silhouette (c), the foreground coefficient is computed using the normal segments to the ellipse. For example, in this case, we obtain a foreground coefficient equal to CF = 0.81 (see Sect. 3.2).
The main idea of particle filters is to estimate the probability distribution p(Xt |Zt ) of the state vector Xt of the tracked object given Zt , representing all the observations. This probability can be approximated from a set of N weighted (n) (n) samples St = {st , πt , 1 ≤ n ≤ N } (also called particles). The main steps of the particle filter algorithm are: Selection. N new samples are selected from the old sample set by favoring the best particles. Prediction. The new samples are predicted with a stochastic dynamical model (n) (n) (n) (n) = ASt + Bwt where wt is a vector of standard normal random St variables, A and B are, respectively, the deterministic and stochastic components of the dynamical model. (n) Measurement. A weight πt is assigned to each sample (see Sect. 3.2). Estimation. The mean state is then estimated using the weighted particles. In our work, each particle of the filter is an ellipsoid, represented by the state vector: X = [Xe , Ye , Ze , θXe , θY e ] (6) where (Xe , Ye , Ze ) is the 3D head ellipsoid centroid expressed in the world coordinate system (translation component of the matrix MHead/W orld ), and (θXe , θYe ) are respectively the rotation around the X and the Y axes (rotation component of the matrix MHead/W orld )1 . 1
Notice that two angles (instead of three) are sufficient to define the position and orientation of the ellipsoid since its minor axes have both the same length in our model.
3D Head Trajectory Using a Single Camera
3.2
509
Particles Weights
The particle weights are based on foreground, color and body coefficients. • Foreground coefficient CF To compute the foreground coefficient, we first need to extract the person silhouette in the image. For this purpose, we use a background subtraction method which consists in comparing the current image with an updated background image. We use the codebook method from the work of Kim et al. [9] which takes into account shadows, highlights and high image compression. The foreground coefficient is computed by searching for silhouette contour points along Ne ellipse normals. For each normal, the distance de from the ellipse point to the detected silhouette point is used to compute the foreground coefficient: CF =
Ne 1 (De (n) − de (n))/De (n) , Ne n=1
CF ∈ [0 . . . 1]
(7)
Where De , the half length of the normal, is used to normalize the distances. An example of foreground silhouette obtained with the codebook method [9] is shown in Fig. 1. This figure shows also the foreground coefficient obtained for a detected ellipse. • Color coefficient CC The color coefficient is based on the normalized 3D color histogram of the head ellipse [10]. The color histogram, in the RGB color space, is composed of Nb = 8 × 8 × 8 bins and is computed inside a rectangular zone included in the ellipse. The comparison between an updated color head model and the target model is then done by calculating the histogram intersection: CC =
Nb
min (H (i) , Href (i)) ,
CC ∈ [0 . . . 1]
(8)
i=1
• Body coefficient CB The body coefficient is used to link the head to the body through the body center. The projection of the 3D point corresponding to the centroid of the person should be near the centroid of the 2D silhouette (distance db compared to the half-major axis of the bounding box Db ). This coefficient is only used when the bounding box is valid (not used in case of occlusion for example). CB = (Db − db )/Db ,
CB ∈ [0 . . . 1]
(9)
• Final coefficient Finally, the ellipsoid coefficient is a combination of these coefficients amplified by a Gaussian to give larger weights to the best particles: 2 1 exp(CF CC CB )/2σ Cf inal = √ 2πσ
with σ = 0.25
(10)
510
3.3
C. Rougier and J. Meunier
Tracking
Initially, a head detection module is used to calibrate our system. When the person stands up, the top head point is detected in the foreground silhouette. From this point, several 2D ellipses are tested and the one which has the biggest foreground coefficient CF is kept. If CF > 0.7, the ellipse is supposed sufficiently reliable. This head ellipse is used to calibrate the ellipsoid proportion related to the human height (The height H of the person is supposed to be known and the aspect ratio of the ellipse is fixed at 1.2 for a human head [1]). Finally, to obtain the initial 3D position of the ellipsoid, we use a floor-image homographic transformation of the image position of the feet to get the (X, Y ) location and the person height H to infer the Z coordinate. To have a reliable 3D head localization, we need to precisely estimate the head projection in the image. With a conventional particle filter, a lot of particles are needed for precision, which severely affects the computational performance and is incompatible with real-time operation. We rather prefer using an improved particle filter based on a hierarchical scheme with several layers, similar to the annealed particle filter [2]. We chose to work with 250 particles and 4 layers, which is a good compromise between performance and computational time. The known velocity between two successive images is added to the particles to predict the next 3D ellipsoid localization before propagating the particles. Each
layer has a different stochastic component B = BXe , BYe , BZe , BθXe , BθYe for the model propagation, sufficiently large for the first layer and decreasing for the next layers, such as Bl+1 = Bl /2 with l the current layer and l + 1 the next layer. At the beginning, to refine the initial head position, B will be large for the X and Y components (initially, the person is supposed to be standing up, so that Ze is approximately known and, θXe and θYe are close to zero). For the next images, B is reinitialized from the current velocity between two successive images and fixed to 0.1m or 0.1rad if the velocity is too small.
4
Experimental Results
The precision of our method is evaluated with the HumanEva-I dataset [14], which contains synchronized multi-view video sequences and corresponding MoCap data (ground truth 3D localizations). As our method works with a single camera, we compared the results obtained from three different view points (color cameras C1 , C2 and C3 ) using the video sequences of 3 subjects (S1, S2 and S3). The HumanEva dataset contains several actions. We used the motion sequences “walking” and “jog” to evaluate our 3D trajectories at 30Hz, 20Hz and 10Hz. Figure 2 gives an example of 3D head trajectories obtained from different view points, and shows that the curves are similar to the MoCap trajectory. Table 1 summarizes the 3D mean errors obtained for each subject and each camera. This corresponds to a mean error of about 5% at a 4 to 6 meters distance. As expected, when the movement was larger, the error tended to be slightly higher, but the 3D ellipsoid remained well tracked.
3D Head Trajectory Using a Single Camera
511
X (meters)
X location
2
1 0 −1 0
100
1
MoCap 0.5
C1
Z
Y (meters)
Z (meters)
200
300
400
500
600
500
600
500
600
Time (frame) Y location
1.5 1 0 −1 0
C2 X
1 0.5
1
0 −0.5
Y (meters)
0 −1 −1.5
100
−1
200
300
400
Time (frame) Z location
C3 Z (meters)
Y
0 1.5
1.7 1.6 1.5
0
X (meters)
100
200
300
400
Time (frame)
Fig. 2. 3D head trajectories for a walking sequence of subject S1 (20Hz) Table 1. Mean 3D errors (in cm) obtained from walking and jog sequences for different subjects (S1, S2, S3) and several view points (C1, C2, C3) Frame Camera rate C1 30Hz C2 C3 C1 20Hz C2 C3 C1 10Hz C2 C3
Action “walking” S1 S2 S3 23.4 ± 15.6 23.2 ± 9.2 23.6 ± 12.6 17.5 ± 14.6 24 ± 9.5 21.4 ± 11 21 ± 13.7 24.1 ± 9.8 22.9 ± 13.6 23.4 ± 14.2 22.9 ± 9.3 24.1 ± 11.9 20.7 ± 11.4 22.9 ± 8.4 19.5 ± 9 19.7 ± 9.8 23.1 ± 10.8 24.6 ± 13.1 22.6 ± 12.3 24.1 ± 12.8 21.2 ± 15.6 33.3 ± 15.9 23.1 ± 7.7 21.2 ± 15.3 21.7 ± 14 20.4 ± 9.5 25.7 ± 12.7
Action “jog” S1 S2 S3 17.8 ± 9 25.7 ± 18.2 22.3 ± 16 27 ± 11.6 26 ± 9 17 ± 9 27 ± 12.1 21.5 ± 9.4 33.1 ± 21.7 21.4 ± 14.5 28.8 ± 22.3 26.4 ± 21.1 30 ± 11.6 29.5 ± 12.4 18.8 ± 13.6 23.4 ± 8.7 25.8 ± 12.1 30.9 ± 17.6 20.7 ± 16.8 30.8 ± 24.4 51.3 ± 45.7 35.6 ± 15.4 34.4 ± 15.4 37.8 ± 23.5 18.5 ± 8.2 35.3 ± 12.5 39.2 ± 28.9
Our method is implemented in C++ using the OpenCV library [15] and can run in quasi-real time (130ms/frame on an Intel Core 2 Duo processor (2.4 GHz), not optimized code and image size of 640x480).
5
Discussion and Conclusion
In this paper, we have shown that a 3D head trajectory can be computed with only one single calibrated camera. The method gave similar results for different viewpoints, different frame rates and different subjects. The 3D locations were estimated with a reasonable mean error of around 25cm (5% at 5 meters) which is sufficient for most activity recognition based on trajectories.
512
C. Rougier and J. Meunier
Acknowledgement This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC).
References 1. Birchfield, S.: Elliptical head tracking using intensity gradients and color histograms. In: Proc. IEEE Computer Vision and Pattern Recognition (CVPR), pp. 232–237 (1998) 2. Deutscher, J., Blake, A., Reid, I.D.: Articulated Body Motion Capture by Annealed Particle Filtering. In: Proc. IEEE Computer Vision and Pattern Recognition (CVPR), vol. 2, pp. 126–133 (2000) 3. Hild, M.: Estimation of 3D motion trajectory and velocity from monocular image sequences in the context of human gait recognition. In: International Conference on Pattern Recognition (ICPR), vol. 4, pp. 231–235 (2004) 4. Kawanaka, H., Fujiyoshi, H., Iwahori, Y.: Human Head Tracking in Three Dimensional Voxel Space. In: International Conference on Pattern Recognition (ICPR), vol. 3, pp. 826–829 (2006) 5. Kobayashi, Y., Sugimura, D., Hirasawa, K., Suzuki, N., Kage, H., Sato, Y., Sugimoto, A.: 3D Head Tracking using the Particle Filter with Cascaded Classifiers. In: Proc. of British Machine Vision Conference (BMVC), pp. 37–46 (2006) 6. Stenger, B., Mendon¸ca, P.R.S., Cipolla, R.: Model-Based Hand Tracking Using an Unscented Kalman Filter. In: Proc. BMVC, vol. 1, pp. 63–72 (2001) 7. Wu, C., Aghajan, H.: Head pose and trajectory recovery in uncalibrated camera networks - Region of interest tracking in smart home applications. In: ACM/IEEE International Conference on Distributed Smart Cameras, pp. 1–7 (2008) 8. Isard, M., Blake, A.: Condensation - conditional density propagation for visual tracking. International Journal of Computer Vision 29(1), 5–28 (1998) 9. Kim, K., Chalidabhongse, T.H., Harwood, D., Davis, L.: Real-time foregroundbackground segmentation using codebook model. Real-Time Imaging 11(3), 172–185 (2005) 10. Nummiaro, K., Koller-Meier, E., Van Gool, L.: An adaptive color-based particle filter. Image and Vision Computing 21(1), 99–110 (2003) 11. Wu, G.: Distinguishing fall activities from normal activities by velocity characteristics. Journal of Biomechanics 33(11), 1497–1500 (2000) 12. Zhang, Z.: A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(11), 1330–1334 (2000) 13. Hartley, R.I., Zisserman, A.: Multiple view geometry in computer vision, 2nd edn. Cambridge University Press, Cambridge (2004) 14. Sigal, L., Black, M.J.: HumanEva: Synchronized video and motion capture dataset for evaluation of articulated human motion Brown University, Department of Computer Science, Providence, RI, CS-06-08 (2006) 15. OpenCV: Intel Open Source Computer Vision Library, http://www.intel.com/technology/computing/opencv 16. Bouguet, J.Y.: Camera calibration toolbox for Matlab, http://www.vision.caltech.edu/bouguetj/calib_doc/index.html 17. Rougier, C., Meunier, J., St-Arnaud, A., Rousseau, J.: Monocular 3D Head Tracking to Detect Falls of Elderly People. In: International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 6384–6387 (2006)
Artificial Neural Network and Fuzzy Clustering Methods in Segmenting Sputum Color Images for Lung Cancer Diagnosis Fatma Taher and Rachid Sammouda Department of Computer Engineering, khalifa University, Sharjah, United Arab Emirates
[email protected] Department of Computer science, University of Sharjah, Sharjah, United Arab Emirates
[email protected] Abstract. Lung cancer is cancer that starts in the lungs. Cancer is a disease where cancerous cells grow out of control, taking over normal cells and organs in the body. The early detection of lung cancer is the most effective way to decrease the mortality rate. In this paper we compare two methods, a modified Hopfield Neural Network (HNN) and a Fuzzy C-Mean (FCM) Clustering Algorithm, used in segmenting sputum color images. The segmentation results will be used as a base for a Computer Aided Diagnosis (CAD) system for early detection of lung cancer. Both methods are designed to classify the image of N pixels among M classes or regions. Due to intensity variations in the background of the raw images, a pre-segmentation process is developed to standardize the segmentation process. In this study, we used 1000 sputum color images to test both methods, and HNN has shown a better classification result than FCM; however the latter was faster in converging. Keywords: Lung cancer Diagnosis, Sputum Cells, Hopfield Neural Network, Fuzzy C-Mean Clustering .
1 Introduction Lung cancer is considered to be the leading cause of cancer death throughout the world and it is difficult to detect in its early stages because symptoms appear only at advanced stages [1]. Physicians use several techniques to diagnose lung cancer, such as chest radiograph and sputum cytological examination where a sputum sample can be analyzed for the presence of cancerous cells. However, the analysis process of the images obtained by this technique is still suffering from some limitation caused by the scanning process. Recently, some medical researchers have proven that the analysis of sputum cells can lead to a successful diagnosis of lung cancer [2]. For this reason we attempt to come with automatic diagnostic system for detecting the lung cancer in its early stages based on the analysis of the sputum color images. A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 513–520, 2010. © Springer-Verlag Berlin Heidelberg 2010
514
F. Taher and R. Sammouda
In order to formulate a rule we have developed a technique for unsupervised segmentation of the sputum color image. In order to divide the images into several meaningful sub regions, image segmentation has been used as the first step in image analysis, there are many algorithms which have been proposed in other articles for medical image segmentation, such as histogram analysis, regional growth, edge detection and pixel classification. Good reference of this technique can be found in [3]. Other authors have considered the use of the color information as the key discriminate factor for cell segmentation for lung cancer diagnosis The analysis of sputum images have been used in [4] for detecting tuberculosis; it consists of analyzing sputum images for detecting bacilli. In this paper, two basic techniques have been applied: Hopfield Neural Network (HNN) and Fuzzy C-Mean Clustering Algorithm (FCM) to segment sputum color images prepared by the standard staining method described in [5]. In computer memory, each image is represented as three separate pixel matrices corresponding to red, green and blue intensity (R, G, and B image planes). The segmentation is performed based on the result of a preprocessing for the extraction of region of interest (ROI). We present result of the segmentation of some images for both methods. The remanding sections of this paper is organized as follows. In Section 2, Hopfield Neural Network (HNN) segmentation algorithm is introduced. In Section3, fuzzy clustering algorithm is described. In Section 4, segmentation result and discussion are presented. Finally in Section 5, the conclusion is drawn and several issues for future works are discussed.
2 Hopfield Neural Network Hopfield Neural Network (HNN) is one of the artificial neural networks, which has been proposed for segmenting both gray-level and color images. In [6], the authors present the segmentation problem for gray-level images as minimizing a suitable energy function with HNN; it derived the network architecture from the energy function. The existing algorithm for color image segmentation can be found in [7], where the authors present an algorithm for segmentation of medical stained images where the nuclei cell and background were classified into three classes, and the input was the RGB component of the used images. In our work we used the Hopfield neural network HNN algorithm as our segmentation method. The HNN is very sensitive to intensity variation and it can detect the overlapping cytoplasm classes. HNN is considered as unsupervised learning therefore; the network classifies the feature space without teacher based on the compactness of each cluster calculated using the Euclidean distance measure between the kth pixel and the centroid of class l. The neural network structure consists of a grid of N x M neurons with each column representing a cluster and each row representing a pixel. The network is designed to classify the image of N pixels of P features among M classes, such that the assignment of the pixels minimizes the criterion function:
E=
1 N M 2 2 ∑∑ RklVkl 2 k =1 l =1
(1)
Artificial Neural Network and Fuzzy Clustering Methods
515
where Rkl is considered as the Euclidean distance measure between the kth pixel and the centroid of class l, Vkl is the output of the kth neurons. The minimization is achieved using HNN and by solving the motion equations satisfying:
∂E ∂ui = − μ (t ) ∂Vi ∂t
(2)
where μ (t ) is as defined in [6] a scalar positive function of time used to increase the convergence speed of the HNN. By applying the relation (2) to equation (1), we get a set of neural dynamics given by:
dU kl = − μ (t )[ Rkl2Vkl ] dt
(3)
Where Ukl and Vkl are the input and output of the kth neuron respectively. To assign a label m to the kth pixel we use the input-output function given by: v(t+1)
(4)
The HNN segmentation algorithm can be summarized in the following steps: 1. 2. 3. 4. 5.
Initialize the input of neurons to random values. Apply the input-output relation given in (4) to obtain the new output value for each neuron, establishing the assignment of pixel to classes. Compute the centroid for each class Solve the set of differential equation in (3) to update the input of each neuron. Repeat from step 2 until convergence then terminate.
We applied the HNN with the specification mentioned above to one thousand sputum color images and maintain the result for further processing in the following steps. Our algorithm could segment 97% of the images successfully in nuclei, cytoplasm regions and clear background. Furthermore, HNN takes short time to achieve the desired results. By experiment, HNN needs less than 120 iterations to reach the desired segmentation result in 36 seconds.
3 Fuzzy Clustering Clustering is the process of dividing the data into groups based on similarity of objects; information that is logically similar physically is stored together, in order to increase the efficiency in the database system and to minimize the number of disk access. Let {X
(q )
:q=1,……,Q} be a set of Q feature vectors. Each feature vector
(q ) (q) (q ) X =(X 1 ,….,X N )
has N components. The process of clustering is to assign the Q
feature vectors into K clusters center and its designated by C
{C ( k ) : k = 1,..., K } for each K
(k )
.
th
cluster C k
its
516
F. Taher and R. Sammouda
Fuzzy Clustering has been used in many fields like pattern recognition and Fuzzy identification. A variety of Fuzzy clustering methods has been proposed and most of them are based upon distance criteria. The most widely used algorithm is the Fuzzy C-Mean algorithm (FCM), it uses reciprocal distance to compute fuzzy weights. The FCM was introduced by J.C.Bezdek [8]. This algorithm has as input a predefined number of clusters, which is the k from its name. Means stands for an average location of all the members of particular cluster and the output is a partitioning of k cluster on a set of objects. The objective of the FCM cluster is to minimize the total weighted mean square error.
J=(W qk ,C
(k )
∑ ∑ (W
qk
) =
) p || x ( q ) − c ( k ) ||2
( q =1, Q ( k =1, K )
(5)
The FCM allows each feature vector to belong to multiple clusters with various fuzzy membership values [9]. Then the final classification will be according to the maximum weight of the feature vector over all clusters. The detail algorithm: Input: Vectors of objects, each object represent s dimensions, where:
v = {v1 , v2 ,......., vn } . In our case it will be an image pixel; each pixel has 3
dimensions, RGB K = number of cluster. Output: A set of k cluster which minimize sum of distance error. Algorithm steps: 1.
Initialize random weight for each pixel, it uses fuzzy weighting with positive weights {W qk } between [0, 1]. th
feature vector over all K clusters
2.
Standardize the initial weights for each q via
3.
Standardize the weights over k=1,…,K for each q to obtain W qk .
4.
Compute new centroids C
5. 6.
Update the weights {W qk } via Assign each pixel to a cluster based on the maximum weight.
(k )
, k=1,…,K
via
We applied the FCM clustering algorithm with the specification mentioned above to one thousand sputum color images and maintain the result for further processing in the following steps. Our algorithm segment the images in nuclei, cytoplasm regions and clear background, however, the FCM is not sensitive to intensity variation, therefore, the cytoplasm regions are detected as one cluster when we fixed the cluster number to three, four, five and six. Moreover, FCM failed in detecting the nuclei, it detected only part of it. By experiment, the FCM algorithm takes less than 50 iterations to reach the desired results in 10 seconds on average.
Artificial Neural Network and Fuzzy Clustering Methods
517
4 Segmentation Result and Discussion Here, we present the result obtained with two sample images; the first sample containing red cells surrounded by a lot of debris nuclei and a background reflecting a large number of intensity variation in its pixel values as shown in Figure1 (a), and the second sample is composed of blue stained cells shown in Figure 3 (a), Figure 1 (b) and (c) show the segmentation results using HNN and the FCM with the RGB components of the raw image Figure 1 (a), respectively. As is seen in the segmentation result of both algorithms (b) and (c), the nuclei of the cells were not detected, in the case of HNN, and were not accurately represented in (c). For this reason we developed a mask to extract our regions of interest, described in [6], and the result is shown in Figure 1 (d).
Fig. 1. (a) Sample of sputum color image stained with blue and red dyes. (b) and (c) The segmentation result using HNN and the FCM, respectively, with the RGB components of the raw image. (d) The result of pre-segmentation masking process. (e) and (f) show the segmentation result using HNN and FCM with the RGB components of (d)by fixing the clusters number to three, respectively, (g) and (h) are the results obtained by fixing the clusters number to four.
Figure 1 (e) and (f) show the segmentation result using HNN and FCM with the RGB components of (d). By fixing the cluster numbers to three, respectively, we realized that in the case of HNN, the nuclei were detected but not precisely. And in
518
F. Taher and R. Sammouda
the case of the FCM only part of the nuclei has been detected. We increased the cluster numbers to four as an attempt to solve the nuclei detection problem. The results are shown in Figure 1 (g) and (h) for both HNN and FCM, respectively. Comparing the HNN segmentation result in (g) to the raw image (d), we can say that the nuclei regions were detected perfectly, and also their corresponding cytoplasm regions. However, due to the problem of intensity variation in the raw image (d) and also due to the sensitivity of HNN, the cytoplasm regions were represented by two clusters. These cytoplasm clusters may be merged later if the difference in their mean values is not large. Comparing the FCM segmentation result in (h) to the raw image (d), the nuclei regions are detected but they present a little overlapping in the way that the two different nuclei may be seen or considered as one nucleus, and this can affect the diagnosis results. The cytoplasm regions are smoother than in the case of HNN, reflecting that the FCM is less sensitive to the intensity variation than HNN. A quantitative reflection of the above comparison and discussion is shown in Figure 2, where it can be seen that the segmentation error at convergence is smaller with HNN than with the FCM. However the FCM converges fifty iterations earlier than HNN. Figure 3 (a) shows a sample of sputum color image stained with blue dye, (b) and (c) show the segmentation results using HNN and the FCM with the RGB components of the raw image (a), respectively. As is seen in the segmentation results of both algorithms in (b) and (c) the nuclei have not been detected and the background presents a lot of intensity variation. A filter was needed to minimize the affect of the intensity variation in the raw image as described in [6]. The result of this filter is shown in (d); (e) and (f) are the segmentation results obtained using HNN and FCM with RGB components of (d) with four clusters. Here, the nuclei have been detected; however a color cluster is missing in the result of FCM (f). The same applies for the previous case of the red cells; HNN is more sensitive to intensity variation between nuclei-nuclei or nuclei-cytoplasm regions. This is clear in Figure 4 which shows quantitatively, the energy function of HNN and FCM during the segmentation process of the blue sample. 8.00E+20
Error Values
7.00E+20 6.00E+20 5.00E+20 Fuzzy_k=4
4.00E+20
NN_k=4
3.00E+20 2.00E+20 1.00E+20 0.00E+00 1
13
25
37
49
61
73
85
97
109
Ite ration No.
Fig. 2. Curves of the HNN and the FCM energy functions during the segmentation process
Artificial Neural Network and Fuzzy Clustering Methods
519
Fig. 3. (a) A sample of sputum color image stained with blue dyes, (b) and (c) the segmentation result using HNN and the FCM with the RGB components of the raw image(a), respectively. (d) The result of pre-segmentation masking process. (e) and (f) The segmentation result using HNN and FCM with the RGB components of (d) by fixing the clusters number to four, respectively.
Fig. 4. Curve of the HNN and the FCM energy functions during the segmentation process
5 Conclusion In this study, we present two methods for the segmentation of sputum cell color images, which will be used as a basis for a Computer Aided Diagnosis (CAD) system for early detection of lung cancer. The presented methods, use a pre-classification algorithm to extract the regions of interest with the raw color image component in the RGB color space, and remove the debris cells. As shown in the previous section, HNN has shown a better classification result than a FCM clustering algorithm, however the latter was faster in convergence. In the future, we plan to consider a Bayesian decision theory for the detection of the lung cancer cells as soon as a more extended dataset is available.
520
F. Taher and R. Sammouda
References 1. The National Women’s Health Information Center, U.S Department of Health and Human Services office on Women’s Health, Lung Cancer (2003), http://www.4woman.gov/faq/lung.htm 2. Herb, F.: HuilongGuan Culture, Lung cancer (2002), http://www.4uherb.com/cancer/lung/about.htm 3. Lin, J.S., Chenz, K.S., Mao, C.-W.: A Fuzzy Hopfield Neural for Medical Image Segmentation. IEEE Trans. Nuclear Science 43(4) (August 1996) 4. Forero-Vargas, M., Sierra-Ballen, E.L., Alvarez-Borrego, J., Pacheco, J.L.P., Cristobal, G., Alcala, L., Desco, M.: Automatic Sputum Color Image Segmentation for Tuberculosis Diagnosis. In: Proc. SPIE, pp. 251–261 (2001) 5. Sammouda, R., Niki, N., Nishitani, H., Nakamura, S., Mori, S.: Segmentation of Sputum Color Image for Lung Cancer Diagnosis. In: Proceeding of International Conference on image Analysis and Processing, Italy, September 1997, p. 243 (1997) 6. Sammouda, R., Niki, N., Nishitani, H., Kyokage, E.: Segmentation of Sputum Color Image for Lung Cancer Diagnosis based on Neural Network. Journal of IEICE Transaction on information and systems E81_D(8), 862–871 (1998) 7. Taher, F., Sammouda, R.: Identification of Lung Cancer based on shape and Color. In: Proceeding of the 4th International Conference on Innovation in Information Technology, Dubai, UAE, November 2007, pp. 481–485 (2007) 8. Bezdek, J.C.: Fuzzy Mathematics in Pattern Classification, Ph.D. thesis, Center for Applied Mathematics, Cornell University, Ithica, N.Y (1973) 9. Chen, Y.-S.: An Efficient Fuzzy c-means Clustering Algorithm for Image Data. In: Mathematical problems in Engineering and Aerospace Science (2004)
A Multi-stage Approach for 3D Teeth Segmentation from Dentition Surfaces Marcin Grzegorzek1, Marina Trierscheid1,2, , Dimitri Papoutsis2, and Dietrich Paulus1 1
Research Group for Active Vision, University of Koblenz-Landau Universit¨ atsstr. 1, D-56070 Koblenz http://www.uni-koblenz.de/agas 2 RV realtime visions GmbH, Technology Center Koblenz Universit¨ atsstr. 3, D-56070 Koblenz http://www.realtimevisions.com
Abstract In this paper, we present a multi-stage approach for teeth segmentation from 3D dentition surfaces based on a 2D model-based contour retrieval algorithm. First, a 3D dentition model is loaded to the system and a range image is created. Second, binarized 2D sectional images are generated and contours are extracted. During several processing steps a set of tooth contour candidates are produced and they are evaluated. The best-fitting contour for each tooth is refined using snakes. Finally, the 2D contours are integrated to full 3D segmentation results. Due to its excellent experimental results, our algorithm has been applied in the practical realization of a so-called virtual articulator currently being developed for dentistry. Today, only mechanical articulators are applied in the dental practice. They are used in the fabrication and testing of removable prosthodontic appliances (dentures), fixed prosthodontic restorations (crowns, bridges, inlays and onlays), and orthodontic appliances. Virtual articulators are supposed to simulate the same functionality, however, in a much more flexible and convenient way.
1
Introduction
Dental treatments often require a functional analysis of movements of the mandible and of the various tooth-to-tooth relationships that accompany those movements. Currently this is practically done by a device called mechanical articulator. Mechanical articulators simulate the temporomandibular joints and jaws to which maxillary and mandibular casts are attached (see Figure 1, left). They are used in the fabrication and testing of removable prosthodontic appliances (dentures), fixed prosthodontic restorations (crowns, bridges, inlays and onlays), and orthodontic appliances. Mechanical articulators, however, feature
Research activities leading to this work have been supported by the German Federal Ministry of Economics and Technology within the funding scheme PRO INNO II. Author’s new address: SOVAmed GmbH, Universit¨ atsstr. 3, D-56070 Koblenz, http://www.sovamed.com
A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 521–530, 2010. c Springer-Verlag Berlin Heidelberg 2010
522
M. Grzegorzek et al.
Fig. 1. Left: typical mechanical articulator. Right: 3D dentition model for the virtual articulator.
some limitations. They have a small number of degrees of freedom, and their spatial accuracy is rather moderate - at least in comparison to the precision of state-of-the-art 3D scanners nowadays available in the medical industry. For these reasons, a project funded by the German Federal Ministry of Economics and Technology aiming at conceptualizing and developing a virtual articulator has been initialized. The virtual articulator is a system with the same practical application as the mechanical one, however, it works in a much more precise and convenient way. The system consists of three parts: an impression tray, a device for the analysis of jaw motion using ultrasonic sound sensors, and a high-precision optical 3D scanner for the digital acquisition of the dental impression. Thus the anatomical information is combined with the actual motion of the maxillary and mandibular teeth in relation to each other. One goal of the virtual articulator is to determine the teeth antagonists and the occlusion surfaces between them. This analysis can be accomplished without any limitations of degrees of freedom. However, apart from the teeth areas, neighbouring surfaces such as gums and teethridge are also included in the dentition models (see Figure 1, right). Therefore, the first and unavoidable step in designing a virtual articulator is the segmentation of teeth from the 3D dentition models. The list of requirements for the teeth segmentation approach has been created by dental doctors and technicians involved in the project. The segmentation algorithm must: (i) work with a high geometrical accuracy, (ii) require little user interaction, and (iii) provide segmentation results for a jaw with 16 teeth within few seconds. These requirements have significantly influenced the selection of our segmentation strategy and serve as a basis for its evaluation. Our approach starts with generating 2D slices from the 3D dentition models given as range images. Thus, the problem is reduced to a 2D segmentation task. One can distinguish between four main categories of 2D image segmentation algorithms, namely the pixel-based, the region-based, the contour-based, and the model-based methods [1]. The pixel-based and the region-based approaches work very fast and do not require any user interaction, however, due to their simplicity they are unable to overcome the problem of artefacts in the data (e. g., wholes in the 3D dentition models). The contour-based strategy, still fast and fully automatic, is suitable for segmenting structures with clear boundaries. Unfortunately, neighbouring tissue surfaces visualized in the dentition models are very similar to
A Multi-stage Approach for 3D Teeth Segmentation from Dentition Surfaces
523
each other and there are no clear borders between them. Model-based methods are able to integrate a priori knowledge into the segmentation process which is their huge advantage, if the application domain is so clearly defined as in our case. Therefore, our own approach for the segmentation of teeth in 2D range images uses several model-based techniques, including active contours. Active contours adopt aspects of the contour-based and the region-based methods and integrate object-based model knowledge into the segmentation process. Finally, we summarize the 2D image segmentation results back to the 3D space which provides the 3D teeth models. The paper is structured as follows. Section 2 overviews most relevant approaches for 3D segmentation. In Section 3, our own segmentation method is described. Section 4 comprehensively evaluates the approach using a manually created ground truth. Section 5 concludes the paper and states shortly our future work in the context of the virtual articulator.
2
Related Work
Automatic teeth segmentation algorithms can be used for various purposes and there is some related work in this area. However, these approaches do not altogether fulfil the project’s requirements specified in section 1. Therefore, the segmentation algorithm used for the virtual articulator has been designed for that purpose in particular, which makes it original and novel. The most related work has been published by Kondo et al. in [2]. This paper presents an automated method for tooth segmentation from 3D digitized image captures by a laser scanner. Similar to our approach, the authors avoid the complexity of directly processing 3D mesh data and process range images. They use two range images. First, they obtain the dental arch from the plan-view range image and then compute a panoramic range image of the dental model based on the dental arch. The interstices between the teeth are detected separately in the two range images, and results from both views are combined for a determination of interstice locations and orientations. Finally, the teeth are separated from the gums by delineating the gum margin. The authors tested their algorithm on 34 dental models. No gold standard has been used for the evaluation. Moreover, their dental models seem to be very clean and free from the artifacts which our models show. There are further scientific contributions in this area. In [3] an automatic tooth segmentation method using active contours without edges is presented. Here, the authors concentrate on the identification of deceased individuals based on dental characteristics. However, in contrast to our approach, side views of the teeth are segmented in this work. Another related approach has been published in [4] where the authors present a multi-step approach based on geometric types of active contours for volumetric computed tomography (CT). Finally, Zhao et al. introduce an interactive method for teeth segmentation in [5]. Based on curvature values of the triangle mesh, feature points are connected to feature regions and a feature contour is obtained with the help of user supplied information.
(c)
(b)
(d)
(f)
(g)
(e)
(h)
2D Binary Images
2D Teeth Contours (candidates)
2D Teeth Contours (one per tooth)
Refined 2D Teeth Contours
3D Segmentation Results
3D Jaw Geometry
(a)
2D Sectional Images
M. Grzegorzek et al.
Reference Points
524
Fig. 2. Processing chain of our segmentation algorithm serving as a road map for Section 3
Fig. 3. Range image generated from a dentition model
3
Segmentation Approach
In this section, we describe our teeth segmentation algorithm. Its processing chain is depicted in Figure 2, where steps from (a) to (d) can be considered as preprocessing (Subsection 3.1) and steps from (e) to (h) contain the computation of teeth contours in 2D range images as well as their summary to 3D segmentation results (Subsection 3.2). 3.1
Preprocessing
First, a 3D dentition model is loaded to the system in form of a 3D jaw geometry (STL format) and a range image is created from it (Figure 3). Second, the user sets one reference point near the center of each tooth seen in the range image. Third, slices of the denture surface are cut on multiple planes and a range image is created from each slice, resulting in several 2D sectional images. Finally, the sectional images are binarized using a simple thresholding so that white pixels represent available geometry and black pixels depict the background.
A Multi-stage Approach for 3D Teeth Segmentation from Dentition Surfaces
525
Fig. 4. Left: split points of the main contour. Middle: contour segments after splitting. Right: reshaped contour. The circle in the middle of the images marks the reference point.
3.2
2D Teeth Contours and 3D Segmentation Results
For each reference point, and thus for each tooth, 2D contours are extracted from each binary images resulting from the preprocessing. Therefore in each binary image a square proximity region around the reference point with a fixed size is taken into consideration. The iterative contour extraction algorithm of Suzuki is applied for each region row by row [6]. As a result, all contours are stored in a tree data structure, the contour tree. Depending on the kind of each contour (white to black passage, or black to white passage), it is denoted either as internal or external. Obviously, internal contours represent artefacts (wholes) in the binary images (black pixels in a white region) and are disregarded from further investigations. The longest external contour of the contour tree which encloses the reference point is chosen to be the main contour for the following steps of the algorithm. The main contour resulting from a region around a molar looks as depicted in Figure 4. In order to reshape this contour so that it matches the tooth outlines, split points between the teeth (highlighted in Figure 4) must be determined. For this purpose, dominant points of the contour are detected first. Candidates p for dominant points are identified by inscribing triangles in the contour as in Figure 5. We used the algorithm of Chetverikov [7]. Here, a dominant point fulfills the following criteria dmin ≤ p − p+ ≤ dmax , dmin ≤ p − p− ≤ dmax (1) α ≤ αmax p − p+ = a and p − p− = b are Euclidean distances between the points. α ∈ [−π, π] is the opening angle at the point p. p In addition to the dominant points, points on b which the convexity property of the main contour α a − is not fulfilled are also calculated. They are quanp titatively expressed by a measure of nonconvexity. c Those with a high value of this measure (greater p+ than a particular threshold) are called convexity defects. The split points between the teeth are eventually identified by the intersection between Fig. 5. Finding candidates for the set of dominant points and the set of convex- dominant points
526
M. Grzegorzek et al.
ity defects. Splitting up the main contour using these points results in several contour segments. Segments not matching the outline of the tooth are discarded using a priori knowledge about the characteristics of these outlines. Finally, the remaining contour segments (Figure 4, right) are used to create a candidate tooth contour for this tooth. They are connected to each other based on the Bresenham line algorithm. Up to this point (step (e) in Figure 2), several candidates for the tooth contours have been retrieved. In the next step, for each tooth the best-fitting contour will be determined. Therefore we have defined certain criteria in order to evaluate the candidates. They are based on contour form features involving the following: area A number of pixels enclosed by the contour perimeter P length of the contour as the sum of distances of consecutive points compactness c measure for the efficiency of the contour to contain a given area A, defined as P2 . (2) c= 4πA The compactness of a circle equals one, less compact contours result in a value greater than one. Next, from all candidate contours three reference values are identified: the maximum area maxArea, the smallest measured value of compactness minCompactness and the smallest of all major convexity defects minConvexityDefect. The best-fitting contour for a certain tooth can now be retrieved by evaluating the following simple optimization vector v for each candidate i: contour[i].area ·α maxArea
(3)
minConvexityDef ect ·β contour[i].majorConvexityDef ect
(4)
minCompactness ·γ contour[i].compactness
(5)
v[1] = v[2] =
v[3] =
The factors α, β, γ are weights for the criteria. The vector having the greatest magnitude finally refers to the “optimal” tooth contour among the candidates. A contour found by the algorithm described above usually encloses most of the tooth area, but the boundaries may be imprecise. This is due to retrieving the contours from sectional images which only represent a part of the dentition model. Therefore, the tooth contour is refined using the snakes algorithm on a gradient image computed from a range image of the model. A snake, a special case of active contours, is an energy-minimizing contour guided by external constraint forces and influenced by image forces that pull it toward features such as lines and edges [8]. The application of snakes results in a correction of the tooth contours toward the outlines of each tooth, as long as this outline is discriminable in the gradient image to a certain extend.
A Multi-stage Approach for 3D Teeth Segmentation from Dentition Surfaces
527
Fig. 6. An example 3D segmentation result for a premolar from two points of view
In the final step of our method, the 2D refined contours computed for a particular tooth are integrated into a complete 3D tooth segmentation result. If a single tooth of the dentition model is needed, the particular contour is taken into account and the subjacent1 surface is extracted. The set of all teeth surfaces can be retrieved by extracting all surface regions which are enclosed by any tooth contour. An example segmentation result for a premolar visualized from two different points of view can be found in Figure 6.
4
Experiments and Results
For experiments, we used altogether 28 pairs of 3D dentition models given in the STL format. They have been scanned via an optical 3D dental scanner. We divided them into three categories. The first one (dataset I) consists of 8 pairs acquired from artificial sets of teeth created by an experienced dental technician. It represents ideal teeth in the sense of its anatomical features (e. g. the characteristics of fissures and cusps). The second one (dataset II) includes 17 pairs of dental models which have been acquired from European patients with healthy teeth. Although the surfaces of these models are very smooth, some teeth feature malpositions, which makes the segmentation task more difficult. The third category (dataset III) consists of 4 pairs of dentition models. Here, the malposition of teeth is more significant. All teeth have been manually segmented by an experienced dental technician, which provided a gold standard for the evaluation. In our experiments, we have distinguished between the four different kinds of human teeth, namely incisors, canines, premolars and molars.2 Apparently the difficulty of the segmentation task differs significantly from one teeth category to another, which is why we have evaluated them separately. The accuracy in the determination of the 2D tooth contour has been evaluated using well-known discrepancy measures: the sensitivity and specificity as well as the Tanimoto and Dice measures. 1 2
considering the point of view from which the range image has been created before Incisors – 1st and 2nd in the quadrant, canines – 3rd, premolars - 4th and 5th, molars – 6th, 7th and 8th.
528
M. Grzegorzek et al.
Table 1. Mean values of the sensitivity (Se), specificity (Sp), Tanimoto measure (Ta), and Dice measure (Di) achieved by our algorithm for teeth contour estimation calculated for four different types of teeth and three different datasets Data Incisors Set Se Sp Ta Di I II III
Caninus Se Sp Ta Di
Premolars Se Sp Ta Di
Molars Se Sp Ta Di
0.82 1.00 0.76 0.86 0.83 1.00 0.77 0.87 0.99 0.99 0.92 0.96 0.97 0.98 0.92 0.96 0.86 0.99 0.79 0.87 0.90 0.99 0.84 0.91 0.99 0.98 0.91 0.95 0.95 0.96 0.85 0.92 0.54 0.80 0.45 0.56 0.76 0.99 0.71 0.82 0.96 0.98 0.88 0.94 0.94 0.95 0.84 0.90
Table 2. Standard deviations of the sensitivity (Se), specificity (Sp), Tanimoto measure (Ta), and Dice measure (Di) achieved by our algorithm for teeth contour estimation calculated for four different types of teeth and three different datasets Data Incisors Set Se Sp Ta Di I II III
Caninus Se Sp Ta Di
Premolars Se Sp Ta Di
Molars Se Sp Ta Di
0.17 0.01 0.14 0.12 0.12 0.01 0.10 0.06 0.02 0.01 0.02 0.01 0.05 0.01 0.05 0.02 0.13 0.02 0.11 0.07 0.11 0.01 0.09 0.06 0.02 0.01 0.04 0.02 0.14 0.04 0.13 0.10 0.33 0.38 0.26 0.31 0.19 0.01 0.16 0.13 0.10 0.02 0.10 0.06 0.12 0.06 0.14 0.10
Let A and B be two compact sets of image elements of the sizes |A| and |B|. Dice coefficient CD = Tanimoto coefficient CT =
2|A ∩ B| |A| + |B|
(6)
|A ∩ B| |A ∪ B|
(7)
Let N be the set of all image elements of the size |N | and A, B two compact sets of image elements of the sizes |A|, |B|, where B is the gold standard. Sensitivity of A
|A ∩ B| |B|
(8)
|N | − |A ∪ B| |N | − |B|
(9)
Se = Specificity of A Sp =
The mean values of the achieved results can be found in Table 1, the standard deviations in Table 2. As one can see, remarkable high performance has been achieved in all three datasets for premolars and molars (see example results in Figure 7). Incisors cause our algorithm difficulties, especially those of dataset III.
A Multi-stage Approach for 3D Teeth Segmentation from Dentition Surfaces
529
Fig. 7. Above: gold standard of an upper jaw of dataset I in the form of binary masks. Below: corresponding segmentation result.
5
Conclusions
In this work, we presented a new algorithm for the segmentation of teeth in 3D dentition models. The approach will be used in the practical realization of the so-called virtual articulator. The requirements for our method defined by a dental technician include a high geometrical accuracy, limited user interaction, and execution time efficiency. In the development and evaluation phase of our algorithm we focused on the 2D tooth contour determination which finally served as a basis for the full 3D tooth segmentation. In the only interactive step of our approach, the user is required to set one reference point per teeth manually in a range image. This is necessary, since our initial experiments with fully automatic methods have not satisfied the industrial requirements. Usually, there are 14 to 16 teeth in a dentition model. Thus, the manual effort has been significantly reduced in comparison to a previous method, where the teeth contours had been segmented completely manually. The execution time of few seconds from setting the reference points until achieving the segmentation results for one dentition model also fulfills the requirements of the virtual articulator. The geometrical accuracy of our segmentation approach has been evaluated for more than 800 teeth against a gold standard created by an experienced dental technician. The high sensitivity and specificity achieved in our experiments, especially for premolars and molars, prove the applicability of our method in the practical realization of the virtual articulator. In the future, we will analyse range images taken from frontal views of the teeth in order to increase the segmentation accuracy for incisors.
530
M. Grzegorzek et al.
References 1. Lehmann, T., Oberschelp, W., Pelikan, E., Repges, R.: Bildverarbeitung f¨ ur die Medizin: Grundlagen, Modelle, Methoden, Anwendungen. Springer, Heidelberg (1997) 2. Kondo, T., Ong, S.H., Foong, K.W.C.: Tooth segmentation of dental study models using range images. IEEE Transactions on Medical Imaging 23(3), 350–362 (2004) 3. Shah, S., Abaza, A., Ross, A., Ammar, H.: Automatic tooth segmentation using active contour without edges. In: 2006 Biometrics Symposium, Baltimore, USA, August 2006, pp. 1–6. IEEE Computer Society, Los Alamitos (2006) 4. Keyhaninejad, S., Zoroofi, R.A., Setarehdan, S.K., Shirani, G.: Automated segmentation of teeth in multi-slice ct images. In: International Conference on Visual Information Engineering, Bangalore, India, September 2006, pp. 339–344. IEEE Computer Society, Los Alamitos (2006) 5. Zhao, M., Ma, L., Tan, W., Nie, D.: Interactive tooth segmentation of dental models. In: IEEE Conference on Engineering in Medicine and Biology, Shanghai, China, September 2005, pp. 654–657. IEEE Computer Society, Los Alamitos (2005) 6. Suzuki, S., Abe, K.: Topological structural analysis of digitized binary images by border following. Computer Vision, Graphics, and Image Processing 30(1), 32–46 (1985) 7. Chetverikov, D.: A simple and efficient algorithm for detection of high curvature points in planar curves. In: Petkov, N., Westenberg, M.A. (eds.) CAIP 2003. LNCS, vol. 2756, pp. 746–753. Springer, Heidelberg (2003) 8. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. International Journal on Computer Vision 1(4), 321–331 (2004)
Global versus Hybrid Thresholding for Border Detection in Dermoscopy Images Rahil Garnavi1 , Mohammad Aldeen1 , Sue Finch2 , and George Varigos3 1
3
NICTA Victoria Research Laboratory, Department of Electrical and Electronic Engineering, The University of Melbourne, Melbourne, Australia 2 Department of Mathematics and Statistics, The University of Melbourne, Melbourne, Australia Department of Dermatology, Royal Melbourne Hospital, Melbourne, Australia
[email protected], {aldeen,sfinch}@unimelb.edu.au,
[email protected] Abstract. In this paper we demonstrate the superiority of the automated hybrid thresholding approach to border detection in dermoscopy images over the global thresholding method through a newly introduced evaluation metric: Performance Index. The approach incorporates optimal color channels into the hybrid thresholding method, which is a combination of global and adaptive local thresholding, to determine the closest border to that drawn by dermatologists. Statistical analysis and optimization procedure are used and shown to be convergent in determining the optimal parameters for the local thresholding procedure in order to obtain the most accurate borders. The effectiveness of the approach is tested on 55 high resolution dermoscopy images of patients, with manual borders drawn by three expert dermatologists, and the union is used as the ground truth. The results demonstrate the significant advantages of the automated hybrid approach over the global thresholding method. Keywords: Border detection, Histogram thresholding, Dermoscopy, Melanoma, Computer-aided diagnosis.
1
Introduction
Malignant melanoma is one of the most lethal and rapidly increasing cancers. It represents 10% of all cancers in Australia, and its incidence is four times higher than in Canada, the UK and the US, with more than 10,000 cases diagnosed and around 1700 deaths annually [1]. It is well-known that the clinical diagnosis of melanoma is inherently subjective and its accuracy is considered to be limited, especially for equivocal lesions [2]. This has lead to the emergence of the computer-based analysis of dermoscopy images. Automatic border detection is an important initial step towards development of a computer aided diagnosis of melanoma. The accuracy of the detected border is crucial as it contains important diagnostic features such as asymmetry and border irregularity and also reveals information about homogeneity and dermoscopic patterns. Therefore, failure to include the whole lesion when segmentation A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 531–540, 2010. c Springer-Verlag Berlin Heidelberg 2010
532
R. Garnavi et al.
is performed can drastically bias the results of successive steps of the diagnostic system; feature extraction and classification. Numerous border detection methods have been proposed in the literature [3], including histogram thresholding followed by region growing [4], color clustering [5], statistical region merging [6], k-means++ clustering followed by region merging [7], etc. Melli et al. [5] criticized the adaptive thresholding methods for not providing accurate segmentation result due to the lack of sharp bimodal luminance distribution between the surrounding skin and the lesion. However, in this study we demonstrate that with a proper choice of color channel, adaptive histogram thresholding can produce borders which are highly accordant with those manually drawn by expert dermatologists. In our preliminary study [8] we proposed a hybrid thresholding method, which combines global and adaptive local thresholding and uses color optimization process to detect the lesion borders, and demonstrated its superiority over other state-of-the-art skin lesion segmentation algorithms. Here, we provide a new analytical and experimental framework to justify the necessity of the second stage of the algorithm, the adaptive thresholding, in the border detection process. The rest of the paper is organized as follows. The border detection method is summarized in Section 2. Experimental results and discussions are presented in Section 3. Section 4 provides the conclusion.
Fig. 1. Three areas generally identified in dermoscopy images: core-lesion, edge-lesion and background skin
2
Hybrid Thresholding Method
Manual borders drawn by dermatologists tend to surround the borders produced by automated method [3,8]. As shown in Figure 1, three areas are generally identified in dermoscopy images; core-lesion, edge-lesion and background skin. The width of the edge-lesion area is variable depending on the skin color, lesion color and the border fuzziness. Existing border detection methods can easily identify the core-lesion area by finding the sharpest pigment change. However, they often fail to precisely detect the edge-lesion area. In contrast, dermatologists prefer to choose the outmost detectable pigment to minimize the risk of incorrect
Global versus Hybrid Thresholding for Border Detection
533
diagnosis. We tackle the problem by proposing a two-stage histogram thresholding method [8] which successively identifies the core-lesion and edge-lesion areas and here we highlight the success of the method in detecting the edge-lesion area. In the ensuing a summary of the method is provided. 2.1
Forming the Core-Lesion
The core-lesion area is detected by applying global thresholding to the optimal color channel of XoYoR obtained in the color optimization procedure [9]. This step includes the pre-processing operations of hair removal [10], noise filtering using Gaussian low-pass filter (with a 10 × 10 kernel) and intensity adjustment. This is followed by application of Otsu thresholding method [11] to XoYoR color channel, and performing connected component analysis and morphological operations to obtain the initial border and form the core-lesion area. 2.2
Forming the Edge-Lesion
To expand or shrink the core-lesion boundary to the edge-lesion boundary, an adaptive local thresholding technique based on Otsu method [11] is applied to the X color channel determined as optimal in the color optimization procedure [9]. Starting from an arbitrary point on the core-lesion boundary, the local threshold is calculated over a window of size W . If the local threshold value is less than a defined threshold called Texpand the boundary is expanded by one pixel. If it is larger than a defined threshold called Tshrink the boundary is shrunk. Otherwise, it is the No Change state, in which based on the previous moves, we may make a decision to either laterally move to the adjacent pixel, or expand or shrink the boundary in a radial manner. To define the threshold values for shrinkage and expansion (Equation 2), a bandwidth (Equation 1) is calculated based on background skin and core-lesion pixels values. The Otsu method is applied to the background skin area and core-lesion area to obtain estimates of these values. Bandwidth = %B × (Tskin − TcoreLesion )
(1)
where B, the bandwidth factor, is the extent to which the core-lesion area is expanded towards the background skin. The local threshold is calculated for every pixel over its surrounding window and the process is stopped when the initial pixel is revisited. Further details may be found in [8]. Texpand = Tskin − Bandwidth , Tshrink = Tskin + Bandwidth
3
(2)
Experiments
The proposed method is tested on a set of 55 high resolution dermoscopy images taken by a Canon EOS 450D camera in Royal Melbourne Hospital (24-bit RGB color images, 2000 × 1334, TIFF). To validate the automatic borders produced
534
R. Garnavi et al.
by the method, manual borders were independently drawn by three dermatologists using Wacom Intuos A4 size Tablet PC. Considering the practical nature of the melanoma diagnosis, which calls for extreme caution when excluding portions of the image, and taking into account the inter-observer and intra-observer variability in borders drawn by dermatologists we calculate the union of the three manually drawn borders for each image and consider that as the ground truth. To quantitatively compare the automatic borders to the manual borders, different metrics have been used [3]. Here, we apply five evaluation metrics of sensitivity specificity, accuracy, similarity and border error given by: TP × 100% TP + FN TN Specif icity = × 100% TN + FP TP + TN × 100% Accuracy = TP + FP + FN + TN 2 × TP × 100% Similarity = 2 × TP + FN + FP FP + FN BorderError = × 100% TP + FN Sensitivity =
(3) (4) (5) (6) (7)
where TP, TN, FP, and FN refer to true positive, true negative, false positive and false negative, respectively. 3.1
Standardisation of the Images
The lesion inside a dermoscopy image generally appears in different sizes and locations. Two metrics of accuracy and specificity includes the TN factor which refers to the number of pixels of the surrounding background skin which are truly detected by the automated method. However, the TN factor highly depends on the size of the lesion and its ratio to the whole image, thus the value of accuracy and specificity metrics can be biased in images with small lesions. To the best of our knowledge, this issue has not been addressed in previously published studies. To balance the effect of large TN and normalize the accuracy and specificity metrics, we set a frame of background skin around the lesion, such that the area of the rectangular image frame is twice as big in area as that of an imaginary rectangular encompassing the lesion. This has the effect that the number of background pixel and lesion pixels is roughly the same. Figure 2 shows the segmentation result of a dermoscopy image before and after standardisation. 3.2
Optimal Choice of Window and Bandwidth
Two main parameters are involved in the proposed method: window size (W) and the bandwidth factor (B). In order to determine the optimal value for W and B, a comprehensive set of experiments is performed on the set of 55 dermoscopy images, with W ranging from 30 to 70 and B ranging from 10% to 90% (steps of 10). Different borders can be obtained by using different values of W and B. Consequently, 45 borders are obtained for each dermoscopy image. To evaluate the results, each border is compared with the ground truth in terms of the evaluation metrics expressed in Equations (3–7).
Global versus Hybrid Thresholding for Border Detection
535
Fig. 2. Segmentation result (a) before and (b) after standardisation
Statistical Analysis. Figures 3 and 4 show the mean value and 95% confidence interval (CI) for metrics of sensitivity, specificity, accuracy, similarity and border error for different values of W and B parameters, respectively. We set levels of acceptability to the lower bound of the confidence interval. Due to the importance of sensitivity, we start the analysis from this metric. As shown in Figure 3(a), given the level of acceptability of 87% for sensitivity, 22 sets of W and B, out of 45, are selected which are marked by filled circles in the graph. With the level of acceptability of 93% for specificity, nine sets from the previous 22 sets are nominated as illustrated in Figure 3(b). Having 93.5% level of acceptability for accuracy, as shown in Figure 3(c), the set is narrowed down to six sets of W and B, i.e. (30,10), (30,20), (40,20), (30,30), (40,30) and (40,40). Having these six sets, two metrics of similarity and border error are studied to make the final decision about the optimal value for B and W. As shown in Figure 4(a and b) according to both of these metrics the pair of (30,30) gains the best results. Performance Index. It is often the case that different methods provide different results when calculating the five evaluation metrics. The problem that has so far existed is which method provides the best possible result when all of the five metrics are considered simultaneously. To answer this, we introduce the following procedure. First, we define a measure called Performance Index (PI): PI =
SN S + SP C + ACC + SM L + (BErr) 5
(8)
where SNS, SPC, ACC, SML and BErr refers to sensitivity, specificity, accuracy, similarity and 100%-borderError, respectively. Second, the index is calculated
Fig. 3. Mean and 95% confidence interval of (a) sensitivity, (b) specificity and (c) accuracy metrics for different values of B and W over the image set
536
R. Garnavi et al.
Fig. 4. Mean and 95% confidence interval of (a) similarity and (b) border error metrics for different values of B and W over the image set
Fig. 5. Performance Index for different values of B and W over the image set
for different sizes of W and B over the image set of 55 dermoscopy images, using the mean value of the metrics. Third, for each W, a family of PIs versus B is plotted, as shown in Figure 5. Fourth, we choose the maximum PI point which reveals the optimal combination of W and B. As shown in Figure 5, the maximum value of PI is obtained for (W=30, B=30). This result is in accordance with the results we obtained from the statistical analysis above. We have chosen (W30, B30) to proceed with. However, there are other combinations for which PI is very similar, e.g. (W30, B20) or (W30, B10). As these combinations provide better sensitivity, they would also be reasonable options to be considered. Cross Validation. In order to provide a stronger proof for the optimal B and W identified through the statistical analysis and the proposed performance index, we also perform a 11-fold cross validation process, where the data set of 55 images is iteratively partitioned into a 50-image subset for train set and a 5-image subset for test set, resulting in 11 sets with unique combinations of test and train data. For each of the test sets, a family of PI curves for different W and B is plotted, as illustrated in Figure 6, which shows that all train sets converge to
Global versus Hybrid Thresholding for Border Detection
537
Fig. 6. 11-fold cross validation using performance index evaluation metric Table 1. Performance Index for different images in each test set TEST Image 1 Image 2 Image 3 Image 4 Image 5 Mean Std.
Set 1 90.59 89.62 92.01 91.41 91.63 91.05 0.95
Set 2 91.47 93.80 91.43 92.58 91.40 92.14 1.05
Set 3 95.27 94.41 92.26 87.99 91.17 92.22 2.87
Set 4 94.90 92.58 91.36 95.28 91.12 93.05 1.94
Set 5 89.53 91.47 91.81 89.37 96.28 91.69 2.79
Set 6 92.81 90.90 80.83 85.82 92.22 88.51 5.10
Set 7 92.43 90.53 94.14 85.06 96.67 91.77 4.37
Set 8 92.96 95.35 92.59 94.19 89.28 92.87 2.28
Set 9 90.08 96.48 83.89 94.00 83.24 89.54 5.91
Set 10 93.50 89.52 92.57 90.91 94.95 92.29 2.13
Set 11 94.99 94.44 86.89 94.35 95.87 93.31 3.63
the value of (30,30) for B and W, except for set 6, where the optimal PI occurs when W=30 and B=20, which is almost similar to the PI value for (W=30, B=30). Table 1 shows the resultant PI for images and the corresponding mean and standard deviation for each test set using the optimal setting of (W=30, B=30), which demonstrates the acceptability of the identified setting. 3.3
Global versus Hybrid Thresholding
Figure 7 shows the automatic borders obtained by global and the hybrid approach and also the corresponding ground truth border drawn by dermatologists (the union of the three manual borders) for a sample lesion. It shows that the hybrid approach produces a border which is much closer to the ground truth,
538
R. Garnavi et al.
Fig. 7. The global and hybrid borders produced by the automated methods and the manual border drawn by dermatologists Table 2. Evaluation metrics (mean ±margin of error) for global and hybrid thresholding methods and the difference between them ACC SNS SPC SML BErr PI Global 90.90 ±0.7 73.35 ±2.6 99.68 ±0.1 83.94 ±1.8 27.24 ± 2.4 84.12 ±1.5 Hybrid 94.85 ±0.5 89.52 ±1.9 97.44 ±0.7 91.97 ±0.9 15.4 ±1.5 91.67 ±0.9 Hybrid–Global 3.95 ±0.7 16.17 ±1.5 -2.23 ±0.6 8.02 ±1.4 11.84 ±2.0 5.6 ±1.0
compared to the border produced by global thresholding method. In order to objectively show the extent of improvement by evaluation metrics, when the lesion border is expanded from the core-lesion area (the result of global thresholding method) to the edge-lesion area (the result of hybrid thresholding method with the optimal values of B=30 and W=30), the mean difference between these two methods is calculated. Table 2 gives the 95% confidence intervals for the mean for each metric, for global and hybrid (W30, B30) methods, and also the results of mean difference between these two approaches. In the Hybrid column, it is shown that the average values for each metric are clearly in an acceptable range. The mean specificity is 97.4; the 95% confidence interval is 96.9 to 98.1, and we can be confident that sensitivity will be high when the parameters (W=30, B=30) are used. Table 2 also demonstrates that with respect to all metrics there is a significant improvement when the hybrid thresholding method is applied, with the exception of specificity. However, in the application of border detection in dermoscopy images, dermatologists are more concerned about not to leave out any part of the lesion, than erroneously including some part of the surrounding normal skin. Thus, in this particular application specificity is less important. Moreover, the mean level of specificity is very high in both approaches. As shown in Table 2 sensitivity, similarity and accuracy are increased by 16.17%, 8% and 3.95%,
Global versus Hybrid Thresholding for Border Detection
539
respectively and border error is reduced by 11.84% which demonstrate the significant superiority of the hybrid approach. Performance Index is also elevated by 5.6% which proves the better performance of the hybrid thresholding method.
4
Conclusion
In this paper we demonstrate the superiority of the automated hybrid thresholding approach to border detection in dermoscopy images over the global thresholding method, through a newly introduced evaluation metric: Performance Index, which is composed of standard metrics of sensitivity, specificity, accuracy, similarity, and border error. Statistical analysis and optimization procedure are used and shown to be convergent. The effectiveness and validity of the approach is demonstrated by running an experiment on 55 high resolution dermoscopy images. The experimental results show clearly the significant advantages of the proposed hybrid method over the global thresholding approach. In addition to higher performance, other advantageous of the method include simplicity and low computational cost.
Acknowledgments The assistance of Dr. Gayle Ross and Dr. Alana Tuxen in obtaining the manual borders and Mr. Michael Purves and Ms. Amanda Rebbechi in taking the dermoscopy images, all from Royal Melbourne Hospital, is gratefully acknowledged. This research is supported by NICTA Victoria Research Laboratory, Australia.
References 1. Australia skin cancer facts and figures, http://www.cancer.org.au/ (accessed September 2009) 2. Perrinaud, A., Gaide, O., French, L.E., Saurat, J.H., Marghoob, A.A., Braun, R.P.: Can automated dermoscopy image analysis instruments provide added benefit for the dermatologist? British Journal of Dermatology 157, 926–933 (2007) 3. Celebi, M.E., Iyatomi, H., Schaefer, G., Stoecker, W.: Lesion border detection in dermoscopy images. Computerized Medical Imaging and Graphics 33, 148–153 (2009) 4. Iyatomi, H., Oka, H., Celebi, M.E., Hashimoto, M., Hagiwara, M., Tanaka, M., Ogawa, K.: An improved internet-based melanoma screening system with dermatologist-like tumor area extraction algorithm. Computerized Medical Imaging and Graphics 32, 566–579 (2008) 5. Melli, R., Grana, C., Cucchiara, R.: Comparison of color clustering algorithms for segmentation of dermatological images. In: SPIE Medical Imaging, vol. 6144, pp. 3S1–3S9 (2006) 6. Celebi, M.E., Kingravi, H., Iyatomi, H., Aslandogan, Y.A., Stoecker, W.V., Moss, R.H., et al.: Border detection in dermoscopy images using statistical region merging. Skin Research and Technology 14, 347–353 (2008)
540
R. Garnavi et al.
7. Zhou, H., Chen, M., Zou, L., Gass, R., Ferris, L., Drogowski, L., Rehg, J.M.: Spatially constrained segmentation of dermoscopy images. In: 5th IEEE International Symposium on Biomedical Imaging, pp. 800–803 (2008) 8. Garnavi, R., Aldeen, M., Celebi, M.E., Finch, S., Varigos, G.: Border detection in dermoscopy images using hybrid thresholding on optimized color channels. To appear in Computerized Medical Imaging and Graphics, Special Issue: Skin Cancer Imaging 9. Garnavi, R., Aldeen, M., Celebi, M.E., Bhuiyan, A., Dolianitis, C., Varigos, G.: Automatic segmentation of dermoscopy images using histogram thresholding on optimal color channels. International Journal of Medicine and Medical Sciences 1, 126–134 (2010) 10. Lee, T., Ng, V., Gallagher, R., et al.: Dullrazor: A software approach to hair removal from images. Computers in Biology and Medicine 27, 533–543 (1997) 11. Otsu, N.: A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics 9, 62–66 (1979)
Unsupervised Segmentation for Inflammation Detection in Histopathology Images Kristine A. Thomas1 , Matthew J. Sottile1 , and Carolyn M. Salafia2 1
Department of Computer and Information Science, University of Oregon 1202 University of Oregon, Eugene, OR 97403-1202, USA 2 Placental Analytics, LLC 93 Colonial Avenue, Larchmont, New York 10538
Abstract. Acute inflammation of the placenta is associated with an increased rate of perinatal morbidity and mortality. This paper presents a novel method for analyzing digitized images of hematoxylin and eosin (H&E) stained histology slides to detect and quantify inflammatory polymorphonuclear leukocytes (PMLs). The first step is isolating the tissue by removing the background and red blood cells via color thresholding. Next, an iterative threshold technique is used to differentiate tissue from nuclei. Nuclei are further segmented to distinguish between nuclei which match morphological characteristics of PMLs and those which do not, such as connective tissue fibroblasts as well as chorion and amnion epithelium. Quantification of the remaining nuclei, many of which are likely neutrophils, are associated with amniotic fluid proteomic markers of infection and inflammation, as well as histological grading of neutrophils in amnion, chorion and decidua, and in the umbilical cord. Keywords: Medical image processing, computer aided diagnosis, quantitative image analysis, histopathology image analysis, microscopy images, pathological image analysis, image processing, nuclei morphology.
1
Introduction
The diagnosis of intra-amniotic infection is clinically relevant to perinatal morbidity and mortality. Even among expert pathologists variability in the degree of infection diagnosed occurs with the standard use of qualitative visual examination of hematoxylin and eosin (H&E) stained placenta histology slides [1]. This leads to a less than optimal scoring system to determine the presence of infection, with a significant possibility of contradictory diagnoses within a set of experts examining the same slide. The goal of this paper is to explore methods to improve diagnostic reliability through automated image analysis. This analysis takes into account known color and shape properties of nuclei related to infection in stained slides. The construction of an automated method for quantification of inflammatory polymorphonuclear leukocytes (PMLs), within the H&E slides would be a novel and useful tool. To address this we are developing a computer aided diagnosis tool A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 541–549, 2010. c Springer-Verlag Berlin Heidelberg 2010
542
K.A. Thomas, M.J. Sottile, and C.M. Salafia
which relies heavily on algorithms that segment and filter out image features that do not fit the criteria of the inflammatory PML nuclei. Inflammatory PMLs are indicative of an intra-amniotic infection. By isolating inflammatory PML cells, quantitative data can be gathered about the degree of inflammation within the tissue sample. This paper discusses threshold-based segmentation and shape analysis techniques for the purposes of quantification of inflammatory PMLs. Our technique was developed without the assumption that staining techniques can be exactly controlled and replicated consistently across lab environments. We will introduce a technique for removing background information which consists of white backgrounds as well as red blood cells. Our color analysis makes no assumptions about the color channel bit resolution. We will also introduce a novel thresholding technique which is iterative in nature and automates the thresholding process to separate stained cell nuclei from the tissue. Finally the cell nuclei are categorized based on size and shape to determine if they are likely inflammatory PMLs. We believe that providing automated analysis tools based on known physical properties of inflammation indicators in stained slide images is an important step toward increasing the reliability and agreement of expert pathologist diagnoses.
2
Related Work
Much of the work in the literature related to the automated analysis of pathology slides are based on supervised techniques in which manual specification of parameters (such as thresholds and iteration counts) is required by the user. We believe this is detrimental to acceptance of such tools by pathologists with limited (if any) familiarity with the meaning of these parameters and the underlying algorithms. A rich body of literature does exist though related to stained tissue image analysis for diagnostic purposes. Researchers have proposed analysis methods that increase automation and reduce the burden on the analyst when applying the computational techniques. Wang et al. presented an automated approach to segmentation of cervical tissue but had issues with false positives when their algorithm misclassified red blood cells as their intended target of squamous epithelium [2]. Our proposed method for removing red blood cells prior to applying segmentation techniques solves this problem. Sertel et al. attempted to remove white background pixels to alleviate computational intensity by not analyzing portions of the image with intensity level greater than 200 in all three 8-bit RGB color channels [3]. Upon implementation of this technique we found the fixed value threshold of 200 too low for H&E stained tissue. We present an alternate method which is responsive to image conditions rather than a hard coded value and retains the advantage of reducing the data to pixels which contain tissue rather than background. Classification using shape and color was used by Kong et al. to differentiate between pathological components of H&E stained tissue [4]. Tosun et al. proposed a shape analysis based on object size to group various tissue components [5]. In section 4.3 we implement a similar approach for differentiating between nuclei types within placental tissue.
Unsupervised Segmentation for Inflammation Detection
543
At the present, image processing techniques for automated (or even semiautomated) medical image processing is an open and active area of research. There are no publications know to the authors related to image processing for the purposes of identifying inflammatory PML nuclei. The medical methods of diagnosis for placental inflammation are also an open area of research. Attempts at standardized scoring of intra-amniotic infection is still a source of controversy [6]. Salafia et al. argue that the current qualitative approach needs to be improved upon via a reliable quantitative automated tool. The technique presented in this paper is an answer to that call and would allow a more universal and uniformed system of diagnosis for the presence of inflammation through the identification of inflammatory PML nuclei [7].
3
Image Dataset
Samples from placental tissue soaked in paraffin and cut at a thickness of 5µm are placed on a glass slide. Each slide is prepared using a dual staining process of hematoxylin and eosin (H&E) to create visual contrast between tissue structures. The tissue slide is loaded into an Aperio Scancope XT and the entire slide is scanned and digitized at 20× magnification with a resolution of 0.5 µm per pixel. The image set for developing and verifying this algorithm contained 450 tissue slides with 45 covering variations in H&E stain age and 405 covering variations in H&E stain brand. Stain age was varied on 3 sequential tissue slices from 15 cases and stained at 0, 2, and 4 days old. The brand variation was 8 unique combinations of 4 brands of Haematoxylin and 2 types of Eosin, plus one slice stained with Myelo IHC. This rotation of 9 stain combinations was performed 3 times over 27 sequential tissue slices from 15 tissue samples for a total of 405 slides. The goal of this variation in brand and stain age was to determine whether or not variations that will be expected in practical laboratory and hospital environments will have a negative impact on the performance and reliability of the algorithm. It is not practical to assume that staining conditions can be exactly replicated across a diverse set of lab environments, therefore a robust diagnostic algorithm must take this into account when evaluating its viability.
4
Methods
The following section will introduce image segmentation for background detection, a novel iterative threshold selection algorithm for cell nuclei detections and shape analysis for isolation of particular nuclei of interest. 4.1
Background Detection (Image Segmentation)
To set the stage for the iterative thresholding technique, the image must first consist of only H&E stained tissue. Removing extraneous features, such as the
544
K.A. Thomas, M.J. Sottile, and C.M. Salafia
white background of the slide and red blood cells, normalizes the image for input into the iterative thresholding technique. The method for removing background and red blood cells makes use of color ratios and is therefore neutral to the number of bits for color representation. Consider the raw image in Figure 1(a) in the following discussion. Removal of red blood cells is accomplished using a simple threshold. Within an image of H&E stained tissue, any pixel with a red to blue ratio greater than 1.25 is likely too red to be considered tissue or nuclei and can be masked out and ignored. The mask showing the pixels identified as meeting this criteria to be filtered out from Figure 1(a) are shown in Figure 1(b).
(a)
(b)
(c)
(d)
Fig. 1. The process of removing red blood cells and white background from an image. The red blood cell mask removes pixels with a red to blue ratio. The background mask removes pixels with a high values in each color channel. Pixels meeting these criteria are represented as black in figure 1(b) and figure 1(c) respectively. Figure 1(d) is the resulting image after masking.
The next step is to remove all background from the image. In the case of slide images, the background is a relative “white” but not necessarily the maximum intensity RGB color value for white. Basing the threshold on a percentage of the maximum value of each channel helps eliminate manual adjustments when a slide is very dark, very light, overcast with excess dye, or the scanner was not calibrated correctly to the background. We can compute the color channel thresholds for determining what we will consider “white” by first identifying the maximum intensity in each color channel: rmax , gmax , and bmax . Given these values, we can select individual thresholds for each channel as: tred = 0.95 · rmax ;
tgreen = 0.92 · gmax ;
tblue = 0.92 · bmax
The multipliers are intended to select very bright pixels, but not strictly those at the maximum value. This is necessary to take into account variations in the lighting and background due to the imaging process. We choose a slightly stricter threshold for the red channel because we must avoid accidentally confusing sparse tissue (with a pink appearance) with background. Given these thresholds, we are then able to determine if a given pixel should be considered “white.” isWhite(x) = (xred > tred ) ∧ (xgreen > tgreen ) ∧ (xblue > tblue ) The mask for background pixels to be filtered out of Figure 1(a) are shown in Figure 1(c). Even when there are no “white” pixels within the image, there is
Unsupervised Segmentation for Inflammation Detection
545
enough variability among the color channels to push the threshold high enough such that no pixels meet the criteria for thresholding in all three channels (causing at least one clause in isWhite() to be false, rendering isWhite() to be false), thereby making the method safe for use when no “white” background is present. 4.2
Cell Nuclei Detection (Iterative Threshold)
Once the blood and background are removed from the image, the remaining pixels represent tissue and cell nuclei. A threshold must be established to differentiate tissue from nuclei. The precise value of this threshold is a function of the image contrast due to lighting conditions during imaging and shifts in the mean color intensities due to stain aging and brand variations. To eliminate the need for a technician to dial in the optimal threshold value to account for these variations, an automated method was developed. The iterative technique converges on the threshold for obtaining this distinction between tissue and nuclei with the following method. Algorithm: Iterative Threshold Selection. Input: RGB pixel values. Output: Threshold. threshold ← meanIntensity(P ixels)/2 repeat for all pixels p ∈ P ixels if pred < threshold remove p from P ixels threshold ← meanIntensity(P ixels)/2 until nothing is removed from P ixels return threshold The threshold value returned from the iterative threshold selection algorithm is used to separate pixels belonging to tissue and pixels belonging to nuclei based on a comparison of the red channel value. The red channel is compared because the tissue which appears as shades of pink have a much higher red channel value than the nuclei which are a purple or blue color. After the threshold selection algorithm was used on the image set, we found the final threshold to be slightly low (dark) for segmenting all of the nuclei in the image due to some nuclei being overlaid with tissue or not as receptive to the dye. We found adding a 10% correction, which is 25 for the 8-bit encoding used by our slide scanner, brought some of the outlier nuclei into view. 4.3
Cell Nuclei of Interest Detection (Shape Analysis)
It is now essential to determine which of the cell nuclei are the nuclei of interest, that is, those of inflammatory PML nuclei. This is achieved via shape analysis. The nuclei of interest are morphologically distinct in size and shape.
546
K.A. Thomas, M.J. Sottile, and C.M. Salafia
The first distinction is that of nuclei size. PML nuclei consistently take on a characteristic size within the image, allowing them to be differentiated from other cell nuclei. Filtering out nuclei that are outside of this range eliminates a large number of non-PML nuclei that have a similar response to the H&E stain. This filtering can be achieved by examining the pixel count of the candidate nuclei. The second distinction necessary to eliminate non-PML nuclei is that of nuclei shape. PML nuclei have a characteristic circular (or near circular) shape, while other common features of similar size, such as connective tissue fibroblasts, take on a more elliptical and elongated shape. A measure known as the eccentricity can be computed for each feature to perform filtration by shape. A circle has an eccentricity of zero, and increasingly flattened oval shapes have eccentricities that approach 1.0 (Fig 2). Experiments resulted in a distinction between cell nuclei with eccentricity of .95 or above as fibroblasts. The empirical setting was manually verified by an expert of the domain. We made use of Matlab’s built in function to calculate eccentricity based on each connected group of pixels.
Fig. 2. Examples of shapes with an eccentricity of 0.00, 0.70 and 0.95 respectively
Inflammatory PML nuclei differ from chorion and amnion epithelium in size. Cell size analysis is dependent on resolution and magnification which prevents the use of empirical values for the range of normal pixel group size but is easily calculated. The size threshold is determined by the following formula: resolution (pixels per µm2 ) × cell size (µm2 ) = pixel group size. The threshold range is based on the normal range of cell size. Following the computation of the number of pixels of each connected component, the connected components are eliminated as noise if the number of pixels is below the pixel group size interval, while those that are above the pixel group size interval are eliminated as epithelial cells or larger artifacts. The remaining nuclei are likely belonging to inflammatory PML cells.
5
Results
Although literature exists related to the automated analysis of pathology slides [2–5], the presented approach has the following advantages: (1) Our method accounts for the possibility of red blood cells being present in the image and the impact it could have on causing false positives as was experienced by [2]. Because
Unsupervised Segmentation for Inflammation Detection
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
547
Fig. 3. Case study with inflammatory PML nuclei shown in 3(a), fibroblasts in 3(b) and epithelium in 3(c). The mask after thresholding shown in 3(d), 3(e) and 3(f) respectively. Finally 3(g), 3(h) and 3(i) show the results after size and shape filtering.
of the negative impact on results, we preprocess the image to remove red blood cells from the image. (2) Our method for detection and removal of background pixels is responsive to image conditions and independent of bit resolution for RGB encoding as opposed to [3] which presented a hard coded value we found to produce false positives and was dependant on 8-bit RGB encoding. (3) Unlike anything previously published, our technique is for the purpose of detecting inflammatory PMLs. The effectiveness of our set of algorithms is based on the ability to identify nuclei which are inflammatory PMLs from other tissue features in digitized H&E stained histopathology slides. This has been achieved by first isolating tissue from background and red blood cells then isolating the stained nuclei with the use of automated threshold selection and finally classification by known attributes of inflammatory PMLs. During the development and verification process, a domain-expert visually inspected processed images and was in agreement with the algorithm for isolating nuclei which were likely inflammatory PMLs. The nuclei identified as inflammatory PML cells are quantified via the number of positive pixels, the percentage of positive pixels and the number of positive pixel groups. Positive pixel count is the number of pixels which are identified as likely belonging to inflammatory PMLs. Percent of positive pixels is calculated by dividing Positive pixel count by the number of pixels that are identified as tissue. The number of Positive pixel groups is the number of positive pixel connected TM R 2 Duo T9300 components. We implemented our method on an IntelCore
548
K.A. Thomas, M.J. Sottile, and C.M. Salafia
2.5 GHz processor with 4GB RAM and 32-bit operating system using MatLab R2008b. Execution time ranged from 4.8 to 6.4 seconds for a 2500 x 2500 pixel ROI. This variation is due to the execution time being a function of the number of inflammatory PMLs present in the tissue sample. The original image in figure 4(a) captures a fetal inflammatory response in the placenta. The results of the segmentation algorithms are shown in figure 4(b) which is a contour map corresponding to the number of pixels representing inflammatory PMLs in a 40 × 40 neighborhood. The contour lines outline areas with the highest concentration in black and the lowest in white. Quantification of the data in figure 4 indicates the number of positive pixel groups on the right side of the blood vessel is 10× the amount on the left side. This is an indication that the PMLs have migrated from the blood stream to repair injured tissue on the right side of the vessel.
(a)
(b)
Fig. 4. An example illustrating the original image (a) and the result of our algorithm (b) with black contour lines indicating regions with high inflammatory PML presence
6
Conclusion
Our ongoing work in this area is focused on performing further validation studies with domain experts. These algorithms will be used in a clinical trial conducted by Placental Analytics, LLC to compare our quantification of PMLs to the qualitative grading of inflammation by several expert pathologists as well as the results of amniotic fluid mass resistance (AFMR) scores. AFMR scores have been shown to predict preterm labor due to intra-amniotic infection [8]. An automated method for quantification of inflammatory PMLs could be a useful tool for pathologists determining the severity of amniotic infection. Automated quantification is more reliable in its reproducibility than the current qualitative approach to diagnostics. Reproducible, reliable and unsupervised segmentation for quantification of inflammatory PML cells can be performed via the use of iterative threshold selection and shape analysis. Acknowledgments. This work was supported through a research contract from Placental Analytics, LLC.
Unsupervised Segmentation for Inflammation Detection
549
References 1. Redline, R.W., Faye-Petersen, O., Heller, D., Qureshi, F., Savell, V., Vogler, C.: Amniotic infection syndrome: nosology and reproducibility of placental reaction patterns. Pediatric and Developmental Pathology 6(5) (2003) 2. Wang, Y., Turner, R., Crookes, D., Diamond, J., Hamilton, P.: Investigation of methodologies for the segmentation of squamous epithelium from cervical histological virtual slides. In: IMVIP ’07: Proceedings of the International Machine Vision and Image Processing Conference, Washington, DC, USA, pp. 83–90. IEEE Computer Society, Los Alamitos (2007) 3. Sertel, O., Kong, J., Shimada, H., Catalyurek, U.V., Saltz, J.H., Gurcan, M.N.: Computer-aided prognosis of neuroblastoma on whole-slide images: Classification of stromal development. Pattern Recognition 42(6) (2009) 4. Kong, J., Sertel, O., Shimada, H., Boyer, K.L., Saltz, J.H., Gurcan, M.N.: Computeraided evaluation of neuroblastoma on whole-slide histology images: Classifying grade of neuroblastic differentiation. Pattern Recognition 42(6) (2009) 5. Tosun, A.B., Kandemir, M., Sokmensuer, C., Gunduz-Demir, C.: Object-oriented texture analysis for the unsupervised segmentation of biopsy images for cancer detection. Pattern Recognition 42(6) (2009) 6. Naeye, R.L.: Disorders of the placenta, fetus, and neonate: diagnosis and clinical significance. Mosby Year Book, St. Louis (1992) 7. Salafia, C.M., Misra, D., Miles, J.N.V.: Methodologic issues in the study of the relationship between histologic indicators of intraamniotic infection and clinical outcomes. Placenta 30(11) (2009) 8. Buhimschi, I.A., Buhimschi, C.S.: Proteomics of the amniotic fluid in assessment of the placenta. Relevance for preterm birth. Placenta (2008)
Scale-Space Representation of Lung HRCT Images for Diffuse Lung Disease Classification Kiet T. Vo and Arcot Sowmya The University of New South Wales Sydney, Australia {ktv,sowmya}@cse.unsw.edu.au
Abstract. A scale-space representation based on the Gaussian kernel filter and Gaussian derivatives filter is employed to describe HRCT lung image textures for classifying four diffuse lung disease patterns: normal, emphysema, ground glass opacity (GGO) and honey-combing. The mean, standard deviation, skew and kurtosis along with the Haralick measures of the filtered ROIs are computed as texture features. Support vector machines (SVMs) are used to evaluate the performance of the feature extraction scheme. The method is tested on a collection of 89 slices from 38 patients, each slice of size 512x512, 16 bits/pixel in DICOM format. The dataset contains 70,000 ROIs from slices already marked by experienced radiologists. We employ this technique at different scales and different directions for diffuse lung disease classification. The technique presented here has best overall sensitivity of 84.6% and specificity of 92.3%. Keywords: HRCT diffuse lung disease, texture classification, scale-space feature extraction.
1 Introduction High Resolution Computed Tomography (HRCT) is generally considered to be the best imaging modality for assessment of lung parenchyma in patients likely to have diffuse lung diseases [1]. However, the diagnosis of diffuse lung disease from HRCT images is a difficult task for radiologists because of the complexity and variation in the visual disease patterns on the images. Therefore, the construction of a computeraided diagnosis system for diffuse lung disease is important in providing the radiologist with a “second opinion”. Texture classification has been a significant research topic in image processing, particularly in medical image analysis, and many features have been proposed to represent a texture [2]. The method chosen for feature extraction is clearly critical to the success of texture classification. Five major categories of features for texture identification have been proposed: statistical, geometrical, structural, model-based, and signal processing features. Among these methods, the signal processing approach has advantages in the characterization of the directional and scale features of textures. In the literature on HRCT lung disease classification, typical methods used are based on histogram statistics, co-occurrence matrices, run-length parameters and A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 550–558, 2010. © Springer-Verlag Berlin Heidelberg 2010
Scale-Space Representation of Lung HRCT
551
fractal features such as Uppaluri et al. [3], Chabat et al. [4] and Uchiyama et al. [5]. However, in the recent decade, the feature extraction method based on the wavelet transform has attracted more attention [6-8] with some works applying wavelet transform for lung tissue classification: Shamsheyeva et al. [9] with quincunx wavelet transform (QWF) combined with gray-level histogram features, Shojaii’s [10] utilizing the vertical sub-image to extract honeycombing regions, Tolouee et al. [11] with discrete wavelet frames (DWF) and rotated wavelet frames (RWF) to describe features, and experiments conducted by Depeursinge’s group using QWF combined with gray-level features, another feature called air-pix (number of air pixels in each ROI) and a set of clinical features [12-14]. However, these wavelet-based methods suffer from the lack of directional information – a unique and important feature of multi-dimensional signals. Therefore, contourlet transform proposed by Do [15] with intrinsic multi-dimensional information (with different number of directions at each scale) and even less computational complexity has recently received increasing attention. In addition, another approach for signal-processing feature extraction is based on scale-space theory [16, 17]. Scale space theory is a natural framework to construct multi-dimensional multi-scale textures by deploying multi-scale Gaussian derivative filters up to a certain order. Ginneken [18] applied this method to construct a feature vector in detecting abnormalities in chest radiograph. In this paper, we apply the scale-space representation of lung HRCT images at different scales and different directions of Gaussian derivative filters to tackle the problem of classification of four diffuse lung disease patterns: normal, emphysema, ground glass opacity (GGO) and honey-combing (HC) (Fig. 1). Bart ter Haa Romeny’s group introduced scale-space theory to describe textures, especially in applications of medical issues [18, 19] and the most challenge in applying scale-space theory for classification of diffuse lung diseases based on HRCT lung images is to identify the extent of the scale which is appropriate to ensure both high accuracy and low computation time. The second goal of this paper is to evaluate ways to model the outputs of filters after deploying Gaussian derivatives at different scales and directions. In this paper, we use four basic features: mean, standard deviation, skew and kurtosis, along with the Haralick features. At the classification phase, support vector machines (SVMs) are used to evaluate the performance of the feature extraction scheme. The methodology is described in detail in the next section.
Fig. 1. (a) Emphysema (red) (b) Honeycombing (green) (c) GGO (yellow)
552
K.T. Vo and A. Sowmya
2 Methodology A texture classification system typically consists of several stages such as image preprocessing, feature extraction and selection, and classification. Each stage is explained below in the context of scale-space texture classification. 2.1 Scale-Space theory A brief description of the scale-space theory is described in [13] and the original textbook [14] may be referenced for more details. A linear scale-space is a stack of Gaussian blurred images. The generating equation of the images L at different scales is the linear isotropic diffusion equation as given below:
Δ x L(x;σ ) =
∂L(x;σ ) ∂t
(1)
where Δ is the Laplacian, t = σ / 2 is the variance, and σ ïis the scale. Diffusion of intensity is equivalent to convolving the image with Gaussian kernels of variance 2
σ 2 ïgiven below: G2 D (x; σ ) =
1 2πσ 2
−
e
x2 2σ 2
(2)
The convolution of L with G2 D blurs the image and depending on the scale of the Gaussian kernel will emphasize coarser structures in the image, which is desirable in texture analysis as textures are scale dependent. Derivatives provide additional information about textures. Derivatives of discrete data are ill-posed. According to scale-space theory, observed data is a Gaussian convolution with the original image:
L(x; σ ) = L(x) ⊗ G2 D (x;σ )
(3)
where ⊗ is the convolution operator. If we take the derivative of both sides we obtain:
∂ ∂ L(x;σ ) = [ L(x) ⊗ G2 D (x;σ )] ∂x ∂x
(4)
Both convolution and differentiation are linear operators, so we can exchange the order of these two operators:
∂ ∂ L(x;σ ) = L(x) ⊗ [G2 D (x;σ )] ∂x ∂x
(5)
which means that the derivative of an observed image can be calculated by convolving the image with the derivative of the Gaussian kernel. The order of the
Scale-Space Representation of Lung HRCT
553
derivative determines the type of structure extracted from the image, e.g. the first order derivative emphasizes the edges, the second order emphasizes ridges and corners and so on. To construct multi-scale texture images we simply convolve the Gaussian derivatives including the zeroth order derivative (the Gaussian kernel itself) with the th texture image. Moreover, the n order derivative at any given orientation can be constructed from n+1 independent orientations. For example, if n = 1, from only Lx and Ly the derivatives in all other orientations can be calculated using:
Lθx ( x; y ) = cos(θ ) Lx ( x; y ) + sin(θ ) Ly ( x; y ) θ
where Lx indicates the derivative of image
(6)
L(x) in orientation θ.
2.2 Feature Vector Construction Once the medical images are preprocessed, Gaussian derivatives filters (including zeroth order) are used for feature extraction of texture information at different scales and different directions. Mean, standard deviation, skew and kurtosis texture descriptors are then extracted from each filtered ROI. This yields four texture descriptors for each combination of scale and direction. In addition to this, cooccurrence matrices are also calculated at each scale. A co-occurrence matrix captures the spatial dependence of contrast values, depending on different directions and distances specified. Four co-occurrence matrices are calculated for each filtered matrix at each scale; a co-occurrence matrix is calculated for four directions, 0, 45, 90, and 135 degrees at a set distance of one. The following fourteen Haralick texture descriptors are then extracted from each co-occurrence matrix: angular second moment, contrast, correlation, sum of squares (variance), sum average, sum variance, entropy, sum entropy, difference variance, difference entropy, inverse difference moment, information measures of correlation and maximal correlation coefficient [20]. For fast calculation of Haralick texture features, we use the recently-proposed method [21] that uses a recursive blocking algorithm, unrolling and scalar replacement to optimize the code. This smart approach increased the overall performance of the calculation by a factor of 2. The final texture descriptor vector had 60 elements per scale per direction. Feature reduction is necessary to reduce the feature space. This can be done by methods of feature selection. However, in this paper, for fast and simple procedure, the size of the texture description vector was reduced to 18 by averaging over the four co-occurrence directions. Through experiments in section 3, the performance of four basic measures (mean, standard deviation, skew and kurtosis) and Haralick measures are evaluated. 2.3 Classifiers The support vector machine (SVM) has exploded in popularity within the machine learning literature and more recently has received increasing attention from the statistics community as well, and SVM has proved to be the best classifier in lung
554
K.T. Vo and A. Sowmya
disease categorization [22]. The SVM paradigm was originally designed for the binary (two-class) classification problem [23]. To extend the original SVM to multiclass classification, popular methods include one-versus-all method using winnertakes-all strategy; one-versus-one method implemented by max-wins voting; and pair-wise coupling method (the extension of one-versus-one). The strategies are competitive to each other and there is no clear superiority of one method over others [24]. In our experiments, we use SVM with Gaussian radial basis function kernel, quadratic programming method for optimization and one-versus-all method for classification of different diffuse lung disease patterns because this configuration has proven to be the most appropriate in experiments.
3 Experiments and Results The dataset is constructed from a collection of 89 HRCT slices from 38 patients which have been previously investigated and labeled by experienced radiologists, each slice of size 512x512, 16 bits/pixel in DICOM format. From these labeled slices, 73,000 ROIs of size 32x32 are extracted. The size of ROIs is chosen to maintain the classification sensitivity with the smallest area. Distributions of the ROIs are detailed in Table 1. Table 1. Distribution of ROIs over lung disease patterns
# of ROIs # of patients
Emphysema
GGO
Honeycombing
Normal
20,000
18,000
15,000
20,000
15
11
8
10
The performance of the experiment is evaluated through two measures: sensitivity and specificity . Sensitivity measures the accuracy among negative instances, while specificity measures the accuracy among positive instances: Sensitivity=
TP TN ; Specificity= TP+FN TN+FP
where TP is true positives, FP false positives, TN true negatives and FN false negatives. K-fold cross validation [25] is carried out to compute the classification sensitivity and specificity. The advantage of k-fold cross validation is that all the examples in the dataset are eventually used for both training and testing. For our size of dataset, 10fold cross validation is appropriate to ensure that the bias is small and the computational time is acceptable. Moreover, in the experiments, multiple-run k-fold (10-run 10-fold) cross validation is also performed to ensure higher replicability and reliability of the results. Each run is executed with different random splits of the data set. For all experiments, six directional Gaussian derivative filters 120 L0 , L10 , L190 , L02 , L60 , at scales σ = 1,2,4,8 and 16 are used where filters are 2 , L2 o
o
o
o
o
Scale-Space Representation of Lung HRCT
555
Gaussian derivatives Ln ( x, y; σ ) , σ denotes the scale, n the order of derivative and θ
θ the direction in which the derivative is computed. Hence, at each scale, the feature vector consists of 6x4=24 basic parameters and 6x14 = 84 Haralick parameters. For the classifier, the initial configuration of SVMs is set as mentioned in section 2.4 with two parameters: cost C = 1000 and gamma γ = 1 . The results in Table 2 are obtained from applying Gaussian derivatives at different scales with only 24 basic parameters for each scale. The results show that the best overall sensitivity and specificity are obtained when combining features of all scales. However, it is clearly seen that using only scale σ = 2, we also achieve an acceptable and good result as compared to the best result or in the other words, features at scale σ = 2 play the most important role in discriminating between different patterns of diffuse lung diseases. Table 2. Results without Haralick measures (24 basic parameters at each scale) Best results are highlighted
Scale
σ
Empysema Sensitivity
Specificity
GGO Sensitivity
HC
Specificity
Normal
Specificity
Specificity
Sensitivity
Specificity
1
76.5
85.1
72.7
83.5
70.6
82.3
75.9
87.1
2
86.8
89.9
80.5
86.7
81.6
85.4
83.3
90.7
4
78.3
85.3
72.1
84.5
71.8
80.8
80.4
83.7
8
80.1
85.7
75.9
80.6
74.8
79.6
80.5
86.3
16
73.6
80.0
73.3
80.2
70.4
75.5
75.9
83.5
1,2,4
87.2
90.5
81.1
87.5
80.8
85.7
86.5
91.6
1,2,4,8,16
88.5
92.4
83.7
87.7
81.5
85.5
87.4
91.5
To evaluate the performance of the additional Haralick features, we conducted the same experiments at only scale σ = 2 because the size of the feature vector could be huge if we combine all the features of all scales. The results are in Table 3. Although the combination of basic measures and Haralick measures leads to slightly better result, the complexity in computation is much higher, which does not seem to warrant their use. It is also noticed from both tables that the results for ground glass opacity and honeycombing remain relatively low. Actually, lesions such as ground glass opacity result in similar local mean intensity to honeycombing and are therefore unable to be differentiated when subjected to scale-space filters. In fact, these two patterns are still the most challenging for classification of diffuse lung diseases so far.
556
K.T. Vo and A. Sowmya
Table 3. Results to evaluate the performance of Haralick features at scale are highlighted Empysema
GGO
HC
σ = 2 Best results Normal
Feature Set Sensitivity
Specificity
Sensitivity
Specificity
Specificity
24 basic featuress
86.8
89.9
80.5
86.7
81.6
84 Haralick features
84.1
90.5
80.9
88.2
108 combined features
89.5
91.7
80.8
92.5
Specificity
Sensitivity
Specificity
85.4
83.3
90.7
79.4
89.1
85.5
93.8
79.7
90..1
88 8.4
94.7
4 Conclusion Although the overall results obtained is at the level of “acceptable” when compared to the method previously reported by us using wavelet and contourlet features with sensitivity of 93.40% and specificity of 98.40% [26, 27], it also proves the important role of scale-space features in classification of diffuse lung diseases, particularly features at scale 2. Further investigation will integrate scale-space features with the method using wavelet and contourlet feartures to obtain a more robust method. Acknowledgments. The use of lung HRCT images from the LMIK database, School of Computer Science and Engineering, UNSW, is gratefully acknowledged.
References 1. Webb, W.R., Muller, N.L., Naidich, D.P.: High-Resolution CT of the Lung. Lippincott Williams & Wilkins, Philadelphia (2001) 2. Materka, A., Strzelecki, M.: Texture Analysis Methods-A Review. Technical University of Lodz, Institute of Electronics, COSTB 11 report, Brussels (1998) 3. Uppaluri, R., Hoffman, E.A., Sonka, M., Hunninghake, G.W., McLennan, G.: Interstitial lung disease: A quantitative study using the adaptive multiple feature method. Am. J. Respir. Crit. Care Med. 159, 519–525 (1999) 4. Chabat, F., Yang, G.-Z., Hansell, D.M.: Obstructive lung diseases: texture classification for differentiation at CT. Radiology 228, 871–877 (2003) 5. Uchiyama, Y., Katsuragawa, S., Abe, H., Shiraishi, J., Li, F., Li, Q., Zhang, C., Suzuki, K., Doi, K.: Quantitative computerized analysis of diffuse lung disease in high-resolution computed tomography. Med. Phys. 30, 2440–2454 (2003)
Scale-Space Representation of Lung HRCT
557
6. Mallat, S.G.: A theory for multiresolution signal decomposition: the wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 11, 674–693 (1989) 7. Unser, M.: Texture classification and segmentation using wavelet frames. IEEE transactions on image processing 4, 1549–1560 (1995) 8. Wouwer, G.V., Scheunders, P., Dyck, D.V.: Statistical texture characterization from discrete wavelet representations. IEEE Trans. Image Processing 8, 592–598 (1999) 9. Shamsheyeva, A., Sowmya, A.: The anisotropic Gaussian kernel for SVM classification of HRCT images of the lung. In: Proc. Intelligent Sensors, Sensor Networks and Information Processing Conference, pp. 439–444 (2004) 10. Shojaii, R., Alirezaie, J., Babyn, P.: Automatic Segmentation of Abnormal Lung Parenchyma Utilizing Wavelet Transform. In: Alirezaie, J. (ed.) Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2007, vol. 1, pp. 1217–1220 (2007) 11. Tolouee, A., Abrishami-Moghaddam, H., Garnavi, R., Forouzanfar, M., Giti, M.: Texture Analysis in Lung HRCT Images. In: DICTA ’08 on Digital Image Computing: Techniques and Applications, pp. 305–311 (2008) 12. Depeursinge, A., Hidki, A., Platon, A., Poletti, P.-A., Unser, M., Muller, H.: Lung Tissue Classification Using Wavelet Frames. In: Proceedings of the 29th Annual International Conference of the IEEE EMBS Cité Internationale (2007) 13. Depeursinge, A., Iavindrasana, J., Cohen, G., Platon, A., Poletti, P.A., Muller, H.: Lung Tissue Classification in HRCT Data Integrating the Clinical Context. In: 21st IEEE International Symposium on CBMS ’08, Computer-Based Medical Systemspp. 542–547 (2008) 14. Depeursinge, A., Iavindrasana, J., Hidki, A., Cohen, G., Geissbuhler, A., Platon, A., Poletti, P.-A., Muller, H.: A classification framework for lung tissue categorization. In: SPIE (2008) 15. Do, M.N., Vetterli, M.: The contourlet transform: an efficient directional multiresolution image representation. IEEE Transactions Image on Processing 14, 2091–2106 (2005) 16. Gangeh, M., ter Haar Romeny, B., Eswaran, C.: Scale-Space Texture Classification Using Combined Classifiers, pp. 324-333 (2007) 17. ter Haar Romeny, B.M.: Front-End Vision and Multi-scale Image Analysis: Multi-scale Computer Vision Theory and Applications (Written in Mathematica). Kluwer Academic Publishers, Dordrecht (2003) 18. van Ginneken, B., Katsuragawa, S., ter Haar Romeny, B.M., Doi, K., Viergever, M.A.: Automatic detection of abnormalities in chest radiographs using local texture analysis. IEEE Trans. Med. Imaging 21(2), 139–149 (2002) 19. Gangeh, M.J., Duin, R.P.W., Eswaran, C., Haar Romeny, B.M.: Scale Space Texture Classification Using Combined Classifiers with Application to Ultrasound Tissue Characterization, pp. 287–290 (2007) 20. Haralick, R.M., Dinstein, I., Shanmugan, K.: Textural features for image classification. IEEE Transactions on Systems, Man, and Cybernetics SMC-3, 610–621 (1973) 21. Miyamoto, E., Merryman, T.: Fast calculation of Haralick texture features (2008) 22. Depeursinge, A., Iavindrasana, J., Hidki, A., Cohen, G., Geissbuhler, A., Platon, A., Poletti, P.-A., Muller, H.: A classification framework for lung tissue categorization. In: SPIE, vol. 6919 (2008) 23. Burges, C.J.C.: A Tutorial on Support Vector Machines for Pattern Recognition. Kluwer Academic Publishers, Boston (1998)
558
K.T. Vo and A. Sowmya
24. Duan, K., Keerthi, S.S.: Which Is the Best Multiclass SVM Method? An Empirical Study: Multiple Classifier Systems, 278–285 (2005) 25. Devijver, P.A., Kittler, J.: Pattern Recognition: A Statistical Approach. Prentice-Hall, London (1982) 26. Vo, K., Sowmya, A.: Directional Multi-scale Modeling of High-Resolution Computed Tomography (HRCT) Lung Images for Diffuse Lung Disease Classification. In: CAIP (2009) 27. Vo, K., Sowmya, A.: Diffuse Lung Disease Classification in HRCT Lung Images Using Generalized Gaussian Density Modeling of Wavelets Coefficients. In: IEEE International Conference on Image Processing, Cairo, Egypt (2009)
Correction of Left Ventricle Strain Signals Estimated from Tagged MR Images Mina E. Khalil1, Ahmed S. Fahmy1, and Nael F. Osman1,2 2
1 Center for Informatics Science, Nile University, Cairo, Egypt Johns Hopkins University, School of Medicine, Baltimore, Maryland, USA
Abstract. Strain measurement is a quantity used for assessing the regional function of the left ventricular (LV) of the heart. They are computed by tracking the motion of the non-invasive, virtual tags in the cardiac muscle with time. Tracking these tags gives information for each region of the cardiac muscle by quantifying its deformation during contraction (systolic period) and relaxation (diastolic period). However, these strain measurements suffer from inaccuracies caused by the degradation of the tags and the image quality. In this work, numerical simulations are used to investigate the factors contributing to the error in measurements. An empirical model for the estimated strain values is deduced and presented. In addition, a correction method for the measurement errors is proposed based on the empirically-deduced model. The method was validated on real data, and showed potential enhancement by reducing the errors in strain measurements. Keywords: Strain, Left ventricle regional function, tagged MR.
1 Introduction Coronary artery disease is the leading cause of death worldwide. Blockages in the coronary artery result in the development of ischemic regions in the myocardium – those with a reduced flow of blood and oxygen. If this condition persists, these regions may be further stressed causing angina or lead to infarction, resulting in tissue death or necrosis due to loss of blood flow. Depending on the extent of the infarction, a cardiac arrest might occur or sever compensation might lead to heart failure. Accurate quantification and timing of regional myocardial function allows early identification of dysfunction, and becoming increasingly important for clinical risk assessment, patient management, and evaluation of therapeutic efficacy [9]. Strain analysis allows a direct assessment of the degree of regional myocardial deformation and its timing during the cardiac cycle. Ischemic and infarcted regions have different motion patterns than normal regions, which will yield to different strain signals [1]. Tissue strain at any time is percentage change per unit length in some direction [2]. Due to the doughnut shape of the short axis cross section of the heart, strain is usually estimated in two main directions: radial and circumferential. However, myocardial contraction is principally circumferential [9]. The estimated strain signals (strain vs. time curves) are not only used to determine the local deformation of A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 559–567, 2010. © Springer-Verlag Berlin Heidelberg 2010
560
M.E. Khalil, A.S. Fahmy, and N.F. Osman
myocardium but also can reveal subtle, but important, features relevant to a patient’s heart condition [3]. Cardiac Magnetic Resonance (CMR) tagging is considered the gold standard technique for measuring the left ventricular strain [4]. In CMR tagging, non-invasive markers (tag lines) are virtually implanted that move with the motion of heart muscle, Fig.1. Tracking the moving tag lines with time allows quantification of its mechanical which provides a quantitative analysis for the regional function of the heart. For automatic quantification of strain, the Harmonic Phase (HARP) technique automates the process of tracking points inside the heart from tagged MR images [5, 8]. It is based on the idea that the harmonic phase of a point is invariant with time. This means that tracking points implies finding the points that have the same harmonic phase values throughout the entire image sequence. After finding these points, it is easy to just compute the mechanical strain in the two directions; circumferential and radial. Fig.2-a shows an ideal circumferential strain signal covering the cardiac cycle starting from first time frame till 250msec to cover the systolic period and while rest of the signal covers the diastolic cycle. Its frequency response is shown in Fig.2-b. Unfortunately, the strain signals estimated from the HARP technique can be distorted, resulting a strain signals such as in Fig.2-c&d, due to a number of factors that depend on the image quality and acquisition. Factors such as image intrinsic noise, fading of the tags, too small tag separation, and cardiac twisting, have been stated in literature [5, 6] but they have not been fully studied to investigate their effect on the strain signal. Moreover, little work has been done to propose remedies for these artifacts. In this work, the effect of each factor is studied using numerical simulation to determine the extent of distortion induced in the strain signals. An empiricallydeduced model is proposed based on the findings of this study. In addition, a correction method is proposed for the strain signal. The proposed method has been validated on real strain signals and the results show great potential to remove the artifacts from the strain signal.
2 Theory and Methods 2.1 Study of Inherent Causes of Artifacts Numerical simulations were conducted to study the extent of corruption caused by the factors stated above. In these simulations, tagged MR image sequences are generated for doughnut-shaped phantom having Longitudinal Relaxation Time (T1=850 msec) and inner and outer radii of 30mm and 70mm respectively. The tag patterns are generated by simulating the Spatial Modulation of Magnetization pulse sequence (SPAMM) [7]. This is achieved by multiplying a uniform-intensity image, Im0, with a finite cosine series (Eq.1) whose frequency corresponds to (1/Tag Separation). ,
1 2
,
1
2
where Imtag is image after imposing the tag patterns with tag separation TS.
(1)
Correction of Left Ventricle Strain Signals Estimated from Tagged MR Images
(a)
561
(b)
Fig. 1. (a) MRI image of the Left Ventricle captured during Systolic Period, where (b) shows the corresponding Tagged MR image
(a)
(b)
(c)
(d)
Fig. 2. (a) Smooth real strain signal (b) The Fourier domain spectrum corresponding to strain signal in (a). (c) & (d) distorted real strain signals
After generating the tagged MR images shown in Fig.3, simulation of the cardiac motion is done using a predefined function, similar to the signal in Fig.2-a, having only DC and 2 frequency components. The reason for choosing such function will be discussed below. A sample images and strain signals generated by the simulator are shown in Fig.3, 4 respectively. The next step is the simulation of the factors causing distortion of the strain signal. This is done as follows: Tag pattern fading depends on the T1 relaxation time, T1 = 850msec for the cardiac muscle, as well as the flip angle (α) of the RF pulse sequence [5]. Fading appears in the tagged MR images, as shown in Fig.3-c, as reduction in the tag contrast, or the
562
M.E. Khalil, A.S. Fahmy, and N.F. Osman
peak-to-peak magnitude of the finite cosine series (Eq.2), which decreases exponentially with increasing α of the RF pulse, that is, ,
where
1 2
,
2
1
(2)
is given as follows; 1
(3)
is the time to repetition of the where α is the RF pulse sequence flip angle and pulse sequence. Small tag separation is simulated by assigning the tag separation ranging from 1mm to 6mm. Twisting of the heart muscle is simulated by rotating the inner and outer radii of the doughnut-shape centered in the tagged MR image by an angle ɵ that varies with time. Additive intrinsic Noise is simulated by adding white Gaussian noise on the complex image with zero mean and different variance values to the images.
(a)
(b)
(c)
Fig. 3. Simulated tagged MR image with tag separation=6mm, α=12̊, twisting angle=30̊ and noise variance = 0.005 captured at (a) 1st time frame, (b) end-systole, (c) middle diastole.
2.2 Strain Curve Modeling The aim of modeling the strain signal is to extract the strain patterns regardless of the corruption found in the strain signal. Because of the smoothness of the normal strain signals, it was natural to use only the low frequency coefficients of the Fourier series of the strain signal to represent it. In this study, to determine the number of frequency coefficients sufficient to replicate the strain signal, the energy density distribution was calculated from strain signal dataset containing 140 smooth strain signals, i.e. strain signals with minimal artifacts, of normal volunteers and patients. It showed that the first 5 Fourier coefficients, represented in Eq4, contain 98.97% of the total energy with variance ± 0.52%. Moreover, adding higher order of coefficient not only has a minimal effect on extracting the patterns, but it also causes distortion in the strain signal in case of corrupted strain signal.
Correction of Left Ventricle Strain Signals Estimated from Tagged MR Images
(a)
(b)
(c)
(d)
563
Fig. 4. Simulated (a) ideal strain signal (b) at noise variance = 0.005 and α=12̊ (c) at tag separation = 2mm, (d) at twisting angle = 30̊
This model was compared with an ideal low pass filter (LPF) with cut-off frequency equal to the first 5 Fourier coefficients. It was also compared with polynomials of 2nd to 7th order. 2
2
(4)
where a0, a1, a2, b1, b2, and w are the model parameters, ɛfit is the model fitted on the strain signal. 2.3
Weighted Least Square Curve Fitting
Although least square error (LSE) method can be used to fit the above mentioned model to the raw strain signals, instead weighted least-mean squares (LMS) method is used. This allows the use of priori information about our confidence in each strain measurement; i.e. assign error probability for each point. In fact, the inherent causes of the strain artifacts can be dominant in some particular region of the signal. Tag pattern fading: According to Eq. 2, the tag pattern fades exponentially with time, depending mainly on the flip angle (α) of the RF pulse sequence, since T1 is constant in all regions of the cardiac muscle. Therefore, weights can be assigned as the inverse function of the exponential decaying function of the tag fading (Eq.2). (5) where W1 is the adjusted weights to reduce the fading effect, is defined in Eq.3. Factor of small tag separation and maximum twisting: It was observed, from simulation, that this factor causes positive peaks to be present in the middle of the
564
M.E. Khalil, A.S. Fahmy, and N.F. Osman
strain signal, i.e. at peak systole. Therefore, weights are assigned as Gaussian distribution (Eq.6) with mean equal to the time index of peak strain (end-systole) and variance corresponding to region starting from 25% to 55% of the strain signal. 1
(6)
√2
where W2 is the adjusted weights to reduce the small tag separation and twisting effect, µ is the time index of peak strain and σ is the variance. Additive intrinsic noise: Due to the randomness of its source, we assumed uniform weights at each strain sample. Hence, combining all these information together, we get overall weights along the strain signal (Fig.5). The weights are adjusted so that maximum weight magnitude corresponds to highest confidence.
Fig. 5. The weights of the strain curve fitting problem
3 Results A dataset, acquired from the MESA study containing 840 strain signals for 70 volunteers and patients, is used to evaluate the proposed method. For each subject 12 strain signals were acquired corresponding to different segment of the cardiac muscle. A comparison, based on computing error using Eq.7 and its variance, is done on 140 smoothed strain signals, i.e. with minimal artifacts, between the model presented in this study, and an ideal LPF with cut-off frequency equal to the 5 Fourier coefficients, and also different fitting models such as; 2nd till 7th polynomial. The results of the comparison are shown in Table1.
100
(7)
Where ɛfit is the resultant strain signal from the fitting models, ɛactual is the actual strain signal and T is the time covering the strain signal.
Correction of Left Ventricle Strain Signals Estimated from Tagged MR Images
565
Table 1. Error percentage computed for different model
Fitting Model 2nd poly 3rd poly 4th poly 5th poly 6th poly 7th poly Low Pass Filter 5 Fourier Coeff.
Mean 32 % 16 % 13.4 % 8.1 % 7.2 % 6.2 % 11.5 % 7.3 %
Variance 11.5 % 6% 6.4 % 3.9 % 3.5 % 2.9 % 7.1 % 2.8 %
Table 2. Percentage of strain signals which the experts were able to extract information from before and after the proposed method
Strain signal Systolic Period Diastolic Period
Before
After
21.014 % 49.8 % 22.5 %
80.43 % 83.6 % 80.2 %
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 6. Success and failure in 6 real strain signals before (dotted) and after (solid) applying the proposed method. (a)-(d) successful correction while (e)&(f) failure in correcting strain signals.
In case of distorted strain signals, the performance of the proposed method is evaluated visually by experts in the field of CMR. This evaluation cannot be done
566
M.E. Khalil, A.S. Fahmy, and N.F. Osman
automatically due to the lack of a gold reference for the strain signals. Experts were conducting a binary evaluation on the strain signals, i.e. they assign zero to the strain signals from which they could not extract useful information. They have evaluated the entire strain signal as well as the parts covering the systolic and diastolic periods only. Experts’ evaluation results are shown in Table2. Success and failure samples are shown in Fig.6.
4 Discussion Simulations done in the study provided good understanding of the inherent causes of artifacts and how these artifacts distort the strain signal. Therefore, it was found that artifacts resulting from mistracking of the tag lines due to fading of the tag patterns increases with the decrease of the image quality and the increase in the RF flip angle. As for small tag separation, their artifacts appears as positive spikes, as shown in Fig.4-c, in the region of end-systole of the strain signal due to the fact that at maximum contraction the small tag lines overlap making some tag lines disappear. Twisting of the heart muscle showed artifacts quite similar to the one coming from small tag separations, it also appears as positive spikes but found in larger region in the strain curves as shown in Fig.4-d. Ideal LPF was expected to provide mean error similar or better comparable with the other models, but Table1 shows that applying it on smoothed strain signals, i.e. with minimal artifacts, has mean error of 11.5% which is higher than the model proposed in the study. Moreover, in case of distorted strain signals, its results were biased by the severity of artifacts found in the signal. Therefore, using an ideal LPF in this case would get poor improvement in the results. Modeling the strain signal with polynomials of different orders showed that by increasing the order of polynomials the mean error decreases. The 6th and 7th polynomial has mean error comparable with the proposed model, but they suffered from over fitting when tested on real distorted strain signals. Based on the energy density distribution of strain signals, it was found that the first 5 Fourier coefficients stores energy of 98.97% with variance equal to ±0.52% of the total energy. Therefore, based on this finding, it can be said that the proposed model is sufficient to replicate the strain signal regardless of the patient’s condition. Table2 shows that the efficiency of extracting information from the strain signals has increased up to 80% instead of 20% after applying the weighted LMS fitting method. It showed a good potential in reducing the artifacts caused by the fading effect at low signal-to-noise ratio as well the small tag separation and twisting of the heart muscle. The remaining 20% of the strain signals needs further analysis to understand more the inherent causes of artifacts in order to increase the efficiency.
5 Conclusion In this work, the extent of induced distortion in the strain signals is studied using numerical simulation. Based on the energy density distribution from dataset used in this study, an empirically-deduced model is presented. This model is used for the
Correction of Left Ventricle Strain Signals Estimated from Tagged MR Images
567
correction of distorted strain signals using weighted LMS method. Evaluation of the proposed method was done on real strain signals by experts in the field. It showed good potential in enhancing the strain signals.
References 1. Fahmy, A.S., Krieger, A., Osman, N.F.: An integrated System for Real-Time Detection of Stiff Masses with a Single Compression. IEEE Transaction on Biomedical Engineering 53(7) (July 2006) 2. Urheim, S., Edvardsen, T., Torp, H., Angelsen, B., Smiseth, O.A.: Myocardial Strain by Doppler Echocardiography: Validation of a new Method to quantify regional myocardial function. Circulation 102, 1158–1164 (2000) 3. Stuber, M., Scheidegger, Fischer, S.E., Nagel, E., Steinenmann, F., Hess, O.M., Boesiger, P.: Alterations in the local myocardial motion pattern in patients suffering from pressure overload due to aortic stenosis. Circulation 100, 361–368 (1999) 4. Youssef, A., Ibrahim, E.-S.H., Korosoglou, G., Abraham, R., Weiss, R.G., Osman, N.F.: Strain-Encoding cardiovascular magnetic resonance for assessment of right-ventricular regional function. Journal of Cardiovascular Magnetic Resonance 10, 33 (2008) 5. Osman, N.F.: Measuring Regional Cardiac Function Using Harmonic Phase Magnetic Resonance Imaging. A dissertation submitted to the Johns Hopkins University in conformity with the requirements for the degree of Doctor of Philosophy (May 2001) 6. Parthasarathy, V.: Characterization of Harmonic Phase MRI: Theory, Simulation and Applications. A dissertation submitted to the Johns Hopkins University in conformity with the requirements for the degree of Doctor of Philosophy (April 2006) 7. Axel, L., et al.: MR imaging of motion with spatial modulation of magnetization. Radiology 171, 841–845 (1989) 8. Osman, N.F., Kerwin, W.S., McVeigh, E.R., Prince, J.L.: Cardiac Motion Tracking Using CINE Harmonic Phase (HARP) Magnetic Resonance Imaging. Magnetic Resonance in Medicine 42, 1048–1060 (1999) 9. Götte, M.J.W., et al.: Myocardial Strain and Torsion Quantified by Cardiovascular Magnetic Resonance Tissue Tagging: Studies in Normal and Impaired Left Ventricle. Journal of the American College of Cardiology 48(10)
Texture Analysis of Brain CT Scans for ICP Prediction Wenan Chen, Rebecca Smith, Nooshin Nabizadeh, Kevin Ward, Charles Cockrell, Jonathan Ha, and Kayvan Najarian Virginia Commonwealth University, Richmond, Virginia 23221, United States
Abstract. Elevated Intracranial Pressure (ICP) is a significant cause of mortality and long-term functional damage in traumatic brain injury (TBI). Current ICP monitoring methods are highly invasive, presenting additional risks to the patient. This paper describes a computerized noninvasive screening method based on texture analysis of computed tomography (CT) scans of the brain, which may assist physicians in deciding whether to begin invasive monitoring. Quantitative texture features extracted using statistical, histogram and wavelet transform methods are used to characterize brain tissue windows in individual slices, and aggregated across the scan. Support Vector Machine (SVM) is then used to predict high or normal levels of ICP using the most significant features from the aggregated set. Results are promising, providing over 80% predictive accuracy and good separation of the two ICP classes, confirming the suitability of the approach and the predictive power of texture features in screening patients for high ICP. Keywords: Computerized tomography (CT) images, Intracranial Pressure (ICP), texture analysis, Grey Level Run Length Method (GLRLM), Dual Tree Complex Wavelet Transform (DTCWT).
1
Introduction
One of the most severe complications after the initial TBI is increased intracranial pressure (ICP), typically resulting from edema caused by the original injury. Due to the rigid structure of the skull, elevated ICP can lead to shift and compression of soft brain tissue. Cerebral blood flow is proportional to the difference between the mean ICP in the brain and the mean arterial pressure. Therefore, as ICP increases, blood flow decreases, causing the brain to become ischemic and in turn leading to further brain damage due to lack of adequate blood flow [1]. ICP must therefore be closely monitored in the period after the initial injury [2]. However, the standard method is highly invasive, involving cranial trepanation and the insertion of an ICP monitor device. This risks both infection and further damage to brain tissue. Consequently, a non-invasive screening method that can evaluate a patient’s ICP level could potentially offer valuable assistance to physicians in deciding whether to perform trepanation. A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 568–575, 2010. c Springer-Verlag Berlin Heidelberg 2010
Texture Analysis of Brain CT Scans for ICP Prediction
569
Computed tomography (CT) scans are the standard image modality applied in TBI cases, due to their low cost and the low disruption they present to the patient. They also offer good definition of blood and bone matter. Currently, physicians often study CT scans of the brain when deciding whether to begin invasive ICP monitoring, looking for key signs such as brain midline shift or significant changes in the size of the lateral ventricles [3]. Both of these measures are quantifiable. However, there is no simple manual method to quantify changes in tissue texture, some of which may not be easily visible to the human eye. This has lead to interest in the computerized extraction of key texture features. Several prior studies have explored texture analysis of liver CT images in differentiating normal and malignant tissues. These have focused on the use of spatial domain statistical methods, including Spatial Grey Level Dependence Matrix (SGLDM) and Grey Level Run Length Method (GLRLM) [4,5,6].Relevant to brain CTs, texture analysis has been used to automatically target radiotherapy for head-and-neck cancer based on CT images [7], and to distinguish between healthy tissue and intracerebral hemorrhage (ICH) [8]. However, to the best of the authors’ knowledge, texture analysis has not yet been utilized to provide quantitative features for predicting ICP. This study focuses on the prediction of ICP levels using texture features extracted from brain CT scans via multiple processing methods. One possible pattern/regularity may be that with the increase of ICP, the density of the brain tissue increases because of the compression, which may result in the change in the appearance/texture of the CT images. Such possible patterns motivate the texture analysis on the CT images for ICP prediction. The method consists of four key stages: feature extraction from individual slices, aggregation across a full CT scan, feature selection, then classification via Support Vector Machine (SVM). A flow diagram of the approach is provided in Fig. 1.
Fig. 1. Flow diagram of ICP prediction. The four key stages are shown, along with preliminary selection of slice windows to be analyzed. The five methods of feature extraction are also displayed.
570
2
W. Chen et al.
Dataset and Labeling
Fifty-six brain CT scans with a spatial resolution of 512x512 pixels and 8-bit grey level were provided by the Carolinas Healthcare System (CHS). These scans covered 17 patients diagnosed with mild to severe TBI. For each patient, ICP values were recorded every hour for each day they were monitored. Each CT scan is assigned an ICP score, calculated as the average of the two ICP readings closest to the time the scan was taken. The ICP scores for the scans are then grouped into two classes: high (ICP>15 mm Hg) and normal (ICP≤15 mm Hg), because 15mmHg is usually the threshold distinguishing normal ICP level and elevated ICP level used by physicians. Using this threshold, the dataset contains 31 normal ICP cases and 25 with high ICP. Five slices covering the region expected to contain ventricle matter were manually selected from each scan for analysis.
3 3.1
Methodology Texture Feature Extraction
Six rectangular windows are extracted from each of the five selected CT slices; these are selected to contain primarily neural tissue. Figure 2 illustrates the chosen regions. Each window is processed separately, with a total of 48 texture features being extracted using five methods: Grey Level Run Length Matrix (GLRLM), Histogram analysis, Fourier Transform, Discrete Wavelet Packet Transform (DWPT), and Dual Tree Complex Wavelet Transform (DT-CWT). Although not all of these features may be useful in the final classification task, the feature selection step removes those considered irrelevant. Furthermore, since this method is ultimately intended to provide a generalized framework for texture-based classification of CT scans, these additional measures may prove useful in analysis tasks involve other types of tissue elsewhere in the body. Grey Level Run Length Method (GLRLM). A grey-level run can be considered a set of consecutive collinear pixels that share the same grey level value. These runs characterize texture elements in images, and each one is defined by its direction, length, and grey level value [9]. Four directions are examined in this study (0, 45, 90 and 135 degrees) and the runs for each are used to construct four separate run-length matrices. The element P (i, j) of a run-length matrix P reflects the number of runs j of pixels with gray level value i in the specified direction for the matrix. P is size (n× k), where n is the maximum grey level and k is the maximum possible run length in the image being analyzed. Eleven features are then calculated from each of the four run matrices: Short Run Emphasis (SRE), Long Run Emphasis (LRE), Grey Level Nonuniformity (GLN), Run Length Nonuniformity (RLN), Run Percentage (RP), Low GrayLevel Run Emphasis (LGRE), High Gray-Level Run Emphasis (HGRE), Short Run Low Gray-Level Emphasis (SRLGE), Short Run High Gray-Level Emphasis (SRHGE), Long Run Low Gray-Level Emphasis (LRLGE), and Long Run High
Texture Analysis of Brain CT Scans for ICP Prediction
571
Fig. 2. Location of the six windows used for texture extraction
Gray-Level Emphasis (LRHGR). These provide a statistical characterization of image texture based on both gray-level and run-length that can be used as input to the classification process. Their formulations can be found in [10]. Histogram Analysis. Four features are obtained from the histogram for each window: average intensity level, average contrast, smoothness and entropy. The first two measures are respectively defined as the histogram’s mean and standard deviation. A measure of smoothness is provided by R=1−
1 1 + σ2
(1)
where σ is the standard deviation. A value approaching 0 indicates low variation in pixel value intensity (i.e. smooth texture), while a value approaching 1 shows much higher variation (i.e. rough texture). The final feature, histogram entropy, reflects the randomness of intensity values. Fourier Features. Fast Fourier Transform (FFT) is applied to each window. The maximum, minimum, average and median of the resulting frequencies are then recorded as texture features, along with the frequency corresponding to the median value of the absolute value. Discrete Wavelet Packet Transform (DWPT). The standard Discrete Wavelet Transform (DWT) decomposes a signal by passing it through a series of filters at consecutive levels of decomposition. At each level i, both a high-pass and low-pass filter are applied to the signal passed from level i − 1, extracting detail and approximation coefficients respectively. Only the approximation coefficients are then passed on to the next level i + 1. In image processing, this creates four new sub-images at each level by sub-sampling the input by 2. This
572
W. Chen et al.
multi-resolution analysis is much more effective in detecting localized texture patterns than FFT’s global approach. The Discrete Wavelet Packet Transform (DWPT) extends DWT by also decomposing the detail coefficients at each level [11]. Though this incurs a higher computational cost, it provides a richer representation of texture which is valuable to the task in this study. Dual Tree Complex Wavelet Transform (DT-CWT). Dual tree complex wavelet transform [12] is designed to overcome some shortcomings when applying Discrete Wavelet Transform (DWT) in higher dimension (e.g., image processing). There exist problems in DWT such as oscillations around sigularity, shift variance, aliasing after processing on wavelet coefficients, and lack of directionality. However, these problem do not happen in Fourier Transform. DTCWT tries to deal with these issues by adopting some characters from the Fourier Transform while keeping the wavelet advantage in time-frequency analysis. In image processing applications, this method is known to be free of checker board artifact in DWT and to provide 6 directional wavelets in the 2D dimension. The reason that Frourier Transform avoid those problems above is that it is in the complex domain and that the real part and imaginary part form a Hilbert Transform pair. In DTCWT, the scaling function and wavelet function are also complex valued. For example, the wavelet function can be expressed as: ψc (t) = ψr (t) + jψi (t),
(2)
where ψr (t) is the real-valued function for the real part and ψi (t) is the realvalued function for the imaginary part. In the DTCWT, the real and imaginary parts are processed separatly using a DWT scheme. The advantage of DTCWT in texture analysis is that it makes the location of texture patterns not sensitive. Also it can capture more directions of texture patterns directly from the image. In this application, each subimage for texture analysis is decomposed into the first level, then the energy of each subband is calculated as in DWT part above, the entropy of energy distribution is also calculated as in DWT part. 3.2
Slice Aggregation and Feature Selection
Since ICP labeling is done on a full scan basis, it is necessary to combine the features extracted from each of the scan slices to create a new aggregated feature set to be used in classification. For a given slice, the features for the six windows are averaged, giving a total of 48 features per slice. For any given texture feature f , five new aggregative features are calculated across all slices: min(f ), max(f ), median(f ), μ(f ), and σ(f ). This gives a total of 48*5=240 features for the entire scan. This simplified approach is taken to save computational time and to circumvent the issue of large gaps between slices. Feature selection is then applied to the aggregated feature set to remove irrelevant features and improve performance. This is done via a two-stage filter-wrapper
Texture Analysis of Brain CT Scans for ICP Prediction
573
approach. In the first stage, the ReliefF algorithm [13] is used to rank the features in terms of their significance. The top 50 are then passed to the second stage. In the second stage, the Support Vector Machine (SVM) is used in the wrapper for the classification of feature set. In the validation step, the 10 fold cross-validation result is used as the accuracy measurement. Radical Basis Function (RBF) kernel is used with fixed parameters of C = 32768 and γ = 0.125. The search method used to construct feature set is Best First [16]. The feature subset achieving the best validation accuracy for ICP prediction is selected as the final feature set, which contains 10 features in the set. All feature selection steps are performed in the Weka machine learning software [14]. 3.3
Classification
Classification is performed using the Support Vector Machine (SVM) algorithm, which is widely used in classification tasks and typically offers good accuracy. To accommodate the relatively low number of scans in the dataset, 10-fold crossvalidation is used to measure performance.
4
Results
The result of the proposed method is evaluated using the following criterion: sensitivity, specificity and accuracy [15]. Sensitivity is defined as sensitivity =
number of True Positives Total number of Positives
(3)
number of True Negatives Total number of Negatives
(4)
while specificity is defined as specif icity =
The classification results are provided in Table 1. It can be seen that the designed ICP prediction method offers very good accuracy, despite the relatively small size of the dataset. Sensitivity and specificity are also good, meaning that the two classes - normal and high ICP - are well distinguished. These results indicate that texture analysis alone has the potential to be highly valuable in a non-invasive ICP screening method. Table 1. Classification results using SVM Sensitivity
Specificity
Accuracy
82.25 (+-1.74) 81.20 (+-0.04) 81.79 (+-2.30)
574
5
W. Chen et al.
Conclusions
This paper provides a non-invasive screening method for ICP prediction based on quantitative texture features extracted from CT brain scans. Feature extraction methods used include histogram analysis, wavelet transform and statistical gray level analysis. These are applied to windows in each slice, and the features aggregated across the entire CT scan. By selecting the most informative features for the classification task, a promising ICP prediction model is built. Due to the number of features extracted, the approach can potentially form a framework for analysis of other body regions; furthermore, new methods of texture analysis can easily be incorporated. Once refined, the method may thus be suitable for incorporation as a module into a wider TBI decision-support system. Future work will involve combining texture analysis with the authors’ existing work on evaluation of mid-line shift, ventricle size and blood volume, with the hope of further improving prediction accuracy. The method will also be tested on a large dataset, possibly allowing the selection of a larger feature set for classification. Finally, the medical interpretations of various texture features will be explored in more depth with the help of physicians.
Acknowledgments The authors would like to acknowledge the Carolinas Healthcare System (CHS) for providing the CT dataset and ICP recordings.
References 1. Koch, M.A., Narayan, R.K., Timmons, S.D.: Traumatic Brain Injury. In: Merck manual online, http://www.merck.com/mmpe/sec21/ch310/ch310a.html (retrieved) 2. Castellano, G., Bonilha, L., Li, L.M., Cendes, F.: Texture Analysis of Medical Images. Clinical Radiology 59(12), 1061–1069 (2004) 3. Sucu, H.K., Gelal, F., Gokmen, M., Ozer, F.D., Tektas, S.: Can midline brain shift be used as a prognostic factor to predict postoperative restoration of consciousness in patients with chronic subdural hematoma? Surgical Neurology 66(2), 178–182 (2006) 4. Chen, E.L., Chung, P., Chen, C.L., Tsa, H.M., Chang, C.I.: An automated diagnostic system for CT liver image classification. IEEE Trans. Biomed. Eng. 45(6), 783–794 (1998) 5. Mir, A.H., Hanmandlu, M., Tandon, S.N.: Texture Analysis of CT Images. IEEE Eng. Med. Biology 14, 781–786 (1995) 6. Mougiakakou, S., Valvanis, I., Nikita, K.S., Nikita, A., Kelekis, D.: Characterization of CT liver lesions based on texture features and a multiple neural network classification scheme. In: 25th Int. Conf. of the IEEE Engineering in Medicine and Biology Society, vol. 2, pp. 1287–1290 (2003) 7. Yu, H., Caldwell, C., Mah, K., Poon, I., Balogh, J., MacKenzie, R., Khaouam, N., Tirona, R.: Automated Radiation Targeting in Head-and-Neck Cancer Using Region-Based Texture Analysis of PET and CT Images. International Journal of Radiation Oncology*Biology*Physics 75(2), 618–625 (2009)
Texture Analysis of Brain CT Scans for ICP Prediction
575
8. Kabara, S., Gabbouj, M., Dastidar, P., Alaya-Cheikh, F., Ryymin, P., Lassonen, E.: CT Image Texture Analysis of Intracerebral Hemorrhage. In: 2003 Finnish Signal Processing Symposium (FINSIG’03), Tampere, Finland, pp. 190–194 (2003) 9. Chu, A., Sehgal, C.M., Greenleaf, J.F.: Use of gray value distribution of run lengths for texture analysis. Pattern Recognition Letters 11(6), 415–420 (1990) 10. Tang, X.: Texture information in run-length matrices. IEEE Trans. Image Proc. 7(11), 1602–1609 (1998) 11. Manthalkar, R., Biswas, P.K., Chatterji, B.N.: Rotation and scale invariant texture features using discrete wavelet packet transform. Pattern Recognition Letters 24(14), 2455–2462 (2003) 12. Kingsbury, N.: Complex Wavelets for Shift Invariant Analysis and Filtering of Signals. Applied and Computational Harmonic Analysis 10(3), 234–253 (2001) 13. Robnik-Sikonja, M., Kononeko, I.: An adaptation of relief for attribute estimation in regression. In: ICML’97: 14th International Conference on Machine Learning, pp. 296–304 (1997) 14. Weka 3: Data Mining Software in Java, http://www.cs.waikato.ac.nz/ml/weka/ 15. Altman, D.G., Bland, J.M.: Diagnostic tests. 1: Sensitivity and specificity. British Medical Journal 308 (6943), 1552 (1994) 16. Ron, K., John George, H.: Wrapper for feature subset selection. Artificial Intelligence 97(1-2), 273–324 (1997)
Mass Description for Breast Cancer Recognition Imene Cheikhouhou, Khalifa Djemal, and Hichem Maaref IBISC Laboratory, Evry Val d’Essonne University, 40 rue du Pelvoux, 91020, Evry, France {Imen.CheikhRouhou,Khalifa.Djemal,maaref}@iup.univ-evry.fr Abstract. In this paper, we present a robust shape descriptor named the Mass Descriptor based on Occurrence intersection coding (MDO) using the contour fluctuation detection. This descriptor allows a good characterization of the breast lesions and so a good classification performance. The efficiency of the proposed descriptor is evaluated on known Digital Database for Screening Mammography DDSM using the area under the Receiver Operating Characteristics (ROC) curve analysis. Results show that the specified descriptor has proven its performance in breast mass recognition using Support Vector Machine (SVM) classifier.
1
Introduction
Cancer is a disease characterized by an abnormal reproduction of cells, which invade and destroy adjacent tissues, being even able to spread to other parts of the body, through a process known as metastasis. Breast cancer is one of the major causes of deaths among woman. Presently, breast radiography, also called mammography, is the mostly used tool to detect this kind of cancer on its starting stage. The mammography makes possible the identification of the abnormalities in their initial development, which is a determining factor for success in treatment. According to BIRADS [1] standard, benign masses have round, smooth and well circumscribed boundary. In counter part, malignant masses have fuzzy, ill defined irregular and spiculated boundaries with spicules extending into surrounding tissues. However, certain benign entities such as fibroadenomas and cystic masses have also weakly defined boundaries. Similarly, malignant cases may have strong well defined boundaries (See Figure 1). Several kinds of shape feature formulations were previously developed to characterize the mass boundaries. The most used descriptors are: circularity, rectangularity [2], compactness(C), spiculation index (SI), fractional concavity (Fcc ) [3], fractal dimension [4], Fourier descriptors [5] and statistics based on the distribution of the Normalized Radial Length [6]. One important feature in automated malignity recognition is spiculation level characterization. This characterization should be able to regroup similar benign lesions in one class and similar malignant lesions in an other class without considering their position in the mammogram. Therefore, the shape description must be invariant in relation to geometric transformations such as translation, rotation and scaling. There are many existing invariant texture analysis methods like Circular Mellin Feature extractor [11] and modified Gabor filters [9]. The circular mellin A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 576–584, 2010. c Springer-Verlag Berlin Heidelberg 2010
Mass Description for Breast Cancer Recognition
a)
b)
577
c)
Fig. 1. Samples of mass margins extracted from BIRADS : a) circumscribed round, b) ill defined irregular and c) spiculated lesions
feature represents the spectral decomposition of the image scene in the polar log coordinate system and is invariant to both scale and orientation of the target texture pattern. In texture based pattern recognition, it is required to get rotation, scale and translation invariance, which would be insensitive to noise and illumination variation, but also in shape analysis to preserve the same value for similar shapes. In other terms, it is not judicious to obtain different classification outputs for objects having the same form but different positions, different orientations or different scales. Although the importance of shape descriptors invariance, few contributions were mentioned. Barman et al [8] proposed a shape measure which has a reasonable scale invariance. For this reason, we focus on this criteria to conceive an efficient and invariant shape descriptor able to quantify the degree of mass spiculation and to further improve the classification performance. The paper is organized in four sections. Next one is preserved to detail the proposed descriptor and to provide examples clarifying the method. Section 3 shows experimental results. A ROC curve is represented to validate features ability to discriminate between benign masses and malignant tumors. We present also, a comparison with other methods that characterize shape complexity in the same data sets. Finally, section 4 is allocated to conclusion.
2
Mass Descriptor Based on Occurrence Intersection Coding (MDO)
Most of the shape analysis methods [5], [6], are focused on computing global measures characterizing the boundary’s shape. Such methods are relatively insensitive to important local changes due to lobulations and spicules. While the majority of benign masses on mammograms are well circumscribed, some present stellate distortions. Also, while most malignant tumors are spiculated, some circumscribed malignant tumors are also encountered. In this way, discrimination between microlobulations in malignant tumors and macrolobulations in benign masses requires a detailed analysis of local characteristics of mass boundaries. Thus, we propose a new descriptor that could distinguish between benign and malignant lesions by detecting local changes in the mass boundary.
578
I. Cheikhouhou, K. Djemal, and H. Maaref Y
2(x
H
x,y
,t f1 )
,y,
t f2 )
1
1(
C(x,y)
(x0 ,y0 ) d(i)
j
,t i1 )
X
x,y
t i2 )
x,y
,y,
2
W
1(
2(x
,t (i1
+1 ))
i
1(
O
Fig. 2. Example illustrating the evolution of lines Δ1 and Δ2 in the domain Ω
2.1
MDO Principles and Computation
− → − → As illustrated in the figure 2, in the orthonormal basis (o, i , j ), we consider an example of a lesion L, in the domain Ω of the mammographic image which is delimited by the width w and the height h. Let C(x, y) present the contour of L in Ω. Assume that n is the number of points in the contour of the lesion L. We call Δ1 (x, y, t1 , θ) the parallel lines chosen in a corresponding direction θ and has as equation: Δ1 (x, y, t1 , θ) : b1 y = a1 x + c1 + t1 .
(1)
Where a1 , b1 , c1 , t1 , θ ∈ . All different cases where a1 and b1 are not simultaneously null are taken into account in our procedure. When t1 evolves in time, the line Δ1 (x, y, t1 , θ) sweeps the domain Ω while preserving the same direction a1 . The major goal of this study is to calculate, in a given time t1 and in a given angle θ, the number of intersections between the lesion and the line Δ1 (x, y, t1 , θ). Similarly, we denote by Δ2 (x, y, t2 , θ) the set of parallel lines chosen in a perpendicular direction (θ + 90◦ ) to Δ1 (x, y, t2 , θ) and has as equation: Δ2 (x, y, t2 , θ) : b2 y = a2 x + c2 + t2
(2)
Once the parameters of the lines Δ1 and Δ2 as the initial and final time required to sweep the whole domain Ω are set, we focus on computing intersections between the cited lines and the contour of the lesion. Thus, we calculate the vector S1θ of the size n1 (respectively S2θ of the size n2 ) where each element denotes the number of intersections between the contour C(x, y) and the line Δ1 (x, y, t1 , θ) at a specific time t1 and a specific direction θ, (respectively the number of intersections between the contour C(x, y) and the line Δ2 (x, y, t2 , θ) at a specific time t2 and a specific direction (θ + 90◦ ).
Mass Description for Breast Cancer Recognition
S1θ (t1 ) = Δ1 (x, y, t1 , θ) ∩ C(x, y), S2θ (t2 ) = Δ2 (x, y, t2 , θ) ∩ C(x, y),
ti1 ≤ t1 ≤ tf 1 ti2 ≤ t2 ≤ tf 2
579
(3)
The elements of the vectors S1θ and S2θ depends on the complexity of the contour. In fact, the number of intersections between the contour function and each line is around the values (2, 3, 4) if the contour function delineates a circumscribed circular or oval lesion. Homologously, the intersection between both functions could be of highly values (6,8,10,...) if the contour is more irregular and presents more lobulations or spiculations (as represented in Figures 3 and 4). In such manner, if we sum the elements in both vectors S1θ and S2θ , we would obtain low entities for regular lesions and high entities for irregular forms. Nevertheless, the mass size factor could affect the significance of the measure, since a regular large size mass could reach a higher value than that of a small size spiculated mass. So, when summing the elements of both vectors S1θ and S2θ , obtained entity is non invariant to scaling and depends hardly on the size of the lesion. For instance, the same lesion considered in two different scales necessarily provides different results. In order to satisfy scaling invariance, we proceed by preserving only the variations of the topology which does not depend on the lesion size. We compute the vectors sθ1 and sθ2 (respectively of the sizes m and n) deriving from S1θ and S2θ , where each element points to the localization of the topology changing as follows: sθ1 (m) = S1θ (t1 ) if
g1 (t1 ).S1θ (t1 ) = 0,
m ≤ n1
(4)
sθ2 (n) = S2θ (t2 ) if
g2 (t2 ).S2θ (t2 ) = 0,
n ≤ n2
(5)
where gk (t) =
1 0
if
Skθ (t) = Skθ (t + 1), otherwise
k = 1, 2
(6)
The mass descriptor computed for a precise angle θ results of the sum of elements of sθ1 and sθ2 which is independent of the considered scale. M DO =
m i=1
2.2
sθ1 (i) +
n
sθ2 (j)
(7)
j=1
Evaluation of the MDO Descriptor
To more explain the used terms, we study the case of two different lesions. Lesion 1 as represented in Figure.3 has a slightly spiculated form, whereas, lesion 2 (Figure 4) has a highly spiculated form. In order to simplify the representation and to more explicit results to reader, we consider the angle θ=0◦ which provides horizontal lines for the set Δ1 (x, y, t1 , 0◦ ) and vertical lines for the set Δ2 (x, y, t2 , 90◦ ). First column of Figure 3 shows the occurrence number ◦ ◦ S10 /S20 of points belonging to Δ1 /Δ2 and the contour simultaneously. Second ◦ ◦ ◦ ◦ column depicts the topology changing s01 and s02 of S10 and S20 . For any size
580
I. Cheikhouhou, K. Djemal, and H. Maaref
Fig. 3. MDO computation for the lesion 1: First column: The occurrence number (S1) of points belonging to line 1 and the contour simultaneously and the occurrence number (S2) of points belonging to line 2 and the contour simultaneously. Second column: sθ1 and sθ2 the topology changing of S1 and S2. ◦
◦
of this object, S10 and S20 would have the same template and would be smaller ◦ ◦ or greater depending on scaling level. However, s01 and s02 would preserve the same sequence and the same elements. We remark that, in the case of lesion ◦ ◦ 1, S10 elements do not overpass the value 4 and S20 elements do not overpass ◦ ◦ ◦ the value 6. Also, s01 = 39 and s02 = 47 which results of M DO0 = 86. These values are relatively low when compared to values obtained with the lesion 2. ◦ ◦ As represented in Figure 4, S10 reaches the value 9 and S20 reaches the highest value of intersection points in a given position of Δ2 equal to 12. These important ◦ ◦ values lead to higher values of s01 = 239 and s02 = 361 which provides a higher ◦ M DO0 = 600. These results show that the proposed descriptor MDO is robust and could well distinguish between smoothed and spiculated contours.
3
Experimental Results
The proposed descriptor is evaluated through a publicly available database, the Digital Database for Screening Mammography (DDSM), assembled by a research group at the University of South Florida [12]. In this study, we select only regions
Mass Description for Breast Cancer Recognition
581
Fig. 4. MDO computation for the lesion 2: First column: The occurrence number (S1) of points belonging to line 1 and the contour simultaneously and the occurrence number (S2) of points belonging to line 2 and the contour simultaneously. Second column: sθ1 and sθ2 the topology changing of S1 and S2
of interests that depicted circumscribed masses or microlobulated/spiculated masses. The considered data set consists of 87 masses (49 benign/38 malignant) which are partitioned into training (29 benign/21 malignant) and test (20 benign/17 malignant) sets. We used as classifier the support vector machine SVM which is a well known machine learning technique based on statistical learning theory [10]. To evaluate the classification performance, we use the so-called Receiver Operating Characteristic (ROC) curve analysis, which is now routinely used for many classification tasks. It is represented by the classification sensitivity (TPF) as the ordinate versus the specificity (FPF) as the abscissa. These P FP are defined as: T P F = T PT+F N and F P F = F P +T N [7]. A good classifier has its ROC curve climbing rapidly towards upper left hand corner of the graph. This can also be quantified by measuring the area under the (ROC) curve: (Az ). Therefore, the closer the area is to 1, the better the classifier is, while the closer the area is to 0.5, the worse the classifier is. We compare results from MDO to both Radial length and geometrical features. Kilday [6] developed a set of shape features based on radial length from the objects centroid to the points on the boundary. These features have had a
582
I. Cheikhouhou, K. Djemal, and H. Maaref
Fig. 5. ROC curves obtained with SVM classifier: Comparison between MDO, NRL (A1) and geometrical (Com) features
good success in computer aided diagnosis applications as demonstrated in [13] [14]. The mean of the normalized radial length is defined as the Euclidean distance from the center of mass of the segmented lesion to the ith pixel on the mass boundary and normalized with respect to the maximum distance found for that mass. Figure 2 shows the mass center and how to calculate the radial distance. The area ratio A was a measure of the percentage of tumor outside the circular region defined by the mean of the x-y line plot, and alone described the macroscopic shape of the tumor. Since the area ratio have proven its performance in many applications [13], we follow the behavior of this descriptor applied to the same database in Figure 5. We also assess the performance of a geometrical feature: the compactness in differentiating between benign and malignant masses. The compactness (Com) is a measure of contour complexity versus enclosed 2 area, defined as: Com = PA . Where P and A are the mass perimeter and area respectively. A mass with a rough contour will have a higher compactness than a mass with a smooth boundary [15]. Considering results on Table 1 and Figure 5, first, we remark that all of the features, could effectively discriminate circumscribed benign masses from spiculated malignant tumors. Second, considering each feature individually, we notice that the area A relatively failed to well classify circumscribed/spiculated lesions and provide an area under ROC Az =0.86. In counter part, the compactness Com provides satisfying result with Az =0.90. The feature MDO of the present study has clearly outperformed all the other shape-based features and seems to be the most effective in benign/malignant classification of beast masses. As demonstrated in last works, geometrical features could be very informative and could improve classification results especially when they are used in junction with other features [16]. However, used individually they could not provide all necessary information about mass malignancy. Regarding Radial length
Mass Description for Breast Cancer Recognition
583
Table 1. Individual performance of the different features in terms of the area under the ROC curve Tested features Corresponding Az Proposed descriptor MDO
0.92 ± 0.01
Geometrical feature Com NRL feature A
0.90 ± 0.01 0.86 ± 0.03
measures, although they have proven their efficiency in many applications [13], we note that these features provide satisfying results with a generally round boundary, however, in the case of complex shapes, the centroid may lie outside the tumor region and could not be a valid point to measure the distances to the boundary. The highest value of area under ROC which is obtained with MDO measure and presented in Figure 5 as a red line, is estimated to Az =0.92. The higher value of Az proves the efficiency and the robustness of the proposed descriptor. Also, this higher value of the area under the ROC curve proves that applied descriptor which preserves the same value of MDO for the same form independently of its translation, rotation and scaling, insures a good classification rate. This result proves that the proposed Mass Descriptor based on Occurrence intersection coding (MDO) could be a determining descriptor in computer aided diagnosis systems providing a second opinion to radiologists.
4
Conclusion
In this paper, we proposed a new descriptor dedicated to highlight spiculation measure in the analysis of breast masses. For evaluation, we have used the well known DDSM database and the SVM classifier. When computing the Mass Descriptor based on Occurrence intersection coding (MDO), we notice its ability to capture diagnostically important details of shape related to spicules and lobulations. The proposed descriptor provides high classification accuracies while discriminating between benign breast masses and malignant tumors. In the future work, we intend to study the MDO invariance in different rotation cases of the pathological masses.
References 1. American College of Radiology BI-RADS (Breast Imaging Reporting and Data System) Frensh Edition realized by SFR, 3rd edn. (2003) 2. Sahiner, B.S., Chan, H.P., Petrick, N., Helvie, M.A., Hadjiiski, L.M.: Improvement of mammographic mass characterization using spiculation measures and morphological features. Med. Phys. 28(7), 1455–1465 (2001) 3. Rangayyan, R.M., Mudigonda, N.R., Desautels, J.E.L.: Boundary modelling and shape analysis methods for classification of mammographic masses. Med. Biol. Eng. Comput. 38, 487–496 (2000)
584
I. Cheikhouhou, K. Djemal, and H. Maaref
4. Rangayyan, R.M., Nguyen, T.M.: Fractal Analysis of Contours of Breast Masses in Mammograms. Journal of Digital Imaging 20(3), 223–237 (2007) 5. El-Faramawy, N.M., Rangayan, R.M., Desautels, J.E.L., Alim, O.A.: Shape factors for analysis of breast tumors in mammograms. In: Canadian Conference on Electrical and Computer Engineering, pp. 355–358 (1996) 6. Kilday, J., Palmieri, F., Fox, M.D.: Classifying mammographic lesions using computer-aided image analysis. IEEE Trans. Med. Imaging 20, 664–669 (1993) 7. Adler, W., Lausen, B.: Bootstrap estimated true and false positive rates and ROC curve. Journal of Computational Statistics and Data Analysis, 1–12 (2008) 8. Barman, H., Granlund, G., Haglund, L.: Feature extraction for computer aided analysis of mammograms. In: International Journal of Pattern Recognition and Artificial Intelligence (IJPRAI), pp. 1339–1356 (1993) 9. Sim, D.G., Kim, H.K., oh, D.I.: Translation Scale and Rotation Invariant Texture Descriptor for Texture based Image Retrieval. In: International Conference on Image processing proceedings, vol. 3, pp. 742–745 (2000) 10. Vapnik, V.: Statistical learnig theory. Wiley, New York (1998) 11. Vyas, V.S., Rege, P.P.: Radon Transform application for rotation invariant texture analysis using Gabor filters. In: Proceedings of NCC-2007, IIT Kanpur, pp. 439–442 (2007) 12. Heath, M., Bowyer, K., Kopans, D., Moore, R., Kegelmeyer, P.: The Digital Database for Screening Mammography. In: 5th International Workshop on Digital Mammography, Toronto, Canada (2000) 13. Chen, C.Y., Chiou, H.J., Chou, Y.C., Wang, H.K., Chou, S.Y., Chiang, H.K.: Computer-aided Diagnosis of Soft Tissue Tumors on High-resolution Ultrasonography with Geometrical and Morphological Features. Academic Radiology, 618–626 (2009) 14. Hadjiiski, L., Chan, H.P., Sahiner, B., Helvie, M.A., Roubidoux, M.A., Blane, C., Paramagul, C., Petrick, M.N., Bailey, J., Klein, K., Foster, M., Patterson, S., Adler, A., Nees, A., Shen, J.: Improvement in Radiologists Characterization of Malignant and Benign Breast Masses on Serial Mammograms with Computer-aided Diagnosis: An ROC Study. Radiology, 255–265 (2004) 15. Shen, L., Rangayyan, R.M., Desautels, J.E.L.: Detection and classification of mammographic calcifications. Intemational Journal of Pattern Recognition and Artificial Intelligence 7(6), 1403–1416 (1993) 16. Cheikhrouhou, I., Djemal, K., Sellami, D., Maaref, H., Derbel, N.: Empirical Descriptors Evaluation for Mass Malignity Recognition. In: The First International Workshop on Medical Image Analysis and Description for Diagnosis Systems MIAD’09, Porto, Portugal, January 16-17, pp. 91–100 (2009)
A New Preprocessing Filter for Digital Mammograms Peyman Rahmati1, Ghassan Hamarneh2, Doron Nussbaum3, and Andy Adler1 1
Dept. of System and computer Engineering, Carleton University, ON, Canada 2 School of Computing Science, Simon Fraser University, BC, Canada 3 Dept. of Computer Science, Carleton University, ON, Canada {Prahmati,Adler,Nussbaum}@sce.carleton.ca,
[email protected] Abstract. This paper presents a computer-aided approach to enhancing suspicious lesions in digital mammograms. The developed algorithm improves on a well-known preprocessor filter named contrast-limited adaptive histogram equalization (CLAHE) to remove noise and intensity inhomogeneities. The proposed preprocessing filter, called fuzzy contrast-limited adaptive histogram equalization (FCLAHE), performs non-linear enhancement. The filter eliminates noise and intensity inhomogeneities in the background while retaining the natural gray level variations of mammographic images within suspicious lesions. We applied Catarious segmentation method (CSM) to compare the segmentation accuracy in two scenarios: when there is no preprocessing filter, and when the proposed preprocessing filter is applied to the original image. The proposed filter has been evaluated on 50 real mammographic images and the experimental results show an average increase of segmentation accuracy by 14.16% when the new filter is applied. Keywords: Breast cancer, mammography, image denoising, and segmentation.
1 Introduction Breast cancer is the leading type of cancer in women and the second most fatal type of cancer in women [1]. Mammography allows for the detection of intangible tumors and increases survival rate [2]. Digital mammography uses x-rays to project structures in the 3D female breast onto a 2D image [2]. The inherently noisy nature of digital mammograms, low contrast of suspicious areas, and ill-defined mass borders make mass segmentation a challenging problem. The design of an appropriate preprocessing filter is essential for segmentation algorithms to delineate masses with high accuracy. Tumors appear as medium-gray to white areas on digital mammograms [3]. Tumor shapes are described by standardized keywords [4]: the shapes are grouped as oval, irregular, lobulated, or round and the margins are expressed as circumscribed, obscured, ill-defined or spiculated. The presence of irregularly-shaped masses and spicules correlate with increased likelihood of malignancy [5]. Given the low signal-to-noise ratio of mammography images, which greatly decreases the observable details, this paper develops a new strategy to enhance the original image. Previous works on mammographic-image preprocessing used methods such as Gamma correction [6], adaptive 2D Wiener filtering [7], contrast-limited A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 585–592, 2010. © Springer-Verlag Berlin Heidelberg 2010
586
P. Rahmati et al.
adaptive histogram equalization (CLAHE) [8], and multi-scale wavelet-based enhancement [9]. It is a difficult task to choose the single best technique for image enhancement, and the enhancement most often is evaluated based on the performance of the subsequent segmentation algorithm [10]. The role of a proper preprocessing filter is to provide an enhanced image to be fed into the subsequent blocks of a Computer Aided Diagnosis (CAD) system. To design an accurate CAD system to diagnose the tumors as either benign or malignant, the inherent nature of mammograms should not be affected by the preprocessing filter. Otherwise, important features which are probably useful for tumour classification will be lost. To retain the inherent characteristic intensity variations of mammograms, we propose a new preprocessor filter, named fuzzy contrast-limited adaptive histogram equalization (FCLAHE). There are two goals for this work: enhancement of mammographic images to achieve better visibility to the human observer (radiologist), and smoothing inhomogeneities in the background of a main lesion under investigate to decrease the amount of probable false positive. This paper is organized as follows: In the next section, we contrast some of the closely related works to our proposed method. In section 3, the suggested preprocessor filter (FCLAHE) is detailed. Finally, the experimental results are detailed in section 4. We draw the conclusion in section 5.
2 Related Works In addition to the noise present in mammograms, some artifacts further complicate the diagnosis and introduce uncertainty into the image interpretation. These artifacts are related to the variability in tissue density and the inhomogeneous nature of tissue in some anatomical structures. These artifacts imply that designing algorithms for mammography preprocessing is a significantly more demanding task than for, say, medical images of homogeneous structures. The main goal is to decrease the inhomogeneities leading to increased accuracy in the subsequent mass segmentation algorithm. There are several works on mammographic-image preprocessing. Baeg et.al [6] used gamma correction for mammographic enhancement. Based on their texture-analysis method, classification of 150 biopsy-proven masses into benign and malignant classes resulted in an area under Receiver Operating Characteristic (ROC) curve of 0.91. Gamma correction is simple but the effects are localized and not global. The adaptive 2D Wiener filtering (A2DWF) noise-reduction algorithm [7] estimates the noise in the neighborhood around each pixel and then adjusts the surrounding region based on that noise estimate. Mayo et.al [11] compared A2DWF, a wavelet filter, a filter based on independent component analysis, and a diffusion-based filter. Although Mayo et.al did not extract features, the noise-removal aspects of the compared methods were shown similar, based on both visual observation and mean square error. To yield improved diagnostic performance, Mekle et.al [9] used an interactive multi-scale enhancement which incorporates dyadic spline wavelet functions and sigmoidal nonlinear enhancement functions. In mammogram image, masses appear brighter and gradually darken as the image is traversed from the mass core toward the background. Since useful discriminatory
A New Preprocessing Filter for Digital Mammograms
587
features to classify tumors can be based on the natural intensity variations of mammogram, it is important that the preprocessing filter retains such intensity characteristics. To the best of our knowledge, designing a preprocessing filter to enhance the original mammographic image, while still preserving its natural intensity variations, has not been reported previously. In this work, we enhance the original mammograms while maximizing the preservation of their inherent characteristics. Pisano et al. [8] examined several digital mammograms using multiple methods of enhancement, including CLAHE. They concluded that image detail is good and that, in general, lesions appeared obvious compared to the background. They also found that graininess (inhomogeneity) was introduced due to the enhancement of image noise, which might mislead a radiologist to thinking that there are false microcalcifications. The advantage of CLAHE is that it is straightforward to implement and results in a high contrast image. As per the conclusions of Pisano et al. on the efficacy of CLAHE [8], the original mammographic images are first processed using CLAHE to eliminate noise and followed by the application of a nonlinear filter, which incorporates the advantages of a nonlinear fuzzy function.
3 Preprocessing Filter: FCLAHE In this section, we develop a preprocessoring filter to enhance digital mammograms, called fuzzy contrast-limited adaptive histogram equalization (FCLAHE). This work is inspired by CLAHE [8], with modifications to improve its performance. Pisano et al. [8] examined several image-enhancement methods, and expressed that CLAHE “might be helpful in allowing radiologists to see subtle edge information, such as spiculations”. The main drawback of CLAHE is that the amount of the enhancement for the foreground and the background of the original image are similar (linearly filtered). The result is an image with high contrast in both the foreground and the background. In other words, it increases the visibility of the main mass at the cost of simultaneously creating small yet misleading intensity inhomogeneities in the background, i.e. leads the radiologist to the detection of more tumors than actually exists in the image (higher false positives). The presence of inhomogeneities in the background can also mislead the segmentation algorithm as to location of the real mass under investigation. Since CLAHE eliminates noise in exchange for increasing the inhomogeneities in the background, there is a trade-off between the accuracy of the enhanced image and the inhomogeneities in the background. The proposed preprocessing filter tries to make an improved compromise between them. In the proposed filter, we retain the advantages of CLAHE, which are the ability to remove noise and obtaining a high contrast image, while improving on its deficiencies, which is the loss of inherent nature of enhanced mammography images. Therefore for simulating the natural gray levels variations of the original mammogram and providing a smoothed image with an acceptable brightness, a custom algorithm is used to supply non-linear enhancement.
588
P. Rahmati et al.
We use fuzzy function proposed in [12] within a new form to provide a non-linear adjustment to the image based on the image gray level statistics. The fuzzy function in [12] attributes a membership value between 0 and 1 to the pixels based on the difference of their gray levels than a seed point, located at the mass core. The fuzzy function defined as [12]:
ܨሺሻ ൌ
ͳ ǡ ͳߚ݀
(1)
where p is the intensity of the pixel being processed, d is the intensity difference between p and a seed point, and β controls the opening of the function. The larger the difference, the lower the membership function (F(p)); and as β increases, the opening of F(p) decreases. Fig. 1 visualizes the behaviour of this function.
Fig. 1. Fuzzy function F(p) membership for various values of β
The non-linear filter called fuzzy contrast-limited adaptive histogram equalization (FCLAHE) that we proposed is described as: ܫመሺݔǡ ݕሻ ൌ ݎܣሺݔǡ ݕሻ ൫ ܫሺݔǡ ݕሻ െ ݎܣሺݔǡ ݕሻ൯ ܨሺ ȁܫሺݔǡ ݕሻ െ ݎܣሺݔǡ ݕሻȁ ሻ ǡ
(2)
where , denotes the average of the gray levels of the pixels placed radially in distance away from a seed point, see the red circle in figure 2(b) with radial distance (r) of 110 and 121. The seed point is selected by a radiologist at the center of the mass core, the brightest region in mammograms. F(d) is the fuzzy function defined in (1). , is the original intensity of a pixel located at the coordinate (x, y). , is the new intensity of the same pixel in the processed image. In (2), it is assumed that the intensities of the pixels at a radial distance (r) away from the seed point are similar unless there are inhomogeneities in the background. This assumption comes from the fact that masses in digital mammograms appear as brighter areas that gradually darken from the mass core outwards toward the background (see figure 2(a)), resembling an intensity variations profile similar to the presumed fuzzy function in figure 1 where d here indicates the radial distance from the seed point. Wherever there is a bright region in the background (due to background inhomogeneities), , will be more different than , i.e. , , ∆ , , will , will give a higher value. According to (1), ∆ therefore yield a lower value. This, in turn, will attenuate the second term in the right
A New Preprocessing Filter for Digital Mammograms
589
hand side of (2), and , will be assigned . If there are no inhomogenieties in , 0, and F(0) will the background, then will be close to , , i.e. ∆ yield unity, which in turn assigns , to , , (i.e. the intensity of the pixel does not change). In fact, equation (2) removes inhomogeneities or the small regions with a high contrast in the background, which may be mistaken for a real lesion in the subsequent segmentation algorithm. In addition, the proposed equation keeps the brightness and the contrast of the mass core, see Fig. 2. As a result, the proposed FCLAHE algorithm provides a smoothed image so that has a close agreement with the nature of the original mammography image. The major advantage of this filter compared to earlier approaches [6-9] is its ability to eliminate noise without affecting the intensity characteristics of mammographic image. One of the most common approaches to segment mammograms is by utilizing the statistical distribution of their intensities. Therefore, to achieve accurate lesion segmentation, it is important that preprocessing minimizes the effects of changing the statistical distribution or changes it properly to a predefined probability distribution function (PDF), see figure 5(c).
4 Experimental Results The proposed algorithm was run on a Pentium IV (PC), Intel 3.0 GHz, with Windows XP Professional, 3.0 GB RAM, in MATLAB 7.0. We executed our proposed method over 50 mammographic images which were selected from Digital Database of Screening Mammography (DDSM) [13]. Fig. 2 shows the enhanced images with FCLAHE for different β values. Note how the result from CLAHE introduce erroneous enhancement of background intensity patterns (red oval in Fig. 2b). To investigate the effect of the FCLAHE over the accuracy of the subsequent segmentation algorithm, we performed the Catarious segmentation method (CSM) to delineate the masses [14]. We performed CSM segmentation method on both the original images and the enhanced images using either CLAHE or FCLAHE. Then, we evaluated the segmentation accuracy for them based on expert-validated ground truth delineation using overlap criterion defined as follows: ܱܸ ൌ
ܤתܣ ܣ
,
(3)
where A is the ground truth delineation, B is the competing delineated area, and OV =1 when A and B match perfectly. The bar chart in Fig. 3 shows the overlap values (OV%) (between the automatic segmentation and ground truth) for each study case with and without FCLAHE. The average overlap with expert increased from 52% to 68.36% when the proposed filter was employed (averaged over 50 images), i.e a 16.3% increase in accuracy. Moreover, the average overlap with expert was 54.2% when we applied CLAHE. Comparing with CLAHE average segmentation accuracy (54.2%), FCLAHE offers a 14.16% increase in accuracy. Fig. 4 depicts qualitative comparative segmentation results. The corresponding quantitative increase in accuracy for these cases were: from 41% to 67.1% for Fig. 4 (a), 49% to 80.1% for Fig. 4(b), 55% to 69% for Fig. 4(c), and 46% to 88.2% for Fig. 4 (d). It seems that
590
P. Rahmati et al.
Fig. 4(a) and Fig. 4(c) have less accurate segmentation results after applying FCLAHE due to having several small bright regions (inhomogeneities) immediately attached to their main mass boundary. Fig. 5 depicts the effect of CLAHE and FCLAHE on the histogram specification of original mammogram for one case study. Fig. 5(a) shows the histogram of the original image which is totally spiky and has less symmetry while in Fig. 5(b), the histogram of CLAHE-based enhanced image has better symmetry but still suffers from spikes due to the presence of inhomogeneities in the background. FCLAHE-based enhanced image, shown in Fig. 5(c), offers the smoothest and the most symmetric histogram specification, which is better similar to an exponential distribution such as Poisson distribution which is already deemed to be effective for modeling the intensity variations of mammogram [15].
Fig. 2. An example of applying the FCLAHE algorithm to a mammographic image. (a) Original mammographic image. (b) The enhanced images of (a) with CLAHE, which has a good contrast but some small regions (which one of them is shown by a red oval) in the background may misleadingly be mistaken for a real mass. (c) Our proposed enhanced image (FCLAHE) with β = 0.009 which has an acceptable contrast while there are no regions with a high contrast (high intensity inhomogeneity) in the background. (d) FCLAHE image with β = 0.02.
Fig. 3. The effect of the FCLAHE on the accuracy (OV%) of the CSM as a function of each case number
A New Preprocessing Filter for Digital Mammograms
(a)
(b)
(c)
591
(d)
Fig. 4. Examples of the CSM performance over mammographic images with and without the proposed pre-processing filter. First row: The delineated images achieved using the original mammograms without applying the FCLAHE. Second row: The final contours of CSM algorithm (black lines) performed on the enhanced images using the FCLAHE. The ground truth is shown in white.
(a)
(b)
(c) Fig. 5. The effect of the proposed preprocessing filter on the histogram of an original mammogram image. (a) Original mammogram image along with its histogram specification. (b) CLAHE-based enhanced image with its histogram. (C) FCLAHE-based enhanced image with its histogram, resembling an exponential PDF.
5 Conclusion We presented a novel preprocessing filter, referred to as FCLAHE, which was applied in the context of computer-aided delineation of lesions in mammographic images. The new model incorporates the advantages of the preceding preprocessor of CLAHE and simultaneously addresses its deficiencies. The proposed preprocessing filter improved the CLAHE algorithm by applying a nonlinear fuzzy function in a new form. It conserved the brightness of the mass core and improved the graininess and small inhomogeneous regions in the background in an appropriate manner. The enhanced images retained the inherent nature of mammograms. The overlap with expert
592
P. Rahmati et al.
delineations increased from 54.2% to 68.36% when the FCLAHE is applied, averaged over all 50 images in the database. A potential direction for future work is to apply a PDF-based segmentation method on the FCLAHE-enhanced mammography images, in anticipation of higher segmentation accuracy due to the FCLAHE advantage of preserving the intensity variations of mammogram.
References 1. American Cancer Society, American Cancer Society: Breast Cancer Facts & Figures 2005- 2006, pp. 1–28 (2006) 2. Rangayyan, R.M.: Breast cancer and mammography. In: Neuman, M.R. (ed.) Biomedical Image Analysis, pp. 22–27. CRC Press, Boca Raton (2005) 3. Egan, R.L.: Breast Imaging: Diagnosis and Morphology of Breast Diseases. W. B. Saunders Co., Philadelphia (1988) 4. The Mosby Medical Encyclopedia, Revised edn. The Penguin Group, New York (1992) 5. Jeske, J.M., Bernstein, J.R., Stull, M.A.: Screening and Diagnostic Imaging. In: American Cancer Society Atlas of Clinical Oncology, pp. 41–63. B.C. Decker, London (2000) 6. Baeg, S., Kehtarnavaz, N.: Texture based classification of mass abnormalities in mammograms. In: Proc. of the 13th IEEE Symposium on Computer-Based Medical Systems (CBMS), Houston, TX, June 2000, vol. 1, pp. 163–168 (2000) 7. Mayo, P., Rodenas, F., Verdu, G.: Comparing methods to denoise mammographic images. In: Proc. of the 26th Annual Intl. Conference of the Engineering in Medicine and Biology Society (EMBC), vol. 1, pp. 247–250 (2004) 8. Pisano, E.D., Cole, E.B., Hemminger, B.M., Yaffe, M.J., Aylward, S.R., Maidment, A.D.A., Johnston, R.E., Williams, M.B., Niklason, L.T., Conant, E.F., Fajardo, L.L., Kopans, D.B., Brown, M.E., Pizer, S.M.: Image Processing Algorithms for Digital Mammography: A Pictorial Essay. RadioGraphics 20(5), 1479–1491 (2000) 9. Mekle, R., Laine, A.F., Smith, S., Singer, C., Koenigsberg, T., Brown, M.: Evaluation of a multiscale enhancement protocol for digital mammography. In: Proc. of the Wavelet Applications in Signal and Image Processing VIII, San Diego, CA, USA, vol. 4119, pp. 1038–1049 (2000) 10. Gonzalez, R.C., Woods, R.E.: Digital Image Processing, 2nd edn. Prentice Hall, Upper Saddle River (2002) 11. Mayo, P., Rodenas, F., Verdu, G.: Comparing methods to denoise mammographic images. In: Proc. of the 26th Annual Intl. Conference of the Engineering in Medicine and Biology Society (EMBC), vol. 1, pp. 247–250 (2004) 12. Guliato, D., Rangayyan, R.M., Carnielli, W.A., Zuffo, J.A., Desautels, J.E.L.: Segmetation of breast tumors in mammograms using fuzzy sets. Journal of Electronic Imaging 12(3), 369–378 (2003) 13. Heath, M., Bowyer, K., Kopans, D., Kegelmeyer, W.P.H., Moore, R., Chang, K., MunishKumaran, S.: Current status of the Digital Database for Screening Mammography http://marathon.csee.usf.edu/Mammography/Database.html (accessed September 15, 2009) 14. Catarious, D.M., Baydush, A.H., Floyd, C.E.: Incorporation of an iterative, linear segmentation routine into a mammographic mass CAD system. Medical Physics 31(6), 1512–1520 (2004) 15. Rahmati, P., Ayatollahi, A.: Maximum Likelihood Active Contours Specialized for Mammography Segmentation. In: The 2nd IEEE International Conference on BioMedical Engineering and Informatics (BMEI’09), China, vol. 1, pp. 257–260 (2009)
Classification of High-Resolution NMR Spectra Based on Complex Wavelet Domain Feature Selection and Kernel-Induced Random Forest Guangzhe Fan1, Zhou Wang2, Seoung Bum Kim3, and Chivalai Temiyasathit4 1
Dept. of Statistics & Acturial Science, University of Waterloo, Waterloo, ON, Canada Dept. of Electrical & Computer Engineering, University of Waterloo, Waterloo, ON, Canada 3 Dept. Industrial Systems & Information Engineering, Korea University, Seoul, Korea 4 International College, King Mongkut's Inst. of Technology Ladkrabang, Bangkok, Thailand
2
Abstract. High-resolution nuclear magnetic resonance (NMR) spectra contain important biomarkers that have potentials for early diagnosis of disease and subsequent monitoring of its progression. Traditional features extraction and analysis methods have been carried out in the original frequency spectrum domain. In this study, we conduct feature selection based on a complex wavelet transform by making use of its energy shift-insensitive property in a multiresolution signal decomposition. A false discovery rate based multiple testing procedure is employed to identify important metabolite features. Furthermore, a novel kernel-induced random forest algorithm is used for the classification of NMR spectra based on the selected features. Our experiments with real NMR spectra showed that the proposed method leads to significant reduction in misclassification rate. Keywords: High-resolution NMR spectrum; Metabolomics; Classification tree; Random forest; Complex wavelet transforms; False discovery rate; Kernel.
1 Introduction High-resolution nuclear magnetic resonance (NMR) spectroscopy has been found to be useful in characterizing metabolic variations in response to disease states, genetic medication, and nutritional intake. Given the thousands of feature points in each NMR spectrum, the first step is to identify the features that are mostly related to the problems being studied. Many of the feature points are either redundant or irrelevant. Removing them may largely reduce the computational cost while improving the stability (e.g., noise robustness) of the subsequent analysis and classification processes. Such dimensionality reduction procedures are mostly carried out directly in the original frequency domain. The widely used methods for identifying important metabolite features in spectral data include principal component analysis (PCA) and partial least squares (PLS) [1,2], but the principle components from PCA or PLS do not provide clear interpretations with respect to the original features. In [3], a genetic programming based method was proposed to select a subset of original metabolite
A. Elmoataz et al. (Eds.): ICISP 2010, LNCS 6134, pp. 593–600, 2010. © Springer-Verlag Berlin Heidelberg 2010
594
G. Fan et al.
features in NMR spectra for the classification of genetically modified barley, though the method may not be reliable for high-dimensional and noisy data. The wavelet transform provides a powerful and flexible framework for localized analysis of signals in both time and scale, but its applications to NMR spectra has not been fully exploited. In [4,5], wavelet transform was used for the detection of small chemical species in NMR spectra by suppressing the water signal. In [6], decimated discrete wavelet transform was employed to analyze mass spectrometry data (similar to NMR spectra) with class information. In [7], NMR spectra were analyzed using complex wavelet transforms, which have the important property of energy shiftinsensitive. In particular, the false discovery rate based multiple test procedure leads to more reliable feature selection results when intensity and position shifts exist between multiple NMR spectra being compared (which is always the case in the data acquired in practice). NMR spectrum based classification is not only a practically useful application. It also provides a direct test of the quality of the feature extraction procedure. In [7], a simple classification tree algorithm was used. In our present study, we used Gabor wavelet transform which achieved the best result in [7]. We employed a new cross-validated testing scheme and identify different feature sets for our usage. To test the results of our feature selection scheme, we also compare three different classification approaches: classification tree, random forest and kernelinduced random forest, which is a novel algorithm for the classification of highresolution NMR spectra. In [8], a classification tree algorithm was described in detail. Later [9] proposed the ensemble of classification tree which was called random forest. Random forest is a powerful classification tool which uses many randomly generated large classification trees and combines them to vote for a decision. The instability of a single classification tree was greatly reduced by the ensemble. A kernel-induced random forest method was proposed in [10]. A kernel function is computed for every two observations based on all the features or a reduced feature space. Then the observations are used to classify other observations via a recursive partitioning procedure and its ensemble model. The classification accuracy is improved with the kernel-induced feature space.
2 Method 2.1 Complex Wavelet Transform We consider complex wavelets as dilated/contracted and translated versions of a complex-valued “mother wavelet” w( x) = g ( x)e jωc x , where ω c is the center frequency of the modulating band-pass filter, and g (x ) is a slowly varying and symmetric realvalued function. The family of wavelets derived from the mother wavelet can be expressed as:
ws , p ( x ) =
1 ⎛ x − p ⎞ 1 ⎛ x − p ⎞ jωc ( x − p ) / s w⎜ g⎜ ⎟= ⎟e s ⎝ s ⎠ s ⎝ s ⎠
(1)
where s ∈ R + and p ∈ R are the scale and translation factors, respectively. Considering the fact that g ( − x ) = g ( x ) , the wavelet transform of a given real signal f (x ) can be written as:
Classification of High-Resolution NMR Spectra Based on Complex
F ( s, p ) =
∞
∫ f ( x) w
* s, p
−∞
⎡ ⎤ ⎛x⎞ ( x ) dx = ⎢ f ( x) * g ⎜ ⎟ e jωc x / s ⎥ ⎝s⎠ ⎣ ⎦ x= p
595
(2)
In other words, we can use this to compute the wavelet coefficient F ( s, p ) at any given scale s and location p . Using the convolution theorem and the shifting and scaling properties of the Fourier transform, it is not difficult to derive that
F ( s, p ) =
1 2π
∞
∫ F (ω )
s G ( sω − ω c ) e jωp dω
(3)
−∞
where F (ω ) and G (ω ) are the Fourier transforms of f (x ) and g (x ) , respectively. Now suppose that the function f (x ) has been shifted by a small amount Δx , i.e., f ' ( x) = f ( x + Δx) . This corresponds to a linear shift in the Fourier domain:
F ' (ω ) = F (ω )e jωc Δx . Substitute this into Eq. (3) and assume that Δx is small relative to the envelop function g (x) , then we can derive
F ' ( s, p ) = F ( s, p )
(4)
This implies that the magnitude (or energy) of the complex wavelet coefficient does not change significantly with a small translation. Such an energy shift-insensitive property is very important in the analysis of NMR spectra because a small misalignment between multiple NMR spectra is unavoidable (even after preprocessing), and the misalignment may interfere with direct comparisons between NMR spectra. Among the various complex wavelets available, we choose the Gabor wavelets mainly for two reasons: First, according to the Gabor uncertainty principle, the timefrequency resolution of a signal is fundamentally limited by a lower bound on the product of its bandwidth and duration, and the Gabor filters are the only family of filters that achieve this lower bound [11]. In other words, the Gabor filters provide the best compromise between simultaneous time and frequency signal representations. Second, the Gabor wavelets are easily and continuously tunable for both the center frequencies and for bandwidths. 2.2 Feature Selection Based on a Multiple Testing Procedure
The most straightforward approach for feature selection in the wavelet transform domain is thresholding. However, this method may result in ignorance of small magnitude coefficients that are indeed important for classification. In this study, we identify complex wavelet coefficient features to maximize the separation of classes. More specifically, a multiple testing procedure that controls the false discovery rate (FDR) is employed to identify significant Gabor coefficients that discriminate between the spectra under different conditions. The FDR is the error rate in multiple hypothesis tests and is defined as the expected proportion of false positives among all the hypotheses rejected [12]. In our problem, the rejected hypothesis is interpreted as the significant coefficients necessary for classification.
596
G. Fan et al.
The FDR-based procedure is explained with our experimental data. Let δ jk be the magnitude of the Gabor coefficient at the k-th position of the j-th class. Our experimental data comprise 136 NMR spectra in which half of the spectra were taken from the zero-SAA phase and the other half were taken from the supplemented-SAA phase. The goal is to identify a set of δ k that maximizes the separability between the two SAA phases. For each wavelet coefficient, a null hypothesis states that the average magnitudes of Gabor coefficients are equal between the two SAA phases, and the alternative hypothesis is that they differ. The two-sample t statistic is
tk =
δ 1k − δ 2 k σˆ12k σˆ 22k n1
+
(5)
n2
where δ 1k , σˆ12k , and n1 are the sample mean, variance, and the number of samples from the first condition, respectively. Similarly, δ 2 k , σˆ 22k , and n2 are obtained from the second condition. By asymptotic theory, tk approximately follows a t-distribution on the assumption that the null hypothesis is true. Using this, the p-values for δ k can be obtained. In multiple testing problems, it is well known that applying a single testing procedure leads to an exponential increase of false positives. To overcome this, the methods that control family-wise error rates have been proposed. The most widely used one is the Bonferroni method that uses a more stringent threshold [13]. However, the Bonferroni method is too conservative, and it often fails to detect the “true” significant features. A more recent multiple testing procedure that controls FDR was proposed by Benjamini and Hochberg [12]. The advantage of the FDRbased procedure is that it identifies as many significant hypotheses as possible while keeping a relatively small number of positives [14,15]. 2.3 Kernel-Induced Classification Tree and Random Forest
A classification model was used to examine the advantage of using the complex wavelet transform and FDR-based feature selection in NMR spectra. We used a classification tree, one of the widely used classification methods. Classification trees partition the input (feature) space into disjoint hyper-rectangular regions according to performance measures such as misclassification errors, Geni index, and cross-entropy and then fit a constant model in each disjoint region [16]. The number of disjoint regions (equivalent to the number of terminal nodes in a tree) should be determined appropriately because a very large tree overfits the training set, while a small tree cannot capture important information in the data. In general, there are two approaches to determining the tree size. The first approach is the direct stopping methods that attempt to stop tree growth before the model overfits the training set. The second approach is tree pruning that removes the leaves and branches of a full-grown tree to find the right size of the tree. In the present study the Geni index was used as a performance measure. To determine tree size, we stop the growth of a tree when the number of data points in the terminal node reaches five.
Classification of High-Resolution NMR Spectra Based on Complex
597
In order to estimate the true misclassification rate of classification tree models, we used a cross-validation technique. Specifically, we used a four-fold cross validation in which the experimental data were split into four groups corresponding to four subjects. Three subjects were used for training the models, and the one remaining subject was used for testing. This process was repeated three more times. The final classification results from the four different testing samples were then averaged to obtain the misclassification rates (or cross-validated error rates) of the classification tree models. G G A kernel is a function K, such that for all xi and x j ∈ X p , i, j =1, 2, …, n
G G
G
G
K( xi , x j ) = < φ( xi ), φ( x j ) >
(6)
where φ is a (non-linear) mapping from the input space to an (inner product) feature space. If the observation i is fixed in the training sample, and observation j is a new input, then the kernel function above can be treated as a new feature defined by G observation i, denoted as K( xi ,⋅ ). Some popular kernels are inner product kernel, polynomial kernel and Gaussian (radial basis) kernel. A classification tree model is a recursive partitioning procedure in the feature space. Starting from the root node, at each step, a greedy exhaustive search is implemented to find the best splitting rule such as “Xi