Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Moshe Y. Vardi Rice University, Houston, TX, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
4642
Seong-Whan Lee Stan Z. Li (Eds.)
Advances in Biometrics International Conference, ICB 2007 Seoul, Korea, August 27-29, 2007 Proceedings
13
Volume Editors Seong-Whan Lee Korea University, Department of Computer Science and Engineering Anam-dong, Seongbuk-ku, Seoul 136-713, Korea E-mail:
[email protected] Stan Z. Li Chinese Academy of Sciences, Institute of Automation Center for Biometrics and Security Research & National Laboratory of Pattern Recognition 95 Zhongguancun Donglu, Beijing 100080, China E-mail:
[email protected] Library of Congress Control Number: 2007933159 CR Subject Classification (1998): I.5, I.4, K.4.1, K.4.4, K.6.5, J.1 LNCS Sublibrary: SL 6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics ISSN ISBN-10 ISBN-13
0302-9743 3-540-74548-3 Springer Berlin Heidelberg New York 978-3-540-74548-8 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2007 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12114641 06/3180 543210
Preface
Many applications in government, airport, commercial, defense and law enforcement areas have a basic need for automatic authentication of humans both locally or remotely on a routine basis. The demand for automatic authentication systems using biometrics, including face, fingerprint, gait, and iris, has been increasing in many aspects of life. The purpose of the 2007 International Conference on Biometrics (ICB 2007) was to provide a platform for researchers, engineers, system architects and designers to report recent advances and exchange ideas in the area of biometrics and related technologies. ICB 2007 received a large number of high-quality research papers. In all 303 papers were submitted from 29 countries around the world. Of these 34 papers were accepted for oral presentation and 91 papers were accepted for poster presentation. The program consisted of seven oral sessions, three poster sessions, two tutorial sessions, and four keynote speeches on various topics on biometrics. We would like to thank all the authors who submitted their manuscripts to the conference, and all the members of the Program Committee and reviewers who spent valuable time providing comments on each paper. We would like to thank the conference administrator and secretariat for making the conference successful. We also wish to acknowledge the IEEE, IAPR, Korea Information Science Society, Korea University, Korea University BK21 Software Research Division, Korea Science and Engineering Foundation, Korea University Institute of Computer, Information and Communication, Korea Biometrics Association, Lumidigm Inc., Ministry of Information and Communication Republic of Korea, and Springer for sponsoring and supporting this conference. August 2007
Seong-Whan Lee Stan Z. Li
Organization
ICB 2007 was organized by Center for Artificial Vision Research, Korea University.
Executive Committee General Chair General Co-chairs Program Co-chairs
Tutorials Chair Publications Chair Finance Chair Sponsorship Chair Registration Chair
Seong-Whan Lee (Korea University, Korea) Anil Jain (Michigan State University, USA) Tieniu Tan (Chinese Academy of Sciences, China) Ruud Bolle (IBM, USA) Josef Kittler (University of Surrey, UK) Stan Li (Chinese Academy of Sciences, China) Patrick J. Flynn (University of Notre Dame, USA) Bon-Woo Hwang (Carnegie Mellon University, USA) Hyeran Byun (Yonsei University, Korea) Dongsuk Yook (Korea University, Korea) Diana Krynski (IEEE, USA)
Program Committee Josef Bigun (Sweden) Frederic Bimbot (France) Mats Blomberg (Sweden) Horst Bunke (Switzerland) Hyeran Byun (Korea) Rama Chellappa (USA) Gerard Chollet (France) Timothy Cootes (UK) Larry Davis (USA) Farzin Deravi (UK) John Daugman (UK) Xiaoqing Ding (China) Julian Fierrez (Spain) Sadaoki Furui (Japan) M. Dolores Garcia-Plaza (Spain) Dominique Genoud (Switzerland) Shaogang Gong (UK) Venu Govindaraju (USA) Steve Gunn (UK) Bernd Heisele (USA)
Kenneth Jonsson (Sweden) Behrooz Kamgar-Parsi (USA) Takeo Kanade (USA) Jaihie Kim (Korea) Naohisa Komatsu (Japan) John Mason (UK) Jiri Matas (Czech Republic) Bruce Millar (Australia) Mark Nixon (UK) Larry O’Gorman (USA) Sharath Pankanti (USA) Jonathon Phillips (USA) Matti Pietikinen (Finland) Ioannis Pitas (Greece) Salil Prabhakar (USA) Ganesh N. Ramaswamy (USA) Nalini Ratha (USA) Marek Rejman-Greene (UK) Gael Richard (France) Arun Ross (USA)
VIII
Organization
Zhenan Sun (China) Xiaoou Tang (China) Massimo Tistarelli (Italy) Patrick Verlinde (Belgium) Juan Villanueva (Spain)
Yunhong Wang (China) Harry Wechsler (USA) Wei-Yun Yau (Singapore) David Zhang (Hong Kong)
Organizing Committee Yong-Wha Chung (Korea) Hee-Jung Kang (Korea) Jaewoo Kang (Korea) Chang-Su Kim (Korea) Daijin Kim (Korea) Hakil Kim (Korea) Hanseok Ko (Korea)
Heejo Lee (Korea) Chang-Beom Park (Korea) Jeong-Seon Park (Korea) Myung-Cheol Roh (Korea) Bong-Kee Sin (Korea) Sungwon Sohn (Korea) Hee-Deok Yang (Korea)
Additional Reviewers Andrea Abate Mohamed Abdel-Mottaleb Aditya Abhyankar Andy Adler Mohiuddin Ahmad Timo Ahonen Haizhou Ai Jose Alba-Castro Fernando Alonso-Fernandez Meng Ao Babak Nadjar Araabi Arathi Arakala Banafshe Arbab-Zavar Vutipong Areekul Vijayakumar Bhagavatula Manuele Bicego Imed Bouchrika Ahmed Bouridane Christina Braz Ileana Buhan Raffaele Cappelli Modesto Castrill´on-Santana Ee-Chien Chang Jiansheng Chen Weiping Chen Jin Young Choi
Seungjin Choi Rufeng Chu Jonathan Connell Tim Cootes Sarat Dass Reza Derakhshani Jana Dittman Wenbo Dong Bernadette Dorizzi Timothy Faltemier Pedro G´ omez-Vilda Javier Galbally Xiufeng Gao Yongsheng Gao Georgi Gluhchev Berk Gokberk Abdenour Hadid Miroslav Hamouz Asmaa El Hannani Pieter Hartel Jean Hennebert Javier Hernando Sakano Hitoshi Heiko Hoffmann Vincent Hsu Jian Huang
Organization
Yonggang Huang Yuan Huaqiang Jens Hube David Hurley Yoshiaki Isobe Jia Jia Kui Jia Andrew Teoh Beng Jin Changlong Jin Alfons Juan Pilsung Kang Tomi Kinnunen Klaus Kollreider Ajay Kumar James T. Kwok Andrea Lagorio J. H. Lai Kenneth Lam Jeremy Lecoeur Joon-Jae Lee Zhen Lei Alex Leung Bangyu Li Yongping Li Zhifeng LI Shengcai Liao Almudena Lindoso Chengjun Liu James Liu Jianyi Liu Rong Liu Wei Liu Xin Liu Zicheng Liu Xiaoguang Lu Sascha M¨ uller Bao Ma Sotiris Malassiotis S´ebastien Marcel Gian Luca Marcialis Ichino Masatsugu Peter McOwan Kieron Messer Krzysztof Mieloch Sinjini Mitra
Pranab Mohanty Don Monro Pavel Mrazek Daigo Muramatsu Vittorio Murino Isao Nakanishi Anoop Namboodiri Loris Nanni Kenneth Nilsson Lawrence O’Gorman Tetsushi Ohki Nick Orlans Carlos Orrite Gang Pan Roberto Paredes Kang Ryoung Park Jason Pelecanos John Pitrelli Norman Poh Xianchao Qiu Ajita Rattani Jose Antonio Rodriguez Yann Rodriguez Fabio Roli Sujoy Roy Mohammad Sadeghi Albert Salah Raul Sanchez-Reillo Michael Schuckers Stephanie Schuckers Caifeng Shan Shiguang Shan Weiguo Sheng Takashi Shinzaki Terence Sim Sridha Sridharan Fei Su Eric Sung Nooritawati Md Tahir Dacheng Tao Daniel Thorpe Jie Tian Kar-Ann Toh Sergey Tulyakov Kaoru Uchida
IX
X
Organization
Andreas Uhl Umut Uludag Sekar V. Mayank Vatsa Raymond Veldhuis Alessandro Verri Claus Vielhauer Hee Lin Wang Jian-Gang Wang Seadrift Wang Yiding Wang Zhuoshi Wei Jing Wen Damon Woodard Xiangqian Wu Wenquan Xu Yong Xu Shuicheng Yan
Xin Yang Dong Yi Lijun Yin Jane You Xiaotong Yuan Khalil Zebbiche Bingjun Zhang Changshui Zhang Chao Zhang Jianguo Zhang Taiping Zhang Yangyang Zhang Guoying Zhao Wei-Shi Zheng Jie Zhou Xiangxin Zhu Xuan Zou
Sponsoring Institutions IEEE Computer Society IEEE Systems, Man and Cybernetics Society International Association for Pattern Recognition Korea Information Science Society Korea Science and Engineering Foundation Korea University Korea University BK21 Software Research Division Korea University Institute of Computer, Information and Communication Korea Biometrics Association Lumidigm Inc. Ministry of Information and Communication Republic of Korea
Table of Contents
Face Recognition Super-Resolved Faces for Improved Face Recognition from Surveillance Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frank Lin, Clinton Fookes, Vinod Chandran, and Sridha Sridharan Face Detection Based on Multi-Block LBP Representation . . . . . . . . . . . . Lun Zhang, Rufeng Chu, Shiming Xiang, Shengcai Liao, and Stan Z. Li
1 11
Color Face Tensor Factorization and Slicing for Illumination-Robust Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yong-Deok Kim and Seungjin Choi
19
Robust Real-Time Face Detection Using Face Certainty Map . . . . . . . . . . Bongjin Jun and Daijin Kim
29
Poster I Motion Compensation for Face Recognition Based on Active Differential Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xuan Zou, Josef Kittler, and Kieron Messer
39
Face Recognition with Local Gabor Textons . . . . . . . . . . . . . . . . . . . . . . . . . Zhen Lei, Stan Z. Li, Rufeng Chu, and Xiangxin Zhu
49
Speaker Verification with Adaptive Spectral Subband Centroids . . . . . . . . Tomi Kinnunen, Bingjun Zhang, Jia Zhu, and Ye Wang
58
Similarity Rank Correlation for Face Recognition Under Unenrolled Pose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marco K. M¨ uller, Alexander Heinrichs, Andreas H.J. Tewes, Achim Sch¨ afer, and Rolf P. W¨ urtz
67
Feature Correlation Filter for Face Recognition . . . . . . . . . . . . . . . . . . . . . . Xiangxin Zhu, Shengcai Liao, Zhen Lei, Rong Liu, and Stan Z. Li
77
Face Recognition by Discriminant Analysis with Gabor Tensor Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhen Lei, Rufeng Chu, Ran He, Shengcai Liao, and Stan Z. Li
87
Fingerprint Enhancement Based on Discrete Cosine Transform . . . . . . . . Suksan Jirachaweng and Vutipong Areekul
96
XII
Table of Contents
Biometric Template Classification: A Case Study in Iris Textures . . . . . . Edara Srinivasa Reddy, Chinnam SubbaRao, and Inampudi Ramesh Babu Protecting Biometric Templates with Image Watermarking Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nikos Komninos and Tassos Dimitriou Factorial Hidden Markov Models for Gait Recognition . . . . . . . . . . . . . . . . Changhong Chen, Jimin Liang, Haihong Hu, Licheng Jiao, and Xin Yang
106
114 124
A Robust Fingerprint Matching Approach: Growing and Fusing of Local Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wenquan Xu, Xiaoguang Chen, and Jufu Feng
134
Automatic Facial Pose Determination of 3D Range Data for Face Model and Expression Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaozhou Wei, Peter Longo, and Lijun Yin
144
SVDD-Based Illumination Compensation for Face Recognition . . . . . . . . . Sang-Woong Lee and Seong-Whan Lee
154
Keypoint Identification and Feature-Based 3D Face Recognition . . . . . . . Ajmal Mian, Mohammed Bennamoun, and Robyn Owens
163
Fusion of Near Infrared Face and Iris Biometrics . . . . . . . . . . . . . . . . . . . . . Zhijian Zhang, Rui Wang, Ke Pan, Stan Z. Li, and Peiren Zhang
172
Multi-Eigenspace Learning for Video-Based Face Recognition . . . . . . . . . . Liang Liu, Yunhong Wang, and Tieniu Tan
181
Error-Rate Based Biometrics Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kar-Ann Toh
191
Online Text-Independent Writer Identification Based on Stroke’s Probability Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bangyu Li, Zhenan Sun, and Tieniu Tan
201
Arm Swing Identification Method with Template Update for Long Term Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kenji Matsuo, Fuminori Okumura, Masayuki Hashimoto, Shigeyuki Sakazawa, and Yoshinori Hatori
211
Walker Recognition Without Gait Cycle Estimation . . . . . . . . . . . . . . . . . . Daoliang Tan, Shiqi Yu, Kaiqi Huang, and Tieniu Tan
222
Comparison of Compression Algorithms’ Impact on Iris Recognition Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefan Matschitsch, Martin Tschinder, and Andreas Uhl
232
Table of Contents
XIII
Standardization of Face Image Sample Quality . . . . . . . . . . . . . . . . . . . . . . . Xiufeng Gao, Stan Z. Li, Rong Liu, and Peiren Zhang
242
Blinking-Based Live Face Detection Using Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lin Sun, Gang Pan, Zhaohui Wu, and Shihong Lao
252
Singular Points Analysis in Fingerprints Based on Topological Structure and Orientation Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jie Zhou, Jinwei Gu, and David Zhang
261
Robust 3D Face Recognition from Expression Categorisation . . . . . . . . . . Jamie Cook, Mark Cox, Vinod Chandran, and Sridha Sridharan
271
Fingerprint Recognition Based on Combined Features . . . . . . . . . . . . . . . . Yangyang Zhang, Xin Yang, Qi Su, and Jie Tian
281
MQI Based Face Recognition Under Uneven Illumination . . . . . . . . . . . . . Yaoyao Zhang, Jie Tian, Xiaoguang He, and Xin Yang
290
Learning Kernel Subspace Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bailing Zhang, Hanseok Ko, and Yongsheng Gao
299
A New Approach to Fake Finger Detection Based on Skin Elasticity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jia Jia, Lianhong Cai, Kaifu Zhang, and Dawei Chen
309
An Algorithm for Biometric Authentication Based on the Model of Non–Stationary Random Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vladimir B. Balakirsky, Anahit R. Ghazaryan, and A.J. Han Vinck
319
Identity Verification by Using Handprint . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hao Ying, Tan Tieniu, Sun Zhenan, and Han Yufei
328
Gait and Signature Recognition Reducing the Effect of Noise on Human Contour in Gait Recognition . . . Shiqi Yu, Daoliang Tan, Kaiqi Huang, and Tieniu Tan
338
Partitioning Gait Cycles Adaptive to Fluctuating Periods and Bad Silhouettes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jianyi Liu and Nanning Zheng
347
Repudiation Detection in Handwritten Documents . . . . . . . . . . . . . . . . . . . Sachin Gupta and Anoop M. Namboodiri
356
A New Forgery Scenario Based on Regaining Dynamics of Signature . . . . Jean Hennebert, Renato Loeffel, Andreas Humm, and Rolf Ingold
366
XIV
Table of Contents
Systems and Applications Curvewise DET Confidence Regions and Pointwise EER Confidence Intervals Using Radial Sweep Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . Michael E. Schuckers, Yordan Minev, and Andy Adler
376
Bayesian Hill-Climbing Attack and Its Application to Signature Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Javier Galbally, Julian Fierrez, and Javier Ortega-Garcia
386
Wolf Attack Probability: A New Security Measure in Biometric Authentication Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Masashi Une, Akira Otsuka, and Hideki Imai
396
Evaluating the Biometric Sample Quality of Handwritten Signatures . . . Sascha M¨ uller and Olaf Henniger
407
Outdoor Face Recognition Using Enhanced Near Infrared Imaging . . . . . Dong Yi, Rong Liu, RuFeng Chu, Rui Wang, Dong Liu, and Stan Z. Li
415
Latent Identity Variables: Biometric Matching Without Explicit Identity Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simon J.D. Prince, Jania Aghajanian, Umar Mohammed, and Maneesh Sahani
424
Poster II 2ˆN Discretisation of BioPhasor in Cancellable Biometrics . . . . . . . . . . . . Andrew Beng Jin Teoh, Kar-Ann Toh, and Wai Kuan Yip
435
Probabilistic Random Projections and Speaker Verification . . . . . . . . . . . . Chong Lee Ying and Andrew Teoh Beng Jin
445
On Improving Interoperability of Fingerprint Recognition Using Resolution Compensation Based on Sensor Evaluation . . . . . . . . . . . . . . . . Jihyeon Jang, Stephen J. Elliott, and Hakil Kim
455
Demographic Classification with Local Binary Patterns . . . . . . . . . . . . . . . Zhiguang Yang and Haizhou Ai
464
Distance Measures for Gabor Jets-Based Face Authentication: A Comparative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Gonz´ alez-Jim´enez, Manuele Bicego, J.W.H. Tangelder, B.A.M Schouten, Onkar Ambekar, Jos´e Luis Alba-Castro, Enrico Grosso, and Massimo Tistarelli Fingerprint Matching with an Evolutionary Approach . . . . . . . . . . . . . . . . W. Sheng, G. Howells, K. Harmer, M.C. Fairhurst, and F. Deravi
474
484
Table of Contents
XV
Stability Analysis of Constrained Nonlinear Phase Portrait Models of Fingerprint Orientation Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jun Li, Wei-Yun Yau, Jiangang Wang, and Wee Ser
493
Effectiveness of Pen Pressure, Azimuth, and Altitude Features for Online Signature Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daigo Muramatsu and Takashi Matsumoto
503
Tracking and Recognition of Multiple Faces at Distances . . . . . . . . . . . . . . Rong Liu, Xiufeng Gao, Rufeng Chu, Xiangxin Zhu, and Stan Z. Li
513
Face Matching Between Near Infrared and Visible Light Images . . . . . . . . Dong Yi, Rong Liu, RuFeng Chu, Zhen Lei, and Stan Z. Li
523
User Classification for Keystroke Dynamics Authentication . . . . . . . . . . . . Sylvain Hocquet, Jean-Yves Ramel, and Hubert Cardot
531
Statistical Texture Analysis-Based Approach for Fake Iris Detection Using Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaofu He, Shujuan An, and Pengfei Shi
540
A Novel Null Space-Based Kernel Discriminant Analysis for Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tuo Zhao, Zhizheng Liang, David Zhang, and Yahui Liu
547
Changeable Face Representations Suitable for Human Recognition . . . . . Hyunggu Lee, Chulhan Lee, Jeung-Yoon Choi, Jongsun Kim, and Jaihie Kim
557
“3D Face”: Biometric Template Protection for 3D Face Recognition . . . . E.J.C. Kelkboom, B. G¨ okberk, T.A.M. Kevenaar, A.H.M. Akkermans, and M. van der Veen
566
Quantitative Evaluation of Normalization Techniques of Matching Scores in Multimodal Biometric Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . Y.N. Singh and P. Gupta
574
Keystroke Dynamics in a General Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . Rajkumar Janakiraman and Terence Sim
584
A New Approach to Signature-Based Authentication . . . . . . . . . . . . . . . . . Georgi Gluhchev, Mladen Savov, Ognian Boumbarov, and Diana Vasileva
594
Biometric Fuzzy Extractors Made Practical: A Proposal Based on FingerCodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Val´erie Viet Triem Tong, Herv´e Sibert, J´er´emy Lecœur, and Marc Girault
604
XVI
Table of Contents
On the Use of Log-Likelihood Ratio Based Model-Specific Score Normalisation in Biometric Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . Norman Poh and Josef Kittler Predicting Biometric Authentication System Performance Across Different Application Conditions: A Bootstrap Enhanced Parametric Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Norman Poh and Josef Kittler
614
625
Selection of Distinguish Points for Class Distribution Preserving Transform for Biometric Template Protection . . . . . . . . . . . . . . . . . . . . . . . . Yi C. Feng and Pong C. Yuen
636
Minimizing Spatial Deformation Method for Online Signature Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bin Li, Kuanquan Wang, and David Zhang
646
Pan-Tilt-Zoom Based Iris Image Capturing System for Unconstrained User Environments at a Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sowon Yoon, Kwanghyuk Bae, Kang Ryoung Park, and Jaihie Kim
653
Fingerprint Matching with Minutiae Quality Score . . . . . . . . . . . . . . . . . . . Jiansheng Chen, Fai Chan, and Yiu-Sang Moon
663
Uniprojective Features for Gait Recognition . . . . . . . . . . . . . . . . . . . . . . . . . Daoliang Tan, Kaiqi Huang, Shiqi Yu, and Tieniu Tan
673
Cascade MR-ASM for Locating Facial Feature Points . . . . . . . . . . . . . . . . . Sicong Zhang, Lifang Wu, and Ying Wang
683
Reconstructing a Whole Face Image from a Partially Damaged or Occluded Image by Multiple Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bon-Woo Hwang and Seong-Whan Lee
692
Robust Hiding of Fingerprint-Biometric Data into Audio Signals . . . . . . . Muhammad Khurram Khan, Ling Xie, and Jiashu Zhang
702
Correlation-Based Fingerprint Matching with Orientation Field Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Almudena Lindoso, Luis Entrena, Judith Liu-Jimenez, and Enrique San Millan Vitality Detection from Fingerprint Images: A Critical Survey . . . . . . . . . Pietro Coli, Gian Luca Marcialis, and Fabio Roli
713
722
Fingerprint Recognition Optimum Detection of Multiplicative-Multibit Watermarking for Fingerprint Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Khalil Zebbiche, Fouad Khelifi, and Ahmed Bouridane
732
Table of Contents
XVII
Fake Finger Detection Based on Thin-Plate Spline Distortion Model . . . . Yangyang Zhang, Jie Tian, Xinjian Chen, Xin Yang, and Peng Shi
742
Robust Extraction of Secret Bits from Minutiae . . . . . . . . . . . . . . . . . . . . . . Ee-Chien Chang and Sujoy Roy
750
Fuzzy Extractors for Minutiae-Based Fingerprint Authentication . . . . . . . Arathi Arakala, Jason Jeffers, and K.J. Horadam
760
Iris Recognition Coarse Iris Classification by Learned Visual Dictionary . . . . . . . . . . . . . . . Xianchao Qiu, Zhenan Sun, and Tieniu Tan
770
Nonlinear Iris Deformation Correction Based on Gaussian Model . . . . . . . Zhuoshi Wei, Tieniu Tan, and Zhenan Sun
780
Shape Analysis of Stroma for Iris Recognition . . . . . . . . . . . . . . . . . . . . . . . S. Mahdi Hosseini, Babak N. Araabi, and Hamid Soltanian-Zadeh
790
Biometric Key Binding: Fuzzy Vault Based on Iris Images . . . . . . . . . . . . . Youn Joo Lee, Kwanghyuk Bae, Sung Joo Lee, Kang Ryoung Park, and Jaihie Kim
800
Pattern Analysis and Learning Multi-scale Local Binary Pattern Histograms for Face Recognition . . . . . Chi-Ho Chan, Josef Kittler, and Kieron Messer
809
Histogram Equalization in SVM Multimodal Person Verification . . . . . . . Mireia Farr´ us, Pascual Ejarque, Andrey Temko, and Javier Hernando
819
Learning Multi-scale Block Local Binary Patterns for Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shengcai Liao, Xiangxin Zhu, Zhen Lei, Lun Zhang, and Stan Z. Li
828
Horizontal and Vertical 2DPCA Based Discriminant Analysis for Face Verification Using the FRGC Version 2 Database . . . . . . . . . . . . . . . . . . . . Jian Yang and Chengjun Liu
838
Video-Based Face Tracking and Recognition on Updating Twin GMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Li Jiangwei and Wang Yunhong
848
Poster III Fast Algorithm for Iris Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jan Mazur
858
XVIII
Table of Contents
Pyramid Based Interpolation for Face-Video Playback in Audio Visual Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dereje Teferi and Josef Bigun
868
Face Authentication with Salient Local Features and Static Bayesian Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guillaume Heusch and S´ebastien Marcel
878
Fake Finger Detection by Finger Color Change Analysis . . . . . . . . . . . . . . Wei-Yun Yau, Hoang-Thanh Tran, Eam-Khwang Teoh, and Jian-Gang Wang
888
Feeling Is Believing: A Secure Template Exchange Protocol . . . . . . . . . . . Ileana Buhan, Jeroen Doumen, Pieter Hartel, and Raymond Veldhuis
897
SVM-Based Selection of Colour Space Experts for Face Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohammad T. Sadeghi, Samaneh Khoshrou, and Josef Kittler
907
An Efficient Iris Coding Based on Gauss-Laguerre Wavelets . . . . . . . . . . . H. Ahmadi, A. Pousaberi, A. Azizzadeh, and M. Kamarei
917
Hardening Fingerprint Fuzzy Vault Using Password . . . . . . . . . . . . . . . . . . Karthik Nandakumar, Abhishek Nagar, and Anil K. Jain
927
GPU Accelerated 3D Face Registration / Recognition . . . . . . . . . . . . . . . . Andrea Francesco Abate, Michele Nappi, Stefano Ricciardi, and Gabriele Sabatino
938
Frontal Face Synthesis Based on Multiple Pose-Variant Images for Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Congcong Li, Guangda Su, Yan Shang, and Yingchun Li
948
Optimal Decision Fusion for a Face Verification System . . . . . . . . . . . . . . . Qian Tao and Raymond Veldhuis
958
Robust 3D Head Tracking and Its Applications . . . . . . . . . . . . . . . . . . . . . . Wooju Ryu and Daijin Kim
968
Multiple Faces Tracking Using Motion Prediction and IPCA in Particle Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sukwon Choi and Daijin Kim
978
An Improved Iris Recognition System Using Feature Extraction Based on Wavelet Maxima Moment Invariants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Makram Nabti and Ahmed Bouridane
988
Color-Based Iris Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Emine Krichen, Mohamed Chenafa, Sonia Garcia-Salicetti, and Bernadette Dorizzi
997
Table of Contents
XIX
Real-Time Face Detection and Recognition on LEGO Mindstorms NXT Robot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1006 Tae-Hoon Lee Speaker and Digit Recognition by Audio-Visual Lip Biometrics . . . . . . . . 1016 Maycel Isaac Faraj and Josef Bigun Modelling Combined Handwriting and Speech Modalities . . . . . . . . . . . . . 1025 Andreas Humm, Jean Hennebert, and Rolf Ingold A Palmprint Cryptosystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035 Xiangqian Wu, David Zhang, and Kuanquan Wang On Some Performance Indices for Biometric Identification System . . . . . . 1043 Jay Bhatnagar and Ajay Kumar Automatic Online Signature Verification Using HMMs with User-Dependent Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1057 J.M. Pascual-Gaspar and V. Carde˜ noso-Payo A Complete Fisher Discriminant Analysis for Based Image Matrix and Its Application to Face Biometrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1067 R.M. Mutelo, W.L. Woo, and S.S. Dlay SVM Speaker Verification Using Session Variability Modelling and GMM Supervectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1077 M. McLaren, R. Vogt, and S. Sridharan 3D Model-Based Face Recognition in Video . . . . . . . . . . . . . . . . . . . . . . . . . 1085 Unsang Park and Anil K. Jain Robust Point-Based Feature Fingerprint Segmentation Algorithm . . . . . . 1095 Chaohong Wu, Sergey Tulyakov, and Venu Govindaraju Automatic Fingerprints Image Generation Using Evolutionary Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1104 Ung-Keun Cho, Jin-Hyuk Hong, and Sung-Bae Cho Audio Visual Person Authentication by Multiple Nearest Neighbor Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1114 Amitava Das Improving Classification with Class-Independent Quality Measures: Q-stack in Face Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1124 Krzysztof Kryszczuk and Andrzej Drygajlo Biometric Hashing Based on Genetic Selection and Its Application to On-Line Signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1134 Manuel R. Freire, Julian Fierrez, Javier Galbally, and Javier Ortega-Garcia
XX
Table of Contents
Biometrics Based on Multispectral Skin Texture . . . . . . . . . . . . . . . . . . . . . 1144 Robert K. Rowe
Other Modalities Application of New Qualitative Voicing Time-Frequency Features for Speaker Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1154 Nidhal Ben Aloui, Herv´e Glotin, and Patrick Hebrard Palmprint Recognition Based on Directional Features and Graph Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1164 Yufei Han, Tieniu Tan, and Zhenan Sun Tongue-Print: A Novel Biometrics Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . 1174 David Zhang, Zhi Liu, Jing-qi Yan, and Pengfei Shi Embedded Palmprint Recognition System on Mobile Devices . . . . . . . . . . 1184 Yufei Han, Tieniu Tan, Zhenan Sun, and Ying Hao Template Co-update in Multimodal Biometric Systems . . . . . . . . . . . . . . . 1194 Fabio Roli, Luca Didaci, and Gian Luca Marcialis Continual Retraining of Keystroke Dynamics Based Authenticator . . . . . 1203 Pilsung Kang, Seong-seob Hwang, and Sungzoon Cho Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1213
Super-Resolved Faces for Improved Face Recognition from Surveillance Video Frank Lin, Clinton Fookes, Vinod Chandran, and Sridha Sridharan Image and Video Research Laboratory Queensland University of Technology GPO Box 2434 Brisbane, QLD 4001 Australia {fc.lin,c.fookes,v.chandran,s.sridharan}@qut.edu.au
Abstract. Characteristics of surveillance video generally include low resolution and poor quality due to environmental, storage and processing limitations. It is extremely difficult for computers and human operators to identify individuals from these videos. To overcome this problem, super-resolution can be used in conjunction with an automated face recognition system to enhance the spatial resolution of video frames containing the subject and narrow down the number of manual verifications performed by the human operator by presenting a list of most likely candidates from the database. As the super-resolution reconstruction process is ill-posed, visual artifacts are often generated as a result. These artifacts can be visually distracting to humans and/or affect machine recognition algorithms. While it is intuitive that higher resolution should lead to improved recognition accuracy, the effects of superresolution and such artifacts on face recognition performance have not been systematically studied. This paper aims to address this gap while illustrating that super-resolution allows more accurate identification of individuals from low-resolution surveillance footage. The proposed optical flow-based super-resolution method is benchmarked against Baker et al.’s hallucination and Schultz et al.’s super-resolution techniques on images from the Terrascope and XM2VTS databases. Ground truth and interpolated images were also tested to provide a baseline for comparison. Results show that a suitable super-resolution system can improve the discriminability of surveillance video and enhance face recognition accuracy. The experiments also show that Schultz et al.’s method fails when dealing surveillance footage due to its assumption of rigid objects in the scene. The hallucination and optical flow-based methods performed comparably, with the optical flow-based method producing less visually distracting artifacts that interfered with human recognition. Keywords: super-resolution, face recognition, surveillance.
1
Introduction
Faces captured from surveillance footage are usually of poor-resolution as they typically occupy a small portion of the camera’s field of view. It is extremely challenging for a computer or even a human operator to accurately identify an S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 1–10, 2007. c Springer-Verlag Berlin Heidelberg 2007
2
F. Lin et al.
individual from a database in such a situation. In addition, a human operator is usually responsible for monitoring footage from several cameras simultaneously, increasing the chance of human error. One solution to the problem would be to complement our natural ability to recognise faces with the computers’ power to process large amounts of video data. This paper presents an intelligent surveillance system aided by optical flowbased super-resolution and automatic face recognition. The system operates in a semi-automatic manner where it enhances the surveillance video through superresolution and displays a list of likely candidates from a database together with the enhanced image to a human operator who then makes the final verification. Super-resolution is aimed at recovering high frequency detail lost through aliasing in the image acquisition process. As the reconstruction process is illposed due to the large number of variables, visual artifacts are usually generated as a result. These artifacts can be visually distracting to humans and/or affect machine recognition algorithms. Although it has been shown that face recognition accuracy is dependent on image resolution [1, 2, 3] and it is known that super-resolution improves image fidelity, the effects of super-resolution on recognition performance has not been systematically studied. This paper aims to address this gap while illustrating that super-resolution allows more accurate identification of individuals from low-resolution surveillance footage. Experiments were conducted to compare the performance of the proposed optical flow-based super-resolution system [12] against two existing methods – a facespecific recognition-based method [10] and a reconstruction-based algorithm that supports independently moving rigid objects [14]. Images from the Terrascope surveillance database [4] were used to illustrate the reconstruction performance of the tested methods and demonstrate the importance of accurate registration in super-resolution. Face identification performance were tested on an Eigenface [5] and Elastic Bunch Graph Matching (EBGM) [6] system using images from the XM2VTS database [7]. The Eigenface method is a baseline holistic system that new methods are usually benchmarked against while EBGM a newer technique that is less sensitive to pose and lighting changes. Traditional interpolation methods were also tested for comparison. The outline of the paper is as follows. Section 2 provides background information on super-resolution, the inherent difficulties associated with surveillance footage as well as an overview of the super-resolution algorithm tested. Experimental methodology and results are presented in Section 3 and concluding remarks are discussed in Section 4.
2
Super-Resolution
Super-resolution image reconstruction is the process of combining low-resolution (LR) images into one high-resolution image. These low-resolution images are aliased and related to each other through sub-pixel shifts; essentially representing different snapshots of the same scene which carry complementary information
Super-Resolved Faces for Improved Face Recognition
3
[8]. The challenge is to find effective and computationally efficient methods of combining two or more such images. 2.1
Observation Model
The relationship between the ideal high-resolution (HR) image and the observed LR images can be described by the following observation model, yk = DBk Mk x + nk ,
(1)
where yk denotes the k = 1 . . . p LR images, D is a subsampling matrix, Bk is the blur matrix, Mk is the warp matrix, x is the ideal HR image of the scene which is being recovered, and nk is the additive noise that corrupts the image. D and Bk simulate the averaging process performed by the camera’s CCD sensor while Mk can be modelled by anything from a simple parametric transformation to motion flow fields. Essentially, given multiple yk ’s, x can be recovered through an inversion process. The problem is usually ill-posed however, due to the large number of pixel values to be be estimated from a small number of known pixels. Generally, reconstruction of a super-resolved image is broken up into three stages – motion compensation (registration), interpolation and blur and noise removal (restoration) [8]. 2.2
Approaches to Super-Resolution
Super-resolution techniques can be classed into two categories – reconstructionand recognition-based. Most super-resolution techniques are reconstruction-based, dating back to Tsai and Huang’s work in 1984 [9]. These methods operate directly with the image pixel intensities following the principles of Equation 1 and can super-resolve any image sequence provided the motion between between observations can be modelled. Their useful magnification factors are usually low however, in that the super-resolved image becomes too smooth or blurred when the scale is chosen to be more than 4 [10]. Recognition-based methods approach the problem differently by learning features of the low-resolution input images and synthesising the corresponding highresolution output [10]. Training is performed by looking at high-resolution and downsampled versions of sample image patches. The reconstruction process involves looking at a patch of pixels in the low-resolution input and finding the closest matching low-resolution patch in the training set, then replacing that patch with the corresponding high-resolution patch. An advantage of these methods is that only input image is required. Although they output images with sharp edges, visually distracting artifacts are often produced as by-product. 2.3
The Problem with Human Faces in Surveillance Video
As super-resolution is inherently an ill-posed problem, most methods operate within a constrained environment by assuming that the objects are static and only modeling global parametric motion such as translations/rotations, affine
4
F. Lin et al.
and projective transformation between frames. While they work well with static scenes, performance degrades severely when applied to human faces in surveillance video as human faces are non-planar, non-rigid, non-lambertian, and subject to self occlusion [11]. Optical flow methods can be used to overcome the non-planarity and non-rigidity of the face by recovering a dense flow field to describe local deformations while the remaining two problems need to be addressed through robust estimation methods. 2.4
Systems Tested
Three super-resolution methods have been included in this set of experiments. Lin et al. – The proposed system is a reconstruction-based method [12] that uses a robust optical flow method developed by Black et al. [13] to register the local motion between frames. Optical flow techniques operate on the concept of constant intensity, meaning that although the location of a point maybe change over time, it will always be observed with the same intensity. They also assume that neighbouring pixels in the image are likely to belong to the same surface, resulting in a smoothness constraint that ensures the motion of neighbouring pixels varies smoothly. Most optical flow algorithms break down when these two assumptions are not satisfied in practice. This occurs when motion boundaries, shadows and specular reflections are present. The robust optical flow method used here addresses these two constraint violations through a robust estimation framework. A graduated non-convexity algorithm is proposed to recover the optical flow and motion discontinuities. Schultz et al. – Schultz et al.’s [14] system is a reconstruction-based system capable of handling independently moving objects. However, each object is assume to be rigid. The system is expected to perform very poorly when applied to surveillance footage where the subjects’ faces not only move around freely, but also change in orientation and shape as they turn around and change facial expressions. The system was included in this set of experiments to highlight the importance of accurate image registration, and that more flexible motion models like optical flow are required to obtain good results with surveillance video. Baker et al. – The hallucination algorithm developed by Baker et al. [10] is a face-specific recognition-based method. The system is trained using full frontal face images and hence the super-resolved images are generated with a face-specific prior. The super-resolved output of the system always contains an outline of a frontal face even when the input images contain none, hence the term hallucination. The method works well if the input image is precisely aligned as shown in [10]. However, when applied to faces that are not full frontal pose normalised, distracting visual artifacts are expected to be produced and the appearance of the face may even change completely.
Super-Resolved Faces for Improved Face Recognition
3
5
Experimental Results
Videos from the Terrascope database were used to investigate if the superresolution methods were applicable to surveillance footage. The database consists of videos captured by surveillance cameras placed in an office environment. Due to the database containing only twelve subjects, the speech sequences from the XM2VTS database were used for the face recognition experiments to obtain more statistically significant results. The XM2VTS database is a multi-modal (speech and video) database created to facilitate testing of multi-modal speech recognition systems. It contains 295 subjects recorded over four sessions in four months. As the speech sequences contain only frontal faces, they represent the situation where the face detector has found a frontal face suitable for recognition whilst scanning through surveillance footage. The experiments were conducted to simulate a production environment, with no manual human interventaion required. 3.1
Preparation
The Terrascope video sequences were captured in colour at 640×480 pixels (px) at 30 frames/sec. These were converted to grayscale without any downsampling before processing since they accurately reflect real-world surveillance footage. The original XM2VTS videos were captured in colour at a resolution of 720×576px with the subject sitting close to and facing the camera, resulting in very high-resolution faces. Hence these frames needed to be downsampled first to simulate surveillance conditions more closely. The images were resized and converted to grayscale as uncompressed ground-truth images at three different resolutions – 180×144px, 120×96px and 90×72px as ground-truth high-resolution images. These images were then downsampled by a factor of two through blurring and decimation to simulate the low-resolution images which were then used as the input for the super-resolution and interpolation stages. To super-resolve using Schultz et al. and Lin et al.’s methods , the respective algorithms were applied to a moving group of five frames, with the third frame being the reference. Five frames were chosen because it was a good tradeoff between reconstruction quality and computation time [14]. Baker et al.’s method was applied using a single frame – the reference frame for the other two super-resolution methods. To compare the performance of the super-resolution algorithm with interpolation methods, upsampled images were also generated for the reference frame of each 5-frame sequence using bilinear and cubic spline interpolation. For the face recognition experiment, an object detector [15] trained using frontal faces was applied to each of the enhanced images individually. Each image was then segmented and normalised. The CSU Face Identification Evaluation system [16] was then used to evaluate recognition performance of the super-resovled, interpolated and ground-truth images. Frontal face images from the Face Recognition Grand Challenge (FRGC) [17] Fall2003 and Spring2004 datasets were used to train the facespace for the Eigenface system. A range
6
F. Lin et al.
(10–500) of values for the Eigenvectors retained were tested. The normalised images from the XM2VTS database were then projected into the facespace and the distance to the enrolment images computed. Both Euclidean (EUC) and Mahalinobis Cosine (MCOS) distance metrics were tested. For the EBGM system, the gabor jets used to detect the facial features were trained using 70 handmarked images from the FERET database [18]. The predictive step (PS) and magnitude (MAG) distance metrics were used. 3.2
Results
Figure 1 shows selected enhanced images from the Terrascope database. As expected, all super-resolution algorithms produced sharper images than the interpolation methods. However, Schultz et al.’s method’s assumption of rigid objects has resulted in a grid-like noise pattern. The hallucinated face looks reasonably sharp and clean but the subjects take on a different appearance. Lin et al.’s method shows some sharpening noise but it is most visually correct and suitable for human inspection.
(a)
(b)
(c)
(d)
(e)
Fig. 1. Comparison between enhanced images. (a) bilinear interpolation, (b) cubic spline interpolation, (c) Schultz et al., (d) Baker et al., (e) Lin et al.
Figure 2 contains selected enhanced images from the XM2VTS database at the three resolutions. Schultz et al.’s method no longer generates the grid pattern noise due to the XM2VTS speech sequences containing only frontal faces, making the faces more or less rigid. Baker et al.’s hallucination algorithm didn’t do so well as it is quite sensitive to misalignment of the low-resolution face. While the hallucinated faces looked sharper than those generated by other methods, the faces take on a different appearance and distracting artifacts are present upon closer inspection. Table 1 presents the face recognition rates (ranks 1 and 10) for all combinations of face recognition algorithm, distance metric, resolution and image enhancement method. The recognition rate for a given rank N is the probability
Super-Resolved Faces for Improved Face Recognition
(a)
(b)
(c)
(d)
(e)
7
(f)
Fig. 2. Comparison between enhanced images. First row 90×72, second row 120×96, third row 180×144. (a) bilinear interpolation, (b) cubic spline interpolation, (c) Schultz et al., (d) Baker et al., (e) Lin et al., (f) ground-truth.
that the true subject is identified in the top N matches returned by the system. For example, by examining the last cell in the bottom hand corner of the table, it can be seen that when testing the ground-truth images on the EBGM method with the magnitude distance metric, the probability of returning the correct subject is 72.6%, increasing to 88.8% if the list is expanded from 1 to 10. The Eigenface system results given will be for 250 retained Eigenvectors as it gave the best overall recognition performance. Schultz et al’s method in general does not improve recognition performance over simple interpolation techniques, most likely due to its inability to handle non-rigid objects. The optical flow-based method worked very well as expected, since it accurately registers the motion between frames and produces the most visually appealing images. Once again this highlights the importance of accurate registration. The hallucinated images performed surprisingly well despite the presence of severe artifacts. This seems to suggest that the face recognition methods tested aren’t sensitive to the type of visual artifacts generated by this particular algorithm. The important thing to note here is that while hallucination works well to improve machine recognition, the severe visual artifacts make it less desirable for the proposed application where a human operator makes the final verification. For the two higher resolutions, the ground-truth and super-resolved images actually lose the lead to interpolated ones in some instances. This can be attributed to the Eigenface and EBGM methods being quite robust to downsampling and that the downsampling process actually smoothes out some illumination variations and noise. The authors obtained similar results, where
8
F. Lin et al.
Table 1. Recognition rates for the two face recognition methods at 90×72px, 120×96px and 180×144px. Values in bold indicate best performing method (excluding groundtruth). Recognition method Distance metric 90×72px Bilinear Cubic spline Schultz et al. Baker et al. Lin et al. Ground-truth 120×96px Bilinear Cubic spline Schultz et al. Baker et al. Lin et al. Ground-truth 180×144px Bilinear Cubic spline Schultz et al. Baker et al. Lin et al. Ground-truth
Eigenface EBGM EUC MCOS PS MAG Rank 1 / 10 30.3 / 57.1% 34.6 / 62.2% 35.8 / 66.4% 50.2 / 78.1% 30.1 / 57.3% 34.8 / 61.2% 36.3 / 66.6% 51.3 / 77.8% 31.0 / 59.2% 35.4 / 63.6% 36.4 / 67.3% 53.4 / 79.0% 35.2 / 66.1% 32.3 / 64.8% 51.7 / 77.0% 61.4 / 87.3% 37.2 / 64.6% 41.0 / 69.5% 45.5 / 73.5% 60.6 / 84.6% 40.1 / 67.2% 43.2 / 71.4% 53.9 / 79.5% 66.3 / 87.5% Rank 1 / 10 40.2 / 68.6% 46.5 / 72.7% 49.2 / 73.9% 56.6 / 80.9% 40.7 / 69.3% 47.6 / 72.8% 49.2 / 73.8% 57.2 / 80.9% 41.0 / 69.2% 46.4 / 72.1% 49.8 / 73.3% 56.9 / 81.2% 42.3 / 68.3% 50.6 / 75.2% 57.8 / 77.8% 60.4 / 81.6% 47.3 / 73.3% 51.7 / 75.2% 52.8 / 73.8% 63.8 / 85.2% 49.2 / 76.5% 49.9 / 74.5% 55.2 / 74.3% 67.4 / 87.6% Rank 1 / 10 49.1 / 73.8% 58.3 / 80.1% 57.2 / 74.9% 65.7 / 85.2% 50.2 / 75.0% 59.0 / 80.4% 57.7 / 74.3% 66.3 / 85.7% 49.9 / 75.0% 59.5 / 79.5% 58.8 / 75.2% 67.7 / 85.7% 45.6 / 72.5% 52.7 / 75.5% 66.3 / 83.4% 67.1 / 84.9% 53.4 / 76.9% 59.5 / 79.4% 60.1 / 75.9% 70.6 / 87.6% 52.9 / 77.3% 58.0 / 77.7% 62.9 / 78.4% 72.6 / 88.8%
performance improved by smoothing the images when the resolution was sufficient [19]. This suggests that higher resolution isn’t necessarily better beyond a certain limit and can actually introduce unwanted noise depending on the face recognition algorithm used.
4
Conclusion
This paper has presented a simple yet effective way to assist a human operator in identifying a subject captured on video from a database by intelligently narrowing down the list of likely candidates and enhancing the face of the subject. Visual artifacts are often generated due to the super-resolution reconstruction process being ill-posed. These artifacts can be visually distracting to humans and/or affect machine recognition algorithms. As the rank 1 recognition rates are still likely to be poor despite the improvement provided by super-resolution, a fullyautomated recognition system is currently impractical. To increase accuracy to a usable level, the surveillance system will need to operate in a semi-automated manner by generating a list of top machine matches for subsequent human
Super-Resolved Faces for Improved Face Recognition
9
recognition. Therefore it is important for the enhanced images to be visually pleasing and not contain excessively distracting artifacts. The proposed optical flow-based super-resolution method has been shown to be superior when compared against two other existing algorithms in terms of visual appearance and face recognition performance on an Eigenface and EBGM system. The system’s performance was the most consistent, resulting in visually pleasing images and recognition rates comparable to the hallucination method. Baker et al.’s hallucination algorithm results in good recognition performance despite the generation of distracting artifacts due to its sensitivity to misalignment of the input images as often occurs in an automated environment. Schultz et al.’s method has been found to be unsuitable for application to surveillance footage due to its object-rigidity constraint. Its performance was no better than interpolation in many cases, highlighting the importance of accurate registration.
References 1. Gunturk, B., Batur, A., Altunbasak, Y., M III, H., Mersereau, R.: Eigenfacedomain super-resolution for face recognition. IEEE Transactions on Image Processing 12(5), 597–606 (2003) 2. Lemieux, A., Parizeau, M.: Experiments on eigenfaces robustness. In: Proc. ICPR2002, vol. 1, pp. 421–424 (August 2002) 3. Wang, X., Tang, X.: Face Hallucination and Recognition. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 486–494. Springer, Heidelberg (2003) 4. Jaynes, C., Kale, A., Sanders, N., Grossmann, E.: The Terrascope dataset: scripted multi-camera indoor video surveillance with ground-truth. In: Proc. Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 309–316 (October 2005) 5. Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3(1), 71–86 (1991) 6. Wiskott, L., Fellous, J., Kr¨ uger, N., Malsburg, C.: Face recognition by elastic bunch graph matching. In: Sommer, G., Daniilidis, K., Pauli, J. (eds.) CAIP 1997. LNCS, vol. 1296, pp. 456–463. Springer, Heidelberg (1997) 7. Messer, K., Matas, J., Kittler, J., Luettin, J., Maitre, G.: XM2VTS: The Extended M2VTS Database. In: Proc. AVBPA-1999, pp. 72–76 (1999) 8. Park, S., Park, M., Kang, M.: Super-resolution image reconstruction: a technical overview. IEEE Signal Processing Magazine 25(9), 21–36 (2003) 9. Tsai, R., Huang, T.: Multiframe image restoration and registration. Advances in Computer Vision and image Processing 1, 317–339 (1984) 10. Baker, S., Kanade, T.: Limits on Super-Resolution and How to Break Them. 24(9), 1167–1183 (2002) 11. Baker, S., Kanade, T.: Super Resolution Optical Flow. Technical Report CMU-RITR-99-36, The Robotics Institute, Carnegie Mellon University (October 1999) 12. Lin, F., Fookes, C., Chandran, V., Sridharan, S.: Investigation into Optical Flow Super-Resolution for Surveillance Applications. In: Proc. APRS Workshop on Digital Image Computing 2005, pp. 73–78 (February 2005) 13. Black, M., Anandan, P.: A framework for the robust estimation of optical flow. In: Proc. ICCV-1993, pp. 231–236 (May 1993)
10
F. Lin et al.
14. Schultz, R., Stevenson, R.: Extraction of High-Resolution Frames from Video Sequences. IEEE Transactions on Image Processing 5(6), 996–1011 (June 1996) 15. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: CVPR (2001) 16. Bolme, D., Beveridge, R., Teixeira, M., Draper, B.: The CSU Face Identification Evaluation System: Its Purpose, Features and Structure. In: Proc. International Conference on Vision Systems, pp. 304–311 (April 2003) 17. Phillips, P., Flynn, P., Scruggs, T., Bowyer, K., Chang, J., Hoffman, K., Marques, J., Min, J., Worek, W.: Overview of the face recognition grand challenge. In: Proc. CVPR ’05, vol. 1, pp. 947–954 (2005) 18. Phillips, P., Moon, H., Rizvi, S., Rauss, P.: The feret evaluation methodology for face-recognition algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 1090–1104 (2000) 19. Lin, F., Cook, J., Chandran, V., Sridharan, S.: Face Recognition from SuperResolved Images. In: Proc. ISSPA 2005, pp. 667–670 (August 2005)
Face Detection Based on Multi-Block LBP Representation Lun Zhang, Rufeng Chu, Shiming Xiang, Shengcai Liao, and Stan Z. Li Center for Biometrics and Security Research & National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences 95 Zhongguancun Donglu Beijing 100080, China
Abstract. Effective and real-time face detection has been made possible by using the method of rectangle Haar-like features with AdaBoost learning since Viola and Jones’ work [12]. In this paper, we present the use of a new set of distinctive rectangle features, called Multi-block Local Binary Patterns (MB-LBP), for face detection. The MB-LBP encodes rectangular regions’ intensities by local binary pattern operator, and the resulting binary patterns can describe diverse local structures of images. Based on the MB-LBP features, a boosting-based learning method is developed to achieve the goal of face detection. To deal with the non-metric feature value of MB-LBP features, the boosting algorithm uses multibranch regression tree as its weak classifiers. The experiments show the weak classifiers based on MB-LBP are more discriminative than Haar-like features and original LBP features. Given the same number of features, the proposed face detector illustrates 15% higher correct rate at a given false alarm rate of 0.001 than haar-like feature and 8% higher than original LBP feature. This indicates that MB-LBP features can capture more information about the image structure and show more distinctive performance than traditional haar-like features, which simply measure the differences between rectangles. Another advantage of MB-LBP feature is its smaller feature set, this makes much less training time.
1 Introduction Face detection has a wide range of applications such as automatic face recognition, human-machine interaction, surveillance, etc. In recent years, there has been a substantial progress on detection schemes based on appearance of faces. These methods treat face detection as a two-class (face/non-face) classification problem. Due to the variations in facial appearance, lighting, expressions, and other factors [11], face/non-face classifiers with good performance should be very complex. The most effective method for constructing face/non-face classifiers is learning based approach. For example, neural network-based methods [10], support vector machines [9], etc. Recently, the boosting-based detector proposed by Viola and Jones [12] is regarded as a breakthrough in face detection research. Real-time performance is achieved by learning a sequence of simple Haar-like rectangle features. The Haar-like features encode differences in average intensities between two rectangular regions, and they can be calculated rapidly through integral image [12]. The complete Haar-like feature set is large and contains a mass of redundant information. Boosting algorithm is introduced to S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 11–18, 2007. c Springer-Verlag Berlin Heidelberg 2007
12
L. Zhang et al.
select a small number of distinctive rectangle features and construct a powerful classifier. Moreover, the use of cascade structure [12] further speeds up the computations. Li et al. extended that work to multi-view faces using an extended set of Haar features and an improved boosting algorithm [5]. However, these Haar-like rectangle features seem too simple, and the detector often contains thousands of rectangle features for considerable performance. The large number of selected features leads to high computation costs both in training and test phases. Especially, in later stages of the cascade, weak classifiers based on these features become too weak to improve the classifier’s performance [7]. Many other features are also proposed to represent facial images, including rotated Haar-like features [6], census transform [3], sparse features [4], etc. In this paper, we present a new distinctive feature, called Multi-block Local Binary Pattern (MB-LBP) feature, to represent facial image. The basic idea of MB-LBP is to encode rectangular regions by local binary pattern operator [8]. The MB-LBP features can also be calculated rapidly through integral image, while these features capture more information about the image structure than Haar-like features and show more distinctive performance. Comparing with original Local Binary Pattern calculated in a local 3×3 neighborhood between pixels, the MB-LBP features can capture large scale structure that may be the dominant features of image structures. We directly use the output of LBP operator as the feature value. But a problem is that this value is just a symbol for representing the binary string. For this non-metric feature value, multi-branch regression tree is designed as weak classifiers. We implement Gentle adaboost for feature selection and classifier construction. Then a cascade detector is built. Another advantage of MBLBP is that the number of exhaustive set of MB-LBP features is much smaller than Haar-like features (about 1/20 of Haar-like feature for a sub-window of size 20 × 20). Boosting-based method use Adaboost algorithm to select a significant feature set from the large complete feature set. This process often spends much time even several weeks. The small feature set of MB-LBP can make this procedure more simple. The rest of this paper is organized as follows. Section 2 introduces the MB-LBP features. In section 3, the AdaBoost learning for feature selection and classifier construction are proposed. The cascade detector is also described in this section. The experiment results are given in Section 4. Section 5 concludes this paper.
2 Multi-Block Local Binary Pattern Features Traditional Haar-like rectangle feature measures the difference between the average intensities of rectangular regions (See Fig.1). For example, the value of a two-rectangle filter is the difference between the sums of the pixels within two rectangular regions. If we change the position, size, shape and arrangement of rectangular regions, the Haarlike features can capture the intensity gradient at different locations, spatial frequencies and directions. Viola an Jones [12] applied three kinds of such features for detecting frontal faces. By using the integral image, any rectangle filter types, at any scale or location, can be evaluated in constant time [12]. However, the Haar-like features seem too simple and show some limits [7]. In this paper, we propose a new distinctive rectangle features, called Multi-block Local Binary Pattern (MB-LBP) feature. The basic idea of MB-LBP is that the
Face Detection Based on Multi-Block LBP Representation
13
Fig. 1. Traditional Haar-like features.These features measure the differences between rectangular regions’ average intensities.
Fig. 2. Multi-block LBP feature for image representation. As shown in the figure, the MB-LBP features encode rectangular regions’ intensities by local binary pattern. The resulting binary patterns can describe diverse image structures. Compared with original Local Binary Pattern calculated in a local 3×3 neighborhood between pixels, MB-LBP can capture large scale structure.
simple difference rule in Haar-like features is changed into encoding rectangular regions by local binary pattern operator. The original LBP, introduced by Ojala [8], is defined for each pixel by thresholding the 3×3 neighborhood pixel value with the center pixel value. To encode the rectangles, the MB-LBP operator is defined by comparing the central rectangle’s average intensity gc with those of its neighborhood rectangles {g0 , ..., g8 }. In this way, it can give us a binary sequence. An output value of the MBLBP operator can be obtained as follows: M B − LBP =
8
s(gi − gc )2i
(1)
i=1
where gc is the average intensity of the center rectangle, gi (i = 0, · · · , 8) are those of its neighborhood rectangles, 1, if x > 0 s(x) = 0, if x < 0 A more detailed description of such MB-LBP operator can be found in Fig. 2. We directly use the resulting binary patterns as the feature value of MB-LBP features. Such binary patterns can detect diverse image structures such as edges, lines, spots, flat areas and corners [8], at different scale and location. Comparing with original Local Binary Pattern calculated in a local 3×3 neighborhood between pixels, MB-LBP can capture large scale structures that may be the dominant features of images. Totally, we can get 256 kinds of binary patterns, some of them can be found in Fig. 3. In section 4.1, we conduct an experiment to evaluate the MB-LBP features. The experimental results
14
L. Zhang et al.
Fig. 3. A randomly chosen subset of the MB-LBP features
show the MB-LBP features are more distinctive than Haar-like features and original LBP features. Another advantage of MB-LBP is that the number of exhaustive set of MB-LBP features (rectangles at various scales, locations and aspect ratios) is much smaller than Haar-like features. Given a sub-window size of 20 × 20, there are totally 2049 MB-LBP features, this amount is about 1/20 of Haar-like features (45891). People usually select significant features from the whole feature set by Adaboost algorithm, and construct a binary classifier. Owing to the large feature set of haar-like feature, the training process usually spends too much time. The fewer number of MB-LBP feature set makes the implementation of feature selection significantly easy. It is should be emphasized that the value of MB-LBP features is non-metric. The output of LBP operator is just a symbol for representing the binary string. In the next section, we will describe how to design the weak classifiers based on MB-LBP features, and apply the Adaboost algorithm to select significant features and construct classifier.
3 Feature Selection and Classifier Construction Although the feature set of MB-LBP feature is much smaller than Haar-like features, it also contains much redundant information. The AdaBoost algorithm is used to select significant features and construct a binary classifier. Here, AdaBoost is adopted to solve the following three fundamental problems in one boosting procedure: (1) learning effective features from the large feature set, (2) constructing weak classifiers, each of which is based on one of the selected features, (3) boosting the weak classifiers into a stronger classifier. 3.1 AdaBoost Learning We choose to use the version of boosting called gentle adaboost [2] due to it is simple to be implemented and numerically robust. Given a set of training examples as (x1 , y1 ), ..., (xN , yN ), where yi ∈ {+1, −1} is the class label of the example xi ∈ Rn . Boosting learning provides a sequential procedure to fit additive models of the form M F (x) = fm (x). Here fm (x) are often called weak learners, and F (x) is called m=1
a strong learner. Gentle adaboost uses adaptive Newton steps for minimizing the cost
Face Detection Based on Multi-Block LBP Representation
15
Table 1. Algorithm of Gentle AdaBoost 1. Start with weight wi =
1 ,i N
= 1, 2, ..., N, F (x) = 0
2. Repeat for m = 1, ... ,M (a) Fit the regression function by weighted least squares fitting of Y to X. (b) Update F (x) ← F (x) + fm (x) (c) Update wi ← wi e−yi fm (xi ) and normalization 3. Output the Classifier F (x) = sign[ M m=1 fm (x)]
function: J = E[e−yF (x) ], which corresponds to minimizing a weighted squared error at each step. In each step, the weak classifier fm (x) is chosen so as to minimize the weighted squared error: N Jwse = wi (yi − fm (xi ))2 (2) i=1
3.2 Weak Classifiers It is common to define the weak learners fm (x) to be the optimal threshold classification function [12], which is often called a stump. However, it is indicated in Section 2 that the value of MB-LBP features is non-metric. Hence it is impossible to use thresholdbased function as weak learner. Here we describe how the weak classifiers are designed. For each MB-LBP feature, we adopt multi-branch tree as weak classifiers. The multi-branch tree totally has 256 branches, and each branch corresponds to a certain discrete value of MB-LBP features. The weak classifier can be defined as: ⎧ k ⎪ a0 , x = 0 ⎪ ⎪ ⎪ ⎪ ... ⎨ k fm (x) = (3) aj , x = j ⎪ ⎪ ⎪ ... ⎪ ⎪ ⎩ k a255 , x = 255 Where xk denotes the k-th element of the feature vector x, and aj , j = 0, · · · , 255, are regression parameters to be learned. These weak learners are often called decision or regression trees. We can find the best tree-based weak classifier (the parameter k, aj with minimized weighted squared error as Equ.(2)) just as we would learn a node in a regression tree.The minimization of Equ.(2))gives the following parameters: wi yi δ(xki = j) aj = i (4) k i wi δ(xi = j) As each weak learner depends on a single feature, one feature is selected at each step. In the test phase, given a MB-LBP feature, we can get the corresponding regression
16
L. Zhang et al.
value fast by such multi-branch tree. This function is similar to the lookup table (LUT) weak classifier for Haar-like features [1], the difference is that the LUT classifier gives a partition of real-value domain.
4 Experiments In this section, we conduct two experiments to evaluate proposed method. (1) Comparing MB-LBP features with Haar-like features and original LBP features. (2) Evaluating the proposed detector on CMU+MIT face database. A total of 10,000 face images were collected from various sources, covering out-ofplane and in-plan rotation in the range of [−30◦,30◦ ]. For each aligned face example, four synthesized face examples were generated by following random transformation: mirroring, random shifting to +1/-1 pixel, in-plane rotation within 15 degrees and scaling within 20% variations. The face examples were then cropped and re-scaled to 20×20 pixels. Totally, we get a set of 40,000 face examples. More than 20,000 large images which do not contain faces are used for collecting non-face samples. 4.1 Feature Comparison In this subsection, we compare the performance of MB-LBP feature with Haar-like rectangle features and conventional LBP features. In the experiments, we use 26,000 face samples and randomly divide them to two equally parts, one for training the other for testing. The non-face samples are randomly collected from large images which do not contain faces. Our training set contains 13,000 face samples and 13,000 non-face samples, and the testing set contains 13,000 face samples and 50,000 non-face samples. Based on Adaboost learning framework, three boosting classifiers are trained. Each of them contains selected 50 Haar-like features, conventional LBP features and MBLBP features, respectively. Then they are evaluated on the test set. Fig. 4(a) shows the curves of the error rate (average of false alarm rate and false rejection rate) as a function of the number of the selected features in the training procedure. We can see the curve 0.25
1 Haar−like feature Original LBP feature MB−LBP feature
0.9 Detection Rate
Error
0.2 0.15 0.1
0.7 MB−LBP feature Original LBP feature Haar−like feature
0.6
0.05 0 0
0.8
10
20 30 40 Number of Features
(a)
50
0.5 −3 10
−2
−1
10 10 False Alarm Rate
0
10
(b)
Fig. 4. Comparative results with MB-LBP features, Haar-like features and original LBP features. (a) The curves show the error rate as a function of the selected features in training process. (b) The ROC curves show the classification performance of the three classifiers on the test set.
Face Detection Based on Multi-Block LBP Representation
17
corresponding to MB-LBP features has the lowest error rate. It indicates that the weak classifiers based on MB-LBP features are more discriminative. The ROC curves of the three classifiers on the test set can be found in Fig. 4(b). It is shown that in the given false alarm rate at 0.001, classifier based on MB-LBP features shows 15% higher correct rate than haar-like feature and 8% higher than original LBP feature. All the above shows the distinctive of MB-LBP features. It is mainly because the MB-LBP features can capture more information about the image structures. 4.2 Experimental Results on CMU+MIT Face Set We trained a cascade face detector based on MB-LBP features and tested it on the MIT+CMU database which is widely used to evaluate the performance of face detection algorithm. This set consists of 130 images with 507 labeled frontal faces. For training the face detector, all collected 40,000 face samples are used, the bootstrap strategy is also used to re-collect non-face samples. Our trained detector has 9 layers including 470 MB-LBP features.Comparing with the Viola’s cascade detector [12] which has 32 layers and 4297 features, our MB-LBP feature is much more efficient. From the results, we can see that our method get considerable performance with fewer features. The processing time of our detector for a 320x240 image is less than 0.1s on a P4 3.0GHz PC. Table 2. Experimental results on MIT+CMU set False Alarms 6 10 21 31 57 78 136 167 293 422 Ours 80.1% - 85.6% - 90.7% - 91.9% - 93.5% Viola - 78.3% - 85.2% - 90.1% - 91.8% - 93.7%
Fig. 5. Some detection results on MIT+CMU set
18
L. Zhang et al.
5 Conclusions In this paper, we proposed multi-block local binary pattern(MB-LBP) features as descriptor for face detection. A boosting-based detector is implemented. Aims at the nonmetric feature value of MB-LBP features, multi-branch regression tree is adopted to construct the weak classifiers. First, these features can capture more information about image structure than traditional Haar-like features and show more distinctive performance. Second, fewer feature number of the completed feature set makes the training process easier. In our experiment, it is shown that at the given false alarm rate 0.001, MB-LBP shows 15% higher correct rate than Haar-like feature and 8% higher than original LBP feature. Moreover, our face detector gets considerable performance on CMU+MIT database with fewer features.
Acknowledgements This work was supported by the following funding resources: National Natural Science Foundation Project #60518002, National Science and Technology Supporting Platform Project #2006BAK08B06, National 863 Program Projects #2006AA01Z192 and #2006AA01Z193, Chinese Academy of Sciences 100 people project, and the AuthenMetric Collaboration Foundation.
References 1. Wu, B., Ai, H.Z., Huang, C., Lao, S.H.: Fast rotation invariant multi-view face detection based on real adaboost. In: FG (2004) 2. Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: A statistical view of boosting. Annals of Statistics (2000) 3. Froba, B., Ernst, A.: Face detection with the modified census transform. In: AFGR (2004) 4. Huang, C., Ai, H., Li, Y., Lao, S.: Learning sparse features in granular space for multi-view face detection. In: IEEE International conference on Automatic Face and Gesture Recognition, April 2006. IEEE Computer Society Press, Los Alamitos (2006) 5. Li, S.Z., Zhu, L., Zhang, Z.Q.: Statistical learning of multi-view face detection. In: Tistarelli, M., Bigun, J., Jain, A.K. (eds.) ECCV 2002. LNCS, vol. 2359. Springer, Heidelberg (2002) 6. Lienhart, R., Maydt, J.: An extended set of haar-like features for rapid object detection. In: ICIP (2002) 7. Mita, T., Kaneko, T., Hori, O.: Joint haar-like features for face detection. In: ICCV (2005) 8. Ojala, T., Pietikainen, M., Harwood, D.: A comparative study of texture measures with classification based on feature distributions. Pattern Recognition (January 1996) 9. Osuna, E., Freund, R., Girosi, F.: Training support vector machines: an application to face detection. In: CVPR (1997) 10. Rowley, H.A., Baluja, S., Kanade, T.: Neural network-based face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (1998) 11. Simard, P.Y., Cun, Y.A.L., Denker, J.S., Victorri, B.: Transformation invariance in pattern recognition - tangent distance and tangent propagation. Neural Networks: Tricks of the Trade (1998) 12. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society Press, Los Alamitos (2001)
Color Face Tensor Factorization and Slicing for Illumination-Robust Recognition Yong-Deok Kim and Seungjin Choi Department of Computer Science Pohang University of Science and Technology San 31 Hyoja-dong, Nam-gu, Pohang 790-784, Korea {karma13,seungjin}@postech.ac.kr
Abstract. In this paper we present a face recognition method based on multiway analysis of color face images, which is robust to varying illumination conditions. Illumination changes cause large variations on color in face images. The main idea is to extract features with minimal color variations but with retaining image spatial information. We construct a tensor of color image ensemble, one of its coordinate reflects color mode, and employ the higher-order SVD (a multiway extension of SVD) of the tensor to extract such features. Numerical experiments show that our method outperforms existing subspace analysis methods including principal component analysis (PCA), generalized low rank approximation (GLRAM) and concurrent subspace analysis (CSA), in the task of face recognition under varying illumination conditions. The superiority is even more substantial in the case of small training sample size.
1
Introduction
Face recognition is a challenging pattern classification problem, which is encountered in many different areas such as biometrics, computer vision, and human computer interaction (HCI). Cruxes in practical face recognition systems result from varying illumination conditions, various facial expressions, pose variations, and so on. Of particular interest in this paper is the case of varying illumination conditions in color face image ensemble. Various approaches and methods have been developed in face recognition. Subspace analysis is one of the most popular techniques, demonstrating its success in numerous visual recognition tasks such as face recognition, face detection and tracking. Exemplary subspace analysis methods include singular value decomposition (SVD), principal component analysis (PCA), independent component analysis (ICA), nonnegative matrix factorization (NMF), and Fisher linear discriminant analysis (LDA). All these methods seek a linear representation of face image ensemble such that basis images and encoding variables are learned, satisfying a certain fitting criterion with each face image represented by a vector. Face images are formed by the interaction of multiple factors related to illuminations, color information, various poses, facial expressions, identities. Portion of such information are embedded in 2D spatial structure. Thus, face images, S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 19–28, 2007. c Springer-Verlag Berlin Heidelberg 2007
20
Y.-D. Kim and S. Choi
intrinsically fit in multiway representation, known as tensor, reflecting various interactions between different modes. However, aforementioned subspace analysis methods are confined to at most 2-way representation. For example, color face images are converted to gray scale-valued vectors. This vectorization is a typical pre-processing in conventional subspace methods. It leads to high-dimensional vectors with losing some spatial structure, which results in a curse of dimensionality problem, making such methods to suffer from the small sample size problem. Recently, there are a great deal of studies on multiway analysis in computer vision. These include 2D-PCA [1], generalized low rank approximation (GLRAM) [2], concurrent subspace analysis (CSA) [3], tensor faces [4] which employs the multilinear SVD (a.k.a HOSVD) [5,6], and multilinear ICA [7]. The basic idea of tensor analysis goes back to Tucker decomposition [8,9]. See [10] for a recent review of tensor factorization. In the case of color face images, illumination change yields large variations on color in face images, even through they have exactly the same pose and facial expression. In this paper, we present a tensor factorization-based method for illumination-robust feature extraction in a task of color face image recognition. The method is referred to as color face tensor factorization and slicing (CFTFS). We form a 4-way tensor whose coordinates are associated with rows and columns of face images, color, and samples. The CFTFS employs the multilinear SVD in the 4-way tensor, simultaneously analyzing subspaces corresponding to rows, columns, and color. Then it chooses slices where the information about variations on row and column modes remain but variations on color mode are minimized. We demonstrate that CFTFS outperforms existing methods, including PCA, GLRAM, and CSA. The useful behavior of the method becomes more substantial, especially in the case of small training sample size. The rest of this paper is organized as follows. In the next section we describe a brief overview of tensor algebra and Tucker decomposition. The proposed method, CFTFS is presented in Sec. 3. In Sec. 4, numerical experimental results are presented, showing that CFTFS is an indeed effective method for illumination-robust face recognition. Finally conclusions are drawn in Sec. 5.
2 2.1
Background: Multiway Analysis Tensor Algebra
A tensor is a multiway array of data. For example, a vector is 1-way tensor and a matrix is 2-way tensor. The N -way tensor X ∈ RI1 ×I2 ×···×IN has N indices (i1 , i2 , . . . , iN ) and its elements are denoted by xi1 i2 ...iN where 1 ≤ in ≤ In . Mode-n vectors of an N -way tensor X are In -dimensional vectors obtained from X by varying index in while keeping the other indices fixed. In matrix, column vectors are referred to as mode-1 vectors and row vectors correspond to mode-2 vectors. The mode-n vectors are column vectors of the matrix X (n) which is the mode-n matricization (matrix unfolding) of the tensor X . The mode-n
Color Face Tensor Factorization and Slicing
21
I2
I3 I1
X (1)
I1 IY
I3 I3
I3 I2 I1
I2
X (2)
I1
I1
I3 X (3)
I3 I1
I2
I2
Fig. 1. Matricization of a 3-way tensor X ∈ RI1 ×I2 ×I3 leads to X (1) ∈ RI1 ×I2 Id , X (2) ∈ RI2 ×I3 I1 , and X (3) ∈ RI3 ×I1 I2 which are constructed by the concatenation of frontal, horizontal, vertical slices, respectively
matricization of X ∈ RI1 ×I2 ×···×IN is denoted X (n) ∈ RIn ×In+1 In+2 ···IN I1 I2 ···In−1 where In+1 In+2 · · · IN I1 I2 · · · In−1 is the cyclic order after n. The original index in+1 runs fastest and in−1 slowest in the columns of the matrix X (n) . Pictorial illustration of the mode-n matricization of a 3-way tensor is shown in Fig. 1. I1 ×I2 ×···×IN is defined as X , Y = The scalar product of two tensors X , Y ∈ R x y . The Frobenius norm of a tensor X is given by i i ···i i i ···i 1 2 N 1 2 N i1 ,i2 ,...,i N X = X , X . The mode-n product of a tensor S ∈ RJ1 ×J2 ×···×Jn ×···×JN by a matrix A(n) ∈ In ×Jn R is defined by S ×n A(n)
j1 ···jn−1 in jn+1 ···jN
=
Jn
sj1 ...jn−1 jn jn+1 ···jN ain jn ,
(1)
jn =1
leading to a tensor S ×n A(n) ∈ RJ1 ×J2 ×···×In ×···×JN . With the mode-n product, a familiar matrix factorization X = U SV is written as X = S ×1 U ×2 V in the tensor framework.
22
2.2
Y.-D. Kim and S. Choi
Tucker Decomposition
The Tucker decomposition seeks a factorization model of an N -way tensor X ∈ RI1 ×I2 ×···×IN as mode products of a core tensor S ∈ RJ1 ×J2 ×···×JN and N mode matrices A(n) ∈ RIn ×Jn , X ≈ S ×1 A(1) ×2 A(2) · · · ×N A(N ) (1) (2) (N ) xi1 i2 ···iN ≈ sj1 j2 ···jN ai1 j1 ai2 j2 · · · aiN jN .
(2) (3)
j1 ,j2 ,...,jN
Usually, mode matrices are constrained to orthogonal for an easy interpretation and there is no loss of fit. The pictorial illustration of the Tucker decomposition is shown in Fig. 2. J3
I3
d J2
X
(2)
A
S
A(3) I3
I2
I1
(1)
I2
A I1
J1 Fig. 2. The 3-way Tucker decomposition
In Tucker decomposition, all modes don’t need to be analyzed. If mode-n is not analyzed, then associated component matrix A(n) becomes In × In identity matrix I In . Now, we introduce new terminology ‘N-way Tucker M decomposition’ where N-way means we incorporate N-way tensor and M is the number of analyzed modes. This new terminology explains PCA, 2D-PCA, GLRAM, CSA, and HOSVD in general framework. Table 1 summarizes each method using this new terminology in the case of general multiway data and face image ensemble. Suppose that Ω is the set of analyzed modes. If all mode matrices are orthogonal, minimizing least square discrepancy between the data and model in Eq. (2) is equivalent to maximizing the function: G(A(1) , A(2) , . . . , A(N ) ) = X ×1 A(1) ×2 A(2) · · · ×N A(N ) 2 ,
(4)
over A(n) for all n ∈ Ω. In the Tucker decomposition, Eq. (4) has no closed form solution, except for PCA, so local solution is found iteratively with Alternative Least Squares(ALS). In each step, only one of the component matrix is optimized, while keep others fixed. Suppose n ∈ Ω and A(1) , . . . , A(n−1) , A(n+1) , . . . , A(N ) are fixed. Then Eq. (4) is reduced to a quadratic expression of A(n) , consisting of orthonormal columns. We have G(A(n) ) = K(n) ×n A(n) 2 = A(n) K (n) 2 , where K(n) = X ×1 A(1) · · · ×n−1 A(n−1) ×n+1 A(n+1) · · · ×N AN ,
(5)
Color Face Tensor Factorization and Slicing
23
Table 1. Tucker decomposition explains PCA, 2D-PCA, GLRAM, CSA, and HOSVD in general framework. Assume the last mode is associated with samples. The grayscale face image ensemble constructs the 3-way tensor(rows, columns, samples) and color face image ensemble does the 4-way tensor(rows, columns, color, samples). Data type N -way tensor
Method
Tucker decomposition
PCA
N -way Tucker 1
CSA
N -way Tucker N − 1
HOSVD
N -way Tucker N
PCA
Remark A
3-way Tucker 1
(n)
= I In for n ∈ {1, . . . , N − 1} A(N) = I IN Jn ≤ In for all n A
(1)
= I I1 , A(2) = I I2
A
(1)
= I I1 , A(3) = I I3
Face image
2D-PCA
3-way Tucker 1
ensemble
GLRAM
3-way Tucker 2
A(3) = I I3
CSA
4-way Tucker 3
A(4) = I I4
Table 2. ALS algorithm for Tucker decomposition Input: X , Ω, Jn for all n ∈ Ω. Output: S, A(n) for all n ∈ {1, 2, . . . , N }. 1. · · ·
Initialize A(n) ← I In for all n ∈ / Ω. A(n) ← SVDS(X (n) , Jn ) for all n ∈ Ω. S ← X ×1 A(1) ×2 A(2) · · · ×N A(N) .
2. Repeat until converges for all n ∈ Ω · K(n) ← X ×1 A(1) · · · ×n−1 A(n−1) ×n+1 A(n+1) · · · ×N AN . · K (n) ← mode-n matricization of K(n) . · A(n) ← SVDS(K (n) , Jn ). 3. S ← X ×1 A(1) ×2 A(2) · · · ×N A(N) .
and K (n) is the mode-n matricization of K(n) . Hence the columns of A(n) can be found as an orthonormal basis for the dominant subspace of the column space of K (n) . The resulting algorithm is presented in Table 2.
3
Color Face Tensor Factorization and Slicing
In the case of color face images, illumination change yields large variations on color in face images, even through they have exactly the same pose and facial expression. Conventional methods, such as PCA and GLRAM, convert color image into grayscale image and then reduced the dimension. As the illumination changes, the intensity value also varies extremely. However there is no way to reduce this fluctuation in garyscale image based method, since they already
24
Y.-D. Kim and S. Choi
Fig. 3. CFTFS finds a transformation which reduces the size of image and minimizes the variation on color
thrown away the information on color. If there are large illumination changes but only small number of samples are available, then these methods can’t prevent a sharp decline on face recognition performance. Our propose method, color face tensor factorization and slicing (CFTFS), solves these problem by conserving the color information and the spatial structure of original color face image ensemble. With the help of multiway analysis, the CFTFS simultaneously analyzes the subspace of rows, columns, and color. Then CFTFS slices a feature tensor where information about variations on rows and columns modes are retained but on color mode are minimized. Basic idea of CFTFS is illustrated in Fig. 3. CFTFS uses the 4-way tensor X in which (I1 , I2 ) is the size of face image, I3 = 3 is the number of color coordinates(RGB), and I4 is the number samples. The face image data is centered so that it has zero mean: X :,:,:,i4 ← X :,:,:,i4 − M for all 1 ≤ i4 ≤ I4 ,
(6)
where M = I14 Ii44 X :,:,:,i4 . As its name hinted, CFTFS consists two stage: dimension reduction and slicing. At dimension reduction stage, it play the 4-way Tucker 3 decomposition, where mode-4 is not analyzed and J1 < I1 , J2 < I2 , and J3 = I3 = 3. In fact, it is equivalent to CSA which minimizes I4
X :,:,:,i4 − S:,:,:,i4 ×1 A(1) ×2 A(2) ×3 A(3) 2
(7)
i4 =1
over A(1) , A(2) , A(3) , and S :,:,:,i4 for all 1 ≤ i4 ≤ I4 . The color face tensor X :,:,:,i4 is projected to an intermediate feature tensor S :,:,:,i4 = X :,:,:,i4 ×1 A(1) ×2 A(2) ×3 A(3) . Hence the dimension is reduced from I1 × I2 × 3 to J1 × J2 × 3.
Color Face Tensor Factorization and Slicing
25
Since we use SVDS in our algorithm, the first columns in mode matrices represent the most dominant subspace, the second columns do the second most subspace orthogonal to the first one, and so on. Thus the third slice S :,:,3,i4 of an intermediate feature tensor is a final illumination-robust feature matrix. In the final feature matrix, information of X :,:,:,i4 about variations on rows and columns are retained but on color are minimized. The illumination-robust feature extraction for a test color face tensor Y ∈ RI1 ×I2 ×3 is summarized to
(3)
T = (Y − M) ×1 A(1) ×2 A(1) ×3 a3 (3)
where a3
4
,
(8)
is the third column vector of A(3) .
Numerical Experiments
Our MATLAB implementation of the CFTFS partly uses the tensor toolbox [11]. We show the effectiveness of our proposed method for the illuminationrobust face recognition, with CMU PIE face database [12], comparing it PCA,
Fig. 4. Sample face images are shown. Illumination change yields large variations on color in face images, even through they have exactly the same pose and facial expression.
CFTFS PCA GLRAM CSA
Accuracy
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 60 40 20 Number of features
0.2
0.4
0.6
0.8
Training sample ratio
Fig. 5. From top to bottom, face recognition accuracy of CFTFS, PCA, GLRAM, and CSA. CFTFS has the highest accuracy and robust to lack of training samples.
26
Y.-D. Kim and S. Choi
GLRAM, and CSA. CMU PIE database contains 41,368 face images of 68 peoples. In our experiment, we use the sub-database contains 1,632 face images. It has 24 varying illumination condition with exactly same pose(C27) and facial expression(neutral) for each person. We fix the location of the two eyes, crop the face, and resize to 30 × 30 pixel. Sample face images are shown in Fig. 4. Fig. 5 and Table 3 show the recognition result for 4 methods. The (x, y, z) axis represent the number of features, training sample ratio, and recognition accuracy. The number of features are {9, 16, 25, 36, 49, 64} and training sample ratio are Table 3. Recognition accuracy over various number of features and training sample ratio. CFTFS has the higher recognition accuracy than others in the most of the time. Moreover superiority becomes more substantial, especially in the case of small training sample size. # of
CFTFS
PCA
GLRAM
CSA
training sample ratio(%)
features
10
20
30
40
50
60
70
80
9
36.67
50.85
61.27
68.04
74.75 78.74
16
45.32
63.63
73.45
80.81
85.63 89.81 93.01 95.54 97.45
25
52.99
70.82
80.50
86.46
90.88 94.19 96.56 98.10 99.14
36
54.62
72.90
82.42
87.87
91.84 95.17 97.03 98.51 99.69
49
58.45
75.62
85.07
90.20
93.67 95.97 97.43 99.13 99.63
64
61.41
78.12
85.80
90.74
93.79 95.58 97.75 99.23 99.45
83.06 85.32
90 88.19
9
21.37
35.31
46.41
55.95
64.27 71.62 78.59 84.22 89.85
16
26.44
41.71
54.17
65.26
74.04 80.54 87.89 91.89 96.93
25
30.14
47.25
59.86
70.21
78.52 85.65 90.48 94.43 97.73
36
32.10
50.62
63.77
73.47
81.42 87.89 92.63 96.12 98.50
49
34.06
52.32
65.92
76.39
84.32 90.12 93.99 97.70 98.99
64
34.72
54.47
67.35
78.28
85.28 90.38 94.49 97.58 99.60
9
16.74
28.44
38.18
46.97
54.76 61.10 67.46 72.42 78.90
16
19.68
32.87
44.53
54.93
63.31 70.74 79.06 84.46 91.44
25
22.76
37.89
50.01
60.25
69.30 78.08 84.58 90.05 95.09
36
24.38
40.69
53.62
63.68
72.94 80.82 87.02 92.67 96.84
49
26.65
43.19
55.74
67.46
76.53 84.03 89.67 94.54 97.30
64
27.90
45.50
58.09
69.45
78.13 84.62 90.49 95.38 98.83
9
16.74
28.44
38.18
46.97
54.76 61.10 67.46 72.42 78.90
16
19.68
32.87
44.53
54.93
63.31 70.74 79.06 84.46 91.44
25
22.72
37.99
49.67
60.59
68.82 77.02 84.06 90.37 95.06
36
20.85
34.78
46.62
57.43
66.02 73.84 80.92 87.16 92.55
49
18.34
31.11
41.50
51.35
59.67 67.19 74.55 79.57 86.53
64
15.59
26.33
34.61
42.67
49.14 56.00 60.76 66.53 69.63
Color Face Tensor Factorization and Slicing
27
{0.1, 0.2, . . . , 0.9}. The experiments are carried out 20 times independently for each case and the mean accuracy are used. It is known that GLRAM and CSA are has more image compression ability than PCA. However, our experiment results show that they are not suitable for the face recognition under varying illumination conditions. Especially, CSA shows the poorest result since it captures the feature where dominant variations on color remain. The slicing, difference between CFTFS and CSA, dramatically increases the recognition performance. As Fig. 5 and Table. 3 show, CFTFS has the higher recognition accuracy than others in the most of the time. Moreover superiority of our methods becomes more substantial, especially in the case of small training sample size.
5
Conclusions
In this paper, we have presented a method of color face tensor factorization and slicing which extracts an illumination-robust feature. Revisiting the Tucker decomposition, we have explained our algorithm in general framework with PCA, 2D PCA, GLRAM, CSA, and HOSVD. Using the 4-way Tucker 3 decomposition, subspaces of rows, columns, and color are simultaneously analyzed and then feature, in which information about variations on rows and columns are retained but on color are minimized, is extracted by slicing. Numerical experiments have confirmed that our method indeed effective for face recognition under a condition in which a large illumination change exists and only small number of training samples are available. Acknowledgments. This work was supported by Korea MIC under ITRC support program supervised by the IITA (IITA-2006-C1090-0603-0045).
References 1. Yang, J., Zhang, D., Frangi, A.F., Yang, J.Y.: Two-dimensional PCA: A new approach to appearance-based face representation and recognition. IEEE Trans. Pattern Analysis and Machine Intelligence 26, 131–137 (2004) 2. Ye, J.: Generalized low rank approximations of matrices. In: Proceedings of International Conference on Machine Learning, Banff, Canada, pp. 887–894 (2004) 3. Xu, D., Yan, S., Zhang, L., Zhang, H.J., Liu, Z., Shum, H.Y.: Concurrent subspace analysis. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, San Diego, CA, pp. 203–208. IEEE Computer Society Press, Los Alamitos (2005) 4. Vasilescu, M.A.O., Terzopoulos, D.: Multilinear subsapce analysis of image ensembles. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, Madison, Wisconsin (2003) 5. de Lathauwer, L., de Moor, B., Vandewalle, J.: A multilinear singular value decomposition. SIAM J. Matrix Anal. Appl. 21, 1253–1278 (2000) 6. de Lathauwer, L., de Moor, B., Vandewalle, J.: One the best rank-1 and rank(R1 , R2 , . . . , RN ) approximation of higher-order tensros. SIAM J. Matrix Anal. Appl. 21, 1324–1342 (2000)
28
Y.-D. Kim and S. Choi
7. Vasilescu, M.A.O., Terzopoulos, D.: Multilinear independent component analysis. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, San Diego, California (2005) 8. Tucker, L.R.: Some mathematical notes on three-mode factor analysis. Psychometrika 31, 279–311 (1966) 9. Kroonenberg, P.M., de Leeuw, J.: Principal component analysis of three-mode data by means of alternating least squares algorithms. Psychometrika 45, 69–97 (1980) 10. Kolda, T.G.: Multilinear operators for higher-order decompositions. Technical Report SAND2006-2081, Sandia National Laboratories (2006) 11. Bader, B.W., Kolda, T.G.: Algorithm 862: MATLAB tensor classes for fast algorithm prototyping. ACM Trans. Mathematical Software 32, 635–653 (2006) 12. Sim, T., Baker, S., Bsat, M.: The CMU pose, illumination, and expression database. IEEE Trans. Pattern Analysis and Machine Intelligence 25, 1615–1618 (2003)
Robust Real-Time Face Detection Using Face Certainty Map Bongjin Jun and Daijin Kim Department of Computer Science and Engineering Pohang University of Science and Technology, {simple21,dkim}@postech.ac.kr
Abstract. In this paper, we present a robust real-time face detection algorithm. We improved the conventional face detection algorithms for three different steps. For preprocessing step, we revise the modified census transform to compensate the sensitivity to the change of pixel values. For face detection step, we propose difference of pyramid(DoP) images for fast face detection. Finally, for postprocessing step, we propose face certainty map(FCM) which contains facial information such as facial size, location, rotation, and confidence value to reduce FAR(False Acceptance Rate) with constant detection performance. The experimental results show that the reduction of FAR is ten times better than existing cascade adaboost detector while keeping detection rate and detection time almost the same.
1
Introduction
Face detection is an essential preprocessing step for face recognition[1], surveillance, robot vision interface, and facial expression recognition. It also has many application areas such as picture indexing, tracking, clustering and so on. However face detection has its intrinsic difficulties for the following reasons. First, face is not a rigid object, i.e. every person has different facial shape and different form/location of facial features such as eyes, nose, and mouth. Second, face of the same person looks differently as the facial expression, facial pose, and illumination condition changes. Finally, it is almost impossible to train infinite number of non-face patterns, consequently unexpected false acceptance or false rejection could be occurred. The procedure of face detection could be divided as preprocessing step, face detection step, and postprocessing step. First, for preprocessing step, illumination compensation techniques like histogram equalization[2], normalization to zero mean and unit variance on the analysis window[3], and modified census transform[4] have been proposed. Second, for detecting faces, many classification algorithms have been proposed to classify the face and non-face patterns such as: skin color based approaches[5][6], SVM[7][8], gaussian mixture model[9], maximum likelihood[10], neural network[11] [8], and adaboost[3][4]. Finally, for postprocessing step, the algorithms usually group detected faces which is located in the similar position. Then, they select only one face from each face group and S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 29–38, 2007. c Springer-Verlag Berlin Heidelberg 2007
30
B. Jun and D. Kim
Fig. 1. The overall procedure of the proposed algorithm
determine the size, location, and rotation of the selected face. These methods usually show good performance, but have difficulties in learning every non-face patterns in natural scene. In addition, these methods are somewhat slow due to much computation steps. In this paper, we present a novel face detection algorithm. For preprocessing step, we revise the modified census transform to compensate the sensitivity to the change of pixel values. For face detection step, we propose we propose difference of pyramid(DoP) images for fast face detection. Finally, for postprocessing step, we propose face certainty map(FCM) which contains facial information such as facial size, location, rotation, and confidence value to reduce FAR(False Acceptance Rate) with constant detection performance(Fig. 1). The outline of this paper is as follows: in section 2, we explain modified census transform and our proposed revised modified census transform for preprocessing step. In section 3, we explain face detection using adaboost[4], our proposed face detection method, and the combined system. In section 4, we propose face certainty map for postprocessing step. In section5, we show the experimental results and analysis. Finally, some conclusions of this work are given in section 6.
2 2.1
Preprocessing Step Revised Modified Census Transform
Zabin and Woodfill proposed an illumination insensitive local transform method called census transform(CT) which is an ordered set of comparisons of pixel intensities in a local neighborhood representing which pixels have lesser intensity than the center[12]. Let N (x) define a local spatial neighborhood of the pixel at x so that x ∈ / N (x), a comparison function C(I(x), I(x )) be 1 if I(x) < I(x ), and denote the concatenation operation, then the census transform at x is defined as T (x) = C(I(x), I(y)). (1) y∈N
Robust Real-Time Face Detection Using Face Certainty Map
90
90
91
90
90
90
90
90
90
MCT
0
0
1
1 If I(x, y) > 90.1
0
0
0
0
0
0
31
001000000(= 64)
(a) Modified Census Transform(MCT) 90
90
91
90
90
90
90
90
90
RMCT 1 If I(x, y) > 90.1 + r
0
0
0
0
0
0
0
0
0
(b) Revised Modified form(RMCT)
000000000(= 0)
Census
Trans-
Fig. 2. Comparison of MCT and RMCT
Since census transform transforms pixel values by comparison with center pixel value, it can not transforms the pixel values equal to center pixel value. Fr¨ oba and Ernst proposed modified census transform(MCT) to solve this problem[4]. Let N (x) be a local spatial neighborhood of the pixel at x so that ¯ N (x) = N (x) ∪ x. The intensity mean on this neighborhood is denoted by I(x). With this, they reformulate equation 1 and write the modified census transform as ¯ Γ (x) = C(I(x), I(y)). (2) y∈N
Using equation 2, they could determine all of the 511 structure kernels defined on a 3 × 3 neighborhood, while CT has only 256 structure kernels. However, as you can see in Fig 2-(a), MCT is sensitive to subtle changes of pixel values in local region. To solve this problem, we revised MCT by addition of a small value r(=2 or 3) as ¯ + r, I(y)). Υ (x) = C(I(x) (3) y∈N
We call equation 3 as revised modified census transform(RMCT). RMCT transforms the pixel values to one of 511 patterns in 3 × 3 neighborhood. Since the local pixel value changes in 3 × 3 neighborhood is insensitive to illumination change, this transform is robust to illumination change(Fig 2-(b)). Moreover, since RMCT has regular patterns which can represent facial features, it is good to classify face and non-face patterns.
3 3.1
Face Detection Step Face Detection Using RMCT and Adaboost
In this section, we present our proposed face detection algorithm using RMCT and adaboost. RMCT transforms the pixel values to one of 511 patterns in 3 × 3 neighborhood. Then, using the training images transformed by RMCT, we construct the weak classifier which classifies the face and non-face patterns
32
B. Jun and D. Kim
and the strong classifier which is the linear combination of weak classifiers. The weak classifier consists of the set of feature locations and the confidence values for each RMCT pattern. In test phase, we scan the image plane by shifting the scanning window and obtain the confidence value for strong classifier in each window location. Then, we determine the window location as face region, when the confidence value is above the threshold. Moreover, we construct multi-stage classifier cascade for fast face detection. We expect that the image patches which contain background are rejected from early stage of cascade. Accordingly the detection speed may increase[3][4]. For training multi-stage classifier cascade, the non-face training images for first classifier are composed of arbitrary non-face images, while that of later classifiers are composed of the images which is falsely accepted by the former classifiers. 3.2
Speed Up the Algorithm
The face detector analyzes image patches of pre-defined size. Each window has to be classified either as a face or non-face. In order to detect faces in the input image, conventional face detection algorithms have been carried out by scanning all possible analysis windows. In addition, to find faces of various size, the image is repeatedly down-scaled with a pre-defined scaling factor. This is done until the scaled image is smaller than the sub window size. Although face detection by this full-search algorithm shows optimal performance, it is computationally expensive and slow. To solve this problem, we propose difference of pyramid(DoP) images coupled with two-dimensional logarithmic search. First, we obtain face candidate region using DoP. Since the algorithm does not search the whole input image but the face candidate region, we expect it will reduce the computational time. Difference of Pyramid Images. Extracting face candidate region using motion difference has been adopted for fast face detection in many previous works. However this method assumes that the location of camera is fixed and the background image is constant. Accordingly, it is difficult to adopt this method to still image or image sequence from moving camera. We propose difference of pyramid images(DoP) for compensating this problem. In order to obtain face candidate region using motion difference, at least two images are required such as background image and input image or image of previous frame and image of current frame. However, we can obtain face candidate region from single image using DoP, since it is computed not from image sequence but from single image. Thus, we are able to adopt this algorithm to still image and image sequence from moving camera as well. Many face detection algorithms construct several numbers of down-scaled pyramid images and then scan each pyramid image using scanning window with pre-defined size. In this paper, we construct n down-scaled images which constitute an image pyramid. Then we obtain n−1 DoP image by subtract i−th pyramid image from (i − 1) − th pyramid image. Since the size of i − th pyramid
Robust Real-Time Face Detection Using Face Certainty Map
33
Fig. 3. Difference of Pyramid Images
image and (i − 1) − th pyramid image are different, we first align each image with the center point then obtain the DoP image by subtracting corresponding pixel points. Since plain background has little changes in DoP image, we can perform fast face detection. The pixel points in DoP image which have higher value than threshold are selected as face candidate region. Fig. 3 shows some example images of DoP.
4 4.1
Postprocessing Step Face Certainty Map
For minimizing FAR(False Acceptance Rate) and FRR(False Rejection Rate), existing face detection algorithms concentrate on learning optimal model parameters and devising optimal detection algorithm. However, since the model parameters are determined by the face and non-face images in the training set, it is not guaranteed that the algorithms work well for novel images. In addition, there are infinite number of non-face patterns in real-world, accordingly it is almost impossible to train every non-face patterns in natural scene. As a result, the face detection algorithms, which showed good performance during the training phase, show high FRR and FAR in real environment. The face detector we described in section 3.1 determines the image patch in the current scanning window as face when the confidence value is above the threshold and determines it as non-face when the confidence value is beyond the threshold. We call the scanning window of which the confidence value is above the threshold as detected face window. Fig. 4 shows some examples of the face detection algorithm proposed in section 3.1: the rectangles in each figure represent the detected face windows. We can see that while the face regions are detected perfectly, there are several falsely accepted regions(blue circle regions). In addition, when we investigate the figures precisely, there are numbers of detected face windows near real face region, while few detected face windows are near falsely accepted regions. With this observation, we propose face certainty
34
B. Jun and D. Kim
Fig. 4. Face Detection Results Without FCM
map(FCM) which can reduce FAR with constant detection performance and no additional training. As you can see in Fig. 4, there are multiple detected face windows, even though there is only one face in the input image. For real face region, there are detected face windows with same location but different scale(Fig. 5-(a)) and detected face windows with same scale but different location(Fig. 5-(b)). However, falsely accepted regions do not show this property. Consequently, we can determine the regions where multiple detected face windows are overlapped as face region and the regions with no overlapped detected face windows as falsely accepted region. By adopting this methodology, we can reduce FAR greatly. For now, we explain how to adopt FCM to the face detection algorithm we described in section 3.1. The detailed explanation of the procedure is given below. 1. For each scanning window centered at (x, y), we compute the confidence value. Hi (Υ ) = hp (Υ (p)), p∈Si
where i represents the i − th cascade, p represents the p − th feature location, and Si is the set of feature locations, respectively. 2. The confidence value cumulated for all n cascade is like following: S(x, y) =
n
Hi (Υ ),
i=1
if Hi (Υ ) for all i is above threshold, otherwise 0.
(4)
Robust Real-Time Face Detection Using Face Certainty Map
35
Fig. 5. Detected Face Windows;(a)Detected face windows which have the same center point(C1) but different scale (b) Detected face windows which have the same scale but different center point
We compute the cumulated confidence value for every pixel position in the image. 3. We also compute equation 4 for all pyramid images, then we have Sp (x, y), p = 1, . . . , m, where m is the total number of constructed pyramid images and the pixel locations (x, y) of each down-scaled pyramid image are translated to its corresponding original image locations. 4. The FCM for location (x, y) consists of four items such as Smax (x, y), Wmax (x, y), Hmax (x, y), and C(x, y). Smax (x, y) is the maximum confidence value among Sp (x, y), p = 1, . . . , m, Wmax (x, y) and Hmax (x, y) is the width and height of the detected face window which has the maximum confidence value, and C(x, y) is the confidence value cumulated for all m pyramid images C(x, y) = m p=1 Sp (x, y). 5. Since we constructed FCM, we can determine the face region using it. First, we look for the values above threshold in Smax (x, y). Then we determine the location (x, y) as the center of face when C(x, y) is above threshold. The non-face region where the maximum confidence value is above threshold is not classified as a face region, since C(x, y) is lower than the threshold. Consequently, we can reduce the FAR using our proposed FCM.
5
Experimental Results and Discussion
For constructing training face data, we gathered 17,000 face images from internet and Postech DB[13]. Gathered face images contain multiple human species, variety of illumination conditions, and expression variation. Each image is aligned by eye location, and we resize images to 22 × 22 base resolution. In addition, for the robustness to image rotation, we generated another 25,060 face images by rotating gathered face images to -3, 0, 3 degrees. For non-face training data, we collected 5,000 images which include no face image from internet. Then, we extracted image patches from collected internet images by random size and position. After that, we generated 60,000 non-face
36
B. Jun and D. Kim
Fig. 6. Experimental Results
images by resizing extracted image patches to the same scale of training face images. We used these 60,000 non-face images as the training non-face data for the first stage of cascade. For the next stages of cascade, we used non-face data which are considered as face image by the previous cascade(i.e. we used false positives of previous cascade as training non-face data for training current cascade). A validation training data set is used for obtaining threshold value and stop condition of each stage of cascade. The validation set face and non-face images exclude the images used for training. We constructed validation set of 15,000 face images and 25,000 non-face images by the same way as we used for training data. For each cascade, we preprocessed each face and non-face image using RMCT. Then we chose the feature positions for classification (Si ) and obtained classification value(Hi (Υ )) and threshold value(Ti ) for each position. We constructed the face detector with 4 cascade and the maximal number of allowed position for each cascade is 40, 80, 160, and 400, respectively. In addition, the RMCT is defined 511 patterns in 3 × 3 neighborhood. Accordingly, we can not apply the 3 × 3 RMCT to the training images which have the size 22 × 22. Thus, we excluded the outer areas of each image and used the inner 20 × 20 areas of it. We tested our algorithm on CMU+MIT frontal face test set. Fig. 6 and table 1 represents the results of face detection. When we used FCM, the reduction of FAR
Robust Real-Time Face Detection Using Face Certainty Map
37
Table 1. Results of Face Detection Detector Number of False Detection RMCT, adaboost and FCM 3 RMCT and adaboost 93 Viola-Jones 78 Rowley-Baluja-Kanade 167 Bernhard Froba 27
is ten times better than the cascade adaboost detector with the same detection rate, while the detection time is almost the same. The cascade adaboot detector needs computations for grouping and eliminating overlapped face candidate region. In contrast, proposed detector does not need these computations but needs the computation for FCM. Operating on 320 by 240 pixel images, faces are detected at 23 frames per second on a conventional 3.2 GHz Intel Pentium IV system and 6 frames per second on OMAP5912(ARM9 system).
6
Conclusion
In this paper, we proposed a face and robust face detection algorithm using difference of pyramid(DoP) images and face certainty map(FCM). The experimental results showed that the reduction of FAR is ten times better than existing cascade adaboost detector while keeping detection rate and detection time almost the same. Existing adaboost face detection algorithms have to add more cascade stages for non-face images in order to reduce FAR. However, since it needs more weak classifier for constructing strong classifier, the processing time increases. Moreover, as the number of stages in cascade increase, FRR also increase. We were free from these drawbacks and increased detection performance by applying FCM to existing adaboost face detection algorithm. Since we can reduce FAR, the number of stages in cascade is also minimized, while preserving the same performance as existing algorithm which has more stages of cascade. Accordingly training time and processing time is faster than existing algorithm. Furthermore, FCM can be applied to any face detection algorithm beside adaboost which obtains confidence value or probability. In this work, we applied the algorithm to only frontal faces. A future extension of this work could be pose and rotation invariant face detection algorithm
Acknowledgements This work was partially supported by the Korea Science and Engineering Foundation (KOSEF) through the Biometrics Engineering Research Center (BERC) at Yonsei University. Also, It was financially supported by the Ministry of Education and Human Resources Development(MOE), the Ministry of Commerce, Industry and Energy(MOCIE) and the Ministry of Labor(MOLAB) through the fostering project of the Lab of Excellency.
38
B. Jun and D. Kim
References 1. Lee, H.-S., Kim, D.: Facial expression transformations for expression-invariant face recognition. In: Proc. of International Symposium on Visual Computing, pp. 323–333 (2006) 2. Sung, K.K.: Learning and Example Selection for Object and Pattern Recognition. PhD thesis, MIT, AI Lab, Cambridge (1996) 3. Viola, P., Jones, M.: Fast and Robust Classification using Asymmetric Adaboost and a Detector Cascade. In: Advances in Neural Information Processing System, vol. 14, MIT Press, Cambridge (2002) 4. Froba, B., Ernst, A.: Face detection with the modified census transform. In: Sixth IEEE International Conference on Automatic Face and Gesture Recognition. IEEE Computer Society Press, Los Alamitos (2004) 5. Yang, J., Waibel, A.: A real-time face tracker. In: Proc. 3rd Workshop on Appl. of Computer Vision, pp. 142–147 (1996) 6. Dai, Y., Nakano, Y.: Face texture model based on sgld and its application in face detection in a color scene. Pattern Recognition 29, 1007–1017 (1996) 7. Osuna, E.: Support Vector Machines: Training and Applications. PhD thesis, MIT, EE/CS Dept., Cambridge (1998) 8. Mohan, A., Papageorgiou, C., Poggio, T.: Examplebased object detection in images by components. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 349–361 (2001) 9. Sung, K.K., Poggio, T.: Example-based learning for view-based human face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 39–51 (1998) 10. Schneiderman, H., Kanade, T.: A statistical method for 3d object detection applied to face and cars. In: Computer Vision and Pattern Recognition, pp. 746–751 (2000) 11. Rowley, H., Baluja, S., Kanade, T.: Neural network-based face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 23–38 (1998) 12. Zabih, R., Woodfill, J.: A non-parametric approach to visual correspondence. IEEE Transactions on Pattern Analysis and Machine Intelligence (1996) 13. Kim, H.C., Sung, J.W., Je, H.M., Kim, S.K., Jun, B.J., Kim, D., Bang, S.Y.: Asian Face Image Database PF01. Technical Report, Intelligent Multimedia Lab, Dept. of CSE, POSTECH (2001)
Motion Compensation for Face Recognition Based on Active Differential Imaging Xuan Zou, Josef Kittler, and Kieron Messer Centre for Vision, Speech and Signal Processing University of Surrey, United Kingdom {x.zou,j.kittler,k.messer}@surrey.ac.uk
Abstract. Active differential imaging has been proved to be an effective approach to remove ambient illumination for face recognition. In this paper we address the problem caused by motion for a face recognition system based on active differential imaging. A moving face will appear at two different locations in the ambient illumination frame and combined illumination frame and as result artifacts are introduced to the difference face image. An approach based on motion compensation is proposed to deal with this problem. Experiments on moving faces demonstrate that the proposed approach leads to significant improvements in face identification and verification results.
1
Introduction
The illumination problem in face recognition, is one of those challenging problems which remain to be addressed [8]. The variation in face appearance caused by illumination change can be much larger than the variation caused by personal identity [5]. The approaches to this problem can be divided into “Passive” and “Active”. In “Passive” approaches, attempts are made to process images which have already been affected by illumination variations. Such approaches are either based on illumination modelling, photometric normalisation, or the use of illumination-insensitive features. “Active” approaches usually involve additional devices (optical filters, active illumination sources or specific sensors) that actively obtain modalities of face images that are insensitive to or independent of illumination change. These modalities include face shape (depth map or surface normal) and thermal/near infrared images. Active differential imaging is an important variant of active sensing to minimise the illumination problem. An active differential imaging system consists of an active illuminant and an imaging sensor. During the imaging process, two images are captured: the first one is taken when the active illuminant is on, the second one is taken when the illuminant is off. In the first image the scene is illuminated by the combination of both the ambient illumination and the active illumination, while in the second image the face is illuminated only by ambient illumination. Therefore the difference of these two images contains the scene viewed under active illumination only, which is completely independent of ambient illumination. Some sample images are shown in Fig. 1. Near-Infrared is often S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 39–48, 2007. c Springer-Verlag Berlin Heidelberg 2007
40
X. Zou, J. Kittler, and K. Messer
(a)
(b)
(c)
Fig. 1. Face images under combined illumination(a), ambient illumination(b) and their difference image(c), without any motion
(a)
(b)
(c)
Fig. 2. Face images under combined illumination(a), ambient illumination(b) and their difference image(c), when the face is moving
chosen as the active illumination source for active differential imaging due to its invisibility, which makes the whole system unobtrusive. Recently, the idea of active differential imaging has been applied to illumination invariant face recognition[9][10][1][6] and significant advantages on face recognition performance are reported for faces under varying illuminations. Specific sensors which can perform differential imaging have been proposed[3][7]. However, despite its success in removing ambient illuminations for a still scene, the active differential imaging system is detrimentally affected by any motion of the subject in the field of view. A moving subject will appear at two different locations in the combined illumination frame(C-frame) and ambient illumination frame(A-frame), respectively, which results in a difference image with “artifacts” as shown in Fig. 2. To the best of our knowledge, this problem has not been addressed before. In this paper, we propose a motion compensation technique developed to cope with this problem. The face capture system captures a C-frame and an A-frame alternately. The motion between two C-frames is estimated, and a “virtual” Cframe is computed by performing motion interpolation. The difference image between this “virtual” C-frame, and the A-frame captured between the above
Motion Compensation for Face Recognition
41
two C-frames is used for recognition. The advantage of applying the proposed approach is proved by the improvement in face recognition results on a database containing moving face sequences. The paper is organised as follows: A detailed look at the problem caused by motion is given in Section 2. Section 3 describes the proposed approach to deal with the motion problem. The information about a moving face database for our experiments is provided in Section 4. The details and results of the experiments carried out are presented in Section 5. Conclusions are drawn in Section 6.
2
Motion Problem of Active Differential Imaging
Two problems can be introduced by a motion for a capture system based on active differential imaging. The first one is motion blur for each captured image. However, this is a general problem for all imaging systems. Fortunately face recognition based on the commonly adopted subspace approaches is relatively robust to the degradation in resolution and consequently it is not seriously affected by motion blur. Therefore motion blur is not a major problem to be addressed. The second one is posed by the motion artifacts in the difference image. Since the C-frame and A-frame are captured at different times, the difference image will be quite different to what it is supposed to be due to the position change of the object. The displacement between object positions in two frames is directly related to the time interval between this two successive captures and the moving speed. As shown in Fig. 2, this problem is significant for a face capture system based on differential imaging. Artifacts are prominent especially around the edges, such as eyelids, nose and mouth. This paper focuses on the second problem. An example is given below to show how the face similarity degrades due to displacement of the faces. A C-frame and an A-frame are taken for a still face with 60 pixels in inter-occular distance as shown in Fig. 3. We manually shifted A-frame by 1 to 10 pixels from its original position horizontally, or vertically. The faces in the resulting difference images are registered with reference to the original position in the C-frame, cropped and normalised to 55*50 patches, and histogram equalised, as shown in Fig. 4. Fig. 5 shows that the similarities between
(a)
(b)
Fig. 3. Combined illumination frame (a) and ambient illumination frame (b)
42
X. Zou, J. Kittler, and K. Messer
(a)
(b)
(c)
Fig. 4. original face template (a), and resulting difference image when A-frame shifted (-10, 7, 3, 0, 3, 7, 10) pixels horizontally (b) and vertically (c) from C-frame
1
1
0.9
0.9
0.8
0.8
Normalised Correlation Score
Normalised Correlation Score
the resulting difference images and the face template keep decreasing when displacement between the C-frame and A-frame in any direction (left, right, up or down) increases. The similarity is measured by the Normalised Correlation (NC) score in the original image space, Principal Component Analysis(PCA) subspace, and Linear Discriminant Analysis(LDA) subspace. PCA and LDA subspaces are built from an near-infrared face database with 2844 images of 237 people. NC in an LDA subspace usually gives much better results for face recognition than NC in the image space and PCA subspace, however, NC in the LDA subspace is much more sensitive to the displacement than in the other two spaces, according to Fig. 5.
0.7 0.6 0.5 0.4 0.3 0.2 img lda pca
0.1 0 −10
−8
0.7 0.6 0.5 0.4 0.3 0.2 img lda pca
0.1
−6
−4
−2
0
2
4
Horizontal Displacement (in pixel)
(a)
6
8
10
0 −10
−8
−6
−4
−2
0
2
4
6
8
10
Vertical Displacement (in pixel)
(b)
Fig. 5. NC score drops when A-frame shifts from C-frame horizontally(a) and vertically(b)
Therefore, the performance of a face recognition system based on active differential imaging will degrade when faces are moving. For a general-purpose camera, the time interval between two frames is 40ms (CCIR) or 33ms (EIA), which is so long that the motion effect maybe significant. A hardware solution is specific high speed sensors, such as the sensor developed by Ni[7], which can provide a capture speed of 100 images per second. However, due to the high price of custom designed devices, a software solution is always desirable. The problem cannot be solved using only the difference image. First, simply applying subspace approaches does not work: as discussed above, none of the commonly used face subspace representations is both insensitive to this motion effect and discriminative enough for face recognition. Second, motion information cannot be recovered from the difference image to remove the motion effect. It is
Motion Compensation for Face Recognition
43
also impossible to align the faces in the successive C-frame and A-frame because faces are in different illuminations in these two frames. We propose to use two nearest C-frames to obtain motion information and interpolate a virtual Cframe, the motion effect can then be removed from the difference image of the A-frame and the virtual C-frame.
3
Motion Compensation for Moving Face
Assuming the face is moving at the same speed in the same direction between ti , the time when the first C-frame Ci is captured, and ti+2 , the time when the second C-frame Ci+2 is captured. We can apply interpolation to obtain a virtual C-frame Ci+1 as “captured” at ti+1 , which is the same time when the frame Ai+1 is captured. Therefore, the faces in the frame Ci+1 and frame Ai+1 are exactly at the same location. As a result, the motion effect is removed in the difference image between the frame Ci+1 and frame Ai+1 . This approach is illustrated in Fig. 6.
Fig. 6. Illustration of our proposed approach
The robust optical flow estimation method by Black and Anandan [4] is applied to obtain a dense motion field between two successive C-frames: Ci and Ci+2 . For the image intensity function I(x, y, t) defined on the region S, the estimation of optical flow field U = {(us , vs ) | s ∈ S} can be treated as a minimization problem of a cost function E(u, v) of the residual errors from the
44
X. Zou, J. Kittler, and K. Messer
data conservation constraint and the spatial coherence constraint. The robust formulation presented in [4] is as below: E(u, v)= λD ρD (Ix us +Iy vs +It ,σD )+λS ρS (us −un , σS )+ ρS (vs −vn ,σS ) s∈S
n∈Gs
n∈Gs
(1) where Ix , Iy and It are the partial derivatives of I(x, y, t). Both ρD and ρS are Lorentzian function: 1 x ρ(x, σ) = log(1 + ( )2 ) (2) 2 σ λD and λS are weights for the data conservation term and spatial coherence term, respectively. σD and σS are the parameters controling the shape of Lorentzian function and the threshold for outliers. Using Lorentzian function instead of the quadratic function used in Least Square Estimation, the influence of the outliers for the data conservation constraint and the spatial coherence constraint can be reduced. A coarse-to-fine strategy is employed to cope with large motions. If u and v represent the horizontal and vertical motion between frame Ci and ti+1 −ti i+1 −ti Ci+2 , then the motion between Ci and Ci+1 are u tti+2 −ti and v ti+2 −ti based on linear interpolation. Ci+1 can be warped from Ci based on: Ci+1 (p) =
3
ds Ci (pso )
(3)
s=0 where Ci+1 (p) is the grey value of a pixel p with coordinates (m, n) in frame Ci+1 , {pso }s=0,..,3 are the 4 nearest neighbors of the original subpixel location −ti (mo , no ) of p in the frame Ci with mo = m − u(m, n) ∗ tti+1 , no = n − v(m, n) ∗ i+2 −ti ti+1 −ti s ti+2 −ti . {d }s=0,..,3 are the weights related to the distances and {pso }s=0,..,3 . Fig. 7 illustrates the interpolation process.
Fig. 7. Illustration of the interpolation process
between (mo , no )
Motion Compensation for Face Recognition
4
45
Capture System and Database Capture
An experimental face capture system based on active differential imaging was used to capture moving face data for our experiments. The image sensor has a high resolution of 755*400 pixels. A face database of 37 subjects was captured in an indoor environment near the window with sunlight coming in. Each subject sat 1 meter away from the camera, and was asked to keep moving his/her face. Two sessions of data were recorded for each subject, with 29 C-frames and 29 Aframes captured continuously for each session. Ambient illumination was mainly from left for one session and from right for the other. Another 6 difference images for each subject were captured when the subject sat still. These images will serve as gallery images for the identification and verification experiments.
5
Experiments
To show the advantage brought by the proposed motion compensation approach, face recognition experiments are conducted on two sets of difference face images for comparison. The first set contains the original difference face images between every pair of C-frame and A-frame, without considering the motion issue. For the second set, the proposed approach is applied to obtain the difference images without motion effect. Faces in both sets are geometrically normalised based on the manually marked eye positions in the corresponding C-frame, cropped to
(a)
(b)
(c) Fig. 8. Faces in difference images without motion compensation (a) and with motion compensation (b), and corresponding template images(c) for subjects in (a) and (b)
46
X. Zou, J. Kittler, and K. Messer
55*50 image patches, and photometrically normalised using histogram equalisation. Examples for faces in both sets are shown in Fig. 8 and it can be seen that the motion artifacts have been removed after motion compensation. Each set contains 2072 faces(37 subjects×2 sessions×28 difference images). 5.1
Improvement in Face Similarity to Template
A histogram of the motion between every two frames for the whole database is shown in Fig. 9.(a). For every pair of C-frames, the motion value recorded in the histogram is the average length of all those motion vectors of the motion foreground pixels (pixels with motion vector magnitude above a threshold 0.8). Since faces are normalised based on the eye positions for recognition, the same absolute displacement will have different influence for faces with different interocular distances. Therefore relative motion, which is the motion value over the inter-ocular distance, is applied here to measure motion. 250
NC Score Improvement after Motion Compensation 0.35 lda_Infrared pca_Infrared orig image
0.3
Improvement in NC Score
Number of Frames
200
150
100
0.25
0.2
0.15
0.1
50
0.05
0 0
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Relative Displacement
(a)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Relative Displacement
(b)
Fig. 9. (a)Histogram of the relative motion in all sequences. (b)Average improvement in NC score after using motion compensation in terms of the relative motion.
According to Fig.9.(b), NC score in LDA subspace is improved by motion compensation more significantly than in PCA subspace and original image space. When motion is too small, applying motion compensation brings a little improvement. But for moderate motion, the improvement is significant. When motion is too large the improvement decreases because large motion tends to introduce large pose change which has a negative influence on motion estimation. 5.2
Face Identification Experiments
Face identification experiments are carried out using Nearest Neighbor classifier based on full face, left/right half of the face, and the fusion of left/right halves of the face. The similarity is measured by NC score in the respective LDA subspaces for full face, left face, or right face. For the fusion case, the Sum rule[2] is applied to fuse the similarity score in the LDA subspace for left face and right face. As
Motion Compensation for Face Recognition
47
Table 1. Face Identification Rank-1 Error Rates(%) on Moving Faces
full face left face right face fusion of left/right faces
with motion compensation without motion compensation 12.07 31.71 14.23 32.09 10.28 26.93 6.13 14.96
Table 2. Verification Half Total Error Rates(%) on Moving Faces
full face left face right face fusion of left/right faces
with motion compensation without motion compensation 7.88 14.13 8.69 15.63 8.72 15.33 6.42 10.95
shown in Table 1, after motion compensation the errors decrease to less than half of the errors achieved without performing motion compensation. Applying fusion technique gives the best identification result with an error rate 6.13%. 5.3
Face Verification Experiments
For the first round of verification experiment, the data of one session is used for evaluation, and the other session for testing. Then evaluation and testing sessions are switched for the other round of testing. Every test image is used to make claim to all 37 identities in the gallery. So in total there are 1036 instances of true claims and 1036*36 instances of false claims during evaluation and the same numbers of true claims and false claims in the testing stage. The average verification errors are reported in Table 2. Again, significant improvement is achieved by applying motion interpolation. Fusion has the best performance with the lowest error rate 6.42%.
6
Conclusion and Future Work
Motion causes problems for face recognition systems based on active differential imaging. The similarities between probe images and the template decrease due to the artefact in the captured face image caused by motion. In this paper we proposed an approach based on motion compensation to remove the motion effect in the difference face images. A significant improvement has been achieved in the results of face identification and verification experiments on moving faces. Since the motion artifact also introduces difficulties in automatic face localisation, we are now investigating the performance boost of the system in the fully automatic operation scenario.
48
X. Zou, J. Kittler, and K. Messer
Acknowledgement The support from the Overseas Research Students Awards Scheme (ref. 2003040015) is gratefully acknowledged.
References 1. Hizem, W., Krichen, E., Ni, Y., Dorizzi, B., Garcia-Salicetti, S.: Specific sensors for face recognition. In: Proceedings of IAPR International Conference on Biometric (2006) 2. Kittler, J., Hatef, M., Duin, R.P.W., Matas, J.: On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(3), 226–239 (1998) 3. Miura, H., et al.: A 100frame/s cmos. active pixel sensor for 3d-gesture recognition system. In: Proceeding of IEEE Solid-State Circuits Conference, pp. 142–143. IEEE Computer Society Press, Los Alamitos (1999) 4. Black, M.J., Anandan, P.: The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields. Computer Vision and Image Understanding 63(1), 75–104 (1996) 5. Moses, Y., Adani, Y., Ullman, S.: Face recognition: the problem of compensating for the illumination direction. In: Eklundh, J.-O. (ed.) ECCV 1994. LNCS, vol. 801. Springer, Heidelberg (1994) 6. Ni, Y., Krichen, E., Salicetti, S., Dorizzi, B.: Active differential imaging for human face recognition. IEEE Signal Processing letter 13(4), 220–223 (2006) 7. Ni, Y., Yan, X.L.: Cmos active differential imaging device with single in-pixel analog memory. In: Proc. IEEE Eur. Solid-State Circuits Conf., pp. 359–362. IEEE Computer Society Press, Los Alamitos (2002) 8. Zhao, W., Chellappa, R., Rosenfeld, A.: Face recognition: A literature survey. ACM Computing Surveys 35, 399–458 (2003) 9. Zou, X., Kittler, J., Messer, K.: Face recognition using active Near-IR illumination. In: Proceedings of British Machine Vision Conference, pp. 209–219 (2005) 10. Zou, X., Kittler, J., Messer, K.: Ambient illumination variation removal by active Near-IR imaging. In: Proceedings of IAPR International Conference on Biometric, January 2006, pp. 19–25 (2006)
Face Recognition with Local Gabor Textons Zhen Lei, Stan Z. Li, Rufeng Chu, and Xiangxin Zhu Center for Biometrics and Security Research & National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, 95 Zhongguancun Donglu, Beijing 100080, China
Abstract. This paper proposes a novel face representation and recognition method based on local Gabor textons. Textons, defined as a vocabulary of local characteristic features, are a good description of the perceptually distinguishable micro-structures on objects. In this paper, we incorporate the advantages of Gabor feature and textons strategy together to form Gabor textons. And for the specificity of face images, we propose local Gabor textons (LGT) to portray faces more precisely and eÆciently. The local Gabor textons histogram sequence is then utilized for face representation and a weighted histogram sequence matching mechanism is introduced for face recognition. Preliminary experiments on FERET database show promising results of the proposed method. Keywords: local textons, Gabor filters, histogram sequence, face recognition.
1 Introduction Face recognition has attracted much attention due to its potential values for applications as well as theoretical challenges. Due to the various changes by expression, illumination and pose etc, face images always change a lot on grey level. How to extract representation robust to these changes becomes an important problem. Up to now, many representation approaches have been introduced, including Principal Component Analysis (PCA) [11], Linear Discriminant Analysis (LDA) [2], Independent Component Analysis (ICA) [3] etc. PCA provides an optimal linear transformation from the original image space to an orthogonal eigenspace with reduced dimensionality in sense of the least mean square reconstruction error. LDA seeks to find a linear transformation by maximizing the ratio of between-class variance and the withinclass variance. ICA is a generalization of PCA, which is sensitive to the high-order relationships among the image pixels. Recently, the textons based representations have achieved great success in texture analysis and recognition. The term texton was first proposed by Julesz [6] to describe the fundamental micro-structures in natural images and was considered as the atoms of pre-attentive human visual perception. However, it is a vague concept in the literature because of lacking a precise definition for grey level images. Leung and Malik [7] reinvent and operationalize the concept of textons. Textons are defined as a discrete set which is referred to as the vocabulary of local characteristic features of objects. The goal is to build the vocabulary of textons to describe the perceptually distinguishable S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 49–57, 2007. c Springer-Verlag Berlin Heidelberg 2007
50
Z. Lei et al.
micro-structures on the surfaces of objects. Various of this concept have been applied to the problem of 3D texture recognition successfully [4,7,12]. In this work, we propose a novel local Gabor textons histogram sequence for face representation and recognition. Gabor wavelets, which capture the local structure corresponding to spatial frequency, spatial localization, and orientation selectivity, have achieved great success in face recognition [13]. Therefore, we incorporate the advantages of Gabor feature and textons strategy to form Gabor textons. Moreover, In order to depict the face images precisely and eÆciently, we propose local Gabor textons (LGT) and then utilize LGT histogram sequence for face representation and recognition. The rest of this paper is organized as follows. Section 2 details the construction of local Gabor textons. Section 3 describes LGT histogram sequence for face representation and recognition. The experimental results on FERET database are demonstrated in Section 4 and in Section 5, we conclude this paper .
2 Local Gabor Textons (LGT) Construction Texture is often characterized by its responses to a set of orientation and spatial-frequency selective linear filters which is inspired by various evidences of similar processing in human vision system. Here, we use Gabor filters of multi-scale and multi-orientation which have been extensively and successfully used in face recognition[13,8] to encode the local structure attributes embedded in face images. The Gabor kernels are defined as follows: k2 k2 z2 2 exp( )[exp(ik z) exp( )] (1) 2 2 2 2 where and define the orientation and scale of the Gabor kernels respectively, z (x y), and the wave vector k is defined as follows: k
k ei
(2)
where k kmax f , kmax 2, f 2, 8 . The Gabor kernels in (1) are all self-similar since they can be generated from one filter, the mother wavelet, by scaling and rotating via the wave vector k . Each kernel is a product of a Gaussian envelope and a complex plane wave, and can be separated into real and imaginary parts. Hence, a band of Gabor filters is generated by a set of various scales and rotations. In this paper, we use Gabor kernels at five scales 0 1 2 3 4 and four orientations 0 2 4 6 with the parameter 2 [8] to derive 40 Gabor filters including 20 real filters and 20 imaginary fitlers which are shown in Fig. 1. By convoluting face images with corresponding Gabor kernels, for every image pixel we have totally 40 Gabor coeÆcients which are successively clustered to form Gabor textons. Textons express the micro-structures in natural images. Compared to the periodic changes of texture images, for face images, dierent regions of faces usually reflect dierent structures. For example, the eye and nose areas have distinct dierences. If we cluster the textons over the whole image used in texture analysis, the size of textons vocabulary could be very large if one wants to keep describing the face precisely, which will increase the computational cost dramatically. In order to depict the face in
Face Recognition with Local Gabor Textons
51
Fig. 1. Gabor filters with 5 scales and 4 orientations. Left are the real parts and right are the imaginary ones.
Fig. 2. An example of face image divided into 7 8 regions with size of 20 15
a more eÆcient way, we propose local Gabor textons to represent the face image. We first divide the face image into several regions with the size of h w (Fig. 2), then for every region, the Gabor response vectors are clustered to form a local vocabulary of textons which are called local Gabor textons (LGT). Therefore, a series of LGT can be constructed corresponding to dierent regions. Specifically, we invoke K-means clustering algorithm [5] to determine the LGT among the feature vectors. K-means method is based on the first order statistics of data, and finds a predefined number of centers in the data space, while guaranteing that the sum of squared distances between the initial data points and the centers is minimized. In order to improve the eÆciency of the computation, the LGT vocabulary Li corresponding to the region Ri is computed by the following hierarchical K-mean clustering alogrithm: 1. Suppose we have 500 face images in all. These images are divided into 100 groups randomly, each of which contains 5 images. All of them are encoded by Gabor filters. 2. For each of the group, the Gabor response vectors at every pixel in region Ri with the size of h w are concatenated to form a set of 5 h w feature vectors. The K-mean clustering algorithm is then applied to these feature vectors to form k centers. 3. The centers for all the groups are merge together and the K-mean clustering algorithm is applied again to these 100 k centers to form k centers. 4. With the initialization of the k centers derived in step 3, apply K-means clustering algorithm on all images in region Ri to achieve a local minimum. ¼
¼
52
Z. Lei et al.
These k centers finally constitute the LGT vocabulary Li corresponding to region Ri and after doing these operations in all regions, the local Gabor textons vocabularies are constructed.
3 LGT Histogram Sequence for Face Representation and Recognition With the LGT vocabularies , every pixel in region Ri of an image is mapped to the closest texton element in Li according to the Euclidean distance in the Gabor feature space. After this operation in all regions, a texton labeled image fl (x y) is finally formed with the values between 1 and k . The LGT histogram, denoted by Hi ( ), is used to describe the distribution of the local structure attributes over the region Ri . A LGT histogram with the region Ri of the labeled image fl (x y) induced by local Gabor textons can be defined as Hi ( )
(xy)¾Ri
I fl (x y) 0 k
1
(3)
in which k is the number of dierent labels and I A
1 0
A is true A is false
The global description of a face image is built by concatenating the LGT histograms H (H1 H2 Hn ) which is called local Gabor textons histogram sequence. The collection of LGT histograms is an eÆcient face representation which contains information about the distribution of the local structure attributes, such as edges, spots and flat areas over the whole image. Fig. 3 shows the process of face representation using LGT histogram sequence. The similarity of dierent LGT histogram sequences extracted from dierent images is computed as follows: n
S (H H )
S 2 (Hi Hi )
¼
¼
(4)
i 1
where S 2 (Hi Hi )
(Hi ( )
k
¼
Hi ( ))2 ¼
(Hi ( ) Hi ( )) ¼
1
(5)
is the Chi square distance commonly used to match two histograms, in which k is the number of bins for each histogram. Pervious work has shown dierent regions of face make dierent contributions for face recognition [14,1], e.g., the areas nearby eyes are more important than others. Therefore, dierent weights can be set to dierent LGT histograms when measure the similarity of two images. Thus, (4) can be rewritten as: S (H H ) ¼
n
¼
Wi S 2 (Hi Hi ) ¼
i 1
(6)
Face Recognition with Local Gabor Textons
53
Fig. 3. Face representation using local Gabor textons histogram sequence
where Wi is the weight for i-th LGT histogram and is learned based on Fisher separation criterion [5] as follows. For a C class problem, the similarities of dierent images from the same person compose the intra-personal similarity class and those of images from dierent persons compose the extra-personal similarity as introduced in [9]. For the i-th LGT histogram Hi , the mean and the variance of the intra-personal similarities can be computed by mi
2i intra
intra
Nintra
Nintra
Nj
j 1 p 1 q p1
C N j 1
1
C N j 1
1
S 2 (Hip j Hiq j )
Nj
p j
(S 2 (Hi
j 1 p 1 q p1
Hiq j) mi intra )2
(7)
(8)
where Hip j denotes the i-th LGT histogram extracted from the p-th image from the j-th N (N 1) class and Nintra Cj 1 j 2j is the number of intra-personal sample pairs. Similarly, the mean and the variance of the extra-personal similarities of the i-th LGT histograms can be computed by mi
extra
1 Nextra
C 1
C
Nj
Ns
j 1 s j1 p 1 q 1
p j
S 2 (Hi
Hiqs )
(9)
54
Z. Lei et al.
2i extra
1 Nextra
C 1
C
Nj
Ns
j 1 s j1 p 1 q 1
(S 2 (Hip j Hiq s)
mi
extra )
2
(10)
where Nextra Cj 11 Cs j1 N j N s , is the number of extra-personal sample pairs. Then, the weight for i-th LGT histogram Wi is derived by the following formulation Wi
(mi
intra
mi
extra )
2
2i intra 2i extra
(11)
Finally, the weighted similarity of LGT histogram sequences (6) is used for face recognition with a nearest neighbor classifier.
4 Experimental Results and Analysis In this section, we analyze the performance of the proposed method using the FERET database. The FERET [10] database is a standard testbed for face recognition technologies. In our experiments, we use a subset of the training set containing 540 images from 270 subjects for training. And four probes against the gallery set containing 1196 images are used for testing according to the standard test protocols. The images in fb and fc probe sets are with expression and lighting variation respectively, and dup I, dup II probe sets include aging images. All images are rotated, scaled and cropped to 140 120 pixels according to the eye positions and then preprocessed by histogram equalization. Fig. 4 shows some examples of the cropped FERET images.
Fig. 4. Example FERET images used in our experiments
There are some parameters which influence the performance of the proposed algorithm. The first one is the size of the local region. If the size is too big (e.g. the whole image), it could lose the local spatial information and may not reveal the advantage of the local analysis. On the other hand, if the region is too small, it would be sensitive to the mis-alignment. Another parameter needed to be optimized is the size of LGT vocabulary (the number of local Gabor textons in every vocabulary). To determine the value
Face Recognition with Local Gabor Textons
55
Fig. 5. The recognition rates of varying the values of parameters: the size of LGT vocabulary and the region size
(a)
(b)
(c)
(d)
Fig. 6. Cumulative match curves on fb (a), fc (b), dup I (c) and dup II (d) probe sets
56
Z. Lei et al.
of the parameters, we use the training set to cluster the local Gabor textons and test on fb probe set to evaluate the performance of varying the values of parameters. The result is shown in Fig. 5. Note the similarities of histogram sequences are not weighted here. The region size varies from 5 5 to 28 24 and the size of LGT vocabulary varies from 8 to 128. As expected, too larger region size results in a decreased recognition rate because of the loss of spatial information. Considering the trade-o between recognition and computational cost, we choose the region size of 10 10 and the LGT vocabulary size of 64. Therefore, the face image here is divided into 14 12 regions with the size of 10 10 and for every region, 64 local Gabor textons and the weight of the corresponding LGT histogram are learned from the training set as described in Section 2 and 3. Finally, the performance of the proposed method is tested on four probe sets against the gallery. Fig. 6 demonstrates the cumulative match curves (CMC) on four probe sets and Table 1 shows the rank-1 recognition rates of the proposed method compared to some well-known methods. Table 1. The rank-1 recognition rates of dierent algorithms on the FERET probe sets Methods PCA LDA Best Results of [10] The proposed method
fb 0.78 0.88 0.96 0.97
fc 0.38 0.43 0.82 0.90
dup I 0.33 0.36 0.59 0.71
dup II 0.12 0.14 0.52 0.67
From the results, we can observe that the performance of the proposed method is very comparable with the other well-known methods. It significantly outperforms PCA and LDA methods and is better than the best results in [10] in all of the four probe sets. These results indicate the proposed method is accurate and robust to the variation of expression, illumination and aging.
5 Conclusions We have studied the use of textons as a robust approach for face representation and recognition. In particular, we propose local Gabor textons extracted by using Gabor filters and K-means clustering algorithm in local regions. In this way, we incorporate the eectiveness of Gabor features and robustness of textons strategy simultaneously and depict the face images precisely and eÆciently. The local Gabor textons histogram sequence is then proposed for face representation and a weighted histogram sequence matching mechanism is introduced for face recognition. The preliminary results on FERET database show the proposed method is accurate and robust to the variations of expression, illumination and aging etc. However, one drawback of our method is that the length of the feature vector used for face representation slows down the recognition speed indeed. A possible choice is to apply subspace methods such as PCA, LDA etc. to reduce the dimensionality of the feature vectors which will be tested in our future work.
Face Recognition with Local Gabor Textons
57
Acknowledgements. This work was supported by the following funding resources: National Natural Science Foundation Project #60518002, National Science and Technology Supporting Platform Project #2006BAK08B06, National 863 Program Projects #2006AA01Z192 and #2006AA01Z193, Chinese Academy of Sciences 100 people project, and the AuthenMetric Collaboration Foundation.
References 1. Ahonen, T., Hadid, A., Pietikainen, M.: Face recognition with local binary patterns. In: Proceedings of the European Conference on Computer Vision, Prague, Czech, pp. 469–481 (2004) 2. Belhumeur, P., Hespanha, J., Kriegman, D.: Eigenfaces vs. fisherfaces: recognition using class specific linear projection. IEEE Trans. PAMI 19(7), 711–720 (1997) 3. Comon, P.: Independent component analysis - a new concept? Signal Processing 36, 287–314 (1994) 4. Cula, O., Dana, K.: Compact representation of bidirectional texture functions. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1041–1047. IEEE Computer Society Press, Los Alamitos (2001) 5. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley, Chichester (2000) 6. Julesz, B.: Texton, the elements of texture perception, and their interactions 290(5802), 91– 97 (March 1981) 7. Leung, T., Malik, J.: Representing and recognizing the visual appearance of materials using three-dimensional textons. International Journal of Computer Vision 43(1), 29–44 (2001) 8. Liu, C., Wechsler, H.: Gabor feature based classification using the enhanced fisher linear discriminant model for face recognition. IEEE Transactions on Image Processing 11(4), 467– 476 (2002) 9. Moghaddam, B., Jebara, T., Pentland, A.: Bayesian face recognition. Pattern Recognition 33(11), 1771–1782 (2000) 10. Phillips, P.J., Moon, H., Rizvi, S.A., Rauss, P.J.: The FERET evaluation methodology for face-recognition algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(10), 1090–1104 (2000) 11. Turk, M.A., Pentland, A.P.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3(1), 71–86 (1991) 12. Varma, M., Zisserman, A.: Classifying images of materials: achieving viewpoint and illumination independence. In: Proceedings of the European Conference on Computer Vision, pp. 255–271 (2002) 13. Wiskott, L., Fellous, J., Kruger, N., malsburg, C.V.: Face recognition by elastic bunch graph matching. IEEE Trans. PAMI 19(7), 775–779 (1997) 14. Zhang, W.C., Shan, S.G., Gao, W., Zhang, H.M.: Local gabor binary pattern histogram sequence (lgbphs): a novel non-statistical model for face representation and recognition. In: Proceedings of IEEE International Conference on Computer Vision, pp. 786–791. IEEE Computer Society Press, Los Alamitos (2005)
Speaker Verification with Adaptive Spectral Subband Centroids Tomi Kinnunen1 , Bingjun Zhang2 , Jia Zhu2 , and Ye Wang2 1
Speech and Dialogue Processing Lab Institution for Infocomm Research (I2 R) 21 Heng Mui Keng Terrace, Singapore 119613
[email protected] 2 Department of Computer Science School of Computing, National University of Singapore (NUS) 3 Science Drive 2, Singapore 117543 {bingjun,zhujia,wangye}@comp.nus.edu.sg
Abstract. Spectral subband centroids (SSC) have been used as an additional feature to cepstral coefficients in speech and speaker recognition. SSCs are computed as the centroid frequencies of subbands and they capture the dominant frequencies of the short-term spectrum. In the baseline SSC method, the subband filters are pre-specified. To allow better adaptation to formant movements and other dynamic phenomena, we propose to adapt the subband filter boundaries on a frame-by-frame basis using a globally optimal scalar quantization scheme. The method has only one control parameter, the number of subbands. Speaker verification results on the NIST 2001 task indicate that the selection of the parameter is not critical and that the method does not require additional feature normalization.
1
Introduction
The so-called mel-frequency cepstral coefficients [1] (MFCC) have proven to be efficient feature set for speaker recognition. A known problem of cepstral features, however, is noise sensitivity. For instance, convolutive noise shifts the mean value of the cepstral distribution whereas additive noise tends to modify the variances [2]. To compensate for the feature mismatch between training and verification utterances, normalizations in feature, model and score domains are commonly used [3]. Spectral subband centroids [4; 5; 6; 7] (SSC) are an alternative to cepstral coefficients. SSCs are computed as the centroid frequencies of subband spectra and they give the locations of the local maxima of the power spectrum. SSCs have been used for speech recognition [4; 5], speaker recognition [7] and audio fingerprinting [6]. Recognition accuracy of SSCs is lower in noise-free conditions compared with MFCCs. However, SSCs can outperform MFCCs in noisy conditions and they can be combined with MFCCs to provide complementary information. S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 58–66, 2007. c Springer-Verlag Berlin Heidelberg 2007
Speaker Verification with Adaptive Spectral Subband Centroids
59
Fig. 1. Computation of the SSC and the proposed OSQ-SSC features. In SSC, the subband boundaries are fixed and in OSQ-SSC, the boundaries are re-calculated for every frame by partitioning the spectrum with optimal scalar quantization.
The key component of the SSC method is the filterbank. Design issues include the number of subbands, the cutoff frequencies of the subband filters, the shape of the subband filters, overlapping of the subband filters, compression of the spectral dynamic range and so on [5]. The parameters of the filterbank can be optimized experimentally for a given task and operating conditions. In this study, our aim is to simplify the parameter setting of the SSC method by adding some self-adaptivity to the filterbank. In particular, we optimize the subband filter cutoff frequencies on a frame-by-frame basis to allow better adaptation to formant movements and other dynamic phenomena. We consider the subbands as partitions or quantization cells of a scalar quantizer. Each subband centroid is viewed as the representative value of that cell and the problem can be defined as joint optimization of the partitions and the centroids. The difference between the conventional SSC method and the proposed method is illustrated in Fig. 1.
2
Spectral Subband Centroids
In the following, we denote the FFT magnitude spectrum of a frame by S[k], where k = 1, . . . , N denotes the discrete frequency index. The index k = N corresponds to the half sample rate fs /2. The mth subband centroid is computed as follows [5]: qh (m) γ k=ql (m) kWm [k]S [k] cm = q (m) , (1) h γ k=ql (m) Wm [k]S [k] where Wm [k] are the mth bandpass filter coefficients, ql (m), qh (m) ∈ [1, N ] are its lower and higher cutoff frequencies and γ is a dynamic range parameter.
60
T. Kinnunen et al.
The shape of the subband filter introduces bias to the centroids. For instance, the triangular shaped filters used in MFCC computation [1] shift the centroid towards the mid part of the subband. To avoid such bias, we use a uniform filter in (1): Wm [k] = 1 for ql (m) ≤ k ≤ qh (m). Furthermore, we set γ = 1 in this study. With these modifications, (1) simplifies to qh (m)
k=q (m)
l cm = q (m) h
kS[k]
k=ql (m)
3
S[k]
.
(2)
Adapting the Subband Boundaries
To allow better adaptation of the subband centroids to formant movements and other dynamic phenomena, we optimize the filter cutoff frequencies on a frameby-frame basis. We use scalar quantization as a tool to partition the magnitude spectrum into K non-overlapping quantization cells. The subband cutoff frequencies, therefore, are given by the partition boundaries of the scalar quantizer. The expected value of the squared distortion for the mth cell is defined as e2m = pk (k − cm )2 , (3) q(m−1) mX ⇒ X S (n, tX ; rX (n, tX ; mX )) ≤ X S (n, tX ; rX (n, tX ; mX )) . Similarity between two test images is then given by: 1 δ rG (n, tG ; mG ), rP (n, tP ; mP ) √ S(tG , tP ) = · , F n m ,m mG + m P + 1 G
(2)
(3)
P
the normalization factor F is defined as the maximal possible similarity: F = NN ·
NM i=1
1 √ . 2i − 1
(4)
To give an example, let rg (n, tG ) be [1, 3, 2]. This means (for node n) the most similar model images to the test image tG are number 1, then 3 and then 2. If rp is [1, 2, 3] the similarity for this node would be S=
√1 1 √1 1
+ +
√1 4 √1 3
+ +
√1 4 √1 5
=
2 = 0.99. 2.02
(5)
Similarity Rank Correlation for Face Recognition Under Unenrolled Pose
Probe
71
Gallery
S(tG , tP )
rG [7, 3, 9, . . .] .. .
rP [7, 9, 3, . . .] .. .
Fig. 2. Once the rank lists have been calculated, they are describing the person’s identity. For each probe image showing a person in a very different pose than the image in the gallery the similarity to all gallery images can be calculated by the similarity function S(tG , tP )
3 3.1
Experiments Data
Three poses of the CAS-PEAL face Database [13] have been used: PM+45, PM+00 and PM−45. The first NM identities build the model database, the last NT identities the testing database. In most experiments NM = NT = 500 has been used, variations are shown in table 2. Twelve of the remaining identities have been labeled manually to find the landmarks in the model and testing databases, as described in section 2.1. 3.2
Identification Results
Recognition rates are shown in the right half of table 1. For comparison, the results of direct jet comparison (which is not appropriate for large pose variation), are shown on the left half of that table. Table 2 shows the recognition rates for the case album PM+00 – probe PM−45 for different numbers of identities in model and testing database. In general, recognition rates increase if the number of model identities increases or the number of testing images decreases. If the model size is equal to the test size, the results are even better with a higher number of identities, at least up to 500
72
M.K. M¨ uller et al.
(a) PM+45
(b) PM+00
(c) PM−45
Fig. 3. Examples of used CAS-PEAL poses Table 1. Recognition rates for 500 CAS-PEAL identities for the three poses PM+45, PM+00 and PM−45. The left table shows the results of simple jet comparison, the right one the result of our system with 500 different identities in the model database. Probe
Gallery
PM+45 PM+00 PM−45 PM+45 PM+00 PM−45
100.0 33.0 3.2
8.0 100.0 24.4
Probe
Gallery 2.6 54.2 100.0
PM+45 PM+00 PM−45 PM+45 PM+00 PM−45
100.0 98.0 56.4
85.0 100.0 96.4
39.4 99.0 100.0
Table 2. Recognition rates for PM+00 – PM−45 for different numbers of model and test identities. The model database consists of the first NM identities, the testing database of the NT last ones of the 1000 used images. Test
Model 100 100 200 300 400 500 600 700 800 900
96.0 99.0 99.0 100.0 99.0 99.0 100.0 100.0 98.0
200
300
400
500
600
700
800
900
95.5 89.7 97.5 96.7 97.5 97.3 98.0 100.0 98.5 99.0 98.0 99.7 99.0 99.3 99.5 — — —
88.5 95.3 96.8 98.8 99.0 99.5 — — —
83.8 93.4 96.8 98.6 99.0 — — — —
81.2 92.0 96.5 98.3 — — — — —
79.6 90.4 96.1 — — — — — —
78.3 89.6 — — — — — — —
75.9 — — — — — — — —
identities. Figure 4 shows the cumulative match scores for the closed identification scenario we used. Our method performs quite well for pose differences of 45◦ , pose variations of 90◦ are difficult, because the number of visible landmarks in both images decreases. Experiments with node independent rank lists have been made, but combination of node dependent with node independent rank
1
1
0.8
0.8
Identification rate
Identification rate
Similarity Rank Correlation for Face Recognition Under Unenrolled Pose
0.6 0.4 0.2 0
0.6 0.4 0.2 0
1
10
100
1
10
Rank
(b) PM+45 ↔PM−45
1
1
0.8
0.8
Identification rate
Identification rate
100 Rank
(a) PM+45 ↔PM+00
0.6 0.4 0.2 0
0.6 0.4 0.2 0
1
10
100
1
10
Rank
100 Rank
(c) PM+00 ↔PM+45
(d) PM+00 ↔PM−45
1
1
0.8
0.8
Identification rate
Identification rate
73
0.6 0.4 0.2 0
0.6 0.4 0.2 0
1
10
100 Rank
1
10
100 Rank
Fig. 4. Identification performance (cumulative match scores) for different poses in gallery (left of ↔) and probe (right of ↔) set. The model database consists of 500 identities, the testing database of 500 different ones.
similarities lead to better recognition rates for even bigger pose differences than 90◦ . What can be observed, as already in Table 1, is that recognition is generally better if the gallery consists of frontal pose. In our experiments, the rank similarity function as defined in (3) performed best. For PM+00↔PM−45 it reached 99% of recognition rate. The Spearman rank-order correlation coefficient lead to only 93.4%.
74
M.K. M¨ uller et al. 1
0.6 P
CAR
0.8
0.4 0.2 0 1e-04
0.001
0.01
0.1
1
0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0
imposters clients
0.7
0.75
FAR
(a) CAR(FAR)
0.6 P
CAR
0.8
0.4 0.2 0 0.001
0.01
0.85
0.9
(b) histogram of client and imposter scores
1
1e-04
0.8 Score
0.1
1
0.05 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0
FAR
imposters clients
-6
-4
-2
0
2
4
6
8
Score
(c) CAR(FAR)
(d) histogram of client and imposter scores
Fig. 5. Verification performance for PM+00 – PM−45 for 500 model and 500 test identities. a) and b) are without normalization and have an EER of 2.6%. c) and d) are the results with normalization according to (7); the EER is between 0.4% and 0.6%.
3.3
Verification Results
To test our method for verification tasks each of the 500 gallery identities is compared to all probe identities. This means there are 500 clients and 500·499 imposters. This has been done for the case album PM+00 – probe PM−45. Figure 5 shows the results. Figure 5 a) shows Correct Acceptance Rate (CAR) over False Acceptance Rate (FAR), Figure 5 b) the probability distributions for clients and imposters to attain a certain score. The Equal Error Rate (EER) is 2.6%. To improve the EER in verification experiments, one can use the distribution of similarities to all gallery images to normalize the similiarity values of each probe image. For that, each probe image is tested against all gallery identities, and the resulting 500 similarities for each probe image are normalized in the following way: NG −1 1 ¯ S(tP ) = S(tG , tP ) NG t =0
(6)
G
Sn (tG , tP ) =
1 NG
¯ P) S(tG , tP ) − S(t N −1 ¯ P) 2 · tGG=0 S(tG , tP ) − S(t
(7)
Similarity Rank Correlation for Face Recognition Under Unenrolled Pose
75
Fig 5 shows the resulting improvement for CAR and EER, the latter lies between 0.4 and 0.6%.
4
Discussion
We have presented a module for a face recognition system which can recognize or verify persons in a pose which is very different from the one enrolled. We have demonstrated the efficiency on 45◦ pose variation. An interesting side result is that the frontal pose is the best to be used as a gallery. For a complete system, the recognition step must be preceded by a rough pose estimation and look-up in the respective pose model. As the database used has considerable scatter in the actual pose angles it can be concluded that our method is robust enough to expect that the full range of poses can be covered with relatively few examples.
Acknowledgments We gratefully acknowledge funding from Deutsche Forschungsgemeinschaft (WU 314/2-2 and MA 697/5-1), the European Union, the European Regional Development Fund, and the land of Northrhine-Westphalia in the program “Zukunftswettbewerb Ruhrgebiet” and the “Ruhrpakt”. Portions of the research in this paper use the CAS-PEAL face database collected under the sponsorship of the Chinese National Hi-Tech Program and ISVISION Tech. Co. Ltd.
References 1. Blanz, V., Vetter, T.: Face recognition based on fitting a 3d morphable model. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(9), 1063–1074 (2003) 2. Saul, L.K., Roweis, S.T.: Think Globally, Fit Locally: Unsupervised Learning of Low Dimensional Manifolds. Journal of Machine Learning Research 4, 119–155 (2003) 3. Okada, K., von der Malsburg, C.: Pose-invariant face recognition with parametric linear subspaces. In: Proceedings of the International Conference on Automatic Face and Gesture Recognition, Washington D.C., pp. 71–76 (2002) 4. Tewes, A.: A Flexible Object Model for Encoding and Matching Human Faces. Shaker Verlag, Ithaca, NY, USA (2006) 5. Tewes, A., W¨ urtz, R.P., von der Malsburg, C.: A flexible object model for recognising and synthesising facial expressions. In: Kanade, T., Jain, A., Ratha, N.K. (eds.) AVBPA 2005. LNCS, vol. 3546, pp. 81–90. Springer, Heidelberg (2005) 6. Press, W., Flannery, B., Teukolsky, S., Vetterling, W.: Numerical Recipes in C — The Art of Scientific Programmming. Cambridge University Press, Cambridge (1988) 7. Andrew, L., Rukhin, A.O.: Nonparametric measures of dependence for biometric data studies. Journal of Statistical Planning and Inference 131, 1–18 (2005) 8. Ayinde, O., Yang, Y.H: Face recognition approach based on rank correlation of gabor-filtered images. Pattern Recognition 35(6), 1275–1289 (2002)
76
M.K. M¨ uller et al.
9. Lades, M., Vorbr¨ uggen, J.C., Buhmann, J., Lange, J., von der Malsburg, C., W¨ urtz, R.P., Konen, W.: Distortion invariant object recognition in the dynamic link architecture. IEEE Transactions on Computers 42(3), 300–311 (1993) 10. Wiskott, L., Fellous, J.M., Kr¨ uger, N., von der Malsburg, C.: Face recognition by elastic bunch graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(7), 775–779 (1997) 11. Fritzke, B.: A growing neural gas network learns topologies. In: Tesauro, G., Touretzky, D.S., Leen, T.K. (eds.) Advances NIPS 7, pp. 625–632. MIT Press, Cambridge, MA (1995) 12. Heinrichs, A., M¨ uller, M.K., Tewes, A.H., W¨ urtz, R.P.: Graphs with principal components of Gabor wavelet features for improved face recognition. In: Crist´ obal, G., Javidi, B., Vallmitjana, S. (eds.) Information Optics: 5th International Workshop on Information Optics; WIO’06, American Institute of Physics, pp. 243–252 (2006) 13. Gao, W., Cao, B., Shan, S., Zhou, D., Zhang, X., Zhao, D.: The CAS-PEAL largescale Chinese face database and baseline evaluations. Technical Report JDL-TR-04FR-001, Joint Research & Development Laboratory for Face Recognition, Chinese Academy of Sciences (2004)
Feature Correlation Filter for Face Recognition Xiangxin Zhu, Shengcai Liao, Zhen Lei, Rong Liu, and Stan Z. Li Center for Biometrics and Security Research & National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, 95 Zhongguancun East Road, 100080 Beijing, China {xxzhu,scliao,zlei,rliu,szli}@nlpr.ia.ac.cn http://www.cbsr.ia.ac.cn
Abstract. The correlation filters for pattern recognition, have been extensively studied in the areas of automatic target recognition(ATR) and biometrics. Whereas the conventional correlation filters perform directly on image pixels, in this paper, we propose a novel method, called “feature correlation filter (FCF)”, by extending the concept of correlation filter to feature spaces. The FCF preserves the benefits of conventional correlation filters, i.e., shift-invariant, occlusion-insensitive, and closed-form solution, and also inherits virtues of the feature representations. Moreover, since the size of feature is often much smaller than the size of image, the FCF method can significantly reduce the storage requirement in recognition system. The comparative results on CMU-PIE and the FRGC2.0 database show that the proposed FCFs can achieve noteworthy performance improvement compared with their conventional counterpart. Keywords: Feature correlation filters, Correlation filters, MACE, Face recognition.
1
Introduction
Face recognition has received much attention due to its potential values for applications as well as theoretical challenges. However, despite the research advances over these years, face recognition is still a highly difficult task in practice due to the large variability of the facial images. The variations between images of a same face can generally be divided into two categories: the appearance variations, and the man-made variations. The appearance variations include facial expression, pose, aging, illumination changes, etc. And the man-made variations are mainly due to the imperfections of the capture devices and image processing technologies, e.g. the noises from the cameras and the face registration error resulting from imperfect face detections [1]. The performances of many recognition algorithms degrade significantly in these cases. A common approach to overcome the effect of the appearance variations is to use face representations that are relatively insensitive to those variations. To deal with the man-made variations, using correlation to characterize the similarity S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 77–86, 2007. c Springer-Verlag Berlin Heidelberg 2007
78
X. Zhu et al.
between faces is a natural choice, because of its shift-invariant property and the optimality in the presence of additive white Gaussian noise [2]. As far as correlation is mentioned, we often think of the Matched Filter (MF), which is essentially a template image designed to match with a specific pattern. The correlation peak value provides the likelihood measure, and a shift of the test image simply produces a shift of the correlation peak. However the MFs are very sensitive to even slight deformations [3]. Hence, to capture all possible image variations, a huge number of MFs are needed, which is definitely computational and storage impractical [4]. To solve these problems, more sophisticated correlation filter design techniques are proposed in various literatures [5][6][7]. The advanced correlation filters construct a single filter or template from a set of training images and can be computed analytically and effectively using frequency domain techniques. Among such advanced correlation filters, Minimum Average Correlation Energy(MACE) filter [8] is one of the most well known. Savvides et al. showed that the MACE is, to some extent, robust to illumination variations [4]. However, since the correlation filters (including MACE) operate directly on the raw image pixel data (sometimes with illumination normalization [9]) without feature extraction, it may not achieve stability with respect to deformations caused by expression and pose variations, thus may not gain good generalization ability. We think this might be the Achilles heel of this approach [10]. The motivation of this work is to explore potentials of MF types of matchers by combining the advantages of the correlation method and the feature representations of faces. This leads to the idea of the proposed feature correlation filter (FCF) method. To overcome the limitation of the correlation filters, we extend the concept of correlation filter to the feature space; we do the correlation using face image features rather than directly using image pixel values. “The most common motivation for using features rather than the pixels directly is that features can act to encode ad-hoc domain knowledge that is difficult to learn using a finite quantity of training data”(Viola and Jones [11]). Moreover, the FCF method could significantly reduce the size of templates stored in the system, because the size of feature is often much smaller than the size of image. Many representation approaches have been introduced, such as Eigenface (PCA) [12], Fisherface (LDA) [3], ICA [13], Haar features [11], Gabor wavelet features [14], etc. Among these, we choose to use the PCA features and Gabor wavelet features for constructing FCFs and to demonstrate the effectiveness of the method. By combining the advantages of the correlation filter and feature representation of faces, we expect to gain improved generalization and discriminant capability. First, we formulate the feature extraction procedure as inner products, then perform the correlation on features rather than raw image pixel values. The FCF is constructed by solving an optimization problem in frequency domain. Finally a closed-form solution is obtained. While the focus of this work will be on the MACE filter criterion, it should be stated that all of the results presented here are equally applicable to many
Feature Correlation Filter for Face Recognition
79
other kinds of the advanced correlation filters with appropriate changes to the respective optimization criteria. The rest of this paper is organized as follows. Section 2 reviews the conventional MACE filter briefly. In Section 3, we introduce the basic concept of the FCF, and explain in detail how to construct the FCF. In Section 4, the experiments on PIE and FRGC2.0 database are presented. Finally, we conclude this paper and point out some further research in Section 5.
2
Minimum Average Correlation Energy (MACE) Filter
Notation: In this and the following section, we denote matrices by light face characters and vectors by bold characters. Uppercase symbols refer to the frequency domain terms, while lowercase symbols represent quantities in the space domain. Suppose we have N facial images from a certain person. We consider each 2dimensional image as a d × 1 column vector xi (i = 1, 2, . . . , N ) by lexicographically reordering the image, where d is the number of pixels. The discrete Fourier transform (DFT) of xi is denoted by Xi , and we define the training image data matrix in frequency domain as X = [X1 X2 . . . XN ]. X is a d × N matrix. Let the vector h be the correlation filter(correlation template) in the space domain and H be its Fourier transform. The correlation result of the ith image and the filter could be written as ci (m, n) = h(m, n) ◦ xi (m, n) = h, xm,n i
(1)
where ◦ denotes correlation, and , denotes inner product of two vectors. Here xm,n is a vector obtained by circularly shifting the ith training image by i m pixels horizontally and n pixels vertically, and reorder it to a 1-dimensional vector. Keep in mind that h and xi are both 1-dimensional vectors obtained by reordering a 2-dimensional array. Since the correlation actually operates on a 2-dimensional plane, here we use two indices, m and n, to indicate the elements in these vectors. From Eq. (1), we can see that each value in the correlation output plane is simply a inner product of the filter and the shifted input image. By Parseval’s theorem, the correlation energy of ci (m, n) can be rewritten as follows using its frequency domain representation Ci (u, v): Ei =
m
|ci (m, n)|2 =
n
1 |Ci (u, v)|2 d u v
1 = |H(u, v)|2 |Xi (u, v)|2 d u v =
(2)
1 † h Di h d
where Di is a d × d diagonal matrix containing the power spectrum of training image Xi along its diagonal. The superscripts † denote conjugate transpose.
80
X. Zhu et al.
The objective of the MACE filter is to minimize the average correlation energy over the image class while simultaneously satisfying an linear constraint that the correlation values at the origin due to training images take on pre-specified values stored in vector u, where u = [u1 u2 · · · uN ]T . i.e. ci (0, 0) = Xi † H = ui
(3)
The average correlation energy over all training images is Eavg = H† DH
where D =
N 1 Di N i=1
(4)
Minimize Eavg subject to the constraint, X † H = u. The solution can be obtained using Lagrange multipliers [8]: H = D−1 X(X † D−1 X)−1 u
3 3.1
(5)
Feature Correlation Filter for Face Recognition MACE in Feature Space
All the linear feature extraction method can be expressed as follows: yi = v † xi
(6)
where v = [v1 v2 . . . vM ] is a d × M feature extraction matrix. yi ∈ RM is the feature vector of length M. M depends on what feature to use. The discrete Fourier transform (DFT) of vj is denoted by Vj , and define V = [V1 V2 . . . VM ]. Note that, if we use PCA feature, the columns of v in Eq. (6) should be the eigenfaces. Extracting Gabor features could also be formulated using Eq. (6), since applying Gabor wavelet kernel to the input image is simply an inner product. As described in Eq. (1), the correlation plane is a collection of correlation values, each one is obtained by performing an inner product of the image and the template. In our FCF method, each correlation value is obtained by performing inner product of two feature vectors. The feature correlation output of the input image xi (i = 1, 2, . . . , N ) and the template h in spatial domain can be formulated as ci (m, n) = yh , yi = v † h, v † xi m,n =
M
vj † h · vj † xm,n i
(7)
j=1
Here · denotes simply scalar product. yh is the feature vector extracted from the template h. Comparing Eq. (7) with Eq. (1), we can see the difference in definition between our FCF method and the conventional correlation method.
Feature Correlation Filter for Face Recognition
81
The frequency representation of ci (m, n) is, M
Ci (u, v) =
1 † V H · Vj (u, v) · Xi ∗ (u, v) d j=1 j
(8)
where the superscript ∗ denotes complex conjugation. With Eq. (8), we can obtain the expression of the correlation plane energy : Ei =
1 |Ci (u, v)|2 d u v
1 = 3 H † V V † Di V V † H d
(9)
The average correlation plane energy is Eavg = H† V V † DV V † H
(10)
And the constraint at the origin is X †V V †H = u
(11)
We let P = V † H, the objective function and constraint function can be rewritten as Eavg = P† V † DV P (12) subject to
X †V P = u
(13)
Minimize Eavg in Eq. (12), subject to the constraint in Eq. (13), we get the closed-form solution: P = (V † DV )−1 V † X(X †V (V † DV )−1 V † X)−1 u
(14)
Then, we obtain the feature correlation filter P, which is a M × 1 complex vector. Remind that M is the number of the features used. ˜ = V † DV and X ˜ = V † X, Eq. (14) can be rewritten as: Let D ˜ −1 X( ˜ X ˜ †D ˜ −1 X) ˜ −1 u P=D
(15)
Comparing Eq. (15) with Eq. (5), we can see that the FCF and MACE share the same formulation. Note that, for each person we only need to store P in the recognition system, while the conventional correlation filter based system stores a template(filter) H. Typically the size of P is much smaller than the size of H. Thus the FCF can significantly reduce the size of the templates stored in the system. However, in FCF system, the d × M matrix V is also needed to be stored for verification. So this advantage on storage may not show up until the number of people enrolled in the system grows larger than M .
82
3.2
X. Zhu et al.
Face Verification Using FCF
Given a test image xt , the output correlation plane can be obtained using Eq. (16), ct (m, n) = F−1 (Ct (u, v)) M = F−1 ( Vi (u, v) · P(i) · Xt ∗ (u, v))
(16)
i=1
where The superscript T denotes nonconjugate transpose of a complex vector. F−1 (·) is inverse Fourier transform. Xt is the DFT of xt . We compute the peak-to-sidelobe ratio (PSR) of ct (m, n) to measure the peak sharpness. peak − mean P SR = (17) σ where the peak is the largest value in the correlation output, the mean and σ are the average value and the standard deviation of all the correlation outputs, excluding a 5 × 5 region centered in the peak.
4
Experiments
To demonstrate the effectiveness of our proposed FCF method, several experiments are carried out on the illumination subset of the CMU-PIE database [15] and two subsets of FRGC2.0 data [16]. The experiments on the CMU-PIE is to show that the FCFs preserve the advantages of the conventional MACE, including shift-invariant and insensitive to partially occlusion. The experiments on the FRGC2.0 database, which has much larger size and is more difficult, will show the improved performance of our FCF methods. In the following experiments, the centers of the eyes of an image are manually detected firstly, then rotation and scaling transformations align the centers of the eyes to predefined locations. Finally, the face image is cropped to the size of 100 × 100 to extract the facial region, which is further normalized to zero mean and unit variance. We choose the most representative holistic and local features , i.e. Eigenface(PCA) and Gabor wavelets, to construct the FCFs. For PCA, we used the eigenfaces that retain 98% of the total variation, i.e., totally 134 eigenfaces for experiments on CMU-PIE database, and 400 for experiments on FRGC2.0. The three most significant eigenfaces are discarded, since they capture the variation due to lighting [3]. The first 5 eigenfaces used for the CMU-PIE database are shown in Fig. 1. We used Gabor kernels at 5 scales and 4 orientations. In order to obtain a reasonably small-size feature vector, we extract the Gabor features at 49 predetermined positions, shown in Fig. 1. The total number of the Gabor features we used is 980. The two kinds of FCFs are termed as PCA-FCF and Gabor-FCF, respectively. The correlation outputs of MACE, PCA-FCF and Gabor-FCF of typical genuine and imposter images are shown in Fig. 2. The storage costs are listed in Table 1.
Feature Correlation Filter for Face Recognition
83
Fig. 1. Left: the first 5 Eigenfaces used for PIE database. Right: The 49 predetermined positions for Gabor feature extraction.
Fig. 2. The correlation output planes of MACE, PCA-FCF and Gabor-FCF, respectively(from left to right). Upper row: The correlation outputs of a typical genuine image. Lower row: The correlation outputs of a typical imposter image. Table 1. The Storage Cost of Each Template (In KBytes) MACE 80.0
4.1
PCA-FCF 3.2
Gabor-FCF 7.8
CMU-PIE Database
The illumination subset of the CMU-PIE database we used consists of 67 people with 21 illumination variation (shown in Fig. 3). We use 3 images per person to construct the MACE filter and the FCFs. The other 18 images are used for testing. The test is carried out in three scenarios (shown in Fig. 3). (1) Using full size facial images for testing. (2) The testing images are partially occluded(10 pixels from top, and 15 pixels from bottom). (3) The testing images are shifted(10 pixels to the left) to simulate the situation of registration error. All these three cases are often encountered in real applications. Note that we train the MACE and the FCFs using full size images. The results are shown in Table 2. The performance of PCA method degrades significantly in occlusion and shifting scenarios, while the MACE and the FCF methods still perform well. The
84
X. Zhu et al.
Fig. 3. Left: The 21 images of different illuminations from Person 2 of the PIE database. Right: the full size image, partially occluded image and shifted image. Table 2. The Rank-1 recognition rates on the CMU-PIE database Methods MACE PCA PCA-FCF Gabor-FCF
Full Size 100.00% 98.76% 100.00% 100.00%
Partial 99.59% 24.30% 96.60% 100.00%
Shifted 100.00% 10.78% 100.00% 100.00%
results demonstrate that the FCFs preserve the MACE’s advantages, i.e., insensitive to occlusion and shift-invariant, which are desired in real applications. Since there are only 67 people in the CMU-PIE database, and only illumination variation is involved(the 21 images are taken almost simultaneously, thus the expression and pose in the images are exactly the same), both MACE and FCF obtain nearly perfect recognition rates. 4.2
FRGC2.0 Database
To further evaluate the performance of the FCF-based method and show its improved recognition ability, we use the FRGC2.0 database [16], which is much larger and more difficult. Two subsets are used: the controlled subset and uncontrolled one. The controlled images were taken in a studio setting, are full frontal facial images taken under two lighting conditions and with two facial expressions. The uncontrolled images were taken in varying illumination conditions; e.g., hallways, atria, or outdoors. Each set of uncontrolled images contains two expressions. The images are taken in different semesters, and some images are blurred. Some sample images are shown in Fig. 4. From each subset, we randomly select 222 people and 20 images per person. 5 images of each person are used for training, and the other 15 for testing. The test results are summarized in the Table 3. The results show that the FCF methods significantly outperform the MACE. This is mainly derived from the combination of the advanced correlation filters and the feature representation.
Feature Correlation Filter for Face Recognition
85
Fig. 4. Some example image for FRGC database. Top row, controlled images. Bottom row, uncontrolled images.
Table 3. The Rank-1 recognition rates on the FRGC2.0 database Methods MACE PCA PCA-FCF Gabor-FCF
5
Controlled 80.99% 76.73% 91.92% 93.84%
Uncontrolled 34.29% 51.11% 64.14% 67.57%
Conclusions
This work aims to bring a promising perspective for using correlation filters in biometric recognition. The major contribution is the idea of incorporating the feature representations of faces into correlation filter, to take the advantages of advanced correlation filter and feature representation. We formulated the feature extraction as inner products, and obtained a closed-form solution by solving an optimization problem in frequency domain. The experimental results show that the proposed FCF methods significantly outperform their conventional counterpart. Working with fewer features, the FCF method can significantly reduce the size of the template stored in the system. While the FCFs in this paper is in essence a linear method, in future work, we would consider the nonlinear extension. We would also consider other types of features to used, and selection of the best features and combinations of different feature types for correlation. Acknowledgments. This work was supported by the following funding resources: National Natural Science Foundation Project #60518002, National Science and Technology Supporting Platform Project #2006BAK08B06, National 863 Program Projects #2006AA01Z192 and #2006AA01Z193, Chinese Academy of Sciences 100 people project, and the AuthenMetric Collaboration Foundation.
86
X. Zhu et al.
References 1. Rentzeperis, E., Stergiou, A., Pnevmatikakis, A., Polymenakos, L.: Impact of Face Registration Errors on Recognition, presented at Artificial Intelligence Applications and Innovations (AIAI06) (2006) 2. Vanderlugt, A.: Signal detection by complex spatial filtering. IEEE Trans. Inf. Theory 10, 139–145 (1964) 3. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. Fisherfaces: recognition using class specific linear projection. Pattern Analysis and Machine Intelligence, IEEE Transactions 19, 711–720 (1997) 4. Savvides, M., Kumar, B.V.K.V., Khosla, P.K.: Corefaces - Robust shift invariant PCA based correlation filter for illumination tolerant face recognition. IEEE Comp. Vision and Patt. Rec (CVPR) (2004) 5. Kumar, B.V.K.V., Mahalanobis, A., Juday, R.D.: Correlation Pattern Recognition. Cambridge Univ. Press, Cambridge, U.K. (2005) 6. Vijara Kumar, B.V.K.: Minimum variance synthetic discriminant functions. J. Opt. Soc. Amer. A 3, 1579C1584 (1986) 7. Mahalanobis, A., Vijaya Kumar, B.V.K., Song, S., Sims, S.R.F., Epperson, J.: Unconstrained correlation filters. Appl. Opt. 33, 3751C3759 (1994) 8. Mahalanobis, A., Kumar, B.V.K.V., Casasent, D.: Minimum average correlation energy filters. Appl. Opt. 26, 3630–3633 (1987) 9. Savvides, M., Kumar, B.V.K.V.: Illumination normalization using logarithm transforms for face authentication. In: 4th Int. Conf. AVBPA (2003) 10. Vijaya Kumar, B.V.K., Savvides, M., Xie, C.: Correlation Pattern Recognition for Face Recognition. Proceedings of the IEEE 94, 1963–1976 (2006) 11. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Computer Vision and Pattern Recognition (CVPR) (2001) 12. Turk, M., Pentland, A.: Eigenfaces for recognition. J. Cognitive Neuroscience 3, 72–86 (1991) 13. Bartlett, M.S., Lades, H.M., Sejnowski, T.J.: Independent component representations for face recognition. In: Proceedings of the SPIE, Conference on Human Vision and Electronic Imaging (1998) 14. Wiskott, L., Fellous, J., Kruger, N., malsburg, C.V.: Face recognition by elastic bunch graph matching. Pattern Analysis and Machine Intelligence, IEEE Transactions 19, 775–779 (1997) 15. Sim, T., Baker, S., Bsat, M.: The CMU Pose, Illumination, and Expression (PIE) Database of Human Faces, Robotics Institute, Carnegie Mellon University CMURI-TR-01-02 (2001) 16. Phillips, P.J., Flynn, P.J., Scruggs, T., Bowyer, K.W., Jin, C., Hoffman, K., Marques, J., Jaesik, M., Worek, W.: Overview of the face recognition grand challenge. In: Proc. IEEE Conf. Comp. Vision Pattern Rec. (CVPR) (2005)
Face Recognition by Discriminant Analysis with Gabor Tensor Representation Zhen Lei, Rufeng Chu, Ran He, Shengcai Liao, and Stan Z. Li Center for Biometrics and Security Research & National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, 95 Zhongguancun Donglu, Beijing 100080, China
Abstract. This paper proposes a novel face recognition method based on discriminant analysis with Gabor tensor representation. Although the Gabor face representation has achieved great success in face recognition, its huge number of features often brings about the problem of curse of dimensionality. In this paper, we propose a 3rd-order Gabor tensor representation derived from a complete response set of Gabor filters across pixel locations and filter types. 2D discriminant analysis is then applied to unfolded tensors to extract three discriminative subspaces. The dimension reduction is done in such a way that most useful information is retained. The subspaces are finally integrated for classification. Experimental results on FERET database show promising results of the proposed method. Keywords: discriminant analysis, Gabor tensor representation, face recognition.
1 Introduction Face images lie in a highly nonlinear and nonconvex manifolds in the image space due to the various changes by expression, illumination and pose etc. Design a robust and accurate classifier in such a nonlinear and nonconvex distribution is diÆcult work. One approach to simplify the complexity is to construct a local appearance-based feature space, using appropriate image filters, so that the distributions of faces are less aected by various changes. Gabor wavelet-based features have been used for this purpose [5,14,8]. The Gabor wavelets, whose kernels are similar to the two-dimensional (2D) receptive field profiles of the mammalian cortical simple cells, exhibit desirable characteristics of spatial locality and orientation selectivity. It is robust to variations due to expression and illumination changes and is one of the most successful approaches for face recognition. However, Gabor features are usually very high-dimensional data and there are redundancies among them. It is well known that face images lie in a manifold of intrinsically low dimension; therefore, the Gabor feature representations of faces could be analyzed further to extract the underlying manifold by some statistical approach such as subspace methods. S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 87–95, 2007. c Springer-Verlag Berlin Heidelberg 2007
88
Z. Lei et al.
Subspace methods such as PCA, LDA [1,13] have been extensively studied in face recognition research. PCA uses the Karhunen-Loeve transform to produce the most expressive subspace for face representation and recognition by minimizing the residua of the reconstruction. However, it does not utilize any class information and so it may drop some important clues for classification. LDA is then proposed and it seeks subspace of features best separating dierent face classes by maximizing the ratio of the betweenclasses scatter to the within-class scatter. It is an example of the most discriminating subspace methods. However, because of the usually high dimensions of feature space (e.g. the total number of pixels in an image) and small sample size, the within-class scatter matrix S w is often singular, so the optimal solution of LDA cannot be found directly. Therefore, some variants of LDA have been proposed such as PCALDA (Fisher-LDA), Direct LDA (D-LDA), Null space LDA (N-LDA) [1,19,2] etc. However, these LDAs all try to solve the singular problem of S w instead of avoiding it. None of them can avoid losing some discriminative information helpful for recognition. Recently, Two-Dimensional Linear Discriminant Analysis (2D-LDA) method [18,6] has been discussed as a generalization of traditional 1D-LDA.1 The main idea of the 2D method is to construct the scatter matrices using image matrices directly instead of vectors. As the image scatter matrices have a much smaller size, 2D-LDA significantly reduces the computational cost and avoids the singularity problem. Further generalization has also been proposed to represent each object as a general tensor of second or higher order, such as PCA and discriminant analysis with tensor representation [15,16]. Previous work usually [3,9,12] performs subspace analysis with Gabor vector representation. Because of the high-dimension of Gabor feature vectors and the scarcity of available data, it leads to the curse of dimensionality problem. Moreover, for objects such as face images, a vector representation ignores higher-order structures in the data. Yan et al. [16] proposed discriminant analysis with tensor representation. It encoded an object as a general tensor and iteratively learned multiple interrelated subspaces for obtaining a lower-dimensional space. However, the computational convergence of the method is not guaranteed and it is hard to extend it to kernel form. In this paper, we propose a novel discriminant analysis method with Gabor tensor representation for face recognition. Compared to the method in [16], we introduce an alternative way for feature extraction on Gabor tensor in such a way that we can derive a non-iterative way for discrimination and extend it to kernel learning easily. The algorithm is divided into three steps. First, a face image is encoded by Gabor filters to form a 3rd-order tensor; Second, the 3rd-order tensor is unfolded into three 2nd-order tensors and 2D linearkernel discriminant analysis is conducted on them respectively to derive three discriminative subspaces; and finally these subspaces are integrated and reduced further for classification with a nearest neighbor classifier. The rest of the paper is organized as follows: Section 2 describes the Gabor tensor representation. Section 3 details the discriminant analysis method with Gabor tensor representation. The experimental results on FERET database are demonstrated in Section 4 and in Section 5, we conclude the paper.
1
LDA based on vectors is noted as 1D-LDA.
Face Recognition by Discriminant Analysis
89
2 Gabor Tensor Representation The representation of faces using Gabor features has been extensively and successfully used in face recognition [14]. Gabor features exhibit desirable characteristics of spatial locality and orientation selectively and optimally localized in the space and frequency domains. The Gabor kernels are defined as follows: k2 k2 z2 2 exp( )[exp(ik z) exp( )] (1) 2 2 2 2 where and define the orientation and scale of the Gabor kernels respectively, z (x y), and the wave vector k is defined as follows: k
k ei
(2)
where k kmax kmax 2, f 2, 8 . The Gabor kernels in (1) are all self-similar since they can be generated from one filter, the mother wavelet, by scaling and rotating via the wave vector k . Each kernel is product of a Gaussian envelope and a complex plane wave, while the first term in the square brackets in (1) determines the oscillatory part of the kernel and the second term compensates for the DC value. Hence, a band of Gabor filters is generated by a set of various scales and rotations of the kernel. In this paper, we use Gabor kernels at five scales 0 1 2 3 4 and eight orientations 0 1 2 3 4 5 6 7 with the parameter 2 [9] to derive the Gabor representation by convoluting face images with corresponding Gabor kernels. For every image pixel we have totally 40 Gabor magnitude coeÆcients which can be regarded as a Gabor feature vector of 40 dimensions. Therefore, a h w 2D image can be encoded by 40 Gabor filters to form a 40 h w 3rd-order Gabor tensor. Fig. 1 shows an example of a face image with its corresponding 3rd-order Gabor tensor. f ,
Fig. 1. A face image and its corresponding 3rd-order Gabor tensor
3 Discriminant Analysis with Gabor Tensor Representation Since it is hard to apply discriminant analysis directly in the 3rd or higher-order tensor space, we here adopt an alternative strategy. The 3rd-order Gabor tensor is first unfolded into three 2nd-order tensors (matrices) along dierent axes. Fig. 2 shows the modes of the Gabor tensor unfolding. After that 2D linearkernel discriminant analysis is conducted on these 2nd-order tensors respectively to extract eective and robust subspaces which will be combined further for classification.
90
Z. Lei et al.
Fig. 2. Illustration of the three feature matrices unfolded from a 3rd-order tensor along di«erent axes m1,m2,m3
3.1 2D Linear Discriminant Analysis (2D-LDA) 2D-LDA is based on the matrices rather than vectors as opposed to 1D-LDA based approaches. Let the sample set be X X1 X2 Xn , and Xi is an r c unfolded Gabor matrix. The within-class scatter matrix S w and the between-class scatter matrix S b based on 2nd-order tensors are defined as follows: Sw
L
¾
(X j
mi )T (X j
mi )
(3)
i 1 X j Ci
Sb
L
ni (mi
m)T (mi
i 1
m)
(4)
where mi n1i X j ¾Ci X j is the mean of data matrices in class Ci , and m 1n iL 1 X j ¾Ci X j is the global mean matrix. The 2D-LDA searches such optimal projections that after projecting the original data onto these directions, the trace of the resulting betweenclass scatter matrix is maximized while the trace of the within-class scatter matrix is minimized. Let W denote a c d (d c) projection matrix, and the r c unfolded Gabor matrix X is projected onto W by the following linear transformation: Y
XW
(5)
Face Recognition by Discriminant Analysis
91
where the resulting Y is a r d matrix smaller than X. Denote S˜ w , S˜ b the with-class and between-class scatter matrices of the projected data Y respectively. 2D-LDA then chooses W so that the following object function is maximized: J
tr(S˜ b ) tr(S˜ w )
¾ L i 1
tr(
tr(
L i 1
ni W T (mi
X j Ci
m)T (mi
W T (X j
mi )T (X j
m)W) mi )W)
tr(W T S b W) tr(W T S w W)
(6)
The optimal projection matrix Wopt can be obtained by solving the following generalized eigen-value problem S w 1 S b W W (7) where is the diagonal matrix whose diagonal elements are eigenvalues of S w 1 S b . 3.2 2D Kernel Discriminant Analysis (2D-KDA) It is well known the face appearances may lie in a nonlinear low-dimensional manifold due to the expression and illumination variations [7]. The linear methods may not be adequate to model such a nonlinear problem. Accordingly, a 2D non-linear discriminant analysis method based on kernel trick is proposed here. Like other kernel subspace representations, such as Kernel PCA (KPCA) [11], Kernel 2DPCA [4], Kernel FDA (KFDA) [17], the key idea of 2D Kernel Discrimannt Analysis (2D-KDA) is to solve the problem of 2D-LDA in an implicit feature space F which is constructed by the kernel trick: : x Rd (x) F (8) Given M training samples, denoted by r c unfolded Gabor matrices Ak (k 1 2 M). A kernel-induced mapping function maps the data vector from the original input space to a higher or even infinite dimensional feature space. The kernel mapping on matrices is defined as:
(A) [(A1)T (A2)T (A M )T ]T
(9)
where Ai is the i-th row vector (1c) of the matrix A and is kernel mapping function on vectors. Performing 2D-LDA in F means to maximize the following Fisher discriminant function: tr(W T S b W) J(W) argmax (10) tr(W T S w W) where S b and S w represent the between-class scatter and the within-class scatter respectively in F.
Sb
L
ni ( i
)T ( i )
(11)
j )T ( (A j ) j )
(12)
i 1
where, i
1 ni
ni j 1
S w
L
( (A j )
¾
i 1 A j Ci
(A j ),
1 n
L i 1
ni i .
92
Z. Lei et al.
If we follow the conventional kernel analysis as in 1D-KFDA, there exist r M samples to span the kernel feature space (Aik )T i 1 r; k 1 M , which will result in heavy computational cost for subsequent optimization procedure. To alleviate the computational cost, following [20], we use M samples to approximate the kernel feature space: f [(A1 )T (AM )T ]T , here Ak is the mean of the r row vector of Ak . Thus, (10) can be rewritten as: J( ) argmax
tr( T Kb ) tr( T Kw )
(13)
and the problem of 2D-KDA is converted into finding the leading eigenvectors of Kw 1 Kb . Kb
L
ni (zi
z)T (zi
z)
(14)
i 1
Kw
L
¾
i 1 j Ci
( j
zi )T ( j
zi )
(15)
where j [1T j 2T j rTj ]T , i j [k(A1 Aij ) k(A2 Aij ) k(AM Aij )], k is the kernel function to compute the inner product of two vectors in F; zi n1i j ¾Ci j , and z is the mean of all j . Three classes of kernel functions are widely used, i.e. Gaussian kernels, polynomial kernels, and sigmoid kernels and here, we use Gaussian kernels in the following experiments. 3.3 Discriminant Analysis with Gabor Tensor Representation The 2nd-order tensors, which were obtained by unfolding the 3rd-order Gabor tensor along dierent axes, depict faces from dierent aspects, so the subspaces derived from the dierent modes of 2nd-order tensor spaces may contain complemental information helpful for discrimination. Considering this, the subspaces are integrated and reduced further using PCA method. Suppose Y1 Y2 Y3 are the three subspaces obtained for each image, they are first transformed into 1D vectors respectively, denoted as y1 y2 y3 , and concatenated into one 1D vector y [yT1 yT2 yT3 ]T , PCA is then performed on these combined vectors. Finally, the shorter vectors derived from PCA are used for classification with a nearest neighbor classifier. As mentioned above, we proposed a novel linearkernel discriminant analysis method with Gabor-tensor representation noted as GT-LDA and GT-KDA respectively. It inherits the advantages of the 2D discriminant analysis methods and therefore can eectively avoid the singularity problem. The algorithm is described with the Gabor tensor representation but not limit to it. In fact, it can be extended to arbitrary N-order tensor. Specifically, it’s not hard to find that 2D-LDA and LDA are both special forms of our proposed method with N 2 and N 1.
4 Experimental Results and Analysis In this section, we evaluate the performance of the proposed algorithm (GT-LDAGTKDA) using the FERET database [10]. The FERET database is a standard test set for
Face Recognition by Discriminant Analysis
93
Fig. 3. Example cropped FERET images used in our experiments
(a)
(b)
Fig. 4. Cumulative match curves on fb (a) and fc (b) probe sets
face recognition technologies. In our experiments, we use a subset of the training set containing 540 images from 270 subjects for training. Two probe sets named fb (expression) and fc (lighting), which contains images with expression and lighting variation respectively, are used for testing against the gallery set containing 1196 images . All images are cropped to 88 80 pixels according to the eye positions. Fig. 3 shows some examples of the cropped FERET images. To prove the advantage of the proposed method, we compare our method (GT-LDA GT-KDA) with some other well-known methods: PCA, LDA, GaborLDA and GaborKFDA. Fig. 4 demonstrates the cumulative match curves (CMC) of these methods on the fb and fc probe sets and the rank-1 recognition rates are shown in Table 1. From the results, we can find the proposed method, GT-LDA and GT-KDA, all outperform
94
Z. Lei et al. Table 1. The rank-1 recognition rates on the FERET fb and fc probe sets Methods fb(Expression) fc(Lighting) PCA 77.99% 38.14% F-LDA 87.53% 42.78% Gabor·F-LDA 93.97% 77.84% Gabor·KFDA 95.56% 78.87% GT-LDA 98.24% 89.18% GT-KDA 98.66% 89.69%
significantly to the other methods both in fb and fc probe sets and the kernel method, GT-KDA, achieves the best result: 98.66% in fb and 89.69% in fc. These results indicate the proposed method is accurate and robust to the variation of expression and illumination. Moreover, in contrast to the traditional LDA method, the computational cost of the proposed algorithm is not increased very much. In the training phase, although it needs to do three discriminant analysis rather than once in traditional LDA, the computational cost is not much higher because the 2D discriminant analysis is usually conducted on a lower feature space. And in the testing phase, the computational cost of the proposed method is nearly the same as the traditional LDA.
5 Conclusions In this paper, a novel algorithm, linearkenerl discriminant analysis with Gabor tensor representation has been proposed for face recognition. A face image is first encoded as a 3rd-order Gabor tensor. After unfolding the Gabor tensor and applying discriminant analysis with them, we can eectively avoid the curse of dimensionality dilemma and overcome the small sample size problem to extract dierent discriminative subspaces. Followed by combining these discriminant subspaces and doing reduction further, a robust and eective subspace is finally derived for face recognition. Experimental results on FERET database have shown the accuracy and robustness of the proposed method to the variation of expression and illumination. Acknowledgements. We would like to thank Wei-shi Zheng for helpful discussion. This work was supported by the following funding resources: National Natural Science Foundation Project #60518002, National Science and Technology Supporting Platform Project #2006BAK08B06, National 863 Program Projects #2006AA01Z192 and #2006AA01Z193, Chinese Academy of Sciences 100 people project, and the AuthenMetric Collaboration Foundation.
References 1. Belhumeur, P., Hespanha, J., Kriegman, D.: Eigenfaces vs. fisherfaces: recognition using class specific linear projection. IEEE Trans. PAMI 19(7), 711–720 (1997) 2. Chen, L., Liao, H., Ko, M., Lin, J., Yu, G.: A new lda-based face recognition system which can solve the small sample size problem. Pattern Recognition (2000)
Face Recognition by Discriminant Analysis
95
3. Ki-Chung, C., Cheol, K.S., Ryong, K.S.: Face recognition using principal component analysis of gabor filter responses. In: Proceedings of International Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems, pp. 53–57 (1999) 4. Kong, H., Li, X., Wang, L., Teoh, E.K., Wang, J.G., Venkateswarlu, R.: Generalized 2d principal component analysis. In: Proc. International Joint Conference on Neural Networks (2005) 5. Lades, M., Vorbruggen, J., Buhmann, J., Lange, J., von der Malsburg, C., Wurtz, R.P., Konen, W.: Distortion invariant object recognition in the dynamic link architecture. IEEE Transactions on Computers 42, 300–311 (1993) 6. Li, M., Yuan, B.: 2d-lda: A novel statistical linear discriminant analysis for image matrix. Pattern Recognition Letters 26(5), 527–532 (2005) 7. Li, S.Z., Jain, A.K. (eds.): Handbook of Face Recognition. Springer, New York (2005) 8. Liu, C.: Gabor-based kernel PCA with fractional power polynomial models for face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(5), 572–581 (2004) 9. Liu, C., Wechsler, H.: Gabor feature based classification using the enhanced fisher linear discriminant model for face recognition. IEEE Transactions on Image Processing 11(4), 467– 476 (2002) 10. Phillips, P.J., Moon, H., Rizvi, S.A., Rauss, P.J.: The FERET evaluation methodology for face-recognition algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(10), 1090–1104 (2000) 11. Sch¨olkopf, B., Smola, A., M¨uller, K.R.: Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation 10, 1299–1319 (1999) 12. Shen, L., Bai, L.: Gabor wavelets and kernel direct disciminant analysis for face recognition. In: Int’l Conf on Pattern Recognition (ICPR’04), pp. 284–287 (2004) 13. Swets, D., Weng, J.: Using discriminant eigenfeatures for image retrieval. IEEE Trans. on PAMI 16(8), 831–836 (1996) 14. Wiskott, L., Fellous, J., Kruger, N., malsburg, C.V.: Face recognition by elastic bunch graph matching. IEEE Trans. PAMI 19(7), 775–779 (1997) 15. Xu, D., Yan, S.C., Zhang, L., Zhang, H.J., Liu, Z.K., Shum, H.Y.: Concurrent subspaces analysis. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 203–208. IEEE Computer Society Press, Los Alamitos (2005) 16. Yan, S.C., Xu, D., Yang, Q., Zhang, L., Zhang, H.J.: Discriminant analysis with tensor representation. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 526–532. IEEE Computer Society Press, Los Alamitos (2005) 17. Yang, M.H.: Kernel eigenface vs. kernel fisherface: Face recognition using kernel methods. In: Proc. IEEE Int. Conf on Automatic Face and Gesture Recognition. IEEE Computer Society Press, Los Alamitos (2002) 18. Ye, J., Janardan, R., Li, Q.: Two-dimensional linear discriminant analysis. In: Proceedings of Neural Information Processing Systems (2004) 19. Yu, H., Yang, J.: A direct lda algorithm for high-dimensional data with application to face recognition. Pattern Recognition (2001) 20. Zhang, D., Chen, S., Zhou, Z.: Recognizing face or object from a single image: Linear vs. kernel methods on 2d patterns. In: S·SSPR06, in conjunction with ICPR06, HongKong, China (2006)
Fingerprint Enhancement Based on Discrete Cosine Transform Suksan Jirachaweng and Vutipong Areekul Kasetsart Signal & Image Processing Laboratory (KSIP Lab), Department of Electrical Engineering, Kasetsart University, Bangkok, 10900, Thailand {g4885038,fengvpa}@ku.ac.th http://ksip.ee.ku.ac.th
Abstract. This paper proposes a novel fingerprint enhancement algorithm based on contextual filtering in DCT domain. All intrinsic fingerprint features including ridge orientation and frequency are estimated simultaneously from DCT analysis, resulting in fast and efficient implementation. In addition, the proposed approach takes advantage of frequency-domain enhancement resulting in best performance in high curvature area. Comparing with DFT domain, DCT has better signal energy compaction and perform faster transform with real coefficients. Moreover, the experimental results show that the DCT approach is out-performed the traditional Gabor filtering, including the fastest separable Gabor filter, in both quality and computational complexity. Keywords: Fingerprint Enhancement, Discrete Cosine Transform Enhancement, Frequency-Domain Fingerprint Enhancement.
1 Introduction Inevitably, many fingerprint identification applications are playing an important role in our everyday life from personal access control, office time attendance, to country boarder control. To pursue this goal, automatic fingerprint identification system (AFIS) must be proved to be highly reliable. Since most automatic fingerprint identification systems are based on the minutiae and ridge matching, these systems rely on good quality of input fingerprint images for minutiae and ridge extraction. Unfortunately, bad quality of fingerprint and elastic distortion are now major problems for most AFISs especially large database systems. In order to reduce the error accumulated from false accept rate and false reject rate, quality of fingerprint must be evaluated and enhanced for better recognition results. Based on filtering domains, most fingerprint enhancement schemes can be roughly classified into two major approaches; i.e. spatial-domain and frequency-domain. Filtering in spatial-domain applies convolution directly to fingerprint image. On the other hand, filtering in frequency-domain need Fourier analysis and synthesis. Fingerprint image is transformed, then multiplied by filter coefficients, and inverse transformed Fourier coefficients back to enhanced fingerprint image. In fact if employed filters are the same, enhancement results from both domains must be exactly the same by signal processing theorem. However, for practical implementation, these S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 96–105, 2007. © Springer-Verlag Berlin Heidelberg 2007
Fingerprint Enhancement Based on Discrete Cosine Transform
97
two approaches are different in terms of enhancement quality and computational complexity of algorithms. Practical performing fingerprint enhancement based on each domain has different advantage and disadvantage. For example, most popular Hong’s Gabor filters [1], with orientation and frequency spatially adaptable, are applied to partitioning fingerprint image. However, this Gabor filter model is based on unidirectional ridge enhancement, resulting in ridge discontinuity and blocking artifacts around highly curvature region. On the other hand, for frequency domain approaches, natural fingerprint image is localized in some frequency coefficients. Gabor filter can be easily designed to cooperate with high curvature area. For example, Kamei et al. [2] introduced fingerprint filter design based on frequency domain using discrete Fourier transform. Chikkerur et al. [3] applied short time Fourier transform and took advantage from 2-dimensional filter shaping design, adapted with highly curvature area, resulting in better enhanced results. However, comparing with spatial-domain approaches, this scheme suffers from high computational complexity in Fourier analysis and synthesis even though Fast Fourier Transform (FFT) is employed. In order to take advantage from frequency-domain fingerprint enhancement with low computational complexity, we propose fingerprint enhancement based on Discrete Cosine Transform (DCT). The DCT is a unitary orthogonal transform with real coefficients. It is closely related to the Discrete Fourier transform (DFT) which has complex coefficients. Moreover, it has been known that DCT provides a distinct advantage over the DFT in term of energy compaction and truncation error [4]. Thus is why DCT has been widely employed in general image and video compression standards. Hence, in this paper, we investigated DCT-base fingerprint enhancement for practical implementation. We expected best enhanced quality results with low computational complexity. This paper is organized as follows. Section 2 describes several processes in order to implement enhancement filtering in DCT domain including intrinsic estimation and practical filtering. Section 3 shows experimental evaluation. Finally, section 4 concludes our works and future research.
2 Proposed Approach The fingerprint enhancement approach consists of 4 concatenated processes; i.e. discrete cosine transform of sub-blocks of partitioning fingerprint, ridge orientation and frequency parameters estimation, filtering in DCT domain, and inverse discrete cosine transform of sub-blocks. The advantages of the proposed approach are as follows.
Fingerprint ridges form a natural sinusoid image, which its spectrums are packed or localized in frequency domain. Hence these spectrums can be easily shaped or filtered in this domain. Moreover, filter can be specially designed in order to handle high curvature ridge area such as singular points. This is the great advantage over the spatial-domain filtering approach. Comparing with discrete Fourier transform, discrete cosine transform performs better in term of energy compaction. Moreover, DCT coefficients are real number comparing with complex number of DFT coefficients. Therefore, we can handle DCT coefficients easier than DFT coefficients. Besides, fast DCT
98
S. Jirachaweng and V. Areekul
requires less computational complexity and less memory usage comparing with fast Fourier transform (FFT). By partitioning fingerprint into sub-blocks, the proposed approach utilizes spatially contextual information including instantaneous frequency and orientation. Intrinsic features such as ridge frequency, ridge orientation, and angular bandwidth can be simply analyzed directly from DCT coefficients. Each process of the proposed fingerprint enhancement is explained as follows. 2.1 Overlapping DCT Decomposition and Reconstruction Conventional fingerprint enhancement schemes, applying with non-overlapping blocks of partitioning fingerprint, often encounter with blocking artifacts such as ridge discontinuity and spurious minutiae. To preserve ridge continuity and eliminate blocking artifacts, overlapping block is applied to both DCT decomposition and reconstruction, similar to the DFT approach in [3]. However, there is no need to apply any smooth spectral window for DCT because overlapping area is large enough to prevent any blocking effects, corresponding with its energy compaction property. 2.2 Intrinsic Parameter Estimation on DCT Domain Ridge frequency, ridge orientation, and angular bandwidth can be analyzed from DCT coefficients directly. Therefore DCT analysis yields appropriate domain to perform fingerprint enhancement and provides filtering parameters as the same time. Ridge Frequency Estimation: The ridge frequency (ρ0) is simply obtained by measuring a distance between the origin (0,0) and the highest DCT peak of highfrequency spectrum as following equation,
ρ 0 = u 02 + v 02
(1)
where ( u0 ,v0 ) is the coordinate of the highest peak of high-frequency spectrum.
(a)
(b)
(c)
(d)
Fig. 1. Figure (a) and (c) represent blocks of a fingerprint model with different frequency. Figure (b) and (d) are DCT coefficients of figure (a) and (c), respectively. Note that DC coefficient is set to zero in order to clearly display high-frequency spectrum.
Ridge orientation estimation: The dominant orientation of parallel ridges, θ, are closely related to a peak-angle, φ, in DCT coefficients, where φ is measured counterclockwise (if φ > 0) from the horizontal axis to the terminal side of the highest spectrum peak of high frequency (DC spectrum is not included). However, θ and φ relationship is not one-to-one mapping. The ridge orientation, which θ varies in the
Fingerprint Enhancement Based on Discrete Cosine Transform
99
range of 0 to π, is projected into the peak-angle, which φ varies in the range of 0 to π/2. Relationship between θ0 ridge orientation in spatial domain and φ0 peak angle in frequency domain are described in equation (2) with some examples in Fig. 2. ⎛ v0 ⎞ π ⎟ , φ0 = − θ 0 where 0 ≤ θ 0 ≤ π ⎟ 2 u ⎝ 0⎠
φ0 = tan −1 ⎜⎜
T 0=S=0
T 0=7S/8
T 0=3S/4
T 0=5S/8
T0 =S/2
T 0=3S/8
(2)
T 0=S/4
T 0=S/8
Fig. 2. Examples of relationship between ridge orientation in spatial domain and peak-angle in DCT domain, all ridge angles refer to horizontal axis and DC coefficient is set to zero in order to show high-frequency spectrum. (Note that only the top-left quarters of DC coefficients are zoomed in for clear view of high-frequency peak behavior.)
From Fig. 2, ridge orientation at π-θ has the highest spectrum peak with the same location as ridge orientation at θ. However, their phase patterns are distinguishable by observation. Therefore additional phase analysis is needed to classify the quadratics of ridge orientation in order to correctly perform fingerprint enhancement. Since Lee et al. [5] proposed edge detection algorithm based on DCT coefficients, our fingerprint enhancement modified Lee’s approach by modulation theorem in order to detect quadrant of fingerprint ridge orientation. According to Lee’s technique, the orientation quadrant of a single line can be determined by the polarities of two first AC coefficients, G01 and G10, where Guv is the
(a)
(b)
(c)
(d)
Fig. 3. Four polarity patterns indicate (a) a single line orientation ranging from 0 to π/2, (b) a single line orientation ranging from π/2 to π, (c) parallel ridge orientation ranging from 0 to π/2, and (d) parallel ridge orientation ranging from π/2 to π
100
S. Jirachaweng and V. Areekul
DCT coefficient at coordinate (u,v), as shown in Fig. 3. In case of a single line, polarity of product of G01 and G10 coefficients indicates the line orientation. If G01×G10 ≥ 0, this line orientation is in the first quadrant (0 to π/2) as shown in Fig. 3(a). On the other hand, if G01×G10 < 0, this line orientation is in the second quadrant (π/2 to π) as shown in Fig. 3(b). This technique can be applied to detect orientation of parallel lines or ridges by modulation theorem with the pattern of polarities around the high peak DCT coefficients. To be precise, ridge orientation in the first quadrant (0 to π/2) and ridge orientation in the second quadrant (π/2 to π) can be indicated by the same polarities of 45o and 135o diagonal coefficients referred to the highest absolute peak as shown in Fig. 3(c) and (d), respectively. 5
pixels
V1 3
5
pixels
pixels
V2 3
pixels
Fig. 4. Demonstrate 2-D perpendicular diagonal vectors, V1 at 45o and V2 at 135o, referred to the highest absolute spectrum peak (the center black pixel (negative value))
In order to identify the quadrant and avoid influence of interference, two 2-D perpendicular diagonal vectors, V1 and V2, are formed with size of 5×3 pixels, center at the peak position as shown in Fig. 4. The average directional strengths of each vector (S1, S2) are then computed by equation (3). Then the quadrant can be classified and the actual fingerprint ridge orientation can be identified as shown in equation (4). 2
∑V (u i
S i = Max
n = −1, 0 ,1
0
+ m, v 0 + n)
m = −2
5
where i = 1,2
where S 1 ≥ S 2 ⎧π / 2 − φ θ =⎨ ⎩π − (π / 2 − φ ) Otherwise
(3)
(4)
Finally, the estimated ridge frequency and orientation of each local region is formed a frequency field and an orientation field. Then Gaussian filter is applied to smooth both global fields in order to reduce noise effect as [1]. Angular bandwidth estimation: At the singularity region, ridge spectrum is not an impulse but it spreads bandwidth out. Therefore, the desired filter of each block must be adapted based on its angular bandwidth. We slightly modified the coherence parameter from Chikkerur’s concept in [3], called non-coherence factor. This noncoherence factor represents how wide ridge orientation can be in the block that has more than one dominant orientation. This factor is in the range of 0 to 1, where 1 represents highly non-coherence or highly curved region and 0 represents uniorientation region. The non-coherence factor can be given by
Fingerprint Enhancement Based on Discrete Cosine Transform
NC (u c , v c ) =
∑
( i , j )∈W
sin(θ (u c , v c ) − θ (u i , v j )) W ×W
101
(5)
where (uc,vc) is the center position of block, (ui,vj) is the ith and jth positions of neighborhood blocks within W×W, and the angular bandwidth, φBW, can be estimated by the equation (6) as follows,
φ BW (uc , vc ) = sin −1 ( NC(uc , vc )) .
(6)
2.2 Enhancement Filtering in DCT Domain In DCT domain, filtering process is not simply as in DFT domain [2,3], which required only coefficient multiplication. The Gabor filter in [1] is modified in order to cooperate with DCT domain based on Cartesian-form representation. The enhancement filtering in DCT domain can be separated into two arithmetic manipulation; i.e. multiplication and convolution. 1) Filtering by Multiplication: The enhancement filter can be expressed in term of product of separable Gaussian functions, similar to the frequency-domain filtering technique in [2] as follows. F fd ( ρ , φ ) = F ( ρ , φ ) H f ( ρ ) H d (φ )
(7)
where F(ρ,φ) is DCT coefficients in polar-form representation, directly related to DCT coefficients, F(u,v), in rectangular-form representation. Ffd (ρ,φ) is DCT coefficients of the filtering output. The Hf(ρ) filter, which performs the ridge frequency filtering⋅in Gaussian shape, is given by H f (ρ ρ 0 ,σ ρ , Z ) =
⎛ (ρ − ρ 0 ) 2 1 exp⎜ − ⎜ Z 2σ ρ2 ⎝
⎞ , ρ = u2 + v2 ; ρ ≤ ρ ≤ ρ ⎟ 0 0 0 min 0 max ⎟ ⎠
(8)
where ρ0 and σρ are the center of the high-peak frequency group and the filtering bandwidth parameter, respectively. The ρmin and ρmax parameters are minimum and maximum cut-off frequency constraints, which suppress the effects of lower and higher frequencies such as ink, sweat gland holes, and scratches in the fingerprint. The Z is a filtering normalization factor, depending on filtering energy result. The Hd(φ) filter, which performs the ridge orientation filtering, is given by ⎧ ⎛ (φ − φ 0 ) 2 ⎪ exp⎜ − H d (φ φ 0 , σ φ , φ BW ) = ⎨ ⎜ 2σ φ2 ⎝ ⎪ 1 ⎩
⎞ ⎟ ⎟ ⎠
where φ − φ 0 ≥ φ BW
(9)
Otherwise
where the φ0 is the peak orientation for bandpass filter, σφ is the directional bandwidth parameter, and φBW, the angular bandwidth, is given by equation (6).
102
S. Jirachaweng and V. Areekul
2) Filtering by Convolution: Since the θ and π-θ ridge orientation coefficients are projected into the same DCT-domain region. Therefore, both directional coefficients still remain from the previous filtering. In order to truncate inappropriate directional coefficients, two diagonal Gabor filters are exploited by convolution operation. The finally enhanced DCT coefficients are given by FEnh (u , v ) = F fd (u , v ) ∗ H q (u , v )
(10)
where FEnh(u,v) is enhanced DCT coefficients in rectangular-form. Ffd(u,v) is the previous result of enhanced DCT coefficients in rectangular-form, by converted from Ffd (ρ,φ) in polar-form. The quadrant correction filter, Hq(u,v), is given by ⎧ ⎡ (u + v )π ⎤ ⎛ (u + v ) 2 ⎪cos ⎢ exp⎜ − ⎥ ⎜ 2 2σ q2 ⎦ ⎪⎪ ⎣ ⎝ H q (u, v ) = ⎨ ⎛ (u − v ) 2 ⎪ ⎡ (u − v)π ⎤ ⎜− exp ⎪cos ⎢ ⎥ ⎜ 2 2σ q2 ⎦ ⎪⎩ ⎣ ⎝
⎞ ⎟ where θ ≥ π / 2 ⎟ ⎠ ⎞ ⎟ Otherwise ⎟ ⎠
(11)
where σq is the quadratic parameter and cos(nπ/2) only has three values -1, 0 and -1. Indeed, this convolution operation requires low computation because most of bandpass filtered coefficients are truncated to zero from the previous operation. In case of highly curved ridges, the transformed coefficients are projected into widely curved subband of DCT domain as shown in Fig. 5.
R2
R1
θ1
θ2
θ1 Spatial Domain
θ2 DCT Domain
Fig. 5 Highly curved ridges in spatial and frequency (DCT) domain. Signal is localized in widely curved subband, which can be classified into the principal region (R1) and the reflection region (R2).
From Fig. 5, we approximate the orientation range from θ1 to θ2 by non-coherence factor from the equation (6). The curved subband can be classified into two regions; i.e. principal region (R1) and reflection region (R2). The principal region (R1) contains only one diagonal component (45o or 135o) as mentioned before. The 45o or 135o diagonal components are the phase pattern of the oriented ridges in the range of 0o to 90o or 90o to 180o, respectively. The reflection region (R2) composes of both of 45o and 135o diagonal components from the reflection property of DCT coefficients. Then the convolution is applied only in the principal region.
Fingerprint Enhancement Based on Discrete Cosine Transform
103
3 Experimental Evaluation The experimental results have been evaluated on public fingerprint database FVC2002 Db3a [6] (100 users, 8 images each) in term of enhancement quality, matching performance, and computational complexity. The fingerprint image is partitioned into blocks of 16×16 pixels, and a simple segmentation scheme using mean and variance is employed. Five fingerprint enhancement filtering types are evaluated as follows; Traditional Gabor filtering with non-quantized orientation (TG)[1], Separable Gabor filtering with non-quantized orientation (SG)[7], Separable Gabor filtering with 8-quantized orientation (SG8)[8], Short-Time Fourier Transform approach (STFT)[3], and proposed approach (DCT). In the spatial domain approaches, the discrete Gabor filters are the same 25×25 fixed-window size. Note that the separable Gabor filter [7,8] was implemented on the fly using a set of priori created and stored filters. Moreover, symmetric of 2-D Gabor filter [1] was also exploited in this process. These filtering schemes accelerated execution speed of the traditional Gabor enhancement process as fast as possible. For the STFT [3] and the DCT approaches in frequency domain, fingerprint image is also partitioned into 16×16 blocks but each block is transformed with 32×32 overlapped window to reduce blocking artifacts. Note that the probability estimation in [3] is not included. In order to compare the performance of various enhancement algorithms, three evaluation methodologies are used; i.e. the goodness index [1] of minutiae extraction, the matching performance, and the average execution time. First, the goodness index (GI) from [1] is employed to measure the extracted minutiae quantity from each fingerprint enhancement algorithm. In this case, we needed to manually mark minutiae of all fingerprints in FVC2002 Db3a. The goodness index is given by r
∑ q [M i
GI =
i
i =1
− Li − S i ]
,
r
∑
(12)
q i Ti
i =1
where r is the number of 16×16 windows in the input fingerprint image, qi represents the quality factor of ith window (good = 4, medium = 2, poor = 1) which estimated by partitioning and thresholding of the dryness factor (mean × variance of block) and the smudginess factor (mean / variance of block). Mi represents the number of minutiae pair, which match with human expert in a tolerance box in the ith window. Li and Si represent the number of lost and spurious minutiae in the ith window, respectively. Ti represents the number of minutiae extracted by experts. Second, enhancement results are tested with our minutiae matching verification algorithm based on Jiang’s concept of [9], and the equal error rate (EER) is reported. Finally, the average execution time of fingerprint enhancement process is measured for FVC2002 Db3a (image size 300×300 pixels) on Pentium M 1.5GHz with 376Mb RAM. Note that execution time includes filter parameter estimation (frequency and orientation), transform (if required), and filtering process. However, segmentation process is not included and we used the same segmentation process for all comparison schemes. The objective test results are summarized in Table 1. Contradict to our belief; overall execution time of DCT approach is faster than the separable Gabor
104
S. Jirachaweng and V. Areekul
Table 1. Summary of the performance comparison among various fingerprint enhancement algorithms over FVC 2002 Db3a Fingerprint Database, Pentium M 1.5GHz, 376Mb RAM
Fingerprint Enhancement Algorithm TG [1] SG [7] SG8 [8] STFT (modified from [3]) DCT (Proposed Approach)
Average Goodness Index (GI) [1] 0.160 0.167 0.181 0.250 0.336
(a1) #20_5
(b1) SG[7] (GI=0.59)
(a 2) #40_4
(b 2) SG[7] (GI=0.19)
(a 3) #107_7
(b 3) SG[7] (GI=0.18)
Our Matching (% EER) 9.716 9.326 12.196 7.713 6.846
(c1) STFT[3] (GI=0.63)
(c2) STFT[3] (GI=0.30)
(c3) STFT[3] (GI=0.47)
Execution Time (Second) 0.973 0.278 0.160 0.172 0.151
(d1) DCT (GI=0.70)
(d 2) DCT (GI=0.32)
(d 3) DCT (GI=0.68)
Fig. 6. (a) Original fingerprint #20_5, #40_4 and #107_7 from FVC2002 Db3a, (b) Enhanced results from SG[7], (c) Enhanced results from STFT modified from [3], (d) Enhanced results of our proposed DCT based method
filtering with 8-quantized orientation. We investigated in depth and we found that even though separable 2-D convolution alone is faster than both FFT and Fast DCT analysis and synthesis, the fingerprint intrinsic parameter estimation was slow this approach down since these parameters are evaluated in frequency domain. Fig. 6 shows enhancement results for subjective tests with GI values for objecttive tests. Note that the quality of enhanced fingerprints is improved based on
Fingerprint Enhancement Based on Discrete Cosine Transform
105
frequency-domain filtering, especially in highly curved ridges. Overall of FVC2002, DB3a database, both STFT and DCT based performed very well around highly curved area with slightly different results around singular point area.
4 Conclusion and Future Research In conclusion, this paper proposes a novel fingerprint enhancement approach based on discrete cosine transform (DCT). The enhancement takes advantage of filtering real DCT coefficients with high-energy compaction in frequency-domain. Hence filtering can be specially designed to cooperate highly curvature area resulting in less discontinuity and blocking artifacts comparing with spatial-domain filtering. For future research, we will conduct exhaustive experiments based on all FVC databases in order to prove the efficient of DCT-based fingerprint enhancement. To achieve this goal, all minutiae in all FVC databases need to be manually marked. We will also exploit orientation adaptive filter in DCT Domain in the near future. Acknowledgments. This work was partially supported by Department of Electrical Engineering, Kasetsart University, Thailand Research Fund (TRF) through the Royal Golden Jubilee Ph.D. Program (Grant No.PHD/0017/2549), and the Commission on Higher Education through the TRF Research Scholar (Grant No. RMU4980027).
References 1. Hong, L., Wang, Y., Jain, A.K.: Fingerprint Image Enhancement: Algorithm and Performance Evaluation. IEEE Trans. on Pattern Analysis and Machine Intelligence 20(8), 777–789 (1998) 2. Kamei, T., Mizoguchi, M.: Image Filter Design for Fingerprint Enhancement. In: Proc. ISCV’95, pp. 109–114 (1995) 3. Chikkerur, S., Cartwright, A.N., Govindaraju, V.: Fingerprint Enhancement Using STFT Analysis. Pattern Recognition 40, 198–211 (2007) 4. Rao, K.R., Yip, P.: Discrete Cosine Transform: Algorithms, Advantages, Applications. Academic Press, Boston, MA (1990) 5. Lee, M., Nepal, S., Srinivasan, U.: Role of edge detection in video semantics. In: Proc. PanSydney Workshop on Visual Information Processing (VIP2002). Conferences in Research and Practice in Information Technology, Australia (2003) 6. Maltoni, D., Maio, D., Jain, A.K., Prabhakar, S.: Fingerprint Verification Competition 2002. Database Available: Handbook of Fingerprint Recognition. Springer, Heidelberg (2003) 7. Areekul, V., Watchareeruetai, U., Suppasriwasuseth, K., Tantaratana, S.: Separable Gabor filter realization for fast fingerprint enhancement. In: Proc. Int. Conf. on Image Processing (ICIP 2005), Genova, Italy, pp. III-253–III-256 (2005) 8. Areekul, V., Watchareeruetai, U., Tantaratana, S.: Fast Separable Gabor Filter for Fingerprint Enhancement. In: Zhang, D., Jain, A.K. (eds.) ICBA 2004. LNCS, vol. 3072, pp. 403–409. Springer, Heidelberg (2004) 9. Jiang, X., Yau, W.Y.: Fingerprint Minutiae Matching Based on the Local and Global Structures. In: Proc. Int. Conf. on Pattern Recognition (15th), vol. 2, pp. 1042–1045 (2000)
Biometric Template Classification: A Case Study in Iris Textures Edara Srinivasa Reddy1, Chinnam SubbaRao2, and Inampudi Ramesh Babu3 1,2
Research Scholar 3 Professor Department of Computer Science, Acharya Nagarjuna Univerity, Guntur, A.P, India {edara_67,rinampudi}@yahoo.com Abstract. Most of the biometric authentication systems store multiple templates per user to account for variations in biometric data. Therefore, these systems suffer from storage space and computation overheads. To overcome this problem the paper proposes techniques to automatically select prototype templates from iris textures. The paper has two phases: one is to find the feature vectors from iris textures that have less correlation and the second to calculate DU measure. Du measure is an effective measure of the similarity between two iris textures, because it takes into consideration three important perspectives: a) information, b) angle and e) energy. Also, gray level co occurrence matrix is used to find the homogeneity and correlation between the textures. Keywords: Shaker iris, Jewel iris, Flower iris, Stream iris, gray level cooccurrence matrix, Spectral information divergence, Spectral angle mapper, DU measure.
1 Introduction A typical iris biometric system operates in two distinct stages: the enrollment stage and the authentication stage [2]. During enrollment stage iris textures are acquired and processed to extract a feature set. The complex iris textures carry very distinctive information. The feature set includes some distinguished features like nodes and end point in the iris texture. The stored feature set, labeled with the user’s identity, is referred as a template. In order to account for variations in the biometric based authentication system relies on the stability from an individual is susceptible to changes due to distance from the sensor, poor resolution of the sensor and alterations in iris texture due to cataract operations and others. Multiple iris features pertaining to different portions of iris must be stored in data base. There is a trade off between the number of templates, and the storage and computational overheads introduced by multiple templates. An efficient system must in fact select the templates automatically.
2 Basic Types of Iris Textures Depending on the texture of different human iris, we can group the irises into four basic groups. They are a) Stream iris, b) Jewel iris, c) Shaker iris and d) Flower iris. S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 106–113, 2007. © Springer-Verlag Berlin Heidelberg 2007
Biometric Template Classification: A Case Study in Iris Textures
107
Also, there are different combinations of there groups. As a preliminary study we have concentrated to group the given data base into the above four classes. a) Stream Iris: It contains a uniform fiber structure with subtle variations or streaks of color as shown in fig. 1.c. The structure of the iris is determined by the arrangement of the white fibers radiating from the center of the iris (or pupil). In this image one can notice that they are uniform and reasonably direct or parallel.
Fig. 1. a & b. Stream Iris
Fig. 1. c. Stream iris Texture
b) Jewel Iris: It contains dot-like pigments in the iris. The jewel iris can be recognized by the presence of pigmentation or colored dots on top of the fibers as shown in fig. 2.c. The dots (or jewels) can vary in color from light orange through black. They can also vary in size from tiny (invisible to the naked eye) to quite large.
Fig. 2.a & b. Jewel iris
Fig. 2. c. Jewel iris Textures
c) Shaker iris: It contains dot-like pigments and rounded openings. The shaker iris is identified by the presence of both flower like petals in the fiber arrangement and pigment dots or jewels is shown in fig. 3.c. The presence of even one jewel in an otherwise Flower iris is sufficient to cause the occupant to exhibit shaker characteristics.
Fig. 3.a & b. Shaker iris
Fig. 3. c. Shaker iris textures
d) Flower iris: It contains distinctly curved or rounded openings in the iris. In a flower iris the fibers radiating from the center are distorted (in one or more places) to
108
S.R. Edara, S. Chinnam, and R.B. Inampudi
produce the effect of petals (hence the name flower) shown in fig. 4. In this image one can notice that they are neither regular nor uniform. A flower iris may have only one significant petal with the remainder of the iris looking like a stream
Fig. 4.a & b. Flower iris
Fig. 4. c. Flower iris textures
3 Template Selections and Updating The size of the same iris taken at different times may be different in the image as a result of changes in the camera to face distance. Due to simulation by light or for other reasons, such as hippos, natural continuous movement of the pupil, the pupil may be constricted or dilated [3]. The problem of template selection means, given a set of N iris images corresponding to a single human iris, selected k templates that best represent the similarity observed in N images. The selected templates must be updated time to time, since most of the users may under go cataract operations due to aging. To account for such changes, old templates must be replaced with newer ones. Some of the minutiae points like nodes and end points of iris textures may be added or deleted from the iris template. Thus template selection refers to the process by which iris textures are chosen from a given set of samples, where as template update refers to the process by which existing templates are replaced.
4 Selection of Feature Vectors GLCM(gray-level co-occurrence matrix)[4], also known as the gray-level spatial dependence matrix, is a statistical measure used to characterize the texture of an image by calculating how often pairs of pixel, with specific values and in a specified spatial relationship occur in an image and then by extracting statistical measures from this matrix. By default, the spatial relationship is defined as the pixel of interest and the pixel to its immediate right (horizontally adjacent), but one can specify other spatial relationships between the two pixels. Each element (i,j) in the resultant GLCM is simply the sum of the number of times that the pixel with value I occurred in the specified spatial relationship to a pixel with value j in the input image. Initially, the dynamic range of the given image is scaled to reduce the number of intensity values in intensity image from 256 to 8. The number of gray-levels, determine the size of the GLCM. The gray-level co-occurrence matrix can reveal certain properties about the spatial distribution of the gray levels in the texture image. For example, if most of the entries
Biometric Template Classification: A Case Study in Iris Textures
109
in the GLCM are concentrated along the diagonal, the texture is coarse with respect to the specified offset. However, a single GLCM might not be enough to describe the textual features of the input image. For example, a single horizontal offset might not be sensitive to texture with a vertical orientation. For this reason, gray co-matrix [5] can create multiple GLCMs for a single input image. TO create multiple GLCMs, specify an array of offsets to the gray co-matrix function. These offsets define pixel relationships of varying direction and distance. For, example, one can define an array of offsets that specify four directions (horizontal, vertical and two diagonals) and four distances. In this case, the input image is represented by 16 GLCMs. The statistics are calculated from these GLCMs and then the average is taken. Statistical description of GLCM: The statistical parameters that can be derived for GLCM are contrast, correlation, energy and homogeneity. a)
Contrast measures the local variations in the gray-level co-occurrence matrix. b) Correlation measures the joint probability occurrence of the specified pixel pairs. c) Energy provides the sum of squared elements in the GLCM. Homogeneity measures the closeness of the distribution of elements in the GLCM to the GLCM diagonal. We have derived only correlation to identify the length of the feature vector which has unique information from iris texture. A typical iris exhibits rich texture information in the immediate vicinity of the pupil which tapers away in intensity as one move away from the pupil. Thus, the iris templates are taken by segmenting a
Fig:5a
Fig:5b
Fig:5c
Fig:5d
Fig. 5. Iris textures after canny edge detection a)Stream iris b)Jewel iris c)Shaker iris d)Flower iris
110
S.R. Edara, S. Chinnam, and R.B. Inampudi
portion of iris texture near the pupil and from them the feature vectors are derived, by using correlation function. This is implemented by performing canny edge detection technique to produce binary image or respective iris texture and also by calculating the probability of occurrence of pixels. The correlation Vs offset plot for different iris textures are given below.
Fig. 6a. Correlation of Shaker iris texture
Fig. 6b. Correlation of Jewel iris texture
Fig. 6c. Correlation of Stream iris texture
Biometric Template Classification: A Case Study in Iris Textures
111
Fig. 6d. Correlation of Flower iris texture
From the above correlation plots, we can see the peak offsets which represent the periodic patterns that repeat for every set of pixels. Correlation function can be used to evaluate the coarseness. Larger texture primitives give rise to coarse texture and small primitives give fine texture. However if the primitives are periodic, then the correlation increases and decreases periodically with the distance [12]. For example, in the case of flower iris, the peak offsets are at 9 and 13 pixels. But, a common offset is observed at 9 for all textures. So, we can take a 9 pixel nearest neighbor as a feature vector to estimate DU measure in the case of iris textures.
5 Du Measure Du measure[1] is an effective measure of the similarity between two iris textures, because it takes into consideration three important perspectives: a) information b) angle and c) energy. a)Information: It is measured in terms of SID (spectral information divergence) between probability mass functions of the two textures f, g. SID(f,g ) = D(d||g) + D(g||f); Here D(f||g) is the relative entropy of g with respect to f. Where D(f||g) = Σ (f* log(f/g)) and D(g||f)= Σ (g * log(g/f)) b) Angle: It is measured in terms of spectral angle mapper. It measures the angle between two vectors f and g given by, SAM (f,g) = acos(/(||f||*||g||) Where = Σ fi*gi, and ||f||2 is the 2-norm of vector f, ||f||2=1/2 Mixed Measure: Du et al[26] developed SID-SAM mixed measure. It is defined by a mixed measure, M = SID(f,g) * tan(SAM(f,g)). c) Energy: It is measured in terms of average power difference. Instead of using 2norm, we can use 1-norm. APD (average power difference) is given by, APD(f,g) = ||f-g|| *1/N, where ||x||1=Σ|x|. Finally DU measure is given by, Du(f,g)=APD * Mixed measure.
112
S.R. Edara, S. Chinnam, and R.B. Inampudi
6 Iris Classification When a new iris texture is to be grouped into a predefined class, the Du measure is calculated with any one of the typical templates in that class. If the Du score is very high, we can say that it does not belong to that class. The new template belongs to a class, with which its Du measure is very low in comparison to others[1].
7 Experimental Results We have taken for our study the iris image data base from CASIA iris image Data base[CAS03a] and MMU iris data base[MMU04a]. CASIA iris image data base contributes a total number of 756 iris images which were taken in two different time frames. Each of the iris images is 8-bit gray scale with resolution 320 x 280. MMU data base contributes a total number of 450 images which were captured by LG Iris Access ® 2200. Our studies conclude that all the iris images in the database have four different types of textures: Flower, Jewel, Stream and shaker. Thus there occur 6 different combinations for matching. Hence SID, SAM, APD values, for these 6 combinations, are given in the table. Table 1. Du measures for different combinations of iris textures
Combination Stream, Jewel Stream, Flower Stream, Shaker Flower, Jewel Flower, Shaker Jewel, Shaker
SID 1.51657 1.60659 1.8626 1.41048 2.08906 2.14638
SAM 0.34578 0.26289 0.29232 0.40956 0.27165 0.38986
MM 0.54618 0.43236 0.56053 0.6123 0.58188 0.88193
APD 8.3483 14.859 4.4424 6.491 10.397 3.9059
DU 4.55997 6.4244 2.4901 3.9744 6.0498 3.447
When a new template belongs to the same class, then the SID (relative entropy) and APD (average power difference) become zero; thus Du measure automatically becomes zero or very less.
References [1] NIST: Advanced Encryption Standard, AES (2001), http:// csrc.nist.gov/ publications/ fips/fips-197.pdf [2] Heijmans, H.: Morphological Image Operators. AcademiPress, San Diego (1994) [3] Juels, A., Sudan, M.: A Fuzzy Vault Scheme. In: Lapidothand, A., Teletar, E. (eds.) proc. IEEEInt’l. Symp. Inf. Theory, p. 408 (2002) [4] Gonzalez, R.C., Woods, R.E.: Digital Image Processing, 3rd edn. Addison-Wesley, Reading (1992)
Biometric Template Classification: A Case Study in Iris Textures
113
[5] De Mira Jr, J., Mayer, J.: Image Feature Extraction for application of Biometric Identification of Iris – A Morphological Approach. In: proc. IEEE. Int’l. Symp. on Computer Graphics and Image processing, SIBGRAPI’03. [6] Uludag, U., Jain, A.K.: Fuzzy Finger Print Vault. In: Proc. Workshop: Biometrics: Challenges Arising from Theory to practice, pp. 13–16. W.H. Press, New York (2004) [7] Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes in C, 2nd edn. Cambridge University press, Cambridge (1992) [8] Teo, C.C., Ewe, H.T.: An efficient One Dimensional Fractal Analysis for Iris Recognition, WSCG’2005, January 31 Feb4, 2005, Plzen, Czech Republic (2005) [9] LiMA,Tieniu: Efficient Iris Recognition by Characterizing Key Local Variations, IEEE trans., Image Processing (2004) [10] Ives, R., Etter, D., Du, Y.: Iris Pattern Extraction using Bit Planes and Standard Deviations. In: IEEE conference on Signals, systems and computers (2004) [11] Tian, Q.-C., Pan, Q., Cheng, Y.-M.: Fast algorithm and application of Hough Transform in iris segmentation. In: proceedings of third IEEE conference on machine learning and Cybernetics, Shangai, pp. 26–29 (2004) [12] Sharma, M., Markou, M., Singh, S.: Evaluaion of Texture Methods For Image Analaysis, pattern recognition letters.
Protecting Biometric Templates with Image Watermarking Techniques Nikos Komninos and Tassos Dimitriou Athens Information Technology, GR-19002 Peania Attiki, Greece {nkom,tdim}@ait.edu.gr
Abstract. Biometric templates are subject to modifications for identity fraud especially when they are stored in databases. In this paper, a new approach to protecting biometric templates with image watermarking techniques is proposed. The novelty of this approach is that we have combined lattice and block-wise image watermarking techniques to maintain image quality along with cryptographic techniques to embed fingerprint templates into facial images and vice-versa. Thus, protecting them from being modified. Keywords: Biometric templates, fingerprints, face pictures, authentication, watermarking.
1 Introduction Visual based biometric systems use feature extraction algorithms to extract the discriminant information that is invariant to as many variations embedded in the raw data (e.g. scaling, translation, rotation) as possible. Template-based methods crop a particular subimage (the template) from the original sensory image, and extract features from the template by applying global-level processing, without a priori knowledge of the object’s structural properties. Compared to geometric feature extraction algorithms, image template approaches need to locate far fewer points to obtain a correct template. For example, in the probabilistic decision-based neural network (PDBNN) face recognition system, only two points (left and right eyes) need to be located to extract a facial recognition template. An early comparison between the two types of feature extraction methods was made by Brunelli and Poggio [1]. They found the template approach to be faster and able to generate more accurate recognition results than the geometric approach. In practice, there is not always a perfect match between templates and individuals. One speaks of a false positive if the biometric recognition system says 'yes', but the answer should be 'no'. A false negative works the other way round: the system says 'no', where it should be a 'yes'. One of the main challenges with biometric systems is to minimise the rates of both false positives and of false negatives. In theory one is inclined to keep the false positives low, but in practical situations it often works the other way round: people that operate these systems dislike false negatives, because they slow down the process and result in extra work and people complaining. S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 114–123, 2007. © Springer-Verlag Berlin Heidelberg 2007
Protecting Biometric Templates with Image Watermarking Techniques
115
A potential protection approach of biometric templates is feasible with image watermarking techniques in visual-based biometric systems. Watermarking techniques attempt to protect the copyrights of any digital medium by embedding a unique pattern or message within the original information. The embedding method involves the use of a number of different authentication, encryption and hash algorithms and protocols to achieve the validity and copy protection of the particular message. One of the most important requirements of watermarking is the perceptual transparency between the original work and the watermarked. Especially for images that objective metrics are widely used [6]. The watermark message may have a higher or lower level of perceptibility, meaning that there is a greater or lesser likelihood that a given observer will perceive the difference. In this paper, we apply watermarking techniques to biometric templates to overcome serious cases of identity theft. In particular, we embed a person’s fingerprint template into his facial image with the form of a cryptographic encoder that utilizes encryption algorithms, hash functions and digital signatures. Once the facial image has been watermarked, it can be stored in public databases without risking an identity modification or fabrication. Following this introduction, the paper is organized as follows. Section 2 presents current work that combine watermarking and biometrics techniques. Section 3 discusses the requirements for efficient watermarking and briefly describes lattice and block-wise embedding methods and how these can be used along with cryptographic techniques to protect biometric templates. Section 4 evaluates the performance and the efficiency of the two embedding methods, through simulation tests. Section 5 concludes with remarks and comments on the open issues in watermarking and biometric templates.
2 Related Work Current research efforts in combining watermarking techniques and visual-based biometric systems follow a hierarchical approach, with the most explored area being that of biometrics. Watermarking techniques on the other hand have been explored less in conjunction with biometrics templates. Despite the fact that several attempts of combining watermarking techniques and biometric systems have already been proposed. In Lucilla et al. [5], a technique for the authentication of ID cardholders is presented, which combines dynamic signature verification with hologram watermarks. Two biometric features are already integral parts of ID cards for manual verification: the user’s face and signature. This technique embeds dynamic features of the cardholder’s signature into the personal data printed on the ID card, thereby facilitating automated user authentication based on the embedded information. Any modification of the image can also be detected and will further disallow the biometric verification of the forger. Jain and Uludag [2] worked with hiding fingerprint minutiae in images. For this purpose, they considered two application scenarios: A set of fingerprint minutiae is transferred as the watermark of an arbitrary image and a face image is watermarked with fingerprint minutiae. In the first scenario, the fingerprint minutiae are transferred via a non-secure channel hidden in an arbitrary image. Before being embedded into
116
N. Komninos and T. Dimitriou
the host image, the fingerprint minutiae are encrypted, which further increases the security of the data. The produced image is sent through the insecure communication channel. In the end, the image is received and the fingerprint minutiae are extracted, decrypted and ready for any further processing. In the second scenario, a face scan is watermarked with fingerprint minutiae data and the result is encoded in a smart card. For the authentication of a user, the image is retrieved from the smart card, the fingerprint minutiae are extracted from it and they are compared to the minutiae obtained from the user online. The user is authenticated based on the two fingerprint minutiae data sets and the face image. Jain et al. [3] have presented a fingerprint image watermarking method that can embed facial information into host fingerprint images. The considered application scenario in this case is as follows: The fingerprint image of a person is watermarked with face information of the same person and stored on a smart card. At an access control site, the fingerprint of the user is sensed and compared with the one stored on the smart card. After the fingerprint matching has successfully been completed, the facial information can be extracted from the fingerprint image on the smart card and can be used for further authentication purposes.
3 Combining Image Watermarking Techniques with Visual Based Biometric Systems Visual-based biometric systems use feature extraction techniques to collect unique facial patterns and create biometric templates. However, biometric templates are subject to fraud especially in passport cloning and illegal immigration. Image watermarking techniques along with cryptographic primitives can be used to verify the authenticity of a person and also detect any modification to biometric templates when these are securely stored. Biometric templates of a fingerprint and a face scan can be hashed and encrypted with cryptographic algorithms and then embedded into an image. For example, with the use of hash functions and encryption methods, the owner of a facial image can embed his/her template. The recipient can extract it by decrypting it and therefore can verify that the received image was the one intended by the sender. Encrypting and hashing watermarked information can guarantee the authentication of the owner and the image itself since the purpose of watermarks is two-fold: (i) they can be used to determine ownership, and (ii) they can be used to detect tampering. There are two necessary features that all watermarks must possess [7]. First, all watermarks should be detectable. In order to determine ownership, it is imperative that one be able to recover the watermark. Second, watermarks must be robust to various types of processing of the signal (i.e. cropping, filtering, translation, compression, etc.). If the watermark is not robust, it serves little purpose, as ownership will be lost upon processing. Another important requirement for watermarks is the perceptual transparency between the original work and the watermarked; and for images objective metrics are widely used. The watermarked message may have a higher or lower level of perceptibility, meaning that there is a greater or lesser likelihood that a given observer will perceive the difference. The ideal is to be as imperceptible as possible and it is required to develop models that are
Protecting Biometric Templates with Image Watermarking Techniques
117
used to compare two different versions of the works and evaluate any alterations. Evaluating the perceptibility of the watermarks can be done with distortion metrics. These distortion metrics do not exploit the properties of the human visual system but they provide reliable results. Also, there is an objective criterion that relies on the sensitivity of the eye and is called Watson perceptual distance. It is also known as just noticeable differences and consists of a sensitivity function, two masking components based on luminance, contrast masking, and a pooling component. Table 1 gives the metrics that are used more often. Table 1. Quality Measurements
Signal to Noise Ratio (SNR) Peak Signal to Noise Ratio (PSNR) Image Fidelity (IF)
2
SNR
PSNR
IF
~
¦ I m, n / ¦ I m, n I m, n
m, n
m, n
2
~ MN max I m2 , n / ¦ I m, n I m, n m, n
~ 1 ¦ I m, n I m, n m, n
2
2
/ ¦ I m2 , n m, n
Mean Square Error (MSE)
MSE
1 ~ ¦ I m, n I m, n MN m, n
Normalized Cross Correlation (NC)
NC
¦
m, n
2
~ I m, n I m, n / ¦ I m2 , n m, n
~ ¦ I m, n I m , n / ¦ I m , n
Correlation Quality (CQ)
CQ
Watson Distance (WD)
§ 4· Dwat c0 , cw ¨¨ ¦ d >i, j , k @ ¸¸ © i, j, k ¹
m, n
m, n
1/ 4
There are plenty of image watermarking techniques available in the literature but we have combined lattice and semi-fragile, or block-wise, embedding methods to take advantage of their unique features. Briefly, the lattice watermarking system embeds only one bit per 256 pixels in an image. Each bit is encoded using the trellis code and produces a sequence of four bits. The trellis coding is a convolution code and the number of states is 23=8 with possible outputs 24=16. After the encoding procedure, the bits need to be embedded in 256 pixels which means that each of the four bits is embedded in 256/4=64 pixels [6]. The block-wise method involves the basic properties of the JPEG compression where DCT domain takes place. Four bits are embedded in the high-frequency DCT of each 8x8 (64 pixels) block in the image and not in the low-frequency in order to avoid any visual differences that would lead to unacceptably poor fidelity. By using the block-wise method, the image can host 16 times more information than lattice. Specifically 28 coefficients are used which means that each bit is embedded in seven coefficients. The seven coefficients that host one bit are chosen randomly according to a seed number and thus, each coefficient is involved in only one bit [6]. By combing the two methods, we can exploit their advantages particularly, in circumstances where both the quality and the ability to notice the corrupted blocks is essential. In an image, the part that is likely to be illegally altered is watermarked with the block-wise method while the rest of the image is watermarked with the lattice method. In a facial image, for example, the areas of the eyes, mouth and jaw can be used to embed the fingerprint template. The round area of the face can be used to embed additional information, such as the name and/or address of the person shown in the photo with the lattice method. If an adversary changes for example the color of
118
N. Komninos and T. Dimitriou
the eyes or some characteristics of the face (e.g. adding a moustache) the combined algorithm is able to determinate the modified pixels. This is achieved by comparing the extracted message with the original. The combination of the two embedding methods is implemented in a cryptographic encoder-decoder. The authority that wishes to protect a face or fingerprint photo, extracts the biometric template(s) of that person and along with a short description and a unique feature of the image are inserted in a hash function and the result is encrypted with a 1024-bit secret key. The signature, together with the short and the extracted unique description, is embedded with the lattice method while the biometric template is embedded with the block-wise algorithm. As a unique description we have used the sum of the pixel values of the four blocks in the corners. The Secure Hash Algorithm (SHA) and the Rivest, Shamir, Aldeman (RSA) [8] have been used to hash and sign the fingerprint template, short and unique descriptions. The design of the encoder is illustrated in Fig. 1a. Biometric Template
Short Description
Original Image Watermarked Image
Biometric Template Short Description Extracted Description
Extract Unique Description
SHA Hash Function 1024-bit Secret Key
Extract Unique Descriptio
Detect
Signature Biometric Template Short Description Unique Description
RSA Encryption
Signature Message Short Description Unique Description
1024-bit Public Key
RSA Decryption
Close ?
SHA Hash Function
Embed =?
Both Yes ?
Valid \ Invalid
Watermarked Image
(a)
(b)
Fig. 1. Cryptographic Encoder / Decoder
From the watermarked version of the image, at the decoder’s side, the signature, the short and unique descriptions are extracted with the lattice method while the biometric template is extracted with the block-wise algorithm. Then, the unique description is compared with the extracted one. Thus, the first step is to verify whether the unique descriptions match. In the case of the watermark being copied and embedded in another image, the extracted description will not be the same. This is due to the fact that the pixel values of the image have been slightly changed to host the watermark, the extracted description cannot be exactly the same, but can be very close. Therefore, some upper and lower boundaries have been determined for this step of verification.
Protecting Biometric Templates with Image Watermarking Techniques
119
The next step is to decrypt the signature using the 1024-bit RSA public key and retrieve the hash value. The biometric template, short and unique descriptions that have been extracted, is again the input to the hash function. The obtained hash value is then compared with the one decrypted from the signature. The third step of the decoder is to verify whether the decrypted hash value matches exactly with the one calculated by the decoder. If both hash values and unique descriptions are valid, the authentication process is successful. The whole design of the decoder is presented in Fig. 1b.
4 Experimental Results In order to evaluate the performance and the efficiency of the embedding methods, excessive tests have taken place. A number of cases have been considered each with a different variable parameter. A grayscale bitmap image with 300x300 (Fig. 2a) resolution has been used for the experiments. The difference between the original and the watermarked image was evaluated by the ideal values of the original image that are presented in Table 2. Table 2. Ideal Values of the test Image Quality Measurements MSE
Ideal Values 0
Quality Measurements NC
Ideal Values 1
SNR (dB)
97
CQ
129.923
PSNR (dB)
104
Watson Distance
0
IF
100
Cases have also been considered once again each with a different variable parameter, independent of lattice and block-wise methods in order to maintain image quality. When combined image watermarking is performed, a small part of the image is watermarked with the block-wise method and the rest is watermarked with the lattice method. We assume that the most vulnerable to illegal modifications, is the small part. The biometric template is embedded in the small part while the short and extracted descriptions are embedded in the large part of the image. The experimental tests of the quality measurements were performed using: only lattice; only block-wise; combined block-wise and lattice. It was found that the lattice method achieves better results than the block-wise and expected that the produced result values of the combined case would be in between the values of those produced by the two methods. Particularly, in the case of the lattice algorithm the maximum number of the embedded bit can be 351 (one bit per 256 pixels). The formulas that are used to evaluate the differences between two images are presented in Table 1. The tests were executed using a range of values for the parameter in order to conclude what the best values are. The parameters are the embedding strength (β) and the lattice spacing (α). The range of the α value was from 0.35 to 5.33 and the range of β from 0.7 to 1.1. The incensement steps for a was 0.02 and for β 0.1.
120
N. Komninos and T. Dimitriou
The measurement values for the lattice method are very close to the ideal ones. More specifically, the direction towards zero is achieved using low values of α in the case of MSE. If at the same time the value of β that is used is low, the MSE is decreased even further. In the case of SNR and PSNR, the result values are higher when the parameters α and β are low. The image fidelity (IF), which is defined as a percentage of how identical the images are, the value of 100% is considered to be the optimum and as can be noticed from Table 3, the results are very close to this. Utilizing the NC and the CQ quality measurements, it is observed that their measurements are closer to the ideal ones (Table 2), as the values of α and β are decreased. The above observations are also justified from the Watson measurement which is based on luminance, contrast, and pooling masking. Table 3. Results From Lattice, Block-Wise and Combined Embedding Methods alpha(a)=0.93, beta(β)=1.0, alpha(a)=0.1 MSE SNR PSNR IF NC CQ WatsonDistance
Lattice alpha(a), beta(β) 0.353 47.13 53.27 99.969 0.99999 129.925
BlockWise alpha(a) 1.428 39.33 45.29 99.983 0.99992 129.916
Combined alpha(a),beta(β) alpha(a) 0.389 45.14 52.52 99.966 0.99994 129.922
31.436
58.262
32.453
(a) alpha(a)=1.53, beta(β)=0.8, alpha(a) =0.2 MSE SNR PSNR IF NC CQ Watson-Distance
Lattice alpha(a), beta(β)
Block-Wise alpha(a)
Combined alpha(a),beta(β), alpha(a)
0.545 43.18 50.77 99.9952 0.99998 129.921 48.901
4.951 33.62 41.18 99.9563 0.99985 129.903 157.506
0.72 41.97 49.55 99.9936 0.99998 129.921 49.136
(b)
Therefore, it could be suggested that the optimum parameter values are those that give the best results. They could even be the zero values. But at the decoder’s side not all the bits are extracted correctly. Specifically when using low values of α and β, the decoder is not able to get the correct embedded bits. In conclusion it can be said that a trade-off between the quality results and the decoder’s result is necessary in order to determine the optimum values. From the tests we concluded that suggested values could be α ≈ 1.53 and β = 0.8 (Table 3b). Similarly, in the case of the block-wise method, the tests were executed for the same image in order to be comparable with those of the lattice method. One major difference is the number of bits that are embedded. Since the method embeds four bits in every 64 pixels and the image has 90000 pixels in total, the number of bits can be hosted in 5625. The size of the information that can be watermarked is significantly
Protecting Biometric Templates with Image Watermarking Techniques
121
higher and in fact is 16 times greater than the size in the lattice method. Therefore, before even executing the test, it is expected that the results will not be as good. The information in the block-wise method is much more, which means that the alterations in the image will produce worse values in the quality measurements. The observation of the results proves what is being stated in the beginning. The values of the quality measurements are not as good in comparison with those of the lattice method since the measurement of the MSE is higher than the ideal value, which is zero. The values of the SNR and PSNR, which are widely used, show that as the value of the parameter alpha (a) is increased, the result becomes worse. In the case of the IF, NC, and CQ, the measurements seem to be distant from the ideal values as alpha (a) takes higher values. The same conclusion can be phrased for the perceptual distance given by the Watson model, where the results are worse as the value of alpha (a) is increased. It seems that as the value of alpha is increased, the watermarked image has poorer fidelity. So the optimum value of the parameter should perhaps possibly be a small one e.g. 0.01. However, it seems that values below 0.05 do not allow the decoder to get the right message. The chosen value of alpha depends on how sensitive the user wants the method to be in order to locate the corrupted bits and mark the corresponding blocks. Higher values increase the sensitivity but at the same time the quality of the image is reduced. So it is again necessary to make a trade-off between the results and the sensitivity. A possible suggested value could be a ≈ 0.2 (Table 3b). Indeed the results were not as good as those of the lattice method but they were better than those of the block-wise method. In Table 3 some result values of the combination are given in order to compare them with those of the two methods when they are applied individually. Table 3 justifies that the combination produces quality measurements between the two methods. Table 4 presents the maximum number of bits that can be hosted in the image using the two embedding methods and a combination of them. Table 4. Maximum Number of Embedded Bits Lattice Max Embedded Bits
351
BlockWise 5625
Combined 2752
The last test was to verify that in case somebody modifies the block-wise part of the image, which is the biometric template, the decoder realizes the modification, informs the authority that the authentication application failed and outputs a file with the modified blocks marked. The part that is likely to be illegally altered is the eyes or jaw and the biometric template(s) of the facial and/or fingerprint images which are embedded with the block-wise method (Fig. 2b). In the watermarked version the distance between the eyes was changed and this image was inserted in the decoder in order to verify its authenticity. The authentication process failed and a marked image was produced (Fig. 2c). By observing the last image it is clear that the decoder has successfully located the modified blocks.
122
N. Komninos and T. Dimitriou
(a)
(b)
(c)
Fig. 2. Original Image (a), Watermarked Image (b), Marked Image (c)
Throughout the paper we have considered the case where a fingerprint and/or facial template is embedded into a facial image. That does not mean that a facial image cannot be embedded with our method into a fingerprint image, which is illustrated in Fig. 3. Similar to Fig. 2, Fig. 3a is the original image, Fig. 3b is the watermarked and Fig. 3c is the marked image generated by our testbed.
(a)
(b)
(c)
Fig. 3. Original Image (a), Watermarked Image (b), Marked Image (c)
The potential danger with sensitive databases containing biometric identifiers, such as facial, fingerprint images and templates, is that they are likely to be attacked by hackers or criminals. Watermarking the information in these databases can allow the integrity of the contents to be verified. Another danger is that this critical data can be attacked while it is being transmitted. For example, a third party could intercept this data and maliciously alter the data before re-transmitting it to its final destination. The transmission problem is even more critical in cellular and wireless channels. The channels themselves are quite noisy and can degrade the signal quality. Additionally, data transmitted through wireless channels is far from secure as they are omni-directional, and as such can be eavesdropped with relative ease. The growth of the wireless market and e-commerce applications for PDAs requires a robust cryptographic method for data security. There are compact solid state sensors already available in the market, which can capture fingerprints or faces for the purpose of identity verification. These devices can also be easily attached to PDAs and other hand-held cellular devices for use in identification and verification. Considering all the noise and distortion in cellular channels, our combined watermarking technique along with the cryptographic encoder/decoder will mainly work in smudging, compression and filtering. Our cryptographic encoder/decoder will only fail when noise and distortion is detected in the sensitive areas of the images that have been embedded with the block-wise algorithm. If our watermarked image is transferred in a noisy channel, then we need to reduce the amount of information inserted with the block-wise method to have a high rate of success.
Protecting Biometric Templates with Image Watermarking Techniques
123
5 Conclusion Watermarking biometric data is of growing importance as more robust methods of verification and authentication are being used. Biometrics provides the necessary unique characteristics but their validity must be ensured. This can be guaranteed to an extent by watermarks. Unfortunately, they cannot provide a foolproof solution especially when the transmission of data is involved. A receiver can not always determine whether or not he has received the correct data without the sender giving access to critical information (i.e., the watermark). In this paper we have presented a cryptographic encoder/decoder that digitally signs biometric templates, which are embedded with combined lattice and block-wise image watermarking techniques into an image. Combining image watermarking techniques with cryptographic primitives enables us to protect biometric templates that have been generated by a visual-based biometric system without any distortion of the image. Since biometric templates are essential tools for authenticating people, it is necessary to protect them for possible alterations and fabrications in conjunction with their biometric image(s) when these are stored in private/public databases. Image watermarking techniques in conjunction with cryptographic primitives provide a powerful tool to authenticate an image, its biometric template and any additional information that is considered important according to a particular application. In the passport-based scenario, for example, the photograph and the private information (i.e. name/address) of an individual can be protected with the proposed approach. Our results showed that we can combine watermarking techniques to securely embed private information in a biometric image without fading it out.
References 1. Brunelli, R., Poggio, T.: Face recognition: features versus templates. IEEE Trans. On Pattern Analysis and Machine Intelligence 15, 1042–1052 (1993) 2. Jain, A.K., Uludag, U.: Hiding Fingerprint Minutiae in Images. In: Proc. Automatic Identification Advanced Technologies (AutoID), New York, pp. 97–102 (March 14-15, 2002) 3. Jain, A.K., Uludag, U., Hsu, R.-L.: Hiding a face in a fingerprint image. In: Proc. International Conference on Pattern Recognition (ICPR), Canada, (August 11-15, 2002) 4. Kung, S.Y., Mak, M.W., Lin, S.H.: Biometric Authentication: A Machine Learning Approach. Prentice Hall Information and System Sciences Series (2005) 5. Lucilla, C.F., Astrid, M., Markus, F., Claus, V., Ralf, S., Edward, D.J.: Biometric Authentication for ID cards with hologram watermarks. In: Proc. Security and Watermarking of Multimedia Contents SPIE’02, vol. 4675, pp. 629–640 (2002) 6. Peticolas, F., Anderson, R., Kuhn, M.: Information hiding – a survey. IEEE Proceedings 87(7), 1062–1078 (1999) 7. Wong, P.H.W., Au, O.C., Yueng, Y.M.: Novel blind watermarking technique for images. IEEE Trans. On Circuits and Systems for Video Technology 13(8), 813–830 (2003) 8. Hao, F., Anderson, R., Daugman, J.: Combining Crypto with Biometrics Effectively. IEEE Transaction on Computers 55(9), 1081–1088 (2006)
Factorial Hidden Markov Models for Gait Recognition Changhong Chen1, Jimin Liang1, Haihong Hu1, Licheng Jiao1, and Xin Yang2 1
Life Science Research Center, School of Electronic Engineering, Xidian University Xi’an, Shaanxi 710071, China 2 Center for Biometrics and Security Research, Key Laboratory of Complex Systems and Intelligence Science, Institute of Automation, Chinese Academy of Sciences, P.O. Box 2728, Beijing 100080, China
[email protected] Abstract. Gait recognition is an effective approach for human identification at a distance. During the last decade, the theory of hidden Markov models (HMMs) has been used successfully in the field of gait recognition. However the potentials of some new HMM extensions still need to be exploited. In this paper, a novel alternative gait modeling approach based on Factorial Hidden Markov Models (FHMMs) is proposed. FHMMs are of a multiple layer structure and provide an interesting alternative to combining several features without the need of collapse them into a single augmented feature. We extracted irrelated features for different layers and iteratively trained its parameters through the Expectation Maximization (EM) algorithm and Viterbi algorithm. The exact Forward-Backward algorithm is used in the E-step of EM algorithm. The performances of the proposed FHMM-based gait recognition method are evaluated using the CMU MoBo database and compared with that of HMMs based methods. Keywords: gait recognition, FHMMs, HMMs, parallel HMMs, frieze, wavelet.
1 Introduction Hidden Markov models had been the dominant technology in speech recognition since 1980s’. HMMs provide a very useful paradigm to model the dynamics of speech signals. They provide a solid mathematical formulation for the problem of learning HMM parameters from speech observations. Furthermore, efficient and fast algorithms exist for the problem of computing the most likely model given a sequence of observations. Gait recognition is similar with speech recognition in time-sequential space. Due to the successful application of HMMs to speech recognition, A. Kale, et al, [1, 2] introduced HMMs to gait recognition in recent years and gained inspiring performance. Some other recognition methods [3-5] based on HMMs were proposed one after the other. There are some possible extensions to the HMMs, such as factorial HMMs (FHMMs) [6], coupled HMMs [7], and so on. FHMMs were first introduced by Ghahramani [6] and attempt to extend HMMs by allowing the modeling of several stochastic random processes loosely coupled. FHMMs are of a multiple layer structure S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 124–133, 2007. © Springer-Verlag Berlin Heidelberg 2007
Factorial Hidden Markov Models for Gait Recognition
125
and provide an interesting alternative to combining several features without the need of collapse them into a single augmented feature. In this paper we explore the potential of FHMMs for gait modeling. This paper is structured as follows. Section II introduces the image preprocessing and feature extraction methods. Section III describes the FHMMs in details and the realization in gait recognition. In section IV, the proposed method is evaluated using the CMU MoBo database [8], and its performances are compared with that of HMMs based methods. Section V concludes the paper.
2 Feature Extraction 2.1 Preprocessing The preprocessing procedure is very important. The CMU MoBo database [8] offers human silhouettes segmented from the background images. However, the silhouettes are noisy and need to be smoothed. Firstly, mathematical morphological operations are used to fill the holes and remove some noise. Secondly, we remove some big noise blocks though filtering, which can’t be eliminated by simple morphological operations. Finally, all the silhouettes are aligned and cropped into the same size. The size can be chosen manually which varies with different databases. For CMU MoBo database, we choose 640*300, which contains most useful information and less noise for most people. An example is showed in Fig. 1.
(a)
(b)
Fig. 1. (a) is an example of the original silhouette; (b) is the processed silhouette of (a)
2.2 Feature Extraction B. Logan [9] pointed out that “there is only an advantage in using the FHMM if the layers model processes with different dynamics; if the features are indeed highly correlated FHMMs do not seem to offer compelling advantages”. The choice of features is critical to FHMMs, however, it is really a challenge to choose uncorrelated features from a sequence of gait images. In this paper, two kinds of different feature extraction methods are employed for different layers of FHMM.
126
C. Chen et al.
2.2.1 Frieze Feature The first gait feature representation is a frieze pattern [10]. A two-dimensional pattern that repeats along one dimension is called a frieze pattern in the mathematics and geometry literature. Consider a sequence of binary silhouette images b( x, y, t ) indexed spatially by pixel location ( x, y ) and temporally by time t . The first frieze pattern is calculated as FC ( x, t ) =
∑ b( x, y, t ) , where each column y
(indexed by time t ) is the vertical projection (column sum) of silhouette image. The second frieze pattern FR ( x, t ) = b( x, y, t ) can be constructed by stacking row
∑ x
projections. It is considered that FR contains more information than Fc and some obvious noise can be filtered from FR as shown in Fig.2. We choose FR as the feature for the first FHMM layer.
(a)
450
180
180
400
160
160
350
140
140
300
120
120
250
100
100
200
80
80
150
60
60
100
40
40
50
20
0
0
0
50
100
150
200
250
300
20
0
100
(b)
200
300
400
(c)
500
600
700
0
0
100
200
300
400
500
600
700
(d)
Fig. 2. (a) is a silouette image, its frieze features are Fc (b) and FR (c). (d) is FR after filtering noise.
2.2.2 Wavelet Feature Wavelet transform can be regarded as a temporal-frequency localized analysis method, which has good time resolution in high frequency part and good frequency resolution in low frequency part. It has the property of holding entropy and can change the energy distribution of the image without damaging the information. Wavelet transform acts on the whole image, which can eliminate the global relativity of the image as well as separate the quantization error to the whole image avoiding artifacts. The wavelet transform suits image processing very much, so we choose the vectors obtained from wavelet transform of the silhouette images as the feature for the second FHMM layer.
3 FHMM-Based Gait Recognition FHMMs were first described by Ghahramani[6]. They present FHMMs and introduce several methods to efficiently learn their parameters. Our effort, however, is focused on exploiting the application of FHMMs in gait modeling.
Factorial Hidden Markov Models for Gait Recognition
127
3.1 FHMMs Description The factorial HMM arises by forming a dynamic belief network composed of several layers. Each layer can be considered as an independent HMM. This is shown in Fig. 3. Each layer has independent dynamics but that the observation vector depends upon the current state in each of the layers. This is achieved by allowing the state variable in HMM to be composed of a collection of states. That is, we now have a “meta-state” variable which is composed of states as follows:
St = St(1) , S t( 2 ) ,L St( M ) ,
(1)
where St is the “meta-state” at time t , S t(m ) is the state of the mth layer at time t and M is the number of layers. S t(1−)1
S t(1 )
St(1+)1
S t(−21)
St( 2 )
St(2+)1
St−1
St
St +1
SSt(t−−31)
S t(3)
St(+31)
Yt −1
Yt
Yt +1
Yt −1
Yt
Yt +1
(a)
(b)
Fig. 3. (a) Dynamic Belief Network representation of a hidden markov model; (b) Dynamic Belief Network representation of a factorial HMM with M=3 underlying Markov chains
It is assumed for simplicity that the number of possible states in each layer is equal. Let K be the number of states in each layer. A system with M layers requires M K × K transition matrices with zeros representing illegal transitions. It should be noted that this system could still be represented as a regular HMM with a K M × K M transition matrix. It is preferable to use the M K × K transition matrices over the K M × K M equivalent representation for the computational simplicity. It is also assumed that each meta-state variable is a priori uncoupled from other state variables: M
P ( St | St −1 ) = ∏ P( Stm | Stm−1 ).
(2)
m =1
As for the probability of the observation given the meta-state, there are two different ways of combining the information from the layers. The first method assumes that the observation is distributed according to a Gaussian distribution with a common covariance and the mean being a linear combination of the state means, which is went by the name of “linear” factorial HMM. The second combination method, the “streamed” method, assumes that P (Yt | St ) is the product of the distributions of each layer ( Yt is the observation at time t ). More details can be found in [9].
128
C. Chen et al.
3.2 Initialization of Parameters (1) Number of states K and layers M : Five state numbers are chosen for CMU MOBO database. The number of layers depends on the feature vectors extracted. We extracted two kinds of feature vectors, so the number of layers is two. (2) The transition matrices: The transition matrices are M K × K matrices. Each of the initial K × K matrices is set as a left-to-right HMM, which is only allowed transition from one state to itself and its next state. (3) Output probability distribution: A gait sequence is always large in size. The large dimension makes it impossible to calculate a common covariance of the observation. So we employ the “streamed” method in 3.1. P (Yt | St ) is calculated as the product of the distributions of each layer. The models we used are exemplar-based models [2]. The motivation behind using an exemplar based model is that the recognition can be based on the distance measure between the observed feature vector and the exemplars. The distance metric and the exemplars are obviously the key factors to the performance of the algorithm. Let
Y = {Y1 , Y2 ,L, YT }
be
the
sequence
of
observation
vectors,
F = { f , f ,L , f } be the feature vectors of the observation vectors in layer m , and T be the length of the sequence. The initial exemplar set is denoted m m m m as S m = {s1 , s2 , L s K } . We get the initial exemplar element s K by equally dividing observation sequence into K clusters and averaging the feature vectors of m
m 1
m 2
m T
each cluster. We estimate the output probability distribution by an alternative approach based on the distance between the exemplars and the image features. In this way we avoid calculating high-dimensional probability density functions. The output probability distribution of the mth layer is defined as:
bn ( f t m ) = αδ nm e −δ n ×D ( ft m
δ nm =
m
, Snm )
,
Nn , ∑ D( ft m , S nm )
(3)
(4)
ft m∈enm
where α is a constant, D( f t m , S nm ) is the inner product distance between the t th feature vector f t m and the nth state S nm in the mth layer. δ nm is defined as equation (4). N n is the number of frames belonging to the nth cluster, which is constant to all layers. enm represents the nth cluster of the mth layer. Let β be a constant. The output probability distribution can be represented as: M
P(Yt | St ) = β ∏ bn ( f t m ). m=1
(5)
Factorial Hidden Markov Models for Gait Recognition
129
3.3 Estimation of Parameters The factorial HMMs we use are exemplar-based. The model parameters are denoted as λ , which include the exemplars in each layer, the transition probabilities between states in each layer and the prior probabilities of each state. The exemplars are initialized as mentioned above and remain unchangeable when estimate other parameters. The transition probabilities and the prior probabilities can be estimated using the Expectation Maximization (EM) algorithm. The algorithm steps can be referred to [6]. The exact Forward-Backward algorithm [6] is used in the E-step. The naive exact algorithm, consisting of translating the factorial HMM into an equivalent HMM with K m states and using the forward-backward algorithm, has the time complexity of O(TK 2 M ). The exact Forward-Backward algorithm has time complexity O(TMK ( M +1) ) because it makes use of the independence of the underlying Markov chains to sum over M K × K transition matrices. Viterbi algorithm is used to get the most probable path and the likelihood. New exemplars can be obtained through the most probable path, also the new output probability distribution. The whole process is iterated until the likelihood converges to a small threshold. 3.4 Recognition Firstly, the probe sequence y = { y (1), y ( 2) L y (T )} is preprocessed and extracted features are used as the train sequence. Then the output probability distribution of the probe sequence can be calculated using the states of the train sequence. We can get the log likelihood Pj that the probe sequence is generated by the FHMM parameters λ j of the j th person in the train database:
Pj = log( P ( y | λ j )).
(6)
The above procedure is repeated for every person in the database. Suppose Pm is the largest one among all Pj ’s, then we can assign the unknown person to be person m . A key problem during calculate the log likelihood Pj is how to get the clusters of the probe sequence given the FHMM of the train sequence. We calculate the distance between the features of probe sequence and the exemplars of a train sequence to confirm the clusters. The clusters of the same probe sequence vary with different train sequences.
4 Experiment Results We use CMU MoBo database [8] to evaluate the proposed method. Fronto-parallel sequences are adopted and the image size is preprocessed to be 640×300. Besides the experiment on the proposed method, other three comparative experiments are conducted. When using only one of the two features, the one layer FHMMs
130
C. Chen et al.
deteriorates to standard HMMs. We give the experiment results of the two HMMs of the two features separately. As showed in Fig. 4, we also give the results of merging the results of the two HMMs. We call this system ‘parallel HMM’ as [11]. If the judgments of the two HMMs are same, their results will be the results of the ‘parallel HMM’. Otherwise, we sum the corresponding likelihoods of the two HMMs and rearrange them to get the final results. Also, the experimental results are compared with that of [1] and [12].
Gait
HMM classifier
Frieze
Feature extraction
merging
results
HMM classifier
wavelet
Fig. 4. Parallel HMM
4.1 Same Styles Experiments The train and probe data sets are of the same motion style. For this type of experiments, we use two cycles to train and two cycles to test. (a) S vs. S: Training on slow walk of some cycles and testing on slow walk of other cycles. (b) F vs. F: Training on fast walk of some cycles and testing on fast walk of other cycles. (c) B vs. B: Training on walk carrying a ball of some cycles and testing on walk carrying a ball of other cycles. (d) I vs. I: Training on walk in a incline of some cycles and testing on walk in a incline of other cycles. The results for same style experiments are shown as: Table 1. The results for same styles experiments
P(%) at rank S vs. S F vs. F B vs. B I vs. I
HMM[12] 1 5 100 100 96.0 100 100 100 95.8 100
HMM[1] 1 5 72.0 96.0 68.0 92.0 91.7 100 --- ---
HMMf 1 5 100 100 88.0 100 95.8 100 92.0 100
HMMw 1 5 100 100 100 100 100 100 96.0 100
pHMM 1 5 100 100 96.0 100 100 100 96.0 100
FHMM 1 5 100 100 100 100 100 100 100 100
4.2 Different Styles Experiments The train and probe data sets are of the different motion styles. For this type of experiments, we use four cycles to train and two cycles to test. The CMC curves for the four experiments of different styles are given in Fig. 5 and the performance comparison with other methods is shown in table 2.
Factorial Hidden Markov Models for Gait Recognition S vs.F
F vs.S
100
100
Identification Rate
105
Identification Rate
105
95
90
85
80
Exp. Exp. Exp. Exp. 0
5
HMMf HMMw PHMM FHMM
10
15
95
90
Exp. Exp. Exp. Exp.
85
80
0
5
15
Rank
(a)
(b)
S vs.B
F vs.B
105
105
100
100
95
95
90
90
Identification Rate
Identification Rate
HMMf HMMw PHMM FHMM
10
Rank
85 80 75 70 65
85 80 75 70 65
Exp. Exp. Exp. Exp.
60 55 50
131
0
5
HMMf HMMw PHMM FHMM
10
Exp. Exp. Exp. Exp.
60 55 50
15
0
5
Rank
10
HMMf HMMw PHMM FHMM 15
Rank
(c)
(d)
Fig. 5. The cumulative matching characteristics for different styles experiments. Exp.HMMf represents HMM with frieze vectors. Exp.HMMw represents HMM with wavelet transform vectors. Exp. PHMM represents parallel HMM. Exp. FHMM represents factorial HMM. (a) shows the results of S vs. F. (b) shows the results of F vs. S. (c) shows the results of S vs. B. (d) shows the results of F vs. B. Table 2. The results for different styles experiments
P(%) at rank S vs. F F vs. S S vs. B F vs. B
HMM[12] 1 5 --------52.2 60.9 -----
HMM[1] 1 5 32.0 72.0 56.0 80.0 --- ----- ---
HMMf 1 5 96.0 100 92.0 100 66.7 95.8 50.0 79.2
HMMw 1 5 92.0 100 88.0 96.0 70.8 95.8 54.2 91.7
pHMM 1 5 96.0 100 88.0 100 87.5 100 58.3 75.0
FHMM 1 5 100 100 92.0 100 83.3 100 58.3 83.3
(a) S vs. F: Training on slow walk and testing on fast walk. (b) F vs. S: Training on fast walk and testing on slow walk. (c) S vs. B: Training on slow walk and testing on walking with a ball. (d) F vs. B: Training on fast walk and testing on walking with a ball. For same styles experiments, the performance of FHMM-based gait recognition method is excellent, which can reach 100% at rank 1. For different styles experiments, more experiments are done and much better results are obtained than reference [1] and
132
C. Chen et al.
[12]. For the experiment S vs. F, FHMM-based gait recognition method can reach 100% at rank 1, which is the best result until now. Both experiment S vs. F and F vs. S have gained higher identification rate than experiment S vs. B and F vs. B. When people walk with a ball, their shapes change a lot. Absolutely superiority of FHMM over HMM with a single feature can be seen in all of these experiments. Also the FHMM-based gait recognition method is better that of parallel HMM based method, except the experiment S vs. B. From the experiment results we can see that the performance of the FHMM-based gait recognition method is superior to that in [1] and [12]. Also its performance is better than the method using frieze feature or wavelet feature individually. Meanwhile, it is a little bit better than parallel HMM. What’s more, FHMM is simpler in implement and faster than parallel HMM. The results show that FHMM-based method is effective and improves the performance of HMM.
5 Conclusion We presented a FHMM-based gait recognition method. The experiment results proved that FHMM is a good extension of HMM. The FHMM framework provides an interesting alternative to combining several features without the need of collapse them into a single augmented feature. FHMM is simpler than parallel HMM in implement. However, the features must be irrelated. It is a challenge problem to extract irrelated but effective features from the same gait sequence. Out future work will concentrate in this area to further improve the performance. Acknowledgments. This work was partially supported by the Natural Science Foundation of China, Grant Nos. 60402038 and 60303022, the Chair Professors of the Cheung Kong Scholars, and the Program for Cheung Kong Scholars and Innovative Research Team in University (PCSIRT).
References 1. Kale, A., Cuntoor, N., Chellappa, R.: A framework for activity-specific human identification. In: Proc. of the Int. Conf. on Acoustics, Speech and Signal Processing (May 2002) 2. Sundaresan, A., RoyChowdhury, A., Chellappa, R.: A Hidden Markov Model Based Framework for Recognition of Humans from Gait Sequences. In: Proceedings of IEEE International Conference on Image Processing. IEEE Computer Society Press, Los Alamitos (2003) 3. Liu, Z., Malave, L., Sarkar, S.: Studies on Silhouette Quality and Gait Recognition. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’04). IEEE Computer Society Press, Los Alamitos (2004) 4. Iwamoto, K., Sonobe, K., Komatsu, N.: A Gait Recognition Method using HMM. In: SICE Annual Conference in Fukui, Japan (2003) 5. Chen, C., Liang, J., Zhao, H., Hu, H.: Gait Recognition Using Hidden Markov Model. In: Jiao, L., Wang, L., Gao, X., Liu, J., Wu, F. (eds.) ICNC 2006. LNCS, vol. 4221, pp. 399–407. Springer, Heidelberg (2006)
Factorial Hidden Markov Models for Gait Recognition
133
6. Ghahramani, Z., Jordan, M.: Factorial Hidden Markov Models. Computational Cognitive Science Technical Report 9502 (Revised) (July 1996) 7. Brand, M.: Coupled hidden Markov models for modeling interacting processes. MIT Media Lab Perceptual Computing/Learning and Common Sense Techincal Report 405 (Revised) (June 1997) 8. Gross, R., Shi, J.: The Cmu Motion of Body (mobo) Database. Technical report, Robotics Institute (2001) 9. Logan, B., Moreno, J.: Factorial Hidden Markov Models for Speech Recognition: Preliminary Experiments. Cambrige Research Laboratory Technical Research Series (September 1997) 10. Liu, Y., Collins, T., Tsin, Y.: Gait Sequence Analysis using Frieze Patterns, CMU-RI-TR01-38 11. Logan, B., Moreno, P.: Factorial HMMs for Acoustic Modeling, Acoustics Speech and Signal[C]. In: Proceedings of the IEEE International Conference, vol. 2 (S), pp. 813–816. IEEE Computer Society Press, Los Alamitos (1998) 12. Zhang, R., Vogler, C., Metaxas, D.: Human Gait Recognition. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. IEEE Computer Society Press, Los Alamitos (2004)
A Robust Fingerprint Matching Approach: Growing and Fusing of Local Structures Wenquan Xu, Xiaoguang Chen, and Jufu Feng State Key Laboratory on Machine Perception, Center for Information Science, School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, P.R. China {xuwq,chenxg,fjf}@cis.pku.edu.cn
Abstract. This paper proposed a robust fingerprint matching approach based on the growing and fusing of local structures. First, we obtain candidate of minutiae triangles, the smallest local structure in our approach; and then all candidates of minutiae structures grow into larger local structures (we call growing regions) based on which we define the credibility of minutiae triangles and introduce a competition strategy; finally, growing regions compete and fuse into a much larger local structures (called fusion region). The matching score is calculated based on the evaluation of growing and fusing of local structures. Results on FVC2004 show that the proposed approach is robust to fingerprint nonlinear deformation and is efficient. Keywords: fingerprint, nonlinear deformation, robust, local, global, grow, fuse.
1 Introduction The nonlinear deformation of a fingerprint introduced by the elastic feature of the finger and the non-uniform pressure on the finger during the fingerprint acquisition makes finger matching algorithm less efficient and less accurate. Based on the idea that the local may be less affected by the global deformation, a lot of fingerprint matching algorithms adapt the local and global matching scheme. In this scheme we first define the local structure which is at least invariant as to affine transformation; after obtaining candidates of local structures which are possibly matched, we combine them into a global match result. The efficiency of this matching scheme depends on two critical points: first, how reliable the local structure is; second, how well the local match becomes a global match. Many works have been made on the former point. AKJ [2] use a local structure consisting of a minutiae and a list of sampling points on the same ridge of the minutiae. En Zhu[3] constructs a local structure with the orientation information near the minutiae. X. Jiang[4] suggests that the minutiae triangle local structure is less affected by the non-linear deformation. Y.He && J. Tian[5] construct a local structure with two minutiae which he assumes would be more reliable when concerning the information of ridge count. Xuefeng Liang and Tetsuo Asano [8] introduce the minutiae polygons by including more information S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 134–143, 2007. © Springer-Verlag Berlin Heidelberg 2007
A Robust Fingerprint Matching Approach: Growing and Fusing of Local Structures
135
near the bifurcation minutiae. As for the latter point, there is relatively less work. AKJ [2] adapt the explicit alignment method in global matching procedure. Bazen and Gerez [6] introduce an implicit alignment using thin plate template spline. Both explicit and implicit alignments need reference points. Sharat Chikkerur [7] introduces a graph matching algorithm CBFS without alignment. Y. Feng and J. Feng [1] give a definition of local compatibility in an attempt to measure the coherence of local structure pairs and embed it in global matching process.
Fig. 1. Two impressions of the same fingerprint (FVC2004 DB1_A 14_3 and 14_6). Because of nonlinear deformation, the ridge in the left impression is closed to a straight line while the correspondent ridge in the right impression is a curve.
Fig. 2. Minutiae structures: A, minutiae whip; B, minutiae triangle; C, minutiae orientation cross: minutiae structure with local orientation; D, K-plet; E, minutiae stick; F, minutiae polygon.
2 Proposed Method This paper shows the natural way in which local matches grow into global match using a competition strategy. First, candidates of minutiae triangle are obtained; and then we let candidates of minutiae structures grow into larger local structures (called
136
W. Xu, X. Chen, and J. Feng
growing regions), based on which we define the credibility of each minutiae triangle; finally, growing regions compete and fuse into a much larger local structure (called fusion region). The matching score is calculated based on the evaluation of both the growing region and the fusion region. In this way, we obtain the match result without alignment. The following part of this paper is arranged by the flow of the growing and fusing sequence of local, from minutiae to minutiae triangles, to growing region, to fusion region. 2.1 Matching of Minutiae Structures With minutiae triangle being local structure, we adopt the Delaunay Triangulation method to get minutiae triangles since it is an equiangular triangulation method concerning both of the angle and the length of the triangle, thus making the triangulation more robust for nonlinear deformation. We define a rotation-invariant feature vector: Vtri = (d pm , d pn , ρ p , ρ m , θ pm , θ pn )
where d pi denotes the distance between minutiae p and minutiae i, ρ p is ∠mpn , ρ m is ∠nmp and ρn is ∠mnp ( ρ p ≥ ρ m ≥ ρ n ), θ pi denotes the relative radial angle between directions of minutiae p and minutiae i, θ pm denotes the relative radial angle between directions of minutiae p and minutiae m, as in Fig. 3. Now the distance of minutiae triangles can be defined by the following equation: Dtri (Vtri1 , Vtri 2 ) = Wtri T | Vtri1 − Vtri 2 | ,
where Wtri is a 1× 6 weight vector, and Vtri1 and Vtri 2 are feature vectors of two minutiae triangles. We can increase the contribution of the length and the angle of the triangle while at the same time decrease the influence of the orientation of the minutiae by adjusting Wtri .
Fig. 3. Minutia triangle constructed by three minutiae p, m and n
2.2 Growing Region and Credibility of Minutiae Structures In this stage, the minutiae triangle grows into the growing region, a larger matched local. In according to the similarity of growing regions, we define the credibility of the minutiae structure.
A Robust Fingerprint Matching Approach: Growing and Fusing of Local Structures
137
Fig. 4. Left: A core candidate triangle grows into a growing region; Right: The feature vector of a star point in the growing region is defined by the affine transform invariant feature vector Vsp = θ1 ,θ 2 ,θ 3 , ϕ1 , ϕ 2 , ϕ 3
{
}
In the work of Y.S. Feng and J.F. Feng [1], they implicitly use the neighbor minutiae triangle to define the credibility of minutiae triangle. But due to the independent construction of minutiae triangles in template fingerprint and query fingerprint, a candidate triangle can be elected only when their neighbor minutiae triangles are candidate minutiae triangles. In this paper, we give a more reasonable definition of the credibility of a candidate of minutiae structure with the neighbor minutiae of the minutiae triangle. The growing region is developed from a core candidate triangle by including all the other minutiae triangles that satisfying either one of the following condition: a) the minutiae triangle has at least one common vertex with the core candidate triangle; b) neighboring triangles of the minutiae triangle has at least one common vertex with the core candidate triangle as in Fig. 4 left. For each pair of core candidate triangles, we get a pair of growing regions which is actually a pair of point sets. Calculating the similarity of two growing regions can be treated as calculating the similarity of two point sets. Since the correspondence of the three vertex of the core candidate triangle is known, we can further simplify it to a string matching question. For each Minutiae point ( Vi , we call it star point) in the growing regions, it can be identified by the affine transformation invariant feature vector:
Vsp = {r ,θ1 ,θ 2 ,θ 3 ,ϕ1 ,ϕ 2 ,ϕ 3 } ,
r is the Euler distance between the star point, Tc is the center of the core candidate structure, θ i is the angle between line TiTc and the minutiae direction of Vi , and ϕi is the angle between line TiTc and line ViTc as in Fig.4 right. Then we have the where
definition of the distance between two star points:
(
)
1 2 T 1 2 Dsp (Vsp , Vsp ) = Wsp1 f Vsp ,Vsp ,
138
W. Xu, X. Chen, and J. Feng
(
1
f V sp , V sp
2
)
⎛ ⎞ log r 1 − log r 2 ⎜ ⎟ log 0.7 ⎜ ⎟ 1 2 1 2 1 2 ⎟ ⎜ = abs(θ1 − θ1 ) + abs(θ 2 − θ 2 ) + abs(θ 3 − θ 3 ) , ⎜ ⎟ 3 × 30 ⎜ ⎟ ⎜ abs(ϕ11 − ϕ12 ) + abs(ϕ 21 − ϕ 2 2 ) + abs(ϕ 31 − ϕ 3 2 ) ⎟ ⎜⎜ ⎟⎟ 3 × 20 ⎝ ⎠
where Wsp1 is a 1× 3 weight vector, and Vsp1 and Vsp 2 are feature vectors of two star points. We can further define the similarity of two star points as:
{
}
⎧100 × exp − W sp 2 f 2 (Vsp ,Vsp ) σ (Vsp1 ,Vsp 2 ) = ⎨ 0 ⎩ T
1
2
if Dsp (V sp ,Vsp ) < bsp else 1
2
In our experiment on FVC2004, Wsp1 = (1.2, 1.2, 0.8 )T , Wsp 2 = (0.5, 0.5, 0.5 )T , and
bsp =1.8. We convert the unordered star points of the growing region into an ordered sequence by arranging them in the increasing order of ϕ1 . And then we adopt the dynamic programming approach [6], which bases on string matching, to get matched number of star points. The creditability of the core candidate triangle pair is determined by the similarity of growing regions which represent local areas. In our experiment, the creditability of the minutiae structure pair is defined as:
CR = n , where n is the number of matched minutiae in the pair of growing regions. In this way, the minutiae structure pair gain more credit if it have more matched minutiae in its growing regions. The growing region has a larger area than the minutiae triangle; therefore it is more affected by the nonlinear deformation; but compare to the global region it is still less affected by the nonlinear deformation. 2.3 Fusing of Growing Regions and the Compatibility of Minutiae Structure Human experts usually pay special attention to the interrelation of some candidate minutiae structures and double check the neighborhood of the minutiae structure. In order to imitate this human behavior, we propose a measurement for the coherence of local structure pairs, called local structure compatibility. In according to the compatibility of minutiae structures, all growing regions are fused into a fusion region. The fusing process is done by a majority voting among candidate local structures during which candidate local structures compete against each other.
A Robust Fingerprint Matching Approach: Growing and Fusing of Local Structures
139
2.3.1 The Compatibility of Minutiae Structure Some candidates of minutiae triangle pairs we get from the local matching process may not be compatible. They are not compatible because of the relative position or relative pose or both. In order to depict this difference, we first define the feature vector of two minutiae structure as:
Vmm = {r , ϕ1 , ϕ 2 , ϕ 3 , θ1 , θ 2 , θ 3 } where r is the distance between minutiae triangles, ϕi is the angle from line pc mc to the line pc pi , θ i is the angle from line p c mc to line mc mi as in Fig. 5 left. Then we define the compatibility of minutiae triangles as:
(
)
1 2 CO = G W ⋅ Vmm − Vmm ,
where G(x) is monotonically decreasing function and W is a weight vector. We simply choose G(x) as:
⎧1 x < bco . G ( x) = ⎨ ⎩0 else where bco is predefined boundary. Since Vmm only accounts for the Euler distance of two minutiae triangles, it can not discriminate the topological difference of two pairs of minutiae triangles. Therefore, we define three types of topological condition to check the compatibility of minutiae triangles as in Fig. 5 right: A, separate; B, share one vertex; C, share two vertices.
Fig. 5. Left, the feature vector of two triangle local structure can be defined by {r , {ϕi }, {θi }} ; Right, three kinds of topological conditions to check the compatibility of minutiae triangles: A, separate; B, share one vertex; C, share two vertices.
2.3.2 Majority Voting Among Minutiae Structure In the fusing process, the possibly largest group of minutiae structures that are most compatible with each other are selected. We hold the majority voting with competition strategy, in which every pair of minutiae structures scores another one and only those candidate minutiae structure pairs that have a score larger than a certain value bmv can survive and be fused into a global match region. And then all
140
W. Xu, X. Chen, and J. Feng
three vertices of compatible minutiae triangles together with their matched neighbor minutiae are labeled as matched minutiae. Sometimes a minutia in the template fingerprint may several possible matched minutiae in the query fingerprint obtained by different LMR. In this case, these minutiae correspondences are not reliable and are omitted. Algorithm: Majority Voting among Minutiae structures 1, Let Vote[i ] , i 1,2,..., n represent the vote of each pair of minutiae structure, and let {Qi , Ti , CRi }, i 1,2,..., n represent candidate minutiae structures and their credibility 2, Initialization: Vote[i ] = CRi ; 3, Scoring: for i = 1 to n-1 for j = i+1 to n CO = the compatibility of {Qi , Ti } and {Q j , T j }
u CO + CRi u CO
Vote [i ] = Vote[i ] + CR Vote[ j ] = Vote[ j ]
j
end end
After the fusing process, all the compatible growing regions are fused into a larger local (called fusion region). The fusion region is a circle area that can just hold all the matched minutiae inside. 2.3.3 Evaluation of the Fusion Region The fusion region is actually a local area but larger than the growing region. The evaluation here uses the information of unmatched minutiae in the fusion region which minimize the false matched rate. It is based on the following situations we observed: 1, all growing regions in a genuine match tend to be fused into a connected matched region. However in an imposter match, growing regions may be fused into several separate matched regions. That is, almost all minutiae located in the fusion region are matched minutiae in a genuine match. While in imposter matches, there are a good portion of unmatched minutiae are located in the fusion region; 2, besides, the deformation of the fusion region is consistent in a genuine match, either squeezing, tensing, or rotating in the same direction. And it is not true for an imposter match. A critical function is used to evaluate the fusing result:
CGMA =
n , max(n ft , n fq )
where n is the number of matched minutiae between the template fingerprint and the query fingerprint, and n ft and n fq are the numbers of minutiae located in the fusion
region in the template fingerprint and query fingerprint correspondingly.
A Robust Fingerprint Matching Approach: Growing and Fusing of Local Structures
141
2.4 Scoring The scoring process uses the information lying in the local as well as in the global to measure the similarity of local structures when they grow. The match score is calculated as:
Score = 100
n2 nLMR CGMR , nt nq
where n LMA is the number of growing regions and C GMR is the evaluation of the fusion region.
A
B
C
Fig. 6. Comparing the matching result of fingerprints with large nonlinear deformation between the alignment base method and our method: A, shows minutiae correspondence of two impressions of a fingerprint after alignment manually; B and C, show minutiae correspondences obtained by our approach Table 1. Result of our method on the database of FVC2004 Database FVC2004 DB1_a DB2_a DB3_a
EER (%) 3.819 4.712 2.923
FMR100 (%) 7.363 8.854 4.201
FMR1000 (%) 7.354 8.961 4.249
ZeroFMR (%) 12.200 15.769 7.592
3 Experiments and Results Our experiment is performed on fingerprint databases of FVC2004 (DB1_A, DB2_A, DB3_A). Each of the three fingerprint databases has 8 fingerprints of 100 fingers
142
W. Xu, X. Chen, and J. Feng
totally 800 fingerprints. This dataset is available on website of FVC 2004, and is proved to be the most difficult for most methods because no efforts were made to control image quality and the sensor platens were not systematically cleaned [9]. And we take the standard performance evaluation indicators of FVC: genuine and imposter score distribution, equal error rate (EER) and ROC [10]. With only minutiae information, our approach has EER of 3.8%, 4.7%, 2.9% on the three databases of FVC2004 (DB1_a, DB2_a, DB3_a correspondingly). The average processing time is 0.03 seconds per matching on an AMD64 3200+ 1GB PC under Windows 2003. There are two impression of a fingerprint from FVC2004 DB1_A (14_3 14_6) as in Fig.1. Because of nonlinear deformation, the ridge in the left impression is approximately a straight line while the correspondent ridge in the right impression is curve. The minutiae correspondences are pointed out manually after alignment in Fig.6 left from which we can see that it is hard to find the correspondence using an alignment based fingerprint matching algorithm. But with our approach, we can find almost all the minutiae correspondence located in the overlap region.
4 Conclusion and Future Work In the paper, we present a robust fingerprint matching approach in which fingerprints are matched by local in three stages. Locals grow and compete in each stage and gradually become a global region. As a result, there is no alignment. Results of our experiment on FVC2004 shows that our approach is robust to the nonlinear deformation and has a low false accept rate. The triangulation process is the first step of our approach and is thus the most important stage. Actually it is a state-of-art problem for several reasons: Cn3 different triangles can be generated from n minutiae; triangle set of too large size has a high computational and memory complexity; triangle set of too small size often fails to have common triangles. In our experiment the Delaunay Triangulation can get a more reliable triangle set than the k-neighbor method [1][4]. The evaluation and the growing strategy applied to the matched local estimate how well locals match. So it has to account for the deformation of fingerprint which unfortunately lacks efficient description. In our method we skip this problem by stress more on the number of matched minutiae. In future work, we will understand the nonlinear deformation of fingerprint and give a more reasonable definition of the matched local.
、
Acknowledgments. This work was supported by NSFC(60575002 60635030) and NKBRPC (2004CB318000) and Program for New Century Excellent Talents in University. Thanks all reviewers for their useful comments.
References 1. Feng, Y., Feng, J.: A Novel Fingerprint Matching Scheme Based on Local Structure Compatibility. In: ICPR2006, vol. 4, track 4, Thu-O-II-1b (2006) 2. Jain, A., Hong, L.: On-line fingerprint verification. IEEE Trans. PAMI. 19(4), 302–314 (1997)
A Robust Fingerprint Matching Approach: Growing and Fusing of Local Structures
143
3. Zhu, E.: Fingerprint matching based on global alignment of multiple reference minutiae. Pattern Recognition 38, 1685–1694 (2005) 4. Jiang, X., Yau, W.-Y.: Fingerprint minutiae matching based on the local and global structures. In: ICPR2000, vol. 2, pp. 1038–1041 (2000) 5. He, Y., Tian, J.: Fingerprint Matching Based on Global Comprehensive Similarity. IEEE Trans. PAMI 28(6), 850–862 (2006) 6. Bazen, A.M., Gerez, S.H.: Fingerprint matching by thin-plate spline modeling of elastic deformations. Pattern Recognition 36, 1859–1867 (2003) 7. Chikkerur, S., Govindaraju, V., Cartwright, A.N.: K-plet and Coupled BFS: A Graph Based Fingerprint Representation and Matching Algorithm. In: ICB2006 (2006) 8. Liang, X., Asano, T.: Fingerprint Matching Using Minutia Polygons. In: ICPR2006, vol. 1, track 4, Mon-O-IV-2 (2006) 9. Fingerprint Verification Competition: (FVC2004) (2004), http://bias.csr.unibo.it/fvc2004 10. Cappelli, R., Maio, D., Maltoni, D., Wayman, J., Jain, A.K.: Performance Evaluation of Fingerprint verification systems. IEEE Trans. PAMI. 28(1), 3–18 (2006)
Automatic Facial Pose Determination of 3D Range Data for Face Model and Expression Identification Xiaozhou Wei, Peter Longo, and Lijun Yin Department of Computer Science, State University of New York at Binghamton, Binghamton, NY
Abstract. Many of the contemporary 3D facial recognition and facial expression recognition algorithms depend on locating primary facial features, such as the eyes, nose, or lips. Others are dependent on determining the pose of the face. We propose a novel method for limiting the search space needed to find these “interesting features.” We then show that our algorithm can be used in conjunction with surface labeling to robustly determine the pose of a face. Our approach does not require any type of training. It is pose-invariant and can be applied to both manually cropped models and raw range data, which can include the neck, ears, shoulders, and other noise. We applied the proposed algorithm to our created 3D range model database, the experiments show the promising results to classify individual faces and individual facial expressions. Keywords: Surface Normal Difference, Facial Pose Detection; 3D range model.
1 Introduction The facial pose estimation is a first critical step towards developing a successful system for both face recognition [21] and facial expression recognition [15]. The majority of existing systems for facial expression recognition [21, 12] and face recognition [13, 14, and 18] operate in 2D space. Unfortunately, 2D data is unsatisfactory because it is inherently unable to handle faces with large head rotation, subtle skin movement, or lighting change with varying postures. With the recent advance of 3D imaging systems [23, 11], research on face and facial expression recognition using 3D data has been intensified [8, 9, 10, 16, 19, and 22]. However, almost all existing 3D-based recognition systems are based on static 3D facial data. We are interested in researching the face and facial expression recognition in a dynamic 3D space. One of the prerequisites of dynamic 3D facial data analysis is to design an algorithm that can robustly determine a face’s pose. Previously, a number of methods have been proposed for determining the pose of a face in 2D and 3D space. Most can be broadly categorized as being either feature-based [2] or appearance-based [6]. The feature-based methods attempt to relate facial pose to the spatial arrangements of significant facial features. The appearance-based methods consider the face in its entirety [5]. Recently, approaches have been developed that combine feature-based and appearance-based techniques [8, 9], the results are very encouraging. Some S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 144–153, 2007. © Springer-Verlag Berlin Heidelberg 2007
Automatic Facial Pose Determination of 3D Range Data
145
Support Vector Regression based approaches [3] have shown impressive results, but they require the use of a training set. In this paper, we describe a pose and expression invariant algorithm that can robustly determine a face’s pose. We have tested our algorithm on preprocessed, manually cropped 3D facial models and on unprocessed raw 3D data coming directly from our dynamic 3D imaging system. The general framework of our approach is outlined in Figure 1. The first step is to remove the image’s boundary since there is no guarantee that it is smoothly cropped. Then, we apply our novel Surface Normal Difference (SND) algorithm, which produces groups of triangles. The groups containing the fewest triangles are ignored, and the triangles in the other groups are labeled as “potentially significant.” The Principal Component Analysis (PCA) algorithm is run on the vertices of the “potentially significant” triangles in order to align the model in the Z direction and determine the location of the nose tip. Finally, we label the very concave “potentially significant” triangles as “significant,” and use the resulting groups, as well as the symmetry property of the face, to find the nose bridge point. At final, we evaluate the proposed algorithms through the experiments on our developed systems of dynamic 3D face recognition and dynamic 3D facial expression recognition. Each part of our framework will be elaborated on in the following sections.
Fig. 1. Pipeline for determining the pose of a 3D facial range model
2 Surface Normal Difference (SND) Let us define a spatial triangle as a tuple of three vertices. Assume a triangle t = (v1, v2, v3), which has a three dimensional normal vector, nt, consisting of x, y, and z components. Assume a triangle s = (s1, s2, s3). We call s a “neighbor of” t if the sets {v1, v2, v3} and {s1, s2, s3} are not disjoint. neighbor
(s, t) ⎯ ⎯ → { v 1 , v 2 , v 3} ∩ { s 1 , s 2 , s 3 } ≠ 0
Assume a set of triangles, N, which contains all of t’s neighboring triangles. For each triangle, u with normal nu, in N, we determine the angle, θtu, between the normal vectors of t and u. θ tu = cos
−1
(( n t • n u ) /( n t × n u ))
We determine the maximum value of θtu and call it θmax. If θmax is greater than a specified angular tolerance, δ, we add t to set G. Otherwise, we add t to set L. We repeat this procedure for all triangles in the facial mesh. Upon completion, we label the triangles in G as “potentially significant” and the triangles in L as “not significant.”
146
X. Wei, P. Longo, and L. Yin
Fig. 2. Neighbor normal illustration: (a) Mesh comprising part of a cheek; (b) Mesh comprising part of an eye
Figure 2a shows a mesh comprising part of a cheek and Figure 2b shows a mesh comprising part of an eye. In both figures, t is the blue triangle. The triangles in N are colored black. nt is represented by the thick yellow line protruding from the blue triangle. nu, for each triangle in N, is represented by the thick green line protruding out of each black triangle. θmax is much less in Figure 2a than in Figure 2b. This is what we expect and what our approach is based on. We have found that many of the triangles in important facial regions, including the nose, eye, and lip regions, have much larger θmax values than those in less important facial regions, such as the cheek and forehead regions.
3 Pose Determination Our pose estimation approach consists of three key steps: region of interest (ROI) detection followed by nose tip and nose bridge determination. Figure 3 illustrates a processed range image going through our pipeline, with Figure 3a showing a processed image from our database. 3.1 Determination of Region of Interest The first step is to remove the 3D range image’s boundary triangles since the boundary may be rough, and a rough boundary would negatively affect our approach’s accuracy. Figure 3b shows the result of this initial step. Next, the Surface Normal Difference (SND) algorithm is applied. The normal of every remaining triangle in the 3D mesh is determined, and the maximum angle that each triangle’s normal makes with an adjacent triangle’s normal is calculated. If this maximum angle is greater than an angular tolerance, δ, both triangles whose normal vectors made the angle are marked as “potentially significant.” Otherwise, the corresponding triangle is marked as “not significant.” This procedure is repeated, by incrementing δ, until the number of “potentially significant” triangles is less than α. We initially set δ to 10 degrees and have empirically set α to 3000. Figure 3c shows the “potentially significant” triangles after applying the SND algorithm. Usually, the SND algorithm will keep a number of small, connected surfaces that are not part of a significant facial region. For example, it may keep a surface corresponding to a pimple on the forehead or a cut on the cheek. Most of the time, these outlying surfaces are composed of a very small number of triangles relative to
Automatic Facial Pose Determination of 3D Range Data
147
the largest remaining connected surface, which usually contains, minimally, the eye and nose regions. In order to filter out these outlying surfaces, the maximum number of triangles contained in any connected surface ρ is determined. Any surface composed of fewer than κ percent of ρ triangles is considered an outlying surface, and all of the triangles in these outlying surfaces are marked as “not significant.” We have empirically set κ to 0.1. At this point, a vertex is labeled as “significant” if it is part of a “potentially significant” triangle. All other vertices are labeled as “not significant.” We call the mesh comprised of the remaining “potentially significant” triangles a Sparse Feature Mesh (SFM). The SFM is shown in Figure 3d. 3.2 Determination of Nose Tip The PCA algorithm is run on the SFM vertices in order to align the model in the Z direction and determine the location of the nose tip (NT). Figure 3e shows the result of this step. NT ∈ V sig | max( z )
where Vsig denotes the set of “significant” vertices. Note that the principal depth direction (Z) of a model can be reliably estimated with the minimum eigen-value, given the SFM vertices, even though the X and Y components may not be aligned to the correct model orientation. The shape index [1], which is a surface primitive based on surface curvatures, is calculated at each “significant” vertex. All triangles that have at least one very concave vertex (a shape index less than -.50) are marked as “significant.” All other triangles are marked as “not significant.” A number of discreet groups of triangles, usually numbering around 20, remain. These groups usually include the corners of the eyes and the sides of the nose and mouth. The “significant” triangles are shown in Figure 3f. 3.3 Determination of Nose Bridge In order to locate the nose bridge point, all pairs of groups meeting certain general geometric criteria are iterated over, and the symmetry of the shape indices of the vertices near each line connecting a pair of candidate groups (LCG) and near that line translated to the nose tip (LNT) is calculated. The sum of these two symmetry values is minimized and the line perpendicular to the LCG that passes through the nose tip (HNT) is inspected. All the “significant” vertices within an XY distance of γ1 from a line connecting two candidate groups are the vertices that compose a LCG. L CG ( g 1 , g 2 ) = ∀ ( v ∈ V sig
dist
xy
(v, g1g 2 ) < γ 1 )
where g1 and g2 are candidate groups and γ1 is the 3D length of an arbitrary mesh triangle’s side. The two points of maximum concavity on either side of the midpoint of the LCG, at least a certain distance, γ2, from the midpoint, are found, and the point between these two maximums with the greatest Z value is referred to as the PBMZ. PBMZ
( V ) = v ∈ V | max( z )
148
X. Wei, P. Longo, and L. Yin
where V denotes the vertices in the region of interest. Let B be the set containing the LCG vertices between these two maximums. B = v ∈ V | v ≥ (max 1) ∧ v ≤ (max 2 )
The symmetry of the shape indices of B about the PBMZ is determined by summing the mean squared differences of the shape indices of the 0.50*|B| vertices in B closest to the PBMZ. If there are not at least 0.25*|B| vertices between the PBMZ and either maximum, the LCG is rejected because the nose bridge point is expected to be a point of close to maximum Z almost exactly in-between two maximum concavities (i.e. the eye corners). The LCG is translated to the nose tip and the above procedure is repeated for determining the symmetry of the LNT. sym ( B ) =
. 25 * | B |
∑
i =1
( b PBMZ
(V ) − i
− b PBMZ
(V ) + i
)2 / B
The optimal groups are found by using the symmetry minimization method, as defined below: g1opt , g 2 opt ∈ G sig | min( sym ( L CG ( g 1, g 2 )) + sym ( L NT ( g1, g 2 )))
If the sum is minimal, we inspect the corresponding H NT, which is composed of the “significant” vertices within an XY distance of γ1 from the line perpendicular to the LCG that passes through the nose tip. H
NT
( g 1, g 2 ) = ∀ ( v ∈ V
sig
dist
xy
(v , ⊥
NT
( g 1 g 2 )) < γ
1
)
( g 1 g 2 ) denotes the line perpendicular to g 1 g 2 that passes through the where ⊥ nose tip. The variances of shape indices of the 0.25*|HNT| vertices closest to the nose tip on each side of the LNT are calculated and compared. The side with the lesser variance is considered the nose bridge side. If either side has fewer than three vertices, the HNT is rejected because the optimal HNT is expected to have a large number of “significant” vertices. NB1 is a set containing all of the HNT vertices on one side of the LNT and NB2 is a set containing all of the HNT vertices on the other side of the LNT. Of these two sets, the one containing the HNT vertices on the nose bridge side of the LNT is renamed NBcandidates and the other is renamed NBnon-candidates. NT
NB
1
= ∀ (v ∈ H NB
2
NT
( g 1 , g 2 ) dist
= ∀ (v ∈ H
var' (V , L ) = var( v ∈ V NB
candidates
NT
xy
( v , L NT ( g 1 , g 2 )) < 0 )
( g 1, g 2 ) v ∉ NB 1 )
dist xy ( v , L ) 30%), they have given only a few sequences (between 20 and 40) so their influence is small. If we compute the average of the EER computed on each user we obtain 4.5%, corresponding to a fair performance. This value points an other problem of our method: probably, because of the few numbers of problematic users, we are unable to achieve our second objective which was identifying them before the authentication with our clustering methods.
6 Conclusion The works presented in this paper shows that the keystroke dynamics can be used to perform authentication or identification in real case applications (with an EER around 5%). Adaptation of thresholds and parameters of the system according to user behaviour is a promising way for improving the performances of keystroke dynamics. In addition, the combination of classifiers by adding a fusion step in the system architecture also improves performances. Our experiment shows important improvements even with simple classifiers. Our works on parameters adaptation and classification of user show also interesting results and other improvements remain possible. The authentication of problematic users is still a problem. Therefore, the keystroke dynamics is beginning reaching maturity even if, in real applications, a series of problems can occur: For example, how the systems will react when the keyboard changed? This problem is also present in other biometric systems. It can probably explain why behavioural biometric remains rather marginal in commercial applications.
User Classification for Keystroke Dynamics Authentication
539
References 1. Biopassword: Biopassword, http://www.biopassword.com 2. Gaines, R.S., al.: Authentication by Keystroke Timing: Some Preliminary Results. Rand Corporation (1980) 3. Ilonen, J.: Keystroke dynamics. In: Advanced Topics in Information Processing (2003) 4. Peacock, A., Ke, X., Wilkerson, M.: Typing Patterns: A Key to User Identification. IEEE: Security & Privacy Magazine 02(5), 40–47 (2004) 5. Monrose, F., Rubin, A.D.: Keystroke dynamics as a biometric for authentication. Future Generation Computer Systems 16(4), 351–359 (2000) 6. Yu, E., Cho, S.: Keystroke dynamics identity verification–its problems and practical solutions. Computers and Security 23(5), 428–440 (2004) 7. Chen, W., Chang, W.: Applying Hidden Markov Models to Keystroke Pattern Analysis for Password Verification. In: IEEE International Conference on Information Reuse and Integration, pp. 467–474. IEEE Computer Society Press, Los Alamitos (2004) 8. Revett, K., al.: Authenticating computer access based on keystroke dynamics using a probabilistic neural network. DSI - Sistemas de Computação e Comunicações, 2006 (to appear) 9. Kittler, J., al.: On Combining Classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(3), 226–239 (1998) 10. Jain, A.K., Nandakumar, K., Ross, A.: Score Normalization in Multimodal Biometric Systems. Pattern Recognition 38(12), 2270–2285 (2004) 11. Hocquet, S., Ramel, J.-Y., Cardot, H.: Estimation of User Specific Parameters in One-class Problems. In: 18th International Conference on Pattern Recognition, Hong Kong, pp. 449– 452 (2006) 12. Jolliffe, I.T.: Principal Component Analysis. Springer, Heidelberg (1990)
Statistical Texture Analysis-Based Approach for Fake Iris Detection Using Support Vector Machines Xiaofu He, Shujuan An, and Pengfei Shi Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai, 200240, China {xfhe,sj_an,pfshi}@sjtu.edu.cn
Abstract. This paper presents a novel statistical texture analysis based method for detecting fake iris. Four distinctive features based on gray level co-occurrence matrices (GLCM) and properties of statistical intensity values of image pixels are used. A support vector machine (SVM) is selected to characterize the distribution boundary, for it has good classification performance in high dimensional space. The proposed approach is privacy friendly and does not require additional hardware. The experimental results indicate the new approach to be a very promising technique for making iris recognition systems more robust against fake-iris-based spoofing attempts.
1 Introduction Biometrics systems offer great benefits with respect to other authentication techniques, in particular, they are often more user friendly and can guarantee the physical presence of the user[1-2]. Iris recognition is one of the most reliable biometric technologies in terms of identification and verification performance. It mainly uses iris pattern to recognize and distinguish individuals since the pattern variability among different persons is enormous. In addition, as an internal (yet externally visible) organ of the eye, the iris is well protected from the environment and stable over time [3-9]. Whereas, it is important to understand that, as any other authentication technique, iris recognition is not totally spoof-proof. The main potential threats for iris-based systems are [10-12]: 1) Eye image:Screen image,Photograph,Paper print,Video signal; 2)Artificial eye:Glass/plastic etc; 3) Natural eye(user):Forced use; 4) Capture/replay attacks: Eye image, IrisCode template; 5) Natural eye(impostor): Eye removed from body, Printed contact lens. Recently, the feasibility of the last type of attack has been reported by some researchers [10-14]: they showed that it is actually possible to spoof some iris recognition systems with well-made iris color lens. Therefore, it is important to detect the fake iris as much as possible before subsequent iris recognition. In previous research, Daugman introduced the method of using FFT (Fast Fourier Transform) in order to check the printed iris pattern [10-12]. His method detects the high frequency spectral magnitude in the frequency domain, which can be shown distinctly and periodically from the print iris pattern because of the characteristics of the periodic dot printing. However, if the input counterfeit iris is defocused and blurred purposely, the counterfeit iris may be accepted as live one. Some iris camera S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 540–546, 2007. © Springer-Verlag Berlin Heidelberg 2007
Statistical Texture Analysis-Based Approach for Fake Iris Detection
541
manufacturer also proposed counterfeit iris detection method by using the method of turning on/off illuminator and checking the specular reflection on a cornea. Whereas, such method can be easily spoofed by using the printed iris image with cutting off the printed pupil region and seeing through by attacker’s eye, which can make corneal specular reflection [13]. Lee et al. [14] proposed a new method of detecting fake iris attack based on the Purkinje image by using collimated IR-LED (Infra-Red Light Emitting Diode). Especially, they calculated the theoretical positions and distances between the Purkinje images based on the human eye model. However, this method requires additional hardware and need the user’s full cooperation. To some extent, this interactive mode demands cooperation of the user who needs to be trained in advance and will eventually increase the time of iris recognition. In this paper, we propose a new statistical texture analysis based method for detecting fake iris. Four distinctive features based on co-occurrence matrix and properties of statistical intensity values of image pixels are used. A support vector machine (SVM) is used to characterize the distribution boundary, for it has good classification performance in high dimensional space and it is originally developed for two-class problems. The proposed approach is privacy friendly and does not require additional hardware. The remainder of this paper is organized as follows: the proposed method is described in section 2. Section 3 reports experiments and results. Section 4 concludes this paper.
2 Proposed Approach 2.1 Preprocessing In our experiments, we find that the outer portion of the color contact lens (corresponding to regions closer to outer circle) provides the most useful texture information for fake iris detection since this section of the fake iris is insensitive to the pupil dilation. In addition, the iris maybe corrupted by the occlusion of eyelashes and eyelids, it is necessary to exclude them as much as possible. Therefore, we extract features only in the lower half part of the iris pattern (called region of interest, ROI), i.e we minimize the influence of eyelashes and eyelids by abandoning up half part of the iris pattern. As mentioned above, useful iris information distributes in the outer portion of the color contact lens. We have found empirically that ROI is usually concentric with the outer circle of iris and the radius of the ROI areas is restricted in a certain range. We therefore detect outer circle to acquire the most discriminating iris features. Outer boundary is detected using Hough transform together with the improved Canny edge detector [15-17], shown in Fig. 1(a). After outer boundary detection, ROI is estimated according to the radius rROI empirically, as seen in Fig.1(b). In order to achieve invariance to translation and scale, the ROI is further normalized to a rectangular block of a fixed size W × H by anti-clockwise unwrapping the iris ring, as shown in Fig. 1(c).
542
X. He, S. An, and P. Shi
(a)
(b)
(c) Fig. 1. Preprocessing. (a) Source color contact lens image and outer boundary detection result. (b) The normalization region used for fake detection. (c) Normalized image.
2.2 Feature Vector The gray level co-occurrence matrices (GLCM) based analysis [18] is one of the most prominent approaches used to extract textural features. Each element P (i, j ) of the GLCM represents the relative frequency with which two neighboring pixels separated by a distance of certain columns and certain lines occur, one with gray tone i and the other with gray tone j . Such matrices of gray-level spatial-dependence frequencies are a function of the angular relationship between the neighboring resolution cells as well as function of the distance between them. It is common practice to utilize four well-known properties, i.e, contrast, correlation, angular second moment (is also known as energy) and homogeneity, of a GLCM as the co-occurrence matrix-based features. Although the approach was proposed more than 30 years ago, the properties still remain amongst the most popular and the most discriminative types of texture features [19-20]. In this study, two well-known properties of the GLCM, i.e contrast con and angular second moment asm are utilized for creating feature vector. In addition, the mean
m and standard deviation σ 1 value of the intensity values of image pixels are
also used for feature value. These four feature values are defined as following:
m=
1 W ×H
H
W
∑∑ I ( x, y) x =1 y =1
(1)
Statistical Texture Analysis-Based Approach for Fake Iris Detection
σ1 =
1 W ×H N
H
543
W
∑∑ ( I ( x, y) − m)
2
(2)
x =1 y =1 N
con = ∑∑ (i − j )2 P(i, j )
(3)
i =1 j =1 N
N
asm = ∑∑ P(i, j ) 2
(4)
i =1 j =1
Where I denotes the normalized iris image, W is the width of the normalized iris image, H is the height of the normalized iris image. P is the co-occurrence matrix, N denotes the dimension of the co-occurrence matrix. Therefore, these feature values are arranged to form a four dimensional feature vector.
V = [m, σ 1 , con, asm]T
(5)
2.3 Classification After feature extraction, ROI is represented as a feature vector of length four. The features extracted are used for classification by SVM [21] that appear to be a good candidate because of their ability to transform the learning task to the quadratic programming problem in high-dimensional spaces. In addition, SVM is originally developed for two-class problems. In this paper, radial basis functions (RBF) kernel function is used as,
K ( x, xi ) = exp{− where,
xi
comprises the input features, and
σ2
| x − xi |2
σ 22
}
(6)
is the standard deviation of the RBF
kernel, which is three in our experiments.
The input to the SVM texture classifier comes from a feature vector of length four. This reduces the size of the feature vector and results in an improved generalization performance and classification speed. The sign of the SVM output then represents the class of the iris. For training, +1 was assigned to the live iris class and -1 to the fake iris class. As such, if the SVM output for an input pattern is positive, it is classified as live iris.
3 Experimental Results In this work, experiment is performed in order to evaluate the performance of the proposed method, which is implemented using Matlab 7.0 on an Intel Pentium IV 3.0G processor PC with 512MB memory. We manually collect 2000 live iris images, 250
544
X. He, S. An, and P. Shi
contact color lens images. 1000 live iris images and 150 contact lens images are used for training and the rest for testing. The positive samples (the live iris images) come from the SJTU iris database version 3.0 (Iris database of Shanghai Jiao Tong University, version 3.0) which is created by using contactless iris capture device. The negative samples (the color contact lens image, bought at market) come from those images that captured at one session. The size of eye images is 640×480. The radius rROI of the ROI is 70. The size of each normalized iris image is 90×360. The iris feature vector consists of a feature vector of length four. Samples of the live and fake iris are shown in Fig. 2. The main drawback of the GLCM approach is the large computer resources required. For example, for an 8-bit image (256 gray level), the co-occurrence matrix has 65536 elements. The number of gray levels determines the size of the GLCM. To reduce this problem, the number of gray levels is set to 8 gray levels when scaling the grayscale values in image. The parameters of RBF kernel function are set: upper bound is 10, standard deviation is 3. The correct classification rate is 100%. The average execution time for feature extraction and classification (for testing) is 31.3ms and 14.7 ms respectively. The results indicate that the proposed scheme should be feasible in practical applications.
(a)
(b)
(c)
(d)
Fig.2. The test examples of live and fake iris. (a) Live eye. (b) Live eye with a color contact lens. (c) Normalized image of ROI of the live eye. (d) Normalized image of ROI of the live eye with a color contact lens.
4 Conclusion In this paper, we have presented an efficient fake iris detection method based on statistical texture analysis. Four feature values, i.e the mean and standard deviation value of the intensity values of image pixels, contrast and angular second moment of
Statistical Texture Analysis-Based Approach for Fake Iris Detection
545
the GLCM, are used for creating feature vector. We choose SVM to characterize the distribution boundary, for it has good classification performance in high dimensional space. Experimental results have illustrated the encouraging performance of the current method in accuracy and speed. The correct classification rate was 100%. The average execution time for feature extraction and classification is 31.3ms and 14.7 ms respectively using Matlab 7.0 on an Intel Pentium IV 3.0G processor PC with 512MB memory. In the future work, we will extend the fake iris database and conduct experiments on a large number of iris databases in various environments for the proposed method to be more stable and reliable.
Acknowledgements The authors would like to thank Mr. Eui Chul Lee (Dept. of Computer Science, Sangmyung University) for his helpful discussions. They also thank the anonymous referees for their constructive comments. This work is funded by the National Natural Science Foundation (No.60427002) and the National 863 Program of China (Grant No. 2006AA01Z119).
References 1. Jain, A.K., Bolle, R.M., Pankanti, S. (eds.): Biometrics: Personal Identification in Networked Society. Kluwer, Norwell, MA (1999) 2. Zhang, D.: AutomatedBiometrics: Technologies andSystems. Kluwer, Norwell, MA (2000) 3. Daugman, J.: High confidence visual recognition of persons by a test of statistical independence. IEEE Trans. Pattern Anal. Mach. Intell. 15(11), 1148–1161 (1993) 4. Daugman, J.: The importance of being random: Statistical principles of iris recognition. Pattern Recognition 36(2), 279–291 (2003) 5. Daugman, J.: How iris recognition works. IEEE Trans. on Circuits and Systems for Video Technology 14(1), 21–30 (2004) 6. Wildes, R.P.: Iris recognition: An emerging biometric technology. Proc. IEEE 85(9), 1348–1363 (1997) 7. Ma, L., Tan, T., Wang, Y., Zhang, D.: Personal identification based on iris texture analysis. IEEE Trans. Pattern Anal. Mach. Intell. 25(12), 1519–1533 (2003) 8. Sun, Z., Wang, Y., Tan, T., Cui, J.: Improving iris recognition accuracy via cascaded classifiers. IEEE Trans. on Systems, Man and Cybernetics, Part C 35(3), 435–441 (2005) 9. Park, K.R., Kim, J.: A real-time focusing algorithm for iris recognition camera. IEEE Trans. on Systems, Man and Cybernetics, Part C 35(3), 441–444 (2005) 10. Daugman, J.: Recognizing Persons by their Iris Patterns: Countermeasures against Subterfuge. In: Jain, et al. (eds.) Biometrics. Personal Identification in a Networked Society, pp. 103–121 (1999) 11. Daugman, J.: Demodulation by complex-valued wavelets for stochastic pattern recognition. International Journal of Wavelets, Multiresolution,and Information Processing 1(1), 1–17 (2003) 12. Daugman, J.: Iris Recognition and Anti-Spoofing Countermeasures. In: 7th International Biometrics Conference, London (2004) 13. http://www.heise.de/ct/english/02/11/114/
546
X. He, S. An, and P. Shi
14. Lee, E.C., Park, K.R., Kim, J.: Fake iris detection by using purkinje image. In: Zhang, D., Jain, A.K. (eds.) Advances in Biometrics. LNCS, vol. 3832, pp. 397–403. Springer, Heidelberg (2005) 15. He, X., Shi, P.: A Novel Iris Segmentation method for Hand-held Capture Device. In: Singh, S., Singh, M., Apte, C., Perner, P. (eds.) ICAPR 2005. LNCS, vol. 3687, pp. 479–485. Springer, Heidelberg (2005) 16. Fleck, M.M.: Some defects in finite-difference edge finders. IEEE Transactions on Pattern Analysis and Machine Intelligence 14(3), 337–345 (1992) 17. Canny, J.: A Computational Approach to Edge Detection. IEEE Transaction on Pattern Analysis and Machine Intelligence 8, 679–714 (1986) 18. Haralick, R.M., Shanmugam, K., Dinstein, I.: Textural features for image classification. IEEE Trans. Syst. Man Cybern. 3(6), 610–621 (1973) 19. Soares, J.V., Renno, C.D., Formaggio, A.R., etc.: An investigation of the selection of texture features for crop discrimination using SAR imagery. Remote Sensing of Environment 59(2), 234–247 (1997) 20. Walker, R.F., Jackway, P.T., Longstaff, D.: Genetic algorithm optimization of adaptive multi-scale GLCM features. Int. J. Pattern Recognition Artif. Intell. 17(1), 17–39 (2003) 21. Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Mining Knowledge Discovery 2, 955–974 (1998)
A Novel Null Space-Based Kernel Discriminant Analysis for Face Recognition Tuo Zhao1, Zhizheng Liang1, David Zhang2, and Yahui Liu1 1
Harbin Institute of Technology
[email protected] 2 Hongkong Polytechnic University
Abstract. The symmetrical decomposition is a powerful method to extract features for image recognition. It reveals the significant discriminative information from the mirror image of symmetrical objects. In this paper, a novel null space kernel discriminant method based on the symmetrical method with a weighted fusion strategy is proposed for face recognition. It can effectively enhance the recognition performance and shares the advantages of Null-space, kernel and symmetrical methods. The experiment results on ORL database and FERET database demonstrate that the proposed method is effective and outperforms some existing subspace methods. Keywords: symmetrical decomposition, symmetrical null-space based kernel LDA, weighted fusion strategy, face recognition.
1 Introduction Linear Discriminant Analysis (LDA) is a popular method in extracting features, and it has been successfully applied to many fields. In general, the objective of LDA is to seek a linear projection from the image space to a low dimensional space by maximizing the between-class scatter and minimizing the within-class scatter simultaneously. It is shown that LDA is better than PCA especially under illumination variation [1]. In [2], Zhao also favorably supported this point under the FERET testing framework. However, in many practical applications of LDA, when the number of the samples is much smaller than the dimension of the sample space, it results in the small size sample (SSS) problem [3]. During the past several decades, many methods have been proposed to deal with this problem. Belhumeur [1] used PCA to reduce the dimension of the samples to an intermediate dimension and then Fisher LDA is used to extract the features. However, the Fisher LDA only uses the regularized space of the within-class scatter matrix. Considering this point, Yu [4] proposed Direct LDA. In their method, they projected all the samples directly by removing the null space of between-class scatter matrix. Chen [5] made a more crucial modification. That is, they firstly projected all the samples onto the null space of within-class scatter matrix before doing DLDA and proposed a more powerful Null-space Based LDA (NLDA). Although the methods mentioned above have proved to be efficient for face recognition, they are still linear techniques in nature. Hence they are inadequate to S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 547–556, 2007. © Springer-Verlag Berlin Heidelberg 2007
548
T. Zhao et al.
describe the complexity of real face images because of illumination, facial expression and pose variations. To deal with this problem, kernel methods including Kernel PCA (KPCA) and Kernel LDA [6, 7] are proposed. The idea of kernel methods is to map all the data by a non-linear function into the feature space and then to perform some operations in this space. In recent years, some researchers introduced the symmetric idea in face recognition. Yang [8] presented a symmetrical principal component analysis (SPCA) algorithm according to the symmetry of human faces. Their method can reduce the sensitivities to outliers. In this paper, we propose a symmetrical null space-based kernel LDA method, which can simultaneously provide the advantages of kernel, symmetrical and Null-space methods. The rest of this paper is organized as follows. Section 2 reviews the related work on Kernel Fisher and its variants. Section 3 proposes our method. The experimental results are shown in section 4. Section 5 gives discussion and conclusions.
2 Related Work 2.1 Fundamentals In Kernel methods, the input space R n firstly is mapped into the feature space F by a non-linear mapping φ , denoted as:
φ : R n → F , x a φ ( x) .
(1)
Kernel Fisher Discriminant is to solve the problem of LDA in the feature space F by a set of nonlinear discriminant vectors in input space. By maximizing the following Fisher criterion:
J φ (ϕ ) =
| ϕ T Sbφϕ | , | ϕ T Stφϕ |
(2)
where Sbφ and Stφ are defined as below:
Sbφ =
Stφ =
1 M
1 M
(3)
c
∑ l (mφ − mφ )(mφ − mφ )
,
T
i =1
i
i
i
M
∑ (φ ( x ) − mφ )(φ ( x ) − mφ )
T
i =1
i
i
,
(4)
where x1 , x2 ,..., xM is a set of M training samples in input space, li is the number of c
training samples of class
i and satisfies ∑ li = M , miφ is the mean vector of the i =1
mapped training samples of class i , and training samples.
mφ is the mean vector across all mapped
A Novel Null Space-Based Kernel Discriminant Analysis for Face Recognition
549
2.2 Kernel Fisher and Its Variant (KFD and NKLDA) Actually, we can obtain the discriminant vectors with the respect to the Fisher criterion by computing the Eigenvalue problem as Sbφϕ = λ Stφϕ . In theory, any
solution ϕ ∈ F must lie in the span of all samples in F , so we can view ϕ as a linear combination of φ ( xi ) : M
ϕ = ∑ a jφ ( x j ) = Qα ,
(5)
j =1
where Q = [φ ( x1 ), φ ( x2 ),..., φ ( xM )] and α = (a1 , a2 ,..., aM )T . The matrix K is defined as
K = K% − 1M K% − K% 1M + 1M K% 1M .
(6)
Here 1M = (1/ M ) M ×M , K% = QT Q is an M × M matrix and its elements are
K% ij = φ ( xi )T φ ( x j ) = (φ ( xi ) ⋅ φ ( x j )) = k ( xi , y j )
(7)
corresponding to a given nonlinear mapping φ and, the new sample vector by kernel tricks, K ( xi ) , is defined as a column of K :
K ( xi ) = [k ( x1 , xi ), k ( x2 , xi ),..., k ( xM , xi )]T .
(8)
Substituting Eq.(5) into Eq.(2) gives
J K (α ) =
| α T K bα | , | α T K tα |
(9)
where the new between-class scatter matrix K b and total scatter matrix Kt are
Kb = Kt =
1 M
1 M
c
∑ l (m − m)(m − m)
T
i =1
i
i
i
,
(10)
M
∑ ( K ( x ) − m)( K ( x ) − m)
T
i =1
i
i
,
(11)
where mi is the mean vector of all K ( x j ) belonging to class i , and m is the mean vector of all K ( x j ) . Eq.(9) is a standard Fisher equation and we obtain the final discriminant vectors by:
X i = α ⋅ K ( xi ) .
(12)
Liu [8] developed kernel null space method. The between class scatter matrix K b is projected onto the null space of new within-class scatter matrix K w . That is, the null space of K w in the mapping space is firstly calculated as
550
T. Zhao et al.
Y T K wY = 0 ,
(13)
where Y consists of the eigenvectors with zero eigenvalues and Y T Y = I . Then we obtain
K% b = Y T KbY .
(14)
Furthermore, the eigenvectors U of K% b with the first several largest eigenvalues are selected to form the transformation matrix
X i = U T Y T K ( xi ) .
(15)
3 Symmetrical Null-Space Based Kernel LDA (SNKLDA) 3.1 Symmetrical Ideas
f = f e + f o , in which f e = ( f + f m ) / 2 and f o = ( f − f m ) / 2 . Here f m is the symmetrical counterpart of f . f e and f o are, respectively, described by a linear combination of a set of even or odd symmetrical basis functions. Thus, any function can be linearly reconstructed by the two basis functions, one even symmetrical and another odd symmetrical. This is the even-odd decomposition principle. We can apply this principle to face images and define symmetry to be the horizontal mirror symmetry with the vertical midline of the image as its axis. According to the odd-even decomposition theory, xi can be decomposed as xi = xei + xoi with xei = ( xi + xmi ) / 2 denoting the odd symmetrical For
any
function
,
f
it
can
be
written
as
image, and xei = ( xi − xmi ) / 2 denoting the odd symmetrical image, where xmi is the mirror image of xi . Based on this, symmetrical methods perform classification on two new symmetrical image sets, and then combine the features of odd and even images as the final features. 3.2 Our Proposed Method Theorem: For any polynomial kernel function φ and any two vectors xa and xb , if < xa , xb >= 0 , there exits a relationship: φ ( xa + xb ) = φ ( xa ) + φ ( xb ) N
Proof: Because φ ( x) is a polynomial function, φ ( x) can be written as φ ( x) = ∑ ai xi , i =0
N
and similarly φ ( xa + xb ) = ∑ ai ( xa + xb ) . Since < xa , xb >= 0 , i
i =0
N
φ ( xa + xb ) = ∑ ai ( xa + xb )i i =0
N
N
N
i=0
i =0
i =0
= ∑ βi0 ai xai + ... + ∑ β i j ai xaj xbi − j + ... + ∑ βii ai xbi
A Novel Null Space-Based Kernel Discriminant Analysis for Face Recognition N
N
i =0
i =0
551
= ∑ ai xai + ∑ ai xbi = φ ( xa ) + φ ( xb ) where β is the coefficients of a linear combination. So for new training samples in the feature space, the following equation holds Kij =< φ ( xi ), φ ( x j ) >
=< φ ( xoi + xei ), φ ( xoj + xej ) > =< φ ( xoi ) + φ ( xei ), φ ( xoi ) + φ ( xej ) >
(16)
=< φ ( xoi ), φ ( xoj ) > + < φ ( xoi ), φ ( xej ) > + < φ ( xei ), φ ( xoj ) > + < φ ( xei ), φ ( xej ) > . Since the polynomial kernel Kij =< φ ( xi ), φ ( x j ) >= (< xi , x j > +1) d is adopted for the symmetrical decomposition. Then we obtain:
< φ ( xoi ), φ ( xej ) >= (< xoi , xej > +1) d = (0 + 1) d = 1 , < φ ( xei ), φ ( xoj ) >= (< xei , xoj > +1) d = (0 + 1) d = 1 .
(17)
Thus, it means that k ( xoi , xej ) and k ( xei , xoj ) do not contain the valid discriminant information. In [10], Lu et al. discussed the orthogonally of Kernel Symmetrical PCA in the original space and used the polynomial kernel in their experiments. But we find that in the feature space the inner product of φ ( xo ) and φ ( xe ) are constant when the polynomial kernel is applied. Therefore, when a symmetrical method adopts the polynomial kernel, it still only considers the principal components of even and odd symmetrical images, not considering the correlative components between even and odd symmetrical images. Therefore, when the Kernel Symmetrical PCA method adopts the polynomial kernel, we can decompose the sample set into two new sets. Then we can perform classification on even and odd images by combing their features. Since NKLDA is more effective than KPCA in solving small size sample problem [6]. In the following, we combine it with symmetrical ideas in discriminant analysis to improve the classification performance. To this end, we propose SNKLDA. The proposed method utilizes the symmetrical properties to do NKLDA in both the even and odd space and it shares the advantages of two methods in some sense. In the following, we describe the proposed method. We firstly convert the training samples into two training sets {xei }1≤i ≤ M and {xoi }1≤i≤ M . Then we project the new sample set in the feature space by the kernel trick. That is, we can obtain the following new kernel training sets:
K e ( x ei ) = ( k ( x e 1 , x ei ), k ( x e 2 , x ei ), ..., k ( x eM , x ei )) T ,1 ≤ i ≤ M , K o ( x oi ) = ( k ( x o1 , x oi ), k ( x o 2 , x oi ), ..., k ( x oM , x oi )) T ,1 ≤ i ≤ M .
(18)
Then we calculate class mean, total mean, the between-class scatter matrix and the within-class scatter matrix of each new kernel training set as follows:
552
T. Zhao et al.
mej = ∑ K ( xej ) / M , me = ∑ mej / c , moj = ∑ K ( xoj ) / M , mo = ∑ mo j / c , j∈c
j∈c
j∈c
j∈c
(19)
c
K we = ∑ ∑ ( K ( xei ) − mej )( K ( xei ) − mej )T , j =1 i∈C j c
K wo
= ∑ ∑ ( K ( xoi ) − moj )( K ( xoi ) − moj ) ,
(20)
T
j =1 i∈C j
K be = ∑ (mej − me )(mej − me )T , K bo = ∑ (moj − mo )(moj − mo )T . i∈C j
i∈C j
(21)
Based on the above steps, we perform feature extraction in both the odd and even spaces. That is, we extract the null space Ye of K we and Yo of K wo such that
YeT K weYe = 0 , YoT K woYo = 0 .
(22)
Subsequently, the between class scatter matrix K be and Kbo are projected onto the null space of K we and K wo
K% be = YeT KbeYe , K% bo = YoT KboYo .
(23)
Furthermore, the eigenvectors U e of K% be and U o of K% bo with the first several largest eigenvalues are selected to form the transformation matrix
X e = U eT YeT Ke ( xe ) , X o = U oT YoT K o ( xo ) .
(24)
It is obvious that we get two sets of features from Eq. (24). It is necessary to use these features for classification. Note that in the previous symmetrical methods, the features are usually directly combined as the final features [9, 10]. Different from the previous methods, we assign different weights to fuse the features in the odd and even spaces to improve the classification performance. The final features are defined as
X fuse ( x) = [ we X e ( x), wo X o ( x)], t
⎛ Ro ⎞ ⎛ Re ⎞ wo = ⎜ ⎟ , we = ⎜ ⎟ R + R ⎝ e o ⎠ ⎝ Re + Ro ⎠
t
(25)
where wo and we are fused weights, Re and Ro are the correct recognition rates with even features and odd respectively, and t = 0,1, 2 in our experiment. To reduce the computational complexity of weight training, we compute the weights on a small set of samples.
4 Experimental Results In order to compare the performance of SKNLDA with other methods, the experiments are performed on two popular face databases: ORL database and FERET database. All the following experiments share the same steps. The image data are
A Novel Null Space-Based Kernel Discriminant Analysis for Face Recognition
553
preprocessed including histogram equalization, normalization. For the sake of simplicity, we choose the Nearest Neighbor as the classifier and the polynomial kernel function is selected as k ( x, y ) = (1+ < x, y >)d with degree d = 2 . In addition, we select c − 1 and N − 1 as the finial dimensions in Fisher LDA and KSPCA. 4.1 Experiments on ORL Database
There are 10 different images for each subject in ORL face database composed of 40 distinct subjects. All the subjects are in up-right, frontal position. The size of each face image is downsampled to 46 × 56 . Fig.1 shows all 10 images of the first person after histogram equalization.
Fig. 1. Samples from the first person in ORL Database
In this set of experiments, the first 10 subjects in the database are selected to determine the weights: In order to reduce variation, we random select 2 images from each subject as training samples and the others are testing samples. The results are the average of 20 runs. We obtain even CRR 96.44% and odd CRR: 59.00%. Table 1 shows the performance of SNKLDA under different t . Table 1. The performance of SNKLDA under different t on ORL database K 2
3 4
t =1 86.44 92.03 94.82
t=0 85.39 91.35 94.52
t=2 86.33 91.88 94.75
Table 2. The experiment results on ORL database
K 2 3 4
Fisher 75.72 86.31 91.40
NLDA 85.26 91.06 94.18
NKLDA 84.13 91.02 94.42
KSPCA 84.22 89.78 92.99
SNKLDA 86.44 92.03 94.82
Then we compare SNKLDA with some methods in the case of different numbers of training samples. The number of training samples per subject, k, varies from 2 to 4. In each round, k images are randomly selected from the database for training and the remaining images of the same subject are used for testing. For each k, 50 tests are performed and the final results are averaged over 50 runs. Table 2 shows the CRR (%) and we can see that the performance of SNKLDA is better than other methods.
554
T. Zhao et al.
4.2 Experiments on FERET Database
In order to further test the capability of the proposed method, experiments are made on the dataset with more subjects such as the FERET database. In the experiments, we select a subset including 200 subjects respectively from FERET database with 11 upright, frontal-view images of each subject. The face images on this database involve much more variations in lighting, view and expressions. The size of face images is downsampled to 50 × 60 . Fig.2 shows all 11 images of the first person after histogram equalization.
Fig. 2. Samples from the first person in FERET Database after histogram equalization
In this set of experiments, the first 40 subjects in the database are selected to determine the weights. In order to reduce variation, we random select 2 images from each subject as training samples and the others are testing samples. The results are the average of 20 runs. We obtain even CRR 80.81% and odd CRR: 20.56%. Table 3 shows the performance of SNKLDA under different t . Table 3. The performance of SNKLDA under different
K 2 3 4
t =1 68.07 74.67 78.34
t=0 66.70 73.21 77.42
t on ORL database t=2 67.99 74.64 78.27
Table 4. The experiment results on ORL database
K 2 3 4
Fisher 31.20 55.64 68.16
NLDA 60.00 64.60 68.81
NKLDA 62.25 68.47 74.00
KSPCA 56.94 63.88 67.93
SNKLDA
68.07 74.67 78.34
The number of samples in each class, k, varies from 2 to 4. For each k, 100 tests are performed and the final results are averaged over 100 runs. Table 4 shows the experimental results and we can see SNKLDA outperforms all the other subspace methods in all the cases.
5 Discussions and Conclusions From the above two experiments we can find that SNKLDA is better than all the other methods. It may come from the fact that the SNKLDA method has the advantages of
A Novel Null Space-Based Kernel Discriminant Analysis for Face Recognition
555
Fig. 3. Misclassification in even (L), odd (M) and full space with weighted strategy(R) on ORL
the NKLDA method in the small sample size problem. Although SNKLDA has twice computational complexity of NKLDA due to the symmetrical decomposition, the symmetry decomposition contributes to about 5% increases on FERET. In the above experiments the number of the subspace dimension is ( c − 1) × 2 which is much lower than PCA method in the symmetrical methods. It should be noted that extracting complementary discriminative information and the fusion strategy are also important for symmetrical methods. From the experiment we can see our weighted strategy is better than directly combination. In addition, from Fig. 3, we note that the two spaces are not complementary enough to increase the recognition rate greatly by the basic fusion strategy. The misclassified samples in the even space are usually also misclassified in the odd space. Hence how to efficiently use the complementary information in two spaces deserves the further research. In all, the main contributions of this paper are briefly summarized as follows: (a) Proposed SNKLDA method that can provide better performance than other methods; (b) Proposed weighted strategy for even and odd features and revealed the bottleneck of the symmetrical method ;(d) pointed out that the SKPCA method in the polynomial kernel still only considers the principal components of even and odd symmetrical images, not considering the correlative components between even and odd symmetrical images. However, considering the problems in the above discussion, there still has some wok to do in the near future. That is, we will pay much more attention to the following two aspects: (1) Fusion strategy: a more powerful fusion strategy should be used to maximize the utilization of the complementary information; (2) Kernel function: a novel kernel function should be used to avoid the orthogonality of even and odd features to generate a mixed space for classification.
References 1. Belhumeur, P.N., Hespanha, J.P., Kiregman, D.J.: Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection. IEEE Trans. on PAMI. 19(7), 711–720 (1997) 2. Zhao, W., Chellappa, R., Philips, P.J.: Subspace Linear Discriminant Analysis for Face Recognition. Tech Report CAR-TR-914 Center for Automation Research, University of Maryland (1999) 3. Fukunaga, K.: Introduction to Statistical Pattern Recognition. Academic Press, London (1990) 4. Yu, H., Yang, J.: A Direct LDA algorithm for High-dimensional Data with Application to Face Recognition. Pattern Recognition 34(10), 2067–2070 (2001)
556
T. Zhao et al.
5. Chen, L.F., Liao, H.Y.M., Lin, J.C., Ko, M.T., Yu, G.J.: A new LDA-based Face Recognition System Which Can Solve the Small Sample Size Problem. Pattern Recognition 33(10), 1713–1726 (2000) 6. Yang, M.H.: Kernel Eigenfaces vs. Kernel Fisherfaces: Face Recognition Using Kernel Methods. In: Int. Conf. on Automatic Face and Gesture Recognition, pp. 215–220 (2002) 7. Liu, W., Wang, Y.H., Li, S.Z., Tan, T.N.: Null Space-based Kernel Fisher Discriminant Analysis for Face Recognition. In: Int. Conf. on Automatic Face and Gesture Recognition, pp. 369–374 (2004) 8. Yang, Q., Ding, X.: Symmetrical PCA in Face Recognition. In: IEEE Int. Conf. on Image Processing, pp. 97–100 (2002) 9. Lu, C., Zhang, C., Zhang, T., Zhang, W.: Kernel Based Symmetrical Principal Component Analysis for Face Classification. Nerucomputing 70(4-6), 904–911 (2007)
Appendix be x = {x1 , x2 ,..., xn −1 , xn } and its mirror vector be , and another sample vector be y = { y1 , y2 ,..., yn −1 , yn } and its xm = {xn , xn −1 ,...,, x2 , x1} mirror vector be ym = { yn , yn −1 ,...,, y2 , y1} . Then by symmetrical decomposition, we Let
a
sample
vector
obtain xe =
x + xm x +x x +x x +x x +x = { 1 n , 2 n −1 ,..., n −1 2 , n 1 } , 2 2 2 2 2
(A.1)
xo =
x − xm x −x x −x x −x x −x = { 1 n , 2 n −1 ,..., n −1 2 , n 1 } , 2 2 2 2 2
(A.2)
ye =
y + ym y + y y + yn −1 y +y y +y ={ 1 n , 2 ,..., n −1 2 , n 1 } , 2 2 2 2 2
(A.3)
yo =
y − ym y − y y − yn −1 y −y y −y ={ 1 n , 2 ,..., n −1 2 , n 1 } . 2 2 2 2 2
(A.4)
The inner product is x1 + yn x1 − yn x +y x −y (A.5) )( ) + ... + ( n 1 )( n 1 )) . 2 2 2 2 It is obvious that ( xi + yn −i )( xi − yn −i ) + ( xn −i + yi )( xn −i − yi ) = ( xi + yn −i )( xi − yn −i ) + 2 2 2 2 2 2 < xe , yo >= ((
( −(
xn−i + yi xi − yn−i )( )) = 0 , so 2 2
< xe , yo >= 0 .
(A.6)
< xo , ye >= 0 .
(A.7)
In the similar way, we prove
Changeable Face Representations Suitable for Human Recognition Hyunggu Lee, Chulhan Lee, Jeung-Yoon Choi, Jongsun Kim, and Jaihie Kim School of Electrical and Electronic Engineering, Yonsei University, Biometrics Engineering Research Center (BERC), Republic of Korea {lindakim,devices,jychoi,kjongss,jhkim}@yonsei.ac.kr
Abstract. In order to resolve non-revocability of biometrics, changeable biometrics has been recently introduced. Changeable biometrics transforms an original biometric template into a changeable template in a non-invertible manner. The transformed changeable template does not reveal the original biometric template, so that secure concealment of biometric data is possible using a changeable transformation. Changeable biometrics has been applied to face recognition, and there are several changeable face recognition methods. However, previous changeable face transformations cannot provide face images that can be recognized by humans. Hence, ‘human inspection’, a particular property of face biometrics, cannot be provided by previous changeable face biometric methods. In this paper, we propose a face image synthesis method which allows human inspection of changeable face templates. The proposed face synthesis method is based on subspace modeling of face space, and the face space is obtained by projection method such as principle component analysis (PCA) or independent component analysis (ICA). The face space is modeled by fuzzy C-means clustering and partition-based PCA. Using the proposed method, human-recognizable faces can be synthesized while providing inter-class discrimination and intra-class similarity. Keywords: changeable face biometrics, non-invertibility, human inspection, face image synthesis, partition-based PCA.
1
Introduction
The human face is a key biometric characteristic for human recognition of personal identity. Current face recognition algorithms can handle large amounts of face biometric data very quickly. However, face recognition algorithms are not perfect. In many cases, face recognition algorithms are not as good as human inspection. Hence, the human ability of face recognition is still an important factor in recognition of a human face. Additionally, there are some examples where the human ability to recognize faces is an important issue, such as human inspection of the pictures of faces in a driver’s license or a passport. In an ideal face recognition system, images of users are stored in a trusted database, and in that case, human inspection is possible using the face images of the database. S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 557–565, 2007. c Springer-Verlag Berlin Heidelberg 2007
558
H. Lee et al.
However, storing user’s face images can cause privacy violations. In order to resolve this problem, changeable biometrics has been introduced [1]. Ratha et al. [1] proposed a changeable face synthesis method based on morphing. Synthesizing face images by morphing allows human inspection. However, the original face image can be reconstructed when an attacker learns the morphing function. Besides the method of Ratha et al., there are several methods for generating changeable face biometrics [2][3][4]. A common point of these methods is that they cannot provide for human inspection. Boult [2] proposed a transformation method for face recognition, in which a face feature is transformed via scaling and translation. Using robust distance measure, the transformed feature can provide improved performance, and the original biometric data cannot be inverted from their encrypted data, which is cryptographically secure. However, as mentioned, this transform does not allow human inspection. Savvides et al. [3] proposed changeable biometrics for face recognition that uses minimum average correlation energy (MACE) filters and random kernels. However, the original face image can be recovered via deconvolution if the random kernel is known. Moreover, because the face is convolved with a random kernel, their method cannot provide for human inspection. Teoh et al. [4] proposed a new authentication approach called BioHashing which combines user-specific tokenized random vectors with biometric feature vectors to generate a biometric code. The inner product between biometric feature vectors and a set of orthonormal random vectors are calculated and a biometric code is generated by using a predefined threshold. By thresholding, the original biometric can not be recovered from the biometric code. Again, however, their method does not provide for human inspection. In this paper, we propose a subspace based changeable face synthesis method which allows human recognition of the generated faces. The organization of the paper is as follows. Section 2 briefly describes the method for generating changeable coefficient vectors from face images and our motivation for recognizable face synthesis. In Section 3, the subspace modeling and range fitting methods for face synthesis follow. Finally, results of our face synthesis and future work are discussed in Section 4 and Section 5, respectively.
2
Changeable Coefficient Vector Generation
A changeable face template can be generated by combining PCA and ICA coefficient vectors [5]. Two different face feature coefficient vectors, P and I are extracted from an input face image using PCA and ICA, and the two coefficient vectors are normalized as follows. p = P/ |P| = [p1 , p2 , p3 , ..., pN ], i = I/ |I| = [i1 , i2 , i3 , ..., iN ].
(1)
Then, the two normalized feature coefficient vectors, p and i are scrambled using two permutation matrices.
Changeable Face Representations Suitable for Human Recognition CA ps = SP ID (p),
is = SICA ID (i).
559
(2)
CA The permutation matrices, SP and SICA ID ID are generated by a random number generator whose seed is the user’s personal ID. Finally, a changeable face coefficient vector c is generated by addition of the scrambled coefficient vectors,
c = ps + is .
(3)
Even if an attacker knows the changeable face coefficient vector c and the scrambling rule, the original PCA or ICA coefficient vector can not be recovered because the changeable coefficient vector is generated by addition. Moreover, due to scrambling and addition, the changeable coefficient vector is not the same as either the PCA or the ICA coefficient vector, so that the changeable coefficient vector cannot represent a human-like image when the changeable coefficient vector is reprojected onto either the PCA or ICA basis. Fig. 1 shows some reprojected images using the PCA and ICA coefficient vectors and using the changeable coefficient vector. Fig. 1 (b) and (c) show reconstruction images using the original subspace coefficient vectors and corresponding subspace basis. Reconstructed images are rather degraded because we do not use the full number of basis vectors. The changeable coefficient vector can be directly reprojected using the PCA or ICA basis. Since the changeable coefficient vector differs from either the PCA or ICA coefficient vector, the reprojected images in Fig. 1 (d) and (e) cannot be recognized as human. Hence humans cannot inspect the resulting changeable face templates as to whether it is similar or not to a corresponding template in the database.
(a)
(b)
(c)
(d)
(e)
Fig. 1. Face generation using changeable coefficient vector (a): input face image, (b): reconstruction using the PCA coefficient vector, (c): reconstruction using the ICA coefficient vector, (d): reprojected image using changeable coefficient vector with the PCA basis, (e): reprojected image using changeable coefficient vector with the ICA basis
From the above discussion, it is clear that the changeable face template cannot yield a human face image. In fact, because the original face image cannot be stored in changeable biometrics, there is no specific face image that is associated with the changeable template. However, for human inspection, a new face image ought to be synthesized, and that synthesized face image should be generated from a changeable face template. In this paper, we propose a face synthesis method which can generate a new face image using a changeable coefficient vector from a modeled face subspace. In the following sections, the proposed face synthesis method is described more specifically.
560
3
H. Lee et al.
Subspace Based Face Synthesis
The proposed face image synthesis method is founded on face space modeling on a subspace such as the PCA or ICA coefficient vector space. As a subspace distribution modeling method, partition-based PCA follows fuzzy C-means clustering [6]. As shown in Fig. 2 (a) and (b), grouping is carried out by the clustering method, then PCA is applied to each cluster. Hereafter, this type of approach will be called ‘partition-based PCA’. However, due to differences of the modeled
Fig. 2. Conceptual picture of face subspace modeling and range fitting for changeable coefficient vector (a): subspace is partitioned using clustering, (b): PCA is applied to each cluster, (c): range of changeable coefficient vector is fitted into range of coefficient vector of partition-based PCA
distributions between the face coefficient vector and the changeable coefficient vector, the changeable coefficient vector should be fitted within the modeled distribution of the face coefficient vector when synthesizing a face image. Before range fitting, the changeable coefficient vector is located outside of the modeled distribution of the face coefficient vector, e.g., the triangle points in Fig. 2 (c). After the range fitting process, the range fitted changeable coefficient vectors are located inside of the modeled distribution of the face coefficient vector, e.g., the square points in Fig. 2 (c). Because the face subspace coefficient vector, which exists inside of the modeled distribution of face coefficient vector, does represent a real face image, the range fitted changeable coefficient vector can also represent a face image that allows human inspection. Detail of the distribution modeling of face coefficient vector and the range fitting of the changeable coefficient vector into the distribution of the face coefficient vector will be covered in below. 3.1
Face Distribution Modeling on the Subspace
We use fuzzy C-means (FCM) method for subspace modeling, which is a soft clustering method. In order to update cluster prototypes, the overall information of data points are utilized with a weight which specifies the responsibility to cluster prototypes [6]. After partitions are obtained from clustering, PCA is applied to each cluster. Local PCA [7] is a method similar to partition-based
Changeable Face Representations Suitable for Human Recognition
561
PCA. which iteratively updates partitions and PCA basis of the partitions using the reconstruction error. However, in our experiments, in order to utilize the local structure of initial clusters (FCM result), we use partition-based PCA as a non iterative version of local PCA. 3.2
Local Surface Determination and Face Synthesis
After FCM clustering, PCA is applied to each cluster in the subspace. Because PCA captures maximum variation of the distribution, the local surface of each subspace can be determined from the partition-based PCA components. For the k th cluster of the subspace, picluster k is the ith element of partition-based PCA coefficient vector pcluster k . For face synthesis, picluster k should be restricted with an upper and a lower limit, which can be selected using the mean micluster k i i and standard deviation σcluster k of pcluster k . i miniCluster maxiCluster k , k i≤ pCluster k ≤ i i minCluster k = mCluster k − ασCluster k where i i i maxCluster k = mCluster k + ασCluster k.
(4)
As a result, the local surface of the kth cluster can be determined by minicluster k and maxicluster k of picluster k with scale factor α which controls the width of the modeled local surface. For face synthesis, partition-based PCA coefficient vector pcluster k should be selected within the local surface of the kth cluster. The selected pcluster k is associated with one synthesized face in the corresponding subspace, and face images can be generated as follows. The synthesized face coefficient vector λsyn is the result of reprojecting pcluster k into the corresponding partitionbased PCA basis Φcluster k . λsyn = Φcluster k pcluster k .
(5)
After the synthesized face coefficient vector λsyn is obtained, the synthesized face image xsyn can be generated by reprojection with the corresponding subspace basis Ψ. xsyn = Ψλsyn .
(6)
Hence, a face image can be synthesized using pcluster k . However, a changeable face image is synthesized from a changeable coefficient vector c. Therefore, the changeable coefficient vector c should be converted accordingly. The conversion of c is done by range fitting; the range fitted changeable coefficient vector is ycluster k . To do this, the range of the changeable coefficient vector should be determined. After range determination, the changeable coefficient vector is fitted within the distribution of partition-based PCA coefficient vector pcluster k . Then ycluster k is able to represent a face image that is recognizable by a human using reprojection as follows: xsyn = ΨΦcluster k ycluster k .
(7)
562
3.3
H. Lee et al.
Determining the Range of the Changeable Coefficient Vector
Because a changeable coefficient vector is a mixed feature of two different subspace coefficient vectors (PCA and ICA), the distribution of the changeable coefficient vector is unrelated to the original two subspace coefficient vectors. However, if we assume that the original two coefficient vectors are statistically independent, then because the changeable coefficient vector c is no more than the addition of PCA and ICA coefficient vectors, the range of c can be determined from the ranges of the PCA and ICA coefficient vectors. The range of pi , the ith element of PCA coefficient vector p, can be obtained from the training data, which exists between a lower limit minipca and upper limit maxipca . In a similar way, the range of ii , the ith element of ICA coefficient vector i, can CA be obtained. Because the scrambling rule, i.e., SP and SICA ID ID , are stored for changeable coefficient vector generation, they can also be used for range determination of the changeable coefficient vector. Hence, scrambling can be ignored for notational simplification. Then, ci becomes an addition of pi and ii . Using the independence assumption between pi and ii , the range of ci , ith element of changeable coefficient vector c, can be inferred by examining the range of pi and ii as follows: minic ≤ci ≤ maxic , minic = minipca + miniica where maxic = maxipca + maxiica . 3.4
(8)
Range Fitting of Changeable Coefficient Vector into partition-based PCA Coefficient Vector
The range of the changeable coefficient vector is then fitted into the range of the partition-based PCA coefficient vector. Using the ranges as determined in previous subsection, the changeable coefficient vector c is converted to be in the range of partition-based PCA coefficient vector pcluster k . Because there is no restrictions on selection of cluster partition k, k can be selected freely for each person, and the converted coefficient vector is represented by ycluster k . i yCluster
c −mini
i c = ( maxi i − min i ) × (maxCluster c c i + minCluster k .
k
k
− miniCluster k )
(9)
As a result, the synthesized face image xsyn can be reprojected from ycluster k using the corresponding basis of partition-based PCA and the basis of the subspace. xsyn = ΨΦcluster k ycluster k .
4
(10)
Experiment and Result
In our experiment, 492 frontal facial images from the AR database [8] are used. Images with occlusion and illumination changes are excluded, for a total of 6
Changeable Face Representations Suitable for Human Recognition
563
different images per subject. The dimension of PCA and ICA spaces is set to 100. For face space modeling, FCM is applied to the face coefficient vector space. In our experiments, two distinct clusters are formed for each coefficient vector space. When generating a changeable coefficient vector, different scrambling rules are applied to each person. Performance can be evaluated in two aspects. One is the EER using the generated changeable coefficient vector c, and the other is the EER using the range fitted coefficient vector ycluster k . The EER of the changeable coefficient vector c can be used to evaluate the performance of the changeable face recognition system, and the EER of the range fitted coefficient vector ycluster k can be used for quantitative evaluation of human inspectibility. As Table 1 shows, due to different scrambling of changeable coefficient vectors for each person, the EER of the changeable coefficient vector is very low. The EER of ycluster k fitted to ICA clusters is similar to the EER using ICA coefficient vectors, and the EER of ycluster k fitted to PCA clusters is similar to the EER using PCA coefficient vectors. This result is due to the range fitting process. During range fitting, whatever the value of the changeable coefficient vector c, the range fitted coefficient vector ycluster k is located within the local surface of the modeled subspace cluster. Hence, EER of the range fitted coefficient vector is similar to the EER of the corresponding subspace coefficient vector. This implies that inter-class discrimination and intra-class similarity of the original input faces is preserved in the synthesized changeable faces.
Table 1. Performance of changeable coefficient vector and partition-based PCA coefficient vectors of modeled clusters PCA coefficient vector
ICA coefficient vector
Changeable coefficient vector
15.03%
12.6%
0.02%
Range fitted coefficient vector - ICA cluster 13.83%
Range fitted coefficient vector - PCA cluster 14.88%
Fig. 3 shows the result of changeable face image synthesis. Because two clusters are formed for each subspace, two types of face images can be synthesized from each subspace. Fig. 3 (d) and (e) show synthesized face images from ICA clusters, and Fig. 3 (h) and (i) show synthesized images from PCA clusters. In both subspaces, using changeable coefficient vectors, synthesized images from different clusters show distinct characteristics, such as gender. Using the proposed face synthesis method, we can represent any person as a different person of either gender. From Fig. 3, we can discern the distinctiveness of the synthesized face image for each person. Hence, the above result shows inter-class discrimination of the proposed face synthesis method. However, synthesized faces from the same person’s images should show similarity between synthesized faces. This intra-class similarity can be shown in Fig. 4. Input faces of Fig. 4 (a) contain expressional variation. Fig. 4 (b) is a result of synthesized images from ICA cluster 1, and Fig 4 (c) is a result of synthesized images from ICA cluster 2. For
564
H. Lee et al.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
Fig. 3. Synthesized changeable faces (a):input image, (b):reconstructed image by ICA basis, (c):direct reprojection of changeable coefficient vector using ICA basis, (d):synthesized image from cluster 1 of ICA, (e):synthesized image from cluster 2 of ICA, (f):reconstructed image by PCA basis, (g): direct reprojection of changeable coefficient vector using PCA basis , (h):synthesized image from cluster 1 of PCA, (i):synthesized image from cluster 2 of PCA
Fig. 4 (b) and (c), because images of the same row are synthesized from the same person’s input images, images within a row are similar each other. However, images of different rows are synthesized from images of different person, and any two images between two different rows are not similar. From Fig. 3 and Fig. 4, it can be shown that synthesized face images can be used to substitute the original face image. In other words, with respect to inspectability, the proposed face synthesis method shows intra-class similarity while providing inter-class discrimination. The similarity between the EER using the original subspace coefficient vector and the EER using the changeable coefficient vector of Table 1 supports the above results.
(a) (b) (c) Fig. 4. Synthesized changeable faces of two ICA clusters (a): input faces, (b): synthesized faces from ICA cluster 1, (c): synthesized faces from ICA cluster 2
Changeable Face Representations Suitable for Human Recognition
5
565
Conclusion and Future Works
The proposed face synthesis method for changeable face biometrics can generate face images which allow human recognition. The quality of the synthesized face image depends on the accuracy of the subspace modeling method and the compactness of the subspace. From the point of view of human inspectability, face image synthesis using ICA cluster modeling is better than PCA cluster modeling. The quality of the synthesized face image also can be improved by using a more accurate subspace modeling method. In this work, face subspace modeling is accomplished by a simple clustering method (FCM) and partition based PCA. However, the face subspace can be modeled by more accurate clustering methods such as Gaussian Mixture Models (GMM) or Gustafson-Kessel clustering, etc. In future works, such subspace modeling methods may be tested for changeable face synthesis.
Acknowledgements This work was supported by the Korea Science and Engineering Foundation (KOSEF) through the Biometrics Engineering Research Center (BERC) at Yonsei University
References 1. Ratha, N.K., Connell, J.H., Bolle, R.M.: Enhancing security and privacy in biometrics-based authentication systems. IBM Systems Journal. 40, 614–634 (2001) 2. Boult, T.: Robust Distance Measures for Face-recognition supporting revocable biometrics token. In: 7th International Conference Automatic Face and Gesture Recognition, pp. 560–566 (2006) 3. Savvides, M., Vijaya Kumar, B.V.K., Khosla, P.K.: Cancelable Biometric Filters for Face Recognition. In: Proc. of the 17th International Conference on Pattern Recognition (ICPR 2004), vol. 3, pp. 922–925 (2004) 4. Teoh, A.B.J., Ngo, D.C.L., Goh, A.: BioHashing: two factor authentication featuring fingerprint data and tokenised random number. Pattern Recognition 37, 2245–2255 (2004) 5. Jeong, M.Y., Lee, C.H., Kim, J.S., Choi, J.Y., Toh, K.A., Kim, J.H.: Changeable biometrics for appearance based face recognition. In: The Biometric Consortium Conference, Baltimore convention center, Baltimore, MD, USA, September 19th21th, 2006 (2006) 6. Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York (1981) 7. Kambhatla, N., Leen, T.K.: Dimension Reduction by Local Principal Component Analysis. Neural Computation 9, 1493–1516 (1997) 8. Martinez, A.M., Benavente, R.: The AR Face Database. CVC Technical Report. #24 (1998)
“3D Face”: Biometric Template Protection for 3D Face Recognition E.J.C. Kelkboom , B. G¨ okberk, T.A.M. Kevenaar, A.H.M. Akkermans, and M. van der Veen Philips Research, High-Tech Campus 34, 5656AE, Eindhoven {emile.kelkboom,berk.gokberk,tom.kevenaar,ton.h.akkermans, michiel.van.der.veen}@philips.com
Abstract. In this paper we apply template protection to an authentication system based on 3D face data in order to protect the privacy of its users. We use the template protection system based on the helper data system (HDS). The experimental results performed on the FRGC v2.0 database demonstrate that the performance of the protected system is of the same order as the performance of the unprotected system. The protected system has a performance of a FAR ≈ 0.19% and a FRR ≈ 16% with a security level of 35 bits. Keywords: Template protection, privacy protection, helper data system (HDS), 3D face recognition.
1
Introduction
Biometrics is used to recognize people for identification or verification purposes. It is expected that in the near future, biometrics will play an increasing role in many security applications. Today the market is dominated by fingerprint recognition, but for the near future market studies predict that face recognition technologies will also play an important role. This is driven by initiatives like the ePassport for which the ICAO standardized the face as being one of the modalities to be used for verification purposes. Following these trends, recently the European Project “3D Face”[1] was initiated. The principal goals of this project are to (i) improve the performance of classical face recognition techniques by extending it to 3D, (ii) integrate privacy protection technology to safeguard the biometric information and (iii) deploy the secure face recognition system at several international airports for the purpose of employee access control. In this paper we concentrate on the privacy protection for 3D face recognition. In any biometric system, the storage of biometric information, also called biometric template, may be a privacy risk. To mitigate these risks, we see in recent literature different theoretical methods of privacy protection, e.g. fuzzy commitment [2], fuzzy vault [3], cancelable biometrics [4], fuzzy extractors [5], and the helper data system (HDS) [6,7]. The general goal of these systems is to (i) prevent identity theft, (ii) introduce versatility, and (iii) prevent cross matching. Also several attempts were made to integrate these techniques in practical systems for face [8] or fingerprint [9]. S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 566–573, 2007. c Springer-Verlag Berlin Heidelberg 2007
“3D Face”: Biometric Template Protection
567
In our work we make use of the HDS template protection approach in the verification setting. We use a 3D face feature extraction algorithm that is based on the maximum and minimum principal curvature directions. The aim is to have at least the same verification performance in the protected case as in the unprotected case. The remainder of the paper is organized as follows. In Section 2, we present a brief description of the feature extraction algorithm followed by the introduction of the HDS template protection system in Section 3. The results are given in Section 4 followed by the conclusions in Section 5.
2
3D Face Feature Extraction
In this work, we use a shape-based 3D face recognizer [10]. It has two main steps: 1) the alignment of faces, and 2) the extraction of surface features from 3D facial data. In the alignment step, each face is registered to a generic face model (GFM) and the central facial region is cropped. The GFM is computed by averaging correctly aligned images from a training set. After the alignment step, we can assume that all faces are transformed in such a way that they best fit the GFM, and have the same position in the common coordinate system. After alignment, the facial surface is divided into 174 local regions. For each region, the maximum and minimum principal curvature direction are computed. Each of the two directions is presented by the azimuthal and the polar angle in the spherical coordinate system. Combining all the regions leads to a feature vector with 174 × 2 × 2 = 696 entries. For matching two feature vectors, the distance is computed using the L1 or the L2 norm.
3
The Template Protection System: Helper Data System
The helper data system (HDS) is shown in Figure 1. It consists of the training, enrollment and verification stages. The inputs to all stages are real-valued feature vectors defined as, z ∈ k , x ∈ k and y ∈ k , respectively (k is the number of components of the feature vector). The feature vectors are derived from the 3D face image by the feature extraction algorithm described in Section 2. In each stage, users may have multiple images and therefore multiple feature vectors, which are defined as (z i,j )t , i = 1, . . . , NT ; j = 1, . . . , MTi ; t = 1, . . . , k, (xi,j )t , i = 1, . . . , N ; j = 1, . . . , MEi ; t = 1, . . . , k, (y i,j )t , i = 1, . . . , N ; j = 1, . . . , MVi ; t = 1, . . . , k,
(1)
where NT is the number of users in the training stage with user i having MTi images, and N is the number of users in the enrollment and verification stage with user i having MEi images in the enrollment and MVi images in the verification stage. The notation (xi,j )t indicates the t-th component of vector xi,j .
568
E.J.C. Kelkboom et al.
Training
Enrollment
Verification
z i,j
xi,j
y i,j
Quantization
Quantization
Statistics
µT
xBi , r i
y Bi
Reliable Selection
RNG
Enc (ECC)
ci
+
w 1i
w 1i
x Ri si
w 2i
Data Storage
h(si )
Hash
µT
Reliable Selection
y Ri w 2i
h(si )
+
Comparison
ci
Dec (ECC)
si
h(si ) Hash
Accept/Reject
Fig. 1. The HDS template protection system; the enrollment (left), and verification stage (right). This figure is adapted from [8,9].
3.1
Training Stage
For template protection, binary feature vectors (binary strings) must be derived from the real-valued feature vectors. This is done by quantizing the feature vector with respect to a single threshold vector. In the training stage, this quantization threshold vector is calculated from the feature vectors of the training population z i,j . As threshold vector, we use the mean of the feature vectors defined as Ti NT M
1
µT = NT
i=1
3.2
MTi
z i,j .
(2)
i=1 j=1
Enrollment Stage
In the enrollment stage, each user i has MEi feature vectors xi,j . In the Quantization block, the real-valued feature vectors are quantized into binary feature vectors xBi using the following equation (xBi )t =
ME 1 i 0, if (µi )t < (µT )t , with (µi )t = (xi,j )t , 1, if (µi )t ≥ (µT )t MEi j=1
(3)
such that (µi )t is the mean of component t of the feature vectors of user i. The reliability (r i )t of each component (xBi )t is calculated as the ratio MEi |(µT )t − (µi )t | 1 (r i )t = , with (σi )t = ((xi,j )t − (µi )t )2 (4) (σi )t MEi − 1 j=1
“3D Face”: Biometric Template Protection
569
such that (σi )t is the standard deviation of component t of the feature vectors of user i. In the single image enrollment scenario, MEi = 1, we define (σi )t = 1. Also, a secret si of LS bits is randomly generated by the Random Number Generator (RNG) block. The security level of the system is higher at larger secret lengths LS . A codeword ci of an error correcting code with LC bits is obtained by encoding si in the ENC block. In our case we use the “Bose, RayChaudhuri, Hocquenghem” (BCH) Error Correction Code (ECC) [11]. For the BCH code, the codeword length is equal to LC = 2n − 1, where n is a natural number. The most common codeword lengths for our application are 127, 255, and 511 bits and can be freely chosen as long as it is smaller than or equal to the feature vector length k. Examples of some BCH parameter combinations are given in Table 1. In the Reliable Component block, the reliable binary string xRi is created by cropping the binary feature vector xBi to the same length as the codeword by selecting the LC components having the largest reliability (r i )t . The indices of the LC most reliable components are collected in the public helper data w1i . Hereafter, the reliable binary feature vector xRi is bitwise XOR-ed with codeword ci . This XOR operation leads to the second helper data w2i . The third and last helper data is the hashed value of secret si , indicated as h(si ). The cryptographic hash function can be considered as a one-way function which makes it computationally hard to retrieve the secret si from its hashed value h(si ). The protected template corresponds to the three helper data denoted as: [X]i = {h(si ), w 1i , w2i }. The protected template can be considered as public and reveals only a minimum amount of information of si and xi,j . Therefore it can be easily stored on a less secure local data storage device or on a centralized database, depicted here as the Data Storage. Table 1. Some examples of BCH parameter combinations Codeword (LC ) Secret (LS ) Correctable bits (η) BER = η/LC 36 15 11.8% 127 64 10 7.9% 37 45 17.7% 255 63 30 11.8% 31 109 21.3% 511 67 87 17.0%
3.3
Verification Stage
In the verification stage, a single feature vector y i,j is used. As in classical biometric systems, this feature vector is compared to the reference data stored in the system. In the current setup, y i,j is compared to the protected template [X]i derived in the enrollment stage using a dedicated matching method as follows. In the Quantization block, the binary feature vector y Bi is obtained by quantizing y i,j using Eq. 3, where (µi )t is replaced by (y i,j )t . The same threshold µT is used as in the enrollment stage. In the Reliable Component block, the helper data
570
E.J.C. Kelkboom et al.
w1i is used to select components in y Bi to obtain y Ri . The recovered codeword ci is the output of the XOR operation between the helper data w2i and y Ri . Next, this codeword is decoded to recover the (candidate) secret si , which is hashed into h(si ). In the Comparison block, h(si ) is matched bitwise with h(si ) as obtained from the protected template [X]i . If the hashes are bitwise exact the user is accepted, otherwise rejected. We have a match only if the following is true h(si ) = h(si ) iff si = si iff ||ci ⊕ ci ||1 = ||(xRi ⊕ w2i ) ⊕ (w 2i ⊕ y Ri )||1 = ||xRi ⊕ y Ri ||1 ≤ η
(5)
where η is the number of bits the ECC can correct and ||xRi ⊕ y Ri ||1 is the hamming distance (HD) between xRi and y Ri . This means that the number of bit differences between xRi and y Ri should be equal or less than η for a match.
4
Verification Performance Results
To analyze the verification performance of the system, we use the FRGC v2.0 database [12], which has 465 subjects having between 1 or 22 3D face images with a total of 4007 images. The 1st version, FRGC v1.0, is used to derive the GFM in the feature extraction algorithm. When applying our feature extraction algorithm to the FRGC images, we obtain feature vectors with k = 696 (see Section 2). Our verification performance test is complex, because the template protection system uses multiple enrollment images. The test protocol we use is elaborated next followed by the verification performance results. 4.1
Test Protocol
We first divide the FRGC v2.0 database into a training and a test set. The training set is used to obtain the quantization threshold µT , while the test set is used to analyze the verification performance. The optimal number of enrollment images, Nenrol , is not known and has to be verified. Its range is set to [1, 10], and two images of each subject are used in the verification stage. In the FRGC v2.0 database, subjects have varying numbers of 3D face images. In order to have the same subjects in each test, only the subjects having at least 12 images are selected for the test set, while the rest is used as the training set. This results into a test set containing 145 subjects with a total of 2347 images. For each Nenrol and codeword length {127, 255, 511} case, 20 verification runs are performed. Each run consists of randomly selecting enrol +2(N enrol + 2) images of each subject and performing experiments for the NN possible combination of dividing the enrol selected images into Nenrol enrollment images and two verification images. The results are averaged over all combinations and runs. We evaluate the verification performance for both the protected and the unprotected case. For the template protection system, we evaluate its performance by studying the reliable binary feature vectors xRi and y Ri and assuming a hamming distance classifier given as HD = ||xRi ⊕ y Ri ||1 .
(6)
“3D Face”: Biometric Template Protection
571
The verification performance test is performed for feature lengths of 127, 255, and 511 bits, corresponding to possible codeword lengths of the ECC code. We also look at the verification performance of the full binary feature vectors xBi and y Bi , indicated as the “696” bits case. For the unprotected case, we use the real-valued feature vectors xi,j and y i,j and the L1 and the L2 norm as distance measure. 4.2
Performance Results: Protected and Unprotected Templates
Figure 2(a) shows the Equal Error Rate (EER) at different choices of Nenrol . It is clear that Nenrol has influence on the performance and we observe that at 7 or more images the performance stabilizes. For the real-valued (unprotected) case the EER is around 7.2%, while for binary (protected) case the EER is between 3-4%. This shows that in this case our binarization method itself leads to a significant performance improvement. We assume that the performance gain is achieved due to the filtering property of the binarization method on the realvalued feature vectors. The influence of Nenrol on the False Acceptance Rate 2
16
10 L1 L2 127 255 511 696
EER [%]
12 10
1 2 7
1
10
FAR, FRR [%]
14
8 6 4
0
10
−1
10
FAR
FRR
−2
10 2 0 1
−3
2
3
4
5
6
7
8
9
Number of enrollment images, Nenrol
10
10
0
0.2
0.4
(a)
0.8
1
(b) 2
10
3.5 3
Imposter
Genuine
1 2 7
127 255 511
1
10
FAR, FRR [%]
Relative Frequency [%]
0.6
Fractional Hamming Distance
2.5 2 1.5 1
0
10
−1
FAR
FRR
10
−2
10 0.5 0 0
−3
0.2
0.4
0.6
0.8
Fractional Hamming Distance
(c)
1
10
0
0.2
0.4
0.6
0.8
1
Fractional Hamming Distance
(d)
Fig. 2. At different Nenrol values (a) shows the EER for each case, (b) the FAR and FRR curves for the 255 bits case, and (c) gives the genuine and imposter distribution. For different codeword lengths, (d) gives the FAR and FRR curves.
572
E.J.C. Kelkboom et al.
Table 2. Verification performance for the protected (reliable binary feature vectors) and unprotected (real-valued and full binary feature vectors). Nenrol is set to 7.
case 127 255 511
Protected Templates Unprotected Templates FAR, FRR FAR, FRR FRR @ FAR @ EER case EER @ LS ≈ 65 @ LS ≈ 35 FAR ≈ 0.25% FRR ≈ 2.5% 4.1 0.023%, 30.0% 0.18%, 17.7% Binary “696” 3.3% 14.9% 4.7% 3.7 0.007%, 32.8% 0.19%, 15.6% Real, L1 7.2% 25.3% 25.2% 3.2 ≈ 0%, 58.5% ≈0%, 36.8% Real, L2 7.8% 28.3% 27.5%
(FAR) and False Rejection Rate (FRR) curves is shown in Figures 2(b) for the 255 bits case and is representative for the other cases. It can be seen that with a larger Nenrol , both the EER and the corresponding threshold value, given as the Fractional Hamming Distance (FHD), decreases. FHD is defined as the hamming distance divided by the feature vector length. The EER threshold value also stabilizes at a Nenrol larger than 7. The shift of the EER threshold can be explained with Figure 2(c). The genuine distribution shifts to a smaller FHD when Nenrol is increased. By increasing Nenrol , (σi )t and (µi )t can be better estimated and consequently the reliable components can be selected more accurately. A better selection of the most reliable components leads to a smaller FHD at genuine matches, as it is seen by the shift. On the other hand, when the most reliable components are selected, the imposter distribution curve also shifts to the left. However, the genuine distribution shift is greater than the imposter distribution, resulting in a EER at a smaller FHD. The performance results for each case are given in Table 2, where Nenrol is set to 7. For the protected case it shows the EER, the FRR and FAR at the error correction capability of the ECC when LS ≈ 65 bits and LS ≈ 35 bits. For the unprotected case the EER, FRR at a FAR ≈ 0.25%, and FAR at a FRR ≈ 2.5% are shown. It can be seen that the binarization improves the performance in terms of EER. At a secret length of around 65 bits, codeword lengths 127 and 255 have the best performance, but the FRR is still high (≈ 30%). At a smaller secret length of 35 bits, FRR decreases to ≈ 15% while maintaining a good FAR ≈ 0.20%. The smaller codewords have a better performance because the threshold corresponding to the EER point shifts to a smaller FHD (see Figure 2(d)). This decrease is larger than the decrease of the error correcting capabilities of the ECC due to smaller codeword lengths (see Table 1).
5
Conclusions
In this work, we successfully combined the HDS template protection system with a 3D face recognition system. The verification performance of the protected templates is of the same order as the performance of the unprotected, realvalued, templates. In order to achieve this improvement we proposed a special
“3D Face”: Biometric Template Protection
573
binarization method, which uses multiple enrollment images. In a HDS template protection system, the choice of the operating point is limited by the number of bits the ECC can correct. Using multiple images and varying the number of binary features (corresponding to the codeword length of the ECC) the operating point can be brought closer to the EER point. We obtained the best verification performances at a codeword length of 255 bits with a FAR ≈ 0.19% and a FRR ≈ 16% at 35 bits of security. This is better than the FAR = 0.25% and FRR ≈ 26% performance of the real-valued case. It is expected that if the performance of the real-valued feature vectors is improved, it will further improve the performance of the protected templates. Furthermore, the verification performance of the protected templates can be enhanced with a more robust binarization algorithm. If the resulting binary templates are more robust, the EER will be achieved at a lower fractional Hamming distance. This will give the template protection system the flexibility to choose different operating points, leading to a more secure or a more convenient system.
References 1. 3DFace: (http://www.3dface.org/home/welcome) 2. Juels, A., Wattenberg, M.: A fuzzy commitment scheme. In: 6th ACM Conference on Computer and Communications Security, pp. 28–36. ACM Press, New York (1999) 3. Juels, A., Sudan, M.: A fuzzy vault scheme. In: Proc. of the 2002 International Symposium on Information Theory (ISIT 2002), Lausanne (2002) 4. Ratha, N.K., Connell, J.H., Bolle, R.M.: Enhancing security and privacy in biometrics-based authentication systems. IBM Systems Journal 40, 614–634 (2001) 5. Dodis, Y., Reyzin, L., Smith, A.: Fuzzy extractors: How to generate strong secret keys from biometrics and other noisy data. In: Cachin, C., Camenisch, J.L. (eds.) EUROCRYPT 2004. LNCS, vol. 3027, pp. 532–540. Springer, Heidelberg (2004) 6. Verbitskiy, E., Tuyls, P., Denteneer, D., Linnartz, J.P.: Reliable biometric authentication with privacy protection. In: Proc. of the 24th Symp. on Inf. Theory in the Benelux, Veldhoven, The Netherlands, pp. 125–132 (2003) 7. Linnartz, J.-P., Tuyls, P.: New shielding functions to enhance privacy and prevent misuse of biometric templates. In: 4th Int. Conf. on AVBPA (2003) 8. Kevenaar, T.A.M., Schrijen, G.-J., Akkermans, A.H.M., van der Veen, M., Zou, F.: Face recognition with renewable and privacy preserving binary templates. In: 4th IEEE workshop on AutoID, Buffalo, New York, USA, pp. 21–26. IEEE Computer Society Press, Los Alamitos (2005) 9. Tuyls, P., Akkermans, A.H.M., Kevenaar, T.A.M., Schrijnen, G.J., Bazen, A.M., Veldhuis, R.N.J.: Pratical biometric authentication with template protection. In: 5th International Conference, AVBPA, Rye Brook, New York (2005) 10. G¨ okberk, B., Irfanoglu, M.O., Akarun, L.: 3D shape-based face representation and feature extraction for face recognition. Image and Vision Computing 24, 857–869 (2006) 11. Purser, M.: Introduction to Error-Correcting Codes. Artech House, Boston (1995) 12. Phillips, P.J., Flynn, P.J., Scruggs, T., Bowyer, K.W., Chang, J., Hoffman, K., Marques, J., Min, J., Worek, W.: Overview of the face recognition grand challenge. In: IEEE CVPR, vol. 2, pp. 454–461. IEEE Computer Society Press, Los Alamitos (2005)
Quantitative Evaluation of Normalization Techniques of Matching Scores in Multimodal Biometric Systems Y.N. Singh and P. Gupta Department of Computer Science and Engineering, Indian Institute of Technology Kanpur, Kanpur-208016, India
[email protected] [email protected] Abstract. This paper attempts to make an quantitative evaluation of available normalization techniques of matching scores in multimodal biometric systems. Two new normalization techniques Four Segments Piecewise Linear (FSPL) and Linear Tanh Linear (LTL) have been proposed in this paper. FSPL normalization techniques divides the region of genuine and impostor scores into four segments and maps each segment using piecewise linear function while LTL normalization techniques maps the non-overlap region of genuine and impostor score distributions to a constant function and overlap region using tanh estimator. The effectiveness of each technique is shown using EER and ROC curves on IITK database of having more than 600 people on following characteristics: face, fingerprint, and offline-signature. The proposed normalization techniques perform better and particularly, LTL normalization is efficient and robust.
1
Introduction
In the recent years biometric becomes popular due to automated identification of people based on their distinct physiological and/or behavioral characteristics [1]. Most of the practical biometric systems are unimodal (e.g., rely on the evidence of any single biometric information). Unimodal systems are usually, costefficient but may not achieve the desired performance because of, noisy data, non-universality, lack of uniqueness of the biometric trait, and spoofing attacks [2]. The performance of the biometric system can be improved by combining of multiple biometric characteristics. These systems are referred as multimodal biometric systems [3]. In multimodal biometric systems fusion at matching score level is commonly preferred because matching scores are easily available and contains a sufficient information to make decision about legitimate user and impostor. Assume that OkG = rkG1 , rkG2 , . . . , rkGN is the set of genuine scores of N in dividuals and OkI = rkI 1 , rkI 2 , . . . , rkI n is the set of impostor scores of those individuals where, n = N X(N − 1) for characteristic k. The complete set of matching scores is denoted as Ok where, Ok = OkG ∪ OkI and |OkG ∪ OkI | = N + n. S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 574–583, 2007. c Springer-Verlag Berlin Heidelberg 2007
Quantitative Evaluation of Normalization Techniques of Matching Scores
575
Prior to combine the matching scores of different characteristics, scores are preprocessed and to make them homogeneous. The dissimilarity score (rki ) of user i for characteristic k can be converted into similarity score in the common max(OG ,OI )−r
k ki numerical range, let it be [0, 1] using the formula, rki = max(OG ,OIk)−min(O G I . k k k ,Ok ) Alternatively, if the raw scores are found in the range [min(Ok ), max(Ok )], then they are converted to similarity scores by simply subtracting them from max(Ok ) (e.g., max(Ok )−rki ). In the rest of the paper the symbol rki is used for similarity score of user i for characteristic k. Further, matching scores of different characteristics need not to be on same numerical scale. Using normalization technique scores of different characteristics are transformed to a common numerical scale. In this paper matching scores of face and fingerprint characteristics are obtained using Haar wavelet [4] and minutiae based technique [5], respecively while global and local features are used to compute the matching scores for offline-signature [6]. The rest of the paper is organized as follows: Section 2 presents the related work in the area of normalization techniques of matching score in multimodal biometric systems. Section 3 proposes two new normalization techniques of matching scores that improve the system performance. The performance of normalization techniques is evaluated using different fusion strategies. Normalization and fusion at matching score level are discussed in Section 4. Experimental results are given in Section 5. Finally, conclusions are presented in the last Section.
2
Related Work
Normalization of matching scores in the multimodal biometric systems is an important issue that leads to system performance. In [7] experiments on a database of 100 users for face, fingerprint and hand-geometry characteristics indicate that the performances of min-max, z-score, and tanh normalizations are found to be better than others. Also, min-max and z-score normalization techniques are sensitive to outliers. Hence, there is a need for a robust and efficient normalization procedure like the tanh normalization. A comprehensive study on normalizationfusion, permutations has been done in [8] where Snelick et al., have proposed an adaptive normalization technique of matching scores. This technique is computationally intensive and suffered with parameters overhead. Score Normalization Score normalization refers to transformation of scores obtain from different matchers into a common numerical range. A number of normalization techniques such as min-max, z-score, double sigmoid, tanh, piecewise linear, adaptive normalization along with their evaluation are well studied in [7] and [8]. Assume nki be the normalized score corresponding to the similarity score rki .
576
Y.N. Singh and P. Gupta
Min-Max (MM) - MM normalization transforms the raw scores of Ok in the range of [0, 1] using, nki =
rki − min(OkG , OkI ) max(OkG , OkI ) − min(OkG , OkI )
Z-Score (ZS) - ZS normalization transforms the scores to a distribution with mean 0 and standard deviation 1. Let μOk , δOk be the mean and standard deviation of the set Ok then ZS represents the distance between raw score rki and μOk in units of δOk as, nki =
rki − μOk δOk
Since μOk , δOk are sensitive to outliers, therefore z-score is not robust. Statistically using Grubbs’ test [9] one can identify outliers and evaluate the performance of ZS. Double-Sigmoid (DS) - DS normalization transforms the scores into the range of [0, 1] using, ⎧ 1 ⎪ if rki < tk , ⎪ ⎨ 1+exp −2 rkti −tk kL nki = 1 otherwise. ⎪ ⎪ ⎩ 1+exp −2 rkti −tk kR
where tk is the reference point chosen some value falling in the region of genuine and impostor scores and the parameters tkL and tkR are chosen as, tkL = tk − min(OkG ) and tkR = max(OkI ) − tk . DS exhibits a linear characteristic of scores in the overlap region of interval [tk − tkL , tkR − tk ] and nonlinear characteristic beyond to that. Tanh - Tanh normalization is based on tanh estimator [10]. It maps the raw scores of Ok in the range of [0, 1] as,
rki − μOkG nki = 0.5 ∗ tanh 0.01 ∗ +1 σOkG where, μOkG and σOkG are the mean and standard deviation of the genuine matching scores of characteristic k, respectively. Piecewise-Linear (PL) - Piecewise linear (PL) normalization technique transforms the scores of Ok in the range of [0, 1]. The normalization function of PL maps the raw scores using piecewise linear function as, ⎧ 0 if rki ≤ min(OkG ), ⎪ ⎨ 1 if rki ≥ max(OkI ), nki = G ⎪ k ) ⎩ rki −min(O otherwise. max(OI )−min(OG ) k
k
Quantitative Evaluation of Normalization Techniques of Matching Scores
577
2
Tanh estimator Mapping Function
1
Distribution
Genuine Impostor
Genuine Impostor
Distribution
1
0
0 0.5
0.6
0.7
0.8
0.5
0.6
0.7
0.8
Scores
Scores
(a)
(b)
Fig. 1. Proposed Score Normalization Techniques (a) Four Segments Piecewise Linear (FSPL) (b) Linear Tanh Linear (LTL)
3
Proposed Score Normalization Techniques
This section proposes two new matching score normalization techniques: FourSegments-Piecewise-Linear (FSPL) and Linear-Tanh-Linear (LTL) normalization, using quantitative combination of multiple normalization techniques. FSPL and LTL techniques take advantages of the characteristics resulted from the piecewise linear function and the tanh estimator for the separability of genuine and impostor scores distributions and robustness, respectively. 3.1
Four-Segments-Piecewise-Linear (FSPL)
FSPL normalization technique divides the regions of impostor and genuine scores into four segments and map each segment using piecewise linear functions (Fig. 1(a)). A reference point tk is chosen in between the overlapping region of OkG and OkI . The scores between two extremities of the overlap region are mapped using two linear functions separately in range of [0, 1] and of [1, 2] towards left and right of tk , respectively as, ⎧ 0 if rki ≤ min(OkG ), ⎪ ⎪ ⎪ G ⎪ r −min(O ) k ⎨ ki if min(OkG ) < rki ≤ tk , G) tk −min(Ok nki = rki −tk ⎪ 1 + max(OI )−t if tk < rki ≤ max(OkI ), ⎪ ⎪ k k ⎪ ⎩ 2 if r > max(OI ). ki
3.2
k
Linear-Tanh-Linear (LTL)
LTL normalization technique takes the advantage of the characteristic resulted from tanh estimator. Normalization function of LTL maps the non overlap region of impostor scores to a constant value 0 and non overlap region of genuine scores
578
Y.N. Singh and P. Gupta
to a constant value 1 (Fig. 1(b)). The overlapped region between OkI and OkG is mapped to a nonlinear function using tanh estimator as, ⎧ 0 if rki ≤ min(OkG ), ⎪ ⎪ ⎨ 1 if rki ≥ max(OkI ), nki = rki −μOG ⎪ k ⎪ + 1.5 otherwise. ⎩ 0.5 ∗ tanh 0.01 ∗ δ G O k
The effect of normalization techniques both discussed in the previous section and the proposed ones, are examined on system performance using the following fusion strategies. These fusion strategies take into account the performance of the individual characteristic in weighting their contributions [8]. I. Fusion Strategy A. (Assignment of Weights based on EER) This fusion strategy assigns the weight to each characteristic based on their equal error rate (EER). Weights for more accurate characteristics are higher than those of less accurate characteristic. Thus, the weights are inversely proportional to the corresponding errors. Let ek be the EER to characteristic k, then weight wk associated to characteristic k can be computed by, t
−1 1 1 wk = ∗ (1) ek ek k=1
II. Fusion Strategy B. (Assignment of Weights based on Score Distributions) Here weights are assigned to individual characteristic based on their impostor and genuine scores distributions. The means of these distribution are defined by μOkI and μOkG respectively, and standard deviations by σOkI and σOkI respectively. A parameter dk [11] is used as a measure of the separation of these two distributions for characteristic k as, dk =
μOkG − μOkI 2 2 σOkG + σOkI
If dk is small, overlap region of two distributions is more, and if dk is large, overlap region of two distributions is less. Therefore, weights are assigned to each characteristic proportional to this parameter as, t
−1 wk = dk ∗ dk (2) k=1
For both fusion strategies, 0 ≤ wk ≤ 1, (∀k); fi for user i is computed as, fi =
t k=1
t
k=1
wk ∗ nki ; (∀i)
wk = 1 and the fused score
Quantitative Evaluation of Normalization Techniques of Matching Scores
579
Fig. 2. Block Diagram of Multimodal System
4
System Description
Block diagram of the multimodal biometric verification system based on the fusion of face, fingerprint and signature information at matching score level is shows in Fig. 2. For each characteristic k, first N matchers generate genuine scores rkG1 , rkG2 , . . . , rkGN using the matching of live-template to the template of the same individual stored in the database. Next n matchers generate im postor scores rkI 1 , rkI 2 , . . . , rkI n using matching of live-template to the template of other individual stored in the database. Prior to transformation of scores to a common numerical range matching scores of different characteristics must be homogeneous. In the normalization phase scores obtained from different matchers (genuine and impostor) are scaled to a common numerical range. Finally to obtain the fused scores, genuine and impostor scores of each characteristic are combined separately using the weighted fusion strategies as follows, (n1 , n2 , . . . , nN )G =
t
wk ∗ (nk1 , nk2 , . . . , nkN )G
k=1
and I
(n1 , n2 , . . . , nn ) =
t
I
wk ∗ (nk1 , nk2 , . . . , nkn ) ;
k=1 G
I
The fused matching scores (n1 , n2 , . . . , nN ) ∪ (n1 , n2 , . . . , nn ) are commonly referred as total similarity measures (TSM) of the biometric system. The performance of different normalization techniques for each fusion method is studied against EER values, number of false rejections for subjects and Receiver Operating Characteristics (ROC) curves.
580
5
Y.N. Singh and P. Gupta
Experimental Results
In this section the effect of different normalization techniques on system performance for a multimodal verification system based on face, fingerprint and offline-signature has been discussed using IITK database. For each of these characteristics of total 609 users, live-template is matched against database template, yielding 609 genuine scores and 609 (609x1) impostor scores. The EER values for raw scores for each characteristics are found to be 2.03%, 9.86%, 6.25% for face, fingerprint and signature respectively. The weights for different characteristics for both fusion strategies are calculated according to (1) and (2) which are found as (0.684, 0.123, 0.193) and (0.530, 0.297, 0.177) for face, fingerprint and signature respectively. Table 1. EER Values for (Normalization, Fusion) Combinations (%) Normalizations Fusion Strategy A Fusion Strategy B MM 1.07 0.75 ZS 0.74 0.58 DS 0.77 1.08 Tanh 0.91 0.48 PL 1.08 0.91 FSPL 0.71 0.45 LTL 0.42 0.38
100 90 80 70 60 50 40 30 20 10 0
GAR (%)
GAR (%)
Table 1 shows the EER values against different normalization techniques under two fusion strategies. The best one is the lowest EER value in the individual column. As seen in Table 1, the proposed new normalization technique LTL leads to better performance of EER values 0.42% and 0.38% than any other normalization techniques under fusion strategy A and B, respectively. These two near EER values also lead to conclude that the performance of LTL normalization is least dependent upon the distribution of matching scores. The effect of
0
0.2
0.4
0.6
0.8
1
100 90 80 70 60 50 40 30 20 10 0 0
FAR (%) Face Fingerprint Signature MM
ZS DS Tanh PL
(a)
0.2
0.4
0.6
0.8
1
FAR (%) FSPL LTL
Face Fingerprint Signature MM
ZS DS Tanh PL
FSPL LTL
(b)
Fig. 3. Effect of Different Normalization Techniques on System Performance (a) Fusion Strategy A and (b) Fusion Strategy B
Quantitative Evaluation of Normalization Techniques of Matching Scores 99.6
99.6
99.4
99.4
99.2
99.2 99
GAR (%)
GAR (%)
99
581
98.8 98.6 98.4
98.8 98.6 98.4
98.2
98.2
98
98
Fusion (A) Fusion (B)
97.8 97.6 0
0.2
0.4
0.6
Fusion (A) Fusion (B)
97.8
0.8
1
0
0.2
0.4
FAR (%)
0.6
0.8
1
FAR (%)
(a) MM
(b) ZS
99.4
99.8
99.2
99.6
98.8
99.4
98.6
99.2
GAR (%)
GAR (%)
99
98.4 98.2 98 97.8
99 98.8 98.6
97.6 98.4
Fusion (A) Fusion (B)
97.4 97.2 0
0.2
0.4
0.6
Fusion (A) Fusion (B)
98.2
0.8
1
0
0.2
0.4
FAR (%)
(c) DS
0.8
1
(d) Tanh
99.5
100
99
99.5
98.5
99
GAR (%)
GAR (%)
0.6
FAR (%)
98 97.5 97
98.5 98 97.5
96.5
97
Fusion (A) Fusion (B)
96 0
0.2
0.4
0.6
Fusion (A) Fusion (B)
96.5
0.8
1
0
0.2
0.4
FAR (%)
0.6
0.8
1
FAR (%)
(e) PL
(f) FSPL 99.85 99.8 99.75
GAR (%)
99.7 99.65 99.6 99.55 99.5 99.45 99.4 Fusion (A) Fusion (B)
99.35 99.3 0
0.2
0.4
0.6
0.8
1
FAR (%)
(g) LTL Fig. 4. Performance of Different Normalization Techniques under Fusion Strategy A and B
582
Y.N. Singh and P. Gupta
different normalization techniques on system performance under fusion strategy A and B are shown in Fig. 3(a) and (b). ROC curves for the face, fingerprint and signature characteristics are also shown on these figures for comparison which shows the improvement in performance after fusion for all normalization techniques. The performance of different normalization techniques: MM, ZS, DS, Tanh, PL, FSPL, and LTL under both fusion starategies are shown in Fig. 4 (a) through (g). 100
100
99.5
99.5 99
GAR (%)
GAR (%)
99 98.5 98
MM ZS DS Tanh PL FSPL LTL
97.5 97 96.5 0
0.2
0.4
0.6
0.8
98.5 98 MM ZS DS Tanh PL FSPL LTL
97.5 97 96.5 96
1
0
0.2
FAR (%)
0.4
0.6
0.8
1
FAR (%)
(a)
(b)
99.85
99.85
99.8
99.8
99.75 99.75
99.65
GAR (%)
GAR (%)
99.7
99.6 99.55 99.5 0.5xSD 0.75xSD TRUE SD 1.5xSD 2xSD
99.45 99.4 99.35 99.3 0
0.2
0.4
0.6
FAR (%)
(c)
0.8
99.7 99.65 99.6 0.5xSD 0.75xSD TRUE SD 1.5xSD 2xSD
99.55 99.5 99.45 1
0
0.2
0.4
0.6
0.8
1
FAR (%)
(d)
Fig. 5. (a), (b) shows Outperformance of LTL under Fusion Strategy A and Fusion Strategy B. (c), (d) shows Robustness Analysis of LTL normalization under Fusion Strategy A and Fusion Strategy B.
It has been observed from Fig. 4(a) that the performance of MM normalization technique under the fusion strategy B is better than that of fusion strategy A. Similarly it is the case with Tanh, FSPL and LTL normalization as shown in Fig. 4(d), Fig. 4(f) and Fig. 4(g), respectively. The performance of ZS normalization is better at lower FAR under fusion strategy A while at higher FAR, fusion strategy B gives better performance as shown in Fig. 4(b). From Fig. 4(c) the performance of DS normalization is better at higher FAR under fusion strategy A. PL normalization achieved better performance at higher FAR under fusion strategy B while at lower FAR, fusion strategy A gives better performance as shown in Fig. 4(e). wPerformance of different normalization techniques under both fusion strategies are compared in Fig. 5(a) and (b). From these figures it is observed that the proposed normalization technique LTL outperform the other normalization techniques under both fusion strategies at lower FAR as well as at higher FAR.
Quantitative Evaluation of Normalization Techniques of Matching Scores
583
The performance of another proposed normalization technique FSPL is better at higher FAR under both fusion strategies. Robustness behavior of the proposed normalization technique LTL is analyzed under fusion strategy A and fusion strategy B that are respectively shown in Fig. 5(c) and (d). From these figures it is found that the performance of the biometric system is completely invariant to the change in standard deviation (SD) of matching scores. In other words, LTL normalization is insensitive towards outliers and hence, it is a robust matching scores normalization technique.
6
Conclusion
This paper deals with the effect of normalization techniques of matching scores on the performance of multimodal biometric systems using face, fingerprint and offline-signature. The experimental results, obtained on a biometric database of IITK of more than 600 individuals, show that the proposed normalization technique Linear-Tanh-Linear (LTL) is efficient and robust. The performance of Four-Segments-Piecewise-Linear (FSPL) normalization technique is better at low FAR. This analysis of normalization techniques of matching scores suggests that an exhaustive testing of score normalization is needed to evaluate the performance of any multimodal biometric system.
References 1. Jain, A.K., Ross, A.: Information Fusion in Biometrics. Pattern Recognition Letters, Special Issues on Multimodal Biometrics, 2115–2125 (2003) 2. Snelick, R., Indovina, M., Yen, J., Mink, A.: Multimodal Biometrics: Issues in Design and Testing. In: ICMI’03, Canada, pp. 68–72 (2003) 3. Jain, A.K., Ross, A.: Learning User Specific Parameters in a Multimodal Biometric Systems. In: Proc. Inter. Conf. on Image Processing (ICIP), pp. 57–60 (2002) 4. Hassanien, A.E., Ali, J.M.: An Iris Recognition System to Enhance E-security Environment Based on Wavelet Theory. AMO 5(2), 93–104 (2003) 5. Raymond Thai: Fingerprint Image Enhancement and Minutiae Extraction. Technical Report, The University of Western Australia (2003) 6. Ismail, M.A., Gad, S.: Off-line Arabic Signature Recognition and Verification. Pattern Recognition 33, 1727–1740 (2000) 7. Jain, A.K., Nandakumara, K., Ross, A.: Score Normalization in Multimodal Biometric Systems. Pattern Recognition 38, 2270–2285 (2005) 8. Snelick, R., Uludag, U., Mink, A., Indovina, M., Jain, A.K.: Large-Scale Evaluation of Multimodal Biometric Authentication Using State-of-the-Art Systems. IEEE Trans. on PAMI 27(3), 450–455 (2005) 9. Grubbs, F.E.: Procedures for Detecting Outlying Observations in Samples. Technometrics 11(1), 121 (1969) 10. Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., Stahel, W.A.: Robust Statistics: The Approach Based on Influence Functions. Wiley, New York (2005) 11. Bolle, R.M., Pankanti, S., Ratha, N.K.: Evaluation Techniques for BiometricsBased Authentication Systems (FRR). In: Proc. of 15th International Conference Pattern Recognition, vol. 2, pp. 831–837 (2000)
Keystroke Dynamics in a General Setting Rajkumar Janakiraman and Terence Sim School of Computing, National University of Singapore, Singapore 117543 {rajkumar,tsim}@comp.nus.edu.sg
Abstract. It is well known that Keystroke Dynamics can be used as a biometric to authenticate users. But most work to date use fixed strings, such as userid or password. In this paper, we study the feasibility of using Keystroke Dynamics as a biometric in a more general setting, where users go about their normal daily activities of emailing, web surfing, and so on. We design two classifiers that appropriate for one-time and continuous authentication. We also propose a new Goodness Measure to compute the quality of a word used for Keystroke Dynamics. From our experiments we find that, surprisingly, non-English words are better suited for identification than English words. Keywords: Keystroke dynamics, biometrics.
1
Introduction
Keystroke Dynamics is increasingly being used as a biometric for user authentication, no doubt because keyboards are common input devices, being readily found on computers, telephones, ATM machines, etc. By Keystroke Dynamics we mean the temporal typing pattern (the way you type), rather than the content typed (what you type). Most of the research into Keystroke Dynamics, however, is done on fixed-text input, otherwise called password hardening [3,4,6,10], rather than on free text. Typically, keystroke authentication is performed during userlogin on a pre-determined string, such as the userid or password. This seems to us to be somewhat limiting, considering that most people continue to use the keyboard well beyond user-login. It would certainly be more useful if Keystroke Dynamics can handle free text as well as fixed text. In our literature search, we note that S.J. Shepherd [1] was perhaps the first to explore using Keystroke Dynamics for continuous authentication, using the rate of typing. The system authenticated the user based only on the mean and standard deviation of the Held Times and the Interkey Times, irrespective of the key being pressed. Although it worked for a user population of four, the accuracy of the system is likely decrease as the number of users increase. There is no guarantee that these features are sufficiently discriminative. Indeed, our experiments conducted with a larger pool of 22 users confirm this. Recent works of Villani et al., Rao et al., and Leggett et al. [7,8,9], conducted studies on keystroke verification on fixed text as well as free text. The users were asked to type a pre-determined text of a few hundered keystrokes (much longer S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 584–593, 2007. c Springer-Verlag Berlin Heidelberg 2007
Keystroke Dynamics in a General Setting
585
than the usual userid and password), and a text of a few hundred keystrokes of their own choice in the keystroke capture application. This data is then used for training and testing of their verification systems. The general conclusion from their studies is that Keystroke Dynamics works better on fixed text than on free text. We remark that these researchers all used Held Times and Interkey Times (of up to three consecutive keystrokes) as features, and did not consider the actual words being typed. We believe this is the cause of their poor performance. Our work in this paper suggests that Held Times and Interkey Times do indeed depend on the words typed. That is, the timings for ‘THE’ is different for ‘FOR’. By using word-specific Held and Interkey Times, we are able to achieve greater accuracy. In other words, we are using fixed strings within free text for the purpose of discrimination. We show that many fixed strings qualify as good candidates, and this allows us to verify the user as soon as any of these strings are typed. Can a sample of keystroke data identify a user without any constraints on language or application? In other words, we wish to identify a person without any constraint on what he or she types. The person is not required to input a pre-determined text. Can Keystroke Dynamics still be used in such a general setting? In this paper, we attempt to answer this question. The answer will help in the design of continuous authentication systems [5], in which the system continuously checks for the presence of the authorized user after initial login. In such a scenario, it is impractical to demand the user to repeatedly type her userid or any other pre-determined text. Instead, the system has to utilize the typing patterns present in free text for authentication. Perhaps the work closest to ours is that of Gunetti and Picardi [12], in which clever features were devised (along with suitable distance metrics) for free-text authentication. More precisely, Gunetti and Picardi avoided using the usual digraph and trigraph latencies directly as features. Instead, they used the latencies only to determine the relative ordering of different digraphs, and devised a distance metric to measure the difference between two orderings of digraphs, without regard to the absolute timings. The authors reported a False Accept Rate (FAR) of 0.0456% at a False Reject Rate (FRR)1 of 4.0%, which, although worse than their fixed-text system, is state of the art for free-text systems. We begin by analyzing the keystrokes of users as they go about their normal daily activities of emailing, web surfing, etc. We then look for patterns that can be used as a biometric. Such a pattern has to be discriminative, and at the same time common (universal) across all users (because the pattern cannot be used on people who do not type it). Also, for practical purposes, we should not have to wait too long for such a pattern to appear. The pattern should be readily available. We discover that, indeed, discriminative, universal and available patterns do exist even when the typing is unconstrained. Moreover, non-English words are better suited for this task. As far as we can tell, we are the first to investigate the problem of Keystroke Dynamics in a general setting. Our paper makes the following contributions: 1
In the keystroke dynamics literature, FRR and FAR are also known as the False Alarm Rate and the Imposter Pass Rate, respectively.
586
R. Janakiraman and T. Sim
1. We propose a new Goodness Measure to assess a keystroke pattern based on its discriminability, universality, and availability. 2. We show that Keystroke Dynamics can be used as a biometric even in a general setting. 3. We show that, surprisingly, some non-English words have a higher Goodness Measure than English words. 4. We propose two classifiers that are suitable for one-time and continuous keystroke authentication.
2
Basic Concepts
In this section we explain the basic terminology used in this paper. Fig. 1 shows a typical keystroke stream, collected from a user. Each arrow indicates a keyevent - the down facing arrow indicates a key being depressed, the upward facing arrow indicates the key being released. A pair of keyevents, a press and a release of the same key, form a keystroke. Inter key time
I t ( H, E )
H
H H t H
E
E
L
L
L
L
O
O Time
Held time
Fig. 1. Typical keystroke data. Each upward and downward pointing arrow indicates a keyevent.
2.1
Definitions
Held Time (Ht ).We define Held Time as the time (in milliseconds) between a key press and a key release of the same key. Fig. 1 shows how the Held Time of the key ‘H’ is determined. Note that Held Time is strictly greater than zero. Interkey Time (It ). This is defined as the time in milliseconds between two consecutive keystrokes. In Fig. 1, It (H,E) is the time between the key release of ‘H’ and key press of ‘E’. Interkey Times can be negative, i.e. the second key is depressed before the first key is released. Sequence. We define Sequence as a list of consecutive keystrokes. For example ‘HELLO’ in the Fig. 1 is a Sequence. A Sequence can be of any length, the minimum being two. In this example, the Sequence is a valid English word, but this need not be the case. Thus, ‘HEL’, ’LLO’ are also valid Sequences from the same keystroke stream in Fig. 1.
Keystroke Dynamics in a General Setting
587
Feature Vector (Ft ). This is a vector of the Held Times followed by the Interkey Times of a Sequence. For the Sequence ‘THE’, its feature vector is: Ft (THE) = Ht (T) Ht (H) Ht (E) It (T,H) It (H,E)
(1)
For a Sequence of length n, the length of the feature vector will be 2n − 1. 2.2
Histogram (Histseq )
From the samples of the Sequence appearing in the keystroke data, we estimate the probability density function (pdf) for each element in the feature vector Ft . This information is stored as a normalized histogram for the Sequence, which in turn will be used for classification. We choose to represent the pdf as a histogram rather than as the parameters of a multidimensional Gaussian pdf because we observe that the data are rarely normally distributed. It is well known that a histogram with a fixed bin size is able to represent any pdf more accurately than a single Gaussian distribution. Given two pdfs hi , hj , how similar are they? This may be measured using the Bhattacharyya distance or the Kullback-Leibler divergence [11]. We prefer the Bhattacharyya distance [11] because of its symmetry: DistB (hi , hj ) = hi (x)hj (x)dx (2) This distance is always between 0 and 1, with 1 meaning that the two pdfs overlap perfectly, and 0 meaning that the pdfs do not overlap at all. Since our pdfs are discretized as histograms, the actual computation of Equation (2) is performed by first multiplying the corresponding bins of the two histograms, taking the positive square root of the products, and then summing over all the bins. We will use the Bhattacharyya distance in Classifier B (see Section 2.3). 2.3
Classifier
Classifier A is designed to identify a person from a single instance of a Sequence appearing in the keystroke data. The identity of the person is given by arg max P (person | seq) person
where, P (person | seq) =
P (Histperson | f) seq
(3)
(4)
f ∈Ft
Here we are making the Na¨ıve Bayes assumption, i.e. the elements of the Feature Vector Ft are statistically independent. Classifier A is useful for applications where authentication is required immediately without any delay; for example, in a login module that prompts the user to type a system-generated string (such a string is different each time, to guard against replay attacks).
588
R. Janakiraman and T. Sim
Classifier B is designed to identify a person from multiple instances of the same Sequence appearing in the keystroke data. Here we first build a histogram (Histin ) from the input keystroke stream and then compare it with the learned histogram (Histseq ) using the Bhattacharyya distance. The identity of the person is given by arg max DistB (Histseq , Histin ) (5) person
Again, we make the Na¨ıve Bayes assumption. Classifier B is useful for applications which can afford to wait and collect enough keystrokes (thereby accumulating more evidence) before authentication.
3
Experiments
All our experiments are based on keystroke data collected from 22 users over a period of two weeks. The users are staff or students from our department, with different typing abilities. Some are trained typists, in that they had undergone typing classes and could type without looking at the keyboard. Others are untrained typists, but are still familiar with the keyboard as they have used it for many years. The users are of Chinese, Indian or European origin, and all are fluent in English. Unlike most other studies, the data we collected were not controlled by any means. Keystrokes were logged as users went about their daily work of using email, surfing the web, creating documents, and so on. The collected data from individual users ranged from 30,000 keyevents to 2 million keyevents. In total 9.5 million keyevents were recorded. The PCs used belonged to each individual user, that is, they were not shared machines nor public access computers. Most PCs had keyboards with the DellTM US layout, although a few users used their own non-Dell laptops. Each user used the same keyboard throughout the data collection, and thus the problem of different keyboards affecting their typing speed did not arise. To collect keystrokes, we wrote a data collector program in Visual C. This was basically a System Wide Keyboard and Mouse hook, which collects keyboard and mouse events regardless of the application. The data collector was installed in the user’s machine running Microsoft Windows XPTM . When activated the program collects all the keyevents and the mouseevents along with the timestamp and name of the application receiving the event. To protect the privacy of the users, each userid and machine id were hashed to a unique number. Also, a shortcut key was provided so that the user can switch it off while typing sensitive information such as passwords or pin numbers. Table 1. The ten most frequently used English words, in descending order of frequency THE, OF, AND, A, TO, IN, IS, YOU, THAT, IT
Keystroke Dynamics in a General Setting
3.1
589
Results I - For Sequences That Are English Words
Classifier A and Classifier B were run on selected Sequences from the keystroke data that are English words. The words are selected from a Corpus [2] of most frequently appearing English words, see Table 1. The user data was split into 10 bins and 10-fold-cross-validation was conducted with both classifiers. The presented results are the mean and the standard deviation of the classification accuracy from the cross-validation. Accuracy is computed as a number (probability) between 0 and 1. From Tables 2 and 3 it is evident that the Classifier B outperforms Classifier A. But we should note that Classifier A uses only one instance of the Sequence for identification, whereas Classifier B uses multiple instances for a combined result. Classifier A will be suitable for applications like Continuous Authentication [5] which needs to authenticate a user immediately upon receiving a biometric sample. Note that Table 2 shows the accuracy for identification tasks (multiclass classification) rather than for verification (two-class classification with a claimed identity). We can in principle get even better performance for verification by choosing a sequence that works best for each person, i.e., since we know the identity of the person being verified, we can sacrifice universality for discriminability. Table 2. Performance of Classifier A - English Words Sequence Mean of Accuracy Std. dev. of Accuracy FOR 0.0598 0.1224 TO 0.0838 0.1841 THE 0.0562 0.1504 YOU 0.0512 0.0733 IS 0.0538 0.0432 IN 0.0573 0.0669 AND 0.0878 0.2682 OF 0.0991 0.2809
Classifier B can be aptly called a post-login identifier, as it needs significant amount of keystroke data to identify the person. The identification accuracy is very high, but the price to pay is the large number of samples required. Although it might not be suitable for Continuous Verification, it can used as a forensic tool to identify the person after data collection. In our experiments, we also observed that the accuracy of Classifier B increases with the number of samples. We surmise that this is due to better estimation of the histogram. 3.2
Results II - For Non-English Sequences
In order to perform keystroke identification in a general setting, we cannot depend only on English words. Almost all users, even native English speakers, type abbreviated words every so often. For example, ‘tmr’ is frequently used to
590
R. Janakiraman and T. Sim Table 3. Performance of Classifier B - English Words Sequence Mean of Accuracy Std. dev. of Accuracy FOR 0.7955 0.2319 TO 0.9455 0.1184 THE 0.8409 0.2443 YOU 0.7364 0.2985 IS 0.9591 0.0796 IN 1.0000 0.0000 AND 0.8318 0.2191 OF 0.8000 0.1927
mean ‘tomorrow’. In this age of short text messaging, such abbreviations are increasingly common. In fact, by regarding such words as coming from a foreign language, it is clear that our approach can be applied to other languages as well. Running Classifier B on non-English Sequences produces Table 4. Generally the accuracies for non-English Sequences are higher than that for English words. Although the corpus listed ‘THE’ as the most frequently used word in English, it was no longer the case when non-English Sequences are considered.
4
Goodness Measure
Given that we are allowing non-English text, we can no longer rely on the Corpus to guide us in selecting useful sequences. How then do we select a Sequence for identification? We will need to look for them from the training data. According to Jain [13], a good biometric ought to satisfy seven criteria: Universality, Uniqueness, Permanance, Collectability, Performance, Acceptability, Circumvention. Of these, we only need to consider the first two, because the others have to do with technology, or user perception. Universality (commonality) means the biometric should be measurable across all users. Uniqueness (individuality) has to do with its discriminative power. From these two criteria, we derive a new Goodness Measure to measure the quality of a Sequence based on the criteria of accuracy, availablity and universality. First, a few definitions. 1. Universality (U ). A Sequence is not useful for identification unless it is commonly used by users. For English text, the Corpus lists ‘THE’ as frequently occuring because every person uses it. We define universality as, U=
No. of Users having this Sequence in their keystrokes Total No. of users
(6)
2. Accuracy (A). The classification accuracy of the Sequence. Note that 0 ≤ A ≤ 1. Accuracy is a measure of how discriminative a Sequence is, i.e. its uniqueness.
Keystroke Dynamics in a General Setting
591
3. Expectancy (E). Unlike other kinds of biometrics, keystroke dynamics requires the system to wait until the user enters the required text. A particular string may be Universal, but the user may not type it frequently, thus keeping the system waiting for a long time. To capture this notion, we define the Expectancy of a Sequence to be the average number of keystrokes until an instance of the Sequence appears in the text. Intuitively, this measures how readily available the Sequence is. At best, the Sequence could appear on every keystroke (E=1); at worst, it might never appear in the text (E = ∞). For example, in the keystroke stream, ‘TO BE OR NOT TO BE’, E(’TO’) = 18 2 = 9 and E(’THE’) = ∞. With the above definitions, we can now define Goodness Measure of a Sequence, Gm as follows: 0 if E = ∞ Gm = U×A (7) otherwise E Ideally, if all the factors are at their best (A=1,E=1,U =1), Gm will equal 1. In the Worst case (A=0 or E=∞ or U =0), Gm will equal 0. When E = 1, Equation (7) reduces to the special case, Gm = U × A
(8)
which may be interpreted as the Goodness Measure for a fixed-text keystroke identification system. Thus, our Goodness Measure is also applicable for traditional fixed-text systems. Table 4 shows the Goodness Measure of a number of English and non-English sequences. The table is sorted by Accuracy, but a quick glance reveals that Accuracy does not mean a high Gm score. For example, the Sequence ‘NE’ (row four) has a high Accuracy but a low Gm score. The reason is its long Expectancy: one has to wait, on the average, over 400 keystrokes before this Sequence appears. For applications that require immediate authentication, ‘NE’ is a poor choice. Table 4 also shows that the best performing English words (those that appear in Table 3) rank below non-English Sequences, both in terms of Accuracy and Gm score. Surprisingly, the Sequence ‘THE’ (third row from the bottowm) has a long Expectancy and is not Universal. This yields a low Gm score. We surmise that this counter-intuitive observation is because the subjects in our experiments did not write in complete, grammatical English. This, in turn, probably reflects the informal way English prose is used in everyday communication, rather than an indictment of the subjects’ poor command of the language. Finally, the table also highlights the fact that Expectancy is the dominant criteria affecting Gm score. Many Sequences with approximately equal Accuracy and Universality scores differ greatly in their Gm scores because of different Expectancy values. For Keyboard Dynamics in a free-text setting, waiting for a long Expectancy Sequence limits how quickly the system can authenticate. Sequences with short Expectancy are more useful in this regard.
592
R. Janakiraman and T. Sim Table 4. Performance of Classifier B - non-English sequences Sequence Accuracy Expectancy Universality AN 1.0000 113 1 IN 1.0000 125 1 NG 0.9955 150 1 NE 0.9909 407 1 LE 0.9864 294 1 RE 0.9818 218 1 TI 0.9818 324 1 HE 0.9773 157 1 EN 0.9773 207 1 MA 0.9773 383 1 ER 0.9727 245 1 OU 0.9682 317 1 IT 0.9682 339 1 ING 0.9682 345 1 AI 0.9636 312 1 .. .. .. .. . . . . TO 0.9455 411 1 THE 0.8409 350 0.95 AND 0.8318 814 1 FOR 0.7955 1062 1
5
Gm 0.008866 0.007986 0.006636 0.002433 0.003360 0.004498 0.003030 0.006226 0.004722 0.002549 0.003977 0.003051 0.002857 0.002802 0.003083 .. . 0.002298 0.002296 0.001022 0.000749
Conclusion and Future Work
In this paper, we presented a technique to identify a person based on Keystroke Dynamics in a general setting. This generalizes traditional fixed-text Keystroke Dynamics to free-text systems. Essentially, we identify a person based on a common list of fixed strings which we discover from analyzing users’ keystroke logs. Our technique can also be used for verification, in which case each user can have his/her own list of strings. We also found that non-English Sequences were more accurate than English words. This is useful because the prevalence of new communication technologies, such as instant messaging, online chat, text messaging, etc., means that users increasingly use informal English (containing abbreviations and even new words) when composing messages. This is true even for native English speakers. To guide our selection of good non-English words to use, we proposed a novel Goodness Measure based on well-studied properties that all biometrics ought to possess. In the future, we would like to conduct the experiments on a larger pool of users to see if our results hold up. Also, we intend to investigate the effect of different keyboards on a person’s typing speed, and how we may mitigate against this. Finally, it would be interesting to see if keystroke dynamics can distinguish between trained and untrained typists.
Keystroke Dynamics in a General Setting
593
References 1. Shepherd, S.J.: Continuous authentication by analysis of keyboard typing characteristics. In: IEEE Conf. on Security and Detection, European Convention, pp. 111–114. IEEE Computer Society Press, Los Alamitos (1995) 2. Fry, E.B., Kress, J.E., Fountoukidis, D.L.: The Reading Teachers Book of Lists, 3rd edn. 3. Monrose, F., Reiter, M.K., Wetzel, S.: Password hardening based on keystroke dynamics. In: Proceedings of the 6th ACM Conference on Computer and Communications Security. ACM Press, New York (1999) 4. Rodrigues, R.N., Yared, G.F.G., Costa, C.R.D., Yabu Uti, J.B.T., Violaro, F., Ling, L.L.: Biometric Access Control Through Numerical Keyboards Based on Keystroke Dynamics. In: International Conference of Biometrics, pp. 640–646 (2006) 5. Kumar, S., Sim, T., Janakiraman, R., Zhang, S.: Using Continuous Biometric Verification to Protect Interactive Login Sessions. In: ACSAC, pp. 441–450 (2005) 6. Joyce, R., Gupta, G.: Identity authentication based on keystroke latencies. In: Communications of the ACM, vol. 33(2), pp. 168–176. ACM Press, New York (1990) 7. Villani, M., Tappert, C., Ngo, G., Simone, J., St. Fort, H., Cha, S.: Keystroke Biometric Recognition Studies on Long-Text Input under Ideal and ApplicationOriented Conditions. In: Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop, p. 39. IEEE Computer Society, Washington (2006) 8. Rao, B.: Continuous Keystroke Biometric System M.S. thesis, Media Arts and Technology, UCSB (2005) 9. Leggett, J., Williams, G., Usnick, M., Longnecker, M.: Dynamic identity verification via keystroke characteristics. Int. J. Man-Mach. Stud. 35(6), 859–870 (1991) 10. Obaidat, M.S., Sadoun, B.: Keystroke Dynamics Based Authentication. Ch. 10, Textbook (1998), http://web.cse.msu.edu/∼ cse891/Sect601/textbook/10.pdf 11. Duda, R., Hart, P., Stork, D.: Pattern Classification, 2nd edn. John Wiley and Sons, Chichester (2000) 12. Gunetti, D., Picardi, C.: Keystroke analysis of free text. ACM Transactions Information Systems Security 8(3), 312–347 (2005) 13. Jain, A.K.: Biometric recognition: how do I know who you are? In: Proceedings of the 12th IEEE Signal Processing and Communications Applications Conference. IEEE Computer Society Press, Los Alamitos (2004)
A New Approach to Signature-Based Authentication Georgi Gluhchev1, Mladen Savov1, Ognian Boumbarov2, and Diana Vasileva2 1
Institute of Information Technologies, 2, Acad. G. Bonchev Str., Sofia 1113, Bulgaria
[email protected],
[email protected] 2 Faculty of Communication Technologies, Technical University, 8, Kl. Ohridski, 1000 Sofia, Bulgaria
[email protected],
[email protected] Abstract. A new signature based authentication approach is described, where signing clips are analyzed. Using an web-camera a series of frames is acquired that allows investigating the dynamics of the complex “hand-pen”. For this a set of features of the hand, the pen and their mutual disposition at the time of signing is derived. Classification and verification decision-making rule based on the Mahalanobis distance has been used. A class-related feature weighting is proposed for the improvement of accuracy. A Gaussian-based model for the description of skin color is suggested. The preliminary experimental results have confirmed the reliability of the approach. Keywords: Signature, Authentication, Biometrics, Feature weight, Classification error, Color modeling.
1 Introduction The signature has been and is still used as a principal mean for person authentication. This is due to the comparative stability of the graph and of the movement dynamics stemming from the stereotype built in the years. Until now two approaches to signature authentication have been brought to life called off-line and on-line. The development of the off-line methods started in 70-ies of the last century [11,12]. They are based on the evaluation of sophisticated static features describing the signature’s graph [3,6,7,9,10]. Unfortunately, the graph can be imitated skillfully within the range of the admissible variations of the individual. This makes the identification methods using off-line analysis of the signature not quite reliable. To speed up the authentication and increase the reliability on-line analysis has been introduced, where a pressure sensitive graphic tablet is used, thus giving the possibility for the analysis of the graph, dynamics and pressure simultaneously [4,8]. Since it is difficult to imitate dynamics, it is believed that this approach will be more resistant to forgeries. However, the non-standard way of signing may involve deviations in the signature parameters and may cause an increase in the error. While the two above-mentioned approaches encompass the signature graphics, dynamics and pressure, there is one more aspect of the writing process that may contribute to the authentication problem solution. This concerns the hand position and parameters, pen orientation and its relevant position to the hand during the signing, S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 594–603, 2007. © Springer-Verlag Berlin Heidelberg 2007
A New Approach to Signature-Based Authentication
595
i.e. the behavior of the complex “hand-pen”. This aspect is described in the paper and some preliminary results are reported. To the best of our knowledge, such an investigation, where interconnected parameters of different character, reflecting the individuality of the signing subject, has not been carried out. The paper is organized in the following way: in section 2 some preprocessing and segmentation steps are presented; section 3 describes feature extraction; section 4 introduces the authentication rule; in section 5 experimental results are presented; in section 6 some problems are discussed and the possibilities for further extension of the approach are outlined.
2 Pre-processing and Segmentation The approach is based on the processing of images of a signing hand. For this the following scenario is used: the hand enters the field of a web-camera mounted above the desk, does signing and leaves the field after that. Thus, a video-clip is obtained and saved. Depending on the signature length, the series of images can contain between 100 and 200 frames. 2.1 Image Pre-processing The image processing consists of several steps: detection of the signature end points, hand-pen extraction and evaluation of features related to the hand parameters and hand-pen dynamics. 2.2 Detection of the Signature’s Start and End Points The procedure for detection of the signature’s start and end points was described in detail in [15]. The absolute difference between the frames and an empty reference frame is used for this. When the hand enters the view field of the camera this difference will rise sharply, then will oscillate about some constant value and will diminish with the hand’s outdrawing, so a graph with steep slope at the beginning and at the end of the signing will be produced (Fig. 1). Calculating the values of the gradient of the graph provides an easy determination of the “plateau” and so of the beginning and the end of the signature. The experiments carried out in [15] have shown satisfying precision in the localization of the signature. Variations of about 2 frames have been observed. At this initial step the signature duration d as a number of frames can be obtained and used further as an identification feature. 2.3 Object Detection The complex “hand-pen” is the object that we are interested in. Because of the almost uniform background a fixed threshold could be applied for the object’s detection. To evaluate it, a few frames, say K, are recorded in the absence of any object. For every two consecutive empty frames (k, k+1) the maximal absolute difference E k, k +1 of the values of the three color components is evaluated:
596
G. Gluhchev et al.
E
k ,k +1
= max | E k +1 (i, j) - E k (i, j) | ,
(1)
where E k (i, j) is the color components vector at pixel (i, j) in the k-th image (k = 1, 2, …, K-1) and K is the number of the frames in the series. The maximal difference
E = max E k ,k +1
(2)
k
is further used as a threshold for the object detection in the image sequence. After this operation, some “salt and pepper”- type noise could remain in the image (Fig.2b), but it could be easily removed using morphological or heuristic techniques.
30
20
10
0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91
Fig. 1. Graph of the absolute differences. The horizontal axis represents the frame numbers, differences are alongside the vertical axis.
a)
b)
Fig. 2. a) Original image; b) background subtracted
2.4 Hand – Pen Segmentation
For the analysis of the hand and pen movement, a separation of the two elements is required. To facilitate the task, a blue color pen is used which makes good contrast
A New Approach to Signature-Based Authentication
597
with the predominant skin color (red with blue and green tinges) and the background which is a sheet of white paper. The use of the “red/blue” relation in the pixels leads to a good separation of the hand from the pen (Fig. 3).
3 “Hand-Pen” System Features Extraction The essential authentication features include hand characteristics and movement, pen position and the mutual hand-pen disposition.
Fig. 3. Separation of the pen from the hand
3.1 Hand Features
Hand color and hand geometry are biometrics parameters specific for the individual and can contribute to the authentication. Hand color. The hand color will speed up the search and increase the accuracy provided a large data-base of individuals of different races has to be searched for. To achieve this, models of the skin color of different races have to be generated. This requires a proper color space of minimal dimension to be selected, on the one hand, and the color components to be separated from the intensity, on the other hand. The YCbCr space is one of the possibilities studied in the literature that satisfies these requirements [2,5]. It is obtained from the RGB space using the non-linear transformation
Y = 0.299R + 0.587G + 0.114B Cb = B - Y Cr = R - Y
(3)
The first component Y reprisents the light intensity, i.e. it produces a gray-level image, while Cb and Cr are not influenced by the intensity change. To describe the distribution of the skin color components, Gaussian function is usually used. Since the parameter distribution depends on the race, a Gaussian mixture model (GMM) is expected to be more adequate for an overall skin color presentation. The distribution density function of the mixture is defined as
598
G. Gluhchev et al.
p(x | Θ) =
M
∑ P(i)p(x | i, Θ)
(4)
i =1
where P(i) is the prior probability of the i-th race, and p( x | i, Θ) is its density function. The parameter Θ consists of mean values and co-variances that have to be estimated from the experimental data. The Expectation Maximization proved to be a suitable technique for the evaluation of Θ in case of mixed models [1]. To demonstrate the possibility of GMM to distinguish between the races, a mixed distribution is shown in Fig. 4. For the evaluation of its parameters a data base created at the Technical University in Sofia is used, including 50 images of individuals from the white, black and yellow races. It is visible that nevertheless the three race-related areas overlap, it is possible to obtain a relatively good separation between them.
Fig. 4. Projection of a mixture of Gaussian distributions of white, black and yellow skin color. The horizontal axis represents the value of Cb, the probability density is alongside the
vertical axis. More sophisticated approach may use color properties of different parts of the hand, thus characterizing the individual more accurately. Hand position and geometry. The main geometric features of the hand and its position during signing could be evaluated from its upper contour К. scanning the image column by column and taking the coordinates ( x , y) of the first pixel from the hand (x is the column number and y is the row number in the frame). The hand shape will be different for different persons because it depends on the size of the hand, on the way it holds the pen and on its movements while signing. Hand position. The slope angle α of the line l: y = ax + b, which approximates К the best way in the sense of minimal mean-square distance, could be used as a general characteristic of the hand position. For that the line parameters a and b are evaluated minimizing the sum:
A New Approach to Signature-Based Authentication
S=
599
M −1
∑
[y − (ax + b)]2
(5)
x =0
where (х, у) ∈ К and M is the number of the contour points. Geometric features. The geometric features are extracted using the characteristic points of the hand contour. End points and points of curvature larger than a predefined threshold are assumed as “characteristic”. Different techniques could be used for curvature evaluation. To simplify the calculations, we have used the following formula
c=
PP− q + PP+ q
,
(6)
P− q P+ q where P denotes the current point, and P-q and P+q are the points remote q points far from P in both directions. Different geometric parameters of the obtained polygon like perimeter, area or distances from its centre to specific points could be measured. Pen position. The pen position is described by the angle γ of its tilt towards the plane and the angle β of its projection in the plane. Since the pen length l is fixed, the first angle is determined by the ratio of the projection length l’ to l. To determine β and l’, we have to determine the major axis of the pen. For this its centre (Cx, Cy) is evaluated averaging the x and y coordinates of the pixels from the pen. After that the eigen values λ1 and λ2 of the characteristic equation
Σ - λI = 0 ,
(7)
where Σ is a co-variance matrix and I is the unitary matrix is solved. β is evaluated according to the formula cov11 - λ1 β = arctg ( ), (λ1 ≥ λ 2 ) (8) cov12 To determine the angle γ, a straight line of angle β is drawn through the center and the distance l’ between the utmost pixels from the pen coinciding with the straight line is evaluated. γ is obtained from the equation γ = arccos( l ' / l) .
(9)
Hand-pen relative position. The mutual position of the hand and pen is described using the following parameters:
a) The difference δ = α − β b) The distances between the pen center and hand contour. For this a straight line perpendicular to the pen’s longitudinal axis and passing through the center is used. The distances r1 (C, H1 ) and r2 (C, H 2 ) between the two cross-points Н1 and Н2 of that line and contour are evaluated (Fig. 5).
600
G. Gluhchev et al.
Fig. 5. Distances between the pen and hand contour
4 Authentication The described in 3 set of features is measured for every individual whose signing has to be recognized. The mean value vectors mi and covariance matrices Si are evaluated and used as parameters of the squared Mahalanobis distance R i (x) = (x - m i ) t Si −1 (x - m i )
(10)
between the current signing x and the i-th individual from the data base ( i = 1,2,…,N). There are two aspects of the authentication problem: classification and verification. While the classification problem is aimed at the assignment of a class-label to an unknown sample, the verification is aimed at the confirmation of the claimed classlabel. This difference is further used to show how the verification accuracy could be improved. Formula (10) treats all the features equally. However, in real cases the contribution of the features may not be equal. It is more reasonable to accept that different features will have different classification value for different individuals. Therefore, it makes sense to try to evaluate feature weights for each class separately and use them as weight factors for distance evaluation. This means evaluation of class-dependent values of the features. In this paper we show that such a class-based evaluation of feature significance is possible and may improve the decision-making. The claim is that not only the feature values characterize the individual, but also the weights of the features are specific for him. 4.1 Feature Weighting
The accuracy of the classification is usually evaluated in terms of correct and wrong answers. This suggests using the function φ = Tp - Fp as a measure of accuracy, where Tp is the number of the true-positive answers and Fp is the number of the falsepositive ones for a particular class. The goal is to find the maximal value of φ as a function of the feature weights tij for every class i and every feature j. Since there is
A New Approach to Signature-Based Authentication
601
no analytical relationship between the weights and the classification rate the only way to optimize φ consists in varying the weights tij. A good practical scheme for this was suggested by the theory of experiment planning [14]. The gradient for the surface φ is evaluated and the search goes in its direction until a maximal value of φ is obtained.
5 Experimental Results To test the described approach, 14 volunteers have taken parts in the experiments. 10 signatures have been acquired from each of them in different days and at different time of the day. In this study the following 8 features have been measured and used for the evaluation of mi and Si : 1) signature length d as a number of frames, 2) hand slope α, 3) pen projection angle β, 4) pen slope γ, 5) difference δ = α − β , 6) ratio r1 / r2 of the distances between the pen centre and hand contour, 7) perimeter p of the polygon defined by the characteristic points of the upper hand contour, and 8) area of the polygon. The feature “skin colour” was not used because all the volunteers were from the same race. 5.1 Verification
Using the Matlab random number generator 1000 signatures have been simulated for every individual. Assuming independent features the Mahalanobis distance was evaluated according to the formula R ik =
8
∑ t ij (f kj − m i ) 2 /σ i 2
(11)
j =1
Moving in the gradient direction the best values of tij in terms of maximal classification rate have been determined. The verification results are shown in Table 1. Some individuals (Nos 3, 8, 9, 11 and 14) are not included in the table because an absolute score of 100% was achieved for them. The first three rows include the values of Tp, Fp and φ in case of weights equal to 1. For the next rows the above described scheme was applied. It is seen that there is an improvement of the classifier performance for all the individuals, except No 10, where no effect is observed. The most significant improvement is obtained at Nos 2 and 7. Thus, an average accuracy φ = 99.9% was achieved after the modification of the weights, compared to the initial accuracy of 99.6%. The highest rate of false-positive Fp = 0.4% was observed at the individual No 1. The best reduction in Fp was obtained for No 2, where the accuracy increased from 99.2% to 100%. It was interesting to see how the evaluated weights would affect the results if applied to a classification problem. Using again 1000 simulations per class and the evaluated weight factors for the classes no significant changes have been observed. Almost the same average accuracy was achieved resulting in 0.02% increase in Tp and 0.04% decrease in Fp. Therefore, for the classification problem there is may be no need to evaluate class-dependent feature weight factors.
602
G. Gluhchev et al. Table 1. Experimental results
weights initial values new values
score Tp Fp φi Tp Fp φf φf - φi
1 999 6 993 999 4 995 2
2 999 17 982 1000 1 999 17
4 997 0 997 998 0 998 1
5 996 1 995 999 0 999 4
6 994 2 992 998 0 998 6
7 990 0 990 999 0 999 9
10 998 1 997 998 1 997 0
12 13 1000 999 1 0 999 999 1000 1000 0 0 1000 1000 1 1
6 Conclusion A new signature based authentication approach that takes into account the dynamics of the complex “hand-pen” is suggested. It allows evaluating specific parameters of the hand and pen and their mutual position. While the acquired source information allows for the extraction of many features of different nature, only a small number of global features and a simple classification rule have been used in the investigation. Nevertheless, the obtained tentative results aimed at the illustration of the approach, have shown quite satisfactory results. The assumption that the use of class-specific feature weights will improve the overall accuracy in case of verification has proven to be correct. The intuitive explanation of this is that if a sample does not belong to a particular class its features will not be properly weighted when a distance to that class will be measured and, as a result, larger distance value will be obtained. The future work will be aimed at the thorough analysis of the parameter dynamics. Also, the acquisition of more experimental data will be of primary importance. A possible extension of the approach may include different alternatives like: using a tablet instead of paper or capturing signature after signing and processing it in offline mode. Thus, more accurate dynamic and/or static information will be obtained. Acknowledgments. The investigation was supported by BioSecure Network of Excellence of the 6th European Framework Program, contract No 507634, and the Ministry of Education and Sciences in Bulgaria, contract No 1302/2003.
References 1. Bouguila, N., Ziou, D.: Dirichlet-based probability model applied to human skin detection. In: Proc. Int. Conf. ASSP’04, vol. V, pp. 521–524 (2004) 2. Boumbarov, O., Vassileva, D., Muratovski, K.: Face extraction using 2D color histograms. In: XL Int. Scientific Conf. on Information, Communication and Energy Systems and Technologies (ICEST’2005), Nis, Serbia and Montenegro, vol. 1, pp. 334–337 (2005) 3. Fang, B., Leung, C.H., Tang, Y.Y., Kwok, P.C.K., Tse, K.W., Wong, Y.K.: Off-line signature verification with generated training samples. IEE Proc. Vis. Image Signal Process 149(2), 85–90 (2002) 4. Fink, G., Wienecke, M., Sagerer, G.: Video-Based On-Line Handwriting Recognition. In: Proc. Int. Conf. on Document Analysis and Recognition, pp. 226–230. IEEE, Los Alamitos (2001)
A New Approach to Signature-Based Authentication
603
5. Fu, Z., Yang, J., Hu, W., Tan, T.: Mixture clustering using multidimensional histograms for skin detection. In: Proc. of the 17th Int. Conf. on Pattern Recognition (ICPR’04), pp. 549–552 (2004) 6. Hairong, L., Wenyuan, W., Chong, W., Quing, Z.: Off-line Chinese signature verification based on support vector machines. Pattern Recognition Letters 26, 2390–2399 (2005) 7. Ismail, M.A., Gad, S.: Off-line arabic signature recognition and verification. Pattern Recognition 33, 1727–1740 (2000) 8. Jain, A.K., Griess, F.D., Connell, S.D.: On-Line Signature Verification. Pattern Recognition 35(12), 2963–2972 (2002) 9. Ka, H., Yan, H.: Off-line signature verification based on geometric feature extraction and neural network classification. Pattern Recognition 30(1), 9–17 (1996) 10. Kai, H., Yan, H.: Off-line signature verification using structural feature correspondence. Pattern Recognition 35, 2467–2477 (2002) 11. Kuckuk, W., Rieger, B., Steinke, K.: Automatic Writer Recognition. In: Proc. of Carnahan Conf. on Crime Countermeasures, Kentuky, pp. 57–64 (1979) 12. Lantzman, R.: Cybernetics and forensic handwriting investigation, Nauka, Moscow (1968) 13. Munich, M.E., Perona, P.: Camera-Based ID Verification by Signature Tracking. In: Burkhardt, H.-J., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1406, pp. 782–796. Springer, Heidelberg (1998) 14. Nalimov, V.V., Chernova, N.A.: Statistical Methods in Experiment Planning, Nauka, Moskow (1965) 15. Savov, M., Gluhchev, G.: Automated Signature Detection from Hand Movement. In: Proc. of CompSysTech’04, Rousse, Bulgaria, pp. III.A.3-1–III.A.3-6 (2004)
Biometric Fuzzy Extractors Made Practical: A Proposal Based on FingerCodes Valérie Viet Triem Tong1 , Hervé Sibert2 , Jérémy Lecœur3 , and Marc Girault4 1
Supelec, Campus de Rennes, Avenue de la Boulaie F-35576 Cesson-Sévigné Cedex, France 2 NXP Semiconductors, 9 rue Maurice Trintignant, F-72081 Le Mans Cedex 9, France 3 Irisa-Inria, Campus de Beaulieu, F-35042 Rennes Cedex 4 France Telecom Research and Development, 42 rue des Coutures, BP6243, F-14066 Caen Cedex 4, France
Abstract. Recent techniques based on error-correction enable the derivation of a secret key for the (varying) measured biometric data. Such techniques are opening the way towards broader uses of biometrics for security, beyond identification. In this paper, we propose a method based on fingerprints to associate, and further retrieve, a committed value which can be used as a secret for security applications. Unlike previous work, this method uses a stable and ordered representation of biometric data, which makes it of practical use.
1
Introduction
Biometric authentication refers to verifying physiological or behavioral features such as voiceprint, fingerprint or iris scan [9,2]. From personal device unlock to e-passports, biometry-based authentication schemes have spread into our lives. Such schemes usually run as follows: the user presents his current biometric data, and the verifier checks if the measured data match the biometric reference data acquired during the enrollment. Biometric data have several properties which make them natural candidates for security applications: they are hard to forge, they are unique to each person, and they are a good source of entropy. However, they also have several drawbacks which slow down their adoption: one cannot change his biometric data, biometric data are often easy to steal, and acquisition of biometric data is subject to variations. As the measured data can vary, they cannot be directly used as a password or as a cryptographic secret. Moreover, they have to be sent across a possibly insecure network for matching against reference data stored in a database, whose loss would be disastrous. Several research directions address these concerns. First, recent schemes avoid the storage of the whole reference data. Second, new techniques based on error-correction enable the derivation of a constant secret key
Most of the research work that this article stems from was carried out by the authors at France Telecom Research and Development in Caen (France).
S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 604–613, 2007. c Springer-Verlag Berlin Heidelberg 2007
Biometric Fuzzy Extractors Made Practical
605
from the (varying) measured biometric data. Such techniques can also naturally address the biometric data storage problem. Still, stolen biometric data are stolen for life. Unlike a cryptographic key, they cannot be updated or destroyed. To solve this last problem, it remains to be able to use biometric data to retrieve a secret value linked to each user, instead of them directly as a secret. Few such proposals have been made, but they are not practical. In this paper, we propose a practical method based on fingerprints to associate, and further retrieve, a committed value which can be used as a secret for security applications. 1.1
Related Works
The fact that storing biometric reference data should be avoided for distant authentication system is commonly accepted, as mentioned by Jain et al. in [16]. In an attempt to do so [8], Jain and Uludag propose to use watermarking to secure the transmission of biometric data across a network. However, they still use a biometric database. In [11], Juels and Wattenberg present the first crypto-biometric system called fuzzy commitment, in which a cryptographic key is decommitted using biometric data. Fuzziness means that a value close to the original (under some suitable metrics) is sufficient to extract the committed value. The scheme is based on the use of error-correcting codes, and runs as follows. Let C ⊂ {0, 1}n a set of codewords for a suitable error-correcting code. The user chooses a secret codeword c ∈ C, the decommitment key is the enrolled fingerprint x and the commitment is the pair (c − x, H(c)), where H is a one-way function. When a user tries to decommit a pair (c − x, H(c)) using a decommitment key x , he attempts to decode c − x + x to the closest codeword c . Decommitment is successful if H(c ) = H(c), which means the user has retrieved his secret c. In this scheme, the key x is built from a fingerprint, as a set of minutiae positions. This yields two shortcomings. First, it does not allow modifications of x, such as re-ordering, and addition/deletion of an element in x, although such modifications are frequent in real life. Second, the security proof of this scheme holds only if x is uniformly distributed, which is not the case in reality. In order to overcome these drawbacks, Juels and Sudan propose a fuzzy vault scheme [10]. This scheme may be thought of as an order-invariant version of the fuzzy commitment scheme, obtained by using a generalization of Reed-Solomon codes in which they think of a codeword as an evaluation of a polynomial over a set of points. The idea is to encode a secret s as a polynomial p of degree d using the Reed-Solomon encoding scheme. The codeword consists of a set of pairs R1 = {(x, p(x)}1≤i≤n , where the x-coordinates represent the position of minutiae in the reference fingerprint. A set R2 of chaff points that are not on p is added to R1 to form the vault V . To rebuild the codeword using the ReedSolomon decoding scheme, the user needs d+1 pairs (x, y) ∈ R1 from the original n points. The method has been successfully tested by Uludag and Jain with the IBM-GTDB database [15]. However, the fuzzy vault has one main drawback, which is its restricted use. Indeed, if the polynomial p is compromised, then the
606
V. Viet Triem Tong et al.
fingerprint itself is, as all the other minutiae in the fingerprint are the points on p in the vault V . Thus, if this method was used for different applications, with different vaults each time, as the x-coordinates correspond to the minutiae, disclosure of different vaults for the same user would reveal his minutiae. Next, Clancy et al. considered a practical implementation of the fuzzy vault in a secure smartcard [4]. Their experimentation showed that a sufficient security level for the fuzzy vault cannot be obtained with real-life parameters. They thus defined a modified scheme called the fingerprint vault, and proposed a way to find the optimal vault parameters. However, the choice of chaff points also yields uniformity problems, that can be exploited by an attacker to distinguish between chaff points and real points in the vault. Chang and Li have analyzed this problem [3] in a general setting. They show that, since secret points are not uniformly distributed, the proper way of choosing chaff point is far from being trivial, and there is no known non-trivial bound on the entropy loss. 1.2
Contents of the Paper
In this paper, we propose a new method to associate, and further retrieve, a committed value using a fingerprint. Our method is based on a special fingerprint data representation measure called FingerCode [14]. The committed value is rebuilt from the FingerCode biometric data and public data we call FingerKey. The committed value cannot be recovered from the public data, and we show that no illegitimate user is able to forge it. Moreover, using the FingerCode, we avoid the minutiae set modifications concerns. At last, our method does not require storage of any biometric reference, and therefore, when the committed value is used as a (seed for generation of a) secret key in a secure cryptographic protocol, an attacker cannot learn the biometric data. The paper is organized as follows: after the introduction, we present the tools we use and, more particularly, error-correcting codes, the FingerCode definition and its extraction method in Section 2. In Section 3, we describe our construction and give our experimental results. In Section 4, we give some advice on possible applications and security parameters. At last, we give directions for future work and we conclude.
2 2.1
Preliminaries Error-Correcting Codes
Like the fuzzy schemes of Juels et al., our scheme relies on using error-correcting codes. We refer the reader to [12,13] for more details on error-correcting codes. The goal of error-correcting codes is to prevent loss of information during the transmission of a message over a noisy channel by adding redundancy to the original message m to form a message m , which is transmitted instead of m. If some bits are corrupted during the transmission, it remains possible for the
Biometric Fuzzy Extractors Made Practical
607
receiver to reconstruct m , and therefore m. In our scheme, the noise is the result of the use of different measures of the fingerprint. More formally, an error-correction code over a message space M = {0, 1}k consists of a set of codewords C ⊂ {0, 1}n (n > k) associated with a coding function denoted by encode, and a decoding function denoted by decode, such that encode:M → C injects messages into the set of codewords and decode: {0, 1}n → C ∪ ∅ maps an n-bits string to the nearest codeword if the string does not have too many errors, otherwise it outputs ∅. We recall that the Hamming distance between binary strings is the number of bit locations where they differ. The distance Δ of a code is the minimum Hamming distance between its codewords. Therefore, to be properly decoded, a string must have at most Δ−1 errors. In the sequel, we have chosen to use 2 Reed-Solomon codes for their correction power. 2.2
FingerCode
Our purpose is to retrieve constant secret data from a varying measures of biometric data. This measure should have be reasonably stable, which means slight modifications of the acquisition should result in a low distance from the reference. Therefore, we cannot use a minutiae-based matching method. We propose to use a texture-based method, called FingerCode, which is stable in size, and founded on the localization of the morphological centre of the fingertip and the use of a well-known pattern analysis method to characterize a fingerprint. It was introduced by Jain and Prabhakar in [14,6]. A FingerCode is a 640-component vector of numbers that range from 0 to 7, which is ordered and stable in size. The matching is then handled by a simple Euclidean distance. Here is a summary of this method. Using Bazen and Guerez method [1], an estimation of the block orientation field is computed. Using this orientation, a very simple curvature estimator can be designed for each pixel. The maximum value of this estimator is the morphological center that we are searching. We extract a circular region of interest around this point in which we consider 5 concentric bands and 16 angular portions making a total of 80 sectors. The width of the band is related to the image resolution, e.g. for a 500-dpi image, the band is 20-pixels wide. Although we avoid the problem of translation and rotation of the image by using the morphological center and a circular area, we still have to handle the finger pressure differences. Hence, we normalize the image to given values of mean and variance. If normalization is performed on the entire image at once, then the intensity variations due to the finger pressure difference remain. Therefore, it is necessary to normalize separately each sector of the image, in order to obtain an image where the average intensity at a local scale is about the same as that of the whole image. Ridges and valleys can be characterized by their local frequency and orientation. So, by using a properly tuned Gabor filter, we can remove noise and catch this piece of information. An even symmetric Gabor filter has the following general form in the spatial domain:
608
V. Viet Triem Tong et al.
2 2 G(x, y, f, θ) = exp − 12 xδ2 + yδ2 × cos(2π × f × x ) x y x = x sin θ + y cos θ y = x cos θ − y sin θ where f is the frequency of the sinusoidal plane along direction t from the xaxis, and δx and δy are the space constants of the Gaussian envelope along the corresponding axis. The filtering in the spatial domain is performed with a 19 × 19 mask in 8 directions from 0 to 157, 5˚, resulting in 8 filtered images of the region of interest. Finally, we compute the FingerCode as the average absolute deviation from the mean (AAD) of each sector of each image. The feature vector is organized as follows: values from 0 to 79 correspond to the 0˚filter, the 80 following values to the 22, 5˚filter and so on. In these 80 values, the 16 first are the innermost band values, the first one being the top sector and the other values corresponding to the sectors counted clockwise. The matching is based on Euclidean distance between FingerCodes. We also use cyclic rotation of the FingerCode, up to 2 steps, to handle rotations up to 45˚.
3
Our Construction
The idea behind our proposal is to use the FingerCode, which offers ordered and stable biometric data, in the fuzzy commitment scheme [11], in order to avoid minutiae-related drawbacks. However, the FingerCode has a too high FRR rate to be used directly as the decommitment key: error-correction would never be efficient enough to recognize a legitimate user. Therefore, our scheme slightly differs from the fuzzy commitment. It consists of a fuzzy extractor and of a secure sketch according to the definitions of Dodis et al. [5]. In our construction, the secret is represented as a word of d + 1 letters, which correspond to the d + 1 integer coefficients of a polynomial p ∈ Z[X] of degree d. The registration step consists in extracting public data called FingerKey from the pair (F, p), where F is a FingerCode. To this end, n points of p are randomly chosen, with n > d, and we hide these points using n stable subparts of the FingerCode like in the fuzzy commitment scheme. To retrieve the secret polynomial p, we use the usual decoding procedure for each point. If at least d + 1 points can be decommitted, then p is obtained by Lagrange interpolation. 3.1
Encoding
The encoding part, or fuzzy extractor as defined in [5], is a set of fuzzy commitments where the committed values are points of on a secret polynomial. First, the system chooses a secret polynomial p of degree d (for instance, using a bijection between a secret key chosen by user U and the polynomials space), and n random points p0 . . . pn−1 on this polynomial (n > d). These points are represented by l − bit strings, and encoded with an encoding function RS-encode which outputs n codewords c0 , . . . , cn−1 ∈ {0, 1}l+e. In our experimentation,
Biometric Fuzzy Extractors Made Practical
609
RS-encode is the encoding function for Reed-Solomon codes. The FingerCode FU of the user is then divided into n parts FU = f0 || . . . ||fn−1 . The system then computes {δi = ci − fi , 0 ≤ i ≤ n − 1}. The FingerKey KFU for user U consists of the set of pairs (δi , H(ci )) where H is a one-way function and 0 ≤ i ≤ n − 1. This algorithm is described in Figure 1. The FingerKey is then made public for future reference, and may be stored in a server or in a user’s smartcard. Inputs polynomial p of degree d integer n with n > d family (pi )0≤i≤n of randomly chosen points of p FingerCode FU = f0 || . . . ||fn−1 Outputs FingerKey KFU Algorithm KFU ← ∅ For i going from 0 to n − 1 do ci ← RS-encode(xi , p(xi)) δi ← ci − fi KFU ← KFU ||(δi , H(ci )) Return KFU Fig. 1. Encoding algorithm: FingerKey extractor for user U
3.2
Decoding
In order to retrieve his secret, user U has his FingerCode FU = f0 || . . . ||fn measured by the system. The system looks up the FingerKey KFU for U , and uses the RS-decoding function to compute, for each value δi + fi , the closest codeword ci . In our experimentation, as we use the Reed Solomon code for encoding, we use the Reed Solomon decoding function RS-decoding. If equality H(ci ) = H(ci ) holds, then, the decommitment of the codeword is successful. If at least d + 1 values fi (i ≤ 0 ≤ n) yield a successful decommitment, the user is able to rebuild p using Lagrange interpolation. The user is thus authenticated as U if he succeeds in decommitting at least d + 1 points and thus rebuild the secret polynomial p. The system can then rebuild the user-chosen secret. The algorithm is given in detail in Figure 2. 3.3
Experimentation
We have experimented this method on a fingerprint database containing 1000 pictures. In a first attempt, we used FingerCode values rounded to the nearest integer. It turned out that the FingerCode differed too much from the original value to be corrected by error-correction. We then changed our rounding method as follows: values from 0 to 2 were replaced by 0, those from 2 to 4 were replaced by 1, and the others (from 4 to 7) were replaced by 2. Thus, we used transformed FingerCodes whose vector components values were 0, 1 or 2, which yields 3640
610
V. Viet Triem Tong et al. Inputs degree d of the committed polynomial integer n with n > d User FingerCode FU = f0 || . . . ||fn−1 FingerKey for user U KFU = (δ0 , H(c0 ))|| . . . ||(δn−1 , H(cn−1 )) Outputs polynomial p or Failure Algorithm P ← ∅ For i going from 0 to n − 1 do ci ← RS-decode(δi + fi ) If H(ci ) = H(ci ) P ← P ∪ {ci } If |P | > d, p ← Lagrange Interpolation of elements in P Return p otherwise return Failure. Fig. 2. Decoding algorithm: Secure sketch for a user pretending to be U
possible values for an entire FingerCode. We divided each FingerCode into sixteen parts, used the Reed-Solomon code RS(25 , 10) defined by his generator polynomial X 6 + X 5 + 1, and we chose secrets as being polynomials of degree 8. Then, no FingerCode coming from a fingerprint of another user could lead to a successful decommitment. For a FingerCode coming from another fingerprint of the legitimate user, the decommitment success rate was more than 60%. As our system is more flexible than the standard procedure and as it allows the regeneration of a user-chosen secret, with public data(FingerKey), we may have expected worse results than standard matching. However, using the standard FingerCode matching on the same picture database, we obtained a FRR of 78% for a FAR of 0, 1%. Therefore, it appears that splitting the FingerCode into several parts and using our construction allows for better results than the standard FingerCode matching. In the literature, better results are given for usual FingerCode matching: Prabhakar obtains a 19,32% FRR for a 0,1% FAR. Therefore, it is likely that our results can also be improved, and, using FingerCode enhancements (picture optimization...), we can reasonably expect results for our construction similar to those obtained for standard FingerCode matching.
4 4.1
Applications and Security Security
The scheme we have presented enables users to retrieve a committed secret, without biometric data storage. Just like other biometric systems, our system should be used in a safe manner. Indeed, measuring biometric data and then using these data to impersonate the legitimate user is still possible. Retrieving a user’s secret is always possible given his FingerKey and a measure of his
Biometric Fuzzy Extractors Made Practical
611
FingerCode. Nevertheless, the concealment property ensures that retrieving the secret is unfeasible given the FingerKey only. More formally, the security of the scheme relies on three properties: 1. an attacker who knows only a published FingerKey cannot retrieve the biometric data which it was computed from, 2. (concealment) an attacker who knows only a published FingerKey cannot retrieve the corresponding secret polynomial p, 3. an attacker who knows both a published FingerKey and the polynomial p cannot retrieve the corresponding biometric data. First, let us notice that retrieving the secret polynomial p from the biometric data (FingerCode) associated with a FingerKey is straightforward. Property 1 arises from the security of the fuzzy commitment scheme [11]. We will show how it extends to Property 3. In the following Theorem, we show that the computational complexity of retrieving p is at least equal to the computational complexity of the inversion of the hash function used. Therefore, Property 2 holds as long as inverting the hash function is not computationally feasible. Theorem 1. Suppose that an attacker knows the value KF of a FingerKey , and that he has access neither to the biometric data of the user corresponding to KF , nor to a view of an execution of the scheme. Consider the following parameters of our system: is the length of the FingerCode subcomponents fi , n is the number of FingerCode subcomponents, Δ is the distance of the error-correcting code used, L is the output length of the one-way function H, d is the degree of the secret polynomial p, and C(H) the complexity of inverting H for a random input. Given the FingerKey KF , the complexity of retrieving p is equal to min(2L , (d + 1) ∗ C(H), (d + 1) ∗
2
).
Δ−1 2
Sketch of the proof. We suppose p is chosen in a set whose cardinality is beyond exhaustive search. In order to retrieve p, the attacker has to find d + 1 points on p. As he knows only the FingerKey KF , this means he has to find d + 1 elements in {ci ∈ C}1≤i≤n . As p is independent from the FingerCode, the values δi do not reveal any information about the ci ’s. Therefore, the attacker has either to revert H, or to find at least d + 1 values f such that H(RS − decode(δi + f )) = H(ci ) for different i’s. In order to find some f such that the distance between ci and δi + f is at most Δ−1 2 values, and to test each value by 2 , the attacker has to try an average of ( Δ−1 ) 2 computing H(RS −decode(δi +f )) for a given i. Trying several i’s simultaneously would just multiply the complexity by the number of i’s targeted, thus it does not reduce the global complexity of the attack, which is thus (d + 1) ∗ 2 . ( Δ−1 ) 2 If the attacker chooses to revert H, there should be no algorithm better than brute force for a proper choice of H, so the complexity of inversion of H would
612
V. Viet Triem Tong et al.
be 2L . In the case when there exists an algorithm better than exhaustive search to invert H, the complexity of the attack is at most (d + 1) ∗ C(H).
The idea behind the proof is that, as all the values ci − fi are given, and that knowing the ci ’s yields p, the gap between Properties 1 and 2 lies in the given values H(ci ). At last, Property 3 holds thanks to the fact that the choice of the points on the polynomial is random. Hence, providing p does not give information on the ci ’s, the knowledge of which is equivalent to knowing the biometric data. Any implementation of the system should ensure that the FingerCode is not stored, and the user should only present his fingerprint to a trusted device. If this is not the case, then, the system could still be secure if the FingerKeys of the user are stored in a user-owned device. For instance, a FingerKey may be stored in a smartcard, which would, on input of the FingerCode value, regenerate the user’s secret and perform cryptographic operations using this secret internally. 4.2
Applications
Our system addresses two usual drawbacks of biometrics: it enables the regeneration of constant data, and it also allows to change these data in case of loss or theft. This system enables applications beyond the reach of usual biometric identification. For instance, the regenerated value can be used as the seed for obtaining a constant RSA keypair (or any other cryptographic key), used for applications such as digital signature, or encryption/decryption. Moreover, a user can choose different secrets for different applications, and also change every secret that would be compromised. Therefore, this system also offers scalability, and decreases the risks linked to the theft of private data.
5
Conclusion
We have presented a system in which no biometric data has to be stored, and the current biometric data is used to decommit a secret value. This enables the regeneration of several secret values used for various applications for each user. By reducing the need for transmission of biometric data, our system also reduces the risks of biometric data theft. This proposal is the first to combine errorcorrecting codes with stable and ordered biometric templates. Our experiments, based on the FingerCode, provide encouraging results. We now wish to improve our results by finding more suitable biometric measures. Hybrid methods that mix minutia and texture features, as introduced in as in [7], provide more stable biometric measures than the one we have used. One promising research direction is thus to adapt such methods to our construction, in order to improve our experimental results.
Biometric Fuzzy Extractors Made Practical
613
References 1. Bazen, A.M., Guerez, H.: Directional field computation for fingerprints based on the principal component analysis of local gradients. In: Proc. of 11th annual Workshop on Circuits, Systems and Signal Processing, 2000 (2000) 2. Bolle, R., Pankanti, S.: Biometrics, Personal Identification in Networked Society: Personal Identification in Networked Society. Kluwer Academic Publishers, Norwell, MA, USA (1998) 3. Chang, E.-C., Li, Q.: Hiding secret points amidst chaff. In: Vaudenay, S. (ed.) EUROCRYPT 2006. LNCS, vol. 4004, pp. 59–72. Springer, Heidelberg (2006) 4. Clancy, T.C., Kiyavash, N., Lin, D.J.: Secure smartcardbased fingerprint authentication. In: WBMA ’03: Proceedings of the 2003 ACM SIGMM workshop on Biometrics methods and applications, pp. 45–52. ACM Press, New York (2003) 5. Dodis, Y., Reyzin, L., Smith, A.: Fuzzy extractors: How to generate strong keys from biometrics and other noisy data. In: Cachin, C., Camenisch, J.L. (eds.) EUROCRYPT 2004. LNCS, vol. 3027, pp. 523–540. Springer, Heidelberg (2004) 6. Jain, A., Prabhakar, S., Hong, L., Pankanti, S.: Filterbank-based fingerprint matching. In: Proc. of IEEE Transactions on Image Processing, vol. 9(5), pp. 846–859. IEEE Computer Society Press, Los Alamitos (2000) 7. Jain, A., Ross, A., Prabhakar, S.: Fingerprint matching using minutiae and texture features. In: Proc. of International Conference on Image Processing (ICIP), Thessaloniki, Greece, pp. 282–285 (October 2001) 8. Jain, A., Uludag, U.: Hiding biometric data. IEEE Transactions on Pattern Analysis and Machine Intelligence (2003) 9. Jain, A.K., Maltoni, D.: Handbook of Fingerprint Recognition. Springer, New York (2003) 10. Juels, A., Sudan, M.: A fuzzy vault scheme. In: Proceedings of IEEE International Symposium on Information Theory, 2002. IEEE Computer Society Press, Los Alamitos (2002) 11. Juels, A., Wattenberg, M.: A fuzzy commitment scheme. In: CCS ’99: Proceedings of the 6th ACM conference on Computer and communications security, pp. 28–36. ACM Press, New York (1999) 12. MacWilliams, F.J., Sloane, N.J.A.: The Theory of Error-Correcting Codes, NorthHolland (1977) 13. MacWilliams, F.J., Sloane, N.J.A.: The Theory of Error-Correcting Codes, Part II, North-Holland (1977) 14. Prabhakar, S.: Fingerprint Classification and Matching Using a Filterbank. PhD thesis 15. Uludag, U., Jain, A.K.: Fuzzy fingerprint vault. In: Proc. Workshop: Biometrics: Challenges Arising from Theory to Practice, pp. 13–16 (2004) 16. Uludag, U., Pankanti, S., Prabhakar, S., Jain, A.: Biometric cryptosystems: Issues and challenges (2004)
On the Use of Log-Likelihood Ratio Based Model-Specific Score Normalisation in Biometric Authentication Norman Poh and Josef Kittler CVSSP, University of Surrey, Guildford, GU2 7XH, Surrey, UK
[email protected],
[email protected] Abstract. It has been shown that the authentication performance of a biometric system is dependent on the models/templates specific to a user. As a result, some users may be more easily recognised or impersonated than others. We propose a model-specific (or user-specific) likelihood based score normalisation procedure that can reduce this dependency. While in its original form, such an approach is not feasible due to the paucity of data, especially of the genuine users, we stabilise the estimates of local model parameters with help of the user-independent (hence global) parameters. The proposed approach is shown to perform better than the existing known score normalisation procedures, e.g., the Z-, F- and EER-norms, in the majority of experiments carried out on the XM2VTS database . While these existing procedures are linear functions, the proposed likelihood based approach is quadratic but its complexity is further limited by a set of constraints balancing the contributions of the local and the global parameters, which are crucial to guarantee good generalisation performance.
1 Introduction An automatic biometric authentication system works by first building a model or template for each user. During the operational phase, the system compares a scanned biometric sample with the registered model to decide whether an identity claim is authentic or fake. Typically, the underlying class-conditional probability distributions of scores have a strong user dependent component, modulated by within model variations. This component determines how easy or difficult it is to recognise an individual and how successfully he or she can be impersonated. The practical implication of this is that some user models (and consequently the users they represent) are systematically better (or worse) in authentication performance than others. The essence of these different situations has been popularized by the so called Doddington’s zoo, with each of them characterized by a different animal name such as lamb, sheep, wolf or goat [1]. A sheep is a person who can be easily recognized; a goat is a person who is particularly difficult to be recognized; a lamb is a person who is easy to imitate; and a wolf is a person who is particularly successful at imitating others. In the literature, there are two ways to exploit the Doddington’zoo effect to improve the system performance by using model-specific threshold and by model-specific score normalisation. The term client-specific is more commonly used than model-specific. S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 614–624, 2007. c Springer-Verlag Berlin Heidelberg 2007
On the Use of Log-Likelihood Ratio
615
However, we prefer the latter because the source of variability is the model and not the user (or client). For instance, if one constructs two biometric models to represent the same person, these two models may exhibit different performance. In model-specific thresholding, one employs a different decision threshold for each user, e.g. [2,3,4,5]. The model-specific threshold can be a function of a global decision threshold [6,7,8]. In model-specific score normalisation, one uses a one-to-one mapping function such that after this process, only a global threshold is needed. Examples of existing methods are Z-, D- (for Distance), T- (for Test), EER- (for Equal Error Rate) and more recently, F-Norms (for F-ratio). According to [9,10], Z-Norm [10] is impostor-centric, i.e, normalisation is carried out with respect to the impostor distributions calculated “offline” by using additional data. T-Norm [10] is also impostor-centric and its normalisation is a function of a given utterance calculated “online” by using additional cohort impostor models. D-Norm [11] is neither client- nor impostor-centric; it is specific to the Gaussian mixture model (GMM) architecture and is based on Kullback-Leibler distance between two GMM models. EER-norm [9] is client-impostor centric. In [5], a clientcentric version of Z-Norm was proposed. However, this technique requires as many as five client accesses. As a consequence of promoting user-friendliness, one does not have many client-specific biometric samples. F-norm [12] is client-impostor centric; it is designed to cope with learning using as few as one sample per client (apart from those used to build the model). In this paper, we propose a model-specific log-likelihood ratio (MS-LLR) based model-specific score normalisation procedure. While the existing Z-, D- and T-norms are linear functions, the proposed MS-LLR procedure is quadratic. Note that directly estimating the model-specific class-conditional score distributions is difficult because the number of samples available for each user is often very small. As a result, the estimated parameters of the distributions are very unreliable and this leads to unsatisfactory generalisation. We overcome this problem by adapting the model-specific (hence local) parameters from the model-independent (hence global) parameters. An important assumption in MS-LLR is that the class conditional score distributions are Gaussian. When this assumption is likely to be violated, we first transform the scores to exhibit distribution that is closer to Gaussian. The rationale is as follows: if the global (userindependent) class conditional score distributions are obviously violating the Gaussian, e.g., highly skewed, one cannot expect that the MS-LLR will be Gaussian. When we applied the MS-LLR procedure to the individual systems in the XM2VTS score-level fusion benchmark database [13], almost all the systems showed systematic improvement over the baseline and more than half of them were better than the existing normalisation procedures in terms of a posteriori equal error rate (EER). The overall result is that better generalisation performance is obtained in terms of DET curve and of expected performance curve (EPC) where a priori threshold is used. This means that improvement is likely over various operating thresholds.
2 Methodology Let y be the output score of a biometric system and p(y|j, k) be its model-specific class-conditional score distribution, where j ∈ {1, . . . , J} is a model identity and there
616
N. Poh and J. Kittler
are J models. k is the class label which can be client (genuine user) or impostor, i.e., k ∈ {C, I}. A score normalisation procedure based on the log-likelihood ratio framework can be realised as follow: y norm = Ψj (y) = log
p(y|j, C) p(y|j, I)
(1) k We will assume that p(y|j, k) is a Gaussian, i.e., p(y|j, k) = N μj , (σjk )2 , where μkj and σjk are the class conditional mean and standard deviation of user j for k = {C, I}. We refer to μkj and σjk as user-specific statistics. In this case, Ψj (y) can be written as: Ψj (y) =
σjC 1 1 C 2 I 2 (y − μ ) − (y − μ ) + log , j j 2(σjC )2 2(σjI )2 σjI
(2)
Being an LLR, such a user-specific normalization procedure is optimal (i.e., results in the lowest Bayes error) when 1. the parameters μkj , σjk for k ∈ {C, I} and for all j are estimated correctly. 2. the class-conditional scores can be described by the first and second order statistics. The first condition is unlikely to be fulfilled in practice because there is always lack of user-specific training data. For instance, one has only two or three genuine scores to estimate p(y|j, C) but may have more simulated impostor scores, e.g., in the order of hundreds, to estimate p(y|j, I). As a result, in its original form, (2) is not a practical solution. The second condition can be fulfilled by converting any score such that the resulting score distribution confirms better to a Gaussian distribution. In Section 2.1, we present the Z-norm and its variants (D- and T-norms). Other existing score normalisation procedures will also be discussed. In Section 2.2, we will show how to estimate robustly the parameters in (2) in order to fulfill the first condition. We then deal with the second condition in Section 2.3. 2.1 Some Existing Score Normalisation Procedures Three types of score normalisation will be briefly discussed here. They are Z-, EER- and F-norms. Z-norm [2] takes the form.: yjZ =
y − μIj . σjI
(3)
Z-norm is impostor centric because it relies only on the impostor distribution. In fact, it can be verified that after applying Z-norm, the resulting expected value of the impostor scores will be zero across all the models j. The net effect is that applying a global threshold to Z-normalised scores will give better performance than doing so with the baseline unprocessed scores. An alternative procedure that is client-impostor centric is called the EER-norm [9]. It has the following two variants: y T I1 = y − Δtheo j
(4)
Δemp j
(5)
y
T I2
=y−
On the Use of Log-Likelihood Ratio
where Δtheo = j
I μIj σjC +μC j σj σjI +σjC
617
is a threshold found as a result of assuming that the class-
conditional distributions, p(y|j, k) for both k, are Gaussian and Δemp is found empirj ically. In reality, the empirical version (5) cannot be used when only one or two userspecific genuine scores are available1 . Another study conducted in [14] used a rather heuristic approach to estimate the user-specific threshold. This normalization is defined as: y mid = y −
μIj + μC j 2
(6)
The rest of the approaches in [14] can be seen as an approximation to this one. The under-braced term is consistent with the term Δtheo in (4) when one assumes that j σjC = σjI = 1. A significantly different normalisation procedure than the above two is called F-norm [12]. It is designed to project scores into another score space where the expected client and impostor scores will be the same, i.e., one for client and zero for impostor, across all J models. Therefore, F-norm is also client-impostor centric. This transformation is: yjF =
y − μIj γμC j
+ (1 − γ)μC − μIj
.
(7)
where γ has to be tuned. Two sensible default values are 0 when μC j cannot be estimated because no data exists and at least 0.5 when there is only a single user-specific sample. γ thus accounts for the degree of reliability of μC j and should be close to 1 when abundant genuine samples are available. In all our experiments, γ = 0.5 is used when using F-norm. In order to illustrate why the above procedures may work, we carried out an experiment on the XM2VTS database (to be discussed in Section 3). This involved training the parameters of the above score normalisation procedures on a development (training) set and applied it to an evaluation (test) set. We then plotted the model-specific class conditional distribution of the normalised scores, p(y norm |j, k), for all j’s and the two k’s. The distributions are shown in Figure 1. Since there are 200 users in the experiment, each sub-figure shows 200 Gaussian fits on the impostor distributions (the left cluster) and another 200 on the client distributions (right cluster). The normalisation procedures were trained on the development set and were applied on the evaluation set. The figures shown here are the normalised score distributions on the evaluation set. Prior to any normalisation, in (a), the model-specific class conditional score distributions are very different from one model to another. In (b), the impostor score distributions are aligned to centre close to zero. In (c), the impostor distributions centre around zero whereas the client distributions centre around one. Shown in (d) is the proposed MS-LLR score normalisation (to be discussed). Its resulting optimal decision boundary is located close to zero. This is a behaviour similar to EER (which was not shown here due to bad generalisation). Since the distributions in (b), (c) and (d) are better aligned than (a), improvement is expected. 1
In our experiments, due to too few user-specific genuine scores, (4) results in poorer performance than the baseline systems without normalisation. Following this observation, the performance of EER-norm and its variants will not be reported in the paper.
618
N. Poh and J. Kittler
1.5 0.3
4.5
0.5
4
3
likelihood
0.3
likelihood
1 likelihood
likelihood
0.25
3.5
0.4
2.5 2
0.2
0.5
0.1
1.5 1
0.1
0.2 0.15
0.05
0.5 0
−10
0
10 scores
20
(a) baseline
30
40
0
−2
0
2
4 scores
6
8
10
(b) Z-norm
−0.5
0
0.5 scores
1
(c) F-norm
0
−20
−10
0
10 scores
20
30
(d) MS-LLR
Fig. 1. Model-specific distributions p(y norm |j, k) for (a) the baseline system, (b) Z-norm, (c) F-norm and (d) our proposed MS-LLR using one of the 13 XM2VTS systems
2.2 User-Specific Parameter Adaptation In order to make (2) practical enough as a score normalisation procedure, we propose to use the following adapted parameters: μkadapt,j = γ1k μkj + (1 − γ1k )μk k (σadapt,j )2
=
γ2k (σjk )2
+ (1 −
γ2k )(σ k )2
(8) (9)
where γ1k weighs the first moment and γ2k weighs the second moment of the modelspecific class-conditional scores. γtk thus provides an explicit control of contribution of the user-specific information against the user-independent information. Note that while (8) is found by the maximum a posteriori adaptation [15], (9) is not; (9) is motivated by parameter regularisation as in [16] where, in the context of classification, one can adjust between the solution of a linear discriminative analysis and that of a quadratic discriminative analysis. We used a specific set of γtk values as follows: γ1I = 1, γ2I = 1, γ1C = 0.5, γ2C = 0
(10)
The rationale for using the first two constraints in (10) is that the model-specific statistics μIj and σjI can be estimated reliably since a sufficiently large number of simulated impostor scores can be made available by using a development population of users. The rationale of the third (10) and fourth constraints is exactly the opposite of the first two, i.e., due to the lack of user-specific genuine scores, the statistics μCj and σjC cannot be estimated reliably. Furthermore, between these two parameters, the second order moment (σjC ) is more affected than its first order counterpart (μCj ). As a result, if one were to fine tune γtk , the most likely one should be γjC . Our preliminary experiments on the XM2VTS database (to be discussed in Section 3) show that the value of γjC obtained by the cross-validation procedure is not necessarily optimal. Furthermore, in the case of having only one observed genuine training score, cross-validation is impossible. For this reason, we used the default γjC = 0.5 in all our experiments. This hyper-parameter plays the same role as that of γ in the F-norm in (7). Although the F-norm and the proposed MS-LLR are somewhat similar, MS-LLR is a direct implementation of (1) whereas the
On the Use of Log-Likelihood Ratio
619
F-norm, as well as other normalisation procedures surveyed in Section 2.1, are, at best, approximations to (2). In brief, the proposed MS-LLR is based on (2) whose model-specific statistics are obtained via adaptation, i.e., (8) and (9). To further constrain the model, we suggest to use (10). When only one genuine samples is available, we recommend γjC = 0.5. However, when more user-specific genuine samples are available, γjC > 0.5 generalises probably better. 2.3 Improving the Estimate of Parametric Distribution By Score Transformation All the existing procedures mentioned in Section 2.1, as well as our proposed one based on LLR, i.e., (2), strongly rely on the Gaussian assumption on p(y|j, k). There are two solutions to this limitation. Firstly, if the physical characteristic of scores is known, the associated theoretical distribution can be used so that one replaces the Gaussian assumption with the theoretical one in order to estimate p(y|j, k). Unfortunately, very often, the true distribution is not known and/or there is always not enough data to estimate p(y|j, k), especially for the case k = C. Secondly, one can improve the parametric estimation of p(y|j, k) by using an order preserving transformation that is applied globally (independent of any user). When the output score is bounded in [a, b], the following transformation can be used [17]: y−a y = log (11) b−y For example, if y is the probability of being a client given an observed biometric sample x, i.e., y = P (C|x), then a = 0 and b = 1. The above transformation becomes: y P (C|x) y = log = log 1−y P (I|x) p(x|C) P (C) = log + log p(x|I) P (I) p(x|C) = log +const (12) p(x|I)
y The function log 1−y is actually an inverse of a sigmoid (or logistic) function. The
underbraced term is called a log-likelihood ratio (LLR). Therefore, y can be seen as a shifted version of LLR. When the output score is not bounded, in our experience, we do not need to apply any transformation because assuming p(y|j, k) to be Gaussian is often adequate. We believe that the Gaussian distribution exhibits such a good behaviour because it effectively approximates the true distribution using its first two moments. It should be emphasized here that the order preserving transformation discussed here does not guarantee that the resulting score distribution to be Gaussian. In fact, this is not the goal because p(y|k) is in fact a mixture p(y|j, k) for all j’s by definition. Conversely, if p(y|k) is higly skewed, one cannot expect that p(y|j, k) to be Gaussian.
620
N. Poh and J. Kittler
3 Database, Evaluation and Results The publicly available2 XM2VTS benchmark database for score-level fusion [13] is used. The systems used in the experiments are shown in the first column of Table 1. For each data set, there are two sets of scores, i.e., the development and the evaluation sets. The development set is used uniquely to train the parameters of a given score normalisation procedure, including the threshold (bias) parameter, whereas the evaluation set is used uniquely to evaluate the generalisation performance. The fusion protocols were designed to be compatible with the originally defined Lausanne Protocols [18] (LPs). In order to train a user-specific procedure, three user-specific genuine scores are available per client for LP1 whereas only two are available for LP2. Table 1. Absolute performance for the a posteriori selected threshold calculated on the evaluation (test) score set of the 11 XM2VTS systems as well as two whose outputs are post-processed according to the techniques described in Section 2.3
no. 1 2 3 4 5 6 7 8 9 10 11 12 13
system (modality, a posteriori EER (%) feature, classifier) baseline Z-norm F-norm MS-LLR (F,DCTs,GMM) 4.22 4.04 ∗ 3.57 3.79 (F,DCTb,GMM) 1.82 1.92 ∗ 1.43 1.65 (S,LFCC,GMM) 1.15 1.34 0.68 ∗ 0.44 (S,PAC,GMM) 6.62 4.96 4.63 ∗ 4.37 (S,SSC,GMM) 4.53 2.57 2.33 ∗ 2.03 (F,DCTs,MLP) 3.53 3.28 3.14 ∗ 2.89 (F,DCTs,iMLP) 3.53 3.18 3.19 ∗ 2.70 (F,DCTb,MLP) 6.61 6.32 6.53 ∗ 6.31 (F,DCTb,iMLP) 6.61 ∗ 6.35 6.84 6.77 (F,DCTb,GMM) † ∗ 0.55 0.97 0.79 0.78 (S,LFCC,GMM) 1.37 0.99 0.58 ∗ 0.48 (S,PAC,GMM) 5.39 ∗ 4.65 5.28 5.07 (S,SSC,GMM) 3.33 ∗ 2.20 2.60 2.32
Note: Rows 1–9 are from LP1 where 3 genuine samples per client are used for training; whereas rows 10–13 are from LP2 where only two are available for training. ∗ denotes the smallest EER in a row. †: We verified that for this system, the scores between the development and evaluation sets are somewhat different, thus resulting in poor estimation of the parameters of the score normalisation procedures.
The most commonly used performance visualising tool in the literature is the Decision Error Trade-off (DET) curve [19]. It has been pointed out [20] that two DET curves resulting from two systems are not comparable because such comparison does not take into account how the thresholds are selected. It was argued [20] that such threshold should be chosen a priori as well, based on a given criterion. This is because when a biometric system is operational, the threshold parameter has to be fixed a priori. As a result, the Expected Performance Curve (EPC) [20] was proposed and the following criterion is used: 2
Accessible at http://www.idiap.ch/∼norman/fusion
On the Use of Log-Likelihood Ratio
llr z−norm f−norm us−llr
6.5
5.5
HTER(%)
−5
llr z−norm f−norm us−llr
6
5
0
FRR [%]
relative change of EER (%)
5
−10
5
4.5
4
−15
2 3.5
−20 1
z−norm
f−norm
2
5
3 0.1
10
0.2
0.3
0.4
FAR [%]
us−llr
(a) face
(b) DET, face
0.5 α
0.6
0.7
llr z−norm f−norm us−llr
10
llr z−norm f−norm us−llr
5
5
4.5
−30 −40
HTER(%)
FRR [%]
−20
0.9
5.5
0 −10
0.8
(c) EPC, face
20
rel. change of EER (%)
621
4
3.5
2 3
−50 2.5
−60 1
z−norm
f−norm (b) speech
us−llr
(d) speech
0.5
1
2 FAR [%]
5
(e) DET, speech
2 0.1
0.2
0.3
0.4
0.5 α
0.6
0.7
0.8
0.9
(f) EPC, speech
Fig. 2. Performance of the baseline, Z-norm, F-norm and MS-LLR score normalisation procedures on the 11+2 XM2VTS systems in terms of the distribution of relative change of a posteriori EERs for (a) the face and (d) the speech systems shown here in boxplots; in pooled DET curves (b and e); and in pooled EPC curves (c and f). A box in a boxplot contains the first and the third quantile of relative change of a posteriori EERs. The dashed lines ending with horizontal lines show the 95% confidence of the data. Outliers are plotted with “+”. The statistics in (a–c) are obtained from the 7 face systems shown in Table 1 whereas those in (d–f) are obtained from the remaining 6 speech systems.
WERα (Δ) = αFAR(Δ) + (1 − α)FRR(Δ),
(13)
where α ∈ [0, 1] balances FAR and FRR. An EPC is constructed as follows: for various values of α in (13) between 0 and 1, select the optimal threshold Δ on the development (training) set, apply it on the evaluation (test) set and compute the half total error rate (HTER) on the evaluation set. HTER is the average of false acceptance rate (FAR) and false rejection rate (FRR). This HTER (in the Y-axis) is then plotted with respect to α (in the X-axis). The EPC curve can be interpreted similarly to the DET curve, i.e., the lower the curve, the better the generalisation performance. In this study, the pooled version of EPC is used to visualise the performance. This is a convenient way to compare methods on several data sets by viewing only a single curve per method. This is done by calculating the global FAR and FRR over a set of experiments for each of the α values. The pooled EPC curve and its implementation can be found in [13].
622
N. Poh and J. Kittler
We applied the Z-norm, F-norm and the proposed MS-LLR score normalisation procedures on the 11+2 XM2VTS systems, i.e., the 11 original systems and two of which are based on the transformed output using (11). The a posteriori EER’s are shown in norm Table 1. The improvement of each system, i.e., EER EERorig − 1, is shown as boxplots in Figures 2(a and c), pooled DET curves in Figures 2(b and d), and pooled EPC curves in Figures 2(c and f), for the face and the speech systems, respectively. As can be observed, in all experiments, normalised scores give almost always better improvement but there is one exception, notably with system (F,DCTb,GMM). The degradation is possibly due to the mismatch between p(y|j, k) in the development set and the same distribution in the evaluation set.
4 Conclusions The XM2VTS database is collected under relatively controlled conditions. Although baseline performance is already very good, we show that by applying model-specific score normalisation on the output of the resulting systems, one can further improve the system performance. In particular, among the few score normalisation procedures tested, our proposed model-specific log-likelihood ratio-based (MS-LLR) approach performs best . For the speech systems, the reduction of a posteriori EER is 40% on average and can be as high as 60%. For the face systems, this improvement is only up to 10% on average. From the pooled DET and EPC curves, the average results show that MS-LLR performs best; this is followed by F-norm and Z-norm. The EER-norm performs worse than the baseline systems due to overfitting on the development set. This is because only two or three genuine samples are available. Nevertheless, for the F-norm and the proposed MSLLR, thanks to parameter adaptation, the additional genuine scores are fully exploited. This is contrary to the Z-norm which does not make use of such information. We conjecture that the proposed MS-LLR works best because it combines the following strategies: the general LLR framework shown in (1), the Gaussian assumption on the model-specific class conditional score distribution and the constraints in (10). We also observe that when there is a mismatch between the development and the evaluation sets, e.g., due to different noise factors to which a biometric system is vulnerable, the model-specific class conditional distributions will change. As a result, without taking this change into account, any model-specific score normalisation may fail. This calls for predicting this change in order to take the effect of Doddington’s zoo fully into account. An interesting observation is that the speech systems improve much better than the face systems. Finding out why is beyond the scope of this paper and it will be the subject of future investigation. Another potential research direction is to combine the system outputs after applying model-specific score normalisation. Fusion at this level can be intramodal, i.e., involving a single biometric modality, or multimodal, i.e., involving more than one biometric modalities. Since we have already observed somewhat systematic improvement of performance after the score normalisation process, further improvement is to be expected when these outputs are used in the context of fusion. This subject is currently being investigated.
On the Use of Log-Likelihood Ratio
623
Acknowledgment This work was supported partially by the prospective researcher fellowship PBEL2114330 of the Swiss National Science Foundation, by the BioSecure project (www.biosecure.info) and by the Engineering and Physical Sciences Research Council (EPSRC) Research Grant GR/S46543. This publication only reflects the authors’ view.
References 1. Doddington, G., Liggett, W., Martin, A., Przybocki, M., Reynolds, D.: Sheep, Goats, Lambs and Woves: A Statistical Analysis of Speaker Performance in the NIST 1998 Speaker Recognition Evaluation. In: Int’l Conf. Spoken Language Processing (ICSLP), Sydney (1998) 2. Furui, S.: Cepstral Analysis for Automatic Speaker Verification. IEEE Trans. Acoustic, Speech and Audio Processing / IEEE Trans. on Signal Processing 29(2), 254–272 (1981) 3. Pierrot, J.-B.: Elaboration et Validation d’Approaches en V´erification du Locuteur, Ph.D. thesis, ENST, Paris (September 1998) 4. Chen, K.: Towards Better Making a Decision in Speaker Verification. Pattern Recognition 36(2), 329–346 (2003) 5. Saeta, J.R., Hernando, J.: On the Use of Score Pruning in Speaker Verification for Speaker Dependent Threshold Estimation. In: The Speaker and Language Recognition Workshop (Odyssey), Toledo, pp. 215–218 (2004) 6. Jonsson, K., Kittler, J., Li, Y.P., Matas, J.: Support vector machines for face authentication. Image and Vision Computing 20, 269–275 (2002) 7. Lindberg, J., Koolwaaij, J.W., Hutter, H.-P., Genoud, D., Blomberg, M., Pierrot, J.-B., Bimbot, F.: Techniques for a priori Decision Threshold Estimation in Speaker Verification. In: ¨ Proc. of the Workshop Reconnaissance du Locuteur et ses Applications Commerciales et Criminalistiques¨(RLA2C), Avignon, pp. 89–92 (1998) 8. Genoud, D.: Reconnaissance et Transformation de Locuteur, Ph.D. thesis, Ecole Polythechnique F´ed´erale de Lausanne (EPFL), Switzerland (1998) 9. Fierrez-Aguilar, J., Ortega-Garcia, J., Gonzalez-Rodriguez, J.: Target Dependent Score Normalisation Techniques and Their Application to Signature Verification. In: Zhang, D., Jain, A.K. (eds.) ICBA 2004. LNCS, vol. 3072, pp. 498–504. Springer, Heidelberg (2004) 10. Auckenthaler, R., Carey, M., Lloyd-Thomas, H.: Score Normalization for Text-Independant Speaker Verification Systems. Digital Signal Processing (DSP) Journal 10, 42–54 (2000) 11. Ben, M., Blouet, R., Bimbot, F.: A Monte-Carlo Method For Score Normalization in Automatic Speaker Verification Using Kullback-Leibler Distances. In: Proc. Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Orlando, vol. 1, pp. 689–692 (2002) 12. Poh, N., Bengio, S.: F-ratio Client-Dependent Normalisation on Biometric Authentication Tasks. In: IEEE Int’l Conf. Acoustics, Speech, and Signal Processing (ICASSP), Philadelphia, pp. 721–724 (2005) 13. Poh, N., Bengio, S.: Database, Protocol and Tools for Evaluating Score-Level Fusion Algorithms in Biometric Authentication. Pattern Recognition 39(2), 223–233 (2005) 14. Toh, K.-A., Jiang, X., Yau, W.-Y.: Exploiting Global and Local Decision for Multimodal Biometrics Verification. IEEE Trans. on Signal Processing 52(10), 3059–3072 (2004) 15. Gauvain, J.L., Lee, C.-H.: Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Obervation of Markov Chains. IEEE Tran. Speech Audio Processing 2, 290–298 (1994) 16. Friedman, J.: Regularized discriminant analysis. J. American Statiscal Association 84, 165– 175 (1989)
624
N. Poh and J. Kittler
17. Dass, S.C., Zhu, Y., Jain, A.K.: Validating a biometric authentication system: Sample size requirements. IEEE Trans. Pattern Analysis and Machine Intelligence 28(12), 1302–1319 (2006) 18. Matas, J., Hamouz, M., Jonsson, K., Kittler, J., Li, Y., Kotropoulos, C., Tefas, A., Pitas, I., Tan, T., Yan, H., Smeraldi, F., Begun, J., Capdevielle, N., Gerstner, W., Ben-Yacoub, S., Abdeljaoued, Y., Mayoraz, E.: Comparison of Face Verification Results on the XM2VTS Database. In: Proc. 15th Int’l Conf. Pattern Recognition, Barcelona, vol. 4, pp. 858–863 (2000) 19. Martin, A., Doddington, G., Kamm, T., Ordowsk, M., Przybocki, M.: The DET Curve in Assessment of Detection Task Performance. In: Proc. Eurospeech’97, Rhodes, pp. 1895– 1898 (1997) 20. Bengio, S., Mari´ethoz, J.: The Expected Performance Curve: a New Assessment Measure for Person Authentication. In: The Speaker and Language Recognition Workshop (Odyssey), Toledo, pp. 279–284 (2004)
Predicting Biometric Authentication System Performance Across Different Application Conditions: A Bootstrap Enhanced Parametric Approach Norman Poh and Josef Kittler CVSSP, University of Surrey, Guildford, GU2 7XH, Surrey, UK
[email protected],
[email protected] Abstract. The performance of a biometric authentication system is dependent on the choice of users and the application scenario represented by the evaluation database. As a result, the system performance under different application scenarios, e.g., from cooperative user to non-cooperative scenario, from well controlled to uncontrolled one, etc, can be very different. The current solution is to build a database containing as many application scenarios as possible for the purpose of the evaluation. We propose an alternative evaluation methodology that can reuse existing databases, hence can potentially reduce the amount of data needed. This methodology relies on a novel technique that projects the distribution of scores from one operating condition to another. We argue that this can be accomplished efficiently only by modeling the genuine user and impostor score distributions for each user parametrically. The parameters of these model-specific class conditional (MSCC) distributions are found by maximum likelihood estimation. The projection from one operating condition to another is modelled by a regression function between the two conditions in the MSCC parameter space. The regression functions are trained from a small set of users and are then applied to a large database. The implication is that one only needs a small set of users with data reflecting both the reference and mismatched conditions. In both conditions, it is required that the two data sets be drawn from a population with similar demographic characteristics. The regression model is used to predict the performance for a large set of users under the mismatched condition.
1 Introduction The performance of a biometric system involves a biometric database with multiple records of N users. In a typical experimental evaluation, the records of a reference user are compared with the remaining users, hence resulting in impostor match scores. The comparisons among records of the same reference user result in genuine user (or referred to as “client”) match scores. By sweeping through all possible decision threshold values given these two sets of class conditional match scores, one obtains the system performance in terms of pairs of false acceptance (FAR) and false rejection (FRR) rates, respectively. Consequently, the measured performance is inevitably database- and protocol-dependent. For instance, running the same matching algorithm on a single database but with two different protocols (or ways of partitioning the training and test S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 625–635, 2007. c Springer-Verlag Berlin Heidelberg 2007
626
N. Poh and J. Kittler
data) may result in two correlated but nevertheless slightly different performance measures. A good example is the XM2VTS database and its two Lausanne protocols. As a result, if one algorithm outperforms another, one cannot be certain that the same relative performance is repeatable when a different set of users is involved. The same concern about the preservation of relative performance applies to changes in application scenarios, e.g., from a controlled to uncontrolled ones, and from cooperative to non-cooperative users. These changes will result in variability of the measured performance. The goal of this paper is to propose a statistical model that can capture the above factors of variability such that given a small set of development (training) data where samples acquired during the reference and mismatched conditions are present, one can apply the model to predict the performance on an evaluation (test) data set composed of samples collected in the mismatched condition only. We use the term mismatched condition as the one that is significantly different relative to the reference one. In this work, we assume that the population of users in both the development and the evaluation sets are drawn from a common demographic population. This itself is a difficult problem because given this common demographic population, one still has to deal with the other three sources of variability. The novelty of this study is to propose an evaluation methodology as well as a statistical model developed for that purpose to avoid collecting more data (although more data is always better) but instead to put more emphasis on predicting the performance under various operating conditions based only on the development data set. The proposed methodology can thus significantly reduce the cost of data collection. This contrasts with the conventional evaluation methodology which is inherently demanding in terms of the amount of data required, e.g., the BANCA [1] and FERET [2] evaluations. Furthermore, given limited resources in data collection, one has to trade-off the amount of data (users) against the number of application scenarios available. Preliminary experiments on 195 experiments carried out on the BANCA database show that predicting the performance from the reference to the mismatched condition is possible even when the performance associated with these two conditions were assessed on two separate populations of users.
2 Towards Generalising Biometric Performance In order to model the change of score distribution, we will need a statistical model that can effectively model the score distribution within a given operating condition. We opt to build a model known as the model-specific class conditional (MSCC) distributions. If there are J users, there will be 2 × J MSCC distributions since each user has two class conditional scores, one reflecting genuine match scores and the other impostor match scores. An MSCC distribution represents a statistical summary of classconditional scores specific to each user. In essence, it captures the sample variability but conditioned on the class label. Then, in order to capture the change from a reference operating condition to a mismatched one, we need a projection function that can map a set of MSCC distributions to another set. One reasonable assumption that is used here is that the effect of the mismatch, as observed given only the scores, is the same
Predicting Biometric Authentication System Performance
627
Fig. 1. A procedure to predict performance as well as its associated confidence intervals based on MFCC distributions under mismatched application scenario and user composition. Since the MSCC parameters under mismatched situation can only be estimated probabilistically, we propose a slightly different version of model-specific bootstrap which also referred to as the bootstrap subset technique [3].
across all the user models of the same group, i.e., the users are drawn from the common demographic population. (e.g., bankers or construction workers, but not both). Figure 1 outlines an algorithm to train and infer the proposed statistical model. In Step 1, the parameters of the MSCC distributions are estimated via the maximum likelihood principle. In Step 2, the projection function between the set of MSCC distributions corresponding to the reference operating condition and the mismatched one is modeled via regression. The projection function has to be learnt from the smaller data set containing biometric samples recorded in both types of conditions. The transformation is then applied to a larger data set recorded in the reference operating condition. Thanks to this projection, the MSCC distributions of the mismatched condition becomes available, while, avoiding the need of actually collecting the same amount of data set for the mismatched condition. In Step 3, we derive the confidence interval around the predicted performance. The state-of-the-art technique to do so in biometric experiments is known as the bootstrap subset technique [3]. This technique is different from the conventional bootstrap because it does not draw the score samples directly but draws the user models acquired for the database. In our context, this technique draws with replacement the user models associated with a pair of MSCC distributions in each round of the bootstrapping process. The bootstrap subset technique assumes that the parameters of the MSCC distributions are known. We propose a Bayesian approach which defines a distribution over each of the MSCC parameters. In order to estimate confidence intervals around the
628
N. Poh and J. Kittler
predicted performance we propose to sample from this distribution J different sets of MSCC parameters (one for each user) in each round of bootstrap. As will be shown, the proposed Bayesian approach gives systematically better estimate of confidence interval than the bootstrap subset technique. Step 4 attempts to derive the system-level class-conditional (SLCC) score distributions from the set of MSCC distributions. Finally, Step 5 visualises the performance in terms of conventional plots, e.g., receiver’s operating characteristic (ROC) and detection error trade-off (DET). The following sections present an algorithm to implement the proposed evaluation methodology shown in Figure 1. 2.1 Model Specific Class Conditional Score Distribution Let the distribution of model-specific class-conditional (MSCC) scores be written as p(y|k, j, m) where y ∈ R is the output of a biometric system, j is the user index and j ∈ [1, . . . , J] and m is the m-th model/template belonging to user j. Without loss of generality, p(y|k) can be described by: p(y|k) =
j
p(y|k) =
m
p(y|k, j, m)P (m|j, k) P (j|k)
(1)
p(y|k, j)P (j|k)
(2)
j
where p(y|k, j, m) is the density of y conditioned on the user index j, true class label k and the m-th model (a discrete variable) specific to user j; P (m|j, k) specifies how probable it is that m is used (when k = C) or abused (when k = I); and, P (j|k) specifies how probable it is that user j uses the system or persons impersonating him/her abuses his identity. All the data sets we deal with have only one model/template per user, i.e., m = 1. As a result, (1) and (2) are identical. The false acceptance rate (FAR) and false rejection rate (FRR) are defined as a function of the global decision threshold Δ ∈ [−∞, ∞] in the following ways: FAR(Δ) = 1 − ΨI (Δ) FRR(Δ) = ΨC (Δ),
(3) (4)
where Ψk is the cumulative density function (cdf) of p(y|k), i.e., Ψk (Δ) = p(y ≤ Δ|k) =
Δ
p(y|k)dy
(5)
−∞
In this study, we assume that p(y|k, j) (or more precisely p(y|k, j, m = 1)) is Gaussian, i.e., p(y|k, j) = N μkj , (σjk )2 , (6) where, μkj and (σjk )2 are respectively the mean and the variance of the underlying scores. When the true score distribution is known, it should be used. However, as is often the case, due to the paucity of user-specific data, especially the genuine user match
Predicting Biometric Authentication System Performance
629
data model 60 0
10
data model
40
0.2 −1
10
−8
−6
−4
−2
0 likelihood
2
4
6
FRR
0.1
FR [%]
scores
0.3
FRR, FAR
0 −8
impostor (model) impostor (data) client (model) client (data) −6
−4
10 5
−2
10
1
0.5
20
2 1 0.5
−3
10
0.2 0.1 −2
0 Scores
2
4
(a) pdf and cdf
6
−6
10
−4
10 FAR
−2
10
0
10
(b) ROC
0.10.2 0.5 1 2
5 10 20 FA [%]
40
60
(c) DET
Fig. 2. (a) Comparison between the pdf’s (top) and cdf’s (bottom) estimated from the model and the data; the same comparison when visualised using (b) ROC in log-log scales and (c) DET (in normal deviate scales). In order to estimate the pdf’s in the top figure of (a), the kernel density (Parzen window) method was used with a Gaussian kernel; the cdf’s of the data used for visualising the ROC curve are based on (5). Table 1. The partition of the data sets. The first row and first column should read “the score set C Ysmall is available”. Data set size small large
reference degraded Genuine (C) impostor (I) Genuine (C) impostor (I) available available available available available available not available not available
scores, one may not be able to estimate the parameters of the distribution reliably. As a result, a practical solution may be to approximate the true distribution with a simpler one, e.g., using only the first two order of moments as represented by a Gaussian distribution. Figure 2 compares the class-conditional score distribution estimated from the data with the one estimated from the model. Referring back to Figure 1, Step 1 is an application of (1); Step 4 of (5) and Step 5 of (3) and (4), respectively. 2.2 Estimation of Score Distribution Projection Function In this section, we develop the projection function from the score distribution of the reference operating condition to a mismatched one. Suppose that for a small number of user models, we have access to both their reference and degraded class-conditional k scores, i.e., {Ysmall |Q} for Q ∈ {ref, deg} (for reference and degraded, respectively) and k ∈ {C, I}. For another set with a much larger number of user models, we have k only access to their data captured in the reference condition, {Ylarge |Q = ref, ∀k } and k we wish to predict the density of the degraded scores {Ylarge |Q = deg, ∀k } which we do not have access to. Table 1 summarises the availability of scores data. In our k experimental setting, {Ylarge |Q = deg, ∀k } serves as the ground-truth (or test) data whereas the other six data sets are considered as the training data in the usual sense.
630
N. Poh and J. Kittler
Let y be the variable representing a score in Ysk where k ∈ {C, I} and s ∈ {small, k large}. pdf k The p(y|k) estimated from Ys as given by (2) is a function of p(y|j, k) ≡ k 2 N μj , (σj ) for all j, k. Therefore, p(y|k) (for a given data set s) can be fully represented by keeping the parameters {μkj , σjk |∀j , s} estimated from Ysk . Our goal is to learn the following projection f : {μkj , σjk |∀j , Q = ref, s} → {μkj , σjk |∀j , Q = deg, s}
(7)
separately for each k from the small data set (s = small) and apply it to the large data set (s = large). One could have also learnt the above transformation jointly for both k ∈ {C, I} if one assumed that the noise source influenced both types of scores in a similar way. Our preliminary experiments show that this is, in general, not the case. For instance, degraded biometric features affect the magnitude of the genuine scores much more than that of the impostor scores. For this reason, it is sensible to model the transformation functions for each class of access claims separately. This is done in two steps for each of the two parameters separately: fμ : {μkj |∀j , Q = ref, s} → {μkj |∀j , Q = deg, s} fσ :
{σjk |∀j , Q
= ref, s} →
{σjk |∀j , Q
= deg, s}
(8) (9)
where fparam is a polynomial regression function for param ∈ {μ, σ}. Note that in (9), the function is defined on the standard deviation and not on the variance because noise is likely to be amplified in the latter case. This observation was supported by our preliminary experiments (not shown here). We have also considered modeling the joint density of {μkj , σjk |∀j } and found that their correlation is extremely weak across the 195 experiments taken from the BANCA database (to be described in Section 3), i.e., on average −0.4 for the impostor class and 0 for the genuine user (client) class. This means that the two Gaussian parameters are unlikely to depend on each other and there is no additional advantage to model them jointly. These two regression functions give us the predicted degraded Gaussian parameters – μ ˆkj and σ ˆjk – given the Gaussian k k parameters of the reference condition–μj and σj . The polynomial coefficients are obtained by minimising the mean squared error between the predicted and the true values of the larger data set given the values of the smaller data set. We used Matlab’s polyfit function for this purpose. As a byproduct, the function also provides a 95% confidence around the predicted mean value, which corresponds to the variance of the prediction, i.e., V ar[fμ (μkj )] for the mean parameter and V ar[fσ (σjk )] for the standard deviation parameter. These two by-products will be used in Section 2.3. In our preliminary experiments, the degree of polynomial was tuned using a two-fold cross validation procedure. However, due to the small number of samples (in fact the number of users) used, which is 26, using a quadratic function or even of higher order does not necessarily generalise better than a linear function. Following the Occam’s Razor principle, we used only a linear function for (8) and (9), respectively. Examples of fitted regression functions for each fμ and fσ conditioned on each class k are shown in Figure 3(a). The four regression functions aim collectively to predict a DET curve in degraded condition given the MSCC parameters of the reference condition. Figure 3(b) shows both the reference and predicted degraded DET curves. Two
Predicting Biometric Authentication System Performance
impostor
60
client 0.4
0
20
FRR [%]
−0.4
40
0.2
μ (noisy)
μ (noisy)
−0.2
−0.2
−0.6 −0.6
−0.4 −0.2 μ (clean) impostor
0.4
0.6 0.8 μ (clean) client
1
10 5
0.3
2
σ (noisy)
σ (noisy)
0.3 0.2 0.1
1 0.5
0.2
0.2
0.1 0.1
0.2 0.3 σ (clean)
0.4
631
0.1
0.1
0.2 0.3 σ (clean)
(a) Regression on MSCC parameters
0.4
clean noisy ML predicted noisy confidence confidence predicted noisy (bset) confidence (bset) confidence (bset) 0.1 0.2
0.5
1
2
5
10
20
40
60
FAR [%]
(b) DET
Fig. 3. One of 195 BANCA experiments showing the regression fits to project the MSCC parameters under the reference condition to those under the degraded conditions. Note that there is also user composition mismatch because the reference and degraded data come from two disjoint sets of users (the g1 and g2 sets according to the BANCA protocols). Four regression fits are shown in (a) for each of the client and impostor classes and for each of the two Gaussian parameters (mean and standard deviation). The regression lines together with their respective 95% confidence intervals were estimated on the development set whereas the data points (each obtained from a user model) were obtained from the evaluation set. Figure (b) shows the DET curves (plotted with ‘◦’) along with its upper and lower confidence bounds (dotted lines) as compared to the original reference (‘♦’) and degraded (‘∗’) DET curves. The 95% confidence intervals around the predicted degraded DET curves were estimated by 100 bootstraps as described in Section 2.3 (not the bootstrap subset technique). In each bootstrap (which aims to produce a DET curve), we sampled the predicted MSCC parameters given the reference MSCC parameters from the four regression models assuming Gaussian distribution.
procedures were used to derive the predicted degraded DET curves, thus resulting in two predicted degraded curves as well as their corresponding confidence intervals. They will be described in Section 2.3. 2.3 Bootstraps Under Probabilistic MSCC Parameters: A Bayesian Approach This section deals with the case where the MSCC parameters can only be estimated probabilistically, e.g., when applying (7) which attempts to project from the reference MSCC parameters to the degraded ones, the exact degraded MSCC parameters are unknown. This is in contrast to the bootstrap subset technique which assumes that the MSCC parameters can be estimated deterministically (which corresponds to the usual maximum likelihood solution). Let μ ˆkj = fμ (μkj ) be the predicted mean parameter given the reference mean parameter μkj via the regression function fμ . The predicted standard deviation is defined similarly, i.e., σ ˆjk = fσ (σjk ). We assume that the predicted mean value is normally distributed, i.e.,
632
N. Poh and J. Kittler
p(ˆ μkj ) = N (fμ (μkj ), V ar[fμ (μkj )]),
(10)
and so is the predicted standard deviation value, i.e., p(ˆ σjk ) = N (fσ (σjk ), V ar[fσ (σjk )]).
(11)
Both distributions p(ˆ μkj ) and p(ˆ σjk ) effectively characterise the variability due to projecting the parameters of the reference MSCC distribution as represented by {μkj , σjk } to the degraded ones {ˆ μkj , σ ˆjk }, which one cannot estimate accurately. In order to take into account the uncertainty associated with the predicted Gaussian parameters, we propose to sample from the distributions p(ˆ μkj ) and p(ˆ σjk ) for all j and k k both classes k, thus obtaining a set of MSCC parameters {vμ j , vσ j |∀j , ∀k } from which we can evaluate a DET curve thanks to the application of (1), (3) and (5) in each round of bootstraps. By repeating this process U times, one obtains U bootstrapped DET curves. We give a brief account here how to obtain the expected DET curve and its corresponding 95% confidence intervals given the U bootstrapped DET curves. This technique was described in [4]. First, we project a DET curve into polar coordinates (r, θ), i.e., radius and DET angle, such that θ = 0 degree is parallel to FRR=0 in normal deviate scale and θ = 90 degree is parallel to FAR=0. To obtain α × 100% confidence, given the set of bootstrapped DET curves in polar coordinates, we estimate the upper and lower bounds: 1−α 1+α ≤ Ψθ (r) ≤ , 2 2 where Ψθ (r) is the empirical cdf of the radius r observed from the U bootstrapped curves for a given θ. Note that each bootstrapped curve cuts through θ exactly once. 1+α The lower, upper and median DET curves are given by setting r to be 1−α 2 , 2 , and 1 2 , respectively. By back-projecting these three curves from the polar coordinates to the DET plane, one obtains the expected DET curve as well as its associated confidence region at the desired α × 100% level of confidence. This technique was described in [4]. A large U is necessary in order to guarantee stable results. Our preliminary experiments show that U > 40 is fine. We used U = 100 throughout the experiments. Our initial experiments show that the most probable DET curve derived this way generalises much better compared to plotting a DET directly from the predicted MSCC parameters, i.e., {ˆ μkj , σ ˆjk |∀j,k } because by using these parameters, one does not take the uncertainty of the prediction into consideration. Our preliminary experiments, as shown for instance in Figure 3(b), suggest that in order to generalise to a different user population and under a mismatched situation, it is better to sample from (10) and (11) for each user and for each class k in each round of bootstraps than to use the bootstrap subset procedure with the predicted MSCC parameters.
3 Database, Performance Evaluation and Results We chose the BANCA database for our experiments for the following reasons:
Predicting Biometric Authentication System Performance
1 0 −1 −2 0
10
20
30 40 50 60 DET angles (degrees)
70
(a) Before correction
80
90
2 normal deviates
2 normal deviates
normal deviates
2
1 0 −1 −2 0
633
10
20
30 40 50 60 DET angles (degrees)
70
80
90
1 0 −1 −2 0
10
20
30 40 50 60 DET angles (degrees)
70
80
90
(b) After correction, (bootstrap (c) After correction, (Bayesian subset) bootstrap)
Fig. 4. Comparison of DET bias due to noise and user composition mismatches on 195 BANCA systems under degraded (Ud) conditions as compared to the reference (Mc) conditions. Each of the three figures here presents the distribution of 195 DET radius bias for all θ ∈ [0, 90] degrees. The upper and lower confidence intervals represent 90% of the data. Prior to bias correction, in (a), the DET bias is large and is systematic (none zero bias estimate); in (b), after bias correction using the predicted MSCC parameters, i.e., following the bootstrap subset technique, the DET radius bias is still large and non-zero; in (c), after bias correction using the Bayesian bootstrap approach, the bias is significantly reduced and is close to zero – indicating the effectiveness of the regression functions in projecting the MSCC parameters from the reference to both the degraded (Ud) conditions. An example of the actual DET curve was shown in Figure 3(b). Similar results were obtained when the experiments were repeated to predict the performance on the adverse (Ua) conditions instead of the degraded (Ud) conditions (not shown here).
(a) Availability of mismatched conditions: It has three application scenarios: controlled, degraded and adverse operating conditions. (b) Different sets of user: It comes with a defined set of protocols that has two partitions of gender-balanced users, called g1 and g2. This allows us to benchmark the quality of performance prediction by training the projection function on g1 and testing it on g2. In each data set, there are only 26 users. 3 genuine scores are available per user; and 4 for the impostor scores to estimate p(y|j, k, m = 1). (c) Availability of many systems: Being a benchmark database, 195 face and speech verification systems have been assessed on this database. These systems were obtained from “ftp://ftp.idiap.ch/pub/bengio/banca/banca scores” as well as from [5]. In order to assess the quality of predicted DET curve, we consider three DET curves: two derived experimentally using the reference and the degraded data, and one based on the MSCC models (i.e., the predicted degraded curve from the reference data). These three curves can be expressed by ruref (θ), rudeg (θ) and rupdeg (θ) (pdeg for predicted degraded curve), respectively, in polar coordinates for convenience with θ ∈ [0, π2 ]. In order to quantify the merit of the proposed approach, we define the bias of the reference and the predicted degraded curves, with respect to the ground-truth degraded DET curve as follows: ref deg biasref u (θ) = ru (θ) − ru (θ)
(12)
biaspdeg (θ) = rupdeg (θ) − rudeg (θ), u
(13)
and
where u is one of the 195 data sets. By performing U = 195 independent experiments, we can then estimate the density p(biasdata (θ)) for each of the θ values for u data ∈ {ref, pdeg} condition) separately. We expect that the expected bias due to the
634
N. Poh and J. Kittler
predicted degraded DET curve, Eu [biaspdeg (θ))] be around zero and to have small conu fidence intervals whereas Eu [biasref u (θ))] to be further away from zero and has comparatively larger confidence intervals. It is desirable to have positive bias for the predicted degraded DET curve, i.e., Eu [biaspdeg (θ))], because systematically overestimating an u error is better than underestimating it. The experimental results for predicting from the reference to the degraded operating condition, summarised over 195 BANCA systems, are shown in Figure 4.
4 Conclusions While prior work has been reported on the effect of the sample and user variability, e.g., [6,3], to the best of our knowledge, none could be used to predict the biometric performance under mismatched conditions. As a result, the conventional methodology in biometric evaluation has always relied on collecting more data and one had to decide on a trade-off between the amount of data and the number of application scenarios available. We propose an evaluation methodology along with an algorithm that does not rely on a large quantity of data. Instead, it attempts to predict the performance under the mismatched condition by adequately modeling the score distribution and then projecting the distribution into one that matches the target mismatched condition for a given target population of users. Generalisation to different population of users (but of the same demographic characteristic) can be inferred from the resulting confidence interval. As a result, significantly fewer data is needed for biometric evaluation and existing databases can be reused to predict the performance behaviour of a system. Our on-going work attempts to remove the Gaussian assumption made regarding the MSCC distribution. One important assumption about the current choice of regression algorithm (polyfit) used as the projection function is that the variance of the error terms is constant. A complete non-parametric modelling of the pair of MSCC parameters would have been more desirable. This issue is also being investigated. Finally, more comprehensive experiments, notably as to how well each predicted DET curve performs, will also be conducted.
Acknowledgment This work was supported partially by the prospective researcher fellowship PBEL2114330 of the Swiss National Science Foundation, by the BioSecure project (www.biosecure.info) and by Engineering and Physical Sciences Research Council (EPSRC) Research Grant GR/S46543. This publication only reflects the authors’ view.
References 1. Bailly-Bailli`ere, E., Bengio, S., Bimbot, F., Hamouz, M., Kittler, J., Mari´ethoz, J., Matas, J., Messer, K., Popovici, V., Por´ee, F., Ruiz, B., Thiran, J.-P.: The BANCA Database and Evaluation Protocol. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688. Springer, Heidelberg (2003)
Predicting Biometric Authentication System Performance
635
2. Phillips, P.J., Rauss, P.J., Moon, H., Rizvi, S.: The FERET Evaluation Methodology for Face Recognition Algorithms. IEEE Trans. Pattern Recognition and Machine Intelligence 22(10), 1090–1104 (2000) 3. Bolle, R.M., Ratha, N.K., Pankanti, S.: Error Analysis of Pattern Recognition Systems: the Subsets Bootstrap. Computer Vision and Image Understanding 93(1), 1–33 (2004) 4. Poh, N., Martin, A., Bengio, S.: Performance Generalization in Biometric Authentication Using Joint User-Specific and Sample Bootstraps, IDIAP-RR 60, IDIAP, Martigny. IEEE Trans. Pattern Analysis and Machine Intelligence 2005 (to appear) 5. Cardinaux, F., Sanderson, C., Bengio, S.: User Authentication via Adapted Statistical Models of Face Images. IEEE Trans. on Signal Processing 54(1), 361–373 (2006) 6. Doddington, G., Liggett, W., Martin, A., Przybocki, M., Reynolds, D.: Sheep, Goats, Lambs and Woves: A Statistical Analysis of Speaker Performance in the NIST 1998 Speaker Recognition Evaluation. In: Int’l Conf. Spoken Language Processing (ICSLP), Sydney (1998)
Selection of Distinguish Points for Class Distribution Preserving Transform for Biometric Template Protection Yi C. Feng and Pong C. Yuen Department of Computer Science Hong Kong Baptist University {ycfeng, pcyuen}@comp.hkbu.edu.hk
Abstract. This paper addresses the biometric template security issue. Follow out previous work on class distribution transform, the proposed scheme selects the distinguish points automatically. By considering the geometric relationship with the biometric templates, the proposed scheme transforms a real-value biometric template to a binary string such that the class distribution is preserved and proved mathematically. The binary string is then further encoded using BCH and hashing method to ensure that the template protecting algorithm is non-invertible. Two face databases, namely ORL and FERET, are selected for evaluation and LDA is used for creating the original template. Experimental results show that by integrating the proposed scheme into the LDA (original) algorithm, the system performance can be further improved by 1.1% and 4%, in terms of equal error rate, on ORL and FERET databases respectively. The results show that the proposed scheme not only can preserve the original template discriminant power, but also improve the performance if the original template is not fully optimized. Keywords: Biometric template security, face recognition, one-way transform, class distribution preserving.
1
Introduction
Biometric recognition is a reliable, robust and convenient way for person authentication [1,2,4]. With the growing use of biometrics, there is a rising concern about the security and privacy of the biometric template itself. Recent studies have shown that ”hill climbing attacks” [5] on a biometric system are able to recover the biometric templates. As a result, templates have to be stored in an encrypted form to protect them. The matching step needs to be performed in encrypted domain because the decrypted templates are insecure. Different biometric template protection algorithms have been proposed in the last few years and can be categorized into two approaches. The first approach is the cancelable template proposed by Ratha et al. [2] in 2001. The basic idea is that the system only uses a cancelable template, which is transformed by the original template using a one-way function. Therefore, S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 636–645, 2007. c Springer-Verlag Berlin Heidelberg 2007
Selection of Distinguish Points
637
if the cancelable template is lost or stolen, the system may reset and re-issue another cancelable template using different set of parameters of the one-way function. In their recent study on cancelable biometric for fingerprint [3], the use of distortion function caused the FAR to increase by about 5% for a given FRR. While the invertibility of their transform is low, the discriminability of their cancelable template is not known. No analysis of invertibility vs. discriminability was reported in [3]. The second approach combined cryptographic methods with the biometric system, known as biometric cryptosystem. This approach consists of two stages [10]. First, the biometric template is transformed into a string. The second step generates a cryptographic key from the string using cryptographic technique. Some methods such as fuzzy commitment scheme [7] and fuzzy vault scheme [8,9], have been proposed for the second step. However, direct applying error correcting coding can only handle relatively small intra-class variations. In order to handle a relative large variation such as face template, Ngo et al. [10] proposed a new algorithm for the first step. They employed random projection and other appearance-based dimension reduction methods (PCA, LDA) to further extract the features. After that, a bit string is obtained using a thresholding technique. However, it is not clear whether the algorithm preserves the original template discriminability. Along this line, Feng and Yuen [11] proposed a new conceptual model called Class Distribution Preserving (CDP) Transform. A new scheme was proposed using a set of ”distinguish point” and distance measurement. The ”distinguish point” set is randomly generated, which will increase the randomness of the representation. A tri-state representation is proposed to overcome the distortion in transform, but the tri-state bit is hard to estimate. In this paper, based on the concept of CDP transform, a new scheme is proposed to transform a real value biometric template into a bit string. In the proposed scheme, the set of ”distinguish point” will be selected and the estimation of the tri-state bit is not required. The rest of the paper is organized as follows. Section 2 will give a review on the class distribution preserving transform. Section 3 will present our new proposed scheme while the experimental results will be reported in Section 4. Finally, Section 5 will give the conclusion.
2
Class Distribution Preserving Transform
The basic idea of class distribution preserving (CDP) transform [11] can be illustrated using Fig. 1. Consider a two class problem with class Ω1 (data represented by ”cross”) and Ω2 (data point represented by ”circle”). Given distinguish points say B1 and B2 , for any real value feature vector (biometric template) V in ddimensional space, it can be transformed into a k-dimensional (k = 2 in this case) tri-state bit string [b1 , b2 ], where ⎧ ⎨ 0 : d(Bi , V ) < thi − r2 bi = φ : thi − r2 ≤ d(Bi , V ) ≤ thi + r2 (1) ⎩ 1 : d(Bi , V ) > thi + r2
638
Y.C. Feng and P.C. Yuen
B2
B1
ȍ1
ȍ2 1
Ɏ
CDP Transform
1
V
ªD1 º ª d (V , B1 ) º ª b1 º «D » Dis tan ce « d (V , B ) » «b » thresholding 2 » n o« « 2 » Calculatio o« 2 » « » « » «» « » « » « » ¬D d ¼ ¬d (V , Bk )¼ ¬bk ¼
Fig. 1. Illustration for the Class Distribution Preserving Transform
Although the scheme in [11] works, there are rooms for improvement. First, all distinguish points are randomly generated and therefore, may not be able to generate a set of optimal bit string. Second, the number of distinguish points depends on the number of class which may introduce a complexity burden when the class number is large. Finally, it is hard to estimate the parameter ”r” in calculating the tri-state φ.
3
Our Proposed Scheme
The block diagram of the proposed system is shown in Fig. 2. It consists of enrollment and authentication phases. The template protection algorithm consists of three blocks, namely class distribution preserving transform, error correction coding and hashing. The error correction coding and hashing steps are used to ensure that the protecting algorithm is non-invertible. This paper mainly focuses on the class distribution preserving transform. The proposed scheme follows the concept of class distribution preserving (CDP) transform [11], but removes the limitations in our previous work [11]. This paper makes two major changes. First, instead of follow the scheme in [11] for biometric identification system (one to many), the new scheme is designed for authentication system (one to one). In practice, most of the biometric systems perform authentication, such as border control in Hong Kong and China. Second, the distinguish point set is determined so as to preserve the class distribution after transformation. Details of our scheme are discussed as follows. 3.1
CDP Transform for Authentication System
Assume there are c classes CT = {Ω1 , Ω2 . . . Ωc } with the cluster centers {M1 , M2 . . . Mc } respectively. For each class Ωo , let ro be the largest distance any two
Selection of Distinguish Points
Decision
639
S(H(En(Tr(V))), H(En(Tr(V’))))
Database
Matching
H(En(Tr(V)))
H(En(Tr(V’)))
Fuzzy Commitment Scheme Tr(V)
Tr(V’)
Class-Distribution Preserving Transformation Enrollment
V
Feature Extractor
V’
T
Sensor
T’
Authentication
Fig. 2. Block diagram of the proposed system
data points in Ωo , For authentication system, we only need to solve a two-class problem. Therefore, in determining the distinguish point set for class Ωo , we only need to consider the class Ωo , and CT − Ωo = {x ∈ CT |x ∈ Ωo } . So the problem is defined as follows. Given a data point (biometric template) V , we would like to find a transformation T which has the following two CDP properties, where d1 and d2 are distance functions in input and transformed space respectively and to is the threshold in the transformed domain. – Property One If d1 (V, Mo ) < ro , then d2 (T (V ), T (Mo )) ≤ to This property shows that if V belongs to class Ωo , after transformation, V also belongs to the same class. – Property Two If d1 (V, Mo ) > ro , then d2 (T (V ), T (Mo )) > to This property shows that if V does not belong to class Ωo , after transformation, V will also not belong to the class Ωo . 3.2
Determining Distinguish Points and Thresholds
In order to develop the CDP transformation for authentication system as discussed in Section 3.1, we need to determine the set of k distinguish points {B1 , B2 . . . Bk } and the responding thresholds {th1 , th2 . . . thk } for each class. For simplicity, we will represent them in pairs {B1 , th1 }.
640
Y.C. Feng and P.C. Yuen
Let the data points (biometric templates) in each class are represented as CT − Ωo = {P1 , P2 . . . Pm } and Ωo = {Q1 , Q2 . . . Qn } with center Mo . Each template will be transformed into a k-dimensional binary string. The proposed method consists of two steps. The first step finds the k directions where the distinguish points to be located while the positions as well as the threshold values will be determined in the second step. Step 1. Using Mo as an origin, each template in CT − Ωo can be represented by a unit vector. This step determines the k representative directions from the m unit vectors. To do that, our strategy is to determine the k unit vectors with the largest separation (i.e. largest angles between two vectors). The problem can be solved using a k-mean algorithm: At the beginning we find the first one v1 which has the largest angles between the other unit vectors. Then find the second one v2 which has the largest angle to v1 . Find v3 which has largest angles to v1 and v2 , and so on until we find k unit vectors. With the k unit vectors, classify the unit vectors in CT − Ωo to k clusters, the i-th cluster is composed of nearest neighbors of vi . (i = 1, 2 . . . k). The output of this step will be k clusters {G1 , G2 , . . . Gk } in which {v1 , v2 . . . vk } are a set of k unit vectors (from Mo ). Step 2. This step determines the exact positions of distinguish points in each direction (unit vector). The position of the i-th distinguish point along the direction of vi can be written as Bi = Mo + ai vi , where ai is a real value (parameter) and controls the position of Bi along vi . The threshold thi for the point Bi is then equal to thi = |ai − ro |. Therefore, what we need to do is to estimate ai . The value of ai can be positive or negative. ai < 0 means that Bi Mo is with the same direction of vi and thi = |Bi Mo | + ro . If ai > 0, Bi Mo is in the opposite direction of vi and thi = |Bi Mo | − ro . Here we give a definition of ai as ai = ±(2 + ei )d, (i = 1, 2 . . . k). The parameter d here is the largest value of |Pj Mo |, (j = 1, 2 . . . m) and ei (i = 1, 2 . . . k) here is random generated value between range [0, 2]. The rationale in determining the ai will be discussed in next section. 3.3
Proof of the CDP Properties
Before discussing the proof, let’s consider two scenarios in which a distinguish point is located in two positions with respect to a data point outside the class Ωo as shown in Fig. 2. Scenario One. Considering a distinguish point {Bi , thi } where Bi and Pj are located in the opposite direction with respect to Ωo with center Mo as Fig. 3. In this case, thi = |Bi Mo | + ro . For any data point Pj in group Gi (after clustering in step 1) and Qs (s = 1, 2 . . . n) in Ωo , if θj is the angle between direction Bi Mo and Mo Pj , then cos θj > ro /|Pj Mo |. The distance between the distinguish point Bi and Pj , Qs , Mo can be written as
Selection of Distinguish Points
|Bi Pj | ≥ |Bi Mo | + |Pj Mo | cos θj > |Bi Mo | + ro = thi ; |Bi Qs | ≤ |Bi Mo | + |Qs Mo | ≤ |Bi Mo | + ro = thi ;
641
(2)
|Bi Mo | < |Bi Mo | + ro = thi . That means, |Bi Pj | > thi , |Bi Mo | ≤ thi and |Bi Qs | ≤ thi . So, using Bi as a distinguish point, the point Pj , Qs and Mo are then transformed into bit bi (Pj ) = 1, bi (Mo ) = 0, bi (Qs ) = 0. That means, data points within the class Ωi will be transformed into same bit value while data points outside the class Ωo will be transformed into opposite value.
Pj ȍo Bi
Mo Qs
șj
Bi’
Fig. 3. Distinguish one point from Mo with pair Bi , thi
Scenario Two. In this scenario, Bi (with threshold thi ) and Pj are located in the same direction with respect to Ωo with center Mo as Fig. 3. In this case, thi = |Bi Mo | − ro . Then the angle between direction Bi Mo and Mo Pj , θj , will satisfy cos θj >
|Pj Mo | ro ro2 + − 2|Bi Mo | |Pj Mo | 2|Bi Mo | · |Pj Mo |
and
|Bj Mo | > 2|Pj Mo |.
The distance between the distinguish point
Bi
(3)
(4)
and Pj , Qs , Mo can be written
as |Bi Qs | ≥ |Bi Mo | − |Qs Mo | ≥ |Bi Mo | − ro = thi ;
|Bi Mo | > |Bi Mo | − ro = thi ; |Bi Pj |2 = |Bi Mo |2 + |Pj Mo |2 − 2 cos θj · |Bi Mo | · |Pj Mo | < =
|Bi Mo |2 |Bi Mo |2
+ −
|Pj Mo | − |Pj Mo | − 2ro |Bi Mo | 2ro |Bi Mo | + ro2 = th2 i . 2
2
+
(5)
ro2
That means |Bi Pj | ≤ thi , |Bi Mo |, |Bi Qs | > thi . So, using Bi as a distinguish point, the point Pj , Qs and Mo are then transformed into bit bi (Pj ) = 0, bi (Mo ) = 1, bi (Qs ) = 1. That means, data points within the class Ωo will be transformed into same bit value while data points outside the class Ωo will be transformed into opposite value. So far, we have shown that under the above two scenarios, the transformed bit of any feature vector Q in class Ωo will be the same with Mo , and the transformed
642
Y.C. Feng and P.C. Yuen
bit of any feature vector V in group Gi will be different from Mo . Using the same argument as above, consider all k pairs {B1 , th1 }, {B2 , th2 } . . . {Bk , thk }, the transformed bit string of any feature vector Q in class Ωo will be the same with Mo . And there is at least one bit in the transformed bit string of any feature vector V out of class Ωo which is different from Mo . (Because V should belong to one group Gi and the corresponding pair {Bi , thi } will contribute a different bit from Mo .) Therefore, we can prove that If V belongs to Ωo , d2 (T (V ), T (Mo )) = 0; If V dose’nt belong to Ωo , d2 (T (V ), T (Mo )) > 0. The function d2 can be Hamming distance between two bit strings. One important thing on cos θj we would like to point out. In scenarios 1 and 2, cos θj needs to satisfy the following two conditions cos θj >
ro |Pj Mo |
cos θj >
|Pj Mo | ro ro2 + − 2|Bi Mo | |Pj Mo | 2|Bi Mo | · |Pj Mo |
(6)
And hence, ro Δ 2 if m = 2k + 1 and | r |≤
WW (k ) = 1
m≠0
if m = 2k + 1 1 Δ 2 1 if m = 2k and | r |> Δ 2 if m = 2k and | r |≤
WW (k ) = 0
w
where, | f ( k ) | are the stego-coefficients, which contain the fingerprint template. During the embedding process, two aspects should be paid more attention: (2a). To guarantee embedded coefficients are also real number by way of INDFT manipulation, the embedding data is implemented under positive symmetric condition similar to DFT-based embedding method. That is: on the chosen frequency point, let X ( k ) = X ( N − k ) . The positive symmetric condition is defined as: *
| X (k ) |←| X (k ) | +ε | X ( N − k ) |←| X ( N − k ) | +ε
(10) (11)
(2b). Choosing quantization step Δ : The quantization step Δ = 5120 is used in the proposed scheme. This quantization value ensures the robustness of the algorithm and sensitivity to audible that is an important issue in human auditory system (HAS). At the end of embedding fingerprint template, we carry out INDFT of the stegocoefficients to get the stego-audio signal. Now this stego-audio signal, which hides fingerprint template, can be securely transmitted to authentication server for the identification of a person. 2.3 Fingerprint Template Extraction and Verification
At the authentication server end, we perform NDFT on the stego-signal to extract the fingerprint template. Extracting data does not need original audio signal, because
Robust Hiding of Fingerprint-Biometric Data into Audio Signals
709
template/data is embedded by quantizing NDFT amplitude coefficients. The process of extracting hidden data is the inverse of the embedding process; the detailed steps are as follows: (a) Perform segmentation of the received audio signal; (b) Carry out NDFT on segmented stego-audio signal by secret key; (c) Extract fingerprint template by quantization rule in the chosen frequency point; After performing extraction, the biometric template is decrypted by the same method as described in subsection 2.1. At the end, we perform verification of the extracted fingerprint template, t ′( n) , against the pre-stored template database by the following equation:
M =
1 N
N
∑ t ′(n) ⊕ t (n)
(12)
i =1
where, N is the size of the fingerprint template, t ( n) is original fingerprint template stored in the database, while t ′( n) is an extracted template from the stego-audio signal.
3 Experimental Results and Discussions Experiments are conducted on the public domain fingerprint images dataset, DB3, FVC2004 that contains a total of 120 fingers and 12 impressions per finger (1440 impressions) using 30 volunteers. The size of fingerprint images is 300×480 pixels captured at a resolution of 512 dpi.
Fig. 2. Original audio signal without fingerprint template and Stego-audio signal with hidden fingerprint template
710
M.K. Khan, L. Xie, and J. Zhang
To evaluate the performance of the proposed scheme, in terms of robustness and inaudibility, we performed a set of experiments including Gaussian noise, low pass filtering, Mp3 compression, re-sampling, and re-quantization on the stego-transmitted signal that contains hidden biometric-fingerprint template. The original audio signal and stego-audio signal containing hidden biometric-fingerprint template are depicted in figure 2, and their signal difference is shown in figure 3. Presented system shows 100% fingerprint template decoding accuracy without any attacks as shown in Table 1. We evaluated the performance of our system by calculating Signal-to-Noise Ratio (SNR), Mean Squared Error (MSE), and Bit Error Rate (BER), and their mathematical formulae are shown in equations (13) to (15), respectively. N −1
SNR(dB) = 10 log10
∑x
2
( n)
i =0
M −1
∑ [x(n) − y(n)]
(13)
2
i =0
M −1
1 [x(n) − y(n)]2 ∑ N i =0 1 N −1 BER = ∑ x(n) ⊕ y (n) N i =0
MSE =
(14)
(15)
Fig. 3. Signal difference of original and stego-audio signals
where, N is size of the template, x (n) is the original/host audio signal and y ( n) is the stego-audio signal in which, fingerprint template is hidden. Hence, it is apparent from the experimental results that the proposed system is an ideal candidate for secure transmission of biometric templates over insecure communication network. Moreover, it achieves an outstanding decoding performance, even under different kinds of attacks which could be possible during the transmission.
Robust Hiding of Fingerprint-Biometric Data into Audio Signals
711
Table 1. Experimental Results Attack
No Attack Gaussian Noise Low-pass Filtering 128K Mp3 Compression Re-Sampling Re-Quantization
SNR (dB)
34.9622 32.7750 34.8085 34.5605 34.8861 34.8991
MSE
3.4076e-004 5.6386e-004 3.5304e-004 3.7379e-004 3.4679e-004 3.4575e-004
BER
Accuracy (%)
0 0.0059 0.0020 0.0000 0.0020 0.0000
100 99.41 99.80 100 99.80 100
4 Conclusion We have presented a novel covert-transmission scheme of biometric-fingerprint templates, in which audio signal hid templates as container to protect from attacks and kept secret the existence of fingerprint templates from the communication eavesdropping. We used chaos and NDFT-based template hiding scheme and proved that biometric template in audio signal could not affect the identification performance of the biometric recognition system. In addition, we performed a series of experiments to evaluate the performance of the system and the experimental results have shown that the proposed system is robust against noises and attacks, and achieves better verification accuracy. Our future work would focus on hiding different type of biometrics templates e.g. face, iris etc; into audio signal to evaluate the robustness of our system.
Acknowledgements This project is supported by 'Sichuan Youth Science and Technology Funds' under grant number: 03ZQ026-033 and ‘Southwest Jiaotong University Doctors Innovation Funds 2005’.
References 1. Anil, K.J., Prabhakar, S., Hong, L., Pankanti, S.: Filterbank-based fingerprint matching. IEEE Transactions On Image Processing 9, 846–859 (2000) 2. Anil, K.J., Pankanti, S., Bolle, R.: Biometrics: Personal Identification in Networked Society. Kluwer, USA (1999) 3. Khan, M.K., Jiashu, Z., Lei, T.: Chaotic secure content-based hidden transmission of biometrics templates. In: Chaos, Solitons, and Fractals, vol. 32(5), pp. 1749–1759. Elsevier Science, Amsterdam (2007) 4. Anil, K.J., Prabhakar, S., Hong, L.: A multichannel approach to fingerprint classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 21, 348–359 (1999) 5. Daugman, J.: High confidence visual recognition of persons by a test of statistical independence. IEEE Transactions on Pattern Analysis and Machine Intelligence 15, 1148– 1161 (1999) 6. Anil, K.J., Umut, U.: Hiding biometric data. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 1494–1498 (2003)
712
M.K. Khan, L. Xie, and J. Zhang
7. Gunsel, B., Umut, U., Tekalp, A.M.: Robust watermarking of fingerprint images. In: Pattern Recognition, vol. 35, pp. 2739–2747. Elsevier Science Ltd., Amsterdam (2002) 8. Khan, M.K., Jiashu, Z., Lei, T.: Protecting biometric data for personal identification. In: Li, S.Z., Lai, J.-H., Tan, T., Feng, G.-C., Wang, Y. (eds.) SINOBIOMETRICS 2004. LNCS, vol. 3338, pp. 629–638. Springer, Heidelberg (2004) 9. Ratha, N., Connell, J., Bolle, R.: Enhancing security and privacy in biometrics-based authentication systems. IBM System Journal 40, 614–634 (2001) 10. Davida, G.I., Frankel, Y., Matt, B.J.: On enabling secure applications through online biometric identification. In: IEEE Symposium on Security and Privacy, pp. 148–157 (1998) 11. Davida, G.I., Frankel, Y., Matt, B.J., Peralta, R.: On the relation of error correction and cryptography to an offline biometric based identification scheme. In: Proc. Workshop Coding and Cryptography, pp. 129–138 (1999) 12. Soutar, C., Roberge, D., Stojanov, S.A., Gilroy, R., Vijaya Kumar, B.V.K.: Biometric encryption using image processing. In: Proc. SPIE, Optical Security and Counterfeit Deterrence Techniques II, vol. 3314, pp. 178–188 (1998) 13. Soutar, C., Roberge, D., Stojanov, S.A, Gilroy, R., Vijaya Kumar, B.V.K.: Biometric encryption, enrollment and verification procedures. In: Proc. SPIE, Optical Pattern Recognition IX, vol. 3386, pp. 24–35 (1998) 14. Andy, A.: Vulnerabilities in biometric encryption systems. In: Kanade, T., Jain, A., Ratha, N.K. (eds.) AVBPA 2005. LNCS, vol. 3546. Springer, Heidelberg (2005) 15. Linnartz, J.P., Tuyls, P.: New shielding functions to enhance privacy and prevent misuse of biometric templates. In: Proc. of the 4th Int. Conf. on Audio and Video Based Biometric Person Authentication, UK, pp. 393–402 (2004) 16. Verbitskiy, E., Tuyls, P., Denteneer, D., Linnartz, J.P.: Reliable biometric authentication with privacy protection. In: Proc. of the 24th Symposium on Inf. Theory, pp. 125–132 (2003) 17. Tuyls, P., Gosling, J.: Capacity and examples of template-protecting biometric authentication systems. In: Maltoni, D., Jain, A.K. (eds.) BioAW 2004. LNCS, vol. 3087, pp. 158–170. Springer, Heidelberg (2004) 18. Uludag, U., Pankanti, S., Prabhakar, S., Jain, A.K.: Biometric cryptosystems: issues and challenges. Proceedings of the IEEE 92, 948–960 (2004) 19. Yeung, M.M., Pankanti, S.: Verification watermarks on fingerprint recognition and retrieval. Journal of Electronic Imaging 9, 468–476 (2000) 20. Sonia, J.: Digital watermarking techniques: a case study in fingerprints and faces. In: Proc. Indian Conference on Computer Vision, Graphics, and Image Processing, pp. 139–144 (2000) 21. Andrew, D.K.: Steganalysis of LSB matching in grayscale images. IEEE Signal Processing Letters 12, 441–444 (2005) 22. Bagchi, S., Mitra, S.K.: The Nonuniform Discrete Fourier Transform and its application in filter design. IEEE Trans. on Circuits and System: Analog and Digital Signal Processing 43, 422–433 (1996) 23. Ling, X., Jiashu, Z., Hong-Jie, H.: NDFT-based Audio Watermarking Scheme with High Security. In: IEEE ICPR, vol. 4, pp. 270–273 (2006)
Correlation-Based Fingerprint Matching with Orientation Field Alignment Almudena Lindoso, Luis Entrena, Judith Liu-Jimenez, and Enrique San Millan University Carlos III of Madrid, Electronic Technology Department, Butarque 15, 28911 Leganes, Madrid, Spain {alindoso,entrena,jliu,quique}@ing.uc3m.es
Abstract. Correlation-based techniques are a promising approach to fingerprint matching for the new generation of high resolution and touchless fingerprint sensors, since they can match ridge shapes, breaks, etc. However, a major drawback of these techniques is the high computational effort required. In this paper a coarse alignment step is proposed which reduces the amount of correlations that should be performed. Contrarily to other alignment approaches based on minutiae or core location, the alignment is based on the orientation field estimations. Also the orientation coherence is used to identify the best areas for correlation. The accuracy of the approach is demonstrated by experimental results with an FVC2000 fingerprint database. The approach is also very well suited for hardware acceleration due to the regularity of the used operations.
1 Introduction Fingerprints are widely used in recognition of a person’s identity because of its proven uniqueness, stability and universality. Characteristic fingerprint features are generally categorized into three levels [1]. Level 1 features, or patterns, are the macro details of the fingerprint such as ridge flow and pattern type (loop, arch, etc.). Level 2 features are the minutiae, such as ridge bifurcations and endings. Level 3 features include all dimensional attributes of the ridge such as ridge width, shape, pores, incipient ridges, breaks, creases, scars, and other permanent details. Most commercial Automated Fingerprint Identification Systems (AFIS) are based on Level 1 and Level 2 features. This is because the extraction of Level 3 features requires high resolution images, in the order of 1000 pixels per inch (ppi). However, Level 3 features are also claimed to be permanent, immutable and unique according to the forensic experts, and can provide discriminatory information for human identification. With the advent of high resolution fingerprint sensors and “touch-less” sensors that eliminate skin deformation, such as those introduced by TBS, Inc [2], recognition considering Level 3 features becomes feasible. However, many of the existing fingerprint matching techniques cannot take full advantage of these new sensors since they barely consider Level 3 features. This work focuses on correlation-based fingerprint matching. Correlation uses the gray level information of the fingerprint image since a gray-level fingerprint contains S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 713–721, 2007. © Springer-Verlag Berlin Heidelberg 2007
714
A. Lindoso et al.
much richer, discriminatory information than only the minutiae locations. Correlationbased techniques can take into account Level 3 features as well as other fingerprint features. As a matter of fact, these methods have been used successfully for fingerprint matching with conventional sensors, as demonstrated in the last two fingerprint verification competitions (FVC) [3], [4]. Approaches to correlation-based fingerprint matching have already been proposed [5], [10]. Among the most important aspects of these techniques are the selection of appropriate areas of the fingerprint image for correlation and the computational effort required to consider translation and rotation between the fingerprint images. In order to take into account displacement and rotation, Ouyang et al. propose the use of a local Fourier-Mellin Descriptor (FMD) [6]. However, since the center of relative rotation between two compared fingerprints is unknown, the local FMD has to be extracted for a large number of center locations. Other works correlate ridge feature maps to align and match fingerprint images [7], but do not consider rotation yet. In this paper we propose techniques that reduce the correlation effort and minimize the effect of deformation by focusing on high-quality and distinctive fingerprint image regions. To this purpose, we introduce a coarse alignment step based on the correlation of the orientation fields of the fingerprints. The alignment dramatically reduces the correlation search space and is further refined at the final correlation step. Due to the lack of high resolution (1000 ppi) fingerprint databases available in the public domain, experimental results have been conducted with a low resolution database (FVC 2000 DB2). Finally, in order to deal with the high computational requirements of correlation techniques, we propose the use of hardware acceleration techniques. As correlation computations are highly regular, they are much more suitable for hardware acceleration than other approaches [8]. The remaining sections of this paper are organized as follows: Section 2 formulates correlation-based fingerprint matching, Section 3 summarizes the fingerprint preprocessing steps, Section 4 describes the matching algorithm including the alignment and correlation region selection approaches, Section 5 analyses the use of hardware acceleration, Section 6 presents the experimental results and finally Section 7 presents the conclusions.
2 Correlation-Based Fingerprint Matching Cross-correlation, or simply correlation, is a measurement of image similarity. In order to compensate variations of brightness, contrast, ridge thickness, etc. that affect the correlation of fingerprint images, the Zero Mean Normalized Cross Correlation (ZNCC) can be used [11]. ZNCC is defined by the expression:
ZNCC ( x, y, α ) =
CC (T − T , I ( x, y, α ) − I ( x, y, α )) T − T ⋅ I ( x, y , α ) − I ( x, y , α )
(1)
where CC is the cross-correlation, T is the template image and I(x, y, α) is the input image shifted by x and y pixels in vertical and horizontal directions, respectively, and rotated by an angle α.
Correlation-Based Fingerprint Matching with Orientation Field Alignment
715
As an alternative, the cross-correlation required to compute ZNCC can be obtained by multiplication in the Fourier domain, using the following formula:
CC (T , I ) = F −1 ( F * (T ) ∗ F ( I ))
(2)
3 Image Preprocessing Correlation-based matching requires good image quality because the matching is performed directly with gray-level fingerprint images. Figure 1 summarizes the preprocessing that has been used in our approach, which is based in [9]. The main steps are: normalization, low frequency noise filtering, orientation field estimation and frequency estimation with their respective coherences, Gabor filtering and finally equalization. Input fingerprint
Normalization
Preprocessed fingerprint
Low-frequency filter
Orientation field Frequency map
Gabor Filtering
Equalization
Fig. 1. Preprocessing
4 Fingerprint Matching In the proposed algorithm matching is divided in three steps: image alignment, selection of correlation regions and correlation-based matching. 4.1 Coarse Alignment The purpose of the alignment step is to estimate the translation and rotation between input and template images. This step allows reducing dramatically the amount of
716
A. Lindoso et al.
correlations that should be performed. On the other hand, only a coarse alignment is needed since it will be refined later on in the correlation step. Most approaches for alignment use features extracted from the images, such as minutiae or core locations [10], [14]. In order to avoid extracting these features, we propose a new approach based on the correlation of the orientation fields computed in the preprocessing step. More precisely, we compute the correlation of the sine and the cosine of the estimated orientation angle weighted by the coherence (Coh(θ)) as the input image is translated by (x, y) pixels and rotated by an angle α with respect to the template image:
CC sin ( x, y, α ) = CC (sin(2θ T ) * Coh(θ T ), sin(2θ Ix, y ,α ) * Coh(θ Ix , y ,α ))
(3)
(Δx, Δy, Δα ) = max (CCsin ( x, y, α ) + CC cos ( x, y, α )) / CCCoh ( x, y, α )
(4)
x , y ,α
Note that as the orientation field is rotated, the orientation angle θ Ix, y ,α must be corrected by the rotation angle. The correlation maximum determines the best translation (Δx, Δy) and rotation (Δα). The computational effort required for this correlation operation is acceptable since the orientation map is much smaller than the image. A similar alignment approach could be devised based on the estimation of ridge frequency. However, the results are usually much worse. 4.2 Selection of Regions
After alignment, both fingerprints are analyzed in order to determine candidate regions for correlation. Selection of local regions for correlation is required, since using the entire fingerprint will be computationally very expensive and will correlate badly due to fingerprint deformation and noise. On the other hand, the local regions should be highly distinctive. Several approaches for selecting local regions are discussed in [5]. A typical way to choose region candidates consists in computing auto-correlation of the image in order to determine the more distinguishable parts of the image. However, this approach requires a huge computational effort. Regions around core or regions where ridges have high curvature may be selected as candidates, but correlation results may be bad because these are typically very noisy areas. Besides, core does not appear in all fingerprints. Our approach for the selection of regions is based on image quality and fingerprint overlapping. The quality of the images is an important factor in the search of region candidates for correlation computation. If the region candidate chosen corresponds to a bad quality area in any of the fingerprints, then the verification will fail. To this purpose we use the coherence of the orientation field as a measure of quality [12]. In particular, the multiplication of the coherence of input and template fingerprints is our basic criterion to select region candidates. This computation is only considered for the overlapping areas of both fingerprints, which can be computed thanks to the relative translation and rotation estimated in the alignment step. Notwithstanding, this approach can lead to select regions without enough distinction. To avoid this problem, an average filter is applied to the coherence map, thus allowing the selection of regions with lower coherence as long as they have high coherence neighbours. This correction increases the chances to select distinctive regions located in good quality areas of the image.
Correlation-Based Fingerprint Matching with Orientation Field Alignment
717
It must be pointed out that if two samples of the same finger are excessively rotated or translated, insufficient overlapping may occur because the images could show non-coincident parts of the same finger. In addition, the overlapping areas may have low quality. In such a case, the matching is attempted with the overlapping portion of the selected region, but the chances the verification fails increase. 4.3 Matching
Finally, in the matching step the input region candidates are correlated with the template using ZNCC formula described in section 2. An area, called search area, corresponding to the input region but a little larger is chosen in the template image to perform the correlation. Figure 2 illustrates the region selection process and the main matching steps. Matching is done considering translation and rotation of the input region over the search area. Thanks to the previous alignment step, translation and rotation can be performed at a fine scale within a reduced searching space. In particular, 12 rotations are considered, 6 clockwise and 6 counterclockwise with an angle step of 2º. Input Fingerprint
Template Fingerprint
Search Area
Zone Candidate
Fig. 2. Region selection and matching steps
5 Hardware Acceleration The proposed fingerprint verification approach relies basically in correlation computations, which are used for the alignment and the matching steps. As correlation computations are highly regular, they are much more suitable for hardware acceleration than other approaches. Recent results presented in [8] demonstrate that correlation can be accelerated using a FPGA by more than 600 times with respect to a modern PC. This result is obtained by taking advantage of specialized digital signal processing hardware modules, known as DSP slices [16], which are available in modern FPGAs.
718
A. Lindoso et al.
The hardware architecture designed to accelerate correlation computations is summarized in Fig. 3. The architecture follows equation (1) to compute CC as a series of multiply-accumulate (MAC) operations. This approach adapts easily to a variety of correlation sizes, as required for the alingment and the matching steps. The input fingerprint image is stored in the input memory. The template fingerprint image is stored in the input registers of the DSP slices. DSP slices are organized in a matrix, where each slice computes a MAC and passes the result to next slice in the same row. In other words, each row is organized as a systolic array that computes a row correlation. The correlation matrix results from the addition of n consecutive row correlations R(i), i=0,…,n-1. A delay line is inserted after the last slice in every row in order that the result R(i) reaches the first slice in the next row at the required time to be added up. With this approach, a single data is read from the input memory at every clock cycle, which is supplied to the first slice in each row. At the output of the last row, a correlation result is obtained at every clock cycle for the possible displacements of the input image with respect to the template image.
Memory
R(0)
delay line
R(1)
delay line
R(2)
delay line
............ R(n-1)
Fig. 3. Hardware architecture for correlation computation
The proposed architecture can be easily scaled to any number of DSP slices and can be implemented with an FPGA or in an ASIC. In practice, the acceleration requirements for real time applications are usually moderate and can be achieved with a low cost FPGA or application-specific hardware in a cost effective manner.
6 Experimental Results The proposed algorithm has been tested with FVC 2000 2 A [13] data base. This database consists of 100 different fingers, with 8 samples per finger, giving a total of 800 fingerprints. The image size is 256x364 pixels. For the computation of the orientation field, fingerprint images are divided into blocks of size 64x64. The resulting orientation field has blocks of size 18x24. Initially, several number of correlation regions were considered, but after analyzing the results obtained, the number of regions was set to three. After some preliminary experiments, the size of the region was set to 50x50 pixels and the search area to 100x100 pixels.
Correlation-Based Fingerprint Matching with Orientation Field Alignment
719
The tests have been conducted considering the FVC premises [15]. To determine the FMR (False Match Rate) curve, the first sample of each finger has been matched against the first sample of all fingers. To determine the FNMR (False Non Match Rate), all the samples of each finger have been matched among themselves. In our case, symmetric matches have been considered in the results because the proposed algorithm is not symmetric. The ROC (Receiving Operating Curve) obtained for the proposed algorithm is shown in Figure 4. The EER (Equal Error Rate) achieved is 9 %.
ROC(TR) 0.0001
0.001
0.01
0.1
1
1
FNMR
0.1
ROC
0.01
TR
0.001
0.0001
FMR
Fig. 4. ROC curve in logarithmic scale of the proposed algorithm tested with FVC 2000 2 A data Base, FNMR (False Non-Match Rate), FMR (False Match Rate) TR (Correlation Threshold)
The FVC 2000 participants [13] were 11 and their EER range from 0.61 % to 46.15 %, for the considered database (DB2). Comparing our results with the participants’ results, the EER of the proposed algorithm is below the EER achieved by five of the participants and above the EER achieved by 6 of the participants. Most of the false matches were due to very bad quality images or insufficient overlapping between the images. These cases can be solved with additional preprocessing steps and particularly, with slightly higher resolution images. With the image size used in the experiments, the correlation regions contain a significant
720
A. Lindoso et al.
portion of the image. This sometimes prevents avoiding low quality areas or using fully overlapped correlation areas of the required size. Also, these cases can be detected at the preprocessing step or later by setting thresholds in the alignment and correlation steps. In a practical application, the verification should be rejected if the image quality is bad or the overlapping is not sufficient, asking the user for a new fingerprint sample. However, this option has not been used in our tests. The time required to complete the computation for each single verification is 1.15 seconds, divided in 0.25 seconds for preprocessing and 0.9 seconds for matching. This time has been measured with a PC Pentium IV 3 GHz with 2 GB of RAM for a C implementation of the algorithm. The preprocessing time of the proposed algorithm is below the enrollment time reported for all the participants of FVC 2000 [15], considering DB2. Considering both preprocessing and matching times, the total time required for verification is below the time reported for nearly all participants, since only two of them achieve better times. The implementation of the proposed algorithm can be largely optimized. It must be noted that, given a set of parameter values (block size, region size, search area size, etc.), the computation time is independent of the fingerprints considered, because all computations are completed for every matching. Thus, the computational effort can be significantly reduced by aborting a step as soon as a decision can be made in the alignment and region selection steps. Correlation computations for the coarse alignment step contribute to the matching time with 109 ms and correlation computations for matching contribute with 320 ms. Using hardware acceleration for correlation computations, as proposed in Section 5, these times could be substantially reduced. Correlation for coarse alignment can be completed in 1 ms and correlation for matching in 8 ms. Including data transfer, the whole correlation computations of the proposed algorithm could be computed in less than 14 ms. These results have been obtained with a XC4VSX55 FPGA [16].
7 Conclusions Correlation-based techniques are a promising approach to fingerprint matching for the new generation of high resolution and touch-less fingerprint sensors. However, a large number of correlation computations must be performed in order to consider the possible translation and rotation between the fingerprint images. In this paper, a coarse alignment step based on the correlation of the orientation fields of the fingerprints has been proposed. This alignment dramatically reduces the correlation search space and is further refined at the final correlation step. The orientation field coherence is used to weight the contribution of the orientation estimations to the alignment and to select appropriate regions for correlation. The experiments presented in this paper demonstrate that this approach produces acceptable results with low resolution sensors. It can be expected that the matching accuracy will improve with high resolution and touchless fingerprint sensors, as these sensors will be able to show Level 3 features and to reduce deformation. In the past, one of the main drawbacks of correlation-based fingerprint matching approaches was the high computational effort required. With the proposed approach the computational effort has been reduced to an effort comparable to other techniques. Moreover, since this approach relies basically on correlation computations it is very
Correlation-Based Fingerprint Matching with Orientation Field Alignment
721
suitable for hardware acceleration in order to reduce the verification time. To this purpose, a hardware acceleration architecture has also been proposed that is able to reduce the computational effort required by correlation by two orders of magnitude.
References 1. Jain, A., Chen, Y., Demitrius, M.: Pores and Ridges: Fingerprint Matching using Level 3 Features. In: Proc. 18th Int’l Conf. on Pattern Recognition (ICPR’06), vol. 4, pp. 477–480 (2006) 2. Parziale, G., Diaz-Santana, E.: The Surround Imager: A Multicamera Touchless Device to Acquire 3D Rolled-Equivalent Fingerprints. In: Zhang, D., Jain, A.K. (eds.) Advances in Biometrics. LNCS, vol. 3832, pp. 244–250. Springer, Heidelberg (2005) 3. Maio, D., Maltoni, D., Cappelli, R., Wayman, J.L., Jain, A.K.: FVC 2002: Second Fingerprint Verification Competition. In: 16th Proc. Int. Conf. on Pattern Recognition, vol. 3, pp. 811–814 (2002) 4. Maio, D., Maltoni, D., Cappelli, R., Wayman, J.L., Jain, A.K.: FVC 2004: Third Fingerprint Verification Competition. In: Proc. Int. Conf. on Biometric Authentication (2004) 5. Bazen, A.M., Verwaaijen, G.T.B., Gerez, S.H., Veelenturf, L.P.J., van der Zwaang, B.J.: A Correlation-Based Fingerprint Verification System. In: Proc. Workshop on Circuits, Systems and Signal Processing (ProRISC), pp. 205–213 (2000) 6. Ouyang, Z., Feng, J., Su, F., Cai, A.: Fingerprint Matching with Rotation-Descriptor Texture Features. In: Proc. 18th Int’l Conf. on Pattern Recognition (ICPR’06), vol. 4, pp. 417–420 (2006) 7. Ross, A., Reisman, J., Jain, A.: Fingerprint Matching using Feature Space Correlation. In: Tistarelli, M., Bigun, J., Jain, A.K. (eds.) ECCV 2002. LNCS, vol. 2359, pp. 48–57. Springer, Heidelberg (2002) 8. Lindoso, A., Entrena, L., Ongil-Lopez, C., Liu, J.: Correlation-based fingerprint matching using FPGAs. In: Proc. Int. conf. on Field Programmable Technology, FPT, pp. 87–94 (2005) 9. Hong, L., Wan, Y., Jain, A.: Fingerprint image enhancement: algorithm and performance evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(8), 777– 789 (1998) 10. Maltoni, D., Maio, D., Jain, A.K., Prabhakar, S.: Handbook of Fingerprint Recognition. Springer, New York (2003) 11. Crouzil, A., Massip-Pailhes, L., Castan, S.: A New Correlation Criterion Based on Gradient Fields Similarity. In: Proc. 13th Int. Conf. on Pattern Recognition, pp. 632–636 (1996) 12. Lim, E., Toh, K.A., Suganthan, P.N., Jiang, X., Yan, W.Y.: Fingerprint quality analysis. In: Proceedings on International Conference on Image Processing, pp. 1241–1244. IEEE, Los Alamitos (2004) 13. Maio, D., Maltoni, D., Cappelli, R., Wayman, J.L., Jain, A.K.: FVC2000: Fingerprint Verification Competition. IEEE Transactions on Pattern Analysis Machine Intelligence 24(3), 402–412 (2002) 14. Park, C.H., Oh, S.K., Kwak, D.M., Kim, B.S., Song, Y.C., Park, K.H.: A new reference point detection algorithm based on orientation patron labelling in fingerprint images. In: Perales, F.J., Campilho, A., Pérez, N., Sanfeliu, A. (eds.) IbPRIA 2003. LNCS, vol. 2652, pp. 697–703. Springer, Heidelberg (2003) 15. Wayman, J., Jain, A., Maltoni, D., Maio, D.: Biometric Systems, Technology, Design and Performance Evaluation. Springer, London (2005) 16. Virtex-4 Family Overview (2004), www.xilinx.com, Xilinx Inc.
Vitality Detection from Fingerprint Images: A Critical Survey Pietro Coli, Gian Luca Marcialis, and Fabio Roli University of Cagliari – Department of Electrical and Electronic Engineering Piazza d’Armi – 09123 Cagliari (Italy) {pietro.coli,marcialis,roli}@diee.unica.it
Abstract. Although fingerprint verification systems reached a high degree of accuracy, it has been recently shown that they can be circumvented by “fake fingers”, namely, fingerprint images coming from stamps reproducing an user fingerprint, which is processed as an “alive” one. Several methods have been proposed for facing with this problem, but the issue is far from a final solution. Since the problem is relevant both for the academic and the industrial communities, in this paper, we present a critical review of current approaches to fingerprint vitality detection in order to analyze the state–of–the art and the related open issues.
1 Introduction In the last years, fingerprint verification systems for personal identity recognition reached a high degree of accuracy [1]. Fingerprints can be reasonably considered the biometric for which academic and industrial research achieved the highest level of maturity. In fact, many capture devices have been implemented with powerful software development kits for acquiring, processing and matching fingerprint images. The reason of this success is mainly due to the most claimed characteristic of fingerprints: their uniqueness [2]. In other words, it is claimed that fingerprints are unique from person to person, and the probability to find two similar fingerprints characterized, for example., by minutiae is very low [2]. However, recent works pointed out that the current fingerprint capture devices can be deceived by submitting a “fake fingerprint” made up of gelatine or liquid silicon [3]. This fake finger can be obtained by person coercion (the so-called “consensual” method) or by latent fingerprints [3]. The first method is the most simple, as it requires that the person put his finger on a plasticine-like material. The next step is to drip over this mould some liquid silicon. After its solidification, a fake stamp reproducing the fingerprint of the client can be used for deceiving the acquisition sensor, which processes that as an “alive” fingerprint. An example of “live” and “fake” fingerprint images fabricated by the consensual method and acquired with an optical sensor is given in Figure 2. Obviously, reproducing a fingerprint is not so easy as it may appear, because high quality stamps are necessary for a successful logon on a system protected with a fingerprint authentication module. On the other hand, the issue exists, as it was pointed out in [3]. S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 722–731, 2007. © Springer-Verlag Berlin Heidelberg 2007
Vitality Detection from Fingerprint Images: A Critical Survey
723
Since fingerprints can be reproduced, this has a dramatic impact on the main property which made them so popular and appealing for security applications: their uniqueness. The state-of-the-art literature addresses this crucial problem as follows: is the finger on the acquisition sensor alive or “fake” ? In order to handle this problem, several methods to detect the fingerprint vitality (or “liveness”) have been proposed, and this research field is still very active. A first subdivision of the state-of-the-art approaches can be made by observing that vitality is detected by extracting vitality measures from the finger, as the heartbeat or the blood pressure (hardware-based approaches), or from the fingerprint image(s) directly (software-based approaches). As, to our knowledge, no previous paper reviewed critically the state-of-the-art of fingerprint vitality detection methods, in this paper, we propose a taxonomy of current methods (Section 2) and discuss and compare some key features of previous works, such as the material used for fingerprint reproduction and the data sets used (Section 3). In section 4, we analyze and compare the detection performances reported in the literature. Some conclusions are drawn in Section 5.
2 Fingerprint Vitality Detection: A Taxonomy of Existing Methods A possible taxonomy of fingerprint vitality detection methods is proposed in Figure 1. Roughly, existing approaches can be subdivided in “hardware-based” and “softwarebased”. The first ones try to detect the vitality of the fingertip put on the sensor by additional hardware able to measure, for example, blood pressure [4], heartbeat [5], fingertip odor [6], or skin impedance [7]. These approaches are obviously expensive as they require additional hardware and can be strongly invasive: for example, measuring person’s blood pressure is invasive as it can be used for other reasons that for simply detecting the vitality of his fingertip [8]. Moreover, in certain cases a clever imitator can circumvent these vitality detection methods. Therefore, making the image processing module more “intelligent”, that is, making it able to detect if a fake finger has been submitted is an interesting alternative to the hardware-based approaches. Several approaches aimed to extract vitality features from the fingerprint images directly have been recently proposed [9-15]. The general rationale behind these approaches is that some peculiarities of “live fingerprints” cannot be hold in artificial reproductions, and they can be detected by a more or less complex analysis of fingerprint images. The related vitality detection approaches can be named “software-based”. In this survey, we focus on the software-based approaches. Therefore, in the taxonomy of Figure 1, the leave associated to hardware-based approaches is not expanded further. According to the taxonomy of Figure 1, the initial subdivision of the softwarebased approaches is based on the kind of features used. If the features extracted derive from the analysis of multiple frames of the same image, captured while the subject puts his fingertip on the acquisition surface at certain time periods (e.g., at 0 sec and at 5 sec), the related methods are named “dynamic” (as they use dynamic features).
724
P. Coli, G.L. Marcialis, and F. Roli
On the other hand, if features are extracted from a single fingerprint impression or the comparison of different impressions, the methods are named “static”(as they use static features). Referring to the leaves of the taxonomy in Figure 1, they describe software-based approaches as functions of the physical principle they exploit: the perspiration, the elastic distortion phenomena, and the intrinsic structure of fingerprints (morphological approaches). According to the proposed taxonomy, in the following sections, we review the vitality detection methods proposed in the scientific literature. Fingerprint vitality detection approaches Requiring additional hardware to be integrated to the main capture device
Trying to extract features from fingerprint images
Hardware-based
Software-based Features are extracted from a single impression or by comparing multiple impressions
Features are extracted by comparing multiple frames Dynamic
Static
[9], [12], [14]
Single impression
Multiple impressions
[11] [10], [14]
[13]
[14]
[9], [15] Perspirationbased
Elastic deformation-based
Morphologybased
Fig. 1. The proposed taxonomy of fingerprint vitality detection methods. Labels of the edges in the form [*] are the number of the related Reference.
2.1 Static Methods 2.1.1 Static Methods Using a Single Impression Following the path of the tree in Figure 1 from the static methods-junction, we first consider methods which exploit single impression. These can be classified into two further classes: perspiration and morphology based. About the former we have selected two main works as [15] and [9]. Both study the perspiration phenomenon with two transforms, [15] with wavelet space, [9] with Fourier space. Tan and Schuckers [15] showed how it is possible to clearly distinguish a live from a fake finger by wavelet transform. The rationale of this method is the analysis of the particular shape of the finger surface. In live fingers, in order to guarantee the physiological thermo-regulation, there are many little chinks named “pores” scattered along the center of the ridges. Because of this characteristic the acquired image of a finger shows a non-regular shape of the ridges. Generally the path of the ridges is
Vitality Detection from Fingerprint Images: A Critical Survey
725
irregular and, if the resolution of the device is high enough, it is possible to observe these pseudo-periodic conformations at the center of the ridges. With the fabrication step of an artificial finger it is possible to lose these micro-details and consequently the correspondent acquired image is more regular in the ridge shape. The authors propose to analyze this feature with a wavelet decomposition. In particular, the image is enhanced and converted into a mono-dimensional signal as the gray level profile extracted in correspondence of the center of the ridges. A wavelet decomposition of this signal is applied with a five-levels multiresolution scheme: The standard deviation, the mean value for each wavelet coefficient and from the original and the last approximation signals are computed. The obtained 14 parameters are considered as a feature-vector for the next classification stage. The concept of detecting liveness from the skin perspiration analysis of the pores has been already proposed in [9]. In particular, the authors uses one static feature, named SM, based on the Fast Fourier Transform of the fingerprint skeleton converted into a mono-dimensional signal. The rationale is that for a live finger it is possible to notice clearly the regular periodicity due to the pores on the ridges. On the contrary this regularity is not evident for spoof fingerprint signals. Carrying on with single impression-static methods, another work is noticeable to mention. Unlike the previous works, this applies a liveness detection studying the morphology of the fingerprint images. So, referring to Figure 1, the branch ending to morphologic based method is related to the work by Moon et al. [11]. The study is based on a different method with a contrasting argument. Looking at the finger surface with an high resolution Digital Single Lens Reflex camera, they observe that the surface of a fake finger is much coarser then that of a live finger. The main characteristic of this work is that an high resolution sensor is necessary for successfully capturing this difference (1000 dpi, whilst current sensors exhibit 500 dpi on average). Moreover, this approach does not work with the entire image, too large because of its resolution, but with subsamples of a fixed size. For extracting this feature, the residual noise returned from a denoising-process applied to the original sub-images is considered. The standard deviation of this noise is then computed to highlight the difference between live and fake coarseness. 2.1.2 Static Methods Using Multiple Impressions Whilst the previous studies search a liveness indication from intrinsic properties of a single impression, there are other static features based on multiple impressions: in this case the liveness is derived from a comparison between a reference template image and the input image. This methods are represented in Figure 1 with two branches starting from the “Multi-impressions” node: one indicates method based on elasticdeformation features, the other indicates ones based on morphologic features. Ref [10] falls within the first category. Given a genuine query-template pair of fingerprints, the entity of elastic distortion between the two sets of extracted minutiae is measured by a thin-plate spline model. The idea is that live and spoof fingerprints show different elasticity response repeating the acquisitions. The experimental investigation of Coli et al. [14] considers both the elasticdeformation based method and the morphology-based one. The elastic deformation is evaluated by computing the averaged sum of all the distances among the matched
726
P. Coli, G.L. Marcialis, and F. Roli
minutiae of input and template fingerprints. The different elastic response of a live finger or an artificial stamp is linked with the spread of this mean value. The other static multi-impression measure is based on morphologic investigation. The feature that makes use of the ridge width is based on the idea that during the creation of fingerprint replica, there is an unavoidable modification of the thickness of the ridges: first when the user put his finger on the cast material, next when the stamp is created with liquid silicone. 2.2 Dynamic Methods Dynamic methods for vitality detection relies on the analysis of different image frames acquired during an interval while the user put his finger on the scanner. As it is schematized in Figure 1 there are two types of dynamic methods: one based on the perspiration phenomenon, the other on the elastic response of the skin. The same properties of the skin used for a static measure in [9, 11] is exploited with dynamic analysis based on the first approach [9]: the pores scattered on the fingertip surface are the source of the perspiration process. When the finger is in contact with the surface of the scanner, the skin gets wetter because of an increasing of sweat amount. This physiological phenomenon can be recorded by acquiring sequential frames during a fixed interval of few seconds. The variation of the wetness of the fingertip skin reflects on a variation of the gray-level profile of the acquired images. In order to evaluate this feature, the fingerprint skeleton of the image at 0 and 5 seconds is converted into a couple of mono-dimensional signals (C1, C2). Several statistical measures are proposed on the basis of the obtained signals. In particular [9], DM1 ( Total swing ratio ), DM2 (Min/Max growth ratio), DM3 (Last-First fingerprint signal difference mean), DM4 (Percentage change of standard deviation). By considering that Ref. [12] draws up a more complete vitality analysis on different technology of fingerprint scanner, it has been necessary to introduce some modifications to the original method. The dynamic of the device can produce a saturated signal for excessive amount of wetness, in such situation the feature DM2 lost its original efficacy. In order to avoid this drawback two new features named DM5 (Dry saturation percentage change) and DM6 (Wet saturation percentage change) are elaborated. With a selection of these measures, Coli et al. [14] applies a liveness detection on their extended database. As the previous works can be considered an extension of the static perspiration measures, the work of Antonelli et Al. [13] adopts a dynamic procedure in order to perform a liveness detection based on elastic deformation. The user holding his finger on the scanner surface is invited to apply a rotation of the fingertip. The movement of the fingertip on the surface induces an elastic tension to the whole surface and consequently an elastic deformation depending on the level of elasticity of the skin, for live finger, or of the artificial material for the spoof stamp. A dynamic acquisition extracts a sequence of images at high frame rate (> 20 fps). The authors named “distortion-code” the feature-vector encoding the elastic distortion from the current frame to the next one, obtained computation of the optical flows that estimates the variation in the time of the rotating images.
Vitality Detection from Fingerprint Images: A Critical Survey
727
3 Previous Works: Fabrication Process of Fake Stamps and Data Sets Used We believe that a critical review of previous works on fingerprint vitality detection should analyze: (i) the different materials employed for fake stamps and the methods used for creating them; (ii) the characteristics of the data sets used for the vitality detection experiments. Table 1 deals with Item (i) for the methods we reviewed in Section 2. Item (i) is important because the response of a certain fingerprint scanner varies with the material adopted (e.g., gelatine or silicon). Secondly, the intrinsic quality of the stamp depends on the material for the cast and the method followed for its creation. With regard to the mould materials, all those employed are able to deceive an optical sensor, as pointed out in [3]. On the other hand, using the silicone material is not effective for capacitive sensors, probably due to the different electrical properties of this material with respect to the skin. Table 1. Software-based methods for fingerprint vitality detection and materials and methods used for stamp fabrication Reference [9] [10] [11]
Scanner Capacitive Not specified Digital cam.
Method Consensual Consensual Consensual
Cast material Rubber Gum Not specified
[13]
Optical
Consensual
Not specified
[12]
Optical Electro-optical Capacitive Optical Optical Electro-optical Capacitive
Consensual
Dental impression
Consensual Consensual
Plasticine Not specified
[14] [15]
Mould material Play-Doh Gelatine Gelatine Plastic clav. Silicone, Gelatine, Latex Play-Doh
Silicone Play-Doh Gelatine
In table 1 it is worth noting that all the approaches at the state-of-the-art used the consensual method for creating the stamp. This method consists in three steps: (1) the user put his finger on a mould material: the pattern of the fingerprint image negative is reproduced in a mould; (2) the cast material (e.g. liquid silicon with a catalyst) is dripped over the mould: the liquid covers the negative fingerprint; (3) after some hours the solidification of the rubber is completed and the cast can be removed from the mould; (4) the rest of the mould is cleaned off from the surface of the cast. This method exploits the subject cooperation whilst other approaches, e.g., the ones that produce a fake stamp from latent fingerprints, are more complex and requires more expert knowledge [3]. Moreover, the quality of stamps is intrinsically lower than that obtained with the consensual method. The consensual method is used for obtaining high quality stamps, in order to create a severe performance test for vitality detection.
728
P. Coli, G.L. Marcialis, and F. Roli
It is, in fact, easy to see that fraudulent attacks using high quality stamps are much more difficult to detect, while attacks with low quality stamps can be, in many cases, detected also without using a vitality detection module (the fake impression is rejected as an impostor impression). However, it should be noted that several variables are involved in the fraudulent access process, in particular: (1) the initial pressure of the subject on the cast; (2) the mould material dripped over the cast; (3) the contact of the stamp on the acquisition surface. These variables concur to alter the shape of the reproduced fingerprint and, in some cases, these alterations strongly impact on the final quality of the obtained image. Figure 2 shows some examples of fake fingerprint images where it can be easily observed the different visual quality. It is worth noting that no previous work has devoted much attention to this issue. However, in our opinion, it is important because adding a fake fingerprint detector obviously impacts on the false rejection rate of the system, that is, the rate of genuine rejected due to misclassified live fingerprints. Fake fingerprint images of poor quality, as those showed in Figure 2(a, c), could be easily rejected without employing a fake detector. It is worth noting that this problem arises independently on the position of the fake detection module into the processing chain (e.g. if the fake detection is done before or after the verification stage).
(a)
(b)
(c)
Fig. 2. Examples of fake fingerprint images from Ref. [9] (a), Ref. [14] (b), Ref. [11] (c)
The second item we raised in this Section concerns the characteristics of the data set used for the vitality detection experiments. Table 2 points out the most important characteristics of the data sets used in previous works. The second column reports the number of different fake fingerprints (subjects), the third the number of impressions for each fingerprint, the fourth the number of image frames acquired. The fifth column points out if the subjects used for producing stamps are the same used as clients, namely, if the data set contains both fake and live fingerprint images for each individual. Information reported in Table 2 is useful to analyze: (1) the sample size of data sets for fake detection rate evaluation; (2) the protocol adopted in experiments. With regard to item (1), it is worth noting that it requires several resources in terms of volunteers, time, and personnel devoted to stamp fabrication. In particular, volunteers must be trained to appropriately press their finger on the mould material, and a visual analysis of the stamp is necessary in order to obtain good quality images. In order to produce an acceptable stamp, many trials are required. Since the solidification of the mould material can require several hours, this impacts on the number of fake stamps produced per time unit. As a consequence, reported experimental results can be affected by the small sample size of the used data set. This could cause a not reliable estimation of the detection performance.
Vitality Detection from Fingerprint Images: A Critical Survey
729
Table 2. Some key characteristics of the data set used for vitality detection experiments in previous works Reference
No. fakes
No. impressions No. frames
[9] [10] [11] [13] [12] [14] [15]
18 32 24 12 33 28 80
1 10 1 5 1 2 1
2 0 0 20 2 2 0
Correspondence with clients ? NO YES NO NO NO YES NO
The differences about the above item (2), i.e., the differences of the characteristics of the data sets, are pointed out by the fifth column of Table 2. The term “correspondence with client” means that for each fake image, there is the correspondent live image. This impacts on the experimental protocol used and the final goal of the experimentation. In particular, works for which this correspondence is absent are aimed to point out that the proposed feature(s) can allow to distinguish a fake image from an alive one. Accordingly, they do not require the presence of related clients. On the other hand, they do not allow evaluating the penetration rate of the stamps in a verification system, that is, to assess the rate of fake fingerprints which would be accepted as live fingerprints. In some cases (Table 2, third and sixth rows), a certain number of live and fake fingerprint frames/impressions is captured from the same subject. This is due to the characteristic of the measure, which requires the comparison of the input impression with the related template client (e.g. elastic distortion or morphological measures [10, 14] which require an additional minutiae extraction step), and also to the possibility of evaluating the relationship between fake detection rate and verification performance. In this case the protocol adopted is a little more complex, because the fake detection features can be extracted only by the comparison phase. As an example, the protocol adopted in [14] is made up of the following steps: -
-
the second impression has been considered as the template of the fingerprint stored in the system database. The minutiae-points were manually detected in order to avoid errors due to the minutiae detection algorithm; the first and the second frame of the first impression have been considered as the images provided by the system during an access attempt. Only attempts related to fingerprints of the same subject were considered (“genuine” attempts by live and fake fingers). Even for these images the minutiae-points were manually detected.
4 Previous Works: Vitality Detection Performances Table 3 reports a preliminary comparison of previous works in terms of overall missdetection (classification) rate, that is, the average between the rate of “live” fingerprints wrongly classified as fake ones and viceversa. Due to several problems, as the differences in terms of sensors used, data set size, protocol, and classifiers, it is difficult to fairly compare these values. As an example,
730
P. Coli, G.L. Marcialis, and F. Roli
the work by Antonelli et al. [13] uses a threshold-based classifier based on the monodimensional distribution of live and fake classes. This distribution has been computed by the Euclidean distance between template and input distortion codes. The threshold has been tuned on the “equal error rate” value. This value is obviously different from the overall error rate usually considered for evaluating the performance of vitality detection systems. Moreover, the strong variations in error rates even when using similar approaches (e.g. see [10] and [14], or [9] and [12]) suggest that a common experimental protocol is necessary in order to avoid the difficulty in interpreting reported results. Finally, an average error rate of 3-5% even in small data sets makes these systems quite unacceptable for a real integration in current fingerprint verification systems, due to their impact on the false rejection rates (i.e. wrongly rejected clients) which could increase. As an example, the best fingerprint verification system at the 2004 edition of Fingerprint Verification Competition exhibited a 2% equal error rate on average. [16] Table 3. Vitality detection performance reported in previous works Reference [9] [10] [11] [13] [12]
Classification method Back-propagation neural network Support Vector Machine Threshold Threshold Neural Network, Threshold, Linear Discriminant Analysis
[14] [15]
k-NN Classifier Classification Tree
Error rate 0% 18% 0% 5% (equal error rate) 6% (capacitive sensor) 3% (electro-optical sensor) 2% (optical sensor) 6% 5% (capacitive sensor) 13% (optical sensor)
5 Conclusions Fingerprint vitality detection has become a crucial issue in personal verification systems using this biometric. In this paper, we critically reviewed the main approaches to fingerprint vitality detection proposed in these years. To the best of our knowledge, this is the first survey about fake fingerprint detection methods. In particular, we proposed a possible taxonomy for summarizing the current state-of-the-art and also examined the scientific literature from other point of views, as the materials employed for producing fake stamps and the data sets used for experiments. Finally, we reported the detection rates of previous works in order to trace a preliminary comparison among the current approaches. Our future work will analyze further the state of the art and will also perform a fair experimental comparison by adopting an appropriate data set aimed to highlight pros and cons of the state-of-the-art methods for fingerprint vitality detection.
Vitality Detection from Fingerprint Images: A Critical Survey
731
References [1] Maltoni, D., Maio, D., Jain, A.K., Prabhakar, S. (eds.): Handbook of fingerprint recognition. Springer, Heidelberg (2003) [2] Bolle, R.M., Connell, J.H., Ratha, N.K.: Biometric perils and patches. Pattern Recognition 35(12), 2727–2738 (2002) [3] Matsumoto, T., Matsumoto, H., Yamada, K., Hoshino, H.: Impact of artificial ‘gummy’ fingers on fingerprint systems. In: Proceedings of SPIE, vol. 4677 (2002) [4] Lapsley, P., Less, J., Pare, D., Hoffman, N.: Anti-Fraud Biometric Sensor that Accurately Detects Blood Flow, SmartTouch, LLC, US Patent #5,737,439 (1998) [5] Biel, L., Pettersson, O., Philipson, L., Wide, P.: ECG analysis: A new approach in human identification. IEEE Transactions on Instrumentation and Measurement 50(3), 808–812 (2001) [6] Baldissera, D., Franco, A., Maio, D., Maltoni, D.: Fake Fingerprint Detection by Odor Analysis. In: proceedings International Conference on Biometric Authentication (ICBA06), Hong Kong (January 2006) [7] Osten, D., Carim, H.M., Arneson, M.R., Blan, B.L.: Biometric, Personal Authentication System, U.S. Patent #5 719 950 (February 17, 1998) [8] Jain, A.K., Bolle, R., Pankanti, S. (eds.): BIOMETRICS: Personal Identification in Networked society. Kluwer Academic Publishers, Dordrecht (1999) [9] Derakhshani, R., Schuckers, S., Hornak, L., O’Gorman, L.: Determination of vitality from a non-invasive biomedical measurement for use in fingerprint scanners. Pattern Recognition 36(2), 383–396 (2003) [10] Chen, Y., Jain, A.K., Dass, S.: Fingerprint deformation for spoof detection. In: Biometric Symposium, Cristal City, VA (2005) [11] Moon, Y.S., Chen, J.S., Chan, K.C., So, K., Woo1, K.C.: Wavelet based fingerprint liveness detection. Electronics Letters 41(20), 1112–1113 (2005) [12] Parthasaradhi, S., Derakhshani, R., Hornak, L., Schuckers, S.: Time-series detection of perspiration as a vitality test in fingerprint devices. IEEE Trans. On Systems, Man and Cybernetics, Part C 35(3), 335–343 (2005) [13] Antonelli, A., Cappelli, R., Maio, D., Maltoni, D.: Fake Finger Detection by Skin Distortion Analysis. IEEE Transactions on Information Forensics and Security 1(3), 360– 373 (2006) [14] Coli, P., Marcialis, G.L., Roli, F.: Analysis and selection of feature for the fingerprint vitality detection. In: SSPR/SPR 2006, pp. 907–915 (2006) [15] Tan, B., Schuckers, S.: Liveness detection for fingerprint scanners based on the statistics of wavelet signal processing. In: Conference on Computer Vision Pattern Recognition Workshop (CVPRW’06) (2006) [16] Maio, D., Maltoni, D., Cappelli, R., Wayman, J.L., Jain, A.K.: FVC2004: Third Fingerprint Verification Competition. In: Zhang, D., Jain, A.K. (eds.) ICBA 2004. LNCS, vol. 3072, pp. 1–7. Springer, Heidelberg (2004)
Optimum Detection of Multiplicative-Multibit Watermarking for Fingerprint Images Khalil Zebbiche, Fouad Khelifi, and Ahmed Bouridane School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast BT7 1NN Belfast, UK {kzebbiche01,fkhelifi01,A.Bouridane}@qub.ac.uk
Abstract. Watermarking is an attractive technique which can be used to ensure the security and the integrity of fingerprint images. This paper addresses the problem of optimum detection of multibit, multiplicative watermarks embedded within Generalized Gaussian distribution features in Discrete Wavelet Transform of fingerprint images. The structure of the proposed detector has been derived using the maximum-likelihood approach and the Neyman-Pearson criterion. The parameters of the Generalized Gaussian distribution are directly estimated from the watermarked image, which makes the detector more suitable for real applications. The performance of the detector is tested by taking into account the different quality of fingerprint images and different attacks. The results obtained are very attractive and the watermark can be detected with low detection error. Also, the results reveal that the proposed detector is more suitable for fingerprint images with good visual quality. Keywords: Fingerprint images, multibit watermarking, multiplicative rule, maximum-likelihood.
1 Introduction Fingerprint-based authentication systems are the most advanced and accepted techniques of the biometric technologies. They have been used in law enforcement agencies and have been progressively automated over the last years. With the recent developments in fingerprint sensing, an increasing number of non-criminal applications are either using or actively considering using fingerprint-based identification. However, biometric-based systems, in general, and fingerprint-based systems, in particular, may risk several threats. Ratha et al. [1] describe eight basic sources of possible attacks on biometric systems. In addition Schneir [2] identifies many other types of abuses. Watermarking, which is one of the possible techniques that may be used, has been introduced to increase the security and the integrity of fingerprint data [3]-[7]. One of the most important stages in watermarking is the detection stage, which aims to decide whether a given watermark has been inserted within an image or not. This can be seen as a hypothesis testing in that the system has to decide the alternative hypothesis (the image is watermarked) and the null hypothesis (the image is not S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 732–741, 2007. © Springer-Verlag Berlin Heidelberg 2007
Optimum Detection of Multiplicative-Multibit Watermarking for Fingerprint Images
733
watermarked). In binary hypothesis testing two kinds of errors can occur: accepting the alternative hypothesis, when the null hypothesis is correct and accepting the null hypothesis when the alternative hypothesis is true. The first error is often called false alarm error and the second error is usually called missed detection error. The problem of watermarking detection has been investigated by many researchers; however, the most of these works consider the case of one-bit watermarking. The problem of assessing the presence of a multibit watermark is more difficult than the one-bit watermark because the information bits embedded are unknown for the detector. Hernandez et al. [8] derived an optimum detection strategy for additive watermarking rule, which cannot be used when another embedding rule is used. Barni et al. [9] proposed a structure of an optimum multibit detector for multiplicative watermarks embedded in Weibull distribution features. In this paper, we propose an optimum detector of a multibit, multiplicative watermark embedded in the DWT coefficients of fingerprint images. The structure of the proposed detector is derived using a maximum-likelihood (ML) method based on Bayes’ decision theory, whereby the decision threshold is obtained using the Neyman-Pearson criterion. A Generalized Gaussian Probability Density Function (PDF) is used to model the statistical behavior of the coefficients. The performance of the proposed decoder is examined through a number of experiments using real fingerprint images with different quality. The rest of the paper is organized as follows: Section 2 shows how the watermark sequence is hidden into the DWT coefficients. Section 3 explains the derivation of the decision rule based on ML method while the derivation of the decision threshold is presented is Section 4. The experimental results are provided in Section 5. The conclusion is presented in Section 6.
2 Embedding Stage The watermark is embedded into the DWT subbands coefficients. Let b= {b1…bNb} be the information bit sequence to be hidden (assuming value +1 for bit 1 and -1 for bit 0) and m= {m1m2…mNb} a pseudo-random set uniformly distributed in [-1, 1], which is generated using a secret key K. The information bits b are hidden as follows: (i) the DWT subband coefficients used to carry the watermark are partitioned into Nb non-overlapping blocks {Bi: 1 ≤ i ≤ Nb}. (ii) the watermark sequence m is split into Nb non-overlap chunks {Mi: 1≤ i ≤ Nb} so that, each block Bk and each chunk Mk will be used to carry one in information bit. (iii) each chunk Mk is multiplied by +1 or -1 according to the information bit bk to get an amplitude-modulated watermark. Finally, the watermark is embedded using the multiplicative rule, given by:
y Bk = (1 + γM k bk )x Bk
{
}
{
(1)
}
where x Bk = x1Bk x2 Bk L x NBk and y Bk = y1Bk y 2 Bk L y NBk are the DWT coefficients of an original image and the associated watermarked image belonging to the block Bk,
734
K. Zebbiche, F. Khelifi, and A. Bouridane
respectively. γ is a positive scalar value used to control the strength of the watermark. The larger the strength, the more robust is the watermark but the visual quality of the image may be affected. So, it is important to set γ to a value which maximizes the robustness while keeping the visual quality unaltered.
3 Maximum-Likelihood Detection Since the exact information bit sequence is unknown for the detector for blind watermarking, multibit detection is more difficult than the one-bit case. However, an optimum detector for multibit watermark can be derived following the same approach for the one-bit watermark described in [10], [11]. The watermark is detected using ML based on Bayes’ decision theory, whereby the decision threshold is derived using the Neyman-Pearson criterion which aim to minimize the missed detection probability for a fixed false alarm rate. According to this approach, the problem is formulated as a statistical hypothesis testing. Two hypotheses can be established as follows: H0: Coefficients are marked by a spreading sequence m, modulated by one of the 2Nb possible bit sequence b. H1: Coefficients are marked with another possible sequence m’, including the null sequence, where m’≠m. The likelihood ratio, denoted by l(y), is defined as:
l ( y ) = f Y ( y m ) f Y ( y m').
(2)
where fY(y⎪m) and fY(y⎪m’) represent the PDF of y conditioned to the presence of the sequence m and m’, respectively. In fact, it has been proved in [10] that for reasonably small value of the strength γ, the PDF of the coefficients y conditioned to the event m’ can be approximated by the PDF of y conditioned to the presence of the null sequence, fY(y⎪0). The likelihood ration l(y) becomes: l ( y ) = f Y ( y m ) f Y ( y 0).
(3)
Assuming that the information bits b and the coefficients in m are independent of each other, as well as the DWT coefficients used to carry the watermark. The PDF fY(y⎪m) is obtained by integrating out the 2Nb possible bit sequences. f Y ( y m ) = ∏ f Y ( y k mk ) Nb
k =1
= ∏ f Y ( y k m k , − 1) p(bk = −1) + f Y ( y k mk , + 1) p(bk = +1). (4) Nb
k =1
By assuming that p(bk= -1) = p(bk= +1) = 1/2, equation (4) can be written as follows:
Optimum Detection of Multiplicative-Multibit Watermarking for Fingerprint Images
735
⎛ y ki ⎞ ⎛ y ki 1⎡ 1 1 ⎟⎟ + ∏ f X ⎜⎜ f X ⎜⎜ k =1 2 ⎢i∈Bk 1 − γm ki ⎝ 1 − γmki ⎠ i∈Bk 1 + γmki ⎝ 1 + γmki ⎣ l ( y) = Nb ⎡ ⎤ ∏ ⎢ ∏ f X ( y ki )⎥ k =1 ⎣ i∈Bk ⎦
(5)
Nb
∏ ⎢∏
⎞⎤ ⎟⎟⎥ ⎠⎦⎥
Further simplification can be made by taking the natural logarithm of the likelihood ratio, thus the decision rule can be expressed by Nb ⎧ ⎡ ⎛ 1 ⎪ z ( y ) = ∑ ⎨− ln (2) + ln ⎢ ∏ ⎜⎜ k =1 ⎪ ⎣⎢i∈Bk ⎝ 1 − γmki ⎩
⎞ ⎛ yki ⎟⎟ f X ⎜⎜ ⎠ ⎝ 1 − γmki
⎛ 1 + ∏ ⎜⎜ i∈Bk 1 + γm ki ⎝
⎞ ⎟⎟ − f X ( y ki ) ⎠
⎞ ⎛ y ki ⎟⎟ f X ⎜⎜ ⎠ ⎝ 1 + γmki
(6)
⎤ ⎫⎪ ⎞ ⎟⎟ − f X ( yki )⎥ ⎬ ⎥⎦ ⎪⎭ ⎠
For an optimum behavior of the ML detector, it is necessary to describe the PDF of the DWT coefficients of the original image. An initial investigation using various distributions such as Laplacian, Gaussian and Generalized Gaussian has found that the Generalized Gaussian PDF is the most suitable distribution that can reliably model the DWT coefficients of the fingerprint images. It has been found that the Generalized Gaussian can also be used to model the coefficients for each block Bk. The central Generalized Gaussian PDF is defined as:
(
β f X (x i ; α , β ) = ⎛⎜ β 2αΓ⎛⎜ 1 ⎞⎟ ⎞⎟ exp − ( x i α ) β ⎝ ⎠⎠ ⎝
)
(7)
where Γ() is the Gamma function, i.e., Γ(z) =∫e-t tz-1dt, z>0. The parameter α is referred to as scale parameter and models the width of the PDF peak (standard deviation) and β is called the shape parameter and it is inversely proportional to the decreasing rate of the peak. Note that β =1 yields the Laplacian distribution and β =2 yields the Gaussian one. The parameter α and β are estimated as described in [12]. Inserting (7) in (6), the decision rule for the Generalized Gaussian model is expressed by:
⎧ ⎡ ⎛ Nb 1 ⎪ z ( y ) = ∑ ⎨− ln (2) + ln ⎢ ∏ ⎜⎜ ⎢ k =1 i∈Bk 1 − γm ki ⎪ ⎣ ⎝ ⎩
⎡⎛ y ⎞ ⎟⎟ exp ⎢⎜ ki ⎢⎜⎝ α Bk ⎠ ⎣
⎞ ⎟ ⎟ ⎠
β Bk
⎡⎛ y ⎛ 1 ⎞ ⎟⎟ exp ⎢⎜ ki + ∏ ⎜⎜ i∈Bk 1 + γm ⎢⎜⎝ α Bk ki ⎠ ⎝ ⎣
⎡ 1 ⎢1 − 1 − γmki ⎢⎣ ⎞ ⎟ ⎟ ⎠
β Bk
β Bk
⎤⎤ ⎥⎥ ⎥⎦ ⎥ ⎦
⎡ 1 ⎢1 − 1 + γmki ⎢⎣
β Bk
⎤ ⎤ ⎤ ⎫⎪ ⎥⎥⎥⎬ ⎥⎦ ⎥ ⎥ ⎪ ⎦⎦⎭
(8)
The decision rule reveals that an image is watermarked by the sequence m (H0 is accepted) only if z(y) exceeds a threshold λ.
736
K. Zebbiche, F. Khelifi, and A. Bouridane
4 Decision Threshold The Neyman-Pearson criterion is used in this work to obtain the threshold λ in such a way that the missed detection probability is minimised, subject to a fixed false alarm probability P * FA . Fixing the value of P * FA , the threshold λ can be obtained using the relation: PFA* = P(z ( y ) > λ H 1 ) = ∫λ f z (z ( y )) dz +∞
(9)
where fz(z(y)) is the PDF of z conditioned to the event H1. The problem now is to derive a good estimate of fz(z(y)). One idea is to use Monte Carlo simulations to estimate the false alarm probability for different values of λ and then choose the threshold λ which leads to the desired false alarm. However, this approach is very computationally intensive, especially when Generalized Gaussian PDF is used to model the coefficients because the parameters β and α are calculated numerically. Another simpler solution may be used to derive the threshold λ, by relying on the central limit theorem and assuming that the PDF of z(y) can be assumed Gaussian with mean µ z=E[z(y)] and δ z2 = V [z ( y )] [9]. Equation (9) can be written as: ⎞ ⎟ ⎟ ⎠
(10)
λ = erfc −1 (2 PFA* ) 2δ z2 + μ z .
(11)
PFA* =
⎛λ−μ 1 z erfc⎜ ⎜ 2δ 2 2 z ⎝
where erfc()is the complementary error function, so:
The mean µ z and the variance δ z2 are estimated numerically by evaluating z(y) for n
unreal sequences { mi : mi ∈ [− 1, 1]; 1 ≤ i ≤ n} , so that
μˆ z =
1 n ∑ zi n i =1
(12)
and 1 n 2 (13) ∑ (z i − μˆ z ) n − 1 i =1 where zi represents the log likelihood ratio corresponding to the sequence mi and n is the number of the fake sequences used to evaluate z. the selection of n involves a trade-off between computational complexity and accuracy of results. The higher the n, the better the estimates of µ z and δ z2 but the higher computational complexity is and the less used in real applications, and inversely.
δˆz2 =
5 Experimental Results The experiments were carried out using real fingerprint images of size 448×478 with different quality chosen from ‘Fingerprint Verification Competition’ (Db3_a,FVC
Optimum Detection of Multiplicative-Multibit Watermarking for Fingerprint Images
737
2000)[13] . Each image is transformed by DWT using Daubechies wavelet at the 3rd level to obtain low resolution subband (LL3), and high resolution horizontal (HL3), vertical (LH3) and diagonal (HH3) subbands. For reasons of imperceptibility and robustness, the watermark is embedded in the HL3, LH3, HH3 subbands. Each subband is partitioned into blocks of size 16×16 (256 coefficients/block). A blind detection is used so that the parameters α and β of each block used are directly estimated from the DWT coefficients of the watermarked image because it was assumed that the watermarked image is close to the original one (strength γ2, are regarded as non-uniform patterns. The uniform LBP operator, LBPu2P,R, is defined as. if U(LBPP,R ) ≤ 2, ⎧ I (LBPP , R (x, y )) ⎪ LBPP , R ( x, y ) = ⎨ I(z) ∈ [0, ( P − 1) P + 1] ⎪ ( P − 1) P + 2 otherwise ⎩ μ2
(2)
where, U ( LBPP , R ) = s ( g P −1 − g c ) − s ( g 0 − g c ) + ∑ p =1 s ( g p − g c ) − s ( g P −1 − g c ) P
Superscript u2 shown in Equation 2 indicates that the definition relates to uniform patterns with a U value of at most 2. If U(x) is smaller than 2, the current pixel will be labeled by an index function, I(z). Otherwise, it will be labeled as (P-1)P+2. The index function, I(z), containing (P-1)P+2 indices, is used to assign a particular index to each of the uniform patterns. Some researchers used the LBP operator as one of the face normalization techniques [1] and then directly applied a LDA classifier to the LBP image. However, such an approach will fail in the presence of an image translation or even rotation. The histogram approach which first summarizes the LBP image statistically has been proposed to alleviate these problems. As keeping the information about the spatial relation of facial regions is very important for face recognition, the face image is first divided into several small non-overlapping regions of the same size. Uniform pattern histograms are computed over the regions and then concatenated into a single histogram representing the face image. 2.2 Multi-scale Local Binary Patterns By varying the sampling radius, R and combining the LBP images, a multiresolution representation based on LBP, called multi-scale local binary patterns [10] can be obtained. This representation has been suggested for texture classification and the results reported for this application show that its accuracy is better than that of
812
C.-H. Chan, J. Kittler, and K. Messer
the single scale local binary pattern method. In general, this multiresolution representation can be realized in two ways. First, it can be accomplished by increasing the neighborhood size of the operator. Alternatively one can down-sample the original image with interpolation or low-pass filtering and then apply an LBP operator of fixed radius. However, the general problem associated with the multiresolution analysis is the high dimensionality of the representation combined with the small training sample size. It limits the total number of LBP operators to at most of 3. One of the approaches [13] is to employ a feature selection technique to minimize redundant information. We propose another method which achieves dimensionality reduction by feature extraction. 2.3 Our Approach In our approach, we combine the multi-scale local binary pattern representation with Linear Discrminant Analysis, LDA. Uniform local binary pattern operators at R scales are first applied to a face image. This generates a grey level code for each pixel at every resolution. The resulting LBP images, shown in Fig. 1, are cropped to the same size and divided into non-overlapping sub-regions, M0, M1,..MJ-1. The regional pattern histogram for each scale is computed based on Equation 3. H Pμ,2r , j (i ) =
where,
∑ B(LBP μ (x' , y') = i ) 2 P ,r
( x ', y ')∈M j
(3)
⎧1 when x = 0 . i ∈ [0, (P − 1)P + 2), r ∈ [1, R ], j ∈ [0, J ), and B ( x) = ⎨ ⎩0 otherwise
B(x) is a Boolean indicator. The set of histograms computed at different scales for region, Mj, provides regional information. By concatenating these histograms into a single histogram, we obtain the final multiresolution regional face descriptor presented in Equation 4.
[
F j = H Pμ,21, j
H Pμ,22, j
..... H Pμ,2R , j
]
(4)
This regional facial descriptor can be used to measure the face similarity by summing the similarities between all the regional histograms. However, by directly applying the similarity measurement to the multi-scale LBP histogram [10], the performance will be compromised. The reason is that this histogram is of high dimensionality and contains redundant information. By adopting the idea from [7], the dimension of the descriptor can be reduced by employing principal component analysis (PCA) before LDA. PCA is used to extract the statistically independent information as a basis for LDA to derive discriminative facial features. Thus a regional discriminative facial descriptor, Dj, is defined by projecting the histogram information, Fj, into LDA space Wjlda, i.e.
(
)
T
D j = W jlda Fj
(5)
After the projection, the similarity measurement presented below is obtained by summing the similarity, i.e. normalized correlation, of regional discriminative descriptors.
Multi-scale Local Binary Pattern Histograms for Face Recognition
813
(a) original image
(b) normalized
(c) LBPu28,1 image
(d) LBPu28,2 image
(e) LBPu28,3 image
(f) LBPu28,4 image
(g) LBPu28,5 image
(h) LBPu28,6 image
(i) LBPu28,7 image
(j) LBPu28,8 image
(k) LBPu28,9 image
(l) LBPu28,10 image
Fig. 1. a) original image, b) cropped and normalized face image, c-l) LBPu2 images at different radii. (Note: Gray: non-uniform pattern, White: dark spot, Black: bright spot, Other colors: rotational uniform patterns where 8 brightness levels of color denote the rotational angle). Sim(I , I ') = ∑ j
DjDj ' Dj Dj '
(6)
This discriminative descriptor gives 4 different levels of locality: 1) the local binary patterns contributing to the histogram contain information at the pixel level, 2) the patterns at each scale are summed over a small region to provide information at a regional level, 3) the regional histograms at different scales are concatenated to produce multiresolution information, 4) the global description of face is established by concatenating the regional discriminative facial descriptors.
3 Experimental Setup The goals of identification and verification systems are different. Whereas the goal of identification is to recognize an unknown face image, verification validates a person’s identity by comparing the captured face image with her/his image template(s) stored in the system database. However, most researchers only evaluate their algorithm either in identification or verification scenario, which makes them very difficult to compare with others. In order to ensure a reproducibility of the experiments and comparability with other methods, we tested our approach on the well-known, FERET and XM2VTS, databases using common protocols. In the FERET database [6], the open-source publicly available CSU face identification evaluation framework [3] was utilized to test the performance of our
814
C.-H. Chan, J. Kittler, and K. Messer
method. In this experiment, only frontal faces are considered. They are divided into a standard gallery (fa set) containing 1196 images of 1196 subjects, and four probe sets, namely the fb set (1195 images containing different facial expressions), fc set (194 images acquired under different lighting conditions), dup I set (722 images taken a week later), dup II set (234 images taken at least a year later). The CSU standard training set containing 510 images from fa set and dup I set are used for computing the LDA transformation matrix, Wjlda. The XM2VTS frontal face database [4] contains 2360 images of 295 subjects, captured for verification over 4 sessions in a controlled environment. The testing is performed using the Lausanne protocol which splits the database into training, evaluation and test sets. The training set has 200 subjects as clients, the evaluation set contains additional 25 subjects as imposters and the testing set another 70 subjects as imposters. There are two configurations of the Lausanne Protocol. In our work, we use Configuration I, in which the client images for training and evaluation were acquired from the first three sessions. The decision of acceptance or rejection is based on a measurement of similarity between the gallery and the average of client’s training images with a global threshold. This threshold is selected at the equal error point, EER, at which the false rejection rate is equal to the false acceptance rate on the evaluation set. For both XM2VTS and Feret databases, face images are extracted with the provided groundtruth eye positions and scaled to a size of 142×120 (rows × columns). The cropped faces are photometrically normalized by histogram equalization. In total, four parameters are available to optimize the performance of our method. The first one is the LBP parameter, the circularly symmetric neighborhood size, P. A large neighborhood increases the length of the histogram and slows down the computation of the similarity measure while small neighborhood may result in information loss. We have selected a neighborhood of P=8, containing 59 patterns for LBPu2. The second parameter is the total number of multi-scale operators. A small number of operators cannot provide sufficient information for face recognition, while a large radius operator not only reduces the size of the corresponding LBP images, but also decreases the number of uniform patterns which tends to degrade the system accuracy. In our experiments, R is set to 10, which means that ten LBP operators are employed to represent the face image. After extracting the LBP images, they are then cropped to the same size. The third parameter is the number of the regions, k. A large number of small regions increases the computation time as well as degrading the system accuracy in the presence of face localization errors. A big region increases the loss of spatial information. In this work, an image is partitioned into k×k nonoverlapped rectangle size regions where k is optimized empirically. The last parameter controls the PCA transformation matrix. In general, some of the higherorder eigenvectors are removed because they do not contribute to the accuracy of face recognition and the measure also saves the computation time. In our experiments, the number of eigenvectors kept is determined by the requirement to retain 98% of the energy of the signal [3].
Multi-scale Local Binary Pattern Histograms for Face Recognition
815
4 Results and Discussions 4.1 Experiments in Face Identification: FERET Database In this test, the recognition rate at rank1 and two statistical measures are used to compare the performance of the methods. The measures are the mean recognition rate with 95% confidence interval and the probability of the algorithm outperforming another. The probability is denoted by P(Alg 1 > Alg 2). These measures are computed by permuting the gallery and probe sets, see [3] for details. The results with PCA, BIC and EBGM in the CSU system as benchmarks [3] are reported in Table 1 for comparison. The result of the LBPu28,2 regional histograms method with chi-squared similarity measurement (LBPH_Chi) [9], LBPu28,2 regional histograms projected on LDA space with normalized correlation (LBPH+LDA) and our proposed method (MLBPH+LDA) with different k×k regions are plotted in Fig. 2. Comparing the mean recognition rate with LBPH_Chi and LBPH+LDA, applying LDA to the representation generated by uniform pattern regional histograms clearly improves the performance, but employing the multi-scale LBP improves the recognition rate even further. As expected for the LBP histogram based methods, the mean recognition rate is reduced as the window size increases because of the loss of the spatial information, but for our method, the mean recognition rate is robust for a wide range of 16≥k>3 Table 1. Comparisons on the probe sets and the mean recognition rate of the permutation test with 95% confidence interval on the FERET database with CSU Standard training set
MLBPH+LDA LBPH+LDA LBPH_Chi PCA_MacCos Bayesian_MP EBGM_Optimal
k 11 16 16
Fb 0.986 0.977 0.964 0.853 0.818 0.898
Fc 0.711 0.747 0.588 0.655 0.351 0.418
Fig. 2. The mean recognition rate with 95% confidence interval for three LBP methods against different k×k regions.
Dup1 0.722 0.710 0.648 0.443 0.508 0.463
Dup2 0.474 0.491 0.487 0.218 0.299 0.244
Lower 0.844 0.819 0.744 0.662 0.669 0.621
Mean 0.885 0.856 0.791 0.721 0.720 0.664
Upper 0.925 0.900 0.838 0.775 0.769 0.712
Fig. 3. The mean recognition rate with 95% confidence interval for LBP based methods and PCA MahCosine against varying the standard deviation of the simulated localization error.
816
C.-H. Chan, J. Kittler, and K. Messer
regions. For example the mean recognition rate with k=3 is 84.8%, while k=11 is 88.5%. In other words, changing the number of regions, k, only affects the length of the feature vector and the computation time. In the presence of the face localization inaccuracies, the performance of the face recognition method involving spatial information as an input parameter degrades; however our proposed method using smaller k can be expected to maintain the recognition accuracy. These finding are discussed further in Section 4.2. In Table 1, the parameter k of the LBP-based method is optimized from the point of view of accuracy and compared with other methods. LBP with LDA based methods clearly outperform the others in all statistical tests and all probe sets. Comparing MLBP and LBP both with LDA, the accuracy is not significantly different, but MLBPH+LDA is slightly better as P(MLBPH+LDA>LBPH+LDA)= 0.898. 4.2 Robustness to Face Localization Error A generic face recognition system first localizes and segments a face image from the background before recognizing it. However, a perfect face localization method is very difficult to achieve, and therefore a face recognition method capable of working well in the presence of localization errors is highly desired. In order to evaluate the effect of face localization error on the recognition rate of our method achieved on the FERET database comparatively, PCA MachCosine, LBPH+LDA and LBPH+Chi face recognition methods are also implemented. The training images and gallery images, fa set, are registered using the groundtruth eye coordinates but the probe sets (fb, fc, Dup 1 and 2) are registered using simulated eye coordinates which are the groundtruth eye location displaced by a random vector perturbation (ΔX, ΔY). These vectors are uncorrelated and normally distributed with a zero mean and standard deviation, σ, from 0 to 10. For LBP based methods, a large region size parameter, k=3, and a small region size, k=10, are tested. The recognition rates of LBP based methods using the respective values of parameter k, with PCA MachCosine against the standard deviation of the simulated localization error are plotted in Fig. 3. Clearly, the recognition rates of local region based methods outperform that of PCA. Projecting LBP histograms on LDA spaces provides better recognition rate than the error achieved in the original histogram space, in spite of the localization error. Also, for the local region based histogram methods, the larger region size the better the recognition rate as the localization error increases. Most importantly, in the presence of localization error, the recognition rate of MLBPH+LDA using a larger window size is more robust than others. The main reasons for the superior performance are the combination of the histogram approach and the multiresolution representation. 4.3 Experiments in Face Verification: XM2VTS Database In verification tests, the total error, TER, which is the summation of the false rejection rate and the false acceptance rate, is used to report the performance of the methods. In this experiment, we compare LBPu28,2 with Chi-squared (LBPH_Chi), histogram intersection (LBPH_HI), and our proposed method (MLBPH+LDA) together with the Adaboost classifier for LBPH [2] (LBPH+Adaboost). Rodriguez [14] found that the total error rate of LBPH-Adaboost giving 7.88% on the test set, is similar to that of LBPH-Chi, namely 6.8%. Nevertheless, we found that the error rate of LBPHAdaboost can be reduced to 5.263% if 300 regional histograms (features) are used.
Multi-scale Local Binary Pattern Histograms for Face Recognition
817
Table 2. Total Error Rate, TER, according to Lausanne protocol for configuration 1 k MLBPH+LDA LBPH+LDA LBPH_Chi [9] LBPH_HI LBPH+AdaBoost [2] LBPH_MAP [14] LBP+LDA [1] LBP+HMM [1] ICPR2000-Best [5] AVBPA03-Best [5] ICB2006-Best [5]
3 6 7 7
Manual Registration Eva Set (%) Test Set (%) 1.74 1.48 8.39 6.74 11.11 8.27 10.28 7.94 7.37 5.26 2.84 9.12 2.74 5.00 4.80 2.21 1.47 1.63 0.96
Automatic Registration Eva Set (%) Test Set (%) 1.53 1.99
14.00 4.98 2
13.10 3.86 2 2.07
Table 2 reports the comparative results of the above mentioned methods, as well as of Rodriguez methods [14][1], and the performance of the best ICPR2000[5], the best AVBPA2003[5] and the best ICB2006[5] algorithms with the Lausanne protocol Configuration 1. Compared to other LBP based methods, it is clear that our proposed method, MLBPH+LDA, performs better. However, the result of our method in manual registration is not better than that in ICB2006, in which the features were extracted by convoluting the gabor filters, 5 scales and 8 orientation, on the face image. Since the method in ICB2006 uses face images of better resolution than ours, we can expect that the face manifolds of the gabor feature space are simpler and the associated error rate is lower. Nevertheless, our method is more robust than others in the presence of face localization inaccuracies as shown in Table 2.
5 Conclusions In a real face recognition system, a face image is detected, registered and then identified. However, the accuracy of automatic face localization is not perfect and therefore face recognition methods working successfully in the presence of localization error are highly desired. In this paper, a discriminative descriptor containing the information from a multiresolution analysis of face image is proposed. The descriptor is formed by projecting the local face image information acquired by multiple LBP operators, into the LDA space. The recognition is performed by measuring the dissimilarity of the gallery and probe descriptors. Our proposed method has been implemented and compared with existing LBP methods as well ass other well known benchmarks in the application of face identification and verification using the FERET and XM2VTS databases following their standard protocols. In face identification performed on the FERET database, the experimental results clearly show that the mean recognition rate of 88.5%, with a 95% confidence interval, delivered by our method outperforms other state-of-the-art contenders. In particular, our method achieved the overall best result of 98.6% recognition rate in the 2
Note: There is a mistake in the TER of the best algorithm reported in ICB 2006, where TER, 1.57, is not equal to the summation of false acceptance rate, 0.57 and the false rejection rate, 1.57.
818
C.-H. Chan, J. Kittler, and K. Messer
experiment involving the varying facial expression probe set (fb set) while delivering comparative results to other LBP based methods for other probe sets. Also under the simulated localization error test, our proposed method is clearly more robust than others because it benefits from the multiresolution information captured by the regional histograms. The proposed method has also been tested in the verification mode on the XM2VTS database. With manual registration it achieved the third best result, TER=1.48% on the test set, but with fully automatic registration outperformed all the other methods by a small margin, achieving, TER=1.99%. In conclusion, our method achieves a comparable result with the state-of-art benchmark methods, on manually annotated face but it is more robust in the presence of localization errors.
Acknowledgements A partial support from the EU Network of Excellence Biosecure and from the ESPRC Grant GR/S98528/01 is gratefully acknowledged.
References 1. Heusch, G., Rodriguez, Y., Marcel, S.: Local Binary Patterns as an Image Preprocessing for Face Authentication. In: Proc FGR 2006, pp. 9–14 (2006) 2. Zhang, G., Huang, Z., Li, S.Z.: Boosting local binary pattern (LBP)-based face recognition. In: SinoBiometrics 2004, pp. 179–186 (2004) 3. Beveridge, J.R., Bolme, D., Teixeira, M., Draper, B.: The CSU Face Identification Evaluation System User’s Guide: Version 5.0, Technical Report, C.S. Dept., CSU (2003) 4. Messer, K., Matas, J., Kittler, J., Jonsson, K.: XM2VTSDB: The extended M2VTS database. In: Akumuri, S., Kullman, C. (eds.) AVBPA’99, pp. 72–77 (1999) 5. Messor, K., Kittler, J., Short, J., Heusch, G., Cardinaux, F., Marcel, S., Rodriguez, Y., Shan, S., Su, Y., Gao, W., Chen, X.: Performance Characterisation of Face Recognition Algorithms and Their Sensitivity to Severe Illumination Changes. In: ICB2006, pp. 1–11 (2006) 6. Philips, P.J., Moon, H.J., Rizvi, S.A., Rauss, P.J.: The FERET Evaluation Methodology for Face Recognition Algorithms. PAMI 22(10), 1090–1104 (2000) 7. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. Fisherfaces: Recognition using Class Specific Linear Projection. In: Buxton, B.F., Cipolla, R. (eds.) ECCV 1996. LNCS, vol. 1065, pp. 45–56. Springer, Heidelberg (1996) 8. Shan, S., Zhang, W., Su, Y., Xhen, X., Gao, W.: Ensemble of Piecewise FDA based on Spatial Histogram of Local (Gabor) Binary Patterns for Face Recognition. In: ICPR (2006) 9. Ahonen, T., Hadid, A., Pietikäinen, M.: Face description with local binary patterns: Application to face recognition. PAMI 28(12), 2037–2041 (2006) 10. Ojala, T., Pietikäinen, M., Mäenpää, T.: Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns. PAMI 24(7), 971–987 (2002) 11. Zhang, W., Shan, S., Zhang, H., Gao, W., Chen, X.: Multi-resolution histogram of local variation patterns (MHLVP) for Robust Face Recognition. In: Kanade, T., Jain, A., Ratha, N.K. (eds.) AVBPA 2005. LNCS, vol. 3546, pp. 937–944. Springer, Heidelberg (2005) 12. Huang, X., Li, S.Z., Wang, Y.: Jensen-Shannon Boosting Learning for Object recognition. In: Proceeding of CVPR (2005) 13. Raja, Y., Gong, S.: Sparse Multiscale Local Binary Patterns. In: Proc. 17thBMVC (2006) 14. Rodriguez, Y., Marcel, S.: Face authentication using adapted local binary pattern histogram. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 321–332. Springer, Heidelberg (2006)
Histogram Equalization in SVM Multimodal Person Verification Mireia Farrús, Pascual Ejarque, Andrey Temko, and Javier Hernando TALP Research Center, Department of Signal Theory and Communications Technical University of Catalonia, Barcelona, Catalonia {mfarrus,pascual,temko,javier}@gps.tsc.upc.edu
Abstract. It has been shown that prosody helps to improve voice spectrum based speaker recognition systems. Therefore, prosodic features can also be used in multimodal person verification in order to achieve better results. In this paper, a multimodal recognition system based on facial and vocal tract spectral features is improved by adding prosodic information. Matcher weighting method and support vector machines have been used as fusion techniques, and histogram equalization has been applied before SVM fusion as a normalization technique. The results show that the performance of a SVM multimodal verification system can be improved by using histogram equalization, especially when the equalization is applied to those scores giving the highest EER values. Keywords: speaker recognition, multimodality, fusion, support vector machines, histogram equalization, prosody, voice spectrum, face.
1 Introduction Multimodal biometric systems, which involve the combination of two or more human traits, are used to achieve better results than the ones obtained in a monomodal recognition system [1]. In a multimodal recognition system, fusion is possible at three different levels: feature extraction level, matching score level and decision level. Fusion at the score level matches the monomodal scores of different recognition systems in order to obtain a single multimodal score, and it is the preferred method by most of the systems. Matching score level fusion is a two-step process which consists of a previous score normalization and the fusion itself [2-5]. The normalization process transforms the non homogeneous monomodal scores into a comparable range of values. Z-score is a conventional affine normalization technique which transforms the scores into a distribution with zero mean and unitary variance [3, 5]. Histogram equalization (HE) is used as a non linear normalization technique which makes equal the statistics of the monomodal scores. HE can be seen as an extension to the whole statistics of the mean and variance equalization performed by the z-score normalization. The fusion process is a combination of the previously normalized scores. In this paper, two fusion methods are used and compared: matcher weighting and support vector machines. In matcher weighting method, each monomodal score is weighted by a factor proportional to its recognition result. A support vector machine is a binary S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 819–827, 2007. © Springer-Verlag Berlin Heidelberg 2007
820
M. Farrús et al.
classifier based on a learning fusion technique, where scores are seen as input patterns to be labeled as accepted or rejected. The aim of this work is to improve the results obtained in our recent work based on the fusion of prosody, voice spectrum and face features where different step strategies were applied [6]. The improvement is achieved with previous histogram equalization as a normalization of the scores in a SVM based fusion. In the next section, the monomodal information sources used in this work are described. Z-score and histogram equalization are presented in section 3. Matcher weighting fusion technique and support vector machines are reviewed in section 4 and, finally, experimental results are shown in section 5.
2 Monomodal Sources 2.1 Voice Information In multimodal person recognition only short-term spectral features are normally used as voice information. However, it has been demonstrated that voice spectrum based systems can be improved by adding prosodic information [7]. Spectral parameters are those which only take into account the acoustical level of the signal, like spectral magnitudes, formant frequencies, etc., and they are more related to the physical traits of the speaker. Cepstral coefficients are the usual way to represent the short-time spectral envelope of a speech frame in current speaker recognition systems. However, Frequency Filtering (FF) parameters, presented in [8] and used in this work, become an alternative to the use of cepstrum in order to overcome some of its disadvantages. Several linguistic levels like lexicon, prosody or phonetics are used by humans to recognize others with voice. These levels of information are more related to learned habits and style, and they are mainly manifested in the dialect, sociolect or idiolect of the speaker. Prosodic parameters, in particular, are manifested as sound duration, tone and intensity variation. Although these features don’t provide very good results when they are used alone, they give complementary information and improve the results when they are fused with vocal tract spectrum based systems. The prosodic recognition system used in this task consists of a total of 9 prosodic scores already used in [9]: • • • • • • • • •
number of frames per word averaged over all words average length of word-internal voiced segments average length of word-internal unvoiced segments mean F0 logarithm maximum F0 logarithm minimum F0 logarithm F0 range (maximum F0 – minimum F0) logarithm F0 “pseudo slope”: (last F0 – first F0) / (number of frames in word) average slope over all segments of a piecewise linear stylization of F0
Histogram Equalization in SVM Multimodal Person Verification
821
2.2 Face Information Facial recognition systems are based on the conceptualization that a face can be represented as a collection of sparsely distributed parts: eyes, nose, cheeks, mouth, etc. Non negative matrix factorization (NMF), introduced in [10], is an appearancebased face recognition technique based on the conventional component analysis techniques which does not use the information about how the various facial images are separated into different facial classes. The most straightforward way in order to exploit discriminant information in NMF is to try to discover discriminant projections for the facial image vectors after the projection. The face recognition scores used in this work have been calculated in this way with the NMF-faces method [11], in which the final basis images are closer to facial parts.
3 Histogram Equalization Z-score (ZS) is one of the most conventional normalization methods, which transforms the scores into a distribution with zero mean and unitary variance Denoting as a the raw matching from the set A of all the original monomodal biometric scores, the z-score normalized biometric is computed as:
xZS =
a − mean( A) std ( A)
(1)
where mean(A) is the statistical mean of A and std(A) the standard deviation. Histogram equalization (HE) is a general non parametric method to match the cumulative distribution function (CDF) of some given data to a reference distribution. This technique can be seen as an extension of the statistical normalization made by the z-score to whole biometric statistics. Histogram equalization is a widely used non linear method designed for the enhancement of images. HE employs a monotonic, non linear mapping which reassigns the intensity values of pixels in the input image in order to control the shape of the output image intensity histogram to achieve a uniform distribution of intensities or to highlight certain intensity levels. This method has been also developed for the speech recognition adaptation approaches and the correction of non linear effects typically introduced by speech systems such as microphones, amplifiers, clipping and boosting circuits and automatic gain control circuits [12, 13]. The objective of HE is to find a non linear transformation to reduce the mismatch of the statistics of two signals. In [14, 15] this concept was applied to the acoustic features to improve the robustness of a speaker verification system. On the other hand, in this paper HE is applied to the scores. N intervals with the same probability are assigned in the distributions of both signals. Each interval in the reference distribution, x ∈ [ qi , qi +1 [ , is represented by (xi, F(xi)). xi is the average of the scores B
B
B
B
and F(xi) is the maximum cumulative distribution value: B
B
B
B
822
M. Farrús et al. ki
xi =
∑x j =1
ij
ki
F ( xi ) =
,
Ki M
(2)
where xij are the scores in the interval, ki is the number of scores in the interval, Ki is the number of data in the interval [q 0 , q i +1 [ , and M is the total amount of data. All the scores in each interval of the source distributions are assigned to the corresponding interval in the reference distribution. F(xi) sets the boundaries [ q 'i , q 'i +1 [ of the intervals in the distribution to be equalized. These boundaries limit B
B
B
B
B
B
B
B
the interval of values that fulfils the following condition: F (qi ) ≤ F ( y ) < F (qi +1 ) , and all the values of the source signal lying in the interval [ q 'i , q 'i +1 [ will be transformed to their corresponding xi value. B
B
4 Fusion Techniques and Support Vector Machines One of the most conventional fusion techniques is the matcher weighting (MW) method, where each monomodal score is weighted by a factor proportional to each biometric recognition rate, so that the weights for more accurate matchers are higher than those of less accurate matchers. When using the Equal Error Rates (EER) the weighting factor for every biometric is proportional to the inverse of its EER. Denoting wm and em the weighting factor and the EER for the m-th biometric xm and M the number of biometrics, the final fused score u is expressed as [1, 3]: B
B
B
B
B
M
u = ∑ wm xm m =1
where
1 em wm = M . 1 ∑ m =1 em
B
(3)
In contrast to the MW that is a linear and a data-driven fusion method, non linear and machine learning based methods may lead to a higher performance. Learning based fusion can be treated as a pattern classification problem in which the scores obtained with individual classifiers are seen as input patterns to be labeled as ‘accepted’ or ‘rejected’. Recent works on statistical machine learning have shown the advantages of discriminative classifiers like SVM [16] in a range of applications. Support vector machine (SVM) is a state-of-the-art binary classifier. Given a linearly separable twoclass training data, SVM finds an optimal hyperplane that splits input data in two classes, maximizing the distance of the hyperplane to the nearest data points of each class. However, data are normally not linearly separable. In this case, non linear decision functions are needed, and an extension to non linear boundaries is achieved by using specific functions called kernel functions [17]. Kernel functions map the data of the input space to a higher dimensional space (feature space) by a non linear transformation. The optimal hyperplane for a non linearly separable data is defined by:
Histogram Equalization in SVM Multimodal Person Verification
823
N
f ( x) = ∑ α i ti Κ ( x, xi ) + b
(4)
i =1
where ti are labels, K is a chosen kernel function and ∑ i =1α i ti = 0 . The vectors xi are N
B
B
B
B
the support vectors, which determine the optimal separating hyperplane and correspond to the points of each class that are the closest to the separating hyperplane.
5 Recognition Experiments In the next section, the monomodal recognition systems used in the fusion experiments are described. Experimental results by using different normalization and fusion techniques are shown in section 5.2. 5.1 Experimental Setup
Recognition experiments have been performed with the Switchboard-I speech database [18] and the video and speech XM2VTS database of the University of Surrey [19]. Switchboard-I database, which is a collection of 2430 two-sided telephone conversations among 543 speakers from all areas of the United States, has been used for the speaker recognition experiments. Speaker scores have been obtained by using two different systems: a voice spectrum based recognition system and a prosody based recognition system. The spectrum based speaker recognition system used is a 32-component GMM system with diagonal covariance matrices; 20 Frequency Filtering parameters were generated with a frame size of 30 ms and a shift of 10 ms, and 20 corresponding delta and acceleration coefficients were included. In the prosody based recognition system a 9-feature vector was extracted for each conversation side. The mean and standard deviation over all words were computed for each individual feature. The system was tested using the k-Nearest Neighbor classifier (with k=3), comparing the distance of the test feature vector to the k closest vectors of the claimed speakers and the distance of the test vector to the k closest vectors of the cohort speakers, and using the symmetrized Kullback-Leibler divergence as a distance measure. In both spectral and prosodic systems, each speaker model was trained with 8 conversation sides. Training was performed using splits 1-3 of the Switchboard-I database, and splits 4-6 were provided the cohort speakers and the UBM. Both systems were tested with one conversation-side according to NIST’s 2001 Extended Data task. Face recognition experiments were performed with the XM2VTS database, which is a multimodal database consisting of face images, video sequences and speech recordings of 295 subjects. Only the face images have been used in our experiments. In order to evaluate the verification algorithms on the database, the evaluation protocol described in [19] was followed. The well-known Fisher discriminant criterion was constructed as [20] in order to discover discriminant linear projections and to obtain the facial scores. Fusion experiments have been done at the matching score level. Since both databases contain biometric characteristics belonging to different users, a chimerical
824
M. Farrús et al.
database has been created to perform the experiments. A chimerical database is an artificial database created using two or more monomodal biometric characteristics from different individuals to form artificial (or chimerical) users. In this paper, the chimerical database consists of 30661 users created by combining 179 different voices of the Switchboard-I database with 270 different faces of the XM2VTS database. The scores were then split in two equal sets (development and test) for each recognition system, obtaining a total amount of 46500 scores for each set (16800 clients and 29700 impostors). The kernel function used in the SVM was a Gaussian radial basis function. Scores are always equalized to the histogram corresponding to the best scores involved in the fusion; i.e. those scores that provided the lowest EER. A 1000-interval histogram was applied before each SVM fusion, and both SVM and HE-SVM techniques are compared to a baseline system which uses MW fusion and z-score normalization. 5.2 Verification Results Monomodal systems. Table 1 shows the EER obtained for each prosodic feature in the prosody based recognition system. As it can be seen in the table, features based on fundamental frequency measurements achieve the lowest EER. T
T
Table 1. EER for each prosodic feature Features Log (#frames/word) Average length of word-internal voiced segments Average length of word-internal unvoiced segments Log(mean_F0) Log(max_F0) Log(min_F0) Log(range_F0) F0 ‘pseudo-slope’ Average slope over all segments of PWL stylization of F0
EER (%) 30.3 31.5 31.5 19.2 21.3 21.5 26.6 38.3 28.7
The EER obtained in each monomodal recognition system are shown in Table 2. Note that fusion is only used in the prosodic system, where there are 9 prosodic scores to be combined. In this case, fusion is carried out in one single step, and the results of the three types of fusion mentioned above are presented: (1) matching score fusion with z-score normalization, (2) support vector machines and (3) support vector machines with a previous histogram equalization. Table 2. EER (%) for each monomodal recognition system Source Prosody
ZS-MW SVM HE-SVM Voice spectrum Face
EER (%) 15.66 14.65 13.39 10.10 2.06
Histogram Equalization in SVM Multimodal Person Verification
825
Bimodal systems. Table 3 shows the fusion results for two bimodal systems: a prosody based system fused with the voice spectral recognition system, and a voice spectrum based system fused with the face recognition system, using the same fusion methods as above. As in the monomodal systems (Table 2), matcher weighting fusion is slightly worse than the support vector machines. T
Table 3. EER (%) for each bimodal recognition system Source Prosody + voice spectrum Voice spectrum + face
ZS-MW 7.44 1.83
SVM 6.84 0.99
HE-SVM 6.25 1.02
Trimodal system. In [6] several strategies were proposed by fusing the monomodal scores in one, two and three steps. In those experiments the best results were achieved in a two-step configuration, where the 9 prosodic scores were fused in the first step and the obtained scores were then fused in the second step with voice spectral and facial scores (Fig. 1). PS SS
F1 F2
FS
Fig. 1. Two-step fusion
The EER for the selected two-step fusion are presented in Table 4. Once again, matcher weighting fusion method is clearly outperformed by support vector machines. Table 4. EER (%) for the trimodal system Fusion technique ZS-MW SVM
EER (%) 1.493 0.647
In our trimodal system, equalization has been applied in all the possible combinations: (1) HE before the first fusion (equalization of the prosodic scores) (2) HE before the second fusion (equalization of the three modalities) (3) HE before both fusions F1and F2 The results are shown in Table 5 and they are compared to the non equalized SVM fusion.
826
M. Farrús et al. Table 5. EER (%) applying HE before SVM fusion Fusion technique SVM
F1 HE HE
F2 HE HE
EER (%) 0.647 0.630 0.649 0.613
As it can be seen in the table, the best result is achieved when histogram equalization is used before F1 and F2. By equalizing only the prosodic scores, the performance of the system is also improved. On the other hand, equalization before the second fusion does not improve the performance of the system.
6 Conclusions In this work, the use of prosody improves the performance of a bimodal system based on vocal tract spectrum and face features. The experiments show that support vector machines based fusion is clearly better than matcher weighting fusion method. In addition, results are improved by applying histogram equalization as a normalization technique before SVM fusion. The best verification results are achieved when the histogram of the scores with the highest values of EER (the prosodic scores in our experiments) are equalized to the distribution of the scores that provide the lowest EER.
References 1. Bolle, R.M., et al.: Guide to Biometrics, p. 364. Springer, New York (2004) 2. Fox, N.A., et al.: Person identification using automatic integration of speech, lip and face experts. In: ACM SIGMM 2003 Multimedia Biometrics Methods and Applications Workshop, Berkeley, CA. ACM, New York (2003) 3. Indovina, M., et al.: Multimodal Biometric Authentication Methods: A COTS Approach. In: MMUA. Workshop on Multimodal User Authentication, Santa Barbara, CA (2003) 4. Lucey, S., Chen, T.: Improved audio-visual speaker recognition via the use of a hybrid combination strategy. In: The 4th International Conference on Audio- and Video- Based Biometric Person Authentication, Guildford, UK (2003) 5. Wang, Y., Tan, T.: Combining fingerprint and voiceprint biometrics for identity verification: and experimental comparison. In: Zhang, D., Jain, A.K. (eds.) ICBA 2004. LNCS, vol. 3072. Springer, Heidelberg (2004) 6. Farrús, M., et al.: On the Fusion of Prosody, Voice Spectrum and Face Features for Multimodal Person Verification. In: ICSLP, Pittsburgh (2006) 7. Campbell, J.P., Reynolds, D.A., Dunn, R.B.: Fusing high- and low-level features for speaker recognition. In: Eurospeech (2003) 8. Nadeu, C., Hernando, J., Gorricho, M.: On the decorrelation of filter bank energies in speech recognition. In: Eurospeech (1995) 9. Peskin, B., et al.: Using prosodic and conversational features for high-performance speaker recognition: Report from JHU WS’02. In: ICASSP (2003)
Histogram Equalization in SVM Multimodal Person Verification
827
10. Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Advances in Neural Information Processing Systems: Proceedings of the 2000 Conference. MIT Press, Cambridge (2001) 11. Zafeiriou, S., Tefas, A., Pitas, I.: Discriminant NMF-faces for frontal face verification. In: IEEE International Workshop on Machine Learning for Signal Processing, Mystic, Connecticut, IEEE, Los Alamitos (2005) 12. Hilger, F., Ney, H.: Quantile based histogram equalization for noise robust speech recognition. In: Eurospeech, Aalborg, Denmark (2001) 13. Balchandran, R., Mammone, R.: Non parametric estimation and correction of non linear distortion in speech systems. In: ICASSP (1998) 14. Pelecanos, J., Sridharan, S.: Feature warping for robust speaker verification. In: ODYSSEY-2001 (2001) 15. Skosan, M., Mashao, D.: Modified Segmental Histogram Equalization for robust speaker verification. Pattern Recognition Letters 27(5), 479–486 (2006) 16. Cristianini, N., Shawe-Taylor, J.: An introduction to support vector machines (and other kernel-based learning methods). Cambridge University Press, Cambridge (2000) 17. Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge discovery 2, 121–167 (1998) 18. Godfrey, J.J., Holliman, E.C., McDaniel, J.: Switchboard: Telephone speech corpus for research and development. In: ICASSP (1990) 19. Lüttin, J., Maître, G.: Evaluation Protocol for the Extended M2VTS Database (XM2VTSDB). In: IDIAP, Martigny, Switzerland (1998) 20. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(7), 711–720 (1997)
Learning Multi-scale Block Local Binary Patterns for Face Recognition Shengcai Liao, Xiangxin Zhu, Zhen Lei, Lun Zhang, and Stan Z. Li Center for Biometrics and Security Research & National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, 95 Zhongguancun Donglu, Beijing 100080, China {scliao,xxzhu,zlei,lzhang,szli}@nlpr.ia.ac.cn http://www.cbsr.ia.ac.cn
Abstract. In this paper, we propose a novel representation, called Multiscale Block Local Binary Pattern (MB-LBP), and apply it to face recognition. The Local Binary Pattern (LBP) has been proved to be effective for image representation, but it is too local to be robust. In MB-LBP, the computation is done based on average values of block subregions, instead of individual pixels. In this way, MB-LBP code presents several advantages: (1) It is more robust than LBP; (2) it encodes not only microstructures but also macrostructures of image patterns, and hence provides a more complete image representation than the basic LBP operator; and (3) MB-LBP can be computed very efficiently using integral images. Furthermore, in order to reflect the uniform appearance of MB-LBP, we redefine the uniform patterns via statistical analysis. Finally, AdaBoost learning is applied to select most effective uniform MB-LBP features and construct face classifiers. Experiments on Face Recognition Grand Challenge (FRGC) ver2.0 database show that the proposed MB-LBP method significantly outperforms other LBP based face recognition algorithms. Keywords: LBP, MB-LBP, Face Recognition, AdaBoost.
1
Introduction
Face recognition from images has been a hot research topic in computer vision for recent two decades. This is because face recognition has potential application values as well as theoretical challenges. Many appearance-based approaches have been proposed to deal with face recognition problems. Holistic subspace approach, such as PCA [14] and LDA [3] based methods, has significantly advanced face recognition techniques. Using PCA, a face subspace is constructed to represent “optimally” only the face; using LDA, a discriminant subspace is constructed to distinguish “optimally” faces of different persons. Another approach is to construct a local appearance-based feature space, using appropriate image filters, so the distributions of faces are less affected by various changes. Local features analysis (LFA) [10], Gabor wavelet-based features [16,6] are among these. S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 828–837, 2007. c Springer-Verlag Berlin Heidelberg 2007
Learning MB-LBP for Face Recognition
829
Recently, Local Binary Patterns (LBP) is introduced as a powerful local descriptor for microstructures of images [8]. The LBP operator labels the pixels of an image by thresholding the 3 × 3-neighborhood of each pixel with the center value and considering the result as a binary string or a decimal number. Recently, Ahonen et al proposed a novel approach for face recognition, which takes advantage of the Local Binary Pattern (LBP) histogram [1]. In their method, the face image is equally divided into small sub-windows from which the LBP features are extracted and concatenated to represent the local texture and global shape of face images. Weighted Chi square distance of these LBP histograms is used as a dissimilarity measure of different face images. Experimental results showed that their method outperformed other well-known approaches such as PCA, EBGM and BIC on FERET database. Zhang et al [17] propose to use AdaBoost learning to select best LBP sub-window histograms features and construct face classifiers. However, the original LBP operator has the following drawback in its application to face recognition. It has its small spatial support area, hence the bit-wise comparison therein made between two single pixel values is much affected by noise. Moreover, features calculated in the local 3 × 3 neighborhood cannot capture larger scale structure (macrostructure) that may be dominant features of faces. In this work, we propose a novel representation, called Multi-scale Block LBP (MB-LBP), to overcome the limitations of LBP, and apply it to face recognition. In MB-LBP, the computation is done based on average values of block subregions, instead of individual pixels. This way, MB-LBP code presents several advantages: (1) It is more robust than LBP; (2) it encodes not only microstructures but also macrostructures of image patterns, and hence provides a more complete image representation than the basic LBP operator; and (3) MB-LBP can be computed very efficiently using integral images. Considering this extension, we find that the property of the original uniform LBP patterns introduced by Ojala et al [9] can not hold to be true, so we provide a definition of statistically effective LBP code via statistical analysis. While a large number of MB-LBP features result at multiple scales and multiple locations, we apply AdaBoost learning to select most effective uniform MB-LBP features and thereby construct the final face classifier. The rest of this paper is organized as follows: In Section 2, we introduce the MB-LBP representation. In Section 3, the new concept of uniform patterns are provided via statistical analysis. In Section 4, A dissimilarity measure is defined to discriminate intra/extrapersonal face images, and then we apply the AdaBoost learning for MB-LBP feature selection and classifier construction. The experiment results are given in Section 5 with the FRGC ver2.0 data sets [11]. Finally, we summarize this paper in Section 6.
2
Multi-scale Block Local Binary Patterns
The original LBP operator labels the pixels of an image by thresholding the 3×3neighborhood of each pixel with the center value and considering the result as
830
S. Liao et al.
(a)
(b)
Fig. 1. (a) The basic LBP operator. (b) The 9×9 MB-LBP operator. In each sub-region, average sum of image intensity is computed. These average sums are then thresholded by that of the center block. MB-LBP is then obtained.
a binary string or a decimal number. Then the histogram of the labels can be used as a texture descriptor. An illustration of the basic LBP operator is shown in Fig. 1(a). Multi-scale LBP [9] is an extension to the basic LBP, with respect to neighborhoods of different sizes. In MB-LBP, the comparison operator between single pixels in LBP is simply replaced with comparison between average gray-values of sub-regions (cf. Fig. 1(b)). Each sub-region is a square block containing neighboring pixels (or just one pixel particularly). The whole filter is composed of 9 blocks. We take the size s of the filter as a parameter, and s × s denoting the scale of the MB-LBP operator (particularly, 3 × 3 MB-LBP is in fact the original LBP). Note that the scalar values of averages over blocks can be computed very efficiently [13] from the summed-area table [4] or integral image [15]. For this reason, MB-LBP feature extraction can also be very fast: it only incurs a little more cost than the original 3 × 3 LBP operator. Fig. 2 gives examples of MB-LBP filtered face images by 3 × 3, 9 × 9 and 15 × 15 blocks. From this example we can see what influence parameter s would make. For a small scale, local, micro patterns of a face structure is well represented, which may beneficial for discriminating local details of faces. On the
(a1)
(b1)
(c1)
(d1)
(a2)
(b2)
(c2)
(d2)
Fig. 2. MB-LBP filtered images of two different faces. (a) original images; (b) filtered by 3 × 3 MB-LBP (c) filtered by 9 × 9 MB-LBP; (d) filtered by 15 × 15 MB-LBP.
Learning MB-LBP for Face Recognition
831
other hand, using average values over the regions, the large scale filters reduce noise, and makes the representation more robust; and large scale information provides complementary information to small scale details. But much discriminative information is also dropped. Normally, filters of various scales should be carefully selected and then fused to achieve better performance.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
Fig. 3. Differential images. (a)(b) are the original intra-personal images, (c)(d)(e) are the corresponding differential images generated by 3 × 3, 9 × 9 and 15 × 15 MBLBP operators. And (f)(g) are the original extra-personal images, (h)(i)(j) are the corresponding differential images generated by 3 × 3, 9 × 9 and 15 × 15 MB-LBP operators.
Fig. 3 demonstrates MB-LBP features for intra-personal and extra-personal difference images (brighter pixels indicate greater difference). Using differential images, We also have another way to demonstrate the discriminative power of MB-LBP. We see that when scale of the MB-LBP filter become larger, thought both intra-personal variance and extra-personal variance decrease, it is more clear that with larger scale MB-LBP, intra-personal differences are smaller than that of the extra-personal.
3
Statistically Effective MB-LBP (SEMB-LBP)
Using the “uniform” subset of LBP code improves the performance of LBP based methods. A Local Binary Pattern is called uniform if it contains at most two bitwise transitions from 0 to 1 or vice versa when the binary string is considered circular [9]. It is observed that there are a limited number of transitions or discontinuities in the circular presentation of the 3×3 texture patterns; according to Ojala, this uniform patterns are fundamental properties of local image texture, they provide a vast majority amount all patterns: 90% for (8,2) type LBP, 70% for (16,2) type [9]. However, the original definition of uniform LBP patterns based on the transition of pixel values can not be used for MB-LBP with blocks containing more
832
S. Liao et al.
than a single pixel. The reason is obvious: the same properties of circular continuities in 3×3 patterns can not hold to be true when parameter s becomes larger. Using average gray-values instead of single pixels, MB-LBP reflects the statistical properties of local sub-region relationship. The lager parameter s becomes, the harder for the circular presentation to be continuous. So we need to redefine uniform patterns. Since the term uniform refers to the uniform appearance of the local binary pattern, we can define our uniform MB-LBP via statistical analysis. Here, we present the concept of Statistically Effective MB-LBP (SEMB-LBP), based on the percentage in distributions, instead of the number of 0-1 and 1-0 transitions as in the uniform LBP. Denote fs (x, y) as an MB-LBP feature of scale s at location (x, y) computed from original images. Then a histogram of the MB-LBP feature fs (·, ·) over a certain image I(x, y) can be defined as: = 0, . . . , L − 1
Hs (l) = 1[fs (x,y)=l] ,
(1)
where 1(S) is the indicator of the set S, and is the label of the MB-LBP code. Because all MB-LBP code are 8-bit binary string, so there are total L = 28 = 256 labels. Thus the histogram has 256 bins. This histogram contains information about the distribution of the MB-LBP features over the whole image. The histograms, with the large still training set of the FRGC ver2.0, for the scales 3 × 3, 9 × 9, 15 × 15, and 21 × 21 are shown in Fig. 4. We sort the bins of a histogram according to its percentage in the histogram. A statistical analysis shows the following: • For the 3 × 3 LBP operator, the top 58 bins correspond to the so-called uniform patterns; • However, for MB-LBP filters with block containing more than one pixel, the top 58 bins are not the same as those for the 3 × 3 LBP filters.
0.06
0.06
0.06
0.06
0.04
0.04
0.04
0.04
0.02
0.02
0.02
0.02
0
0
0
0
100
(a)
200
0
100
(b)
200
0
100
200
(c)
0
0
100
200
(d)
Fig. 4. Histograms of MB-LBP labels on the large still training set of the FRGC ver2.0. (a)3 × 3, (b)9 × 9, (c)15 × 15, and (d)21 × 21.
To reflect the uniform appearance of MB-LBP, we define the statistically effective MB-LBP (SEMB-LBP) set of scale s as follows: SEM B-LBPs = {|Rank[Hs (l)] < N }
(2)
where Rank[Hs (l)] is the index of Hs (l) after descending sorting, and N is the number of uniform patterns. For the original uniform local binary patterns, N
Learning MB-LBP for Face Recognition
833
= 58. However, in our definition, N can be assigned arbitrarily from 1 to 256. Large value of N will cause the feature dimensions very huge, while small one loses feature variety. Consequently, we adopt N = 63 for a trade-off. Labeling all remaining patterns with a single label, we use the following notation to represent SEMB-LBP: us (x, y) = {
Indexs [fs (x, y)], N,
if fs (x, y) ∈ SEM B-LBPs otherwise
(3)
where Indexs [fs (x, y)] is the index of fs (x, y) in the set SEM B-LBPs (started from 0).
4
AdaBoost Learning
The above SEMB-LBP provides an over-complete representation. The only question remained is how to use them to construct a powerful classifier. Because those excessive measures contain much redundant information, a further processing is needed to remove the redundancy and build effective classifiers. In this paper we use Gentle AdaBoost algorithm [5] to select the most effective SEMB-LBP feature. Boosting can be viewed as a stage-wise approximation to an additive logistic regression model using Bernoulli log-likelihood as a criterion [5]. Developed by Friedman et al, Gentle AdaBoost modifies the population version of the Real AdaBoost procedure [12], using Newton stepping rather than exact optimization at each step. Empirical evidence suggests that Gentle AdaBoost is a more conservative algorithm that has similar performance to both the Real AdaBoost and LogitBoost algorithms, and often outperforms them both, especially when stability is an issue. Face recognition is a multi-class problem whereas the above AdaBoost learning is for two classes. To dispense the need for a training process for faces of a newly added person, we use a large training set describing intra-personal or extra-personal variations [7], and train a “universal” two-class classifier. An ideal intra-personal difference should be an image with all pixel values being zero, whereas an extra-personal difference image should generally have much larger pixel values. However, instead of deriving the intra-personal or extra-personal variations using difference images as in [7], the training examples to our learning algorithm is the set of differences between each pair of local histograms at the corresponding locations. The positive examples are derived from pairs of intrapersonal differences and the negative from pairs of extra-personal differences. In this work, a weak classifier is learned based on a dissimilarity between two corresponding histogram bins. Once the SEMB-LBP set are defined for each scale, histogram of these patterns are computed for calculating the dissimilarity: First, a sequence of m subwindows R0 , R1 , . . . , Rm−1 of varying sizes and locations are obtained from an image; Second, a histogram is computed for each SEMB-LBP code i over a subwindow Rj as Hs,j (i) = 1[us (x,y)=i] · 1[(x,y)∈Rj ] ,
i = 0, . . . , N, j = 0, . . . , m − 1
(4)
834
S. Liao et al.
The corresponding bin difference is defined as 1 2 1 2 D(Hs,j (i), Hs,j (i)) = |Hs,j (i) − Hs,j (i)| i = 0, . . . , N
(5)
The best current weak classifier is the one for which the weighted intrapersonal bin differences (over the training set) are minimized while that of the extra-personal are maximized. With the two-class scheme, the face matching procedure will work in the following way: It takes a probe face image and a gallery face image as the input, and computes a difference-based feature vector from the two images, and then it calculates a similarity score for the feature vector using the learned AdaBoost classifier. Finally, A decision is made based on the score, to classify the feature vector into the positive class (coming from the same person) or the negative class (different persons).
5
Experiments
The proposed method is tested on the Face Recognition Grand Challenge (FRGC) ver2.0 database [11]. The Face Recognition Grand Challenge is a large face recognition evaluation data set, which contains over 50,000 images of 3D scans and high resolution still images. Fig. 5 shows a set of images in FRGC for one subject section. The FRGC ver2.0 contains a large still training set for training still face recognition algorithms. It consists of 12,776 images from 222 subjects. A large validation set is also provided for FRGC experiments, which contains 466 subjects from 4,007 subject sessions.
Fig. 5. FRGC images from one subject session. (a) Four controlled still images, (b) two uncontrolled stills, and (c) 3D shape channel and texture channel pasted on the corresponding shape channel.
The proposed method is evaluated on two 2D experiments of FRGC: Experiment 1 and Experiment 2. Experiment 1 is designed to measure face recognition performance on frontal facial images taken under controlled illumination. In this experiment, only one single controlled still image is contained in one biometric sample of the target or query sets. Experiment 2 measures the effect of multiple
Learning MB-LBP for Face Recognition
835
still images on performance. In Experiment 2, each biometric sample contains four controlled images of a person taken in a subject session. There are 16 matching scores between one target and one query sample. These scores are averaged to give the final result. There are total 16028 still images both in target and query sets of experiment 1 and 2. In the testing phase, it will generate a large similarity matrix of 16028 by 16028. Furthermore, three masks are defined over this similarity matrix, thus three performance results will obtained, corresponding to Mask I, II, and III. In mask I all samples are within semesters, in mask II they are within a year, while in mask III the samples are between semesters, i.e., they are of increasing difficulty. In our experiments, All images are cropped to 144 pixels high by 112 pixels wide, according to the provided eyes positions. A boosting classifier is trained on the large still training set. The final strong classifier contains 2346 weak classifiers, e.g. 2346 bins of various sub-region SEMB-LBP histograms, while achieving zero error rate on the training set. The corresponding MB-LBP filter size of the first 5 learned weak classifiers are s = 21, 33, 27, 3, 9. We find that a middle level of scale s has a better discriminative power. To compare the performances of LBP based methods, we also evaluate Ahonen et al’s LBP method [1] and Zhang et al’s Boosting LBP algorithm [17]. We u2 use LBP8,2 operator as Ahonen’s, and test it directly on FRGC experiment 1 and 2 since their method needs no training. For Zhang’s approach, we train an AdaBoost classifier on the large still training set in the same way of our MBLBP. The final strong classifier of Zhang’s contains 3072 weak classifiers, yet it can not achieve zero error rate on the training set. The comparison results are given in Fig. 6, and Fig. 7 describes the verification performances on a receiver operator characteristic(ROC) curve (mask III). From the results we can see that MB-LBP outperforms the other two algorithms on all experiments. The comparison proves that MB-LBP is a more robust representation than the basic LBP. Meanwhile, because MB-LBP provides a more complete image representation that encodes both microstructure and macrostructure, it can achieve zero error rate on the large still training set, while generate well on the validation face images. Furthermore, we can also find that MB-LBP performs well on Mask III, which means that it is robust with time elapse. Experiment 1 Mask I Mask II Mask III MB-LBP 98.07 97.04 96.05 Zhang’s LBP 84.17 80.35 76.67 Ahonen’s LBP 82.72 78.65 74.78
Experiment 2 Mask I Mask II Mask III 99.78 99.58 99.45 97.67 96.73 95.84 93.62 90.54 87.46
Fig. 6. Verification performance (FAR=0.1%) on FRGC Experiment 1 and 2
836
S. Liao et al.
Fig. 7. ROC performance on FRGC Experiment 1 and 2
6
Summary and Conclusions
In this paper, we present a Multi-scale Block Local Binary Pattern (MB-LBP) based operator for robust image representation. The Local Binary Pattern (LBP) has been proved to be effective for image representation, but it is too local to be robust. The Multi-scale Block Local Binary Patterns (MB-LBP) use sub-region average gray-values for comparison instead of single pixels. Feature extraction for MB-LBP is very fast using integral images. Considering this extension, uniform patterns may not remain the same as those defined by Ojala et al [9]. To reflect the uniform appearance of the MB-LBP, we define our Statistically Effective MB-LBP (SEMB-LBP) via statistical analysis. Simply using absolute difference between the same bin of two histograms as dissimilarity measure, we finally apply AdaBoost learning to select the most effective weak classifiers and construct a powerful classifier. Experiments on FRGC ver2.0 show that MB-LBP significantly outperforms other LBP based method. Since MB-LBP can be viewed as a certain way of combination using 8 ordinal rectangle features, our future work may be combining eight or more ordinal features [2] in a circular instead of rectangle features. While ordinal filters are sufficient and robust for comparison of image regions, we expect that this kind of combination will be more powerful for image representation. Acknowledgements. This work was supported by the following funding resources: National Natural Science Foundation Project #60518002, National Science and Technology Supporting Platform Project #2006BAK08B06, National 863 Program Projects #2006AA01Z192 and #2006AA01Z193, Chinese Academy of Sciences 100-people project, and the AuthenMetric Collaboration Foundation.
References 1. Ahonen, T., Hadid, A., Pietikainen, M.: Face recognition with local binary patterns. In: Proceedings of the European Conference on Computer Vision, Prague, Czech, pp. 469–481 (2004)
Learning MB-LBP for Face Recognition
837
2. Balas, B., Sinha, P.: Toward dissociated dipoles: Image representation via non-local comparisons. In: CBCL Paper #229/AI Memo #2003-018, MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA (August 2003) 3. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(7), 711–720 (1997) 4. Crow, F.: Summed-area tables for texture mapping. In: SIGGRAPH, vol. 18(3), pp. 207–212 (1984) 5. Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of boosting. Technical report, Department of Statistics, Sequoia Hall, Stanford Univerity (July 1998) 6. Liu, C., Wechsler, H.: Gabor feature based classification using the enhanced fisher linear discriminant model for face recognition. IEEE Transactions on Image Processing 11(4), 467–476 (2002) 7. Moghaddam, B., Nastar, C., Pentland, A.: A Bayesain similarity measure for direct image matching. Media Lab Tech Report No.393, MIT (August 1996) 8. Ojala, T., Pietikainen, M., Harwood, D.: A comparative study of texture measures with classification based on feature distributions. Pattern Recognition 29(1), 51–59 (1996) 9. Ojala, T., Pietikainen, M., Maenpaa, M.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(7), 971–987 (2002) 10. Penev, P., Atick, J.: Local feature analysis: A general statistical theory for object representation. Neural Systems 7(3), 477–500 (1996) 11. Phillips, P.J., Flynn, P.J., Scruggs, T., Bowyer, K.W., Chang, J., Hoffman, K., Marques, J., Min, J., Worek, W.: Overview of the face recognition grand challenge. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society Press, Los Alamitos (2005) 12. Schapire, R.E., Singer, Y.: Improved boosting algorithms using confidence-rated predictions. In: Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pp. 80–91 (1998) 13. Simard, P.Y., Bottou, L., Haffner, P., Cun, Y.L.: Boxlets: a fast convolution algorithm for signal processing and neural networks. In: Kearns, M., Solla, S., Cohn, D. (eds.) Advances in Neural Information Processing Systems, vol. 11, pp. 571–577. MIT Press, Cambridge (1998) 14. Turk, M.A., Pentland, A.P.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3(1), 71–86 (1991) 15. Viola, P., Jones, M.: Robust real time object detection. In: IEEE ICCV Workshop on Statistical and Computational Theories of Vision, Vancouver, Canada, July 13, 2001 (2001) 16. Wiskott, L., Fellous, J., Kruger, N., Malsburg, C.v.d.: Face recognition by elastic bunch graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(7), 775–779 (1997) 17. Zhang, G., Huang, X., Li, S.Z., Wang, Y., Wu, X.: Boosting local binary pattern (LBP)-based face recognition. In: Li, S.Z., Lai, J.-H., Tan, T., Feng, G.-C., Wang, Y. (eds.) SINOBIOMETRICS 2004. LNCS, vol. 3338, pp. 180–187. Springer, Heidelberg (2004)
Horizontal and Vertical 2DPCA Based Discriminant Analysis for Face Verification Using the FRGC Version 2 Database Jian Yang1,2 and Chengjun Liu1 1
Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102 2 Department of Computer Science, Nanjing University of Science and Technology, Nanjing 210094, P.R. China {csjyang,chengjun.liu}@njit.edu
Abstract. This paper presents a horizontal and vertical 2D principal component analysis (2DPCA) based discriminant analysis (HVDA) method for face verification. The HVDA method, which derives features by applying 2DPCA horizontally and vertically on the image matrices (2D arrays), achieves high computational efficiency compared with the traditional PCA and/or LDA based methods that operate on high dimensional image vectors (1D arrays). The HVDA method further performs discriminant analysis to enhance the discriminating power of the horizontal and vertical 2DPCA features. Finally, the HVDA method takes advantage of the color information across two color spaces, namely, the YIQ and the YCbCr color spaces, to further improve its performance. Experiments using the Face Recognition Grand Challenge (FRGC) version 2 database, which contains 12,776 training images, 16,028 controlled target images, and 8,014 uncontrolled query images, show the effectiveness of the proposed method. In particular, the HVDA method achieves 78.24% face verification rate at 0.1% false accept rate on the most challenging FRGC experiment, i.e., the FRGC Experiment 4 (based on the ROC III curve). Keywords: Principal Component Analysis (PCA), Biometric Experimentation Environment (BEE), Face Recognition Grand Challenge (FRGC), Fisher Linear Discriminant Analysis (FLD or LDA), feature extraction, face verification, biometrics, color space.
1 Introduction Principal component analysis (PCA) is a classical technique widely used in pattern recognition. Sirovich and Kirby first applied PCA to represent pictures of human faces [1], and Turk and Pentland further proposed the well-known Eigenfaces method for face recognition [2]. Since then, PCA has become a popular method for face recognition. Due to its simplicity and robustness, PCA was chosen as the baseline algorithm for the Face Recognition Grand Challenge (FRGC) evaluation [10]. S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 838–847, 2007. © Springer-Verlag Berlin Heidelberg 2007
Horizontal and Vertical 2DPCA Based Discriminant Analysis for Face Verification
839
PCA-based techniques usually operate on vectors. That is, before applying PCA, the 2D image matrices should be mapped into pattern vectors by concatenating their columns or rows. The pattern vectors generally lead to a high-dimensional space. For example, an image with a spatial resolution of 128 × 128 defines a 16, 384-dimensional vector space. In such a high-dimensional vector space, computing the eigenvectors of the covariance matrix is very time-consuming. Although the singular value decomposition (SVD) technique is effective for reducing computation when the training sample size is much smaller than the dimensionality of the images [1, 2], it does not help much when the training sample size becomes large. For example, for the FRGC version 2 database, the number of training images is 12,776. If all these training images are used, PCA has to compute the eigenvectors of a 12,776 × 12,776 matrix. It should be mentioned that the Fisherfaces method [3] also encounters the same problem as PCA does, since the Fisherfaces method requires a PCA step before applying the Fisher linear discriminant analysis (FLD or LDA). Compared with PCA, the two-dimensional PCA method (2DPCA) [4] is a more straightforward technique for dealing with 2D images (matrices), as 2DPCA works on matrices (2D arrays) rather than on vectors (1D arrays). Therefore, 2DPCA does not transform an image into a vector, but rather, it constructs an image covariance matrix directly from the original image matrices. In contrast to the covariance matrix of PCA, the size of the image covariance matrix of 2DPCA is much smaller. For example, if the image size is 128 × 128, the image covariance matrix of 2DPCA is still 128 × 128, regardless of the training sample size. As a result, 2DPCA has a remarkable computational advantage over PCA. The original 2DPCA method, which focuses on the columns of images and achieves the optimal image energy compression in horizontal direction, however, overlooks the information that might be contained in the image rows. In this paper, we embed both kinds of image information (in rows and columns) into a discriminant analysis framework for face recognition. Specifically, we first perform the imagecolumn based 2DPCA (horizontal 2DPCA) and the image-row based 2DPCA (vertical 2DPCA), and then apply LDA for further feature extraction. The proposed framework is called Horizontal and Vertical 2DPCA based Discriminant Analysis (HVDA). FRGC is the most comprehensive face recognition efforts so far organized by the US government, which consists of a large amount of face data and a standard evaluation method, known as the Biometric Experimentation Environment (BEE) system [10, 6]. The FRGC version 2 database contains both controlled and uncontrolled high resolution images, and the BEE baseline algorithm reveals that the FRGC Experiment 4 is the most challenging experiment, because it evaluates face verification performance of controlled face images versus uncontrolled face images. As the training set for FRGC version 2 database consists of 12,776 high resolution images, face recognition methods have to deal with high-dimensional images and very large data sets [11]. The proposed HVDA method has the computational advantage over the conventional PCA and/or LDA based methods, such as the Eigenfaces and the Fisherfaces methods, for dealing with high resolution images and large training data sets due to the computational efficiency of 2DPCA. Recent research in face recognition shows that color information plays an important role in improving face recognition performance [7]. Different color spaces as well as various color configurations within or across color spaces are investigated
840
J. Yang and L. Chengjun
and assessed using the FERET and the FRGC Version 2 databases. The experimental results reveal that the color configuration YQCr, where Y and Q color components are from the YIQ color space and Cr is from the YCbCr color space, is most effective for the face recognition task. This color configuration, together with an LDA method, achieves 65% face verification rate at 0.1% false accept rate on FRGC Experiment 4 using the FRGC Version 2 database. This paper, thus, applies the HVDA method in the YCbCr color space for improving face recognition performance.
2 Horizontal and Vertical 2DPCA This section outlines the two versions of 2DPCA, namely the horizontal 2DPCA and the vertical 2DPCA. 2.1 Horizontal 2DPCA Given image A, an m × n random matrix, the goal of 2DPCA is to find a set of orthogonal projection axes u1 ,L, u q so that the projected vectors Yk = Auk
( k = 1,2,L , q ) achieve a maximum total scatter [4]. The image covariance (scatter) matrix of Horizontal 2DPCA is defined as follows [4]: Gt = 1 M
M
∑ (A j =1
j
− A)T (A j − A)
(1)
where M is the number of training images, A j is an m × n matrix denoting the j-th training image, and A is the mean image of all training images. The optimal projection axes u1 ,L, u q are chosen as the orthonormal eigenvectors of G t corresponding to q largest eigenvalues λ1 , λ2 ,L, λq [4]. After the projection of images onto these axes, i.e.,
X k = ( A − A)u k , k = 1,2,L, q ,
(2)
we obtain a family of principal component vectors, Y1 ,L, Yq , which form an m × q feature matrix B = [X1 ,L, Xq ] . Let U = [u1 ,L, u q ] , and Equ. (2) becomes B = ( A − A)U
(3)
The feature matrix B contains the horizontal 2DPCA features of image A. 2.2 Vertical 2DPCA
If the input of 2DPCA is the transpose of the m × n image A, then the image covariance matrix defined in Eq. (1) becomes Ht = 1 M
M
∑ (A j =1
j
− A )(A j − A ) T
(4)
Horizontal and Vertical 2DPCA Based Discriminant Analysis for Face Verification
841
Now, H t is an m × m non-negative definite matrix. Let v1 ,L, v p be the orthonormal eigenvectors of H t corresponding to the p largest eigenvalues. After projecting AT onto these eigenvectors, we have
Yk = ( A − A)T v k , k = 1,2,L, p
(5)
Let V = [ v1 ,L, v p ] and CT = [Y1 ,L, Yp ] , then we have CT = ( A − A )T V and C = V T ( A − A )
(6)
C = V T ( A − A ) contains the vertical 2DPCA features of image A. Horizontal 2DPCA operates on image rows while Vertical 2DPCA operates on image columns, as image columns become image rows after the transpose operation.
3 Horizontal and Vertical 2DPCA Based Linear Discriminant Framework This section discusses the Horizontal and Vertical 2DPCA based Discriminant Analysis (HVDA) method, which applies the YQCr color configuration defined by combining the component images across two color spaces: YIQ and YCbCr. 3.1 Fisher Linear Discriminant Analysis
The between-class scatter matrix S b and the within-class scatter matrix S w are defined as follows [13] Sb = 1 M
Sw = 1 M
c
∑ l (m i =1
c
∑l i =1
− m 0 )(m i − m 0 )
(7)
l i li (x ij − m i )(x ij − m i )T ∑ − 1 j =1 i
(8)
i
T
i
where xij denotes the j-th training sample in class i; M is the total number of training samples, li is the number of training samples in class i; c is the number of classes; m i is the mean of the training samples in class i ; m 0 is the mean across all training samples. If the within-class scatter matrix S w is nonsingular, the Fisher discriminant vectors ϕ1 , ϕ 2 ,L , ϕ d can be selected as the generalized eigenvectors of S b and S w corresponding to the d ( d ≤ c − 1 ) largest generalized eigenvalues, i.e., S b ϕ j = λ j S w ϕ j , where λ1 ≥ λ2 ≥ L ≥ λd . These generalized eigenvectors can be obtained using the classical two-phase LDA algorithm [13]. In small sample size cases, the within-class scatter matrix S w is singular because the training sample size is smaller than the dimension of the image vector space. To address this issue of LDA, A PCA plus LDA strategy, represented by Fisherfaces [3],
842
J. Yang and L. Chengjun
was developed. In Fisherfaces, N − c principal components are chosen in the PCA phase and then LDA is implemented in the N − c dimensional PCA space. To improve the generalization performance of the Fisherfaces method, the Enhanced Fisher Model (EFM) is developed [5]. The EFM method applies a criterion to choose the number of principal components in the PCA to avoid overfitting of the PCA plus LDA framework. In particular, a proper balance should be preserved between the data energy and the eigenvalue magnitude of the within-class scatter matrix. While the spectral energy should be preserved, the trailing eigenvalues of the within-class scatter matrix should not be too small in order to prevent the amplification of noise. It should be pointed out that this criterion is still applicable even when the within-class scatter matrix S w is nonsingular.
3.2 Horizontal and Vertical 2DPCA Based Discriminant Analysis (HVDA) In our algorithm, the horizontal feature matrix B derived by Horizontal 2DPCA and the vertical feature matrix C from Vertical 2DPCA are processed, respectively, by the EFM method for further feature extraction. The extracted features then apply the cosine similarity measure to calculate the similarity score between any pair of query and target images. After similarity score normalization, two kinds of normalized scores, i.e., the normalized scores from the horizontal discriminant features and the normalized scores from the vertical discriminant features, are fused at the classification level. The proposed HVDA framework is illustrated in Fig. 1. Some details on cosine similarity measure, score normalization and fusion strategy are presented below. Horizontal 2DPCA
EFM (Cos)
Normalization Fusion
Image A Vertical 2DPCA
EFM (Cos)
Normalization
Fig. 1. Illustration of the proposed HVDA framework
The cosine similarity measure between two vectors x and y is defined as follows:
δ cos (x, y ) =
xT y || x || ⋅ || y ||
(9)
where || ⋅ || is the notation of Euclidian norm. Suppose that there are M target images x1 , x 2 ,L x M . For a given query image y , we can obtain a similarity score vector s = [s1 , s 2 , L , s M ]T by calculating the Cosine similarity measure between each pair of x i and y . Based on the horizontal discriminant features, we can obtain the horizontal similarity score vector s h and, based on vertical discriminant features, we can calculate the vertical similarity score
Horizontal and Vertical 2DPCA Based Discriminant Analysis for Face Verification
843
vector s v . Each score vector is normalized by means of the z-score normalization technique [8]. The normalized scores are as follows:
s inew =
si − μ ,
σ
(10)
where μ is the mean of s1 , s 2 ,L , s M and σ is their standard deviation. After score normalization, the horizontal similarity score vector s h and the vertical similarity score vector s v are fused using the sum rule, that is, the final similarity score vector is s h + sv .
3.3 HVDA in YQCr Color Space Recent research on color spaces for face recognition reveals that some color configurations, such as YQCr, can significantly improve the FRGC baseline performance [7]. YQCr is defined by combining the component images across two color spaces: YIQ and YCbCr. YIQ is a color space formerly used in the National Television System Committee (NTSC) television standard [14]. The Y component represents the luminance information and I and Q represent the chrominance information. Remember that in the YUV color space, the U and V components can be viewed as x and y coordinates within the color space. I and Q can be viewed as a second pair of axes on the same graph, rotated 33° clockwise. Therefore, IQ and UV represent different coordinate systems on the same plane. The YIQ system is intended to take advantage of human color-response characteristics. YIQ is derived from the corresponding RGB space as follows:
0.1140 ⎤ ⎡ R ⎤ ⎡Y ⎤ ⎡0.2990 0.5870 ⎢ I ⎥ = ⎢0.5957 − 0.2745 − 0.3213⎥ ⎢G ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢⎣Q ⎥⎦ ⎢⎣ 0.2115 − 0.5226 0.3111 ⎥⎦ ⎢⎣ B ⎥⎦
(11)
The YCbCr color space is developed as a part of the ITU-R Recommendation B.T. 601 for digital video standard and television transmissions [14]. It is a scaled and offset version of the YUV color space. Y is the luminance component and Cb and Cr are the blue and red chrominance components, respectively. YCbCr is derived from the corresponding RGB space as follows: ⎡Y ⎤ ⎡16 ⎤ ⎡ 65.4810 128.5530 24.9660 ⎤ ⎡ R ⎤ ⎢C ⎥ = ⎢128⎥ + ⎢− 37.7745 − 74.1592 111.9337 ⎥ ⎢G ⎥ ⎢ b⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢⎣Cr ⎥⎦ ⎢⎣128⎥⎦ ⎢⎣ 111.9581 − 93.7509 − 18.2072⎥⎦ ⎢⎣ B ⎥⎦
(12)
Our HVDA method first works on the three color components Y, Q and Cr to derive the similarity scores, and then fuses the normalized similarity scores using the sum rule.
844
J. Yang and L. Chengjun
4 Experiments We evaluate our HVDA method using the FRGC version 2 database and the associated Biometric Experimentation Environment (BEE) [10]. The FRGC version 2 database contains 12,776 training images, 16,028 controlled target images, and 8,014 uncontrolled query images for the Experiment 4. The controlled images have good image quality, while the uncontrolled images display poor image quality, such as large illumination variations, low resolution of the face region, and possible blurring. It is these uncontrolled factors that pose the grand challenge to the face recognition performance. The BEE system provides a computational-experimental environment to support a challenge problem in face recognition or biometrics, which allows the description and distribution of experiments in a common format. The BEE system uses the PCA method that has been optimized for large scale problems as a baseline algorithm, which applies the whitened cosine distance measure for its nearest neighbor classifier [10]. The BEE baseline algorithm shows that Experiment 4, which is designed for indoor controlled single still image versus uncontrolled single still image, is the most challenging FRGC experiment. We therefore choose FRGC Experiment 4 to evaluate our method. In our experiment, the face region of each image is first cropped from the original high-resolution still images and resized to 64x64. Fig. 2 shows some example images used in our experiments.
Fig. 2. Example cropped images in FRGC version 2
According to the FRGC protocol, the face recognition performance is reported using the Receiver Operating Characteristic (ROC) curves, which plot the Face Verification Rate (FVR) versus the False Accept Rate (FAR). The ROC curves are automatically generated by the BEE system when a similarity matrix is input to the system. In particular, the BEE system generates three ROC curves, ROC I, ROC II, and ROC III, corresponding to images collected within semesters, within a year and between semesters, respectively [12]. The similarity matrix stores the similarity score of every query image versus target image pair. So, the size of the similarity matrix is T × Q , where T is the number of target images and Q is the number of query images. The proposed HVDA method is trained using the standard training set of the FRGC Experiment 4. In the 2DPCA phase, we choose q=19 in horizontal 2DPCA transform and p=19 in vertical 2DPCA transform, respectively. In the EFM phase, we choose 1000 principal components in the PCA step and 220 discriminant features in the LDA step. The resulting similarity matrix is analyzed by the BEE system and the three ROC curves generated are shown in Fig. 3. The verification rates (%) when the False Accept Rate is 0.1% are listed in Table 1. Table 1 also includes the verification rates reported in recent papers for comparison [7, 9]. These results show that the
Horizontal and Vertical 2DPCA Based Discriminant Analysis for Face Verification
845
HVDA method achieves better face verification performance than those reported before. In addition, we can see that the fusion of Horizontal 2DPCA based Discriminant Analysis (HDA) and Vertical 2DPCA based Discriminant Analysis (VDA) can significantly improve the verification performance. 1 0.9 0.8
Verification Rate
0.7 0.6 0.5 0.4 0.3 BEE Baseline HDA VDA HVDA
0.2 0.1 0 -3 10
-2
-1
10
0
10
10
False Accept Rate
(ROC I) 1 0.9 0.8
Verification Rate
0.7 0.6 0.5 0.4 0.3 BEE Baseline HDA VDA HVDA
0.2 0.1 0 -3 10
-2
-1
10
10
0
10
False Accept Rate
(ROC II) Fig. 3. ROC curves corresponding to the HDA, VDA, and HVDA methods and the BEE Baseline algorithm
846
J. Yang and L. Chengjun 1 0.9 0.8
Verification Rate
0.7 0.6 0.5 0.4 0.3 BEE Baseline HDA VDA HVDA
0.2 0.1 0 -3 10
10
-2
-1
10
10
0
False Accept Rate
(ROC III) Fig. 3. (continued) Table 1. Verification rate (%) comparison when the False Accept Rate is 0.1% Method BEE Baseline YQCr +LDA [7] MFM-HFF [9] HVDA HDA VDA
ROC I 13.36 64.47 75.70 78.65 73.90 73.38
ROC II 12.67 64.89 75.06 78.50 73.96 73.73
ROC III 11.86 65.21 74.33 78.24 73.89 73.97
5 Conclusions This paper presents a horizontal and vertical 2DPCA based Discriminant Analysis (HVDA) method for face verification. The HVDA method first integrates the horizontal and vertical 2DPCA features, and then applies the Enhanced Fisher Model (EFM) to improve its discriminatory power. The HVDA method further takes advantage of the color information across two color spaces, YIQ and YCbCr for enhancing its performance. The proposed method has been tested on FRGC Experiment 4 using the FRGC version 2 database. Experimental results show the feasibility of the proposed method. In particular, the HVDA method achieves 78.24% face verification rate at 0.1% false accept rate based on the ROC III curve. Acknowledgments. This work was partially supported by the NJCST/NIJT grant, and by Award No. 2006-IJ-CX-K033 awarded by the National Institute of Justice, Office
Horizontal and Vertical 2DPCA Based Discriminant Analysis for Face Verification
847
of Justice Programs, US Department of Justice. Dr. Yang was also supported by the National Science Foundation of China under Grants No. 60503026, No. 60472060, No. 60473039, and No. 60632050.
References 1. Sirovich, L., Kirby, M.: Low-dimensional procedure for characterization of human faces. J. Optical Soc. Of Am. 4, 519–524 (1987) 2. Turk, M., Pentland, A.: Eigenfaces for recognition. J. Cognitive Neuroscience 3(1), 71–86 (1991) 3. Belhumeur, P.N., Hespanha, J.P., Kriengman, D.J.: Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. IEEE Trans. Pattern Anal. Machine Intell. 19(7), 711–720 (1997) 4. Yang, J., Zhang, D., Frangi, A.F., Yang, J.-Y.: Two-Dimensional PCA: a New Approach to Face Representation and Recognition. IEEE Transaction on Pattern Analysis and Machine Intelligence 26(1), 131–137 (2004) 5. Liu, C., Wechsler, H.: Robust coding schemes for indexing and retrieval from large face databases. IEEE Trans. Image Processing 9(1), 132–137 (2000) 6. Liu, C.: Capitalize on Dimensionality Increasing Techniques for Improving Face Recognition Grand Challenge Performance. IEEE Trans. Pattern Analysis and Machine Intelligence 28(5), 725–737 (2006) 7. Shih, P., Liu, C.: Improving the Face Recognition Grand Challenge Baseline Performance Using Color Configurations Across Color Spaces. In: IEEE International Conference on Image Processing (ICIP 2006), Atlanta, GA, October 8-11 (2006) 8. Jain, A., Nandakumar, K., Ross, A.: Score normalization in multimodel biometric systems. Pattern Recognition 38, 2270–2285 (2005) 9. Hwang, W., Park, G., Lee, J.: Multiple face model of hybrid Fourier feature for large face image set. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition. IEEE Computer Society Press, Los Alamitos (2006) 10. Phillips, P.J., Flynn, P.J., Scruggs, T., Bowyer, K.W., Chang, J., Hoffman, K., Marques, J., Min, J., Worek, W.: Overview of the Face Recognition Grand Challenge. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition. IEEE Computer Society Press, Los Alamitos (2005) 11. Phillips, P.J., Flynn, P.J., Scruggs, T., Bowyer, W.W.: Preliminary Face Recognition Grand Challenge Results. In: Proceedings of the 7th International Conference on Automatic Face and Gesture Recognition (FGR’06) (2006) 12. Phillips, P.J.: FRGC Third Workshop Presentation. In: FRGC Workshop, (February 2005) 13. Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic Press, London (1990) 14. Buchsbaum, W.H.: Color TV Servicing, 3rd edn. Prentice Hall, Englewood Cliffs, NJ (1975)
Video-Based Face Tracking and Recognition on Updating Twin GMMs Li Jiangwei and Wang Yunhong Intelligence Recognition and Image Processing Laboratory, Beihang University, Beijing, P.R. China {jwli,yhwang}@buaa.edu.cn
Abstract. Online learning is a very desirable capability for video-based algorithms. In this paper, we propose a novel framework to solve the problems of video-based face tracking and recognition by online updating twin GMMs. At first, considering differences between the tasks of face tracking and face recognition, the twin GMMs are initialized with different rules for tracking and recognition purposes, respectively. Then, given training sequences for learning, both of them are updated with some online incremental learning algorithm, so the tracking performance is improved and the class-specific GMMs are obtained. Lastly, Bayesian inference is incorporated into the recognition framework to accumulate the temporal information in video. Experiments have demonstrated that the algorithm can achieve better performance than some well-known methods. Keywords: Face Tracking, Face Recognition, Online Updating, Bayesian Inference, GMM.
1 Introduction Recently, more and more research interesting has been transferred from image-based face detection and recognition [1-2] to video-based face tracking and recognition [3-5]. Compared to image-based face technologies, multiple frames and temporal continuity contained in video facilitate face tracking and recognition. However, large variations of image resolution and pose, poor video quality and partial occlusion are the main problems video-based face technologies have encountered. To deal with these difficulties, many researchers have presented their solutions [3-5], which adopted various strategies to fully use temporal and spatial information in video. The capability of online learning is very favorable for video-based algorithms. The models can be updated when new sample comes, so the memory is saved without preserving the sample, and the model turns to fit current and future patterns with the time elapsing. Among all video-based face technologies, only a few algorithms [3,4] introduced the updating mechanism into their framework. In this paper, we propose a new framework for video-based face tracking and recognition based on online updating Gaussian Mixture Models (GMMs). At any instance, we use the new sample to update two GMMs, called as “twin GMMs”. As S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 848–857, 2007. © Springer-Verlag Berlin Heidelberg 2007
Video-Based Face Tracking and Recognition on Updating Twin GMMs
849
shown in Fig. 1, in the training stage, according to different requirements of face tracking and recognition, we design twin initial models, namely tracking model and recognition model. Then, use the tracking model to locate the face of the incoming frame. At each instance, the detected face is learned to update both twin models with different updating rules. By learning all frames in the sequence, the recognition model gradually evolves to the class-specific model, and the tracking model becomes more powerful by merging the learned samples into its framework. In the testing stage, given the testing video, the recognition score is calculated by accumulating the current likelihood and previous posteriors using Bayesian inference. For most traditional methods, they train gallery models in batch modes and then use it to perform recognition. Contrastively, with online sequential updating, our learning mechanism saves memory loads and is more adaptive for real-time applications. Moreover, the recognition approach based on Bayesian inference effectively captures temporal information. Experimental results show that our algorithm can effectively track and recognize faces in video even with large variations. Note the differences between our algorithm and some existing updating methods [3, 4]. In our paper, we emphasize the necessity of designing different models to deal with different tasks. Compared to [3, 4], we note the distinctions between face tracking and recognition, so twin models with various initialization and learning strategies for each task are proposed. It is fairly a delicate learning mechanism. Furthermore, with our learning mechanism, the evolved class-specific models for distinct individuals have different forms. These advantages make the algorithm flexible.
Fig. 1. Online updating twin models
850
J. Li and Y. Wang
2 The Framework of Updating GMM is a special form of HMM and have been well studied for many years. The GMM assumes the probability that the observed data belongs to this model takes the following form:
r r G ( x ) = p ( x | λl ) =
l
∑α
m
r N ( x, μ m , θ m )
(1)
m =1
r where N ( x , μ m , θ m ) denotes the multi-dimensional normal distribution with the mean
μ m and the covariance matrix θ m , and α m is the weight of the corresponding component, satisfying:
α m ≥ 0, m = 1, L , l , and
l
∑α
m
=1
(2)
m =1
In the following, we begin with the initialization of the twin GMMs. Then will show how to learn the incoming sample to update the twin models in the training stage. 2.1 Initialization
Motivated by the distinctions between face tracking and face recognition, it is necessary to initialize the face tracking model and the face recognition model in different ways. For face tracking, considering there exists numerous variations in face pattern, the initial tracking model must train on a large scale of data samples. For face recognition, to fasten the evolution from the initial model to a class-specific model, the recognition model can learn on much less samples. In addition, to ensure proper convergence, the recognition model is initialized with enough components to spread over the face space. The dimension of all training data is reduced to d by PCA to prevent “the curse of dimensionality”. The method begins with the initialization of the twin GMMs. Denote r r GT ( x ) = p( x | λl ) as the initial tracking model with l1 components and r r G R ( x ) = p( x | λl ) the initial recognition model with l 2 components. Further assume 1
2
there exists a training face set with p (p>5000) samples. The initialization proceeds as the following: z Initialization of the tracking model r For GT ( x ) , considering the diversities of face patterns in feature space, the whole set with p samples is used to train the model using some unsupervised learning method [6], which is characterized as being capable of selecting number of components and r not requiring careful initialization. In GT ( x ) , each Gaussian component can be treated as a pose manifold. Since it is random for a face in video being a certain pose, we discard the learned weight coefficients for all components and fix them as 1 / l1 , r namely all components in GT ( x ) are equally weighted. So the initial parameters of
Video-Based Face Tracking and Recognition on Updating Twin GMMs
851
r 1 1 GT ( x ) are {l1 , , μ ( m,0 ) , θ ( m ,0) } , where l1 is the number of components, and , l1 l1
μ (m, 0) and θ (m,0 ) are initial weight, mean and covariance of each Gaussian components. z Initialization of the recognition model: r For G R (x ) , we randomly select l 2 points from the set as the mean vectors, and
initialize the weight α ( m,0 ) = 1 / l 2 . To weaken the influence of the training data on class-specific models, only q (q l 2 . This is beneficial for r the fast evolution of G R ( x ) from initial recognition model to class-specific models as well. 2.2 Updating Process
r In the training stage, with the initial tracking model GT ( x ) , we can use the model to continuously track the face. Denote {I 0 , L , I t , LI N }i the ith incoming video sequence. The updating process can be expressed as: r r r r GT ( x ) ⊕ {I 0 , L I t , L , I N }i − > GT ( x ) , G R ( x ) ⊕ {I 0 , L I t , L , I N }i − > Gi ( x ) (4)
r where ⊕ is the operator of incremental updating, and Gi (x ) is the class-specific model for the ith sequence. In video, the dynamics between frames can facilitate the tracking process. It is formulated as a Gaussian function: Κ ( s t , s t −1 ) = exp{−( s t − s t −1 )C t−−11 ( s t − s t −1 )}
(5)
where st is the current state variable, including face position and pose information. From the current frame I t , according to Eq.(5), draw y image patches around the previous position of the detected face as face candidates and normalize them into
852
J. Li and Y. Wang
d-dimensional vectors {F(1,t ) , L, F( r ,t ) , L, F( y ,t ) } . The face is detected with maximum likelihood rule: Ft* = arg max GT ( F(i ,t ) )
(6)
i
After we obtain the face, we use it to update both models. Note that the twin models should be updated in different ways: z
Updating of the tracking model: r 1 Give the model GT (x ) with the parameter {l , , μ ( m,t −1) , θ ( m,t −1) } at time t − 1 , where l m is the mth Gaussian component. For the new sample Ft* , we first find its ownership: o( m,t ) ( Ft* ) = N ( Ft* , μ ( m ,t −1) ,θ ( m,t −1) ) , m * = arg max o( m ,t ) ( Ft* ) (7) m
In Eq.(7), the probability of Ft* in the m * th component is largest. All weights keep invariant, and only update the parameters of the m * component with the rate λT :
ς = Ft * − μ ( m ,t −1) , μ ( m ,t ) = μ ( m ,t −1) + λT o( m ,t ) ( Ft* )ς *
*
*
*
and θ ( m ,t ) = θ ( m ,t −1) + λT o( m ,t ) ( Ft* )(ςς T − θ ( m ,t −1) ) *
*
*
*
(8)
To facilitate face tracking in current video and simultaneously avoid over-fitting, we keep the weights of all components invariant and only update the mean and the covariance of the component with highest ownership score. These will prevent the model converging to a class-specific model so as to still keep good tracking performance when the following video comes. z Updating of the recognition model: r For G R (x ) , we use some existing technique for updating. There are several incremental learning methods for GMM [7,8], while only [8] can update the model one by one. So the method in [8] is used. Assume the parameter is {l t −1 , α ( m,t −1) , μ ( m,t −1) ,θ ( m,t −1) } at time t − 1 . As the new data
Ft* comes, for each component, we first calculate its ownership confidence score:
r o( m,t ) ( Ft* ) = α ( m,t −1) N ( Ft* , μ ( m,t −1) ,θ ( m,t −1) ) / G R ( x )
(9)
Then use the score to update the corresponding weight:
α (m,t ) = α (m,t −1) + λR (
o(m,t ) (Ft* ) 1 − lt −1C
− α (m,t −1) ) − λR
C 1 − lt −1C
(10)
In Eq.(10), λ R determines the updating rate, and C = λN / 2 is a constant, where N = d + d (d + 1) / 2 is the number of parameters specifying each mixture component.
Video-Based Face Tracking and Recognition on Updating Twin GMMs
853
Check all α ( m,t ) . If α ( m ,t ) < 0 , it means too few data belong to the component m , so cancel this component, set l t = l t −1 − 1 and renormalize α ( m,t ) . The remaining parameters are updated as:
ς = Ft* − μ ( m,t −1) , μ ( m,t ) = μ ( m,t −1) + λR and θ ( m ,t ) = θ ( m,t −1) + λR
o( m,t ) ( Ft* )
α ( m ,t −1)
o( m ,t ) ( Ft* )
α ( m,t −1)
ς,
(ςς T − θ ( m,t −1) )
(11)
Then use the new parameter {l t , α ( m,t ) , μ ( m ,t ) , θ ( m,t ) } for next updating. 2.3 Updating Results
Except for above updating rules, note the following additions: (1) For the face recognition model, to learn more intra-personal patterns and tolerate face location error, at any instance, the model is updated with more generated virtual samples by locating the face with errors and mirror operation. (2) Two models should be updated with different updating rate. Generally, the updating rate of the face recognition model is much faster than that of the face tracking model, i.e., λT | ≤ f˜f cos(θ) =
| < f, ˜f > | |fT ˜f| = ≤1 f˜f f˜f
(2)
where f = (f1 , f2 , ..., fk−1 , fk , fk+1 , ...) and ˜f = (f˜1 , f˜2 , ..., f˜k−1 , f˜k , f˜k+1 , ...) are ˜ respectively and vector forms of the 2D pattern f and ˜f from frames F and F cos(θ) ∈ [0, 1] is the similarity measure between the patterns. ˜ is found by maximizing cos(θ). The most parallel pattern ˜f in the frame F This can be done by repetitive scalar products. That is the pattern f from frame F is glided over a neighborhood of expected parallel local neighborhood of ˜f in ˜ and the most similar pattern is selected as a match. the frame F The motion vector for the point at the center of pattern f is calculated as the displacement between pattern f and pattern ˜f as illustrated in Fig. 4. That is:
Pyramid Based Interpolation for Face-Video Playback
873
Fig. 4. Motion Vector and unknown frames in a sequence
M V (k, l) = x + iy
(3)
Where x and y are the horizontal and vertical displacements respectively √ of the block/pattern f , (k,l) is the index for the center of pattern f and i = −1. 3.2
Pyramid Based Approach for Block Matching
Using large block size while calculating the motion vectors above gives rise to what is known as block effect in the interpolated frame. Whereas making the block size small may result in multiple similarities and a probable choice of a false match for the motion vector. To solve this problem, a pyramid based frame interpolation method which makes use of both large as well as small size blocks hierarchically is employed to calculate the motion vector in the right direction. That is, initially large block size is used to get a crude directional information on which way the block is moving. Here it is unlikely to get a false match as the size of the block is very large. This motion vector is used to determine the appropriate search area in the next step of the the block matching process so that the probability of a false match is minimal. The search area is also reduced to a smaller local neighborhood because we have information on where to look for the pattern. Then, we reduce the size of the block by a factor of n and calculate the motion vector again (Fig 5). The motion vector calculated for each block is again used to determine where the search area should be in the next iteration. This process is repeated until a satisfactory result is achieved in the interpolated frame. The final complex valued motion vector matrix is used to interpolate the new ˜ The number of frames to be interpolate d depends on frames between F and F. the norm of the motion vector and is determined at run-time. For a length 2
874
D. Teferi and J. Bigun
Fig. 5. Pyramid based iteration where n=4 and number of iterations =3
time units, the actual frame interpolation is done by dividing the motion vector ˜ centered at (k+x,l+y) to at point (k,l ) by 2 and moving the block in frame F ˜ the new frame at (k+x/2,l+y/2 ). Then the block at the new location in frame F ˜ is moved back the same distance. That is, let F be the frame at t0 , F the frame 0) at t1 and F be the frame at (t1 −t , then 2 ˜ + x, l + y) F (k + x/2, l + y/2) = F(k ˜ (k + x/2, l + y/2) F (k, l) = F
(4) (5)
where x and y are the real and imaginary parts of the motion vector at (k,l ). Consecutive interpolations are made in analogous manner. The motion vector is adjusted and a new frame is created as necessary between frames F and F as ˜ This process continues until all the necessary frames well as between F and F. are created.
4
Video Synthesis
Audio data is extracted from the input audio-video signal and forwarded to the speech recognizer. The recognizer uses the HMM models for the recognition and returns a transcription file containing the start and end time of each digit spoken in the audio signal. A search for the prompted text is done against the transcription file and the time gap of each prompted digits within the video signal is captured. Then, the image sequences are arranged according to the order of the prompted text. The discontinuity between the image sequences of each digit is minimized by interpolating frames using the pyramid based multiresolution frame interpolation technique summarized in section 3. The interpolated frames are attached to a silence sound and are inserted to their proper locations in the video signal to decrease the discontinuity of utterances. Finally, the video is played to represent the digit sequence prompted by the system.
Pyramid Based Interpolation for Face-Video Playback
(a) Original Frame 1
(c) Using large blocks
875
(b) Original Frame 2
(d) Using small blocks
(e) Using pyramid method
Fig. 6. Results of the various frame interpolation methods
Fig. 7. Process flowchart
5
Experiment
The experiments are conducted on all the digit speaking face videos of the XM2VTS database. The accuracy of the text-prompted video signal is mainly dependent on the performance of the speech recognition system. The accuracy of our HMM based speech recognition system is 94%. The pyramid based frame interpolation algorithm gives optimal result when the final block size of the
876
D. Teferi and J. Bigun
pyramid is 3. The discontinuity of the reshuffled video signal is reduced significantly as evaluated by the human eye, the authors. The time it takes a person to speak a digit is enough to interpolate the necessary frames between digits. Moreover, a laptop computer as a portable DVD can be used to playback the video for an audio visual recognition system. However, there is still visible blur around the eye and the mouth of the subject. This is due to the fact that differences between the frames are likely to appear around the eye and the mouth, such as varied state of eye, teeth, and mouth opening etc (Fig 6). Such changes are difficult to interpolate in real time as they do not exist in both the left and right frames. Biometric authentication and liveness detection systems that make use of motion information of face, lip and text prompted audio-video are easy targets of such playback attacks described here.
6
Conclusion
The risk of spoofing and impersonation is forcing biometric systems to incorporate liveness detection. Assuring liveness especially on remotely controlled systems is a challenging task. The proposed method shows a way to produce playback attacks against text-prompted systems using audio and video in real time. The result shows that assuring liveness remotely by methods that rely on apparent motion can be targets of such attacks. Our results suggest the need to increase the sophistication level of biometric systems to stand up against advanced playback attacks. Most video-based face detection and recognition systems search for the best image from a sequence of images and use it for extracting features assuming it will yield a better recognition. However, this could have a negative impact as some of those blurry or faulty frames that are dropped could be our only hope to tell if a video-signal is a playback or a live one. Therefore, we conclude that audio visual recognition systems can withstand such playback attacks by analyzing the area around the eyes and the mouth of the new frames between prompted text or digit to identify if they are artificial.
Acknowledgment This work has been sponsored by the Swedish International Development Agency (SIDA).
References 1. Jain, A., Ross, A., Prebhakar, S.: An Introduction to Biometric Recognition. IEEE Transactions on Circuits and Systems for Video Technology, Special Issue on Imageand Video-Based Biometrics 14(1) (January 2004) 2. Ortega-Garcia, J., Bigun, J., Reynolds, D., Gonzalez-Rodriguez, J.: Authentication Gets Personal with Biometrics. IEEE Signal Processing Magazine 21(2), 50–62 (2004)
Pyramid Based Interpolation for Face-Video Playback
877
3. Faundez-Zanuy, M.: Biometric Security Technology. IEEE Aerospace and Electronic Systems Magazine 21(6), 15–26 (2006) 4. Ratha, N.K., Connell, J.H., Bolle, R.M.: Enhancing Security and Privacy in Biometrics-Based Authentication Systems. IBM Systems Journal 40(3), 614–634 (2001) 5. Kollreider, K., Fronthaller, H., Bigun, J.: Evaluating Liveness by Face Images and the Structure Tensor. In: AutoID 2005: Fourth Workshop on Automatic Identification Advanced Technologies, pp. 75–80. IEEE Computer Society Press, Los Alamitos (2005) 6. Li, J., Wang, Y., Tan, T., Jain, A.K.: Live Face Detection Based on the Analysis of Fourier Spectra. In: Jain, A.K., Ratha, N.K. (eds.) Biometric Technology for Human Identification. Proceedings of the SPIE, vol. 5404, pp. 296–303 (August 2004) 7. Faraj, M., Bigun, J.: Person Verification by Lip-Motion. In: Computer Vision and Pattern Recognition Workshop (CVPRW), pp. 37–45 (June 2006) 8. Messer, K., Matas, J., Kitler, J., Luettin, J., Maitre, G.: XM2VTSDB: The Extended M2VTS Database. In: 2nd International Conference on Audio and Videobased Biometric Person Authentication (AVBPA), pp. 72–77 (1999) 9. Veeravalli, A.G., Pan, W., Adhami, R., Cox, P.G.: A Tutorial on Using Hidden Markov Models for Phoneme Recognition. In: Thirty-Seventh Southeastern Symposium on System Theory, SSST 2005 (2005) 10. Young, S., Evermann, G., Gales, M., Hein, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.: The htk Book. for Version 3.3 (April 2005), http://htk.eng.cam.ac.uk/docs/docs.shtml 11. Bigun, J.: Vision with Direction: A Systematic Introduction to Image Processing and Computer Vision. Springer, Heidlberg (2006) 12. Jain, J., Jain, A.K.: Displacement Measurement and its Application in Interframe Image Coding. IEEE Transactions on Communication COM 29, 1799–1808 (December 1981) 13. Cheng, K.W., Chan, S.C.: Fast Block Matching Algorithms for Motion Estimation. In: ICASSP-96: IEEE International Conference on Acoustic Speech and Signal Processing, vol. 4(1), pp. 2311–2314. IEEE Computer Society Press, Los Alamitos (1996) 14. Aly, S., Youssef, A.: Real-Time Motion Based Frame Estimation in Video Lossy Transmission. In: Symposium on Applications and the Internet, pp. 139–146 (January 2001) 15. Zhai, J., Yu, K., Li, J., Li, S.: A Low Complexity Motion Compensated Frame Interpolation Method. In: IEEE International Symposium on Circuits and Systems. ISCAS 2005, vol. 5, pp. 4927–4930. IEEE Computer Society Press, Los Alamitos (2005)
Face Authentication with Salient Local Features and Static Bayesian Network Guillaume Heusch1,2 and S´ebastien Marcel1 1
2
IDIAP Research Institute, rue du Simplon 4, 1920 Martigny, Switzerland Ecole Polytechnique F´ed´erale de Lausanne (EPFL), 1015 Lausanne, Switzerland {heusch,marcel}@idiap.ch
Abstract. In this paper, the problem of face authentication using salient facial features together with statistical generative models is adressed. Actually, classical generative models, and Gaussian Mixture Models in particular make strong assumptions on the way observations derived from face images are generated. Indeed, systems proposed so far consider that local observations are independent, which is obviously not the case in a face. Hence, we propose a new generative model based on Bayesian Networks using only salient facial features. We compare it to Gaussian Mixture Models using the same set of observations. Conducted experiments on the BANCA database show that our model is suitable for the face authentication task, since it outperforms not only Gaussian Mixture Models, but also classical appearance-based methods, such as Eigenfaces and Fisherfaces.
1
Introduction
Face recognition has been an active research area since three decades, and a huge variety of different systems are now capable to recognize people based on their face image, at least in so-called controlled conditions (good illumination, fixed pose). Existing algorithms are often divided into two categories, depending on the information they use to perform the classification: appearance-based methods (also called holistic) are typically using the whole face as input to the recognition system. On the other hand, feature-based methods are considering a set of local observations derived from a face image. Such observations may be geometric measurements (distance between the eyes, etc.), particular blocks of pixels, or local responses to a set of filters for instance. Examples of appearance-based systems include the well-known Principal Component Analysis (PCA) [19], Linear Discriminant Analysis (LDA) [4] as well as Independent Component Analysis (ICA) [3] to name a few. These projection techniques are used to represent face images as a lower-dimensional vector, and the classification itself is actually performed by comparing these vectors according to a metric in the subspace domain (or using a more sophisticated classification technique, such as Multi-Layer Perceptron or Support Vector Machines). On the other hand, feature-based approaches are trying to derive a model of an individual’s face based on local observations. Examples of such systems include S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 878–887, 2007. c Springer-Verlag Berlin Heidelberg 2007
Face Authentication with Salient Local Features
879
the Elastic Bunch Graph Matching (EBMG) [20], recent systems using Local Binary Patterns (LBP) [1] [17], and also statistical generative models: Gaussian Mixture Models (GMM) [6] [13], Hidden Markov Models (HMM) [15] [18] or its variant [5] [14]. Face recognition systems using local features were empirically shown to perform better as compared to holistic methods [5] [12] [13]. Moreover, they also have several other advantages: first, face images are not required to be precisely aligned. This is an important property, since it increases robustness against imprecisely located faces, which is a desirable behaviour in real-world scenarios. Second, local features are also less sensitive to little variations in pose and in illumination conditions. In this paper we will focus on statistical generative models and propose a new model, based on static Bayesian Networks, especially dedicated to the data we have to deal with, that is the human face. Actually, we think that classical statistical models (GMM and HMM), although successful, are not really appropriate to properly describe the set of local observations extracted from a face image. Indeed, GMM as applied in [5] are modelling the distribution of overlapping blocks among the whole face image, thus considering each block to be independent with respect to the others. Furthermore, it was shown in [13] that better results are obtained by modelling each part of the face using a different GMM. However, the model likelihood is computed as the product of the GMM likelihood, hence again considering the different face parts independently. Obviously, this is not the case due to the nature of the “face object”. Consider the two eyes for instance: the block containing one eye is likely to be related somehow to the block containing the other eye. Going one step further, HMMbased approaches, as well as its variant (2D-HMM, coupled HMM) are able to add structure to the observations and therefore usually perform better. Examples of embbeded dynamic Bayesian Networks (which are nothing else but an extension of the HMM framework) applied to face recognition can be found in [14]. However, such systems cannot introduce causal relationships between observations themselves, they mainly act on their ordering. By using static Bayesian Networks, it is then possible to model causal relationships between a set of different observations represented by different variables. Hence, in this contribution we propose a first attempt, to our knowledge, to derive a statistical generative model based on this paradigm and especially dedicated to the particular nature of the human face. Conducted experiments on the BANCA [2] database show a performance improvement over a GMM-based system making the independence assumption between different facial features. The remaining of this paper is organized as follows. Section 2 describes the general framework to perform face authentication with statistical models. Then, Bayesian Networks are briefly introduced before presenting the proposed generative model used to represent a face. The BANCA database, the experimental framework and obtained results are discussed in Sec. 4. Finally, the conclusion is drawn in Sec. 5, and some possible future research directions are outlined.
880
2
G. Heusch and S. Marcel
Face Authentication Using Generative Models
In the framework of face authentication, a client claims its identity and supports the claim by providing an image of its face to the system. There are then two different possibilities: either the client is claiming its real identity, in which case it is referred to as a true client, either the client is trying to fool the system, and is referred as an impostor. In this open-set scenario, subjects to be authenticated may or may not be present in the database. Therefore, the authentication system is required to give an opinion on whether the claimant is the true client or an impostor. Since modelling all possible impostors is obviously not feasible, a so-called world-model (or universal background model) [5] [10] is trained using data coming from different identities, and will be used to simulate impostors. More formally, let us denote λC¯ as the parameter set defining the world-model whereas λC represents the client-specific parameters. Given a client claim and its face representation X, an opinion on the claim is given by the following loglikelihood ratio: Λ(X) = log P (X|λC ) − log P (X|λC ) (1) where P (X|λC ) is the likelihood of the claim coming from the true client and P (X|λC ) is representing the likelihood of the claim coming from an arbitrary impostor. Based on a threshold τ , the claim is accepted if Λ(X) ≥ τ and rejected otherwise. In order to find the parameters λC of the world model, and since we are dealing with model containing unobserved (or hidden) variables, the well-known Expectation-Maximisation (EM) algorithm [9] in the Maximum Likelihood (ML) learning framework is used. However, when it comes to client parameter estimation, ML learning cannot be reliably used due to the small amount of available training data for each client, instead the Maximum A Posteriori (MAP) criterion is used [10] [5]. In this case, client-specific parameters are adapted from the world-model parameters (i.e. the prior) using client data in the following manner: λMAP = α · λML + (1 − α) · λC C C
(2)
λML C
where denotes the client parameters obtained from a Maximum Likelihood estimation. The adaptation parameter α is used to weight the relative importance of the obtained ML statistics with respect to the prior.
3 3.1
Proposed Model Bayesian Networks
In this section, we will briefly describe the framework used to build the statistical generative model to represent a face. Bayesian networks (also known as belief networks) provide an intuitive way to represent the joint probability distribution
Face Authentication with Salient Local Features
881
over a set of variables: random variables are represented as nodes in a directed acyclic graph, and links express causality relationships between these variables. More precisely, defining P a(Xi ) as the parents of the variable Xi , the joint probability encoded by such a network over the set of variables X = (X1 , ..., Xn ) is given by the following chain rule: P (X) =
n
P (Xi |P a(Xi ))
(3)
i=1
Hence, a Bayesian Network is fully defined by the structure of the graph and by its parameters, which consists in the conditional probability distributions of each variable given its parents. Note however that a variable may have no parents, in which case its probability distribution is a prior distribution. Inference. The task of inference in Bayesian Networks consists in computing probabilities of interest, once evidence has been entered into the network (i.e. when one or more variables has been observed). In other words, entering evidence consists in either fixing the state of a discrete variable to one of its possible value or to assign a value in the case of a continuous variable. We are then interested in finding the effect this evidence has on the distribution of the others unobserved variables. There are many different algorithm allowing to perform inference, the most renowned is certainly the belief propagation due to Pearl [16], which is a generalisation of the forward-backward procedure for HMM. However, it becomes problematic when applied to multiply-connected networks. Another more generic method is the Junction Tree algorithm [8], which allows to compute such posterior probabilities in any kind of networks and is also the most efficient algorithm to perform exact inference. Learning. Learning in Bayesian Networks refers either to structure learning, parameters learning or both [11]. In our case, we are considering networks of fixed structure. Hence, parameters are learned using the classical EM algorithm [9] with either the ML or the MAP criterion described previously (Sec. 2). 3.2
Face Representation
Figure 1 depicts the proposed model to represent a face using salient facial features. Shaded nodes are representing visible observations (eyebrows, eyes, nose and mouth) derived from the face image, whereas white nodes are representing the hidden causes that generated these observations. This model can be understood as follows: a face is described by a set of unknown dependencies between eyebrows and eyes (node BE), eyes and nose (node EN ) and nose and mouth (node N M ). These combinations then generate a certain type of facial features (such as a small nose, or broad lips for instance) which are represented by the nodes at the second level. And finally, these types of facial features then generate the corresponding observations.
882
G. Heusch and S. Marcel
E
B
Olb
Orb
NM
EN
BE
Ole
Ore
N
M
On
Om
Fig. 1. Static Bayesian Network for Face Representation
In this network, hidden nodes are discrete-valued and observed nodes are multivariate gaussians. The likelihood of the face representation defined by X = (Olb , Orb , Ole , Ore , On , Om ) is obtained by first inferring the distribution of the hidden variables once observations has been entered in the network, and then by summing out over the states of the hidden variables. Note that our model introduce relationships between observations: if the node Ole is observed, information about the node Ore can be inferred through the node E node for instance.
4 4.1
Experiments and Results The BANCA Database
The BANCA database [2] was especially meant for multi-modal biometric authentication and contains 52 clients (English corpus), equally divided into two groups g1 and g2 used for development and evaluation respectively. Each corpus is extended with an additional set of 30 other subjects and is referred as the world model. Image acquisition was performed with two different cameras: a cheap analogue webcam, and a high-quality digital camera, under several realistic scenarios: controlled (high-quality camera, uniform background, controlled lighting), degraded (webcam, non-uniform background) and adverse (high-quality camera, arbitrary conditions). Figure 2 shows examples of the different acquisition scenarios. In the BANCA protocol, seven distinct configurations for the training and testing policy have been defined. In our experiments, the configuration referred as Match Controlled (Mc) has been used. Basically, it consists in training the system with five images per client acquired during the first contolled session. Then, the testing phase is performed with images acquired during the remaining sessions under the controlled scenario.
Face Authentication with Salient Local Features
(a) controlled
(b) degraded
883
(c) adverse
Fig. 2. Example of the different scenarios in the BANCA database
4.2
Experimental Framework
Each image was first converted to grayscale and processed by an Active Shape Model (ASM) in order to locate the facial features [7]. Then, histogram equalization was applied on the whole image so as to enhance its contrast. Blocks centered on a subset of facial features were extracted (Fig. 1), and in order to increase the amount of training data, shifted versions were also considered. Hence, in our experiments we use the original extracted block as well as 24 other neighbouring blocks, resulting from extractions with shifts of 2, 3 and 4 pixels in each directions. Each block is finally decomposed in terms of 2D Discrete Cosine Transform (DCT) in order to build the final observation vectors. Face authentication is subject to two types of error, either the true client is rejected (false rejection) or an impostor is accepted (false acceptance). In order to measure the performance of authentication systems, we use the Half Total Error Rate (HTER), which combines the False Rejection Rate (FRR) and the False Acceptance Rate (FAR) and is defined as: (F AR + F RR) [%] (4) 2 Hyperparameters, such as the threshold τ , the dimension of the DCT feature vectors, the cardinality of the hidden nodes and the adaptation parameter α were selected using the validation set along a DET curve at the point corresponding to the Equal Error Rate (EER), where the false acceptance rate equals the false rejection rate. HTER performance is then obtained on the evaluation set with these selected hyperparameters. HT ER =
4.3
Results
Here we present face authentication results obtained with the proposed model based on Bayesian Networks (BNFACE), and also with a baseline GMM system. Since the main assumption that drove us towards our approach was to state that blocks containing facial features should not be treated independently, we reproduced the experiment with the so-called Partial Shape Collapse GMM (PSC-GMM) first presented in [13]. However, and in order to yield a fair comparison, we use exactly the same set of features produced for our model, and
884
G. Heusch and S. Marcel
hence did not take into account the nose bridge and both cheek regions used in [13]. In our experiments, DCT feature vectors of dimensions 64 were used, and the cardinality of the discrete variables was set to 3 at the first level and to 8 at the second level. Regarding the PSC-GMM model, we used 512 gaussians for each of the six GMM corresponding to the six extracted facial features, as suggested in [13]. In Tab. 1, we report the results obtained by our approach (BNFACE), by our implementation of the PSC-GMM and by the GMM approach as published in [5]. Note that in [5] only the results on g2 are available. The proposed BNFACE model outperforms the corresponding PSC-GMM approach on both sets of the BANCA database. Moreover, obtained results on the test set g2 are better than those obtained with a single GMM [5]. This comparison is interesting since this GMM-based system uses much more features extracted from the whole face image. Note also that our model contains the less client-specific parameters to be learned. Table 1. HTER Performance on the Mc protocol of the BANCA database FA system HTER on g1 [%] HTER on g2 [%] number of parameters BNFACE 9.01 5.41 5225 PSC-GMM 11.31 11.34 6 · 33280 GMM [5] not available 8.9 9216
Since results presented in [13] are reported in terms of EER on a graph, we also compare EER performance on both development and test sets in Tab. 2. Note however that the numeric results from [13] are estimated from the graph, and are thus subject to little imprecisions. Once again, we noticed that the proposed approach performs better than using the PSC-GMM with the same features, as can also be seen on DET curves (Fig. 3). However, results of the original PSC-GMM are better. It can be explained by the fact that it uses more features than our model to perform the face authentication task. Note also that our model provide better performance than classical appareance-based models such as Eigenfaces and Fisherfaces as provided in [13]. Table 2. EER Performance on the Mc protocol of the BANCA database FA system EER on g1 [%] EER on g2 [%] BNFACE 9.01 4.84 PSC-GMM 11.31 6.92 PSC-GMM [13] 3.9 4.1 Fisherfaces [13] 10.2 11.5 Eigenfaces [13] 13.8 14.0
Face Authentication with Salient Local Features
DET curves on g1
DET curves on g2 BNFACE PSC-GMM
BNFACE PSC-GMM 20
False Rejection Rate [%]
False Rejection Rate [%]
20
10
5
2
1 1
885
10
5
2
2
5
10
20
False Acceptance Rate [%]
1 1
2
5
10
20
False Acceptance Rate [%]
(a)
(b)
Fig. 3. DET curves obtained on the Mc protocol of the BANCA database for the BNFACE (solid line) and the PSC-GMM (dashed line) models
5
Conclusion and Future Directions
In this paper, we proposed a new statistical generative model to represent human faces, and applied it to the face authentication task. The main novelty of our approach consists in introducing dependencies between observations derived from salient facial features. As shown by the conducted experiments on a benchmark database, our main hypothesis seems to be verified, since our model performs better than systems relying on the independence assumption between facial features. Moreover, obtained results also compares favourably against classical holistic methods such as Eigenfaces and Fisherfaces. However, this work is a preliminary attempt to use static Bayesian Networks in face recognition and many issues are still open. Indeed, future research directions are manifold. First, causal relationships between facial features are not known (at least to our knowledge) and finding the right structure for the network is not straightfoward. Second, it will be interesting to use other facial features possibly carrying more discriminative information, such as skin texture for instance, and incorporate it into a network.
Acknowledgments This work has been funded by the GMFace project of the Swiss National Science Foundation (SNSF) and the Swiss National Center of Competence in Research (NCCR) on Interactive Multimodal Information Management (IM2). Softwares were implemented using the TorchVision library1 and experiments were carried out using the PyVerif framework.2 1 2
http://torch3vision.idiap.ch http://pyverif.idiap.ch
886
G. Heusch and S. Marcel
References 1. Ahonen, T., Hadid, A., Pietik¨ ainen, M.: Face Recognition With Local Binary Patterns. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3024, pp. 469– 481. Springer, Heidelberg (2004) 2. Bailly-Bailli`ere, E., Bengio, S., Bimbot, F., Hamouz, M., Kittler, J., Mari´ethoz, J., Matas, J., Messer, K., Popovici, V., Por´ee, F., Ruiz, B., Thiran, J-P.: The Banca Database and Evaluation Protocol. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688. Springer, Heidelberg (2003) 3. Bartlett, M., Movellan, J., Sejnowski, T.: Face Recognition by Independent Component Analysis. IEEE Trans. on Neural Networks 13(6), 1450–1464 (2002) 4. Belhumeur, P., Hespanha, J., Kriegman, D.: Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection. IEEE Trans. on Pattern Analysis and Machine Intelligence 19(7), 711–720 (1997) 5. Cardinaux, F., Sanderson, C., Bengio, S.: User Authentication via Adapted Statistical Models of Face Images. IEEE Trans. on Signal Processing 54(1), 361–373 (2005) 6. Cardinaux, F., Sanderson, C., Marcel, S.: Comparison of MLP and GMM classifiers for face verification on XM2VTS. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688. Springer, Heidelberg (2003) 7. Cootes, T.F., Taylor, C.J., Cooper, D., Graham, J.: Active Shape Models: Their Training and Applications. Computer Vision & Image Understanding 61(1), 38–59 (1995) 8. Cowell, G., Dawid, P., Lauritzen, L., Spiegelhalter, J.: Probabilistic Networks and Expert Systems. Springer, Heidelberg (1999) 9. Dempster, A., Laird, N., Rubin, D.: Maximum Likelihood From Incomplete Data via the EM Algorithm. The Journal of Royal Statistical Society 39, 1–37 (1977) 10. Gauvain, J.-L., Lee, C.-H.: Maximum A Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains. IEEE Transactions on Speech and Audio Processing 2(2), 291–298 (1994) 11. Heckerman, D.: Tutorial on Learning With Bayesian Networks. In: Learning in Graphical Models, ch. A, pp. 301–354. MIT Press, Cambridge (1999) 12. Heisele, B., Ho, P., Wu, J., Poggio, T.: Face Recognition: Component-based versus Global Approaches. Computer Vision and Image Understanding 91(1), 6–21 (2003) 13. Lucey, S., Chen, T.: A GMM Parts Based Face Representation for Improved Verification through Relevance Adaptation. In: IEEE Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 855–861. IEEE Computer Society Press, Los Alamitos (2004) 14. Nefian, A.: Embedded Bayesian Networks for Face Recognition. In: IEEE Intl. Conference on Multimedia and Expo. (ICME). IEEE Computer Society Press, Los Alamitos (2002) 15. Nefian, A., Hayes, M.: Hidden Markov Models for Face Recognition. In: IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), vol. 5, pp. 2721–2724. IEEE Computer Society Press, Los Alamitos (1998) 16. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Francisco (1988) 17. Rodriguez, Y., Marcel, S.: Face Authentication Using Adapted Local Binary Pattern Histograms. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 321–332. Springer, Heidelberg (2006)
Face Authentication with Salient Local Features
887
18. Samaria, F., Young, S.: HMM-based Architecture for Face Identification. Image and Vision Computing 12(8), 537–543 (1994) 19. Turk, M., Pentland, A.: Face Recognition Using Eigenfaces. In: IEEE Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 586–591. IEEE Computer Society Press, Los Alamitos (1991) 20. Wiskott, L., Fellous, J.-M., Kr¨ uger, N., Von Der Malsburg, C.: Face Recognition By Elastic Bunch Graph Matching. In: 7th Intl. Conf. on Computer Analysis of Images and Patterns (CAIP), pp. 456–463 (1997)
Fake Finger Detection by Finger Color Change Analysis Wei-Yun Yau1, Hoang-Thanh Tran2, Eam-Khwang Teoh2, and Jian-Gang Wang1 1
Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 2 Nanyang Technological University, EEE, Singapore
[email protected],
[email protected],
[email protected],
[email protected] Abstract. The reliability of a fingerprint recognition system would be seriously impacted if the fingerprint scanner can be spoofed by a fake finger. Therefore, fake finger detection is necessary. This work introduces a new approach to detect fake finger based on the property of color change exhibited by a real live finger when the finger touches a hard surface. The force exhibited when the finger presses the hard surface changes the blood perfusion which resulted in a whiter color appearance compared to a normal uncompressed region. A method to detect and quantify such color change is proposed and used to differentiate a real finger from the fakes. The proposed approach is privacy friendly, fast and does not require special action from the user or with prior training. The preliminary experimental results indicate that the proposed approach is promising in detecting the fake finger made using gelatin which is particularly hard to detect. Keywords: Fake finger, Color change, blood perfusion.
1 Introduction The identity of an individual is a very critical “asset” of that individual that facilitates the person when performing myriad daily activities such as financial transactions, access to places and buildings and to computerized accounts. However, the traditional means of identity authentication using tokens such as card and key or personal identification number and password is becoming vulnerable to identity thefts. Therefore, the ability to correctly authenticate an individual using biometrics is becoming important. Unfortunately, the biometric system is also not fool-proof. It is subjected to various threats including attack at the communication channels (such as replay attacks), the software modules (such as replacing the matching module), the database of the enrolled users and the sensor with fakes [1], [2]. Recently, several researchers have shown that it is possible to spoof the fingerprint recognition system with fake fingers [3],[4]. These include enhancing the latent prints on the finger scanner with pressure and/or background materials to creating fingerprint molds using materials such as silicon, gelatin and Play-Doh as well as the use of cadaver fingers [5]. In order to counter such spoof attacks, fingerprint recognition S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 888–896, 2007. © Springer-Verlag Berlin Heidelberg 2007
Fake Finger Detection by Finger Color Change Analysis
889
system vendors have considered several approaches to detect the liveness of the finger. In general, these approaches can be classified into the following three categories [6]: 1. Analysis of skin details in the acquired images: minute details of the fingerprint images are used, ex: detecting sweat pores [5] and coarseness of the skin texture. A high resolution sensor is usually needed. Furthermore, sweat pores and skin texture varies with finger type, such as those dry and wet fingers and thus these approaches usually have large false error. 2. Analysis of static properties of the finger: additional hardware is used to capture information such as temperature, impedance or other electrical measurements, odor, and spectroscopy [12] where multiple wavelengths are exposed and the spectrum of the reflected light is analyzed to determine the liveness of the finger. Except for the spectroscopy technique, the other approaches can be easily defeated. However, the spectroscopy technique requires the use of expensive sensing mechanism. 3. Analysis of dynamic properties of the finger: analyzes the properties such as pulse oximetry, blood pulsation, perspiration, skin elasticity and distortion [6]. The former two approaches can only detect a dead or an entire fake finger but cannot differentiate false layer attached to the finger. In addition, it may reveal the medical condition of the user. To measure the perspiration, the user has to place the finger on the sensor for quite some time and may not be feasible for people with dry finger. A summary of the various liveness detection approaches is given in [7],[8]. Most of the approaches except the spectroscopy and the skin distortion techniques are not able to detect fake finger layer made using gelatin as gelatin contains moisture and has property such as electrical property quite similar to human skin. However, the spectroscopy technique requires expensive equipment while measuring the skin distortion requires the user to touch and twist the finger which is not user-friendly and will require user training. In this paper, the mold made using gelatin will be investigated as this is the most difficult attack to detect and that the attacker can easily destroy any proof by just eating the gelatin mold. This paper proposed the use of the dynamic property of the skin color. As the finger is pressed on the hard surface of the fingerprint scanner, there occurs change in the color of the skin at the region in contact with the scanner. Such color change is dynamic and occurs only in a live finger. A method to detect and measure the color change is proposed and used to differentiate between a real and fake finger. Section 2 describes the property of the finger color change while Section 3 describes the approaches used to detect and measure the color change. This is followed by Section 4 describing the experimental results before Section 5 concludes the paper.
2 Finger Color Change The main idea of our proposed approach to detect fake finger described in this paper is based on the change in the color of the finger portion in contact with any hard
890
W.-Y. Yau et al.
surface, such as when the finger is pressed on the finger scanner. As a real live finger is pressed on the hard surface of the scanner, the applied force will cause interaction among the fingernail, bone and tissue of the fingertip. This will alter the hemodynamic state of the finger, resulting in various patterns of blood volume or perfusion [9]. Such pattern is observable at the fingernail bed [9],[10] and at the surrounding skin region in contact with the scanner. The compression of the skin tissue at the contact region with the scanner will cause the color of the region to change to whiter and less reddish compared to the normal portion. This is because blood carries hemoglobin which is red in color. When the finger is pressed onto the hard surface, the amount of blood that can flow to the skin region in contact with the surface is limited due to the force exerted which constricts the capillaries at the fingertip. With less blood flow, the color of the skin region will change to less reddish, resulting in whiter appearance compared to the normal finger. Figure 1 shows the property of the color change before (Fig. 1a and 1c) and after the finger is pressed on a hard surface (Fig. 1b and 1d). For comparison, Figure 2 shows the color of the fake finger made using gelatin before and after the finger is pressed. As shown in figures 1 and 2, the portion of the finger in contact with the hard surface will show change in color. We postulate this property to be true even for people
(a) Real finger before pressing
(c) Real finger before pressing
(b) Real finger after pressing
(d) Real finger after pressing
Fig. 1. Images of a real finger before and after pressing on a hard surface
Fake Finger Detection by Finger Color Change Analysis
(a) Fake finger before pressing
(c) Fake finger before pressing
891
(b) Fake finger after pressing
(d) Fake finger after pressing
Fig. 2. Images of a fake finger before and after pressing on a hard surface
of different age, ethnicity or gender and is not sensitive to the type or condition of the skin, such as dry, wet or oily. However, a fake finger will not have such a property as mold will contact the surface first and the user will have to be careful when pressing the soft gelatin mold on the hard surface. As this property is dynamic and repeatable for all live fingers, it can be used to detect the liveness of the finger. The proposed method will only require the use of ordinary low cost digital camera, such as those commonly used in the mobile phones or PCs.
3 Fake Finger Detection Methodology To quantify the change in the color of the finger, we first model the background region using a Gaussian model when the finger is not present. When a substantial change is detected, a finger is present. This image is saved as the initial image, II. Then, the images are continuously captured until the finger is pressed on the sensor and then lift off. The finger image with the entire finger pressed on the sensor, which gives the largest contact area, is taken as the desired image, ID. Image II is then aligned to ID based on the tip and the medial axis of the finger as shown in figure 3. The tip is defined as the point where the medial axis of the fingertip cuts the border of the fingertip.
892
W.-Y. Yau et al.
Fig. 3. Alignment of finger using the finger tip. The red line is the medial axis while the red square dot at the boundary between the finger and background is the tip point.
Subsequently, the foreground region of the finger is smoothed using an averaging filter. Then the region is divided into n1 × n2 non-overlapping square blocks of size s × s, beginning from the medial axis of the finger. Since we are interested in detecting the color change only, we convert the original image in RGB into CIELa*b* color space which provides good results experimentally. The color of the fingertip before touching the hard surface is homogeneous. Thus we regard the chrominance component of all the n1 × n2 blocks in the initial image II (before pressing) can be modeled using a single Gaussian distribution, µ o,σo. This would be taken as the reference chrominance value for the normal un-pressed fingertip region, Ro. For each block in the pressed image ID, clustering using hierarchical k-means [11] is performed on the chrominance component. Then the dominant clusters with homogeneous chrominance value of center µ 1 and standard deviation σ1 different from µ o,σo found from the n1 × n2 blocks is taken as the reference value for the compress region R1 of a pressed fingertip. This is then repeated for all the k images of II and ID in the training sets to obtain the overall reference value for the normal Ro(µ ro,σro) and pressed region R1(µ r1,σr1) of the individual. Given a pixel xi,j, its similarity with respect to the normal Ro or pressed region R1 can be quantified using the distance measure:
D ( xi , j μ rt , σ rt ) = t i, j
( xi , j − μ rt ) 2
σ rt2
; t = 0,1
(1)
The pixel xi,j is assigned to its proper region, R, based on the following thresholding operation: ⎧0 ⎪ R( xi , j ) = ⎨1 ⎪2 ⎩
Dio, j < α o Di1, j < α 1 otherwise
(2)
Fake Finger Detection by Finger Color Change Analysis
893
where αo, α1 are the thresholds for the similarity measure to the region Ro and R1 respectively. For each block n in the n1 × n2 blocks of II and ID, the likelihood that n belongs to the category normal (R0), pressed (R1) or otherwise (R2) can be obtained from the dominant homogeneous region assigned to the pixels in it as:
R (ni , j ) = mod( R ( xi , j ) | xi , j ∈ni , j )
(3)
When verifying the liveness of a finger, the category assigned to each of the n1 × n2 blocks in images II and ID is determined using equations (2) and (3). Then the finger is considered real if it satisfies the following criteria: R ( n i , j | I I ) ∈ R o ∀ n | I I I mod( R ( n i , j | I D )) ∉ R 2 I
∑ R (n
|
i, j ID
− ni , j | I I ) > α R
(4)
where αR is the threshold for real finger verification.
4 Experimental Results In order to evaluate the proposed approach, a database of images was collected using the prototype setup shown in figure 4.
light
glass
camera Fig. 4. Prototype system setup for data collection. The camera used is a PC camera with 0.5M pixel resolution while the light source is obtained from a series of LED.
The images were collected from 25 human subjects using white LED light source. Prior to any capture, the background image without the presence of finger was first captured. Then two images were acquired from each subject, one before the finger pressed the glass and the other after the finger pressed the glass. No guidance was given to the subjects on how they should press their finger except that they were told to press as if they were using a fingerprint recognition system. This was then repeated with the subject wearing a gelatin mold to simulate fake finger. A new gelatin mold was made for each subject since the gelatin mold quality deteriorates with time. Thus we have a total of 50 images (before and after pressing) for real fingers and another 50 images (before and after pressing) for fake fingers in the dataset. All these images were used in the performance evaluation of the proposed system.
894
W.-Y. Yau et al.
Based on this dataset, we are able to correctly detect all real fingers and achieving 80% accuracy in detecting the fake finger as fake. Table 1 show the result obtained. Table 1. Result for real and fake finger detection
Correct detection rate
False detection rate
Real finger
100%
0%
Fake finger
80%
20%
Figure 5 shows some sample results obtained for the real and fake finger respectively. The errors in detecting the fake finger occur when the gelatin mold is made very thin and properly stuck to the real finger before touching the glass surface or due to error in the segmentation process. We found that in some cases, the fake finger has been detected even before pressing it. This is because the mold is not well made with bubbles which will be rejected as the homogeneity constraint of the finger is not valid anymore.
(a) real finger before pressing
(b) real finger after pressing
(d) fake finger before pressing (e) fake finger after pressing
(c) change detected
(f) change detected
Fig. 5. Results of color change detection obtained on a real and the corresponding fake finger of the same subject. The colored part is the detected change.
Fake Finger Detection by Finger Color Change Analysis
(g) real finger before pressing (h) real finger after pressing
(i) change detected
(j) fake finger before pressing (k) fake finger after pressing
(l) change detected
895
Fig. 5. (continued)
The advantages of the proposed method are that it is fast since the capture is done real time and does not have to wait for perspiration to set in as in [5]. It also does not require the use of expensive hardware as in [12] and has no implication on the privacy of the user as it does not reveal the medical condition of the person. In addition, it also does not require careful interaction of the user with the scanner as in [6] and thus can be readily deployed for mass users without prior user training.
5 Conclusion and Future Works This paper presented a new approach to detect fake finger based on the property of color change exhibited by a real live finger when the finger touches a hard surface. Such condition arises naturally due to blood perfusion which is universal. The proposed approach is privacy friendly, fast and does not require special action from the user with prior training. The preliminary experimental results indicate that the proposed approach is promising in detecting the fake finger made using gelatin which is particularly hard to detect.
896
W.-Y. Yau et al.
We are currently working on the use of different light source to improve the accuracy rate of detecting fake finger as well as collecting more data, including those from different ethnic backgrounds, for the purpose of studying the effect of skin color and pressure variation. We recognized that the current setup is only applicable for optical scanner. Thus another future work is to investigate the possibility of using front, back or side view of the fingertip to detect the color change property. These views are useful for non-optical fingerprint scanners where a small digital camera can be installed to capture an appropriate image for detecting the color change property and thus verify the validity of the finger.
References 1. Ratha, N.K., Connell, J.H., Bolle, R.M.: Enhancing security and privacy in biometricsbased authentication systems. IBM Syst. J. 40(3), 614–634 (2001) 2. Maltoni, D., Maio, M., Jain, A.K., Prabhakar, S.: Handbook of Fingerprint Recognition. Springer, Heidelberg (2003) 3. Matsumoto, T., Matsumoto, H., Yamada, K., Hoshino, S.: Impact of artificial ”Gummy” fingers on fingerprint systems. In: Proc. SPIE, vol. 4677, pp. 275–289 (2002) 4. Putte, T., Keuning, J.: Biometrical fingerprint recognition: Don’t get your fingers burned. In: Proc. 4th Working Conf. Smart Card Research and Adv. App., pp. 289–303 (2000) 5. Partthasaradhi, S.T.V., Derakhshani, R., Hornak, L.A., Schuckers, S.A.C.: Time-series detection of perspiration as a liveness test in fingerprint devices. IEEE Trans. SMC-Part C 35(3), 335–343 (2005) 6. Antonelli, A., Cappelli, R., Mario, D., Maltoni, D.: Fake finger detection by skin distortion analysis. IEEE Trans. Info. Forensics & Security 1(3), 360–373 (2006) 7. Schuckers, S.: Spoofing and anti-spoofing measures. Inform. Security Tech. Rep. 7(4), 56–62 (2002) 8. Valencia, V., Horn, C.: Biometric liveness testing. In: Woodward Jr., J.D., Orlans, N.M., Higgins, R.T. (eds.) Biometrics, McGraw Hill, New York (2002) 9. Mascaro, S.A., Asada, H.H.: The common patterns of blood perfusion in the fingernail bed subject to fingertip touch force and finger posture. Haptics-e 4(3), 1–6 (2006) 10. Mascaro, S.A., Asada, H.H.: Understanding of fingernail-bone interaction and fingertip hemodynamics for fingernail sensor design. In: Proc. 10th Int. Symp. Haptic Interfaces for Virtual Environment and Teleoperator Systems, pp. 106–113 (2002) 11. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley, Chichester (2001) 12. Nixon, K., et al.: Novel spectroscopy-based technology for biometric and liveness verification. In: Proc. SPIE, vol. 5404, pp. 287–295 (2004)
Feeling Is Believing: A Secure Template Exchange Protocol Ileana Buhan, Jeroen Doumen, Pieter Hartel, and Raymond Veldhuis University of Twente, Enschede, The Netherlands {ileana.buhan,jeroen.doumen,pieter.hartel, r.n.j.veldhuis}@utwente.nl
Abstract. We use grip pattern based biometrics as a secure side channel to achieve pre-authentication in a protocol that sets up a secure channel between two hand held devices. The protocol efficiently calculates a shared secret key from biometric data. The protocol is used in an application where grip pattern based biometrics is used to control access to police hand guns. Keywords: Ad-hoc authentication, fuzzy cryptography, biometrics.
1 Introduction We are developing smart guns with grip pattern biometrics [11] for the police, to reduce the risk of officers being shot with their own weapon in case of a take away situation [7]. In this scenario a police handgun authenticates the owner by grip pattern biometrics integrated with the grip of the gun. Police officers often work in teams of two and each officer must be able to fire the other officer’s weapon. Normally, teams are scheduled in advance so that appropriate templates can be loaded into the weapons at the police station. However, in emergency situation this is not possible; in this case police officers have to team up unprepared and exchange templates in the field. Biometric data is sensitive information thus during the exchange the templates must be protected. Officers may work with colleagues from other departments, even from neighboring countries, so a shared key, or a public key infrastructure where the certificate associated with these keys must be verifiable on-site is not realistic. Also, one cannot expect a police officer to perform some complicated interfacing operation with his gun in the field. In this paper we present a solution that is both simple and effective. Each police officer owns a gun, that holds the owners stored biometric grip pattern template and is equipped with a grip pattern sensor and a short range radio. The users swap the devices, so that each device can measure the grip pattern of the other user. The devices are then returned to their owners. Each device now contains a genuine template of its owner and a measurement of the other user, called guest. The devices calculate a common key from the owner’s template and the guest measurement. The act of the guest putting her hand on the user’s device corresponds to sending a message on a secure side channel. Therefore, we have termed our protocol Feeling is Believing (FiB). Securing the exchange of biometric templates is an instance of the pairing problem. As described by Saxena [9] the pairing problem is to enable two devices, which share no prior context, to agree upon a security association that they can use to protect their S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 897–906, 2007. c Springer-Verlag Berlin Heidelberg 2007
898
I. Buhan et al.
subsequent communication. Secure pairing must be resistant to a man-in-the-middle adversary who tries to impersonate one or both of these devices in the process. To achieve secure pairing one cannot rely on any previously shared secret information. This problem can be solved in cryptography by the issue of certificates. As said before, due to the ad-hoc nature of the pairings we cannot rely on an existing communications channel to a certification authority that could validate certificates. Our approach is to use an additional physically authenticated side channel which is governed by humans. Related Work. Balfanz et al.[9] propose physical contact between devices. This type of channel has the property that the user can control precisely which devices are communicating. The authors extend this approach to location limited channels where they propose short range wireless infrared communication where the secret information is communicated when devices are in the line of sight. McCune et al. [6] propose to use a visual channel and make photographs of the hash codes of the public keys. This represents a major breakthrough especially from the point of user friendliness. In the same line of work, Googrich et al. [4] propose a human assisted authentication audio channel as a secure side channel. They use a text to speech engine for vocalizing a sentence derived from the hash of a device’s public key. The pre-authentication channel is used mostly to authenticate public-keys. The hash of the public key is either vocalized [4] or photographed [6]. Others [12,10], use a Diffie-Hellman like key agreement scheme where short sequences transmitted on the private channel authenticate the key sent on the main channel. Our protocol is a Diffie-Hellman like protocol in the sense that both parties equally contribute to the session key. FiB achieves mutual authentication and the partial keys are extracted from the biometric identification data of the individual. Contribution. FiB can perform secure biometric template exchange in an ad-hoc situation, its main merit being user friendliness. We stress that the application domain for FiB is more general than the gun application. FiB can securely exchange template for any type of biometric system. FiB is formally verified to prove the security and origin authentication of templates. The workload of an intruder who is trying to guess the session key is evaluated in 9 different scenarios. To the best of our knowledge, this is the first time that a biometric is used as a location limited channel. Using biometrics and cryptography together brings up challenges. Biometric data is usually noisy. However the noise is not reproducible or uniformly random thus cryptography algorithms have difficulties in adjusting to this noise. We propose a new method for correcting errors in the keys generated from biometric data using a user specific error profile.
2 FiB Protocol Description FiB solves the problem of secure template exchange and provide a solution for securing the ad-hoc transfer of private information between mobile, biometrically enabled devices in a scenario where no pre-distributed keys are available. The main threat is an intruder loading his template into one of the participating devices. Thus, we require the authorization of the device owner before his template is transferred. To preserve the privacy of the biometric template, it must be encrypted before being sent out.
Feeling Is Believing: A Secure Template Exchange Protocol
899
Fuzzy Key Extraction. Before we turn to the protocol, we introduce the main tool. A fuzzy extractor as defined by Dodis et al. [1] is a public function that extracts robustly a binary sequence K from a noisy measurement F with the help of some public string H. Enrollment is performed by a function Gen, which on the input of the noise free biometric T and the chosen binary string K will compute a public string H = Gen(T, K). During authentication, a second function Reg takes as input a noisy measurement F and the public string H and outputs a binary string K = Reg(F, H). In a perfect world, K = K but in reality they might be slightly different. However, we expect that keys are close in terms of their Hamming distance when T and F are similar enough. Both T and F are multidimensional feature vectors and while for most of the features the authentication will work correctly, for some feature the noise might be too large. We write K = K + e, where we assume that the Hamming weight of the error e, denoted with wt(e) satisfies wt(e) ≤ t, 0 ≤ t n. FiB Protocol preliminaries. Before we delve into the description of the protocol, which represents the core of our solution, we describe the context. Enrollment of users takes place before the protocol starts. During enrollment a low noise measurement T is taken, for each user. A key K of length n is generated and function Gen(T,K) outputs the helper data H. Then the user specific error profiles E is computed. The error profile is used by the protocol to lower the error rates of the biometric system, see subsection 4. This information is loaded into the device of the user. After the enrollment we have achieved that: (1) the identity of the user can be verified by his own device, and (2) a device is prepared to run the FiB protocol which allows secure template transfer so that another device can verify the identity of the user. The Protocol. The principals that interact during the protocol are two devices Da and Db . Before the protocol starts each device knows the data of its owner, i.e. the template, the helper data constructed from the template, the key and the error profile. Hence initially Da knows {Ta , Ha , Ka , Ea } and Db knows {Tb , Hb , Kb , Eb }. The message flow of the FiB protocol is shown in Figure 1. By Ka ||Kb we mean a combination of Ka and Kb , for example concatenation. Actions of device Da (Initiator). We assume that Da starts the protocol. When an unknown grip pattern is detected, Da broadcasts its helper data Ha on the wireless channel. Ha is constructed such that it does not reveal any significant information about T or K. Upon receiving message 4 from Db , Da uses the received Hb and Fb to extract Kb . The second part of the message is used to help Da recover Ka . Since Ka and Ka are close, Da can use the error profile Ea to recover Ka by flipping carefully chosen bits in Ka until it can successfully decrypt {Fa }Ka . Since Da can recognize a measurement coming from its own user, Da can check the decryption results. When Da successfully finds Ka it sends message 7 to Db . Da verifies in step 11 that Tb matches the measurement received on the secure side channel in step 1. Actions of device Db (Responder). Device Db receives Ha and detects an unknown grip pattern, Fa . The rest of the operations are similar to those of Da . Both participants have to perform the same amount of computation. FiB is a solution that offers confidentiality during data transfer and authentication. However, FiB does not guarantee what happens to the templates after the protocol ends.
900
I. Buhan et al. Biometric channel (1)
Fb {Ta ,Ha ,Ea ,Ka }
{Tb ,Hb,E b,Kb } Fa
(2) Ha (5)K’=Reg(F
b
(6)
b ,Hb)
K’a =Correct()
(4) Hb ,{ Fa }K’a (7)
{ Fb }K’a||K’b
(1)
(3)Ka’=Reg(Fa ,H
a)
(8) Kb’ =Correct()
(9){Ta }K’a||K’b
(11)Verify( Fa, Ta )
(10){Tb }K’a||K’b
(11)Verify(Fb , Tb)
Wireless channel
Fig. 1. FiB protocol
We emphasize that loading a template into ones device is similar to handing over a key to the device. At any time the owner of the template can access that device. In the scenario of the smartgun we assume that the sensitive information is stored in a tamper resistant storage environment where the templates cannot be ”taken out”.
3 Security Evaluation for FiB Protocol There are two distinct, rigorous views of cryptography that have been developed over the years. One is a formal approach where cryptographic operations are seen as black box functions represented by symbolic expressions and their security properties are modelled formally. The other is based on a detailed computational model where cryptographic operations are seen as strings of bits and their security properties are defined in terms of probability and computational complexity of successful attacks. In the following we look at both aspects of security. 3.1 Formal Verification of the FiB Protocol with CoProVe We have formally verified that FiB satisfies secrecy of the templates and mutual authentication. The adversary, named Eve, is a Dolev-Yao [3] intruder that has complete control of the communication channel. She can listen to, or modify messages on the main communication channel between the devices but cannot access the secure side channel. The tool used for this purpose is the constraint based security protocol verifier CoProVe by Corin and Etalle [2]. An earlier version of the protocol was verified and found buggy, the published version of the protocol above fixes the flaw found. A (security) protocol is normally verified using a model of the protocol, to avoid getting bogged down in irrelevant detail. The quality of the model then determines the accuracy of the verification results. The basic difference between a protocol and a model lies in the assumptions made when modelling the protocol. We believe that the following assumptions are realistic:
Feeling Is Believing: A Secure Template Exchange Protocol
901
1. No biometric errors. We assume that the correction mechanism always works perfectly and thus the initiator knows the key used by the sender. Thus, we look only at complete protocol rounds. When the initiator cannot work out the key the protocol is aborted. In this case we assume that Eve cannot work out the key either. 2. Modelling the secure side channel. We assume that when the protocol starts device Da knows Fa and device Db knows Fb while Eve knows neither because she cannot eavesdrop on the secure side channel. 3. Classifier based verification in step 11 removed. Because systems without an equational theory such as CoProVe cannot compare two terms, the last check before accepting a template cannot be modelled. This check prevents an intruder modifying messages 9 and 10 in the protocol. We have verified the model in figure 1 with the assumptions above. We argue that the above abstractions do not affect the secrecy and the authentication property. Verification with CoProVe explores a scenario in which one of the parties involved in the protocol plays the role of the initiator (i.e. the party staring the protocol) and the other plays the role of the responder. A third party, the intruder learns all message exchanged by the initiator and the responder. The intruder can devise new messages and send them to honest participants as well as replay or delete messages. Should the intruder learn a secret key and a message encrypted with that key, then the intruder also knows the message. This is the classical Dolev-Yao intruder [3]. We have explored two scenarios that we believe to be realistic and representative for real attacks. In the first scenario two honest participants Alice and Bob are plagued by a powerful intruder who is assumed to know Tb (because the intruder might have communicated with Bob in a previous session). In the scenario the intruder tries to load her own template Ti into Da so as to be able to fire weapon Ta . This impersonation attack is found to be impossible by CoProVe under the assumption that the intruder does not know Fa or Ta . This is a realistic assumption since these are not transmitted in clear and the key used to encrypt is computed from data sent over the secure side channel. Verification thus shows that the intruder cannot impersonate Db even though the intruder has knowledge of Tb . In the second scenario one participant, Alice, has to deal with the intruder, who does not have useful initial knowledge. This scenario represents the case that Alice’s device was stolen and the intruder tries to load his template into Alice’s device and use it. Verification shows that Ta remains secret when Alice is the initiator of the protocol. This means that an intruder cannot trick Alice into disclosing her template data if she does not provide the intruder a sample of her biometric data. When Alice is the responder and the intruder has full access to the device (i.e. the intruder can submit his own biometric data), Ta will not be disclosed. This is because before the device sends anything useful on the wireless link the device will check whether his owner is there after step 3. As a result in both scenarios we have verified the security and authentication authentication of responder to the initiator and authentication of initiator to responder. 3.2 Intruder’s Computational Effort of Guessing Ka ||Kb . To derive keys from fuzzy data we use a semi-known plain text attack in steps 6 and 8, to recover the session key. This approach can raise a couple of natural questions: “If
902
I. Buhan et al.
both Alice (device Da ) and Bob (device Db ) have to guess the session key, how much more difficult is for Eve (the intruder) to do the same?”, “What happens if Eve already knows the templates of Alice and or Bob?”, “What kind of guarantees is this protocol offering?” To answer these questions we study the following scenarios: AE(0) No previous contact between Alice and Eve. AE(1) Eve records a measurement of Alice’s biometric.From the helper data sent out in the clear Eve constructs Ka . AE(2) Eve had a protocol run with Alice and knows Ta and Eve can construct Ka . We denote by W (x → y) the average number of trials that Eve, has to do to guess y when she knows x using the best guessing strategy. We analyze Eve’s workload to guess Ka in the three scenarios above. In scenario AE(2), Eve knows Ka and has to guess Ka = Ka + e where wt(e) ≤ t. Since Eve has no information about the distribution, she has to correct up to t noise errors. As the key length is n, there are ni different error patterns if the actual number of errors is i, thus on average she will have to guess: W (Ka →
Ka )
t 1 n 1 ≈ + . 2 i=0 i 2
In scenario AE(1), Eve knows Ka and has to guess Ka where Ka = Ka + e , thus Ka = Ka + e + e. Since wt(e + e) ≤ 2t, Eve has workload: W (Ka → Ka ) ≈
2t 1 n 1 + . 2 i=0 i 2
In scenario AE(0) Eve has no information on Alice thus she has to brute force all possibilities. Thus the number of trials is approximately: W (0 → Ka ) ≈
2n + 1 . 2
Scenarios for Bob are analogous: BE(0) No previous contact between Bob and Eve. BE(1) Eve records a measurement of Bob. BE(2) Eve had a previous round with Bob thus knows Tb and can construct Kb . Eve’s workload for guessing Kb is equal to guessing Ka in the analogous scenario. To achieve her goal of loading her template in one of the devices, Eve has to guess Ka ||Kb in all scenarios. Table 1 summarizes her workload. In each row we have the information that Eve knows about Bob and in the column the information that Eve knows about Alice. Due to the message flow in the protocol (see figure 1), Eve might have an advantage if she has information about Alice. Eve can intercept message (4), {Fa }Ka and recover Ka if the biometrics allows for taking a decision on whether two measurements come from the same individual. This explains the plus sign between the work of guessing Ka and the work of guessing Kb in the columns where Eve has some
Feeling Is Believing: A Secure Template Exchange Protocol
903
Table 1. Guesswork required for Eve to compute the session key AE(0) BE(0) BE(1) BE(2)
AE(1)
AE(2)
W (0→Ka )·W (0→Kb )
W (Ka →Ka )+W (0→Kb )
W (0→Ka )·W (Kb →Kb )
W (Ka →Ka )+W (Kb →Kb ) W (Ka →Ka )+W (Kb →Kb )
W (0→Ka )·W (Kb →Kb )
W (Ka →Ka )+W (0→Kb )
W (Ka →Ka )+W (Kb →Kb ) W (Ka →Ka )+W (Kb →Kb )
knowledge about Alice. We can estimate an upper bound for the work load of Alice and Bob according to Pliam’s [8] equation: WAlice (Ka →
Ka )
t t n 1 n 1 n 1 ≤ ( + 1) − · |Ej (σ, q) − n |. 2 i=0 i 2 i=0 t 2 j=1
Here Ej (σ, q) represents the distribution of the noise, which is described below. The best case scenario for Eve, however unlikely, is [BE(2),AE(2)] when she had a previous round with both Alice and Bob. Even though she has both Ka and Kb her workload is at least twice as high compared to Alice or Bob. Moreover while for Eve all error patterns are equally likely, Alice and Bob have the error profile that makes recovery faster.
4 Experimental Validation with Real Life Data The Correct function used during the FiB protocol uses a semi-known plain text attack to recover the key. We link the keys used in the protocol to the biometric data of a user using the template protection scheme proposed by Linnartz and Tuyls [5]. The purpose of using this function is to lower the error rate of the system by flipping bits on the result of function Reg in the order defined by the user specific user profile E. Key search algorithm. In classical symmetric cryptography to decrypt a message encrypted with a key K one must posses K. In particular, with a key K that differs only in one bit from K, decryption will fail. The FiB protocol uses this apparent disadvantage of symmetric key cryptography as an advantage: K is used to form the session key. The noise of the measurements is used as random salt [13] for the session key. The key search algorithm makes it possible to recover K . We start the key search by assuming there are no errors in K , and we use K for decryption. If decryption fails we assume that we have a one bit error. We start flipping one bit of the key according to the position indicated by the error profile, until we have exhausted the error profile. Then we assume that two bits are wrong and we try all combinations of two bits from the error profile. Finally if we reach the limit on the number of trials we assume that the key is coming from an intruder. The recovery of K is a semi-known plain text attack. When the correct value of K is discovered the initiator will recognize the message en crypted with K . This is possible since the encrypted message is a biometric template. The initiator of the protocol possesses a fresh measurement of this template and hence is able to recognize a correct match. The verification is performed by a classifier based matching algorithm designed for this particular biometrics.
904
I. Buhan et al.
Error profile computation. Linnartz et al. [5] propose a multiple quantization level system with odd-even bands. The embedding of binary data is done by shifting the template distribution to the center of the closest even-odd q interval if the value of the key bit is a 1, or to the center of an odd-even q interval if the value of the key bit is a 0. The calculation Gen(ti , ki ) = hi proceeds component wise as follows, where i = 1, n: (2p + 12 ) · q − ti , ki = 1 Gen(ti , ki ) = hi = (2p − 12 ) · q − ti , ki = 0. where p ∈ Z is chosen such that hi ≤ 2q. During authentication the key is recovered by computing: 1, 2pq ≤ fi + hi < (2p + 1)q Reg(fi , hi ) = 0, (2p − 1)q ≤ fi + hi < 2pq. However, during Reg whenever the difference between the measured fi value and the template ti is greater than q2 we get an error in the key computation. Each key bit is extracted independently. Thus, the error profile is a vector of length n. The i-th value of the error profile represents the probability of the i-th key bit to be computed wrongly. We assume that the features are independent and normally distributed. The error profile is computed component-wise, using the function: ∞ √ Ei (σ, q) = σ 2 2 i=0
(3+4i) q √ 2 2 σ (1+4i) q √ 2 2 σ
√ 2 e−x dx ≈ σ 2 2
3q √ σ 2 2 q √ σ 2 2
2
e−x dx,
During enrollment several different measurement have to be made for each user. The error profile is based on the fact that in practical situations the estimated standard deviation is different for different user. To compute the error profile we use the same data after dimensionality reduction that the classifier based matcher use. Results. We implemented the key extraction described above. We wanted to see the influence of the correction mechanism on the overall error rates. The evaluation is performed on real life grip pattern biometric data collected from 41 police officers. A detailed description of this biometric can be found in Veldhuis et al. [11]. Each of the 41 officers contributed 25 different measurements. Approximately 75% of these samples(18), are used for training the algorithm and 25% (7) are used for testing. First, we reduce the dimensionality of the data to 40 independent features. For training and testing we use the same data that is used for verification by the classifier based recognition algorithm.Second, we embed the key bits using Linnartz and Tuyls scheme above. Figures 2 presents the results obtained from the collected data. We offer three conclusions from this evaluation. The first conclusion is, as expected, the larger the quantization step the lower the FRR but the higher the FAR. We tested 8 different values for q, ranging from 1 to 5 in increments of 0.5. We did not try larger values for q because the FAR becomes unacceptably large. The second conclusion is that the influence of the correction algorithm is significant. For example for q = 3, without correction the FRR=13.93% and the FAR=0%. When we correct 1 bit the FRR goes down to 5.92%,
Feeling Is Believing: A Secure Template Exchange Protocol
905
100
FAR, without correction 90
FRR, without correction 80
FRR, after 1 bit correction
error rate
70
FAR, after 1 bit correction FRR after 2 bit correction
60
FAR after 2 bit correction 50
FRR after 3 bit correction 40
FAR after 3 bit correction
30
20
10
0
1
1.5
2
2.5
3
3.5
4
4.5
5
quantization step
Fig. 2. Results on grip pattern data
while the FAR retains the same value 0%. After correcting 2 bits the FRR goes down to 2.43% while the FAR remains equal to 0%. Correcting 3 bits further reduces the FRR to 1.74% while the FAR increases only slightly to 0.07%. The third conclusion is that the correction mechanism is stable, meaning that the effect of correction is independent of the time when data in collected. Data was collected during two sessions and the performance of the correction algorithm is similar on both sets.
5 Conclusions The contributions of this paper are threefold. Firstly, we propose FiB a protocol that can exchange biometric templates securely even though no prior security association exists between the participants. We are confident in the guarantees offered by FiB because we formally verify the protocol to prove security and origin authentication. Also, from the information theoretic point of view the workload of Eve is at least double of that of Alice or Bob in the unlikely scenario where Eve has interacted with Alice and Bob and where Eve possess the same key material. Moreover, in this case the biometric has to be favorable to Eve in that she has to be able to verify wheatear two noisy measurements are coming from the same user or not. Secondly, for the first time we propose to use biometrics as a secure side channel. The advantage of using biometrics compared to any other types of side channels is the extreme user friendliness. Thirdly, a new correction mechanism for correcting biometric errors based on user specific error profile is proposed. We present an evaluation of the performance in terms of FAR and FRR and we show that the correction algorithm significantly improves the overall results. Our experiments show that by correcting only 1 bit in the overall key the FRR is reduced by approximately 50% while the FAR is not increased significantly. We believe that our approach can be applied to other types of biometrics.
Acknowledgements The authors would like to thank Ricardo Corin and Sandro Etalle for helping formally verify the protocol using CoProVe.
906
I. Buhan et al.
References 1. Boyen, X., Dodis, Y., Katz, J., Ostrovsky, R., Smith, A.: Secure remote authentication using biometric data. In: Cramer, R.J.F. (ed.) EUROCRYPT 2005. LNCS, vol. 3494, pp. 147–163. Springer, Heidelberg (2005) 2. Corin, R., Etalle, S.: An improved constraint-based system for the verification of security protocols. In: Hermenegildo, M.V., Puebla, G. (eds.) SAS 2002. LNCS, vol. 2477, pp. 326– 341. Springer, Heidelberg (2002) 3. Dolev, D., Yao, A.: On the security of public key protocols. Information Theory, IEEE Transactions on 29, 198–208 (1983) 4. Goodrich, M.T., Sirivianos, M., Solis, J., Tsudik, G., Uzun, E.: Loud and clear: Humanverifiable authentication based on audio. In: 26th IEEE International Conference on Distributed Computing Systems (ICDCS 2006), Lisboa, Portugal, 4-7 July 2006, p. 10. IEEE Computer Society Press, Los Alamitos (2006) 5. Linnartz, J.P, Tuyls, P.: New shielding functions to enhance privacy and prevent misuse of biometric templates. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 393–402. Springer, Heidelberg (2003) 6. McCune, J., Perrig, A., Reiter, M.: Seeing-is-believing: using camera phones for humanverifiable authentication. In: Security and Privacy, 2005 IEEE Symposium on, pp. 110–124. IEEE Computer Society Press, Los Alamitos (2005) 7. NJIT: Personalized weapons technology project, progress report. Technical report, New Jersey Institute of Technology (April 2001) 8. Pliam, J.O.: Guesswork and variation distance as measures of cipher security. In: Heys, H.M., Adams, C.M. (eds.) SAC 1999. LNCS, vol. 1758. Springer, Heidelberg (2000) 9. Saxena, N., Ekberg, J., Kostiainen, K., Asokan, N.: Secure device pairing based on a visual channel (short paper). SP, 306–313 (2006) 10. Vaudenay, S.: Secure communications over insecure channels based on short authenticated strings. In: Shoup, V. (ed.) CRYPTO 2005. LNCS, vol. 3621, pp. 309–326. Springer, Heidelberg (2005) 11. Veldhuis, R.N.J., Bazen, A.M., Kauffman, J.A., Hartel, P.H.: Biometric verification based on grip-pattern recognition. In: Security, Steganography, and Watermarking of Multimedia Contents VI, Proceedings of SPIE, San Jose, California, USA, January 18-22, 2004, vol. 5306, pp. 634–641 (2004) 12. Wong, F.L., Stajano, F.: Multi-channel protocols for group key agreement in arbitrary topologies. In: 4th IEEE Conference on Pervasive Computing and Communications Workshops (PerCom 2006 Workshops), Pisa, Italy, 13-17 March 2006, pp. 246–250. IEEE Computer Society Press, Los Alamitos (2006) 13. Wu, T.D.: The secure remote password protocol. In: Proceedings of the Network and Distributed System Security Symposium, NDSS 1998, San Diego, California, USA. The Internet Society (1998)
SVM-Based Selection of Colour Space Experts for Face Authentication Mohammad T. Sadeghi1 , Samaneh Khoshrou1 , and Josef Kittler2 1
Signal Processing Research Lab., Department of Electronics University of Yazd, Yazd, Iran 2 Centre for Vision, Speech and Signal Processing School of Electronics and Physical Sciences University of Surrey, Guildford GU2 7XH, UK {M.Sadeghi,J.Kittler}@surrey.ac.uk
Abstract. We consider the problem of fusing colour information to enhance the performance of a face authentication system. The discriminatory information potential of a vast range of colour spaces is investigated. The verification process is based on the normalised correlation in an LDA feature space. A sequential search approach which is in principle similar to the “plus L and take away R” algorithm is applied in order to find an optimum subset of the colour spaces. The colour based classifiers are combined using the SVM classifier. We show that by fusing colour information using the proposed method, the resulting decision making scheme considerably outperforms the intensity based verification system.
1
Introduction
The spectral property of the skin albedo is known to provide useful biometric information for face recognition. Measured in terms of colour, it has been exploited directly in terms of features derived from the colour histogram [7]. Such features are then used to augment the face feature space in which the matching of a probe image against a template is carried out. Alternatively, the spectral information can be used indirectly, by treating each R, G, B colour channel as a separate image. The face matching scores computed for the respective channels are then combined to reach the final decision about the probe image identity. This indirect approach has the advantage that an existing face recognition system can simply be applied to create the respective colour space experts, without any structural redesign. This indirect approach has been investigated extensively in [13] [6]. In these studies the colour face experts were designed in different colour spaces, such as R, G, B, the intensity image, normalised green and an opponent colour channel, or colour channels decorrelated by the Principal Component Analysis (PCA). The merits of the information conveyed by the various channels individually, as well as their combined effect, have been evaluated. It has been demonstrated that some of the colour spaces provide more powerful representation of the face skin properties than others, offering significant improvements in performance as compared to the raw intensity image. In S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 907–916, 2007. c Springer-Verlag Berlin Heidelberg 2007
908
M.T. Sadeghi, S. Khoshrou, and J. Kittler
this paper we extend the work to other colour representations suggested in the literature. In total, 15 different colour spaces are considered, giving rise to 45 different colour channels. Assuming that face experts based on all these channels are available, it is pertinent to ask which channels provide complementary information and how the expert scores should be fused to achieve the best possible performance of the face recognition system. In [12] the fusion problem was solved by selecting the best expert or a group of experts dynamically with the help of a gating function learnt for each channel. In the present study we formulate the colour expert fusion problem as a feature selection problem. The colour spaces for fusion are selected using a sequential search approach similar to the Plus L and Take Away R algorithm. An important contributing factor in the selection algorithm is the ”fusion rule” used for combining the colour based classifiers. In [11] untrained methods such as averaging and voting schemes were used for this purpose. However, in different applications it has been demonstrated that trained approaches such as Support Vector Machines (SVMs) have the potential to outperform the simple fusion rules, especially when a large enough training data is available. The main aim of this paper is to study the performance of the proposed colour selection algorithm using the SVM classifiers. Surprisingly good results are obtained using the proposed method. The paper is organised as follows. In the next section different colour spaces adopted in different machine vision applications are reviewed. The face verification process is briefly discussed in Section 3. The proposed method of colour space selection is described in Section 4. The experimental set up is detailed in Section 5. Section 6 presents the results of the experiments. Finally, in Section 7 the paper is drawn to conclusion.
2
Colour Spaces
For computer displays, it is most common to describe colour as a set of three primary colours: Red, Green and Blue. However, it has been demonstrated that in different applications using different colour spaces could be beneficial. In this section some of the most important colour spaces are reviewed. Considering the R, G, B system as the primary colour space, we can classify the other colour spaces into two main categories: Linear and Nonlinear transformation of the R, G,B values. 2.1
Linear Combination of R,G,B
CM Y -based colour space is commonly used in colour printing systems. The name CM Y refers to cyan, magenta and yellow. The RGB values can be converted to CM Y values using: C = 255 − R,
M = 255 − G, Y = 255 − B
(1)
There are several CIE-based colour spaces,but all are derived from the fundamental XY Z space [2]. A number of different colour spaces including Y U V ,
SVM-Based Selection of Colour Experts
909
Y IQ, Y ES and Y Cb Cr are based on separating luminance from chrominance (lightness from colour). These spaces are useful in compression and other image processing applications. Their formal definition can be found in [2]. I1I2I3 or Ohta’s features [8] were first introduced for segmentation as optimised colour features and are shown in equations: I1 =
R+G+B , 3.0
I2 = R − B,
I3 = 2G − R − B
(2)
LEF Colour Space defines a colour model that combines the additivity of the RGB model with the intuitiveness of the hue-saturation-luminance models by applying a linear transformation to the RGB cube. 2.2
Nonlinear Combination of R,G,B
The chromaticities for the normalised RGB are obtained by normalising the RGB values with the intensity value, I: r = R/I,
g = G/I,
b = B/I
(3)
where I = (R + G + B)/3. Similar equations are used for normalising the XY Z values. The result is a 2D space known as the CIE chromaticity diagram. The opponent chromaticity space is also defined as rg = r − g,
yb = r + g − 2b
(4)
Kawato and Ohya [5] have used the ab space which is derived from NCC rgchromaticities as: √ a = r + g/2, b = 3/(2g) (5) In [16], two colour spaces namely P 1 and P 2 have been defined by circulating the r, g and b values in equation 4. Log-opponent (or Log-opponent chromaticity) space has been applied to image indexing in [1]. The space is presented by equations: Lnrg = ln(R/G) = ln R − ln G R.G Lnyb = ln( 2 ) = ln R + ln G − 2 ln B B
(6)
T SL (Tint - Saturation - Lightness) colour space is also derived from NCC rg-chromaticities. l1l2l3 colour space as presented in [4] has been adopted for colour-based object recognition. Many people find HS-spaces (HSV , HSB, HSI, HSL) intuitive for colour definition. For more information about the relevant equations used in this study, the reader is referred to [3].
910
3
M.T. Sadeghi, S. Khoshrou, and J. Kittler
Face Verification Process
The face verification process consists of three main stages: face image acquisition, feature extraction, and finally decision making. The first stage involves sensing and image preprocessing the result of which is a geometrically registered and photometrically normalised face image. Briefly, the output of a physical sensor (camera) is analysed by a face detector and once a face instance is detected, the position of the eyes is determined. This information allows the face part of the image to be extracted at a given aspect ratio and resampled to a pre-specified resolution. The extracted face image is finally photometrically normalised to compensate for illumination changes. The raw colour camera channel outputs, R, G and B are converted according to the desired image representation spaces. In this study different colour spaces reviewed in the previous section were considered. In the second stage of the face verification process the face image data is projected into a feature space. The final stage of the face verification process involves matching and decision making. Basically the features extracted for a face image to be verified, x, are compared with a stored template, µi , that was acquired on enrolment. In [14], it was demonstrated that the Gradient Direction (GD) metric or Normalised Correlation (NC) function in the Linear Discriminant Analysis (LDA) feature space works effectively in the face verification systems. In this study we adopted the NC measure in the LDA space. The score, s, output by the matching process is then used for decision making. In our previous studies [11] [6], global or client specific thresholding method were used for decision making. In this work, the above mentioned thresholding methods are compared to decision making using the Support Vector Machines. If the score computation is applied to different colour spaces separately, we end up with a number of scores, sk = s(xk ), k = 1, 2, . . . , N which then have to be fused to obtain the final decision. The adopted fusion method is studied in the next section.
4
Colour Space Selection
One of the most exciting research directions in the field of pattern recognition and computer vision is classifier fusion. Multiple expert fusion aims to make use of many different designs to improve the classification performance. The approach we adopted for selecting the best colour space(s) is similar in principal to the sequential feature selection methods in pattern recognition [10]. In this study, the Sequential Forward Selection (SFS), Sequential Backward Selection (SBS) and Plus’L’ and Take away ’R’ algorithms were examined for selecting an optimum subset of the colour spaces. Two untrained fusion rules, the sum rule and the voting scheme and a trained fusion method, the Support Vector Machines are used in order to combine the scores of the colour based classifiers. The selection procedure keeps adding or taking away features (colour spaces in our case) until the best evaluation performance is achieved. The selected colour spaces are then used in the test stage.
SVM-Based Selection of Colour Experts
4.1
911
Support Vector Machines
A Support Vector Machine is a two-class classifier showing superior performance to other methods in terms of Structural Risk Minimisation [15]. For a given training sample {xi , yi }, i = 1, ..., N , where xi ∈ R D is the object marked with a label yi ∈ {−1, 1}, it is necessary to find the direction w along which the margin between objects of two classes is maximal. Once this direction is found the decision function is determined by threshold b: y(x) = sgn(w · x + b)
(7)
The threshold is usually chosen to provide equal distance to the closest objects of the two classes from the discriminant hyperplane w · x + b = 0, which is called the optimal hyperplane. When the classes are linearly non-separable some objects can be shifted by a value δi towards the right class. This converts the original problem into one which exhibits linear separation. The parameters of the optimal hyperplane and the optimal shifts can be found by solving the following quadratic programming problem: minimise w · w + C
N i=1
δi
subject to: yi (w · xi + b) ≥ 1 − δi , δi ≥ 0
(8) i = 1, ..., N
where parameter C defines the penalty for shifting the objects that would otherwise be misclassified in the case of linearly non separable classes. The QP problem is usually solved in a dual formulation minimise
N i=1
αi −
1 2
N N i=1 j=1
subject to: N αi yi = 0, 0 ≤ αi ≤ C
αi αj yi yj xi · xj (9) i = 1, ..., N
i=1
Those training objects xi with αi > 0 are called Support Vectors, because only they determine direction w: w=
N
αi yi xi
(10)
i=1, αi >0
The dual QP problem can be rapidly solved by the Sequential Minimal Optimisation method, proposed by Platt [9]. This method exploits the presence of linear constraints in (9). The QP problem is iteratively decomposed into a series of one variable optimisation problems which can be solved analytically. For the face verification problem, the size of the training set for clients is usually less than the one for impostors. In such a case, the class of impostors is represented better. Therefore, it is necessary to shift the optimal hyperplane
912
M.T. Sadeghi, S. Khoshrou, and J. Kittler
towards the better represented class. In this work, the size of the shift is determined in the evaluation step considering the Equal Error Rate criterion.
5
Experimental Design
The aim of the experiments is to show that by fusing the sensory data used by component experts, the performance of the multiple classifier system improves considerably. We use the XM2VTS database 1 and its associated experimental protocols for this purpose. The XM2VTS database is a multi-modal database consisting of face images, video sequences and speech recordings taken of 295 subjects at one month intervals. Since the data acquisition was distributed over a long period of time, significant variability of appearance of clients, e.g. changes of hair style, facial hair, shape and presence or absence of glasses, is present in the recordings. For the task of personal verification, a standard protocol for performance assessment has been defined. The so called Lausanne protocol splits randomly all subjects into a client and impostor groups. The client group contains 200 subjects, the impostor group is divided into 25 evaluation impostors and 70 test impostors. The XM2VTS database contains 4 sessions. Eight images from 4 sessions are used. From these sets consisting of face images, training set, evaluation set and test set are built. There exist two configurations that differ by a selection of particular shots of people into the training, evaluation and test sets. The training set is used to construct client models. The evaluation set is selected to produce client and impostor access scores, which are used for designing the required classifier. The score classification is done either by the SVM classifier or by thresholding. The thresholds are set either globally (GT) or using the client specific thresholding (CST)technique [6]. According to the Lausanne protocol the threshold is set to satisfy the Equal Error Rate criterion, i.e. the operating point where the false rejection rate (FRR) is equal to the false acceptance rate (FAR). False acceptance is the case where an impostor, claiming the identity of a client, is accepted. False rejection is the case where a client, claiming his true identity, is rejected. The evaluation set is also used in fusion experiments (classifier combination) for training. The SVM-based sequential search algorithms pick the best colour spaces using this set of data. Finally the test set is selected to simulate realistic authentication tests where impostor’s identity is unknown to the system. The performance measures of a verification system are the False Acceptance rate and the False Rejection rate. The original resolution of the image data is 720 × 576. The experiments were performed with a relatively low resolution face images, namely 64 × 49. The results reported in this article have been obtained by applying a geometric face registration based on manually annotated eyes positions. Histogram equalisation was used to normalise the registered face photometrically. 1
http://www.ee.surrey.ac.uk/Research/VSSP/xm2vtsdb/
SVM-Based Selection of Colour Experts
6
913
Experimental Results
Table 1 shows the performance of the face verification system using the individual colour spaces considering the first configuration of the Lausanne protocol. The decision boundary in the NC space was determined using the SVMs. The values in the table indicate the FAR and FRR in both evaluation and test stages. As we expect, the best performance is obtained neither in the original RGB spaces nor in the intensity space. Some other colour spaces such as U in the YUV space or opponent chromaticities individually can lead to better results. Table 2 shows some results of the same experiments considering the second XM2VTS protocol configuration. As mentioned earlier, in similar experiments the global or client specific thresholding techniques were used for decision making. Figure 1 contains plots of the total error rate in different colour spaces in the evaluation and test stages. These results demonstrate that although in most of the cases, the SVMs work better than the GT technique, but the CST leads to better or comparable results. Table 1. Identity verification results using different colour spaces. Classification boundary was determined using the SVMs.(configuration 1). subspace FAR Eval FRR Eval. FAR Test FRR Test subspace FAR Eval. FRR Eval FAR Test FRR Test subspace FAR Eval. FRR Eval. FAR Test FRR Test subspace FAR Eval. FRR Eval. FAR Test FRR Test subspace FAR Eval. FRR Eval. FAR Test FRR Test
R G 2.19 1.97 2.17 2 2.41 1.97 1.5 1.75 b T(TSL) 1.69 1.52 1.66 1.5 1.73 1.32 1.25 1 I3 E(LEF) 1.67 2.1 1.6 2.166 1.69 2.21 0.75 0.5 I(YIQ) Q(YIQ) 2.17 1.85 2.16 1.83 2.48 1.75 1.5 0.75 LHSL) Xn 2.33 1.80 2.33 1.83 2.42 1.90 1 1
B 1.75 1.83 1.97 1.75 S(TSL) 1.31 1.33 1.53 1.75 F(LEF) 1.66 1.67 1.60 0.5 a(ab) 1.79 1.83 1.92 1.25 Yn 1.65 1.66 1.69 1.25
I H Sat Val 2.14 1.92 1.76 2.19 2.17 1.83 1.66 2.16 2.17 1.84 1.81 2.46 1.25 0.5 1.25 2 L(TSL) V(YUV) rg U(YUV) 2.11 2.31 1.49 1.91 2.166 2.33 1.5 1.83 2.1491 2.26 1.60 1.75 1.5 0.75 1.25 0 X(CIE) Y(CIE) Z(CIE) Y(YES) 2.27 2.11 2 2.05 2.33 2.166 2 2 2.38 2.14 2.25 2.10 1.25 1.5 1.75 1.5 b(ab) Lnrg Lnyb l1 1.52 1.22 1.49 2.05 1.5 1.16 1.5 2 1.58 1.35 1.59 1.74 1.2500 1.75 1.25 1 Zn C(CMY) M(CMY) Y(CMY) 1.52 2.47 2.02 1.74 1.5 2.5 2 1.83 1.57 2.79 2.02 1.97 0.5 1.75 1.75 1.75
r g 1.93 1.48 1.83 1.5 2.08 1.47 0.75 1 Cr I2 1.82 2.31 1.83 2.33 2.21 2.3 1.5 0.75 E(YES) S(YES) 1.82 2 1.83 2 1.65 1.82 0.75 0.25 l2 l3 2.35 1.71 2.33 1.67 2.46 1.58 1.25 1.50 bg 1.66 1.66 1.73 0.75
914
M.T. Sadeghi, S. Khoshrou, and J. Kittler
Table 2. Identity verification results using some of the colour spaces (configuration 2) subspace FAR Eval. FRR Eval. FAR Test FRR Test
R 1.26 1.25 1.65 1.5
G 1.25 1.25 1.67 1.5
B 1.25 1.25 1.87 1.5
I 1.24 1.25 1.80 1.5
H 1.25 1.25 1.14 0.5
5
4
4.5
3.5
3.5
3
r 1.36 1.25 2.02 1.5
g 0.74 0.75 0.76 0.5
b 1.34 1.25 1.67 0.75
GT SVM CST
2.5
2 GT SVM CST
2.5
2
V 1.25 1.25 1.61 1.5
3
TER
TER
4
S 1.23 1.25 1.71 1
0
5
10
15
20 25 Subspace ID
30
35
40
1.5
1
45
0
(a) Config. 1, Evaluation
5
10
15
20 25 Subspace ID
30
35
40
45
(b) Config. 2, Evaluation
5
4.5 GT SVM CST
4.5
GT SVM CST
4
4
3.5
3.5
TER
TER
3 3
2.5 2.5 2
2
1.5
1.5 1
0
5
10
15
20 25 Subspace ID
30
(c) Config. 1, Test
35
40
45
1
0
5
10
15
20 25 Subspace ID
30
35
40
45
(d) Config. 2, Test
Fig. 1. Verification results in different colour spaces. The decision boundary has been determined using GT, SVMs or CST.
In the next step, the adopted search method, Plus ‘L’ and Take away ‘R’ algorithm was used for selecting a subset of colour spaces. Figure 2 shows the resulting error rates for different number of colour spaces in Configurations 1 and 2. In the search algorithm L = 2 and R = 1. Before fusing, the scores associated to each colour space were appropriately normalised. The normalised scores were then considered as a feature vector. The SVM classifier was finally used for decision making. Table 3 contains the fusion results obtained using the adopted “Plus 2 and take away 1” algorithm for Configurations 1 and 2. These results were obtained by score fusion using the averaging rule or the SVMs. Note that using the search algorithm, colour spaces are selected from the evaluation data for the whole data set. However, the algorithm is flexible so that, for different conditions, different spaces can be adopted adaptively. In the case of the experimental protocols of the XM2VTS database using the SVM classifier Lnrg, a(ab), V(HSV) spaces have been selected for the first configuration while b(ab),U(YUV), bg(opp-chroma), Cr, X(CIE),I, V(HSV), H(HSV) and S(YES) have been adopted for the second one. These results first of all demonstrate that the proposed colour selection algorithm considerably improves the performance of the face verification system. Moreover, they also show that although not much gain in performance can be
SVM-Based Selection of Colour Experts 2
1 FAR FRR HTERE
FAR FRR HTER
0.9
1.6
0.8
1.4
0.7
1.2
0.6
%b of error
% of error
1.8
1 0.8
0.5 0.4
0.6
0.3
0.4
0.2
0.2 0
0.1 0
5
10
15
20 25 No. of subspaces
30
35
40
0
45
0
(a) Config. 1, Evaluation
10
15
20 25 No. of subspaces
30
35
40
45
2
FAR FRR data3
FAR FRR HTER
1.8
1.6
1.6
1.4
1.4
1.2
1.2
% of error
% of error
5
(b) Config. 2, Evaluation
2 1.8
1 0.8
1 0.8
0.6
0.6
0.4
0.4
0.2 0
915
0.2
0
5
10
15
20 25 No. of subspaces
30
35
(c) Config. 1, Test
40
45
0
0
5
10
15
20 25 No. of subspaces
30
35
40
45
(d) Config. 2, Test
Fig. 2. SVM-based Plus 2 and Take away 1 results Table 3. Verification results using the proposed colour selection and fusion methods Evaluation Configuration Fusion rule FAR FRR TER Averaging 0.48 0.5 0.98 1 SVMs 0.5 0.5 1 Averaging 0.19 0.25 0.44 2 SVMs 0.24 0.25 0.49
Test FAR FRR TER 0.55 1 1.55 0.59 1 1.59 0.27 0.25 0.52 0.38 0 0.38
obtained from SVMs in the colour spaces individually (figure 1) , combining the colour based classifiers using SVMs would overall be beneficial.
7
Conclusions
We addressed the problem of fusing colour information for face authentication. In a face verification system which is based on the normalised correlation measure in the LDA face space, an SVM-based sequential search approach similar to the ”plus L, and take away R” algorithm was applied in order to find an optimum subset of the colour spaces. Using the proposed method, the performance of the verification system was considerably improved as compared to intensity only or the other colour spaces. Within the framework of the proposed method, we showed that fusing colour based classifiers using SVMs outperforms the simple averaging rule. Acknowledgements. The financial support from Iran Telecommunication Research Centre is gratefully acknowledged. A partial support from the EU Network of Excellence Biosecure and from the EPSRC Grant GR/S98528/01 is also gratefully acknowledged.
916
M.T. Sadeghi, S. Khoshrou, and J. Kittler
References 1. Berens, J., Finlayson, G.: Log-opponent chromaticity coding of colour space. In: Proceedings of the Fourth IEEE International Conference on Pattern Recognition, pp. 1206–1211. IEEE Computer Society Press, Los Alamitos (2000) 2. Colantoni, P., et al.: Color space transformations. Technical report, http://www.raduga-ryazan.ru/files/doc/colorspacetransform95.pdf 3. Foley, J., van Dam, A., Feiner, S., Hughes, J.: Computer graphics: principles and practice, 2nd edn. Addison-Wesley Longman Publishing Co. Inc., Boston, MA, USA (1996) 4. Gevers, T., Smeulders, A.: Colour based object recognition. In: ICIAP, vol. 1, pp. 319–326 (1997) 5. Kawato, S., Ohya, J.: Real-time detection of nodding and head-shaking by directly detecting and tracking the ”between-eyes”. In: Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition, pp. 40–45. IEEE Computer Society Press, Los Alamitos (2000) 6. Kittler, J., Sadeghi, M.: Physics-based decorrelation of image data for decision level fusion in face verification. In: Roli, F., Kittler, J., Windeatt, T. (eds.) MCS 2004. LNCS, vol. 3077, pp. 354–363. Springer, Heidelberg (2004) 7. Marcel, S., Bengio, S.: Improving face verification using skin colour information. In: 16th International Conference on Pattern Recognition, vol. 2, pp. 20378–20382 (2002) 8. Ohta, Y., Kanade, T., Sakai, T.: Colour information for region segmentation. Computer Graphics and Image Processing 13(3), 222–241 (1980) 9. Platt, J.: Sequential minimal optimization: A fast algorithm for training support vector machines. Technical Report 98-14, Microsoft Research, Redmond, Washington (April 1998) 10. Pudil, P., Novovicova, J., Kittler, J.: Floating search methods in feature selection. Pattern Recognition Letters 15, 1119–1125 (1994) 11. Sadeghi, M., Khoshrou, S., Kittler, J.: Colour feature selection for face authentication. In: Proceedings of the International Conference on Macine Vision Applications, MVA’07, Japan, May 2007 (2007) 12. Sadeghi, M., Khoshrou, S., Kittler, J.: Confidence based gating of colour features for face authentication. In: Proceedings of the 7th International Workshop on Multiple Classifier System, MCS’07, Czech Republi, May 2007, pp. 121–130 (2007) 13. Sadeghi, M., Kittler, J.: A comparative study of data fusion strategies in face verification. In: The 12th European Signal Processing Conference, Vienna, Austria, 6-10 September 2004 (2004) 14. Sadeghi, M., Kittler, J.: Decision making in the LDA space: Generalised gradient direction metric. In: The 6th International Conference on Automatic Face and Gesture Recognition, Seoul, Korea, May 2004, pp. 248–253 (2004) 15. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995) 16. Vertan, C., Cuic, M., Boujemaa, N.: On the introduction of a chrominance spectrum and its applications. In: Proceedings of the First International Conference on Colour in Graphics and Image Processing, 1-4 October 2000, pp. 214–218 (2000)
An Efficient Iris Coding Based on Gauss-Laguerre Wavelets H. Ahmadi1,2, A. Pousaberi1, A. Azizzadeh3, and M. Kamarei1 1
Dept. of Electrical and Computer Engineering,University of Tehran Dept. of Electrical and Computer Engineering, University of British Columbia 3 Research Center, Ministry of Communication, Tehran, Iran
[email protected],
[email protected],
[email protected],
[email protected] 2
Abstract. In this paper preliminary results of a new iris recognition algorithm using Gauss-Laguerre filter of circular harmonic wavelets are presented. Circular harmonic wavelets (CHWs) applied in this paper for iris pattern extraction, are polar-separable wavelets with harmonic angular shape. The main focus of this paper is on iris coding using Gauss-Laguerre CHWs which constitute a family of orthogonal functions satisfying wavelet admissibility condition required for multiresolution pyramid structure. It is shown that GaussLaguerre wavelets having rich frequency extraction capabilities are powerful tools for coding of iris patterns. By judicious tuning of Laguerre parameters, a 256-byte binary code is generated for each iris. A fast matching scheme based on Hamming distance is used to compute the similarity between pairs of iris codes. Preliminary experimental results on CASIA and our database indicate that the performance of the proposed method is highly accurate with zero false rate and is comparable with Daugman iris recognition algorithm well publisized in literature. Keywords: Biometrics, Iris recognition, Gauss-Laguerre wavelets, Circular harmonic wavelets.
1 Introduction Security and surveillance of information and personnel is becoming more and more important recently, in part due to the rapid development of information technology (IT) and its wide spread applications in all aspects of daily life. For surveillance of personnel, biometrics as a science of personal identification that utilize physical/biological or behavioral characteristics of an individual, are now widely used in numerous security applications. For a biometric system to be a valid candidate for use in human identification, it is required that it embodies any number of unique characteristics of an individual that remain consistent during the life of a person. Fingerprints, voiceprints, retinal blood vessel patterns, face, iris pattern, handwriting are examples of biometrics that can be used in place of non-biometric methods. Irisbased biometrics is considered to provide a high accuracy personal identification mainly due to rich texture of iris pattern as well as nonintrusive nature of biometric extraction [1]. S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 917–926, 2007. © Springer-Verlag Berlin Heidelberg 2007
918
H. Ahmadi et al.
Iris recognition is based on the visible features of the human iris (see Fig. 1) that include rings, furrows, freckles, and the iris corona. These unique features are protected by the body’s own mechanisms and can not be modified without risk. As such iris features are considered to be the highly accurate and reliable biofeatures for personal identification [1] and has been a subject of wide attentions during the last decade. Standard iris recognition process consists of the following three major steps. • • •
Preprocessing step which includes image capturing, image filtering and enhancement (optional) followed by iris localization, normalization, denoising and image enhancement. Iris feature extraction stage. Iris feature classification
Several algorithms have been proposed by numerous groups for iris-based biometrics of which the most publisized and accepted algorithm, belongs to John Daugman [2]. Daugman used multiscale quadrature wavelets to extract phase structure information of the iris texture and generate a 2048-bit iris code. For identification the difference between a pair of iris codes is evaluated using Hamming distance criteria. In Daugman scheme, it has been shown that a Hamming distance of less than 0.34 measured with any of iris templates in database was acceptable for identification. Since introduction of Daugman algorithm, numerous attempts have been made by other groups to arrive at compatible or alternative algorithms and several claims have been made on the adequacy of these algorithms for iris recognition. Daugman used the following three innovations for feature extraction and subsequent coding: 1. 2. 3.
An Integro-differential scheme for iris segmentation. A Cartesian to polar transformation followed by normalization. A 256 binary code based on Gabor filters.
These three steps have also been utilized by other research groups in which alternative algorithms including modification of Gabor filters have been proposed. Among them, algorithms proposed by Wildes [3], Ma [4] are claimed to be comparable with Daugman scheme in accuracy of recognition. In this paper, we present an alternative iris coding scheme which utilizes Gaussian-Laguerre filters during iris coding where it is shown that results are commensurable with those of Daugman. The proposed algorithm using Gaussian-Laguerre wavelets first introduced in this paper was tested using CASIA data base as well as data generated by this research group. The rich information of Gaussian Laguerre filter in various frequency bands enables an extraction of the necessary iris patterns in an effective manner and generates a meaningful unique code for each individual. The proposed algorithm includes also an algorithm based on Radon transformation for iris localization which is computationally efficient and contributes significantly to increase the speed of iris localization. In the sequel, a brief reference is made to some of the related research works on iris-based biometrics followed with an overview of our proposed algorithm in Section 2. Description of circular harmonic wavelets (CHW) is included in Section 3. Image preprocessing for segmentation, feature extraction and pattern matching are
An Efficient Iris Coding Based on Gauss-Laguerre Wavelets
919
Fig. 1. Typical iris images, (left) CASIA data base, (right) Our captured iris image
given in Sections 4 and 5, respectively. Experimental results on Iris databases are reported in Section 6. Further works are described in Section 7. Finally Section 7 concludes this paper.
2 A Brief Overview of Literature and Proposed Scheme Biometrics using iris features is relatively new with most of the works done during last decade. A pioneering work was done by John Daugman [2] in which multiscale quadrature wavelets were used to extract texture phase structure information of the iris to generate a 2,048-bit iris code. Euclidean distance was used to evaluate the difference between a pair of iris representations. It was shown that, a Hamming distance of lower than 0.34 with any of iris templates in database will suffice for identification. Ma et al.[4],[5] adopted a well-known texture analysis method (multichannel Gabor filtering) to capture both global and local details in an iris image. They studied Gabor filter families for feature extraction. Wildes et al. [3] considered Laplacian pyramid constructed in four different resolution levels and used their normalized correlation for matching Boles and Boashash [6] used a zero-crossing of 1D wavelet at various resolution levels to distinguish the texture of the iris. Tisse et al. [7] constructed the analytic image (a combination of the original image and its Hilbert transform) to demodulate the iris texture. Lim et al. [8] used 2D Haar wavelet and quantized the 4th level high frequency information to form an 87-binary code length as feature vector and applied a LVQ neural network for classification. Woo Nam et al. [9] exploited a scale-space filtering to extract unique features that uses the direction of concavity of image from an iris image. A modified Haralick’s cooccurrence method with multilayer perceptron has also been introduced for extraction and classification of the iris images [10]. In a prior work carried-out by this group Daubechies2 wavelets were utilized for the analysis of local patterns of iris [11] [12]. A typical automatic iris recognition system (AIRS) requires an implementation of several steps as indicated in Fig. 2. At first step, an imaging system must be designed to capture a sequence of iris images with sufficient details from the subject. After image capturing, image preprocessing is to be applied on the images which includes several important steps stages consisting of identification of boundaries of iris, image enhancement, normalization and coordinate transformation. To implement an automatic iris recognition system (AIRS), we used a new algorithm for iris segmentation that is considered to perform faster than the currently reported methods. This was followed by a new iris feature extraction and coding algorithm using
920
H. Ahmadi et al.
Laguerre Gaussian wavelets. For matching process as the last step in identification process and for comparing iris code with database, Hamming distance between class of each iris code and input code was utilized. Descriptions of details of proposed algorithms are given in sections that follow.
3 Gaussian Laguerre Wavelets
X
O R
The circular harmonic wavelets (CHWs) are polar-separable wavelets, with harmonic angular shape. They are steerable in any desired direction by simple multiplication with a complex steering factor and as such they are referred to self-steerable wavelets. The CHWs were first introduced in [14] which utilizes concepts from circular harmonic functions (CHFs) employed in optical correlations for rotation invariant pattern recognition. The same functions also appear in harmonic tomographic decomposition and have been considered for the analysis of local image symmetry. In addition; recently, CHFs have been employed for the definition of rotation-invariant pattern signatures [8]. A family of orthogonal CHWs forming a multiresolution pyramid referred to as circular harmonic pyramid (CHP) is utilized for coefficient generation and coding. In essence, each CHW pertaining to the pyramid represents the image by translated, dilated and rotated versions of a CHF. At the same time, for a fixed resolution, the CHP orthogonal system provides a local representation of the given image around a point in terms of CHFs. The self-steerability of each component of the CHP can be exploited for pattern analysis in presence of rotation (other than translation and dilation) and, in particular, for pattern recognition, irrespective of orientation [15]. CHFs are complex, polar separable filters characterized by harmonic angular shape, a useful property to build rotationally invariant descriptors. A scale parameter is also introduced to perform a multiresolution analysis. The Gauss-Laguerre filters (CHFs) which constitute a family of orthogonal functions satisfying wavelet
Fig. 2. Flow diagram of system for eye localization and iris segmentation
An Efficient Iris Coding Based on Gauss-Laguerre Wavelets
921
admissibility condition required for multiresolution wavelet pyramid analysis are used. Similar to Gabor wavelets, any image may be represented by translated, dilated and rotated replicas of Gauss-Laguerre function. For a fixed resolution, the GaussLaguerre CHFs provide a local representation of the image in the polar coordinates system centered at a given point, named pivot. This representation is called the GaussLaguerre transform [16]. A Circular harmonic function is a complex polar separable filter characterized by a harmonic angular shape, which is represented in polar coordinates as:
( n) L (r , θ ) = h(r ).e inθ k
(1)
where h(r ) is:
h(r ) = (−1) k 2
1 ( n +1) n ⎡ k! ⎤ 2 n n 2 −π r 2 2 π 2 ⎢ ⎥ r Lk (2πr )e ⎢⎣ ( n + k )!⎥⎦
(2)
) Here r and θ are polar coordinates and L(n k (.) is the generalized Laguerre polyno-
mial as follows:
⎛n + k ⎞ rh ⎟⎟ (−1) k ⎜⎜ ⎝ k − h ⎠ h! h =0 k
Lnk (r ) =
∑
(3)
For different radial order k and angular order n , these functions called GaussLaguerre (GL) functions and from an orthogonal basis set under the Gaussian function. As any CHF, GL functions are self-steering, i.e. they rotate by an angle φ when multiplied by factor e jnφ . In particular, the real and imaginary part of each GL function form a geometrically in phase-quadrature pair. Moreover, GL functions are isomorphous with their Fourier transform. It is shown [15] that each GL function defines an admissible dyadic wavelet. Thus, the redundant set of wavelets corresponding to different GL functions constitutes a self-steering pyramid, useful for local and multiscale image analysis. The real part of GL function is depicted in Fig. 3. For better visualization, the resolution of filter is enhanced. GL pyramid (real part CHW) is also shown in Fig. 4. An important feature of GL functions as applied to iris recognition system is that GL function with various degrees of freedom can be tuned to significant visual features. For examples, for n = 1 , GLS are tuned to edges, for n = 2 to ridges, for n = 3 to equiangular forks, for n = 4 to orthogonal crosses, irrespective of their actual orientation. Let I ( x, y ) be the observed image. For every site of the plane, it is possible to perform the GL analysis by convolving it with each properly scaled GL function, g jnk ( x, y ) =
1 n r cosθ r sin θ Lk ( , ) a2 j a2 j a2 j
where a 2 j are the dyadic scale factors.
(4)
922
H. Ahmadi et al.
Fig. 3. Real part of GL function for n = 4, k = 0, j = 2
Fig. 4. GL pyramid. Real part of CHW for n = 2, k = 0 , 1 , 2 , j = 0 ,1 , 2 .
4 Image Preprocessing Preprocessing in iris recognition consists of three steps: iris edge detection, normalization and enhancement. A captured image contains not only the iris but also contains segments of eyelid, eyelash, pupil and sclera that all are considered to be not desirable. Distance between camera and eye, environment light conditions (dilation of pupil) can influence the size of iris. Therefore, image preprocessing is a necessary step before any feature extraction step can be applied to overcome these problems. However, in the proposed algorithm due to its robustness against illumination and contrast effects, we have not considered enhancement or noise removal during iris preprocessing. 4.1 Iris Localization
Iris boundaries are considered to be composed of two non concentric circles. For iris recognition, it is necessary to determine inner and outer boundaries, radius and centers. Several approaches have been proposed using iris edge detection. Our method is based on iris localization using Radon transform and locally searching for iris image. The details of the approach are in [13]. The key contribution of the Radon transformation and consequent local search approach is its fast response in identifying the iris boundaries. For instance, time of detection on CASIA image database [17] was only .032 seconds in average evaluated on a platform of 2.2 Intel CPU, 512 M
An Efficient Iris Coding Based on Gauss-Laguerre Wavelets
923
RAM and MATLAB source code. In comparison, our approach exhibited a high order of success rate and time consuming between existed methods. The results of detection on two databases [13], show high order of reliability on iris segmentation. 4.2 Iris Normalization
Different image acquisition process and conditions can influence the results of identification. The dimensional incongruities between eye images are mostly due to the stretching of the iris caused by pupil expansion/ contraction from variation of the illuminations. Other factors that contribute to such deformations include variance of camera and eye distance, rotation of the camera or head. Hence a solution must be contrived to remove these deformations. The normalization process, projects iris region into a constant dimensional ribbon so that two images of the same iris under different conditions have characteristic features at the same spatial location. Daugman [2] suggested a normal Cartesian to Polar transform that remaps each pixel in iris area into a pair of polar coordinates (r ,θ ) where r and θ is on the interval [0 1] and [0 2π ] respectively. In this paper, we have used Daugman’s approach on the images of size of 512 × 128 and12.5% of lower part of the normalized image was discarded.
5 Feature Extraction As stated earlier, GL functions provide self steering pyramidal analysis structure that can be used to extract iris features and for iris coding efficiently. Using GL functions where by proper choice of their parameters, it is possible to generate a set of redundant wavelets that enable an accurate extraction of complex texture features of an iris. Redundancy of wavelet transform and a higher degree of freedom in selecting parameters of GL function as compared with those of Gabor wavelets, makes GL function highly suitable for iris feature extraction. Self steering pyramid structure of LG–based image analysis is distinctly different from Gabor wavelets where by choosing the parameters of GL functions, we are able to carryout both local and multiscale image analysis of a given image in a more effective manner .To take advantage of the degrees of freedom provided by LG function, it is necessary that the parameters of the filters be tuned to significant visual features and iris texture patterns so that it can extract desirable frequency information of iris patterns. In our experiment it was found that for n = 2 , quality results were obtained where several simulations runs were carried out to verify the selection of parameters in which filters were convolved directly with the mapped iris image. Other parameters in this manner were also adjusted by trial and error and observing the simulation results. The output of filtered image is a complex value code that contains information of iris for coding. Based on the sign of each entry, we assign +1 to positive and 0 to others codes. Finally a 2048 binary bite is generated for each iris a sample of the code is depicted in Fig. 5. For matching process, Hamming distance is used. Rotation is considered here by shifting code versus templates in database. We take into account 12 bite shifts in both direction (left and right) and 2 bit shifts upwards and downwards. The minimum distance between input code and all codes in database for each class is assigned for decision making.
924
H. Ahmadi et al.
Fig. 5. Step by step of flowchart. (from left to right – up to down) input image, segmented iris, normalized iris, complex-valued iris code, binary code.
6 Experimental Results To evaluate the performance of the proposed algorithm, our alghoritm was tested on two iris databases: a) CASIA Database [17] and a) set of iris image collected locally using using an iris camera system built at the research lab. Our preliminary database constructed locally has 280 images from 40 distinct subjects. Three images were taken at the first session followed by 4 images that were taken at second session. Our local iris imaging system was a modified webcam with Infrared lighting. For each iris class,
Fig. 6. The distribution of distance between authorized abd imposter users. A reasonable separation is exist between them.
Fig. 7. Four iris images that failed in verification process. As can be seen, all are heavily occluded with eyelids and eyelashes.
An Efficient Iris Coding Based on Gauss-Laguerre Wavelets
925
we choose three samples taken at the first session for training and all samples captured at second sessions serve as test samples. This is also consistent with the widely accepted practice for testing biometrics algorithms [18]. We tested the proposed algorithms in two modes of. identification and verification. In identification tests, correct classification rate was 100% and complete separibality between interclass and intraclass distances is achived in both databases. In verification mode, a 1:1 search for each image was done. The algoritm was able to succesfully verify each input image except in case of 4 iris images in CASIA and 0 image in our local database. The discrepency and the error can easily be attributed to poor image quality and image capturing process. Fig. 6 shows the distribution of intra-class and inter-class matching distance of algorithm. The four images that failed for correct identification, are shown in Fig. 7 and as it can be seen, images are extremely occluded or has excessive pupil dilation. The system was implemented by MATLAB, 1.8GHz Intel processor, 256M RAM. The entire time of process including localization, preprocessing and normalization, feature extraction and matching is only (.2+.05+.15+.05=.45 Seconds).
7 Further Works This paper outlines the results of our preliminary attempt for designing an effective iris recognition system for personal identification. As such we plan modifications in the system to improve the system both in hardware and in software algorithm. In our future, we will attempt to improve identification results for which image enhancement will be incorporated during preprocessing stage. Such preprocessing is anticipated to increase classification accuracy i.e. increase of separation distance between inter and intra classes. It is also planned to examine a new scheme in XOR classification in which a circular shifting to simulate eye rotation will be considered. Masking of the eyelids and eyelashes is also planned as part of future works.
8 Conclusion In this paper, a new and an efficient algorithm for fast iris recognition is proposed for which Gaussian-Laguerre (LG) Wavelets have been utilized. Gauss-Laguerre filter is conceived to be more compatible for extraction of iris texture information as compared with commonly used Gabor filter. By adjusting Laguerre parameters and applying to normalized iris, a 256-byte in binary form is generated for each individual. An Exclusive OR classifier was used to perform and evaluate matching score. Despite of the absence of iris enhancement and denoising during image preprocessing, the results showed that the algorithm is sufficiently robust against contrast and illumination factors. Experimental results on a set of two databases indicated superior performance of the LG based algorithm in both identification and verification modes.
References 1. Jain, A., Bolle, R., Pankanti, S.: Biometrics: Personal Identification in a Networked Society. Kluwer Academic Publishers, Dordrecht (1999) 2. Daugman, J.: High Confidence Visual Recognition of Persons by a Test of Statistical Independence. IEEE Trans. PAMI 15, 1048–1061 (1993)
926
H. Ahmadi et al.
3. Wildes, R.P.: Iris Recognition: An Emerging Biometric Technology. In: Proc. IEEE, vol. 85, pp. 1348–1363 (1997) 4. Ma, L., Tan, T., Wang, Y., Zhang, D.: Efficient Iris Recognition by Characterizing Key Local Variations. IEEE Trans. Image Processing 13 (2004) 5. Ma, L., Wang, Y., Tan, T.: Personal Iris Recognition Based on Multichannel Gabor Filtering. In: ACCV2002, Melbourne Australia (2002) 6. Boles, W., Boashash, B.: A Human Identification Technique Using Images of the Iris and Wavelet Transform. IEEE Trans. Signal Processing 46, 1085–1088 (1998) 7. Tisse, C., Martin, L., Torres, L., Robert, M.: Person Identification Technique Using Human Iris Recognition. In: Proc. Vision Interface, pp. 294–299 (2002) 8. Lim, S., Lee, K., Byeon, O., Kim, T.: Efficient Iris Recognition through Improvement of Feature Vector and Classifier. ETRI Journal 23 (2001) 9. Nam, K.W., Yoon, K.L., Bark, J.S., Yang, W.S.: A Feature Extraction Method for Binary Iris Code Construction. In: Proc. 2nd Int. Conf. Information Technology for Application (2004) 10. Jaboski, P., Szewczyk, R., Kulesza, Z.: Automatic People Identification on the Basis of Iris Pattern Image Processing and Preliminary Analysis. In: Proc. Int. Conf. on Microelectronics, Yugoslavia, vol. 2, pp. 687–690 (2002) 11. Poursaberi, A., Araabi, B.N.: A Half-Eye Wavelet Based Method for Iris Recognition. In: Proc. 5th Int. Conf. Intelligent Systems Design and Applications (ISDA), Wroclaw Poland (2005) 12. Poursaberi, A., Araabi, B.N.: Iris Recognition for Partially Occluded Images: Methodology and Sensitivity Analysis. J. Applied Signal Processing (to be published in February 2006) 13. Torkamani, A., Azizzadeh, A.: Iris Detection as Human Identification. J. IET Image Processing Journal (to be published 2007) 14. Jacovitti, G., Neri, A.: Multiscale Image Features Analysis with Circular Harmonic Wavelets. In: Proc. SPIE 2569, Wavelets Appl. Signal Image Process, vol. 2569, pp. 363– 372 (1995) 15. Jacovitti, G., Neri, A.: Multiresolution Circular Harmonic Decomposition. IEEE Trans. Signal Processing 48, 3242–3247 (2000) 16. Capdiferro, L., Casieri, V., Jacovitti, G.: Multiple Feature Based Multiscale Image Enhancement. In: Proc. IEEE DSP. IEEE Computer Society Press, Los Alamitos (2002) 17. www.sinobiometrics.com 18. Mansfield, T., Kelly, G., Chandler, D., Kane, J.: Biometric Product Testing Final Report. In: issue 1.0, Nat’l Physical Laboratory of UK (2001)
∗ Hardening Fingerprint Fuzzy Vault Using Password Karthik Nandakumar, Abhishek Nagar, and Anil K. Jain Department of Computer Science & Engineering, Michigan State University, East Lansing, MI – 48824, USA {nandakum,nagarabh,jain}@cse.msu.edu
Abstract. Security of stored templates is a critical issue in biometric systems because biometric templates are non-revocable. Fuzzy vault is a cryptographic framework that enables secure template storage by binding the template with a uniformly random key. Though the fuzzy vault framework has proven security properties, it does not provide privacy-enhancing features such as revocability and protection against cross-matching across different biometric systems. Furthermore, non-uniform nature of biometric data can decrease the vault security. To overcome these limitations, we propose a scheme for hardening a fingerprint minutiae-based fuzzy vault using password. Benefits of the proposed password-based hardening technique include template revocability, prevention of cross-matching, enhanced vault security and a reduction in the False Accept Rate of the system without significantly affecting the False Reject Rate. Since the hardening scheme utilizes password only as an additional authentication factor (independent of the key used in the vault), the security provided by the fuzzy vault framework is not affected even when the password is compromised. Keywords: Biometric template security, fuzzy vault, hardening, password, fingerprint, minutiae, helper data.
1 Introduction Biometric systems have attained popularity because they provide a convenient and reliable way to authenticate a user as opposed to traditional token-based (e.g., smart cards) and knowledge-based (e.g., passwords) authentication. However, it is now well-known that biometric systems are vulnerable to attacks. One of the most serious attacks is against the stored templates. A stolen biometric template cannot be easily revoked and it may be used in other applications that employ the same biometric trait. Table 1 presents a summary of the approaches that have been proposed for biometric template protection. We propose a hybrid approach where the biometric features are hardened using password before a secure sketch (fuzzy vault) is constructed. 1.1 Fuzzy Vault Framework Fuzzy vault [1] is a cryptographic framework that binds the biometric template with a uniformly random key to build a secure sketch of the template. Only the secure sketch ∗
Research supported by ARO grant no. W911NF-06-1-0418.
S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 927–937, 2007. © Springer-Verlag Berlin Heidelberg 2007
928
K. Nandakumar, A. Nagar, and A.K. Jain Table 1. Summary of biometric template protection approaches
Template Protection Approaches Encryption Noninvertible transform (e.g., [2, 3]) Hardening / Salting (e.g., [4]) Key generation (e.g., [5]) Secure sketch (e.g. [1, 6-9]) Proposed hardened fuzzy vault
Methodology
Advantages
Limitations Template is exposed during every authentication attempt Usually leads to increase in the FRR
User-specific external randomness is added to the biometric features
Matching algorithm and accuracy are unaffected Since transformation occurs in the same feature space, matcher need not be redesigned Increases the entropy of biometric features resulting in low FAR
A key is derived directly from biometric features
Most efficient and scalable approach
A sketch is derived from the template; sketch is secure because template can be reconstructed only if a matching biometric query is presented A hybrid approach where the biometric features are hardened (using password) before a secure sketch (vault) is constructed
More tolerant to intrauser variations in biometric data; can be used for securing external data such as cryptographic keys Hardening increases the entropy thereby improving the vault security; also enhances user privacy
Template is encrypted using well-known cryptographic techniques One-way function is applied to the biometric features
If the user-specific random information is compromised, there is no gain in entropy Tolerance to intra-user variations is limited, resulting in high FRR Template is exposed during successful authentication. Nonuniform nature of biometric data reduces security Not user-friendly; user needs to provide both the password and the biometric during authentication
(vault) is stored and if the original template is “uniformly random”, it is infeasible (or computationally hard) to retrieve either the template or the key without any knowledge of the user’s biometric data. The fuzzy vault scheme can secure biometric features that are represented as an unordered set. Let MT = {x1 , x2 ,..., xr } denote a biometric template with r elements. The user selects a key K, encodes it in the form of a polynomial P of degree n and evaluates the polynomial P on all the elements in MT. The points lying on P ({( xi , P( xi ))}ir=1 ) are hidden among a large number (s) of random chaff points that do
(
)
not lie on P {( x j , y j ) | x j ≠ xi , ∀i = 1,..., r , y j ≠ P ( x j )}sj =1 . The union of genuine and chaff point sets constitutes the vault V. In the absence of user’s biometric data, it is computationally hard to identify the genuine points in V, and hence the template is secure. During authentication, the user provides a biometric query denoted by MQ = {x1′, x2′ ,..., xr′ } . If MQ overlaps substantially with MT, the user can identify many points in V that lie on the polynomial. If the number of discrepancies between MT and MQ is less than (r-n)/2, Reed-Solomon decoding can be applied to reconstruct P and
Hardening Fingerprint Fuzzy Vault Using Password
929
the authentication is successful. On the other hand, if MT and MQ do not have sufficient overlap, it is infeasible to reconstruct P and the authentication is unsuccessful. The vault is called fuzzy because it can be decoded even when MT and MQ are not exactly the same; this fuzziness property compensates for intra-user variations observed in biometric data. Security of the fuzzy vault framework has been studied in [1, 6] and bounds on the entropy loss of the vault have been established. 1.2 Limitations of Fuzzy Vault Framework Though the fuzzy vault scheme has proven security properties [1, 6], it has the following limitations. (i) The security of the vault can be compromised if the same biometric data is reused for constructing different vaults (with different polynomials and random chaff points) [10, 11]. If a person has access to two vaults obtained from the same biometric data, he can easily identify the genuine points in the two vaults by correlating the abscissa (x) values in the two vaults. Due to this reason, the vault is not revocable, i.e., if a vault is compromised, a new vault cannot be created from the same biometric data by merely binding it with a different key. Further, this vulnerability allows cross-matching of templates across different systems. Thus, the fuzzy vault framework does not have privacy-enhancing properties. (ii) It is possible for an attacker to exploit the non-uniform nature of biometric features and develop attacks based on statistical analysis of points in the vault. (iii) Since the number of chaff points in the vault is much larger than the number of genuine points, it is possible for an adversary to substitute a few points in the vault using his own biometric features [10, 11]. This allows both the original user and the adversary to be successfully authenticated using the same identity. Thus, an adversary can deliberately increase the false accept rate of the system. (iv) As a genuine user is being authenticated, his original template is exposed temporarily, which may be gleaned by an attacker. While better fuzzy vault constructions that do not involve chaff points [6] can prevent vulnerabilities (ii) and (iii), they do not address limitations (i) and (iv). By using a password as an additional factor for authentication, the above limitations of a fuzzy vault system can be easily alleviated. In this paper, we propose a scheme for hardening a fingerprint-based fuzzy vault using password. One of the main advantages of password-based hardening is enhanced user privacy. Further, the proposed scheme has been designed such that password is only an additional layer of authentication and the security provided by the basic fuzzy vault framework is not affected even if the password is compromised. Hence, the proposed approach provides higher level of security as long as the password is secure. When the password is compromised, template security falls to the same level as in the fuzzy vault.
2
Hardening Fuzzy Vault Using Password
Our vault hardening scheme consists of three main steps (see Fig. 1). Firstly, a random transformation function derived from the user password is applied to the
930
K. Nandakumar, A. Nagar, and A.K. Jain
biometric template. The transformed template is then secured using the fuzzy vault framework. Finally, the vault is encrypted using a key derived from the password. Random transformation of the template using password enhances user privacy because it enables the creation of revocable templates and prevents cross-matching of templates across different applications. The distribution of transformed template is statistically more similar to uniform distribution than the distribution of original template. This provides better resistance against attacks on the vault. Furthermore, the additional variability introduced by password-based transformation decreases the similarity between transformed templates of different users. This reduces the False Accept Rate of the system substantially. If we assume client-server architecture for the biometric system (as shown in Fig. 1) where feature extraction and transformation are applied at the client side and matching is performed at the server, the server never sees the original template. Only the transformed template would be revealed during successful vault decoding and the original template is never exposed at the server.
Biometric Template
Password, User Identity
T
Biometric Sensor
Keypad
M TW
Feature Extraction and Transformation W
Key Generation
[ I, E,
M TW ]
V
Encryption
Vault Encoding
VE Database [ I, VE ]
E
Client
Server
(a)
Biometric Query
Password, Claimed Identity
Q
Biometric Sensor
Keypad
M QW
Feature Extraction and Transformation W
Key Generation
[ I, D,
M QW ]
VE Database [ I, VE ]
V
M QW
D
Decryption
Vault Decoding
Server
Client
Match/Non-match
(b) Fig. 1. Operation of the hardened fuzzy vault. (a) Enrollment and (b) authentication stages. In this figure, I represents the identity of the user, W is the user password, MT (MQ) represents the biometric template (query), M TW ( M QW ) represents the template (query) after transformation using the password, E and D represent the encryption and decryption keys generated from the password and V and VE represent the plaintext and encrypted vaults.
Two common methods for cracking a user password are dictionary attacks and social engineering techniques. In the proposed system, password is implicitly verified during authentication by matching the transformed biometric features. Even if an adversary attempts to guess the password, it is not possible to verify the guess without knowledge of the user’s biometric data. This provides resistance against dictionary attacks to learn the password. However, it is still possible to glean the user password through social engineering techniques. Therefore, password based transformation alone is not sufficient to ensure the security of the biometric template. Due to this
Hardening Fingerprint Fuzzy Vault Using Password
931
reason, we use the fuzzy vault framework to secure the transformed biometric template. Note that the key used in constructing the fuzzy vault that secures the transformed template is still uniformly random and independent of the password. Therefore, even if the password is compromised, the security of the vault is not affected and it is computationally hard for an attacker to obtain the original biometric template. Finally, the vault is encrypted using a key derived from the password. This prevents substitution attacks against the vault because an adversary cannot modify the vault without knowing the password or the key derived from it.
3 Fingerprint-Based Fuzzy Vault Implementation A number of techniques have been proposed for constructing a fuzzy vault using fingerprint minutiae (e.g., [13, 14]). The proposed hardening scheme is based on the fingerprint-based fuzzy vault implementation described in [12] which has the highest genuine accept rate and a very low false accept rate among the known implementations of fingerprint-based fuzzy vault. In this implementation, the ReedSolomon polynomial reconstruction step is replaced by a combination of Lagrange interpolation and Cyclic Redundancy Check (CRC) based error detection. Each minutia point is represented as an element in the Galois field GF(216) by applying the following procedure. Let (u, v, θ) be the attributes of a minutia point, where u and v indicate the row and column indices in the image, and θ represents the orientation of the minutia with respect to the horizontal axis. The minutia attributes are uniformly quantized and expressed as binary strings Qu, Qv and Qθ of lengths Bu, Bv and Bθ bits, respectively. The values of Bu, Bv and Bθ are chosen to be 6, 5 and 5, respectively, so that a 16-bit number can be obtained by concatenating the bit strings Qu, Qv and Qθ. A fixed number (denoted by r) of minutiae are selected based on their quality. A randomly generated key K of size 16n bits is represented as a polynomial P of degree n. The polynomial P is evaluated at the selected minutiae and these points constitute the locking set. A large number (denoted by s, s >> r) of chaff points are randomly generated and the combined set of minutiae and chaff is randomly reordered to obtain the vault V. To facilitate the alignment of query minutiae to the template, we extract and store a set of high curvature points (known as helper data) from the template image. The helper data itself does not leak any information about the minutiae, yet contains sufficient information to align the template and query fingerprints [12]. During authentication, the helper data extracted from the query image is aligned to the template helper data using trimmed Iterative Closest Point (ICP) algorithm [15]. Aligned query minutiae are used to coarsely filter out the chaff points in the vault. A minutiae matcher [16] is then applied to find correspondences between the query minutiae and the remaining points in the vault. Vault points having a matching minutia in the query constitute the unlocking set. For interpolation of a polynomial of degree n, at least (n+1) projections are needed. Therefore, if the size of the unlocking set is less than (n+1), it leads to authentication failure. If the unlocking set has (n+1) or more elements, all possible subsets of size (n+1) are considered. Each of these subsets gives rise to a candidate polynomial and CRC-based error detection identifies the valid polynomial. If a valid polynomial is found, the authentication is successful.
932
K. Nandakumar, A. Nagar, and A.K. Jain
4 Hardened Fuzzy Vault Implementation The key component of a hardened fingerprint-based fuzzy vault scheme is the feature transformation module which transforms the minutiae features using a password. We employ simple operations of translation and permutation as the transformation functions because they do not affect the intra-user variability of the minutiae features thereby maintaining the false reject rate to a great extent. 4.1 Minutiae Transformation We assume that the password is of length 64 bits (8 characters) which is divided into 4 units of 16 bits each. We classify the minutiae into 4 classes by grouping minutiae lying in each quadrant of the image into a different class and assign one password unit to each class. We generate a permutation sequence of 4 numbers by applying a one way function on the password. Using this sequence, we permute the 4 quadrants of the image such that the relative positions of minutiae within each quadrant are not changed. Each 16-bit password unit is assumed to be in the same format as a 16-bit minutia representation described in section 3. Hence, the password unit can be divided into three components Tu, Tv and Tθ of lengths Bu, Bv and Bθ bits, respectively. The values of Tu and Tv are considered as the amount of translation along the vertical and horizontal directions, respectively, and Tθ is treated as the change in minutia orientation. The new minutiae attributes are obtained by adding the translation values to the original values modulo the appropriate range, i.e., Q´u = (Qu + Tu) mod (2^ Bu), Q´v = (Qv + Tv) mod (2^ Bv) and Q´θ = (Qθ + Tθ) mod (2^ Bθ). To prevent overlapping of minutiae from different quadrants, the minutia location is wrapped around in the respective quadrant if it has been translated beyond the boundary. The effect of minutiae transformation using password is depicted in Fig. 2. 4.2 Encoding Hardened Vault The transformed minutiae are encoded in a vault using the procedure described in section 3. The vault and helper data are further encrypted using a key generated from the password. This layer of encryption prevents an impostor without knowledge of the password from modifying the vault. 4.3 Decoding Hardened Vault During authentication, the encrypted vault and helper data are first decrypted using the password provided by the user. The template and query helper data sets are aligned and the password-based transformation scheme described in section 4.1 is applied to the aligned query minutiae. Good quality minutiae are then selected for decoding the vault. Apart from the well-known factors like partial overlap, non-linear distortion and noise that lead to differences in the template and query minutiae sets of the same user, the password-based transformation scheme introduces additional discrepancies. If a minutia lies close to the quadrant boundary, the same minutiae may fall in different quadrants in the template and the query due to imperfect alignment. This reduces the
Hardening Fingerprint Fuzzy Vault Using Password
933
number of minutiae correspondences and leads to a small decrease in the genuine accept rate. Another problem arising due to imperfect alignment is that the same minutia point may appear at opposite ends of the quadrants in the template and the query after the transformation. This is because the minutiae are translated within their respective quadrants modulo the quadrant size. To address this problem, we add a border of width 15 pixels around each quadrant and minutiae within 15 pixels of the quadrant boundary are duplicated on the border at the opposite end of the quadrant.
5 Experimental Results The proposed password-based fuzzy vault hardening scheme has been tested on the FVC2002-DB2 and MSU-DBI fingerprint databases. FVC2002-DB2 [17] is a public domain database with 800 images (100 fingers × 8 impressions/finger) of size 560×296. Only the first two impressions of each finger were used in our experiments; the first impression was used as the template to encode the vault and the second impression was used as the query in vault decoding. The MSU-DBI database [18] consists of 640 images (160 fingers × 4 impressions/finger) of size 640×480. Two impressions of each finger collected six weeks apart were used in our experiments. 1
3
4
2
4
3
(a)
2
1
(b)
Fig. 2. Minutiae transformation using password. (a) and (b) show the original and transformed minutiae, respectively. The number at each corner indicates the permutation of quadrants.
The criteria used for evaluating the performance are failure to capture rate (FTCR), genuine accept rate (GAR) and false accept rate (FAR). When the number of minutiae in the template and/or query fingerprint is less than the required number of genuine points, we call it as failure to capture. The parameters used in vault implementation were chosen as follows. Since the number of minutiae varies for different users, using a fixed value of r (the number of genuine minutiae used to construct the vault) across all users leads to large FTCR. To overcome this problem, we fix the range of r (set to
934
K. Nandakumar, A. Nagar, and A.K. Jain
18-24 and 24-30 for the FVC and MSU databases, respectively) and determine its value individually for each user. The number of chaff points (s) is chosen to be 10 times the number of genuine points in the vault. The choice of n requires a compromise between the vault security and the acceptable values of GAR and FAR. Table 2 shows that the proposed system leads to a small decrease in the GAR for all values of n. This is due to misclassification of a few minutiae at the quadrant boundaries and the inability of the minutiae matcher to effectively account for nonlinear deformation in the transformed minutiae space. Although the minutiae matcher [16] used here can tolerate deformation to some extent by employing an adaptive bounding box, it is designed to work in the original minutiae space where the deformation is consistent in a local region. Since minutiae transformation makes the deformation inconsistent in all the regions, the number of correspondences found by the matcher decreases. Fig. 3 shows a pair of images for which the vault without hardening could be decoded, but the hardened vault could not be decoded. From Table 2, we also observe that the FAR of the system is zero for all values of n. This is due to the transformation of minutiae using password which makes the distribution of minutiae more random and reduces the similarity between minutiae sets of different users. This enables the system designer to select a wider range of values for n without compromising the FAR. For example, for the FVC database, the original fuzzy vault required n=10 to achieve 0% FAR (corresponding GAR is 86%). For the hardened fuzzy vault, even n=7 gives 0% FAR (corresponding GAR is 90%). Table 2. Genuine Accept Rates (GAR), False Accept Rates (FAR) and Failure to Capture Rates (FTCR) of the hardened fuzzy vault for FVC2002-DB2 and MSU-DBI databases. Here, n represents the degree of the polynomial used in vault encoding. FTCR n=7 n=8 n = 10 (%) GAR(%) FAR(%) GAR(%) FAR(%) GAR(%) FAR(%) Vault without FVC2002 – hardening DB2 Hardened vault
2
91
0.13
91
0.01
86
0
2
90
0
88
0
81
0
n = 10 n = 11 n = 12 FTCR (%) GAR(%) FAR(%) GAR(%) FAR(%) GAR(%) FAR(%) MSU-DBI
Vault without hardening Hardened vault
5.6
85
0.08
82.5
0.02
78.8
0
5
80.6
0
75.6
0
73.8
0
6 Security Analysis The hardened fuzzy vault system has two independent layers of security, namely, password and biometric. An impostor can gain access to the system only if both these layers of security are compromised simultaneously. Now, we shall analyze the security of the system if one of the layers is compromised.
Hardening Fingerprint Fuzzy Vault Using Password
935
Compromised Password: Suppose an impostor gains access to the password of a genuine user. The impostor can at most generate the decryption key that allows him to decrypt the vault. However, to be successfully authenticated, he still would have to decode the vault by identifying the genuine minutia points from the vault, which is computationally hard. Suppose an attacker attempts a brute-force attack on the proposed system by trying to decode the vault using all combinations of (n+1) points in the vault. If n = 10, r = 30 and s = 300, the total number of possible combinations is C(330,11); among these combinations, C(30,11) combinations will successfully decode the vault. The expected number of combinations that need to be evaluated is 2 × 1012 which corresponds to ~ 40 bits of security. Security can be improved by adding a larger number of chaff points (e.g., when s = 600 in the above system, we can achieve ~ 50 bits of security) at the expense of increased storage requirements. Further improvement in the template security can be achieved by replacing the random permutation and translation functions (that are invertible) by a non-invertible transform [2]. Though non-invertible transforms usually result in a decrease in GAR, they can provide an additional 50-70 bits of security [2] if the goal is to prevent an attacker from learning the original biometric template of the genuine user.
(a)
(b)
Fig. 3. An example of false reject in password-based vault hardening. (a) Template and aligned query minutiae prior to hardening and the corresponding minutiae matches found by the matcher, (b) Template and query minutiae after vault hardening and the corresponding minutiae matches found by the matcher. While the minutiae matches marked with circles were found in both (a) and (b), the matches marked with squares were detected only in (a) and not in (b). Since the number of minutia correspondences prior to hardening is 12, the vault can be successfully decoded because n was set to 10. After hardening, the number of minutia matches is only 9; hence, the vault cannot be decoded for n = 10.
936
K. Nandakumar, A. Nagar, and A.K. Jain
Compromised Biometric: Suppose an impostor gains access to the biometric template of a genuine user through covert means (e.g., lifting a fingerprint impression of the genuine user without his knowledge), he will still have to guess the password to be authenticated. The guessing entropy of an 8-character password is between 18-30 bits [19]. Although this level of security may be insufficient in practical applications, it is still better than the fuzzy vault framework and most of the other approaches presented in Table 1 which offer no security when the biometric is compromised. When an adversary does not have any knowledge of the user password and biometric data, then the security of the hardened fuzzy vault is the combination of the security provided by the password and biometric layers. If n = 10, r = 30, s = 300 and password is 8-character long, the security of the hardened vault is between 58-70 bits.
7 Summary We have proposed an algorithm to harden a fingerprint-based fuzzy vault based on user password. Based on permutations and translations generated from the user password, we modify the minutiae in a fingerprint before encoding the fuzzy vault. Our hardening technique addresses some of the major limitations of a fingerprintbased fuzzy vault framework and provides enhanced security and privacy. Experiments on two fingerprint databases show that proposed algorithm reduces the False Accept Rate of the system with some loss in the Genuine Accept Rate. An impostor cannot circumvent the hardened fuzzy vault system as long as both the password and the biometric features are not compromised simultaneously.
References 1. Juels, A., Sudan, M.: A Fuzzy Vault Scheme. In: Proceedings of IEEE International Symposium on Information Theory, Lausanne, Switzerland, p. 408 (2002) 2. Ratha, N., Chikkerur, S., Connell, J.H., Bolle, R.M.: Generating Cancelable Fingerprint Templates. IEEE Trans. on PAMI 29(4), 561–572 (2007) 3. Savvides, M., Kumar, B.V.K.V., Khosla, P.K.: Cancelable biometric filters for face recognition. In: Proceedings of ICPR, Cambridge, UK, August 2004, vol. 3, pp. 922–925 (2004) 4. Teoh, A.B.J., Goh, A., Ngo, D.C.L.: Random Multispace Quantization as an Analytic Mechanism for BioHashing of Biometric and Random Identity Inputs. IEEE Trans. on PAMI 28(12), 1892–1901 (2006) 5. Monrose, F., Reiter, M.K., Li, Q., Wetzel, S.: Cryptographic Key Generation from Voice. In: Proc. IEEE Symp. Security and Privacy, Oakland, May 2001, pp. 202–213 (2001) 6. Dodis, Y., Reyzin, L., Smith, A.: Fuzzy Extractors: How to Generate Strong Keys from Biometrics and Other Noisy Data. In: Proceedings of International Conference on Theory and Applications of Cryptographic Techniques, May 2004, pp. 523–540 (2004) 7. Hao, F., Anderson, R., Daugman, J.: Combining Crypto with Biometrics Effectively. IEEE Trans. on Computers 55(9), 1081–1088 (2006) 8. Sutcu, Y., Li, Q., Memon, N.: Protecting Biometric Templates with Sketch: Theory and Practice. IEEE Trans. on Information Forensics and Security 2007 (to appear)
Hardening Fingerprint Fuzzy Vault Using Password
937
9. Draper, S.C., Khisti, A., Martinian, E., Vetro, A., Yedidia, J.S.: Using Distributed Source Coding to Secure Fingerprint Biometrics. In: Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, Hawaii, vol. 2, pp. 129–132 (April 2007) 10. Boult, T.E., Scheirer, W.J., Woodworth, R.: Fingerprint Revocable Biotokens: Accuracy and Security Analysis. In: Proc. of CVPR, Minneapolis (June 2007) 11. Scheirer, W.J., Boult, T.E.: Cracking Fuzzy Vaults and Biometric Encryption, Univ. of Colorado at Colorado Springs, Tech. Rep. (February 2007) 12. Nandakumar, K., Jain, A.K., Pankanti, S.: Fingerprint-based Fuzzy Vault: Implementation and Performance, Michigan State Univ. Tech. Rep. TR-06-31 (2006) 13. Yang, S., Verbauwhede, I.: Automatic Secure Fingerprint Verification System Based on Fuzzy Vault Scheme. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Philadelphia, USA, March 2005, vol. 5, pp. 609–612 (2005) 14. Uludag, U., Pankanti, S., Jain, A.K.: Fuzzy Vault for Fingerprints. In: Proceedings of Fifth International Conference on AVBPA, Rye Town, USA, July 2005, pp. 310–319 (2005) 15. Chetverikov, D., Svirko, D., Stepanov, D., Krsek, P.: The Trimmed Iterative Closest Point Algorithm. In: Proc. of ICPR, Quebec City, Canada, August 2002, pp. 545–548 (2002) 16. Jain, A.K., Hong, L., Bolle, R.: On-line Fingerprint Verification. IEEE Trans. on PAMI 19(4), 302–314 (1997) 17. Maio, D., Maltoni, D., Wayman, J.L., Jain, A.K.: FVC2002: Second Fingerprint Verification Competition. In: Proc. of ICPR, Quebec City, August 2002, pp. 811–814 (2002) 18. Jain, A.K., Prabhakar, S., Ross, A.: Fingerprint Matching: Data Acquisition and Performance Evaluation. Michigan State Univ. Tech. Rep. TR99-14 (1999) 19. Burr, W.E., Dodson, D.F., Polk, W.T.: Information Security: Electronic Authentication Guideline. NIST Special Report 800-63 (April 2006)
GPU Accelerated 3D Face Registration / Recognition Andrea Francesco Abate, Michele Nappi, Stefano Ricciardi, and Gabriele Sabatino Dipartimento di Matematica e Informatica, Università degli Studi di Salerno, 20186, Fisciano (SA), Italy {abate,mnappi,sricciardi,gsabatino}@unisa.it
Abstract. This paper proposes a novel approach to both registration and recognition of face in three dimensions. The presented method is based on normal map metric to perform either the alignment of captured face to a reference template or the comparison between any two faces in a gallery. As the metric involved is highly suited to be computed via vector processor, we propose an implementation of the whole framework on last generation graphics boards, to exploit the potential of GPUs applied to large scale biometric identification applications. This work shows how the use of affordable consumer grade hardware could allow ultra rapid comparison between face descriptors through their highly specialized architecture. The approach also addresses facial expression changes by means of a subject specific weighting masks. We include preliminary results of experiments conducted on a proprietary gallery and on a subset of FRGC database.
1 Introduction Three dimensional face representation is object of growing interest from the biometrics research community, as witnessed by the large number of approaches to recognition proposed in the last years, whose main focus has been accuracy and robustness, often considering the computing time required a minor issue. This fact is easily understandable considering the serious challenges related to face recognition which sometimes push researchers to exploit metrics involving time intensive computing. According to literature, 3D based methods can exploit a plurality of metrics [1], some of which, like Eigenface [2], Hausdorff distance [3] and Principal Component Analysis (PCA) [4], have been originally proposed for 2D recognition and then extended to range images. Other approaches instead, have been developed specifically to operate on 3D shapes [5], like those exploiting Extended Gaussian Image [6], the Iterative Closest Point (ICP) method [7], canonical image [8] or normal map [9]. One line of work is represented by multi-modal approaches, which typically combine 2D (intensity or colour) and 3D (range images or geometry) facial data and in some case different metrics, to improve recognition accuracy and/or robustness over conventional techniques [10-12]. However, as the diffusion of this biometric increases, the need for one-to-many comparison on large galleries becomes more frequent and crucial to many applications. Unfortunately, a matching time in the range of seconds (or even minutes) is not rare in 3D face recognition, so, as claimed by Bowyer et al. in their 2006 survey “one attractive line of research involves methods to speed up the 3D S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 938–947, 2007. © Springer-Verlag Berlin Heidelberg 2007
GPU Accelerated 3D Face Registration / Recognition
939
matching” [1]. To this regard the launch of multi-core CPUs on the market could be appealing to biometric systems developers, but the reality is that not many of the most established face recognition algorithm can take advantage of multithreading processing and, even in case this is possible, the overall theoretical speedup is by a factor 2 or 4 (for PC and workstation class machines), while database size could possibly grow by even orders of magnitude. It is worth to note that one of the most stable technological trend in the last years for PCs and workstations has been the leap in both computing power and flexibility of specialized processors on which graphics board are based on: the Graphical Processing Units (GPUs). Indeed, GPUs arguably represent today most powerful and affordable computational hardware, and they are advancing at an incredible rate compared to CPUs, with performances growing approximately from 1.7 to 2.3 times/year versus a maximum of 1.4 times/year for CPUs. As an example, the recent G80 GPU core from Nvidia Corp. (one of the market leaders together with ATI Corp.) features approximately 681 millions of transistors resulting in a highly parallel architecture based on 128 programmable processors, 768 MB of VRAM with 86 GB/sec of transfer rate over a 384 bit wide bus. The advantages in using these specialized processors for general purpose applications, a task referred as General Purpose computation on GPU or GP-GPU, have been marginal until high level languages for GPU programming have emerged. Nevertheless, as GPUs are inherently vector processors and not real general-purpose processing units, not every algorithm or data structure is suited to fully exploit this potential. In this paper, we present a method to register and recognize a face in 3D by means of the same normal map metric. As this metric represents geometry in terms of coloured pixels, it is particularly suited to take full advantage of vector processors, so we propose a GPU implementation aimed to maximize the comparison speed for large scale identification applications. This paper is organized as follows. In section 2. the proposed methodology is presented in detail. In section 3. experimental results are shown and briefly discussed. The paper concludes in section 4.
2 Description of Proposed Methodology In the following subsections 2.1 to 2.4. we describe in depth the proposed face recognition approach and its implementation via GPU. A preliminary face registration is required for the method to perform optimally but, as it is based on the same metric exploited for face matching, we first describe the normal map based comparison, then we expose the alignment algorithm, and the adaptation necessary to efficiently compute the metric through GPU. 2.1 Representing Face Through Normal Map and Comparing Faces Through Difference Map Whether a subject has to be enrolled for the first time or a new query (a subject which has to be recognized) is submitted to the recognition pipeline, a preliminary face capture is performed and the resulting range image is converted in a polygonal mesh M. We intend to represent face geometry storing normals of mesh M in a bidimensional
940
A.F. Abate et al.
matrix N with dimension l×m. To correlate the 3D space of normals to the 2D domain of matrix N we project each vertex in M onto a 2D surface using a spherical projection (opportunely adapted to mesh size). Then we sample the mesh by means of mapping coordinates and quantize the length of the three scalar components of each normal as an RGB coded color, storing it in a bitmap N by means of the same coordinates as array indexes. More precisely, we assign to each pixel (i, j) in N, with 0 ≤ i < l and 0 ≤ j < m, the three scalar components of the normal to the point of the mesh surface with mapping coordinates (l/i, m/j). The resulting sampling resolution is 1/l for the s range and 1/m for the t range. The normal components are stored in pixel (i, j) as RGB colour components. We refer to the resulting matrix N as the normal map of mesh M. A normal map with a standard colour depth of 24 bit allows 8 bit quantization for each normal component, this precision proved to be adequate for the recognition process. To compare the normal map NA from input subject to another normal map NB previously stored in the reference database, we compute the angle included between each pairs of normals represented by colours of pixels with corresponding mapping coordinates, and store it in a new Difference Map D with components r, g and b opportunely normalized from spatial domain to colour domain, so 0 ≤ rN A , g N A , bN A ≤ 1 and 0 ≤ rN B , g N B , bN B ≤ 1 . The value θ, with 0 ≤ θ < π, is the angular difference between the pixels with coordinates (x N A , y N A ) in NA and (x N B , y N B ) in NB and it is stored in D as a grey-scale image (see Fig. 1). To reduce the effects of residual face misalignment during acquisition and sampling phases, we calculate the angle θ using a k × k (usually 3 × 3 or 5 × 5) matrix of neighbour pixels. Summing every grey level in D results in histogram H(x) that represent the angular distance distribution between mesh MA and MB. On the X axis we represent the resulting angles between each pair of comparisons (sorted from 0° degree to 180° degree), while on the Y axis we represent the total number of differences found. This means that two similar faces will have an histogram H(x) with very high values on little angles, while two distinct faces will have differences more distributed. We define a similarity score through a weighted sum between H and a Gaussian function G, as in (3) where σ and k change recognition sensitivity . x k ⎛ − 1 2 similarity _ score = ∑ ⎜ H ( x ) ⋅ e 2σ ⎜ σ 2 π x =0 ⎝ 2
⎞ ⎟ ⎟ ⎠
(1)
Fig. 1. Face capture, Normal Map generation and resulting Difference Map for two subjects
GPU Accelerated 3D Face Registration / Recognition
941
2.2 Face Registration A precise registration of captured face is required by normal map based comparison to achieve the best recognition performance. So, the obvious choice could be to use the most established 3D shape alignment method, the Iterative Closest Point (ICP), to this aim. Unfortunately ICP is a time expensive algorithm. The original method proposed by Chen and Medioni [13] and Besl and McKay [14] features a O(N2) time complexity, which has been lowered to O(Nlog(n)) by other authors [15] and further reduced by means of heuristic functions or by shape voxelization and distance pre-computing in its most recent versions [16]. Nevertheless the best performance in face recognition applications are in the range of many seconds to minutes, depending on source and target shape resolution. As the whole normal-map based approach is aimed to maximize the overall recognition speed reducing the comparison time, we considered ICP not suited to fit well into this approach. So we introduce pyramidal-normal-map based face alignment. A pyramidalnormal-map is simply a set of normal maps relative to the same 3D surface ordered by progressively increasing size (in our experiments each map differs from the following one by a factor of 2). Our purpose is to exploits this set of local curvature descriptors to perform a fast and precise alignment between two 3D shapes, measuring the angular distance (on each axis) between an unregistered face and a reference template and reducing it to a point in which it does not significantly affect recognition precision. The template is a generic neutral face mesh whose centroid corresponds to the origin of the reference system. To achieve a complete registration the captured face has to match position and rotation of reference template. Scale matching, indeed, is not needed as the spherical projection applied to generate the normal map is invariant to object size. The first step in the alignment procedure is therefore to compute face’s centroid which allows to match reference template position offsetting all vertices by the distance from centroid to the axis origin. Similarly, rotational alignment can be obtained through a rigid transformation of all vertices once the angular distance between the two surface has been measured. As we intend to measure this distance iteratively and with progressively greater precision, we decide to rotate the reference template instead of captured face. The reason is simple: because the template used for any alignment is always the same, we can precompute for every discrete step of rotation the relative normal map once and offline, drastically reducing the time required for alignment. Before the procedure begins, a set-up is required to compute a pyramidal normal map for the captured face (a set of four normal maps with size ranging from 16x16 to 128x128 has proved to be adequate in our tests). At this time the variables controlling the iteration are initialised, like the initial size m of normal maps, the angular range reduction factor k, and R, the maximum angular range for the algorithm to operate, i.e. the maximum misalignment allowed between the two surfaces. We found that for biometric applications a good compromise between robustness and speed is reached setting this value to 180°, with k=4, but even R=360° can be used if required. With the first iteration, the smallest normal map in the pyramid is compared to each of k^3 pre-computed normal maps of the same size relative to the coarsest rotation steps of template. The resulting difference maps are evaluated to find the one with the highest similarity score, which represent the better approximation to
942
A.F. Abate et al.
alignment (on each axis) for that level of pyramid. Then the template is rotated according to this first estimate. The next iteration starts from this approximation comparing the next normal map in the pyramid (with size m=m*2) to every template normal map of corresponding size found within a range which has now its centre on the previous approximation and whose width has been reduced by a factor k. This scheme is repeated for i iterations until the range’s width fall below a threshold value T. At this point the sum of all i approximations found for each axis is used to rotate the captured face, thus resulting in its alignment to the reference template (see Fig. 2). Using the above mentioned values for initialisation, four iterations (i=4) with angular steps of 45°, 11,25°, 2.8° and 0.7° are enough to achieve an alignment adequate for recognition purpose. As the number of angular steps is constant for each level of iteration, the total number of template normal maps generated offline for i iterations is ik^3, and the same applies to the total number of comparisons. It has to be noted that the time needed for a single comparison (difference map computing), is independent by mesh resolution but it depends on normal map size instead, whereas the time needed to precompute each template’s normal map depends on its polygonal resolution.
Fig. 2. Face captured (shaded) and reference template (wireframe) before (left) and after (center) alignment. Right: normal maps before (up) and after (bottom) alignment
The template does not need to have the same resolution and topology of captured face, it is sufficient it has a number of polygons at least greater than the number of pixels in the largest normal map in the pyramid and a roughly regular distribution of vertices. Finally, another advantage of proposed algorithm is that no preliminary rough alignment is needed by the method to converge if the initial face misalignment (for each axis) is within R. 2.3 Storing Facial Expressions in Alpha Channel To improve robustness to facial expressions we introduce the expression weighting mask, a subject specific pre-calculated mask aimed to assign different relevance to different face regions. This mask, which shares the same size of normal map and difference map, contains for each pixel an 8 bit weight encoding the local face surface
GPU Accelerated 3D Face Registration / Recognition
943
rigidity based on the analysis of a set facial expressions of the same subject (see Fig. 3). In fact, for each subject enrolled, eight expressions (a neutral plus seven variations) are acquired and compared to the neutral face resulting in seven difference maps. More precisely, given a generic face with its normal map N0 (neutral face) and the set of normal maps N1, N2, …, Nn (the expression variations ), we first calculate the set of difference map D1, D2, …, Dn resulting from {`N0 - N1', `N0 - N2', …, `N0 – Nn'}.The average of set {D1, D2, …, Dn} is the expression weighting mask which is multiplied by the difference map in each comparison between two faces. We can augment each 24 bit normal map with the Expression Weighting Mask normalized to 8 bit. The resulting 32 bit per pixel bitmap can be conveniently managed via various image formats like the Portable Network Graphics format (PNG) which is typically used to store for each pixel 24 bit of colour and 8 bit of alpha channel (transparency) in RGBA format. When comparing any two faces, the difference map is computed on the first 24 bit of colour info (normals) and multiplied to the alpha channel (mask).
Fig. 3. Facial expressions exploited in Expression Weighting Mask
2.4 Implementing the Proposed Method Via GP-GPU As briefly explained in the introduction to this paper, GPUs can vastly outperform CPUs for some computational topics, but two key requirements have to be satisfied: (1) the algorithm and the data types on which it operates should conform as much as possible to the computational architecture of GPU and to its specialized memory, the VRAM; (2) the data exchange with main memory and CPU should be carefully planned and minimized where possible. Because the descriptor used in our approach to face recognition is a RGBA coded bitmap the second part of first requirement is fully satisfied, but the first part is not so trivial. Indeed if the comparison stage of two normal maps requires pixel to pixel computation of a dot product, a task easily performed on multiple pixels in parallel via pixel shaders, the computation of histogram and similarity score for their algorithmic nature is not so suited to be efficiently implemented on GPU. This is mainly due to the lack of methods to access and to write in VRAM as we could easily do in RAM via CPU. For this reason we decided to split the face comparison step from the rank assignment step, by means of a two-staged strategy which relies on GPU to perform a huge number of comparisons in the fastest possible time, and on CPU to work on the results produced from GPU to provide rank statistics, thanks to its general purpose architecture. We addressed the second requirement by an optimised arrangement of descriptors which minimize the number of data transfers from and to the main memory (RAM) and, at the same time, allows vector units on GPU to work efficiently (see
944
A.F. Abate et al.
Fig. 4). Indeed we arranged every 1024 normal maps (RGBA, 24+8 bit) in a 32x32 cluster, resulting in a single 32 bit 4096x4096 sized bitmap (assuming each descriptor is sized 128x128 pixels). This kind of bitmap reaches the maximum size a GPU can manage at the moment, allowing to reduce by a factor 1,000 the number of exchanges with VRAM . The overhead due to descriptor arrangement is negligible as this task is performed during enrolment when normal map and expression weighting mask are computed and stored in the gallery. Up to 15,000 templates could be stored within 1GB of VRAM. On a query, the system load the maximum allowed amount of clusters from main memory to available free VRAM, then the GPU code is executed in parallel on any available pixel shader unit (so computing time reduction is linear in the number of shader units) and the result is write in a specifically allocated FrameBuffer-Object (FBO) as a 32x32 cluster of difference maps. In the next step the FBO is flushed to RAM and the CPU start to compute the similarity score of each difference map storing each score and its index. This scheme repeats until all template clusters have been sent to and processed by GPU and all returning difference map clusters have been processed by CPU. Finally the sorted score vector is outputted. We implemented this algorithm through the Open GL 2.0 library and GLSL programming language.
Fig. 4. Schematic representation of GPU accelerated normal map matching
3 Experiments To test the proposed method four experiments using two different 3D face datasets have been conducted. We built the first dataset acquiring 235 different individuals (138 males and 97 females, age ranging from 19 to 40) in an indoor environment by means of a structured light scanner, the Mega Capturor II from Inspeck Corp.. For each subject eight expressions has been captured (including the neutral one) and each resulting 3D surface has an average of 60-80.000 polygons, with a minimum detail of about 1.5 millimetres. For the second dataset we used 1024 face shapes from release 2/experiment 3s of FRGC database, disregarding texture data. This dataset has undergone a pre-processing stage including mesh subsampling to one fourth or original resolution, mesh cropping to eliminate unwanted details (hair, neck, ears, etc.) and mesh filtering to reduce capture noise and artifacts. For all experiments we set σ =4.5 and k=50 for the Gaussian function and the normal map size is 128×128 pixels.
GPU Accelerated 3D Face Registration / Recognition
945
The first experiment, whose results are shown in Fig. 5-a., measures the overall recognition accuracy of proposed method through the Receiver Operating Characteristic (ROC) curve. The histogram compares the baseline algorithm (blue column, implemented exploiting the FRGC framework and applied on the preprocessed dataset described above) respectively to: proposed method on FRGC dataset using embedded alignment info (violet column), proposed method on FRGC dataset with pyramidal-normal-map based alignment (green column) and proposed method and alignment on our gallery allowing the use of expression weighting mask (orange). The result shown in the third column (green) is slightly better than the one measured on the second column (violet) as the alignment performed by proposed algorithm has proved to be more reliable than the landmarks embedded in FRGC. The best score is achieved in the fourth column (orange) as in this case we exploit both proposed alignment method and the weighting mask to better address expression variations. The second experiment is meant to measure alignment accuracy, using the first dataset with 235 neutral faces for gallery and 705 (235*3) opened mouth, closed eyes and smile variations as probes. Moreover, the probes have been rotated of known angles on the three axis to stress the algorithm. The results are shown on Fig. 5-b. where after four iterations, 95.1% of probes have been re-aligned with a tolerance of less than two degree and for 73.1% of them the alignment error is below one degree. The purpose of the third group of experiments is to measure the effect of posing variations and probe misalignment on recognition performance without the alignment step. Also in this case we used the neutral faces for gallery and opened mouth, closed eyes and smile variations, additionally rotated of known angles, as probes. The results in Fig. 5-c. show that for a misalignment within one degree the recognition rate is 98.1%, which drops to 94.6% if misalignment reaches two degrees. As the average computational cost of a single comparison (128x128 sized normal maps) is about 3 milliseconds for an Amd Opteron 2,6 GHz based PC, the total time needed to alignment is slightly more than 0.3 seconds, allowing an almost real time response. The overall memory requirement to completely store the template’s precomputed normal maps is just 4 Mbytes. Finally, the fourth experiment shows in Fig. 6. how many templates can be theoretically compared to the query within 1 second if they could fit entirely in VRAM, proving how time wise the GPU based version of proposed method easily outperform any CPU based solution, whatever the processor chosen. To this aim we replicated the 1024 templates from FRGC subset to fill in all available VRAM. The system was able to compare about 85,000 templates per second (matching one-to-15,360 in 0,18 sec. on GeForce 7950 GTX/1024 with 32 pixel shaders) versus about 330 of CPU based version (AMD). In the same figure we compare the performance of different CPUs and GPUs including recently released GPU cores based on specs reported from the two main manufacturers (NVidia 8800 GTS and 8800 GTX featuring 96 and 128 programmable unified shaders respectively). Comparing these results to ICP based registration and recognition methods (typically requiring from a few seconds to tens of seconds for a single one-to-one match) clearly shows that the proposed approach is worth using it, at least timewise, regardless to the dataset dimension, as in a real biometric application the pre-processing phase (mesh
946
A.F. Abate et al.
C o m p ariso n s/Sec
Fig. 5. ROC curve (a), alignment accuracy (b) and its relevance to recognition (c)
300000 250000 200000 150000 100000 50000 0
166356 (?) 206112 (?) 91732 85643 247 0
332
CPU (Xeon CPU (AMD GPU nvidia GPU ATI GPU nVidia GPU nVidia P4 Opteron G70 7950 X1950 R580 G80 8800 G80 8800 3.20GHz) 2,4GHz) GTX GTS GTX
Fig. 6. Number of comparisons/sec for various computational hardware. In the graph CPU means only CPU is exploited, while GPU means that CPU (AMD Opteron 2,4 GHz) + GPU work together according to proposed scheme. (?) is just an estimate based on specs.
subsampling, filtering, cropping performed within 1 second in the tested framework) has to be performed only once at enrolment time.
4 Conclusions and Future Works We presented a 3D face registration and recognition method optimized for large scale identification applications. The proposed approach showed good accuracy and robustness and proved to be highly suited to take advantage of GPU architecture, allowing to register a face and to compare it to many thousands of templates in less than a second. As the recent release of Nvidia “Cuda” GPU based programming environments promises further advances in term of general purpose capability, we are currently working to fully implement the method on GPU, including those stages (as normal map and histogram computing) which are still CPU based in this proposal.
GPU Accelerated 3D Face Registration / Recognition
947
References [1] Bowyer, K.W., Chang, K., Flynn, P.A.: A survey of approaches and challenges in 3D and multi-modal 3D + 2D face recognition. In: Computer Vision and Image Understanding, vol. 101, pp. 1–15. Elsevier, Amsterdam (2006) [2] Zhang, J., Yan, Y., And Lades, M.: Face Recognition: Eigenface, Elastic Matching, and Neural Nets. Proc. of the IEEE 85(9), 1423–1435 (1997) [3] Achermann, B., Bunke, H.: Classifying range images of human faces with Hausdorff distance. In: 15-th International Conference on Pattern Recognition, September 2000, pp. 809–813 (2000) [4] Hesher, C., Srivastava, A., Erlebacher, G.: A novel technique for face recognition using range images. In: Seventh Int’l Symposium on Signal Processing and Its Applications (2003) [5] Lu, X., Colbry, D., Jain, A.K.: Three-dimensional model based face recognition. In: 7th IEEE Workshop on Applications of Computer Vision, pp. 156–163 (2005) [6] Tanaka, H.T., Ikeda, M., Chiaki, H.: Curvature-based face surface recognition using spherical correlation principal directions for curved object recognition. In: Third International Conference on Automated Face and Gesture Recognition, pp. 372–377 (1998) [7] Medioni, G., Waupotitsch, R.: Face recognition and modeling in 3D. In: IEEE International Workshop on Analysis and Modeling of Faces and Gestures (AMFG 2003), October 2003, pp. 232–233 (2003) [8] Bronstein, A.M., Bronstein, M.M., Kimmel, R.: Expression-invariant 3D face recognition. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 62–70. Springer, Heidelberg (2003) [9] Abate, A.F., Nappi, M., Ricciardi, S., Sabatino, G.: Fast face recognition based on normal map. In: Proceedings of ICIP 2005, IEEE International Conference on Image Processing, Genova, Italy, July 2005. IEEE Computer Society Press, Los Alamitos (2005) [10] Tsalakanidou, F., Tzovaras, D., Strintzis, M.G.: Use of depth and color eigenfaces for face recognition. Pattern Recognition Letters 24(9-10), 1427–1435 (2003) [11] Papatheodorou, T., Rueckert, D.: Evaluation of Automatic 4D Face Recognition Using Surface and Texture Registration. In: Proceedings of the Sixth IEEE International Conference on Automatic Face and Gesture Recognition, Seoul, Korea, May 2004, pp. 321– 326. IEEE Computer Society Press, Los Alamitos (2004) [12] Gokberk, B., Salah, A.A., Akarun, L.: Rank-based decision fusion for 3D shape-based face recognition. In: Kanade, T., Jain, A., Ratha, N.K. (eds.) AVBPA 2005. LNCS, vol. 3546, pp. 1019–1028. Springer, Heidelberg (2005) [13] Chen, Y., Medioni, G.: Object modeling by registration of multiple range images. Image and Vision Computing 10, 145–155 (1992) [14] Besl, P., McKay, N.: A method for registration of 3-D shapes. IEEE Transaction on Pattern Analysis and Machine Intelligence 14, 239–256 (1992) [15] Jost, T., Hügli, H.: Multi-resolution ICP with heuristic closest point search for fast and robust 3D registration of range images. In: Fourth International Conference on 3-D Digital Imaging and Modeling, October 06 - 10, 2003, pp. 427–433 (2003) [16] Yan, P., Bowyer, K.: A Fast Algorithm for ICP-Based 3D Shape Biometrics. In: Proceedings of the ACM Workshop on Multimodal User Authentication, December 2006, pp. 25– 32. ACM, New York (2006)
Frontal Face Synthesis Based on Multiple Pose-Variant Images for Face Recognition Congcong Li, Guangda Su, Yan Shang, and Yingchun Li Electronic Engineering Department, Tsinghua University, Beijing, 100084, China
[email protected] Abstract. Pose variance remains a challenging problem for face recognition. In this paper, a stereoscopic synthesis method for generating a frontal face image is proposed to improve the performance of automatic face recognition system. Through this method, a frontal face image is generated based on two posevariant face images. Before the synthesis, face pose estimation, feature point extraction and alignment are executed on the two non-frontal images. Benefited from the high accuracy of pose estimation and alignment, the composed frontal face retains the most important features of the two corresponding non-frontal face images. Experiment results show that using the synthetic frontal image achieves a better recognition rate than using the non-frontal ones. Keywords: Face recognition, pose estimation, face alignment, stereoscopy, texture synthesis.
1 Introduction In recent years, techniques on face recognition have been developed a lot. Many algorithms have achieved satisfying recognition performance in controlled conditions under which faces are in frontal pose with harmonious illumination and neutral expression. However, there are still many open problems when face recognition technology is put into real world applications. The face recognition vender test (FRVT) 2002 reports [1] that recognition under illumination, expression and pose variations still remains challenging. Results show that recognition rate decreases sharply when face rotates to a large angle. Aiming to improve face recognition performance under pose variance, face synthesis has been considered a lot. There have been considerable discussions of synthesizing face image in novel views which can be roughly divided into two categories: those based on 3D head model and those based on 2D image statistics [2]. One common way to synthesize novel views of a face is to recover its 3D structure. Many current algorithms utilize a morphable 3D model to generate face images of novel views from a single image. These methods face a problem in common: when only one face image is available, the texture within occluded region becomes undefined. Vetter et al. [3, 4] use the linear object class approach to deal with this S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 948–957, 2007. © Springer-Verlag Berlin Heidelberg 2007
Frontal Face Synthesis Based on Multiple Pose-Variant Images for Face Recognition
949
problem. It is assumed that a new face’s texture can be represented as a linear combination of the texture from a group of example faces in the same view, and the combination coefficients can be used to synthesize the face image in another view. However, another difficulty appears. Since the generated texture is a linear combination of the textures in training database, some individual characteristic information would be lost, such as little scar, beauty spot, and so on. Methods based on 2D images can also produce promising results for face synthesis. Stereopsis and projective geometry are often utilized. The utilization of these techniques highly depends on whether most face region is visible in all source images. While building statistical models for face images, AAM (Active Appearance Models) [5] and ASM (Active Shape Models) [6] are widely used. Both methods can help extract corresponding feature points for different facial images. In this work, we present a whole scheme from pose-variant image capture to novel frontal image synthesis and recognition. This scheme generates a frontal facial image based on multiple pose-variant facial images. Pose estimation algorithm and ASM algorithm for face alignment are both improved to ensure the synthetic accuracy. The synthesis stage is divided into shape reconstruction and texture synthesis. Shape reconstruction is carried out mainly based on stereoscopy and partly assisted by a trained model. The proposed scheme solves the following problems: 1. 2.
3.
It transforms the 2-D rotated facial images to a 2-D frontal view which is proven more suitable for face recognition by our experimental results. Since the synthesis is based on more than one image, the texture contained in the two input images can cover most of the face so that details are retained; It overcomes the computationally extensive problem of synthesis, and thus is suitable for real-time face identification applications;
The rest of the paper is organized as follows: Section 2 provides an overview of the proposed method; Section 3 introduces briefly the pose estimation method and Section 4 introduces the face alignment method, both of which affect the final accuracy of face reconstruction. Section 5 demonstrates the synthetic process, including shape reconstruction and texture synthesis. Experiment results are given in Section 6, and this paper is concluded in section 7.
2 Overview of the Proposed Scheme Aiming to generate a frontal-pose face image for recognition, an integrated scheme is designed, including image collection, image preprocessing, pose estimation, face alignment, facial shape reconstruction and texture synthesis, as shown in Fig. 1. The Multi-Channel Image Input & Process Module by Tsinghua University captures face image and performs face detection parallelly in four channels. Face synthesis will be carried on based on the input images with the help of pose estimation and face alignment.
950
C. Li et al.
Fig. 1. Overview of the proposed scheme
3 Pose Estimation In order to reconstruct a frontal face, the first step is to estimate poses of the input face images. The face images are preprocessed through feature-positioning and normalization. They are rectified and normalized geometrically according to the automatically located key-point positions. In the geometric normalization step, not only the eyes but also the chin middle point are located automatically. Then each face image is scaled and rotated so that the eyes are positioned in a horizontal line and the distance between the chin point and the center of the eyes equals a predefined length. After that, the face image is cropped to a given size. The examples of training images in TH face database [7] are shown in Fig.2. We collect 780 images of 60 people with left-right poses ranging from -45 degree to 45 degree at intervals of 15 degree and with up-down poses ranging from -20 degree to 20 degree at intervals of 10 degree. It is essential to extract features from images utilizing the composite PCA (principle component analysis) and projecting face images to their eigenspace. Given a set of samples X i ∈ R N represented face images by column vectors. The transformation
Fig. 2. The training face images in TH database and their corresponding normalized images
Frontal Face Synthesis Based on Multiple Pose-Variant Images for Face Recognition
951
matrix can be formed by using eigenvectors which normalized to unit matrix T . The projection of X i into the N-dimensional subspace can be expressed as
α = {α1 ,L , α N } = X iT ⋅ T
(1)
The shape feature is shown in Fig. 3. The feature points represent geometric characteristic. AB and A' B ' are the distance between two eyes when pose angle is 0 and β degree respectively. Set radius as 1. A' E ' = A' B ' = sin(θ + β ) + sin(θ − β ) = 2sin θ cos β
(2)
Since distance AB = 2sin θ , then pose angle
β = arccos(
A' E ' ) AB
(3)
Set weights of two weight parameters α and β after two groups of features are gained. The new eigenvector ξ is
ξ = p ⋅ α + q ⋅ β , where p + q = 1
(4)
SVM (support vector machine) is used to find the optimal linear hyperplane by which the expected classification error for unseen test samples is minimized [8]. According to the structural risk minimization principle, a function that classifies the training data accurately will generalize best regardless of the dimensionality of the input space. Each training sample xi is associated with coefficient ai .Those samples whose coefficient ai is nonzero are Support Vectors (SV) of the optimal hyperplane. f ( x) is an optimal SVM classified function. yi ∈ (+1, −1) . f ( x) =
∑ y a K ( x , x) + b i i
i
(5)
vector
Where K is a kernel function. Here we use linear kernel, φ ( xi ) = xi , then K ( xi , x j ) = xi ⋅ x j = xiT x j .
Fig. 3. Shape feature points and the configuration of pose variance
952
C. Li et al.
The PCA projection values of samples to eigenspace are used as SVM input parameters and the optimal hyperplane that correctly separates data points is found. Combining the PCA and SVM classifier, we can draw better classification results, thus more accurate pose angle estimation.
4 Face Alignment An improved ASM (active shape models) method is chosen to extract the face feature points in this paper. It is hard for the conventional ASM to get accurate result on each feature point; what’s more, the performance depends heavily on the initial positions of the landmarks. According to the facial structure, the edge information and facial part information are introduced to the matching process of ASM, which improved the performance of ASM [9, 10]. The face images are first normalized and totally 105 feature points extracted by the improved ASM algorithm are selected to represent the face shape feature, as shown in Fig. 4.
Fig. 4. Feature points extracted by the improved ASM algorithm
Although the improved ASM algorithm provides fine alignment, the accuracy of points can still be ameliorated. For images with different left-right poses, feature points with the same definition in different images should have the same y-coordinate value. However, we found the ASM results did not always meet this principle, although they together can always well descript contours of facial parts. Therefore, to ensure the corresponding relation between feature points in different images, feature points of some facical parts like face contour are connected and fitted by polynomial curve. As shown in Fig. 5, half of the contour is fitted at two stages. At the first stage shown in the left, feature points are roughly fitted and the point parallel to the corner of mouth is considered to be a subsection point. Then the contour is fitted separated by two curves as shown in the right. Then the contour feature points are adjusted on the curves in well-proportioned distribution. Besides fitting the contour, similar operations are also carried onto the other facial parts. The alignment of the face feature points provides an important basis for shape reconstruction of frontal face, which will be mentioned in the next section.
Frontal Face Synthesis Based on Multiple Pose-Variant Images for Face Recognition
953
Fig. 5. A two-stage fitting to the contour with polynomial curves
5 Face Synthesis A face image can be separated into shape information and texture information. If we have these two kinds of information, we can reconstruct a facial image. 5.1 Shape Reconstruction
The shape of the novel frontal facial image is reconstructed based on the aligned feature points extracted from the source images. Two-view stereoscopy is introduced into the shape reconstruction, as shown in the Fig. 6 below.
Fig. 6. Demonstration of the stereoscopy of shape reconstruction
For those feature points that are occluded in the source image with large rotation, we utilize a mean 3D model generated by 30 training subjects to reproduce their positions. We assume the shape of a human face is symmetric in horizontal direction. So we can generate positions of feature points in the occluded part from the viewable part in the same image with the help of this model. Although the model-based generated feature point positions are only approximations to their actual positions, the accuracy of these positions can be improved by a series of detailed iterative measures.
954
C. Li et al.
Fig. 7. A mean 3D model in our experiment
5.2 Texture Synthesis
For each source image and the reconstructed image, after feature points are positioned, triangularization is introduced to depart the face into multiple triangles. The triangularization follows the same principle and is realized by applying Delaunay triangularization [11] to a standard frontal shape. The standard frontal image can be a mean shape from a series of training images. Then the triangle based affine transform is used to span the source face images to fit the destination shape. Equation (6) describes this affine transform process. Here ( x ', y ') is the corresponding coordinate in the destination image of a point ( x , y ) in the source image. Since the coordinates of vertexes of the corresponding triangles in the source and destination images are known, the affine transform in Equation (6) can be solved.
⎡ x ' ⎤ ⎡ a b ⎤ ⎡ x ⎤ ⎡ Ox ⎤ ⎢ y ' ⎥ = ⎢ c d ⎥ ⎢ y ⎥ + ⎢O ⎥ ⎣ ⎦ ⎣ ⎦⎣ ⎦ ⎣ y⎦
(6)
Fig.8 is an example of the texture synthesis. The left two are the source images with feature points extracted by ASM; the middle is the destination shape and its triangulation; the right one is the pure texture for the destination image. For those source images with very small rotation, there is very little occluded region. So the whole texture of each source image is used to synthesize a frontal facial texture and then the mean of them is considered to be the texture for destination image. However, for those source images with large rotation, the situation will be different. For each source image, we use the non-occluded half to generate
Fig. 8. Process of texture synthesis
Frontal Face Synthesis Based on Multiple Pose-Variant Images for Face Recognition
955
respectively texture of half the destination image, and a smooth filter is applied to the boundary in the middle.
6 Experiment Results In this section we describe our experiment results on TH face database [7]. The proposed synthesis method is used to generate frontal facial images based on different sets of facial images with different poses. Then the performance of face recognition based on the synthetic images is tested and compared with that based on the original face-rotated images, as shown in Table 1 on the next page. Experiments are based on an identification task. The task is completed by common Gabor-PCA algorithm, which is trained by using another 600 frontal images from TH face database. The recognition rate means the fraction of probes that have ranked first in the identification task. The gallery consists of images of 2000 people with one frontal image per person from TH face database. Probe sets include image sets of different facial poses and the synthetic face image sets. Each probe set contains 200 face images and would be used to test recognition performance respectively. In our experiments, the mean time cost of the proposed method is tested. The onetime completion of the whole scheme is 0.98 second averagely. This indicates that this method can be used for real-time identification applications. Fig. 9 and Fig. 10 show some examples of the facial synthetic results. Fig. 9 gives two examples of face synthesis based on images with left-right rotation angles (yawing angles). Fig. 10 gives three examples of face synthesis based on images with up-down rotation angles (pitching angles). For each person, the first two columns are original images with rotation; the third column is the synthetic frontal facial image generated from the two images in the same row by our method; the fourth row is the real frontal facial image for the same person which is presented here for comparison.
Fig. 9. Examples of face synthesis based on images with different yawing angles (left-right rotation angles)
956
C. Li et al.
Fig. 10. Examples of face synthesis based on images with different pitching angles (up-down rotation angles) Table 1. Recognition performance of the original image sets and the synthetic image sets
Original Image Set L 15
Recognitio n Rate 71.4%
Original Image Set R 15
Recognitio n Rate 70.1%
Synthetic Image Set L 15+R 15
Recognitio n Rate 88.5%
L 30
58.2%
R 30
57.8%
L 30+R 30
76.5%
L 45
25.0%
R 45
25.2%
L 45+R 45
52.6%
L 15
71.4%
R 30
57.8%
L 15+R 20
85.6%
L 15
71.4%
R 45
25.2%
L 15+R 45
79.0%
L 30
58.2%
R45
25.2%
L 30+R 45
62.7%
U 10
62.5%
D 10
62.8%
U 10+D 10
79.9%
U 20
45.0%
D 20
42.3%
U 20+D 20
66.7%
U 10
64.0%
D 20
42.3%
U 10+D 20
70.9%
Table 1 shows the recognition performance of different probe sets. In Fig. 9, Fig. 10 and in Table 1, “L xx” means the images in the set has a face pose with xx degree’s rotation to its left. Similarly, “R” means “Right”, U means “Up” and “D” means “Down”. The name “L xx + R yy” in the fifth column means that images in the set are synthesized through the corresponding images in “L xx” set and “R yy” set. From Table.1, we can see that the recognition rates in the last column are much higher than the other two in the same row, which indicates that the frontal face synthesis did help to improve the recognition performance significantly.
7 Conclusions In this paper, we proposed a scheme for pose-variant face recognition. In order to overcome the difficulty brought by the non-frontal face images, a stereoscopic synthesis method is presented to generate a frontal face image based on two
Frontal Face Synthesis Based on Multiple Pose-Variant Images for Face Recognition
957
pose-variant face images, which are captured at the same time. To ensure the accuracy of shape reconstruction, we introduced an eigenspace analysis and SVM classification combined method for facial pose estimation and an improved ASM method for facial feature point extraction and alignment. With more than one input images, the whole face texture is nearly totally covered so that the synthesized frontal face can retain the individual texture characteristics. Although a small quantity of unavoidable estimation and alignment errors may affect the final reconstruction accuracy, experiment results show that most of the information important for recognition has been retained and helps to improve the recognition performance. Without expensive time cost, this method is suitable for real-time identification applications.
References 1. Phillips, P.J., Grother, P., Ross, J., Blackburn, D., Tabassi, E., Bone, M.: Face Recognition Vendor Test 2002: Evaluation Report. (March 2003) 2. Du, Y., Lin, X.: Multi-view face image synthesis using factorization models. In: International Workshop on Human-Computer Interaction (2004) 3. Vetter, T., Poggio, T.: Linear object classes and image synthesis from a single example image. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(7), 733–742 (1997) 4. Cootes, T., Edwards, G., Taylor, C.: Active appearance models. IEEE Trans. Pattern Analysis and Machine Intelligence 23(6), 681–685 (2001) 5. Cootes, T., Taylor, C., Cooper, D., et al.: Active shape models ~ their training and application. Computer Vision and Image Understanding 61(1), 38–59 (1995) 6. Vetter, T.: Synthesis of novel views from a single face image. International Journal of Computer Vision 28(2), 103–116 (1998) 7. Li, C., Su, G., Meng, K., Zhou, J.: Technology Evaluations on TH-FACE Recognition System. In: Zhang, D., Jain, A.K. (eds.) Advances in Biometrics. LNCS, vol. 3832, pp. 589–597. Springer, Heidelberg (2005) 8. Vapnik, V.: Statistical Learning Theory. John Wiley & Sons, New York (1998) 9. Gu, H., Su, G., Du, C.: Feature Points Extraction from Faces. In: Image and Vision Computing (IVCNZ’03), pp. 154–158 (2003) 10. Du, C., Su, G., Lin, X., Gu, H.: An Improved Multi-resolution Active Shape Model for Face Alignment. Jounal of Optoelectronics. Laser 15(12), 706–710 (2004) (in Chinese) 11. Hassanpour, R., Atalay, V.: Delaunay Triangulation based 3D Human Face Modeling from Uncalibrated Images. In: Int. C. Computer Vision and Pattern Recognition, p. 75. IEEE Comput. Soc. Press, Los Alamitos (2004)
Optimal Decision Fusion for a Face Verification System Qian Tao and Raymond Veldhuis Signals and Systems Group, Faculty of EEMCS University of Twente, The Netherlands
Abstract. Fusion is a popular practice to increase the reliability of the biometric verification. In this paper, optimal fusion at decision level by AND rule and OR rule is investigated. Both a theoretical analysis and the experimental results are given. Comparisons are presented between fusion at decision level and fusion at matching score level. For our face verification system, decision fusion proves to be a simple, practical, and effective approach, which significantly improves the performance of the original classifier.
1 Introduction Fusion is a popular practice to increase the reliability of the biometric verification by combining the outputs of multiple classifiers. Often, fusion is done based on these matching scores, because this combines a good performance with a simple implementation. In decision fusion, each classifier outputs an accept or reject decision and the fusion is done based on these decisions. The diagram of decision fusion can be drawn as in Fig. 1.
Decision 1/0 CLASSFIER1
Decision 1/0 CLASSFIER2
CLASSFIERN
Decision 1/0
Optimal Decision Fusion
Decision 1/0
Fig. 1. Diagram of optimal decision fusion
In literature fusion at matching score level is more frequently discussed [2] [3] [6] [5]. In this paper, however, we will show that fusion at decision level by AND rule and OR rule can be applied in a optimal way such that it always gives an improvement in terms of error rates over the classifiers that are fused. Here optimal is taken in NeymanPearson sense [9]: at a given false-reject rate α, the decision-fused classifier has a falsereject rate β that is minimal and never larger than the false-reject rates of the classifiers that are fused at the same α. S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 958–967, 2007. c Springer-Verlag Berlin Heidelberg 2007
Optimal Decision Fusion for a Face Verification System
959
In this paper we apply the optimal decision fusion to a likelihood-ratio-based face verification system. At decision level the classifier outputs binary values: 0 for reject and 1 for accept. At matching score level, the classifier outputs the log likelihood ratio. Optimal decision fusions by AND rule and OR rule are compared to matching score fusion by sum rule. This paper is organized as follows. In Section 2 theoretical analysis on optimal decision fusion is given. In Section 3 the application on face verification is described, and the results of optimal decision fusion on this system are shown. Section 4 gives the conclusions.
2 Optimal Decision Fusion 2.1 Optimal Decision Fusion Theory Suppose we have two (or more) classifiers which output binary decisions. Assume that the decisions are statistically independent. (Note that this independency may arise from independent classifiers, or independent samples.) Each decision Di is characterized by two error probabilities: the first is the probability of a false accept, the false-accept rate (FAR), αi , and the second is the probability of a false reject, the false-reject rate (FRR), βi . To analyze the AND rule it is more convenient to work with the detection probability or detection rate pd,i = 1−βi . It is assumed that pd,i is a known function of αi , pd,i (αi ), known as the ROC (Receiver Operating Characteristic). In practice, the ROC has to be derived empirically. After application of the AND rule to decisions Di , i = 1, ..., N , we have, under the important assumption that all decisions are statistically independent, that α=
N
αi
(1)
pd,i (αi )
(2)
i=1
pd (α) =
N i=1
with α the false-accept rate and pd the detection rate of the fused decision, respectively. Optimal AND rule fusion can be formally defined by finding pˆd (α) =
max
N i=1
α=
N αi
pd,i (αi )
(3)
i=1
(3) means that the resulting detection rate pd at a certain α is the maximal value of the product Nof the detection rates at some combination of αi ’s under the condition that α = i=1 αi . In other words, the αi ’s of component classifiers are tuned so that the N fused classifier can give maximal detection rate at a fixed α = i=1 αi . Likewise, if we define the reject rate for the impostors pr,i = 1 − αi , the optimal decision fusion by OR rule can be similarly formulated pˆr (β) =
max
β=
N i=1
N βi
i=1
pr,i (βi )
(4)
960
Q. Tao and R. Veldhuis
where pˆd (α) and pˆr (β) are the optimized ROCs by AND rule and OR rule, respectively. For AND rule, it is easily proved that the optimized detection rate pˆd (α) is never smaller than any of the pd,i ’s at the same FAR α pˆd (α) ≥ pd,i(α)
i = 1, ..., N
(5)
Because, by definition pˆd (α) =
max
α=
N i=1
N αi
pd,i (αi )
i=1
≥ pd,j (αj ) N j=1 N
i=1
(6) αi =α
As it holds for any classifier that, pd,i (1) = 1, (5) readily follows by setting αj = α and αi = 1, i = j. For OR rule, it can be similarly proved that the optimized reject rate pˆr (β) is never smaller than any of the pr,i ’s at the same FRR β. By solving the optimization problem in (3) and (4), the operation points for every component classifiers are obtained, hence the fused classifier which yields the optimal performance in the Neyman-Pearson sense. Because in real situations, the ROCs, i.e. pˆd (α) or pˆr (β), are characterized by a set of discrete operation points rather than analytically, the optimization in (3) and (4) must be solved in a numerical way. In [11] the problem is reformulated in a logarithmic domain as an unconstrained Lagrange optimization problem. 2.2 Optimal Decision Fusion on Identical Classifiers In this section we will discuss, in particular, the optimal decision fusion on identical classifiers. This is a very useful setting in real applications, as will be shown in Section 3. Fusion on identical classifiers, in practice, means that given one classifier and multiple independent input samples, we make optimal fusion on the multiple output decisions. In this paper, for simplicity, we analyze optimal fusion on two decisions. Fusion on three or more decisions can be done in a similar manner. Because the classifiers are identical, we have that pd,1 = pd,2 and the optimization problem can be formulated to α pfusion (x; α) = pd (x) · pd ( ) x pˆfusion (α) = max {pfusion (x; α)} α≤x≤1
(7) (8)
where x is a changing variable in the search process, and pˆfusion (α) is the detection rate at α under optimal AND fusion. The optimum can be found by looking for the stationary point where the derivative of (7) w.r.t x is zero. As this derivative can be written α α α pfusion (x; α) = pd (x)pd ( ) − 2 pd (x)pd ( ) x x x
(9)
Optimal Decision Fusion for a Face Verification System
961
√ √ Obviously when x = α, i.e. α1 = α2 = α, the derivative reaches zero. However, for some ROCs and for some α, this stationary point corresponds to a minimum, then the optimum is found at the border, either α1 = 1 or α2 = 1, which means only one of the two ROCs is taken. In practice, therefore, under the optimal situation, either the two component classifiers work on identical operation points, or one of them does not effect at all. Although the former situation happens more often in practice, the later one does occur in certain cases.
(a) Optimal AND fusion
(b) Optimal OR fusion
Fig. 2. Optimal decision fusion on ROC, example 1
(a) Optimal AND fusion
(b) Optimal OR fusion
Fig. 3. Optimal decision fusion on ROC, example 2
Examples are shown to illustrate the optimal decision fusion. Fig. 2 and Fig. 3 show two examples of optimal decision fusion. In both figures, the solid line represents the original ROC, the dots represent the candidates in search of optimal point by (7) with different x, and the dashed line represents the resulting optimal ROC. Improvements of performance can be clearly seen in both cases. Furthermore, it can be observed that OR rule is more suitable for the ROC in Fig. 2, and AND rule is more suitable for the ROC in Fig. 3. In Fig. 3 (b) we can see that for a certain range of α, the two component classifiers are not working on the same operation point, , but one of the two is taken. To better explain the improvement brought by fusion, Fig. 4 visualizes different decision boundaries of the original classifier, AND fusion, OR fusion, and the sum rule. The crosses represent the scattering of two independent matching scores (in this case the
962
Q. Tao and R. Veldhuis
logarithm likelihood ratio) for the user, and the circles represent the scattering of two independent matching scores for the impostors. In Fig. 4 (b) (c) (d), fusion facilitates decision boundaries spanning across a two dimensional space, where the two ”clouds” are better separated compared to the case of a one dimensional space in Fig. 4 (a) (in which only one dimension is valid). Even better separation can be expected in higher dimensional spaces.
(a) original classifier
(b) AND fusion
(c) OR fusion
(d) sum rule
Fig. 4. Boundaries of different classifiers based on the original classifier
3 Application of Optimal Decision Fusion on a Face Verification System 3.1 The Face Verification System on a Mobile Device In Section 2 the optimal decision fusion theory has been presented. In this section we describe a real application of a biometric verification system, on which the optimal decision fusion will be applied. In a larger context, our biometric verification system acts as a link between a user and a private PN (personal network), via an intermediate MPD (mobile personal device) [7]. To achieve high security for the PN, it is specially demanded, among other requirements, that the authentication should be done not only at logon time, but also ongoing, in order to prevent the scenario that a MPD is taken away by the impostors after logged in by the user.
Optimal Decision Fusion for a Face Verification System
963
We use the face as the biometrics, and a camera on the MPD as the biometric sensor. In our standard system, features are extracted from each frame of face image, and a decision of YES or NO is made. In our original face recognition system, face detection is done by Viola-Jones method [10], face registration is done by aligning prominent facial landmarks detected also by Viola-Jones method. Illumination normalization is done by apply local binary patterns (LBP) [4] [1] as an preprocessing method. A likelihood ratio classifier is used which is based on the relative distribution between the user and the background data [8]. As the user-specific distribution has to be learned from extensive user training data which is beyond most public face databases, we collected our dataset under laboratory conditions. More than 500 frames of face images are collected per subject. (The database is still under construction, but the data used in this paper is available on request.) In our new system with decision fusion, multiple frames with certain intervals are taken as the input, and the decision is made based on optimal fusion. It can be argued that the independency assumption is rendered less true when the intervals are chosen too small, but we will show that even in case of partial dependency, the decision fusion brings improvements to the performance of the system. 3.2 Experiments Setup In the experiments, the face images are collected with a frequency of 5 frames per second, and stored as a function of time. For each subject, the data are independently collected in different sessions under different illuminations. Examples of the cross session data are shown by Fig. 5.
Fig. 5. The face data of a user collected in different sessions
We use the data of two independent sessions for training and testing. Firstly, the classifier is trained on the first session. Secondly, the classifier is tested on the second session, and a ROC is obtained. The ROC represents the component classifier in the decision fusion. Then optimal decision fusion is then made on the ROC according to (3) or (4). Finally, the optimal decision fusion scheme is tested on multiple inputs from the second session, with each component classifier working on its optimal operation points.
964
Q. Tao and R. Veldhuis
(a) scatter plot
(b) ROC
Fig. 6. Experiment results with randomly chosen samples
(a) scatter plot
(b) ROC
Fig. 7. Experiment results with samples chosen at a time interval of 0.5 second
(a) scatter plot
(b) ROC
Fig. 8. Experiment results with samples chosen at a time interval of 15 second
3.3 Results on Optimal Decision Fusion In the following experiments, optimal decision fusion is done on two samples. The samples are taken in three ways. (1) The two samples are randomly taken; (2) The two samples are taken on a short interval of 0.5 second; (3) The two samples are taken on a longer interval of 15 second. For comparison, we also do sum rule matching score fusion, which is the theoretically optimal scheme for logarithm likelihood ratio matching scores. Fig. 6, Fig. 7, and Fig. 8 shows the results of these three sampling ways, respectively.
Optimal Decision Fusion for a Face Verification System
965
Improvements in performance can be clearly seen from Fig. 6, Fig. 7, and Fig. 8, with the EER (equal error rate) reduced to less than half of the original value. In Fig. 7 and Fig. 8 there exists certain correlation between the two samples, but despite this partial independency, OR rule still works very well and yields a performance comparable to or even better than the sum rule matching score fusion. The scatter plot indicates that in certain cases, a corner-shaped OR rule boundary is favored over a straight-line sum rule boundary. 3.4 Outliers and OR-Rule Optimal Decision Fusion Outliers, in face verification, means the face images which belong to the user, but deviate from the user distribution because of extraordinary expressions or poses. Outliers occur in biometric verification, and cause rejections of the user. In our ongoing face verification system on a MPD, this harms the convenience aspect of the system [7]. Fig. 9 illustrates the outlier faces rejected by the original classifier.
Fig. 9. Outliers in user data which are rejected by the classifier
The optimal decision fusion by OR rule, fortunately, can effectively reduce the FRR cause by the outliers at almost no expenses of FAR. Suppose the outlier distribution of the genuine user sample x is denoted by ΨGo (x), with a prior probability of a small quantity po , and suppose the distribution of the genuine user sample in normal cases is ΨG (x), with a prior probability of 1 − po . Taking into acount the outlier distribution, the probability Ψ (x) of a genuine user sample x can be expressed by Ψ (x) = (1 − po ) · ΨG (x) + po · ΨGo (x)
(10)
For two samples x1 and x2 , assuming independency, their joint probability is Ψ (x1 , x2 ) = (1 − po )2 · ΨG (x1 )ΨG (x2 ) + p2o · ΨGo (x1 )ΨGo (x2 ) +po n(1 − po ) · ΨGo (x1 )ΨG (x2 ) + po (1 − po ) · ΨG (x1 )ΨGo (x2 )
(11)
The four terms in (11) describe the probability of the four different joint occurrences of the two samples, corresponding to Fig. 10. Note the second term, which describes the simultaneous occurrences of two outliers, is extremely small due to p2o . In this case, OR rule boundary denoted by the solid line works better than the sum rule boundary denoted by the dotted line, with fewer false rejections. Real examples in our experiments also confirms the advantage of OR rule, as shown in Fig. 11. In this experiment, the cross session data are more extensive, therefore the outlier effects are more prominent.
966
Q. Tao and R. Veldhuis
One outlier
No outliers
Two outliers
One outlier
Fig. 10. The distribution of the two samples in fusion, taking into account the outlier distribution. The solid lines are the OR rule boundary, and the dotted line is the sum rule boundary.
(a) scatter plot
(b) ROC
Fig. 11. Experiment results with samples with outliers
It can be seen that in realistic situations, in presence of outliers, the OR rule works best. Comparing the OR rule performance with the sum rule performance in Fig. 11 (b), it can be seen that the FRR is effectively reduced at the same FAR.
4 Conclusions In this paper, optimal fusion at decision level by AND rule and OR rule is proposed and investigated. Both the theoretical analysis and the experimental results are given, showing optimal decision fusion can always give an improvement to the performance of the original classifier. For our face verification system, decision fusion proves to be a simple, practical, and effective approach, which significantly improves the performance of the system. The improvements brought by optimal decision fusion on FAR with respect to a fixed FRR (or FRR with respect to FAR) is very desirable for biometric systems.
References 1. Heusch, G., Rodriguez, Y., Marcel, S.: Local binary patterns as image preprocessing for face authentication. In: IEEE International Conference on Automatic Face and Gesture Recognition. IEEE Computer Society Press, Los Alamitos (2006) 2. Kittler, J., Hatef, M., Duin, R., Matas, J.: On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(3), 226–239 (1998)
Optimal Decision Fusion for a Face Verification System
967
3. Kittler, J., Li, Y., Matas, J., Sanchez, M.: Combining evidence in multimodal personal identity recognition systems. In: Big¨un, J., Borgefors, G., Chollet, G. (eds.) AVBPA 1997. LNCS, vol. 1206. Springer, Heidelberg (1997) 4. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(7), 971–987 (2004) 5. Ross, A., Jain, A.: Information fusion in biometrics. 24(13) (2003) 6. Ross, A., Nandakumar, K., Jain, A.: Handbook of Multibiometrics. Springer, Heidelberg (2006) 7. Tao, Q., Veldhuis, R.: Biometric authentication for mobile personal device. In: First International Workshop on Personalized Networks, San Jose, USA (2006) 8. Tao, Q., Veldhuis, R.: Verifying a user in a personal face space. In: 9th Int. Conf. Control, Automation, Robotics, and Vision, Singapore (2006) 9. van Trees, H.L.: Detectioin, Estimation, and Modulation Theory. John Wiley and Sons, New York (1969) 10. Viola, P., Jones, M.: Robust real-time face detection. International Journal of Computer Vision 57(2), 137–154 (2004) 11. Zhang, W., Chang, Y., Chen, T.: Optimal thresholding for key generation based on biometrics. In: International Conference on Image Processing (2004)
Robust 3D Head Tracking and Its Applications Wooju Ryu and Daijin Kim Intelligent Multimedia Laboratory, Dept. of Computer Science and Engineering, Pohang University of Science and Technology (POSTECH), Pohang, Korea {wjryu,dkim}@postech.ac.kr
Abstract. The head tracking is a challenging work and a useful application in the field of computer vision. This paper proposes a fast 3D head tracking method that is working robustly under a variety of difficult conditions. First, we obtain the pose robustness by using the 3D cylindrical head model (CHM) and dynamic template. Second, we also obtain the robustness about the fast head movement by using the dynamic template. Third, we obtain the illumination robustness by modeling the illumination basis vectors and by adding them to the previous input image to adapt the current input image. Experimental results show that the proposed head tracking method outperforms the other tracking method using the fixed and dynamic template in terms of the small pose error and the higher successful tracking rate and it tracks the head successfully even if the head moves fast under the rapidly changing poses and illuminations in a speed of 10-15 frames/sec. The proposed head tracking method has a versatile applications such as a head gesture TV remote controller for the handicapped people and a drawing tool by the head movement for the entertainment.
1
Introduction
For the 3D head tracking, many researchers have used simple geometric head models such as a cylinder [1], [2], an ellipsoid [3], or a head-like 3D shape [4] to recover the global head motion. They assume that the shape of the head model does not change during tracking, which means that it does not have the shape parameters. The global head motion can be represented by a rigid motion, which can be parameterized by 6 parameters; three for 3D rotation and three for 3D translation. Therefore, the number of the model parameters is only 6. Among three different geometric 3D head models, we take the cylindrical head model due to the robustness to the pose variation and the general applicability and the simplicity. It is more appropriate to approximate the 3D shape of the generic faces than the ellipsoid model. Also, it requires a small number of parameters and its fitting performance is less sensitive to their initialization than the head-like 3D shape model. To be more robust about the extreme pose and fast head movement, dynamic template technique has been proposed [2]. Although the dynamic template technique can treat the extreme head movement, it has a problem that can fail the head tracking due to the accumulated fitting error. To remedy this problem, [2] S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 968–977, 2007. c Springer-Verlag Berlin Heidelberg 2007
Robust 3D Head Tracking and Its Applications
969
proposed the re-registration technique that stored the reference frames and used them when the fitting error became large. The dynamic template may cover the gradually changing illumination because the template is updated every frames. However, it can not cover all kinds of illumination changes, specifically for the rapidly changing illumination. [1] removed the illumination effects by adding the illumination basis vectors to the template. This approach can cover the rapidly changing illumination. We derive a novel full-motion recovery under perspective projection that combines the dynamic template and re-registration technique [2] and the removal of illumination effect by adding the illumination basis vectors [1]. Also, we update the reference frames according to the illumination condition of input image. This approach provides a new head tracking method which is robust to the extreme head poses, the fast head movement, the rapidly changing illumination.
2 2.1
Full Motion Recovery Under the Rapidly Changing Illuminations Full Motion Recovery
The cylinder head model assumes that the head is shaped as a cylinder and the face is approximated by the cylinder surface. The 3D cylinder surface represented as x = [x y z]T and the 2D image pixel coordinate is represented as u = [u v]T . If we take the perspective projection function, the 2D image pixel coordinate u is given by fL u = P (x) = [ x y ]T , (1) z where fL is the focal length. When the cylinder surface point x is transformed by the rigid motion vector p, the rigid transformation function M (x; p) of x can be represented by M (x; p) = Rx + T,
(2)
where R ∈ R3×3 and T ∈ R3×1 are the 3D rotation matrix and the 3D translation vector, respectively. We take the twist representation [8], whose detailed derivation is given in [9]. According to the twist representation, the 3D rigid motion model M (x; p) is given by ⎡
⎤⎡ ⎤ ⎡ ⎤ 1 −wz wy x tx 1 −wx ⎦ ⎣ y ⎦ + ⎣ ty ⎦ M (x; p) = ⎣ wz −wy wx 1 z tz
(3)
where p = [ wx wy wz tx ty tz ]T is the 3D full head motion parameter vector. The warping function W (x; p) in Eq. (??) is completely defined by using P (x) and M (x; p) as W (x; p) = P (M (x; p)) fL x − ywz + zwy + tx = xwz + y − zwx + ty −xwy + ywx + z + tz
(4) (5)
970
W. Ryu and D. Kim
When we consider the illumination basis vectors, the objective function for the robust 3D head tracking is given by minimize
[I(W (x; p), t)−
bN
x
(qi +Δqi )bi (x)−I(W (x; p+Δp), t+1)]2 , (6)
i=1
where p and qi are the 3D rigid motion parameter and the i − th illumination coefficient, respectively, and Δp and Δqi are the updated parameters computed by solving optimization problem. 2.2
Linear Approximation
To solve the Eq. (6), we need to approximate the nonlinear equation to the linear equation about the Δp and Δq as I(W (x; p), t) −
bN
(qi + Δqi )bi (x) − I(W (x; p + Δp), t + 1)
i=1
≈ I(W (x; p), t) − I(W (x; p), t + 1) −
bN
qi bi (x) −
i=1
bN
Δqi bi (x) − ∇I
i=1
∂W Δp. ∂p
(7)
Let us define the error image, the steepest descent image, and the Hessian matrix H as E(x) = I(W (x; p), t) − I(W (x; p), t + 1) −
bN
qi bi (x)
(8)
i=1
∂W ∂W SD(x) = [∇I , . . . , ∇I , b1 (x), . . . , bbN (x)] ∂p1 ∂p6 H= SD(x)T SD(x).
(9) (10)
x
Then, the model parameters [ Δp Δq ]T can be obtained as Δp = H -1 SD(x)T E(x). Δq
(11)
x
2.3
Parameter Update
At every frame, we iteratively update the parameters Δp and Δq simultaneously. Before the iterative process, we need to set the previous input image patch as the current template and set the initial parameters p = p0 and q = 0, where p0 is the previous motion parameters. At every iteration, we compute the new error image and the steepest descent image. When computing the new error image, the parameter p in the first term I(W (x; p), t) should be kept as p0 because it is used as the template image. Table 1 summarizes the overall process of the iterative parameter update of p and q, where 1 and 2 are the threshold values.
Robust 3D Head Tracking and Its Applications
971
Table 1. Overall process of the parameter update
(1) (2) (3) (4) (5) (6) (7)
3
Set the previous input image patch as template image. Initialize the parameters as p = p0 and q = 0. Compute E(x) and SD(x). Compute the Hessian matrix H. Compute the incremental parameter Δp and Δq. Update the parameter by p ← p + Δp and q ← q + Δq. If Δp < |1 | and Δq < |2 | then stop. Otherwise goto (3).
Robust 3D Head Tracking
The dynamic template and re-registration method firstly suggested in [2] for robust head tracking. In this chapter we will review the dynamic template and explain how to generate illumination basis vectors and the modified re-registration algorithm under the rapidly changing illumination. 3.1
Dynamic Template
We briefly review the dynamic template method for the robust head tracking [2]. Since the fixed template can not cover the all kinds of head motions, we consider the dynamic template to obtain the long-term robustness of head motion. The cylindrical model can not represent the head shape exactly, the template from the initial frame can not cover the current input image when the head pose is extremely changed. The dynamic template method assumes that the previous input image patch is the current template image. 3.2
Illumination Basis Vectors
Although the dynamic template approach can cover the gradually changing illumination, it can not cover the rapidly changing illumination effectively. To tackle the rapidly changing illumination, we propose to use the linear model. To build illumination model, we generate the head images whose illuminations are changed in five different directions( left, right, up, down, and front side ), collect illumination images, and apply the principle component analysis (PCA) to the collected images after subtracting the mean image. 3.3
Re-registration
We describe what frames are referenced and how to execute the re-registration process with the illumination vectors are considered. While tracking the head motion, the fitting error can be accumulated. Therefore, when the fitting error is over a certain threshold value, we need to re-register to prevent the accumulation error. In the early step of tracking, before accumulation error is over the threshold value, we record the input image I and its motion parameter p
972
W. Ryu and D. Kim
as the reference frames (reference DB). The reference frames are classified with the head pose wx , wy , wz and each class in the reference DB is represented by one reference frame. When the re-registration is executed, the reference frame which corresponds to the current head pose is selected from the reference DB. If the illumination is changed, we can not use the old reference DB because the current input image has different illumination condition with the reference frame. Therefore we update the reference DB when the norm of the illumination parameter q is larger than a threshold value, after the re-registration is performed.
4
Experimental Results
We developed the illumination-robust 3D head tracking system using the realtime algorithm explained above. We used the desktop PC(Pentium IV 3.2GHz) and Logitech Web Camera. The average tracking speed is about the 10 frames per second when the illumination basis vectors are used and is about 15 frames per second when they are not used. To initialize automatically, we used the face and eye detector based on MCT+Adaboost [10] and AAM face tracker to find face boundary and face feature points. 4.1
Extreme Pose and Fast Head Movement
In this experiments, we compared the 3D head tracking using the fixed and dynamic template. First, we captured the test sequence with extreme head motion about the tilting and yawing. Fig. 1-(a) shows the result of the 3D head tracking, where each row corresponds to the 3D head tracking using the fixed template and the 3D head tracking using the dynamic template, respectively. As you see, the 3D head tracking using the fixed template starts to fail from the 48 − th frame and loses the head tracking completely at the 73 − th frame. After the 73 − th frame, the tracking can not be performed any more. However, we can get a stable head tracking throughout the entire frames using the dynamic template. Fig. 1-(b) compared the measured 3D head poses (tilting, yawing, rolling) when the poses changed much, where the left and right column correspond to the fixed template and the dynamic template, respectively, and the ground truth poses are denoted as the dotted line. Second, we evaluated our head tracking system when the head move very fast. The test sequence has 90 frames with one tilting and two yawings. Fig. 2-(a) shows the result of 3D head tracking using the test image sequence. While the tracking using the fixed template is failed at frame 31, the tracking using the dynamic template is succeeded throughout the whole test sequence. Fig. 2-(b) compared the measured 3D head poses when the head moves fast. As you see in the above experiments, the fixed template produces the large measurement error in the 3D head poses, but the dynamic template produces the very accurate 3D head pose measurements.
Robust 3D Head Tracking and Its Applications
973
Fig. 1. Comparison of the head tracking results and the measured 3D head poses in the case of changing poses
Fig. 2. Comparison of head tracking results and the measured 3D head poses in the case of the fast head movement
4.2
Rapidly Changing Illumination
We tested how the proposed head tracking system is executed when the illumination condition is changed rapidly. The test sequence has three rapidly changing
974
W. Ryu and D. Kim
Fig. 3. Comparison of the tracking results under the rapidly changing illumination
illuminations (front, left, and right). Fig. 3 compares the head tracking results, where each row corresponds to the head tracking result with the dynamic template only and the head tracking result with the dynamic template using the illumination basis vectors. As you see, when we use the dynamic template only, it shows the very unstable tracking results under the rapidly changing illuminations. On the other hand, when we use the dynamic template with the illumination basis vectors, the tracking results are very stable throughout the entire frames even though the illumination condition changes rapidly. 4.3
Rapidly Changing Pose, Head Movement and Illumination
To test the proposed head tracking methods more faithful, we built the head moving database1 of the 15 different peoples which includes five different illuminations (left, right, up, down, and front). The ground truth of the head rotation is measured by the 3D object tracker (Fastrak system). Table 4.3 summarizes the head tracking experiments on the IMH DB, where the number in each test sequence denotes the number of frames, rtrack denotes the ratio of the number of successfully tracked frames over the number of total frames, and the average pose error Ep is computed by computing the sum of of tilting, yawing, and rolling pose error between the ground truth and the estimated angle every frame and averaging the pose error sums over the entire frames. From this table, we know that the proposed tracking method using the dynamic template with the illumination basis vectors is successfully tracking the whole image sequences in the IMH DB. 4.4
Application 1: Face Remote Controller
We developed the face remote controller (FRC) for real world application. The FRC is the remote controller which is controlled by the head gesture instead of the hand. The head gestures are used for moving the current cursor to the left, right, up, and down side, where the cursor is designated to move discretely between buttons. And, the eye blinking is used for generating the button click event. We apply the FRC to a TV remote controller system using the CCD camera that can zoom in/out and is located on the top of the TV. The TV 1
We call this database the IMH DB [11].
Robust 3D Head Tracking and Its Applications
975
Table 2. The tracking results on the IMH DB
Seq. 1 (679) Seq. 2 (543) Seq. 3 (634) Seq. 4 (572) Seq. 5 (564) Seq. 6 (663) Seq. 7 (655) Seq. 8 (667) Seq. 9 (588) Seq. 10 (673) Seq. 11 (672) Seq. 12 (504) Seq. 13 (860) Seq. 14 (694) Seq. 15 (503)
Fixed template rtrack = 0.05 rtrack = 0.1 rtrack = 0.06 rtrack = 0.07 rtrack = 0.06 rtrack = 0.08 rtrack = 0.07 rtrack = 0.05 rtrack = 0.09 rtrack = 0.23 rtrack = 0.07 rtrack = 0.04 rtrack = 0.37 rtrack = 0.13 rtrack = 0.25
Dynamic Dynamic template with template the illumination basis vectors rtrack = 0.89 rtrack = 1 Ep = 3.89 rtrack = 0.54 rtrack = 1 Ep = 3.34 rtrack = 0.26 rtrack = 1 Ep = 3.75 rtrack = 0.44 rtrack = 1 Ep = 4.81 rtrack = 0.9 rtrack = 1 Ep = 5.75 rtrack = 0.39 rtrack = 1 Ep = 4.41 rtrack = 0.48 rtrack = 1 Ep = 3.89 rtrack = 0.88 rtrack = 1 Ep = 4.29 rtrack = 0.38 rtrack = 1 Ep = 4.49 rtrack = 0.48 rtrack = 1 Ep = 2.92 rtrack = 0.42 rtrack = 1 Ep = 5.30 rtrack = 0.59 rtrack = 1 Ep = 8.21 rtrack = 0.53 rtrack = 1 Ep = 5.38 rtrack = 0.72 rtrack = 1 Ep = 2.19 rtrack = 0.58 rtrack = 1 Ep = 3.79
watcher sits in the chair which is approximately 5 meters far from the TV. Fig. 4 shows how to perform the cursor movement and button click with the head gesture and the eye blinking, and how the FRC is applied to the TV remote controller system.
Fig. 4. Cursor movement and click using the head gesture and the eye blinking
4.5
Application 2: Drawing Tool by Head Movement
We also applied the head movement to develop a drawing tool. Basically, we use the head movement instead of the mouse or tablet pen to move the mouse cursor, where the center point on the front cylinder surface is used as the position of the mouse cursor. We define three state as “Wait”, “Move”, and “Draw” to organize the drawing tool by the head movement (DTHM). When the state is “Wait”, the DTHM system does nothing and just wait for the eye blinking. Once
976
W. Ryu and D. Kim
Fig. 5. Drawing tool by head movement
the eye blinking has been occurred in the state of “Wait”, the DTHM system changes the state from “Wait” to “Move”. In the state of “Move”, we can move the mouse cursor where you want to start drawing. If there are no movements in the state of “Move” for some time, the state is changed from “Move” to “Draw”, and then we can draw a shape by moving the head. If you want to stop drawing and move mouse cursor, stop head moving and wait for some time until the state is changed from “Draw” to “Move”. Fig. 5 shows how to execute the DTHM system. Initially, the system is in the state of “WAIT” and changes to the “Move” state by blinking the eye. Then, the system is changed to the “Draw” state by doing nothing for some time and we draw the “R” character by moving the cursor appropriately. We change the state from the “Draw” to “Move” to move the cursor to the next drawing position.
5
Conclusion
We proposed a new framework for the 3D head tracking using the 3D cylinder head model, which combined several techniques such as dynamic template, reregistration, and the removal of illumination effects using the illumination basis vectors. Also, we proposed the new object function that added the linear illumination model to the existing objective function based on LK image registration and derived the iterative updating formula of the model parameters such as the rigid motion parameter p and the illumination coefficient vector q at the same time. We modified the overall process of the existing re-registration technique such that the reference frames could be updated when the illumination condition was changed rapidly. We performed many intensive experiments of 3D head tracking using the IMH DB. We evaluated the head tracking performance in term of the pose error and the successful tracking rate. The experiment results showed that the proposed head tracking method was the most accurate and stable among other tracking methods using the fixed and dynamic template. We also developed the face TV remote controller for the handicapped people and the drawing tool by head movement for the entertainment to prove the versatile applicability of the proposed 3D head tracking method.
Robust 3D Head Tracking and Its Applications
977
Acknowledgement This work was financially supported by the Ministry of Education and Human Resources Development(MOE), the Ministry of Commerce, Industry and Energy(MOCIE) and the Ministry of Labor(MOLAB) through the fostering project of the Lab of Excellency. Also, it was partially supported by the Intelligent Robotics Development Program, one of the 21st Century Frontier R&D Programs funded by the Ministry of Commerce, Industry and Energy of Korea.
References 1. Cascia, M., Sclaroff, S., Athitsos, V.: Fast, reliable head tracking under varying illumination: An approach based on robust registration of texture-mapped 3d models. IEEE Trans. Pattern Analysis and Machine Intelligence(PAMI) 22, 322–336 (2000) 2. Xiao, J., Moriyama, T., Kanade, T., Cohn, J.: Robust full-motion recovery of head by dynamic templates and re-registration techniques. International Journal of Imaging Systems and Technology 13, 85–94 (2003) 3. Basu, S., Essa, I., Pentland, A.: Motion regularization for model-based head tracking. In: Proceedings of the International Conference on Pattern Recognition(ICPR), vol. 3, p. 611 (1996) 4. Malciu, M., Preteux, F.: A robust model-based approach for 3d head tracking in video sequences. In: Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition, p. 169. IEEE Computer Society Press, Los Alamitos (2000) 5. Lucas, B., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Proceedings of the 7th International Joint Conference on Artificial Intelligence (IJCAI), pp. 674–679 (1981) 6. Baker, S., Matthews, I.: Lucas-Kanade 20 Years On: A Unifying Framework: Part 1, tech. report CMU-RI-TR-02-16, Technical Report. Robotics Institute, Carnegie Mellon University (2002) 7. Baker, S., Gross, R., Matthews, I.: Lucas-Kanade 20 Years On: A Unifying Framework: Part 3, tech. report CMU-RI-TR-03-35, Technical Report. Robotics Institute, Carnegie Mellon University (2003) 8. Bregler, C., Malik, J.: Tracking people with twists and exponential maps. In: IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 8–15. IEEE Computer Society Press, Los Alamitos (1998) 9. Murray, R., Li, Z., Sastry, S.: A Mathematical Introduction to Robotic Manipulation. CRC Press, Boca Raton, USA (1994) 10. Froba, B., Ernst, A.: Face detection with the modified census transform. In: Proceedings of the IEEE Conference on Automatic Face and Gesture Recognition, pp. 91–96. IEEE Computer Society Press, Los Alamitos (2004) 11. Ryu, W., Sung, J., Kim, D.: Asian Head Movement Video Under Rapidly Changing Pose, Head Movement and Illumination AHM01, Technical Report. Intelligent Multimedia Lab, Dep. of CSE, POSTECH (2006)
Multiple Faces Tracking Using Motion Prediction and IPCA in Particle Filters Sukwon Choi and Daijin Kim Intelligent Multimedia Laboratory, Dept. of Computer Science & Engineering, Pohang University of Science and Technology (POSTECH), Pohang, Korea {capriso,dkim}@postech.ac.kr
Abstract. We propose an efficient real-time face tracking system that can track fast moving face and cope with the illumination changes. To achieve these goals, we use the active appearance model(AAM) to represent the face image due to its simplicity and flexibility and take the particle filter framework to track the face image due to its robustness. We modify the particle filter framework as follows. To track fast moving face, we predict the motions using motion history and motion estimation, hence we can reduce the required number of particles. For observation model, we use active appearance model(AAM) to obtain an accurate face region, and update the model using incremental principle component analysis(IPCA). Occlusion handling scheme incorporates motion history to handle the moving face with occlusion. We have expanded our application to multiple faces tracking system. Experimental results present the robustness and effectiveness of the proposed system.
1
Introduction
Many research groups in the computer vision society have been interested in the topics related to the human face such as face detection, face recognition, face tracking, etc. Among them, face tracking is an important work that can be applied to several applications such as human-computer interaction, facial expression recognition, and robotics, etc. There are several works that is related to the face tracking. Bayesian filters such as Kalman filter [1] and particle filter [2], [3], [4] are the most popular techniques for face tracking. The Kalman filter assumes that the state transition model obeys the Gaussian function. On the other hand, the particle filter assumes arbitrary model. The particle filter shows better performance on face tracking because the dynamics of the moving face obeys the non-linear/non-Gaussian function. Particle filter consists of observation model and state transition model. The observation model has to measure the likelihood of each particle. For this purpose, we need an appearance model. There are several researches about face modeling algorithm. Turk et al. [5] use PCA for eigenface analysis. Jepson et al. [6] introduce an online learning algorithm, namely online appearance model (OAM), which assumes that each pixel can be explained with mixture components, and Zhou et al. [7] modify it. However, these approaches cannot cope S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 978–987, 2007. c Springer-Verlag Berlin Heidelberg 2007
Multiple Faces Tracking Using Motion Prediction
979
with the facial variations such as illumination or expression. Cootes et al. [8] introduce AAM to represent the face using eigenface approach. The drawback of AAM is that all variations cannot be trained during training phase. Hamlaoui et al. [9] expand Zhou’s work and use AAM as their appearance model, but there is no scheme to update the observation model. The state transition model describes the dynamics of the moving objects between two frames. There are two ways to approximate the motion model: using a trained motion model by learning from video examples and using a fixed constant-velocity model. However, these approaches do not work well when the objects are moving very fast. The image registration technique introduced by Lucas and Kanade [10] is another approach that can be used for object tracking. However, their method is a gradient-based approach, which often traps into the local minimum. Zhou et al. [7] use adaptive velocity model to track the object effectively. However, if the particles are insufficient or the variance is relatively small, fast moving face cannot be tracked. We propose to use IPCA [11], [12], [13] and motion prediction model to track the fast moving faces under the illumination change. This tracking system is expanded to multiple faces tracking for real-time system. This paper is organized as follows: We review the particle filter in Section 2. In Section 3 and Section 4, we explain the observation model and the state transition model. We present how to handle the occlusion while the face is moving in Section 5. Section 6 shows how we expanded our system for multiple faces tracking. Next, the experimental results are presented in Section 7 and the conclusions are drawn in Section 8.
2
Particle Filter
The particle filter tries to estimate the states {θ1 , . . . , θt } recursively using sampling technique. To estimate the states, the particle filter approximates the pos(1) (P ) terior distribution p(θt |Y1:t ) with a set of samples {θt , . . . , θt } and a noisy observation {Y1 , . . . , Yt }. The particle filter consists of two components, observation model and state transition model. They can be defined as: State T ransition M odel : θ t = Ft (θ t−1 , Ut ), Observation M odel : Yt = Ht (θ t , Vt ).
(1)
The transition function Ft approximates the dynamics of the object being tracked using the previous state θ t−1 and the system noise Ut . The measurement function Ht models a relationship among the noisy observation Yt , the hidden state θt , and the observation noise Vt . We can characterize transition probability p(θt |θ t−1 ) with the state transition model, and likelihood p(Yt |θt ) with the observation model. We use maximum a posteriori (MAP) estimate to ˆt . get the state estimate θ
980
3
S. Choi and D. Kim
Observation Model
The observation model in the particle filter finds the relationship between the observed data and the state. We use AAM to represent the face image due to its simplicity and flexibility. We design the observation likelihood using the warped image and the reconstructed image of AAM. To make the observation model cope with the illumination change, we use IPCA to update the AAM basis vectors. 3.1
AAM-based Observation Model
We use the AAM for our observation model. Let us define the reconstructed image at time t as Iˆt , and the warped image at time t as I˜t . The AAM tries to minimize the AAM error, which is the distance between the reconstructed image and the warped image. Thus, we approximate the AAM error using Mahalanobis distance between Iˆt−1 and I˜t . To prevent the observation likelihood of a good particle from being spoiled by a few outliers, we use the robust statistics [14] to decrease the weight of outliers. The observation likelihood looks as : p(It |θ t ) ∝ exp −
N
ρ
l=1
I˜t (l) − Iˆt−1 (l) σR (l)
,
(2)
1 where
ρ(x) =
if |x| < ξ , ξ|x| − 12 ξ 2 , if |x| ≥ ξ 2 2x ,
(3)
where N is the number of pixels in the template image, l is the pixel index, the σR is the standard deviation of reconstruction image, and ξ is a threshold that determines whether the pixel is an outlier or not. We select the best particle, which has the largest observation likelihood, as the MAP estimate. 3.2
Update Observation Model
After estimating the state, we perform AAM fitting to get the accurate face ˆ t . Then, we need region and then we can get the 2D global pose parameter θ to update the observation model to cope with the illumination change. Before we update the observation model, we need to check the goodness of fitting to determine whether we need to update the observation model or not. The AAM error is not appropriate for measuring the goodness of fitting, because the illfitted result and the illumination changes both increases the AAM error. For this purpose, we use the number of outliers using OAM. A pixel is declared as an outlier if the normalized pixel value, which is normalized by the mean and the variance of the OAM component, is larger than certain threshold ξ. We get the number of outliers for each component and take the average number of outliers, Noutlier . If Noutlier is smaller than certain threshold N0 , then we can update the AAM basis vectors using IPCA with new input image I˜t .
Multiple Faces Tracking Using Motion Prediction
4
981
State Transition Model
The state transition model describes the dynamics of moving object. In our system, we use motion history and motion estimation to predict the location of the face with low computation. 4.1
Motion Prediction Model Using Motion History
We implemented the tracking system of Zhou et al. [7] and evaluated the tracking performance and operating speed. Fig. 1-(a) shows the tracking performance when 20 particles are used to track the fast moving face, where the vertical axis denotes the pixel displacement between two consecutive frames and the circles denote the cases of failed tracking. We can see that the their tracking algorithm often fails when the pixel displacement between two consecutive frames is greater than 20. Fig. 1-(b) shows the average operating speed when the number of particles is changed. The number of particles should be less than 30 to guarantee the 15 frame per second. To realize the real-time face tracking, we propose the adaptive state transition model using the motion prediction model using the motion history. 35
30
Frame per Second
25
20
15
10
5
0
(a) Tracking performance.
0
50
100
150 200 250 Number of Particles
300
350
400
(b) Operating speed.
Fig. 1. Tracking experimentation results of Zhou et al.[7] algorithm
We assume that the dynamics of moving face obeys the constant acceleration model. Then, the velocity vt−1 and the acceleration at−1 at the time t − 1 is obtained as: ˆ t−1 − θ ˆ t−2 , vt−1 = θ at−1 = vt−1 − vt−2 ,
(4) (5)
where θ t−1 is the motion state that is the global 2D pose parameter of the AAM fitted image at the time t − 1. The motion velocity at the current time is predicted by the motion velocity and the acceleration at the previous time. Then, ¯ between the the velocity of moving face at the time t and the effective velocity v two consecutive frames can be obtained as: ˜ t = vt−1 + at−1 , v ˜t vt−1 + v ¯t = v . 2
(6) (7)
982
S. Choi and D. Kim
In real situation, the actual motion of moving face does not obey the constant velocity model. So, the effective velocity obtained from the motion history model may be overestimated or underestimated. To overcome this problem, we suggest to use the motion prediction model that performs the motion estimation tech¯t = θ ˆ t−1 + v ¯t. nique [15] around the estimated motion state θ In the motion estimation technique, each macroblock in the current frame is compared to a macroblock in the reference frame to find the best matching macroblock. To achieve this goal, block matching criterion such as the mean absolute error(MAE) or the mean squared error(MSE) are usually used. The computational load of finding the best matching macroblock is proportional to the size of macroblock. To reduce the computation, we use the face detector based on Adaboost [16]. We can obtain the predicted motion state of the face θt by applying motion esti¯ t . If the face detector fails, we use θ ¯ t as the predicted mation technique around θ motion state of the face as: ¯ if the face is detected, ˜ t = θt , θ (8) θt , if the face is not detected. ˜t The particles are distributed around the predicted motion state of the face θ as in the following state transition model. (p)
θt
(p)
˜t + U , =θ t
(9)
where Ut is the system noise that follows Gaussian distribution N (0, σt2 ). Optionally, we can include the adaptive velocity model into our state transition model. Since we can obtain the predicted motion state of the face using motion estimation technique, only a small number of particles is required. So, we can apply the adaptive velocity model without suffering from the heavy computational load. The particles are generated as: (p)
θt
(p)
˜ t + VLS {It ; θ ˜t} + U , =θ t
(10)
˜ t } is a function that computes the adaptive velocity at the startwhere VLS {It ; θ ˜ t in the input image It . ing point θ 4.2
Noise Variance and Number of Particles
We adaptively changes the noise variance and the number of particles. They are ˆt. proportional to the Mahalanobis distance, Dt , between Iˆt and I˜t at θ σt = σ0 ×
Dt , D0
Pt = P0 ×
Dt . D0
(11)
We restrict the range of standard deviation as [σmin , σmax ] and the number of particles as [Pmin , Pmax ] to ensure certain degree of frame rate and performance.
Multiple Faces Tracking Using Motion Prediction
5
983
Occlusion Handling
Our system detects occlusion using the number of outliers. If the face is occluded, the number of outliers Noutlier is larger than certain threshold, N0 , because the occlusion makes large number of outliers. When the occlusion is declared, we will get a bad tracking result if we use adaptive velocity model or face detector. Hence, we stop these methods and use only motion history with maximizing the number of particles and the variance. We can generate a new set of particles based on the Eq. (12). (p) ¯ t + U (p) . θt = θ (12) t
6
Multiple Faces Tracking
We presented single face tracking algorithm so far. We have expanded our single face tracking algorithm to multiple faces tracking. To check a new incoming face, we periodically invoke face detector. We use face detector based on Adaboost [16] learning. If we find a new face, then we detect eyes in the face region. With geometric information of eyes’ position, we initialize the parameters of AAM, and perform AAM fitting. The result of the AAM fitting is used to initialize a new tracker. The overall procedure of multiple faces tracking algorithm is not quite different from the single tracking case. We use motion prediction model, and perform AAM fitting on the MAP estimate, then update the observation model if possible, for each tracker.
7
Experimental Results
We implemented the proposed system in Windows C++ environment. Our system processes at least 15 frames per second for single face tracking in Pentium 4 CPU with 3.0 GHz and 2 GB RAM with Logitech Web Camera. All the experiments in this section uses the parameters in the following manner. We captured the test sequences of size 320 × 240 with 15 fps. The size of AAM template image is 45 × 47, and we used 5 shape/appearance basis vectors each. We set ξ as 1.7 to declare outlier, and N0 /N as 0.18 to declare occlusion. We performed 3 experiments to prove the robustness of the proposed face tracking system. In the first experiment, we showed that the proposed system is robust and effective in terms of the moving speed of faces, the required number of samples, and the tracking performance. The second experiment showed the performance of the tracking system under the occlusions. In the last experiment, we showed that our system can also handle the multiple faces tracking problem. 7.1
Tracking Fast Moving Face
We captured the test sequences with high/medium/low speed of moving face. We compared the face tracking system that uses the motion prediction model and the adaptive velocity model.
984
S. Choi and D. Kim
40 high speed medium speed low speed
35
Pixel displacement
30 25 20 15 10 5 0
50
100
150 Frame
200
250
300
Fig. 2. The pixel displacement of the face in test sequences with high/medium/low speed of moving face
We obtained the pixel displacement of the face for each sequence as in Fig. 2. Tracking is performed using 20 particles. The horizontal axis denotes the frame number, and the vertical axis denotes the pixel displacement of the face. To measure the tracking accuracy, we compute the error between the estimated 2
2
location and the ground truth of the face as eg = (qx − gx ) + (qy − gy ) , where qx , qy are the horizontal/vertical translations of the estimated location of the face, and gx , qy are the translations of the ground truth. Fig. 3 compares the graph of eg from each velocity models. As you can see, eg of the motion prediction model shows lower value when the face moves fast. Fig. 4 shows the result of 180
30
adaptive velocity motion prediction
160
10
adaptive velocity motion prediction
25
120
adaptive velocity motion prediction
9 8
140
7
20
80
Error
Error
Error
6 100
15
5 4
60
10
3
40
2
5 20 0
1 50
100
150 Frame
200
250
300
(a) High speed.
0
50
100
150 Frame
200
250
300
(b) Medium speed.
0
50
100
150 Frame
200
250
300
(c) Low speed.
Fig. 3. Tracking accuracy under the various speed of moving face
the face tracking using the test sequence with high speed of moving face. The motion prediction model successfully tracks the face, while the adaptive velocity model fails to track the face. To compare the required number of particles, we performed another tracking experiment using the range of the number of particles as [10, 60] and use 20 particles at initialization. Fig. 5 shows the required number of particles and the AAM error of the tracking systems with the two velocity models. As shown in Fig. 5, we need fewer particles with the motion prediction model due to its better tracking performance, so we can save computations. 7.2
Occlusion Handling
We present the performance of our occlusion handling method. As shown in Fig. 6, the moving face is heavily occluded. On the upper right corner of each
Multiple Faces Tracking Using Motion Prediction
985
Fig. 4. 1st row : original test sequence, 2nd row : tracking result of the adaptive velocity model, 3rd row : tracking result of the motion prediction model, 1-5 column : frame 155, 158, 161, 164, 167 60
35 adaptive velocity motion prediction initial particles
50
adaptive velocity motion prediction 30
AAM error
Number of particles
25 40
30
20
15
20 10 10
0
5
100
200
300
400
500
600
Frame
(a) Particle number.
700
0
100
200
300
400
500
600
700
Frame
(b) AAM errors.
Fig. 5. The required number of particles and the AAM error
Fig. 6. Occlusion handling, 1st row : frame 43, 56, 65, 2nd row : frame 277, 281, 285, 3rd row : frame 461, 463, 465
Fig. 7. Tracking results, 1st row : frame 190, 200, 210, 2nd row : frame 230, 330, 350
986
S. Choi and D. Kim
image, we can see small template image that displays outliers with white pixels. Our occlusion detection scheme detected outliers and occlusion successfully. 7.3
Multiple Faces Tracking
In this section, we present our multiple faces tracking system. Fig. 7 shows that our system is expanded to a multiple faces tracking system. We can see that scale, rotation, and translation of three faces are changing, but our system shows good tracking performance.
8
Conclusion
We presented a face tracking system that can track the fast moving face. The proposed system takes IPCA and motion prediction model to make the observation model and the state transition model adaptive. With these approaches, the face is successfully tracked using a few particles and small variance. We have found that these approaches generate particles more efficiently, and saved unnecessary computational load, and improved tracking performance. In the case of occlusion, we include only motion history to approximate the dynamics of the occluded face. Our system has implemented and tested in real-time environment. The experimental results of the proposed face tracking system show the robustness and effectiveness in terms of the moving speed of faces, the required number of samples, the tracking performance, and occlusion handling. Also, we have applied our system and found that the system can handle the multiple faces properly.
Acknowledgement This work was partially supported by the Korea Science and Engineering Foundation (KOSEF) through the Biometrics Engineering Research Center (BERC) at Yonsei University. Also, it was partially supported by the Intelligent Robotics Development Program, one of the 21st Century Frontier R&D Programs funded by the Ministry of Commerce, Industry and Energy of Korea.
References 1. Azarbayejani, A., Pentland, A.: Recursive estimation of motion, structure and focal length. IEEE Transactions on Pattern Analysis and Machine Intelligence 17, 562– 575 (1995) 2. Doucet, A., Godsill, S.J., Andrieu, C.: On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and Computing 10(3), 197–209 (2000) 3. Arulampalam, S., Maskell, S., Gordon, N., Clapp, T.: A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Transactions on Signal Processing 50(2), 174–189 (2002)
Multiple Faces Tracking Using Motion Prediction
987
4. Okuma, K., Taleghani, A., de Freitas, N., Little, J., Lowe, D.: A boosted particle filter: Multitarget detection and tracking. In: Proceddings of European Conference on Computer Vision, pp. 28–39 (2004) 5. Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3(1), 72–86 (1991) 6. Jepson, A.D., Fleet, D.J., El-Maraghi, T.: Robust online appearance model for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(10), 1296–1311 (2003) 7. Zhou, S., Chellappa, R., Moghaddam, B.: Visual tracking and recognition using appearance-adaptive models in particle filters. IEEE Transactions on Image Processing 13(11), 1491–1506 (2004) 8. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. In: Proceedings of 5th European Conference on Computer Vision, pp. 484–498 (1998) 9. Hamlaoui, S., Davoine, F.: Facial action tracking using particle filters and active appearance models. In: Joint sOc-EUSAI conference, pp. 165–169 (2005) 10. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Proceedings of the 7th International Joint Conference on Artificial Intelligence, pp. 674–679 (1981) 11. Hall, P., Marshall, D., Martin, R.: Incremental eigenanalysis for classification. In: Proceedings of British Machine Vision Conference, pp. 286–295 (1998) 12. Artac, M., Jogan, M., Leonardis, A.: Incremental PCA for on-line visual learning and recognition. In: International Conference on Pattern Recognition, pp. 781–784 (2002) 13. Ross, D.A., Lim, J., Yang, M.-H.: Adaptive probabilistic visual tracking with incremental subspace update. In: Proceedings of 8th European Conference on Computer Vision, vol. 2, pp. 470–482 (2004) 14. Huber, P.J.: Robust statistics. John Wiley, Chichester (1982) 15. Bhaskaran, V., Konstantinides, K.: Image and video compression standards. Kluwer Academic Publishers, Dordrecht (1997) 16. Froba, B., Ernst, A.: Face detection with the modified census transform. In: Sixth IEEE International Conference on Automatic Face and Gesture Recognition. IEEE Computer Society Press, Los Alamitos (2004)
An Improved Iris Recognition System Using Feature Extraction Based on Wavelet Maxima Moment Invariants Makram Nabti and Ahmed Bouridane Institute for Electronics, Communications and Information Technology (ECIT), School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Northern Ireland, UK, BT7 1NN. {mnabti01,a.bouridane}@qub.ac.uk
Abstract. Human recognition technology based on biometrics has received increasing attention over the past decade. Iris recognition is considered to be the most reliable biometric authentication system and is becoming the most promising technique for high security. In this paper, we propose a multiscale approach for iris localization by using wavelet modulus maxima for edge detection, a fast and a compact method for iris feature extraction based on wavelet maxima components and moment invariants. The features are represented as feature vector, thus allowing us to also propose a fast matching scheme based on exclusive OR operation. Experimental results have shown that the performance of the proposed method is very encouraging and comparable to the well known methods used for iris texture analysis. Keywords: biometrics, iris recognition, multiscale edge detection, wavelet maxima, moment invariants.
1 Introduction Consistent automatic recognition of individuals has long been an important goal, and it has taken on new importance in recent years. The use of biometric signatures, instead of tokens such as identification cards or computer passwords, continues to gain increasing attention as a means of identification and verification of individuals for controlling access to secured areas, materials, or systems because an individual's biometric data is unique and cannot be transferred. Biometrics is automated methods of identifying a person or verifying the identity of a person based on a physiological or behavioral characteristic. Examples of physiological characteristics include hand, finger images, facial characteristics, and iris recognition. Signature verification and speaker verification are examples of behavioral characteristics [2,3]. Biometrics have the potential for high reliability because it is based on the measurement of an intrinsic physical property of an individual The iris is an overt body that is available for remote (i.e., noninvasive) assessment. The variability of features of any one iris is well enough constrained to make possible a fully automated recognition and verification system based upon machine vision, and even identical twins have distinct iris features[2]. S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 988 – 996, 2007. © Springer-Verlag Berlin Heidelberg 2007
An Improved Iris Recognition System Using Feature Extraction
989
The iris is an annular area between the pupil and the white sclera in the eye; it has a rich texture based on interlacing features, called the texture of the iris. This texture is well known to provide a signature that is unique to each subject. Compared with other biometric signatures mentioned above, the iris is generally considered more stable and reliable for identification [3]. The authentication system based on iris recognition is reputed to be the most accurate among all biometrics methods because of its acceptance, reliability and accuracy. Ophthalmologists originally proposed that the iris of the eye might be used as a kind of optical fingerprint for personal identification [1]. Their proposal was based on clinical results that every iris is unique and it remains unchanged in clinical photographs. The human iris begins to form during the third month of gestation. The structure is complete by the eighth month of gestation, but pigmentation continues into the first year after birth. It has been discovered that every iris is unique and no two people even two identical twins have uncorrelated iris patterns [3], and is stable throughout the human life. It is suggested in recent years that the human irises might be as distinct as fingerprint for different individuals, leading to the idea that iris patterns may contain unique identification features. A number of groups have explored iris recognition algorithms and some systems have already been implemented and put into commercial practice by companies such as Iridian Technologies, whose system is based on the use of Daugman’s algorithm. 1.1 Related Work Most works on personal identification and verification using iris patterns have been done in the 1990s; Daugman [1] developed the feature extraction process based on information from a set of 2-D Gabor filter. He generated a 256byte code by quantizing the local phase angle according to the outputs of the real and imaginary parts of the filtered image, the Wildes system made use of Laplacian pyramid constructed with four different resolution levels to generate iris code [4]. It also exploited a normalized correlation based on goodness-of-match values and Fisher’s linear discriminant for pattern matching. Boles [5] implemented the system operating the set of 1-D signals composed of normalized iris signatures at a few intermediate resolution levels and obtaining the iris representation of these signals via the zerocrossing of the dyadic wavelet transform. Tan [6] generates a bank of 1D intensity signals from the iris image and filters these 1D signals with a special class of wavelet. The positions of local sharp variations are recorded as the features. 1.2 Outline In this paper, we first present a multiscale approach for edge detection based on wavelet maxima which can provide significant edges where noise disappears with an increase of the scales (to a certain level), with less texture points producing local maxima thus enabling us to find the real geometrical edges of the image thereby yielding an efficient detection of the significant circles for inner and outer iris boundaries and eyelids. A new approach has been proposed for feature extraction to making a feature vector compact and efficient by using wavelet maxima components and moment invariants technique to present our feature vector which is invariant to
990
M. Nabti and A. Bouridane
translation, rotation, and scale changes. A fast matching scheme based on the exclusive OR operation has been carried out. The remainder of this paper is organized as follows. Section 2 describes iris localization step using our proposed multiscale edge detection approach. Mapping and normalization step are presented in Section 3. Feature extraction and matching are given in Section 4 and Section 5, respectively. Experimental results and discussions are reported in Section 6. Section 7 concludes this paper.
2 Iris Localization Image acquisition captures the iris as part of a larger image that also contains data derived from the immediately surrounding eye region. Therefore, prior to performing iris pattern matching, it is important to localize that portion of the acquired image that corresponds to the iris. Figure 1 depicts the portion of the image derived from inside the limbus (the border between the sclera and the iris) and outside the pupil (iris is from CASIA iris database).
Fig. 1. Eye image
If the eyelids are occluding part of the iris, then only that portion of the image below the upper eyelid and above the lower eyelid should be included. The eyelid boundary also can be irregular due to the presence of eyelashes. From these suggestions, it can be said that, in iris segmentation problems, a wide range of edge contrasts must be taken in consideration, and iris segmentation must be robust and effective. 2.1 Multiscale Edge Detection In our proposed method [10], a multistage edge detection is used to extract the points of sharp variations (edges) with modulus maxima and where the local maxima are detected to produce only single pixel edges. The resolution of an image is directly related to the appropriate scale for edge detection. A high resolutions and a small scale will result in noisy and discontinuous edges; low resolution and a large scale will result in undetected edges. The scale controls the significance of edges to be shown. Edges of higher significance are more likely to be preserved by the wavelet transform across the scales. Edges of lower significance are more likely to disappear when the scale increases.
An Improved Iris Recognition System Using Feature Extraction
991
Mallat, and Hwang [7, 8] proved that the maxima of the wavelet transform modulus can detect the location of the irregular structures. The wavelet transform characterizes the local regularity of signals by decomposing signals into elementary building blocks that are well localized both in space and frequency. This not only explains the underlying mechanism of classical edge detectors, but also indicates a way of constructing optimal edge detectors under specific working conditions. 2.2 Proposed Method Assume f(x, y) is a given image of size M × N. At each scale j with j>0 and S 0 f = f(x,y), the wavelet transform decomposes S
j −1 f
band S j f , a horizontal highpass band W
H j
The three wavelet bands (S j f, W
H j
into three wavelet bands : a lowpass V
f and a vertical highpass band W j f .
f, W j f) at scale j are of size M × N, which is the V
same as the original image, and all filters used at scale j (j>0) are upsampled by a j
factor of 2 compared with those at scale zero. In addition, the smoothing function used in the construction of a wavelet reduces the effect of noise. Thus, the smoothing step and edge detection step are combined together to achieve the optimal result. At each level of wavelet decomposition the modulus M j f of the gradients can be computed by: M j f=
W jH f
2
+ W jV f
2
(1)
and the associated phase A j f is obtained by: A j f = tan
−1
⎛ W jV ⎜ ⎜W H ⎝ j
⎞ ⎟ ⎟ ⎠
(2)
A Hough transform is then used to localize iris and pupil circles. The eyelids are isolated using the horizontal multiscale edges (figure2-b) with a linear Hough transform while the eyelashes are isolated using a thresholding technique (figure 3).
(a)
(b)
Fig. 2. Edge detection, (a) pupil edge detection, (b) Edges for eyelids detection
992
M. Nabti and A. Bouridane
Fig. 3. Iris localization, Black regions denote detected eyelid and eyelash regions
3 Iris Normalization After determining the limits of the iris in the previous phase, the iris should be isolated and stored in a separate image. The dimensional variations between eye images are mainly due to the stretching of the iris caused by pupil dilation from varying levels of illumination, images capture distances, head incline, and other factors. Regarding this reasons it is necessary to normalize iris region, for this purpose all the points within the boundary of the iris are remapped (figure 4) from Cartesian coordinates to polar coordinates (r,θ) as: I(x(r, θ), y(r, θ))
I(r, θ)
(3)
where r is on the interval [0,1] and θ is angle [0,2π]. In this model a number of data points are selected along each radial line and this is defined as the radial resolution. The number of radial lines going around the iris region is defined as the angular resolution as in (figure 4-a). In the new coordinate system, the iris can be represented in a fixed parameter interval (figure 4-b).
(a)
(b)
Fig. 4. Normalized iris, (a) Normalized iris portion with radial resolution of 15 pixels, and angular resolution of 60 pixels, (b) Iris normalized into polar coordinates
4 Feature Extraction Feature extraction is the crucial step in an iris recognition system, therefore what kind of features should be extracted from images? It is clear that the extracted features should meet at least the following requirements: they should be significant, compact, and fast to compute. For these reasons and to achieve a compact and efficient feature vector, wavelet maxima components and moment invariants techniques are used.
An Improved Iris Recognition System Using Feature Extraction
993
4.1 Wavelet Maxima Components for Feature Extraction Wavelet decomposition provides a very elegant approximation of images and a natural setting for the multi-level analysis. Since wavelet transform maxima provide useful information about textures and edges analysis [7], we propose to use this technique for fast feature extraction by using the wavelet components. Wavelet maxima have been shown to work well in detecting edges which are likely the key features in a query; moreover this method provides useful information about texture features by using horizontal and vertical details. 4.2 Proposed Method As described in [7] to obtain the wavelet decomposition a pair of discrete filters H, G has been used as follows: Table 1. Response of filters H, G H G
0 0
0 0
0.125 0.375 0.375 0,125 0 0 -2 2 0 0
At each scale s, the algorithm decomposes the normalized iris image I(x,y) into I(x, y, s) , W v (x y, s) and W h (x,y,s) as shown in figures (5,6) . - I(x, y, s) : the image smoothed at scale s. - W h (x, y, s) and W v (x, y, s) can be viewed as the two components of the gradient vector of the analyzed image I(x,y) in the horizontal and vertical direction, respectively. At each scale s (s=0 to s=S-1 where S is the number of scales or decomposition) image I(x, y) is smoothed by a lowpass filter: I(x, y, s+1) = I(x, y, s) * (H s , H s )
(4)
The horizontal and vertical details are obtained respectively by: W h (x, y, s) = 1 . I(x, y, s) * (G s , D)
(5)
W v (x y, s) = 1 . I(x, y, s) * (D, G s )
(6)
λs
λs
- We denote by D the Dirac filter whose impulse response is equal to 1 at 0 and 0 otherwise. - We denote by A * (H, L) the separable convolution of the rows and columns, respectively, of image A with the 1-D filters H and L. - Gs, Hs are the discrete filters obtained by appending 2s-1 zeros between consecutive coefficients of H and G.
994
M. Nabti and A. Bouridane 0.12 0.1 0.08 0.06 0.04 0.02 0 -0.02 -0.04 -0.06 -0.08
0
20
40
60
80
100
Fig. 5. Wavelet maxima vertical components at scale 2 with intensities along specified column 0.08 0.06 0.04 0.02 0 -0.02 -0.04 -0.06 -0.08 -0.1
0
10
20
30
40
50
60
70
80
90
100
Fig. 6. Wavelet maxima horizontal components at scale 2 with intensities along specified column
- λ s , as explained in [7] due to discretization, the wavelet modulus maxima of a step edge do not have the same amplitude at all scales as they should in a continuous model. The constants λ s compensate for this discrete effect. 4.3 Moment Invariants for Feature Vector Representation The theory of moments provides an interesting series expansion for representing objects. This is also suitable to mapping the wavelet maxima to vectors so that their similarity distance can be measured in a simple way [9]. Certain functions of moments are invariant to geometric transformations such as translation, scaling, and rotation. Such features are useful in the identification of objects with unique signatures regardless of their location, size, and orientation [9]. A set of seven 2-D moment invariants that are insensitive to rotation, translation and scaling [9] have been computed for each horizontal and vertical wavelet maxima component from scale 1 to scale 5 . Therefore, ten wavelet maxima components (i.e., H1,V1,H2,V2,H3,V3,H4,V4,H5,V5) are obtained thus making a feature vector size of 70 (7x10) for every iris image.
5 Matching It is important to present the obtained vector in a binary code because it is easier to determine the difference between two binary code-words than between two number vectors.In fact, Boolean vectors are always easier to compare and to manipulate. We have applied Hamming Distance matching algorithm for the recognition of two samples. It is basically an Exclusive OR (XOR) function between two bit patterns. Hamming Distance is a measure, which delineates the differences, of iris codes. Every
An Improved Iris Recognition System Using Feature Extraction
995
bit of the presented iris code is compared to every bit of referenced iris code so that if the two bits are the same e.g. two 1’s or two 0’s, the system assigns a value ‘0’ to that comparison and if the two bits are different, the system assigns a value ‘1’ to that comparison. The equation for iris matching is as follows: HD =
1 N
∑P
i
⊕ Ri
(7)
where N is dimension of feature vector, Pi is the ith component of the presented feature vector while Ri is the ith component of referenced feature vector. The Match Ratio between two iris templates is computed by: ⎛T ⎞ Ratio = ⎜⎜ z ⎟⎟ ∗ 100 ⎝ Tb ⎠
(8)
where Tz is total number of zeros calculated by the Hamming distance vector and Tb is the total number of bits in iris template.
6 Results and Analysis The proposed algorithm have been assessed using the CASIA iris image database, which consists of 80 persons, 108 set eye images and 756 eye images. From the results shown in Table 2, we can find that Daugman’s method and the proposed method have the best performance, followed by Li Ma and Tan method. Table 2. Accuracy and speed Comparison Methods Correct recognition rate (%) Speed (ms)
Daugman 99.90 310
Tan 99.23 110
Proposed 99.50 85
Daugman analyzed the iris texture by computing and quantizing the similarity between the quadrature wavelets and each local region, which requires that the size of the local region must be small enough to achieve high accuracy. In [6] special class of wavelet to represent local texture information of the iris has been adopted. Our proposed method achieves higher accuracy based on a texture analysis method using wavelet maxima modulus for edge detection to localize the iris region and wavelet maxima components for feature extraction. The result of this feature extraction is a set of moments (7 elements) at different wavelet maxima decomposition levels for each iris image those features called in the proposed method as wavelet maxima moments. These features significantly represent iris images; they are compact and can be easily used to compute similar distances. Finally, wavelet maxima moments are invariant to all affine transforms namely translation, scaling and rotation. From the point of view complexity, as shown in Table 2, our proposed method clearly outperforms the other two. In particular, it can be said that, although Daugman’s method achieves a very slightly better recognition rate (+0.4%) our
996
M. Nabti and A. Bouridane
proposed method is about 3.65 times faster. Compared against Tan’s method, our proposed solution achieves a better recognition rate (+0.5%) and is about 1.3 times faster. It is worth noting that the speed results were obtained on a PC running Windows OS with a 1.8 MHz clock speed. Implementations were carried out under MATLAB tools.
7 Conclusion In this paper, we propose some optimized and robust methods for improving the accuracy of human identification system based on the iris patterns from the practical viewpoint. To achieve this some efficient methods based on a multiscale edge detection approach using wavelet maxima modulus and wavelet maxima moments have been presented. Our extracted features are invariant to all affine transforms namely translation, scaling and rotation. Through various experiments, we show that the proposed methods can be used for personal identification systems in an efficient way.
References 1. Daugman, J.: High Confidence Visual Recognition of Persons by a Test of Statistical Independence. IEEE Trans. Pattern Analysis and Machine Intelligence 15(11), 1148–1161 (1993) 2. Jain, A., Maltoni, D., Maio, D., Wayman, J.: Biometric systems, Technology, Design and Performance Evaluation. Springer, London (2005) 3. Muron, A., pospisil, j.: The human iris structure and its usages. Physica 39, 87–95 (2000) 4. Wildes, R.: Iris Recognition: An Emerging Biometric Technology. Proc. IEEE 85, 1348– 1363 (1997) 5. Boashash, B., Boles, W.: A Human Identification Technique Using Images of the Iris and Wavelet Transform. IEEE Trans. Signal Processing 46(4), 1185–1188 (1998) 6. Ma, L., Tan, T., et al.: Efficient Iris Recognition by Characterizing Key Local Variations. IEEE Trans. on Image Processing 13, 739–750 (2004) 7. Mallat, S., Hwang, W.: Singularity Detection and Processing with Wavelets. IEEE Trans. Information Theory 38(2), 617–643 (1992) 8. Mallat, S.: A Wavelet Tour of Signal Processing, 2nd edn. Academic Press, London (1998) 9. Jain, A.K.: Fundamentals of digital image processing. Prentice-Hall, Inc., Englewood Cliffs (1989) 10. Nabti, M., Ghouti, L., Bouridane, A.: An efficient iris segmentation technique based on a multiscale approach, Special issue on advances in biometrics. The Mediterranean Journal of Computers and Networks 2(4), 151–159 (2006)
Color-Based Iris Verification Emine Krichen, Mohamed Chenafa, Sonia Garcia-Salicetti, and Bernadette Dorizzi Institut National des Télécommunications, 9 Rue Charles Fourier 91160 Evry France {emine.krichen,sonia.salicetti,bernadette.dorizzi}@int-evry.fr
Abstract. In this paper we propose a novel iris recognition method for iris images acquired under normal light illumination. We exploit the color information as we compare the distributions of common colors between a reference image and a test image using a modified Hausdorff distance. Tests have been made on the UBIRIS public database and on the IRIS_INT database acquired by our team. Comparisons with two iris reference systems in controlled scenario show a significant improvement when using color information instead of texture information. On uncontrolled scenarios, we propose a quality measure on colors in order to select good images from bad ones in the comparison process. Keywords: Iris recognition; Hausdorff distance; quality measure, Gaussian Mixture Models.
1 Introduction Iris acquisition is a challenging task. Indeed, a small and in most of the cases dark object has to be acquired from a relative long distance (from 35cm to 3m). Iris is also a moving target, hidden in another moving object (the eye) almost covered by eyelids, bounded nearly in the middle by a dark hole (the pupil) which can dilate and contract depending on the illumination intensity during the acquisition process. These dilations and contractions of the pupil induce changes in iris texture which unfortunately are nonlinear. Nevertheless, the most challenging problem is that iris is located behind the cornea, a very high reflective mirror, which makes impossible to acquire irises in normal light without some constraints and extra tools. Figure 1 (left) shows an iris acquired under normal illumination conditions and without any particular constraint. Reflections do not permit any accurate processing in this case. One solution would be the use of a strong source of illumination on the eye (typically a flash). Figure 1 (middle image) shows the kind of images obtained using this technique with normal light illumination. Reflections are then mostly deleted, except the ones resulting from the use of the flash itself. Despite the relative good quality of this image, in general, these alternatives are not sufficient for a performant texture analysis. In practice, near infrared illumination is considered to be the most appropriate for iris acquisition. Indeed, it allows tackling the problem of cornea reflection. Moreover, near infrared waves penetrate the cornea and allow getting very good quality iris images with all the richness of the iris texture, even with dark irises. Figure 1 (right S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 997–1005, 2007. © Springer-Verlag Berlin Heidelberg 2007
998
E. Krichen et al.
image) shows an iris image acquired under near infrared illumination. As can be seen the image quality is substantially higher than the one of the two irises previously presented. All commercial iris solutions are running today on images acquired under this kind of illumination [1]. However, there are applicative situations where we have at disposal a color camera with more or less high resolution. This is the case on PDA and smartphones. Our aim, in this article is to study whether it is possible to use these cameras and more especially the color information of the iris to produce a system sufficiently accurate to perform iris identity verification. Very few works report results on this type of images. In [2], we proposed a small modification of the standard iris techniques by using wavelet packets instead of the classic Gabor wavelet. We ran preliminary experiments on a private database using a 3 million pixels camera, indoor and with a flash (figure 1 (middle) is an image from such database). We also studied the contribution of color information to the wavelet method. We showed that the mean of the inter-class distribution using color information is higher than the one obtained with grey level images. In other publications [3] presenting experiments on normal light images, researchers just transform color irises into grey level ones and apply on this type of images the algorithms they developed previously for near infrared images. Our paper is organised as follows: section 2 describes all the modules composing our proposed algorithm, including iris segmentation, normalization and recognition. In section 3 we introduce the databases on which we have run our experiments namely IRIS_INT and UBIRIS. UBIRIS [4] is an iris database acquired under normal light illumination. It is a public database, which contains about 250 persons, some of them with 2 sessions. In the second session, irises are acquired under uncontrolled mode (outdoor, uncontrolled illumination, …). Experimental results will be given in section 4. Also, as benchmarking tools for our system, we consider two reference iris systems, Masek [6] (used by NIST) and OSIRIS, a new reference system developed by our team in the framework of BioSecure. We demonstrate that the use of color information can highly outperform methods only based on texture information. We also propose a global quality measure on color images in order to select test images during the matching process. Finally, conclusions are given and future works is discussed.
Fig. 1. Iris acquisition under normal light conditions without any restrictions (left image), normal light iris acquisition under a controlled scenario with a flash (middle image from IRIS_INT database) and near infrared iris acquisition (right image from CASIAv2 database)
Color-Based Iris Verification
999
2 Iris Recognition on Color Images The iris rim is segmented from original iris images using the Hough Transform method proposed and explained in [5] and developed in open source by Masek [6]. Once iris and pupil circles have been detected, we normalize the iris rim using the rubber sheet model proposed by J. Daugman in [7]. The segmentation and normalization processes are shown in Figure 2.
Se
Fig. 2. Iris segmentation (Top images): the iris and pupil contours are considered as circles and are detected using the Hough Transform. Iris normalization (Bottom images): the iris is normalized using the rubber sheet model.
Our iris recognition method is composed of several steps: first we need to compress the colors present in the reference image into a fixed number of colors. We have used the minimum variance quantization in order to reduce the number of colors [9]. From each original image, we produce a compressed image with its corresponding color map (figure 3). Both information will be used in the next stages. We need to compress the number of colors of the images because of the fact that at least 10.000 colors are present in the iris image even in dark and poorly textured irises. This makes any color-based method very heavy to develop. Once the reference iris image is compressed, we process the test image using the color map obtained for the reference image. We assign each color in the map to the nearest color present in the test iris image using the 3 dimensional color cube (RGB space). Then, for each color in the map, we compare the distributions of the corresponding pixel positions in the reference and test images. This comparison is made using a modified Hausdorff Distance [8]. The Hausdorff Distance is an asymmetric distance which measures the maximum distance of each point of a set to the nearest point in the other set. Formula 1 shows the original expression of the Hausdorff Distance between set X and set Y where x is an element of X, y of Y and d denotes the Euclidian distance.
d H ( X ,Y )
max^sup xX inf yY d ( x, y ), sup yY inf xX d ( x, y )`.
(1)
1000
E. Krichen et al.
Fig. 3. Calculation of scores between Image 1 and Image 2 including: the compression of the two images; the creation of a set of pixels whose values are equal to each color and finally the computation of the Hausdorff Distance between set X and Y
This formula of the Hausdorff distance has a great limitation if we want to apply it to our problem because it works at the pixel level; indeed, only one pixel that is an outlier can modify dramatically the result of the Hausdorff Distance. For instance, this will occur with pixels which correspond to noise as eyelids, spot light reflections, eyelashes … for these reasons, we have used a modified form of the Hausdorff Distance in which we took the minimum between the Hausdorff distance calculated between X and Y on one hand and Y and X on another hand, and we computed the average between the Euclidian distances of elements of one set to the elements of the other set. This modified Hausdorff Distance is expressed in Formula 2:
d H mod ( X , Y )
ª 1 º 1 min « ¦ inf x d ( x, y ), inf y d ( y, x)» ¦ Ny Y ¬ Nx X ¼
(2)
Finally, the score between two images is computed by averaging the modified Hausdorff distances obtained for each color, on all possible colors c of the map (see Formula 3). Xc and Yc are the pixel positions corresponding to color c respectively in the reference image Iref and the test image Itest.
S ( Iref , Itest )
mean c (d H mod ( X c , Yc ))
(3)
In order to accelerate our algorithm, only distributions with comparable pixels size are used for comparison. The whole recognition process is described in Figure 4. Based on the dominant color of the iris, we made a coarse pre-classification between irises. We notice that two main classes of colors are present in the iris images (blue and brown images). The Hue distribution (form the HSV color space instead of RGB color space [10]) can be used to automatically classify the irises into 2 classes. The Hue value indicates the nature of the color and is expressed in terms of angle (degrees). For brown irises, the Hue distribution falls into the range of 0° (90% of the
Color-Based Iris Verification
1001
Fig. 4. Scores computation between Image 1 and Image 2 including: the compression of the two images; the creation of a set of pixels whose values correspond to a given color and finally the computation of the modified Hausdorff Distance between set X (reference) and Y (test)
irises in the databases at hand) while the Hue values for blue irises varies around 200°. Figure 5 shows one Hue distribution of each class. In order to classify the two kinds of irises (brown and blue irises) we have used a simple Gaussian Mixture Model [12] with 3 Gaussians for each class. Some images may contain eyelids occlusion; as eyelids fall in range of 300 to 360°, these images can affect the Hue distribution by introducing a new peak (see Figure 5 right). As the discrimination between the 2 classes is mainly due to other values range of the distribution, eyelids don’t affect the classification capacity of the GMM. In order to pre-classify an iris test image, we select the class (brown or blue) corresponding to the GMM giving the highest probability. After this preclassification, only comparisons between reference and test irises belonging to the same class are performed (the reference image is stored with its associated class label). In case no comparison is performed, a maximum score is given to the corresponding test image.
Fig. 5. Brown iris Hue histogram (left image), blue iris Hue histogram (middle image) and brown iris with strong Eyelids presence (right image)
1002
E. Krichen et al.
3 Evaluation 3.1 Normal Light Iris Databases There are few iris databases acquired under normal light condition. In this work we have used a quite small database acquired by our team, called IRIS_INT as development set and we made the tests on UBIRIS database. IRIS_INT contains 700 images from 70 persons. There are between 7 and 23 iris images per person. The images have been taken with a flash at a distance of 20-30 cms; the camera has a 23mm focus, the image resolution is 2048x1536 and the captured irises have 120-150 pixels of radius. Some variations in lighting and position have been introduced in the database. We have chosen to capture only the left iris of each person. In fact, we acquired the top left quarter of the subject’s face. This database includes some other variations including out of focus images, eyelids and eyelashes occlusions, lenses and eyeglasses effects. UBIRIS iris database was developed by the Universidade da Beira Interior, with the aim of testing the robustness of various iris recognition algorithms to degradations. Several intra-class variations or image degradations, such as illumination, contrast, reflections, defocus, and occlusions are present in this database. The database contains 1,877 gray-level images from 241 persons, captured in two different sessions. Some images are shown in Figure 6. In the first session, noise factors are minimized as the irises are acquired indoor, in a controlled illumination environment (dark room). Images are available at two different resolutions, a high resolution (800*600) and a low resolution (200*150). In our experiments we used the smallest resolution available in order to simulate PDA or smartphone limitation in terms of resolution. Indeed, if we consider the half top of a face acquired with a PDA or smartphone camera of 2 megapixels, an iris in average will be represented by an image of size 400*300 which doubles the resolution that we consider. We have used 1727 images from the 1,877 which are available. The deleted images correspond to very degraded images (closed eye, blurring, occlusion).
Fig. 6. Some low quality images (highly covered by eyelids and spot lights) acquired in session 2 (uncontrolled images) of the UBIRIS database
3.2 Experiments We have used two reference systems in order to benchmark our system although those systems were developed and optimized on a different data (irises acquired under near infrared illumination). These two systems; Masek [6] from the University of Western
Color-Based Iris Verification
1003
Australia, and OSIRIS developed under the BioSecure European Network of Excellence [11] correspond to implementations of Daugman’s work [7] although none of them can pretend to be equivalent to Daugman’s system in terms of optimization and recognition performance. In the present work, all developments are made on IRIS_INT database while tests are made on UBIRIS database. We developed two different protocols (or scenarios) on UBIRIS. In the first protocol, called “controlled scenario”, we only consider images from the first session, that is images acquired indoor, in a dark room under controlled illumination conditions. In the second protocol, “called uncontrolled scenario”, images of the first session are used as references while those of the second session are used as test images. In both cases, only one image is considered as reference. Development experiments. We made all possible comparisons between images of IRIS_INT database. It has allowed us to optimize some parameters in the system such as the number of colors in the compressed images. We considered from 30 to 300 colors in the color map, and the best results were obtained for color maps of 60 colors. Indeed, if too many colors are considered, the similarity between reference and test images of the same person is lost. On the other hand if too few colors are considered the essential information to discriminate a client and an impostor will be also lost. Also in this case too many pixel positions will be associated to each color in the color map, which increasing considerably the computational load when computing the modified Hausdorff distances. Controlled scenario. For the controlled scenario on the UBIRIS database we obtain 1.2% of EER, and 2.1% of FRR with 0.1% of FAR. These results are very encouraging as both OSIRIS and MASEK perform poorly on the same data (of course on grey-level images), from 5 to 10 times worst than our color-based approach (see Table1). Such a difference between systems can only be explained by the fact that color carries some information which is definitively lost when we just use grey-level images and texture information only. Indeed, the quality of texture information is bad and not sufficient for such analysis. Also, the bad performance of both Masek and OSIRIS confirm that texture-based methods suffer greatly from the low resolution of the images (small size of irises, a radius lower than 50 pixels). Uncontrolled scenario. Uncontrolled illumination changes affect both color and texture of the image. Tests show a huge performance degradation compared to the controlled scenario, up to 22% of FRR at 0.1% of FAR for our color-based approach that still outperforms significantly OSIRIS and Masek. Table 1 summarizes the performance obtained in each scenario and with the three different systems. Table 1. Benchmarking performances on UBIRIS database under different scenarios with Color-based system, OSIRIS and Masek systems
Color OSIRIS Masek
Controlled scenario EER FRR 1.2 2.1 5 10 7.1 20
Uncontrolled EER 8 14.5 25
scenario FRR 22 40 80
1004
E. Krichen et al.
To cope with bad quality images of this uncontrolled scenario, like images which are covered by a high number of spot lights as shown in Figure 6, we introduce a quality measure on iris images at the acquisition step. This way images would be rejected right after acquisition and thus not processed by the system; system performance would this way be increased. In fact, in the UBIRIS database brown irises suffer much more from illumination changes than blue irises: for brown irises the ‘red’ colors (range 0°-50°) turn to purple in most cases (less than 360°). Therefore, to detect bad quality images we only use the GMM trained on brown irises (acquired in the controlled scenario) for the two classes separation by applying a threshold on the probability given by the GMM. Figure 7 shows the distribution of the resulting GMM probabilities on brown irises, when considering all the brown iris images from both the first session (red points) and the second session (blue points). The principle to select test images by the proposed quality measure is the following: only irises that have a probability comparable to those obtained on iris images of the first session (controlled scenario) are kept. Irises from the first session have a minimum probability of 0.42; so we fixed the selection threshold at 0.4. This means that irises with a probability lower than 0.4 are rejected and not considered in the test process. Using this selection process, 80% of brown irises from the second session (uncontrolled scenario) are rejected. The EER falls from 8% to 1.5%, and FRR reaches 3.8% at FAR of 0.1% (versus 22% without considering the selection process).
Fig. 7. GMM probabilities on brown irises from session1 (Red) and Session2 (Blue)
4 Conclusions In this paper we presented a novel iris verification method entirely based on the color information, using the minimum variance principle for iris color reduction and a modified Hausdorff distance for matching. We also proposed a pre-selection method
Color-Based Iris Verification
1005
based on the use of Gaussian Mixture Models that characterize two classes of colors (blue and brown or more generally clear and dark) for iris images encoded in the Hue axis of the HSV color space. These models are also used to detect images corrupted by a variety of strong illumination effects. Our tests show that in the controlled acquisition mode our system outperforms significantly two iris reference systems based on texture information. Nevertheless, in uncontrolled conditions, tests show a huge decrease in performance for our method although it remains better than those obtained with texture-based systems (22% vs. 40% and 80%). In order to cope with image degradation, typical in the uncontrolled scenario, we proposed a quality measure based on the GMMs above mentioned to reject bad quality images. This selection process drastically improves the results of our approach reducing the error rates roughly by a factor 6, while rejecting 80% of the images acquired in uncontrolled conditions. Future work will be focused on fusion strategies of texture-based systems and our color-based approach on images acquired under normal illumination conditions.
References 1. http://www.iridiantech.com/index2.php 2. Emine Krichen, M., Mellakh, A., Garcia-Salicetti, S., Dorizzi, B.: Iris Identification Using Wavelet Packets. In: 17th International Conference on Pattern Recognition (ICPR 2004), Cambridge, UK, 23-26 August 2004, vol. 4, pp. 335–338 (2004) 3. Sun, Z., Wang, Y., Tan, T., Cui, J.: Improving iris recognition accuracy via cascaded classifiers. IEEE Transactions on Systems, Man, and Cybernetics, Part C 35(3), 435–441 (2005) 4. http://iris.di.ubi.pt 5. Wildes, R.: Automated iris recognition: An emerging biometric technology. Proceedings of the IEEE 85 (9), 1348–1363 (1997) Awarded IEEE Donald G. Fink Prize Paper Award 6. http://www.csse.uwa.edu.au/ pk/studentprojects/libor/ 7. Daugman, J.: How iris recognition works. IEEE Transactions on Circuits and Systems fo Video Technology 14(1) (January 2004) 8. Dubuisson, M.-P., Jain, A.K.: A modified Hausdorff distance for object matching. Pattern Recognition (1994). In: Conference A: Computer Vision & Image Processing, Proceedings of the 12th IAPR International Conference, 9-13 October 1994, vol. 1, pp. 566–568 (1994) 9. Mathworks, Inc., Matlab Image Processing Toolbox, Ver. 2.2 Natick, MA (1999) 10. Smith, A.R.: Color Gamut Transform Pair. Computer Graphics 12(3), 12–19 (1978) 11. http://www.biosecure.info
Real-Time Face Detection and Recognition on LEGO Mindstorms NXT Robot Tae-Hoon Lee Center for Cognitive Robotics, Korea Institute of Science and Technology, 39-1 Hawolgok-dong, Seongbuk-gu, Seoul 136-791, Republic of Korea
Abstract. This paper addresses a real-time implementation of face recognition system with LEGO Mindstorms NXT robot and wireless camera. This system is organized to capture an image sequence, find the features of face in the images, and recognize and verify a person. Moreover, this system can collect the facial images of various poses due to movable robot, which enables this system to increase performance. The current implementation uses the LDA(Linear Discriminant Analysis) learning algorithm considering the number of training data. We have made several tests on video data, and measured the performance and the speed of the proposed system in real environment. Finally, the result has confirmed that the proposed system is better than conventional PC-based systems. Keywords: LEGO Mindstorms NXT robot, real-time face recognition.
1
Introduction
During the past few years, we have witnessed the explosion in interest and progress in automatic face recognition technology. Face recognition technologies have been developed and come into our life. For instance, applications such as intelligent building, PC security system based on face analysis start to appear in recent years. For applying face recognition technologies to real applications, many kinds of systems have been developed before 21 century. In the early 1990s, Gilbert et al. introduced a real-time face recognition system using custom VLSI hardware for fast correlation in an IBM compatible PC[2]. Five years later, Yang et al. introduced a parallel implementation of face detection algorithm using a TMS320C40 chip[3]. On the other hand, IBM introduced a commercial chip, ZISC which can compute the classification in RBF(Radial Basis Function) based neural network[4]. However, these efforts did not make successful results because they could not cope with real problems caused by illumination change, pose variation, human aging, lens distortion, and so forth. So recent face researches are focused on solving these problems. For illumination invariant face recognition, S. Zhou and
This research was performed when the author was a summer student intern at the Center for Cognitive Robotics, Korea Institute of Science and Technology.
S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 1006–1015, 2007. c Springer-Verlag Berlin Heidelberg 2007
Real-Time Face Detection and Recognition on LEGO Mindstorms
1007
R. Chellappa proposed rank constraint recognition[5] and A. Georghiades et al. proposed ‘Illumination cone’. For pose variation, R. Gross proposed ‘Eigen lightfields’[7] and 3D face recognition based on morphable model was proposed by V. Blanz and T. Vetter[8]. These approaches deal with illumination changes using multiple training data from database, which have already collected in well-structured studio. However, it can be difficult to collect these facial data in real environment. The limited number of registered facial images is also one of the reason why the recognition rate decreases. Since face recognition system has generally a fixed position, it requires well-posed and frontal faces. Recently for overcoming this limitation, many researches have been reported[1]. In order to solve training data problem, we adopt movable camera using wireless camera and collect more training data instead of improving learning algorithms. We can collect multiple facial data in different poses and conditions using active robot, which is controlled by predefined rules. Then LDA can categorize these data into each manifold. The remaining parts of this paper are organized as follows: In Section 2, we introduce the hardware structure regarding the movable camera. Our face recognition processes by the LDA learning are presented in Section 3. In Section 4, the applicability of the proposed method is illustrated via some experiments. Finally, in Section 5, concluding remarks are given.
2
Hardware Design
Most of face recognition system is running using non-movable camera. Therefore it has a defect that can not deal with various face pose. In order to cope
Robot (Client)
PC (Host) 2.4Ghz RF Receiver
RF Camera
USB
Sensor
PC NXT Brick
Motor
Bluetooth Network Actuator Data Sensor Data
USB Bluetooth Adaptor
Fig. 1. The block diagram of the proposed system
1008
T.-H. Lee
with diverse facial images, the face recognition system has to train many faces collected from well-designed studio or recognize a well-posed frontal face. Considering these restricted conditions, we construct a prototype system based on LEGO Mindstorms NXT robot with wireless camera. The LEGO Mindstorms NXT is well known as a tiny embedded system for students[9]. This system can collect arbitrarily posed facial images and train them. In this section, we introduce our system’s hardware architecture. This system is composed of server and client system as shown in Fig. 1. The former is exactly same as the conventional PC, which plays a role on face recognition process and the latter consists of a camera and a LEGO Mindstorms NXT robot. Two systems can communicate with each other via Bluetooth. The LEGO Mindstorms NXT brick has a 32-bit Atmel ARM7 processor with Atmel AVR coprocessor. Each processor has 256KB Flash memory and 512 byte memory respectively. Additionally it has Bluetooth module for wireless communication, which a server can control its motor module. The LEGO Mindstorms NXT robot has too small memory to recognize a person(See Table 1. So we attached a wireless camera and make communication between the server and the camera using radio frequency(RF) devices. Host system analyzes the video frames and sends the control signal to LEGO Mindstorms NXT robot for collecting data while moving around. The LEGO Mindstorms NXT robot is decorated with LEGO blocks and camera, which is shown in Fig. 2.
Mindstorms NXT Wireless camera
Fig. 2. LEGO Mindstorms NXT robot with wireless camera
So we can take a randomly posed facial image by camera with mobile robot. Our mobile robot is made up of the LEGO Mindstorms NXT robot. A robot’s controller is NXT Brick that is small embedded system that play a role in the brain of LEGO Mindstorms NXT robot. Specification of this hardware is shown as Table 1[15].
Real-Time Face Detection and Recognition on LEGO Mindstorms
1009
Table 1. Specification of LEGO Mindstorms NXT Processor 32-bit ARM7 250MHz RAM 64 Kilo Byte Storage 256 KB Flash Memory Controller Atmel AVR microcontroller 4MHz RAM 512 Byte Display 100x64 pixel LCD matrix Input four 6-wire cable digital platform Output three 6-wire cable digital platform
As mentioned, face recognition process requires high performance computing power. However, NXT brick has low computing power. Since we can not process image information on the NXT Brick, we implemented host PC-based face recognition system and construct wireless communication between robot and PC. On this system, the wireless camera captures video frames and send them to host PC. Robot works as actuator of camera by the signal from host PC. In other words, we take pictures using the wireless camera and transfer image data through radio frequency using transmitter. On the PC, our program receives an image from USB RF Receiver, processing image data and controlling robot using Bluetooth and RF communications.
3 3.1
Face Analysis Face Detection and Feature Extraction
As the first step of the entire process, face detection affects greatly the overall system performance. In fact, a successful face detection is prerequisite to the success of following face recognition and verification tasks. Our system extracts faces using OpenCV face detector[14]. OpenCV face detector uses Haar cascade classifier. This algorithm was proposed by P. Viola and M. Jones[13]. Basically this approach is based on Adaboost, which is known as one of the best performance and quality methods. We have trained much enough faces with various poses that is also collected by movable robots. After face detection, we adjust the size of facial image into default rectangle, and extract the face region with ellipse mask because face image contains background area. And then we apply the histogram equalization processing. It is known that the histogram equalization is effective under quite a different illumination condition. Even if it could not normalize locally illuminated shape, it works in general illumination changes by windows and lights in indoor environment. 3.2
Face Classification
The classification of feature vectors in face images needs a essential compression of large images into a small vector. We used PCA(Principal Component Analysis)[11] and LDA as feature extractors. Firstly, PCA is mathematically
1010
T.-H. Lee
considered as an orthogonal linear transformation, which is a kind of a mapping function into other space. In pattern recognition fields, PCA is used for dimensionality reduction. In this paper, facial images (40 × 40) are represented as a 1 × 1600 vector. Since we collected many images per person, a face is considered as a vector of high dimension. Let the training set of N face images be Γ1 , Γ2 , Γ3 , · · · , ΓN . The average face of the training set is N 1 Ψ= Γi . N i=1
(1)
Each face differs from the average by the vector Φi = Γi − Ψ.
(2)
For a ΦT with zero empirical mean by Equation 2, the PCA transformation is given by Y = W T Φ = ΣV T . (3) where W T is a transformation matrix and W ΣV T is the sigular value decomposition of Φ. We can consider Y as a vector of a face in new space and recognize a person using this vector. Generally, PCA shows a good transformation and largely distributed vectors. However, it has a demerit if there are many data in same class. In this paper, we should find another method because we consider several posed faces in same persons. LDA is a class specific method that it tries to shape the scatter in order to make it more reliable for classification. This method selects a transformation matrix, W of Equation 3 in such a way that the ratio of the between-class scatter and the within-class scatter is maximized. Let the between-class scatter matrix be defined as SB =
c
Ni (μi − μ)(μi − μ)T
(4)
i=1
and the within-class scatter matrix be defined as SW =
c
(xk − μi )(xk − μi )T
(5)
i=1 xk ∈Xi
where μi is the mean image of class Xi , and Ni is the number of samples in class Xi . If SW is nonsingular, the optimal projection Wopt is chosen as the matrix with orthonormal columns which maximizes the ratio of the determinant of the between-class scatter matrix of the projected samples to the determinant of the within-class scatter matrix of the projected samples, i.e., Wopt = argmaxW
|W T SB W | = [W1 W2 · · · Wm ] |W T SW W |
(6)
Real-Time Face Detection and Recognition on LEGO Mindstorms
1011
where Wi |i = 1, 2, · · · , m is the set of generalized eigenvectors of SB and SW corresponding to the m largest generalized eigenvalues λi |i = 1, 2, · · · , m, i.e., Sb wi = λSW wi ,
i = 1, 2, · · · , m
(7)
Note that there are at most c − 1 nonzero generalized eigenvalues, and so an upper bound on m is c − 1, where c is the number of classes. LDA has error rates lower than PCA and required less computation time[11]. When the number of training samples per class is small, it is known that PCA has better result than LDA[12]. In this paper, we use more than 20 images per one class.
4
Experimental Results and Analysis
We performed several experiments to verify our approaches. Firstly, we collected 281 facial images. Secondly, we did preprocessing including histogram equalization and masking. Finally we recognized an input person by matching it with all trained faces. We can identify a person by selecting the index, which has the minimum error. In this experiment, we perform this process repeatedly. One of 7 trained persons was randomly inserted and the system recognized the detected facial image repeatedly. 4.1
Collecting Facial Images
For collecting facial images, wireless camera is used in order to enable to transfer video stream through radio frequency. This device is made up of wireless camera and RF receiver. The detailed information is explained in Table 2[16]. Table 2. Specification of wireless camera Image Sensor 1/4” Color CCD Horizontal Resolution 300 TV lines Sync Internal PAL 625 lines interlaced NTSC 525 lines interlaced Light Sensitivity 10 Lux Signal to Noise Ratio 42dB or more Gamma 0.45 Frequency 2.410GHz, 2.430GHz, 2.450GHz and 2.470GHz
In this paper, the robot moves around and collects facial images on the desk. It has tried to track a face and move a little step until it loses a face region. If it loses a face region, it returns a previous status and position to try it again. Server system carries the whole algorithm out and controls the robot by analyzed signal. Server system extracts facial images using OpenCV face detector and normalizes the size of faces into 40 × 40(pixels) in every captured images.
1012
4.2
T.-H. Lee
Performance in Face Recognition
We have performed experiments with captured images in real-time with our prototype system. We used a data set of 7 persons(more than 20 images per a person) for training and tried to recognize persons with newly captured images in indoor environment.
Fig. 3. Examples of collected facial images
This system can detect a face and recognize a person at 12-16 frame rates. This speed was measured from user-input to final stage with the result being dependent on the number of objects in an image. The system captures a frame from wireless camera through Bluetooth, preprocesses it, detects a face, extracts feature vectors and identifies a person. These whole stages have to be operated in real-time. In this system, movable robot always tries to train a manifold of each person. In a general way, even though training stage take much time, it is done in background process and user can not be aware of this process. We also measured the processing time of each stages using internal timer and the results were shown in Table 3. Even though the training stage takes too much time to perform in real-time, we can turn the process into background running and we would make it as if it were
Real-Time Face Detection and Recognition on LEGO Mindstorms
1013
Table 3. Speeds of each process Face detection 111 ms Preprocessing(masking and histogram Equalization) 67 ms PCA training 250 ms LDA training 221 ms Classification 38 ms
100 95 90 85 80 75 70 65
PCA+ 1
60
PCA+ N
55
LDA+ N
50 1
2
3
4
5
6
7
avg
Fig. 4. Experimental results
done in real-time. We can also carry it out in real-time if the number of persons in training is small enough without other efforts. The recognition performance of the system is highly dependent on accuracy of face extraction. In the next test, we measured the classification accuracy assuming correct face extraction, which means that we throw away the data wrongly extracted by OpenCV face detector. This experiment was done by comparing three approaches in order to verify the effect of multiple training data. One is that we use the PCA method with only one image per person for training(PCA+1). Another is that we use the PCA method with n images per person(PCA+N). The other is that we use the LDA method with more than n images per person for training(LDA+N).1 All of them use just an image for testing. The result in Fig. 4 shows that the LDA method with collected training images makes the result the best.
5
Conclusions
Until now, most of face recognition system have a fixed camera. Therefore they can not deal with various face pose. In order to cope with diverse facial images, 1
In this paper, N=20 was used.
1014
T.-H. Lee
the face recognition system has to train many faces collected from well-designed studio or recognize a well-posed frontal face. In this paper, we construct a prototype system to throw these restricted conditions away. In order to collect facial images enough in real environments, we organized several components such as a conventional PC, movable robot using LEGO Mindstorms NXT robot, and RF wireless communication. This system can actively capture an image sequence, find the features of face in the images, and recognize a person. In addition, it can collect the facial images of various poses due to movable robot, which enables this system to increase performance. We made several experiments and the result has confirmed us that the proposed system is better than conventional PC-based systems.
Acknowledgments The author would like to thank Dr. Bum-Jae You at the Center for Cognitive Robotics, Korea Institute of Science and Technology and Dr. Sang-Woong Lee at Carnegie Mellon University for their kind supports and suggestions for this research.
References 1. Tan, X., Chen, S., Zhou, Z.-H., Zhang, F.: Face recognition from a single image per person: A survey. Pattern Recognition 39(9), 1725–1745 (2006) 2. Gilbert, J.M., Yang, W.: A Real-Time Face Recognition System Using Custom VLSI Hardware. In: Proc. of Computer Architectures for Machine Perceptron Workshop, New Orleans, USA, pp. 58–66 (1993) 3. Yang, F., Paindavoine, M., Abdi, H.: Parallel Implementation on DSPs of a Face Detection Algorithm. In: Proc. of International Conference on the Software Process, Chicago, USA (1998) 4. IBM ZISC036 Data Sheet: http://www.ibm.com 5. Zhou, S., Chellappa, R.: Rank constrained recognition under unknown illuminations. In: IEEE International Workshop on Analysis and Modeling of Faces and Gestures (2003) 6. Georghiades, A., Belhumeur, P., Kriegman, D.: From Few to Many: Illumination Cone Models for Face Recognition under Variable lighting and Pose. IEEE Transactions. on Pattern Analysis and Machine Intelligence 23(6), 643–660 (2001) 7. Gross, R., Matthews, I., Baker, S.: Appearance-Based Face Recognition and LightFields. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(4), 449–465 (2004) 8. Blanz, V., Vetter, T.: Face Recognition Based on Fitting a 3D Morphable Model. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(9), 1063–1074 (2003) 9. Sharad, S.: Introducing Embedded Design Concepts to Freshmen and Sophomore Engineering Students with LEGO MINDSTORMS NXT. In: IEEE International Conference on Microelectronic System Education, pp. 119–120. IEEE Computer Society Press, Los Alamitos (2007)
Real-Time Face Detection and Recognition on LEGO Mindstorms
1015
10. LEGO Mindstorms NXT Hardware Developers Kit: http://mindstorms.lego.com/overview/nxtreme.aspx 11. Turk, M., Pentland, A.: Eigen faces for recognition. Journal of Cognitive Neuroscience 3, 71–86 (1991) 12. Martinez, A.M., Kak, A.C.: PCA versus LDA. IEEE Transaction on Pattern Analysis and Machine Intelligence 23(2), 228–233 (2001) 13. Viola, P., Jones, M.: Robust Real-time Face Detection. International Journal of Computer Vision 57(2) (2004) 14. http://sourceforge.net/projects/opencvlibrary/ 15. http://mindstorms.lego.com/ 16. http://www.miracleon.com/products/cm32c.php
Speaker and Digit Recognition by Audio-Visual Lip Biometrics Maycel Isaac Faraj and Josef Bigun Halmstad University, School of Information Science, Computer and Electrical Engineering (IDE) Halmstad University, Box 823, SE-301 18 Halmstad {maycel.faraj,josef.bigun}@ide.hh.se
Abstract. This paper proposes a new robust bi-modal audio visual digit and speaker recognition system by lip-motion and speech biometrics. To increase the robustness of digit and speaker recognition, we have proposed a method using speaker lip motion information extracted from video sequences with low resolution (128 ×128 pixels). In this paper we investigate a biometric system for digit recognition and speaker identification based using line-motion estimation with speech information and Support Vector Machines. The acoustic and visual features are fused at the feature level showing favourable results with digit recognition being 83% to 100% and speaker recognition 100% on the XM2VTS database.
1
Introduction
In recent years, some techniques have been suggested that combine visual features to improve the recognition rate in acoustically noisy environments that have background noise or cross talk among speakers [1][2][3][4][5]. The present work is a continuation of [6]. The dynamic visual features are suggested based on the shape and intensity of the lip region [7][8][9][10][11] because changes in the mouth shape including the lips and tongue carry significant phoneme-discrimination information. So far the visual representation has been based on shape models to represent changed mouth shapes that rely exclusively on the accurate detection of the lip contours, often a challenging task under varying illumination conditions and rotations of the face. Another disadvantage is the fluctuating computation time due to the iterative convergence process of the contour extraction. The motion in dynamic lip images can be modelled by moving-line patterns generating planes in space-time that encode the normal velocity of lines also known as normal image velocity, further details can be read in [6][12]. Here we use direct feature fusion to obtain the audio-visual observation vectors by concatenating the audio and visual features. The observation sequences are then modelled with a Support Vector Machine (SVM) classifier for digit and speaker recognition respectively. The studies [13][14] [15] reported good performance with Support Vector Machine (SVMs) classifiers in recognition, whereas traditional methods for speaker recognition are GMMs [16] and artificial neural networks [17]. By investigating SVM instead of the more common GMM [6], S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 1016–1024, 2007. c Springer-Verlag Berlin Heidelberg 2007
Speaker and Digit Recognition by Audio-Visual Lip Biometrics
1017
we wanted to study the performance influence of the classification method on speaker recognition and digit recognition. In previous work [6], we introduced a novel feature extraction method for lip motion used in a speaker verification system as a framework for the well known Gaussian Mixture Models. Here, we extended previous work [6] by studying a novel quantization technique for lip features. Furthermore, a digit recognition is presented together with a biometric speaker identification using SVM classifier. An extension of this work as journal article is under review IEEET.onComputers. The remainder of the paper is organized as follows. In Section 2 we describe briefly the lip-motion technique for the mouth region along with our quantization (feature-reduction) method, followed by acoustic feature extraction in Section 3. Section 4 describes SVM classifier used for digit and speaker recognition along with the database and the experimental setup in Section 5. Finally experimental results are shown with a discussion of the experiments and the remaining issues, Section 6 and 7.
2
Visual Features by Normal Image Velocity
Bigun et al. proposed a different motion estimation technique based on an eigenvalue analysis of the multidimensional structure tensor [18], allowing the minimization process of fitting a line or a plane to be carried without the Fourier Transform. Applied to optical-flow estimation, known as the 3D structure-tensor method, the eigenvector belonging to the largest eigenvalue of the tensor is directed in the direction of the contour motion, if motion is present. However, this method can be excessive for applications that need only line-motion features. We assume that the local neighbourhood in the lip image contains parallel lines or edges as this is supported by real data [19]. Lines in a spatio-temporal image translated with a certain velocity in the normal direction will generate planes with a normal that can be estimated in a total-least-square-error (TLS) sense as the local directions of the lines in 2D manifolds using complex arithmetic and convolution [18]. The velocity component of translation parallel to the line cannot be calculated; this is referred to as the aperture problem. We denote the normal unit vector as k = (kx , ky , kt )T and the projection of k to the x–y coordinate axes represents the direction vector of the line’s motion. The normal, k, of the plane will then relate to the velocity vector va as follows T t v = va = − k2k+k = 2 (kx , ky ) x
y
−
1 k
( kkxt )2 + ( kyt )2
kx ky , kt kt
T ,
(1)
where v is the normal image flow. The normal velocity estimation problem bek comes a problem of solving the tilts (tan γ1 = kkxt ) and (tan γ2 = kyt ) of the motion plane in the xt and yt manifolds, which is obtained from the eigenvalue analysis of the 2D structure tensor, [18]. Using complex numbers and smoothing, the angles
1018
M. Isaac Faraj and J. Bigun
Fig. 1. Illustration of velocity estimation quantification and reduction
of the eigenvectors are given effectively as complex values such that its magnitude is the difference of the eigenvalues of the local structure tensor in the xt manifold, whereas its argument is twice the angle of the most significant eigenvector approximating 2γ1 . The function f represents the continuous local image, whose sampled version can be obtained from the observed image sequence. Thus, the arguments of u ˜1 and u ˜2 deliver the TLS estimations of γ1 and γ2 in the local 2D manifolds xt and yt respectively, but in the double angle representation [20], leading to the estimated velocity components as follows. kx 1 tan γ1 = tan γ1 = tan( arg(˜ u1 )) ⇒ v˜x = 2 kt 2 tan γ1 + tan2 γ2
(2)
ky 1 tan γ2 = tan γ2 = tan( arg(˜ u2 )) ⇒ v˜y = kt 2 tan2 γ1 + tan2 γ2
(3)
The tilde over vx and vy denote that these quantities are estimations of vx and vy . With the calculated 2D-velocity feature vectors, (vx , vy )T , in each mouth-region frame (128×128 pixels) we have dense 2D-velocity vectors. To extract statistical features from the 2D normal velocity and to reduce the amount of data without degrading identity-specific information excessively, we reduce the 2D velocity feature vectors (vx , vy )T at each pixel to 1D scalars where the expected directions of motion are 0◦ , 45◦ , −45◦ – marked with 3 different greyscale shades in 6 regions in Fig. 1. The motion vectors within each region become real scalars that take the signs + or − depending on which direction they move relative to their expected spatial directions (differently shaded boxes). f (p, q) = (vx (p, q), vy (p, q)) ∗ sgn( (vx (p, q), vy (p, q))), p, q = 0 . . . 127. (4)
Speaker and Digit Recognition by Audio-Visual Lip Biometrics
1019
The next step is to quantize the estimated velocities from arbitrary real scalars to a more limited set of values. We found that direction and speed quantization significance reduces the impact of noise on the motion information around the lip area. The quantized speeds are obtained from the data by applying a mean approximation as follows. g(l, k) =
N −1
f (N l + p, N k + q), p, q = 0 . . . (N − 1), l, k = 0 . . . (M − 1) (5)
p,q=0
where N and M represent the window size of the boxes (Fig. 1) and the number of boxes, respectively. The statistics of lip-motion are represented by 144-dimensional (M ×M ) feature vectors. The original dimension before reduction is 128×128×2 = 32768.
3
Acoustic Features
The Mel-Frequency Cepstral Coefficient (MFCC) is a commonly used instance of the filter-bank–based features [21] that can represent the speech spectrum. Here, the input signal is pre-emphasized and divided into 25-ms frame every 10 ms. A Hamming window is applied to each frame that is computed by (MFCC) vectors from the FFT-based, mel-warped, log-amplitude filter bank followed by a cosine transform and cepstral filtering. The speech features in this study were the MFCC vectors generated by the Hidden Markov Model Toolkit (HTK) [22] processing the data stream from the XM2VTS database. This MFCC vector contains 12 cepstral coefficients extracted from the Mel-frequency spectrum of the frame with normalized log energy, 13 delta coefficients (velocity), and 13 delta-delta coefficients (acceleration).
4
Classification by Support Vector Machine
The SVM formulation is based on the Structural Risk Minimization principle, which minimizes an upper bound on the generalization error, as opposed to the Empirical Risk Minimization [23][24]. An SVM is a discrimination-based binary method using a statistical algorithm. The background idea in training an SVM system is finding a hyperplane w · x + b = 0, as a decision boundary between two classes. For linearly separable training dataset labelled pairs xi , yi , i = 1, . . . , l, where xi ∈ n and y ∈ {1,-1}l, the following equation is verified for each observation data (feature vector). di (wT xi + b) ≥ 1 − ξi f or i = 1, 2, ..., l ξi > 0,
(6)
where di is the label for sample data xi which can be +1 or -1; wi and b are the weights and bias that describe the hyperplane; ξ represents the number of data
1020
M. Isaac Faraj and J. Bigun
samples left inside the decision area, controlling the training errors. In our experiment we use the inner-product kernel function as RBF kernel 2
K(x, y) = exp(−γ x − y ), γ > 0,
(7)
When conducting digit-classification experiments, we will need to choose between multiple classes. The best method of extending the two-class classifiers to multiclass problems appears to be application dependent. For our experiments we use the one against one approach. It simply constructs for each pair of classes an SVM classifier which separates those classes. All tests in this paper were performed using the SVM toolkit [25].
5
XM2VTS Database
All experiments in this paper are conducted on the XM2VTS database, currently the largest publicly available audio-visual database [26]. The XM2VTS database contains images and speech of 295 subjects (male and female), captured over 4 sessions. In each session, the subject is asked to pronounce three sentences when recording the video sequence; we use only “0 1 2 3 4 5 6 7 8 9”. It is worth noting that the XM2VTS data is difficult to use as is for digit recognition experiments because the speech or lip motions are not annotated. Before defining a protocol we thus needed to annotate both speech and visual data, which we did nearly 100% automatically by speech segmentation. For each speaker of the XM2VTS database, the utterance “ 0 1 2 3 4 5 6 7 8 9” was divided into single-digit sub sequences 0 to 9. We used Hidden Markov models to automatically segment the digit sequences furthermore we manually verified and corrected the segmentation results so as to eliminate the impact of database segmentation errors when interpreting our recognition results. We propose two protocol setups for the XM2VTS database; protocol 1 is the well known Lausanne protocol [26], used for speaker identification and protocol 2 which is used for digit recognition. Protocol 2 is also suggested by other studies [14]. Protocol 1 – the training contains 225 subjects with 200 subjects as clients using and 25 subjects as impostors using sessions 1, 2 and 3. The training group is also used in evaluation. For the testing session 4 of the training group are used and with yet another 70 subjects as impostors. Protocol 2 – the speakers were involved both in training SVMs and testing SVMs, we used 4 different pronunciations for training and testing. The training and test samples were completely disjoint.
6
Experimental Results
We want to quantify the performance of our visual features in speaker recognition and digit recognition as a stand-alone and audio-complementary modality. First the text-prompted speaker-recognition test using protocol 1 are presented and then the digit-recognition system test results using protocol 2 are presented. In our experiments we use direct fusion at feature level, which are detailed in [19].
Speaker and Digit Recognition by Audio-Visual Lip Biometrics
1021
Table 1. Speaker identification rate by SVM using word “7” for 295 speakers Kernel Audio recognition rate Visual recognition rate Audio-Visual recognition rate RBF 92% 80% 100%
6.1
Speaker-Identification System by SVM
A smaller dataset of 100 speakers was tested for all digits and the most significant word for the speaker recognition rate was digit “7” which gave the highest recognition rate. The experiment follows protocol 1 using all 295 speakers: – Partition the database for training, evaluation, and testing according to protocol 1. – Train the SVM for an utterance so that the classification score, L (the mean of the classification equation 6 for an utterance), is positive for the user and negative for impostors. – L is compared to a threshold T. • Find the threshold T such that False Acceptance is equal to False Rejection using the evaluation set. • Using the threshold T, the decision L is made according to the rule: if L > T accept the speaker else reject her/him. However, protocol 1 is desired for verification (1:1 matching). To perform identification (1:many matching) we proceed as follows: – Identify the speaker from a group of speakers • We construct classifiers to separate each speaker from all other speakers in the training set. • The speaker identity is determined by the classifier that yields the largest likelihood score. Table 1 shows the results of using SVM classifiers with RBF kernel function using only one word (digit) to recognize the speaker identity. The recognition performance obtained when using coefficients both from dynamic image and speech are considerably higher than when using a single modality based on speech parameters. These results show that our features can perform well in identification problems. 6.2
Digit Recognition System by SVM
In Table 2, we illustrate all systems based on only acoustic, only visual and merged audio visual feature information. We obtain the best recognition rate for digits “1, 6, and 7” 100%. One cause why the results in Table 2 vary is that there is not enough information (especially visual information) for certain utterances. This is not surprising because the XM2VTS database was collected for identity recognition and not digit recognition. During the segmentation we could verify that when
1022
M. Isaac Faraj and J. Bigun
Table 2. Digit-recognition rate of all digits using protocol 2 in one against one SVM Word Audio features Visual features Audio-Visual features 0 89% 70% 92% 1 90% 77% 100% 2 86% 60% 89% 3 90% 75% 96% 4 89% 55% 85% 5 90% 50% 83% 6 100% 90% 100% 7 93% 100% 100% 8 91% 54% 83% 9 90% 49% 85%
uttering the words from 0 to 9 in a sequence without silence between words, the words “4, 5, 8, 9” are pronounced in shorter time-lapses and the amount of visual data is notably less in comparison to other digits. Additionally amount of speech for each speaker differ when uttering the same word or digit depending on the manner and speed of the speaker. Digit recognition give ≈ 68% and audio-visual features give ≈ 90% for overall recognition.
7
Conclusion and Discussion
In this paper we described a system utilizing lip movement information in dynamic image sequences of numerous speakers for robust digit and speaker recognition by no use of iterative algorithm or assuming successful lip-contour tracking. In environments such as airports, outside traffic, train station etc. the automatic digit recognition or speaker recognition system based on only acoustic information would with high probability be unsuccessful. Our experimental results support the importance of adding lip motion representation in speaker or digit recognition systems that can be installed for instance in mobile devices as a complement to acoustic information. We presented a novel lip-motion quantization and recognition results of lipmotion features as standalone and as a complement to audio for speaker and digit recognition tasks using extensive tests. Improvements of recognition rate based on audio utilizing our motion features for digit as well as identity are provided. Our main goal is to present an effective feature level extraction for lip movement sequences which in turn can be used for identification and digit recognition as shown here and also for speaker verification [19] using an different state-of-art approach GMM. From a visual information classification performance perspective, the digit utterance in the XM2VTS database contained less relevant information for the digits “4, 5, 8, 9”. The poor recognition performance of these digits indicate that XM2VTS database does not contain sufficient amounts of visual information on lip movements. Not surprisingly, if the visual feature-extraction is made on suffi-
Speaker and Digit Recognition by Audio-Visual Lip Biometrics
1023
cient amount of visual speech data, the available modelling for recognition tasks appears to be sufficient for successful recognition.
References 1. Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.: Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE 91(9), 1306– 1326 (2003) 2. Brunelli, K.R., Falavigna, D.: Person identification using multiple cues. IEEE Transactions on Pattern Analysis and Machine Intelligence 17(10), 955–966 (1995) 3. Chibelushi, C., Deravi, F., Mason, J.: A review of speech-based bimodal recognition. IEEE Transactions on Multimedia 4(1), 23–37 (2002) 4. Duc, B., Fischer, S., Bigun, J.: Face authentication with sparse grid gabor information. In: IEEE International Conference Acoustics, Speech, and Signal Processing, vol. 4(21), pp. 3053–3056 (1997) 5. Tang, X., Li, X.: Video based face recognition using multiple classifiers. In: Sixth IEEE International Conference on Automatic Face and Gesture Recognition FGR2004, pp. 345–349. IEEE Computer Society, Los Alamitos (2004) 6. Faraj, M.I., Bigun, J.: Person verification by lip-motion. In: 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), pp. 37–45 (2006) 7. Luettin, J., Maitre, G.: Evaluation protocol for the extended m2vts database xm2vtsdb (1998). In: IDIAP Communication 98-054, Technical report R R-21, number = IDIAP - (1998) 8. Dieckmann, U., Plankensteiner, P., Wagner, T.: Acoustic-labial speaker verification. In: Big¨ un, J., Borgefors, G., Chollet, G. (eds.) AVBPA 1997. LNCS, vol. 1206, pp. 301–310. Springer, Heidelberg (1997) 9. Jourlin, P., Luettin, J., Genoud, D., Wassner, H.: Acoustic-labial speaker verification. In: Big¨ un, J., Borgefors, G., Chollet, G. (eds.) AVBPA 1997. LNCS, vol. 1206, pp. 319–326. Springer, Heidelberg (1997) 10. Chen, T.: Audiovisual speech processing. IEEE Signal Processing Magazine 18(1), 9–21 (2001) 11. Liang, L., Zhao, X.L.Y., Pi, X., Nefian, A.: Speaker independent audio-visual continuous speech recognition. In: IEEE International Conference on Multimedia and Expo., 2002. ICME 02. Proceedings, vol. 2, pp. 26–29 (2002) 12. Kollreider, K., Fronthaler, H., Bigun, J.: Evaluating liveness by face images and the structure tensor. In: AutoID 2005: Fourth Workshop on Automatic Identification Advanced Technologies, pp. 75–80. IEEE Computer Society Press, Los Alamitos (2005) 13. Wan, V., Campbell, W.: Support vector machines for speaker verification and identification. In: Proceedings of the 2000 IEEE Signal Processing Society Workshop, Neural Networks for Signal Processing X, vol. 2, pp. 775–784 (2000) 14. Gavat, I., Costache, G., Iancu, C.: Robust speech recognizer using multiclass svm. In: 7th Seminar on Neural Network Applications in Electrical Engineering. NEUREL 2004, pp. 63–66 (2004) 15. Clarkson, P., Moreno, P.: On the use of support vector machines for phonetic classification. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP, vol. 2, pp. 585–588. IEEE Computer Society Press, Los Alamitos (1999)
1024
M. Isaac Faraj and J. Bigun
16. Reynolds, D., Quatieri, T., Dunn, R.B.: Speaker verification using adapted gaussian mixture models. Digital Signal Processing 10(1-3), 19–41 (2000) 17. Farrell, K., Mammone, R., Assaleh, K.: Speaker recognition using neural networks and conventional classifiers, vol. 2(1), pp. 194–205. IEEE-Computer Society Press, Los Alamitos (1994) 18. Bigun, J., Granlund, G., Wiklund, J.: Multidimensional orientation estimation with applications to texture analysis of optical flow. IEEE-Trans. Pattern Analysis and Machine Intelligence 13(8), 775–790 (1991) 19. Faraj, M.I., Bigun, J.: Audio-visual person authentication using lip-motion from orientation maps. (Article accepted for publication in Pattern Recognition Letters: February 2, 2007) (2007) 20. Granlund, G.H.: In search of a general picture processing operator. Computer Graphics and Image Processing 8(2), 155–173 (1978) 21. Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE transactions on Acoustics, Speech, and Signal Processing 28(4), 357–366 (1980) 22. Young, S., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: The htk book (for htk version 3.0) (2000), http://htk.eng.cam.ac.uk/docs/docs.shtml 23. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995) 24. Burges, C.J.: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2(2), 121–167 (1998) 25. Chang, C.C., Lin, C.J.: Libsvm-a library for support vector machines. software (2001), available at http://www.csie.ntu.edu.tw/cjlin/libsvm 26. Messer, K., Matas, J., Kittler, J., Luettin, J.: Xm2vtsdb: The extended m2vts database. In: Second International Conference of Audio and Video-based Biometric Person Authentication, ICSLP’96, pp. 72–77 (1999)
Modelling Combined Handwriting and Speech Modalities Andreas Humm, Jean Hennebert, and Rolf Ingold Universit´e de Fribourg, Boulevard de P´erolles 90, 1700 Fribourg, Switzerland {andreas.humm, jean.hennebert, rolf.ingold}@unifr.ch
Abstract. We are reporting on consolidated results obtained with a new user authentication system based on combined acquisition of online handwriting and speech signals. In our approach, signals are recorded by asking the user to say what she or he is simultaneously writing. This methodology has the clear advantage of acquiring two sources of biometric information at no extra cost in terms of time or inconvenience. We are proposing here two scenarios of use: spoken signature where the user signs and speaks at the same time and spoken handwriting where the user writes and says what is written. These two scenarios are implemented and fully evaluated using a verification system based on Gaussian Mixture Models (GMMs). The evaluation is performed on MyIdea, a realistic multimodal biometric database. Results show that the use of both speech and handwriting modalities outperforms significantly these modalities used alone, for both scenarios. Comparisons between the spoken signature and spoken handwriting scenarios are also drawn.
1
Introduction
Multimodal biometrics has raised a growing interest in the industrial and scientific communities. The potential increase of accuracy combined with better robustness against forgeries makes indeed multimodal biometrics a promising field. In our work, we are interested in building multimodal authentication systems using speech and handwriting as modalities. Speech and handwriting are indeed two major modalities used by humans in their daily transactions and interactions. Also, these modalities can be acquired simultaneously with no inconvenience, just asking the user to say what she/he is signing or writing. Finally, speech and handwriting taken alone do not compare well in terms of performance against more classical biometric systems such as iris or fingerprint. Merging both biometrics will potentially lead to a competitive system. 1.1
Motivations
Many automated biometric systems based on speech alone have been studied and developed in the past, as reviewed previously [1]. Numerous biometric systems based on signature have also been studied and developed in the past [2][3]. Likewise biometric systems based on online handwriting were not so numerous, however, we can refer to [4] or [5] as examples of state-of-the-art systems. S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 1025–1034, 2007. c Springer-Verlag Berlin Heidelberg 2007
1026
A. Humm, J. Hennebert, and R. Ingold
Our proposal here is to record speech and handwriting signals where the user reads aloud what she or he is writing. Such acquisitions are referred here and in our related works as CHASM for combined handwriting and speech modalities1 . In this work, we have been defining two scenarios. In the first one, called spoken signatures, a bimodal signature with voice is acquired. In this case, the user is simply asked to say the content of the signature, corresponding in most of the case to his or her name. This scenario is similar, in essence, to text-dependent password based systems where the signature and speech content remains the same from access to access. Thanks to the low quantity of data requested to build the biometric templates, this scenario would fit in commercial applications running, for example, in banks. In the second scenario, called spoken handwriting, the user is asked to write and read synchronously the content of several lines of a given random piece of text. This scenario is less applicable in the case of commercial applications because of the larger quantity of data requested to build models. However, it could be used for forensic applications. Comparisons that we will draw between these scenarios will, of course, have to be weighted due to the difference of quantity of data. Our motivation to perform a synchronized acquisition is multiple. Firstly, it avoids doubling the acquisition time. Secondly, the synchronized acquisition will probably give better robustness against intentional imposture. Indeed, imitating simultaneously the voice and the writing of somebody has a much higher cognitive load than for each modality taken separately. Finally, the synchronization patterns (i.e. where do users synchronize) or the intrinsic deformation of the inputs (mainly the slowdown of the speech signal) may be dependent on the user, therefore bringing an extra piece of useful biometrics information. 1.2
Related Work
Several related works have already shown that using speech and signature modalities together permits significant improvements in authentication performances in comparison to systems based on speech or signature alone. In [6], a tablet PC system based on online signature and voice modalities is proposed to ensure the security of electronic medical records. In [7], an online signature verification system and a speaker verification system are also combined. Both sub-systems use Hidden Markov Models (HMMs) to produce independent scores that are then fused together. In [8], tests are reported for a system where the signature verification part is built using HMMs and the speaker verification part uses either dynamic time warping or GMMs. The fusion of both systems is performed at the score level and results are again better than for the individual systems. In [9], the SecurePhone project is presented where multimodal biometrics is used to secure access and authenticate transactions on a mobile device. The biometric modalities include face, signature and speech signals. The main difference between these works and our CHASM approach lies in the acquisition procedure. In our case, the speech and signature data streams 1
We note here that such signals could also be used to recognize the content of what is said or written. However, we focus here on the task of user authentication.
Modelling Combined Handwriting and Speech Modalities
1027
are recorded simultaneously, asking the user to actually say the content of the signature or text. Our procedure has the advantage of shortening the enrollment and access time for authentication and will potentially allow for more robust fusion strategies upstream in the processing chain. This paper is actually reporting on consolidated evaluation results of our CHASM approach. It presents novel conclusions regarding comparison of performance of spoken signature and spoken handwriting. Individual analysis and performance evaluation of spoken signatures and spoken handwriting have been presented in our related works [10][11][12]. The remainder of this paper is organized as follows. In section 2, we give an overview of MyIDea, the database used for this work and of the evaluation protocols. In section 3 we present our modelling system based on a fusion of GMMs. Section 4 presents the experimental results. Finally, conclusions and future work are presented.
2
CHASM Database
2.1
MyIDea Database
CHASM data have been acquired in the framework of the MyIDea biometric data collection [13][14]. MyIDea is a multimodal database that contains many other modalities such as fingerprint, talking face, etc. The ”set 1” of MyIDea is already available for research institutions. It includes about 70 users that have been recorded over three sessions spaced in time. This set is here considered as a development set. A second set of data is planned to be recorded in a near future and will be used as evaluation set in our future work2 . CHASM data have been acquired with a WACOM Intuos2 graphical tablet and a standard computer headset microphone (Creative HS-300). For the tablet stream, (x, y)-coordinates, pressure, azimuth and elevation angles of the pen are sampled at 100 Hz. The speech waveform is recorded at 16 kHz and coded linearly on 16 bits. The data samples are also provided with timestamps to allow a precise synchronization of both streams. The timestamps are especially important for the handwriting streams as the graphical tablet does not send data samples when the pen is out of range. In [15], we provide more comments on spoken signature and spoken handwriting data and on the way users synchronize their acoustic events with signature strokes. In [16], we report on a usability survey conducted on the subjects of MyIDea. The main conclusions of the survey are the following. First, all recorded users were able to perform the signature or handwriting acquisition. Speaking and signing or writing at the same time did not prevent any acquisition from happening. Second, the survey shows that such acquisitions are acceptable from a usability point of view. 2
The data set used to perform the experiments reported in this article has been given the reference myidea-chasm-set1 by the distributors of MyIDea.
1028
2.2
A. Humm, J. Hennebert, and R. Ingold
Recording and Evaluation Protocols
Spoken signatures. In MyIDea, six genuine spoken signatures are acquired for each subject per session. This leads to a total of 18 true acquisitions after the three sessions. After acquiring the genuine signatures, the subject is also asked to imitate six times the signature of another subject. Spoken signature imitations are performed in a gender dependent way by letting the subject having an access to the static image and to the textual content of the signature to be forged. The access to the voice recording is not given for imitation as this would lead to a too difficult task considering the high cognitive load and would be practically infeasible in the limited time frame of the acquisition. This procedure leads to a total of 18 skilled forgeries after the three sessions, i.e. six impostor signatures on three different subjects. Two assessment protocols have been defined on MyIDea with the objective of being as realistic as possible (see [16] for details). The first one is called without time variability where signatures for training and testing are taken from the same session. The second protocol is called with time variability where the signatures for training are taken from the first session while for testing they are taken from a different session. To compare with the skilled forgeries described above, we also test with random forgeries taking the accesses from the remaining users. These protocols are strictly followed here. Spoken handwriting. For each of the three sessions, the subject is asked to read and write a random text fragment of about 50 to 100 words. The subject is allowed to train for a few lines on a separated sheet in order to be accustomed to with the procedure of talking and writing at the same time. After acquiring the genuine handwriting, the subject is also asked to imitate the handwriting of another subject (same gender) and to synchronously utter the content of the text (skilled forgeries). In order to do this, the imitator has access to the static handwriting data of the subject to imitate. The access to the voice recording is also not given for imitation. This procedure leads to a total of three impostor attempts on different subjects after the three sessions. An assessment protocol for spoken handwriting is also available with MyIDea [16] and is followed for the realization of the tests in this paper. In short, this protocol trains the models on data from session one and test it on data from sessions two and three. As for spoken signatures, we also test against skilled forgeries and random forgeries. It actually corresponds to a text-prompted scenario where the system prompts the subject to write and say a random piece of text each time an access is performed. This kind of scenario allows the system to be more secure against spoofing attacks where the forger plays back a pre-recorded version of the genuine data. This scenario also has the advantage of being very convenient for the subject who does not need to remember any password phrase.
3
System Description
As illustrated on Fig. 1, our system models independently the speech and handwriting signals to obtain a score that is finally fused.
Modelling Combined Handwriting and Speech Modalities
1029
Fig. 1. CHASM handwriting system
3.1
Feature Extraction
For each point of the handwriting, we extract 25 dynamic features based on the x and y coordinates, the pressure and angles of the pen in a similar way as in [17] and [10]. This feature extraction was actually proposed to model signatures. However it can be used without modification in the case of handwriting as nothing specific to signature was included in the computation of the features. The features are mean and standard deviation normalized on a per user basis. For the speech signal, we compute 12 Mel Frequency Cepstral Coefficients (MFCC) and the energy every 10 ms on a window of 25.6 ms. We realized that the speech signal contains a lot of silence which is due to the fact that writing is usually more slow than speaking. It is known, in the speech domain, that silence parts impair the estimation of models. We therefore implemented a procedure to remove all the silence parts of the speech signal. This silence removal component is using a classical energy-based speech detection module based on a bi-Gaussian model. MFCC coefficients are mean and standard deviation normalized using normalization values computed on the speech part of the data. 3.2
GMMs System
GMMs are used to model the likelihoods of the features extracted from the handwriting and from the speech signal. One could argue that GMMs are actually not the most appropriate models in this case as they are intrinsically not capturing the time-dependent specificities of speech and handwriting. However, a GMM is well appropriated to handle the text-independent constraint of the spoken handwriting scenario. We also wanted to have similar types of models for both scenarios to draw fair comparisons. Furthermore, GMMs are well-known flexible modelling tools able to approximate any probability density function. With GMMs, the probability density function p(xn |Mclient ) or likelihood of a D-dimensional feature vector xn given the model of the client Mclient , is estimated as a weighted sum of multivariate Gaussian densities p(xn |Mclient ) ∼ =
I
wi N (xn , μi , Σi )
(1)
i=1
in which I is the number of mixtures, wi is the weight for mixture i and the Gaussian densities N are parameterized by a mean D × 1 vector μi , and a D × D covariance matrix, Σi . In our case, we make the hypothesis that the
1030
A. Humm, J. Hennebert, and R. Ingold
features are uncorrelated and we use diagonal covariance matrices. By making the hypothesis of observation independence, the global likelihood score for the sequence of feature vectors, X = {x1 , x2 , ..., xN } is computed with Sc = p(X|Mclient ) =
N
p(xn |Mclient )
(2)
n=1
The likelihood score Sw of the hypothesis that X is not from the given client is here estimated using a world GMM model Mworld or universal background model trained by pooling the data of many other users. The decision whether to reject or to accept the claimed user is performed comparing the ratio of client and world score against a global threshold value T . The ratio is here computed in the log-domain with Rc = log(Sc ) − log(Sw ). The training of the client and world models is usually performed with the Expectation-Maximization (EM) algorithm that iteratively refines the component weights, means and variances to monotonically increase the likelihood of the training feature vectors. Another way to train the client model is to adapt the world model using a Maximum A Posteriori criterion (MAP) [18]. In our experiments we used the EM algorithm to build the word model by applying a simple binary splitting procedure to increase the number of Gaussian components through the training procedure. The world model is trained by pooling the available genuine accesses in the database3. In the results reported here, we used MAP adaptation to build the client models. As suggested in many papers, we perform only the adaptation of the mean vector μi , leaving untouched the covariance matrix Σi and the mixture coefficient wi . 3.3
Score Fusion
We obtain the spoken handwriting (sh) score by applying a weighted summation of the handwriting (hw) and speech (sp) log-likelihood ratios with Rc,sh = Wsp Rc,sp + Whw Rc,hw . This is a reasonable procedure if we assume that the local observations of both sub-systems are independent. This is however clearly not the case as the users are intentionally trying to synchronize their speech with the handwriting signal. Time-dependent score fusion procedures or feature fusion followed by joint modelling would be more appropriate than the approach taken here. More advanced score recombination could also be applied such as, for example, using classifier-based score fusion. We report here our results with or without using a z-norm score normalization preceding the summation. The z-norm is here applied globally on both speech and signature scores for all test accesses, in a user-independent way. As the mean and standard deviation of the z-norm are estimated a posteriori on the same data set, z-norm results are of course unrealistic but give an optimistic estimation of what could be the fusion performances with such a normalisation. 3
The skilled forgeries attempts are excluded for training the world model as it would lead to optimistic results. Ideally, a fully independent set of users would be preferable, but this is not possible considering the small number of users (≈ 70) available.
Modelling Combined Handwriting and Speech Modalities
4
1031
Experimental Results
We report our results in terms of Equal Error Rates (EER) which are obtained for a value of the threshold T where the impostor False Acceptation and client False Rejection error rates are equal. 4.1
Spoken Signature
Table 1 summarizes the results with our best MAP system (128 Gaussians for the client and world models) in terms of ERR for the different protocols. The following conclusions can be drawn. The speech modelisation performs equally well as the signature in the case of single session experiments (without time variability). However, when multi-session accesses are considered, signature performs better than speech. Signature and speech modalities suffer from time-variability but in different degrees. It is probable that users show a larger intra-variability for the speech than for the signature modality. This could be here even more amplified as users are probably not used to slow down the speech to the pace of handwriting. Another explanation could be in the acquisition conditions that are more difficult to control in the case of the speech signal: different position of the microphone, environmental noise, etc. Another conclusion from Table 1 is that skilled forgeries decrease systematically and significantly the performance in comparison to random forgeries. For the protocol with time variability, a drop of about 200% relative performance is observed for the signature modality and about 50% for the speech modality. We have to note here that the skilled forgers do not try to imitate the voice of the user but actually say the genuine verbal content which is very probably the source of the loss of performance. Also from Table 1, we can conclude that the sum fusion, although very straightforward, brings systematically a clear improvement in the results, in comparison to the modalities taken alone. Interestingly, the z-norm fusion is better than the sum fusion for the protocol without time variability and is worse in the case of the protocol with time variability. An interpretation of this is proposed in [11]. 4.2
Spoken Handwriting
Table 2 summarizes the results with our best MAP system (256 Gaussians for the client and world models), comparing random versus skilled forgeries. The Table 1. Summary of spoken signature results in terms of terms of Equal Error Rates. Protocol with and without time variability, skilled and unskilled forgeries. time variability forgeries
without with random skilled random skilled
signature speech
0.4 % 0.8 %
3.9 % 2.7 % 7.3 % 2.7 % 12.4 % 17.1 %
sum fusion (.5/.5)
0.2 % 0.9 % 1.7 % 5.0 %
z-norm fusion (.5/.5) 0.1 % 0.7 % 2.3 % 8.6 %
1032
A. Humm, J. Hennebert, and R. Ingold
Table 2. Spoken handwriting results in terms of terms of Equal Error Rates, with time variability. Comparison of random versus skilled forgeries. forgeries
random skilled
handwriting speech
4.0 % 13.7 % 1.8 % 6.9 %
sum fusion (.5/.5)
0.7 % 6.9 %
z-norm fusion (.5/.5) 0.3 % 4.0 %
following conclusions can be drawn. For the handwriting, skilled forgeries decrease the performances in a significant manner. This result is actually understandable as the forger is intentionally imitating the handwriting of the genuine user. For the speech signal, skilled forgeries also decreases the performance. As the forger do not try to imitate the voice of the genuine user, this result can be surprising. However, it can be explained as the forger is actually saying the exact same verbal content as the one used by the user at training time. When building a speaker model, the characteristics of the speaker are of course captured, but also, to some extent, the content of the speech signal itself. Results using the z-norm fusion are also reported in Table 2, showing an advantage against the sum fusion. As a conclusion of these experiments with spoken handwriting, we can reasonably say that the speech modelisation performs on average better than the handwriting. Intuitively, one could argue that this is understandable as the handwriting is a gesture that is more or less fully learned (behavioral biometric) while speech contains information that are dependent on learned and physiological features (behavioral and physiological biometric). 4.3
Comparison of Spoken Signatures and Spoken Handwriting
We are able here to do a comparison of results obtained with spoken signatures and spoken handwriting data as our experiments are performed using the same database, with the same users and the same acquisition conditions. Results of spoken handwriting in Table 2 can be compared with results of spoken signatures in Table 1, for the protocol with time variability. The signature modality of spoken signatures provides better results than the handwriting modality of spoken handwriting. This can be explained in the following way. Handwriting is a taught gesture that is crafted to be understood by every person. In school, every child learns in more or less the same way to write the different characters. In contrast, a signature is built to be an individual characteristic of a person that should not be imitable and that is used for authentication purposes. A comparison of the speech modality of Table 1 and 2 shows that spoken handwriting provides better results than spoken signatures. An explanation for this lies in the quantity of speech data available. While the average length of the speech is about two seconds for signature, spoken handwriting provides about two minutes of speech. The speech model is therefore more precise for spoken
Modelling Combined Handwriting and Speech Modalities
1033
handwriting than for spoken signature. Now, if we compare the z-norm fusion of Table 1 and 2, we can observe that spoken handwriting performs better than spoken signatures. However, we should pay attention that this conclusion is also dependent on the quantity of data. If we would have less handwriting data, the conclusion may also be reversed.
5
Conclusions and Future Work
We presented consolidated results obtained with a new user authentication system based on combined acquisition of online handwriting and speech signals. It has been shown that the modelling of the signals can be performed advantageously using GMMs trained with a MAP adaptation procedure. A simple fusion of GMM scores lead to significant improvements in comparison to systems where the modalities would be used alone. From a usability point of view, this gain of performance is obtained at no extra cost in terms of acquisition time, as both modalities are recorded simultaneously. The proposed bi-modal speech and handwriting approach seems then to be a viable alternative to systems using single modalities. In our future work, we plan to investigate the use of more robust modelling techniques against time variability and forgeries. We have identified potential directions such as HMMs, time-dependent score fusion, joint modelling, etc. Also, as soon as an extended set of spoken signature data will be available, experiments will be conducted according to a development/evaluation set framework. We will also investigate if the biometrics performances are impaired due to the signal deformations induced by the simultaneous recordings. Acknowledgments. We warmly thank Asmaa El Hannani for her precious help with GMMs based systems. This work was partly supported by the Swiss NSF program “Interactive Multimodal Information Management (IM2)”, as part of NCCR and by the EU BioSecure IST-2002-507634 NoE project.
References 1. Reynolds, D.: An overview of automatic speaker recognition technology. In: Proc. IEEE ICASSP, vol. 4, pp. 4072–4075 (2002) 2. Plamondon, R., Lorette, G.: Automatic signature verification and writer identification - the state of the art. Pattern Recognition 22(2), 107–131 (1989) 3. Leclerc, F., Plamondon, R.: Automatic signature verification: the state of the art– 1989-1993. Int’l J. Pattern Rec. and Art. Intelligence 8(3), 643–660 (1994) 4. Liwicki, M., Schlapbach, A., Bunke, H., Bengio, S., Mari´ethoz, J., Richiardi, J.: Writer identification for smart meeting room systems. In: Proceedings of the 7th International Workshop on Document Analysis Systems, pp. 186–195 (2006) 5. Nakamura, Y., Kidode, M.: Online writer verification using kanji handwriting. In: Gunsel, B., Jain, A.K., Tekalp, A.M., Sankur, B. (eds.) MRCS 2006. LNCS, vol. 4105, pp. 207–214. Springer, Heidelberg (2006) 6. Krawczyk, S., Jain, A.K.: Securing electronic medical records using biometric authentication. In: Kanade, T., Jain, A., Ratha, N.K. (eds.) AVBPA 2005. LNCS, vol. 3546, pp. 1110–1119. Springer, Heidelberg (2005)
1034
A. Humm, J. Hennebert, and R. Ingold
7. Fuentes, M., et al.: Identity verification by fusion of biometric data: On-line signature and speech. In: Proc. COST 275 Workshop on The Advent of Biometrics on the Internet, Rome, Italy, November 2002, pp. 83–86 (2002) 8. Ly-Van, B., et al.: Signature with text-dependent and text-independent speech for robust identity verification. In: Proc. Workshop MMUA, pp. 13–18 (2003) 9. Koreman, J., et al.: Multi-modal biometric authentication on the securephone pda. In: Proc. Workshop MMUA, Toulouse (2006) 10. Humm, A., Hennebert, J., Ingold, R.: Gaussian mixture models for chasm signature verification. In: 3rd Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms, Washington (2006) 11. Hennebert, J., Humm, A., Ingold, R.: Modelling spoken signatures with gaussian mixture model adaptation. In: 32nd ICASSP, Honolulu (2007) 12. Humm, A., Ingold, R., Hennebert, J.: Spoken handwriting verification using statistical models. In: Accepted for publication ICDAR (2007) 13. Dumas, B., et al.: Myidea - multimodal biometrics database, description of acquisition protocols. In: proc. of Third COST 275 Workshop (COST 275), Hatfield (UK) (October 27 - 28 2005), pp. 59–62 (2005) 14. Hennebert, J., et al.: Myidea database (2005), http://diuf.unifr.ch/go/myidea 15. Humm, A., Hennebert, J., Ingold, R.: Scenario and survey of combined handwriting and speech modalities for user authentication. In: 6th Int’l. Conf. on Recent Advances in Soft Computing (RASC 2006), Canterburry, Kent, UK, pp. 496–501 (2006) 16. Humm, A., Hennebert, J., Ingold, R.: Combined handwriting and speech modalities for user authentication. Technical Report 06-05, University of Fribourg, Department of Informatics (2006) 17. Van Ly, B., Garcia-Salicetti, S., Dorizzi, B.: Fusion of hmm’s likelihood and viterbi path for on-line signature verification. In: Biometrics Authentication Workshop, Prague (May 15th 2004) 18. Reynolds, D., Quatieri, T., Dunn, R.: Speaker verification using adapted gaussian mixture models. Digital Signal Processing 10, 19–41 (2000)
A Palmprint Cryptosystem Xiangqian Wu1 , David Zhang2 , and Kuanquan Wang1 1 School of Computer Science and Technology, Harbin Institute of Technology (HIT), Harbin 150001, China {xqwu, wangkq}@hit.edu.cn http://biometrics.hit.edu.cn 2 Biometric Research Centre, Department of Computing, Hong Kong Polytechnic University, Kowloon, Hong Kong
[email protected] Abstract. Traditional cryptosystems are based on passwords, which can be cracked (simple ones) or forgotten (complex ones). This paper proposes a novel cryptosystem based on palmprints. This system directly uses the palmprint as a key to encrypt/decrypt information. The information of a palmprint is so complex that it is very difficult, if not impossible, to crack the system while it need not remember anything to use the system. In the encrypting phase, a 1024 bits binary string is extracted from the palmprints using differential operations. Then the string is translated to a 128 bits encrypting key using a Hash function, and at the same time, an error-correct-code (ECC) is generated. Some general encryption algorithms use the 128 bits encrypting key to encrypt the secret information. In decrypting phase, the 1024 bits binary string extracted from the input palmprint is first corrected using the ECC. Then the corrected string is translated to a decrypting key using the same Hash function. Finally, the corresponding general decryption algorithms use decrypting key to decrypt the information. The experimental results show that the accuracy and security of this system can meet the requirement of most applications.
1
Introduction
Information security is becoming increasingly important in nowadays. Cryptology is one of the most effective ways to enhance the information security. In the traditional cryptosystems, information is encrypted using passwords. The simple passwords are easy to be memorized while they are also easy to be cracked. And the complex passwords are difficult to be cracked while they are also difficult to be remembered. In order to overcome this problem, some biometric feature-based encrypting/decrypting algorithms have been developed [1, 2, 3, 4, 5, 6, 7]. The palmprint is a relatively new biometric feature [8, 9, 10, 11, 12, 13] and has several advantages compared with other currently available features [14]: palmprints contain more information than fingerprint, so they are more distinctive; palmprint capture devices are much cheaper than iris devices; palmprints also contain additional distinctive features such as principal lines and wrinkles, S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 1035–1042, 2007. c Springer-Verlag Berlin Heidelberg 2007
1036
X. Wu, D. Zhang, and K. Wang
(a) Original Palmprint
(b) Cropped Image
Fig. 1. An example of the palmprint and the normalized image
which can be extracted from low-resolution images; a highly accurate biometrics system can be built by combining all features of palms, such as palm geometry, ridge and valley features, and principal lines and wrinkles, etc. Therefore, it is suitable to use palmprints to implement a cryptosystem. Up to now, we failed to find any literature to discuss palmprint encryption. In this paper, we will use error-correcting theory to design a palmprint cryptosystem. When palmprints are captured, the position and direction of a palm may vary so that even palmprints from the same palm may have a little rotation and translation. Furthermore, palms differ in size. Hence palmprint images should be orientated and normalized before feature extraction and matching. In this paper, we use the preprocessing technique described in [13] to align and normalize the palmprints. After preprocessing, the central part of the image, which is 128 × 128, is cropped to represent the whole palmprint. Fig. 1 shows a palmprint and the normalized image. The rest of this paper is organized as follows. Section 2 describes the feature extraction and matching. Section 3 presents the palmprint cryptosystem. Section 4 contains some experimental results and analysis. And Section 5 provides some conclusions.
2 2.1
Feature Extraction and Matching DiffCode Extraction
Let I denote a palmprint image and Gσ denote a 2D Gaussian filter with the variance σ. The palmprint is first filtered by Gσ as below: If = I ∗ Gσ
(1)
where ∗ is the convolution operator. Then the difference of If in the horizontal direction is computed as following: D = If ∗ b
(2)
b = [−1, 1]
(3)
where ∗ is the convolution operator. Finally, the palmprint is encoded according to the sign of each pixel of D:
A Palmprint Cryptosystem
C(i, j) =
1, if D(i, j) > 0; 0, otherwise.
1037
(4)
C is called DiffCode of the palmprint I. The size of the preprocessed palmprint is 128 × 128. Extra experiments shows that the image with 32 × 32 is enough for the DiffCode extraction and matching. Therefore, before compute the DiffCode, we resize the image from 128 × 128 to 32 × 32. Hence the size of the DiffCode is 32 × 32. Fig. 2 shows some examples of DiffCode. From this figure, the DiffCode preserves the structure information of the lines on a palm.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 2. Some examples of DiffCodes. (a) and (b) are two palmprint samples from a palm; (c) and (d) are two palmprint samples from another palm; (e)-(h) are the DiffCodes of (a)-(d), respectively.
2.2
Similarity Measurement of DiffCode
Because all DiffCodes have the same length, we can use Hamming distance to define their similarity. Let C1 , C2 be two DiffCodes, their Hamming distance (H(C1 , C2 )) is defined as the number of the places where the corresponding values of C1 and C2 are different. That is, H(C1 , C2 ) =
32 32 i=1 j=1
where ⊗ is the logical XOR operation.
C1 (i, j) ⊗ C2 (i, j)
(5)
1038
X. Wu, D. Zhang, and K. Wang
The matching distance of two DiffCodes C1 and C2 is defined as the normalized Hamming distance: D(C1 , C2 ) =
H(C1 , C2 ) 32 × 32
(6)
Actually, D(C1 , C2 ) is the percentage of the places where C1 and C2 have different values. Obviously, D(C1 , C2 ) is between 0 and 1 and the smaller the matching distance, the greater the similarity between C1 and C2 . The matching score of a perfect match is 0. Because of imperfect preprocessing, there may still be a little translation between the palmprints captured from the same palm at different times. To overcome this problem, we vertically and horizontally translate C1 a few points to get the translated C1T , and then, at each translated position, compute the matching distance between C1T and C2 . Finally, the final matching distance is taken to be the minimum matching distance of all the translated positions.
3
Palmprint Cryptosystem
In general, the palmprints captured from the same hand at different time are not exactly same. However, they are similar enough to distinguish that they are from the same hand. That is, when the matching distance between the DiffCodes C1 and C2 is less than a threshold T , they should be regarded as being computed from the same hand, and C2 should be able to decrypt the information which is encrypted using C1 . However, in general symmetric cryptosystems (eg.AES), it is impossible to successfully finish the decryption if the encrypting key and the decrypting key are not exactly same. To overcome this problem, we must transform C2 to C1 before using it for decryption. Since both C1 and C2 are binary strings with the same length, we can use the error-correct-coding theory to encode C1 and get its error-correcting code, which can correct less than T × 1024 errors, and then use this error-correcting code to correct C2 . If the matching distance between C1 and C2 is less than T , which means that C1 and C2 are from the same hand, C2 can be exactly transformed to C1 using the error-correcting code. And then the corrected C2 can be used for decryption. The principle of the palmprint cryptosystem is shown in Fig. 3. In the encrypting phase, the 32×32 = 1024 bits DiffCode is extracted from the palmprints. Then the DiffCode is encoded to a fix length palmprint key (HC) using a Hash function (eg. MD5), and at the same time, an error-correct-code (ECC) of the DiffCode is generated using an existed algorithm (eg. BCH). Some general encryption algorithms (eg. AES) use this palmprint key to encrypt the secret information S. In decrypting phase, the 1024 bits DiffCode extracted from the input palmprint is first corrected using the ECC. Then the corrected string is encoded to a palmprint key (HC) using the same Hash function. Finally, the corresponding general decryption algorithms use this key to decrypt theinformation (S). To overcome the translation problem, we can get the 144 × 144 central part of the palmprint in the preprocessing of decryption phase, and then resize it to 36 × 36 to compute DiffCode. That is, in decryption phase, we get a DiffCode
A Palmprint Cryptosystem
1039
(a) Encrypting Phase
(b) Decrypting Phase Fig. 3. Palmprint cryptosystem
with 36×36 size. From this larger DiffCode, we can get 25 DiffCodes with 32×32, which are used one by one for decryption until success. This process is equivalent to the translation the DiffCode vertically and horizontally from −2 to +2 points.
4
Experimental Results and Analysis
We employed the PolyU Palmprint Database [15] to test our system. This database contains 7, 752 grayscale images captured from 386 different palms by a CCDbased device. These palmprints were taken from people of different ages and both sexes and were captured twice, at an interval of around two months, each time taking about 10 images from each palm. Therefore, This database contains about 20 images of each palm. The size of the images in the database is 384 × 284. In our
1040
X. Wu, D. Zhang, and K. Wang
experiments, all images were preprocessed using the preprocessing technique described in [13] and the central 128×128 part of the image was cropped to represent the whole palmprint. In the system, the Hash, error-correcting and encrypting algorithms are respectively selected as MD5, BCH and AES. For a (n, k, t) BCH code, n, k and t respectively mean the length of the code, the length of the information and the number of the errors which can be corrected by this code. For our system, t can be computed using its distance threshold T as following: t = 1024 × T.
(7)
And k should satisfy the following conditions: k 1024
(8)
If k > 1024, we can append (k − 1024) zeros to the 1024 bits DiffCode to get the message with length k and then encode it using BCH encoding. Therefore, to error-correcting encoding, we should know the distance threshold of the system, which is dependent on the application. To investigate the relationship between the threshold and accuracy, each sample in the database is matched against the other palmprints in the same database. The matching between palmprints which were captured from the same palm is defined as a genuine matching. Otherwise, the matching is defined as an impostor matching. A total of 30, 042, 876 (7, 752 × 7, 751/2) matchings have been performed, in which 74, 086 matchings are genuine matchings. The FAR and FRR at different thresholds are plotted in Fig. 4. Some typical FARs, FRRs, the corresponding thresholds and the numbers of the error bits are listed in Table 1. We can select a threshold according to the requirement of the applications. In our experiments, we choose the distance threshold as 0.2949. According to Table 1, the corresponding FAR, FRR and the number of errors which should be corrected are 0.0012% , 3.0169%, and 302. According to the theory of BCH error-correcting-code, (4095, 1412, 302) BCH code can be used in our system. Now we analyze the attacks to this system. If the attack happens at Point A (See Fig. 3), that is, the attacker uses some palmprints to attack the system. In this case, the possibility to successfully decrypt the message is about 0.0012% ≈ 10−5 , which means that to decrypt the message, a cracker has to find about 105 different palmprints to try, which is very difficult to get so many palmprints in a short time. If the attack happens at Point B (See Fig. 3), that is, the cracker attacks the system by directly generating the DiffCode for the error-correcting. The possibility to successfully decrypt the message in this way is p: 302 301 1 0 C1024 + C1024 + · · · + C1024 + C1024 ≈ 2−134 (9) 21024 If the attack happens at Point C (See Fig. 3), that is, the cracker generates the corrected DiffCode to attack the system, the possibility to success is 2−1024 . If the attack happens at Point D (See Fig. 3), that is, the cracker generates the hashed code to attack the system, the possibility to success is 2−128 .
p=
A Palmprint Cryptosystem
1041
100 FAR FRR
90 80
Error Rate (%)
70 60 50 40 30 20 10 0 0.4
0.45
0.5
0.55
0.6
0.65 0.7 0.75 Matching Distance
0.8
0.85
0.9
0.95
1
Fig. 4. The FAR and FRR at Different Threshold Table 1. Typical FAR, FRR, corespoding thresholds and number of error bits Threshold Number of Error Bits FAR (%) 0.3799 389 2.6862 0.3750 384 2.0133 0.3701 379 1.4784 0.3652 374 1.0976 0.3604 369 0.7799 0.3555 364 0.5458 0.3496 358 0.3786 0.3447 353 0.2558 0.3398 348 0.1663 0.3350 343 0.1092 0.3301 338 0.0663 0.3252 333 0.0399 0.3203 328 0.0238 0.3154 323 0.0135 0.3096 317 0.0075 0.3047 312 0.0042 0.2998 307 0.0022 0.2949 302 0.0012 0.2900 297 0.0006 0.2852 292 0.0003
5
FRR (%) 0.0240 0.0283 0.0382 0.0523 0.0764 0.1018 0.1386 0.2150 0.2899 0.3791 0.4964 0.6973 0.9038 1.2277 1.5445 1.9533 2.4370 3.0169 3.6788 4.4751
Conclusions
This paper proposed a almprint cryptosystem. This system extracted binary DiffCode feature from palmprint and used the error-correcting theory to remove
1042
X. Wu, D. Zhang, and K. Wang
the difference between the DiffCodes from the same palms. The system can effectively encrypt and decrypt messages and it is almost impossible to crack it.
Acknowledgements This work is partially supported by the National Natural Science Foundation of China (No. 60441005), the Key-Project of the 11th-Five-Year Plan of Educational Science of Hei Longjiang Province, China (No. HZG160), the Science and Technology Project of the Education Department of Hei Longjiang Province (No. 11523026) and the Development Program for Outstanding Young Teachers in Harbin Institute of Technology.
References 1. Uludag, U., Pankant, S., Prabhakar, S., Jain, A.K.: Biometric cryptosystems: issues and challenges. Proceedings of the IEEE 92, 948–960 (2004) 2. Freire-Santos, M., Fierrez-Aguilar, J., Ortega-Garcia, J.: Cryptographic key generation using handwritten signature. In: Proc. of SPIE, Biometric Technologies for Human Identificatin III (2006) 3. Uludag, U., Pankant, S., Jain, A.K.: Fuzzy vault for fingerprints. In: Kanade, T., Jain, A., Ratha, N.K. (eds.) AVBPA 2005. LNCS, vol. 3546, pp. 310–319. Springer, Heidelberg (2005) 4. Monrose, F., Reiter, M.K., Li, Q., Wetzel, S.: Using voice to generate cryptographic keys. In: A Speaker Odyssey, The Speaker Recognition Workshop, pp. 202–213 (2001) 5. Juels, A., Sudan, M.: A fuzzy vault scheme. In: Proc. IEEE International Symposium on Information Theory, IEEE Computer Society Press, Los Alamitos (2002) 6. Soutar, C., Roberge, D., Stojanov, S.A., Gilroy, R., Kumar, B.V.K.V.: Biometric encryption. ICSA Guide to Cryptography (1999) 7. Monrose, F., Reiter, M.K., Li, Q., Lopresti, D.P., Shih, C.: Towards speechgenerated cryptographic keys on resource constrained devices. In: Proc. 11th USENIX Security Symposium, pp. 283–296 (2002) 8. Zhang, D.: Palmprint Authentication. Kluwer Academic Publishers, Dordrecht (2004) 9. Wu, X., Zhang, D., Wang, K.: Palmprint Recognition. Scientific Publishers, China (2006) 10. Wu, X., Wang, K., Zhang, D.: Fisherpalms based palmprint recognition. Pattern Recognition Letters 24, 2829–2838 (2003) 11. Duta, N., Jain, A., Mardia, K.: Matching of palmprint. Pattern Recognition Letters 23, 477–485 (2001) 12. Han, C., Chen, H., Lin, C., Fan, K.: Personal authentication using palm-print features. Pattern Recognition 36, 371–381 (2003) 13. Zhang, D., Kong, W., You, J., Wong, M.: Online palmprint identification. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 1041–1050 (2003) 14. Jain, A., Ross, A., Prabhakar, S.: An introduction to biometric recognition. IEEE Transactions on Circuits and Systems for Video Technology 14, 4–20 (2004) 15. PolyU Palmprint Palmprint Database (http://www.comp.polyu.edu.hk/∼ biometrics/)
On Some Performance Indices for Biometric Identification System Jay Bhatnagar and Ajay Kumar Biometrics Research Laboratory Department of Electrical Engineering Indian Institute of Technology Delhi, New Delhi, India
[email protected],
[email protected] Abstract. This paper investigates a new approach to formulate performance indices of biometric system using information theoretic models. The performance indices proposed here (unlike conventionally used FAR, GAR, DET etc.) are scalable in estimating performance of large scale biometric system. This work proposes a framework for identification capacity of a biometric system, along with insights on number of cohort users, capacity enhancements from user specific statistics etc. While incorporating feature level information in a rate-distortion framework, we derive condition for optimal feature representation. Furthermore, employing entropy measures to distance (hamming) distribution of the encoded templates, this paper proposes an upper bound for false random correspondence probability. Our analysis concludes that capacity can be the performance index of a biometric system while individuality expressed in false random correspondence can be the performance index of the biometric trait and representation. This paper also derives these indices and quantifies them from system parameters. Keywords: Identification capacity, Joint source-channel coding, Individuality, FRC (false random correspondence probability).
1
Introduction
There has been significant interest in large scale applications of biometrics for secure personal authentication. Proposing scalable performance indices for large scale identification systems using biometrics is a challenging problem in biometrics research. The challenge lies in devising indices and evolving their interdependence to convey security, accuracy, privacy and such other measurable characteristics of biometric systems. Authors in [1] discuss system specific indices such as FAR, FRR, GAR, ROC etc. as measures for the imperfect accuracy in relation to signal capacity and representation limitations. However, error rates are performance indices that are dependent on choice of: threshold, user at that threshold. Further, error rates are computed by integrating region from known distributions (genuine and imposter). It will be computationally efficient (for reasons of scalability) to develop average performance measure of biometric identification system. If such a measure can be computed incorporating constrained statistical information instead of complete distributions, then it leads to efficient tools in proposing performance models [2]. S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 1043–1056, 2007. © Springer-Verlag Berlin Heidelberg 2007
1044
J. Bhatnagar and A. Kumar
Information conveyed in biometric is considered to be inherent part of individuality of a person, and a natural code generated from user specific characteristics – physiological or behavioral. A population of user biometric templates which is uniquely indexed (uniqueness) and maximally separable (distinctiveness) is desirable for reliable identification, however this may not guarantee security [1]. It is well known that in practical applications, user biometric data does not provide error free (perfect) identification. Loss of uniqueness and distinctiveness can be attributed to: (1) statistical variations in biometric data of a user (in-class) and, (2) statistical variations in the biometric data after inter-user comparisons. Authors of [3] allude the need to model information content (statistical measure) of biometric template in relation to the number of users that can be reliably identified. In the same work [3], the above reference to information content is stated as template capacity or inherent signal capacity (and individuality) of a biometric. Authors in [4] use Kullback-Leibler` divergence or the relative entropy D ( p || u ) between (p) population distribution and (u) user distribution per feature to model uniqueness and individuality for face biometric. The approach of [4] demonstrates one of the few important distance measures from information theory perspective. Capacity and error rates have been widely known in information theory literature [2]. For the framework proposed here, Capacity represents: how many users on average can be reliably recognized by the identification system, for a given data base and noise model. Restated, capacity ties the decision accuracy in terms of number of users to database quality and system noise. Similarly, individuality sets a limit to identification capacity in absence of noise (in-class). The key contributions of this paper can be summarized as follows: (i) Formulate and quantify capacity of a biometric identification system (section 3) (ii) Estimate the average number of cohort users from the capacity derived in (i) (iii) Derive false random correspondence probability with optimality for feature representation (section 4) Biometric signals of our interest could be at feature level or score level (class genuine and imposter). Several reasons motivate the choice of matching score and feature template, respectively, in the proposed analysis of capacity and individuality. Due to higher dimensionality, it is computationally difficult to extract accurate crossuser information (inter-class statistics) at feature level; though the same can be tractable using matching scores. Also, with number of users tending to infinity (in the asymptote), the i.i.d. (independent identically distributed) assumption for feature vectors is weakened due to correlation present in multiple biometric acquisitions. Traditionally, signal received at the channel output is used for making decision in communication system [2]. Analogously, for biometric system it is matching score which represents the signal space used for decision making and not the query template itself. Therefore, in the framework for capacity of biometric system it is proposed to employ statistics from matching scores. Uniqueness and distinctiveness are statistical properties of biometric feature template and the feature representation, which together define individuality in this work. Hence, discussion on individuality (section 4) entails feature level information.1 1
Information: refers Shannon measure for statistical information, expressed in terms of entropy [2].
On Some Performance Indices for Biometric Identification System
1045
2 Motivation: Review of Noisy Source-Channel Model In the basic information model given by Shannon [5], i.i.d (independent identically distributed) symbols S from source with self information or entropy given by H ( S ) are transmitted on a memoryless additive white Gaussian noise channel given by N . Information theorems define maximum and minimum bounds on information rates for achievable reliability and fidelity, respectively [2]. Reliability (capacity) gives an average measure of erroneous decisions due to noisy observations made over an unreliable channel. Reliability measures the accuracy in identifying source constrained by the channel noise. Fidelity, gives an average measure of error in representation of the biometric source. The information theorems on source coding and information capacity cover these concepts [2]. A generalized form of information theorems is the joint source-channel coding theorem [6], which marks need for a combined source-channel model. This result states that a loss in signal-to-noise ratio (alternately, loss of discernability) at the source coder has an equivalent measure in loss of channel reliability (increase in bit error rate of the channel), also known as the source-channel separation principle. The separation principle provides us with a framework to decompose class genuine (also, class imposter) separately as source statistics and noise statistics, which otherwise may be perceived only as source distribution. We now formulate capacity of biometric identification system. Biometric system capacity: is the maximum information (max. number of users) per use of the noisy identification channel (in-class, inter-class variations) that can be guaranteed near zero error rate for identification. Capacity (pattern recognition context): The term capacity is measured as the maximum average mutual information on observation space of all measurable patterns X containing k , which maximizes the occurrence of a pattern of class k given the distribution of k . Average mutual information gives an average measure of statistics for optimal decision rule (maximum likelihood) [2].
C =: max I ( S ; X ), ∀ p ( S ) .
(1)
Fig. 1. Schematic of Biometric systems as identification channel
The linear channel model shown in above figure assumes Gaussian noise
N f ( 0, σ N2 ) and desired signal modeled with statistics S f (0, S G ) . The resulting expression for capacity can be shown [2] as given by:
1046
J. Bhatnagar and A. Kumar
C =
1 log 2
2
(1 +
SG
σ
2 N
).
(2)
The main reasons that influence choice of Gaussian are [4]: (a) Closed form expression is analytically tractable, (b) Tends to be a good reflection of real world distributions, and (c) Has the highest entropy for given variance and is useful in estimating bounds. In this paper variance has been used interchangeably for statistics. In context of the proposed approach signal statistics will cover all desired information or signals of interest for identification. Similarly, noise statistics will capture the mechanisms that restrict reliable identification.
3 A New Framework for Capacity In this section we combine tools discussed so far to develop a model for estimating capacity of the biometric system. From a signal theory perspective [7], M registered users are required to be represented by an M-ary signal probability space. However, measures on the signal space of matching scores can be uniquely defined for a single user. A challenging problem in formulating capacity with score level information for M-users is that different users registered in the biometric system have different sample spaces (or class distributions) and a unique probability space cannot apparently be defined. The following model removes this shortcoming and facilitates use of score level signals to formulate a new framework for biometric system capacity. Figure 2 depicts a model for estimating identification capacity. Let ∧
{g
∧
& { i m } denote the m }
median scores taken from genuine and imposter
distributions for user m ∈ M ; for which the respective distributions peak. Let A be ∧
the desired user transmitting information sequence { g m } and let B be the interfering ∧
user trying to corrupt these symbols by transmitting an information sequence { i m } .
Fig. 2. A two channel biometric identification system
On Some Performance Indices for Biometric Identification System
1047
∧
Thus for each transmitted genuine signal from { g m } in the indexing order m, B transmits an interference symbol by transmitting the same indexed imposter signal ∧
∧
∧
from { i m } . Further, information symbols { g m } and { i m } are subject to random spreading given by additive white Gaussian channels. The spreading for A is from a genuine Gaussian channel f N ( 0 , σ g2 ( m ) ) and that for B is from imposter Gaussian
channel f N ( 0 , σ i2( m ) ) .Variances given by
σ g2 , σ i2
respectively are averages for
genuine and imposters from all enrolled users M. (Figure 2) and denote the average trend to model the biometric system2 noise. This almost completes formulating the model for M-ary hypotheses on a single probability space [8]. The resulting model looks Neyman-Pearson type used in jamming/intrusion systems, with target A and interferer B. Figure 3 is simplified form of Figure 2 incorporating a the worst case additive noise f N ( 0 , σ i2 ) for both sets of information sequences. Figure 4 ∧
∧
illustrates a useful statistical distance measure between { g m } = s 1 and { i m } = s 2 given by d m . Discernability (decidability index) improves with increase in d m . As observed in Figure 3, the transmitter is a 2-ary or binary hypothesis transmitter for M signal pairs that are randomly spaced, since d m is a random variable. The resulting average
signal
energy
for
randomly
chosen
signal
pair
can
be
given: ( s + s ) / 2 ≅ 2.5d , on the assumption that s1 ≅ d m . The average for M2 1
2 2
2 m
2
ary set of signal pairs can be given as 2.5d m . This completes formulating the model for biometric identification channel. Therefore, the signal to noise ratio as required in (2) can be expressed as:
SG
σ
2 N
=
2 .5 d max( σ
2 g
2
,σ
2 i
.
)
Fig. 3. A Generalized model for biometric identification system with random coding
(3)
1048
J. Bhatnagar and A. Kumar
Fig. 4. Statistical distance measures and distributions for user m
The achievable rate R of identification channel can be given using (2):
C1 =
1 2 .5 d 2 log 2 [1 + ]. 2 max( σ g2 , σ i2 )
(4)
For model variations to incorporate new information/statistics (independent to & additional of existing signal statistics) and its effect on capacity, variance of new statistics can be added in the numerator of the log argument in (4), similarly any additions to noisy statistics can be accounted in the denominator. One such example is that of capacity improvement from incorporating quality indices at score level. If we apply the fact that the median (peak) genuine scores are different for different users, then the variability of the peaking of matching score from different users, offers a user-specific measure to improve classification performance [9]. Performance improvement using this signal statistics can be given by extracting the variance of the ∧
peaks of genuine scores from all M users, denoted as σ work new system capacity can be given by:
2 g
. Then under revised frame
1 σ g + 2 .5 d 2 . C 2 = log 2 [1 + ] 2 max( σ g2 , σ i2 )
(5)
∧ 2
We will now propose the last setting formulating capacity with use of cohorts score. In the inception of cohorts [9], it was shown that an extended score template incorporating neighborhood non-matching scores and a matching score gives substantial improvement in classification performance. Not all non-matching scores do well in the cohorts set. Only close non-matching scores/users to the class matching score will constitute a complement set which gives an appended decision space or increased confidence level of decision. The notion of cohorts illustrates that loss of uniqueness (and, accuracy) due to system noise can be compensated by exploiting the structure of interfering bins [9] and augmenting close non-matching scores in the new decision space. It is proposed to employ one possible selection of cohorts using 3σ bound. 3σ rule is a loose probability bound in Markov sense [8], in that it includes all scores lying within 33 % probability about the mean/median genuine score. Markov inequality is a weak approximation as it gives the loosest bound in estimating
On Some Performance Indices for Biometric Identification System
1049
the spread of a random variable about its median. The rule of 3σ will select ± σ ,±2σ ,±3σ points about the peak matching (genuine) score for each user to estimate cohorts. Also towards the tails of the class genuine distribution, a nonmatching score may not give accurate information on the class/ class neighbors. Denote the average of such M-variances as σ c . 2
C3 =
1 log 2 [1 + 2
σ
2 c
∧ 2
+σ
g
+ 2 .5 d
2
(6)
].
max( σ g2 , σ i2 )
From the definition of identification capacity the number of users ( χ ) that can be identified reliably.
χ = CM .
(7)
If, C ∈ [0,1] , then 1 − C = C ∈ [0,1] . This gives the complement set C , of the maximum average unreliability of the identification channel. In terms of average number of close/unreliable users, it can be formulated that;
N = CM , N ≤ M .
(8)
For a fixed number of users M, greater the value of C, smaller is the useful cohort set given by (8). This is particularly the case with highly individual biometrics. For highly individual biometrics, C is expected to be closer to 1 than 0.5 (also, validated by results for capacity in section 5). However, the actual number of (useful) close cohorts for users varies with user. A variation of (4) as given below can provide per user capacity. The user specific cohort requirement can be obtained as follows:
Cm =
2 . 5 d m2 1 log 2 [1 + 2 max( σ g2 ( m ) , σ
2 i(m )
)
(9)
].
Similarly, extrema of identification capacity for biometric system can be proposed to give peak capacity and minimum capacity based on the following expressions obtained from system parameters. (
SG
V
2 N
) max
(
2 . 5 d m2 max( V g2 ( m ) , V
2 i(m )
)
) max
,
(
SG
V
2 N
) min
(
2 . 5 d m2 max( V g2 ( m ) , V
2 i(m )
)
) min (10)
4 Individuality and False Random Correspondence Probability In biometrics literature, the individuality is related to and expressed in terms of probability of false random correspondence [10]. The prototype proposed in [10] for random correspondence of fingerprint defines a tolerance area which actually is tolerable distortion measure, as we introduce later in this section. Random correspondence was introduced in [10] as the limiting error rate arising from intra
1050
J. Bhatnagar and A. Kumar
class variability given noisy versions of a user template. Infact, intra class variability contains some measure of random correspondence along with system noise. For a really noisy biometric system, attributing intra class variability to random correspondence may be quite inaccurate. Individuality: As minimum information computed on population statistics comprising of templates from all users (one per user), subject to hamming distortion given by δ. FRC (false random correspondence): Probability that two different user templates are δ similar, for which the rate distortion and distortion rate exist and are equivalent. In documenting iris as a highly individual biometric, Daugman [11] used variability measures at feature level characterized for Gabor features. However, after [11] not much work on random correspondence in the literature employed entropy or entropy rate (rate distortion theory). A plausible explanation for the above could be that the Daugman’s approach in [11] used an exhaustive database on which probability measures and variability were shown to fit a binomial distribution for normalized hamming distances (scores) owing to encoding scheme from the Gabor features. Thus [11] illustrated an empirical model for uniqueness of iris which was an accurate asymptote for population statistics. We point here that encoding refers to feature representation method and its binarization. Author in [11] showed that encoding iris based on Iris code of Gabor features gives code with binomial distribution (independent realizations of many Bernoulli) with peak at 0.5 which signifies maximum source entropy code for the binomial family. By encoding variations based on hamming distance, source coding used in [11] applies a hamming distortion metric to locally source code partitions of the template along a trajectory in the iris. This condition also provides a maximally spaced code book as entropy of hamming distance (which is binomial distributed) peaks at 0.5 for best uniqueness, a clairvoyant choice in favor of the Gabor representation. It is therefore good to infer that uniqueness/individuality is dependent on the choice for representation of feature statistics. Applying rate-distortion concepts [12] it is proposed that approach employed in [11] and the approach in [10] are two related manifestations of a more general approach. Furthermore, it is analytically difficult to accurately evolve parametric models for larger number of features and/or larger population that can be generalized for all biometrics [12]. In this section the rate-distortion frame work will be revived to propose that: 1) individuality is a property of the source that can be measured in representation statistics (source coding features) given some minimum distortion, 2) individuality is related to false random correspondence probability for a distortion constraint δ. The purpose is to seek the minimum information rate for an average distortion constraint or the average information rate for minimum distortion (both are equivalent approaches in giving optimal source representation [12]).For a choice of representation method (PCA, LDA, Gabor etc.) and given a distortion constraint, smaller the rate-distortion more efficient is the representation technique. In that sense, rate-distortion is a direct measure of individuality. We propose to use a minimum limiting distortion to generate a source code book and study the distance distribution on this code book to formulate false random correspondence. This approach is rationally similar to that discussed in [2]. The following corollary is useful in providing the analytical basis to false random correspondence based on distance measures.
On Some Performance Indices for Biometric Identification System
1051
∧
Corollary 1. Given a source code book F that lies in the rate-distortion region for
avg. minimum distortion D , If the distribution of hamming metric defined on source ∧
code book F denoted by {d i }, ∀i ∈ I ≥
KM follows a binomial distribution with
probability = 0.5, then minimum of achievable rate condition is implied. Proof: We use result proved as bound for rate-distortion in appendix A ∧
∧
I ( f k ; f k ) ≥ h( f k ) − h( f k − f k ) . ∧
I ( fk ; f k ) ≥
(11)
∧ 1 log e (2πe)σ k2 − h( f k − f k ) . 2
(12)
We use the following inequality for any source coding model [14] ∧
∧
h ( f k ) ≥ h( f k ) ≈ h( d k ) ≥ h( f k − f k ) .
(13)
h(d k ) , denotes the entropy of hamming distance distribution, hamming distance computed between code vectors for feature k ∈ K .The variability/uncertainty in d k denotes average hamming distortion, for equivalent to variance of f k . ∧
h(d k ) ≥ h( f k − f k ) is important to tighten the inequality further in (12). Also, for k independent observations (appendix B).
∑ I( f
∀k ∈K
∧
k; f k)≥
1 log e (2πe)σ k2 − h(d k ) = R( D) ∀k∈K 2
∑
(14)
Given d k ∈ (0,1,...., N ) is a discrete random variable, {d k } gives independent realizations of K random variables for K different features. We define, d i =
∑d
k
; i ∈ I . This gives realizations from independent K
∀k∈K
partitions and hence independent random variables. (14) shows addition of hamming metrics from K partitions (features) of M-templates to give the hamming distance ∧
distribution for F . We characterize the independence of hamming metrics { d i } by binomial distribution parameterized by f ( p, I ) [8]. Minimization of (14) is possible if
entropy
h( d i ) =
∑h
(m)
(d k ) is
maximum,
entropy
for
k∈K
binomial f ( p, I ) maximizes for p = 0.5, proving the corollary. (Indexing by m denotes a template-m which can have a maximum I as hamming metric in distance computations across the M template population). Therefore, a minimum rate achievability using (14) can also be viewed as the condition for accurately transferring
1052
J. Bhatnagar and A. Kumar
uniqueness from source statistics onto the space of representation statistics, given the distortion constraint. Conversely, a source representation for which the above condition is achieved with equality gives optimal rate-distortion. As a simple illustration, we cite the Iris code and its optimality in capturing uniqueness of iris biometric using a Gabor representation [11]. The corollary with propositions presented in this section help in formulating the false random correspondence probability of a biometric. If, we approximate ∧
H ( F ) ≅ H ( F ) = h ( d ) Since, h ( d ) ≥ H ( D ) , Define, d i ≤ δ ; δ > 0 ; where
δ
denotes tolerable hamming distortion which is binary hamming metric for tolerable
distortion given by D .The tolerable hamming threshold given by δ corresponds to conditional entropy given by h( d / d ≤ δ ) . An exponent of conditional entropy to base 2 gives the average total number of sequences that lie within a hamming distance of δ . This gives an analytical expression for the average false random correspondence probability as:
2 h ( d / d ≤δ ) P (false correspondence) = . 2 h(d )
(15)
Clearly, from (15), increase in δ reflects greater distortion, thereby increasing the numerator with conditional entropy in the exponent. This results in a higher false correspondence. The basic form of entropy function will depend on the choice of feature representation. Thus, uniqueness is dependent on the choice of feature representation. An interesting observation from this section is in noting that (15) is the zero noise reliability of the biometric system. Thus, (15) defines a minimum (best) error rate of noiseless biometric system, for a target population.
5 Experiments In order to estimate the capacity formulated in section 3, we perform experiments on real biometric samples. Hand images of 100 users (10 images of each user) employed in [13] were used to simultaneously extract palmprint and hand geometry features. The feature extraction for each of the modalities is same as detailed in [13]. Class genuine and imposter score distributions, using Euclidean distance, were generated to ∧ 2
extract the following parameters: σ , d , σ , σ g and σ c . 2 i
2
2 g
2
Table-1 as given
below shows system parameters. Some significant insights can be acquired from Table 2 which shows capacity and number of cohort users based on formulae (6), (7), (8) and (9). Capacity increase from incorporating user specific statistics and cohorts is about twice the capacity prior to the additional statistics, this explains improved identification performance of the biometric system as a result of incorporating additional information gains. Clearly,
On Some Performance Indices for Biometric Identification System
1053
Table 1. System parameters from the experiments
σ g2
σ i2
d2
∧ 2
σg
σ c2
Palmprint
1.5e+4
6.3e+4
2.2e+4
1.4e+4
3.5e+4
Hand geometry
2.988
7.652
1.295
0.893
2.55
Biometric Modality
Table 2. System capacity and required number of cohorts
Biometric Modality
C1
C2
C3
N1
N2
Palmprint
0.45
0.53
0.7
55
47
Hand geometry
0.25
0.31
0.45
75
69
palmprint gives a superior identification performance, based on measurements for the database. The difference in identification capacity between palmprint and hand geometry remains approximately the same for every additional statistic. N 1 is smaller than N 2 , and indicates the number of cohorts users for two distinct cases when (1) raw capacity, (2) capacity with user specific statistics.
6 Conclusions This paper formulates a new frame work for performance indices of biometric identification system. Section 3 proposed system level performance index namely the identification capacity. It was also illustrated in section 3, that boosting capacity is possible using extra information/ variability from statistics such as user quality, and cohorts. Capacity is useful to compare noisy performance at systems level and hence can be considered as a local performance index. Exp0erimental results in section 5 gave real values to capacity for the input statistics and inference on number of cohort users. Section 4 formulates a generalized approach for false random correspondence (and, individuality) using rate-distortion and hamming distortion measure. Individuality is a global performance index, in that it gives the upper limit to performance of the biometric trait for a noiseless biometric system. It is also important to point here that information theoretic approach relies mainly on statistical averages; hence performance indices such as capacity & false random correspondence indicate average trends. Several open problems spring from this analysis, the most natural is in asking how capacity can be related to a minimum test sample size and training sample size to
1054
J. Bhatnagar and A. Kumar
guarantee a stable ROC, since both capacity & reliability are system performance indices like ROC. Study of individuality is apparently closer to the theory of evolution than information theory. Our future work will consider extending the current analysis to other biometric traits such as fingerprint, iris, face, and gait and study the resulting performance measures from experimental work. A more challenging problem will be to evolve performance indices for multi biometric system that employ various types of fusion principles.
Acknowledgement This work is partially supported by the research grant from Ministry of Information and Communication Technology, Government of India, grant no. 12(54)/2006-ESD.
References 1. Jain, A.K., Pankanti, S., Prabhakar, S., Hong, L., Ross, A., Wayman, J.L.: Biometrics: A Grand Challenge. In: Proc. ICPR, UK, vol. II, pp. 935–942 (2004) 2. Gallager, R G.: Information Theory and Reliable Communication. John Wiley, Chichester (1968) 3. Jain, A.K., Ross, A., Pankanti, S.: Biometrics: A Tool for Information Security. IEEE Trans. Information Forensics and Security 1(2), 125–143 (2006) 4. Adler, A., Youmaran, R., Loyka, S.: Towards a Measure of Biometric Information (February 2006), http://www.sce.carleton.ca/faculty/adler//publications 5. Slepian, D.: Key Papers in the Development of Information Theory. IEEE Press, New York (1974) 6. Vembu, S., Verdú, S., Steinberg, Y.: The Source-Channel Separation Theorem Revisited. IEEE Trans. Information Theory 41(1), 44–54 (1995) 7. Simon, M., Hinedi, S., Lindsey, W.: Digital communication Techniques: Signal Design and Detection. Prentice-Hall, NJ (1995) 8. Feller, W.: An Introduction to Probability Theory and Applications. John Wiley & Sons, Chichester (1971) 9. Aggarwal, G., Ratha, N., Bolle, R.M.: Biometric Verification: Looking Beyond Raw Similarity Scores. In: Workshop on Biometrics (CVPR), New York, pp. 31–36 (2006) 10. Pankanti, S., Prabhakar, S., Jain, A.K.: On the Individuality of Fingerprints. IEEE Trans. PAMI 24(8), 1010–1025 (2002) 11. Daugman, J.: Probing the uniqueness and randomness of Iris Codes: Results from 200 billion iris pair comparisons. Proc. of the IEEE 94(11), 1927–1935 (2006) 12. Cover, T M., Thomas, J A.: Elements of Information Theory. John Wiley & Sons, Chichester (1991) 13. Kumar, A., Zhang, D.: Feature selection and combination in biometrics. In: Kanade, T., Jain, A., Ratha, N.K. (eds.) AVBPA 2005. LNCS, vol. 3546, pp. 813–822. Springer, Heidelberg (2005)
On Some Performance Indices for Biometric Identification System
1055
Appendix A Definitions: R ( D ) : Rate distortion function is the infinum of achievable rates R, such that
R(D) is in the rate-distortion region of the source (K-feature statistics for all M users) for the corresponding distortion constraint D . We rephrase the need of Gaussian assumption to source coding (based on estimation theory) - For Gaussian distribution under mean sq. error, the conditional ∧
mean of {
∧
∧
f } is optimal estimator of source { f }; f ∈ F , f ∈ F . Furthermore,
mean square distortion D as the distortion measure in Gaussian frame work is adopted for following reasons:1) generally useful for image information, 2) gives the minimum rate for representation error(tightened by the fact that rate at min. distortion is equivalent to distortion at min. rate), which is required in formulating individuality. Though the population statistics for F need not actually be Gaussian, we employ Normal approximation to M-user statistics to deduce the achievable worse case lower bound for rate-distortion R ( D ) . Problem: To formulate rate distortion R ( D ) in source coding Feature statistics of Kdifferent features per template, for M user population, approx. by Gaussian (for large number of users); and for avg. distortion measure given by ∧
D ≥ E F − Fk
2
;k ∈ K
Hence, to show that:
R ( D ) = min
K
k =1
σ k2
1
∑ 2 log
e
Dk
;
∑D
k
= D
∀k∈K
Solution: Let the K- different features per template be i.i.d (For non-i.i.d we subtract a template self-covariance from σ k , in the numerator of the result to prove). Idea is to source 2
code each k ∈ K features and then concatenate the K code blocks to generate the template code. We prove the stated result for k = 1 , which can be easily generalized under uncorrelated ness to k = K . ∧
∧
I ( f k ; f k ) = h( f k ) − h( f k / f k ) ∧ ∧ 1 = log e (2πe)σ k2 − h( f k − f k / f k ) 2 ∧ 1 ≥ log e (2πe)σ k2 − h( f k − f k ) ; (Conditioning reduces 2 entropy)
1056
J. Bhatnagar and A. Kumar ∧ 1 log e (2πe)σ k2 − h( N (0, E ( f k − f k ) 2 ) ; 2 1 1 ≥ log e (2πe)σ k2 − log e (2πe) Dk 2 2 2 ∧ σ 1 R( Dk ) = log e [ k ] = inf I ( f k ; f k ) 2 Dk
=
For K independent variables (feature sets) can be generalized as below:
R(D) =
∑ ∀k
R(Dk ) =
∑ ∀k
∧ σ2 1 log e [ k ] = inf I ( F ; F ) 2 Dk
Automatic Online Signature Verification Using HMMs with User-Dependent Structure J.M. Pascual-Gaspar and V. Carde˜ noso-Payo ECA-SIMM, Dpto. Inform´ atica, Universidad de Valladolid, Campus Miguel Delibes s/n, 47011 Valladolid, Spain {jmpascual,valen}@infor.uva.es
Abstract. A novel strategy for Automatic online Signature Verification based on hidden Markov models (HMM) with user-dependent structure is presented in this work. Under this approach, the number of states and Gaussians giving the optimal prediction results are independently selected for each user. With this simple strategy just three genuine signatures could be used for training, with an EER under 2.5% obtained for the basic set of raw signature parameters provided by the acquisition device. This results increment by a factor of six the accuracy obtained with the typical approach in which claim-independent structure is used for the HMMs.
1
Introduction
Signature verification is of particular importance within the framework of biometrics, because of its long standing tradition in many identity verification scenarios[1]. In on-line Automatic Signature Verification (ASV) systems, the signer uses special hardware to produce her signature, so that a time sequence of parameters (position, inclination, pressure, . . . ) is generated by the device and can be processed by a system in order to characterize spatial and temporal features of the signature. These features are to be useful to build a model or template which could be later used to verify the claimed identity of the same signer with a minimum controlled risk that a forged signature could be taken as genuine. This particular instance of a pattern recognition problem is heavily influenced by intra and inter-user variability affecting the signing process, so that proper selection of a good model for every signature is a key step of any signature verification system. Hidden Markov Models (HMM) is a widely used probabilistic framework when modelling patterns of temporal sequences. It has been successfully applied to speech recognition tasks [2], on-line handwriting [3] and on-line signature verification [4,5]. A HMM can be roughly described as a graph of interconnected emitting states which topology is defined by means of a transition matrix, made of probability values for every particular state-to-state transition, and where the probability of emission of a given output value is usually modelled by a superposition of Gaussian distributions. S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 1057–1066, 2007. c Springer-Verlag Berlin Heidelberg 2007
1058
J.M. Pascual-Gaspar and V. Carde˜ noso-Payo
The number of states in the model and the number of Gaussian distributions associated to each state constitute the structural parameters of the model. Fixing these structural parameters is a highly specialized problem-dependent task carried out by domain experts, since it is hard to find general efficient algorithms to infer these parameters from the data being modelled. Given the structure of the model, the classical Baum-Welch algorithm (see [2]) can be successfully applied to estimate the values of the transition probabilities and of the weights and statistical moments of the Gaussians which provide the maximum expected likelihood for the observed time sequence. These values represent what we call the statistical parameters of the model. Table 1 provides a view of the current state of the art in ASV using different modelling alternatives and sometimes combining local and global signature parameters. This gives a reference point of the expected accuracy of present and future systems. Table 1. Some ASV systems, its error rates and employed techniques Author
Date Employed technique % Error
Nelson et al. [6]
1994 Distance based
EER: 6
Yang et al. [5]
1995 HMM
FR: 1.75, FA: 4.44
Kashi et al. [7]
1997 HMM
EER: 2.5
Nalwa [8]
1997 Own algorithm
EER: between 2 and 5
DiLeece et al. [9]
2000 Multiexpert system
FR: 3.2, FA: 0.55
Jain et al. [10]
2002 String matching
FR: 3.3, FA: 2.7
Igarza et al. [11]
2003 HMM
EER: 9.25
Ortega et al. [12]
2003 HMM
EER: 0.98
Hansheng et al. [13]
2004 Linear regression (ER2 ) EER: 0.2
Fierrez-Aguilar et al. [14] 2005 Local and Global fusion EER: 0.24
In this work we evaluate the influence of a claim-dependent selection of the structural parameters of the HMM on the accuracy of the automatic ASV. In section 2, a brief discussion of user dependent and user independent structure selection is made. The signature database used in this work is described in section 3 and the set of experiments is presented in section 4. Results and discussion section 5 will show that user dependent selection of the optimal model structure provides average EER below 2.5%, even for skilled forgeries. If we take into account that these experiments were carried out using just 5 raw signature parameters and 3 training signatures per user, we find the final accuracy values really promising and competitive with state of the art ASV systems.
2
User-Dependent Structure of HMMs
HMMs have been successfully applied to high-accuracy ASV systems for the last decade [15,11,5]. Although these systems differ in several important aspects
Automatic Online Signature Verification
1059
(e.g. signature pre-processing techniques, signature database characteristics, likelihood and score normalization techniques, . . . ), they always share a common architectural strategy, since they are based on a global set of structural parameters which, in most cases, are experimentally evaluated using all possible signatures of the database. The structure of the model which provides the best overall performance over all signers is selected. This is what we refer as ASV approach based on HMM with user-independent structure (HMM-UIS). As an alternative to the HMM-UIS approach, the selection of the best structural parameters of the model could be carried out independently for each user, exploiting specific characteristics of the signature. This is what we call ASV based on HMM with user-dependent structure (HMM-UDS). Under this approach, the model would be adapted to the user or, at least, to a given class of users sharing a set of statistically similar features. Although some heuristics have been evaluated to guide this selection, in the present paper we will concentrate in an exhaustive search procedure of the optimal model structure. Although this approach is not directly usable in practical systems, we can use the results as an upper bound for the best accuracy which could be obtained when switching to a UDS. To further support our view of the importance of a user-dependent selection of the structural parameters of the model, we include Figure 1 to illustrate the correlation between form and structural parameters of the HMM in a given fictitious stroke of a signature. The five subfigures illustrate five different ways of covering a sample signature trace with states, each with the same fixed number of Gaussians. The same number of degrees of freedom (NS × NG = 16 has been used in all cases, starting with the coarse-grained solution of figure 1-b), in which there is no dynamic modelling provided by state change, to the one in figure 1-f), where 16 states are used to model the stroke dynamics. For a given number of degrees of freedom, it should be expected that the model with lower number of Gaussians would perform better when using raw features without time derivatives, since it will better resemble time evolution of the stroke. At the same time, the number of states will be highly influenced by the amount of statistically different samples available to initialize the model. As a consequence, the compromise between NS and NG is to be highly influenced both by the geometric and temporal characteristics of the signature and by the observable variability of features along it over the set of training samples. This is what motivated the experimental study we will present in the following sections.
3
Data Acquisition and Pre-processing
All the experiments have been carried out using the MCYT signature database [16], which contains on-line signatures acquired with a WACOM Intuos A6 USB digital tablet. A total of 3331 different users contributed to the database, each of them producing 25 genuine signatures across five time-spaced sessions in which 1
While the delivered and filtered version of MCYT database contains just 330 signatures, the raw complete original version was used in this study.
1060
J.M. Pascual-Gaspar and V. Carde˜ noso-Payo
(a) Trace to model
(b) 1 state, 16 Gaussians by state
(c) 2 states, 8 Gaussians by state
(d) 4 states, 4 Gaussians by state
(e) 8 states, 2 Gaussians by state
(f) 16 states, 1 Gaussian by state
Fig. 1. Samples of how to model a signature’s stroke
each user also produced 25 ‘over-the-shoulder’ skilled forgeries of 5 different users. With this procedure, a total of 333 × (25 + 25) = 16650 signatures were produced, 8325 genuine and 8325 skilled forgeries. The input tablet provided a series of 100 vector samples per second, each vector including the raw parameters we used in this experiment: pen coordinates X and Y , pressure level P and pen orientation in terms of azimuth and elevation (see [16] for further details). Table 2. MCYT global statistics summary Gen (For) Length (cm) Duration (s) Speed (cm/s)
Mean 23.98 (24.25) 5.79 (7.15) 5.71 (4.59)
σ (%) 18% (38%) 41% (74%) 28% (56%)
Max. 47.51 (71.28) 20.20 (53.06) 19.46 (23.96)
Min. 6.12 (3.44) 0.57 (0.50) 1.29 (0.51)
The basic statistics of relevant global signature parameters for this database are shown in Table 2, both for genuine instances and forgeries. The signature length results to be the most stable feature and the easiest to reproduce, while duration and speed show higher deviation both intra and, specially, inter-user. A geometric normalization is performed to remove the absolute position offsets, since a grid of cells was used during the acquisition process to capture several signatures per sheet.
4
Experimental Setup
A single experiment can be defined as a function E(NU , NS , NG , IK ) → EER depending on the user identity NU , the number of states NS , the number of
Automatic Online Signature Verification
1061
Gaussians per state NG , and the kind of forgery considered in the evaluation phase IK (K = S, R, S:skilled, R:random). In each experiment the same three multi-session signatures were used for training (the first ones from each of the 3 central sessions). Two collections of experiments have been carried out in order to compare HMM-UIS and HMM-UDS strategies. In both of them a Bakis Left-To-Rightnoskip topology was chosen for the HMMs and EER was used as the objective function for the optimization in HMM-UDS strategy. In the first collection of experiments, the HMM-UIS strategy was evaluated using 49 MUIS = (NS , NG )7x7 different structure configurations, both NS and NG ranging from 1 to 64 in steps of power of 2. The average EER obtained with each model was calculated and also the number of users for whom a model structure could be trained was annotated. This is a relevant parameter, since not for every user is always possible to initialize a model with any number of states and Gaussians. In these experiments, we evaluated EER using only random forgeries. In the second collection of experiments, we evaluated the influence of the selection of optimal values for NS and NG independently for each user (HMMUDS). Here, the models were trained using the same experimental conditions used in the first collection, to allow comparison between both approaches. A set of 555 possible structures were evaluated for each user MUDS = (NS , NG )111x5 , the number of states ranging from 1 to 111 and the number of Gaussians from 1 to 5. From these models, the optimal model (Mopt ) was selected as the one that brought the lowest EER or the lower number of states for equal EER values. The average EER over all the Mopt models was taken as the global average accuracy. Evaluation was carried out using both random and skilled forgeries in this case.
5
Results and Discussion
Table 3 shows error rates using the 49 models of the MUIS test matrix. The number in parenthesis in each cell represents the number of users NV U for whom a valid model structure was trainable. As expected, NV U decreases as the number of degrees of freedom NF D increases, since for some users there are not enough initialization data. Six of these configurations (NF D = 1024, 2048 and 4096) were not trainable for any user and are shown as empty cells in the table. Since any ASV system should balance accuracy and good generalization capabilities, the model structure with lowest EER and for which all subjects in the database can be trained is chosen: the configuration composed of 32 states and one Gaussian by state. This HMM-UIS configuration produced an EER of 16.29% using just three training samples. Of course, better EER exist in the table, but they come at the cost of smaller generalization capabilities, since the number of valid trainable users is really small. Table 4 shows error rates obtained using the HMM-UDS approach, which are clearly lower than the ones in the HMM-UIS approach. The adaptation of the number of states individually for each user drastically improved the accuracy of
1062
J.M. Pascual-Gaspar and V. Carde˜ noso-Payo
Table 3. Errors as %EER with HMM-UIS approach tested using random forgeries NS \ N G
1
2
4
8
16
32
64
1 2 4 8 16 32 64
36.43(333) 35.55(333) 34.80(333) 31.11(333) 24.20(333) 16.29(333) 11.82(324)
35.19(333) 33.93(333) 32.40(327) 29.71(330) 23.74(321) 16.27(309) 11.48(262)
33.08(333) 31.77(332) 29.84(319) 26.80(312) 22.38(306) 15.85(259) 8.94(107)
31.88(333) 30.11(324) 27.88(310) 25.83(287) 22.90(254) 16.36(107) 8.56(7)
30.19(333) 28.21(306) 26.76(277) 27.28(241) 30.47(99) 21.04(8)
28.64(333) 27.61(287) 29.53(210) 35.91(104) 55.88(9)
28.29(324) 29.92(260) 37.47(90) 42.59(7)
Table 4. Results with HMM-UDS models NG
% EER (IK = R) % EER (IK = S)
1 2 3 4 5
3.83 3.46 4.08 4.63 4.96
3.29 3.29 3.42 3.71 3.59
NGopt.
2.33
2.06
the system, although the impact of the number of Gaussians by state was not so relevant as the influence of the optimization of the number of states. Average EER using both random and skilled forgeries is shown in this table and, again, just 3 signatures were always used for training. The first five rows at the table represent the error when NG was fixed and the optimum number of states was selected. As expected, the best results arise for models with low NG values (1 and 2). The last row shows the average error obtained when both the number of states and the number of Gaussians by state are selected to optimize EER. With this ‘two-dimension’ optimization the error rate is reduced by 33% and 37% for random and skilled forgeries respectively. Figure 2 shows the histograms of NS and NG for the experiments with random (a,b)and skilled forgeries (c,d). From figures fig. 2-a and fig.2-c it can be seen that the upper limit of 111 states per model can be increased in future works expecting better results from it because many models reached their best performance with the highest values of NS . With respect the number of Gaussians distributions it seems that for random forgeries (fig. 2-b) a low number of Gaussians per state performs better, however in the case of skilled forgeries (fig. 2-d) a higher number of Gaussians discriminate better this kind of impostor. To illustrate the relationship between signature complexity and number of states, signatures with different visual complexities are plotted in figure 3 besides their optimal number of states (one Gaussian per state was used in these signatures). After all these experiments, we came to the conclusion that a reliable ASVHMM system must be accurate for the majority of users, reporting a low mean error rate, and also that is very important for the real application acceptance
Automatic Online Signature Verification
(a) NS (random forgeries)
1063
(b) NG (random forgeries)
(c) NS (skilled forgeries)
(d) NG (skilled forgeries)
Fig. 2. NS and NG histograms for random forgeries (a,b) and skilled forgeries (c,d) tests
(a) 12 states
(b) 55 states
(c) 93 states
Fig. 3. Samples of signatures modelled using the HMM user-dependent structure approach
that the system works properly for very different types of signatures, not being desirable the existence of users getting high errors rates because their signatures are simplistic or inconsistent. The histogram in figure 4-a) illustrates the distribution of the number of users sharing a same EER interval when random forgeries were used. We emphasize the following three results: a) a high number of users yield no verification errors (28% of the models give 0% EER); b) 86% of the models have an EER lower than 5%; c) only three models report EER over 15%, 15.81% being the worst EER result of our system. Table 4 shows that no significant differences are found with respect to the random case when skilled forgeries were used. In fact, only a slight improvement can be depicted, which could be attributed to the fact that forged signatures were produced without information on the signature dynamics, which is difficult to infer for complex signatures. In spite of these similar average EER results, the appearance of the EER histogram in figure 4-b) is completely different to the one in random forgeries. Many of the genuine signatures of the users resulted to be difficult to forge when a different optimal number of states and Gaussians is chosen for each user, since the dynamics are hidden in these optimal number
1064
J.M. Pascual-Gaspar and V. Carde˜ noso-Payo
(a) random forgeries
(b) skilled forgeries
Fig. 4. EER histograms (in %)
y = 0,0242 · x + 2,4235 [r2 = 0,27]
2
y = 0,0277 · x + 2,2287 [r = 0,51]
y = 0,1313 x + 14,783 R2 = 0,2743
30
20
10
# of Pen-ups
20
Signature duration (s)
40
Signature's length
20
25
50
15
10
10
5
0 0
20
40
60
80
100
# of States
(a) Duration vs. Ns
120
0
0 0
100
200
300
(# of States) X (# of Gaussians)
400
500
0
100
200
300
400
500
(# of States) X (# of Gaussians)
(b) Duration vs. Ns × Ng (c) # pen-ups vs. Ns × Ng
Fig. 5. Relevant global parameters in terms of optimal claim-dependent Ns and Ng
of degrees of freedom. A higher number of models with low EER show (230 producing 0% EER), although there also exists a group of users which produced simplistic or non-consistent signatures easier to forge, leading to a final average EER similar to the one found in random forgeries. This might have been clearly improved with a higher control over the acquisition process. Recent studies on the combination of local and global features for ASV have provided reference rankings on the relevance of several global features [15]. The total length of a signature and the number of strokes (or the basically equivalent number of pen-ups) are shown to be the most relevant of these global parameters. In order to test the correlation between the optimum number of degrees of freedom (NS , NG ) of the user dependent models and the value of these global parameters for a given user, we have carried out a linear correlation analysis shown in figure 5. In these figures, we prove that there is a reasonably good linear correspondence (r2 = 0.51) between the length of a signature and the number NS × NG , which could provide a basic guideline on a more efficient model selection strategy. As for the number of pen-ups, there is not such a clear correspondence. This could be related to the fact that independent strokes could be better modelled after separate HMM models and then the results merged, compared to a single HMM based model as the one we are using here. Even less information can be extracted about the correspondence between signature length and optimum number of states. In any case, we conclude that further research is worthy on
Automatic Online Signature Verification
1065
the correspondence between global signature parameters and optimum structural parameters of the models.
6
Conclusions
In this work, we provided experimental evidences of the fact that data driven user dependent structure optimization of the HMM models could bring lower EER in ASV systems. The influence of two relevant structural parameters, the number of states NS in the model and the number of Gaussians by state NG , was evaluated and NS was the parameter which provided better observable improvement. HMM-UDS strategies lead to more accurate and reliable ASV systems using a smaller number of training signatures, which always represents an advantage for practical use cases. User adaptation shows to cope well with intra-user variability while providing good inter-user discrimination. An EER of 2.33% for random forgeries and 2.06% for skilled forgeries has been obtained. This represents a factor of 6 gain over the HMM-UIS strategy for the random forgery scenario within the same experimental conditions. Since optimization was carried out using an exhaustive search, it might not be useful in practical systems. Nevertheless, the results provide an lower bound for the best obtainable EER which encourages for further experimentation on data driven model selection strategies. Also, a better parameterization including time-dependent features will of course provide an overall increase of accuracy, according to the results found in other works [15].
Acknowledgements This work has been partially supported by the Spanish Ministry of Education, under contract TIC2003-08382-C05-03 and by the Consejer´ıa de Educaci´ on de la Junta de Castilla y Le´ on, under research project VA053A05.
References 1. Plamondon, R., Srihari, S.N.: On-line and off-line handwriting recognition: A comprehensive survey. Transactions on pattern analysis and machine intelligence 22(1), 63–84 (2000) 2. Rabiner, L.R.: A tutorial on hidden markov models and selected application in speech recognition. Proceedings of IEEE 77(2), 257–286 (1989) 3. Hu, J., Brown, M.K., Turin, W.: Hmm based on-line handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 18(10), 1039–1045 (1996) 4. Fierrez-Aguilar, J.: Adapted Fusion Schemes for Multimodal Biometric Authentication. PhD thesis, Esc. T´ecnica Superior de Ing. de Telecomunicaci´ on (2006) 5. Yang, L., Widjaja, B.K., Prasad, R.: Application of hidden markov models for signature verification. Pattern Recognition 28(2), 161–170 (1995) 6. Nelson, W., Turin, W., Hastie, T.: Statistical methods for on-line signature verification. IJPRAI 8(3), 749–770 (1994)
1066
J.M. Pascual-Gaspar and V. Carde˜ noso-Payo
7. Kashi, R.S., Hu, J., Nelson, W.L., Turin, W.: On-line handwritten signature verification using hidden markov model features. Document Analysis and Recognition 2 (1997) 8. Nalwa, V.S.: Automatic on-line signature verification. Proceedings of the IEEE 85(2), 215–239 (1997) 9. Lecce, V.D., Dimauro, G., Guerriero, A., Impedovo, S., Pirlo, G., Salzo, A.: A multi-expert system for dynamic signature verification. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 320–329. Springer, Heidelberg (2000) 10. Jain, A., Griess, F., Connell, S.: On-line signature verification. Pattern Recognition 35(12), 2963–2972 (2002) 11. Igarza, J.J., Goirizelaia, I., Espinosa, K., Hern´ aez, I., M´endez, R., S´ anchez, J.: Online handwritten signature verification using hidden markov models. In: Sanfeliu, A., Ruiz-Shulcloper, J. (eds.) CIARP 2003. LNCS, vol. 2905, pp. 391–399. Springer, Heidelberg (2003) 12. Ortega-Garcia, J., Fierrez-Aguilar, J., Martin-Reillo, J., Gonzalez-Rodriguez, J.: Complete signal modeling and score normalization for function-based dynamic signature verification. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 658–667. Springer, Heidelberg (2003) 13. Lei, H., Palla, S., Govindaraju, V.: Er2: An intuitive similarity measure for online signature verification. In: IWFHR ’04: Proceedings of the Ninth International Workshop on Frontiers in Handwriting Recognition (IWFHR’04), Washington, DC, USA, pp. 191–195. IEEE Computer Society Press, Los Alamitos (2004) 14. Fierrez-Aguilar, J., Nanni, L., Lopez-Pe˜ nalba, J., Ortega-Garcia, J., Maltoni, D.: An on-line signature verification system based on fusion of local and global information. In: Kanade, T., Jain, A., Ratha, N.K. (eds.) AVBPA 2005. LNCS, vol. 3546, pp. 523–532. Springer, Heidelberg (2005) 15. Fierrez-Aguilar, J., Krawczyk, S., Ortega-Garcia, J., Jain, A.K.: Fusion of local and regional approaches for on-line signature verification. In: Li, S.Z., Sun, Z., Tan, T., Pankanti, S., Chollet, G., Zhang, D. (eds.) IWBRS 2005. LNCS, vol. 3781, pp. 188–196. Springer, Heidelberg (2005) 16. Ortega, J., Fierrez, J., Simon, D., Gonzalez, J., Hernaez, I., Igarza, J.J., Vivaracho, C., Escudero, D., Moro, Q.: Mcyt baseline corpus: a bimodal biometric database. IEEE Proc. Visual Image Signal Processing 150(6), 395–401 (2003)
A Complete Fisher Discriminant Analysis for Based Image Matrix and Its Application to Face Biometrics R.M. Mutelo, W.L. Woo, and S.S. Dlay School of Electrical, Electronic and Computer Engineering University of Newcastle Newcastle upon Tyne, NE1 7RU United Kingdom {risco.mutelo,w.l.woo,s.s.dlay}@ncl.ac.uk
Abstract. This paper presents a Complete Orthogonal Image discriminant (COID) method and its application to biometric face recognition. The novelty of the COID method comes from 1) the derivation of two kinds of image discriminant features, image regular and image irregular, in the feature extraction stage and 2) the development of the Complete OID (COID) featuresbased on the fusion of the two kinds of image discriminant features used in classification. Firstly, the COID method first derives a feature image of the face image with reduced dimensionality of the image matrix by means of two dimensional principal component analysis and then performs discriminant analysis in a double discriminant subspaces in order to derive the image regular and irregular features making it more suitable for small sample size problem. Finally combines the image regular and irregular features which are complementary for achieving better discriminant features. The feasibility of the COID method has been successfully tested using the ORL images where it was 73.8% more superior to 2DFLD method on face recognition. Keywords: Biometrics, Face Recognition, Fisher Discriminant Analysis (FDA), two dimensional Image, Feature extraction, image representation.
1 Introduction Over the last few years, biometrics has attracted a high interest in the security technology marketplace. Biometric technologies use physiological or behavioural characteristics to identify and authenticate users attempting to gain access to computers, networks, and physical locations. The fastest growing areas of advanced security involve biometric face recognition technologies. Biometric face recognition technology offers great promise in its ability to identify a single face, from multiple lookout points, from a sea of hundreds of thousands of other faces. However building an automated face recognition system is very challenging. In real world applications, particularly in image recognition, there is a lot of small sample size (SSS) problems in observation space (input space). In such problems, the number of training samples is less than the dimension of feature vectors. A good face recognition methodology should consider representation as well as classification issues. Straightforward image S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 1067–1076, 2007. © Springer-Verlag Berlin Heidelberg 2007
1068
R.M. Mutelo, W.L. Woo, and S.S. Dlay
projection techniques such two-dimensional principal component analysis (2DPCA) [1], two dimensional reduction PCA (2D RPCA) [2], two dimensional FLD [3, 4] are among the most popular methods for representation and recognition. This paper introduces a novel Complete Orthogonal Image discriminant (COID) method for biometric face representation and recognition. The novelty of the COID method comes from 1) the derivation of two kinds of image discriminant features, image regular and image irregular, in the feature extraction stage and 2) the development of the Complete OID (COID) features based on the fusion of the two kinds of image discriminant features used in classification. In particular, the COID method first derives a feature image of the face image with reduced dimensionality of the image matrix by means of two dimensional principal component analysis and then performs discriminant analysis in a double discriminant subspaces in order to derive the image regular and irregular features making it more suitable for small sample size problem. The 2DPCA transformed face images preserves the spatial structure that defines the 2D face image and exhibit strong characteristics features with reduced noise and redundancies. Whereas, 2DFLD would further reduce redundancy and represent orthogonal discriminant features explicitly.
2 Outline of 2DPCA and 2DFLD Given a set of M training samples X1 , X 2, L, X M in the input space ℜ m× n . The aim is to project image X i , an m × transformation
n face matrix, onto B i by the following linear Yi = X i B
(1)
Thus, we obtain an an m × d dimensional projected matrix Yi , which is the feature matrix of the image sample Xi . In the 2DPCA method, the total scatter S t of the projected samples is introduced to measure the discriminatory power of the projection vector B . The total scatter of the projected samples can be characterized by the trace of the scatter matrix of the projected feature vectors. Whereas in 2DFLD the between class scatter S b and the within-class scatter S w matrices of the projected samples are introduced. From this point of view, we adopt the following criterion: J (β i ) = β T S t β
J (ϕ i ) =
ϕ iT S bϕ i ,ϕi ≠ 0 ϕ iT S wϕ i
(2)
(3)
where β i and ϕ i are the unitary vector that maximizes the 2DPCA criterion (2) and the 2DFLD criterion (3) which are called the optimal projection axis, β opt and ϕ opt . In general, it is not enough to have only one optimal projection axis. We usually need to select a set of projection axes, Β = [β1 , β 2, L, β d ] and Β = [ϕ1 ,ϕ 2, L,ϕ d ] , subject to the orthonormal constraints. The scatter matrices can be defined as follows:
A Complete Fisher Discriminant Analysis for Based Image Matrix and Its Application
St =
1 M
M
∑ (X
− X 0 ) (X i − X 0 ) T
i
1069
(4)
i =1
Sb =
1 M
T Li (X i − X 0 ) (X i − X 0 ) ∑ j =1
(5)
Sw =
1 M
(X ij − Xi )T (X ij − Xi ) ∑∑ i j
(6)
C
C
LI
=1
=1
where Xij denotes the j th training sample in class i , Li is the number of training samples in class i , X i is the mean of the training samples in class i , C is the number of image classes, X 0 is the global mean across all training samples.
3 Complete Orthogonal Image Discriminant Approach In this section, we examine the problem in the whole input space rather than in the space transformed by 2DPCA first. 3.1 Fundamentals Suppose we have a set of M training samples X1 , X 2, L, X M in the input space of size m × n, by definition the rank ( X i ) = min(m, n) . From the denominator in (3), we have ⎛ 1 C Li rank (S w ) = rank ⎜ X − Xi ⎜ M i =1 j =1 ij ⎝ ≤ (M − C ) ⋅ min (m, n )
∑∑ (
) (X T
ij
⎞ − Xi ⎟ ⎟ ⎠
)
(7)
So that S w is non-singular when M ≥c+
n min(m, n)
(8)
If the within-class covariance operator S w is invertible, ϕ iT S wϕ i > 0 always holds for every nonzero vector ϕ i . Therefore, the Fisher criterion can be directly employed to extract a set of optimal discriminant vectors. However, in some case it is almost impossible to make S w invertible because of the limited amount of training samples in real-world applications. This means that there always exist vectors satisfying ϕ iT S wϕ i = 0 . In [5], it is shown that these vectors are from the null space of S w for the vectorized fisher discriminant approach. These vectors turn out to be very effective if they satisfy ϕ iT S bϕ i > 0 at the same time [5]. The positive image between-class scatter matrix makes the data become well separable when the within-class scatter matrix is
1070
R.M. Mutelo, W.L. Woo, and S.S. Dlay
zero. In such a case, the Fisher criterion degenerates into the following between-class covariance criterion:
(
)
J b (ϕ i ) = ϕ iT S bϕ i , ϕ i = 1
(9)
As a special case of the Fisher criterion, it is reasonable to use the image between class scatter to measure the discriminatory ability of a projection axis when the image within-class covariance is zero. 3.2 Optimal Orthogonal Image Discriminant Vectors The input space ℜ m× n is relatively large and mostly empty (containing noise and redundancies), it is computationally intensive to calculate the optimal discriminant vectors directly. Our strategy is to reduce the feasible solution space (search space) where two kinds of discriminant vectors might hide. Suppose β1 , β 2, L, β d are eigenvectors corresponding to the largest positive eigenvalues of St . We can define the subspace ψ t = span{β1 , β 2 ,L , β d } and its orthogonal complementary space is denoted by ψ 1 = span{β d +1 , β d +2 ,L, β n } . The eigenvectors {β d +1 , β d + 2 ,L, β n } are t
relatively close to zero. In 2DPCA these eigenvectors are considered to contain very little or no discriminant information. Since n = rank (S t ) , non of the eigenvalues of S t are negative. Therefore, we refer to ψ 1 in this paper as the positive null space of St . t
Since ℜ m×n = ψ t ⊕ ψ 1 , it follows that for an arbitrary vector ϕ i ∈ ℜ m×n , ϕ i can be t
uniquely represented in the form ϕ i = φi + ς i with φ i ∈ ψ t and ς i ∈ ψ 1 . We can define t
the mapping . ℜ m×n → ψ t using ϕ i = φi + ς i → φi
(10)
where φi is the orthogonal projection of ϕ i onto ψ t . The eigenvectors ς i have eigenvalues relatively close to zero, thus discarded in 2DPCA. It follows that J (ϕ i ) = J (φi ) . Since the new search space is much smaller (less dimensionality), it is easer to derive discriminant vectors from it. The aim at this point is to calculate Fisher optimal discriminant vectors in the reduced search space ψ t . According to linear algebra theory [6], ψ t is isomorphic to m × d dimensional matrix space ℜ m× d . The corresponding isomorphic mapping is ϕ i = Βη i , where Β = (β1 , β 2 ,L, β d ),η i ∈ ℜ m×d
(11)
Under the isomorphic mapping ϕ i = Βη i , the criterion function (3) and (9) in the subspace is, converted into J (ϕ i ) =
( (
) )
(
)
η iT Β T S b Β η i and J b (ϕ i ) = η iT Β T S b Β η i η iT Β T S w Β η i
(12)
A Complete Fisher Discriminant Analysis for Based Image Matrix and Its Application
J (η i ) =
η iT Sˆ bη i ,η i ≠ 0 and J b (η i ) = η iT Sˆ bη i , η i = 1 η iT Sˆ wη i
1071
(13)
where Sˆ b = ΒT Sb Β and Sˆ w = ΒT S w Β . This means that J (η i ) is a generalized Rayleigh quotient and J b (η i ) Rayleigh quotient in the isomorphic space ℜ m×d . Now, the problem of calculating the optimal discriminant vectors in subspace ψ t is transformed into the extremum problem of the (generalized) Rayleigh quotient in the isomorphic space ℜ m×d . Therefore we can obtain the discriminant feature matrix Π j by the following transformation: Π j = X jΤ
where
(
) (
(14)
Τ = ϕ1 ,ϕ 2 ,L,ϕ q = Βη1 , Βη 2 ,L, Βη q
(
= Β η1 ,η 2 ,L,η q
)
)
The transformation in (14) can be divided into two transformations: Y j = X j Β , where Β = (β1 , β 2 , L, β d )
(
Π j = Y j Λ , where Λ = η1 ,η 2 ,L,η q
(15)
)
(16)
The transformation (15) is exactly 2DPCA [1]. Looking back at (13) and considering the two matrices Sˆ b and Sˆ w are the between class and within-class scatter matrices in ℜ m×d . Therefore, 2DPCA is firstly used to reduce dimension of the input space. Low dimensionality is important for learning, performing 2DFLD on the transformed images results in better generalization to unseen images. 3.3 Two Kinds of Image Discriminant Features Our strategy is to split the space ℜ m× d into two subspaces: the positive null space and the range space of Sˆ w and then use the Fisher criterion to derive the image regular discriminant features from the range space and use the between-class scatter criterion to derive the image irregular discriminant matrices from the null space. Suppose α1 ,α 2 ,L,α d are the orthonormal eigenvectors of Sˆ w and the first q ones correspond to largest positive eigenvalues. We define the subspace, Θ w = span{α1 ,α 2 ,L,α q } , the ranger space and it’s orthogonal complementary space is Θ 1 = span{α q +1 ,α d +2 ,L,α d } . w
Since for the nonzero vector η i in Θ 1 , the image within class scatter and between w class scatter becomes η iT Sˆ wη i = 0 and η iT Sˆ bη i > 0 . Therefore the Fisher criterion J b (η i ) = η iT Sˆ bη i
is used to derive the discriminant feature matrix. On the other hand,
for every nonzero vector η i in Θ w satisfies η iT Sˆ wη i > 0 , it is feasible to derive the optimal regular discriminant vectors from Θ w using the standard Fisher criterion J (η i ) . To calculate the optimal image regular discriminant vectors in Θ w . The
1072
R.M. Mutelo, W.L. Woo, and S.S. Dlay
dimension of Θ w is m × q , space Θ w is isomorphic to the matrix space ℜ m× q and the corresponding isomorphic mapping is η = Β1ξ , where Β1 = (α1 ,α 2 ,L,α q )
(17)
Under this mapping, the Fisher criterion J (η i ) is converted into J (ξ i ) = ~
~ ξ Ti S b ξ i ,ξi ≠ 0 ~ ξ Ti S w ξ i
(18)
~
where Sb = Β1T Sˆ b Β1 and S w = Β1T Sˆ wΒ1 . Thus the generalized eigenvector of the ~
~
generalized eigenequation is S b ξ i = λi S w ξ i . Let ϑ1 ,ϑ2 ,L ,ϑ f
be the f largest ~ −1~ eigenvectors of S w S b , we can obtain the image regular discrminant vectors η i = Β1ϑi , i = 1,L, f using (17). Similarly, the optimal image irregular discriminant
vectors within Θ 1 is isomorphic to the matrix space ℜ m×( d −q ) and the corresponding w
isomorphic mapping is η = Β 2ξ , where Β 2 = (α q +1 , α q + 2 , L, α d )
(19)
Under this mapping, the Fisher criterion Jˆ b (η i ) is converted into t Jˆ b (ξ i ) = ξ i S b ξ i , ξ i = 1 t
(20) t
where Sb = ΒT2 Sˆ b Β 2 . Letting υ1 ,υ 2 ,L,υ f be the eigenvectors of S b corresponding to f largest eigenvalues. We have ηˆi = Β 2υ i the optimal irregular discriminant vectors
with respect to Jˆ b (η ) . The linear discriminant transformation in (16) can be performed in ℜ m× d . After the projection of the image sample Y j onto the regular discriminant vectors η = (η1 ,η 2 ,L,η f ) , we can obtain the image regular discriminant feature matrix:
(
)
Π1 = Y j η1 ,η 2 ,L,η f = X j Β1ϑ
(21)
where Β1 = (α1 ,α 2 ,L,α q ) , ϑ = (ϑ1 ,ϑ2 ,L ,ϑ f ) . After the projection of the sample Y j onto the irregular discriminant vectors ηˆ1 ,ηˆ2 ,L,ηˆ f , we can obtain the image irregular discriminant feature matrix:
(
)
Π 2 = Y j ηˆ1 ,ηˆ 2 ,L ,ηˆ f = X j Β 2 υ
(22)
where Β 2 = (α q +1 , α q + 2 , L, α d ) , υ = (υ1 ,υ 2 ,L,υ f ) . 3.4 Fusion of Two Kinds of Discriminant Features We propose a simple fusion strategy based on a summed normalized distance. The distance between two arbitrary feature matrices,
A Complete Fisher Discriminant Analysis for Based Image Matrix and Its Application ⎛⎜ (Ti − Tzkj ) ⎞⎟⎠ ∑ ⎝ ∑ k =1 zk z =1 f
D(Π i , Π j ) =
2
q
1073
(23)
where Π i = [T1i , T2i ,....Tif ] and Π j = [T1j , T2j ,....Tfj ] . For a given image sample we denote
[
]
a feature matrix Π = Π1 , Π 2 , where Π1 , Π 2 are the regular and irregular discriminant feature matrices of the same pattern. The summed normalized-distance between sample Π and the training sample Π r = Π1r , Π 2r is defined by
[
D (Π, Π r ) = θ r
D(Π1 , Π1r ) M
∑ D(Π , Π 1
g =1
1 g)
+ θi
]
D (Π 2 , Π 2r ) M
∑ D (Π
2
, Π 2g )
(24)
g =1
where θ r and θ i are the fusion coefficient. These coefficients determine the weight of regular discriminant information and irregular information in the decision level.
4 Results and Discussions Our analysis was performed on the Olivetti Research Laboratory (ORL) face database which contains 40 persons, each person has 10 different images. For some subjects, the images were taken at different times, which contain quite a high degree of variability in lighting, facial expression (open/closed eyes, smiling/nonsmiling etc), pose (upright, frontal position etc), scale, and facial details (glasses/no glasses). Examples of sample images from the ORL face database are shown in Fig. 1.
Fig. 1. Example ORL images with spatial resolution. 112 × 92Note that the images vary in pose, size, and facial expression.
4.1 ORL Face Database Firstly, an analysis was performed using one image samples per class for training. Thus, the total number of training samples is 40 and testing samples is 360. That is, the total number of training samples and number of image classes are both 40. According to (8) the image within scatter matrix Sˆ w is singular. It is almost ~
impossible to make S w invertible because of the limited amount of training samples used here as it is the usual case in real-world applications. That is, there always exist ~ ~ vectors satisfying ξ i S w ξ i = 0 . The 2DLDA method is inapplicable since S w is
1074
R.M. Mutelo, W.L. Woo, and S.S. Dlay
100
Magnitude of Eigenvalues
80
60
40
20
0
0
20
40 60 Number of Eigenvalues
80
100
Fig. 2. Mmagnitudes of the eigenvalues in descending order from S t for 2DPCA 80
Recognition Accuracy (%)
70 60 50 40 30 20 COID 2DPCA
10 0
1
2
3 4 5 6 7 8 9 10 Number of Projection Vectors
Fig. 3. Face Recognition performance for the COID method and 2DPCA. The 2DFLD method is inapplicable when a limited number of training samples is given.
singular thus suffers from the SSS problem. Therefore 2DPCA and COID methods were used for feature extraction. The size of image scatter matrix S t was 92 × 92 and the rank (S t ) = 92, the largest d principal components are used as projection axes. According to section 3.2, most of the discriminant information is reserved. From Fig. 2, it is reasonable to use the 10 eigenvectors corresponding to the largest eigenvalues. The 2DPCA transformed feature matrix Y j is of size 112 × 10, thus, it ~
~
follows that S b and S w are 10 × 10. However, the rank (Sˆ w ) = 0 , therefore the positive null space Θ 1 is used to extract the optimal discriminant vector where Β1 = 0 w
and Β 2 = (α1 , α 2 ,L, α10 ) since q = 0. The results in Fig. 3 lead to the following findings: 1) the classification performance with the COID outputs is better than that achieved by the 2DPCA method. The best performance of 73.6% was achieved by COID and 2DPCA obtained an accuracy of 63.6%. The difference is that COID t evaluates a smaller discriminant scatter matrix S b of size 10 × 10 more accurately, in
A Complete Fisher Discriminant Analysis for Based Image Matrix and Its Application
1075
Table 1. CPU Times (seconds) for feature extraction at top recognition accuracy for training
Accuracy (%) Dimensions Feature Extraction time (seconds) Pentium IV, 3.00GHz, RAM 496Mb
2DPCA
2DLDA
86.9 112×4 0.438
88.0 112×2 0.562
Image Irregular 77.5 112×2 1.143
Image Regular 84.38 112×2 1.260
COID 89.9 112×2 2.150
contrast to the scatter matrix of 2DPCA of size 92 × 92. 2) The recognition accuracy increases as the number of eigenvectors Β 2 = (α1 , α 2 ,L, α10 ) used are increased. In particular, the face recognition perform of COID becomes stable when 4 and 5 eigenvectors are used where a top recognition accuracy of 73.6%. After which the recognition accuracy decreases as noise and redundancies are introduced by the smaller eigenvectors making the system unstable. Similar trend is also observed for 2DPCA. We then analysed the performance of 2DPCA [1], 2DFLD [3] and COID when two image samples per class are used for training and all the remaining image samples per class for testing. Again 10 eigenvectors were used to transform the 2DPCA space. Then spaces Θ w = span{α 1 , α 2 ,L, α 4 } and Θ 1 = span{α 5 , α 6 ,L, α 10 } were defined as in w
section 3.2. The fusion coefficients θ r =2 and θ i = 0.8 were used combine the two kinds of discriminant information. The results in table 1 lead to the findings: 1) the COID method performs better than 2DPCA, 2DLDA methods with an accuracy of 89.9% correct recognition when 112 × 2 features are used. In addition, the dimensions of the feature matrix for COID method are comparable to 2DPCA and 2DLDA. 2) Although the image regular features show superior performance to the irregular feature, they are complementary to each other as shown by the COID results. As the discriminatory power depends on both the within class and between class covariance matrices, regular feature contain more discriminatory information. However, COID (compared to 2DPCA and 2DLDA) takes more time for feature extraction as it requires a two phase process which derives the discriminant information from double discriminant subspace making it a more powerful discriminant approach.
5 Conclusions A novel Complete Orthogonal Image discriminant (COID) is developed. The COID method performs discriminant analysis in double discriminant subspaces: regular and irregular. The novelty of the OID method comes from 1) the derivation of the two kinds of image discriminant features, image regular and image irregular, in the feature extraction stage and 2) the development of the Complete OID (COID) features-based on the fusion of the two kinds of image discriminant features used in classification. In particular, the COID method first derives a feature image of the face image with reduced dimensionality of the image matrix by means of two dimensional principal component analysis. This low dimensionality is important for the discriminant
1076
R.M. Mutelo, W.L. Woo, and S.S. Dlay
learning leading to better generalization to unseen images, as the number of examples required for attaining a given level of performance grows exponentially with the dimensionality of the underlying representation space. The reason behind integrating the 2DPCA and the 2DFLD representations are twofold. Firstly, the 2DPCA transformed face images preserves the spatial structure that defines the 2D face image and exhibit strong characteristics features with reduced noise and redundancies. Lastly, 2DFLD would further reduce redundancy and represent orthogonal discriminant features explicitly. Since the image regular and irregular features are complementary for achieving better discriminant features there combined together in order to enhance the performance. In addition, the results show that COID is more suitable for Small Sample Size (SSS) problem due to its double discriminant subspace.
References [1] Yang, J., Zhang, D., Frangi, A.F., Yang, J.-y.: dimensional PCA: a new approach to appearance-based face representation and recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions 26, 131–137 (2004) [2] Mutelo, R.M., Khor, L.C., Woo, W.L., Dlay, S.S.: Two-dimensional reduction PCA: a novel approach for feature extraction, representation, and recognition, presented at Visualization and Data Analysis 2006 (2006) [3] Li, M., Yuan, B.: 2D-LDA: A statistical linear discriminant analysis for image matrix. Pattern Recognition Letters 26, 527–532 (2005) [4] Yang, J., Zhang, D., Yong, X., Yang, J.-y.: Two-dimensional discriminant transform for face recognition. Pattern Recognition 38, 1125–1129 (2005) [5] Chen, L.-F., Liao, H.-Y.M., Ko, M.-T., Lin, J.-C., Yu, G.-J.: A new LDA-based face recognition system which can solve the small sample size problem. Pattern Recognition 33, 1713–1726 (2000) [6] Kreyszig, E.: Introductory Functional Analysis with Applications. John Wiley & Sons, Chichester (1978)
SVM Speaker Verification Using Session Variability Modelling and GMM Supervectors M. McLaren, R. Vogt, and S. Sridharan Speech and Audio Research Laboratory Queensland University of Technology, Brisbane, Australia {m.mclaren,r.vogt,s.sridharan}@qut.edu.au Abstract. This paper demonstrates that modelling session variability during GMM training can improve the performance of a GMM supervector SVM speaker verification system. Recently, a method of modelling session variability in GMM-UBM systems has led to significant improvements when the training and testing conditions are subject to session effects. In this work, session variability modelling is applied during the extraction of GMM supervectors prior to SVM speaker model training and classification. Experiments performed on the NIST 2005 corpus show major improvements over the baseline GMM supervector SVM system.
1
Introduction
Commonly, text-independent speaker verification systems employ Gaussian mixture models (GMMs) trained using maximum a-posteriori (MAP) adaptation from a universal background model (UBM) to provide state-of-the-art performance [1,2,3]. The GMM-UBM approach involves generative modelling whereby the distribution that produced some observed data is determined. A major challenge in the design of speaker verification systems is the task of increasing robustness under adverse conditions. During the GMM training process, adverse effects due to session variability contribute to errors in the distribution estimation. Recently, a method was proposed to directly model session variability in telephony speech during model training and testing [4,5]. The main assumption is that session effects can be represented as a set of offsets from the true speaker model means. A further assumption is that the offsets are constrained to a lowdimensional space. Session variability modelling attempts to directly model the session effects in the model space, removing the need for discrete session categories and data labelling required for regular handset, channel normalisation and feature mapping [6,7]. Direct modelling of session effects has led to a significant increase in robustness to channel and session variations in GMM-based speaker verification systems. Results show a reduction of 46% in EER and 42% in minimum detection cost over baseline GMM-UBM performance on the Mixer corpus of conversational telephony data [5]. Session variability modelling aims to relate the session effects across the mixture components of a model. For this technique, the speaker dependent information of a model’s mixture components can be conveniently represented as a S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 1077–1084, 2007. c Springer-Verlag Berlin Heidelberg 2007
1078
M. McLaren, R. Vogt, and S. Sridharan
GMM mean supervector formed through the concatenation of the GMM component means. In contrast to the traditional GMM-UBM classifier, the support vector machine (SVM) is a two-class, discriminative classifier whereby the maximum margin between classes is determined. SVMs utilise a kernel to linearly separate two classes in a high-dimensional space. A SVM speaker verification system recently presented by Campbell et al. has utilised GMM mean supervectors as features to provide performance comparable to state-of-the-art GMM-UBM classifiers [8]. The fundamental differences between GMM and SVM classification bring into question whether techniques used to improve GMM systems based on distribution estimation can also enhance SVM classification based on margin maximisation. This paper aims to demonstrate that robust modelling techniques developed for generative modelling can improve the performance of discriminative classifiers. The approach taken involves modelling session variability in the GMM mean supervector space prior to SVM speaker model training and classification. A description of the common GMM-UBM system and recent research into session variability modelling is presented in Section 2. A brief summary of support vector machines is presented in Section 3 along with details regarding the extraction of session variability modelled supervectors for SVM training. Presented in Section 4 is the experimental configuration with the results of the system evaluated using the NIST 2005 database in Section 5.
2 2.1
Modelling Session Variability in the GMM Mean Supervector Space The GMM-UBM Classifier
In the context of speaker verification, GMMs are trained using features extracted from a speech sample to represent the speech production of a speaker. In such a system, MAP adaptation is employed to adapt only the means of the UBM to model a speaker [1]. The classifier computes a likelihood ratio using the trained speaker model and the UBM, giving a measure of confidence as to whether a particular utterance was produced by the given speaker. GMM-UBM classifiers provide state-of-the-art performance when coupled with a combination of robust feature modification and score normalisation techniques [3,9]. The GMM likelihood function is, g(x) =
C
ωc N (x; μc , Σ c ),
(1)
c=1
where ωc are the component mixture weights, μc the means, and Σ c the covariances of the Gaussians. A mean supervector can be obtained by concatenating each of the mean vectors, μ = μT1 · · · μTC . As only the means are adapted during speaker model training, the speaker model can be compactly represented by the common UBM and a speaker dependent GMM mean supervector offset.
SVM Speaker Verification Using Session Variability Modelling
2.2
1079
Session Variability Modelling
Attempts to directly model session variability in GMM-UBM based speaker verification systems have provided significant performance improvements when using telephony speech [5]. The purpose of session variability modelling is to introduce a constrained offset of the speaker’s mean vectors to represent the effects introduced by the session conditions. In other words, the Gaussian mixture model that best represents the acoustic observations of a particular recording is the combination of a session-independent speaker model and an additional sessiondependent offset from the true model means. This can be represented in terms of the GMM component means supervectors as μh (s) = m + y(s) + U z h (s).
(2)
Here, the speaker s is represented by the offset y(s) from the speaker independent (or UBM) mean supervector m. To represent the conditions of the particular recording (designated with the subscript h), an additional offset of U z h (s) is introduced where z h (s) is a low-dimensional representation of the conditions in the recording and U is the low-rank transformation matrix from the constrained session variability subspace to the GMM mean supervector space. Speaker models are trained through the simultaneous optimisation of the model parameters y(s) and z h (s), h = 1, ..., H over a set of training observations. The speaker model parameters are optimised according to the maximum a posteriori (MAP) criterion often used in speaker verification systems [10,2]. The speaker offset y(s) has a prior as described by Reynolds [1] while the prior for each of the session factors z h (s) is assumed to belong to a standard normal distribution, N (0, I). An efficient procedure for the optimisation of the model parameters is described in [5]. The session variability vectors are not actually retained to model the speaker but their estimation is necessary to accurately estimate the true speaker means. A similar optimisation process is used during testing.
3
Support Vector Machines
A support vector machine (SVM) performs classification by mapping observations to a high-dimensional, discriminative space while maintaining good generalisation characteristics [11]. SVM training involves the positioning of a hyperplane in the high-dimensional space such that the maximum margin exists between classes; a procedure unlike distribution estimation for GMMs. The term support vectors refers to the training vectors which are located on or between the class boundaries and, as a result, contribute to the positioning of the separating hyperplane. A kernel function K(X a , X b ) = φ(X a ) · φ(X b ) is used to compare observations in the high-dimensional space to avoid explicitly evaluating the mapping function φ(X).
1080
3.1
M. McLaren, R. Vogt, and S. Sridharan
A GMM Supervector SVM Speaker Verification System
In the context of speaker verification, it is necessary to be able to compare two utterances of varying lengths when using SVMs. A method of achieving this is to train a GMM through mean adaptation to represent each utterance, from which mean supervectors can be extracted. SVMs using GMM supervectors as feature vectors have demonstrated promising capabilities when only feature mapping and feature normalisation are applied [8]. The process of producing a GMM mean supervector to represent an utterance can be viewed as a kernel. Essentially, features from a variable length sequence of feature vectors are being transformed from the input space to the SVM feature space. In the given context, the SVM feature space has a dimension determined by the length of the GMM mean supervector. The SVM system implemented in this work uses the mean offsets from a gender dependent UBM as input supervectors for SVM classification. That is, the supervector representing utterance X a is the difference between the supervector μa extracted from the mean adapted GMM trained from X a and the supervector m taken from the gender dependent UBM. The motivation for removing the UBM mean bias is to reduce the rounding errors accumulated when equating dot products of floating point representations in high dimensions. The input supervectors are also scaled to have unit variance in each dimension based on statistics from the background dataset. The aim of this process is to allow each dimension of the supervector an equal opportunity to contribute to the SVM. The SVM kernel using background data scaling can then be formulated as, K(X a , X b ) = (μa − m)T B −1 (μb − m),
(3)
where B is the diagonal covariance matrix of the background data. This background dataset is a collection of non-target speakers used to provide negative examples in the SVM training process. 3.2
Incorporating Session Variability into GMM Supervectors
The session variability modelling technique described in Section 2.2 is employed during GMM training to estimate and remove the contribution of the session conditions from the adapted model means. The trained model means can then be represented in GMM supervector form by y(s) in (2). This session-independent speaker model provides a method of incorporating session variability modelling into SVM classification. This differs from Campbell’s nuisance attribute projection (NAP) in which subspaces in the SVM kernel contributing to variability are removed through projection [12]. The following experiments attempt to model session variability into the GMM supervectors during GMM training in order to demonstrate the possible advantages that such techniques for generative modelling may impart on discriminative classification.
SVM Speaker Verification Using Session Variability Modelling
4
1081
Experiments
Evaluation of the proposed method was conducted using the NIST 2005 speaker recognition corpus consisting of conversational telephone speech from the Mixer Corpus. Focus was given to 1-sided training and testing using the common evaluation condition, restricted to English dialogue as detailed in the NIST evaluation plan [13]. The performance measures used for system evaluation were the equal error rate (EER) and minimum decision cost function (DCF) Further experiments involving score-normalisation were conducted on the systems to aid in the comparison of the GMM and SVM domains [6]. A set of 55 male and 87 female T-Norm models were trained to estimate the score normalisation parameters. 4.1
GMM-UBM System
As a point of reference, a baseline GMM-UBM system was implemented. The system uses MAP adaptation with an adaptation factor of 8 and feature-warped MFCC features with appended delta coefficients [7]. Throughout the trials, 512 GMM mixture components were used. Gender dependent UBMs were trained using a diverse selection of 1818 utterances from both Mixer and Switchboard 2 corpora. The GMM-UBM system employing the session variability modelling technique presented in [5] used gender dependent transform matrices U with a session subspace dimension of Rs = 50 trained from the same data as was used to train the UBMs. 4.2
GMM Supervector SVM System
The training of SVM speaker models required the production of several sets of utterance supervectors. A GMM mean supervector was produced to represent each (1) utterance in the background data set, (2) training utterance and (3) testing utterance. The background dataset consisted of all utterances used to train the UBMs. The difference between the standard SVM and the session variability modelling SVM system is the method used to train the GMMs to represent each utterance prior to extraction of the supervectors. The baseline SVM system used standard MAP adapted GMMs to represent each utterance while the session SVM system employed session variability modelling training. In the latter system, session variability modelling was applied to the GMM of each utterance including those in the background data set. For both systems, the supervectors were used to train one-sided SVM speaker models using LIBSVM [14]. A single supervector was used to represent the target training utterance while non-target training utterances were represented by the gender dependent background data set. The SVM employed a linear-based kernel using background data scaling as detailed in (3).
1082
5
M. McLaren, R. Vogt, and S. Sridharan
Results
A comparison of performance between different system configurations is shown in Figure 1 with resulting EER and minimum DCF points detailed in Table 1. Results of systems including score normalisation are also detailed in this table. These results show that a distinct performance gain can be achieved in discriminative classifiers when robust modelling techniques are applied during generative model training. This is evident by the observed performance variation between the two discriminative classifiers. The minimum DCF of the SVM system was reduced from .0258 to .0185 when session variability modelling was applied; a 28% relative improvement. In terms of EER, the session SVM system has a gain of 13% over the reference SVM configuration.
Fig. 1. DET plot for the 1-side condition comparing GMM-UBM and GMM mean supervector SVM systems, with and without session variability modelling
A comparison between the reference GMM-UBM and SVM systems shows the SVM configuration having a gain of 38% in minimum DCF and 30% in EER over the GMM-UBM. Similarly, an improvement of 35% and 10% in minimum DCF and EER respectively is found between the two session configurations. A significant improvement is shown through the GMM supervector SVM classification over the baseline GMM-UBM configuration which reflects the findings in [12]. Noteworthy is the performance of the reference SVM system being similar to that of the session GMM-UBM system throughout the mid to high false alarm range.
SVM Speaker Verification Using Session Variability Modelling
1083
Table 1. Minimum DCF and EER results for 1-side condition for GMM-UBM and GMM mean supervector SVM systems, including T-Norm results System Reference GMM-UBM Session GMM-UBM Reference SVM Session SVM Session Fused
Standard EER Min. DCF 9.15% .0418 6.23% .0286 6.38% .0258 5.58% .0185 4.41% .0168
T-Norm EER Min. DCF 9.95% .0392 5.58% .0239 6.15% .0240 5.26% .0189 4.74% .0160
Table 1 shows that a significant advantage was found through the application of T-Norm to the session GMM-UBM configuration, supporting previous results indicating that the session GMM-UBM system responds particularly well to score normalisation [5]. Conversely, the session SVM system showed little change through the normalisation technique while the reference GMM-UBM and SVM configurations both showed similar, moderate improvements when applying score-normalisation. The modest improvements due to T-Norm for the session SVM system suggests that this system may produce scores that are less prone to output score variations across different test utterances. The scores from both the session GMM-UBM and the session SVM system were linearly fused to minimise the mean-squared-error. The DET plot demonstrates that performance is further boosted through this process. The fused system gave a relative improvement of 9% in minimum DCF and 21% in EER over the session SVM configuration. This result indicates that complementary information is found between the two systems despite session variability modelling being incorporated in both. Applying T-Norm to this fused system provided mixed results. Future work will investigate further score normalisation methods for the GMM mean supervector SVM system using session variability modelling. A comparison between Campbell’s method of nuisance attribute projection (modelling session variability in the SVM kernel) with GMM supervectors [12] and the work presented in this paper would also be of interest.
6
Conclusions
This paper has demonstrated that employing robust modelling techniques during GMM training improves the performance of a GMM mean supervector SVM speaker verification system. This is of interest due to the fundamental differences between the two classification systems; GMM’s based on distribution estimation versus margin maximisation in SVM classification. Applying session variability modelling during the training of the GMM mean supervectors for SVM classification showed significant performance gains when evaluated using the NIST 2005 SRE corpus and was superior to the session GMM-UBM configuration. Fusion of the session GMM-UBM and session SVM systems displayed performance above either configuration on its own.
1084
M. McLaren, R. Vogt, and S. Sridharan
Acknowledgments. This research was supported by the Australian Research Council (ARC) Discovery Grant Project ID: DP0557387.
References 1. Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal Processing 10(1), 19–41 (2000) 2. Gauvain, J.L., Lee, C.H.: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing 2(2), 291–298 (1994) 3. Przybocki, M., Martin, A.: NIST Speaker Recognition Evaluation Chronicles. In: Odyssey Workshop (2004) 4. Kenny, P., Dumouchel, P.: Experiments in speaker verification using factor analysis likelihood ratios. In: Odyssey: The Speaker and Language Recognition Workshop, pp. 219–226 (2004) 5. Vogt, R., Sridharan, S.: Experiments in Session Variability Modelling for Speaker Verification. IEEE International Conference on Acoustics, Speech and Signal Processing 1, 897–900 (2006) 6. Auckenthaler, R., Carey, M., Lloyd-Thomas, H.: Score normalization for textindependent speaker verification systems. Digital Signal Processing 10(1), 42–54 (2000) 7. Pelecanos, J., Sridharan, S.: Feature warping for robust speaker verification. Proc. Speaker Odyssey 2001 (2001) 8. Campbell, W.M., Sturim, D.E., Reynolds, D.A.: Support vector machines using GMM supervectors for speaker verification. Signal Processing Letters 13(5), 308– 311 (2006) 9. Vogt, R.: Automatic Speaker Recognition Under Adverse Conditions. PhD thesis, Queensland University of Technology, Brisbane, Queensland (2006) 10. Reynolds, D.A.: Comparison of background normalization methods for textindependent speaker verification. Proc. Eurospeech 97 (1997) 11. Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3), 273–297 (1995) 12. Campbell, W.M., Sturim, D.E., Reynolds, D.A., Solomonoff, A.: SVM Based Speaker Verification using a GMM Supervector Kernel and NAP Variability Compensation. IEEE International Conference on Acoustics, Speech and Signal Processing 1, 97–100 (2006) 13. The NIST 2006 Speaker Recognition Evaluation Plan (2006), Available at http://www.nist.gov/speech/tests/spk/2004/SRE-04 evalplan-v1a.pdf 14. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. (2001), Software available at http://www.csie.ntu.edu.tw/∼ cjlin/libsvm
3D Model-Based Face Recognition in Video Unsang Park and Anil K. Jain Department of Computer Science and Engineering Michigan State University 3115 Engineering Building East Lasing, MI 48824, USA {parkunsa,jain}@cse.msu.edu
Abstract. Face recognition in video has gained wide attention due to its role in designing surveillance systems. One of the main advantages of video over still frames is that evidence accumulation over multiple frames can provide better face recognition performance. However, surveillance videos are generally of low resolution containing faces mostly in non-frontal poses. Consequently, face recognition in video poses serious challenges to state-of-the-art face recognition systems. Use of 3D face models has been suggested as a way to compensate for low resolution, poor contrast and non-frontal pose. We propose to overcome the pose problem by automatically (i) reconstructing a 3D face model from multiple non-frontal frames in a video, (ii) generating a frontal view from the derived 3D model, and (iii) using a commercial 2D face recognition engine to recognize the synthesized frontal view. A factorization-based structure from motion algorithm is used for 3D face reconstruction. The proposed scheme has been tested on CMU’s Face In Action (FIA) video database with 221 subjects. Experimental results show a 40% improvement in matching performance as a result of using the 3D models. Keywords: Face recognition, video surveillance, 3D face modeling, view synthesis, structure from motion, factorization, active appearance model.
1 Introduction Automatic face recognition has now been studied for over three decades. While substantial performance improvements have been made in controlled scenarios (frontal pose and favorable lighting conditions), the recognition performance is still brittle with pose and lighting variations [12, 13]. Until recently, face recognition was mostly limited to one or more still shot images, but the current face recognition studies are attempting to combine still shots, video and 3D face models to achieve better performance. In particular, face recognition in video has gained substantial attention due to its applications in deploying surveillance systems. However, face images captured in surveillance systems are mostly off-frontal and have low resolution. Consequently, they do not match very well with the gallery that typically contains frontal face images. S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 1085–1094, 2007. © Springer-Verlag Berlin Heidelberg 2007
1086
U. Park and A.K. Jain
Input Video
Reconstructed 3D Model (Shape and Texture)
Synthesized Frontal View from the 3D Model
Gallery (Frontal)
Identity
Fig. 1. Face recognition system with 3D model reconstruction and frontal view synthesis
There have been two main approaches to overcome the problem of pose and lighting variations: (i) view-based and (ii) view synthesis. View-based methods enroll multiple face images under various pose and lightings and match the probe image with the gallery image with most similar pose and lighting conditions [1, 2]. Viewsynthesis methods generate synthetic views from the input probe images with similar pose and lighting conditions as in gallery data to improve the matching performance. The desired view can be synthesized by learning the mapping function between pairs of training images [3] or by using 3D face models [4, 11]. The parameters of the 3D face model in the view synthesis process can also be used for face recognition [4]. Some of the other approaches for face recognition in video utilize appearance manifolds [14] or probabilistic models [15], but they require complicated training process and have been tested only on a small database. The view-synthesis method is more appealing than the view-based method in two respects. First, it is not practical to collect face images at all possible pose and lighting conditions for the gallery data. Second, the state-of-the-art face recognition systems [5] perform the best in matching two near-frontal face images. We propose a face recognition system that identifies the subject in a video which contains mostly non-frontal faces. We assume that only the frontal pose is enrolled in gallery data. This scenario is commonly observed in practical surveillance systems. The overall system is depicted in Fig. 1. One of the main contributions of the proposed work is to utilize 3D reconstruction techniques [6, 7, 8] for the purpose of handling the pose variation in the face recognition task. Unlike the morphable model based approach [4], comprehensive evaluations of 3D reconstruction from 2D images for the face recognition task have not been reported. Most of the effort in 3D face model reconstruction from 2D video has focused on accurate facial surface reconstruction, but the application of the resulting model for recognition has not been extensively explored. The contributions of our work are: (i) quantitative evaluation of
3D Model-Based Face Recognition in Video
1087
the performance of factorization algorithm in structure from motion, (ii) viewsynthesis using structure from motion and its use in face recognition, and (iii) evaluation on a public domain video database (CMU’s Face In Action database [10]) using a commercial state-of-the-art face recognition engine (FaceVACS from Cognitec [5]).
2 3D Face Reconstruction Obtaining a 3D face model from a sequence of 2D images is an active research problem. Morphable model (MM) [4], stereography [19], and Structure from Motion (SfM) [17, 6] are well known methods in 3D face model construction from 2D images or video. MM method has been shown to provide accurate reconstruction performance, but the processing time is overwhelming for use in real-time systems. Stereography also provides good performance and has been used in commercial applications [19], but it requires a pair of calibrated cameras, which limits its use in many surveillance applications. SfM gives reasonable performance, ability to process in real-time, and does not require a calibration process, making it suitable for surveillance applications. Since we are focusing on face recognition in surveillance video, we propose to use the SfM technique to reconstruct the 3D face models. 2.1 Tracking Facial Feature Points We use 72 facial feature points that outline the eyes, eyebrows, nose, mouth and facial boundary. While the actual number of feature points is not critical, there needs to be sufficient number of points to capture the facial characteristics. Number of points used in face modeling using AAM vary in the range 60~100. The predefined facial feature points are automatically detected and tracked by the Active Appearance Model (AAM) [18], which is available as a SDK in public domain [9]. We train the AAM model on a training database with about ± 45° yaw, pitch and roll variations. As a result, the facial feature points from a face image within ±45° variations can be reliably located in the test images. The feature points detected in one frame are used as the initial locations for searching in the next frame, resulting in more stable point correspondences. An example of feature point detection is shown in Fig. 2. 2.2 3D Shape Reconstruction The Factorization method [6] is a well known solution for the Structure from Motion problem. There are different factorization methods depending on the rigidity of the object [7, 8], to recover the detailed 3D shape. We regard the face as a rigid object and treat the small expression changes as noise in feature point detection, resulting in recovering only the most dominant shape from video data. We use orthographic projection model that works reasonably well for face reconstruction when the distance between camera and object is a few meters. Under orthographic projection model, the relationship between 2D feature points and 3D shape is given by W = M⋅S,
(1)
1088
U. Park and A.K. Jain ⎡u11 ⎢u ⎢ 21 ⎢ ⎢ ⎢u f 1 W =⎢ ⎢ v11 ⎢v21 ⎢ ⎢ ⎢v ⎣ f1
u12 L u1 p ⎤ ⎡ i1 x ⎢i u 22 L u2 p ⎥⎥ ⎢ 2x ⎥ ⎢ M ⎥ ⎢ u f 2 L u fp ⎥ ⎢ i fx , M =⎢ ⎥ v12 L v1 p ⎥ j ⎢ 1x v22 L v2 p ⎥ ⎢ j2 x ⎥ ⎢ M ⎥ ⎢ ⎢⎣ j fx v f 2 L v fp ⎥⎦
i1 y i2 y M i fy j1 y j2 y M j fy
i1z ⎤ i2 z ⎥⎥ ⎥ ⎡ S x1 ⎥ i fz ⎥ ⎢ , S = ⎢ S y1 ⎥ j1z ⎢ S z1 ⎥ ⎣ j2 z ⎥ ⎥ ⎥ j fz ⎥⎦
S x 2 L S xp ⎤ ⎥ S y 2 L S yp ⎥, S z 2 L S zp ⎥⎦
(2)
where, ufp and vfp in W represent the row and column pixel coordinates of pth point in the fth frame, each pair of ifT = [ifx ify ifz] and jfT = [jfx jfy jfz] in M represents the rotation matrix with respect to the fth frame, and S represents the 3D shape. The translation term is omitted in Eq. (1) because all 2D coordinates are centered at the origin. The rank of W in Eq. (2) is 3 in the ideal noise-free case. The solution of Eq. (1) is obtained by a two-step process: (i) Find an initial estimate of M and S by singular value decomposition, and (ii) apply metric constraints on the initial estimates. By a singular value decomposition of W, we obtain W = U⋅D⋅VT ≈ U’D’V’T,
(3)
where U and V are unitary matrices of size 2F×2F and P×P, respectively and D is a matrix of size 2F×P for F frames and P tracked points. Given U, D and V, U’ is the first three columns of U, D’ is the first three columns and first three rows of D and V’T is the first three rows of VT, to impose the rank 3 constraint on W. Then, M’ and S’ that are the initial estimates of M and S are obtained as M’ = U’⋅D’1/2, S’ = D’1/2⋅V’T.
(4)
Input video
Feature point detection
3D reconstruction
Iterative fitting of 3D model to feature points
yaw Estimated facial pose
Pitch: -13.83 q Yaw: 31.57 q Roll: - 5.97 q
pitch roll
Fig. 2. Pose estimation scheme
3D Model-Based Face Recognition in Video
1089
To impose the metric constrains on M’, a 3×3 correction matrix A is defined as ([if jf]T⋅A)⋅(AT⋅ [if jf]) = E,
(5)
where if is the fth i vector in the upper half rows of M, jf is the fth j vector in the lower half rows of M and E is a 2×2 identity matrix. The constraints in Eq. (5) need to be imposed across all frames. There are one if and jf vectors in each frame, which generate three constraints. Since A⋅AT is a 3x3 symmetric matrix, there are 6 unknown variables. Therefore, at least two frames are required to solve Eq. (5). In practice, to obtain a robust solution, we need more than two frames and the solution is obtained by the least squared error method. The final solution is obtained as M = M’⋅A, S = A-1⋅S’,
(6)
where M contains the rotation information between each frame and the 3D object and S contains the 3D shape information. We will provide the lower bound evaluation of the performance of Factorization method on synthetic and real data in Section 3. 2.3 3D Facial Pose Estimation We estimate the facial pose in a video frame to select the best texture model for the 3D face construction. The video frame with facial pose close to frontal is a good candidate for texture mapping because it covers most of the face area. If a single profile image is used for texture mapping, the quality of the 3D model will be poor in the occluded region. When a frame in a near-frontal pose is not found, two frames are used for texture mapping. There are many facial pose estimation methods in 2D and 3D domains [20]. Because the head motion occurs in 3D domain, 3D information is necessary for accurate pose estimation. We estimate the facial pose in [yaw, pitch, roll] (YPR) values as shown in Fig. 2. Even though all the rotational relationships between the 3D shape and the 2D feature points in each frame are already obtained by the matrix M in factorization process, it reveals only the first two rows of the rotation matrix for each frame, which generates inaccurate solutions in obtaining YPR values especially in noisy data. Therefore, we use the gradient descent method to iteratively fit the reconstructed 3D shape to the 2D facial feature points. 2.4 Texture Mapping We define the 3D face model as a set of triangles and generate a VRML object. Given the 72 points obtained from the reconstruction process, 124 triangles are generated. While the triangles can be obtained automatically by Delaunay triangulation process [16], we use a predefined set of triangles for efficiency sake because the number and configuration of the feature points are fixed. The corresponding set of triangles can be obtained from the video frames with a similar process. Then, the VRML object is generated by mapping the triangulated texture to the 3D shape. The best frame to be used in texture mapping is selected based on the pose estimation as described earlier. When all the available frames deviate
1090
U. Park and A.K. Jain
(a)
(b)
(c)
(d)
(e)
Fig. 3. Texture mapping. (a) a video sequence used for 3D reconstruction; (b) single frame with triangular meshes; (c) two frames with triangular meshes; (d) reconstructed 3D face model with one texture in (b); (e) reconstructed 3D face model with two texture mappings in (c). The two frontal poses in (d) and (e) are correctly identified in matching experiment.
significantly from the frontal pose, two frames are used for texture mapping as described in Fig. 3. Even though both the synthetic frontal views in Figs. 3 (d) and (e) are correctly recognized, the one in (e) looks more realistic. When more than one texture is used for texture mapping, a sharp boundary is often observed across the boundary where two different textures are combined because of the difference in illumination. However, the synthetic frontal views are correctly recognized in most cases regardless of this artifact.
3 Experimental Results We performed a number of experiments to i) evaluate the lower performance bound of the Factorization algorithm in terms of the rotation angle and number of frames in synthetic and real data, ii) reconstruct 3D face models on a large public domain video database, and iii) perform face recognition using the reconstructed 3D face models. 3.1 3D Face Reconstruction with Synthetic Data We first evaluate the performance of the Factorization algorithm using synthetic data. A set of 72 facial feature points are obtained from a ground truth 3D face model. A sequence of 2D coordinates of the facial feature points are directly obtained from the ground truth. We take the angular values for the rotation in steps of 0.1° in the range (0.1°, 1°) and in steps of 1° in the range (1°, 10°). The number of frames range from 2 to 5. The RMS error between the ground truth and the reconstructed shape is shown in Fig. 4 (a). While the number of frames required for the reconstruction in the noiseless case is two (Sec. 2.2), in practice more frames are needed to keep the error small. As long as the number of frames is more than two, the errors are negligible.
3D Model-Based Face Recognition in Video
1091
2 frames
2 frames
3 frames 3 frames 4 frames
4 frames 5 frames
5 frames
15 frames
(a)
(b)
Fig. 4. (a) RMS error between reconstructed shape and ground truth. (b) RMS error between reconstructed and ideal rotation matrix, Ms.
3.2 3D Face Reconstruction with Real Data For real data, noise is present in both the facial feature point detection and the correspondences between detected points across frames. This noise is not random and its affect is more pronounced at points of self-occlusion and on the facial boundary. Since AAM does use feature points on the facial boundary, the point correspondences are not very accurate in the presence of self-occlusion. Reconstruction experiments are performed on real data with face rotation from -45° to +45° across 61 frames. We estimate the rotation between successive frames as 1.5° (61 frames varying from -45° to +45°) and obtain the reconstruction error with rotation in steps of 1.5° in the range (1.5°, 15°). The number of frames used is from 2 to 61. A direct comparison between the ground truth and the reconstructed shape is not possible in case of real data because the ground truth is not known. Therefore, we measure the orthogonality of M to estimate the reconstruction accuracy. Let M be a 2F×3 matrix as shown in Eq. (3) and M(a:b,c:d) represent the sub matrix of M from rows a to b and columns c to d. Then, Ms = M×M’ is a 2Fx2F matrix where all elements in Ms(1:F, 1:F) and Ms(F+1:2F, F+1:2F) are equal to 1 and all elements in Ms(1:F, F+1:2F) and Ms(F+1:2F, 1:F) are equal to 0 if M is truly an orthogonal matrix. We measure the RMS difference between the ideal Ms and the calculated Ms as the reconstruction error. The reconstruction error for real data is shown in Fig. 4 (b).
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 5. Examples where 3D face reconstruction was not successful. (a), (b), (c) and (d) Failure of feature point detection using AAM; (e), (f) and (g) Deficiency of motion cue, resulting in (h) a failure of SfM to reconstruct the 3D face model.
1092
U. Park and A.K. Jain
Based on experiments with real data, it is observed that the number of frames needed for reconstruction is more for real data than the synthetic data, but the error decreases quickly as the number of frames increases. The slight increase in error with larger pose differences is due to error in point correspondences from self-occlusion. 3.3 Face Recognition with Pose Correction We have used a subset of CMU’s Face In Action (FIA) video database [10] that includes 221 subjects for our matching experiments. To demonstrate the advantage of using reconstructed 3D face models for recognition, we are primarily interested in video sequences that contain mostly non-frontal views for each subject. Since the reconstruction with SfM performs better when there are large motions, both left and right non-frontal views are collected for each subject in FIA database, if available, resulting, on average, about 10 frames per subject. When there is a sufficient interframe motion and the feature point detection performs well, it is possible to obtain the 3D face model from only 3 different frames, which in consistent with the results shown in Fig. 4. The number of frames that is required for the reconstruction can be determined based on the orthogonality of M. We successfully reconstructed 3D face models for 197 subjects out of the 221 subjects in the database. The reconstruction process failed for 24 subjects either due to poor facial feature point detection in the AAM process or the deficiency of motion cue, which caused a degenerate solution in the factorization algorithm. Example images where AAM or SfM failed are shown in Fig. 5. The failure occurs due to large pose or shape variations that were not represented in the samples used to train the AAM model. All the reconstructed 3D face models are corrected in their pose to make all yaw, pitch, and roll values equal to zero. The frontal face image can be obtained by projecting the 3D model in the 2D plane. Once the frontal view is synthesized, FaceVACS® face recognition engine from Cognitec [5] is used to generate the matching score. This engine is one of the best commercial 2D face recognition systems. The face recognition results for frontal face video, non-frontal face video
Fig. 6. Face recognition performance with 3D face modeling
3D Model-Based Face Recognition in Video
(a)
(b)
(c)
(d)
1093
(e)
Fig. 7. 3D model-based face recognition results on six subjects (Subject IDs in the FIA database are 47, 56, 85, 133, and 208). (a) Input frames; (b), (c) and (d) reconstructed 3D face models at right, left, and frontal views, respectively; (e) frontal images enrolled in the gallery database. All the frames in (a) are not correctly identified, while the synthetic frontal views in (d) obtained from the reconstructed 3D models are correctly identified for the first four subjects, but not for the last subject (#208). The reconstructed 3D model of the last subject appears very different from the gallery image, resulting in the recognition failure.
and non-frontal face video with 3D face modeling are shown in Fig. 6 based on 197 subjects for which the 3D face reconstruction was successful. The CMC curves show that the FaceVACS engine does extremely well for frontal pose video but its performance drops drastically for non-frontal pose video. By using the proposed 3D face modeling, the rank-1 performance in non-frontal scenario improves by about 40%. Example 3D face models and the synthesized frontal views from six different subjects are shown in Fig. 7.
4 Conclusions We have shown that the proposed 3D model based face recognition from video provides a significantly better performance, especially when many non-frontal views are observed. The proposed system synthesizes the frontal face image from the 3D reconstructed models and matches it against the frontal face enrolled in the gallery. We have automatically generated 3D face models for 197 subjects in the Face In Action database (session 1, indoor, camera 5) using the SfM factorization method. The experimental results show substantial improvement in the rank-1 matching performance (from 30% to 70%) for video with non-frontal pose. The entire face recognition process (feature point tracking, 3D model construction, matching) takes ~10 s per subject on a Pentium IV PC for 320x240 frames. We are working to improve the matching speed and the texture blending technique to remove the sharp boundary that is often observed when mapping multiple textures.
1094
U. Park and A.K. Jain
References [1] Pentland, A., Moghaddam, B., Starner, T.: View-based and Modular Eigenspace for Face Recognition. In: Proc. CVPR, pp. 84–91 (1994) [2] Chai, X., Shan, S., Chen, X., Gao, W.: Local Linear Regression (LLR) for Pose Invariant Face Recognition. In: Proc. AFGR, pp. 631–636 (2006) [3] Beymer, D., Poggio, T.: Face Recognition from One Example View. In: Proc. ICCV, pp. 500–507 (1995) [4] Blanz, V., Vetter, T.: Face Recognition based on Fitting a 3D Morphable Model. IEEE Trans. PAMI 25, 1063–1074 (2003) [5] FaceVACS Software Developer Kit, Cognitec, http://www.cognitec-systems.de [6] Tomasi, C., Kanade, T.: Shape and motion from image streams under orthography: A factorization method. Int. Journal of Computer Vision 9(2), 137–154 (1992) [7] Xiao, J., Chai, J., Kanade, T.: A Closed-Form Solution to Non-Rigid Shape and Motion Recovery. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3024, pp. 668–675. Springer, Heidelberg (2004) [8] Brand, M.: A Direct Method for 3D Factorization of Nonrigid Motion Observation in 2D. In: Proc. CVPR, vol. 2, pp. 122–128 (2005) [9] Stegmann, M.B.: The AAM-API: An Open Source Active Appearance Model Implementation. In: Ellis, R.E., Peters, T.M. (eds.) MICCAI 2003. LNCS, vol. 2879, pp. 951–952. Springer, Heidelberg (2003) [10] Goh, R., Liu, L., Liu, X., Chen, T.: The CMU Face In Action (FIA) Database. In: Zhao, W., Gong, S., Tang, X. (eds.) AMFG 2005. LNCS, vol. 3723, pp. 255–263. Springer, Heidelberg (2005) [11] Zhao, W., Chellappa, R.: SFS Based View Synthesis for Robust Face Recognition. In: Proc. FGR, pp. 285–292 (2000) [12] Phillips, P.J., Grother, P., Micheals, R.J., Blackburn, D.M., Tabassi, E., Bone, J.M.: FRVT: 2002: Evaluation Report, Tech. Report NISTIR 6965, NIST (2003) [13] Phillips, P.J., Flynn, P.J., Scruggs, T., Bowyer, K.W., Worek, W.: Preliminary Face Recognition Grand Challenge Results. In: Proc. AFGR, pp. 15–24 (2006) [14] Lee, K., Ho, J., Yang, M., Kriegman, D.: Video-based face recognition using probabilistic appearance manifolds. CVPR I, 313–320 (2003) [15] Zhou, S., Krueger, V., Chellappa, R.: Probabilistic recognition of human faces from video. Computer Vision and Image Understanding 91, 214–245 (2003) [16] Barber, C.B., Dobkin, D.P., Huhdanpaa, H.: The Quickhull Algorithm for Convex Hulls. ACM Trans. Mathematical Software 22(4), 469–483 (1996) [17] Ullman, S.: The Interpretation of Visual Motion. MIT Press, Cambridge, MA (1979) [18] Matthews, I., Baker, S.: Active Appearance Models Revisited. International Journal of Computer Vision 60(2), 135–164 (2004) [19] Maurer, T., Guigonis, D., Maslov, I., Pesenti, B., Tsaregorodtsev, A., West, D., Medioni, G.: Performance of Geometrix ActiveIDTM 3D Face Recognition Engine on the FRGC Data. In: Proc. CVPR, pp. 154–160 (2005) [20] Tu, J., Huang, T., Tao, H.: Accurate Head Pose Tracking in Low Resolution Video. In: Proc. FGR, pp. 573–578 (2006)
Robust Point-Based Feature Fingerprint Segmentation Algorithm Chaohong Wu, Sergey Tulyakov, and Venu Govindaraju Center for Unified Biometrics and Sensors (CUBS) SUNY at Buffalo, USA
Abstract. A critical step in automatic fingerprint recognition is the accurate segmentation of fingerprint images. The objective of fingerprint segmentation is to decide which part of the images belongs to the foreground containing features for recognition and identification, and which part to the background with the noisy area around the boundary of the image. Unsupervised algorithms extract blockwise features. Supervised method usually first extracts point features like coherence, average gray level, variance and Gabor response, then a Fisher linear classifier is chosen for classification. This method provides accurate results, but its computational complexity is higher than most of unsupervised methods. This paper proposes using Harris corner point features to discriminate foreground and background. Shifting a window in any direction around the corner should give a large change in intensity. We observed that the strength of Harris point in the foreground area is much higher than that of Harris point in background area. The underlying mechanism for this segmentation method is that boundary ridge endings are inherently stronger Harris corner points. Some Harris points in noisy blobs might have higher strength, but it can be filtered as outliers using corresponding Gabor response. The experimental results proved the efficiency and accuracy of new method are markedly higher than those of previously described methods.
1 Introduction The accurate segmentation of fingerprint images is key component to achieve high performance in automatic fingerprint recognition systems. If more background areas are included into segmented fingerprint of interest, more false features are possibly introduced into detected feature set; If some parts of foreground are excluded, useful feature points may be missed. There are two types of fingerprint segmentation algorithms: unsupervised and supervised. Unsupervised algorithms extract blockwise features such as local histogram of ridge orientation [1,2], gray-level variance, magnitude of the gradient in each image block [3], Gabor feature [4,5]. Practically, the presence of noise, low contrast area, and inconsistent contact of a fingertip with the sensor may result in loss of minutiae or more spurious minutiae. Supervised method usually first extracts several features like coherence, average gray level, variance and Gabor response [5,6,7], then a simple linear classifier is chosen for classification. This method provides accurate results, but its computational complexity is higher than most unsupervised methods. Segmentation in low quality images faces several challenging technical problems. First problem is the presence of noise caused by dust and grease on the surface of livescan fingerprint scanners. Second problem is ghost images of fingerprints remaining S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 1095–1103, 2007. c Springer-Verlag Berlin Heidelberg 2007
1096
C. Wu, S. Tulyakov, and V. Govindaraju
from the previous image acquisition [7]. Third problem is low contrast fingerprint ridges generated through inconsistent contact press or dry/wet finger surface. Fourth problem is indistinct boundary if the features in the fixed size of window are used. Final problem is segmentation features being sensitive to the quality of image. This paper proposes using Harris corner point features [8,9] to discriminate foreground and background. The Harris corner detector was developed originally as features for motion tracking, it can reduce significantly amount of computation compared to tracking every pixel. It is translation and rotation invariant but not scale invariant. We found that the strength of a Harris point in the foreground area is much higher than that of a Harris point in the background area. Some Harris points in noisy blobs might have higher strength, but it can be filtered as outliers using corresponding Gabor response. The experimental results proved the efficiency and accuracy of new method are much better than those of previously described methods. Furthermore, this segmentation algorithm can detect accurate boundary of fingerprint ridge regions, which is very useful in removing spurious boundary minutiae, and most current segmentation methods can not provide consistent boundary minutiae filtering.
2 Features for Fingerprint Segmentation Feature selection is the first step for designing fingerprint segmentation algorithm. There are two general types of features used for fingerprint segmentation, i.e., block features and pointwise features. In [5,10] selected point features include local mean, local variance or standard deviation, and Gabor response of the fingerprint image. Local mean is calculated as M ean = w I, local variance is calculated as V ar = 2 (I − M ean) , w is window size centered the processed pixel. The Gabor response w is the smoothed sum of Gabor energies for eight Gabor filter responses. Usually the Gabor response is higher in the foreground region than that in the background region. The coherence feature indicates how strong the local window gradients centered the processed point in the same dominant orientation. Usually the coherence will be much higher in the foreground than in the background, but it may be influenced significantly by boundary signal and noisy signal. Therefore, single coherence feature is not sufficient for robust segmentation. Systematic combination of those features is necessary. (Gxx − Gyy )2 + 4G2xy | (Gs,x , Gs,y )| Coh = w = (1) | w (Gs,x , Gs,y )| Gxx + Gyy Because pointwise-based segmentation method is time consuming, blockwise features are usually used in the commercial automatic fingerprint recognition systems. Block mean, block standard deviation, block gradient histogram [1,2], block average magnitude of the gradient [11] are most common block features for fingerprint segmentation. In [12] gray-level pixel intensity-derived feature called block clusters degree(CluD) is introduced. CluD measures how well the ridge pixels are clustering. 2 E(x, y) = log |F (r, θ)| (2) r
θ
Robust Point-Based Feature Fingerprint Segmentation Algorithm
(a)
(b)
(c)
(d)
(e)
(f)
1097
Fig. 1. (a) and (b) are two original images, (c) and (d) are FFT energy maps for images (a) and (b), (e) and (f) are Gabor energy maps for images (a) and (b), respectively
Texture features, such as Fourier spectrum energy [6], Gabor features [4,13] and Gaussian-Hermite Moments [14], have been applied to fingerprint segmentation. Ridges and valleys in a fingerprint image are generally observed to possess a sinusoidal-shaped plane wave with a well-defined frequency and orientation [15], and non-ridge regions does not hold this surface wave model. In the areas of background and noisy regions, it is assumed that there is very little structure and hence very little energy content in the Fourier spectrum. Each value of energy image E(x,y) indicates the energy content of the corresponding block. The fingerprint region may be differentiated from the background by thresholding the energy image. The logarithm values of the energy is used to convert the large dynamic range to a linear scale(Equation 2). The region mask is obtained by thresholding E(x, y). However, uncleaned trace finger ridges and straight stripes are unfortunately included into regions of interest (Figure 1(c)).
1098
C. Wu, S. Tulyakov, and V. Govindaraju
Gabor filter-based segmentation algorithm is now most often used method [4,13]. An even symmetric Gabor filter has the following spatial form: 1 x2 y2 g(x, y, θ, f, σx , σy ) = exp{− [ θ2 + θ2 ]}cos(2πf xθ ) 2 σx σy
(3)
For each block of size W × W centered at (x,y), 8 directional Gabor features are computed for each block, the standard deviation of 8 Gabor features is utilized for segmentation. The formula for calculating the magnitude of Gabor feature is defined as, (w/2)−1 (w/2)−1 G(X, Y, θ, f, σx , σy ) = I(X + x0 , Y + y0 )g(x0 , y0 , θ, f, σx , σy ) x0 =−w/2 y0 =−w/2 (4) However, fingerprint images with low contrast or false traces ridges or noisy complex background can not be segmented correctly by Gabor filter-based method (Figure 1(e)). In [14], similarity is found between Hermite moments and Gabor filter. GaussianHermite Moments has been successfully used to segment fingerprint images in [14]. Orthogonal moments use orthogonal polynomials as transform kernels and produce minimal information redundancy, Gaussian Hermite moments(GHM) can represent local texture feature without minimal noise effect.
3 Harris Corner Points 3.1 Review of Harris-Corner-Points We propose using Harris corner point features [8,9] to discriminate foreground and background. The Harris corner detector was developed originally as features for motion tracking, it can reduce significantly amount of computation compared to tracking every pixel. Shifting a window in any direction around the corner should give a large change in intensity. Corner points provide repeatable points for matching, so some efficient methods have been designed [8,9]. Gradient is ill defined at a corner, so edge detectors perform poorly at corners. However, in the region around a corner, gradient has two or more different values. The corner point can be easily recognized by looking a small window. Shifting a window around a corner point in any direction should give a large change in gray-level intensity, Given a point I(x,y), and a shift(Δx, Δy), the auto-correlation function E is defined as: E(x, y) = [I(xi , yi ) − I(xi + Δx, yi + Δy)]2 (5) w(x,y)
where w(x,y)is window function centered on image point(x,y). For a small shifts [Δx,Δy], the shifted image is approximated by a Taylor expansion truncated to the first order terms,
Δx I(xi + Δx, yi + Δy) ≈ I(xi , yi ) + [Ix (xi , yi )Iy (xi , yi )] (6) Δy
Robust Point-Based Feature Fingerprint Segmentation Algorithm
1099
where Ix (xi , yi ) and Iy (xi , yi ) denote the partial derivatives in x and y, respectively. Substituting approximation Equation 6 into Equation 5 yields, E(x, y) = w(x,y) [I(xi , yi ) − I(xi + Δx, yi + Δy)]2
2 Δx = w(x,y) I(xi , yi ) − I(xi , yi ) − [Ix (xi , yi )Iy (xi , yi )] Δy
2 Δx = w(x,y) −[Ix (xi , yi )Iy (xi , yi )] Δy
2 (7) Δx = w(x,y) [Ix (xi , yi )Iy (xi , yi )] Δy
2 (I (x , y )) Ix (xi , yi )Iy (xi , yi ) Δx x i i w w = [Δx, Δy] 2 (xi , y i )Iy (xi , yi ) Δy w Ix w (Iy (xi , yi )) Δx = [Δx, Δy]M (x, y) Δy That is,
E(Δx, Δy) = [Δx, Δy]M (x, y)
Δa Δy
(8)
where M(x,y) is a 2×2 matrix computed from image derivatives, called auto-correlation matrix which captures the intensity structure of the local neighborhood.
(Ix (xi , yi ))2 Ix (xi , yi )Iy (xi , yi ) M= w(x, y) (9) Ix (xi , yi )Iy (xi , yi ) (Iy (xi , yi ))2 x,y
3.2 Strength of Harris-Corner Points of a Fingerprint Image In order to detect interest points, the original measure of corner response in [8] is : R=
det(M ) λ1 λ2 = T race(M ) λ1 + λ2
(10)
The auto-correlation matrix (M) captures the structure of the local neighborhood. Based on eigenvalues(λ1, λ2 ) of M, interest points are located where there are two strong eigen values and the corner strength is a local maximum in a 3×3 neighborhood. To avoid the explicit eigenvalue decomposition of M, T race(M ) is calculated as Ix2 + Iy2 , Det(m) is calculated as Ix2 Iy2 − (Ix Iy )2 , and R = Det(m) − k × T race(M )2
(11)
To segment the fingerprint area (foreground) from the background, the following “corner strength” measure is used, because there is one undecided parameter k in equation(11). R=
2 Ix2 Iy2 − Ixy Ix2 + Iy2
(12)
1100
C. Wu, S. Tulyakov, and V. Govindaraju
3.3 Harris-Corner-Points Based Fingerprint Image Segmentation We found that the strength of a Harris point in the fingerprint area is much higher than that of a Harris point in background area, because boundary ridge endings inherently possess higher corner strength. Most high quality fingerprint images can be easily segmented by choosing appropriate threshold value. In Figure 2, a corner strength of 300 is selected to distinguish corner points in the foreground from those in the background. Convex hull algorithm is used to connect harris corner points located in the foreground boundary.
(a)
corners detected
corners detected
corners detected
corners detected
(b)
(c)
(d)
(e)
Fig. 2. A fingerprint with harris corner strength of (b)10, (c)60, (d)200, and (e)300. This fingerprint can be successfully segmented using corner response threshold of 300.
corners detected
corners detected
corners detected
corners detected
(a)
(b)
(c)
(d)
Fig. 3. A fingerprint with harris corner strength of (a)100, (b)500, (c)1000, (d) 1500 and (e)3000. Some noisy corner points can not be filtered completely even using corner response threshold of 3000.
It appears relatively easy for us to segment fingerprint images for following image enhancement, feature detection and matching. However, two technical problems need to be solved. First, different “corner strength” thresholds are necessary to achieve good segmentation results for different qualities images based on image characteristical analysis. Second, Some Harris points in noisy blobs might have higher strength, it can not be segmented by choosing simply one threshold. When single threshold is applied to all the fingerprint images in one whole database, not all the corner points in the background in a fingerprint image are removed, some corner points in noisy regions can not be thresholded even using high threshold value (Figure 3). In order to deal with such situations, we implemented a heuristic selection algorithm using corresponding Gabor response (Figure 4).
Robust Point-Based Feature Fingerprint Segmentation Algorithm
(a)
1101
(b)
Fig. 4. Segmentation result and final feature detection result for the image shown in the Figure 1(a). (a) Segmented fingerprint marked with boundary line, (b) final detected minutiae.
4 Experimental Results The proposed methodology is tested on FVC2002 DB1 and DB4, each database consists of 800 fingerprint images (100 distinct fingers, 8 impressions each). Image size is 374 × 388 and the resolution is 500dpi. To evaluate the methodology of adapting a gaussian kernel to the local ridge curvature of a fingerprint image, we modified Gabor-based fingerprint enhancement algorithm [15,16] with two kernel sizes: the smaller one in high-curvature regions and the larger one in pseudo-parallel ridge regions, minutiae are detected using chaincode-based contour tracing [17], the fingerprint matcher developed by Jea et al. [18] is used for performance evaluation.
(a)
(b)
(c)
(d)
Fig. 5. Boundary spurious minutiae filtering. (a) and (b) incomplete filtering using NIST method, (c) and (d) proposed boundary filtering.
1102
C. Wu, S. Tulyakov, and V. Govindaraju
1
1 ERR = 0.0453 ERR = 0.0106
0.95
0.99 ERR = 0.0720
EER = 0.0125
0.9
0.98
1−FRR
1−FRR
0.85
0.97
NIST boundary filtering Proposed boundary filtering
0.96
0.8 Proposed boundary filtering NIST boundary filtering 0.75
0.7 0.95 0.65
0.94
0
0.01
0.02
0.03 FAR
(a)
0.04
0.05
0.06
0.6
0
0.05
0.1
0.15
0.2 FAR
0.25
0.3
0.35
0.4
(b)
Fig. 6. ROC curves for (a) FVC2002 DB1 and (b) FVC2002 DB4
Our methodology has been tested on low quality images from FVC2002. To validate the efficiency of proposed segmentation method, current widely-used Gabor filterbased segmentation algorithm [4,13] and NIST segmentation [19] are utilized for comparison. The proposed segmentation method have a remarkable advantage over current methods in terms of boundary spurious minutiae filtering. Figure 5 (a) and (b) show unsuccessful boundary minutiae filtering using NIST method [19], which is implemented by removing spurious minutiae pointing to invalid block and removing spurious minutiae near invalid blocks, and invalid blocks are defined as blocks with no detectable ridge flow. However, boundary blocks are more complicated, so the method in [19] fails to remove most boundary minutiae. In Figure 5 (c) and (d) show the filtering results of proposed method. In comparison of Figure 5(a) against (c) and (b) and (d), 30 and 17 boundary minutiae are filtered, respectively. Performance evaluations for FVC2002 DB1 and DB4 are shown in Figure 6. For DB1, ERR for false boundary minutiae filtering using proposed segmented mask is 0.0106 and EER for NIST boundary Filtering is 0.0125. For DB4, ERR for false boundary minutiae filtering using proposed segmented mask is 0.0453 and EER for NIST boundary Filtering is 0.0720.
5 Conclusions In this paper, a robust interest point based fingerprint segmentation is proposed for fingerprints of varied image qualities. The experimental results compared with those of previous methods validate that our algorithm has better performance even for low quality images, in terms of including less background and excluding less foreground. In addition, this robust segmentation algorithm is capable of filtering efficiently spurious boundary minutiae.
Robust Point-Based Feature Fingerprint Segmentation Algorithm
1103
References 1. Mehtre, B.M., Chatterjee, B.: Segmentation of fingerprint images – a composite method. Pattern Recognition 22(4), 381–385 (1989) 2. Mehtre, B.M., Murthy, N.N., Kapoor, S., Chatterjee, B.: Segmentation of fingerprint images using the directional image. Pattern Recognition 20(4), 429–435 (1987) 3. Ratha, N.K., Chen, S., Jain, A.K.: Adaptive flow orientation-based feature extraction in fingerprint images. Pattern Recognition 28(11), 1657–1672 (1995) 4. Alonso-Fernandez, F., Fierrez-Aguilar, J., Ortega-Garcia, J.: An enhanced gabor filter-based segmentation algorithm for fingerprint recognition systems. In: Pan, Y., Chen, D.-x., Guo, M., Cao, J., Dongarra, J.J. (eds.) ISPA 2005. LNCS, vol. 3758, pp. 239–244. Springer, Heidelberg (2005) 5. Bazen, A., Gerez, S.: Segmentation of fingerprint images. In: Proc. Workshop on Circuits Systems and Signal Processing (ProRISC 2001), pp. 276–280 (2001) 6. Pais Barreto Marques, A.C., Gay Thome, A.C.: A neural network fingerprint segmentation method. In: Fifth International Conference on Hybrid Intelligent Systems(HIS 05), p. 6 (2005) 7. Zhu, E., Yin, J., Hu, C., Zhang, G.: A systematic method for fingerprint ridge orientation estimation and image segmentation. Pattern Recognition 39(8), 1452–1472 (2006) 8. Harris, C., Stephens, M.: A combined corner and edge detector. In: Proc. in Alvey Vision Conference, pp. 147–151 (1988) 9. Mikolajczyk, K., Schmid, C.: Scale affine invariant interest point detectors. International Journal of Computer Vision 60(1), 63–86 (2004) 10. Klein, S., Bazen, A.M., Veldhuis, R.: fingerprint image segmentation based on hidden markov models. In: 13th Annual workshop in Circuits, Systems and Signal Processing, in Proc. ProRISC 2002 (2002) 11. Maio, D., Maltoni, D.: Direct gray-scale minutiae detection in fingerprints. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(1), 27–40 (1997) 12. Chen, X., Tian, J., Cheng, J., Yang, X.: Segmentation of fingerprint images using linear classifier. EURASIP Journal on Applied Signal Processing 2004(4), 480–494 (2004) 13. Shen, L., Kot, A., Koo, W.: Quality measures of fingerprint images. In: Proc. Int. Conf. on Audio- and Video-Based Biometric Person Authentication, pp. 266–271 (2001) 14. Wang, L., Suo, H., Dai, M.: Fingerprint image segmentation based on gaussian-hermite moments. In: Li, X., Wang, S., Dong, Z.Y. (eds.) ADMA 2005. LNCS (LNAI), vol. 3584, pp. 446–454. Springer, Heidelberg (2005) 15. Hong, L., Wan, Y., Jain, A.K.: “Fingerprint image enhancement: Algorithms and performance evaluation”. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(8), 777–789 (1998) 16. Wu, C., Govindaraju, V.: Singularity preserving fingerprint image adaptive filtering. In: International Conference on Image Processing, pp. 313–316 (2006) 17. Wu, C., Shi, Z., Govindaraju, V.: Fingerprint image enhancement method using directional median filter. In: Biometric Technology for Human Identification. SPIE, vol. 5404, pp. 66–75 (2004) 18. Jea, T.Y., Chavan, V.S., Govindaraju, V., Schneider, J.K.: Security and matching of partial fingerprint recognition systems. In: SPIE Defense and Security Symposium. SPIE, vol. 5404 (2004) 19. Watson, C.I., Garris, M.D., Tabassi, E., Wilson, C.L., MsCabe, R.M., Janet, S.: User’s Guide to NIST Fingerprint Image Software2(NFIS2). NIST (2004) 20. Otsu, N.: A threshold selection method from gray level histograms. IEEE Transactions on Systems, Man and Cybernetics 9, 62–66 (1979)
Automatic Fingerprints Image Generation Using Evolutionary Algorithm Ung-Keun Cho, Jin-Hyuk Hong, and Sung-Bae Cho Dept. of Computer Science, Yonsei University Biometrics Engineering Research Center 134 Sinchon-dong, Seodaemun-ku Seoul 120-749, Korea {bearoot,hjinh}@sclab.yonsei.ac.kr,
[email protected] Abstract. Constructing a fingerprint database is important to evaluate the performance of an automatic fingerprint recognition system. Because of the difficulty in collecting fingerprint samples, there are only few benchmark databases available. Moreover, various types of fingerprints should be required to get a fair assessment on how robust the system is against various environments. This paper presents a novel method that generates various fingerprint images automatically from only a few training samples by using the genetic algorithm. Fingerprint images generated by the proposed method include similar characteristics of those collected from the corresponding real environment. Experiments with real fingerprints verify the usefulness of the proposed method. Keywords: fingerprint identification performance algorithm, image filtering, image generation.
evaluation,
genetic
1 Introduction Due to the persistence and individuality of fingerprints, the fingerprint recognition has become a popular personal identification technique [1]. Recently, people consider it important to evaluate the robustness of those systems for practical application [2]. Performance evaluation, mostly dependent on benchmark databases, is a difficult process, because of the lack of public fingerprint databases involving large samples. Except for some popular fingerprint databases such as NIST database [3] and FVC databases [2], researchers usually rely on a small-scale database collected by themselves to evaluate their system. The construction of fingerprint databases requires an enormous effort so as to be incomplete, costly and unrealistic [4]. Moreover, the database should include samples collected from various environments in order to estimate the robustness of the system under realistic applications [5]. In order to measure the performance of fingerprint recognition systems from various points of view, researchers proposed several performance evaluation protocols and databases. Jain, et al. developed a twin-test by measuring the similarity of identical twins’ fingerprints [6], while Pankanti, et al. theoretically estimated the individuality of fingerprints [1]. Hong, et al. reviewed performance evaluation for S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 1104–1113, 2007. © Springer-Verlag Berlin Heidelberg 2007
Automatic Fingerprints Image Generation Using Evolutionary Algorithm
1105
biometrics systems including fingerprint verification systems [7]. Khanna and Weicheng presented benchmarking results using NIST special database 4 [3], and Maio, et al. initiated several competitions of fingerprint verification such as FVC [2]. Simon-Zorita, et al. collected MCYT Fingerprint Database in consideration of position variability control and image quality [5]. There were some works on the generation of synthetic images for constructing databases with little cost and effort. FaceGen [9] is a modeler for generating and manipulating human faces. It manages shape, texture, expression, phones, and accessories (hair, glasses), and it can also reconstruct 3D face from single images and were used to test face recognition techniques [10]. Cappelli, et al. developed a software (SFinGE) [4] that heuristically generated fingerprint images according to some parameters, where the synthetic databases were used as one of benchmark databases in FVC [2]. In this paper, we propose a novel method that generates fingerprint images from only a few initial samples, in which the images get similar characteristics to ones manually collected from the corresponding real environment. When a target environment is given, the proposed method constructs a set of filters that modifies an original image so as to become similar to that collected in the environment. A proper set of filters is found by the genetic algorithm [8,11], where the fitness evaluation is conducted using various statistics of fingerprints to measure the similarity.
2 Proposed Method 2.1 Overview The proposed method works similar to that of a simple genetic algorithm as shown in Fig. 1, and fitness evaluation optimizes a filter set that generates fingerprint images corresponding to a given environment. In the initialization step are set the parameters for the genetic algorithm including the population size, the maximum number of generations, the length of chromosomes, selection strategy, selection rate, crossover rate and mutation rate. The length of chromosomes means the size of a filter composed, where each gene in the chromosome represents the corresponding filter in the pool of filters. A target environment should be also determined in initialization so that the proposed method can generate images having the similar characteristics of fingerprints collected by that environment. Only a few samples are required to calculate several statistics for the target environment to evaluate a chromosome. After initializing population, the proposed method iteratively conducts fitness evaluation, selection, crossover and mutation until a terminal condition is satisfied, in which the last 3 steps work the same as the genetic algorithm. Especially, the fitness of a chromosome is estimated according to the similarity between a few real images from the target environment and images generated after filtering. The value of each gene means a filter to apply for images of the training database. If we collect some samples from an environment which we target, the proposed method automatically analyzes the environment and finds out a set of proper filters without any expert knowledge.
1106
U.-K. Cho, J.-H. Hong and S.-B. Cho
Fig. 1. Overview of the proposed method
2.2 Image Filter Pool Popular image filters are used to produce similar effects from real environments. Even though each filter has a simple effect as shown in Table 1, they might produce various results when appropriately compounding with each other. The order and type of filters used in the filter set are determined by the genetic algorithm, because it is practically impossible to test all the cases of composition. Table 1 shows the description of image filters used in this paper. They are widely used to reduce noises, smooth images, or stress the focus of images [12]. Typically, there are several categories of image filters such as histogram-based filters, mask filters and morphological filters. Various parameters like mask types make the effect of filtering more diverse. Total 70 filters construct the pool of filters. Table 1. Image filters used in this paper
Group Histogram Mask Morphology(10 masks)
Filter Brightness (3 values), Contrast (3 values), Stretch, Equalize, Logarithm Blur (6 masks), Sharper (4 masks), Median (10 masks) Erosion, Dilation, Opening, Closing None
Index 1~9 10~29 30~69 0
Automatic Fingerprints Image Generation Using Evolutionary Algorithm
1107
2.3 Fitness Evaluation The fitness of a filter set is estimated by measuring the similarity between fingerprints collected from the target environment and images generated by the composite filter. Several representative features of fingerprints, such as the mean and variance of images, directional contrasts [13], average ridge thickness and interval [14], singularities [13] and minutiae [1], are used to design the fitness evaluation function. As mentioned before, fingerprints are easily affected by an input environment, where the statistics of fingerprints obtained might manifest the environment. Table 2 shows the features of fingerprint images and the reason to use these features for evaluating the fitness. Table 2. Features of fingerprint images for fitness evaluation
Feature Mean Variance Directional contrast Thickness Interval Singularity Minutiae
Description The mean of gray values The variance of gray values The mean of block directional difference The mean of ridge thickness The mean of valley thickness The region of discontinuous directional field Ending and bifurcation point of ridges
Purpose Measurement of whole gray level Uniformity of gray values Distinctness between ridges and valleys Measurement of ridge thickness Measurement of valley thickness Major features of fingerprints Major features of fingerprint recognition
There are 4 directional contrasts obtained by estimating the block directional difference in 8 cardinal directions without regard for the opposite direction. Singularity is detected by the Poincare index [13], which is a popular method to compute core and delta points based on the orientation field. According to the result by the algorithm and human experts, 3 types of singularities are defined as follows: − Missing singularity: A set of real singularities which the algorithm cannot detect. − Spurious singularity: A set of points which the algorithm detects but are not real. − Paired singularity: A set of real singularities which the algorithm detects well. Minutiae points are extracted through the process of Gabor filtering, binarization and thinning the ridges [1]. Minutiae including ending and bifurcation and nothing can be detected by the algorithm and identified by human experts, respectively, so as to define 8 types of combinations as follows: ending-ending, ending-bifurcation, ending-nothing, bifurcation-ending, bifurcation-bifurcation, bifurcation-nothing, nothing-ending, nothing-bifurcation. With these various statistics for fingerprints, the fitness evaluation function, in which weights are heuristically determined, is defined as follows. The statistics of the
1108
U.-K. Cho, J.-H. Hong and S.-B. Cho
target environment is calculated from the environment database. All the values are normalized from 0 to 1. fitness(i) = w1 × ( meani − meantarget ) + w2 × (variancei − variancetarget ) 4
+ w3 × ∑ (contrastij − contrast j ) target j =1
+ w4 × (thicknessi − thicknesstarget )
(1)
+ w5 × (intervali − intervaltarget ) + w6 × + w7 ×
∑
c∈singularit yType
∑
c∈minutiaeType
( singularityi (c) − singularity target (c))
( minutiaei (c ) − minutiaetarget (c))
3 Experimental Results 3.1 Experimental Environment The usability of the proposed method is verified by comparing the fingerprints collected from real environments with those generated. A fingerprint database, used in this work, is constructed by Computer Vision Laboratory in Inha University, in which three fingerprints images were captured from each finger according to the input pressure (high (H), middle (M) and low (L)) [15]. Forty two fingerprint images of fourteen fingers are used for the training data, while forty five fingerprint images of fifteen fingers are for the test data. In the experiment, we aim to generate the fingerprints of high and low pressures from those of middle pressure. With the training data, two filter sets (MÆH, MÆL) are evolved by the proposed method for each environment. Real fingerprints of high and low pressures in the training data are used to calculate the statistics of target environments as shown in Fig. 1. After evolution, the test data is used to estimate the performance of the proposed method by measuring the similarity between the real fingerprints of high and low pressures and those generated by the filters from fingerprints of middle pressure. Fig. 2 shows the distribution of features for the training data. The input pressure affects the value of fingerprint features, which might also influence the performance of fingerprint recognition. The ridges of highly pressed fingerprints are easily connected so as to produce spurious bifurcation points, while fingerprints of low pressure are apt to generate spurious ending points. It is natural that the thickness and interval of ridges are divided to the input pressure. On the other side, singularity is less affected by the input pressure since it is calculated by the global feature like orientation. The fingerprints, collected in the environment of middle input pressure, show good performance in extracting minutiae points rather than the others. Especially, ‘real ending-extracted bifur’ and ‘real bifur-extracted ending’ of minutiae strongly show the trend of effects of the input pressure.
Automatic Fingerprints Image Generation Using Evolutionary Algorithm
1109
Fig. 2. Minutiae analysis of the training data
3.2 Analysis of the Process of Evolution Parameters of the genetic algorithm in the experiment are set as follows: 100 generations, 50 populations, 5 gene lengths, 0.7 selection rate, 0.7 crossover rate and 0.05 mutation rate. At most five filters are used to compose a filter set because the size of chromosomes is set as five. Roulette-wheel selection is used as a basic
1110
U.-K. Cho, J.-H. Hong and S.-B. Cho
Fig. 3. Fitness through evolution (left: high pressure, right: low pressure)
Fig. 4. Fingerprints produced by the proposed method Table 3. Filter sets obtained through evolution
Environment
Filter type
High
Highpass 3×3 #2
Erosion 1×3
Closing 3×1
Low
None
None
Stretch
Closing Rectangle 3×3 Dilation Diamond 3×3
None
None
selection mechanism. Weights used in the fitness function are set as (1, 1, 1, 2, 2, 1, 3), since the ridge information is highly affected by the input pressure. Better filter sets are obtained by the proposed method through evolution as shown in Fig. 3. The maximum and average fitness increase as the generation grows for both target environments. Fig. 4 shows the resulting fingerprints that show similar figures to those collected from the target environments, while table 3 presents the best filter sets obtained in the last generation.
Automatic Fingerprints Image Generation Using Evolutionary Algorithm
1111
Fig. 5. Minutiae analysis of the test data
3.3 Analysis of Generated Fingerprints We have analyzed the resulting fingerprints by comparing with fingerprints collected from the target environment. As shown in Fig. 5, the statistics of the fingerprints of middle pressure has been changed to be close to those calculated from the target environments, especially for mean, directional contrasts, ridge thickness and interval. Since singularity is hardly dependent on the orientation of original images, however,
1112
U.-K. Cho, J.-H. Hong and S.-B. Cho
it is not able to correctly model the target environment. Fig. 5 shows that the proposed method is very useful in the distribution of minutiae extracted, where the generated fingerprints show similar aspect to those of the target environments in most cases of extracting minutiae points. Generated-high and generated-low signify the results of the proposed method. Fig. 6 shows the impostor/genuine distribution and FMR/FNMR curves [4] on collected and generated fingerprint databases using VeriFinger 4.2 SDK. Although they are the same source fingerprints, generated images have lower performance than originals and little difference from those in the target environment.
Fig. 6. Impostor/genuine distribution and FMR/FNMR curves
4 Concluding Remarks In this paper, we have proposed a novel method that automatically generates fingerprint images by using the genetic algorithm. Various simple image filters are used to construct a composite filter where the genetic algorithm searches their proper types and order. We have conducted experiments on the real database collected according to the input pressure, where the fingerprints generated by the proposed method showed the similar characteristics to those collected from real environments in terms of various statistics of fingerprint images. The generated images might be used to evaluate the performance of fingerprint recognition systems. Moreover, the proposed method has the applicability to the fingerprint image enhancement by modifying the fitness evaluation module.
Automatic Fingerprints Image Generation Using Evolutionary Algorithm
1113
As the future work, in order to generate realistic fingerprints more precisely, we will develop a fingerprint model that characterizes them with various measures. Heuristic filters that deform fingerprints might be used to include various effects into them. Acknowledgments. This work was supported by the Korea Science and Engineering Foundation (KOSEF) through the Biometrics Engineering Research Center (BERC) at Yonsei University.
References [1] Pankanti, S., Prabhakar, S., Jain, A.: On the individuality of fingerprints. IEEE Trans. Pattern Analysis and Machine Intelligence 24(8), 1010–1025 (2002) [2] Cappelli, R., Maio, D., Maltoni, D., Wayman, J.L., Jain, A.K.: Performance evaluation of fingerprint verification systems. IEEE Trans. Pattern Analysis and Machine Intelligence 28(1), 3–18 (2006) [3] Khanna, R., Weicheng, S.: Automated fingerprint identification system (AFIS) benchmarking using the National Institute of Standards and Technology (NIST) Special Database 4. In: Proc. 28th Int. Carnahan Conf. on Security Technology, pp. 188–194 (1994) [4] Maltoni, D.: Generation of Synthetic Fingerprint Image Databases. In: Ratha, N., Bolle, R. (eds.) Automatic Fingerprint Recognition Systems, Springer, Heidelberg (2004) [5] Simon-Zorita, D., Ortega-Garcia, J., Fierrez-Aguilar, J., Gonzalez-Rodriguez, J.: Image quality and position variability assessment in minutiae-based fingerprint verification. IEEE Proc. Vision, Image Signal Process 150(6), 402–408 (2003) [6] Jain, A., Prabhakar, S., Pankanti, S.: On the similarity of identical twin fingerprints. Pattern Recognition 35(11), 2653–2663 (2002) [7] Hong, J.-H., Yun, E.-K., Cho, S.-B.: A review of performance evaluation for biometrics systems. Int. J. Image and Graphics 5(2), 501–536 (2005) [8] Goldberg, D.: Genetic Algorithm in Search, Optimization and Machine Learning. Addison Wesley, Reading (1989) [9] Blanz, V., Vetter, T.: A Morphable Model for the Synthesis of 3D Faces. In: Proceedings of Computer Graphics SIGGRAPH, pp. 187–194 (1999) [10] Orlans, N., Piszcz, A., Chavez, R.: Parametrically controlled synthetic imagery experiment for face recognition testing. In: Proc. of the 2003 ACM SIGMM workshop on Biometrics methods and applications, pp. 58–64 (2003) [11] Cho, U.-K., Hong, J.-H., Cho, S.-B.: Evolutionary singularity filter bank optimization for fingerprint image enhancement. In: Rothlauf, F., Branke, J., Cagnoni, S., Costa, E., Cotta, C., Drechsler, R., Lutton, E., Machado, P., Moore, J.H., Romero, J., Smith, G.D., Squillero, G., Takagi, H. (eds.) EvoWorkshops 2006. LNCS, vol. 3907, pp. 380–390. Springer, Heidelberg (2006) [12] Gonzalez, R., Woods, R.: Digital Image Processing. Addison-Wesley, Reading, MA (1992) [13] Karu, K., Jain, A.: Fingerprint Classification. Pattern Recognition 29(3), 389–404 (1996) [14] Lim, E., Jiang, X., Yau, W.: Fingerprint quality and validity analysis. IEEE Int. Conf. on Image Processing 1, 22–25 (2002) [15] Kang, H., Lee, B., Kim, H., Shin, D., Kim, J.: A study on performance evaluation of fingerprint sensors. In: Proc. 4th Int. Conf. Audio-and Video-based Biometric Person Authentication, pp. 574–583 (2003)
Audio Visual Person Authentication by Multiple Nearest Neighbor Classifiers Amitava Das Microsoft Research – India 196/36 2nd Main , Sadashivnagar, Bangalore 560 080, India
[email protected] Abstract. We propose a low-complexity audio-visual person authentication framework based on multiple features and multiple nearest-neighbor classifiers, which instead of a single template uses a set of codebooks or collection of templates. Several novel highly-discriminatory speech and face image features are introduced along with a novel “text-conditioned” speaker recognition approach. Powered by discriminative scoring and a novel fusion method, the proposed MCCN method delivers not only excellent performance (0% EER) but also a significant separation between the scores of client and imposters as observed on trials run on a unique multilingual 120-user audio-visual biometric database created for this research. Keywords: Speaker recognition, face recognition, audio-visual biometric authentication, fusion, multiple classifiers, feature extraction, VQ, Multimodal.
1 Introduction Multimodal biometric authentication is attracting lots of interest these days as it offers significantly higher performance and more security than unimodal methods using a single biometric. Multimodal methods also make the system more robust to sensor failures and adverse background conditions such as poor illuminations or high background noise. However, proper fusion of multiple biometrics remains a challenging and crucial task. Amongst the various multimodal biometric methods reported so far, Audio-Visual person authentication [3-8][17] offers some more unique advantages. First of all, it uses two biometrics (speech and face image), which people share quite comfortably in everyday life. No additional private information is given away. These biometrics are not associated with any stigma (unlike fingerprints which are taken for criminals as well). These are contact-free and the sensors for these are everywhere thanks to the worldwide spread of camera-fitted mobile phones. Another unique advantage of this combination is that the combined use of an intrinsic biometric (face) along with a performance biometric (voice) offers a heightened protection from imposters, while providing flexibility in terms of changing of „password‟. The challenges faced by audio-visual authentication are: a) poor performance by face recognition, especially for expression, pose and illumination variation, b) poor performance in speaker recognition in noise, c) increased complexity of the system, S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 1114–1123, 2007. © Springer-Verlag Berlin Heidelberg 2007
Audio Visual Person Authentication by Multiple Nearest Neighbor Classifiers
1115
and d) lack of proper fusion method to exploit best of the two biometrics. Some other requirements for any biometric systems are: a) training material should be low and enrollment should be fast, b) enrollment of a new user should not require any massive re-estimation, c) ease of using the system during enrollment and actual usage, d) complexity to be minimal so that it can be scaled up to handle a large number of users or run it in an embedded device, e) high accuracy , or in other words a significant separation between client and imposter score distributions f) robustness to multiple sessions of testing g) ability to change “password”, and h) robustness to various imposter attacks. Majority of the recent audio-visual biometric authentication methods [3-8] use separate speech and face based classifiers and then apply various late fusion methods [12] such as sum, product, voting, mixture-of-experts, etc, to fuse the scores of the two modes. A few recent methods (e.g. [8]) proposed a feature level fusion as well. For the speech mode, various text-independent speaker recognition methods, using GMM [11] or VQ [10], are predominantly used. These methods offer lower complexity than HMM [13] or DTW[14] based text-dependent speaker recognition methods. Performance of the Text-dependent methods are better accuracy but they require lots of training data. For the face-mode, majority of the methods [1] are essentially variations of PCA-based [2] approach. Since PCA-based methods suffer severely from pose and illumination variations, various pose normalization methods [1] were proposed using 2D or 3D models followed by a PCA based dimension reduction and a nearest neighbor matching of the reduced-feature template. Several methods (e.g. [9]) use multiple frames of a video and analyze different trajectories or face dynamics in the feature space for face recognition. Note that all these methods mentioned here require quite high complexity and some methods, especially the HMM-based ones, need a massive amount of training data as well. PCA based methods also face the problem of re-estimation -- every time a new user is enrolled. In this paper, we propose a low-complexity multimodal biometric authentication framework based on multiple features and multiple nearest neighbor classifiers (MNNC). The proposed MNNC framework uses multiple features and for each feature uses a set of codebooks or collection of templates (opposed to a single template). We used it here for audio-visual person authentication, but MNNC can be used for any detection/verification task involving multiple classifiers. We also introduce several new speech and face features and a novel “text-conditioned” speaker recognition approach. Coupled with a discriminative scoring and a novel fusion method the proposed MNNC method delivers excellent performance (0% EER) as well as a non-overlapping distribution of the client and imposter scores, as demonstrated on trials run on a unique 120 people multilingual MSRI Bimodal Biometric database. The proposed method meets all the requirements mentioned above, while operating at very low complexity. A real-time AV-LOG person authentication system prototype is created based on this research, which is being used by employees of our organization on a trial basis. This paper is organized as follows: Section 2 details the basic architecture, describes the computation of the scoring & fusion method. Section 3 presents the various feature extraction methods. Section 4 presents details of the MSRI database, experimental set up, followed by the results. Finally the summary and conclusions are presented in section 5.
1116
A. Das
2 Person Authentication by Multiple Nearest Neighbor Classifiers In the proposed architecture, multiple features F1, F2, F3,…., FL , are extracted from the various biometrics used. For each feature Fi, we have a dedicated nearest neighbor classifier, which compares Fi with a set of codebooks, one for each of the N users enrolled, and computes a score Ri and multiply it with a suitable weight. A suitable fusion method combines various scores Ri to form a final score Rfinal, which is then compared with a threshold to decide to accept or reject the identity claim.
Biometric Extraction
Feature Extraction
NN Classifier
Fi
Classification
....
CB1i
CBNi
Wi
NN Classifier 1 Biometric-1 Extraction
NN Classifier 2 1Detector 2
Biometric-2 Extraction
NN Classifier 3 2Detector .3
. .
Biometric-Q Extraction
.. .
NN Classifier L-1 L2Detector L-1
Ri
X
Accept F U S I O N
Y
Rfinal
< T? N
Reject
NN Classifier L Fig . 1. Architecture of the proposed MNNC framework
2.1 Score Computation for Person Authentication During enrollment, for each user Pj and for each feature vector Fi , a codebook CBij, is designed from the feature vector ensemble [Fi1 Fi2 Fi3 .. FiT] of size T obtained from the training data. Any standard clustering method such as K-means can be used to design the codebooks. During authentication, a set of Q biometric samples are presented (Figure 1), along with an identity claim, „k‟. We need to verify whether the feature set [F1 F2 F3 .. FL], extracted from the biometric samples, belongs to person Pk or not. We present next the scoring method for person verification (the basic idea can be extended to the person-identification problem as well). Let us consider just a single Nearest Neighbor classifier, which has N codebooks CBk k=1,2,..,N, for the N users enrolled to the system. Each CBk has M codevectors, CBk = [Ckm], m=1,2,..,M.
Audio Visual Person Authentication by Multiple Nearest Neighbor Classifiers
1117
Given a feature vector F (one of the L feature vectors Fi, i=1,2..L, we drop the index “i” here for convenience) and a claim “k”, we find a similarity-dissimilarity ratio R as follows : Step1: Given the identity claim, k, compute two distances, Dtrue & Dimp, as follows: Dtrue = minimum distance of F from the codebook CBk of claimed person Pk, Dtrue = min {Dkm}, where Dkm = ||F – Ckm ||2, m=1,2,…,M, Ckm being the codevector of codebook CBk. Dimp = minimum distance of F from the set of codebooks of all other persons except person Pk or Dimp = min {Dim}, where Dim = ||F – Cnm ||2, m=1,2,,M; n=1,2,..N & n not equal to k. Note that, when a sequence of feature vectors [Fij, j=1,2,3…J] are given (as in case of the MFCC sequence of all the J frames of a speech utterance), Dimp and Dtrue are computed by first computing the Dimp and Dtrue distance for each feature vector Fij and then accumulate them, e.g. Dtrue (J) = Dtrue (Fi1) + Dtrue (Fi2) +….+ Dtrue (FiJ). Step2: Compute the score as R = Dtrue / Dimp
In conventional person verification using a single feature, this score R will be compared against a threshold T and the identity claim will be rejected/accepted if R is greater/less than T. In our system, we compute this ratio Ri for each of the L features Fi i=1,2,..,L and then fuse them. If all the users being tested are pre-enrolled, it can be shown that for a client, Ri should be less than 1 and for an imposter, it should be greater than 1. Note that, Ri is essentially a likelihood ratio measure. 2.2
Fusion Methods Used in the Proposed MNNC Framework
Knowing the unique property of the scores Ri, it makes sense to use the product rule, i.e. Rfinal = R1xR2x..x RL. This should perform well, separating the true and imposter distributions. However, often one or two feature scores will misbehave making the product score to be inaccurate. Therefore, we propose a “trend-modified-product‟ (TMP) rule described below: 1.
Given the scores, find whether the majority is greater than 1 (indicated by a flag J=0) or the majority is less than 1 (indicated by a flag J=1).
2.
If J equals 1, then final-score is a product of those scores which are less than a threshold (THmin), else if (J=0) then the final-score is a product of scores Ri which are greater than a threshold (THmax).
With the features we proposed here (section 3), we observed that usually only one or two feature scores fail at the most, and most of the times the majority trend (J=0) is observed. This helps TMP to work well. To illustrate this, we show the results of a sample speaker verification trial (Table 1). TMP does work better (lower EER; better client-imposter score separation) than product rule in this example. Table 1. Results of MNNC based speaker verification using product rule & TMP Only Speech Product Fusion TMP Fusion
MFCC: CBsize=4; Dim=9; CMD: CBsize=10; Dim=9 EER
FAR
FRR
Mc
Mi
D(Mc,Mi)
Dovlp
Dsep
5.5
61.6
61.7
1.31
30.24
29.24
24
-
0.75
1
20.5
0.75
45.4
44.4
6
-
1118
A. Das
2.3 MNNC Based Audio-Visual Person Authentication System -- AVLOG We designed a real-time person authentication system called AVLOG for log-in and other access-control applications. The biometrics used here are speech utterances of a user-selected password and multiple profiles of the user‟s face. During testing and enrollment, a person is asked to look at 3 to 5 (this number is a system parameter) spots on the screen and a single web-cam captures the multiple profiles. Then the person is asked to speak his password. During enrollment, 4 samples of the password are taken. During testing the user needs to say the password only once. A password is a set of 4 words, such as address, names, etc – something which is easy to remember and preferably spoken in the native language of the person. As a result, the passwords are unique (very unlikely to be the same) and phonetically and linguistically quite different from each other. We call this approach “text-conditioning” speaker recognition, which differs from the conventional „text-dependent‟ or „textindependent‟ approaches. Details of feature extraction methods are presented next.
3 Feature Extraction Details We want to keep the training and test data to the minimum, so the challenge is how to extract multiple meaningful features from limited data. For face, we use multiple poses, and for each pose we use a highly discriminatory “transformed pose feature” (TFP) [16] which captures the essence of a person‟s face quite well. For speech we use the conventional MFCC (Mel-Frequency-Cepstral-Coefficient) as well as introduce a novel speech feature called Compressed MFCC Dynamics (CMD). 3.1 Speech Feature 1: Text-Conditioned MFCC Text conditioning (use of unique password per user) leads to a high discrimination and the extracted MFCC sequences become more separable. Figure 2 shows the effect of the proposed text-conditioning approach. It shows the MFCCs of the two passwords of two users plotted in a 2-D MFCC (MFCC component 1 & 2) space. Conventional case (any password) on the left shows overlapping, while the textconditioned case on the right shows distinct separation of data from two speakers. MFCC1
MFCC1
Conventional
MFCC2
Text-Conditioned
MFCC2
Fig . 2. Scatter plots of 2 MFCC components; Blue dot/cross two utterances of speaker-1, Red dot/cross two utterances of speaker-2. Note that text-conditioned data are separable.
Audio Visual Person Authentication by Multiple Nearest Neighbor Classifiers
1119
Table 2. Speaker identification accuracies of TIVQ & TCVQ for various codebook size(N) and code-vector dimension (K) for experiments run on the MSRI database
N=4 N=8 N=64
K=8 51.4 76.2 93.1
TIVQ K=8 60.1 81.7 94.9
K=13 65.9 84 95.6
K=8 97.8 100 100
TCVQ K=8 98.3 100 100
K=13 97.4 100 100
Text-conditioning allows a simple VQ-based classifier perform significantly better than the conventional text-independent VQ (TIVQ) speaker recognition systems [10] at significantly less complexity. We call our speaker-recognition method „textconditioned VQ‟ or TCVQ. Table 2 compares the identification accuracies of TCVQ against TIVQ for various codebook sizes and dimensions for trials run on our MSRI database. The discriminatory power of text-conditioned MFCC feature allows TCVQ to outperform TIVQ, while using lesser number of codebook parameters -- amounting to lower complexity and memory usage [17]. Text-conditioned MFCC offers two other advantages as evident in Figure 3: a) there is a significantly wider separation of the client from imposters and b) the system can quit earlier (without processing the entire utterance) while giving 100% accuracy. 3.2 Speech Feature 2: Compressed MFCC Dynamics (CMD) Signature When a person says something, his or her characteristics are best captured in the dynamics of the utterance. Traditionally researchers used a combination of 13dimension MFCC, 13 delta-MFCC & 13-delta-delta-MFCC or a total of 39 dimension vector to capture such dynamics. This gives a little benefit but makes the feature space very large -- increasing confusability and complexity. We introduce a novel speech feature called “Compressed-MFCC-Dynamics” or CMD, which captures the above-mentioned speech dynamics efficiently. The CMD is computed as follows. The set of MFCCs are collected from the entire utterance to form a 2-D array called MFCCgram (Figure 4). On this 2-D array, we apply DCT and then keep a small set (M) of DCT coefficients. This fixed M-dimension (typically M=15 works well) vector forms the CMD. Figure 4 shows the MFCCgram and the CMD parameters for 3 users and their 2 passwords. Figure 4 clearly shows how well the CMD feature captures speaker characteristics. Note the similarities within a
Fig. 3. Plot of accumulated distance of target speaker(client) and 4 closest imposters over time (expressed as speech frame numbers) – plotted for conventional (left) MFCC and textconditioned (right) MFCC
1120
A. Das
Fig. 4. MFCCgram and the corresponding CMD signatures of two utterances of the passwords of 3 speakers of the MSRI database. CMD dimension=15.
speaker and the differences across speakers. Note that MFCCgram is variable in X – . dimension, but CMD has a fixed dimension easy for designing classifiers 3.3
Face Image Feature: Transformed Face Profile (TFP)
The face features are extracted in a manner similar to the CMD. From each face profile, after face detection [15] DCT is applied on the cut gray image. A set of selected DCT coefficients is stored [16] as the Transformed Face Profile (TFP) signature. Figure 5 shows the discriminatory power of TFP across users and their ability to form natural clusters in a 3D TFP space (formed by two selected maxima and one minimum). Also note how well TFP tolerates expression variations.
Fig. 5. Left: Central profile faces and the corresponding TFP signatures for 4 users . Right: Clustering of 3 TFP points (2 maxima, 1 minima) for 10 central-profile samples of 8 users
Audio Visual Person Authentication by Multiple Nearest Neighbor Classifiers
1121
4 Database, Experiments and Results Due to our text - conditioning requirements (unique password for each user) we could not use any conventional audio-visual databases. We therefore created a unique multilingual audio-visual biometric MSRI database which captured 3 face profiles (left, center and right) and unique multi-lingual passwords from 120 people. For each user, there are 10 training and 10 test samples of the password and 10-20 face images for each profile. The face images are captured with digital camera in normal office lighting condition and there are decent expression variations. For the proposed MNNC method, we report results for only the person verification task. The system can easily be configured for both closed-set and open-set person identification tasks. The various system parameters are: MFCC-codebook-size, MFCC dimension, number of CMD‟s, CMD dimension, no of TFP‟s and TFP dimension. The results presented in Table 3 are for the individual biometrics as well as the combined audio-visual system. Only a minimal number of system parameters are used, leading to a storage of only 486 numbers per user and approximately 4000 multiply-add operations for each user-trial. In comparison, a 128x39 Text-independent VQ speaker recognition subsystem alone would have needed 5000 numbers per user to be stored and about 5x105 multiply-adds per user-trial. Table 3. Performance of the proposed MNNC method. Shown here are th e EER, FAR, FRR as well as the means of the client and imposter distributions (Mc, Mi), the distance between them D(Mc, Mi) and the amount of overlap, Dovlp , (or non-overlap, Dsep ) between the tails of the distributions of client and imposter scores . Only Speech
MFCC: CBsize=4; Dim=9; CMD: CBsize=10; Dim=9 EER
FAR
MFCC
0.83
0.8
CMD
13.5
87
Product Fusion
5.5
61.6
TMP Fusion
0.75
FRR
Dovlp
Dsep
0.28
0.2
-
12.1
32.5
-
30.24
29.24
24
-
45.4
44.4
6
-
Mi
D(Mc,Mi)
Dovlp
Dsep
36
35
2.11
-
0.64
43.2
42.2
3.34
-
0.43
48.1
47.1
1.5
-
0.22
1.8E+07
1.8E+07
1.3
-
Mc
Mi
D(Mc,Mi)
2.5
0.9
1.28
0.25
1.67
13.1
61.7
1.31
1
20.5
0.75
EER
FRR
FAR
Mc
Left TFP
4.6
13
24
0.59
Center TFP
7.8
16
41
Right TFP
3.3
21.7
8.3
TMP Fusion
0.09
0.09
0.09
Only Face
CBsize=8; Dim=15
Speech+Face Speech: MFCC-CB-size=4;Dim=9; CMD number=10;Dim=9 Face: TFP number=8, TFP Dim=15 EER
FRR
FAR
Mc
Mi
D(Mc,Mi)
Dovlp
Dsep
Product Fusion
0
0
0
0.33
6E+08
6E+08
-
10
TMP Fusion
0
0
0
0.11
7.5E+08
7.5E+08
-
44
1122
A. Das
From the results in Table 3. the following observations can be made: a) Combination of multiple features and multiple biometrics gives better results than unimodal methods using a single biometric b) Proposed trend-modified-fusion (TMP) does provide better client-toimposter separation than the conventional product fusion c) The MCCN framework proposed here performs quite well, significantly separating the score distributions of the client and imposters (no overlap; 0% EER) The proposed MNNC method requires very low complexity. To provide some insight into this, let us compare the storage and computation requirements of MNNC with two hypothetical AV biometric systems using PCA [2] for face and textindependent VQ for speech [10] and PCA for face and a high performance textdependent DTW based system (as in[14] ) for speech. As mentioned earlier, these sub-systems are the most popular ones used in audio-visual authentication systems proposed recently. The MIPS-complexity is expressed here in terms of the number of multiply-add operations per user-trial (for 120 users) and the memory-complexity is expressed in terms of number of real constants required to be stored per user. For the MIPS part, PCA complexity dominates, for the memory, the DTW method dominates. As see in Table 4, the proposed MCCN method requires significantly less complexity and storage per user than any of these conventional methods. Table 4. Complexity and memory usage comparison of MCCN with conventional methods METHOD
MIPS
Memory
Comments
VQ+PCA
10^8
O(6000)
4 sec test utterance; CBsize=128; dim=39; PCA - 60 size eigenvalue per profile; 40x40 size image
DTW+PCA
10^8
O(30000)
5-template DTW, 4 second test utterance; Method as in [14]; PCA as above
O(4000)
O(500)
4 second test utterance; other parameters as in Table 3
MCCN
5
Summary and Conclusions
We proposed a low-complexity high-performance audio-visual person authentication method based on multiple features and a multiple nearest neighbor classifier framework. Few novel speech and face features were proposed, which are highly compact and discriminatory. Use of a judicious fusion method combined with the discrimination power of the features led the proposed method to achieve high performance (0% EER) and a wide separation (no-overlap) of the score-distributions of the client and imposters. We created a unique multilingual 120-user bimodal audio-visual MSRI biometric database to test our proposed approach. A real-time AVLOG access control system has been built using the basic principles proposed here and the system is being tested by various employees of our organization. The performance of the real-time prototype is similar to the data presented here. We also have versions which are robust to background noise. Our current and future focus includes: a) mobile and small-foot-print embedded implementation, b) expanding our database to more number of users and c) investigating few more promising features.
Audio Visual Person Authentication by Multiple Nearest Neighbor Classifiers
1123
Acknowledgment The databases used in this research has been prepared by interns working at MSRI.
References 1. Zhao, W.Y., et al.: Face recognition: A Literature Survey. ACM Comp. Surveys, 399–458 (2003) 2. Turk, M., Pentland, A.: Eigenfaces for Recognition. Journal of cognitive neuroscience 3, 71–86 (1991) 3. Chibelushi, C., et al.: A Review of Speech Based Bimodal Recognition. IEEE trans. Multimedia, 23–37 (2002) 4. Kanak, A., et al.: Joint Audio Video Processing for Biometric Speaker Identification. In: Proc. ICASSP-03 (2003) 5. Marcel, S., et al.: Bi-Modal Face & Speech Authentication: A Bio Login Demonstration System. In: Proc. MMUA-06 (May 2006) 6. Hazen, T., et al.: Multi-modal Face and Speaker Identification on Handheld Device. In: Proc. MMUA (May 2003) 7. Yacoub, S., et al.: Fusion of Face and Speech Data for Person identity Verification. IEEE trans. neural Network (September 1999) 8. Wu, Z., Cai, L., Meng, H.: Multi-level Fusion of Audio and visual Features for Speaker identification. In: Proc. ICB 2006, pp. 493–499 (2006) 9. Biuk, Z., Loncaric, S.: Face Recognition from Multi-pose Image Sequence. In: Proc. 2nd Int’l. Symp. on Image and Signal Processing and Analysis, pp. 319–324 (2001) 10. Soong, F.K., Rosenberg, A.E., Juang, B.-H., Rabiner, L.R.: A vector quantization approach to speaker recognition. AT&T Journal 66, 14–26 (1987) 11. Reynolds, D., et al.: Speaker Verification using adapted GMM. Digital Signal Processing 10(1-3) (2000) 12. Kittler, J., et al.: Combining Evidence in Multimodal personal identity recognition systems. In: Proc. Int. Conf. on Audio & Video Based Person Authentication (1997) 13. Das, A., Ram, V.: Text-dependent speaker-recognition – A survey and State of the Art. Tutorial presented at ICASSP-2006, Toulouse (May 2006) 14. Ram, V., Das, A., Kumar, P.: Text-dependent speaker-recognition using one-pass dynamic programming. In: Proc. ICASSP, Toulouse, France (May 2006) 15. Viola, P., Jones, M.: Robust Real-time Object Detection. In: Proc. ICCV-2001 (2001) 16. Das, A., et al.: Face Recognition from Images with high Pose Variations by Transform Vector Quantization. In: Kalra, P., Peleg, S. (eds.) ICVGIP 2006. LNCS, vol. 4338, pp. 674–685. Springer, Heidelberg (2006) 17. Das, A., et al.: Audio Visual Biometric Recognition by Vector Quantization. In: Proc. IEEE/ACL SLT workshop (December 2006) 18. Das, A.: Audio-Visual Biometric Recognition. In: ICASSP07 ( accepted tutorial) (2007)
Improving Classification with Class-Independent Quality Measures: Q-stack in Face Verification Krzysztof Kryszczuk and Andrzej Drygajlo Swiss Federal Institute of Technology Lausanne (EPFL), Signal Processing Institute http://scgwww.epfl.ch/
Abstract. Existing approaches to classification with signal quality measures make a clear distinction between the single- and multiple classifier scenarios. This paper presents an uniform approach to dichotomization based on the concept of stacking, Q-stack, which makes use of classindependent signal quality measures and baseline classifier scores in order to improve classification in uni- and multimodal systems alike. In this paper we demonstrate the application of Q-stack on the task of biometric identity verification using face images and associated quality measures. We show that the use of the proposed technique allows for reducing the error rates below those of baseline classifiers in single- and multi-classifier scenarios. We discuss how Q-stack can serve as a generalized framework in any single, multiple, and multimodal classifier ensemble. Keywords: statistical pattern classification, quality measures, confidence measures, classifier ensembles, stacking.
1
Introduction
Biometric identity verification systems frequently face the challenges of noncontrolled data acquisition conditions. In such conditions biometric signals may suffer from quality degradation due to extraneous, identity-independent factors. It has been demonstrated in numerous reports that a degradation of biometric signal quality is a frequent cause of significant deterioration of classification performance [1,2,3], also in multimodal systems, which systematically outperform their unimodal counterparts [4,1,5]. Seeking to improve the robustness of classifiers to degraded data quality, researchers started to incorporate quality measures into the classification process, at the same time making a clear distinction between using quality information in single classifier [2,6,7,3], as opposed to multi-classifier and multimodal biometric systems [1,5,8,9]. The application of quality measures in a multimodal system received far more attention, with a dominant intuitive notion that a classifier that has a higher quality data at its disposal ought to be more credible than a classifier that operates on noisy signals [1]. In [5] Toh et al. mention that introduction of quality measures provide additional degrees of freedom which helps maximize class separation but neither they explicitly state the actual reason for this improved separation nor do they suggest if and how should this effect S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 1124–1133, 2007. c Springer-Verlag Berlin Heidelberg 2007
Improving Classification with Class-Independent Quality Measures
1125
apply to single classifier systems. However, the impact of signal quality measures on classification performance of a single classifier has also been noticed. In [3] Gales and Young propose to use parallel model architecture to account for varying speech quality in speaker verification. In [2,6] adaptive threshold selection helps reduce errors. In turn, while quality-dependent model or threshold selection/adaptation is shown to work for single classifiers, the method is not generalized to multiple classifier systems. We argue that in fact the very same mechanisms governs the use of quality measures in single- and multi-classifier systems alike, and propose a quantitative rather than intuitive perspective on the role of quality measures in classification. In [10] we have proposed Q-stack, a novel theoretical framework of improving classification with class-independent quality measures. Q − stack is based on the concept of classifier stacking [11]. In the scheme of Q − stack a classifier ensemble is used in which the first classifier layer is made of the baseline unimodal classifiers, and the second, stacked classifier operates on features composed of the normalized similarity scores and the relevant quality measures. The concept of concatenating scores and quality measures into one feature vector was previously used in [9], but only in the context of multimodal fusion using a likelihood ratio-based classifier. In [10] we have demonstrated using synthetic datasets that Q − stack allows for improved class separation using class-independent quality measures for both uni- and multi-modal classification, provided that a statistical dependence exists between the features of the stacked classifier. The nature of the stacked classifier is chosen depending on the actual structure of the classified data. The importance of the dependence between classification scores and quality measures was also stressed in [7], but its implications are not extended beyond error prediction for a single classifier. In this paper we present Q−stack as a general framework of classification with quality measures that is applicable to uni-, multiple-classifier and multimodal biometric verification. We demonstrate the principles and performance of Q − stack on a real biometric dataset, the face part of the BioSec database [12], using two different face image matchers, a quality measure that is correlated with the classification scores of one of the matchers, and three different stacked classifiers. We give evidential support to following hypotheses: 1. A score-dependent quality measure provides an additional dimension in which a stacked classifier can separate the classes better than the baseline classifier that uses only the similarity scores. 2. Proposed method allows to improve biometric verification in single- and multi-classifier scenarios alike. 3. In a multi-classifier system, the quality measure needs to be dependent on at least one classifier in order to observe the benefits of Q − stack. The paper is structured as follows. Section 2 describes the principles of the proposed method of Q − stack. In Section 3 we treat on the application of Q−stack to uni- and multi-classifier face matching, and we give the experimental results with their discussion. Section 4 concludes the paper.
1126
2
K. Kryszczuk and A. Drygajlo
Q − Stack - Using Quality Measures to Improve Classification
Consider two class data-generating processes A and B, which generate features subjected to k arbitrary base classifiers C1,2,...,k , each returning a scalar similarity score x1,2,...,k . Concatenate these scores to x = [x1 , x2 , ..., xk ], where vector x is an instance of a multivariate random variable X. The distribution of X is affected by a noise-generating process N that interacts according to some function γ with the class-generating processes A and B, causing signal degradation. The nature of γ needs not be given explicitly [10]. Instead, the interaction between A, B and N manifests itself in the impact of noise instances n on the corresponding observed score instances x. Consequently, a causal dependence between N and X can be observed. In practice it may not be feasible to measure n directly. Instead, a set of j scalar quality measures qm = qm1 , qm2 , ..., qmj can be collected, where qm denotes an instance of a random variable QM . By definition, QM is dependent on N , and therefore it also inherits a dependence on X. At the same time, quality measures do not carry class-selective information, p(qm|A) = p(qm|B). Let us now concatenate the training scores x and the relevant quality measures qm into evidence vectors e = [x, qm], and analyze the separation between classes A and B in the (k + j)-dimensional evidence space defined by all components of the evidence vectors. Under the assumption of equal priors, P (A) = P (B), class separation can be expressed in terms of divergence between class-conditional joint distributions of p(e|A) and p(e|B). In [13] Koval et al. have shown that divergence between joint class-conditional distributions is greater for dependent, than for independent classification features. Consequently, since p(QM |A) = p(QM |B) the existence of statistical dependencies between X and QM grants that ∞ ∞ |p(x|B) − p(x|A)| dx < |p(e|B) − p(e|A)| de. (1) −∞
−∞
A detailed proof of (1) is beyond the frames of this paper. An intuitive understanding of this result is shown in Figure 1. Here, the evidence consists of one class-selective score and one quality measure, e = [x, qm]. In both subplots the marginal class-conditional distributions of evidence remain unchanged. The variables of X and QM are independent in the left, and dependent in the right subplot. Note that in the independent case the class separation is defined entirely by p(x|A) and p(x|B). In the presence of a dependence between X and QM classes A and B are clearly better separated. For more details the reader is referred to [10]. As a consequence of (1), classification in the evidence space is guaranteed to be more accurate than using base scores x alone, as long as there is a dependence between X and QM . For classification in the evidence space the scores of the base classifiers x become part of the feature vector for a new stacked classifier [11], hence the coined name Q − stack. The stacked classifier can be chosen arbitrarily, depending on the actual joint distributions of evidence e.
Improving Classification with Class-Independent Quality Measures
1127
b.
a.
p(x|A)
p(x|B)
p(x|A)
Score x
Cla ss p(x B ,q m |B )
p(x|B)
p(qm|B)
Cla ss
p(x A ,q |A ) m
p(qm|A)
Class B p(x,qm|B)
Quality Measure qm
p(qm|B)
Class A p(x,qm|A)
p(qm|A)
Quality Measure qm
ERRORS
Score x
Fig. 1. Graphical representation of the impact of dependence between scores x and quality measures qm on class separation in the evidence space e = [x, qm]: a. independent x and qm, b. dependent x and qm.
Figure 2 shows a block diagram of Q − stack applied in a single-classifier scenario. Features extracted from a single signal are classified by a single base classifier. At the same time, quality measures (one or more) are collected. Classification scores and quality measures are combined into evidence vectors and classified by a stacked classifier operating in the evidence space. In Figure 3 the very same structure is applied to multimodal classification. Two signals are classified in parallel, resulting in score vector x, which is further combined together with the respective quality measures into an evidence vector e = [x, qm]. The evidence vector e becomes a feature vector for the stacked classifier. Note that if no quality measures are present, the architecture shown in Figure 3 simply performs a multimodal score-level fusion. Q-stack Quality Measurement
Signal
Feature Extraction
Quality Measures qm
Class Features
Baseline Classifier
Scores x
Evidence
e=[x,qm]
Stacked Classifier
DECISION
Fig. 2. Q-stack in single base classifier scenario
3
Q-Stack for Face Verification
The experiments presented here give an embodiment of the proposed method of Q − stack. We show that the improvements in the system performance do indeed hinge on the statistical dependencies of the signal quality measures on
1128
K. Kryszczuk and A. Drygajlo Q-stack Quality Measurement 1
Signal 1
Feature Extraction 1
Quality Measures qm1
Class Features 1
Quality Measurement 2
Signal 2
Feature Extraction 2
Baseline Classifier 1
Scores x1
Evidence
e=[x1,x2,qm1,qm2]
DECISION
Quality Measures qm2
Class Features 2
Baseline Classifier 2
Stacked Classifier
Scores x2
Fig. 3. Q-stack in multimodal/multiclassifier scenario
the corresponding classifier scores, using an example of face matching. In our experiments we used face images of 200 subjects (training set: 50 subjects, testing set: 150 subjects) from the BioSec database, baseline corpus. For the details regarding the BioSec database the reader is referred to [12]. The experiments presented in this paper involve one-to-one sample matching. All face images have been registered manually in order to avoid the impact of face localization algorithms on the matching performance. All images have been photometrically normalized [14]. In our experiments we used the following two face matchers: 1. DCT - local DCT mod2 features and a Bayes classifier based on the feature distributions approximated by Gaussian Mixture Models (GMM)[15] (scores produced by the DCT matcher denoted as x1 ), 2. P CA - Mahalanobis distance between global P CA feature vectors [16]. The P CA projection space was found using all images from the training dataset. The scores produced by the P CA matcher are denoted as x2 . The two face matchers were chosen since they both operate on very different features. The local DCT mod2 features encode mostly high spacial frequencies, while the projection of the face images on the P CA subspace emphasizes lower spacial frequencies. In order to test the hypothesis that a quality measure needs to sport a statistical dependence on the classification scores, for the experiments reported here a quality measure QM was chosen that correlates well with the scores of one of the used matchers, but not both. Namely, we used a normalized 2-dimensional cross correlation with an average face template [17], which due to its nature was expected to be dependent on the scores x2 of the P CA classifier. If qmα and qmβ are the quality measures computed for each of the matched face images then the resulting combined quality measure used as evidence was computed as √ qm = qmα qmβ , following [1]. In order to maintain a consistent notation with Section 2 and with [10] denote here the class of genuine client match scores as Class A, and the class of the imposter match scores as Class B. The class-conditional distributions of classifier scores x1 (DT C matcher), x2 (P CA matcher), and quality measures QM are shown in Figure 4. The distributions of the quality measures in Figure 4 c show that indeed the used quality measures are class-independent and cannot help separate classes on their own.
Improving Classification with Class-Independent Quality Measures a. DCT
1129
c. QM
b. P CA Class A, training
−4
Class B, testing
−2
0
2
Score x1
4
6
8
p(qm|A), p(qm|B)
Class A, testing
p(x2 |A), p(x2 |B)
p(x1 |A), p(x1 |B)
Class B, training
−6
−4
−2
0
2
4
0.4
Score x2
0.5
0.6
0.7
0.8
0.9
1
Quality measure qm
Fig. 4. Class-conditional distributions of scores x1 , x2 and quality measures qm Table 1. Correlation coefficients between evidence components x1 , x2 , qm T raining T esting x1 x2 qm x1 x2 x1 1.00 0.57 0.02 x1 1.00 0.36 x2 0.57 1.00 0.46 x2 0.36 1.00 qm 0.02 0.46 1.00 qm -0.11 0.50
qm -0.11 0.50 1.00
As we have discussed before, a quality measure needs to be dependent on the scores in order to improve the class separation. Since correlation between random variables entails dependence, we computed the pair-wise Pearson’s correlation coefficients between x1 , x2 and qm for both the training and testing data sets see Table 1. A strong correlation is evident between the scores x1 and x2 , which is a consequence of the fact that both classifiers operate on the same input signals. However, the quality measure QM is strongly correlated only with the scores x2 originating from the P CA classifier. The experiments were conducted as follows. First, the baseline classifiers DCT and P CA were trained using the dedicated training data set of the BioSec database, disjoint from the testing set, originating from a separate set of users, and quality measures qm were collected. The proposed method of Q − stack was applied to evidence vectors composed of e = [x1 , qm], e = [x2 , qm], e = [x1 , x2 ] and e = [x1 , x2 , qm]. The combinations of e = [x1 , qm] and e = [x2 , qm] are examples of Q−stack applied to a single classifier. The evidence vector e = [x1 , x2 ] is Q−stack with only score-level evidence from two classifiers, which is equivalent to multi-classifier fusion [4]. Finally, the evidence vector e = [x1 , x2 , qm] represents Q−stack applied to multi-classifier fusion with quality measures. Scores x1 , x2 and quality measures qm were normalized to zero mean and unit variance using normalization parameters estimated on the training dataset. Three different stacked classifiers that separated e|A from e|B were used: Support Vector Machines classifier with a Radial Base Function kernel (SV M ), a Bayes classifier using Gaussian approximations of the class-conditional distributions (Bayes), and a Linear Discriminant-based classifier (LDA).
1130
K. Kryszczuk and A. Drygajlo
2
Class A
a. training τ1 Class B
2
Class A
0 −1 −2 −3 −4 −5 −6 −4
Class B
1
Quality measure qm
Quality measure qm
1
b. testing τ1
0 −1 −2 −3 −4 −5
−2
0
2
Score x1
4
6
8
−6 −4
−2
0
2
4
6
8
Score x1
Fig. 5. Class separation in the evidence space e = [x1 , qm] for the DCT classifier. Bold lines show the Q − stack decision boundaries for following stacked classifiers: SVM (dash-dot), Bayes (dashed) and LDA (solid). Thin dashed line marked τ1 shows the corresponding decision boundary of the baseline DCT classifier.
Figure 5 shows the application of Q−stack to the DCT classifier with evidence vector e = [x1 , qm] and Figure 6 shows the application of Q − stack to the P CA classifier, evidence vector e = [x1 , qm] for both the training (a) and testing (b) datasets. The solid lines in both figures show the class decision boundaries for the corresponding stacked classifiers, optimized on the testing dataset. The thin dashed lines marked τ1 and τ2 represent baseline classification boundaries for x1 and x2 , respectively. In Figures 5 and 6 the areas confined by the bold (Q − stack) and the thin lines show the loci of observations that can be classified more accurately by Q − stack than by the baseline classifiers (τ1 and τ2 ). The gains in accuracy due to the classification in the space of evidence e rather than in the score space x1 or x2 are summarized in Table 2. Classification results are reported in terms of total classification accuracy AC, Half-Total Error Rate HT ER [4] and error rates for individual classes ERA and ERB , obtained on the testing dataset. A weak correlation between the quality measure and the scores x1 of the DCT classifier (Table 1) result in marginal improvements in classification accuracy. Indeed, as it is shown in Figure 5, the Q − stack decision boundary does not deviate much from the baseline classification boundary given by x1 = τ1 . The exception here is the stacked SV M classifier, whose shape of the decision boundary suggests overfitting. Strong correlation between the the quality measure QM and the scores x2 of the P CA classifier is reflected in a Q − stack decision boundaries that deviate consistently and significantly from that of the baseline classifier given by x2 = τ2 (Figure 6). Large areas confined by the Q − stack and τ2 decision bounds suggest a significant improvement of classification accuracy in the e = [x2 , qm] space compared with the baseline classifier. This improvement is indeed consistent for all stacked classifiers. In the case of classification in the e = [x1 , x2 ] space, Q − stack is equivalent to any trained classifier fusion method: here using SVM, Bayesian and LDA classifiers. The results shown in Table 2 show an improvement in classification
Improving Classification with Class-Independent Quality Measures a. training τ2
2
Class A
1
Quality measure qm
Quality measure qm
1
2
0 −1 −2 −3
Class B
−4 −5 −6 −6
Class A
1131
b. testing τ2
0 −1 −2 −3
Class B −4 −5
−4
−2
0
Score x2
2
4
−6 −6
−4
−2
0
2
4
Score x2
Fig. 6. Class separation in the evidence space e = [x2 , qm] for the P CA classifier. Bold lines show the Q − stack decision boundaries for following stacked classifiers: SVM (dash-dot), Bayes (dashed) and LDA (solid). Thin dashed line marked τ2 shows the corresponding decision boundary of the baseline P CA classifier. Table 2. Comparison of classification performance achieved by the baseline classifiers and using the proposed Q−stack method. Experimental results reported for the SV M , Bayes and LDA stacked classifiers operating in the evidence space. Accuracy[%] HT ER[%] ERA [%] ERB [%] Baseline DCT : e = [x1 ] 83.97 14.92 13.21 16.64 P CA: e = [x2 ] 70.62 26.68 15.25 13.36 SVM DCT + QM : e = [x1 , qm] 84.97 14.95 14.83 15.07 P CA + QM : e = [x2 , qm] 79.23 20.68 20.54 20.82 DCT + P CA: e = [x1 , x2 ] 87.41 12.73 12.96 12.51 DCT + P CA + QM : e = [x1 , x2 , qm] 87.78 12.12 11.96 12.28 LDA DCT + QM : e = [x1 , qm] 83.85 14.83 12.79 16.88 P CA + QM : e = [x2 , qm] 77.32 21.53 19.75 23.31 DCT + P CA: e = [x1 , x2 ] 85.91 13.06 11.46 14.66 DCT + P CA + QM : e = [x1 , x2 , qm] 88.68 11.57 11.96 11.19 Bayes DCT + QM : e = [x1 , qm] 84.85 14.65 13.88 15.43 P CA + QM : e = [x2 , qm] 76.41 21.56 18.42 24.7 DCT + P CA: e = [x1 , x2 ] 86.23 13.04 11.92 14.17 DCT + P CA + QM : e = [x1 , x2 , qm] 87.03 12.3 11.25 13.34
accuracy over both baseline systems, which is an expected result. The classification results for e = [x1 , x2 ] serve as multi-classifier baseline for the comparison with the results obtained by applying Q − stack to fusion with e = [x1 , x2 , qm]. Despite the fact that QM correlates well only with x2 but not with x1 , adding the quality measure to the evidence vector (e = [x1 , x2 , qm]) resulted in fur-
1132
K. Kryszczuk and A. Drygajlo
ther consistent improvements in classification performance. Q − stack applied to multi-classifier fusion with quality measures proved to deliver the lowest error rates of all compared systems.
4
Conclusions
In this paper we have presented Q − stack, a uniform method of incorporating class-independent quality measures in unimodal classification and multi-classifier systems. The method is based on the improved class separation due to the dependence between the classifier scores and the quality measures. We have demonstrated the method to be effective in improving both single- and multi-classifier face verification. We have also shown that the benefits that can be expected of the application of Q − stack hinge on the statistical dependencies between the quality measures and the baseline classifier similarity scores, and on a proper selection of the stacked classifier according to the class-conditional evidence distributions. The method generalizes well to other modalities - we have obtained promising results also for fingerprint and speech modalities. Multiple quality measures can be incorporated by simply adding them into the evidence vector. The results suggest that particular attention must be paid to the development of classifier-quality measure ensembles, rather than classifier-independent quality measures alone.
Acknowledgments This work was partially funded by the Swiss National Science Foundation (SN SF ) and National Centre of Competence in Research (N CCR)IM 2.M P R. We wish to thank Prof. Javier Garcia-Ortega (UAM) for making the face part of the BioSec database available for our experiments.
References 1. Fierrez-Aguilar, J.: Adapted Fusion Schemes for Multimodal Biometric Authentication. PhD thesis, Universidad Politecnica de Madrid (2006) 2. Kryszczuk, K., Drygajlo, A.: Gradient-based image segmentation for face recognition robust to directional illumination. In: Visual communications and image processing 2005, Beijing, China, July 12-15, 2005 (2005) 3. Gales, M.J.F., Young, S.J.: Robust continuous speech recognition using parallel model compensation. IEEE Transactions on Acoustics, Speech and Signal Processing 4(5) (1996) 4. Bengio, S., Marcel, C., Marcel, S., Mariethoz, J.: Confidence measures for multimodal identity verification. Information Fusion 3(4), 267–276 (2002) 5. Toh, K.A., Yau, W.Y., Lim, E., Chen, L., Ng, C.H.: Fusion of auxiliary information for multi-modal biometrics authentication. In: Proceedings of International Conference on Biometrics, Hong Kong. LNCS, pp. 678–685. Springer, Heidelberg (2004)
Improving Classification with Class-Independent Quality Measures
1133
6. Wein, L., Baveja, M.: Using fingerprint image quality to improve the identification performance of the U.S. VISIT program. In: Proceedings of the National Academy of Sciences (2005) 7. Grother, P., Tabassi, E.: Performance of biometric quality measures. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(4), 531–543 (2007) 8. Poh, N., Heusch, G., Kittler, J.: On combination of face authentication experts by a mixture of quality dependent fusion classifiers. In: Proceedings of th 7th International Workshop on Multiple Classifier Systems, Prague, Czech Republic (2007) 9. Nandakumar, K., Chen, Y., Dass, S.C., Jain, A.K.: Quality-based score level fusion in multibiometric systems. In: Proceedings of International Conference on Pattern Recognition, Hong Kong, China, vol. 4, pp. 473–476 (2006) 10. Kryszczuk, K., Drygajlo, A.: Q-stack: uni- and multimodal classifier stacking with quality measures. In: Proceedings of the International Workshop on Multiple Classifier Systems, Prague, Czech Republic (2007) 11. Wolpert, D.: Stacked generalization. Neural Networks 5, 241–259 (1992) 12. Fierrez, J., Ortega-Garcia, J., Torre-Toledano, D., Gonzalez-Rodriguez, J.: Biosec baseline corpus: A multimodal biometric database. Pattern Recognition 40(4), 1389–1392 (2007) 13. Koval, O., Voloshynovskiy, S., Pun, T.: Error exponent analysis of person identification based on fusion of dependent/independent modalities. In: Proceedings of SPIE Photonics West, Electronic Imaging 2006, Multimedia Content Analysis, Management, and Retrieval 2006 (EI122) (2006) 14. Gross, R., Brajovic, V.: An image preprocessing algorithm for illumination invariant face recognition. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, Springer, Heidelberg (2003) 15. Sanderson, C.: Automatic Person Verification Using Speech and Face Information. PhD thesis, Griffith University, Queensland, Australia (2003) 16. Turk, M.A., Pentland, A.P.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3(1), 71–86 (1991) 17. Kryszczuk, K., Drygajlo, A.: On combining evidence for reliability estimation in face verification. In: Proc. of the EUSIPCO 2006, Florence (September 2006)
Biometric Hashing Based on Genetic Selection and Its Application to On-Line Signatures Manuel R. Freire, Julian Fierrez, Javier Galbally, and Javier Ortega-Garcia Biometric Recognition Group - ATVS, Escuela Politecnica Superior, Universidad Autonoma de Madrid C/ Francisco Tomas y Valiente 11, E-28049 Madrid, Spain {m.freire,julian.fierrez,javier.galbally,javier.ortega}@uam.es
Abstract. We present a general biometric hash generation scheme based on vector quantization of multiple feature subsets selected with genetic optimization. The quantization of subsets overcomes the dimensionality problem of other hash generation algorithms, while the feature selection step using an integer-coding genetic algorithm enables to exploit all the discriminative information found in large feature sets. We provide experimental results of the proposed hashing for verification of on-line signatures. Development and evaluation experiments are reported on the MCYT signature database, comprising 16, 500 signatures from 330 subjects. Keywords: Biometric Hashing, Biometric Cryptosytems, Feature Selection, Genetic Algorithms.
1
Introduction
The application of biometrics to cryptography is receiving increasing attention from the research community. Cryptographic constructions known as biometric cryptosystems using biometric data have been recently proposed, exploiting the advantages of authentication based on something that you are (e.g., your fingerprint or signature), instead of something that you know (e.g., a password) [1,2]. A review of the state of the art in biometric cryptosystems is reported in [2]. It establishes a commonly accepted classification of biometric cryptosystems, namely: (i) key release, where a secret key and a biometric template are stored in the system, the key being released after a valid biometric match, and (ii) key generation, where a template and a key are combined into a unique token, such that it allows reconstructing the key only if a valid biometric trait is presented. This last scheme has the particularity that it is also a form of cancelable biometrics [3] (i.e., the key can be changed), and it is secure against system intruders since the stored token does not reveal information from neither the key nor the biometric. Within key generation biometric cryptosytems, the biometric template can be extracted using a biometric hashing scheme, where a binary string is obtained from the biometric sample (see Fig. 1). In this arquitecture, biometric cryptosystems have stronger security constraints than biometric hashing schemes, where the extraction of a stable binary representation of the biometric is generally priorized. S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 1134–1143, 2007. c Springer-Verlag Berlin Heidelberg 2007
Biometric Hashing Based on Genetic Selection and Its Application
1135
Input Token
Biometric
BIOMETRIC HASHING Preprocessing
Feature Extraction
Feature Selection
Binarization
Transformation BIOMETRIC CRYPTOSYSTEM
Secure Token
Fig. 1. A generic biometric cryptosystem, where the biometric is binarized using a biometric hashing scheme
We present a biometric hashing scheme based on genetic selection, extending the idea of feature subset concatenation presented in [4]. In that previous work, there was no clear indication of which of all the possible feature subsets should be used for the biometric hash. Moreover, there is a need for a dimensionality reduction criterion when dealing with high-dimensional vectors. We provide a solution to this problem using feature subset selection based on a genetic algorithm (GA) [5], leading to a practical implementation of biometric hashing. The proposed hash generation scheme can be applied to any biometric trait represented as a fixed-sized feature vector. In this work, we present a case study for the application of this scheme to the verification of on-line signatures, where dynamic information of the signing process is available. Within biometric hashing, handwritten signature has an interesting application in authentication and identity management, due to its widespread social and legal acceptance [6,7]. This paper is structured as follows. In Sect. 2 we outline related work on hashing and biometric cryptosystems. The proposed hashing scheme is presented in Sect. 3.1, and the feature selection algorithm using GA is detailed in Sect. 3.2. A case study in biometric hash generation from on-line signatures is reported in Sect. 4. Finally, some conclusions and future work are discussed in Sect. 5.
2
Related Work
Several biometric cryptosystems for key generation have been proposed in the literature. The fuzzy vault scheme [8] establishes a framework for biometric cryptosystems. In this construction, a secret (typically, a random session key) is encoded using an unordered set of points A, resulting in an indivisible vault V . The original secret can only be reconstructed if another set B is presented and overlaps substantially with A. The fuzzyness of this construction fits well with the intra-variability of biometrics. Uludag et al. [9] proposed a biometric cryptosystem for fingerprints based on the fuzzy vault, where the encoding and the decoding sets were vectors of minutiae data. Their scheme was further developed in [10], where the fuzzy vault for fingerprints is enhanced with helper
1136
M.R. Freire et al.
data extracted from the orientation field flow curves. Other works have applied the fuzzy vault to on-line signature data using function-based information [11]. Hoque et al. [4] present a biometric hashing scheme for biometrics, where the generated hash plays the role of a cryptographic key. Their work identifies the problem of intra-variability and proposes a hashing based on vector quantization of feature subsets. However, the problem of the high dimensionality in the feature vector is not considered in their contribution. Also, their evaluation considers the hash as a final cryptographic key and not as a building block of a biometric cryptosystem, and therefore performance is measured in terms of exact matching among two hashes. Another approach was presented in [12], where Vielhauer et al. propose a biometric hashing scheme for statistical features of on-line signatures. Their work is based on user-dependent helper data, namely an Interval Matrix. Vielhauer and Steinmetz further applied this scheme to biometric hash generation using handwriting [13]. Another approach to crypto-biometrics using handwritten signature is BioHashing, where pseudo-random tokens and biometrics are combined to achieve higher security and performance [14,15]. This scheme has also been applied to face biometrics in [16]. Information-theoretical approaches to crypto-biometrics have also been presented. One example is the work of Dodis et al. [17], where a theoretical framework is presented for cryptography with fuzzy data (here, biometrics). They propose two primitives: a secure sketch, which produces public information about a biometric signal that does not reveal details of the input, and a fuzzy extractor, wich extracts nearly uniform randomness from a biometric input in an error-tolerant way helped by some public string. Also, they propose an extension of the fuzzy vault which is easier to evaluate theoretically than the original formulation of Juels and Sudan [8].
3
Biometric Hash Generation
We present a biometric hash generation scheme based on the concatenation of binary strings extracted from a set of feature vector subsets. We extend the previous work by Hoque et al. [4], where vector quantization is applied to feature subsets, which are further concatenated to form the biometric hash. We provide a solution for high-dimensional vectors by means of feature selection based on an integer-coding genetic algorithm. 3.1
Feature Subset Concatenation
Given a feature vector x = [x1 , . . . , xN ] with xi ∈ R, a biometric hash h = [h1 , . . . , hL ] with hi ∈ {0, 1} of dimension L is extracted. Let xj with j = 1, . . . , D be formed by a subset of features of x of dimension M (M < N ), with possibly overlapping features for different j. Let C j be a codebook obtained by vector quantization of feature subset xj using a development set of features xjk=1,...,K . We define h for an input feature vector xT as: h(xT ) = concat (f (xjT , C j )) j=1,...,D
(1)
Biometric Hashing Based on Genetic Selection and Its Application
1137
where f is a function that assigns the nearest-neighbour codewords, and concat(·) denotes the concatenation of binary strings. The codebooks C j are computed with vector quantization as follows. Let j xk=1,...,K be feature vector subsets forming a development set. The k -means algorithm is used to compute the centroids of the underlying clusters, for a given number of clusters Q. Then, centroids are ranked based on their distance to the mean of all centroids. Finally, binary codewords of size q = log2 Q are defined as the position of each centroid in the ranking using Gray coding [18]. 3.2
Feature Selection Using Genetic Algorithms
GA are non-deterministic methods inspired in natural evolution, which apply the rules of selection, crossover and mutation to a population of possible solutions in order to optimize a given fitness function [5]. In the present work, a GA with integer coding is implemented in order to obtain the best subsets of M features. Integer coding has been used instead of binary coding, since the last one does not fit well when the number of features is fixed. Algorithm 1. Feature subset selection using GA Input: n, S, θ Output: A F ←S A←∅ for i ← 1 to n do B ← GA(F ) {Call GA, returns a sorted list of candidate subsets} for all b ∈ B do if b ∩ a ≤ θ, ∀a ∈ A then A ←A∪b end if end for N←∅ for all a ∈ A do N ← N ∪a end for N ← unique(N ) {Remove repeated items} F ← ∅ for j ← 1 to |F |/2 do F ← F ∪ Nj end for F ← F − F end for
The proposed iterative algorithm for feature subset selection using GA is presented in Algorithm 1. Note that the proposed algorithm can be easily modified to use a different feature selection technique such as SFFS [19]. In words, Algorithm 1 receives the number of iterations n, the initial feature set S, and the threshold θ, which represents the maximum number of overlapped features
1138
M.R. Freire et al.
permitted among different subsets. For the feature set F , initially equal to S, the feature selection algorithm is called (here, the GA). From the output (a sorted list of subsets of size M ), subsets are selected iteratively if no previously selected subset overlaps with the one at hand in more than a certain threshold θ. This way, the threshold settles the degree of correlation allowed among the different subsets. In the proposed scheme, the fitness function of the GA has been defined as f = EER−1 , where the EER is computed for skilled forgeries from a development set different to the training set used for vector quantization (see Sect. 4.1). After the subsets selection, the half of the features with the best performance are removed from F , and the algorithm is iterated. This strategy was followed in order to avoid the possible loss of not so discriminative sets of features, that nevertheless provide complementary information to the previously selected features.
4 4.1
Case Study: Biometric Hash Generation from Feature-Based Information of On-Line Signatures Signature Database and Experimental Protocol
The MCYT on-line signature corpus is used for the experiments [20]. This database contains 330 users with 25 genuine signatures and 25 skilled forgeries per user, captured in four acquisition sites. Forgers were asked to imitate after observing the static image of the signature to imitate, they tried to copy them at least 10 times, and then they wrote the forgeries naturally without breaks or slowdowns. For the experiments presented here, we have followed a 2-fold cross-validation strategy. The database has been divided into two sets: a development set, formed by the even users, and an evaluation set, with the odd users. The development set has been further partitioned into: training set (for vector quantization), with the even users of the development set, and testing set (for GA optimization), with the rest. Evaluation experiments were conducted as follows. A binary hash was generated for each genuine signature in the database, and compared to the hashes of the remaining genuine signatures of the user at hand, all her skilled forgeries, and the first genuine signature of the remaining users (random forgeries). Genuine and impostor matching scores were calculated using the similarity function sH (h1 , h2 ) = qmax −dH (h1 , h2 ), where dH represents Hamming distance and h1 and h2 are binary vectors of size qmax [21]. The EER between genuine and either skilled of random forgeries using this similarity measure is used as the performance criterion of the hashing scheme. Examples of the matching of genuine and skilled hashes are included in Fig. 2. 4.2
Feature Extraction from On-Line Signatures
For the experiments we use an on-line signature representation based on global features [7]. In particular, a 100-dimensional global feature vector is extracted from each on-line signature [22], including features based on timing information, number of strokes, geometry, etc.
Biometric Hashing Based on Genetic Selection and Its Application
1139
011011111101001011111010111101111001111110000001001100010010111011110111011
011011110011011011110010111101111010111101101000001100011010110110110111011
Similarity
59
Similarity
54
Similarity
46
Similarity
48
010001110101011011100111010111110010010100101011000100100110011010010011011
011011110101011011111111111101110010111110101011011100010010111110110011100
(a) Genuine matches
011011111101001011111010111101111001111110000001001100010010111011110111011
100001111001101011110110111111110100010110000111000000111101001101110100011
010001110101011011100111010111110010010100101011000100100110011010010011011
000011111101000011111010110100101010010101101001001111101110101110110011100
(b) Skilled impostor matches Fig. 2. Examples of the matching between a genuine template and a genuine test hash (a), and between a genuine template and a skilled forgery (b). The binary strings correspond to the real hashes obtained with overlap θ = 0. Different bits are represented in bold.
Each feature xi is normalized into the range [0, 1] using tanh-estimators [23]. The normalization is given by: 1 xi − μi xi = tanh 0.1 +1 (2) 2 σi
1140
M.R. Freire et al. GA for set 1
GA for set 2
50
50 avg best
avg best
45
40
EER (%)
EER (%)
45
35
30
40
35
30
25
25 20
40 60 Generation
80
100
20
40 60 Generation
80
100
(a) GA for set 1
GA for set 2
50
50 avg best
avg best
45
40
EER (%)
EER (%)
45
35
30
40
35
30
25
25 20
40 60 Generation
80
100
20
40 60 Generation
80
100
(b) Fig. 3. Results of the first (a) and second (b) execution of the GA. Following the 2-fold cross-validation approach, set 1 (left) corresponds to the development set formed by the odd users, and set 2 (right) by the even users. Table 1. Biometric hashing scheme performance for different overlapping threshold θ Overlap (θ) Number of subsets Hash length EER skilled (%) EER random (%) 0 25 75 18.83 8.02 1 171 513 12.99 3.50 2 530 1590 12.15 3.16
where μi and σi are the mean and the standard deviation of the feature xi for the genuine signatures in the database. 4.3
Implementation of the Genetic Algorithm
In the proposed integer-coding genetic selection, a string of the genetic population is represented by a vector of length M , where M is the dimension of the subsets to be found. Each element of the string is an integer in the range [1, 100] and corresponds to a feature in the original set.
Biometric Hashing Based on Genetic Selection and Its Application
1141
35 skilled with overlap 0 skilled with overlap 1 skilled with overlap 2 random with overlap 0 random with overlap 1 random with overlap 2
30
EER (%)
25
20
15
10
5
0
1
5
9
13
17
21
25
73 121 # of subsets
171
350
530
Fig. 4. EER (in %) for increasing number of subsets of the biometric hash for overlapping threshold θ = 0, 1 and 2, respectively
The configuration set of the genetic algorithm is the following: – – – – – –
Population size: 100, randomly generated in the first generation. String size: 4, representing the number M of features in the subset. Stop condition: after completing 100 generations. Selection: binary tournament. Crossover: one-point, with 85% probability. Mutation: mutated elements are randomly assigned a value in the range [1, 100] that is not present in the string, with 5% probability.
The output of our genetic algorithm is the whole set of the strings produced along the evolution of the GA, sorted by descending order of fitness. 4.4
Experimental Results
Development Experiments. Algorithm 1 was executed for parameters n = 2, S = [1, 100] and θ = 0, 1, and 2, respectively. The number of clusters in the kmeans algorithm was fixed to 8. As a result, codewords extracted from each subset have a bit size of log2 8 = 3. In Fig. 3(a) we present the evolution of the first iteration of the feature subset selection algorithm, corresponding to an execution of the genetic algorithm. We observe that the best string (feature subset) converges to an EER value of about 24% for skilled forgeries. Results of the second execution are presented in Fig. 3(b). Interestingly, the best EER is only slightly worse when considering the 50 features discarded from the first iteration.
1142
M.R. Freire et al.
Evaluation Results. The proposed hashing scheme was evaluated for the subsets and the codebooks obtained in the development experiments. Matching scores were computed using the similarity measure described in Sect. 4.1. EERs using an increasing number of subsets are presented in Fig. 4 for θ = 0, 1 and 2. The evaluation results are summarized in Table 1. We observe that the best EER for skilled and random forgeries is achieved when a high number of subsets is considered (θ = 2). However, it is worth noting that a big hash length does not imply a higher security, since large hashes include redundant information.
5
Conclusions
We have proposed a general hash generation scheme for fixed-length featurebased approaches to biometrics. Our scheme includes a feature selection step based on genetic algorithms, providing a practical solution for high-dimensional feature vectors. The proposed scheme can be used as a building block of biometric cryptosystems. Experiments have been conducted on signature data from the MCYT database using 2-fold cross-validation. We have studied the effect of using subsets with a variable number of overlapping features. We observed that the best EER is achieved with the configuration that involves more feature subsets (overlapping threshold of 2). A future direction of this research will be the comparison of the GA with other feature selection strategies. Also, the effect of other parameters as the quantization size or the subset length will be considered. The redundancy of the resulting hashes is also yet to be studied, as well as the application of the proposed hashing to other biometrics.
Acknowledgements This work has been supported by Spanish Ministry of Education and Science (project TEC2006-13141-C03-03) and BioSecure NoE (IST-2002-507634). M. R. F. is supported by a FPI Fellowship from Comunidad de Madrid. J. F. is supported by a Marie Curie Fellowship from European Commission. J. G. is supported by a FPU Fellowship from the Spanish Ministry of Education and Science.
References 1. Jain, A.K., Ross, A., Pankanti, S.: Biometrics: A tool for information security. IEEE Trans. on Information Forensics and Security 1(2), 125–143 (2006) 2. Uludag, U., Pankanti, S., Prabhakar, S., Jain, A.K.: Biometric cryptosystems: Issues and challenges. Proceedings of the IEEE 92(6), 948–960 (2004) 3. Bolle, R.M., Connell, J.H., Ratha, N.K.: Biometric perils and patches. Pattern Recognition 35(12), 2727–2738 (2002)
Biometric Hashing Based on Genetic Selection and Its Application
1143
4. Hoque, S., Fairhurst, M., Howells, G., Deravi, F.: Feasibility of generating biometric encryption keys. Electronics Letters 41(6), 309–311 (2005) 5. Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley Longman Publishing Co. Inc., Boston, MA, USA (1989) 6. Plamondon, R., Srihari, S.N.: On-line and off-line handwriting recognition: A comprehensive survey. IEEE Trans. PAMI 22(1), 63–84 (2000) 7. Fierrez, J., Ortega-Garcia, J.: On-line signature verification. In: Jain, A.K., Ross, A., Flynn, P. (eds.) Handbook of Biometrics (to appear) 8. Juels, A., Sudan, M.: A fuzzy vault scheme. Design Code. Cryptogr. 38(2), 237–257 (2006) 9. Uludag, U., Pankanti, S., Jain, A.K.: Fuzzy vault for fingerprints. In: Kanade, T., Jain, A., Ratha, N.K. (eds.) AVBPA 2005. LNCS, vol. 3546, pp. 310–319. Springer, Heidelberg (2005) 10. Uludag, U., Jain, A.K.: Securing fingerprint template: Fuzzy vault with helper data. In: Proc. CVPRW, p. 163. IEEE Computer Society, Los Alamitos (2006) 11. Freire-Santos, M., Fierrez-Aguilar, J., Ortega-Garcia, J.: Cryptographic key generation using handwritten signature. In: Proc. SPIE., vol. 6202, pp. 225–231 (2006) 12. Vielhauer, C., Steinmetz, R., Mayerhoefer, A.: Biometric hash based on statistical features of online signatures. In: Proc. ICPR., vol. 1, pp. 123–126 (2002) 13. Vielhauer, C., Steinmetz, R.: Handwriting: Feature correlation analysis for biometric hashes. EURASIP JASP 2004(4), 542–558 (2004) 14. Teoh, A.B., Goh, A., Ngo, D.C.: Random multispace quantization as an analytic mechanism for biohashing of biometric and random identity inputs. IEEE Trans. PAMI 28(12), 1892–1901 (2006) 15. Lumini, A., Nanni, L.: An improved biohashing for human authentication. Pattern Recognition 40(3), 1057–1065 (2007) 16. Ngo, D.C.L., Teoh, A.B.J., Goh, A.: Biometric hash: high-confidence face recognition. IEEE Trans. Circ. Syst. Vid. 16(6), 771–775 (2006) 17. Dodis, Y., Reyzin, L., Smith, A.: Fuzzy extractors: How to generate strong keys from biometrics and other noisy data. In: Cachin, C., Camenisch, J.L. (eds.) EUROCRYPT 2004. LNCS, vol. 3027, pp. 523–540. Springer, Heidelberg (2004) 18. Lin, S., Costello, D.J.: Error Control Coding, 2nd edn. Prentice-Hall, Inc. Upper Saddle River, NJ, USA (2004) 19. Pudil, P., Novovicova, J., Kittler, J.: Floating search methods in feature selection. Pattern Recogn. Lett. 15(11), 1119–1125 (1994) 20. Ortega-Garcia, J., et al.: MCYT baseline corpus: A bimodal biometric database. IEE Proc. Vision, Image and Signal Processing 150(6), 395–401 (2003) 21. Theodoridis, S., Koutroumbas, K.: Pattern Recognition. Academic Press, San Diego (2006) 22. Fierrez-Aguilar, J., et al.: An on-line signature verification system based on fusion of local and global information. In: Roli, F., Vitulano, S. (eds.) ICIAP 2005. LNCS, vol. 3617, pp. 523–532. Springer, Heidelberg (2005) 23. Jain, A.K., Nandakumar, K., Ross, A.: Score normalization in multimodal biometric systems. Pattern Recognition 38(12), 2270–2285 (2005)
Biometrics Based on Multispectral Skin Texture Robert K. Rowe Lumidigm, Inc, 801 University Blvd., SE, Suite 302, Albuquerque, NM 87106
[email protected] Abstract. Multispectral imaging (MSI) of the skin provides information about both the surface and subsurface characteristics of the skin tissue. These multispectral characteristics may be described as textures and used to determine the identity of an individual. In this investigation, a multi-day, multi-person study was conducted to compare the performance of a matcher based on multispectral texture analysis with a conventional minutiae-based fingerprint matcher. Both matchers used MSI data from exactly the same area of the sensor. Performance of the two methods was compared for a range of simulated sensor areas. The performance of the textural matcher was nearly equivalent to that of the minutiae matcher for the larger sensor areas within the limited scope of this study. For small sensor areas, the performance of the texture matcher was significantly better than the minutia matcher operating over the identical small region. Keywords: multispectral skin texture, skin biometrics, multispectral imaging, local consistency.
1 Introduction Fingerprint-based biometric sensors are used across a broad range of applications, from law enforcement and civil identification to commercial access control and even in some consumer devices such as laptops and cell phones. In the latter cases, there is a need to reduce the size of the sensor in order to reduce the area of the device that the sensor occupies and also, generally, to reduce the cost of the sensor. However, the performance of a contact fingerprint sensor degrades as the size decreases [1]. Because of this, some manufacturers [2, 3, 4] produce long and narrow fingerprint sensors that simulate a larger-area sensor by combining a series of narrow images collected while the user “swipes” their finger across the sensor surface. Such a sensor configuration places a burden on the user and limits the applications in which this type of solution is considered. One way to reduce the sensing area while maintaining a simple, single-touch user interface is to measure a property of the skin that is locally consistent while still being distinct from person to person. In this way, a small-area sensor would be able to perform a biometric match using a skin location never previously enrolled as long as the optical properties of the enrolled and tested skin sites were “similar enough”. In this paper, we examine the feasibility of using multispectral imaging (MSI) for small-area sensors. The matching is done using textural descriptors of the multispectral S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 1144–1153, 2007. © Springer-Verlag Berlin Heidelberg 2007
Biometrics Based on Multispectral Skin Texture
1145
data combined with a classification methodology that seeks to find characteristics of the data that are locally consistent. We test the performance of this matcher against that of a commercial minutiae matcher operating on fingerprint images developed from the same multispectral data used for the texture matcher.
2 Relationship to Other Work A number of papers have been published in the area of using texture matching of fingerprints. For example, Jain [5] proposed a local texture analysis using Gabor filters applied to tessellated regions around the core point. Lee and Wang [6] and Hamamoto [7] also used Gabor filters localized at the core. Coetzee and Botha [8] and Willis and Myers [9] proposed a Fourier analysis fingerprint texture as the basis for biometric determinations. Tico et al. [10] applied wavelet analysis to fingerprint images to distinguish between them. Apart from differences in basis functions and other methodological matters, the key difference between these prior efforts and the one reported in this paper is that, to this investigator’s knowledge, all prior investigations are based on conventional fingerprint images. As such, the observed textural pattern is extracted from a single image, which limits the information content and can be adversely affected by artifacts due to effects such as dry skin, poor contact between the skin and sensor, and other operationally important matters. In contrast, the present investigation is based on multiple images taken with a robust imaging methodology which contains information about both the surface and subsurface characteristics of the skin while minimizing sampling artifacts. In fact, the one data plane in the MSI stack most similar to conventional optical fingerprinting techniques was explicitly removed from the texture analysis in order to avoid the presence of spurious effects, as described later in this paper. The topic of how to perform biometric matching using small-area fingerprint sensors is an active area of investigation. One common approach is to build up a large enrollment image by piecing together a series of small fingerprint images taken over multiple placements of the finger in a process known as mosaicing [1, 11]. Alternatively, some authors have combined the minutiae information together instead of the image themselves [12, 13, 14, 15]. In contrast to either of these approaches, the work described in this paper is associated with finding those characteristics of the multispectral skin texture that are locally consistent in order that an enrollment measurement may be made at one skin site and successfully verified at a different but nearby skin site. One manufacturer of a facial recognition system [16] says they use “surface texture analysis” to augment and improve the performance of their commercial system. The major difference between such a surface-based method and the methods and systems described in this paper is that the MSI sensor has been developed to obtain a significant portion of information from features below the surface of the skin. Moreover, the plurality of wavelengths, illumination angles, and optical polarization conditions used in this investigation yields additional information beyond that available through simple surface reflectance measurements.
1146
R.K. Rowe
3 Multispectral Skin Sensing 3.1 Hardware and Raw MSI Data In order to capture information-rich data about the surface and subsurface features of the skin of the finger, the MSI sensor collects multiple images of the finger under a variety of optical conditions. The raw images are captured using different wavelengths of illumination light, different polarization conditions, and different illumination orientations. In this manner, each of the raw images contains somewhat different and complementary information about the finger. The different wavelengths penetrate the skin to different depths and are absorbed and scattered differently by various chemical components and structures in the skin. The different polarization conditions change the degree of contribution of surface and subsurface features to the raw image. Finally, different illumination orientations change the location and degree to which surface features are accentuated. Fig. 1 shows a simplified schematic of the major optical components of an MSI fingerprint sensor. Illumination for each of the multiple raw images is generated by one of the light emitting diodes (LEDs). The figure illustrates the case of polarized, direct illumination being used to collect a raw image. The light from the LED passes through a linear polarizer before illuminating the finger as it rests on the sensor platen. Light interacts with the finger and a portion of the light is directed toward the imager through the imaging polarizer. The imaging polarizer is oriented with its optical axis to be orthogonal to the axis of the illumination polarizer, such that light with the same polarization as the illumination light is substantially attenuated by the polarizer. This severely reduces the influence of light reflected from the surface of the skin and emphasizes light that has undergone multiple optical scattering events after penetrating the skin. The second direct-illumination LED shown in Fig. 1 does not have a polarizer placed in the illumination path. When this LED is illuminated, the illumination light is randomly polarized. In this case the surface-reflected light and the deeply penetrating light are both able to pass through the imaging polarizer in equal proportions. As such, the image produced from this non-polarized LED contains a much stronger influence from surface features of the finger. Importantly, all of these direct-illumination sources (both polarized and nonpolarized) as well as the imaging system are arranged to avoid any critical-angle phenomena at the platen-air interfaces. In this way, each illuminator is certain to illuminate the finger and the imager is certain to image the finger regardless of whether the skin is dry, dirty or even in contact with the sensor. This aspect of the MSI imager is distinctly different from most other conventional fingerprint imaging technologies and is a key aspect of the robustness of the MSI methodology. In addition to the direct illumination illustrated in Fig. 1, the MSI sensor also integrates a form of total internal reflectance (TIR) imaging. In this illumination mode, one or more LEDs illuminate the side of the platen. A portion of the illumination light propagates through the platen by making multiple TIR reflections at the platen-air interfaces. At points where the TIR is broken by contact with the skin, light enters the skin and is diffusely reflected. A portion of this diffusely reflected light is directed toward the imaging system and passes through the imaging polarizer
Biometrics Based on Multispectral Skin Texture
1147
Fig. 1. Optical configuration of an MSI sensor. The red lines illustrate the direct illumination of a finger by a polarized LED.
(since this light is randomly polarized), forming an image for this illumination state. Unlike all of the direct illumination states, the quality of the resulting raw TIR image is critically dependent on having skin of sufficient moisture content and cleanliness making good optical contact with the platen, just as is the case with conventional TIR sensors. In practice, MSI sensors typically contain multiple direct-illumination LEDs of different wavelengths. For example, the Lumidigm J110 MSI sensor used to collect the data used in this investigation has four direct-illumination wavelength bands (430, 530, and 630 nm as well as a white light) in both polarized and unpolarized configurations. When a finger is placed on the sensor platen, eight direct-illumination images are captured along with a single TIR image. The raw images are captured on a 640 x 480 image array with a pixel resolution of 525 ppi. All nine images are captured in approximately 500 mSec. An example of the nine images captured during a single finger placement is illustrated in Fig 2. The upper row shows the raw images for unpolarized illumination wavelengths of 430, 530, and 630 nm, as well as white light. The lower row shows the corresponding images for the cross-polarized case as well as the TIR image. The grayscale for each of the raw images has been expanded to emphasize the features. 3.2 Composite Fingerprint Generation and Matching It can be seen from Fig. 2 that there are a number of features present in the raw data including the textural characteristics of the subsurface skin, which appears as mottling that is particularly pronounced under blue (430 nm) and green (530 nm) illumination wavelengths. As well, the relative intensities of the raw images under each of the illumination conditions is very indicative of the spectral characteristics (i.e. color) of the finger or other sample (note that the relative intensities have been obscured in Fig. 2 to better show the comparative details of the raw images).
1148
R.K. Rowe
Fig. 2. Raw MSI images. The upper row of images corresponded to non-polarized illumination of various wavelengths (from left to right: 430, 530, 630nm and white light). Directly below each of these images is the corresponding cross-polarized illumination case. The image at the extreme right of the lower row is the TIR image.
The set of raw images shown in Fig. 2 can be combined together to produce a single representation of the fingerprint pattern. This fingerprint generation relies on a wavelet-based method of image fusion to extract, combine and enhance those features that are characteristic of a fingerprint. The wavelet decomposition method that is used is based on the dual-tree complex wavelet transform (DTCWT) [17]. Image fusion occurs by selecting and compiling the coefficients with the maximum absolute magnitude in the image at each position and decomposition level [18]. An inverse wavelet transform is then performed on the resulting collection of coefficients, yielding a single, composite image. An example of the result of applying the compositing algorithm to two placements of the same finger is given in Fig. 3. The composite fingerprint images can then be used with conventional fingerprint matching software. For this investigation, a commercial minutiae-based feature extraction and matching routine (NEC, NECSAM FE4, ver. 1.0.2.0, PPC2003) was applied to the composite fingerprints as a control to compare to texture matching results.
Fig. 3. Composite fingerprint images extracted from 2 different sets of multispectral data like those shown in Fig. 2
Biometrics Based on Multispectral Skin Texture
1149
3.3 Multispectral Texture Analysis The DTCWT process that was used to generate composite fingerprint images was also used to provide the spectral-textural features of the multispectral data. The coefficients from the third level of the DTCWT decomposition of the multispectral image stack were used as features for the texture analysis in this investigation. Since the strength and quality of the raw TIR image plane is highly variable (dependent on skin moisture, good contact, etc.), this plane was omitted from the multispectral texture analysis but not from the corresponding composite fingerprint. Anderson et al. [19] have defined a set of image features based on a DTCWT decomposition that they refer to as inter-level products. These features represent the conjugate products of the DTCWT coefficients in two adjacent decomposition levels. They have also defined an inter-coefficient product (elsewhere in their publications it is referred to as “same-level product”) that represents the conjugate product of adjacent coefficients in the same decomposition level. These conjugate products have been shown to represent the fundamental features of the image while being insensitive to translation and some amount of rotation. In a similar way, we define an inter-image product, P, as the conjugate product of coefficients at some direction, d, decomposition level, k, generated by any 2 of the raw multispectral images, i and j, in a multispectral image stack at location x,y: Pi,j(x,y,d,k) = Ci(x,y,k) Cj*(x,y,k),
(1)
where Ci(x,y,k) is the complex coefficient for image i at decomposition level k and location x,y while C*j(x,y,k) is the conjugate of the corresponding complex value for image j. For purposes of this investigation, we compiled all real and imaginary components of all the conjugate products generated from each unique image pair as a feature vector. For 8 raw image planes, this results in a 384 element vector (28 conjugate products/direction, 6 directions, 2 scalar values (real and imaginary) per product for i≠j, plus 8 conjugate products/direction, 6 directions, 1 scalar value (real only) for i=j). In addition, the isotropic magnitudes of the coefficients were added to the feature vector, where the isotropic magnitude is simply the sum of the absolute magnitudes over the 6 directional coefficients. Finally, the mean DC values of each of the raw images over the region of analysis were added to the feature vector. Concatenating all of these values resulted in a 400-element feature vector at each element location.
4 Experimental Data and Analysis Multispectral data were collected on 21 volunteers over a period of approximately 2 weeks. The participants were office workers whose ages ranged from the mid-20s to the mid-60s. They were of mixed gender and race. Each participant made two visits separated by at least one day. During a visit, each of the 4 fingers on the right hand was measured 3 times, yielding 12 multispectral data sets collected per person per visit. This resulted in a total dataset comprised of 504 multispectral image stacks taken over 84 unique fingers.
1150
R.K. Rowe
For purposes of this investigation, the data collected during a participant’s first visit was treated as enrollment data and data from the second visit was used for testing. A summary of the spans of time between enrollment and verification for the 21 study participants is: 1 day (16 people), 3 days (2 people), 4 days (1 person), 5 days (1 person) and 11 days (1 person). Since the dataset used for this analysis was of a relatively modest size, the biometric task that was tested was chosen to be personalization, which requires far less data than verification (or any other “open” formulation) to obtain stable results. In this testing scenario, there are N enrolled people and each test sample is assumed to be one of the enrollees. No attempt is made to determine or discriminate against nonenrolled test samples. The estimate for identity is simply the enrollment sample that is the closest match to the test sample. Multiple, random trials were conducted to determine the performance of both the multispectral texture matcher as well as the fingerprint matcher used as a control for each tested condition. Each trial was based on enrollment by 6 randomly selected fingers from 6 randomly selected people. These same data were also used to build a classification model as described below. The data from the same fingers taken during the second visit for each of the 6 enrolled fingers were used for performance testing. In order to augment the calibration model, the data from one randomly selected finger from each of the other 15 people in the dataset who were not enrolled for a particular trial were included as part of the calibration dataset. Multispectral texture analysis was performed on an element-by-element basis over a particular analysis region. Each analysis region comprised a contiguous area of the imager with pixel dimensions of 64 x 48. The number of analysis regions used for each of the 4 cases that were studied varied from 5 x 5 (320 x 240 pixels, 0.61” x 0.46”) to 1 x 1 (64 x 48 pixels, 0.12” x 0.09”). In each of the cases studied, the comparison was made to a fingerprint that was generated from the multispectral data and masked to cover exactly the same area as used for the multispectral match. The four cases are illustrated and further described in Fig. 4. For all cases, the analysis regions were centered on the active image area. The calibration data for a given analysis region was used as an input into a Fisher linear discriminant analysis (FLDA) algorithm to develop a classification model. Prior to processing the calibration data, the data were normalized such that the standard deviation of each of the elements of the feature vector was 1.0 over the calibration set. The same normalization factor was also applied to the test data. The test samples were then matched to the enrollment samples by projecting the difference in feature vectors onto the FLDA factors and accumulating an RMS summary of the resulting projection. The differences were accumulated for all elements within an analysis region and for all analysis regions used in a particular case. The estimate of the identity of the test sample was then selected as the enrollment that had the smallest accumulated match value associated with it. The comparison to fingerprint minutiae matching was done by masking the MSIderived fingerprint to the same area as used for the corresponding MSI texture analysis. The minutiae matching was then performed between the test image and the 6*3 = 18 enrolled images. The ID of the enrolled image with the largest match value was recorded for that particular test image.
Biometrics Based on Multispectral Skin Texture
1151
Fig. 4. Illustration of sensor sizes used in this study using the same images as in Fig 3. The largest sensor size is at the top and is 320 x 240 (0.61” x 0.46”) followed by 192 x 144 (0.37” x 0.27”), 192 x 48 (0.37” x 0.09”) and 64 x 48 (0.12” x 0.09”) on the bottom.
5 Results The performance of each test configuration was estimated by aggregating the results of 20 random trials conducted for each of the test conditions. The summary of these results is given in Fig. 5. A comparison of the results of this study indicates that the performance of the multispectral texture matcher is approximately the same as the fingerprint matcher for Skin Texture, 8 color planes 100
80
80 Percent Correct
Percent Correct
Fingerprint Minutiae 100
60 40
40 20
20 0
60
64x48
192x48 192x144 Sensing Area [pixels]
320x240
0
64x48
192x48 192x144 Sensing Area [pixels]
320x240
Fig. 5. Results for fingerprint matching (left) and multispectral skin texture matching (right) for 4 different sensor sizes. The boxes indicate the inner two quartiles of the 20 random trials and the red line is the median of the 20 trials. The notch indicates the uncertainty of the median at a 5% significance level.
1152
R.K. Rowe
the two largest sensor areas that were tested. Further work with larger datasets will be required to quantify the performance differences between a minutiae matcher and a multispectral texture matcher over the same, relatively large area of the finger. However, the results from this modestly sized study do indicate that as the sensor size decreases, the performance of the texture-based matcher is maintained better than that of the minutiae-based matcher, which degrades sharply for the smallest sensor areas tested. This difference in performance is possibly due to the property of local consistency of the multispectral texture: skin proximal to the point of enrollment has approximately the same properties as the enrollment site itself. Therefore, placementto-placement variation (which becomes more severe as the sensor size decreases) affects the texture matcher far less than the minutiae matcher.
References 1. Maltoni, D., Maio, D., Jain, A.K., Prabhakar, S.: Handbook of Fingerprint Recognition. Springer, Heidelberg (2003) 2. http://www.authentec.com 3. http://www.upek.com 4. http://www.atmel.com/products/biometrics/ 5. Jain, A.K., Prabhakar, S., Hong, L., Pankanti, S.: Filterbank-based fingerprint matching. IEEE Transactions on Image Processing 9, 846–859 (2000) 6. Lee, C.J., Wang, S.D.: Fingerprint feature extraction using Gabor filters. Electronic Letters 35(4), 288–290 (1999) 7. Hamamoto, Y.: A Gabor filter-based method for fingerprint identification. In: Jain, L.C., Halici, U., Hayashi, I., Lee, S.B. (eds.) Intelligent Biometric Techniques in Fingerprint and Face Recognition, CRC Press, Boca Raton, FL (1999) 8. Coetzee, L., Botha, E.C.: Fingerprint recognition with a neural-net classifier. In: Proc. South African Workshop on Pattern Recognition, 1st edn. vol. 1, pp. 33–40 (1990) 9. Willis, A.J., Myers, L.: A cost-effective fingerprint recognition system for use with lowquality prints and damaged fingertips. Pattern Recognition 34(2), 255–270 (2001) 10. Tico, M., Kuosmanen, P., Saarinen, J.: Wavelet domain features for fingerprint recognition. Electronics Letters 37(1), 21–22 (2001) 11. Brown, L.G.: Image registration techniques. ACM Computing Surveys 24(4), 326–376 (1992) 12. Yau, W.Y., Toh, K.A., Jiang, X., Chen, T.P., Lu, J.: On fingerprint template synthesis. In: Proc. Int. Conf. on Control Automation Robotics and Vision, 6th edn. (2000) 13. Toh, K.A., Yau, W.Y., Jiang, X., Chen, T.P., Lu, J., Lim, E.: Minutiae data synthesis for fingerprint identification applications. In: Proc. Int. Conf. on Image Processing, vol. 3, pp. 262–265 (2001) 14. Jain, A.K., Ross, A.: Fingerprint mosaicking. In: Proc. Int. Conf. on Acoustic Speech and Signal Processing, vol. 4, pp. 4064–4067 (2002) 15. Ramoser, H., Wachmann, B., Bischof, H.: Efficent alignment of fingerprint images. In: Proc. Int. Conf. on Pattern Recognition (16th), vol. 3, pp. 748–751 (2002) 16. http://www.visionics.com/trends/skin.html
Biometrics Based on Multispectral Skin Texture
1153
17. Kingsbury, N.: Complex wavelets for shift invariant analysis and filtering of signals. Journal of Appl. and Comput. Harmonic Analysis 10, 234–253 (2001) 18. Hill, P., Canagarajah, N., Bull, D.: Image fusion using complex wavelets. In: Proc. 13th British Machine Vision Conference, Cardiff, UK (2002) 19. Anderson, R., Kingsbury, N., Fauqueur, J.: Robust rotation-invariant object recognition using edge-profile clusters. submitted to European Conference on Computer Vision (May 2006)
Application of New Qualitative Voicing Time-Frequency Features for Speaker Recognition Nidhal Ben Aloui1,2 , Herv´e Glotin1 , and Patrick Hebrard2 1
Universit´e du Sud Toulon-Var Laboratoire LSIS B.P. 20 132 - 83 957 La Garde, France {benaloui,glotin}@univ-tln.fr 2 DCNS - Division SIS, Le Mourillon B.P. 403 - 83 055 Toulon, France {nidhal.ben-aloui,patrick.hebrard}@dcn.fr
Abstract. This paper presents original and efficient Qualitative TimeFrequency (QTF) speech features for speaker recognition based on a med-term speech dynamics qualitative representation. For each frame of around 150ms, we estimate and binarize a suband voicing activity estimation of 6 frequency subands. We then derive the Allen temporal relations graph between these 6 time intervals. This set of temporal relations, estimated at each frame, feeds a neural network which is trained for speaker recognition. Experiments are conducted on fifty speakers (males and females) of a reference radio database ESTER (40 hours) with continuous speech. Our best model generates around 3% of frame class error, without using information of frame continuity, which is similar to state of the art. Moreover, our QTF generates a simple and light representation using only 15 integers for coding speaker identity.
1
Introduction
Classical models of speech recognition assume that a short term analysis of the acoustic signal is essential for accurately decoding the speech signal. Most of these systems are based on maximum likelyhood trained Gaussian mixture speaker models, with diagonal covariance matrices. They can use for example principal component analysis from the output of 24 mel frequency channels, or more usually around 256 components calculated from the cepstral coefficients and deltas [1], voice recognition feature for speaker recognition such as since such features compensate charactarics of speakers [2]. This paper presents an alternative view, where the time scale requires an accurate description, longer than the phonetic segment, wedded to the dynamics of TF voicing level intervals. We assume that voicing reflects a singular property of the modulation spectrum that may provide a qualitative framework to generate a speaker identity model. S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 1154–1163, 2007. c Springer-Verlag Berlin Heidelberg 2007
Application of New Qualitative Voicing Time-Frequency Features
1155
A variety of studies have shown that intelligibility depends on the integrity of the low-frequency modulation spectrum [3]. Within reverberant environments the greatest impact of acoustic reflection is between 2 and 6Hz. An effect of reverberation is to jumble the spectral content of the acoustic signal across both time and frequency, particulary that portion of the spectrum below 1500Hz. Although, reverberation is known to interfere with intelligibility; the basis for its deleterious impact is not well understood. On the other hand, it has been established that phonological perception is a SB process [4]. Moreover, this can be linked to speech production dynamics [5]. This has inspired various algorithms for robust speech recognition [6, 7], also linked to the TF voicing level [6, 8]. This time-frequency dynamics seems therefore to be important in speech perception, and because a natural representation for timely events has been proposed by Allen J.F., we proposed a quantic time-frequency dynamics for robust speech features extraction paradigm [9], that an approach is presented here for a specific task of speaker recognition.
2
Features Extraction
2.1
Voicing Extraction
We use the voicing measure [6, 10], correlated with SNR and equivalent to the harmonicity index (HNR) [11, 12] to estimate the average voicing per utterance. It is extracted form the autocorrelogram of a demodulated signal. In the case of Gaussian noise, the correlogram of a noisy frame is less modulated than a clean one [11]. The peaks in the autocorrelogram of the demodulated frame isolate the various harmonics in a signal. This can be used to separate a mixture of harmonic noises and a dominant harmonic signal. It is interesting that such separation can be efficiently accomplished, using a time windows of duration in the same range as the average phoneme duration. Before the autocorrelation we compute the demodulated signal after half wave rectification, followed by pass-band filtering in the pitch domain ([90,350]Hz). For each frame of 128ms, we calculate the ratio R = R1/R0, where R1 is the local maximum in time delay segment corresponding to the fundamental frequency, and R0 is the cell energy. This measure is strongly correlated with SNR in the 5-20dB range [11]. Figure 1 demonstrates explicitly the voicing levels for each SBs and each frame. These values will be thresholded to get qualitative dynamic features that will be used to estimate the speaker identity. The subband (SB) definitions are following Fletcher studies, followed by other ones like in ALLEN J.B. papers [4, 5, 6].1 Here are the definitions of SB range (Hz):[216 778 ; 707 1631 ;1262 2709; 2121 3800 ; 3400 5400 ; 5000 8000]. 1
Note that ALLEN J.B and ALLEN J.F. are two different authors, the first worked on speech analysis, the second on generic time representation. Our model is based on both approaches.
1156
N. Ben Aloui, H. Glotin, and P. Hebrard
Fig. 1. From Voicing to the Allen’s Interval : (a) voicing signal (b) the voicing level by SB (c) the binarized voicing levels by SB using s Tb threshold
2.2
Allen Interval Graphs
A temporal intervals algebra has been defined in [13, 14], where 13 atomic relations are depicted between two time intervals. From ours observations to the voicing levels by SB (cf fig. 1-(b)), we propose to apply the J.F. Allen’s time representation for each SBs and each voicing activity regions (that can be 200 ms long). We present Allen’s time relationships in the figure 2. X is the sliding interval which gives, progressively with Y , the 13 Allen intervals relations. We can define the algebric distance d between the two nearest interval to 1 and we increment it as we moved away from the interval relation. This will give us the numeric values for each relation that we use in this paper. The “b”symbol is coded into “1”, “m”into “2”, . . . for the 13 relations of the figure 2. Moreover, the “no-relation” is also coded into “0”, this can occured between two SBs if one or both have no frame with enough voicing level. In order to obtain and to characterize exactly these six SB intervals we binarize the voicing matrix s MR (s mean that for each speaker we build his own voicing matrix). First of all, we must find the best threshold for each sub-waveband to use it after. Indeed, after some experiences we have noticed that it is better to set one threshold for each SB. Thus we are looking for the best threshold which gives a fixed percent (30% , 50% and 70% ) of 1 in each waveband by binarising the voicing matrix s MR . We first look for the threshold s Tbi values in: Υ = mean(s MRbi ) − τ , mean(s MRbi ) + τ , by steps of 0.01 with τ = 0.4, such that: Card(s Rbi (t) > Card(s w)
s
Tbi )
= P ± ε,
(1)
Application of New Qualitative Voicing Time-Frequency Features Relation X before Y
Symbol b
Illustration X
1157
Inverse Symbol Y
a
X X meets Y
m
X overlaps Y
o
Y X
Y
mi oi
X X starts Y
s
X during Y
d
X finishes Y
f
X equals Y
eq
Y
si
X Y Y
di X fi
X Y
Fig. 2. The interval construction structure with the 13 symbols. ( no=“no-relation”).
where t is the time of speaker s, T is threshold used, bi represents a SB with i ∈ {1..6}, and P defines the percent of 1 in the binarized interval. ε is a percent of tolerance (±1%), and s w is the total number of frames for speaker s. If not any threshold value satisfies the previous conditions, then we set s Tbi = mean(s MRbi ). After this iterative threshold research, we have generated per speaker voicing frame a threshold vector s T = [s T1 s T2 s T3 s T4 s T5 s T6 ], where Ti represents the threshold by SB. By applying s T to s MR we obtain the binarized matrix, which a short sample is represented on Fig. 1-c. We then extract the Allen’s graph per speaker for all frame having at least 4 binarized voicing interval simultaneously equal to 1. The Allen graph corresponding to the figure 1-c is represented by atomic relation matrix in figure 3 (we characterize in bold the relations which are non redundant and useful). For each speaker, we obtain full graphs which will be used to discriminate one from the other by using a Multi-Layer Perceptron (MLP). In the next paragraph, we introduce a brief explanation of this method. All Allen graphs are symmetric to the first diagonal (see 3). Each of the seven relations has an inverse relation in the same graph. This is due to the logic between two intervals: if the interval A is BEFORE B, in the same time we can assure that B IS AFTER A. Thus the useful information is contained in the 15 first relations which are ordered from left to the right, beginning from subband I’1 to subband I’5. For our example, one input vector at time t of our Multi Layer Perceptron Classifier is QTF(t)=[di di di oi oi d d d d s oi d oi f d]. We see on figure 4 the histogram of all QTF vectors for “Yves Decaens”.
1158
N. Ben Aloui, H. Glotin, and P. Hebrard
I1 I2 I3 I4 I5 I6
I1 eq ⎜ d ⎜ ⎜ d ⎜ ⎜ d ⎜ ⎝ o o ⎛
I2 di eq di di di di
I3 di d eq si o di
I4 di d s eq o fi
I5 oi d oi oi eq di
I6 ⎞ oi d⎟ ⎟ d⎟ ⎟ f ⎟ ⎟ d⎠ eq
Fig. 3. The second part of the figure 1-c Allen graph correspondences. The top-right triangle of this matrix contains the 15 relations values that will be given to an MLP for Id. speaker modelisation. Then they will be sorted from 1 to 15 as 1 = (I 1, I 2), 2 = (I 1, I 3), ..., 15 = (I 5, I 6). We needn’t to use down-left triangle of this matrix due to the symmetric relation between both triangles.
The “no-relation” occures when some binarized SBs only containing zeros. We see that that the “no-relation”represents around 10% of all relations. See also Tab. 1 for an estimation of the discrimative power of this relation. In figure 5 we present the log ratio of the QTF normalized distributions of two speakers (Yves Deceans and Patricia Martin), whos both have spoken about 50 minutes. It shows well the strong differences for certain relations and certain SB couples. We demonstrate in this paper that this information is discrimitive for a speaker identification task training a simple MLP. For this purpose, we train with these 15 integers a MLP using the Torch toolbox [15]. It’s a machine learning library written in simple C++ and distributed under BSD License. Torch is currently developed by IDIAP team. Torch has been designed to be essentially time efficient, modular, as it is a research oriented library. Our MLP is based on the Last Mean Square criterion (other Machine Learning algorithm could have been used, like Support Vector Machines SVM).
3
Database
Our experiments are conducted on a corpus issued from the Phase 1 of ESTER 2006 evaluation campaign (Evaluation Campaign for the Rich Transcription of French Broadcast News) [16]. ESTER implements three tasks, named transcription (T), segmentation (S) and information extraction (E). We have focused our work on SVL Speaker tracking and SRL Speaker diarization. This acoustic corpus Phase 1 is issued from two different sources, named France Inter (Inter) and Radio France International (RFI). It contains about 40 hours of manually transcribed radio broadcast news. This corpus is divided into three separated parts for training, developing and testing activities respectively. The training part (train) contains 30 hours and 40 minutes and the development part (dev) 4 hours and 40 minutes. The test part (test) contains 4 hours and 40 minutes. The unseen source in the test data is meant to evaluate the impact of the knowledge of the document source on performances.
Application of New Qualitative Voicing Time-Frequency Features
4
1159
Experimental Results and Interpretation
We give class error rate results left column in table 1, and their falsification in right columns in the same Table. We see that the best parametrisation is given for 70% of ‘active’intervall in each SB, yielding to a class error rate of 2.5%. Comparison with ESTER Phase 1 results [17] are promising: we get the same order of speaker identification for this task on continous broadcast news. Table 1. In the “Real Data” column are presented the class error results for the 50 speakers id. with different P values (MLP methods P = 0.3, 0.5 or 0.7). The best score is around 2.5%. In the “Falsified Data” column we find for comparison a model falsification with only two relations: no-relation versus all other relations (see text).
Parameters MLP data Set nhu method Nb Nb 300 0.3 27338 600 600 1000 300 0.5 27306 600 600 1000 18166 600 0.7 300 600 1000
Real data The class error iter Train Dev Test Nb % % % 1206 8.57 12.37 12.21 1678 9.97 16.18 16.27 1472 8.20 13.62 13.30 1242 26.16 32.93 33.68 1431 8.58 13.32 13.11 1940 6.62 9.05 9.11 675 34.95 43.22 42.58 1560 10.20 15.17 15.11 874 1.56 2.68 3.01 819 1.49 2.7 2.92 873 1.44 2.55 2.46 923 1.46 2.57 2.48
Falsified data The class error iter Train Dev Test Nb % % % 132 85.21 85.01 84.95 123 85.15 84.44 85.29 110 85.14 84.21 84.50 138 84.99 85.74 86.12 95 85.22 84.60 84.96 140 85.23 85.33 84.80 128 85.24 84.86 84.57 130 85.07 85.24 85.70 159 86.03 86.74 86.03 141 85.75 86.56 85.90 149 86.11 86.41 86.25 150 86.11 86.92 86.01
We run a falsification experimentation to verify if each relation encodes useful information, or if only the difference between “no-relation” and all other relations is relevant. Thus, in each set (train, dev and test), we have just used a 1 bit intput parameter for our MLP, replacing all atomic relations different to the “no-relation” by 1, and keeping all the no-relation (originaly coded by 0). The table 1 on the right gives this falsification results: we observe that the worst score is approximately 85%, telling that most of the coding information is given by the nature of the relations. But we also mesure that this is better than the random system (around 93% cf. formula2 ). This demonstrates that the norelation contains by itself some speaker identity information. 2
Let Pk be the frequency of the class Ck , the error rate of a random classifier is:
2 c c
card(Ck ) 2 ERrand = 1 − (Pk ) = 1 − c k=1 card(C) k=1 k=1 where c is the number of classes and card(Ck ) is the number of images in the class Ck .
1160
N. Ben Aloui, H. Glotin, and P. Hebrard
Locuteur : Yves Decaens 350
300
Total Atomic Constraints
250
200
150
100
50
0
no
b
m
o
s d f eq a Atomic Constraint Relations
mi
oi
si
c
fi
Fig. 4. Histogram of the QTF MLP inputs for the men speaker “Yves DECAENS”. Each atomic relation is in abscissa from left to right (no:no-relation, b:before, m:meets, o:overlaps, s:starts, d:during, f:finishes, eq:equals, and respective symmetric ones: a, mi, oi, si, di, fi and eq). Each of the 15 MLP inputs (from left blue for the SB couple (I1,I2), to right red dark for the SB couple (I5,I6)). We note that the QTF distributions are varying more across relations than across each SB couples. The first input in dark blue on the left of each histogram (i.e for the couple (I1,I2)) has a majority of “no relation” and “during”. The 15th input (node for couple (I5,I6)), in the right of each histogram, has a majority of “during” relation, ie the voicing interval in subband 5 is most of the time included in the voicing interval of SB 6. The “meeting” relations are not present for this speaker. The “equal” relation seems to be an informative kind of relation, showing great contrast across each SB couple.
Application of New Qualitative Voicing Time-Frequency Features
1161
Fig. 5. This figure shows the logarithm (10) ratio between the QTF normalized distributions. The atomic relations and SB couples orders are the same than in previous figure. As we can see there are often large differences between at least half of the relations or SB couples. The MLP have learned with success each of these singularities distributions and then discriminats well speaker identity. Note that this figure represents only a part of the QTF processed information: it may exists another intra-frame discriminative information between the 15 parameters calculated on each frame, that is erased in this global histogram summing all training frames.
5
Discussion and Conclusion
A better optimisation could be obtained by combining different parameters for each subband. Moreover this preliminary iterative threshold research should be replaced by more global optimisation scheme. The total number of MLP weights is around 39000 (=600*(15+50)), which is smaller than usual parameters size. Nevertheless our approach generates the same order of errors than other experiments conducted on ESTER Phase 1 using classical MG HMM model on
1162
N. Ben Aloui, H. Glotin, and P. Hebrard
classical spectral parameters. Another major difference and originality of our qualitative approach remains its parcimonious and simple integer representation of each speaker in a very small integer subspace (50 speakers identity coded in [0 : 13]15 ). A systematic falsification of each relation will give for each relation an idea of its amount of information for encoding speaker identity information. This should show if speakers are discriminated by particular relations, or if only the joint set of each relation encodes the speaker identity. Some experiments are currently testing our qualitative model on ESTER phase 2, which is a bigger database, with more speakers, and with published SRL and SVL scores with MG HMM models [17].
Acknowledgements We thank Odile Papini for her information on logical atomic time representation. ADER for the CIFRE convention between DCN and LSIS UMR and USTV.
References [1] Bimbot, F., Bonastre, J.-F., Fredouille, C., Gravier, G., Magrin-Chagnolleau, I., Meignier, S., Merlin, T., Ortega-Garcia, J., Petrovska, D., Reynolds, D.A.: A tutorial on text-independent speaker verification. EURASIP Journal on Applied Signal Processing 4, 430–451 (2004) [2] Hayakawa, S., Takeda, K., Itakura, F.: Speaker Identification Using Harmonic Structure of LP-residual Spectrum. In: Big¨ un, J., Borgefors, G., Chollet, G. (eds.) AVBPA 1997. LNCS, vol. 1206, pp. 253–260. Springer, Heidelberg (1997) [3] Greenberg, S., Arai, T., Grant, W.: The role of temporal dynamics in understanding spoken language. Dynamics of Speech Production and Perception Nato Advanced Studies Series, Life and Behavioural Sciences 374, 171–190 (2006) [4] Fletcher, H.: The nature of speech and its interpretation. J. Franklin Inst. 193(6), 729–747 (1922) [5] Allen, J.B.: How do humans process and recognise speech. IEEE Trans. on Speech and Signal Processing 2(4), 567–576 (1994) [6] Glotin, H.: Elaboration and comparatives studies of robust adaptive multistream speech recognition using voicing and localisation cues. Inst. Nat. Polytech Grenoble & EPF Lausanne IDIAP (2001) [7] Morris, A., Hagen, A., Glotin, H., Bourlard, H.: Multi-stream adaptive evidence combination for noise robust ASR. int. journ. Speech Communication, special issue on noise robust ASR 17(34), 1–22 (2001) [8] Glotin, H., Vergyri, D., Neti, C., Potamianos, G., Luettin, G.: Weighting schemes for audio-visual fusion in speech recognition. In: IEEE int. conf. Acoustics Speech & Signal Process (ICASSP) Salt Lake City-USA (September 2001) [9] Glotin, H.: When Allen J.B. meets Allen J.F.: Quantal Time-Frequency Dynamics for Robust Speech Features. Research Report LSIS 2006.001 Lab Systems and Information Sciences UMR CNRS (2006) [10] Glotin, H.: Dominant speaker detection based on harmonicity for adaptive weighting in audio-visual cocktail party ASR. In: Adaptation methods in speech recognition ISCA Workshop September, Nice (2001)
Application of New Qualitative Voicing Time-Frequency Features
1163
[11] Berthommier, F., Glotin, H.: A new SNR-feature mapping for robust multistream speech recognition. In: Proc. Int. Congress on Phonetic Sciences (ICPhS) Berkeley University Of California, Ed., San Francisco 1 of XIV August, pp. 711–715 (1999) [12] Yumoto, E., Gould, W.J., Bear, T.: Harmonic to noise ratio as an index of the degree of hoarseness. The Acoustic Society of America 1971, 1544–1550 (1982) [13] Allen, J.F.: An Interval-Based Representation of Temporal Knowledge. In: Proceedings of 7th IJCAI August, pp. 221–226 (1981) [14] Allen, J.F.: Maintaining Knowledge About Temporal Intervals. Communications of the ACM 26(11), 832–843 (1983) [15] Collobert, R., Bengio, S., Marihoz, J.: Torch: a modular machine learning software library. Laboratoire IDIAP IDIAP-RR 02-46 (2002) [16] Gravier, G., Bonastre, J.F., Galliano, S., Geoffrois, E., Mc Tait, K., Choukri, K.: The ESTER evaluation campaign of Rich Transcription of French Broadcast News. In: Language Evaluation and Resources Conference (April 2004) [17] Galliano, S., Geoffrois, E., Mostefa, D., Choukri, K., Bonastre, J.-F., Gravier, G.: The Ester Phase 2: Evaluation Campaign for the Rich Transcription of French Broadcast News. In: European Conf. on Speech Communication and Technology pp. 1149–1152 (2005)
Palmprint Recognition Based on Directional Features and Graph Matching Yufei Han, Tieniu Tan, and Zhenan Sun Center for Biometrics and Security Research National Labrotory of Pattern Recognition,Institue of Automation Chinese Acdamey of Sciences P.O. Box 2728, Beijing, P.R. China, 100080 {yfhan,tnt,znsun}@nlpr.ia.ac.cn
Abstract. Palmprint recognition, as a reliable personal identity check method, has been receiving increasing attention during recent years. According to previous work, local texture analysis supplies the most promising framework for palmprint image representation. In this paper, we propose a novel palmprint recognition method by combining statistical texture descriptions of local image regions and their spatial relations. In our method, for each image block, a spatial enhanced histogram of gradient directions is used to represent discriminative texture features. Furthermore, we measure similarity between two palmprint images using a simple graph matching scheme, making use of structural information. Experimental results on two large palmprint databases demonstrate the effectiveness of the proposed approach.
1 Introduction Biometrics identifies different people by their physiological and behavioral difference, such as face, iris, retinal, gait, etc [1]. As an alternative personal identity authentication method, it has attracted increasing attention during recent years. In the field of biometrics, palmprint is a novel but promising member. Most discriminating patterns of palmprint could be captured by low resolution capture devices, such as a low-cost CCD camera [3]. Large region of palm supplies stable line patterns which are difficult to be faked. A key issue in palmprint analysis is finding a proper descriptor to represent its line patterns [5]. In previous work, local texture based approach is proved to be the most efficient [1,2,3,4,5,6]. Since line patterns of palmprint are always spread over different image areas, both description of local patterns and their spatial relation are important for describing palmprint in an accurate way. Therefore, component based image representation supplies a reasonable framework, following which we could design efficient palmprint recognition methods by adopting local image features. In this paper, we introduce a novel approach for palmprint recognition which considers both texture information of local image regions and the spatial relationships between these regions. In our method, a straightforward extraction of image gradients vector is adopted across the whole palmprint image plane. Derived vector field is then divided into blocks. Statistical texture features of each block are encoded by a local S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 1164–1173, 2007. © Springer-Verlag Berlin Heidelberg 2007
Palmprint Recognition Based on Directional Features and Graph Matching
1165
direction histogram (LDH) proposed in this paper, which describes distribution properties of the directional vectors. All localized blocks form a global graph. The similarity of two palmprint images is measured by a simple graph matching method, previously utilized in [9]. The remainder of this paper is organized as follows. Details about LDH scheme and the graph matching method are described in section 2 and 3 respectively. In section 4, experimental results are reported. Section 5 concludes the paper.
2 Local Direction Histogram Based Palmprint Description Image gradient has been used as an efficient indicator of local image texture [8]. As reported in [7], it has a close relation with reflectance and surface normal of objects imaged, which decide inherent structures of corresponding image patches. Benefit from it, histogram of gradient directions has been widely used as a powerful image descriptor [10][11]. It captures characteristic patterns of local image region. In our work, we utilize histograms of oriented gradients to represent texture information of palmprint images. We calculate gradient vectors by the first order Gaussian derivative operator ∇G , in order to avoid noise sensitivity caused by directly differentiation on original intensities. It should be noted that size of ∇G operator is chosen according to needs of applications. A proper choice makes good trade-off between noise tolerance and locality of description. For a local image region W, the gradient magnitude M and direction angle θ are expressed respectively as follows:
M = (∇Gx ∗ W )2 + (∇G y ∗ W )2
θ = tan ( −1
∇G y ∗ W ∇G x ∗ W
(1)
)
The whole procedure of palmprint feature extraction is illustrated in Fig.1. Before feature extraction and matching, input palmprint images are firstly normalized to regulate translation, rotation and scale variation among different palms, following the method employed in [3]. In our work, assuming the size of normalized palmprint image and ∇G operator is N by N and m by m respectively. After calculation of gradient vector on each sample region, the size of derived vector valued image is (Nm+1) by (N-m+1). To focus on local texture representation, we divide the whole vector field into r*r square blocks, each of them has the side length of (N-m+1)/r . For each block, we again divide it into 4 sub-regions with the same size. A 6-bin direction histogram is constructed covering the 360 degree range of gradient directions in each sub-region. Same as in [10], samples are added to the histogram, weighted by corresponding gradient magnitudes. This histogram contains information about distribution of the direction angles, which represents details of image structure in a statistical way. Besides, spatial information is also important for more efficient representation. So we combine histograms of these four neighboring regions to yield a 24-bin spatially enhanced histogram and normalize it to unit length, namely, local direction histogram, as shown in Fig. 2. Through this way, local statistical descriptions and their spatial arrangement are concatenated into a more discriminative
1166
Y. Han, T. Tan, and Z. Sun
texture feature of the block region. A more involve method to construct histograms is based on adding gradient vectors into bins with equal weights, rather than weighted by their magnitudes. The gradient magnitude is sensitive to changes in illumination settings, such as direction changes of light sources. In contrast, the direction angle is more stable under those variations. Through this scheme, we can improve robustness of texture representation. Both two types of histogram based features are evaluated in the experiment. Finally, we take each block as a node of a graph map, associated with a local direction histogram, illustrated also in Fig.1. As a result, a palmprint image is represented by such a map in two different respects. Bins of local direction histograms represent statistical characteristics of texture patterns on the region level. Topology relations among nodes describe spatial layouts of blocks and thus produce a more global description of the whole image.
Fig. 1. Diagram of palmprint feature extraction
Fig. 2. Generation of local direction histogram
Palmprint Recognition Based on Directional Features and Graph Matching
1167
3 Graph Matching Approach Each node of the graph representation has two attributes, local direction histogram feature associated with it, and its spatial relation with other nodes. It is a natural idea to measure similarity of two palmprint images by comparing nodes of corresponding graphs. In our work, we adopt a simple graph matching algorithms proposed by Sun et al [9]. Each node of one palmprint image is supposed to be the most similar with the corresponding one of another palmprint image, if these two are captured from the same palm. Similarly in [9], before any discussion, we define conditions that matching nodes should satisfy. Assuming the graph representations of two palmprint images are {Ai} and {Bj} (i , j = 1, 2...S ) respectively, two matching nodes should have the same spatial position in each graph. Furthermore, texture patterns of matching node pair Ai and Bi should be most similar under a certain metric among all pairs composed by Ai and individual Bj (i , j = 1, 2...S ) . Therefore, we count the number of matching pairs to evaluate resemblance between two images. The higher it is, the more likely the two images come from the same palm. Through this kind of procedure, we make full use of both texture and structure properties of each node to achieve accurate classification. In our paper, we utilize the Chi-square distance between local direction histograms {LDHAi} and {LDHBj} (i , j = 1, 2...S ) to measure texture similarity of two nodes: 24
( LDHAi − LDHB j )
k =1
LDHAi + LDHB j
χ ( LDHAi , LDHB j ) = ∑ 2
k
k
k
2
(2)
k
Based on this metric, Chi-square distance between histogram features of two matching nodes should be lower than a prefixed threshold [9]. Following pseudo codes illustrate how to compute the number of matching pairs N step by step: Begin N = 0 for each node Ai do { for each node Bj do Compute χ ( LDHAi , LDHB j ) ; ( j = 1, 2...S ) 2
{
if χ ( LDHAi , LDHB j ) < χ ( LDHAi , LDHBi ) ( j ≠ i ) 2
2
break; } if χ ( LDHAi , LDHBi ) is the minimal among all χ ( LDHAi , LDHB j ) 2
2
{ } } End
if χ ( LDHAi , LDHBi ) < Threshold N = N+1; 2
1168
Y. Han, T. Tan, and Z. Sun
To obtain normalized matching score ranging between 0 and 1, we can divide N by S, the total number of nodes in each graph map. In the following experiment, we directly employ N as the matching score for convenient.
4 Verification Experiments In this section, we test the performance of the proposed approach on the PolyU Palmprint Database [12] and the CASIA Palmprint Database [13]. In the first dataset, images which belong to the same palm contain deformation of skin surface, such as contraction and stretching (see Fig. 7). Images in the second one are mostly captured with less such variations between intra-class samples. 4.1 Verification Experiment on the PolyU Palmprint Database
The PolyU database [12] contains 7,752 palmprint images from 386 palms. Each palm has two sessions of images. Average time interval between two sessions is two months. Light conditions and focus of the imaging device are changed between two occasions of image capturing [3], which is challengeable to robustness of recognition algorithms. After preprocessing, regions of interests, with the size of 128 128 are obtained. The ∇G operator has the size of 9 by 9. After gradient calculation, the whole vector valued image is divided into 6*6 = 36 blocks. All images are used to estimate intra-class distribution. We select five images randomly from each session to form inter-class samples. If a session contains less than five images, we make up totally ten samples from the class which the session belong to, including all images contained in the session. Therefore, 74,068 intra-class matching and 7,430,500 inter-class matching are involved in the experiment. In this experiment, we evaluate two LDH based methods. The first constructs histograms with samples weighted by their corresponding gradient magnitudes, while the second treats each sample with the same weight. The other three state-of-the-art algorithms, namely, fusion code [4], competitive code [5], ordinal code [6], are implemented for further comparing. Experimental results are denoted in Table 1 and Fig. 3.
×
Table 1. Comparisons of performances on the PolyU database
Algorithm
EER [14]
d’ [14]
Fusion code [4]
0.21%
5.40
Competitive code [5]
0.04%
5.84
Ordinal code [6]
0.05%
6.90
Weighted LDH
0.10%
7.13
Non-weighted LDH
0.08%
6.12
Palmprint Recognition Based on Directional Features and Graph Matching
1169
Fig. 3. ROC curves on the PolyU database
We named the first method " Weighted LDH " and the second " Non-weighted LDH " . As shown in Fig. 3, in terms of ROC curves, both of LDH based methods achieve comparable performances, compared with ordinal code [6] and competitive code [5]. Notably, they are more accurate than Fusion code [4]. Of the two proposed approaches, non-weighted LDH performs obviously better, which is more robust to intra-class appearance variations. 4.2 Verification Experiments on the CASIA Palmprint Database
The CASIA Palmprint database [13] contains 4,512 24-bit color palmprint images, coming from 564 palms. Each palm has 8 images. During image capturing, palms of subjects are required to be laid on a uniform-colored background (see in Fig. 4(a)). Then palmprint images are captured by a common CMOS camera above palms. There are no pegs to restrict postures and positions of palms. Original size of each image is 64 480. After preprocessing, we crop a square region with size of 17 176, as region of interests (ROI), show in Fig. 4(b). We again adopt 9 9 ∇G operator and divide the whole vector valued image into 6*6 = 36 blocks. All possible intra-class samples are used to simulate genuine distribution. One image is selected randomly from each class to estimate imposter distribution. Thus, totally 15,792 intra-class comparisons and 158,766 inter-class comparisons are implemented. Tab 2 and Fig. 5 illustrate performances of two proposed LDH methods and the other three state-of-the-art algorithms. As we see, non-weighted LDH method achieves the highest accuracy, followed by weighted LDH, ordinal code [6], competitive code [5] and fusion code [4]. Compared with results in section 4.1, both two LDH based approaches perform better. That ' s because that deformation of skin surface (see Fig.7) existing in intra-class samples of the PolyU database [12] changes reflectance and geometrical structures of skin surface, which then declines similarities of local distributions of image gradient directions.
0×
×
6×
1170
Y. Han, T. Tan, and Z. Sun
(a)
(b) Fig. 4. (a) Palm images in the CASIA database (b) Cropped ROI Table 2. Comparisons of performances on the CASIA database
Algorithm Fusion code [4] Competitive code [5] Ordinal code [6] Weighted LDH Non-weighted LDH
EER [14] 0.57% 0.19% 0.08% 0.05% 0.04%
d’ [14]
3.73 3.82 5.65 9.30 10.51
Furthermore, Fig. 6 describes the intra-class and inter-class matching score distributions using non-weighted LDH on the PolyU [12] and the CASIA [13] databases. As shown in the figure, for most intra-class samples, the number of matching block pairs is higher than ten, about 1/3 of the total blocks. Therefore, our approach only needs a fraction of the whole image area to deliver a valid classification. It can be used to handle palmprint images containing regions of occlusion or impaired palm skin. Compared with it, state-of-the-art algorithms [4][5][6] achieve successful recognition requiring most of the image region should be matched.
Palmprint Recognition Based on Directional Features and Graph Matching
Fig. 5. ROC curves on the CASIA database
(a)
(b) Fig. 6. Distribution of matching scores on the PolyU (a) and the CASIA (b) databases
1171
1172
Y. Han, T. Tan, and Z. Sun
Fig. 7. Deformation of skin surface in the PolyU database
5 Conclusions In this paper, we have proposed a novel palmprint recognition method by utilizing LDH based local texture descriptor and graph matching. It involves three main parts, namely, gradient vector calculation, local direction histogram generation and graph matching. By organizing local texture features to form a graph based representation, we can differentiate two palmprint images from both fine details and global structural information through a simple graph matching procedure. Our extensive experimental results have demonstrated validity of the approach. In our method, it is an important issue to choose a proper number of histogram bins and size of local image blocks. However, this problem is still not well addressed in our paper and needs further work in the future. In a further step, we will investigate how to find more efficient structural features to improve descriptive power of palmprint representations. Acknowledgments. Experiments in the paper use the PolyU Palmprint Database 2nd collected by the Biometric Research Center at the Hong Kong Polytechnic University. This work is funded by research grants from the National Basic Research Program (Grant No. 2004CB318110), the Natural Science Foundation of China (Grant No. 60335010, 60121302, 60275003, 60332010, 69825105 60605008) and the Chinese Academy of Sciences.
,
References 1. Kong, W.K., Zhang, D., Li, W.X.: Palmprint feature extraction using 2-D Gabor filters. Pattern recognition 36, 2339–2347 (2003) 2. You, J., Li, W.X., Zhang, D.: Hierarchical palmprint identification via multiple feature extraction. Pattern recognition 35(4), 847–859 (2002) 3. Zhang, D., Kong, W.K., You, J., Wong, M.: Online Palmprint Identification. IEEE Trans on PAMI 25(9), 1041–1050 (2003) 4. Kong, W.K., Zhang, D.: Feature-Level Fusion for Effective Palmprint Auentication. In: Zhang, D., Jain, A.K. (eds.) ICBA 2004. LNCS, vol. 3072, pp. 520–523. Springer, Heidelberg (2004) 5. Kong, W.K., Zhang, D.: Competitive Coding Scheme for Palmprint Verification. In: Proc.of the 17th ICPR, vol. 14, pp. 520–523 (2004)
Palmprint Recognition Based on Directional Features and Graph Matching
1173
6. Sun, Z.N., Tan, T.N., Wang, Y.H., Li, S.Z.: Ordinal Palmprint Representation for Personal Identification. In: Proc. of CVPR 2005, vol. 1, pp. 279–284 (2005) 7. Chen, H.F., Belhumeur, P.N., Jacobs, D.W.: In Search of Illumination Invariants. In: Proc. of CVPR 2000, vol. I, pp. 254–261 (2000) 8. Sun, Z.N., Tan, T.N., Wang, Y.H.: Robust Direction Estimation of Graident Vector Field for Iris Recognition. In: Proc. of ICPR 2004, vol. 2, pp. 783–786 (2004) 9. Sun, Z.N., Tan, T.N., Qiu, X.C.: Graph Matching Iris Image Blocks with Local Binary Pattern. In: Zhang, D., Jain, A.K. (eds.) Advances in Biometrics. LNCS, vol. 3832, pp. 366–372. Springer, Heidelberg (2005) 10. Lowe, D.G.: Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision(IJCV) 60, 90–110 (2004) 11. Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human Detection. In: Proc.of CVPR 2005, vol. 1, pp. 863–886 (2005) 12. PolyU Palmprint Database, http://www.comp.polyu.edu.hk/ biometrics/ 13. CASIA Palmprint Database, http://www.cbsr.ia.ac.cn 14. Daugman, J., Williams, G.: A Proposed Standard for Biometric Decidability. In: Proc. CardTech/SecureTech Conference, Atlanta, GA, pp. 223–234 (1996)
Tongue-Print: A Novel Biometrics Pattern David Zhang1,*, Zhi Liu2, Jing-qi Yan2, and Peng-fei Shi2 1
Biometrics Research Centre, Department of Computing, Hong Kong Polytechnic University, Kowloon, Hong Kong 2 Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai China
[email protected],{liu_zhi,jingqiy}@sjtu.edu.cn
Abstract. The tongue is a unique organ in that it can be stuck out of mouth for inspection, and yet it is otherwise well protected in the mouth and is difficult to forge. The tongue also presents both geometric shape information and physiological texture information which are potentially useful in identity verification applications. Furthermore, the act of physically reaching or thrusting out is a convincing proof for the liveness. Despite these obvious advantages for biometrics, little work has hitherto been done on this topic. In this paper, we introduce this novel biometric and present a verification framework based on the tongue-prints. The preliminary experimental results demonstrate the feasibility of the tongue biometrics. Keywords: Biometrics, tongue-print, verification.
1 Introduction The reliable automatic recognition of identities has long been an attractive goal, with biometrics [1][2], such as fingerprints, palmprints and iris images already being widely used in a number of identity recognition systems. The list of physiological and behavioral characteristics that have so far been developed and implemented in such systems is long and includes the face, iris, fingerprint, palmprint, hand shape, voice, signature and gait. However, how to counter the forge has been the common challenge for the traditional biometrics. Many of the traditional biometrics, however, are unreliable in that features may be forged, for example by using a fake iris. Actually, the tightened security required the noninvasive biometrics that are anticounterfeiting and can provide liveness verification. Accordingly, it is very necessary to find some new biometrics to fill the requirements. The tongue may offer a solution to this difficulty, having as it does many properties that make it suitable for use in identity recognition. To begin with, the tongue is unique to each person in its shape (see Fig. 1) and in its surface textures (see Fig. 2). Second, the tongue is the only internal organ that can quite normally and easily be exposed for inspection. This is useful because it is the exposed portion of the tongue that carries a great deal of shape and textural information that can be acquired in images that we call “tongue-print”. Third, according to our long time observation, the shape of the individual tongue is constant, notwithstanding its instinctive squirm and S.-W. Lee and S.Z. Li (Eds.): ICB 2007, LNCS 4642, pp. 1174–1183, 2007. © Springer-Verlag Berlin Heidelberg 2007
Tongue-print: A Novel Biometrics Pattern
1175
its physiological textures are invariant even as the coating of the tongue changes. Fourth, as the human tongue is contained in the mouth, it is isolated and protected from the external environment, unlike the fingers, for example. Finally, the process of tongue inspection is also a reliable proof of life.
(a)
(b) Fig. 1. Some samples with different shape from frontal and profile view. (a) Different shapes from the frontal view and (b) different shapes from the profile view.
Fig. 2. Some samples with different textures
This promising combination of characteristics was the inspiration for the development of the on-line verification system based on the tongue-prints that is described here. This system extracts both the shape and textural features of the tongue and uses them for recognition. The shape vector represents the shape features of the tongue, specifically its length, bend, thickness, and width, and the curvature of its tip. The texture codes represent the textural features of the central part of the tongue. Fig. 3 gives a block diagram illustrating the framework. The modules are described in the following sections. The remainder of this paper is organized as follows: Section 2 describes the preprocessing of tongue-print images. Section 3 introduces our feature extraction and recognition framework, describing how we extract shape features and analyze the texture of the tongue. Section 4 presents the experimental results. Section 5 offers our conclusion.
1176
D. Zhang et al.
Fig. 3. The block diagram of our tongue-print verification procedure
2 Tongue Image Preprocessing Before feature extraction, it is necessary to preprocess the captured tongue images to obtain an outline of the area of the tongue and to eliminate the non-tongue background. This is done using a tongue contour detection method described in our
(a)
(b) Fig. 4. The region we determined using corners of the mouth and tip of the tongue. (a) is the frontal view; and (b) is the profile view.
Tongue-print: A Novel Biometrics Pattern
1177
previous work [7][8] The corners of the mouth and the tip of the tongue (see Fig. 4) can then be used to determine the region of interest (ROI) in the captured tongue image. These ROI provide the main visual features of the tongue that are used in the subsequent procedures.
3 Tongue-Print Recognition In this section we introduce a feature extraction and recognition framework that makes use of both the shape and texture of the tongue. The shape vector represents the geometrical features of the tongue while the texture codes represent the textural features of the central part of the tongue. 3.1 Shape Feature Extraction The shape of the tongue (to be represented as a shape vector) is measured using a set of control points. The control points are P1, P2, …,P11, Ptip and Pm (shown in Fig. 5). These control points demarcate important areas of the ROI (here, the part below the segment LP1, P 2 ). The following describes how five measures, length, bend, thickness, width of the tongue, and the curvature of its tip, are formed as our measurement vectors:
Fig. 5. The tongue feature model for the frontal and profile view images. (a) is the frontal view; and (b) is the profile view.
1) Width: We define four segments (
LP 3, P 4 , LP 5, P 6 , LP 7, P8 , LP 9, P10 ) that are
parallel to the segment LP1, P 2 in the regions of interest mentioned above. And these segments follow the rule formularized by Eq. (1):
d (LP1,P2 , LP3,P4 ) = d (LP3,P4 , LP5,P6 ) = d (LP5,P6 , LP7,P8 ) = d (LP7,P8 , LP9,P10 )
(1)
where d (i ) represents the distance between two parallel segments. We then use the length of these five segments to construct the width vector W .
1178
D. Zhang et al.
2) Length: The length of the tongue in the profile view is defined by the distance between Ptip and Pm (Ptip denotes the tip of the tongue, and Pm denotes the corner of the mouth, as shown in Fig. 5 (b)) as follows:
Length = Ptip − Pm .
(2)
3) Thickness: Take a line between Ptip and Pm (shown in Fig. 5 (b)) and extend the lines ( LP 3, P 4 , LP 5, P 6 , LP 7, P 8 , LP 9, P10 ) so that they intersect with the segment
LPm Ptip . The points of intersection are labeled as: Pa1 , Pa 2 , Pa 3 , Pa 4 . Crossing these points, we can get a set of orthogonal lines of the segment LPm Ptip . The lengths of these lines within the contour of the profile view are used for the thickness vector T . In addition, the orthogonal lines that cross
Pa1 Pa 4 and Pa 2
respectively intersect the contour of the tongue at Pb1 , Pb 2 and Pb 3 .
Fig. 6. Total Curvature Measures. L1: length of the segment between point; L2: length of the segment between
Q2
Q1
and
and its succeeding point; a1: interior angle at
Q2 ;
Q1
and its preceding
L3: length of the segment between
Q1 ; a2: interior angle at Q2 .
4) Curvature of the tip of the tongue: We measure the curvature of the tip of the tongue by using the Total Curvature Function (TCF) [6]. The Total Curvature Function is an approximate estimation method and it is defined for one segment between the two points Q1 and Q2 , as illustrated in Fig. 6. In this figure, the curvature at
Q1 can be formulated as: C1 = a1/( L1 + L 2)
and the curvature at
Q2 is formulated as: C 2 = a 2 /( L 2 + L3) .
Thus, the total curvature value of the segment L2 between formulated as:
(3)
(4)
Q1 and Q2 is
Tongue-print: A Novel Biometrics Pattern
TC = L 2 ∗ (C1 − C 2) .
1179
(5)
We then use these TC to build the vector Cur by using the curvature values at the control points P3, P 4,… P9, P10 (shown in Fig. 5(a)).
Pb 3 and the segment LPb1Pb 2 is
5) Bend: The distance between the middle point formulated as in Eq. (6)
b = D ( Pb 3 , LPb1Pb 2 )
(6)
where D(i) computes the distance between the Pb 3 and LPb1Pb 2 . Then, we can use b to describe the degree of bend of the tongue. The measurement of b is illustrated in Fig. 5(b). As the components of these vectors are of different sizes and they have a large dynamic range, it is necessary to normalize them into a single, common range. The five measurement vectors are then combined to form the shape vector that represents the tongue shape. 3.2 Texture Feature Extraction The textural features of the tongue are primarily found on the central part of its surface. To extract this information, we set up a sub-image of the segmented tongue image as a region of interest (ROI). This region is selected under the coordinates system Pcorner OPtip with 256*256 pixels, corresponding to the rectangular area enclosed by the white line in Fig. 7 (a). To extract the texture features, we apply a powerful texture analysis tool, a two dimensional Gabor filter. Gabor filters have been widely used to extract local image features [9][10]. A 2-D Gabor filter in the spatial domain has the following general form [9]:
G ( x, y , θ , u , σ ) =
1 2πσ 2
exp{
x2 + y2 }exp{2π i(uxcosθ +uysinθ )} 2σ 2
(7)
where i = −1 ; u is the frequency of the sinusoidal wave; θ controls the orientation of the function; and σ is the standard deviation of the Gaussian envelope. Gabor filters are robust against variations in image brightness and contrast and can be said to model the receptive fields of a simple cell in the primary visual cortex. In order to make the Gabor filter more robust against brightness, it is set to zero DC (direct current) with the application of the following formula [9]: n
G '( x, y,θ , μ , σ ) = G ( x, y,θ , μ , σ ) − where (2n + 1) is the size of the filter. 2
n
∑ ∑ G (i, j,θ , μ , σ )
i =− n j =− n
(2n + 1)
2
(8)
1180
D. Zhang et al.
(a)
Fig. 7. (a) shows the ROI; (b) and (c) are original samples of the textures in the ROI. (d) and (f) are respectively the real parts of features from (b) and(c). (e) and (g) are the imaginary parts of (b) and (c).
An input tongue sub-image I ( x, y ), x, y ∈ Ω ( Ω is the set of image points) is convolved with G ' . Then, the sample point in the filtered image is coded to two bits, (br , bi ) using the following rules:
br = 1 br = 0 bi = 1 bi = 0
if Re[ I
⊗ G '] ≥ 0
if Re[ I
⊗ G '] < 0
if Im[ I
⊗ G '] ≥ 0
if Im[ I
⊗ G '] < 0
(9)
Using this coding method means that only the phase information in the sub-images is stored in the texture feature vector. This texture feature extraction method was introduced by Daugman for use in iris recognition [11]. Fig. 7 (d)(e)(f)(g) show the features generated in this procedure
Tongue-print: A Novel Biometrics Pattern
1181
3.3 Recognition In this step, the Mahalanobis distance is used for the tongue shape matching and the Hamming distance is used for the tongue texture code matching. Using these two kinds of distances gives us two matching scores. Because the shape feature vector and texture codes are non-homogeneous and are suitable for different matchers, in this tongue-print based verification method we exploit the matching score level fusion [3]. In our experience, the tongue shape information is more important than texture information. Thus, we apply the following strategy to get the decision results.
S = w1SS + w2 ST
(10)
where S is the final matching score and the S S , ST are respectively the matching scores in the shape matching module and texture matching module and w1 , w2 are their corresponding weight values (in our study, w1
= 0.6 and w2 = 0.4 ).
4 Experiments and Results 4.1 Database Our database contains 134 subjects. The subjects were recorded in five separate sessions uniformly distributed over a period of five months. Within each session ten image pairs for each subject, a front view and a profile view, were taken using our self-designed tongue-print capture device. In total, each subject provided 50 image pairs. We collected the tongue images from both men and women and across a wide range of ages. The distribution of the subjects is listed in Table 1. We called this tongue image database TB06. Table 1. Composition of the tongue image database
Number of samples Percentage (%)
male 89 66.4
Sex female 45 33.6
20-29 81 60.4
Age 30-39 32 23.9
40-49 21 15.7
4.2 Experimental Results A matching is counted as a correct matching if two tongue images were collected from the same tongue; otherwise it is an incorrect matching. In our study, the Minimum Distance Classifier [12] is applied for its simplicity. The verification result is obtained using the matching score level fusion [3]. The performance of the verification system when using TB06 is represented by Receiver Operating Characteristic (ROC) curves, which are a plot of the genuine acceptance rate against the false acceptance rate for all possible operating points. From Fig. 8, we can see that combining shape and texture features for the tongue verification produces a better
1182
D. Zhang et al.
Fig. 8. ROC curve used to illustrate the verification test results
performance than using them singly. When the FAR is equal to 2.9%, we get the Genuine Accept Rate of 93.3%. These results demonstrate that the tongue biometric is feasible.
5 Conclusions As the only internal organ that can be protruded from the body, the human tongue is well protected and is immune to forgery. The explicit features of the tongue cannot be reverse engineered, meaning that tongue verification protects the privacy of users better than other biometrics. This paper presents a novel tongue-print based verification approach. Using a uniform tongue image database containing sample images collected from 134 people, experiments produced a 93.3% recognition rate. These promising results suggest that tongue-prints of the human tongue qualify as a feasible new member of the biometrics family.
Acknowledgment This work was supported in part by the UGC/CRC fund from the HKSAR Government, the central fund from the Hong Kong Polytechnic University and the NSFC fund under the contract No. 60332010.
References 1. Zhang, D.: Automated Biometrics—Technologies and Systems. Kluwer Academic, Boston (2000) 2. Jain, A., Bolle, R., Pankanti, S.: Biometrics: Personal Identification in Networked Society. Kluwer Academic, Boston (1998)
Tongue-print: A Novel Biometrics Pattern
1183
3. Ross, A., Jain, A., Qian, J.-Z.: Information Fusion in Biometrics. Pattern Recognition Letters 24(13), 2115–2125 (2003) 4. Hughe, G.E.: On the mean accuracy of statistical pattern recognizers. IEEE Transactions on Information Theory 14(l), 55–63 (1968) 5. Deng, J.-W., Tsui, H.T: A novel two-layer PCA/MDA scheme for hand posture recognition. In: Proceedings of 16th International Conference on Pattern Recognition, vol. 1, pp. 283–286 (2002) 6. Pikaz, A., Dinstein, I.: Matching of partially occluded planar curves. Pattern Recognition 28(2), 199–209 (1995) 7. Pang, B., Zhang, D., Wang, K.Q.: The bi-elliptical deformable contour and its application to automated tongue segmentation in Chinese medicine. IEEE Transactions on Medical Imaging 24(8), 946–956 (2005) 8. zhi, L., Yan, J.-q., Zhou, T., Tang, Q.-l.: Tongue Shape Detection Based on B-Spline. In: ICMLC2006, vol. 6, pp. 3829–3832 (2006) 9. Kong, W.K., Zhang, D., Li, W.: Palmprint feature extraction using 2-D Gabor filters. Pattern Recognition 36(10), 2339–2347 (2003) 10. Jain, A., Healey, G.: A multiscale representation including opponent color features for texture recognition. IEEE Transactions on Image Processing 7(1), 124–128 (1998) 11. Daugman, J.: High confidence visual recognition of persons by a test of statistical independence. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(11), 1148–1161 (1993) 12. Jain, A.K., Duin, R.P.W., Mao, J.: Statistical Pattern Recognition: A Review. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(1), 4–37 (2000)
Embedded Palmprint Recognition System on Mobile Devices Yufei Han, Tieniu Tan, Zhenan Sun, and Ying Hao Center for Biometrics and Security Research National Labrotory of Pattern Recognition,Institue of Automation Chinese Acdamey of Sciences P.O. Box 2728, Beijing, P.R. China, 100080 {yfhan,tnt,znsun,yhao}@nlpr.ia.ac.cn
Abstract. There are increasing requirements for mobile personal identification, e.g. to protect identity theft in wireless applications. Based on built-in cameras of mobile devices, palmprint images may be captured and analyzed for individual authentication. However, current available palmprint recognition methods are not suitable for real-time implementations due to the limited computational resources of handheld devices, such as PDA or mobile phones. To solve this problem, in this paper, we propose a sum-difference ordinal filter to extract discriminative features of palmprint using only +/- operations on image intensities. It takes less than 200 ms for our algorithm to verify the identity of a palmprint image on a HP iPAQ PDA, about 1/10 of state-of-the-art methods ' complexity, while this approach also achieves high accuracy on the PolyU palmprint database. Thanks to the efficient palmprint feature encoding scheme, we develop a real-time embedded palmprint recognition system, working on the HP PDA.
1 Introduction There are about 1.5 billion mobile devices currently in use. With fast development of wireless network and embedded hardware, people enjoy the convenience of mobile commerce, mobile banking, mobile office, mobile entertainment, etc. However, at the same time, the risk of identity theft is increasing. The mostly used identity authentication method is password, but it can be cracked or forgotten. So biometrics is emerging to enhance the security of digital life. Fingerprint, face and voice recognition have been put into use on embedded mobile devices such as PDA and mobile phones. But they still have many disadvantages. For fingerprint recognition, only a small portion of mobile devices have a fingerprint sensor, and some people do not have clear fingerprints [1]. Face and voice recognition are sometimes not accurate and robust enough for personal recognition [1]. In contrast, palmprint could supply an alternative way for mobile authentication when these methods fail. According to work [2,3], large areas of human palms supply enough and robust representative features for identity authentication. Furthermore, palmprint recognition is testified to achieve high accuracy in real world use based on low-resolution (