Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen
2688
3
Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Tokyo
Josef Kittler
Mark S. Nixon (Eds.)
Audio- and Video-Based Biometric Person Authentication 4th International Conference, AVBPA 2003 Guildford, UK, June 9-11, 2003 Proceedings
13
Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Volume Editors Josef Kittler University of Surrey Center for Vision, Speech and Signal Proc. Guildford, Surrey GU2 7XH, UK E-mail:
[email protected] Mark S. Nixon University of Southampton Department of Electronics and Computer Science Southampton, SO17 1BJ, UK E-mail:
[email protected] Cataloging-in-Publication Data applied for A catalog record for this book is available from the Library of Congress Bibliographic information published by Die Deutsche Bibliothek Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the Internet at .
CR Subject Classification (1998): I.5, I.4, I.3, K.6.5, K.4.4, C.2.0 ISSN 0302-9743 ISBN 3-540-40302-7 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2003 Printed in Germany Typesetting: Camera-ready by author, data conversion by DA-TeX Gerd Blumenstein Printed on acid-free paper SPIN 10927847 06/3142 543210
Preface
This book collects the research work presented at the 4th International Conference on Audio- and Video-Based Biometric Person Authentication that took place at the University of Surrey, Guildford, UK, in June 2003. We were pleased to see a surge of interest in AVBPA. We received many more submissions than before and this reflects not just the good work put in by previous organizers and participants, but also the increasing world-wide interest in biometrics. With grateful thanks to our program committee, we had a fine program indeed. The papers concentrate on major established biometrics such as face and speech, and we continue to see the emergence of gait as a new research focus, together with other innovative approaches including writer and palmprint identification. The face-recognition papers show advances not only in recognition techniques, but also in application capabilites and covariate analysis (now with the inclusion of time as a recognition factor), and even in synthesis to evaluate wider recognition capability. Fingerprint analysis now includes study of the effects of compression, and new ways for compression, together with refined study of holistic vs. minutiae and feature set selection, areas of interest to the biometrics community as a whole. The gait presentations focus on new approaches for temporal recognition together with analysis of performance capability and new approaches to improve generalization in performance. The speech papers reflect the wide range of possible applications together with new uses of visual information. Interest in data fusion continues to increase. But it is not just the more established areas that were of interest at AVBPA 2003. As ever in this innovative technology, there are always new ways to recognize people, as reflected in papers on on-line writer identification and palm print analysis. Iris recognition is also represented, as are face and person extraction in video. The growing industry in biometrics was reflected in presentations with a specific commercial interest: there are papers on smart cards, wireless devices, architectures, and implementation factors, all of considerable consequence in the deployment of biometric systems. A competition for the best face-authentication (verification) algorithms took place in conjunction with the conference, and the results are reported here. The papers are complemented by invited presentations by Takeo Kanade (Carnegie Mellon University), Jerry Friedman (Stanford University), and Frederic Bimbot (INRIA). All in all, AVBPA continues to offer a snapshot of research in this area from leading institutions around the world. If these papers and this conference inspire new research in this fascinating area, then this conference can be deemed to be truly a success.
April 2003
Josef Kittler and Mark S. Nixon
Organization
AVBPA 2003 was organized by – the Centre for Vision, Speech and Signal Processing, University of Surrey, UK, and – TC-14 of IAPR (International Association for Pattern Recognition).
Executive Committee Conference Co-chairs
Local Organization
Josef Kittler and Mark S. Nixon University of Surrey and University of Southampton, UK Rachel Gartshore, University of Surrey
Program Committee Samy Bengio (Switzerland) Josef Bigun (Sweden) Frederic Bimbot (France) Mats Blomberg (Sweden) Horst Bunke (Switzerland) Hyeran Byun (South Korea) Rama Chellappa (USA) Gerard Chollet (France) Timothy Cootes (UK) Larry Davis (USA) Farzin Deravi (UK) Sadaoki Furui (Japan) M. Dolores Garcia-Plaza (Spain) Dominique Genoud (Switzerland) Shaogang Gong (UK) Steve Gunn (UK) Bernd Heisele (USA) Anil Jain (USA) Kenneth Jonsson (Sweden) Seong-Whan Lee (South Korea) Stan Li (China) John Mason (UK) Jiri Matas (Czech Republic) Bruce Millar (Australia) Larry O’Gorman (USA) Sharath Pankanti (USA)
Organization
P. Jonathon Phillips (USA) Salil Prabhakar (USA) Nalini Ratha (USA) Marek Rejman-Greene (UK) Gael Richard (France) Massimo Tistarelli (Italy) Patrick Verlinde (Belgium) Juan Villanueva (Spain) Harry Wechsher (USA) Pong Yuen (Hong Kong)
Sponsoring Organizations University of Surrey International Association for Pattern Recognition (IAPR) Institution of Electrical Engineers British Machine Vision Association Department of Trade and Industry (DTI) Springer-Verlag GmbH
VII
Table of Contents
Face I Robust Face Recognition in the Presence of Clutter . . . . . . . . . . . . . . . . . . . . . . . . . 1 A.N. Rajagopalan, Rama Chellappa, and Nathan Koterba An Image Preprocessing Algorithm for Illumination Invariant Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Ralph Gross and Vladimir Brajovic Quad Phase Minimum Average Correlation Energy Filters for Reduced Memory Illumination Tolerant Face Authentication . . . . . . . . . . . .19 Marios Savvides and B.V.K. Vijaya Kumar Component-Based Face Recognition with 3D Morphable Models . . . . . . . . . . . 27 Jennifer Huang, Bernd Heisele, and Volker Blanz
Face II A Comparative Study of Automatic Face Verification Algorithms on the BANCA Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 M. Sadeghi, J. Kittler, A. Kostin, and K. Messer Assessment of Time Dependency in Face Recognition: An Initial Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Patrick J. Flynn, Kevin W. Bowyer, and P. Jonathon Phillips Constraint Shape Model Using Edge Constraint and Gabor Wavelet Based Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Baochang Zhang, Wen Gao, Shiguang Shan, and Wei Wang Expression-Invariant 3D Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .62 Alexander M. Bronstein, Michael M. Bronstein, and Ron Kimmel
Speech Automatic Estimation of a Priori Speaker Dependent Thresholds in Speaker Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Javier R. Saeta and Javier Hernando A Bayesian Network Approach for Combining Pitch and Reliable Spectral Envelope Features for Robust Speaker Verification . . . 78 Mijail Arcienega and Andrzej Drygajlo
X
Table of Contents
Cluster-Dependent Feature Transformation for Telephone-Based Speaker Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .86 Chi-Leung Tsang, Man-Wai Mak, and Sun-Yuan Kung Searching through a Speech Memory for Text-Independent Speaker Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Dijana Petrovska-Delacr´etaz, Asmaa El Hannani, and G´erard Chollet
Poster Session I LUT-Based Adaboost for Gender Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Bo Wu, Haizhou Ai, and Chang Huang Independent Component Analysis and Support Vector Machine for Face Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Gianluca Antonini, Vlad Popovici, and Jean-Philippe Thiran Real-Time Emotion Recognition Using Biologically Inspired Models . . . . . . . 119 Keith Anderson and Peter W. McOwan A Dual-Factor Authentication System Featuring Speaker Verification and Token Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Purdy Ho and John Armington Wavelet-Based 2-Parameter Regularized Discriminant Analysis for Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Dao-Qing Dai and P.C. Yuen Face Tracking and Recognition from Stereo Sequence . . . . . . . . . . . . . . . . . . . . . 145 Jian-Gang Wang, Ronda Venkateswarlu, and Eng Thiam Lim Face Recognition System Using Accurate and Rapid Estimation of Facial Position and Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 Takatsugu Hirayama, Yoshio Iwai, and Masahiko Yachida Fingerprint Enhancement Using Oriented Diffusion Filter . . . . . . . . . . . . . . . . . 164 Jiangang Cheng, Jie Tian, Hong Chen, Qun Ren, and Xin Yang Visual Analysis of the Use of Mixture Covariance Matrices in Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .172 Carlos E. Thomaz and Duncan F. Gillies A Face Recognition System Based on Local Feature Analysis . . . . . . . . . . . . . .182 Stefano Arca, Paola Campadelli, and Raffaella Lanzarotti Face Detection Using an SVM Trained in Eigenfaces Space . . . . . . . . . . . . . . . .190 Vlad Popovici and Jean-Philippe Thiran Face Detection and Facial Component Extraction by Wavelet Decomposition and Support Vector Machines . . . . . . . . . . . . . . . . . 199 Dihua Xi and Seong-Whan Lee
Table of Contents
XI
U-NORM Likelihood Normalization in PIN-Based Speaker Verification Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 D. Garcia-Romero, J. Gonzalez-Rodriguez, J. Fierrez-Aguilar, and J. Ortega-Garcia Facing Position Variability in Minutiae-Based Fingerprint Verification through Multiple References and Score Normalization Techniques . . . . . . . . . 214 D. Simon-Zorita, J. Ortega-Garcia, M. Sanchez-Asenjo, and J. Gonzalez-Rodriguez Iris-Based Personal Authentication Using a Normalized Directional Energy Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 Chul-Hyun Park, Joon-Jae Lee, Mark J.T. Smith, and Kil-Houm Park An HMM On-line Signature Verification Algorithm . . . . . . . . . . . . . . . . . . . . . . . 233 Daigo Muramatsu and Takashi Matsumoto Automatic Pedestrian Detection and Tracking for Real-Time Video Surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 Hee-Deok Yang, Bong-Kee Sin, and Seong-Whan Lee Visual Features Extracting & Selecting for Lipreading . . . . . . . . . . . . . . . . . . . . 251 Hong-xun Yao, Wen Gao, Wei Shan, and Ming-hui Xu An Evaluation of Visual Speech Features for the Tasks of Speech and Speaker Recognition . . . . . . . . . . . . . . . . . . . . . . . . . .260 Simon Lucey Feature Extraction Using a Chaincoded Contour Representation of Fingerprint Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 Venu Govindaraju, Zhixin Shi, and John Schneider Hypotheses-Driven Affine Invariant Localization of Faces in Verification Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 M. Hamouz, J. Kittler, J.K. Kamarainen, and H. K¨ alvi¨ ainen Shape Based People Detection for Visual Surveillance Systems . . . . . . . . . . . . 285 M. Leo, P. Spagnolo, G. Attolico, and A. Distante Real-Time Implementation of Face Recognition Algorithms on DSP Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 Seong-Whan Lee, Sang-Woong Lee, and Ho-Choul Jung Robust Face-Tracking Using Skin Color and Facial Shape . . . . . . . . . . . . . . . . . 302 Hyung-Soo Lee, Daijin Kim, and Sang-Youn Lee
Fingerprint Fusion of Statistical and Structural Fingerprint Classifiers . . . . . . . . . . . . . . . . 310 Gian Luca Marcialis, Fabio Roli, and Alessandra Serrau
XII
Table of Contents
Learning Features for Fingerprint Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 318 Xuejun Tan, Bir Bhanu, and Yingqiang Lin Fingerprint Matching with Registration Pattern Inspection . . . . . . . . . . . . . . . 327 Hong Chen, Jie Tian, and Xin Yang Biometric Template Selection: A Case Study in Fingerprints . . . . . . . . . . . . . . 335 Anil Jain, Umut Uludag, and Arun Ross Orientation Scanning to Improve Lossless Compression of Fingerprint Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 Johan Th¨ arn˚a, Kenneth Nilsson, and Josef Bigun
Image, Video Processing, Tracking A Nonparametric Approach to Face Detection Using Ranklets . . . . . . . . . . . . 351 Fabrizio Smeraldi Refining Face Tracking with Integral Projections . . . . . . . . . . . . . . . . . . . . . . . . . . 360 Gin´es Garc´ıa Mateos Glasses Removal from Facial Image Using Recursive PCA Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 Jeong-Seon Park, You Hwa Oh, Sang Chul Ahn, and Seong-Whan Lee Synthesis of High-Resolution Facial Image Based on Top-Down Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377 Bon-Woo Hwang, Jeong-Seon Park, and Seong-Whan Lee A Comparative Performance Analysis of JPEG 2000 vs. WSQ for Fingerprint Image Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 Miguel A. Figueroa-Villanueva, Nalini K. Ratha, and Ruud M. Bolle
General New Shielding Functions to Enhance Privacy and Prevent Misuse of Biometric Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393 Jean-Paul Linnartz and Pim Tuyls The NIST HumanID Evaluation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 Ross J. Micheals, Patrick Grother, and P. Jonathon Phillips Synthetic Eyes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .412 Behrooz Kamgar-Parsi, Behzad Kamgar-Parsi, and Anil K. Jain
Table of Contents
XIII
Poster Session II Maximum-Likelihood Deformation Analysis of Different-Sized Fingerprints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421 Yuliang He, Jie Tian, Qun Ren, and Xin Yang Dental Biometrics: Human Identification Using Dental Radiographs . . . . . . . 429 Anil K. Jain, Hong Chen, and Silviu Minut Effect of Window Size and Shift Period in Mel-Warped Cepstral Feature Extraction on GMM-Based Speaker Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438 C.C. Leung and Y.S. Moon Discriminative Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446 Florent Perronnin and Jean-Luc Dugelay Cross-Channel Histogram Equalisation for Colour Face Recognition . . . . . . . 454 Stephen King, Gui Yun Tian, David Taylor, and Steve Ward Open World Face Recognition with Credibility and Confidence Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462 Fayin Li and Harry Wechsler Enhanced VQ-Based Algorithms for Speech Independent Speaker Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470 Ningping Fan and Justinian Rosca Fingerprint Fusion Based on Minutiae and Ridge for Enrollment . . . . . . . . . . 478 Dongjae Lee, Kyoungtaek Choi, Sanghoon Lee, and Jaihie Kim Face Hallucination and Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486 Xiaogang Wang and Xiaoou Tang Robust Features for Frontal Face Authentication in Difficult Image Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495 Conrad Sanderson and Samy Bengio Facial Recognition in Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505 Dmitry O. Gorodnichy Face Authentication Based on Multiple Profiles Extracted from Range Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515 Yijun Wu, Gang Pan, and Zhaohui Wu Eliminating Variation of Face Images Using Face Symmetry . . . . . . . . . . . . . . .523 Yan Zhang and Jufu Feng Combining SVM Classifiers for Multiclass Problem: Its Application to Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531 Jaepil Ko and Hyeran Byun A Bayesian MCMC On-line Signature Verification . . . . . . . . . . . . . . . . . . . . . . . . 540 Mitsuru Kondo, Daigo Muramatsu, Masahiro Sasaki, and Takashi Matsumoto
XIV
Table of Contents
Illumination Normalization Using Logarithm Transforms for Face Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549 Marios Savvides and B.V.K. Vijaya Kumar Performance Evaluation of Face Recognition Algorithms on the Asian Face Database, KFDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557 Bon-Woo Hwang, Hyeran Byun, Myoung-Cheol Roh, and Seong-Whan Lee Automatic Gait Recognition via Fourier Descriptors of Deformable Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566 Stuart D. Mowbray and Mark S. Nixon A Study on Performance Evaluation of Fingerprint Sensors . . . . . . . . . . . . . . . 574 Hyosup Kang, Bongku Lee, Hakil Kim, Daecheol Shin, and Jaesung Kim An Improved Fingerprint Indexing Algorithm Based on the Triplet Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584 Kyoungtaek Choi, Dongjae Lee, Sanghoon Lee, and Jaihie Kim A Supervised Approach in Background Modelling for Visual Surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592 P. Spagnolo, M. Leo, G. Attolico, and A. Distante Human Recognition on Combining Kinematic and Stationary Features . . . . 600 Bir Bhanu and Ju Han Architecture for Synchronous Multiparty Authentication Using Biometrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609 Sunil J. Noronha, Chitra Dorai, Nalini K. Ratha, and Ruud M. Bolle Boosting a Haar-Like Feature Set for Face Verification . . . . . . . . . . . . . . . . . . . . 617 Bernhard Fr¨ oba, Sandra Stecher, and Christian K¨ ublbeck The BANCA Database and Evaluation Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . 625 Enrique Bailly-Bailli´ere, Samy Bengio, Fr´ed´eric Bimbot, Miroslav Hamouz, Josef Kittler, Johnny Mari´ethoz, Jiri Matas, Kieron Messer, Vlad Popovici, Fabienne Por´ee, Belen Ruiz, and Jean-Philippe Thiran A Speaker Pruning Algorithm for Real-Time Speaker Identification . . . . . . . 639 Tomi Kinnunen, Evgeny Karpov, and Pasi Fr¨ anti “Poor Man” Vote with M -ary Classifiers. Application to Iris Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647 V. Vigneron, H. Maaref, and S. Lelandais Handwriting, Signature, Palm Complete Signal Modeling and Score Normalization for Function-Based Dynamic Signature Verification . . . . . . . . . . . . . . . . . . . . . . . 658 J. Ortega-Garcia, J. Fierrez-Aguilar, J. Martin-Rello, and J. Gonzalez-Rodriguez
Table of Contents
XV
Personal Verification Using Palmprint and Hand Geometry Biometric . . . . . 668 Ajay Kumar, David C.M. Wong, Helen C. Shen, and Anil K. Jain A Set of Novel Features for Writer Identification . . . . . . . . . . . . . . . . . . . . . . . . . . 679 Caroline Hertel and Horst Bunke Combining Fingerprint and Hand-Geometry Verification Decisions . . . . . . . . 688 Kar-Ann Toh, Wei Xiong, Wei-Yun Yau, and Xudong Jiang Iris Verification Using Correlation Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 697 B.V.K. Vijaya Kumar, Chunyan Xie, and Jason Thornton
Gait Gait Analysis for Human Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706 A. Kale, N. Cuntoor, B. Yegnanarayana, A.N. Rajagopalan, and R. Chellappa Performance Analysis of Time-Distance Gait Parameters under Different Speeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .715 Rawesak Tanawongsuwan and Aaron Bobick Novel Temporal Views of Moving Objects for Gait Biometrics . . . . . . . . . . . . . 725 Stuart P. Prismall, Mark S. Nixon, and John N. Carter Gait Shape Estimation for Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734 David Tolliver and Robert T. Collins
Fusion Audio-Visual Speaker Identification Based on the Use of Dynamic Audio and Visual Features . . . . . . . . . . . . . . . . . 743 Niall Fox and Richard B. Reilly Scalability Analysis of Audio-Visual Person Identity Verification . . . . . . . . . . 752 Jacek Czyz, Samy Bengio, Christine Marcel, and Luc Vandendorpe A Bayesian Approach to Audio-Visual Speaker Identification . . . . . . . . . . . . . . 761 Ara V. Nefian, Lu Hong Liang, Tieyan Fu, and Xiao Xing Liu Multimodal Authentication Using Asynchronous HMMs . . . . . . . . . . . . . . . . . . 770 Samy Bengio Theoretic Evidence k-Nearest Neighbourhood Classifiers in a Bimodal Biometric Verification System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 778 Andrew Teoh Beng Jin, Salina Abdul Samad, and Aini Hussain
XVI
Table of Contents
Poster Session III Combined Face Detection/Recognition System for Smart Rooms . . . . . . . . . . 787 Jia Kui and Liyanage C. De Silva Capabilities of Biometrics for Authentication in Wireless Devices . . . . . . . . . 796 Pauli Tikkanen, Seppo Puolitaival, and Ilkka K¨ ans¨ al¨ a Combining Face and Iris Biometrics for Identity Verification . . . . . . . . . . . . . . 805 Yunhong Wang, Tieniu Tan, and Anil K. Jain Experimental Results on Fusion of Multiple Fingerprint Matchers . . . . . . . . . 814 Gian Luca Marcialis and Fabio Roli Predicting Large Population Data Cumulative Match Characteristic Performance from Small Population Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 821 Amos Y. Johnson, Jie Sun, and Aaron F. Bobick A Comparative Evaluation of Fusion Strategies for Multimodal Biometric Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .830 J. Fierrez-Aguilar, J. Ortega-Garcia, D. Garcia-Romero, and J. Gonzalez-Rodriguez Iris Feature Extraction Using Independent Component Analysis . . . . . . . . . . .838 Kwanghyuk Bae, Seungin Noh, and Jaihie Kim BIOMET: A Multimodal Person Authentication Database Including Face, Voice, Fingerprint, Hand and Signature Modalities . . . . . . . . 845 Sonia Garcia-Salicetti, Charles Beumier, G´erard Chollet, Bernadette Dorizzi, Jean Leroux les Jardins, Jan Lunter, Yang Ni, and Dijana Petrovska-Delacr´etaz Fingerprint Alignment Using Similarity Histogram . . . . . . . . . . . . . . . . . . . . . . . . 854 Tanghui Zhang, Jie Tian, Yuliang He, Jiangang Cheng, and Xin Yang A Novel Method to Extract Features for Iris Recognition System . . . . . . . . . .862 Seung-In Noh, Kwanghyuk Bae, Yeunggyu Park, and Jaihie Kim Resampling for Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 869 Xiaoguang Lu and Anil K. Jain Toward Person Authentication with Point Light Display Using Neural Network Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 878 Sung-Bae Cho and Frank E. Pollick Fingerprint Verification Using Correlation Filters . . . . . . . . . . . . . . . . . . . . . . . . . 886 Krithika Venkataramani and B.V.K. Vijaya Kumar On the Correlation of Image Size to System Accuracy in Automatic Fingerprint Identification Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 895 J.K. Schneider, C.E. Richardson, F.W. Kiefer, and Venu Govindaraju
Table of Contents
XVII
A JC-BioAPI Compliant Smart Card with Biometrics for Secure Access Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 903 Michael Osborne and Nalini K. Ratha Comparison of MLP and GMM Classifiers for Face Verification on XM2VTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 911 Fabien Cardinaux, Conrad Sanderson, and S´ebastien Marcel Fast Frontal-View Face Detection Using a Multi-path Decision Tree . . . . . . . 921 Bernhard Fr¨ oba and Andreas Ernst Improved Audio-Visual Speaker Recognition via the Use of a Hybrid Combination Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 929 Simon Lucey and Tsuhan Chen Face Recognition Vendor Test 2002 Performance Metrics . . . . . . . . . . . . . . . . . . 937 Patrick Grother, Ross J. Micheals, and P. Jonathon Phillips Posed Face Image Synthesis Using Nonlinear Manifold Learning . . . . . . . . . . .946 Eunok Cho, Daijin Kim, and Sang-Youn Lee Pose for Fusing Infrared and Visible-Spectrum Imagery . . . . . . . . . . . . . . . . . . . 955 Jian-Gang Wang and Ronda Venkteswarlu
AVBPA2003 Face Authentication Contest Face Verification Competition on the XM2VTS Database . . . . . . . . . . . . . . . . . 964 Kieron Messer, Josef Kittler, Mohammad Sadeghi, Sebastien Marcel, Christine Marcel, Samy Bengio, Fabien Cardinaux, C. Sanderson, Jacek Czyz, Luc Vandendorpe, Sanun Srisuk, Maria Petrou, Werasak Kurutach, Alexander Kadyrov, Roberto Paredes, B. Kepenekci, F.B. Tek, G.B. Akar, Farzin Deravi, and Nick Mavity Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .975
Robust Face Recognition in the Presence of Clutter A.N. Rajagopalan1, Rama Chellappa2 , and Nathan Koterba2 1
2
Indian Institute of Technology, Madras, India
[email protected] Center for Automation Research, University of Maryland, College Park, USA {rama,nathank}@cfar.umd.edu
Abstract. We propose a new method within the framework of principal component analysis to robustly recognize faces in the presence of clutter. The traditional eigenface recognition method performs poorly when confronted with the more general task of recognizing faces appearing against a background. It misses faces completely or throws up many false alarms. We argue in favor of learning the distribution of background patterns and show how this can be done for a given test image. An eigenbackground space is constructed and this space in conjunction with the eigenface space is used to impart robustness in the presence of background. A suitable classifier is derived to distinguish non-face patterns from faces. When tested on real images, the performance of the proposed method is found to be quite good.
1
Introduction
Two of the very successful and popular approaches to face recognition are the Principal Components Analysis (PCA) [1] and Fisher’s Linear Discriminant (FLD) [2]. Methods based on PCA and FLD work quite well provided the input test pattern is a face i.e., the face image has already been cropped out of a scene. The problem of recognizing faces in still images with a cluttered background is more general and difficult as one doesn’t know where a face pattern might appear in a given image. A good face recognition system should i) detect and recognize all the faces in a scene, and ii) not mis-classify background patterns as faces. Since faces are usually sparsely distributed in images, even a few false alarms will render the system ineffective. Also, the performance should not be too sensitive to any threshold selection. Some attempts to address this situation are discussed in [1, 3] where the use of distance from eigenface space (DFFS) and distance in eigenface space (DIFS) are suggested to detect and eliminate non-faces for robust face recognition in clutter. In this study, we show that DFFS and DIFS by themselves (in the absence of any information about the background) are not sufficient to discriminate against arbitrary background patterns. If the threshold is set high, traditional eigenface recognition (EFR) invariably ends up missing faces. If the threshold is lowered to capture faces, the technique incurs many false alarms. J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 1–9, 2003. c Springer-Verlag Berlin Heidelberg 2003
2
A.N. Rajagopalan et al.
One possible approach to handle clutter in still images is to use a good face detection module to find face patterns and then feed only these patterns as inputs to the traditional EFR scheme. In this paper, we propose a new methodology within the PCA framework to robustly recognize frontal faces in a given test image with background clutter. Towards this end, we construct an ‘eigenbackground space’ which represents the distribution of the background images corresponding to the given test image. The background is learnt ‘on the fly’ and provides a sound basis for eliminating false alarms. An appropriate pattern classifier is derived and the eigenbackground space together with the eigenface space is used to simultaneously detect and recognize faces. Results are given on several test images to validate the proposed method.
2
Eigenface Recognition in Clutter
In the EFR technique, when a face image is presented to the system, its weight vector is determined with respect to the eigenface space. In order to perform recognition, the difference error between this weight vector and the a priori stored mean weight vector corresponding to every person in the training set is computed. This error is also called the distance in face space (DIFS). That face class in the training set for which the DIFS is minimum is declared as the recognized face provided the difference error is less than an appropriately chosen threshold. The case of a still image containing face against background is much more complex and some attempts have been made to tackle it [1, 3]. In [1], the authors advocate the use of distance from face space (DFFS) to reject non-face patterns. The DFFS can be looked upon as the error in the reconstruction of a pattern. It has been pointed out in [1] that a threshold θDF F S could be chosen such that it defines the maximum allowable distance from the face space. If DFFS is greater than θDF F S , then the test pattern is classified as a non-face image. In a more recent work [3], DFFS together with DIFS has been suggested to improve performance. A test pattern is classified as a face and recognized provided its DFFS as well as DIFS values are less than suitably chosen thresholds θDF F S and θDIF S , respectively. Although DFFS and DIFS have been suggested as possible candidates for discriminating against background patterns, it is difficult to conceive that by learning just the face class we can segregate any arbitrary background pattern against which face patterns may appear. It may not always be possible to come up with threshold values that will result in no false alarms and yet can catch all the faces. To better illustrate this point, we show some examples in Fig. 1(a) where faces appear against background. Our training set contains faces of these individuals. The idea is to locate and recognize these individuals in the test images when they appear against clutter. The DFFS and DIFS values corresponding to every subimage pattern in these images were calculated and an attempt was made to recognize faces based on these values as suggested in [3]. It turns out that not only do we catch the face but also end up with many false alarms (see Fig. 1(b)) since information about the background is completely
Robust Face Recognition in the Presence of Clutter
3
ignored. It is interesting to note that some of the background patterns have been wrongly identified as one of the individuals in the training set. If the threshold values are made smaller to eliminate false alarms, we end up missing some of the faces. Thus, the performance of the EFR technique is quite sensitive to the threshold values chosen.
3
Background Representation
If only the eigenface space is learnt, then background patterns with relatively small DFFS and DIFS values will pass for faces and this can result in an unacceptable number of false alarms. We argue in favor of learning the distribution of background images specific to a given scene. A locally learnt distribution can be expected to be more effective (than a universal background class learnt as in [4, 5] which is quite data intensive) for capturing the background characteristics of the given test image. By constructing the eigenbackground space for the given test image and comparing the proximity of an image pattern to this subspace versus the eigenface subspace, background patterns can be rejected. 3.1
The Eigenbackground Space
We now describe a simple but effective technique for constructing the ‘eigenbackground space’. It is assumed that faces are sparsely distributed in a given image, which is a reasonable assumption. Given a test image, the background is learnt ‘on the fly’ from the test image itself. Initially, the test image is scanned for those image patterns that are very unlikely to belong to the ‘face class’. • A window pattern x in the test image is classified (positively) as a background pattern if its distance from the eigenface space is greater than a certain (high) threshold θb . Note that we use DFFS to initially segregate only the most likely background patterns. Since the background usually constitutes a major portion of the test image, it is possible to obtain a sufficient number of samples for learning the ‘background class’ even if the threshold θb is chosen to be large for higher confidence. Since the number of background patterns is likely to be very large, these patterns are distributed into K clusters using simple K-means clustering so that K-pattern centers are returned. The mean and covariance estimated from these clusters allow us to effectively extrapolate to other background patterns in the image (not picked up due to high value of θb ) as well. The pattern centers which are much fewer in number as compared to the number of background patterns are then used as training images for learning the eigenbackground space. Although the pattern centers belong to different clusters, they are not totally uncorrelated with respect to one another and further dimensionality reduction is possible. The procedure that we follow is similar to that used to create the eigenface space. We first find the principal components of the background pattern centers or the eigenvectors of the covariance matrix Cb
4
A.N. Rajagopalan et al.
of the set of background pattern centers. These eigenvectors can be thought of as a set of features which together characterize the variation among pattern centers of the background space. The subspace spanned by the eigenvectors corresponding to the largest K eigenvalues of the covariance matrix Cb is called the eigenbackground space. The significant eigenvectors of the matrix Cb , which we call ‘eigenbackground images’, form a basis for representing the background image patterns in the given test image.
4
The Classifier
Let the face class be denoted by ω1 and the background class be denoted by ω2 . Assuming the conditional density function for the two classes to be Gaussian 1 1 d f (x|ωi ) = exp − (x) (1) i N 1 2 (2π) 2 |Ci | 2 where di (x) = (x − µi )t Ci−1 (x − µi ). Here, µ1 and µ2 are the means while C1 and C2 are the covariance matrices of the face and the background class, respectively. If the image pattern is of size M ×M , then N = M 2 . Diagonalization of Ci t t −1 results in di (x) = (x − µi )t (φi Λ−1 i φi )(x − µi ) = y i Λi y i where φi is a matrix containing eigenvectors of Ci and is of the form [φ1i φ2i . . . φN i ]. The weight vector y i = φti (x − µi ) is obtained by projecting the mean-subtracted vector x onto the subspace spanned by the eigenvectors in φi . Written in scalar form, di (x) N y2 becomes di (x) = j=1 λij . Since d1 (x) is approximated using only L principal ij projections, we seek to formulate an estimator for d1 (x) as follows.
dˆ1 (x) =
L 2 y1j 1 + λ ρ1 j=1 1j
N
2 y1j
j=L +1
L 2 y1j 1 = + 21 (x) λ ρ1 j=1 1j
(2)
where 21 (x) is the reconstruction error in x with respect to the eigenface space. This is because 21 (x) can be written as 21 (x) = ||x − x ˆf ||2 where x ˆf is the ˆf is computed estimate of x when projected onto the eigenface space. Because x using only L principal projections in the eigenface space, we have L N N 21 (x) = ||x−(µ1 + y1j φ1j )||2 = || x − µ1 − y1j φ1j − y1j φ1j ||2 j=1
j=1
= ||
N j=L +1
y1j φ1j ||2 =
N
j=L +1
2 y1j
j=L +1
as the φ1j s are orthonormal. In a similar vein, since d2 (x) is approximated using only K principal projections K 2 y2j 1 + 22 (x) (3) dˆ2 (x) = λ ρ 2 j=1 2j
Robust Face Recognition in the Presence of Clutter
5
where 22 (x) is the reconstruction error in x with respect to the eigenbackground N 2 space and 22 (x) = j=K +1 y2j = ||x − x ˆb ||2 Here, x ˆb is the estimate of x when projected onto the eigenbackground space. From equations (1) and (2), the density estimate based on dˆ1 (x) can be written as the product of two marginal and independent Gaussian densities in the face space F and its orthogonal complement F ⊥ , i.e., 2 L y2 1 (x) exp − 12 j=1 λ1j exp − 2ρ 1j 1 · fˆ(x|ω1 ) = (4) 1 N −L L L 2 (2π) 2 j=1 λ1j (2πρ1 ) 2 = fF (x|ω1 ) · fˆF ⊥ (x|ω1 ) Here, fF (x|ω1 ) is the true marginal density in the face space while fˆF ⊥ (x|ω1 ) is the estimated marginal density in F ⊥ . Along similar lines, the density estimate for the background class can be expressed as 2 K y 2 2 (x) exp − 2ρ exp − 21 j=1 λ2j 2j 2 ˆ N −K 1 · f (x|ω2 ) = (5) K K 2 (2π) 2 (2πρ2 ) 2 j=1 λ2j = fB (x|ω2 ) · fˆB ⊥ (x|ω2 ) Here, fB (x|ω2 ) is the true marginal density in the background space while fˆB ⊥ (x|ω2 ) is the estimated marginal density in B ⊥ . The optimal values of ρ1 and ρ2 can be determined by minimizing the Kullback-Leibler distance [6] between the true density and its estimate. The resultant estimates can be shown to be ρ1 =
1 N − L
N j=L +1
λ1j
and ρ2 =
1 N − K
N
λ2j
(6)
j=K +1
Thus, once we select the L -dimensional principal subspace F , the optimal density estimate fˆ(x|ω1 ) has the form given by equation (4) where ρ1 is as given above. A similar argument applies to the background space also. Assuming equal a priori probabilities, the classifier can be derived as 2 (x) 21 (x) N − K N − L log (2πρ2 )− log (2πρ1 ) log fˆF ⊥ (x|ω1 )−log fˆB⊥ (x|ω2 ) = 2 − + 2ρ2 2ρ1 2 2 (7)
When L = K i.e., when the number of eigenfaces and eigenbackground patterns are the same, and when ρ1 = ρ2 , i.e., when the arithmetic mean of the eigenvalues in the orthogonal subspaces is the same, the above classifier interestingly simplifies to log fˆF ⊥ (x|ω1 ) − log fˆB ⊥ (x|ω2 ) = 22 (x) − 21 (x)
(8)
6
A.N. Rajagopalan et al.
which is simply a function of the reconstruction error. Clearly, the face space would favour a better reconstruction of face patterns while the background space would favour the background patterns.
5
The Proposed Method
Once the eigenface space and the eigenbackground space are learnt, the test image is examined again, but now for the presence of faces at all points in the image. For each of the test window patterns, the classifier proposed in Section 4 is used to determine whether a pattern is a face or not. Ideally, one must use equation (7) but for computational simplicity we use equation (8) which is the difference in the reconstruction error. The classifier works quite well despite this simplification. To express the operations mathematically, let the subimage pattern under consideration in the test image be denoted as x. The vector x is projected onto the eigenface space as well as the eigenbackground space to yield estimates of x as x ˆf and x ˆb , respectively. If x − x ˆf 2 < x − xˆb 2 and
x − xˆf 2 < θDF F S
(9)
where θDF F S is an appropriately chosen threshold then recognition is carried out based on its DIFS value. The weight vector W corresponding to pattern x in the eigenface space is compared (in the Euclidean sense) with the pre-stored mean weights of each of the face classes. The pattern x is recognized as belonging to the ith person if i = minW − mj 2 , j = 1, . . . , q j
and
W − mi 2 < θDIF S
(10)
where q is the number of face classes or people in the database and θDIF S is a suitably chosen threshold. In the above discussion, since a background pattern will be better approximated by the eigenbackground images than by the eigenface images, it is to be ˆb 2 would be less than x − x ˆf 2 for a background pattern expected that x − x x. On the other hand, if x is a face pattern, then it will be better represented by the eigenface space than the eigenbackground space. Thus, learning the eigenbackground space helps to reduce the false alarms considerably. Moreover, the threshold value can now be raised comfortably without generating false alarms because the reconstruction error of a background pattern would continue to remain a minimum with respect to the background space only. Knowledge of the background leads to improved performance (fewer misses as well as fewer false alarms) and reduces sensitivity to the choice of threshold values (properties that are highly desirable in a recognition scenario).
Robust Face Recognition in the Presence of Clutter
6
7
Experimental Results
Because our experiment requires individuals in the test images (with background clutter) to be the same as the ones in the training set, we generated our own face database. The training set consisted of images of size 27×27 pixels of 50 subjects with 10 images per subject. The number of significant eigenfaces was found to be 50 for satisfactory recognition. For purpose of testing, we captured images in which subjects in the database appeared (with an approximately frontal pose) against different types of background. Some of the images were captured within the laboratory. For other types of clutter, we used big posters with different types of complex background. Pictures of the individuals in our database were then captured with these posters in the backdrop. We captured about 400 such test images each of size 120 × 120 pixels. If a face pattern is recognized by the system, a box is drawn at the corresponding location in the output image. Thresholds θDF F S and θDIF S were chosen to be the maximum of all the DFFS and DIFS values, respectively, among the faces in the training set (which is a reasonable thing to do). The threshold values were kept the same for all the test images and for both the schemes as well. For the proposed scheme, the number of background pattern centers was chosen to be 600 while the number of eigenbackground images was chosen to be 100 and these were kept fixed for all the test images. The number of eigenbackground images was arrived at based on the accuracy of reconstruction of the background patterns. Due to space constraint, only a few representative results are given here (see Fig. 1 and Fig. 2). The figures are quite self-explanatory. We observe that traditional EFR (which does not utilize background information) confuses too many background patterns (Fig. 1(b)) with faces in the training set. If θDF F S is decreased to reduce the false alarms, then it ends up missing many of the faces. On the other hand, the proposed scheme works quite well and recognizes faces with very few false alarms, if any. When tested on all the 400 test images, the proposed method has a detection capability of 80% with no false alarms, and the recognition rate on these detected images is 78%. Most of the frontal faces are caught correctly. Even if θDF F S is increased to accommodate slightly difficult poses, we have observed that the results are unchanged for the proposed method. This can be attributed to the fact that the proximity of a background pattern continues to remain with respect to the background space despite changes in θDF F S .
7
Conclusions
In the literature, the eigenface technique has been demonstrated to be very useful for face recognition. However, when the scheme is directly extended to recognize faces in the presence of background clutter, its performance degrades as it cannot satisfactorily discriminate against non-face patterns. In this paper, we have presented a robust scheme for recognizing faces in still images of natural scenes. We argue in favor of constructing an eigenbackground space from the
8
A.N. Rajagopalan et al.
(a)
(b)
(c)
Fig. 1. (a) Sample test images. Results for (b) traditional EFR, and (c) the proposed method. Note that traditional EFR has many false alarms
Fig. 2. Some representative results for the proposed method background images of a given scene. The background space which is created ‘on the fly’ from the test image is shown to be very useful in distinguishing non-face patterns. The scheme outperforms the traditional EFR technique and gives very good results with almost no false alarms, even on fairly complicated scenes.
References [1] M. Turk and A. Pentland, “Eigenfaces for recognition”, J. Cognitive Neurosciences, vol. 3, pp. 71-86, 1991. 1, 2 [2] P. Belhumeur, J. Hespanha and D. Kriegman, ”Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection”, IEEE Trans. Pattern Anal. and Machine Intell., vol. 19, pp. 711-720, 1997. 1 [3] B. Moghaddam and A. Pentland, “Probabilistic visual learning for object representation”, IEEE Trans. Pattern Anal. and Machine Intell., vol. 19, pp. 696-710, 1997. 1, 2
Robust Face Recognition in the Presence of Clutter
9
[4] K. Sung and T. Poggio, “Example-based learning for view-based human face detection”, IEEE Trans. Pattern Anal. and Machine Intell., vol. 20, pp. 39-51, 1998. 3 [5] H. A. Rowley, S. Baluja, and T. Kanade, “Neural network-based face detection”, IEEE Trans. Pattern Anal. and Machine Intell., vol. 20, pp. 23-38, 1998. 3 [6] K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, 1991. 5
An Image Preprocessing Algorithm for Illumination Invariant Face Recognition Ralph Gross and Vladimir Brajovic The Robotics Institute Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213 {rgross,brajovic}@cs.cmu.edu
Abstract. Face recognition algorithms have to deal with significant amounts of illumination variations between gallery and probe images. State-of-the-art commercial face recognition algorithms still struggle with this problem. We propose a new image preprocessing algorithm that compensates for illumination variations in images. From a single brightness image the algorithm first estimates the illumination field and then compensates for it to mostly recover the scene reflectance. Unlike previously proposed approaches for illumination compensation, our algorithm does not require any training steps, knowledge of 3D face models or reflective surface models. We apply the algorithm to face images prior to recognition. We demonstrate large performance improvements with several standard face recognition algorithms across multiple, publicly available face databases.
1
Introduction
Besides pose variation, illumination is the most significant factor affecting the appearance of faces. Ambient lighting changes greatly within and between days and among indoor and outdoor environments. Due to the 3D shape of the face, a direct lighting source can cast strong shadows that accentuate or diminish certain facial features. Evaluations of face recognition algorithms consistently show that state-of-the-art systems can not deal with large differences in illumination conditions between gallery and probe images [1, 2, 3]. In recent years many appearance-based algorithms have been proposed to deal with the problem [4, 5, 6, 7]. Belhumeur showed [5], that the set of images of an object in fixed pose but under varying illumination forms a convex cone in the space of images. The illumination cones of human faces can be approximated well by lowdimensional linear subspaces [8]. The linear subspaces are typically estimated from training data, requiring multiple images of the object under different illumination conditions. Alternatively, model-based approaches have been proposed to address the problem. Blanz et al. [9] fit a previously constructed morphable 3D model to single images. The algorithm works well across pose and illumination, however, the computational expense is very high. In general, an image I(x, y) is regarded as product I(x, y) = R(x, y)L(x, y) where R(x, y) is the reflectance and L(x, y) is the illuminance at each point J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 10–18, 2003. c Springer-Verlag Berlin Heidelberg 2003
An Image Preprocessing Algorithm
11
(x, y) [10]. Computing the reflectance and the illuminence fields from real images is, in general, an ill-posed problem. Therefore, various assumptions and simplifications about L, or R, or both are proposed in order to attempt to solve the problem. A common assumption is that L varies slowly while R can change abruptly. For example, homomorphic filtering [11] uses this assumption to extract R by high-pass filtering the logarithm of the image. Closely related to homomorphic filtering is Land’s ”retinex” theory [12]. The retinex algorithm estimates the reflectance R as the ratio of the image I(x, y) and its low pass version that serves as estimate for L(x, y). At large discontinuities in I(x, y) ”halo” effects are often visible. Jobson [13] extended the algorithm by combining several low-pass copies of the logarithm of I(x, y) using different cut-off frequencies for each low-pass filter. This helps to reduce halos, but does not eliminate them entirely. In order to eliminate the notorious halo effect, Tumblin and Turk introduced the low curvature image simplifier (LCIS) hierarchical decomposition of an image [14]. Each component in this hierarchy is computed by solving a partial differential equation inspired by anisotropic diffusion [15]. At each hierarchical level the method segments the image into smooth (low-curvature) regions while stopping at sharp discontinuities. The algorithm is computationally intensive and requires manual selection of no less than 8 different parameters.
2
The Reflectance Perception Model
Our algorithm is motivated by two widely accepted assumptions about human vision: 1) human vision is mostly sensitive to scene reflectance and mostly insensitive to the illumination conditions, and 2) human vision responds to local changes in contrast rather than to global brightness levels. These two assumptions are closely related since local contrast is a function of reflectance. Having these assumptions in mind our goal is to find an estimate of L(x, y) such that when it divides I(x, y) it produces R(x, y) in which the local contrast is appropriately enhanced. In this view R(x, y) takes the place of perceived sensation, while I(x, y) takes the place of the input stimulus. L(x, y) is then called perception gain which maps the input sensation into the perceived stimulus, that is: 1 = R(x, y) (1) I(x, y) L(x, y) With this biological analogy, R is mostly the reflectance of the scene, and L is mostly the illumination field, but they may not be ”correctly” separated in a strict physical sense. After all, humans perceive reflectance details in shadows as well as in bright regions, but they are also cognizant of the presence of shadows. From this point on, we may refer to R and L as reflectance and illuminance, but they are to be understood as the perceived sensation and the perception gain, respectively. To derive our model, we turn to evidence gathered in experimental psychology. According to Weber’s Law the sensitivity threshold to a small intensity
12
Ralph Gross and Vladimir Brajovic
slope =
↓
L
1/ I
2
↓ ∆ R↑
h ρ i−1/2,j
R Perceived Sensation
i−1,j
slope =
1/ I
→
← log(I)
h ρ i,j−1/2
1
↓
L
∆ R↑
h ρ i,j+1/2
L
i,j−1
L i,j
i,j+1
h ρ i+1/2,j → ←∆
→ ←∆
I @ I1 Image Brightness
I @ I2
L
I
(a)
i+1,j
(b)
Fig. 1. (a) Compressive logarithmic mapping emphasizes changes at low stimulus levels and attenuates changes at high stimulus levels. (b) Discretization lattice for the PDE in Equation (5)
change increases proportionally to the signal level [16]. This law follows from experimentation on brightness perception that consists of exposing an observer to a uniform field of intensity I in which a disk is gradually increased in brightness by a quantity ∆I. The value ∆I from which the observer perceives the existence of the disk against the background is called brightness discrimination threshold. Weber noticed that ∆I I is constant for a wide range of intensity values. Weber’s law gives a theoretical justification for assuming a logarithmic mapping from input stimulus to perceived sensation (see Figure 1(a)). Due to the logarithmic mapping when the stimulus is weak, for example in deep shadows, small changes in the input stimulus elicit large changes in perceived sensation. When the stimulus is strong, small changes in the input stimulus are mapped to even smaller changes in perceived sensation. In fact local variations in the input stimulus are mapped to the perceived sensation variations with the gain I1 , that is: I(x, y)
1 = R(x, y), (x, y) ∈ Ψ IΨ (x, y)
(2)
where IΨ (x, y) is the stimulus level in a small neighborhood Ψ in the input image. By comparing Equation (1) and (2) we arrive at the model for the perception gain: . L(x, y) = IΨ (x, y) = I(x, y)
(3)
where the neighborhood stimulus level is by definition taken to be the stimulus at point (x, y). As seen in Equation 4 we regularize the problem by imposing a smoothness constraint on the solution for L(x, y). The smoothness constraint takes care of producing IΨ ; therefore, the replacement by definition of IΨ by I
An Image Preprocessing Algorithm
13
in Equation 3 is justified. We do not need to specify any particular region Ψ . The solution for L(x, y) is found by minimizing: J(L) = ρ(x, y)(L − I)2 dxdy + λ (L2x + L2y )dxdy (4) Ω
Ω
where the first term drives the solution to follow the perception gain model, while the second term imposes a smoothness constraint. Here Ω refers to the image. The parameter λ controls the relative importance of the two terms. The space varying permeability weight ρ(x, y) controls the anisotropic nature of the smoothing constraint. The Euler-Lagrange equation for this calculus of variation problem yields: L+
λ (Lxx + Lyy ) = I ρ
(5)
Discretized on a rectangular lattice, this linear partial differential equation becomes: 1 Li,j + λ hρ 1 1 (Li,j − Li,j−1 ) + (Li,j − Li,j+1 ) + i,j− hρi,j+ 12 2 1 + hρ 1 1 (Li,j − Li−1,j ) + (Li,j − Li+1,j ) = I (6) i− ,j hρi+ 12 ,j 2 where h is the pixel grid size and the value of each ρ is taken in the middle of the edge between the center pixel and each of the corresponding neighbors (see Figure 1(b)). In this formulation, ρ controls the anisotropic nature of the smoothing by modulating permeability between pixel neighbors. Equation 6 can be solved numerically using multigrid methods for boundary value problems [17]. Multigrid algorithms are fairly efficient having complexity O(N), where N is the number of pixels [17]. Running our non-optimized code on a 2.4GHz Pentium 4 produced execution times of 0.17 seconds for a 320x240-pixel image, and 0.76 seconds for a 640x480-pixel image. The smoothness is penalized at every edge of the lattice by weights ρ (see Figure 1(b)). As stated earlier, the weight should change proportionally with the strength of the discontinuities. We need a relative measure of local contrasts that will equally ”respect” boundaries in shadows and bright regions. We call again upon Weber’s law and modulate the weights ρ by Weber’s contrast 1 hρ a+b = 2
|Ia − Ib | ∆I = I min(Ia , Ib )
(7)
where ρ a+b is the weight between two neighboring pixels whose intensities are Ia 2 and Ib . 1
In our experiments equally good performance can be obtained by using Michelson’s contrast (Ia + Ib )/(Ia − Ib ).
14
Ralph Gross and Vladimir Brajovic
Original PIE images
Processed PIE images
Fig. 2. Result of removing illumination variations with our algorithm for a set of images from the PIE database
3 3.1
Face Recognition across Illumination Databases and Algorithms
We use images from two publicly available databases in our evaluation: CMU PIE database and Yale database. The CMU PIE database contains a total of 41,368 images taken from 68 individuals [18]. The subjects were imaged in the CMU 3D Room using a set of 13 synchronized high-quality color cameras and 21 flashes. For our experiments we use images from the more challenging illumination set which was captured without room lights (see Figure 2). The Yale Face Database B [6] contains 5760 single light source images of 10 subjects each seen under 576 viewing conditions: 9 different poses and 64 illumination conditions. Figure 3 shows examples for original and processed images. The database is divided into different subsets according to the angle the light source direction forms with the camera’s axis (12◦ , 25◦ , 50◦ and 77◦ ) We report recognition accuracies for two algorithms: Eigenfaces (Principal Component Analysis (PCA)) and FaceIt, a commercial face recognition system from Identix. Eigenfaces [19] is a standard benchmark for face recognition algorithms [1]. FaceIt was the top performer in the Facial Recognition Vendor Test 2000 [2]. As comparison we also include results for Eigenfaces on histogram equalized and gamma corrected images. 3.2
Experiments
The application of our algorithm to the images of the CMU PIE and Yale databases results in accuracy improvements across all conditions and all algorithms. Figure 4 shows the accuracies of both PCA and FaceIt for all 13 poses of the PIE database. In each pose separately the algorithms use one illumination condition as gallery and all other illumination conditions as probe. The reported
An Image Preprocessing Algorithm
15
Original Yale images
Processed Yale images
Fig. 3. Example images from the Yale Face Database B before and after processing with our algorithm Original SI Histeq Gamma
90
90
80
80
70
70
60 50 40
60 50 40
30
30
20
20
10 0 22
Original SI
100
Recognition Accuracy
Recognition Accuracy
100
10
02
25
37
05
07
27 09 Gallery Pose
(a) PCA
29
11
14
31
34
0 22
02
25
37
05
07
27 09 Gallery Pose
29
11
14
31
34
(b) FaceIt
Fig. 4. Recognition accuracies on the PIE database. In each pose separately the algorithms use one illumination condition as gallery and all other illumination conditions as probe. Both PCA and FaceIt achieve better recognition accuracies on the images processed with our algorithm (SI) than on the original. The gallery poses are sorted from right profile (22) to frontal (27) and left profile (34)
results are averages over the probe illumination conditions in each pose. The performance of PCA improves from 17.9% to 48.6% on average across all poses. The performance of FaceIt improves from 41.2% to 55%. On histogram equalized and gamma corrected images PCA achieves accuracies of 35.7% and 19.3%, respectively. Figure 5 visualizes the recognition matrix for PCA on PIE for frontal pose. Each cell of the matrix shows the recognition rate for one specific gallery/probe illumination condition. It is evident that PCA performs better in wide regions of
16
Ralph Gross and Vladimir Brajovic
100
03
03
02
02
04
04
10
10
18
18
05
05
07
07
19
19
06
06
06
20
20
20
11
11
11
08
08
08
12
12
12
09
09
09
21
21
21
14
14
14
13
13
13
22
22
22
17
17
17
15
15
15
16
16
16
03 02 04 10 18 05 07 19 06 20 11 08 12 09 21 14 13 22 17 15 16
(a) Original
03 02
90
04 10 18
80
05 07
70
19
03 02 04 10 18 05 07 19 06 20 11 08 12 09 21 14 13 22 17 15 16
(b) Histogram equalized
60
50
40
30
20
03 02 04 10 18 05 07 19 06 20 11 08 12 09 21 14 13 22 17 15 16
(c) Our algorithm
100
100
90
90
80
80
Recognition Accuracy
Recognition Accuracy
Fig. 5. Visualization of PCA recognition rates on PIE for frontal pose. Gallery illumination conditions are shown on the y-axis, probe illumination conditions on the x-axis, both spanning illumination conditions from the leftmost illumination source to the rightmost illumination source
70 60 50 40 30 20
Original SI
70 60 50 40
Original SI Histeq Gamma Subset 2
30
Subset 3 Probe Subset
(a) PCA
Subset 4
20
Subset 2
Subset 3 Probe Subset
Subset 4
(b) FaceIt
Fig. 6. Recognition accuracies on the Yale database. Both algorithms used images from Subset 1 as gallery and images from Subset 2, 3 and 4 as probe. Using images processed by our algorithm (SI) greatly improves accuracies for both PCA and FaceIt the matrix for images processed with our algorithm. For comparison the recognition matrix for histogram equalized images is shown as well. We see similar improvements in recognition accuracies on the Yale database. In each case the algorithms used Subset 1 as gallery and Subsets 2, 3 and 4 as probe. Figure 6 shows the accuracies for PCA and FaceIt for Subsets 2, 3 and 4. For PCA the average accuracy improves from 59.3% to 93.7%. The accuracy of FaceIt improves from 75.3% to 85.7%. On histogram equalized and gamma corrected images PCA achieves accuracies of 71.7% and 59.7%, respectively.
An Image Preprocessing Algorithm
4
17
Conclusion
We introduced a simple and automatic image-processing algorithm for compensation of illumination-induced variations in images. The algorithm computes the estimate of the illumination field and then compensates for it. At the high level, the algorithm mimics some aspects of human visual perception. If desired, the user may adjust a single parameter whose meaning is intuitive and simple to understand. The algorithm delivers large performance improvements for standard face recognition algorithms across multiple face databases.
Acknowledgements The research described in this paper was supported in part by National Science Foundation grants IIS-0082364 and IIS-0102272 and by U.S. Office of Naval Research contract N00014-00-1-0915.
References [1] Phillips, P., Moon, H., Rizvi, S., Rauss, P.: The FERET evaluation methodology for face-recognition algorithms. IEEE PAMI 22 (2000) 1090–1104 10, 14 [2] Blackburn, D., Bone, M., Philips, P.: Facial recognition vendor test 2000: evaluation report (2000) 10, 14 [3] Gross, R., Shi, J., Cohn, J.: Quo vadis face recognition? In: Third Workshop on Empirical Evaluation Methods in Computer Vision. (2001) 10 [4] Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. IEEE PAMI 19 (1997) 711–720 10 [5] Belhumeur, P., Kriegman, D.: What is the set of images of an object under all possible lighting conditions. Int. J. of Computer Vision 28 (1998) 245–260 10 [6] Georghiades, A., Kriegman, D., Belhumeur, P.: From few to many: Generative models for recognition under variable pose and illumination. IEEE PAMI (2001) 10, 14 [7] Riklin-Raviv, T., Shashua, A.: The Quotient image: class-based re-rendering and recognition with varying illumination conditions. In: IEEE PAMI. (2001) 10 [8] Georghiades, A., Kriegman, D., Belhumeur, P.: Illumination cones for recognition under variable lighting: Faces. In: Proc. IEEE Conf. on CVPR. (1998) 10 [9] Blanz, V., Romdhani, S., Vetter, T.: Face identification across different poses and illumination with a 3D morphable model. In: IEEE Conf. on Automatic Face and Gesture Recognition. (2002) 10 [10] Horn, B.: Robot Vision. MIT Press (1986) 11 [11] Stockam, T.: Image processing in the context of a visual model. Proceedings of the IEEE 60 (1972) 828–842 11 [12] Land, E., McCann, J.: Lightness and retinex theory. Journal of the Optical Society of America 61 (1971) 11 [13] Jobson, D., Rahman, Z., Woodell, G.: A multiscale retinex for bridging the gap between color imges and the human observation of scenes. IEEE Trans. on Image Processing 6 (1997) 11
18
Ralph Gross and Vladimir Brajovic
[14] Tumblin, J., Turk, G.: LCIS: A boundary hierarchy for detail-preserving contrast reduction. In: ACM SIGGRAPH. (1999) 11 [15] Perona, P., Malik, J.: Scale-space and edge detection using anisotropic diffusion. IEEE PAMI 12 (1990) 629–639 11 [16] Wandel, B.: Foundations of Vision. Sunderland MA: Sinauer (1995) 12 [17] Press, W., Teukolsky, S., Vetterling, W., Flannery, B.: Numerical Recipes in C. Cambridge University Press (1992) 13 [18] Sim, T., Baker, S., Bsat, M.: The CMU Pose, Illumination, and Expression (PIE) database. In: IEEE Int. Conf. on Automatic Face and Gesture Recognition. (2002) 14 [19] Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3 (1991) 71–86 14
Quad Phase Minimum Average Correlation Energy Filters for Reduced Memory Illumination Tolerant Face Authentication Marios Savvides and B.V.K. Vijaya Kumar Electrical and Computer Engineering Department Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh PA 15217, USA
[email protected] [email protected] Abstract. In this paper we propose reduced memory biometric filters for performing distortion tolerant face authentication. The focus of this research is on implementing authentication algorithms on small factor devices with limited memory and computational resources. We compare the full complexity minimum average correlation energy filters for performing illumination tolerant face authentication with our proposed quad phase minimum average correlation energy filters[1] utilizing a Four-Level correlator. The proposed scheme requires only 2bits/frequency in the frequency domain achieving a compression ratio of up to 32:1 for each biometric filter while still attaining very good verification performance (100% in some cases). The results we show are based on the illumination subsets of the CMU PIE database[2] on 65 people with 21 facial images per person.
1
Introduction
Biometric authentication systems are actively being researched for access control, and a growing interest is emerging where these systems need to be integrated into small factor devices such as credit cards, PDA’s, cell phones and other devices with limited memory and computational resources, with memory being the most costly resource in such systems. Traditional correlation filter based methods have not been favored in areas of pattern recognition, mainly because the filters employed were Matched filters[3] which meant that as many filters as training images were used, leading to large amount of memory needed for storing these filters and, more importantly one would have to perform cross-correlation with each of the training images (or matched filters) for each test image. Clearly this is very expensive computationally and requires huge memory resources.
J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 19-26, 2003. Springer-Verlag Berlin Heidelberg 2003
20
Marios Savvides and B.V.K. Vijaya Kumar
Fig. 1. Correlation schematic block diagram. A single correlation filter is synthesized from many training images and stored directly in the frequency domain. FFT’s are used to perform cross-correlation fast and the correlation output is examined for sharp peaks
Recent work using advanced correlation filter designs have shown to be successful for performing face authentication in the presence of facial expressions[4][5]. Advanced correlation filters[6] such as the minimum average correlation energy (MACE) filters[1], synthesize a single filter template from a set of training images and produce sharp distinct correlation peaks for the authentic class and no discernible peaks for impostor classes. MACE filters are well suited in applications where high discrimination is required. In authentication applications we are typically given only a small number of training images. These are used to synthesize a single MACE filter. This MACE filter will typically produce sharp distinct peaks only for the class it has been trained on, and will automatically reject any other classes without any a priori information about the impostor classes. Previous work applying these types of filters for eye detection can be found here [7]. 1.1
Minimum Average Correlation Energy Filters
Minimum Average Correlation Energy (MACE)[1] filters are synthesized in closed form by optimizing a criterion function that seeks to minimize the average correlation energy resulting from cross-correlations with the given training images while satisfying linear constraints to provide a specific peak value at the origin of the correlation plane for each training image. In doing so, the resulting correlation outputs from the training images resemble 2D-delta type outputs, i.e. sharp peaks at the origin with values close to zero elsewhere. The position of the detected peak also provides the location of the recognized object. The MACE filter is given in the following closed form equation:
h = D− 1 X( X+ D− 1X−) 1 u
(1)
Assuming that we have N training images, then X in Eq.(1) is an LxN matrix, where L is the total number of pixels of a single training images (L=d1xd2). X contains the Fourier transforms of each of the N training images lexicographically reordered and placed along each column. D is a diagonal matrix of dimension LxL containing the average power spectrum of the training images lexicographically reordered and placed along its diagonal. u is a row vector with N elements, containing the corresponding desired peak values at the origin of the correlation plane of the
Quad Phase Minimum Average Correlation Energy Filters
21
training images. The MACE filter is formulated directly in the frequency domain for efficiency. Note that + denotes complex conjugate transpose. Also, h is a row vector that needs to be lexicographically re-ordered to form the 2-D MACE filter. In terms of memory requirements, h is typically a complex 32 bit double. For example for a 64x64 resolution images, that would need 32x2x64x64 ~ 32Kb for a single MACE filter array stored in the frequency domain as shown in Fig. 1. 1.2
Peak-to-Sidelobe Ratio (PSR) Measure
The Peak-to-Sidelobe Ratio (PSR) is a metric used to test the whether a test image belongs to the authentic class. First, the test image is cross-correlated with the synthesized MACE filter, then the resulting correlation output is searched for the peak correlation value. A rectangular region (we use 20x20 pixels) centered at the peak is extracted and used to compute the PSR as follows. A 5x5 rectangular region centered at the peak is masked out and the remaining annular region defined as the sidelobe region is used to compute the mean and standard deviation of the sidelobes. The peak-to-sidelobe ratio is given as follows:
PSR =
peak − mean
σ
(2)
The peak-to-sidelobe ratio measures the peak sharpness in a correlation output which is exactly what MACE filter tries to optimize, hence the larger the PSR the more likely the test image belongs to the authentic class. It is also important to realize that the authentication decision is not based on a single projection but many projections which should produce a specific response in order to belong to the authentic class, i.e. the peak value should be large, and the neighboring correlation values which correspond to projections of the MACE point spread function with shifted versions of the test image should yield values close to zero. Another important property of the PSR metric is that it is invariant to any uniform scale changes in illumination. This can be easily be verified from Eq. (2) as multiplying the test image by any constant scale factor can be factored out from the peak, mean and standard deviation terms to cancel out.
Fig. 2. Peak-to-sidelobe ratio computation uses a 20x20 region of the correlation output centered at the peak
22
2
Marios Savvides and B.V.K. Vijaya Kumar
Quad Phase MACE Filters – Reduced Memory Representation
It is well known that in the Fourier domain, phase information is more important than magnitude for performing image reconstruction [8][9]. Since phase contains most of the intelligibility of an image, and can be used to retrieve the magnitude information, we propose to reduce the memory storage requirement of MACE filters by preserving and quantizing the phase of the filter to 4 levels. Hence the resulting filter will be named Quad-Phase MACE filter where each element in the filter array will take on ± 1 for the real component and ± j for the imaginary component in the following manner. +1 − 1 H QP − MACE (u, v) = + j − j
ℜ {H MACE (u , v)}≥ 0 ℜ {H MACE (u , v)} < 0
(3) ℑ {H MACE (u , v)}≥ 0 ℑ {H MACE (u , v)} < 0
Essentially 2 bits/per/frequency are needed to encode the 4 phase levels, namely π /4, 3π /4, 5π /4, 7π /4. Details on partial information filters can be found here [10]. 2.1
Four-Level Correlator – Reduced Complexity Correlation
The QP-MACE filter described in Eq. (3) has unit magnitude at all frequencies, encoding 4 phase levels. In order to produce sharp correlation outputs (that resemble delta-type outputs), then when multiplying the QP-MACE with the conjugate of the Fourier transform of the test image the phase should cancel out in order to provide a large peak. The only way the phases will cancel out is if the Fourier transform of the test image is also phase quantized in the same way such that phases can cancel out to produce a large peak at the origin. Therefore in this described architecture we also propose to quantize the Fourier transform of the test images as in Eq. (3).
Fig. 3. Correlation Outputs: (Left) Full Phase MACE filter ( Peak =1.00, PSR=66) (right) Quad Phase-MACE Filter using Four-Level Correlator (Peak=0.97, PSR=48)
Quad Phase Minimum Average Correlation Energy Filters
23
Fig. 4. Sample images of Person 2 from the Illumination subset of PIE database captured with no background lighting
This effectively results in using a Four-Level correlator in the frequency domain, where multiplication involves only performing sign changes. Thus partly reducing the computational complexity of the correlation block in the authentication process. Obtaining the quad-phase MACE filters and quad-phase Fourier transform arrays is achieved very simply (we do not require to implement the if…then branches shown in Eq. (3)), we need only to extract the sign bit from each element in the array.
3
Experiments Using CMU PIE Database
For applications such as face authentication, we can assume that the user will be cooperative and that he/she will be willing to provide a suitable face pose in order to be verified. However, illumination conditions cannot be controlled, especially for outdoor authentication. Therefore our focus in this paper is to provide robust face authentication in the presence of illumination changes. To test our proposed method, we used the illumination subset CMU PIE database containing 65 people each with 21 images captured under varying illumination conditions. There are 2 sessions of these dataset, one captured with background lights on (easier dataset), and another captured with no background lights (harder dataset). The face images were extracted and normalized for scale using selected ground truth feature points provided with the database. The resulting face images used in our experiments were of size 100x100 pixels. We selected 3 training images from each person to build their filter. The images selected were those of extreme lighting variation, namely image 3, 7 and 16 shown in Fig. 3. The same image numbers were selected for every one of the 65 people, and a single MACE filter was synthesized for each person from those images using Eq. (1) and similarly a reduced memory Quad Phase MACE filter was also synthesized using Eq. (3). For each person’s filter, we performed cross-correlation with the whole dataset (65*21=1365 images), to examine the resulting PSRs for images from that person and all the other impostor faces. This was repeated for all people (total of 88,725 cross-correlations), for each of the two illumination datasets (with and without back-
24
Marios Savvides and B.V.K. Vijaya Kumar
ground lighting). We have observed in these results, that there is a clear margin of separation between the authentic class and all other impostors (shown as the bottom line plot depicting the maximum impostor PSR among all impostors) for all 65 people yielding 100% verification performance for both the full-complexity MACE filters and the reduced-memory Quad-Phase MACE filters. Figure 5. shows a PSR comparison plot for both types of filters for Person 2 for the dataset that was captured with background lights on (note that this plot is representative of the comparison plots of the other people in the database). Since this is the easier illumination dataset, it is reasonable that we observe that the authentic PSRs are very high in comparison to Figure 6. which shows the comparison plot for the harder dataset captured with no lights on. The 3 distinct peaks shown, are those that belong to the 3 training images (3,7,16) used to synthesize Person 2’s filter. We observe that while there is a degradation in PSR performance using QP-MACE filters, this degradation is non-linear i.e. the PSR degrades more for very large PSR values (but still provides a large margin of separation from the impostor PSRs) of the full-complexity MACE filters, but this is not the case for low PSR values resulting from the original full-complexity MACE filters. We see that for the impostor PSRs which are in 10 PSR range and below, QPMACE achieves very similar performance as the full-phase MACE filters. Another very important observation that was consistent throughout all 65 people, is that the impostor PSRs are consistently below some threshold (e.g. 12 PSR). This observed upper bound is irrespective of illumination or facial expression change as reported in [4]. This property makes MACE type correlation filters ideal for verification, as we can select a fixed global threshold, above which the user get authorized, and this is irrespective of what type of distortion occurs, and even irrespective of the person to be authorized. In contrast however, this property does not hold in other approaches such as traditional Eigenface or IPCA methods, who’s residue or distance to face space is highly dependent on any illumination changes.
Fig. 5. PSR plot for Person 2 comparing the performance of full-complexity MACE filters and the reduced-complexity Quad Phase MACE filter using the Four-Level Correlator on the easier illumination dataset that was captured with background lights on
Quad Phase Minimum Average Correlation Energy Filters
25
Fig. 6. PSR plot for Person 2 comparing the performance of full-complexity MACE filters and the reduced-complexity Quad Phase MACE filter using the Four-Level Correlator on the harder illumination dataset that was captured with background lights off
Fig. 7. (left) PSF of Full-Phase MACE filter
(right) PSF of QP-MACE filter
Examining the point spread functions (PSF) of the full-phase MACE filter and the QP-MACE filter show that they are very similar as shown in Fig. 6. for Person 2. Since magnitude response is unity for all frequencies for the QP-MACE filter; it effectively acts as an all pass filter. This justifies why we are able to see more salient features (lower spatial frequency features) of the face, while in contrast the full complexity MACE filter emphasizes higher spatial frequencies; hence we are able to see only edge outlines of mouth, nose, eyes and eye brows. MACE filters work as well as they do in the presence of illumination variations because they emphasize higher spatial frequency features such as outlines of nose, eyes, mouth, their size and the relative geometrical structure between these features on the face. Majority of illumination variations affect the lower spatial frequency content of images, and these frequencies are attenuated by the MACE filters hence the output is unaffected. Shadows for example will introduce new features that have higher spatial frequency content, however MACE filters look at the whole image and do not focus at any one single feature, thus these types of filters provide a graceful degradation in performance as more distortions occur.
26
4
Marios Savvides and B.V.K. Vijaya Kumar
Conclusions
We have shown that our proposed Quad Phase MACE (QP-MACE) filters perform comparably to the full-complexity MACE filters achieving 100% verification rates on both illumination datasets on the CMU PIE database using only 3 training images. These Quad-Phase MACE filters only occupy 2bits per frequency (essentially 1 bit each for the real and imaginary component). Assuming that full phase MACE filter uses 32 bit double data type occupying 64 bits per frequency for complex data, then the proposed Quad-Phase MACE filters only require 2bits/per/Frequency achieving a compression ratio of up to 32 times smaller. A 64x64 pixel biometric filter will only require 1 Kilobytes of memory for storage, making this scheme ideal for implementation on limited memory devices. This research is supported in part by SONY Corporation.
References [1]
Mahalanobis, B.V.K. Vijaya Kumar, and D. Casasent: Minimum average correlation energy filters. Appl. Opt. 26, pp. 3633-3630, 1987. [2] T. Sim, S. Baker, and M. Bsat: The CMU Pose, Illumination, and Expression (PIE) Database of Human Faces. Tech. Report CMU-RI-TR-01-02, Robotics Institute, Carnegie Mellon University, January 2001. [3] Vanderlugt: Signal detection by complex spatial filtering. IEEE Trans. Inf. Theory 10, pp. 139-145, 1964. [4] M. Savvides, B.V.K. Vijaya Kumar and P. Khosla: Face verification using correlation filters. Proc. Of Third IEEE Automatic Identification Advanced Technologies, Tarrytown, NY, pp.56-61, 2002. [5] B.V.K. Vijaya Kumar, M. Savvides, K. Venkataramani and C. Xie: Spatial Frequency Domain Image Processing for Biometric Recognition. Proc. Of Intl. Conf. on Image Processing (ICIP), Rochester, NY, 2002. [6] B.V.K. Vijaya Kumar: Tutorial survey of composite filter designs for optical correlators. Applied Optics 31, 1992. [7] R. Brunelli, T. Poggio: Template Matching: Matched Spatial Filters and beyond. Pattern Recognition, Vol. 30, No. 5, pp.751-768, 1997. [8] S. Unnikrishna Pillai and Brig Elliott: Image Reconstruction from One Bit of Phase Information. Journal of Visual Communication and Image Representation, Vol. 1, No. 2, pp. 153-157, 1990. [9] V. Oppenheim and J. W. Lim: The importance of phase on signals. Proc. IEEE 69, pp. 529-532, 1981. [10] V. K. Vijaya Kumar: A Tutorial Review of Partial-Information Filter Designs for Optical Correlators, Asia-Pacific Engineering Journal (A), Vol. 2, No. 2, pp. 203-215, 1992.
Component-Based Face Recognition with 3D Morphable Models Jennifer Huang1 , Bernd Heisele1,2 , and Volker Blanz3 1
Center for Biological and Computational Learning, M.I.T., Cambridge, MA, USA
[email protected] 2 Honda Research Institute US, Boston, MA, USA
[email protected] 3 Computer Graphics Group, Max-Planck-Institut, Saarbr¨ ucken, Germany
[email protected] Abstract. We present a novel approach to pose and illumination invariant face recognition that combines two recent advances in the computer vision field: component-based recognition and 3D morphable models. First, a 3D morphable model is used to generate 3D face models from three input images from each person in the training database. The 3D models are rendered under varying pose and illumination conditions to build a large set of synthetic images. These images are then used to train a component-based face recognition system. The resulting system achieved 90% accuracy on a database of 1200 real images of six people and significantly outperformed a comparable global face recognition system. The results show the potential of the combination of morphable models and component-based recognition towards pose and illumination invariant face recognition based on only three training images of each subject.
1
Introduction
The need for a robust, accurate, and easily trainable face recognition system becomes more pressing as real world applications such as biometrics, law enforcement, and surveillance continue to develop. However, extrinsic imaging parameters such as pose, illumination and facial expression still cause much difficulty in accurate recognition. Recently, component-based approaches have shown promising results in various object detection and recognition tasks such as face detection [7, 4], person detection [5], and face recognition [2, 8, 6, 3]. In [3], we proposed a Support Vector Machine (SVM) based recognition system which decomposes the face into a set of components that are interconnected by a flexible geometrical model. Changes in the head pose mainly lead to changes in the position of the facial components which could be accounted for by the flexibility of the geometrical model. In our experiments, the component-based system consistently outperformed global face recognition systems in which classification was based on the whole face pattern. A major drawback of the system was the need of a large number of training images taken from different viewpoints J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 27–34, 2003. c Springer-Verlag Berlin Heidelberg 2003
28
Jennifer Huang et al.
and under different lighting conditions. These images are often unavailable in real-world applications. In this paper, the system is further developed through the addition of a 3D morphable face model to the training stage of the classifier. Based on only three images of a person’s face, the morphable model allows the computation of a 3D face model using an analysis by synthesis method [1]. Once the 3D face models of all the subjects in the training database are computed, we generate arbitrary synthetic face images under varying pose and illumination to train the component-based recognition system. The outline of the paper is as follows: Section 2 briefly explains the generation of 3D head models. Section 3 describes the component-based face detector trained from the synthetic images. Section 4 describes the component-based face recognizer, which was trained from the output of the component-based face detection unit. Section 5 presents the experiments on component-based and global face recognition. Finally, Section 6 summarizes results and outlines future work.
2
Generation of 3D Face Models
We first generate 3D face models based on three training images of each person (frontal, half-profile and profile view). An example of the triplet of training images used in our experiments is shown in the top row of Figure 1. The bottom row shows two synthetic images created by rendering the newly generated 3D face model. The main idea behind the morphable model approach is that given a sufficiently large database of 3D face models any arbitrary face can be generated by morphing the ones in the database. An initial database of 3D models was
Fig. 1. Generation of the 3D model. The top images were used to compute the 3D model. The bottom images are synthetic images generated from the 3D model
Component-Based Face Recognition with 3D Morphable Models
29
Fig. 2. Synthetic training images. Synthetic face images were generated from the 3D head models under different illuminations (top row) and different poses (bottom row) built by recording the faces of 200 subjects with a 3D laser scanner. Then 3D correspondences between the head models were established in a fully automatic way using techniques derived from optical flow computation. Based on these correspondences, a new 3D face model can be generated by morphing between the existing models in the database. To create a 3D face model from a set of 2D face images, an analysis by synthesis loop is used to find the morphing parameters such that the rendered images of the 3D model are as close as possible to the input images. A detailed description of the morphable model approach including the analysis by synthesis algorithm can be found in [1]. Using the 3D models, synthetic images such as the ones in Figure 2 can easily be created by rendering the models. The 3D morphable model also provides the full 3D correspondence between the head models, which allows for automatic extraction of facial components.
3
Component-Based Face Detection
The component-based detector detects the face in a given input image and extracts the facial components which are later used to recognize the face. We used the two level component-based face detection system described in [3]. The architecture of the system is schematically shown in Figure 4. The first level consists of fourteen independent component classifiers (linear SVMs). Each component classifier was trained on a set of extracted facial components1 and on a set of randomly selected non-face patterns. As mentioned in the previous Section, the components could be automatically extracted from the synthetic images since the full 3D correspondences between the face models were known. Figure 3 shows examples of the fourteen components for three training images. On the second level, the maximum continuous outputs of the component classifiers within rectangular search regions around the expected positions of the components were 1
The shape of the components was learned by an algorithm described in [4] to achieve optimal detection results.
30
Jennifer Huang et al.
Fig. 3. Examples of the fourteen components extracted from a frontal view and half-profile view of a face
Left Eye Eye Left expert: expert: Linear SVM SVM Linear . . .
Nose expert: expert: Nose Linear SVM SVM Linear . . .
Mouth Mouth expert: expert: Linear SVM Linear SVM 1. Shift 58x58 window over input image
2. Shift component experts over 58x58 window
*
*Outputs of component experts: bright intensities indicate high confidence.
(O1 , X 1 , Y1 ) (O1 , X 1 , Y1 ,..., O14 , X 14 , Y14 )
* (Ok , X k , Yk )
Combination Combination classifier: classifier: Linear SVM SVM Linear
* (O14 , X 14 , Y14 )
3. For each component k, determine its maximum output within a search region and its location: (Ok , X k , Yk )
4. Final decision: face / background
Fig. 4. System overview of the component-based face detector
used as inputs to a combination classifier (linear SVM), which performed the final detection of the face.
4
Component-Based Face Recognition
The component-based face recognizer uses the output of the face detector in the form of extracted components. First, synthetic faces were generated at a resolution of 58×58 for the six subjects by rendering the 3D face models under varying pose and illumination. Specifically, the faces were rotated in depth from 0◦ to 34◦ in 2◦ increments and rendered with two illumination models at each pose. The first model consisted of ambient light alone. The second model included
Component-Based Face Recognition with 3D Morphable Models
31
Fig. 5. Composite of the nine components retained for face recognition
Fig. 6. Histogram-equalized inner face region, added as a tenth component for face recognition
ambient light and a directed light source, which was pointed at the center of the face and positioned between −90◦ and 90◦ in azimuth and 0◦ and 75◦ in elevation. The angular position of directed light was incremented by 15◦ in both directions. From the fourteen components extracted by the face detector, only nine components were used for face recognition. Five components were eliminated because they strongly overlapped with other components or contained few gray value structure (e.g. cheeks). Figure 5 shows the composite of the nine extracted components used for face recognition for some example images. In addition, a histogram-equalized inner face region was added to improve recognition2 . Some examples of this inner face region are shown in Figure 6. The component-based face detector was applied to each synthetic face image in the training set to detect the components and thereby the facial region. Histogram equalization was then preformed on the bounding box around the components. The gray pixel values of each component were then taken from the histogram equalized image and combined into a single feature vector. Fea2
The location of the face region was computed by taking the bounding box around the other nine detected components and then subtracting from the larger edge to form a square component. This square was then normalized to 40×40 and histogramequalized.
32
Jennifer Huang et al.
Fig. 7. Examples of the real test set. Note the variety of poses and illumination conditions ture vectors were constructed for each person, and corresponding classifiers were trained. A face recognition system consisting of second-degree polynomial SVM classifiers was trained on these feature vectors in a one vs. all approach. In other words, an SVM was trained for each subject in the database to separate her/him from all the other subjects. To determine the identity of a person at runtime, we compared the normalized outputs of the SVM classifiers, i.e. the distances to the hyperplanes in the feature space. The identity associated with the face classifier with the highest normalized output was taken to be the identity of the face.
5
Results
A test set was created by taking images of the six people in the database. The subjects were asked to rotate their faces in depth and the lighting conditions were changed by moving a light source around the subject. The test set consisted of 200 images of each person under various pose and illumination conditions. Figure 7 contains examples of the images in the test set3 . The component-based face recognition system was compared to a global face recognition system; both systems were trained and tested on the same images. In contrast to the component-based classifiers, the input vector to the whole face detector and recognizer consisted of the histogram-equalized gray values from the entire 58× 58 facial region4 . The resulting ROC curves of global and componentbased recognition on the test set can be seen in Figure 8. The component-based system achieved a recognition of 90%, which is approximately 50% above the recognition rate of the global system. 3 4
Training and test set will be made available on our website upon publication. For a detailed description of the whole face system see [3].
Component-Based Face Recognition with 3D Morphable Models
33
Recognition with Global and Component−based Systems 1 Component−based Global 0.9
0.8
0.7
Recognition Rate
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3 False Recognition Rate
0.4
0.5
0.6
Fig. 8. ROC curves for the component-based and the global face recognition systems. Both systems were trained and tested on the same data
This large discrepancy in results can be attributed to two main factors: First, the components of a face vary less under rotation than the whole face pattern, explaining why the component-based recognition is more robust against pose changes. Second, in contrast to the training data, the backgrounds in the test images were non-uniform. Component-based recognition only used face parts as input features for the classifier while the input features to the global system occasionally contained distracting background parts.
6
Conclusion and Future Work
This paper presented a new development in component-based face recognition by the incorporation of a 3D morphable model into the training process. This combination allowed the training of a face recognition system which required only three face images of each person. From these three images, 3D face models were computed and then used to render a large number of synthetic images under varying poses and lighting conditions. The synthetic images were then used to train a component-based face detection and recognition system. A global face detection and recognition system was also trained for comparison. Results on 1200 real images of six subjects show that the component-based recognition system clearly outperforms a comparable global face recognition system. Component-based recognition was at around 90% for faces rotated up to
34
Jennifer Huang et al.
approximately half-profile in depth. Future work includes increasing the size of the database and the range of the pose and illumination conditions which the system can handle.
References [1] V. Blanz and T. Vetter. A morphable model for synthesis of 3D faces. In Computer Graphics Proceedings SIGGRAPH, pages 187–194, Los Angeles, 1999. 28, 29 [2] R. Brunelli and T. Poggio. Face recognition: Features versus templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(10):1042–1052, 1993. 27 [3] B. Heisele, P. Ho, and T. Poggio. Face recognition with support vector machines: global versus component-based approach. In Proc. 8th International Conference on Computer Vision, volume 2, pages 688–694, Vancouver, 2001. 27, 29, 32 [4] B. Heisele, T. Serre, M. Pontil, and T. Poggio. Component-based face detection. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, volume 1, pages 657–662, Hawaii, 2001. 27, 29 [5] A. Mohan, C. Papageorgiou, and T. Poggio. Example-based object detection in images by components. In IEEE Transactions on Pattern Analysis and Machine Intelligence, volume 23, pages 349–361, April 2001. 27 [6] A. V. Nefian and M. H. Hayes. An embedded HMM-based approach for face detection and recognition. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 6, pages 3553–3556, 1999. 27 [7] H. Schneiderman and T. Kanade. A statistical method for 3D object detection applied to faces and cars. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 746–751, 2000. 27 [8] L. Wiskott. Labeled Graphs and Dynamic Link Matching for Face Recognition and Scene Analysis. PhD thesis, Ruhr-Universit¨ at Bochum, Bochum, Germany, 1995. 27
A Comparative Study of Automatic Face Verification Algorithms on the BANCA Database M. Sadeghi, J. Kittler, A. Kostin, and K. Messer Centre for Vision Speech and Signal Processing, School of Electronics and Physical Sciences University of Surrey, Guildford GU2 7XH, UK {m.sadegi,j.kittler,a.kostin,k.messer}@surrey.ac.uk http://www.ee.surrey.ac.uk/CVSSP/
Abstract. The performance of different face identity verification methods on BANCA database is compared. As part of the comparison, we investigate the effect of representation on different approaches to face verification. Two conventional dimensionality reduction methods, namely the Principal Component Analysis and the Linear Discriminant Analysis are studied as well as the use of the raw image space. The results of the comparison show that when the training set size is limited, a better performance is achieved using Normalised Correlation method in the LDA space while Support Vector Machine classifier is superior when a large enough training set is available. Moreover, the SVM is almost insensitive to the choice of representation. However, a dimensionality reduction can be beneficial if constraints on the size of the template are imposed.
1
Introduction
The European Project Banca is developing the technology for multimodal biometric access control to teleservices. The focus in the project is on conventional biometrics such as voice characteristics and frontal face. Voice and face have the advantage that they are the biometric modalities used for personal identity authentication by humans. There is, therefore, a particular interest in making these technologies successful. This paper presents a study of 3 face verification algorithms on a new database recorded by the project, referred as Banca database. The problem of face verification involves four basic subproblems: face detection and localisation, geometric and photometric normalisation, face representation, and decision making. In this paper we shall not be concerned with the first two stages of the face verification process. Instead, our focus is on the last two stages with emphasis on face representation. The conventional wisdom in face authentication is to project the face image into a lower dimensional space. The motivation for dimensionality reduction is multifold. Image data is inherently of high dimensionality and designing a verification system in the image space would lead to a computationally complex J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 35–43, 2003. c Springer-Verlag Berlin Heidelberg 2003
36
M. Sadeghi et al.
decision rule. There is also the argument based on the peaking phenomenon, which dictates that the ratio of training set size and pattern dimensionality should be of an order of a magnitude to prevent over-training. As training sets available for face verification system design are invariably small, a significant reduction in dimensionality is normally sought. Third, the face image data is very highly correlated. The use of classical pattern recognition approaches on such data sets leads to unstable decision rules which generalise extremely poorly to unseen patterns. The classical methods used for face representation are the Principal Component Analysis [8] (PCA) and Linear Discriminant Analysis (LDA) [1]. The former method projects the input image into eigenfaces which decorrelate the image data features, whereas the latter maps the input image into fisher faces which maximise the class separability. Recently, the Independent Component Analysis (ICA) [6] has been investigated in the context of face representation. However, the disadvantage of the ICA technique is that there is no natural way to identify which and how many of the ICA axes should be used to define the dimensionality reducing transformation. The above arguments for dimensionality reduction do not hold for the Support Vector Machine approach to pattern recognition. If a decision rule is trained by minimising the structural risk, the peaking phenomenon is not exhibited. Consequently, there is no need for dimensionality reduction prior to classifier design. This suggests that, using SVMs, one can design a face verification rule directly in the image space without incurring any additional classification error. It is then pertinent to pose the question whether there are any benefits at all in performing a dimensionality reduction and conducting the SVM verification process in the feature space. The aim of this paper is to address this question and to compare the performance of an SVM decision rule in the original, PCA and LDA spaces. Such studies have been carried earlier e.g. in [4, 2]. We complement this work in several respects. First of all we focus the comparison on the relationship of the original image space and PCA/LDA subspaces. Second, we shall investigate the effect of the dimensionality of the representation space on the verification performance. This is also the first time that we report experimental results for fully automatically registered Banca database probe images. The paper is organised as follows. In the next section the methods compared are overviewed. In Section 3, the experimental design, data sets and protocol are introduced first. The results are then presented and discussed. The paper is drawn to conclusion in Section 4.
2
Experimental Design
We compare three methods of face verification in three representation spaces. The verification methods are the Support Vector Machine (SVM) which has been detailed in [5], and the conventional decision rules based on the Euclidean distance (ED) and the normalised correlation (NC). The representation spaces
A Comparative Study of Automatic Face Verification Algorithms
37
explored include the original image space, PCA and LDA. As in the Banca project we are interested in the open-set verification problem where new clients can be enrolled without having to recompute the feature spaces, we used the XM2VTS database [7] to estimate the matrix of second order statistical moments which define the dimensionality reducing transformations. Experiments with the above classifiers and representation spaces were conducted on the Banca database for fully automatically registered images. For benchmarking purposes we also obtained the results for manually registered images to gauge the sensitivity of the verification methods to image misregistration. The BANCA database used in the study has been designed in order to test multi-modal identity verification systems under different scenarios. Different cameras and microphones have been used in different environments to create three different scenarios, Controlled, Degraded and Adverse. The database has been recorded in several languages in different countries. Two sections of the data, English and French database are now available. Each section contains 52 subjects (26 males and 26 females). Each subject participated to 12 recording sessions in different conditions and with different cameras. Sessions 1-4 contain data under Controlled conditions while sessions 5-8 and 9-12 contain Degraded and Adverse scenarios respectively. Figure 1 shows a few examples of the face data. Each session contains two recordings per subject, a true client access and an informed imposter attack. For the face image database, 5 frontal face images have been extracted from each video recording, which are supposed to be used as client images and 5 impostor ones. In order to create more independent experiments, images in each session have been divided into two groups of 26 subjects (13 males and 13 females). Thus, considering the subjects’ gender, each session can be divided into 4 groups. In the BANCA protocol, 7 different distinct experimental configurations have been specified, namely, Matched Controlled (MC), Matched Degraded (MD), Matched Adverse (MA), Unmatched Degraded (UD), Unmatched Adverse (UA), Pooled test (P) and Grand test (G). Figure 2 describes the usage of the different sessions in each configuration. “TT” refers to the client training and impostor test session, and “T” depicts clients and impostor test sessions. As we mentioned, 4 groups of data can be considered in each session. The decision function can be trained using only 5 client images per person from the same group and all client images from the other groups. The original resolution of the image data is 720 × 576. The experiments were performed with a relatively low resolution face images, namely 61 × 57. The method discussed in [3] was used for the automatic localisation of eyes centres. Histogram equalisation was used to normalise the registered face photometrically. The thresholds in the decision making system have been determined based on the Equal Error Rate criterion, i.e. where the false rejection rate (FRR) is equal to the false acceptance rate (FAR).
38
M. Sadeghi et al.
Fig. 1. Examples of the BANCA database images. UP: Controlled, Middle: Degraded and Down: Adverse scenarios
3
Fig. 2. The usage of the different sessions in the BANCA experimental configurations (“TT”: clients training and impostor test, “T”: clients and impostor test)
Experimental Results
Table 1 contains a summary of the results obtained on the test set when manually annotated eyes position were used for the face registration. The values in the table indicate the FAR, FRR and Half Total Error Rates (HTER), i.e. the average of FAR and FRR. The experimental results in the original, PCA and LDA spaces are reported in the table for each scenario and verification method. In the original space, in most of the cases, a better performance is achieved using the SVM method. The main exceptions are the unmatched scenarios (UD and UA) where the methods based on the scoring functions (ED and NC) are better. However, projecting the data into the LDA space improves significantly the performance of the method based on the normalised correlation scores to the point that overall, the best performance is achieved by means of the NC method in the LDA space. The only exception for this is the protocol configuration G where the SVM classifier performs better. It seems that the SVM verification method is particularly good when the size of the data set used for training this decision rule is reasonably large. The performance of the SVM method in such a condition has been improved by projecting the data into the LDA space. Another interesting observation is that when the training set is not large enough but the environmental conditions are the same (matched scenarios), the SVM may be aided by projecting the probe image into an LDA feature space. In the other cases, these results demonstrate that projecting the image data into the PCA and LDA spaces does not improve the performance of the system based on the SVM method. Note that, in these experiments, the image size and shape has been optimised for the benefit of the SVM classifier. Slightly, better results than those reported
A Comparative Study of Automatic Face Verification Algorithms
39
Table 1. ID verification results on the BANCA test configurations using different methods of decision making (Euclidean Distance, Normalised Correlation and Support Vector Machine) in different representation spaces. Eyes were localised manually for the face registration purpose. FAR: False Acceptance Rate, FRR: False Rejection Rate and HTER: Half Total Error Rate
MC
MD
MA
UD
UA
P
G
Original PCA LDA Original PCA LDA Original PCA LDA Original PCA LDA Original PCA LDA Original PCA LDA Original PCA LDA
FAR 14.90 13.75 10.86 16.63 16.82 20.38 16.73 15.38 21.25 22.02 20.86 27.88 26.35 27.21 29.90 27.18 26.53 24.10 16.06 13.62 21.41
E.D. FRR HTER 12.56 13.73 13.20 13.47 10.89 10.88 17.56 17.10 17.05 16.93 21.41 20.89 18.33 17.53 17.43 16.41 21.28 21.26 23.33 22.68 21.79 21.33 26.02 26.95 27.05 26.70 28.46 27.83 28.33 29.11 26.41 26.79 26.23 26.38 24.74 24.42 17.05 16.55 15.29 14.46 21.11 21.26
FAR 14.81 9.42 4.23 16.54 12.69 6.92 16.73 13.65 8.17 22.12 19.61 15.57 25.58 28.55 21.25 26.57 21.41 14.58 16.03 11.86 5.19
N.C. FRR HTER 12.44 13.62 8,72 9.07 5.64 4.93 17.56 17.05 13.33 13.01 8.20 7.56 17.95 17.34 15.51 14.58 7.31 7.74 23.08 22.60 21.41 20.51 16.41 15.99 27.05 26.31 29.10 28.83 19.23 20.24 26.62 26.60 21.53 21.47 15.00 14.79 17.14 16.58 12.39 12.12 5.25 5.22
FAR 1.63 2.98 1.06 3.37 4.23 1.73 4.52 4.62 3.37 4.33 5.19 1.73 5.10 6.06 1.35 3.69 4.74 1.38 3.62 3.65 2.40
SVM FRR HTER 9.23 5.43 8.85 5.91 10.77 5.91 16.79 10.08 17.31 10.77 13.33 7.53 19.10 11.81 19.23 11.92 15.90 9.63 46.54 25.43 42.44 23.81 58.85 30.29 55.13 30.11 53.33 29.70 58.72 30.03 36.97 20.33 34.87 19.81 42.78 22.08 4.87 4.25 5.09 4.37 4.74 3.57
here can be obtained with Normalised Correlation if the geometric normalisation is optimised for this matching rule. As we wanted to retain as many parameters as possible fixed for all the matching methods, this inevitably degraded the performance of some of them. One interesting difference between SVM and NC methods is the effect of the scenarios on the false rejection and false acceptance rates. As mentioned, the classifier parameters have been set using the Equal Error Rate criterion in the training stage. In the test stage using the NC approach, when the performance degrades, both the false acceptance rate and the false rejection rate degrade in a balanced manner. In contrast, SVM seems to maintain the false acceptance rate relatively low in all scenarios. The overall rate, expressed as Half Total Error Rate, dramatically increases mainly because of a disproportionate increase in the false rejection rate. The differences in the behaviour of these two verification
40
M. Sadeghi et al.
Table 2. Automatic ID verification results on the BANCA test configurations. Eyes were localised automatically for the face registration purpose
MC
MD
MA
UD
UA
P
G
Original PCA LDA Original PCA LDA Original PCA LDA Original PCA LDA Original PCA LDA Original PCA LDA Original PCA LDA
FAR 26.92 24.81 29.42 22.40 21.53 24.03 21.63 20.76 25.96 29.03 26.15 31.92 27.69 27.69 32.69 30.89 28.39 29.77 24.77 23.62 28.91
E.D. FRR HTER 25.64 26.28 25.25 25.03 27.56 28.49 22.43 22.42 22.05 21.79 24.74 24.39 21.79 21.71 21.92 21.34 26.53 26.25 27.69 28.36 27.31 26.73 31.28 31.60 31.41 29.55 31.41 29.55 30.64 31.66 30.29 30.59 29.35 28.87 30.76 30.27 24.87 24.82 23.03 23.32 28.63 28.77
FAR 26.83 22.69 22.31 22.69 19.80 17.59 21.63 19.71 17.11 28.65 25.28 25.00 26.92 30.28 27.69 30.89 27.30 25.41 24.87 19.39 18.20
N.C. FRR HTER 25.51 26.17 23.20 22.95 22.94 22.62 22.18 22.43 19.10 19.45 17.69 17.64 21.92 21.78 21.66 20.68 17.56 17.33 27.94 28.30 28.97 27.13 25.89 25.44 31.79 29.36 31.15 30.72 26.53 27.11 29.74 30.32 28.07 27.69 24.91 25.16 25.21 25.04 20.59 19.99 18.58 18.39
FAR 3.94 5.58 3.75 5.58 6.35 6.25 4.81 5.48 5.67 4.71 6.06 3.27 4.52 6.15 3.17 4.39 5.93 3.40 7.05 7.44 8.14
SVM FRR HTER 35.51 19.73 31.79 18.69 40.13 21.94 26.03 15.80 25.38 15.87 26.54 16.39 34.49 19.65 32.18 18.83 31.15 18.41 56.15 30.43 51.92 28.99 67.56 35.42 61.79 33.16 57.05 31.60 65.38 34.28 51.15 27.77 46.92 26.43 57.69 30.54 19.70 13.38 19.27 13.35 24.06 16.10
methods could be exploited in intramodal fusion where expert diversity is the key pre-requisite of any improvements in performance of a multiple expert system. Table 2 contains the results of similar experiments when the face registration step was performed based on automatically localised eyes position. These results demonstrate that in most of the cases a better performance is achieved again using the NC method within the LDA space. However, SVM seems less sensitive to localisation errors, especially for the matched scenarios. Thus, in Mc, Md and Ma configurations, the results using SVM in the original or PCA space are better or comparable with the NC results. In the identity verification based on the scoring functions (ED or NC), the False Acceptance Rate and False Rejection Rate are affected equally while using the SVM method, in most of the experimental configurations (such as Ud, Ua and P) the FRR increases dramatically. The automatic verification results of the G protocol emphasises again the superiority of the SVM classifier where the training set is large enough. The other noticeable issue is that the PCA space seems slightly less sensitive to errors in face registration than the LDA space. Although, the performance of the SVM in the manually annotated data experiments might be improved by projecting
A Comparative Study of Automatic Face Verification Algorithms
41
the probe image into an LDA space, in the automatic verification experiments where the face registration error could be high, the benefits are not the same. In the next step, we wanted to investigate the effect of the dimensionality of the representation space on the verification performance. In the PCA space, the d eigenvectors corresponding to the largest eigenvalues were considered as the best d-dimensional axes. In the LDA space, the French part of the BANCA database were used for selecting the best representive features for different number of features, d. These subspaces were then used for the English data representation. The verification experiments were repeated in each subspace. We applied the plusL-minusR method of feature selection to select an optimum subset of the LDA features. Figure 3 contains plots of the experimental results for the different BANCA configurations. In the left plots the results in the PCA space are shown while the right one shows the results in the LDA subspaces. In these experiments, face images have been registered automatically. The results in the figures show the overall tendency for the performance monotonically to improve as more features are added. In most of the experiments the improvements are rapid initially, but the error rates then saturate and very little benefit is gained from increasing the dimensionality. Thus the main advantage of dimensionality reduction is in reducing the size of the template which may be important in some applications. It cannot be overemphasised that the monotonicity behaviour is characteristic only for the experimental design adopted here where the feature selection is based on a fully independent data set comprising records of a different population of subjects. When the dimensionality reducing transformation is based on the same population of subjects as that involved in testing, the performance versus dimensionality curves often exhibit peaking and in such cases we can not only reduce the storage required but also the verification system performance.
4
Conclusions
We compared the performance of different face identity verification approaches in three different representation spaces. The study involved two conventional dimensionality reduction methods, namely the Principal Component Analysis and the Linear Discriminant Analysis which were compared with the raw image space. The comparison was carried out using a recently recorded multimodal biometric database of talking faces, known as the Banca database. The two representation spaces were determined using a population of faces which was completely independent of the set of subjects used in testing. The results of the comparison showed that overall a better performance is achieved using the Normalised Correlation method within the LDA space. However, if a large enough training set is available, the SVMs are more effective in extracting discriminatory information from the data. In contrast to the methods based on scoring functions, the SVMs are relatively insensitive to the choice of representation. In other words, the performance in the original image space defines a target which the dimensionality reducing representation asymptotically
42
M. Sadeghi et al.
Fig. 3. HTER vs. the number of features for different experimental protocols (Left plots: PCA subspace Right plots: LDA subspace) approaches as the number of bases increases. There are minor exceptions to this behaviour. In such circumstance the PCA or LDA representation may deliver slightly better results. Nevertheless, overall, not much gain in performance can be expected from projecting the face data into a subspace. However, a dimensionality reduction can be beneficial if constraints on the size of the template are imposed.
Acknowledgements The financial support from the EU project Banca is gratefully acknowledged.
References [1] P. Belhumeur, J. Hespanha, and D. Kriegman. Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. IEEE Trans. on Pattern Recognition and Machine Intelligence, 19(7):711–720, 1997. 36 [2] O. Deniz, M. Castrillon, and M. Hernandez. Face recognition using independent component analysis and support vector machines. In AVBPA, pages 59–64, 2001. 36
A Comparative Study of Automatic Face Verification Algorithms
43
[3] M. Hamouz, J. Kittler, J. Kamarainen, and H.K¨ alvi¨ ainen. Hypotheses-driven affine invariant localization of faces in verification systems. In Proc. 4th Int. Conf. on Audio- and Video-based Biometric Person Authentication, Guildford, UK, 2003. 37 [4] K. Jonsson, J. Kittler, Y. Li, and J. Matas. Support vector machines for face authentication. In T. Pridmore and D. Elliman, editors, Proceedings of BMVC’99, pages 543– 553, 1999., 1999. 36 [5] A. Kostin, M. Sadeghi, J. Kittler, and K. Messer. On representation spaces for SVM based face verification. In The Advent of Biometrics on the Internet, A COST 275 Workshop, Rome, Italy, 7-8 November, 2002. 36 [6] C. Liu and H. Wechsler. Comparative assessment of independent component analysis. In Proc. the 2nd International Conference on Audioand Video-based Biometric Person Authentication, Washington D. C., March 22-24, 1999. 36 [7] K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre. Xm2vtsdb: The extended m2vts database. In Second International Conference on Audio and Video-based Biometric Person Authentication, March 1999. 37 [8] M. Turk and A. Pentland. Eigenfaces for recognition. Journal of cognitive neuroscience, 3(1), 1991. 36
Assessment of Time Dependency in Face Recognition: An Initial Study Patrick J. Flynn1, Kevin W. Bowyer1, and P. Jonathon Phillips2 1
Dept of Computer Science and Engineering University of Notre Dame, Notre Dame, IN 46556 USA {flynn,kwb}@nd.edu 2 National Institute of Standards and Technology 100 Bureau Dr., Stop 8940, Gaithersburg, MD 20899 USA
[email protected] Abstract. As face recognition research matures and products are deployed, the performance of such systems is being scrutinized by many constituencies. Performance factors of strong practical interest include the elapsed time between a subject’s enrollment and subsequent acquisition of an unidentified face image, and the number of images of each subject available. In this paper, a long-term image acquisition project currently underway is described and data from the pilot study is examined. Experimental results suggest that (a) recognition performance is substantially poorer when unknown images are acquired on a different day from the enrolled images, (b) degradation in performance does not follow a simple predictable pattern with time between known and unknown image acquisition, and (c) performance figures quoted in the literature based on known and unknown image sets acquired on the same day may have little practical value.
1
Introduction
Although automatic face recognition has a rich history [4], it has only recently emerged as a potentially viable component of authentication and access control systems. The research community has responded to this increased interest in various ways. Firstly, the variety of approaches to face recognition continues to broaden (e.g., new modalities [2][5][6]). Secondly, standardized approaches for assessment (and more importantly, comparison) have emerged [1] and been used in meaningful evaluations of vendor systems [7][8]. Such efforts require databases that are large enough and controlled well enough to admit meaningful statistical analyses, but also are representative of applications. The emergence of viable systems is shifting emphasis from development of new algorithmic techniques to gaining an understanding of the basic properties of face recognition systems. One of the least well-understood phenomena is variation of face appearance over short (weeks or months), medium (years), and long (decades) periods of time. This paper presents an J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 44-51, 2003. Springer-Verlag Berlin Heidelberg 2003
Assessment of Time Dependency in Face Recognition: An Initial Study
45
ongoing collection effort to support understanding the short and medium term changes in face appearance. Initial analyses are measuring the effects of temporal variation on algorithm performance and estimating the distribution of matches of images of the same person. The database will be released to the research community to support development of algorithms that are robust to temporal variations. The control of location, camera, and lighting allows variations in face appearance due to elapsed time to be investigated without masking from environmental changes. It has been observed [7,8] that face recognition systems are challenged in uncontrolled lighting (e.g., outdoors or uneven indoor ambient illumination). The database described here contains images with uncontrolled lighting that can be used as challenge data. The de facto standard in the area of performance evaluation of face identification algorithms is the FERET methodology [1]. This methodology and most subsequent work employ the concept of a training image set used to develop the identification technique, a gallery image set that embodies the set of persons enrolled in the system, and a probe image set containing images to be identified. Identification of a probe image yields a ranked set of matches, with rank 1 being the best match. Results are presented as cumulative match characteristics (CMC), where the x-axis denotes a rank threshold and the y-axis is the fraction of experiments that yield a correct match at ranks equal to or lower than the threshold. General aspects of the FERET methodology include precise specification of training, gallery, and probe image sets drawn from a large database of face images, defined methods for computing performance metrics, and sequestration of test set images until after the test is performed. In FERET tests in March of 1997, two algorithms were able to achieve 95% or greater correct identification of the rank-one match, based on a gallery of 1196 neutral-expression face images and a probe set of 1195 alternative-expression face images of the same subjects taken on the same day (Figure 3 of [1]). These algorithms operated in partially automatic mode, meaning that the eye coordinate locations were manually identified and supplied to the algorithms. When a somewhat smaller probe set was used, containing normal-expression images taken in a different image acquisition session, all algorithms scored less than 60% correct on a rank-one match (Figure 4 of [1]). This dramatic performance difference clearly points to the importance of studying how face identification performance changes as probe images are acquired at varying lengths of time separation from the gallery images. The complete FERET database was assembled in 1996 and has 14,126 images from 1,199 subjects [1]. Face images taken for one subject at one image acquisition session typically include multiple standard lighting conditions and multiple facial expressions. Some subjects participated in multiple image acquisition sessions separated by as much as two years in time, but following subjects over time was not a specific focus of the FERET effort. Only a handful of subjects participated in as many as ten different sessions. There are a number of face database efforts described in the literature. The XM2VTS database assembled at Surrey [10] contains 295 subjects with images taken at one-month intervals and has been used for face authentication (verification) research. The PIE database at Carnegie Mellon University [11] contains 41,368 images of 68 people collected over a three-month period, reflecting a large variety of
46
Patrick J. Flynn et. al
poses and lighting conditions. Other databases assembled recently include the AR [12] and Oulu databases [13].
2
Data Collection
To assess the performance of face recognition systems under large elapsed time, a suitable database of such imagery must be obtained. Such a collection will be assembled during 2002-2004. The acquisition plan has the following elements: a. b. c.
Weekly acquisitions of each subject. At a minimum, four high-resolution color images (two facial expressions, two controlled studio lighting configurations) and two images with “unstructured” lighting will be taken of each subject. Subjects will participate in the study as long as possible over the two-year period.
Fig. 1. Ten FERET-style images of one subject taken over a period of eleven weeks
Fig. 2. Two representative "unstructured" images from the Spring 2002 pilot collection
Assessment of Time Dependency in Face Recognition: An Initial Study
47
In Spring 2002, a pilot acquisition study was undertaken to obtain experience with a large-scale study as well as to prompt the development of necessary hardware and software support. The Spring 2002 study involved ten acquisition sessions conducted over an eleven-week period (spring break occurred in the middle of the project). Color images were acquired with Sony MVC-95 cameras, which provided 1600x1200 images in JPEG format with minimal compression artifacts visible. Ten views of a single subject (one from each week of the pilot study) appear in Fig. 1. Three SmithVictor A120 lights with Sylvania Photo-ECA bulbs provided studio lighting. The lights were located approximately eight feet in front of the subject; one was approximately four feet to the left, one was centrally located, and one was located four feet to the right. All three lights were trained on the subject face. One lighting configuration had the central light turned off and the others on. This will be referred to as “FERET style lighting” or “LF”. The other configuration has all three lights on; this will be called “mugshot lighting” or “LM”. In nine of the ten weeks of the Spring 2002 study, two additional images were obtained for each subject, under less wellcontrolled lighting and camera configuration. Generally, these images were taken in a hallway outside the laboratory, with a different camera and subject position each week. Fig. 2 shows two representative images of this “unstructured” type; the subject is the same as that in Fig. 1. The Spring 2002 study yielded 3378 color images. Fig. 3 depicts the number of subjects participating in each week of the project. Eye coordinates in each image were selected manually and used for image registration in the PCA system described below.
3
Experimental Designs and Results
In this section, we describe a series of experiments designed to investigate the baseline performance of a face recognition system employing the data described in Section 2, and to investigate performance variations due to elapsed time. The software suite used in these experiments was developed at Colorado State University [3]. In keeping with the evaluation methodology introduced in the FERET study [1], each experiment is characterized by three image sets, all disjoint.
Fig. 3. Participation counts per week
48
a.
b. c.
Patrick J. Flynn et. al
The training set is used to form the projection matrix used in the PCA technique; in the experiments reported here, 1115 images drawn from the FERET data set (neutral facial expression) formed the training set. We experimented with other sets of training data and did not discover significant variations in performance. The gallery set contains the set of “enrolled” images of subjects to recognize. All galleries used in this paper were drawn from the Spring 2002 image acquisitions described above. The probe set is a set of images to be recognized via matching against the gallery. We employ a closed universe assumption (every probe will have a corresponding match in the gallery). Probes were drawn from the Spring 2002 database discussed above.
Gallery and probe selections were made to allow straightforward experimental comparisons akin to those typically reported in the literature. For the sake of brevity, we report four such experiments in this paper. However, the availability of several weeks of data on the same subjects allows for other studies explicitly addressing time dependence. 3.1
Experiment 1
The scenario for this experiment is a typical enroll-once identification setup. The 62 gallery images were neutral-expression, LF images of all subjects photographed in session 1 of the Spring 2002 study. The 438 probe images were all neutralexpression, LF images of subjects in sessions 2 through 10 of the Spring 2002 study. Hence, this experiment controls for same lighting and type of expression. For each subject, there is one enrolled gallery image and up to nine probe images, each acquired in a distinct later session. Figure 4 shows a cumulative match characteristic (CMC) plot for this recognition study. Since the Colorado State PCA software supports several metrics for subspace matching, we experimented with several choices. This plot illustrates a striking difference in performance between the straightforward Euclidean metric and the Mahalanobis metric that attempts to factor out scaling and correlation effects. The approximately 95% first-rank recognition result using the Mahalanobis angle metric is an encouraging result. The quality of the Mahalanobis angle metric has been noted by other researchers [3,9]. 3.2
Experiment 2
This experiment controls for different expressions in the gallery and probe sets. The gallery used was identical to that of Experiment 1. The 438 probe images were alternate-expression FERET-lit subjects. CMC curves are depicted in Figure 5. As expected, performance in this experiment degrades significantly in comparison to Experiment 1, when the same expression is used in gallery and probe. Note that the performance degradation of the Euclidean metric is more than twice that of the Mahalanobis angle metric. This suggests that the Mahalanobis angle metric is effectively normalizing out some of the variation due to the change in expression. However, the performance degradation of more than 10% for the Mahalanobis angle metric indicates that more needs to be done to handle variation in expression.
Assessment of Time Dependency in Face Recognition: An Initial Study
Fig. 4. CMC plot for experiment 1
49
Fig. 5. CMC plot for Experiment 2
1
1
0.99
0.99
0.98
0.98
0.97
0.97
0.96
0.96
0.95
0.95
0.94
0.94
0.93
0.93
0.92
0.92 0.91
0.91
0.9
0.9 1
2
4
5
6
7
8
9
Delay in weeks between gallery and probe acquisition
10
1-2
2-3
4-5
5-6
6-7
7-8
8-9
9-10
Session weeks
Fig. 6. Rank-1 correct match percentage for Fig. 7. Rank-1 correct match percentage for ten different delays between gallery and probe eight experiments with gallery and probe separated by one week acquisition
3.3
Experiment 3
Experiment 3 was designed to reveal any obvious effect of elapsed time (between gallery and probe acquisition) on performance. The experiment consisted of nine subexperiments. The gallery set is the same as that used in Experiments 1 and 2. Each of the probes was a set of neutral-expression, FERET-lit images taken within a single session after session 1 (i.e., subexperiment 1 used session 2 images in its probes, subexperiment 2 used session 3, and so forth). Figure 6 plots, for each week, the percentage of top-ranked matches that were correct in the nine sub-experiments (the graph depicts performance between 90% and 100%). The graph reveals differences in performance from week to week, but there is no clearly discernable trend in the results. Week 5 has the worst results of the ten weeks and week 6 is essentially the same as week 1 in performance. 3.4
Experiment 4
Experiment 4 was designed to examine the performance of the face recognition system with a constant delay of one week between gallery and probe acquisitions. It consists of nine sub-experiments: the first used images from session 1 as a gallery and session 2 as probe, the second used session 2 as gallery and session 3 as probe, and so on. All images were neutral-expression subjects with FERET-style lighting. The top-
50
Patrick J. Flynn et. al
rank-correct percentages for this batch of experiments appear in Figure 7. We note an overall higher level of performance with one week of delay than with delays larger than one week (as plotted in Figure 6). However, there is no clear trend in performance with an increasing number of weeks between gallery and probe acquisition.
4
Conclusions and Future Work
In this paper, we have described a new data set for face recognition research containing images of several dozen subjects taken weekly over a ten-week interval. Another such collection effort with expanded scope is underway. We also describe the results of some baseline performance experiments using the collected data. We observed superior performance of the Mahalanobis angle metric over the Euclidean metric. Any delay between acquisition of gallery images and probes caused recognition system performance degradation. More than one week’s delay yielded poorer performance than a single week’s delay. However, there is no discernible trend (using the data in the pilot study) that relates the size of the delay to the performance decrease. This motivates the development of a larger database covering more subjects and a longer period of time. Such an acquisition is underway. Additional experiments to be performed include: identification of subjects who are consistently difficult to recognize correctly, determination of the effect of lighting change (from FERET lighting to another structured lighting scheme or to unstructured lighting) on performance, investigation of metrics for matching, and repetition of earlier experiments with new image data.
Acknowledgement and Disclaimer This research was supported by the Defense Advanced Research Project Agency (DARPA) under AFOSR award F49620-00-1-0388 and ONR award N00014-02-10410. Commercial equipment is identified in this work in order to adequately specify or describe the subject matter. In no case does such identification imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the equipment identified is necessarily the best available for this purpose.
References [1] [2]
P.J. Phillips, H. Moon, S.A. Rizvi, and P.J. Rauss. The FERET Evaluation Methodology for Face-Recognition Algorithms. IEEE Trans. on PAMI 20, 10 (Oct 2000), 1090-1104. A.J. O'Toole, T.Vetter, H. Volz, and E.M. Slater, Three dimensional caricatures of human heads: distinctiveness and the perception of facial age, Perception 26, 719-732.
Assessment of Time Dependency in Face Recognition: An Initial Study
[3]
[4] [5] [6] [7] [8] [9] [10]
[11] [12] [13]
51
W.S. Yambor, B.A. Draper and J.R.Beveridge, Analyzing PCA-based Face Recognition Algorithms: Eigenvector Selection and Distance Measures, Proc. 2nd Workshop on Empirical Evaluation in Computer Vision, Dublin, Ireland, July 1, 2000. R. Chellappa, C.L. Wilson, and S. Sirohey, Human and Machine Recognition of Faces: A Survey, Proc. IEEE 83(5), 705-740, May 1995. D.A. Socolinsky, L.B. Wolff, J.D. Neuheisel, and C.K. Eveland, Illumination Invariant Face Recognition Using Thermal Infrared Imagery, Proc. CVPR 2001, vol. I, 527-534, December 2001. V. Blanz, S. Romdhani and T. Vetter, Face identification across different poses and illuminations with a 3D morphable model, Proc. 5th IEEE Int. Conf. Automatic Face and Gesture Recognition, 202-207, 2002. D.M. Blackburn, J.M. Bone and P.J.Phillips, FRVT 2000 results. http://www.frvt.org/FRVT2000. P. J. Phillips, P. Grother, R. Micheals, D. M. Blackburn, E. Tabassi, J. M. Bone, "Face Recognition Vendor Test 2002: Evaluation Report, NISTIR 6965, 2003, http://www.frvt.org. H. Moon and P.J. Phillips, Computational and performance aspects of PCAbased face recognition algorithms, Perception 30:303-321, 2001. J. Matas, M. Hamouz, K. Jonsson, J. Kittler, Y. Li, C. Kotropoulos, A. Tefas, I. Pitas, T. Tan, H. Yan, F. Smeraldi, J. Bigun, N. Capdevielle, W. Gerstner, S. Ben-Yacoub, Y. Abdeljaoued, E. Mayoraz, Comparison of face verification results on the XM2VTS database, Proc. ICPR 2000, Barcelona, v. 4, p. 48584863, Sept. 2000. T. Sim, S. Baker, and M. Bsat, The CMU Pose, Illumination, and Expression (PIE) Database, Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition, pp. 46-51, May 2002. AR Face database. http://rvl1.ecn.purdue.edu/~aleix/aleix_face_DB.html. E. Marszalec, B. Martinkauppi, M. Soriano, M. Pietikäinen (2000), A physicsbased face database for color research, Journal of Electronic Imaging Vol. 9 No. 1 pp. 32-38.
Constraint Shape Model Using Edge Constraint and Gabor Wavelet Based Search Baochang Zhang1, Wen Gao1, 2, Shiguang Shan 2, and Wei Wang 3 1
Computer College, Harbin Institute of Technology Harbin, China, 150001 2 ICT-YCNC FRJDL, Institute of Computing Technology CAS, Beijing, China, 100080 3 College of Computer Science Beijing Polytechnic University, Beijing, China, 100022 {bczhang,wgao,sgshan,wwang}@jdl.ac.cn
Abstract. Constraint Shape Model is proposed to extract facial feature using two different search methods for contour points and control points individually. In the proposed algorithm, salient facial features, such as the eyes and the mouth, are first localized and utilized to initialize the shape model and provide region constraints on iterative shape searching. For the landmarks on the face contour, the edge intensity is exploited to construct better local texture matching models. Moreover, for control points, the proposed Gabor wavelet based method is used to search it by multi-frequency strategy. To test the proposed approaches, on a database containing 500 labeled face images, experiments are conducted, which shows that the proposed method performs significantly better in terms of a deliberate performance evaluation method. The proposed method can be easily used to other texture objects, which is robust to variations in illumination and facial expression. Keywords: Shape Model, Gabor Wavelet, Facial Feature Extraction.
1
Introduction
Accurate facial feature points extraction is important for success of applications such as face authentication, expression analysis and animation. Extensive research has been conducted in the past 20 years, Kass et al[1]introduced Active Contour Models, an energy minimization approach for shape alignment. Wiskott et al [2] used Gabor wavelet to generate a data structure name Elastic Bunch Graph to locate facial features. It searches facial points on the whole image and use the distortion of the graph to adjust the feature points. So it is time-consuming and need large computation. J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 52-61, 2003. Springer-Verlag Berlin Heidelberg 2003
Constraint Shape Model Using Edge Constraint and Gabor Wavelet Based Search
53
The Active Shape/Appearance Models (ASM & AAM), proposed by Cootes et al [3, 4] have been demonstrated to be successful in shape localization. In ASM, the local appearance model, which represents the local statistics around each landmark, efficiently finds the best candidate point for each landmark of the image. Based on the accurate modeling of the local features, ASM obtains nice results in shape localization. AAM combines constraints on both shape and texture in its paper; texture means the intensity patch contained in the shape after warping to the mean shape. There are two linear mapping assumed for optimization: from appearance variation to texture variation, and from texture variation to position variation. The shape is extracted by minimizing the texture reconstruction error. Generally, the feature points used by the ASM/AAM are the outline contours of the facial features, i.e. the outlines of the face, eyebrows, eyes, nose and mouth (see Fig.1). Although these points are enough to describe the shape of a face, they are insufficient in dealing with position of control points (see Fig.1). In this paper, we proposed a method called Constraint Shape Model (CSM), in which different search methods are used to finding contour points and control points individually. For the landmarks on the face contour, the edge intensity is exploited to construct better local texture matching models. The magnitude and phase of Gabor features contains local structure of face is used to align control points, and provide accurate guide for the search. The proposed method is a coarse to fine approach to search facial control points. Compared with the ASM, the CSM can achieve more accurate results. Experiments results demonstrate that the CSM is really good.
2
Statistical Shape Model
Here we describe briefly shape models used to represent deformable object classes. Statistical Shape Models (SSMs) are built from a training set of annotated images, in which corresponding points have been marked. The points from each image are represented as a vector x after alignment to a common co-ordinate frame [5]. Eigenanalysis is applied to aligned shape vectors, producing a set of modes of variation P. _ The model has parameters b controlling the shape represented as x = x + pb , where x is the mean aligned shape. The shape x can be placed in the image frame by applying an appropriate pose transform. Neglecting alignment, the process of estimating model parameters b for a given shape x proceeded by minimizing the residual: r = ( x − x 0 ) where x0 is the current reconstruction of the model. Least squares _
error estimation therefore seeks to minimize E1 (b) = r T r , specifically we wish to find δ b so as to minimize E1 (b + δ b) . This can be shown to have a solution of the form: δ b = ( p T p) − 1 p T δ x
(1)
δ b = p tδ x
(2)
P is orthonormal, simplifies to:
54
Baochang Zhang et al.
This is the standard SSM parameter update equation for iterative search. The update vector δ x is obtained by searching the local image area around each landmark point. Models of local image appearance for each landmark point are built from the training set. These are used for each iteration of search to determine the best local landmark position. In this paper, Constraint Shape Model will be introduced, which is based on the statistic shape model, while making full use of salient facial features to initialize the shape model.
3
Constraint Shape Model
Based on the located three salient facial features (eyes, nose), in this section, Constraint Shape Model (CSM) is proposed with several constraints. Also, the edge constraints are introduced to improve the local texture model for the landmarks on the face contour. 3.1
Initialization of Constraint Shape Model
Of all the facial features, two irises are the most distinct organs. Therefore, they are relatively easier to be located. And they can provide enough information to initialize the translation, the scale and the in-image-plane rotation parameters for the shape model. Mouth center can also be detected in an expected area determined by iris locations to provide further constraints on initializing the Constraint Shape Mode. Refer to [6] for details of the algorithm; after we have located the irises, we can estimate the rough position of the mouth center. In searching progress, the initialization of the mean shape is very important, because a good initialization would less lead to incorrect local minimal. With the positions of the two irises, we can calculate the parameters of the scale, rotation, and translation for the target face in the image. Just like ASM, the shape of a model is described as X = (x1, y1, Lxn , yn ) , where xi , yi (1 ≤ i ≤ n) is the coordinate of the i-th landmark. The initial value of model points X i in the image can be calculated by Xi =Tx , y s,θ (X + Pb) , where the function Tx , y s ,θ is a transform including rotating t t, t t,
by θ , scaling by s , and a translation of ( xt , yt ) . The initial value of statistical shape parameter b can be set as zero. As we know, the initial shape is very close to its real value; So better results can be achieved in less convergence times [7]. The size of local search window is also an important part in the searching strategy. We already have the location of irises; the overall search length can be determined according to the distance between them. We choose a longer searching distance when the distance between two irises points is long. Furthermore, the window size should be different for landmarks on different part of the shape. Because feature points in constrained parts (eyes and mouth) are more probably to be close to their real positions, smaller window size is enough. However, in other parts, larger window would be more appropriate to guarantee the true point within the search region.
Constraint Shape Model Using Edge Constraint and Gabor Wavelet Based Search
3.2
55
Contour Points Searching by Edge Constraint in Local Texture Model Matching
Local texture model matching is conducted under the assumption that the normalized derivative profile, {g i } , satisfies a Gaussian distribution. The matching degree of a probe sample, g s , to the reference model is given by _
f (gs) =(c− k)(gs − g)TSg− 1(gs − g) Where g is the mean of {g }(1 ≤ i
i ≤ N) ,
and
N
(3)
is the number of sample images for es-
tablishing Point Distribution Model (PDM), S g is the covariance. The main idea of finding matching points is to minimize f ( g s ) which is equivalent to minimizing the −
probability that g s comes from the distribution. c is a constant, k is the mean gradient intensity at the target point. With this strategy, points with strong edge information are more probably chosen as the best candidate. For face images, landmarks on the contour part have clearly strong edge intensity, while the landmarks on other parts are not necessary with this property. Therefore, we only apply edge constraints to the landmarks on the face contour.
4
Control Points Searching by Gabor Wavelet Based Method
Control points are the points, which are key points of controlling the shape of eyes, nose, and mouth. Showed in figure 1. 4.1
Gabor Measurement of Feature Points
Daugman pioneered the using of the 2D Gabor wavelet representation in computer vision in 1980’s[8]. The Gabor wavelets, whose kernels are similar to the 2D receptive field profiles of the mammalian cortical simple cells, exhibit desirable characteristics of spatial locality and orientation selectivity. The biological relevance and computational properties of Gabor wavelets for image analysis have been described in [8]. The Gabor wavelet representation facilitates recognition and feature extraction because it captures the local structure corresponding to spatial frequency (scale), spatial localization, and orientation selectivity [9]. A complex-valued 2D Gabor function is a plane wave restricted by a Gaussian envelope:
Fig. 1. One example of the marked face. Green Pointes are Control Points(number is 19), Yellow Points are Contour Points(number is 84)
56
Baochang Zhang et al.
Fig. 2. The Real Part of 40 Gabor Kernels used in this paper → k 2 x2 ϕ j x = k 2j exp − j 2 2σ
→
kj =
( )= ( k jx k jy
k v cosφ u k v sinφ u
)
kv = 2
−
v+2 π 2
σ 2 → → exp i k j x − exp− 2
(4)
, Here 5 frequencies and 8 orientations are →
used φ u = u π , j = u + 8v, v = 0,...,4, u = 0,...,7 . In an image a given pixel x with gray level 8
→
L( x ) ,
the convolution can be defined as → J j x =
→ '
∫ L x ϕ
j
→ → ' 2 → ' x − x d x
(5)
When all 40 kernels are used, showed by Figure 2, 40 complex coefficients are provided. It is called as a jet, which is used to represent the local features. A jet can be expressed as J j = a j exp(iφ j ), where magnitudes a →x vary slowly j
with position, and phases
rotate at a rate approximately determined by the fre-
→ φ j x
quency of the kernel. And a set of Jets referring to one fiducial point is called a Bunch. Two Similarity functions are applied. One is phase-insensitive similarity function: S
a
(J
, J
)=
'
∑
a
j
'
a
j
j
∑
a
2 j
j
∑
a
'2 j
(6)
j
It varies smoothly with the change of the position and we use it for recognition. Another is phase-sensitive similarity function: (
Sφ J , J
'
)=
∑ j
a j a 'j cos φ
∑ j
j
− φ
' j
a 2j ∑ a 'j2
→ → − d kj
(7)
j
It changed quickly with the change of location and it is used for feature adjusting, and the displacement between two Jets can be estimated using following formulation [1].
Constraint Shape Model Using Edge Constraint and Gabor Wavelet Based Search
57
Fig. 3. One example of Jet →
d (J , J ') = (
dx )= dy Γ xx Γ
yy
1 − Γ xy Γ
yx
Γ xx − Γ yy Φ − Γ Γ Φ xy yx
y
x
(8)
This equation yield straightforward method for estimating displacement between two jets taken from object locations. The range of displacement can be increased by using low frequency kernel only. For the lowest frequency the estimated displacement can be 8 pixels, and would be 2 pixels for the highest frequency. So we can proceed with the next higher frequency level and refine the result [10, 11]. If one has access to the whole image of jets, one can work iteratively. Assume J is →
to be accurately positioned in the neighborhood of points x m in an image. Comparing →
→
J with J m = J ( x ) , yield an estimated displacement d = d ( J , J m ) . Then the jet J m is →
→
taken from the position x m = x m + d and the displacement is estimated again. The new displacement will be smaller and can be estimated more accurately with high frequency, converging to eventually to subpixel accuracy. The details of this method are discussed in the 4.2. 4.2
Gabor Wavelet Based Search Method
We can estimate the displacement between two jets up to 8 pixels. By comparing the jet bunch feature, we can get the best fitting jet at a new position. Here we use a coarse to fine approach to search the local points: 1. 2.
Initialize starting points by mean shape model and salient facial features, set frequency of the lowest level, fre = 0. For each feature point, we compute Gabor parameters at the position of P0 , get
3.
the Jet J 0 . For each Bunch, the size is k . For example k = 60 , and J k , i = 1....60 , then dis-
4.
tance between J 0 and J i is computed using d i (J 0 , J k ) . It will include frequency from 0 to fre; r We compute the new position P ' = P0 + di , and get J i' , which is Jet of Pi '
5.
We compute the similarity between J i' and J i using S iφ J i' , J i , and the po-
r
i
(
)
sition of the highest value is candidate one, represented as Pi ' ( P0 ). Then 6.
compared with threshold value λ , if it is more than λ , then end. We will increase frequency by fre = fre + 1, go to step 2
58
Baochang Zhang et al.
The experiment shows that this searching method is very efficient at term of speed and accuracy.
5
Facial Feature Points Extraction
Procedure of CSM is performed through matching the feature point in the image using edge constraints and Gabor wavelet based search method, and then adjusts the affine shape model just like ASM. The procedure is as following: a.
Initialize the starting points using the iris location, by translation, scaling and rotation, match X = X + PB to initial X 0 = M ( s0 ,θ 0 )[ X + PB0 ] + t 0 by change B , At the starting point, B0 = 0 .
b.
Matching the feature points of contour part using Grey feature, first ith point of
X 0 is represented as
we search along a line passing through ( xi , yi ) and perpendicular to the boundary formed by the landmark and its neighbors, the number of points is
( xi , y i ) ,
n s ( ns > n p , for example 20) which is named searching
profile. The objective of the process is to find sub-profile by comparing the gray value of ith point by equation (3). The best matching one is the new point of next searching process
( xi ' , y i ' ) .
All contour points of X will be processed as
above procedure, then we get X (contour part). Searching control points using the method in 4.2, we get control part of X ' . Using X' ,X0, we can find the change of s,θ , t , which is 1 + ds, dθ , dt , details of '
c. d.
this method is showed in ASM[3]. And when X 0 is changed to X ' , then we get dB using formulation (2), so we can easily get the result using X1 = M(s0 (1+ ds),θ 0 + dθ )[X + P(B0 + dB)]+ t0 + dt . e.
6
Computing d ( X 0 , X1 ) , if d ( X 0 , X1 ) < λ , X1 is the result, or else X 0 = X1 then go to b), Starting the process again.
Performance Evaluation and Experiment Result
Performance evaluation is really an important problem for different approaches to facial feature extraction. In this paper, we propose to evaluate the performance by using average error, which is defined as the following distance between the manually labeled shape and the resulting shape of our method: E=
1 N 1 n ∑ ∑ dist( Pij , Pij' ) N i =1 n j =1
(9)
Constraint Shape Model Using Edge Constraint and Gabor Wavelet Based Search
59
Table 1. Performance comparison of different methods
Method ASM IASM CSM
Average Error 3.10 2.84 1.94
Improvement ----8.3% 37.4%
Table 2. Experiment on different frequency
Fre
0
0-1
0-2
0-3
0-4
Average Error
2.10
2.04
1.97
1.87
1.74
1.94
Error
Fig. 4. Sample Face Images in the test database
a)
b)
c)
e)
d)
f)
Fig. 5. Result of the Constraint Shape Model a, b) Facial Feature Points in normal face c, d) Facial Feature Points with face deformation e, f) Facial feature Points with illumination
60
Baochang Zhang et al.
Where N is the total number of the probe images, n is the number of the landmark points in the shape (n=103), Pij is the j-th landmark point in the manually labeled '
shape of the i-th test image manually labeled, Pij is the j-th landmark point in the resulting shape of the proposed method for the i-th test image. The function dist ( p1, p 2) is the Euclidean distance between the two points. To evaluate our method, experiments are conducted on the 500 faces database above mentioned (showed in Figure 4). As Table 1 shows, the average error for standard ASM is about 3.10 pixels. Another method is Improved Active Shape Model (IASM), The details are dealt with in [7]. Error of IASM is reduced to 2.84 pixels. When both edge constrain and Gabor wavelet based search method are applied, the error is reduced to 1.94 pixels per point, and it is really good. Table 2 shows, the more frequency are used, the less error. And ‘0-4’ means that we use frequency from 0 to 4. Figure 5 demonstrates some search results.
7
Conclusions and Future Work
In this paper, to solve the Facial feature extraction problem, we propose Constraint Shape Model. First, salient facial features, such as eyes and the mouth, are localized and utilized to initialize the shape model and provide region constraints on the subsequent iterative shape searching. The edge information is also exploited to construct better local texture models for landmarks on the face contour. In addition, Gabor wavelet based search strategy is applied to this, which is robust to illumination and facial deformation. Future work will be focused on robust local texture models and candidate points searching method to find facial feature in texture objects with rotation in depth and poor quality. How to use the result of the proposed method for object recognition is also a future research effort.
Acknowledgement This research is sponsored partly by Natural Science Foundation of China (No.69789301), National Hi-Tech Program of China (No.2001AA114190), and Sichuan Chengdu YinChen Net. Co. (YCNC).
References [1] [2]
M. Kass, A.Witkin and D. Terzopoulos,”Active contour models” 1st international conference on Computer Vision, London, June 1987, pp.259-268. L. Wiskott, J.M. Fellous, N. Kruger, C.v.d. Malsburg, Face Recogniton by Elastic Bunch Graph Matching, IEEE Trans. On PAMI, Vol.19, No. 7, 1997.
Constraint Shape Model Using Edge Constraint and Gabor Wavelet Based Search
[3]
61
T.F. Cootes, C.J. Taylor, D.H. Cooper, and J. Graham, “Active shape models their training and application,” Copmuter Vision and Image understanding, 61(1): pp38-59, 1995.
[4] [5] [6] [7] [8] [9] [10] [11]
T. Cootes, G.J. Edwards, and C.J. Taylor, “Active appearance models,” in Proceeding of 5th European Conference on Computer Vision, 1998, vol. 2, pp. M. Rogers & J. Graham. "Robust active shape model search." In Proceedings of the European Conference on Computer Vision. May 2002. B. Cao, S.G. Shan, W. Gao, Localizing the iris center by region growing search, Proceeding of the ICME2002. Wei Wang, Shiguan Shan, etc. An Improved Active Shape Model For Face Alignment, ICMI02. J.G. Daugman, “Two-dimensional spectral analysis of cortical receptive field profiles,” Vision Research, vol. 20, pp. 847–856, 1980. Chengjun Liu and Harry Wechsler,”Gabor Feature Based Classification Using the Enhanced Fisher Linear Discriminant Model for Face Recognition” IEEE Trans. Image Processing vol.11 no.4 2002. L. Wiskott, J.M. Fellous, N. Kruger, and C. von der Malsburg, “Face recognition by elastic bunch graph matching,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 775–779, 1997. F. Smeraldi and J. Bigun. Retinal vision applied to facial features detection and face authentication. Pattern Recognition Letters, 23:463-475, 2002.
Expression-Invariant 3D Face Recognition Alexander M. Bronstein1, Michael M. Bronstein1, and Ron Kimmel2 1 Technion – Israel Institute of Technology Department of Electrical Engineering, Haifa 32000, Israel {alexbron,bronstein}@ieee.org 2 Technion – Israel Institute of Technology Department of Computer Science, Haifa 32000, Israel
[email protected] Abstract. We present a novel 3D face recognition approach based on geometric invariants introduced by Elad and Kimmel. The key idea of the proposed algorithm is a representation of the facial surface, invariant to isometric deformations, such as those resulting from different expressions and postures of the face. The obtained geometric invariants allow mapping 2D facial texture images into special images that incorporate the 3D geometry of the face. These signature images are then decomposed into their principal components. The result is an efficient and accurate face recognition algorithm that is robust to facial expressions. We demonstrate the results of our method and compare it to existing 2D and 3D face recognition algorithms.
1
Introduction
Face recognition is a biometric method that unlike other biometrics, is non-intrusive and can be used even without the subject’s knowledge. State-of-the-art face recognition systems are based on a 40-year heritage of 2D algorithms, dating back to the early 1960s [1]. The first face recognition methods used the geometry of key points (like the eyes, nose and mouth) and their geometric relationships (angles, length, ratios, etc.). In 1991, Turk and Pentland introduced the revolutionary idea of applying principal component analysis (PCA) to face imaging [2]. This has become known as the eigenface algorithm and is now a golden standard in face recognition. Later, algorithms inspired by eigenfaces that use similar ideas were proposed (see [3], [4], [5]). However, all the 2D (image-based) face recognition methods appear to be sensitive to illuminations conditions, head orientations, facial expressions and makeup. These limitations of 2D methods stem directly from the limited information about the face contained in a 2D image. Recently, it became evident that the use of 3D data of the face can be of great help as 3D information is viewpoint- and lighting-condition independent, i.e. lacks the “intrinsic” weaknesses of 2D approaches. Gordon showed that combining frontal and profile views can improve recognition accuracy [6]. This idea was extended by Beumier and Acheroy, who compared cenJ. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 62-70, 2003. Springer-Verlag Berlin Heidelberg 2003
Expression-Invariant 3D Face Recognition
63
tral and lateral profiles from the 3D facial surface, acquired by a structured light range camera [7]. This approach demonstrated better robustness to head orientations. Another attempt to cope with the problem of head pose using 3D morphable head models is presented in [8], [9]. Mavridis et al. incorporated a range map of the face into the classical face recognition algorithms based on PCA and hidden Markov models [10]. Particularly, this approach showed robustness to large variations in color and illumination and use of cosmetics, and also allowed separating the face from cluttered background. However, none of the approaches proposed heretofore was able to overcome the problems resulting from the non-rigid nature of the human face. For example, Beumier and Acheroy failed to perform accurate global surface matching, and observed that the recognition accuracy decreased when too many profiles were used [7]. The difficulty in performing accurate surface matching of facial surfaces was one of the primary limiting factors of other 3D face recognition algorithms as well. In this work, we present a geometric framework for efficient and accurate face recognition using 3D data (patented, [11]). Our method is based on geometric invariants of the human face and performs a non-rigid surface comparison, allowing deformations, typical to the human face due to facial expressions.
2
Non-rigid Surface Matching
Classical surface matching methods, based on finding a Euclidean transformation of two surfaces which maximizes some shape similarity criterion (see, for example, [12], [13], [14]), are suitable mainly for rigid objects. Human face can not be considered a rigid object since it undergoes deformations resulting from facial expressions. On the other hand, the class of transformations that a facial surface can undergo is not arbitrary, and empirical observations show that facial expressions can be modeled as isometric (or length-preserving) transformations. Such transformations do not stretch and do not tear the surface, or more rigorously, preserve the surface metric. The family of surfaces resulting from such transformations is called isometric surfaces. The requirement of a deformable surface matching algorithm is to find a representation, which is the same for all isometric surfaces. Schwartz et al. were the first to use multidimensional scaling (MDS) as a tool for studying curved surfaces by planar models. In their pioneering work, they applied an MDS technique to flatten convoluted cortical surfaces of the brain, onto a plane, in order to study their functional architecture [15]. Zigelman et al. [16] and Grossman et al. [17] extended some of these ideas to the problem of texture mapping and voxelbased cortex flattening. A generalization of this approach was introduced in the recent work of Elad and Kimmel [18], as a framework for object recognition. They introduced an efficient algorithm to construct a signature for isometric surfaces. This method, referred to as bending-invariant canonical forms, is the core of our 3D face recognition framework.
64
2.1
Alexander M. Bronstein et al.
Bending-Invariant Canonical Forms
Consider a polyhedral approximation of the facial surface, S. One can think of such an approximation as if obtained by sampling the underlying continuous surface on a finite set of points pi (i = 1,…,n), and discretizing the metric δ associated with the surface δ ( pi , p j )= δ
ij
.
(1)
Writing the values of δ ij in matrix form, we obtain the matrix of mutual distances between the surface points. For convenience, we define squared mutual distances,
( ∆ )ij= δ
2 ij
.
(2)
The matrix ∆ is invariant under isometric surface deformations, but it is not a unique representation of isometric surfaces, since it depends on arbitrary ordering of the points. We would like to obtain a geometric invariant, which is unique for isometric surfaces on one hand, and allows using simple rigid surface matching algorithms to compare such invariants on the other. Treating the squared mutual distances as a particular case of dissimilarities, one can apply a dimensionality-reduction technique called multidimensional scaling (MDS) in order to embed the surface into a lowdimensional Euclidean space Rm. This is equivalent to finding a mapping between two metric spaces, ϕ : ( S, δ )→
(R , d) m
; ϕ ( pi ) = x i ,
(3)
which minimizes the embedding error
(
ε = f δ ij− d ij
)
; d= ij
x−i
xj 2.
(4)
The obtained m-dimensional representation is a set of points xi ∈ Rm (i = 1,…,n), corresponding to the surface points pi. Different MDS methods can be derived using different embedding error criteria [19]. A particular case is the classical scaling, introduced by Young and Householder [20]. The embedding in Rm is performed by double-centering the matrix ∆ B=−
1 2
J∆ J .
(5)
(here J = I - ½U; I is a n× n identity matrix, and U is a matrix consisting entirely of ones). The first m eigenvectors ei, corresponding to the m largest eigenvalues of B, are used as the embedding coordinates x ij = eij ; i =1,...,n; j =1,...,m ,
(6)
where x ij denotes the j-th coordinate of the vector xi. Eigenvectors are computed using a standard eigendecomposition method. Since only m eigenvectors are required (usually, m=3), the computation can be done efficiently (e.g. by power methods).
Expression-Invariant 3D Face Recognition
65
We will refer to the set of points xi obtained by MDS as the bending-invariant canonical form of the surface; when m=3, it can be plotted as a surface. Standard rigid surface matching methods can be used in order to compare between two deformable surfaces, using their bending-invariant representations instead of the surfaces themselves. Since the canonical form is computed up to a translation, rotation, and reflection transformation, to allow comparison between canonical forms, they must be aligned. This is possible, for example, by setting the first-order moments (center of mass) and the mixed second-order moments to zero (see [21]). 2.2
Measuring Geodesic Distances on Triangulated Manifolds
One of the crucial steps in the construction of the canonical form of a given surface, is an efficient algorithm for the computation of the geodesic distances on surfaces, that is, δ ij. A computationally inefficient distance computation algorithm was one of the disadvantages of the work of Schwartz et al. and actually limited practical applications of their method. A numerically consistent algorithm for distance computation on triangulated domains, henceforth referred to as fast marching on triangulated domains (FMTD), was used by Elad and Kimmel [18]. FMTD was proposed by Kimmel and Sethian [22] as a generalization of the fast marching method [23]. Using FMTD, the geodesic distances between a surface vertex and the rest of the n surface vertices can be computed in O(n) operations. We use this method for the bending invariant canonical form computation.
3
Range Image Acquisition
Accurate acquisition of the facial surface is crucial for 3D face recognition. Many of commercial range cameras that are available in the market today are suitable for face recognition application. Roughly, we distinguish between active and passive range sensors. The majority of passive range cameras exploit stereo vision, that is, the 3D information is established from correspondence between pixels in images viewed from different points. Due to the computational complexity of the correspondence problem, passive stereo is usually unable to produce range images in real time. Active range image acquisition techniques usually use controlled illumination conditions for object reconstruction. One of the most popular approaches known as structured light, is based on projecting a pattern on the object surface and extracting the object geometry from the deformations of the pattern [24]. A more robust and accurate version of this approach uses a series of black and white stripes projected sequentially and is known as coded light. The patterns form a binary code, that allows the reconstruction of the angle of each point on the surface with respect to the optical axis of the camera. Then one can compute the depth using triangulation. In this paper, we use the coded light technique for 3D surface acquisition. Using 8 binary patterns, we obtained 256 depth levels which yielded depth resolution of about 1 mm. In our setup, we used an LCD projector with refresh rate of 70Hz controlled
66
Alexander M. Bronstein et al.
via the DV interface. Images were acquired at the rate of 30 frames per second by a black-and-white FireWire CCD camera with a resolution of 640× 480 pixels, 8 bit.
4
3D Face Recognition Using Eigenforms
As a first step, using the range camera we acquire a 3D image, which includes the range image (geometry) and the 2D image (texture) of the face. The range image is converted into a triangulated surface and smoothed using spline. Regions outside the facial contour are cropped, and the surface is decimated to a size of approximately 2000-2500 vertices. Next, the bending-invariant canonical form of the face is computed and aligned using the procedure described in Section 2.1. Since there is a full correspondence between the texture image pixels an and the canonical surface vertices ( x1n , x 2n , x 3n ) , the face texture image can be mapped onto the aligned canonical surface in the canonical form space. By interpolating an and x 3n onto a Cartesian grid in the X1X2 plane, we obtain the flattened texture a% and the canonical image x% , respectively. Both a% and x% preserve the invariance of the canonical form to isometric transformations, and can be represented as images (Fig. 1). Application of eigendecomposition is straightforward in this representation. Like N in eigenfaces, we have a training set, which is a set of duplets of the form {x% n , a% n }i =1 . Applying eigendecomposition separately on the set of a% and x% , we produce two sets of eigenspaces corresponding to the flattened textures and the canonical images. We term the respective sets of eigenvectors ean and e nx as eigenforms. For a new subject represented by ( x% ′, a% ′ ) , the decomposition coefficients are com-
puted according to α = e1a ,..., eaN ( a% ′− a ) , β = e1x ,..., e xN ( x% ′− x ) ,
(10)
where a and x denote the average of a% n and x% n in the training set, respectively. The distance between two subjects represented by ( x% 1 , a% 1 ) and ( x% 2 , a% 2 ) are computed
as a weighted Euclidean distance between the corresponding decomposition coefficients, (α 1,β 1) and (α 2,β 2).
5
Results
The experiments were performed on a 3D face database consisting of 64 children and 93 adults (115 males and 42 females). The texture the range images (acquired at a resolution of 640× 480) were decimated to a scale of 1:8 and cropped outside of the facial contour. The database contained several instances of identical twins (Alex and
Expression-Invariant 3D Face Recognition
67
Mike). Four approaches were compared: (i) eigendecomposition of range images; (ii) combination of texture and range images in the eigenfaces scheme, as proposed by Mavridis et al.; (iii) eigendecomposition of canonical images; and (iv) our eigenforms algorithm.
Fig. 1. Texture flattening by interpolation onto the X1X2 plane: texture mapping on the facial surface (A) and on the canonical form (B); the resulting flattened texture (C) and the canonical image (D)
Using each of these four algorithms, we found the closest matches between a reference subject and the rest of the database. Different instances of identical twins (Alex and Mike) were chosen as the reference subjects. Fig. 2 shows significant improvement of the recognition accuracy if canonical images are used instead of range images. Even without using the texture information, we obtain an accurate recognition of twins. Fig. 3 compares between the method of Mavridis et al. (eigendecomposition of texture and range images), and our eigenforms method (eigendecomposition of flattened textures and canonical images). Our method made no mistakes in distinguishing between Alex and Mike. One can also observe that a conventional approach is unable to cope with significant deformations of the face (e.g. inflated cheeks), and finds a subject with fat cheeks (Robert 090) as the closest match. This is a result typical for eigenfaces, as well as for straightforward range image eigendecomposition.
68
Alexander M. Bronstein et al.
Fig. 2. The closest matches, obtained by eigendecomposition of range images (A) and canonical images (B). Wrong matches are italicized
Fig. 3. The closest matches, obtained by the method of Mavridis et al. (A) and eigenforms (B). Wrong matches are italicized. Note the inability of the conventional method to cope with subjects with exaggerated facial expression (Alex 20, sixth column)
6
Conclusions
We proposed an algorithm capable of extracting the intrinsic geometric features of facial surfaces using geometric invariants, and applying eigendecomposition to the resulting representation. We obtained very accurate face recognition results. Unlike previously proposed solutions, the use of bending-invariant canonical representation makes our approach robust to facial expressions and transformations typical of nonrigid objects.
Expression-Invariant 3D Face Recognition
69
Experimental results showed that the proposed algorithm outperforms the 2D eigenfaces approach, and the straightforward incorporation of range images into the eigenfaces framework, proposed by Mavridis et al. Particularly, we observed that even very significant deformations of the face do not confuse our algorithm, unlike conventional approaches.
References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]
[11] [12] [13] [14] [15]
Bledsoe, W. W. The model method in facial recognition. Technical report PRI 15, Panoramic Research Inc., Palo Alto (1966). Turk, M., Pentland, A. Face recognition using eigenfaces, Proc. CVPR, pp. 586-591 (1991). Belhumeur, V. I., Hespanha, J. P., Kriegman D. J. Eigenfaces vs. Fisherfaces: recognition using class specific linear projection, IEEE Trans. PAMI 19(7), pp. 711-720 (1997). Frey, B. J., Colmenarez, A., Huang, T.S. Mixtures of local linear subspaces for face recognition, Proc. CVPR, pp. 32-37 (1998). Moghaddam, B., Jebara, T., Pentland, A. Bayesian face recognition. Technical Report TR2000-42, Mitsubishi Electric Research Laboratories (2000). Gordon, G. Face recognition from frontal and profile views, Proc. Int’l Workshop on Face and Gesture Recognition, pp. 47-52 (1996). Beumier C., Acheroy, M. P. Automatic face authentication from 3D surface, Proc. British Machine Vision Conf. (BMVC), pp 449-458 (1998). Huang, J., Blanz, V., Heisele, B. Face recognition using component-based SVM classification and morphable models, SVM 2002, pp. 334-341 (2002). Blanz, V., Vetter, T. A morphable model for the synthesis of 3D faces, Proc. SIGGRAPH, pp. 187-194 (1999). Mavridis, N., Tsalakanidou, F., Pantazis, D., Malassiotis, S., Strintzis, M. G. The HISCORE face recognition application: Affordable desktop face recognition based on a novel 3D camera, Proc. Int’l Conf. on Augmented Virtual Environments and 3D Imaging (ICAV3D), Mykonos, Greece (2001). Bronstein, A. M., Bronstein, M. M., Kimmel, R. 3-Dimensional face recognition, US Provisional patent No. 60/416,243 (2002). Faugeras, O. D., Hebert, M. A 3D recognition and positioning algorithm using geometrical matching between primitive surfaces, Proc. 7th Int’l Joint Conf. on Artificial Intelligence, pp. 996–1002 (1983). Besl, P. J.. The free form matching problem. Machine vision for threedimensional scene, In: Freeman, H. (ed.) New York Academic (1990). Barequet, G., Sharir, M. Recovering the position and orientation of free-form objects from image contours using 3D distance map, IEEE Trans. PAMI, 19(9), pp. 929–948 (1997). Schwartz, E. L., Shaw, A., Wolfson, E. A numerical solution to the generalized mapmaker's problem: flattening nonconvex polyhedral surfaces, IEEE Trans. PAMI, 11, pp. 1005-1008 (1989).
70
[16] [17] [18] [19] [20] [21] [22] [23] [24]
Alexander M. Bronstein et al.
Zigelman, G., Kimmel, R., Kiryati, N. Texture mapping using surface flattening via multi-dimensional scaling, IEEE Trans. Visualization and Comp. Graphics, 8, pp. 198-207 (2002). Grossman, R., Kiryati, N., Kimmel, R. Computational surface flattening: a voxel-based approach, IEEE Trans. PAMI, 24, pp. 433-441 (2002). Elad, A., Kimmel, R. Bending invariant representations for surfaces, Proc. CVPR (2001). Borg, I., Groenen, P. Modern multidimensional scaling - theory and applications, Springer (1997). Young, G., Householder, G.S. Discussion of a set of points in terms of their mutual distances, Psychometrika 3 (1938). Tal, A., Elad, M., Ar, S. Content based retrieval of VRML objects – an iterative and interactive approach”, EG Multimedia, 97 (2001). Kimmel, R., Sethian, J. A. Computing geodesic on manifolds. Proc. US National Academy of Science 95, pp. 8431–8435 (1998). Sethian, J. A. A review of the theory, algorithms, and applications of level set method for propagating surfaces. Acta numerica (1996). Horn, E., Kiryati, N. Towards optimal structured light patterns, Image and Vision Computing, 17, pp. 87-97 (1999).
Automatic Estimation of a Priori Speaker Dependent Thresholds in Speaker Verification Javier R. Saeta1 and Javier Hernando2 1
Biometric Technologies S.L. Barcelona, Spain
[email protected] 2
TALP Research Center Universitat Politecnica de Catalunya. Barcelona, Spain
[email protected] Abstract. The selection of a suitable threshold is considered essential for the correct performance of automatic enrollment in speaker verification. Conventional methods have faced with the scarcity of data and the problem of an a priori decision, using biased client scores, impostor data, variances, a speaker independent threshold or some combination of them. Because of this lack of data, means and variances are estimated in most cases with very few scores. Noise or simply poor quality utterances, when comparing to the client model, can lead to some scores which produce a high variance in estimations. These scores are outliers and have an effect on the right estimation of mean and specially standard deviation. We propose here an algorithm to discard outliers. The method consists of iteratively selecting the most distant score with respect to mean. If this score goes beyond a certain threshold, the score is removed and mean and standard deviation estimations are recalculated. When there are only a few utterances to estimate mean and variance, this method leads to a great improvement. Text dependent and text independent experiments have been carried out by using a telephonic multisession database in Spanish with 184 speakers, that has been recently recorded by the authors.
1
Introduction
Speaker verification is used to authenticate a speaker and to ensure that the person is who (s)he claims to be. This purpose requires previously the creation of a probabilistic model for the speaker. The accuracy usually depends on the quality and the number of utterances that we have for that speaker. In other words, the amount of data is essential for the right creation of the model. Once the model is performed, we are ready to test if a speaker utterance belongs to a certain speaker model. The utterance from the speaker is compared to the speaker
J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 70-77, 2003. Springer-Verlag Berlin Heidelberg 2003
Automatic Estimation of a Priori Speaker Dependent Thresholds in Speaker Verification
71
model. In order to take a decision, it is necessary to exceed a predefined threshold that has to be set during enrollment. In Gaussian models, the speaker verification decision is normally based on the likelihood ratio (LR), which is given by the test utterance X, the speaker model Y and the non-speaker model Y , as follows:
p( X | Y ) > ? < Θ (Y ) LR ( X ) = log p( X | Y )
(1)
The speaker is accepted if the LR is above the threshold Θ (Y ) , and rejected if below. In research tasks, this threshold value can be fixed a posteriori. It is even possible to have a speaker independent threshold. On the other hand, this has no sense in commercial applications, where the threshold has to be set a priori. Furthermore, the threshold should be speaker-dependent to minimize computational cost and to take into account speaker peculiarities. Otherwise, the lack of training material is a common problem in real applications. In this way, threshold estimation methods are conditioned by the fact there are often only a few utterances from every speaker. The way of computing the threshold can be based on data from clients, from impostors or both of them, but normally it is even difficult to use impostor material, for instance in phrase-prompted cases [9]. Mean and standard deviation estimations of impostor or client scores are frequently used to compute thresholds. Their estimation is difficult when we have only a few utterances, because some scores are sometimes very distant with respect to mean. To cope with this problem, we will define an algorithm to remove those little representative scores. Moreover, the algorithm will work with data from clients. It seeks to be robust to the scarcity of training utterances and the difficulty of getting data from impostors. In this paper, we will see theoretical aspects and the state-of-the-art of a priori decision threshold estimation in Section 2. In Section 3, we introduce a new algorithm for the determination of speaker dependent thresholds. We will see experimental setup in Section 4 and, finally, conclusions.
2
Theoretical Aspects
The performance of speaker verification systems is usually established by means of the equal error rate (EER). EER is obtained when false rejection rate (FRR) and false acceptance rate (FAR) are equal. However, in real cases, EER is less significant because a certain FAR or FRR is usually required. To obtain a specific rate, we only have to adjust our threshold. There has been several approaches to the problem of automatic threshold estimation. Normally, they use utterances from clients, from impostors or from both of them. In [1], we can find a threshold estimation as a linear combination of impostor
ˆ ) and standard deviation from impostors σ ˆ as follows: scores mean ( M X X
72
Javier R. Saeta and Javier Hernando
Θ
( Mˆ X − σ ˆ X ) + β
=α
x
(2)
where α = 0.61, according to [1], and β should be obtained empirically. Three more speaker dependent threshold estimation methods similar to (2) are introduced in [6, 7]:
Θ
x
= Mˆ X + α σ ˆ X2
(3)
2 where σ ˆ X is the variance estimation of the impostor scores.
ˆ . These scores are The other two methods use, moreover, client scores mean M X biased because they are also employed to create the model: Θ
= α Mˆ X + (1 − α ) Mˆ X
x
The last method has added a speaker independent threshold, Θ
Θ
x
=Θ
SI
+ α ( Mˆ X − Mˆ X )
(4) SI
, as in: (5)
This method is considered as a fine adjustment of a speaker independent threshold. Other approaches to speaker dependent threshold estimation are based on a
ˆ ) and standard deviation ( σ ˆ ) normalization of client scores ( S M ) by mean ( M X X from impostor scores [10]:
S M , norm =
S M − Mˆ X σˆ X
(6)
The methods which perform like (6) can also be seen as znorm [5]. We should also make reference to another threshold normalization technique such as hnorm [4], which is based on a handset-dependent normalization. Some other methods are based on FAR and FRR curves [8]. Speaker utterances used to train the model are also employed to get the FRR curve. On the other hand, a set of impostor utterances is used to obtain the FAR curve. The threshold is adjusted to equalize both curves. There are also other approximations [9] based on the idea of the difficulty of getting impostor utterances which fit the client model, specially in phrase-prompted cases. In these cases, it is difficult to obtain the whole phrase. The alternative is to use some words from other speakers or different databases, to complete the whole phrase. Finally, it is worth noting there are other methods which use different estimators for mean and variance. In [2], we can observe two of them, classified according to the percentage of used frames. Instead of employing 100% frames, they use 95% most typical frames discarding 2,5% maximum and minimum frame likelihood values, and 95% best, removing 5% minimum values. With the selection of the 95% frames, we remove those frames which are out of range of typical frame likelihood values. 1
α has different values in (2), (3), (4) and (5).
Automatic Estimation of a Priori Speaker Dependent Thresholds in Speaker Verification
3
73
New Speaker Dependent Algorithm for Threshold Estimation
As we have pointed out before, in commercial applications, the decision about the threshold should be reached during enrollment. Furthermore, it is difficult to get enough data to obtain a good estimation for the mean and variance of the client or impostor scores. Normally, we have only a few utterances and sometimes getting impostor data becomes even harder. That is the reason why we decide to use only data from clients as follows:
Θ
x
= Mˆ X − α σ ˆ X
(7)
ˆ is the biased client scores mean, σ ˆ is the standard deviation and α is where M X X a constant which has to be set experimentally on a development population. This equation is basically a combination of (2) and (3), but only with data from clients. The main problem when we have only a few utterances is that some of them could produce non-representative scores. This is common when a utterance has background noises, is recorded with a very different handset or simply when the speaker is sick, tired... These 'outliers' affect mean estimation too much and, specially, standard deviation estimation. The influence of outliers becomes even more significant if the standard deviation or the variance are multiplied by a constant, like in expressions (2), (3) and (7). Moreover, our previous experiments have shown that a few speakers concentrate the majority of errors. Their threshold is probably wrong fixed due to the outliers. In this way, our goal is to minimize their presence. For this purpose, we propose an algorithm which basically computes mean and standard deviation estimations. Once computed, it decides if the estimations will improve with the exclusion of one or several scores from this computation. Roughly speaking, our idea consists of removing those scores which can lead to a wrong estimation because they are outliers. Of course, in some cases, we will not obtain any improvement removing the outliers. The proposed algorithm will begin to consider the most distant score with respect to the mean, and will continue with the second most distant if necessary. The main questions here will be: 1) how to decide the elimination of a score, and 2) when to stop the algorithm. To solve the first question, we use a parameter to control the difference between the standard deviation estimation with and without the most distant score, the possible outlier. We define ∆ as the percentage of variation of the standard deviation from which we consider to discard a score. ∆ will decide if the score is considered as an outlier or not. If the percentage of variation exceeds ∆ , we confirm this score as an outlier. In case we find an outlier, we recalculate mean and standard deviation estimations without it. At this point, we will look for the next most distant score. And here we have the second question: when to stop the iterations. We can observe that it is necessary to define σ TH as the flooring standard deviation, i.e., the minimum standard
74
Javier R. Saeta and Javier Hernando
deviation from which we decide to stop the process. If σ TH is reached, the algorithm stops. Summarizing, firstly we compute mean and standard deviation. Then, we locate the most distant score from the mean. We remove this score and compute mean and standard deviation again. We compare the new standard deviation with the old one. If the variation of the standard deviation is higher than ∆ and, on the other hand, σ TH has not been reached yet, we start a new iteration. If the variation is higher than ∆ but lower than σ TH , the standard deviation is fixed to σ TH . In any other case, the algorithm stops. With this algorithm, we discard outliers and minimize the effect they elicit in standard deviation. We remove those outliers which produce a variation in standard deviation in a percentage higher than ∆ . The improvement in terms of FAR and FRR is more important when the number of utterances is considerably low because mean and variance estimation becomes more difficult.
4
Experimental Setup
4.1
Database
Our experimental database was especially designed for speaker recognition and recorded by the authors. It has 184 speakers, 106 male and 78 female. It is a multisession telephonic database in Spanish, with calls from fixed and mobile phones. 520 calls were made from a fixed phone and 328 from a mobile phone. One hundred speakers have at least 5 or more sessions. The average number of sessions per speaker is 4.55. The average time between sessions per speaker is 11.48 days. Each session has: -
4.2
Four different 8-digit numbers, repeated twice. Two different 4-digit numbers, repeated twice. Six different isolated words. Five different sentences. One minute long read paragraph. One minute of spontaneous speech. Verbal Information Verification
The speaker verification is performed in combination with a speech recognizer. During enrollment, those utterances catalogued as "no voice" are discarded. This selection ensures a minimum quality for the threshold. Tests are realized with 8-digit utterances. The speech recognizer discards those digits with a low probability and selects utterances which have exactly 8 digits. Thus, verbal information [3] is applied here as a filter to remove low quality utterances.
Automatic Estimation of a Priori Speaker Dependent Thresholds in Speaker Verification
75
Table 1. Error rates for text dependent and text independent experiments
Baseline (3 ses.) Modified (3 ses.) Baseline (4 ses.) Modified (4 ses.) 4.3
TD (digits) FAR FRR 4.18 15.09 3.72 13.40 4.13 9.03 4.24 7.40
TI (free speech) FAR FRR 15.02 33.93 15.02 7.45 18.00 13.62 9.99 6.94
Speaker Verification System
Utterances are processed in 20 ms frames, Hamming windowed and pre-emphasized with a filter with a zero at 0.97. The feature set is formed by 12th order Mel-frequency cepstral coefficients (MFCC) and the normalized log energy. Delta and delta-delta MFCC are computed to form a 39-dimensional vector for each frame. Cepstral Mean Subtraction is also applied. Left-Right HMM models with 2 states per phoneme and one mixture per state are obtained for each digit. Client and world models have the same topology. Otherwise, Gaussian Mixture Models (GMM) of 64 mixtures are employed to model spontaneous speech. 4.4
Results
Our text dependent experiments have been carried out with digits, using speakers with a minimum of 5 sessions. It yields 100 clients. We use 3 or 4 sessions for enrollment and the rest of sessions to make client tests. Speakers with more than one session and less than 5 sessions are impostors. 8-digit and 4-digit utterances are employed for enrollment whereas only 8-digit utterances are used for tests. In text independent experiments, one minute long spontaneous speech utterances are used to train and to test the model. The number of sessions chosen for training is the same as in the text dependent case. Table 1 shows FAR and FRR for text dependent and text independent experiments. The baseline experiments do not use the algorithm proposed in this paper. On the other hand, the modified experiments include the algorithm when computing thresholds. As we can see in table 1, it is expected that FAR and FRR are higher in 3 session experiments than in 4 session ones. Furthermore, it is important to note that fixed and mobile sessions are used indistinctly to train or test. This fact increases the EER. We can also observe in the table that error rates are considerably reduced in all experiments. The error reduction is much more significant in text independent experiments. The reason is that the threshold showed in (1) is computed with only 3 or 4 scores. In this case, we clearly see the importance of removing the outliers. However, in text dependent experiments, the threshold is computed with digit utterances. There are 12 utterances per session although some of them are discarded by the speech recognizer, as we have explained in 4.2. This means that we can have up to 48 utterances for 4 session experiments and it implies much more scores than in
76
Javier R. Saeta and Javier Hernando
text independent case. Anyway, FAR and FRR are reduced in 3 session experiments whereas FRR decreases from 9.03% to 7.40% in 4 session experiments. Otherwise, in 3 session text independent experiments, the FRR decreases from 33.93% to 7.45% and high improvement is also observed for 4 session experiments in comparison to baseline.
5
Conclusions
The automatic estimation of speaker dependent thresholds has revealed as a key factor in speaker verification enrollment. Threshold computation methods deal with lack of data and the fact that the decision should be taken a priori for real-time applications. These methods are currently a linear combination of the estimation of means and variances from clients and/or impostor scores. When there are only a few utterances to create the model, the right estimation of means and variances from client scores becomes an important challenge. In this paper, we have proposed an algorithm that alleviates the problem of a low number of utterances. It removes outliers and contributes to better estimations. Experiments from our database with a hundred clients have shown an important reduction in error rates. The improvements have been higher in text independent experiments than in text dependent experiments because the first ones use only a few scores. In these cases, the influence of outliers is more relevant.
References [1] [2] [3] [4] [5] [6] [7] [8]
S. Furui, “Cepstral Analysis for Automatic Speaker Verification”, IEEE Trans. on Acoustics, Speech and Signal Processing, vol. 29, no. 2, pp. 254-272, 1981. F. Bimbot, D. Genoud, “Likelihood Ratio Adjustment for the Compensation of Model Mismatch in Speaker Verification”, Proc. Eurospeech’97, pp. 13871390. Q. Li, B.H. Juang, Q. Zhou, C.H.Lee, “Verbal Information Verification”, Proc. Eurospeech’97, pp. 839-842. D.A. Reynolds, “Comparison of Background Normalization Methods for TextIndependent Speaker Verification”, Proc. Eurospeech’97, pp. 963-966. G. Gravier, G. Chollet, “Comparison of Normalization Techniques for Speaker Verification”, Proc. RLA2C, Avignon, 1998, pp. 97-100. J.B. Pierrot, J. Lindberg, J. Koolwaaij, H.P. Hutter, D. Genoud, M. Blomberg, F. Bimbot, “A Comparison of A Priori Threshold Setting Procedures for Speaker Verification in the CAVE Project”, Proc. ICASSP’98, pp. 125-128. J. Lindberg, J. Koolwaaij, H.P. Hutter, D. Genoud, J.B. Pierrot, M. Blomberg, F. Bimbot, “Techniques for A Priori Decision Threshold Estimation in Speaker Verification”, Proc. RLA2C, Avignon 1998, pp. 89-92. W.D. Zhang, K.K. Yiu, M.W. Mak, C.K. Li, M.X. He, “A Priori Threshold Determination for Phrase-Prompted Speaker Verification”, Proc. Eurospeech’99, pp. 1203-1206.
Automatic Estimation of a Priori Speaker Dependent Thresholds in Speaker Verification
[9] [10]
77
A.C. Surendran, C.H. Lee, “A Priori Threshold Selection for Fixed Vocabulary Speaker Verification Systems”, Proc. ICSLP’00, pp.246-249, vol. II. N. Mirghafori, L. Heck, “An Adaptive Speaker Verification System with Speaker Dependent A Priori Decision Thresholds”, Proc. ICSLP’02, pp. 589592.
A Bayesian Network Approach for Combining Pitch and Reliable Spectral Envelope Features for Robust Speaker Verification Mijail Arcienega and Andrzej Drygajlo Swiss Federal Institute of Technology Lausanne, Signal Processing Institute {mijail.arcienega,andrzej.drygajlo}@epfl.ch http://ltswww.epfl.ch
Abstract. In this paper, we provide a new approach in the design of robust speaker verification in noisy environments using some principles based on the missing data theory and Bayesian networks. This approach integrates high-level information concerning the reliability of pitch and spectral envelope features in missing feature compensation process in order to increase the performance of Gaussian mixture models (GMM) of speakers. In this paper, a Bayesian network approach for modeling statistical dependencies between reliable prosodic and spectral envelope features is presented. Within this approach, conditional statistical distributions (represented by GMMs) of the features are simultaneously exploited for increasing the recognition score, particularly in very noisy conditions. Masked by noise data can be discarded and the Bayesian network can be used to infer the likelihood values and compute the recognition scores. The system is tested on a challenging text-independent telephone-quality speaker verification task.
1
Introduction
In human decoding of speech, suprasegmental information plays an important role. Suprasegmental features, in particular prosodic features, such as the pitch, are known to carry information regarding the identity of the speaker. However the pitch alone is not discriminative enough for speaker verification and spectral envelope features can be advantageously used to complete the set of acoustic parameters. The Gaussian mixture models (GMMs) [1] are widely used to characterize the spectral envelope and have been successfully applied to text-independent speaker recognition systems. It is well known that the effect of channel distortions and noise on the performance of such systems is of serious concern. Prosodic features are known to be less affected by these impairments than spectral envelope features. Suprasegmental features, such as the pitch, are therefore worth re-examining for speaker recognition systems. The information carried by the pitch, which is not present in unvoiced regions, is not easily modeled. A new statistical modeling approach is necessary to J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 78–85, 2003. c Springer-Verlag Berlin Heidelberg 2003
A Bayesian Network Approach
Voicing Status
Time
voiced t-1
unvoiced t
79
voiced t+1
t+2
Downsampled pitch
Unreliable feature
Spectral envelope features 1,...,l,...,L
Fig. 1.
1,...,l,...,L
1,...,l,...,L
1,...,l,...,L
Structure of the features
represent, in a unified framework, spectral envelope features and pitch related features, as well as their dependencies. Moreover, since the pitch contour variations are slow in comparison to spectral envelope features, it can be downsampled in order to reduce redundant data (Figure 1). Bayesian networks [2] are powerful tools for modeling statistical dependencies between features, furthermore, they can also be used for inferring missing or unobserved features. In our previous work [3], a Bayesian network incorporated into a speaker verification system has already proved its capacity to exploit the information carried by additional features such as the pitch and the voicing status. On the other hand, recently, the missing feature theory has been successfully applied to the speech and speaker recognition systems dealing with noisy speech [4, 5]. Following one of the missing feature approaches, a classification is made in order to separate reliable features from unreliable (masked by noise) ones. After such a classification, only reliable features are taken into account during the recognition phase. In this paper, the missing feature theory is incorporated into a multi-rate Bayesian network based system. While the Bayesian network is capable of taking advantage of the dependencies between the pitch and spectral envelope features, the missing feature approach is used for discarding features masked by noise. In this way only reliable features account for the likelihood score. Two auxiliary features are introduced: the voicing status and the reliability status for the spectral envelope features. The pitch, as well as the spectral envelope features, are modeled in a different manner depending on the voicing status.
2
The Bayesian Network
Spectral envelope features that belong to voiced regions present common characteristics (presence of the response to the glottal excitation, specific distribution of formants, etc.). These characteristics are different than those of features from unvoiced regions. In order to better capture the variations and to better model
80
Mijail Arcienega and Andrzej Drygajlo
xt,l
xt,l
xt,L
Fig. 2. Graph of the Bayesian network associating to the voicing status s, the pitch and the spectral envelope x at time t the distributions of voiced and unvoiced features, the Bayesian network introduced in this paper includes the voicing status s as an auxiliary feature. At time t, the voicing status can be st = 1 (voiced) or st = 2 (unvoiced). The main idea is to build conditional models (for the pitch as well as for the spectral envelope features), given the voicing status s. Suppose that at time t, there are L feature vectors xt,l , l = 1, ..., L, associated with pitch t (Figure 1). Spectral envelope features aim at maximizing the exclusion of the characteristics of harmonics that represent the pitch. Therefore the pitch and the spectral envelope carry complementary and uncorrelated information and one can assume that given the voicing status st , xt,l and t are conditionally independent, i.e. p(xt,l |t , st ) = p(xt,l |st ) .
(1)
Following the definitions presented below, x and are d-separated when s is instantiated. The graph associated to the features at time t is shown in Figure 2. Definition 1. A causal network is a directed graph where directed links define the cause-effect relationships between the variables represented by the nodes. Definition 2. A Bayesian network is causal network represented by an acyclic directed graph where the links are quantified by conditional probabilities. Definition 3. Two variables A and B in a Bayesian network are d-separated if there is a third variable C in the path between A and B such that P (A|B, C) = P (A|C) ,
(2)
in other words, A and B are conditionally independent, given C (we also say that C has been instantiated). 2.1
The Conditional Models
Although Bayesian networks are often defined only by discrete probability tables, this paper incorporates continuous density functions also. The final score for a given utterance remains a log-likelihood. Under a given voicing status, spectral envelope features xt,l are supposed to be independent and identically distributed. They are therefore represented by
A Bayesian Network Approach
81
only two conditional probability density functions p(x|s = 1) and p(x|s = 2). Two Gaussian mixture models (GMMs) are used for representing these pdfs. In this manner, the statistical characteristics of the spectral envelope features are defined by the sets of parameters x x x , (3) λx i = ci,m , µi,m , σi,m where i = 1, 2 represents the voicing status and m = 1, ..., Mix ; Mix being the number of mixtures associated to each GMM. Since every vector xt,l is associated to one voicing status s, the goal, during the training stage, is to separate these vectors into two groups, one for each voicing status and then to train both GMMs separately. The pitch modeling also depends on the voicing status. In voiced zones, one GMM is used for modeling the statistical properties of the pitch values. p(|s = 1) is then completely defined by } , λ = {cm , µm , σm
(4)
where m = 1, ..., M1 ; M1 being the number of mixtures used for modeling the distribution of the pitch. In unvoiced regions, a value for the pitch does not physically exist; nevertheless, a table of discrete probabilities can still be used to represent it. If we set t = 0 in these regions, the probability p( = 0|s = 2) will always equal one. The pitch can be therefore characterized by p( = 0|s = 2) = 1 and p( = 0|s = 2) = 0. Finally, the voicing status s probabilities are defined by two weights, w1 and w2 that represent the probabilities of being in a voiced zone p(s = 1) and the probability of being in a unvoiced zone p(s = 2), respectively. The set of training data that belongs to an utterance, O = {η1 , ..., ηT }, where ηt = {t , xt,l }, l = 1, ..., L, and the sequence of states S = {s1 , ..., sT } are therefore completely modeled by the Bayesian network represented in Figure 2 with parameters λ: p(s = i) = wi ,
p(x|s = i) defined by λx i ,
p(|s = 1) defined by λ , and p( = 0|s = 2) = 1 ; p( = 0|s = 2) = 0 . (5) The main problem when training the network is to determine, in a reliable manner, the sequence of states S = {s1 , ..., sT } for an observed sample O. This sequence of states relies on a correct voiced/unvoiced decision. One procedure for extracting reliable pitch estimates, associated with a reliable voiced/unvoiced decision, is presented in [6]. Once the sequence S is defined, vectors x are separated into two groups. The multivariate probability density functions of the vectors in each group, modeled by a GMM with diagonal covariance matrix, is trained using the ExpectationMaximization (EM) algorithm. The parameters for the model of the pitch in voiced zones are calculated in the same manner.
82
3
Mijail Arcienega and Andrzej Drygajlo
Incorporating the Reliability Auxiliary Feature
Contrary to pitch, spectral envelope features are easily corrupted by noise. The missing feature theory is incorporated in order to reduce the influence of noise over these features. We use the a posteriori SNR criterion for the classification between reliable and unreliable features. The threshold is fixed to τ = 6 dB, which corresponds to a threshold of 0 dB for the a priori SNR. Assuming no a priori knowledge about the features corrupted by noise, we can use the unbounded integration method [5]. This technique is known as marginalization. The resulting marginal pdf remains a GMM that allows us to calculate the likelihood of the spectral envelope feature vector. This technique of the missing feature theory can be incorporated in the Bayesian network as follows: – First, we introduce an auxiliary variable mi that indicates the reliability status for each vector component xi . (reliable mi = 0 ; unreliable mi = 1). For a feature vector x this variable will also be a vector m. – Second, all unreliable components are replaced by a unique symbol U , referred as “missing” or “unreliable”. In this case, xt,l and t are conditionally independent given st and mt,l . This can be expressed as: p(xt,l , t |st , mt,l ) = p(xt,l |t , st , mt,l ) · p(t |st , mt,l ) .
(6)
Moreover, t is independent of mt,l , therefore p(xt,l , t |st , mt,l ) = p(xt,l |st , mt,l )p(t |st ) .
(7)
p(x|s) can be written as a product where each factor depends only on one vector component xi (i.e. p(x|s) = p(xi |s)). For each spectral envelope vector, if mi = 0 the probability p(xi |s, mi = 0) is calculated by using the corresponding factor of p(x|s). If mi = 1 the probability is calculated by using a probability mass distribution(pmd) equivalent to the one used for the pitch in the unvoiced regions, when the component xi = U , p(xi = U |mi = 1) = 1 and p(xi = U |mi = 1) = 0. Here, xt,l and t are d-separated when st and mt,l are instantiated. The missing status mt,l depends on s. Indeed, unvoiced features, by having lower energy, are more influenced by noise and therefore they have more probabilities to be labelled as missing. The graph representing these relationships for L = 1 appears in Figure 3, a simplified representation is proposed as well. The multirate Bayesian network graph is presented in Figure 4.
4
Likelihood Estimation
Let O = {η1 , ..., ηT } be a test sequence, S = {s1 , ..., sT } the corresponding voicing status sequence and M = {m1,l , ..., mT,l } the sequence of reliability
A Bayesian Network Approach ½
83
...
mt
...
ܽ ...
Ü
xt
...
Ü
Fig. 3. Graph of the Bayesian network incorporating the missing status. A proposed simplified representation for this graph is also shown
m t,1 mt,l
xt,1
xt,l
mt,L
xt,L
Fig. 4. Multi-rate Bayesian network graph incorporating the missing status status. Given the model, the state st and the missing status mt,l , the conditional likelihood for the couple ηt = (t , xt,l ) is (Equations 1 and 7): p(ηt |st , λ) = p(t , xt,l |st , mt,l , λ) = p(xt,l |st , mt,l , λ) · p(t |st , λ) ,
(8)
We are interested in calculating p(O|S, M, λ), which gives us a likelihood score for the whole utterance. This is calculated as follows: p(O|S, M, λ) =
T
p(ηt |st , mt,l , λ) ,
(9)
t=1
We can separate the influence of the pitch and of the spectral envelope: p(O|S, M, λ) = p(X|S, M, λ) · p(P |S, λ) ,
(10)
where X is the set of spectral envelope feature vectors and P the set of pitch values. This decomposition can go further. The set of spectral envelope feature vectors X can be separated into two sets XV and XU containing the voiced and unvoiced vectors respectively, with their likelihoods being p(XV |MV , λx 1 ) and p(XU |MU , λx 2 ). The pitch likelihood can be also simplified. Indeed p(P |S, λ) =
84
Mijail Arcienega and Andrzej Drygajlo
p(PV |λ ), since p(PU |s = 2) = 1. Therefore, x p(O|S, M, λ) = p(XV |MV , λx 1 ) · p(XU |MU , λ2 ) · p(PV |λ ) .
(11)
Since p(xi = U |mi = 1) = 1, then p(xt,l |mt,l , λ) = p(xP t,l |λ) ,
(12)
where xP t,l represents the reliable part of the spectral envelope vector. Finally we obtain: P x p(O|S, M, λ) = p(XVP |λx 1 ) · p(XU |λ2 ) · p(PV |λ ) .
(13)
In order to obtain the log-likelihood LL, and at the same time make this measure independent of the number of elements in the test sequence O, we take the logarithm of both sides in Equation 13 and divide by T . If at the same time we bring out the number of frames in each group in order to normalize separate likelihoods, we obtain: LL(O|S, M, λ) =
N2 N1 LLL (XVP |λx LLL(XUP |λx 1) + 2)+ T T N1 + LL(PV |λ ) , T
(14)
where LLL() is the normalized likelihood for a group of L spectral envelope feature vectors. As this equation shows, individual likelihoods are weighted by coefficients that represent the proportion of each group in the utterance.
5
Experiments
In the experiments presented here, 180 speakers of the Switchboard Database were used to build a universal background model (UBM). 90 other speaker models were adapted for performing the verification task. Approximately one minute of speech per speaker was used to train the models. Utterances from another session with durations of 30 seconds were used for tests. 512 mixtures were used for the spectral envelope features UBM and 4 mixtures for the pitch model. During the test phase, white Gaussian noise at different signal-to-noise ratios (SNRs) was added to the test utterances. As we can see in Figure 5, by incorporating the pitch and the conditional models, the effects of the noise, reflected in the score, are drastically reduced. A further improvement is noticeable when discarding unreliable information of the spectral envelope features.
6
Conclusion
The Bayesian network approach presented in this paper has proved its capacity to represent and exploit the information carried by auxiliary features such as
A Bayesian Network Approach
85
Table 1. Definition of Likelihoods Notation
Likelihood calculation
LL(X) LL(O|S, λ) LL(O|S, M, λ)
Classical MFCC-UBM-GMM system Bayesian network incorporating pitch Bayesian network incorporating reliability status and pitch
50 45
LL(X|λ ) LL(O|S,λ ) LL(O|S,M,λ )
Equal Error Rate [%]
40 35 30 25 20 15 10 5 Clean
20[dB]
10[dB] SNR
5[dB]
0[dB]
Fig. 5. Equal Error Rate (EER) for different Signal to Noise Ratios (SNRs)
pitch, voicing status and reliability status. The use of conditional GMMs, given information about voiced and unvoiced segments and given the mechanism of detection and compensation of unreliable features, helps to better model the variability of noisy speech. This clearly allows the improvement of performance of speaker verification.
References [1] Reynolds, D. A., Rose, R.: Robust text-independent speaker identification using Gaussian mixture models. 3 (1995) 72–83 78 [2] Pearl, J.: Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann Publishers (1988) 79 [3] Arcienega, M., Drygajlo, A.: A Bayesian network approach for combining pitch and spectral envelope features for speaker verfication. In: COST 275 Workshop, ”The Advents of Biometrics over the Internet”, Rome, Italy (2002) 99–102 79 [4] Renevey, P., Drygajlo, A.: Missing feature theory and probabilistic estimation of clean speech componenets for robust speech recognition. Volume 6., Budapest, Hungary (1999) 2627–2630 79 [5] Drygajlo, A., El-Maliki, M.: Integration and imputation methods for unreliable feature compensation in GMM based speaker verification, Crete, Greece (2001) 107–112 79, 82 [6] Arcienega, M., Drygajlo, A.: Robust voiced/unvoiced decision associated to continuous pitch tracking in noisy telephone speech. Volume 4., Denver, Colorado USA (2002) 2433–2436 81
Cluster-Dependent Feature Transformation for Telephone-Based Speaker Verification Chi-Leung Tsang1 , Man-Wai Mak1 , and Sun-Yuan Kung2 1
Center for Multimedia Signal Processing Dept. of Electronic and Information Engineering The Hong Kong Polytechnic University, Hong Kong SAR, China 2 Dept. of Electrical Engineering Princeton University, USA
Abstract. This paper presents a cluster-based feature transformation technique for telephone-based speaker verification when labels of the handset types are not available during the training phase. The technique combines a cluster selector with cluster-dependent feature transformations to reduce the acoustic mismatches among different handsets. Specifically, a GMM-based cluster selector is trained to identify the cluster that best represents the handset used by a claimant. Handset distorted features are then transformed by cluster-specific feature transformation to remove the acoustic distortion before being presented to the clean speaker models. Experimental results show that cluster-dependent feature transformation with number of clusters larger than the actual number of handsets can achieve a performance level very close to that achievable by the handset-based transformation approaches.
1
Introduction
Recently, speaker verification over the telephone has attracted much attention, primarily because of the proliferation of electronic banking and electronic commerce. Although substantial progress in telephone-based speaker verification has been made, sensitivity to handset variations remains a challenge. To enhance the practicality of these systems, techniques that make speaker verification systems handset invariant are indispensable. We have previously proposed a handset compensation approach [1] that aims to resolve the handset variation problem. The approach extends the ideas of stochastic matching [2] where the parameters of non-linear feature transformations are estimated under a maximum-likelihood framework. To adopt the transformations to telephone-based speaker verification, a GMM-based handset selector was also proposed. In addition, we have proposed a divergence-based handset selector with out-of-handset (OOH) rejection capability in [3] to handle
This work was supported by The Hong Kong Polytechnic University Grant No. A442 and HKSAR RGC Grant No. PolyU5129/01E. S.Y. Kung was also a Distinguished Chair Professor of The Hong Kong Polytechnic University.
J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 86–94, 2003. c Springer-Verlag Berlin Heidelberg 2003
Cluster-Dependent Feature Transformation
87
the utterances obtaining from ‘unseen’ handsets. The selector is able to identify the ‘seen’ handsets and reject the ‘unseen’ handsets so that appropriate compensation techniques can be applied to the distorted features obtained from these handsets. Although promising results have been obtained, the approach assumes that the labels of the handset types are known during the training phase so that handset-dependent feature transformations can be derived for the ‘seen’ handsets. This requirement, however, is difficult to fulfill in practical situations. For utterances obtained from a telephone conversation, the only known information is the telephone number. As a telephone number can associate with several handsets, determining the handset type based on a given phone number alone is not very reliable. Without a reliable method to identify the handset type, the approaches proposed in [1] and [3] become less useful. To address the above problem, this paper proposes to use cluster-dependent feature transformations, instead of handset-dependent feature transformations, for channel compensation. In this cluster-based approach, a two-level clustering procedure is used to create a number of clusters from a telephone speech corpus such that each cluster represents a group of handsets with similar characteristics and that one set of transformation parameters is derived for each of the clusters. A cluster selector is also proposed to select the cluster that best represents the handset in a verification session. The distorted vectors are transformed according to the transformation parameters associated with the identified cluster.
2
Stochastic Feature Transformation
The key idea of stochastic matching [2] is to transform the distorted data to fit the clean speech models. Assuming that the telephone channel is represented by ˆ t can be written as a cepstral bias b, the transformed vectors x ˆ t = fν (yt ) = yt + b x
(1) {bi }D i=1
is the set of transforwhere yt is a D-dimensional distorted vector, ν = mation parameters, and fν (·) denotes the transformation function. Given distorted speech yt , t = 1, . . . , T and an M -center Gaussian mixture model (GMM) X M X X ΛX = {ωjX , µX j , Σj }j=1 with mixing coefficients ωj , mean vectors µj and coX variance matrices Σj derived from the clean speech of several speakers (ten speakers in this work), the maximum-likelihood estimates of ν can be iteratively computed via the expectation-maximization (EM) algorithm [4] as follows [1] T M X −2 X (µji − yt,i ) t=1 j=1 hj (fν (yt ))(σji ) bi = i = 1, . . . , D (2) T M X −2 t=1 j=1 hj (fν (yt ))(σji ) where hj (fν (yt )) is the posterior probability given by X ωjX p(fν (yt )|µX j , Σj ) hj (fν (yt )) = M X X X l=1 ωl p(fν (yt )|µl , Σl )
(3)
X X X where p(fν (yt )|µX j , Σj ) ≡ N (fν (yt ); µj , Σj ) is a normal density with mean X X µj and covariance Σj .
88
3
Chi-Leung Tsang et al.
Hierarchical Clustering
3.1
Unsupervised Handset Clustering
The clustering algorithm is based on the EM algorithm [4]. Let’s define Y = {Yu ; u = 1, . . . , U } be a set of vector sequences derived from U utterances, and C = {C (1) , C (2) , . . . , C (N ) } be a set of clusters derived from Y, where N is the number of clusters. Given a vector sequence Yu derived from an utterance of an unknown handset, the posterior probability that Yu is generated by the n-th cluster C (n) is π (n) p(Yu |Yu ∈ C (n) , φ(n) ) P (C (n) |Yu , Λ) = N (k) p(Y |Y ∈ C (k) , φ(k) ) u u k=1 π
1≤n≤N
(4)
where Λ = {π (k) , φ(k) ; k = 1, . . . , N } and φ(k) = {µ(k) , Σ (k) ; k = 1, . . . , N } where π (k) , µ(k) , and Σ (k) denote respectively the mixture coefficient, mean vector, and covariance matrix of the k-th component density (cluster). Therefore, ∗ ∗ the vector sequence Y belongs to the n∗ -th cluster C (n ) if P (C (n ) |Y, Λ) > = n∗ . P (C (n) |Y, Λ) ∀ n Using this clustering algorithm, we can divide the set of vector sequences Y into N different clusters, with each cluster containing the vector sequences that are close to each other in the Mahalanobis sense. Specifically, the cluster C (n) , where 1 ≤ n ≤ N , contains the set of vector sequences Y (n) such that Y (n) = {Y; Y ∈ C (n) and P (C (n) |Y, Λ) > P (C (k) |Y, Λ) ∀ k = n}. (n) = Y and Y (n) ∩ Y (m) = φ Note also the following properties of Y (n) ’s: N n=1 Y ∀n = m. 3.2
Cluster Selector
In our previous work [1], a handset selector is designed to identify the most likely handset used by the claimants. The handset’s identity was then used to select the transformation parameters to recover the distorted speech. Although results have shown that the handset selector is able to identify the ten handsets in HTIMIT at a rate of 98.29%, it may be difficult to derive one set of transformation parameters for each handset when no labels of the handset types are available during the training phase. To address the above problem, we propose to use a cluster selector with cluster-dependent feature transformation. The cluster selector is constructed by a two-level clustering procedure. In the first level, the EM-algorithm is used to create one cluster for each group of similar handsets. That is, the utterances from all types of handsets that the users may use for verification are grouped together to form one cluster. This cluster is then divided into N clusters, where N > 1, using the clustering algorithm described in Section 3.1, with each resulting cluster
Cluster-Dependent Feature Transformation
89
containing only the utterances from handsets with similar characteristics. Then, in the second level, a cluster-specific GMM is derived for each cluster using the utterances in that cluster. For each cluster, the estimation algorithm described in Section 2 is used to determine a set of transformation parameters that aim to remove the distortion introduced by the handsets belonging to that particular cluster. During verification, the transformation parameters corresponding to the most likely cluster to which the handset belongs are used to transform the distorted features to fit the clean speaker models. Specifically, during verification, an utterance of claimant’s speech obtained from an unknown handset is fed to N cluster-dependent GMMs (denoted as {Ωn }N n=1 ). The cluster that best represents the handset is selected according to N
n∗ = arg max
T
n=1
t=1
log p(yt |Ωn )
(5)
where p(yt |Ωn ) is the likelihood of the n-th cluster. Then, the transformation parameters corresponding to the n∗ -th cluster are used to transform the distorted vectors.
4
Experiments and Results
4.1
Uncoded and Coded Corpora
HTIMIT [5] and GSM-transcoded HTIMIT containing resynthesized GSM coded speech [6] were used to evaluate the proposed approach. HTIMIT was obtained by playing back a subset of the TIMIT corpus through 9 different telephone handsets (cb1-cb4, el1-el4, and pt1) and a Sennheizer head-mounted microphone (senh). The GSM-transcoded corpus was obtained by encoding the speech in HTIMIT using a GSM coder. The encoded utterances were then decoded to produce resynthesized speech. Feature vectors were extracted from each of the utterances in the uncoded and coded corpora. The feature vectors were 12dimensional mel-frequency cepstrum coefficients (MFCC) [7]. These vectors were computed every 14 ms using a Hamming window of 28 ms. Speakers in the corpora were divided into a speaker set (50 male and 50 female) and an impostor set (25 male and 25 female). Each speaker was assigned a personalized 32-center GMM that models the characteristics of his/her own voice.1 Each GMM was trained by using the feature vectors derived from the HTIMIT’s SA and SX sentence sets (with a total of 7 sentences) of the corresponding speaker. A collection of all SA and SX sentences uttered by all speakers in the speaker set was used to train a 64-center GMM background model (Mb ). The handset “senh” in HTIMIT was used as the enrollment handset, and utterances obtained from it were considered to be clean. 1
We choose to use GMMs with 32 centers because of limited amount of data for each speaker. We found that the EM algorithm becomes numerically unstable when the number of centers is larger than 32.
90
4.2
Chi-Leung Tsang et al.
Cluster-Dependent Feature Transformation
All of the HTIMIT utterances from the 9 different handsets (cb1-cb4, el1-el4, and pt1) were put together to form one cluster, and then the clustering algorithm in Section 3.1 was used to divide this cluster into N clusters (N = 3, 6, 9, 12 and 15 in this work). In HTIMIT, each utterance has ten different versions, each of them being produced by playing the corresponding clean TIMIT utterance to one of the ten handsets. As a result, the corpus contains an identical number of utterances from each speaker for all handsets. This enables the clustering process to take the handset characteristics, rather than the speaker characteristics, into account. After the clustering process, each cluster will contain the speech of different speakers from the same handset (or a group of handsets with similar characteristics). For the n-th cluster (n = 1, . . . , N ), 70 utterances corresponding to that cluster were selected to create a 2-center GMM ΛYn , i.e. M = 2 in (2). In order to minimize speaker/utterance variation and to retain handset variation between a distorted cluster and the features extracted from the enrollment handset, the 70 utterances corresponding to the distorted cluster and the enrollment handset must have identical contexts and they must be produced by the same set of speakers. For example, Utterance k of cluster n will have context identical to Utterance k of handset “senh”, and so on, and they are produced by the same speaker. Specifically, 70 utterances from handset “senh” were used to create ΛX1 , . . . , ΛXN , with ΛXn being created from the same set of sentences used to create ΛYn for n = 1, . . . , N . As a result, {ΛXn , ΛYn }N n=1 forms a set of GMM pairs representing the statistical difference between the enrollment handset “senh” and the verification handsets in each of the clusters. Then, for each pair of {ΛXn , ΛYn }N n=1 , a set of transformation parameter νn were computed using the estimation formulae described in Section 2. As claimants may use the enrollment handset “senh” for verification, a set of feature transformation parameters were also derived for handset “senh” by creating a 2-center GMM ΛX using the SA and SX sentences obtained from senh. This handset-dependent feature transformation will be used when speech from the enrollment handset is fed to the verification system. The same procedures were also applied to the GSM-transcoded HTIMIT (GSM) were corpus, and a set of GSM-based feature transformation parameters νn computed for each cluster. 4.3
Coder-Dependent Cluster Selectors (i)
Two cluster selectors, each of them consisting of N +1 64-center GMMs {Ωn ; i = 1, 2 and n = 1, . . . , N + 1}, were constructed from the SA and SX sentence sets of the uncoded and coded corpora (i = 1 for the uncoded corpus and i = 2 (i) for the GSM-transcoded corpus). For example, GMM Ωn for n = 1, . . . , N represents the characteristics of speech derived from the n-th cluster of the i-th (i) corpus, while GMM ΩN +1 represents the characteristics of speech derived from
Cluster-Dependent Feature Transformation
91
the enrollment handset (senh) of the i-th corpus. Here, we treat all the speech from the enrollment handset as one cluster. Unlike the speaker models described in Section 4.1, the amount of data in each cluster allows us to use a lot more centers for each GMM. However, we choose to use 64 centers only because our objective is to capture the handset characteristics rather than the characteristics of individual speakers. As the handset characteristics should be broader than the speaker characteristics in the feature space, deriving a small number of centers from the utterances of many speakers should prevent the centers from capturing the speaker characteristics. 4.4
Verification Procedures
During verification, a vector sequence Y derived from a claimant’s utterance (SI sentence) was fed to a coder-dependent (Uncoded or GSM) cluster selector corresponding to the coder being used by the claimant. According to the outputs of the cluster selector (5), a set of coder-dependent transformation parameters were selected. Note that in addition to the N clusters, the output of the cluster selector used in this experiment can also be “senh”. The features were transformed and then fed to a 32-center GMM speaker model (Ms ) to obtain a score (log p(Y|Ms )), which was then normalized according to [8] S(Y) = log p(Y|Ms ) − log p(Y|Mb )
(6)
where Mb is a 64-center GMM background model. S(Y) was compared with a threshold to make a verification decision. In this work, the threshold for each speaker was adjusted to determine the equal error rate (EER). For ease of comparison, we collect the scores of 100 speakers, each being impersonated by 50 impostors, to compute the speaker-independent equal error rate (EER). There were 300 client speaker trials (100 client speakers × 3 sentences per speaker) and 150,000 impostor trials (50 impostors per speaker × 100 client speakers × 3 sentences per impostor). 4.5
Verification Results
The results using uncoded HTIMIT and GSM-transcoded HTIMIT are summarized in Table 1. A baseline experiment (without using the cluster selectors and feature transformations), an experiment using CMS as channel compensation, and an experiment using a handset selector and handset-dependent feature transformation were also conducted for comparison. The EERs of 100 genuine speakers and 50 impostors were averaged to give an average EER. Columns labeled with “HTIMIT” and “GSM-HTIMIT” respectively show the performance for HTIMIT speech and GSM-transcoded HTIMIT speech. For uncoded HTIMIT, Table 1 shows that the cluster-dependent feature transformation approach, with number of clusters used (N ) between 3 and 15, can significantly reduce the error rates as compared to the baseline and the CMS methods. However, under handset mismatch conditions with N = 3 and
92
Chi-Leung Tsang et al.
Table 1. Equal error rates (in %) under handset match/mismatch conditions for uncoded HTIMIT and GSM-transcoded HTIMIT. Transformation methods include the baseline, cepstral mean subtraction (CMS), handset-dependent feature transformation, and cluster-dependent feature transformation. FT stands for zero-th order stochastic feature transformation Number of HTIMIT (%) GSM-HTIMIT (%) Transformation Clusters Mismatch Matched Mismatch Matched Row Method Used (N ) Handset Handset Handset Handset 1 Baseline N/A 23.51 3.09 24.77 5.99 2 CMS N/A 11.81 6.95 14.72 8.50 3 FT (Handset-based) N/A 7.10 3.19 10.18 4.32 4 FT (Cluster-based) 3 9.02 3.12 11.55 4.64 5 FT (Cluster-based) 6 8.66 3.12 11.02 4.89 6 FT (Cluster-based) 9 7.91 3.20 10.18 4.48 7 FT (Cluster-based) 12 7.85 3.13 9.83 4.59 8 FT (Cluster-based) 15 7.62 2.98 10.14 4.87
N = 6 (Rows 4 and 5, Column 4), the average EERs are higher than that of the handset-dependent feature transformation approach (Row 3, Column 4). As we have used the utterances from 9 different handsets (cb1-cb4, el1-el4, and pt1) for finding the clusters, for N = 3 or N = 6, the clustering algorithm may not be able to create sufficient clusters such that each of them contains only the utterances from the handsets with similar characteristics. If a cluster contains utterances from different kinds of handsets, a global feature transformation will result, which may reduce the capability of the transformation to recover the speech patterns. This problem can be solved by using more clusters. For instance, for cluster-dependent feature transformation with N = 9, 12, and, 15 (Rows 6 to 8, Column 4), average EERs comparable to that of the handset-dependent feature transformation were obtained. Table 1 also shows that under match conditions, varying the number of clusters does not affect the EER significantly. When utterances from the enrollment handset were fed to the cluster selector, most of them were recognized as “senh” by the cluster selector and transformed by the transformation parameters of the enrollment handset. As only a small number of utterances were transformed incorrectly (by the cluster-based transformation), varying the number of clusters has little effect on the EER. Similar results were also obtained from the GSM-transcoded speech. In particular, Table 1 shows that under handset mismatch condition, cluster-dependent feature transformations with N = 9 (Row 6, Column 6) achieve an average EER that is the same as that of the handset-dependent transformation (Row 3, Column 6). With N = 12 and N = 15 (Rows 7 and 8, Column 6), the clusterdependent feature transformation even outperforms the handset-dependent feature transformation.
Cluster-Dependent Feature Transformation
93
Based on the above experimental results, we conjecture that the clusterdependent feature transformation approach can achieve a low average EER provided that the number of clusters is large enough to prevent global transformation from occurring. However, increasing the number of clusters will also decrease the number of utterances for training the transformation parameters. Although we have not attempted to determine the maximum number of clusters that can be used, we can increase the number as long as there are sufficient utterances in each cluster to create the GMMs and for estimating the transformation parameters.
5
Conclusions
This paper has demonstrated that cluster-dependent stochastic feature transformation is an effective channel compensation approach to telephone-based speaker verification. Although it is not guarantee to outperform handset-dependent feature transformation, it will be very useful when labels of the handset types are not available during the training phase. Results based on 150 speakers of HTIMIT and GSM-transcoded HTIMIT show that combining cluster-dependent feature transformation and cluster identification can significantly reduce verification error rate. We also found that cluster-dependent feature transformation with number of clusters larger than the actual number of handsets can achieve a performance level very close to that achievable by the handset-dependent transformation approach. We are currently extending this cluster-based approach to telephone corpora, such as SPIDRE, where no handset labels are available for both enrollment and verification.
References [1] M. W. Mak and S. Y. Kung, “Combining stochastic feautre transformation and handset identification for telephone-based speaker verification,” in Proc. ICASSP’2002, 2002, pp. I701–I704. 86, 87, 88 [2] A. Sankar and C. H. Lee, “A maximum-likelihood approach to stochastic matching for robust speech recognition,” IEEE Trans. on Speech and Audio Processing, vol. 4, no. 3, pp. 190–202, 1996. 86, 87 [3] C. L. Tsang, M. W. Mak, and S. Y. Kung, “Divergence-based out-of-class rejection for telephone handset identification,” in Proc. ICSLP’02, 2002, pp. 2329– 2332. 86, 87 [4] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J. of Royal Statistical Soc., Ser. B., vol. 39, no. 1, pp. 1–38, 1977. 87, 88 [5] D. A. Reynolds, “HTIMIT and LLHDB: speech corpora for the study of handset transducer effects,” in ICASSP’97, 1997, vol. 2, pp. 1535–1538. 89 [6] Eric W. M. Yu, M. W. Mak, and S. Y. Kung, “Speaker verification from coded telephone speech using stochastic feature transformation and handset identification,” in Pacific-Rim Conference on Multimedia 2002, 2002, pp. 598–606. 89
94
Chi-Leung Tsang et al. [7] S. B. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans. on ASSP, vol. 28, no. 4, pp. 357–366, August 1980. 89 [8] H. C. Wang C. S. Liu and C. H. Lee, “Speaker verification using normalized loglikelihood score,” IEEE Trans on Speech and Audio Processing, vol. 4, no. 1, pp. 56–60, 1996. 91
Searching through a Speech Memory for Text-Independent Speaker Verification Dijana Petrovska-Delacr´etaz1, Asmaa El Hannani1 , and G´erard Chollet2 1
DIVA Group, University of Fribourg, Informatics Dept., Switzerland {dijana.petrovski,asmaa.elhannani}@unifr.ch 2 ENST, TSI, Paris, France
[email protected] Abstract. Current state-of-the-art speaker verification algorithms use Gaussian Mixture Models (GMM) to estimate the probability density function of the acoustic feature vectors. Previous studies have shown that phonemes have different discriminant power for the speaker verification task. In order to better exploit these differences, it seems reasonable to segment the speech in distinct speech classes and carry out the speaker modeling for each class separately. Because transcribing databases is a tedious task, we prefer to use datadriven segmentation methods. If the number of automatic classes is comparable to the number of phonetic units, we can make the hypothesis that these units correspond roughly to the phonetic units. We have decided to use the well known Dynamic Time Warping (DTW) method to evaluate the distance between two speech feature vectors. If the two speech segments belong to the same speech class, we could expect that the DTW distortion measure can capture the speaker specific characteristics. The novelty of the proposed method is the combination of the DTW distortion measure with data-driven segmentation tools. The first experimental results of the proposed method, in terms of Detection Error Tradeoff (DET) curves, are comparable to current state-of-the-art speaker verification results, as obtained in NIST speaker recognition evaluations.
1
Introduction
Current state-of-the-art speaker verification algorithms use Gaussian Mixture Models (GMM) to estimate the probability density function of the acoustic feature vectors [1]. Various studies [2], [3], [4], [5] have shown that phonemes have different discriminant power for the speaker verification task. In order to better exploit these differences, it seems reasonable to segment the speech in distinct speech classes and carry out the speaker modeling for each class separately. We call these experiments segmental speaker recognition experiments. The speech classes can be determined using two different approaches. The first choice is to use Large Vocabulary Continuous Speech Recognition (LVCSR) with previously trained phone models, and a language model, generally a bigram or a trigram stochastic grammar. With such systems, we can segment the speech data into J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 95–103, 2003. c Springer-Verlag Berlin Heidelberg 2003
96
Dijana Petrovska-Delacr´etaz et al.
phones. The second possibility is to use data-driven techniques based on Automatic Language Independent Speech Processing (ALISP) tools [6], that provide a general framework for creating speech units (denoted here as ALISP units), with little or no supervision. The number of speech classes depends on the sharpness of the speech segmentation that we want to obtain. Because transcribing databases is a tedious task, we prefer to use data-driven segmentation methods. If the number of automatic classes is comparable to the number of phonetic units, we can make the hypothesis that these units correspond roughly to the phonetic units. It is clear that if we would like to use such methods for speech recognition, we would have to study the correspondence of the automatically acquired ALISP units with phone units. On the other side, applying the ALISP segmentation for speaker verification purposes does not require the correspondence of the ALISP units with the phone units to be known, and the ALSIP recognizer can only be used to segment the speech data. After the segmentation step, the speaker modeling can be done for each class separately. In [4], [5] we have already used a data-driven speech segmentation tool to obtain 8 speech classes. The number of classes was chosen in order to have enough data for each class, when dealing with 2 min of enrollment speech data used to build the speaker models. In those experiments, we studied speaker modeling algorithms, such as Multiple Layer Perceptrons (MLP) and GMM’s. The goal of this work is to try to model the speech in more than 8 classes. The number of speech classes used in this work is 64, and is comparable to a pseudo-phonetic segmentation. It is obvious that when using so many classes, the classical speaker modeling method has to be redefined. Speaker modeling with GMM’s is still possible, but more difficult because of the lack of client speaker data. MLP could not be applied, because of lack of sufficient data for the client class. Therefore, we have decided to use the well known Dynamic Time Warping method to evaluate the distance between two speech patterns. This method could be used independently on short and long speech data. If the two speech patterns belong to the same speech class, we could expect that the DTW distortion measure can capture the speaker specific characteristics. DTW distance measure have already been used for text-dependent speaker recognition experiments [7], [8], [9], [10]. The novelty of the proposed method is its combination with the ALSIP units. The speaker verification approach chosen in this paper is the following: the segments found in the enrollment client speech data constitute the speaker specific speech memory (denoted here also as Client-Dictionary). Another set of speakers is used to model the non-speaker (world) speech memory (denoted here as World-Dictionary). The speaker verification step is done combining these two dictionaries. DTW distance measure is applied in order to find the similarities of the incoming test segments to the client and world dictionaries, on a segmental level. The final client score is related to the class specific DTW distance measures. Furthermore, normalization techniques are also applied, leading to some improvements. The outline of this paper is the following: in Sect. 2 we present in a more detailed way the proposed method. Sect. 3 describes the database used and
Searching through a Speech Memory
97
the experimental protocol. The evaluation results are reported in Sect. 4. The conclusions and perspectives are given in Sect. 5.
2 2.1
Searching through a Speech Memory for Speaker Verification Data-Driven Speech Segmentation
The steps needed to acquire and model the set of data-driven speech units, denoted here as Automatic Language Independent Speech Processing (ALISP) units [6], are shortly described in the following section. Instead of the widely used phonetic labels, data-driven labels automatically determined from the training corpus are used. The set of symbolic units is automatically acquired through temporal decomposition, vector quantization, segment labeling and Hidden Markov Modeling, as shown in Fig. 1. After a classical pre-processing step leading to acoustic-feature vectors, temporal decomposition [11] is used for the initial segmentation of the speech data into quasi-stationary segments. At this point, the speech is segmented in spectrally stable portions. For each segment, its gravity center frame is determined. A vector quantization algorithm is used to cluster the center of gravity frames of the spectrally stable speech segments. The codebook size defines the number of ALISP symbols. The initial labeling of the entire speech segments is achieved using minimization of the cumulated distances of all the vectors from the speech segment to the nearest centroid of the codebook. The result of this step is an initial segmentation and labeling. These labels are used as the initial transcriptions of the ALISP speech units. Hidden Markov Modeling is further applied for a better coherence of the initial ALISP units.
1 0 0 1 0SPEECH DB 1 0 1 0 1
ALISP UNITS Initial segments
First transcription
Final transcription
Initial segmentation
Grouping of segments in N classes
Modeling of N classes
Temporal Decompostion
Vector quantization + Clustering
Hidden Markov Modeling
N HMM models
Fig. 1. Unsupervised Automatic Language Independent Speech Processing (ALISP) unit acquisition, and their HMM modeling
98
Dijana Petrovska-Delacr´etaz et al.
DTW
DTW
H5
H5
.... H5
H5
Segments from client (C) dictionary
111 000 00 11 000 111 000 111 H5 11 H5 .... H5111 H5 000 000 000 111 00 111 000 00 000 000 111 11 111 111
Ha .... Ha
Ha
Segments from world (W) dictionary
Ha
Segments from client (C) dictionary
best DTW match
00 11 000 000 00 .... Ha 11 Ha111 Ha 111 Ha 000 000 00 11 111 111 00 11 00 000 000 00 11 111 111 11
Segments from world (W) dictionary
best DTW match
111 000 Ha 000 111 000 111
H5
Fig. 2. Illustration of the proposed speaker verification method based on searching in a client and world speech dictionaries, representing the client and world speech memory
2.2
Segmental Speaker Modeling Using Dynamic Time Warping
A speaker verification system is composed of training and testing phases. During the training (also known as enrollment phase), the client model is constructed. The proposed client model is build with segments found in the client’s enrollment speech data, denoted here as the Client-Dictionary. The non-speaker (or world) model, is build with segments found in the speech data representing the world speakers, denoted here as World-Dictionary. During the testing phase (see Fig. 2), each of the test speech segments is compared with a DTW distance measure, to the Client-Dictionary and to the World-Dictionary. In the next step, the ratio of the number of times that a segment belonging to the speaker is chosen nb(Sc ), versus the mean value of the number of times a world segment is chosen, mnb(SW ), is estimated. This ratio represents the score S of the claimed speaker: S=
nb(Sc ) mnb(SW )
.
These experiments were primarily designed for the purpose of speaker clustering. If the “pattern” of the relative frequencies remains the same, as the duration of test speech data is increasing, we can suppose that the world speakers with the same relative frequencies could be estimated as being equally spaced to the client speaker. The reason of the choice of the mean value in the denominator is to “diminish” the discrepancy between the number of speech segments available in the Client and in the World-Dictionaries. The final decision is taken comparing this score to a threshold (found on an independent development set). The hypothesis that the claimed speaker S is the true speaker is valid if: nb(Sc ) mnb(SW )
> δ with δ being the decision threshold.
Searching through a Speech Memory
3
99
Database and Experimental Setup
We have used the National Institute of Standards and Technology (NIST) 2001 cellular speaker verification database. In order to evaluate the proposed method, 4 disjoint sets, denoted here as: World-ALISP-Set, Threshold-Set, Impostor-Set and Evaluation-Set are chosen among the available data. The World-ALISP-Set is used for two purposes: to build the gender dependent ALISP recognizers, and to model the world speakers. Two gender dependent ALISP speech recognizers are obtained with data coming from 60 female and 59 male speakers. Because in NIST evaluations, the gender is an a priori provided information, we will use gender dependent subsets of each of the above mentioned sets without repeating it explicitly each time. Because of lack of speech data, the same speakers are used to model the world speakers. In current best performing speaker verification approaches, improvement is obtained using different normalization techniques. For our experiments, the Znormalization [12] is applied. 21 female and 16 male speakers are chosen for the Impostor-Set, necessary for this normalization. The decision threshold δ is estimated from the Threshold-Set, build from 16 female and 16 male speakers. The evaluation of our speaker verification method is done with the EvaluationSet composed of 14 female and 14 male speakers. The speech parameterization for the temporal decomposition is done with Linear Prediction Cepstral Coefficients (LPCC), calculated on 20 ms windows, with a 10 ms shift (this choice is due to implementation facilities). For the Hidden Markov Modeling step, we have used the Mel Frequency Cepstral Coefficients (MFCC). They are generally used for common speech and speaker verification purposes. The window and shift values are kept the same as for the LPCC parameterization. In order to accelerate the search, we have restricted the number of speech units in the client and world dictionaries. The 5 longest segments per class and per world speakers are chosen for the World -Dictionary, and the 15 longest ones for the Client-Dictionary. In such a way, each client speaker is represented by approximately 1000 segments. The world speakers are represented by approximately 20000 segments.
4 4.1
Speaker Verification Results ALISP Speech Segmentation
The main characteristics of the proposed method is the segmentation of the speech data, followed by a the Dynamic Time Warping (DTW) similarity measure to compare different speech segments. In order to have an idea of the speech segmentation resulting from the ALISP recognizer, an example is given in Fig. 2. The mean value of the ALISP units is 60 ms for the female and 100 ms for the male data. It will be interesting to compare our ALISP based units with classical phone units.
100
Dijana Petrovska-Delacr´etaz et al. client−client test (5182−melf)
client−client test (5182−melf)
0.09
0.09 581 segments (43 sec)
0.08
Relative Frequency
Relative Frequency
0.07
0.06 0.05 0.04
0.06 0.05 0.04
0.03
0.03
0.02
0.02
0.01 0 0
581 segments (43 sec)
0.08
0.07
0.01 10
20
30 World
40
50
0 0
60
10
20
Client
30 World
40
50
60 Client
Fig. 3. Two examples of client-client tests with different durations. The relative frequency histograms represent the relative number of times the world or the client segment is chosen as the nearest segment to the test segment client−impostor test (5182−mdta)
client−impostor test (5182−medx)
0.04
440 segments (35 sec)
0.05
0.035
0.045 Relative Frequency
Relative Frequency
0.03 0.025 0.02 0.015
348 segments (26 sec)
0.04 0.035 0.03 0.025 0.02 0.015
0.01
0.01
0.005 0.005
0 0
10
20
30
40
World
50
0 0
60 Client
10
20
30 World
40
50
60 Client
Fig. 4. Two examples of client-impostor tests with different durations. The relative frequency histograms represent the relative number of times the world or the client segment is chosen as the nearest segment to the test segments
4.2
Speaker Modeling
In this section some examples are shown, corresponding to client-client and client-impostor tests. They are chosen to represent the ’good’ and the ’bad’ cases. In Fig. 3 two examples of client-client tests are shown, from the female data set. The first 60 bars represent the relative frequency of the number of times each of the 60 world speaker identity was chosen as the best matching unit. The last bar represents the same value, but for the claimed speaker identity. In the left panel of the Fig. 3, we can observe that a clear preference is given to the segments belonging to the client enrollment data. This is not the case in the right panel, where the algorithm does not show a clear preference for the real speaker. In Fig. 4, two typical examples of client-impostor tests are reported. It seems like the algorithm has more difficulties in choosing one preferred speaker. For the shorter test data (right panel) even a small preference for the client data is observed. One possible reason for these bad results, could be the short test data duration.
Searching through a Speech Memory
101
Speaker Detection Performance
Miss probability (in %)
40
no normalization Z−norm
20
10
5
2 1
1
2
5 10 20 False Alarm probability (in %)
40
Fig. 5. DET performance of the proposed speaker verification method. The evaluation is done with 14 female and 14 male client speakers, originating from the NIST 2001 cellular data
4.3
DET Curves
The proposed speaker verification method is experimentally evaluated with the DET curves. The results obtained with our evaluation set (which is a subset of the the full NIST 2001 cellular evaluation data) are shown in Fig. 5. The female and male results are combined together. In this figure is also given the influence of the z-normalization, which as expected, gives better results when compared to the experiments done without the normalization. The equal-error rate for the normalized experiments is about 15 %. As a first attempt to evaluate the performance of the proposed method, the results are satisfactory. For comparison, we can indicate that the 2002 NIST cellular results vary from 8% to 25%.
5
Discussion and Conclusions
The presented experiments of searching through a speech memory for textindependent speaker verification, seem to be competitive with state-of-the art methods. In [5] we used a similar data-driven segmentation method, but with only 8 speech classes. In order to have a finer segmentation, we have augmented the number of data-driven speech classes to 64. Due to the increased nuber of speech classes, we have to adapt the speaker modeling method to a bigger number of classes. This is the reason why we used the DTW distance measure to estimate the similarity between two speech segments.
102
Dijana Petrovska-Delacr´etaz et al.
The results are obtained without fine tuning of different parameters. This leaves us some possibilities to further improve the proposed method. Different parameters could be varied: we can change the size of the Speaker/World Dictionaries; another normalization technique could be applied; the scoring method could be based on better exploiting the overall distance measure between the test segment and the reference segments; different local constraints for the DTW could be envisaged. Another improvement could be foreseen from some higherlevel informations resulting from the segmentation of the speech data. At this moment we did not exploit the speech classes in a optimal way. We even treated all the speech data, without discarding the segments that are detected as belonging to the ’silence’ class. It is also important to repeat the experiments with more speakers in the evaluation set.
Acknowledgments ˇ Our thanks go to F. Bimbot for his tsd95 package, to J. Cernock´ y for the ALISP tools, and to the HTK team.
References [1] Reynolds, D. A., Quatieri, T. F., Dunn, R. B.: Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal Processing, Special Issue on the NIST’99 evaluations, Vol. 10(1-3), 19–41, January/April/July 2000 95 [2] Eatock, J. P., Mason, J. S.: A Quantitative Assessment of the Relative Speaker Discriminant Properties of Phonemes. Proc. ICASSP, Vol. 1, 133–136 (1994) 95 [3] Olsen, J.: A Two-stage Procedure for Phone Based Speaker Verification. In G. Borgefors J. Big¨ un, G. Chollet, editor, First International Conference on Audio and Video Based Biometric Person Authentication (AVBPA), Springer Verlag: Lecture Notes in computer Science 1206. 199–226 (1997) 95 ˇ [4] Petrovska-Delacr´etaz, D., Cernock´ y, J., Hennebert, J., Chollet, G.: Textindependent Speaker Verification Using Automatically Labeled Acoustic Segments. In International Conference on Spoken Language Processing (ICLSP), Sydney, Australia (1998) 95, 96 ˇ [5] Petrovska-Delacr´etaz, D., Cernock´ y, J., Chollet, G.: Segmental Approaches for Automatic Speaker Verification. Digital Signal Processing, Special Issue on the NIST’99 evaluations, Vol. 10(1-3), 198–212, January/April/July 2000 95, 96, 101 ˇ [6] Chollet, G., Cernock´ y, J., Constantinescu, A., Deligne, S., Bimbot, F.: Towards ALISP: a proposal for Automatic Language Independent Speech Processing. In Keith Ponting, editor, NATO ASI: Computational models of speech pattern processing Springer Verlag (1999) 96, 97 [7] Rosenberg, A. E.: Automatic Speaker Verification: A Review. Proc. IEEE, Vol. 64, No. 4, April (1976) 475-487 96 [8] Rabiner, L., Schafer, R. W.: Digital Processing of Speech Signals. Prentice Hall, Engewood Cliffs, NJ (1978) 96 [9] Furui, S.: Cepstral Analysis Technique for Automatic Speaker Verification. IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. 29, No. 2 (1981) 254-272 96
Searching through a Speech Memory
103
[10] Pandit, M., Kittler J.: Feature Selection for a DTW-Based Speaker Verification System. Proc. ICASSP, Seattle, Vol. 2 (1998) 769-772 96 [11] Atal, B.: Efficient coding of LPC Parameters by Temporal Decomposition. Proc. IEEE ICASSP (1983) 81–84 97 [12] Reynolds, D. A.: Comparison of Background Normalization Methods for TextIndependent Speaker Verification. Proc. Eurospeech, Rhodes (1997) 963–966 99
LUT-Based Adaboost for Gender Classification Bo Wu, Haizhou Ai, and Chang Huang Computer Science and Technology Department Tsinghua University, Beijing, P.R. China
[email protected] Abstract. There are two main approaches to the problem of gender classification, Support Vector Machines (SVMs) and Adaboost learning methods, of which SVMs are better in correct rate but are more computation intensive while Adaboost ones are much faster with slightly worse performance. For possible real-time applications the Adaboost method seems a better choice. However, the existing Adaboost algorithms take simple threshold weak classifiers, which are too weak to fit complex distributions, as the hypothesis space. Because of this limitation of the hypothesis model, the training procedure is hard to converge. This paper presents a novel Look Up Table (LUT) weak classifier based Adaboost approach to learn gender classifier. This algorithm converges quickly and results in efficient classifiers. The experiments and analysis show that the LUT weak classifiers are more suitable for boosting procedure than threshold ones. Keywords: gender classification, Adaboost
1
Introduction
Face is one of the most important biometric features of human and contains lots of useful information, such as gender, age, ethnicity and identity. Gender classification is to tell the gender of a person according to his/her face. In the early research work, this problem is regarded as a psychological issue or some pure experimental matter, the size of the sample set is rather small, at most a few hundreds, and the main method used is Neural Network. Gollomb et al. [1] developed a two-layer SEXNET with 30 by 30 pixel face samples; Cottrell and Metcalfe [2] trained a BP network to identify gender and expression with Principal Component Analysis (PCA) as a pre-process; Edelman et al. [3] used different parts of human face, up-half, down-half and whole, to train linear networks, and compared their performance on gender classification; Alice et al. [4] analysed the relation between the facial femininity, masculinity, attractiveness and recognizability through a series of interesting experiments, and investigated the statistic structure of the information in human face through PCA method. They also constructed a 3D PCA model to extract features from which a gender perceptron was trained [5]. With the significant progress in human face detection algorithms and the increasing requirements for advanced surveillance and J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 104-110, 2003. Springer-Verlag Berlin Heidelberg 2003
LUT-Based Adaboost for Gender Classification
105
monitoring system, some new powerful methods are introduced in gender classification, and the size of experimental databases grows to thousands. Moghaddam and Yang’ [6] developed a method based on RBF-kernel SVMs and achieved very good results (only 3.4% error) on FERET database compared to some classical methods, such as RBF network, Fisher Linear Discriminant etc; Shakhnarovich et al. [7] applied threshold weak classifier based Adaboost algorithm on face detection, gender and ethnic classification that reached a performance of 78% correct rate in gender classification that is even better than SVMs on their large database with more than 3000 faces. They also built an automatic gender and ethnic classification system under the same unified framework. Adaboost is a very promising method due to both its good correct rate and extremely high speed, however, the problem existed in present boosting approaches is that the training procedure is very time consuming, more than 24 hours, and the resulting perceptron like classifier usually involves too much features so that it would be hard to make the method adaptive to particular application environment, such as real-time monitoring system. In this paper, we develop a LUT weak classifier Adaboost method, and compare its performance with threshold weak Adaboost and SVMs on gender classification problem. Our main contribution is that we find instead of simple threshold weak classifiers, LUT ones are extremely suitable to Adaboost learning procedure to generate efficient classifiers with fewer features. The paper is organized as follows: In Section 2, the Adaboost algorithm is reviewed and then our LUT version is introduced. In Section 3, the gender classification procedure is described. The experiment results are shown in section 4 and some conclusions are given in section 5.
2
The LUT Adaboost
Adaboost [8] is a learning algorithm that selects a set of weak classifiers from a large hypothesis space to construct a strong classifier. The final decision function of the strong classifier is
1, h( x ) = 0,
∑
T t =1
α t ht ( x) ≥ threshold otherwise
(1)
Basically speaking, the performance of the final strong classifier originates in the characteristics of its weak hypothesis space. In [7] threshold weak classifiers are used as the hypothesis space input to boosting procedure, however in this paper we use LUT weak instead. We argue that LUT weak classifiers are more general and better than simple threshold ones for Adaboost training due to in nature the distribution of samples tends to multi-Gaussian. The Adaboost Algorithm [10] for two class classification problem is as follows: • •
Given example images (x1,y1),…,(xu,yu), where yi=0,1 for negative and positive examples respectively. Initialize weights w1,i=1/2m,1/2n for yi=0,1 respectively, where m and n are the number of negatives and positives respectively.
106
•
Bo Wu et al.
For t = 1,…,T: 1. 2.
•
3.
Normalize the weights, wt,j := Σjwt,j, so that wt is a probability distribution. For each feature, j, train a classifier hj which is restricted to using a single feature. The error is evaluated with respect to wt, ε =Σiwt,j|hj(xi)-yi|. Choose the classifier, ht, with the lowest error εt.
4.
Update the weights: wt +1,i = wt ,i β
1− et t
, where ei = 0 if xi is classified
correctly, ei =1 otherwise, and βt = εt/(1- εt). The final strong classifier is: 1 h( x ) = 0
∑
T t =1
1 T ∑ αt 2 t =1 otherwise , where αt = -logβt.
α t ht ( x) ≥
In the above procedure, weights are adjusted in each step according to whether a sample is classified correctly in order to pay more emphasis on wrong classified ones. This repeating procedure of weights adjustment may result in sample distribution of multi-Gaussian type, such as in Fig.1. In this case, simple threshold weak classifiers will not be adaptable as the LUT ones. Even in the case of single peak distribution, LUT weak classifiers can exclude more negative samples. This makes LUT-based Adaboost learning converge much faster than simple threshold-based one, especially in the case of disperse distribution of sample features. In practice, as in [7,9] the Haar features via integration image are used to generate LUT weak classifiers. Suppose a Haar feature f(x) is normalized in the domain [0, 1], the size of LUT is n, then the k-th LUT item corresponds to a range Rk=[(k-1)/n, k/n], that is to say equally separated range is used. Count the number of positive samples (w1) and negative ones (w2) whose feature f(x) belongs to this range over all training samples and introduce LUT weak classifier as follows,
1 hLUT ( x) = 0
f ( x) ∈ Rk and P1( k ) > P2( k ) otherwise
(2)
where P1( k ) = P( f ( x) ∈ Rk | w1 ) , P2(k ) = P( f (x) ∈ Rk | w2 ) . The superiority of the LUT weak classifiers is shown in Section 4.
Fig. 1. LUT weak classifier versus simple threshold classifier for a histogram distribution of a particular rectangle feature on all positive samples
LUT-Based Adaboost for Gender Classification
107
Female Male Fig. 2. Sample normalization
0.1 F M
0.08
Fig. 3. Some samples in our gender database
0.04
0.04
0.03
0.03
0.02
0.02
0.01
0.01
0.06
0.04
0.06 0.04 0.02 0 -1
0 (a)
1
0 -1
0.04
0.04
0.03
0.03
0.02
0.02
0 (b)
1
0.02
0 -1
0 (c)
0 -1
1
0.05
0 (e)
1
0 (h)
1
0.04
0.03
0.01
0 -1
1
0.04
0.02 0.01
0 (d)
0.06
0.02
0.01
0 -1
0 (f)
1
0 -1
0 (g)
0 -1
1
Fig. 4. Samples’ distribution: (a)~(i) are the samples’ distributions after 0, 3, 35, 60, 80, 95, 105 and 115 rounds of boosting respectively. Real line represents female and dashed line male
3
Gender Classification
For practical problem, like gender classification, in order to make the sample distribution as compact as possible, some geometry alignment and gray level normalization are preprocessed automatically before samples are fed into the learning algorithm, see Fig.2. With the female as the positive sample and male as the negative, the LUT Adaboost algorithm in Section 2 is used to train the gender classifier. 0.2
100
Threshold LUT
Percentage of all male returned
0.18 0.16
Error rate
0.14 0.12 0.1 0.08 0.06 0.04
LUT size = 8 Threshold
90 80 70 60 50 40 30 20 10
0.02
0
50
100
150
Number of boosting rounds
200
250
0
0
10
20
30
40
50
60
70
80
90
100
Percentage of false positive
Fig. 5. Convergence speed of the threshold Fig. 6. The ROC curves for the detection of Adaboost algorithm and the LUT one. Vertical males in our gender face database coordinate is the error rates on the training set
108
Bo Wu et al.
Table 1. Gender classification results with RBF-kernel SVM and threshold/LUT weak classifier based Adaboost methods with 200 features selected
Method Resolution Total rate Female Male
RBF-kernel SVM 24× 24 36× 36 90.13% 90.55% 90.74% 91.56% 89.34% 89.25%
Threshold Adaboost 24× 24 36× 36 85.46% 85.51% 85.33% 84.37% 85.63% 86.96%
LUT Adaboost 24× 24 36× 36 87.92% 88.00% 88.44% 86.96% 87.25% 89.34%
There is not yet standard database for gender classification. Shakhnarovich et al [7] collected a gender face set from World Wide Web that has totally about 4,500 different faces and done 5-fold cross validation on it. Moghaddam and Yang [6] selected 1,044 males and 711 females from FERET face database as their test set. In our experiments, we collect about 5,500 males and 5,500 females from the FERET database [11] and a large amount of WWW pictures as the training set, and 1,300 females and 1,300 males from only the WWW pictures as the test set which is independent from the training set. Thus our gender face database has about 13,600 different faces, covering almost all races and ages. Fig.3 shows some samples in our database.
4
Experiments
In order to see whether the LUT weak is more suitable to Adaboost, we trace the sample distribution of the weak feature boosted in each round. Fig.4 shows the distribution evolvement through the boosting process. It can be seen that although at the beginning both positive and negative sample tend to single-Gaussian, after several rounds of boosting, the main clusters of both two classes largely overlap and some small peaks appear, which make it very difficult to separate them by only one threshold. Thus the complicated distributions in the boosting procedure make the threshold weak classifier inefficient. In addition, Fig.5 shows the error rates in the first 250 boosting rounds of both the weak threshold algorithm and LUT Adaboost one. It is clear that the convergence speed of LUT Adaboost is much faster than the threshold one’s, and in our experiments there is no problem of over-fitting. Table 2. Gender classification results with LUT weak classifier based Adaboost with 200 features selected
LUT size
4
Total Female Male
86.84% 86.22% 87.63%
Total Female Male
8
64
128
86.42% 87.41% 85.16%
82.09% 82.59% 81.45%
87.00%
16 32 Resolution = 24× 24 87.92% 85.46% 87.59% 88.44% 85.56% 87.86% 87.25% 85.35% 87.25% Resolution = 36× 36 87.71% 88.00% 86.26%
84.88%
84.17%
88.00% 85.73%
87.19% 88.39%
85.26% 84.40%
85.41% 82.59%
86.96% 89.34%
85.19% 87.63%
LUT-Based Adaboost for Gender Classification
109
Fig. 7. Automatic gender classification: + represents landmark and F and M represent Female and Male respectively (These pictures come from the CMU test set for upright face detection)
For the LUT Adaboost algorithm the entry number of LUT, i.e. LUT size, is also an important parameter. In experiments, we train a series of Adaboost gender classifiers with the same number of weak classifiers but different LUT size on our training set described in Section 3. For comparison, results of the RBF-kernel SVM and threshold weak classifier based Adaboost are also listed in Table 1. Table 2 is the results of the LUT scheme. It can be seen that LUT classifiers are better than the threshold ones on both high (36 by 36 pixels) and low (24 by 24 pixels) sample resolutions. However, the performance of LUT method does not strictly increase when the LUT size grows. One possible explanation is that although with more LUT entries the weak classifier can fit more complicated model, the number of samples for each entry decreases so that the training algorithm cannot evaluate the real distribution correctly. The ROC curve for the gender classification with LUT and threshold Adaboost on our set is shown in Fig 6. For possible real application, we have integrated our gender classification algorithm in an automatic face detection and facial feature landmarks extraction system of which the face detection is also based on boosted cascade of LUT weak classifiers and the landmarks extraction is based on a Simple Direct Appearance Model (SDAM) approach [10] for only 3 feature points. Thus we get an automatic gender classification system whose input is real-life photos and output is some possible face blocks labelled with gender. Some results are shown in Fig.7.
110
Bo Wu et al.
5
Conclusion
In this paper, we proposed a LUT weak classifiers based Adaboost learning algorithm for gender classification. Similar to the threshold weak classifiers used in [7, 8, 9] the LUT ones are also based on Haar feature via integration image, but the latter have stronger ability to model complex distribution of training samples, such as multiGaussian, than the former. This makes LUT weak very suitable for boosting procedure. Experiment results show the efficiency of our algorithm. Although we discuss our method on gender classification problems, it can be used in other pattern recognition problems too.
References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]
Golomb, D. T. Lawrence, and T. J. Sejnowski. SEXNET: A neural network identifies sex from human faces. In Advances in Neural Information Processing Systems, pp. 572–577, 1991. G. W. Cottrell and J. Metcalfe. EMPATH: Face, emotion, and gender recognition using holons. In Advances in Neural Information Processing Systems, pp. 564–571, 1991. Edelman, D. Valentin, H. Abdi. Sex classification of face areas: how well can a linear neural network predict human performance. Journal of Biological System, Vol. 6(3), pp. 241-264, 1998. Alice J.O'Toole et al. The Perception of Face Gender: The Role of Stimulus Structure in Recognition and Classification. Memory and Cognition, Vol. 26, pp. 146-160, 1997. Alice J.O'Toole, Thomas Vetter, et al. The role of shape and texture information in sex classification. Technical Report No.23, 1995. Moghaddam and M.H. Yang. Gender Classification with Support Vector Machines. IEEE Trans. on PAMI, Vol. 24, No. 5, pp. 707-711, May 2002 G. Shakhnarovich, P.. Viola and B. Moghaddam. A Unified Learning Framework for Real Time Face Detection and Classification. IEEE conf. on AFG 2002. Y. Freund and R. E. Schapire. Experiments with a New Boosting Algorithm. In Proceedings of the 13-th International Conference on Machine Learning, pp. 148-156, Morgan Kaufmann, 1996. P. Viola, M. Jones, Rapid Object Detection using a Boosted Cascade of Simple Features, in CVPR2001. S. Z. Li, Y. ShiCheng, H. Zhang, Q. Cheng, Multi-View Face Alignment Using Direct Appearance Models, IEEE conf. on AFG 2002. P. J. Phillips, H. Wechsler, J. Huang, and P. Rauss, The FERET database and evaluation procedure for face recognition algorithms, Image and Vision Computing J, Vol. 16, No. 5, pp 295-306, 1998.
Independent Component Analysis and Support Vector Machine for Face Feature Extraction Gianluca Antonini, Vlad Popovici, and Jean-Philippe Thiran Signal Processing Institute Swiss Federal Institute of Technology Lausanne CH-1015 Lausanne, Switzerland {Gianluca.Antonini,Vlad.Popovici,JP.Thiran}@epfl.ch http://ltswww.epfl.ch
Abstract. We propose Independent Component Analysis representation and Support Vector Machine classification to extract facial features in a face detection/localization context. The goal is to find a better space where project the data in order to build ten different face-feature classifiers that are robust to illumination variations and bad environment conditions. The method was tested on the BANCA database, in different scenarios: controlled conditions, degraded conditions and adverse conditions.
1
Introduction
One of the most remarkable abilities of human vision is that of face detectionrecognition process. Due to variations in illumination, background and facial expressions it may become complex for a computer to perform such task. Face detection-recognition algorithms are generally made up of three different steps: localization of the face region, extraction of meaningful facial features and normalization of the image respect to this features to perform the recognition step. In this paper we focus our attention on the facial feature extraction issue. Among all the possible classification of the existing face detection algorithms, for our purposes we will consider the Holistic Face Models (HFM) and the Local Features Face Models (LFFM) [1]. In the HFM approach the image region containing the whole face is selected manually and a representation of the face patch is learned from examples. It is clear that the major problem with this approach is to capture all the faceclass variance. Moreover, it is very difficult to model the geometric relationships between the different face-parts. In the LFFM approach, the basic idea is to represent the face with a set of meaningful features and not as a whole. In this way, it is easier to use any geometric information we can have about the face-class (collinear eyes positions, vertical symmetry etc...).
Work partially performed in the BANCA project of the IST European program with the financial support of the Swiss OFES and with the support of the IM2-NCCR of the Swiss NFS.
J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 111–118, 2003. c Springer-Verlag Berlin Heidelberg 2003
112
Gianluca Antonini et al.
The feature based face detection is not a new technique. It has previously been investigated for instance in [2]. In these works they propose an implementation of the local features detectors done via the Principal Component Analysis (PCA) based classification of neighbors of local maxima of the Harris corner detector. Our approach belongs to the Local Face Feature Model category. We propose to use the Independent Component Analysis (ICA), instead of PCA, as linear transformation on the image patches and to perform the SVM classification in the ICA space. The goal is to provide a robust representation of the patches (by ICA) and train ten different classifiers (by SVM) for ten classes of features representing the face. Our algorithm can be seen as a pre-filtering stage of a face detection system. The rest of the paper is organized as follows. Section II gives a brief introduction of ICA. Section III does the same for SVM method. Experiments and results are shown in section IV followed by some conclusions.
2 2.1
Independent Component Analysis Overview
Assume we are observing m linear mixtures x1 ,...,xm of n independent components
xj = aj1 s1 + aj2 s2 + ... + ajn sn
(1)
Each mixture xj as well as each independent component sk is a random variable. Using the vector-matrix notation we can write x = As
(2)
The model in Eq.1 is called independent component analysis, or ICA model. It is a generative model, which means that the observations are generated by a mixing process of latent variables which are the independent components. These variables are not directly observable and have to be estimated along with the mixing matrix A. The basic assumption in ICA is that the latent variables si are statistically independent. Technically, it means that the joint probability density is factorizable in the product of the respective marginal densities: p(s) =
N
pi (si ).
(3)
i=0
From the probability theory, the Central Limit Theorem tells that the distribution of a sum of independent random variables tends toward a Gaussian distribution, under certain conditions. We can use this result to assert, intuitively, that a mixture of si is more Gaussian distributed with respect to each of
Independent Component Analysis and Support Vector Machine
113
them. So, one criterion to estimate the independent components is to minimize the gaussianity of the si through some measures of nongaussianity like kurtosis and negentropy [3]. Another approach inspired by information theory, using the concept of differential entropy, is the minimization of mutual information I(y1 , y2 , ..., yn ) =
n
H(yi ) − H(y).
(4)
i=0
Mutual information is equivalent to the Kullback-Leibler divergence between the joint density f (y) and the product of its marginal densities. So it represents a natural measure of the dependence between random variables and takes into account also the high-order statistics. The concepts of mutual information, negentropy and projection pursuit are all closely related [4]. Because negentropy is invariant for invertible linear transformations, finding and invertible transformation that minimizes the mutual information is roughly equivalent to find directions in which the negentropy is maximized. Again, a single direction that maximizes negentropy is a form of projection pursuit and could also be interpreted as estimation of a single component. 2.2
Why ICA?
Much of the information that perceptually distinguishes faces is contained in the higher order statistics of the images [5]. Since ICA gets more then second order statistics (covariance), it appears more appropriate with respect to PCA. The technical reason is that second-order statistics correspond to the amplitude spectrum of the image (actually, the Fourier transform of the autocorrelation function of an image corresponds to its power spectrum, the square of the amplitude spectrum). The remaining information, high-order statistics, corresponds to the phase spectrum. This is the informative part of a signal. If we remove the phase information, an image looks like noise.
3
Support Vector Machine
In this section we briefly sketch the SVM algorithm and its motivation. A more detailed description of SVM can be found in [6]. The task of learning from examples, for a two-class pattern recognition problem, can be formulated as follows: given a set of functions {fα }α∈Λ ,
fα : R → {−1, +1}
(5)
and a set of examples {(xi , yi ), i = 1, . . . , l} ⊂ Rn × {−1, +1},
(6)
114
Gianluca Antonini et al.
each one generated according to an unknown probability distribution function P (x, y), we want to find the function fα∗ which minimizes the risk of misclassification of the new patterns drawn randomly from P, given by the risk functional : 1 R(α) = (7) |fα (x) − y| dP (x, y). 2 The risk functional is upper bounded by the sum of empirical risk and VapnikChervonenkis (VC) confidence term (see [6]). While in practice the risk functional cannot be minimized directly, one can try to minimize its upper bound. In the case of SVM, the empirical risk is kept constant, say zero, and a minimizer for the confidence term is sought. Let us consider first the simple case of linearly separable data. We are searching an optimal separating (hyper–)plane 1 w, x + b = 0
(8)
which minimizes the VC confidence term while providing the best generalization. The decision function is f (x) = sgn (w, x + b)
(9)
Geometrically, the problem to be solved is to find the hyperplane that maximizes the sum of distances to the closest positive and negative training examples. The distance is called margin (see Figure 1) and the optimal plane is obtained by 2 maximizing w or, equivalently, by minimizing w2 subject to yi (w, x+ b) ≥ 1. In the case that the two classes overlap in feature space, one way to find
(a) A possible separating plane
(b) Optimal separating plane
Fig. 1. Two possible solutions for the separating plane problem. A better generalization is expected from the second case 1
We use ·, · to denote the inner product operator.
Independent Component Analysis and Support Vector Machine
115
the optimal plane is to relax the above constraints by introducing some slack variables ξi (for more details see [6]). Introducing the Lagrange multipliers αi , we can express the decision function as a function of them: yi αi x, xi + b (10) f (x) = sgn i∈S
where S = {i | αi > 0}, with the new constraints : l
yi αi = 0 and αi ≥ 0, ∀i = 1, . . . , l
(11)
i=1
The vectors xi , i ∈ S are called support vectors and are the only examples from the training set that affect the shape of the separating boundary. To generalize the linear case one can project the input space into a higher– dimensional space in the hope of a better training–class separation. In the case of SVM this is achieved by using the so–called ”kernel trick”. In essence, it replaces the inner product xi , xj in (10) with a kernel function K(xi , xj ). As the data vectors are involved only in this inner products, the optimization process can be carried out in the feature space directly. Some of the most used kernel functions are: the polynomial kernel the RBF kernel
4 4.1
K(x, z) = (x, z + 1)d
(12)
K(x, z) = exp(−γx − z ) 2
(13)
Methods and Results The BANCA Database
The BANCA database is a multimodal and multi-language database. It has been recorded in 3 different scenarios: controlled, degraded and adverse. For each of the four languages (French, English, Spanish and Italian), there are 52 subjects (26 males and 26 females) each performing 12 recording sessions, with 2 recordings per session (4 sessions per scenario). In all, there are 6240 images per language. A more detailed description of the BANCA database can be found in [7]. 4.2
Methods
In our experiments we have used the English subset of the BANCA database composed by 6240 images which has been divided into two equally sized subset for training and testing ( in such a way that images of the same person cannot appear both in training and testing ). In the test set we have taken 3120 positive examples and 3120 negative examples, so our test set size is 6240. The size of
116
Gianluca Antonini et al.
(a) The ten features
(b) controlled conditions
(c) degraded conditions
(d) adverse conditions
Fig. 2. The ten features (a) and the “clouds” of Harris corner in the three different conditions (b,c,d)
our patches is 32x32. The features we have considered are: the left corner of the left eye (P1), the central point of the left eye (P2), the right corner of the left eye (P3), the left corner of the right eye (P4), the central point of the right eye (P5), the right corner of the right eye (P6), the left and right nostrils (P7 and P8) and the left and right corners of the mouth (P9 and P10). Totally ten classes ( figure 2(a)). For each patch we perform first a PCA projection to reduce the dimensionality, passing from 1024 (32x32) components to 50 components (getting about 95% of the total variance). Starting from the PCA transformed space we apply the ICA and we train the SVM classifier in the ICA space using the radial basis function kernel. We evaluate the robustness of our classifiers in terms of accuracy and number of false positive. We use a test set composed by the manually selected feature points as positive examples and a set of random points extracted with an Harris corner detector [8] as negative examples. The results are shown in table 1. 4.3
Results
In figures 3(a)-3(f) we show, as an example, the results we have obtained applying our models to a “cloud” of corners, from the three different scenarios, using the Harris corner detector (figures 2(b),2(c),2(d)). It is important to underline the
Independent Component Analysis and Support Vector Machine
117
(a) feature P5 in controlled (b) feature P7 in controlled (c) feature P8 in degraded conditions conditions conditions
(d) feature conditions
P6
degraded (e) feature P5 in adverse (f) feature P10 in adverse conditions conditions
Fig. 3. The features extracted from a “cloud” of corners in the three different scenarios fact that in order to avoid scanning the image in all positions, we used an Harris corner detector [8] as a prefiltering stage. It turned out that the corner detector can be tuned to pick out enough corners such that there are always corners sufficiently close to the real feature positions.
5
Conclusions and Future Work
In this work we have proposed the combined use of ICA and SVM for the facial feature extraction problem. The algorithm can be used in a face detection system as a preprocessing step, before to use other kind of information (as the geometric symmetries between the different feature positions). In order to find a better space where projects the data to perform the classification step, would be interesting to investigate the application of some recent evolutions of the ICA as the overcomplete ICA [9] and Topographic ICA [10].
118
Gianluca Antonini et al.
Table 1. Numerical results on a set of 6240 test patches Features accuracy% false positive P1 93.32 57 P2 93.00 37 P3 93.30 55 P4 97.01 24 P5 93.40 26 P6 92.91 26 P7 98.05 21 P8 97.80 7 P9 92.70 52 P10 95.20 30
References [1] K.-K. Sung and T. Poggio. Learning human face detection in cluttered scenes. In V. Hlavac and R. Sara, editors, Computer Analysis of Images and Patterns, pages 432–439. Springer, Berlin, 1995. 111 [2] Miroslav Hamouz, Josef Kittler, Jiri Matas, and Petr B´ılek. Face detection by learned affine correspondences. LNCS, 2396:566–575, 2002. 112 [3] A. Hyvaerinen and E. Oja. Independent component analysis: algorithms and applications. Neural Networks, 13(4-5):411–430, 2000. 113 [4] A. Hyvaerinen. Fast and robust fixed-point algorithms for independent component analysis. IEEE-NN, 10(3):626, May 1999. 113 [5] H. M. Lades M. S. Bartlett and T. J. Sejnowski. Independent component representations for face recognition. In Proceedings of the SPIE, volume 3299, pages 528–539, 1998. 113 [6] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121–167, 1998. 113, 114, 115 [7] S. Bengio, F. Bimbot, J. Mari´ethoz, V. Popovici, F. Por´ee, E. Bailly-Bailli`ere, G. Matas, and B. Ruiz. Experimental protocol on the BANCA database. IDIAPRR 05, IDIAP, 2002. 115 [8] Chris Harris and Mike Stephens. A combined corner and edge detector. Proceedings Fourth Alvey Vision Conference, pages 147–151, 1988. 116, 117 [9] Michael S. Lewicki and Terrence J. Sejnowski. Learning overcomplete representations. Neural Computation, 12(2):337–365, 2000. 117 [10] Aapo Hyv¨ arinen, Patrik O. Hoyer, and Mika Inki. Topographic independent component analysis. Neural Computation, 13(7):1527–1558, 2001. 117
Real-Time Emotion Recognition Using Biologically Inspired Models Keith Anderson and Peter W. McOwan Department of Computer Science, Queen Mary College University of London Mile End Road, London, E1 4NS, UK
Abstract. A fully automated, multi-stage architecture for emotion recognition is presented. Faces are located using a tracker based upon the ratio template algorithm [1]. Optical flow of the face is subsequently determined using a multi-channel gradient model [2]. The speed and direction information produced is then averaged over different parts of the face and ratios taken to determine how facial parts are moving relative to one another. This information is entered into multi-layer perceptrons trained using back propagation. The system then allocates any facial expression to one of four categories, happiness, sadness, surprise, or disgust. The three key stages of the architecture are all inspired by biological systems. This emotion recognition system runs in real-time and has a range of applications in the field of humancomputer interaction.
1
Introduction
Expressions form a significant part of human interaction and providing computers with the ability to recognise and make use of such non-verbal information could open the way to new and exciting paradigms in human-computer interaction. The human face is a key element of non-verbal communication, with our facial expressions acting as a type of social semaphore. Without the ability to express and recognise emotions, social interaction is less fulfilling and stimulating, with either or both parties often unable to fully understand the meaning of the other. The importance of signalling emotional state in communication is demonstrated by the use of emoticons in internet chat rooms. Many chat room users use a whole range of these emoticons (widely understood symbolic abbreviations such as :) for happy) to enhance interaction and give an insight into their current emotional state. Without such information what is meant as a light-hearted joke can, for example, easily be misinterpreted using this form of communication, leading to the other party becoming upset due to being unclear as to how the statement was meant. Computer emotion recognition would have a range of useful applications. For example, allowing a robot to understand the emotions of the user would enhance its J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 119-127, 2003. Springer-Verlag Berlin Heidelberg 2003
120
Keith Anderson and Peter W. McOwan
effectiveness at performing many tasks. Emotion recognition also has a part to play in software or educational tutorials [3], measurement tools for behavioural science [4], as a mechanism for detecting deceit [5] and in the development of socially intelligent software tools for autistic children [6]. A number of attempts to solve the problem of emotion recognition have been made in the past, with early attempts measuring the movement of dots applied to the face of subjects [7]. However, since then, less intrusive methods have been developed, and these approaches have met with varying degrees of success. They have generally attempted to solve the problem by use of various optical flow algorithms [8,9,10,11]. Many of these systems are set up to recognise some or all of six key emotions universally associated with unique facial expressions [8,9]. These emotions are happiness, sadness, surprise, disgust, anger, and fear. The bulk of other work in this area has involved attempts to recognise individual or combined action units [4,10,12,13] described by the Facial Action Coding System [14]. In general, these approaches have all required some form of manual pre-processing before emotion recognition is possible. This paper discusses an approach to recognising only the basic emotions. However, rather than focusing solely on the problem of emotion recognition, this paper describes a fully automated multi-stage emotion recognition architecture. It allows the emotional state of any user who sits in front of a computer equipped with a camera to be determined in real-time, even in cluttered and dynamic scenes. As the system is still in development, it currently only recognises four of the six basic emotions (happiness, sadness, disgust, surprise). There are three main components to the system. The face tracker; optical flow algorithm; and emotion recognition (Fig. 1). The face tracker is a modification of, and an extension to, the ratio template algorithm [2]. Optical flow is determined by a real-time version of a multi-channel gradient model [15], whilst the emotion recognition system currently uses multi-layer perceptrons trained using backpropagation. These techniques are all relatively quickly processed, allowing the combined system to run at 4fps on a 384x247 image on a 450MHz Pentium III machine with Matrox Genesis DSP boards. The following three sections of the paper describe each of the three main components of the system in more detail. Results are then given, and proposals made for further work.
Fig. 1. System summary
Real-Time Emotion Recognition Using Biologically Inspired Models
2
121
Face Tracking
The face tracking component of the system is founded upon a modified version of the ratio template algorithm [16]. The ratio template algorithm, originally described by Sinha [1], operates by matching ratios of averaged luminance using a spatial face model. It is able to detect frontal views of faces under a range of lighting conditions, although it is ineffective when the subject is illuminated from beneath. The method is able to handle limited changes in scale, yaw, pitch, and tilt of the head, with its strengths lying in its ease to implement, its speed, and its tolerance to the different illuminations characteristic of an unstructured indoor environment. The ratio template algorithm also provides a rough spatial map locating facial features. This is of importance to the emotion recognition component of the system as it provides information as to the position of different facial features. A detailed description of the ratio template approach can be found in [16]. The version of the ratio template algorithm described in [16] is modified in this system by the inclusion of biological proportions (the golden ratio) into the face model, and by examination of higher order relationships additional to the original ratio measures. The information provided by the modified ratio template algorithm is then combined with that given by simple morphological eye/mouth detection, image motion, and matching density to allocate a single face probability to each location in the scene [17]. Fig. 2 summarises the key stages of the face tracker.
Fig. 2. Summary of face tracker employed by emotion recognition system
122
Keith Anderson and Peter W. McOwan
Parallels can be drawn between the stages of the face tracker and biological face detection systems. There are cells in the human cortex that fire specifically when a face in a particular pose is present in their receptive field [18], with the ratio template performing the same function. Also, the advantages of, and biological plausibility for, the taking of luminance ratios have been previously identified [15]. The alterations made to the ratio template are inspired by the naturally occurring golden ratio. The use of eye/mouth identification is supported by studies into human face detection strategies [18,19]. Face detecting cells respond to the relative positions of features within a face and any change to this arrangement of facial features reduces the cells’ response [18]. There have also been suggestions that humans possess specialised detectors for eyes [19]. The morphological eye/mouth detection mimics these effects by checking for the presence of eyes, and checking they and the mouth are in the correct relative positions.
3
Motion Detection
Once a face has been located in the scene by the face tracker, an optical flow algorithm determines the motion of the face. Using motion information for the purposes of emotion recognition simplifies the task by ignoring variations in texture of different people’s faces. Hence, the facial motion patterns seen when each of the basic emotions are expressed are similar, independent of whom expresses the emotion. Interestingly, facial motion alone has already been shown to be a useful cue in the field of human face recognition [20]. In our system, the multi-channel gradient model (MCGM) is employed to determine facial optical flow. The MCGM is based on a model of the human cortical motion pathway, and operates in 3 dimensions (two spatial, and one temporal), recovering the dense velocity field of the image at each location. The model involves the application of a range of spatial and temporal differential filters to the image, with appropriate ratios being taken to recover speed and direction [2]. The MCGM is chosen for the extraction of motion information for several reasons. Firstly, by using a ratio stage, it is robust to changes in scene luminance, thus removing problems associated with Fourier energy or template matching methods of recovering optical flow. Also, a real-time version of the MCGM has been implemented on a machine using Matrox Genesis DSP boards [15]. Finally, there is evidence to show that the MCGM is a biologically plausible model of the human cortical motion pathway, particularly as it has been shown to correctly predict a number of motion-based optical illusions [2]. As part of the emotion recognition system, the MCGM is only active where a face is found in the scene by the face tracker, thus reducing processing time. The speed and direction information produced is then provided to the emotion recognising component of the system.
Real-Time Emotion Recognition Using Biologically Inspired Models
4
123
Emotion Recognition
The recognition system aims to map the optical flow output provided by the MCGM to one of four of the prototypical emotional states (happiness, sadness, disgust, surprise). Multi-layer perceptrons trained using back-propagation are used for this purpose. They are trained using image sequences taken from the Cohn-Kanade [21,22] facial expression database. The frame rates of the Cohn-Kanade sequences have been reduced to the 4fps at which our current combined face tracking/emotion recognition system runs. The image sizes have also been reduced such that the faces in the sequence are at a scale detectable by the face tracker (faces approximately 45x55 pixels in size). Rather than inputting the raw optical flow output directly into the perceptrons, the motion (speed and direction) data provided by the MCGM is condensed into a more efficient form. Such condensation of data reduces the amount of information entered into the perceptrons, thereby reducing the perceptron size needed to effectively learn the problem. This step is required as the smaller the network size, the faster the output can be obtained, and the better they are for use in real-time systems. The data is condensed by averaging the motion of key facial regions. Currently, motion is averaged according to the regions of the spatial template used by the modified ratio template algorithm (Fig. 2), with two extra chin regions (Fig. 3). A problem that has to be addressed to permit effective emotion recognition is the removal of the effects of overall head motion. By moving the head as a whole whilst expressing an emotion a dramatic change is seen in the optical flow output, e.g. the optical flow of someone smiling whilst moving there head downwards would look very different from that of someone smiling whilst raising there head. An effective way of removing the effect of overall head motion is to take ratios of motion e.g. ratio of left cheek motion to left side of chin. By taking ratios it is possible to determine how different facial parts are moving relative to one another, independently of how the head is moving globally. The set of ratios taken has been determined empirically (Fig. 3), with this set of ratios being asymmetrical down the middle of the face. Use of asymmetry is effective as the motion seen when basic emotions are expressed is symmetrical, so can be coded by the motion of half a face. Therefore there is no point in taking ratios of motion on the left side of the face that are the same as the right part of the face. Use of asymmetrical ratios doubles the information coded by the same number of ratios. Once the representation of the data to be input into the networks has been chosen, it is then necessary to determine the structure of the emotion recognising networks themselves. Thus far, two approaches have been taken to the design of the recognition system. One involves the use of a single large network with a separate output node for each emotion, whilst the second uses four smaller individual networks each trained to recognise a different emotion. Both network types are trained with the ratios of averaged facial motion obtained from 4 consecutive frames of image sequence recorded at 4fps. These four frames represent the start phase of one of the four emotions the system attempts to recognise. The training set consists of 130 emotion examples.
124
Keith Anderson and Peter W. McOwan
Fig. 3. Ratios of averaged motion taken. Arrows indicate the ratios
In both network architectures, upon input of a four frame sequence to the fullytrained network, the node/network giving the highest output determines the emotion to which the sequence is categorised. As with the face tracker and optical flow algorithm, the emotion recognition component of the system is inspired by biology, as it uses biologically motivated neural networks.
5
Results
The two approaches were tested on 57 sequences, not included in the training set, taken from the Cohn-Kanade database [21,22] of four emotions, happiness, sadness, disgust, and surprise. The overall matching rate is 86% for the large network and 77% for the smaller individual networks. The recognition rates for each individual emotion are shown in Table 1. The single large network gave better results on this data set than the four smaller networks. However, the results given by the four individual networks are still good, and it may be necessary to use this approach when the system is expanded further (see Section 6). Although the system was tested on a small data-set, these results demonstrate that even though the frame rate is low (4fps), the faces are small in size and the optical flow data averaged, it is still possible to get high recognition rates. Obviously, due to the need for speed, this approach is not the most accurate for recognising emotion. However, it is still sufficiently accurate for use in the application domains for which it has been designed, and future modifications could enhance the technique. Table 1. Emotion recognition results
Emotion Happiness Sadness Surprise Disgust
Percentage correct Single Network 82% 93% 87% 80%
Individual Network 76% 93% 57% 80%
Real-Time Emotion Recognition Using Biologically Inspired Models
6
125
Future Work
A number of avenues still remain open to investigation. Obviously, the approach must be extended to all six of the basic emotions. Currently, only four have been used as fewer examples of the sadness and fear expressions are present with the correct action units in the Kanade-Cohen expression database. Thus far the motion information provided by the MCGM has only been averaged over regions according to the spatial map of the ratio template algorithm. However, the choice of regions over which to do the spatial averaging is important, as otherwise important data relating to salient face emotional movement could be lost. Ideally one wants to choose regions for averaging such that the face is separated into a set of individual parts that move in a concerted manner when different emotions are expressed. Therefore, new regions will be chosen so to take account of the underlying muscle structure of the face. This approach should be more effective than simply using the regions of the face template used by the modified ratio template algorithm. It is also possible that better results would be achieved by averaging the motion information over different regions according to the specific emotion. For example, it may be more sensible to focus on the motion of the mouth region for one emotion, and the forehead for another. Changes to the ratios of averaged motion taken could also be made according to the emotion. These modifications would require the use of individual networks for each emotion as the data representation would differ for each emotion characterised. The use of thresholding also needs to be considered. Currently the system assumes an emotion is expressed at all time instants, but it is desirable that the system recognise when no emotion is expressed. Therefore, thresholds will be included such that the outputs of the networks must exceed a given constant for an input to be deemed an emotion. This may necessitate the inclusion of non-emotion examples in the training sets used by the networks. Finally, other network types will be studied. By training a Self Organising Map (SOM) to recognise the six basic emotions, each basic emotion will have its own region of the map that is sensitive to its motion signature. Importantly, it is thought that if, for instance, the SOM had a 2-dimensional topology it could be used to recognise emotional mixes. When exposed to an emotional mix, regions of the map between the regions sensitive to the component emotions of the mix may become activated.
7
Summary
A fully automated system for the recognition of four of the six emotions universally associated with unique facial expressions has been presented. This real-time system, inspired by biology, achieved a correct recognition rate of 86% when tested on a section of the Cohn-Kanade facial expression database. Approaches to enhance the accuracy and applicability of the system further have been discussed.
126
Keith Anderson and Peter W. McOwan
References [1] Sinha P.: Perceiving and Recognising three-dimensional forms, PhD dissertation, M.I.T. Available at http://theses.mit.edu:80/Dienst/UI/2.0/Describe/ 0018.mit.theses%2f1995-70?abstract [2] Johnston A., McOwan P.W., Benton C.P.: Robust Velocity Computation from a Biologically Motivated Model of Motion Perception. Proceedings of the Royal Society of London, Vol. 266. (1999) 509-518 [3] Picard R.W.: Towards Agents that Recognize Emotion. Actes Proceedings IMAGINA. (1998) 153-155 [4] Bartlett M.S., Huger J.C., Ekman P., Sejnowski T.J.: Measuring Facial Expressions by Computer Image Analysis. Psychophysiology, Vol. 36. (1999) 253-263 [5] Bartlett M.S. , Donato G., Movellan J.R., Huger J.C., Ekman P., Sejnowski T.J.: Face Image Analysis for Expression Measurement and Detection of Deceit. Proceedings of the 6th Annual Joint Symposium on Neural Computation. (1999) [6] Ogden, B.: Interactive Vision in Robot-Human Interaction. Progression Report. (2001) 42-55 [7] Himer W., Schneider F., Kost G., Heimann H.: Computer-based Analysis of Facial Action: A New Approach. Journal of Psychophysiology, Vol 5(2). (1991) 189-195 [8] Yakoob Y., Davis L.: Recognizing Facial Expressions by Spatio-Temporal Analysis. IEEE CVPR. (1993) 70-75 [9] Rosenblum M., Yakoob Y., Davis L.: Human Emotion Recognition from Motion Using a Radial Basis Function Network Architecture. IEEE Workshop on Motion of Non-Rigid and Articulated Objects. (1994) [10] Lien J.J, Kanade T., Cohn J.F., Li C.: Automated Facial Expression Recognition Based on FACS Action Units. Third IEEE International Conference on Automatic Face and Gesture Recognition. (1998) 390-395 [11] Essa I.A., Pentland A.P.: Coding, Analysis, Interpretation, and Recognition of Facial Expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 19(7). (1997) 757-763 [12] Lien J.J, Kanade T., Cohn J.F., Li C.: Automated Facial Expression Recognition Based on FACS Action Units. Third IEEE International Conference on Automatic Face and Gesture Recognition. (1998) 390-395 [13] Tian Y., Kanade T., Cohn J.F.: Recognizing Action Units for Facial Expression Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 23 (2). (2001) 390-395 [14] Ekman P., Friesen W.: Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press, Palo Alto, CA. (1978) [15] McOwan P.W., Benton C., Dale J., Johnston A.: A Multi-differential Neuromorphic Approach to Motion Detection, International Journal of Neural Systems, Vol. 9. (1999) 429-434 [16] Scassellati B.: Eye Finding via Face Detection for a Foveated, Active Vision System. Proceedings of the Fifteenth National Conference on Artificial Intelligence. (1998)
Real-Time Emotion Recognition Using Biologically Inspired Models
127
[17] Anderson K., McOwan P.W.: Robust Real-Time Face Tracker for Cluttered Environments. Submitted to Computer Vision & Image Understanding. [18] Tovée M.J.: An Introduction to the Visual System, Cambridge University Press. (1996) [19] Hietanen J.K.: Does your gaze direction and head orientation shift my visual attention?, Neuroreport, Vol. 10. (1999) 3443-3447 [20] Hill H., Johnston A.: Categorising Sex and Identity from the Biological Motion of Faces. Current Biology, Vol. 11. (2001) 880-885 [21] Lien J.J.J., Kanade T., Cohn J.F., Li C.C.: Detection, Tracking, and Classification of Subtle Changes in Facial Expression, Journal of Robotics and Autonomous Systems, Vol. 31. (2000) 131-146 [22] Cohn J.F., Zlochower A., Lien J., Kanade T.: Automated face analysis by feature point tracking has high concurrent validity with manual FACS coding” Psychophysiology, Vol. 36. (1999), 35-43
A Dual-Factor Authentication System Featuring Speaker Verification and Token Technology Purdy Ho and John Armington Hewlett-Packard, USA {purdy.ho, john.armington}@hp.com
Abstract. This paper presents a secure voice authentication system combining speaker verification and token technology. The dual-factor authentication system is especially designed to counteract imposture by pre-recorded speech and the text-to-speech voice cloning (TTSVC) technology, as well as to regulate the inconsistency of audio characteristics among different handsets. The token device generates and prompts a onetime passcode (OTP) to the user. The spoken OTP is then forwarded simultaneously to both a speaker verification module, which verifies the user’s voice, and a speech recognition module, which converts the spoken OTP to text and validates it. Thus, the OTP protects against recorded speech or voice cloning attacks and speaker verification protects against the use of a lost or stolen token device. We show the preliminary results of our Support Vector Machine (SVM)-based speaker verification algorithm, handset identification algorithm, and the system architecture of our design.
1
Introduction
Voice applications are moving into the mainstream today thanks to standards such as VoiceXML and services such as voice portals. While these standards and services are maturing, security has not kept pace. Traditional static touchtone passwords used in phone channels are subject to eavesdrop and can easily be forgotten by users, thus leaving room for improvement. Voice is the most natural way of communicating over the phone channels, so speaker verification is better suited to integrate into the phone network than many other biometric authentication technologies. The task of speaker verification [2, 7] is to determine a binary decision of whether or not an unknown utterance is spoken by the person of the claimed identity. Speaker verification can be implemented in two different approaches: text-dependent and text-independent. Both approaches are subject to prerecorded speech attack. They also share a vulnerability known as text-to-speech voice cloning (TTSVC)1 , a highly sophisticated technique of synthesizing speech using a person’s voice and prosodic features, which will likely become more commonplace over time. Thus, we enhanced the security of single-factor speaker 1
Text-to-Speech Voice Cloning System by AT&T Labs, http://www.naturalvoices.att.com/.
J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 128–136, 2003. c Springer-Verlag Berlin Heidelberg 2003
A Dual-Factor Authentication System Featuring Speaker Verification
129
verification by adding the time-synchronous one-time passcode (OTP), which is prompted to the user in an out-of-band manner. In addition, our system also takes into account the degradation of speaker verification accuracy introduced by the inconsistency of different handsets. We implemented the SVM-based handset identifier[9], which identifies the handset that an utterance is coming from prior to speaker verification. The outline of this paper is as follows: Section 2 defines our SVM-based speaker verification and presents preliminary results. Section 3 illustrates our SVM-based handset identifier and presents results. Section 4 explains the system architecture. Section 5 states the potential problems of existing systems and describes the benefits of using our proposed system. Section 6 concludes the paper.
2
Speaker Verification Using Support Vector Machines
We use the SVM classifier to perform speaker verification. A detail description of the SVM theory can be found in [12] and [1]. The general SVM decision function has the following form: f (x) =
yi αi K(xi , x) + b
(1)
i=1
where xi ∈ Rn , i = 1, 2, . . . , l are the training data. Each point of xi belongs to one of the two classes identified by the label yi ∈ {−1, 1}. The coefficients αi and b are the solutions of a quadratic programming problem [12]. αi are non-zero for support vectors and are zero otherwise. Speech data are not linearly separable, so we mapped xi (training points) and x (test points) in the input space to Φ(xi ) and Φ(x) of a higher dimensional feature space. We still look for a linear separation but in a different feature space [1]. The mapping Φ(xi ) · Φ(x) is represented by a Gaussian kernel function K(xi , x), as shown in Eq. (2). σ was determined by experiments and was chosen to be 0.5. ||xi − x||2 K(xi , x) = exp(− ) (2) 2σ 2 Classification of a test data point x is performed by computing the sign of the right-hand side of Eq. (1). The distance from this point x to the decision plane is given by the following equation: d(x) =
yi αi K(xi , x) + b i=1 αi yi xi
i=1
(3)
The sign of d (the same as that given by Eq. (1)), is the classification result for x, and |d| is the distance from x to the decision plane.
130
2.1
Purdy Ho and John Armington
Training and Testing
The data used in speaker verification were the utterances recorded from 25 speakers over different handsets within a week. Each speaker recorded about 50-60 utterances, 3 seconds each. Four out of the 25 speakers were female. Each audio waveform was sampled at 8KHz. We computed Mel-Frequency Cepstral Coefficients (MFCCs)[11] at 100 frames/sec. The audio signals were pre-emphasized and hamming-windowed before being transformed into the cepstral domain. Thirteen MFCCs were computed per frame for all utterances. Twenty-five Gaussian-kernel SVMs were trained, one for each speaker. Four utterances from each speaker were used for training. The positive training data of a speaker SVM came from the MFCC vectors of the 4 utterances from that particular speaker. While the negative training data of each speaker SVM consists of about 5,000 MFCC vectors, randomly chosen from the rest of the speakers. This is the typical SVM 1-vs-rest training method. The test data were the rest of the utterances that were not used for training. An average of 50 utterances from each speaker were used for testing. Each test utterance was again broken down into frames and converted to MFCC vectors. The Gaussian-kernel SVM takes each frame of the utterance as its input and computes d(xi ) according to eq. (3), where x is the feature vector of a frame and i = 1, 2, . . . , l, are the frame indices. The sign of d(xi ) is the recognition output of this particular frame. To determine whether the whole utterance X comes from a particular speaker, we used a voting mechanism in which the total number of positive recognition outputs of this utterance was divided by the total number of frames l in this particular utterance. This quantity is called the characteristic distance, d(X)+ , as shown in Eq. (4). We claim that the utterance is spoken by the claimed speaker if this quantity is positive. d(X)+ =
sign(d(xi ))/
(4)
i=1
The preliminary SVM testing results show that we have about 5% of errors when tested on speech data coming from a variety of handsets2 . Previous attempts of using SVM for speaker verification were done in [13] and [3].
3
SVM-Based Handset Identifier
We reimplemented the SVM-based Handset identifier from our previous work[9] for improving the speaker verification accuracy when dealing with a variety of handsets. Our handset identifier was trained and tested on four different handsets: Carbon-button (CB), electret (EL), cordless (CD), and headset (HS), based on the LLHDB corpus [10]. Thus, each speaker had 4 handset-dependent SVMbased speaker verification models; each was trained on one of the four handsets. 2
Most speaker verification systems report better results tested on speech data from the same phone channel.
A Dual-Factor Authentication System Featuring Speaker Verification
Incoming utterance of Speaker 1
131
Handset-Dependent SVMs of Speaker 1 Carbonbutton Selected
Feature Extraction
MFCC Vectors of T frames x1 x1 x1 x1 x x x x 2 2 2 ..... 2 ... ... ... ... x x x x D t =` D t = 2 D t = 3 D t = T
SVM Handset Identifier
Electret
Cordless
Headset
Fig. 1. The speaker verification process with the embedded handset identifier
In speaker verification, the handset from which the utterance is coming from was first determined by the SVM-based handset identifier. Then the utterance is forwarded to the handset-dependent SVM-based speaker verification model, as shown in Figure 1. Previous attempts in handset identification can be found in [6], [10], [5], [4], etc. Since this paper focuses on system design, we did not compare our algorithm to these previous ones. However, some comparisons can be found in our previous paper [9]. 3.1
Training and Testing
The audio waveforms from LLHDB were sampled at 8kHz with a 16-bit resolution. We chose the Rainbow passages (a ninety-seven word passage containing a broad range of phonemes in [8]) from 5 males and 5 females recorded on each of these 4 handsets (CB, EL, CD, and HS) as training data. We converted each Rainbow waveform into 13 MFCCs per frame at 100 frames/sec. Each audio sample was pre-emphasized and hamming-windowed before being transformed into the cepstral domain. Four Gaussian-kernel SVMs were trained with σ = 3.6 to distinguish utterances on one type of handset from those on the others using the one-vs-rest approach, i.e. the positive training class of a handset SVM comes from the MFCC vectors of that particular type of handset, while its negative training data consist of the MFCC vectors from all the other handset types, as shown in Table 1. Each test utterance is broken down into frames and converted to MFCC feature vectors the same way as in training. The Gaussian-kernel SVM takes
132
Purdy Ho and John Armington
Table 1. Positive & negative training classes of each SVM Gaussian-kernel SVMs Positive class Carbon-button CB Electret EL Cordless CD Headset HS
Negative class EL, CD, and HS CB, CD, and HS CB, EL, and HS CB, EL, and CD
each frame of the utterance as its input and computes d(xi ) according to eq. (3), where xi is the feature vector of a frame and i = 1, 2, . . . , l, are the frame indices, such that the utterance is X = {x1 , x2 , . . . , xi }. The sign of d(xi ) (+1, −1) is the recognition output of this particular frame. To determine whether the whole utterance comes from a particular handset, we used the voting mechanism again; the total number of positive recognition outputs of this utterance was divided by the total number of frames l in this particular utterance. This quantity, the characteristic distance, d(X)+ j as shown in Eq. (5), is computed for each SVM j = 1, 2, . . . , q (where q is the number of handset types). The maximum characteristic distance dy (X) among the q SVMs is computed by Eq. (5) and y is the class label of the utterance X given by the j which gives the maximum characteristic distance. d(X)+ j
=
sign(d(xi ))/
(5)
i=1
q with dy (X) = max d(X)+ j j=1 . The utterances used for testing also came from the LLHDB, but were independent of the training set. Each speaker in the LLHDB was prompted to read 10 sentences. We chose the “Ten Sentences” recorded by 20 males and 20 females from each of the four handsets (a total of 10x40x4 utterances) as testing set. Our preliminary result of testing these utterances against the trained handset identifier was about 92% of accuracy.
4
System Architecture
This section provides information on the protocols and administrative processes required to provide the necessary security framework in which speaker verification can operate. To aid in understanding, references to processes on the block diagram in Figure 2 below are noted with numbers matching those on the diagram.
A Dual-Factor Authentication System Featuring Speaker Verification
4.1
133
Token Technology
Many schemes are available today for implementing one-time passcodes. These range from public domain schemes such as S/Key 3 and SASL 4 , to proprietary r and ActivCardT M . The schemes using hardware tokens such as RSA Security different schemes have different costs, levels of convenience, and practicality for a given purpose. Token authentication is a common technique in multi-factor authentication used by many on-line systems today. It provides a method of communicating to the user a one-time passcode via a channel that is out-ofband with respect to the channel being authenticated. One-time passcode, by definition, means that it must provide a reliable method of replay prevention. 4.2
Voice Portal Authentication
We assume that speaker verification is mainly used in connection with a voice portal, and that the portal middleware security software enforces access control to downstream applications. Figure 2 is a block diagram that illustrates how the various components interact, as well as the high-level data flows.
Fig. 2. Authentication for voice portal systems 3 4
FreeBSD Handbook Chapter 10.5, http://www.freebsd.org/doc/en US.ISO88591/books/handbook/skey.html. The One-Time-Password SASL Mechanism, http://www.faqs.org/rfcs/rfc2444.html.
134
Purdy Ho and John Armington
1. The portal homepage server 180 determines the claimed identity of the caller by prompting the caller, matching the caller-ID against a database of previous callers, or a combination of the two. 2. 180 then passes the claimed identity (audio form) of the caller to the authentication subsystem 300 along with a request for authentication, via the policy server 650 and its agent 640. 3. The authentication subsystem’s speech recognition module 310 converts the claimed identity (audio form) to text for retrieving the trained SVM-based speaker classifier associated with the caller’s claimed identity from the general directory 150. 4. 300 prompts the caller by the portal’s text-to-speech module 250, via 650 and 640, to speak the one-time passcode. The OTP is communicated in an out-of-band manner, such as displaying on a token device (e.g. a watch or a handheld). 5. The user speaks the passcode, and the voice portal subsystem’s audio I/O module 270 returns it to 300, via 650 and 640. 6. The SVM-based handset identifier, embedded in the speaker verification module 320, determines the specific handset that the utterances are coming from. It then forwards these utterances to the handset-dependent SVM-based speaker verification model trained on the specific handset and the claimed speaker. If d(X)+ in Eq. (4) is positive, the module will return a positive response to the policy server 650. 7. While step 6 is in progress, 310 converts the OTP to text. It then forwards both the claimed identity (text form) and the text OTP to the OTP validation subsystem 130 (via 650). 130 matches the text against the OTP specified to the claimed user and returns the results to 650. 8. The policy server 650 combines the two authentication results (from steps 6 and 7) and grants the user’s access request only if both steps are verified successfully. 9. Ideally, the template adaptation module 330 can update the enrolled templates on the general directory 150 with the new samples and retrain the speaker SVM to improve performance of future authentication.
5
Problems and Solutions
Problems and solutions associated with previous attempts to build secure voiceenabled applications: Problem 1: Touchtone static passwords are subject to eavesdrop, are easy to forget, and are not handsfree. Solution 1: Replace static touchtone passwords with spoken OTPs such that users don’t have to remember and to type in the passwords. Problem 2: Existing (and future) single-factor speaker verification authentication systems are (will be) subject compromise by TTSVC technology. Solution 2: Add spoken OTP prompted via a secure out-of-band mechanism as a countermeasure.
A Dual-Factor Authentication System Featuring Speaker Verification
135
Problem 3: Distortions from different telephone handsets make speaker verification challenging and negatively impact performance. Solution 3: A handset identification [9] algorithm, as described in Section 3, should be used to compensate for that.
6
Conclusion
The voice authentication system described in this paper combines single-factor speaker verification with token technology. This design enhances security of voice portals by ensuring user’s real-time access and preventing impostors from using pre-recorded speech and the TTSVC technology. The preliminary SVM speaker verification shows promising results (5% error rate) tested on speech data coming from different handsets. Our system can also improve the verification accuracy by adding the SVM-based handset identifier[9] to determine the handset that an utterance is coming from prior to speaker verification. In the future, we will implement the adaptation algorithm such that speaker verification models can be updated upon successful logon. This will keep the database fresh and accommodate to changes in voice over time.
References [1] C. Burges. A tutorial on support vector machines for pattern recognition. Bell Laboratories, Lucent Technologies, 1998. 129 [2] J.P. Campbell. Speaker recognition: A tutorial. Proceedings of the IEEE, 85(9), 1997. 128 [3] Y. Gu and T. Thomas. A text-independent speaker verification system using support vector machines classifier. Eurospeech, 2001. 130 [4] L.P. Heck, Y. Konig, M.K. S¨ onmez, and M. Weintraub. Robustness to telephone handset distortion in speaker recognition by discriminative feature design. Speech Communication, 31, 2000. 131 [5] L.P. Heck and M. Weintraub. Handset-dependent background models for robust text-independent speaker recognition. IEEE ICASSP, pages 1071–1074, 1997. 131 [6] S.P. Kishore and B. Yegnanarayana. Identification of handset type using autoassociative neural networks. The 4th International Conference on Advances in Pattern Recognition and Digital Techniques, 1999. 131 [7] J.M. Naik. Speaker verification: A tutorial. IEEE Communications Magazine, 1990. 128 [8] F. Nolan. The Phonetic Bases of Speaker Recognition. Cambridge University Press, 1983. 131 [9] Purdy Ho. A Handset Identifier Using Support Vector Machines. In IEEE International Conference on Spoken Language Processing, Denver, CO, USA, 2002. 129, 130, 131, 135 [10] D.A. Reynolds. HTIMIT and LLHDB: Speech corpora for the study of handset transducer effects. IEEE ICASSP, pages 1535–1538, 1997. 130, 131
136
Purdy Ho and John Armington
[11] M. Slaney. Auditory toolbox, version 2. Technical Report, Interval Research Corproation, 1998. 130 [12] V. Vapnik. Statistical learning theory. John Wiley and Sons, New York, 1998. 129 [13] V. Wan and W. Campbell. Support vector machines for speaker verification and identification. IEEE Proceeding, 2000. 130
Wavelet-Based 2-Parameter Regularized Discriminant Analysis for Face Recognition Dao-Qing Dai1 and P. C. Yuen2 1
Center for Computer Vision, Faculty of Mathematics and Computing, Zhongshan University Guangzhou 510275 China
[email protected] 2 Department of Computer Science, Hong Kong Baptist University, Hong Kong
[email protected] Abstract. This paper addresses the small-size problem in Fisher Discriminant Analysis. We propose to use wavelet transform for preliminary dimensionality reduction and use a two-parameter regularization scheme for the within-class scatter matrix. The novelty of the proposed method comes from: (1) Wavelet transform with linear computation complexity is used to carry out the preliminary dimensionality reduction instead of employing a principal component analysis. The wavelet filtering also acts as smoothing out noise. (2) An optimal solution is found in the full space instead of a sub-optimal solution in a restricted subspace. (3) Detailed analysis for the contribution of the eigenvectors of the within-class scatter matrix to the overall classification performance is carried out. (4) An enhanced algorithm is developed and applied to face recognition. The recognition accuracy (rank 1) for the Olivetti database using only three images of each person as training set is 96.7859%. The experimental results show that the proposed algorithm could further improve the recognition performance.
1
Introduction
The bio-metric based identity verification is now a very active field for computer scientists. The Linear (Fisher) Discriminant Analysis (LDA) is a well-known and popular statistical method in pattern recognition and classification. The basic idea is to optimize the discriminant criteria, in which the ratio (Fisher index) between the inter-class distance and the intra-class distance is maximized. In the last decade, LDA approach has been successfully applied in the face recognition technology. Some of the LDA-based face recognition systems [2, 3, 5, 7, 11] have also been developed and encouraging results have been achieved. However, LDA approach suffers from a well-known small size problem, that is the sample size are small compared with the dimension of feature vector. A number of research works have been proposed to solve this problem [1, 3, 6, 9, 12].
The research of this author is partly supported by NSF of China, NSF of GuangDong and ZAAC.
J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 137–144, 2003. c Springer-Verlag Berlin Heidelberg 2003
138
Dao-Qing Dai and P. C. Yuen
Among them, one idea is to perform a dimension reduction on the input feature vector before applying of the LDA. That is, techniques are employed to reduce the dimension from d to d , where d < d, so that in Rd the eigenvalue problem becomes well posed. A popular technique is the use of principal component analysis (PCA) [10] for dimension reduction. The FisherFace of [1], the most discriminant features of [9] and the subspace methods of [12], et al, are examples of this nature. This approach is straightforward but suffers from a number of limitations. First, the selection of principal components for PCA is still an unsolved problem. Different methods on selection of principal components for recognition have been proposed but there is no consent. Second, usually a much smaller dimension d is chosen comparing with the dimension of the original i.e. d d. Some of the useful discriminant information from the image will also be removed. We propose to use wavelet transform for preliminary dimensionality reduction and use a two-parameter regularization scheme for the within-class scatter matrix. The organization of this paper is as follows. We will give a brief review on principal component analysis (PCA) and linear discriminant analysis (LDA) in Section 2. Details of the proposed regularization method will be reported in Section 3. Section 4 reports the experimental results of the proposed RDA algorithm on face recognition.
2
A Brief Review on Principal Component Analysis and Linear Discriminant Analysis
For classification purpose, suppose that we have c classes of patterns ω1 , ω2 , · · · , ωc . Their dimension is d. The mean of each class is mi = E(x|x ∈ ωi ), i = 1, 2, · · · , c, where E(η) denotes the mathematical expectation of random variable η. The mean of the total assembly ω = ∪ci=1 ωi is m = E(x|x ∈ ω). The scatter of the means {m1 , m2 , · · · , mc } with the total mean m is the between class covariance matrix Cb = E((ξ − m)(ξ − m)T ) . The superscript T denotes matrix transpose. The matrix Cb measures the distribution of class mean mi with the total mean m. For each class, ωi (i = 1, 2, . . . , c), the scatter with mean mi is denoted by Ci = E((x − m)(x − m)T |x ∈ ωi ). The pooled-covariance is the c within-class covariance matrix Cw which is then given by Cw = pi Ci . The toi=1
tal population scatter matrix C is given by C = E((x−m)(x−m)T |x ∈ ω), which is the sum of the within-class covariance matrix and between-class covariance, i.e., C = Cb + Cw . When the mathematical expectation of a random variable is obtained through the sample average, the estimated rank of the within-class matrix Cw satisfies, Rank(Cw ) ≤ min(d, #(ω) − c) where #(ω) is the number of training samples in ω.
(1)
Wavelet-Based 2-Parameter Regularized Discriminant Analysis
2.1
139
Principal Component Analysis (PCA)
For the input vector x we want to find a vector y of rank f such that the reconstruction error E(x − y2 ) is minimized. The solution is obtained through a whiten transform which transforms the symmetric matrix C into diagonal matrix with its diagonal elements {σ1 , σ2 , · · · , σd } in non-increasing order. The feature vector y is obtained through the input vector x by y = T T x, where T is the first f rows of the whiten transform. In the transformed domain, the feature vectors are uncorrelated since the correlation matrix T T CT is diagonal. Sirovich and Kirby [8] used PCA for characterization of human face images. Turk and Pentland [10] developed a face recognition system based on PCA. They project images to a lower dimensional subspace under the minimum reconstruction error d criteria. E(x−y2) = i=f +1 σi . In face recognition applications the columns of the transform matrix T, when properly reshaped and scaled, are called eigenfaces. The number of eigenfaces are usually determined from the relative reconstruction error d i=f +1 σi = d . i=1 σi Smaller corresponds to better reconstruction quality. 2.2
Linear Discriminant Analysis (LDA)
For a linear transform W : Rd → Rf , the Fisher linear discriminate index, which measures separability of clusters, in the feature space is defined as In(W ) = tr((W T Cw W )−1 (W T Cb W )). The optimal class separation transform T , which maximizes the Fisher index, is obtained by solving the eigenvalue problem −1 Cb )T = T D (Cw
(2)
where D is a d × f diagonal matrix. The index In(T ) will be the trace of the matrix D, In(T ) = tr(D). From (1), when #(ω) < d + c, the small sample problem occurs and the matrix Cw is not invertible.
3 3.1
Proposed Method Pre-dimension Reduction
Suppose that we are given a linear transform L : Rd → Rd with LLT = I, where I is the identity matrix. The original data x is transformed into x ˆ. We apply the LDA on the intermediate data x ˆ to get data y1 . On the other hand we can use LDA directly on the original data to get data y2 . We establish the relationship between y1 and y2 . We denote by Cˆw and Cˆb respectively the withinclass and between-class covariance matrixes of the transformed data set x ˆ. For the applications of the LDA method, from (2) we have −1 Cb T = T D, Cw
−1 ˆ ˆ ˆ Cb T = TˆD, Cˆw
140
Dao-Qing Dai and P. C. Yuen
where T and Tˆ are respectively the Fisher transforms. We can show that Tˆ is the image of T under the transform L, that is, we have Tˆ = LT T.
(3)
In face recognition applications, the columns of the Fisher transforms T and Tˆ , when properly reshaped and scaled, are also called eigenfaces. The main concern of using the operator L is dimension reduction. Two important examples of the operator L are – If L is the PCA transform, the L-transformed eigenfaces Tˆ are the PCA transform of the original eigenfaces T . – If L is the wavelet transform(WT), the L-transformed eigenfaces Tˆ are the wavelet transform of the original eigenfaces T . The equation (3) justifies the use of PCA or WT for dimension reduction. WT has been successfully applied to image compression in the past ten years [4]. We choose WT to dimension reduction instead of using PCA since (i) it is a linear transform with computational complexity proportional to the input data, (ii) An advantage of WT is that with few coefficients it can capture most of the image energy and the main image features. (iii) WT is a kind of a noise reduction operation, which is useful for scanned images. Figure 1 is an example diagram of two-dimension WT. 3.2
Regularization
It is well-known from matrix theory that the symmetric matrix C r can be decomposed as C r = U ΛU T , where Λ is a diagonal matrix with non-increasing diagonal elements, U is a unitary matrix satisfying U U T = I. The columns of the matrix U controls the orientation and the corresponding diagonal element in the matrix Λ determines its size. In this paper we shall use the the pooled within-class covariance matrix Cw to get the orientation unitary matrix U , so that variation information of the data is still kept. We point out that it is also possible to use the total covariance matrix C = Cb + Cw for the same reason. Suppose that we have the decompod sition Cw = λi ui uTi , where λ1 ≥ λ2 ≥ . . . ≥ λd are the eigenvalues of Cw i=1
and ui (i = 1, 2, · · · , d) are the corresponding normalized eigenvectors. We set U = (u1 , u2 , · · · , ud ). To get a regularization scheme, we use a two-parameter family regularization C r by specifying the diagonal elements of the matrix Λ with λri (i = 1, 2, · · · , d) given by λi + α, i = 1, 2, · · · , r, r λi = β, i = (r + 1), (r + 2), · · · , d.
Wavelet-Based 2-Parameter Regularized Discriminant Analysis
141
Fig. 1. Wavelet transform of an image. (a) The original image. (b) Three level decomposition of the image (a) with Daubechies’ wavelet D6 . (c) Reconstruction from approximation coefficients of level 3. Notice that contours of the face appear highlighted
The motivation of our scheme is illustrated in Figure 2. When we have correctly got the orientation matrix, which can be obtained from a prior knowledge e.g., from images with normal pose, facial expression etc., our task is to choose the weights each orientation vector contributes. We shall discuss the problem of updating the orientation matrix elsewhere.
4
Results and Conclusion
This section reports the evaluation results of the proposed RDA algorithm on face recognition. A standard and internet available database from Olivetti Research Laboratory is selected for evaluation. The Olivetti set contains 400 images with resolution 92x112 of 40 persons with variations in pose, illumination and facial expression. In the experiment, three images of each person are used for the training set. The rest seven images are used as test set. The WT reduced dimension is 168, hence the resulted LDA problem is ill-posed. We use cross-validation to select the optimal values of regularization parameters α and β. The rankone recognition rates are shown in Figure 3. We notice that for the parameter ranges the recognition rates are greater than 89%, which shows that the WT step of dimension reduction captured useful information for classification. When
142
Dao-Qing Dai and P. C. Yuen
Fig. 2. With correctly obtained feature vectors u1 and u2 , the effect of linear combinations λ1 u1 + λ2 u2 with real numbers λ1 and λ2 on classification. (a): Four samples {1, 2, 3, 4}. (b) With the linear combination 0.2u1 + u2 the samples are grouped into {1, 2} and {3, 4}. (c) But with the linear combination u1 +0.2u2 they are grouped into {1,4} and {2,3} Table 1. The recognition rates (rank 1) when α, β correspond to trivial cases and the optimal values α = β = 10−8 α = β = 106 α = 0, β = 106 α = 106 , β = 10−4 optimal α, β 82.1429% 93.5714% 92.8571% 90.3571% 96.7859%
α = β and they are large( on the diagonal direction ), the matrix C r becomes multiple of the identity matrix, which means that all variations in all directions are treated as the same constant. When α = 0 and β is large ( on the β axis ), the vectors in the null space of Cw are ignored, this is equivalent to taking C r as the pseudo-inverse of Cw . When β is very small (noticing that it can not be zero) and α is large ( approximately on the α axis ), the vectors in the null space of Cw plays a principal role. We point out that the optimal parameters α and β are not unique in general. In Table 1, we list the recognition rates of several trivial cases. The simple regularization (α = β = 10−8 ) attains only recognition rate of 82.1429%. The optimal recognition accuracy is higher than those of the three trivial cases.
Wavelet-Based 2-Parameter Regularized Discriminant Analysis
143
Fig. 3. Cross-validation obtained plot of the recognition rates corresponding to the parameters α, β ranging from 1 to 500. Three images out of ten for each person are used for training, the rest seven images are the test set
Table 2. Average recognition rates (rank 1) of 40 experiments, using 3 images of each person as training set # of persons α = β = 10−8 α = 0, β = 106 α = β = 106 α = 106 , β = 10−4 optimal α, β 5 79.7857% 99.5000% 99.5714% 86.2143% 99.4286% 10 79.3214% 98.5714% 98.5000% 87.6786% 98.8214% 15 81.7381% 97.8333% 97.3809% 89.6905% 98.2857% 20 81.9464% 97.5536% 96.3393% 89.9286% 97.8750% 25 82.0571% 96.4857% 96.0000% 90.9143% 97.3286% 30 81.5357% 95.2143% 95.3095% 91.0000% 97.1190% 35 81.8469% 94.0306% 94.5204% 90.5714% 96.9592% 39 82.0421% 92.9487% 93.5897% 90.2930% 96.6484%
To further test the proposed method, more experiments are carried out. In each experiment we use part of the 40 persons, again each has three images as training set and the rest seven as the test set. We run the programm 40 times. The average recognition rates are listed in Table 2. From Table 2, about the same observation as in Table 1 can be made. Finally we list some LDA-based results on the Olivetti database in Table 3.
144
Dao-Qing Dai and P. C. Yuen
Table 3. Comparison between the proposed method and existing LDA-based methods for Olivetti database. a Extracted from Ref. [12] Method LDAa Weighted LDA [12] Yu et al. [11] Proposed Recognition accuracy 62.76% 92.75% 90.80% 96.79%
In summary we proposed a wavelet enhanced LDA system for human face recognition. We justified the use of WT for preliminary dimension reduction and used a regularization scheme for the eigenvalues of the covariance matrix to overcome the small sample size problem. Moreover an enhanced algorithm is developed and applied to face recognition. We did not discuss on updating the orientation matrix in this paper yet, this will be reported in forthcoming publications.
References [1] P. N. Belhumeur, J. P. Hespanha, D. J. Kriegman, Eigenfaces vs. Fisherfaces: recognition using class specific linear projection, IEEE Trans. Pattern Anal. Mach. Intell., Vol. 19 (7) (1997), 711-720. 137, 138 [2] R. Chellappa, C. Wilson, S. Sirohey, Human and machine recognition of faces: a survey, Proc. IEEE, Vol. 83 (5) (1995) 705-740. 137 [3] D. Q. Dai and P. C. Yuen, Regularized discriminant analysis and its applications to face recognition, Pattern Recognition, Vol. 36 (2003), 845-847 137 [4] I. Daubechies, Ten lectures on wavelets, CBMS-NSF Regional Conf. Series in Appl. Math., Vol. 61, SIAM Philadelphia, PA, 1992. 140 [5] K. Etemad and R. Chellappa. Discriminant analysis for recognition of human face images. J. Opt. Soc. Am. A, 14(1997)1724–1733. 137 [6] W. J. Krzanowski, P. Jonathan, W. V. McCarthy and M. R. Thomas, Discriminant analysis with singular covariance matrices: methods and applications to spectroscopic data, Appl. Statist., Vol. 44(1995), 101-115. 137 [7] P. J. Phillips, H. Moon, S. A. Rizvi and P. J. Rauss, The FERET evaluation methodology for face-recognition algorithms, IEEE Trans. Pattern Anal. Mach. Intell., Vol. 22(10)(2000), 1090-1104. 137 [8] L. Sirovich and M. Kirby, Application of the Karhunen-Loeve procedure for the characterization of human faces, IEEE Trans. Pattern Anal. Mach. Intell., Vol. 12(1990), 103-108. 139 [9] D. Swets, J. Weng, Using discriminant eigenfeatures for image retrieval, IEEE Trans. Pattern Anal. Mach. Intell., Vol. 18 (8) (1996), 831-836. 137, 138 [10] M. Turk and A. Pentland, Eigenfaces for recognition, Journal of Cognitive Neuroscience, Vol. 3(1)(1991)72-86. 138, 139 [11] H. Yu and J. Yang, A direct LDA algorithm for high-dimensional data-with application to face recognition, Pattern Recognition, Vol. 34(2001), 2067-2070. 137, 144 [12] W. Zhao, R. Chellappa and P. J. Phillips, Subspace linear discriminant analysis for face recognition, Center for Automation Research, University of Maryland, College Park, Technical Report CAR-TR-914, 1999. 137, 138, 144
Face Tracking and Recognition from Stereo Sequence Jian-Gang Wang, Ronda Venkateswarlu, and Eng Thiam Lim Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore 119613 {jgwang,vronda,etlim}@i2r.a-star.edu.sg
Abstract. In this paper, we present a face recognition system that is able to detect, track and recognize a person walking toward a stereo camera. Integrating stereo and intensity pairs, pose can be detected and tracked. Face images, which are suitable for recognition, from stereo sequence can be selected automatically based on the pose. Then the straightforward Fisherface is carried out on the selected face images. Results of the tracking and recognition depict that this video based stereo approach provides a more robust performance since the 3D information is considered.
1
Introduction
Sensitivity to variations in pose is a challenging problem in face recognition using appearance-based methods. Video based face and/or expression recognition [2, 5, 10, 8, 11, 12] offers several advantages over still image based approaches under a relatively controlled environment, e.g. visitor access, ATM, HCI, where faces are relatively large in the images. In such applications, good frames can be selected to perform classification. Our main focus in this paper is to detect, track and estimate the pose of the face automatically using stereo vision, enabling us to select frontal pose for face recognition as the person moves in front of the stereo camera. Face recognition from stereo sequence has ever been studied in [11]. In this paper, we presented a fully different approach in three major aspects: first, we used disparity, instead of image differences, as the silhouettes segmentation cue, the segmentation performs better even strong background motion. Second, the head pose can be estimated from stereo in our system, hence, the pre-selecting of the face image candidates for face recognition become easier and more reasonable. Third, Fisherfaces technique, instead of graph matching, is carried out on the selected face image to identify persons. Head detection, tracking and pose estimation are described in Section 2. Section 3 deals with the brief of Fisherface technique as applied to the selected face images. Section 4 discusses tests and results obtained from the recognition system. Section 5 concludes the work with future enhancements to the system.
J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 145-153, 2003. Springer-Verlag Berlin Heidelberg 2003
146
Jian-Gang Wang et al.
2
Head/Face Detection, Tracking and Pose Estimation
We propose a hybrid approach to detect and track head/face in real-time. The signal flow diagram is shown in Fig. 1. Disparity map of the face is obtained at frame rate by commercially available stereo software [9]. Combining disparity and intensity pairs, either head or face is detected and tracked automatically. The head is tracked if the face features are not available, e.g. when the person is far away from the stereo head or with the profile view; the face and 3D pose are tracked once the facial features are available. Left image
Right image
Separating head from the head-andshoulder component using distance and watershed transforms
Initial head contour fitting
Features extraction (anti max-median filter) Features are available 3D face pose estimation
Head contour refinement
Face/3D pose tracking
Features are not available
Person of interest extraction (disparity segmentation)
Head detection & tracking
Disparity map
Fig. 1. Flowchart of the head/face-tracking algorithm
Knowing the face features within the face contour, the face contour can be refined which would help in subsequent frame-to-frame feature extraction and pose estimation. 2.1
Head Detection and Feature Extraction
Object-Oriented Disparity Segmentation. In this paper, we would like to track only the nearest human to the camera. Since we are using stereo system, the human of interest can be detected quickly in image by thresholding the distance from the stereo head. The thresholds are selected based on the peak analysis of the disparity histogram. Two people in front of the camera are separated using the disparity map are shown in Figure 2.
Face Tracking and Recognition from Stereo Sequence
(a) left image
(b) right image
(c)disparity
(d) near face
147
(e) far face and background
Fig. 2. Separating the human of interest using the histogram of the disparity image
Head Location Using Morphological Watersheds. For computation efficiency, we try to locate the people’s faces using the shape cue of their silhouettes instead of skin color. As discussed, in the segmented disparity image (silhouette), the head and shoulder are connected due to the fact that they have nearly the same distance to the camera. Observing the shape of people’s silhouette, the head and shoulder could be separated from each other using blob analysis technique. The head and shoulder, in the disparity segmentation result, can be thought as two touching blobs because the shape of the head and the shape of the shoulder are significantly different. A novel method is proposed in this paper, which employs morphological watersheds transform in conjunction with distance transform to separate the head from shoulder. Watershed is an efficient tool to detect touching objects. To minimize the number of valleys found by the watershed transform, one need to maximize the contrast of our objects of interest. Distance transform determines the shortest distance between each blob pixel and the blob’s background, and assigns this distance to the pixel. Here, we use distance transform to produce a maximum in the head and shoulder blobs, respectively. In addition, the head and the shoulder have touching zones of influence. Applying watershed to the resulting distance transformation, watershed line, between the head and the shoulder, will be resulted. The head is located using the watershed line. We have tested our algorithm on a face database built in our lab. There are 1000 face images of 100 student volunteers with 10 different head poses. Online experiments are also carried out. An example is shown in Fig. 3. The disparity segmentation of a face image is shown in Fig. 3(a). The “Chamfer” distance transform of Fig. 3(a) is shown in Fig. 3(b). Watershed line (Fig. 3(c)) is overlaid to the original segmented disparity image, the head is located; see Fig. 3(d). The results of a sequence of images are shown in Fig 4. Although the algorithm works well, the watershed line may not incidentally correspond to the neck. Fortunately, this can be detected by checking the ratio of the output elliptically head. The ratio of the head contour is assumed 1.2 in our experiment. If the watershed is found not correspond to the neck, i.e. the difference between the ratio of the output head and the assumed one is significant, a candidate head will be placed below the vertically maxima of the silhouette, in a manner similar to Darrel [5], and will be refined in the feature tracking stage.
148
Jian-Gang Wang et al.
(a) disparity segmentation (b) distance transform of (a) (c) watershed line of (b)
(d) overlay (c) to (a)
Fig. 3. Separating the head from the shoulder using morphological watersheds
Fig. 4. Face contour location on a sequence of real face images; first row: disparity; second row: watershed segmentation; third row: elliptical contour
Feature Extraction. Feature extraction is a hard problem even the face region has been located. In a typical face image, the eyebrows, eyes, mouth and nostrils are darker than the surround skin. Observing this, a new method, called “anti-max median filter”, is proposed in this paper to extract the features. The features including eyes, eyebrows, nostrils and mouth are detected by subtracting the max-median filter image from the original image. One example can be seen in Fig. 5. Although median filter is normally a somewhat slower process than convolution, due to the requirement for sorting all the pixels in each neighborhood by gray level, we adopt it because only the pixels in the face region are filtered. There are, also, algorithms that speed up the process.
(a) Original image
(b) Max-median filter of (a)
Fig. 5. Feature extraction
(c) Thresholding ((a)-(b))
Face Tracking and Recognition from Stereo Sequence
2.2
149
Face Tracking
In order to track the face robustly even the illumination is changed, the areas around the detected features are tracked by mean subtracted normalized correlation, which is accurate and can take care of illumination changes. This computation is affordable because the search region is restricted to within the elliptical face contour. The positions of the features in a frame are predicated from the previous two frames: x (t)= x (t-1) + c(x (t-1)-x (t-2))
(1)
where x-1= x0, c is a parameter. The two centers of the two eyebrows and the center of the two nostrils form a triangle, we call it “feature triangle”, see E1E2N in Fig. 6(a). Using the “feature triangle”, the contour ellipse can be refined based on the gradient model developed by Birchfield [3]. The face is modeled as a vertical ellipse in [3]. However, we extend the Birchfield’s model because we can estimate the symmetry axis of the face from the “feature triangle”. The initial contour is modeled as an ellipse (x0, y0, b0, α ) where (x0, y0), the initial ellipse center, is set to be the center of “feature triangle”, α is the angle between the symmetry axis of the “feature triangle” and the vertical direction, see Fig. 6(a), b0 is the initial length of the minor axis of the ellipse. Considering α , the face contour will correspond to an ellipse (x, y, b, α ) where the normalized sum of the gradient magnitude along the perimeter of the ellipse is maximum:
G ( x, y, b,α ) = max( x , y ,b
1 N ∑ n(i) ⋅ g ( x, y, b,α ) ) N i =1
(2)
where n(i) is the unit vector normal to the ellipse at pixel i, g is the intensity gradient at pixel i, and denotes dot product. As mentioned in Section 2.2, the ratio of the ellipse is assumed 1.2. Using the above constraints, including the direction of the ellipse major axis and the ellipse center, the contour of the face is tracked when the face features are available. Initializing and re-initializing, which are done automatically, the contour of the head is fitted as an ellipse using the edges of the head region. The results of the ellipse refinement of Fig. 5(a) are shown in Fig. 6(b) and (c).
(a)
(b)
(c)
Fig. 6. “Feature triangle” and the refinement of the face contour (a) “feature triangle” (b) gradient image of Fig. 6(a); (c) refined ellipse
150
Jian-Gang Wang et al.
2.3
Pose Estimation
The 3D coordinate of the feature points can be computed from their disparity values. Assuming the 3D coordinates of the three feature points of the “feature triangle” are E1 (Xe1, Ye1, Ze1), E2 (Xe2, Ye2, Ze2) and Nc (XN YN, ZN) respectively with respect to the camera. Then the 3D position of the face is taken as the average of E1, E2 and Nc. The unit normal of the facial plane will be: nf = n1 × n2
(3)
n1 = N[E2E1]
(4)
n2 = N[E2Nc]
(5)
where
N [⋅ ] represents the normalization operator.
3
Face Recognition
“Fisherface”, which uses subspace projection prior to Principle Component Analysis (PCA) projection to avoid the possible singularity, is used to identify face in this paper. The Fisherface maximizes the ratio of between-class scatter to that of withinclass scatter. The optimal projection Wopt is given by T T T Wopt = W fisher W pca
(6)
W pca = arg max W T ST W
(7)
where W
W fisher = arg max W
T W T W pca S bW pcaW T W T W pca S wW pcaW
(8)
where ST is the total scatter matrix. The difference, which is used to classify, in the feature space: T T d = Wopt S testing − Wopt S learning
(9)
where Stesting and Slearning are the sample images from the test and learning database respectively. Experiments [1] demonstrated that the Fisherface method has error rates that are lower than those of the Eigenface technique for test on the Harvard and Yale Face Database. It appears to be the best at simultaneously handling variation in lighting and expression.
Face Tracking and Recognition from Stereo Sequence
4
151
Performance Results
We have evaluated our algorithm on stereo sequences as the person is moving in front of a stereo camera. ‘SVS stereo heads MEGA-D MegaPixel’ is used for implementing stereo computing. SVS [9] runs at rates of up to 45Hz with image resolution 320× 240, and gives for each pixel the disparity between the two images from the stereo head. The calibration and the rectification of the cameras are done automatically using SVS library. The parameters, including the baseline b and the focal length of the lenses f, can be obtained from the camera calibration process. They are used to compute 3D position of the head. The face detection, tracking and pose estimation runs on a Pentium III 733M Hz PC in a speed about 25 frames per second. Database. Twenty sequences, five from each subject, were used for learning. 8 nearfrontal views (-150 ~ 150) are selected from each stereo sequence respectively. Thus this is a 4-class classification problem with 40 image samples for each class. All images are properly normalized automatically [4] to the same size (40× 50), centered at the tracked face contour (ellipse), the size of the detected face is used as an estimate of scales. We consider the face without expression (neutral face) in this paper. For nclass with m-samples for each class, the samples are stored and numbered in the learning database as follows:
Fig. 7. Some samples from training database
Some of the samples for the first and fourth subject in the learning database are shown in Fig. 7. Recognition Performance. The following tests were conducted for evaluating the stereo tracking-based face recognition performance. Generalisation. In this case, twelve new sequences, three for each subject, are used for testing. This test gives an index on the generalization ability of the system. Two near-frontal views from the sequence are selected for testing. Rejection. In this case, the testing set consists of the face images of the subject who does not appear in the learning database. An average recognition rate of 93% has been obtained on the database above mentioned. E.g. the generalization experiment for the 4th subject is shown in Fig. 8(a). We can see the testing views from the 4th subject have the smallest differences d (defined in Eq. (14)), which are quite small, with all learning views of the 4th subject (samples from 5× 25th to 5× 32th), hence, they are correctly classified. An example of the rejection experiment for the 4th subject is shown in Fig. 8(b). 120 views, 8 views each learning sequence (total 5 sequences) for each subject, of the first three subjects
152
Jian-Gang Wang et al.
are used for training; 24 samples, 8 each testing sequence (total 3 sequences), of the 4th subject are used for testing. We can see, in the Fig 8(b), the testing samples are different from any training samples, hence, they are rejected.
(a) generalization
(b) rejection
Fig. 8. Results of the generalization and rejection
5
Conclusion
A face recognition system that is able to detect, track and recognize a person walking toward a stereo camera has been developed. Initialing and re-initialing of the head/face tracking are automatically. A novel approach, which uses morphological watershed to separate the head and shoulder in the user’s silhouette, is verified a fast and efficient algorithm. Solving the variations in pose problem, the near-frontal view is selected automatically from the stereo sequence for face recognition. This selection is benefited from the ability of the real-time stereo head/face tracking subsystem. The face recognition approach is applicable for some practical applications, such as visitor identification, ATM, HCI where the faces are relatively large in the images. Current research is addressing further testing of the face recognition approach in larger database.
References [1]
[2]
Belhumeur, P. N., Hespanha J. P. and Kriegman, D. J.: Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 19, No. 7, 711-720, (1997). Bouguet, J.-Y., Girod, B.; Gokturk, S.B.; Tomasi, C.: Model-based face tracking for view independent facial expression recognition, Proceedings Fifth IEEE International Conference on Automatic Face and Gesture Recognition, 20-21 May (2002), 272 –278.
Face Tracking and Recognition from Stereo Sequence
[3]
153
Birchfield, S.: An elliptical head tracking using intensity gradients and color histograms, Proceedings IEEE International Conference on Computer Vision and Pattern Recognition, Santa Barbara, California, 232-237, June (1998). [4] Brunelli R., Poggio, T., Face Recognition: Features versus Templates, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 15, No. 10, 1042-1052, (1993). [5] Choudhury, T., Clarkson, B., Jebara, T., and Pentland, A.: Multimodal: Person recognition using unconstrained audio and video, Proceedings International Conference on Audio and Video Based Person Authentication, 176-181, (1999). [6] Darrell, T., Gordon, G., Harville, M. and Woodell, J.: Integrated person tracking using stereo, color and pattern detection, International Journal of Computer Vision, Vol. 37, No. 2, 175-185, (2000). [7] Etemad, K., and Chellappa, R.: Face Recognition using Discriminant Eigenvectors, Proc., International Conference on Acoustics, Speech and Signal Processing, Atlanta, GA, May (1996), 2148-2151. [8] Howell A. J., and Buxton, H.: Towards unconstrained face recognition from image sequences, Proceedings Second IEEE International Conference on Automatic Face and Gesture Recognition, (1996), 224 –229. [9] Konolige, K.: Small Vision Systems: Hardware and Implementation. Eighth International Symposium on Robotics Research, Hayama, Japan October (1997). http://www.ai.sri.com/~konolige. [10] Li, Y., Gong, S. and Liddell, H.: Video–based online face recognition using identify surfaces, Second International Workshop on Recognition, Analysis and Tracking of Faces and Gestures in Real-time, 40-46, Vuncouver, Canada, July (2001). [11] Steffens, J., Elagin, E., and Hart N.: PersonSpotter – Fast and robust system for human detection, tracking and recognition, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, (1998), 516-521. [12] Wexhsler, H., Kakkad, V., Huang, J., Gutta, S. and Chen, V.: Automatic videobased person authentication using RBF network, Proceedings International Conference on Audio- and Video-based Person Authentication, 85-92, (1997).
Face Recognition System Using Accurate and Rapid Estimation of Facial Position and Scale Takatsugu Hirayama, Yoshio Iwai, and Masahiko Yachida Graduate School of Engineering Science, Osaka University 1-3, Machikaneyama, Toyonaka, Osaka 560-8531, Japan
[email protected] {iwai,yachida}@sys.es.osaka-u.ac.jp
Abstract. Face recognition technology needs to be robust for arbitrary facial appearances. In this paper, we propose a face recognition system that is efficient and robust for facial scale variations. In the process of face recognition, facial position detection incurs the highest computational cost. To estimate both facial position and scale, there needs to be a trade-off between accuracy and efficiency. To resolve the trade-off, we propose a method that estimates facial position in parallel with facial scale. We apply the method to our proposed system and demonstrate the advantages of the proposed system through face recognition experiments. The proposed system is more efficient than any other system and can maintain high face recognition accuracy for facial scale variations.
1
Introduction
A security system using face recognition technology is suited to various highsecurity situations. To develop more versatile security systems, many researchers have proposed methods for recognizing a face in an image. A face’s appearance changes according to facial pose, expression, lighting, and age. In initial work, some researchers proposed methods such as template matching [1], ‘‘Eigenface’’ algorithms [2], and ‘‘Fisherface’’ algorithms [3]. These methods are effective for recognizing a face normalized for position, scale, direction, and expression under the same illumination. However, security systems using these methods constrain users to remain stationary when the systems record and match their facial images. It is difficult for users to submit to the above constraints. To construct a more useful security system, it is desirable to develop a face recognition method that relaxes those constraints. In recent work, researchers have mainly proposed face recognition methods that are robust for various facial appearances. A method using a face graph has the highest recognition performance of those methods [4]. The nodes that construct the face graph are positioned according to facial feature points (FFPs). Typically, methods using the face graphs consist of facial position detection and FFP extraction. In addition, a method that matches a face model graph with a facial image by using raster scanning [5, 6, 7, 8, 9] has been proposed to detect the facial position. The elastic matching method [5, 6] and our flexible feature J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 154–163, 2003. c Springer-Verlag Berlin Heidelberg 2003
Face Recognition System
155
Fig. 1. Face representation. Gray nodes on contour, black nodes on facial organ.
matching method [8, 9] have been proposed to extract FFPs. These methods can robustly recognize faces with small variations. However, as large variations decrease the accuracy of matching, these methods require the use of multiple model graphs to maintain accuracy, a requirement that imposes a high computational cost. Kr¨ uger et al. regard this high cost as a significant problem [7]. To further improve accuracy of face recognition, we take into account facial scale and propose a face recognition system that is efficient and robust for facial scale variations. It is vital to resolve the trade-off between accuracy and efficiency to estimate facial position and scale. Generally, if the facial position and scale are estimated by using a geometrical model [5, 6, 7, 8, 9, 10], the scale cannot be estimated unless the position is estimated accurately, and vice versa. It is difficult to solve this type of dependency when the computational capacity is limited. We aim to solve the above dependency. In this work, we propose a method that estimates facial position in parallel with facial scale. The proposed method has two advantages: (1) The method can detect a face with scale variations accurately; (2) the method can detect a face while incurring only a small computational cost because it uses only an average face graph and our scale conversion method [11]. The scale conversion estimates facial scale, and does it more efficiently than any other scaling method. We implement our face recognition system based on the proposed method and demonstrate the advantages of the system through face recognition experiments.
2
Face Recognition System
A face is represented as a graph where 30 nodes are placed on the FFPs (Fig. 1). Each node extracts the local feature at the corresponding location. Each node also has a Euclidean distance to other nodes. The proposed system is summarized in Fig. 2. The system consists of facial position detection (the proposed method), FFP extraction (the flexible feature matching method), and face identification procedures. Model images are registered in the face model database. The size of faces in these images is nearly equal across the samples, and model graphs, an average model graph and a scale dictionary are created from those images. The average model graph is generated by taking the average information of the local features and the Euclidean dis-
156
Takatsugu Hirayama et al.
S
Input Image Facial Position Detection
Scale Dictionary
-Scale Conversion
Facial Feature Point Extraction -Flexible Feature Matching
Face Identification Evaluation
Average Model Graph Model Graphs Face Model Database
(a) Overview.
position estimation verification of scale conversion by global scanning position and scale first stage priority queue verification of position estimation scale conversion position and scale by local scanning second stage F
(b) Facial position detection procedure.
Fig. 2. Proposed System
tances of the available model graphs. The scale dictionary is used in the scale conversion method. 2.1
Facial Position Detection
Overview Facial position is estimated in parallel with facial scale. The proposed method is summarized in Fig. 2(b). In this method, a face is represented as a graph where 20 nodes are placed on the facial organs (black nodes in Fig. 1). The method is composed of four states: position estimation by global scanning, position estimation by local scanning, scale conversion, and verification. Facial position and scale are estimated by changing these states. Although position estimation by global scanning has a low computational cost, it does not provide stable estimation; however, we regard the positions estimated in this state as central reference positions for a more detailed search of a face. Position estimation by local scanning performs a detailed search around the central positions. The local scanning state has a higher computational cost than the global scanning state, but it can perform estimation more accurately. The scale conversion state estimates the facial scale. The verification state checks the validity of a graph with the estimated position and scale. In the remainder of Section 2.1, we explain in detail the four states and the transitions between these states. Position Estimation by Global Scanning Position estimation by global scanning is based on the concept of dynamic link architecture [12, 13, 14]. First, the method places nG × nG gridded sampling points (at intervals of dG pixels) on the whole image, and extracts the local features on the points. We can then determine the similarity between a local feature, Dl , of sampling point l and that, Vm , of node m of the average face graph, by the following equation: φml = 1 − Vm − Dl , (l = 1, . . . , n2G ).
(1)
Face Recognition System
(a) Global scanning.
157
(b) Local scanning.
Fig. 3. Position estimation
We call this similarity the local feature similarity, φml . We can estimate facial position by searching for sampling point l having the best similarity (Fig. 3(a)), and by placing the average graph to where node m is fixed at point l. The sequence of these processes is applied to every node (m = 1, . . . , 20) of the average graph. Therefore, the method maximally estimates m facial positions. Position Estimation by Local Scanning Position estimation by local scanning is also based on the dynamic link architecture concept. This state is different from the position estimation by global scanning in terms of the size of the search area and the number of local features used to calculate the local feature similarity. First, in this state, the method places nL × nL (< nG × nG ) gridded sampling points (at interval of dL × sopt pixels) around the facial position estimated at the previous state. Here, sopt is the facial scale that is estimated by the scale conversion state (described in a later section). At the initial state, dL × sopt × nL is nearly equal to the facial scale, and dL decreases in proportion to the number of iterations. Equation (2) is calculated for node m of the average graph. φml = 1 − Vm − Dl (sopt ), (l = 1, . . . , n2L ).
(2)
This equation behaves differently from Eqn. (1) in normalizing the local features for facial scales. Next, the method searches for the sampling point l with the best φml , and places a shadow graph as its node m is fixed at point l. The shadow graph is a graph of the position (x, y) and the scale (sopt ) estimated at the previous state. Then, nD local features, Zn , are extracted from the input image at white nodes of the graph in Fig. 3(b). The sum of local feature similarities is calculated by: ψ = φml +
nD
[1 − Vn − Zn (sopt )].
(3)
n=1
The sequence of these processes is applied to every node (m = 1, . . . , 20) of the average graph. We regard the position having the maximum summation as the estimation result of this state.
158
Takatsugu Hirayama et al.
Scale Conversion Details of the scale conversion method are described in [11]. First, we explain how to make the scale dictionary. Nt images (model images) are used for training. Graphs are generated by mapping elastic average graphs Gave with various scales (elastic ratio sT ) to the facial position (xtrue , ytrue ) on each training image. The suitability of the graphs can be measured from the global feature similarity π (details about the similarity are described in a later section). The similarities are registered as training data in a database called the scale dictionary. The scale conversion is performed by using the scale dictionary. This process begins with deriving ri for the shadow graph G according to the following equation: ri = arg min[π(G(x, y, sopt )) − π(Gave (xtrue , ytrue , sT ))] sT
for each training image, i = 1, . . . , Nt ,
T = 1, . . . , T.
(4)
The second term, π(Gave (xtrue , ytrue , sT )), in the brackets of Eqn. (4) is the training data in the scale dictionary. After ri is derived, a probability distribution of ri is calculated from statistics on ri . We estimate facial scale according to this distribution, i.e., a scale ratio tnew is ri with the highest probability. The elastic ratio sopt is updated by tnew (sopt ← sopt ×snew (snew = 1/tnew )). A new shadow graph with the estimated scale is generated using sopt , and sopt is also used to normalize local features for the facial scale. Verification We verify the estimated position and scale by using the global feature similarity:
π(G(x, y, sopt )) =
20 Vm Wm (G(x, y, sopt )) 1 · m=1 . 20 20 20 2 2 V W (G(x, y, s )) m opt m=1 m m=1
(5)
Here, G(x, y, sopt ) is the shadow graph with estimated position (x, y) and scale (sopt ). Wm represents the local features of the shadow graph. State Transition In this section, we describe the transitions between the above four states. As shown in Fig. 2(b), the proposed method is composed of two stages: the first stage alternately performs the position estimation by using global scanning and the scale conversion via the verification state. The second stage performs the position estimation by using local scanning instead of the global scanning state employed in the first stage. In each stage, the transition to the scale conversion state arises when the global feature similarity is close to the training data of the scale dictionary. The transition to the end state arises when the similarity exceeds a certain threshold value, or when the number of transitions exceeds a certain threshold time. The proposed method updates the estimated position and scale by using the beam search method. During the search, the method stores the shadow
Face Recognition System
159
graph in the priority queue, and selects c shadow graphs according to priority. The selected graphs are used to update the estimation at the next state. We regard the global feature similarity as the priority. Although the computational cost of the beam search depends on the number of c, each estimation update from c shadow graphs can be performed independently and can be distributed. Therefore, we consider that this distributed processing can resolve the trade-off between accuracy and efficiency of the estimation. 2.2
Facial Feature Point Extraction
After the facial position detection procedure, we have rough estimates of FFP positions, which are used as initial information for finding the exact positions. We apply the flexible feature matching method to the FFP extraction procedure. Details of the method are described in [8, 9]. First, the shadow graph is compared with every model graph in the database. The network response Oid is evaluated for every model according to the following equation: wmn xmn − . (6) Xm − Wm × sopt + u Oid (G) = sopt m∈G
n∈G−m
Here, parameter G in Eqn. (6) expresses the node of a graph. Xm represents the local features of the model graph, xmn is the Euclidean distance on the model graph, and wmn is the Euclidean distance on the shadow graph G. The local features and the Euclidean distance in Eqn. (6) are normalized by sopt for the facial scale. The parameter u is a multiplicative weight. The model graph with the minimum network response is used as a reference graph in the following iteration process. By activating neighborhood connections around the nodes of the shadow graph, a candidate for a new shadow graph is created and the network response Of f p is computed: wmn rmn − . (7) Of f p (G) = Rm − Wm × sopt + u sopt m∈G
n∈G−m
If the network response Of f p of the candidate graph is smaller than that of the shadow graph, the candidate graph becomes the new shadow graph. The flexible feature matching procedure repeats this process by employing the random diffusion process. Here, Rm in Eqn. (7) represents the local features of the reference graph, rmn is the Euclidean distance between node m and n on the reference graph. 2.3
Face Identification
The most valid shadow graph can be considered to be an accurate representation of the face in the input image. If the network response Of f p of the shadow graph is smaller than a certain threshold, the system accepts the identification, i.e., the model corresponding to the reference graph is judged to be of the person matching the input face.
160
3
Takatsugu Hirayama et al.
Experimental Results
We conducted face recognition experiments to verify whether our system is effective. The experiments were performed on 100 images (one image per person, 256 × 256 pixels, 8-bit gray scale) from the Purdue University’s AR face database [15]. The face of each image is an expressionless frontal pose. A subset of the images includes faces with eyeglasses and illumination variations. Of these images, 50 were registered as models in the database; the others were unregistered. The FFPs of models were manually extracted. Test images consisted of the following 300 images: the above 100 original images (test set 100%) and 200 images that were scaled to 80 and 120% of their original size (test set 80%,120%). The images were scaled by the bi-cubic interpolation of Adobe Photoshop 7.0 on MacOS X. We applied Gabor features to the local features. The Gabor features are extracted by using the Gabor wavelet transformation [16], which convolutes an image with Gabor wavelets, and has robustness for variations of illumination [6]. The Gabor wavelet transformation was performed using our 16 Gabor wavelets (2 scales and 8 orientations) [11]. The following parameters were used in the facial position detection procedure: nG = 12, dG = 20, nL = 5, dL = 30, nD = 2, Nt = 20, T = 11, and sT = 0.5 ∼ 1.5 by 0.1, c = 10. Parameter dG was nearly equal to the scale of a bigger Gabor wavelet. For nD , local features were extracted at the eyes. Distributed processing, used to speed up the beam search was not introduced into our proposed system. We compared the performance of the proposed system with that of the previous system described in [11]. The previous system detects facial position by raster scanning at 5-pixel intervals. After the facial position detection procedure, the previous system estimates facial scale by using the scale conversion method. 3.1
Performance of Facial Position Detection
Examples of facial position detection results are shown in Fig. 4. These results indicated that the proposed system could accurately estimate facial position and scale from images that show variations of facial scales, with occlusion of eyes, and under various illumination conditions. We compare the performance of the proposed system with that of the previous system in Table 1, indicating the average error between the estimated position and the ground truth, the average of the estimated scale, and an average computational cost which is the number of local feature extractions. We confirmed that the estimation accuracy of the proposed system is more stable than that of the previous system. Therefore, we consider that the proposed system can accurately detect a face against variations of scale. The biggest advantage of the proposed system is that it has a low computational cost: about 1/100th that of the previous system. Introducing distributed processing to iteration with the beam search, we can further improve the computational speed of the proposed system.
Face Recognition System
161
Fig. 4. Results of facial position detection
Table 1. Performance of facial position detection procedure method test set position error (pixel) estimated scale (%) cost proposed 100% 4.66 100+1.4 628 80% 4.09 80+7.4 1141 120% 7.97 120+7.7 714 previous 100% 3.14 100+1.6 52100 80% 7.52 80+4.2 52100 120% 19.5 120+10 52100
3.2
Face Identification Accuracy
Accuracy of face identification is represented by the ROC curves in Fig. 5. The curve connects identification rates with thresholds to judge whether the system accepts the identification result. The identification rate of an unregistered person (the vertical axis in the figure) is the rate of which the system rejects the identification result. The proposed system’s accuracy is higher than that of the previous system and remains good for variations of facial scale. We confirmed that accuracy of facial position and scale estimation affects that of face identification.
4
Conclusions
We proposed a face recognition system that is sufficiently efficient and robust for facial scale variations. The system efficiently estimates facial position in parallel with facial scale. We verified the advantages of the system through face recognition experiments. The results demonstrated that the system can accurately estimate position and scale of facial images with eyeglasses and illumination variations, and that estimation using our system is faster than an orthodox facial position detection system. The results also demonstrated that the system can maintain high person identification accuracy for facial scale variations. In future work, we will improve on the proposed system in order to accurately detect a face with various poses and expressions.
162
Takatsugu Hirayama et al.
100
Identification rate of unregistered person(%)
Identification rate of unregistered person(%)
100 80 60 40 20
100% 80% 120%
0
80 60 40 20
100% 80% 120%
0
0 20 40 60 80 100 Identification rate of registered person(%)
(a) Proposed system.
0 20 40 60 80 100 Identification rate of registered person(%)
(b) Previous system.
Fig. 5. Identification rate
References [1] X. Song, C. Lee, G. Xu and S. Tsuji: Extracting facial features with partial feature template, Proceedings of the Asian Conference on Computer Vision, pp. 751–754 (1994). 154 [2] M. Turk and A. Pentland: Eigenface for recognition, Journal of Cognitive Neuroscience, Vol. 3, No. 1, pp. 71–86 (1991). 154 [3] P. Belhumeur, J. Hespanha and D. Kriegman: Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection, IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 19, No. 7, pp. 711–720 (1997). 154 [4] O. Ayinde and Y.–H. Yang: Face Recognition Approach Based on Rank Correlation of Gabor–Filtered Images, Pattern Recognition, Vol. 35, pp. 1275–1289 (2002). 154 [5] L. Wiskott, J. M. Fellous, N. Kr¨ uger and C. von der Malsburg: Face recognition and gender determination, Proceedings of the International Workshop on Automatic Face and Gesture Recognition, pp. 92–97 (1995). 154, 155 [6] L. Wiskott, J. M. Fellous, N. Kr¨ uger and C. von der Malsburg: Face recognition by Elastic Bunch Graph Matching, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 19, No. 7, pp. 775–779 (1997). 154, 155, 160 [7] N. Kr¨ uger, M. P¨ otzsch and C. von der Malsburg: Determination of face position and pose with a learned representation based on labelled graphs, Image and Vision Computing, Vol. 15, pp. 665–673 (1997). 154, 155 [8] D. Pramadihanto, Y. Iwai and M. Yachida: A flexible feature matching for automatic face and facial points detection, Proceedings of the 14th International Conference on Pattern Recognition, pp. 324–329 (1998). 154, 155, 159 [9] D. Pramadihanto, Y. Iwai and M. Yachida: Integrated Person Identification and Expression Recognition from Facial Images, IEICE Trans. on Information and System, Vol. E84–D, No. 7, pp. 856–866 (2001). 154, 155, 159 [10] A. Lanitis, C. J. Taylor and T. F. Cootes: Automatic Interpretation and Coding of Face Images Using Flexible Models, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 19, No. 7, pp. 743–756 (1997). 155 [11] T. Hirayama, Y. Iwai and M. Yachida: Face Recognition Based on Efficient Facial Scale Estimation, Proceedings of the Second International Workshop on Articu-
Face Recognition System
[12]
[13]
[14]
[15] [16]
163
lated Motion and Deformable Objects (AMDO 2002), pp. 201–212 (2002). 155, 158, 160 D. Pramadihanto, H. Wu and M. Yachida: Face Identification under Varying Pose Using a Single Example View, The Transactions of the Institute of Electronics, Information and Communication Engineers D–II , Vol. J80–D–II, No. 8, pp. 2232– 2238 (1997). 156 W. Konen, T. Maurer and C. von der Malsburg: A fast dynamic link matching algorithm for invariant pattern recognition, Neural Network , Vol. 7, pp. 1019–1030 (1994). 156 R. P. W¨ urtz: Object Recognition Robust Under Translation, Deformation, and Changes in Background, IEEE Trans. on on Pattern Analysis and Machine Intelligence, Vol. 19, No. 7, pp. 769–774 (1997). 156 A. M. Martinez and R. Benavente: The AR face database, CVC Technical Report 24 (1998). 160 J. G. Daugman: Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters, Journal of the Optical Society of America A, Vol. 2, pp. 1160–1169 (1985). 160
Fingerprint Enhancement Using Oriented Diffusion Filter Jiangang Cheng, Jie Tian, Hong Chen, Qun Ren, Xin Yang Biometrics Research Group, Institute of Automation, Chinese Academy of Sciences, P.O.Box 2728,Beijing,100080,China
[email protected],
[email protected] http://www.fingerpass.net
Abstract. Fingerprint enhancement is a critical step in a fingerprint identific ation system. Recently, some anisotropic nonlinear diffusion filter is applied to the fingerprint preprocessed. Impressive results are main reason for using nonlinear diffusion filtering in image processing. Poor efficiency, especially the computational load, is the main reason for not using nonlinear diffusion filtering. In order to improve the efficie ncy, a novel piecewise nonlinear diffusion for fingerprint enhancement is presented. It shows anisotropic diffusion equation using diffusion tensor which continuously depends on the gradient is not necessary to smooth the fingerprint. We simplify the anisotropic nonlinear-diffusion in order to satisfy a real-time fingerprint recognition system. According to the local character of the fingerprint, the diffusion filter is steered by the orientation of ridge. Experimental results illustrate that our enhancement algorithm can satisfy the requirement of an AFIS.
1 Introduction Fingerprint is a typical flow-like pattern. In an Automatic Fingerprint Identification System (AFIS), the fingerprint enhancement plays a key role in a fingerprint recognition system. Many automatic fingerprint matching algorithms depend on comparison of minutiae, or Galton’s characteristics. Reliably extracting minutiae from the input fingerprint images is a difficult task. There are many approaches to enhance fingerprint image in order to avoid creating fake minutiae and ignoring genuine minutiae. Among these approaches, the most flexible one consists of estimating orientation by means of a structure tensor [1], [2], [3].These ideas can be further developed into an on-line AFIS. This paper is supported by the National Science Fund for Distinguished Young Scholars of China under Grant No. 60225008, the Special Project of National Grand Fundamental Research 973 Program of China under Grant No. 2002CCA03900, the National High Technology Development Program of China under Grant No. 2002AA234051, the National Natural Science Foundation of China under Grant Nos. 60172057, 69931010, 60071002, 30270403, 60072007
link-author: Jie Tian; Telephone: 8610-62532105; Fax: 8610-62527995. J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 164-171, 2003. c Springer-Verlag Berlin Heidelberg 2003
Fingerprint Enhancement Using Oriented Diffusion Filter 165
Recently, many researchers in the computer vision field are very interested in analyzing oriented flow-like pattern for more than ten years [1]. Some nonlinear diffusion filters are built to enhance the flow-like image. Impressive results are main reason for using nonlinear diffusion filtering in image processing. Unlike linear diffusion filtering, edges remain well-localized and can even be enhanced in nonlinear diffusion system. Poor efficiency is the main reason for not using nonlinear diffusion filtering. In this paper, we absorb the idea in non-linear diffusion to enhance the oriented flow-like pattern. This approach overcomes the disadvantage of non-linear diffusion. Section 2 introduces the related work about non-linear diffusion which was built by Weikert in literatures [1], [2], [5]. Section 3 we give an explicit solution of the nonlinear diffusion. And section 4 gives our method to enhance the fingerprint. In section 5 some experimental results illustrate the performance of our method. We conclude in the final section.
2 Related Work Define the fingerprint image as (Iij)N x M. An anisotropic diffusion filter with a diffusion tensor evolves the initial image I under an evolution equation of type:
∂u = div( D∇u ) ∂t u(i,j;0) =Iij,01; where P ( Fc X ) is the probability that the claimed fingerprint Fc corresponds to the input fingerprint X, and P ( Fc X ) = ∑i ≠ c P( Fi X ) . Assuming uniform probabilities P(Fi) for all fingerprints, applying Bayesian approach, and computing log-likelihood ratio, the above criterion becomes:
(
)
(
)
log P X Fc − log P X Fc > ∆
(1)
where P ( X Fc ) is the conditional probability of X given Fc. A positive value of ∆ = log δ , indicates a valid claim, and a negative value indicates an impostor attempt.
The first term of left side of equation (1) is directly related to the scores provided by the verification system, and the second term is denoted as the normalization factor. One approach to estimate this factor is:
Facing Position Variability in Minutiae-Based Fingerprint Verification log P( X Fc ) ≈ log
∑ P( X Fi )
221
(2)
Fi ∈ C , i ≠ c
where C is a set of fingerprints, denoted as cohort set, selected to calculate the normalization factor. This set can be formed by the fingerprints taken from the global population, or by the fingerprints which better represent the population near the claimed fingerprint. If we assume that the sum in (2) is dominated by the nearest impostor fingerprint, the estimation of the normalization factor can be resume to: log P( X Fc ) ≈
[
max log P( X Fi )
Fi ∈ C , i ≠ c
]
(3)
We have used this latter approach, denoted as best reference normalization, in our system evaluation, to better differentiate clients and impostors. The cohort set has been reduced to the maximum score (best reference), attained by the input fingerprint X against the set of fingerprints formed by the all the global fingerprint population except the claimed fingerprint. We have combined both procedures, multiple references and score normalization in order to evaluate verification performance. The two approaches depicted in figure 7: Curve (h) is attained applying normalizing scores obtained in curve (g). In the case of curve (i), once the normalization of scores in curves (a), (b) and (c) is applied, the maximum normalized score is selected. As a conclusion, a notable performance improvement is achieve in both cases of score normalization, since the best 2.3% EER, achieved in figure 6(g), is now significantly improved to values 1.0% in curve (h) and 0.9% in curve (i).
Fig. 6. DET curves. (a), (b), (c): no position Fig. 7. DET curves for Best Reference Score control cases; (d), (e), (f): position controlled Normalization. Curve (h): normalization of cases; (g): MAX-Rule case scores in curve (g). Curve (i): normalization of scores in curves (a), (b), (c) and selection of the maximum result
222
D. Simon-Zorita et al.
5
Conclusions
The experiments accomplished in this paper have permitted to evaluate the performance of the proposed minutiae-based fingerprint automatic verification system, over a large individual population from the MCYT_Fingerprint database, in which the variability factors of the acquisition process, by means of an optical device, are sufficiently represented. We have analysed the influence of controlling the position of the finger over the sensor screen during acquisition, and we conclude that the system EER decrease significantly if we increase: i) the control level of both the reference pattern and the test patterns (initial 8.0% EER for baseline scores has been improved to 3.3 % EER when high control level is applied to tests and reference); ii) the number of reference patterns, if possible variability conditions due to the finger position during the acquisition process is considered (2.3% EER has been attained with multiple references). Experiments with best reference normalization technique demonstrate a great improvement in discriminating clients and impostors, resulting in an EER less than 1%. The achieved results, making use of multiple-reference strategies and best reference normalization, deserve to be considered as highly efficient alternatives for automatic fingerprint verification systems. We emphasize that these results are attained over an unsupervised database, in which just position but no image quality control has been accomplished. In fact, we can affirm that about a 5% of the acquired images are of very bad quality; 20% of low quality; 55% of medium quality; and 20% of high quality. Quality supervision and labeling of MCYT_Fingerprint is being currently accomplished to achieve, in a near future, results as a function of the quality degree of stored fingerprints.
References [1] [2] [3] [4] [5] [6]
Digital Persona, http://www.digitalpersona.com S. Furui, “An Overview of Speaker Recognition Technology”, ESCA Workshop on Automatic Speaker Recognition, Martigny (Switzerland), pp. 1-9, April 1994. L. Hong, Y. Wan, and A. K. Jain, “Fingerprint Image Enhancement: Algorithm and Performance Evaluation”, IEEE Trans. Pattern Anal. and Machine Intell., vol. 20, no. 8, pp. 777-789, 1998. L. Hong, A. K. Jain, S. Pankanti, and R. Bolle, “Identity Authentication Using Fingerprints”, Proc. of the 1st International Conference on AVBPA’97, CransMontana, Switzerland, March 1997. A. K. Jain, L. Hong, and R. Bolle, “On-line Fingerprint Verification”, IEEE Trans. Pattern Anal. and Machine Intell., vol. 19, no. 4, pp. 302-314, 1997. A. K. Jain and S. Pankanti, “Automated Fingerprint Identification and Imaging Systems”, in Advances in Fingerprint Technology, 2nd Edition, Elsevier Science, New York, 2001.
Facing Position Variability in Minutiae-Based Fingerprint Verification
[7] [8]
[9] [10] [11]
[12] [13]
223
D. Maio and D. Maltoni, “Direct Gray-Scale Minutiae Detection in Fingerprints”, IEEE Trans. Pattern Anal. and Machine Intell., vol. 19, no. 1, pp. 27-40, 1997. J. Ortega-Garcia, J. Gonzalez-Rodriguez, and S. Cruz-Llanas, “Speech Variability in Automatic Speaker Recognition Systems for Commercial and Forensic Purposes”, IEEE Aerospace and Electronic Systems Magazine, vol. 15, no. 11, pp. 27-32, Nov. 2000. J. Ortega-Garcia, J. Gonzalez-Rodriguez, and V. Marrero-Aguilar, “AHUMADA: a Large Speech Corpus in Spanish for Speaker Characterization and Identification”, Speech Communication, vol. 31, pp. 255-264, June 2000. J. Ortega-Garcia, et al., “MCYT: A Multimodal Biometric Database”, Proc. of COST-275 Biometric Recognition Workshop, Rome (Italy), November 2002. J. Ortega-Garcia, J. Gonzalez-Rodriguez, D. Simon-Zorita, and S. Cruz-Llanas, "From Biometrics Technology to Applications regarding Face, Voice, Signature and Fingerprint Recognition Systems”, in Biometrics Solutions for Authentication in an E-World, (D. Zhang, ed.), pp. 289-337, Kluwer Academic Publishers, July 2002. Precise Biometrics, http://www.precisebiometrics.com D. Simon-Zorita, J. Ortega-Garcia, S. Cruz-Llanas, J. L. Sanchez-Bote, and J. Glez-Rodriguez, “An Improved Image Enhancement Scheme for Fingerprint Minutiae Extraction in Biometric Identification”, Proc. of the 3rd International Conference on AVBPA’01, Halmstad, 6-8 June, 2001.
Iris-Based Personal Authentication Using a Normalized Directional Energy Feature Chul-Hyun Park1, Joon-Jae Lee2 , Mark J. T. Smith3 , and Kil-Houm Park1 1
2 3
School of Electrical Engineering and Computer Science Kyungpook National University, Daegu, Korea
[email protected],
[email protected] Division of Internet Engineering, Dongseo University, Busan, Korea
[email protected] School of Electrical and Computer Engineering, Purdue University West Lafayette, Indiana 47907-2035, USA
[email protected] Abstract. In iris-based biometric systems, iris images acquired by a video or CCD camera generally have a lot of contrast or brightness differences in an image or between images due to the different extent of camera focusing or illumination conditions. To extract the discriminatory iris features robust to such differences, this paper presents a new normalization scheme of the directional energy that will be used as the iris feature. The proposed method first performs band-pass filtering on the input iris image to reduce the effect of high frequency noise and DC energy difference, then decomposes the image into several directional subband outputs using a directional filter bank (DFB). The directional energy values of the iris pattern are extracted from the decomposed subband outputs on a block-by-block basis and then are normalized with respect to the total block energy. Matching is performed by finding the Euclidean distance between the input and enrolled template feature vector. Experimental results show that the proposed method is robust to changes in illumination or contrast and the overall feature extraction and matching procedures are effective.
1
Introduction
Active studies have recently been performed in the field of biometrics that attempts to identify individuals based on their physiological or behavioral characteristics. Biometric features are attractive because they have a low potential for loss, theft, and forgery and personal authentication and identification systems that use these features have achieved high accuracy. Among the biometric technologies, iris-based biometric systems are more advantageous in terms of fidelity, since the iris images include rich discriminatory patterns such as arching ligament, crypts, ridges, and a zigzag collarette (refer to Fig. 3) [1]. Furthermore, the iris patterns are immutable over time and well protected by the eyelids. The iris is the annular and pigmented part that surrounds the pupil of the eye, and is made up of muscular tissue that regulates the diameter of the pupil. J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 224–232, 2003. c Springer-Verlag Berlin Heidelberg 2003
Iris-Based Personal Authentication
225
The iris has diverse and distinctive patterns, but exploiting its geometric features directly is not as easy as in fingerprints or faces. Thus, most existing iris recognition methods extract either the multiresolutional or the directional features of an iris pattern using Gabor filters or wavelet transforms [2], [3], [4], [5]. Since most of the methods use two or four level quantized values of the Gabor filtered or wavelet transformed image as a feature vector, they have robustness to contrast or illumination changes. However, they do not utilize a significant component of the rich discriminatory information available in the iris pattern. Therefore, in order to extract the distinctive iris features robust to contrast and illumination differences in an image or between the images, this paper presents a new normalization method of the directional energy that will be used as the iris feature. The proposed method first detects the iris area from the input image, then establishes a region of interest (ROI) for feature extraction. The extracted ROI is converted to polar coordinates after which band-pass filtering is performed to reduce the effects of noisy components. The ROI is then decomposed into 8 directional subband outputs using a directional filter bank (DFB) [6], [7]. The normalized directional energy features are extracted from the directional subband outputs. To account for the rotational offset, the proposed method generates additional feature vectors where various rotations are considered, by simply shifting the decomposed directional subband images and recalculating the feature values. The rotational alignment is achieved by finding the minimum Euclidean distance between the corresponding feature vectors. Experimental results demonstrate that the normalized directional energy feature is not only robust to brightness or contrast difference in an image or between images but also well represents the rich distinctive iris features very well.
2
Iris Localization and ROI Extraction
Once an eye image is acquired by a video or CCD camera, iris localization is performed first. Since an iris is circular and much darker than the neighboring (white) sclera, the iris region can be easily detected in the input image. The outer and inner boundaries are detected as shown in Fig. 2 using a circular edge detector [1] and as a result, the center coordinates and radii of the outer and inner boundaries are obtained. The detected iris region is then converted from Cartesian coordinates to polar coordinates to facilitate feature extraction and rotational alignment. As the center of the iris generally differs from that of the pupil, the iris region is normalized to one with the fixed width and height (2N × 8N ) as illustrated in Fig. 2(b) for the same portions of the irises to be matched. In this work, N was set to 72. In most cases, the iris images contain a lot of noisy components like glint due to the illuminator, and pattern occlusions by the eyelashes or eyelids. Thus, the proposed method excludes the upper and lower 90◦ cones and the outer half of the iris region, where occlusions by the eyelids commonly occur or glint appears due to the illuminator (see Fig. 2), and exploits the inner half of the left and right 90◦ cones of the iris region as a ROI for feature extraction.
226
Chul-Hyun Park et al.
(a)
(b)
Fig. 1. Sample image and detected inner and outer boundaries of iris y r x
r R 1
R2 R3
R4
2N
8N (a)
(b)
Fig. 2. Illustration of the Cartesian to polar coordinates conversion and region of interest extraction. (a) The original iris region. (b) The region converted to polar coordinates and partitioned in regions of interest (R1 , R2 , R3 , R4 )
3
Feature Extraction
Iris patterns include many linear features that can be considered as a combination of directional linear pattern components. As such, the unique characteristics of an iris pattern can be effectively represented by a feature vector constructed by extracting the linear components of an iris according to directionality. In this sense, a DFB is suitable for extracting iris features, since it can accurately decompose an image into directional subband outputs. The proposed method first decomposes the extracted ROIs into 8 directional subband outputs using an 8-band DFB and extracts the iris features from the decomposed directional subband outputs. 3.1
Directional Decomposition
In the proposed method, the ROI images R1 , R2 , R3 , and R4 (See Fig. 2) are decomposed into 8 directional subband outputs separately using an 8-band DFB. Since the DFB partitions the two-dimensional spectrum of an image into wedgeshaped directional passband regions accurately and efficiently as shown in Fig. 3(a), each directional component or feature can be captured effectively in its subband image. The subband images are rectangular in shape because of the down sampling operators involved in the DFB. For an N × N image, the first
Iris-Based Personal Authentication ω 6
7
5
0
ω
2
4
0 3
1
2 ω
2
1
3
5
4
6
7
2 1
4
0
4
(a)
0
1
2
3
5
2
1
6
1
3
6 5
ω
227
7 7
(b)
Fig. 3. Frequency partition map of (a) input and (b) 8 subband outputs
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 4. Sample ROI images and their decomposed outputs: (a)-(d) show ROI images R1 , R2 , R3 , and R4 ; (e)-(h) show decomposed subband outputs of (a)-(d) half of the 2n subband outputs is N/2n−1 × N/2 in size, while the other half is N/2 × N/2n−1 . Fig. 4 shows an example of the ROI images and the directional subband images decomposed by the 8-band DFB. The post sampling DFB method was employed in this work to achieve geometrically accurate subband representations [7]. 3.2
Extraction of Normalized Directional Energy Feature
One of the intuitive features that can be extracted from the directionally decomposed subband images is directional energy. This directional energy can be a good feature in case that the illumination conditions and the extent of camera focusing are similar, however unfortunately most iris images have so different brightness or contrast. Since the iris images are acquired by a video camera under various internal and external illumination conditions, they have the contrast and brightness differences in an image or between images. Therefore, to extract the iris feature that represents well the directional diversity of the iris pattern and has robustness to various brightness or contrast change at the same time, image normalization is necessary, yet this is not easy in the iris image in which the brightness or contrast differences in an image or between images exist. To solve this problem, the proposed method employs the ratio of the directional
228
Chul-Hyun Park et al.
ROI image Rn
BPF
Directional dec.
Feature values
Fig. 5. Procedure for feature extraction
energy in each block, which we call the normalized directional energy, instead of the directional energy itself. In the proposed method, the high frequency components above π/6 and DC components are removed by the band-pass filter to reduce the effect of noise and to solve the problem of the non-uniform distribution of DC energy in the real world DFB due to the finite precision. Thereafter the normalized directional energy features are extracted from the band-pass filtered image. (n) (n) Let ekθ denote the energy value of subband θ (which we call Skθ ). More (n) (n) (n) specifically, Skθ corresponds to kth block Bk of the nth ROI image Rn ; eˆkθ (n) (n) is the normalized energy value of ekθ ; and ckθ (x, y) is the coefficient value at (n) pixel (x, y) in subband Skθ . Now, ∀n ∈ {0, 1, 2, 3}, k ∈ {0, 1, 2, . . . , 35}, and (n) q ∈ {0, 1, 2, . . . , 7}, the feature value, vkθ , can be given as (n) (n) vkθ = nint vmax × eˆkθ (1) where (n)
e (n) eˆkθ = 7 kθ (n)
ekθ =
,
(n) θ=0 ekθ (n) ckθ (x, y)
(2) (n) − c¯kθ ,
(3)
(n) x,y∈Skθ
(n)
nint(x) is the function that returns the nearest integer to x, c¯kθ is the mean (n) (n) of pixel values of ckθ (x, y) in the subband Skθ , and vmax is a positive integer normalization constant. The procedures for feature extraction are illustrated in Fig. 5.
4
Matching
To achieve the rotational alignment between the input and template feature vectors, the proposed method generates additional feature vectors, in which various rotations are considered, by shifting the directional subband images and recalculating the feature values. Next, the minimum Euclidean distance between the
Iris-Based Personal Authentication
229
corresponding feature vectors is used for rotational alignment. The minimum unit of horizontal shift in the subband images is a one-pixel shift in 4 to 7 directional subband images. This one-pixel shift means a four-pixel shift in the image domain. For an N × N ROI image (Rn ) (See Fig. 2), a four-pixel shift in the ROI image is equivalent to a rotation of 45 × (4/N ) degrees in the original image. For a 72 × 72 image, the minimum unit of rotation that can be effectively compensated for is 2.5◦ . As such, the feature vector generated by the proposed method is invariant to small perturbations within ±1.25◦. Handling rotations as described above dictate that pixels outside of the ROI boundary will be involved. To address this issue, we establish a wider ROI (N > N ) as shown in Fig. 6(a)-(d) and extract the input feature vectors, where the various rotations are considered, from the shaded area as illustrated in Fig 6(e)(h). When the ROI images shift as in Fig. 6(a)-(d), the corresponding subband outputs must be shifted as in Fig 6(e)-(h). The iris pattern matching is based on finding the Euclidean distance between the input and template feature vectors. Let VjR denote the jth feature value of the input feature vector where R × 45 × (4/N ) degree rotation is considered and let Tj denote the jth feature value of the template feature vector. Then the Euclidean distance DE between the input and template feature vectors is given by M (4) DE = min (VjR − Tj )2 R
j=1
where R × {−8, −7, . . . , −2, −1, 0, 1, 2, . . . , 7, 8}, M is the size of the feature vector. The range of R is established on the assumption that the rotation of the input iris image is between −20◦ and +20◦. The input iris is accepted if the final distance is below a certain threshold. Otherwise, it is rejected.
5
Experimental Results
For the experiments, we acquired a total of 434 iris images from 10 persons using a digital movie camera and a 50W halogen lamp. The iris images were captured from a distance of about 15-20cm and the light was located below the camera so that the glint only appeared in the lower 90◦ cone of the iris. The acquired iris images were color images with the size of 640 × 480, but they were converted into 256 grayscale images. In our experiments, the performance of the proposed method was compared with that of the Gabor filter bank-based method in [2]. In order to estimate the performance as a personal verification or authentication method, first the genuine and imposter distributions of the distances were examined. A genuine distribution represents the distribution of the distances between all possible intra-class images in all the test images, while an imposter distribution represents the distribution of the distances between all possible inter-class images in all the test images. The more separated the two distributions and the smaller the standard
230
Chul-Hyun Park et al.
N′
N′
R4
N
R2
R1
R3
N N′
N′
N
N (a)
(b)
(c)
(d)
(g)
(h)
0-3 direction subband outputs 4-7 direction subband outputs (e)
(f)
Fig. 6. Shifts of (a)-(d) ROI images R1 , R2 , R3 , R4 and (e)-(h) the decomposed directional subband outputs of (a)-(d)
deviation for each distribution, the more advantageous for a personal verification method. A decidability index is also a good measure of how well the two distributions are separated [2]. Let µG and µI be the means of the two distributions, respectively, and let σG and σI be the standard deviations. Then the decidability index DI can be defined as |µG − µI | DI = 2 . (σG + σI2 )/2
(5)
In addition to the decidability index, an equal error rate (EER), which is the error rate at which a false accept rate (FAR) is equal to a false reject rate (FRR), is often used as a compact measure of verification accuracy of a biometric system. Here, FAR is the rate at which an imposter print is incorrectly accepted as genuine and FRR is the rate at which a genuine print is incorrectly rejected as an imposter. Table 1 shows the decidability index, EER, and genuine acceptance rate (GAR) at EER for the proposed method. Here, the GAR indicates the rate at which a genuine print is correctly accepted as genuine. We can see that the proposed method has better characteristics in the genuine and imposter distributions, and also has the lower EER than the Gabor filter bank-based method. The performance of a verification system can also be evaluated using a receiver operator characteristic (ROC) curve, which graphically demonstrates how GAR changes with a variation in FAR. The ROC curve for the proposed method is shown in Fig. 7. With a Gabor filter bank, there are always overlapping or missing subband regions as illustrated in Fig. 8(a), whereas the DFB tiles the entire frequency space with contiguous passband regions as shown in Fig. 8(b).
Iris-Based Personal Authentication
231
Table 1. Performance comparison between the Gabor filter bank-based method and the proposed method Method Gabor filter bank-based Proposed
Decidability index 3.0498 3.3638
EER (GAR) 4.25% (95.74%) 3.80% (96.23%)
Genuine Acceptance Rate (%)
100
90
80
70
60 Gabor filterbank-based method Proposed method 50 10-2
10-1
100 False Acceptance Rate (%)
101
102
Fig. 7. ROC curve for the proposed method and the Gabor filter bank-based method
ω
2
ω
( , )
(a)
(π , π )
ω
1
( − π ,− π )
2
( − π ,− π )
1
(b)
Fig. 8. Subband ranges used for generating feature vector by (a) Gabor filter bank-based method and (b) proposed method
232
Chul-Hyun Park et al.
Accordingly, the DFB can represent linear patterns, as found in iris patterns, more effectively than a Gabor filter bank [8]. In addition, since the proposed method uses the normalized directional energy instead of the thresholded (1bit) directional subband output, it can provide more distinctive information than the Gabor filter bank-based method using only the 2-bit quantized value of the Gabor filtered images as the iris features. The performance comparison results in Table 1 and the ROC in Fig. 8 show that the proposed method has better characteristics.
6
Conclusion
We have presented a new iris-based personal verification method based on extraction of normalized directional energy features. In the proposed method, the diverse iris pattern is represented as a form of directional energy. Since directional energy itself is sensitive to changes in brightness or contrast, the propose method employs a directional energy that is normalized to the subband block energy. The experimental results show that the proposed method is effective in extracting iris features that are robust to illumination variations. We are currently working on 1) extending the ROI region; and 2) combining multiple features extracted from different matchers.
References [1] Daugman, J. G., Downing, C.: Epigenetic randomness, complexity, and singularity of human iris patterns. Procedings of the Royal Society B 268 (2001) 1737–1740 224, 225 [2] Daugman, J. G.: High confidence visual recognition of persons by a test of statistical independence. IEEE Trans. Pattern Anal. Machine Intell. 15 (1992) 1148–1161 225, 229, 230 [3] Wildes, R. P.: Iris recognition: An emerging biometric technology. Proc. IEEE 85 (1997) 1348–1363 225 [4] Boles, W. W., Boashash, B.: A human identification technique using images of the iris and wavelet transform. IEEE Trans. Signal Processing 46 (1998) 1185–1188 225 [5] Lim, S., Lee, K., Byeon, O., Kim, T.: Efficient iris recognition through improvement of feature vector and classifier. ETRI Journal 23 (2001) 61–70 225 [6] Bamberger, R. H., Smith, M. J. T.: A filter bank for the directional decomposition of images: Theory and design. IEEE Trans. Signal Processing 40 (1992) 882–893 225 [7] Park, S., Smith, M. J. T., Mersereau, R. M.: A new directional filter bank for image analysis and classification. Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing 3 (1999) 1417–1420 225, 227 [8] Bamberger, R. H., Smith, M. J. T.: A multirate filter bank based approach to the detection and enhancement of linear features in images. Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing 4 (1991) 2557–2560 232
An HMM On-line Signature Verification Algorithm Daigo Muramatsu and Takashi Matsumoto Electrical, Electoronics and Computer Engineering, Waseda University 3-4-1 Okubo Shinjuku-ku Tokyo, 169-8555 Japan
[email protected] Abstract. Authentication of individuals is rapidly becoming an important issue. On-line signature verification is one of the methods that use biometric features of individuals. This paper proposes a new HMM algorithm for on-line signature verification incorporating signature trajectories. The algorithm utilizes only pen position trajectories. No other information is used which makes the algorithm simple and fast. A Preliminary experiment was performed and the intersection of FAR and FRR was 2.78%.
1
Introduction
Personal identity verification has a great variety of application including Electrical Commerce, access to computer terminals, buildings, credit card verification, and so on. Algorithm for personal identification can be roughly classified into four categories depending on static/dynamic and biometric/physical or knowledgebased as shown in Fig. 1. Fingerprint, iris, DNA, face, for example, are in the group of static and biometric. Dynamic biometric methods include voice and on-line signature. Schemes which use passwords are static and knowledge-based, whereas methods using magnetic cards and IC cards are static and physical. There are at least two reasons that make on-line pen-input signature verification be one of the very promising schemes for personal authentication. First,
Biometric
Static
Fingerprint Iris DNA Face etc.
Signature Voice etc.
Authentication
Magnetic Card IC Card Key etc.
Dynamic Dynamic Dynamic Password etc.
Physical/knowledge
Fig. 1. Authentication Methods J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 233–241, 2003. c Springer-Verlag Berlin Heidelberg 2003
234
Daigo Muramatsu and Takashi Matsumoto
signature has a long history and is being already built in among many civilizations. Second, with the advent of growing number of pen-input devices including PDA’s, tablet PC’s among others, pen-input environment is rapidly becoming a popular platform. In terms of training data sets, there are two cases to be distinguished: (a) Forgery data as well as authentic signature data are available, (b) No forgery data are available. This paper proposes a new algorithm for pen-input on-line signature verification using a discrete HMM incorporating signature trajectories where no forgery data are available for training. There are three types of forgery, (A) Random forgery: Forger has no access to authentic signature, (B) Simple forgery: Forger knows the name of the person whose signature is to be authenticated, (C) Skilled forgery: Forger can view and train authentic signature. Type (C) forgery is one of the difficult situations to deal with. A Preliminary experiment was performed on a database consisting of 1848 genuine signatures and 3170 skilled forgery signatures, from fourteen individuals, where the forgery data was not used for training. The intersection of FAR and FRR curves gives 2.78%, which looks promising since only the pen position trajectories are used instead of pen pressure/pen inclination trajectories [1]-[7].
2
The Algorithm
The overall algorithm is shown in Fig.2. It consists of three sub-algorithms; i) Preprocessing, ii) HMM Generation, iii) Verification.
2.1
Preprosessing
Typical raw data taken from digitizer (shown in Fig.3) is (x(t), y(t)) ∈ R2 , t = 1, 2, ......, T
Template Data Dt
Input Data Din
Preprocessing
Preprocessing
HMM Learning
HMM Learning Verification
Fig. 2. Overall Algorithm
(1)
An HMM On-line Signature Verification Algorithm
235
Fig. 3. Typical Raw Data 9
13
5
1
Fig. 4. Quantized Directions (L=16) where (x(t), y(t)) is the pen position. The sampling rate is 100 points/sec. Let θ := tan−1 and define θ by θ = {
y(t + 1) − y(t) π π ,− < θ ≤ x(t + 1) − x(t) 2 2
θ (x(t + 1) − x(t) ≥ 0) θ + π (x(t + 1) − x(t) < 0)
(2)
(3)
In order to formulate the problem in terms of discrete HMM, consider the quantized angle information defined by V (t) = {
π π 1 (θ < − π2 + L , θ ≥ 3π 2 − L) (2n−3)π n (− π2 + ≤ θ < − π2 + L
(2n−1)π ) L
(4)
n = 2, 3, ......, L Pen position (x(t), y(t)) is transformed into L discretized angles shown in Fig.4, so that, the given trajectory (1) is now a sequence of discrete symbols: O(t) := V (t) ∈ {1, 2, ......, L}
2.2
(5)
HMM Generation
Proposed HMM Structure. HMM is a general doubly stochastic structure which is applicable to a broad class of problems where time evolution is important. Every general discipline must be tailored before being applied to a specific type of problem, which is a basic engineering function, and HMM is no exception. The general framework of HMM must be carefully tuned to the on-line signature verification problem. In order to make an HMM precise, let us first recall the output symbols defined in the previous section: O(t) = V (t), t = 1, ....., T − 1
(6)
236
Daigo Muramatsu and Takashi Matsumoto
An HMM of a signature H = H({aij }, {bjk }, {πj }, N )
(7)
−1 is defined by the joint probability distribution of {Q(t), O(t)}Tt=1 given H ; −1 P ({Q(t), O(t)}Tt=1 |H) = πQ(1)
T −2
aQ(t)Q(t+1)
t=1
T −2
bQ(t+1)V (t+1)
(8)
t=0
where Q(t) stands for hidden state at time t, {aij } the state transition probability, {bjk } the output emission probability, and {πj } the initial state probability. Learning in HMM amounts to an estimation of parameters {aij },{bjk },{πj } and N , whereas verification is to compute likelihood P ({O(t)}|H) given a test test −1 m −1 and template signatures {Om (t)}Tt=1 , and attempt signature{Otest (t)}Tt=1 to make decisions. In order to tune HMM to our current type of problem, we will use the left to right model so we put the following constraints on {aij } and {πj }: i) aij = 0 unless i = j or i = j − 1 ii) aN N = 1 so that {aij } matrix is restricted to the one in the form a11 a12 . 0 a22 . . .. (9) {aij } = . . aN −2N −1 aN −1N −1 aN −1N 0 1 iii) {πj } = (1, 0, ......, 0)
(10)
Note that i) demands that state qi cannot jump to qi+k , k ≥ 2, and also ii) indicates that the last state qN cannot transit any other state, which corresponds to the causality of a trajectory associated with H, and iii) demands that the initial state Q(1) must be always q1 , The reasons for these constraints will become clear below. Learning. The well-known Baum-Welch learning algorithm uses a gradient search to compute arg
max
{aij },{bjk },{πj }
−1 P ({O(t)}Tt=1 |{aij }, {bjk }, {πj }, N )
(11)
with N fixed. The efficiency of this method naturally depends on each problem. In the current on-line signature verification problem, there are three major hurdles: – In a realistic signature verification problem, only very few data sets are available for training. Namely, individuals will be willing to register only few signatures; ten signatures registrations will be out of question, even five signature registration is felt inconvenient, unless there is a particular reason.
An HMM On-line Signature Verification Algorithm
237
– The objective function (11) is typically non-convex with respect to the parameters, and the serious problems are created by the local minima. – The dimensions of the parameter space are determined by the sum of the non-zero parameters of {aij },{bjk },{πj } and,N which amounts to 2N − 1 + N × L + N ; the computational effort is significant. Our algorithm given below is fast because it is one-shot, and hence noniterative, and most importantly, works well. Model Generation. When the data set is given for learning, our proposed algorithm first attempts to associate a clear meaning to the states. The algorithm, −1 however, is still HMM in that it has nontrivial {bjk }. Let the data set{O(t)}Tt=1 be given. Step.1 State Clarification Given the data set O(t) = V (t), t = 1, ......, T − 1
(12)
make a division between O(t) and O(t + 1), if V (t) = V (t + 1), i.e., if the angle value changes. Consider the following simple example:
5, 5, 5 8, 8 9, 9, 9, 9 10, 10, 10 q1 q2 q3 q4
The first data 5 means that the trajectory angle is in the direction 5 as shown in Fig. 4 . After three such angles, the angle value changes to 8 so that a division is made as indicated by a vertical line {|}, and so forth. The three repeated symbols 5, 5, 5 are associated with state q1 , the two repeated symbols 8, 8 are associated with state q2 , and so forth. Thus, given a training data set, the number of states is well defined, whereas the number of states in a general HMM learning is one of the most difficult parameters to be estimated. Step.2 Learning.{aij }, {πj } Let n(O, qi ) be the number of repetitions of the symbol V (t) associated with qi . Define aij = 0 (i = j, i = j − 1) (13) n(O, qi ) − 1 n(O, qi ) 1 aii+1 = n(O, qi )
aii =
(1 ≤ i ≤ N − 1) (1 ≤ i ≤ N − 1)
aN N = 1
(14) (15) (16)
238
Daigo Muramatsu and Takashi Matsumoto
π = (1, 0, ......, 0)
(17)
Since each signature data generates one HMM, care needs to be exercised in learning output emission probabilities in order to circumvent overfitting. Step.3 Learning {bjk } Let V (j) be the angle value associated with state qj and define bjk :=
α Z(σv , L)
x=k+0.5
x=k−0.5
e
(−
(V (j)−x)2 2 2σv
)
dx +
1−α L
(18)
k = 1, 2, ......, L where Z(σv , L) is the normalization constant in order that L
bjk = 1
(19)
k=1
is satisfied. This is a straightforward smoothing/flooring which prevents overfitting. 2.3
Verification
The purpose of signature verification is to infer whether a given test signature was written by the registered person or not. In order to accomplish it, we compute a particular performance index based on log likelihood and compare it with threshold values. Given training data set m −1 D := {D1 , ..., Dm , ..., DM }, Dm = {Om (t)}Tt=1
(20)
consisting of M signature trajectories from a registered person, we create associated HMM’s by the above algorithm. Note that the above HMM generation is one shot, i.e., each training data generates one HMM. This implies that a test sigtest −1 can generate an HMM, call it Htest . nature trajectory Dtest = {Otest (t)}Tt=1 Recall Dm , m = 1, ......, M , the template data sets, and consider,given Htest , m −1 marginalization with respect to {Q(t)}Tt=1 : m −1 P (Dm |Htest ) = P ({Q(t), Om (t)}Tt=1 |Htest ) (21) AllP aths
It should be noted that in (21), the likelihood is evaluated with respect to Htest , wheres the data set is Dm . This paper proposes the following performance index derived from (21): Θm,test = ln P (Dm |Htest ) We infer that
(22)
An HMM On-line Signature Verification Algorithm
Dtest is authentic
if m f (Θm,test , λm ) ≥ G Dtest is forgery if m f (Θm,test , λm ) < G where 1 (Θtest,m ≥ λm ) f (Θtest,m , λm ) = 0 (Θtest,m < λm )
239
(23)
λm is a threshold value, which is computed from the training data set Dm :
M M M
1 1 1 λm := Θm,n − c × (Θm,n − Θm,l )2 (24) M n=1 M n=1 M l=1
m = 1, 2, ......, M, n = 1, 2, ......, M, l = 1, 2, ......, M G is an empirical value indicating the number of times that the performance index exceeds threshold. The error rates in the experiment to be reported below will be given as a function of parameter c.
3
Experiment
Error
This section reports our preliminary experiment using the algorithm described above. Fourteen individuals participated the experiment. The data was taken for the period of three months. There were 1848 authentic signatures and 3170 skilled forgery signatures. The forgery data set was not used for HMM learning, it was used for test only. Table1 shows the details. Fig. 5 shows average verification error as a function of parameter c described above, where the intersection of FRR and FAR curves gives 2.78%.
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
FRR
0
0.2
0.4
FAR
0.6 0.8 1.0 Parameter c
1.2
1.4
Fig. 5. Average Verification Error
4
Conclusions and Future Works
Since the proposed algorithm utilizes only pen position trajectories, there are at least three directions to pursue:
240
Daigo Muramatsu and Takashi Matsumoto
Table 1. The Data Used for Experiment Authentic Forgery Individual Test Template Test Total Genetation A 204 25 585 814 B 45 5 81 131 C 141 15 237 393 D 25 5 68 98 E 187 25 435 647 F 153 15 357 525 G 56 5 71 132 H 54 5 73 132 I 205 10 288 503 J 210 20 396 626 K 73 5 69 147 L 102 10 156 268 M 94 5 81 180 N 134 15 273 422 Total 1683 165 3170 5018
1. Attempt to install the algorithm on a PDA where computational power is severely limited; 2. Attempt to improve the verification ability by using additional information such as pen pressure, pen inclinations and so forth; 3. Try different HMM topologies.
References [1] Y. Komiya and T. Matsumoto, ”On-line Pen Input Signature Verification PPI (pen-Position/ pen-Pressure/ pen-Inclinations)”, Proc. IEEE SMC’99, Vol.4, pp. 41-46, 1999. [2] T. Ohishi, Y. Komiya and T. Matsumoto, ”An On-Line Pen Input Signature Verification Algorithm”, Proc. IEEE ISPACS 2000, Vol.2, pp. 589-592, 2000. [3] T. Ohishi, Y. Komiya and T. Matsumoto, ”On-line Signature Verification using Pen Position, Pen Pressure and Pen Inclination Trajectories”, Proc. ICPR 2000, Vol. 4, pp. 547-550, 2000. [4] D. Sakamoto, T. Ohishi, Y. Komiya, H. Morita and T. Matsumoto, ”On-line Signature Verification Algorithm Incorporating Pen Position, Pen Pressure and Pen Inclination Trajectories”, Proc. IEEE ICASSP 2001, Vol. 2, pp. 993-996, 2001. [5] Y. Komiya, T. Ohishi, T. Matsumoto: ”A Pen Input On-Line Signature Verifier Integrating Position, Pressure and Inclination Trajectories”, IEICE Trans. Information Systems, vol. E84 - D, No.7, July, 2001. [6] Y. Komiya, H. Morita, T. Matsumoto, ”Pen-input On-line Signature Verification with Position, Pressure, Inclination Trajectories”, Proc. IEEE IPDPS 2001, pp. 170, April 2001.
An HMM On-line Signature Verification Algorithm
241
[7] T. Ohishi, Y. Komiya, H. Morita, T. Matsumoto, ”Pen-input On-line Signature Verification with Position, Pressure, Inclination Trajectories”, Proc. IEEE IPDPS 2001, pp. 170, April 2001. [8] H.Yasuda, T.Takahashi and T.Matsumoto,”A Discrete HMM For Online Handwriting Recognition”, International Journal of Pattern Recognition and Artificial Intelligence, Vol.14, No.5, pp.675-688, 2000.
Automatic Pedestrian Detection and Tracking for Real-Time Video Surveillance Hee-Deok Yang1 , Bong-Kee Sin2 , and Seong-Whan Lee1, 1
2
Center for Artificial Vision Research, Korea University Anam-dong, Seongbuk-ku, Seoul 136-701, Korea {hdyang,swlee}@image.korea.ac.kr Department of Computer Multimedia, Pukyong National University Daeyeon 3-dong, Nam-ku, Pusan 608-737, Korea
[email protected] Abstract. This paper presents a method for tracking and identifying pedestrians from video images taken by a fixed camera at an entrance. A pedestrian may be totally or partially occluded in a scene for some period of time. The proposed approach uses the appearance model for the identification of pedestrians and the weighted temporal texture features. We compared the proposed method with other related methods using color and shape features, and analyzed the features’ stability. Experimental results with various real video data revealed that real time pedestrian tracking and recognition is possible with increased stability over 5–15% even under occasional occlusions in video surveillance applications.
1
Introduction
Tracking humans in video is an important task in many applications such as video surveillance and virtual reality interface as they are the primary actors in those task domains. In the past few years, a number of real time systems have been developed to detect and track people. VSAM [1] is a CMU development for tracking people and moving objects. The system’s infrastructure involves fourteen cameras. Roh and Lee [2] introduced a system for detecting and tracking multiple people who are totally or partially occluded occasionally. Haritaoglu et al. [3] introduced the system W4 for detecting and tracking multiple people or part of their bodies. Darrell et al. [4] used disparity and color information for extracting and tracking individual persons. More recently there has been an increasing interest in integrating the temporal information in tracking system [6]. Among the systems listed above, Roh and Lee’s [2] and W4 [3] system already employed temporal information to improve the performance of tracking and identification. With temporal information, they could narrow down the search range significantly. Pedestrian tracking in general
To whom all correspondence should be addressed. This research was supported by Creative Research Initiatives of the Ministry of Science and Technology, Korea.
J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 242–250, 2003. c Springer-Verlag Berlin Heidelberg 2003
Automatic Pedestrian Detection and Tracking
243
consists of two subtasks of target detection and verification [5] [6]. And a working is required to answer the questions: “Is the pedestrian detected correctly?” and “Does the pedestrian detected in the previous frame match any pedestrian of the current frame, or vice versa?” Usually the number of candidate pedestrians in a scene is small, but it is not rare to process a large number of candidates to equally many candidates in the previous frame. In this paper, we propose a pedestrian tracking method based on the appearance model using the temporal texture features of the image sequence. Temporal texture is a set of pairs of a texture value and an associated weight. The weight connotes the size, duration, position, velocity and frequency of appearance of the texture region, as well as the number of the pedestrians adjacent to the target pedestrian. Finally, we employed a simple method of recognition for the verification task.
2
Overview
The overall system organization is shown in Fig. 1. The system consists of three parts: pedestrian detection, pedestrian tracking and face recognition. The pedestrian detection, in turn, is decomposed into several sequential subtasks of motion detection, candidate region detection, and pedestrian detection. In pedestrian tracking, the pedestrian hypotheses are confirmed for identification, and then passed to match with pedestrians tracked in the preceding frames. Finally the face recognition, itself being a challenging task, is achieved by a simple system which consists of face detection and identification. Note that there are feedback loops in the candidate region detection module, and pedestrian tracking module.
Fig. 1. System overview
244
3
Hee-Deok Yang et al.
Pedestrian Detection
The goal of pedestrian detection is to locate all and only the pedestrians in a scene. The task is discussed in two parts: detection of candidate regions and detection of individual pedestrians. A candidate region contains any and sometimes many moving pedestrians in a scene, and the pedestrian detection explicitly divides the region into individual pedestrians. 3.1
Candidate Region Detection
Candidate region detection is a simple process of locating any moving objects in a scene. Here we took the methods of adaptive background subtraction and three-frame differencing. Let Bn (x) denote the current background intensity value at pixel x at time n learned by observation over time, and Tn (x) the difference threshold. B0 (x) is initially set to the value of the first frame, i.e., B0 (x) = I0 (x), and T0 (x) is supplied externally. Bn (x) and Tn (x) are updated over time as follows [1]: Bn+1 (x) = Tn+1 (x) =
αBn (x) + (1 − α)In (x), x is non-moving x is moving Bn (x),
αTn (x) + (1 − α)(5 × |In (x) − Bn (x)|), x is non-moving Tn (x), x is moving
(1)
(2)
where α is a time constant denoting the relative magnitude of the influence from the new input. 3.2
Pedestrian Detection
Usually a foreground region contains more than one pedestrian. In this case we need to partition such a region into individual pedestrians. In the first step of pedestrian detection we apply morphological operators such as erosion and dilation to eliminate noise. The resulting image is referred to as a silhouette image. The following segmentation step mainly concerns locating heads in the silhouette image, by shape analysis and vertical projection. By observing the curvature of the boundary of a silhouette, the part of a pedestrian that looks like a head may be located. The point of maximal convex value is a good indicator of head, while the point of minimal concave value is similarly a good indicator of separation between two pedestrians. In this step we consider four types of concave points:
Automatic Pedestrian Detection and Tracking
– – – –
case case case case
1 2 3 4
: : : :
space space space space
between between between between
245
shoulder and background pedestrians pedestrian and background pedestrian and obstacle
Next, the system divides the region into sub-regions, each corresponding to an individual pedestrian with a head detected in the preceding stage. Finally the torso axis is estimated from the head center down, and each pedestrian segment is estimated according to the distance from the torso axis.
4
Pedestrian Tracking
The temporal information along the video frame sequence is conjectured to have a potential for greater accuracy and speed-up. This point will be discussed here together with a solution for brief occlusions. 4.1
Temporal Texture Feature
For accurate and faster pedestrian tracking, we selected the temporal texture feature as a supplement to the appearance model. This should capture the interframe correlation and continuation for successful tracking. Let us start with the definition; The temporal texture(T ) is defined as a set of pairs of texture values(t) and its temporal weights (w) as
T = {(t1 , w1 , ), (t2 , w2 ), ..., (ti , wi )}
(3)
where i is the number of intensity clusters. The texture value, measured in intensity unit, represents the mean or center of a texture cluster in the intensity space. The first step of calculating the temporal texture is clustering textures. For each pedestrian, an intensity histogram is created and the mean intensity value of pedestrian is calculated. In this way, two or three intensity clusters for each pedestrian can be obtained. The temporal weight is the coherency of the intensity cluster. It is a function of the size of intensity clusters and the variance of the intensity values. The relationship between each parameter is
wi ∝
1 ∆sti · ∆mti
(4)
where ∆sti is the difference of number of pixels at texture value ti and ∆mti is the difference of mean at texture value ti . The velocity of the pedestrian is temporal in that it changes at every frame while the size of pedestrian is assumedly not temporal, and hence not explicitly
246
Hee-Deok Yang et al.
used in the model. This has a nontrivial influence on the amount of computation. For this reason we include it as a temporal texture feature. The weight W is the coherency of the pedestrian. It is a function of the size of pedestrian, the frequency of the associated texture, the set of temporal weight and the existence of adjacent objects. The relation between each parameter is
Wn ∝
Σwn · T An Γ (n) · ∆Sn · ∆pn · ∆vn
(5)
where n is frame number in an image sequence, Sn is the number of pixels in the region of a pedestrian, ∆Sn = Sn − Sn−1 , p is the position of a pedestrian and ∆pn = pn − pn−1 , v = ∆pn /∆t, ∆vn = vn − vn−1 , and finally ∆t is the time interval between frames. The adjacency function Γ measures the 2D geometric adjacency or separation between two pedestrians. It is related to the shape of target pedestrian and the distance from the other pedestrian, as: Γp (t) =
Ap (xtp − xtq )2 + (ypt − yqt )2
(6)
where xt , y t is the center coordinate of a target pedestrian at time t, respectively, and A is the length of the target in the direction that the adjacency occurs. 4.2
Occlusion Detection
Let us consider the situation where two pedestrians are being tracked through occasional occlusions. We can determine an approximate depth for each target with a separate model for occlusion such as:
OAB = max
δ(IAB (x), DAB (x))
(7)
x∈R
0, a = b , A and B represent two different pedestrians, R 1, otherwise represents the overlapping region between the two pedestrians, DAB (x) denotes the nearer pedestrian where δ(a, b) =
arg min {|x − Ac|, |x − Bc|} {A,B}
where Ac and Bc are pedestrian centers, IAB (x) denotes the pedestrian whose intensity at x, IA (x) or IB (x), more similar to the current observation I(x) arg min {|I(x) − IA (x)|, |I(x) − IB (x)|}. {A,B}
Automatic Pedestrian Detection and Tracking
5 5.1
247
Experimental Results and Analysis Experimental Environment
Our tracking system was implemented and evaluated on a Pentium IV–1.7 GHz PC, which Microsoft Windows 2000T M . The video images were acquired at the rate of 30 fps using a Meteor II frame grabber and Jai 3CCD camera with the resolution set to 320 x 240 pixels. The video captured two or three persons entering an office room together. The data set includes 600 frames in total. 5.2
Experimental Results
Fig. 2. shows some results for tracking three people in different occlusion situations. You can see two persons are continuously tracked throughout the occlusion. The last frame of Fig. 2(a) shows the case in which one pedestrian is missing when two persons overlap. In this case the heads of the two pedestrians overlap, and one head is lost. Fig. 2(b) shows a different result obtained by incorporating temporal weights. Table 1 shows a brief description about the content of the test data and the result of detection and tracking. The number of pedestrians is the number of individual pedestrians who have ever appeared in the scene. The numbers of false detections and false trackings are the number of error frames in which one or more pedestrians are missing or incorrect. In case of false tracking, an error
(a) result of detected pedestrian without temporal weight
(b) result of detected pedestrian with temporal weight
Fig. 2. Tracking examples (scene I), without top and with bottom temporal weights
248
Hee-Deok Yang et al.
Table 1. Tracking result from three sample scenes; the detection error includes both pedestrian misses and ghosts(false positives), and the tracking error includes loss or incorrect location of pedestrians, counted in frames from the total frames number total number of number number of pedestrians frames of detection errors of tracking errors scene I 3 200 2 1 scene II 2 150 4 2 scene III 3 250 8 3
may affect the next frame. In the current experiment we simply ignored all the consequential errors. Now let us assume that we have a complete and high-performance tracking system. As a probe into the feasibility of an intelligent integrated system, we combined the tracking system with our existing face recognition system for real time surveillance task. The face recognizer is based on the support vector machine as reported in [7]. For detailed description, please refer to the paper. In the current setup of the system, we have not yet carried out a systematic evaluation. Thus, for now, the best result we can provide is a few snapshots of the working system. Fig. 3 shows some frames containing the result of person tracking and, if sufficiently near, identification. Note that the pedestrian #1 in frame #68 is changed to ”Jeong” in frame #99, and the pedestrian #2 is changed to ”Lee”. The information about the name and the face is stored in the database. 5.3
Performance Analysis
In the final set of tests, we compared the performance of the texture feature used in our system with the color feature used in [2] and the shape feature of the temporal templates used in [3]. The quantitative comparison of the performance is based on the following measure [2]
(a) #68
(b) #75
(c) #99
(d) #115
Fig. 3. Example of pedestrian tracking and face recognition
Automatic Pedestrian Detection and Tracking
(a) shape
(b) color
(c) texture
(d) texture + occlusion model
249
Fig. 4. Comparison of the stability of each feature
x,y∈P (MP
− I(x, y))
(8)
|P |
I(x,y)
where P is a set of foreground pixels of the target pedestrian, MP = x,y∈P |P | is the mean intensity in P , I(x, y) is the pixel intensity for the texture feature at (x, y), (the color value for the color feature, 0 or 1 for the shape feature), and |P | is the number of pixels of the target pedestrian in P . Fig. 4 shows the variation of each feature as a function of time for the pedestrian #1 in Fig. 2. The horizontal axis is of the discrete time frame, while the vertical axis is of the normalized variance measured by Equation 8. The figure shows that the texture feature is more stable than the shape or the color feature. Fig. 4(d) implies that the stability can improve if we introduce a face recognizer and an occlusion model.
6
Conclusion and Further Research
In this paper, we proposed an appearance model using temporal textures for robust multiple pedestrian tracking in video streams. The texture features include temporal texture value, size, position, velocity, and frequency of the texture value region and they are temporal in that they vary over time with associated weights and an occlusion model. The experimental result shows that the stability of the proposed method in the tracking. The temporal features are not complete. Rather they can be further generalized with any kind of features in the appearance model for the pedestrian tracking process. One problem with temporal texture arises when the
250
Hee-Deok Yang et al.
pedestrians are in a uniform or clothes with the same color and intensity. To solve this problem, we introduced other features such as position and face recognition result. Although face recognition is the most discriminating feature, the computational cost of face analysis is too expensive and the accuracy is often not sufficiently high. However, we applied face recognition to our system when the region of pedestrian is sufficiently large for recognition. As the result of applying face recognition, we could increase the discrimination and the security level. Future research will be focused on the tracking system using multiple cameras. By using multiple cameras, we believe that we can reduce the tracking error and expand tracking area.
References [1] Collins, T. et al.: A System for Video Surveillance and Monitoring:VSAM Finial Report. Technical report CMU-RI-TR-00-12, Robotics Institute, Carnegie Mellon University (May 2000) 242, 244 [2] Roh, H.-K., Lee, S.-W.: Multiple People Tracking Using an Appearance Model Based on Temporal Color. Proc. of 1st IEEE Int’l Workshop on Biologically Motivated Computer Vision, Seoul, Korea (May 2000) 369–378 242, 248 [3] Haritaoglu, I., Harwood, D., Davis, L. S.: W4: Who? When? Where? What? A Real Time System for Detecting and Tracking People. Proc. of Int’l Conf. on Face and Gesture Recognition, Nara, Japan (April 1998) 222–227 242, 248 [4] Darrell, T. et al.: Integrated Person Tracking Using Stereo, Color, and Pattern Detection. Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Santa Barbera, California, (1998) 601–608 242 [5] Intille, S. S., Davis, J. W., Bobick, A. F.: Real-Time Closed-World Tracking. Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Puerto Rico, (Jun. 1997) 697–703 243 [6] Baoxin, L., Chellappa, R.: A Generic Approach to Simultaneous Tracking and Verification in video. IEEE Trans. on Image Processing, Vol. 11, No. 5, (2002) 530–544 242, 243 [7] Xi, D., Lee, S.-W.: Face Detection and Facial Feature Extraction Using Support Vector Machines. Proc. of 16th Int’l Conf. on Pattern Recognition, Quebec City, Canada, (August 2002) 209-212 248
Visual Features Extracting & Selecting for Lipreading* Hong-xun Yao1, Wen Gao1,2, Wei Shan1, and Ming-hui Xu1 1
Department of Computer Science and Engineering, Harbin Institute of Technology Harbin, 150001, China {yhx,sw,mhx}@vilab.hit.edu.cn 2 Institute of Computing Technology, Chinese Academy of Sciences Beijing, 100080, China
[email protected] Abstract. This paper has put forward a way to select and extract visual features effectively for lipreading. These features come from both lowlevel and high-level, those are compensatory each other. There are 41 dimensional features to be used for recognition. Tested on a bimodal database AVCC which consists of sentences including all Chinese pronunciation, it achieves an accuracy of 87.8% from 84.1% for automatic speech recognition by lipreading assisting. It improves 19.5% accuracy from 31.7% to 51.2% for speakers dependent and improves 27.7% accuracy from 27.6% to 55.3% for speakers independent when speech recognition under noise conditions. And the paper has proves that visual speech information can reinforce the loss of acoustic information effectively by improving recognition rate from 10% to 30% various with the different amount of noises in speech signals in our system, the improving scope is higher than ASR system of IBM. And it performs better in noisy environments.
1
Introduction
Lipreading computing plays a very important role in automatic speech recognition, human computer interface, and content-based data compression, et al. Most studies have demonstrated that accuracy rate of automatic speech recognition (ASR) systems has been improved by visual information aided, especially in noise environment or multiple talkers. When a person intends to understand more about what he heard, he would subconsciously use visual information that came from the talker’s facial expression, lip movement and gestures. In reverse, the result would be influenced or disturbed for human speech perception when the lip movements were going along with interferential acoustic signal sequences. The computer lipreading system aims at extracting motion information of the mouth effectively and correctly. Through integrating visual features and acoustic *
This paper has been supported by Chinese National High Technology Plan “Multi-model perception techniques” (2001AA114160).
J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 251-259, 2003. Springer-Verlag Berlin Heidelberg 2003
252
Hong-xun Yao et al.
features, ASR performance could be improved by supplements visual features for acoustic information. Various visual features for ASR have been purposed in many literatures [1, 2, 4, 6, 8, 9]. Then, some fusion strategies [5, 6, 7, 8, 9, 10] will be applied to improve ASR performance. In general, all visual features extracted can be grouped into two categories: low-level pixels based features, and high-level lip shape based ones. In the former, ROI (region of interest) pixel values in an image are employed directly or after some image transform (DWT, DCT, PCA et al). In the latter, the features come from lip contours which are described by a shape model or special points on those contour positions. In fact, in most lipreading systems (including the research in IBM), the low-level visual features come from whole face region by PCA. However, as we know, if features come from whole face region by PCA, the features must cover of personal face looks, which are disadvantageous for lipreading. And investigators could not distinguish which dimensional features have more contribution for lipreading and which for others. In the same way, the high level features come from the coordinates of the points position which are on lip contours, therefore, the features may contain personal lip trait, which are no help for lipreading, either. Thus, we think over that lipreading visual features should come from those interior pixels inside inner lip contours. And it would be proved out effective for those features in next section in the paper.
2
Visual Features Selecting and Extracting
We wish to extract features come from effective region and significative to contribute for lipreading. As we know, lip shape is most important to supply lipreading information. Thus, some features must come from lip shape directly. However, only this type features are not enough because the shape model may not consider all relevant speech information such as motion of tongue and tooth. Therefore, the Shape Interior Analysis is essential complementarity. Both type features should be applied in our system. We hope to get peak power to integrate these features in proper manner. We extract high-level significative features from the lip shape model, which are defined as Ps and extract low-level pixel intensity features from the ROI (Shape Interior Analysis) which are called Pi . The final feature vector P consists of the fusion from both type features by getting rid of their relativity. 2.1
Features from Shapes
The lip contours model [3, 4] can been described by several curves (Fig. 1). This lip contour model is described by four parabola curves. Inner lip and outer lip are described with two parabolas separately. The equations are shown in following:
Visual Features Extracting & Selecting for Lipreading
253
Fig. 1. A lip contour model
x2 , Yuo = h1 × 1 − 2 w0
x2 Ylo = − h4 × 1 − 2 w0
1)
x2 Yui = h2 × 1 − 2 , w1
x2 Y li = − h3 × 1 − 2 w1
2)
(
(
where Equation (1) describes the lip outer contour cures (upper and lower). Equation (2) represents inner contour ones. Thus, parameters of lip contour model are: w0 , w1 ,
h1 , h2 , h3 , h4 . Now we get the shape features of a lip: w0 , w1 , h1 , h2 , h3 , h4 . To achieve the best recognition performance, we assemble the useful features in several means and compare their recognition rate, shown in Table 1. From that, we found the recognition rate of Group 4 is lower than that of group 5, it means that the area between outer and inner lip contours S lip will make things worse because it only represents personal lip trait and no help for lipreading. Group 6 works better than group 5 reminds us that h2 + h3 is more robust than h2 and h3 . The same conclusion is drawn from group 7 and group 6. So finally the shape feature Ps is defined as ( w0 , w1 , h2 + h3 ,
h1 + h4 ) T . Table 1. Direct parameters selected
Group1
w0 + w1 h1 h2 h3 h4
Group2
w0 w1 h2 h3 h1 + h4
Group3
w0 + w1 h2 + h3 h1 + h4
Group4
w0 w1 h1 h2 h3 h4 S lip
Group5
w0 w1 h1 h2 h3 h4
Group6
w0 w1 h1 h2 + h3 h4
Group7
w0 w1 h2 + h3 h1 + h4
254
Hong-xun Yao et al.
Recognition Rate (100%)
20.7 19.5
21 19 17 15
14.6
15.9
17.1
18.3
14.6
13 A
B
C
D
E
F G Features Group
Fig. 2. Recognition rates of different combine direct parameter features
2.2
Features from Shape Interior Information
If we could say, features, in section 2.1, are essential; the features from shape interior must be sufficient. In fact, image main information comes from of inside inner lip contour, which stands for the most important information of lip movement. Especially, the change of intensity inside inner lip contour is very important for it can tell us whether and how much the tongue and tooth is exposed. We divided the inner mouth into three parts equally, shown in Fig. 3. The one-order features, E A , E B , E C , the mean intensity of the upper section A, middle section B and lower section C, which respectively means whether the upper tooth, the tongue or lower tooth are exposed or not. The two-order features, σ A ,
σ
B
, σ
C
, the difference intensity of the section A, B or C. In addition, the average
intensity of inner mouth Ei is also considered. That is, there are 7 dimensions of high-level features. In the size-normalized 32 × 16 ROI, we get all pixels intensity inside lip inner contour region one by one. If there is no correspondent point, 0 is supplied to the sightless position in order to ensure the same dimensions and correspondency. We use the Principle Component Analysis method to get the M-dimensional feature representation of each mouth shapes, that is the M largest PCA matrix eigenvalues. When we applied different dimensions feature representation, we found the different recognition rate for lipreading, shown in Fig. 4. And from it, the best result is occurring while M = 15. Another features selected method is automatic, which is applied by the neural network of SOFM (self-organising feature maps). The SOFM can help us solve the problem of effective features selected, ant its structure is shown in Fig. 5.
Fig. 3. Three parts inside lip inner contour
Recogn. Rate (100%)
Visual Features Extracting & Selecting for Lipreading
255
28 24 20 16 12 8 4 0 10 15 20 25 30 40 45 50 55 Number of Features
Fig. 4. Recognition rates of different dimension low-level features
Fig. 5. Self-organising feature maps model
The neural network of SOFM is composed of two layers of neural cells: the input layer and the output layer. There is a weight between each input cell or output one. The SOFM is trained by competitive learning without a supervisor. The parameters put into input nodes of the network consist of two parts: high-level significative features from lip shape models, and low-level pixel intensity features from the ROI. Compared with the results, although features are derived from different methods selected, they have played a similar role for the post-recognizing process. 2.3
Features Extracting
In order to get the accuracy of lip contours description by parabolas, it is the thing to distinguish between lip pixels and others. However, as we know, skin colors are near to lip colors, shown in Fig. 6. To enhance the intensity of lip region, we adopted the lip color transformation method (LCTM) by literature [4], shown the third line in Fig. 6. LCTM is different from those methods which are only applied grey level images. It attach importance to chromatic information in distinguish lip color (LC) from skin color (SC). It brings prominence of lip region by the enhancing process. It is easier and more accurate in enhanced images for matching the actual lip contours with the parabolas. The following pictures showed the difference between grey level image and enhanced image. The process of energy minimum is to make lip contours drawn up at actual contours position. The energy functions are given here by the following integrals, shown in Equation (3). Each of them corresponds to one of lip contour.
256
Hong-xun Yao et al.
(1)
(2)
(3)
Fig. 6. Images in first line are original, in second line are corresponding grey level images, and in the third line are enhanced images
(1)
(2)
(3)
Fig. 7. The result of contours matching by energy minimum
Ei = −
1 Γi
r ∫ Φ (x )ds , e
Γi
i = 1, 2,3, 4,
( 3)
where Γ
represents the length of the curve, i is one curve in lip contour model, Γ i r Φ e ( x ) is the intensity distribution function of the pixels which are on lip contours model. E1 , E2 , E3 , E4 are energy functions of four contours. We use gradient decent algorithm to adjust the parameters of the lip contours model to minimize the energy function quickly. In this way, the curves will be inducted to the actual lip contour. Fig.7 shows the result of lip contours matching by energy minimum.
3
Experiment
3.1
Bimodal Database AVCC
For the Chinese lipreading experiment, we recorded our own aligned Audio-Visual Continuous Chinese utterance database AVCC. There are two advantages of the database. One is that covers all the vowels and consonants in Chinese and the frequency of the syllable is similar to the actual frequency in Chinese. The other advantage is that utterances are not isolated pronounce but a cut from sequence sentence. So image sequences are prominent component of the utterance without redundant mouth shape such as closed shape.
Visual Features Extracting & Selecting for Lipreading
257
Table 2. Results of speaker dependent and speaker independent
Visual Audio Visual-audio
Speaker Dependent (%) Clean 5DB 0DB 40.2 40.2 40.2 84.1 70.7 57.3 87.8 84.1 74.4
-5DB 40.2 31.7 51.2
Speaker Independent (%) Clean 5DB 0DB -5DB 30.9 30.9 30.9 30.9 65.9 41.9 27.6 14.6 81.3 63 55.3 37.8
The database consists of 82 Chinese utterances, which are cut from 246 sentences (3 samples from different language circumstance for per utterance) pronounced by 10 talkers, 5 males and 5 females. The video picture size is 256 × 256 at 25 fps and the audio is simultaneously recorded at 11.025kHz, 8-bit resolution. All the pictures are face images with simple background under natural light.
3.2
Results
We use semi-continuous HMM with 6 states and 8 modes per state to train and recognize. The late integration strategy is adapted to combine acoustic result with visual one. It reaches the peak point when the fusion feature dimensions equal 41 by plenty of experiments. Where the shape feature Ps with 4 dimensions is robust to image transforms like translation, scaling, rotation and lighting whereas the interior feature Pi with 37 dimensions contains important information inside the mouth. Results of speaker dependent and speaker independent are shown in Table 2. And the SOFM based method has got the similar results with the Table 2, omitted it. From Table 2, we can find that the improvement more when under noise circumstances. Comparing with the work in literatural9, shown in Table 3, IBM has higher recognition rate in pure audio speech recognition, but the improving rate of recognition is not as effective as our system by lipreading integrated into automatic speech recognition. Although they select more features by two-lever PCA from more than 6000 dimensions to get final 41-dimensional visual features. But the validity of those features seems not strong as ones that we happen to select the final 41dimensional feature vector only from 11 high-level features and 302 low-level features. Where AV-HiLDA(FF), AV-MS-1(DF), AV-MS-2(DF), AV-MS-PROD(DF), AVMS-UTTER(DF) in Table 3, represent the fusion by Audio/Visual Features Fusion or Decision Fusion. Table 3. Results of the literatural [9]
Audio-only AV-HiLDA(FF) AV-MS-1(DF) AV-MS-2(DF) AV-MS-PROD(DF) AV-MS-UTTER(DF)
Clean 85.56 86.16 85.38 85.08 85.81 86.53
Noisy 51.90 63.03 63.39 61.62 64.79 64.73
258
Hong-xun Yao et al.
4
Conclusion
The lipreading visual features should come from lip shape and its inner pixels in this paper. Shape features Ps is stable, invariant to image transforms like shifting, rotating and lighting; Pi can supply more lip movement information such as tooth and tongue. Experiments with feature P are carried out on our bimodal database AVCC with 82 Chinese utterances. Lipreading can improve 19.5% accuracy from 31.7% to 51.2% for speakers dependent and improves 27.7% accuracy from 27.6% to 55.3% for speakers independent when speech recognition under noise conditions. In general, the visual information can improve speech recognition rate by 10%-30%. From the result we can see features extracted in this paper performs effectively in reducing error rate caused by noise, its improving scope is higher than ASR system of IBM. The lower the SNR is the better it performs. Even tested with clean voice the visual information can also increase recognition rate in some degrees. So features extracted and selected in this paper are more effectively, and it may reach real-time technique nearer.
References [1] [2] [3]
[4] [5] [6] [7]
[8]
S. Dupont, J. Luettin, “Audio-Visual Speech Modeling for Continuous Speech Recognition”. IEEE Transactions On Multimedia, Vol. 2, No. 3, September 2000. Matthews, T.F. Cootes, J.A. Bangham, S. Cox, R. Harvey, “Extraction of Visual Features for Lipreading”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, No. 2, February 2002. M.E.Hennecke, K.V.Prasad & D.G.Stork, “Using Deformable Templates to Infer Visual Speech Dynamics”. 28th Annual Asilomar Conference on Signals, Systems and Computers, Volume 1, pp578-582, Pacific Grove, CA.IEEE, IEEE Computer Society Press, 1994. H.Yao, W. Gao, J. Li, Y. Lv, R. Wang. Real-time Lip Locating Method for LipMovement Recognition, Chinese Journal of Software, 2000, 11(8): 1126-1132. G. Gravier, G. Potamianos, and C. Neti, Asynchrony modeling for audio-visual speech recognition, Proc. Human Language Technology Conference, San Diego, 2002. G. Gravier, S. Axelrod, G. Potamianos, and C. Neti, Maximum entropy and MCE based HMM stream weight estimation for audio-visual ASR, Proc. Int. Conf. Acoust. Speech Signal Process., Orlando, 2002. C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, and D. Vergyri, Large-vocabulary audio-visual speech recognition: A summary of the Johns Hopkins Summer 2000 Workshop, Proc. IEEE Work. Multimedia Signal Process., Cannes, 2001. Matthews, G. Potamianos, C. Neti, and J. Luettin, A comparison of model and transform-based visual features for audio-visual LVCSR, Proc. IEEE Int. Conf. Multimedia Expo., Tokyo, 2001.
Visual Features Extracting & Selecting for Lipreading
[9]
259
G. Potamianos, J. Luettin, C. Neti. Hierarchical discriminant features for audiovisual LVCSR, ICASSP, Salt Lake City, May 2001. [10] J. Luettin, G. Potamianos, C. Neti. Asynchronous stream modeling for largevocabulary audio-visual speech recognition, ICASSP, Salt Lake City, May 2001.
An Evaluation of Visual Speech Features for the Tasks of Speech and Speaker Recognition Simon Lucey Advanced Multimedia Processing Laboratory Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh PA 15213, USA
[email protected] Abstract. In this paper an evaluation of visual speech features is performed specifically for the tasks of speech and speaker recognition. Unlike acoustic speech processing, we demonstrate that the features employed for effective speech and speaker recognition are quite different to one another in the visual modality. Area based features (i.e. raw pixels) rather than contour features (i.e. an atomized parametric representation of the mouth, e.g. outer and inner labial contour, tongue, teeth, etc.) are investigated due to their robustness and stability. For the task of speech reading we demonstrate empirically that a large proportion of word unit class distinction stems from the temporal rather than static nature of the visual speech signal. Conversely, for the task of speaker recognition static representations suffice for effective performance although modelling the temporal nature of the signal does improve performance. Additionally, we hypothesize that traditional hidden Markov model (HMM) classifiers may, due to their assumption of intra-state observation independence and stationarity, not be the best paradigm to use for modelling visual speech for the purposes of speech recognition. Results and discussion are presented on the M2VTS database for the tasks of isolated digit, speech and text-dependent speaker recognition.
1
Introduction
It is largely agreed upon that the majority of visual speech information stems from a subject’s mouth [5]. The field of audio-visual speech processing (AVSP) is still in a state of relative infancy, during the period of its short existence a majority of the work performed has been towards the goal of finding the best mouth representation for the tasks of audio-visual speech and speaker recognition. Usually, these representations are based on the techniques used to initially locate and track the mouth, due to their ability to parametrically describe the mouth in a compact enough form for use in statistical classification. This paper concentrates on the evaluation of area features as opposed to contour features, due to their robustness and stability. Area based representations are concerned with transforming the whole input region of interest (ROI) mouth J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 260–267, 2003. c Springer-Verlag Berlin Heidelberg 2003
An Evaluation of Visual Speech Features for the Tasks
261
intensity image into a meaningful feature vector. Contour based representations are concerned with parametrically atomizing the mouth, based on a priori knowledge of the components of the mouth (i.e. outer and inner labial contour, tongue, teeth, etc.). In a recent paper by Potamianos et al. [9] a review was conducted between area and contour features for the tasks of speechreading on a large audio visual database. In this paper it was shown that area representations obtained superior performance. Area based representations of the mouth were shown to be robust to noise and compression artifacts and are the mouth representation of choice in current AVSP work. It is widely accepted that for acoustic speech and speaker recognition applications cepstral [11] features work well in both applications respectively. Like many aspects of acoustic speech processing, this rationale has been applied to visual speech processing applications with minimal analysis and evaluation of the validity of such an assumption in the visual modality. In this paper we explore a number of visual speech representations for the tasks of speech and speaker recognition and demonstrate that the modelling of visual speech for the tasks of speech and speaker recognition are different in terms of the features and classifiers used.
2
A Brief Review of Area Based Representations
The most common technique used to gain a holistic compact representation of a mouth is through the use of principal component analysis (PCA) [1], which attempts to find a subspace the main linear modes of variation, on the mouth ROI intensity image. Linear discriminant analysis (LDA) [8] generates a subspace based on a measure of class discrimination. LDA representations have become extremely useful in AV speech [10, 8] and speaker recognition applications. PCA and LDA are referred to as data driven, as they both require training observations of mouth ROI images to create their compact representation of the mouth. Other data-driven transforms have been employed on the mouth region, such as maximum likelihood linear transform (MLLT) [10] and independent component analysis (ICA) [4], albeit with minimal improvement to traditional PCA and LDA techniques. Non-data driven transforms have been previously used such as the discrete wavelet transform (DWT) [9], discrete cosine transform (DCT) [10] or multiscale spatial analysis (MSA) [6] directly or as pre-processing stage for visual feature extraction. These non-data driven approaches have the benefit of not being dependent on a training ensemble, but bring minimal a priori knowledge about the mouth to the problem of visual speech and speaker recognition.
3
Evaluation of Speech Features
The actual evaluation of visual speech features is not an easy task as an inherent problem with extracting speech features is in getting an accurate measure of how well a given speech feature works when compared against another. Generally an
262
Simon Lucey
accurate measure of the quality of visual features is indicative of how well it performs in the task it is being used for, which in this case is visual speech and text dependent speaker recognition. As previously mentioned, only area features shall be investigated in this paper due to their robustness and ability to holistically represent the mouth. Data-driven feature extraction approaches were investigated solely in this evaluation due to their natural ability to bring a priori knowledge of the mouth to the representation. For purposes of notation the mouth image matrix I(x, y) is expressed as the vectorized column vector y = vec(I). The tasks of speech and speaker recognition were tested with the following visual features, PCA: in which PCA was used to create a twenty dimensional subspace ΦP CA preserving the 20 highest linear modes of mouth variation. This feature extraction approach was employed for both speech and speaker recognition. SLDA: in which LDA was used to create a twenty dimensional subspace ΦSLDA for the speaker recognition task using a priori knowledge of the subject classes to generate the 20 most discriminant basis vectors. MRPCA: in which the mean removed mouth sub-image y∗ is calculated from a given temporal mouth sub-image sequence Y = {y1 , . . . , yT } such that, yt∗ = yt − y,
where y =
T 1 yt T t=1
(1)
This approach is very similar to cepstral mean substraction [11] used on acoustic cepstral features to improve recognition performance by providing some invariance to unwanted variations. In the visual scenario this unwanted variation usually stems from subject appearance. Mean-removal PCA (MRPCA) uses these newly adjusted y∗ mouth sub-images to create a new twenty dimensional subspace ΦMRP CA preserving the 20 highest modes of mean removed mouth variation. This approach was first proposed by Potamianos et al. [9] for improved visual speech recognition performance. WLDA: in which LDA was used to create a nine dimensional subspace ΦW LDA for the speech recognition task using a priori knowledge of the word classes to generate the 9 most discriminant basis vectors. Mean removal, similar to the approach used for MRPCA, was first employed to remove unwanted subject variances from the WLDA feature extraction process. A compact representation of the mouth sub-image y can be obtained by the linear transform, o = Φ y
(2)
such that o is the compactly represented visual speech observation feature vector. Illumination invariance was obtained by normalising the vectorised mouth intensity sub-image y to a zero-mean unit-norm vector. For the generation of the LDA subspaces, PCA was first employed to preserve the first 50 linear modes of variation, in order to remove any low energy noise that may
An Evaluation of Visual Speech Features for the Tasks
263
corrupt classification performance. For all subspaces, shots one to three of the M2VTS [7] database were used as training mouth observations, with shot four being used for testing in the speech and speaker recognition tasks. In all cases delta (i.e. first order derivative) features were appended to static features. 3.1
Training of Hidden Markov Models
Hidden Markov models (HMMs) were used to model the video utterances using HTK ver 2.2. [11]. The first three shots of the M2VTS database were used to train the visual HMMs with shot four being used for testing. The database consisted of 36 subjects (male and female) speaking four repetitions (shots) of ten French digits from zero to nine. In the task of speech recognition the word error rate (WER) was used as a measure of performance for the ten digits being recognized in the M2VTS database. Speaker recognition encapsulates two tasks, namely speaker identification and verification. Speaker error rate (SER) was used to gauge the effectiveness of visual features for speaker identification. The SER metric was deemed useful enough for gauging the effectiveness of visual features in speaker recognition as good performance in the speaker identification task generally translates well for the verification task. Due to the relatively small size of the M2VTS database and the requirement for separate speaker dependent digit HMMs, all speaker dependent HMM digit models were trained by initializing training with the previously found speaker independent or background digit model. This approach prevented variances in each model becoming too small and allows each model to converge to sensible values for the task of text dependent speaker recognition. 3.2
Speech Recognition Performance
Table 1 shows the WER for the task of digit recognition on the M2VTS database. Raw PCA features have the worse WER performance out of all the visual features evaluated. There is little difference between the MRPCA and WLDA area representation of the mouth in terms of WER at the normal video sample rate of 40ms, with WLDA visual features performing slightly better. Acoustic MFCC features were also evaluated in Table 1 for comparison with its visual counterparts. The train and test sets of each feature type were evaluated in terms of WER. The difference between train and test WERs is very important as this gives an indication of how undertrained a specific speech recognition classifier is using a certain type of feature [2]. The train WER is also very important as it gives a rough estimate of the lower Bayes error for that feature representation, with the test WER giving an estimate of the upper Bayes error. Both train and test errors are essential to properly evaluate a feature set. There are very large differences between train and test WERs for all visual feature sets in comparison to the differences seen in the acoustic MFCC feature set. Additionally, the test WERs for all visual features are quite large, which is in stark contrast to the acoustic MFCCs which received negligible error. This may indicate the inherent variability of the chosen visual features is higher than
264
Simon Lucey
Table 1. WER rates for train and test sets on the M2VTS database (note best performing visual features have been highlighted) Features PCA PCA PCA MRPCA MRPCA MRPCA WLDA WLDA WLDA MFCC
(Dim) 40 40 40 40 40 40 18 18 18 26
Sampling 40ms 10ms 10ms 40ms 10ms 10ms 40ms 10ms 10ms 10ms
HMM Topology WER(% ) Mixtures States Train set Test set 3 3 14.19 31.43 3 3 21.43 39.71 3 9 8.07 28.57 3 3 9.71 25.71 3 3 13.52 30.57 3 9 5.33 23.14 3 3 10.38 23.43 3 3 17.11 33.43 3 8 12.76 28.57 3 3 1.44 1.62
those found in conventional acoustic features, or that the visual features do not provide enough distinction between word classes using a standard HMM classifier. Similar results were received by Cox et al. [2] pertaining to the undertrained nature of standard HMM based visual speech recognition classifiers. Initially, one may assume the undertrained nature of the visual HMM classifiers may be attributed to the acoustic modality having four times as many training observations as the visual modality. This is due to the acoustic speech signal being sampled at a 10m intervals, with the visual speech signal being sampled at a coarser 40ms interval. To partially remedy this situation, the visual features were up-sampled1 to 10ms intervals using simple linear interpolation. Inspecting Table 1 one can see that the WER increases when testing is performed on the interpolated visual features using the same topology (i.e. number of states and mixtures) HMM classifier for all visual feature types. However, when the number of HMM states is increased the WER performance of all interpolated visual features improves. For PCA and MRPCA representations the WER actually surpasses those seen at normal sample rates. The interpolated MRPCA based HMM classifier with extra states receives an WER that marginally surpasses that for the normally sampled WLDA classifier. Additionally, the train WER for the interpolated MRPCA classifier, with extra states, is half of that for the normally sampled WLDA classifier, indicating that the increase in classifier complexity may provide additional word class distinction. The interpolated WLDA features, using an increased number of states, still receives a poorer WER than realised with the originally sampled WLDA features with less states. The lack of performance improvement in the WLDA representations, using interpolation with an increased number of states, indicates that some vital discriminative information pertaining to the temporal nature of the utterance is being thrown 1
Interpolation of visual features occurred prior to the calculation of delta features, which were used in all experiments. It must be noted that when interpolation was employed on static and previously calculated delta visual features minimal change in WER was experienced.
An Evaluation of Visual Speech Features for the Tasks
265
away in comparison to the PCA and MRPCA representations. This could be attributed to the majority of discriminatory information between words being contained in the temporal nature of the pronunciation not the static appearance. A major drawback in WLDA feature extraction seems to stem from its inability to form a discriminative subspace based on the dynamic, not just static, nature of the signal. Potamianos et al. [8] devised an approach to circumvent this limitation by incorporating contextual information about adjacent frames into the construction of a discriminative subspace. Although showing some improvement, this approach fails to address some of the fundamental problems associated with using a standard HMM classifier for speech reading. The performance improvement from the interpolation of PCA and MRPCA features along with the increase in HMM states for their respective HMM classifiers can be considered to be counter intuitive, as no extra information is being added to the interpolated visual features apart from the delta features which are dependent on the sample rate of the signal. The benefit of interpolating visual features can be understood from work done by Deng [3] concerning standard HMM based speech recognition. Deng has argued that the use of many states in a standard HMM can approximate continuously varying, non-stationary, patterns in a piecewise constant fashion. Further, it was found in previous acoustic speech recognition work [3], that as many as ten states are needed to model strongly dynamic speech segments in order to achieve a reasonable recognition performance. Similar results were found by Matthews et al. [6] for visual speech recognition where as many as nine states were required, after visual feature interpolation, to achieve reasonable WERs. It has been postulated by Deng [3] that employing extra states in a standard HMM to better model the non-stationary dynamic nature of a signal in a piece-wise manner has obvious shortcomings. This is due to the many free and largely independent parameters needing to be found by the addition of extra states which requires a large amount of training observations for reliable classification. The problems concerning the lack of training observations can be partially combated through the interpolation. Such trends can however, be much more effectively and accurately described by simple deterministic functions of time which require a very small number of parameters, as opposed to using many HMM states to approximate them piecewise constantly. This indicates that, unlike the acoustic modality, the use of a standard HMM may be suboptimal for the purposes of modelling the non-stationary nature of the visual speech modality effectively for speech recognition. 3.3
Speaker Recognition Performance
Table 2 shows the SER for the task of text dependent speaker identification. The use of SLDA in this instance is of considerable benefit over the traditional PCA representation of the mouth. Intuitively, this makes considerable sense as a person’s identity can be largely represented by the static representation of that person’s mouth. This result differs to those found in visual speech recognition, which found the discriminant nature of WLDA to be of limited use due
266
Simon Lucey
Table 2. SER for train and test sets on the M2VTS database (note best performing visual features have been highlighted) Features PCA PCA SLDA SLDA MFCC
(Dim) 40 40 40 40 26
Sampling 40ms 10ms 10ms 40ms 10ms
HMM Topology SER(% ) Mixtures States Train set Test set 2 2 0.38 28.00 2 2 0.67 28.29 2 2 0.19 19.71 2 2 0.19 19.71 3 2 0.00 9.72
to the majority of the class distinction between words existing in the temporal correlations in an utterance rather than the static appearance of the mouth. The up-sampling of visual features was also investigated, but from an exhaustive search through HMM topologies, there was no improvement in SER from the optimal topologies used at the normally sampled rates. This result can be attributed to two things. Firstly, there is an inherent lack of training observations for generating a subject dependent digit HMM, making the generation of suitably complex HMMs difficult. Secondly, the piece-wise temporal approximation made by a standard HMM suffices for the task of visual speaker recognition due to its natural ability to discriminate based on static features, as indicated by the superior performance of SLDA over PCA features. Interestingly, the performance of the acoustic and visual classifiers are relatively close, with both classifiers being marginally undertrained. This result was to be expected due to the lack of training data associated with each subject and digit.
4
Discussion
In this paper feature extraction techniques for the visual speech modalities, pertaining to the tasks of speech and speaker recognition, were evaluated. For speechreading it was shown that MRPCA mouth features, at an interpolated sample rate, gave superior WERs over all those evaluated. Although, WLDA features, based on a static discriminant space, perform almost as well and do not require interpolation and have a much smaller dimensionality. For both feature sets the benefit of mean subtraction was shown, with the improved performance being linked to unwanted subject variabilities being removed. An interesting point was also raised about the validity of using a standard HMM for speech recognition in the visual modality, as the quasi stationary assumption made for the acoustic modality does not seem to hold as well in the visual modality. Visual speaker recognition achieved excellent results using the SLDA mouth feature. This can be attributed to the more static nature of the speaker recognition task, which is easily accommodated by the LDA feature extraction procedure and standard HMM topology.
An Evaluation of Visual Speech Features for the Tasks
267
References [1] C. Bregler and Y. Konig. Eigenlips for robust speech recognition. In International Conference on Accoustics, Speech and Signal Processing, Adelaide, Australia, 1994. 261 [2] S. Cox, I. Matthews, and J. A. Bangham. Combining noise compensation with visual Information in speech recognition. In Auditory-Visual Speech Processing, Rhodes, 1997. 263, 264 [3] L. Deng. A generalized hidden Markov model with state-conditioned trend functions of time for the speech signal. Signal Processing, 27:65–78, 1992. 265 [4] M. S. Gray, J. R. Movellan, and T. J. Sejnowski. A comparison of local versus global image decompositions for visual speechreading. In 4th Joint Symposium on Neural Computation, pages 92–98, 1997. 261 [5] F. Lavagetto. Converting speech into lip movements: A multimedia telephone for hard hearing people. IEEE Transactions on Rehabilitation Engineering, 3(1):90– 102, March 1995. 260 [6] I. Matthews, T. Cootes, S. Cox, R. Harvey, and J. A. Bangham. Lipreading using shape, shading and scale. In Auditory-Visual Speech Processing, pages 73–78, Sydney, Australia, 1998. 261, 265 [7] S. Pigeon. The M2VTS database. Laboratoire de Telecommunications et Teledection, Place du Levant, 2-B-1348 Louvain-La-Neuve, Belgium, 1996. 263 [8] G. Potamianos and H. P. Graf. Linear discriminant analysis for speechreading. In IEEE Second Workshop on Multimedia Signal Processing, pages 221–226, 1998. 261, 265 [9] G. Potamianos, H. P. Graf, and E. Cosatto. An image transform approach for HMM based automatic lipreading. In International Conference on Image Processing, volume 3, pages 173–177, 1998. 261, 262 [10] G. Potamianos, J. Luettin, and C. Neti. Hierarchical discriminant features for audio-visual LVCSR. In International Conference on Acoustics, Speech and Signal Processing, volume 1, pages 165–168, 2001. 261 [11] S. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, and P. Woodland. The HTK Book (for HTK version 2.2). Entropic Ltd., 1999. 261, 262, 263
Feature Extraction Using a Chaincoded Contour Representation of Fingerprint Images Venu Govindaraju1 , Zhixin Shi, and John Schneider2 1
2
Center of Excellence for Document Analysis and Recognition (CEDAR) State University of New York at Buffalo, Buffalo, NY 14228, U.S.A.
[email protected] http://www.cedar.buffalo.edu Ultra-Scan Corporation, 4240 Ridge Lea Rd, Amherst, New York 14226
[email protected] http://www.ultra-scan.com
Abstract. A feature extraction method using the chaincode representation of fingerprint ridge contours is presented for use by Automatic Fingerprint Identification Systems. The representation allows efficient image quality enhancement and detection of fine feature points called minutiae. Enhancement is accomplished by binarization and smoothing followed by estimation of the ridge contours field of flow. The original gray scale image is then enhanced using connected component analysis and a dynamic filtering scheme that takes advantage of the knowledge gained from the estimated direction flow of the contours. The minutiae are generated using a sophisticated ridge contour following procedure. Visual inspection of several hundred images indicates that the method is very effective.
1
Introduction
Automatic Fingerprint Identification System (AFIS) is an important biometric technology. Fingerprint images can be obtained from ink impressions or by direct live scanning of the fingerprints by sensors [13] such as with ultrasound technology [9]. Feature (minutiae) extraction is a key step in accurate functioning of any AFIS. Due to imperfections of the image acquisition process, minutiae extraction methods are prone to missing some real minutiae while picking up spurious points (artifacts) [3, 8]. Image imperfections can also cause errors in determining the location coordinates of the true minutiae and their relative orientation in the image. Most feature extraction algorithms described in the literature extract minutiae from a thinned skeleton image that is generated from a binarized fingerprint image. Thinning is a lossy and computationally expensive operation and the accuracy of the output skeletal representation varies for different algorithms. In this paper we introduce the use of chaincode representation as an efficient alternative for processing fingerprint images. It circumvents most of the problems associated with thinning and skeleton images. The first step is to binarize the J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 268–275, 2003. c Springer-Verlag Berlin Heidelberg 2003
Feature Extraction Using a Chaincoded Contour Representation
269
fingerprint image (section 2.1). The next step averages neighboring pixels to generate smooth chaincodes without introducing spurious breaks in contours (section 2). This is important because an end point of a ridge contour is a vital minutia point. The ridge flow field is estimated from a subset of selected chaincodes as described in section 2.2. The original gray scale image is enhanced using a dynamically oriented filtering scheme together with the estimated direction field information (section 2.3). The enhanced fingerprint image henceforth can be used for all subsequent processing. The algorithm for extracting minutiae using chaincode contours of the enhanced images is described in section 3. Some experimental results using NIST datasets are presented in section 4. The chaincode representation is procedurally described as follows. Given a binary image, it is scanned from top to bottom and right to left, and transitions from white (background) to black (foreground) are detected. The contour is then traced counterclockwise (clockwise for interior contours) and expressed as an array of contour elements (Figure 1(a)). Each contour element represents a pixel on the contour, contains fields for the x,y coordinates of the pixel, the slope or direction of the contour into the pixel, and auxiliary information such as curvature. The slope convention used by the algorithms described is as shown in Figure 1(b).
2
Fingerprint Image Enhancement
Direct binarization using standard techniques renders images unsuitable for extraction of fine and subtle features such as minutiae points. Therefore it is necessary to: (i) improve the clarity of ridge structures of fingerprint images (ii) maintain their integrity, (iii) avoid introduction of spurious structures or artifacts, and
0
X
CHAIN_CODED CONTOUR
SLOPE CONVENTION
start
1 Y
2
0 X slope
Y status
mode
curvature
(a)
7
. . .
3 4
6
5
(b)
Fig. 1. Chain code contour representation: (a) contour element, (b) slope convention. Data field in the array contains positional and slope information of each component of the traced contour. Properties stored in the information fields are: coordinates of bounding box of a contour, number of components in the corresponding data fields, area of the closed contour, and a flag which indicates whether the contour is interior or exterior
270
Venu Govindaraju et al.
(iv) retain the connectivity of the ridges while maintaining separation between ridges. There are two types of fingerprint image enhancement methods described in the literature; those that work on binarized images and those that work on gray-scale images [6, 5, 10]. The binarization-based methods require a specially designed binarization algorithm to ensure the quality of the resultant images so that the connectivity information lost during binarization can be at least partially recovered. The gray-scale based methods start with a direction field that captures the local orientation information of the ridge contours, followed by the application of a bank of filters to improve the quality of the image [2]. The directional field itself is typically computed by the gradient method. However, computation of the gradients is inefficient and lacks robustness in noisy images. The method presented in this paper combines aspects of both approaches described above. We first use a local-global binarization algorithm to obtain a binary fingerprint image that is of sufficient quality to retain and discern the ridges, and maintain local orientations. However, some of the ridge contours might fragment during this process. The local directional field is estimated using a fast chaincode-base algorithm [4] and is localized by the use of a 15×15 mask. The tradeoffs that affect the size of this mask are as follows. Larger masks retain the orientation while compromising the integrity of the ridges. To enhance the fingerprint image we apply a simple anisotropic filter on the gray-scale image. This method is similar to the one proposed in [1]. The filter is adaptive and has an elliptical shape. It is applied to the fingerprint image with it major axis aligned parallel to the local ridge direction. Since the shape of the filter is controlled by the estimated local ridge orientation, we avoid the need for computing local ridge frequency which is required by most filtering algorithms [2]. 2.1
Binarization
Our binarization algorithm is tuned for efficiency. Experiments on images from database DB 4 NIST Fingerprint Image Groups show that a binarization algorithm using a single global threshold can not give satisfactory results. Noise in inked fingerprints produces non-uniform ink density, non-printed areas, and the presence of stains and noise. To overcome these problems posed by presence of noise, we apply a simple global threshold algorithm in each partitioned local area of size 15×15 pixels. Within this small local area the pixel density does not vary significantly, allowing the rendering of distinct ridge contours without much blurring. In order to obtain smooth edges on the ridge contours, a 3×3 mask is applied to the gray-scale image as a quick equalization process before initiating the localglobal thresholding described. Methods described in the literature use contrast enhancement or mean and variance based image normalization [1, 6, 12] which cause interference between sweat pores and ridge edges. For minutiae based feature extraction methods, it is preferable that the sweat pores be treated as noise and eliminated. Figure 2(b) shows the binary fingerprint image obtained by the method described.
Feature Extraction Using a Chaincoded Contour Representation
(a)
(b)
271
(c)
Fig. 2. Direction field computed from chaincode: (a) enhanced gray-scale image, (b) binary image, and (c) direction field image generated using chaincode representation and contour following
2.2
Orientation Field Using Chaincode
Chaincode representation of object contours is extensively used in document analysis and recognition research [4]. It has proven to be an efficient and effective representation especially for handwritten documents. Unlike the thinned skeletons, chaincode is a lossless representation in the sense that the pixel image can be fully recovered from its chaincode representation. The chaincode captures the contour boundary information from the edges of the fingerprint ridges. Tracing the chaincode contour provides the local ridge direction at each boundary pixel. To calculate the direction field for the local ridge orientations, we divide the image into 15×15 pixel blocks and use the ridge directions to estimate the ridge orientation in each block. Following are the algorithmic steps. a. Filter the small components that could be from noise or other fragments using the width of the ridges as a guide for estimating the threshold under which components are likely to be noise. b. End points are detected (section 3) and it is determined if they are actual ridge ending minutiae. These points are not used in the computation of the direction flow field as directions around end points can be ambiguous. Figure 2(a) shows the direction field image generated from the chaincode image. Other direction field estimation algorithms described in the literature compute the gradient at every pixel [1, 2, 12]. The method described using chaincode is more efficient and accurate (section 4). 2.3
Enhancement Using Anisotropic Filter
Fingerprint minutiae extraction algorithms depend on the quality of binarization. Imperfections in binarization often lead to broken ridges or touching ridges which in turn create spurious points. To maintain the separation between ridges one could lower the threshold level in binarization. Our approach equalizes the
272
Venu Govindaraju et al.
pixel values within the same ridges by raising the gray-values of the uneven pixels inside the ridges. Specifically, we use a directional anisotropic filter that has an elliptical shape with its major axis aligned parallel to the local ridge direction. The filter smoothes the pixels along the ridge direction as opposed to the direction across the ridges. A structure-adaptive anisotropic filtering technique has been used by previous researchers for image filtering [1, 14]. ((x − x0 ) · n)2 ((x − x0 ) · n⊥ )2 + H(x0 , x) = V + Sρ(x − x0 ) exp − σ12 (x0 ) σ22 (x0 ) where n and n⊥ are mutually normal unit vectors and n is parallel to the ridge direction. The shape of the kernel is controlled by σ12 (x0 ) and σ22 (x0 ). The region constraint ρ satisfies the condition ρ(x) = 1 when |x| < r and r is the maximum support radius. Two additional parameters, S and V are for phase intensity control and control of peripheral pixels (near the outskirts of the kernel) respectively. As per [1], we take V = −2 and S = 10 in our experiments. σ12 (x0 ) and σ22 (x0 ) control the shape of the Gaussian kernel. As functions of x0 they should be estimated using the frequency information around x0 . But the filter is not sensitive to their values as long as σ22 (x0 ) is around the measure of the average ridge width. For our experiments we also set σ12 (x0 ) = 4 and σ22 (x0 ) = 2 [1]. Figure 2(b) shows the enhanced fingerprint image and Figure 2(c) shows the binary fingerprint image obtained from the enhanced image.
3
Minutiae Extraction Using Chaincode
Most of the fingerprint minutiae extraction methods are thinning-based by which the skeletonization process converts each ridge contour to one pixel wide. The minutiae points are detected by tracing the thin ridge contours. When the trace stops, an end point is marked. Bifurcation points are those with more than two neighbors [12]. In practice, thinning methods have been found to be sensitive to noise and the skeleton structure does not match up with the intuitive expectation. The alternate method of using chaincoded contours is presented here. The direction field estimated from chaincode gives the orientation of the ridges and information on any structural imperfections such as breaks in ridges, spurious ridges and holes. The standard deviation of the orientation distribution in a block is used to determine the quality of the ridges in that block. For example in Figure 2 the directions of the ridges at the bottom of the image are misleading. We have used contour tracing in other handwriting recognition applications [4, 11]. We consistently trace the ridge contours of the fingerprint images in a counter-clock-wise fashion. When we arrive at a point where we have to make a sharp left turn we mark a candidate for a ridge ending point. Similarly when we arrive at a sharp right turn, the turning location marks a bifurcation point (Figure 3 (a)).
Feature Extraction Using a Chaincoded Contour Representation
Thresholding line
P_out(x2,y2) θ P_in
273
(x1,y1) P_in
(x1,y1)
θ P_out (x2,y2)
(i) left turn
(a)
(ii) right turn
(b)
Fig. 3. (a) Minutiae location in chaincode contours, (b) the distance between the thresholding line and the y-axis gives a threshold for determining a significant turn To determine the significant left and right turning contour points from among the candidates marked during the trace, we compute vectors Pin leading in to the candidate point P from its several previous neighboring contour points and Pout going out of P to several subsequent contour points. These vectors are normalized and placed in a Cartesian coordinate system with Pin along the x-axis (Figure 3 (b)). The turning direction is determined by the sign of S(Pin , Pout ) = x1 y2 − x2 y1 S(Pin , Pout ) > 0 indicates a left turn and S(Pin , Pout ) < 0 indicates a right turn. A threshold T is then selected such that any significant turn satisfies the conditions: x1 y1 + x2 y2 < T Since the threshold T is the x-coordinate of the thresholding line in Figure 3(b), it can be empirically determined to be a number close to zero. This ensures that the angle θ made by Pin and Pout is close to or less than 90◦ . The turning point locations are typically made of several contour points. We define the location of a minutiae as the center point of the small group of turning pixels. The minutiae density per unit area is not allowed to exceed a certain value. If we consider groups of candidate minutiae forming clusters, all the candidate minutiae in a cluster whose density exceeds this value are replaced by a single minutiae point located at the center of the cluster.
4
Experimental Results
Experiments are underway with the NIST datasets. Using the Goodness Index (GI)(equation 1) described in [7, 2] we compute the goodness of the minutiae detected. r qi (pi − di − ii ) GI = i=1 r (1) i=1 qi ti
274
Venu Govindaraju et al.
Fig. 4. Example images showing our preliminary test. Original gray-scale fingerprint image and corresponding direction field images generated from chaincode representation shown on the top and the corresponding candidate minutiae detected using the chaincode based minutiae extraction method on another fingerprint image
where r is the total number of 15×15 image blocks; pi , is the number of minutiae paired in the ith block; di is the number of missing minutiae according to the algorithm in the ith block; ii , is the number of spuriously inserted minutiae generated by the algorithm in the ith block; ti is the true number of minutiae in the ith block; and qi is a factor which represents the image quality in the ith block (good=4, medium=2, poor=1). A high value of GI indicates a high degree of reliability of the extraction algorithm. The maximum value, GI = 1, is reached when all true minutiae are detected and no spurious minutiae are generated. Our test on a few hundred NIST images shows the GI index range from 0.25 to 0.70. Figure 4 shows some examples from our preliminary tests.
5
Conclusion
This paper describes novel use of the chaincode image representation for the purpose of fingerprint image enhancement and minutiae extraction. This method
Feature Extraction Using a Chaincoded Contour Representation
275
is more efficient and accurate when compared to thinning based methods with the additional advantage of being a lossless representation.
Acknowledgments We would like to thank Chaohong Wu and Tsai-Yang Jea for their assistance in implementing some of the techniques described in this paper.
References [1] Greenberg,S. Aladjem,M. and Kogan,D.: Fingerprint Image Enhancement using Filtering Techniques, Real-Time Imaging 8 ,227-236 (2002) 270, 271, 272 [2] Hong,L., Wan,Y.and Jain,A.: Fingerprint Image Enhancement: Algorithm and Performance Evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence 20: 777 -789. 1998 270, 271, 273 [3] Jain,A., Hong,L. and Bolle,R.: On-Line Fingerprint Verification. IEEE-PAMI, Vol.19, No.4, pp. 302-314, Apr. 1997. 268 [4] Madhvanath,S., Kim,G., and Govindaraju,V.: Chain Code Processing for Handwritten Word Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 21: 928-932, 1997. 270, 271, 272 [5] Maio,D. and Maltoni,D.: Direct Gray-Scale Minutiae Detection in Fingerprints. IEEE Transactions on Pattern Analysis and Machine Intelligence 19: 27 -40, 1997 270 [6] O’Gorman,L.and Nickerson,J. V.: An Approach to Fingerprint Filter Design. Pattern Recognition 22 :29 -38, 1989 270 [7] Ratha,N. K., Chen,S., and Jain,A.: Adaptive Flow Orientation-Based Feature Extraction in Fingerprint Images. Pattern Recognition, Vol. 28, No. 11, pp. 1657-1672, Nov. 1995. 273 [8] Ratha,N., Karu,K., Chen,S., and Jain,A.: A Real Time Matching System for Large Fingerprint Databases. IEEE-PAMI, Vol. 18, No. 8, pp. 799-813, Aug. 1996. 268 [9] Schneider,J. K., and Glenn,W. E.: Surface Feature Mapping Using High Resolution C-Span Ultrasonography. US Patent 5587533, 1996. 268 [10] Sherlock,D., Monro,D. M. and Millard,K.: Fingerprint Enhancement by Directional Fourier Filtering. IEEE Proceedings on Visual Imaging Signal Processing, 141: 87-94, 1994 270 [11] Shi,Z., and Govindaraju,V.: Segmentation and Recognition of Connected Handwritten Numeral Strings. Journal of Pattern Recognition, Pergamon Press, Vol. 30, No. 9, pp.1501-1504, 1997. 272 [12] Simon-Zorita,D, Ortega-Garcia,J., Cruz-Llanas,S. Sanchez-Bote,J. L., and GlezRodriguez,J.: An Improved Image Enhancement Scheme for Fingerprint Minutiae Extraction in Biometric Identification, Proc. of 3rd International Conference on Audio- and Video- Based Biometric Person Authentication AVBPA’01 (Halmstad, Sweden, June 2001), J.Bigun and F.Smeraldi Eds, LNCS2091, pp. 217-222 270, 271, 272 [13] Xia,X. and O’Gorman, L.: Innovations in fingerprint capture devices. Journal of Pattern Recognition, Pergamon Press, Vol. 36, No. 2, pp. 361-370, 2002. 268 [14] Yang,G. Z., Burger,P., Firmin,D. N. and Underwood,S. R.: Structure Adaptive Anisotropic Filtering.Image and Vision Computing 14: 135-145, 1996 272
Hypotheses-Driven Affine Invariant Localization of Faces in Verification Systems M. Hamouz1 , J. Kittler1 , J. K. Kamarainen2 , and H. K¨ alvi¨ ainen2 1
Centre for Vision, Speech and Signal Processing University of Surrey, United Kingdom {m.hamouz,j.kittler}@eim.surrey.ac.uk 2 Laboratory for Information Processing Lappeenranta University of Technology, Finnland {jkamarai,Heikki.Kalviainen}@lut.fi
Abstract. We propose a novel framework for localizing human faces in client authentication scenarios based on correspondences between triplets of detected Gabor-based local features and their counterparts in a generic affine invariant face appearance model. The method is robust to partial occlusion, feature detector failure and copes well with cluttered background. The method was tested on the BANCA database and produced promising results.
1
Introduction
The accuracy of face detection and localization significantly influences the overall performance of face recognition systems and therefore this topic attracts great attention from researchers and companies. In spite of the considerable past research effort, face detection and localization still remain a challenging problem because faces are non-rigid and have a high degree of variability in shape, colour and texture. The main contribution of this paper is a novel localization framework which dichotomizes the face detection problem into the modelling of the face-class variability and dealing with the imaging effects. This algorithm produces not only the location of faces found in the image, but also the positions of facial features which are an inherent part of the output. A client specific template is used in a final step to make sure the best facial hypothesis out of the ordered list given by a generic model is chosen.
2
State of the Art
In order to detect a face, a model of face instance in the image has to be created. The construction of a general model of face class has been tackled in the literature in basically two ways: the global appearance modelling approach and the localoperator approach.
This work was supported by the EU Project BANCA [1].
J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 276–284, 2003. c Springer-Verlag Berlin Heidelberg 2003
Hypotheses-Driven Affine Invariant Localization of Faces
277
Global-Appearance-Based Detection In this approach the face class is modelled as a cluster in a high-dimensional space where the separation from a nonface class is carried out using various classifiers. Huge training sets are required to learn the decision surface reliably. The imaging effects (scale, rotation, perspective) are removed in the upper-level of the system by using a so called “sliding window”. The concept of a sliding window is the root idea of these methods. To remove imaging effects, exhaustive scanning with the window has to be carried out in multiple scales and rotations. This has huge implications for the model of the face class. Since it is not possible to scan all possible scales and rotations (which are inherently continuous variables), scale and rotational discretization has to be introduced. This operation makes the modelling of face class appearance difficult and prone to false detections and misalignments. In other words a face/non-face classifier has to learn all possible fluctuations of misaligned faces that do not fit exactly the chosen scale and rotation samples in order not to miss any face instance. As a result, not only the precision of localization decreases (since the classifier cannot distinguish between slightly misaligned faces) but the cluster of faces becomes less compact and thus more difficult to learn. The representative work of this approach can be found in [14, 16]. Local Operator Methods In this framework, local feature detectors are used. A face is represented by a shape (configuration) model together with models of local appearance. In the work of Weber et al. [18] faces are represented as constellations of rigid features (parts). Variability is represented by a joint probability density function on the shape of the constellation. A similar approach was proposed by Vogelhuber and Schmidt [17]. The work of Yow et al. [20] also falls into this category. The main drawback of the above-mentioned approaches seems to be that they do not exploit all the available photometric information and use only small patches of the face, which leads to an increased false-alarm rate. Moreover the removal of imaging effects (like scale and rotation) seems not to be an integral part and focus of the algorithms, but rather an ad hoc solution. Important group of methods are the Active Shape or Appearance Models [2, 13, 21, 3], where the facial shape and appearance are modelled independently. Although they seem to be ideal as a final localization step, reliable face detection using another method has to be employed first, since these iterative methods need a good initial position and size estimate to converge.
3
Hypotheses-Driven Affine Invariant Localization of Faces in Verification Systems
Our approach was motivated by the active appearance models, but in our case we used simpler shape model (affine) built over selected features in order to allow exhaustive search for faces with unknown position, size and rotation. The approach breaks the task into three separate steps:
278
M. Hamouz et al.
i) the removal of the imaging effects (introduced by the 3D → 2D capture) ii) distinguishing face from background by selecting plausible facial hypotheses conforming to a generic face/background model iii) selection of the best hypothesis by the use of the client specific information First, a local operator search for a set of ten facial features is performed. These features are eye corners, eye centres, nostrils and mouth corners. This particular choice was used in our previous experiments and, in general, any facial features can be used in this algorithm. The complexity of feature-detectors is much smaller than that of a global-appearance face detector and thus falsepositive errors occur. The search for local features is performed in a discretized scale and rotation space, but since small parts of the face are being detected, the localization error introduced by the discretization is much smaller than in the case of the whole-face detector. In contrast to local-appearance models, our search for a face instance in the image is only navigated by the evidence coming from the local detectors. By using correspondences between the features found by local detectors in the image and in the normalized face space coordinates (see below), full affine invariance is achieved and thus imaging effects are removed before classification into the face/nonface classes. This is the main difference between our approach and the sliding-window based methods, where a face patch is not affinely aligned and the face/nonface classifier has to be trained to cope with all the geometric variability. Additionally, not every patch of image has to be verified (reducing false-positive error), since the number of plausible transformation hypotheses, defined by the correspondences, is quite small. The search through subsets of features which invoke face hypotheses is heavily reduced by using geometric feature configuration constraints learned from the training set (in this context called “confidence regions”). They are expressed in terms of a distribution of transformations between the extracted data and the model. In order to reduce the inherent face variability, each face is registered in a common coordinate system. In this coordinate system, which we refer to as “face space”, all faces are geometrically normalized and consequently photometrically correlated. In such a space, both the natural biological shape variability of faces and distortions introduced by the scene capture are removed. As a result of face normalization, the face features become tightly distributed and the whole face class very compact. This greatly simplifies the face appearance modelling. We assume that the geometric frontal human face variability can be, to a large extent, modelled by affine transformations [19, 15]. To determine an affine transformation from one face to another, we need to define correspondences of three reference points. As a good choice we propose the midpoint between the eye coordinates, along with the left eye centre and the point on the face vertical axis of symmetry half way between the tip of the nose and the mouth. These facial points define “face space” used in our experiments, see Figure 1. Previously [4], we established experimentally that the total variance of the set of XM2VTS training images registered in the face space was minimized by this particular selection of reference points. Note that two of these reference points are not directly detectable. However, their choice is quite effective at normalizing the width and
Hypotheses-Driven Affine Invariant Localization of Faces
1
279
2
3
Fig. 1. Groundtruth and the construction of face space
the height of each face. Consequently, in the coordinate system, defined by these points, the face features we aim to detect become tightly clustered. To hypothesize a face we need to detect at least three face features of different type. As there are ten features on each face that our detectors attempt to extract, there are potentially many triplets that may generate a successful hypothesis. However, some of the configurations would be insufficient to define the mapping to the face space, such as a triplet composed of features close to each other (e.g. eye features). If less than three features are detected, the face will not be found. On the other hand, an excessive number of false positives will increase the number of triplets, and thus face hypotheses, that have to be verified. In verification scenarios, client-specific information can and should be used to ensure the best choice of the face position. In our system we are outputting a limited number of best facial hypotheses found in the image (i.e. the results of segmenting face from background) and by matching client templates ensuring that the face is not missed. In the case of the impostor access, the localization error can grow significantly due to the dissimilarity between impostors and client’s templates (i.e. non-facial hypotheses can be chosen), but this does not increase the impostor acceptance error.
4
Evidence Detection in Gabor Feature Space
Use of Gabor filters to extract and represent salient sub-parts in facial images is not novel. Many successful face detection and recognition applications have been reported, e.g. [10, 12, 9, 11]. However, most of the methods tightly combine the Gabor filters into the whole system and invariance is established in higher level processing, for example by using a graph structure [10], where a full invariance cannot be achieved. In our case, as the face detection is partitioned into separate sub-systems, the Gabor features can be used in the task where they are especially useful, in invariant detection of simple structures, face evidences. Gabor filters have been popular in feature extraction due to their optimal joint-localization of energy in space and spatial-frequency. The minimal joint-
280
M. Hamouz et al.
uncertainty can be achieved by an elementary function of a complex sinusoidal plane wave multiplied by a Gaussian.
f 2 − fγ22 x2 + fη22 y2 j2πf x e e , πγη x = x cos θ + y sin θ, y = −x sin θ + y cos θ ,
ψ(x, y; f, θ) =
(1)
where f is the frequency of the sinusoid plane wave, θ is the anti-clockwise rotation of the Gaussian envelope and the sinusoid, γ is a spatial width of the filter along the major axis, and η spatial width along the minor axis (perpendicular to the sinusoid). A filter response for an image ξ(x, y) can be calculated at any location (x, y) = (x0 , y0 ) using convolution rξ (x, y; f, θ) = ψ(x, y; f, θ) ∗ ξ(x, y) ∞ = ψ(x − xτ , y − yτ ; f, θ)ξ(xτ , yτ )dxτ dyτ .
(2)
−∞
The Gabor filters with different values of f and θ can be used to inspect same events in different orientations and scales [6]. Using a proper sampling lattice for filter parameters, the feature space can be discretized in terms of location, orientation, and scale to form a Gabor feature space. For object recognition any classifier can be used, e.g., k-nearest neighbor (k-NN) decision rule or sub-cluster classifier [7]. They are trained using objects (face features) in a standard pose while in the testing phase, invariance is achieved by search operations carried out in the Gabor feature space. A detailed discussion of the construction of a proper Gabor feature space can be found in [7].
5
Hypotheses Generation and Verification
By establishing correspondences between image and face-space features, transformation from shape-free face-space into image-space coordinates can be computed by solving a simple system of linear equations. This transformation can be learned from the training set faces and a probabilistic model built. We are using a Gaussian model, which approximates this type of data well. As the appearance model, a Support Vector Machine was trained on facial and background 20x20 pixel image patches (faces were registered into face space). During detection, only the triplets with a favourable transformation scores are considered for the appearance test. Classification is then performed on the underlying image patch which is assigned an appearance score. This score expresses the consistency of the patch with the face class. This score is computed by the SVM and expresses its membership in the face class. The discriminant function is depicted in Figure 2(a). The conceptual details of the transformation model, appearance model and the algorithm implementation can be found in our previous work [4].
Hypotheses-Driven Affine Invariant Localization of Faces
281
100
90
80
70
[%]
60
50
40
30
20
10
0
Client access Impostor access Minimum error (15 faces) 0
0.05
0.1
0.15
0.2 d
0.25
0.3
0.35
0.4
eye
(a)
(b)
Fig. 2. (a) Linear discriminant function produced by the SVM (b) Distribution of localization errors on BANCA - English Part, according protocol to G, the uppermost curve denotes the best hypothesis out of 15 chosen using the groundtruth data 5.1
Final Position Refinement Using Client Templates
Given a limited number of best face hypotheses (e.g. 15 per image), more exhaustive tests can be performed to choose the best candidate for the face location. In this case, since this detector is a element of a face verification system, we use normalized correlation of a candidate patch and the client template in the LDA space as the selection criterion. The hypothesis yielding the highest correlation is taken as the localization output. It should be noted that correlation is performed using a higher resolution patch (61x45 pixels) than in the case of the generic model. This is computationally affordable as the number of hypotheses to be tested is very small. The higher resolution promotes a more precise localization. Details about the verification setup can be found in [8].
6
Experimental Results
As mentioned above, the BANCA database was used for testing the overall performance of the verification system. The scale independent localization criterion was used to evaluate the localization error. This error measure is quite strict and has been previously used by other authors [5]. It is defined as follows: deye =
max(dl , dr ) ||Cl − Cr||
282
M. Hamouz et al.
Fig. 3. Typical examples of localization in the client case
where Cl , Cr are the groundtruth eye centre coordinates and dl , dr distances between the detected eye centres and the groundtruth ones. The evaluation of the localization performance involved 5460 images(2340 client accesses and 3120 impostor accesses). 15 facial hypotheses were generated for each image. The distribution function of localization error is shown in Figure 2(b). As expected, the impostor localization error is higher than in the client case. If we consider a face correctly localized when deye < 0.25 [5], then in the case of client accesses, the detection rate is 93.6%. For impostors the success rate is lower (83%) but this is not surprising as the final selection is based on the client template. In fact this contributes to impostors being correctly rejected. It should be noted that due to the current setup, we localize only the biggest face (there is usually one face in the scene only), but the method is easily extendible to scenes containing multiple faces. It was proved on the BANCA database that the method copes well with large variations in scale and rotation. An interesting behaviour was observed, when comparing verification performance on the data generated with the help of groundtruth (i.e. face chosen out of 15 and closest to the groundtruth) and the data determined automatically. The performance was better in the second case, which reflects the fact that although the manual groundtruth registration data provides the best localization of the face from the operator point of view, it is not necessarily the best data from the automatic verification point of view. As shown in Figure 3 the system deals well with rotated heads and images, where eyes are occluded or closed.
7
Conclusion
We proposed a novel framework for localizing human faces based on correspondences between triplets of detected local features and their counterparts in an affine invariant face appearance model which can be used in face verification
Hypotheses-Driven Affine Invariant Localization of Faces
283
scenarios. The method is robust to partial occlusion or feature detector failure since a face may be detected if only three out of ten currently used detectors succeed. The method was tested on a difficult data set and exhibited promising results.
References [1] http://falbala.ibermatica.com/banca. 276 [2] G. J. Edwards, C. J. Taylor, and T. F. Cootes. Learning to identify and track faces in image sequences. In ICCV, 317–322, 1998. 277 [3] S. Gong, A. Psarrou, and S. Romdhani. Corresponding dynamic appearences. Image and Vision Computing, 20:307–318, 1997. 277 [4] M. Hamouz, J. Kittler, J. Matas, and P. B´ılek. Face detection by learned affine correspondences. In Proceedings of Joint IAPR International Workshops SSPR02 and SPR02, 566–575, August 2002. 278, 280 [5] O. Jesorsky, K. J. Kirchberg, and R. W. Frischholz. Robust Face Detection Using the Hausdorff Distance. In AVBPA 2001, 90–95, volume 2091 of J. Bigun and F. Smeraldi, Lecture Notes in Computer Science, Halmstad, Sweden, 2001. Springer. 281, 282 [6] J.-K. Kamarainen, V. Kyrki, and H. K¨ alvi¨ ainen. Fundamental frequency Gabor filters for object recognition. In 16th International Conference on Pattern Recognition ICPR, 1:628–631, Quebec, Canada, 2002. 280 [7] J.-K. Kamarainen, V. Kyrki, H. K¨ alvi¨ ainen, M. Hamouz, and J. Kittler. Invariant Gabor features for evidence extraction. In Proceedings of MVA2002 IAPR Workshop on Machine Vision Applications, 228–231, 2002. 280 [8] A. Kostin, M. Sadeghi, J. Kittler, and K. Messer. On representation spaces for SVM based face verification. In M. Falcone, A. Ariyaeeinia, and A. Paoloni, The advent of biometrics on the internet, 9–16, 2002. 281 [9] J. Lampinen, T. Tamminen, T. Kostiainen, and I. Kallio¨ aki. Bayesian object matching based on mcmc sampling and gabor filters. In Proc. SPIE Intelligent Robots and Computer Vision XX: Algorithms, Techniques, and Active Vision, 41–50, volume 4572, 2001. 279 [10] M. Lades, J. C. Vorbr¨ uggen, J. Buhmann, J. Lange, C. v.d. Malsburg, R. P. W¨ urtz, and W. Konen. Distortion invariant object recognition in the dynamic link architecture. IEEE Transactions on Computers, 42(3):300–311, Mar 1993. 279 [11] S. J. McKenna and S. Gong. Real-time pose estimation. Journal of Real-Time Imaging, 4:333–347, 1998. 279 [12] Hyun Jin Park and Seung Yang. Invariant object detection based on evidence accumulation and Gabor features. Pattern Recognition Letters, 22:869–882, 2001. 279 [13] S. Romdhani, S. Gong, and A. Psarrou. A generic face appearance model of shape and texture under very large pose variations from profile to profile views. In Proc. of ICPR, 1:1060–1064, 2000. 277 [14] K. Sung and T. Poggio. Example-based learning for view-based human face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1):39– 50, January 1998. 277 [15] F. Torre, S. Gong, and S. McKenna. View-based adaptive affine tracking. In Proc. of ECCV, 1:828–842, 1998. 278
284
M. Hamouz et al.
[16] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition, 1:511–518, 2001. 277 [17] V. Vogelhuber and C. Schmid. Face detection based on generic local descriptors and spatial constraints. In Proc. of International Conference on Computer Vision, 1084–1087, 2000. 277 [18] M. Weber, M. Welling, and P. Perona. Unsupervised learning of models for recognition. In Proc. 6th Europ. Conf. Comput. Vision, Dublin, Ireland, 1:18–32, 2000. 277 [19] K. Yow and R. Cipolla. Towards an automatic human face localization system. In Proc. of BMVC, 2:307–318, 1995. 278 [20] K. Yow and R. Cipolla. Feature-based human face detection. Image and Vision Computing, 15:713–735, 1997. 277 [21] S. Y. Li, S. Gong, and H. Liddell. Modelling faces dynamically across views and over time. In Proc. of ICCV, 554–559, 2001. 277
Shape Based People Detection for Visual Surveillance Systems M. Leo, P. Spagnolo, G. Attolico, and A. Distante Istituto di Studi sui Sistemi Intelligenti per l’Automazione - C.N.R. Via Amendola 166/5, 70126 Bari, ITALY {leo,spagnolo,attolico,distante}@ba.issia.cnr.it
Abstract. People detection in outdoor environments is one of the most important problems in the context of video surveillance. In this work we propose an example-based learning technique to detect people in dynamic scenes. A classification based on people shape and not on image content has been applied. First, motion information and background subtraction have been used for highlighting objects of interest, then geometric and statistical information have been extracted from horizontal and vertical projections of detected objects to represent people shape. Finally, a supervised three layer neural network has been used to properly classify objects. Experiments have been performed on real image sequences acquired in a parking area. The results have shown that the proposed method is robust, reliable, fast and it can be easily adapted for the detection of any other moving object in the scene.
1
Introduction
The people detection problem has been largely studied in literature due to many promising applications in several areas, above all visual surveillance. The availability of new technologies and the corresponding decrease of their costs cause more attentions to developing real-time surveillance system. Surveillance systems have been already installed in many locations such as highways, streets, stores, homes and office, but often have been used only as record supports to control the situations. A visual system that automatically detects abnormal situations and alerts for possible anomalous behaviors would be greatly appreciated. The first step is then to built a visual system able to detect moving objects, recognize people and individuate illegal behaviors. In this paper, we focus on the problem of people detection in outdoor environments with a static TV camera. Our application context is a park where people and cars go through. In this context the main aim is to detect the presence of people in order to recognize their gestures and individuate illegal actions such as thefts. We propose a new technique that combines high detection rate with fast computational time in order to process image sequences in real time. Detecting people in images is more challenging than detecting other objects because people can be articulated in their shape and can assume a variety of postures, so it is nontrivial to define a single J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 285-293, 2003. Springer-Verlag Berlin Heidelberg 2003
286
M. Leo et al.
model that captures all these possibilities. Moreover, people wear different dresses of different colors and then the interclass variation in the people class can be very high, making difficult the recognition by using color-based and fine edge-based techniques. In literature two main categories of approaches for moving object classification have been developed: Shape-based classification and Motion-based classification. Motion based classification is performed extracting information on periodic properties of motion. In [1] time-frequency analysis was applied to detect and characterize the periodic motion and in [2] residual optical flow was used to analyze rigidity and periodicity of moving entities. Shape-based classification has been performed extracting different descriptions of moving regions. The descriptions can be based on intensity gradient image [12], image blob dispersedness, area, ratio of the blob bounding box [3,4], mean and standard deviation of silhouette projection histogram [5], wavelet representation [6,7,8]. The above approaches, namely shape-based classification and motion-based classification, have also been effectively combined for moving objects classification[9]. Usually, motion based classification techniques require a lot of time for parameters computation and they cannot perform in real time. Moreover they require that the orientation and the apparent size of the segmented objects do not change during several periods and this constraint cannot be applied in a context of video surveillance where there are no rules in the movement. The shape based classification techniques, instead, do not need temporal information (they work on a single frame) and they can be implemented in real time because the parameters calculation is more fast and simple. In this work, a shape-based classification technique that uses statistical and geometric information extracted from the horizontal and vertical projection of the binary silhouette of the moving object has been adopted. The use of horizontal and vertical projections has several advantages: reduces the problem of noises in images still maintaining all the information to distinguish a moving object from others in contexts of video surveillance; it can be performed in a very small amount of time even without specialized hardware or algorithmic dodges. In this work, seven parameters that incorporate geometric and statistical information of the moving object have been extracted from the projections of binary shapes and have been provided as input to a neural classifier. The paper is organized as follows: in section 2 a brief overview of the approach for highlighting objects of interest in a static background is described; in section 3 the feature extraction step, and the feature learning step are detailed. Finally, the experimental results obtained on real image sequences are reported in section 4.
2
Object Detection and Binary Shape Extraction
The first problem that a visual system has to solve to recognize any object is to extract from a complex image the objects of interest that are candidate to be matched with a searched model. In our contest the objects of interest are both moving objects and static objects that differ from a background model. For this reason we have implemented an algorithm for objects detection that uses an adaptive background subtraction scheme. The key idea is to maintain a statistical model of the background: for each pixel a running average and a form of standard deviation is maintained. So,
Shape Based People Detection for Visual Surveillance Systems
287
in the test phase, a pixel is labeled as foreground if its intensity value differs from the running average two times more than the standard deviation [3]. Any background subtraction approach is sensitive to variations of the illumination conditions. To solve this problem, it is necessary to frequently update the information about the running average and the standard deviation of all background pixel values. For the updating, an exponential filter has been used; the implemented updating equations are described in [10]. After these steps, a reliable model of the background is available at each frame, so it is possible to exactly extract the objects of interest. But the results of the object detection subsystem described above cannot be used directly by the object recognition system since they present an undesirable drawback: the shadows. Each foreground object contains its own shadow, because also this area effectively differs from the background. It’s necessary to remove that, because the structure of the object shape radically changes. Starting from the observation that shadows move with their own objects, but that they have not a fixed texture, the removing algorithm proposed in [10] has been implemented. Firstly, the image is segmented calculating the photometric gain for each pixel, then segments that present the same correlation calculated in the background image and in the current image are eliminated. Finally, an additional step is made in order to further simplify the following classification phase. All the moving blob with the area lower than an appropriate threshold are removed. This step allows to concentrate the attention only on the object of interest as a car or a person. At this point each detected object is represented by a binary shape that can be provided as input to the classifier. In presence of many objects, each of them is considered separately by the moving detection algorithm. Each blob is detached from the remaining and analyzed at different times. For each frame the algorithm extracts a number of binary images equals to the number of different objects detected in the scene.
3
Feature Extraction and Learning
After the detection of the objects of interest, it is necessary to extract some attributes for modeling and recognize automatically their shapes (pattern recognition problem). The problem of pattern recognition can be decomposed in a features extraction problem and in a classification problem. In the first case a set of coefficients (pattern), that allows to describe the input information in a significant way, has to be found. The extracted characteristics are provided as input to the classifier then the objective is to reduce the number of coefficients still preserving the relevant information of the input shape. In order to satisfy these two contrasting requirements, in this work, each binary image has been processed as follows: the horizontal and vertical projections of the whole image are computed; from these projections the most important geometric and statistic properties have been extracted. In the first step the system generates a new representation of the binary image of the target. Each item of this new representation is the sum of the black points in the rows of the binary image for the vertical projection, and the sum of the black points in the columns for the horizontal projection, that is:
288
M. Leo et al.
N
Hor Pr oj ( j ) =
M
∑ I (i, j)
Ver Pr oj (i ) =
i =1
∑ I (i, j)
(1)
j =1
where N is the number of rows and M is the number of columns in the binary image I. In this way the information is collected in a set of coefficients equals to the sum of the rows and columns in the image, obtaining a substantial reduction of the representation coefficients. The initial coefficients of the binary image representation were MxN whereas, after the projections their number becomes M+N. In addition, since the target is a small part of the whole image, many coefficients of the projections are zero. These coefficients are removed by the system and the number of coefficients decreases further. Starting from the remaining coefficients the system evaluates seven parameters that compose the pattern associated with each object of interest. The parameters P1…P7 are : 1.
Max value of the horizontal projection : N P1 = max Hor Pr oj ( j ) = ∑ I (i, j ) j =1... M i = 1
2.
(2)
Max value of the vertical projection: M P 2 = max Ver Pr oj (i ) = ∑ I (i, j ) i =1... N j =1
3.
(3)
Sum of the coefficients of the horizontal projection (or of the vertical projection): M
M
N
j =1
J =1 i =1
N
N
M
i =1
i =1 J =1
P3 = ∑ Hor Pr oj ( j ) = ∑∑ I (i, j ) = ∑Ver Pr oj (i) = ∑∑ I (i, j )
4.
(4)
Mean value of the horizontal projection: M
P 4 = ∑ Hor Pr oj ( j ) / K j =1
(5)
where K ≤ M is the number of non zero items in the horizontal projection. 5. Mean value of the vertical projection: N
P5 = ∑ Ver Pr oj (i) / H i =1
(6)
where H ≤ N is the number of non zero items in the vertical projection. 6. Standard deviation of the horizontal projection : M
M
j =1
j =1
P6 = ∑ Hor Pr oj ( j ) − ( ∑ Hor Pr oj ( j ) / K )
7.
(7)
Standard deviation of the vertical projection : N
N
i =1
i =1
P 7 = ∑ Ver Pr oj (i ) − (∑ Ver Pr oj (i ) / H )
(8)
Shape Based People Detection for Visual Surveillance Systems
289
The parameters P1,P2 and P3 provide geometric information: the max value of the horizontal projection is the height of the object; the max value of the vertical projection is the width of the object and P3 is its area. The parameters P4, P5, P6 and P7 provides, instead, statistical information: P4 and P5 are normalized with the number of coefficients in the respective projections and they supply information on the centroids of the object. In the same way P6 and P7 are normalized with the number of coefficients in the respective projections and they detain the information of the object shape. This method is invariant to the object translation in the scene. In this way it overcomes the problems of using sliding windows [16] in order to search the target. The possibility to easily adapt this method to detect other moving objects in the scene is another advantage of the proposed method. The figure 1 shows the scheme of the whole people detection system. Figure 2 shows the horizontal and vertical projections of two images containing a person and a car. The extracted parameters are provided as input to the classifier. The classification is performed by a three layer neural network trained, with the Back propagation algorithm, on a set of positive examples relative to selected people images and negative examples relative to others moving objects in the scene.
Fig. 1. The people detection system
(a)
( b)
(c)
(d)
(e)
(f)
Fig. 2. Two examples of feature extraction by projection: (a) (d) two binary images containing a person and a car - (b)(e) horizontal projections (c) (f) vertical projections
290
M. Leo et al. Table 1. The test sequences and their details
Sequence Number of Number number Frames of Binary Images generated
1 2 3 4
894 387 1058 51
1294 387 1854 51
Number of Binary Images with a walking person
Number of Binary Images with a group of walking people
Number of Binary Images with a car
1109 0 1468 0
185 0 386 0
0 387 0 0
Number of Binary Images with groups of cars and walking people 0 0 0 51
Table 2. The classification results
Sequence Number 1 2 3 4
Positive Response of the Neural Network (people detected) 1294 0 1854 0
Negative Response of the Neural Network (people not detected) 0 387 0 51
a
b
c
d
e
f
g
h
Fig. 4. a-f) Three examples of test images correctly processed; g-h) An example of test images wrongly processed. The people and the car move together and their binary shape are spoiled
4
Experimental Setup and Results
The experiments have been performed on real image sequences acquired with a static TV camera Dalsa CA-D6 with 528 X 512 pixels, connected to a Stream Store able to store up two hours of recorded image. The frame rate is 30Hz. The processing is performed with a Pentium IV, with 1,5 GH and 128 Mb of RAM. The sequences have been acquired in a parking area with the TV camera placed on the sixth floor (22 meters) of a neighbouring building. The parking is very haunted and it’s very likely to
Shape Based People Detection for Visual Surveillance Systems
291
have in the same frame many moving targets. Often this is a problem for the algorithms of people recognition because the target are shapeless. The proposed system overcomes partially this problem: when the moving objects are detached, they are analyzed separately but, contrariwise, if different blobs of the targets are connected, the system processes the whole binary shape. Each frame can produce, thus, one or more binary images belonging to the four different categories listed below: 1. 2. 3. 4.
one walking person group of walking people one other moving object (probably a vehicle) groups of moving objects and walking people
The neural classifier has been trained with binary images containing a single person or a single car and it has been tested on four different sequences. Table 1 shows the details of each sequence. In the third column the number of binary images extracted from all the frames of the sequence is shown. Notice that from each frame the number of binary images extracted is equal to the number of moving targets not connected in the scene. The last four columns show the types of images extracted from each sequence. The first and third sequences contain respectively two and three persons walking in different directions, the second contains only a car, and the fourth contains a car and a person connected together for all the time (they move at the same time). In the third sequence there are some frames where the blobs of the three people are connected and thus the binary image contains the whole shape. The classification results are shown in the table 2. The system is able to detect people also when they are grouped or they assume unexpected postures or a partial occlusion occurs or if some objects carried on modify the usual human body shape. The figure 4 shows some examples of images extracted from the test sequences 1,3 and 4 and the relative binary images extracted by the algorithm described in section 2. In a), c) and e) people are correctly classified and included in a white rectangle. In g) people and car move together and the people is not recognized. The system does not detect a person only when its blob is connected with a blob of a car (sequence 4). In this case the detection based on the relative binary image is hard also for the human eyes and an approach based on the recognition of the human components in the gray levels images [11] appears most appropriate . Notice that, in sequence 4, car and people move at the same time and that their blob are connected: this is a unusual circumstance (only few frames) , specially if the monitored area is wide. In addition it’s important to observe that thefts or other illegal actions are done against static objects and when the parking area is not much frequented. Heuristic methods based on temporal consistency, prediction of movements or domain knowledge can be implemented in order to track the people also when the dynamic of the scene inhibits our system. The target classification system proposed is very fast. Each binary image is processed in 0.016 seconds and more binary images can be processed on parallel processors. If there is a number of parallel processors greater than the number of extracted binary images the time elapsed to classify all the detected objects in a frame remains around 0.016 seconds.
292
M. Leo et al.
5
Conclusions and Future Works
This work deals with the problem of people detection in outdoor environments in the context of video surveillance. An example-based learning technique to detect people in dynamic scenes has been proposed. The classification is purely based on the people shape and not on the image content. Adaptive background subtraction has been used for detecting the objects of interest, then geometric and statistical information extracted from the horizontal and vertical projections are used to represent people shape and, finally, a supervised three layer neural network has been used to classify the extracted patterns. The experiments show that both a single person and a group of people are correctly detected also when other moving objects are in the scene. People are not detected only when their blob is connected with the blob of a moving car that modifies widely the whole binary shape. In this case people detection from the binary shape is hard even for human eyes. In conclusion it possible to assert that the proposed method is robust, reliable, fast and it can be easily adapted for the detection of other moving objects in the scene. Future works will deal with the problem of gesture recognition of the detected people in order to individuate illegal behaviors such as thefts or damaging.
References [1] [2] [3] [4] [5] [6] [7] [8] [9]
R. Cutler, L. Davis: Robust real-time periodic motion detection analysis and applications, IEEE Trans. Pattern Anal. and Mach. Intell. 22 8 (2000), pp. 781796 A. J. Lipton: Local application of optical flow to analyse rigid versus non rigid motion. In the website http://www.eecs.lehigh.edu/FRAME/Lipton/iccvframe.html R.T. Collins et al: A system for video surveillance and monitoring: VSAM, final report, CMU-RI-TR-00-12, technical Report, Carnegie Mellon University, 2000 A.J. Lipton, H. Fujiyoshi, R.S. Patil: Moving target classification and tracking from real time video, Proceedings of the IEEE-WACV, 1998, pp. 8-14 Y. Kuno, T. Watanable, Y. Shimosakoda, S. Nakagawa: Automated detection of human for visual surveillance system, Proceedings of ICPR, 1996, pp. 865869 M. Oren, C. Papageorgiou, P. Sinha, E. Osuna, and T. Poggio “Pedestrian Detection using wavelet templates” In CVPR97, pp. 193-199, 1997 C.Papageorgiou and T.Poggio, A trainable System for Object Detection, International . Journal of Computer Vision,vol.38, no.1, pp. 15-33 (2000) M. Leo, G. Attolico, A. Branca, A. Distante: Object classification with multiresolution wavelet decomposition , in Proc. of SPIE Aerosense 2002, conference on Wavelet Applications, 1-5 April, 2002, Orlando, Florida, USA I. Haritaoglu, D. Harwood, L.Davis: W4: real time surveillance of people and their activities, IEEE Trans. Pattern Anal. and Mach. Intell. 22 8 (2000), pp. 809-830
Shape Based People Detection for Visual Surveillance Systems
293
[10] P. Spagnolo, A. Branca, G. Attolico, A. Distante: Fast Background Modeling and Shadow Removing for Outdoor Surveillance, in Proc. of IASTED VIIP 2002, 9-12 September, 2002, Malaga, Spain [11] A. Mohan, C. Papageorgiou, T. Poggio: Example-based Object Detection in Images by Components, IEEE Trans. Pattern Analysis and Machine Intelligence, Vol.23, No.4, April 2001 [12] L. Zhao, C. Thorpe: Stereo and Neural Network based Pedestrian Detection, IEEE Transaction on Intelligent Transportation Systems, Vol.1, N. 3, September 2000
Real-Time Implementation of Face Recognition Algorithms on DSP Chip Seong-Whan Lee, Sang-Woong Lee, and Ho-Choul Jung Center for Artificial Vision Research, Korea University Anam-dong, Seongbuk-ku, Seoul 136-701, Korea {swlee,sangwlee,hcjung}@image.korea.ac.kr
Abstract. This paper describes a prototype implementation of face recognition and verification algorithms in a stand-alone system using the TI TMS320C6711 chip. This system is organized to capture an image sequence, find the features of face in the images, and recognize and verify a person. The current implementation uses the methods of PCA(Principal Component Analysis) and SVM(Support Vector Machines). We have made several tests on video data, and measured the performance and the speed of the proposed system in real environment. Finally, the result has confirmed us that the proposed system can be applied to various applications which are impossible in conventional PC-based systems.
1
Introduction
During the past few years, we have witnessed the explosion in interest and progress in automatic face recognition technology. Many systems, implemented on workstation or PC, are already deployed as a component of an intelligent building system and a security system for gate control. However, hardware cost and volume often limits the application of face recognition or verification system, interactive toys, mobile devices, and so on. Therefore, it is needed that face recognition systems become smaller and contain faster algorithms. This paper introduces a face recognition and verification system realized on top of a stand-alone platform. The main issues about the work are (i) to design a compact hardware system using DSP for face recognition, (ii) to implement complex algorithms such as SVM and PCA on this system, and (iii) to demonstrate a prototype in real-time. The remainder of this paper is organized as follows. Section 2 introduces some previous systems for face recognition and verification. Section 3 and 4 describe the hardware and software architectures of the implementation based on TMS320C6711 chip for face recognition and verification. The experimental results are summarized in Section 5.
This research was supported by Creative Research Initiatives of the Ministry of Science and Technology, Korea.
J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 294–301, 2003. c Springer-Verlag Berlin Heidelberg 2003
Real-Time Implementation of Face Recognition Algorithms on DSP Chip
2
295
Related Work
Any typical face recognition system performs three separable functions. The first is face detection, the second is feature extraction and the third is classification. After the first step of extracting faces, we create a number of feature vectors, which capture most of the important characteristics of the face image. For calculating these feature vectors, face recognition system has been a large system which has large memories and fast processors. And the training process, a separate part of the third stage, is an additional high computational task because feature vectors usually exist in high-dimensional space. Therefore, there is a recent surge of introducing systems for small-volume and high-speed face recognition system. Technically speaking, two competing methods are used to build such systems. One is VLSI implementation, and the other is DSP(Digital Signal Processor) implementation. In the early 1990s, Gilbert et al. introduced a real-time face recognition system using custom VLSI hardware for fast correlation in an IBM compatible PC[9]. Five years later, Yang et al. introduced a parallel implementation of face detection algorithm using a TMS320C40 chip[7]. On the other hand, IBM introduced a commercial chip, ZISC which can compute the classification in RBF(Radial Basis Function) based neural network[5]. In order to compute massive vectors, Genov and Cauwenberghs introduced support vector ‘machine’, called Kerneltron, which computes wavelet, kernel function and SVM algorithm with a mixed VLSI architecture[4]. However, these systems did not implement the whole stages of a face recognition system. There are several other systems such as Argus[6], FaceIt[8] that implemented all stages on PC or workstation. Unlike these, we have implemented the entire steps using only a TMS320C6711 DSP chip. This will enables face recognition system to be applied to diverse applications.
3
System Design
Most real-time image processing tasks are time-critical and highly computationintensive. We have chosen the approach of using a high performance DSP chip, namely, a float-point TMS320C6711. The implementation used C6711 DSK(DSP Starter Kit) and IDK(Image Developer’s Kit) from Texas Instrument, Inc. Because the C6711 DSK platform was developed for an evaluation, it has many limitations in developing a large application. Thus, we re-designed DSK’s circuit and combined other devices. The resulting system is shown in table 1. 3.1
System Organization
The current system is composed of a keypad module, a main board, IDK and a host PC shown as Fig. 1. Main board contains two McBSPs(Multi-channel Buffered Serial Port) and EMIF(External Memory InterFace) to communicate with other devices, and supplies 5V power to all the modules. Users use 4x4
296
Seong-Whan Lee et al.
Table 1. Specification of the system TI DSK with IDK Our System DSP TMS320C6711-150MHz TMS320C6711-150MHz SDRAM 16 MByte 32 MByte Flash ROM 128 KByte 1 MByte User Input 4 Dip Switch 4x4 Keypad Output VGA Display VGA Display/LCD Display General Port McBSP RS-232C
Fig. 1. Block diagram of the system
keypad for typing their own ID number and selecting modes. In addition to the buttons, keypad module has RS-232C adapter and converter of baud rate for serial communication. IDK is a daughter board to capture images from CCD camera and display image sequence into VGA display, and the host PC analyzes and stores the results from C6711 through a RS-232C port. Properly deleting or overwriting data in this stand-alone system, we can operate whole functions without the host PC. 3.2
Boot Process for Large Applications
When a DSP/BIOS application is booted from ROM, the following procedure[10] should take place: – Upon reset, the TMS320C6711 initializes a segment of memory(copy from ROM into RAM), including custom boot code. – The custom boot code then executes and copies an additional section that initializes program and data memory as needed. – DSP/BIOS startup procedures will initialize C variables in the .bss section with data contained in the .cinit section, as well as perform general DSP/BIOS setup. – The application will begin execution at main().
Real-Time Implementation of Face Recognition Algorithms on DSP Chip
297
Fig. 2. The flowchart of the system
Generally, a program for general control is executed in internal memory. But our code for face recognition is too large to be loaded into 64K internal memory. Thus, in our system, several things were considered. First, a boot code was linked by V isualLinkerT M so that it is loaded (from flash memory into internal memory) and executed (at 00000000H) properly upon reset. Next, a boot code enables user code to be executed in SDRAM(80000000H ) for fast access. These sections must be properly linked so that they can be loaded into ROM and run from RAM. Finally, the ROM image must be created and programmed into ROM.
4
Implementation of Face Recognition Algorithms
In this section, we describe the software architecture of the face recognition and verification system. It will be loaded onto the platform designed in the previous section. 4.1
Overview
Figure 2 describes an internal computational structures of the face recognition/verification system. First, users should select the mode of operation among face registration, deletion, recognition and verification. In the registration and deletion processing, users should present their ID to the system. Next, some face images are captured by a CCD camera. In order to reduce the dimension of
298
Seong-Whan Lee et al.
input data, characteristic features are generated by the feature extractor. The features represent the original image approximately in a very compact form. In recognition/verification mode, the system compares the test features with stored features. In recognition mode, the system searches the most similar user in stored data. In verification mode, users should submit their own ID as verification key. Then the system decides whether a current user is the owner of the input ID or not. After each process, the system returns to wait for a key stroke. 4.2
Algorithm
Face Detection As the first step of the entire process, face detection affects greatly the overall system performance. In fact, a successful face detection is prerequisite to the success of following face recognition and verification tasks. Moreover it is important to reduce the memory and computation for the standalone system. Thus, we have chosen a simple and fast face detection algorithm using geometric relations of the eyes and nose/lip information. The procedure is as follows: First, we detect significant edges using Sobel operator in the image. Of course, noise is suppressed by a Gaussian filter prior to the application the Sobel operator. Next, we apply 8-connected component labelling in the image to capture blobs and find the candidates of eye region. The algorithm handles the equivalences during the first scan by merging equivalence classes as soon as a new equivalence is found. Then, we find the eyebrow regions based on the eye region candidates and we check for nose below the center of two eyes. Face region comes from the geometric relations between eyes region and nose/lip regions. Finally, face detection completes by normalizing the face region to a standard size of 32x32 pixels. Normalization of a face region reduces the effect of variation in the distance and location. Feature Extraction The extraction of feature vectors in face images is essentially a compression of a the large image into a small vector. We used PCA[2] as feature extractor. Let the training set of M face images be Γ1 , Γ2 , Γ3 , · · · , ΓM . M 1 The average face of the training set is Ψ = M i=1 Γi . Each face differs from the average by the vector Φi = Γi − Ψ . This set {Φi } is then subject to PCA, to produce a set of orthogonal vectors and their associated eigenvalues which best describe the distribution of the data. These eigenvectors can be thought of as a set of features that together characterize the variation among face images. Classification The classification is one of the most important components in a face recognition/verification system. Let us briefly describe the high performance classifier, SVM[3]. The SVM provides a good generalization in pattern classification problems without any domain knowledge by the structural risk minimization method. If we assume that the input space is linearly separable for simplicity, the decision hyperplane is defined by equation (1) in a high dimensional space. w·x+b =0 ,
(1)
Real-Time Implementation of Face Recognition Algorithms on DSP Chip
299
where x is a training pattern vector and (w, b) ∈ n × are the parameters that determines the hyperplane. The constraint given by equation (2) must be satisfied to ensure that all training patterns are correctly classified. yi (w · xi + b) − 1 ≥ 0 , ∀i
(2)
where yi is a given data with xi . Because this simple model of SVM, called a maximal margin classifier, can not be applied to real applications, we adapted the dual objective function as l
l 1 L(ω, b, ξ, α) = αi − yi yj αi αj < xi · xj > , 2 i,j=1 i=1
(3)
where ξ is slack variable and α is Lagrange coefficient. However one problem is that it takes too much computation to solve this dual objective function. So several approaches, such as Chunking, Decomposition and SMO(Sequential Minimal Optimization) have been proposed[1]. The Chunking algorithm uses the fact that the value of quadratic form is the same if you remove the row and column of the matrix that correspond to zero Lagrange multipliers. Therefore Chunking drastically reduces the size of the matrix from the number of training examples squared to approximately the number of non-zero Lagrange multipliers squared. The Decomposition algorithm shows that a large QP(Quadratic Problem) can be broken down into a series of smaller QP subproblems. As long as at least one example that violate KKT(Karush-Kuhn-Tucker) conditions is added to the examples for the previous sub-problem, each step reduces the overall objective function and maintains a feasible point that obey all of the constraints. The main advantage of this decomposition is that it suggests algorithms with memory requirements linear in the number of training examples and linear in the number of support vectors. SMO is a simple algorithm that quickly solves the QP problem in SVM without any extra matrix storage and without invoking an iterative numerical routine for each sub-problem. At every step, SMO chooses two Lagrange multipliers to jointly optimize, finds the optimal values for these multipliers, and updates the SVM to reflect the new optimal values. We use the Decomposition algorithm considering the limitation of memory on DSP.
5
Experimental Results and Analysis
We have performed experiments with captured images in real-time with our prototype system shown as Figure 3 (a). The images shown as Figure 3 (b) were taken when system is active. We used a data set of 20 persons(5 images per a person) for training and tried to recognize persons with newly captured images in indoor environment. This system can detect a face and recognize a person at 12 16 frame rate. This speed was measured from user-input to final stage with the result being dependent on the number of objects in an image. The system captures a frame
300
Seong-Whan Lee et al.
(a) Prototype system
(b) System in action
Fig. 3. The implemented face recognition/verification system
(a) Average time
(b) Result of recognition experiments
Fig. 4. Performance of the system
through IDK, preprocesses it, detects a face, extracts feature vectors and identifies a person. These whole stages have to be operated in real-time. In order to identify the bottleneck process in time, we also measured the processing time of each stages using timer0 in TMS320C6711 chip. As expected, the critical steps are the training stage and the registration stage primarily due to massive matrix calculation and writing new data into flash ROM. Apart from these stages, the recognition and verification modes take less than 50ms, which shows that our system is useful in real-time applications. The recognition performance of the system is highly dependent on accuracy of face extraction. In the next test, we measured the classification accuracy assuming correct face extraction. This experiment was done by increasing the number of training samples and test samples per a person. The result shows that the number of training data is more important than that of test data in figure 4. Also, we have decided trade-off between speed and performance on the number
Real-Time Implementation of Face Recognition Algorithms on DSP Chip
301
of training data. Figure 4 shows that training stage takes a few seconds. But this is not the main problem, since training stage needs not real-time speed in some applications. Taking this into account, the result is deemed quite satisfactory for a prototype system and for launching a commercial product.
6
Conclusion and Future Work
A real-time face recognition/verification system was developed with an effective implementation of the complex algorithm on a DSP-based hardware platform. The result of several experiments in real life showed that the system works well and is applicable to real time tasks. In spite of the worst case, the complete system requires only about 90 ms per a frame. This level of performance was achieved through a careful system design of both software and hardware, and tells about the possibility of various applications. However, in training stage and registration stage, users have to wait several seconds for a response. It is a future work to make the training stage of SVM algorithm faster or to make a VLSI implementation for massive vectors calculation. Also hardware integration may be considered for faster system.
References [1] Scholkopf, B., Burges, C., Smola, A.: Advances in Kernel Methods - Support Vector Learning. MIT Press (1998) 299 [2] Turk, M., Pentland, A.: Eigenfaces for Recognition. Journal of Cognitive Neuroscience, 3 1 (1991) 77-86 298 [3] Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines. Cambridge University Press (2000) 298 [4] Genov, R., Cauwenberghs, G.: Kerneltron : Support Vector ‘Machine’ in Silicon. Pattern Recognition with Support Vector Machines, Lecture Notes in Computer Science, 2388 (2002) 120-134 295 [5] IBM ZISC036 Data Sheet, http://www.ibm.com 295 [6] Sukthankar, R., Stockton, R.: Argus: The Digital Doorman. IEEE Intelligent Systems, 16 2 (2001) 14-19 295 [7] Yang, F., Paindavoine, M., Abdi, H.: Parallel Implementation on DSPs of a Face Detection Algorithm. Proc. of International Conference on the Software Process, Chicago, USA (1998) 295 [8] http://www.identix.com/datasheets/faceit/tech5.pdf 295 [9] Gilbert, J. M., Yang, W.: A Real-Time Face Recognition System Using Custom VLSI Hardware. Proc. of Computer Architectures for Machine Perceptron Workshop, New Orleans, USA (1993) 58-66 295 [10] Texas Instruments: TMS320C6000 Tools: Vector Table and Boot ROM Creation (Rev. C). Application Report SPRA544C (2002) 296
Robust Face-Tracking Using Skin Color and Facial Shape Hyung-Soo Lee1 , Daijin Kim2 , and Sang-Youn Lee3 1
2
Department of Computer Science and Engineering, Pohang University of Science and Technology
[email protected] Department of Computer Science and Engineering, Pohang University of Science and Technology
[email protected] 3 Multimedia Technology Laboratory, Korea Telecom 17 Woomyeon-Dong, Seocho-Gu, Seoul, 137-792, Korea
[email protected] Abstract. This paper proposes a robust face tracking algorithm based on the CONDENSATION algorithm that uses skin color and facial shape as observation measures. Two independent trackers are used for robust tracking: one is tracking for the skin colored region and another is tracking for the facial shape region. The two trackers are coupled by using an importance sampling technique, where the skin color density obtained from the skin color tracker is used as the importance function to generate samples for the shape tracker. The samples of the skin color tracker within the chosen shape region are updated with higher weights. The proposed face tracker shows a robust tracking performance over the skin color based tracker or the facial shape based tracker given the presence of clutter background and/or illumination changes.
1
Introduction
Tracking the human face is a key component of video surveillance and monitoring systems and can be used as a preprocessing for the high-level processing, such as face recognition, access control, and facial expression recognition system and so on. There have been many face tracking algorithms that can be classified as several categories according to the used cues. One category of the face-tracking algorithm is the color based method. This algorithm uses the property that human facial colors of different people are clustered in certain transformed 2-D color space. Yang, Qian, Jang, and Gong(see [1, 2, 3], and [4]) used this algorithm for face tracking. This algorithm is easy to implement, fast, and has a low computational cost. However, when the illumination condition changes drastically and the background image has a skin-like color, it is not robust. Color adaptation methods have been proposed to compensate for this problem. Gong [4] used the Gaussian mixture model to represent the skin color model and updated parameters over time as the illumination condition changes. Qian [2] collected pixel samples whose filtering scores based on J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 302–309, 2003. c Springer-Verlag Berlin Heidelberg 2003
Robust Face-Tracking Using Skin Color and Facial Shape
303
the general color model exceed prescribed threshold and updated color model. Jang [3] proposed an adaptive color model which used the basic structure of CONDENSATION. These algorithms, however, use only skin-color cannot help misidentifying skin-colored objects in the background, and mistaking them for a face. Another category of the face tracking algorithm is the shape based method. Birchfield [5] used the property that the human facial shape is elliptical. This approach is not influenced by background color and illumination changes, but when the background is highly cluttered, it is not robust. The other category of the face-tracking algorithm uses both color and shape. Birchfield [6] used gradient intensity for measuring an object’s boundary shape and a color histogram for measuring color of object’s interior, but it did not integrate the two measurement scores well. In this paper, we propose a face tracking algorithm based on the CONDENSATION algorithm. We use skin color and facial shape as the observation measure. To effectively integrate each measure, we constructed two separate trackers which use skin color and facial shape as the observation measure, respectively. We use the importance sampling technique to limit the sampling region of the two trackers. Experimental result shows the robustness of our algorithm in clutter background, in the presence of skin-colored objects. This paper is organized as follows. Section 2 describes the theoretical backgrounds of the CONDENSATION algorithm and the importance sampling. Section 3 describes the proposed face tracking system that consists of two independent trackers. Section 4 shows the simulation results how the proposed face tracking system is working robust in a variety of difficult environments. Finally, a conclusion is drawn.
2
Face Tracking System
In this section, we present how to implement our proposed face tracking system in detail. Our face tracking system uses the CONDENSATION algorithm and importance sampling. We use skin color and facial shape as the observation measures. Due to the difficulty of integrating each measure, we construct two separate trackers which use skin color and facial shape as the observation measure, respectively. We call the tracker using skin color and face shape as observation measure as TRACKER I and TRACKER II, respectively. TRACKER I uses a standard CONDENSATION algorithm [7] to track the skin-colored pixels where a color histogram is taken as the measurement. After we finish all steps of TRACKER I, we then have the density of skin-colored , πti,color }. pixels represented by TColor ∼ {si,color t TRACKER II uses the density of skin-colored pixels obtained from TRACKER I as the importance function and generate shape samples from this density. This approach is advisable because the assumption that faces are located in the skin-colored region is reasonable. We use the elliptical shape to model facial shape and use the elliptical shape in the measurement step. Af-
304
Hyung-Soo Lee et al.
Fig. 1. Proposed Face Tracking System ter we finish all steps of TRACKER II, we have the density of facial-shapes represented by TShape ∼ {si,shape , πti,shape }. t After we locate the facial region from the density of TRACKER II, we assign high weights to the samples of TRACKER I, which are located in the facial region. This updating is heuristic to some degree, but the approach permits us to make more samples in the facial region selected in the next frame. Fig. 1 shows the overall procedure of the proposed face tracking system. 2.1
TRACKER I
TRACKER I uses the standard CONDENSATION algorithm to track the skin colored-pixels. To use CONDENSATION algorithm, we should define state vector s, dynamic model p(sit |sit−1 ), and measurement method. State Vector and Dynamic Model We define the state vector s as a position of pixel in the image plane (x, y). It is represented as si,color = (sx , sy ), t is t a frame number and i is ith sample in this frame. The dynamic model can be represented as 1 sit − sit−1 2 1 exp(− ). (1) p(stt |sit−1 ) = √ 2 α2 2πα Measurement It is well known that the skin-color distributions of different people in a certain transformed 2-D color space have similar distributions [1]. RGB representation is almost the basic color representation method in video cameras. RGB representation expresses not only the color but the brightness as well. But the brightness can be changed easily by a change in illumination
Robust Face-Tracking Using Skin Color and Facial Shape
305
conditions. So we use the chromaticity space as color model to remove brightness from skin-color representation. Chromatic colors can be defined as follows. r=
R G , g= R+G+B R+G+B
(2)
We make a two dimensional color histogram to represent the facial skin-color model, which consists of 100 bins in each axis. We represent the measurement weight π as the color of state vector s. The measurement weight is obtained from the color histogram. Behavior of TRACKER I Every frame, the tracker selects the samples from the sample distribution of the previous frame, and predicts new sample positions in the current frame. Then it measures the observation weights of the predicted samples. After performing these three steps for all samples, we have sample distribution of the current frame, and we can use this distribution in the next frame. The color samples track the skin-colored region successfully in a limited environment. However, when other skin-colored objects appear in the image plane, it is possible to miss the objects, so we need to use shape information as well. 2.2
TRACKER II
As mentioned above, TRACKER II uses the density of skin-colored pixels as the importance function and generates samples from it. We should define the state vector s, the dynamic model p(sit |sit−1 ) and the measurement method. Shape model for face representation is created by using the B-spline method [10]. State Vector and Dynamic Model We define the state vector s as the center position of shape templates in the image plane (x, y). It is represented = (sx , sy ) , where t is a frame number and i is the ith sample in as si,shape t this frame. The same dynamic model as we use in TRACKER I is taken as the dynamic model for the TRACKER II. Sample Generation TRACKER I generates samples from the sample distribution of TRACKER I of the previous frame. But we use the density of TRACKER , πti,color } as the importance function for the TRACKER II. I gcolor ∼ {si,color t So, TRACKER II generates the samples from it and the observation weight of each sample is given by πti,shape =
) ft,shape (si,shape t gt,color (si,color ) t
p(Zt |xt,shape = si,shape ), t
(3)
where ft,shape (si,shape ) = p(xt = sit |Zt−1 ) = t N j=1
j,shape πt−1 p(xt = sit |xt−1 = sjt−1 ).
(4)
306
Hyung-Soo Lee et al.
Fig. 2. Generating facial shape samples from color samples
Fig. 2 shows how to generate the shape samples from the skin-colored pixel density. Fig. 2-(a) shows the original image, some small dotted pixels in Fig. 2-(b) represents the color samples for the importance function, other large and bright dotted pixels in Fig. 2-(b) represents the color samples which are selected for propagating the shape samples. In Fig. 2-(c), we describe some shape templates of which centers are the selected samples represented as bright pixels. Measurement We have to measure and compute the weights for new samples generated from the importance function. We define the shape model for face representation using B-spline. We use N = 100 control points, and construct an elliptical shape model[11]. A Canny edge detector provides the oriented edgefeatures that are discretized into four directions. We can obtain the edge image map and the directional map of the current frame image using the Canny edge detector. A sophisticated measurement method has been proposed by Nishihara [9]. It desires not only large gradient magnitudes around the boundary but also gradient directions be perpendicular to the boundary. Then, the measurement density can be given by )= p(zt |xt = si,shape t
N 1 |n(k) · d(k)|. N
(5)
k=1
where n(k) is the unit normal to the ellipse at the kth pixel of ellipse control point, d(k) is the unit directional vector of edge image map at the kth pixel, N is the total number of control points, and · denotes a dot product. n(k) is measured when the facial shape model is defined and d(k) is measured from the Canny edge map. This means that the weight of the sample is greater when the edge information corresponding to each control point of the shape template is present in the image plane and the direction of each control point and corresponding
Robust Face-Tracking Using Skin Color and Facial Shape
307
Fig. 3. Tracking result when the face and the skin-color object are present simultaneously
direction of pixel in the image plane is similar. After measuring all the weight of shape samples, we determine the location of face by averaging the position of 10 samples with the larger weights.
3
Experiment Results and Discussion
Our proposed face tracking algorithm is implemented on a P3-800MHz system with 320*240 image size. We used WDVR-430B grabber board to capture image sequence from PC camera. This grabber board captures 10 frames per second and our system performs 8 frames per second with this board. We make several experiments in a variety of environments to show the robustness of the proposed face tracking system. Fig. 3 shows the tracking result when the face and the skin-colored object are present in the image plane at the same time. In the figure, a dark rectangle represents the skin color region and a light rectangle represents the facial location determined by the position of shape samples. In the experiment, we try to steal the ellipse from the face by a hand gesture. You can see that the skin-color region expands as the hand moves, but the facial location does not change. Even though the hand is a skin-colored object, it cannot steal the ellipse. If we use skin color information only, it is possible that the tracker tracks the hand, not the face. Fig. 4 shows the experimental result when another face appears during the tracking process. Although newly appeared face has elliptical shape, the size is different from the tracked face. Therefore the shape templates on newly appeared face has lower weight than tracked face, accordingly the system can track the subject robustly. Fig. 5 shows the system tracks well in clutter background. If we use shape information alone, the system will easily attracted by the clutter. The color information limits the searching area as we intended.
308
Hyung-Soo Lee et al.
Fig. 4. Experimental Result of Appearing Another Face in the Image Plane
Fig. 5. Experimental Result of Tracking Face in Clutter
4
Conclusion
In this paper, we proposed a robust face tracking algorithm based on the CONDENSATION algorithm and importance sampling. We present two independent trackers which use skin color and facial shape information as measures, respectively.We used the importance sampling technique to limit the sampling region and, as a result, to improve the tracking performance. We performed a variety of face tracking experiments, and the system showed a good robustness even when other skin-colored objects or other faces appeared in the image, and even with clutter background or changing illumination. Compared with other face tracking systems, which use skin color or facial shape only for the measurement cue, our proposed system shows a better and more robust tracking performance.
Acknowledgements The authors would like to thank the Ministry of Education of Korea for its financial support toward the Electrical and Computer Engineering Division at POSTECH through its BK21 program. This research was also partially supported by a grant (R01-1999-000-00224-0) from Korea Science & Engineering Foundation.
Robust Face-Tracking Using Skin Color and Facial Shape
309
References [1] Yang, J.,Waibel A.: A Real-Time Face Tracker. Proceeding of WACV.(1996) 142– 147. 302, 304 [2] Qian R. J.,Sezan M. I.,Matthews K. E.: A Robust Real-Time Face Tracking Algorithm. Int. Conf. on Image Processing.(1998) 131–135. 302 [3] Jang G. J.,Kweon I. S.: Robust Real-time Face Tracking Using Adaptive Color Model. International Symposium on Mechatronics and Intelligent Mechanical System for 21 Century, Changwon, Korea.(2000) 302, 303 [4] Raja Y.,McKenna S. J.,Gong S.: Colour Model Selection and Adaptation in Dynamic Scenes, In 5th European Conference on Computer Vision. (1998) 460–474. 302 [5] Birchfield S.: An Elliptical Head Tracker. 31st Asilomar Conference. (1997) 1710– 1714. 303 [6] Birchfield S.: Elliptical Head Tracking Using Intensity Gradients and Color Histograms. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Santa Barbara, California. (1998) 232–237. 303 [7] Isard M.,Blake A.: CONDENSATION-conditional density propagation for visual tracking. International Journal of Computer Vision 29(1). (1998) 5–28. 303 [8] Isard M.,Blake A.: ICONDENSATION: Unifying low-level and high-level tracking in a stochastic framework. ECCV (1). (1998) 893–908. [9] Nishihara H. K.,Thomas H. J.,Huber E.: Real-time tracking of people using stereo and motion. Proceedings of the SPIE, volume 2183. (1994) 266–273. 306 [10] Blake A.,Isard M.,Reynard D.: learning to track the visual motion of contours. International Journal of Artificial Intelligent(78). (1995) 101–134. 305 [11] Blake A.,Isard M.: Active Contours. Springer-Verlag. (1998) 306
Fusion of Statistical and Structural Fingerprint Classifiers Gian Luca Marcialis, Fabio Roli, and Alessandra Serrau Department of Electrical and Electronic Engineering – University of Cagliari Piazza d’Armi – 09123 Cagliari – Italy {marcialis,roli,serrau}@diee.unica.it
Abstract. Classification is an important step towards fingerprint recognition. In the classification stage, fingerprints are usually associated to one of the five classes “A”, “L”, “R”, “T”, “W”. The aim is to reduce the number of comparisons that are necessary for recognition. Many approaches to fingerprint classification have been proposed so far, but very few works investigated the potentialities of combining statistical and structural algorithms. In this paper, an approach to fusion of statistical and structural fingerprint classifiers is presented and experiments that show the potentialities of such fusion are reported.
1
Introduction
In the last years, Automated Fingerprint Identification Systems (AFISs) have become very important for the identification through the recognition of the fingerprints found in the crime scene. The identification time strictly depends on the number of fingerprints stored in the data base, but it is well-known that this number can be very large (more than 70 millions of fingerprints in the FBI data base). The usual solution is to perform a classification of the given fingerprint in one of the following five classes: Arch (A), Tented Arch (T), Left Loop (L), Right Loop (R), and Whorl (W) [1]. The next step is to recognise the fingerprint by a search among fingerprints associated to the identified class. This strategy obviously reduces the identification time. The proposed approaches to automatic fingerprint classification can be coarsely subdivided into the two main categories of statistical and structural approaches. In the statistical approaches, fingerprints are characterised by a set of measurements, called feature vector. The feature vector is extracted from fingerprint images and used for classification [2, 3]. Structural approaches basically use the syntactic or structural pattern-recognition methods [4-8]. Fingerprints are described by production rules or relational graphs, and parsing processes or graph matching algorithms are used for classification. It is worth remarking that the structural approach to fingerprint classification has not received much attention until now. However, a simple visual analysis of the “structure” of fingerprint images allows one to see that structural J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 310-317, 2003. Springer-Verlag Berlin Heidelberg 2003
Fusion of Statistical and Structural Fingerprint Classifiers
311
information can be very useful for distinguishing fingerprint classes of the “arch” and “whorl” types. Such “structure” can be extracted by a segmentation of directional images, that is, by segmenting fingerprint images into regions characterised by homogeneous ridge directions (Figure 1). On the other hand, it is easy to see that structural information is not particularly appropriate for distinguishing fingerprint classes of the “right loop”, “left loop” and “tented arch” types (Figure 2). Accordingly, the combination of statistical and structural fingerprint classifiers should be investigated. With regard to this issue, it is worth noting that very few works investigated the potentialities of such fusion [5-7]. In this paper, we show by experiments that: fusion of structural and statistical fingerprint classifiers can outperform the best individual classifier; the structural classifier outperforms the statistical one for fingerprint classes with a clear “structure”, in particular, the “A” class (Figure 1). In Section 2, the structural fingerprint classifier we used is briefly described. In Section 3, the statistical classifier and the fusion rules are described. In Section 4, experimental results are reported. Conclusions are drawn in Section 5.
Fig. 1. “Structure” of Fingerprint Classes A and W extracted by a segmentation of directional images
Fig. 2. “Structure” of Fingerprint Classes L, R, T extracted by a segmentation of directional images
2
Structural Fingerprint Classification
2.1
An Overview of Our Approach
The structural classification is carried out by the following modules: 1.
The pre-processor module. It enhances the quality of the original fingerprint image. Then, the orientation field is computed according to [3], and segmented by the algorithm presented in [8]. Further information on this module can be found in [6,7].
312
2.
3.
2.2
Gian Luca Marcialis et al.
The Directed Positional Acyclic Graph (DPAG) generator module. The “structure” of the fingerprint is represented by the graph extracted from the segmented image. For the sake of brevity, we describe in more detail only the graphs generator. The Recursive Neural Network [9], a machine learning model that is able to learn structured inputs in terms of DPAGs. Graph Representation of the Segmented Images
It is quite easy to see by a visual analysis of fingerprint images that the fingerprint “structure” can be extracted by segmenting fingerprint image into regions characterised by homogeneous ridge directions (Figures 1 and 2). Such regions are associated to the nodes of the Directional Positional Acyclic Graph that we selected for representing such segmentation. A DPAG is a directed acyclic graph in which a “child node” (a node linked by another node through an edge) can be distinguished from the others, because its “position” with respect to the “father node” is univocally determined. We can identify the first child, the second child, and so on. The maximum number of “childrennodes”, called “out-degree”, is given. In our DPAG, a “super-source-node” is also defined: the “super-source” is the node from which it is possible to reach all nodes of the graph, by following the direction of each edge. The generation of DPAG from a given segmented image is carried out by the following algorithm: • • •
• •
The whole orientation field image is the super-source. The Regions of the segmented orientation field image are first ordered according to the relative positions of the centre of mass. The first region R1 is assigned as the first child of the supersource. The sub-image starting from the x-coordinate of the R1’s centre of mass is partitioned in od rectangles, where od is the out-degree of the DPAG. Figure 3 shows an example of such rectangles starting from the region labelled with ‘2’. Figure 3 also shows the centre of mass of the regions by little filled circles. In the example, od = 8. The first baricenter belonging to an adjacent region of R1, found in the i-th rectangle, becomes the i-th child of the node. The same process is repeated for the children-nodes while all regions have been considered.
Figure 3 shows an example of DPAG extracted from a fingerprint image segmentation. Further details on the algorithm for graph generation can be found in [7]. The representation is completed by attaching to each graph node a feature vector containing the local characteristics of the regions (area, average directional value, etc) and the geometrical and spectral relations among adjacent regions (relative positions, differences among directional average values, etc). With respect to the DPAG representation presented in [6,7], we added novel features based on the probability distribution of the orientations in each region. In particular, we achieved good results
Fusion of Statistical and Structural Fingerprint Classifiers
313
by using non-parametrical and parametrical “distance” measures between orientation distributions of father and child nodes. 0
S
3
0
x 2
1
2 2 0 1
3
3
4
5
5
1
2
R ectan gle no . 3
5
1
4 0
y S
4
Fig. 3. Example of DPAG extracted from a fingerprint segmentation
2.3
Recursive Neural Networks for Fingerprint Classification
Our approach relies on Recursive Neural Network [9], a machine learning architecture which is capable of learning hierarchical data structures, such as the structural representation of fingerprints which we employ in this paper. The input to the network is a labeled DOAG U, where the label U(v) at each vertex v is a realvalued feature vector associated with a fingerprint region, as described in Section 2.2. A hidden state vector X (v ) ∈ ℜ n is associated with each node v, and this vector contains a distributed representation of the sub-graph dominated by v (i.e., all the vertices that can be reached starting from a directed path from v). The state vector is computed by a state transition function f that combines the state vectors of v’s children with a vector encoding the label of v. Computation proceeds recursively from the frontier to the super-source (the vertex dominating all other vertices). The base step of such computation is X (v) = 0 , if v is a missing child. Transition function f is computed by a multi-layer perceptron, that is replicated at each node in the DOAG, sharing weights among replicas. Classification with recurrent neural networks is performed by adding an output function g that takes as input the hidden state vector X(s) associated with the super-source s. Function g is also implemented by a multilayer perceptron. The output layer uses the softmax functions (normalized exponentials), so that Y can be interpreted as a vector of conditional probabilities of classes given the input graph, i.e., Yi = P (C = i | U ) , being C a multinomial class variable. Training relies on maximum likelihood. Further details can be found in [7,9]. In order to take into account ambiguous fingerprints (the so called “crossreferenced” fingerprints), characterised by two class labels, a “soft” target vector was introduced in the training phase. The two classes were considered to have the same
314
Gian Luca Marcialis et al.
probability. Experiments showed that this “soft” target reduces the noise-labelling introduced by the cross-referenced fingerprints.
3
Fingerprint Classification by Fusion of Structural and Statistical Classifiers
In order to investigate the potentialities of the combination of statistical and structural methods, we coupled our structural approach with the statistical approach proposed by Jain et al. [2]. The core of such approach is a novel representation scheme (called FingerCode) which is able to represent into a numerical feature vector both the minute details and the global ridge and furrows structures of fingerprints. Several strategies were evaluated in order to combine the statistical and structural classifiers [10]. We firstly assessed the performances of simple combination rules, namely, the mean rule and the product rule, which require the assumption of error independence among the combined classifiers. Then, we adopted the so-called “metaclassification” (or “stacked”) approach to classifier combination which uses an additional classifier for combination [11]. In particular, a K-Nearest Neighbour classifier was used.
4
Experimental Results
4.1
The Data Set
The well known NIST-4 database containing five fingerprint classes (A, L, R, W, T) was used for experiments. In particular, the first 1,800 fingerprints (f0001 through f0900 and s0001 through s0900) were used for classifier training. The next 200 fingerprints were used as validation set, and the last 2,000 fingerprints as test set. 700 ambiguous fingerprints also characterise the NIST-4 data set. These fingerprints are labelled with two target classes instead of one. 4.2
Results for the Individual and Combined Classifiers
Table 1 reports the overall accuracy on the test set of the following individual and combined classifiers: 1. 2. 3. 4. 5.
The Recursive Neural Network trained with our structural representation (“RNN”). The Multi-Layer Perceptron trained with the statistical representation called “Finger-code” [2] (“MLP”). The combination of RNN and MLP with the mean rule (“Mean”). The combination of RNN and MLP with the product rule (“Product”). The combination of RNN and MLP according to the stacked approach with a KNN classifier (“KNN”).
Fusion of Statistical and Structural Fingerprint Classifiers
315
Table 1. Percentage accuracy of the individual classifiers and the combined classifiers on NIST-4 test set
RNN 76.9
MLP 86.0
Mean 87.9
Product 88.3
KNN 89.0
The best performance was achieved by the KNN classifier, but all the combination rules outperformed the best individual classifier. This points out the benefits of such combination for fingerprint classification. The good performances of Mean and Product rules, which assume the independence among classifiers outputs, point out the complementarity between structural and statistical representations. This complementarity is also pointed out by using an oracle, that is, an “ideal” combiner able to select the classifier, if any, that correctly classified the input pattern. The very high performance of such oracle, namely 94.0% of accuracy, is a theoretical upper bound for the performances achievable with the fusion of structural and statistical classifiers. The obtained accuracy (89.0%) points out the potentialities of our fusion approach (Table 1). 4.3
Comparison among Classifiers
Figure 4 shows the accuracy-rejection curves of the individual classifiers (RNN and MLP) and the best shows that the accuracy always increases by increasing the rejection rate. Figure 5 shows the accuracy-rejection curves of structural, statistical and combined classifiers for the A class. It points out that the structural classifier outperforms the statistical one for the A class. The superiority of structural classifier increases by increasing the rejection rate. This increase is superior than that of the combined classifier, and the structural one outperforms the combined one at 3% rejection rate. The combined classifier outperforms the structural one only for small values of rejection rate. Such effect points out that the A class fingerprints can be recognised better by a “structural” representation that exploits the information contained in such fingerprints. Even in this case the combination of structural and statistical classifiers is able to improve performance by exploiting the evident complementarity among diverse representations of fingerprints. For the other classes, the accuracy-rejection curves are similar to the ones showed in Figure 4. The statistical classifier always outperforms the structural one, but the fusion among these classifier exhibits performances once more superior.
5
Conclusions
The structural approach to fingerprint classification has not received much attention until now. However, a simple visual analysis of the “structure” of fingerprint images allows one to see that structural information can be very useful for distinguishing fingerprint classes strongly structured, like the Arch class. On the other hand, the structural approach should be combined with the statistical one for effective classification of other classes. In this paper, the structural approach and its combination with the statistical one were investigated.
316
Gian Luca Marcialis et al.
Fig. 4. Accuracy-Rejection curves of the individual classifiers and the best combined classifier on NIST-4 test set
Fig. 5. Accuracy-Rejection curves of the individual classifiers and the best combined classifier for the A class
Reported results pointed out that the structural classifier can distinguish the A class much better than the statistical one (Figure 5). Moreover, the fusion of structural and statistical classifiers can improve the performances of the best individual classifier (Table 1). Further investigations will rely on the feature extraction process and the effectiveness of the DPAG representation with larger fingerprint directional images. Other combination rules will also investigated.
Fusion of Statistical and Structural Fingerprint Classifiers
317
Acknowledgements The authors wish to thank Anil K. Jain for providing them with the FingerCode representation of NIST-4 data set, and Raffaele Cappelli and Davide Maltoni for the results of their image segmentation algorithm. Thanks also go to Paolo Frasconi that introduced the authors to the use of recursive neural networks.
References [1] [2]
E.R. Henry, Classification and Uses of Fingerprints, Routledge, London (1900). A.K. Jain, S. Prabhakar, L. Hong, A Multichannel Approach to Fingerprint Classification, IEEE Transactions on PAMI, vol.21, no.4, pp.348-358, 1999. [3] G.T. Candela, P.J. Grother, C.I. Watson, R.A. Wilkinson and C.L. Wilson, PCASYS - A Pattern-Level Classification Automation System for Fingerprints, NIST tech. Report NISTIR 5647, 1995. [4] B. Moayer, K.S. Fu, A syntactic approach to fingerprint pattern recognition, Pattern Recognition, vol. 7, pp. 1-23, 1975. [5] R. Cappelli, D. Maio and D. Maltoni, A Multi-Classifier Approach to Fingerprint Classification, Pattern Analysis and Applications, vol.5 no.2 pp. 136-144, 2002. [6] G.L. Marcialis, F. Roli, P. Frasconi, Fingerprint Classification by Combination of Flat and Structural Approaches, Proc. of the 3rd International Conference on Audio- and Video- Based Person Authentication (AVBPA ’01), J. Bigun and F. Smeraldi Eds., Springer LNCS2091, pp. 241-246, 2001. [7] Y. Yao, G.L. Marcialis, M. Pontil, P. Frasconi and F. Roli, Combining Flat and Structural Representations with Recursive Neural Networks and Support Vector Machines, Pattern Recognition, vol. 36 no.2 pp.397-406, 2003. [8] D. Maio, D. Maltoni, A Structural Approach to Fingerprint Classification, Proc. 13th ICPR, Vienna, pp. 578-585, 1996. [9] P. Frasconi, M. Gori, A. Sperduti, A General Framework for Adaptive Processing of Data Structures, IEEE Trans. On Neural Networks, vol.9, no.5, pp.768-786, 1998. [10] F. Roli and J. Kittler Eds., Multiple Classifier Systems, Springer LNCS2364, 2002. [11] G. Giacinto and F. Roli, Ensembles of Neural Networks for Soft Classification of Remote Sensing Images, European Symposium on Intelligent Techniques, 20-21 March, 1997, Bari, Italy, pp. 166-170.
Learning Features for Fingerprint Classification Xuejun Tan, Bir Bhanu, and Yingqiang Lin Center for Research in Intelligent Systems University of California, Riverside, CA 92521, USA {xtan,bhanu,yqlin}@cris.ucr.edu
Abstract. In this paper, we present a fingerprint classification approach based on a novel feature-learning algorithm. Unlike current research for fingerprint classification that generally uses visually meaningful features, our approach is based on Genetic Programming (GP), which learns to discover composite operators and features that are evolved from combinations of primitive image processing operations. Our experimental results show that our approach can find good composite operators to effectively extract useful features. Using a Bayesian classifier, without rejecting any fingerprints from NIST-4, the correct rates for 4 and 5-class classification are 93.2% and 91.2% respectively, which compare favorably and have advantages over the best results published to date.
1
Introduction
The Henry System is a systematic method for classifying fingerprints into five classes: Right Loop (R), Left Loop (L), Whorl (W), Arch (A), and Tented Arch (T). Figure 1 shows the examples of each class. This system of fingerprint classification is commonly used by almost all the developers and users. The most widely used approaches for fingerprint classification are based on the number and relations of the singular points (SPs), which are defined as the points where a fingerprint’s orientation field is discontinuous. Using SPs as reference points, Karu and Jain [8] present a classification approach based on the structural information around SPs. Most other research uses a similar method: first, find the SPs and then use a classification algorithm to find the difference in areas, which are around the SPs for different classes. Several representations based on principal components analysis (PCA) [10], self-organizing map (SOM) [11] and Gabor filters [12] are used. The problems with these approaches are: (a) it is not easy to detect the SPs and some fingerprints do not have SPs; (b) the uncertainty in the location of SPs is large, which has great effect on the classification performance since the features around the SPs are used. Cappelli et al. present a structural analysis of a fingerprint’s orientation field [9]. Jain and Minut propose a classification algorithm based on finding the kernel that best fits the flow field of the given fingerprint [15]. Both approaches are unnecessary to find the SPs. Researchers have also tried different methods to combine different classifiers to improve the clasJ. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 318-326, 2003. Springer-Verlag Berlin Heidelberg 2003
Learning Features for Fingerprint Classification
319
sification performance. Senior [13] combines Hidden Markov Models (HMM), decision trees and PCASYS (a standard fingerprint classification algorithm) [10]. Yao et al. [14] present new fingerprint classification algorithms based on two machine learning approaches: support vector machines (SVMs) and recursive neural networks (RNNs). The features used in those approaches are well-defined conventional known features. Unconventional features discovered by the computer are never used in fingerprint classification. In most imaging applications, the task of finding a good feature is equivalent to finding a good point in the search space of composite operators, where a composite operator consists of primitive operators and it can be viewed as a selected combination of primitive operations applied on images. Our Genetic Programming (GP) based approach may try many unconventional ways of combining primitive operations that may never be imagined by humans and yield exceptionally good results. The parallelism of GP and the speed of computers allow the search space explored by GP to be much larger than that by human experts. As the search goes on, GP will gradually shift the population of composite operators to the portion of the space containing good composite operators.
Fig. 1. Examples of fingerprints from each class of Henry System for fingerprint classification
Genetic Programming (GP) was first proposed by Koza in [1]. Poli [2] used GP to develop effective image filters to enhance and detect features of interest or to build pixel-classification-based segmentation algorithms. Stanhope and Daida [3] used GP paradigm for the generation of rules for target/clutter classification and rules for the identification of objects. Howard et al. [4] applied GP for automatic detection of ships in low-resolution SAR imagery using an approach that evolves detectors. Roberts and Howard [5] used GP to develop automatic object detectors in infrared images. The contributions of our work are: (a) we develop an approach to learn the composite operator based on primitive features automatically. This may help us to find some useful unconventional features, which are beyond the imagination of humans. The primitive operators defined in this paper are very basic and easy to compute. (b) Primitive operators are separated into computation operators and feature generation
320
Xuejun Tan et al.
operators. Features are computed wherever feature generation operators are used. (c) Results are shown on the entire NIST-4 fingerprint database and they are compared with the other published research.
2
Technical Approach
Figure 2 shows the block diagram of our approach. During the training, GP is used to generate compositor operators, which are applied to the primitive features generated from the original orientation field. Feature vectors used for fingerprint classification are generated by composite operators. A Bayesian classifier is used for classification. During training, fitness value is computed according to the classification result and used for evolution. During testing, the learned composite operator is applied directly to generate feature vectors. Note that, in our approach, we do not need to find the reference points. The major design considerations of GP include:
Fig. 2. Block diagram of our approach
The Set of Terminals: For a fingerprint, we can estimate the orientation field [6]. The set of terminals used in this paper are called primitive features, which are generated from the orientation field. Primitive features used in our experiments are : 1) original orientation image; 2) mean, standard deviation, min, max and median images obtained by applying 3× 3 and 5× 5 templates on orientation image; 3) edge images obtained by applying sobel filters along horizontal and vertical directions on orientation image; 4) binary image obtained by thresholding the orientation image with a threshold of 90. Note that, local orientation θ∈ [0, 180); and 5) images obtained by applying sin and cos operations on the orientation image. These 16 images are input
Learning Features for Fingerprint Classification
321
to the composite operators. GP determines which operations are applied on them and how to combine the results. Table 1. Primitive operators used in our approach
Primitive Operator ADD_OP, SUB_OP, MUL_OP and DIV_OP
Meaning A+B, A–B, A× B and A/B. If the pixel in B has value 0, the corresponding pixel in A/B takes the maximum pixel value in A. max(A,B) and min(A,B)
MAX2_OP and MIN2_OP ADD_CONST_OP, SUB_CONST_OP, MUL_CONST_OP A+c, A-c, A× c and A/c and DIV_CONST_OP SQRT_OP and LOG_OP Computation Operators MAX_OP, MIN_OP, MED_OP, MEAN_OP and STD_OP BINARY_ZERO_OP and BINARY_MEAN_OP NEGATIVE_OP LEFT_OP, RIGHT_OP, UP_OP and DOWN_OP HF_DERIVATIVE_OP and VF_DERIVATIVE_OP SPE_MAX_OP, SPE_MIN_OP, SPE_MEAN_OP, SPE_ABS_MEAN_OP and SPE_STD_OP Feature Generation Operators
SPE_U3_OP and SPE_U4_OP SPE_CENTER_MOMENT11_OP SPE_ENTROPY_OP SPE_MEAN_VECTOR_OP and SPE_STD_VECTOR_OP
sign ( A ) ×
A and
sign ( A ) × log( A ) . max(A), min(A), med(A), mean(A) and std(A), replace the pixel value by the maximum, minimum, median, mean or standard deviation in a 3× 3 block threshold/binarize A by zero or mean of A -A left(A), right(A), up(A) and down(A). Move A to the left, right, up or down by 1 pixel. The border is padded by zeros HF(A) and VF(A). Sobel filters along horizontal and vertical directions max2(A), min2(A), mean2(A), mean2( A ) and std2(A) µ 3(A) and µ 4(A). Skewness and kurtosis of the histogram of A µ 11(A). First order central moments of A H(A). Entropy of A mean_vector(A) and std_vector(A). A vector contains the mean or standard deviation value of each row/column of A
The Set of Primitive Operators: A primitive operator takes one or two input images, performs a primitive operation on them and outputs a resultant image. Suppose 1) A and B are images of the same size and c is a constant, c∈ [-100,+100]; 2) for operators, which take two images as input, the operations are performed on the pixelby-pixel basis. Currently, there are two kinds of primitive operators in our approach: computation operators and feature generation operators. Table 1 explains the meaning of these operators in detail. For computation operators, the output is an image, which is generated by applying the corresponding operations on the input image. However, for feature generation operators, the output includes an image and a real number or vector. The output image is the same as the input image and passed as the input image to the next node in the composite operator. The real number or vector is the elements of the feature vector, which is used for classification. Thus, the size of the feature
322
Xuejun Tan et al.
vectors depends on the number of the feature generation operators that are a part of the composite operator. Generation of New Composite Operator: Composite operators are represented by binary trees whose internal nodes represent the primitive operators and leaf nodes represent the primitive features. The search of GP is done by performing reproduction, crossover and mutation operations. The initial population is randomly generated. The reproduction operation used in our approach is based on tournament selection. To perform crossover, two composite operators are selected on the basis of their fitness values. One internal node in each of these two parents is randomly selected, and the two subtrees with these two nodes as root are exchanged between the parents. In this way, two new composite operators are created. Once a composite operator is selected to perform mutation operation, an internal node of the binary tree representing this operator is randomly selected, and the subtree rooted at this node is deleted, including the node selected. Another binary tree is randomly generated and this tree replaces the previously deleted subtree. The resulting new binary tree replaces the old one in the population. We use steady-state GP in our experiments. A detailed description of it can be found in Koza [1]. The Fitness Measure: During training, at every generation for each composite operator proposed by GP, we compute the feature vector and estimate the Probability Distribution Function (PDF) for each class using all the available feature vectors. Suppose the feature vectors for each class have normal distribution, vi,j , where i = 1,2,3,4,5 and j=1,2,…ni , ni is the number of feature vectors in the training for class i, ω i . Then, for each i, we estimate the mean µ i and covariance matrix Σ i by all vi,j , and the PDF of ω i can be expressed as:
p(x ωi ) =
(2π )
1
n/2
∑i
1/2
exp( −
1 −1 (x − µi )T ∑ i (x − µi )) 2
(1)
According to Bayesian theory, we have
v ∈ ωk , iff. p(v ωk ) ⋅ p(ωk ) = max (p(v ωi ) ⋅ p(ωi )) i =1,2,3,4,5
(2)
where x ∈ {vi,1 vi,2 ... vi,ni } , n is the size of feature vector and v is a feature vector for classification. During training, we estimate p(x ω i), then use the entire training set to do the classification. The Percentage of Correct Classification (PCC) is taken as the fitness value of the composite operator.
Fitness Value =
nc × 100% ns
(3)
where nc is the number of correctly classified fingerprints in training set and ns is the size of training set. Note that, if Σ i =0 for ω i in equation (1), we simply let the fitness value of the composite operator be 0. During testing, we still use equation (2) to
Learning Features for Fingerprint Classification
323
obtain the classification results on the testing set, however, none of the testing fingerprints is used in the training. Parameters and Termination: The key parameters are maximum size of composite operator (150), population size (100), number of generations (100), crossover rate (0.6), and mutation rate (0.05). The GP stops whenever it finishes the pre-specified number of generations. Composite Operator for 5-class classification, size 61: ( (SUB_OP) ( (MIN_OP) ( (HF_DERIVATIVE_OP) ( (HF_DERIVATIVE_OP) ( (ADD_CONST_OP) ( (MUL_OP) ( (SPE_STD_VECTOR_OP) ( (STDV_OP) ( (SPE_CENTER_MOMENT11_OP) ( (SQRT_OP) ( (SUB_CONST_OP) ( (VF_DERIVATIVE_OP) ( (MEAN_OP) ( (INPUT_OP: 0) ) ) ) ) ) ) ) ) ( (SUB_CONST_OP) ( (HF_DERIVATIVE_OP) ( (SUB_CONST_OP) ( (HF_DERIVATIVE_OP) ( (ADD_CONST_OP) ( (SUB_CONST_OP) ( (ADD_CONST_OP) ( (MUL_OP) ( (SPE_STD_OP) ( (MEAN_OP) ( (LOG_OP) ( (SPE_MEAN_VECTOR_OP) ( (SQRT_OP) ( (RIGHT_OP) ( (SPE_MIN_OP) ( (ABS_OP) ( (MEAN_OP) ( (INPUT_OP: 0) ) ) ) ) ) ) ) ) ) ) ( (SUB_CONST_OP) ( (SPE_MEAN_VECTOR_OP) ( (SPE_STD_VECTOR_OP) ( (SPE_MIN_OP) ( (STDV_OP) ( (SPE_CENTER_MOMENT11_OP) ( (SPE_U3_OP) ( (SPE_STD_VECTOR_OP) ( (SPE_MIN_OP) ( (STDV_OP) ( (SPE_CENTER_MOMENT11_OP) ( (SPE_U3_OP) ( (UP_OP) ( (SPE_MEAN_OP) ( (INPUT_OP: 1) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ( (SUB_CONST_OP) ( (SPE_MEAN_OP) ( (SQRT_OP) ( (SUB_CONST_OP) ( (SPE_U3_OP) ( (SPE_U4_OP) ( (SPE_STD_VECTOR_OP) ( (SPE_MIN_OP) ( (STDV_OP) ( (SPE_CENTER_MOMENT11_OP) ( (SQRT_OP) ( (SUB_CONST_OP) ( (VF_DERIVATIVE_OP) ( (INPUT_OP: 13) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )
Fig. 3. Learned composite operator for 5-class classification
3
Experiments
3.1
Database
The database used in our experiments is the NIST Special Database 4 (NIST-4) [11]. The size of the fingerprint images is 480× 512 pixels with a resolution of 500 DPI. NIST-4 contains 2000 pairs of fingerprints. Some sample fingerprints are shown in Figure 1. We use the first 1000 pairs of fingerprints for training and the second 1000 pairs of fingerprints for testing. In order to reduce the effect of overfitting, for the 1000 pairs of fingerprints in training, we only use first 500 pairs to estimate the parameters for each class and use the entire training set to validate the training results. Table 2. Confusion matrix of the testing results for 5 and 4-class classifications
324
Xuejun Tan et al.
3.2
Experimental Results
We performed the experiments 10 times and took the best result as the learned composite operator. Figure 3 shows the best composite operator for 5-class classification, whose size is 61. Out of these 61 operators, there are 21 feature generation operators and the length of the feature vector is 87. The size of the best composite operator for 4-class classification is 149, which is much larger and not convenient to be shown directly. Obviously, these composite operators are not easy to be constructed by humans. Note that, it is possible to perform feature selection to reduce the size of feature vectors. During training, our approach runs slowly. Usually, it takes about 60 minutes for one generation to evolve. However, in testing, since it only needs to apply composite operator to the corresponding primitive operators, it runs very fast. On a SUN Ultra II workstation with a 200MHZ CPU, the average run-times for one fingerprint test for 5-class and 4-class classifications are 40ms and 71ms, respectively. Table 3. Classification results on NIST-4
Approaches
Class # Error rate % Reject rate % Dataset Comments 5 14.6 Decision based Karu and 4000 images, zero on topological Jain 1996 [8] no training 4 8.6 information Jain and Same as Hierarchical Minut 2002 4 8.7 zero above kernel fitting [15] 5 14.6 Training: KNN 4 8.5 First 2000 5 13.6 Jain et al. images Neural Network 1.8 1999 [12] Testing: 4 7.9 Second 2000 KNN+NN, two 5 10.0 images stage classifier 4 5.2 Neural Network Senior 2001 Same as zero 4 Average 8.51 fusion with [13] above priors 5 10.0 Yao et al. Same as 1.8 SVM+RNN 2003 [14] above 4 5.3 GP based 5 8.4 Same as learned features This paper zero above + Bayesian 4 6.7 classifier Table 2 shows the confusion matrix of our testing results on the second 1000 pairs of fingerprint in NIST-4. Note that, because of bad quality, the ground truths of some fingerprints provided by NIST-4 fingerprint database contain 2 classes, i.e. the ground truths of f0008_10 include class T and L. As other researchers did in their experiments, we only use the first ground truth label to estimate the parameters of the 1 3.1%, 4.2%, 4.5% and 22.3% for R,L,W and A/T respectively.
Learning Features for Fingerprint Classification
325
classifier. However, in testing, we use all the ground truth labels and consider a test as correctly classified if the output of the system matches to one of the ground truths. The PCC is 93.2% and 91.2% for 4 and 5-class classifications respectively. The classes R, L, W, A and T are uniformly distributed in NIST-4. However, in nature, the frequencies of their occurrence are 31.7%, 33.8%, 27.9%, 3.7% and 2.9%, respectively. From Table 2, we observe that most of the classification errors are related to classes A and T. Considering that A and T occur less frequently in nature, our approach is expected to perform better in real world. Table 3 shows the results on NIST-4 database reported by other researchers. Considering that we have not rejected any fingerprints from NIST-4, our result is one of the best. For the 5-class classification, our result has 1.6% advantage over the result shown in [12], although in [12] the reject rate is 1.8%.
4
Conclusions
In this paper, we proposed a learning algorithm for fingerprint classification based on GP. Our experimental results show that the primitive operators selected by us are effective and GP can find good composite operators, which are beyond humans’ imagination, to extract the feature vectors for fingerprint classification. The experimental results on NIST-4 fingerprint database show that our approach is one of the best approaches. Without rejecting any fingerprints, the experimental results show that our approach is promising and has advantages over the best results reported in the literatures.
Acknowledgments This work is supported in part by a grant from SONY, DiMI, I/O Software and F49620-02-1-0315. The contents and information do not necessarily reflect the positions or policies of the sponsors.
References [1] [2] [3] [4]
J.R. Koza, Genetic Programming II: Automatic Discovery of Reusable Programs, MIT Press, 1994. R. Poli, Genetic programming for feature detection and image segmentation, Evolutionary Computation, T.C. Forgarty Ed., pp. 110-125, 1996. S.A. Stanhope and J.M. Daida, Genetic programming for automatic target classification and recognition in synthetic aperture radar imagery, Proc. Evolutionary Programming VII, pp. 735-744, 1998. D. Howard, S.C. Roberts, and R. Brankin, Target detection in SAR imagery by genetic programming, Advances in Eng. Software, 30(5), pp. 303-311, May 1999.
326
Xuejun Tan et al.
[5]
S.C. Roberts and D. Howard, Evolution of vehicle detectors for infrared line scan imagery, Proc. Evolutionary Image Analysis, Signal Processing and Telecommunications, pp. 110-125, 1999. A.M. Bazen and S.H. Gerez, Systematic methods for the computation of the directional fields and singular points of fingerprints, IEEE Trans. on PAMI, vol. 24, no. 7, pp. 905-919, July 2002. C.I. Watson and C.L. Wilson, NIST special database 4, fingerprint database, U.S. National Institute of Standards and Technology, 1992. K. Karu and A.K. Jain, Fingerprint classification, Pattern Recognition, 29(3), pp. 389-404, 1996. R. Cappelli, A. Lumini, D. Maio and D. Maltoni, Fingerprint classification by directional image partitioning, IEEE Trans. PAMI, vol. 21, no. 5, pp. 402-421, 1999. G.T. Candela, P.J. Grother, C.I. Watson, R.A. Wilkinson and C.L. Wilson, PCASYS --- a pattern-level classification automation system for fingerprints, Technical Report NISTIR 5647, NIST, Apr. 1995. U. Halici and G. Ongun, Fingerprint classification through self-organizing feature maps modified to treat uncertainties, Proc. IEEE, vol. 84, no. 10, pp. 1497-1512, Oct. 1996. A.K. Jain, S. Prabhakar and L. Hong, A multichannel approach to fingerprint classification, IEEE Trans. on PAMI, vol. 21, no. 4, pp. 348-359, Apr. 1999. A. Senior, A combination fingerprint classifier, IEEE Trans. on PAMI, 23(10), pp. 1165-1174, 2001. Y. Yao, G.L. Marcialis, M. Pontil, P. Frasconi and F. Roli, Combining flat and structured representations for fingerprint classification with recursive neural networks and support vector machines, Pattern Recognition, vol. 36, no. 2, pp. 397-406, Feb. 2003. A.K. Jain and S. Minut, Hierarchical kernel fitting for fingerprint classification and alignment, Proc. ICPR, vol. 2, pp. 469-473, 2002.
[6] [7] [8] [9] [10] [11] [12] [13] [14]
[15]
Fingerprint Matching with Registration Pattern Inspection* Hong Chen, Jie Tian, and Xin Yang Biometrics Research Group, Institute of Automation, Chinese Academy of Science PO Box 2728, Beijing 100080, China
[email protected] [email protected] Abstract. The "registration pattern" between two fingerprints is the optimal registration of each part of one fingerprint with respect to the other fingerprint. Registration patterns generated from imposter's matching attempts are different from those patterns from genuine matching attempts, although they may share some similarities in the aspect of minutiae. In this paper, we present an algorithm that utilizes minutiae, associate ridges and orientation fields to determine the registration pattern between two fingerprints and match them. The proposed matching scheme has two stages. An offline, training stage, derives a genuine registration pattern base from a set of genuine matching attempts. Then, an online matching stage registers the two fingerprints and determines the registration pattern. Only if the pattern makes a genuine one, a further fine matching is conducted. The genuine registration pattern base was derived using a set of fingerprints extracted from the NIST Special Database 24. The algorithm has been tested on the second FVC2002 database. Experimental results demonstrate the performance of the proposed algorithm.
1
Introduction
Great improvement has been achieved in the development of on-line fingerprint sensing techniques and automatic fingerprint recognition algorithms. However, there are still many challenging problems exist. One challenge is how to reliably and adequately extract and record a fingerprint's feature in a convenient way for matching. Another challenging problem is matching of non-linear distorted fingerprints. The distortion consisted of two parts. The acquisition of a fingerprint is a 3D-2D warping process. Next time the fingerprint captured with a different contact center will result in a different warping mode. The other possible that will introduce distortion to *
This paper is supported by the National Science Fund for Distinguished Young Scholars of China under Grant No. 60225008, the Prophasic Project of National Grand Fundamental Research 973 Program of China under Grant No. 2002CCA03900, the National Natural Science Foundation of China under Grant No. 79990580.
J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 327-334, 2003. Springer-Verlag Berlin Heidelberg 2003
328
Hong Chen et al.
fingerprint is the non-orthogonal pressure people exert on the sensor. How to cope with these non-linear distortions in the matching algorithm is a real challenge. Most automatic fingerprint verification systems are based on minutiae matching since the most reliable feature for fingerprint matching is minutiae. There are three major drawbacks with these methods. I.) Imperfection in the minutiae extract algorithm or the strong noise in the image may introduce false minutiae or miss genuine minutiae. II.) Rich information the ridge/valley structure for discrimination is not used by these methods. III.) In order to allow those greatly distorted fingerprint to be recognized, these methods have to use a large bounding box and consequently sacrifice the accuracy of the system. Jain et al. [3, 4] proposed a novel filterbankbased fingerprint feature representation method well handle the first two drawbacks. Fan et al. [11] used a set of geometric masks to record part of the rich information of the ridge structure. And Tian et al. [13] utilized both the minutiae and part of the associate ridges. However, these methods do not solve the problem of distortions. Recently, some methods were presented that explicitly deal with the problem of the non-linear distortion in fingerprint images and the matching of such images. Ratha et al. [8] proposed a method to measure the forces and torques on the scanner directly with the aid of specialized hardware. And Dorai et al. [9] proposed a method to detect and estimate distortion occurring in fingerprint videos. But these methods do no help when images are already collected. Maio and Maltoni et al. [5] proposed a plastic distortion model to "describe how fingerprint images are deformed when the user improperly places his/her finger on the sensor plate". This model helps to understand this process. However, due to the insufficiency and uncertainty of information, it is very difficult to automatically and reliably estimate the parameter in that model. Senior et al. [7] proposed a method to convert a distorted fingerprint image into an equally spaced fingerprint before matching and improved the matching accuracy. However, if the compression or traction force is parallel to the local ridge orientation, the inter-ridge space will not change and this method cannot detect it. And the method cannot handle the distortion introduced by different contact center. Bazen et al. [6] used a thin-plate spline model to describe the non-linear distortions between the two sets of possible matching minutiae pairs. By normalizing the input fingerprint with respect to the template, this method is able to perform a very tight minutiae matching and thus improve the performance. However, the TPS model focuses on smoothly interpolating images over scattered data. When applied this model to fingerprint recognition, it can make two fingerprints, no matter they come from the same finger or not, more similar to each other. Using a series of triangular with relatively small local deformation, while accumulated to huge global distortion, Kovács-Vajna [12] proposed a fingerprint verification method, which is able to cope with the strong deformation of fingerprint images. However, triangles with small deformation may make to an odd global deformation pattern, which is infeasible for genuine matching attempt but allows some possible for imposter's matching. In this paper, we introduce a novel fingerprint verification algorithm based on the determination and inspection of the registration pattern (RP) between two fingerprints. The algorithm first coarsely aligns two fingerprints. Then determines the possible RP by optimally registered each part of the two fingerprints. Next, inspects the possible RP with a genuine RP space. If the RP makes a genuine one, a further fine matching is conducted. This paper is organized as follows. First, the feature
Fingerprint Matching with Registration Pattern Inspection
329
representation and extraction method is introduced. Next, the matching algorithm is explained. Experimental result of the algorithm is given in Section 4. Section 5 contains the summary and discussion.
2
Feature Representation and Extraction
For an input fingerprint image, we use the method described in [1,2] to estimate the local orientation field O, enhance the image and get the thinned ridge map T. The thinned ridge map was post-processed using the method described in [14]. And then detect the minutiae set M. For each mi belongs to M, we trace the associate ridges in T and record some sample points P with constant intervals. The minutia set and the sample points on the associate ridge provide information for both alignment and discrimination. The orientation field provides very good global information of a fingerprint. We choose these features to represent a fingerprint: F=(M,P,O). An example of the feature set of a live-scan fingerprint is provided in Figure 1.
(a) input image
(b) orientation field
(c) ridge map
Fig. 1. Feature set of a live-scan fingerprint image. In some regions where minutiae are very sparse, the sample points on the associate ridges provide additional information for registration and discrimination
3
The Matching Algorithm
The matching algorithm has two stages. The offline, training stage presented in Section 3.3, derives a genuine RP base from a set of genuine matching attempts. The online matching stage is consisted of four sub-stages. First, the coarse registration presented in Section 3.1 aligns two fingerprints and finds the possible correspondence between the two feature sets. Second, the RP determination presented in Section 3.2 registers each portion of the fingerprint optimally and thus determinates the RP. Third, the RP inspection presented in Section 3.4 defines a genuine RP space and verifies whether a possible RP makes a genuine one. Fourth, the fine matching stage decides the correspondence of the two feature sets and gives a matching score.
330
Hong Chen et al. Sorted global registration candidates
Next candidate? Feedback Yes Transform the template orientation field
No
Match with the input fingerprint O field
Fit well? Yes Get the alignment parameter
Reject
Fig. 2. Flowchart of the coarse registration with feedback
3.1
Coarse Global Registration with Feedback
The task of the coarse global registration is to align the two fingerprints and find the possible corresponding point pairs between the two feature sets. We revised the registration method described in [10] and introduce an orientation field matching degree feedback mechanism to improve the robustness of global alignment. To estimate the registration parameter, we use the minutiae set to construct the local structure set: {Fl1, Fl2 ,…, Fln }. Each local structure in the input fingerprint is compared with each local structure in the template fingerprint. Each comparison generates a registration parameter and a similarity score:
MFl p ,q = ( Fl tp , Fl qi , (dx, dy, rot ), s p ,q ) ,
(1)
where the definition of the similarity score spq is the same with [11]. These comparisons give a possible correspondence list of feature points in two sets:
Lcorr = {( pat , pbi , MFl p ,q ) | p at ∈ Fl tp , pbi ∈ Fl qi } .
(2)
We cluster those registration parameters in {MFlp,q} into several candidate groups. Parameters in each group are averaged to generate a candidate global registration parameter. And the summation of similarity scores in each group becomes the power of each candidate registration parameter. The candidate parameters are sorted by their power and verified one by one using the orientation field information to choose the best global registration parameter. Figure 2 shows the flowchart of this procedure. 3.2
Registration Pattern Determination
The next step is the determination of the RP that optimally registers each part of the two fingerprints. We take the input image as "standard image", and register the template fingerprint, “a distorted image”, with respect to the standard one. The feature set of the template fingerprint is first aligned with the input fingerprint using the global alignment parameter we get in the last step. Next, we tessellate the
Fingerprint Matching with Registration Pattern Inspection
331
overlap portion of the input image into seven non-overlap hexagons with radius=R. Then, we compute the optimal alignment parameter of the template image with respect to each hexagon in the input image. The registration method is the same with that described in Section 3.1 except that: first, the searching space is greatly reduced, since the search region is restricted to the hexagon and its neighborhoods; second, sample points on the associate ridges are utilized to provide more information for registration. The possible correspondence list is extended by the possible correspondences of sampled feature points. We illustrate the orientation and type of the sample points in Figure 3.
Fig. 3. Orientation and type of sample points on associate ridges
The registration parameters in a whole describe the RP between the two images:
RP = ((dx1 , dy1 , rot1 ), (dx2 , dy 2 , rot 2 ),..., (dx7 , dy7 , rot 7 )). 3.3
(3)
Learning Genuine Registration Patterns
Some fingerprints from different fingers may have similar flow patterns and many of their minutiae can be matched if we use a loose bounding box in order to allow the large distortion. However, when analyzed the two fingerprints in detail, we found that the RP was different from those from true matching. To learn the genuine RPs, we used a set of distorted fingerprint images to derive a genuine RP base (GRPB). This set of images was extracted from NIST Special DB24 [15]. The database contains 100 MPEG-2 compressed digital videos of live-scan fingerprint data. Users are required to place their finger on the sensor and distort their finger exaggeratedly once the finger touched the surface.
(a) template image
(b) input image
(c) registration pattern
Fig. 4. A genuine RP derived from a true matching attempt
We matched those images from same finger one to one and computed the RPs. These RPs formed our GRPB. In our experiment, we use seven fix-sized hexagons with R = 45 pixels. In most cases, they can cover most of the overlap portion. And
332
Hong Chen et al.
the alignment parameters of these hexagons can well represent the whole RP. Figure 4 shows a genuine RP our algorithm derived from two images in NIST Special DB24. 3.4
Registration Pattern Inspection
We define the distance between two RPs:
d ( RPi , RPj ) = 3
∑| dx
i k
− dxkj |3 + | dyki − dykj |3 +( R× | rotki − rotkj |)
k
(4)
. And a genuine RP space:
S GRP = {RP | ∃ RPi ∈ GRPB, d ( RP, RPi ) < Thrgspace } .
(5)
When we match fingerprints, each matching attempt generates a possible RP. If the possible RP belongs to SGRP, a further fine matching is conducted. If not, the matching attempt is rejected. Figure 5 shows a fake RP our algorithm detected.
(a) template image
(b) input image
(c) registration pattern
Fig. 5. A fake RP our algorithm detected. The two fingerprints have some similarities in both minutiae and flow patterns. Inadequate contrast in the image stopped us from rejecting it in the stage of coarse global registration, though the two fingerprints in fact belong to different types
3.5
Fine Matching
This is the final matching step. From the list of possible corresponding point pairs refined in the stage of RP determination, each feature point in the overlap portion of the template fingerprint may have one or more corresponding feature points in the input fingerprint. The confliction of one feature point corresponding to more than one feature points can be solved by a simple rule: assign this feature point to the point which has the largest sum of similarity score with it. All the other correspondences are deleted. Then compute the matching score:
M=
m × max(ninput , ntemplate )
∑s
i
,
(6)
Fingerprint Matching with Registration Pattern Inspection
333
where m is the number of matching feature points, ninput and ntemplate are the numbers of feature points in the overlap portion of the input fingerprint and template fingerprint respectively, and si is the similarity score in the final correspondence list.
4
Experiments and Results
We have tested our algorithm on the second FVC2002 database [16]. This database contains 880 optical 8-bit gray-scale fingerprint images (296 × 560 pixels), 8 prints of each of 110 fingers. The images are captured at 569 dpi. Noted that this resolution is different to the images in the NIST Special DB24, which were captured at 500dpi, we scaled the data in the GRPB. Each fingerprint in the test set was matched with the other fingerprints in the set. Therefore a total of (880 × 872) / 2 = 383,680 imposter's matching attempts and (110 × 8 × 7) / 2 = 3,080 genuine matching attempts were tested. To examine the effectiveness of the RP inspection method, we computed matching scores of both with and without the RP inspection method. The ROC (Receiver Operating Characteristic) curve is shown in Figure 6. The equal error rate (EER) is observed to be ~ 0.51% and ~ 1.1% respectively.
Fig. 6. ROC showing the effectiveness of the algorithm on the second FVC2002 database
5
Summary and Future Work
We have introduced a novel fingerprint verification algorithm based on the determination and inspection of the RP between two fingerprints. The coarse global registration with feedback is capable of aligning two fingerprints with very high accuracy and robustness. The inspection of possible RP successfully detected some dangerous imposter's matching attempts, which had similar flow pattern and minutiae configuration with the template images. Such cases are also the main reason for the high false matching rates (FMR) in traditional matching algorithms. Currently, a possible RP is inspected by one to one check with the genuine RPs. We are working on deriving a knowledge base from these patterns.
334
Hong Chen et al.
References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [1] [15] [16]
A.K. Jain, L. Hong, and R. Bolle, "On-line fingerprint verification", IEEE Trans. Pattern Anal. Machine Intell., vol. 19, no. 4, pp. 302-314, 1997. L. Hong, Y. Wan, and A. K. Jain, "Fingerprint image enhancement: algorithms and performance evaluation", IEEE Trans. Pattern Anal. Machine Intell., vol. 20, no. 8, pp.777-789, 1998. A.K. Jain, S. Prabhakar, L. Hong, and S. Pankanti, "Filterbank-based fingerprint matching", IEEE Trans. on Image Processing, vol. 9, no.5, pp. 846859, 2000. Ross, A.K. Jain, and J. Reisman, "A hybrid fingerprint matcher", Pattern Recognition, 2003. R. Cappelli, D. Maio, and D. Maltoni, "Modelling plastic distortion in fingerprint images", in Proc. ICAPR2001, Rio de Janeiro, Mar. 2001. Asker M. Bazen and Sablih H. Gerez, "Elastic minutiae matching by means of thin-plate spline models", in Proc. 16th ICPR, Québec City, Canada, Aug. 2002. Senior and R. Bolle, "Improved fingerprint matching by distortion removal", IEICE Trans. Inf. and Syst., Special issue on Biometrics, E84-D (7): 825-831, Jul. 2001. N.K. Ratha and R.M. Bolle, "Effect of controlled acquisition on fingerprint matching", in Proc. 14th ICPR, Brisbane, Australia, Aug., 1998. Chitra Dorai, Nalini Ratha, and Ruud Bolle, "Detecting dynamic behavior in compressed fingerprint videos: Distortion", in Proc. CVPR2000, Hilton Head, SC., Jun. 2000. X. Jiang and W. Yau, "Fingerprint minutiae matching based on the local and global structures", in Proc. 15th ICPR, Barcelona, Spain, Sept. 2000. Kuo Chin Fan, Cheng Wen Liu, and Yuan Kai Wang, "A randomized approach with geometric constraints to fingerprint verification", Pattern Recognition, 33 pp. 1793-1803, 2000. Z.M. Kovács-Vajna, "A fingerprint verification system based on triangular matching and dynamic time warping", IEEE Trans. Pattern Anal. Machine Intell., vol. 22, no. 11, pp. 1266-1276, 2000. Yuliang He, Jie Tian, Xiping Luo, and Tanghui Zhang, “Image enhancement and minutia matching in fingerprint verification”, Pattern Recognition Letter, vol. 24/9-10, pp. 1349-1360, 2003. Xiping Luo and Jie Tian "Knowledge based fingerprint image enhancement", in Proc. 15th ICPR, Barcelona, Spain, Sept. 2000. C.I. Watson, NIST Special Database 24 Digital Video of Live-Scan Fingerprint Data, U.S. National Institute of Standards and Technology, 1998. D. Maio, D. Maltoni, R. Cappelli, J.L. Wayman, and A.K. Jain, "FVC2002: Second fingerprint verification competition", in Proc.16th ICPR, Québec City, Canada, Aug. 2002.
Biometric Template Selection: A Case Study in Fingerprints Anil Jain, Umut Uludag, and Arun Ross Michigan State University, East Lansing, MI, USA 48824 {jain,uludagum,rossarun}@cse.msu.edu
Abstract. A biometric authentication system operates by acquiring biometric data from a user and comparing it against the template data stored in a database in order to identify a person or to verify a claimed identity. Most systems store multiple templates per user to account for variations in a person’s biometric data. In this paper we propose two techniques to automatically select prototype fingerprint templates for a finger from a given set of fingerprint impressions. The first method, called DEND, performs clustering in order to choose a template set that best represents the intra-class variations, while the second method, called MDIST, selects templates that have maximum similarity with the rest of the impressions and, therefore, represent typical measurements of biometric data. Matching results on a database of 50 different fingers, with 100 impressions per finger, indicate that a systematic template selection procedure as presented here results in better performance than random template selection.
1
Introduction
A biometric authentication system uses the physiological (fingerprints, face, hand geometry, iris) and/or behavioral traits (voice, signature, keystroke dynamics) of an individual to identify a person or to verify a claimed identity [1]. A typical biometric system operates in two distinct stages: the enrollment stage and the authentication stage. During enrollment, a user’s (the enrollee) biometric data (e.g., fingerprints) is acquired and processed to extract a feature set (e.g., minutiae points) that is stored in the database. The stored feature set, labeled with the user’s identity, is referred to as a template. In order to account for variations in the biometric data of a user, multiple templates corresponding to each user may be stored. During authentication, a user’s biometric data is once again acquired and processed, and the extracted feature set is matched against the template(s) stored in the database in order to identify a previously enrolled individual or to validate a claimed identity. The matching accuracy of a biometricsbased authentication system relies on the stability (permanence) of the biometric data associated with an individual over time. In reality, however, the biometric data acquired from an individual is susceptible to changes introduced due to improper interaction with the sensor (e.g., partial fingerprints, change in pose J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 335–342, 2003. c Springer-Verlag Berlin Heidelberg 2003
336
Anil Jain et al.
(a)
(b)
Fig. 1. Variations in fingerprints: Two impressions of a fingerprint acquired at different time instances exhibiting partial overlap
during face-image acquisition), modifications in sensor characteristics (e.g., optical vs. solid-state fingerprint sensor), variations in environmental factors (e.g., dry weather resulting in faint fingerprints) and temporary alterations in the biometric trait itself (e.g., cuts/scars on fingerprints). In other words, the biometric measurements tend to have a large intra-class variability. Thus, it is possible for the stored template data to be significantly different from those obtained during authentication (Figure 1), resulting in an inferior performance (higher false rejects) of the biometric system. In order to account for the above variations, multiple templates, that best represent the variability associated with a user’s biometric data, should be stored in the database. For example, one could store multiple impressions pertaining to different portions of a user’s fingerprint in order to deal with the problem of partially overlapping fingerprints. Similarly, a user’s face image acquired from multiple viewpoints may be stored in order to account for variations in a person’s pose. There is a tradeoff between the number of templates, and the storage and computational overheads introduced by multiple templates. For an efficient functioning of a biometric system, this selection of templates should be done automatically. However, there is limited literature dealing with the problem of automatic template selection in a biometric system. We propose techniques to automatically select the templates in order to account for variations observed in a user’s biometric data as well as to adequately represent typical data values. Although we consider a fingerprint-based biometric system as our test-bed, the techniques presented in this paper may be applied to other types of biometric traits (such as face and hand geometry) as well.
2
Template Selection
The problem of template selection with regard to fingerprints may be posed as follows: Given a set of N fingerprint images corresponding to a single finger, select K templates that ‘best’ represent the variability as well as the typicality observed in the N images, K < N . Currently, we assume that the value of K is predetermined. This systematic selection of templates is expected to result in
Biometric Template Selection: A Case Study in Fingerprints
337
a better performance of a fingerprint matching system compared to a random selection of K templates out of the N images. It is important to note that template selection is different from template update. The term template update is used to refer to one of the following situations: (i) Template aging: Certain biometric traits of an individual vary with age. The hand geometry of a child, for example, changes rapidly during the initial years of growth. To account for such changes, old templates have to be regularly replaced with newer ones. The old templates are said to undergo aging. (ii) Template improvement: A previously existing template may be modified to include information obtained at a more recent time instance. For example, minutiae points may be added to, or deleted/modified from the template of a fingerprint, based on information observed in recently acquired impressions [2, 3, 4]. As another example, Liu et al. [5] update the eigenspace in a face recognition system via decay parameters that control the influence of old and new training samples of face images. Thus, template selection refers to the process by which prototype templates are chosen from a given set of samples, whereas template update refers to the process by which existing templates are either replaced or modified. We propose the following two methods for template selection: Method 1 (DEND): In this method, the N fingerprint impressions corresponding to a user are grouped into K clusters, such that impressions within a cluster are more similar than impressions from different clusters. Then for each cluster, a prototype (representative) impression that typifies the members of that cluster is chosen, resulting in K template impressions. To perform clustering, it is required to compute the (dis)similarity between fingerprint impressions. This measure of (dis)similarity is obtained by matching the minutiae point sets of the fingerprint impressions. Our matching algorithm is based on an elastic string matching technique that outputs a distance score indicating the dissimilarity of the minutiae sets being compared [6]. Since our representation of the N fingerprint impressions is in the form of a N × N dissimilarity matrix instead of a N × d pattern matrix (d is the number of features), we use hierarchical clustering [7]. In particular, we use an agglomerative complete link clustering algorithm. The output of this algorithm is a dendrogram which is a binary tree, where each terminal node corresponds to a fingerprint impression, and the intermediate nodes indicate the formation of clusters (see Figure 2). The template set T , |T | = K, is selected as follows: 1. Step 1: Generate the N × N dissimilarity matrix, M , where entry (i, j), i, j ∈ {1, 2, . . . N } is the distance score between impressions i and j. 2. Step 2: Apply the complete link clustering algorithm on M , and generate the dendogram, D. Use the dendogram D to identify K clusters. 3. Step 3: In each of the clusters identified in step 2, select a fingerprint impression whose average distance from the rest of the impressions in the cluster is minimum. If a cluster has only 2 impressions, choose any one of the two impressions at random. 4. Step 4: The impressions selected in step 3 constitute the template set T .
338
Anil Jain et al.
800
700
Distance (max: 1000)
O 600
O
OO
500
O
400
300
200
100
0
2
4 13
3 18 15 17 11 21 23 24 12 20 14
5 16
6
1
7 10 19 25
8
9 22
Fingerprint impression index
Fig. 2. Dendrogram generated using the 25 fingerprint impressions of one finger. The circles on the subtrees indicate impressions enclosed by the clusters for K =5 We refer to the above algorithm as DEND since it uses the dendrogram to choose the representative templates. Method 2 (MDIST): The second method sorts the fingerprint impressions based on their average distance score with other impressions, and selects those impressions that correspond to the K smallest average distance scores. Here, the rationale is to select templates that exhibit maximum similarity with the other impressions and, hence, represent typical data measurements. We refer to this method as MDIST since templates are chosen using a minimimum distance criteria. Thus, for every user: 1. Step 1: Find the pair-wise distance score between the N impressions. 2. Step 2: For the j th impression, compute its average distance score, dj , with respect to the other (N − 1)impressions. 3. Step 3: Choose K impressions that have the smallest average distance scores. These constitute the template set T . The choice for the value of K is application dependent. Larger K values would mean storing more templates per user, and this may not be feasible in systems with limited storage capacities. Moreover during authentication, matching a query (input) image with a large number of templates per user would be computationally demanding. Smaller K values, on the other hand, may not sufficiently capture the intra-class variability nor the typicality of the impressions, leading to inferior matching performance. Therefore, a reasonable value of K, that takes into account the aforementioned factors, has to be specified.
Biometric Template Selection: A Case Study in Fingerprints
(a)
339
(b)
(c)
(d)
(e)
Fig. 3. The cluster membership (K = 5) for the dendrogram shown in Figure 2. At most 5 members are indicated for each cluster. The prototype template in each cluster is marked with a thick border. Note that the cluster in (b) has only one member
340
Anil Jain et al.
(a)
(b)
(c)
(d)
(e)
Fig. 4. The prototype templates of a finger selected using the MDIST algorithm
3
Experimental Results
In order to study the effect of automatic template selection on fingerprint matching, we need to have several impressions per finger (∼ 25). Standard fingerprint databases (e.g., FVC 2002 [8]) do not contain a large number of impressions per finger. Therefore, we collected 100 impressions each of 50 different fingers (10 fingers each of 5 different individuals) in our laboratory using the Identix BioTouch USB 200 optical sensor (255 × 256 images, 380 dpi). The data was acquired over a period of two months with no more than 5 impressions of a finger per day. The 100 impressions of each finger were partitioned into two sets: template selection was done using the first 25 impressions, and the matching performance of the selected templates was tested using the remaining 75 impressions (test set). Figure 2 shows the dendrogram obtained using the 25 fingerprint impressions of one finger. On setting K = 5, the resulting clusters and their prototypes as computed using the DEND algorithm are shown in Figure 3; some clusters are seen to have only one member, suggesting the possible existence of outliers. The various prototypes are observed to have different regions of overlap with respect to the extracted minutiae points. The prototypes, for the same finger, computed using the MDIST algorithm are shown in Figure 4. In order to assess the matching performance of the proposed techniques (for K = 5), we match every image in the test set (50 fingers, 75 impressions per finger) against the selected templates (5 per finger). When a test image is matched with the template set of a finger, 5 different distance scores are obtained. The mimimum of these scores is reported as the final matching score. Thus, we obtain 187, 500 matching scores (75 × 50 × 50) using the selected template sets. Figure 5(a) shows the ROC (Receiver Operating Characteristic) curves representing the matching performance of the template sets selected using both the algorithms. The Equal Error Rates (EER) of DEND and MDIST are observed to be 7.95% and 6.53%, respectively. Now, for the 50 fingers, there are a total of 50 ) − 1 non-selected template sets. It is computationally prohibitive to gen( 25 5 erate the matching scores and the ROC curves corresponding to all these permutations. Therefore, we chose 53, 130 permutations (assuming that the impression indices in the template set of all the 50 fingers is the same) and computed their EER. The histogram of EER values is shown in Figure 5(b). In this histogram, the vertical dashed lines indicate the EER values corresponding to the DEND
Biometric Template Selection: A Case Study in Fingerprints 100
7000
MDIST
DEND
6000
90 DEND MDIST
80
5000
4000
Count
Genuine Accept Rate (%)
341
70
3000
60 2000
50 1000
40 −2 10
−1
0
10
10
1
10
0
6
6.5
7
7.5
8
8.5
False Accept Rate (%)
9
9.5
10
10.5
11
EER (%)
(a)
(b)
Fig. 5. (a) The ROC curves for the DEND and MDIST algorithms. (b) The EER histogram for the non-selected sets
12
11
10
K DEND MDIST 1 2 2 3 4,8,25 2,4,13 5 4,5,9,22,25 2,3,4,12,13 7 2,5,7,9,19,22,23 2,3,4,12,13,21,23 9 2,5,7,8,9,19,20,22,23 2,3,4,12,13,20,21,23,25 (a)
EER (%)
9
8
7
DEND
6
5
MDIST
4
3
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Number of templates (K)
(b)
Fig. 6. (a) The selected templates for a finger using the DEND and MDIST algorithms at different K values. (b) The EER of the fingerprint matcher at different values of K and MDIST algorithms. The percentage of non-selected template sets that have a lower EER than the template sets selected with the proposed methods is 37.7% and 0%, for DEND and MDIST, respectively, thereby suggesting that systematic template selection is better than random selection. The table in Figure 6(a) lists the impressions that were selected as templates for one finger using the DEND and MDIST algorithms at different K values. The impression index indicates the acquisition time of the impressions - a lower index referring to an earlier time instance. We see that there is no direct relationship between an impression index and its choice as a template. This result suggests that template selection is a necessary step prior to matching. Figure 6(b) shows the EER of the proposed selection methods at different values of K.
342
4
Anil Jain et al.
Discussion and Future Work
A systematic procedure for template selection is critical to the performance of a biometric system. Based on our experiments, we observe that the MDIST algorithm for template selection results in a better matching performance than DEND. This may be attributed to the fact that MDIST chooses a template set that typifies those candidate impressions that are similar and occur frequently. However, the DEND method captures the variations associated with the fingerprint impressions. We also observe that systematic template selection results in a better performance than random selection of templates. The template selection mechanism has to be applied periodically, and in an incremental fashion, as more and more biometric samples of an individual are acquired after repeated use of the system. Currently, we are working on employing the template selection techniques to update the template set of a user in fingerprint systems. We are also examining ways to determine the value of K automatically. It may be necessary to store different number of templates for different users. Future work would involve testing similar techniques on face and hand biometric data.
References [1] J. L. Wayman, “Fundamentals of biometric authentication technologies,” International Journal of Image and Graphics, vol. 1, no. 1, pp. 93–113, 2001. 335 [2] X. Jiang and W. Ser, “Online fingerprint template improvement,” IEEE Transactions on PAMI, vol. 24, pp. 1121–1126, August 2002. 337 [3] K. A. Toh, W. Y. Yau, X. D. Jiang, T. P. Chen, J. Lu, and E. Lim, “Minutiae data synthesis for fingerprint identification application,” in Proc. International Conference on Image Processing (ICIP), vol. 3, pp. 262–265, 2001. 337 [4] A. K. Jain and A. Ross, “Fingerprint mosaicking,” in Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), (Orlando, Florida), May 2002. 337 [5] X. Liu, T. Chen, and S. M. Thornton, “Eigenspace updating for non-stationary process and its application to face recognition,” To appear in Pattern Recognition, Special issue on Kernel and Subspace Methods for Computer Vision, 2003. 337 [6] A. K. Jain, L. Hong, and R. Bolle, “On-line fingerprint verification,” IEEE Transactions on PAMI, vol. 19, pp. 302–314, April 1997. 337 [7] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Englewood Cliffs, New Jersey: Prentice Hall, 1988. 337 [8] D. Maio, D. Maltoni, R. Cappelli, J. L. Wayman, and A. K. Jain, “FVC2002: Fingerprint verification competition,” in Proceedings of the International Conference on Pattern Recognition (ICPR), (Quebec City, Canada), pp. 744–747, August 2002. 340
Orientation Scanning to Improve Lossless Compression of Fingerprint Images Johan Th¨ arn˚ a, Kenneth Nilsson, and Josef Bigun School of Information Science, Computer and Electrical Engineering (IDE) Halmstad University, P.O. Box 823, 301 18 Halmstad, Sweden {Kenneth.Nilsson,Josef.Bigun}@ide.hh.se
Abstract. While standard compression methods available include complex source encoding schemes, the scanning of the image is often performed by a horizontal (row-by-row) or vertical scanning. In this work a new scanning method, called ridge scanning, for lossless compression of fingerprint images is presented. By using ridge scanning our goal is to increase the redundancy in data and thereby increase the compression rate. By using orientations, estimated from the linear symmetry property of local neighbourhoods in the fingerprint, a scanning algorithm which follows the ridges and valleys is developed. The properties of linear symmetry are also used for a segmentation of the fingerprint into two parts, one part which lacks orientation and one that has it. We demonstrate that ridge scanning increases the compression ratio for Lempel-Ziv coding as well as recursive Huffman coding with approximately 3% in average. Compared to JPEG-LS, using ridge scanning and recursive Huffman the gain is 10% in average.
1
Introduction
The compression performance is sensitive to how data is presented to the sequential encoders, which inherently assume data being one dimensional. Scanning corresponds to the order that the pixels of the fingerprint are read. Horizontal scanning (row-by-row) is a common scanning method adopted in many lossless image compression algorithms such as for example JPEG-LS [8], and PNG [3]. To the best of our knowledge, published lossless compression schemes adopted to fingerprints are lacking. The aim of scanning by following the ridges in the fingerprint is to reduce the local variance in the resulting data input to the encoder, and thereby increase the performance of the compression. We do not present a full lossless compression scheme but rather present an important element of such a scheme: scanning. We also present results that show JPEG-LS does not present an advantage as compared to a straight forward use of Lempel-Ziv or recursive Huffman utilized together with ridge scanning.
J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 343–350, 2003. c Springer-Verlag Berlin Heidelberg 2003
344
1.1
Johan Th¨ arn˚ a et al.
Linear Symmetry and Local Orientations
Linear symmetry [2, 1] is a way of describing an image by simple patterns, i.e. line patterns, and how they are rotated. The complex moments I20 and I11 can be used to estimate the linear symmetry of an image neighbourhood and are defined in the frequency domain as I11 = (ωx + jωy )(ωx − jωy )|F (ωx , ωy )|2 dωx dωy . (1) (ωx + jωy )2 |F (ωx , ωy )|2 dωx dωy . (2) I20 = I20 and I11 can be computed in the spatial domain by [2] I11 = |z(x, y)| ∗ h(x, y).
(3)
I20 = z(x, y) ∗ h(x, y).
(4)
where h(x, y) is a gaussian window function, described by h(x, y) = e
−x
2 +y2 2σ2 h
.
(5)
and z(x, y) is the squared complex valued gradient image, computed as z(x, y) = (fˆx + j fˆy )2 .
(6)
Thus, I20 and I11 are formed by gaussian averaging z(x, y) and |z(x, y)| respectively in an image neighbourhood defined in size by σh . Equations 3 and 4 state that if all the gradient vectors in the local neighbourhood, defined by σh , points in the same direction, the magnitude of I20 will be the same as I11 . If instead the variance in direction of the vectors within the neighbourhood is large, then the magnitude of I20 will decrease compared to if they all had the same direction. Thus a certainty measure of linear symmetry for the neighbourhood, and for the angular information contained within I20 , is |I20 |/I11 . The dominant orientation in the local neighbourhood is extracted from I20 as 1 θ = arg(I20 ). (7) 2 where the factor 12 is due to the double angle representation of z(x, y) in equation 6. Left of figure 1 displays the magnitude of I20 for the fingerprint in figure 3. Middle figure shows the magnitude of the low-pass filtered I20 -image, while on the right θ is presented. We use the local orientation θ to estimate the appropriate direction for the scanning.
Orientation Scanning to Improve Lossless Compression of Fingerprint Images
345
Fig. 1. Left: displaying |I20 | for the fingerprint in figure 3. Middle: the low-pass filtered (downsampling followed by upsampling) version of the left image. Right: θ for the fingerprint (not low-pass filtered)
Segmentation. The certainty measure of linear symmetry offers the possibility of segmenting the image in two parts . A boolean (binary) image, Lb , is created, telling whether the certainty measure exceeds a certain threshold or not. Figure 2 displays, from left to right, the binary image Lb , the part of the fingerprint that exceeds, respectively not exceeds the threshold value. When computing the certainty measure a low-pass filtered version of I20 is used. This gives the ability to encode the two parts separately as they ideally are uncorrelated. If the value of |I20 | exceeds the threshold, then the corresponding position in Lb is set to the estimated orientation angle θ, if not the corresponding position of Lb is set to a reserved value. 1.2
Ridge Scanning
Scanning corresponds to the order that a set of pixel samples are read. The usual method is the horizontal (row-by-row) scanning. However, we argue that the scanning method should depend on the information theoretic encoding approach taken at the final stage of compression.
Fig. 2. Left: the binary image Lb . Middle: the part of the fingerprint that exceeds the threshold for linear symmetry. Right: the part of the fingerprint that lacks linear symmetry
346
Johan Th¨ arn˚ a et al.
Fig. 3. Left: a fingerprint. Middle: the corresponding scanning directions (downsampled). Right: the scanning order for ridge scanning for the part of the fingerprint with distinct linear symmetry
The scanning order can be defined by any function, as long as there is a oneto-one mapping. Using the ridge scanning approach, the scanning order function is defined by the local orientations of the image. Figure 3 shows a fingerprint with corresponding scanning directions. To reduce the overhead cost a downsampled version of the estimated orientation image, which is upsampled by the decoder to determine the scanning order and thereby decode the information, is used. Figure 4 shows the scanning order calculation for the encoding/decoding procedures using ridge scanning. The encoder calculates a downsampled version of the local orientation image, L, upsamples it again to make the scanning consistent with the decoder. When the encoder has determined the scanning order matrix, V , the pixels within the image can be scanned and put in a vector y. The value of y at position n is determined by y(n) = A(V (n)), where A is the input image. The vector y can then be compressed. The decoder receives the downsampled orientation image together with the compressed vector, upsamples the orientation image and calculates the scanning order. The decompressed data y, can then be re-ordered in the correct way.
Fig. 4. Determining the scanning order for the encoder/decoder using ridge scanning
Orientation Scanning to Improve Lossless Compression of Fingerprint Images
347
Figure 3 right displays the determined scanning order for the part of the fingerprint which was captured by the segmentation. Low gray intensity indicates a low scanning-order number, the scanning-order number increases with intensity. 1.3
Compression Algorithms
Lempel-Ziv Coding - ZIP. The Lempel-ziv coding algorithm [9] is a lossless compression algorithm, which is categorized as variable-to-fix length coding, meaning that the input block is of variable size, while the output size is fix. The output size depends on the size of the dictionary. The Lempel-Ziv algorithm exploits redundant patterns present in the input data. Huffman Coding. Static Huffman coding [5] is a fix-to-variable length encoding algorithm. Huffman coding provides shorter codes for frequently used symbols at the price of longer codes for symbols occurring at a lower frequency. When the probabilities pi for different symbols are known, it is possible to assign a different number of bits to each symbol, based on the symbols probability of occurrence. The theoretical minimum average number of bits required to encode a data source with Huffman coding is given by the entropy of the data source, where the entropy η is defined as [4] 1 pi · log2 ( ). (8) η=− p i i=1 Recursive Huffman. Skretting et.al [6] has proposed a recursive splitting scheme of signals before entropy coding them using Huffman coding. The algorithm splits the signal into shorter parts with lower entropy, which can then be separately Huffman encoded. The algorithm includes a balancing procedure, such that the splitting does not continue if the size of the compressed data increases. The scheme seems to suit well for the ridge-scanning approach, as the scanning optimally scans in the ridges and valleys, where the variance is locally low. In this paper we test the hypothesis that ridge scanning, compared to horizontal scanning, should increase the redundancy in data and therefore further increase the compression rate. JPEG-LS. JPEG-LS [8] is an image compression algorithm which can be run in two different modes, lossless mode or near lossless mode. Near lossless corresponds to visually small changes between the reconstructed and the original data set that nevertheless can damage a fingerprint, e.g. a faint minutiae point can disappear. The algorithm features two different run modes. Run Mode, is a run-length encoder, and Regular Mode, is a predictive encoder which predicts the value for the current pixel from four of it’s surrounding pixels. Results of JPEG-LS (in lossless mode) is provided as a benchmark to other results.
348
Johan Th¨ arn˚ a et al.
Fig. 5. The fingerprints used for testing. First row: 6 3, 7 4, 9 2, 10 8, 12 2 and second row: 18 3, 64 3, 86 8, 91 3, 99 6
2
Experiments and Results
Segmenting the image using the Lb -representation, two ideally uncorrelated parts are created which are scanned differently. The part with lack of linear symmetry is scanned vertically while the part with distinct local orientations is scanned using ridge scanning. The vector formed from the part of the image with lack of linear symmetry, r, consists mainly of scanned areas with low variance in pixel colour. The variance within the ridge scanned part, w, is typically higher (though the local variance is low). We are encoding the vectors independently. The origin of the scanning is determined by the global orientation of the fingerprint, while the ridge scanning order is determined by the local orientation. 2.1
Data Set
The fingerprints used in the tests are from the database ”db2a”, used in the ”Fingerprint Verification Contest 2000”, ”FVC2000”, [7] available from the university of Bologna, Italy. The database consists of 8 fingerprints/person from 100 persons giving a total number of 800 fingerprints. The images are of size 364x256 pixels with a resolution of 500 dpi (dots per inch). The sensor used to capture the fingerprints was a low-cost capacitive sensor. Images are of the format uncompressed Tagged Image File Format (TIFF). The fingerprints 6 3, 7 4, 9 2, 10 8, 12 2, 18 3, 64 3, 86 8, 91 3, and 99 6 shown in figure 5 were used in the tests.
Orientation Scanning to Improve Lossless Compression of Fingerprint Images
349
Table 1. Abbreviations for table 2 abbreviation scanning compression method hz horizontal zip sH any static Huffman hrH horizontal recursive Huffman rrH ridge recursive Huffman rz ridge zip
Table 2. Bits per pixel (bpp). Original fingerprint has 8 bpp fingerprint JPEG-LS hz sH hrH rrH rz 63 6.0510 6.1476 5.7722 5.5273 5.3656 5.9847 74 4.3854 4.2787 4.3318 4.1798 4.0034 4.2790 92 4.6481 4.4079 4.3209 4.1678 3.9503 4.3316 10 8 4.7169 4.2373 4.1210 4.0251 3.9266 3.8983 12 2 5.9876 6.5312 6.4168 5.7771 5.7119 6.4357 18 3 5.2967 4.9169 4.7275 4.4775 4.2844 4.6476 64 3 3.6756 3.3162 3.5804 3.5852 3.3001 3.1614 86 8 6.2515 6.2125 5.8616 5.6176 5.3872 5.8939 91 3 6.2286 6.7074 6.6716 6.2085 6.0462 6.6700 99 6 5.9128 6.8651 6.3944 6.0276 5.8786 6.5146 average 5.3154 5.3621 5.2198 4.9594 4.7854 5.1817
The size of the fingerprints used in the tests were clipped to be of size 352x256 pixels, giving a total file size of 90112 bytes. This was done to get a more suitable image size when divided by two in the downsampling process. 2.2
Compression Results
In the results for the compression algorithms the overhead, i.e. the size of the downsampled orientation image (↓ L in figure 4), is not taken into account. If downsampled 16 times, and orientation is quantized to 62 levels the uncompressed orientation image is of size 264 bytes. This represents less than 0.3% of the total image size. The average number of bits per pixel (bpp) for the best compression methods for the different encoding approaches are displayed in table 2. Lowest number of bits per pixel for each fingerprint displayed in bold. Table 1 shows the abbreviations used. The available implementation of JPEG-LS is provided as a benchmark towards the other algorithms. From table 2 the gain using ridge scanning compared to horizontal scanning is for zip 3.4% and for recursive Huffman 3.5% in average. Comparing ridge scanning and JPEG-LS the gain for zip (rz) is 2.5% and for recursive Huffman (rrH) 10%, also in average.
350
Johan Th¨ arn˚ a et al.
The total compression rate is for JPEG-LS 34%, for ridge scanning zip (rz) 35%, and for ridge scanned recursive Huffman (rrH) 40%.
3
Conclusions and Future Works
We have investigated the use of linear symmetry (i.e. local orientations) to determine an alternative way of scanning a fingerprint compared to horizontal (row-by-row) scanning. The results provides evidence that ridge scanning applied to a lossless compression schemes increases redundancy in data and improves compression by information theoretic techniques. While we could clearly provide evidence for superior performance as compared to JPEG-LS more tests are needed to confirm the advantage of ridge scanning in conjunction with recursive Huffman and Lempel-Ziv compression. A more specific algorithm needs to be developed to fully take advantage of the structures within the fingerprint. Considering not only the local orientations but also the dominant frequency of the fingerprint, creating separate vectors for different phases of the ridge/valley pattern, will probably further decrease the local variance in data. Applying different encoding algorithms to the two parts (r and w) of the segmented image has not been tested, but might be useful as they are uncorrelated.
References [1] J. Bigun. Recognition of local symmetries in gray value images by harmonic functions. Ninth International Conference on Pattern Recognition, Rome, pages 345–347, 1988. 344 [2] J. Bigun and G. H. Granlund. Optimal orientation detection of linear symmetry. In First International Conference on Computer Vision, ICCV (London), pages 433–438, Washington, DC., June 8–11 1987. IEEE Computer Society Press. 344 [3] G. Randers-Pehrson et. al. PNG (Portable Network Graphics) Specification, Version 1.2. PNG Development Group, July 1999. 343 [4] Fred Halsall. Data Communications, Computer Networks and Open Systems. Addison-Wesley, 1996. 347 [5] D. A. Huffman. A method for the construction of minimum redundancy codes. Proc. IRE, 40:1098–1101, 1952. 347 [6] John H˚ akon Husøy Karl Skretting and Sven Ole Aase. Improved huffman coding using recursive splitting. In NORSIG-99 conference 9-11 sep., 1999. 347 [7] D. Maio, D. Maltoni, R. Cappelli, J. L. Wayman, and A. K. Jain. Fvc2000: Fingerprint verification competition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3):402–412, March 2002. 348 [8] M Weinberger and G Seroussi. The LOCO-I Lossless Image Compression Algorithm: Principles and Standardization into JPEG-LS. 343, 347 [9] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, IT-23(3), may 1977. 347
A Nonparametric Approach to Face Detection Using Ranklets Fabrizio Smeraldi Queen Mary, University of London Mile End Road, London E1 4NS, UK
Abstract. Ranklets are multiscale, orientation-selective, nonparametric rank features similar to Haar wavelets, suitable for characterising complex patterns. In this work, we employ a vector of ranklets to encode the appearance of an image frame representing a potential face candidate. Classification is based on density estimation by means of regularised histograms. Our procedure outperforms SNoW, linear and polynomial SVMs (based on independently published results) in face detection experiments over the 24’045 test images in the MIT-CBCL database.
1
Introduction
The term “nonparametrics” refers to statistical techniques that make no assumptions about the distribution of the observables. To this category belong some of the classification algorithms recently developed within learning theory, such as Support Vector Machines (SVMs) [1], that have shown a good performance in face detection tasks [2, 3]. However, nonparametrics is more commonly used with reference to statistical methods based on ranks [4]. Closely related to these are rank based features, that have been widely applied in the context of stereo correspondence [5, 6, 7]. Their main advantages consist in robustness to outliers and invariance under monotonic transformations, e.g. brightness and contrast changes and gamma correction. In this paper we propose a nonparametric approach to face detection using an image representation based on ranklets, a family of multiscale rank features featuring directional selectivity [8]. Classification is performed based on direct density estimation using frequency histograms regularised by convolution with a Gaussian kernel. Assumptions on the underlying distribution of the data are kept to a bare minimum. Our system makes use of the nonparametric approach both in the feature extraction stage and in the classification stage, whereas other procedures utilise nonparametric classifiers either directly on the intensity level of the images [2] or on the responses of a set of linear filters [3]. Face detection experiments over the 24 045 test images of the MIT-CBCL face database show that our system performs better than linear and polynomial
Part of this research was carried out at Halmstad University, Sweden.
J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 351–359, 2003. c Springer-Verlag Berlin Heidelberg 2003
352
Fabrizio Smeraldi
−1 (C1)
+1 (T1)
+1 (T2)
+1 (T3)
− 1 (C2)
−1 (C3)
Fig. 1. The three Haar wavelets h1 (x), h2 (x) and h3 (x) (from left to right). Letters in parentheses refer to “treatment” and “control” pixel sets (see Sect. 2.2)
SVMs and the SNoW (Sparse Network of Winnows) algorithm (according to independently published results). The combination of ranklets and regularised histograms yields a significantly lower error rate, thus proving a promising technique for recognising deformable patterns in low resolution images.
2
Ranklets: A Family of Wavelet-Style Rank Features
Ranklets are computed starting from a ranking of the intensity values in (part of) the image, i.e. from a permutation π of the integers from 1 to N that expresses the relative order of the pixel intensity values. In the next section we recall the definition of a classic non-parametric hypothesis test, the Wilcoxon rank-sum test, from which ranklets derive. We will then specialise our notation to pixel values in image neighbourhoods in Sect. 2.2. 2.1
The Wilcoxon Rank-Sum Test
The Wilcoxon rank-sum test (also known as the Mann-Whitney U-test) is a hypothesis test designed for the comparison of two treatments [4]. Suppose that N quantities are split in two groups of n “treatment” and m “control” observations (according to the standard terminology). We ask whether the treatment observations are significantly higher than the controls. To this end, we define the n Wilcoxon rank sum statistics Ws as the sum of treatment ranks: Ws = i=1 π(i). The logic behind this definition is that high values of the treatment observables relative to the control observables will result in a large value of Ws . After an experiment is performed, the treatment values are judged to be significantly higher than the controls if Ws is above a critical value τ . The value of τ determines the confidence level of the test. In the next Section, we apply this test to pixel values to determine intensity variations between image regions. 2.2
Definition of Ranklets
Given an image I, we indicate with π W (I(x)) the rank of the intensity I(x) of pixel x among the intensity of the pixels in a suitably sized window W . To simplify matters, we assume that no two pixels have the same intensity; ties can be broken at random when they occur (see also Sect. 4).
A Nonparametric Approach to Face Detection Using Ranklets
353
The Wilcoxon test can be used to determine intensity variations among conveniently chosen subset of the pixels in W . As we have shown in [8], considering the central pixel x0 in W as the only “treatment” observation and the N − 1 pixels in W \ {x0 } as the “control” observations reproduces the Rank transform introduced by Zabih and Woodfill [7]. Ranklets are obtained by splitting the N pixels in W in two groups of size N/2, thus assigning half of the pixels to the “treatment” group and half to the “control” group. This introduces a new degree of freedom, namely the geometric arrangement of the two regions in W , that can be exploited to obtain orientation selective features. To this end, we define the “treatment” and “control” groups starting from the three Haar wavelets [9] hj (x), j = 1, 2, 3 displayed in Fig. 1. We identify the local neighbourhood W with the support of the hj , and we define the set of “treatment” pixels Tj as the counter-image of 1 under hj : Tj = h−1 j (+1), and the set of “con−1 trol” pixels Cj as the counter-image of −1: Cj = hj (−1). For each of the three resulting partitions of W , W = Tj ∪ Cj , we compute the Wilcoxon statistics as Wsj = x∈Tj π W (I(x)). We can conveniently replace Wsj with the equivalent Mann-Whitney statistics WYj X = Wsj − (N/2 + 1)N/4,
(1)
which has an immediate interpretation in terms of pixel comparisons. As can be easily shown [4], WYj X is equal to the number of pixel pairs (xp , yq ) with xp ∈ Tj and y q ∈ Cj such that I(xp ) > I(y q ). Its possible values therefore range from 0 to the number of pairs (xp , y q ) ∈ Tj × Cj , which is N 2 /4. For this reason, ranklets are conveniently defined as Rj =
j WXY − 1. N 2 /4
(2)
Thus the value of Rj will be 1 if and only if, for all the possible pairs of one pixel in Tj and one pixel in Cj , the first pixel is brighter than the second. If the opposite is true, Rj will be −1. 2.3
Geometric Interpretation
The geometric interpretation of the ranklets Rj is straightforward given the properties of WYj X and of the Haar wavelets hj . Consider for instance R1 and suppose that the local neighbourhood W straddles a vertical edge, with the darker side on the left (where C1 is located) and the brighter side on the right (corresponding to T1 ). Then R1 will be close to +1, as many pixels in T1 will have a higher intensity than those in C1 . Conversely, R1 will be close to −1 if the dark and bright side of the edge are reversed. Horizontal edges or other patterns with no global left-right variation of intensity will give a value close to zero. Therefore, R1 responds to vertical edges in the images. By a similar argument R2 detects horizontal edges, while R3 is sensitive to corners formed by horizontal and vertical lines. These response patterns closely match those of the corresponding Haar wavelets hj .
354
2.4
Fabrizio Smeraldi
Multiscale Ranklets
Due to the close correspondence between Haar wavelets and ranklets, the multiscale nature of the former directly extends to the latter. To each translation and scaling of the hj specified by (x0 , s) we associate the sets of treatment and control pixels defined by Tj;(x0 ,s) = {x | hj ((x − x0 )/s) = +1} , Cj;(x0 ,s) = {x | hj ((x − x0 )/s) = −1} .
(3)
We can then compute the value of Rj (x0 , s) over the local neighbourhood W(x0 ,s) = Tj ∪ Cj , with N = #W(x0 ,s) . 2.5
Efficiency Considerations
As seen Sect. 2.2, WYj X equals the number of pairs (xp , y q ) ∈ Tj × Cj such that the intensity at xp is larger than the intensity at y q . Since N 2 /4 such pairs can be formed out of the N pixels in W , it would seem that the number of comparisons required should grow with the square of the area of W . Notice however that these pairwise comparisons are never explicitly carried out; the value of WYj X is obtained by subtracting a constant from the Wilcoxon statistics Wsj (see Eq. 1). In turn, Wsj is computed by ranking the pixels in W , which only requires order of N log N operations. The intuitively appealing interpretation in terms of a count of pixel pairs turns out to be a “free” by-product of the sorting operation.
3
Classification
Classification is performed over vectors of ranklets extracted from training and test images (see Sect. 4). We proceed by estimating a probability density P(x|face) starting from the facial images in the training set, and we classify the test images as faces or non-faces by thresholding such distribution: P(x|face) > τ . This is equivalent to applying the Bayesian rule to the face vs non-face problem under the assumption that x is uniformly distributed. In the realistic case of a detector that scans a large number of frames only a small fraction of which actually represent faces, this reduces to assuming a uniform distribution for P(x|non-face), which would seem a reasonable guess in the absence of further information. We thus avoid the difficult problem of estimating P(x|non-face) for arbitrary image backgrounds. 3.1
Density Estimation
Density estimation is an ill-posed and difficult problem. In the case of face detection, however, the use of histograms has been shown to give good results [3].
A Nonparametric Approach to Face Detection Using Ranklets
355
Training data are in general insufficient to provide a reliable estimate of a multi-dimensional histogram. Therefore, the original multivariate distribution must be factored into a product of univariate densities, thus implicitly assuming the independence of the corresponding observables. This amounts to imposing a constraint on the form of the distribution, so that the choice of the “independent” variables is crucial for the performance of the algorithm. In [3], a number of “visual attributes” is obtained by grouping together wavelet coefficients selected according to their frequency and orientation characteristics; these quantities are then effectively treated as uncorrelated in the decision rule. We decided to approximate P(x|face) as a product of one-dimensional densities along the Principal Components (PCs) of the training data: P(x|face) ≈
D
Pi (x · ei )
(4)
i=1
where x is a feature vector (centred around the average), D is the dimensionality of the feature space and the ei are the PCs, that satisfy Cei = σi2 ei , where C is the correlation matrix of the training data. In the following we assume that the σi are sorted in decreasing order. Let us emphasise that the scalar products x · ei , though uncorrelated, are in general not independent (though they might be, for instance in the case that P(x|face) were Gaussian). We are therefore introducing an approximation. However, this choice has the advantage of reproducing the second order moments of P(x|face) exactly. Experimentally we find that, while for low values of i the Pi are definitely multi-modal, for higher values of i they tend to a Gaussian (Fig. 2). This is consistent with the fact that the low-variance components are more affected by noise. As a consequence, it makes sense, both in terms of speed and memory efficiency and for the purpose of noise control, to replace the estimated densities for high values of i with a Gaussian, thus rewriting (4) as P(x|face) ≈
d
Pi (x · ei )G(x⊥ ; (σd+1 , . . . , σD ))
(5)
i=1
where x⊥ denotes the projection of x along Span{ed+1 , . . . , eD }. In a further approximation step, we replace G with the one-dimensional Gaussian d G (||x⊥ ||; σd+1 ). For low values of d, x⊥ can be obtained as x⊥ = x − i=1 (x · ei )ei , which lowers both computation and storage requirements. 3.2
Regularised Histograms
The univariate distributions Pi (x · ei ) along the first d Principal Components are estimated by means of frequency histograms Hi . The support Si of each Hi is chosen to span the projections x · ei of all the training vectors, with the ten largest and smallest values removed to limit the influence of outliers. We then divide Si into 500 equally spaced bins. Due to the relatively low number of training examples, the estimated histogram is affected by high levels of noise (Fig. 2),
356
Fabrizio Smeraldi 8
15
12
7 10
6 10
8
5
4
6
3 5
4
2 2
1
0
0
50
100
150
200
250
300
350
400
450
500
0
0
50
100
150
200
250
300
350
400
450
500
0
0
50
100
150
200
250
300
350
400
450
500
˜ 1 , cenFig. 2. Histograms: along the 1st PC (H1 , left); same, regularised (H ˜ ˜ tre); along the 180th PC, regularised (H180 , right). Note how H180 is closer to a Gaussian leading to poor classification results. Contrary to what might be expected, reducing the number of bins in Hi does not lead to an improvement. Instead, performance is dramatically improved (see Sect. 4) if the Hi are regularised by ˜ i = K ∗ Hi (Fig. 2). convolution with a Gaussian kernel K, to give H We finally construct our estimate of Pi (x · ei ) as ˜ i (x · ei )/A if x ∈ Si , H Pi (x · ei ) ≈ (6) G(x, σi ) otherwise. where A is a normalisation factor and the Gaussian model G is used outside the support of the histogram (σi is obtained from the same Eigenvalue problem that determines the PCs ei ).
4
Experimental Results
We present the results of face detection experiments over the images of the MIT-CBCL face database [10], that consists of low-resolution grey level images (19 × 19 = 361 pixels) of faces and non-faces. A training set of 2 429 faces and a test set of 472 faces and 23 573 non-faces are provided. All facial images nearly occupy the entire frame; considerable pose changes are represented (Fig. 3). The database also contains a set of 4 548 non-faces intended for training. In our experiments we have discarded this training set of non-faces, since the notion of “non-face prototypes” appears to be problematic (see Sect. 3). Images are encoded as a vector of multiscale ranklets obtained by locating one of 5 windows Wk of decreasing size at a number of sampling points on the image. Window sizes vary from 18 × 18 to 2 × 2 pixels. Sampling points are arranged in 5 overlapping square grids covering the entire image area. The sampling grid for the largest window only includes the centre of the image, while the smallest window is sampled at 100 different locations. For each window and location three ranklets Rj are computed, giving a total of 190 × 3 = 570 features (note that we are mapping the originally 361dimensional images to a higher dimensional feature space). For the regularising
A Nonparametric Approach to Face Detection Using Ranklets
357
Fig. 3. Training faces (left), test faces (centre), and test non-faces (right) from the MIT-CBCL set. Notice how a few of the test faces in the database are incorrectly framed (centre, lower right)
kernel described in Sect. 3.2, a standard deviation σ = 20 has been used. We obtained optimal results by estimating 180 univariate densities Pi , i.e. by setting d = 180 in (5). With this choice, the classifier (including the principal components and the histograms) requires only 14% of the storage space occupied by the feature vectors of the training set. By varying the threshold for acceptance, a Receiver Operating Characteristic (ROC) curve is obtained (Fig. 4). The Equal Error Rate (EER) of the system in this configuration is 13%. For comparison, when regularisation is omitted, the EER is 33%. In our system, when two pixels in W happen to have the same intensity value, the tie is broken at random. However, in the case of nearly uniform patterns, this procedure can lead to random feature vectors. We address this problem by considering, for each W , the number of tied pixels as a fourth feature, in addition to the Rj . This additional feature may be thought of as “flagging” a situation in which the representation is unreliable. This brings the dimensionality of the feature vector to 190 × 4 = 760. As can be seen performance is slightly increased (Fig. 4), with an EER of 11%. According to independent results published in [11], the SNoW and linear SVMs yield EERs in excess of 20% on the same database. The EER for polynomial SVMs is around 15%. It should be noted that all these methods employ the 4 548 non-face training examples in the database, that are not used in our system. As we have shown in [8], ranklets outperform the Rank and Census transforms and Haar wavelets (experiments were performed on the same dataset using a simpler, distance-based classifier).
5
Conclusions
We have presented a face detection system based on ranklets, a family of rank features similar to Haar wavelets for what concerns orientation selectivity and multiscale nature. Their definition in terms of the Mann-Whitney statistics provides an efficient computing scheme and an intuitively appealing interpretation in term of pairwise comparisons of pixel values.
358
Fabrizio Smeraldi
100
90
80
1−False Negatives (in %)
70
60
50
40
30
20 Ranklets + ties Ranklets
10
0
0
10
20
30
40 50 60 False Positives (in %)
70
80
90
100
Fig. 4. ROC curves for ranklets (◦) and ranklets with added tie counts (∗) Classification is performed using a practical nonparametric density estimation algorithm based on regularised histograms. This procedure is both memory and time efficient and results in reliable classification on the non-Gaussian, multi-modal distribution underlying the face detection problem. Experimental results over a test set of 24 045 images confirm the consistently good performance of our system. The low EER (11%) obtained represents an improvement over other algorithms, including the SNoW and linear and polynomial SVMs applied directly to the intensity data. In future work we plan to complement our face detector with a multiresolution scanning procedure for the localisation of faces irrespective of size, background and position in the image.
References [1] Vapnik, V. N.: The nature of statistical learning theory. Springer–Verlag (1995) 351 [2] Osuna, E., Freund, R., Girosi, F.: Training Support Vector Machines: an application to face detection. In: Proceedings of CVPR ’97. (1997) 351 [3] Schneiderman, H., Kanade, T.: A statistical method for 3D object recognition applied to faces and cars. In: Proceedings of CVPR, IEEE (2000) 746–751 351, 354, 355 [4] Lehmann, E. L.: Nonparametrics: Statistical methods based on ranks. Holden-Day (1975) 351, 352, 353 [5] Bhat, D. N., Nayar, S. K.: Ordinal measures for visual correspondence. In: Proceedings of CVPR. (1996) 351–357 351 [6] Kendall, M., Gibbons, J. D.: Rank correlation methods. Edward Arnold (1990) 351 [7] Zabih, R., Woodfill, J.: Non-parametric local transforms for computing visual correspondence. In: Proceedings of the 3rd ECCV. (1994) 151–158 351, 353
A Nonparametric Approach to Face Detection Using Ranklets
359
[8] Smeraldi, F.: Ranklets: orientation selective non-parametric features applied to face detection. In: Proc. of the 16th ICPR, Quebec, CA. Volume 3. (2002) 379–382 351, 353, 357 [9] Daubechies, I.: Ten lectures in wavelets. SIAM, Philadelphia, USA (1992) 353 [10] MIT Center for Biological and Computational Learning: CBCL face database no. 1. http://www.ai.mit.edu/projects/cbcl (2000) 356 [11] Alvira, M., Rifkin, R.: An empirical comparison of SNoW and SVMs for face detection. Technical Report AI Memo 2001-004 – CBCL Memo 193, MIT (2001) http://www.ai.mit.edu/projects/cbcl 357
Refining Face Tracking with Integral Projections Gin´es Garc´ıa Mateos Dept. de Inform´ atica y Sistemas, Universidad de Murcia 30.071 Espinardo, Murcia, Spain
[email protected] Abstract. Integral projections can be used, by themselves, to accurately track human faces in video sequences. Using projections, the tracking problem is effectively separated into the vertical, horizontal and rotational dimensions. Each of these parts is solved, basically, through the alignment of a projection signal –a one-dimensional pattern– with a projection model. The effect of this separation is an important improvement in feature location accuracy and computational efficiency. A comparison has been done with respect to the CamShift algorithm. Our experiments have also shown a high robustness of the method to 3D pose, facial expression, lighting conditions, partial occlusion, and facial features.
1
Introduction and Related Research
A wide range of approaches has been proposed to deal with the problem of human face tracking. Most of them are based on skin-like color detection [1]– [8], which is a well-known robust method to track human heads and hands. Color detection is usually followed by a location of individual facial features, for example, using PCA [2], splines [3] or integral projections [4], [5], [6]. However, in most of the existing research, projections are processed in a rather heuristic way. Other kinds of approaches are adaptations of face detectors to tracking, e.g. based on contour modelling [9], [10], eigenspaces [11] and texture maps [12]. These methods are computationally expensive and sensitive to facial expressions, so efficiency is usually achieved at the expense of reducing location accuracy. In this paper1 , we prove that projections not only can be used to track human faces, but the dimensionality reduction involved in projection yields substantial improvements in computational efficiency and location accuracy. The technique has been implemented using Intel IPL and OpenCV libraries [13], and compared with the CamShift algorithm [1]. Each frame typically requires less than 5 ms. on a standard PC, achieving a location accuracy of about 3 mm.
2
Working with Integral Projections
An integral projection is a one-dimensional pattern, obtained through the sum of a given set of pixels along a given direction. Let i(x, y) be an image and 1
This work has been supported by the Spanish MCYT grant DPI-2001-0469-C03-01.
J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 360–368, 2003. c Springer-Verlag Berlin Heidelberg 2003
Refining Face Tracking with Integral Projections 0 50
BROWS
EYES
0
MOUTH
NOSE
100
200
200 0
20
40
60
80
100
120
MOUTH CORNERS
150
150
200
0 50 100
100
150
250
RIGHT LEFT EYE EYE
50
361
0
20
a)
40
b)
60
250 0 10 20 30 40 50
c)
d)
Fig. 1. Integral projection models and reprojection. a-c) Mean and variance of the vertical projection of the face a), and the horizontal projection of eyes b) and mouth c) regions. d) Reprojection of the model by matrix product R(i) a region in it, the vertical integral projection of R(i), denoted by PV R(i) is given by: PV R(i) (y) = i(x, y); ∀(x, y) ∈ R(i). The horizontal projection of R(i), denoted by PHR(i) , can be defined in a similar way. 2.1
Integral Projection Models
In order to establish a formal framework for the use of projections, it is adequate to distinguish between integral projection signals and models. A projection signal is a discrete function S : {smin , .., smax } → R. A projection model describes a variety of signals, usually obtained from different instances of a kind of object. We propose a gaussian-style modelling, in which the model corresponding to a set of signals is given by the mean and variance at each point in the range of the signals. Thus, a projection model can be expressed as a pair of functions: – M : {mmin , .., mmax } → R. Mean at each point in the range of the signals. – V : {mmin , .., mmax } → R. Variance at each point. Fig. 1 shows three typical examples of integral projection models, corresponding to the vertical projection of the whole face, and the horizontal projection of eyes and mouth regions. In this case, the training set contained 45 face instances. Useful information can be obtained from simple heuristic analysis of integral projections, e.g. searching for global minima [4], applying fuzzy logic [6] or thresholding projections [5], without the need of using explicit models. However, these techniques are normally very sensitive to outliers. Working with explicit integral projection models entails the following advantages: – Models are learnt from examples, avoiding the ad hoc nature of heuristic methods. This allows an easier adaptation to similar applications. – A distance function can be defined to measure the likelihood that a certain signal S is an instance of a particular model (M, V ). In our case, this distance has the form of a mean squared difference, d(S, (M, V )) =
1 (S(i) − M (i))2 , ||C|| V (i) i∈C
(1)
362
Gin´es Garc´ıa Mateos
with, C = {smin , . . . , smax } ∩ {mmin , . . . , mmax } .
(2)
– The model can be visualized and interpreted by a human observer and, more interesting, an approximate reconstruction can be obtained by reprojecting the model. For example, if vertical and horizontal projections are applied, the reprojection can be computed through a matrix product. An example of reprojection from a projection model is shown in Fig. 1d). 2.2
Alignment of Integral Projections
Alignment is the basic concern when dealing with projections; corresponding features in the images should be projected into the same locations. The problem of aligning 1-D signals is equivalent to object location in 2-D images. However, using projections involves a reduction in the number of freedom degrees; in practice, only two parameters have to be considered: scale and translation. Assuming signal S is an instance of a given model (M, V ), the alignment problem can be formulated as finding the values of scale d and translation e, such that after aligning S, corresponding pixels are projected in S and (M, V ) into the same locations. The aligned S, denoted by S , is given by, S : {(smin − e)/d, . . . , (smax − e)/d} → R ; S (i) = S(di + e) .
(3)
Note that the distance function defined in (1, 2) requires S and (M, V ) to be properly aligned. Furthermore, if S is an instance of the model, this distance can be used as a goodness of alignment measure: a low value will be obtained for a good alignment, and a high value otherwise. Thus, substituting S in (1) with S in (3), the alignment problem is reformulated as finding the pair of values (d, e) which minimize, 1 (S(di + e) − M (i))2 . ||C|| V (i)
(4)
i∈C
3
Face Tracking Using Projections
The tracking algorithm is part of an iterative process, which recalculates the location of the face and facial features in a sequence of images. The face detection technique described in [8] is used as initialization. For each face being tracked, a bounding ellipse and the location of eyes and mouth are computed. The input to the tracker is a face model, the state of tracking in the previous image, it−1 , and a new image it . The face model consists of a set of projection models of the whole face and some parts in it, where the locations of facial features are known. This model is computed using the first image of the sequence, where eyes and mouth locations are given by the face detector, as in Fig. 2a). Basically, the algorithm consists of three main steps in which the vertical, horizontal and orientation parameters of the new location are estimated independently. This process is explained in more detail in the following subsections.
Refining Face Tracking with Integral Projections xeye1
xeye2
ymouth
yeyes
y
363
xmouth
a)
x
b)
c)
d)
Fig. 2. Face model and preprocessing step. a) Sample face used to compute a face model. b) Expected location of the bounding ellipse and facial features in a new image, using locations in the previous one. c) Face region extracted from b), using rtolerance =25%. d) Face region wrapped and segmented, according to the model 3.1
Preprocessing Step
Kalman filters [2], [3] and skin-color detection [1],[4],[5] have been commonly applied in face tracking as prediction filters. In our case, we have proved that integral projections can be used to solve the problem by themselves. Thus, a null predictor is used: the locations of the bounding ellipse and the facial features in it−1 (x, y) are taken as the expected locations in it (x, y). In the preprocess, a rectangle in it (x, y) containing the bounding ellipse and rotated with respect to the eye-line (the line going through the expected locations of both eyes) is wrapped into a predefined size rectangle given by the face model, see Fig. 2c). In fact, an area bigger than just the bounding ellipse is wrapped. The size of this additional area is a percentage of the model size, denoted by rtolerance (set to 25% in the experiments). The inner part of the ellipse is segmented, as shown in Fig. 2d), and taken as an input for the vertical alignment step. 3.2
Vertical Alignment Step
In this step, using vertical projections, the vertical translation and scale of the face in the new image are estimated. Firstly, the vertical projection of the segmented face region, PV F ACE , is computed, see Fig. 3c). This signal is aligned with respect to the vertical projection model (MV F ACE , VV F ACE ), see Fig. 3d). Finally, the alignment parameters (d, e) are used to align the face vertically. 3.3
Horizontal Alignment Step
After the vertical alignment step, y coordinates of mouth and eyes are known2 , so eyes and mouth regions are segmented (including the tolerance area) according 2
Although both eyes might not have exactly the same y coordinate, since the face could be rotated. Anyway, a small rotation is supposed in the worst case.
Gin´es Garc´ıa Mateos 150 125
100
100 yeyes
80
75
ymouth
60
a)
xeye1
xmouth
f)
40
50
20
25
0
0
75 125 175 225
b)
xeye2
EYES MOUTH
yeyes
y eyes
140 120
ymouth
100 150
g)
0
40
h)
200 250
e)
d)
c) 50 100 150 200 60 100 140 180
ymouth
364
80
120
50 100 150 200 100 125 150 175 -40
xeye1 xeye2
xmouth
0
40
i)
80
xeye1
xeye2
xmouth
j)
Fig. 3. Vertical and horizontal alignment steps. Upper row: vertical alignment. Lower row: horizontal alignment. a),f) Images used to compute the model. b),g) Segmented face, eyes and mouth regions in it (x, y), using the locations in it−1 (x, y). c),h) Vertical and horizontal projections of the models (dashed lines) and the regions in b),g) (solid lines). d),i) The same projections after alignment. e),j) Vertical and horizontal alignment of b),g), using alignment parameters obtained in d),i), respectively to the model, see Fig. 3g). Using horizontal projections of these regions, the algorithm estimates the horizontal translation and scale of the face. Actually, only the horizontal projection of the eyes region, PHEY ES , is considered. As can be seen in Fig. 3i), PHEY ES is usually more stable than PHMOUT H , producing a more reliable alignment. 3.4
Orientation Estimation Step
Face orientation3 is computed through the estimation of the eye-line. While in the previous steps both eyes were supposed to be located along yeyes , here the y coordinate of the right and left eyes (yeye1 and yeye2 , respectively) are estimated independently. This process is illustrated in Fig. 4. After steps 2 and 3, the locations of the eyes are approximately known, so regions EYE1 and EYE2 can be segmented. Aligning the vertical projections of these regions, the values of yeye1 and yeye2 are computed.
4
Experimental Results
The purpose of these experiments was to assess the location accuracy of the integral projection method, its robustness to a variety of non-trivial conditions, and to compare its computational efficiency with respect to other approaches. 3
Orientation is here considered in a planar sense, i.e. rotation with respect to the image plane.
Refining Face Tracking with Integral Projections xeye1 xeye2 min max min max
40
EYE2
min
30
yeyes
max
ymouth
a)
10
0
0
-10
yeye1
-20
100
-30
175 250
c)
yeye2
-20
-10
b)
20 10
20
EYE1
365
d)
100
160 220
e)
f)
Fig. 4. Orientation estimation step. a) A sample rotated face, after vertical and horizontal alignment. b),d) Right and left eyes segmented, respectively. c),e) Vertical projections of the right and left eyes (solid lines), after alignment with the model (dashed lines). f) Final location of the facial features and the bounding ellipse
The tracking algorithm described in this paper has been implemented using Intel IPL and OpenCV image processing libraries [13]. OpenCV provides a colorbased face tracker, called CamShift [1], which has been used for comparison. This algorithm, by itself, only computes the bounding ellipse, but not facial features locations. To solve this problem, we have assumed that eyes and mouth move coherently with respect to the bounding ellipse. Both algorithms have been applied to four video sequences4 , adding up a total of 1915 frames (∼ 72 sec.). The first two sequences were captured from TV (news programs) at a 640x480 resolution; another from an inexpensive video-conference camera; and the last one was extracted from a DVD movie. All the sequences contain wide changes in facial expression and 3D pose; the last two also include samples of partial occlusion, varying illumination and faster movements. The obtained location accuracy and execution times are summarized in Table 1. Fig. 5 shows feature location errors along time for the first two sequences; signal-to-model distances (see Sect. 2.2) of the resultant aligned projections are also shown. Finally, some sample frames for the worst error cases are presented in Fig. 6. Errors are expressed as Euclidean distances in millimeters, in the face plane, assuming an interocular distance of 70 mm. and using a manual labelling of facial features as ground-truth. In all the sequences, the integral projection tracker exhibits a clearly better performance, both in time and accuracy. The average location error –always below 4 mm– contrasts greatly with the error obtained by CamShift, above 10 mm. In the last sequence, CamShift could not be applied due to the presence of skin-like color background. The CPU time required for each frame depends on the face size and is typically below 5 ms. This allows real-time processing without much CPU usage. Moreover, the algorithm is between 103% and 48% faster than CamShift, even when the last is only applied to a part of the images. 4
Test videos and results available at: http://dis.um.es/∼ginesgm/fip/demos.html.
Gin´es Garc´ıa Mateos a)
0
50
100
c)
8
y (mm)
12 10 8 6 4 2
150
200
250
4 0
-4 -8 -8
0 4 x (mm)
100
200
8
-4
0 4 x (mm)
8
e) y (mm)
0
0
-4
b)
12 10 8 6 4 2
4
-8 -8
-4
d)
8
y (mm)
366
300
400
500
7.5 5 2.5 0 -2.5 -5 -7.5 -10 -5
5 0 x (mm)
10
Fig. 5. Integral projection face tracker results. a),b) Average feature location error in mm. (thin lines) and signal-to-model distance (thick lines), for each frame in the first two sequences, respectively. c),d),e) Location errors for the right c) and left d) eyes, and mouth e), for the second sequence. The dashed circle represents the iris size Table 1. Average and maximum feature location errors (eyes and mouth), and execution time (per frame, not including input/output), using integral projections and the CamShift algorithm. Computer used: AMD Athlon processor at 1.2 GHz Face size Int. Projections CamShift Sequence Source Length X x Y Location error Time Location error Time file name (frames) (pixels) avg./max.(mm) (ms) avg./max.(mm) (ms) tl5-02.avi TV 281 97x123 3.41 / 13.0 4.02 10.3 / 28.8 8.17 a3-05.avi TV 542 101x136 1.95 / 9.76 4.62 9.22 / 24.1 7.52 ggm2.avi QuickCam 656 70x91 1.83 / 9.29 3.69 12.4 / 30.8 5.45 13f.avi DVD 436 146x176 3.61 / 14.2 8.35 Unable to work
Another interesting result from the experiments is the possibility of using the signal-to-model distance as a reliability degree of tracking. As shown in Fig. 5, a direct relationship exists between this value and the location error. Furthermore, a very high value (usually over 20) is obtained when the face disappears. Thus, it has been used to detect the end of tracking in a sequence.
5
Conclusions
We have presented a new approach to human face tracking in video sequences, which is exclusively based on the alignment of integral projections. The tracking problem is decomposed into three main steps, where vertical, horizontal and orientation parameters are estimated independently. Although this separabil-
Refining Face Tracking with Integral Projections
367
Fig. 6. Sample frames, showing worst error cases. Eyes and mouth regions (right) are segmented according to the tracker results. Dist refers to signal-tomodel distance ity is questionable in the general case of object location, our experiments have extensively proved its feasibility in face tracking. The proposed technique exhibits a very high computational efficiency, while achieving a better location accuracy than other more sophisticated existing trackers, without losing track of the face in any frame. Each frame typically requires less than 5 ms. on a standard PC, with an error of about 3 mm. Our experiments have also shown a high robustness to facial expressions, partial occlusion, lighting conditions and 3D pose. Another two interesting advantages, over color-based trackers, are that the algorithm is not affected by background distractors, and it can be applied on grey-scale images or under changing lighting or acquisition conditions.
References [1] Bradski, G. R.: Computer Vision Face Tracking For Use in a Perceptual User Interface. Intel Technology Journal Q2’98 (1998) 360, 363, 365 [2] Spors, S., Rabenstein, R.: A Real-Time Face Tracker for Color Video. IEEE Intl. Conference on Acoustics, Speech, and Signal Processing, Utah, USA (2001) 360, 363 [3] Kaucic, R., Blake, A.: Accurate, Real-Time, Unadorned Lip Tracking. Proc. of 6th Intl. Conference on Computer Vision (1998) 370–375 360, 363 [4] Sobottka, K., Pitas, I.: Segmentation and Tracking of Faces in Color Images. Proc. of 2nd Intl. Conf. on Aut. Face and Gesture Recognition (1996) 236–241 360, 361, 363 [5] Stiefelhagen, R., Yang, J., Waibel, A.: A Model-Based Gaze Tracking System. Proc. of IEEE Intl. Symposia on Intelligence and Systems (1996) 304–310 360, 361, 363 [6] Pahor, V., Carrato, S.: A Fuzzy Approach to Mouth Corner Detection. Proc. of ICIP-99, Kobe, Japan (1999) I-667–I-671 360, 361 [7] Schwerdt, K., Crowley, J. L.: Robust Face Tracking Using Color. Proc. of 4th Intl. Conf. on Aut. Face and Gesture Recognition, Grenoble, France (2000) 90–95 [8] Garc´ıa-Mateos, G., Ruiz, A., L´ opez-de-Teruel, P. E.: Face Detection Using Integral Projection Models. Proc. of IAPR Intl. Workshops S+SSPR’2002, Windsor, Canada (2002) 644–653 360, 362
368
Gin´es Garc´ıa Mateos
[9] Isard, M., Blake, A.: Contour Tracking by Stochastic Propagation of Conditional Density. Proc. 4th Eur. Conf. on Computer Vision, Cambridge, UK (1996) 343– 356 360 [10] Vieren, C., Cabestaing, F., Postaire, J.: Catching Moving Objects with Snakes for Motion Tracking. Pattern Recognition Letters, 16 (1995) 679–685 360 [11] Pentland, A., Moghaddam, B., Starner, T.: View-Based and Modular Eigenspaces for Face Recognition. Proc. CVPR’94, Seattle, Washington, USA (1994) 84–91 360 [12] La Cascia, M., Sclaroff, S., Athitsos, V.: Fast, Reliable Head Tracking Under Varying Illumination: An Approach Based on Registration of Texture-mapped 3D Models. IEEE PAMI, 22(4), (2000) 322–336 360 [13] Intel Corporation. IPL and OpenCV: Intel Open Source Computer Vision Library. http://www.intel.com/research/mrl/research/opencv/ 360, 365
Glasses Removal from Facial Image Using Recursive PCA Reconstruction Jeong-Seon Park1 , You Hwa Oh1,2 , Sang Chul Ahn2 , and Seong-Whan Lee1 1
2
Center for Artificial Vision Research, Korea University Anam-dong, Seongbuk-ku, Seoul 136-701, Korea {jspark, swlee}@image.korea.ac.kr Imaging Media Research Center, Korea Institute of Science and Technology Hawolgok-dong 39-1, Seongbuk-ku, Seoul 136-791, Korea {yhoh, asc}@imrc.kist.re.kr
Abstract. This paper proposes a new method of removing glasses from human frontal face images. We first detect regions occluded by glasses. Then we generate a naturally looking glassless facial image by recursive PCA reconstruction. The resulting image has no trace of glass frame, nor the reflection and shade made by the glasses. The experimental results show that the proposed method is an effective solution to the problem of glass occlusion, and, we believe, it can be applied to enhancing the performance of face recognition systems.
1
Introduction
Automatic face recognition has become a hot research issue due to its potential in a wide range of applications such as access control, human-computer interaction and automatic search in a large-scale face database[1]. One important requirement for successful face recognition is robustness to variations coming from different lighting conditions, facial expression, pose, size, and occlusion by other objects. Among others, glasses are the most common occlusion objects affecting the performance of a face recognition system significantly. Recently, a few methods of glasses extraction and removing have been reported by some researchers[2, 3, 4, 5]. Jing et al.[2] extracted glasses using the deformable contour and removed glasses from facial images. Lanitis et al.[3] showed that their flexible model, now called the active appearance model, could be used to remove small occlusions by glasses. Saito et al.[4] generated glassless facial images using PCA(Principal Component Analysis). However, the results were not good enough and leaves some traces of glass frames. More recently, Wu et al.[5] detected glass frames using 3D Hough transform of stereo facial images.
To whom all correspondence should be addressed. This research was supported by Creative Research Initiatives of the Ministry of Science and Technology, Korea.
J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 369–376, 2003. c Springer-Verlag Berlin Heidelberg 2003
370
Jeong-Seon Park et al.
Fig. 1. Example of face detection procedure
This paper proposes a glasses removing method from frontal face images by recursive application of the PCA to obtain natural looking glassless facial images. For this we first locate the regions of glass occlusion and then generate an image compensating the occlusion. In this work the problem of occlusion includes not only the glass frames but also the reflection shade created by the glasses.
2
Glasses Removing Method
The proposed glasses removing method is consist of two steps of face detection and recursive reconstruction of glassless facial image. Here we will first describe the face detection process with a brief review and discussion on the existing methods. Then we will present a new glasses removing method based on recursive PCA reconstruction from a facial image to generate a glassless facial image. 2.1
Face Detection
The proposed method starts from the extraction of a gray facial image of fixed size based on color and shape information. The procedure is illustrated in Fig. 1.
Detection of Eye Candidates Using Color Information In order to find eye candidates, we focus on skin color and black and white in the normalized color space. Then we apply GSCD(Generalized Skin Color Distribution) transform and BWCD(Black and White Color Distribution) transform to the input color image. Fig. 1(b-1) shows eye candidate regions in the BWCD image. On the other hand we remove every non-skin color region in the GSCD image using a morphological closing filter as Fig. 1(a-2). Hair regions in the BWCD image
Glasses Removal from Facial Image Using Recursive PCA Reconstruction
371
are removed as Fig. 1(b-2). Then, eye candidates in (c) are generated by using the combined image of (a-2) and (b-2). Determination of an Exact Eye Position In general eye candidates from the processing step may not be located accurately. In this step we measure the reconstruction error which is the disparity between input face and its reconstructed image by the PCA reconstruction for deciding which eye candidate position is accurate. Every candidate position is moved slightly. Horizontally and vertically for trial as shown in Fig. 1(d) and their reconstruction errors are measured. The eye position with the minimum reconstruction error is chosen as the exact eye position. Then, the facial region is located with an exact eye positions and normalized as Fig. 1(e), and the normalized image is filtered with a mask to remove other regions except the face as output of Fig. 1. 2.2
Simple PCA Reconstruction
A typical method of generating glassless facial images is the simple PCA-based reconstruction method developed by Saito et al.[4]. It just combines the upper half of a facial region in the PCA reconstructed image and the lower half of the input image. The representational power of the PCA depends on the training set. For instance, if training images do not contain glasses, reconstructed images of wearing glasses cannot represent the glasses properly. In other words, for input faces wearing glasses, the PCA tries to represent the glass region in the reconstructed image, although eigenfaces are obtained from a training set of glassless facial images. Therefore, errors spread out to the whole reconstructed image, resulting in the degradation of the quality with some remaining traces of glass frames. Fig. 3(b) shows examples of glasses removal results from some glasses facial images. Even with eigenfaces that are obtained from a training set of glassless facial images, it is evident that the simple PCA reconstruction has some limitations to generate natural looking glassless facial images. 2.3
The Proposed Recursive PCA Reconstruction
A glass occlusion region includes not only the frame of glasses but also the reflection by lens and the shade by glasses. In order to remove glasses and generate natural looking glassless facial images, we should find the glass occlusion region and generate a seamless glassless image. The proposed glasses removing method is composed of an off-line process which generates eigenfaces from training glassless facial images and an on-line process which detects glass frames using color and edge information and recursively reconstructs the glassless facial image. We describe the detailed procedure of on-line procedure as shown in Fig. 2. The input image(Γ ) wearing glasses can be expressed by the mean image(ϕ) and
372
Jeong-Seon Park et al.
Fig. 2. Glasses removing procedure of the proposed recursive PCA reconstruction weighted sum of eigenfaces(µk ), Γˆ = ϕ +
M
ωk · µk ,
k = 1, · · · , M
(1)
k=1
where ωk is the weight of the k-th eigenface and Γˆ is the reconstructed image. The difference image between input image(Γ ) and its reconstructed image(Γˆ ) is calculated by Eq.(2) as below. D(i) = (Γˆi · |(Γi − Γˆi |)1/2 ,
i = 1, · · · , N × N
(2)
where, N is the size of input image. Then we perform difference stretching using the facial color information in order to represent the glass occlusion region clearly as shown in Fig. 2(c). A remained difficult problem is finding the glass frame around eyebrows. We resort to the glass frame image extracted from the original color image as shown in Fig. 2(d). This can enhance the differences around the eyebrows by replacing a difference error with an intensity of the glasses frames, if the latter has a larger value than the former in the region having smaller errors.
Glasses Removal from Facial Image Using Recursive PCA Reconstruction
373
Fig. 3. Examples of glasses removal : (a) input faces wearing glasses, (b) glassesremoved faces by the simple PCA reconstruction, (c) glasses-removed faces by the proposed recursive PCA reconstruction After the glass occlusion region is found in the previous step, the region can be compensated with reconstructed image(or mean image) as shown in Fig. 2(e). In the first iteration of the PCA reconstruction, the mean image is used for the compensation of occluded region. From the second iteration, the previously reconstructed image is used. That is, Γ1 = w · ϕ + (1 − w) · Γ, Γt = w · Γˆt−1 + (1 − w) · Γ,
if (t = 1) if (t > 1)
(3)
where t is iteration index for recursive PCA reconstruction. The iteration stops if the difference between the currently reconstructed image and the previously reconstructed image becomes less than a given threshold . ˆ || ≤ ||Γˆt − Γt+1
(4)
As a result of recursive PCA reconstruction and compensation, a natural looking glassless facial image is generated at the last process of the proposed method as shown in Fig. 2(e). The examples of glasses-removed image by the proposed method are shown in Fig. 3(c).
3 3.1
Experimental Results Experimental Data
The experimental images were captured by a digital camera from PC-100 with 320×240 pixels in size. A total of 824 color images were captured from 42 persons, where for each person 8 images with various facial expressions and 4 or 5 images with a few different glasses. The training set consists of the facial images of 30 persons and the test set of those of 18 persons. Further more several artificial glasses images, which have various thickness and the shape of glass frame and reflection by the lens were generated by removing surrounding pixels of glasses in the facial images.
374
Jeong-Seon Park et al.
Fig. 4. Comparison of distances between images
3.2
Experimental Results and Discussion
In the experiment, we measured the accuracy of the proposed method using Euclidean distance between images and compared with the results of the previous method. First, the accuracy of the proposed method was measured by comparing Euclidean distance from one original glassless facial image to another original glassless one(distance 1), to a glasses facial image(distance 2) and to the glassesremoved image(distance 3) as shown in Fig. 4. The distance between the original and its glasses-removed image is very similar to those of another glassless images. We measure the distances between original glassless facial images and their reconstruction images while increasing the number of iterations. Fig. 5(a) shows the change of the average Euclidean distances between the original glassless image and the reconstructed image of input glasses image, and (b) shows the change of the average Euclidean distances between the original image and the compensated images in the eigen-space for recognition aspect. As the trend of monotonously decreasing distance shows, we can conclude that the glassesremoved facial images became similar to the original glassless one. Fig. 6 shows the sample results of the Saito et al.’s method[4] and those by the proposed method. In the experiment, the subjects tried four kinds of glasses. The glass-removed images and their Euclidean distances were compared with previous method and the proposed method. This experimental results show
Glasses Removal from Facial Image Using Recursive PCA Reconstruction
375
Fig. 5. Difference between the original non-occluded image and the compensated images along the iterations
Fig. 6. Results of the previous method and the proposed method
that the proposed method can produce more natural looking facial images which look more similar to the original glassless image than the previous method.
376
4
Jeong-Seon Park et al.
Concluding Remarks
This paper proposed a new glasses removing method to generate a glassless facial image. The proposed method is based on a recursive PCA reconstruction. First, normalized face image is automatically extracted by using color and shape information, and then a natural looking glassless facial image is generated from the normalized facial image with recursive PCA reconstruction. In order to generate a glassless facial image, we detected the occlusion region by glasses and generated a glasses-removed facial image by the recursive PCA reconstruction and compensation. Since the proposed method can extract and remove every occlusion made by glasses, more natural looking glassless facial images are possible where the occlusion regions include not only the frame of the glasses but also the reflection made by the lens and shade by the glasses. Most regions except the glasses occlusion can be compensated by the input image, so the output image after glasses removal is made similar to the input without losing inherent features. This method can be applied to removing occlusion caused by not only glasses but also other objects, with some modifications. Moreover by applying this techniques to face recognition system, we can expect an improvement of the recognition performance.
References [1] Chellappa, R., Wilson, C. L., Sirohey, S.: Human and Machine Recognition of Faces : A Survey. Proc. of IEEE, Vol. 83, No. 5 (May 1995) 705–740 369 [2] Jing, Z., Mariani, R.: Glasses Detection and Extraction by Deformable Contour. Proc. of Int’l Conf. on Pattern Recognition, Barcelona, Spain (2000) 933–936 369 [3] Lanitis, A., Taylor, C. J., Cootes, T. F.: Automatic Interpretation and Coding of Face Images Using Flexible Models. ´ı´ıIEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 19, No. 7 (July 1997) 743–756 369 [4] Saito, Y., Kenmochi, Y., Kotani, K.: Estimation of eyeglassless facial images using principal component analysis. Proc. of Int’l Conf. on Image Processing, Vol. 4, Kobe, Japan (Oct. 1999) 197–201 369, 371, 374 [5] Wu H. et al.: Glasses Frame Detection with 3D Hough Transform. Proc. of 16th Int. Conf. on Pattern Recognition, Vol. II, Quebec, Canada (Aug. 2002) 346–349 369 [6] Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuroscience Vol. 12, No. 1 (1991) 71–86 [7] Kim, J. K. et al.: Automatic FDP/FAP generation from an image sequence. Proc. of IEEE Int’l Symposium of Circuits and Systems, Geneva, Switzerland (May 2000) 40–43
Synthesis of High-Resolution Facial Image Based on Top-Down Learning Bon-Woo Hwang1 , Jeong-Seon Park2, and Seong-Whan Lee2 1
2
VirtualMedia, Inc., #1808 Seoul Venture Town, Yoksam-dong, Kangnam-gu, Seoul 135-080, Korea
[email protected] Center for Artificial Vision Research, Korea University Anam-dong, Seongbuk-ku, Seoul 136-701, Korea {jspark, swlee}@image.korea.ac.kr
Abstract. This paper proposes a method of synthesizing a high-resolution facial image from a low-resolution facial image based on top-down learning. A face is represented by a linear combination of prototypes of shape and texture. With the shape and texture information about the pixels in an given low-resolution facial image, we can estimate optimal coefficients for a linear combination of prototypes of shape and those of texture by solving least square minimization. Then high-resolution facial image can be synthesized by using the optimal coefficients for linear combination of the high-resolution prototypes. The encouraging results of the proposed method show that our method can be used to increase the performance of the face recognition by applying our method to enhance the low-resolution facial images captured at surveillance systems.
1
Introduction
There is a growing interest in surveillance system for security areas such as international airports, borders, sports grounds, and safety areas. And various researches on face recognition have been carried out for a long time. But there still exist a number of difficult problems such as estimating facial pose, facial expression variations, resolving object occlusion, changes of lighting conditions, and in particular, the low-resolution images captured in surveillance system at public area. Handling low-resolution images is one of the most difficult and commonly occurring problems in various image processing applications, such as scientific, medical, astronomical, or weather image analysis, image archiving, retrieval and transmission as well as video surveillance or monitoring[1]. Numerous methods have been reported in the area of estimating or synthesizing high-resolution images from a series of low-resolution images or single low-resolution image. Super-resolution is a typical example of techniques synthesizing a high-resolution
To whom all correspondence should be addressed. This research was supported by Creative Research Initiatives of the Ministry of Science and Technology, Korea.
J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 377–385, 2003. c Springer-Verlag Berlin Heidelberg 2003
378
Bon-Woo Hwang et al.
image from a series of low-resolution images[2], whereas interpolation produces a large image from only one low-resolution image. In this paper, we are concerned with building a high-resolution facial image from a low-resolution facial image. Our task is distinguished from previous works that built high-resolution images mainly from scientific images or image sequence video data, and focused on removing occlusion or noise[3]. The proposed method is a top-down, object-class-specific and model-based approach. It is highly tolerant to sensor noise, incompleteness of input images and occlusion by other objects. The top-down approach to interpreting images of variable objects are now attracting considerable interest among many researchers[4][5]. The motivation for top-down learning lies in its potential of deriving highlevel knowledge from a set of prototypical components. This paper proposes a high-resolution face synthesis method from only one low-resolution image based on top-down learning. The 2D morphable face model[5] is used in top-down learning, and a mathematical procedure for solving least square minimization(LSM) is applied to the model.
2 2.1
High-Resolution Face Synthesis Overview
In order to synthesize a high-resolution facial image from only one low-resolution image, we used a top-down learning or example-based approach. Suppose that sufficiently large amount of facial images are available for off-line training, then we can represent any input face by a linear combination of a number of facial prototypes[7]. Moreover, if we have a pair of low-resolution facial image and its corresponding high-resolution image for the same person, we can obtain an approximation to the deformation required for the given low-resolution facial image by using the coefficients of examples. Then we can obtain high-resolution facial image by applying the estimated coefficients to the corresponding high-resolution example faces as shown in Fig.1. Our goal is to find an optimal parameter set α which best estimates the high-resolution image from a given low-resolution image. The proposed method is based on the morphable face model introduced by Poggio et al.[6] and developed further by Vetter et al.[7][8]. Assuming that the pixelwise correspondence between facial images has already been established[7], the 2D shape of a face is coded as the displacement field from a reference image. So the shape of a facial image is represented by a vector S = (dx1 , dy1 , · · · , dxN , dyN )T 2N , where N is the number of pixels in image, (dxk , dyk ) the x, y displacement of a point that corresponds to a point xk in the reference face and can be denoted by S(xk ). The texture is coded as the intensity map of the image which results from mapping the face onto the reference face. Thus, the shape normalized texture is represented as a vector
Synthesis of High-Resolution Facial Image Based on Top-Down Learning
379
Fig. 1. Basic idea of the proposed synthesis method based on top-down learning T = (i1 , · · · , iN )T N , where ik is the intensity or color of a point that corresponds to a point xk among N pixels in the reference face and can be denoted by T (xk ). Next, we transform the orthogonal coordinate system by principal component analysis(PCA) into a system defined by eigenvectors sp and tp of the covariance matrices CS and CT on the set of M faces. Where S¯ and T¯ represent the mean of shape and that of texture, respectively. Then, a facial image can be represented by M−1 M−1 ¯ ¯ αp sp , T = T + βp t p (1) S=S+ p=1
p=1
where α, β M−1 . Before explaining the proposed synthesis procedure, we define two types of warping processes, forward and backward warping. Forward warping warps a texture expressed in reference shape onto each input face by using its shape information. This process results in an original facial image. Backward warping warps an input face onto the reference face by using its shape information. This process results in a texture information expressed in reference shape. The mathematical definition and more details about the forward and backward warping can be found in reference[7]. The synthesis procedure consists of 4 steps, starting from a low-resolution facial image to a high-resolution face as shown in Fig.2. Here the displacement of the pixels in an input low-resolution face which correspond to those in the reference face is known. Steps 1 and 4 are explained from the previous studies of morphable face models in many studies[4][7]. Steps 2 and 3 are carried out by similar mathematical procedure except that the shape about a pixel is 2D vector and the texture is 1D(or 3D for RGB color image) vector. Therefore, we will describe only the Step 2 of estimating high-resolution shape information from low-resolution one.
380
Bon-Woo Hwang et al.
Fig. 2. Synthesis procedure from a low-resolution image to a high resolution one Step 1. Obtain the texture of a low-resolution facial image by backward warping. Step 2. Estimate a high-resolution shape from the given low-resolution shape. Step 3. Estimate a high-resolution texture from the obtained low-resolution texture obtained at Step 1. Step 4. Synthesize a high-resolution facial image by forward warping the estimated texture with the estimated shape. 2.2
Problem Definition for Synthesis
Let us define S + = (dx1 , dy1 , · · ·, dxL , dyL , dxL+1 , dyL+1 · · ·, dxL+H , dyL+H )T to be a new shape information by simply concatenating low-resolution shape and highresolution shape, where L is the number of pixels in low-resolution image and H is the number of pixels in high-resolution image. Similarly, let us define T + = (i1 , · · · , iL , iL+1 , · · · , iL+H )T to be a new texture information. Then, by applying PCA to both the shape S + and texture T + , the face image in Eq.(1) can be expressed as S + = S¯+ +
M−1
αp sp + ,
p=1
T + = T¯ + +
M−1
βp t p +
(2)
p=1
where α, β M−1 . Since there is shape information about only low-resolution facial image, we need an approximation to the deformation required for the low-resolution facial image by using coefficients of the bases as shown in Fig.1. The goal is to find an optimal set αp that satisfies S˜+ (xj ) =
M−1
αp s+ p (xj ), j = 1, · · · , L,
(3)
p=1
where xj is a pixel in the low-resolution facial image, L the number of pixels in low-resolution image, and M − 1 the number of bases.
Synthesis of High-Resolution Facial Image Based on Top-Down Learning
381
We assume that the number of observations, L, is larger than the number of unknowns, M − 1. Generally there may not exist a set of αp that perfectly fits ˆ so as to minimize the error. For this, we the S˜+ . So, the problem is to choose α define an error function, E(α), the sum of square of errors which measures the difference between the known displacements of pixels in the low-resolution input image and its represented ones. E(α) =
L
(S˜+ (xj ) −
M−1
2 αp s+ p (xj ))
(4)
p=1
j=1
where x1 , · · · , xL are pixels in the low-resolution facial image. Then the problem of synthesis is formulated as finding α ˆ which minimizes the error function : (5) α ˆ = arg min E(α). α
2.3
Solution by Least Square Minimization
The solution to Eq.(4) - (5) is nothing more than least square solution. Eq.(3) is equivalent to the following equation. + + s1 (x1 ) · · · s+ S˜ (x1 ) α1 M−1 (x1 ) .. .. .. .. .. (6) . = . . . . + + + ˜ αM−1 s1 (xL ) · · · sM−1 (xL ) S (xL ) This can be rewritten as: ˜ +, S+ α = S where
(7)
+ s+ 1 (x1 ) · · · sM−1 (x1 ) .. .. .. S+ = , . . . + + s1 (xL ) · · · sM−1 (xL )
α = (α1 , · · · , αM−1 )T ,
˜ + = (S˜+ (x1 ), · · · , S˜+ (xL ))T . S
(8)
˜+
The least square solution to an inconsistent S α = S of L equation in T T ˜+ . If the columns of S+ are linearly M − 1 unknowns satisfies S+ S+ α∗ = S+ S +T + independent, then S S is non-singular and +
T T ˜+ . α∗ = (S+ S+ )−1 S+ S
(9)
˜ + onto the column space is therefore Sˆ+ = S+ α∗ . By The projection of S using Eq.(9), we obtain high-resolution facial image S(xL+j ) = S¯+ (xL+j ) +
M−1 p=1
α∗p s+ p (xL+j ), (j = 1, . . . , H),
(10)
382
Bon-Woo Hwang et al.
where xL+1 , · · · , xL+H are pixels in the high-resolution facial image, L and H is the number of pixels in the low-resolution facial image and that of high-resolution facial image, respectively. By using Eq.(10), we can get the correspondence of all pixels. Previously we made the assumption that the columns of S are linearly independent. Otherwise, Eq.(9) may not be satisfied. If S has dependent columns, the solution α∗ will not be unique. The optimal solution in this case can be solved by pseudoinverse of S+ [5]. But, for our purpose of effectively synthesizing a high-resolution facial image from a low-resolution one, this is unlikely to happen.
3 3.1
Experimental Results and Analysis Face Database
For testing the proposed method, we used about 200 images of Caucasian faces that were rendered from a database of 3D head models recorded by a laser scanner(CyberwareT M )[7][8]. The original images were color image set size 256× 256 pixels. They were converted to 8-bit gray level and resized to 16 × 16 and 32 × 32 for low-resolution facial images by Bicubic interpolation technique. PCA was applied to a random subset of 100 facial images for constructing bases of the defined face model. The other 100 images were used for testing our algorithm. Next, we use a hierarchical, gradient-based optical flow algorithm to obtain a pixel-wise correspondence[7]. The correspondence is defined between a reference facial image and every image in the database. It is estimated by the local translation between corresponding gray level intensity patches. 3.2
High-Resolution Facial Synthesis
As mentioned before, 2D-shape and texture of facial images are treated separately. Therefore, a facial image is synthesized by combining both of the estimated shape and the estimated texture. Fig. 3 shows the examples of the high-resolution facial image synthesized from two kinds of low-resolution images and its average intensity error per pixel from original high-resolution image. Fig. 3(a) shows the input low-resolution images, Fig. 3(b) to (d) the synthesized high-resolution images using Bilinear interpolation, Bicubic interpolation and by the proposed method, respectively. Fig. 3(e) is the original high-resolution facial images. As shown in Fig. 3, the synthesized images by the proposed method are more similar to the original ones and clearer than others. The average intensity error per pixel of those is reduced to almost half than that of input image which is made by the nearest neighbor interpolation from 32 × 32 image. Fig. 4 shows the mean synthesis errors in shape, texture and facial image from the original high-resolution image. Horizontal axes of Fig. 4 (a) and (b) represent the input low-resolution, two interpolation methods and the proposed synthesis method. Vertical axes of them represent the mean displacement error
Synthesis of High-Resolution Facial Image Based on Top-Down Learning
383
Fig. 3. Examples of synthesized high-resolution face
per pixel about shape vectors and the mean intensity error per pixel(for an image using 256 gray level) about texture and image vector, respectively. Err Sx and Err Sy in Fig. 4(a) are the x-directional mean displacement errors along the x− and y− axes for shape, respectively. And Err T and Err I in Fig. 4(b) implies the mean intensity error for texture and for image, respectively.
4
Conclusions and Further Research
In this paper, we proposed an efficient method of synthesizing high-resolution facial image based on top-down learning. The proposed method consists of the following steps : computing linear coefficients minimizing the error or difference between the given shape/texture and the linear combination of the shape/texture prototypes in the low-resolution image, and applying the coefficient estimates to the shape and texture prototypes in the high-resolution facial image, respectively. We interpret the encouraging performance of our proposed method as evidence in support of the hypothesis that the human visual system may synthesis a high-resolution information from low-resolution information using prototypical examples in top-down learning. The experimental results are very natural and plausible like original high-resolution facial images, when displacement among the pixels in an input face which correspond to those in the reference face, is
384
Bon-Woo Hwang et al.
Fig. 4. Mean synthesis errors
known. It is a challenge for researchers to obtain the correspondence between the reference face and a given facial image under low-resolution vision tasks.
Acknowledgement We would like to thank the Max-Planck-Institute for providing the MPI Face Database.
References [1] Tom, B., Katsaggelos, A. K.: Resolution Enhancement Monochrome and Color Video Using Motion Comensation. IEEE Trans. on Image Processing, Vol. 10, No. 2 (Feb. 2001) 278–287 377 [2] Baker, S., Kanade, T.: Limit on Super-Resolution and How to Break Them. IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 24 No. 9 (Sep. 2002) 1167-1183 378 [3] Windyga, P. S.: Fast impulsive noise removal. IEEE Trans. on Image Processing, Vol. 10, No. 1 (2001) 173–178 378 [4] Jones, M. J., Sinha, P., Vetter, T., Poggio, T: Top-down learning of low-level vision tasks[brief communication]. Current Biology, Vol. 7 (1997) 991–994 378, 379 [5] Hwang, B.-W., Lee, S.-W.: Reconstruction of Partially Damaged Face Images Based on a Morphable Face Model. IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 25, No. 3 (2003) 365–372 378, 382 [6] Beymer, D., Shashua, A., Poggio, T.: Example-Based Image Analysis and Synthesis. AI Memo 1431/CBCL Paper 80, Massachusetts Institute of Technology, Cambridge, MA (Nov. 1993) 378 [7] Vetter, T., Troje, N. E.: Separation of texture and shape in images of faces for image coding and synthesis. Journal of the Optical Society of America A. Vol. 14, No. 9 (1997) 2152–2161 378, 379, 382
Synthesis of High-Resolution Facial Image Based on Top-Down Learning
385
[8] Blanz, V., Romdhani, S., Vetter, T.: Face Identification across Different Poses and Illuminations with a 3D Morphable Model. Proc. of the 5th Int’l Conf. on Automatic Face and Gesture Recognition, Washington, D. C. (2002) 202–207 378, 382
A Comparative Performance Analysis of JPEG 2000 vs. WSQ for Fingerprint Image Compression Miguel A. Figueroa-Villanueva, Nalini K. Ratha, and Ruud M. Bolle IBM Thomas.J. Watson Research Center P.O. Box 704, Yorktown Heights, New York 10598, USA
[email protected] {ratha, bolle}@us.ibm.com
Abstract. The FBI Wavelet Scalar Quantization (WSQ) compression standard was developed by the US Federal Bureau of Investigation (FBI). The main advantage of WSQ-based fingerprint image compression has been its superiority in preserving the fingerprint minutiae features even at very high compression rates which standard JPEG compression techniques were unable to preserve. With the advent of JPEG 2000 image compression technique based on Wavelet transforms moving away from DCT-based methods, we have been motivated to investigate if the same advantage still persists. In this paper, we describe a set of experiments we carried out to compare the performance of WSQ with JPEG 2000. The performance analysis is based on three public databases of fingerprint images acquired using different imaging sensors. Our analysis shows that JPEG 2000 provides better compression with less impact on the overall system accuracy performance.
1
Introduction
Image compression and decompression is an essential requirement in large scale fingerprint identification systems involved in both civil and criminal applications. Because of the peculiarities of a fingerprint image and the features used in an automatic fingerprint recognition system, general image compression techniques do not work well. The FBI Wavelet Scalar Quantization (WSQ) [1] compression standard was developed by the US Federal Bureau of Investigation (FBI) while in the process of converting its criminal fingerprint database, which reaches the 200 million inked fingerprint cards, to digital format. This process was initialized following an effort to incorporate an Automated Fingerprint Identification System (AFIS), besides the added benefits of digital imagery such as accessibility and preservation among others. Each inked card has a fingerprint impression area to be digitized of about 39 square inches. Currently, the cards are being scanned at a resolution of 500 pixels per inch and 8 bpp (gray-scale). This amounts to about 10 MB per card and a total storage of about 2000 TB. Incorporating image compression is also very important when the communication bandwidth is limited. J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 385–392, 2003. c Springer-Verlag Berlin Heidelberg 2003
386
Miguel A. Figueroa-Villanueva et al.
Preliminary tests [2] indicated that the minutiae detection process was less accurate on fingerprint images which had been compressed using JPEG [3] rather than WSQ. In addition, JPEG suffered from blocking artifacts and inferior root mean square error (RMSE) when applied to fingerprints at high compression rates. It was clear that the JPEG standard did not provide the level of quality needed for archival purposes at the target bitrates around 0.75 bpp. Brislawn [4] shows an example of an inked card digitized fingerprint image zoomed at the core and the same image compressed at 0.75 bpp using JPEG and FBI WSQ, respectively. We can clearly see the blocking artifacts which overwhelm the JPEG compressed images. However, the next generation JPEG compression standard, JPEG 2000 [5], has been developed and improves many of the deficiencies in JPEG for which the WSQ standard was developed. The FBI WSQ and JPEG 2000 compression schemes are both transformbased. The major differences are the decomposition structure of the wavelet transform and the scanning order of the quantized coefficients in the entropy coding step. The FBI WSQ uses a fixed wavelet packet basis with (9,7) biorthogonal filters, while JPEG 2000 uses Mallat’s algorithm or the pyramidal approach with the same set of filters for lossy encoding. The wavelet decomposition structure will influence the amount and length of the zero runlengths. The vertical bitplane scanning order of JPEG 2000 allows more control for the compression target rate with further quantization if needed than the raster scanning used in FBI WSQ. This scanning order also influences the amount and lengths of the zero runlengths. More details can be obtained in [5, 1, 6]. There are several fingerprint compression methods that have been developed. Shnaider and Paplinski [7] used vector quantization with a simpler wavelet decomposition structure for a gain in speed. Jang and Kinsner [8] proposed an adaptive wavelet decomposition structure based on a multifractal measure. However, although both of these methods have a smaller computational complexity the accuracy was below the FBI WSQ standard. Kasaei et al. [9] have used piecewise-uniform pyramid lattice vector quantization with an adaptive wavelet packet decomposition as well as with a fixed 73-subband wavelet packet transform which is very similar to the FBI WSQ basis tree. This method has shown to provide better PSNR rates than the FBI standard. Our motivation has been to analyze the effectiveness of the newer JPEG 2000 standard in addressing the issues with the original JPEG in reference to the distortions created by the JPEG algorithm. In addition, we focus in the accuracy or error rates when an image compression technique is used. We targeted our evaluation experiment on a set of large public databases to verify the usefulness of the newer standard and use different comparison indices in terms of the overall error rate change in an authentication system.
2
Experiments
The comparative performance analysis involved testing the two compression techniques on a large set of public fingerprint images. The software used in the
A Comparative Performance Analysis of JPEG 2000 vs. WSQ
Data Sources
JPEG 2000
387
PSNR
FBI WSQ
Fingerprint Matcher
PSNR ROC Curve
Fig. 1. Test setup diagram
experiment is a FBI Certified IBM FBI WSQ encoder/decoder and the Jasper 1.500.4 JPEG 2000 [10] implementation. Figure 1 shows a block diagram of the test setup. Note that all the experiments used two compression rates: 2.25 bpp and 0.75 bpp. These rates have been selected because the FBI Compliance Tests for WSQ Certification are performed at these rates. However, we are more interested in the higher compression rates (0.75 bpp) as the minutiae feature distortions can be significant at higher compression rates. Note that the FBI WSQ encoder tends to over compress the images when supplied with the target rate. The JPEG 2000 compressor is much more stable and therefore was given as an input the compression rate according to the average output file size from the FBI WSQ to maintain the effective compression rate as close as possible. The average compression achieved by JPEG 2000, while close, has been marginally higher. The performance metrics used to compare the performance of the two compression schemes are: (a) image-based metrics; and (b) error rate analysis in terms of ROCs. – Image-based metrics: The peak signal-to-noise ratio (PSNR) is computed for each of the reconstructed images (after decoder) relative to the original image. Although PSNR doesn’t always correlate with visual perception, it does provide some indication of the level of distortion or raw difference introduced by the algorithms as can be seen from the following: 255 P SN R = 20 log10 RM SE −1 M−1 2 1 N f (i, j) − fˆ(i, j) RM SE = N M i=0 j=0
(1)
(2)
where f (i, j) and fˆ(i, j) are the original and reconstructed N × M 8-bit grey-scale images, respectively. – ROC analysis: The data sets are given as an input to a state-of-the-art IBM Fingerprint Matcher. The matcher returns a score from 0-100 for every pair in the data set indicating the similarity among the pair. By putting a threshold on the score returned by the matcher an automated system can classify the
388
Miguel A. Figueroa-Villanueva et al.
Table 1. Data sets description name DB2 DB3 NIST4
source size (pixels) samples resolution Low-cost Capacitive Sensor 256x364 100x8 500 ppi Optical Sensor 448x478 100x8 500 ppi Scanned inked cards 512x512 250x2 500 ppi
pair as a match or a mismatch. A ROC curve is obtained by calculating the false accepts (i.e. matcher classifies a pair from different fingers as a match) and the false rejects (i.e. matcher classifies a pair from the same finger as a mismatch) at different operating points as determined by the threshold value. For each data set an ROC is obtained for the original data set and for the compressed versions (encoded/decoded) of the data set using both algorithms. The image-based metrics tend to describe the visual distortion that may have been caused due to the compression whereas the accuracy analysis in terms of ROC provides the ultimate performance degradation due to compression. The latter is more useful in e-commerce/m-commerce type of applications. 2.1
Data Sets
The experiments are done on databases 2 & 3 from the Fingerprint Verification Competition (FVC2000) [11]. These are databases of images obtained from capacitive and optical sensors in groups of 8 corresponding to the same finger. Also, a set of NIST4 was used which is composed of images obtained from digitized FBI inked cards. These were obtained in groups of 2 corresponding to the same finger. Table 1 presents a description of these databases. Figure 2 shows an example image from each database.
(a)
(b)
(c)
Fig. 2. A fingerprint image sample from (a) DB2, (b) DB3, and (c) NIST4
A Comparative Performance Analysis of JPEG 2000 vs. WSQ
389
Table 2. Result summary: (a) Mean and Standard Deviation of PSNR differences and (b) FRR decrease by using FBI WSQ instead of JPEG 2000 Dataset
0.75 bpp 2.25 bpp µJ −W σJ −W µJ −W σJ −W DB2 1.03 dB 0.4131 3.13 dB 1.1205 DB3 0.89 dB 0.5869 1.51 dB 0.8530 NIST4 0.88 dB 0.6489 1.26 dB 0.6563 (a)
3
Dataset
0.75 bpp 10−3 10−4 DB2 8.02% 3.51% DB3 4.18% 4.50% NIST4 11.81% 9.62% (b)
2.25 bpp 10−3 10−4 -10.1% -4.98% 0.74% 3.80% 2.07% 2.57%
Results
In this section we present and discuss the results obtained from our experiments. We first introduce the effect on the PSNR to give a preliminary notion of how the algorithms behave and subsequently discuss the ROC plots which provide a measure of performance degradation by choosing one algorithm over the other. 3.1
PSNR Results
The images compressed with JPEG 2000 have higher PSNR rates than their respective samples compressed with FBI WSQ. This is very consistent, given that the difference1 in PSNR is greater than zero in the overwhelming majority of samples. The average differences range from 0.87 dB to over 3 DB, as summarized in table 2(a), which is a very significant amount. As stated previously, this doesn’t prove that there is an advantage in compressing with JPEG 2000, but it does indicate that the FBI WSQ algorithm introduces significantly greater raw distortion and provides consistent support to the results obtained in the next section. 3.2
ROC Results
In the following figures, we can observe that the ROC curves for the JPEG 2000 compressed cases are consistently better than those for FBI WSQ data. Notice that in figures 3(a)-(c) we have emphasized the region in the range of [10−5 , 10−3 ] false accept rate (FAR) because it is unusual for an application to operate at regions of higher FAR regardless of how low the false reject rate (FRR) may be2 . To do so would be highly insecure (e.g. you can guarantee a FRR of 0% by granting access to everyone). For this operable region we can clearly see how JPEG 2000 performs better than FBI WSQ for all the datasets 1
2
Notice that throughout this report we subtract the values obtained from FBI WSQ to the values obtained from JPEG 2000 and not the other way around. This is true for the ROC rates and score values as well. Although plots have been zoomed in at the region of interest they are mostly consistent throughout the entire range.
390
Miguel A. Figueroa-Villanueva et al. ROC Curve NIST4 @ 0.75 bpp
ROC Curve DB3 @ 0.75 bpp
ROC Curve DB2 @ 0.75 bpp original jpc 0.62 bpp wsq 0.75 bpp
original jpc 0.54 bpp wsq 0.75 bpp
original jpc 0.54 bpp wsq 0.75 bpp
−0.3
10
−0.3
10
−0.3
10
−0.4
10 −0.4
−0.5
false reject
false reject
false reject
10
−0.5
10
−0.4
10
10
−0.5
10
−0.6
10 −0.6
10
−0.7
10
−0.6
10 −5
10
−4
10 false accept
−3
10
(a)
−4
10 false accept
(b)
−3
10
−5
10
−4
10 false accept
−3
10
(c)
Fig. 3. ROC Curves in logarithmic scale for (a) DB2 at 0.75 bpp, (b) DB3 at 0.75 bpp, and (c) NIST-4 at 0.75 bpp at the higher compression rate (0.75 bpp). At the lower compression rate (2.25 bpp) the difference is smaller and they cross each other sometimes making the FBI WSQ case better in some instances. However, there is no significant gain in compression at 2.25 bpp. Therefore, the 0.75 bpp plots provide a more interesting scenario. Table 2(b) summarizes the decrease in FRR obtained from using FBI WSQ to compress rather than JPEG 2000 at operating points 10−3 and 10−4 FAR. Notice that JPEG 2000 outperforms FBI WSQ by 3.5% up to over 11%. This performance decrease can translate into a very high volume of persons that may be denied access to a service or which need to be further investigated, if we consider the high amounts of transactions in B2C settings or the high volume of people in security screenings such as in airports. 3.3
Statistical Significance
After looking at the results from the previous two sections it is clear that JPEG 2000 seems to outperform the FBI WSQ compression scheme. However, one possible explanation for this difference is pure chance. It might come from the specific test case (i.e. the particular set of images in the databases used) and not be generally true for the compression schemes. To answer this question a statistical significance analysis is done to asses if the difference is significant or if it is in fact just due to chance. We test two things. First, we would like to know if the difference in the PSNR values is significant. Second and most importantly, we want to know if the difference in the ROC curves is significant. For this second test we note that the ROC is an abstraction or is composed of two different cases. The genuine pair case (i.e. when a pair is really from the same finger) and the impostor pair case (i.e. when a pair is indeed not from the same finger). It should seem obvious that scores from the matcher for each of these cases follow two distributions. For the genuine pair case they should oscillate around a high valued mean and for the impostor case they should oscillate around a lower valued mean. For this reason and to gain more insight on the effect of the compression algorithms on the system, we test the scores of these two cases separately instead of the ROC’s.
A Comparative Performance Analysis of JPEG 2000 vs. WSQ
391
Table 3. Results of PSNR, genuine score, and impostor score significance tests source
PSNR Genuine Scores Impostor Scores Pval 95% C.I. Pval 95% C.I. Pval 95% C.I. DB2 @ 0.75 bpp 0 [1.003,∞) 1.627e-8 [ 0.870, 1.791] 0.817 [-0.017, 0.013] DB3 @ 0.75 bpp 0 [0.857,∞) 8.998e-5 [ 0.359, 1.077] 0.197 [-0.004, 0.019] NIST4 @ 0.75 bpp 0 [0.827,∞) 0.026 [ 0.085, 1.323] 1.42e-5 [ 0.017, 0.045] DB2 @ 2.25 bpp 0 [3.063,∞) 0 [-46.63,-45.02] 0 [-6.058,-6.030] DB3 @ 2.25 bpp 0 [1.457,∞) 0.384 [-0.413, 0.161] 0.043 [-0.022,-.0004] NIST4 @ 2.25 bpp 0 [1.210,∞) 0.413 [-0.481, 1.167] 0.417 [-0.018, 0.007]
Table 3 shows the results of the one-sided t-test for the PSNR values. We can observe the p-value in the second column of the table, which indicates that the probability of observing the obtained results by chance given that there is in fact no difference in performance is null. Hence, there is no doubt that FBI WSQ introduces more raw distortion to the image than JPEG 2000. The fourth column is a 95% confidence interval of the parameter tested (i.e. the mean difference in PSNR). Therefore, this column helps us quantify the difference in PSNR. Table 3 also shows the results of the two-sided t-tests for the genuine and impostor scores. For the 2.25 bpp case the results show some inconsistencies failing for DB3 and acknowledging a decrease in the score for the DB2 JPEG 2000 compressed data for the genuine case while there was also a decrease detected in the impostor cases. However, we can observe that for the higher compression rate there is very little doubt that there was an increase in the scores for the JPEG 2000 compressed data when the samples were truly from the same finger (i.e. genuine case). For the cases when the samples come from different fingers (i.e. impostor case) there is no significant difference in most of the cases and the probability of obtaining the results by mere chance is very high, therefore failing the test. Intuitively, it is easy to agree with these results. For the case when the images come from the same finger you can expect a decrease in the matching score if the algorithm introduces more distortion, which is the case for the FBI WSQ as observed from the PSNR results. Alternatively, when the images come from different fingers the distortion introduced should not make them any more different then they already are.
4
Conclusions
Given that JPEG 2000 performs better than FBI WSQ at higher bitrates one can question the need for a domain-specific compression algorithm. It is clear that JPEG 2000 was designed for general image compression and still is able to outperform the FBI WSQ on fingerprint compression. There are costs associated with the maintenance of a compression standard at the state-of-the-art and it seems that the FBI WSQ is ready for an update. If the maintenance costs are too
392
Miguel A. Figueroa-Villanueva et al.
high for an insignificant gain in compression, then it might be better to replace it. It is true, however, that given the infrastructure already available for the FBI WSQ then there might be an even greater cost for replacing the standard. Although it is difficult to justify a replacement of the standard for the use of the FBI’s criminal databases, one doesn’t have these restrictions when it comes to B2C activities or in the implementation of a security screening with a user database that doesn’t involve the FBI’s criminal databases. These settings can be easily conceived in work areas (e.g. banks) where the access is restricted to certain authorized users. If we consider the widespread availability of JPEG we can confidently assume that it will be less expensive for a sensor or fingerprint software developer to support JPEG 2000 than to support FBI WSQ. This comes with the added advantage that JPEG 2000 will continue to be improved upon and maintained since it is a much larger community effort.
References [1] Federal Bureau of Investigation, “WSQ Gray-Scale Fingerprint Image Compression Specification,” Document No. IAFIS-IC-0110 (v2), Feb. 1993, Drafted by T. Hopper, C. M. Brislawn and J. N. Bradley. 385, 386 [2] T. Hopper and F. Preston, “Compression of grey-scale fingerprint images,” in Proceedings Data Compression Conference, Snowbird, UT, March 1992, pp. 309– 318, IEEE Computer Society. 386 [3] W. B. Pennebaker and J. L. Mitchell, “The JPEG Still Image Data Compression Standard,” in Van Nostrand-Reinhold, 1993. 386 [4] C. M. Brislawn, The FBI Fingerprint Image Compression Standard, http://www.c3.lanl.gov/˜brislawn/FBI/FBI.html. 386 [5] M. Boliek, C. Christopoulos, and Majani E. (editors), “JPEG2000 Part I Final Draft International Standard,” Aug. 2000, number ISO/IEC FDIS 15444-1, ISO/IEC JTC1/SC29/WG1 N1855. 386 [6] C. M. Brislawn, J. N. Bradley, R. J. Onyshezak, and T. Hopper, “The FBI compression standard for digitized fingerprint images,” in Proc. SPIE, Denver, CO, Aug. 1996, vol. 2847, pp. 344–355. 386 [7] M. Shnaider and A. P. Paplinski, “Compression of fingerprint images using wavelet transform and vector quantization,” in ISSPA-96, Gold Coast, Australia, August 1996, pp. 437–440. 386 [8] E. Jang and W. Kinsner, “Multifractal Wavelet Compression of Fingerprints,” in Proceedings of IEEE Communications, Power and Computing Conference, Winnipeg, MB, May 1997, pp. 313–321. 386 [9] S. Kasaei and M. Deriche, “Fingerprint Compression using a Piecewise-Uniform Pyramid Lattice Vector Quantization,” in ICASSP-97, Munich, Germany, April 1997, pp. 3117–3120. 386 [10] M. D. Adams, JasPer Software Reference Manual, 2000, http://www.ece.uvic.ca/˜mdadams/jasper/jasper.pdf. 387 [11] Fingerprint Verification Competition 2000, http://bias.csr.unibo.it/fvc2000/default.asp. 388
New Shielding Functions to Enhance Privacy and Prevent Misuse of Biometric Templates Jean-Paul Linnartz and Pim Tuyls WY7, Nat.Lab., Philips Research, 5656 AA Eindhoven, The Netherlands
[email protected] [email protected] In biometrics, a human being needs to be identified based on some characteristic physiological parameters. Often this recognition is part of some security system. Secure storage of reference data (i.e., user templates) of individuals is a key concern. It is undesirable that a dishonest verifier can misuse parameters that he obtains before or during a recognition process. We propose a method that allows a verifier to check the authenticity of the prover in a way that the verifier does not learn any information about the biometrics of the prover, unless the prover willingly releases these parameters. To this end, we introduce the concept of a delta-contracting and epsilon-revealing function which executes preprocessing in the biometric authentication scheme. It is believed that this concept can become a building block of a public infrastructure for biometric authentication that nonetheless preserves privacy of the participants.
1
Introduction
Measurement of distinguishing features of physical objects and living beings can be used to identify these and distinguish them from others. In some cases, there is a desire to add cryptographic properties to this identification process. In biometrics, a human being is identified by measuring a set of parameters of the body. Biometric data are said to identify a person based on "who he is", rather than on "what he has" (such as a smartcard) or "what he knows" (such as a password). An unresolved issue, however, is that when deployed at large scale, a citizen looses privacy as he must reveal his identifying biometric data to his bank, to the government, to his employer, the car rental company, to the owner of a discotheque or nightclub, etc. Each of them will obtain the same measured data, and unless special precautions are taken there is no guarantee that none of these parties will ever misuse the biometric data to impersonate the citizen. This paper proposes, reviews and analyzes a novel technique to enhance the privacy and security of authentication and key establishment in biometric applications. In particular, we prevent misuse of templates. Following the tradition in J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 393-402, 2003. Springer-Verlag Berlin Heidelberg 2003
394
Jean-Paul Linnartz and Pim Tuyls
cryptography to name our role players, we will say that prover Peggy allows verifier Victor to measure her object [1] or physiological parameters [2] called "Prop." We are not only interested in security breaches due to a dishonest Peggy, but also in those resulting from an unreliable Victor. Often the distinction is made between identification and verification. Identification estimates which object is presented by searching for a match in a data base of reference data for many objects. Victor a priori does not know whether he sees Prop1 (belonging to Peggy) or Prop2 (belonging to Petra). On the other hand, verification attempts to establish whether the object presented truly is the object Prop that a known prover Peggy claims it to be. Peggy provides not only Prop but also a message in which she claims to be Peggy and can be linked to Prop, in some direct or implicit way. Here, we address primarily the setting of verification, that is, Victor is assumed to have some a priori knowledge about Prop in the form of certified reference data, but at the start of the protocol he is not yet sure whether Prop or a fake replacement is present. To further develop the insight in the security aspects of verification, we distinguish (possibly against common practice in biometric literature) between verification and authentication. In a typical verification situation, the reference data itself allows a malicious Victor to artificially construct measurement data that will pass the verification test, even if Prop itself has never been available. In authentication, the reference data gives insufficient information to allow Victor to (effectively) construct valid measurement data. While such protection is not yet mature for biometric authentication, it is common practice with computer passwords. When a computer verifies a password, it does not compare the password p typed by the user with a stored reference copy. In stead the password is processed by a cryptographic one-way function F and the outcome is compared against a locally stored reference string F(p) [3]. This prevents that the system can be attacked from the inside such that the unencrypted or decryptable password of its users can be stolen. The main difference with biometrics is that during measurements it is unavoidable that noise or other aberrations occur. Noisy measurement data will be quantized into discrete values before these can be processed by any cryptographic function. Due to external noise, the outcome of the quantization may differ from experiment to experiment. In particular if Peggy's physiological parameter takes on a value close to a quantization threshold, minor amounts of noise can change the outcome. Minor changes at the input of a cryptographic function will be amplified and the outcome will bear no resemblance to the expected outcome. This effect, identified as 'confusion' and 'diffusion' [3], makes it less trivial to use biometric data as input to a cryptographic function. Particularly the comparison of measured data with reference data can not be executed in the encrypted domain. It is an essential part of this paper to discuss whether the measurement must be stored itself or it suffices to store and exchange only a (one-way) cryptographic derivative. Storage of reference data (user templates) and protecting their privacy is well recognized as a key concern with biometric authentication [4,15]. Preferably the derivative should not allow an attacker to construct fake data. It was previously known that enrollment data can be encrypted. However, a security weakness appears when during authentication the data needs to be decrypted. This problem was also
New Shielding Functions to Enhance Privacy and Prevent Misuse
395
addressed in [6, 7, 9, 10]. In the current paper we further develop and generalize the mathematical formulation, including an information theoretic evaluation of concealing properties and we analyze a new solution.Before an authentication can take place, Prop must have gone through an enrollment phase. During this phase, Peggy and Prop visit a Certification Authority. Prop's parameters are measured. These measurements are processed and stored for later use. In an on-line application, such reference data can be stored in a central (possibly even publicly accessible) data base or these data can be certified with a digital signature of the Certification Authority, and given to Peggy. In the latter case, it is Peggy' responsibility to securely give this certified reference data to Victor. Key establishment: In addition to authentication, the system can use Prop's (biometric) parameters to generate a secret key [11, 12]. An important property is that Victor should not be able to calculate this key on his own, by misusing reference data that is offered to him. Victor must measure Prop, otherwise he should not be able to find the key. We illustrate this by the example of access to a data base of highly confidential encrypted documents to which only a (set of) specific users is allowed access. The computer retrieval system authenticates humans and retrieves an\ decryption key from their biometric parameters. This system must be protected against a dishonest software programmer Mallory who has access to the biometric reference data from all users. If Mallory downloads the complete reference data file, all encrypted documents, and possibly reads all the software code of the system, she should not be able to decrypt any document. We distinguish between to attacks 1.
2.
Misuse of templates: a dishonest Victor can attempt to calculate the parameters of Prop or to establish the key without having access to the object. This corresponds to a system operator who attempts to retrieve user passwords from the reference database of data strings F(p). Misuse of measurement data: After having had an opportunity to measure Prop, a dishonest Victor misuses its measurement data. This corresponds to grabbing all keystrokes including the plain passwords typed by a user.
This text primarily addresses attack 1. Attack 2 typically is prevented by mutual cryptographic authentication of Victor and Peggy in addition to the biometric exchange of data from Prop [14].
2
Model
In order to study countermeasures against misuse of templates, we consider the system depicted in Figure 1. This authentication system consists of a mechanism to extract a measurement Y of the object, some signal processing function G(W,Y) and a cryptographic function F. F is one-way in the sense that it is "easy" to calculate the output given the input signal but it is computationally "infeasible" to find a valid input given an output value [4]. An important aim is to propose and study appropriate choices for G to enhance the reliability and reproducibility of the detection and to
396
Jean-Paul Linnartz and Pim Tuyls
shield the information (or 'entropy') in the authentication secret Z from the reference data. The reference data consists of two parts: the cryptographic key value V against which the processed measurement data U is compared, and the data W which assists in achieving reliable detection. establish key
message
i
Y
X σ
2 x
N
σ
G
Z
F
U
? =
2 n
V W
Authenticated Y/N
Fig. 1. Model of authentication of Prop i and generation of a seed for key establishment. Noise N occurs during measurement Y of parameter X. The delta-contracting function G and the hash F are invoked to create U, which then is verified against reference template (W,V)
Peggy authenticates herself with Prop as follows: • • • • • • •
When she claims to be Peggy, she sends her identifier message to Victor, and makes Prop available for measurement. Victor retrieves the authentication challenge W from an on-line trusted database. Alternatively, in an off-line application Peggy could provide Victor with reference data (V,W) together with a certificate that this data is correct. Peggy allows Victor to take a (possibly noisy) measurement Y = X + N of the physiological properties X of Prop Victor calculates Z = G(W,Y). Optional for key establishment: Victor can extract further cryptographic keys from Z, for instance to generate an access key. Victor calculates the cryptographic hash function F(Z). The output U = F(Z) is compared with reference authentication response V. If U = V, the authentication is successful.1 2
Here, X, N, and Y are real (or complex) valued vectors of length n1, X ∈ Rn1. Vector W contains n2 values, typically real, complex or high resolution digital numbers, that control the function G(W,X). Further, Z, U and V are discrete-valued (typically binary) vectors of length n3, n4, and n4, resp. During authentication, Z is the estimate of the authentication secret S that was chosen during enrollment, which we will describe next.
1
2
In a networked system, the creation of U is typically executed locally at the verifier, whereas V is stored in a central database. Either Victor sends U to the data base and the verification is done at the data base, or the database send V to Victor and Victor himself compares U with V. Note that here we make an exact match. Checking for imperfect matches would not make sense because of the cryptographic operation F. Measurement imperfections (noise) are eliminated by the use of W and the δ -contracting property of G.
New Shielding Functions to Enhance Privacy and Prevent Misuse
397
Personal identification of Peggy
i
X σ
2 x
S G-1
F
V W
Fig. 2. Enrollment of Prop i, involving the estimation of X, the choice of S, and the calculation of V and W. Here G-1 is the inverse of the delta-contracting function and F a hash function.
During enrollment, some secret S (S ∈ {0,1}n3) is chosen, and the corresponding V = F(S) is calculated V (V ∈ {0,1}n4). Further, X is measured. The enrollment can be performed under more ideal circumstances, or can be repeated to reduce the variance of the noise. Thus we assume that N ≈ 0 during enrollment and that X is available. Thirdly, a value for W is calculated such that not only G(W,X) = S but also during authentication G(W,Y) = S for Y ≈ X. We call this property δ -contracting. Definition 1: Let G(W,Y): R n1 + n2 → {0,1}n3 be a function and δ ≥ 0 be a nonnegative real number. The function G is called "δ -contracting" if and only if for all X ∈ Rn1 there exist (an efficient algorithm to find) at least one vector W ∈ Rn2 and one binary string S ∈ {0,1}n3 such that G(W,Y) is constant on a ball with radius δ around X, i.e., G(W, X) = G(W, Y) = S for all Y ∈ Rn1 such that || X - Y || ≤ δ. Any function is 0-contracting. The δ -contracting property ensures that despite the noise, for a specific Prop all likely measurements Y will be mapped to the same value of Z. Definition 2: Let G(W,X): R n1 + n2 → {0,1}n3 be a function. The function G called "versatile " if and only if for all S ∈ {0,1}n3 and all X ∈ Rn1, there exists (an efficient algorithm to find) at least one vector W ∈ Rn2 such that G(W,Y) = S. A trivial ∞ -contracting function is G(W,X) = Constant. However this function is not versatile. The property of versatility is relevant particularly for key establishment. A trivial versatile and ∞ -contracting function is G(W,X) = C(W). However, in this solution W reveals the secret S, or at least, the conditional entropy H(S|W) = 0 Theorem: If W is a constant, i.e., if G(W,Y) = C(Y) then either the largest contracting range of G is δ = 0 or G(W,Y) is a constant independent of Y. Proof: Assume G is δ -contracting, with δ > 0. Choose two points Y1 and Y2 such that G(W,Y1) = Z1 and G(W,Y2) = Z2. Define a vector r = λ (Y2 - Y1) such that 0 < ||r|| < δ . Then, Z1 = G(W,Y1) = G(W,Y1 + r) = G(W,Y1 + 2r) = ... = Z2 Thus G(W,Y1) = G(W,Y2) is constant. Corrolary: The desirable property that biometric data can be verified in the encrypted domain cannot be achieved unless Prop-specific data W is used. Biometric authentication that attempts to process Y without such "helper" data is doomed to store decryptable user templates. Definition 3: Let G(W,Y): R n1 + n2 → {0,1}n3 be a δ -contracting function with δ ≥ 0 and ε ≥ 0 be a non-negative real number. The function G is called "ε -revealing" if
398
Jean-Paul Linnartz and Pim Tuyls
and only if for all X ∈ Rn1 there exists (an efficient algorithm to find) a contracting vector W ∈ Rn2 such that the mutual information I(W;S) < ε . Hence W conceals S: it reveals only a well-defined, small amount of information about S. Similarly, we require that V conceals S. However we do not interpret this in the information theoretic sense but in the complexity theoretic sense, i.e., the computational effort to obtain a reasonable estimate of (X or) S from V is prohibitively large, even though in the information theoretic sense V may (uniquely) define S.
3
Proposed System
We have developed several constructions of δ -contracting and ε -revealing biometric authentication systems. We will describe one here. For a more elaborate discussion, we refer to a forthcoming paper [8]. For simplicity we adopt a model of X and N being zero mean i.i.d. jointly Gaussian random vectors with variance σ x2 and σ n2, resp. For the i-th dimension (1, 2, .. i , ... ,n1, n1 = n2) of Y, W and Z, the δ -contracting function is 1 if 2nq ≤ y i + wi < ( 2n + 1) q, for any n = ..,− 1,0,1,.. zi = 0 if ( 2n − 1) q ≤ y i + wi < nq, , for any n = ..,− 1,0,1,..
with q a quantization step size. During enrollment, xi is measured and the C.A. will find a wi such that the value of xi + wi is pushed to the nearest lattice point where xi + wi + δ will be quantized to the same zi for any small δ . This can be interpreted as a watermark of Quantization Index Modulation [5]. For the i-th dimension of S, the value of wi will be ( 2n + 1 ) q − x i 2 wi = 1 ( 2n − 2 ) q − x i
if s i = 1 if s i = 0
where n= .., -1, 0, 1, 2, ... is chosen such that -q < wi < q. The value of n is discarded, but the values of w are released as helper data. We analyse the case of a single specific dimension, where a secret message s = {1,+1} is verified. The contraction range δ equals q/2. The probability that an honest couple Peggy-Victor makes an error in one dimension equals q Pe = 2Q 2σ n
3q − 2Q 2σ n
5q + 2Q 2σ n
− ...
where Q(x) is the integral over the Guassian pdf with unity variance. In a practical situation, if one can apply error correction decoding to further reduce the error rate, compared to Figure 3.
New Shielding Functions to Enhance Privacy and Prevent Misuse
10
Error Rate per Dimension (uncoded)
10
10
10
10
10
10
10
399
0
-1
-2
-3
-4
-5
-6
-7
0
5 10 15 Quantization Step Size -to- Meaurement Noise Ratio
20
Fig. 3. Uncoded error probability per dimension as a function of q/σ
n
The next analysis will quantify ε by calculating the leakage of information for our assumptions of the statistical behavior of the input signals X and W, where the statistics of W are determined by those of X and S. The signals in all dimensions are calculated in an identical manner, so we omit the index i. We observe that for si = 1 w = (2n+1/2)q -x, so
fW
w s1 = 1 = ∞ n =∑−∞
(
)
1 2π σ
x
0 2 (( 2n + 1 / 2) q − w ) exp − 2σ x2
for | w |> q for | w |< q
Figure 4 plots q * f(w/q) as a function of w/q. The solid lines depict fW(w|s=0) and the crosses depict fW(w|s=1). Information leaks whenever fW(w|s=1) ≠ fW(w|s=1). The symmetry properties fW(w|s) = fW(q- w|s) and fW(w|s=1) = fW(-w|s=0) apply. fW(w|s=1) has a maximum for w = q/2, which corresponds to highly likely values of x near x = 0. The unconditional probability density of W follows from fW(w) = fW(w|s=1) P(s=1) + fW(w|s=0) P(s=0)+. Despite the suggestion by Figure 4, it is neither true that that fW(w|s=1) = 1 - fW(-w|s=1) nor that fW(w) is constant. Using Bayes rule, the a posteriori probability pw1 on s = 1 can be expressed as pw1 = P (s = 1 | W = w ) =
fW ( w | s = 1) P( s = 1) f ( w)
Similarly, we define pw0. Then, the mutual information I(W;S) follows from: I (W ; S ) = H ( S ) −
q
∫ H ( S | W = w) fW ( w)dw
−q
Here H(S) stands for the information theoretic entropy of a discrete random variable S, defined as H(S) = - Σ ι P(S=i) log2 P(S=i). Since S takes the value 0 or 1 with probability 0.5, H(S) = 1 bit. Thus,
400
Jean-Paul Linnartz and Pim Tuyls
I (W ; S ) = H ( S ) +
q
∫ {pw1 log pw1 + pw0 log(1 − pw0 )}fW ( w)dw
−q
I (W ; S ) = 1 +
f ( w | s = 0) 1 q f ( w | s = 1) 1 q fW ( w | s = 1) log W dw + ∫ { fW ( w | s = 0)}log W dw ∫ 2−q 2 fW ( w) 2−q 2 fW ( w)
Expanding the logarithm into separate terms, i.e., applying the rule log(a/b) = log a - log b), we get I (W ; S ) = 1 +
1 q 1 q fW ( w | s = 1) log fW ( w | s = 1)dw + ∫ { fW ( w | s = 0)}log{ fW ( w | s = 0)}dw ∫ 2−q 2−q −
q
∫ fW ( w) log 2 fW ( w)dw
−q
Or simply, I (W ; S ) =
q
q
−q
−q
∫ fW ( w | s = 1) log fW ( w | s = 1)dw − ∫ fW ( w) log fW ( w)dw
Figure 5 shows that quantization values as crude as q / σ n = 1 are sufficient to ensure small leakage (ε = 0; i − −) gs [i][j] = gl [i + rt − m][j];
Synthetic Eyes
↓
→ (a)
↓
→
415
↓
← (b)
Fig. 2. (a) Different stages showing how the operator Lower Lower eyeLid (LLL) acts. Top left: the real image. Top right: the model partial circle fit to the iris and its center, obtained by the eyefinder program. Bottom left: the iris and the lower part of the eye are stretched downward. Bottom right, i.e. the outcome of the operator LLL: the white of the eye has been pulled in so that the iris agrees with the model partial circle. (b) Cumulative effect of 3 operators to produce a synthetic image (stages showing how individual operators act have been omited). Top left is the real image. Top right is the image after the application of the operator RLL (Raise Lower eyeLid). Bottom right is the outcome of the operator RU L (Raise Upper eyeLid) applied to the image on top of it. Bottom left is the outcome of the operator RB (Raise eyeBrow) applied to the image to its right. In short, the bottom left is generated to produce the appearance of an upward head tilt where ij is the row index of a pixel on the jth column below the eyebrow or between the eye and the eyebrow, depending on j. m is a small positive integer which decides how far the eyebrow is lowered. Note that, in each column, m pixels move into wo from wl . RB has a similar behavior, but m is a small negative integer. Here some pixels are pushed out of the window, and pixels with new gray level values (comparable to those between the eye and the eyebrow) are created to enlarge the area between the eye and the eyebrow. T B is a mixture of LB and RB, and AB is a variation of LB.
Fig. 3. The window size at left is subject to many unwanted lighting conditions. The slightly smaller window at right is far simpler to handle
416
Behrooz Kamgar-Parsi et al.
Fig. 4. The same eye of the same person cropped out of his photos taken over a one-year period, under different lighting conditions. Some of the full-face photos from which these eyes were cropped out are shown on the right
Operators 5-8 Opening the eye further usually requires the display of some or all of the covered portion of the iris. Hence, along with other sub-tasks, the (entire) iris must be constructed from the visible portion. We model the iris image with a circle. Fitting a small circle to quantized points with regression techniques is typically unreliable. We use a matched-filter like approach to find the iris, which we describe below. Let r denote its radius, and ic and jc the row and the column index of the pixel at its center. To calculate these parameters, we extract the visible portion of the iris. Let leni denote the length of the iris on row i. While, jc is calculated from leni ’s directly, r and ic will be calculated through fitting the model to leni ’s. Because r is only several pixels long, the quantization error is significant and attempts should be made to reduce its impact. While fitting a circle to iris, we allow for 4 levels of quantization. Let f be the fraction (in the vertical direction) of the lowest iris pixel in column jc which is actually covered by the iris. We then allow f to take on the values 1, .75, .5, or .25. We allow the radius of the model circle rM to vary for several pixels in the increments of 1/8 pixel. Furthermore, for each value of rM , we allow 4 values for f (as mentioned above). For each set of rM and f , we move the model circle along the vertical direction on the extracted iris and calculate the degree of mismatch between the corresponding segments of the model and the iris. The model configuration (specified by rM and f ) together with its placement (specified by ic ), which gives rise to the minimum discrepancy, will determine the values of rM , f and ic . Once the entire iris is obtained, we can display it to the desired extent (see Figure 2 for LLL). Operators LU L, RU L and RLL can be similarly described. Operator 9 This operator applies intensity gradient in one or several directions.
Synthetic Eyes
Fig. 5. Top left is a real image. The other 5 are synthetic images which were automatically generated from the the top left image
2.2
417
Fig. 6. Some photos of a subject in nonFERET set A. Rotation, scaling, and intensity normalization will make (almost) identical eye images cropped out of these photos. That is, for images of this subject correct recognition did not require synthetic images (unlike the subject in Figure 4)
Image Rescaling
To compare images, we register them against each other. This requires scaling them to the same size, more precisely, to the same number of pixels, by increasing the number of pixels in some and decreasing in others. Consequently, we must create new pixels, namely pixels with new gray-level values. The rescaling process, hence, must be carried out with a great deal of care. See [3] for details of our approach. 2.3
Intensity Normalization
To reduce the impact of variations in ambient light, we normalize the intensities in all the windows covering the eye so that they have the same total brightness. Intensity normalization should be performed so that the quality of the image is not (significantly) impacted. For details see [3]. Window Size We have experimented with different window sizes. We believe that the window should be picked so as to reduce shading that might be caused by certain lighting conditions. For example, Figure 4 left, shows a window which is slightly too large, hence allowing unwanted lighting conditions. Whereas the window on the right is much better protected. The size of this window is 40 × 32 pixels, cropped out of a close-up image with the resolution of 256 × 384.
418
Behrooz Kamgar-Parsi et al.
Table 1. Summary of experimental results. For each subject 8 synthetic images were produced Image Source Image Collection Info. FERET non-FERET set A non-FERET set B
3
Images collected within a few days Images collected over many months
Real Images Real+Synthetic Error (percentage) Error 5 (15%) 0 2 (7%) 0 12 ( 44%)
0
Experiments
Images used in the experiments were mostly from the FERET database, i.e. close-up frontal view images. However, because our FERET images provided no more than 2 images per person, we also included 6 “local” subjects, from the Naval Research Laboratory and the Michigan State University, each providing 10 images taken in the same style as FERET images. Windows covering an eye and the eyebrow were cropped out automatically as follows. First, the two eyes were located using the eye locator program that we have developed. Next, the image was rotated, scaled, and translated so that the two eyes were located on pre-specified pixels. Next, the window was cropped out and its intensity was normalized in the manner described earlier. The size of these images (or windows) were 40 × 32 pixels. Images were compared by calculating the sum of square differences of the gray-level values at corresponding pixels. We used this similarity measure because it is perhaps the most straightforward and commonly used similarity measure. Evaluation of the merits of different similarity measures is beyond the scope of this work. 3.1
Test 1: Real Images Only
We used 40 FERET subjects each providing 2, and 6 non-FERET subjects each providing 10 images. The test set had a total of 94 images, composed of 40 FERET images and 54 non-FERET images. The 40 FERET images belonged to 40 different subjects, whereas non-FERET images belonged to 6 subjects (each contributing 9 images). The training set was composed of 91 images, including 1 image, say S0 , of the test subject. For each image in the test set, its closest match (in the training set) was determined. If the closest match in the training set was S0 , i.e. if it belonged to the test subject, the answer would be regarded as correct, otherwise wrong. Of the 40 FERET test images, 35 (85%) of them found the correct match. Of the 54 non-FERET test images, 40 (74%) found the correct match.
Synthetic Eyes
3.2
419
Test 2: Real and Synthetic Images
We used S0 and generated 8 synthetic images of S0 . These 8 synthetic images were generated as follows: Two CS operators were applied to the real image (causing two different synthetic lighting conditions). Operators LB and LLL in combination were applied creating the appearance of looking up into the camera (downward head tilt). Operators RLL, RU L and RB in combination were applied creating the appearance of upward head tilt. See Figure 2b. Operators RLL and LU L in combination were applied creating squint and possibly a slight smile. Operator CS was then applied to the above three synthetic images. These 8 “derivatives” of S0 were then added to the training set, while the same test set was used. If S0 or one of its derivatives was the closest match, the answer was considered correct; otherwise wrong. We obtained correct answer for all of the FERET and non-FERET test images. The results are summarized in Table 1. We mention that we did not obtain error-free results when the generated synthetic images were fewer than 8. Results are further discussed in the Discussion.
4
Discussion
In general, humans do not need to closely examine people’s eyes in order to recognize them. Furthermore, it is not an easy task for humans to recognize a given person in a photo if the presented photo displays only the eye and the eyebrow of the person. But, would this imply that the eye does not possess unique and sufficient recognition information? Having so many recognition cues in their possession, humans have not had the need to specialize on recognition through the eye alone. Our experimental results are preliminary, nevertheless they suggest that the eye is rich in discriminatory information–more than is normally utilized by humans. This wealth of information, however, can be exploited by machines. Variation in the appearance of the eye and its associated eyebrow are caused by many factors, including head tilt, the degree to which the eye is open, relative position of the eyebrow with respect to the eye, lighting condition and certain expressions, etc. A versatile recognition system would have already seen examples of such variations. But, real images often do not provide an adequate number of these examples. Even if several photos of a given subject are available, they may not represent the entire space of possible eye/eyebrow variations. Furthermore, some of the real images may be too similar to each other and thus redundant. Synthetic images, on the other hand can enrich the training set so that it includes the desired variations. This has been indicated by our experimental results. As indicated in Table 1, for the non-FERET set A, i.e. for subjects whose images were taken within a few days, we obtained relatively good results even without synthetic images, while for set B subjects, synthetic images were greatly needed. The reason for such wide differences among these subjects appears to be as follows. The problem is much easier when images of the subject of interest are taken on the same day (or a few days apart), and under similar lighting conditions. Under such conditions, different images of the same subject can be so similar to each other that (small) variations in their eyes or eyebrows may
420
Behrooz Kamgar-Parsi et al.
not cause sufficient recognition difficulty and there would not be much need for synthetic images. However, when images are taken on different days, perhaps months apart, under considerably different conditions, then there will be large differences among them. Thus, variations such as head tilt, expression, etc., would add to the lack of similarity, resulting in incorrect recognition when only available real images are used. In these difficult cases, synthetic images, alleviating the differences in pose, expression, or lighting condition, are very helpful. Many, if not most, real life scenarios fall into the latter category where synthetic images can be of major help. Finally, we speculate that a similar improvement may be achieved when solving face recognition problem, i.e. when the training set is enriched with carefully generated synthetic images of the entire face.
References [1] Belhumeur, P. N., Hespanha, J., and Kriegman, D., “Eigenfaces vs. fisherfaces: Recognition using class specific linear projection,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 711-720, July 1997. 412 [2] Hjelmas, E., Wroldsen, J., “Recognizing faces from the eyes only,” Proc. Scandinavian Image Analysis Conference, 1997. 413 [3] “Recognizing Eyes,” Kamgar-Parsi, Behrooz, Kamgar-Parsi, Behzad, Naval Research Laboratory, NCARAI Technical Report, 2002. 417 [4] Kotropoulos, C., Tefas A., and Pitas, I., “Frontal face authentication using morphological elastic graph matching,” IEEE Trans. Image Processing, vol. 9, pp. 555-560, 2000. 412 [5] Moghaddam, B., Wahid, W., and Pentland, A., “Beyond eigenfaces: Probabilistic matching for face recognition,” Proc. Int’l Conf. Automatic Face and Gesture Recognition, pp. 30-35, Nara, Japan, April 1998. 412 [6] Penev, P. S., and Atick, J. J., “Local feature analysis: A general statistical theory for object representation,” Network: Computation in Neural Systems, vol. 7, pp. 477-500, 1996. 412 [7] Sim, T. and Kanade, T., “Combining models and Exemplars for Face Recognition: An Illuminating Example,” Proc. CVPR 2001 Workshop on Models versus Exemplars in Computer Vision, Dec. 2001. 412 [8] Lee, Kuang-Chil, Ho, J., and Kriegman, D. J., “Nine points of light: acquiring subspaces for face recognition under variable lighting,” Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 519-526, Dec. 2001. 412 [9] Turk, M., and Pentland, A., “Eigenfaces for recognition,” Journal of Cognitive Neuroscience, vol. 3, no. 1, pp. 71-86, 1991. 412 [10] Wiskott, L., Fellous J. M., Kruger, ˙ N., and von der Malsburg, C., “Face Recognition by Elastic Bunch Graph Matching,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol 19, pp. 775-779, 1997. 412 [11] Zhang, J., Yan, Y., and Lades, M., “Face recognition: Eigenfaces, elastic matching, and neural nets,” Proc. IEEE, vol. 85, pp. 1422-1435, 1997. 412
Maximum-Likelihood Deformation Analysis of Different-Sized Fingerprints * Yuliang He, Jie Tian**, Qun Ren, and Xin Yang Biometrics Research Group, Institute of Automation, Chinese Academy of Science P.O.Box 2728, Beijing, 100080, China
[email protected] http://www.fingerpass.net
Abstract. This paper introduces a probabilistic formulation in terms of Maximum-likelihood estimation to calculate the optimal deformation parameters, such as scale, rotation and translation, between a pair of fingerprints acquired by different image capturers from the same finger. This uncertainty estimation technique allows parameter selection to be performed by choosing parameters that minimize the deformations uncertainty and maximize the global similarity between the pair of fingerprints. In addition, we use a multi-resolution search strategy to calculate the optimal deformation parameters in the space of possible deformation parameters. We apply the method to fingerprint matching in a pension fund management system in China, a fingerprint-based personal identification application system. The performance of the method shows that it is effective in estimating the optimal deformation parameters between a pair of fingerprints.
1
Introduction
Fingerprint-based biometric systems have attracted great interest of researchers to find new algorithms and techniques for fingerprint recognition in the last decade. Great progress has been made in the development of on-line fingerprint sensing techniques [1, 2, 3] and, as a consequence, several small and inexpensive sensing elements have overrun the market. Significant improvements have been achieved on the algorithmic side as well [2, 3]. However, a large number of challenging problems [4,5] still exist. For example, a pension fund management system in China, a fingerprint-based personal identification application system, requires that the matching *
**
This paper is supported by the National Science Fund for Distinguished Young Scholars of China under Grant No. 60225008, the Special Project of National Grand Fundamental Research 973 Program of China under Grant No. 2002CCA03900, the National High Technology Development Program of China under Grant No. 2002AA234051, the National Natural Science Foundation of China under Grant Nos. 60172057, 69931010, 60071002,30270403,60072007; Correspondence author: Jie Tian; Telephone: 8610-62532105; Fax: 8610-62527995.
J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 421-428, 2003. Springer-Verlag Berlin Heidelberg 2003
422
Yuliang He et al.
algorithm be more tolerant with deformations and non-linear distortions, which influence the performance of these algorithms in fingerprints acquired by different image capturers. Although many methods [3,4,5] have been proposed and succeeded in dealing with many similar problems mentioned above, they, to the best of our knowledge, are only designed for identifying a pair of fingerprints acquired by the same image capturer and may be invalid when applied to fingerprint identification for a pair of fingerprints produced by different image capturers. In addition, it is very difficult for many researchers to find an effective and efficient optimization algorithm capable of automatically calculating the optimal deformation parameters between the pair of fingerprints. Thereby, we introduce in this paper a probabilistic formulation in terms of Maximum-likelihood estimation to automatically model the deformations, such as scale, rotation, and translation, between a pair of fingerprints. This uncertainty estimation technique adopts the strategy of replacement of the local similarity between two fingerprints by their global similarity in estimating their deformation parameters. To search the optimal deformation parameters in the space of possible deformation parameters, we use a multi-resolution search strategy in examining the hierarchical cell of this space of possible deformation parameters, by which this space is divided into cells, and then the algorithm can determine which cells could contain a position satisfying the acceptance criterion. The motivation of our effort is that a good comprehension of the deformation dynamics can be very helpful for designing new robust (deformation tolerant) fingerprint matching algorithms. The performance of our method proves that it is an effective method of dynamically estimating the deformation parameters between a pair of fingerprints.
2
Deformation Analysis
Pressing the finger’s tip against the plain surface of an on-line acquisition sensor produces, as the main effect, a 3d to 2d mapping of the finger skin. The user’s random placement brings out the deformations, such as rotation and translation, between a pair of fingerprints acquired by the same image capturer from the same finger. There is also scale deformation between a pair of fingerprints acquired by different image capturers. These deformations in fingerprints greatly influence the performance of a fingerprint-matching algorithm. Many methods [1,8,9] in literature explicitly attempt to model fingerprint deformations for a pair of fingerprints acquired by the same image capturers. However, few are designed for a pair of fingerprints acquired by different image capturers. And many fingerprint-based personal identification application systems, such as the mentioned pension fund management system in China, require that their matching algorithms be tolerant with some factors, such as resolution and size, which these image capturers bring out, and therefore greatly influence the performance of these algorithms. Hereby, to model deformations between a pair of fingerprints acquired by different image capturers for fingerprint matching, we investigate the characteristics
Maximum-Likelihood Deformation Analysis of Different-Sized Fingerprints
423
of mapping and introduce a probabilistic formulation in terms of maximum-likelihood estimation. 2.1
Fingerprint Minutiae
A fingerprint is the pattern of ridges and valleys on the surface of a finger. The uniqueness of a fingerprint can be determined by the overall pattern of ridges and valleys as well as the local ridge anomalies (a ridge bifurcation or a ridge ending, call minutiae points) which posses the discriminatory information. Our representation for the fingerprint consists of positional, directional, and type information of minutiae. Let M(F)={(xi, yi, αi, βi)}, i=1,2,…,m(F) be the minutiae set of fingerprint F containing the position information (x, y), directional information (α), and minutiae type information β (β=0 indicates a ridge ending and β=1 indicates a bifurcation) for m(F) minutiae elements in fingerprint F. Parameter m(F) is the number of minutiae in fingerprint F. For convenience, M(F, i) is used to represent the ith minutia Mi in fingerprint F. 2.2
Affine Transformation
For a pair of fingerprints F and G acquired from the same finger, an ink-on-paper fingerprint and a live-scanned fingerprint, there are deformations, including translation, scale, rotation and so on, between them. Let M(F)={M(F, i)=(xi,yi,α i,β i), 1≤ i≤ m(F)}and M(G)={M(G, j)=(xj, yj,α j,β j), 1≤ j≤ m(G)} to be the minutiae sets of fingerprint F and G respectively where m(F) and m(G) are the numbers of minutiae elements in fingerprint F and G respectively. Both M(F) and M(G) can be considered to be sets of discrete points at locations of the occupied pixels in fingerprint F and G. Considering minutiae sets: M(F) and M(G), we impose an affine transformation model T that relates these two minutiae sets, i.e. T: M(F)→ M(G). The parameterized affine transformation model T is invoked when it can be safely assumed that spatial variations of pixels in a region can be represented by a low-order polynomial [7]. Five random variables θ (-900) in the affine transformation model are used to describe the rotation, scale, translation deformations between fingerprint F and G where tx and ty represent horizontal and vertical translation respectively, sx and sy correspond to horizontal and vertical scale respectively, θ corresponds to rotation. These random variables can be thought of five functions that map the input fingerprint minutiae set, M(F), into the template fingerprint G. But, the type of M(F, i) keeps invariable. The new minutia M(F, i) can be presented by (x’i, y’i, α ’ i, β i) = (sxcosθ +sy sinθ + tx, -sxsinθ +sycosθ + ty, arctan[(syα i)/ sx ], β i ). 2.3
Constructing the Probability Density Function (PDF)
M(F), the minutiae set of fingerprint F, is mapped into the template fingerprint G using the affine transformation model T. To formulate the problem in terms of maximum-likelihood estimation of the deformations, including scale, translation, and rotation, we must have some set of measurements that are a function of these deformation parameters between fingerprints F and G. Similar to methods based on the Hausdorff
424
Yuliang He et al.
distance [7], we use the Hausdorff distance from each minutia in fingerprint F (with respect to deformation parameters specified byθ , tx, ty sx, and sy) to the closest occupied minutia in fingerprint G as our set of measurements. Let di (θ , tx, ty sx, sy) denote these Hausdorff distances. In general, these distances can be found quickly for any θ , tx, ty sx, and sy, if we pre-compute the Hausdorff distance transform of fingerprint G by Equality 1 on condition that the distance between (x’ i, y’ i) and (x j, y j) is less than εl and |a’i- a’j| +b that minimizes the objective function for penalty parameter C , subject to the linear constraints with n training examples, {x i , yi }n , x i ∈ R d , yi ∈ {− 1,1} , where yi is the class label. n 1 2 w 2 + C ∑ξ i 2 i =1 yi (( xi ⋅ w ) + b) ≥ 1 − ξ i , ξ i ≥ 0
(1)
f ( x) is a function that maximizes the margin which is the distance from x to the hyperplane that separating classes. The margin of an example (x, y ) with respect to f is defined as y ⋅ f (x) [8]. The parameter C gives the trade-off between margin maximization and training error minimization. SVM can be applied to nonlinear separation as a linear machine in a high dimensional feature space using kernel function. Linear SVM can be rewritten as follows: n
f (x) = ∑α i yi xTi x + b i =1
n
w = ∑ α i yi x i i
(2)
Combining SVM Classifiers for Multiclass Problem
533
where α i is coefficient weight. In feature space with nonlinear mapping Φ : Rd → RD ,
n
f ( x) = ∑ α i yi Φ (x i )T Φ (x) + b and i =1
kernel
function
K : R d × R d → R , K ( x, z ) can be replaced as Φ (x)T Φ (z ) avoiding mapping Φ .
In decomposition methods for multiclass problem, the principal merits of SVM are presented in [9]. First, SVM learning algorithms generate dichotomizers that can adapt to the complexity of the problem, selecting a number of support vectors proportional to the problem complexity when dichotomies induced by decomposition methods can induce two-class classification problem with different level of complexity. Second, SVM can perform accurate nonlinear separation of two classes in the input space using kernel function.
3
Decomposition Methods for Multiclass Problem
Learning machines implementing decomposition methods are composed of two parts; one is decomposition (encoding) and the other is reconstruction (decoding) [10]. For more detailed summary, refer to [10], [11]. In decomposition step, we generate decomposition matrix D ∈ {− 1, 0, + 1}L× K that specify K classes to train L dichotomizers, f1,K, f L . The dichotomizer fl is trained according to row D(l ,⋅ ) . If D(l , k ) = +1 , all examples of class k are positive and if D(l , k ) = − 1 , all examples of class k are negative, and if D(l , k ) = 0 none of the examples of class k participates in training of fl [11]. The columns of D is called codewords [12]. In reconstruction step, a simple nearest-neighbor rule is commonly used. The class output is selected that minimizes the some similarity measure S : R L × {− 1,0,1}L → [0, ∞ ] , between f (x) and column D(⋅ , k ) [12]. class _ output = arg min k S ( f (x), D (⋅ , k ))
(3)
In case the similarity measure is defined based on margin, the method is called margin decoding. S ( f (x), D(⋅ , k )) = ∑ l f l (x) D (l , k )
(4)
When classifier outputs hard decision, h(x) ∈ {− 1, 1} , the method is called hamming decoding. S H (h(x), D(⋅ , k )) = 0.5 × ∑ l (1 − hl (x) D(l , k ))
3.1
(5)
One Per Class
Each dichotomizer fi has to separate a single class from all the others. Therefore, if we have K classes, we need K dichotomizers. In reconstruction max-win decoding is
534
Jaepil Ko and Hyeran Byun
commonly used for this scheme, i.e., a new input x can be classified as class j such that f j gives the highest value as we use similarity measure. In real applications, the decomposition of a polychotomy gives rise to complex dichotomies that in turn need complex dichotomizers [12]. One of the merits of OPC is to train all the classes at once and it can be benefit when training sample is small for each class like face recognition, however it can give complex dichotomies because it groups various classes into one, so it needs an accurate classifier such as SVM and the performance of OPC depends heavily on the performance of base classifiers. 3.2
Pairwise Coupling
Each dichotomizer fij is to separate a class i from a class j for each possible pair of classes, so a dichotomizer is trained on the samples related to the two classes only. This can be a merit in that simpler dichotomy can be made. This can be a demerit when we consider decoding procedure. If an input x that belongs neither to class i nor to class j is fed into fij , nonsense output can come out [13]. As the number of classes increases, the performance becomes lowered by the nonsense outputs. It needs to decode relative classifiers only for good performance. The number of dichotomier is K C2 = K ( K − 1) / 2 and hamming decoding is frequently used. To reduce decoding time, tree-structured decoding is applied in [3]. The decomposition matrix of OPC and PWC schemes, for K = 4 , are given in Fig. 1a and b +1 − 1 − 1 − 1 +1 − 1 − 1 − 1 +1 − 1 − 1 − 1
(a) OPC
− 1 − 1 − 1 + 1
+1 − 1 0 0 +1 0 − 1 0 + 1 0 0 − 1 0 +1 − 1 0 0 + 1 0 − 1 0 0 + 1 − 1
(b) PWC
Fig. 1. Decomposition matrices. Each row corresponds to one dichotomy and each column to one class. The number of class is 4 and the number of machines for OPC and PWC are 4 and 6 respectively.
4
Sequential Combination of Basic Schemes
OPC and PWC are frequently used as a decomposition method for multiclass problem. However, they have their own problems due to their intrinsic decomposition algorithm. OPC suffers from the problems of having complex dichotomizers and PWC has a problem that nonsense outputs from unrelated dichotomizers are considered in the final decision. The training error bound of OPC and PWC are presented respectively in [11]:
Combining SVM Classifiers for Multiclass Problem
535
ε K = Kε b
(6)
ε K ≤ ( K − 1)ε b
(7)
where ε b is the error rate of dichotomizers that is the average of the number of training examples misclassified and K is the number of classes. The confidence level of the complex dichotomizer might be lower than that of simple dichotomizer. In other words, since each PWC dichotomizer is trained on only two classes, one would expect PWC classification error rate to be lower than the corresponding error rate of OPC. This agrees with the fact that the error bound in (7) is lower than that in (6) [11]. Based on the ground explained above, we propose a new method combining OPC and PWC with rejection option. Most of recognition process is done by OPC. If OPC outputs unclear result on a given input due to its complexity, we reject the decoding output of OPC and then consult on a proper PWC. Fig. 2 illustrates the architecture of the proposed method.
Fig 2. Reconstruction with rejection based on OPC and PWC decompositions
In OPC decomposition with max-win decoding, we can make a reject condition as follows: Let s1 is the largest output and s2 is the second largest output among { f1(x),K, f L (x)} where class ci = argi{s1 = fi (x)} and class c j = arg j {s2 = f j (x)}, and we define i = ci and j = c j , then final decision is done as (9): true reject = false
if s1 − s 2 ≤ θ else
c if reject = false i class _ output = ci if reject = true and SVM ij ≥ 0 c j if reject = true and SVM ij < 0
(8)
(9)
536
Jaepil Ko and Hyeran Byun
where SVM ij is a binary classifier that is trained with class i as positive class and class j as negative class.
5
Experimental Results and Discussion
We demonstrate our method on face images from ORL dataset. The image sets consist of 400 images, ten images per each individual. Each image for one person differs from each other in lighting, facial expression and pose. Since we focus on the classification performance of SVM with the partially automatic recognition, we make training and testing data set for SVM as PCA outputs. The best feature dimension of PCA outputs is determined by the highest recognition performance on the test of the base line algorithm with the preprocessing procedure using eye position described in the face recognition evaluation protocol, FERET [14]. We perform PCA using all the samples and divide them into two equal parts for gallery and probe set respectively for finding the best recognition performance. Consequently, we determined 48-dimension from our experiment. In preprocessing stage, we performed rotation, scaling, and translation, and then applied histogram equalization to flatten the distribution of image intensity values. Fig. 3 shows examples of the normalized face images whose dimension is 1024.
Fig. 3. Normalized face images for one individual in the ORL face images
The 48-dimensional data set for SVM are shifted using the mean and scaled by their standard deviation. We used SMOBR [15] SVM with various kernels and C . In our experiment on the face recognition performance with decomposition methods, we select sequentially one image of each person for testing and the remaining nine images for training and repeat it ten times, so the number of samples for training and testing is 360 and 40 respectively. 5.1
Change of Error Rate by Rank
Our proposed method considers only the highest and second highest ranked candidates. It means that the right answer should be one of those two candidates. Fig. 4 shows the error rate according to the rank change. OPC with margin decoding and RBF kernel with σ=1 and C=4 were used for this experiment. We can see that error rate decreases as the rank increases in Fig. 4(b). Moreover, shown in Fig. 4(b), the difference of error rate between rank 1 and rank 2 is distinguishable from other pairs ( 3.25% ).
Combining SVM Classifiers for Multiclass Problem
(a)
537
(b)
Fig. 4 (a) Error rate by rank (b) Difference of error rate between ranks
5.2
Test on the Rejected Samples
Fig. 5(a) shows the number of rejected samples by thresholding. Positive in the figure denotes the rejected sample, though it could be correctly classified by OPC without rejection option. Negative indicates the rejected sample, which would be misclassified by OPC, if rejection option was absent. Reject indicates total number of rejected samples. From the graph, we can see that the number of samples of Negative is more than that of Positive between threshold 0.1 and 0.3. The output of PWC with hamming decoding for each rejected sample is shown in Fig. 5(b). o − × indicates the correctly classified before rejection but misclassified sample through a proper PWC and × − o in vice versa. The average number of samples of × − o is more than that of o − × .
(a) (b) Fig. 5 (a) The number of rejected samples by threshold (b) Output of PWC after rejecting by threshold 5.3
Performance Comparison
In Table 1, we summarize the error rates of OPC, PWC and the proposed method. The table shows that the proposed method shows better performance than PWC and OPC on the ORL face database. In our method, we set the threshold to 0.2 with varying parameter C.
538
Jaepil Ko and Hyeran Byun Table 1. Comparison Results of Error Rate (%)
6
Conclusion
In this paper, we give the strength and weakness of two representative decomposition methods, OPC and PWC. We then introduced a new method combining OPC and PWC with rejection option using SVM as base classifiers. Our proposed method can overcome the limit of OPC by rejecting the doubtful opinion from OPC. The rejected instance is fed to PWC. The nonsense output problem of PWC can be surmounted by considering only two highest ranked classes by OPC. The experimental results on the ORL face database show that our proposed method can reduce the error rate on the real dataset. The disadvantage of our methods is to require more binary machines ( K ( K + 1) / 2 , where K is the number of classes) and more elaboration for selecting suitable model parameters in both PWC and OPC.
Acknowledgments This work was supported in part by Biometrics Engineering Research Center, (KOSEF).
References [1] [2] [3] [4] [5]
Vapnik, V. N.: Statistical Learning Theory. John Wiley & Sons, New York (1998) Phillips, P. J.: Support vector machines applied to face recognition. Advanced in Neural Information Processing System II, MIT Press (1998) 803-809 G. Guo, S. Z. Li, and K. L. Chan: Support vector machines for face recognition. Image and Vision Computing, Vol. 19 (2001) 631-638 Heisele, B., Ho P., Poggio T.: Face recognition with support vector machines: Global versus Component-based Approach. Proc. of IEEE International Conference on Computer Vision (2001) 688-694 Hastie T., Tibshirani R.: Classification by Pairwise Coupling. Advances in Neural Information Processing Systems, Vol. 10, MIT Press (1998)
Combining SVM Classifiers for Multiclass Problem
[6] [7] [8] [9] [10] [11] [12] [13] [14] [15]
539
Dietterich T. G., Bakiri G.: Solving Multiclass Learning Problems via ErrorCorrecting Output Codes. Journal of Artificial Intelligence Research, Vol. 2 (1995) 263-286 Hansen L., Salamon P.: Neural network ensembles. IEEE Trans. on PAMI, Vol. 12, No. 10 (1990) 993-1001 Allwein E. L., Schapire R. E., Singer Y.: Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers. Proc. of International Conference on Machine Learning (2000) 9-16 Valentini G.: Upper bounds on the training error of ECOC-SVM ensembles. Technical Report TR-00-17, DISI-Dipartimento di Informatica Scienze dell' Informazione (2000) Masulli F., Valentini G.: Comparing Decomposition Methods for Classification. Proc. of International Conference on Knowledge-based Intelligent Engineering Systems & Allied Technologies, Vol. 2 (2000) 788-791 Klautau A., Jevtic N., Orlisky A.: Combined Binary Classifiers with Applications to Speech Recognition. Proc. of International Conference on Spoken Language Processing (2002) 2469-2472 Alpaydm E., Mayoraz E.: Learning Error-Correcting Output Codes from Data. Proc. of International Conference on Artificial Neural Networks (1999) Moreira M., Mayoraz E.: Improved Pairwise Coupling Classification with Correcting Classifiers. Proc. of European Conference on Machine Learning (1998) 160-171 Phillips P. J., Moon H., Rizvi S. A., Rauss P. J.: The FERET evaluation methodology for face-recognition algorithms. IEEE Trans. PAMI, Vol. 22, No. 10 (2000) 1090-1104 Almedia M. B.: SMOBR-A SMO program for training SVM. Univ. of Minas Gerais, Dept. of Electrical Engineering. http://www.cpdee.ufmg.br/~barros/
A Bayesian MCMC On-line Signature Verification Mitsuru Kondo, Daigo Muramatsu, Masahiro Sasaki, Takashi Matsumoto Department of Electrical, Electronics and Computer Engineering Graduate School of Science and Engineering, Waseda University 3-4-1 Ohkubo, Shinjuku-ku, Tokyo, Japan {mitsuru,daigo,sas}@matsumoto.elec.waseda.ac.jp
[email protected] http://www.matsumoto.elec.waseda.ac.jp/
Abstract. Authentication of individuals is rapidly becoming an important issue. The authors have previously proposed a pen-input online signature verification algorithm. The algorithm considers writer's signature as a trajectory of pen-position, pen-pressure and peninclination which evolves over time, so that it is dynamic and biometric. In our previous work, genuine signatures were separated from forgery signatures in a linear manner. This paper proposes a new algorithm which performs nonlinear separation using Bayesian MCMC (Markov Chain Monte Carlo). A preliminary experiment is performed on a database consisting of 1852 genuine signatures and 3170 skilled1 forgery signatures from fourteen individuals. FRR 0.81% and FAR 0.87% are achieved. Since no fine tuning was done, this preliminary result looks very promising.
1
Introduction
Personal identity verification has a great variety of applications including EC, access to computer terminals, buildings, credit card verification, to name a few. Algorithms for personal identity verification can be roughly classified into four categories depending on static/dynamic and biometric/physical or knowledge-based as shown in Figure 1 (This figure has been partly inspired by a brochure from Cadix Corp, Tokyo.) Fingerprints, iris, retina, DNA, face, blood vessels, for instance, are static and biometric. Algorithms which are biometric and dynamic include lip movements, body movements and on-line signature. Schemes which use passwords are static and knowledge-based, whereas methods using magnetic cards and IC cards are physical. In [1]-[3], we proposed algorithm PPI (pen-position/pen-pressure/pen-inclination) for on-line pen input signature verification. The algorithm considers individual’s 1
There are three types of forgery signature – Random forgery, Simple forgery and Skilled forgery. A forgery is called Random when forger has no access to genuine signature. A forgery is called Simple when forger knows only the name of the person who write genuine signature. A forgery is called Skilled when forger can view and train genuine signature.
J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 540-548, 2003. Springer-Verlag Berlin Heidelberg 2003
A Bayesian MCMC On-line Signature Verification
541
signature as a trajectory of pen-position, pen-pressure, pen-inclination and penvelocity which evolve over time, so that it is dynamic and biometric. Since the number of Tablet PCs and PDAs is rapidly increasing, on-line signature verification seems to be one of the most competitive schemes.
Fig. 1. Authentication methods
2
The Algorithm
2.1
Overall Algorithm
Fig. 2. Overall algorithm
Figure 2 describes overall algorithm. A database of signatures is divided into two groups: signatures for learning and signatures for testing. 2.2
Feature Extraction
The raw data available from our tablet (WACOM Art Pad 2 pro Serial) consists of five dimensional time series data: ( x( j ), y ( j ), p ( j ), px( j ), py ( j )) ∈ R 2 × {0,1,...,255} × R 2
j = 1,2,..., J
(2.1)
where ( x( j ), y ( j )) ∈ R 2 is the pen position at time j , p( j ) ∈ {0,1,...,255} represents the pen pressure, and ( px( j ), py ( j )) ∈ R 2 is pen inclination with respect to the x - and y -axis as shown in Figure 3. Define J
∑ x( j ) j =1
Xg =
J
− x min
x max − x min
(2.2) × L
542
Mitsuru Kondo et al. J
∑ y( j) j =1
Yg =
J
− y min
y max − y min
(2.3)
× L
where x min and xmax stand for the minimum and the maximum value of x( j ) , y min and y max stand for the minimum and the maximum value of y ( j ) , respectively, shown in Figure 4. L is a scale parameter to be chosen. The pair ( X g , Yg ) can be thought of the centroid of signature data (2.1).
Fig. 3. Raw data from tablet
Let
Fig. 4. Xmin, xmax, ymin and ymax of signature
V ( j ) = (dx( j ), dy ( j )) x( j ) − x min y ( j ) − y min := ( × L− Xg, × L − Yg ) x max − x min y max − y min j = 1,2,..., J
(2.4)
be the vector of relative pen position with respect to the centroid. Then the vector length f ( j ) and the vector angle θ ( j ) of each pen position are given by f ( j ) = dx( j ) 2 + dy ( j ) 2
j = 1,2,..., J
dy ( j ) tan − 1 (dx( j ) > 0) dx( j ) π sign(dy ( j )) × (dx( j ) = 0) 2 θ ( j) = dy ( j ) tan − 1 + π (dx( j ) < 0, dy ( j ) ≥ 0) dx( j ) − 1 dy ( j ) − π (dx( j ) < 0, dy ( j ) < 0) tan dx( j )
Our feature consists of the following five dimensional data
(2.5)
j = 1,2,..., J
(2.6)
A Bayesian MCMC On-line Signature Verification
(θ ( j ), f ( j ), p ( j ), px( j ), py ( j )) ∈ R 2 × {0,1,...,255} × R 2
j = 1,2,..., J
543
2.7)
where J is the number of the data. A typical original signature trajectory given by Figure 5(a) is converted into the relative trajectory given by Figure 5(b).
Fig. 5. (a) Original signature trajectories
2.3
(b) Relative trajectories
Angle-Arc Length Distance Measure
Let
θ I ( j ) − θ T (k ) S ( p I (k ), pT ( j )) S ( f I (k ), f T ( j )) j = 1.2,..., J
k = 1.2,..., K
(2.8)
be the local angle arc length distance measure, where sub-index I means data for the input signature and T for the template signature. J is the number of the input data and K is the number of the template data. Function S is defined by 1 S (u , v) = u − v
(u = v) (u ≠ v)
(2.9)
Since 1 is the minimum value of p I ( j ) and pT (k ) , function S ( p I ( j ), pT (k )) puts penalty on discrepancies between p I ( j ) and pT (k ) . Similarly, S ( f I ( j ), f T (k )) puts penalties on the length of each local trajectory length The following is our total angle arc length distance measure. S
D1 :=
min
j s ≤ j s +1 ≤ k s ≤ k s +1 ≤
j s +1
k s +1
∑θ s =1
I
( j s ) − θ T (k s ) S ( p I ( j s ), pT (k s )) S ( f I ( j s ), f T (k s ))
(2.10)
where i1 = k1 = 1, i s = I , k s = K are fixed. Because of the sequential nature of the distance function, Dynamic Programming is a feasible means of the computation.
544
Mitsuru Kondo et al.
D1( 0,0 ) = 0 D1( j s − 1, k s − 1) + θ I ( j s ) − θ T ( k s ) × S ( p I ( j s ), p T ( k s )) S ( f I ( j s ), f T ( k s )) D1( j − 1, k ) + θ ( j ) − 0 s s I s D1( j s , k s ) = min × S ( p I ( j s ), 0 ) S ( f I ( j s ), 0 ) D1( j , k − 1) + 0 − θ ( k ) s s T s × S ( 0, p T ( k s )) S ( 0, f T ( k s ))
2.4
(2.11)
Pen Inclination Dintances
Define pen-inclination distances S'
D 2 :=
∑ px ( j
min
I
j s ' ≤ j s ' +1 ≤ j s ' +1 s '=1 k s ' ≤ k s ' +1 ≤ k s ' +1
s'
) − pxT (k s ' )
(2.12)
) − pyT (k s" )
(2.13)
S"
D3 :=
∑ py ( j
min
I
js " ≤ j s "+1 ≤ j s " +1 s ''=1 k s " ≤ k s "+1 ≤ k s " +1
s"
which are computable via DP also. 2.5
Template Generation
To choose three template signatures, we compute the sum of the distance measure between each of the signatures in a group of genuine signatures for learning and sort them according to their distances, then choose three signatures with the smallest distances. These will be used as templates. 2.6
Bayesian MCMC
Given a feature vector ( D1, D 2, D3) , our previous algorithm proposed in [1]-[3], performs linear separation of genuine signatures from forgery signatures. We naturally expect that appropriate nonlinear separation would improve the performance even though our linear scheme already performed reasonably well. Let D = {x m , t m } (x m = ( D1m , D 2 m , D3m ), t m ∈ {0,1}, m = 1,2,..., M ) be training data set for learning, where t m = 1 if x m is genuine, whereas t m = 0 if x m is forgery. This paper attempts to perform semi parametric approach to estimate the probability density of a given test data x test = ( D1test , D 2 test , D3test ) being genuine. A possible means for this purpose is to prepare three layer perceptron; h
f (x m ; w ) :=
∑ i =1
a iσ (
n
∑ σ (b x ij
j =1
mj
+ c i ))
(2.14)
A Bayesian MCMC On-line Signature Verification
545
where
σ (u ) :=
1 1 + e− u
(2.15)
w = ({a i }, {b ij }, {c i })
(2.16)
One of the fundamental issues will naturally be how one learns w . The Bayesian scheme used in this paper considers the likelihood function defined by log P ({x m } | w, H ) =
∑t
m
log( f (x m ; w )) + (1 − t m ) log(1 − f (x m ; w ))
m
= G (D | w, H )
(2.17)
m = 1,2,..., M
which is the cross entropy function defined in [4]. Symbol H stands for the model under consideration, in particular, three layer perceptron with specific number of hidden units. In order to define prior distribution, we decompose w as w = (w 1 ,..., w C ) , w c ∈ R k
(2.18)
w c = ( wc1 ,..., wck ) , c = 1,2,3
(2.19)
where
is the vector of weight parameters between c th input and the hidden units, w 4 = ( w41 ,..., w4 k ) represents the bias vector for the hidden units and w 5 = ( w51 ,..., w5 k ) is the vector between hidden units and the output unit, thus C = 5 in the present situation. The prior distribution for w is defined by P(w | α , H ) =
∏
1 α exp(− c Z W (α c ) 2
C
c =1
∑w ),α i
= (α 1 ,..., α
2 i
C
) ,α
c
∈ R
(2.20)
where Z W (α c ) is the normalization constant, c = 1,2,..., C . This is so called the “weight-decay” prior, which favors small values of w and decreases the tendency to overfit. Bayes formula gives the posterior distribution of w by P (w | D, α , H ) =
P(D | w, H ) P(w | α , H ) P(D | α , H )
1 = exp(− (− G (w ) + Z M (α )
(2.21)
C
∑α
c
EWc (w c )))
c =1
where EWc (w c ) =
1 2
∑w
2 ci
i
(2.22)
546
Mitsuru Kondo et al.
The posterior distribution of hyperparameter α is also given by Bayes formula; P (α | D, H ) =
P(D | α , H ) P (α | H ) P(D | H )
(2.23)
where we assume prior for each component of α as Gamma distribution; P (α
where κ
α c
c
|H)=
(ψ
α c
/ 2κ
Γ (ψ
α c
α c
)ψ
α c /2
/ 2)
α
ψ α c / 2− 1 c
exp(− ψ
α c
/ 2κ
α c
)
(2.24)
(> 0) is width parameter which stands for the mean of the Gamma
distribution (2.22), ψ
α c
(> 0) is shape parameter and Γ (⋅ ) denotes gamma function: ∞
∫
Γ (a ) = b a − 1 exp(− b)db , a > 0
(2.25)
0
The posterior distributions of w and α are often very complicated so that analytical expressions for the posterior distributions are generally impossible. There are at least two means of overcoming this difficulty. The first method attempts to compute approximate gradient of log( P (α | D, H )) with respect to α and perform approximate optimization resulting in α MP . Given this approximate value, one can perform optimization of P (w | D, α MP , H ) with respect to w which results in w MP . While there are cases where this scheme works, there are other situations where this method fails. The method we use in this paper utilizes Markov Chain Monte Carlo (MCMC) to draw samples from posterior distributions of w and α and perform approximate integral; P (x test is genuine) =
∫∫
f (x test ; w )P (w | D, α ) P (α | D)dwdα ≈
1 S
S
∑ f (x
test
j =1
;w
(2.26)
S stands for the number of samples. In order to draw samples from posterior of w and α , the scheme performs alternate iterations of two different operations: (A) Hyperparameter α
( j)
updated via Gibbs sampling;
(2.28)
(B) Parameter w ( j ) updated via Metropolis. At the end of J th iteration, {w ( j ) , α
}
( j) J j =1
are considered to be samples from posterior
distributions. The samples from the first K iterations {w ( j ) , α the samples from the last ( J − K ) iterations {w ( j ) , α Now
we
rewrite {w ( j ) , α
}
( j) J j = K +1
(2.27)
as {w ( j ) , α
}
}
( j) J j = K +1
( j) S j =1
,
}
( j) K j =1
are discarded and
are used for verification.
where S = J − K .
experiments given in Section 3, J was 1000 and S was 100.
In
the
A Bayesian MCMC On-line Signature Verification
Let x test = ( D1test , D 2 test , D3test ) be a test data. With samples {( w ( j ) , α
( j)
547
)}Sj =1 , the
perceptron produces ensemble outputs { f (x test ; w ( j ) )}Sj =1 between 0 and 1. The proposed algorithm predicts that the test signature is genuine if (2.29) is satisfied, while a test signature is predicted to be forgery if (2.30) is satisfied. 1 S
0.5 ≤
0≤
3
1 S
S
∑ f (x
test
; w ( j ) ) ≤ 1.0
(2.29)
j =1
S
∑ f (x
test
; w ( j ) ) < 0.5
(2.30)
j =1
Experiment
This section reports a preliminary experiment using the algorithm described above. Fourteen individuals participated in the experiment. The data was taken for the period of three months. There are 1852 genuine signatures and 3170 skilled forgery signatures. Signatures are divided into two groups as shown in Table 1. Table 1. Database
Signatures for Learning Genuine forgery 418 1573
Signatures for Testing genuine forgery 1431 1601
Table 2 shows verification error rates as a function of the number of hidden units. FAR represents False Acceptance Rate and FRR represents False Rejection Rate. Table 2. Verification Error Rate
Hidden
FRR %
4 5 6
1.08 0.81 0.88
FAR % 1.50 1.25 1.37
hidden
FRR %
7 8 9
0.95 1.42 1.28
FAR % 0.87 1.12 1.03
Since our previous scheme [1]-[3] contains a free parameter c to be chosen, direct comparison is impossible, however, the minimum FAR 0.87%, and the minimum FRR 0.81%, of the proposed algorithm are very encouraging.
548
Mitsuru Kondo et al.
4
Discussion
The number of hidden units can also be estimated within the present framework which we will pursue in the future.
References [1] T. Ohishi, Y. Komiya and T. Matsumoto. On-line Signature Verification using Pen Position, Pen Pressure and Pen Inclination Trajectories. Proc. ICPR 2000, Vol. 4, pp547-550, September, 2000. [2] H. Morita, D. Sakamoto, T. Ohishi, Y. Komiya, T. Matsumoto. On-Line Signature Verifier Incorporating Pen Position, Pen Pressure and Pen Inclination Trajectories. Proc. 3rd AVBPA, Sweden, pp.318 - 323, June, 2001. [3] M. Kondo, D. Sakamoto, M. Sasaki and T. Matsumoto. A New Online Signature Verification Algorithm Incorporating Pen Velocity Trajectories. Proc. IEEE ISPACS 2002, Taiwan R.O.C. November, 2002. [4] D.J.C. MacKay, Bayesian Methods for Adaptive Models, PhD thesis, California Inst. Tech. 1991.
Illumination Normalization Using Logarithm Transforms for Face Authentication Marios Savvides and B.V.K. Vijaya Kumar Dept. of Electrical and Computer Engineering, Carnegie Mellon University 5000 Forbes Ave, Pittsburgh, USA
[email protected] [email protected] Abstract. In this paper we propose an algorithm that can easily be implemented on small form factor devices to perform illumination normalization in face images captured under various lighting conditions for face verification. We show that Logarithm transformations on images suffering from significant illumination variation, produce face images that are improved substantially for performing face authentication. We present illumination normalized images from the CMU PIE database to demonstrate the improvement using this non-linear preprocessing approach. We show that we get improved face verification performance using this scheme when training on frontal illuminated faces images, and testing on images captured under variable illumination
1
Introduction
In some face verification applications, it can be argued that the system user will cooperate to provide a face image with a nominal pose and a neutral expression. It is, however, not reasonable to assume that the user will always have control over the surrounding lighting conditions during the authentication. Outdoor authentication systems are examples where lighting conditions will vary, and providing simple solutions for correcting lighting variations for face authentication is the focus of this paper. We have used advanced correlation filter methods to evaluate the verification improvement using the proposed illumination normalization method. Researchers have dealt with illumination variation in many ways [1][2]. In this paper, we propose and investigate a computationally simple pre-processing scheme for illumination normalization that can boost the performance of current face authentication algorithms. This enables both enrollment (training) and testing to be performed on devices with limited computational resources.
J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 549-556, 2003. Springer-Verlag Berlin Heidelberg 2003
550
Marios Savvides and B.V.K. Vijaya Kumar
Fig. 1. Mapping of Pixel Intensities using Logarithm10 Transformation. Pixel intensity 0 represents black, and 255 is white. Thus, we observe that by this transformation dark pixels (possibly face image regions dominated by shadows are greatly enhanced)
2
Logarithm Transformation for Face Image Enhancement
Lighting variations resulting from the changing position of various light sources, cause portions of the face to illuminate, while the other portions may be in complete darkness (common when illuminating faces during the night). Figure 1 shows the mapping of pixel intensities ranging from 0 to 255 (assuming 8-bit grayscale images) shown on the horizontal axis to the logarithmic transformed values on the vertical axis. A face image that is partially shadowed will contain pixel intensities from the 10-70 intensity range belonging to the non-illuminated image region. Taking the log transform will non-linearly map the pixel intensities and as a result the shadowed regions will be intensity enhanced producing a final image with more intelligibility in the low-illumination regions. To evaluate the effects of such transformations, we performed some experiments on the illumination subset of the CMU PIE database [3] that was captured under no ambient lighting conditions. This dataset exhibits the larger illumination variation, where many images contain substantial shadows. Fig. 2 shows three images selected from Person 2 in the PIE database (image 1 on the top row and image number 3 on the middle row and image number 19 in the bottom row). The left column shows the original images and the right column is the resulting enhanced log transformed images. We can see that the transformation has increased the intelligibility of the darkened regions, especially in image 1, the left portion of person 2 is originally completely in the dark; however the post-processed image shows substantial detail around the eye and mouth that was not previously visible. Image 3 has also been enhanced significantly bringing to visibility a lot of features. The bottom row shows image 19, which is captured with frontal illumination. Observe the specu-
Illumination Normalization Using Logarithm Transforms for Face Authentication
551
larities visible in the original image, which are attenuated in the log transformed image.
Fig. 2. Person 2 from illumination subset of CMU PIE database captured with no background lighting. (top row) Image 1. (middle row) Image 3. (bottom row) Image 19. (left) Original images (right) corresponding Log transformed images
Clearly, this type of enhancement provides an improvement in the quality of image whether due to poor lighting, presence of shadows and even artifacts due to specularities.
552
Marios Savvides and B.V.K. Vijaya Kumar
3
Face Authentication
We used the minimum average correlation energy (MACE) filter[4][5][6] for performing face authentication. We will briefly describe these filters in the next section. In these experiments we used face images of 65 people from the illumination subset of the CMU PIE database with each person having 21 face images captured under varying illumination conditions as shown in Fig. 3. These images were cropped and scaled to 100x100 pixels using ground truth feature points (eyes and nose) retrieved from the CMU PIE database. In this experiment, we select 3 training images that are of frontal lighting to simulate probable enrollment conditions. Testing however will be on the whole database, to see the verification performance. The three images selected for training are numbers 7,10,19.
Fig. 3. Sample images (un-enhanced) of Person 2 from the Illumination subset of PIE database captured with no background lighting
We used the MACE type filter design to synthesize a single filter for each person using just the 3 training images (no. 7,10,19) from that person. We then performed cross-correlation with the whole database to examine the correlation outputs (and compute a metric called the peak-to-sidelobe ratio which we will define later) resulting from the 21 images of the authentic person and the 64*21 impostor images. We repeated this experiment but instead we used Log transformed images from the database for both training and testing.
Illumination Normalization Using Logarithm Transforms for Face Authentication
553
Fig. 4. A schematic of Correlation filter-based face verification[7][8]. A single correlation filter is synthesized from a few training images and stored directly in the frequency domain. FFT’s are used to perform cross-correlation of a test image with the stored template and the correlation output is examined for sharp peaks
4
Minimum Average Correlation Energy (MACE) Filters
Minimum Average Correlation Energy (MACE) [4] filters are synthesized in closed form by optimizing a criterion that seeks to minimize the average correlation energy resulting from cross-correlations with the given training images while satisfying linear constraints to provide a specific value at the origin of the correlation plane for each training image. In doing so, the resulting correlation outputs from the training images exhibit sharp peaks at the origin with values close to zero elsewhere. The location of the detected peak also provides the location of the recognized object in the scene. The MACE filter is synthesized directly in the frequency domain using the following closed form equation:
h = D− 1 X( X+ D− 1 X−) 1 u
(1)
Assuming that we have N training images, X in Eq. (1) is an LxN matrix, where L is the total number of pixels in a single training image (L=d1xd2). Matrix X contains along its columns lexicographically re-ordered versions of the 2-D Fourier transforms of the N training images. D is a diagonal matrix of dimension LxL containing along its diagonal the average power spectrum of the training images and u is a row vector with N elements, containing the corresponding desired values at the origin of the correlation plane of the training images. Typically, these constraint values are set to 1 for all training images from the authentic class. There is another variant of the MACE filter called the unconstrained minimum average correlation energy (UMACE) filter [9], which also tries to minimize the average correlation energy but instead of constraining the correlation outputs at the origin to a specific value, it only tries to maximize the peak value at the origin. This optimization results in the following filter equation:
h = D− 1 m
(2)
Where m is a row vector containing the average of the Fourier transforms of training images. UMACE filters are computationally attractive as they do not require any matrix inverse as the original MACE filters do.
554
Marios Savvides and B.V.K. Vijaya Kumar
The Peak-to-Sidelobe Ratio (PSR) is the metric used to determine whether a test image belongs to the authentic class or not. First, the test image is cross-correlated with the synthesized MACE filter and the resulting correlation output is searched for the peak value. A rectangular region (we use 20x20 pixels) centered at the peak is extracted and used to compute the PSR as follows: A 5x5 rectangular region centered at the peak is masked out and the remaining annular region defined as the sidelobe region is used to compute the mean and standard deviation of the sidelobes. The peak-to-sidelobe ratio is then defined as: PSR =
peak − mean
(3)
σ
The PSR measures the peak sharpness in a correlation output which is exactly what MACE type filters try to maximize; therefore the larger the PSR the more likely the test image belongs to the authentic class. In most applications of correlation filters, people usually examine only the value of the correlation peak for classification, therefore it is also important to note that the authentication decision in our approach (using the PSR metric) is not based on a single projection or inner product but many projections which should produce specific response in order for the input to be declared as belonging to the authentic class, i.e., the peak value should be large, and the neighboring correlation values (which correspond to inner products of the MACE point spread function with shifted versions of the test image) should yield values close to zero. This is true for both types of MACE filters described before. More importantly the PSR metric is also invariant to any uniform changes in illumination, which is important in this application. We can see that multiplying the input image by a constant scaling factor k, will result in a peak, mean and σ also scaled by k, which is cancelled out in the calculation of PSR in Eq. (3).
5
Results
We show example PSR plots in Fig 5 that demonstrate the effect of using Log transformed images with MACE type filters. Authentic images that exhibit large variation from the training images due to illumination changes perform poorly in comparison yielding low PSR values (solid plots at the top) 13 in many cases. This increases the margin of separation between impostor PSRs and authentic PSRs providing an increased verification accuracy as summarized in Table 1. Table 1. Average Verification accuracy at zero False Acceptance Rate (FAR=0) across all 65 people
Images used Original Log Transformed
MACE 93.50 % 97.18 %
UMACE 93.59 % 97.09 %
Illumination Normalization Using Logarithm Transforms for Face Authentication
555
Fig. 5. PSR plot for Person 12 and 45 comparing the performance of using Log transformed face images with MACE filters shown as the dashed line. The solid line are the PSRs using the raw database images. The top plots belong to the authentic person and the bottom plots are the maximum impostor PSR. For images yielding low PSRs for the authentic person ( σ ( x , y )
(2)
I t ( x, y ) − I t − 1 ( x , y ) > P ( x, y )
(3)
or
where I t ( x, y ) and I t− 1 ( x, y ) are the intensity values for the pixel (x,y) at time t and t-1, m( x, y ) and
σ ( x, y )
are mean and variance respectively of the intensity
values observed during the supervised training period, and P ( x, y ) is the maximum difference between intensity values that are consecutive in time, observed during the whole training period:
P ( x, y ) = max I t ( x, y ) − I t − 1 ( x, y ) , T=frames of the training set t∈ T
(4)
After this step, in the resulting binary image there are many small clusters of pixels that must be removed: a one step filter removes blob whose size is lower than a certain threshold. Finally we obtain an image with only foreground objects, each of whose has been extracted with its own shadow. Because the presence of shadows changes the shape of objects in an unpredictable way, causing serious trouble to the following step object recognition, the algorithm proposed in [10] has been used for removing shadows. It starts from the assumption that a shadow is an abnormal illumination of a part of an image due to the interposition of an opaque object with respect to a bright point-like illumination source. From this assumption, we can note that shadows move with their own objects but also that they have not a fixed texture, as real objects: they are half-transparent regions which retain the representation of the underlying background surface pattern. Therefore, our aim is to examine the parts of the image that have been detected as moving regions from the previous segmentation
596
P. Spagnolo et al.
step but with a structure substantially unchanged with respect to the corresponding background. To do this, firstly a segmentation procedure has been applied to recover large regions characterized by a constant photometric gain; then, for each segment previously detected, the correlation between pixels is calculated, and it is compared with the same value calculated in the background image: segments whose correlation is not substantially changed are marked as shadow regions and removed. So, the final image contains only motion objects without shadows: these are the input for the object recognizer described in [11].
3
Background Updating
Any background subtraction approach is sensitive to variations of the illumination; to solve this problem the background model must be updated. Traditional updating algorithms have a serious problem: they operate the updating only on the pixels that have been labeled as ‘background’ in the last frames. If in a region there is a moving object, the corresponding background pixels are left unchanged. In particular in presence of slow moving objects, as a person staying in a certain region for a certain period of time, this can invalidate the results. In addition, erroneously labeling foreground and background points could determine a wrong update of the background model. The proposed approach allows all the pixels of the background to be updated, even if they correspond to points that at time t are masked by foreground objects: every background pixel can be updated even if currently invisible. In literature, many approaches update the background at each frame; in our case, the implemented surveillance system works at a frame rate of 30 Hz, so it is not necessary to update the background at each frame because relevant variations in a very short time are not probable. The proposed approach, according with the background model implemented, consists in a stack-updating: the number of frames to be used for calculating the new values of mean and variance is fixed (i.e. 80, 100 or 150 frames). During this period, new values of the statistical parameters are calculated in the static points of the image. The updating rule (4) allows a parameter α (related to the system frame rate), tacking values in the range (0,...,1), to control the updating process by weighting differently the two terms:
m n +1 ( x , y ) = α ∗ m n ( x , y ) + (1 − α ) ∗ m n − 1 ( x , y )
(5)
σ
(6)
n +1
( x, y ) = α ∗ σ n ( x, y ) + (1 − α ) ∗ σ
n− 1
( x, y )
Pn+1 ( x, y ) = α ∗ Pn ( x, y ) + (1 − α ) ∗ Pn− 1 ( x, y )
(7)
where the n+1 indicates the new updated values, n indicates the values calculated during the last observation period, and n-1 indicates the old values of the parameters. The updating procedure described above must be done only for pixels that have been classified as static for most of the observation period (i.e. 80%). Because the continuous variations of the lighting conditions in outdoor environment determine uniform variations in all the image intensity values, it is possible to update all pixels in the image. The idea is that pixels with the same mean of intensity values will
A Supervised Approach in Background Modelling for Visual Surveillance
597
assume the same value after updating. So, it is possible to update pixels covered by foreground objects by simply calculating the updating average of the pixels with the same intensity value. Therefore the updating value relative to each pixel of the background model covered by a foreground object is estimated by averaging all the different values mn ( x, y ) exhibited by all the pixels {( x, y )} with the same intensity value mn − 1 ( x, y ) = bi .
µ (bi ) =
1 ∑ m n ( x, y ) N (bi ) {( x , y )∈ I t |m n − 1 ( x , y ) =bi }
(8)
where {bi}i=1,...,n are the n different intensity values that each pixel can assume, and N(bi) is the number of pixels in the background model with intensity mean value bi. So, the update rule for the mean will be:
mn+1 ( x, y) = α * µ (mn− 1 ( x, y)) + (1 − α ) * mn− 1 ( x, y)
(9)
In an analogous way σ and P can be updated. Moreover, in order to avoid the accumulation of the error over time, the background model is periodically reinitialized in all the regions labeled as static for a long time interval. In this case, α can be set to very low values allowing the system to self-reinitialize.
4
Experimental Results
The experiments have been performed on real image sequences acquired with a static TV camera Dalsa CA-D6 with 528 X 512 pixels; the frame rate selected is 30Hz. The processing is performed with a Pentium IV, with 1,5 GHz and 128 Mb of RAM. The characteristics of each test sequence are resumed in the table 1. The sequences are acquired in a real archeological site while people were simulating the movements normally performed by intruders. Table 1. The test sequences
Sequence number 1 2 3 4
Frames 894 387 1058 551
Number of frames of the training 200 150 200 150
Number of frames of the updating-stack 150 100 150 100
The results obtained applying the proposed motion detection algorithm are very encouraging. In the following images the obtained results are compared with the one provided by two very common motion detection approaches. In particular, the first column of fig. 1 shows two images of a sequence; the second column illustrates the results obtained applying a statistical background model like that proposed in [8]: it is evident that the presence of moving small trees heavily affects the correct extraction of shapes, a problem clearly mentioned in the relative paper. The third column shows the results obtained implementing the approach described in [6]: it can be seen that
598
P. Spagnolo et al.
the moving trees are not detected by the system, but the region where the background model is obsolete due to the presence of a human that moves slowly, the result is affected by a large noise. Finally, the last column depicts the results obtained using the proposed approach: moving trees are not detected and the quality of the resulting shapes is higher then the previous ones, even in regions where people stayed for a quite long period of time.
Fig. 1. The comparison between the results obtained applying to a pair of images (1st column) a traditional statistic background modelling (2nd column), a supervised approach using a traditional background update (3rd column) and the proposed algorithm (4th column)
5
Conclusions and Future Works
This work deals with the problem of outdoor motion detection in the context of video surveillance. A supervised approach for background subtraction has been implemented to reduce the amount of false alarms caused by small movements of background objects. A new updating algorithm has been implemented to allows the update of all the background pixels, even if covered by foreground objects. The experiments show that the proposed approach works better then other similar techniques proposed in literature, in particular in presence of slowly moving objects. Future work will investigate the improvement of the training algorithm, using a non supervised approach. This is a very important requisite for the real implementation on a visual surveillance system, that needs to be very reliable. Another important improvement of the system is expected by a more correct detection of background object movements even if they exhibit intensity changes greater than the ones observed during the training period. All these goals must be obtained by reducing the dependency of the algorithm on the conditions occurred during the training period.
A Supervised Approach in Background Modelling for Visual Surveillance
599
References [1]
S. Fejes, L.S. Davis, Detection of independent motion using directional motion estimation, Technical Report. CAR-TR-866, CS-TR 3815, Univ. of Mar., Aug. 1997. [2] S. Fejes, L.S. Davis, What can projections of flow fields tell us about the visual motion, In Proc Intern. Confer. on Computer Vision ICCV98, 1998, pp. 979986. [3] S. Fejes, L.S. Davis, Exploring visual motion using projections of flow fields, In. Proc. of the DARPA Image Underst. Work., pp.113-122, New Orleans, LA,1997. [4] L. Wixson, M. Hansen, Detecting salient motion by accumulating directionalconsistent flow, In proc. of Intern. Conf. on Comp. Vis., 1999, vol II, pp 797804. [5] C. Anderson, P. Burt, G. Van Der Wal, Change detection and tracking using pyramid transformation techniques, In Proc. of SPIE – Intell. Robots and Comp. Vision Vol. 579, pp.72-78, 1985. [6] I. Haritaoglu, D. Harwood, L. Davis, A Fast Background Scene Modeling and Maintenance for Outdoor Surveillance, ICPR, pp.179-183, Barcelona,2000. [7] C. Wren, A. Azarbayejani, T. Darrell, A. Pentland, Pfinder: Real-time tracking of the human body, IEEE Trans. on Patt. An. and Mach. Intell. 19(7): pp.780785, 1997. [8] T. Kanade, T. Collins, A. Lipton, Advances in Cooperative Multi-Sensor Video Surveillance, Darpa Image Underst. Work., Morgan Kaufmann, Nov. 1998, pp. 3-24. [9] H. Fujiyoshi, A. J. Lipton, Real-time human motion analysis by image skeletonisation, IEEE WACV, Princeton NJ, October 1998, pp.15-21. [10] P. Spagnolo, A. Branca, G. Attolico, A. Distante: Fast Background Modeling and Shadow Removing for Outdoor Surveillance, IASTED VIIP,2002, pag. 668-671. [11] M. Leo, G. Attolico, A. Branca, A. Distante: Object classification with multiresolution wavelet decomposition , in Proc. of SPIE Aerosense 2002, conference on Wavelet Applications, 1-5 April, 2002, Orlando, Florida, USA.
Human Recognition on Combining Kinematic and Stationary Features Bir Bhanu and Ju Han Center for Research in Intelligent Systems, University of California Riverside, CA 92521, USA {bhanu,jhan}@cris.ucr.edu
Abstract. Both the human motion characteristics and body part measurement are important cues for human recognition at a distance. The former can be viewed as kinematic measurement while the latter is stationary measurement. In this paper, we propose a kinematic-based approach to extract both kinematic and stationary features for human recognition. The proposed approach first estimates 3D human walking parameters by fitting the 3D kinematic model to the 2D silhouette extracted from a monocular image sequence. Kinematic and stationary features are then extracted from the kinematic and stationary parameters, respectively, and used for human recognition separately. Next, we discuss different strategies for combining kinematic and stationary features to make a decision. Experimental results show a comparison of these combination strategies and demonstrate the improvement in performance for human recognition.
1
Introduction
In many applications of personnel identification, established biometrics, such as fingerprints, face or iris, may be obscured. Gait, which concerns recognizing individuals by the way they walk, can be used as a biometric to recognize people under these situations. However, most existing gait recognition approaches [1, 2, 3, 4] only consider human walking frontoparallel to the image plane. In this paper, we propose a kinematic-based approach to recognize human by gait which relaxes this condition. The proposed approach estimates 3D human walking parameters by fitting the 3D kinematic model to the 2D silhouette extracted from a monocular image sequence. Since both the human motion characteristics and body part measurement are important cues for human recognition at a distance, kinematic and stationary features are extracted from the estimated parameters, and used for human recognition separately. Moreover, we combine the classifiers based on stationary and kinematic features to increase the accuracy of human recognition. Experimental results show a comparison of different combination strategies and demonstrate the improvement in performance for human recognition.
J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 600–608, 2003. c Springer-Verlag Berlin Heidelberg 2003
Human Recognition on Combining Kinematic and Stationary Features
601
Input Video Clip Containing Human Walking
Silhouette Extraction Human Silhouette
Key Stationary Frame Parameter Selection Estimation Key Frames
Kinematic Parameter Estimation
Parameter Prediction
Parameter Initialized Length, Initialization Radius, Location, and Angle of Model in Key Frames
Predicted Location Estimated and Angle Location and Angle Model Matching of Model in Key frames
3D Human Model
Estimated Length and Radius
Model Matching 3D Human Model
Estimated Location and Angle
Fig. 1. Diagram of the proposed approach for human gait analysis
2
Technical Approach
In our approach, we first build a 3D human kinematic model for regular human walking. The model parameters are then estimated by fitting the 3D human kinematic model to the extracted 2D human silhouette. Finally, stationary and kinematic parameters are extracted from these parameters for human recognition. The realization of our proposed approach is shown in Figure 1. 2.1
Human Kinematic Model
A human body is considered as an articulated object, consisting of a number of body parts. The body model adopted here is shown in Figure 2(a), where a circle represents a joint and a rectangle represent a body part (N: neck, S: shoulder, E: elbow, W: waist, H: hip, K: knee, and A: ankle). Most joints and body part ends can be represented as spheres, and most body parts can be represented as cones. The whole human kinematic model is represented as a set of cones connected by spheres [5]. Figure 2(b) shows that body parts can be approximated well in this manner, however, the head is approximated only crudely by a sphere and the torso is approximated by a cylinder with two spheroid ends. Matching between 3D Model and 2D Silhouette: The matching procedure determines a parameter vector X so that the proposed 3D model fits the given 2D silhouette as well as possible. Each 3D human body part is modeled by a cone with two spheres si and sj at its ends, as shown in Figure 2(b) [5]. Each sphere si is fully defined by 4 scalar values, (xi , yi , zi , ri ), which define its location and
602
Bir Bhanu and Ju Han
Head (Sphere)
N
S
Torso (Cylinder with elliptical cross section)
3 DOFs
Left Upper Arm
W E
3 DOFs
1 DOFs
H
H
Left Thigh
Right Thigh
3 DOFs
K
K
Left Leg
Right Leg
A
A
Left Toe
Right Toe
1 DOFs
1 DOFs
S Right Upper Arm
ri
rj
( xi , y i , z i )
(x j , y j , z j )
3 DOFs
1 DOFs
Left Lower Arm with hand
3 DOFs
3 DOFs
E
Right Lower Arm with hand
Camera
1 DOFs
3D Segment (i.e., the cone is the body part and spheres are joints) 2D Projection on image plane
ri
1 DOFs
'
r j' (x j , y j )
( xi , y i )
(a)
(b)
Fig. 2. (a) 3D Human Kinematic Model; (b) Body part geometric representation size. Given these values for two spheroid ends (xi , yi , zi , ri ) and (xj , yj , zj , rj ) of a 3D human body part model, its projection P(ij) onto the image plane is the convex hull of the two circles defined by (xi , yi , ri ) and (xj , yj , rj ). If the 2D human silhouette is known, we may find the relative 3D body parts locations and orientations with prior knowledge of camera parameters. We propose a method to perform a least squares fit of the 3D human model to the 2D human silhouette. That is, to estimate the set of sphere parameters X = {Xi : (xi , yi , zi , ri )} by choosing X to minimize (PX (x , y ) − I(x , y ))2 , (1) error(X ; I) = x ,y ∈I
where I is the silhouette binary image, PX is the binary projection of the 3D human model to image plane, and x , y are image plane coordinates. Model Parameter Selection: Human motion is very complex due to so many degrees of freedom (DOFs). To simplify the parameter estimation procedure, we use the following reasonable assumptions: (1) the camera is stationary; (2) people are walking before the camera at a distance; (3) people are moving in a constant direction; (4) the swing direction of arms and legs parallels to the moving direction. According to these assumptions, we do not need to consider the waist joint, and only need to consider one DOF for each other joint. Therefore, the elements of the parameter vector X of the 3D human kinematic model are defined as: (a) Radius ri (11): torso(3), shoulder, elbow, hand, hip, knee, ankle, toe, and head; Length li (9): torso, inter-shoulder, inter-hip, upper arm, forearm, thigh, calf, foot, and neck; (b) Location (x, y)(2); Angle θi (11): neck,
Human Recognition on Combining Kinematic and Stationary Features
603
65
60
Silhouette Width (Pixel)
55
50
45
40
35
30
25
20
5
10
15
20
25
30
35
40
Frame Number
Fig. 3. Human silhouette width variation in a video sequence (Circles represent frames selected as key frames for stationary parameter estimation) left upper arm, left forearm, right upper arm, right forearm, left thigh, left calf, left foot, right thigh, right calf, and right foot. With 33 stationary and kinematic parameters, the projection of the human model can be completely determined. 2.2
Model Parameter Estimation
Assuming that people are the only moving objects in the scene, their silhouette can be extracted by a simple background subtraction method [7]. After the silhouette has been cleaned by a pre-processing procedure, its height, width and centroid are easily extracted for motion analysis. The human moving direction is estimated through the silhouette width variation in the video sequence [7]. Stationary Parameter Estimation: The stationary parameters include body part length and joint radius. Human walking is a cyclic motion, so a video sequence can be divided into motion cycles and studied separately. The walking cycle can be detected by exploiting the silhouette width variation in a sequence as shown in Figure 3. In each walking cycle, the silhouette with minimum width means the most occlusion; the silhouette with maximum width means the least occlusion and is more reliable for model parameter estimation. To estimate the stationary parameters, we first select 4 key frames (see Figure 3) from one walking cycle, and then perform matching procedure on these frames as a whole because the human silhouette from a single frame might not be reliable due to noise. The corresponding feature vector thus includes 20 common stationary parameters and 13*4 individual kinematic parameters. Then, the set of parameters is estimated from these initial parameters by choosing a parameter vector X to minimize the least square error in equation (1) with respect to the same kinematic constraints. The parameters are initialized according to the human statistical information. After the matching algorithm is converged, the estimated stationary parameters are obtained.
604
Bir Bhanu and Ju Han
Kinematic Parameter Estimation: To reduce the search space and make our matching algorithm converge faster, we use the linear prediction of parameters from the previous frames as the initialization of the current frame. After the matching algorithm is converged, the estimated kinematic parameters are obtained for each frame. 2.3
Kinematic and Stationary Feature Classifiers
In our approach, kinematic features are the mean and standard deviation values extracted from the kinematic parameters of each frame in the whole image sequence containing one human walking cycle. Assuming that human walking is symmetric, that is, the motion of the left body part is the same as or similar to the right body parts, the kinematic feature vector xk selected for human recognition includes 10 elements: the mean and standard deviation of angles of neck, upper arm, forearm, thigh, and leg. Stationary features are directly selected from the estimated stationary parameters of each sequence containing human walking. Among those model stationary parameters, joint radius depends on human clothing, and inter-shoulder and inter-hip length is hardly estimated due to the camera view (human walking within small angle along the frontparallel direction). Assuming the body part length is symmetric for left and right body parts, the stationary feature vector xs selected for human recognition includes 7 elements: neck length, torso length, upper arm length, forearm length, thigh length, calf length, and foot length. After the kinematic and stationary features are extracted, they are used to classify different people separately. For simplicity, we assuming the feature vector x (x could be xs or xk ) for a person ωi is normally distributed in the feature space, and each of the independent features have Gaussian distribution with the same standard deviation value. Under this assumption, minimum distance classifier is established: x is assigned to the class whose mean vector has the smallest Euclidean distance with respect to x. 2.4
Classifier Combination Strategies
To increase the efficiency and accuracy of human recognition, we need to combine the two classifiers in some way. Kittler et al. [8] demonstrate that the commonly used classifier combination schemes can be derived from a uniform Baysian framework under different assumptions and using different approximations. We use these derived strategies to combine the two classifiers in our experiments. In our human recognition problem with M people in the database, two classifiers with feature vector xs and xk , respectively, are combined to make a decision on assigning each sample to one of the M people (ω1 , ..., ωM ). The feature space distribution of each class ωi is modeled by the probability density function p(xs |ωi ) and p(xk |ωi ), and its a priori probability of occurrence is P (ωi ). Under the assumption of equal priors, the classifier combination strategies are described as follows:
Human Recognition on Combining Kinematic and Stationary Features
605
Fig. 4. Sample human walking sequences in our database – Product rule {xs , xk } ∈ ωi , – Sum rule {xs , xk } ∈ ωi , – Max rule {xs , xk } ∈ ωi , – Min rule {xs , xk } ∈ ωi ,
if p(xs |ωi )p(xk |ωi ) = maxM k=1 p(xs |ωk )p(xk |ωk ) if p(xs |ωi ) + p(xk |ωi ) = maxM k=1 (p(xs |ωk ) + p(xk |ωk )) if max{p(xs |ωi ), p(xk |ωi )} = maxM k=1 max{p(xs |ωk ), p(xk |ωk )} if min{p(xs |ωi ), p(xk |ωi )} = maxM k=1 min{p(xs |ωk ), p(xk |ωk )}
In our application, the estimate of a posteriori probability is computed as follows: exp{−||x − µi ||2 } , (2) P (ωi |x) = M 2 k=1 exp{−||x − µk || } where x is the input of the classifier, and µi is the ith class center.
3
Experimental Results
The video data used in our experiment are real human walking data recorded in outdoor environment. Eight different people walk within [-45◦ ,45◦ ] with respect to frontparallel direction. We manually divide video data into single-cycle sequences with an average of 16 frames. In each sequence, only one person walks along the same direction. There are a total of 110 single-cycle sequences in our database, and the number of sequences per person ranges from 11 to 16. The image size is 180 × 240.Figure 4 shows some sample sequences in our database. We use Genetic algorithm for model parameter estimation. Each of the extracted kinematic and stationary features is normalized by x−µ σ , where x is the specific feature value, µ and σ are the mean and standard deviation of the specific feature over the entire database. Recognition results in our experiments are obtained using Leave-One-Out method. Performance of Stationary Feature Classifier: The recognition rate with all the 7 stationary features is 62%. Table 1 shows the human recognition performance using different number of stationary features. From this table, we can see
606
Bir Bhanu and Ju Han
Table 1. Comparison of performance using different number of stationary features Feature Size Stationary Features Recognition Rate 1 neck 31% 2 neck, torso 32% 3 neck, torso, upper arm 45% 4 neck, torso, upper arm, forearm 50% 5 neck, torso, upper arm, forearm, thigh 55% 6 neck, torso, upper arm, forearm, thigh, calf 59% 7 neck, torso, upper arm, forearm, thigh, calf, foot 62%
Table 2. Comparison of performance using mean and standard deviation features computed from each body part angle variation sequences over a single-cycle sequence Feature Size Kinematic Features Recognition Rate 5 Mean 50% 5 Standard Deviation 49% 10 Mean and Standard Deviation 72%
that the recognition rate increases when feature number increases. Therefore, each of these features has its own contribution to the overall recognition performance using stationary features. On the other hand, the contribution varies among different features. For example, adding torso length into the feature vector with neck length makes 1% improvement, while adding upper arm length into the feature vector with torso and neck length makes 13% improvement. As a result, better recognition performance might be achieved by using weighted Euclidean distance instead of regular Euclidean distance. This requires a training procedure. However, due to the high feature space dimension (7) and small class number (8) in the database, overfitting becomes a big problem under this situation, i.e., training results achieve high recognition rate on training data and low recognition rate on testing data. Therefore, we do not carry out weight training in this paper. We expect such a procedure to be carried out when a large database with a large number of classes (people) becomes available. Performance of Kinematic Feature Classifier: The recognition rate with all the 10 kinematic features is 72%. In Table 2, it is shown that the mean and standard deviation features computed from each body part angle variation sequences over a single-cycle sequence achieve similar recognition rate, 50% and 49%, respectively. Table 3 shows the human recognition performance using different number of kinematic features. Similar to stationary features, the recognition
Human Recognition on Combining Kinematic and Stationary Features
607
Table 3. Comparison of performance using different number of kinematic features Feature Size Kinematic Features Recognition Rate 2 neck 34% 4 neck, upper arm 51% 6 neck, upper arm, forearm 57% 8 neck, upper arm, forearm, thigh 63% 10 neck, upper arm, forearm, thigh, leg 72%
Table 4. Comparison of performance using different combination strategies Combination Rule Recognition Rate Product Rule 83% Sum Rule 80% Max Rule 73% Min Rule 75%
rate increases when feature number increases. We also expect a weight training procedure carried out on a large human walking database in the future. Performance with Classifier Combination: Table 4 shows the human recognition performance on classifier combination with different strategies. Considering the recognition rate on stationary and kinematic classifiers are 62% and 72%, respectively, all the four rules achieve better recognition performance on human recognition. Among the combination strategies, product rule achieves the best recognition rate of 83%. Sum rule also achieves a better recognition rate of 80%. The recognition rates achieved by max and min rules are only slightly better than that of kinematic classifier (72%). Sum rule has been mathematically proved to be robust to errors by Kittler et al. [8]. We believe that the main reason for the good performance achieved by product rule is the holding of the conditional independence assumption (the features used in different classifiers are conditionally statistically independent) in product rule for our application. The poor performance of max and min rules may come from their order statistics and sequential sensitivity to noise. Similar results are found in Shakhnarovich and Darrell’s work on combining face and gait features [9].
4
Conclusions
In this paper, we propose a kinematic-based approach for human recognition. The proposed approach estimates 3D human walking parameters by fitting the kinematic model to the 2D silhouette extracted from a monocular image sequence. The kinematic and stationary features are extracted from the estimated
608
Bir Bhanu and Ju Han
parameters, and used for human recognition separately. Next, we use different strategies to combine the two classifiers to increase the accuracy of human recognition. Experimental results show that our proposed approach achieves the highest 83% recognition rate by using product rule on combining classifiers of stationary features and kinematic features. Note that this performance is achieved under the situation of people walking from −45◦ to 45◦ with respect to the frontparallel direction, and the low resolution of human walking sequences. With higher resolution human walking sequences and a weight training procedure for weighted Euclidean distance, we expect a better recognition performance in our future work.
Acknowledgment This work was supported in part by grants F49620-97-1-0184, F49620-02-1-0315 and DAAD19-01-0357; the contents and information do not necessarily reflect the position or policy of U.S. Government.
References [1] S. A. Niyogi, E. H. Adelson. Analyzing and recognizing walking figures in XYT. in Proc. IEEE Conference on CVPR, pp. 469-474, 1994. 600 [2] J. J. Little, J. E. Boyd. Recognizing people by their gait: the shape of motion. Videre: Journal of Computer Vision Research, 1(2):469-474, 1998. 600 [3] H. Murase and R. Sakai. Moving object recognition in eigenspace representation: gait analysis and lip reading. Pattern Recognition Letters, 17(2):155-62, 1996. 600 [4] P. S. Huang and C. J. Harris and M. S. Nixon. Recognizing humans by gait via parameteric canonical space. Artificial Intelligence in Engineering, 13:359-366, 1999. 600 [5] M. H. Lin. Tracking articulated objects in real-time range image sequences. in Proc. ICCV, pp. 648-653, 1999. 601 [6] S. Wachter and H.-H. Nagel. Tracking of persons in monocular image sequences. in Proc. IEEE Workshop on Nonrigid and Articulated Motion, pp. 2-9, 1997. [7] B. Bhanu and J. Han. Individual recognition by kinematic-based gait analysis. in Proc. International Conference on Pattern Recognition, (3):343-346, 2002. 603 [8] J. Kittler, M. Hatef, R. Duin, and J. Matas. On Combining Classifiers. IEEE Trans. PAMI, 20(3):226-239, 2001. 604, 607 [9] G. Shakhnarovich and T. Darrell. On probabilistic combination of face and gait cues for identification. in Proc. IEEE International Conferenceon Automatic Face and Gesture Recognition, pp. 169-174, 2002. 607
Architecture for Synchronous Multiparty Authentication Using Biometrics Sunil J. Noronha, Chitra Dorai, Nalini K. Ratha, and Ruud M. Bolle IBM T.J. Watson Research Center P.O. Box 704, Yorktown Heights, New York 10598, USA {noronha,dorai,ratha,bolle}@us.ibm.com http://www.research.ibm.com
Abstract. Biometrics-based remote individual authentication has become widespread recently. However, several existing business systems and processes often require participation of multiple parties synchronously in real time. Further, new e-business processes can be enabled by technology that allows multiple participants to authenticate themselves synchronously and persistently. In this paper, we make distinction between traditional workflow processes that require multiparty authentication from synchronous multiparty authentication needed in business and consumer scenarios. A new system and method for multiparty authentication and authorization using real-time biometrics is proposed. We define realtime biometrics to include concepts that cover certificates that witness the simultaneous acquisition of biometrics signal from multiple parties and certificates that prove that the parties continuously provided the biometrics signal over an unbroken interval of time. We also present novel business processes based on this technology such as remote (web-based) owner access to bank lockers controlled by a designated bank officer even when the officer is not physically present at the bank, notarization of a document remotely, and signing “e-Will” without being present in the attorney’s office.
1
Introduction
Remote individual positive identification using biometrics is becoming widespread. However, several existing business systems and processes require the synchronous participation and authentication of multiple parties in real time. Further, new business processes can be enabled by technology that allows multiple participants to remotely authenticate themselves synchronously. Existing solutions have so far failed to synchronously authenticate and authorize multiple parties using biometrics, particularly in a networked environment. With the prolific growth of the Internet, many commercial applications are being explored that are remotely operated and possibly unattended. For example, an e-commerce system may use a fingerprint of the customer to validate a transaction over the Web, such as airline ticket purchase. Other examples of remote biometric authentication include point of sale transaction authorization based on fingerprints. J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 609–616, 2003. c Springer-Verlag Berlin Heidelberg 2003
610
Sunil J. Noronha et al.
Beyond authentication problems where one person has to be authenticated, there are (business) processes where multiple parties have to be authenticated more or less at the same time (synchronously). Several exemplary authentication scenarios are now described where multiple parties have to be authenticated at the same time (synchronously), or where one or more of the identities have to be authenticated during a period of time (persistent synchronicity). The authentication scenarios include (a) a vault in a bank that can only be opened by two bank employees, where each employee has a separate key; (b) a locker or safe deposit box in a vault that is opened through the process of a bank employee opening the vault with a key (or two employees with two separate keys) and the safe deposit box owner opening his or her box with a key, in combination with a key used by the bank employee; and (d) a notary public witnessing the execution of a document by verifying the identity of the signer through conventional means, and authenticating the document by signing the notary stamp. Other such applications in the military and other similar areas exist, where more than one authority is required to execute a transaction, such as the release of a weapon. If any transaction is executed during a multi-party meeting (e.g., if the participants vote on an important decision) and the authenticity of the transaction needs to be later proved, it is generally not sufficient to have authenticated the participants at the beginning of the meeting. Instead, it may necessary to prove that all parties simultaneously participated in the transaction. This problem may be referred to as one of synchronous biometric authentication. Further, when the transaction spans a significant portion of time, it is often necessary to prove that parties were not absent during any part of the transaction (e.g., never left the meeting). This problem may be referred to as one of persistent biometric authentication.
2
Related Work
The previous work in the area of multi-party authentication mostly focusses around encryption based dynamic peer groups [1]. The methods and services described in this paper neither use biometrics nor use a synchronous and persistence verification method. Yet another crypto-based method is described in [2] to support document conferencing. This work supports tracking multiple authors for evolving documents until the document is completed. A multi-party biometrics-based system is described in [3]. However, the focus is on designing protocols to support at least k matches from a set of n authorized parties and does not involve the concept of synchronous and persistence of the biometrics signal. A traditional single party remote authentication is described in U.S. Patent No. 5,930,804 to Yuan Pin Yu et al. [4]. Existing Internet based systems do not allow for multi-party authentication and authorization in a synchronous fashion. A typical technique used for authenticating multiple parties in a sequential fashion (flow process) is by the use of digital certificates. However, these certificates do not support simultaneous real
Architecture for Synchronous Multiparty Authentication Using Biometrics
611
time signatures, or the presence of the signatories, over an extended period of time. Nor do they link the signatures to the signers in a non repudiable manner, as fingerprints or other biometrics can do. In this paper, we present a novel system and architecture for identifying, verifying, and/or authenticating biometrics of two or more persons simultaneously and continuously within a prescribed time period. The proposed architecture can be used to achieve higher levels of security, as well as to implement important applications such as coguarantorship, which would otherwise be very difficult or impossible to realize in an electronic, network-based system.
3
Architecture of the Proposed Multi-party Authentication System
In this section we describe both the overall architecture as well as the server and client components of the proposed system to perform multi-party authentication. Figure 1 presents a high level architecture diagram of a multi-party authentication system where each party (e.g., an initiating Party, Party 1, Party 2, · · ·, Party K) is centrally authenticated on one of many authentication servers, referred to as a Synchronicity and Persistence Validation (SPV) server. The SPV server may be considered to be a multi-component transaction management server. A network can be a public network, such as the Internet, or it can be a private network. Bidirectionally connected to the network are multiple client computer devices such as a a workstation or a PC or a handheld device that has a user interface through which a user can communicate with other devices on the network. Beyond the traditional user interface there are other input devices for acquiring biometric signals. The acquired biometrics signals are processed, such as by being compressed, enhanced and/or analyzed, by validation client subsystems (VCS). The biometric signals, such as those representing one or more of fingerprints, voice prints and/or retinal images, can be authenticated through the authentication servers that are components of the SPV servers. In the system of Figure 1 multiple authentication servers/database pairs may be used. In this case, for every party’s userID, not only the userID is known, but also an identifier for an associated authentication server is known. The biometrics templates, the database and the other related components can form a part of a biometrics processor component of the SPV server. The input devices may be implemented using video capture devices such as digital cameras that generate images of a portion of each user at each client input device such as an image of a fingerprint, an image of the user’s face, an image of the user’s iris or retina, or any other part of the user that is suitable for use in generating biometrics input data. The image data may be transmitted as, for example, 30 frames per second (or less) video data, or at super-video rates (greater than 30 frames per second), or it may be compressed such as by using MPEG techniques, before being transmitted to the network. The use of compressed image data is also useful for concealing client responses to challenges issued by the SPV server. At least one of the Synchronicity and Persistence
612
Sunil J. Noronha et al.
Initiating Party Validation Client Subsystem (VCS)
Synchronicity and Persistence Validation (SPV) Server
VCS
VCS
VCS
Party 1
Party 2
Party K
Fig. 1. Proposed architecture of a multiparty authentication system Validation (SPV) servers is connected to the network, and communicates with the various clients8 via their respective VCS. Real time biometrics are herein considered to include first certificates that witness the simultaneous acquisition of biometrics signals from multiple parties, and second certificates that prove that the parties continuously provided the biometrics signals over an unbroken interval of time. Business processes based on this technology are thus covered by the proposed architecture. 3.1
Components of the Synchronicity and Persistence Validation Server
Figure 2(a) shows the components of the SPV server namely, a policy coordinator, a challenge generator, a timing coordinator, and a biometrics processor. The subsystems of the SPV server represent four underlying core technologies that provide for: the generation of a (common) challenge to (all) the clients by the challenge generator of the SPV server; an algorithm at each client to respond to the challenge (part of the VCS); an algorithm at each client to combine the basic biometrics signals with a response to the challenge (also part of the VCS); and an algorithm at the SPV server (the policy coordinator) to ascertain the “policy” desired by the user towards synchronicity and persistence, and to verify, using the timing coordinator and the biometrics processor, that the desired policy has been satisfied. Multiparty Challenging Process: Server Figure 2(b) illustrates using a flowchart how the system functions. A user (e.g., the initiating party) initiates a synchronous, multi-party transaction at the client. The client contacts the designated SPV server for the transaction. The SPV server, based on the policy defined for the type of the transaction, as maintained by the policy coordinator, contacts the other parties (e.g., Party 1, Party 2,..., Party K) defined by the policy. The SPV server then requests the parties to provide their biometrics signals, and also generates a common challenge to all of the involved parties. At the
Architecture for Synchronous Multiparty Authentication Using Biometrics
Synchronicity and Persistence Validation server
613
Initiating party requests a synchronous multi-party transaction
Policy Server SPV Server determines transaction policy in consultation with Policy Coordinator
Challenge generator
Timing coordinator
Biometrics verifier
(a)
Response validator authenticates response to challenge
SPV server generates challenges in consultation with Challenge Generator
Timing coordinator validates synchronocity and persistence
Participating clients (parties) acquire the biometrics and generate responses to the challenge
SPV Server certifies completion of multiparty authorization
Biometrics Verifier validates the biometrics
SPV Server receives the response from the Clients and passes the response to Timing Coordinator and passes the biometrics to Biometrics Processor
(b)
Fig. 2. (a) Components of the Synchronicity and Persistence Validation Server; (b) Flowchart of the working of the system
next step the clients receiving the challenge add their response to the biometrics stream. In the present system the client response data is inserted steganographically into a compressed video stream. The SPV server separates the biometrics signals and the response from each of the clients and passes the response to the timing coordinator, and passes the biometrics signals to the biometrics server or verifier. A time stamp of the clients and the SPV server is one example of a response to the SPV server initiated challenge. The responses received from clients are used to determine whether the biometrics signal acquisition was synchronous, and whether the biometrics signals persisted for at least the duration of time specified by the policy. The SPV server, based on the results of the processing by the response validator, timing coordinator, and the biometrics verifier then certifies (if appropriate) the completion of the multi-party authorization requested by the initiating party at the client. In summary, the steps involved include: on receipt of the transaction request, the SPV server contacts all the parties involved; the SPV server generates a continuous set of common challenges for all the involved clients; locally, the clients acquire the biometrics, respond to the challenge and send the biometrics streams to the SPV server; the SPV server verifies that the signals are not stale (to prevent a replay attack) by checking the correctness of the responses to the challenges (e.g., by checking the timing of the responses relative to the time that the challenge(s) were issued); and the SPV server authenticates all, or selected portions, of the biometrics signals against stored templates.
614
Sunil J. Noronha et al.
The components used during the foregoing processing include, but need not be limited to: (a) clock synchronization through a stream of challenges generated by the SPV server, e.g., time stamp, mean of every data frame, variance of every data frame, and pseudo random coded sets of challenges; (b) liveness detection using challenge/response at the image level; or (c) data hiding in compressed bit streams and image content-based hash functions that may be used as keys for encoding the challenge response into an auxiliary channel of video or other data transmitted over the network to the SPV server. Timing Coordinator: Synchronicity is implemented and guaranteed via a SPV server controlled protocol. Time for synchronicity verification is measured with a reasonably accurate clock, and does not require knowledge of the absolute time, e.g., Universal Coordinated Time (UTC); a Trusted Time Server calibrated to UTC may be used to certify that the transaction not only occurred synchronously between multiple parties, but that the transaction also occurred at a specified absolute moment in time. The Trusted Time Server could form a part of the Timing Coordinator, or it could be a third party service. Persistence Verification Process: Persistence can be implemented as follows: The SPV Server continually generates unique timing tokens (UTTs), and continually evaluates response tokens (RTs) received from the clients. If the sequence of RTs from each client meets certain specifications (optionally controlled by the policy coordinator), such as a minimum threshold between RT time stamps, the transaction is deemed to be persistent for that client. An intersection of persistence intervals between multiple participating clients is then employed to certify the synchronicity and persistence of the multiparty transaction. Other variants of the persistence protocol may use simpler RTs in which the process of acquiring a biometric ensures the continuity of the transaction. In this case freshly generated SPV server time stamps need not be included in the RTs. Policy Coordinator: The policy coordinator contains the policy parameters such as the synchronization delay, persistence time, level of security desired, number of parties to be authenticated, etc of a transaction, and serves them to the other components of the SPV server. Client Process: Biometric Acquisition, Challenge Response, and Liveness Detection Figure 3(a) is a flow chart showing the overall flow of biometric acquisition, challenge response, and optional liveness detection for a client process. First, the initiating party issues a transaction request to the SPV server. The SPV server obtains the policy information from the policy coordinator, and contacts the all client parties involved in the transaction and requests the participating clients to begin acquiring biometrics information using the biometric input devices. The SPV server generates challenges using the challenge generator, and transmits the generated challenges to the participating clients. The
Architecture for Synchronous Multiparty Authentication Using Biometrics
615
2.1. notarize( )
Initiating party issues a transaction request to the SPV server
SPV Server validates the response to ascertain liveness
1. request for notarization(Document) 2. identify using biometric( )
2.2. return(Notarized Document) : eNotary System
Jill : Client
SPV server contacts all parties as per the policy and initiates clients for biometric signal acquisistion
SPV Server provides the biometrics verifier and timing coordinator the biometrics and time stamp Jill : Client
: eNotary System 1. request for notarization(Document)
SPV Server generates the challenge streams for the clients
SPV server intimates the result to the initiating client
2. identify using biometric( )
2.1. notarize( )
2.2. return(Notarized Document)
Client responds to the challenge and acquires the biometrics
(a)
(b)
Fig. 3. (a)Overall flow of the operations of the client process; (b) Example of a multi-party authentication process
participating clients return responses to the challenges. The client responses are inserted steganographically into the compressed video stream that conveys the biometrics signals generated from the biometric input devices. The SPV server extracts the responses from the compressed video streams, preferably by using the reverse procedure to the procedure used by the clients to steganographically insert the responses into the video stream, and then validates the responses to the challenges to ascertain the liveliness or timeliness of the responses from the various participating clients. Assuming that the responses are received by the SPV server in a timely manner, as possibly defined by the policy currently in effect, the SPV server provides the received biometrics signals to the biometrics processor and received time stamps to the timing coordinator. The SPV server then validates, based on the presence or absence of authentication outputs from the biometrics processor and the timing coordinator, the synchronicity and persistence of the responses, as specified by the policy that is currently in effect. The final decision with respect to authentication of the parties to the transaction is then passed to the initiating client and the method terminates. As has been mentioned, a video compression based data hiding technique may be employed for the clients to conceal their response(s) in the biometrics video signal.
616
3.2
Sunil J. Noronha et al.
Application Scenarios for the Multi-party Authentication Process
We show an application scenario for the multi party authentication process described above in Figure 3(b). This figure illustrates a scenario using two diagrams. The first diagram identifies the actors (denoted by stick figures) and systems (denoted by other symbols) involved in the scenario, and the numbered arrows represent the sequence and directionality of interactions between the actors and the systems. The second diagram more clearly portrays the time sequence of interactions in the same scenario by plotting the interactions vertically in chronological order; where each vertical column represents a timeline for the actor or system drawn at the head of the column. The example application scenario includes a request to notarize an electronic document. Here the eNotary system takes on the role of a notary public to witness the execution of a document by verifying the identity of the signer remotely, and authenticating the document by signing the notary stamp digitally. Other multiparty scenarios include a request to open a bank locker (e.g., a safety deposit box) that is co-owned by two clients, and that requires the participation of a bank manager client and an electronic will (eWill) application that requires the participation of a testator client and a witness client.
4
Conclusion
A novel system and method for multiparty authentication is described in this paper. The multi party authentication process uses synchronous and persistent biometrics signals received from parties to a transaction, based on a policy, to approve a transaction request. The biometrics signals preferably are expressed as compressed video signals having steganographically inserted challenge response data. Several business applications are described that are based on the multiparty authentication engine.
References [1] G. Ateniese, M. Steiner and G. Tsuidik, ”New Multiparty authentication services and key agreement protocols”, IEEE Journal of Selected Areas in Communication, Vol. 18, No. 4, April 2000, pp. 1–13. 610 [2] A. Goh, CK Ng and WK Yip, ”Multiparty authentication mechanism for networkmediated document-object conferencing and collaborative processing”, Proc. of IEEE TENCON, Vol. 3, pp. 427–432, 2000. 610 [3] M. Peyravian, S. M. Matyas, A. Roginsky and N. Zunic, ”Multiparty BiometricBased Authentication”, Computers & Security, Vol. 19, No. 4, 369–374, 2000. 610 [4] Y-P Yu, S. Wong and M. B. Hoffberg, ”Web-based biometric authentication system and method”, US Patent 5930804, July 1999. 610
Boosting a Haar-Like Feature Set for Face Verification Bernhard Fr¨ oba, Sandra Stecher, and Christian K¨ ublbeck Fraunhofer-Institute for Integrated Circuits Am Weichselgarten 3, D-91058 Erlangen, Germany bdf,stechesa,
[email protected] Abstract. This paper describe our ongoing work in the field of face verification. We propose a novel verification method based on a set of haar-like features which is optimized using AdaBoost. Seven different types of generic kernels constitute the starting base for the feature extraction process. The convolution of the different kernels with the face image, each varying in size and aspect-ratio, leeds to a high-dimensional feature space (270000 for an image of size 64x64). As the number of features quadruples the number of pixels in the original image we try to determine only the most discriminating features for the verification task. The selection of a few hundred of the most discriminative features is performed using the Ada-Boost training algorithm. Experimental results are presented on the M2VTS-database according to the Lausanne-Protocol, where we can show that a reliable verification system can be realized representing a face with only 200 features.
1
Introduction
Face verification – the task to verify a person’s identity just using her face – has become more an more interesting during the last few years. And it is still a research topic to design fast and credible face verification algorithms getting by with a limited number of training pictures per person. One often used method is for example Turk’s and Pentland’s method based on Principle Component Analysis (PCA) Eigenfaces [6]. It is combined with a simple euclidian metric for face recognition. Moreover G. Guo and H. Zhang [2] use the same PCA feature representation for face recognition, but for feature optimization and classification they apply an algorithm called AdaBoost. Another class of face identification methods is based on Elastic Graph Matching [8], where mainly Gabor Wavelets are used as image features. In this paper we are going to develop a method for face verification which is based on an algorithm originally used for face detection. P. Viola and M. Jones [7] use a boosted set of haar-like features for fast face detection. The AdaBoost algorithm serves to optimize the feature set and to train the classifier. The haar-features are in a more common sense edge orientation filters, but both fine and local object structures and larger, more global intensity transitions can be represented by these haar-like features. J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 617–624, 2003. c Springer-Verlag Berlin Heidelberg 2003
618
Bernhard Fr¨ oba et al.
Fig. 1. Examples for the chosen face cutting from face-images of the M2VTSdatabase In the following sections we describe the haar-kernels used in our work and how they are computed. Thereafter we outline the main ideas of the AdaBoost algorithm and show how it can be used for feature selection. We also present the distance metric used in our work. Experimental results are presented on the M2VTS-database [5] using the Lausanne-Protocol for evaluation. We can show that a reliable verification system can be realized representing a face with only 200 features. This result is encouraging because one aim in practical applications is to achieve good recognition performance while having a small feature representation which can fore example be stored on a smart-card.
2
Preprocessing
A very important part for every face-verification algorithm is a suitable face cutting, which has to be the same for every face in the verification process. Experiments showed, that better results are achieved by omitting the mouth-region, because in this area variations can be quite large. Moreover it was observed, that the inclusion of the forehead and the hair line ameliorate the results. In the subsequent figure some chosen face cuttings are presented. The original images are taken from the M2VTS-database.
3
Feature Description
Here we use a set of seven different kernels (kernel-types: 1-7), each varying in size and aspect-ratio (Fig.2) for the extraction of our haar-like features. Kernels of type 1 extract horizontal intensity-variations, kernels of type 2 vertical variations and kernels of type 3 mark a saddle point. Kernels of type 4 or 5 are adequate to find diagonal structures and type 6 and 7 kernels are designed to search out horizontally or vertically oriented low-high-low or highlow-high transitions. Each feature is determined by – the x- and y-size of the kernel – the type of the kernel and – the x- and y-position in the face-image
Boosting a Haar-Like Feature Set for Face Verification Typ 1
1
Typ 2
Typ 3
1
1
−1
−1
−1
1
−1
Typ 4
Typ 5
1
Typ 6
Typ 7
1
1 1 −2 −1
619
−1
1
−2 1
Fig. 2. Seven base kernels. Each kernel is divided into different kernel-areas with a constant weighting factor. For the evaluation of a specific kernel at a certain face-image position, all pixels lying under one kernel-area are summed up and weighted with the weighting factor of this kernel-area. To receive the final feature value, all subtotals are added up
so that for each image-pixel several features can be calculated. For even kernelsizes it is necessary to evaluate kernels also from sub pixels, because the border of two pixel-areas might fall between two image-pixel borders or the kernel does not overlap whole image-pixels at its margin (Fig.3). As the kernel has to be totally inside the face-image for a valid convolution, the number of features per pixel decreases when the pixel is closer to the image-margin, because bigger kernels partially are lying outside the image-region in the margin areas. The smallest implemented and possible kernel-size is set to 2 × 2 and the maximum size is 255 × 255, but it is also restricted to the face-image size itself. If there are no restrictions to the desired kernel-size and to the kernel type, with all possible sizes and aspect-ratios, and evaluated at each image-pixel, the
0
1.5
x
0 1.5 y
Fig. 3. (4 × 4)-kernel of type 1 (Fig.2): The evaluation is done on face-image pixel (x, y). For the upper left kernel-pixel, which covers partially four imagepixel, the values of the concerned image-pixels (1, 1), (1, 2), (2, 1) and (2, 2) are quartered and the four sub-pixel are summed up and weighted with the kernelweighting factor
620
Bernhard Fr¨ oba et al.
number of possible features is out of scale. In our setting, the number of all features which can be calculated, quadruples the number of pixels in the faceimage, so that it is necessary to restrict the minimal and maximum size of the applied kernels, and to determine a grid, with variable step size so that the kernels have not to be evaluated at each pixel position. The most remarkable advantage of the approach seems to be the fact that, depending on the used kernel-size, a more or less global edge-information is extracted at the different positions in the image. The idea was, that very fine and local structures, e.g. around the eye-region, are better described by features extracted with small kernels, whereas more global structures as nose, cheeks and forehead might be characterized much better by features from bigger kernels. Furthermore there exists a very efficient calculation method for these haar-like features. Feature values are not computed directly from the original face-image, but on base of the so called Integral Image [7], which has to be generated from the original face-image as an intermediate step. The value of the Integral Image at pixel (x, y) is the sum of all original face-image pixel placed left and above (x, y), this pixel including. With the Integral Image it is now possible to determine the sum of image-pixel in each kernel-area with constant calculation steps and so the evaluation complexity for each kernel-type is independent of the actual kernelsize.
4
Feature Selection and Classification
As in the previous section already mentioned, the number of features obtained for one face is quite large, even if the kernel sizes are restricted and the kernels are not evaluated at each possible location. Just to get an idea of the number of features for an image of size 64 × 64, a minimal kernel width and kernel height of 6 pixel, a maximum kernel width and kernel height of 31 pixel and a step-size of 3 for the grid (only every third pixel in every third row is taken as position for the evaluation of the kernel) the number of features is about 270000. To reduce complexity and to improve verification results we try to select only a small subset of the most discriminating features for face verification. To achieve this we use the AdaBoost algorithm as described in [2]. 4.1
Basic Idea of Boosting
“Boosting is a general method for improving the accuracy of any given learning algorithm.”[1]. The basic idea of the boosting algorithm is to select some predictions from a multitude of quite uncertain predictions, which just have to be slightly better than a random guess and to combine these so called weak predictions to one single strong prediction. Based on a training set of face images we try to find an optimal combination of single features for the final optimized feature set by boosting. Ideally those features should be selected, which are best suited to discriminate between different persons.
Boosting a Haar-Like Feature Set for Face Verification
4.2
621
Boosting a Feature Set
In our implementation a weak predictor is a single feature fi , fi ∈ I, where I is the set of all possible features. A single feature fi is determined by kernel type, kernel size and position within the face image. The distance metric of the weak classifier is the euclidian distance de,s of two corresponding features fi,p1 and fi,p2 from two different pictures p1 and p2, de,s = fi,p1 − fi,p2 .
(1)
If the distance de,s is smaller than the classifier-threshold tf , the two face images are classified as belonging to the same person (positive verification result). Otherwise the decision is that they are from two different persons (negative verification result). In the present setup the classifier threshold tf is a global constant for all weak classifiers. For boosting we have to determine the two error rates false accept rate (FAR) and the false reject rate (FRR) for each feature over the whole training set. The total FRR is the mean of all class specific FRRs. A class specific FRR is obtained using a leave one out strategy, i.e for all patterns in the training set of a person one is fixed and compared against the remaining patterns of that person. The FAR is computed from impostor experiments, where all persons from the training set, who do not belong to the current class serve as impostors. Again the mean class specific FAR is used as feature specific FAR. Boosting is an iterative procedure, where the number of boosting rounds corresponds to the number of features to be chosen, which must be specified by the operator. Additionally a weighting factor wi , which determines the importance of the feature is obtained for each selected feature automatically by the AdaBoost algorithm. AdaBoost selects the feature with the lowest combined error rate over the whole training set in the current round. Additionally training patterns which are already well represented by the chosen features are assigned a smaller influence on the training procedure. Thus as the training process proceeds, boosting focuses more on the harder cases among the training data. The feature selection process results in a new feature set J which is a subset of the original feature set I. The AdaBoost algorithm is implemented accordingly to the proposed algorithm from P. Viola and M. Jones in [7]. The final classifier used for verification can be regarded as a nearest neighbor classifier with a weighted euclidian distance. The distance de,a of two pictures p1 and p2 using the feature weighting factors w is defined as following wj fj,p1 − fj,p2 (2) de,a = j∈J
where j specifies a certain feature from the feature set J. Finally the result is again thresholded with a threshold t to decide whether the face images are from the same person or not. If de,a exceeds the threshold t the test image does not belong to the person it was tested on, it is rejected. For de,a < t the test image is assigned to the appropriate person.
622
Bernhard Fr¨ oba et al.
Table 1. Face verification results on the M2VTS database according to the Lausanne protocol (Configuration 1). Minimal kernel size: 6; maximal kernel size: 31; step-size for grid: 3; non client specific thresholds Number of Evaluation set Features FAR=FRR FRR=0 FAR=0 50 100 200 300
5
FAR 6.4 5.9 6.4 6.4
FRR 7.3 6.5 6.3 6.1
Test set FAR=FRR FRR=0
FAR=0
FAR FRR FAR FRR FAR FRR FAR FRR FAR FRR 84 0 0 57.0 6.9 12.81 83.0 0 0.024 73.4 79.0 0 0 51.2 6.1 10.8 78.0 0 0.020 67.6 81.6 0 0 49.5 6.9 8.8 80.2 0 0.026 66.1 83.2 0 0 50.0 6.8 8.8 81.9 0 0.024 65.6
Experimental Results
Our presented verification results are achieved on the M2VTS database according to the Lausanne Protocol on Configuration 1 as it is defined in [3]. In all experiments we use common settings. The kernel width and height was restricted to a minimum size of 6 pixel and a maximal size of 31 pixel. Smaller kernel sizes increase the amount of data and the complexity to much, so we agreed on this setting for the kernel sizes. Further experiments has shown that larger kernel sizes do not enhance the verification result. Besides this a grid with step size 3 was used, so that only at every third pixel and in every third row kernels were evaluated. The single feature threshold was globally set to tf = 2 for all features and the final evaluation threshold t was not client specific. In the following table (Tab.1) we show some experimental results using the already described standard settings. The number of features to be chosen by boosting is varied for the different tests. By analyzing the verification results it can be observed that selecting more features does not necessarily improve the results. At this point 200 features seem to be the best choice for our verification task. Finally it has to be straightened out, that it is not possible in practice to select an arbitrary number of features, because if the error rates of the best feature left approaches 0.5 which means that random guess yields almost the same result, the boosting process is stopped. In Fig. 4 the three most discriminating from features are visualized. The first selected feature (kernel type: 2, width: 20, height: 15) lies around the nose region and evaluates horizontal intensity transitions. This is a quite non intuitive choice but it can be observed, that for some persons this area is almost homogeneous and for others the lower edge of the eyeglasses are visible in the upper kernel area, as can be seen in the face image in the middle. The second feature (kernel type 3, width: 27, height: 27) which was chosen, appraises a kind of saddle-point on the left cheek, including the left eye and parts of the nose. Third, a feature (kernel type 5, width: 11, height: 20) above the left eye is selected. This feature evaluates diagonal intensity variations, where the intensity transition from the eyebrow to the forehead is determined.
Boosting a Haar-Like Feature Set for Face Verification
623
3. 1 −1
2.
1 −1 1 −1 −1 1
1.
Fig. 4. Visualization of the best three features, chosen by the AdaBoost algorithm. The best (1.) chosen feature (kernel type: 2, width: 20, height: 15) evaluates the nose region, the second (2.) one (kernel type 3, width: 27, height: 27) extracts features from the left cheek, including the left eye and parts of nose and the third feature (3.)(kernel type 5, width: 11, height: 20) is positioned over the left eye
6
Conclusions
In this paper we presented a face verification system based on a sparse haarfeature representation. The features are selected from a large set of possible features using AdaBoost. Selecting only the 200 most discriminative features we reach recognition rates comparable to the state of the art, see [4].
Acknowledgments The work described here was supported by the German Federal Ministry of Education and Research (BMBF) under the project EMBASSI.
References [1] Y. Freund and R. Schapire. A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence 14(5), pages 771–780, September 1999. 620 [2] Guodong Guo and Hong-Jiang Zhang. Boosting for fast face recognition. Technical report, Microsoft Research, Februar 2001. 617, 620
624
Bernhard Fr¨ oba et al.
[3] J. Luettin and G. Maˆıtre. Evaluation protocol for the extended M2VTS database (XM2VTSDB). IDIAP-COM 05, IDIAP, 1998. 622 [4] J. Matas, M. Hamouz, K. Jonsson, J. Kittler, Y. Li, C. Kotropoulos, A. Tefas, I. Patas, T. Tan, H Yan, F Smeraldi, J. Bigun, N. Capdevielle, W. Gerstner, S. Ben-Yacoub, Y. Abdeljaoued, and E. Mayoraz. Comparison of face verification results on the XM2VTS database. In Proc. Int. Conf. on Patt. Rec. ICPR-15. IEEE Computer Society, 2000. 623 [5] K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maˆıtre. Xm2vtsdb: The extended m2vts database. In Second International Conference on Audio- and Videobased Biometric Person Authentication, pages 71–77, 1999. 618 [6] M. A. Turk and A. P. Pentland. Face recognition using eigenfaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 586 – 591, June 1991. 617 [7] Paul Viola and Michael Jones. Robust real-time object detection, 2001. 617, 620, 621 [8] Laurenz Wiskott. Phantom faces for analysis. Pattern Recognition, 30(6):837–846, 1997. 617
The BANCA Database and Evaluation Protocol Enrique Bailly-Bailli´ere4, Samy Bengio1 , Fr´ed´eric Bimbot2 , Miroslav Hamouz5 , Josef Kittler5 , Johnny Mari´ethoz1, Jiri Matas5 , Kieron Messer5 , Vlad Popovici3 , Fabienne Por´ee2, Belen Ruiz4 , and Jean-Philippe Thiran3 1
2
IDIAP, CP 592, rue du Simplon 4, 1920 Martigny, Switzerland IRISA (CNRS & INRIA) / METISS, Campus de Beaulieu, 35042 Rennes, France 3 EPFL, STI-ITS, 1015 Lausanne, Switzerland 4 University Carlos III de Madrid, Madrid, Spain 5 University of Surrey, Guildford, Surrey, GU2 7XH, UK
[email protected] Abstract. In this paper we describe the acquisition and content of a new large, realistic and challenging multi-modal database intended for training and testing multi-modal verification systems. The BANCA database was captured in four European languages in two modalities (face and voice). For recording, both high and low quality microphones and cameras were used. The subjects were recorded in three different scenarios, controlled, degraded and adverse over a period of three months. In total 208 people were captured, half men and half women. In this paper we also describe a protocol for evaluating verification algorithms on the database. The database will be made available to the research community through http://www.ee.surrey.ac.uk/Research/VSSP/banca.
1
Introduction
In recent years the cost and size of biometric sensors and processing engines has fallen, a growing trend towards e-commerce, teleworking and e-banking has emerged and people’s attitude to security since September 11th has shifted. For these reasons there has been a rapid increase in the use of biometric technology in a range of different applications. For example, at Schipol Airport in Amsterdam, frequent flyers are able to use their iris scans to check-in for flights. The same technology is also used to grant access to airport personnel in secure areas. In Spain, fingerprint scans are used on social security cards and in the U.S.A the Federal Bureau of Prisons uses hand geometry to track the movements of its prisoners, staff and visitors within its prisons. However, even though these are all fairly reliable methods of biometric personal authentication they are unacceptable by users in all but these high-security situations. They require both user co-operation and are considered intrusive. In contrast, personal identification systems based on the analysis of speech and face images are non-intrusive and more user-friendly. Moreover, personal identity can be often ascertained without the client’s assistance. However, speech and image-based systems are more susceptible to imposter attack, especially if the J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 625–638, 2003. c Springer-Verlag Berlin Heidelberg 2003
626
Enrique Bailly-Bailli´ere et al.
imposter possesses information about a client, eg. a photograph or a recording of client’s speech. Multi-modal personal verification is one of the most promising approaches to user-friendly (hence acceptable) highly secure personal verification systems [5]. BANCA is a European project whose aim is to develop and implement a secure system with enhanced identification, authentication and access control schemes for applications over the Internet such as tele-working and Web - or remote - banking services. One of the major innovations targeted by this project is to obtain an enhanced security system by combining classical security protocols with robust multi-modal verification schemes based on speech and face images. In order to build reliable recognition and verification systems they need training and in general the larger the training set, the better the performance achieved [8]. However, the volume of data required for training a multi-modal system based on the analysis of video and audio signals is in the order of TBytes (1000 GBytes). It is only recently that technology allowing manipulation and effective use of such amounts of data has become available. For the BANCA project there was a need for a multi-modal database that would include various but realistic recording scenarios, using different kinds of material and in different European languages. We are at present aware of only three publicly available medium or large scale multi-modal databases, the database collected within the M2VTS project, comprising 37 subjects [3], the DAVID-BT database [2] and the database collected within the Extended M2VTS EU project [10]. A survey of audio visual databases prepared by Chibelushi et al [7], lists many others, but these are either mono-modal or small e.g. FERET [12], Yale [15], Harvard [13] and Olivetti [14]. From the point of view of the database size DAVID-BT is comparable with the M2VTS database: 31 clients - 5 sessions. However, the speech part of DAVIDBT is significantly larger than that of M2VTSDB. On the other hand - the quality and reproducibility of the data available on an SVHS tape is low. The XM2VTS database, together with the Lausanne protocol, [10] contains 295 subjects recorded over 4 sessions. However, it was not possible to use it as the controlled recording environment was not realistic enough compared to the real world situations when one makes a transaction at home through a consumer web cam or through an ATM in a variety of surroundings. Therefore it was decided that a new database for the project would be recorded and a new experimental protocol using the database defined [6]. Typically, an evaluation protocol defines a set of data, how it should be used by a system to perform a set of experiments and how the performance of the system should be computed [11]. Hopefully, the protocol should of been designed in such a manner that there no bias in the performance is introduced, e.g. the training data is not used for testing. It should also represent a realistic operating scenario. Performing experiments according to a defined protocol allows different institutions to easily asses their results when compared to others. The purpose of this paper is to present the BANCA database and its associated protocol.
The BANCA Database and Evaluation Protocol
627
The rest of this paper is organised as follows. In the next section we define the task of identity verification. In section 3 the BANCA database specification is detailed. Information about the BANCA protocol designed for training and testing personal verification algorithms is given in section 4 before some conclusions are drawn.
2
Identity Verification
Identity Verification (IV) can be defined as the task that consists in verifying the identity X claimed (explicitly or implicitly) by a person U , using a sample y from this person, for instance an image of the face of U , a speech signal produced by U , etc... By comparing the sample to some template (or model ) of the claimed identity X, the IV system outputs a decision of acceptance or rejection. The process can be viewed as a hypothesis testing scheme, where the system has to decide within the following alternative: ˆ – U is the true client (acceptance, denoted X), ˆ ¯ – U is an impostor (rejection, denoted X). In practice, an IV system can produce 2 types of errors: – False Acceptance (FA) if the system has wrongly accepted an impostor, – False Rejection (FR) if a true client has been rejected by the system. In practical applications, these 2 types of error have an associated cost, which will be denoted as CF A and CF R respectively. Moreover, in order to measure the quality of the system independently of the the distribution of the accesses, we define the following quantities: – the False Acceptance Rate (PF A ) is the ratio between the number of FA and the number of impostor accesses, – the False Rejection Rate (PF R ) is the ratio between the number of FR and the number of client accesses. IV approaches are usually based on the characterization of hypotheses X ¯ by a client template and a non-client template respectively, which are and X learned during a training (or enrollment) phase (the non-client model may even be trained during a preliminary phase, also called installation phase, and is often the same for every client, in which case it is called the world model). Once the template for client X has been created, the system becomes operational for verifying identity claims on X. In the context of performance evaluation, this is referred to as the test phase. Conventionally, the procedure used by an IV system during the test phase can be decomposed as follows: – feature extraction, i.e. transformation of the raw sample into a (usually) more compact representation, – score computation, i.e. output of a numerical value SX (y) based on a (nor¯ malized) distance between y and the templates for X (and X), – decision by comparing the score SX (y) to a threshold Θ, independent of X.
628
3 3.1
Enrique Bailly-Bailli´ere et al.
The BANCA Database The Acquisition System
To record the database, two different cameras were used; a cheap analogue web cam and a high quality digital camera. For the duration of the recordings the cameras were left in automatic mode. In parallel, two microphones, a poor quality one and a good quality one were used. The database was recorded onto a PAL DV system. PAL DV is a proprietary format which captures video at a colour sampling resolution of 4:2:0. The audio was captured in both 16bit and 12bit audio frequency at 32kHz. The video data is lossy compressed at the fixed ratio of 5:1. The audio data remains uncompressed. This format also defines a frame accurate timecode which is stored on the cassette along with the audio-visual data. This video hardware can easily be interfaced to a computer allowing frame accurate retrieval of the data in the database onto the computer disk. 3.2
The Specification
The BANCA database was designed in order to test multi-modal IV with various acquisition devices (2 cameras and 2 microphones) and under several scenarios (controlled, degraded and adverse). For 4 different languages (English, French, Italian and Spanish), video and speech data were collected for 52 subjects (26 males and 26 females) on 12 different occasions, i.e. a total of 208 subjects. Each language - and gender - specific population was itself subdivided into 2 groups of 13 subjects, denoted in the following g1 and g2. Each subject recorded 12 sessions, each of these sessions containing 2 recordings: 1 true client access and 1 informed (the actual subject knew the text that the claimed identity subject was supposed to utter) impostor attack. The 12 sessions were separated into 3 different scenarios: – controlled (c) for sessions 1-4, – degraded (d) for sessions 5-8, – adverse (a) for sessions 9-12. The web cam was used in the degraded scenario, while the expensive camera was used in the controlled and adverse scenarios. The two microphones were used simultaneously in each of the three scenarios with each output being recorded onto a separate track of the DV tape. During each recording, the subject was prompted to say a random 12 digit number, his/her name, their address and date of birth. Each recording took an average of twenty seconds. Table 1 gives an example of the speech a subject would be expected to utter at a single session (i.e. two recordings). For each session the true client information remained the same. For different sessions the impostor attack information changed to another person in their group. More formally, in a given session, the impostor accesses by subject X were successively made with a claimed identity corresponding to each other subject
The BANCA Database and Evaluation Protocol
629
Table 1. Example of the speech uttered by a subject at one of the twelve Banca sessions True Client 038921674501 Annie Other 9 St Peters Street Guildford Surrey GU2 4TH 20.02.1971
Impostor Attack 857901324602 Gertrude Smith 12 Church Road Portsmouth Hampshire PO1 3EF 12.02.1976
from the same group (as X). In other words, all the subjects in group g recorded one (and only one) impostor attempt against each other subject in g and each subject in group g was attacked once (and only once) by each other subject in g. Moreover, the sequence of impostor attacks was designed so as to make sure that each identity was attacked exactly 4 times in the 3 different conditions (hence 12 attacks in total). In the rest of this paper the following notation will be used: : subject i in group g g ∈ {g1, g2}, i ∈ [1, 13] Xig yk (X) : true client record from session k by subject X k ∈ [1, 12] zl (X) : impostor record (from a subject X ) claiming identity X during a session l (with X = X) l ∈ [1, 12] For each language, an additional set of 30 other subjects, 15 males and 15 females, recorded one session (audio and video). This set of data is referred to as world data. These individuals claimed two different identities, recorded by both microphones. Finally, any data outside the BANCA database will be referred to as external data. Figure 1 shows a few examples of the face data from the English part of the database, whilst figure 2 shows a few examples of face data from the French part.
4
Experimental Protocol
In verification, two types of protocols exist; closed-set and open-set. In closed-set verification the population of clients is fixed. This means that the system design can be tuned to the clients in the set. Thus both, the adopted representation (features) and the verification algorithm applied in the feature space are based on some training data collected for this set of clients. Anyone who is not in the training set is considered an impostor. The XM2VTS protocol is an example of this type of verification problem formulation. In open-set verification we wish to add new clients to the list without having to redesign the verification system. In particular, we want to use the same feature
630
Enrique Bailly-Bailli´ere et al.
Fig. 1. Examples of the BANCA database images taken from the English part of the database Up: Controlled, Middle: Degraded and Down: Adverse scenarios
space and the same design parameters such as thresholds. In such a scenario the feature space and the verification system parameters must be trained using completely independent data from that used for specifying client models. The Banca protocol is an example of an open-set verification protocol. In this paper, we present a configuration of the Banca protocol using only one language [6]; other protocols, taking into account all 5 languages, will be presented later. 4.1
A Monolingual Protocol
In order to define an experimental protocol, it is necessary to define a set of evaluation data (or evaluation set), and to specify, within this set, which are to be used for the training phase (enrollment) and which are to be used for the test phase (test accesses). Moreover, before becoming operational, the development of an IV system requires usually the adjustment of a number of configuration parameters (model size, normalization parameters, decision thresholds, etc.). It is therefore necessary to define a development set , on which the system can be calibrated and
The BANCA Database and Evaluation Protocol
631
Fig. 2. Examples of the BANCA database images taken from the French part of the databaseUp: Controlled, Middle: Degraded and Down: Adverse scenarios
adjusted, and for which it is permitted to use the knowledge of the actual subject identity during the test phase. Once the development phase is finished, the system performance can then be assessed on the evaluation set (without using the knowledge of the actual subject identity during the test phase). To avoid any methodological flaw, it is essential that the development set is composed of a distinct subject population as the one of the evaluation set. In order to carry realistic (and unbiased experiments), it is necessary to use different populations and data sets for development and for evaluation. We distinguish further between 2 circumstances: single-modality evaluation experiments and multi-modality evaluation experiments. In the case of single-modality experiments, we need to distinguish only between two data sets: the development set, and the evaluation set. In that case, g1 and g2 are used alternatively as development set and evaluation set (when g1 is used as development set, g2 is used as evaluation set, and vice versa). In the case of multi-modality experiments, it is necessary to introduce a third set of data: the (fusion) tuning set used for tuning the fusion parameters, i.e. the way to combine the outputs of each modality. If the tuning set is identical
632
Enrique Bailly-Bailli´ere et al.
Table 2. The usage of the different sessions in the seven BANCA experimental configurations (“TT”: clients training and impostor test, “T”: clients and impostor test) Session MC MD MA 1 TT 2 T 3 T 4 T 5 TT 6 T 7 T 8 T 9 TT 10 T 11 T 12 T
UD UA P G TT TT TT TT T T T T T T TT T T T T T T T T T TT T T T T T T T T T
to the development set, this may introduce a pessimistic bias in the estimation of the tuning parameters (biased case). Another solution is to use three distinct sets for development, tuning and evaluation (unbiased case). In that case, we expect the experimenters to use data from the other languages as development set, while g1 and g2 are used alternatively for tuning and evaluation. In the BANCA protocol, seven distinct experimental configurations have been specified which identify which material can be used for training and which for testing. In all configurations, the true client records for the first session of each condition is reserved as training material, i.e. the true client record from sessions 1, 5 and 9. In all experiments, the client model training (or template learning) is done on at most these 3 records. The seven configurations are Matched Controlled (MC), Matched Degraded (MD), Matched Adverse (MA), Unmatched Degraded (UD), Unmatched Adverse (UA), Pooled test (P) and Grand test (G). Table 2 describes the usage of the different sessions in each configuration. “TT” refers to the client training and impostor test session, and “T” depicts clients and impostor test sessions. A more detailed description of the seven configurations can be found in Annex A. For example, for configuration MC the true client data from session 1 is used for training and the true client data from sessions 2, 3 and 4 are used for client testing. All the impostor attack data from sessions 1,2,3 and 4 is used for impostor testing. From analysing the performance results on all seven configurations it is possible to measure: – the intrinsic performance in a given condition, – the degradation from a mismatch between controlled training and uncontrolled test,
The BANCA Database and Evaluation Protocol
633
– the performance in varied conditions with only one (controlled) training session, – the potential gain that can be expected from more representative training conditions. It is also important to note that for the purpose of the protocol, 5 frontal images have been extracted from each video recording to be used as true client and impostor attack images. 4.2
Performance Measure
In order to visualize the performance of the system, irrespective of its operating condition, we use the conventional DET curve [9], which plots on a logdeviate scale the False Rejection Rate PF R as a function of the False Acceptance Rate PF A . Traditionally, the point on the DET curve corresponding to PF R = PF A is called EER (Equal Error Rate) and is used to measure the closeness of the DET curve to the origin. The EER value of an experiment is reported on the DET curve, to comply with this tradition. Figure 3 shows an example DET curve. We also recommend to measure the performance of the system for 3 specific operating conditions, corresponding to 3 different values of the Cost Ratio R = CF A /CF R , namely R = 0.1, R = 1, R = 10. Assuming equal a priori probabilities of genuine clients and impostor, these situations correspond to 3 quite distinct cases: R = 0.1 → a FA is an order of magnitude less harmful than a FR, R = 1 → a FA and a FR are equally harmful, R = 10 → a FA is an order of magnitude more harmful than a FR. When R is fixed and when PF R and PF A are given, we define the Weighted Error Rate (W ER) as: W ER(R) =
PF R + R PF A . 1+R
(1)
PF R and PF A (and thus W ER) vary with the value of the decision threshold Θ, and Θ is usually optimized so as to minimize W ER on the development set D: ˆ R = arg min W ER(R) . Θ ΘR
(2)
The a priori threshold thus obtained is always less efficient than the a posteriori threshold that optimizes W ER on the evaluation set E itself: ∗ = arg min W ER(R) . ΘR ΘR
(3)
The latter case does not correspond to a realistic situation, as the system is being optimized with the knowledge of the actual test subject identities on the evaluation set. However, it is interesting to compare the performance obtained with a priori and a posteriori thresholds in order to assess the reliability of the threshold setting procedure.
634
Enrique Bailly-Bailli´ere et al.
40
g1 P g2 P g1 G g2 G
PFR [%]
20 10 5 2 1 0.5 0.2 0.1 0.10.2 0.5 1
2
5 10 PFA [%]
20
40
Fig. 3. An example DET used to help measure performance. Curves are for the groups 1 and 2 using protocols P and G
5
Distribution
It is intended to make the database available to the research community to train and test their verification algorithms. The reader is pointed to [1] to find out which parts of the database are currently available. Already, part of the database has been made available for use for standards work for MPEG-7 [4]. To date, 14 copies of this bench mark MPEG-7 test set have been distributed. Although this data was captured in connection with biometric verification many other uses are envisaged such as animation and lip-tracking.
6
Conclusion
In this paper, a new multi-modal database and its associated protocol have been presented which can be used for realistic identity verification tasks using up to two modalities. The BANCA database offers the research community the opportunity to test their multi-modal verification algorithms on a large, realistic and challenging database. It is hoped that this database and protocol will become a standard, like the XM2VTS database, which enables institutions to easily compare the performance of their own algorithms to others.
7
BANCA Partners
EPFL(Switzerland); IRISA(France); University Carlos III de Madrid (Spain); Ibermatica (Spain); Thales (France); BBVA (Spain); Oberthur(France); Institut Dalle Molle d’Intelligence Artificielle Perceptive (IDIAP) (Switzerland); Universit´e de Louvain(France); University of Surrey(UK).
The BANCA Database and Evaluation Protocol
635
Acknowledgments This research has been carried out in the framework of the European BANCA project, IST-1999-11169. For IDIAP this work was funded by the Swiss OFES project number 99-0563-1 and for EPFL by the Swiss OFES project number 99-0563-2.
References [1] The BANCA Database - English part; http://www.ee.surrey.ac.uk/Research/VSSP/banca. 634 [2] BT-DAVID; http://faith.swan.ac.uk/SIPL/david. 626 [3] The M2VTS database; http://www.tele.ucl.ac.be/ M2VTS/m2fdb.html. 626 [4] MPEG-7 Overview; http://mpeg.telecomitalialab.com/standards/mpeg-7/mpeg7.htm. 634 [5] M. Acheroy, C. Beumier, J. Big¨ un, G. Chollet, B. Duc, S. Fischer, D. Genoud, P. Lockwood, G. Maitre, S. Pigeon, I. Pitas, K. Sobottka, and L. Vandendorpe. Multi-modal person verification tools using speech and images. In Multimedia Applications, Services and Techniques (ECMAST 96), Louvain-la-Neuve, 1996. 626 [6] S. Bengio, F. Bimbot, J. Mari´ethoz, V. Popovici, F. Por´ee, E. Bailly-Bailli`ere, G. Matas, and B. Ruiz. Experimental protocol on the BANCA database. Technical Report IDIAP-RR 02-05, IDIAP, 2002. 626, 630 [7] C.C Chibelushi, F. Deravi, and J. S. D. Mason. Survey of audio visual speech databases. Technical report, University of Swansea. 626 [8] P Devijver and J Kittler. Pattern Recognition: A Statistical Approach. Prentice Hall, 1982. 626 [9] A. Martin et al. The DET curve in assessment of detection task performance. In Eurospeech’97, volume 4, pages 1895–1898, 1997. 633 [10] K Messer, J Matas, J Kittler, J Luettin, and G Maitre. Xm2vtsdb: The extended m2vts database. In Second International Conference on Audio and Video-based Biometric Person Authentication, March 1999. 626 [11] P. J. Phillips, A. Martin, C. L. Wilson, and M. Przybocki. An introduction to evaluating biometric systems. IEEE Computer, pages 56–63, February 2000. 626 [12] P. J. Phillips, H. Wechsler, J.Huang, and P. J. Rauss. The FERET database and evaluation procedure for face-recognition algorithm. Image and Vision Computing, 16:295–306, 1998. 626 [13] ftp://hrl.harvard.edu/pub/faces. 626 [14] http://www.cam-orl.co.uk/facedatabase.html. 626 [15] http://cvc.yale.edu/projects/yalefaces/yalefaces.html. 626
A
Detailed Description of the Different Protocols
Tables 3, 4 and 5 describe more formally the 7 training-test configurations.
yk (Xi ),
k=5
k = 6, 7, 8
Table 3. Description of protocols Mc, Md and Ma
k=9
client: 780 impostor: 1040
client: 39 impostor: 52
l ∈ {9, 10, 11, 12}
k = 10, 11, 12
yk (Xi ),
yk (Xi ),
l ∈ {5, 6, 7, 8} zl (Xi ),
client: 39 impostor: 52
zl (Xi ),
yk (Xi ),
client: 2 × 5 × 2 × 39 = 780 client: 780 impostor: 2 × 5 × 2 × 52 = 1040 impostor: 1040
l ∈ {1, 2, 3, 4}
k = 2, 3, 4
Total number of image tests
zl (Xi ),
yk (Xi ),
client: 13 × 3 = 39 impostor: 13 × 4 = 52
∀ Xi i = 1, ..., 13 test with:
Impostor testing
k=1
Ma (use only adverse data)
All World data + any external data (it is forbidden to use other client data from the same group)
yk (Xi ),
Md (use only degraded data)
Number of tests per experiment
∀ Xi i = 1, ..., 13 test with:
∀ Xi i = 1, ..., 13 train with:
Client testing
Non client training
Client training
Mc (use only controlled data)
636 Enrique Bailly-Bailli´ere et al.
yk (Xi ),
k=1
yk (Xi ),
k=1
Table 4. Description of protocols Ud and Ua
k = 10, 11, 12
l ∈ {9, 10, 11, 12}
client: 39 impostor: 52
zl (Xi ),
client: 2 × 5 × 2 × 39 = 780 client: 780 impostor: 2 × 5 × 2 × 52 = 1040 impostor: 1040
l ∈ {5, 6, 7, 8}
yk (Xi ),
Total number of image tests
zl (Xi ),
k = 6, 7, 8
client: 13 × 3 = 39 impostor: 13 × 4 = 52
∀ Xi i = 1, ..., 13 test with:
Impostor testing
yk (Xi ),
Number of tests per experiment
∀ Xi i = 1, ..., 13 test with:
All World data + any external data (it is forbidden to use other client data from the same group)
∀ Xi i = 1, ..., 13 train with:
Client testing
Non client training
Client training
Ud Ua (use controlled data for training (use controlled data for training and degraded data for testing) and adverse data for testing)
The BANCA Database and Evaluation Protocol 637
yk (Xi ),
k=1
yk (Xi ),
k = 1, 5, 9
Table 5. Description of protocols P and G
client: 2340 impostor: 3120
client: 2 × 5 × 2 × 117 = 2340 impostor: 2 × 5 × 2 × 156 = 3120
zl (Xi ),
Total number of image tests
l ∈ {1, ..., 12}
l ∈ {1, ..., 12}
k = 2, 3, 4, 6, 7, 8, 10, 11, 12
client: 117 impostor: 156
zl (Xi ),
k = 2, 3, 4, 6, 7, 8, 10, 11, 12 yk (Xi ),
client: 13 × 9 = 117 impostor: 13 × 12 = 156
∀ Xi i = 1, ..., 13 test with:
∀ Xi i = 1, ..., 13 test with: yk (Xi ),
All World data + any external data (it is forbidden to use other client data from the same group)
∀ Xi i = 1, ..., 13 train with:
G (use all data for training and all data for testing)
Number of tests per experiment
Impostor testing
Client testing
Non client training
Client training
P (use controlled data for training and all data for testing)
638 Enrique Bailly-Bailli´ere et al.
A Speaker Pruning Algorithm for Real-Time Speaker Identification Tomi Kinnunen, Evgeny Karpov, and Pasi Fränti University of Joensuu, Department of Computer Science P.O. Box 111, 80101 Joensuu, Finland {tkinnu,ekarpov,franti}@cs.joensuu.fi
Abstract. Speaker identification is a computationally expensive task. In this work, we propose an iterative speaker pruning algorithm for speeding up the identification in the context of real-time systems. The proposed algorithm reduces computational load by dropping out unlikely speakers as more data arrives into the processing buffer. The process is repeated until there is just one speaker left in the candidate set. Care must be taken in designing the pruning heuristics, so that the correct speaker will not be pruned. Two variants of the pruning algorithm are presented, and simulations with TIMIT corpus show that an error rate of 10 % can be achieved in 10 seconds for 630 speakers.
1
Introduction
The speaker identification task is defined as follows: given an unknown speaker and a set of N candidate speakers, find the most similar speaker among the candidates [1, 2]. More precisely, this is a closed-set speaker identification task which means that the unknown speaker is assumed to be one of the candidate speakers. In general, a speaker identification system usually consists of the following four parts: feature extraction, speaker modeling, pattern comparison, and decision logic. Given an unknown speakers voice sample and the stored candidate speaker models, the system first computes feature vectors from the given speech sample. Then the feature vectors are compared against all of the N models using the pattern comparison algorithm. The result of this phase is a list of match scores, which can be either similarity or dissimilarity values. The decision logic finally makes a one-out-of-N decision, e.g. selects the speaker with maximum degree of similarity. Speech user interfaces and speaker adaptation methods in speech recognition systems are examples of a potential application of speaker identification technology. In such systems, the identification time must be minimized so that the system works in real-time, or near real-time. Speaker identification is a computationally expensive problem [1, 2]. The identification time is dominated by two factors: the number of speakers (N) and the number of unknown speakers feature vectors (M). Identification requires N·M distance (or similarity) computations. By reducing the number of speakJ. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 639-646, 2003. Springer-Verlag Berlin Heidelberg 2003
640
Tomi Kinnunen et al.
ers or feature vectors, identification time can be significantly reduced. Silence detection is a simple example of reducing the number of feature vectors [8]. In this work we aim to reduce the number of computations by reducing the number of candidate speakers. The basic idea is that when a given amount of new data (feature vectors) comes in, we drop a certain amount of candidate speakers away. The process is repeated until finally just one speaker is left in the candidate set. We assign this speaker to be the most similar to the unknown speech sample.
2
Principle of Speaker Pruning
The principle of the proposed speaker pruning is illustrated in Fig. 1. The ellipses represent the models of speakers in the speaker database, and the “x” dots are the feature vectors of the unknown speaker. Initially, all the speakers are in the database. When more data comes in, a few of the most dissimilar speakers are pruned away, and they are not anymore used in the pattern comparisons. The process is repeated until there is only one speaker left in the candidate set. The decision of the speaker identity is the last speaker left in the set. The speaker pruning framework described above is in a general form. The following issues must be taken in the consideration in the actual implementation: 1 2 3 4 5
what are the features, what is the presentation of speaker models, and what is the pattern comparison method, what is the pruning criterion (algorithm), how many new vectors are read from input buffer prior to next pruning, how many speakers are pruned at each iteration. Speaker A
Speaker B
Speaker A
Speaker B
r
Speaker C Speaker D
r rr r
Speaker C
r r r r r rr r r
Speaker F
r
Speaker F
Speaker E
Speaker E
Speaker G
Speaker B
Speaker C
r rr r r r r r r r rrr r r
S peaker B
r
r r
r rr r r r r rr r r
r
Speaker F
Fig. 1. Illustration of speaker pruning
In this work the feature vectors are composed of the 12 lowest mel-frequency cepstral coefficients (MFCC) computed using 27 mel-spaced filters. The 0th cepstral coefficient is excluded. Analysis frame is windowed by a 30 ms Hamming window, and frame shift is 10 ms. The signal is pre-emphasized by the filter H(z) = 1 – 0.97z-1. Before the feature extraction, silent frames are removed based on simple short-term
A Speaker Pruning Algorithm for Real-Time Speaker Identification
641
energy thresholding [3]. All speech sample durations in this paper refer to silenceremoved speech. All speakers are modeled by a codebook [4, 7] of 64 vectors using the generalized Lloyd algorithm (GLA) as the clustering method [5]. The pattern comparison method is the average quantization error (or distortion) D(X, C) between the test vector sequence X and the codebook C [7]. The speaker with the minimum quantization error is selected as the best matching candidate. The design issues 3, 4 and 5 are discussed in the next Section in detail.
3
Speaker Pruning Algorithm
A pruning algorithm with two variants is proposed. The variants are referred to as Static and Adaptive pruning. These will be described in Sections 3.1 and 3.2. The details of the control parameters of the algorithms will be discussed in Section 3.3. The following notations will be used: X Ci D(X, Y) M K 3.1
Feature vectors of the unknown speaker. The model (codebook) of ith speaker. Dissimilarity of vector sequences X and Y. The number of new vectors read at each iteration. The number of pruned speakers at each iteration.
Static Pruning
The basic idea in Static pruning is to maintain an ordered list of the best matching speakers. At each iteration, K worst matching speakers are pruned out from the list. As new vectors arrive from the input buffer, the dissimilarity values between the augmented vector set and the remaining speaker models are updated. Note that, in practice, the re-evaluation of the dissimilarities can be done fast by using cumulative counts of distances. The pseudocode of the method is given in Fig. 2. 3.2
Adaptive Pruning
In the second variant, Adaptive pruning, the pruning criterion is data-driven: the number of speakers to be pruned depends on the current distribution of the dissimilarity values between the unknown speaker and the remaining speaker models. Based on the mean value µ and standard deviation σ of the dissimilarity distribution, a pruning threshold θ is set, and all speakers above this threshold are pruned out. After pruning, the dissimilarity distribution changes, and its mean value and standard deviation must be updated to obtain the updated pruning threshold. The pseudocode is given in Fig. 3.
642
Tomi Kinnunen et al.
Let C = {C1,…,CN} be the set of all speaker models ; Let X = Ø ; WHILE (C ≠ Ø AND vectors left in input buffer) DO Insert M new vectors from input buffer to set X ; Re-evaluate dissimilarities D(X, Ci) for all Ci in C ; Remove K most dissimilar models from C ; END RETURN arg mini { D(X, Ci) | Ci Є C } ; Fig. 2. Static pruning algorithm
Let C = {C1,…,CN} be the set of all speaker models ; Let X = Ø ; WHILE (C ≠ Ø AND vectors left in input buffer) DO Insert M new vectors from input buffer to set X ; Re-evaluate dissimilarities D(X, Ci) for all Ci in C ; Compute µ and σ of the distribution { D(X, Ci) | Ci ЄC } ; Let θ = µ + η σ be the pruning threshold ; Remove all speakers i from C satisfying D(X, Ci) > θ ; END RETURN arg mini { D(X, Ci) | Ci Є C } ; Fig. 3. Adaptive pruning algorithm
3.3
Controlling the Pruning
Both the static and adaptive variants have a parameter M, which will be referred to as pruning interval. It simply specifies the amount of vectors read from the input buffer before the next pruning. The variants differ in the way the number of pruned speakers is defined. The static variant has a parameter K which specifies the number of speakers pruned at each iteration. In the adaptive variant, the number of pruned speakers depends on the distribution of the current dissimilarity values. The parameter η determines the degree of the thresholding: the larger η is, the less speakers are pruned, and vice versa. The formula for calculating the pruning threshold has the following interpretation. From visual inspections of speaker dissimilarity values on TIMIT corpus we found out that the speaker dissimilarity distribution follows more or less a Gaussian curve, as shown in Fig. 4. Because of this, the pruning threshold corresponds to a certain confidence interval of the normal distribution, and η specifies its width. Speakers above the upper value of confidence interval will be pruned. For instance, if η = 1 then speakers above the 68 % confidence interval will be pruned; that is approximately 16 % of the speakers.
A Speaker Pruning Algorithm for Real-Time Speaker Identification
643
60
Amount of speakers
50
40
30
20
10
2, 07
2, 01
2, 04
1, 95
1, 98
1, 92
1, 89
1, 86
1, 8 1, 83
1, 74
1, 77
1, 68
1, 71
1, 62
1, 65
1, 59
1, 56
1, 5 1, 53
1, 47
0
D is s im ila r ity
Fig. 4. Examples of typical dissimilarity value distributions from TIMIT corpus (N=630 speakers)
Fig. 5. Moving of the dissimilarity values and the pruning threshold with time
An example how the distributions and thresholds change over time is shown in Fig. 5. Our general observation is that the variance of the dissimilarity values decreases with time, which can be explained by the fact that the “outlier” speakers are pruned out and the remaining speakers are close to the unknown speaker.
4
Experiments
For experiments, we used the American English TIMIT corpus [6] with all of the 630 speakers included. The length of training data (silence-removed speech) for speaker models was on average 8.8 s. We used 8 kHz sampling frequency and 16-bit resolution. The features were extracted as described in Section 2.
644
Tomi Kinnunen et al.
4.1
Results
We consider the trade-off between identification error rate and average time spent on the identification. By lenghtening the pruning interval or by decreasing the number of pruned speakers, we expect smaller error rate, but with the cost of increased identification time. From several runs with different parameter combinations we can plot the error rate as a function of identification time. These curves are shown in Figures 6 and 7 for the static and adaptive variants, respectively. 100
In te r v a l 2
90
In te r v a l 5
80
In te r v a l 1 0
Error rate (%)
70 60 50 40 30 20 10 0 0
10
20
30
40
50
60
70
80
A v e r a g e id e n tific a tio n tim e (s )
Fig. 6. Evaluation of the static variant using different pruning intervals (M = 2, 5, 10)
Error rate (%)
100 90
η = 0 .5
80
η = 0 .9
70
η = 0 .1
60 50 40 30 20 10 0 0
5
10
15
20
25
30
35
40
A v e ra g e id e n tific a tio n tim e (s )
Fig. 7. Evaluation of the adaptive variant using different values for η- parameter (η = 0.1, 0.5, 0.9)
Next, we compared the two variants by selecting the best curves from each variant. These are shown in Fig. 8. The pruning interval for the static variant is M = 5 vectors (or 50 milliseconds), and the parameter η for the adaptive variant is η = 0.9. An experiment without any pruning and using all available test data was also carried out. In this case, for the N = 630 speakers, only one speaker was misclassified and the error rate is therefore about 0.15 %, with average identification time about 230 seconds. The high identification rate is due to fact that TIMIT is recorded in a noise-free laboratory environment.
A Speaker Pruning Algorithm for Real-Time Speaker Identification
645
100 90
Static
80
Adaptive
Error rate (%)
70 60 50 40 30 20 10 0 0
10
20
30
40
50
60
70
80
Average identification time (s)
Fig. 8. Comparison of the proposed variants
4.2
Discussion
The following observations are made from the results. In the static variant, the pruning interval has only little effect on the error rate or identification time. With an average identification time of 10 seconds, the error rate is 40 % in the best case. With the identification time of 50 seconds, the error rate drops below 0.5 %. In the adaptive variant, the parameter η has some effect on the performance for small identification times. For high identification times the curves do not show significant differences. An error rate of 10 % is reached in 10 seconds identification time. With 25 seconds, an error rate less than 0.5 % is achieved. This is half of the time of the static variant for the same error rate. We also ran a few tests with larger pruning intervals and η-parameters (up to M = 30 and η = 2.0) but the results were poor. For instance, using η = 2.0, an error rate of 0.3 % took more than 100 seconds to reach. We decided to include only the best results here. The results for best parameter combinations for both variants are shown in Fig. 8. It is evident that the adaptive variant works better in general. It reaches lower error rate with the same identification time. The adaptive method reaches an error rate of 0.46 % with 24 seconds of speech, whereas the static method spends over 60 seconds to reach the same error rate. Compared to the full search which reaches the error rate 0.15 % in 230 seconds, the speed-up is significant in both cases. In the case of only 3 seconds spent for identification, the error rate for the adaptive method is 53 %. Therefore, the methods need further optimization in order to work in real-time with acceptable error rate.
646
Tomi Kinnunen et al.
5
Conclusions
In this paper we have presented a method for reducing the computational load in realtime speaker identification systems. Two variants of the algorithm with different pruning heuristics were presented, and their performance on the TIMIT corpus was studied. The adaptive method outperforms the static variant in every case. With the adaptive method, a 10 % error rate can be reached in 10 seconds with 630 speakers. Both variants are easy to implement. In the future, we plan to extend the algorithm to use time-dependent values for M, K and η parameters. For instance, pruning interval M should be initially large so that the unknown speakers feature vector distribution stabilizes, and then it should be supressed gradually to make the identification faster.
References [1] [2] [3] [4]
[5] [6] [7] [8]
J. Campbell, ”Speaker Recognition: A Tutorial,” Proc. IEEE, 85 (9), pp. 14371462, 1997. S. Furui: ”Recent Advances in Speaker Recognition,” Pattern Recognition Letters, 18, pp. 859-872, 1997. J.R. Deller Jr., J.H.L. Hansen, and J.G. Proakis, Discrete-time Processing of Speech Signals, Macmillan Publishing Company, New York, 2000. T. Kinnunen, and I. Kärkkäinen, “Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification,” Proc. Joint IAPR International Workshop on Statistical Pattern Recognition (S+SPR2002), pp. 681-688, Windsor, Canada, 2002. Y. Linde, A. Buzo, and R.M. Gray, “An Algorithm for Vector Quantizer Design,” IEEE Trans. on Comm., 28(1), pp. 84-95, 1980. Linguistic Data Consortium, http://www.ldc.upenn.edu/ F.K. Soong, A.E. Rosenberg, B.-H. Juang, and L.R. Rabiner, ”A Vector Quantization Approach to Speaker Recognition,” AT&T Technical Journal, 66, pp. 14-26, 1987. D. Burileanu, L. Pascalin, C. Burileanu, M. Puchiu, “An Adaptive and Fast Speech Detection Algorithm”, Proc. 3rd International Workshop on Text, Speech and Dialogue (TSD 2000), pp. 177-182, Brno, Czech Republic, 2000.
“Poor Man” Vote with M -ary Classifiers. Application to Iris Recognition V. Vigneron1,2 , H. Maaref2 , and S. Lelandais2 1
LIS avenue F´elix Viallet, 38031 Grenoble cedex, France
[email protected] 2 LSC FRE 2494 40 rue du Pelvoux, 91020 Evry Courcouronnes vvigne,
[email protected] Abstract. Achieving good performance in biometrics requires matching the capacity of the classifier or a set of classifiers to the size of the available training set. A classifier with too many adjustable parameters (large capacity) is likely to learn the training set without difficulty but be unable to generalize properly to new patterns. If the capacity is too small, the training set might not be learned without appreciable error. There is thus advantage to control the capacity through a variety of methods involving not only the structure of the classifiers themselves, but also the property of the input space. Ths paper proposes an original non parametric method to combine optimaly multiple classifier responses. Highly favorable results have been obtained using the above method.
1
Ensemble of Classifiers
One recent trend in computational learning looks at what Valiant called “theory of learnable” [23] . Suppose we have a set of n samples and use these to fit a finite number M of classifiers from a family F of possible classifiers. Then the probability that a classifier g chosen is consistent with a training set yet having overall error rate at least En , is at most M (1 − En )n . Using a single classifier has shown a certain limitation in achieving satisfactory recognition performance and this leads us to use multiple classifiers, which is now a common practice [14]. Readers can find surveys in Ripley [20] in Devroye et al. [4]. Some recent papers include those proposing directional: mixtures of experts [11], boosting methods [5], bagging methods [2], query by committee [7], stacked regression [24], distributed estimation for data fusion [1, 8, 22]. These papers prove that the approach of multiple classifiers produced a promising improvement in recognition performance. The efficacy of the method is explained by the following argument: for instance, most classifiers share the feature that the solution space is highly degenerate. The post-training distribution of classifiers trained on different training sets chosen according to the density of the samples p(x) will be spread out over a multitude of nearly equivalent solutions. The ensemble is a particular sample from the set of these solutions. The basic idea of the J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 647–657, 2003. c Springer-Verlag Berlin Heidelberg 2003
648
V. Vigneron et al.
ensemble approach is to eliminate some of the generalization errors using the differentiation within the realized solutions of the learning problem. The variability of the errors made by the classifiers of the ensemble has shown that the consensus improves significantly on the performance of the best individual in the ensemble1 . In [10], Hansen et al. used a digit recogntion problem to illustrate how the ensemble consensus outperformed the best individuals by 25%. The marginal benefit obtained by increasing the ensemble size is usualy low due to correlation among errors made by participating classifiers on an input x [9]: most classifiers will get the right answer on easy inputs while many classifiers will make mistakes on “difficult” inputs. The number of classifiers can be very high (some hundreds) so it is difficult to understand their decision characteristics. Some of the above referenced papers used a simple scheme of combination, which just cascade multiple classifiers. This scheme results in a less larger robustness of the agregated classifier. This is due to classifiers interdependencies that may reduce the recognition performance. Our present interest is a new scheme to improve class separation performances in combining multiple classifiers using information theoric learning. The paper is organizes as follows. Section 2 explains the basic idea of wavelet denoising. Section 3 presents a theoretical sketch of distribution-free classification (DFC). Section 4 provides an independence measure based on mutual information to evaluate the classifier combination. Section 5 presents an experiment where few samples are available in the case of iris recognition and further investigations are finally discussed.
2
Relations between Independence and M -ary Classifiers Collective Decision
One can often read in literature the following sentence “to combined a set of classifiers, it is better to choose independent classifiers”, e.g. in Nadal et al. [16]. This idea is vague and this is illustrated on the following example. Example 1. Consider a binary classification problem (with equiprobable classes w1 and w2 ) with two classifiers C1 and C2 whose outputs are ς1 and ς2 . Suppose there is no reject mechanism and that their performances are homogeneous for the two classes, i.e. their probabilities of correct classification λ1 and λ2 are equal: p(ς1 = w1 |w1 ) = p(ς1 = w2 |w2 ) = λ1 and p(ς2 = w1 |w1 ) = p(ς2 = w2 |w2 ) = λ2 . By evidence, p(ς1 = w1 |w2 ) = p(ς1 = w2 |w1 ) = 1 − λ1 and p(ς2 = w1 |w2 ) = p(ς2 = w2 |w1 ) = 1 − λ2 . The probabilities of the outputs ς1 and ς2 are p(ς1 = w1 ) = p(ς1 = w1 |w1 )p(w1 ) + p(ς1 = w1 |w2 )p(w2 ) = λ1 12 + (1 − λ1 ) 12 = p(ς2 = w1 ) = p(ς2 = w1 |w1 )p(w1 ) + p(ς2 = w1 |w2 )p(w2 ) = 1
λ2 12
+ (1 −
λ2 ) 12
=
1 2 1 2
This analysis is certainly true for situations described here wherein the classifiers of the ensemble see different training patterns, it can be effective even when all the classifiers are trained using the same training set [9].
“Poor Man” Vote with M -ary Classifiers. Application to Iris Recognition
649
Suppose the two classifiers are independent, then we have p(ς1 = w1 , ς2 = w1 ) = p(ς1 = w1 )p(ς2 = w1 ) = 12 21 = 14 . Similarly, p(ς1 = w1 , ς2 = w2 ) = p(ς1 = w2 , ς2 = w1 ) = p(ς1 = w1 )p(ς2 = w2 ) = 12 21 = 14 . Thus the performance of the ensemble is independent of the performance of the individuals. This can be possible only if λ1 = λ2 = 12 , i.e. two classifiers are independent if their are random (recognition rate at 50%) ! This example suggest that interesting classifiers should not be independent in the classical sense. Still one class of probabilities consists of conditional probabilities [19], e.g. we have to impose that classifiers be independent for each class. Conditional densities of two random variables ς1 and ς2 conditioned by the class arise in the following case : p(ς1 = wj , ς2 = w |wi ) = p(ς1 = wj |wi )p(ς2 = w |wi ), ∀1 ≤ i, j, ≤ K.
(1)
Suppose that prior probabilities p(wi ) = 1/K, ∀i and that both classifiers admit a conditional probability and satisfy Eq. (1). Then, p(ς1 = wj , ς2 = w ) = p(ς1 = w )p(ς2 = w ), ∀1 ≤ j, ≤ K.
(2)
K K 1 where p(ς1 = wj ) = i=1 p(ς1 = wj |wi )p(wi ) = K i=1 p(ς1 = wj |wi ) and K 1 K p(ς = w |wi ). Hence, p(ς1 = p(ς2 = w ) = i=1 p(ς1 = w |wi )p(wi ) = K 2 i=1 K 1 K wj , ς2 = w ) = i=1 p(ς1 = wj , ς2 = w |wi )p(wi ) = K i=1 p(ς1 = wj , ς2 = w |wi ). Eq. (2) becomes: K 1 p(ς1 = wj , ς2 = w |wi ) = K i=1
1 K2
K K i=1
i =1
p(ς1 = wj |wi )p(ς2 = w |wi ),(3)
K K 1 1 Replacing (1) in (3) give K i=1 p(ς1 = wj , ς2 = w |wi ) = K 2 i=1 p(ς1 = 1 K K wj |wi )p(ς2 = w |wi ) + K 2 i=1 i =i p(ς1 = wj |wi )p(ς2 = w |wi ) . At the end : (K − 1)
K i=1
p(ς1 = wj , ς2 = w |wi ) =
K K
p(ς1 = wj |wi )p(ς2 = w |wi )
(4)
i=1 i =i
Eq. (4) can be satisfied only if at least one of both conditions is true: p(ς1 = wj |wi ) = 1/K, ∀1 ≤ i, j ≤ K or p(ς1 = w |wi ) = 1/K, ∀1 ≤ i, ≤ K. Example 2. Consider once again our toy example restricted to binary classification and decide that λ1 = λ2 ≈ 1. Similarly, p(ς1 = w2 , ς2 = w1 |w1 ) ≈ 1 and p(ς1 = w1 |w1 )p(ς2 = w1 |w1 ) ≈ λ1 λ2 ≈ 1. We conclude that the two classifiers performed well even they are conditionaly independent. Further, conditional independence will be the evaluation criterion of the classifiers.
650
3
V. Vigneron et al.
Distribution-Free Classification
Suppose that the classifiers of the ensemble are each trained on independently chosen training sets of s samples selected according to p(x). We have K classes, each represented as wi , 1 ≤ i ≤ K. Let p(wi |x) be the probability that x comes from wi . The classifier has K possible categories to choose. It is well known that the Bayes classifier represents the optimum measure of performance in the sense of the minimal classification error [4]. The Bayes classifier selects the class w∗ (x) = wj if p(wj |x) = arg maxwi ∈Ω p(wi |x), where Ω is a partition of the feature space. Using Bayes’ formula, this posterior probability is p(x|wj )p(wj ) , where p(x|wi ) (the likelihood function of class wi ) p(wj |x) = K p(x|w i )p(wi ) i=1 and p(wi ) are usually unknown but can be evaluated using the learning data set. Suppose there are a number of classifiers Ck , 1 ≤ k ≤ M , each of which produces the output ςk (x) (simplified for readibility in ςk ). Then the bayes formula gives p(ς1 , . . . , ςM |wj )p(wj ) . (5) p(wj |ς1 , . . . , ςM ) = K i=1 p(ς1 , . . . , ςM |wi )p(wi ) Table 1. DFC learning algorithm Learning Algorithm Inputs: {x1 , . . . , xn }; Init. Partition output space of Ck in qk bins Lk1 , . . . , LkM Q = K k=1 qk sets of exactly K counters associated to the K classes, dei noteda nw i1 ,...,iM , 1 ≤ i ≤ K. Repeat 1 Compute the outputs ς1 (x), . . . , ςM (x) of the classifiers 2 For each Ck , find ik such that ςk (x) ∈ Lkik , 1 ≤ ik ≤ qk . i =wtrue 3 Increment the counter of those indices nw matching the true i1 ,...,iM class wtrue of x 4 For each possible combination of indices i1 , . . . , iK , 1 ≤ ik ≤ qk , collect the agregated classification response yi1 ,...,iK (x) as w wj if ni1j,...,iM = arg max1≤≤K nw i1 ,...,iM > 0 yi1 ,...,iK (x) = w0 in the other case. For each x ∈ {x1 , . . . , xn }. output: {yi1 ,...,iK (x1 ), . . . , yi1 ,...,iK (xn )}; a
i1 , . . . , iM are the indices of the set of counters.
Estimating the unknown likelihood p(ς1 , . . . , ςM |wj ) is of primordial importance. They could be estimated using some parametric model, but we have no prior information on the collective functionning of the classifiers. This motivates us to search for distribution-free performance of error estimation using a histogrambased rule with a fixed partition, the number of bins in the partition being “not too large”. Such learning algorithm is given in Table 1. The interesting point sets in the fact that no prior knowledge is needed by the classifiers. The output space of the classifiers is a discrete space with K + 1 distinct points corresponding to the K possible classes, the supplementary one being the
“Poor Man” Vote with M -ary Classifiers. Application to Iris Recognition
651
rejection class2 . The most natural division for such a space consists to consider each point as a division, i.e. Li = {wi }, 1 ≤ i ≤ K, et L0 = {w0 } (rejection class). The classification algorithm is given in Tab. 2. If the objective is to provide subjective probabilities and not to classify, the step 3 changes. Then, subjective w probabilities can be computed from ni1j,...,iM for each class by: P (w ) = i 0
w
ni j,...,i M 1 wj M j=1 ni ,...,i 1
w
M
if ∃ni1j,...,iM = 0,
(6)
in the other case.
In the case of a rank-classifier, the output is a K−dimensional vector ςk = k (r1k , . . . , rK ), where rjk is the output of the classifier k corresponding to the class wj . Each axe rjk is subdivided in qkj intervals Lkj which can be of dif, 1 ≤ ≤ q is an interval defined by its lower and upferent size, e.g. Lkj kj kj kj per bounds by [L−1 , L ]. The index ik in the algorithm is computed by ik = K j−1 kj k j=1 ikj =1 qk , where the indices ikj are such that rj ∈ Likj . Table 2. Application of the DFC algorithm for a given input x Classification Algorithm 1 Compute the outputs ς1 (x), . . . , ςM (x) for a given form x 2 For each ςk (x), find ik such that ςk (x) ∈ Lkik . 3 Compute yi1 ,...,iK (x) as the agregated classification response.
4
Independence Measure of the Classifiers Ensemble
In section 2, we prove that conditional independence of two classifiers C1 and C2 requires that p(ς1 , ς2 |wi ) = p(ς1 |wi )p(ς2 |wi ). This is a desired property of the ensemble performances. The Kullback-Leibler (KL) divergence can be considered as a kind of distancebetween two probability densities, because it is always non negative and it is equal to zero iff the two distributions are equal3 (see Hyv¨arinen et al. [18] for more details). This is defined between two probability density functions (pdf’s) g and f as D(f g) =
x
f (x) log
f (x) . g(x)
(7)
To apply the Kullback-Leibler divergence here, one might measure the independence between the joint density p(ς1 , ς2 |wi ) and the factorized density 2 3
If the objective is just to give subjective probabilities and not to classify, we just true need to collect the counter values nw i1 ,...,iM . This is a direct consequence of the (strict) convexity of the negative logarithm. This is not a proper distance measure because it is not symmetric.
652
V. Vigneron et al.
p(ς1 |wi )p(ς2 |wi ), i.e. D(p(ς1 , ς2 |wi )p(ς1 |wi )p(ς2 |wi )) =
p(ς1 , ς2 |wi ) log
ς1 ,ς2
p(ς1 , ς2 |wi ) . p(ς1 |wi )p(ς2 |wi )
(8)
The more D(p(ς1 , ς2 |wi )p(ς1 |wi )p(ς2 |wi )) is small, the more the classifiers are independent given the class wi . For a finite number of class {w0 , w1 , . . . , wK }: D(x) =
p(ς1 , ς2 , wi ) log
wi ς1 ,ς2
p(ς1 , ς2 |wi ) , p(ς1 |wi )p(ς2 |wi )
(9)
Rearranging (9) we find for a given x: D(x) = −
p(ς1 , ς2 , wi ) log
wi
=
2
2
p(ςj |wi ) +
p(ς1 , ς2 , wi ) log p(ς1 , ς2 |wi )(10)
wi
j=1
Hj − H(ς1 , ς2 ; w0 , . . . , wK ),
(11)
j=1 wi
where Hj = − wi j log p(ςj |wi )p(ς1 , ς2 , wi ) is the entropy of the j−th classifier and H(ς1 , ς2 ; w0 , . . . , wK ) denotes the total entropy onto the total set of classes {w0 , . . . , wK }. From (11) and noting that D(x) ≥ 0, we see that the inequality H(ς1 , ς2 ; w0 , . . . , wK ) ≤ j wi Hj , holds, with equality iff the conditional mutual information equals zero D(x) = 0, i.e. the multivariate probabilities is fully factorized p(ς1 , ς2 |wi ) = p(ς1 |wi )p(ς2 |wi ). The mutual information between the two classifiers C1 and C2 measures the quantity of information that each classifier conveys about the other; this can be considered as a measure of statistical correlation between the classifiers (see Oja [18]). In other words, suppose ς2 and D(x) are important, then the knowledge of ς2 does not give us much more information (most information in ς1 is already in ς2 ). Hence, a small value of D(x) is preferable for combination ofwiclassifiers. i . . . Let us denote nw i1 ,i2 i3 iK ni1 ,i2 ,...,iM for simplicity. Note that such calculus requires only few arrangement from the proposed algorithm in section 3. The various probabilities given in equation (9) can be estimated by the following marginals: p(ς1 , ς2 , wi ) = wi
nwi i1 ,i2 i1 ,i2
i nw i1 ,i2
i nw i1 ,i2 wi , i1 ,i2 ni1 ,i2 wi i2 ni1 ,i2 p(ς1 |wi ) = wi , i ,i ni ,i 1 2 wi1 2 i ni ,i p(ς2 |wi ) = 1 1wi2 . i1 ,i2 ni1 ,i2
p(ς1 , ς2 |wi ) =
,
(12) (13) (14) (15)
“Poor Man” Vote with M -ary Classifiers. Application to Iris Recognition
653
By inserting Eqs. (12-15) in (9), we obtain w
D(x) =
wi i1 ,i2
i nw i1 ,i2
wi
wi α,β nα,β
log
ni i,i 1 2
wi wi α,β nα,β wi wi α ni1 ,α α nα,i2 wi wi α,β nα,β α,β nα,β
,
(16)
and, after some algebra manipulation:
α,β
5
1
D(x) =
wi
i nw α,β
i1 ,i2 wi
wi i nw i1 ,i2 α,β nα,β i . nw log wi i1 ,i2 i ( α nw α,i2 )( α ni1 ,α )
(17)
Application to M -ary Classifier to Iris Recognition
Iris recognition combines computer vision, pattern recognition, statistics, and the human-machine interface. The purpose is real-time, high confidence recognition of a person’s identity by mathematical analysis of the random patterns that are visible within the iris of an eye from some distance. Because the randomness of iris patterns has very high dimensionality, recognition decisions are usually made with confidence levels high enough to support rapid and reliable exhaustive searches through large-sized databases. Iris recognition technology identifies people by the unique patterns of the iris - the colored ring around the pupil of the eye. A first process was developed by Daugman [3]. It permitted efficient comparisons based on information from a set of gabor wavelets, which are specialized filters bank that extract information from a signal at various locations and scales. Feature extraction used a 4 step protocole: (i) localisation of the eyes regions in the face (Fig.1.a), (ii) detection of the outlines of the eyes (Fig.1.b), (iii) localisation of the pupils, (iv) extraction of the gradient vector field (Fig.1.c). The pupil detection (step used the luminance image of the face: the image of the face is divided in 4 almost equal rectangles of size /2 × L/2 where is the width of the face and L its height. On each eyes region, the luminance image is then computed and filtered by a retinian filter (to correct local variations of light), from which one extract the norm and the direction of the gradient. Conversion of an iris image into a numeric code that can be easily manipulated is essential to its use. Once the image has been obtained, an iris code is computed based on the information from a gradient field. Iris code derived from this process are compared with previously learned iris code. Data Collection A serie of experiments was performed for both tasks (i) evaluating features, (ii) evaluating recognition performance by combining multiple classifiers. The database and features to be used are as follows: let B0 = {(x1 , w1 ), . . . , (xn , wn )} be a n = 30 eye-images database, xi a N -dimensional vector composed of a set of feature cells xij identified by the pixel Pj = (aj , bj ), 1 ≤ j ≤ N , wi the label attached to the i-th image. Two numerical features that
654
V. Vigneron et al.
(a)
(b)
(c)
Fig. 1. (a) Eye region detection. (b) Iris outlines extraction. (c) Iris vector field computed
have good recognition performance in practice are used in this experiment: the first feature called CGB ( Contour-Based Gradient Distribution) [12] is computed by evaluating the Sobel operator onto the normalized mesh R and computing the gradient direction distribution map. The second feature, called DDD (Directional Distance Distribution) [17], is computed using distance information. Each pixel in the binary map R shoots rays in eight directions and each ray computes the distance to the pixel with opposite color (black or right). Both map CGB and DDD can be represented with a N = 256−dimensional feature vector4 x = (x1 , . . . , xN ). Due to the small size of the dataset n N , performance evaluation of the agregated classifier is done by bootstrap (see Kallel et al. [13] for details). In this experiment, a M -ary classifier (M = 100) is trained on the basis of the algorithm reported in Tab. 1: one half with CGD-based inputs, one half with DDD-based inputs. A partial classifier is a classifier which takes into considerations only N −inputs, N N (see [21]). For a given classifier, N randomly chosen positions in the gradient vector field of the mesh R are memorized. For a new image, gradient values are collected at the same positions. Other positions are randomly selected for a new classifier. Hence, we evaluate the M −ary classifiers in supposing a “degenerate” feature space. Methods Classically, if there are numerous data, the first step consists in the division of the supplied data into two sets : a test set and a training set. This is not possible here due to the small size of the dataset. Our last resort since no classical inference is possible due to the intrinsic complexity of the problem is to construct an estimate of the density function without imposing structural assumptions. Using resampling methods such as bootstrap [6], the information contained in the observed dataset B0 , drawn from the empirical distribution F0 4
CGB contains the local information about the image because the edge operator can extract only the local gradient direction information. On the contrary, DDD capture the global information since directional distance information provides a rough sketch of the global pattern.
“Poor Man” Vote with M -ary Classifiers. Application to Iris Recognition
655
iid
such that {(x1 , w1 ), . . . , (xn , wn )} ∼ F0 , is extended to many typical generated data sets B ∗b , 1 ≤ b ≤ BThese samples are called bootstrapped samples (see [2]). Tab. 3 describes the training algorithm. The agregated classification response (x) is then updated according to Tab. 1. yi∗b ,...,i 1 K Table 3. Experimental protocole including bootstrap resampling method Training Algorithm input 30 Initial CGD and DDD R meshes init. B0 =Empty list; Bb∗ =Empty lists (1 ≤ b ≤ B); For each Ck , 1 ≤ k ≤ M (k) (k) (k) 1 Choose pixel Pi = (ai , bi ), 1 ≤ i ≤ N N randomly in the mesh R. (k) 2 Get {x01 , . . . , x0N } with x0i ← Pi ; EndFor 3 Draw random samples Bb∗ , 1 ≤ b ≤ B with replacement from B0 . For each Bb∗ , 1 ≤ b ≤ B 4 Construct the classifier Ck∗b , 1 ≤ k ≤ M (see Tab. 1)a using the bootstrap sample. 5 Collect the agregated classification responses yi∗b (x). 1 ,...,iK EndFor a
The choice of classifier type has few impact on the results.
At the end, in Tab. 3, a new observation x is classified by majority voting using the predictions of all classifiers. Results Tab. 4 compares the recognition rates based on DDD, CDG and mixed CGD-DDD based classifiers using the original samples B0 . It is apparent that the CGD feature has a better recognition performance than the DDD feature, this means a better discriminating power. Combination of both CGD-based and DDD-based classifiers was also tested and show some improved performance. Further test show that these recognition rates are improved in all case when the size of the pattern vector is larger. Although 3% may appear to be a small increase, it should be borne in mind that even small percentage increases are difficult to generate when the overall classification accuracy level exceeds 80%. We can therefore conclude that bootstrap is a useful technique for improving the performance of classifier. Thus, this algorithm forces the classification to concentrate on those observations that are more difficult to classify. We can also conclude that the rank-based classifiers produce significantly improvement over the class-based classifiers. Note that no test step is necesary with bootstrap. Table 4.Comparison of recognition rate (%) with boostrap standard deviation classifier CGD DDD CGD+DDD
rank-based (%) class-based (%) 99,521 ±3, 07 96,65 ±6, 40 99,492 ±22, 36 96,55 ±6, 40 99,638 ±11, 0 96,98 ±7, 34
656
6
V. Vigneron et al.
Conclusion
A non parametric method for distribution free classification was describe and evaluated on a biometrics data. As our experimental results indicate good performance classification is archieved although there is room for improvement and although these results are not comparable with the recognition rate proposed with industrial devices (see [3]). Further works should concern more digital image processing, and the use of coloured images.
References [1] U. Beyer and F. Smieja. Learning from exemples, agent teams, and the concept of reflection. International Journal of Pattern Recognition and Artificial Intelligence, 1994. 647 [2] L. Breiman. Bagging predictors. Rapport technique TR-421, Statistics Department, University of California, Berkeley, 1994.for our model. 647, 655 [3] J. Daugman. high confidence visual recognition of persons by a test of statsitical independence”, IEEE Trans. PAMI, 1, Nov. 1993, pp. 148-160. 653, 656 [4] L. Devroye, L. Gy¨ orfi and G. Lugosi. A probabilistic theory of pattern recognition, Application of Mathematics, Springer, 1991. 647, 650 [5] H. Dr¨ uker, C. Cortes, L.Jackel, Y. Lecun and V. Vapnik. Boosting and other ensemble methodes. Neural Computation, 6(6), pp. 1289-1301, 1994. 647 [6] B. Efron (1979) The convex hull of a random set of points. Biometrika, 52, p 331–342. 654 [7] Y. Freund, H. Seung, E. Shamir and N. Tishby. Information, prediction and query commitee. Neural Information Processing Systems, S. Hanson, J. Cowan, C. Giles ed., pp. 483-490, Denver, 1993. 647 [8] J. Gubner. Distributed estimation and quantization. IEEE Trans. on Information Theory, 39(4), pp. 1456-1459, 1993. 647 [9] L. K. Hansen and P. Salamon. Neural network ensembles. IEEE Trans. on PAMI, 12, pp. 993-1001, 1990. 648 [10] L. K. Hansen, C. L¨ usberg and P. Salamon. Ensemble methods for recognition of handwritten digits. Proc. IEEE Signal Processing Workshop, S. Y. Kung, F. Fallside, J. A. Sorensen and C. A. Kamm Ed.; Piscataway, NJ, pp. 540-549, 1992. 648 [11] M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, pp. 181-214, 1994. 647 [12] A. K. Jain. Fundamentals of digital image processing, Prentice-Hall Inc., 1989. 654 [13] R. Kallel, M. Cottrell and V. Vigneron. Bootstrap for neural model selection. Proc. of the 8th European Symposium on Artificial Neural Networks, Bruges, 2224 Avril, 2000. 654 [14] J. Kittler and M. Hatef. Improving recognition rates by classifier combination. Proc. IWFHR’96, pp. 81-101, 1996. 647 [15] A. N. Kolmogorov and S. V. Fomin. Elements of the theory of functions and functional analysis, Vol. 1, Graylock Press, 1957. [16] J. P. Nadal, R. Legault et C. Suen. Complementary algorithms for the recognition of totally unconstrained handwritten numerals. 10th International Conference on Pattern Recognition, pp. 443-446, Atlantic cit, 1990. 648
“Poor Man” Vote with M -ary Classifiers. Application to Iris Recognition
657
[17] I. S. Oh, J. S. Lee and C. Y. Suen. Analysis of class separation and combination of class dependent features for handwriting recognition, IEEE Trans. on PAMI, 21(10), pp. 1099-1125. 654 [18] A. Hyv¨ arinen, J.Karhunen and E. Oja. Independent component analysis, John Wiley & Sons, 2001. 651, 652 [19] J. Pearl , D. Geiger and T. Verma. Conditional independence and its representations. Kybernetika, 25(2), 1989. 649 [20] B. D. Ripley. Pattern recognition and neural networks, Cambridge University Press, 1996. 647 [21] C. M. Soares, C. L. Fr´ oes da Silva, M. De Gregorio and F. M. G. Fran¸ca. A software implmentation of the WISARD classifier, Proceeding of Brasilian Symposium on Artificial Neural Network, Belo Horizonte, MG, december 9-11, 1998, vol. II, 225229. 654 [22] K. Tumer and J. Ghosh. A framework for estimating performance improvements in hybrid pattern classifiers. Proc. World Congress on Neural Networks, San Diego, pp. 220-225, 1994. 647 [23] L. G. Valiant. A theory of the learnable. Communications of the association for computing machinery, 27, pp. 1134-1142, reprint in Shavlik & Dietterich, 1990. 647 [24] D. Wolpert. Stacked regression. Neural Networks, 5(2), pp. 241-260, 1992. 647
Complete Signal Modeling and Score Normalization for Function-Based Dynamic Signature Verification J. Ortega-Garcia, J. Fierrez-Aguilar, J. Martin-Rello, and J. Gonzalez-Rodriguez Biometrics Research Lab., ATVS, Universidad Politécnica de Madrid, Spain {jortega,jfierrez}@diac.upm.es http://www.atvs.diac.upm.es
Abstract. In this contribution a function-based approach to on-line signature verification is presented. An initial set of 8 time sequences is used; then first and second time derivates of each function are computed over these, so 24 time sequences are simultaneously considered. A valuable function normalization is applied as a previous stage to a continuous-density HMM-based complete signal modeling scheme of these 24 functions, so no derived statistical features are employed, fully exploiting in this manner the HMM modeling capabilities of the inherent time structure of the dynamic process. In the verification stage, scores are considered not as absolute but rather as relative values with respect to a reference population, permitting the use of a best-reference score-normalization technique. Results using MCYT_Signature subcorpus on 50 clients are presented, attaining an outstanding best figure of 0.35% EER for skilled forgeries, when signer-dependent thresholds are considered.
1
Introduction
Automatic signature verification has been an intense research field because of the social and legal acceptance and the widespread use of the written signature as a personal authentication method. Biometric recognition techniques have made possible notable improvements in the objective assessment of quantitative similarities between handwritten samples, leading to the development of automatic on-line signature verification systems [1], [2]. Nevertheless, in the last years, almost minor improvements in kind of analysis, characteristic selection or performance evaluation have been reported [3]. The inherent behavioral-based nature of the on-line signing process, makes the input information well adjusted to be considered as a random process rather than as a deterministic signal. This dynamic input information, acquired through a time sampling procedure, must be consequently considered as discrete time random sequences. Many feature-based approaches make use of these sequences in order to derive statistical parameters, but the use of complete sequences have so far yielded better reJ. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 658-667, 2003. Springer-Verlag Berlin Heidelberg 2003
Complete Signal Modeling and Score Normalization
659
sults [4], as reducing time sequences just to statistical features should affect in diminishing our ability to make a precise characterization of this dynamic process. In any case, this time-based sequence characterization process is strongly related to the way in which a reference model is established, as a competitive modeling process to cope with this complete random sequences is needed. Hidden Markov Models (HMMs) [5] have shown this capability regarding other behavioral-based biometric traits, like speech, outperforming other classical approaches like distance measure, (weighted) cross correlation or dynamic time warping (dynamic string matching). With respect to on-line signature recognition, HMMs have been also used [6− 8], but not in all cases taking advantage of the complete sequences involved. Our recent work in the field [9] has been oriented to exploit dynamic signature information as complete time sequences [10] by means of continuous density HMMs, in order to derive function-based on-line signature verification systems. In the present contribution, the effect of: i) considering complete input sequences, ii) computing first and second time sequence derivatives [11], iii) statistically normalizing the complete sequence set, and iv) considering HMM outputs as relative scores with respect to a reference population through a best-reference score normalization technique [12], are analyzed. Results using MCYT_Signature database [13] are presented, yielding remarkably performance in both common and user-specific threshold settings [14].
2
MCYT_Signature Database
The number of existing large public databases oriented to the performance evaluation of recognition systems based specifically on signature is quite limited. In this context, the MCYT project, oriented to the acquisition of multiple biometric traits (namely, fingerprints and signatures), was launched. The expected number of individuals considered in the database is roughly 450; about 350 of them are currently being acquired following a single-session procedure, (although by the time this contribution was being realized, less than 100 were supervised and fully available). And about 100 more individuals are planned to be acquired by mid 2003, in a multi-session procedure, in order to incorporate intrinsic short-term signature variability, signature size variability, and over-the-shoulder and time constrained forgeries. In order to acquire the dynamic signature sequences, a WACOM graphics tablet, model INTUOS A6 USB has been employed. The graphic tablet resolution is 2,540 lines per inch (100 lines/mm), and the precision is +/– 0.25 mm. The maximum detection height is 10 mm (so also pen-up movements are considered), and the capture area is 127× 97 mm. This tablet provides the following discrete-time dynamic sequences (dynamic range of each sequence is specified): i) Position in x-axis, xt: [0− 12,700], corresponding to 0-127 mm, ii) position in y-axis, yt: [0− 9,700], corresponding to 0-97 mm, iii) pressure pt applied by the pen: [0− 1,024], iv) azimuth angle γ t of the pen with respect to the tablet: [0− 3,600], corresponding to 0º− 360º, and v) altitude angle ϕ t of the pen with respect to the tablet: [300− 900], corresponding to 30º− 90º.
660
J. Ortega-Garcia et al.
The sampling frequency of the acquired signals is set to be 100 Hz, taking into account the Nyquist sampling criterion, as the maximum frequencies of the related biomechanical sequences are always under 20− 30 Hz. Each target user produces 25 genuine signatures, and 25 skilled forgeries are also captured for each user. These skilled forgeries are produced by the 5 subsequent target users by observing the static images of the signature to imitate, trying to copy them again at least 10 times, and then, producing with a natural dynamic the valid acquired forgeries. In this way, shape-based high skilled forgeries are usually obtained, as shown in Fig. 1. Following this procedure, user n (ordinal index) realizes 5 samples of his/her genuine signature, and then 5 skilled forgeries of client n–1. Then, again 5 new samples of his/her genuine signature; and then 5 skilled forgeries of user n–2; this procedure is iterated by user n, making genuine signatures and imitating previous users n–3, n–4 and n–5. Summarizing, user n produces finally 25 samples of his/her own signature (in groups of 5 samples) and 25 skilled forgeries (5 samples of users n–1, n–2, n–3, n–4 and n– 5). Vice versa, for user n, 25 skilled forgeries will be produced by users n+1, n+2, n+3, n+4, and n+5 (in groups of 5 samples each).
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 1. From (a) to (c), genuine client signature samples; from (d) to (f), skilled forgeries, produced each forgery by one different individual
3
Analysis of Complete Sequences
3.1
Function-Based Approach
Different approaches are considered in the literature in order to extract signature information; they can be divided into [1]: i) Function-based approaches, in which signal processing methodology is applied to the dynamically acquired time sequences (i. e., velocity, acceleration, force or pressure), and ii) Feature-based approaches, in which statistical parameters are derived from the acquired information; regarding these, one can specify also different levels of classification, so it is possible to use and combine shape-based global static (i. e., aspect ratio, center of mass or horizontal span ratio), global dynamic (i. e., total signature time, time down ratio or average speed) or local (stroke direction, curvature or slope tangent) parameters. The behavioral nature of handwriting makes dynamic signal information to be considered as random processes rather than as deterministic signals; in order to adapt our processing techniques to this specific nature, and to make a better exploitation of the instantaneous dynamic information that the on-line acquisition process is offering us, the complete time functions sampled as time sequences by the acquisition device are considered in this contribution.
Complete Signal Modeling and Score Normalization
3.2
661
Basic Functions
The basic function-based representation of each signature consists of five time sequences xt, yt, pt, γt and φt, where t is the discrete time index, given by the input device and characterizing instantaneous dynamic properties of the signing process. Fig. 2(a) shows a genuine sample signature, and Fig. 2(b), shows the corresponding basic sequences associated to it. The horizontal axis reflects the number of acquired samples so, in this particular case, more than 400 samples (corresponding to more than 4 s) of each sequence are acquired.
Fig. 2. On-line signature acquisition. (a) Client signature; (b) Basic dynamic signals obtained from the graphics tablet for the client signature in (a) during the handwriting process: position in x-axis, position in y-axis, pressure, azimuth, and altitude (from top to bottom, respectively)
3.3
Extended Functions
From the basic sequence set we have derived some other signal-based sequences. Previous results with other dynamic sequences (i. e., tangential acceleration, normal acceleration, or instantaneous displacement) [9] have shown good levels of performance; in the present contribution, three derived dynamic sequences have been used as extended functions, namely: • • •
Path-tangent angle: θ t = tan − 1 ( y& t x&t ) 12 Path velocity magnitude (speed): vt = x&t 2 + y& t 2 * Log curvature radius: ρ t = log ρ t = log vt θ &t
[
]
(
)
in which log(·) is applied in order to reduce the dynamic range of function values. In all cases, discrete time derivatives have been computed using the second-order regression procedure (described in the next point). Thus, regarding our function-based instantaneous vector set, including 5 basic time sequences, and 3 more extended, we get:
[
u t = xt , yt , pt , γ t ,ϕ t ,θ t , vt , ρ
* t
]
1≤ t ≤ T
where T is the time duration of the considered signature.
(1)
662
J. Ortega-Garcia et al.
3.4
First and Second Time Derivatives
First and second time derivatives of complete instantaneous function-based vector sets, have shown to be highly effective in order to take into account the velocity and acceleration of change of each instantaneous vector set [11]. Because of the discrete nature of the above-mentioned functions, first time derivatives are calculated using a second order regression, expressed through operator ∆, namely: 2
2
τ =1
τ =1
f&t ≈ ∆ f t = ∑τ ( f t +τ − f t − τ ) 2∑τ
2
(2)
where ∆∆ is computed applying equation (2), but this time on ∆. In this way, each signature can be formally described as a global set V of time vectors, V = [v1 v 2 K v T ] , where vt is a column vector including the 24 considered sequences: v t = [u t , ∆u t , ∆∆u t ]
3.5
1≤ t ≤ T
(3)
Statistical Signal Normalization
A final statistical normalization, oriented to obtain zero mean and unit standard deviation function values, which has proved to increase the verification performance [9], is applied to the global set V: wt = vt −
1 T ∑ vτ , T τ =1
1≤ t ≤ T
(4)
so a new global normalized vector function set O = [o1 o 2 KoT ] is obtained through: o tn = w tn
1 ⋅ T−1
∑ [wτn ] T
τ =1
2
1 ≤ t ≤ T , 1 ≤ n ≤ 24
(5)
where w tn is the n-th component of vector w t . This means that global vector set O, comprises 24 statistically-normalized complete time sequences; taking into account that sampling rate is 100 samples/s, every dynamic signature is characterized by a parameter rate of 2,400 values/s.
4
Signature Modeling through Hidden Markov Models
In order to derive real recognition benefits in terms of enhanced verification performance, this complete function-based signature characterization process is strongly related to the way in which a reference model is established: a competitive modeling scheme, capable of coping with this global set of random sequences, with a strong underlying time-basis, is required. Hidden Markov Models (HMMs) [5] were introduced in the pattern recognition field as a robust method to model the variability of discrete time random signals, where time or context information is available.
Complete Signal Modeling and Score Normalization
663
HMMs have shown this capability regarding other behavioral-based biometric traits, like speech, outperforming other classical pattern recognition approaches: distance measure, (weighted) cross correlation or dynamic time warping (dynamic string matching). With respect to on-line signature recognition, HMMs have been also used [6− 8], although not always clear performance improvements have been documented; mainly due to different reasons like, e.g., making use of incomplete sets of time sequences or employing discrete- instead of continuous-density models. 4.1
HMM Modeling of Time Sequences
Basically, the HMM models a doubly stochastic process governed by an underlying Markov chain with finite number of states and a set of random functions each of which is associated to the output observation of one state. At discrete instants of time t, the process is in one of the states and generates an observation symbol according to the random function corresponding to that current state. The model is hidden in the sense that the underlying state which generated each symbol cannot be deduced from symbol observation. An example of a left-to-right state transition topology considering a continuous-density random output function for each state is depicted in Fig. 3.
Fig. 3. HMM with left-to-right topology. The model has 4 states, with no transition skip between them; the associated continuous probability density functions to each state are also shown
Formally, a HMM is described as following: • • • •
N, number of hidden states {S1, S2, …, SN}. The state at t will be denoted as qt. A state transition matrix A = {aij} where
[
]
aij = Pr qt +1 = S j qt = S i , 1 ≤ i, j ≤ N
The observation symbol probability density function in state j, bj(o), 1 ≤ j≤N The initial state distribution π = {πi} where π i = Pr[q1 = S i ], 1 ≤ i ≤ N
In this contribution, bj(o) has been modeled as a mixture of M multi-variate Gaussian densities: b j (o) =
M
∑ c jm N(o, µ jm , U jm ) ,
m =1
1≤ j ≤ N
(6)
664
J. Ortega-Garcia et al.
where N(o, µjm, Ujm) is a normal distribution, with mean µjm and diagonal covariance matrix Ujm. Thus, the observation symbol density functions can be parameterized with the set B = {cjm, µjm, Ujm}, 1 ≤ j ≤ N, 1 ≤ m ≤ M. A particular HMM is described by the set λ = (π, A, B), which is the set that will represent the K modeling reference signature samples of a determined target user. The score of an input signature O = [o1 o2 … oT] claiming the identity λ is calculated as (1 T ) ⋅ log ( Pr[O λ ] ) using the Viterbi algorithm, that considers just the locally-optimal state sequence. 4.2
Score Normalization
If the verification problem is considered as a form of hypothesis test, it would be reasonable for the signature verification system to score for an input signature O and a given identity λ i not an absolute quantity related with Pr[O | λ i ] , but rather a (relative) measure of Pr(λ i | O) / Pr(λ i | O) , where λ i stands for “an antithetical identity with respect to λ i”. The underlying idea is related to the normalization ability to separate even more client and impostor scores, and to group together what is more similar to non-client or impostor scores. This idea has shown to be effective regarding several biometric recognition systems [12]. The selection of the reference set representing the background population (representing an opposite identity with respect to the client), or cohort, is very important in practice, and better results are obtained if the cohort set for user λ i is based on models similar to it. In our case, the implementation of the score normalization idea is based on: log Pr(O | λ i ) −
max log Pr(O | λ )
λ ∈ Ref ,λ ≠ λ i
(7)
which, as it has been stated, improves the separation of client and impostor score distributions. The left term in (7) is directly related with the score provided by the HMM, and will be referred to as raw score. The right term in (7) is the normalization factor, where the cohort set is reduced in this case to just the maximum score of the input signature O against a reference set of signature-based identity models (or best reference) different from the claimed one. This best reference score normalization stage has been considered in the implemented on-line signature verification system and a comparison of no score normalization, best reference with a casual cohort (models of a separate population) and best reference with a skilled cohort (models of forgers of λ i) will be performed.
5
System Performance
A HMM configuration with N=4 states, M=8 mixture densities per state and K=6 training reference samples has been used for the evaluation. This configuration provides, as demonstrated in [9], good generalization performance.
Complete Signal Modeling and Score Normalization
665
A testing sub-corpus of the MCYT database has been selected consisting of 50 client signers, providing 15 signatures each and 15 skilled forgeries for each of them (from 3 forgers out of 5); as 6 genuine samples are used to train each client model, a total of 450 client and 750 skilled impostor attempts are considered for global evaluation. Best reference score normalization has been tested on the described subcorpus with two different types of normalization cohorts: casual and skilled. The casual cohort is composed of 50 external (separate set, common to all users) signers from MCYT_Signature subcorpus. The skilled cohort comprises samples from the 2 remaining forgers for each user not previously included in the impostor attempts. Verification results for the above-mentioned sub-corpus, considering always skilled forgeries as impostor samples, are shown for no score normalization, best reference score normalization with casual normalization cohort and best reference score normalization with skilled normalization cohort, in all cases considering global decision thresholds, are shown in Fig. 4 in form of DET plots (type I vs. type II errors on normal deviation scale); this representation clearly improves, in terms of separation between similar curves and precision near the origin, the traditional ROC curves.
Fig. 4. DET plot showing verification results: for raw scores (no normalization) (solid line); for best reference normalization (BRef) using casual cohort (dashed line); and for (BRef) using skilled cohort (dotted line)
Global EER verification results in case of global decision threshold (common to all users) and average EER in case of user-dependent threshold [14], are shown with and without score normalization in Table I. Table 1 Signature Verification Results (Skilled Forgeries)
EER Global Threshold User-Specific Threshold
Raw scores 4.83 % 0.98 %
Best Reference (Casual Cohort) 1.75 % 0.56 %
Best Reference (Skilled Cohort) 1.21 % 0.35 %
666
J. Ortega-Garcia et al.
6
Conclusions and Future Work
The use of a complete function-based sequence set, considering ∆ and ∆∆ time derived sequences, with statistical normalization of function values, modeling through continuous density HMMs, and employing score normalization techniques, have shown to produce notable performance improvements regarding on-line signature verification systems; in our case, from a 4.83% EER, obtained by using a global threshold on the described database, we decrease down to 1.75% when best reference normalization is used for a casual cohort set. In the case we use a skill cohort normalization, global EER goes down to just 1.21%, due to a good adaptation of the reference cohort to the skilled forgeries’ attempts. On the other hand, if we consider just user-specific (or user-dependent) thresholds, raw scores fall down to 0.98% average EER, as thresholds are specifically adapted to each client’s particular score distribution. Applying in this case best reference normalization, we get outstanding results of 0.56% average EER for casual cohort, and 0.35% average EER for skilled cohort. These final results are highly remarkable specially taking into account the state-of-the-art on-line signature verification performance levels [3]. Future work will include unimodal fusion of our function-based approach with feature-based approaches, in order to exploit both perspectives. Also, results on the entire MCYT_Signature subcorpus, considering short-term signature variability, signature size variability, and over-the-shoulder and time constrained forgeries will be explored.
Acknowledgment Authors wish to thank Spanish Ministry of Science and Technology for supporting this research under project TIC2000-1669-C04-01.
References [1] [2] [3]
R. Plamondon and G. Lorette, “Automatic Signature Verification and Writer Identification - The State of the Art”, Pattern Recognition, vol. 22, no. 2, pp. 107-131, 1989. F. Leclerc and R. Plamondon, “Automatic Signature Verification: The State of the Art, 1989-1993”, Intl. Jour. of Pattern Rec. and Machine Intell., vol. 8, no. 3, pp. 643-660, 1994. R. Plamondon and S. N. Srihari. “On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey”, IEEE Trans. Pattern Anal. and Machine Intell., vol. 22, no. 1, pp. 63-84, Jan. 2000.
Complete Signal Modeling and Score Normalization
[4]
[5] [6]
[7] [8] [9]
[10] [11] [12] [13] [14]
667
M. Parizeau and R. Plamondon, “A Comparative Analysis of Regional Correlation, Dynamic Time Warping, and Skeletal Tree Matching for Signature Verification”, IEEE Trans. on Pattern Anal. and Mach. Intell., vol. 12, no. 7, pp. 710-717, July 1990. L. R. Rabiner. “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition”, Proc. of the IEEE, vol. 77, no. 2, pp. 257-286, 1989. J. G. A. Dolfing, E. H. L. Aarts, and J. J. G. M. van Oosterhout, “On-line Signature Verification with Hidden Markov Models”, Proc. of the IEEE 14th International Conference on Pattern Recognition, pp. 1309-1312, Brisbane, Australia, August 1998. L. Yang, B. K. Widjaja, and R. Prasad, “Application of Hidden Markov Models for Signature Verification”, Pattern Recognition, vol. 28, no. 2, pp. 161170, 1995. R. S. Kashi, J. Hu, W. L. Nelson, W. Turin, “On-line Handwritten Signature Verification using Hidden Markov Model Features”, Proc. of the 4th Intl. Conf. on Document Analysis and Recognition, vol. 1, pp. 253-257, 1997. J. Ortega-Garcia, J. Gonzalez-Rodriguez, D. Simon-Zorita, and S. Cruz-Llanas, "From Biometrics Technology to Applications regarding Face, Voice, Signature and Fingerprint Recognition Systems”, in Biometrics Solutions for Authentication in an E-World, (D. Zhang, ed.), pp. 289-337, Kluwer Academic Publishers, July 2002. W. Nelson, W. Turin, and T. Hastie: “Statistical Methods for On-Line Signature Verification”, Int. Journ. Pattern Rec. and Mach. Intell., vol. 8, no. 3, pp. 749-770, 1994. F. K Soong and A.E. Rosenberg, “On the Use of Instantaneous and Transitional Spectral Information in Speaker Recognition”, IEEE Trans. on Acoust., Speech and Signal Proc., vol. ASSP-36, no. 6, pages 871-879, 1988. S. Furui, “An Overview of Speaker Recognition Technology”, ESCA Workshop on Automatic Speaker Recognition, Martigny (Switzerland), pages 1-9, April 1994. J. Ortega-Garcia, et al., “MCYT: A Multimodal Biometric Database”, Proc. of COST-275 Biometric Recognition Workshop, Rome (Italy), Nov. 2002. Available at: http:// K. Jain, F. D. Griess, S. D. Connell, “On-Line Signature Verification”, Pattern Recognition, vol. 35, no. 12, pp. 2963-2972, Dec. 2002.
Personal Verification Using Palmprint and Hand Geometry Biometric Ajay Kumar1, David C. M. Wong1, Helen C. Shen1, and Anil K. Jain2 Department of Computer Science, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong. {ajaykr,csdavid,helens}@cs.ust.hk 2 Pattern Recognition and Image Processing Lab, Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824.
[email protected] 1
Abstract. A new approach for the personal identification using hand images is presented. This paper attempts to improve the performance of palmprint-based verification system by integrating hand geometry features. Unlike other bimodal biometric systems, the users does not have to undergo the inconvenience of passing through two sensors since the palmprint and hand geometry features can be are acquired from the same image, using a digital camera, at the same time. Each of these gray level images are aligned and then used to extract palmprint and hand geometry features. These features are then examined for their individual and combined performance. The image acquisition setup used in this work was inherently simple and it does not employ any special illumination nor does it use any pegs to cause any inconvenience to the users. Our experimental results on the image dataset from 100 users confirm the utility of hand geometry features with those from palmprints and achieve promising results with a simple image acquisition setup.
1
Introduction
Reliability in the personal authentication is key to the security in the networked society. Physiological characteristics of humans i.e. biometrics are typically time invariant, easy to acquire, and unique in every individual. Biometric features such as face, iris, fingerprint, hand geometry, palmprint, signature, etc. has been therefore suggested for the security in access control. Most of the current research in biometrics has been focused on fingerprint and face [1]. The reliability on personal identification using face is currently low as the researchers today continue to grapple with the problems of pose, makeup, orientation and gesture [2]. Fingerprint identification is widely used in personal identification as it works well in most cases. However it is difficult to acquire unique fingerprint features i.e. minutiae for some class of persons such as manual laborers, elderly people, etc.. Therefore some researchers have J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 668-678, 2003. Springer-Verlag Berlin Heidelberg 2003
Personal Verification Using Palmprint and Hand Geometry Biometric
669
recently investigated the utility of palmprint features for the personal authentication. Moreover, additional biometric features, such as palmprints, can be easily integrated with the existing authentication system to provide enhanced level of confidence in personal authentication. 1.1
Prior Work
Two kinds of biometric indicators can be extracted from the low-resolution2 hand images; (i) Palmprint features, which are composed of principal lines, wrinkles, minutiae, delta points, etc., (ii) Hand Geometry features which include area/size of palm, length and width of fingers. The problem of personal verification using palmprint features has drawn considerable attention and researchers have proposed various methods [3]-[15] for the verification of palmprints. Palmprints can be considered as texture which is random, rather than uniform, but unique in every individual. Therefore texture analysis of palmprint images using Gabor filters [3], wavelets [4], Fourier transform [5], and local texture energy [6] have been proposed in the literature. As compared to fingerprint, the palmprint have large number of creases. Wu et al. [7] have characterized these creases by directional line energy features and used them for palmprint identification. The endpoints of some prominent principal lines i.e. heart-line, head-line, and life-line are rotation invariant. Therefore the authors [8]-[9] have used these endpoints and midpoints for the registration of geometrical and structural features of principal lines for palmprint matching. Duta et al. [10] have suggested that connectivity of extracted palm lines is not important. Therefore they have used a set of feature points along the prominent palm lines, instead of extracted palm lines as in [9], to generate the matching score for palmprint authentication. The palmprint pattern also contains ridges and minutiae, similar to as in fingerprint pattern. However, in palmprints the creases and ridges often overlap and cross each other. Therefore Funda et al. [11] have suggested the extraction of local palmprint features i.e. ridges by eliminating the creases. However this work [11] is only limited to the extraction of ridges, and does not go beyond its usage to support any success of these extracted ridges in the identification of palmprints. Chen et al. [12] have attempted to estimate palmprint crease points by generating local gray level directional map. These crease points are connected together to isolate the crease in the form of line segments, which are used in the matching process. However the method suggested in [12] is not sufficiently detailed enough to suggest the robustness of these partially extracted creases for the matching of palmprints. Some related work on palmprint verification also appears in [13] and [14]. A recent paper by Han et al. [15] uses morphological and Sobel edge features to characterize palmprints and trained neural network classifier for their verification. The palmrint authentication in [5]-[12] utilizes inked palmprint images while the recent work in [4], [15] have shown the utility of inkless palmprint images acquired from the digital scanner. However some promising results on the palmprint images acquired from image acquisition system using CCD based digital camera appear in [3] and [13]. 2
High resolution hand images, of the order of 500 dpi, can also be used to extract fingerprint features. However, the database for such images will require large storage/speed and is left/suggested for future work.
670
Ajay Kumar et al.
The US patent office has issued several patents [16]-[19] for devices that measure hand geometry features for personal verification. Some related work using lowresolution digital hand images appears in [20] and [21]. The authors have used fixation pegs to restrict the hand movement but shown promising results. However the results in [20]-[21] may be biased by the small size of database and an imposter can easily violate the integrity of system by using fake hand [22]. 1.2
Proposed System
The palmprint and hand geometry images can be extracted from a hand image in a single shot at the same time. Unlike other multibiometrics system (e.g. face and fingerprint [23], voice and face [24], etc.) a user does not have to undergo the inconvenience of passing through multiple sensors. Furthermore, the fraud associated with fake hand, in hand geometry based verification system, can be checked with the integration of palmprint features. This paper presents a new method of personal authentication using palmprint and hand geometry features simultaneously acquired from a single hand image. The block diagram of the proposed verification system is shown in figure 1. Hand images of every user are used to automatically extract the palmprint and hand geometry features. This is achieved by first thresholding the images acquired from the digital camera. The resultant binary image is used to estimate the orientation of hand since in absence of pegs user does not necessarily align their hand in a preferred direction. The rotated binary image is used to compute hand geometry features. This image also serves to estimate the center of palmprint from the residue of morphological erosion with a known structuring element (SE). This center point is used to extract the palmprint image of a fixed size, from the rotated gray level hand images. Each of these palmprint images are used to extract significant features. Thus the palmprint and hand geometry features of an individual are obtained from the same hand image. Two schemes for the fusion of features, fusion at decision level and at representation level, were considered. The decision level fusion gave better results as is detailed in section 5.
Fig. 1. Block diagram of the Personal Verification System using Palmprint and Hand Geometry
Fig. 2. Acquisition of a typical image sample using digital camera
Personal Verification Using Palmprint and Hand Geometry Biometric
2
671
Image Acquisition & Alignment
Our image acquisition setup was inherently simple and does not employ any special illumination (as in [3]) nor does it use any pegs to cause any inconvenience to users (as in [20]). The Olympus C-3020 digital camera (1280 × 960 pixels) was used to acquire the hand images as shown in figure 2. The users were only requested to make sure that (i) their fingers does not touch each other and (ii) much of their hand (back side) touches the imaging table. 2.1
Extraction of Hand Geometry Images
Each of the acquired images has to be aligned in a preferred direction so as to capture similar features for the matching. The image thresholding operation is used to obtain binary hand-shape image. The magnitude of thresholding limit is automatically computed from the Otsu's method [25]. Since the image background is stable (black), the thresholding limit can be computed once and used subsequently for other images. The binarized oval shape of hand can be approximated by an ellipse. The parameters of best-fit-ellipse, for a given binary hand shape, is computed from the base of moments [26]. The orientation of binarized hand image is approximated by the major axis of ellipse and the required angle of rotation is the difference between normal and the orientation of image. As shown in figure 3, the binarized image is rotated and used for computing hand geometry features. The estimated orientation of binarized image is also used to rotate gray-level hand image, from which the palmprint image is extracted as detailed in next subsection.
(a)
(b)
(c)
(d)
(e)
Fig. 3. Extraction of two biometric modalities from the hand image, (a) captured image from the digital camera, (b) binarized image and ellipse fitting to compute the orientation (c) binary image after rotation, (d) gray scale image after rotation (e) ROI, i.e. palmprint, obtained from the erosion of (c) with SE, (f) segmented palmprint from (e)
2.2
Extraction of Palmprint Images
Every binarized hand-shape image is subjected to morphological erosion, with a known binary SE, to compute region of interest i.e. palmprint. Let R be the set of nonzero pixels in a given binary image and SE be the set of non-zero pixels i.e. structuring element. The morphological erosion is defined as RΘ SE = {g : SE g ⊆ R}
(1)
672
Ajay Kumar et al.
where SE g denotes the structuring element with its reference point shifted by g pixels. A square structuring element (SE) is used to probe the composite binarized image. The center of binary hand image after erosion i.e. the center of rectangle that can enclose the residue is determined. This center coordinates are used to extract a square palmprint region of fixed size as shown in figure 3. 2.3
Normalization of Palmprints
The extracted palmprint images are normalized to have pre-specified mean and variance. The normalization is used to reduce the possible imperfections in the image due to sensor noise and non-uniform illumination. The method for normalization employed in this work is same as suggested in [27] and is sufficient for the quality of acquired images. Let the gray levels at (x,y), in a palmprint image be represented by I(x,y). The mean and variance of image, φ and ρ respectively, can be computed from the gray level of pixels. The normalized image I ′( x, y ) is computed using the pixelwise operations as follows:
φ I ′( x, y ) = φ
d
+λ
if I ( x, y ) > φ
d
− λ
otherwise
where λ =
ρ
d
{I ( x, y) − φ }2 ρ
(2)
where φ d and ρ d are the desired values for mean and variance. These values are pretuned according to the image characteristics i.e. I(x,y). In all our experiments, the values of φ d and ρ d were fixed to 100. Figure 4 (a)-(b) shows a typical palmprint image before and after the normalization.
3
Feature Extraction
3.1
Extraction of Palmprint Features
The palmprint pattern is mainly made up of palm lines, i.e. principal lines and creases. Line feature matching [8], [15] is reported to be powerful and offer high accuracy in palmprint verification. However it is very difficult to accurately characterize these palm lines, i.e. its magnitude and direction, in noisy images. Therefore a robust but simple method is used here. Line features from the normalized palmprint images are detected from the four, line detectors [28], directional masks. Each of these masks can detect lines oriented at 0o (h1) or 45o (h2) or 90o (h3) or 135o (h4). The spatial extent of these masks was empirically fixed as 9 × 9. Each of these masks is used to filter I ′( x, y ) as follows: I 1 ( x, y ) = h1 * I ′( x, y )
(3)
where ‘*’ denotes discrete 2D convolution. Thus four filtered images, i.e. I1 ( x, y ) , I 2 ( x, y ) , I 3 ( x, y ) , and I 4 ( x, y ) are used to generate a final image I f ( x, y ) by graylevel voting.
Personal Verification Using Palmprint and Hand Geometry Biometric
(a)
(c)
(d)
(g)
673
(b)
(e)
(f)
(h)
Fig. 4. Palmprint feature extraction; (a) segmented image, (b) image after normalization, filtered images with directional mask at orientation 0o in (c), 90o in (d), 45o in (e), 135o in (f), (g) image after voting, and (h) features extracted from each of the overlapping blocks
I f ( x, y ) = max { I1 ( x, y ) , I 2 ( x, y ) , I 3 ( x, y ) , I 4 ( x, y ) }
(4)
The resultant image represents the combined directional map of palmprint I ( x, y ) . This image I f ( x, y ) is characterized by a set of localized features, i.e. standard deviation, and used for verification. I f ( x, y ) is divided into a set n blocks and the standard deviation of gray-levels in each of these overlapping blocks is used to form the feature vector. vpalm = {σ 1, σ 2, …, σ n }
(5)
where σ 1 is the standard deviation in the overlapping first block (figure 4(h)). 3.2
Extraction of Hand Geometry Features
The binary image3 as shown in figure 3(c), is used to compute significant hand geometry features. A total of 16 hand geometry features were used (figure 5); 4 finger 3
This work uses palm side of hand images to compute hand geometry features, while prior work [20]-[21] uses other side of hand images.
674
Ajay Kumar et al.
lengths, 8 finger widths (2 widths per finger), palm width, palm length, hand area, and hand length. Thus the hand geometry of every hand image is characterized by feature vector vhg of length 1 × 16.
Fig. 5. Hand Geometry Feature Extraction
4
Information Fusion and Matching Criterion
The multiple pieces of evidences can be combined by a number of information fusion strategies that have been proposed in the literature [29]-[31]. In the context of biometrics, three levels of information fusion schemes have been suggested; (i) fusion at representation level, where the feature vectors of multiple biometric are concatenated to form a combined feature vector, (ii) fusion at decision level, where the decision scores of multiple biometric system are combined to generate a final decision score, and (iii) fusion at abstract level [31], where multiple decision from multiple biometric systems are consolidated [31]. The first two fusion schemes, i.e. (i) and (ii), are more relevant for a bimodal biometric system and were considered in this work. The similarity measure between the v (feature vector from user) and 1 v (stored identity as claimed) is used as matching score and is computed as follows: 2 α =
∑v v 1 2
(6)
∑v ∑v 1 2
Similarity measure defined in above equation computes the normalized correlation between the feature vector v and v . During the user verification, a user is required 1 2 to indicate his/her identity. If the matching score (6) is less than some desired threshold then the user is assumed to be imposter else we decide him/her as genuine.
Personal Verification Using Palmprint and Hand Geometry Biometric
5
675
Experiments and Results
The experiments reported in this paper utilize inkless palmprint images obtained from digital camera, as discussed in section 2. We collected 1000 hand images, 10 samples from each user, for 100 users. The first five images from each user were used for training and rest were used for testing. The palmprint images, of size 300 × 300 pixels, for each of these images were automatically extracted as described in section 2.2. Each of the palmprint images were divided into 144 overlapping blocks of size 24 × 24 pixels, with the overlapping of 6 pixels (25 %). Thus a 1 × 144 feature vector was obtained from every palmprint image. Figure 6 shows the distribution of imposter and genuine matching scores from the palmprint and hand geometry. The receiver operating characteristics for three distinct cases, hand geometry alone, (ii) palmprint alone, and (ii) using decision level fusion with max rule, i.e. highest of similarity measure from hand geometry or palmprint, is shown in figure 7.
Fig. 6. Distribution of gennuine and imposter scores from the two biometric
Fig. 7. Comparative performance of palmprint and geometry features (on 500 images) using decision level fusion
Some users failed to touch their palm/fingers on the imaging board. It was difficult to use such images, mainly due to change in scale, and these images were marked as bad. A total of 28 such images were identified and removed. The FAR and FRR scores for 472 test images, using total minimum error as criterion, is shown in table 1. The comparative performance of two fusion schemes is displayed in figure 8. The cumulative distribution of combined matching scores for the two classes, using decision level fusion (max rule), is shown in figure 9. Table 1. Performance scores for total minimum error on 472 test images
Palmprint Hand Geometry Fusion at Representation Fusion at Decision
FAR 4.49 % 5.29 % 5.08 % 0%
FRR 2.04 % 8.34 % 2.25 % 1.41 %
Decision Threshold 0.9830 0.9314 0.9869 0.9840
676
Ajay Kumar et al.
Fig. 8. Comparative performance of two fusion scheme on 472 test images
6
Fig. 9. Comp Distribution of two classes of similarity scores for 472 test images
Conclusions
The objective of this work was to investigate the integration of palmprint and hand geometry features, and to achieve higher performance that may not be possible with single biometric indicator alone. The results obtained in figure 6, from 100 users, confirm our objective. These results should be observed/seen in the context of simple image acquisition setup and further improvement in performance, in presence of controlled illumination/environment, is intuitively expected. The achieved results are significant since the two biometric traits were derived from the same image, unlike other bimodal biometric systems which require two different sensors/images. Our results also show that the decision level fusion scheme, with max rule, achieves better performance than those for fusion at representation level.
References [1] [2] [3] [4] [5] [6]
A. K. Jain, R. Bolle, and S. Pankanti, Biometrics: Personal identification in networked society, Kulwer Academic, 1999. M.-H. Yang, D. J. Kriegman, and N. Ahuja, “Detecting faces in images: A Survey,” IEEE Trans. Patt. Anal. Machine Intell., vol. 24, pp. 34-58, Jan. 2002. W. K. Kong and D. Zhang, "Palmprint texture analysis based on low-resolution images for personal authentication," Proc. ICPR-2002, Quebec City (Canada). A. Kumar and H. C. Shen, “Recognition of palmprints using wavelet-based features,” Proc. Intl. Conf. Sys., Cybern., SCI-2002, Orlando, Florida, Jul. 2002. W. Li, D. Zhang, and Z. Xu, "Palmprint identification by Fourier transform," Proc. Int. J. Patt. Recognit. Art. Intell., vol. 16, no. 4, pp. 417-432, 2002. J. You, W. Li, and D. Zhang, "Hierarchical palmprint identification via multiple feature extraction," Pattern Recognit., vol. 35, pp. 847-859, 2002.
Personal Verification Using Palmprint and Hand Geometry Biometric
[7] [8] [9] [10] [11]
[12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25]
677
X. Wu, K. Wang, and D. Zhang, "Fuzzy directional energy element based palmprint identification," Proc. ICPR-2002, Quebec City (Canada). W. Shu and D. Zhang, “Automated personal identification by palmprint,” Opt. Eng., vol. 37, no. 8, pp. 2359-2362, Aug. 1998. D. Zhang and W. Shu, “Two novel characteristics in palmprint varification: datum point invariance and line feature matching,” Pattern Recognit., vol. 32, no. 4, pp. 691-702, Apr. 1999. N. Duta, A. K. Jain, and Kanti V. Mardia, “Matching of palmprint,” Pattern Recognit. Lett., vol. 23, no. 4, pp. 477-485, Feb. 2002. J. Funada, N. Ohta, M. Mizoguchi, T. Temma, K. Nakanishi, A. Murai, T. Sugiuchi, T. Wakabayashi, and Y Yamada, “Feature extraction method for palmprint considering elimination of creases,” Proc.14th Intl. Conf. Pattern Recognit., vol. 2, pp. 1849 -1854, Aug. 1998. J. Chen, C. Zhang, and G. Rong, “Palmprint recognition using crease,” Proc. Intl. Conf. Image Process., pp. 234-237, Oct. 2001. D. G. Joshi, Y. V. Rao, S. Kar, V. Kumar, "Computer vision based approach to personal identification using finger crease pattern," Pattern Recognit., vol. 31, no. 1, pp. 15-22, 1998. S. Y. Kung, S. H. Lin, and M. Fang," A neural network based approach to face/palm recognition," Proc. Intl. Conf. Neural Networks, pp. 323-332, 1995. C.-C. Han, H.-L. Cheng, C.-L. Lin and K.-C. Fan, "Personal authentication using palm-print features," Pattern Recognit., vol. 36, pp. 371-381, 2003. D. P. Sidlauskas, "3D hand profile identification apparatus," U. S. Patent No. 4736203, 1988. I. H. Jacoby, A. J. Giordano, and W. H. Fioretti, "Personal identification apparatus," U. S. Patent No. 3648240, 1972. R. P. Miller, "Finger dimension comparison identification system," U. S. Patent No. 3576538, 1971. R. H. Ernst, "Hand ID system," U. S. Patent No. 3576537, 1971. R. Sanchez-Reillo, C. Sanchez-Avila, and A. Gonzales-Marcos, "Biometric identification through hand geometry measurements," IEEE Trans. Patt. Anal. Machine Intell., vol. 22, no. 10, pp. 1168-1171, 2000. A. K. Jain, A. Ross, and S. Pankarti, "A prototype hand geometry based verification system, Proc. 2nd Intl. Conf. Audio Video based Biometric Personal Authentication, Washington D. C., pp. 166-171, Mar. 1999. B. Miller, "Vital signs of identity," IEEE Spectrum, vol. 32, no. 2,, pp. 22-30, 1994. L. Hong and A. K. Jain, “Integrating face and fingerprint for personal identification,” IEEE Trans. Patt. Anal. Machine Intell., vol. 20, pp. 12951307, Dec. 1998. S. Ben-Yacoub, Y. Abdeljaoued, and E. Mayoraz, "Fusion of face and speech data for person identity verification," ,” IEEE Trans. Neural Networks, vol. 10, pp. 1065-1074, 1999. N. Otsu, “A threshold selection method from gray-scale histogram,” IEEE Trans. Syst., Man, Cybern., vol. 8, pp. 62-66, 1978.
678
Ajay Kumar et al.
[26] S. Baskan, M. M. Balut, and V. Atalay, "Projection based method for segmentation of human face and its evaluation," Pattern Recognit. Lett., vol. 23, pp. 1623-1629, 2002. [27] L. Hong, Y. Wan, and A. K. Jain, “Fingerprint image enhancement : Algorithm and performance evaluation,” IEEE Trans. Patt. Anal. Machine Intell., vol. 20, pp. 777-789, Aug. 1998. [28] J. R. Parkar, Algorithms for Image Processing and Computer Vision, John Wiley & Sons, 1997. [29] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, “On combining classifiers," IEEE Trans. Patt. Anal. Machine Intell., vol. 20, pp. 226-239, Mar. 1998. [30] S. Prabhakar and A. K. Jain, "Decision level fusion in fingerprint verification," Pattern Recognit., vol. 35, pp. 861-874, 2002. [31] A. Ross, A. K. Jain, and J.-Zhang Qian, “Information fusion in biometrics,” Proc. AVBPA’01, Halmstad, Sweden, pp. 354-359, Jun. 2001.
A Set of Novel Features for Writer Identification Caroline Hertel and Horst Bunke Department of Computer Science, University of Bern Neubr¨ uckstrasse 10, CH-3012 Bern, Switzerland
[email protected] Abstract. A system for writer identification is described in this paper. It first segments a given page of handwritten text into individual lines and then extracts a set of features from each line. These features are subsequently used in a k-nearest-neighbor classifier that compares the feature vector extracted from a given input text to a number of prototype vectors coming from writers with known identity. The proposed method has been tested on a database holding pages of handwritten text produced by 50 writers. On this database a recognition rate of about 90% has been achieved using a single line of handwritten text as input. The recognition rate is increased to almost 100% if a whole page of text is provided to the system. Keywords: personal identification; handwriting analysis; writer identification; feature extraction; k-nearest neighbor classifier.
1
Introduction
The identification of persons based on biometric measurements has become a very active area of research [1, 2, 3]. Many biometric modalities, including facial images, fingerprints, retina patterns, voice, signature, and others have been investigated. In the present paper we consider the problem of personal identification using samples of handwritten text. The objective is to identify the writer of one or several given lines of handwritten text. In contrast with signature verification [4] where the identity of an individual is established based on a predefined, short sequence of characters, the methods proposed in the present paper are completely text-independent. That is, any text consisting of one or a few lines may be used to establish the identity of the writer. In particular, we don’t suppose that the meaning (i.e. the ASCII transcription) of the given handwritten text is known. In contrast with signature verification, which is often performed in the on-line mode (where the writer is connected to the system via an electronic pen or a mouse and the writing is recorded as a time-dependent process) we assume off-line handwritten text as input modality. That is, only an image of the handwriting is available, without any temporal information. Applications of the proposed approach are forensic writer identification, the retrieval of handwritten documents from a database, or authorship determination of historical manuscripts. J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 679–687, 2003. c Springer-Verlag Berlin Heidelberg 2003
680
Caroline Hertel and Horst Bunke
For a survey covering work in automatic writer identification and signature verification until the end of the 1980’s see [4]. An extension including work until 1993 has been published in [5]. In [6] a system for writer identification using textural features derived from the grey-level co-occurrence matrix and Gabor filters is described. For this method whole pages of handwritten text are needed. Similarly, in [7, 8] a system for writer verification is described. It takes two pages of handwritten text as input and determines if they have been produced by the same writer. The features used to characterize a page of text include writing slant and skew, character height, stroke width, frequency of loops and blobs, and others. Morphological features obtained from transforming the projection of the thinned writing are computed in [9]. In this approach only single words are used to establish the identity of a writer. In contrast to [4, 5, 6, 7, 8, 9] the method proposed in [10] works on an intermediate level using text lines as basic input units from which features are computed. In the current paper a continuation and extension of this work is presented. The novel contribution of the paper is a significantly extended set of features that are suitable to characterize an individual’s handwriting. This new set of features has been tested on a data set that is an extension of the data set described in [10] from 20 to 50 writers. On this extended data set a recognition rate of about 90% has been achieved using a single line of handwritten text as input. The recognition rate is increased to almost 100% if a whole page of text is provided to the system. The remainder of the paper is organized as follows. In the next section the new set of features is introduced. Then a series of experiments with the new features are described in Sect. 3. Finally, conclusions are drawn in Sect. 4.
2
Features for Writer Identification
In this paper four groups of novel features for writer identification are introduced. They will be described in the following four sub-sections. Additionally some features that have already been used in other systems [7, 8, 10] will be briefly sketched in Sect. 2.5. Throughout this section we assume that a page of handwritten text has been segmented into individual lines. The segmentation methods used in the present paper are the same as those described in [10]. 2.1
Connected Components
Some people tend to write a whole word in a single, continuous stroke, while others break up a word into a number of components. The features introduced in this sub-section attempt to model this behavior. From the binary image of a line of text, connected components are extracted first. Each connected component C is described by its bounding box (x1 (C), y1 (C), x2 (C), y2 (C)), where (x1 (C), y1 (C)) and (x2 (C), y2 (C)) are the coordinates of the left-lower and right-upper corner of the bounding box of C,
A Set of Novel Features for Writer Identification
681
Fig. 1. Five sample text lines from the data set used in the experiments; these lines have been produced by different writers
respectively. Given all connected components of a line of text, the average distance between two successive bounding boxes is computed first. For this purpose we order all connected components according to their x1 -value. Given the ordered list (C1 , C2 , · · · , Cn ) we calculate the average value of (x1 (Ci+1 ) − x2 (Ci )). This quantity is used as a feature that is potentially useful for writer discrimination. The next two features are the average distance of two consecutive words and the average within-word distance of connected components. In order to compute these two features, a clustering procedure is applied that groups connected components together if they are likely to belong to the same word. This clustering procedure uses a threshold t on the distance of two consecutive connected components, Ci and Ci+1 . If (x1 (Ci+1 ) − x2 (Ci )) < t then it is assumed that Ci and Ci+1 belong to the same word. Otherwise, Ci is considered to be the last component of a word wj and Ci+1 the first component of the following word wj+1 . Other features derived from connected components are the average, median, and standard deviation of the length (x2 (C)−x1 (C)) of connected components C in a line of text, and the average number of black-to-white transitions within each connected component. 2.2
Enclosed Regions
If we analyze the closed loops occurring in handwritten text we observe certain properties that are specific to individual writers. For example, the loops of some writers are of circular shape while the loops of other writers tend to be more elliptical. To simplify our computational procedures, we don’t analyze the loops directly, but the blobs that are enclosed by a loop. These blobs can be easily computed by standard region growing algorithms. For a graphical illustration see Figs. 1 and 2. In Fig. 1 five lines of text from the database used in the experiments described in Sect. 3 are shown. They have been produced by different writers,
682
Caroline Hertel and Horst Bunke
(a) (b)
Fig. 2. Blobs enclosed by loops: (a) extracted from the first line in Fig. 1; (b) extracted from the second line in Fig. 1
as can be readily seen. In Fig. 2 the blobs enclosed by the loops corresponding to the first two text lines are displayed. One notices a clear difference in the shape of these blobs. In the next paragraph we describe features derived from such blobs. The first feature is the average of the form factor f = 4Aπ/l2 taken over all blobs of one text line, where A is the area of the blob under consideration and l the length of its boundary. The second feature is similar. It measures the roundness r = l2 /A of an object. Again the average over all blobs in a line of text is taken. The last feature is the average size of the blobs in a text line. 2.3
Lower and Upper Contour
The lower (upper) contour of a line of text is defined as the sequence of pixels obtained if only the lower(upper)-most pixel in each column of the text image is considered. Obviously, if there are gaps between words or parts of a word in the text, these gaps will be present in the lower (upper) contour as well. Gaps of this kind are eliminated by simply shifting the following pixels of the lower (upper) contour by the amount of the gap to the left. After this operation there is exactly one black pixel at each x-coordinate in the lower (upper) contour. However, there are usually discontinuities in the y-coordinates of two consecutive points. These discontinuities are eliminated by shifting the following elements along the y-axis by an appropriate amount. A graphical illustration of this procedure is shown in Fig. 3. The sequence of pixels resulting from the operations described in the previous paragraph is called the characteristic lower (upper) contour. A visual analysis reveals that these characteristic contours are quite different from one writer to another. An example is shown in Fig. 4. Comparing Fig. 4 with Fig. 3 we notice a clear difference between the two contours. From both the characteristic lower and upper contour of a line of text a number of features are extracted. The first feature is the slant of the characteristic contour. It is obtained through linear regression analysis. The second feature is the mean squared error between the regression line and the original curve. The next two features measure the frequency of the local maxima and minima on the characteristic contour. A local maximum (minimum) is defined as a point on the characteristic contour such that there is no other point within a neighborhood of given size that has a larger (smaller) y-value. Let m be the number of local maxima and l be the length of the contour. Then the frequency of the local
A Set of Novel Features for Writer Identification
683
(a) (b)
(c)
Fig. 3. Illustration of characteristic contour extraction, using the first line in Fig. 1: (a) lower contour; (b) lower contour after gap elimination; (c) lower contour after elimination of discontinuities in y-direction (characteristic contour)
Fig. 4. Characteristic contour of second line in Fig. 1
maxima is simply the ratio m/l. The frequency of the local minima is defined analogously. Moreover, the local slope of the characteristic contour to the left of a local maximum within a given distance is computed, and the average value, taken over the whole characteristic contour, is used as a feature. The same operation is applied for the local slope to the right of a local maximum. Finally, similar features are computed for local minima. 2.4
Fractal Features
In [11, 12] it was shown that methods based on fractal geometry are useful to derive features that characterize certain handwriting styles. While the purpose in those papers was to distinguish between legible and poorly formed handwritings, we take a broader view in this sub-section and aim at features that are useful for writer identification. The basic idea behind the features proposed in [11, 12] is to measure how the area A (i.e. the number of pixels) of a handwritten text grows when we apply a dilation operation [13] on the binary image. In order to make the features used in this paper invariant with respect to the writing instrument (i.e. stroke width), a thinning operation is applied first. Then the writing is dilated using a diskshaped kernel of increasing radius d = 1, 2, · · · and the quantity ln(A(d)) − ln(d) is recorded as a function of ln(d). This function is also called evolution graph [11, 12]. Typically, the evolution graph can be segmented into three parts, each of which behaves more or less linearly. As an example, the evolution graphs derived from the first two lines in Fig. 1 are shown in Fig. 5. The endpoints of the straight line segments of each evolution graph are computed by means
684
Caroline Hertel and Horst Bunke
Fig. 5. Evolution graphs and their approximation by three straight line segments, derived from the first line (left) and second line (rigth) in Fig. 1
of an exhaustive search over all possible points on the x-axis. The objective of this search procedure is to minimize the mean squared error between the original points of the evolution graph and the straight line segments used for approximation. Eventually, the slope of each of the three straight line segments is used as a feature to characterize the given handwritten text line. The differences between the two handwriting styles shown in the first two lines of Fig. 1 clearly manifest themselves in the two evolution graphs, and the slopes of the three straight line segments. Using a disk-shaped dilation kernel results in an evolution graph that is invariant under rotation of the original image. However, the direction of individual strokes and stroke segments is a very important characteristic of a person’s handwriting style. Because of this observation, not only disks, but also ellipsoidal dilation kernels are used. To generate this kind of kernels, three parameters are involved, namely the length of the ellipse’s two main axes and the rotation angle. Through systematic variation of these parameters a set of 18 dilation kernels are generated. For each of these kernels an evolution graph similarly to Fig. 5 is derived and the slope of the three characteristic straight line segments is computed. Thus a total of 57 (= 3 + 18 × 3) different fractal features are generated.
2.5
Basic Features
In addition to the features described in the previous sections, some features that were used already in [7, 8, 10] were included in the experiments described in Sect. 3. These features correspond to writing skew and slant, the height of the three main writing zones, and the width of the writing. Computational procedures for extracting these features from a line of text can be found in [10].
A Set of Novel Features for Writer Identification
3
685
Experimental Results
The experiments described in this section are based on the IAM database [14]. This database comprises about 1’500 pages of handwritten text, produced by over 500 writers. A subset of 250 pages written by 50 writers was selected. Each writer contributed 5 pages of text. One page comprises about 8 lines of text on the average. The total data set consists of 2185 lines of text. The original data format is a binary image for each page of handwritten text. From these images the text lines were extracted first. The corresponding procedures are described in [14]. From each individual line of text, the features introduced in Sect. 2 were extracted. As the ranges of the individual features are quite different, a feature normalization procedure was applied resulting in features that all have zero mean and a standard deviation equal to one. Out of a potentially large number of classifiers, a simple, Euclidean-distance based 5-nearest-neighbor (5-NN) classifier was adopted for the experiments. This classifier determines the five nearest neighbors to each input feature vector and decides for the class that is most often represented. In case of a tie, the class with the smallest sum of distances is chosen. The number of nearest neighbors to be taken into account was experimentally determined. The advantage of this classifier is its conceptual simplicity and the fact that no classifier training is needed. In the experiments the whole set of handwritten text lines was split into five portions of equal size. One portion was used as test set and the other four as prototypes for the 5-NN classifier. This procedure was repeated four times such that each portion was used once as test set, and the average recognition rate obtained from these five runs was recorded. A summary of our experimental results is provided in Table 1. In order to see how the proposed method for writer identification scales up with a growing number of classes, i.e. writers, the experiments were not only run on the full data set produced by 50 writers, but also on a subset that came from 20 writers. Each of the groups of features introduced in Sect. 2 was tested individually (see the first five rows in Table 1). Additionally the union of all these features was tested (see row six). If we compare the individual groups of features with each other we notice that the blob features yield the lowest classification rate (see 2nd row). A possible explanation of this fact is the rather small number of these features (only three). Next are the connected component based features with a recognition performance of about 53% (31%) for the 20 (50) class problem. The features derived from the characteristic lines are doing quite well and are comparable in performance to the basic features. The best performance among the individual groups of features is achieved by the fractal features. In row six of Table 1, the performance of the union of all features is recorded. On the small data set a recognition rate of 96% is achieved. The performance decreases to about 90% for the case of 50 writers. The small data set is the same as the one used in the work reported in [10]. On this data set a recognition rate of about 88% was achieved with a nearest-neighbor classifier in [10], using a simpler set of features
686
Caroline Hertel and Horst Bunke
Table 1. Correct recognition rate for various sets of features features connected components enclosed regions lower and upper contour fractal features basic features all features combination of all lines
20 writers 50 writers 53.6 36.0 76.0 92.6 75.6 96.4 100.0
31.8 18.4 52.8 84.2 57.9 90.7 99.6
than the ones employed in this paper. Hence the new set of features proposed in the present paper lead to a clearly improved recognition rate. In an additional experiment, reported in the last row in Table 1, it was assumed that a whole page of text is written by one single person. In other words, all individual lines on the same page must come from the same writer. Consequently, the results obtained for the individual lines on a page were combined with each other. Simple majority voting was applied to determine the writer of a page. Ties were broken based on the distances output by the individual 5-NN classifiers. Under this combination strategy, a recognition rate of 100% was obtained for the small data set. On the 50-writer data set, all pages but one were correctly assigned, which is equivalent to a recognition rate of 99.66%.
4
Conclusions
Handwriting is a modality that can be used for the identification of persons. In the present paper the problem of text-independent writer identification for the case of off-line handwritten text was addressed. The approach proposed in this paper is applicable as soon as at least a single line of text is available from a writer. Thus it is positioned between other approaches proposed in the literature that use either complete pages of text or just single words. In the present paper a number of novel features have been proposed. These features are rather powerful and lead to quite high recognition rates in two experiments involving 20 and 50 writers, respectively. There are several applications for which handwriting based person identification is important. Examples include forensic science, handwritten text retrieval from databases as well as digital libraries including historical archives. Another application example is personal handwriting recognition systems that automatically adapt themselves to a particular writer in a multi-user environment. In our future work we want to further upgrade the system described in this paper by including more writers in the database and exploring additional characteristic features. Also the application of feature selection algorithms is of potential interest [15].
A Set of Novel Features for Writer Identification
687
Acknowledgment We want to thank Simon G¨ unter for providing guidance and many hints to the first author of this paper.
References [1] A. K. Jain, R. Bolle, S. Pankanti (eds.). Biometrics. Personal Identification in Networked Society, Kluwer Academic. 1999. 679 [2] A. K. Jain, L. Hong, S. Pankanti. Biometrics identification. Comm. ACM 43(2), pp. 91 – 98. 2000. 679 [3] J. Bigun, I. Smeraldi (eds.). Audio- and video-based biometric person authentication. Proc. of the 3rd Int. Conf. AVBPA, Halmstadt, Sweden. 2001. 679 [4] R. Plamondon and G. Lorette. Automatic signature verification and writer identification - the state of the art. Pattern Recognition, 22, pp. 107–131. 1989. 679, 680 [5] F. Leclerc and R. Plamondon. Automatic signature verification: The state of the art 1989-1993. In Progress in Automatic Signature Verification edited by R. Plamandon, World Scientific Publ. Co., pp. 13–19. 1994. 680 [6] H. E. S.Said, G. S.Peake, T. N.Tan and K. D.Baker. Personal identification based on handwriting. Pattern Recognition, 33, pp. 149–160. 2000. 680 [7] S.-H. Cha and S. Srihari. Writer identification: statistical analysis and dichotomizer. In Ferrie, F. J. et al. (eds.):SSPR and SPR 2000, Springer LNCS 1876, pp. 123–132. 2000. 680, 684 [8] S.-H. Cha and S. Srihari. Multiple feature integration for writer verification. In Schomaker, L. R. B., Vuurpijl, L. G. (eds.): Proc. 7th Int. Workshop Frontiers in Handwriting Recognition, pp. 333–342. 2000. 680, 684 [9] E. N. Zois and V. Anastassopoulos. Morphological waveform coding for writer indentification. Pattern Recognition, 33(3), pp. 385–398. 2000. 680 [10] U. V. Marti, R. Messerli and H. Bunke. Writer identification using text line based features. Proceedings of 6th ICDAR. pp. 101–105. 2001. 680, 684, 685 [11] V.Bouletreau, N.Vincent, R.Sabourin and H.Emptoz. Synthetic parameters for handwriting classification. In Proceedings of the 4th Int. Conf. on Document Analysis and Recognition, Ulm, Germany, pp. 102–106. 1997. 683 [12] V.Bouletreau, N.Vincent, R.Sabourin and H.Emptoz. Handwriting and signature: One or two personally indentifiers?. In Proceedings of the 14th Int. Conf. on Pattern Recognition, Brisbane, Australia, pp. 1758–1760. 1998. 683 [13] P. Soille. Morphological Image Analysis, Springer Verlag, Berlin. 1999. 683 [14] U.-V. Marti and H. Bunke. The IAM-database: an English sentence database for off-line handwriting recognition. In Int. Journal of Document Analysis and Recognition, vol. 5, pp. 39-46. 2002. 685 [15] J. Kittler, P. Pudil, P. Somol. Advances in statistical feature selection. In Singh, S., Murshed, N., Kropatsch, W. (eds.): Advances in Pattern Recognition, Springer, pp. 425 – 434. 2001. 686
Combining Fingerprint and Hand-Geometry Verification Decisions Kar-Ann Toh, Wei Xiong, Wei-Yun Yau, and Xudong Jiang Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore 119613 {katoh,wxiong,wyyau,xdjiang}@i2r.a-star.edu.sg
Abstract. This paper proposes to combine the fingerprint and handgeometry verification decisions using a reduced multivariate polynomials model. Main advantage of this method over those neural network based methods is that only a single step is required for training and the training is optimal. Numerical experiments using a database containing over 100 identities show significant improvement of Receiver Operating Characteristics as compared to that of individual biometrics. Moreover, the result outperforms a few commonly used methods using the same database.
1
Introduction
Due to possible increase in degree of freedom, fusion of multiple classifiers especially taking different input sets may allow alleviation of problems intrinsic to individual classifiers. By exploiting the specialist capabilities of each classifier, a combined classifier may yield results which would not be possible in a single classifier. The biometric verification problem can be considered as such classification problem wherein a decision is made upon whether or not a claimed identity is genuine with inference to some matching criteria. We thus treat the problem of combining multi-modal biometrics as a classifier decisions combination problem in this paper. Generally, the approaches for classifiers combination differs in terms of assumptions about classifier dependencies, type of classifier outputs, combining strategies and combining procedures [1]. Two main types of combination can be identified: classifier selection and classifier fusion. The difference between these two types lies on whether the classifiers are assumed to be complementary or competitive. Classifier selection assumes that each classifier is a “local expert” while classifier fusion assumes that all classifiers are trained over the entire feature space (see e.g. [1]). In this paper, our focus will be on classifier fusion and main effort will be on arriving at a fusion methodology that optimizes the accuracy of the combined decision. According to the information adopted, three levels of combination can be identified ([2]): (i) abstract level, (ii) rank level, and (iii) measurement level. At abstract level, the output information taken from each classifier is only a possible label for each pattern class, whereas at rank level, the output information J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 688–696, 2003. c Springer-Verlag Berlin Heidelberg 2003
Combining Fingerprint and Hand-Geometry Verification Decisions
689
taken from each classifier is a set of ordered possible labels which is ranked by decreasing confidence measure. At measurement level, the output information taken from each classifier is a set of possible labels with associated confidence measure. In this way, with the measurement outputs taken from each individual system, the decision is brought forward to the final output of the combined system. We shall work at the measurement level to combine the fingerprint and the hand-geometry decisions. While the statistical approach (see e.g. [3]) had received considerable attention, many classifiers, predictors or estimators by themselves can be used for data fusion and pattern recognition, we shall briefly review some of the commonly used ones. Spline interpolation possesses good approximation capability, but the selection of its control points requires much knowledge regarding the distribution of data used. The Feedforward Neural Network (FNN) has been shown to be a universal approximator (see e.g. [4]), however, the training process remains much to be a trial and error effort since no learning algorithm can guarantee convergence to global optimal solution within finite iterations. Backpropagation of error gradients has proven to be useful in FNN learning, but a large number of iterations is usually needed for adapting the weights. The problem becomes more severe especially when a high level of accuracy is required. In this paper, we propose a reduced multivariate polynomial model to circumvent the iterative training problem. The paper is organized as follows. In the following section, the problem of combining classifier decisions is stated before some preliminaries on the optimal weighting method are provided. With these backgrounds in place, a reduced multivariate polynomials model is introduced in section 3. In section 4, the proposed model is tested using physical data from the fingerprint and hand-geometry verification systems. Finally, in section 5, some concluding remarks are drawn.
2 2.1
Problem Definition and Preliminaries Problem Definition
Assume that each false positive poses the same amount of risk and every false negative presents identical liability, and the system is under random attack. It remains an issue when combining a set of learned classifiers with correlation. The higher the degree of correlation, the larger amount of agreement or linear dependence among the classifiers or estimators will be. This correlation also reflects the amount of redundancy within the set of classifiers. Here, the problem of correlation which can produce unreliable estimates is referred to as multicollinearity problem [5]. With these preliminaries, we define our problems of combining classifier decisions in the following. 2.2
The Problem of Combining Classifier Decisions
Given (l, m, n, p, q) as positive integers and consider two sets of data: a training set Strain = {xi ∈ Rp , yi ∈ R}, i = 1, ..., m and a test set Stest = {xi ∈ Rp , yi ∈
690
Kar-Ann Toh et al.
R}, i = 1, ..., n. Given a set of functions F = {fˆj (x, y)}, j = 1..., l, l 3‡ min t > 3‡ min
Table 2. Comparison 2, the probe’s correct identity occurs in the top 5 A SP Scorr SP Smwd CM U U SF
90% 98% 100% 82%
B 87 90 90 76
% % % %
C 80 81 83 54
% % % %
D 52 46 59 48
% % % %
E 43 43 50 48
% % % %
F 48 46 53 41
% % % %
G 44 43 43 34
% % % %
CPU/Subject t < 26† sec t < 27† sec t > 3‡ min t > 3‡ min
HumanID Gait Challenge dataset, with promising results. Our algorithm produces competitive classification results while reducing computational cost. The results show rank 5 classification numbers competitive with algorithms [5] and [17]. The rank 1 numbers are notably lower for tests D and F. Unfortunately segmentation errors and the test covariates are conflated in this dataset, making it difficult to determine the causal factor in classification error. The classification time for a subject can be factored into two sources, building the shape representation and testing against the database of known subjects. The shape modeling requires approximately 20 seconds, and is performed once per subject. Each comparison against a member of the gallery database requires approximately 8 milliseconds. Consequently our method scales well with additional members, as the computation cost of adding new tests is low. The results suggest that shape is an appropriate biometric, but that it is a tool best employed when the probe subject is viewed under conditions similar to the gallery subjects. This suggests that local models for each camera would be most successful in a typical surveillance environment. The rank performance of our method indicates that gait shape is an effective winnowing feature, reducing the number of candidates that more computationally intensive methods must analyze. We plan to apply our technique to human activity analysis, by defining collections of key-shapes that are associated with target activities rather than an identity. We are currently exploring bootstrap estimates of cluster statistics as an analytical tool for determining cluster validity, a problem that is frequently left as an engineering detail in the grouping literature.
Gait Shape Estimation for Identification
741
Acknowledgment This work is supported by DARPA/IAO HumanID under ONR contract N0001400-1-0915 and by NSF/RHA grant IIS-0208965.
References [1] C. Ben-Abdelkader, R. Cutler, and L. S. Davis. Motion-based recognition of people in eigengait space. In IEEE Conf Automatic Face and Gesture Recognition, pages 254–259, 2002. 735 [2] C. Ben-Abdelkader, R. Cutler, and L. S. Davis. Stride and cadence as a biometric in automatic person identification and verification. In IEEE Conf Automatic Face and Gesture Recognition, pages 357–362, 2002. 735 [3] A. F. Bobick and A. Y. Johnson. Gait recognition using static, activity-specific parameters. In IEEE Computer Vision and Pattern Recognition, pages I:423–430, 2001. 735 [4] F. R. K. Chung. Spectral Graph Theory. AMS, 2nd edition, 1997. 736 [5] R. T. Collins, R. Gross, and J. Shi. Silhouette-based human identification from body shape and gait. In IEEE Conf Automatic Face and Gesture Recognition, pages 351–356, 2002. 735, 739, 740 [6] D. M. Gavrila. The visual analysis of human movement: A survey. CVIU, 73(1):82– 98, January 1999. 735 [7] J. J.Little and J. E.Boyd. Recognizing people by their gait: The shape of motion. In Videre (online journal), volume 1(2), Winter 1998. 735 [8] A. Kale, A. N. Rajagopalan, N. Cuntoor, and V. Kruger. Gait-based recognition of humans using continuous HMMs. In IEEE Conf Automatic Face and Gesture Recognition, pages 321–326, 2002. 735 [9] H. W. Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2:83–97, 1955. 738 [10] L. Lee and W. E. L. Grimson. Gait analysis for recognition and classification. In IEEE Conf Automatic Face and Gesture Recognition, pages 148–155, 2002. 735 [11] Y. Liu, R. Collins, and Y. Tsin. Gait sequence analysis using frieze patterns. In European Conference on Computer Vision, pages II: 657–671., 2002. 735 [12] K. V. Mardia and I. L. Dryden. Statistical Shape Analysis. John Wiley and Son, 1st edition, 1999. 736 [13] M.Nixon, J.Carter, D.Cunado, P.Huang, and S.Stevenage. Automatic gait recognition. In A.Jain, R.Bolle, and S.Pankanti, editors, Biometrics: Personal Identification in Networked Society, pages 231–249. Kluwer Academic Publishers, 1999. 735 [14] H. Murase and R. Sakai. Moving object recognition in eigenspace representation: Gait analysis and lip reading. Pattern Recognition Letters, 17(2):155–162, Feb 1996. 735 [15] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Neural Information Processing Systems, 2002. 738 [16] S. A. Niyogi and E. H. Adelson. Analyzing and recognizing walking figures in xyt. In IEEE Proceedings Computer Vision and Pattern Recognition, pages 469–474, 1994. 735 [17] P. J. Philips, S. Sarkar, I. Robledo, P. Grother, and K. Bowyer. Baseline results for the challenge problem of human ID. In IEEE Conf Automatic Face and Gesture Recognition, 2002. 735, 739, 740
742
David Tolliver and Robert T. Collins
[18] P. J. Philips, S. Sarkar, I. Robledo, P. Grother, and K. Bowyer. Gait identification challenge problem: Data sets and baseline. 1:385–388, 2002. 735, 739 [19] J. Shi and J. Malik. Normalized cuts and image segmentation. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000. 736 [20] L. Wang, H. Ning, W. Hu, and T. Tan. Gait recognition based on procrustes shape analysis. In International Conference on Image Processing (ICIP), pages III: 433–436, 2002. 735 [21] J. H. Yoo, M. S. Nixon, and C. J. Harris. Model-driven statistical analysis of human gait motion. In International Conference on Image Processing, pages I: 285–288, 2002. 735
Audio-Visual Speaker Identification Based on the Use of Dynamic Audio and Visual Features Niall Fox and Richard B. Reilly Dept. of Electronic and Electrical Engineering University College Dublin, Belfield, Dublin 4, Ireland {niall.fox,richard.reilly}@ee.ucd.ie
Abstract. This paper presents a speaker identification system based on dynamical features of both the audio and visual modes. Speakers are modeled using a text dependent HMM methodology. Early and late audio-visual integration are investigated. Experiments are carried out for 252 speakers from the XM2VTS database. From our experimental results, it has been shown that the addition of the dynamical visual information improves the speaker identification accuracies for both clean and noisy audio conditions compared to the audio only case. The best audio, visual and audio-visual identification accuracies achieved were 86.91%, 57.14% and 94.05% respectively.
1
Introduction
Recently there has been significant interest in multi-modal human computer interfaces, especially audio-visual (AV) systems for applications in areas such as banking, and security systems [1], [3]. It is known that humans perceive in a multimodal manner, and the McGurk effect demonstrates this fact [10]. People with impaired hearing use lip-reading to complement information gleaned from their perceived degraded audio signal. Indeed, synergistic integration has already been achieved for the purpose of AV speech recognition [18]. Previous work in this area is usually based on either the use of audio, [16], [15] or static facial images (face recognition) [2]. Previous multi-modal AV systems pre-dominantly use the static facial image as the visual mode and not the dynamical visual features [1], [19]. It is expected that the addition of the dynamical visual mode should complement the audio mode, increase the reliability for noisy conditions and even increase the identification rates for clean conditions. Also, it would be increasingly difficult for an imposter to impersonate both audio and dynamical visual information simultaneously. Recently, some work has been carried on the use of the dynamical visual mode for the purpose of speech recognition [9], [17]. Progress in speech based bimodal recognition is documented in [4]. The aim of the current study was to implement and compare various methods of integrating both dynamic visual and audio features for the purpose of speaker identification (ID) and to achieve a more reliable and secure system compared to the audio only case. J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 743-751, 2003. Springer-Verlag Berlin Heidelberg 2003
744
Niall Fox and Richard B. Reilly
2
Audio and Visual Segmentation
The XM2VTS database [8], [11] was used for the experiments described in this paper. The database consists of video data recorded from 295 subjects in four sessions, spaced monthly. The first recording per session of the third sentence (“Joe took fathers green shoe bench out”) was used for this research. The audio files were manually segmented into the seven words. The audio segmentation times were converted into visual frame numbers, to carry out visual word segmentation. Some sentences had the start of Joe clipped or it was totally missing. Due to this and other errors in the sentences, only 252 out of a possible 295 subjects were used for our experiments. Visual features were extracted from the mouth ROI (region of interest). This ROI was segmented manually by locating the two labial corners. A 98× 98 pixel block was extracted as the ROI. Manual segmentation was only carried out for every 10th frame, and the ROI coordinates for the intermediate frames were interpolated.
3
Audio and Visual Feature Extraction
The audio signal was first pre-emphasised to increase the acoustic power at higher frequencies using the filter H(z) =1/(1-0.97z -1). The pre-emphasised signal was divided into frames using a Hamming window of length 20 ms, with overlap of 10 ms to give an audio frame rate, FA,, of 100 Hz. Mel-frequency cepstral coefficients (MFCC’s) [5] of dimension 8 were extracted from each frame. The energy [20] of each frame was also calculated and used as a 9th static feature. Nine first order differences or delta features were calculated between adjacent frames and appended to the static audio features to give an audio feature vector of dimension 18. Transform based features were used to represent the visual information based on the Discrete Cosine Transform (DCT) because of its high energy compaction [12]. The 98× 98 colour pixel blocks were converted to gray scale values. No further image pre-processing was implemented, and the DCT was applied to the gray scale pixel blocks. The first 15 coefficients were used, taken in a zig-zag pattern. Calculating the difference of the DCT coefficients over k frames forms the visual feature vector. This was carried out for two values of k giving a visual feature vector of dimension 30. The static coefficients were discarded. The values of k depended on the visual feature frame rate. The visual features can have two frame rates: (1) Asynchronous visual features, have a frame rate, FV, of 25 fps or equivalently, 25 Hz, i.e. that of the video sequence. The optimum values of k used were determined empirically to be 1 and 2. (2) Synchronous visual features, have a frame rate of 100 fps, i.e. that of the audio features. Since the frame rate is higher than the asynchronous case, the values of k must be higher to give the same temporal difference. Two sets of synchronous k values, (3,6) and (5,10), were tested. In general, delta(k1,k2), refers to the use of the k values, k1 and k2, for the calculation of the visual feature vector, where k2 > k1.
Audio-Visual Speaker Identification
745
A sentence observation, O = O1 … Ok … OM, consists of M words, where M = 7 here. A particular AV word, Ok, has NA audio frames and NV visual frames. In general NA ≠ 4× NV, even though FA = 4× FV. This is due to the fact that when NA and NV were determined, the initial frame number and final frame number values were rounded down and up respectively. A sequence of audio and synchronous visual frame observations is given by Equation (1). When the visual features are calculated, according to Equation (2), k2 frames are dropped. In Equation (2), on{V} refers to the nth synchronous visual feature vector of dimension 30, and Tm refers to the top 15 DCT transform coefficients of the mth interpolated visual frame. Hence, to ensure that there are NA visual feature frames, the NV DCT visual frames are interpolated to NA + k2 frames (refer to Fig. 1). Ok = o1{i} ...o{ni} ...o{Ni}A ,
i ∈ {A,V} .
(1)
o{nV } = [Tm − Tm − k1 , Tm − Tm − k2 ], 1 ≤ n ≤ N A , 1 + k 2 ≤ m ≤ N A + k 2 .
(2)
time 40ms
10ms
Audio
NA Audio Feature Frames (100 HZ)
Calculation of Asynchronous Visual Features from the DCT Coeficients
NV Visual DCT Frames (25 HZ)
Interpolation of the DCT Coeficients K2 K1 NA + k2 Visual DCT Frames
Visual
Calculation of Synchronous Visual Features from the DCT Coeficients
NA Visual Feature Frames (100 HZ)
Fig. 1. Frame interpolation and visual feature calculation for a specific word consisting of NA audio frames
4
Speaker Identification and Hidden Markov Modeling
Speaker ID is discussed in this paper as opposed to speaker verification. To test the importance of integration based on the use of dynamic audio and visual features, a text dependent speaker ID methodology was used. For text dependent modeling [6], the speaker says the same utterance for both training and testing. It was employed, as opposed to text independent modeling [16], because of the database used in this study. Also, text independence has been found to give less performance than text dependence [7].
746
Niall Fox and Richard B. Reilly
Each word consists of NA or NV frame observations given above by Equation (1). Speaker Si is represented by M speaker dependent word models, Sik, for i = 1 … N, k= 1 … M where N = 252 and M = 7 here. There are M background models, Bk. Three sessions were used for training and one session for testing. The M background speaker independent word HMMs were trained using three of the sessions for all the speakers. These background models capture the AV speech variation over the entire database. Since there were only three training word utterances per speaker, there was insufficient training data to train a speaker dependent HMM, which was initialized with a prototype model. Hence the background word models were used to initialise the training of the speaker dependent word models. A sentence observation, O, is tested against all N speakers, Si, and the speaker that gives the maximum score is chosen as the identified speaker. To score an observation O against speaker Si, M separate scores, P(Oi/Sik) are calculated, one for each word in O, 1≤ k ≤ M. The M separate scores are normalised with respect to the frame length of each word by dividing by Fk, and are then summed to give an overall score P(Oi/Si) as shown in Equation (3). O is also scored against the background models to give an additional score P(Oi/B) also shown in Equation (3). log( P(O / S i )) =
1 M
log( P(Ok / S ik )) , Fk k =1 M
∑
log( P (O / B )) =
1 M
log( P (Ok / Bik )) . Fk k =1 M
∑
(3)
The two scores in Equations (3), are subtracted to give an overall measure of the likelihood that speaker Si produced the word observation O as shown in Equation (4). The subtraction of the background score provides a normalisation of the complete speaker score, Di. Di is calculated for each of the N speakers and O is identified as speaker Si using the maximum value of Di, i = 1 … N. Di = log(P(O /S i)) − log(P(O / B)).
5
(4)
Audio-Visual Integration
The two main problems concerning AV integration is when and how the integration should take place. Integration can be take place at three levels; early, middle and late [6]. Early and late integration only are discussed in this study. Early Integration (EI). The audio and visual modality features are combined, and then used for training and testing of a single AV classifier. The visual frame rate is first synchronised with the audio frame rate. Equation (5) and Fig. 2 show how the synchronous visual feature vector is concatenated to the audio feature vector. o{n AV } = [o{n A} , o{nV } ] , 1 ≤ n ≤ N A . ∆ A (9)
Audio
A (9)
Visual
∆ 5 V (15)
AV
A (9)
∆ 10 V (15)
∆ A (9)
∆ 5 V (15)
∆ 10 V (15)
Fig. 2. Audio, delta(5,10) visual case and AV feature blocks
(5)
Audio-Visual Speaker Identification
747
This method of integration has several disadvantages. The audio or visual mode data quality is not taken into account resulting in an equal weighting in the AV feature vector. The feature vector has a higher dimension requiring more training data. This is a problem for training speaker dependent models. However EI has the advantage that it is easy to implement both in training and classification. Late Integration (LI). LI requires two independent classifiers to be trained, one classifier for each mode. For speaker ID there are two options for the position to late integrate the speaker scores. The Viterbi word scores may be integrated or the scores according to Equations (3) may be integrated. The advantages of late integration include, the ability to account for mode reliabilities, small feature vector dimensions and ease of adding other modes to the system. For LI the two scores are weighted to account for the reliability of the modes. The two scores may be integrated via addition or multiplication. Equation (6) shows the use of weights for the case of additive integration where λ A is the weight of the audio score. The audio score can be late integrated with either of the asynchronous or synchronous visual scores. Prior to LI the audio and visual scores are normalised. P (O AV / S i ) = (P (O A / S i ) ) A + (P (OV / S i ) ) λ
6
1− λ
A
.
(6)
Experiments
Left to right HMM’s were used in the classification experiments. The models were trained using the Baum Welch algorithm and tested using the Viterbi algorithm [14], implemented using the HMM toolkit, HTK [20]. The audio features were calculated using HTK. The seven background models were trained using three sessions and tested using one session. This gave 3*N (756) training examples per background model. To test that the background models were trained correctly and to test the fusion methodologies, speech recognition experiments were carried out. A six state, two mixture HMM topology was used for the audio and EI AV models. A one state, one mixture topology was used for the asynchronous models and a six state, two mixture topology for the synchronous models. Each model was tested N times to give 7*N (1764) word tests in total, where N = 252, the number of speakers tested. Speaker ID experiments were carried out for N subjects. A one state, one mixture HMM topology was used for the audio, asynchronous visual and EI AV modes. A one state, two mixture HMM topology was used for the synchronous visual mode. These HMM topologies, which gave the best results, were found by exhaustive search. The first three sessions were used for training and the fourth session was used for testing. Two sets of synchronous visual feature k values, (3,6) and (5,10), and one set of asynchronous visual feature k values, (1,2), were tested. Additive white Gaussian noise was applied to the clean audio at signal-to-noise ratios (SNR) ranging from 48dB to –12dB in steps of 6dB. All models were trained using clean speech and tested using the various SNR values. Optimum λ A values were determined by exhaustive search for each noise level. This was achieved by testing λ A values from 0 to 1 in steps of 0.01.
748
Niall Fox and Richard B. Reilly
7
Results and Discussion
Table 1 shows the experimental results using audio. The audio word recognition performed extremely well, with an accuracy of 99.04%. This verifies that the seven background word models were trained correctly. The asynchronous word recognition performed poorly, 26.81%. This may be due the low number of states and mixes employed (both one) because of the low number of asynchronous frames per word. Interpolation of the asynchronous visual frames has the effect of increasing the amount of training data. This permits better training of the HMMs, which resulted in a better visual accuracy of 52.84%. In both cases of synchronous visual speaker ID the results are similar, which suggests that further improvement in the AV system depends on the integration and not on the features employed. Table 1. Word recognition and speaker ID results for clean audio Clas s ifier Modality A udio Synchronous Vis ual A udio-Vis ual (EI) Synchronous Vis ual A udio-Vis ual (EI) A s ynchronous Vis ual
Vis ual Features N/A delta(5,10) delta(5,10) delta(3,6) delta(3,6) delta(1,2)
Word recognition (% ) 99.04 52.5 95.07 52.84 97.17 26.81
S peak er Identification (% ) 86.91 57.14 80.16 53.97 80.56 55.56
Fig. 3a shows the results for EI. There may be several reasons why the EI performed so poorly. The visual features may not have been synchronized properly with the audio features. This may have occurred when visual frames were dropped to calculate the delta features or because of the overlapping audio frames. Another reason for poor EI performance may be the lack of training data for the dimensionally larger AV feature vectors. Fig. 3b shows the results of LI speaker ID. The AV LI scores are synergistic, giving a significant improvement over the audio case. EI Speaker Identification Results (252 Speakers)
LI Speaker Identification Results (252 Speakers)
90
100
Audio Asynch-Visual Synch-Visual A-V
80
80
70
70
60
Score (%)
Score (%)
50 40 30
60 50 40 30
20
20
10 0 -20
Audio SynchV AsynchV Audio-SynchV Audio-AsynchV
90
10
-10
0
10 20 SNR (dB)
30
40
50
Fig. 3a. EI speaker ID rates using delta(5,10) visual features
0 -20
-10
0
10 20 SNR (dB)
30
40
50
Fig. 3b. LI speaker ID rates using delta(5;10) visual features, additive Viterbi score LI of normalised scores
Fig. 4 shows how the audio weights varied with SNR. The continuous line shows the audio weights that gave the best LI results. The general profile is as expected, with higher SNRs requiring higher audio weights and vice versa. The vertical error bars show the audio weights that gave an LI score within a range of 98% of the maximum
Audio-Visual Speaker Identification
749
score. This shows that some flexibility is permitted in the audio weights and this should be kept in mind when implementing adaptable audio weights. Audio Weights vs SNR for SynchV
Audio Weights vs SNR for AsynchV
0.9
1
0.8
0.9 0.8
0.7
0.7
0.6
Audio Weights
Audio Weights
0.5 0.4 0.3
0.4
0.2
0.1
0.1 -10
0
10 20 SNR (dB)
30
40
Fig. 4a. Audio weights for synchronous visual of Fig. 3b
8
0.5
0.3
0.2
0 -20
0.6
50
0 -20
-10
0
10 20 SNR (dB)
30
40
50
Fig. 4b. Audio weights for asynchronous visual of Fig. 3b
Further Developments and Conclusion
The results for LI based on the use of dynamic features show good results. The LI results show that the addition of the dynamic visual mode not only increases the results for low SNR values but also increases the results for clean audio, giving a speaker ID system of higher accuracy and more robustness to audio noise. It was expected that the EI results were poor due to the lack of training data. However, both EI and LI speech recognition had 3*N training samples per model and synergistic EI was not achieved (see table 1). This would suggest that the use of more training data may not yield synergistic EI. To achieve synergistic EI, further analysis of the feature extraction methods and AV feature synchronisation may be required. For an AV system that is robust to real world conditions, it is not sufficient to just prove its robustness to audio noise only. Robustness to visual mode degradation is also necessary. Effects of visual degradation, such as frame rate decimation, noise and compression artifacts [13], have not been reported widely in the literature. It is expected that frame rate decimation would effect the dynamic visual features more so than other visual degradations. Further image pre-processing may yield higher visual accuracies. ROI down-sampling may further compact the visual feature vector and may improve the EI results, due to the reduced amount of training data required. In conclusion, the results show that the addition of the dynamical visual information improves the speaker ID accuracies for both clean and noisy audio conditions compared to the audio only case.
Acknowledgements This work was supported by Enterprise Ireland's IP2000 program.
750
Niall Fox and Richard B. Reilly
References [1] [2] [3] [4] [5]
[6] [7] [8] [9] [10] [11]
[12] [13] [14] [15]
Brunelli, R. and Falavigna, D.: Person identification using multiple cues. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 17, no. 10, pp. 955-966, Oct.1995 Brunelli, R. and Poggio, T.: Face Recognition: Features versus Templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 15, no. 10, pp. 1042-1052, 1993 Chen, T.: Audiovisual Speech Processing. IEEE Signal Processing Magazine, vol. 18, no. 1, pp. 9-21, Jan.2001 Chibelushhi, C. C., Deravi, F., and Mason, J. S. D.: A Review of Speech-Based Bimodal Recognition. IEEE Transaction on Multimedia, vol. 4, no. 1, pp. 2336, Mar.2002 Davis, S. and Mermelstein, P.: Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357-366, 1980 Lucey, S.: Audio-Visual Speech Processing. PhD thesis, Queensland University of Technology, Brisbane, Australia, Apr.2002 Luettin J.: Speaker verification experiments on the XM2VTS database. In IDIAP Communication 98-02, IDIAP, Martigny, Switzerland, Aug.1999 Luettin, J. and Maitre, G.: Evaluation Protocol for the XM2VTSDB Database (Lausanne Protocol). In IDIAP Communication 98-05, IDIAP, Martigny, Switzerland, Oct.1998 Matthews, I., Cootes, T. F., Bangham, J. A., Cox, J. A., and Harvey, R.: Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 2, pp. 198-213, Feb.2002 McGurk, H. and MacDonald, J.: Hearing Lips and Seeing Voices. Nature, vol. 264, pp. 746-748, Dec.1976 Messer, K., Matas, J., Kittler, J., Luettin J., and Maitre, G.: XM2VTSDB: The Extended M2VTS Database. The Proceedings of the Second International Conference on Audio and Video-based Biometric Person Authentication (AVBPA'99), Washington D.C., pp. 72-77, Mar.1999 Netravali, A. N. and Haskell, B. G.: Digital Pictures. Plenum Press, pp. 408416, 1998 Potamianos, G., Graf, H., and Cosatto, E.: An Image Transform Approach for HMM Based Automatic Lipreading . Proceedings of the IEEE International Conference on Image Processing, Chicago, vol. 3 pp. 173-177, 1998 Rabiner, L. R.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, vol. 77, no. 2, pp. 257-286, Feb.1989 Ramachandran, R. P., Zilovic, M. S., and Mammone, R. J.: A Comparative Study of Robust Linear Predictive Analysis Methods with Applications to Speaker Identification. IEEE Transactions on Speech and Audio Processing, vol. 3, no. 2, pp. 117-125, Mar.1995
Audio-Visual Speaker Identification
751
[16] Reynolds, D. A. and Rose, R. C.: Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models. IEEE Transactions on Speech and Audio Processing, vol. 3, no. 1, Jan.1995 [17] Scanlon, P. and Reilly, R.: Visual Feature Analysis For Automatic Speechreading. DSP Research Group, UCD, Dublin, Ireland, 2001 [18] Silsbee, P. and Bovik, A.: A Computer Lipreading for Improved Accuracy in Automatic Speech Recognition. IEEE Transactions on Speech and Audio Processing, vol. 4, no. 5, pp. 337-350, Sept.1990 [19] Yacoub, S. B. and Luetin, J.: Audio-Visual Person Verification. In IDIAP Communication 98-18, IDIAP, Martigny, Switzerland, Nov.1998 [20] Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., Valtchev, V., and Woodland, P.: The HTK Book (for HTK Version 3.1). Microsoft Corporation, Cambridge University Engineering Department, Nov.2001
Scalability Analysis of Audio-Visual Person Identity Verification Jacek Czyz1 , Samy Bengio2 , Christine Marcel2 , and Luc Vandendorpe1 1
Communications Laboratory, Universit´e catholique de Louvain, B-1348 Belgium
[email protected] 2 IDIAP, CH-1920 Martigny, Switzerland {Samy.Bengio,Christine.Marcel}@idiap.ch
Abstract. In this work, we present a multimodal identity verification system based on the fusion of the face image and the text independent speech data of a person. The system conciliates the monomodal face and speaker verification algorithms by fusing their respective scores. In order to assess the authentication system at different scales, the performance is evaluated at various sizes of the face and speech user template. The user template size is a key parameter when the storage space is limited like in a smart card. Our experimental results show that the multimodal fusion allows to reduce significantly the user template size while keeping a satisfactory level of performance. Experiments are performed on the newly recorded multimodal database BANCA.
1
Introduction
With the advent of digital communication and information society, reliable and user-friendly personal identity verification becomes more and more indispensable and critical. Biometrics, which measures a physiological or behavioural characteristic of a person, such as voice, face, fingerprints, iris, etc, provides an effective and inherently more reliable way to carry out personal identification [4]. Several factors influence the choice of a biometric trait for a particular application. Among them, distinctiveness and user friendliness are certainly the most important. For distinctiveness, the biometric trait should be distributed with a large variance inside the target population. At the same time, it should ideally remain constant for a given person, or vary with a small variance. As for user friendliness, the sensors that capture the biometric traits should interfere with the user as little as possible. Also the trait recordings should be done in an unconstrained and contactless manner. These two requirements are unavoidably contradictory. Therefore, it has been suggested to combine or fuse several easily accepted biometric traits, in order to achieve an acceptable level of distinctiveness and user friendliness at the same time. This technique is known as multimodal biometrics. The aptitude of multimodal biometrics for increasing correct verification rates over monomodal biometrics has been demonstrated in several previous studies, (see for example [3], [5], [8]). J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 752–760, 2003. c Springer-Verlag Berlin Heidelberg 2003
Scalability Analysis of Audio-Visual Person Identity Verification
753
A promising application consists in combining biometric efficiency and smart card (SC) security, by storing the user template on a SC [9]. SC’s allow the template to be securely protected and avoid storing biometric data in a central server (central storage is not well accepted by users). However storage space of SC’s and transmission speed between server and SC’s limit the user template size. It is therefore important to evaluate performance as a function of the template size. In this work, we present an identity verification system based on the fusion of the face image and text independent speech data of a user. We analyse its scalability by evaluating the performance at different user template sizes. Although less accurate, text independent speaker verification allows more variability in the content uttered by the user. For this reason it is more user-friendly than text dependent speaker verification. The system presented is modular: each monomodal algorithm (face and speech) outputs a matching score reflecting its confidence in the presumed identity. The matching scores are then conciliated using a fusion algorithm which outputs the final authentication decision. Our analysis of the experimental results on a realistic database shows that the fused system face-speech requires a much smaller number of parameters to represent a user than the best monomodal algorithm at the same performance level. Fusion can therefore help in reducing the storage space required for client data and thus improve the scalability of the verification system. The paper is organised as follows. The monomodal algorithms and the fusion techniques employed are presented in section 2. In section 3, the scalability analysis is described. The database and the experimental protocol are presented in section 4. We discuss the results in section 5, and we draw conclusions in the last section.
2
Fusion of Face and Speaker Verification Algorithms
When the identity of a user has to be verified, speech and face are recorded and compared to previously created user template. A score reflecting the quality of the matching between the template and the data to verify is computed. The fusion of the two scores resulting from the speech and face algorithms leads to the final decision. Hereafter we describe briefly the speaker and face verification algorithms and the fusion techniques. Face Verification Algorithm The first step involves localisation and registration of the face part in the input image. In our implementation, we have skipped this step by manually locating the eye coordinates in the image. While often done in the literature, it biases optimistically the verification performance. After localisation, the face image is cropped and histogram equalisation is applied to reduce the effect of lighting variation. The Fisherface approach [1] is used to extract features from the gray level face image. This feature extraction technique is based on Principal Component Analysis (PCA) and on Linear Discriminant Analysis (LDA). LDA effectively projects the face vector into a subspace where
754
Jacek Czyz et al.
within-class variations are minimised while between-class variations are maximised. Formally, given a set of face vectors xi , each belonging to one of c classes {C1 , C2 , ..., Cc }, we compute the between-class scatter matrix Sb Sb =
c
(µi − µ)(µi − µ)T
i=1
and the within-class scatter matrix, Sw Sw =
c
(xk − µi )(xk − µi )T
i=1 xk ∈Ci
where µi and µ are respectively the class conditional mean and the global mean. It is known that the projection matrix W which maximises the class separability criterion J W T Sb W J= W T Sw W is a solution of the eigenproblem −1 Sw Sb W = W Λ
(1)
where the diagonal matrix Λ contains the eigenvalues. In order to prevent Sw from being singular, an initial dimensionality reduction must be applied. This is achieved by taking the principal components of the face images. The face score sf is computed by matching the newly acquired LDA face projection x xT xt . Note that to to the user template xt using normalised correlation sf = xx t compute the LDA basis, at least two images per person are required. Speaker Verification Algorithm The speaker verification algorithm used to compute the speech score is text independent and based on Gaussian Mixture Models (GMM) [7]. A parameterisation of the raw voice is performed, creating a vector of Linear Frequency Cepstral Coefficients (LFCC) for each section of 10ms of speech. On top of these coefficients, their first derivatives, as well as the log of the energy, are kept. Finally, a cepstral mean subtraction is performed in order to normalise the data. The user template is represented by a GMM that was adapted using a Maximum A Posteriori method from a general World Model GMM trained on a separate population of speakers. The speech score ss is computed by estimating the log-likelihood ratio of a speech sequence of LFCC features X = {x1 , x2 , . . . , xT } pronounced by the speaker i versus the world of all speakers Ω (world model) ss = log p(X| i) − log p(X|Ω) The densities p(X| i) and p(X|Ω) given the ith speaker and world GMM models of N Gaussians can be computed as follows: p(X) =
T t=1
p(xt ) =
T N t=1 n=1
wn · N (xt ; µn , Σ n )
(2)
Scalability Analysis of Audio-Visual Person Identity Verification
755
where N (xt ; µn , Σ n ) is a Gaussian with mean µn ∈ Rd where d is the number 2 of features and with standard deviation Σ n ∈ Rd : 1 1 −1 T N (x; µ, Σ) = (x − µ) exp − Σ (x − µ) (3) d 2 (2π) 2 |Σ| Note that Σ is diagonal in the proposed implementation. The parameters that form the user template in this model are the means µn of the N Gaussians, since the other parameters are fixed during the adaptation procedure and equal to those of the corresponding world model. Fusion Algorithms The fusion of the face and speech scores is performed using a second level classifier. That is, the two scores are considered as input features for a classifier which is trained on genuine and impostor score examples. In our experiments, we have opted for two fusion techniques. The first technique is based on a multi-layered perceptron (MLP) which can be viewed as a universal classifier. The MLP has two inputs where the two scores are fed in, and one output for the final fusion score s = m i=1 tanh(ws,i ss + wf,i sf + bi ) where m is the number of hidden units and the parameters w and b are chosen to minimise the EER on the training set. In the second fusion technique, a new score s is computed by averaging the weighted scores s = ws ss +wf sf . The fusion score s is then thresholded to obtain the final decision. The weights ws and wf and the threshold are found so as to minimise the EER on the training set.
3
Scalability Analysis
As stated in the introduction, we are interested in evaluating the performance of the monomodal and the multimodal algorithms at different user template sizes. While this template is relatively small for the face modality, it can be very large for the speech modality, mainly because of the large number of Gaussians that are necessary to represent faithfully the probability densities. For the face modality, the user template size is determined by the LDA subspace dimensionality. The LDA basis vectors, solution of (1), are ranked according to the magnitude of their corresponding eigenvalue. This magnitude is an indicator of the discriminatory power of the corresponding eigenvector. The performance of the face verification algorithm is then assessed at various numbers of LDA basis vectors, by gradually removing the less discriminative ones. For the speech modality, the number of parameters of the user template depends on the number of Gaussians in the GMM and the feature vector size, i.e. the number of LFCC. The optimal number of Gaussians is normally determined by Maximum Likelihood on the world model. Here, we have trained the GMM on the world model with the number of Gaussians selected in the following set of values: 10, 25, 50, 100, 200 and 300. For each number of Gaussians, the performance of the speaker verification algorithm is assessed. We also studied
756
Jacek Czyz et al.
Fig. 1. First frame of the 12 sessions of the BANCA database
the performance variation when k, the number of LFCC used to parameterise the raw voice varied between 4, 8, 12 and 16. Note that the derivatives of the LFCC and the signal energy are added to the feature vector. The feature vector size is therefore 2k + 1. The number of parameters of the multimodal template is simply the sum of the face and speech templates. We studied the performance variation of the multimodal system at different sizes of the speech and face templates.
4
Database and Experimental Protocol
The experiments presented in the next section were performed on the English part of the BANCA database. This recently recorded database and the accompanying experimental protocol are described in detail in [2]. We give hereafter a short description. The data set contains voice and video recordings of 52 people in several environmental conditions. It is subdivided into two groups of 26 subjects (13 males and 13 females), denoted in the following by g1 and g2. Each subject recorded 12 sessions distributed over several months, each of these sessions containing 2 records: one true user access and one impostor attack. The impostor attacks are attempted only for subjects of the same sex, within the same group. The 12 sessions were separated into 3 different scenarios: controlled for sessions 1 to 4, degraded for sessions 5 to 8 and adverse for sessions 9 to 12. A low-cost camera has been used to record the sessions in the degraded scenario. For this scenario, the background noise for speech and video was unconstrained and the lighting uncontrolled, simulating a user authentificating himself in an office or at home using a home PC and a low cost web-cam. A more expensive camera was used for the controlled and adverse scenarios. The adverse scenario simulates a cash withdrawal machine, and was recorded outdoors. From one video session (about 30 seconds), five frames per person were randomly selected for face verification. At the same time, about 15 seconds of speech were recorded and used for speaker verification. During an impostor attack, the impostor utters the same text as the user that he is imposting. An additional set of 30 other subjects, 15 males and 15 females, recorded one session (audio and video) for each scenario. This set of data is used as world data. Figure 1 shows a subject
Scalability Analysis of Audio-Visual Person Identity Verification
757
of the BANCA database in the 12 sessions. The face images have already been located and registered. Notice how image quality varies accross the sessions. In our testing protocol, session 1 only is used to enrol a new user, that is, to create its user template. In [2], this protocol is referred to as protocol P. This demanding feature of the testing protocol was introduced because having to record several enrolment sessions may be tedious for the users in realistic applications. The remaining sessions are used to simulate genuine and impostors accesses. The testing protocol specifies a validation set, used to set the speech and face algorithm parameters as well as to train the fusion algorithm. A second set, the evaluation set, is used to assess the global system. Group g1 (group g2) is successively validation (evaluation) set and evaluation (validation) set. As in cross-validation, results from the two configurations are averaged. Since only one session is available to create the user template, the LDA basis has to be computed with another face data set comprising several images per person. We chose the XM2VTS face database [6] for availability reasons. As this database contains 295 persons, the user template size is limited to 294 numbers. As in any biometric system, two types of errors are possible: false acceptance when an impostor claim is accepted and false rejection when a genuine claim is rejected. These two errors depend on the biometric system threshold. To assess the performance, we adopted the following methodology. The threshold corresponding to the equal error (EER), that is, when the false acceptance rate (FAR) and the false rejection rate (FRR) are equal, is adjusted on the validation set. With this threshold, the system is tested on the evaluation set which leads to a false acceptance rate (FAR) and a false rejection rate (FRR) . From this two errors, we compute the half total error rate (HTER) which reflects the global performance of the verification algorithm HTER = (FAR+FRR)/2.
Speaker verification
Face modality
0.16
0.16
0.14
0.14
0.12
0.12
HTER
0.18
HTER
0.18
0.1
0.1
0.08
0.08
0.06
0.06
0.04
0.04
0.02 0
50
100 150 200 Number of parameters
(a)
250
300
0.02 0
5000
10000 15000 Number of parameters
20000
(b)
Fig. 2. (a) HTER versus user template size for face modality. (b) HTER versus number of Gaussians needed to represent user template for speech modality
758
5
Jacek Czyz et al.
Experimental Results and Discussion
According to the protocol described in the previous section, we studied the variation of the HTER as a function of the user template size. Figure 2(a) shows the variation of the HTER versus the user template size (expressed in number of parameters needed to store it) for the face verification algorithm. From the figure, the HTER decreases significantly with the first 100 basis vectors. The minimum HTER is reached at 150 vectors, and increases above 150 vectors. This means that the last features extracted (from 150 to 294) slightly degrades the classification and should not be included in the user templates. The minimum HTER obtained is 14.3%. This high value can be partially explained by the fact that only one enrolment session is available for creating the user template. Scalability results for the speech modality are shown on figure 2(b). The curve on this figure corresponds to the variation of the HTER with the number of Gaussians and 16 LFCC coefficients. Since the other LFCC parameterisations offer higher error rates at equal number of parameters of the user template, we only present results with 16 LFCC coefficients. The best HTER for the speech modality is equal to 4.7% and is obtained with 200 Gaussians. It requires 13400 parameters to be stored. The results of the fusion experiments are presented on figure 3(a) and 3(b) for MLP and weighted averaging fusion respectively. From these figures, it appears that the multimodal fusion always outperforms the best single modality (speech in our case). The lowest fusion HTER obtained is equal to 2.38%. The improvement thus reaches almost 50% in spite of the weakness of the face algorithm. Furthermore, the MLP fusion achieves an HTER of 3.77%, i.e. better than the speech modality alone, with only 50 Gaussians instead of 300. In this case, the number of parameters to be stored for the user template is 3500 (3350 for speech and 150 for face), which is almost 4 times less than what is needed for the
Fusion of face and speech: MLP
Fusion of speech and face: weighted averaging
0.18
0.18 face size: 50 face size: 100 face size 150 face size 200 voice alone
0.16 0.14
0.14 0.12
HTER
HTER
0.12 0.1
0.1
0.08
0.08
0.06
0.06
0.04
0.04
0.02 0
face size: 50 face size: 100 face size 150 face size 200 voice alone
0.16
5000
10000 15000 Number of parameters
(a)
20000
0.02 0
5000
10000 15000 Number of parameters
20000
(b)
Fig. 3. (a) HTER for MLP fusion vs. number of Gaussians and face template size. (b) HTER for weighted averaging fusion vs. number of Gaussians and face template size
Scalability Analysis of Audio-Visual Person Identity Verification
759
system using the speech modality only. The speaker verification algorithm with 50 Gaussians for user template achieves an HTER of 6.62%. This result may be of practical interest when storage space is of concern, for example in a biometric system coupled with smart cards [9]. The limited storage and transmission speed of a smart card require user templates as small as possible. The fusion of modalities is therefore a way of improving the performance and reducing the number of parameters needed to be stored and transmitted, without decreasing performance.
6
Conclusions
A multimodal identity verification system using the speech and the face image of a user is presented. The experiments were conducted on a realistic database and according to a test protocol that allows only one enrolment session. The results show that the text independent speaker verification algorithm is robust and provides good results in spite of the uncontrolled nature of the data. In comparison, the face verification algorithm appears to be weak. A substantial improvement is gained when the outputs of the two monomodal algorithms are fused using simple techniques, the performance getting close to real world application requirements. An empirical analysis of the algorithm scalability with respect to the user template size is presented. It shows that fusion may help in reducing the number of parameters needed to be stored while keeping a satisfactory level of performance. Future work will be devoted to the design of a fully automatic audio-visual authentication system with automatic face location and registration. Acknowlegments This work was carried out within the framework of the European Project IST BANCA. We thank the CVSSP laboratory at University of Surrey (UK) for providing the eye coordinates for the BANCA database.
References [1] P. Belhumeur, J. Hespanha and D. Kriegman, “Face recognition: Eigenfaces vs. Fisherfaces: Recognition using class specific projection”, IEEE Trans. Pattern Analysis and Machine Intelligence, 19(7), 1997. 753 [2] S. Bengio, F. Bimbot, J. Mariethoz, V. Popovici, F. Por´ee, E. Bailly-Balliere, G. Matas and B. Ruiz “Experimental protocol on the BANCA database” Technical Report IDIAP-RR 02-05, IDIAP, 2002. 756, 757 [3] B. Duc, E. S. Bigun, J. Bigun, G. Maitre, and S. Fischer. “Fusion of audio and video information for multi modal person authentication” Pattern Recognition Letters, 18:835-843, 1997. 752 [4] A. Jain, R. Bolle and S. Pankanti “Biometrics: personal identification in a networked society”, Kluwer Academic Publishers, 1999. 752
760
Jacek Czyz et al.
[5] J. Kittler, M. Hatef, R. P. W. Duin and J. Matas “On combining classifiers” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 20, No. 3, pp. 226-239, 1998. 752 [6] K. Messer, J. Matas, J. Kittler, J. Luettin and G. Maitre “XM2VTSDB: The extended M2VTS database” in Proc. of Int. Conf. on Audio and Video based Biometric Person Authentication, Washington, USA, 1999. 757 [7] D. A. Reynolds and R. C. Rose “Robust Text-Independent Speaker identification using Gaussian mixture speaker models” in IEEE Trans. on Speech and Audio Processing, vol. 3, no. 1, pp. 72-83, Jan. 1995. 754 [8] A. Ross, A. Jain and J.-Z. Qian “Information fusion in Biometrics” in Proc. of Int. Conf. on Audio and Video based Biometric Person Authentication, Halmstad, Sweden, 2001. 752 [9] R. Sanchez-Reillo “Including Biometric Authentication in a smart card operating system”, Int. Conf. on Audio- and Video-based Person Authentication, Halmstad, Sweden, 2001. 753, 759
A Bayesian Approach to Audio-Visual Speaker Identification Ara V. Nefian1 , Lu Hong Liang1 , Tieyan Fu2 , and Xiao Xing Liu1 1
2
Microprocessor Research Labs, Intel Corporation {ara.nefian,lu.hong.liang,xiao.xing.liu}@intel.com Computer Science and Technology Department, Tsinghua University
[email protected] Abstract. In this paper we describe a text dependent audio-visual speaker identification approach that combines face recognition and audio-visual speech-based identification systems. The temporal sequence of audio and visual observations obtained from the acoustic speech and the shape of the mouth are modeled using a set of coupled hidden Markov models (CHMM), one for each phoneme-viseme pair and for each person in the database. The use of CHMM in our system is justified by the capability of this model to describe the natural audio and visual state asynchrony as well as their conditional dependence over time. Next, the likelihood obtained for each person in the database is combined with the face recognition likelihood obtained using an embedded hidden Markov model (EHMM). Experimental results on XM2VTS database show that our system improves the accuracy of the audio-only or video-only speaker identification at all levels of acoustic signal-to-noise ratio (SNR) from 5 to 30db.
1
Introduction
Increased interest in robust person identification systems leads to complex systems that rely often on the fusion of several type of sensors. Audio-visual speaker identification (AVSI) systems are particularly interesting due to their increased robustness to acoustic noise. These systems combine acoustic speech features with facial or visual speech features to reveal the identity of the user. As in audio-visual speech recognition the key issues for robust AVSI systems are the visual feature extraction and the audio-visual decision strategy. Audio-visual fusion methods [22, 3] can be broadly grouped into two categories: feature fusion and decision systems. In feature fusion systems the observation vectors, obtained through the concatenation of acoustic and visual speech feature vectors, are described using a hidden Markov model (HMM). However, the audio and visual state synchrony assumed by these systems may not describe accurately the audio-visual speech generation. In comparison, in decision level systems the class conditional likelihood of each modality is combined at phone or word levels. Some of the most successful decision fusion models include the multi-stream HMM [18], or the product HMM [21, 4]. J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 761–769, 2003. c Springer-Verlag Berlin Heidelberg 2003
762
Ara V. Nefian et al.
Video sequence
Face Detection
Face
Face Feature Extraction
Mouth Detection
Mouth
Visual Feature Extraction
Face Recognition
AV speaker Identification
AV Speechbased user identification Audio sequence
Acoustic Feature Extraction
Fig. 1. The audio-visual speaker identification system The Bayesian models also revealed their high modeling accuracy for face recognition. Recent face recognition systems using embedded Bayesian networks [14] showed their improved performance over some template-based approaches [23, 1, 6, 17]. The Bayesian approach to audio-visual speaker identification described in this paper (Figure 1) starts with the detection of the face and mouth in a video sequence. The facial features are used in the computation of the face likelihood (Section 2) while the visual features of the mouth region together with the acoustic features determine the likelihood of the audio-visual speech (Sections 3 and 4). Finally the face and the audio-visual speech likelihood are combined in a late integration scheme to reveal the identity of the user (Section 5).
2
The Face Model
While HMM are very successful in speech or gesture recognition, an equivalent two-dimensional HMM for images has been shown to be impractical due to its complexity [5]. Figure 2a shows a graph representation of the 2D HMM with the square nodes representing the discrete hidden nodes and the circles describing the continuous observation nodes. In recent years, several approaches to approximate a 2D HMM with computationally practical models have been investigated [20, 14, 7]. In this paper, the face images are modeled using an embedded HMM (EHMM) [15]. The EHMM used for face recognition is a hierarchical statistical
(a)
(b)
O i, j
O i+m, j
O i, j+n
O i+m, j+n
(c)
Fig. 2. A 2D HMM for face recognition (a), an embedded HMM (b) and the facial feature block extraction for face recognition (c)
A Bayesian Approach to Audio-Visual Speaker Identification
763
model with two layers of discrete hidden nodes (one layer for each data dimension) and a layer of observation nodes. In an EHMM both the “parent” and “child” layer of hidden nodes are described by a set of HMMs (Figure 2b). The states of the HMM in the “parent” and “child” layers are referred to as the super states and the states of the model respectively. The hierarchical structure of the EHMM or the embedded Bayesian networks [14] in general reduces significantly the complexity of these models compared to the 2D HMM. The sequence of observation vectors for an EHMM are obtained from a window that scans the image from left to right and top to bottom as shown in Figure 2c. Using the images in the training set, an EHMM is trained for each person in the database by means of the EM algorithm described in [15]. Recognition is carried out via the Viterbi decoding algorithm [12].
3
The Audio-Visual Speech Model
A a coupled HMM (CHMM) [2] can be seen as a collection of HMMs, one for each data stream, where the hidden backbone nodes at time t for each HMM are conditioned by the backbone nodes at time t − 1 for all the related HMMs. Throughout this paper we will use CHMM with two channels, one for audio and the other for visual observations (Figure 3a). The parameters of a CHMM with two channels are defined below: π0c (i) = P (q1c = i) bct (i) = P (Oct |qtc = i)
a v aci|j,k = P (qtc = i|qt−1 = j, qt−1 = k)
where c ∈ {a, v} denotes the audio and visual channels respectively, and qtc is the state of the backbone node in the cth channel at time t. For a continuous
(a)
(b)
Fig. 3. A directed graph representation of a two channel CHMM with mixture components (a) and the state diagram representation of the CHMM used in our audio-visual speaker identification system (b)
764
Ara V. Nefian et al.
mixture with Gaussian components, the probabilities of the observed nodes are given by: c
bct (i)
=
Mi
c wi,m N (Oct , µci,m , Uci,m )
m=1
where Oct is the observation vector at time t corresponding to channel c, and c are the mean, covariance matrix and mixture weight µci,m and Uci,m and wi,m corresponding to the ith state, the mth mixture and the cth channel. Mic is the number of mixtures corresponding to the ith state in the cth channel. In our audio-visual speaker identification system, each CHMM describes one of the possible phoneme-viseme pairs as defined in [16], for each person in the database.
4
Training the CHMM
The training of the CHMM parameters for the task of audio-visual speaker identification is performed in two stages. First, a speaker-independent background model (BM) is obtained for each CHMM corresponding to a viseme-phoneme pair. Next, the parameters of the CHMMs are adapted to a speaker specific model using a maximum a posteriori (MAP) method. To deal with the requirements of a continuous speech recognition systems, two additional CHMMs are trained to model the silence between consecutive words and sentences. 4.1
Maximum Likelihood Training of the Background Model
In the first stage, the CHMMs for isolated phoneme-viseme pairs are initialized using the Viterbi-based method described in [13] followed by the estimationmaximization (EM) algorithm [10]. Each of the models obtained in the first stage is extended with one entry and one exit non-emitting state (Figure 3 b). The use of the non-emitting states also enforces the phoneme-viseme synchrony at the model boundaries. Next, the parameters of the CHMMs are refined through the embedded training of all CHMM from continuous audio-visual speech [10]. In this stage, the labels of the training sequences consist only of the sequence of phoneme-visemes with all boundary information being ignored. We will denote the mean, covariance matrices and mixture weights for mixture m, state i, and channel c of the trained CHMM corresponding to the background model as (µci,m )BM , (Uci,m )BM , c )BM respectively. and (wi,m 4.2
Maximum a Posteriori Adaptation
In this stage of the training, the state parameters of the background model are adapted to the characteristics of each speaker in the database. The new state
A Bayesian Approach to Audio-Visual Speaker Identification
765
c ˆ c and w parameters for all CHMMs µ ˆci,m , U ˆi,m are obtained through Bayesian i,m adaptation [19]: c c µ ˆci,m = θi,m µci,m + (1 − θi,m )(µci,m )BM ˆ c = θc Uc − (µc )2 + (µc )2 + (1 − θc )(Uc )BM U i,m i,m i,m i,m i,m BM i,m i,m c w ˆi,m
=
c c θi,m wi,m
+ (1 −
c c θi,m )(wi,m )BM ,
(1) (2) (3)
c where θi,m is a parameter that controls the MAP adaptation for mixture component m in channel c and state i. The sufficient statistics of the CHMM states c are obtained using the EM corresponding to a specific user, µci,m , Uci,m and wi,m algorithm from the available speaker dependent data as follows: c r,t γr,t (i, m)Or,t c µi,m = c r,t γr,t (i, m) c c c c c T r,t γr,t (i, m)(Or,t − µi,m )(Or,t − µi,m ) c Ui,m = c r,t γr,t (i, m) c r,t γr,t (i, m) c , wi,m = c r,t k γr,t (i, k)
where c γr,t (i, m)
1 j Pr αr,t (i, j)βr,t (i, j) 1 i,j Pr αr,t (i, j)βr,t (i, j)
=
wc N (Ocr,t |µci,m , Uci,m ) i,m c , c c c k wi,k N (Or,t |µi,k , Ui,k )
a v and αr,t (i, j) = P (Or,1 , . . . , Or,t |qr,t = i, qr,t = j) and a v βr,t (i, j) = P (Or,t+1 , . . . , Or,Tr , qr,t = i, qr,t = j) are the forward and backward variables respectively [10] computed for the rth observation sequences Or,t = [(Oar,t )T , (Ovr,t )T ]T . The adaptation coefficient is c r,t γr,t (i, m) c θi,m = , c r,t γr,t (i, m) + δ
where δ is the relevance factor, which is set δ = 16 in our experiments. Note that as more speaker dependent data for a mixture m of state i and channel c becomes available, the contribution of the speaker specific statistics to the MAP state parameters increases (Equations 1- 3). On the other hand, when less speaker specific data is available, the MAP parameters are very close to the parameters of the background model.
5
Recognition
Given an audio-visual test sequence, the recognition is performed in two stages. First the face likelihood and the audio-visual speech likelihood are computed separately. To deal with the variation in the relative reliability of the audio and
766
Ara V. Nefian et al.
visual features of speech at different levels of acoustic noise, we modified the observation probabilities used in decoding such that ˜bct (i) = [bct ]λc , c ∈ {a, v} where the audio and video stream exponents λa and λv satisfy λa , λv ≥ 0 and λa + λv = 1. Then the overall matching score of the audio-visual speech and face model is computed as L(Of , Oa , Ov |k) = λf L(Of |k) + λav L(Oa , Ov |k) a
v
(4)
f
where O , O and O are the acoustic speech, visual speech and facial sequence of observations, L(∗|k) denotes the observation likelihood for the kth person in the database and λf , λav ≥ 0, λf + λav = 1 are some weighting coefficients for the face and audio-visual speech likelihoods.
6
Experimental Results
The audio-visual speaker identification system presented in this paper was tested on digit enumeration sequences from the XM2VTS database [11]. For parameter adaptation we used four training sequences from each of the 87 speakers in our training set while for testing we used 320 sequences. In our experiments the acoustic observation vectors consist of 13 Mel frequency cepstral (MFC) coefficients with their first and second order time derivatives, extracted from windows of 25.6ms, with an overlap of 15.6ms. The extraction of the visual features starts with the face detection system described in [9] followed by the detection and tracking of the mouth region using a set of support vector machine classifiers. The features of the visual speech are obtained from the mouth region through a cascade algorithm described in [8]. The pixels in the mouth region are mapped to a 32-dimensional feature space using the principal component analysis. Then, blocks of 15 consecutive visual observation vectors are concatenated and projected on a 13 class, linear discriminant space. Finally, the resulting vectors, with their first and second order time derivatives are used as the visual observation sequences. The audio and visual features are integrated using a CHMM with three states in both the audio and video chains with no back transitions (Figure 3b). Each state has 32 mixture components with diagonal covariance matrices. In our system, the facial features are obtained using a sampling window of size 8 × 8 with 75% overlap between consecutive windows. The observation vectors corresponding to each position of the sampling window consist of a set of 2D discrete Cosine transform (2D DCT) coefficients. Specifically, we used nine 2D DCT coefficients obtained from a 3 × 3 region around the lowest frequency in the 2D DCT domain. The faces of all people in the database are modeled using EHMM with five super states and 3,6,6,6,3 states per super state respectively. Each state of the hidden nodes in the “child” layer of the EHMM is described by a mixture of three Gaussian density functions with diagonal covariance matrices. To evaluate the behavior of our speaker identification system in environments affected by acoustic noise, we corrupted the testing sequences with white Gaussian noise at different SNR levels, while we trained on the original clean acoustic sequences.
A Bayesian Approach to Audio-Visual Speaker Identification
767
25 audio−visual audio−only video−only face−only audio−visual−face 20
Error rate(%)
15
10
5
0
0
5
10
15 SNR (db)
20
25
30
Fig. 4. The error rate of the audio-only (λv , λf ) = (0.0, 0.0), video-only (λv , λf ) = (1.0, 0.0), audio-visual (λf = 0.0) and face+audio-visual (λv , λf ) = (0.5, 0.3) speaker identification system Figure 4 shows the error rate of the audio-only (λv , λf ) = (0.0, 0.0), video-only (λv , λf ) = (1.0, 0.0), audio-visual (λf = 0.0) and face-audio-visual (λv , λf ) = (0.5, 0.3) speaker identification system at different SNR levels.
7
Conclusions
In this paper, we described a Bayesian approach for text dependent audio-visual speaker identification. Our system uses a hierarchical decision fusion approach. In the lower level the acoustic and visual features of speech are integrated using a CHMM and the face likelihood is computed using an EHMM. In the upper level, a late integration scheme combines the likelihood of face and audio-visual speech to reveal the identity of the speaker. The use of strongly correlated acoustic and visual temporal features of speech together with overall facial characteristics makes the current system very difficult to break, and increases its accuracy to acoustic noise. The study of the recognition performance in environments corrupted by acoustic noise shows that our system outperforms the audio-only baseline system by a wide margin.
References [1] P.N. Belhumeur, J.P Hespanha, and D.J. Kriegman. Eigenfaces vs Fisherfaces: Recognition using class specific linear projection. In Proceedings of Fourth Europeean Conference on Computer Vision, ECCV’96, pages 45–58, April 1996. 762 [2] M. Brand, N. Oliver, and A. Pentland. Coupled hidden Markov models for complex action recognition. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 994–999, 1997. 763
768
Ara V. Nefian et al.
[3] C. Chibelushi, F. Deravi, and S.D. Mason. A review of speech-based bimodal recognition. IEEE Transactions on Multimedia, 4(1):23–37, March 2002. 761 [4] S. Dupont and J. Luettin. Audio-visual speech modeling for continuous speech recognition. IEEE Transactions on Multimedia, 2:141–151, September 2000. 761 [5] S. Kuo and O.E. Agazzi. Keyword spotting in poorly printed documents using pseudo 2-D Hidden Markov Models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(8):842–848, August 1994. 762 [6] A. Lawrence, C.L. Giles, A.C Tsoi, and A.D. Back. Face recognition : A convolutional neural network approach. IEEE Transactions on Neural Networks, 8(1):98–113, 1997. 762 [7] Jia Lia, A. Najmi, and R.M. Gray. Image classification by a two-dimensional hidden markov model. IEEE Transactions on Signal Processing, 48(2):517–533, February 2000. 762 [8] L. Liang, X. Liu, X. Pi, Y. Zhao, and A. V. Nefian. Speaker independent audiovisual continuous speech recognition. In International Conference on Multimedia and Expo, volume 2, pages 25–28, 2002. 766 [9] R. Lienhart and J. Maydt. An extened set of Haar-like features for rapid objection detection. In IEEE International Conference on Image Processing, volume 1, pages 900–903, 2002. 766 [10] X. Liu, L. Liang, Y. Zhao, X. Pi, and A. V. Nefian. Audio-visual continuous speech recognition using a coupled hidden Markov model. In International Conference on Spoken Language Processing, 2002. 764, 765 [11] J. Luettin and G. Maitre. Evaluation protocol for the XM2FDB database. In IDIAP-COM 98-05, 1998. 766 [12] A. V. Nefian and M. H. Hayes. Face recognition using an embedded HMM. In Proceedings of the IEEE Conference on Audio and Video-based Biometric Person Authentication, pages 19–24, March 1999. 763 [13] A. V. Nefian, L. Liang, X. Pi, X. Liu, and C. Mao. A coupled hidden Markov model for audio-visual speech recognition. In International Conference on Acoustics, Speech and Signal Processing, volume 2, pages 2013–2016, 2002. 764 [14] Ara V. Nefian. Embedded Bayesian networks for face recognition. In IEEE International Conference on Multimedia and Expo, volume 2, pages 25–28, 2002. 762, 763 [15] Ara V. Nefian and Monson H. Hayes III. Maximum likelihood training of the embedded HMM for face detection and recognition. In IEEE International Conference on Image Processing, volume 1, pages 33–36, 2000. 762, 763 [16] C. Neti, G. Potamianos, J. Luettin, I. Matthews, D. Vergyri, J. Sison, A. Mashari, and J. Zhou. Audio visual speech recognition. In Final Workshop 2000 Report, 2000. 764 [17] Jonathon Phillips. Matching pursuit filters applied to face identification. IEEE Transactions on Image Processing, 7(8):1150–1164, August 1998. 762 [18] G. Potamianos, J. Luettin, and C. Neti. Asynchronous stream modeling for large vocabulary audio-visual speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing, volume 1, pages 169–172, 2001. 761 [19] D.A. Reynolds, T.F. Quatieri, and R.B. Dunn. Speaker verification using an adapted gaussian mixture model. Digital Signal Processing, 10:19–41, 2000. 765 [20] F. Samaria. Face Recognition Using Hidden Markov Models. PhD thesis, University of Cambridge, 1994. 762 [21] L.K. Saul and M.L. Jordan. Boltzmann chains and hidden Markov models. In G. Tesauro, David S. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems, volume 7. The MIT Press, 1995. 761
A Bayesian Approach to Audio-Visual Speaker Identification
769
[22] S.B. Yacouband, S. Luettin, J. Jonsson, K. Matas, and J. Kittler. Audio-visual person verification. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 580–585, 1999. 761 [23] M. Turk and A.P. Pentland. Face recognition using eigenfaces. In Proceedings of International Conference on Pattern Recognition, pages 586 – 591, 1991. 762
Multimodal Authentication Using Asynchronous HMMs Samy Bengio Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4, 1920 Martigny, Switzerland
[email protected] http://www.idiap.ch/~bengio
Abstract. It has often been shown that using multiple modalities to authenticate the identity of a person is more robust than using only one. Various combination techniques exist and are often performed at the level of the output scores of each modality system. In this paper, we present a novel HMM architecture able to model the joint probability distribution of pairs of asynchronous sequences (such as speech and video streams) describing the same event. We show how this model can be used for audio-visual person authentication. Results on the M2VTS database show robust performances of the system under various audio noise conditions, when compared to other state-of-the-art techniques.
1
Introduction
Biometric identity verification systems use the characteristics of a person to either accept or reject the identity claim made by a person [11]. While several such systems are based on only one characteristic, or modality (such as a spoken sentence or a face), several recent methods have been proposed in order to combine more that one modality in the hope to obtain more robust decisions [9]. Most of these combination methods are in fact based either on the decisions (accept or reject the access) or the scores (often real values) obtained by each unimodal algorithm in order to take a global and hopefuly more robust decision. In this paper we would like to propose a combination method at the level of the raw data. We will concentrate for this purpose on the difficult task of audiovisual authentication based on two streams of data: an audio stream representing a spoken sentence of a person trying to access the system, and a corresponding video stream of the face of the person pronouncing the sentence. Trying to combine the models at the level of the raw data in that case is complex for many reasons: first, each stream may have been preprocessed at different frame rates, chosen according to prior knowledge of each stream; second, simply up-sampling or down-sampling the streams in order to get the same number of frames in each stream might not be the optimal way of combining the streams. We propose in this paper a solution that could overcome this limitation. In a recent paper [2], we proposed an algorithm to train Asynchronous Hidden Markov Models (AHMMs) in order to model the joint probability of pairs J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 770–777, 2003. c Springer-Verlag Berlin Heidelberg 2003
Multimodal Authentication Using Asynchronous HMMs
771
of sequences of data representing the same sequence of events, even when the events are not synchronized between the sequences. In fact, the model enables to desynchronize the streams by temporarily stretching one of them in order to obtain a better match between the corresponding frames. The model can thus be directly applied to the problem of audio-visual speaker verification where sometimes lips start to move before any sound is heard for instance. The paper is organized as follows: in the next section, we review the model of AHMMs, followed by the corresponding EM training algorithm. Related models are then presented and implementation issues are discussed. Finally, experiments on an audio-visual text-dependent speaker verification task based on the M2VTS database are presented, followed by a conclusion.
2
The Asynchronous Hidden Markov Model
Let us denote the 2 asynchronous sequences to model as X = xT1 and Y = y1S where T and S are respectively the length of sequences X and Y , with S ≤ T without loss of generality1 . We are thus interested in modeling p(xT1 , y1S ). As it is intractable if we do it directly by considering all possible combinations, we introduce a hidden variable Q which represents the state as in the classical HMM formulation [7], and which is synchronized with the longest sequence. Let N be the number of states. Moreover, in the model presented here, we always emit xt at time t and sometimes emit ys at time t. Let us first define (i, t) = P (τt =s|τt−1 =s−1, qt =i, xt1 , y1s ) as the probability that the system emits the next observation of sequence Y at time t while in state i. The additional hidden variable τt = s can be seen as the alignment between Y and Q (and X, which is aligned with Q). Hence, we model p(xT1 , y1S , q1T , τ1T ). 2.1
Likelihood Computation
Using classical HMM independence assumptions, a simple forward procedure can be used to compute the joint likelihood of the two sequences, by introducing the following α intermediate variable for each state and each possible alignment between the sequences X and Y : α(i, s, t) = p(qt =i, τt =s, xt1 , y1s ) α(i, s, t) = (i, t)p(xt , ys |qt =i)
(1)
N
P (qt =i|qt−1 =j)α(j, s − 1, t − 1)
j=1
+ (1 − (i, t))p(xt |qt =i)
N
P (qt =i|qt−1 =j)α(j, s, t − 1)
j=1 1
In fact, we assume that for all pairs of sequences (X, Y ), sequence X is always at least as long as sequence Y . If this is not the case, a straightforward extension of the proposed model is then necessary.
772
Samy Bengio
which is very similar to the corresponding α variable used in normal HMMs. It can then be used to compute the joint likelihood of the two sequences as follows: p(xT1 , y1S ) =
N
p(qT =i, τT =S, xT1 , y1S )
(2)
i=1
=
N
α(i, S, T ) .
i=1
2.2
An EM Training Algorithm
An EM training algorithm can also be derived in the same fashion as in classical HMMs. We here sketch the resulting algorithm, without going into more details2 . Backward Step: Similarly to the forward step based on the α variable used to compute the joint likelihood, a backward variable, β can also be derived as follows: S |qt =i, τt =s) β(i, s, t) = p(xTt+1 , ys+1
β(i, s, t) =
N
(3)
(j, t + 1)p(xt+1 , ys+1 |qt+1 =j)P (qt+1 =j|qt =i)β(j, s + 1, t + 1)
j=1
+
N
(1 − (j, t + 1))p(xt+1 |qt+1 =j)P (qt+1 =j|qt =i)β(j, s, t + 1) .
j=1
E-Step: Using both the forward and backward variables, one can compute the posterior probabilities of the hidden variables of the system, namely the posterior on the state when it emits on both sequences, the posterior on the state when it emits on sequence X only, and the posterior on transitions. Let α1 (i, s, t) be the part of α(i, s, t) when state i emits on Y at time t: α1 (i, s, t) = (i, t)p(xt , ys |qt =i)
N
P (qt =i|qt−1 =j)α(j, s − 1, t − 1)
(4)
j=1
and similarly, let α0 (i, s, t) be the part of α(i, s, t) when state i does not emit on Y at time t: α0 (i, s, t) = (1 − (i, t))p(xt |qt =i)
N
P (qt =i|qt−1 =j)α(j, s, t − 1) .
(5)
j=1
Then the posterior on state i when it emits jointly on both sequences X and Y is α1 (i, s, t)β(i, s, t) , (6) P (qt =i, τt =s|τt−1 =s − 1, xT1 , y1S ) = P (xT1 , y1S ) 2
The full derivations can be found in the appendix of [1].
Multimodal Authentication Using Asynchronous HMMs
773
the posterior on state i when it emits the next observation of sequence X only is α0 (i, s, t)β(i, s, t) P (qt =i, τt =s|τt−1 =s, xT1 , y1S ) = , (7) P (xT1 , y1S ) and the posterior on the transition between states i and j is P (qt =i, qt−1 =j|xT1 , y1S ) =
P (qt =i|qt−1 =j) · P (xT1 , y1S )
S
(8)
α(j, s − 1, t − 1)p(xt , ys |qt =i)(i, t)β(i, s, t)+ s=1 . S α(j, s, t − 1)p(xt |qt =i)(1 − (i, t))β(i, s, t) s=0
M-Step: The Maximization step is performed exactly as in normal HMMs: when the distributions are modeled by exponential functions such as Gaussian Mixture Models, then an exact maximization can be performed using the posteriors. Otherwise, a Generalized EM is performed by gradient ascent, back-propagating the posteriors through the parameters of the distributions.
3
Related Models
The present AHMM model is related to the Pair HMM model [5], which was proposed to search for the best alignment between two DNA sequences. It was thus designed and used mainly for discrete sequences. Moreover, the architecture of the Pair HMM model is such that a given state is designed to always emit on either one OR two sequences, while in the proposed AHMM model, each state can always emit both on one or two sequences, depending on (i, t), which is learned. In fact, when (i, t) is deterministic and solely depends on i, we can indeed recover the Pair HMM model by slightly transforming the architecture. It is also very similar to the asynchronous version of Input/Output HMMs [3], which was proposed for speech recognition applications. The main difference during recognition is that in AHMMs both sequences are considered as output, while in Asynchronous IOHMMs one of the sequence (the shortest one, the output) is conditioned on the other one (the input). The resulting Viterbi decoding algorithm (used in recognition experiments) is thus different since in Asynchronous IOHMMs one of the sequence, the input, is known during decoding, which is not the case in AHMMs.
4
Implementation Issues
The proposed algorithms (either likelihood estimation or training) have a complexity of O(N 2 ST ) where N is the number of states (and assuming the worst case with ergodic connectivity), S is the length of sequence Y and T is the length
774
Samy Bengio
of sequence X. This can become quickly intractable if both X and Y are longer than, say, 1000 frames. It can however be shortened when a priori knowledge is known about possible alignments between X and Y . For instance, one can force the alignment between xt and ys to be such that |t − TS s| < k where k is a constant representing the maximum stretching allowed between X and Y , which should not depend on S nor T . In that case, the complexity (both in time and space) becomes O(N 2 T k), which is k times the usual complexity of HMM algorithms. In order to implement this system, we thus need to model the following distributions: – P (qt =i|qt−1 =j): the transition distribution, as in normal HMMs; – p(xt |qt =i): the emission distribution in the case where only X is emitted at time t, as in normal HMMs; – p(xt , ys |qt =i): the emission distribution in the case where both sequences are emitted at time t. This distribution could be implemented in various forms, depending on the assumptions made on the data: • xt and ys are independent given state i (which is not the same as saying that X and Y are independent of course): p(xt , ys |qt =i) = p(xt |qt =i)p(ys |qt =i)
(9)
• ys is conditioned on xt : p(xt , ys |qt =i) = p(ys |xt , qt =i)p(xt |qt =i)
(10)
• the joint probability is modeled directly, eventually forcing some common parameters from p(xt |qt =i) and p(xt , ys |qt =i) to be shared. In the experiments described later in the paper, we have chosen the latter implementation, with no sharing except during initialization; – (i, t) = P (τt =s|τt−1 =s−1, qt =i, xt1 , y1s ): the probability to emit on sequence Y at time t on state i. With various assumptions, this probability could be represented as either independent on i, independent on s, independent on xt and ys . In the experiments described later in the paper, we have chosen the latter implementation.
5
Experiments
Audio-visual text-dependent speaker verification experiments were performed using the M2VTS database [6], which contains 185 recordings of 37 subjects, each containing acoustic and video signals of the subject pronouncing the French digits from zero to nine. The video consisted of 286x360 pixel color images with a 25 Hz frame rate, while the audio was recorded at 48 kHz using a 16 bit PCM coding. The audio data was down-sampled to 8khz and every 10ms a vector of 16 MFCC coefficients and their first derivative, as well as the derivative of the log
Multimodal Authentication Using Asynchronous HMMs
775
energy was computed, for a total of 33 features. Each image of the video stream (25 per seconds) was coded using 12 shape features and 12 intensity features, as described in [4]. The first derivatives of these features were also computed, for a total of 48 features. In the following, we compared 6 different models: – an AHMM trained on both voice and face data, as explained in the paper, – an HMM trained on the fusion of voice and face data (by up-sampling correctly the face data to obtain the same number of frames in the two streams), – an HMM trained on the voice data only, – an HMM trained on the face data only, – a Gaussian Mixture Model (GMM) trained on the voice data only, – a fusion between the GMM on voice only and the HMM on face only. The fusion was performed using a multi-layer perceptron with the two scores as input. In all the cases, we used the classical speaker verification technique, computing the difference between the log likelihood of the data given the client model and the log likelihood of the data given the world model (a model created with data no coming from the target client), and accepting the access when this difference was higher than a given threshold. Although the M2VTS database is one of the largest databases of its type, it is still relatively small to obtain statistically significant results. Hence, in order to increase the significance level of the experimental results, a 4-fold crossvalidation method was used as follows: We used only 36 subjects, separated into 4 groups. For each subject, there was 5 different recording sessions. We used the first 2 sessions to create a client model, and the last 3 sessions to estimate the quality of the model. For each group, we used the other 3 groups to create a world model (using only the first 2 sessions per client). Moreover, for each client in one of the other three groups, we adapted a client specific model (using a simple MAP adaptation method [8]) from the world model (again using only the first 2 sessions of the client). Using these client-specific models, we selected a global threshold such that it yielded an Equal Error Rate (EER, when the False Acceptance Rate, FAR, is equal to the False Rejection Rate, FRR). Finally, we adapted (using MAP again) a client-specific model from the world model for each client of the current test group and computed the Half Total Error Rate (HTER, the average of the FAR and the FRR) on the last three accesses of each test client using the global threshold previously found. Hence, all results presented here can be seen as unbiased since no parameters (including the threshold) were computed using the test accesses. The HMM topologies were as follows: we used left-to-right HMMs for each instance of the vocabulary, which consisted of the following 11 (french) words: zero, un, deux trois, quatre, cinq, six, sept, huit, neuf, silence. Each model had between 3 to 9 states including non-emitting begin and end states. In each emitting state, there was 3 distributions: P (xt |qt ), the emission distribution of audio-only data, which consisted of a Gaussian mixture of 10 Gaussians (of dimension 33), P (xt , ys |qt ), the joint emission distribution of audio and video
776
Samy Bengio
data, which consisted also of a Gaussian mixture of 10 Gaussians (of dimension 33+48=81), and (i, t), the probability that the system should emit on the video sequence, which was implemented for these experiments as a simple table (but still trained of course). Training of the AHMM was done using the EM algorithm described in the paper. However, in order to keep the computational time tractable, a constraint was imposed in the alignment between the audio and video streams: we did not consider alignments where audio and video information were farther than 0.68 second from each other (equivalent to 17 video frames). The GMM models used a silence removal technique based on an unsupervised bi-Gaussian method in order to remove all non-informative frames. In order to show the interest of robust multimodal speaker verification, we injected various levels of noise in the audio stream during test accesses (training was always done using clean audio). The noise was taken from the Noisex database [10], and was injected in order to reach signal-to-noise ratios of 10dB, 5dB and 0dB. Note that all the hyper-parameters of these systems, such as the number of Gaussians in the mixtures, the number of EM iterations, or the minimum value of the variances of the Gaussians, were not tuned using the M2VTS dataset, but instead on the task of speech recognition using the Numbers’95 database. Figure 1 presents the results. For each method at each level of noise injected in the audio stream, we present the Half Total Error Rate (HTER), a measure often used to assess the quality of a verification system. As it can be seen, the AHMM yielded better and more stable results as soon as the noise level in the audio stream was significant. For almost clean data, the performance of the GMM using the audio stream only as well as the one of the fusion of the score of the GMM with the score of the face HMM model were better, but quickly deteriorated with the addition of noise.
6
Conclusion
In this paper, we proposed the use of a novel asynchronous HMM architecture for the task of text-dependent multimodal person authentication. An EM training algorithm was given, and speaker verification experiments were performed on a multimodal database, yielding significant improvements on noisy audio data. Various propositions were made to implement the model but only the simplest ones were tested in this paper. Other solutions should thus be investigated soon.
Acknowledgements This research has been partially carried out in the framework of the Swiss NCCR project (IM)2. The author would like to thank Stephane Dupont for providing the extracted visual features used in the paper.
Multimodal Authentication Using Asynchronous HMMs
777
50 voice GMM voice HMM voice+face HMM voice+face AHMM voice GMM + face HMM = Fusion face HMM clean
45
40
HTER
35
30
25
20
15
10
5 0db
5db noise level
10db
Fig. 1. HTER (the lower the better), of various systems under various noise conditions during test (from 10 to 0 dB additive noise). The proposed model is the AHMM using both audio and video streams
References [1] S. Bengio. An asynchronous hidden markov model for audio-visual speech recognition. Technical Report IDIAP-RR 02-26, IDIAP, 2002. 772 [2] S. Bengio. An asynchronous hidden markov model for audio-visual speech recognition. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems, NIPS 15, 2003. 770 [3] S. Bengio and Y. Bengio. An EM algorithm for asynchronous input/output hidden markov models. In Proceedings of the International Conference on Neural Information Processing, ICONIP, Hong Kong, 1996. 773 [4] S. Dupont and J. Luettin. Audio-visual speech modelling for continuous speech recognition. IEEE Transactions on Multimedia, 2:141–151, 2000. 775 [5] R. Durbin, S. Eddy, A. Krogh, and G. Michison. Biological Sequence Analysis: Probabilistic Models of proteins and nucleic acids. Cambridge University Press, 1998. 773 [6] S. Pigeon and L. Vandendorpe. The M2VTS multimodal face database (release 1.00). In Proceedings of the First International Conference on Audio- and Videobased Biometric Person Authentication ABVPA, 1997. 774 [7] Laurence R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989. 771 [8] Douglas A. Reynolds, Thomas F. Quatieri, and Robert B. Dunn. Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 10(1–3), 2000. 775 [9] A. Ross, A. K. Jain, and J. Z. Qian. Information fusion in biometrics. In Proceedings of the 3rd International Conference on Audio- and Video-Based Person Authentication (AVBPA), pages 354–359, 2001. 770 [10] A. Varga, H. J. M. Steeneken, M. Tomlinson, and D. Jones. The noisex-92 study on the effect of additive noise on automatic speech recognition. Technical report, DRA Speech Research Unit, 1992. 776 [11] P. Verlinde, G. Chollet, and M. Acheroy. Multi-modal identity verification using expert fusion. Information Fusion, 1:17–33, 2000. 770
Theoretic Evidence k-Nearest Neighbourhood Classifiers in a Bimodal Biometric Verification System Andrew Teoh Beng Jin1, Salina Abdul Samad2, and Aini Hussain2 1
Faculty of Information Science and Technology (FIST), Multimedia University Jalan Ayer Keroh Lama, Bukit Beruang, 75450, Melaka, Malaysia
[email protected] 2 Department of Electrical, Electronic and System Engineering Universiti Kebangsaan Malaysia 43600 UKM Bangi Selangor, Malaysia {salina,aini}@ukm.edu.my
Abstract. A bimodal biometric verification system based on facial and vocal biometric modules is described in this paper. The system under consideration is built in parallel where each matching score reported by two classifiers are fused by using theoretic evidence k-NN (tekNN) based on Dempster-Safer (D-S) theory. In this technique, each nearest neighbour of a pattern to be classified is regarded as an item of evidence supporting certain hypotheses concerning the pattern class membership. Unlike statistical based fusion approaches, tekNN based on D-S theory is able to represent uncertainties and lack of knowledge. Therefore, the usage of tekNN leads to a ternary decision scheme, {accept, reject, inconclusive} which provides a more secure protection. From experimental results, the speech and facial biometric modules perform equally well, giving 93.5% and 94.0% verification rates, respectively. A 99.86% recognition rate is obtained when the two modules are fused. In addition, an ‘unbalanced’ case is been created to investigate the robustness of technique.
1
Introduction
The identification of humans for example for financial transactions, access control or computer access, has mostly been conducted by using ID numbers, such as a PIN or a password. The main problem with those numbers is that they can be used by unauthorized persons. Instead, biometric verification systems use unique personal features of the user himself to verify the identity claimed [1]. However, a major problem with biometrics is that the physical appearance of a person tends to vary with time. In addition, correct verification may not be guaranteed due to sensor noise and limitations of the feature extractor and matcher. One solution to cope with these limitations is to combine several biometrics in a multi-modal identity verification system. By using those multiple biometric traits, a much higher J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 778-786, 2003. Springer-Verlag Berlin Heidelberg 2003
Theoretic Evidence k-Nearest Neighbourhood Classifiers
779
accuracy can be achieved. Even when one biometric feature is somehow disturbed, for example in a noisy environment, the other traits still lead to an accurate decision. A bimodal biometrics verification system based on facial and vocal modals is described in this paper. Each module of the system has been fine-tuned to deal with the problems that may occur in real world application, such as poor quality images obtained from using a low cost PC camera and the problem of using various types of microphones that may cause channel distortion or convolution noise. This system is built in parallel where each score delivered by two classifiers are fused by using theoretic evidence k-NN (tekNN) based on Dempster-Safer (D-S) theory. The main advantage of tekNN classifier over the other parametric or nonparametric statistical classifiers, is its uncertainty management [2], and thus provides a ternary decision scheme, {accept, reject, inconclusive}.
2
Biometrics Modules and Decision Fusion Scheme
2.1
Face Verification
In our system as shown in figure 1, the Eigenface approach [3] is used in the face detection and face recognition modules. The main idea of the Eigenface approach is to find the vectors that best account for the distribution of face images within the entire image space and define as the face space. Face spaces are eigenvectors of the covariance matrix corresponding to the original face images, and since they are facelike in appearance they are so are called Eigenfaces.
Fig. 1. Face Verification System
Face detection is accomplished by calculating the sum of the square error between a region of the scene and the Eigenface, a measure of Distance From Face Space (DFFS) that indicates a measure of how face-like a region. If a window is swept across the scene, to find the DFFS at each location, the most probable location of the face can be estimated [4].
780
Andrew Teoh Beng Jin et al.
From the extracted face, eye co-ordinates will be determined with the hybrid rule based approach and contour mapping technique [5]. Based on the information obtained, scale normalization and lighting normalization are applied for a head in box format. The Eigenface-based face recognition method is divided into two stages: (i) the training stage, (ii) the operational stage. At the training stage, a set of normalized face images, {i} that best describe the distribution of the training facial images in a lower dimensional subspace (Eigenface) is computed by the operation:
ϖ
k
= u k (in − i )
(1)
where n = 1, … ,M, k=1, … ,K, uk is Eigenface and i is the average face. Next, the training facial images are projected onto the Eigenspace, Ω i, to generate the representations of the facial images in Eigenface. Ω i = [ϖ
n1,
ϖ
n2, … ,ϖ nK]
(2)
where i=1,2, . . . ,M. At the operational stage, an incoming facial image is projected onto the same eigenspace and the similarity measure which is the Mahalanobis distance between the input facial image and the template is computed in the Eigenspace. Let ϕ O denote the representation of the input face image with claimed identity C and ϕ C denote the representation of the Cth template. The similarity function between ϕ O and ϕ C is defined as follows: F1(ϕ O, ϕ C) = || ϕ
O
-ϕ
C
||m
(3)
where ||•||m denotes the Mahalanobis distance. 2.2
Speaker Verification
The speaker verification module includes three important stages: endpoint detection, feature extraction and pattern comparison. The endpoint detection stage aims to remove silent parts from the raw audio signal. Noise reduction techniques are used to reduce the noise from the speech signal. Simple spectral subtraction [6] is first used to remove additive noise prior to endpoint detection. Then, in order to cope with the channel distortion or convolution noise that is introduced by a microphone, the zero’th order cepstral coefficients are discarded and the remaining coefficients are appended with delta feature coefficients [7]. In addition, the cepstral components are weighted adaptively to emphasize the narrowband components and suppress the broadband components [8]. The cleaned audio signal is converted to a 12th order linear prediction cepstral coefficients (LPCC). Fig. 2 shows the process used in the front end module.
Theoretic Evidence k-Nearest Neighbourhood Classifiers
781
Fig. 2. The front-end of the speaker verification module
As with the face recognition module, the speaker verification module also consists of two stages: (i) the training stage and (ii) the operational stage. At training phase, four sample utterances with the same words from the same speaker are collected and trained using the modified k-Mean algorithm [7] in order to handle a wide range of individual speech variations. At the operational stage, we opted for a well-known pattern-matching algorithm – Dynamic Time Warping (DTW) [9] to compute the distance between the trained template and the input sample. Let ϕ O represent the input speech sample with the claimed identity C and ϕ C the Cth template. The similarity function between ϕ O and ϕ C is defined as follows: F1(ϕ O, ϕ C) = ||ϕ
O
-ϕ
C
||
(4)
where ||•|| denotes the distance score result from DTW. 2.3
Evidence Theoretic k-NN Classifier Based on D-S Theory
The Dempster-Safer (D-S) theory of evidence [10] is a powerful tool for representing uncertain knowledge. In D-S theory, a problem is represented by a set of Θ ={θ1,….,θK} called the frame of discernment. A basic belief assignment (BBA), m is a function that assigns a value in [0, 1] to every subset A of Θ and satisfies m(ø) = 0 and Σ A⊆Θ .m(A) = 1. m(A) can be interpreted as the measure of the belief that is committed exactly to A, given the available evidence. Two evidential functions derived from the BBA are the credibility function, Bel and the plausibility function, Pl defined respectively as Bel(A) = Σ B⊆ A.m(B) and Pl(A) = Σ A∩ B≠ φ m(B) for all A ⊆ Θ. Any subset A of Θ such that m(A) > 0 is called a focal element of m. Given two BBAs m1 and m2 representing two independent sources of evidence, the D-S’s rule defined a new BBA m = m1 ⊕ m2 verifying m(φ ) = 0 and m( A) =
1 K
∑
A1 ∩ A=2 A
m1 ( A1 ).m 2 ( A2 )
(5)
where K is defined by K= ∑
A1 ∩ A≠2 φ
m1 ( A1 ).m 2 ( A2 )
(6)
for all A ≠ φ . For a bimodal biometric verification system, a two class’s classification problem is considered. The set of classes is denoted by Ω = {ω genuine, ω impostor}. The available
782
Andrew Teoh Beng Jin et al.
information is assumed to consist in a training set T = {(x(1),ω
(1)
), . . ., (xN,ω N)} of two
dimensional patterns x(i), i = 1, …, N and their corresponding class labels ω i, i={genuine, impostor} taking values in Ω . The similarity between patterns is assumed to be correctly measured by a certain distance function d(⋅ ,⋅ ). Let x be the test input to be classified based on the information contained in T. Each pair (xi,ω i) constitutes a distinct item of evidence regarding the class membership of x. If x is “close” to xi according to the relevant metric d, then one will be inclined to believe that both vectors belong to the same class. On the contrary, if d(x ,xi) is very large, then the consideration of xi will leave us in a situation of almost complete ignorance concerning the class of x. Consequently, this item of evidence may be postulated to induce a basic belief assignment (BBA) m( ⋅ | xi) over Ω defined by: m({ω q}|xi) = α exp(-γ qd2)
(7)
m(Ω |xi) = 1 - α exp(-γ qd2)
(8)
where d = d(x,xi), ω q is the class of xi (ω i = ω q), α is a parameter such as 0 < α