Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Moshe Y. Vardi Rice University, Houston, TX, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
4844
Yasushi Yagi Sing Bing Kang In So Kweon Hongbin Zha (Eds.)
Computer Vision – ACCV 2007 8th Asian Conference on Computer Vision Tokyo, Japan, November 18-22, 2007 Proceedings, Part II
13
Volume Editors Yasushi Yagi Osaka University The Institute of Scientific and Industrial Research 8-1 Mihogaoka, Ibaraki, Osaka, 567-0047, Japan E-mail:
[email protected] Sing Bing Kang Microsoft Corporation 1 Microsoft Way, Redmond WA 98052, USA E-mail:
[email protected] In So Kweon KAIST School of Electrical Engineering and Computer Science 335 Gwahag-Ro Yusung-Gu, Daejeon, Korea E-mail:
[email protected] Hongbin Zha Peking University Department of Machine Intelligence Beijing, 100871, China E-mail:
[email protected] Library of Congress Control Number: 2007938408 CR Subject Classification (1998): I.4, I.5, I.2.10, I.2.6, I.3.5, F.2.2 LNCS Sublibrary: SL 6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics ISSN ISBN-10 ISBN-13
0302-9743 3-540-76389-9 Springer Berlin Heidelberg New York 978-3-540-76389-5 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2007 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12183685 06/3180 543210
Preface
It is our great pleasure to welcome you to the Proceedings of the Eighth Asian Conference on Computer Vision (ACCV07), which held November 18–22, 2007 in Tokyo, Japan. ACCV07 was sponsored by the Asian Federation of Computer Vision. We received 640 abstracts by the abstract submission deadline, 551 of which became full submissions. This is the largest number of submissions in the history of ACCV. Out of these 551 full submissions, 46 were selected for oral presentation and 130 as posters, yielding an acceptance rate of 31.9%. Following the tradition of previous ACCVs, the reviewing process was double blind. Each of the 31 Area Chairs (ACs) handled about 17 papers and nominated five reviewers for each submission (from 204 Program Committee members). The final selection of three reviewers per submission was done in such a way as to avoid conflict of interest and to evenly balance the load among the reviewers. Once the reviews were done, each AC wrote summary reports based on the reviews and their own assessments of the submissions. For conflicting scores, ACs consulted with reviewers, and at times had us contact authors for clarification. The AC meeting was held in Osaka on July 27 and 28. We divided the 31 ACs into 8 groups, with each group having 3 or 4 ACs. The ACs can confer within their respective groups, and are permitted to discuss with pre-approved “consulting” ACs outside their groups if needed. The ACs were encouraged to rely more on their perception of paper vis-a-vis reviewer comments, and not strictly based on numerical scores alone. This year, we introduced the category “conditional accept;” this category is targeted at papers with good technical content but whose writing requires significant improvement. Please keep in mind that no reviewing process is perfect. As with any major conference, reviewer quality and timeliness of reviews varied. To minimize the impact of variation of these factors, we chose highly qualified and dependable people as ACs to shepherd the review process. We all did the best we could given the large number of submissions and the limited time we had. Interestingly, we did not have to instruct the ACs to revise their decisions at the end of the AC meeting—all the ACs did a great job in ensuring the high quality of accepted papers. That being said, it is possible there were good papers that fell through the cracks, and we hope such papers will quickly end up being published at other good avenues. It has been a pleasure for us to serve as ACCV07 Program Chairs, and we can honestly say that this has been a memorable and rewarding experience. We would like to thank the ACCV07 ACs and members of the Technical Program Committee for their time and effort spent reviewing the submissions. The ACCV Osaka team (Ryusuke Sagawa, Yasushi Makihara, Tomohiro Mashita, Kazuaki Kondo, and Hidetoshi Mannami), as well as our conference secretaries (Noriko
VI
Preface
Yasui, Masako Kamura, and Sachiko Kondo), did a terrific job organizing the conference. We hope that all of the attendees found the conference informative and thought provoking. November 2007
Yasushi Yagi Sing Bing Kang In So Kweon Hongbin Zha
Organization
General Chair General Co-chairs
Program Chair Program Co-chairs
Workshop/Tutorial Chair Finance Chair Local Arrangements Chair Publication Chairs Technical Support Staff
Area Chairs
Katsushi Ikeuchi (University of Tokyo, Japan) Naokazu Yokoya (NAIST, Japan) Rin-ichiro Taniguchi (Kyuushu University, Japan) Yasushi Yagi (Osaka University, Japan) In So Kweon (KAIST, Korea) Sing Bing Kang (Microsoft Research, USA) Hongbin Zha (Peking University, China) Kazuhiko Sumi (Mitsubishi Electric, Japan) Keiji Yamada (NEC, Japan) Yoshinari Kameda (University of Tsukuba, Japan) Hideo Saito (Keio University, Japan) Daisaku Arita (ISIT, Japan) Atsuhiko Banno (University of Tokyo, Japan) Daisuke Miyazaki (University of Tokyo, Japan) Ryusuke Sagawa (Osaka University, Japan) Yasushi Makihara (Osaka University, Japan) Tat-Jen Cham (Nanyang Tech. University, Singapore) Koichiro Deguchi (Tohoku University, Japan) Frank Dellaert (Georgia Inst. of Tech., USA) Martial Hebert (CMU, USA) Ki Sang Hong (Pohang University of Sci. and Tech., Korea) Yi-ping Hung (National Taiwan University, Taiwan) Reinhard Klette (University of Auckland, New Zealand) Chil-Woo Lee (Chonnam National University, Korea) Kyoung Mu Lee (Seoul National University, Korea) Sang Wook Lee (Sogang University, Korea) Stan Z. Li (CASIA, China) Yuncai Liu (Shanghai Jiaotong University, China) Yasuyuki Matsushita (Microsoft Research Asia, China) Yoshito Mekada (Chukyo University, Japan) Yasuhiro Mukaigawa (Osaka University, Japan)
VIII
Organization
P.J. Narayanan (IIIT, India) Masatoshi Okutomi (Tokyo Inst. of Tech., Japan) Tomas Pajdla (Czech Technical University, Czech) Shmuel Peleg (The Hebrew University of Jerusalem, Israel) Jean Ponce (Ecole Normale Superieure, France) Long Quan (Hong Kong University of Sci. and Tech., China) Ramesh Raskar (MERL, USA) Jim Rehg (Georgia Inst. of Tech., USA) Jun Sato (Nagoya Inst. of Tech., Japan) Shinichi Sato (NII, Japan) Yoichi Sato (University of Tokyo, Japan) Cordelia Schmid (INRIA, France) Christoph Schnoerr (University of Mannheim, Germany) David Suter (Monash University, Australia) Xiaoou Tang (Microsoft Research Asia, China) Guangyou Xu (Tsinghua University, China)
Program Committee Adrian Barbu Akash Kushal Akihiko Torii Akihiro Sugimoto Alexander Shekhovtsov Amit Agrawal Anders Heyden Andreas Koschan Andres Bruhn Andrew Hicks Anton van den Hengel Atsuto Maki Baozong Yuan Bernt Schiele Bodo Rosenhahn Branislav Micusik C.V. Jawahar Chieh-Chih Wang Chin Seng Chua Chiou-Shann Fuh Chu-song Chen
Cornelia Fermuller Cristian Sminchisescu Dahua Lin Daisuke Miyazaki Daniel Cremers David Forsyth Duy-Dinh Le Fanhuai Shi Fay Huang Florent Segonne Frank Dellaert Frederic Jurie Gang Zeng Gerald Sommer Guoyan Zheng Hajime Nagahara Hanzi Wang Hassan Foroosh Hideaki Goto Hidekata Hontani Hideo Saito
Hiroshi Ishikawa Hiroshi Kawasaki Hong Zhang Hongya Tuo Hynek Bakstein Hyun Ki Hong Ikuko Shimizu Il Dong Yun Itaru Kitahara Ivan Laptev Jacky Baltes Jakob Verbeek James Crowley Jan-Michael Frahm Jan-Olof Eklundh Javier Civera Jean Martinet Jean-Sebastien Franco Jeffrey Ho Jian Sun Jiang yu Zheng
Organization
Jianxin Wu Jianzhuang Liu Jiebo Luo Jingdong Wang Jinshi Cui Jiri Matas John Barron John Rugis Jong Soo Choi Joo-Hwee Lim Joon Hee Han Joost Weijer Jun Sato Jun Takamatsu Junqiu Wang Juwei Lu Kap Luk Chan Karteek Alahari Kazuhiro Hotta Kazuhiro Otsuka Keiji Yanai Kenichi Kanatani Kenton McHenry Ki Sang Hong Kim Steenstrup Pedersen Ko Nishino Koichi Hashomoto Larry Davis Lisheng Wang Manabu Hashimoto Marcel Worring Marshall Tappen Masanobu Yamamoto Mathias Kolsch Michael Brown Michael Cree Michael Isard Ming Tang Ming-Hsuan Yang Mingyan Jiang Mohan Kankanhalli Moshe Ben-Ezra Naoya Ohta Navneet Dalal Nick Barnes
Nicu Sebe Noboru Babaguchi Nobutaka Shimada Ondrej Drbohlav Osamu Hasegawa Pascal Vasseur Patrice Delmas Pei Chen Peter Sturm Philippos Mordohai Pierre Jannin Ping Tan Prabir Kumar Biswas Prem Kalra Qiang Wang Qiao Yu Qingshan Liu QiuQi Ruan Radim Sara Rae-Hong Park Ralf Reulke Ralph Gross Reinhard Koch Rene Vidal Robert Pless Rogerio Feris Ron Kimmel Ruigang Yang Ryad Benosman Ryusuke Sagawa S.H. Srinivasan S. Kevin Zhou Seungjin Choi Sharat Chandran Sheng-Wen Shih Shihong Lao Shingo Kagami Shin’ichi Satoh Shinsaku Hiura ShiSguang Shan Shmuel Peleg Shoji Tominaga Shuicheng Yan Stan Birchfield Stefan Gehrig
Stephen Lin Stephen Maybank Subhashis Banerjee Subrata Rakshit Sumantra Dutta Roy Svetlana Lazebnik Takayuki Okatani Takekazu Kato Tat-Jen Cham Terence Sim Tetsuji Haga Theo Gevers Thomas Brox Thomas Leung Tian Fang Til Aach Tomas Svoboda Tomokazu Sato Toshio Sato Toshio Ueshiba Tyng-Luh Liu Vincent Lepetit Vivek Kwatra Vladimir Pavlovic Wee-Kheng Leow Wei Liu Weiming Hu Wen-Nung Lie Xianghua Ying Xianling Li Xiaogang Wang Xiaojuan Wu Yacoob Yaser Yaron Caspi Yasushi Sumi Yasutaka Furukawa Yasuyuki Sugaya Yeong-Ho Ha Yi-ping Hung Yong-Sheng Chen Yoshinori Kuno Yoshio Iwai Yoshitsugu Manabe Young Shik Moon Yunde Jia
IX
X
Organization
Zen Chen Zhifeng Li Zhigang Zhu
Zhouchen Lin Zhuowen Tu Zuzana Kukelova
Additional Reviewers Afshin Sepehri Alvina Goh Anthony Dick Avinash Ravichandran Baidya Saha Brian Clipp C´edric Demonceaux Christian Beder Christian Schmaltz Christian Wojek Chunhua Shen Chun-Wei Chen Claude P´egard D.H. Ye D.J. Kwon Daniel Hein David Fofi David Gallup De-Zheng Liu Dhruv K. Mahajan Dipti Mukherjee Edgar Seemann Edgardo Molina El Mustapha Mouaddib Emmanuel Prados Frank R. Schmidt Frederik Meysel Gao Yan Guy Rosman Gyuri Dorko H.J. Shim Hang Yu Hao Du Hao Tang Hao Zhang Hirishi Ohno Hiroshi Ohno Huang Wei Hynek Bakstein
Ilya Levner Imran Junejo Jan Woetzel Jian Chen Jianzhao Qin Jimmy Jiang Liu Jing Wu John Bastian Juergen Gall K.J. Lee Kalin Kolev Karel Zimmermann Ketut Fundana Koichi Kise Kongwah Wan Konrad Schindler Kooksang Moon Levi Valgaerts Li Guan Li Shen Liang Wang Lin Liang Lingyu Duan Maojun Yuan Mario Fritz Martin Bujnak Martin Matousek Martin Sunkel Martin Welk Micha Andriluka Michael Stark Minh-Son Dao Naoko Nitta Neeraj Kanhere Niels Overgaard Nikhil Rane Nikodem Majer Nilanjan Ray Nils Hasler
Nipun kwatra Olivier Morel Omar El Ganaoui Pankaj Kumar Parag Chaudhuri Paul Schnitzspan Pavel Kuksa Petr Doubek Philippos Mordohai Reiner Schnabel Rhys Hill Rizwan Chaudhry Rui Huang S.M. Shahed Nejhum S.H. Lee Sascha Bauer Shao-Wen Yang Shengshu Wang Shiro Kumano Shiv Vitaladevuni Shrinivas Pundlik Sio-Hoi Ieng Somnath Sengupta Sudipta Mukhopadhyay Takahiko Horiuchi Tao Wang Tat-Jun Chin Thomas Corpetti Thomas Schoenemann Thorsten Thormaehlen Weihong Li Weiwei Zhang Xiaoyi Yu Xinguo Yu Xinyu Huang Xuan Song Yi Feng Yichen Wei Yiqun Li
Organization
Yong MA Yoshihiko Kawai
Zhichao Chen Zhijie Wang
Sponsors Sponsor Technical Co-sponsors
Asian Federation of Computer Vision IPSJ SIG-CVIM IEICE TG-PRMU
XI
Table of Contents – Part II
Poster Session 4: Face/Gesture/Action Detection and Recognition Palmprint Recognition Under Unconstrained Scenes . . . . . . . . . . . . . . . . . . Yufei Han, Zhenan Sun, Fei Wang, and Tieniu Tan
1
Comparative Studies on Multispectral Palm Image Fusion for Biometrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ying Hao, Zhenan Sun, and Tieniu Tan
12
Learning Gabor Magnitude Features for Palmprint Recognition . . . . . . . . Rufeng Chu, Zhen Lei, Yufei Han, Ran He, and Stan Z. Li
22
Sign Recognition Using Constrained Optimization . . . . . . . . . . . . . . . . . . . . Kikuo Fujimura and Lijie Xu
32
Poster Session 4: Image and Video Processing Depth from Stationary Blur with Adaptive Filtering . . . . . . . . . . . . . . . . . . Jiang Yu Zheng and Min Shi
42
Three-Stage Motion Deblurring from a Video . . . . . . . . . . . . . . . . . . . . . . . . Chunjian Ren, Wenbin Chen, and I-fan Shen
53
Near-Optimal Mosaic Selection for Rotating and Zooming Video Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nazim Ashraf, Imran N. Junejo, and Hassan Foroosh
63
Video Mosaicing Based on Structure from Motion for Distortion-Free Document Digitization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Akihiko Iketani, Tomokazu Sato, Sei Ikeda, Masayuki Kanbara, Noboru Nakajima, and Naokazu Yokoya Super Resolution of Images of 3D Scenecs . . . . . . . . . . . . . . . . . . . . . . . . . . . Uma Mudenagudi, Ankit Gupta, Lakshya Goel, Avanish Kushal, Prem Kalra, and Subhashis Banerjee Learning-Based Super-Resolution System Using Single Facial Image and Multi-resolution Wavelet Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shu-Fan Lui, Jin-Yi Wu, Hsi-Shu Mao, and Jenn-Jier James Lien
73
85
96
XIV
Table of Contents – Part II
Poster Session 4: Segmentation and Classification Statistical Framework for Shot Segmentation and Classification in Sports Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ying Yang, Shouxun Lin, Yongdong Zhang, and Sheng Tang
106
Sports Classification Using Cross-Ratio Histograms . . . . . . . . . . . . . . . . . . . Balamanohar Paluri, S. Nalin Pradeep, Hitesh Shah, and C. Prakash
116
A Bayesian Network for Foreground Segmentation in Region Level . . . . . Shih-Shinh Huang, Li-Chen Fu, and Pei-Yung Hsiao
124
Efficient Graph Cuts for Multiclass Interactive Image Segmentation . . . . Fangfang Lu, Zhouyu Fu, and Antonio Robles-Kelly
134
Feature Subset Selection for Multi-class SVM Based Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lei Wang
145
Evaluating Multi-class Multiple-Instance Learning for Image Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xinyu Xu and Baoxin Li
155
Poster Session 4: Shape TransforMesh: A Topology-Adaptive Mesh-Based Approach to Surface Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrei Zaharescu, Edmond Boyer, and Radu Horaud
166
Microscopic Surface Shape Estimation of a Transparent Plate Using a Complex Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Masao Shimizu and Masatoshi Okutomi
176
Shape Recovery from Turntable Image Sequence . . . . . . . . . . . . . . . . . . . . . H. Zhong, W.S. Lau, W.F. Sze, and Y.S. Hung
186
Shape from Contour for the Digitization of Curved Documents . . . . . . . . Fr´ed´eric Courteille, Jean-Denis Durou, and Pierre Gurdjos
196
Improved Space Carving Method for Merging and Interpolating Multiple Range Images Using Information of Light Sources of Active Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ryo Furukawa, Tomoya Itano, Akihiko Morisaka, and Hiroshi Kawasaki
206
Shape Representation and Classification Using Boundary Radius Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hamidreza Zaboli and Mohammad Rahmati
217
Table of Contents – Part II
XV
Optimization A Convex Programming Approach to the Trace Quotient Problem . . . . . Chunhua Shen, Hongdong Li, and Michael J. Brooks
227
Learning a Fast Emulator of a Binary Decision Process . . . . . . . . . . . . . . . ˇ Jan Sochman and Jiˇr´ı Matas
236
Radiometry Multiplexed Illumination for Measuring BRDF Using an Ellipsoidal Mirror and a Projector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yasuhiro Mukaigawa, Kohei Sumino, and Yasushi Yagi
246
Analyzing the Influences of Camera Warm-Up Effects on Image Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Holger Handel
258
Geometry Simultaneous Plane Extraction and 2D Homography Estimation Using Local Feature Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ouk Choi, Hyeongwoo Kim, and In So Kweon
269
A Fast Optimal Algorithm for L2 Triangulation . . . . . . . . . . . . . . . . . . . . . . Fangfang Lu and Richard Hartley
279
Adaptively Determining Degrees of Implicit Polynomial Curves and Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bo Zheng, Jun Takamatsu, and Katsushi Ikeuchi
289
Determining Relative Geometry of Cameras from Normal Flows . . . . . . . Ding Yuan and Ronald Chung
301
Poster Session 5: Geometry Highest Accuracy Fundamental Matrix Computation . . . . . . . . . . . . . . . . . Yasuyuki Sugaya and Kenichi Kanatani
311
Sequential L∞ Norm Minimization for Triangulation . . . . . . . . . . . . . . . . . Yongduek Seo and Richard Hartley
322
Initial Pose Estimation for 3D Model Tracking Using Learned Objective Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matthias Wimmer and Bernd Radig
332
XVI
Table of Contents – Part II
Multiple View Geometry for Non-rigid Motions Viewed from Translational Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cheng Wan, Kazuki Kozuka, and Jun Sato Visual Odometry for Non-overlapping Views Using Second-Order Cone Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jae-Hak Kim, Richard Hartley, Jan-Michael Frahm, and Marc Pollefeys
342
353
Pose Estimation from Circle or Parallel Lines in a Single Image . . . . . . . . Guanghui Wang, Q.M. Jonathan Wu, and Zhengqiao Ji
363
An Occupancy – Depth Generative Model of Multi-view Images . . . . . . . Pau Gargallo, Peter Sturm, and Sergi Pujades
373
Poster Session 5: Matching and Registration Image Correspondence from Motion Subspace Constraint and Epipolar Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shigeki Sugimoto, Hidekazu Takahashi, and Masatoshi Okutomoi Efficient Registration of Aerial Image Sequences Without Camera Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shobhit Niranjan, Gaurav Gupta, Amitabha Mukerjee, and Sumana Gupta
384
394
Simultaneous Appearance Modeling and Segmentation for Matching People Under Occlusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhe Lin, Larry S. Davis, David Doermann, and Daniel DeMenthon
404
Content-Based Matching of Videos Using Local Spatio-temporal Fingerprints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gajinder Singh, Manika Puri, Jeffrey Lubin, and Harpreet Sawhney
414
Automatic Range Image Registration Using Mixed Integer Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shizu Sakakubara, Yuusuke Kounoike, Yuji Shinano, and Ikuko Shimizu Accelerating Pattern Matching or How Much Can You Slide? . . . . . . . . . . Ofir Pele and Michael Werman
424
435
Poster Session 5: Recognition Detecting, Tracking and Recognizing License Plates . . . . . . . . . . . . . . . . . . Michael Donoser, Clemens Arth, and Horst Bischof
447
Action Recognition for Surveillance Applications Using Optic Flow and SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Somayeh Danafar and Niloofar Gheissari
457
Table of Contents – Part II
XVII
The Kernel Orthogonal Mutual Subspace Method and Its Application to 3D Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kazuhiro Fukui and Osamu Yamaguchi
467
Viewpoint Insensitive Action Recognition Using Envelop Shape . . . . . . . . Feiyue huang and Guangyou Xu
477
Unsupervised Identification of Multiple Objects of Interest from Multiple Images: dISCOVER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Devi Parikh and Tsuhan Chen
487
Poster Session 5: Stereo, Range and 3D Fast 3-D Interpretation from Monocular Image Sequences on Large Motion Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jong-Sung Kim and Ki-Sang Hong
497
Color-Stripe Structured Light Robust to Surface Color and Discontinuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kwang Hee Lee, Changsoo Je, and Sang Wook Lee
507
Stereo Vision Enabling Precise Border Localization Within a Scanline Optimization Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefano Mattoccia, Federico Tombari, and Luigi Di Stefano
517
Three Dimensional Position Measurement for Maxillofacial Surgery by Stereo X-Ray Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Naoya Ohta, Kenji Mogi, and Yoshiki Nakasone
528
Stereo Total Absolute Gaussian Curvature for Stereo Prior . . . . . . . . . . . . . . . . . . Hiroshi Ishikawa
537
Fast Optimal Three View Triangulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martin Byr¨ od, Klas Josephson, and Kalle ˚ Astr¨ om
549
Stereo Matching Using Population-Based MCMC . . . . . . . . . . . . . . . . . . . . Joonyoung Park, Wonsik Kim, and Kyoung Mu Lee
560
Dense 3D Reconstruction of Specular and Transparent Objects Using Stereo Cameras and Phase-Shift Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . Masaki Yamazaki, Sho Iwata, and Gang Xu
570
Image and Video Processing Identifying Foreground from Multiple Images . . . . . . . . . . . . . . . . . . . . . . . Wonwoo Lee, Woontack Woo, and Edmond Boyer
580
XVIII
Table of Contents – Part II
Image and Video Matting with Membership Propagation . . . . . . . . . . . . . . Weiwei Du and Kiichi Urahama
590
Temporal Priors for Novel Video Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . Ali Shahrokni, Oliver Woodford, and Ian Reid
601
Content-Based Image Retrieval by Indexing Random Subwindows with Randomized Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rapha¨el Mar´ee, Pierre Geurts, and Louis Wehenkel
611
Poster Session 6: Face/Gesture/Action Detection and Recognition Analyzing Facial Expression by Fusing Manifolds . . . . . . . . . . . . . . . . . . . . Wen-Yan Chang, Chu-Song Chen, and Yi-Ping Hung
621
A Novel Multi-stage Classifier for Face Recognition . . . . . . . . . . . . . . . . . . . Chen-Hui Kuo, Jiann-Der Lee, and Tung-Jung Chan
631
Discriminant Clustering Embedding for Face Recognition with Image Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Youdong Zhao, Shuang Xu, and Yunde Jia
641
Privacy Preserving: Hiding a Face in a Face . . . . . . . . . . . . . . . . . . . . . . . . . Xiaoyi Yu and Noboru Babaguchi
651
Face Mosaicing for Pose Robust Video-Based Recognition . . . . . . . . . . . . Xiaoming Liu and Tsuhan Chen
662
Face Recognition by Using Elongated Local Binary Patterns with Average Maximum Distance Gradient Magnitude . . . . . . . . . . . . . . . . . . . . Shu Liao and Albert C.S. Chung
672
An Adaptive Nonparametric Discriminant Analysis Method and Its Application to Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Liang Huang, Yong Ma, Yoshihisa Ijiri, Shihong Lao, Masato Kawade, and Yuming Zhao
680
Discriminating 3D Faces by Statistics of Depth Differences . . . . . . . . . . . . Yonggang Huang, Yunhong Wang, and Tieniu Tan
690
Kernel Discriminant Analysis Based on Canonical Differences for Face Recognition in Image Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wen-Sheng Vincnent Chu, Ju-Chin Chen, and Jenn-Jier James Lien
700
Person-Similarity Weighted Feature for Expression Recognition . . . . . . . . Huachun Tan and Yu-Jin Zhang
712
Table of Contents – Part II
Converting Thermal Infrared Face Images into Normal Gray-Level Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mingsong Dou, Chao Zhang, Pengwei Hao, and Jun Li Recognition of Digital Images of the Human Face at Ultra Low Resolution Via Illumination Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jen-Mei Chang, Michael Kirby, Holger Kley, Chris Peterson, Bruce Draper, and J. Ross Beveridge
XIX
722
733
Poster Session 6: Math for Vision Crystal Vision-Applications of Point Groups in Computer Vision . . . . . . . Reiner Lenz
744
On the Critical Point of Gradient Vector Flow Snake . . . . . . . . . . . . . . . . . Yuanquan Wang, Jia Liang, and Yunde Jia
754
A Fast and Noise-Tolerant Method for Positioning Centers of Spiraling and Circulating Vector Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ka Yan Wong and Chi Lap Yip
764
Interpolation Between Eigenspaces Using Rotation in Multiple Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tomokazu Takahashi, Lina, Ichiro Ide, Yoshito Mekada, and Hiroshi Murase Conic Fitting Using the Geometric Distance . . . . . . . . . . . . . . . . . . . . . . . . . Peter Sturm and Pau Gargallo
774
784
Poster Session 6: Segmentation and Classification Efficiently Solving the Fractional Trust Region Problem . . . . . . . . . . . . . . . Anders P. Eriksson, Carl Olsson, and Fredrik Kahl
796
Image Segmentation Using Iterated Graph Cuts Based on Multi-scale Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tomoyuki Nagahashi, Hironobu Fujiyoshi, and Takeo Kanade
806
Backward Segmentation and Region Fitting for Geometrical Visibility Range Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Erwan Bigorgne and Jean-Philippe Tarel
817
Image Segmentation Using Co-EM Strategy . . . . . . . . . . . . . . . . . . . . . . . . . Zhenglong Li, Jian Cheng, Qingshan Liu, and Hanqing Lu
827
Co-segmentation of Image Pairs with Quadratic Global Constraint in MRFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yadong Mu and Bingfeng Zhou
837
XX
Table of Contents – Part II
Shape from X Shape Reconstruction from Cast Shadows Using Coplanarities and Metric Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hiroshi Kawasaki and Ryo Furukawa
847
Evolving Measurement Regions for Depth from Defocus . . . . . . . . . . . . . . . Scott McCloskey, Michael Langer, and Kaleem Siddiqi
858
A New Framework for Grayscale and Colour Non-Lambertian Shape-from-shading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . William A.P. Smith and Edwin R. Hancock
869
Face A Regularized Approach to Feature Selection for Face Detection . . . . . . . Augusto Destrero, Christine De Mol, Francesca Odone, and Alessandro Verri
881
Iris Tracking and Regeneration for Improving Nonverbal Interface . . . . . . Takuma Funahashi, Takayuki Fujiwara, and Hiroyasu Koshimizu
891
Face Mis-alignment Analysis by Multiple-Instance Subspace . . . . . . . . . . . Zhiguo Li, Qingshan Liu, and Dimitris Metaxas
901
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
911
Palmprint Recognition Under Unconstrained Scenes Yufei Han, Zhenan Sun, Fei Wang, and Tieniu Tan Center for Biometrics and Security Research National Laboratory of Pattern Recognition, Institute of Automation Chinese Acdamey of Sciences P.O.Box 2728, Beijing, P.R. China, 100080 {yfhan,znsun,fwang,tnt}@nlpr.ia.ac.cn
Abstract. This paper presents a novel real-time palmprint recognition system for cooperative user applications. This system is the first one achieving noncontact capturing and recognizing palmprint images under unconstrained scenes. Its novelties can be described in two aspects. The first is a novel design of image capturing device. The hardware can reduce influences of background objects and segment out hand regions efficiently. The second is a process of automatic hand detection and fast palmprint alignment, which aims to obtain normalized palmprint images for subsequent feature extraction. The palmprint recognition algorithm used in the system is based on accurate ordinal palmprint representation. By integrating power of the novel imaging device, the palmprint preprocessing approach and the palmprint recognition engine, the proposed system provides a friendly user interface and achieves a good performance under unconstrained scenes simultaneously.
1 Introduction Biometrics technology identifies different people by their physiological and behavioral differences. Compared with traditional security authentication approaches, such as key or password, biometrics is more accurate, dependable and difficult to be stolen or faked. In the family of biometrics, palmprint is a novel but promising member. Large region of palm supplies plenty of line patterns which can be easily captured in a low resolution palmprint image. Based on those line patterns, palmprint recognition can achieve a high accuracy of identity authentication. In previous work, there are several successful recognition systems proposed for practical use of palmprint based identity check [1][2][3], and the best-known is developed by Zhang et al [1]. During image capturing, users are required to place hands on the plate with pegs controlling displacement of hands. High quality palmprint images are then captured by a CCD camera fixed in a semi-closed environment with uniform light condition. To alignment captured palmprint images, a preprocessing algorithm [2] is adopted to correct rotation of those images and crop square ROI (regions of interests) with the same size. Detail about this system can be found in [2]. Besides, Connie et al proposed a peg-free palmprint recognition system [3], which captures palmprint images by an optical scanner. Subjects are allowed to place their hand more freely on the platform of the scanner without pegs. As a result, Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 1–11, 2007. © Springer-Verlag Berlin Heidelberg 2007
2
Y. Han et al.
palmprint images with different sizes, translations and rotation angles are obtained. Similar as in [2], an alignment process is involved to obtain normalized ROI images. However, efficient as they are, there are still some limitations. Firstly, some users may feel uncomfortable with pegs to restrict hands during capturing images. Secondly, even without pegs, subjects’ hands are required to contact plates of devices or platforms of scanners, which is not hygienic enough. Thirdly, semi-closed image capturing devices usually increase volume of recognition systems, which makes them not convenient for portable use. Thus, it’s necessary to improve design of the HCI(human-computer interface), in order to make the whole system easy-to-use. Recently, active near infrared imagery (NIR) technology has received more and more attention in face detection and recognition, as seen in [4]. Given a near infrared light source shining objects in front of cameras, intensity of reflected NIR light is attenuated at a large scale with distance between objects and the light source increasing. This property provides us a promising solution to eliminate affection of backgrounds when palmprint images are captured under unconstrained scenes. Based on the technology, in this paper, we propose a novel real-time palmprint recognition system. It’s designed to localize and obtain normalized palmprint images under clutter scenes conveniently. The main contributions are as followings: First, we present a novel design of portable image capturing device, which mainly consists of two parallel placed web cameras. One is used for active near infrared imagery to localize hand regions. The other one captures corresponding palmprint images in visible light, preparing for further feature extraction. Second, we present a novel palmprint preprocessing algorithm, utilizing color and shape information of hands for fast and effective hand region detection, rotation correction and localization of central palm region. So far as we know, there is no similar work reported in previous literatures. The rest of paper is organized as follows: Section 2 presents a description of the whole architecture of the recognition system. In Section 3, the design of human computer interface of the system is described in detail. Section 4 introduces ordinal palmprint representation briefly. Section 5 evaluates the performance of the system. Finally, in Section 6, we conclude the whole paper.
2 System Overview We adopt a common PC with Intel Pentium4 3.0Ghz and 1G RAM as the computation platform. Based on it, the recognition system is implemented using Microsoft Visual C++ 6.0. It consists of five main modules, as shown in Fig.1. After starting the system, users are required to open their hands in a natural manner and place palm regions toward the imaging device at a certain distance between 35 cm and 50 cm from cameras. Surfaces of palms are approximately orthogonal to the optical axis of cameras. In-plane rotation of hands is restricted between -15 degree to 15 degree deviated from vertical orientation. The imaging device then captures two images for each hand by two cameras placed in parallel respectively. One is a NIR hand image with active NIR lighting, the other is a color hand image with background objects, obtained with normal environment lighting condition. Both of them contain complete hand region, see in Fig.2. After that, an efficient palmprint preprocessing
Palmprint Recognition Under Unconstrained Scenes
3
algorithm is performed on the two captured images to obtain one normalized palmprint image quickly, which makes use of both shape and skin color information of hands. Finally, robust palmprint feature templates are extracted from the normalized image using the ordinal code based approach [5]. Fast hamming distance calculation is applied to measure dissimilarity between two feature templates. An example of the whole recognition process could be seen in the supplementary video of this paper.
3 Smart Human-Computer Interface HCI of the system mainly consists of two parts, image capturing hardware and palmprint preprocessing procedure, as shown in Fig.1. Considering a hand image captured under an unconstrained scene, unlike those captured by devices in [1][2][3], there are not only a hand region containing palmprint patterns, but also background objects of different shapes, colors and positions, as denoted in Fig.2. Even within the hand, there still exits rotation, scale variation and translation of palmprint patterns due to different hand displacements. Thus, before further palmprint feature encoding, HCI should localize the candidate hand region and extract a normalized ROI (region of interest), which contains palmprint features without much geometric deformations. 3.1 Image Capturing Device Before palmprint alignment, it is necessary to segment hand regions from unconstrained scenes. This problem could be solved by background modeling and subtraction or labeling skin color region. However, both methods suffer from unconstrained backgrounds or varying light conditions. Our design of imaging device aims to solve the problem in a sensor level, in order to localize foreground hand regions more robustly by simple image binarization. The appearance of the image capturing device is shown in Fig.2(a). This device has two common CMOS web cameras placed in parallel. We mount near infrared (NIR) light-emitting diodes on the device evenly distributed around one camera, similar as in [4], so as to provide straight and uniform NIR lighting. Near infrared light emit by those LEDs have a wavelength of 850 nm. In a further step, we make use of a band pass optical filter fixed on the camera lens to cut off lights with all the other wavelengths except 850nm. Most of environment lights are cut off because their wavelengths are less than 700nm. Thus, lights received by the camera only consist of reflected NIR LED lights and NIR components in environment lights, such as lamp light and sunlight, which are much weaken than the NIR LED lights. Notably, intensities of reflected NIR LED lights are in the inverse proportion to high-order terms of the distance between object and the camera. Therefore, assuming a hand is the nearest one among all objects in front of the camera during image capturing, intensities of the hand region in the corresponding NIR image should be much larger than backgrounds. As a result, we can further segment out the hand region and eliminate background by fast image binarization, as denoted in Fig.2(b). The other
4
Y. Han et al.
camera in the device captures color scene images, obtaining clear palmprint patterns and reserving color information of hands. An optical filter is fixed on the lens of this camera to filter out infrared components in the reflected lights, which is applied widely in digital camera to avoid red-eye. The two cameras work simultaneously. In our device, resolution of both cameras is 640*480. Fig.2(b) lists a pair of example images, captured by the two cameras at the same time. The upper one is the color image. The bottom one is the NIR image. The segmentation result is shown in the upper row of Fig.2(c). In order to focus on hand regions with a proper scale in further processing, we adopt a scale selection on binary segmentation results to choose candidate foreground regions. The criterion of selection grounds on a fact that area of a hand region in a NIR image is larger if the hand is nearer to the camera. We label all connected binary foreground after segmentation and calculate area of each connected component, then choose those labeled regions with their areas varying in a predefined narrow range as the candidate foreground regions, like the white region shown in the image at the bottom of Fig.2(c).
Fig. 1. Flowcharts of the system
Fig. 2. (a) Image capturing device (b) Pair-wise color and NIR image (c) Segmented fore ground and candidate foreground region selection
Palmprint Recognition Under Unconstrained Scenes
5
3.2 Automated Hand Detection Hand detection is posed as two-class problem of classifying the input shape pattern into hand-like and non-hand class. In our system, a cascade classifier is trained to detect hand regions in binary foregrounds, based on works reported in [6]. In [6], Eng-Jon Ong et al makes use of such classifier to classify different hand gestures. In our application, the cascade classifier should be competent for two tasks. Firstly, it should differentiate shape of open hand from all the other kinds of shapes. Secondly, it should reject open hands with in-plane rotation angle deviating out of the restricted range. To achieve this goals, we construct a positive dataset containing binary open left hands at first, such as illustrated in Fig.3(a). In order to make the classifier tolerate certain in-plane rotation, the dataset consists of left hands with seven discrete rotation angles, sampled every 5 degree from -15 degree to 15 degree deviated from vertical orientation, a part of those binary hands are collected from [11]. For each angle, there are about 800 hand images with slight postures of fingers, also shown in Fig.3(a). Before training, all positive data are normalized into 50*35 images. The negative dataset contains two parts. One consists of binary images containing nonhand objects, such as human head, turtles and cars, partly from [10]. The other contains left hands with rotation angle out of the restricted range and right hands with a variety of displacements. There are totally more than 60,000 negative images. Fig.3(b) shows example negative images. Based on those training data, we use Float AdaBoost algorithm to select most efficient Haar features to construct the cascade classifier, same as in [6]. Fig.3(c) shows the most six efficient Haar features obtained after training. We see that they represent discriminative shape features of left open hand. During detection, rather than exhaustive search across all positions and scales in [6], we perform the classifier directly around the candidate binary foreground regions
Fig. 3. (a) Positive training data (b) Negative training data (c) Learned efficient Haar features (d) Detected hand region
6
Y. Han et al.
to search for open left hands with a certain scale. Therefore, we can detect different hands with a relative stable scale, which reduces influence of scale variations on palmprint patterns. Considering mirror symmetry between left and right hands, to detect right hands, we just perform symmetry transform on the images and apply the classifier by the same way on the flipped images. Fig.3(d) shows results of detection. Obtaining detected hand, all the other non-hand connected regions are removed from binary hand images. The whole detection can be finished within 20 ms. 3.3 Palmprint Image Alignment Palmprint alignment procedure eliminates rotation and translation of palmprint patterns, in order to obtain normalized ROI. Most alignment algorithms calculate rotation angles of hands by localizing key contour points in gaps between fingers [2][3]. However, in our application, different finger displacements may change local contours and make it difficult detect gap regions, as denoted in Fig.4. To solve this problem, we adopt a fast rotation angle estimation based on moments of hand shape. Given R is the detected hand region in a binary foreground image. Its orientation θ can be estimated by calculating its moments [7]: θ=
1
arctan(
2
2μ1,1
μ 2,0 − μ 0,2
(1)
)
μ p ,q (p,q=0,1….) is (p,q) order central moments, which is represented as :
μ p ,q = ∑∑ ( x − x
y
1
1
x) p ( y − ∑∑ y ) q , ( x, y ) ∈ R ∑∑ N N x
y
x
(2)
y
Compared with key point detection, moments are calculated based on the whole hand region rather than only contour points. Thus, it is more robust to local changes in contours. To reduce computation cost, the original binary image is down-sampled to a 160*120 one. Those moments are then calculated on the down-sampled version. After obtaining rotation angle θ , the hand region is rotated by - θ degree to get vertical oriented hands, see in Fig.4. Simultaneously, the corresponding color image is also rotated by - θ , in order to make sure consistency of hand orientations in both two images. In a further step, we locate central palm region in a vertical oriented open hand by analyzing difference of connectivity between the palm region and the finger region. Although shape and size of hands vary a lot, a palm region of each hand should be like a rectangle. Compared with it, stretched fingers don’t form a connective region as palm. Based on this property, we employ an erosion operation on the binary hand image to remove finger regions. The basic idea behind this operation is run length code of binary image. We perform a raster scanning on each row to calculate the maximum length W of connective sequences in the row. Any row with its W less than threshold K1 should be eroded. After all rows are scanned, a same operation is performed on each column. As a result, columns with their maximum length W less than K2 are removed. Finally, a rectangular palm region is cropped from the hand. Coordinates (xp,yp )of its central point is derived as localization result. In order to
Palmprint Recognition Under Unconstrained Scenes
7
cope with varying sizes of different hands, we choose values of K1 and K2 adaptively. Before row erosion, distance between each point in the hand region and nearest edge point is calculated by a fast distance transform. The central point of hand is defined as the one with the largest distance value. Assuming A is the maximum length of connective sequences in the row passing through the central point, K1 is defined as follows: K1 = A * p%
(3)
p is a pre-defined threshold. K2 is defined in the same way: K2 = B * q%
(4)
B is the maximum length of connective sequences in the column passing through the central point after row erosion. q is another pre-defined threshold. Compared with fixed value, adaptive K1 and K2 lead to more accurate location of central palm regions, as denoted in Fig.5(b). Fig.5(a) denotes the whole procedure of erosion. Due to visual disparity between two cameras in the imaging device, we can not use (xp,yp ) to localize ROI in corresponding color images directly. Although visual disparity can be estimated by a process of 3D scene reconstruction, this approach may lead to much computation burden on the system. Instead, we apply a fast correspondence estimation based on template matching. Assuming C is a color hand image after rotation correction, we convert C into a binary image M by setting all pixels with skin color to 1, based on the probability distribution model of skin color in RGB space [8]. Given the binary version of the corresponding NIR image, with a hand region S locating at (xn,yn), a template matching is conducted as in Eq.5, also as denoted in Fig.6:
f ( m, n ) =
∑∑ [ M ( x + m, y + n) ⊕ S ( x, y )], ( x, y ) ∈ S x
(5)
y
⊕ is bitwise AND operator. f(m,n) is a matching energy function. (m,n) is a candidate position of the template. The optimal displacement (xo,yo) of hand shape S in M is defined as the candidate position where the matching energy achieves its maximum. The central point (xc,yc) of palm region in C can be estimated by following equations:
xc = x p + xo − xn yc = y p + yo − y n
Fig. 4. Rotation correction
(6)
8
Y. Han et al.
Fig. 5. (a) Erosion procedure (b) Erosion with fixed and adaptive thresholds
With (xc,yc) as its center, one 128*128 sub-image is cropped from C as ROI, which is then converted to gray scale image for feature extraction.
Fig. 6. Translation estimation
4 Ordinal Palmprint Representation In previous work, the orthogonal line ordinal feature (OLOF) [5] provides a compact and accurate representation of negative line features in palmprints. The orthogonal line ordinal filter [5] F(x,y,θ) is designed as follows: F ( x, y , θ ) = G ( x, y , θ ) − G ( x, y , θ + π / 2)
(7)
Palmprint Recognition Under Unconstrained Scenes
G ( x, y , θ ) = exp[ −(
x cos θ + y sin θ
δx
) −( 2
− x sin θ + y cos θ
δy
2
) ]
9
(8)
G(x,y,θ) is a 2D anisotropic Gaussian filter, and θ is the orientation of the Gaussian filter. The ratio between δx and δy is set to be larger than 3, in order to obtain a weighted average of a line-like region. In each local region in a palmprint image,
three such ordinal filters, with orientations of 0, π/6, π/3 are used in convolution process on the region. The filtering result is then encoded into 1 or 0 according to whether its sign is positive or negative. Thousands of ordinal codes are concatenated into a feature template. Similarity between two feature templates is measured by a normalized hamming distance, which ranges between 0 and 1. Further details can be found in [5].
5 System Evaluation Performance of the system is evaluated in terms of verification rate [9], which is obtained through one-to-one image matching. We collect 1200 normalized palmprint ROI images from 60 subjects using the system, with 10 images for each hand. Fig.7 illustrates six examples of ROI images. During the test, there are totally 5,400 intraclass comparisons and 714,000 inter-class comparisons. Although recognition accuracy of the system lies on effectiveness of both alignment procedure of HCI and the palmprint recognition engine, the latter is not the focus of this paper. Thus we don’t involve performance comparisons between the ordinal code and other state-ofthe-art approaches. Fig.8 denotes distributions of genuine and imposter. Fig.9 shows corresponding ROC curve. The equal error rate [9] of the verification test is 0.54%. From experimental results, we can see that ROI regions obtained by the system are suitable for palmprint feature extraction and recognition. Besides, we also record time cost for obtaining one normalized palmprint image using the system. It includes time for image capturing, hand detection and palmprint alignment. The average time cost is 1.2 seconds. Thus, our system can be competent for point-of-sale identity check.
Fig. 7. Six examples of ROI images
10
Y. Han et al.
Fig. 8. Distributions of genuine and imposter
Fig. 9. ROC curve of the verification test
6 Conclusion In this paper, we have proposed a novel palmprint recognition system for cooperative user applications, which achieves a real-time non-contact palmprint image capturing and recognition directly under unconstrained scenes. Through design of the system, we aim to provide more convenient human-computer interface and reduce restriction on users during palmprint based identity check. The core of HCI in the system consists of a binocular image device and a novel palmprint preprocessing algorithm. The former delivers a fast hand region segmentation based on NIR imaging technology. The latter extracts normalized ROI from hand regions efficiently based on shape and color information of human hands. Benefiting further from the powerful recognition engine, the proposed system achieves accurate recognition and convenient use at the same time. As far as we know, this is the first attempt to solve the problem of obtaining normalized palmprint images directly from clutter backgrounds. However, accurate palmprint alignment has not been well addressed in the proposed system. In our future work, it’s an important issue to improve the performance of the system by reducing alignment error in a further step. In addition,
Palmprint Recognition Under Unconstrained Scenes
11
we should improve the imaging device to deal with influence of NIR component in environment light, which varies much in practical use. Acknowledgments. This work is funded by research grants from the National Basic Research Program (Grant No.2004CB318110), the Natural Science Foundation of China (Grant No.60335010, 60121302, 60275003, 60332010, 69825105,60605008) and the Chinese Academy of Sciences.
References 1. Zhang, D., Kong, W.K., You, J., Wong, M.: Online Palmprint Identification. IEEE Trans on PAMI 25(9), 1041–1050 (2003) 2. Kong, W.K.: Using Texture Analysis in Biometric Technology for Personal Identification, MPhil Thesis, http://pami.uwaterloo.ca/ cswkkong/Sub_Page/Publications.htm 3. Connie, T., Jin, A.T.B., Ong, M.G.K., Ling, D.N.C.: Automated palmprint recognition system. Image and Vision Computing 23, 501–515 (2005) 4. li, S.Z., Chu, R.F., Liao, S.C., Zhang, L.: Illumination invariant Face Recognition using Near- Infrared Images. IEEE Trans on PAMI 29(4), 627–639 (2007) 5. Sun, Z.N., Tan, T.N., Wang, Y.H., Li, S.Z.: Ordinal Palmprint Representation for Personal Identification. Proc. of IEEE CVPR 2005 1, 279–284 (2005) 6. Ong, E., Bowden, R.: A Boosted Classifier Tree for Hand Shape Detection. In: Proc. of International Conference on Automatic Face and Gesture Recognition, pp. 889–894 (2004) 7. Jain, A.K.: Fundamentals of Digital Image Processing, vol. 07458, p. 392. Prentice Hall, Upper Saddle River, NJ 8. Jones, M.J., Rehg, J.M.: Statistical Color Models with Application to Skin Color Detection. International Journal of Computer Vision 46(1), 81–96 (2002) 9. Daugman, J., Williams, G.: A Proposed Standard for Biometric Decidability. In: Proc. of CardTech/SecureTech Conference, Atlanta, GA, pp. 223–234 (1996) 10. http://www.cis.temple.edu/ latecki/TestData mpeg7shapeB.tar.gz
11. UST Hand Image database, http://visgraph.cs.ust.hk/Biometrics/Visgraph_web/ index.html
Comparative Studies on Multispectral Palm Image Fusion for Biometrics Ying Hao, Zhenan Sun, and Tieniu Tan National Laboratory of Pattern Recognition, Institute of Automation, CAS
Abstract. Hand biometrics, including fingerprint, palmprint, hand geometry and hand vein pattern, have obtained extensive attention in recent years. Physiologically, skin is a complex multi-layered tissue consisting of various types of components. Optical research suggests that different components appear when the skin is illuminated with light sources of different wavelengths. This motivates us to extend the capability of camera by integrating information from multispectral palm images to a composite representation that conveys richer and denser pattern for recognition. Besides, usability and security of the whole system might be boosted at the same time. In this paper, comparative study of several pixel level multispectral palm image fusion approaches is conducted and several well-established criteria are utilized as objective fusion quality evaluation measure. Among others, Curvelet transform is found to perform best in preserving discriminative patterns from multispectral palm images.
1
Introduction
Hand, as a tool for human to percept and reconstruct surrounding environment, is most used among body parts in our daily life. Due to its high acceptance by the human beings, its prevalence in the field of biometrics is no surprising. Fingerprint[1], hand geometry[2], palmprint[7][8], palm-dorsa vein pattern[3], finger vein[4] and palm vein[5] are all good examples of hand biometric patterns. These modalities have been explored by earlier researchers and can be divided into three categories: - Skin surface based modality. Examples are fingerprint and palmprint. Both traits explore information from the surface of skin and have received extensive attention. Both of them are recognized as having the potential of being used in highly security scenario; - Internal structure based modality, which extracts information from vein structure deep under the surface for recognition. Although new in biometric family, the high constancy and uniqueness of vein structure make this category more and more active nowadays[3][9]; - Global structure based modality. The only example of this category is hand geometry. Hand geometry is a good choice for small scale applications thanks to its high performance-price ratio. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 12–21, 2007. c Springer-Verlag Berlin Heidelberg 2007
Comparative Studies on Multispectral Palm Image Fusion for Biometrics
13
No matter which category of modality one chooses to work on, a closer look on the skin appearance benefits. Physiologically, human skin consists of many components, such as Cells, fibers, veins and nerves, and they give skin a multilayered structure. At the outermost layer, numerous fine furrows, hair and pores are scattered over the surface of skin, while veins, capillaries and nerves form a vast network inside[6]. Optical study has demonstrated that light with longer wavelength tends to penetrate the skin more deeply, for example, near infrared light from 600nm to 1000nm typically penetrates the skin to about 1-3 mm. Therefore different visual contents, with different optical properties, are detected with incident light of different wavelengths[11]. The uniqueness of human skin, including its micro, meso and macro structures, is a product of random factors during embryonic development. Enlighten by the success and fast development of above mentioned hand based biometrics, each of them reflecting only one aspect of hand, we believe that best potential of biometric feature in the hand region is yet to be discovered. The purpose of this work is to exploit the correlative and complementary nature of multispectral hand images for image enhancement, filtering and fusion. Take palmprint and vein for example, the common characteristics of two modalities is that they both utilize moderate resolution hand imagery and they share similar discriminative information: line-like pattern. On the other hand, the intrinsic physiological nature makes the two traits holding distinctive advantages and disadvantages. More precisely speaking, palmprint is related to outermost skin pattern, therefore, its appearance is sensitive to illumination condition, aging, skin disease and abrasion etc. In contrast, hand vein pattern, as an interior structure, is robust to the above mentioned external factors. However, vein image quality varies dramatically across the population and in case of blood vessel constriction resulting from extremely cold weather. Several advantages can be obtained by fusing the two spectral hand images. First of all, a more user-friendly system can be developed by alternatively combining the two traits or choosing the appropriate one for recognition according to corresponding imaging quality; secondly, forgery is much more difficult for such an intelligent system and hence the system is more secure; and finally, the recognition performance might be boosted. In this work, we designed a device to automatically and periodically capture visible and near infrared spectral images. Two sets of lights are turned on in turn so that palmprint and vein images are captured. With the images at hand, we validated the idea of image fusion. Several pixel level image fusion approaches are performed to combine the original images to a composite one, which is expected to convey more information than its inputs. The rest of this paper is organized as follows. The hardware design of image capture device is presented in Section 2, followed by a brief introduction of the four methods in Section 3. The proposed fusion scheme is introduced in Section 4 and Section 5 includes experimental results as well as performance evaluation. Finally, conclusion and discussion are presented in Section 6.
14
2
Y. Hao, Z. Sun, and T. Tan
Hardware Design
Fig. 1 illustrates the principal design of the device we developed to capture images in both visible (400-700nm) and near infrared (800-1000nm) spectra. The device works under a sheltered environment and the light sources are carefully arranged so that the palm region is evenly illuminated. An infrared sensitive CCD camera is fixed at the bottom of the inner encloser and connected to a computer via USB interface. An integrated circuit plate is mounted near the camera for illumination and different combinations of light wavelengths can be accomplished by replacing the circuit plate. By default, the two sets of lights are turned on in turn so that only expected layer of hand appears to camera. When illuminated with visible light, image of hand skin surface, namely, the palmprint is stored. while when NIR light is on, deeper structure as well as parts of dominant surface features, for example the principal lines, are captured. Manual control of the two lights is also allowed by sending computer instruction to the device. A pair of images captured using the device is shown in Fig. 2(a)(b), where (a) is palmprint image and (b) is vein image. It is obvious that the two images emphasize on quite different components of hand.
Fig. 1. Multispectral palm image capture device, where LEDs of two wavelengths is controlled by computer
3
Image Fusion and Multiscale Decomposition
The concept of image fusion refers to the idea of integrating information from different images for better visual or computational perception. Image fusion sometimes refers to pixel-level fusion, while a broad sense definition also includes feature-level and matching score level fusion. In this work, we focus on pixel level fusion because it features minimum information loss. The key issue in image fusion is to faithfully preserve important information while suppress noises. Discriminative information in palmprint and vein, or more
Comparative Studies on Multispectral Palm Image Fusion for Biometrics
15
specifically principal lines, wrinkle lines, ridges and blood vessels, all takes form of line-like patterns. Therefore, the essential goal is to maximally preserve these patterns. In the field of pixel-level fusion, multiscale decomposition (MSD), such as pyramid decomposition and wavelet decomposition, is often applied because it typically provides better spatial and spectral localization of image information and such decorrelation between pyramid subbands allows for a more reliable feature selection[19]. The methods utilized in this paper also follow this direction of research, while evaluation measures are applied to feature level representation rather than intensity level to accommodate the context of biometrics. We selected four multiscale decomposition methods for comparison. Gradient pyramid can obtained by applying four directional gradient operators to each level of Gaussian pyramid. The four operators correspond to horizontal, vertical and two diagonal directions respectively. Therefore, image features are indexed according to their orientations and scales. Morphological pyramid can be constructed by successive procedure of morphological filtering and sub-sampling. Morphological filters, such as open and close are designed to preserve edges and shapes of objects, which make this approach suitable for the task presented here. Shift invariant digital wavelet transform is a method proposed to overcome the wavy effect normally observed in traditional wavelet transform based fusion. It is accomplished by an over-complete version of wavelet basis and the downsampling process is taken place by dilated analysis filters. In our implementation, Haar wavelet is chosen and decomposition level for the above mentioned three methods is three. Curvelet transform is a bit more complex multiscale transform[12][13][15][14] and is designed to efficiently represent edges and other singularities along curves. Unlike wavelet transform, it has directional parameters and its coefficients have a high degree of directional specificity. Therefore, large coefficients in transform space suggests strong lines on original image. These methods are not new in the field of image fusion[16][17][18][19][21], . However, earlier researchers either focus on remote sensing applications, which involves tradeoff between spectral and spatial resolution, or pursue general purpose image fusion scheme. This work is the one of the first applications that adopt and compared them in the context of hand based biometrics.
4
Proposed Fusion Method
Our fusion method is composed of two steps, namely, a preprocessing step to adjust dynamic ranges and remove noises from vein images and a fusion step to combine information from visible image and infrared image. 4.1
Preprocessing
When illuminated with visible light, images of skin fine structures are captured. Contrast to the behavior in visible wavelength, cameras usually have a much
16
Y. Hao, Z. Sun, and T. Tan
lower sensitivity to infrared lights. Therefore, cameras tend to work at a low luminance circumstance and AGC(Auto Gain Control) feature takes effect to maintain the output level. This procedure amplifies the signal and noises at the same time, producing noisy IR images. The first stage of preprocessing is to distinguish between the two spectra. The relative large difference between camera responses to the two wavelengths makes NIR images constantly darker than visible image, in consequence the separation is accomplished simply via an average intensity comparison. Followed is a normalization step to modify the dynamic range of vein image so that the mean and standard deviation of vein image equals to that of the palmprint image. The underlying reason is that equal dynamic range across source images helps to produce comparable coefficients in transform domain. Finally, bilateral filtering is undertaken to eliminate noises from infrared images. Bilateral filtering is a non-iterative scheme for edge-preserving smoothing [10]. The response at x is defined as an weighted average of similar and nearby pixels, where the weighting function corresponds to a range filter while the domain component closely related to a similarity function between current pixel x and its neighbors. Therefore, desired behavior is achieved both in smooth regions and boundaries. 4.2
Fusion Scheme
According to the generic framework proposed by Zhang et. al.[19], image fusion schemes are composed of (a) Multiscale decomposition, which maps source intensity images to more efficient representations; (b) Activity measurement that determines the quality of each input; (c) Coefficient grouping method to determine whether or not cross scale correlation is considered; (d) Coefficient combining method where a weighted sum of source representations is calculated and finally (e) Consistency verification to ensure neighboring coefficients are calculated in similar manner. As a domain specific fusion scheme, the methods applied in this work can be regarded as examples of this framework. For each of the multiscale decomposition methods mentioned in Section 3, the following scheme is applied: Activity measure - A coefficient-based activity measure is applied to each coefficient, which means that the absolute value of each coefficient is regarded as the activity measure of corresponding scale, position and sometimes orientation; Coefficient combining method - Generally speaking, no matter what kind of linear combination of coefficient is adopted, the basic calculation is weighted sum. We utilized the popular scheme proposed by Burt[20] to high frequency coefficient and average to base band approximation. Consistency verification - Consistency verification is conducted in a blockwise fashion and majority filter is applied in local window of 3 by 3 in case that choose max operation is taken in coefficient combination.
Comparative Studies on Multispectral Palm Image Fusion for Biometrics
5
17
Experimental Results
To evaluate the proposed fusion scheme, we collected a database from 7 subjects. Three pairs of images are captured for both hands, producing a total number of 84 images. 5.1
Subjective Fusion Quality Evaluation
The proposed scheme is applied to each pair of visible and NIR images and the resulting fused images by the four decomposition methods are subjectively examined. Fig. 2 demonstrates such an example. Morphological pyramid, although produces most obvious vein pattern on fused images, sometimes introduced artifacts. Other three methods seem to perform similarly to human eyes, thus objective fusion quality evaluation is necessary for more detailed comparison.
(a) visible image
(b) infrared image
(c) fused image with gradient pyramid (d) fused image with morphological pyramid
(e) fused image with shift-invariant DWT
(f) fused image with Curvelet transform
Fig. 2. Palmprint and vein pattern images captured using the self-designed device as well as the fused images
18
5.2
Y. Hao, Z. Sun, and T. Tan
Objective Fusion Quality Evaluation
Many fusion quality evaluation measures have been proposed[22][23] and we choose four of them for our application. Root Mean Square Error(RMSE) between input image A and fused image F is originally defined in Eq. 1. N N 2 i=1 j=1 [A(i, j) − F (i, j)] RM SEAF = (1) N2 Mutual information(MI) statistically tells how much information fused image F tells about the input image A. Suppose that pA (x), pF (y) and pAF (x, y) denote marginal distribution from A, F and joint distribution between A and F respectively. Mutual information between A and F is defined as Eq. 2. M IAF =
x
y
pAF (x, y)log
pAF (x, y) pA (x)pF (y)
(2)
Universal image quality index(UIQI) was proposed to evaluate the similarity between two images, and is defined in Eq. 3. The three components of UIQI denote correlation coefficient, closeness of mean luminance and contrast similarity of two images or image blocks A and F respectively. U IQIAF =
σAF 2μA μF 2σA σF 4σAF μA μF · · 2 = 2 2 + σ2 ) σA σF μ2A + μ2F σA + σF2 (μA + μ2F ) · (σA F
(3)
The above mentioned general purpose criteria are usually applied to intensity image. However, in order to predict the performance of proposed method in the context of biometrics, we apply these measures to feature level representation. In the field of palmprint recognition, the best algorithms reported in literature are those based on binary textual features[7][8]. These methods seek to represent line-like patterns and have been proved to be capable of establishing stable and powerful representation for palmprint. We utilized a multiscale version of Orthogonal Line Ordinal Feature (OLOF) to fused image as well as palmprint image for feature level representation and the average results on collected database are shown in Table 1. Textural features are not suitable for vein image due to the sparse nature of true features and widespread of false features. From Table 1, we can obviously find that Curvelet transform based method outperforms other methods in that it maintains most information available in palmprint. The disadvantage of Curvelet transform is that it takes much longer time in calculating coefficients. We also adopted average local entropy to estimate the information gain from palmprint to fused image and the result is shown in Fig. 3. Curvelet transform based approach is the only one which conveys more information than the original palmprint representation. Thus we can safely draw the conclusion that Curvelet transform based method results in richer representation and is more faithful to source representations. The superior performance of Curvelet transform mainly
Comparative Studies on Multispectral Palm Image Fusion for Biometrics
19
Table 1. Objective Fusion Quality Evaluation RM SEFP alm M IFP alm U IQIFP alm Time Consumption(s) Gradient Pyramid Morphological Pyramid Shift-Invariant DWT Curvelet Transform
0.4194 0.4539 0.4313 0.3773
0.3300 0.2672 0.3083 0.4102
0.5880 0.4800 0.5583 0.7351
0.5742 1.2895 1.9371 18.6979
0.95 0.9 0.85
Average Entropy
0.8 0.75 0.7 0.65 0.6 Original Palmprint Image Fused Image with Curvelet Transform Fused Image with Morphological Pyramid Fused Image with Gradient Pyramid Fused Image with Shift Invariant DWT
0.55 0.5 0.45 4
6
8 10 12 Local Window Size
14
16
Fig. 3. The average local entropy of fused image with regard to window size
results from its built-in mechanism to represent line singularities. Gradient pyramid performs next to Curvelet transform, which suggests its good edge preservation capability and low orientation resolution compared with Curvelet transform. Morphological pyramid method introduces too many artifacts which contribute most to performance degradation.
6
Conclusion and Discussion
In this paper, we proposed the idea of multispectral palm image fusion for biometrics. This concept extends the visual capability of camera and will improve user-friendliness, security and hopeful recognition performance of original palmprint based biometric system. Several image fusion based approaches are evaluated in the context of discriminative features. Experimental results suggest
20
Y. Hao, Z. Sun, and T. Tan
that Curvelet transform outperforms several other carefully selected methods in terms of well established criteria. Further work along proposed direction will include the followings: – Image collection from more spectra. The results presented in Section 5 have ¯ performance of Curvelet transform in combining palmproved the superior print and vein images. To explore the best potential of hand biometrics, we will improve device to capture images from more spectra. Although line-like pattern is dominant on palmprint and vein images, they are not necessarily suitable for other components. Thus more schemes need to be studied based on examination of meaningful physiological characteristics of each skin component. – Recognition based on fused image. Currently, the database is not large enough ¯ to produce convincing recognition performance. A well defined database will be collected in the near future and the proposed method will be tested and also compared with other level fusion; Acknowledgments. This work is funded by research grants from the National Basic Research Pro- gram (Grant No. 2004CB318110), the Natural Science Foundation of China (Grant No. 60335010, 60121302, 60275003, 60332010, 6982510560605008) and the Chinese Academy of Sciences.
References 1. Maltoni, D., Maio, D., Jain, A.K., Prabhakar, S.: Handbook of Fingerprint Recognition. Springer, Heidelberg (2003) 2. Bolle, R., Pankanti, S., Jain, A.K.: Biometrics: Personal Identification in Networked Society. Springer, Heidelberg (1999) 3. Lin, C.-L., Fan, K.-C.: Biometric Verification Using Thermal Images of Palm-Dorsa Vein Patterns. IEEE Trans. on Circuits and Systems for Video Technology 14(2), 199–213 (2004) 4. Finger Vein Authentication Technology, http://www.hitachi.co.jp/Prod/comp/finger-vein/global/ 5. Fujitsu Palm Vein Technology, http://www.fujitsu.com/global/about/rd/200506palm-vein.html 6. Igarashi, T., Nishino, K., Nayar, S.K.: The Appearance of Human Skin. Technical Report CUCS-024-05, Columbia University (2005) 7. Wai-Kin Kong, A., Zhang, D.: Competitive Coding Scheme for Palmprint Verification. In: Intl. Conf. on Pattern Recognition, vol. 1, pp. 520–523 (2004) 8. Sun, Z., Tan, T., Wang, Y., Li, S.Z.: Ordinal Palmprint Recognition for Personal Identification. Proc. of Computer Vision and Pattern Recognition (2005) 9. Wang, L., Leedham, G.: Near- and Far- Infrared Imaging for Vein Pattern Biometrics. In: Proc. of the IEEE Intl. Conf. on Video and Signal Based Surveillance (2006) 10. Tomasi, C., Manduchi, R.: Bilateral Filtering for Gray and Color Images. In: Proc. of Sixth Intl. Conf. on Computer Vision, pp. 839–846 (1998) 11. Anderson, R.R., Parrish, J.A.: The Science of Photomedicine. In: Optical Properties of Human Skin. ch. 6, Plenum Press, New York (1982)
Comparative Studies on Multispectral Palm Image Fusion for Biometrics
21
12. Donoho, D.L., Duncan, M.R.: Digital Curvelet Transform: Strategy, Implementation and Experiments, available http://www-stat.stanford.edu/∼ donoho/Reports/1999/DCvT.pdf 13. Cand`es, E.J., Donoho, D.L.: Curvelets – A Surprisingly Effective Nonadaptive Representation for Objects With Edges. In: Schumaker, L.L., et al. (eds.) Curves and Surfaces, Vanderbilt University Press, Nashville, TN (1999) 14. Curvelet website, http://www.curvelet.org/ 15. Starck, J.L., Cand`es, E.J., Donoho, D.L.: The Curvelet Transform for Image Denoising. IEEE Transactions on Image Processing 11(6), 670–684 (2002) 16. Choi, M., Kim, R.Y., Nam, M.-R., Kim, H.O.: Fusion of Multispectral and Panchromatic Satellite Images Using the Curvelet Transform. IEEE Geoscience and Remote Sensing Letters 2(2) (2005) 17. Nencini, F., Garzelli, A., Baronti, S., Alparone, L.: Remote Sensing Image Fusion Using the Curvelet Transform. Information Fusion 8(2), 143–156 (2007) 18. Zhang, Q., Guo, B.: Fusion of Multisensor Images Based on Curvelet Transform. Journal of Optoelectronics Laser 17(9) (2006) 19. Zhang, Z., Blum, R.S.: A Categorization of Multiscale-Decomposition-Based Image Fusion Schemes with a Performance Study for a Digital Camera Application. Proc. of IEEE 87(8), 1315–1326 (1999) 20. Burt, P.J., Kolczynski, R.J.: Enhanced Image Capture Through Fusion. In: IEEE Intl. Conf. on Computer Vision, pp. 173–182. IEEE Computer Society Press, Los Alamitos (1993) 21. Sadjadi, F.: Comparative Image Fusion Analysais. IEEE Comptuer Vision and Pattern Recognition 3 (2005) 22. Petrovi´c, V., Cootes, T.: Information Representation for Image Fusion Evaluation. In: Intl. Conf. on Information Fusion, pp. 1–7 (2006) 23. Petrovi´c, V., Xydeas, C.: Objective Image Fusion Performance Characterisation. In: Intl. Conf. on Computer Vision, pp. 1868–1871 (2005) 24. Wang, Z., BovikA, A.C.: Univeral Image Quality Index. IEEE Signal Process Letter 9(3), 81–84 (2002)
Learning Gabor Magnitude Features for Palmprint Recognition Rufeng Chu, Zhen Lei, Yufei Han, Ran He, and Stan Z. Li Center for Biometrics and Security Research & National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences {rfchu,zlei,yfhan,rhe,szli}@nlpr.ia.ac.cn http://www.cbsr.ia.ac.cn
Abstract. Palmprint recognition, as a new branch of biometric technology, has attracted much attention in recent years. Various palmprint representations have been proposed for recognition. Gabor feature has been recognized as one of the most effective representations for palmprint recognition, where Gabor phase and orientation feature representations are extensively studied. In this paper, we explore a novel Gabor magnitude feature-based method for palmprint recognition. The novelties are as follows: First, we propose an illumination normalization method for palmprint images to decrease the influence of illumination variations caused by different sensors and lighting conditions. Second, we propose to use Gabor magnitude features for palmprint representation. Third, we utilize AdaBoost learning to extract most effective features and apply Local Discriminant Analysis (LDA) to reduce the dimension further for palmprint recognition. Experimental results on three large palmprint databases demonstrate the effectiveness of proposed method. Compared with state-of-the-art Gabor-based methods, our method achieves higher accuracy.
1 Introduction Biometrics is an emerging technology by using unique and measurable physical characteristics to identify a person. The physical attributes include face, fingerprint, iris, palmprint, hand geometry, gait, and voice. Biometric systems have been successfully used in many different application contexts, such as airports, passports, access control, etc. Compared with other biometric technologies, palmprint recognition has a relatively shorter history and has received increasing interest in recent years. Various techniques have been proposed for palmprint recognition in the literature [1,2,3,4,5,6,7,8,9,10]. They can be mainly classified into three categories according to the palmprint feature representation method. The first category is based on structure features, such as line features [1] and feature points [2]. The second one is based on holistic appearance features, such as PCA [3], LDA [4] and KLDA [5]. The third one is based on local appearance features, such as PalmCode [7], FusionCode [8], Competitive Code [9] and Ordinal Code [10]. Among these representation methods, Gabor feature is one of the most efficient representations for palmprint recognition. Zhang et al. [7] proposed a texture-based method for online palmprint recognition, where 2D Gabor filter was used to extract the Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 22–31, 2007. c Springer-Verlag Berlin Heidelberg 2007
Learning Gabor Magnitude Features for Palmprint Recognition
23
phase information (called PalmCode) from low-resolution palmprint images. Kong and Zhang [8] improved the efficiency of PalmCode method by fusing the codes computed in four different orientations (called FusionCode). Multiple Gabor filters are employed to extract phase information on a palmprint image. To further improve the performance, Kong and Zhang [9] proposed another Gabor based method, namely competitive code. The competitive coding scheme uses multiple 2D Gabor filters to extract orientation information from palm lines based on the winner-take-all competitive rule [9]. Combined with angular matching, promising performance has been achieved. Gabor phase and orientation features have been extensively studied in existing works [7,8,9]. In this paper, we attempt to explore Gabor magnitude feature representation for palmprint recognition. First, to increase the generalization capacity and decrease the influence of illumination variations due to different sensors and lighting environments, we propose an illumination normalization method for palmprint images. Second, multi-scale, multi-orientation Gabor filters are used to extract Gabor magnitude features for palmprint representation. The original feature set is of high dimensionality. Then, we utilize AdaBoost learning to select most effective features from the large number of candidate feature set, followed by Local Discriminant Analysis (LDA) for further dimensionality reduction. Experimental results demonstrate the good performance of proposed method. Compared with state-of-the-art Gabor-based method, our method achieves higher accuracy. Moreover, the processing speed of the method is very fast. In the testing phase, the execution time for the illumination normalization, feature extraction, feature space to LDA subspace projection and matching for one image are 30ms, 20ms, 1.5ms and 0.01ms, respectively. The rest of this paper is organized as follows. In Section 2, we introduce the illumination normalization method. In Section 3, we describe the Gabor magnitude features for palmprint representation. Section 4 gives the details of statistical learning for feature selection and classifier. Experimental results and conclusions are presented in Section 5 and Section 6, respectively.
2 Illumination Normalization Due to different sensors and lighting environments, the palmprint images are varied significantly, as shown in the top row of Fig. 1. A robust illumination preprocessing method will help to diminish the influence of illumination variations and increase the robustness of recognition method. In general, an image I(x, y) is regarded as product I(x, y) = R(x, y)L(x, y), where R(x, y) is the reflectance and L(x, y) is the illuminance at each point (x, y). The reflectance R depends on the albedo and surface normal, which is the intrinsic representation of an object. The luminance L is the extrinsic factor. Therefore, the illumination normalization problem reduces to how to obtain R given an input image I. However, estimating the reflectance and the illuminance is an ill-posed problem. To solve the problem, a common assumption is that the illumination L varies slowly while the reflectance R can change abruptly. In our work, we introduce an anisotropic approach to compute the estimate of the illumination field L(x, y), which has been used
24
R. Chu et al.
Fig. 1. Examples of the palmprint images from different sensors before and after illumination normalization. Top: Original palmprint images. Bottom: Corresponding processed palmprint images. The images are taken from the PolyU Palmprint Database [12] (first two columns), UST Hand Database [13] (middle two columns) and CASIA Palmprint Database [14] (last two columns).
for face recognition [11]. Then, we estimate the reflectance R(x, y) as the ratio of the image I(x, y) and L(x, y) for palmprint image, The luminance function was estimated as an anisotropically smoothed version of the original image, which can be carried out by minimizing the cost function: ρ(x, y)(L − I)2 dxdy + λ (L2x + L2y )dxdy (1) J(L) = y
x
y
x
where the first is the data term while the second term is a regularization term which imposes a smoothness constraint. The parameter λ controls the relative importance of the two terms. ρ is Weber’s local contrast between a pixel a and its neighbor b in either the x or y directions [11]. The space varying permeability weight ρ(x, y) controls the anisotropic nature of the smoothing constraint. By Euler-Lagrange equation, Equ. (1) transforms to solve the following partial differential equation (PDE): λ (2) L + (Lxx + Lyy ) = I ρ The PDE approach is easy to implement. By the regularized approach, the influence of the illumination variations is diminished, while the edge information of the palmprint image is preserved. Fig. 1 show some examples from several different palmprint database before and after processing with the method. In section 5, we will further evaluate the effectiveness of the illumination normalization method on a large palmprint database.
3 Gabor Magnitude Features for Palmprint Representation Gabor features exhibit desirable characteristics of spatial locality and orientation selectively, and are optimally localized in the space and frequency domains. The Gabor kernels can be defined as follows [15]: ψμ,v =
2 2 kμ,v kμ,v z2 σ2 )] exp( )[exp(ik z) − exp(− μ,v σ2 2σ 2 2
(3)
Learning Gabor Magnitude Features for Palmprint Recognition
25
where μ and v define the orientation and scale of the Gabor kernels respectively, z = (x, y), and the wave vector kμ,v is defined as follows: kμ,v = kv eiφμ (4) √ where kv = kmax /f v , kmax = π/2, f = 2, φμ = πμ/8. The Gabor kernels in Equ. 3 are all self-similar since they can be generated from one filter, the mother wavelet, by scaling and rotating via the wave vector kμ,v . Each kernel is a product of a Gaussian envelope and a complex plane wave, while the first term in the square brackets in Equ. (3) determines the oscillatory part of the kernel and the second term compensates for the DC value. Hence, a bank of Gabor filters is generated by a set of various scales and rotations. In our experiment, we use Gabor kernels at five scales v ∈ {0, 1, 2, 3, 4} and eight orientations μ ∈ {0, 1, 2, 3, 4, 5, 6, 7} with the parameter μ = 2π to derive the Gabor representation by convoluting palmprint image with corresponding Gabor kernels. Let I(x, y) be the gray level distribution of an palmprint image , the convolution of image I and a Gabor kernel ψμ,v is defined as: Fμ,v (z) = I(z) ∗ ψμ,v (z)
(5)
where z = (x, y), ∗ denotes the convolution operator. Gabor magnitude feature is defined as Mμ,v (z) = Im(Fμ,v (z))2 + Re(Fμ,v (z))2 (6) where Im(·) and Re(·) denote the imaginary and real part, respectively. For each pixel position (x, y) in the palmprint image, 40 Gabor magnitudes are calculated to form the feature representation.
4 Statistical Learning of Best Features and Classifiers The whole set of Gabor magnitude features is of high dimension. For a palmprint image with size of 128 × 128 , there are about 655,360 features in total. Not all of them are useful or equally useful, and some of them may cause negative effect on the performance. Straightforward implementation is both computationally expensive and exhibits a lack of efficiency. In this work, we utilize AdaBoost learning first to select the most informative features and then apply linear discriminant analysis (LDA) on the selected Gabor magnitude features for further dimension reduction. 4.1 Feature Selection by AdaBoost Learning Boosting can be viewed as a stage-wise approximation to an additive logistic regression model using Bernoulli log-likelihood as a criterion [16]. AdaBoost is a typical instance of Boosting learning. It has been successfully used on face detection problem [17] as an effective feature selection method. There are several different versions of AdaBoost algorithm [16], such as Discrete AdaBoost, Real AdaBoost, LogitBoost and Gentle AdaBoost. In this work, we apply Gentle AdaBoost learning to select most discriminative Gabor magnitude features and remove the useless and redundant features. Gentle AdaBoost is a modified version of the Real AdaBoost algorithm and is defined in Fig. 2.
26
R. Chu et al.
Input: Sequence of N weighted examples: {(x1 , y1 , w1 ), (x2 , y2 , w2 ), . . . , (xN , yN , wN )}; Initialize: wi =
1 N
, i = 1, 2, ..., N, F (x) = 0
Integer T specifying number of iterations; For t = 1, . . . ,T (a) Fit the regression function ft (x) by weighted least squares of yi to xi with weights wi . (b) Update F (x) ← F (x) + ft (x)
(c) Update wi ← wi e−yi ft (xi ) and renormalize. 3. Output the classifier sign[F (x)] = sign[
T t=1
ft (x)]
Fig. 2. Algorithm of Gentle AdaBoost
Empirical evidence suggests that Gentle AdaBoost is a more conservative algorithm that has similar performance to both the Real AdaBoost and LogitBoost algorithms, and often outperforms them both, especially when stability is a crucial issue [16]. While the above AdaBoost procedure essentially learns a two-class classifier, we convert the multi-class problem into a two-class one using the idea of intra- and extraclass difference [18]. However, here the difference data are derived from each pair of Gabor magnitude features at the corresponding locations rather than from the images. The positive examples are derived from pairs of intra-personal differences and the negative from pairs of extra-personal differences. In this work, the weak classifier in AdaBoost learning is constructed by using a single Gabor magnitude feature. Therefore, AdaBoost learning algorithm can be considered as a feature selection algorithm [17,19]. With the selected feature set, a series of statistical methods can be used to construct effective classifier. In the following, we introduce LDA for dimension reduction further and use cosine distance for palmprint recognition and expect it can achieve better performance. 4.2 LDA with Selected Features LDA is a famous method for feature extraction and dimension reduction that maximizes the extra-class distance while minimized the intra-class distance. Let the sample set be X = {x1 , x2 , ..., xn }, where xi is the feature vector for the i-th sample. The withinclass scatter matrix Sw and the between-class scatter matrix Sb are defined as follows: Sw =
L
(xj − mi )T (xj − mi )
(7)
i=1 xj ∈Ci
Sb =
L
ni (mi − m)T (mi − m)
(8)
i=1
L where mi = n1i xj ∈Ci xj is the mean vector in class Ci , and m = n1 i=1 xj ∈Ci xj is the global mean vector.
Learning Gabor Magnitude Features for Palmprint Recognition
27
LDA aims to find projection matrix W so that the following object function is maximized: J=
tr(WT Sb W) tr(WT Sw W)
(9)
The optimal projection matrix Wopt can be obtained by solving the following generalized eigen-value problem (10) S−1 w Sb W = WΛ where Λ is a diagonal matrix whose diagonal elements are the eigenvalues of S−1 w Sb . Given two input vectors x1 and x2 , their subspace projections are calculated as v1 = WT x1 and v2 = WT x2 , and the following cosine distance is used for the matching: H(v1 , v2 ) =
v1 · v2 v1 v2
(11)
where . denotes the norm operator. In the test phase, the projections v1 and v2 are computed from two input vectors x1 and x2 , one for the input palmprint image and another for an enrolled palmprint image. By comparing the score H(v1 , v2 ) with a threshold, a decision can be made whether x1 and x2 belong to the same person.
5 Experiments To evaluate the performance of the proposed palmprint recognition method, three large palmprint databases are adopted, including PolyU Palmprint Database [12], UST Hand Image Database [13] and CASIA Palmprint Database [14]. These databases are among the largest in size in the public domain. We train the classifiers and evaluate the effectiveness of illumination normalization method on PolyU Palmprint Database. To explore the generalization of the classifier, we further evaluate the performance of proposed palmprint recognition method on the other two databases, and compare with the state-of-the-art Gabor-based recognition methods [7,8,9]. 5.1 Evaluate on PolyU Palmprint Database PolyU Palmprint Database[12] contains 7752 images corresponding to 386 different palms. Around twenty samples from each of these palms were collected in two sessions. There are some illumination variations between the two sessions. We select 4000 images from 200 different palms collected in two sessions as the testing set, with 20 images per palm. The rest 3752 images from 186 different palms are used for training. All the input palmprint images are normalized to 128 × 128 using the method proposed in [7]. In the training phase, the training set of positive samples were derived from intraclass pairs of Gabor features, the negative set from extra-class pairs. Two Gabor magnitude feature-based classifiers are trained. One is an AdaBoost learning based classifier, and another is an LDA based classifier using AdaBoost-selected features. These two methods are named “GMBoost” and “GMBoostLDA”, respectively. Moreover, to
28
R. Chu et al.
evaluate the effectiveness of the illumination normalization method, we also train two classifiers and test the performance on the palmprint images without illumination normalization. The first two classifiers are trained using the palmprint images without illumination normalization. 882 most effective features are selected by the AdaBoost procedure from the original 655,360 Gabor magnitude features with the training error rate of zero on the training set. For LDA, the feature dimension retained is 181, which is optimal in the test set. The other two classifiers are trained using the palmprint images with illumination normalization. 615 most effective features are selected with the training error rate of zero on the training set. The optimal feature dimension for LDA is 175 found in the test set. The first 5 most effective features learned by Gentle AdaBoost are shown in Fig. 3, in which the position, scale and orientation of corresponding Gabor kernels are indicated on an illumination normalized palmprint image.
Fig. 3. The first 5 features and associated Gabor kernel selected by AdaBoost learning
In the testing phase, we match palmprints from different sessions. Each image from the first session is matched with all the images in the second sessions. This generated 20,000 intra-class (positive) and 380,000 extra-class (negative) pairs. Fig. 4 shows the ROC curves derived from the scores for the intra- and extra-class pairs. From the result, we can see that all these Gabor magnitude feature-based methods achieve good verification performances. The performance of “GMBoostLDA” methods are better than that of “GMBoost” methods. This indicates applying LDA with AdaBoost-selected features is a good scheme for palmprint recognition. Among these classifiers, “GMBoostLDA with Illumination Normalization” performs the best, which demonstrates the effectiveness of the proposed illumination normalization method. The processing speed of proposed method is very fast. In the testing phase, only the features selected by the AdaBoost learning need to be extracted with the Gabor filter, which largely reduce the computational cost. On a P4 3.0GHz PC, the execution time for the illumination normalization, feature extraction, feature space to LDA subspace projection and matching for one image are 30ms, 20ms, 1.5ms and 0.01ms, respectively. In the next subsection, we will further evaluate the performance of our best classifier on the other two databases to explore the generalization capacity and compare with the state-of-the-art Gabor-based recognition methods. 5.2 Evaluate on UST Hand Image Database and CASIA Palmprint Database UST hand image database [13] contains 5,660 hand images corresponding to 566 different palms, 10 images per palm. All images are captured using a digital camera with
Learning Gabor Magnitude Features for Palmprint Recognition
29
Fig. 4. Verification performance comparison on PolyU Palmprint Database
resolution of 1280 × 960 (in pixels) and 24-bit colors. There are totally 25,470 intraclass (genuine) samples and 15,989,500 extra-class (impostor) samples generated from the UST database. CASIA palmprint database [14] contains 4,796 images corresponding to 564 different palms. All images are captured using a CMOS camera with resolution of 640x480 (in pixels) and 24-bit colors. There are 8 to 10 samples in each of these palms. There are totally 18,206 intra-class (genuine) samples and 11,480,204 extra-class (impostor) samples generated from the test set. Fig. 5 shows the ROC curves derived from the scores for the intra- and extra-class samples. According to the ROC curves, the performance of the proposed method is better than that of the state-of-the-art Gabor-based recognition methods in both the two databases. Note that our classifier is trained on the PolyU database and tested on the UST and CASIA palmprint databases. Two accuracy measurements are computed for further comparison in Table 1. One is the equal error rate (EER) and the other is the d (d-prime) [20]. d is a statistical measure of how well a biometric system can discriminate between different individuals defined as
Fig. 5. Comparative results with state-of-the-art Gabor-based recognition methods. Left: ROC curves on UST Hand Image Database. Right: ROC curves on CASIA Palmprint Database.
30
R. Chu et al.
|m1 − m2 | d = 2 (12) (δ1 + δ22 )/2 where m1 and δ1 denote the mean and variance of intra-class feature vector respectively, while m2 and δ2 denote the mean and variance of extra-class feature vector. The larger the d value is, the better a biometric system performs [20]. Table 1. Comparison of accuracy measures for different classifiers on UST and CASIA databases Algorithm Palm Code (θ = 45o ) [7] Fusion Code [8] Competitive Code [9] Proposed method
Results on UST database EER (%) d 1.77 3.39 0.75 3.40 0.38 3.51 0.35 5.36
Results on CASIA database EER (%) d 0.95 3.58 0.57 3.80 0.19 3.81 0.17 5.57
From the experimental results, we can see that both the EER and the discriminating index of proposed method achieve good performance (in bold font). This also suggests the good generalization capacity of proposed method, which can work well on different types of palmprint images.
6 Conclusions In this paper, we have proposed a Gabor magnitude feature based learning method for palmprint recognition. To decrease the influence of illumination variations, we introduced an illumination normalization method for palmprint images. Then, multi-scale, multiorientation Gabor filters are used to extract Gabor magnitude features. Based on the Gabor magnitude features and statistical learning, a powerful classifier is constructed. The experimental results show that Gabor magnitude features with statistical learning can be powerful enough for palmprint recognition. Compared with state-of-the-art Gabor-based method, our method achieves better performance on two large palmprint database.
Acknowledgements This work was supported by the following funding resources: National Natural Science Foundation Project #60518002, National Science and Technology Supporting Platform Project #2006BAK08B06, National 863 Program Projects #2006AA01Z192 and #2006AA01Z193, Chinese Academy of Sciences 100 people project, and the AuthenMetric Collaboration Foundation.
References 1. Zhang, D., Shu, W.: Two novel characteristics in palmprint verification: Datum point invariance and line feature matching. Pattern Recognition 32, 691–702 (1999) 2. Duta, N., Jain, A., Mardia, K.: Matching of palmprint. Pattern Recognition Letters 23, 477–485 (2001)
Learning Gabor Magnitude Features for Palmprint Recognition
31
3. Lu, G., Zhang, D., Wang, K.: Palmprint recognition using eigenpalms features. Pattern Recognition Letters 24, 1463–1467 (2003) 4. Wu, X., Zhang, D., Wang, K.: Fisherpalms based palmprint recognition. Pattern Recognition Letters 24, 2829–2838 (2003) 5. Wang, Y., Ruan, Q.: Kernel fisher discriminant analust-hand-databaseysis for palmprint recognition. In: Proceedings of International Conference Pattern Recognition, vol. 4, pp. 457–461 (2006) 6. Kumar, A., Shen, H.: Palmprint identification using palmcodes. In: ICIG 2004. Proceedings of the Third International Conference on Image and Graphics, Hong Kong, China, pp. 258– 261 (2004) 7. Zhang, D., Kong, W., You, J., Wong, M.: On-line palmprint identification. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 1041–1050 (2003) 8. Kong, W., Zhang, D.: Feature-Level Fusion for Effective Palmprint Authentication. In: Zhang, D., Jain, A.K. (eds.) ICBA 2004. LNCS, vol. 3072, pp. 761–767. Springer, Heidelberg (2004) 9. Kong, W., Zhang, D.: Competitive coding scheme for palmprint verification. In: Proceedings of International Conference Pattern Recognition, vol. 1, pp. 520–523 (2004) 10. Sun, Z., Tan, T., Wang, Y., Li, S.: Ordinal palmprint represention for personal identification. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 279–284. IEEE Computer Society Press, Los Alamitos (2005) 11. Gross, R., Brajovic, V.: An image preprocessing algorithm for illumination invariant face recognition. In: Proc. 4th International Conference on Audio- and Video-Based Biometric Person Authentication, Guildford, UK, pp. 10–18 (2003) 12. PolyU Palmprint Database, http://www.comp.polyu.edu.hk/biometrics/ 13. UST Hand Image database, http://visgraph.cs.ust.hk/biometrics/Visgraph web/index.html 14. CASIA Palmprint Database, http://www.cbsr.ia.ac.cn/ 15. Daugman, J.G.: Complete discret 2d gabor transforms by neural networks for image analysis and compression. IEEE Trans. ASSP 36, 1169–1179 (1988) 16. Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of boosting. Technical report, Department of Statistics, Sequoia Hall, Stanford Univerity (1998) 17. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Kauai, Hawaii, IEEE Computer Society Press, Los Alamitos (2001) 18. Moghaddam, B., Nastar, C., Pentland, A.: A Bayesian similarity measure for direct image matching. Media Lab Tech Report No.393, MIT (1996) 19. Shan, S., Yang, P., Chen, X., Gao, W.: AdaBoost gabor fisher classifier for face recognition. In: Proceedings of IEEE International Workshop on Analysis and Modeling of Faces and Gestures, Beijing, China, pp. 279–292. IEEE Computer Society Press, Los Alamitos (2005) 20. Daugman, J., Williams, G.: A Proposed Standard for Biometric Decidability. In: Proc. CardTech/SecureTech Conference, pp. 223–234 (1996)
Sign Recognition Using Constrained Optimization Kikuo Fujimura1 and Lijie Xu2 1
Honda Research Institute USA 2 Ohio State University
Abstract. Sign recognition has been one of the challenging problems in computer vision for years. For many sign languages, signs formed by two overlapping hands are a part of the vocabulary. In this work, an algorithm for recognizing such signs with overlapping hands is presented. Two formulations are proposed for the problem. For both approaches, the input blob is converted to a graph representing the finger and palm structure which is essential for sign understanding. The first approach uses a graph subdivision as the basic framework, while the second one casts the problem to a label assignment problem and integer programming is applied for finding an optimal solution. Experimental results are shown to illustrate the feasibility of our approaches.
1
Introduction
There have been many approaches in sign recognition [1]. Among many important elements for sign recognition, components which are of basic importance are hand tracking and hand shape analysis. For hand detection, many approaches use color or motion information [6,7]. It turns out, however, that hand tracking using color is a non-trivial task except for well-controlled environments, as various lighting changes pose challenging conditions [3,10,15,17]. Making use of special equipment such as data gloves is one solution to overcome this difficulty. When the hand is given in a magnified view, hand shape analysis becomes a feasible problem [16], although body and arm posture information might be lost. Successful results are reported by using multiple cameras for extracting 3D information [9,13]. Even though redundancy makes the problem more approachable, handling a bulk of data in real-time poses another challenge for efficient computation. Model fitting for 3D data is known to be a computationally demanding task. Stereo vision is a popular choice in many tasks including man-machine interaction and robotics [9]. However, it still fails to provide sufficient details in depth maps for some tasks such as counting fingers in a given hand posture. Stereo images have been useful in some applications such as large-scale gesture recognition such as pointing motions. In contrast, coded lights or recent techniques such as space-time stereo in general provide a depth resolution much better than a traditional stereo. Such a device is expected to provide a high quality image Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 32–41, 2007. c Springer-Verlag Berlin Heidelberg 2007
Sign Recognition Using Constrained Optimization
33
sequence even for posture analysis. Time-of-flight sensors also provide a resolution that is sufficient for hand shape analysis [5,12,13,18]. For our work, we opt to use this type of device as a suitable alternative for stereo. Much of the work in hand shape and motion analysis primarily deal with a single hand shape or two non-overlapping hands [4,8,11,13,16]. Whereas analysis of independent hand shapes is a basic task, analysis of overlapping hands presents another level of challenge in gesture recognition. Natural gestures (including sign languages as in Fig. 1) often use signs formed by two overlapping hands. Motivated in this manner, we present approaches at analyzing various hand patterns formed by two overlapping hands. In particular, we focus on how to separate two hands from one blob representing overlapping hands. In addition, we also present a real-time non-intrusive sign recognition system that can recognize signs by one hand, two non-overlapping hands, and two overlapping hands. The rest of the paper is organized as follows. In section 2, we outline our solution approaches. Sections 3 and 4 present two formulations to the problem and Section 5 contains a description of the entire system. Experimental results are presented in Section 6 and Section 7 contains concluding remarks.
Fig. 1. Examples of signs formed by overlapping hands
2
Flow of the Algorithm
We present two algorithms to address the problem of overlapping hands. The two algorithms share a common basic part, namely, steps 1 - 3 of the following procedure. 1. 2. 3. 4. 5.
Extracting the arm blob(s) from the image Overlapping hand detection and palm detection Graph formation Separation of overlapped hands (two methods are proposed) Sign recognition
Hand segmentation is an important step for most sign recognition methods. In this work, we use depth streams for hand blob segmentation. Even though
34
K. Fujimura and L. Xu
it is conceptually simple to segment hands in depth images, especially when hands are in front of the body, it still requires work to identify palm and finger parts from the arm blob. This part is described in Section 5. (When the blob is determined to contain only one hand, Steps 3 and 4 are skipped.) For the second step, we make use of the observation that for overlapping hands, the blob’s area is larger than a blob containing a single hand and the convex hull of the blob is significantly larger than the blob itself. Next, a graph structure is extracted from the hand blob. For this operation, the medial axis transform (roughly corresponding to the skeleton of the blob) is generated from the blob by a thinning operation (Fig. 2). The skeleton represents the general shape of the hand structure, while it has some short fragmental branches that do not represent fingers. Also, for two fingers that are connected as in Fig 1(c), the connecting point may not become a node (branching point) in the skeleton. Thus, we create an augmented graph (which we call as G hereafter) from the skeleton. This is accomplished by removing short fragments (e.g., ‘dangling branch’ in Fig. 2) and dividing long edges into shorter pieces (Fig. 2), in particular, at a high curvature point. After a connectivity graph G is formed, the remaining task is to determine parts of G that come from the right and the left hands, respectively.
Fig. 2. Example of the skelton structure (left). Example of the augmented graph structure (right).
Two methods are presented. The first algorithm is based on a tree search paradigm, while the second one is formulated by using constrained optimization.
3
Tree Search Framework
The first algorithm for hand disambiguation uses tree search. Given an augmented graph G, we form two subgraphs H1 and H2 such that G = H1 ∪ H2 . Moreover, each Hi is required to satisfy a few contions so that it represents a proper hand shape. A connected subgraph H1 (H2 ) is formed such that it contains the palm corresponding to the right (left) hand, respectively. Our strategy is to generate H1 and H2 systematically and pick the pair that best matches our definition of ‘hands’. The outline of the algorithm is summarized as follows.
Sign Recognition Using Constrained Optimization
35
1. Create a DAG (directed acyclic graph) from G. 2. Separate the tree G into two parts: (a) Do DFS (depth first search) from the source node (top node of the tree). (b) Let H1 be a connected subgraph from by the scan. This forms a possible hand structure. (c) For each H1 , do reduction on graph G to obtain the remaining part which, in turn, forms the second hand structure H2 . 3. Evaluate the given hand structure pair H1 and H2 . 4. The one with the best evaluation is selected as the final answer.
Fig. 3. Example of tree search
3.1
Evaluation Function
After each scan of the tree, we are left with a pair of H1 and H2 . The evaluation of this pair consists of a few criteria. 1. Each of H1 and H2 must be connected. This comes from the natural requirement that each hand is a connected entity. 2. Distance of any part of H from the palm must be within a certain limit. This discourages to have a finger that is longer than a certain limit. 3. For two segments within a subgraph, the angle formed by the segments cannot be small. This condition discourages to have a finger that bends at an extremely sharp angle. Likewise, fingers bending outward are discouranged. 4. A branching node in H must stay within a certain distance from the palm. This condition discourages to form a finger with branches at the tip of the finger. The above criteria are encoded in the decision process and ones that have the best evaluation are considered.
4
Optimization Framework
The second framework reduces the problem to a labeling problem which is formulated as the following optimization problem. We continue to use graph G. A segment (or edge) si in G is to be assigned a label by a function f (that is, f (si ) is either Right or Left). Each si has an estimate of its likelihood of having
36
K. Fujimura and L. Xu
labeling f (si ). This comes from a heuristic in sign estimation. For example, if s is near the left hand palm, its likelihood of being a part of the right hand is relatively low. For this purpose, a non-negative cost function c(s, f (s)) is introduced to represent the likelihood. Further, we consider two neighboring segments si and sj to be related, in the sense that we would like si and sj to have the same label. Each edge e in graph G has a nonnegative weight indicating the strength of the relation. Moreover, certain pairs of labels are more similar than others. Thus, we impose a distance d() on the label set. Larger distance values indicate less similarity. The total cost of a labeling f is given by : c(s, f (s)) + we d(f (si ), f (sj )) Q(f ) = s∈S
In our problem, the following table (4) is to be completed, where binary variable Aij indicates if segment si belongs to hand pi . For c(i, j), the Euclidean distance from segment si to the palm of hand j is used. Since each (thin) segment si belongs to only one hand, j Aij = 1 hold. In addition to this constraint, a number of related constraints are considered. 1. Neighboring segments should have a similar label. 2. Thick parts represent more than one finger. 3. Thin parts represent one finger.
Fig. 4. Assignment table
It turns out that this is an instance of the Uniform Labeling Problem, which can be expressed as the following integer program by introducing auxiliary variables Ze for an edge e to express the distance between the labels and we use Zej to express the absolute value |Apj − Aqj |. Following Kleinberg and Tardos [2], we can rewrite our optimization problem as follows: ⎛ ⎞ N M min ⎝ c(i, j)Aij + we Xe ⎠ i=1 j=1
subject to j
Aij = 1,
eE
i = 1, 2, 3, · · · , N, if the ith segment is thin.
Sign Recognition Using Constrained Optimization
Aij = 2,
37
i = 1, 2, 3, · · · , N, if the ith segment is thick.
j
Ze =
1 Zej , e ∈ E 2 j
Zej ≥ Apj − Aqj , e = (p, q);
j = 1, · · · , M
Zej ≥ Aqj − Apj , e = (p, q);
j = 1, · · · , M
Aij ∈ {0, 1},
i = 1, 2, 3, , · · · , N ;
j = 1, 2, · · · , M
length(s1 ) + length(s2 ) + · · · < M AXLEN ; for each hand. Here, c(i, j) represents cost (penalty) for segment i to belong to hand j, where j can be either Right or Left. If segment i is far from Left, then c(i, Left) is large. For time-varying image sequences, previous c(i, Left) may be used. This factor is a very powerful factor, assuming palm locations are correctly detected. Terms involving Ze and Zej come from constraint (1). The weight we represents strength between graph node a and b (where e is the edge connecting a and b, where a and b represent segments. For example, if a and b make a sharp turn we is a high number, since a and b are likely to belong to different hands. The weight is given by we = e−αde , where de is the depth difference between two adjacent segments and α is selected based on experiments. For an additional constraint, Aij + Akj < 2 holds for all j, if segments si and sk make a sharp turn. If a segment is thick, an additional constraint may be added. Finally, the total length of fingers must not exceed a certain limit. In general, solving an integer program optimally is NP-hard. However, we can relax the above problem to linear programming with Aij ≥ 0, and this can be solved efficiently by using a publicly available library. Kleinberg and Tardos [2] describe a method for rounding the fractional solution so that the expected objective function Q(f ) is within a factor of 2 from the optimal solution. In our experiments we find that this relaxed linear programming always returns an integer solution.
5 5.1
Sign Recogntion System and Components Palm Detection
To build an entire system for sign recognition, the hand shape analysis module has to be integrated with many other modules such as a motion tracker and pattern classifier. Here, a description is given for modules that are highly related to shape analysis, namely, finger and palm detection. For other related modules, see [18]. To locate fingers within the arm blob, each branch of the skeleton is traced at a regular interval, while measuring width at each position. When the width is smaller than a certain limit, we consider these pixels to belong to a finger. For palm detection, the following steps work well for our experiment.
38
K. Fujimura and L. Xu
1. Trace all branches of the skeleton. At a certain inteval, shoot rays enamating from a point on the skeleton. Pick up the chord that has the shortest distance and we call it the width of the point (Fig. 7 (right)). 2. From all widths, pick up the top few that have the widest widths. Choose the one that is closest to finger positions as the palm center. 3. For the chord at the selected point, pick the center point of the chord and define this as the palm center to be used for the rest of sign recognition. 5.2
Experimental Results
The algorithm has been implemented in C and tested by using alphabet letters used in JSL (Japanese Sign Language). Our experiments show that the second framework has more successful results. The primary reason is that defining the evaluation function that works uniformly for all words is difficult. For example, for ‘X’, we want sharp turns to be minimal, while for ‘B’, we want to keep some of the sharp turns in the pattern. This requires case-based analysis or substantial work is to be done as post-processing. Currently, it takes approximately 0.5 second to resolve overlapping hand cases as in Fig. 5. The second framework also fails at times, for example, when unncecesary fragments remain after pruning in the tree structure. Fig. 6 shows an example of a JSL sentence consisting of three words. ‘convenient store’, ‘place’, and ‘what’, illustrating a sign consisting of a non-overlapping2-hand signs (convenient store) and 1-hand-signs (place and what). Currently, our system has approximately 50 recognizable words. For 1-hand-signs and nonoverlapping-2-hand-signs, recognition speed is less than 0.2 sec for a word (after the total word has been seen) on a laptop computer.
Fig. 5. Examples of two overlapping hand separation. From the top left, signs represent ‘A’, ‘K’, ‘G’, ‘B’, ‘X’, ‘Well’, ‘Meet’, and ‘Letter’ in JSL. Palm centers are marked by circles and hand separation is shown by hands with different shades.
Sign Recognition Using Constrained Optimization
39
Fig. 6. Example of sign recognition. The gray images represent depth profiles obtained by the camera, while the black and white images show processing results. Palm centers are marked as circles. Top row : ‘Convenient store (24 hours open)’. The right and left hands represent ‘2’ and ‘4’, respectively while the hands draws a circular trajectory in front of the body to represent ‘open’. Middle row: ‘place’. The open right hand is moved downward. Bottom row: ‘what’. Only the pointing finger is up and it is moved left and right. The total meaning of the sequence is ‘Where is a convenient store?’. Currently, the system performs word-by-word recognition.
The module has also been incorporated in the word recognition module that takes hand shape, movement, and location into consideration and shows near real-time speed performance.
40
K. Fujimura and L. Xu
Fig. 7. Snapshot of the recognition system (left). Illustration for palm detection (right). For a point on the skeleton, a chord is generated to measure the width at that point. The one that gives rise to the maximum chord length is a candidate for locating the palm center.
6
Concluding Remarks
We have presented an algorithm to separate two overlap hands when fingers of the right and left hands may touch or cross. This is a step toward recognizing various signs formed by two hand formations. Two algorithms have been proposed and examined. Algorithm performance has been experimentally illustrated by showing some overlapping hand patterns taken from JSL words. Some work attempt at identifying fingers in the hand (in addition to counting the number of fingers) from hand appearance. Our work has analyzed finger formation and classify the pattern into several types such as ‘L’ shape and ‘V’ shape. No attempt has been made at this point to identify fingers. We have also been able to integrate our algorithm in a word-based sign recognition system. As compared with existing approaches, salient features of the system constructed as ours are as follows. (i) Signs are recognized without using any special background, marks, or gloves. (ii) Each computation module is computationally inexpensive, thereby achieving fast sign recognition. (iii) Due to depth analysis, the method is less susceptible to illumination changes. The focus of the present work has been to distinguish a given arm blob into two separate parts representing left and right hands. The problem is formulated by a graph division and labelling pardigm and experimental results have been shown that the method has a reasonable performance. Our algorithm requires that much of the fingers is clearly visible. Challenges remain since sign languages usually use overlapping patterns that are complex than those presented in this paper (e.g., ones involving occlusion). We leave this as our future research subject.
Sign Recognition Using Constrained Optimization
41
References 1. Ong, S., Ranganath, S.: Automatic sign language analysis: A survey and the future beyond lexical meaning. IEEE Pattern Analysis and Machine Intelligence 27, 873– 891 (June 2005) 2. Kleinberg, J., Tardos, E.: Approximation algorithms for classfication problems with pairwise relationships: Metric partitioning and Markov random fields. Journal of the ACM 49(5) (2002) 3. Hamada, Y., Shimada, N., Shirai, Y.: Hand shape estimation under complex backgrounds for sign language recognition. In: Proc. of Symposium on Face and Gesture Recognition, Seoul, Korea, pp. 589–594 (2004) 4. Starner, T., Weaver, J., Pentland, A.: Real-time American sign language using desk and wearable computer based video. IEEE Pattern Analysis and Machine Intelligence 20, 12, 1371–1375 (1998) 5. Mo, Z., Neumann, U.: Real-time hand pose recognition using low-resolution depth images. In: Int. Conf. on Computer Vision and Pattern Recognition, New York City (2006) 6. Imagawa, K., Lu, S., Igi, S.: Color-based hand tracking system for sign language recognition. Proc. Automatic Face and Gesture Recognition, 462–467 (1998) 7. Polat, E., Yeasin, M., Sharma, R.: Robust tracking of human body parts for collaborative human computer interaction. Computer Vision and Image Understanding 89(1), 44–69 (2003) 8. Wilson, A., Bobick, A.: Parametric hidden markov models for gesture recognition. IEEE Trans. on Pattern Anal. Mach. Intel. 21(9), 884–900 (1999) 9. Jojic, N., Brumitt, B., Meyers, B., Harris, S., Huang, T.: Detection and estimation of pointing gestures in dense disparity maps. In: Proc. of the 4th Intl. Conf. on Automatic Face and Gesture Recognition, Grenoble, France (2000) 10. Bretzner, L., Laptev, I., Lindeberg, T.: Hand gesture recognition using multi-scale colour features, hierarchical models and particle filtering. In: Proc. of the 5th Intl. Conf. on Automatic Face and Gesture Recognition, Washington D.C., May 2002, pp. 423–428 (2002) 11. Pavlovic, V., et al.: Visual interpretation of hand gestures for human-computer interaction: A review. IEEE Trans. on Pattern Anal. Mach. Intel. 19(7), 677–695 (1997) 12. Iddan, G.J., Yahav, G.: 3D imaging in the studio. In: SPIE, vol. 4298, p. 48 (2000) 13. Malassiotis, S., Aifanti, N., Strintzis, M.G.: A gesture recognition system using 3D data. In: 1st Intl. Symp. on 3D Data Processing, Visualization, and Transmission, Padova, Italy (June 2002) 14. Vogler, C., Metaxas, D.: ASL recognition based on a coupling between HMMs and 3D motion analysis. In: Proc. Int. Conf. Computer Vision, Bombay (1998) 15. Zhu, Y., Xu, G., Kriegman, D.J.: A real-time approach to the spotting, representation, and recognition of hand gestures for human-computer interaction. Computer Vision and Image Understanding 85(3), 189–208 (2002) 16. Athitsos, V., Sclaroff, S.: An appearance-based framework for 3D handshape classfication and camera viewpoint estimation. In: Proc. of the 5th Intl. Conf. on Automatic Face and Gesture Recognition, Washington D.C., May 2002, pp. 45–50 (2002) 17. Zhu, X., Yang, J., Waibel, A.: Segmenting hands of arbitrary color. In: Proc. of the 4th Intl. Conf. on Automatic Face and Gesture Recognition, Grenoble, March 2000, pp. 446–453 (2000) 18. Fujimura, K., Liu, X.: Sign recognition using depth image streams. In: Proc. of the 7th Symposium on Automatic Face and Gesture Recognition, Southampton, UK (May 2006)
Depth from Stationary Blur with Adaptive Filtering Jiang Yu Zheng and Min Shi Department of Computer Science Indiana University Purdue University Indianapolis (IUPUI), USA
Abstract. This work achieves an efficient acquisition of scenes and their depths along long streets. A camera is mounted on a vehicle moving along a path and a sampling line properly set in the camera frame scans the 1D scene continuously to form a 2D route panorama. This paper extends a method to estimate depth from the camera path by analyzing the stationary blur in the route panorama. The temporal stationary blur is a perspective effect in parallel projection yielded from the sampling slit with a physical width. The degree of blur is related to the scene depth from the camera path. This paper analyzes the behavior of the stationary blur with respect to camera parameters and uses adaptive filtering to improve the depth estimation. It avoids feature matching or tracking for complex street scenes and facilitates real time sensing. The method also stores much less data than a structure from motion approach does so that it can extend the sensing area significantly. Keywords: Depth from stationary blur, route panorama, 3D sensing.
1 Introduction For pervasive archiving and visualization of large-scale urban environments, mosaicing views from a translating camera and obtaining depth information has become an interesting topic in recent years. Along a camera path, however, overlapping consecutive 2D images perfectly is impossible due to the inconsistent motion parallax from drastically changed depths in urban environments. Approaches to tackle this problem so far include (1) 1D-2D-3D approach [1] that collects slit views continuously from a translating camera under a stable motion on a vehicle. The generated route panorama (RP) [2][3][4] avoids image matching and stitching. Multiple route panoramas from different slits are also matched to locate 3D features in the 3D space [1][4][5][6]. (2) 2D-3D-2D approach that mosaics 2D images through matching, 3D estimation [7], and re-projection to a 2D image. If scenes are close to a single depth plane, photomontage can select scenes seamlessly [8] to result in a multiperspective projection image. Alternatively, images at intermediate positions can also be interpolated to form a parallel-perspective image [9]. (3) 3D-1D-2D approach obtains a long image close to perspective images at each local position. The 1D sampling slit is shifted dynamically [10] according to a dominant depth measured by laser [11][12] or the image velocity in a video volume [13][14][15][16]. This work aims to scan long route panoramas and the depth with a 1D slit as it is the simplest approach without image matching. We analyze the stationary blur [2][24] ⎯ a perspective effect in the parallel projection due to using a non-zero width Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 42–52, 2007. © Springer-Verlag Berlin Heidelberg 2007
Depth from Stationary Blur with Adaptive Filtering
43
slit. The degree of blurring is related to the scene depth from the camera path as well as the camera parameters [17]. By using differential filters to evaluate the contrast in the RP against the original contrast in the image, we can obtain depth measure at strong spatial-temporal edges. This paper further adjusts camera parameters such as the vehicle speed, camera focal length and resolution to increase the blur effect, which gains the sensitivity of the method and improve the depth estimation. Adaptive filtering for various depths is implemented to reduce the depth errors. In the next section, we first extend the path to a general curve to obtain a geometric projection of the route panoramas. Then we analyze the physical model of the slit scanning and introduce the stationary blur in Section 3. Section 4 is devoted to a depth calculation method and Section 5 develops a filtering approach adaptive to various depths. Section 6 introduces the experiments followed with a conclusion.
2 Acquisition of Route Panoramas Along Streets in Urban Areas We define the slit-scanning scheme model along a smooth camera path on a horizontal plane. A video camera is mounted on a four-wheeled vehicle moving at a speed V. Denoting the camera path by S(t) in a global coordinate system where t is the scanning time in frame number, such a path is an envelope of circular segments with changing curvature κ(t), where κ(t)=0 for a straight segment. The vehicle keeps V=|V| as constant as possible and the variation can be obtained from GPS. In order to produce good shapes in a route panorama, a vertical Plane of Scanning (PoS) is set in the 3D space through the camera focal point as the camera moves along a path. This ensures that vertical lines in the 3D space appear vertically in the route panorama even if the camera is on a curved path.
Fig. 1. A section of 2D RP from slit scanning. Different horizontal contrasts appear at different depths.
To create the route panorama, we collect temporal data continuously from the slit of one pixel width, which generates an RP image with time t coordinate horizontally and the slit coordinate y vertically (Fig. 1). A fixed sampling rate m (frame/sec), normally selected as the maximum reachable rate of the camera, is used for the scanning. At each instance or position on the path, a virtual camera system O-XYZ(t) can be set such that the image frame is vertical and the X axis is aligned with the moving direction V. Within each PoS, we can linear-transform data on the slit l to a vertical slit l’ in O-XYZ(t). This converts a general RP to a basic one that has a vertical and smooth image surface along the camera path. A 3D point P(X,Y,Z) in O-XYZ(t) has the image projection as
44
J.Y. Zheng and M. Shi
I ( x, y, t ) : x = Xf Z , y = Yf Z
(1)
where f is the camera focal length. The projection of P in the basic RP is then I (t , y ) = I ( x, y, t ) ' , calculated by x∈l
I (t , y ) : t = S r , y = Yf Z , r = V m
(2)
where V=|V|, S=|S|, and r (meter/frame) is the camera sampling interval on the path. We define a path-oriented description of the camera rather than using viewpoint orientated representation. As depicted in Fig. 2, the camera moves on a circular path with a radius of curvature R=1/κ, where κ0 for convex, linear, and concave paths, respectively. The camera translation and rotation velocities are V(V,0,0) and Ω(0,β,0), where β is a piece-wised constant related to the vehicle steering status and is estimated from GPS output. Because V is along the tangent of the path, a four-wheeled vehicle has a motion constraint as V =
∂ S (t ) = Rβ ∂t
(3)
where R and β have the same sign.
Fig. 2. Relation of circular path segments and the camera coordinate systems. (a) Convex path where R0.
On the other hand, the relative velocity of a scene point P(X, Y, Z) to the camera is ∂ P (t ) = −V + Ω × P (t ) ∂t
∂( X (t) Y(t) Z(t)) = −(V 0 0) + (0 β 0) × ( X (t) Y(t) Z(t)) ∂t
(4) (5)
When the point is viewed through the slit at time t, i.e., the point is on PoS, we have ∂ X (t ) = −V + β Z ( t ) ∂t
∂ Y (t ) =0 ∂t
β Z (t ) ∂ Z (t ) = − β X (t ) = − ∂t tan α
using tanα=X/Z. Taking temporal derivative of (1), the image velocity v is
(6)
Depth from Stationary Blur with Adaptive Filtering
v=
∂X ∂t ∂X ∂t ∂ Z (t ) ∂ t ∂x ∂ Z (t ) / ∂ t = f − fX = f −x 2 ∂t Z (t ) Z (t ) Z (t ) Z (t )
45
(7)
Filling in the results from (5) and (6) into (7), we obtain the image velocity on slit l as v=−
f (V − β Z ( t )) xβ fV V x + =− + (f + ) Z (t ) Z (t ) R tan α tan α
(8)
From (8), the depth Z(t) and the 3D point can be obtained by Z (t ) =
f 2V
fV = V x V (f + )− v (f R tan α R
2
X (t ) =
+ x ) − fv 2
Z (t ) tan α
, Y (t ) = Z (t ) y
(9)
f
where image velocity vZj
Overlapped sampling Justsampling
|v|S/4
|v|=1
=S/4
Stationary blur in RP No blur
wider than in image Same as in image
|v|>1
<S/4
Overlapped sampling neither missing nor overlapping scenes Missing some details
Motion blur in images
Shorter than in image
Z=Zj Z 1), Mf is estimated by iterating the following steps toward the last frame. Feature point tracking: All the image features are tracked from the previous frame to the current frame by using a standard template matching with Harris corner detector [10]. The RANSAC approach [11] is also employed to eliminate outliers. Extrinsic camera parameter estimation: Extrinsic camera parameter Mf is estimated using the tracked position (uf p , vf p ) and its corresponding 3-D position Sp = (xp , yp , zp ). Here, extrinsic camera parameters are obtained by minimizing p Ef p , the sum of the error function defined in Eq. (2), using the Levenberg-Marquardt method. For 3-D position Sp of the feature point p, the estimated result in the previous iteration is used. Estimation of 3-D feature position: For every feature point p in the current frame, its 3-D position Sp = (xp , yp , zp ) is refined by minimizing the error f function i=1 Eip .
Video Mosaicing Based on Structure
77
Addition and deletion of feature points: In order to obtain accurate estimates of camera parameters, good features should be selected. The set of features to be tracked is updated by evaluating the reliability of features [9]. By iterating the above steps, extrinsic camera parameters Mf and 3-D feature positions Sp are estimated. Step (b): Generation of preview image and instruction for user In parallel with the camera parameter estimation, a coarse preview of the mosaic image is rendered. This preview is updated in real time, using captured images and the estimated camera parameters. With this preview, the user can easily recognize which part of the document is still left to be captured, and figure out where to move the camera in the subsequent frames. 2.3
Off-Line Stage for Parameter Refinement and Target Shape Estimation
This section describes the process to globally optimize the estimated extrinsic camera parameters and 3-D feature positions, and to approximate the shape of the target by fitting a parameterized surface. Step (c): Detection of Reappearing Features Due to the camera motion, most of image features come into the image, move across toward the end of the image, and disappear. Some features, however, reappear in the image, as shown in Figure 3. In this step, these reappearing features are detected, and distinct tracks belonging to the same reappearing feature are linked to form a single long track. This will give tighter constraints among camera parameters in temporally distinct frames, and thus makes it possible to suppress cumulative errors in the global optimization step described later. Reappearing features are detected by examining the similarity of the patterns among features belonging to distinct tracks. First, templates of all the features are projected to the fitted surface (described later). Next, feature pairs whose distance in 3-D space is less than a given threshold are selected and tested with the normalized cross correlation function. If the correlation is higher than a threshold, the feature pair is regarded as reappearing features (see Figure 3).
Compensate for Distortion
(a)
(b)
(c)
Matching
(d)
Fig. 3. Detection of re-appearing features. (a) camera path, posture and 3-D feature positions, (b) temporarily distinct frames in the input video, (c) templates of the same feature in different frame images, (d) templates without perspective distortion.
78
A. Iketani et al.
Vmin: direction of minimum principal curvatures of the target
m
Vmax: direction of maximum principal curvatures of the target
n
direction of minimum curvature of Rp direction of maximum curvature of Rp
Sp Rp
N: Normal of projection plane
Plane P V2
V1: first principal direction of projected points
y = f (x )
Projected points
Fig. 4. Target shape estimation by polynomial surface fitting
Step (d): Global Optimization of 3D Reconstruction Since 3-D reconstruction process described in Section 2.2 is an iterative process, its result is subject to cumulative errors. In this method, by introducing bundleadjustment framework [12], the extrinsic camera parameters and 3-D feature positions are globally optimized so as to minimize the sum of re-projection errors Eall = f p Ef p . Step (e): Target Shape Estimation by Surface Fitting In this step, assuming the curve of the target lies along one direction, the target shape is estimated using 3-D point clouds optimized in the previous step (d). First, as shown in Figure 4, the principal direction of curvature is computed from the 3-D point clouds. Next, 3-D position of each feature point is projected to a plane perpendicular to the direction of minimum principal curvatures. Finally, a polynomial equation of variable order is fitted to the projected 2-D coordinates, and the target shape is estimated. Let us consider for each 3-D point Sp a point cloud Rp which consists of feature points lying within a certain distance from Sp . First, the directions of maximum and minimum curvatures are computed for each Rp using local quadratic surface fit. Then, a voting method is applied to determine the dominant direction Vmin = (vmx , vmy , vmz ) of minimum principal curvatures for the whole target. Next, 3-D position Sp for each feature point is projected to a plane whose normal vector N coincides with Vmin ; i.e. P (x, y, z) = vmx x + vmy y + vmz z = 0. The projected 2-D coordinate (¯ xp , y¯p ) of Sp is given as follows: x ¯p V1 (3) = Sp , y¯p V2 where V1 is a unit vector parallel to the principle axis of inertia of the projected 2-D coordinates (¯ x, y¯), and V2 is a unit vector which is perpendicular to V1
Video Mosaicing Based on Structure
79
and Vmin ; i.e. V2 = V1 × N. Finally, the shape parameter (a0 , a1 , · · · , am ) is obtained by fitting the following variable-order polynomial equation to (¯ x, y¯). q i y¯ = f (¯ x) = ai x ¯. (4) i=0
Using geometric AIC [13], the optimal order q in the above equation is determined as q which minimize the following criteria: G-AIC = J + 2(N (m − r) + q + 1)2 , (5) where J is the residual, N is the number of points, m is the dimension of observed data, and r is the number of constraint equations in fitting Eq. (4) to the projected 2-D coordinates. , called noise level, is the average error of the estimated feature position along y¯ axis. Here, the order q is independent of m, r, N , thus the actual criteria to be minimized is given as follows: (6) G = J + 2q2 . In our method, the noise level is approximated as = Cl, where l is the average of the depth of each feature point in camera coordinate of every frame, and C is a constant, which is empirically set to 0.007. In case of a target with multiple curved surfaces, e.g. a thick bound book shown in Figure 1(a), the target is first divided with a line where the normal vector of the locally fitted quadratic surface varies discontinuously, and the shape parameter is computed for each part of the target. Step (f ): Mosaic Image Generation Finally, a mosaic image is generated by using extrinsic camera parameters and surface shape parameters. Let us consider a 2-D coordinate (m, n) on the unwrapped mosaic image, as shown in Figure 4. Here, the relation between (m, n) and its corresponding 3-D coordinate (¯ x, f (¯ x), z¯) on the fitted surface is given as follows: x¯ d 1 + { f (x)}2 dx, z¯). (7) (m, n) = ( dx 0 The relationship between coordinate (¯ x, f (¯ x), z¯) and its corresponding 2-D coordinate (uf , vf ) on the f -th image plane is given by the following equation. ⎞ ⎞⎛ ⎛ ⎛ ⎞ x ¯ auf V1 ⎝ avf ⎠ = Mf ⎝ V2 ⎠ ⎝ f (¯ x) ⎠ . (8) a N z¯ The pixel value at (m, n) on the unwrapped mosaic image is given by computing the average of the pixel values at all the corresponding coordinates (uf , vf ) in the input image, given by Eq. (7) and (8). After an unwrapped mosaic image is generated by the above process, the shade induced by the curved shape of the target is removed. In this process, the following assumptions are made: the background of the target page is white, and the target is illuminated by a parallel light source. In the proposed method, the vertical direction in the mosaic image coordinate (m, n) is defined to coincide with the direction of the minimum principle curvature of the target. Thus, under
80
A. Iketani et al.
a parallel light source, the effect of shade is uniform for pixels having the same m coordinate on the mosaic image. If we can assume that, in any column of the mosaic image, there exists at least one pixel belonging to the background, a new pixel value Inew (m, n) after shade removal can be computed as follows: Inew (m, n) =
Imax I(m, n) , max(I(m + u, n + v); ∀(u, v) ∈ W )
(9)
where W is a rectangular window whose height is larger than its width, e.g. a window with the size of 5×500 pixel, and Imax is the maximum possible intensity value of an image (typically 255).
3
Experiments
We have developed a prototype video mosaicing system which consists of a laptop PC (Pentium-M 2.1 GHz, Memory 2GB) and an IEEE 1394 type web-cam (Aplux C104T, VGA, 15fps). Experiments are performed on two curved targets: one on a thick bound book with 2 curved pages and the other on a curved poster on a cylindrical column. In both experiments, the intrinsic camera parameters are calibrated in advance using Tsai’s method [14], and are fixed throughout image capturing. A thick bound book captured in the first experiment is shown in Figure 5. It is composed of 2 pages of curved surfaces: one page with texts and the other with pictures and figures. Plus marks (+) were printed on 40mm grid points on both pages for quantitative evaluation (described later). The target is captured as a video of 200 frames at 7.5fps. Sampled frames of the captured images are shown in Figure 6. Tracked feature points are depicted with cross marks. 3-D reconstruction result is shown in Figure 7. The curved line shows the camera path, pyramids show the camera postures in every 10 frames, and the point cloud shows 3-D positions of feature points. The shape of the target estimated after 3 iterations of reappearing feature detection and surface fitting process is shown in Figure 8(a). The optimal orders of the polynomial surface, which are automatically determined by geometric AIC [13], are 5 and 4 for the left and right pages, respectively. The unwrapped mosaic image is shown in Figure 8(b). The resolution of the mosaic image is 3200 × 2192. We evaluate the distortion in the generated mosaic image quantitatively, using plus marks (+) printed on the target paper at every 40mm grid positions. First, the positions of the plus marks are acquired manually in the generated Table 1. Distances of adjacent grid points on the mosaic image [pixels (percentage from average)] page average maximum minimum std. dev. left 338.5(100.0) 348.0(102.8) 331.0(97.8) 3.77(1.11) right 337.6(100.0) 345.0(102.2) 331.1(98.1) 2.75(0.81)
Video Mosaicing Based on Structure
81
Fig. 5. Thick bound book with curved surface
1st frame
100th frame
200th frame
300th frame
Fig. 6. Sampled frames of input images and tracked features (Book)
(a) frontal view
(b) side view
Fig. 7. Estimated extrinsic camera parameters and 3-D positions of features (Book)
(a) Estimated target shape.
(b) Unwrapped mosaic image.
Fig. 8. Estimated target shape and unwrapped mosaic image (Book)
82
A. Iketani et al.
Fig. 9. Poster on a cylindrical column
1st frame
35th frame
87th frame
142th frame
Fig. 10. Sampled frames of input images and tracked features (Poster)
(a) frontal view
(b) side view
Fig. 11. Estimated extrinsic camera parameters and 3-D positions of features (Poster)
(a) Estimated target shape.
(b) Unwrapped mosaic image.
Fig. 12. Estimated target shape and unwrapped mosaic image (Poster)
Video Mosaicing Based on Structure
83
mosaic image. The distances between adjacent plus marks are then computed in the unit of pixel. The average, maximum, minimum and standard deviation of the distances are shown in Table 1. The percentage of each value from the average distance is also shown in parenthesis. Here, the standard deviation can be considered as the average distortion in the generated image. In this experiment, the average distortions for the left and right pages are only 1.1% and 0.8%, respectively. We can confirm that the geometric distortion has been correctly removed. The other experiment is performed on a poster on a cylindrical column shown in Figure 9. Sampled frames of the captured images with tracked feature points, and 3-D reconstruction result are shown in Figure 10, 11, respectively. The shape of the target estimated after 3 iterations is shown in Figure 12(a). The optimal order of the polynomial surface fitted to the target is 4. The unwrapped mosaic images is shown in Figure 12(b). The resolution of the mosaic image is 3200 × 5161. Although the target has few text features, the distortion has been successfully removed in the resultant image. The performance of our system measured in the first experiment is as follows: 22 seconds for initial 3-D reconstruction, 188 seconds for camera parameter refinement and surface fitting, and 410 seconds for generating the final mosaic image.
4
Conclusion
A novel video mosaicing method for generating a high-resolution and geometric distortion-free mosaic image for curved documents has been proposed. With this method based on 3-D reconstruction, the 6-DOF camera motion and the shape of target document are estimated. Assuming the curve of target lies along one direction, the shape model is fitted to the feature point cloud and unwrapped image is automatically generated. In experiments, a prototype system based on the proposed method has been developed and has been successfully demonstrated. Our future work is to reduce the computational cost, and to implement the algorithm on mobile PC.
References 1. Szeliski, R.: Image Mosaicing for Tele-Reality Applications. In: Proc. IEEE Workshop on Applications of Computer Vision, pp. 230–236. IEEE Computer Society Press, Los Alamitos (1994) 2. Capel, D., Zisserman, A.: Automated Mosaicing with Super-resolution Zoom. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 885–891. IEEE Computer Society Press, Los Alamitos (1998) 3. Chiba, N., Kano, H., Higashihara, M., Yasuda, M., Osumi, M.: Feature-based Image Mosaicing. In: Proc. IAPR Workshop on Machine Vision Applications, pp. 5–10 (1998) 4. Hsu, C.T., Cheng, T.H., Beuker, R.A., Hong, J.K.: Feature-based Video Mosaicing. In: Proc. IEEE Int. Conf. on Image Processing, vol. 2, pp. 887–890. IEEE Computer Society Press, Los Alamitos (2000)
84
A. Iketani et al.
5. Lhuillier, M., Quan, L., Shum, H., Tsui, H.T.: Relief Mosaicing by Joint View Triangulation. In: IEEE Int. Conf. on Computer Vision and Pattern Recognition, vol. 1, pp. 785–790. IEEE Computer Society Press, Los Alamitos (2001) 6. Takeuchi, S., Shibuichi, D., Terashima, N., Tominage, H.: Adaptive Resolution Image Acquisition Using Image Mosaicing Technique from Video Sequence. In: Proc. IEEE Int. Conf. on Image Processing, vol. 1, pp. 220–223. IEEE Computer Society Press, Los Alamitos (2000) 7. Cao, H., Ding, X., Liu, C.: A Cylindrical Surface Model to Rectify the Bound Document Image. In: Proc. Int. Conf. on Computer Vision, vol. 1, pp. 228–233 (2003) 8. Brown, M.S., Tsoi, Y.C.: Undistorting Imaged Print Materials using Boundary Information. In: Proc. Asian Conf. on Computer Vision, vol. 1, pp. 551–556 (2004) 9. Sato, T., Kanbara, M., Yokoya, N., Takemura, H.: Dense 3-D Reconstruction of an Outdoor Scene by Hundreds-baseline Stereo Using a Hand-held Video Camera. Int. J. of Computer Vision 47(1-3), 119–129 (2002) 10. Harris, C., Stephens, M.: A Combined Corner and Edge Detector. In: Proc. Alvey Vision Conf., pp. 147–151 (1988) 11. Fischler, M.A., Bolles, R.C.: Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. In: Communications of the ACM, vol. 24(6), pp. 381–395. ACM Press, New York (1981) 12. Triggs, B., McLauchlan, P., Hartley, R., Fitzgibbon, A.: Bundle Adjustment a Modern Synthesis. In: Proc. Int. Workshop on Vision Algorithms, pp. 298–372 (1999) 13. Kanatani, K.: Geometric Information Criterion for Model Selection. Int. J. of Computer Vision 26(3), 171–189 (1998) 14. Tsai, R.Y.: An Efficient and Accurate Camera Calibration Technique for 3D Machine Vision. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 364–374. IEEE Computer Society Press, Los Alamitos (1986)
Super Resolution of Images of 3D Scenecs Uma Mudenagudi, Ankit Gupta, Lakshya Goel, Avanish Kushal, Prem Kalra, and Subhashis Banerjee Department of Computer Science and Engineering IIT Delhi, New Delhi, India
[email protected],{pkalra,suban}@cse.iitd.ernet.in
[email protected] We address the problem of super resolved generation of novel views of a 3D scene with the reference images obtained from cameras in general positions; a problem which has not been tackled before in the context of super resolution and is also of importance to the field of image based rendering. We formulate the problem as one of estimation of the color at each pixel in the high resolution novel view without explicit and accurate depth recovery. We employ a reconstruction based approach using MRF-MAP formalism and solve using graph cut optimization. We also give an effective method to handle occlusion. We present compelling results on real images.
1
Introduction
In this paper we address the problem of estimating a single high resolution (HR) image of a 3D scene given a set of low resolution (LR) images obtained from different but known viewpoints with no restrictions on either the scene geometry or the camera position. The super resolved image can correspond to one of the known view points in the input set, or even be of a novel view. There have been several different approaches to image super resolution, with estimation of high resolution images from multiple low resolution observations related by small 2D motions being by far the most common one [1,2,3]. Most of these approaches assume that the low resolution images can be accurately registered by a 2D affine transformation or a homography, and attempt to super resolve by reversing the degradation caused by blur or defocus and sampling. There has been very little work on super resolution when the scene is 3D and the input cameras are generally positioned. The super resolution problem becomes considerably harder when, because of the depth variation in the scene and arbitrary placement of the input cameras, there is no simple registration transformation between the input images. For then to super resolve one needs dense depth estimation at each pixel using multiple-views geometry; a problem which has not been entirely solved. Though qualitative estimation of such depth information is possible, high precision depth recoveryrequired for accurate and high resolution rendering of a 3D scene is still a challenging problem. Super resolution rendering of a novel view of a 3D scene is an even harder problem because then one needs to effectively handle occlusion. For super resolution of one of the input views the key problem is to determine which pixels/colors Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 85–95, 2007. c Springer-Verlag Berlin Heidelberg 2007
86
U. Mudenagudi et al.
from the other views need to be used for generating each high resolution pixel in the target view. However, for a novel view, the visibility information itself may change significantly - and this needs to be handled. The main contributions of this paper are as follows: 1. We give an MRF-MAP formulation of super resolution reconstruction of 3D scenes from a set of calibrated LR images obtained from arbitrary 3D positions and solve using graph cut optimization. 2. We show that the super resolved novel view synthesis problem can be posed as one of first generating a set of pixels in the source LR views, depending on their color, which are most likely to contribute to a target pixel in the novel view, and then using standard techniques of super resolution to recover the high frequency details. In this sense, our approach is closely related to the novel view synthesis methods of [4,5], which however do not address the issues of super resolution. 3. We give an effective method for handling occlusion using photo-consistency [6]. We demonstrate this with results on real images. Our approach is motivated by [4,5] where also the novel view generation problem have been formulated as one of estimation of color at each generated pixel, without explicit depth recovery. In [4] the virtual cameras are placed within the scene itself, and the visibility information does not change significantly between the input cameras. The color selection is done by geometric analysis of “line of sight”. [5] relies on a probabilistic analysis of the possible colors at a target pixel and uses a database of color patches, computed from the input images themselves, as regularization priors. Occlusion is not explicitly handled. We draw insights from these works to formulate a scheme for super resolution rendering of novel views which can explicitly handle occlusion. In Section 2, we describe the image formation process, and formulation of novel view super resolution using MRF-MAP formalism. We present an effective method to resolve occlusion using photo-consistency in section 3. We give energy minimization using graph cut in section 4. In Section 5, we present results to demonstrate the effectiveness of our method. We conclude in section 6.
2
Novel View Super Resolution
2.1
MRF-MAP Formulation
We are given a set of 2D LR images g1 , . . . gn with their 3 × 4 projective camera matrices P1 , . . . , Pn . Let Ci be the camera center for the camera Pi . Let f be the high resolution image seen from the novel view with projection matrix Pcam . The projection matrix of the k th observed image Pk projects homogeneous 3D points X to homogeneous 2D points xk = λ(x, y, 1)T [7] projectively and xk = Pk X, where the equality is up to scale. We compute Pcam by first multiplying matrix of camera internals with desired magnification, and then multiplying with external parameters of novel view in a standard way. The main task is to assign a color to
Super Resolution of Images of 3D Scenecs
87
a high resolution pixel p which is photo-consistent in the input images, i.e., the corresponding pixels in the input LR reference images have similar color, unless occluded [8,9,6]. We need to compute the expected color for a HR pixel p. Ideally, this would require the computation of a dense depth map from the input images to obtain complete registration, and a reliable solution to this problem is still elusive in the computer vision literature. We compute the expected label (color) in a given image using the photo-consistency constraint. We model f as a Markov Random Field (MRF) [10] and use a maximum a posteriori (MAP) estimate as the final solution. The problem of estimating f can be posed as a labeling problem where each pixel is assigned a color label. We formulate the posterior energy function as n αk (p, p (z))(h(p) ∗ f (p, z) − gk (p (z)))2+λs Vp,q (f (p), f (q))(1) E(f, z|g) = p∈Sk=1
p,q∈N
where S is the set of sites (pixels) in the novel view, z is a depth along the back-projected ray from site p in the novel view camera, f (p, z) is the label at site p and at depth z, h(p) is the space invariant PSF of the camera for k th image, p (z) = Pk (X(z)) is the projection of back projected 3D point at depth z on to the k th LR image, gk (p (z)) is the expected color label when z along the back-projected ray from p is projected to k th LR image, αk (p, p (z)) is a weight associated with the k th LR image and Vp,q = min(θ, |f(p)−f(q)|) is a smoothness prior. We compute the weights αk (p, p (z)) in the following way. We define a zone HR pixel not influenced by the LR pixel Mapping of pixel in any low resolution image
LR pixel Zone of Influence No Pixles zone
p
X(z)
3D Ray
z
HR pixel
Projected rays
p p
p(x, y) p
HR pixel influenced by the LR pixel
(a)
p
(b)
Fig. 1. (a) Limiting the zone of influence of a LR pixel in space. Only those HR pixels whose centers lie inside the circle are influenced by the LR pixel. (b) Zone of influence of LR pixel.
of influence for every low resolution pixel such that the LR pixel contributes to the reconstruction of only those high resolution pixels whose centers lie within the zone of influence (see Figure 1(a) and 1(b)). Specifically, we define the following functions to determine whether a LR pixel p (z) = (x , y ) contributes to the reconstruction of a high resolution pixel p = (x, y). Let p (z) = (x , y ) be the projection of p, in real coordinates, which is projected from z along the back projected ray from the pixel p, in to the k th LR image. We define
88
U. Mudenagudi et al.
αk (p, p (z)) =
1 if d((x , y ), (x , y )) < θ1 0 otherwise
(2)
αk (p, p (z)) = 1 signifies that the high resolution pixel p is within the zone of influence of the LR pixel p in the k th LR image. Here d() is the Euclidean distance function and θ1 is a threshold. We set the threshold to the σ values of the low resolution blur function. In most of our experiments the value of spatial blur σ is evaluated to 0.45 of LR pixel. Minimization of the energy function given in Equation 1 over all possible depths and colors is computationally intractable. In what follows we explain a pre-processing step for pruning the potential color labels gk (p (z)) for a site p by considering only those depths which are photo-consistent at p [6]. We first outline a strategy for the simpler problem of super resolution of one of the input reference images and then outline a procedure for novel view super resolution. 2.2
Pruning of Color Labels Using Photo-Consistency
Consider, first, the problem of super resolution of one of the source images. We finely sample the back-projected ray from a site in the target super resolved view between a range Zmin to Zmax . For each z in this range, we project the 3D point X(z) to each of the LR input images and compute c(p, z) = {gk (p (z)) | 1 ≤ k ≤ n}
(3)
as set of possible colors corresponding to depth z. For each z we do a nearest neighbor clustering of c(p, z) in Y CbCr space, with a suitable distance threshold on color differences, and remove all but the top few dominant clusters. If there are no dominant clusters then we remove z from the list of candidate depths at p. For all candidate z values at site p we write (4) C(p, z) = I(p, z, i) i=1...mpz input images in cluster 1 patch
p(z)
Zmax pair−wise patch based correlation in each cluster
z candidate depth
Zmin
z1
z2 X(z)
B1 B2
p C1
C2
C6
Cn
Ccam C4
C3
input images in cluster 2
(a)
C5
Pcam
(b)
Fig. 2. (a) Clusters at a depth z and patches in each of the input images projected from z along the back projected ray from the pixel p in the super resolved novel view. (b) Cameras C1 to C3 see the color of the surface B1 and rest of the cameras see the color of B2 .
Super Resolution of Images of 3D Scenecs
89
where mpz is the number of dominant clusters, and I(p, z, i) is the set of the image indices that belong to the ith dominant cluster. For each depth z, we define a correlation function mpz corr(patch(I(p, z, i))) (5) f corr(z) = i=1 mpz I(p,z,i)| i=1
2
where patch(I(p, z, i)) is the set of 3 × 3 patches around the LR pixel gk (p ) and corr(patch(I(p, z, i))) is the sum of pair-wise correlations between patches of I(p, z, i). We form a set of candidate depths dp at site p by choosing the local minimas of f corr(z). The intra cluster correlation does not guarantee that the candidate depths are correct, but one can only infer that these depths satisfy the photo-consistency constraint [6,9]. In this process we get the possible color clusters for each pixel and there can be two possibilities: 1. Only one cluster: In this case the pixel is photo consistent in all the views at this particular depth, and hence the dominant color cluster can be chosen unambiguously solely based on photo-consistency. 2. More than one dominant clusters: This suggests that either the 3D point which projects to the pixel is occluded in some of the source images (see Figure 2(b)); or that the high resolution pixel is on an edge so that it projects on to more than one color in the LR images. In view of the above, we write the energy to be minimized as mpz
β(p, z, i) αk (p, p ) h(p) ∗ f (p, z) − gk (p (z)) 2 min E(f |g) = p∈S
z∈dp
+λs
i=1
k∈I(p,z,i)
Vp,q (f (p), f (q))
(6)
p,q∈N
where αk (p, p ) is given by Equation 2 and β(p, z, i) is the weight given to the cluster i at candidate depth z for pixel p. We set the cluster weight β(p, z, i) as follows. In case there is only one dominant cluster, we set β(p, z, i) = 1. If there are more than one dominant clusters we determine if p is an edge point. This check is easy for super resolution of one of the source views because then the LR reference pixel corresponding to p is available. If p is an edge point we set the weights β(p, z, i) for all the clusters corresponding to neighboring colors of p as exp(b Ni ) β(p, z, i) = mpz k=1 exp(b Nk )
(7)
Number of images in ith the cluster and b is a tuning parameter. where Ni = total number of images If p is not an edge point we look up the color of p in the LR image and set β(p, z, i) = 1 for the corresponding cluster and all others to zero. Note
90
U. Mudenagudi et al.
that the above energy function given in Equation 6 is not in a form that is amenable to minimization by graph cut. Moreover, in case of super resolution reconstruction of a novel view, the problem of choosing the photo-consistent set of colors becomes considerably harder because then one needs an effective strategy for resolving occlusion. In the next section we explain how we can handle occlusion using the photo-consistency constraint [6,9].
3
Occlusion Handling Using Photo-Consistency
Clearly, if the above procedure returns multiple dominant clusters for a high resolution pixel p in the target novel view due to occlusion, there is no way to resolve the correct cluster from an analysis based on a single pixel. Seitz and Dyer [8,9] and Kutulakos and Seitz [6] develop a theory for occlusion analysis using photo-consistency, which gives the ordinal visibility constraint [8,9]. If X(z1 ) and X(z2 ) are any two points on two back projected rays then X(z1 ) can occlude X(z2 ) only if z1 < z2 . This allows us to resolve occlusion in a way similar to the voxel coloring algorithm of [8,9]. We scan the back projected rays from each pixel in a target novel view image in the order of depth (z) values. For a given z value along a ray through p, we project onto the source images and mark source image pixels if they satisfy photo-consistency. For the subsequent z values we consider only the unmarked pixels in the source images. We outline the algorithm to resolve occlusion using photo-consistency in Algorithm 1.
Algorithm 1. Algorithm to resolve occlusion For each z from Zmin to Zmax : for each pixel p in the novel view image that has z as a candidate depth (z ∈ dp from previous analysis in Section 2.2): set C = Φ. for all input LR images gk (indexed by k): a. project X(z) on the ray back projected through p on to gk . Let p (z) be the projected pixel in gk (after round off). b. push the color of p (z) into C if p (z) is not marked. end. if C is photo-consistent then a. retain the clusters with color similar to the photo-consistent set and prune all others from C(z, i). Set the weights β(z, p, i) for the remaining clusters in C(z, i) according to their membership as given in Equation 7. b. retain this z value in dp and remove all others. c. mark all pixels p (z) in gk which are unmarked. endif. end. end.
Super Resolution of Images of 3D Scenecs
(a)
91
(b)
Fig. 3. (a) Row 1: One of the input images (out of 40) and photo-consistent depth map, Row 2: Occlusion-map, the pixels for which more than one clusters are formed, and are marked white and generated novel view. (b) Super resolved novel view using 20 images. Note that though the depth map generated is not accurate, it provides correct photo-consistent color in the novel view.
The space carving theory of [6] guarantees that the algorithm outlined above will converge to a photo-consistent shape. Consequently, the procedure guarantees a unique z value for all p, which though may not be “correct”, nevertheless provides a photo-consistent color assignment, which is all that we require for super resolved novel view generation. We show one of the input images, the depth map using a unique z values, the image where the pixels for which multiple clusters are formed, and are marked white (occlusion-map, with slight abuse of notation) and generated novel view in Figure 3(a), and the super resolved novel view in 3(b). The final energy is then given as pz
2 m β(p, z , i) αk (p, p (z )) h(p) ∗ f (p, z ) − gk (p (z )) E(f |g) = p∈S
+λs
i=1
k∈I(p,z ,i)
Vp,q (f (p), f (q))
(8)
p,q∈N
where z is the unique depth value we get after resolving the occlusion. The energy in Equation 8 is clearly amenable to minimization by graph cut.
4
Energy Minimization Using Graph-Cut
Graph cut [11] can minimize only graph-representable energy functions. An energy function is graph-representable if and only if each term Vp,q satisfies the regularity constraint [12]. The energy function of Equation 8 is not in the graphrepresentable form. The data term of site p also depends on the neighbors of p because of the blurring operator. We approximate the data term of Equation 8 as follows. Consider the blur kernel h(p) with values wpp at the center and wpq at the neighbor q of p. Writing f (p, z ) as fp , αk (p, p (z)) as αk , β(p, z , i) as β, I(p, z , i) as Ii , gk (p (z)) as gk (p ) and denoting as Nsp the set of spatial neighbors of p, then the data term is given by
92
U. Mudenagudi et al.
pz pz 2 m 2
m β αk h(p) ∗ fp −gk (p ) = β αk wpp fp −gk (p )+ wpq fq p∈S i=1 k∈Ii
p∈Si=1 k∈Ii
p∈Nsp
pz
2
2 m = β αk wpp fp − gk (p ) + wpq fq +2 wpp fp − gk (p ) wpq fq . (9) p∈Si=1 k∈Ii
q∈Nsp
q∈Nsp
We write the overall energy as
Dp∗ (fp ) + E(f |g) = φpq (fp , fq ) + λs Vp,q (fp , fq ) p∈S
q∈Nsp
(10)
q∈Nsp
where, collecting all terms that depend on a single pixel in to the data term, we have mpz 2 2
∗ Dp (fp ) = β αk wpp fp − gk (p ) + wpq fq (11) i=1
q∈Nsp
k∈Ii
and
mpz
φpq (fp , fq ) =
i=1
β
αk
2(wpp fp − gk (p )) (wpq fq )
(12)
k∈Ii
The above expression is still not graph representable because of the term fp fq in φpq . We further approximate wpp fp − gk (p ) = Δkp in φpq and hold Δkp constant during a particular α-expansion move during the minimization using graph cut. It is easy to verify that φpq satisfies the regularity condition in [12] with this approximation. Unfortunately, it cannot be guaranteed that every step of energy minimization of the approximate energy using α-expansion also minimizes the original energy. However, in our experiments we find that this is almost always the case. [13] gives a different approximation for spatial deconvolution using graph cut, where they can give such guarantee.
5
Results
In this section we present some results of novel view super resolution. In all our experiments we obtained the reference images from image sequences captured using a hand held video camera (Sony), and the sequences were calibrated using an automatic tracking and calibration software, Boujou [14]. We provide the input and the output images in the supplemental material through index.html link. 5.1
Novel View Generation
In our first example we select 11 frames from the image sequence and use 10 of them as reference images to generate the missing 11th image, so that the reconstructed image can be compared with the ground truth. In Figure 4(a) we
Super Resolution of Images of 3D Scenecs
(a)
93
(b)
Fig. 4. (a) Row 1 : two input images out of 10, Row 2: generated novel view and difference with ground truth. (b) Super resolved novel view with magnification of 2× in each direction.
show two of the reference images, the generated novel view and the difference image with ground truth. In Figure 4(b) we show the super resolved output corresponding to the novel view. In Figure 5 we show a close up of interpolated 11th image, the corresponding super resolved novel view and the difference image. The difference image clearly shows the missing high frequency components in the interpolated image.
Fig. 5. Close ups of the interpolated, the super resolved & difference image of Figure 4
5.2
Occlusion Handling
In this example we show that we can effectively handle occlusion. Four out of 10 input images are shown in Figure 6(a). Note that different colors are visible in between the petals of the flower because of occlusion. The occlusion has been properly handled in the reconstructed super resolved view of one of the source image as shown in Figure 6(b). In Figure 6(c) and Figure 6(d) we show the super resolved novel view using 10 and 20 images. In case of super resolution of one of the source view the cluster is selected correctly and in case of super resolution of novel view, the cluster selection improves when we increase the number of images as seen in the super resolved view using 20 images compared to using 10 images.
94
U. Mudenagudi et al.
(a)
(b)
(c)
(d)
Fig. 6. Resolving occlusion: (a) four of the 10 input images where different colors are visible in between the petals of flower, (b) super resolved image of one of the source view, (c) super resolved novel view using 10 images and (d) super resolved novel view using 20 images
(a)
(b)
Fig. 7. Novel views and the corresponding difference image with ground truth (a) using 10 and 40 images, (b) using FOV 11.5o and 7.5o
5.3
Effect of Number of Views and Field of View
In this example we show the novel view generation using different number of input images keeping same field of view (FOV) and different FOVs keeping same number of input images. In the first experiment we consider image sequence of 61 images and generate the novel view using 10, 20, 30, 40 and 50 images. The corresponding rms errors between ground truth and the novel view are 20.2027, 17.7744, 16.8216, 16.3895 and 16.3895 respectively. Two of the generated novel views using 10 and 40 images and corresponding difference images with ground truth are shown in Figures 7(a). In the second experiment we used 10 images from three input sets with FOVs 11.5o, 9.0o and 7.5o . The rms errors corresponding to threes sets are 30.3203, 20.2027 and 15.6003 respectively. The generated novel view using FOV 11.5o and 7.5o , and the corresponding difference images with the ground truth are shown in Figure 7(b).
6
Conclusions
In this paper we have addressed the problem of super resolved generation of novel views of a 3D scene with the reference images obtained from cameras in general
Super Resolution of Images of 3D Scenecs
95
positions. We have posed novel view super resolution problem in the MRF-MAP framework and proposed a solution using graph cut. We have formulated the problem as one of estimation of the color at each pixel in the high resolution novel view without explicit and accurate depth recovery. We have also presented an effective method to resolve occlusion. We have presented compelling results on real images.
References 1. Irani, M., Peleg, S.: Improving resolution by image registration. CVGIP: Graphical Models and Image Processing 53, 231–239 (1991) 2. Chaudhuri, S., Joshi, M.V.: Motion-Free Super-Resolution. Springer, Heidelberg (2004) 3. Borman, S., Stevenson, R.L.: Linear Models for Multi-Frame Super-Resolution Restoration under Non-Affine Registration and Spatially Varying PSF. In: Computational Imaging II. Proceedings of the SPIE, vol. 5299, pp. 234–245 (2004) 4. Irani, M., Hassner, T., Anandan, P.: What Does the Scene Look Like from a Scene Point? In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2351, pp. 883–897. Springer, Heidelberg (2002) 5. Fitzgibbon, A., Wexler, Y., Zisserman, A.: Image-Based Rendering Using ImageBased Priors. International Journal of Computer Vision 63, 141–151 (2005) 6. Kutulakos, K.N., Seitz, S.M.: A Theory of Shape by Space Carving. International Journal of Computer Vision 38, 199–218 (2000) 7. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) 8. Seitz, S.M., Dyer, C.R.: Photorealistic Scene Reconstruction by Voxel Colouring. In: Proceedings of the IEEE CVPR, pp. 1067–1073. IEEE Computer Society Press, Los Alamitos (1997) 9. Seitz, S.M., Dyer, C.R.: Photorealistic Scene Reconstruction by Voxel Colouring. International Journal of Computer Vision 35, 151–173 (1999) 10. Geman, S., Geman, D.: Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. IEEE Transactions on Pattern Analysis and Machine Intelligence 6, 721–741 (1984) 11. Boykov, Y., Veksler, O., Zabih, R.: Fast Approximate Energy Minimization via Graph Cuts. IEEE Transactions on PAMI 23, 1222–1239 (2001) 12. Kolmogorov, V., Zabih, R.: What Energy Functions Can Be Minimized via Graph Cuts? IEEE Transactions on PAMI 26, 147–159 (2004) 13. Raj, A., Singh, G., Zabih, R.: MRFs for MRIs: Bayesian Reconstruction of MR Images via Graph Cuts. In: Proceedings of the IEEE CVPR, pp. 1061–1068. IEEE Computer Society Press, Los Alamitos (2006) 14. 2d3 Ltd (2002), http://www.2d3.com
Learning-Based Super-Resolution System Using Single Facial Image and Multi-resolution Wavelet Synthesis Shu-Fan Lui, Jin-Yi Wu, Hsi-Shu Mao, and Jenn-Jier James Lien Robotics Laboratory, Dept. of Computer Science and Information Engineering National Cheng Kung University, No. 1, Ta-Hsueh Road, Tainan, Taiwan {kilo, Curtis, marson, jjlien}@csie.ncku.edu.tw http://robotics.csie.ncku.edu.tw
Abstract. A learning-based super-resolution system consisting of training and synthesis processes is presented. In the proposed system, a multi-resolution wavelet approach is applied to carry out the robust synthesis of both the global geometric structure and the local high-frequency detailed features of a facial image. In the training process, the input image is transformed into a series of images of increasingly lower resolution using the Haar discrete wavelet transform (DWT). The images at each resolution level are divided into patches, which are then projected onto an eigenspace to derive the corresponding projection weight vectors. In the synthesis process, a low-resolution input image is divided into patches, which are then projected onto the same eigenspace as that used in the training process. Modeling the resulting projection weight vectors as a Markov network, the maximum a posteriori (MAP) estimation approach is then applied to identity the best-matching patches with which to reconstruct the image at a higher level of resolution. The experimental results demonstrate that the proposed reconstruction system yields better results than the bi-cubic spline interpolation method. Keywords: Super-resolution, learning-based, reconstruction, Markov network, multi-resolution wavelets, maximum a posteriori.
1 Introduction Super-resolution (SR) is an established technique for expanding the resolution of an image since directly up-sampling a low-resolution image to produce a high-resolution equivalent invariably results in blurring and a loss of the finer (i.e. higher-frequency) details in the image. SR has attracted increasing attention in recent years for a variety of applications, ranging from remote sensing, to military surveillance, face zooming in video sequence, and so forth. Some researches [8], [9] and [15] use the interpolation method, such as bilinear and bi-cubic interpolation, etc., to obtain the high-resolution image. However, it is difficult to interpolate the details well, such as the texture and corner-like regions. Considering that one single low-resolution image may not provide enough information to recover the high-frequency details, multiframe image reconstruction algorithms [5], [13] and [14] were developed to facilitate the reconstruction of single high-resolution images from a sequence of low-resolution Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 96–105, 2007. © Springer-Verlag Berlin Heidelberg 2007
Learning-Based Super-Resolution System
97
images. Some of the methods are successful to enlarge the images, but we focus on reconstructing the high-resolution image from one single low-resolution image in this paper since it may be more useful in our daily life. Recently, the application of SR to the resolution enlargement of single images has attracted particular attention. For example, Freeman et al. [6], [7] learned the relationship between the corresponding high- and low-resolution pairs from the training data and modeled the images as a Markov network. Although the algorithms presented in these studies were applicable to the resolution scaling of generic images, they were unsuited to the manipulation of facial images characterized by pronounced geometric structures, as described in [1]. Baker and Kanade [1], [2] developed a learning-based facial hallucination method in which a process of Bayesian inference was applied to derive the high-frequency components of a facial image from a parent structure. Meanwhile, Liu et al. [10] presented a two-step procedure for hallucinating faces based on the integration of global and local models. The structure of face is generated in the global model and the local texture is obtained in the local model. However, the methods proposed in [1], [2] and [10] were sometimes difficult to obtain in practice since they were reliant upon the complicated probabilistic models. In general, the objective of SR reconstruction is to preserve the high-frequency details of an image while simultaneously expanding its scale to the desired resolution. Various researchers have demonstrated the successful interpolation and reconstruction of high-resolution images using wavelet algorithms [3], [4], [11]. Such algorithms have a number of fundamental advantages when applied to image reconstruction, including the ability not only to keep the geometric structure of the original image, but also to preserve a greater proportion of the high-frequency components than other methods, such as interpolation. Accordingly, the current study develops a learning-based SR method based upon a multi-resolution wavelet synthesis approach. The SR system consists of two basic processes, namely a training process and a synthesis (reconstruction) process. In the training process, the input image is decomposed into a series of increasingly lowerresolution images using a discrete wavelet transformation (DWT) technique. The image at each resolution level is divided into patches, which are then projected onto an eigenspace to compute the corresponding projection weight vectors. The projection weight vectors generated from all of the patches within the image at each level are grouped into a projection weight database. In the subsequent synthesis procedure, the input image is firstly interpolated to a double-size one (Fig.2, I). By applying three corresponding Haar filters on it, the three initial subbands are created. The three subbands together with the input image are then divided into patches, which are then mapped onto the eigenspace associated with the lowest-resolution input image in the training process. The resulting projection weight vector for each patch is then compared with those within the corresponding projection weight database. The bestfitting vector for each patch is identified by modeling the spatial relationships between patches as Markov network and using the maximum a posteriori (MAP) criterion. The corresponding patch is then mapped back to the input image. Having replaced all of the patches in the three subbands with the best-fitting patches from the training database, the resulting image is synthesized using an inverse wavelet transformation process and used as the input image for an analogous synthesis procedure performed at the next (higher) resolution level. This up-sampling procedure is repeated iteratively until the required image resolution has been obtained.
98
S.-F. Lui et al.
2 Training Process Figure 1 presents a schematic illustration of the training process within the current SR system. As shown, the input image is decomposed into two lower-resolution images, each divided into four sub-bands. At each resolution level, patches from each subband are concatenated to form a patch vector, which is then projected onto the corresponding eigenspace to obtain the corresponding projection weight vector. The vectors associated with each image resolution level are grouped to construct a projection weight database. w
h
level 0
w/2
h/2
Projection Weight Database 1
Haar DWT
w/2
Haar
LL 1
HL HH
…
LH
e1
Projection
N1
level 1
Eigenspace
…
3x3
HL
LL
HH w/4
LL 1
LH
HH
N2
3x3
HL HH
Eigenspace e1
…
HL
LH
…
LL
eN1
Projection Weight Database 2
Haar DWT
h/4 level 2
LH
Projection
eN2
Fig. 1. Flowchart of SR system triaining process. The original input training image (Level 0) is successively transformed using Haar discrete wavelet transformation (DWT) to generate lowerresolution images (Levels 1 and 2, respectively) comprising four sub-bands. At each level, 3x3 pixel patches from each sub-band are concatenated to form a 3×3×4-dimensional vector with a one patch-based eigenspace. These vectors are projected onto the eigenspace to generate a corresponding projection weight vector. The projection weight vectors produced at each level are grouped to form a projection weight database.
2.1 Feature Extraction Using Multi-resolution Wavelet Analysis Discrete wavelet transformation (DWT) is a well-established technique used in a variety of signal processing and image compression / expansion applications. In the
Learning-Based Super-Resolution System
99
current SR system, DWT is applied to decompose the input image into four subbands, namely LL, LH, HL and HH, respectively, where L denotes a low-pass filter and H a high-pass filter. Each of these sub-bands can be thought of as a smaller-scale version of the original image, representing different properties of the image, such as its vertical and horizontal information, for example. The LL sub-band is a coarse approximation of the original image, containing its overall geometrical structure, while the other three sub-bands contain the high-frequency components of the image, i.e. the finer image details. In the synthesis stage of the proposed SR system, the LL sub-band is used as the basis for estimating the best-fitting patches in the remaining three sub-bands. Having identified these patches, the four sub-bands are synthesized using an inverse DWT process to construct an input image for the subsequent higherlevel image reconstruction process. Note that in the current system, the Haar DWT function is employed to carry out wavelet transformation in the training and synthesis procedures since for this function, the direct transform is equal to its inverse. The multi-resolution wavelet approach described above can be repeated as many times as required to achieve the desired scale of enlargement. Assuming that the input image has a 48×64-pixel resolution, the output image should therefore have a resolution of 192x256 pixels. Accordingly, the SR procedure involves the use of two consecutive reconstruction processes, i.e. an initial process to enlarge the input image from 48x64 pixels to 96×128 pixels, and then a second process to produce the final enlarged image with the desired resolution of 192×256 pixels. 2.2 Patch-Based Eigenspace Construction As described in the previous section, four sub-band images are required for the wavelet reconstruction process. However, initially only a single low-resolution image is available. Therefore, the problem arises as to how the coefficients of the other three sub-bands may be estimated. In the current study, this problem is resolved by using a learning-based approach [1], [2] to estimate the missing high-frequency image details on the basis of the image information contained within the training datasets. As shown in Fig. 1, the current SR system incorporates two training datasets, namely projection weight databases 1 and 2, respectively. Database 1 is constructed by applying DWT to each of the original (i.e. Level 0) w×h-pixel training images, resulting in the creation of four sub-band images, i.e. LL, LH, HL and HH, respectively. These sub-band images are then further divided into an array of 3×3pixel patches. Since the image has a total of four sub-bands, a 36-dimensional patchbased vector can be constructed by concatenating the 9 pixels in the corresponding patches of each sub-band. Assuming that a single sub-band image can be divided into M patches and that the training dataset contains a total of N original images, respectively, then an M×N by 36 patch-based matrix can be constructed. There will be over 250000 patch-based vectors if we use 100 192×256-pixel resolution images to train the database, so we use PCA to lower the dimension in order to speed up the computational time of the system. Having constructed projection weight database 1 for the level-1 resolution images, a similar procedure is performed to compute the projection weight vectors for project weight database 2, corresponding to the lowestresolution level images.
100
S.-F. Lui et al.
3 Synthesis Process 3.1 Wavelet Synthesis and Patch-Based Weight Vector Creation As described in Section 2.1, each input image is decomposed into four sub-bands. Subband LL contains the global image structure, but lacks the finer details of the image. Therefore, the LL sub-band is used as the input image in the reconstruction process. The patches within the other three sub-bands are then estimated and used to reconstruct the higher-resolution image, as described below. w h
Inverse wavelet
Output
Bi-cubic interpolation: I1
LL LH HL HH Projection HL
1
3x3
HH
…
LL LH
N1
Markov network: Best Match using MAP
Eigenspace 1 Weight Database 1
Second wavelet reconstruction
Haar filter w/2 Inverse wavelet h/2
LL LH HL HH Projection HL HH
w/4 Input
Haar filter
N2
3x3
…
h/4
LL
Bi-cubic LH interpolation: I2
1
Markov network: Best Match using MAP
Eigenspace 2 Weight Database 2
First wavelet reconstruction
Fig. 2. Flowchart of SR system synthesizing process. The input image with a quarter-resolution of the desired output is interpolated and a Haar filter applied to produce four sub-bands. These sub-bands are then divided into patches and projected onto patch-based eigenspace 2 to generate the corresponding projection weights. Using the maximum a posteriori criterion, the projection weights are then compared with those within projection weight database 2. The bestfitting matches are used to improve the patches in the original input image, which is then synthesized using an inverse wavelet transformation process to construct a new image with twice the resolution of the original input image. Using this new image as an input, the reconstruction process is repeated using patch-based eigenspace 1 and projection weight database 1 to synthesize the required highest-resolution image.
Learning-Based Super-Resolution System
3x3
3x3
3x3
3x3
y
Find a best match from weight database 3x3
3x3
3x3
101
x
3x3
Fig. 3. Each search vector is constructed by concatenating four 3x3-pixel patches from the four sub-band images. These search vectors are then compared with the weight projection vectors within the training database(s) to identify the best-fitting match. The training image patches corresponding to the identified weight projection vector are then mapped back to the input image. Note that the search vector and the training databases are modeled as a Markov network.
As shown in Fig. 2, the input image at the lowest level of resolution, i.e. w/4×h/4 pixels, is interpolated and partitioned into four subbands using a Haar filter. Adopting the same approach as that applied in the training process, the four subbands are divided into patches, and the pixels within corresponding patches in the four different sub-bands are then concatenated to form a search vector (see Fig. 3). All of the search vectors relating to the image are then projected onto eigenspace 2 to create a projection weight set W2i, where i = 1, …, N, in which N is the total number of patches within one sub-band. The search vectors are then compared with the projection weight vectors contained within database 2 to identify the best-matching vectors. The search patches are in raster-scan order. Having found the best-matching vector in the database, the corresponding patches are used to replace the patches in the LH, HL and HH sub-bands in the original input image. Once all of the original patches have been replaced, an inverse wavelet transformation process is performed to synthesize a higher-resolution image with a w/2×h/2-pixel resolution. This image is then used as the input to a second reconstruction process performed using eigenspace 1 and projection weight database 1 to reconstruct the desired w×h-pixel resolution image (see Fig. 2). 3.2 Markov Network: Identifying Best-Matching Vectors Using Maximum a Posteriori Approach In the current SR system, the patches within the four sub-bands are modeled as a Markov network. In this network, the observed nodes, y, represent the transformed patches and the aim is to estimate the hidden nodes, x, from y. In the current network, the hidden nodes represent the fine details of the original images, i.e. the patches in subbands LH, HL and HH. Since the spatial relationship between x and y is modeled as a Markov network (see Fig. 3), the joint probability over the transformed patch y [7] can be expressed as follows:
p ( x1 , x 2 ,..., x N , y 1 , y 2 ,..., y N ) = ∏ Ψ ( xi , x j ) ∏ Φ ( x k , y k ) , (i , j )
k
(1)
102
S.-F. Lui et al.
where Ψ and Φ are the pair-wise compatibility functions learned; i and j are neighboring nodes; and N is the total number of patches in each subband. To estimate xˆ j , which is the best match one of each patch, we adopt the MAP estimation described in [7] to take the maximum over the other variables in the posterior probability.
xˆ jMAP = arg max xj
max p( y1 , y2 ,..., y N , x1 , x2 ,..., x N ) .
[ all xi ,i ≠ j ]
(2)
According to the conditional probability and Bayes’ theorem, the posterior probability can be written as p(x | y) =
1 p(x, y) = p(y) Z
∏Ψ
( xi , x j )Φ ( y k , xk ) ,
(3)
k
(i, j )
where Z is a normalization constant, therefore the probability of any given transformed patch for each node is proportional to the product of the pair-wise compatibility functions. In this Markov network, the relationship between the observation node and the hidden node is modeled by Ψ, and the relationship between the neighboring hidden nodes is modeled by Φ. In our work, we use Ψ to find the best matching one of each patch from training data, and Φ is used to make sure the neighboring patches connect with each other.The compatibility function Φ(x, y) is used to measure the similarity between two patch vectors, and is defined as
φ (x j , y j ) = e
−| y j − x j |2 / 2σ 12 ,
(4)
where yj and xj represent patch vectors. Specifically, yj is the observed node and consists of patches from the transformed images and xj is the hidden node which is to be found in the projection weight database 1. σ1 is a standard deviation which is the corresponding eigenvalue of the eigenvector in our system. The compatibility function between the hidden nodes is defined as
ψ ( xi , x j ) = e
− | d ji − d ij | 2 / 2 σ 22
,
(5)
where dij denotes the pixels in the overlapping region between node i and j, and σ2 is the standard deviation of the overlapping region. The pixels in the overlapping region should agree to guarantee that neighboring nodes are compatible. The best match for each search vector, is obtained by solving these compatibility functions.
4 Experimental Results To evaluate the performance of the proposed SR system, a training database was compiled comprising 100 images. Although the figures presented in this study feature facial images, the current SR system is not limited to the processing of a specific class
Learning-Based Super-Resolution System
103
of images. Hence, a general training database was constructed containing images from the CMU PIE database, a self-compiled database of facial images, and the Internet, respectively. Note that the images were not pre-processed in any way.
(b)
(a)
(1) Input (48×64)
(2) High resolution (192×256)
(1) Input (48×64)
(2) High resolution (192×256)
(4) Current method (3) Bi-cubic spline (192×256) interpolation (192×256)
(3) Bi-cubic spline (4) Current method interpolation (192×256) (192×256)
(5) Result of (2) - (3). (6) Result of (2) - (4). Average difference : Average difference : 3.3 gray value/pixel 9.7 gray value/pixel
(5) Result of (2) - (3). Average difference : 9.5 gray value/pixel
(6) Result of (2) - (4). Average difference : 4.0 gray value/pixel
Fig. 4. Comparison between enlargement results of current SR method and those obtained using a bi-cubic spline interpolation method: (a.1) low-resolution input with 48×64 resolution; (a.2) original high-resolution image with 192×256 resolution; (a.3) bi-cubic spline interpolation results; (a.4) enlargement results obtained using current method; and (a.5) and (a.6) difference images of (a.2) minus (a.3) and (a.2) minus (a.4), respectively. Note that the average difference is calculated as the sum of the gray values in the difference image divided by the total number of pixels in the image. The results show that the proposed SR method achieves a better enlargement performance than the bi-cubic interpolation. Figure 4.(b) presents a corresponding set of images for a different input image.
Increasing the patch size to 5×5 pixels was found to generate no significant improvement in the enlargement results. Typical experimental results obtained when processing facial images are shown in Fig. 3. In general, super-resolution can be
104
S.-F. Lui et al.
(a) Input (48×64)
(c) Current method (d) High resolution (b) Bi-cubic spline (192×256) (192×256) interpolation (192×256)
Fig. 5. Further results
performed using a variety of interpolation techniques, including the nearest-neighbor, bi-linear and bi-cubic spline schemes [10]. In the current study, the enlargement results obtained using the proposed SR system is compared with those obtained from the bi-cubic spline interpolation method. As shown in Figs. 4 and 5, the reconstruction results generated by the SR method are both quantitatively and qualitatively superior to those obtained using the interpolation technique.
5 Conclusions This study has presented a learning-based super-resolution system for the reconstruction of high-resolution images from a single low-resolution image. In the proposed method, a multi-resolution wavelet approach is used to synthesize both the global geometrical structure of the input image and the local high-frequency details. The SR system incorporates a training process and a synthesis process. In the training process, the high-resolution input image is transformed into a series of lowerresolution images using the Haar discrete wavelet transform (DWT). The images at each resolution level are divided into four subbands (LL, LH, HL and HH) and the corresponding patches from each subband are combined to form a search vector. The search vector is then projected onto an eigenspace to derive the corresponding projection weight vector. All of the projection weight vectors associated with the image at its current level of resolution are then grouped into a projection weight vector database. In the reconstruction process, the LL sub-band is utilized as the input image to ensure that the global, low-frequency components of the original image are retained during the enlargement process. A high-resolution image of the desired scale is obtained by performing the reconstruction process iteratively, taking the output image at one resolution level as the input to the synthesis process at the next (higher) resolution level. During each reconstruction process, the high-frequency components
Learning-Based Super-Resolution System
105
of the image are replaced by the best-fitting patches identified when modeling the patches in the subbands as a Markov network and performing a maximum a posteriori search of the projection weight database such that the finer details of the image are preserved as the image resolution is progressively enlarged.
References 1. Baker, S., Kanade, T.: Hallucinating Faces. In: Proc. of Inter. Conf. on Automatic Face and Gesture Recognition, pp. 83–88 (2000) 2. Baker, S., Kanade, T.: Limits on Super-Resolution and How to Break them. IEEE Trans. on Pattern Analysis and Machine Intelligence 24(9), 1167–1183 (2002) 3. Chan, R., Chan, T., Shen, L., Shen, Z.: Wavelet Algorithms for High-Resolution Image Reconstruction. SIAM Journal on Scientific Computing 24(4), 1408–1432 (2003) 4. Chan, R., Riemenschneider, S., Shen, L., Shen, Z.: Tight Frame: An Efficient Way for High-Resolution Image Reconstruction. Applied and Computational Harmonic Analysis 17, 91–115 (2004) 5. Farsiu, S., Robinson, D., Elad, M., Milanfar, P.: Fast and Robust Multi-Frame SuperResolution. IEEE Trans. Image Processing 13(10), 1327–1344 (2004) 6. Freeman, W.T., Jones, T.R., Pasztor, E.C.: Example-Based Super-Resolution. IEEE Computer Graphics and Applications 22(2), 56–65 (2002) 7. Freeman, W.T., Pasztor, E.C., Carmichael, O.T.: Learning Low-Level Vision. IJCV 40(1), 25–47 (2000) 8. Hou, H.S., Andrew, H.C.: Least Squares Image Restoration Using Spline Basis Function. IEEE Transactions on Computers C-26(9), 856–873 (1977) 9. Hou, H.S., Andrews, H.C.: Cubic Splines for Image Interpolation and Digital Filtring. IEEE Transactions Acoust, Speech Signal Processing 26(6), 508–517 (1978) 10. Liu, C., Shum, H., Zhang, C.: A Two-Step Approach to Hallucinating Faces: Global Parametric Model and Local Nonparametric Model. CVPR 1, 192–198 (2001) 11. Nguyen, N., Milanfar, P.: A Wavelet-Based Interpolation-Restoration Method for Superresolution. Circuits, System, Signal Process 19, 321–338 (2000) 12. Pratt William, K.: Digital Image Processing (1991) 13. Schultz, R.R., Stevenson, R.L.: Extraction of High-Resolution Frames from Video Sequences. IEEE Transactions on Image Processing 5(6), 996–1011 (1996) 14. Shekarforoush, H., Chellappa, R.: Data-Driven Multi-Channel Super-Resolution with Application to Video Sequences. JOSA. A 16, 481–492 (1999) 15. Ur, H., Gross, D.: Improved Resolution from Subpixel Shifted Pictures. CVGIP: Graph. Models Image Process 54, 181–186 (1992)
Statistical Framework for Shot Segmentation and Classification in Sports Video Ying Yang1,2 , Shouxun Lin1 , Yongdong Zhang1 , and Sheng Tang1 Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080, China 2 Graduate University of Chinese Academy of Sciences, Beijing 100085, China {yyang, sxlin, zhyd, ts}@ict.ac.cn
1
Abstract. In this paper, a novel statistical framework is proposed for shot segmentation and classification. The proposed framework segments and classifies shots simultaneously using same difference features based on statistical inference. The task of shot segmentation and classification is taken as finding the most possible shot sequence given feature sequences, and it can be formulated by a conditional probability which can be divided into a shot sequence probability and a feature sequence probability. Shot sequence probability is derived from relations between adjacent shots by Bi-gram, and feature sequence probability is dependent on inherent character of shot modeled by HMM. Thus, the proposed framework segments shot considering the character of intra-shot to classify shot, while classifies shot considering character of inter-shot to segment shot, which obtain more accurate results. Experimental results on soccer and badminton videos are promising, and demonstrate the effectiveness of the proposed framework. Keywords: shot, segmentation, classification, statistical framework.
1
Introduction
In recent years, there has been increasing research interests in sports video analysis due to its tremendous commercial potentials, such as sports video indexing, retrieval and abstraction. Sports video can be decomposed into several types of video shots which are sequences of frames taken contiguously by a single camera. In sports video analysis, shot segmentation and classification play an important role for shots are often basic processing unit and give potential semantic hints. Much work was done on shot segmentation and classification, most of which take shot segmentation and classification as a two-stage successive process, and perform the two stages independently using different features without considering the relationship between them. Many shot segmentation algorithms have
This research was supported by National Basic Research Program of China (973 Program, 2007CB311100), Beijing Science and Technology Planning Program of China (D0106008040291), and National High Technology and Research Development Program of China (863 Program, 2007AA01Z416).
Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 106–115, 2007. c Springer-Verlag Berlin Heidelberg 2007
Statistical Framework for Shot Segmentation and Classification
107
been developed by measuring the similarity of adjacent shots [1]. However, these algorithms get poorer performance on sports video due to frequent panning and zooming caused by fast camera motion. After shot segmentation, shot classification is performed on each segment for higher level video content analysis. Some work classified sports video shots based on domain rules of certain sports game [2,3], such as classifying soccer video shots into long shots, medium shots and close-up shots using dominant color. Others tried a unified method to solve this problem using SVM or HMM [4,5,6,7,8]. For these approaches, since classification is done after segmentation independently, the incorrect segmentation will have a negative effect on shot classification. In this paper, a novel statistical framework is proposed for shot segmentation and classification. Compared with previous work, the proposed framework classifies and segments shot simultaneously using same difference features based on statistical inferences. The task of shot segmentation and classification is taken as finding the most possible shot sequence given feature sequence, and it can be formulated by a conditional probability which can be divided into a shot sequence probability and a feature sequence probability. Shot sequence probability is derived from relations between adjacent shots models modeled by Bi-gram, and feature sequence probability is dependent on inherent characters of shot modeled by HMM(Hidden Markov Model). Therefore, the proposed framework segments shot considering characters of intra-shot, while classifies shot considering characters of inter-shot, which is a global search on all the possible shot sequence for the best shot sequence matching feature sequence. The rest of paper is organized as follows. In Section2, the main idea of the statistical framework is presented. Section 3 gives the details of shot segmentation and classification based on the proposed statistical framework. To evaluate the performance of this framework, two applications in soccer and badminton videos and result analysis are described in Section4. Finally, conclusions are drawn and future work is discussed in Section 5.
2
Main Idea of the Statistical Framework
Suppose that a video stream is composed of a sequence of shots, denoted by H = h1 h2 . . . ht . After feature extraction, the video stream can be viewed and manipulated as a sequence of feature vectors, denoted by O = o1 o2 . . . oT . Hence, the task of shot segmentation and classification can be seen as mapping the sequence of feature vectors to a sequence of shots, and the best mapping is the expected result of shot segmentation and classification. Hence, in the proposed framework, the task is interpreted as finding a shot sequence that maximize the conditional probability of H under condition O, namely, finding ˆ = arg max{P(H|O)} = arg max{P(H) · P(O|H)/P(O)} H (1) H
H
The equation is transformed by applying Bayes’ theorem. Since P(O) is constant for O is a known sequence, the problem can be simplified as following ˆ = arg max{P(H) · P(O|H)} H (2) H
108
Y. Yang et al.
The calculation of above probability involves two types of probability distribution, i.e. P(H) and P(O|H).The former indicates probability of shot sequence without effect of features, and the latter indicates probability of feature sequence under a given shot sequence. So we call them Shot Sequence Probability (SSP) and Feature Sequence Probability (FSP), respectively. 2.1
Shot Sequence Probability
Shot sequence H is composed of successive shots of different categories, so the shot sequence probability is dependent on the transition probability between adjacent shots. Since shot sequence is a temporal sequence, the appearance of the present shot is only related to the appearances of shots prior to it. Therefore, shot sequence can be taken as a Markov process. We suppose the shot sequence is a 1 D Markov, namely, the appearance of present shot is only related to the last shot, which can be formulated by the following equation P(hm |h1 h2 . . . hm−1 ) = P(hm |hm−1 )
(3)
This assumption is reasonable since sports video shots regularly alternate to exhibit certain semantic content according to the play status. Hence, shot sequence probability P(H) can be calculated by P(H) = P(h1 h2 . . . ht−1 ht ) = P(h1 ) · P(h2 |h1 ) · . . . P(ht |ht−1 )
(4)
where P(hi ) denotes the initial distribution probability of shot hi , and P(hj |hi ) denotes transition probability between shot hi and hj , which we called Bi-gram in our framework. So SSP can be deduced according to the Bi-gram. 2.2
Feature Sequence Probability
Feature sequence is also a temporal sequence which indicates a certain shot sequence. Since HMM has been successful used in speech recognition to model the temporal evolution of features in a word [9], we use HMM to model sports video shot for the word and shot have similar temporal structure. Hence, feature sequence of a shot is an observed sequence of a given shot HMM, and each emitting state of HMM produces a feature vector in the feature sequence [9]. So feature sequence probability P(O|H) can be transformed into P(O, S|H) (5) P(O|H) = S
where S = s1 s2 . . . sT is the state sequence which emits feature sequence O = o1 o2 . . . oT through the link of all the shot HMMs which we called a super HMM. A super HMM is obtained by concatenating the corresponding shot HMMs using a pronunciation lexicon [10]. So P(O, S|H) can be derived from P(O, S|H) =
T t=1
P(ot , st |st−1 , H) =
T
[P(st |st−1 , H) · P(ot |st )]
t=1
(6)
Statistical Framework for Shot Segmentation and Classification
109
P(ot |s) is the emission probability distribution of state s, and P(st |st−1 , H) is the transition probability between two states. State transitions of intra-HMM are determined from HMM parameters, and state transitions of inter-HMM are determined by Bi-gram Therefore, these two components can be derived from the parameters of shot HMM and Bi-gram.
3
Shot Segmentation and Classification in Sports Video
In this section, we present the semantic shot segmentation and classification based on the statistical framework presented in Section 2, and shots are classified into three categories including Long Shot (LS), Medium Shot (MS) and Close-up Shot (CS). As described in Section 2, the task of shot segmentation and classification is taken as solving the problem of a maximum conditional probability P(H|O), which is dependent on SSP and FSP derived from Bi-gram and the parameters of shot HMM. Hence, Bi-gram and shot HMM are the two keys to shot segmentation and classification. The whole framework is shown in Fig.1.
Fig. 1. Framework for shot segmentation and classification
Since parameters of HMM are estimated by EM algorithm using the feature vector sequence [11], appropriate features are vital to the HMM construction and can better explain the temporal evolution in a shot. So in section 3.1, feature extraction is introduced. Section 3.2 and Section 3.3 present the shot HMM and Bi-gram constructions, respectively. In section 3.4, the procedure of simultaneous shot segmentation and classification is discussed. 3.1
Feature Extraction
As mentioned in Section 2, feature sequence of a shot is the observed vector sequence of shot HMM. So each shot is partitioned into segments to extract features from each segment, and features of all the segments form the feature vector sequence. Therefore, shot segment can be one or more consecutive frames, which called as Shot Segment Unit (SSU). Given the length of SSU, the Segmenting Rate (SR) is required at frame level to determine the space of two successive SSUs. To remain more information of shot, the size of SR may be smaller than that of SSU, as shown in Fig.2.
110
Y. Yang et al.
After the magnitudes of SSU and SR are set, feature vector is extracted from each SSU. Two classes of color and motion related features are used in our work for they are generic and can be easily computed, which can be extended to most categories of sports game. Features are firstly extracted from each frame, and then feature values of a SSU are set as the mean of corresponding feature values of all the frames in it.
Fig. 2. SSU segmentation
Color Related Features. Since the three categories of shots have different playfield and player sizes, such as LS has the largest playfield view, MS have a smaller playfield and a whole player, frames in each type of shot have a distinct color distribution which is differentiated from that of other types of shots. Hence, color features are derived from L, U and V components for CIE LUV color space is approximately perceptual uniform and computed by the following equations ⎧ ⎪ ⎨ Lf = all pixels in frame f L(x, y)/numbers of pixels in frame f Uf = all pixels in frame f U (x, y)/numbers of pixels in frame f (7) ⎪ ⎩ Vf = V (x, y)/numbers of pixels in frame f all pixels in frame f where Lf , Uf and Vf are the 3 basic color features of a frame f , and L(x, y), U (x, y), V (x, y) are the L, U, V components of pixel (x, y) in f , respectively. Motion Related Features. Motion is another important factor to distinguish shot for different shots reflect various camera motions, such as LS usually has relatively much smaller motion than CS. 3 basic motion features are extracted for a frame, which are frame difference Df , compensated frame difference Cf and motion magnitude Mf . Df =
255
(H(f, i) − H(f − 1, i))2 /number of pixels in frame f
(8)
i=0
where H(f, i) is the number of pixels of color i in frame f . The other two basic motion features are built on block-based motion analysis. For each block Bf (s, t) at location (s, t) in present frame f , a block Bf∗−1 (u∗ , v ∗ ) from the previous frame f − 1 is found to best match it. So Cf and Mf are computed as follows, where Gf is the set of all blocks in frame f . 255 Cf = Bf (s,t)∈Gf ( i=0 (Hf (Bf (s, t), i) − Hf −1 (Bf∗−1 (u∗ , v ∗ ), i))2 ) (9) Mf = Bf (s,t)∈Gf ((s − u∗ )2 + (t − v ∗ )2 )1/2 Difference is widely used in signal processing, which can interpret the trend of transformation. Therefore, differences are used to express the trend of variety
Statistical Framework for Shot Segmentation and Classification
111
of color and motion of different types of shots. Given 6 basic features, the 1st and 2nd differences of basic features are computed as follows ∇1 F eaf = F eaf − F eaf −1
∇2 F eaf = ∇1 F eaf − ∇1 F eaf −1
(10)
where F eaf denotes one of the 6 basic features, ∇k F eaf (k = 1, 2) denotes the k th difference of basic feature F eaf of frame f . As a result, there is a 18-D feature vector for each frame and SSU. 3.2
Shot HMM Construction
Since there are 3 categories of shots to be classified, a general solution is to build 3 HMMs modeling 3 types of shots, respectively. However, this method only considers the temporal evolution of SSUs in a single shot, and doesn’t take into account the temporal evolution of SSUs at the transitions between adjacent shots. In fact, in sports videos, the alternation of various types of shots exhibits certain rules, such as MS is mainly followed by LS, while LS is often followed by CS. Therefore, to better simulate the temporal evolutions of SSUs of intra-shot and inter-shot, a context-dependent shot model is introduced, which is defined as tri-shot HMM. Tri-shot HMM, just as its name indicates, models the temporal evolution of SSUs in three shot including prior shot, present shot and next shot. Compared with an ordinary shot model, tri-shot model is trained use feature vector sequence of the 3 involved shots. Since we have 3 categories of shots, there are 33 = 27 tri-shots in total.
Fig. 3. Topology of HMM structure
For simplicity, we use a left-to-right HMM to represent each tri-shot HMM. The middle states are emitting states with output probability distributions, as shown in Fig.3. Each HMM contains 5 states, and Gaussian Mixture Models are used to express continuous observation densities of the emitting states. 3.3
Bi-gram Construction
As described in Section 2, Bi-gram denotes the transition probability between two adjacent shots, which indicates the possibility of a transit of a shot to another shot. In sports video, various types of shots display different play status and they don’t appear randomly. For example, after LS is shown to exhibit the global game status, MS or CS is often shown to track the player or ball. Therefore, Bi-gram
112
Y. Yang et al.
can be calculated according to the statistics of the appearances of each couple of shots in training sports video. The following formulas embody the basic idea of derivation of Bi-gram.
P(hj |hi ) = αN (hi , hj )/N (hi ), if N (hi ) = 0 (11) otherwise P(hj |hi ) = 1/l, where N (hi , hj ) is the number of times shot hj follows shot hi and N (hi ) is the number of times that shot hi appears. l l is the total number of distinct shot models, and α is chosen to ensure that j=1 P(hj |hi ) = 1. 3.4
Procedure of Shot Segmentation and Classification
With Bi-gram and tri-shot HMM constructed, sports video can be segmented and classified into 3 types of shots. Since log operator can transform product operation into summation operation, in practice, product probability in the equations presented in Section 2 can be transformed into summation of corresponding log probability. As mentioned in Section 2, a super HMM is obtained by concatenated the corresponding shot HMMs using a pronunciation lexicon. So each shot HMM in the super HMM is considered as a node. Hence, the task of shot segmentation and classification is to find a path of the maximum log probability from the start node to the end node in the super HMM, as shown in Fig.1. For an unknown shot sequence of T SSUs whose feature vector sequence is O(O = o1 o2 . . . oT ), each path from start node to end node in the super HMM which passes through exactly T emitting HMM states is a potential recognition hypothesis. The log probability of each path is computed by summing the log probability of each individual transition in the path and the log probability of each emitting state generating the corresponding SSU. Intra-HMM transitions are determined from HMM parameters, and inter-HMM transitions are determined by Bi-gram [12]. Thus, the path of maximum log probability is the best result of shot segmentation and classification. As a result, the nodes (shot HMMs) on the best path are the optimal result of shot classification. Since each state in a shot HMM matches a SSU, the boundaries of each shot are the first and last SSUs of it, which realizes shot segmentation simultaneously.
4
Experimental Results and Analysis
We first evaluated the performance of our proposed statistical framework on soccer and badminton videos, and compared the performance of our method with that of a general two-stage method of shot segmentation and classification using the same features and test videos, and then we analyzed experimental results and parameters. Test video is CIF- 352 × 288 × 25fps got from FIFA World Cup 2006 and 2005 World Badminton Championship including 4 full half-time soccer videos and 4 full game badminton videos. The implementation of the proposed
Statistical Framework for Shot Segmentation and Classification
113
framework is based on HTK 3.3 [12]. The ground-truth of boundaries and type of each shot are labeled manually. Table.1 shows the test corpus, and each shot lasts at least 1s, i.e. 25 frames. Table 1. Test Videos Name
Match(2006)
Soccer1 Soccer2 Soccer3 Soccer4
GER-ITA(7-4) ENG-POR(7-1) GER-ARG(6-30) SUI-UKR(6-26)
Length (min) 46:42 46:38 46:05 47:04
Name
Match(2005-8-21)
Badminton1 Badminton2 Badminton3 Badminton4
INA-THA(Mixed Doubles) MAS-INA(Men’s Singles) CHN-INA(Mixed Doubles) NZL-CHN(Mixed Doubles)
Length (min) 24:11 26:12 26:24 17:09
Results are assessed by recall and precision rate which can be computed by Recall = Correct/Ground truth
Precision = Correct/Detected
(12)
where Detected is the shot number obtained by shot segmentation, and Correct is the number of correctly classified shots. Correct classification denotes not only the shot type is correctly recognized but also the overlap of the ranges of classified shot and actual shot more than 90 percent of the length of actual shot. The proposed framework is tested on soccer and badminton videos since they are complete different types. Experimental results are shown in Table 2. Table 2. Experimental results on soccer and badminton video Ground-truth Detected Correct Shot Type soccer badminton soccer badminton soccer badminton LS 570 156 658 165 559 152 MS 103 121 80 392 474 326 CS 215 266 212 402 395 313 Soccer: Recall = 87.8%; Precision = 78.5% Total Badminton: Recall =93.7% ; Precision = 80.4%
On the average, the proposed framework achieves 87.8% recall and 78.5% precision rate on soccer videos, and 93.7% recall and 80.4% precision rate on badminton videos. The results are promising, which demonstrates the effectiveness of our general framework for shot segmentation and classification. We studied that false alarms are mainly caused by the misclassification of CS and MS and over segmentation of CS and MS, and the performance can be improved by applied more complicated features instead of simple color and motion features. Experiments Using SVM. we applied a general method proposed in [5] for shot segmentation and classification using the same features for comparison. To simplify the procedure, we use SVM to classify manually segmented shots in above soccer videos. We chose C = 21 and γ = 2−10 by cross-validation, and
114
Y. Yang et al.
the experimental results are shown in Table 3. The total precision rate is 70.3%, which is far less than 78.5% precision rate achieved by our method. Note that shot classification using SVM is performed on the manual segmented shots, which avoids occurrences of wrong classifications caused by inaccurate segmentation. Thus, the proposed framework performs much better. Table 3. Experimental results of SVM classification Shot Type Ground-truth Correct Precision(%) Total LS 570 530 92.9 Precision = 70.3% MS 392 181 46.2 CS 402 248 61.7
Experiments Using Various Orders of Difference Features. Another 2 types of feature vectors are tested to demonstrate that difference can improve the performance, and they are 6-D feature vector (without difference feature), 12-D feature vector (with 1st difference feature). As we can see from Table 4, performance is significantly improved with higher order difference features. Table 4. Experimental results using various orders of difference features Name Soccer1 Soccer2 Soccer3 Soccer4
6-D feature vector 12-D feature vector 18-D feature vector Recall : Precision (%) Recall : Precision (%) Recall : Precision (%) 83.5 : 72.5 85.8 : 74.7 86.1 : 76.2 84.1 : 73.2 87.3 : 74.1 87.9 : 78.9 83.2 : 68.8 87.9 : 77.1 90.1 : 79 80.6 : 75.6 85.2 : 78.5 86.7 : 80
Experiments Using Various Combinations of SSU and SR. As described in Section 3.1, the size of SSU and SR determined the fineness and number of feature vector in a shot, respectively. 8 groups of different SSU and SR are tested on soccer video using 18-D feature to find the best combination of SSU and SR.
Fig. 4. Experimental results using various combinations of SSU and SR
Statistical Framework for Shot Segmentation and Classification
115
Results are shown in Fig. 4, where the combination of SSU and SR is denoted by SSU SR (frames). As we can see, performances are much better when SR is more than 10 frames and the ratio of SSU to SR is in [1.5, 2]. So in our work, the sizes of SSU and SR are set as 20 and 10 frames, respectively.
5
Conclusions
A statistical framework for shot segmentation and classification in sports video is presented in this paper. The main idea of the proposed framework is that the task of shot segmentation and classification is taken as a conditional probability which implicates intra-shot and inter-shot information. Thus, the proposed method realizes shot segmentation and classification simultaneously, and achieves much better performance than general two-stage method on soccer videos. Experimental results on badminton videos are also promising, which demonstrate the framework can be extended to other sports video. Furthermore, difference of feature introduced from speech recognition area is applied in our framework and has been proved it’s superiority in improvement of performance. In the further work, we will employ higher semantic features to enhance the performance, and apply the framework to event detection in sports video.
References 1. Hanjalic, A.: Shot-boundary Detection: Unraveled and resolved? IEEE Trans. Circuits and Systems for Video Technology 12, 90–105 (2002) 2. Lexing, X., Peng, X., Chang, S.-F., Divakaran, A., Sun, H.: Structure Analysis of Soccer Video with Domain Knowledge and Hidden Markov Models. Pattern Recognition Letters 24(15), 767–775 (2003) 3. Ekin, A., Tekalp, A.M., et al.: Automatic Soccer Video Analysis and Summarization. IEEE Trans. Image Processing 12, 796–807 (2003) 4. Dahyot, R., Rea, N., Kokaram, A.: Sports Video Shot Segmentation and Classification. In: SPIE Int. Conf. Visual Communication and Image Processing, pp. 404–413 (2003) 5. Wang, L., Liu, X., Lin, S., Xu, G., Shum, H.-Y., et al.: Generic Slow-motion Replay Detection in Sports Video. IEEE ICIP, 1585–1588 (2004) 6. Duan, L.-Y., Xu, M., Tian, Q.: Semantic Shot Classification in Sports Video. In: SPIE Proc. Storage and Retrieval for Media Databases, pp. 300–313 (2003) 7. Duan, L.Y., Xu, M., Tian, Q., et al.: A Unified Framework for Semantic Shot Classification in Sports Video. IEEE Trans. on Multimedia 7, 1066–1083 (2005) 8. Xu, M., Duan, L., Xu, C., Tian, Q.: A fusion scheme of visual and auditory modalities for event detection in sports video. IEEE ICASSP 3, 189–192 (2003) 9. Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proceeding of the IEEE 77, 257–286 (1989) 10. Ney, H., Ortmanns, S.: Progress in dynamic programming search for LVCSR. Proceeding of the IEEE 88, 1224–1240 (2000) 11. Bilmes, J.: A Gentle Tutorial on the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models. Technical Report of University of Berkeley, ICSI-TR-97-021 (1998) 12. Young, S., Evermann, G., et al.: The HTK book (for HTK version 3.3). Cambridge University Tech Services Ltd. (2005), http://htk.eng.cam.ac.uk/
Sports Classification Using Cross-Ratio Histograms Balamanohar Paluri, S. Nalin Pradeep, Hitesh Shah, and C. Prakash Sarnoff Innovative Technologies Pvt Ltd
Abstract. The paper proposes a novel approach for classification of sports images based on the geometric information encoded in the image of a sport’s field. The proposed approach uses invariant nature of a crossratio under projective transformation to develop a robust classifier. For a given image, cross-ratios are computed for the points obtained from the intersection of lines detected using Hough transform. These cross-ratios are represented by a histogram which forms a feature vector for the image. An SVM classifier trained on aprior model histograms of crossratios for sports fields is used to decide the most likely sport’s field in the image. Experimental validation shows robust classification using the proposed approach for images of Tennis, Football, Badminton, Basketball taken from dissimilar view points.
1
Introduction
The exponential growth during the last decade of photographic content has fueled the requirement for intelligent content management systems. One of the essential part of such systems is an automated classification of image content. In this paper, we address identification of sports based on the sport field in image using a robust classification mechanism. Conventional approaches discussed in the literature for sport identification are primarily related to video, for example Wang et. al [1] distinguish sports videos shot using color and motion features. Takagi et. al [2] proposed a HMM based video classification system using camera motion parameters, Kobla et. al [3] applied replay detection,text and motion features with Bayesian classifiers to identify sports video. Messer et. al [4] employed neural networks and the texture codes on semantic cues for the same purpose. More recently, Wang et. al [5] classified sports videos with pseudo-2D-HMM using visual and audio features. In these approaches, sports classification is dependent on cues like color, texture and spatial arrangement, which are not preferable to use due to the variability of these features between images of the same sport for two different fields and views. Also, none of the currently existing approaches applied to classify sports videos can be directly used for classifying images of sports field as they employ temporal features also. As opposed to existing approaches, we have opted to use the geometric information of a field to identify the sport. The motivation for this approach is attributed to the fact that sports fields have dominant geometric structures on Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 116–123, 2007. c Springer-Verlag Berlin Heidelberg 2007
Sports Classification Using Cross-Ratio Histograms
P1’
P1
Δ14
Δ’13
Δ13
P2
Δ23 P3
117
P2’
Δ24 Δ’ 23 P3’
Δ’14
P4
Δ13 Δ24 Δ14 Δ23
=
Δ’13 Δ’24 Δ’14 Δ’23
Δ’24
P4’
Fig. 1. Invariant nature of cross ratio for four collinear points subjected to a projective transformation
a planar surface. These structures are well defined and uniform across different fields for the same sport. Moreover, these structures consist of lines that are in stark contrast, like white over green ground, to the sports fields making it easy to identify such lines using conventional image processing techniques. These observations necessitated the use of geometric information to develop a robust classifier. The idea of using geometric information like projective invariance has been used for object recognition [6,7,8]. But to the best of our knowledge, there has been no prior work done related to the classification of sport’s field images using projective invariance. In the proposed approach, we exploit the idea of invariance of cross ratio under projective transformation. Thus in any view of a sport’s field four corresponding co-linear points have the same cross ratio. Since this is true only for the four corresponding points, we use a histogram of cross-ratio for a given sport’s field image to describe a model for that sport. To ensure selection of the same points in each frame, we limit our consideration to points of intersection of dominant lines only. The paper is organized as follows. Section 2 explains the proposed approach. The results are propounded in section 3. Finally, the section 4 concludes the paper by suggesting the future work.
2
Proposed Approach
The planar geometric structures in images of a sport’s field, i.e. lines to demarcate a sport’s field, taken from different view points, are related with each other under a projective transformation. Such a transformation does not preserve
118
B. Paluri et al.
(a)
(b) 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0
0
0.5
(c)
1
1.5
2
2.5
3
(d)
Fig. 2. Interim steps for calculation of the feature vector: (a) Input image (b) Canny edge detections (c) Hough transform based line detections and the points of interesection of the lines (d) Histogram of cross ratios calculated using the intersection points
distances or angles, however incidence and cross-ratios are invariant under it, see [9]. In the following paragraphs, we build upon the invariant property of both these attributes to develop a robust classifier for sports images. 2.1
Cross Ratio Histogram
For a line, incidence relations are : ‘lies on’ between points and lines (as in ‘point P lies on line L’), and ‘intersects’ (as in ‘line L1 intersects line L2’). The cross ratio is a ratio of ratios of distances. Given four collinear points p1 , p2 , p3 , and p4 in P 2 and denoting the Euclidean distance between two points pi and pj as Δij , cross ratio can be defined as shown in figure 1, τp1 p2 p3 p4 =
Δ13 Δ24 . Δ14 Δ23
(1)
Although cross ratio is invariant once the order of the points has been chosen, its value is different depending on this order. Four points can be chosen in
Sports Classification Using Cross-Ratio Histograms
119
4! = 24 different orders, but in fact only six distinct values are produced, which are related by the set 1 τ −1 τ 1 , , }. {τ, , 1 − τ, τ 1−τ τ τ −1
(2)
Thus, for given four points, we use minimum of the above cross ratio value as a representative for them. To classify a given unknown sport’s field image, a canny edge detection [10] is done on the image. The dominant lines in the image are then identified using a Hough Transform [11]. Since in a sport’s field lines drawn on the field are dominant as compared to other edges produced due to noise or player, these dominant lines are are easily identified using the above approach. On each line obtained, we find intersecting points with the rest of the lines. These intersection points are then used for a cross ratio calculation as they will be consistently detected in any view of a sport’s field. Also for all the coplanar lines, such points of intersection will correspond to the same physical point on the field irrespective of the viewing angle of the camera due to the invariance of incidence relations. From the entire set of points obtained in a line, we form subsets of four points each. For each subset we calculate the representative cross-ratio as explained using the set in equation (2). Thus for a given image, we obtain a large number of these cross ratio values which encode the geometric structures present in the field. Ideally, if we are able to consistently reproduce the corresponding points in each input image for a given sport, we can match the cross ratio values individually and obtain a good classification. But this is not the case normally, there are two issues which make the direct comparison of cross ratio values infeasible. Firstly, it is challenging to consistently establish correspondence in points across images of a sport’s field from different view points because some points might not be detected due to noise in the image and as well due to the orientation of the sports fields being significantly different. Secondly, there might be small variation in the cross ratio values due to noise in measurement of lines. To overcome these issues, a histogram based representation of cross ratios is done. For each new image to be classified, the above steps are done to obtain a histogram of cross ratios. This histogram is then used as a feature vector describing the dominant geometric structures in the image. A trained SVM classifier is used to decide the class to which each feature vector and its corresponding image belong to. The entire process with intermediate results is pictorially depicted in figure 2. 2.2
Support Vector Machine Classifier
Since multiple sports are being considered, there is a need for a multi-class SVM classifier. SVM is originally designed for binary classifications only. However, multiple strategies of extending the binary classifier to a multi-class classifier have been discussed in literature such as One-against-all [12], oneagainst-one [13], half-against-half [14] etc. In case of one-against-all(OVA), the
120
B. Paluri et al.
Table 1. Cross ratio values for horizontal and a vertical line, on the outer boundary of the sport field, in each sports. These values are calculated based on the standard field measurements directly. Tennis Badminton Football Basketball
Horizontal line Vertical line 1.0208 1.0989 1.08 1.412 1.333 1.05 0.91 0.96
N class problem is decomposed into series of N binary class problems. In oneaginast-one(OVO), for a N class problem N (N2−1) binary classifiers are trained and the appropriate class is found by a voting scheme. For a Half-againstHalf(HAH) extension, a classifier is built by recursively dividing the training dataset of N classes into two subsets. We have used a modified OVA approach. OVA is one of the earliest approaches for multi-class SVM and gives good results for most of the problems. For an Nclass problem, it constructs N binary SVMs, each binary SVM is trained with all the samples belonging to one class as positive samples and all the samples from the rest of the classes as negative. Given a sample to classify, all the N SVMs are evaluated and the label of the class that has the largest value of the decision function is chosen. One drawback of this method is that when the results from multiple classifiers are combined, each classifier gets the same importance irrespective of its competence. Hence, instead of directly comparing the decisions, we use reliability measures to make the final decision which makes the multi-class framework more robust. The static reliability measure proposed in [15] has been used for our multi-class framework.
3
Experimental Results
Images of Tennis, Basketball, Badminton, and Football fields have been used for experimental evaluation since they possess a good geometric structure in their respective fields. First column of Table 1 shows the cross ratio values considering points of intersection of the vertical lines with the outer boundary horizontal line of the field. Similarly, second column shows the cross ratio value for points of intersection of horizontal lines with the outer vertical line of the field. It can be observed that the difference of geometric structure in a sport’s field is reflected well in the cross ratio values and hence these can be used as a basis for classification of the sport’s field. Cross ratio histogram for images of two different sports fields for the all the four sports considered is shown in figure 3. It can be noted that for the same sport images taken on different fields from different view points generate similar cross ratio histograms. We have collected 200 images of each sport’s field from videos, Internet; and by taking photos of the various sport’s field we ensure that the images for each
Sports Classification Using Cross-Ratio Histograms
121
0.25
0.2
0.15
0.1
0.05
0
0
0.5
1
1.5
2
2.5
3
0
0.5
1
1.5
2
2.5
3
0
0.5
1
1.5
2
2.5
3
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
0.5
1
1.5
2
2.5
3
Fig. 3. First and second column show two seperate views of the sport field with detected line and points of intersections overlaid on the gray image. The third column shows the cross ratio histogram for both the views (blue represent the histogram for first column image and red for the second column image in each row).
class were taken with different viewpoints and illumination conditions. Out of these images, 50 images from each class were used for generating the cross ratio histogram feature vector to train the SVM classifier. The remaining 150 images from each class were mixed to form a collective database of 600 images which has been used as test set to evaluate the performance of the trained SVM classifier. To test robustness of the proposed system when presented with non-sport images, we introduced 100 random images that did not belong to any of the four sports field. For SVM classifier, we experimented with Polynomial kernel function and Radial basis kernel functions with a variety of parameter values. Among the both, we observed that the second order polynomial function performed with higher accuracy.
122
B. Paluri et al.
Table 2. Classification accuracies in percentage for each class. The overall classification accuracy is 91.857 %. Sport Tennis Badminton Basketball Football Non-sport
Tennis Badminton Basketball Football Non-sport 96.66 2 0.66 0 0.66 2.66 94 0.66 1.33 1.33 0.66 2 91.33 3.33 2.66 0.66 0.66 3.33 90.66 4.66 2 1 5 8 84
The performance metric is defined as the percentage of correct classification per sport. Classification accuracies for the 4 types of sports is given in Table 2. Overall classification error on the test dataset was observed to be 8.14 % . The results clearly indicate that for real world datasets under different conditions and viewpoints, the classification accuracies are very good. Also we noted that most of the non-sport images were also segregated well and the few which were not classified as one of the four sports consisted of images of buildings.
4
Conclusion
The paper proposed an approach for sports image classification based on geometric structure of the sports field. Our experimental validation shows encouraging results and high accuracy of classification even for widely separated views of a sport’s field. One of the observation from experimental results was that the proposed approach performs very well when used for classifying images of the sports fields it was trained for. But the performance drops, from the classification error of 6.83% to 8.14%, with introduction of random images having prominent geometric structures. Thus, currently we are working on to improve the system’s ability to discern between sports and non-sports images using other invariant cues from geometry of the sports fields to improve robustness and accuracy of the proposed approach. To address this, we are exploring a relative spatial distribution of intersection points and invariant measures for lines under projective transformation to further augment the feature vector for the classifier. Also, we are gathering more dataset for sports like Hockey, Baseball, and Lacrosse to extend the classifier.
References 1. Wang, D.H., Tian, Q., Gao, S., Sung, W.K.: News sports video shot classification with sports play field and motion features. In: ICIP, pp. 2247–2250 (2004) 2. Takagi, S., Hattori, S., Yokoyama, K., Kodate, A., Tominaga, H.: Sports video categorizing method using camera motion parameters. In: ICME 2003. Proceedings of the 2003 International Conference on Multimedia and Expo, pp. 461–464. IEEE Computer Society, Washington, DC, USA (2003)
Sports Classification Using Cross-Ratio Histograms
123
3. Kobla, V., DeMenthon, D., Doermann, D.: Identifying sports videos using replay, text, and camera motion features (2000) 4. Messer, K., Christmas, W., Kittler, J.: Automatic sports classification. icpr 02, 21005 (2002) 5. Wang, J., Xu, C., Chng, E.: Automatic sports video genre classification using pseudo-2d-hmm. In: ICPR 2006. Proceedings of the 18th International Conference on Pattern Recognition, pp. 778–781. IEEE Computer Society, Washington, DC, USA (2006) 6. Lei, G.: Recognition of planar objects in 3-d space from single perspectiveviews using cross ratio. IEEE Transactions on Robotics and Automation 6(4), 432–437 (1990) 7. Weiss, I., Ray, M.: Model-based recognition of 3d objects from single images. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(2), 116–128 (2001) 8. Rajashekhar, Chaudhuri, S., Namboodiri, V.: Image retrieval based on projective invariance, pp. I: 405–408 (2004) 9. Mundy, J.L., Zisserman, A.: Geometric invariance in computer vision. MIT Press, Cambridge, MA, USA (1992) 10. Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 8(6), 679–698 (1986) 11. Hough, P.: Method and means for recognizing complex patterns. In: US Patent 3,069,654 (1962) 12. Vapnik, V.: Statistical Learning Theory. Wiley, Chichester 13. KreBel, U.H.G.: Pairwise classification and support vector machines, pp. 255–268 (1999) 14. Lei, H., Govindaraju, V.: Half-against-half multi-class support vector machines. In: Multiple Classifier Systems, pp. 156–164 (2005) 15. Yi Liu, Z.Y.: One-against-all multi-class svm classification using reliability measures. In: International Joint Conference on Neural Networks, vol. 2, pp. 849–854 (2005)
A Bayesian Network for Foreground Segmentation in Region Level Shih-Shinh Huang1 , Li-Chen Fu2 , and Pei-Yung Hsiao3 Dept. of Computer and Communication Engineering, National Kaohsiung First University of Science and Technology, Kaohsiung, Taiwan, R.O.C Dept. of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, R.O.C. Dept. of Electrical Engineering, National University of Kaohsiung, Kaohsiung, Taiwan, R.O.C Abstract. This paper presents a probabilistic approach for automatically segmenting foreground objects from a video sequence. In order to save computation time and be robust to noise effect, a region detection algorithm incorporating edge information is first proposed to identify the regions of interest. Next, we consider the motion of the foreground objects, and hence utilize the temporal coherence property on the regions detected. Thus, foreground segmentation problem is formulated as follows. Given two consecutive image frames and the segmentation result obtained priorly, we simultaneously estimate the motion vector field and the foreground segmentation mask in a mutually supporting manner. To represent the conditional joint probability density function in a compact form, a Bayesian network is adopted, which is derived to model the interdependency of these two elements. Experimental results for several video sequences are provided to demonstrate the effectiveness of our proposed approach.
1
Introduction
Foreground segmentation is a fundamental element in developing a system that can deal with high-level tasks of computer vision. Accordingly, the objective of this paper is to propose a new approach to automatically segment the objects of interest from the video. 1.1
State of the Art
Generally speaking, techniques for foreground segmentation can be grouped into two categories, namely, background subtraction and motion-based foreground segmentation. For background subtraction, a simple and an intuitive way is to model the background by independent representation of the gray level or color intensity of every pixel in terms of some uni-modal distributions or an uniform distributions [1,2]. If the intensity of each pixel is due to the light reflected from a particular surface under a single source of lighting, a uni-modal distribution will be sufficient to represent the pixel value. However, in the real world, the model Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 124–133, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Bayesian Network for Foreground Segmentation in Region Level
125
of a pixel value in most video sequences is characterized by multi-modal distribution, and hence the usage of a mixture of Gaussian distributions is common in it [3,4,5]. For example, Friedman and Russell [4] modeled the pixel intensity as a mixture of three Gaussian distributions respectively corresponding to three hypotheses - road, vehicle, and shadow. In [5], this idea is extended to model the scene by using multiple Gaussian distributions and to develop a fast approximate method for incrementally updating the parameters. However, not all distributions are in Gaussian form [6]. In [7,6], a non-parametric background model based on kernel function is employed to represent the color distribution of each background pixel. This kind of approach is a generalization of the mixture of Gaussians, but unfortunately the resulting computation complexity of this method is high. Moreover, their performance deteriorates when the background exhibits noise or dynamic properties. This is because they carry out the foreground segmentation at pixel level without taking spatial relations among pixels into consideration. The approaches of motion-based foreground segmentation arise from the fact the images of foreground objects are generally accompanied with motion. Accordingly, motion-based foreground segmentation refers to dividing of an image into a set of motion-coherent objects which are associated with different labels. To achieve this, the first step is to iteratively compute the dominant motion and significant areas consistent with the current dominant motion [8]. A drawback of these methods is that their performance deteriorates in lack of well-defined dominant motion. To overcome this problem, color information is introduced to obtain more accurate segmentation results. Color segmentation algorithms, such as watershed or mean-shift ones are first applied to obtain an initial segmentation and then the regions with similar motion and intensity are merged. Finally, spatiotemporal constraint is imposed to classify each segmented region into foreground or background. This kind of approaches is generally referred to as region-merging approaches [9,10,11]. Nevertheless, motion-based foreground segmentation approaches are seriously limited by the accuracy of the motion estimation algorithm. Thus, the segmentation results may become unsatisfactory when the image sequence exhibits severe noise effect or when the foreground object is subjected to large displacement. 1.2
Approach Overview
To reduce the problem dimensionality and to provide more accurate segmentation results, an algorithm incorporates the edge information with intensity to identify the regions of interest and the neighborhood relations among them are expressed by a region adjacency graph (RAG). Let G = {V, E} be a RAG, where V = {si : i ∈ [1, ..., R]} is a set of nodes in graph and each node si corresponds to a region; E is a set of edges and (si , sj ) ∈ E if the regions si and sj are adjacent. Here, R is the number of extracted regions of interest. The set of neighboring regions of si is denoted as N (si ) = {sj : (si , sj ) ∈ E and si = sj }. The main drawback of the methods [12,9] that directly estimate the motion followed by foreground segmentation is that their performance is limited by
126
S.-S. Huang, L.-C. Fu, and P.-Y. Hsiao
the motion estimation. Instead of that, we model the relationship between the motion vector field Dt and the foreground segmentation mask Ft throught a Bayesian network. To impose the temporal coherence, the main idea behind is to reinforce the interdependency between Dt and Ft so that motion estimation and foreground segmentation can be mutually support to each other. The remainder of this paper is organized as follows. In section 2, we present an algorithm to extract regions of interest. Based on the obtained regions, section 3 describes how to model the foregoing interdependency among variables through a Bayesian network and to infer the results of motion vector field and foreground segmentation mask simultaneously. Section 4 gives the probability definition of each term aforementioned. In section 5, we demonstrate the effectiveness of the developed approach by providing some satisfactorily experimental results. Finally, we conclude the paper in section 6 with some insightful discussions.
2
Interested Region Detection
In tradition, color segmentation followed by the region merging algorithm [9,10] is a widely used method for region generation. The approaches based on such a region-merging strategy usually result in over-segmentation, Here, we propose an algorithm that incorporates both intensity and edge information to identify the regions of interest so that noise effect can be successfully removed and the number of regions can be greatly decreased. This algorithm mainly consists of two steps, namely, CDM (change detection mask) detection and region generation. 2.1
CDM Detection
First of all, an initial change detection mask (CDMit ) between two consecutive image frames It−1 and It , is computed by a thresholding operation. Fig. 1(a) shows an example of CDMi between frames 39 and 40 in Hall Monitoring video sequence. However, only taking intensity into account for change detection suffers from the problem that non-changed pixels are easily mis-detected as changed ones. Here we incorporate the moving edge in the CDM detection step for identifying the regions with motion. The moving edge (ME) map is defined as (1) MEt = e ∈ Et | min ||e − x|| ≤ Tchange , x∈DEt where Et is the edge map of the current frame It , DEt is the difference edge map obtained by applying Canny edge detector on the difference image Dt , and Tchange is a distance threshold. In our implementation, Tchange is set to 2. Fig 1(b) shows the ME of frame 40 in Hall Monitoring video sequence. Obviously, moving edges only result in the boundaries of foreground objects that are invariant to noise effect. Let these pixels in ME serve as initial seed points, we then use the region growing algorithm [13] is used to expand them pixel by pixel in CDMi. The output of the region growing algorithm is denoted
A Bayesian Network for Foreground Segmentation in Region Level
127
as CDMu. Note that the noise effect is almost removed as shown in Fig. 1(c) after incorporating the extracted moving edges. In order to maintain temporal stability which means that pixels belonging to the previous object mask should also belong to the current change mask, we set all pixels in the previous object mask OMt−1 as changed ones in the resulting change detection mask (CDMt ). Fig. 1(d) shows the finally obtained change detection mask.
Fig. 1. An example of the algorithm for interested region detection
2.2
Region Generation
Based on the assumption that the foreground boundaries always coincide with color segmented boundaries, the colors of pixels in the CDM are first quantized by a k-means clustering algorithm in RGB color space followed by the connectedcomponent finding algorithm so as to extract a set of homogeneous color regions. Currently, the number of quantized colors used here is 12. We show all regions with different colors in Fig. 1(e). Finally, a data structure called region adjacency graph (RAG) is used to represent the spatial relations among these regions of interest.
3
Bayesian Foreground Segmentation
In our formulation, the problem of foreground segmentation is described as follows. Given previous foreground segmentation mask Ft−1 and two consecutive image frames It−1 and It , we define the joint conditional probability density function of the two variables: motion vector field Dt and foreground segmentation mask Ft , that is, Pr(Dt , Ft |It−1 , It , Ft−1 ). To be clearer, Dt is a random variable that represents the motion of each region. For simplicity, we use translation motion model to describe the region motion in this research. As for Ft , it is a random variable that assigns a label, such as background (b) or foreground (f ) to each region. From Bayes’ rule, the a posteriori probability density function of Dt and Ft given It−1 , It , and Ft−1 , can be expressed as Pr(Dt , Ft |Ft−1 , It−1 , It ) =
Pr(Dt , , Ft−1 , Ft , It−1 , It ) , Pr(Ft−1 , It−1 , It )
(2)
Because Pr(Ft−1 , It−1 , It ) is constant with respect to the unknowns and the estimation of Dt and Ft is by computation of the maximum a posteriori (MAP), that is, ˜ t , F˜t ) = arg max Pr(Dt , Ft−1 , Ft , It−1 , It ) (D (3) (Dt ,Ft )
In order to represent the joint probability density function in a compact form, we must identify the dependency among these variables. A Bayesian network
128
S.-S. Huang, L.-C. Fu, and P.-Y. Hsiao
in Fig. 2(a) is adopted to model the interrelationships among these variables in three aspects. For explanation, we decompose it into three sub-graphs as shown in Fig. 2(b)(c)(d). Under such modeling, Pr(Dt , Ft−1 , Ft , It−1 , It ) can be decomposed into five terms and can be further expressed as: Pr(Dt , Ft−1 , Ft , It−1 , It ) = Pr(It |Ft ) Pr(It−1 |Dt , It ) × Pr(Ft−1 |Dt , Ft ) Pr(Dt ) Pr(Ft ).
(4)
The first term Pr(It |Ft ) models the observation likelihood and is commonly used in traditional background subtraction approaches. The graphical model expressing this term is shown in Fig. 2(b). The second term Pr(It−1 |Dt , It ) stands for displaced frame difference (DFD), which was widely used in [14] for estimating motion vector field. Fig. 2(c) shows its graphical model. The third term Pr(Ft−1 |Dt , Ft ) represents the temporal coherence [10] which denotes that the foreground segmentation mask Ft should be almost consistent with the Ft−1 through motion compensation using Dt (see Fig. 2(d)). Finally, the last two terms ,Pr(Dt ) and Pr(Ft ), respectively model the prior probabilities of Dt and Ft , respectively.
Fig. 2. Graphical Models
4
Probability Modeling
In this section, we will concentrate our attention on defining the five terms described in (4). The term Pr(It |Ft ) evaluates how well the currently observed image It fits the labels given in the Ft . Here, the label assigned to a region site s in Ft is denoted as Ft (s) and it is either b (background) or f (foreground). Accordingly, the two conditionally probability density functions Pr(It (s)|Ft (s) = b) and Pr(It (s))|Ft (s) = f ) must be defined for modeling background likelihood and foreground likelihood, respectively. The way we define these two functions is the same as [15]. 4.1
Temporal Constraint
Temporal information is a fundamental element to maintain the consistency of the results over time. The usage of motion information across image sequence is a general way to impose the temporal coherence.
A Bayesian Network for Foreground Segmentation in Region Level
129
Displaced Frame Difference. The conditional probability Pr(It−1 |It , Dt ) quantifies how well the estimated motion vector fits the consecutively observed images. As in [14], we model this term by a Gibbs distribution with the following energy function. UDF D (It−1 |It , Dt ) = λDF D (x, y)2 , (5) si ∈V (x,y)∈si
where (x, y) = ||It (x, y) − It−1 ((x, y) − Dt (si ))||
(6)
is the displaced frame difference (DFD) and λDF D is the parameter for controlling the contribution of this term. Tracked Foreground Difference. The term Pr(Ft−1 |Dt , Ft ) is to model temporal coherence between the segmented foreground masks Ft−1 and Ft , respectively, at two consecutive time instants. It denotes that a region site at the current frame will probably continue having the same label as its corresponding region through motion compensation using Dt in the previous frame. Therefore, we define the energy function of Pr(Ft−1 |Dt , Ft ) as: UT F D (Ft−1 |Dt , Ft ) = − λT F D δ(Ft−1 (x , y ) − Ft (x, y))
(7)
si ∈V (x,y)∈si
where δ(x) is the Kronecker delta function and (x , y ) is the corresponding pixel of (x, y) ∈ si at the previous frame, that is, (x , y ) = (x, y) − Dt (si ). 4.2
Spatial Constraint
The property of spatial constraint infers that region sites in adjacency tend to have the same label. In other words, it encourages the formation of smooth motion estimation in Dt and contiguous components in Ft . Motion Smoothness. Pr(Dt ) represents the prior probability of the motion vector field and the motion smoothness constraint is imposed to define this term. Suppose that si and sj are two neighboring region sites and bij denotes the length of the common border between si and sj . We define Pr(Dt ) by a Gibbs distribution with the energy function UMS (Dt ) = λMS bij ||Dt (si ) − Dt (sj )|| (8) si ∈V sj ∈N (si )
where ||.|| is the Euclidean distance. Foreground Smoothness. As for Pr(Ft ), we relate it to the foreground smoothness which means that the two neighboring region sites should be likely
130
S.-S. Huang, L.-C. Fu, and P.-Y. Hsiao
assigned to the same label with high probability. Similar to Pr(Dt ), Pr(Ft ) is defined as follows. UF S (Ft ) = λF S VF S (si , sj ) (9) si ∈V sj ∈N (si )
where VF S (si , sj ) =
−bij if Ft (si ) = Ft (sj ) +bij otherwise
(10)
Once the energy functions for modeling the interdependency among the aforementioned variables are defined, the next step is to find a solution by maximizing (4). However, due to the nature of the problem, this probability distribution is non-convex and is difficult to estimate Dt and Ft simultaneously. Therefore, we employ an iterative optimization strategy [16] to find the solution as the results of the foreground segmentation.
5
Experiment
This section presents some experiment results of the proposed approach described in this paper. Both subjective evaluation and objective evaluation in comparison with Lee’s approach [17] are introduced to demonstrate the effectiveness of our approach. 5.1
Subjective Evaluation
The first image sequence we use here is the Hall Monitoring MPEG-4 test sequence in CIF format at 10 fps. In Fig. 3, the first row illustrates the original frames and the numbers shown in the top-left corner are the frame number. In particular, frame 10 used here is to demonstrate that our approach can automatically detect the newly introduced objects. On the contrary, in Lee’s approach, the red circles referring to the noisy pixels due to light fluctuation in Fig. 3(b) and shadow areas indicated by blue circles in Fig. 3(b) are both mis-classified as foreground regions. Therefore, it can be seen that performing foreground segmentation in a more semantic region level and using an appropriate scaling factor, both the noise and the shadow effect can be successfully removed in our proposed approach as shown in Fig. 3(c). 5.2
Objective Evaluation
Besides subjective evaluation, we take two videos from the contest held in IPPR (Image Processing and Pattern Recognition), Taiwan, 2006, with manually generated ground-truth images to validate the effectiveness of our approach from objective viewpoint. Here, the error rate ε similar to [1] is adopted for objective evaluation. It is defined as ε = Ne /NI , where NI is the frame size and Ne is the number of pixels that are mis-classified. Two videos with frame size
A Bayesian Network for Foreground Segmentation in Region Level
10
20
40
60
100
131
80
(a) Original Images noise
shadow
shadow
shadow
‘ s Approach (b) Lee ‘
(c) Our Approach
Fig. 3. Hall Monitoring Sequence. (a) are frames 10, 20, 40, 60, and 80. (b) and (c) are the segmentation results of Lee’s approach and our proposed approach, respectively.
Fig. 4. IPPR Contest Sequence One. (a) are original images exhibiting similar appearance and large movement; (b) and (c) are the segmentation results of Lee’s approach and our approach, respectively.
320 × 240 from IPPR contest (http://www.ippr.org.tw/) are unfortunately low quality image sequences and have severe noise effect. The challenges of the first video lie in that a person has similar appearance to the background and the foreground objects have large movements in the scene. However, by modeling the interdependency between Dt and Ft , we can
132
S.-S. Huang, L.-C. Fu, and P.-Y. Hsiao
Fig. 5. Error Rate
obtain the accurate motion estimation to facilitate the foreground segmentation as demonstrated in the right-bottom picture in Fig. 4(c). The error rate curves over 150 frames of the Lee’s approach and our approach are both shown in Fig. 5(a) and the average error rates are 1.1% and 0.5%, respectively. The second IPPR contest sequence shows that two persons go through the gallery. The strong lighting shed through the left windows is blocked by the walking persons and severe shadow areas on the ground and wall are unfortunately formed. Due to space limitation, the results of this video sequence are skipped in this paper. For this video, the average error rates (see Fig. 5(b)) of the Lee’s approach and our approach are 2.3% and 1.0%, respectively.
6
Conclusion
In this paper, we propose a region detection algorithm by incorporating moving edges with intensity difference to extract the regions of interest for further foreground segmentation at the region level. This greatly reduces the influence due to noise effect and at the same time saves the computation complexity. Under the proposed Bayesian network representing the crucial interdependency of the involving variables, the problem of foreground segmentation task is naturally formulated as a kind of inference problem with the imposed spatiotemporal constraints corresponding to the abovementioned interdependency relationship. The solution to this problem is then an iterative optimization strategy which is set to maximize the objective functions of Dt and Ft under the MAP criterion. The main advantage is that both motion estimation and foreground segmentation are simultaneously achieved in a mutually supporting manner so that more accurate results can be obtained.
Acknowledgement This work was supported by the National Science Council under the project NSC 94-2752-E-002-007-PAE and NSC 96-2516-S-390-001-MY3.
A Bayesian Network for Foreground Segmentation in Region Level
133
References 1. Chien, S.Y., Ma, S.Y., Chen, L.G.: Efficient Moving Object Segmentation Algorithm Using Background Registration Technique. IEEE Trans. on Circuits and Systems for Video Technology 12(7), 577–586 (2002) 2. Haritaoglu, I., Harwood, D., Davis, L.S.: W4: Real-Time Surveillance of People and Their Activities. IEEE Trans. on Pattern Analysis and Machine Intelligence 22(8), 809–830 (2000) 3. Benedek, C., Sziranyi, T.: Markovian Framework for Foreground-BackgroundShadow Separation of Real World Video Scenes. In: Narayanan, P.J., Nayar, S.K., Shum, H.-Y. (eds.) ACCV 2006. LNCS, vol. 3851, pp. 898–907. Springer, Heidelberg (2006) 4. Friedman, N., Russell, S.: Image Segmentation in Video Sequence: A Probabilistic Approach. In: Intl. Conf. on Uncertainty in Artificial Intelligence (1997) 5. Stauffer, C., Grimson, W.: Adaptive Background Mixture Models for Real-Time Tracking. IEEE Conf. on Computer Vision and Pattern Recognition 2, 246–252 (1999) 6. Elgammal, A., Harwood, D., Davis, L.S.: Non-parametric Model for Background Subtraction. European Conf. on Computer Vision 2, 571–767 (2000) 7. Sheikh, Y., Shah, M.: Bayesian Modeling of Dynamic Scenes for Object Detection. IEEE Trans. on Pattern Analysis and Machine Intelligence 27(1), 1778–1792 (2005) 8. Wang, J.Y.A., Adelson, E.H.: Representation Moving Images with Layers. IEEE Trans. on Image Processing 3(5), 625–638 (1994) 9. Tsaig, Y., Averbuch, A.: Automatic Segmentation of Moving Objects in Video Sequences: A Region Labeling Approach. IEEE Trans. on Circuits and Systems for Video Technology 12(7), 597–612 (2002) 10. Patras, I., Hendriks, E.A., Lagendijk, R.L.: Video Segmentation by MAP Labeling of Watershed Segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence 23(3), 326–332 (2001) 11. Altunbasak, Y., Eren, P.E., Tekalp, A.M.: Region-Based Parametric Motion Segmentation Using Color Information. Graphical Models and Image Processing: GMIP 60(1), 13–23 (1998) 12. Chen, P.C., Su, J.J., Tsai, Y.P., Hung, Y.P.: Coarse-To-Fine Video Object Segmentation by MAP Labeling of Watershed Regions. Bulletin of the College of Engineering, N.T.U 90, 25–34 (2004) 13. Adams, R., Bischof, L.: Seeded Region Growing. IEEE Trans. on Pattern Analysis and Machine Intelligence 16(6), 641–647 (1994) 14. Wang, Y., Loe, K.F., Tan, T., Wu, J.K.: Spatiotemporal Video Segmentation Based on Graphic Model. IEEE Trans. on Image Processing 14(7), 937–947 (2005) 15. Huang, S.S., Fu, L.C., Hsiao, P.Y.: A Bayesian Network for Foreground Segmentation. In: IEEE Intl. Conf. on Systems, Man, and Cybernetics, IEEE Computer Society Press, Los Alamitos (2006) 16. Chang, M.M., Tekalp, A.M., Sezan, M.I.: Simultaneous Motion Estimation and Segmentation. IEEE Trans. on Image Processing 6(9), 1326–1333 (1997) 17. Lee, D.S.: Effective Gaussian Mixture Learning for Video Background Subtraction. IEEE Trans. on Pattern Analysis and Machine Intelligence 27(5), 827–832 (2005)
Efficient Graph Cuts for Multiclass Interactive Image Segmentation Fangfang Lu1 , Zhouyu Fu1 , and Antonio Robles-Kelly1,2 Research School of Information Sciences and Engineering, Australian National University National ICT Australia , Canberra Research Laboratory, Canberra, Australia 1
2
Abstract. Interactive Image Segmentation has attracted much attention in the vision and graphics community recently. A typical application for interactive image segmentation is foreground/background segmentation based on user specified brush labellings. The problem can be formulated within the binary Markov Random Field (MRF) framework which can be solved efficiently via graph cut [1]. However, no attempt has yet been made to handle segmentation of multiple regions using graph cuts. In this paper, we propose a multiclass interactive image segmentation algorithm based on the Potts MRF model. Following [2], this can be converted to a multiway cut problem first proposed in [2] and solved by expansion-move algorithms for approximate inference [2]. A faster algorithm is proposed in this paper for efficient solution of the multiway cut problem based on partial optimal labeling. To achieve this, we combine the one-vs-all classifier fusion framework with the expansion-move algorithm for label inference over large images. We justify our approach with both theoretical analysis and experimental validation.
1
Introduction
Image segmentation is a classical problem in computer vision. The purpose is to split the pixels into disjoint subsets so that pixels in the same subset form a region with similar and coherent features. This was usually posed as a problem of pixel level grouping based on certain rules or affinity measures. The simplest way to do this is region growing starting from some initial seeds [3]. More sophisticated clustering methods have been introduced later to solve the grouping problem, characterized by the work of Belongie et. al. on the EM algorithm for segmentation [4] and Shi and Malik on normalized cut [5]. While early work on segmentation is mostly focused on unsupervised learning and clustering, a new paradigm of interactive segmentation has come into fashion recently following the pioneering work of Boykov and Jolly [1]. In [1], pixels are labeled by the user with a brush to represent the foreground and background regions. The purpose is to assign the remaining unlabeled pixels to the foreground/background classes. This can be treated as a supervised version
National ICT Australia is funded by the Australian Governments Backing Australia’s Ability initiative, in part through the Australian Re- search Council.
Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 134–144, 2007. c Springer-Verlag Berlin Heidelberg 2007
Efficient Graph Cuts for Multiclass Interactive Image Segmentation
135
of segmentation, providing more accuracy and robustness with user interaction. A number of methods have been proposed to extend [1] in different ways, including grab cut [6] which presents a more convenient way of user interaction instead of brush labellings, lazy snap [7] which applies an over-segmentation as a preprocessing step. Most of the previous methods for interactive segmentation consider only binary segmentation, i.e. foreground and background. An exception is the random walk segmentation algorithm proposed by Grady in [8], which can cope with multiclass segmentation. However, it is based on solving linear systems of equations and not as fast as graph cuts. To our knowledge, the problem of interactive segmentation with multiple classes or regions under the graph cuts framework has not yet been studied. In this paper, we generalize binary interactive image segmentation to multiclass segmentation based on the Markov random field (MRF) model which can then be treated by the multiway cut framework in [2]. Moreover, we develop a strategy for the efficient computation of the multiway cut cost which combines the one-vs-all classifier fusion framework with the expansion-move algorithm proposed in [9]. Some theoretical analysis are also presented on the optimality of the proposed approach. The remainder of this paper is organized as follows. In Section 2, we briefly present the background material of the MRF model for binary segmentation and the graph cuts algorithm. In Section 3, we proposed the model for multiclass interactive segmentation and the details of the labeling algorithm. Experimental results are provided in Section 5. Conclusions are given in Section 6.
2 2.1
Binary Image Segmentation Using Graph Cuts An MRF Formulation
Markov Random Fields (MRF) are undirected graphical models. They provide a probabilistic framework for solving labeling problems over the graph nodes. Let GV, E denote a graph with node-set V = {V1 , . . . , VN } and edge-set E = {Eu,v |u, v ∈ V}.Each node u ∈ V is associated with a label variable Xu which takes a set of discrete values in {1, . . . , K} where K is the number of label classes. Each Eu,v ∈ E represents a pairwise relationship between node u and node v. The optimal set of labels are obtained by maximizing the following function of joint distributions over the label set maximize P (X) = X
1 Z
Eu,v ∈E
ψu,v (Xu , Xv )
φu (Xu )
(1)
u∈V
where φu (Xu ) and ψu,v (Xu , Xv ) are unitary and binary potential functions which determine the compatibility of the assignment of nodes in the graph to the label classes. u ∼ v implies that u and v are neighbors, i.e. Eu,v ∈ E. Z = X u,v∈E ψu,v (Xu , Xv ) u∈V φu (Xu ) is the unknown normalization factor independent of the assignment of X and thus can be omitted for the purpose of inference.
136
F. Lu, Z. Fu, and A. Robles-Kelly
Taking the negative logarithm of Equation 1, we can convert the MRF formulation into an equivalent problem of energy minimization as follows minimize log P (X) = cu (Xu ) + wu,v (Xu , Xv ) (2) X
u
u∼v
where cu (Xu ) = − log φu (Xu ) and wu,v (Xu , Xv ) = − log ψu,v (Xu , Xv ). Here the binary weight term wu,v (a, b) is determined by the interactions between neighboring nodes. A special form of binary interaction term is the Potts prior for smoothness constraints, which will be introduced shortly.In this paper, we will focus on the Potts MRF model for multiclass image segmentation. 2.2
Solving Binary MRF with Graph Cuts
We focus on general Potts prior MRF in this section with the following form of the binary interaction term. wu,v (a, b) = λu,v δ(a, b) 1 a = b δ(a, b) = 0 a=b
where a, b ∈ {0, 1}
(3)
This will simply penalize disparate labels of adjacent nodes and enforces smoothness of the labels across local neighborhood. The conventional Potts prior is a special case of the above equation with constant λu,v ’s across all neighboring sites u and v. The MRF cost function in Equation 2 can then be re-written for the Potts model as minimize log P (X) = X
N u=1
cu (Xu ) +
λu,v δ(Xu , Xv )
(4)
u∼v
The above cost function can be naturally linked with graph cuts. To see this, we first augment the original undirected graph G = V, E defined over the image pixels, where each pixel is treated as a node in the node-set V, by two additional nodes s and t and edge-links between each pixel and the two terminal nodes. Then the graph nodes are split into two disjoint subsets V1∗ = {V1 , s} and V2∗ = {V2 , t } such that V1 ∪ V2 = V. The cut metric C(G) of the graph G is then defined as the weighted sum of edges linking nodes in subset V1∗ and nodes in V2∗ . C(G) = wu,v = wu,t + wu,s + wu,v (5) u∈V1∗ ,v∈V2∗
u∈V1
u∈V2
u∈V1 ,v∈V2
where wu,t and wu,s define the edge weights between node u and terminal nodes, and wu,v defines the edge weight between node u and node v in different subsets. Suppose s and t represent binary class 0 and 1 respectively, define wu,s = cu (1), wu,t = cu (0) and wu,v = λu,v . Then minimizing the cost function in Equation 4 is equivalent to minimizing the cut metric C(G) in Equation 5. As a result, pixel
Efficient Graph Cuts for Multiclass Interactive Image Segmentation
137
u is assigned to class 1 if it is both linked to the terminal node t representing class 1 and broken from the terminal node s representing class 0. Graph min-cut is a well studied problem and equivalent to the max-flow problem, which can be solve efficiently in polynomial time [10]. For binary classes, we can obtain globally optimal solution for the min-cut. An example of binary cut for a toy problem of 9 nodes is illustrated in Figure 1(a). For the sake of visualization, terminal links that have been severed are not shown here.
(a) Binary Cut
(b) Multiway Cut
Fig. 1. Illustration of Binary and Multiway Cuts
In interactive segmentation, we have to keep fixed the membership of the user specified pixels,i.e. brush labeled foreground pixels must remain in the foreground and so do labeled background pixels. These hard constraints over the assignment of labeled pixels can be naturally enforced with the graph cut framework by manipulating the edge weights of links between nodes of labeled pixels and the terminal nodes. For each labeled foreground (class 1) pixel uf , we can set wuf ,1 to an infinitely large value and wuf ,0 to 0 so that uf always remains linked to the foreground t node otherwise it will incur an infinite cost by assigning uf to the background s node. We can also set the weights of terminal links for background pixels in a similar way. To summarize, we list the rules for setting edge weights for binary interactive segmentation in Table 1(a).
3 3.1
Multiclass Interactive Image Segmentation A Multiway Cut Formulation for Multiclass Segmentation
The graph cut framework proposed for binary segmentation in the last section can be extended to multiclass segmentation in a straightforward manner via multiway cuts. Multiway cuts are the generalization of binary cuts and were first proposed by Boykov et.al. in [2]. For m-way multiway cuts, an extra nodeset of m terminal nodes T = {t1 , . . . , tm } are inserted in the original graph G = V, E and links are established from each t-node tj (j = 1, . . . , m) to each node in the node-set V of graph G. The purpose is to split the nodes of this
138
F. Lu, Z. Fu, and A. Robles-Kelly
Table 1. (a) Edge weights for binary segmentation. (b) Edge weights for multiclass segmentation. (b) weight
(a) edge weight
for
edge
{u, v}
λu,v u, v ∈ V and u ∼ v ∞ u ∈ C0 {u, t} 0 u ∈ C1 cu (1) otherwise
{u, s}
0 ∞ cu (0)
for
{u, v}
λu,v u, v ∈ V and u ∼ v ∞ u ∈ C1 {u, t1 } 0 u ∈ Cj (j = 1) su − cu (1) otherwise ... ∞ u ∈ Cm {u, tm } 0 u ∈ Cj (j = m) su − cu (m) otherwise
u ∈ C0 u ∈ C1 otherwise
∗ augmented graph into m disjoint subsets V1∗ = {t1 , V1 }, . . . , Vm = {tm , Vm } such that each subset contains one and only one t-node from T and V1 ∪. . . ∪Vm = V. An example of 3-way cut is illustrated in Figure 1(b). According to [2], we can establish the link between the Potts MRF cost function in Equation 4 and the multiway cut metric as follows. For arbitrary node u in V, after multiway cut, it only remains linked to a single terminal node and the edges linking to all other terminal nodes are severed. Define
du =
m
cu (j)
(6)
j=1
the total sum of costs for assigning pixel u to all different classes. We then define the edge weights of t-links wu,tj = du − cu (j) and the edge weights between neighboring nodes in V remain the same as binary cut, i.e. wu,v = λu,v . Then the multiway cut metric becomes C(G) =
wu,v =
∀i,j∈{1,...,m} u∈Vi∗ ,v∈Vj∗
=
(du − cu (j)) +
u∈V
wu,tj +
i=1 u∈Vi j=1 j=i
u∈V j=Xu
= (m − 2)
m m
du +
u∈V
wu,v
(7)
i,j u∈Vi ,v∈Vj
λu,v δ(Xu , Xv )
u∼v
cu (Xu ) +
λu,v δ(Xu , Xv )
u∼v
Since du is independent of label vectors Xu ’s, the multiway cut is equivalent to the multiclass Potts MRF cost defined in Equation 4 up to a constant additive factor. In the case of binary cut where m = 2, the two costs are exactly the same. Thus, inference of a multiclass Potts MRF model can be recast as a multiway cut problem. We can also encode hard constraints for labeled pixels into the edge weights of t-links in exactly the same way as we did for binary segmentation.
Efficient Graph Cuts for Multiclass Interactive Image Segmentation
139
The resulting rules for choosing edge weights for multiclass segmentation tasks are summarized in Table 1(b).
4
Efficient Implementation of Multiway Cuts
Although multiclass interactive image segmentation can be transformed the multiway cut problem, computing the minimum value of multiway cuts is a NP hard problem. An expansion-move algorithm was proposed by Boykov et.al. in [9] to iteratively update the current labels for lower cost. However, a number of expansion and move steps have to be computed alternately to recover a suboptimal labeling for the purpose of multiclass segmentation. Each step involves solving a graph cut over all pixels in the image. This is quite inefficient for images with moderately large sizes. To make efficient inference of multiway cut without compromising on the accuracy, we present an alternative inference algorithm in this section. Strategy. Our inference algorithm makes use of the one-against-all classifier fusion framework to transform the multiway cut problem into a set of binary cut subproblems. Each binary cut subproblem has two classes, where the positive class is chosen from one of the original label classes, and the negative class is constructed by the rest of the classes. The global optimal solution of the binary subproblem can be obtained by using the max flow algorithm in [10]. We can then obtain a partial solution of the multiway cut problem by voting on the binary cut results. Pixels with consistent labeling across all binary cuts are then assigned with the consistent labels. To resolve the ambiguities in the labels of remaining pixels, we build a graph over pixels with ambiguous labels apply the expansion-move algorithm [9] to solve the multiway cut problem for the labels of the ambiguous pixels with the labels of all other pixels fixed, which is more efficient than applying the expansion-move operations to all unlabeled pixels from scratch and more accurate than resolving the ties arbitrarily. The algorithm is summarized in the Table 2. Analysis. Several points deserve further explanation here. To obtain the cost cu (j) of assigning pixel u to each different region j, we first compute the posterior probability bu,j of pixel u in class j, which can obtained via the following Bayes rule P (u|Cj ) P (u|Cj )P (Cj ) = (9) bu,j = j P (u|Cj )P (Cj ) j P (u|Cj ) 1 is the prior probability of class j, which is assumed to be equal m for every class. P (u|Cj ) the conditional probability of pixel u in the jth region. The class conditional probabilities are obtained by indexing the normalised color histogram built from labeled pixels in the region. The edge weight term λu,v is given by ||I(v) − I(u)||2 ||pv − pu ||2 − ) (10) λu,v = βexp(− σc2 σs2 where p(Cj ) =
140
F. Lu, Z. Fu, and A. Robles-Kelly Table 2. Efficient Graph Cut Algorithm for Multiclass Segmentation
Input: image I with pixels labeled for m different regions. 1. Compute the cost cu (j) for assigning each unlabeled pixel u to the jth region. 2. Repeat the following operations for each region index by j – Establish a binary graph cut problem with the jth region as class 1 and all other 1 regions as class 0. Set c∗u (1) = cu (j) and c∗u (0) = cu (i) as new unary m − 1 i=j 1 λu,v as binary costs for neighboring costs for pixel u, and λ∗u,v = 1 − 2m − 2 pixels u and v. The meaning for the choice of weights will become evident later. – Solve the binary cut problem with new unary and binary costs and recover the (j) label Xu for each pixel u.
(1)
(m)
3. Vote on Xu , . . . , Xu Xu =
j
from the results of m binary cuts
undefined
(j)
(i)
Xu = 1 and Xu = 0 for i = j otherwise
(8)
4. Add the unlabeled pixels with resolved labels into the set of labeled pixels. Run an additional graph cuts optimization over the subset of unresolved pixels to obtain their labels using the neighboring pixels with known labels as hard constraints . Output: labels Xu for each pixel u in image I
where I(u) and pu denote the pixel value and coordinate of pixel v and σc and σs are two bandwidth parameters, β gauges the relative importance of data consistency and smoothness, which is set to 10 in the following experiments. The above choice of binary weight accounts for the image contrast and favors segmentation along the edges as justified in [1]. After obtaining the posterior probability for pixel u, we can compute the cost of assigning u to class j by cu (j) = i=j P (u|Ci ). That is, the larger the posterior probability of class j, the smaller cost for assigning it to the same class. Nom tice that the costs for different classes are normalized such that i=1 cu (i) = 1. In Step 2 of the above algorithm, we have to solve m binary cut problems. Notice that any two binary problems share the same graph topology and only differ in the weights of edges linked to the terminal nodes, hence starting from the second binary cut subproblem, we can capitalize on the solution of the previous subproblem for solving the current subproblem. This is achieved by the dynamic graph cuts framework proposed by Kohli and Torr in [11] to update the labels iteratively from previous results instead of starting from scratch. After running binary cuts for all m subproblems, we get m votes for the label of each pixel u. If the votes are consistent, i.e. Equation 8 is satisfied for certain
Efficient Graph Cuts for Multiclass Interactive Image Segmentation
141
region j which gets all m votes from the binary cut results, then we can assign pixel u to the jth region with confidence. We can also prove the following Proposition: The cost function of multiway cut is the aggregation of the costs of the binary cut subproblems up to a scaling and translation factor. Proof: The cost of the multiway cut problem is given by f (X) =
cu (Xu ) +
u
λu,v δ(Xu , Xv ) =
m
cu (j) +
j=1 u|Xu =j
u,v|u∼v
λu,v
u,v|u∼v Xu =Xv
where a|b is a notation indicating sum over variable(s) a . . . given condition(s) b . . .. The cost of the jth binary cut is given by
f (j) (X) = =
1 cu (i) + m−1
u|Xu =j
u|Xu =j
i|i=j
The sum over all binary costs is then given by
f m
(X) = m
(j)
j=1
j=1
=
cu (j) +
u|Xu =j
u|Xu =j
m−2 m−2 N + 1+ m−1 m−1
λ∗u,v
u,v|u∼v
cu (j) +
c∗u (0) +
u|Xu =j
u|Xu =j
c∗u (1) +
1 m−1
λ∗u,v
u,v|u∼v Xu =j,Xv =j
c (i) + m
u
i|i=j
m
j=1
cu (j) + 1 +
j=1 u|Xu =j
λ∗u,v
u,v|u∼v Xu =j,Xv =j
m−2 m−1
λu,v
u,v|u∼v Xu =Xv
where N is the total number of pixels under study. The second row of the above equation holds due to the identity below, which is true for normalized values of cu (j) over j. m
j=1 u|Xu =j i|i=j
cu (i) = (m − 2)N + (m − 2)
m
cu (j)
j=1 u|Xu =j
This concludes the proof. The above proposition states that minimizing f (X) is equivalent with minimizing j f (j) (X) due to their linear relation. Hence we can treat the consistent labels over all binary cuts as the approximate partial optimal labeling and solve the labels for the few remaining inconsistent pixels. This can be done using the expansion-move algorithm for multiway cut in [9]. As a result, we only need to build a much smaller graph for resolving the ambiguities in the labeling of remaining pixels together with their labeled neighbors, as compared to the total number of unlabeled pixels from the beginning.
142
5
F. Lu, Z. Fu, and A. Robles-Kelly
Experimental Results
We present experimental results for multiclass interactive segmentation of 7 realworld color images. Pixels were selected by the user with scribbles to define different regions. The original images overlapped with scribble labellings are shown in the left column of Figure 2, where different colored brushes indicate different regions in the image. First, we compared the efficiency of the proposed algorithm with the conventional expansion-move algorithm for the purpose of multiclass image segmentation. Specifically, we adopted the α expansion algorithm [9] over the alternative α-β move algorithm in our implementation. We runned our experiment on an Intel P4 workstation with 1.7GHz CPU and 1GB RAM. We have used the maxflow code of Boykov and Kolmogorov [10] for solving graph cuts for both binary inference and α expansion which also includes implementation of the dynamic graph cuts algorithm in [11]. The wrapper code is written in matlab. The average execution time over 5 trials for the segmentation of each image is reported in Table 3. Table 3. Comparison of speed for the conventional expansion-move based multiway graph cut algorithm [9] and the proposed efficient graph cut algorithm Image Name Image Size Classes face∗ bush∗ grave∗ garden∗ tower lake beach
132 × 130 520 × 450 450 × 600 450 × 600 800 × 600 800 × 600 1600 × 1200
3 3 3 3 3 4 4
Execution Time (s) α-Expansion Efficient Graph Cuts 0.0976 0.0836 2.2765 0.9999 3.5721 1.1672 2.6871 1.1407 5.5650 2.9180 5.8270 3.4522 26.3098 13.2942
From the table, we can find that the proposed efficient graph cut algorithm is consistently faster than the expansion-move algorithm with a speedup factor of roughly 2 on average. The two methods achieve comparable results for binary labeling problems. And not surprisingly, for the images we tested, the two algorithms under study achieve similar segmentation results. Hence, in the following, we only show segmentation results of our algorithm compared with those produced by maximum likelihood estimation (MLE) based pixel-level segmentation, which is equivalent to the MRF model with zero second-order pairwise terms. The segmentation results of both our method and the pixel-level segmentation are shown in the middle and right columns of Figure 2 respectively. We visualize the segmentation results with an alpha matte αI + (1 − α)L = 0, with α = 0.3. I is the color of image pixel and L is the color of brush label for the region it belongs to. From those results, we can see that the proposed algorithm can capture the structure of the regions labeled using the brush while pixel-level approach is prone to produce small scattered regions.
Efficient Graph Cuts for Multiclass Interactive Image Segmentation
143
Fig. 2. Segmentation results on example images. Left Column: Original color images with brush labellings. Middle Column: Results of pixel based approach. Right Column: Results of multiclass graph cuts.
144
6
F. Lu, Z. Fu, and A. Robles-Kelly
Conclusions
An energy minimization approach is proposed for multiclass image segmentation in this paper based on the Markov random field model. The cost function of the MRF model can be optimized by solving an equivalent multiway graph cut problem. An efficient implementation of multiway cut is developed and experimental results demonstrate encouraging performance of the proposed approach.
References [1] Boykov, Y., Jolly, M.P.: Interactive graph cuts for optimal boundary & region segmentation of objects in n-d images. In: Intl. Conf. on Computer Vision, pp. 105–112 (2001) [2] Boykov, Y., Veksler, O., Zabih, R.: Markov random fields with efficient approximations. In: Intl. Conf. on Computer Vision and Pattern Recognition (1998) [3] Gonzalez, R.C., Wintz, P.: Digital Image Processing. Addison-Wesley Publishing Company Limited, London, UK (1986) [4] Belongie, S., Carson, C., Greenspan, H., Malik, J.: Color- and texture-based image segmentation using em and its application to content-based image retrieval. In: Proc. of Intl. Conf. on Computer Vision (1998) [5] Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000) [6] Rother, C., Kolmogorov, V., Blake, A.: Grabcut: Interactive foreground extraction using iterated graph cuts. ACM Trans. on Graphics 23(3), 309–314 (2004) [7] Li, Y., Sun, J., Tang, C.K., Shum, H.Y.: Lazy snapping. ACM Trans. on Graphics 23(3) (2004) [8] Grady, L.: Random walks for image segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence 28(11), 1768–1783 (2006) [9] Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Trans. on Pattern Analysis and Machine Intelligence 23(11), 1222–1239 (2001) [10] Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Trans.on Pattern Analysis and Machine Intelligence 26(9), 1124–1137 (2004) [11] Kohli, P., Torr, P.: Efficiently solving dynamic markov random fields using graph cuts. In: Intl. Conf. on Computer Vision (2005)
Feature Subset Selection for Multi-class SVM Based Image Classification Lei Wang Research School of Information Sciences and Engineering The Australia National University Australia, ACT, 0200 Abstract. Multi-class image classification can benefit much from feature subset selection. This paper extends an error bound of binary SVMs to a feature subset selection criterion for the multi-class SVMs. By minimizing this criterion, the scale factors assigned to each feature in a kernel function are optimized to identify the important features. This minimization problem can be efficiently solved by gradient-based search techniques, even if hundreds of features are involved. Also, considering that image classification is often a small sample problem, the regularization issue is investigated for this criterion, showing its robustness in this situation. Experimental study on multiple benchmark image data sets demonstrates the effectiveness of the proposed approach.
1
Introduction
Multi-class image classification often involves (i) high-dimensional feature vectors. An image is often reshaped as a long vector, leading to hundreds of dimensions; (ii) redundant information. Only part of an image, for instance, the foreground, is really useful for classification. In the past few years, multi-class Support Vector Machines (SVMs) have been successfully applied to image classification [1,2]. However, the feature subset selection issue, that is, identifying p important features from the original d ones has not been paid enough attention.1 Classical feature subset selection techniques may be used to find the important dimensions. For instance, in [3], a subset of 15 features is selected by using the wrapper approach from 29 features. However, this approach often has heavy computational load and cannot deal with image classification generally having hundreds of features. Also, the selected features may not fit the SVM classifier well when the selection criteria have no connection with the SVMs. Recent advance in the binary SVMs provides a ray of hope. In [4], a leave-oneout error bound of the binary SVMs (the radius-margin bound) is minimized to optimize the kernel parameters. The minimization is solved by gradient-based search techniques which can handle a large number of free kernel parameters. 1
Please note that this is different from the problem of finding the p optimal combination of the original d features, for instance, in the way of PCA or LDA. In feature subset selection, the features are kept individual and the feature dependence is left to the classifier, for instance, the SVMs, which handles it automatically.
Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 145–154, 2007. c Springer-Verlag Berlin Heidelberg 2007
146
L. Wang
Also, it is found that the optimized kernel parameters can reflect the importance of features for classification. These properties are just what we are seeking for. Unfortunately, this error bound cannot be straightforwardly applied to the multiclass scenario because it is rooted in the theoretical result of binary SVMs, for instance, the VC-Dimension theory upon which it is developed. A possible way may be to apply the radius-margin bound to all of the binary SVMs obtained in a decomposition of the multi-class SVMs. However, by doing so, the selected features will be good at discriminating a certain pair of classes only. Also, how to integrate these selection results is another issue. Hence, this paper seeks a one-shot feature selection from the perspective of multi-class classification. To realize this, this paper extends the radius-margin bound to a feature subset selection criterion for the multi-class SVMs. Firstly, the relationship between the binary radius-margin bound and the classical class separability concept is discussed. Enlightened by it, this work redefines the radius and margin to accommodate multiple classes, forming a new criterion. Its derivative with respect to the kernel parameters can also be analytically calculated, making the gradient-based search techniques still applicable. Minimization of this criterion can efficiently optimize hundreds of kernel parameters simultaneously. These optimized parameters are then used to identify the important features. This approach is quite meaningful for practical image classification because it facilitates the reduction of system complexity and the feature discovery from the perspective of multiclass classification. Experimental result on benchmark data sets demonstrates the effectiveness of this approach.
2
Background
Let D denote a set of m training samples and D = (x, y) ∈ (Rd × Y)m , where Rd denotes a d-dimensional input space, Y denotes the label set of x, and the size of Y is the number of classes, c. A kernel is defined as kθ (xi , xj ) = φ(xi ), φ(xj ), where φ(·) is a possibly nonlinear mapping from Rd to a feature space, F , and θ is the kernel parameter set. Due to the limit of space, the introduction of multi-class SVMs is omitted and this information can be found from [1,2]. The radius-margin bound. It is an upper bound of the test error estimated via a leave-one-out cross-validation procedure. Let Lm = L((x1 , y1 ), · · · , (xm , ym )) be the number of errors in this procedure. It is shown in [4] that Lm ≤
4R2 = 4R2 w2 γ2
(1)
where R is the radius of the smallest sphere enclosing all the m training samples in F , γ the margin, w the normal vector of the optimal separating hyperplane, and γ −1 = w. The R2 is obtained by solving m m R2 = maxβ∈Rm i=1 βi kθ (xi , xi ) − i,j=1 βi βj kθ (xi , xj ) (2) m subject to : i=1 βi = 1; βi ≥ 0 (i = 1, 2, · · · , m)
Feature Subset Selection for Multi-class SVM Based Image Classification
where βi is the i-th Lagrange multiplier. The w2 is obtained by solving m m 1 1 2 m i=1 αi − 2 i,j=1 αi αj yi yj kθ (xi , xj ) 2 w = maxα∈R m subject to : i=1 αi yi = 0; αi ≥ 0 (i = 1, 2, · · · , m)
147
(3)
where αi is the i-th Lagrange multiplier. The derivative of R2 w.r.t. the t-th kernel parameter, θt , is shown in [4] as m m ∂kθ (xi , xi ) ∂kθ (xi , xj ) ∂R2 = βi0 − βi0 βj0 ∂θt ∂θt ∂θt i=1 i,j=1
(4)
where βi0 is the solution of Eq. (2). Similarly, the derivative of w2 w.r.t. θt is m ∂kθ (xi , xj ) ∂w2 = (−1) · α0i α0j yi yj ∂θt ∂θt i,j=1
(5)
where yi ∈ {+1, −1} is the label of xi . Thus, the radius-margin bound can be efficiently minimized by using the gradient search methods. As seen, this bound is rooted in binary SVMs and cannot be directly applied to the multi-class case. An extension of this bound to a feature selection criterion for multi-class SVMs is proposed below.
3
Proposed Approach
Class separability is a concept widely used in pattern recognition. The scatter matrix based measure is often favored thanks to its simplicity. The Between(ST ) are defined as SB = class scatter matrix (SB ) and Total scatter matrix c c ni n (m − m)(m − m) and S = (x − m)(x − m) , i i i T ij ij j=1 i=1 i=1 where ni is the size of class i, xij the j-th sample of class i, mi the mean of class i, and m the mean of all c classes. The following derives the class separability in the feature space, F . Let Di be the training samples from class i and D be the set of all training samples. The mφi and mφ are the mean vectors of Di and D in F , respectively. KA,B is the kernel matrix where {KA,B }ij = kθ (xi , xj ) with the constraints of xi ∈ A and xj ∈ B. The traces of SφB and SφT can be expressed as tr(SφB ) =
c 1 KD i=1
where n =
c
i=1
i ,Di
ni
1
−
1 KD,D 1 1 KD,D 1 ; tr(SφT ) = tr(KD,D ) − n n
(6)
ni and 1 is a column vector whose components are all 1. Based
on these, the class separability in F can be written as C = two classes, it can be shown that n1 n2 φ tr(SB ) = mφ1 − mφ2 2 ; n1 + n2
tr(SφT ) =
ni 2 i=1 j=1
tr(Sφ B) . tr(Sφ T)
In the case of
φ(xij ) − mφ 2
(7)
148
L. Wang
and it can be proven (the proof is omitted) that γ2 ≤
4−
1 n1 +n2 n1 n2
tr(SφB )
R2 ≥
;
1 tr(SφT ) . (n1 + n2 )
(8)
That is, tr(SφB ) measures the squared distance between the class means in F . tr(SφT ) measures the squared scattering radius of the training samples in F if divided by (n1 + n2 ). Conceptually, the mφ1 − mφ2 2 and γ 2 reflect the similar property of data separability and tr(SφT ) positively correlates with R2 . Theoretically, γ 2 is upper bounded by a functional of tr(SφB ) and R2 is lower bounded by tr(SφT )/(n1 + n2 ). When minimizing the radius-margin bound, γ 2 is maximized while R2 is minimized. This requires tr(SφB ) to be maximized and tr(SφT ) to be minimized (although this does not necessarily have γ 2 maximized or R2 minimized). Hence, minimizing the radius-margin bound can be viewed as approximately maximizing the class separability in F . Part of the analysis can also be seen in [5]. Enlightened by this analogy, this work extends the radius-margin bound to accommodate multiple classes. In the multi-class case, tr(SφT )/n measures the squared scattering radius of all training samples in F . Considering the analogy between tr(SφT ) and R2 , the R2 is redefined as the squared radius of the sphere enclosing all the training samples in the c classes, Rc2 =
min
ˆ 2 ,∀ xi ∈D φ(xi )−ˆ c2 ≤R
ˆ2) . (R
(9)
For tr(SφB ) in the multi-class case, it can be proven that tr(SφB )
=
1≤i<j≤c
ni nj mφi − mφj 2 n
.
(10)
Noting the analogy between mφ1 − mφ2 2 and γ 2 , the margin is redefined as 2 Pi Pj γij = Pi Pj wij −2 (11) γc2 = 1≤i<j≤c
1≤i<j≤c
where γij is the margin between class i and j, and Pi = nni , which is the priori probability of the i-th class estimated from the training data. By doing so, a feature selection criterion for multi-class SVMs is obtained. The optimal kernel parameters is sought by minimizing this criterion, 2 Rc . (12) θ = arg min θ∈Θ γc2 The derivation of this criterion w.r.t. the t-th kernel parameter, θt , is 2 2 2 Rc ∂ 1 2 ∂Rc 2 ∂γc − Rc = 4 γc . ∂θt γc2 γc ∂θt ∂θt
(13)
Feature Subset Selection for Multi-class SVM Based Image Classification ∂R2
149
∂w 2
The calculation of ∂θtc and ∂θijt follows the Eq. (4) and (5). This criterion can still be minimized by using the gradient-based search techniques. Finally, please note that the obtained criterion is not necessarily an upper bound of the leave-one-out error of the multi-class SVMs. It is only a criterion reflecting the idea of the radius-margin bound and working for the multi-class case. An elegant extension of the radius-margin bound to multi-class SVMs is in [6]. Regularization issue. When optimizing a criterion with multiple parameters, regularization is often needed to avoid over-fitting the training samples, especially when the number of training samples is small. To check whether this is also needed for this new criterion, a regularized version is given as J (θ) = (1 − λ)
Rc2 + λθ − θ0 2 γc2
(14)
where λ (0 ≤ λ < 1) is the regularization parameter which penalizes the deviation of θ from a preset θ0 . Such a regularization imposes a Gaussian prior over the parameter θ to be estimated. When using the ellipsoid Gaussian RBF kernel defined in Section 4, θ0 can be chosen by applying the constraint of θ1 = θ2 = · · · = θd and solving 2
Rc
. (15) θ0 = arg min
θ∈Θ γc2 θ1 =θ2 =···=θd Since the number of free parameters reduces to one, over-fitting is less likely to occur. It is worth noting that the obtained θ0 has been optimal for the minimization of Rc2 /γc2 with a spherical Gaussian RBF kernel. It is a good starting point for the subsequent multi-parameter optimization.
4
Experimental Result
d −yi )2 , is used. The elliptical Gaussian RBF kernel, k(x, y) = exp − i=1 (xi2σ 2 i For binary SVMs, the optimized σi can reflect the importance of the feature i for classification [4]. The larger the σi , the less important the feature i. Now, the multi-class case is studied with the proposed criterion. The BFGS Quasi-Newton optimization method is used. For the convenience of optimization, gi (gi = 1/2σi2 ) is optimized instead. A multi-class SVM classifier using the selected features is then trained and evaluated by using [7]. Following [4], the feature selection criteria of “Fisher score”, “Pearson correlation coefficient” and “Kolmogorov-Smirnov test” are compared with the proposed criterion. USPS for Optical digit recognition. This data set contains 7291 training images and 2007 test images from ten classes of digits of “0” to “9”. Each 16 × 16 thumbnail image is reshaped to a 256-dimensional feature vector consisting of its gray values. The proposed feature selection criterion is minimized on the training set to optimize the gi for each dimension. The 256 features are then sorted in a descending order of the g values, and the top k features are used for
150
L. Wang
(a) Optical character recognition
(b) Facial image classification
(c) Textured image classification
Fig. 1. Examples of image classification tasks USPS data set
0
0.055
0.8 0.7 0.6
Test error
0.05
Pearson correlation coefficients Kolmogorov−Smirnov test Fisher criterion score Proposed criterion
2 0.045 4 0.04
0.5
6
0.4
8
0.035 0.03 0.025
0.3
10
0.02
0.2
0.015
12
0.1
0.01 14
0 0
0.005
10
20 30 40 The number of selected feature
50
(a) Feature selection result
60 16
0
2
4
6
8
10
12
14
16
(b) Optimized value of g = 1/2σ 2
Fig. 2. Results for the USPS data set
classification. As shown in Figure 2(a), the test error for the proposed criterion decreases quickly with k. With only 50 features selected, it reaches 6.7% (The reported lowest error is 4.3% when all 256 features are used [8]). Compared with the other three selection criteria, this criterion gives the best feature selection performance. Now, the optimized g1 , · · · , g256 are reshaped back into a 16 × 16 matrix and shown in Figure 2(b). In this map, each block corresponds to one of the 256 g’s and its value is reflected by the gray level. As seen, the blocks with the larger g values scatter over the central part of this map, whereas those with the lower values are mostly at the borders and corners. This suggests that the pixels at the borders and corners are less important for discrimination. This observation well matches the images in Figure 1(a) in that the digits are displayed at the central parts and surrounded by a uniform black background. ORL for Facial image classification. This database contains 40 subjects and each of them has 10 gray-level facial images, as shown in Figure 1(b). Each image is resized to 16 × 16 and a 256-dimensional feature vector is obtained. The 400 images are randomly split into 10 pairs of training/test subsets with the equal size of 200, forming ten 40-class classification problems. For each problem, the g values are optimized on the training subset and the features are sorted accordingly. The feature selection result averaged over the 10 classification problems is reported in Figure 3(a). By using only the top 50 selected features, the
Feature Subset Selection for Multi-class SVM Based Image Classification
151
ORL Face data set 0.7
16 Pearson correlation coefficient Kolmogorov−Smirnov Test Proposed criterion Fisher score
0.6
Test error
0.5
Forehead 0.9
14 12
0.8
Right eye
Left eye
0.7 10
0.4
0.6
8
0.3
0.5
Nose
0.4
6 Mouth
0.2
0.3
4 0.1
0.2 Chin
2 0 0
50
100 150 200 The number of selected features
250
(a) Feature selection result
300
2
4
6
8
10
12
0.1 14
16
(b) Optimized value of g = 1/2σ 2
Fig. 3. Results for the ORL face data set
test error can be as low as 11.10 ± 2.87%, and the test error with all 256 features is 7.95 ± 2.36%. Again, the proposed criterion achieves the best selection performance, especially at the top 50. The optimized g values are also reshaped back and plotted in Figure 3(a). A shading effect is used. As seen, the blocks with larger g values (whiter) roughly present a human face, where the eyes, nose, mouth, forehead and chin are marked. This result is consistent with our daily experience that we recognize a person often by looking at the eyes while the cheek, shown as the darker areas, is less used. The subjects in this database often have different hairstyles at the forehead part, and this may explain why the forehead part is also shown as a whiter region. Brodatz for Textured image classification. This database includes 112 different textured images, as shown in Figure 1(c). Top 10 images are used in this experiment. For each of them, sixteen 128×128 sub-images are cropped, without overlapping, from the original 512 × 512 image to form a texture class. By doing so, a ten-class textured image database is created, including 160 samples. Again, they are randomly split into 10 pairs of training (40 samples)/test (120 samples) subsets. By utilizing a bank of Gabor filters (6 orientations and 4 scales), the mean, variance, skewness and kurtosis of the values of the Gabor filtered images are extracted, leading to a 96 (6 × 4 × 4) dimensional feature vector [9]. Using the proposed criterion, the 96 g values are optimized and the features are sorted. The test error with different number of selected features is plotted in Figure 4(a). This criterion and the “Kolmogorov-Smirnov test” show comparable performance, with the former slightly better at selecting the top 10 features. Using this criterion, the lowest test error of 1.42 ± 1.36% is achieved at the top 50 selected features, whereas the error with all 96 features is 2.08 ± 1.06%. This indicates that half features are actually useless for discrimination. Again, the optimized g values are shown in Figure 4(b). It is found that the features from the 25-th to 48-th (the features of “variance”) and those from the 80-th to 96-th (the features of “kurtosis”) dominate the features with larger g values, suggesting that these two types of features are more useful for discriminating the textured images.
152
L. Wang Brodatz texture data set 0.9
3.5
Pearson correlation coefficient Kolmogorov−Smirnov Test Proposed criterion Fisher score
0.8 0.7
3 2.5
The g value
Test error
0.6 0.5 0.4
2 1.5
0.3 1
0.2 0.5
0.1 0 0
20
40 60 80 The number of selected features
100
(a) Feature selection result
0 0
20
40 60 Feature number
80
100
(b) Optimized value of g = 1/2σ 2
Fig. 4. Results for the Brodatz data set Table 1. Feature selection performance with regularization (ORL data set) Average test error for different number of selected features (In %) λ value
1
2
5
10
20
50
68.85±2.73
51.80±8.16
33.75±3.28
24.65±3.94
16.80±3.28
11.10±2.87
0.1
70.10±4.01
62.85±10.15
40.70±4.74
30.15±2.24
19.60±3.85
11.75±2.20
0.25
70.10±4.01
62.05±11.10
40.30±5.06
28.80±1.46
19.15±3.42
12.20±2.38
0.5
73.50±5.25
57.10±12.12
36.05±5.23
29.15±4.78
20.20±3.12
12.55±2.75
0
0.75
77.40±6.82
53.85±9.92
35.05±4.73
28.80±4.19
20.17±2.72
12.95±3.07
0.9
74.45±6.09
55.40±11.29
35.60±4.40
28.10±3.27
21.25±2.44
12.60±3.18
0.99
79.60±7.40
64.15±8.67
37.15±6.03
28.25±3.43
22.10±2.73
12.85±3.35
Finally, the help of regularization is investigated for the proposed criterion on the ORL and Brodatz data sets. For the ORL, there are only 5 training samples in each class, whereas the number of features is 256. For the Brodatz, the case is 4 versus 96. Both have the small-sized training sets with respect to the number of features. In this experiment, θ0 is obtained by solving Eq. (15). By gradually increasing the regularization parameter, λ, from 0 to 0.99, the g values are optimized and the features are selected. The average test error of a multi-class SVM classifier with the selected features is reported in Table 1. It is found that the lowest error (in bold) is consistently obtained with λ = 0, that is, no regularization is imposed. Simply applying the regularization leads to inferior selection performance. This result suggests that the proposed criterion is quite robust to the small sample problem in this classification task. Similar case is observed on the Brodatz data set from Table 2. These results are, more or less, surprising and further analysis leads to the following experiment. By looking into the ORL facial image classification problem, it is found that the total number of training samples is still comparable to the number of features, that is, 5 × 40 = 200 vs. 256, because there are 40 classes. As defined in the proposed criterion, the radius 2 R2 is estimated by using all of the 200 training samples, although the margin γij is estimated from the 10 training samples from each pair of classes. Hence, such a training set is not a typical small sample for estimating R2 . Believing that regularization is necessary in the presence of a small sample problem, this paper
Feature Subset Selection for Multi-class SVM Based Image Classification
153
Table 2. Feature selection performance with regularization (Brodatz data set) Average test error for different number of selected features (In %) λ value 0
1
2
5
10
15
20
44.42±6.11
21.58±5.01
5.92±4.91
3.33±3.24
2.00±2.19
2.17±1.12
0.1
50.33±6.70
25.33±4.32
7.75±5.31
4.25±2.95
3.17±2.32
2.67±2.07
0.25
50.42±6.67
25.00±4.84
9.42±3.73
3.58±1.80
3.25±2.10
2.67±2.22
0.5
50.08±7.27
23.92±3.49
9.67±3.38
4.42±2.42
2.83±2.01
2.75±1.84
0.75
48.42±7.76
25.08±3.98
9.25±4.09
5.17±2.99
2.92±1.72
3.25±2.10
0.9
47.58±8.04
23.58±3.89
9.42±4.94
5.58±2.99
3.08±2.15
2.50±1.80
0.99
47.00±7.64
24.08±4.87
8.92±3.33
5.92±2.73
3.08±2.19
2.92±1.85
Result for 30 classes
Result for 20 classes
0.7 λ=0 λ=0.1 λ=0.25 λ=0.5 λ=0.75 λ=0.9 λ=0.99
Test error
0.5 0.4
λ=0 λ=0.1 λ=0.25 λ=0.5 λ=0.75 λ=0.9 λ=0.99
0.5
0.4
Test error
0.6
0.3
0.3
0.2
0.2 0.1
0.1 0 0
10
20 30 40 The number of selected features
0 0
50
(a) Class number = 30
10
20 30 The number of selected features
40
50
(b) Class number = 20
Result for 10 classes
Result for 8 classes 0.45 λ=0 λ=0.1 λ=0.25 λ=0.5 λ=0.75 λ=0.9 λ=0.99
Test error
0.4
λ=0 λ=0.1 λ=0.25 λ=0.5 λ=0.75 λ=0.9 λ=0.99
0.35 0.3
Test error
0.5
0.4
0.3
0.2
0.25 0.2 0.15 0.1
0.1 0.05 0 0
10
20 30 40 The number of selected features
(a) Class number = 10
50
0 0
10
20 30 40 The number of selected features
50
(b) Class number = 8
Fig. 5. Results for the ORL data set with less number of classes
further conducts the following experiment: maintain 5 training samples in each class and gradually reduce the number of classes to 30, 20, 10 and 8. After that, following the previous procedures, perform feature selection with the proposed criterion for these four new classification problems, respectively. The result is plotted in Figure 5 and it shows what is expected. The selection performance of the criterion without regularization (labelled by λ = 0 and shown as a thick curve) degrades with the decreasing number of classes. When there are only 8 classes (or, equally, 5 × 8 = 40 training samples), the necessity and benefit of regularization can be clearly seen. Summarily, two points can be drawn from Table 1,2 and Figure 5: (1) thanks to the way of defining R2 , regularization is not a concern for the proposed criterion when the number of training samples in
154
L. Wang
each class is small but the number of classes is large. This case is quite reasonable for multi-class classification; (2) when the total number of training samples is really small, employing the regularized version of the proposed criterion can achieve better feature selection performance.
5
Conclusion
A novel feature subset selection approach has been proposed for multi-class SVM based image classification. It handles the selection with hundreds of features, very suitable for image classification where high-dimensional feature vectors are often used. This approach produces overall better performance than the compared selection criteria and shows robustness to the small sample case in multi-class image classification. These results preliminarily verify its effectiveness. More theoretical and experimental study will be conducted in the future work.
References 1. Kim, K.I., Jung, K., Park, S.H., Kim, H.J.: Support vector machines for texture classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(11), 1542–1550 (2002) 2. Chapelle, O., Haffner, P., Vapnik, V.: Support vector machines for histogram-based image classification. IEEE Transactions on Neural Networks 10(5), 1055–1064 (1999) 3. Luo, T., Kramer, K., Goldgof, D., Hall, L., Samson, S., Remsen, A., Hopkins, T.: Recognizing plankton images from the shadow image particle profiling evaluation recorder. IEEE Transactions on Systems, Man and Cybernetics, Part B 34(4), 1753– 1762 (2004) 4. Chapelle, O., Vapnik, V., Bousquet, O., Mukherjee, S.: Choosing multiple parameters for Support Vector Machines. Machine Learning 46(1-3), 131–159 (2002) 5. Wang, L., Chan, K.L.: Learning Kernel Parameters by using Class Separability Measure. In: Presented in the sixth kernel machines workshop, In conjuction with Neural Information Processing Systems (2002) 6. Darcy, Y., Guermeur, Y.: Radius-margin Bound on the Leave-one-out Error of Multi-class SVMs. Technical Report, No. 5780, INRIA (2005) 7. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), Software available at http://www.csie.ntu.edu.tw/∼ cjlin/libsvm 8. Sch¨ olkopf, B., Smola, A.: Learning with Kernels Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA (2002) 9. Manjunath, B.S., Ma, W.Y.: Texture features for browsing and retrieval of image data. IEEE Transactions on Pattern Analysis and Machine Intelligence 16(8) (1996)
Evaluating Multi-Class Multiple-Instance Learning for Image Categorization Xinyu Xu and Baoxin Li Department of Computer Science and Engineering, Arizona State University
[email protected],
[email protected] Abstract. Automatic image categorization is a challenging computer vision problem, to which Multiple-instance Learning (MIL) has emerged as a promising approach. Typical current MIL schemes rely on binary one-versus-all classification, even for inherently multi-class problems. There are a few drawbacks with binary MIL when applied to a multi-class classification problem. This paper describes Multi-class Multiple-Instance Learning (McMIL) to image categorization that bypasses the necessity of constructing a series of binary classifiers. We analyze McMIL in depth to show why it is advantageous over binary MIL when strong target concept overlaps exist among the classes. We systematically valuate McMIL using two challenging image databases, and compare it with state-of-the-art binary MIL approaches. The McMIL achieves competitive classification accuracy, robustness to labeling noise, and effectiveness in capturing the target concepts using smaller amount of training data. We show that the learned target concepts from McMIL conform to human interpretation of the images. Keywords: Image Categorization, Multi-Class Multiple-Instance Learning.
1 Introduction Multiple-Instance Learning (MIL) is a generalization of supervised learning in which training class labels are associated with sets of patterns, or bags, instead of individual patterns. While every pattern may possess an associated true label, it is assumed that pattern labels are only indirectly assessable through bag labels [1]. MIL has been successfully applied to many applications such as drug activity prediction [8], image indexing for content-based image retrieval [1, 13, 21, 5, 2, 14, 6], object recognition [6], and text classification [1]. In many real-world multi-class classification problems, a common treatment adopted by traditional MIL is “one-versus-all” (OVA) approach, which constructs c binary classifiers, fj(x), j=1,…,c, each trained to distinguish one class from the remaining, and labels a novel input with the class giving the largest fj(x), j=1,…,c. Although OVA delivers good empirical performances, it has potential drawbacks. First, it is somewhat heuristic [15]. The binary classifiers are obtained by training in separate binary classification problems, and thus it is unclear whether their realvalued outputs are on comparable scales [15]. This can be a problem, since situations often arise where several binary classifiers acquire identical decision values, making the class prediction based on them very obscure. Second, binary OVA has been Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 155–165, 2007. © Springer-Verlag Berlin Heidelberg 2007
156
X. Xu and B. Li
criticized for its inadequacy in dealing with rather asymmetric problems [15]. When the number of classes becomes large, each binary classifier becomes highly unbalanced with many more negative examples than positive examples. In the case of non-separable classes, the class with smaller fraction of instances tends to be ignored, leading to degraded performances [18]. Third, with respect to Fisher consistency, it is unclear if OVA targets the Bayes rule in absence of a dominating class [10]. Recently, a Multi-class Multiple-Instance Learning (McMIL) approach has been proposed which considers multiple classes jointly by simultaneously minimizing a Support Vector Machine (SVM) objective function. In this paper we focus on the systematic evaluation of the McMIL using image categorization as a case study, and we show that why it is advantageous over binary OVA approaches.
2 Related Work There has been abundant work in the literature on image categorization. Due to space limit we only review the MIL-based approaches. Andrew et al. [1] proposed MI-SVM and applied it to small-scale image annotation. [12] proposed Diverse Density MIL framework and [13] used it to classify natural scene images. [20] proposed EM-DD, which views the knowledge of which instance corresponds to the label of the bag as a missing attribute. [2] performed an instance-based feature mapping through a chosen metric distance function, and a sparse SVM was adopted to select the target concepts. [6] proposed MILES for image categorization and object recognition. MILES does not impose the assumption relating instance labels to bag labels, and it does not rely on searching local maximizers of the DD functions, which makes MILES quite efficient. [14], within the CBIR setting, presented MISSL that transforms any MI problem into an input for a graph-based single-instance semi-supervised learning method. Unlike most prior MIL algorithms, MISSL makes use of the unlabeled data. Most of these works belong to the binary MIL approaches, and thus potentially suffer from the drawbacks mentioned in the previous section. One work by [5] is mostly related with ours. They proposed DD-SVM. Their major contribution is that the image label is no longer determined by a single region; rather it is determined by some number of Instance Prototypes (IPs) learned from the DD function. A nonlinear mapping is defined using the IPs to map every training image to a point in a bag feature space. Finally, for each category, an SVM is trained to separate that category from all the others. Note that the multi-class classification here is essentially achieved by OVA. The method described in this paper differs from DDSVM in that a multi-class feature space is built that enables us to train the SVM only once to yield the multi-class label, leading to increased classification accuracy.
3 Methodology of McMIL For completeness of the paper, we describe the McMIL approach in this section. Let us first define the problem and the notations. Let D={(x1, y1),…, (x|D|, y|D|)} be the labeled dataset, where xi, i=1,...,|D| is called a bag in the MIL setting, and yi ∈ {1,…,c} denotes the multi-class label. In the MIL model xi={xi1,…,xi|xi|}, where xij ∈ \ d represents the jth instance in the ith bag. Our goal is to induce a classifier from patterns to multi-class labels, i.e. a multi-class separating function f: \ d Æ y.
Evaluating Multi-Class Multiple-Instance Learning for Image Categorization
157
3.1 Learning Instance Prototypes The first step in the training phase of the McMIL is to learn the IPs for each category according to the DD function [12]. Because the DD function is defined based on the binary framework, McMIL learns a set of IPs specifically for every class: the ith DD function uses all of the examples in the ith class with positive labels and some examples from other classes with negative labels. McMIL applies EM-DD to locate the multiple local peaks by using every instance in every positive bag with uniform weighs as the starting point of the Quasi-Newton search. Then the IPs with larger DD values and are distinct from each other are selected to represent the target concepts [5]. 3.2 Multi-class Feature Mapping Let IPm =
{( I
1 m
, I m2 ,..., I mnm
)} , m ∈ {1, 2,..., c} be the set of instance prototypes learned by
EM-DD for the mth category. Each I mk = ⎡ xmk , smk ⎤ , k=1,2,...,nm denotes the kth IP in the ⎣ ⎦ mth class, and nm is the number of IPs in the mth class. The learned IP consists of two parts: the ideal attribute value x and the weight factor s. To facilitate the multi-class classification, a similarity mapping kernel ψ is defined which maps a bag of instances into a multi-class feature space ] based on a similarity metric function h: \ d ×\ d → \ . The mapping ψ is defined as ψ(xi)= ⎡ h( xi , I11 ),..., h( xi , I1n1 ),..., h( xi , I c1 ),..., h( xi , I cnc ) ⎤ ⎣
T
⎦
(1)
The similarity function h is defined as ⎧ ⎩
h( xi , I mk ) = max j =1,..., x ⎨exp ⎡⎢ − xij − xmk i
⎣
2 smk
⎤ ⎫ , k = 1,..., n , m ∈ {1,..., c} m ⎥⎦ ⎬⎭
(2)
Each feature corresponds to the similarity between a bag and one IP, which is the similarity between I mk and the instance in xi that is closest to I mk . Note that while this is similar to the bag feature space mapping that was proposed by [5], an important difference between these two mappings is that, the mapping in our method simultaneously incorporate the similarity between a bag and each IP of every category into one feature matrix, while in binary MIL [5, 2, 6], separate feature matrices are needed, each of which only considers the similarity between a bag and each IP of one category that is under consideration. For a given training set of bags with multi-class labels D={(x1, y1),…, (x|D|, y|D|)} , applying the above mapping yields the following kernel similarity matrix S= n n 1 1 ⎡h1 ⎤ ⎡h(x1, I1 ),...., h(x1, I1 1 ),...., h(x1, Ic ),...., h(x1, Ic c ) ⎤ ⎢ ⎥ ⎢h ⎥ n1 nc 1 1 ⎢ 2 ⎥ = ⎢h(x2 , I1 ),...., h(x2 , I1 ),...., h(x2 , Ic ),...., h(x2 , Ic ) ⎥ ⎥ ⎢M ⎥ ⎢LLLLLLLLLLLLLLLLLL ⎥ ⎢ ⎥ ⎢ n 1 1 n ⎢⎣h|D| ⎥⎦ ⎢⎣h(x|D| , I1 ),..., h(x|D| , I1 1 ),..., h(x|D| , Ic ),..., h(x|D| , Ic c )⎥⎦
(3)
158
X. Xu and B. Li
3.3 Multi-class SVM Training and Classification Given the kernel similarity matrix S, the multi-class classification is achieved by multi-class SVM [16, 19] which simultaneously allows the computation of a multiclass classifier by considering all the classes at once. The formulation is |D| 1 c min ∑ wmT wm + C ∑ ∑ ξim w,b ,ξ 2 m =1 i =1 m ≠ y (4) T s.t. wy ψ ( xi ) + by ≥ wmTψ ( xi ) + bm + 2 − ξim , i
i
i
ξim ≥ 0, i = 1,...,| D |, m ∈ {1,.., c} y . i
where w is a vector orthogonal to the separating hyper-plane, |b|/||w|| is the perpendicular distance from the hyper-plane to the origin, C is a parameter to control the penalty due to misclassification. Nonnegative slack variables ξi are introduced in each constraint to account for the cost of the errors. In the testing phase, we first project the test image into the feature space ] using the learned IPs. The decision function is then given by (5) arg max m=1,...,c ( wmTψ ( x) + bm ) which is the same with the OVA decision making. Note that in multi-class SVM, there are c decision functions but all are obtained by solving one problem Eq. (4). 3.4 The Learned Target Concept Each instance prototype represents one kind of target concept that distinguishes one class from another. Thus it is important to visualize the target concept. In order to do that, in each class, for each feature that is defined by one positive instance prototype,
People
Portrait
Scene
Structure
Fig. 1. The target concepts of four categories in the SIMPLIcity-II data set. The target concepts are represented by conceptual regions. For “People”, the various shapes of human are returned as target concepts. For “Portrait”, the target concepts are mainly skin and hair with varying colors. For “Scene”, the major target concepts are mountain, sky, sea water and plants. For “Structure”, the target concepts are various kinds of building structures like walls, windows and roofs.
we find the training image that maximizes the corresponding feature, then we go further to locate the region that is closest to the corresponding prototype. The target concepts for a category are defined by the regions that maximize the similarity
Evaluating Multi-Class Multiple-Instance Learning for Image Categorization
159
between images and the instance prototypes. Fig. 1 shows the target concepts of four classes in the SIMPLIcity-II dataset (Sec. 4).
4 The Advantage of Direct Multi-Class MIL The McMIL approach has advantage over binary MIL when strong target concept overlaps exist among the classes. To show the advantage of McMIL, let us suppose we have a dataset that has three categories: People, Portrait and Scene. Typical People IPs include “hair”, “face”, “clothes”, “building” and “water”, and typical Portrait IPs include “hair”, “face” and “clothes”. Fig. 2(a) shows the characteristic of the feature vectors for a People image and a Portrait image. We can see that the People IPs are inclusive of Portrait IPs. Now let us simplify the problem by assuming that the target concept of People are summarized by “face” and “non-face”, and the target concept of Portrait is summarized by the “face” IP.
(a)
(b)
(c)
Fig. 2. (a) The characteristic of the feature vectors for a People image and a Portrait image. Top: the features for a People image defined by Portrait IPs. Bottom: the features for a Portrait image defined by People IPs. (b) and (c) illustrate the advantage of McMIL over OVA MIL. (b) The feature space of the People and the Portrait classifier in the OVA MIL approach. (c) The multi-class feature space in the proposed McMIL approach.
Let us first consider the OVA MIL approach. OVA MIL will construct 3 binary classifiers. Fig. 2(b) shows the feature space of the People and the Portrait classifier in the OVA MIL approach. In the People classifier (top part of Fig. 2(b)), the features are defined in terms of People IPs: face and non-face. It is more likely that the three People training images will fall into circle A, because the similarity between a People bag and People IPs (face or non-face) would be large. And we can expect that the three Portrait training images will more likely fall into circle B since the similarity between a Portrait bag and the face IP is large but the similarity between a Portrait bag and a non-face IP would be small. In the same sense, the three Scene images will fall into circle C. Because the bags are clearly separable in the People feature space, we would expect that People binary classier will do a good job in classifying a new test point. Now let us turn to Portrait binary classifier whose feature is only defined by face IP. As shown in the bottom part of Fig. 2(b), the People and Portrait bags will become linearly non-separable in the Portrait feature space in that the feature values defined by Portrait face for the People bags and the Portrait bags are both likely to be
160
X. Xu and B. Li
large, and so People bags and Portrait bags will overlap with each other. As a result, Portrait classier will be more likely to make prediction errors in classifying a new test input. Now let us predict the label for a new test image whose true label is People. We feed it into each of the three binary classifier and we get the following decision values: People classifier 0.3218, Portrait classifier 0.3825, and Scene classifier 0.7726. Note that to classify this new image into the right category, the Portrait classifier should give us a negative decision value. But because of Portrait classifier’s poor job in classification, the new image will be associated with a wrong label Portrait since 0.3825>0.3218>-0.7726. Now let us turn to multi-class SVM classification whose feature space is illustrated in Fig. 2(c). Note that the multi-class feature space is defined by the IPs of all the classes, so McMIL has three features defined by People face (face 1), Portrait face(face 2) and non-face. In this multi-class feature space, the probability that the multi-class SVM makes a prediction error is greatly reduced because the dimension of the feature space is enlarged and thus any linearly non-separable points are likely to become separable in a higher dimensional feature space. This is essentially similar to applying a kernel to make linearly non-separable feature space nonlinearly separable. Actually the multi-class feature matrix S is a kernel similarity matrix, the kernel mapping is defined by Eq. (1).
5 Evaluation of McMIL 5.1 Experiment Design Using image categorization as a case study, we evaluate the performance of McMIL on two challenging image data sets: COREL database [2, 5, 6], and SIMPLIcity-II. The images in SIMPLIcity-II are selected from the SIMPLIcity image library [11, 17]. The COREL image database consists of 10 categories, with 100 images per category. The SIMPLIcity-II data set consists of 5 categories, with 100 images per category. Images in both data sets are in JPEG format of size 384 × 256 or 256 × 384. Using the COREL dataset, we compared the classification accuracy of McMIL with that of a few state-of-the-art methods including MILES [6], 1-norm SVM [2], DDSVM [5], k-means-SVM [7], Hist-SVM [4] and MI-SVM [1]. Using COREL dataset, we also comprehensively evaluate McMIL against DD-SVM. Using SIMPLIcity-II, we compared the robustness of McMIL with that of DD-SVM in classifying images where strong target concept overlaps exist among the classes. To facilitate fair comparison with other methods, we use the same feature matrix for all the methods. In all of the experiments, images within each category were randomly partitioned in half to form a training set and a test set. We repeated each experiment for 5 random splits, and reported the average of the results obtained over 5 random test sets, as done in the compared approaches in the literature. 5.2 Results Classification accuracy on COREL data set. Table 1 illustrates the classification accuracy for each class obtained by McMIL and that of DD-SVM and MILES. For McMIL, we presented the results on its two variants: one is McMIL90, which, instead
Evaluating Multi-Class Multiple-Instance Learning for Image Categorization
161
of using all the examples in all the other classes with negative labels, uses 10 images per negative class to learn the IPs for each class. Another variant is McMIL450 which uses all the examples in all the other classes with negative labels to learn the IPs for each class. We are surprised to know that McMIL90 performs much better than DDSVM using 450 negative examples and MILES. This further implies that McMIL is able to achieve higher classification accuracy by using much smaller amount of training negative examples. The left table in Table 2 illustrates the average classification accuracies of McMIL in comparison with other approaches. McMIL90 performs much better than Hist-SVM, k-means-SVM, MI-SVM, 1-norm SVM. McMIL90 performs slightly better than DD-SVM and MILES. Moreover, McMIL450 performs much better than Hist-SVM, k-means-SVM, MI-SVM and 1-norm SVM with a large margin, and it performs comparably well with MILES. McMIL450 outperforms DD-SVM, though the difference is not statistically significant. Table 1. Classification accuracies (in %) for each class obtained by McMIL90, McMIL450, DD-SVM and MILES on the COREL dataset Class
McMIL90
McMIL450
0 1 2 3 4 5 6 7 8 9
71.6 ± 4.7 68.0 ± 6.8 76.8 ± 2.7 90.8 ± 2.7 100.0 ± 0.0 81.6 ± 6.1 94.4 ± 1.9 94.4 ± 3.4 67.2 ± 4.7 88.8 ± 1.6
70.4 ± 3.8 69.6 ± 4.19 74.8 ± 2.66 91.6 ± 3.37 100.0 ± 0.0 79.6 ± 5.32 95.6 ± 2.6 96.4 ± 2.6 62.8 ± 8.37 88.8 ± 4.04
DDSVM 67.7 68.4 74.3 90.3 99.7 76.0 88.3 93.4 70.3 87.0
MILES 68.8 66.0 75.7 90.3 99.7 77.7 96.4 95.0 71.0 85.4
Table 2. The classification accuracy (in %) obtained by McMIL and other methods (with the corresponding 95% confidence intervals) using the COREL (left table) and SIMPLIcity-II (right table) dataset
162
X. Xu and B. Li
Sensitivity to labeling noise. In multi-class classification, the labeling noise is defined as the probability that an image is misclassified. It is important to show the sensitivity of McMIL to labeling noise because in many practical applications, it is usually impossible to get a “clean” data set and the labeling process is often subjective [6]. In this experiment, we used 500 images from Class 0 to Class 4, with 100 images per class. The training and test sets have equal size. In the case of multiclass classification, training set with different level of labeling noise is generated as follows: for the i-th class, i=0,…,4 we randomly pick d% training images and modify their labels to j, j ≠ i, and the probability of j taking any value other than i is equal; for the i-th class, i=0,…,4, we also randomly pick d% images from among all the other classes (200 images in this case) and modify their labels to i. We compared McMIL with DD-SVM for d=0, 2, 4, …, 20 and report the average classification accuracy over 5 random test sets, as shown in fig. 3 (a). We use 50 positive images and 200 negative images per class to learn the IPs for both McMIL and DD-SVM. We observe that for d=0, 2, 4, 6, 8, the classification accuracies of McMIL are 2% higher than DD-SVM. As we increase the percent of training images with label noise from 10 to 20, the accuracy of McMIL remains roughly the same, but the accuracy of DD-SVM drops very quickly. This indicates that our method is less sensitive to labeling noise than DD-SVM.
(a)
(c)
(b)
(d)
Fig. 3. Compare McMIL with DD-SVM in terms of sensitivity to labeling noise (a), sensitivity to number of negative images (b), sensitivity to number of categories (c), and sensitivity to number of sample images (d)
Sensitivity to number of training negative images. We used 1000 images from Class 0 to Class 9 (training and test sets are of equal size) to analyze the performance of McMIL as the number of negative examples used for learning IPs for each class
Evaluating Multi-Class Multiple-Instance Learning for Image Categorization
163
varies from 90 to 450. That is, in the ith (i = 1,…,5) experiment, we randomly choose 10 × i negative examples from the 50 training images of each of the other classes, and run the experiment for 5 random splits. In all these experiments, the number of positive examples remains 50 in learning the IPs for the jth class, j=1,…, 10. As indicated by fig. 3(b), even when the number of negative images drops to 90, McMIL still performs very well. This is an advantage over binary MIL since the performance of most binary MIL will degrade significantly as the amount of negative training data decreases. Why the performance of direct multi-class MIL does not degrade with the decreasing of the amount of negative training examples? We attempt to answer this question as follows: for the OVA MIL approaches, a series of one-versus-all binary classifiers are obtained, each trained to separate one from all the rest. This requires the IPs learned for each class not only to represent the target concepts of the underlying class but also distinguish the underlying class from the rest. This is because, in the definition of DD [12], the target concepts are those intersection points of all the positive bags without intersecting the negative bags, so the intersection points of all the positive bags which also intersect with negative bags should be excluded by learning from a large number of negative examples. However, in the direct multi-class MIL, the requirement for the IPs to distinguish one class from another is lessened, because the “distinguishing” power of the classifier can be offset by learning in the multi-class feature space, due to the nice property of multi-class feature matrix analyzed in Sec. 4. Therefore in order to accurately capture the target concept of the underlying class (i.e. the instance prototypes), the number of training negative examples used for McMIL can be much smaller than that used in DD-SVM. Sensitivity to number of categories. In this experiment, a total of 10 data sets are used, the number of categories in a data set varies from 10 to 19. A data set with i categories contains 100 × i images from Class 0 to Class i-1. In learning IPs for each class, we use the images in the current class with positive labels and all the images in the other classes with negative labels. As shown by fig. 3(c). McMIL outperforms DD-SVM consistently by a large margin for each data set. Sensitivity to number of training images. We compared McMIL with DD-SVM as the size of training set varies from 100 to 500. 1,000 images from Class 0 to Class 9 are used, with the size of the training sets being 100, 200, 300, 400, and 500 (the number of images from each category is size of the training set/10). As indicated in fig. 3(d), McMIL performs consistently better than DD-SVM as the number of training images increases from 100 to 500. Classification accuracy on SIMPLIcity-II. We have built SIMPLIcity-II that is specifically designed to test the robustness of our method in classifying images where strong target concept overlaps exist among the classes. While being small, the SIMPLIcity-II dataset is in fact a very challenging one for various reasons. First, the images in a semantic category are not “pure”. Therefore target concept overlaps occur more frequently than in the COREL dataset. Second, the data contain noise in that a negative bag may contain positive instances as well. Third, the images are very diverse in the sense that they have various background, colors and combinations of semantic concepts, and images are photographed at a wide-range or close up shots.
164
X. Xu and B. Li
The right table of Table 2 shows the average classification accuracy of McMIL and DD-SVM on the SIMPLIcity-II dataset over 5 random test sets (training and test sets are of equal size). Despite the aforementioned difficulties, McMIL has achieved reasonably good results on this dataset, demonstrating that this is indeed a promising approach. We also found that DD-SVM made much higher classification error than McMIL in classifying People and Portrait images: DD-SVM misclassifies 9.2% of People images into Portrait and 8.6% Portrait images into People; while McMIL only produces 6.4% and 6.8% for the two classes respectively.
6 Concluding Remarks In this paper, we described a multi-class multiple instance learning approach. We have carried out systematic evaluation to demonstrate the benefits of direct McMIL. Our empirical studies show that, in addition to overcoming the drawbacks of OVA MIL approaches, McMIL is able to achieve higher classification accuracy than most of the existing OVA MIL approaches even using only a small amount of training data, and it is less sensitive to labeling noise. These two benefits are significant in practical applications. The ability of McMIL in classifying images with strong target concept overlap is found to be superior to DD-SVM. One drawback of the McMIL is that as the number of classes becomes large, it becomes computationally inefficient since the gradient search for the IPs takes quite a lot of time. One solution to increase the efficiency of McMIL is to use only a small amount of negative examples; another is to bypass searching for the IPs by combining MILES [6] with McMIL. Our future work will explore these possibilities.
References 1. Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vector machines for multipleinstance learning. NIPS 15, 561–568 (1998) 2. Bi, J., Chen, Y., Wang, J.Z.: A sparse support vector machine approach to region-based image categorization. CVPR (1), 1121–1128 (2005) 3. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm2001 4. Chapelle, O., Haffner, P., Vapnik, V.N.: Support vector machines for histogram-based image classification. IEEE Trans. Neural Networks 10(5), 1055–1064 (1999) 5. Chen, Y., Wang, J.Z.: Image categorization by learning and reasoning with regions. Journal of Machine Learning Research 5, 913–939 (2004) 6. Chen, Y., Bi, J., Wang, J.Z.: MILES: Multiple-Instance Learning via Embedded Instance Selection. IEEE Trans. on PAMI. 28(12), 1931–1947 (2006) 7. Csurka, G., Bray, C., Dance, C., Fan, L.: Visual Categorization with Bags of Keypoints. In: ECCV 2004 Workshop on Statistical Learning in Computer Vision, pp. 59–74 (2004) 8. Dietterich, T.G., Lathrop, R.H., Lozano-Pérez, T.: Solving the multiple-instance problem with axis-parallel rectangles. Artificial Intelligence 89, 31–71 (1997) 9. Hsu, C.-W., Lin, C.-J.: A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks 13(2), 415–425 (2002)
Evaluating Multi-Class Multiple-Instance Learning for Image Categorization
165
10. Lee, Y., Lin, Y., Wahba, G.: Multicategory support vector machines: theory, and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association 99, 67–81 (2004) 11. Li, J., Wang, J.Z.: Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Trans. on PAMI 25(9), 1075–1088 (2003) 12. Maron, O., Lozano-Pérez, T.: A framework for multiple-instance learning. NIPS 10, 570– 576 (1998) 13. Maron, O., Ratan, A.L.: Multiple-instance learning for natural scene classification. ICML, 341–349 (1998) 14. Rahmani, R., Goldman, S.A.: MISSL: multiple-instance semi-supervised learning. ICML, 705–712 (2006) 15. Schölkopf, B., Smola, A.J.: Learning with Kernels-Support Vector Machines, Regularization, Optimization and Beyond, pp. 211–214. MIT Press, Cambridge, MA (2002) 16. Vapnik, V.N.: Statistical Learning Theory. Wiley, New York (1998) 17. Wang, J.Z., Li, J., Wiederhold, G.: SIMPLIcity: Semantics-sensitive Integrated Matching for Picture Libraries. IEEE Trans. on PAMI 23(9), 947–963 (2001) 18. Wang, L., Shen, X., Zheng, Y.F.: On L-1 norm multi-class Support Vector Machines. In: Proc. 5th Intl. Conf. on Machine Learning and Applications (2006) 19. Weston, J., Watkins, C.: Multi-class support vector machines. In: Verleysen, M. (ed.) Proc. ESANN 1999, D. Facto Press, Brussels (1999) 20. Zhang, Q., Goldman, S.A.: EM-DD: An improved multiple-instance learning technique. NIPS 14, 1073–1080 (2002) 21. Zhang, Q., Goldman, S.A., Yu, W., Fritts, J.E.: Content-based image retrieval using multiple-instance learning. In: Proc. ICML, pp. 682–689 (2002)
TransforMesh : A Topology-Adaptive Mesh-Based Approach to Surface Evolution Andrei Zaharescu, Edmond Boyer, and Radu Horaud INRIA Rhone-Alpes, 655 ave de l’Europe, Montbonnot, 38330, France
[email protected] Abstract. Most of the algorithms dealing with image based 3-D reconstruction involve the evolution of a surface based on a minimization criterion. The mesh parametrization, while allowing for an accurate surface representation, suffers from the inherent problems of not being able to reliably deal with selfintersections and topology changes. As a consequence, an important number of methods choose implicit representations of surfaces, e.g. level set methods, that naturally handle topology changes and intersections. Nevertheless, these methods rely on space discretizations, which introduce an unwanted precision-complexity trade-off. In this paper we explore a new mesh-based solution that robustly handles topology changes and removes self intersections, therefore overcoming the traditional limitations of this type of approaches. To demonstrate its efficiency, we present results on 3-D surface reconstruction from multiple images and compare them with state-of-the art results.
1 Introduction A vast number of problems in the area of image-based 3-D modeling are casted as energy minimization problems where 3-D shapes are optimized such that they best explain image information. More specifically, when interested in performing 3-D reconstruction from multiple images, one typically attempts to recover 3-D shapes by evolving a surface with respect to various criteria, such as photometric or geometric consistencies. Meshes, either triangular, polygonal or even tetrahedral, are one of the most used forms of shape representation. Traditionally, an initial mesh is obtained using some well known method, e.g. bounding boxes or visual hulls, and then deformed over time such that it minimizes an energy, based typically on some form of image similarity measure. There exist two main schools of thought on how to cast the above mentioned problem. Lagrangian methods propose an intuitive approach where the surface representation, i.e. the mesh, is directly deformed over time. Meshes present numerous advantages, among which adaptive resolution and compact representation, but raise two major issues of concern when evolved over time: self-intersections and topology changes. It is critical to deal with both issues when evolving a mesh over time. In order to answer to this, McInerney and Terzopoulos [1] proposed topology adaptive deformable curves and meshes, called T-snakes and T-surfaces. However, in solving the intersection problem, the authors use a spatial grid, thus imposing a fixed spatial resolution. In addition, only offsetting motions, i.e. inflating or deflating, are allowed. Lauchaud et al. [2] proposed a heuristic method where merges and splits are perform in near boundary cases: Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 166–175, 2007. c Springer-Verlag Berlin Heidelberg 2007
TransforMesh : A Topology-Adaptive Mesh-Based Approach to Surface Evolution
167
when two surface boundaries are closer than a threshold and facing each other, an artificial merge is introduced; a similar procedure is applied for a split, when the two surface boundaries are back to back. Self-intersections are avoided in practice by imposing a fixed edge size. Eulerian methods formulate the problem as time variation over sampled space, most typically fixed grids. In such a formulation, the mesh, also called the interface, is implicitly represented. One of the most successful methods, the level set method [3] [4], represents the interface as the zero set of a function. Such an embedding within an implicit function allows one to automatically handle mesh topology changes, i.e. merges, splits. In addition, such methods allow for an easy computation of geometric properties. Nevertheless, this method does not come without its drawbacks. The amount of space required to store the implicit representation is much larger than when storing a mesh parametrization. In addition, at each step of the evolution, the mesh has be recovered from the implicit function. This operation is limited by the grid resolution. Consequently, a mesh with variable resolution cannot be properly preserved by the embedding function. In light of this, we propose a mesh-based lagrangian approach. To solve for the selfintersections and topology changes issues, we adopt the algorithm proposed by Jung et al. [5] in the context of mesh-offseting and extend it to more general situations, as faced with when modeling from multiple images. We have successfully applied our approach to surface reconstruction using multiple cameras. The 3-D reconstruction literature is vast. Recently, Furukawa et al. [6] provide some of the most impressive results. They use however a mesh based solution [2] mentioned above that imposes equal face sizes. Pons et al. [7] provide an elegant level-set based implementation. Hernandez et al. [8] provide an in-between solution: they choose an explicit mesh representation, however they immerse the mesh within a vector field with forces proportional to the image consistencies at given places within a grid. Self-intersections and topology changes are not handled. Other approaches are also surveyed in the recent overview on multi-view stereo reconstruction algorithms [9]. Our contribution with respect to these methods is to provide a purely mesh based solution that does not constrain meshes and allows faces of all sizes as well as topology changes, with the goal of increasing precision without sacrificing complexity when dealing with surface evolution problems. The rest of the paper is organized as follows. Section 2 introduces the algorithm that handles self-intersection and topology changes. In section 3, we present its application to the 3-D surface reconstruction problem. Section 4 shows results on well known data sets and makes comparisons with state-of-the-art approaches, before concluding in section 5.
2 TransforMesh - A Topology-Adaptive Self-intersection Removal Mesh Solution As stated earlier, the main limitations that prevent many applications from using meshes are self-intersections and topology changes, which frequently occur when evolving meshes. In this paper, we show that such limitations can be overcome using a very intuitive geometrically-driven solution. In essence, the approach preserves the mesh consistency by detecting self-intersections and considering the subset of the original mesh surface that is outside with respect to the mesh orientation. A toy example is il-
168
A. Zaharescu, E. Boyer, and R. Horaud
lustrated in Figure 1. Our method is inspired by the work of [5] proposed in the context of mesh-offsetting or mesh expansions. We extend it to the general situation of identifying a valid mesh, i.e. a manifold, from a self-intersecting one with possible topological issues. The currently proposed algorithm has the great advantage of dealing with topological changes naturally, much in the same fashion as the Level-Set based solutions, casting it as a viable solution to surface evolutions with meshes. The only requirement is that the input mesh is the result of the deformation of an oriented and valid mesh. The following sections detail the sequential steps of the algorithm.
(a) Input
(b) Input
(c) Intersections
(d) Output
(e) Output
Fig. 1. A toy example of TransforMesh at work
2.1 Pre-treatment Most mesh-related algorithms suffer from numerical problems due to degenerate triangles. Such degenerate triangles are mainly triangles having area close to zero. In order to remove those triangles, two operations are performed: edge collapse and edge flip. 2.2 Self-intersections The first step of the algorithm consists of identifying self-intersections, i.e. edges along which triangles of the mesh intersect. In practice, one would have to perform n2 /2 checks to see if two triangles intersect, which can become quite expensive when the number of facets is large. In order to decrease the computational time, we use a bounding box test to determine which bounding boxes (of triangles) intersect, and only for those perform a triangle intersection test. We use the fast box intersection method implemented in [10] and described in [11]. The complexity of the method is O(n log d (n)+k) for the running time and O(n) for the space occupied, where n is the number of triangles, d the dimension (3 in the 3-D case), and k the output complexity, i.e., the number of pairwise intersections of the triangles. 2.3 Valid Region Growing The second step of the algorithm consists in identifying valid triangles in the mesh. To this purpose, a valid region growing approach is used to propagate validity labels on triangles that composed the outside of the mesh. Initialization. Initially, all the triangles are marked as non-visited. This corresponds to Figure 2(a). Three queues are maintained: one named V, of valid triangles, i.e. triangles outside and without intersections; one named P of partially valid triangles, i.e. only part
TransforMesh : A Topology-Adaptive Mesh-Based Approach to Surface Evolution
(a) Init
(b) Seed-triangle (c) Valid Queue (d) Partial Queue
169
(e) End
Fig. 2. Valid Region Growing (2-D simplified view). The selected elements are in bold.
of the triangle is outside, and finally one named G, where all the valid triangles will be stored until stitched together into a new mesh (in section 2.4). All that follows is performed within an outer loop, while there still exists a valid seed triangle. Seed-triangle Finding. A seed-triangle is defined as a non-visited triangle without intersections whose normal does not intersect any other triangle of the same connected component. This corresponds to Figure 2(b). In other words, a seed-triangle is a triangle that is guaranteed to be on the exterior. This triangle is crucial, since we will start our valid region growing from such a triangle. If found, the triangle will be added to V and marked as valid; otherwise, the algorithm will jump to the next stage (section 2.4). The next two-steps are performed within a inner loop until both V and P are empty. Valid Queue Processing. While V is not empty, pop a triangle t from the queue, add it to G and for each neighbouring triangle N (t) perform the following: if N (t) is nonvisited and has no intersections, then add it to V; if N (t) is non-visited and has intersections, then add it to P together with the entrance segment and direction, corresponding in this case to the oriented half-edge. (see Figure 2(c)). Partially-Valid Queue Processing. While P is not empty, pop a triangle t from the queue, together with the entrance half-edge ft . Also, we have previously calculated all the intersection segments between this triangle and all the other triangles. Let St = {sti } represent all the intersection segments between triangle t and the other triangles. In addition, let Ht = {htj |where j = 1..3} represent the triangle half-edges. A constrained 2-D triangulation is being performed in the triangle plane, using [12], to ensure that all segments in both St and Ht appear in the new mesh structure and that propagation can be achieved in a consistent way. A fill-like traversal is performed from the entrance half-edge to adjacent triangles, stopping on constraint edges, as depicted in Figure 3. A crucial aspect to ensure a natural handling of topological changes is on choosing the correct side of continuation of the "fill" like region growing when crossing a partially valid triangle. The correct orientation is chosen such that, if the original
170
A. Zaharescu, E. Boyer, and R. Horaud Intersection Segments
TD3
Entrance exit
Entrance exit
TD2 TD1
Entrance edge
(a) Triangle Intersections
(b) Partial Triangle Traversal
Fig. 3. Partial Triangle Traversal
normals are maintained, the two newly formed sub-triangles would preserve the watertightness constraint of a manifold. This condition can also be casted as following: the normals of the two sub-triangles should be opposing each other when the two subtriangles are "folded" on the common edge. A visual representation of the two cases is shown in Figure 4.
A
D
Incoming
Outgoing
A
C
Incoming
O
O Outgoing
C
B
D
B
Fig. 4. Partial Triangle Crossing Cases (side view)
2.4 Triangle Stitching The region growing algorithm described previously will iterate until all the triangles have been selected. In the chosen demo example, this corresponds to Figure 2(e). At this stage, what remains to do is to stitch together the 3-D triangle soup (G queue) in order to obtain a valid mesh which is manifold. We adopt a method similar in spirit to [13,14]. In most cases this is a straight forward operation, reduced to identifying the common vertices and edges, followed by stitching. However, there are three special cases, in which performing a simple stitching will violate the mesh constraints and produce locally non-manifold structures. The special cases, shown in Figure 5, arise from performing stitching in places where the original structure should have been maintained. We adopt the naming convention from [13], calling them the singular vertex case, the
TransforMesh : A Topology-Adaptive Mesh-Based Approach to Surface Evolution
171
singular edge case and the singular face case. All cases are easy to identify only from local operations and are identified after all the stitching has been performed. In the singular vertex case, in order to detect whether vertex v is singular, we adopt the following algorithm: mark all the facets incident to the vertex v as non-visited. Then start from a facet of v, mark it visited and do the same with its non-visited neighbours that are also incident to v (neighbour as chosen based on half-edges). The process is repeated until all the neighbouring facets are processed. If by doing so we exhausted all the neighbouring facets, vertex v is non singular, otherwise it is singular, so a copy of it is created and added to all the remaining non-visited facets. In order to detect a singular edges e, all we have to do is count the number of triangles that share that edge. If it is bigger than 2, we have a singular edge case and two additional vertices and a new edge will be added to account for it.
(a) Singular Vertex
(b) Singular Edge
(c) Singular Face
Fig. 5. Special case encountered while stitching a triangle soup
(a) Original Mesh.
(b) Resulting mesh.
Fig. 6. An example of how a singular vertex occurs in a typical self-intersection removal situation, due to an inverted triangle (marked in red)
(a) Join Case
(b) Split Case
(c) Inside Carving Case
Fig. 7. Different topological changes examples (2-D simplified view). The outline of the final surface obtained after self-intersection removal is outlined in bold blue.
172
A. Zaharescu, E. Boyer, and R. Horaud
In practice, only the singular vertex case appears in real triangular mesh test cases. Such an example has been created to illustrate such a scenario and it is shown in Figure 6. 2.5 Topological Changes The partial-triangle crossing technique described earlier ensures a natural handling of the topological changes (splits and joins) that plagued the mesh approaches until now. Representative cases are illustrated in Figure 7.
3 Using TransforMesh to Perform Mesh Evolutions Our original motivation in developing a mesh self-intersection removal algorithm was to perform mesh evolutions, in particular when recovering 3-D shapes from multiple calibrated images. As stated earlier, few efforts have been put in mesh-based solutions for such 3-D surface reconstruction problem, mostly due to the topological issues raised by mesh evolutions. However, meshes allow one to focus on the region of interest in space, namely the shape’s surface and, as a result, lower the complexities and lead to better precisions with respect to volumetric approaches. In this section we present the application of TransforMesh to the surface reconstruction problem. Often such a problem is casted as an energy minimization over a surface. We decided to start from exact visual hull reconstructions, obtained using [15] and further improve the mesh using photometric constraints by means of an energy functional described in [7]. The photometric constraints are casted as an energy minimization problem using similarity measure between pairs of cameras that are close to each other. We denote by S ⊂ R3 the 3-D surface. Let Ii : Ωi ⊂ R2 → Rd be the image captured by camera i (d=1 for grayscale and d=3 for color images). The perspective projection of camera i is represented as Πi : R3 → R2 . Since the method uses visibility, consider Si as part of surface S visible in image i. In addition, the back-projection of image from camera i onto the surface is represented as Π−1 : Πi (S) → Si . i Armed with theabovenotation,onecan computeasimilarity measureMij of thesurface S as the the similarity measure between image Ii and the reprojection of image Ij into the othercameraiviathesurfaceS.Summingacrossallthecandidatestereopairs,onecanwrite: Mij (S) (1) M(S) = i
j=i
−1 Mij (S) = M |Ωi ∩Πi (Sj ) Ii , Ii ◦ Πi ◦ Πj,S Finally, the surface evolution equation at a point x is given by: ∂S = − λ1 Esmooth + Eimg N ∂t
(2)
(3)
where Esmooth depends on the curvature H (see [6]), N represents the surface normal and Eimg is a photoconsistency term that is a summation across pairs of cameras which depends upon derivatives of the similarity measure M, of the images I, of the projection matrices Π and on the distance xz (see [7] for more details).
TransforMesh : A Topology-Adaptive Mesh-Based Approach to Surface Evolution
173
In the original paper [7], the surface evolution equation was implemented within the Level-Set framework. We extend it to meshes using the TransforMesh algorithm described in the previous section. The original solution performs surface evolution using a coarse to fine approach in order to escape local minima. Traditionally, in Level-Set approaches, the implicit function that embeds the surface S is discretized evenly on a 3-D grid. As a side-effect, all the facets of recovered surface are of approximately equal triangle size. In contrast, mesh based approaches do not impose such a constraint and allow facets of all sizes on the evolving surface. This is particularly useful when starting from visual hulls, for which the initial mesh contains triangles of all dimensions. In addition, the dimension of visual facets appears to be a relevant information since regions where the visual reconstruction is less accurate, i.e. concave regions on the observed surface, are described by bigger facets on the visual hull. Thus, we adopt a coarse to fine approach in which bigger triangles are moved until they stabilize, followed by dimension reduction via an edge-splits. The algorithm iterates at a smaller scale until the desired smallest edge size is obtained. Therefore, the√algorithm uses a multi-scale approach, starting from scale smax to smin = 1 in λ2 = 2 increments using Δt = 0.001 as the timestep. A vertex can maximally move 10% of the average incoming half-edges. The original surface, obtained from a visual hull, is evolved using equation (3), where the cross correlation was used as a similarity measure and λ1 = 0.3. Every 5 iterations TransforMesh is performed, in order to remove the self-intersections and allow for any topological changes.
4 Results We have tested the mesh evolution algorithm with the datasets provided by the MultiView Stereo Evalutation site [9] (http://vision.middlebury.edu/mview/) and our results are comparable with state-of-the-art: we rank in the top 1-3 (depending on the data set and ranking criteria chosen) and the results are within sub-milimeter accuracy. Detailed results are extracted from the website and presented in Table 1 (consult the website for detailed info and running times). We have also included results by Furukawa et al. [6], Pons et al. [7] and Hernandez et al. [8], considered to be the state of the art. The differences between all methods are very small, ranging between 0.01mm to 0.3mm. Some of our reconstruction results are shown in Figure 8. The algorithm reaches a good solution without the presence of a silhouette term in the evolution equation. In a typical evolution scenario, there are a more self-intersections at the beginning, but, as the algorithm converges, intersections rarely occur. Additionally, in the temple case, we performed a test where we have started from one big sphere as Table 1. 3-D Rec. Results. Accuracy: the distance d in mm that brings 90% of the result R within the ground-truth surface G. Completeness: the percentage of G that lies within 1.25mm of R.
PP PPDataset PP Paper P
Temple Ring Acc. Compl. Pons et al. [7] 0.60mm 99.5% Furukawa et al. [6] 0.55mm 99.1% Hernandez et al. [8] 0.52mm 99.5% Our results 0.55mm 99.2%
Temple Sparse Ring Acc. Compl. 0.90mm 95.4% 0.62mm 99.2% 0.75mm 95.3% 0.78mm 95.8%
Dino Ring Acc. Compl. 0.55mm 99.0% 0.33mm 99.6% 0.45mm 97.9% 0.42mm 98.6%
Dino Sparse Ring Acc. Compl. 0.71mm 97.7% 0.42mm 99.2% 0.60mm 98.52% 0.45mm 99.2%
174
A. Zaharescu, E. Boyer, and R. Horaud
(a) Dino - Input
(e) Temple - Input
(b) Dino - Start
(f) Temple - Start
(c) Dino - Results
(g) Temple-Results
(d) Dino - Closeup
(h) Temple-Closeup
Fig. 8. Reconstruction results obtained in the temple and in the dino case
the startup condition, in order to check whether the topological split operation performs properly. Proper converges was obtained. We acknowledge that TransforMesh was not put to a thorough test using the current data sets, which might leave the reader suspicious about special cases in which the method could fail. We have implemented a mesh morphing algorithm in order to test the robustness of the method. We have successfully morphed meshes with significantly different topology from the surface of departure. Results will be detailed in another publication. Implementation Notes. In our implementation we have made extensive use of CGAL (Computational Geometry Algorithms Library) [16] C++ library, which provides excellent implementations for various algorithms, among which the n-dimentional fast box intersections, 2-D constrained Delaunay triangulation, triangular meshes and support for exact arithmetic kernels. The running times of TransforMesh depend greatly on the number of self-intersections, since more than 80% of the running time is spent performing them. Typically, the running time for performing the self-intersections test is under 1 second for a mesh with 50, 000 facets, where exact arithmetic is used for triangle intersections and the self-intersections are in the range of 100.
5 Conclusion We have presented a fully geometric efficient Lagrangian solution for triangular mesh evolutions able to handle topology changes gracefully. We have tested our method in the
TransforMesh : A Topology-Adaptive Mesh-Based Approach to Surface Evolution
175
context of multi-view stereo 3-D reconstruction and we have obtained top ranking results, comparable with state-of-the-art methods in the literature. Our contribution with respect to the existing methods is to provide a purely geometric mesh-based solution that does not constrain meshes and that allows for facets of all sizes as well as for topology changes. Acknowledgements. This research was supported by the VISIONTRAIN RTN-CT2004-005439 Marie Curie Action within the European Community’s Sixth Framework Programme. This paper reflects only the author’s views and the Community is not liable for any use that may be made of the information contained therein.
References 1. McInerney, T., Terzopoulos, D.: T-snakes: Topology adaptive snakes. Medical Image Analysis 4(2), 73–91 (2000) 2. Lachaud, J.O., Taton, B.: Deformable model with adaptive mesh and automated topology changes. In: Proceedings of the Fourth International Conference on 3-D Digital Imaging and Modeling (2003) 3. Osher, S., Fedkiw, R.: Level Set Methods and Dynamic Implicit Surfaces. Springer, Heidelberg (2003) 4. Osher, S., Senthian, J.: Front propagating with curvature dependent speed: algorithms based on the Hamilton-Jacobi formulation. Journal of computational Physics 79(1), 12–49 (1988) 5. Jung, W., Shin, H., Choi, B.K.: Self-intersection removal in triangular mesh offsetting. Computer-Aided Design and Applications 1(1-4), 477–484 (2004) 6. Furukawa, Y., Ponce, J.: Accurate, dense and robust multi-view stereopsis. CVPR (2007) 7. Pons, J.P., Keriven, R., Faugeras, O.: Multi-view stereo reconstruction and scene flow estimation with a global image-based matching score. International Journal of Computer Vision 72(2), 179–193 (2007) 8. Hernandez, C.E., Schmitt, F.: Silhouette and stereo fusion for 3-D object modeling. Computer Vision and Image Understanding 96(3), 367–392 (2004) 9. Seitz, S.M., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A comparison and evaluation of multi-view stereo reconstruction algorithms. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 519–526. IEEE Computer Society Press, Los Alamitos (2006) 10. Kettner, L., Meyer, A., Zomorodian, A.: Intersecting sequences of dD iso-oriented boxes. In: Board, C.E. (ed.) CGAL-3.2 User and Reference Manual (2006) 11. Zomorodian, A., Edelsbrunner, H.: Fast software for box intersection. International Journal of Compational Geometry and Applications (12), 143–172 (2002) 12. Hert, S., Seel, M.: dD convex hulls and delaunay triangulations. In: Board, C.E. (ed.) CGAL3.2 User and Reference Manual (2006) 13. Gueziec, A., Taubin, G., Lazarus, F., Horn, B.: Cutting and stitching: Converting sets of polygons to manifold surfaces. IEEE Transaction on Visualization and Computer Graphics 7(2), 136–151 (2001) 14. Shin, H., Park, J.C., Choi, B.K., Chung, Y.C., Rhee, S.: Efficient topology construction from triangle soup. In: Proceedings of the Geometric Modeling and Processing (2004) 15. Franco, J.S., Boyer, E.: Exact polyhedral visual hulls. In: British Machine Vision Conference, vol. 1, pp. 329–338 (2003) 16. Board, C.E.: CGAL-3.2 User and Reference Manual (2006)
Microscopic Surface Shape Estimation of a Transparent Plate Using a Complex Image Masao Shimizu and Masatoshi Okutomi Graduate School of Science and Engineering, Tokyo Institute of Technology
Abstract. This paper proposes a method to estimate the surface shape of a transparent plate using a reflection image on the plate. The reflection image on a transparent plate is a complex image that consists of a reflection on the surface and on the rear surface of the plate. A displacement between the two reflection images holds the range information to the object, which can be extracted from a single complex image. The displacement in the complex image depends not only on the object range but also on the normal vectors of the plate surfaces, plate thickness, relative refraction index, and the plate position. These parameters can be estimated using multiple planar targets with random texture at known distances. Experimental results show that the proposed method can detect microscopic surface shape differences between two different commercially available transparent acrylic plates.
1
Introduction
Shape estimation of a transparent object is a difficult problem of computer vision that has attracted many researchers. The background changes its apparent position through a moving transparent object according to the refraction of light [1]. Based on that observation, some techniques have been proposed to estimate the shape and the refraction index of a transparent object: using optical flow [5], detecting the positional change of the structured lights [4], and measuring the effects on the shape of a structured light [3] caused by a moving object. Other methods that have been investigated include using polarized light [6]. On the one hand, a monocular range estimation method, named reflection stereo, using a complex image observed as a reflection on a transparent parallel planar plate, has also been proposed [7],[8]. The complex image consists of the reflection at the plate surface (the surface reflection image Is ) and the reflection at the rear surface of the plate, which is then transmitted again through the surface (the rear-surface reflection image Ir ). The object range is obtainable from the displacement in the complex image, that is, the displacement between Is and Ir , which has a unique constraint that is the equivalent of the epipolar constraint in stereo vision. The displacement in the complex image depends not only on the object range, but also the normal vectors of the plate surfaces, plate thickness, relative refraction index, and the plate position. This paper proposes a method to estimate these parameters of the transparent plate through calibration of the reflection stereo system, using multiple planar Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 176–185, 2007. c Springer-Verlag Berlin Heidelberg 2007
Microscopic Surface Shape Estimation of a Transparent Plate
177
targets with random texture at known distances. These estimated parameters are available for range estimation; simultaneously, they express the shape and orientation of the plate. The proposed method estimates a microscopic shape of a plate using a differential measurement of Is and Ir , whereas conventional methods have used only light transmitted through a transparent object. The method might be used for an industrial inspection of a planar glass plate with a common camera. This paper is organized as follows. Section 2 describes range estimation by reflection stereo with a perfectly parallel planar transparent plate. Section 3 extends to the case with a non-parallel planar plate, and shows the parameters to be estimated. Section 4 presents a parameter estimation method through the calibration of the reflection stereo using multiple targets. Experimental results are described in Section 5. This paper concludes with remarks in Section 6.
2
Range Estimation from a Single Complex Image
This section describes the reflection stereo with a perfectly parallel planar plate. Figure 1 shows the two light paths from an object to the camera center. A transparent plate reflects and transmits the incident light on its surface. The transmitted light is then reflected on the rear-surface and is transmitted again to the air through the surface. These two light paths have an angle disparity θs , which depends on the relative refractive index n of the plate, the plate thickness d, the incident angle θi , and the object distance Do . The fundamental relation between the angle disparity θs and the distance Do are explainable as the reflection and refraction of light in a plane including the object, the optical center, and the normal vector of the plate. A two-dimensional (2D) ξ-υ coordinate system is set with its origin at the reflecting point on the surface. The following equation can be derived by projecting the object position (−Do sin θi , Do cos θi ) and the optical center position (Dc sin θi , Dc cos θi ) to ξaxis. sin (2 (θi − θs )) . (1) Do + Dc = d sin θs n2 − sin2 (θi − θs ) The angle disparity θs is obtainable by finding the displacement in the complex image. Then the object distance Do is derived from Eq. (1). The displacement has a constraint which describes a correspondent position in the rear-surface image Object
υ
(− Do sin θ i , Do cosθ i )
Optical center
( Dc sin θ i , Dc cosθ i ) θs Surface reflection
θi
Is Rear-surface reflection
Ir Transparent plate
n
d
ξ
Fig. 1. Fundamental light paths in the reflection stereo
178
M. Shimizu and M. Okutomi
moving along a constraint line according to the image position [7]. The constraint reduces the search to 1D, just as stereo vision with the epipolar constraint. The angle disparity takes the minimum θs = 0 when Do = ∞ if the plate is manufactured perfectly as a parallel planar plate.
3
Reflection Stereo with Non-parallel Planar Plate
The transparent acrylic plate has limited parallelism and has a specific wedge angle, even if it has sufficient surface flatness. In this section, we investigate the range estimation of the reflection stereo using a non-parallel planar plate. 3.1
Range Estimation Using Ray Tracing
Figure 2 depicts the camera coordinate system with its origin at the optical center. The f /δ denotes the lens focal length in the CCD pixel unit. The transparent plate is set with an angle for the Z-axis. We use ray tracing [9] to describe the light paths to estimate the object distance. Figure 3(a) shows that an object in a surface reflection image at an image position A(a) = [ua , va , f /δ] exists in the following line LS with parameter tS : LS (tS ) = qS + vS tS
(2)
qS = q(o, a, mS , n ˆS ) ˆS ), vS = r(a, n ˆS respectively denote a position and the unit normal vector where MS (mS ) and n of the plate surface. Point O(o) represents the origin. Transparent plate Image plane
A( )
v
Y
u a , va
X
u
Z
O
f /δ
Dc
Fig. 2. Transparent plate and camera
ˆS
Image plane
S
O
S
A( )
ˆR
Image plane
R
S
ˆS
S
R
O
R
0 1 0
n (a) Surface reflection image
1
n
(b) Rear-surface reflection image
Fig. 3. Surface and rear-surface reflections
D( )
Microscopic Surface Shape Estimation of a Transparent Plate
179
Similarly, a detected displacement d = d∗p (a)+e∗ (a)Δ of a rear-surface reflection image from the image position a shows that the object in the rear-surface image exists in the following line LR with parameter tR : LR (tR ) = qR + vR tR
(3)
qR = q(q1 , v1 , mS , n ˆS ) ˆR ) q1 = q(q0 , v0 , mR , n
vR = t(v1 , −ˆ nS , 1/n) v1 = r(v0 , n ˆR)
q0 = q(o, a + d, mS , n ˆS )
v0 = t(d, n ˆS , n),
ˆR respectively denote a position and the unit normal vector where MR (mR ) and n of the rear surface of the plate, as shown in Fig. 3(b). In addition, d∗p (a) and e∗ (a) denote the displacement of the rear-surface reflection image for an object at infinite distance and a normal vector of the constraint line at the image position a. The intersection position of lines LS and LR represents the 3D position PO (ˆ pO ) of the object. In real situations, the object position PO (ˆ pO ) is determined using the following equation, which minimizes the distance between the two lines. 1 (4) p ˆO = [(qS + vS tS ) + (qR + vR tR )] 2 −1 tS qS · vS −qS · vR qS · qR − qS 2 = tR qR · vS −qR · vR qR 2 − qS · qR For the above, we use the following relations presented in Fig. 4. (s − p) · n ˆ v·n ˆ r(v, n ˆ) = v − 2ˆ nv · n ˆ
q(p, v, s, n ˆ) = p + v
(5)
1 2 2 v v 2 −n ˆ n − v · n ˆ v · n ˆ
t(v, n ˆ, n) =
3.2
v v · n ˆ
−n ˆ +n ˆ
(6) (7)
Parameters for Reflection Stereo
Reflection stereo with a non-parallel planar plate requires not only a lens focal length f /δ as a parameter, but also the following parameters, which are equivalent to extrinsic parameters in stereo vision. P( )
Q( )
S( )
n ˆ
Fig. 4. A ray and a normal vector of the plate
180
– – – –
M. Shimizu and M. Okutomi
surface and rear-surface normal vectors nS and nR of the plate, a position mS on the surface, plate thickness d, and relative refraction index n.
With these extrinsic parameters, the object range is obtainable using Eq. (4) with a displacement in a complex image.
4 4.1
Surface Shape Estimation by Reflection Stereo Piecewise Non-parallel and Planar Model of the Plate
Real surfaces of acrylic or glass plates are not perfectly flat on some level. Consequently, the extrinsic parameters are position-variant in the complex image. The relative refraction index can differ among production lots and manufacturers; it might also be position-variant. The proposed method assumes that the plate is piecewise non-parallel and planar, as illustrated in Fig. 5, for the reflection stereo with a real transparent plate. In other words, the method assumes that the extrinsic parameters are constant in a small region to form a complex image for an interest position in a complex image. The object range p ˆ Oy (a, d|pe (a)) along the Y -axis is obtainable using Eq. (4) for the small region that includes the image position a, where pe (a) = {nS (a), nR (a), mS (a), d(a), n(a)} denotes the extrinsic parameter vector at position a. 4.2
Parameter Estimation Method
The extrinsic parameters can be estimated from M sets of the object distance Dy and its corresponding displacement d at position a in a complex image, as follows analogously with calibration for stereo vision [10]. p ˆe (a) = arg min E(a)
(8)
pe (a)
E(a) =
M
2
Dyi − p ˆ Oy (a, di |pe (a)) + α ˜ pe − pe (a)
2
i=1
Whole transparent planar plate Image plane
u O A(a)
Piecewise planar region p e (a) n S (a), n R (a), m S (a), d (a), n(a)
Fig. 5. Extrinsic parameters for a piecewise non-parallel and planar model
Microscopic Surface Shape Estimation of a Transparent Plate
181
The second term in the objective function E(a) is a stabilization term that prevents the estimated parameters from deviating markedly from design (catalogue) values p ˜e . The weight α is set empirically to the minimum value. We have used the conjugate gradient method to minimize the nonlinear objective function with initial parameters of the design values. The displacement d in a complex image can be detected as the second peak location of the autocorrelation function of the complex image without any knowledge about a constraint if the image contains a rich and dense texture. The seven components in the extrinsic parameters can be estimated from M ≥ 4 sets of observations1. 4.3
Parametric Expression for Range Estimation
The estimated parameters are considered as continuous and their changes are small with respect to the image position, as is clear from observations of real acrylic plates. The following parametric expression with a two-dimensional cubic function with respect to the image position u = (u, v) is used for range estimation: p ¯e (u) = Φ1 +Φ2 u+Φ3 v+Φ4 uv+Φ5 u2 +Φ6 v 2 +Φ7 u2 v+Φ8 uv 2 +Φ9 u3 +Φ10 v 3 , (9) where Φj (j = 1, 2, .., 10) denotes a coefficient vector corresponding to the seven components of the extrinsic parameters. These vectors are obtainable using leastsquares estimation, with estimated extrinsic parameters p ˆ e (al ) at image positions al , as follows. ¯ pe (al ) − p ˆe (al )2 (10) Φj (j = 1, 2, .., 10) = arg min Φ
l
The constraint line direction and the infinite object position with respect to the image position are also obtained in advance, and parameterized similarly as the extrinsic parameters for range estimation.
5
Experimental Results
5.1
System Setup
The experimental system was constructed as shown in Fig. 6. The plate is set with an angle π/4 for the optical axis. The system includes a black and white camera (60 FPS, IEEE1394, Flea; Point Gray Research Inc.). The design values of the plate are 10.0 [mm] thickness and the relative refraction index of 1.49. The measured distance from the lens optical center to the plate along the optical axis is Dco = 66.2 [mm]. We used a well-known camera calibration tool [2] for 1
The normal vector has two degrees-of-freedom (2-DOF), the position on the surface is 1-DOF, as determined by a distance along Z-axis if the surface normal is given. In total, the extrinsic parameters are 2 + 2 + 1 + 1 + 1 = 7 DOF.
182
M. Shimizu and M. Okutomi
Fig. 6. Configuration of the experimental system Textured plane
Camera
Transparent plate
Dc
D yi
Fig. 7. Setup of the calibration target
intrinsic parameter estimation. The calibrated parameters are the focal length f /δ = 1255.7 and the image center (cu , cv ) = (−31.845, −11.272); the image size is 640 × 480 [pixel]. Image distortion was alleviated using the estimated lens distortion parameters. 5.2
Calibration Targets
A random textured planar target at 13 different positions is used for extrinsic parameter estimation. As depicted in Fig. 7, the experimental system is mounted on a motorized long travel translation stage. Figure 8 shows a complex image for the plane distance of 500 [mm]. As described in 4.2, the displacement (shift value) along the constraint line in the complex image is Fig. 8. Complex image example (Do = obtainable as the second peak location 500 [mm]) of the autocorrelation function. The target positions are at Dyi = 350 + 50i + De [mm] (i = 1, 2, .., 13) from the optical center in Y -axis, where
Microscopic Surface Shape Estimation of a Transparent Plate
183
[mm]
120
0
-120
-160
-320
0
(a) Estimated thickness
160
320
[pixel]
120
0 -120 -320
-160
0
320
160
[pixel]
(b) Estimated refraction index
Fig. 9. Estimated plate thickness and refraction index w.r.t. the image positions
De is a unknown value. To estimate De , first the following sum of the residuals with three predetermined values (De = 0, ±25) is obtained; then the three sums of residuals are used to estimate the De for the minimum sum by a parabola fitting over the three sums. M l
(350 + 50i + De ) − p ˆOy (al , di |pe (al ))2
(11)
i=1
Figure 9 displays the estimated plate thickness d and relative refraction index n of transparent plate “A” (described in the next subsection), at 24 × 18 = 432 image positions. The corresponding size of the plate to the image positions is about 30 × 30 [mm]. The mean and standard deviation of the plate thickness are, respectively, 10.005712 and 5.351757 × 10−3 [mm], those of the refraction index are 1.489652 and 4.427796 × 10−4 , whereas the catalogue values are 10.0 [mm] and 1.49. 5.3
Microscopic Surface Shape Estimation of a Transparent Plate
Two types of commercially produced transparent acrylic plate are used to estimate the extrinsic parameters. Both plates have the same thickness of 10 [mm]; plate “A” was cut from a cast acrylic plate to 100 × 100 [mm], whereas plate “B” was a cast acrylic plate marketed at that size without after-sales cutting. Figure 10 displays the estimated surface and rear-surface positions of plate “A”, corresponding to 24 × 18 = 432 image positions. In this figure resolution, the microscopic surface shape difference between plate “A” and “B” is invisible. Figures 11 and 12 respectively display the estimated surface (a) and rearsurface (b) shapes of plates “A” and “B”. The surface shapes are displayed with respect to least-squares approximation surfaces for each plate. The two plates are measured using the same conditions. The vertical axes are extremely magnified. The standard deviation of the plate “A” surface is 0.013077, whereas that of plate “B” is 0.010649. More than the numerical difference, the shape difference between the two plates is readily apparent from the figures.
184
M. Shimizu and M. Okutomi 20 10 0 -10 -20 10 0 -10
[mm]
Fig. 10. Estimated surface and rear-surface positions of plate “A”
10
10 0
-10 -20 -20
-10
0
10
20
0 -10 -20 -20
[mm]
㩿㪸㪀㩷㪜㫊㫋㫀㫄㪸㫋㪼㪻㩷㫊㫌㫉㪽㪸㪺㪼㩷㫊㪿㪸㫇㪼
-10
0
10
20
[mm]
㩿㪹㪀㩷㪜㫊㫋㫀㫄㪸㫋㪼㪻㩷㫉㪼㪸㫉㪄㫊㫌㫉㪽㪸㪺㪼㩷㫊㪿㪸㫇㪼
Fig. 11. Estimated surface and rear-surface shapes of plate “A”
10
10 0
-10 -20 -20
-10
0
10
[mm]
㩿㪸㪀㩷㪜㫊㫋㫀㫄㪸㫋㪼㪻㩷㫊㫌㫉㪽㪸㪺㪼㩷㫊㪿㪸㫇㪼
20
0 -10 -20 -20
-10
0
10
[mm]
㩿㪹㪀㩷㪜㫊㫋㫀㫄㪸㫋㪼㪻㩷㫉㪼㪸㫉㪄㫊㫌㫉㪽㪸㪺㪼㩷㫊㪿㪸㫇㪼
Fig. 12. Estimated surface and rear-surface shapes of plate “B”
20
Microscopic Surface Shape Estimation of a Transparent Plate
6
185
Conclusions
In this paper, we have proposed a method to estimate the surface shape of a transparent plate using a complex image. The displacement in the complex image depends not only on the object range but also on the normal vectors of the plate surfaces, plate thickness, relative refraction index, and the plate position. These parameters can be estimated using multiple planar targets with a random texture at known distances. Experimental results show that the proposed method can detect microscopic surface shape differences between two different commercially available transparent acrylic plates. Future works will include an investigation of the shape estimation accuracy and resolution using an optical standard, and a study of the precision limits of the proposed method.
References 1. Ben-Ezra, M., Nayar, S.K.: What Does Motion Reveal about Transparency? Proc. on ICCV 2, 1025–1032 (2003) 2. Bouguet, J.-Y.: Camera Calibration Toolbox for Matlab (2007), http://www.vision.caltech.edu/bouguetj/calib doc/index.html 3. Hata, S., Saitoh, Y., Kumamura, S., Kaida, K.: Shape Extraction of Transparent Object using Genetic Algorithm. Proc. on ICPR 4, 684–688 (1996) 4. Manabe, Y., Tsujita, M., Chihara, K.: Measurement of Shape and Refractive Index of Transparent Object. Proc. on ICPR 2, 23–26 (2004) 5. Murase, H.: Surface Shape Reconstruction of a Nonrigid Transport Object using Refraction and Motion. IEEE Trans. on PAMI 14(10), 1045–1052 (1992) 6. Saito, M., Sato, Y., Ikeuchi, K., Kashiwagi, H.: Measurement of Surface Orientations of Transparent Objects using Polarization in Highlight. Proc. on ICPR 1, 381–386 (1999) 7. Shimizu, M., Okutomi, M.: Reflection Stereo – Novel Monocular Stereo using a Transparent Plate. In: Proc. on Third Canadian Conference on Computer and Robot Vision, pp. 14–14 (2006) (CD-ROM) 8. Shimizu, M., Okutomi, M.: Monocular Range Estimation through a Double-Sided Half-Mirror Plate. In: Proc. on Fourth Canadian Conference on Computer and Robot Vision, pp. 347–354 (2007) 9. Whitted, T.: An Improved Illumination Model for Shaded Display. Communications of the ACM 23(6), 343–349 (1980) 10. Zhang, Z.: Flexible Camera Calibration by Viewing a Plane from Unknown Orientations. Proc. on ICCV 1, 666–673 (1999)
Shape Recovery from Turntable Image Sequence H. Zhong, W.S. Lau, W.F. Sze, and Y.S. Hung Department of Electrical and Electronic Engineering, the University of Hong Kong Pokfulam Road, Hong Kong, China {hzhong, sunny, wfsze, yshung}@eee.hku.hk
Abstract. This paper makes use of both feature points and silhouettes to deliver fast 3D shape recovery from a turntable image sequence. The algorithm exploits object silhouettes in two views to establish a 3D rim curve, which is defined with respect to the two frontier points arising from two views. The images of this 3D rim curve in the two views are matched using cross correlation technique with silhouette constraint incorporated. A 3D planar rim curve is then reconstructed using point-based reconstruction method. A set of rims enclosing the object can be obtained from an image sequence captured under circular motion. The proposed method solves the problem of reconstruction of concave object surface, which is usually left unresolved in general silhouette-based reconstruction methods. In addition, the property of the organized reconstructed rim curves allows fast surface extraction. Experimental results with real data are presented. Keywords: silhouette; rim reconstruction; surface extraction; circular motion.
1 Introduction The 3D modelling of real world objects is an important problem in computer vision and has many practical applications, such as in virtual reality, video games, motion tracking systems, etc. The point-based approach is the oldest technique for 3D reconstruction. Once feature points across different views are matched, a cloud of feature points lying on the surface of object can be recovered by triangulation methods. However, these points are not ordered, which means that the topological relationship among them is unknown. As a result, they are not readily usable for reconstructing the surface of an object. Besides feature points, silhouettes are prominent features of an image that can be extracted reliably. They provide useful information about both the shape and motion of an object, and indeed are the only features that can be extracted from a textureless object. But they do not provide point correspondences between images due to the viewpoint dependency of silhouettes. There is a need for the development of methods combining both point-based and silhouettes approaches for modelling a 3D object. The contribution of this paper is that it takes advantage of both point-based and silhouettes approaches to provide fast shape recovery. The method reconstructs a set of 3D rim curves on the object surface from a calibrated circular motion image sequence, with concavities on the object surface retained. These 3D rim curves are reconstructed in an organized manner consistent with the circular motion image order, leading to low computation complexity in the subsequent surface triangulation. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 186–195, 2007. © Springer-Verlag Berlin Heidelberg 2007
Shape Recovery from Turntable Image Sequence
187
This paper is organized as follows. Section 2 briefly reviews the literature on model reconstruction. Section 3 introduces the theoretical principles which are used in this paper. Section 4 describes how the rim curves are computed. The extraction of surface from the reconstructed rim curves is given in section 5. Experimental results are given in section 6 and a summary follows in section 7.
2 Previous Works There are two main streams of methods for model reconstruction, namely point-based and silhouette-based methods. In a point-based method, a limited number of 3D points are initially reconstructed from matched feature points. A 3D model is then obtained by computing a dense depth map of the 3D object from the calibrated images using stereo matching techniques in order to recover the depths of all the object points, (see, for example, [1, 2]). It is further necessary to fuse all the dense depth map into a common 3D model [3] and extract coherently connected surface by interpolation [4]. Although a good and realistic 3D model may be obtained by carrying out these processes, both the estimation of dense depth maps and the fusion of them are of high computation complexity and reported to be formidably time consuming [5]. Another model reconstruction approach utilizes object silhouettes instead of feature points, see seminal works [6, 7]. There are two groups of methods in this approach, namely surface and volumetric methods. The surface method aims to determine points on the surface of the visual hull by either updating depth intervals along viewing lines [8] or using a dual-space approach via epipolar parameterization [9, 10]. On the other hand, volume segment representation, introduced in [11], was for constructing the bounding volume which approximated the actual three-dimensional structure of the rigid object generating the contours in multiple views. In this technique, octree representation is adopted to provide a volumetric description of an object in terms of regular grids or voxels [12-14]. Given the volumetric data in form of an octree, the marching cubes algorithm [15] can be used to extract a triangulated surface mesh for visualization. Since only silhouettes are used, no feature points are needed. However, the aforementioned methods produce only the visual hull of the object with respect to the set of viewpoints from which the image sequence is captured. The main drawback is hence that concave surfaces of the object cannot be reconstructed. More recently, many methods are proposed to combine the silhouette and the photo-consistency constraint to obtain higher quality results based on different formulations, cf. [16-20]. Broadly speaking, these methods all employ a two-step approach to surface reconstruction. In the first step, a surface mesh is initialized by computing a visual hull satisfying the silhouette consistency constraint. In the second step, the surface mesh is refined using the photo-consistency constraint. In this paper, we also consider combining the properties of silhouettes and feature points in shape recovery from a calibrated circular motion image sequence. However, unlike the methods mentioned above which aimed to reconstruct the whole photoconsistent surface based on a visual hull, the proposed method computes a set of discrete photo-consistent surface curves which are circumnavigating the object to represent the shape. The advantages of using discrete surface curves to represent
188
H. Zhong et al.
object shape and to form surface are two fold. First, no visual hull and time-consuming surface optimization are needed, and secondly, surface concavities can be recovered. This is particularly suitable for fast modelling with more accurate shape description than visual hull. Experimental results show that through combining the photo-consistency and silhouette constraints, the surface curves can be computed accurately.
3 Theoretical Principles 3.1 Geometry of the Surface Consider an object with a smooth surface viewed by a pin-hole camera. The contour generator on the object surface is the set of points at which the rays cast from the camera center is orthogonal to the surface normal [21]. The projection of a contour generator on the image plane forms an apparent contour. A silhouette is a subset of the apparent contour where the viewing rays of the contour generator touch the object [22]. Due to the viewpoint dependency of the contour generators, silhouettes from two distinct viewpoints will be the projections of two different contour generators. As a result, there will be no point correspondence in the two silhouettes except for the frontier point(s) [21]. A frontier point is the intersection of the two contour generators in space and is visible in both silhouettes. For two distinct views, there are in general two frontier points at which the two outer epipolar planes are tangent to the object surface. Hence, the projections of them in one view are two outer epipolar tangent points on the object silhouette. The outer epipolar tangent points in the two associated views are corresponding points. X1
A 3D rim curve
C1
C2
A straight line
X2
A 2D rim curve
Fig. 1. A 3D rim curve associated with frontier points for two views. The two cameras centers C1 and C2 define two epipolar planes tangent to the objects at two frontier points X1 and X2 whose images are outer epipolar tangent points. The plane containing C1, X1 and X2 cuts the object at a 3D planar rim curve. The curve segment visible to cameras C1 and C2 projects onto them as a straight line and a 2D rim curve respectively.
Shape Recovery from Turntable Image Sequence
189
3.2 Definition of a 3D Rim Curve A 3D rim curve on an object surface is defined with respect to two distinct views as follows. The two frontier points associated with the two views, together with one of the two camera centers, define a plane. This plane intersects the object surface in a 3D planar curve on the object surface. The two frontier points are on this curve and they cut the curve into segments. The one closer to the chosen camera center is defined a 3D rim curve. The projection of this rim curve in the view associated with the chosen camera is a straight line, and the image of it in the other view is generally a 2D curve. Fig. 1 illustrates the rim curve so defined. 3.3 Cross-Correlation Cross-correlation is a general technique for finding correspondences between points in two images. The matching technique is based on the similarity between two image patches, and cross-correlation gives a measure of similarity. Let Ik(uk, vk) be the intensity of image k at pixel xk=(uk, vk) and the correlation mask size be (2n+1)×(2m+1). When n=m, then n is called half mask size of n. The normalized correlation score between a pair of points x1 and x2 is given by: n
r ( x1 , x2 ) =
m
∑ ∑ [ I (u
i =− n j =− m
1
1
+ i, v1 + j ) − I1 (u1 , v1 )] ⋅ ⎡⎣ I 2 (u2 + i, v2 + j ) − I 2 (u2 , v2 ) ⎤⎦ (1)
[(2n + 1) × (2m + 1)] ⋅ δ ( I1 ) ⋅ δ ( I 2 )
where n
I k (uk , vk ) =
m
∑∑I
i =− n j =− m
k
(uk + i, vk + j ) (2)
(2n + 1) × (2m + 1)
is average intensity about point xk, and n
δ (Ik ) =
m
∑ ∑ ⎡⎣ I
i =− n j =− m
⎤ k (uk + i , vk + j ) − I k (uk , vk ) ⎦ (2n + 1) × (2m + 1)
2
(3)
is the standard deviation of intensity over the mask. The feature point x2 which maximizes the response of the normalized correlation score is deemed to be the best match to x1., and it is usual to define a threshold T to remove dissimilar matches (i.e. the correlation score between corresponding points have to be larger than T, otherwise it will be not considered as a match).
4 Fast Modeling by Rim Reconstruction Given a calibrated circular motion image sequence, silhouettes are first extracted reliably from the images. For two adjacent views, the images of the two frontier
190
H. Zhong et al.
points are estimated using the silhouettes. A 2D rim (the image of a 3D rim curve in one view) is then defined as a line joining the two projected frontier points in an image. The correspondences of this 2D rim in the second image are found using cross-correlation methods. A 3D rim curve is then reconstructed by the linear triangulation method [23]. Repeating this process for the image sequence on a sequential two-view basis, a set of rim curves can be reconstructed. Since the object undergoes circular motion, the rims are constructed in a known order. 4.1 Silhouette Extraction
In order to extract silhouettes, the cubic B-spline snakes [24] are adopted for achieving sub-pixel accuracy. It provides a concise representation of silhouettes. In addition, using a B-spline representation allows explicit computation of tangency at each point on the silhouette which simplifies the subsequent identification of outer epipolar tangents (to be described in section 4.2). We will assume that each silhouette consists of one closed curve. 4.2 Estimation of Outer Epipolar Tangents
An outer epipolar tangent is an epipolar line tangent to the silhouette passing through the epipole. Given the B-spline and the known epipolar geometry between two views, the outer epipolar tangent points can be calculated as follows. Let Ai (i=1,2) be the set of sample points on the B-spline and ei be the epipole in image i from camera center j (j=1,2, j≠i), then the two outer epipolar tangent points in image i are determined by ⎛ p ( y ) − ei ( y ) ⎞ ⎛ pi ( y ) − ei ( y ) ⎞ min ⎜⎜ i max and (4) ⎟ ⎜ ⎟. ⎟ pi ∈ Ai pi ∈ Ai ⎜ p ( x ) − e ( x ) ⎟ i ⎝ pi ( x ) − ei ( x ) ⎠ ⎝ i ⎠ Fig. 2 illustrates the two outer epipolar tangents determined in view one based on the epipolar geometry between view one and view two.
Outer epipolar tangent points
Silhouette curve
Fig. 2. Estimation of outer epipolar tangent
4.3 Matching of 2D Rims
Given two views, the outer epipolar tangents in two views can be computed based on the estimated epipoles and the silhouettes as described in the last subsection. Let us name the two views as the source view and the target view respectively. In the source view, the rim is a straight line connecting the two epipolar tangent points. We need to
Shape Recovery from Turntable Image Sequence
191
find the counterpart of this rim in the target view. First, points on the rim in the source view are sampled. As the silhouettes of all images are available, for each sample point sp in the rim in source view, we can determine the associated visual hull surface points corresponding to the ray back-projected from sp [8]. Suppose the pair of visual hull surface points on the ray associated with sp is V1 and V2. We can limit the search region for sp in the target view by making use of V1 and V2, i.e. the epipolar line of sp in the target view to be searched for matched point is bounded by the projections of V1 and V2. Furthermore, we can use a parameter λ to reduce the searchable depth range from V1 to (1-λ)V1+λV2 based on prior knowledge about the shape of the object. For example, taking a λ=0.5 assumes that the maximum concavity of the object at sp is not larger than half of the depth from V1 to V2. Thus the search region in the target view can be accordingly reduced. After the search region along the epipolar line is defined, each corresponding point of the rim in the target view is determined as the pixel within the delimited epipolar line which has the highest cross correlation score with the sampled point in the source view. An example of matched rims in two consecutive views is shown in Fig. 3, from which we see that the reduction of depth search range greatly increases the matching accuracy.
(a)
(b)
(c)
Fig. 3. Matching of a rim in two views. The rim in the source view (a) is a straight line (green). The search region in the target view (b) is bounded by the projections of visual hull surface points determined based on silhouette information. The rim in the target view (c) is found by searching matched points along the delimited scan lines (blue) joining the two bounding points.
4.4 Reconstruction and Insertion of 3D Rims
Once the 2D rim points are matched, the 3D rim structure can be easily computed using optimal triangulation method. Repeating this process for every two successive views in the image sequence, a set of 3D rims on the object surface can be obtained. After this reconstruction processes, the rims of the model are generated, as shown in Fig. 4(a). However, there may be big gaps between pairs of rims where the surface is concave, see Fig. 4(a). This is because frontier points never occur on the concave part of a surface where the outer epipolar tangents cannot reach. To fill in a gap between two rims, new rims can be inserted. We first compute a 2D interpolated line between two rims in an image for which it is required to insert a new rim. Then, the correspondences in the target image are determined by feature point matching process as described in section 4.2. As a result, a 3D interpolated rim can be generated on the concave surface. This insertion of new rims is carried out until not rims are needed to be inserted. Fig. 4(b) shows the newly added rims (red) and the original rim sets (blue).
192
H. Zhong et al.
(a)
(b)
Fig. 4. Insertion of new 3D rims on concave surface. In (a) there may be a gap between two rims; in (b) more rims (red color) are inserted to fill in gaps.
5 Surface Formation 5.1 Slicing the 3D Rims
The proposed surface extraction method is built on a slice based re-sampling method, and it produces fairly evenly distributed mesh grids. The rims are first re-sampled to give cross-sectional contours by parallel slicing planes. Each slicing plane contains the intersections of the rims with the plane. For a better visual effect, the normal of the slicing planes is chosen to be the direction parallel to the rotation axis of circular motion for imaging the object. Since the 3D rims are generated in the order in accordance with the circular motion image sequence, in general the sampled points on each slice can be re-grouped by linking these points according to the spatial order of the corresponding rims. However, if the frontier points (the top most and lowest points) in the rims are very close, the sample points on the slicing plane near those frontier points may not exactly be in the desired order, for example, points on the slicing plane near the top of the object. So we reorder the sample points on the slicing planes before forming the polygonal cross section. Nonetheless, due to the sequential reconstruction of the rims, only the points on the top two slicing planes are affected. In the matching process, through enforcing the silhouette constraint, the search region on the epipolar line for matched point is substantially reduced, giving a very high accuracy of matching and thus a good reconstruction of surface points. However, there may be still few points reconstructed to be inside the object surface because of wrong matching. This type of error cannot be detected by the silhouette constraint, but can be reduced by smoothing the polygon on the slicing plane and by spatial smoothing on the extracted surface. The result of slice-based model after smoothing is shown in Fig. 6(a). 5.2 Surface Triangulation
In order to allow the reconstructed 3D model to be displayed using conventional rendering algorithms (e.g. in VRML standard adopted here), a triangulated mesh is extracted from the polygons on each slicing plane using the surface triangulation method, and triangle patches are produced that best approximate the surface. Since we have the same number of sample points on each slicing plane and the sample points are on the organized rims, triangle patches can be formed by a simple
Shape Recovery from Turntable Image Sequence
193
strategy: starting from the first sample point, say, p1i on layer i, the closest point pki+1 on layer i+1 to p1i is located. Then the third vertex can be either p2i or pk+1i+1, and is determined by considering two factors, the angles formed at p1i and at pki+1 must not exceed a threshold and the sum of distances to p1i and pki+1. If both points meet the angle constraint or conversely both are violated, the point between p2i or pk+1i+1 giving the shortest distance is chosen; otherwise, the one satisfying the angle constraint is selected.
6 Experimental Results Our approach has been tested on real data. Due to space limit, an experimental result of one circular motion image sequence is shown in this section. The experiment is performed over a turntable sequence of 36 images of a “Clay man” toy with moderately complex surface structure (see Fig. 3). All the cameras have been calibrated, which means that the projection matrices for cameras are already known. Since the performance of the proposed method depends on the accuracy of matching, we evaluate the proposed method by comparing our matching results with that generated by the graph-cut based stereo vision algorithm [25].
(a)
(b)
(c)
(d)
Fig. 5. Comparison of matching results between the graph-cut based method (GCM) and the proposed method. (a) 2D rim points in the source view (white dots); (b) Matched 2D rim points in the target view computed by GCM (green triangles) and the proposed method (white dots). (c) and (d) show a zoom-in region of (a) and (b) respectively.
We calculate the 2D rim curves on a successive view pair basis on gray scale images. A half window size of 10 pixels is used in calculating the normalized cross correlation and the parameter λ is chosen to be 0.5 in our method. For the graph-cut based method (GCM), the default parameter values are used except for the disparity range. The disparity search range is from -1 pixel to +40 pixels. As the images are large, the images are first reduced to a region containing the object with small margins before submitting to GCM. Despite so, the running time of GCM for matching one pair of size-reduced images was more than 40 minutes. In contrast, our
194
H. Zhong et al.
method took less than half a minute to match the 2D rim curves in the original images. Fig. 5 shows the matching results of two views by our method and GCM. It can be seen from Fig. 5 (b) and (d) that our method finds all correct matches but GCM does not. In the experiment, the percentage of correctly determined correspondences is about 99% by our method through visual judgment whereas that of GCM is less than 25%. Similar results are obtained for other view pairs, which demonstrate the effectiveness and accuracy of the proposed method.
(a)
(b)
(c)
Fig. 6. (a) The model after slicing the reconstructed rims using parallel planes, (b) the triangulated mesh model, (c) the textured model in VRML from a new viewpoint
After reconstructing all 3D rim curves, the surface of the object can be formed as discussed in section 5.2. The final result of our method is shown in Fig. 6 where Fig. 6 (b) presents the model in terms of triangle patches computed and Fig. 6 (c) shows the final result after texture mapping from the original image in VRML format.
7 Discussion This paper presents a new model reconstruction method from a circular motion image sequence. It combines the merits of both silhouette-based and point-based methods. By making use of the outer epipolar tangent points, a set of structured 3D rims can be rapidly reconstructed. The proposed rim structure greatly reduces the computation complexity of surface extraction and provides a flexible reconstruction. The flexibility lies in different levels of quality of the reconstruction, depending on the number of rims and slicing planes used to approximate the surface.
References 1. Sun, J., Zheng, N.N., Shum, H.Y.: Stereo Matching Using Belief Propagation. IEEE Transation on Pattern Analysis and Machine Intelligence 25(7), 787–800 (2003) 2. Sun, C.: Fast Stereo Matching Using Rectangular Subregioning and 3D Maximum-Surface Techniques. International Journal of Computer Vision 47(1/2/3), 99–117 (2002) 3. Koch, R., Pollefeys, M., Van Gool, L.: Multi viewpoint stereo from uncalibrated video sequences. In: Burkhardt, H.-J., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1406, pp. 55–71. Springer, Heidelberg (1998)
Shape Recovery from Turntable Image Sequence
195
4. Pollefeys, M., Van Gool, L.: From Images to 3D Models, Ascona, pp. 403–410 (2001) 5. Tang, W.K.: A Factorization-Based Approach to 3D Reconstruction from Multiple Uncalibrated Images. In: Department of Electrical and Electronic Engineering, p. 245. The University of Hong Kong, Hong Kong (2004) 6. Baumgart, B.G.: Geometric Modelling for Computer Vision. Standford University (1974) 7. Laurentini, A.: The visual hull concept for silhouette-based image understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 16(2), 150–162 (1994) 8. Boyer, E., Franco, J.S.: A hybrid approach for computing visual hulls of complex objects. Computer Vision and Pattern Recognition, 695–701 (2003) 9. Brand, M., Kang, K., Cooper, B.: An algebraic solution to visual hull. Computer Vision and Pattern Recognition, 30–35 (2004) 10. Liang, C., Wong, K.Y.K.: Complex 3D Shape Recovery Using a Dual-Space Approach. In: IEEE International Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, pp. 878–884. IEEE Computer Society Press, Los Alamitos (2005) 11. Martin, W.N., Aggarwal, J.K.: Volumetric descriptions of objects from multiple views. IEEE Transactions on Pattern Analysis and Machine Intelligence 5(2), 150–158 (1983) 12. Chien, C.H., Aggarwal, J.K.: Volume/surface octrees for the representation of threedimensional objects. Computer Vision, Graphics and Image Processing 36(1), 100–113 (1986) 13. Hong, T.H., Shneier, M.O.: Describing a robot’s workspace using a sequence of views from a moving camera. IEEE Transactions on Pattern Analysis and Machine Intelligence 7(6), 721–726 (1985) 14. Szeliski, R.: Rapid octree construction from image sequences. Computer Vision, Graphics and Image Processing 58(1), 23–32 (1993) 15. Lorensen, W.E., Cline, H.E.: Marching cubes: a high resolution 3D surface construction algorithm. ACM Computer Graphics 21(4), 163–169 (1987) 16. Esteban, C.H., Schmitt, F.: Silhouette and Stereo fusion for 3D object modeling. Computer Vision and Image Understanding (96), 367–392 (2004) 17. Sinha, S., Pollefeys, M.: Multi-view reconstruction using Photo-consistency and Exact silhouette constraints: A Maximum-Flow Formulation. In: IEEE International Conference on Computer Vision, pp. 349–356. IEEE, Los Alamitos (2005) 18. Furukawa, Y., Ponce, J.: Carved Visual Hulls for Image-Based Modeling. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, Springer, Heidelberg (2006) 19. Kutulakos, K.N., Seitz, S.M.: A theory of shape by space carving. In: International Conference on Computer Vision, Kerkyra, Greece, pp. 307–314 (1999) 20. Isidoro, J., Sclaroff, S.: Stochastic Refinement of the Visual Hull to Satisfy Photometric and Silhouette Consistency Constraints. In: IEEE International Conference on Computer Vision, pp. 1335–1342. IEEE Computer Society Press, Los Alamitos (2003) 21. Cipolla, R., Giblin, P.J.: Visual Motion of Curves and Surfaces. Cambridge Univ. Press, Cambridge, U.K. (1999) 22. Wong, K.Y.K.: Structure and Motion from Silhouettes. In: Department of Engineering, p. 196. University of Cambridge, Cambridge (2001) 23. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) 24. Cipolla, R., Blake, A.: The dynamic analysis of apparent contours. In: International Conference on Computer Vision, Osaka, Japan, pp. 616–623 (1990) 25. Kolmogorov, V., Zabih, R.: Multi-camera Scene Reconstruction via Graph Cuts. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 82–96. Springer, Heidelberg (2002) 26. Franco, J.S., Boyer, E.: Exact Polyhedral Visual Hulls. In: British Conference on Computer Vision, pp. 329–338 (2003)
Shape from Contour for the Digitization of Curved Documents Fr´ed´eric Courteille, Jean-Denis Durou, and Pierre Gurdjos IRIT, Toulouse, France {courteil,durou,gurdjos}@irit.fr Abstract. We are aiming at extending the basic digital camera functionalities to the ability to simulate the flattening of a document, by virtually acting like a flatbed scanner. Typically, the document is the warped page of an opened book. The problem is stated as a computer vision problem, whose resolution involves, in particular, a 3D reconstruction technique, namely shape from contour. Assuming that a photograph is taken by a camera in arbitrary position or orientation, and that the model of the document surface is a generalized cylinder, we show how the corrections of its geometric distortions, including perspective distortion, can be achieved from a single view of the document. The performances of the proposed technique are assessed and illustrated through experiments on real images.
1
Introduction
The digitization of documents currently knows an increasing popularity, because of the expansion of Internet browsing. The traditional process, which uses a flatbed scanner, is satisfactory for flat documents, but is unsuitable for curved documents like for example a thick book, since some defects will appear in the digitized image. Several specific systems have been designed, but such systems are sometimes intrusive with regard to the documents and, before all, they cannot be referred to as consumer equipments. An alternative consists in simulating the flattening of curved documents i.e., in correcting the defects of images provided by a flatbed scanner or a digital camera. In this paper, we describe a new method of simulation of document flattening which uses one image taken from an arbitrary angle of view, and not only in frontal view (as this is often the case in the literature). The obtained results are very encouraging. In Section 2, different techniques of simulation of document flattening are reviewed. In Section 3, a new 3D-reconstruction method based on the so-called shape-from-contour technique is discussed. In Section 4, this method is applied to the flattening simulation of curved documents. Finally, Section 5 concludes our study and states several perspectives.
2
Techniques of Simulation of Document Flattening
One can address a purely 2D-deformation of the image, in order to correct its defects according to an a priori modelling of the flattened document [1,2]. In [3,4,5], Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 196–205, 2007. c Springer-Verlag Berlin Heidelberg 2007
Shape from Contour for the Digitization of Curved Documents
197
the characters orientation is estimated, so as to straighten them out. In all these papers, the results are of poor quality because, if the lines of text are rather well uncurved, the narrowing of the characters near the binding is not well corrected. In [6], a judicious 2D-deformation is introduced, which considers that the contour of each page becomes rectangular after flattening: the results are nice, but a “paper checkerboard pattern” must be placed behind the document to force both them having the same 3D-shape, and this makes the process rather complicated. In order to successfully simulate the document flattening, it is necessary to compute its surface shape. Stereoscopy aims at reconstructing the shape of a document from several photographs taken from different angles of view. In [7], the CPU time is very high when dealing with two images of size 2048 × 1360. On the other hand, this technique works well only if the stereo ring has been intrinsically and extrinsically calibrated. Stereophotometry requires several photographs taken from the same angle of view, but under different lightings. A modelling linking the image greylevel to the orientation of the surface is then used. It has been implemented by Cho et al. [8]. The results are of mean quality since, for photographs taken at close range, perspective should be taken into account. Structuredlighting systems make also use of two photographs taken under two different lightings, knowing that, for one of the photographs, a pattern is projected onto the document [9,10]. The deformation of the pattern in this image gives some information on the surface shape. A second photograph is required, in order to avoid possible artefacts of the pattern in the flattening simulation. The best results using this technique are presented in [11], but they make use of a dedicated imaging system. Shape-from-texture has also been used. An a priori knowledge on the document assumes that the text is printed along parallel lines, which is the case for most documents. Hence, the shape of the document surface may be deduced from the deformation of the lines of text in the image. This technique, that has been implemented by Cao et al. [12] on photographs of cylindrical books taken in frontal view, works well and quickly (some seconds on an image of size 1200 × 1600). Its crucial step consists in extracting the lines of text. In [13], it is generalized to any angles of view. Nevertheless, the latter work assumes that the lines of text are also equally spaced. The oldest contribution to the simulation of document flattening uses the shape-from-shading technique. Wada et al. [14] take advantage of the greylevel gradation in the non-inked areas of a scanned image, in order to estimate the slope of the document surface. This idea has been resumed and improved by Tan et al. [15], whose results are of good quality, and also by Courteille et al. [16]. The latter paper provides two noticeable improvements: a digital camera replaces the flatbed scanner, so as to accelerate the digitization process; a new modelling of shape-from-shading is stated, that takes perspective into account. Finally, the shape-from-contour technique may be used i.e., the deformation of the contours in the image provides information on the surface shape. This technique has been implemented in [17,6,18] on photographs of cylindrical books taken in frontal view. In [19], it is generalized to any applicable surfaces: the results are of mean quality, but this
198
F. Courteille, J.-D. Durou, and P. Gurdjos
last contribution is worth of mention, since it reformulates the problem elegantly, as a set of differential equations. The method of simulation of document flattening that we discuss in this paper uses the shape-from-contour technique to compute the shape of the document from one photograph taken from an arbitrary angle of view.
3
3D-Reconstruction Using Shape-from-Contour
In the most general situation, shape-from-contour (SFC) is an ill-posed problem: the same contours may characterize a lot of different scenes. To make the problem well-posed, it is necessary to make some assumptions. In [20], the scene is supposed to be cylindrically-symmetrical. In the present work, we suppose that its surface is a generalized cylinder having straight meridians. 3.1
Notations and Choice of the Coordinate Systems
The photographic bench is represented in Fig. 1: f refers to the focal length and C to the optical center; the axis Cz coincides with the optical axis, so that the equation of the image plane Πi is z = f . The digital camera is supposed to lie in an arbitrary position with regard to the book, apart from the fact that the optical axis must be non-parallel to the binding. Hence, the vanishing point F of the binding direction can be located at infinity, but it is separate from the principal point O. Thus, we can define the axis Ox by the straight line F O, in such a way that F O > 0. Furthermore, we complete Ox by an axis Oy such that Oxy is an orthonormal coordinate system of the image plane Πi (cf. Fig. 1). It is convenient to define two 3D orthonormal coordinate systems: Ro = Cxyz and Rp = Cuvw, where Cu coincides with the straight line F C, which is parallel to the binding, and where Cv coincides with Cy. Since Cu intersects Πi at F = O, it follows that Cw intersects Πi at a point Ω which also lies on the axis Ox. We introduce the orthonormal coordinate system Ri = Ωxy of Πi . The angle between Cz and Cw is denoted α (cf. Fig. 1). The case α = 0 corresponds to the frontal orientation of the camera. Denoting c = cos α, s = sin α and t = tan α, it can easily be stated that OΩ = t f , F O = f /t and F Ω = f /(c s). Finally, we denote Πr the plane orthogonal to Cw and containing the binding, whose equation is w = δ. 3.2
Relations Between Object and Image
Let P be an object point, whose coordinates are (X, Y, Z) w.r.t. Ro and (U, V, W ) w.r.t. Rp . The transformation rules between these two sets of coordinates are: X = c U + s W, Y = V,
(1a) (1b)
Z = −s U + c W.
(1c)
Shape from Contour for the Digitization of Curved Documents
199
F Πi Πr
y
δ
C
y
O
z
α u
α y u f x
Ω x
w
Fig. 1. Representation of the photographic bench
Let Q be the image point conjugated to P , whose coordinates are (u, v) w.r.t. Ri . Using the perspective projection rules, we obtain: f X − t f, Z f v = Y. Z Denoting f = f /c, the equations (1a), (1b), (1c), (2a) and (2b) give: u=
U , −s U + c W V v=f . −s U + c W
u = f
(2a) (2b)
(3a) (3b)
Let us define the “pseudo-image” Q of P as the image of the orthogonal projection P of P on Πr . As the coordinates of P are (U, V, δ) w.r.t. Rp , the coordinates (u, v) of Q w.r.t. Ri are: U , −s U + c δ V . v=f −s U + c δ
u = f
(4a) (4b)
Dividing (4a) by (3a) and (4b) by (3b), we find: u v = . u v
(5)
200
F. Courteille, J.-D. Durou, and P. Gurdjos
This equality means that the image points Q, Q and Ω are aligned i.e., Ω is the vanishing point of the direction orthogonal to Πr . In a general way, the knowledge of an image point Q does not allow us to compute its conjugated object point P . But, if we also know the pseudo-image Q associated to P , then we can compute the coordinates of P . Actually, it can be deduced from (3a), (3b) and (4a): cu , f + s u v u , V =δ u f + s u u f + s u W =δ . u f + s u U =δ
(6a) (6b) (6c)
Note that (6c) gives W = δ when u = u i.e., for image points such that Q = Q. For a given image point Q, if the location of the associated pseudo-image Q is known, the coordinates (U, V, W ) of the conjugated object point P can be computed using (6a), (6b) and (6c). Nevertheless, in a general way, the location of the pseudo-image Q on the straight line ΩQ is unknown. 3.3
Additional Assumptions
Within the framework of our application, the scene is a book. We make two additional assumptions: • A1 - The flattened pages are rectangular. • A2 - The pages of the book are curved in a such way that they form a generalized cylinder. As the surface of the book is a generalized cylinder, the lower and upper contours are located in two planes which are orthogonal to the binding. Thanks to this property, the SFC problem becomes well-posed. As the binding belongs to the plane Πr , its image and its pseudo-image coincide. Under the assumptions A1 and A2 , it is easy to predict that the pseudoimage of the upper and lower contours of the book (whose images are called Cu and Cl ) are the straight lines Lu and Ll , parallel to Ωy and passing through the ends Bu and Bl of the image B of the binding (cf. Fig. 2). Let Q be an image point of coordinates (u, v) w.r.t. Ri = Ωxy. We call θ the polar angle in the coordinate system F xy, and Qu and Ql the two image points located on Cu and Cl , which have the same polar angle θ as Q (cf. Fig. 2). Considering the assumptions A1 and A2 , and knowing that F is the vanishing point of the binding direction, the object point P conjugated to Q has the same coordinate W as the two object points conjugated to Qu and Ql . We denote uu (θ) and ul (θ) the abscissas of Qu and Ql w.r.t. Ri . Finally, we denote θB the polar angle of B. According to these notations, uu (θB ) and ul (θB ) are the abscissas of
Shape from Contour for the Digitization of Curved Documents
201
y
F
Cu
Qu
Bu
Qu
Lu
B Q Q
θB
θ
Cl
Ql Bl
Ql
Ll y
Ω x
Fig. 2. Geometric construction of the pseudo-image Q associated to an image point Q
the pseudo-images Qu and Ql associated to Qu and Ql , w.r.t. Ri . Hence, if we denote f = f /s, the equation (6c) gives, when applied to Qu and to Ql : uu (θB ) uu (θ) ul (θB ) W =δ ul (θ)
W =δ
f + uu (θ) , f + uu (θB ) f + ul (θ) . f + ul (θB )
(7a) (7b)
From one of both these expressions of W , we can deduce the other coordinates U and V of P , solving the system of two equations (3a) and (3b). Considering the expressions (7a) and (7b) of W and the equations (3a) and (3b), it appears that the computation of the shape of the document requires the knowledge of some parameters: the focal length f , the viewing angle α and the location of the principal point O. On the other hand, the parameter δ can be chosen arbitrarily, because the shape of the document can be computed only up to a scale factor.
4
Application to the Simulation of Document Flattening
Due to lack of space, we do not show any result on synthetic images, but only on real images. The left column of Fig. 3 shows three photographs of the same
202
F. Courteille, J.-D. Durou, and P. Gurdjos
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 3. Three photographs of the same book taken from three different angles of view, and the flattening simulations obtained from each of them: (a-b) α = 1.5◦ ; (c-d) α = 20.38◦ ; (e-f) α = 40.54◦
book taken from three different angles of view. For each of them, the shape of the document is computed using the method described in Section 3. The flattening simulations are shown on the right column of Fig. 3, knowing that a generalized cylinder is particularly easy to flatten. The same six images are zoomed on the area near the binding which contains a picture representing a samurai (cf. Fig. 4). It appears that when the angle of view increases, the quality of the flattening
Shape from Contour for the Digitization of Curved Documents
(a)
(c)
(e)
(b)
(d)
(f)
203
Fig. 4. Zooms on a common area of the six images of Fig. 3
simulation decreases but, even for a strong angle of view (cf. Fig. 3-e), the result remains acceptable (cf. Fig. 3-f). A second result proves that our method works well even with the most general case of camera pose i.e., when the optical axis is also tilted in the direction perpendicular to the axis of the cylinder (cf. Fig. 5).
5
Conclusion and Perspectives
In this paper, we generalize the 3D-shape reconstruction of a document from its contours, as it had previously been stated in frontal view in [17,6,18], to the case of an arbitrary view. We validate this result by simulating the flattening of curved documents taken from different angles of view. Even when the angle of view noticeably increases, the quality of the result remains rather good, in
204
F. Courteille, J.-D. Durou, and P. Gurdjos
(a)
(b)
Fig. 5. Most general case of camera pose: (a) original and (b) flattened images
comparison with other results in the literature that are obtained under similar conditions. In the present state of our knowledge, the focal length and the location of the principal point have to be known. As a first perspective, we aim at generalizing the 3D-shape reconstruction to the case of an uncalibrated camera. In addition, when the angle of view is too large, then focusing blur occurs, which inevitably restricts the quality of the flattening simulation. Rather than enduring this defect, it could be interesting to correct it, knowing that the 3D-shape of the document could allow us to predict the focusing blur magnitude.
References 1. Tang, Y.Y., Suen, C.Y.: Image Transformation Approach to Nonlinear Shape Restoration. IEEE Trans. Syst. Man and Cybern. 23(1), 155–172 (1993) 2. Weng, Y., Zhu, Q.: Nonlinear Shape Restoration for Document Images. In: Proc. IEEE Conf. Comp. Vis. and Patt. Recog., San Francisco, California, USA, pp. 568–573. IEEE Computer Society Press, Los Alamitos (1996) 3. Zhang, Z., Tan, C.L.: Recovery of Distorted Document Images from Bound Volumes. In: Proc. 6th Int. Conf. Doc. Anal. and Recog., Seattle, Washington, USA, pp. 429–433 (2001) 4. Lavialle, O., Molines, X., Angella, F., Baylou, P.: Active Contours Network to Straighten Distorted Text Lines. In: Proc. IEEE Int. Conf. Im. Proc., Thessaloniki, Greece, vol. III, pp. 748–751. IEEE Computer Society Press, Los Alamitos (2001) 5. Wu, C.H., Agam, G.: Document Image De-Warping for Text/Graphics Recognition. In: Caelli, T.M., Amin, A., Duin, R.P.W., Kamel, M.S., de Ridder, D. (eds.) SPR 2002 and SSPR 2002. LNCS, vol. 2396, pp. 348–357. Springer, Heidelberg (2002) 6. Tsoi, Y.C., Brown, M.S.: Geometric and Shading Correction for Images of Printed Materials: A Unified Approach Using Boundary. In: Proc IEEE Conf. Comp. Vis. and Patt. Recog., vol. I, pp. 240–246. IEEE Computer Society Press, Washington, D.C., USA (2004)
Shape from Contour for the Digitization of Curved Documents
205
7. Yamashita, A., Kawarago, A., Kaneko, T., Miura, K.T.: Shape Reconstr. and Image Restoration for Non-Flat Surfaces of Documents with a Stereo Vision System. In: Proc. 17th Int. Conf. Patt. Recog., Cambridge, UK, vol. I, pp. 482–485 (2004) 8. Cho, S.I., Saito, H., Ozawa, S.: A Divide-and-conquer Strategy in Shape from Shading Problem. In: Proc.IEEE Conf. Comp. Vis. and Patt. Recog., San Juan, Puerto Rico, pp. 413–419. IEEE Computer Society Press, Los Alamitos (1997) 9. Doncescu, A., Bouju, A., Quillet, V.: Former books digital processing: image warping. In: Proc. IEEE Worksh. Doc. Im. Anal., San Juan, Puerto Rico, pp. 5–9. IEEE Computer Society Press, Los Alamitos (1997) 10. Brown, M.S., Seales, W.B.: Document Restoration Using 3D Shape: A General Deskewing Algorithm for Arbitrarily Warped Documents. In: Proc. 8th IEEE Int. Conf. Comp. Vis., Vancouver, Canada, vol. I, pp. 367–375. IEEE Computer Society Press, Los Alamitos (2001) 11. Sun, M., Yang, R., Yun, L., Landon, G., Seales, W.B., Brown, M.S.: Geometric and Photometric Restoration of Distorted Documents. In: Proc. 10th IEEE Int. Conf. Comp. Vis., Beijing, China, vol. II, pp. 1117–1123. IEEE Computer Society Press, Los Alamitos (2005) 12. Cao, H., Ding, X., Liu, C.: Rectifying the Bound Document Image Captured by the Camera: A model Based Approach. In: Proc. 7th Int. Conf. Doc. Anal. and Recog., Edinburgh, UK, pp. 71–75 (2003) 13. Liang, J., DeMenthon, D., Doermann, D.: Flattening Curved Documents in Images. In: Proc. IEEE Conf. Comp. Vis. and Patt. Recog., San Diego, California, USA, vol. II, pp. 338–345. IEEE Computer Society Press, Los Alamitos (2005) 14. Wada, T., Ukida, H., Matsuyama, T.: Shape from Shading with Interreflections Under a Proximal Light Source: Distortion-Free Copying of an Unfolded Book. Int. J. Comp. Vis. 24(2), 125–135 (1997) 15. Tan, C.L., Zhang, L., Zhang, Z., Xia, T.: Restoring Warped Document Images through 3D Shape Modeling. IEEE Trans. PAMI 28(2), 195–208 (2006) 16. Courteille, F., Crouzil, A., Durou, J.D., Gurdjos, P.: Shape from Shading for the Digitization of Curved Documents. Mach. Vis. and Appl. (to appear) 17. Kashimura, M., Nakajima, T., Onda, N., Saito, H., Ozawa, S.: Practical Introduction of Image Processing Technology to Digital Archiving of Rare Books. In: Proc. Int. Conf. Sign. Proc. Appl. and Techn., Toronto, Canada, pp. 1025–1029 (1998) 18. Courteille, F., Durou, J.D., Gurdjos, P.: Transform your Digital Camera into a Flatbed Scanner. In: Proc. 9th Eur. Conf. Comp. Vis., 2nd Works. Appl. Comp. Vis., Graz, Austria, pp. 40–48 (2006) 19. Gumerov, N., Zandifar, A., Duraiswami, R., Davis, L.S.: Structure of Applicable Surfaces from Single Views. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3022, pp. 482–496. Springer, Heidelberg (2004) 20. Colombo, C., Del Bimbo, A., Pernici, F.: Metric 3D Reconstruction and Texture Acquisition of Surfaces of Revolution from a Single Uncalibrated View. IEEE Trans. PAMI 27(1), 99–114 (2005)
Improved Space Carving Method for Merging and Interpolating Multiple Range Images Using Information of Light Sources of Active Stereo Ryo Furukawa1, Tomoya Itano1 , Akihiko Morisaka1 , and Hiroshi Kawasaki2 1
Faculty of Information Sciences, Hiroshima City University, 3-4-1, Ozuka-higashi, Asaminami-ku, Hiroshima, Japan
[email protected], {t itano,a morisaka}@toc.cs.hiroshima-cu.ac.jp 2 Faculty of Engineering, Saitama University, 255, Shimo-okubo, Sakura-ku, Saitama, Japan
[email protected] Abstract. To merge multiple range data obtained by range scanners, filling holes caused by unmeasured regions, the space carving method is a simple and effective method. However, this method often fails if the number of the input range images is small, because unseen voxels that are not carved out remains in the volume area. In this paper, we propose an improved algorithm of the space carving method that produces stable results. In the proposed method, a discriminant function defined on volume space is used to estimate whether each voxel is inside or outside the objects. Also, in particular case that the range images are obtained by active stereo method, the information of the positions of the light sources can be used to improve the accuracy of the results.
1 Introduction Shape data obtained by 3D measurement systems are often represented as range images. In order to obtain the entire shape model from multiple scanning results, it is necessary to merge these range images. For this task, several types of approaches have been proposed. Among them, the methods using signed distance fields have been widely researched because they are capable of processing a large volume of input data efficiently. Signed distance fields have also been used as a shape representation in order to interpolate holes appearing in unmeasured parts of object surfaces. Curless and Levoy filled holes of a measured shape by classifying each voxel in volume space as either Unseen (unseen regions), Nearsurface (near the surfaces), or Empty (outside the object), and generating a mesh between Unseen and Empty regions [1]. This process is known as space carving (SC) method. The SC method is capable of effectively interpolating missing parts when there is plenty of observed data. However, when only a few range images are captured, the SC method often fails. This problem occurs because the target object and the ”remains of carving” in volume space become connected. One of the approaches to solve this problem would be classifying the Unseen voxels as either inside or outside of the object. In this paper, object merging and the interpolation method is proposed based on this approach. Since Unseen voxels include both Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 206–216, 2007. c Springer-Verlag Berlin Heidelberg 2007
Improved SC Method for Merging and Interpolating Multiple Range Images
207
unobserved voxels inside an object (due to occlusion or low reflection) and voxels outside the object, it is necessary to discriminate these cases. To classify voxels, we take the following two approaches: (1) defining discriminant function for classifying Unseen voxels, and (2) using the positions of light sources if the range images are obtained using active stereo methods. Under the proposed method, the large “remains of carving” that often occurs in SC method are not generated. Also, since all voxels are classified as inside, outside or near the surface, closed shapes are always obtained. A unique property of the proposed algorithm is that it can be implemented on GPU. Recently, many methods have been proposed for utilizing the computational performance of GPUs for general calculations besides graphics. Also, our algorithm can be executed efficiently on GPU.
2 Related Work To merge multiple range data into a single shape, several types of approaches have been proposed, such as, for example, generating a mesh from unorganized points [2], using deformable model represented as a level-set [3], stitching meshes at the overlapped surfaces (zippering)[4], and methods using signed distance fields [1,5,6]. In particular, methods using signed distance fields have been widely researched since they are capable of processing a large volume of input data. In order to express the distance from a voxel to the object’s surface, the Euclidean distance (the shortest distance between the center of the voxel and the object’s surface) is theoretically preferable[5,6]. However, since computational cost of calculating the Euclidean distance is large, simplified methods such as line-of-sight distance (the distance between the voxel center and the object’s surface measured along the line of sight from the viewpoint of the measurement) are sometimes used [1] instead. Regarding filling holes of the surface of the shape defined as isosurfaces of signed distance fields, a number of methods have already been proposed. Davis et al. [7] presented a method in which signed distance field volume data is diffused using a smoothing filter. As shown in the experiments later, this method sometimes propagates the isosurface in incorrect directions and yields undesirable results (Figure 4(b)). According to the SC method proposed by Curless et al. [1], all voxels are first initialized as Unseen. Then, all voxels between the viewpoints and the object surfaces are set to Empty (this method carves the voxels in front of the observed surfaces, in contrast to the SC method used for 3D reconstruction from 2D silhouettes in which the voxels in the surrounding unobserved regions are carved). This method only carves the volume space in front of the observed surfaces, so, in practice, the unobserved voxels outside of the object remain as Unseen, and excess meshes are generated on the borders between the Unseen and the Empty regions. When range images from a sufficient number of observation points are merged, this kind of excess meshes are not connected to the target object mesh, and can be simply pruned away. However, when the number of observation points is small, or when the object is not observed from certain directions, the excess meshes often become connected to the object (Figure 4(a)). Sagawa et al. succeeded in interpolating scenes with complex objects by taking the consensus between each voxel in a signed distance field with its surrounding voxels
208
R. Furukawa et al.
Unseen
observed surface
v5
Angle of view v1
v2
v3 sensor1
sensor2
v4
Angle of view
Fig. 1. (Left)Shape interpolation using space carving, and (right) classes of voxels used in the proposed method
[8]. Masuda proposed a method for filling unmeasured regions by fitting quadrics to the gaps in the signed distance field [9]. They use a Euclidean distance for their calculation to achieve high quality interpolation at the cost of high computational complexity.
3 Shape Merging and Interpolation Using Class Estimation of Unseen Voxels 3.1 Outline A signed distance field is a scalar field defined for each voxel in 3D space such that the absolute value of the scaler is the distance between the voxel and the object surface and the sign of the scaler indicates whether the voxel is inside or outside of the object (in this paper, voxels inside the object are negative, and those outside are positive). By describing a signed distance field as D(x), it is possible to define an object’s surface as the isosurface satisfying D(x) = 0. In order to express the distance from a voxel to the object’s surface, although there exist several problems for accuracy of hole-filling process, we adopt the line-of-sight distance (the distance from the voxel center to the object’s surface measured along the line of sight from the viewpoint of the measurement) instead of the Euclidean distance, since its computational cost is relatively small. The signed distance D(v) for a voxel v is calculated by accumulating the signed distances from each of the range images, d1 (v), d2 (v), · · · , dn (v), each multiplied by a weight wi (v). It is obtained with the following formula. D(v) =
di (v)wi (v)
(1)
i
The weights represent the accuracy of each distance value, and are often decided according to the angles between the directions of the line-of-sight from the camera and the directions of the normal vectors of the surface. In the constructed signed distance field D(x), the isosurface satisfying D(x) = 0 is the merged shape. In order to generate the mesh model of the merged result, the marching cubes method [10] is used.
Improved SC Method for Merging and Interpolating Multiple Range Images
209
The SC method of Curless et al. [1] divides all the voxels in volume space into one of the three types: Unseen (not observed), Empty (outside the object) and NearSurface (near the surface of the object). Shape interpolation is then performed by generating a mesh over the border between Unseen and Empty voxels (Figure 1(left)). A large problem with this method is that all the voxels in the following three cases are classified as Unseen: (1) voxels outside the object, that do not exist on any line-of-sight to an observed region in the range images, (2) voxels outside the object that exist BEHIND a surface of a range image (occluded voxels outside the object), and (3) voxels inside the object. In the proposed method, each Unseen voxel is classified as either outside or inside of objects using a discriminant function to solve this problem. We now describe the classification of voxels according to the proposed method. For a given voxel, the information obtained from a range image takes one of the following four types ((Figure 1(right)). – NearSurface (near the surface): The absolute value of the signed distance is below the assigned threshold, the voxel in question is classified as “near the surface”, and the signed distance is retained (case of v1 in Figure 1(right)). – Outside (outside the object): The absolute value of the signed distance is larger than the threshold and the sign is positive. The voxel in question exists between the object and the camera, so the voxel can be classified as “outside the object” (v2 ). – Occluded (occluded region): The absolute value of the signed distance is larger than the threshold and the sign is negative. It is not possible to assert unconditionally whether the voxel is inside or outside the object. In this case, the classification of the voxel in question is temporarily set to Occluded. Whether the voxel is inside or outside is estimated afterwards (v3 ). – NoData (deficient data): The signed distance value cannot be obtained due to missing data in the range images. It cannot be judged whether the voxel is inside or outside. In this case, the classification of the voxel is temporarily set to NoData. Whether the voxel is inside or outside is estimated afterwards (v5 ). The case of v4 in Figure 1(right), when the voxel in question is outside the angle of view of the range image, may be handled as either Outside or NoData according to the application. For many applications, the voxels outside of the view can be treated as Outside, but in cases such as zooming and measuring a large object, they should be treated as NoData. When merging multiple range images, the information obtained from each range image regarding voxels may differ. In such cases, priority is given to NearSurface over other classes. The second priority is given to the Outside class. When the information from range images is only Occluded or NoData, it is estimated whether the voxel is inside or outside the object according to the discriminant function defined in Section 3.2. The classified voxels are tagged as Inside or Outside. By performing the above process, all the voxels are classified into the three types: Inside (inside the object), NearSurface (near the surface), and Outside (outside the object). The usual signed distances are assigned to the NearSurface voxels, and a fixed negative and positive values are assigned Inside and Outside voxels, respectively. By generating the isosurface of the constructed signed distance field, a merged polygon mesh can be obtained.
210
R. Furukawa et al.
3.2 Voxel Class Estimation If the information of a voxel obtained from all the range images is only Occluded or NoData, whether the voxel is inside or outside the object should be estimated. For this purpose, discriminant functions based on Bayes estimation are defined. We consider scalar values that are positive outside the object and negative inside the object, and we estimate the probability distributions of these values based on Bayes estimation. The subjective probability distribution when there is no data is a uniform distribution. Based on the posterior probabilities obtained from the data calculated from each range image, the subjective probability distributions are updated according to Bayes theory. Using normal distributions N (μ, σ) for the posterior distributions, a heuristics for assigning the voxel to Inside or Outside is represented as a mean value μ, and the degree of confidence in the heuristics is represented as the standard deviation σ (the higher σ, the lower the degree of confidence). When the voxel in question is Occluded in a given range image, the voxel is behind the surface from the viewpoint. When the absolute value of the distance from the surface of the voxel (expressed as Dist) is relatively small, the confidence that the voxel is inside the object is high. On the other hand, when it is far from the surface (Dist is large), the confidence that it is inside the object is low. In this case, the degree of confidence that the voxel is inside is 1/Dist, so the corresponding posterior distribution is N (−1, Dist). When the voxel in question is NoData for a given range image, it may be either an outside voxel or an unobserved voxel inside the object. For actual cases, pixels of NoData in the range images often indicate outside regions. From this heuristics, a constant value is assigned to the degree of confidence that the voxel is outside, so the posterior distribution corresponding to NoData is represented as N (1, Const), where Const is a user-defined parameter. According to experiments, reasonable results can be obtained by setting Const to a value near the smallest thickness of the object. In Bayes estimation using a normal distribution, the prior distribution N (μ, σ) is updated using the posterior distribution N (μ , σ ) as follows. μ←
1 1 1 σ μ + σμ ← + , σ + σ σ σ σ
(2)
After performing this for all range images, a voxel is classified as Outside if the sign of the mean value μ of the final probability distribution is positive, and Inside if it is negative. By defining the discriminant function C(v) for voxel v as −1/|di (v)| if Occ(i, v) , (3) ci (v) = 1/Const if N od(i, v) ci (v), (4) C(v) = i
the above judgment is equivalent to determining the inside or outside of the object according to the sign of C(v), where Occ(i, v) and N od(i, v) mean that the information for the voxel v obtained from the range image i is Occluded or NoData, and di (v) is the signed line-of-sight distance between the voxel v and the surface of the range image i.
Improved SC Method for Merging and Interpolating Multiple Range Images
211
3.3 Utilizing the Position of Light Sources in Active Stereo Method Active stereo methods, in which the light sources are used as well as the camera, are common techniques for obtaining dense 3D shapes. In these methods, the occlusion of light sources causes missing data, but when data is measured at a point, it proves that the voxels between the point and the light source are outside the object. In this research, we utilize the position of light sources in an active stereo system by using additional range images from virtual cameras located at the light sources. For each of the measurement, we consider using two range images, from both the camera and the light source position (described below as the camera range image, and light source range image). The light source range image can be generated by projecting the 3D positions of pixels of the camera range image onto the virtual camera at the light source position. In the case shown in Figure 2(a), the data ends up missing in the camera range image since the light from the light source is occluded, therefore, at the voxel shown in Figure 2(b), the range data from the camera range image is missing. However, by referring to the light source range image, the line-of sight distance from the light source position to the voxel may be obtained. There are two advantages in using this information. First, by using light source range images, the number of voxels that can be classified as NearSurface or Outside increases. In addition, even if a voxel is not classified to these classes, we can define an improved discriminant function that utilizes more information than the one described in Section 3.2. The voxel types are mainly the same as when using the camera range images alone, but the order of priority for the voxel classifications is set to NearSurface in the camera range image, NearSurface in the light source range image, Outside in the camera range image, followed by Outside in the light source image. The discriminant function C(v) for voxel classification that utilizes the positions of the light sources is as follows. ⎧ −1 + −1 if Occc (i, v) ∧ Occl (i, v) c l ⎪ ⎪ ⎨ |di (v)| c |di (v)| if Occc (i, v) ∧ N odl (i, v) , (5) ci (v) = −1/|dil (v)| ⎪ −1/|di (v)| if N odc (i, v) ∧ Occl (i, v) ⎪ ⎩ 1/Const if N odc (i, v) ∧ N odl (i, v) C(v) = ci (v). (6) i
The superscript symbols c and l of Occ, N od, di (v) mean the camera range images and light source range images, respectively.
4 Implementation Using a GPU The merging and interpolation algorithms in this paper can be executed efficiently by utilizing a GPU. Since GPUs can only output 2D images, the volume of the signed distance field is calculated slice by slice. The signed distance value D(v) and discriminant function value C(v) are calculated by rendering each slice in a frame buffer. Rendering is performed by multi-pass rendering using programmable shaders. Each pass performs the rendering for one range image, and the results are added using blending functions. When performing the rendering for the ith camera range image
212
R. Furukawa et al.
Missing data
Camera range image
Light source (Projector)
(a)
No data
Camera range image
d
Light source range image
(b)
Fig. 2. Using light sources of active stereo methods: (a) Missing data caused by occlusion, and (b) using light source range images
(range image i), the camera range image and the corresponding light source range image are treated as floating point textures. Then, by issuing an instruction to draw one rectangle in the whole frame buffer, the pixel-shader process for each pixel is executed. The pixel shaders calculate di (v)wi (v) and ci (v) using the range image and voxel positions as input values, and add them to the frame buffer. This process is repeated for the number of measured shapes, while changing the textures of the camera range images and the light source range images. Finally, the frame buffer holding the weighted sum of signed distances D(v), and the values of C(v) is read back into the CPU. The flags for voxel classes are checked, the signed distance and the discriminant function are combined, and a slice of the signed distance field is calculated. Since only small parts in the entire volume space are related to the isosurface, the processing cost can be reduced by first computing the signed distance field in a coarse resolution, and performing the same computation in the high resolution for only the voxels determined to be near the isosurfaces in the coarse resolution. In the experiments described in section 5, the implementation of the proposed method uses this coarse-tofine method to reduce the computational cost and time.
5 Experiments In order to demonstrate the effectiveness of the proposed method, experiments were conducted using two types of data: synthetic data formed by seven range images generated from an existing shape model (mesh data), and an actual object (an ornament in the shape of a boar) measured from 8 viewpoints with an active stereo method. For synthetic data, the points on the surface where the light is self-occluded by the object were treated as missing data even if they were visible from the camera, as occurs in measurements by an active stereo method. Each data set was merged and a mesh model was generated using the SC method, the volumetric diffusion method [7] (VD method), the method proposed by Sagawa et al. [8] (Consensus method), and the proposed method (the information regarding the light source position for active stereo was used). In the SC method, the VD method, and the proposed method, the signed distance field is calculated using line-of-sight distance.
Improved SC Method for Merging and Interpolating Multiple Range Images
213
(a)Synthetic data (b)Real data Fig. 3. Results of merging(without interpolation)
(a)SC
(b)VD
(c)Consensus
(d)Proposed(Smoothing)
Fig. 4. Results of merging(synthetic data)
In the Consensus method, however, the signed distance field is calculated using the Euclidean distance. In the VD method, the diffusion of the volume was repeated 30 times. Also, the size of the volume was 512 × 512 × 512. For efficiency, in the proposed method, the procedure is first executed with resolution of 62 × 62 × 62 and rendering with the high resolution was only performed for the region where the surface exists. We performed experiments regarding the proposed method, applying a smoothing filter with a 3 × 3 × 3 size to the volume, and also without such a filter. The filter prevents aliasing on the interpolated surface (the smoothing was not performed in the SC method or the VM method). The SC method, the VD method and the proposed method were executed on PC with an Intel Xeon (2.8GHz) CPU, and an NVIDIA GeForce8800GTX GPU installed. The Consensus method was implemented on a PC constructed with 2 Opteron 275 (2.2GHz) CPUs (a total of 4 CPUs). The results of merging with no interpolation are shown in Figure 3, and the results of the interpolation process with each method (for the proposed method, the case when smoothing was applied) are shown in Figures 4 and 5. Also, Figures 6(a)-(f) show the signed depth fields for each experiments on the synthetic data sliced at a certain zcoordinate.
214
R. Furukawa et al.
(a)SC
(b)VD
(c)Consensus
(d)Proposed(Smoothing)
Fig. 5. Results of merging(real data)
(a)No interpolation
(e)Magnified image of (a)
(b)SC
(c)VD
(f)Magnified image of (c)
(d)Proposed
(g)Discriminant function
Fig. 6. (a)-(f):Slice images of the signed distance field: For “No interpolation”, “SC” and “VD,” green color represents Unseen, black represent Empty, and gray colors represents signed distance at NearSurface regions. For the proposed method, green represents Inside, black represents Outside. (g):The discriminant function. Gray intensities mean plus values and red color means region of minus value.
Figure 4(a) and Figure 5(a) show that excess meshes were produced surrounding the target object using the SC method, since Unseen (not observed) regions remained uncarved. Figure 4(b) shows that, using the VD method, the mesh surrounding the holes beneath the ears and on the chest spread in undesirable directions. These phenomena often occurred where the borders between the NearSurface and Unseen regions (the red line in Figure 6(e)) were not parallel with the normal directions of the surface of the object as shown in Figure 6(e). In such regions, the expansion of the isosurfaces due to the filter to occur in wrong directions (in Figure 6(f) the isosurface expands downwards and to the right). A similar phenomenon also occurred in the actual data (Figure 5(b)). The Consensus method produced interpolated meshes with the best quality. However, processing time was long since the Euclidean distance was required. The proposed method produced the results whose qualities are the best next to the Consensus method. Since our method does not use signed distance field, but only uses
Improved SC Method for Merging and Interpolating Multiple Range Images
215
Table 1. Execution time in seconds. The volume size of the Consensus method for synthesized data was 128 × 128 × 128. Methods
SC VD
Consensus
Proposed No smoothing Smoothing Synthetic data 55 168 15 min. for merging, 18 sec. for interpolation 25 36 Real data 38 120 7 hours for merging, 280 sec. for interpolation 21 28
a discriminant function for Unseen voxels to fill holes, smoothness or continuity of the shapes are not considered, thus there remains a good chance to improve the quality. Using both a signed distance field and a discrimination function may be a promising way to improve our algorithm in terms of qualities of results. Figure 6(g) shows the values of the discriminant function C(v). From the figure, we can see that regions where C(v) has plus values coarsely approximate the shapes of the target objects. Table 1 is the execution times of each method. It shows that the proposed method was executed much faster than the other methods.
6 Conclusion In this paper, the space carving method was improved, and an interpolation algorithm yielding stable results even when there are few range images was proposed. This method was realized by defining a discriminant function based on Bayes estimation in order to determine the inside and outside of an object in a signed distance field, even for unseen voxels. In addition, a method was proposed for improving the accuracy of this discriminant function by using range images obtained using an active stereo method. Furthermore, a technique for implementing the proposed method using a GPU was stated, which realized a reduction in computational time by a wide margin. Finally, experiments were conducted regarding the proposed method and existing methods, and the effectiveness of the proposed method was confirmed.
References 1. Curless, B., Levoy, M.: A volumetric method for building complex models from range images. Computer Graphics(Annual Conference Series) 30, 303–312 (1996) 2. Hoppe, H., DeRose, T., Duchamp, T., McDonald, J., Stuetzle, W.: Surface reconstruction from unorganized points. In: ACM SIGGRAPH, pp. 71–78. ACM Press, New York (1992) 3. Whitaker, R.T.: A level-set approach to 3d reconstruction from range data. IJCV 29(3), 203– 231 (1998) 4. Turk, G., Levoy, M.: Zippered polygon meshes from range images. In: SIGGRAPH 1994, pp. 311–318. ACM Press, New York (1994) 5. Masuda, T.: Registration and integration of multiple range images by matching signed distance fields for object shape modeling. CVIU 87(1-3), 51–65 (2002) 6. Sagawa, R., Nishino, K., Ikeuchi, K.: Adaptively merging large-scale range data with reflectance properties. IEEE Trans. on PAMI 27(3), 392–405 (2005)
216
R. Furukawa et al.
7. Davis, J., Marschner, S.R., Garr, M., Levoy, M.: Filling holes in complex surfaces using volumetric diffusion. In: 3DPVT 2002. Proc. of International Symposium on the 3D Data Processing, Visualization, and Transmission, pp. 428–438 (2002) 8. Sagawa, R., Ikeuchi, K.: Taking consensus of signed distance field for complementing unobservable surface. In: Proc. 3DIM 2003, pp. 410–417 (2003) 9. Masuda, T.: Filling the signed distance field by fitting local quadrics. In: 3DPVT 2004. Proc. of International Symposium on the 3D Data Processing, Visualization, and Transmission, pp. 1003–1010. IEEE Computer Society Press, Washington, DC, USA (2004) 10. Lorensen, W.E., Cline, H.E.: Marching cubes: A high resolution 3d surface construction algorithm. In: SIGGRAPH 1987, pp. 163–169. ACM Press, New York, NY, USA (1987)
Shape Representation and Classification Using Boundary Radius Function Hamidreza Zaboli and Mohammad Rahmati Department of Computer Engineering, Amirkabir University of Technology, Tehran, Iran
[email protected],
[email protected],
[email protected] Abstract. In this paper, a new method for the problem of shape representation and classification is proposed. In this method, we define a radius function on the contour of the shape which captures for each point of the boundary, attributes of its related internal part of the shape. We call these attributes as “depth” of the point. Depths of boundary points generate a descriptor sequence which represents the shape. Matching of sequences is performed using dynamic programming method and a distance measure is acquired. At last, different classes of shapes are classified using a hierarchical clustering method and the distance measure. The proposed method can analyze features of each part of the shape locally which this leads to the ability of part analysis and insensitivity to local deformations such as articulation, occlusion and missing parts. We show high efficiency of the proposed method by evaluating it for shape matching and classification of standard shape datasets. Keywords: Computer vision, shape matching, shape classification, boundary radius function.
1 Introduction In the recent years, shape recognition and retrieval has been an important research area in the computer vision field. Various approaches and methods are proposed to deal with this problem. In spite of propounding state of the art methods in this field, there are serious drawbacks to overcome the deformations which may appear in the real applications. These deformations may appear in the form of noises, occlusions, missing parts and articulations. Different methods such as contour-based methods, region-based methods, skeleton-based methods and statistical methods have tried to overcome them in different ways [1]. Contour-based methods generally use features of the contour (boundary) of the shape without considering internal parts of the shape. This may reduces amount of computations to a process on the contour of the shape. On the other hand, regionbased methods consider the information represented by all points of the shape. Skeleton-based methods extract a rich descriptor of the shape which is called “skeleton”. Skeletons are first introduced by Blum in [2]. While the skeletons are rich descriptors of shapes, they are sensitive structures to deformations of the shape, Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 217–226, 2007. © Springer-Verlag Berlin Heidelberg 2007
218
H. Zaboli and M. Rahmati
especially for natural objects such as human and animal [1]. Moreover, skeleton by itself and the structures and features which are extracted from it, are complex structures. These complex structures make the matching process a hard and sometimes inefficient step. Siddiqi and Sebastian at al. in [3,4] propose a shock graph method which extracts a directed acyclic graph (DAG) from the skeleton of the shape and uses a sub-graph isomorphism method at the matching stage. While the region-based methods, especially skeleton-based methods represent perfect description of the shape, extracting a simple and efficient structure from the skeleton remains as a challenge. In this work, we look for a simple and efficient structure which completely represents interior of the shape. In the skeleton-based methods, there is a function, called "radius function" [1,3]. This function perfectly captures convexities and concavities of different parts of the shape, along the skeletal lines and curves. In this paper, we define a new radius function on the boundary of the shape instead of the skeleton. By traversing the boundary of the shape, the boundary radius function (BRF) models behavior of the shape in the different parts and segments. In this way, we have a simple sequential structure which describes convexities and concavities of the shape as well as skeletal radius function. Variations of the BRF are considered as variations of the different parts of the shape which this creates the convexities and concavities of the shape. We use the variations of the BRF to represent shapes of different objects and classify them. Our method performs part analysis of the shape and it is not sensitive to articulations. Furthermore, due to the part analysis and local description of the different parts, the method is robust under high amount of occlusion and missing parts. Experiments confirm this robustness. The rest of the paper is organized as follows: In section 2, we present the related works. Section 3 presents new definition of the boundary radius function (BRF) and how to extract a descriptor of the shape using the BRF. Section 4 gives method of matching the descriptors of shapes. In Section 5, experimental results are presented and Section 6 concludes the paper.
2 Related Work Employing the variations of the shape contour has been reported widely in the previous works. These variations have been measured based on different criteria. Bernier et al. in [5] used a polar transformation of the distances between contour points and the geometric center of the shape. Hoan et al. used a similar approach in [6]. Variations of contour of the shape are measured by Euclidean distances between the contour points and a fixed reference point. The drawback is sensitivity of the reference point. Location of the reference point has a high potential to vary in the presence of small deformations, articulations, occlusion and missing parts. In fact, dependency of the descriptor to just a single and unstable reference point, makes these methods inefficient. In another work, Belongie et al. in [7] proposed a shape descriptor called "shape context" which captures for each point on the boundary of the shape, Euclidean distances and orientations of other contour points, relative to the current point. While, the descriptor is rich, the Euclidean distance and orientation between the contour
Shape Representation and Classification Using Boundary Radius Function
219
points may vary under deformations and clutter, especially articulations. Thayananthan et al. in [8] used Chamfer matching to improve robustness of the method. Another shortcoming is that the descriptor of a point depends on location of all other point on the boundary of the shape. This makes the descriptor sensitive to occlusion and missing parts which occur in the segments far from the current point. Therefore, the methods can not perform part analysis. In contrast, Siddiqi in [3] proposed a skeleton-based method with a structural descriptor, which can performs part analysis of the shape. It represents shape by a hierarchical structure called "shock graph". The method suffers from the difficult matching process and sensitivity to noises, occlusions and other deformations which appear on the different parts of the shape. Sebastian et al. in [4] have improved the method by defining a graph edit distance algorithm. In this paper, we introduce a contour descriptor using boundary radius function (BRF) which captures efficiently the internal parts of the shape. We define an edit distance algorithm for matching the descriptors and we perform shape recognition and classification using the resulted distance.
3 Boundary Radius Function In the skeleton-based methods, radius function on the skeleton of the shape is defined as follows: Assume S denotes the skeleton of a shape and consider point p where p ∈ S . Radius function R(p) is the radius of the maximal inscribed disc touches the boundary of the shape. With respect to this definition, we define the boundary radius function(BRF) on the boundary of the shape and we show how to calculate values of this function on the boundary of a shape. We define BRF as follows: Definition 1: Assume B denotes the boundary of a shape and consider point p where p ∈ B . Boundary radius function BRF(p) for point p, is defined as radius of the minimal disk centered at p, which touches nearest point q on the boundary of the shape, such that the line segment which connects point p to q lies entirely within the shape and is perpendicular to a ε -neighborhood of q on the boundary. Fig. 1 shows this definition.
(a)
(b)
(c)
(d)
Fig. 1. (a): A horse and BRF(p) for different points on the boundary. (b): line segment pq is perpendicular to the boundary segment in the point q. (c): BRF for a non smooth shape. (d): new BRF (definition 2) for point p on the rear part of the horse shape.
220
H. Zaboli and M. Rahmati
BRF for three sample point is shown in Fig. 1(a). In Fig. 1(b) line segment pq which is radius of the circle, is perpendicular to the boundary at point q. This verticality can be proved for any smooth closed curve. Although boundary of a shape is a closed curve, it can be uneven with sharp corners. In Fig. 1(c). an example of an uneven boundary with sharp corners is shown. As seen in this figure, the boundary at point q has a corner and is not smooth. This causes that the boundary at point q not to be differentiable. Therefore, there is no tangent line on the boundary at point q. As a result in such cases, Definition 1 may not true because there is no point q on the boundary which its tangent is perpendicular to the radius line pq. In fact, Definition 1 is true for smooth closed curves not for any closed shape boundary. Thus we give a new definition of BRF which is less formal but more applicable. Definition 2: Assume B denotes the boundary of a shape and consider point p where p ∈ B . Moreover, assume minimal circle C centered at p which touches boundary B at two different points b1 , b2. By increasing radius of the circle C until it touches a third boundary point q, such that line segment pq lies entirely within the shape, then boundary radius function BRF(p) for point p, is radius of the circle C. Note that in the above definition, radius of circle C is equal to length of the line segment pq. In Fig. 1(c). an example of definition 2 is illustrated. In this figure, by increasing radius of the circle, it touches the boundary at third point q and line segment pq lies entirely within the shape. Fig 1(d). shows another example of BRF with definition 2. In this figure, BRF(p) is not the line segment pa because the line segment pa is not entirely within the shape. While pq is the right line segment for BRF(p). Definition 2 is a useful and applicable definition for BRF in any closed shape. Therefore, we use it for calculating BRF on the boundary of shapes. BRF can analyze and capture related internal part of the shape for any boundary point. This provides enough information for local analysis of parts which is useful for part analysis, articulations, occlusion and missing parts. Given a sample shape, it is straightforward to compute values of the BRF for boundary points of the shape. Fig. 2(a) shows two sample shapes and values of the BRF for their boundary points. These values are shown as a set of transformed points for each sample shape, such that each point s with loci s(x, y) on the boundary of the sample shape is transformed into point s′ with loci s ′( x, y + BRF ( s )) .
(a)
(b)
Fig. 2. (a),(b): Two sample shapes, their boundaries and their transformed point set at their rightwards. Note that the X and Y values increase rightwards and downwards, respectively.
Shape Representation and Classification Using Boundary Radius Function
221
As seen in Fig. 2. the parts with high “depth” in their fronts, have high values of related BRFs. This causes that these parts locate at down sections (i.e. high values of y in their loci (x,y)) of the transformed points. Up to this point, we have for any given shape a transformed point set which reflects for each boundary point, attribute and feature of the related internal parts. This feature is like depth of the shape for the point which its BRF is computed. Although in some cases, BRF may not gives the real and accurate related depth for each boundary point, but it can be considered as a feature similar to depth. Moreover, values of BRF for different points of the boundary accurately represent shapes and it is not necessary to have a real depth measure for describing the shape. Depth-like measure i.e. BRF, represents for each segment of the shape, a view of corresponding segment of the boundary in front of the segment which includes convex and concave parts of the shape. Note that different shapes are created by these convexities and concavities in the different parts of shapes.
4 Extracting Structure from the BRFs In this section we intend to extract a suitable structure from the boundary points such that the structure contains BRF measures. Since the boundary points of a shape generate contour of the shape and the contour has a sequential structure, we use a vector structure containing values of BRFs to represent a shape. Order of the boundary points is kept and the relative BRF value is assigned to each one. In this way, we have a directed attributed vector for any shape to represent it. We call this vector as boundary radius vector (BRV). Therefore, matching of shapes can be performed by matching their related BRVs. One thing to mention is about length of the vectors. It is obvious that bigger shapes have longer boundaries. As a result, length of their BRV will be longer than small shapes. This can cause some problems especially in the presence of some transformations like scale. To overcome this, for matching of any two given shapes, we consider n points on their boundaries for both of them, such that the n points divide boundary of each shape into n equal segments. Thus, we have for each shape, a BRV of length n. Note that n is fixed in a matching process. After computing BRV for a shape, variations of the BRF values in the BRV should be considered. As seen in Fig. 2(a)., in the boundary of the shape, what is creates two convex parts in the two ends of the shape, is variations of the BRF values. In this figure, from point a to b, fixed values of BRF create a flat part between these points. Also, from point b to c and then c to d, increase and decrease of BRF values, respectively, create a bowed curve. Thus, it is necessary to analyze variations of BRF values for a shape, instead of values themselves. Although BRF values are important, their variations are more important. To achieve this goal, we define a first order differential on the BRF values of BRV. This definition is given in below: Definition 3: Assuming BRV of a shape with length n, variations of BRF for ith value in BRV(ith point on contour) is denoted by VBRF(i) and computed as follows:
222
H. Zaboli and M. Rahmati i +α
VBRF (i) =
∑ (BRF (k ) − BRF (k − 1))
k = i −α +1
2α
, α +1 ≤ i ≤ n − α −1
(1)
where, α is a parameter of resolution. Bigger α lead to less accuracy for analyzing details of parts. It is obvious that very little values of α may not leads to superior description of the shape and may not leads to superior efficiency and recognition rate of final recognition system. Using definition 3, we can compute variations of BRF (VBRF) for any given shape. Therefore, we have a sequence of values of VBRFs for any given shape. Similarly, this sequence specifies a vector containing values of VBRF, calculated for a shape. We call this final vector as “descriptor vector”. Note that for each value of the descriptor vector(VBRF(i)), we take into account its related BRF value(BRF(i)) as a weight factor for it. In this way, huge parts of the shape which are deeper than other parts, are considered more important. This will raise recognition rate of our method, especially in the presence of occlusion and missing parts.
5 Matching Descriptor Vectors Up to this point, we have introduced a descriptor vector for any given shape. For matching, classifying and recognizing of shapes, it is sufficient to match the descriptor vectors and compute difference between them. The resulted difference between the shapes is calculated as a distance between the shapes and can be used as a distance measure. Due to the sequential structure of the descriptor vectors, we use a dynamic programming method for matching the vectors. This method is used widely for matching attributed and sequential data structures e.g. strings. Moreover, this method has a high efficiency to deal with deleted and inserted parts in the sequential structures and it can find optimal matching. This ability is very useful for dealing with some deformations which occur to shapes such as occlusion and missing parts. Dynamic programming compares and matches the vectors and generates a cost for the matching. This cost is cost of transforming one of the vectors into the other one and is known as “edit distance”. Resulted edit distance is considered as distance between the two shapes. To use dynamic programming method, we define below cost functions: Cost(i, j) = min(cost(i, j − 1),cost(i − 1, j),cost(i − 1, j − 1) + ⎧⎪ BRF ( i ) − BRF ( j ) ⎨ ⎪⎩ (BRF ( i ) × VBRF ( i )) − (BRF ( j ) × VBRF ( j ))
if VBRF ( i ) = VBRF ( j ) ⎫⎪ ⎬ if VBRF ( i ) ≠ VBRF ( j )⎪⎭
(2)
In relation 2, Cost(i,j) is distance between values 1..i of first vector and values 1..j of the second vector. For a complete matching, i must be equal to the number of values in the first vector and j must be equal to the number of values in the second vector. As seen in this relation, variations of boundary radius function(VBRF), are considered as a basic for matching any two values in the descriptor vectors. After that, values of the related BRFs are taken into account in a next level.
Shape Representation and Classification Using Boundary Radius Function
223
Using the cost function in relation 2, we may compute edit distance between any two descriptor vectors. As a result, it is straightforward to calculate distance between any two given shapes. Therefore, given any two shapes S1 , S2 the process of matching of S1 , S2 can be performed as follows: 1. 2. 3.
Traversing on the boundaries of S1 , S2 and computing BRF for boundary points of S1 , S2 and generating BRV1 , BRV2 , respectively. Computing variations of BRFs and transforming BRV1 , BRV2 into descriptor vectors. Matching descriptor vectors using dynamic programming.
Assuming BRF is computed for n boundary points, the first stage takes time complexity of O(nk2), where k2 is complexity of computing BRF and k is related radius of a boundary point. In the worst case, k is not greater than n/2. The worst case occurs when the shape is like an ellipse. Thus, the first stage takes O(n3/4)=O(n3). Time complexity of the second stage is O(n) and at the third stage, dynamic programming takes O(nm) which n and m are length of the descriptor vectors. Therefore, assuming n ≥ m , overall time complexity of the method in the worst case is O(n3+n+nm) which is still O(n3).
6 Experiments The distance measure computed in section 5 can be used for shape matching and retrieval. We evaluated our method using two standard shape datasets and many customized sample shapes. The datasets are kimia[11] and MPEG-7[12] binary image databases. Moreover we designed and applied a new set of customized and deformed shapes and we performed experiments on this shape dataset. Note that in our experiments, number of sample boundary points (n), discussed in Section 4, is equal to 100. Also, resolution parameter α is considered less than or equal to 3. First experiment is comparison and matching of different shape from different classes. In this experiment, a set of randomly selected shapes from the datasets is considered and distance between each pair of them is computed using the proposed method. Evaluated shapes contain various deformed shapes such as occluded shapes, shapes with articulation and missing parts. Table 1. shows this experiment for 13 sample shapes out of about 100 evaluated shapes. Each value in the table, indicates distance between the shapes at relative row and column. As seen in the table, distances between the shapes which belong to a similar class, are closer than other distances. For example, the horse at column 10 has distance equal to 0.581 with the other horse at row 9 and distance 0.789 with the dog at row 3. Also each shape in the table has distance equal to 0 with itself (e.g. row10, column 10). In the next experiment, we classify shapes of different classes from the datasets based on the distance between the shapes. To achieve this goal, we use a hierarchical clustering method (single linkage).
224
H. Zaboli and M. Rahmati Table 1. 13 out of 100 sample evaluated shapes and distance between each pair of them
0
0.757 1.105 1.37 1.486 0.879 1.740 1.301 1.436 0
1.232 1.419 1.670 0.909 1.787 1.306 1.341 0
1.582 1.838 1.117 1.664 1.784 0.709 0
0.301 1.207 1.366 1.341 1.589 0
1.717 1.587 1.505 1.724 0
1.785 1.377 1.360 0
0.591 1.596 0
1.403 1.466 1.117 1.125 1.366 1.477 1.147 1.198 0.789 1.695 1.183 1.030 1.649 1.445 1.262 1.342 1.785 1.584 1.307 1.328 1.321 1.370 1.224 1.203 1.633 1.973 0.936 0.929
1.669
1.754 1.875 0.936 0.866
0
0.581 1.709 1.168 1.009 0
1.591 1.230 1.083 0
0.627 0.724 0
0.431 0
This method categorizes similar shapes into a class based on the fact that shapes of a class have minimal distances between themselves and have more distances with other classes of shapes. Using this method, we classified 340 shapes of different classes and reached to great results. Fig. 3 shows the results of this classification. Our method measured distances between each pair of shapes of different classes and generated a distance table. Then, clustering method categorized the shapes based on the distance table. As seen in Fig. 3. our method correctly has classified the shapes and has linked similar classes first. In this classification, first classes of horse and deer are linked. Also classes of rabbit and mouse are linked and after that, classes of pliers and scissors and classes of dummies and planes are linked, respectively. Also the dog class is linked to classes of horse and deer which is a correct classification. The classification is continued and similar classes are linked until the last class which class of leaves. This class is linked to the other classes at the last level due to its huge difference with the other classes. In the last experiment, we evaluate our method against the methods proposed in [5,7]. The first method, proposed in [5] is a close method to ours. Both of the two methods capture shape by traversing on the contour and based on its interior.
Shape Representation and Classification Using Boundary Radius Function
225
Fig. 3. Classification of different classes of shapes
The difference between them is that the proposed method in [5] considers all parts of the shape relative to a reference point, while our method in most cases considers each part based on a local analysis. Next method which is called “shape context”, proposed in [7] is a contour-based method which saves a profile for each boundary point. Each profile contains distances and orientations of other boundary points relative to the current point. As mentioned earlier, the method suffers from high dependency of its descriptor to deformations and variations of other parts of the boundary. In this experiment, we evaluate our method for retrieving most similar and closest matches. We perform this retrieval test on 530 randomly selected shapes from the kimia and MPEG-7 datasets. A query shape is selected from the dataset and 10 closest matches are retrieved. Results of these retrievals are shown in Table 2 and are compared with results of [5] and [7]. As shown in the table, Our method performs superior and achieves great results. While polar signature[5] and shape context[7] almost lose efficient retrieval after 5th Table 2. Result of retrieving of 530 shapes by our method and [5,7] and percent of recognition rate of the three methods for top 10 matches
1st
2nd
3rd
4th
5th
6th
7th
8th
9th
10th
100
100
99
99
98
97
97
97
94
84
Polar signature[5]
100
100
97
96
91
82
78
72
57
39
Shape Context[7]
97
91
88
85
84
77
75
66
56
37
Method Boundary Radius Function (BRF)
226
H. Zaboli and M. Rahmati
closest match, our method has recognition rate higher that 90 % for top 9 closest matches.
7 Conclusions We proposed a new descriptor which captures interior of a shape by walking along its contour. Result of this stage is a vector which represents the shape based on the interior of the shape. Matching of the vectors is performed using dynamic programming method. Since the proposed method and its descriptor, analyze locally each part of the shape, the method is robust under complex and problematic deformations, especially articulation, occlusion, missing parts and clutters. Moreover, matching process of the method i.e. dynamic programming, is an efficient and simple matching method. However, the matching process can be done using other methods such as computing the area lying between VBRFs.
References 1. Zhang, D., Lu, G.: Review of shape representation and description techniques. Pattern Regocnition 37, 1–19 (2004) 2. Blum, H.: A Transformation for extracting new descriptors of Shape. In: Whaten-Dunn, W. (ed.) Models for the perception of Speetch and Visual Forms, pp. 362–380. MIT Press, Cambridge (1967) 3. Siddiqi, K., Shokoufandehs, A., Dickinsons, S.J., Zucker, S.W.: Shock Graphs and Shape Matching. International Journal of Computer Vision 35(1), 13–32 (1999) 4. Sebastian, T.B., Klein, P.N., Kimia, B.B.: Recognition of Shapes by Editing Shock Graphs. In: ICCV 2001, pp. 755–762 (2001) 5. Bernier, T., Landry, J.-A.: A New Method for Representing and Matching Shapes of Natural Objects. Pattern Recognition 36, 1711–1723 (2003) 6. Kang, S.K., Ahmad, M.B., Chun, J.H., Kim, P.K., Park, J.A.: Modified Radius-Vector Function for Shape Contour Description. In: Laganà, A., Gavrilova, M., Kumar, V., Mun, Y., Tan, C.J.K., Gervasi, O. (eds.) ICCSA 2004. LNCS, vol. 3046, pp. 940–947. Springer, Heidelberg (2004) 7. Belongie, S., Malik, J., Puzicha, J.: Shape Matching and Object Recognition using Shape Contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(24), 509– 522 (2002) 8. Thayananthan, A., Stenger, B., Torr, P.H.S., Cipolla, R.: Shape Context and Chamfer Matching in Cluttered Scenes. IEEE CVPR 1, 127–133 (2003) 9. Gorelick, L., Galun, M., Sharon, E., Basri, R., Brandt, A.: Shape Representation and Classification Using the Poisson Equation. IEEE Transaction on Pattern Recognition and Machine Intelligence 28(12), 1991–2005 (2005) 10. Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys (CSUR) 33(1), 31–88 (2000) 11. Kimia Image Database (May 2007), Available at http://www.lems.brown.edu/ dmc/main.html 12. MPEG7 CE Shape Database (May 2007), Available at http://www.imageprocessingplace.com/ DIP/dip_image_dabases/image_databases.htm
A Convex Programming Approach to the Trace Quotient Problem Chunhua Shen1,2,3 , Hongdong Li1,2 , and Michael J. Brooks3 1
2
NICTA Australian National University 3 University of Adelaide
Abstract. The trace quotient problem arises in many applications in pattern classification and computer vision, e.g., manifold learning, low-dimension embedding, etc. The task is to solve a optimization problem involving maximizing the ratio of two traces, i.e., maxW Tr(f (W ))/Tr(h(W )). This optimization problem itself is non-convex in general, hence it is hard to solve it directly. Conventionally, the trace quotient objective function is replaced by a much simpler quotient trace formula, i.e., maxW Tr h(W )−1 f (W ) , which accommodates a much simpler solution. However, the result is no longer optimal for the original problem setting, and some desirable properties of the original problem are lost. In this paper we proposed a new formulation for solving the trace quotient problem directly. We reformulate the original non-convex problem such that it can be solved by efficiently solving a sequence of semidefinite feasibility problems. The solution is therefore globally optimal. Besides global optimality, our algorithm naturally generates orthonormal projection matrix. Moreover it relaxes the restriction of linear discriminant analysis that the projection matrix’s rank can only be at most c − 1, where c is the number of classes. Our approach is more flexible. Experiments show the advantages of the proposed algorithm.
1 Introduction The problem of dimensionality reduction—extracting low dimensional structure from high dimensional data—is extensively studied in pattern recognition, computer vision and machine learning. Many of the dimensionality reduction methods, such as linear discriminant analysis (LDA) and its kernel version, end up with solving a trace quotient problem Tr W Sα W , (1) W = argmax W W =Id Tr W Sβ W where Sα , Sβ are positive semidefinite (p.s.d.) matrices (Sα 0, Sβ 0), Id the d × d identity matrix1 and Tr(·) denoting the matrix trace. W ∈ RD×d is target projection matrix for dimensionality reduction (typically d D). In the supervised learning framework, usually Sα represents the distance of different classes while Sβ is the distance between data points in the same class. For example, Sα is the “between classes scatter matrix” and Sβ is the “within classes scatter matrix” for LDA. By formulating 1
The dimension of I is omitted when it can be seen from the context.
Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 227–235, 2007. c Springer-Verlag Berlin Heidelberg 2007
228
C. Shen, H. Li, and M.J. Brooks
the problem of dimensionality reduction in a general setting and constructing Sα and Sβ in different ways, we can analyze many different types of data in the above underlying mathematical framework. Despite the importance of the trace quotient problem, it lacks a direct and globally optimal solution. Usually, as an approximation, the quotient trace −1
Tr (W Sα W )
W Sβ W ,
(instead of the trace quotient) is used. Such an approximation readily leads to a generalized eigen-decomposition (GEVD) solution, via which a close-form solution is readily available. It is easy to check that when rank(W ) = 1, i.e., W is a vector, then Equation (1) is actually a Rayleigh quotient problem, which can be solved by the GEVD. The eigenvector corresponding to the eigenvalue of largest magnitude gives the optimal W . Unfortunately, when rank(W ) > 1, the problem becomes much more complicated. Heuristically, the dominant eigenvectors corresponding to the largest eigenvalues are used to form the W . It is believed that the largest eigenvalues contains more useful information. However such a GEVD approach cannot produce the optimal solution to the original optimization problem (1) [1]. Furthermore, the GEVD does not yield an orthogonal projection matrix. Orthogonal LDA (OLDA) is proposed to compute a set of orthogonal discriminant vectors via the simultaneous diagonalisation of the scatter matrices [2]. In this paper, we proffer a novel semidefinite programming (SDP) based method to solve the trace quotient problem directly, which has the following properties: – It optimises the original problem (Equation (1)) directly; – The target low dimensionality is selected by the user and the algorithm guarantees an globally optimal solution since the optimisation is convex. In other words, it is local-optima-free; – The projection matrix is orthonormal naturally; – Unlike the GEVD approach to LDA, the data are not necessary to be projected to at most c − 1 dimensions with our algorithm solving LDA. c is the number of classes. To our knowledge, this is the first attempt which directly solves the trace quotient problem and at the same time, a global optimum is deterministically guaranteed.
2 SDP Approach to the Trace Quotient Problem In this section, we show how the trace quotient is reformulated into an SDP problem. 2.1 SDP Formulation By introducing an auxiliary variable δ, the problem (1) is equivalent to maximize subject to
(2a)
δ
Tr W Sα W ≥ δ · Tr W Sβ W
(2b)
A Convex Programming Approach to the Trace Quotient Problem
229
W W = Id
(2c)
W ∈R
(2d)
D×d
The variables we want to optimise here are δ and W . But we are only interested in W with which the value of δ is maximised. This problem is clearly not convex because the constraint (2b) is not convex and (2d) is actually a non-convex rank constraint. Let us define a new variable Z ∈ RD×D , Z = W W , and now the constraint (2b) is converted to Tr(Sα − δSβ )Z ≥ 0 under the fact that Tr W SW = Tr SW W = Tr SZ. Because Z is a product of matrix W and its transpose, it must be p.s.d. Overton and Womersley [3] have shown that the set of Ω1 = {W W : W W = Id } is the set of extreme points of Ω2 = {Z : Z = Z , Tr Z = d, 0 Z I}.2 That means, as constraints, Ω1 is more strict than Ω2 . Therefore constraints (2c) amd (2d) can be relaxed into Tr Z = d and 0 Z I, which are both convex. When the cost function is linear and it is subject to Ω2 , the solution will be at one of the extreme points [4]. Consequently, for linear cost functions, the optimization problems subject to Ω1 and Ω2 are exactly equivalent. Moreover, the same nice property follow even when the objective function is a quotient (i.e. fractional programming), which is precisely the case we are dealing with here. With respect to Z and δ, (2b) is still non-convex: the problem may have locally optimal points. But still the global optimum can be efficiently computed via a sequence of convex feasibility problems. By observing that the constraint is linear if δ is known, we can convert the optimization problem into a set of convex feasibility problems. A bisection search strategy is adopted to find the optimal δ. This technique is widely used in fractional programming. Let δ denote the unknown optimal value of the cost function. Given δ∗ ∈ R, if the convex feasibility problem3 find Z subject to Tr(Sα − δ∗ Sβ )Z ≥ 0
(3a) (3b)
Tr Z = d 0ZI
(3c) (3d)
is feasible, then we have δ ≥ δ∗ . Otherwise, if the above problem is infeasible, then we can conclude δ < δ∗ . This way we can check whether the optimal value δ is smaller or larger than a given value δ∗ . This observation motivates a simple algorithm for solving the fractional optimisation problems using bisection search, which solves a convex feasibility problem at each step. Algorithm 1 shows how it works. At this point, a question remains to be answered: are constraints (3c) and (3d) equivalent to constraints (2c) and (2d) for the feasibility problem? Essentially the feasibility problem is equivalent to maximize 2 3
Tr(Sα − δ∗ Sβ )Z
(4a)
Our notation is used here. The feasibility problem has no cost function. The objective is to check whether the intersection of the convex constraints is empty.
230
C. Shen, H. Li, and M.J. Brooks
Algorithm 1. Bisection search Require: δL : Lower bounds of δ; δU : Upper bound of δ and the tolerance σ > 0. while δU − δL > σ do U . δ = δL +δ 2 Solve the convex feasibility problem described in (3a)–(3d). if feasible then δL = δ; else δU = δ. end if end while
subject to
Tr Z = d 0ZI
(4b) (4c)
If the maximum value of the cost function is non-negative, then the feasibility problem is feasible. Conversely, it is infeasible. Because this cost function is linear, we know that Ω1 can be replaced by Ω2 , i.e., constraints (3c) and (3d) are equivalent to (2c) and (2d). Note that constraint (3d) is not in the standard form of SDP. It can be rewritten into the standard form as Z 0 0, (5a) 0Q Z + Q = I,
(5b)
where the matrix Q acts as a slack variable. Now the problem can be solved using standard SDP packages such as CSDP [5] and SeDuMi [6]. We use CSDP in all of our experiments. 2.2 Recovering W from Z From the convariance matrix Z learned by SDP, we can recover the output W by eigen-decomposition. Let Vi denote the ith eigenvector, with eigenvalue λi . Let λ1 ≥ λ2 ≥ √· · · ≥ the sorted eigenvalues. It is straightforward to see that W = √ λD be √ diag( λ1 , λ2 , · · · , λD )V , where diag(·) is a square matrix with the input as its diagonal elements. To obtain a D × d projection matrix, the smallest D − d eigenvalues are simply truncated. This is the general treatment for recovering a low dimensional projection from a covariance matrix, e.g., principal component analysis (PCA). In our case, this procedure is precise, i.e., there is no information loss. This is obvious: λi , the eigenvalues of Z = W W , are the same as the eigenvalues of W W = Id . That means, λ1 = λ2 = · · · = λd = 1 and the left D − d eigenvalues are all zeros. Hence in our case we can simply stack the first d leading eigenvectors to obtain W .
A Convex Programming Approach to the Trace Quotient Problem
231
2.3 Estimating Bounds of δ The bisection search procedure requires a low bound and an upper bound of δ. The following theorem from [3] is useful for estimating the bounds. Theorem 1. Let S ∈ RD×D be a symmetric matrix, and μS1 ≥ μS2 ≥ · · · ≥ μSD be the sorted eigenvalues of S from largest to smallest, then max Tr W SW = W W =Id d S i=1 μi . Refer to [3] for the proof. This theorem can be extended to obtain the following corollary (following the proof for Theorem 1): S Corollary 1. Let S ∈ RD×D be a symmetric matrix, and ν1S ≤ ν2S ≤ · · · ≤ νD be its d S sorted eigenvalues from smallest to largest, then min Tr W SW = i=1 νi . W W =Id
Therefore, we estimate the upper bound of δ: d μSi α δU = i=1 . Sβ d i=1 νi
(6)
In the trace quotient problem, both Sα and Sβ are p.s.d. This is equivalent to say, all of their eigenvalues are non-negative. Recall that the denominator of (6) could be zeros and δU = +∞. This occurs when the d smallest eigenvalues of Sβ are all zeros. In this case, rank(Sβ ) ≤ D − d. For LDA, rank(Sβ ) = min(D, N ). Here N is the number of training data. When N ≤ D − d, which is termed the small sample problem, δU is invalid. A PCA data prep-processing can always be performed to remove the null space of the covariance matrix of the data, such that δU becomes valid. A lower bound of δ is then d νiSα δL = i=1 . (7) Sβ d i=1 μi Clearly δL ≥ 0.
3 Related Work The closest work to ours is [1] in the sense that it also proposes a method to solve the trace quotient directly. [1] finds the projection matrix W in the Grassmann manifold. Compared with optimization in the Euclidean space, the main advantage of optimization on the Grassman manifold is fewer variables. Thus the scale of the problem is smaller. There are major differences between [1] and our method: (i) [1] optimises Tr W Sα W − δ · Tr W Sβ W and they have no a principled way to determine the optimal value of δ. In contrast, we optimize the trace quotient function itself and a deterministic bisection search guarantees the optimal δ; (ii) The optimization in [1] is non-convex (difference of two quadratic functions). Therefore it might become trapped into a local maximum, while our method is globally optimal.
232
C. Shen, H. Li, and M.J. Brooks
Xing et al. [7] propose a convex programming approach to maximize the distances between classes and simultaneously to clip (but not to minimis) the distances within classes. Unlike our method, in [7] the rank constraint is not considered. Hence [7] is metric learning but not necessary a dimensionality reduction method. Furthermore, although the formulation of Xing et al. is convex, it is not SDP. It is more computationally expensive and general-purpose SDP solvers are not applicable. SDP (or other convex programming) is also used in [8,9] for learning a distance metric.
4 Experiments In this work, we consider optimizing the LDA criterion using the proposed SDP approach. Sα is the“between classes scatter matrix” and Sβ is the “within classes scatter matrix”. However, there are many different ways of constructing Sα and Sβ , e.g., the general methods considered in [7]. UCI data Firstly, we test whether the optimal value of the cost function, δ obtained by our SDP bisection search, is indeed larger than the one obtained by GEVD (conventional LDA). In all the experiments, the tolerance σ = 0.1. In this experiment, two datasets (“iris” and “wine”) from UCI machine learning repository [10] are used. We randomly sample 70% of the data each time and run 100 tests for the two datasets. The target low dimension d is set to 2. Figure 1 plots the difference of δ obtained by two approaches. Indeed, SDP consistently yields larger δ than LDA does. To see the difference between these two approaches, we project the “wine” data into 2D space and plot the projected data in Figure 2. As can be seen, the SDP algorithm has successfully brought together the points in the same class, while keeping dissimilar ones apart (for the original data and the PCA projected data, different classes are entangled together.) While both of the two discriminant projection algorithms can separately the data successfully, SDP intentionally finds a projection of the data onto a straight line that maintains the separation of the clusters. Xing et al. [7] have reported very similar observations with their convex metric learning algorithm. To test the influence of the SDP algorithm for classification, again we randomly select 70% of the data for training and the left 30% for testing. Both of the two data are projected to 2D. Results with a kNN classifier (k = 1) are collected. We run 100 tests for each data set. For the “iris” data, SDP is slightly better (test error 3.71%(±2.74%)) than LDA (test error 4.16%(±2.56%)). While for the “wine” data, LDA is better than SDP (1.53%(±1.62%) against 8.45%(±4.30%)). It means that a larger LDA cost δ does not necessarily produce better classification accuracy because: (i) The LDA criterion is not directly connected with the classifier’s performance. It is somewhat a heuristic criterion; (ii) As an example, Figure 2 indicates that SDP might over-fit the training data. When LDA already well separates the data, SDP aligns the data into a line for larger δ; (iii) With noisy training data, LDA can denoise by truncating the eigenvectors corresponding to the smaller eigenvalues, similar to what PCA does. During the learning of SDP, it takes noises into consideration as well, which appears to be harmful. We believe that some regularization to the LDA criterion would be beneficial. Also, other different criteria might perform differently in terms of over-fitting. These topics remain future research directions.
A Convex Programming Approach to the Trace Quotient Problem
233
650 600
δSDP − δLDA
550 500 450 400 350 300 0
20
40 60 Test # (UCI data: iris)
80
100
40 60 80 Test # (UCI data: wine)
100
250
LDA
100
δ
−δ
150
SDP
200
50
0 0
20
Fig. 1. The optimal value of δ obtained by our SDP approach (δSDP ) minus the value obtained by the conventional LDA (δLDA ). For all the runs, δSDP is larger than δLDA .
USPS handwritten digits data Experiments are also conducted on the full USPS data set. The US Postal (USPS) handwritten digit data-set is derived from a project on recognizing handwritten digits on envelopes. The digits were down sampled to 16 × 16 pixels. The training set has 7291 samples, and the test set has 2007 samples. The test set is rather difficult: the error rate achieved by human is 2.5% [11]. In the first experiment, we only use 7291 training digits. 70% are randomly selected for training and the other 30% for testing. The data are linearly mapped from 256D to 55D using PCA such that 90.14% of the energy is preserved. Then for LDA, we map them to 9D (because they are totally ten classes). SDP’s target low dimension is 50D. We run the experiments 20 times. The results are somewhat surprising. The 1NN classification (i.e, nearest neighbor) test error for LDA is 6.99%± 0.49%. SDP achieves much better performance: a 1NN test error 2.79%±0.27%. Note that if we set the target low dimension to 9D for SDP, SDP performs worse than LDA does. In the second experiment, we use 7291 training data for training and USPS’ 2007 test data for testing. Again at first they are mapped to 55D using PCA. LDA reduces
234
C. Shen, H. Li, and M.J. Brooks Original (wine)
PCA (wine)
6
40
5
20
4 0 3 −20 2 −40
1 0 11
12
13 LDA (wine)
−60 15 −1000
14
1.5
−0.4
1
−0.6
0.5
−0.8
0
−1
−0.5
−1.2
−1
−1.4
−1.5
−1.6
−2 −3
−2
−1
0
1
2
−1.8 −7
−500
−6
−5
0 SDP (wine)
−4
−3
500
−2
−1
Fig. 2. (1) Original data (first two dimensions are plotted); (2) Projected to 2D with PCA; (3) Projected to 2D with LDA; (4) Projected to 2D with SDP
the dimensionality to 9D and SDP to 54D. LDA has a 1NN test error 10.36% while our SDP achieves a 5.13% test error. Note that in these experiments we have not tuned all the parameters carefully.
5 Conclusion We have proposed a new formulation for directly solving the trace quotient problem. It is based on SDP, combining with a bisection search approach for solving a fractional programming, which allow us to derive a guaranteed globally optimal algorithm. Compared with LDA, the algorithm also relaxes the restriction of linear discriminant analysis that the projection matrix’s rank can only be at most c − 1. In the USPS classification experiment, it shows that this restriction might significantly affect the LDA’s performance. Our experiments have validated the advantages of the proposed algorithm.
Acknowledgements National ICT Australia (NICTA) is funded through the Australian Government’s Backing Australia’s Ability initiative, in part through the Australian Research Council.
A Convex Programming Approach to the Trace Quotient Problem
235
References 1. Yan, S., Tang, X.: Trace quotient problems revisited. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 232–244. Springer, Heidelberg (2006) 2. Ye, J., Xiong, T.: Null space versus orthogonal linear discriminant analysis. In: Proc. Int. Conf. Mach. Learn., Pittsburgh, Pennsylvania, pp. 1073–1080 (2006) 3. Overton, M.L., Womersley, R.S.: On the sum of the largest eigenvalues of a symmetric matrix. SIAM J. Matrix Anal. Appl. 13(1), 41–45 (1992) 4. Overton, M.L., Womersley, R.S.: Optimality conditions and duality theory for minimizing sums of the largest eigenvalues of symmetric matrices. Math. Program 62, 321–357 (1993) 5. Borchers, B.: CSDP, a C library for semidefinite programming. Optim. Methods and Software 11, 613–623 (1999) 6. Sturm, J.F.: Using SeDuMi 1.02, a matlab toolbox for optimization over symmetric cones (updated for version 1.05). Optim. Methods and Software 11-12, 625–653 (1999) 7. Xing, E., Ng, A., Jordan, M., Russell, S.: Distance metric learning, with application to clustering with side-information. In: Proc. Adv. Neural Inf. Process. Syst., MIT Press, Cambridge (2002) 8. Weinberger, K.Q., Blitzer, J., Saul, L.K.: Distance metric learning for large margin nearest neighbor classification. In: Proc. Adv. Neural Inf. Process. Syst. (2005) 9. Globerson, A., Roweis, S.: Metric learning by collapsing classes. In: Proc. Adv. Neural Inf. Process. Syst. (2005) 10. Newman, D., Hettich, S., Blake, C., Merz, C.: UCI repository of machine learning databases (1998) 11. Simard, P., LeCun, Y., Denker, J.S.: Efficient pattern recognition using a new transformation distance. In: Proc. Adv. Neural Inf. Process. Syst., pp. 50–58. MIT Press, Cambridge (1993)
Learning a Fast Emulator of a Binary Decision Process ˇ Jan Sochman and Jiˇr´ı Matas Center for Machine Perception, Dept. of Cybernetics, Faculty of Elec. Eng. Czech Technical University in Prague, Karlovo n´ am. 13, 121 35 Prague, Czech Rep. {sochmj1,matas}@cmp.felk.cvut.cz Abstract. Computation time is an important performance characteristic of computer vision algorithms. This paper shows how existing (slow) binary-valued decision algorithms can be approximated by a trained WaldBoost classifier, which minimises the decision time while guaranteeing predefined approximation precision. The core idea is to take an existing algorithm as a black box performing some useful binary decision task and to train the WaldBoost classifier as its emulator. Two interest point detectors, Hessian-Laplace and Kadir-Brady saliency detector, are emulated to demonstrate the approach. The experiments show similar repeatability and matching score of the original and emulated algorithms while achieving a 70-fold speed-up for KadirBrady detector.
1
Introduction
Computation time is an important performance characteristic of computer vision algorithms. We show how existing (slow) binary-valued classifiers (detectors) can be approximated by a trained WaldBoost detector [1], which minimises the decision time while guaranteeing predefined approximation precision. The main idea is to look at an existing algorithm as a black box performing some useful binary decision task and to train a sequential classifier to emulate its behaviour. We show how two interest point detectors, Hessian-Laplace [2] and KadirBrady [3] saliency detector, can be emulated by a sequential WaldBoost classifier [1]. However, the approach is very general and is applicable in other areas as well (e.g. texture analysis, edge detection). The main advantage of the approach is that instead of spending man-months on optimising and finding a fast and still precise enough approximation to the original algorithm (which can be sometimes very difficult for humans), the main effort is put into finding a suitable set of features which are then automatically combined into a WaldBoost ensemble. Another motivation could be an automatic speedup of a slow implementation of one’s own detector. A classical approach to optimisation of time-to-decision is to speed-up an already working approach. This includes heuristic code optimisations (e.g. FastSIFT [4] or SURF [5]) but also very profound change of architecture (e.g. classifier cascade [6]). A less common way is to formalise the problem and try to solve the error/time trade-off in a single optimisation task. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 236–245, 2007. c Springer-Verlag Berlin Heidelberg 2007
Learning a Fast Emulator of a Binary Decision Process
Training Set
"Black Box" Vision Algorithm
samples binary output
WaldBoost Learning
(t)
f (x), θA , θB
Emulator
data request
labels
training sample?
(t)
237
Wald decision Bootstrap management
images
images Image pool
Fig. 1. The proposed learning scheme
Our contribution is a proposal of a general framework for speeding up existing algorithms by a sequential classifier learned by the WaldBoost algorithm. Two examples of interest point detectors were selected to demonstrate the approach. The experiments show a significant speed-up of the emulated algorithms while achieving comparable detection characteristics. There has been much work on the interest point detection problem [7] but to our knowledge, learning techniques has been applied only to subproblems but not to the interest point detection as a whole. Lepetit and Fua [8] treated matching of detected points of interest as a classification problem, learning the descriptor. Rosten and Drummond [9] used learning techniques to find parameters of a hand-designed tree-based Harris corner classifier. Their motivation was to speed-up the detection process, but the approach is limited to the Harris corner detection. Martin et al. [10] learned a classifier for edge detection, but without considering the decision time and with significant manual tuning. Nevertheless, they tested a number of classifier types and concluded that a boosted classifier was comparable in performance to these classifiers and was preferable for its low model complexity and low computational cost. The rest of the paper is structured as follows. The approximation of a blackbox binary valued decision algorithm by a WaldBoost classifier is discussed in §2. Application of the approach to interest point detectors is described in §3. Experiments are given in §4 and the paper is concluded in §5.
2
Emulating a Binary-Valued Black Box Algorithm with WaldBoost
The structure of the approach is shown in Figure 1. The black box algorithm is any binary-valued decision algorithm. Its positive and negative outputs form a labelled training set. The WaldBoost learning algorithm builds a classifier sequentially and when new training samples are needed, it bootstraps the training set by running the black box algorithm on new images. Only the samples not decided yet by the so far trained classifier are used for training. The result of the
238
ˇ J. Sochman and J. Matas
process is a WaldBoost sequential classifier which emulates the original black box algorithm. The bootstrapping loop uses the fact that the black box algorithm can provide practically unlimited number of training data. This is in contrast to commonly used human labelled data which are difficult to obtain. Next, a brief overview of the WaldBoost learning algorithm is presented. 2.1
WaldBoost
WaldBoost [1] is a greedy learning algorithm which finds a quasi-optimal sequential strategy for a given binary-valued decision problem. WaldBoost finds a sequential strategy S ∗ such that S ∗ = arg min T¯S S
subject to βS ≤ β,
αS ≤ α
(1)
for specified α and β. T¯S is average time-to-decision, αS is false negative and βS false positive rate of the sequential strategy S. A sequential strategy is any algorithm (in our case a classifier) which evaluates one measurement at a time. Based on the set of measurements obtained up to that time, it either decides for one of the classes or postpones the decision. In the latter case, the decision process continues by taking another measurement. To find the optimal sequential strategy S ∗ , the WaldBoost algorithm combines the AdaBoost algorithm [11] for feature (measurement) selection and Wald’s sequential probability ratio test (SPRT) [12] for finding the thresholds which are used for making the decisions. The input of the algorithm is a labelled training set of positive and negative samples, a set of features F - the building blocks of the classifier, and the bounds on the final false negative rate α and the false positive rate β. The output is an ordered set of weak classifiers h(t) , t ∈ {1, . . . , T } each one corresponding to one (t) (t) feature and a set of thresholds θA , θB on the response of the strong classifier for all lengths t. During the evaluation of the classifier on a new observation x, one weak classifier is evaluated at time t and its response is added to the response function t h(q) (x). (2) ft (x) = q=1
The response function ft is then compared to the corresponding thresholds and the sample is either classified as positive or negative, or the next weak classifier is evaluated and the process continues ⎧ (t) ⎪ +1, ft (x) ≥ θB ⎨ (t) (3) Ht (x) = −1, ft (x) ≤ θA ⎪ ⎩ (t) (t) continue, θA < ft (x) < θB . If a sample x is not classified even after evaluation of the last weak classifier, a threshold γ is imposed on the real-valued response fT (x).
Learning a Fast Emulator of a Binary Decision Process
239
Early decisions made in classifier evaluation during training also affect the training set. Whenever a part of the training set is removed according to eq. 3, new training samples are collected (bootstrapped) from yet unseen images. In the experiments we use the same asymmetric version of WaldBoost as used in [1]. When setting the β parameter to zero, the strategy becomes (t) −1, ft (x) ≤ θA (4) Ht (x) = (t) continue, θA < ft (x) and only decisions for the negative class are made during the sequential evaluation of the classifier. A (rare) positive decision can only be reached after evaluating all T classifiers in the ensemble. In the context of fast black box algorithm emulation, what distinguishes training for different algorithms is the feature set F . A suitable set has to be found for every algorithm. Hence, instead of optimising the algorithm itself, the main burden of development lies in finding a proper set F . The set F can be very large if one is not sure which features are the best. The WaldBoost algorithm selects a suitable subset together with optimising the time-to-decision.
3
Emulated Scale Invariant Interest Point Detectors
In order to demonstrate the approach, two similarity invariant interest point detectors have been chosen to be emulated: (i) Hessian-Laplace [2] detector, which is a state of the art similarity invariant detector, and (ii) Kadir-Brady [3] saliency detector, which has been found valuable for categorisation, but is about 100× slower. Binaries of both detectors are publicly available1 . We follow standard test protocols for evaluation as described in [7]. Both detectors are similarity invariant (not affine), which is easily implemented via a scanning window over positions and scales plus a sequential test. For both detectors, the set F contains the Haar-like features proposed by Viola and Jones [6], plus a centre-surround feature from [13], which has been shown to be useful for blob-like structure detectors [4]. Haar-like features were chosen for their high evaluation speed (due to integral image representation) and since they have a potential to emulate the Hessian-Laplace detections [4]. For the Kadir-Brady saliency detector emulation, however, the Haar-like features turned out not to be able to emulate the entropy based detections. To overcome this, and still keep the efficiency high, “energy” features based on the integral images of squared intensities were introduced. They represent intensity variance in a given rectangle. To collect positive and negative samples for training, a corresponding detector is run on a set of images of various sizes and content. The considered detectors assign a scale to each detected point. Square patches of the size twice the scale are used as positive samples. The negative samples representing the “background” 1
http://www.robots.ox.ac.uk/~vgg/research/affine/
240
ˇ J. Sochman and J. Matas o
1 dc R
r
r R
dc
R
r+R
d
Fig. 2. The non-maximum suppression algorithm scheme for two detections
class are collected from the same images at positions and scales not covered by positive samples. Setting α.There is no error-free classification, the positive and negative classes are highly overlapping in feature space. As a consequence, the WaldBoost classifier responses on many positions and scales – false positives. One way of removing less reliable detections is to threshold the final response function fT at some higher value γ. This would lead to less false positives, more false negatives and very slow classifier (whole classifier evaluated for most samples). A better option is to set α to a higher value and let the training to prune the negative class sequentially. Again, it results in less false positives and controllable amount of false negatives. Additionally, the classifier becomes much faster due to early decisions. An essential part of a detector is the non-maximum suppression algorithm. Here the output differs from that obtained from the original detectors. Instead of having a real-valued map over whole image, sparse responses are returned by the WaldBoost detector due to early decisions – value of ft , t < T available for early decisions is not comparable to fT of positive detections. Thus a typical cubic interpolation and a local maximum search cannot be applied. Instead, the following algorithm is used. Any two detections are grouped together if their overlap is higher than a given threshold (parameter of the application). Only the detection with maximal fT in each group is preserved. The overlap computation is schematically shown in Figure 2. Each detection is represented by a circle inscribed to the box (scanning window) reported as a detection (Figure 2, left). For two such circles, let us denote radius of the smaller circle as r and radius of the bigger one as R. The distance of circle centres will be denoted by d. The following approximation to the actual circles overlap is used to avoid computationally demanding goniometric functions. The measure has an easy interpretation in two cases. First, when the circle centres coincide, the overlap is approximated as r/R. It equals to one for two circles of the same radius and decreases as the radiuses become different. Second, when two circles have just one point in common (d = r + R), the overlap is zero. These two situations are marked in Figure 2, right by blue dots. Linear interpolation (blue solid line in Figure 2, right) is used to approximate the overlap between these two states. Given two radiuses r and R where r ≤ R and circle centres distance dc , the overlap o is computed as
Learning a Fast Emulator of a Binary Decision Process
r o= R
4
1−
dc r+R
241
.
Experiments
This section describes experiments with two WaldBoost-emulated detectors Hessian-Laplace [2] and Kadir-Brady [3] saliency detector. The Hessian-Laplace detector is expected to be easily emulated due to its blob-like detections. This allows to keep the first experiment more transparent. The Kadir-Brady detector is more complex one due to its entropy based detections. Kadir-Brady detector shows rather poor results in classical repeatability tests [7] but has been successfully used in several recognition tasks [14]. However, its main weakness for practical applications is its very long computation time (in order of minutes per image!). 4.1
Hessian-Laplace Emulation
The training set for the WaldBoost emulation of Hessian-Laplace is created from 36 images of various sizes and content (nature, urban environment, hand drawn, etc.) as described in §3. The Hessian-Laplace detector is used with threshold 1000 to generate the training set. The same threshold is used throughout all the experiments for both learning and evaluation. Training has been run for T = 20 (training steps) with α = 0.2 and β = 0. The higher α allows fast pruning of less trustworthy detections during sequential evaluation of the detector. The detector has been assessed in standard tests proposed by Mikolajczyk et al. [7]. First, repeatability of the trained WaldBoost detector has been compared with the original Hessian-Laplace detector on several image sequences with variations in scale and rotation. The results on two selected sequences, boat and east south, from [15] are shown in Figure 3 (top row). The WaldBoost detector achieves similar repeatability as the original Hessian-Laplace detector. In order to test the trained detectors for their applicability, a matching application scenario is used. To that effect, a slightly different definition of matching score is used than that of Mikolajczyk [7]. Matching score as defined in [7] is computed as the number of correct matches divided by the smaller number of correspondences in common part of the two images. However, the matches are computed only pairwise for correspondences determined by the geometry ground truth. Here, the same definition of the matching score is used, but the definition of a correct match differs. First, tentative matches using the SIFT detector are computed and mutually nearest matches are found. These matches are then verified by the geometry ground truth and only the verified matches are called correct. Comparison of the trainer and the trainee outputs on two sequences is given in Figure 3 (bottom row). The WaldBoost detector achieves similar matching score on both sequences while producing consistently more detections and matches.
ˇ J. Sochman and J. Matas
20 0 1
1.5 2 scale change
3000 2000 1000 0 1
2.5
(a) 80 60 40 20 0 1
1.5 2 scale change
(e)
60 40 20 0 1
2.5
(b) WB HL
2.5
4000 #correct matches
matching score
100
1.5 2 scale change
80
2000 1000 0 1
1.5 2 scale change
(f)
1.5 2 scale change
3000 2000 1000 0 1
2.5
2.5
100
60 40 20 0 1
1.5 2 scale change
(g)
1.5 2 scale change
2.5
(d) WB HL
80
WB HL
4000
(c) WB HL
3000
WB HL
#correspondences
40
5000
100 repeatability %
60
WB HL
4000
matching score
repeatability %
WB HL
80
2.5
4000 #correct matches
5000
100
#correspondences
242
WB HL
3000 2000 1000 0 1
1.5 2 scale change
2.5
(h)
Fig. 3. Comparison of Hessian-Laplace detector and its WaldBoost emulation. Top row: Repeatability on boat (a) and east south (c) sequences and corresponding number of detections (b), (d). Bottom row: Matching score (e), (g) and corresponding number of correct matches (f), (h) on the same sequences.
Fig. 4. First centre-surround and energy feature found in WaldBoost Hessian-Laplace (left) and Kadir-Brady (right) emulated detector. The underlying image is generated as E(|xi − 127.5|) and E(xi ) respectively, where E() is the average operator and xi is the i-th positive training example.
The WaldBoost classifier evaluates on average 2.5 features per examined position and scale. This is much less than any reported speed for face detection [1]. The evaluation times are compared in Table 1. The WaldBoost emulation speed is comparable to manually tuned Hessian-Laplace detector. The Hessian-Laplace detector finds blob-like structures. The structure of the trained WaldBoost emulation should reflect this property. As shown in Figure 4, the first selected feature is of a centre-surround type which gives high responses to blob-like structures. The outputs of the trained WaldBoost emulation of Hessian-Laplace and the original algorithm are compared in Figure 5. To find the original Hessian-Laplace detection correctly found by the WaldBoost emulator, correspondences based on Mikolajczyk’s overlap criterion [7] have been found between the original and WaldBoost detections. The white circles show repeated correspondences. The black circles show the detections not found by the WaldBoost emulation. Note that most of the missed detections have a correct detection nearby, so the
Learning a Fast Emulator of a Binary Decision Process
(a)
243
(b)
Fig. 5. Comparison of the outputs of the original and WaldBoost-emulated (a) HessianLaplace and (b) Kadir-Brady saliency detectors. The white circles show repeated detection. The black circles highlight the original detections not found by the WaldBoost detector. Note that for most of missed detections there is a nearby detection on the same image structure. The accuracy of the emulation is 85 % for Hessian-Laplace and 96 % for Kadir-Brady saliency detector. Note that the publicly available Kadir-Brady algorithm does not detect points close to image edges.
corresponding image structure is actually found. The percentage of repeated detections of the original algorithm is 85 %. To conclude, the WaldBoost emulator of the Hessian-Laplace detector is able to detect points with similar repeatability and matching score while its speed is comparable to speed of the original algorithm. This indicates that the proposed approach is able to minimise the decision time down to a manually tuned algorithm speed. 4.2
Fast Saliency Detector
The emulation of the Kadir-Brady saliency detector [3] was trained on the same set of images as the WaldBoost Hessian-Laplace emulator. The saliency threshold of the original detector was set to 2 to limit the positive examples only to those with higher saliency. Note, that as opposed to the Hessian-Laplace emulation where rather low threshold was chosen, it is meaningful to use only the top most salient features from the Kadir-Brady detector. This is not true for HessianLaplace detector since its response does not correspond to the importance of the feature. The Haar-like feature set was extended by the “energy” feature described in §3. The training was run for T = 20 (training steps) with α = 0.2 and β = 0. The same experiments as for the Hessian-Laplace detector have been performed. The repeatability and the matching score of the Kadir-Brady detector and its WaldBoost emulation on boat and east south sequences are shown in Figure 6. The trained detector performs slightly better than the original one.
ˇ J. Sochman and J. Matas
20 0 1
1.5 2 scale change
2000 1000 0 1
2.5
(a) 80 60 40 20 0 1
1.5 2 scale change
(e)
60 40 20
1.5 2 scale change
0 1
2.5
1.5 2 scale change
(b) WB KB
2.5
2500 #correct matches
matching score
100
WB KB
80
1500 1000 500 0 1
1.5 2 scale change
2.5
100
2000 1500 1000 500 0 1
2.5
60 40 20 0 1
1.5 2 scale change
(f)
1.5 2 scale change
2.5
(d) WB KB
80
WB KB
2500
(c) WB KB
2000
#correspondences
40
3000
100
2000 #correct matches
60
WB KB
3000
repeatability %
repeatability %
WB KB
80
#correspondences
4000
100
matching score
244
WB KB
1500 1000 500
2.5
(g)
0 1
1.5 2 scale change
2.5
(h)
Fig. 6. Comparison of Kadir-Brady detector and its WaldBoost emulation. Top row: Repeatability on boat (a) and east south (c) sequences and corresponding number of detections (b), (d). Bottom row: Matching score (e), (g) and corresponding number of correct matches (f), (h) on the same sequences. Table 1. Speed comparison on the first image (850×680) from the boat sequence
Hessian-Laplace Kadir-Brady
original 1.3s 1m 44s
WaldBoost 1.3s 1.4s
The main advantage of the emulated saliency detector is its speed. The classifier evaluates on average 3.7 features per examined position and scale. Table 1 shows that the emulated detector is 70× faster than the original detector. Our early experiments showed that the Haar-like features are not suitable to emulate the entropy-based saliency detector. With the energy features, the training was able to converge to a reasonable classifier. In fact, the energy feature is chosen for the first weak classifier in the WaldBoost ensemble (see Figure 4). The outputs of the WaldBoost saliency detector and the original algorithm are compared in Figure 5. The coverage of the original detections is 96 %. To conclude, the Kadir-Brady emulation gives slightly better repeatability and matching score. But, most importantly, the decision times of the emulated detector are about 70× lower than that of the original algorithm. That opens new possibilities for using the Kadir-Brady detector in time sensitive applications.
5
Conclusions and Future Work
In this paper a general learning framework for speeding up existing binary-valued decision algorithms by a sequential classifier learned by WaldBoost algorithm has been proposed. Two interest point detectors, Hessian-Laplace and Kadir-Brady saliency detector, have been used as black box algorithms and emulated by the
Learning a Fast Emulator of a Binary Decision Process
245
WaldBoost algorithm. The experiments show similar repeatability and matching scores of the original and emulated algorithms. The speed of the Hessian-Laplace emulator is comparable to the original manually tuned algorithm, while the Kadir-Brady detector was speeded up seventy times. The proposed approach is general and can be applied to other algorithms as well. For future research, an interesting extension of the proposed approach would be to train an emulator with not only similar outputs to an existing algorithm but also with some additional quality like higher repeatability or specialisation to a given task.
Acknowledgement The authors were supported by Czech Science Foundation Project 102/07/1317 ˇ (JM) and by EC project FP6-IST-027113 eTRIMS (JS).
References ˇ 1. Sochman, J., Matas, J.: WaldBoost - learning for time constrained sequential detection. In: CVPR, Los Alamitos, USA, vol. 2, pp. 150–157 (2005) 2. Mikolajczyk, K., Schmid, C.: Scale and affine invariant interest point detectors. IJCV 60(1), 63–86 (2004) 3. Kadir, T., Brady, M.: Saliency, scale and image description. IJCV 45(2) (2001) 4. Grabner, M., Grabner, H., Bischof, H.: Fast approximated SIFT. In: ACCV, pp. I:918–927 (2006) 5. Bay, H., Tuytelaars, T., Van Gool, L.: SURF: Speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, Springer, Heidelberg (2006) 6. Viola, P., Jones, M.: Robust real time object detection. In: SCTV, Vancouver, Canada (2001) 7. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir, T., Van Gool, L.: A comparison of affine region detectors. In: IJCV (2005) 8. Lepetit, V., Lagger, P., Fua, P.: Randomized trees for real-time keypoint recognition. In: CVPR, vol. II, pp. 775–781 (2005) 9. Rosten, E., Drummond, T.: Machine learning for high-speed corner detection. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 430– 443. Springer, Heidelberg (2006) 10. Martin, D., Fowlkes, C., Malik, J.: Learning to detect natural image boundaries using local brightness, color, and texture cues. PAMI 26(5), 530–549 (2004) 11. Schapire, R.E., Singer, Y.: Improved boosting algorithms using confidence-rated predictions. Machine Learning 37(3), 297–336 (1999) 12. Wald, A.: Sequential analysis. Dover, New York (1947) 13. Lienhart, R., Maydt, J.: An extended set of haar-like features for rapid object detection. In: ICIP (2002) 14. Fergus, R., Perona, P., Zisserman, A.: A sparse object category model for efficient learning and exhaustive recognition. In: CVPR, vol. 1, pp. 380–387 (2005) 15. Mikolajczyk, K.: Detection of local features invariant to affines transformations. PhD thesis, INPG, Grenoble (2002)
Multiplexed Illumination for Measuring BRDF Using an Ellipsoidal Mirror and a Projector Yasuhiro Mukaigawa, Kohei Sumino, and Yasushi Yagi The Institute of Scientific and Industrial Research, Osaka University
Abstract. Measuring a bidirectional reflectance distribution function (BRDF) requires long time because a target object must be illuminated from all incident angles and the reflected light must be measured from all reflected angles. A high-speed method is presented to measure BRDFs using an ellipsoidal mirror and a projector. The method can change incident angles without a mechanical drive. Moreover, it is shown that the dynamic range of the measured BRDF can be significantly increased by multiplexed illumination based on the Hadamard matrix.
1
Introduction
In recent years, the measurement of geometric information (3D shapes) has become easier by using commercial range-finders. However, the measurement of photometric information (reflectance properties) is still difficult. Reflection properties depend on the microscopic shape of the surface, and they can be used for a variety of applications such as computer graphics and inspection of painted surfaces. However, there is no standard way for measuring reflection properties. The main reason for this is that the dense measurement of BRDFs requires huge amounts of time because a target object must be illuminated from every incident angle and the reflected lights must be measured from every reflected angle. Most existing methods use mechanical drives to rotate a light source, and as a result, the measuring time becomes very long. In this paper, we present a new method to measure BRDFs rapidly. Our system substitutes an ellipsoidal mirror for a mechanical drive, and a projector for a light source. Since our system completely excludes mechanical drive, high-speed measurement can be realized. Moreover, we present an algorithm that improves the dynamic range of measured BRDFs. The combination of a projector and an ellipsoidal mirror can produce any illumination. Hence, the dynamic range is significantly increased by multiplexing illumination based on the Hadamard matrix, while the capturing time remains the same as for normal illumination.
2
Related Work
If the reflectance is uniform over the surface, the measurement becomes easier by merging the BRDFs at every point. Matusik et al.[1] measured isotropic BRDFs by capturing a sphere. Anisotropic BRDFs can also be measured by Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 246–257, 2007. c Springer-Verlag Berlin Heidelberg 2007
Multiplexed Illumination for Measuring BRDF
247
Table 1. Comparison of major BRDF measuring devices Device Camera Light source Density of BRDF Li [5] mechanical rotation mechanical rotation dense Dana [10] fixed mechanical translation dense M¨ uller [6], Han [12] fixed fixed sparse Our system fixed fixed dense
capturing a plane [2] or a cylinder[3]. Marschner et al.[4] measured the BRDFs of general shapes using a 3D range sensor. However, these methods cannot measure specially varying BRDFs. The most straightforward way to measure BRDFs is to use a gonioreflectometer, which allows a light source and a sensor to rotate around the target material. Li et al.[5] have proposed a three-axis instrument. However, because the angle needs to be altered mechanically, the measurement of dense BRDFs takes a long time. To speed up the measurement, rotational mechanisms should be excluded. By placing many cameras and light sources around the target object, BRDFs can be measured for some of the angle combinations of the incident and reflective directions. M¨ uller et al.[6] have constructed a system including 151 cameras with a flash. However, dense measurement is physically difficult. In the optics field, some systems that utilize catadioptric devices have been proposed. Davis and Rawling[7] have patented a system using an ellipsoidal mirror to collect the reflected light. Mattison et al.[8] have developed a handheld instrument based on the patent. The patent focuses only on gathering reflected light, and does not mention the control of the incident direction. Although Ward[9] used a hemispherical half-mirror and a fish-eye lens, it requires a rotational mechanism for the light source. Although Dana[10] used a paraboloidal mirror, a translational mechanism for the light source remains necessary. To avoid using mechanical drive, some systems include catadioptric devices. Kuthirummal et al.[11] used a cylindrical mirror and Han et al.[12] combined a projector and several plane mirrors similar to those used in a kaleidoscope. However, these systems can measure only sparse BRDFs because measurable incident and reflective angles are quite discrete. We, on the other hand, propose a new system that combines an ellipsoidal mirror and a projector. Since our system completely excludes a mechanical device, high-speed measurement is realized. The system can measure dense BRDFs because both lighting direction and viewing direction are densely changed.
3 3.1
BRDF Isotropic and Anisotropic BRDFs
To represent reflection properties, a BRDF is used. The BRDF represents the ratio of outgoing radiance in the viewing direction (θr , φr ) to incident irradiance from a lighting direction (θi , φi ), as shown in Fig.2(a).
248
Y. Mukaigawa, K. Sumino, and Y. Yagi
When a camera and a light source are fixed, the rotation of an object around the surface normal changes the appearance of some materials. Such reflection is called anisotropic reflection, and typical materials of this type are brushed metals and cloth fabrics such as velvet and satin. To perfectly describe anisotropic reflection, the BRDF should be defined by four angle parameters. On the other hand, the appearance does not change according to the rotation around the surface normal for many materials. Such reflection is called isotropic reflection. If isotropic reflection can be assumed, the BRDF can be described using only three parameters, θi , θr , and φ (φ = φi − φr ). If the number of parameters can be reduced from four to three, then the measuring time and data size can be significantly reduced. 3.2
Problems with a 4-Parameter Description
There are two major problems associated with a 4-parameter description: data size and measuring time. First, let us consider the data size. If the angles θr , φr , θi , and φi are rotated at one degree intervals, and the reflected light is recorded as R, G, and B colors for each angle, then the required data size becomes 360×90×360×90×3 = 3, 149, 280, 000 bytes. The size of 3GB is not impractical for recent PCs. Moreover, BRDFs can be effectively compressed because they include much redundancy. Therefore, data size is not a serious problem. On the other hand, the problem of measuring time remains serious. Since the number of combinations of lighting and sensing angles becomes extremely large, a long measuring time is required. If the sampling interval is one degree, the total number of combinations becomes 360 × 90 × 360 × 90 = 1, 049, 760, 000. This means that it would require 33 years if it takes one second to measure one reflection color. Of course the time can be shortened by using a high-speed camera, but the total time required would still remain impractical. While the problem of data size is not serious, the problem of measuring time warrants consideration. In this paper, we straightforwardly tackle the problem of measuring time for 4-parameter BRDFs by devising a catadioptric system.
4 4.1
BRDF Measuring System Principle of Measurement
An ellipsoid has two focal points. All rays from one focal point reflect off the ellipsoidal mirror and reach the other focal point. This property is used for measuring BRDFs. The target object is placed at the focal point, and a camera is placed at the other focal point. Since rays reflected in all directions from the target object converge at a single point, all the rays can be captured at once. The most significant characteristic of the system is that an ellipsoidal mirror is combined with a projector. The projector serves as a substitute for the light source. The projection of a pattern in which only one pixel is white corresponds to the illumination by a point light source. Moreover, changing the location of the white pixel corresponds to rotating the incident angle. Since the change
Multiplexed Illumination for Measuring BRDF
249
Object Mirror
Black Plate
Ellipsoidal Mirror
N
θr
Image Plane
Ellipsoidal Mirror
Half Mirror φr
θi
φi
Beam Splitter
Projector
Camera
Image Plane Camera
Object
Projector
(a) Vertical setup
(b) Horizontal setup
y
y
x
z
z
(c) Mirror for vertical setup
x
(d) Mirror for horizontal setup
Fig. 1. The design of the BRDF measuring device
of the projection pattern is faster than mechanical rotation, rapid and dense measurement can be achieved. 4.2
Design of the Measuring System
Based on the principle described in the previous section, we developed two BRDF measuring devises that have differently shaped ellipsoidal mirrors. One is a vertical setup in which a target material is placed vertically to the long axis as shown in Fig.1(a). The shape of the mirror is an ellipsoid that is cut perpendicularly to the long axis as shown in Fig.1(c). The other is a horizontal setup in which a target material is placed horizontally to the long axis as shown in Fig.1(b). The shape of the mirror is an ellipsoid that is cut parallel and perpendicularly to the long axis as shown in Fig.1(d). The major optical devices are a projector, a digital camera, an ellipsoidal mirror, and a beam splitter. The illumination from the projector is reflected on the beam splitter and the ellipsoidal mirror, and finally illuminates a single point on the target object. The reflected light on the target object is again reflected by the ellipsoidal mirror, and is recorded as an image. The vertical setup has merit in that the density of the BRDF is uniform along φ because the long axis of the ellipsoid and the optical axes of the camera and projector are the same. Moreover, this kind of mirror is available commercially because it is often used as part of a usual illumination device. However, target materials must be cut into small facets to be placed at the focal point. On the other hand, the horizontal setup has merit in that the target materials do
250
Y. Mukaigawa, K. Sumino, and Y. Yagi φ=0
N incoming
outgoing
θr
θi
φ=60
φ=300 φ=270
φ=90 θ=30 θ=60
φ=240
φr
φ=30
φ=330
φi
(a) Four angle parameters
φ=210
φ=180
φ=150
φ=120
φ=240
φ=300 φ=0 φ=60 θ=90 φ=330 φ=30 φ=90 φ=120 θ=60
φ=270
θ=30
θ=90
(b) Vertical setup
(c) Horizontal setup
Fig. 2. Angular parameters of BRDF. (a) Four angle parameters. (b)(c) Relationship between the angles (θ, φ) and the image location.
not have to be cut. Hence, the BRDF of cultural heritages can be measured. However, the mirror must be specially made by the cutting operation. 4.3
Conversion Between Angle and Image Location
The lighting and viewing directions are specified as angles, while they are expressed as 2-D locations in the projection pattern or the captured image. The conversion between the angle and the image location is easy if geometric calibration is done for the camera and the projector. Figures 2 (b) and (c) illustrate the relationship between the angle and the image location for the vertical and horizontal setup, respectively.
5
Multiplexed Illumination
In this section, the problem of low dynamic range inherent in the projectorbased system is clarified, and this problem is shown to be solved by multiplexed illumination based on the Hadamard matrix. 5.1
Dynamic Range Problem
There are two main reasons for low dynamic range. One of these is the difference in intensities for specular and diffuse reflections. If a short shutter speed is used to capture the specular reflection without saturation, diffuse reflection tends to be extremely dark as shown in Fig.3(a). Conversely, a long shutter speed to capture bright diffuse reflection creates saturation of the specular reflection as shown in Fig.3(b). This problem is not peculiar to our system, but is common in general image measurement systems. The other reason is peculiar to our system that uses a projector for illumination. Generally, the intensity of the black pixel in the projection pattern is not perfectly zero. A projector emits a faint light when the projection pattern is black. Even if the intensity of each pixel is small, the sum of the intensities converging on one point cannot be ignored. For example, let us assume that the contrast ratio of the projector is 1000 : 1 and the size of the projection pattern
brightness
brightness
(a)
angle
(b)
angle
251
Specular Diffuse Sum of black projection
brightness
Multiplexed Illumination for Measuring BRDF
(c)
angle
Fig. 3. The problem of low dynamic range
is 1024 × 768. If 10 pixels in a projection pattern are white and the others are black, the intensity ratio of the white pixels to the black pixels is 10 × 1000 : (1024 × 768 − 10) × 1 = 1 : 79.
(1)
Thus the intensity of the black pixels is larger than one of the white pixels. This means that the measured data include a large amount of unnecessary information which should be ignored as shown in Fig. 3(c). By subtracting the image that is captured when a uniform black pattern is projected, this unnecessary information can be eliminated. As only a few bits are required to express the necessary information, a radical solution is required. 5.2
Multiplexed Illumination
Optical multiplexing techniques from the 1970s have been investigated in the spectrometry field[13]. If the spectrum of a light beam is measured separately for each wavelength, each spectral signal becomes noisy. Using the optical multiplexing technique, multiple spectral components are simultaneously measured to improve the quality. In the computer vision field, Schechner et al.[14] applied the multiplexing technique to capture images under varying illumination. In this method, instead of illuminating each light source independently, the multiplexed light sources are simultaneously illuminated. From the captured images, an image that is illuminated by a single light source is calculated. Wenger et al.[15] evaluated the effects of noise reduction using multiplexed illumination. We briefly describe the principle of multiplexed illumination. Let us assume that there are n light sources, and that s denotes the intensities of a point in the images when each light source turns on. The captured images are multiplexed by the weighting matrix W . The intensities m of the point by the multiplexed illumination are expressed by m = W s. (2) The intensities of the point under the single illumination can be estimated by s = W −1 m.
(3)
In our BRDF measuring system, a projector is used instead of an array of light sources. Hence the weighting matrix W can be arbitrarily controlled. It is
252
Y. Mukaigawa, K. Sumino, and Y. Yagi
known that if the component of the matrix W is −1 or 1, the Hadamard matrix is the best √ multiplexing weight[13]. In this case, the S/N ratio is increased by a factor of n. The n × n Hadamard matrix satisfies H Tn H n = nI n ,
(4)
where I n denotes an n × n identity matrix. In fact, as negative illumination cannot be given by a projector, the Hadamard matrix cannot be used directly. It is also known that if the component of the matrix W is 0 or 1, the S-matrix is the best√ multiplexing weight[13]. In this case, the S/N ratio is increased by a factor of n/2. The S-matrix can be easily generated from the Hadamard matrix. Hence, the projection pattern is multiplexed using the S-Matrix. Since the illumination can be controlled for each pixel using a projector, n becomes a large number and dramatic improvement can be achieved.
6 6.1
Experimental Results BRDF Measuring Systems
We constructed BRDF measuring systems named RCGs (Rapid Catadioptric Gonioreflectometers) as shown in Figs.4 (c) and (d). The RCG-1 includes a PointGrey Flea camera, an EPSON EMP-760 projector, and a Melles Griot ellipsoidal mirror as shown in Fig.4 (a). The RCG-2 includes a Lucam Lu-160C camera and a TOSHIBA TDP-FF1A projector. The ellipsoidal mirror for the RCG-2 is designed so that BRDFs can be measured for all angles of θ within 0 ≤ φ ≤ 240 as shown in Fig.4 (b).
Ellipsoidal mirror
Plate mirror
Object
(a)
(b)
Beam splitter
Projector Camera
(c) RCG-1 (vertical setup)
(d) RCG-2 (horizontal setup)
Fig. 4. The BRDF measuring systems
Multiplexed Illumination for Measuring BRDF
(a) velvet
(b) satin
(c) polyurethane
253
(d) copper
Fig. 5. Target materials
(a)
(b)
(c)
(d)
Fig. 6. BRDF of velvet and satin. (a)(b) examples of captured images, (c)(d) rendering result from measured BRDFs.
6.2
Measurement of Velvet and Satin
In this section, the results of measured BRDFs using the RCG-1 are shown. The target objects are velvet and satin, both of which have anisotropic reflections, as shown in Figs.5(a) and (b). First, the measuring time is evaluated. The sampling interval was set to one degree. The pattern corresponding to the lighting directions θi = 30 and φi = 250 was projected, and the reflected images were captured, as shown in Figs.6 (a) and (b) for velvet and satin, respectively. It is noted that some BRDFs could not be measured because the ellipsoidal mirror of the RCG-1 has a hole at the edge of the long axis. 360 × 90 = 32400 images were captured for each material. The measuring time was about 50 minutes. Figures 6 (c) and (d) are generated images of a corrugated plane that have the measured BRDFs of velvet and satin. The rendering process for this corrugated shape fortunately does not require the missing data. It can be seen that the characteristics of anisotropic reflection are reproduced. 6.3
Measurement of a Polyurethane Sphere
To evaluate the effectiveness of the multiplexed illumination, the isotropic BRDF of a polyurethane sphere was measured as shown in Fig.5(c). In this case, the lighting direction is varied by 1-DOF rotation because of the isotropic reflection. That is, the azimuth angle φi is fixed, and the elevation angle is varied 0 ≤ θi ≤ 180. Figure 7(a) shows an example of multiplexed illumination by a 191 × 191 S-matrix, and (b) shows the captured image after subtracting an image of projecting a black pattern. The captured images of the lighting direction θi = 10, φi = 270 were compared under several conditions. Figures 8(a) and (b) show the distribution of the reflected light without multiplexing. (a) is a single captured image, while (b) is
254
Y. Mukaigawa, K. Sumino, and Y. Yagi
(a)
(b)
Fig. 7. An example of the multiplexed illumination. (a) Projected pattern multiplexed by a 191 × 191 S-Matrix, and (b) the captured image.
(a) Single illumination without averaging
(b) Single illumination with averaging
(c) Multiplexed illumination (d) Multiplexed illumination without averaging with averaging Fig. 8. The reflected light of the lighting direction (θl = 10, φl = 270)
the average of ten captured images. The captured images are very noisy even with the averaging process. Figures 8(c) and (d) show the results with multiplexing. A sequence of multiplexed illumination patterns were projected, and the distribution of the reflected light corresponding to the same lighting direction is estimated. As before, (c) is the result without averaging, while (d) is the result with averaging ten images. Obviously, the noise is dramatically decreased in the multiplexed illumination. To find the spatial distribution of the reflected light, the changes in intensity of y = 60, x = 30 − 200 are plotted as shown in Fig.9. (a) shows the intensities without averaging, while (b) shows those with averaging ten images. In the graphs, blue and red lines represent the distribution with and without multiplexing, respectively. It is interesting that the result of multiplexing without averaging is more accurate than the result of the single illumination with averaging. While the time taken for capturing images is ten times greater for the averaging process, the multiplexed illumination can improve accuracy without increasing the capturing time. Figure 10 shows rendered images of a sphere and a corrugated surface using the BRDF measured by multiplexed illumination with averaging. Compared with the real sphere, the distribution of the specular reflection is slightly wide. One of the reasons is that the reflected light on the ellipsoidal mirror does not converge
Multiplexed Illumination for Measuring BRDF brightness 1.6
255
brightness 1.6 single at once multiplexed at once
1.4 1.2
single average multiplexed average
1.4 1.2
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
-0.2
-0.2
-0.4 30 40 50 60 70
80 90 100 110 120 130 140 150 160 170 180 190 200 pixel
(a) without averaging
-0.4 30 40 50 60 70
80 90 100 110 120 130 140 150 160 170 180 190 200 pixel
(b) with averaging ten images
Fig. 9. The distribution of the intensities
(a)
(b)
(c)
(d)
Fig. 10. Rendering results of the pink polyurethane sphere
perfectly on the target material because the alignment of the optical devices is not perfect. Since the target object is a sphere, the normal direction varies if the measuring point is different. As a result, the wide specular reflections may appear to be generated incorrectly. Unnatural reflections were observed in the upper area in Figures 8(c) and (d). This problem may be caused by the cutting operation error of the ellipsoidal mirror. Therefore, the BRDF of θ = 65 is substituted for the missing data of θ > 65. One of our future aims is to improve the accuracy of the optical devices. 6.4
Measurement of a Copper Plate
The isotropic BRDFs of a copper plate were measured as shown in Fig.5(d). Metal is the most difficult material for which to measure the BRDFs accurately, because the intensity levels of specular and diffuse reflections are vastly different. Figure 11 shows the rendering results of a corrugated surface using the measured BRDFs. (a) represents the results of a single illumination, while (b) represents those of multiplexed illumination. Since a fast shutter speed is used when measuring BRDFs to avoid saturation, the captured images are very dark. Hence, the rendering results are brightly represented in this figure. In the rendered images of (a) and (b), incorrect colors such as red or blue are observed. These incorrect colors seem to be the result of magnifying noise during the brightening process. On the other hand, noise can be drastically decreased in the rendered images of (c) and (d) using multiplexed illumination.
256
Y. Mukaigawa, K. Sumino, and Y. Yagi
(a)
(b)
(c)
(d)
Fig. 11. Comparison of the rendered results of the copper plate. (a) and (b) Single illumination. (c) and (d) Multiplexed illumination.
Although the dynamic range of the measured BRDFs is suitably widened, some noise is still observed in the rendered images. Ratner et al.[16] pointed out the fundamental limitation of Hadamard-based multiplexing. The dynamic range problem can also be reduced by incorporating the use of several images captured with varying shutter speeds[17], while the capturing time increases.
7
Conclusion
In this paper, we proposed a new high-speed BRDF measurement method that combines an ellipsoidal mirror with a projector, and solved the low dynamic range problem by applying multiplexed illumination to pattern projection. Two BRDF measuring systems were developed, which include differently shaped ellipsoidal mirrors. The proposed systems can measure complex reflection properties including anisotropic reflection. Moreover, the measuring time of BRDFs can be significantly shortened by the exclusion of a mechanical device. This paper focuses only on the BRDF measuring speed of the developed systems. The accuracy of the measured BRDFs needs to be evaluated. For the evaluation, we are attempting to compare the measured BRDFs and the ground truth using reflectance standards for which reflection properties are known. This research was partially supported by the Ministry of Education, Science, Sports and Culture, Grant-in-Aid for Young Scientists (A).
References 1. Matusik, W., Pfister, H., Brand, M., McMillan, L.: A Data-Driven Reflectance Model. In: Proc. SIGGRAPH 2003, pp. 759–769 (2003) 2. Karner, K.F., Mayer, H., Gervautz, M.: An image based measurement system for anisotropic reflection. Computer Graphics Forum (Eurographics 1996 Proceedings) 15(3), 119–128 (1996) 3. Lu, R., Koenderink, J.J., Kappers, A.M.L.: Optical Properties (Bidirectional Reflection Distribution Functions) of Velvet. Applied Optics 37(25), 5974–5984 (1998) 4. Marschner, S.R., Westin, S.H., Lafortune, E.P.F., Torrance, K.E., Greenberg, D.P.: Image-Based BRDF Measurement Including Human Skin. In: Proc. 10th Eurographics Workshop on Rendering, pp. 139–152 (1999) 5. Li, H., Foo, S.C., Torrance, K.E., Westin, S.H.: Automated three-axis gonioreflectometer for computer graphics applications. In: Proc. SPIE, vol. 5878, pp. 221–231 (2005)
Multiplexed Illumination for Measuring BRDF
257
6. M¨ uller, G., Bendels, G.H., Klein, R.: Rapid Synchronous Acquisition of Geometry and Appearance of Cultural Heritage Artefacts. In: VAST2005, pp. 13–20 (2005) 7. Davis, K.J., Rawlings, D.C.: Directional reflectometer for measuring optical bidirectional reflectance, United States Patent 5637873 (June 1997) 8. Mattison, P.R., Dombrowski, M.S., Lorenz, J.M., Davis, K.J., Mann, H.C., Johnson, P., Foos, B.: Handheld directional reflectometer: an angular imaging device to measure BRDF and HDR in real time. In: Proc. SPIE, vol. 3426, pp. 240–251 (1998) 9. Ward, G.J.: Measuring and Modeling anisotropic reflection. In: Proc. SIGGRAPH 1992, pp. 255–272 (1992) 10. Dana, K.J., Wang, J.: Device for convenient measurement of spatially varying bidirectional reflectance. J. Opt. Soc. Am. A 21(1), 1–12 (2004) 11. Kuthirummal, S., Nayar, S.K.: Multiview Radial Catadioptric Imaging for Scene Capture. In: Proc. SIGGRAPH2006, pp. 916–923 (2006) 12. Han, J.Y., Perlin, K.: Measuring Bidirectional Texture Reflectance with a Kaleidoscope. ACM Transactions on Graphics 22(3), 741–748 (2003) 13. Harwit, M., Sloane, N.J.A.: Hadamard Transform Optics. Academic Press, London (1973) 14. Schechner, Y.Y., Nayar, S.K., Belhumeur, P.N.: A Theory of Multiplexed Illumination. In: Proc. ICCV 2003, pp. 808–815 (2003) 15. Wenger, A., Gardner, A., Tchou, C., Unger, J., Hawkins, T., Debevec, P.: Performance Relighting and Reflectance Transformation with Time-Multiplexed Illumination. In: Proc. SIGGRAPH2005, pp. 756–764 (2005) 16. Ratner, N., Schechner, Y.Y.: Illumination Multiplexing within Fundamental Limits. In: Proc. CVPR2007 (2007) 17. Debevec, P., Malik, J.: Recovering High Dynamic Range Radiance Maps from Photographs. In: Proc. SIGGRAPH1997, pp. 369–378 (1997)
Analyzing the Influences of Camera Warm-Up Effects on Image Acquisition Holger Handel Institute for Computational Medicine (ICM), Univ. of Mannheim, B6, 27-29, 69131 Mannheim, Germany
[email protected] Abstract. This article presents an investigation of the impact of camera warmup on the image acquisition process and therefore on the accuracy of segmented image features. Based on an experimental study we show that the camera image is shifted to an extent of some tenth of a pixel after camera start-up. The drift correlates with the temperature of the sensor board and stops when the camera reaches its thermal equilibrium. A further study of the observed image flow shows that it originates from a slight displacement of the image sensor due to thermal expansion of the mechanical components of the camera. This sensor displacement can be modeled using standard methods of projective geometry in addition with bi-exponential decay terms to model the temporal dependence. The parameters of the proposed model can be calibrated and then used to compensate warmup effects. Further experimental studies show that our method is applicable to different types of cameras and that the warm-up behaviour is characteristic for a specific camera.
1 Introduction In the last couple of years much work has been done on camera modeling and calibration (see [1], [2], [3], [4] to mention a few). The predominant way to model the mapping from 3D world space to 2D image space is the well-known pinhole camera model. The ideal pinhole camera model has been extended with additional parameters to regard radial and decentering distortion ([5], [6], [7]) and even sensor unflatness [8]. These extensions have led to a more realistic and thus more accurate camera model (see Weng et al. [5] for an accuracy evaluation). Beside these purely geometrical aspects of the imaging process additional work has also been done on the electrical properties of the camera sensor and its influence on the image acquisition process. Some relevant variables are dark current, fixed pattern noise and line jitter ([9], [10], [11]). An aspect which has rarely been studied is the effect of camera warm-up on the imaging process. Beyer [12] reports a drift of measured image coordinates to an extent of some tenth of a pixel during the first hour after camera start-up. Wong et al. also report such an effect [13] as well as Robson et al. [14] . All of them only report drift distortions due to camera warm-up but do not give any further explanation of the origins of the observed image drift nor any way to model and compensate for these distortions. Today machine vision techniques have gained a great extension in many sensitive areas like industrial production and medical invention where errors of some tenth of a pixel in image feature Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 258–268, 2007. c Springer-Verlag Berlin Heidelberg 2007
Analyzing the Influences of Camera Warm-Up Effects
259
segmentation caused by sensor warm-up can result in significant reconstruction errors. In [15] measurement drifts of an optical tracking system up to 1 mm during the first 30 minutes after start-up are reported. In many computer assisted surgery applications such reconstruction errors are intolerable. Thus, a better understanding of the impact of camera warm-up on the image acquisition process is crucial. In this paper we investigate the influence of camera warm-up on the imaging process. We will show that the coordinates of segmented feature points are corrupted by a drift movement the image undergoes during camera warm-up. To our opinion this drift is caused by thermal expansion of a camera’s sensor board which results in a slight displacement of the sensor chip. We develop a model for the image plane movement which can be used to compensate distortions in image segmentation during a camera’s warmup period. Finally, we provide further experimental results approving the applicability of our method. The paper is organized as follows. Section 2 describes the experimental setup and the image segmentation methods from which we have observed the warm-up drift. Section 3 presents our model of warm-up drift and a way to calibrate the relevant parameters. Furthermore, a procedure is described to compensate for the image drift which fits easily in the distortion correction models widely used. Section 4 provides further experiments with different types of cameras.
2 Observing Warm-Up Drift To analyze the impact of temperature change after camera start-up a planar test field consisting of 48 white circular targets printed on a black metal plate is mounted in front of a camera (equipped with a 640 × 480 CMOS sensor). The test pattern is arranged to cover the entire field of view of the camera. The complete setup is rigidly fixed. The center points of the targets are initially segmented using a threshold technique. The coordinates of the target centers are refined using a method described in [16]. For each target the gray values along several rays beginning at the initial center are sampled until a gray value edge is detected. The position of the found edge is further refined to sub-pixel precision using moment preservation [17]. Next, a circle is fitted to the found sub-pixel edge points for each target using least squares optimization. The centers of the fitted circles are stored together with the current time elapsed since camera startup. The segmentation process is continuously repeated and stopped after approximately 45 minutes. At the same time the temperature on the sensor board is measured. The obtained data basis has a temporal resolution of approximately three seconds. Since the relative position between the test pattern and the camera is fixed the coordinates of the segmented target centers are not expected to vary systematically over time except noise. The results of the experiment are shown in figure 1 and figure 2.
3 Modeling Warm-Up Drift As one can see from figs. 1 and 2 the temperature increase of the camera sensor board induces an optical flow. To our hypothesis, this flow field results from a movement of the image plane due to thermal expansion of the sensor board. The CTE (Coefficient
260
H. Handel 500
sensor board temperature y-coordinate [pixel]
temperature [ C]
30 28.5 27 25.5 24 22.5
400 300 200 100
21 0 0
5
10 15 20 25 30 35 40 45
0
100 200 300 400 500 600
time [min]
x-coordinate [pixel]
(a)
(b) gray value shift of a falling edge 250
200
200 gray value
gray value
gray value shift of a rising edge 250
150 100
150 100
50
50
0
0 0
1
2
3
4
5
6
7
8
9 10
0
1
2
3
4
pixel
5
6
7
8
9 10
pixel
(c) Fig. 1. Warm-up drift. Fig. 1(a): Measured temperature on the sensor board. Fig. 1(b): Total displacement from camera start-up until thermal equilibrium, the lengths of the arrows are scaled by a factor of 150. Fig. 1(c): Gray value change for sampled line. The red curve shows the sampled gray values immediately after start-up and the blue curve after thermal stabilization.
x-coordinate drift
y-coordinate drift
point drift
39.25 39.2 39.15 39.1
42 y-coordinate [pixel]
41.95 y-coordinate [pixel]
x-coordinate [pixel]
39.3
41.9 41.85 41.8 41.75
5 10 15 20 25 30 35 40 45
41.95 41.9 41.85 41.8 41.75
5 10 15 20 25 30 35 40 45
time [min]
39.1
time [min]
39.2
39.3
x-coordinate [pixel]
(a) y-coordinate drift
5 10 15 20 25 30 35 40 45 time [min]
439.14
point drift y-coordinate [pixel]
596.15 596.1 596.05 596 595.95 595.9 595.85 595.8 595.75
y-coordinate [pixel]
x-coordinate [pixel]
x-coordinate drift
439.12 439.1 439.08 439.06 5 10 15 20 25 30 35 40 45 time [min]
439.14 439.12 439.1 439.08 439.06 595.8
595.9
596
596.1
x-coordinate [pixel]
(b) Fig. 2. Coordinate displacement. Coordinate changes of the top left target (fig. 2(a)) and the bottom rigth one (fig. 2(b)).
Analyzing the Influences of Camera Warm-Up Effects
261
of Thermal Expansion) of FR-4, the standard material which is used for printed circuit boards, can take values up to 150 − 200 ppm/K. Assuming a temperature increase to an extent of 10 to 20 K in the immediate vicinity of the sensor chip one would expect a thermal expansion and thus a displacement of the camera sensor to an extent of some microns. Taking the widely used pinhole camera model to describe the imaging process we can principally distinguish two cases: – The thermal expansion of the sensor board affects only the image plane. The center of projection remains fixed. – Both the image plane and the center of projection are displaced due to thermal expansion. Both cases can be found in real cameras. In the first case, the objective is fixed at the camera housing and is thus not affected by the local temperature increase of the sensor board since the distance to the board is relatively high. This configuration is typical for cameras equipped with C-mount objectives. In the second case, the lens holder of the objective is directly mounted on the circuit board. Thus, an expansion of the board displaces the lens and for this reason the center of projection. This configuration can be found at miniature camera devices used e.g. in mobile phones. Mathematically, the two cases have to be treated separately. In the remaining sections we use the following notation for the mapping from 3D world space to 2D image space: x = K [R|t] X
(1)
where x = (x, y, 1)T denotes the homogeneous image coordinates of world point X also described by homogeneous coordinates. The camera is described by its internal parameters K with ⎛ ⎞ f x 0 cx K = ⎝ 0 f y cy ⎠ 0 0 1 The exterior orientation of the camera is given by the rotation R and the translation t (see [4] for details). 3.1 Fixed Center of Projection If the center of projection remains fixed the observed optical flow will result from a movement of the image plane alone. In this case, the coordinate displacement can be described by a homography [4]. Let x(t0 ) and x(t1 ) denote the coordinates of the same target feature for the time t0 , i.e. immediately after camera start-up, and an arbitrary time t1 . Then, x(t0 ) = K(t0 )[I|0]X x(t1 ) = K(t1 )[R(t1 )|0]X = K(t1 )R(t1 )K−1 (t0 )(K(t0 )[I|0]X) = K(t1 )R(t1 )K−1 (t0 )x(t0 )
262
H. Handel
so that x(t1 ) = H(t1 )x(t0 ) with the time dependent homography H(t1 ): H(t1 ) = K(t1 )R(t1 )K−1 (t0 )
(2)
˜ 1 ) = K(t1 )R(t1 ). Since H(t) ˜ is invertible we Setting x ˜ = K−1 (t0 )x(t0 ) we get H(t can write ˜ −1 (t) = (K(t)R(t))−1 = R−1 (t)K−1 (t) H = RT (t)K−1 (t)
(3)
Since RT is orthogonal and K−1 is an upper diagonal matrix we can use QR decom˜ −1 is given [18]. For a rotation by a small angle position to obtain RT and K−1 once H ΔΩ around axis l we can further use the following approximation [19] R(t) = I + W(t)ΔΩ(t) + O(ΔΩ 2 ),
(4)
where the matrix W(t) is given by ⎞ 0 −l3 (t) l2 (t) 0 −l1 (t) ⎠ W(t) = ⎝ l3 (t) 0 −l2 (t) l1 (t) ⎛
(5)
The vector l is a unit vector and thus has two degrees of freedom. We can identify the rotation by the three component vector ˜l = ΔΩl. From ˜l we get ΔΩ = ˜l and ˜ ˜ becomes l = ˜ll . The homography H(t) ˜ H(t) = (K(t0 ) + ΔK(t))R(t)
(6)
where ΔK(t) denotes the time dependent offset to the original camera parameters and is given by ⎞ ⎛ 0 Δcx (t) Δfx (t) Δfy (t) Δcy (t) ⎠ ΔK(t) = ⎝ 0 (7) 0 0 1 ˜ is determined by seven time dependent parameters, namely Δfx (t), Δfy (t), Thus, H(t) Δcx (t), Δcy (t), the changes of the internal camera parameters and l˜1 (t), l˜2 (t), l˜3 (t), the external orientation. Motivated by the results of our empirical studies (see section 4 for further details) we choose bi-exponential functions for the time dependent parameters: f (t) = a0 + a1 e−k1 t − (a0 + a1 )e−k2 t
(8)
The parameterization of f (t) is chosen in such a way that f (0) = 0 and thus H(0) = I. Since f (t) is determined by the four parameters a0 , a1 , k1 , k2 and we have seven time ˜ dependent parameters for H(t) the complete warm-up model comprises 28 parameters. In section 4 it is shown that the total number of parameters can be reduced in praxis.
Analyzing the Influences of Camera Warm-Up Effects
263
3.2 Moving Center of Projection In this case we use the simplifying assumption that the center of projection and the image plane are equally translated, i.e. the internal parameters of the imaging device remain constant during the warm-up period. This assumption will later be justified empirically. Then, we get the following relations x(t0 ) = K [I|0] X x(t1 ) = K [R(t1 )|t(t1 )] X Since the observed targets lie on a plane the image coordinate changes can again be described by a homography (see [3] for a strict treatment). x(t1 ) = K [r1 (t1 ) r2 (t1 ) t(t1 )] x(t0 )
(9)
where ri (t) denotes the i-th column of R(t). Thus, we get H(t) = K [R(t)|t(t)]. Given the homography H(t) the external parameters can be computed as follows [3] r1 = λK−1 h1 r2 = λK−1 h2 r3 = r1 × r2 t = λK−1 h3 with λ = 1/ K−1 h1 . Using the axis angle notation for the rotation R(t) we get six temporal dependent parameters, namely the three rotation parameters ˜l1 (t), ˜l2 (t), ˜l3 (t) as well as the three translational parameters t1 (t), t2 (t), t3 (t). Again, we use biexponential terms to describe the temporal behaviour of the parameter values. Thus, we have 24 parameters. 3.3 Warm-Up Model Calibration In the previous sections we have shown how to model the coordinate displacement of segmented image features during camera warm-up. We now outline an algorithm to calibrate the parameters of the models: 1. Determine the internal camera parameters using a method described in [3] or [2] based on a few images taken immediately after camera start-up. The obtained values are used for K(0) or K respectively. 2. Collect image coordinates by continuously segmenting target center points. 3. For each segmented image determine the homography H(t) (see [4], [19]) 4. Use a factorization method described in section 3.1 or 3.2 depending on the type of camera to obtain values for the internal/external parameters. 5. Fit a bi-exponential function to the values of each camera parameter. 6. Perform a non-linear least squares optimization over all 28(24) parameters minimizing the following expression M N
xj (ti ) − H(ti ; β)xj (0) 2
(10)
j=1 i=1
where M denotes the number of feature points and β the current parameter vector.
264
H. Handel Focal length shift [pixel] 0.3
Principal point shift [pixel] 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
delta f
0.25 0.2 0.15 0.1 0.05
delta cx delta cy
0 10
20
30 40 time [min]
50
60
70
10
20
30 40 time [min]
50
60
70
(a) Rotation 0.002
x-axis y-axis z-axis
0.0015 0.001 0.0005 0 -0.0005 -0.001 10
20
30
40
50
60
70
time [min]
(b) Fig. 3. Estimated internal and external parameters of the SonyFCB-EX780BP over time x-coordinate drift
x-coordinate drift
point drift
56.6 56.4 56.2 56 55.8 55.6
52.4
y-coordinate [pixel]
x-coordinate [pixel]
x-coordinate [pixel]
57 56.8
52.2 52 51.8 51.6
0
10
20
30
40
50
60
0
10
20
time [min]
30
40
50
52.4 52.2 52 51.8 51.6 55.6 55.8 56 56.2 56.4 56.6 56.8 57
60
time [min]
x-coordinate [pixel]
(a) x-coordinate drift
y-coordinate drift
point drift
56.6 56.4 56.2 56 55.8 55.6
52.4
y-coordinate [pixel]
y-coordinate [pixel]
x-coordinate [pixel]
57 56.8
52.2 52 51.8 51.6
0
10
20
30
40
50
60
70
0
time [min]
10
20
30
40
time [min]
50
60
70
52.4 52.2 52 51.8 51.6 55.6 55.8 56 56.2 56.4 56.6 56.8 57 x-coordinate [pixel]
(b) Fig. 4. Application of warm-up calibration to the SonyFCB-EX780BP. Fig. 4(a) shows the results of the drift calibration. The red curve depicts the segmented image coordinates and the blue one the ideal trajectory according to the calibrated drift-model. Fig. 4(b) shows the results of the drift compensation.
3.4 Warm-Up Drift Compensation With a calibrated warm-up model we can regard the influences of the sensor warmup on the imaging process. For cameras whose center of projection remains fixed an
Analyzing the Influences of Camera Warm-Up Effects
Translation [mm] 0.07
Rotation 0.0004
x-axis y-axis z-axis
0.05 0.03
265
x-axis y-axis z-axis
0.0003 0.0002
0.01 0.0001
-0.01
0
-0.03 -0.05 5
10
15
20
25
30
35
5
10
time [min]
15
20
25
30
35
time [min]
Fig. 5. Estimated external camera parameters of the VRmagic-C3 over time
x-coordinate drift
y-coordinate drift
point drift
39.25 39.2 39.15 39.1
42 y-coordinate [pixel]
41.95 y-coordinate [pixel]
x-coordinate [pixel]
39.3
41.9 41.85 41.8 41.75
5 10 15 20 25 30 35 40 45
41.95 41.9 41.85 41.8 41.75
5 10 15 20 25 30 35 40 45
time [min]
39.1
time [min]
39.2
39.3
x-coordinate [pixel]
(a) y-coordinate drift
point drift 439.15 y-coordinate [pixel]
439.15 y-coordinate [pixel]
x-coordinate [pixel]
x-coordinate drift 596.15 596.1 596.05 596 595.95 595.9 595.85 595.8 595.75
439.1
439.05 5 10 15 20 25 30 35 40 45 time [min]
439.1
439.05 5 10 15 20 25 30 35 40 45 time [min]
595.8
595.9
596
596.1
x-coordinate [pixel]
(b) Fig. 6. Fig. 6(a) and 6(b) show the predicted trajectories of the point coordinates (blue) compared to the observed ones (red)
image coordinate correction is straight forward. Given observed image coordinates xo at time t after camera start-up the undistorted image coordinates xu can be computed by multiplying with the inverse of H(t) xu = H−1 (t)xo
(11)
This correction is independent from the structure of the scene, i.e. the distance of the observed world point from the camera. Fig. 4(b) shows the results of this drift correction. In the second case, where the center of projection is not fixed, a direct correction of the image coordinates is not possible since the image displacement of an observed feature point depends on its position in the scene. In this case, the drift model can only be applied in reconstruction algorithms where the position of the camera is corrected accordingly.
266
H. Handel
Table 1. Camera motion parameters and duration until thermal equilibrium for a single camera (VRmagic-C3, CMOS)
1 2 3 4 5
Translation tx ty 0.024286 -0.004156 0.022227 -0.004706 0.023420 -0.004448 0.018973 -0.004655 0.022384 -0.004380
Rotation lx 0.412771 0.445256 0.377107 0.328780 0.460083
tz -0.037418 -0.039590 -0.034522 -0.033378 -0.040338
0 -0.01 z-axis -0.02 -0.03 -0.04 0.5
1
1.5 2 x-axis
2.5
3
3.5
0
0.5
1
1.5
2
2.5 y-axis
ΔΩ 0.011553 0.012630 0.010667 0.009569 0.013023
Time T99 19.62 18.08 19.37 17.16 18.72
Residuals σ2 0.000108 0.000465 0.000114 0.000112 0.000100
0 -0.005 -0.01 -0.015 -0.02 -0.025 -0.03 -0.035 -0.04
-0.04 -0.050
0.5
1
1.5 2 x-axis
2.5
3
3.5
0
0.5
1
1.5
2
2.5 y-axis
0 -0.005 0.01 -0.01 -0.015 0 -0.02 -0.025 -0.01 -0.03 z-axis -0.035 -0.02 -0.04 -0.03
0.01 0 -0.01 z-axis -0.02 -0.03 -0.04 -0.050
lz 0.783463 0.748210 0.810767 0.786722 0.745038
0 -0.005 0.01 -0.01 -0.015 0 -0.02 -0.025 -0.03 z-axis-0.01 -0.035 -0.02 -0.04 -0.03
0.01
-0.050
ly 0.458647 0.487992 0.426833 0.352122 0.482951
0.5
1
1.5 2 x-axis
2.5
3
3.5
0
0.5
1
1.5
2
2.5 y-axis
0 -0.005 -0.01 -0.015 -0.02 -0.025 -0.03 -0.035 -0.04
-0.04 -0.050
0.5
1
1.5 2 x-axis
2.5
3
3.5
0
0.5
1
1.5
2
2.5 y-axis
Fig. 7. Reconstruction of the position of the image plane for the VRmagic-C3 after 0s, 100s, 600s and 1000s since camera start-up. The axis units are mm. The depicted range in x- and y-direction corresponds roughly to the dimensions of the active sensor chip area.
4 Experimental Results This section presents experimental studies which justify the applicability of the proposed warm-up model. The experiment described in section 2 is conducted for two different types of cameras. The one is a VRmagic-C3, a miniature sized camera whose lens is directly mounted on the circuit board. The camera is equipped with a CMOS based active pixel sensor. The other camera is a SonyFCB-EX780BP, a CCD-based camera whose objective is not directly connected to the sensor circuit board. The initially estimated motion parameters are shown in figures 3(a)-3(b) and 5 respectively. As the figures show, our choice for a bi-exponential function describing the temporal dependence of the camera parameters seems resonable. Furthermore, one can see that for some camera parameters a simple exponential term is sufficient reducing the total number of parameters. Figure 4(a) and 6(a) show the applicability of the chosen models
Analyzing the Influences of Camera Warm-Up Effects
267
to explain the observed image displacement. Figs. 4(b) and 6(b) show the results of the drift correction described in section 3.4. In a second experiment we examine the repeatability of the calibration. The drift model is repeatedly calibrated for one camera. The data has been collected over several weeks. Table (1) shows the resulting parameters. The table contains the values of the motion parameters when the camera comes to a thermal equilibrium. The column T99 denotes the time in minutes until the drift displacement reaches 99% of its final amount. The results show that the warm-up behaviour is characteristic for a specific device. Finally, fig. 7 shows a reconstruction of the image plane during the camera warm-up period.
5 Conclusion We have presented a study of the impact of camera warm-up on the coordinates of segmented image features. Based on experimental observations we have developed a model for image drift and a way to compensate for it. Once the warm-up model is calibrated for a specific camera we can use the parameters for drift compensation. The formulation of our displacement correction fits well in the widely used projective framework used in the computer vision community. Thus, the standard camera models used in computer vision can easily be extended to regard for warm-up effects. Further experimental evaluations have shown that our warm-up model is principally applicable for all kinds of digital cameras and additionally that the warm-up behaviour is characteristic for a specific camera. In the future we plan to use cameras with an on-board temperature sensor to get direct access to the camera’s temperature. The formulation of our model presented here is based on the time elapsed since camera start-up assuming that the temperature always develops similarly. A direct measurement of the temperature instead of time will propably increase accuracy further.
References 1. Brown, D.C.: Close-range camera calibration. Photogrammetric Engineering 37(8), 855–866 (1971) 2. Tsai, R.Y.: A versatile camera calibration technique for 3d machine vision. IEEE Journal for Robotics & Automation RA-3(4), 323–344 (1987) 3. Zhang, Z.: A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(11), 1330–1334 (2000) 4. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) 5. Weng, J., Cohen, P., Herniou, M.: Camera calibration with distortion models and accuracy evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence 14(10), 965– 980 (1992) 6. El-Melegy, M., Farag, A.: Nonmetric lens distortion calibration: Closed-form solutions, robust estimation and model selection. In: Proceedings ICCV, pp. 554–559 (2003) 7. Devernay, F., Faugeras, O.: Straight lines have to be straight. MVA 13(1), 14–24 (2001) 8. Fraser, C.S., Shortis, M.R., Ganci, G.: Multi-sensor system self-calibration. In: SPIE Proceedings, vol. 2598, pp. 2–15 (1995)
268
H. Handel
9. Healey, G., Kondepudy, R.: Radiometric ccd camera calibration and noise estimation. PAMI 16(3), 267–276 (1994) 10. Clarke, T.A.: A frame grabber related error in subpixel target location. The Photogrammetric Record 15(86), 315–322 (1995) 11. Ortiz, A., Oliver, G.: Radiometric calibration of ccd sensors: dark current and fixed pattern noise estimation. IEEE International Conference on Robotics and Automation 5, 4730–4735 (2004) 12. Beyer, H.A.: Geometric and radiometric analysis of a ccd-camera based photogrammetric close-range system. Mitteilungen Nr. 51 (1992) 13. Wong, K.W., Lew, M., Ke, Y.: Experience with two vision systems. Close Range Photogrammetry meets machine vision 1395, 3–7 (1990) 14. Robson, S., Clarke, T.A., Chen, J.: Suitability of the pulnix tm6cn ccd camera for photogrammetric measurement. SPIE Proceedings, Videometrics II 2067, 66–77 (1993) 15. Seto, E., Sela, G., McIlroy, W.E., Black, S.E., Staines, W.R., Bronskill, M.J., McIntosh, A.R., Graham, S.J.: Quantifying head motion associated with motor tasks used in fmri. NeuroImage 14, 284–297 (2001) 16. F¨orstner, W., G¨ulch, E.: A fast operator for detection and precise location of distinct points, corners and centers of circular features. ISPRS Intercommission Workshop on Fast Processing of Photogrammetric Data (1987) 17. Tabatabai, A.J., Mitchell, O.R.: Edge location to subpixel values in digital imagery. IEEE transactions on pattern analysis and machine intelligence 6(2), 188–201 (1984) 18. Golub, G.H., VanLoan, C.F.: Matrix computations. Johns Hopkins University Press, Baltimore, MD (1997) 19. Kanatani, K.: Geometric Computation for Machine Vision. Oxford University Press, Oxford, UK (1993)
Simultaneous Plane Extraction and 2D Homography Estimation Using Local Feature Transformations Ouk Choi, Hyeongwoo Kim, and In So Kweon Korea Advanced Institute of Science and Technology
Abstract. In this paper, we use local feature transformations estimated in the matching process as initial seeds for 2D homography estimation. The number of testing hypotheses is equal to the number of matches, naturally enabling a full search over the hypothesis space. Using this property, we develop an iterative algorithm that clusters the matches under the common 2D homography into one group, i.e., features on a common plane. Our clustering algorithm is less affected by the proportion of inliers and as few as two features on the common plane can be clustered together; thus, the algorithm robustly detects multiple dominant scene planes. The knowledge of the dominant planes is used for robust fundamental matrix computation in the presence of quasi-degenerate data.
1
Introduction
Recent advances in local feature detection have achieved affine/scale covariance of the detected region according to the varying viewpoint [11][12][13][8]. In the description phase of the detected region, geometric invariance to affinity or similarity is achieved by explicitly or implicitly transforming the detected region to the standard normalized region. (See Fig. 1.) Statistics robust to varying illumination or small positional perturbation are extracted from the normalized region and are used in the matching phase. After the tentative matching of the normalized regions, not only the matching feature coordinates but also the feature transformations from image to image are given. In this paper, we are interested in further exploiting the local feature transformations for simultaneously extracting scene planes and estimating the induced 2D homographies. Local feature matches have been treated as point-to-point correspondences, not as region matches with feature transformations, in the literature on 2D homography or fundamental matrix computation [1][2][6][5][3][4]. The main reason is that non-covariant local features (e.g., single-scale Harris corners) are less elaborately described (e.g., the template itself) so that the affine/similar transformation is not uniquely determined. Even the approaches that use affine/scale covariant local features [12][7] do not utilize the feature transformations thoroughly. Some approaches propagate local feature matches into the neighborhood regions for simultaneous object recognition and segmentation [9][10]. These approaches use the feature transformation elaborately and show that a few matches Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 269–278, 2007. c Springer-Verlag Berlin Heidelberg 2007
270
O. Choi, H. Kim, and I.S. Kweon
T1i
xi
T2i T1j
xj
T2j
x2j
x2i x2k
xk
T1k Image 1
T2k Normalized regions
Image 2
Fig. 1. Affine covariant regions are detected in each image independently and each region is transformed to the normalized region. The normalized regions are matched using statistics robust to varying illumination or small positional perturbation. Cyan-colored rectangles represent the features that are tentatively matched. Each tentative match provides not only the matching feature coordinates but also the feature transformations from image to image.
[9], or even only one [10], can be grown over a large portion of the object. Inspired by these approaches, we propagate the feature transformation of one match to the other matches and update the feature transformation iteratively so that the 2D homography of a plane can be estimated; as well, the local features on the common plane can be grouped together. The main difference between our approach and those mentioned is that we are more concerned about sparse scene geometry rather than the verification of a single or a few matches by gathering more evidence in a dense neighborhood. Our approach, however, can also be interpreted as a match verification process, because we group the local feature matches using 2D homography, as Lowe does in [12]. The difference between our algorithm and Lowe’s is that we use the feature transformation directly rather than the clustering with the Hough transform. The main advantage is that our algorithm does not require the labor of determining the resolution of the Hough bins. In Section 2, we develop a simple algorithm that simultaneously groups the coplanar features and estimates the 2D homography. In Section 3, the knowledge of the detected dominant planes is used to develop an importance sampling procedure for robust fundamental matrix computation in the presence of quasidegenerate data [4]. We show some experimental results for both plane extraction and fundamental matrix computation in Section 4. Finally, some advantages and limitations of our approach and future work are discussed in Section 5.
2
Simultaneous Plane Extraction and 2D Homography Estimation
We characterize a feature match as a set of the center coordinates x1i , x2i of matching regions, the feature transformations Hi , H−1 and a membership i variable hi (see Fig. 1 for details).
Simultaneous Plane Extraction and 2D Homography Estimation
where
271
mi = {x1i , x2i , Hi , H−1 i , hi },
(1)
−1 = T2i T−1 Hi = T1i T−1 2i , Hi 1i ,
(2)
T1i and T2i are estimated as described by Mikolajczyk et al. in [8]. In this section, we develop an algorithm that determines hi and updates Hi , H−1 for i each match mi . Feature transformations differ in quality. Under some transformations, many features are transformed with small residual errors, while under other transformations, few features are transformed. Our algorithm described in Fig. 2 selects good transformations with high dominance scores defined as (3); therefore, the algorithm works in the presence of a number of erroneous initial feature transformations on the assumption that there exists at least one good initial seed for each dominant plane. It is important to note that a match already contains a transformation, in other words a homography, so that the algorithm does not require three or four true matches for constructing a hypothesis. This fact enables as few as two matches to be clustered together once one match is dominated by the other. We define –dominance to assess the quality of transformations in a systematic way. Definition 1. mi dominates mj if x2j −Hi (x1j ) < and x1j −H−1 i (x2j ) < , where H(x) is the transformed Euclidean coordinates of x by H. The predicate in Def. 1 is identical to the inlier test in RANSAC (RANdom SAmple Consensus) approaches [1][5] with the maximum allowable residual error . Our algorithm can be considered as a deterministic version of the RANSAC algorithm, where random sampling is replaced by a full search over the hypothesis space. Nothing prevents us from modifying our algorithm to be random; however, the small number of hypotheses, equal to the number of matches, attracts us to choose the full search. The dominance score ni is defined as ni =
N
dij ,
(3)
j=1
where dij =1 when mi dominates mj , otherwise 0. The dominance score ni is the number of features that are transformed by Hi with a residual error less than . Once the ni and dij values are calculated, the membership hi is determined in such a way that the mk with the largest dominance score collects its inliers first and the collected inliers are omitted from the set of matches for the detection of the next dominant plane. We do not cover the model selection issue [6] thoroughly in this paper. The in Fig. 2 is simply decided by the number of inliers and update of Hk , H−1 k the residual error. When the number of inliers is less than three, we do not update the feature transformation and the match is discarded from gathering
272
O. Choi, H. Kim, and I.S. Kweon
Input: M0 = {mi |i = 1...N }. Output: updated hi , Hi , H−1 i , and ni for each mi . Variables: - it: an iteration number. - ti : an auxiliary variable that temporarily stores hi for each iteration. - M : a set of matches whose membership is yet to be determined. - The other variables: explained in the text. 1. Initialize it, hi . - it ← 0. - For i = 1 . . . N , {hi ← i.} 2. Initialize M , ti , ni , dij . - M ← M0 . - For i = 1 . . . N , {ti ← 0.} - For i = 1 . . . N , {for j = 1 . . . N , {calculate dij .}} - For i = 1 . . . N , {calculate ni .} 3. k = argmaxi:mi ∈M ni . 4. If nk = 0, {go to 10.} 5. For i : mi ∈ M , {if dki = 1 and ti = 0, {ti ← hi ← k.}} 6. Update Hk , H−1 k . 7. M ← M − {mi |ti = 0}. 8. Update ni , dij . - For i : mi ∈ M , {for j : mj ∈ M , {calculate dij .}} - For i : mi ∈ M , {calculate ni .} 9. nk ← 0. Go to 3. 10. For i = 1 . . . N , {if hi has changed, {it ← it + 1, go to 2.}} 11. Finalize. = H−1 - For i = 1 . . . N , {Hi = Hhi , H−1 i hi , ni = j:hj =i 1.}
Fig. 2. Algorithm sketch of the proposed simultaneous plane detection and 2D homography estimation technique. The symbol ‘←’ was used to denote replacement.
more inliers. An affine transformation is calculated when the number of inliers is more than two, and a projective transformation is also calculated when the number of inliers is more than three. Three transformations compete with one another for each mk : the original transformation, and the newly calculated affine and projective transformations. The transformation with the minimum cost is finally selected at each iteration. The cost is defined as f (Hk ) =
2 x2i − Hk (x1i )2 + x1i − H−1 k (x2i ) .
(4)
i:hi =k
The cost is minimized using the Levenberg–Marquardt algorithm [14][15] using the original Hk as the initial solution. Refer to [16] for more statistically sound homography estimation.
Simultaneous Plane Extraction and 2D Homography Estimation
(a) = 3.0, it = 0.
273
(b) = 3.0, it = 2.
Fig. 3. (a) shows the plane detection result at it = 0. The same colored features were detected to be on the common plane. For visibility, planes with less than three features are not displayed. The dominance score decreases in the order of red, green, blue, yellow, purple. The algorithm converges very quickly at it = 2 (see (b)). The main reason is that the scene is fully affine so that our initial seeds can collect many inliers in the 0th iteration.
Figure 3 shows the clustered features using our algorithm. One obvious plane on top of the ice cream box was not detected for lack of matching features because of severe distortion (see also Fig. 1). The front part of the detergent box (Calgon) could not be clustered because of the slight slant of the top. The total number of dominance calculations is O(N 2 × K × it) and Hk , H−1 k update is O(K × it), where N is the total number of matches, K is the average number of planes at each iteration and it is the total number of iterations until convergence. The former is the main sources of computational complexity, which is equivalent to testing N × it hypotheses for each plane extraction in conventional RANSAC procedures. More experiments using various scenes are described in Section 4.
3
Importance-Driven RANSAC for Robust Fundamental Matrix Computation
The fundamental matrix can be calculated using the normalized eight-point algorithm. This algorithm requires eight perfect matches. This number is larger than the minimum number required for the fundamental matrix computation. However, we use the eight-point algorithm because it produces a unique solution for general configurations of the eight matches. The main degeneracy occurs when there is a dominant plane and more than five points are sampled from it [5]. The knowledge of the detected planes can be used to develop an importance sampling procedure that avoids the obvious degenerate conditions, i.e., more than five points on the detected common plane. Any feature match that is not grouped with other matches can be considered either a mismatch or a valuable true match off the planes, and a plane that includes many feature matches is highly likely to be a true scene plane, for which the included matches are also likely to be true. To balance between avoiding
274
O. Choi, H. Kim, and I.S. Kweon
mismatches and degeneracy, we propose an importance sampling method that first decides a plane according to its importance Ik = min(nk , n),
(5)
where Ik is the sampling importance of the k’th plane, nk is the dominance score, and n is a user-defined threshold value. Once the plane is determined, a match on the plane is sampled according to the uniform importance. This sampling procedure becomes equivalent to the standard RANSAC algorithm [1][5] when n = ∞, i.e., Ik = nk . The sampling importance decreases for dominant planes that have more than n inliers. n is typically set to five in the experiments in Section 4. The final sample set with eight matches is discarded if any six or more matches are sampled from a common plane. The following cost is minimized for each eight-match sample set. The F with the largest number of inliers is chosen as the most probable relation. xˆ2i T Fxˆ1i 2 . (6) g(F) = mi ∈S8
The inlier test is based on the following predicate, ˆ 2i = 0) < , ˆ T Fˆ ˆ T FT x mi ∈ Sin if d⊥ (x2i , x x1i = 0) < and d⊥ (x1i , x
(7)
ˆ is used to denote the homogeneous coordinates where Sin is the set of inliers, x and d⊥ is the distance from a point to a line in the line normal direction. Refer to [17] for more statistically sound fundamental matrix estimation.
4
Experiments
In this section, we show some experimental results for various images, including repeating patterns and dominant planes. The images in Figures 1, 3 and 4 were adapted from [3]. The castle images that contain repeating patterns (Fig. 5) were downloaded from http://www.cs.unc.edu/˜marc/. We extracted two kinds of features in the feature extraction stage. Maximally stable extremal regions (MSER) [3] and generalized robust invariant features (GRIF) [13] were used. MSERs are described using SIFT neighborhoods [12] and GRIFs are described using both SIFT and a hue histogram. The feature pairs were classified as tentative matches if the distance between the description vectors was smaller than 0.4 and the normalized cross correlation between the normalized regions was larger than 0.6 in all the experiments in this paper. is the only free parameter in our plane extraction algorithm; it is the maximum allowable error in model fitting. Large tends to produce a smaller number of planes with a large number of inliers. Dominant planes invade nearby or even distant planes when is large. The homography is rarely updated with small because the initial feature transformation fits the inliers very tightly. trades off between the accuracy of the homography and the robustness to the perturbed feature position. Figure 4 shows this trade-off. Good results were obtained in the
Simultaneous Plane Extraction and 2D Homography Estimation
(a) = 1
(b) = 5
(c) = 10
(d) = 20
275
Fig. 4. Plane detection results with varying . See the text for details.
range of 2 < < 10. = 5 produced the best results in Fig. 4. We used = 5 for all the experiments in this paper, unless otherwise mentioned. Figure 5 shows the plane extraction and epipolar geometry estimation results on the image pairs with varying viewpoint. It is hard to find the true correspondences in these image pairs because many features are detected on the repeating pattern, e.g., the windows on the wall. The number of planes that humans can manually detect is six in this scene and our algorithm finds only one or two planes that can be regarded as correct. However, it is important to note that the most dominant plane was always correctly detected, because true matches on the distinctive pattern have grown their evidence over the matches on the repeating pattern. Moreover, the number of tentative matches is not enough to detect other planes, i.e., our algorithm does not miss a plane once a true feature correspondence lies on the plane. Figure 6 shows the results for the images with a dominant plane. Among the total 756 tentative matches, 717 matches were grouped to belong to the redcolored dominant plane. For the standard RANSAC approach, the probability that the eight points contain more than two points off the plane is: P =
8 m C8−m 717 C39 = 0.0059 ≈ 0.6%. 8 C756 m=3
(8)
We ran the standard random sampling procedure [1][5] 1000 times to count the number of occasions more than five matches were sampled from the detected dominant plane. The number of occasions was, unsurprisingly, 995, i.e.,
276
O. Choi, H. Kim, and I.S. Kweon
(a) Images with small viewpoint change
(b) Images with moderate viewpoint change
(c) Images with severe viewpoint change Fig. 5. Experiments on varying the viewpoint and repeating patterns. The number of tentative matches decreases with increasing viewpoint change. Many tentative matches are mismatches because of the repeating patterns (windows). The most dominant two planes were correctly detected in (a) and (b) (red–green and red–yellow, respectively). Only one plane was correctly detected in (c) for lack of feature matches (red). The epipolar geometry was correctly estimated in (a) and (b) with d⊥ 0.97 and 0.99 pixels respectively. The epipolar geometry could not be correctly estimated for lack of offplane correct matches in (c).
99.5% of cases were degenerate and only 0.5% were not degenerate cases, on the assumption that our dominant plane is correct, which is very close to the theoretical value. Our algorithm does not suffer from the degeneracy problem on the assumption that the detected planes are correct. Figure 6 shows the two representative solutions that were most frequently achieved during the proposed sampling procedure. It is clear that no more than five matches are sampled from the dominant plane.
Simultaneous Plane Extraction and 2D Homography Estimation
277
(a) Detected dominantplanes
(b) 751/756 inliers, d⊥ = 0.52 (c) 751/756 inliers, d⊥ = 0.58 Fig. 6. Epipolar geometry estimation in the presence of quasi-degenerate data. The two solutions (b) and (c) that are most frequently achieved show the effectiveness of our algorithm.
5
Conclusion
We developed a simultaneous plane extraction and homography estimation algorithm using local feature transformations that are already estimated in the matching process of the affine/scale covariant features. Our algorithm is deterministic and parallel in the sense that all the feature transformations compete with one another at each iteration. This property naturally enables the detection of multiple planes without the independent RANSAC procedure for each plane or the labor in determining the resolution of the Hough bins. Our algorithm always produces consistent results and even two or three matches can be grouped together if one match dominates the others. The knowledge of the detected dominant planes is used to safely discard the degenerate samples so that the fundamental matrix can be robustly computed in the presence of quasidegenerate data. Our plane detection algorithm does not depend on the number of inliers, but depends on the number of supports that enable the original transformation to be updated. In further work, we hope to adapt more features (e.g., single-scale, affine/scale covariant Harris corner [8]) into our framework so that both the number of supports and the number of competing hypotheses can be increased. It is clear that our algorithm can be applied to sparse or dense motion segmentation [6][7]. Dense motion segmentation is another area of future research.
278
O. Choi, H. Kim, and I.S. Kweon
Acknowledgement This work was supported by the IT R&D program of MIC/IITA [2006-S-02801, Development of Cooperative Network-based Humanoids Technology] and the Korean MOST for NRL program [Grant number M1-0302-00-0064].
References 1. Fishler, M.A., Bolles, R.C.: Random sampling Consensus: A paradigm for model fitting with application to image analysis and automated cartography. Communications of the ACM 24(6), 381–395 (1981) 2. Torr, P.H., Zisserman, A.: Robust Computation and Parameterization of Multiple View Relations. In: ICCV (1998) 3. Chum, O., Werner, T., Matas, J.: Two-view Geometry Estimation Unaffected by a Dominant Plane. In: CVPR (2005) 4. Frahm, J., Pollefeys, M.: RANSAC for (Quasi-)Degenerate data (QDEGSAC). In: CVPR (2006) 5. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) 6. Torr, P.H.S.: Geometric Motion Segmentation and Model Selection. Philosophical Transactions of the Royal Society, pp. 1321–1340 (1998) 7. Bhat, P., Zheng, K.C., Snavely, N., Agarwala, A., Agrawala, M., Cohen, M.F., Curless, B.: Piecewise Image Registration in the Presence of Multiple Large Motions. In: CVPR (2006) 8. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir, T., Van Gool, L.: A Comparison of Affine Region Detectors. IJCV 65(1-2), 43–72 (2005) 9. Ferrari, V., Tuytelaars, T., Van Gool, L.: Simultaneous Object Recognition and Segmentation from Single or Multiple Model Views. IJCV 67(2), 159–188 (2006) 10. Vedaldi, A., Soatto, S.: Local Features, All Grown Up. In: CVPR (2006) 11. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust Wide Baseline Stereo from Maximally Stable Extremal Regions. In: BMVC, pp. 384–393 (2002) 12. Lowe, D.G.: Distinctive Image Features from Scale-Invariant Keypoints. IJCV 60(2), 91–110 (2004) 13. Kim, S.H., Yoon, K.J., Kweon, I.S.: Object Recognition using Generalized Robust Invariant Feature and Gestalt Law of Proximity and Similarity. In: CVPR 2006. 5th IEEE Workshop on Perceptual Organization in Computer Vision (2006) 14. Levenberg, K.: A Method for the Solution of Certain Problems in Least Squares. Quart. Appl. Math. 2, 164–168 (1944) 15. Marquardt, D.: An Algorithm for Least-Squares Estimation of Nonlinear Parameters. SIAM J. Appl. Math. 11, 431–441 (1963) 16. Kanatani, K., Ohta, N., Kanazawa, Y.: Optimal Homography Computation with a Reliability Measure. IEICE Trans. Information Systems E83-D(7), 1369–1374 (2000) 17. Kanatani, K., Sugaya, Y.: High Accuracy Fundamental Matrix Computation and its Performance Evaluation. In: BMVC (2006)
A Fast Optimal Algorithm for L2 Triangulation Fangfang Lu and Richard Hartley Australian National University
Abstract. This paper presents a practical method for obtaining the global minimum to the least-squares (L2 ) triangulation problem. Although optimal algorithms for the triangulation problem under L∞ -norm have been given, finding an optimal solution to the L2 triangulation problem is difficult. This is because the cost function under L2 -norm is not convex. Since there are no ideal techniques for initialization, traditional iterative methods that are sensitive to initialization may be trapped in local minima. A branch-and-bound algorithm was introduced in [1] for finding the optimal solution and it theoretically guarantees the global optimality within a chosen tolerance. However, this algorithm is complicated and too slow for large-scale use. In this paper, we propose a simpler branch-and-bound algorithm to approach the global estimate. Linear programming algorithms plus iterative techniques are all we need in implementing our method. Experiments on a large data set of 277,887 points show that it only takes on average 0.02s for each triangulation problem.
1
Introduction
The triangulation problem is one of the simplest geometric reconstruction problems. The task is to infer the 3D point, given a set of camera matrices and the corresponding image points. In the presence of noise, the correct procedure is to find the solution that reproduces the image points as closely as possible. In other words, we want to minimize the residual errors between the reprojected and measured image points. Notice that residual errors measured under different norms lead to different optimization problems. It is shown in [2] that a quasi-convex cost function arises and thus a single local minimum exists if we choose to calculate the residual errors under L∞ -norm. However, the problem with the residual errors measured under L2 -norm still remain of primary interest. Suppose we want to recover a 3D point X = (x, y, z) . Let Pi (i = 1, 2, . . . , n) denote a set of n camera matrices, ui the corresponding image coordinates and ˆ = (X; 1) the homogenous coordinates of X. Under the L2 -norm, we are led to X solve the following optimization problem:
The second author is also affilated with NICTA, a research instutute funded by the Australian Government’s Backing Australia’s Ability initiative, in part through the Australian Research Council.
Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 279–288, 2007. c Springer-Verlag Berlin Heidelberg 2007
280
F. Lu and R. Hartley
minimize C2 (X) =
n
ˆ )2 d(ui , Pi X
subject to λi (X) > 0
(1)
i=1
where d(·, ·) is the Euclidean metric and λi (X) is the depth of the point relative to image i. The optimal solution to (1) is not easy, since the cost function C2 (X) is nonconvex. In the worst case, multiple local minima may occur. Iterative methods, so-called bundle adjustment methods [3], usually work well, though they are dependent on initialization. Both the L∞ solution [4] and the linear solution [5] are useful for initialization, but neither initialization can theoretically guarantee global optimality. A branch-and-bound algorithm was given in [1] and it provably finds the global optimum within any tolerance . However, the way of bounding the cost-function is complicated. The computational cost is high so it may not be used to deal with large-scale data sets. A simpler way of obtaining the global estimates by a branch-and-bound process is presented in this paper. The main feature is to get the lower bound for the cost-function in a bounding box. We then use a much simpler convex lower bound for the cost than is used in [1]. Instead of using Second Order Cone Programming as in [1], we need only Linear Programming and simple iterative convex optimization (such as a Newton or Gauss-Newton method). This makes the implementation very easy. Experimental results show that the method can theoretically guarantee the global optimum within small tolerance, and also it works fast, taking on average 0.02s for each triangulation problem. The only other known methods for finding a guaranteed optimal solution to the L2 triangulation problem are those given for the two-view ([6]) and the three-view problem ([7]), which involve finding the roots of of a polynomial. For the two-view triangulation problem this involves the solution of a degree-6 polynomial, whereas for 3-view problem, the polynomial is of degree 43. This approach does not generalize in any useful manner to larger numbers of views.
2
Other Triangulation Methods
Various other methods for triangulation have been proposed, based on simple algebraic or geometric considerations. Two of the most successful are briefly discussed here. 2.1
Linear Triangulation Methods
ˆ = (X; 1) is given, let p1i , ˆ where X For every image i the measurement ui = Pi X p2i , p3i be the rows of Pi and xi , yi be the coordinates of ui , then two linear equations are given as follows. 3 ˆ = 0, ˆ = 0. xi p3i − p1i X yi pi − p2i X In all, a set of 2n linear equations are composed for computing X in the n-view triangulation problem. The linear least-squares solution to this set of equations
A Fast Optimal Algorithm for L2 Triangulation
281
provides the linear solution to the triangulation problem. The minimum is denoted by Xl here. To be a little more precise, there are two slightly different methods that may ˆ to 1, as suggested be considered here. One may either set the last coordinate of X above, and solving using a linear least-squares solution method. Alternatively the ˆ may be treated as a variable. The optimal solution may then last coordinate of X be solved using Singular Value Decomposition (see [5], DLT method). 2.2
L∞ Framework
The L∞ formulation leads to the following problem: ˆ ) subject to λi (X) > 0. minimize C∞ (X) = max d(ui , Pi X i
(2)
This is a quasi-convex optimization problem. A global solution is obtained using a bisection algorithm in which a convex feasibility problem is solved in each step. Please refer to [4] for details. We denote the minimum by X∞ in this paper. Other methods are mentioned in [6]. The most commonly used technique is to use some non-optimal solution such as the algebraic or (as more recently suggested) the optimal L∞ solution to initialize an iterative optimization routine such as Levenberg-Marquardt to find the L2 minimum. This method can not be guaranteed to find the optimal solution, however. The purpose of this paper is to give a method guaranteed to find the optimal solution.
3 3.1
Strategy Branch and Bound Theory
Branch and Bound algorithms are classic methods for finding the global solution to non-convex optimization problems. By providing a provable lower or upper bound, which is usually a convex function to the objective cost-function, and a dividing scheme, the algorithms tend to achieve the global estimates within an arbitrary small tolerance. 3.2
Strategy
In this paper, our strategy is to find the L2 optimal point Xopt by a process of branch-and-bound on a bounding box B0 . We start with the bounding box and a best point Xmin found so far (X∞ plus local refinement will do in this paper). At each step of the branch and bound process, we consider a bounding box Bi obtained as part of a subdivision of B0 . A lower bound for the minimum of the cost function is estimated on Bi and compared with the best cost found so far, namely C2 (Xmin ). If the minimum cost on Bi exceeds C2 (Xmin ), then we abandon Bi and go on to the next box to be considered. If, on the other hand, the minimum cost on Bi is smaller than C2 (Xmin ), then we do two things: we
282
F. Lu and R. Hartley
evaluate the cost function at some point inside Bi to see if there is a better value for Xmin , and we subdivide the box Bi and repeat the above process with the subdivided boxes.
4
Process
In this paper, the Branch and Bound process is shown considering three aspects. Firstly an initial bounding box is computed. Secondly the bounding method is presented which means the provably lower or upper bound of the objective costfunction is calculated. Then the branching part, that is, a subdivision scheme is given. Since Branch and Bound algorithms can be slow, a branching strategy should be devised to save the computation cost. 4.1
Obtaining the Initial Bounding Box B0
We start by an initial estimate Xinit for the optimum point. If Xopt is the true L2 minimum, it follows that C2 (Xopt ) ≤ C2 (Xinit ). This may be written as: C2 (X) =
n
ˆ opt )2 ≤ C2 (Xinit ) d(ui , Pi X
(3)
i=1
ˆ opt )2 is ˆ opt = (Xopt ; 1). We can see that the sum of the values d(ui , Pi X with X ˆ opt )2 is less than this value. bounded by C2 (Xinit ), and this means each d(ui , Pi X ˆ opt ) ≤ δ for all i. This In particular, defining δ = C2 (Xinit ) we have d(ui , Pi X equation can also be written as
1X 2X ˆ opt 2 ˆ opt 2 p p i i ˆ opt ) = + yi − 3 ≤δ d(ui , Pi X xi − 3 ˆ opt ˆ opt pi X pi X which is satisfied for each i. This means the following two constraints are satisfied for each i. 1 ˆ 2 ˆ yi − pi Xopt ≤ δ xi − pi Xopt ≤ δ , ˆ ˆ 3 3 pi Xopt pi Xopt Notice for n-view triangulation, we have a total number of 4n linear constraints ˆ , formulated by multiplying both sides of the above constraint on the position X ˆ opt . equations with the depth term p3i X We wish to obtain bounds for Xopt = (xopt , yopt , zopt ) . That is to find xmin , xmax , ymin , ymax , zmin , zmax such that xmin ≤ xopt ≤ xmax , ymax ≤ yopt ≤ ymax , zmin ≤ zopt ≤ zmax . For each of them, we can formulate a linear programming (LP) problem by linearizing the constraints in the above equations. For instance, xmin is the smallest value of the x-coordinate of Xopt with respect to the following linear constraints. ˆ opt ≤ 0 , (xi p3i − p1i − δp3i )X 3 2 3 ˆ (yi pi − pi − δpi )Xopt ≤ 0 ,
ˆ opt ≤ 0 (−xi p3i + p1i − δp3i )X 3 2 3 ˆ (−yi pi + pi − δpi )Xopt ≤ 0 .
A Fast Optimal Algorithm for L2 Triangulation
283
The other bound values are found in the similar way. This process then provides an initial bounding box B0 for the optimal point Xopt . Note: In this paper we got the initial optimal point Xinit by local refinement from X∞ , It should be mentioned that any point X may be used instead of this Xinit for initialization. However, we would like to choose a relatively tight initial bounding box since it will reduce the computation complexity. Using the local minimum from X∞ would be a good scheme because it is likely to produce a relatively small value of the cost-function C2 , and hence a reasonably tight bound. 4.2
Bounding
Now we consider the problem of finding a minimum value for C2 (X) on a box B. We rewrite the L2 cost-function as follows: ˆ )2 d(ui , Pi X C2 (X) = i
ˆ 2 ˆ 2 p1i X p2i X = + yi − 3 xi − 3 ˆ ˆ pi X pi X i f 2 (X) + g 2 (X) i i = 2 (X) λ i i Here fi , gi , λi are linear functions in the coordinates of X. And the depths λi (X) can be assumed to be positive, by cheirality(see [5]). Note that for this we must choose the right sign for each Pi , namely that it is of the form Pi = [Mi |mi ] where det Mi > 0. It is observed in [4] that each of the functions fi2 (X) + gi2 (X) such that λi (X) > 0 λi (X) is convex. Now we define wi = maxX∈B λi (X), where B is the current bounding box. The value of each wi can be easily found using Linear Programming(LP). Then we may reason as follows, for any point X ∈ B, wi 1/wi fi2 (X) + gi2 (X) wi λi (X) f 2 (X) + g 2 (X) i i w λ ( X) i i i
≥ λi (X) ≤ 1/λi (X) f 2 (X) + gi2 (X) ≤ i λ2i (X) f 2 (X) + g 2 (X) i i ≤ = C2 (X) 2 (X) λ i i
(4)
However, the left-hand side of this expression is a sum of convex functions, and hence is convex. It is simple to find the minimum of this function on the box
284
F. Lu and R. Hartley
B, hence we obtain a lower bound L(X) = C2 (X).
i
fi2 (X)+gi2 (X) wi λi (X)
for the cost function
BFGS algorithm: In this paper, we adopted the BFGS algorithm([8]) to find the minimum of the convex function L(X) within the bounding box B. The BFGS algorithm is one of the main Quasi-Newton methods in convex optimization problems ([9]). It inherits good properties of the Newton Method such as fast convergence rate while avoiding the complexity of computing the Hessian. This significantly improves the computation speed. 4.3
Branching
Given a box Bi , first we evaluate the lower bound of the cost. If the lower bound exceeds the best cost C2 (Xmin ) we have got so far, we abandon the box and go on to the next box. If on the other hand the lower bound is less than C2 (Xmin ), we evaluate the cost of some point in the current box. If the minimum value is less than C2 (Xmin ), we change the C2 (Xmin ) to the current minimum value and subdivide the block into two along its largest dimension. We repeat the steps until the dimension of the box approaches zero within a given tolerance. Note: How exactly the lower bound approximates the cost depends essentially on how closely maxX∈B λi (X) approximates the value of λi (X) for arbitrary points X in B. It is best if λi (X) does not vary much in B. Note that λi (X) is the depth of the point X with respect to the i-th camera. Thus it seems advantageous to choose the boxes to be shallow with respect to their depth from the cameras. This suggests that a more sophisticated scheme for subdividing boxes may be preferabel to the simple scheme we use of subdividing in the largest dimension.
5
Proof of Optimality
The complete algorithm is given in Fig 1. We have claimed that the method will find the optimal solution. That will be proved in this section. It will be assumed that the bounding box B0 is finite, as in the description of the algorithm. Because we use a FIFO structure to hold the boxes, as the algorithm progresses, the size of boxes B decreases towards zero. Note that at any time, the best result so-far found gives an upper bound for the cost of the optimal solution. We will also define a lower bound as follows. At time j, just before removing the j-th box from the queue, define lj = min min L(X) . B∈Q X∈B
It is clear that lj gives a lower bound for the optimal solution C2 (Xopt ), since L(X) < C2 (X) for all X. We will show two things.
A Fast Optimal Algorithm for L2 Triangulation
285
Algorithm. Branch and Bound Given an initial bounding box B0 , initial optimal point X0 with value f0 = C2 (X0 ), and tolerance > 0. 1. 2. 3. 4. 5.
6. 7. 8. 9. 10.
Initialize Q, a FIFO queue of bounding boxes, with Q = {B0 }. If Q is empty, terminate with globally optimal point X0 and optimal value f0 . Take the next box B from Q. Compute the largest dimension lmax of box B. If lmax < , – Set fC = C2 (XC ) where XC is the centroid of the box B. – If fC < f0 , set f0 = fC and X0 = XC . – Goto 2. Find the minimum of L(X) in B denoted by XL . Set fL = L(XL ). If fL > f0 , goto 2. Find the local minimum of C2 (X) denoted by XC with XL as the initial point. Set fC = C2 (XC ). If fC < f0 , set f0 = fC and X0 = XC . Subdivide B into two boxes B1 and B2 along the largest dimension, and insert B1 and B2 into Q. Goto 2.
Fig. 1. Branch and bound algorithm for optimal L2 triangulation. Alternative stopping conditions are possible, as discussed in the text.
1. The sequence of values for f0 set at line 8 of algorithm 1 is a decreasing sequence converging to C2 (Xopt ). 2. The sequence of values lj is an increasing sequence converging to C2 (Xopt ). Thus, the optimal value C2 (Xopt ) is sandwiched between two bounds, which can always be tested as the algorithm proceeds. The algorithm can be terminated when the difference between the bounds is small enough. We now proceed with the proof. Let δ > 0 be chosen. We will show that some value of f0 will be smaller than C2 (Xopt ) + δ, so the values taken by f0 converge to C2 (Xopt ). First, note that f0 will never be less than C2 (Xopt ) because of the way it is assigned at step 5 or 8 of the algorithm. The cost function C2 (X) is continuous on the box B0 , so the set {X ∈ B0 | C2 (X) < C2 (Xopt ) + δ} is an open set. Thus, there exists a ball S containing Xopt on which C2 takes values within δ of the optimal. Consider the sequence of boxes which would be generated in this algorithm if no boxes were eliminated at step 7. Since these boxes are decreasing in size, one of them B must lie inside the ball S. Thus C2 (X) < C2 (Xopt ) + δ on B . Note that no box that contains the point Xopt can be eliminated during the course of the branch-and-bound algorithm, so the box B must be one of the boxes Bj that will eventually be evaluated. Since box B can not be eliminated at step 7 of the algorithm, step 8 will be executed. The value fC found by optimizing starting in box Bj will result in a value less than C2 (Xopt ) + δ, since all points in Bj satisfy this condition. Thus, f0 will be assigned a value less than C2 (Xopt ) + δ, if it does not already satisfy this condition. This completes the proof that f0 converges to C2 (Xopt ).
286
F. Lu and R. Hartley
Next, we prove that lj converges to C2 (Xopt ). As before, define w ¯i = maxX∈B λi (X). Also define wi = minX∈B λi (X). As the size of the boxes diminishes towards zero, the ratio w ¯i /wi decreases towards 1. We denote by Bj the j-th box taken from the queue during the course of the algorithm 1. Then, for any > 0 there exists an integer value N such that w ¯i < (1 + )wi for all i and all j > N. Now, using the same reasoning as in (4) with the directions of the inequalities reversed, we see that f 2 (X) + g 2 (X) i
i
i
w ¯i λi (X)
≤ C2 (X) ≤
f 2 (X) + g 2 (X) i
i
i
wi λi (X)
.
So, if j > N , then for any point X ∈ Bj , L(X) ≤ C2 (X) ≤ (1 + )L(X) .
(5)
We deduce from this, and the definition of lj that lj < C2 (Xopt ) ≤ (1 + )lj if j > N . Thus, lj converges to C2 (Xopt ).
6
Experiments
We tested our experiments with a set of trials involving all the points in the “Notre Dame” data set([10]). This data set contains 277,887 points and 595 cameras, and involves triangulation problems involving from 2 up to over 100 images. We compared our optimal triangulation method with both L∞ and Linear methods of triangulation, as well as with iterative methods using the BFGS algorithm and Levenberg-Marquardt, initialized by the Linear and L∞ algorithms. The results of these experiments are shown in Figures 2 and 3. Although the experiments were run on all 277,887 points, only 30,000 points (randomly chosen) are shown in the following graphs, because of difficulties plotting 270,000 points. Synthetic Data. Suspecting that previous methods will have difficulty with points near infinity, we devised an experiment that tested this. Results are a little preliminary for this, but they appear to show that the “homogeneous” Linear method (see section 2.1) still works well, but the inhomogeneous SVD method fails a few percent of the time, and iteration will not recover the failure. 6.1
Timing
Our algorithm is coded in C++ and takes about 0.02 seconds per point to run on a standard desktop computer in C++. We were unable to evaluate the algorithm of [1] directly, and its speed is a little uncertain. The authors have claimed 6-seconds (unpublished work) for average timing, but we can not verify this. Their algorithm is in Matlab, but the main numerical processing is done in Sedumi (which is coded in C). The algorithm is substantially more complex, and it is unlikely that the difference between Matlab and C++ would account for the 300-times speed-up of our algorithm compared with theirs.
A Fast Optimal Algorithm for L2 Triangulation
287
Fig. 2. Plot of Linear (left) and L∞ (right) triangulation versus the optimal algorithm. The graph shows the results for 30,000 points randomly chosen from the full Notre Dame set. The top row shows plots of the difference in residual (in pixels) between the two methods, plotted agains the point number. The points sorted according to the difference in residual so as to give a smooth curve, from which one sees the quantitative difference between the optimal and Linear or L∞ algorithms more easily. The plot for the Linear data is truncated on the Y -axis; the maximum difference in residual in reality reaches 12 pixels. The second row presents the same results in a different way. It shows a scatter plot of the optimal versus the Linear or L∞ residuals. Note that the optimal algorithm always gives better results than the Linear or L∞ algorithms. (Specifically, the top row shows that the difference of residuals is always positive.) This is expected and only serves to verify the optimality of the algorithm. In addition, L∞ is seen to outperform the Linear method.
7
Conclusion
The key feature of the proposed method is that it guarantees global optimality with a reasonable computation cost. It can be applied to large-scale triangulation problems. Although the given experiments do show that traditional local methods also work very well on most occasions, the problem of depending on the initialization will always be a disadvantage. Our method may be still a little slow for some large-scale applications, but it does provide an essential benchmark against which other algorithms may be tested, to verify whether they are obtaining optimal results. On real examples where the triangulated points were at a great distance from the cameras, the Linear algorithm gave such poor results that iteration was unable to find the optimal solution in many cases. On the other hand conventional
288
F. Lu and R. Hartley
Fig. 3. Plot of Levenberg-Marquardt (LM) refined results versus the optimal algorithm. The stopping criterion for the LM algorithm was our default (relatively quick) stopping criterion, resulting in about three iterations. The optimal algorithm still does best, but only on a few examples. Note that for visibility, we only show the first 2000 (out of 30,000) sorted points. When the LM algorithm and BFGS algorithms were run for more iterations (from either Linear or L∞ initialization), the results were largely indistinguishable on this data set from the optimal results. Also shown is a snapshot of the dataset that we used.
iterative methods worked well on the Notre Dame data set, because most of the points were relatively close or triangulated from a wide baseline.
References 1. Agarval, S., Chandraker, M., Kahl, F., Belongie, S., Kriegman, D.: Practical global optimization for multiview geometry. In: Proc. European Conference on Computer Vision (2005) 2. Hartley, R., Schaffalitzky, F.: L∞ minimization in geometric reconstruction problems. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition, Washington DC, pp. I–504–509 (2004) 3. Triggs, W., McLauchlan, P.F., Hartley, R.I., Fitzgibbon, A.: Bundle adjustment for structure from motion. In: Triggs, B., Zisserman, A., Szeliski, R. (eds.) Vision Algorithms: Theory and Practice. LNCS, vol. 1883, pp. 298–372. Springer, Heidelberg (2000) 4. Kahl, F.: Multiple view geometry and the L∞ -norm. In: Proc. International Conference on Computer Vision, pp. 1002–1009 (2005) 5. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2003) 6. Hartley, R.I., Sturm, P.: Triangulation. Computer Vision and Image Understanding 68(2), 146–157 (1997) 7. Stewenius, H., Schaffalitzky, F., Nister, D.: How hard is 3-view triangulation really. In: Proc. International Conference on Computer Vision, pp. 686–693 (2005) 8. Nocedal, J., Wright, S.J.: Numerical Optimization, 2nd edn. Oxford University Press, Oxford (2006) 9. Boyd, S., Vanderberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004) 10. Snavely, N., Seitz, S., Szeliski, R.: Photo tourism: Exploring photo collections in 3d. ACM Trans on Graphics 25(3), 835–846 (2006)
Adaptively Determining Degrees of Implicit Polynomial Curves and Surfaces Bo Zheng, Jun Takamatsu, and Katsushi Ikeuchi Institute of Industrial Science, The University of Tokyo, Komaba 4-6-1, Meguro-ku, Tokyo, 153-8505 Japan
[email protected] Abstract. Fitting an implicit polynomial (IP) to a data set usually suffers from the difficulty of determining a moderate polynomial degree. An over-low degree leads to inaccuracy than one expects, whereas an overhigh degree leads to global instability. We propose a method based on automatically determining the moderate degree in an incremental fitting process through using QR decomposition. This incremental process is computationally efficient, since by reusing the calculation result from the previous step, the burden of calculation is dramatically reduced at the next step. Simultaneously, the fitting instabilities can be easily checked out by judging the eigenvalues of an upper triangular matrix from QR decomposition, since its diagonal elements are equal to the eigenvalues. Based on this beneficial property and combining it with Tasdizen’s ridge regression method, a new technique is also proposed for improving fitting stability.
1
Introduction
Recently representing 2D and 3D data sets with implicit polynomials (IPs) has been attractive for vision applications such as fast shape registration, pose estimation [1,2,3,4], recognition [5], smoothing and denoising, image compression [6], etc. In contrast to other function-based representations such as B-spline, Non-Uniform Rational B-Splines (NURBS), and radial basis function (RBF) [7], IPs are superior in the areas of fast fitting, few parameters, algebraic/geometric invariants, robustness against noise and occlusion, etc. A 3D IP function of degree n is defined as: aijk xi y j z k = (1 . . . z n)(a000 a100 . . . a00n )T , (1) fn (x) = x 0≤i,j,k;i+j+k≤n m(x)T a where x = (x y z) is a data point. fn (x)’s zero set {x|fn (x) = 0} is used to represent the given data set. The estimation of IP coefficients belongs to the conventional fitting problem, and various methods have been proposed [1,2,3,4,8]. These fitting methods cannot adaptively control the IP degrees for different object shapes; it is well known that simple shapes correspond to low-degree IPs whereas complicated shapes correspond to the higher ones. Fig.1 shows that Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 289–300, 2007. c Springer-Verlag Berlin Heidelberg 2007
290
B. Zheng, J. Takamatsu, and K. Ikeuchi
→
(a)
(b)
(c)
→
(c’)
(d)
(d’)
Fig. 1. IP fitting results: (a) Original data set; (b) 4-degree IP surface; (c) 8-degree IP surface; (c’) Stability-improved 8-degree IP surface; (d) 10-degree IP surface; (d’) Stability-improved 10-degree IP surface
when fitting an object like Fig.1(a), an over-low degree leads to the inaccuracy (Fig.1(b)), whereas an over-high degree leads to an unstable result: too many unexpected zero sets appear (see Fig.1(d)). This paper provides a solution to automatically find the moderate degree IP (such as Fig.1(c)). Another issue of IP fitting is that there may be collinearity in the covariant matrix derived from the least squares method, making them prone to instability [4], e.g., the fitting results shown in Fig.1(c) and (d). In order to address this issue we propose a method for automatically checking out this collinearity and improving it. And we also combine the Ridge Regression (RR) technique recently introduced by [4,9]. Fig.1(c’) and (d’) show the improved results of Fig.1(c) and (d) respectively, where the redundant zero sets are retired. Note although Fig.1 (d) is globally improved by our method to Fig.1 (d’), since there are too many redundant zero sets that need to be eliminated, the local accuracy is also affected very much. Therefore, we first aim at adaptively finding a moderate degree, and then applying our stability-improving method to obtain a moderate result (accurate both locally and globally). This paper is organized as follows: Section 2 gives a review of IP fitting methods; Section 3 and 4 provide an incremental method for fitting IP with moderate degrees; Section 5 discusses on how to improve the global stability based on our algorithm; Section 6 presents experimental results followed by discussion and conclusions in section 7 and 8.
2
Implicit Polynomial Fitting
In order to estimate the coefficient vector a in (1), a simple method is to solve a linear system M a = b,
(2)
where M is the matrix of monomials, and the ith row of M is m(xi ) (see (1)) and b is a zero vector. But generally M is not a square matrix, and the linear system is usually solved by the least squares method. Solutions to this least squares problem are classified into nonlinear methods [1,2,10] and linear methods [3,4,8,9]. Because the linear methods are
Adaptively Determining Degrees of Implicit Polynomial Curves and Surfaces
291
simpler and much faster than the nonlinear ones, they are preferable and can be formulated as: M T M a = M T b.
(3)
Note this formula is just transformed from the least squares result, a = M † b, where M † = (M T M )−1 M T called a pseudo-inverse matrix. Direct solution of the linear system is numerically unstable, since M T M is nearly singular and b = 0. Thus a is determined from the kernel basis. Fortunately, the methods for improving the classical linear fitting (avoiding the ill-condition matrix M T M in (3)) have already been proposed by adding some constraints, such as the 3L method [3], the Gradient-One method [4] and the Min-Max method [8]. The singularity of M is improved and b is also derived as a nonzero vector. In the prior methods, the symmetric linear system (3) was solved by the classical solvers such as the Cholesky decomposition method, the conjugate gradient method, singular value decomposition (SVD), and their variations. But none of them allow changing the degree during the fitting procedure. This is the main reason why these prior methods require a fixed degree.
3
Incremental Fitting
This section shows computational efficiency of the proposed incremental fitting method. Although an IP degree is gradually increased until obtaining a moderate fitting result, the computational cost is saved because each step can completely reuse the calculation results of the previous step. In this section, first we describe the method for fitting an IP with the QR decomposition method. Next, we show the incrementability of Gram-Schmidt QR decomposition. After that, we clarify the amount of calculation in order to increase the IP degree. 3.1
Fitting Using QR Decomposition
Without solving the linear system (3) directly, our method first carries out QR decomposition on matrix M as M = QN ×m Rm×m , where Q satisfies the QT Q = I (I is the identity matrix), and R is an invertible upper triangular matrix. Then substituting M = QR into (2), we obtain: QRa = b → QT QRa = QT b → Ra = QT b → Ra = b.
(4)
Since R is an upper triangular matrix, a can be quickly solved (in O(m2 )). 3.2
Gram-Schmidt QR Decomposition
Let us assume that matrix M consisting of columns {c1 , c2 , · · · , cm } is known. We show the method of Gram-Schmidt orthogonalization, that is, how to orthogonalize the columns of M into the orthonormal vectors {q1 , q 2 , · · · , q m } which
292
B. Zheng, J. Takamatsu, and K. Ikeuchi Ri +1
~ bi
ai
Ri
~ bi
ri , j
=
~ ai +1 bi +1 ~ bi
ri , j
G
=
0
0 0
Fig. 2. The triangular linear system grows from the ith step to the (i + 1)th, and only the calculation shown in light-gray is required
are columns of matrix Q, and simultaneously calculate the corresponding upper triangular matrix R consisting of elements ri,j . The algorithm is as follows: Initially let q 1 = c1 / c1 and r1,1 = c1 . If {q1 , q 2 , · · · , q i } have been computed, then the (i+1)th step is: rj,i+1 = q Tj ci+1 , for j ≤ i, q i+1 = ci+1 −
i
rj,i+1 q j ,
j=1
ri+1,i+1 = q i+1 , q i+1 = q i+1 / q i+1 .
(5)
From this algorithm, we can see that Gram-Schmidt orthogonalization can be carried out in an incremental manner, which orthogonalizes the columns of M one by one. 3.3
Additional Calculation for Increasing an IP Degree
The incrementability in the QR decomposition with Gram-Schmidt orthogonalization makes our incremental method computationally efficient. Fig.2 illustrates this efficiency by the calculation from the ith step to the (i + 1)th step in our incremental process. It is only necessary to calculate the parts shown in light-gray. For constructing these two upper triangular linear systems from the ith step to the (i + 1)th step, we only need two types of calculation: 1) for growing the upper triangular matrix from Ri to Ri+1 , calculate the rightmost column and add it into the Ri to construct Ri+1 , and 2) for growing the right-hand vector from bi to bi+1 , calculate the bottom element and add it into the bi to construct bi+1 . For the first calculation, it can be simply obtained from Gram-Schmidt orthogonalization in (5). For the second calculation, assuming bi+1 is the bottom element of vector bi+1 , the calculation of bi+1 can follow the (i + 1)th step of Gram-Schmidt orthogonalization in (5), and can be calculated as bi+1 = q Ti+1 b. In order to clarify the computational efficiency, let us assume a comparison between our method and an incremental method that iteratively calls a linear
Adaptively Determining Degrees of Implicit Polynomial Curves and Surfaces
293
method such as the 3L method at each step. It is obvious that, for solving coefficient a at the ith step, our method needs i inner-product operations for constructing the upper triangular linear system, and O(i2 ) for solving this linear system; whereas the latter method needs i × i inner-product operations for constructing linear system (3), and O(i3 ) for solving (3). Let us define a function G to denote the above calculation. Then if we repeat calling function G, we can obtain the incremental (dimension-growing) upper triangular linear systems, and the corresponding coefficient vectors with different degrees can be solved from them.
4
Finding the Moderate Degree
Now to construct an algorithm for finding the moderate degree, we are facing two problems: 1) What is the order for choosing cα ? 2) When should the incremental procedure be stopped? Note: as a matter of convenience, hereafter we use notation α to denote the column index of M instead of i. 4.1
Incremental Order of Monomials
Feeding the columns cα from M into the function G in a different order may lead to the different result. Therefore it is important to decide a suitable order. A reasonable way is to choose cα in the degree-growing order described by Tab.1. The reason is as follows: when we fit a two-degree IP to the data on a unit circle, a unique solution −1 + x2 + y 2 = 0 can be obtained, while if we choose a three-degree IP to fit, solutions such as x(−1 + x2 + y 2 ) = 0, are obtained. There exist some redundant zero set groups, such as x = 0. 4.2
Stopping Criterion
For the second problem, we have to define a stopping criterion based on our defined similarity measure between IP and data set. Once this stopping criterion is satisfied, we consider the desired accuracy is reached and the procedure is terminated. First let us introduce a set of similarity functions measuring the similarity between IP and data set, as follows: Ddist =
N 1 ei , N i=1
Dsmooth =
N 1 (N i · ni ), N i=1
(6)
|f (xi )| where N is the number of points, ei = f (xi ) , N i is the normal vector at a point obtained from the relations of the neighbor normals (here we refer to f (xi ) Sahin’s method [9]), and ni = f (xi ) is the normalized gradient vector of f at xi . ei has proved useful for approximating the Euclidean distance from xi to the IP zero set [2].
294
B. Zheng, J. Takamatsu, and K. Ikeuchi
Table 1. Index List: i, j and k are the powers of x, y and z respectively. α is the index of column of M . And the relations between α and (i, j, k) can be formulated as: α = j + (i + j + 1)(i + j)/2 + 1 (for 2D) and α = k + (j + k + 1)(j + k)/2 + (i + j + k + 2)(i + j + k + 1)(i + j + k)/6 + 1 (for 3D). α 1 2 3 4 5 6 7 8 9 10
[i j] [0 0] [1 0] [0 1] [2 0] [1 1] [0 2] [3 0] [2 1] [1 2] [0 3]
Form L0 (i + j = 0) L1 (i + j = 1)
L2 (i + j = 2)
L3 (i + j = 3) .. . m [0 n] Ln (i + j = n) (a) Index list for 2D
[i j k] Form [0 0 0] L0 (i + j + k [1 0 0] [0 1 0] [0 0 1] L1 (i + j + k [2 0 0] [1 1 0] [0 2 0] [1 0 1] [0 1 1] [0 0 2] L2 (i + j + k .. . m [0 n] Ln (i + j + k (b) Index list for 3D α 1 2 3 4 5 6 7 8 9 10
= 0)
= 1)
= 2) = n)
Ddist and Dsmooth can be considered as two measurements on distance and smoothness between data set and IP zero set. And we define our stopping criterion as: (Ddist < T1 ) ∧ (Dsmooth > T2 ). 4.3
(7)
Algorithm for Finding the Moderate IPs
Having the above conditions, our algorithm is simply described as follows: 1) Calling the function G to construct the upper triangular linear system; 2) Solving this linear system to obtain coefficient vector a; 3) Measuring the similarity for the obtained IP; 4) Stopping the algorithm if the stopping criterion (7) is satisfied; otherwise going to 1) for growing up the dimension.
5
Improving Global Stability
Linear fitting methods in general suffer from not achieving global stability, which is well discussed in [4,9]. Since our fitting method belongs to these linear methods, we face the same problem. We propose a method for solving this by controlling the condition number of matrix M . 5.1
Stability and Condition Number of M
An important reason for global instability is the collinearity of M , which causes the matrix M T M to be nearly singular with some eigenvalues negligible
Adaptively Determining Degrees of Implicit Polynomial Curves and Surfaces
295
compared to the others [4]. Such eigenvalues are degenerated to contribute very little to the overall shape of the fit. Fortunately, since M = QR, we can obtain M T M = RT R, and thus the condition number of M T M can be improved by controlling the eigenvalues of R. Note here we just consider the condition number as λmax /λmin , where λmax and λmin are the maximum and minimum eigenvalues respectively. And from the good properties that upper triangular matrix R’s eigenvalues lay on its main diagonal, we can easily evaluate the singularity of R by only observing the diagonal values. To improve the condition number of M T M , this paper gives a solution from two aspects: eliminating collinear columns of M and using the Ridge Regression method. 5.2
Eliminating Collinear Columns of M
The first simple idea is to check the eigenvalue ri,i to see whether it is too small (nearly zero), at each step. If ri,i is too small, that means the current column ci of M is nearly collinear to the space of {c1 , c2 , · · · , ci−1 }. Thus to find a viable value for R, it should be abandoned, and the subsequent columns should be tried. 5.3
Ridge Regression Method
Ridge regression (RR) regularization is an effective method that improves the condition number of M T M by adding a small moderate value to each diagonal element, e.g., adding a term κD to M T M [4,9]. Accordingly equation (3) can be modified as (M T M + κD)a = M T b and equation (4) can be modified as (R + κR−T D)a = b
(8)
where κ is a small positive value called the RR parameter and D is a diagonal matrix. D will be chosen to maintain Euclidean invariance, and the simplest choice is to let D be an identity matrix. A cleverer choice has been proposed by T.Tasdizen et al. [4] for 2D and T.Sahin et al. [9] for 3D. In fact, their strategies are to add the constraints that keep the leading forms of this polynomial always strictly positive, which proves that the zero set of polynomials with an even degree are always bounded (see the proof in [4]). We give details of this derivation in the appendix.
6
Experimental Results
The setting for our experiments involve some pre-conditions. 1) As a matter of convenience, we employ the constraints of the 3L method [3]; 2) All the data sets are regularized by centering the data-set center of mass at the origin of the coordinate system and scaling it by dividing each point by the average length from points to origin; 3) We choose T1 in (7) with about 20 percent of the layer distance c of the 3L method as done in [3].
296
B. Zheng, J. Takamatsu, and K. Ikeuchi
Ddist Dsmooth
(b)
Ddist
(a)
(c)
Dsmooth
α
(d) (e)
Fig. 3. IP fitting results: (a) Original image. (b) α = 28 (six-degree). (c) α = 54 (ninedegree). (d) α = 92 (thirteen-degree). (e) Convergence of Ddist and Dsmooth . Note “o” symbols represent the boundary points extracted from the image and real lines represent the IP zero set in (b)-(d).
6.1
A Numerical Example
In this experiment, we fit an IP to the boundary of a cell shown in Fig.3 (a). The stopping criterion is set as T1 = 0.01, T 2 = 0.95, and the layer distance of the 3L method is c = 0.05. The moderate IP is found out automatically (see Fig.3(d)). To give a comparison, we also show some fits before the desired accuracy is reached (see Fig.3(b) and (c)). And these results are improved by our method mentioned in section 5. We also track the convergence of Ddist and Dsmooth shown in Fig.3(e). Although there are some small fluctuations on the graph, Ddist and Dsmooth are convergent to 0 and 1 respectively, which also proves the stopping criterion in (7) can effectively measure the similarity between the IP and the data set. 6.2
2D and 3D Examples
Some 2D and 3D experiments are shown in Fig.4 where the fitting results are obtained with the same parameters as those in the first example. As a conclusion here, objects with different shapes may obtain fitting results of different degrees, since these results always respect the complexity of shapes. 6.3
Degree-Fixed Fitting Compared with Adaptive Fitting
Fig.5 shows some comparisons between degree-fixed fitting methods and our adaptive fitting method. Compared to degree-fixed methods such as [3,4,8], the results of our method show that there is neither over-fitting nor the insufficient
Adaptively Determining Degrees of Implicit Polynomial Curves and Surfaces
11-degree
12-degree
12-degree
6-degree
8-degree
297
12-degree
Fig. 4. IP fitting results. First row: Original objects; Second row: IP fits.
fitting, and we also attain global stability. It shows that our method is more meaningful than the degree-fixed methods, since it fulfills the requirement that the degrees should be subject to the complexities of object shapes.
7 7.1
Discussion QR Decomposition Methods
Other famous algorithms of QR decomposition are Householder and Givens [11]. In the field of numerical computation, Householder and Givens have proved more stable than conventional Gram-Schmidt method. But in this paper, since our discussion is based on the good condition of a regularized data set, we ignore the small effect of rounding errors, which causes instability. Here we just would like to take advantage of the properties of QR decomposition that orthogonalize vectors one by one, to demonstrate the possibility of constructing the moderatedegree fitting algorithm described above. 7.2
IP vs Other Functions
In contrast to other function based representations such as B-spline, NURBS, and radial basis function, IP representation cannot give a relatively accurate model. But this representation is more attractive for applications that require fast registration and fast recognition (see the works [4,5,12,13,14]), because of its algebraic/geometric invariants [15]. Also, Sahin also showed some experiments for missing data in [9]. Further accurate representation of a complex object may require segmenting the object shapes and representing each segmented patch with an IP. We will consider this possibility in our future works.
298
B. Zheng, J. Takamatsu, and K. Ikeuchi
Original Objects:
Degree-fixed fitting in 2-degree: 2-degree *
2-degree †
2-degree †
4-degree ‡
4-degree †
4-degree †
2-degree *
6-degree *
12-degree *
Degree-fixed fitting in 4-degree:
Our method:
Fig. 5. Comparison between degree-fixed fitting and adaptive fitting. First row: Original objects. Second and third row: IP fits resulting from degree-fixed fitting with 2-degree and 4-degree fitting respectively. Fourth row: Adaptive fitting. Mark *: moderate fitting. †: insufficient fitting (losing accuracy). ‡: over-fitting.
7.3
Setting the Parameters T1 and T2
Since our stopping criterion can be approximately seen as a kind of Euclidean metric, it is intuitive to control moderate fitting accuracy by setting the appropriate values to T1 and T2 . Basically, these parameters should be decided based on object scale or statistics about data noise. Further discussion is beyond the scope of this paper. Fortunately it is intuitive to decide if the parameters are appropriate for your demand, since the 2D/3D Euclidean distance can be easily observed. In this paper, we practically let T1 and T2 be close to zero and one respectively for a smooth model and more tolerant values for a coarse one.
8
Conclusions
This paper provided an incremental method for fitting shape-representing IPs. By our stopping criterion, an IP of moderate degree can be adaptively found in one fitting process, and global fitting stability is successfully improved. Our results support the argument that IP degrees being adaptively determined by shapes is better than being fixed, because this not only saves much time for users, but also it is suited to the future applications involving automatic recognition systems.
Adaptively Determining Degrees of Implicit Polynomial Curves and Surfaces
299
Acknowledgements Our work was supported by the Ministry of Education, Culture, Sports, Science and Technology under the Leading Project: Development of High Fidelity Digitization Software for Large-Scale and Intangible Cultural Assets.
References 1. Keren, D., Cooper, D.: Describing Complicated Objects by Implicit Polynomials. IEEE Trans. on Patt. Anal. Mach. Intell (PAMI) 16(1), 38–53 (1994) 2. Taubin, G.: Estimation of Planar Curves, Surfaces and Nonplanar Space Curves Defined by Implicit Equations with Applications to Edge and Range Image Segmentation. IEEE Trans. on Patt. Anal. Mach. Intell (PAMI) 13(11), 1115–1138 (1991) 3. Blane, M., Lei, Z.B., Cooper, D.B.: The 3L Algorithm for Fitting Implicit Polynomial Curves and Surfaces to Data. IEEE Trans. on Patt. Anal. Mach. Intell (PAMI) 22(3), 298–313 (2000) 4. Tasdizen, T., Tarel, J.P., Cooper, D.B.: Improving the Stability of Algebraic Curves for Applications. IEEE Trans. on Imag. Proc. 9(3), 405–416 (2000) 5. Tarel, J.-P., Cooper, D.B.: The Complex Representation of Algebraic Curves and Its Simple Exploitation for Pose Estimation and Invariant Recognition. PAMI 22(7), 663–674 (2000) 6. Helzer, A., Bar-Zohar, M., Malah, D.: Using Implicit Polynomials for Image Compression. In: Proc. 21st IEEE Convention of the Electrical and Electronic Eng., pp. 384–388. IEEE Computer Society Press, Los Alamitos (2000) 7. Turk, G., OfBrieni, J.F.: Variational Implicit Surfaces. Technical Report GITGVU-99-15, Graphics, Visualization, and Useability Center (1999) 8. Helzer, A., Barzohar, M., Malah, D.: Stable Fitting of 2D Curves and 3D Surfaces by Implicit Polynomials. IEEE Trans. on Patt. Anal. Mach. Intell (PAMI) 26(10), 1283–1294 (2004) 9. Sahin, T., Unel, M.: Fitting Globally Stabilized Algebraic Surfaces to Range Data. In: Proc. 10th IEEE Int. Conf. on Compter Vision (ICCV), vol. 2, pp. 1083–1088 (2005) 10. Knanatani, K.: Renormalization for Computer Vision. The Institute of Elec., Info. and Comm. eng (IEICE) Transaction 35(2), 201–209 (1994) 11. Horn, R.A., Johnson, C.R.: Matrix Analysis: Section 2.8. Cambridge University Press, Cambridge (1985) 12. Tarel, J.P., Civi, H., Cooper, D.B.: Form 3d objects without point matching using algebraic surface models. In: Proceedings of IEEE Workshop Model Based 3D Image Analysis, pp. 13–21. IEEE Computer Society Press, Los Alamitos (1998) 13. Khan, N.: Silhouette-Based 2D-3D Pose Estimation Using Implicit Algebraic Surfaces. Master Thesis in Computer Science, Saarland University (2007) 14. Unsalan, C.: A Model Based Approach for Pose Estimation and Rotation Invariant Object Matching. Pattern Recogn. Lett. 28(1), 49–57 (2007) 15. Taubin, G., Cooper, D.: Symbolic and Numerical Computation for Artificial Intelligence. In: Computational Mathematics and Applications. ch. 6, Academic Press, London (1992)
300
B. Zheng, J. Takamatsu, and K. Ikeuchi
Appendix A. Choosing Diagonal Matrix D for RR Method A choice of diagonal matrix D for RR method was derived by T.Tasdizen et al. [4] and T.Sahin et al. [9] as: dαα = γ tˆ, where dαα is the αth diagonal element of D, and calculating dαα is respected to i, j, k. The relationship between index α and i, j, k are shown in Tab.1. γ is a free parameter for the (i + j)th form (2D) or (i + j + k)th form (3D) decided from the data set as follows: γi+j =
N
(x2t + yt2 )i+j
(2D), γi+j+k =
t=1
N
(x2t + yt2 + zt2 )i+j+k
(3D), (9)
t=1
where (xt yt ) and (xt yt zt ) are the data point. tˆ is a variable respected to i, j, k and for maintaining Euclidean invariance it can be derived as: tˆ =
i!j! (i + j)!
(2D) and tˆ =
i!j!k! (i + j + k)!
(3D).
(10)
Determining Relative Geometry of Cameras from Normal Flows Ding Yuan and Ronald Chung Department of Mechanical & Automation Engineering The Chinese University of Hong Kong Shatin, Hong Kong, China {dyuan,rchung}@mae.cuhk.edu.hk
Abstract. Determining the relative geometry of cameras is important in active binocular head or multi-camera system. Most of the existing works rely upon the establishment of either motion correspondences or binocular correspondences. This paper presents a first solution method that requires no recovery of full optical flow in either camera, nor overlap in the cameras’ visual fields and in turn the presence of binocular correspondences. The method is based upon observations that are directly available in the respective image stream – the monocular normal flow. Experimental results on synthetic data and real image data are shown to illustrate the potential of the method. Keywords: Camera calibration, Extrinsic camera parameters, Active Vision.
1 Introduction Active vision systems allow the position and orientation of each camera be arbitrarily controlled according to need and have many advantages. They however require the relative geometry of the cameras be determined from time to time for fusion of the visual information. Determination of the relative geometry of cameras is a wellstudied problem. What makes the problem unique and particularly challenging in active vision is that there is no guarantee on how much overlap there exists between the visual fields of the cameras. In the extreme case, there could be no overlap. There have been many methods in the literature proposed for binocular geometry determination. Some methods require certain specific objects appearing in the scene, such as planar surfaces [7] and cubic objects [9]. Such methods constitute simpler solution mechanisms, but they are restricted to certain applications or scenes. Other methods [3] [11] require the accessibility of either cross-camera feature correspondences [1] [11] or motion correspondences [2]. While cross-camera correspondences require the cameras to have much in common in what they picture, establishing motion correspondences is an ill-posed problem and the result is not always reliable when the scene contains much occlusion. The objective of this work is to develop a solution method that does not assume presence of calibration objects or specific shapes or features in the imaged scene, nor impose restriction on the viewing directions of the cameras, thus allowing the visual fields of the cameras to have little or zero overlap. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 301–310, 2007. © Springer-Verlag Berlin Heidelberg 2007
302
D. Yuan and R. Chung
In an inspiring work, Fermüller and Aloimonos [4][5] described a method (hereafter referred to as the FA-camera motion determination method) of determining the ego-motion of a camera directly from normal flow. Normal flow is the apparent motion of image position in the image stream picked up by the camera, which must be in the direction of the intensity gradient, and can be detected directly by applying some simple derivative filters to the image data. Fermüller and Aloimonos first define, for any arbitrarily chosen 3D axis p that contains the optical center, a vector field over the entire image space. Patterns with “+” and “-” candidates are generated according to the vector field and the normal flows. The camera motion parameters can be determined from those patterns. The FA-camera motion determination method forms the basis of a method we proposed earlier [10] for determining the relative geometry of two cameras. However, a key issue has not been addressed in [10]. Any p-axis allows not all but only a small subset of the data points – those whose normal flow is parallel or anti-parallel to the field vector there with respect to p-axis – to be usable. Different choices of the p-axis would allow different subsets of the data points to be usable, each with a different density of the data points in the space. The choices of the p-axis are thus crucial; they determine the density of the data points and the accuracy in determining the camera motions and in turn the binocular geometry. However, in both the FA-camera motion determination method [4][5] and in our earlier work on binocular geometry determination [10], no particular mechanism was devised in choosing the p-axes. This paper presents how the p-axes can be chosen optimally for best use of the data points. We assume that the intrinsic parameters of the cameras have been or are to be estimated by camera self-calibration methods like [8] [11]. The focus of this work is the estimation of the camera-to-camera geometry.
2 Copoint Vector Field with Respect to Arbitrary 3D Direction Fermüller and Aloimonos [4][5] proposed vector filed which was then applied in their camera motion determination method. Our binocular geometry determination method also makes use of this vector field. So here we give a brief review of it. Suppose the image space is represented by an image plane positioned perpendicular to the optical axis and at 1 unit away from the optical center. Pick any arbitrary axis p=[A B C] in space that contains the optical center; the axis hits the image plane at the point P=[A/C B/C]. The family of projection planes that contains axis p define the family of lines that contain point P on the image plane. The copoint vector field for the image space with respect to the p-axis is defined as the field with vectors perpendicular to the above family of lines about point P, as shown in Fig. 1 (a). In the figure each arrow represents the vector assigned to each image point, and it is [M x
My] =
[− y + B / C
x − A / C]
(2.1)
( x − A / C ) + (− y + B / C ) 2 2
Suppose the camera undergoes a pure rotation, which is represented in the rotationaxis form by a vector ω=[α β γ]. For any particular choice of the p-axis, we have a
Determining Relative Geometry of Cameras from Normal Flows
303
p-copoint field direction (Equation 2.1) at each image position. At any image position, the dot product between the p-axis induced field vector and the optical flow there allows the image position to be labeled as either: “+” if the dot product is positive, or “-” if the dot product is negative. By the distributions of the “+” and “-” labels, the image plane is divided into two regions: positive and negative, with the boundary being a 2nd order curve called the zero-boundary, as illustrated by Fig.1 (b). The zero-boundary is only determined by the ratios α/γ and β/γ. Fig.1 (c) illustrates the positive-negative pattern generated the same way as described above, when the camera takes a pure translation. Different from the pattern Fig.1 (b), the zeroboundary is a straight line (Fig.1 (c)), which is a function of the focus of expansion (FOE) of the optical flow and precisely describes the translational direction of the camera. If the motion of the camera includes both rotation and translation, the positivenegative pattern will include the positive region, the negative region, and the “Don’t know” region (depends on the structure), as shown in Fig. 1 (d). P=[A/C B/C ]
(a)
Positive
Negative
(b)
(c)
Don’t know
(d)
Fig. 1. p-copoint vector field and its positive-negative patterns in the image space. (a) pcopoint vector field; positive-negative labeled patterns when (b) camera undergoes pure rotation; (c) camera undergoes pure translation; (d) camera takes arbitrary motion (with both rotation and translation components).
The above labeling mechanism allows constraint for the camera motion to be constructed from optical flow. However, due to the aperture problem the full flow is generally not directly observable from the image data; only the normal flow, i.e., the component of the full flow projected to the direction of the local intensity gradient, is. The above labeling mechanism therefore has to be adjusted, and the positive-negative pattern still can be generated [4] [5]. The main difference is, while with the full flows all the data points can be labeled, with the normal flows only a handful of the data points can be labeled, and the localization of the zero-boundary from the sparsely labeled regions is much more challenging.
3 Binocular Geometry Determination Suppose the binocular geometry of two cameras at a particular configuration of the camera system is to be determined. Our procedure consists of the following. The binocular configuration is first frozen, and then moved rigidly in space while image streams are collected from the respective cameras to serve as data for the
304
D. Yuan and R. Chung
determination task. If the binocular geometry is expressed by a 4×4 matrix X, and the two camera motions by A and B, because of the rigidity of the motion of the camera pair we have the relationship AX=XB, which can be decomposed into the following two equations: R AR x = R x R B
(3.1)
(or ω A = R xω B in vector form) and (R A − I )t x = R x t B − tA
(3.2)
where Rx, RA, RB are the 3×3 orthonormal matrices representing the rotational components of X, A, B respectively, tx, tA, tB are the vectors representing the translational components, and ωA, ωB are the rotations RA, RB expressed in the axisangle form. The plan is, if the camera motions A and B can be determined from normal flows by the use of say the FA- camera motion determination method, Equations (3.1) and (3.2) would provide constraints for the determination of the binocular geometry parameters X. However, there are two complications. First, if general motion of any camera (i.e., motion that involves both translation and rotation) is involved, the image space associated with the camera under the FA- camera motion determination method contains two “Don’t know” regions as depicted by Fig. 1(d). The presence of the “Don’t know” regions would add much challenge to the localization of the zeroboundaries. Second, with only the normal flows in the respective image streams of the two cameras, only a small percentage of the data points can be made usable in the image space under a random choice of the p-axis. This complication is the most troubling, as the localization of the zero boundary from very sparsely labeled data points would be an extremely difficult task. On the first complication, we adopt specific rigid motions of the camera pair to avoid as much as possible the presence of general motion of any camera. On the second complication, we propose a scheme that allows the p-axes to be chosen not randomly as in previous work [4][5][10], but according to the how many data points they can make useful in the copoint vector field based method. The scheme allows each data point (an image position with detectable normal flow) to propose a family of p-axes that could make that data point useful, and vote for them in the space of all possible p-axes. The p-axes in the p-axis space that have the highest number of votes from all the data points are then the p-axes we use in the process. 3.1 Determination of Rx We first determine the rotation component Rx of the binocular geometry. We let the camera pair undergo a specific motion – pure translation – so as to reduce the complexity in locating the zero-boundary of the positive-negative labeled patterns in the image space. When the camera pair exercise a rigid-body pure translation, the motion of each camera is also a pure translation. From Equation (3.2) we have: ~ ~ (3.3) tA = R x tB
Determining Relative Geometry of Cameras from Normal Flows
305
where ~tA and ~tB are both unit vectors corresponding to the focus of expansions (FOEs) of the two cameras. Previously we proved at least two translational motions in different directions are required to achieve a unique solution of Rx [10]. Solution of Rx is presented in our previous work[10]. On determining ~tA and ~tB in each rigid-body translation of the camera pair, we adopt the p-copoint vector field model to generate patterns from the normal flows in the respective image streams. Both cameras will exhibit pattern like that shown in Fig. 1 (c), containing only the positive and the negative regions separated by a straight line (the zero boundary), without the “Don’t know” regions. 3.2 Determination of tx Up to Arbitrary Scale To determine the baseline tx of the binocular geometry, we let the camera pair undergo rigid-body pure rotations while the cameras capture the image stream data. However, tx can only be determined up to arbitrary scale unless certain metric measurement about the 3D world is available. Suppose the camera pair has a pure rotation about an axis passing through the optical center of one camera, say the optical center of camera A. Then camera A only undergoes a pure rotation, while camera B’s motion consists of a rotation about an axis containing the optical center of camera A, and a translation orthogonal to the baseline. In this case Equation (3.2) can be rewritten as: (3.4) (R A − I)t x = R xt B We then rewrite Equation (3.4) to a homogeneous system: )~ A tx = 0
(3.5)
) where tx is the normalized vector representing the direction of the baseline, and A is a )
2×3 matrix calculated from Rx, RA, tB with Rank( A ) = 1. At least two rotations are needed to determine tx uniquely [10]. Camera A, which has only pure rotations in the process, has the positive-negative labeled patterns in the image space just like the one shown in Fig.1 (b), in which a 2nd order curve (the zero-boundary) separates the ‘+’ and ‘-’ labeled regions. As for camera B, the positive-negative labeled patterns in the image space take the form of Fig.1 (d), and are more challenging because of the existence of two “Don’t know” regions. There are two zero boundaries to be determined: one a 2nd order curve, and the other a straight line. The strategy on calculating tx (up to arbitrary scale) by analyzing the positive-negative labeled patterns is presented in our earlier work [10].
4 Optimal Selection of p-Axes Under different choices of the p-axis, different subsets of data points would be made usable to generate the positive-negative labeled patterns in the image space. Obviously a higher density of the labeled patterns is desired, as it would make the localization of the zero-boundary easier. In this section we propose a scheme for that. In the following discussion, for simplicity we only describe the scheme under the case that the camera motion is a pure translation. The scheme for the pure rotation case is actually similar.
306
D. Yuan and R. Chung
4.1 From a Data Point of Normal Flow to a Locus of p-Axes For any given p-axis, only the data points with the normal flows exactly parallel or anti-parallel with the p-copoint field vectors there are usable data for participating in the positive-negative patterns in the image space. Viewing the whole process in the opposite angle, a data point (xi, yi) with normal flow (uin, vin) is usable only under a paxis whose equivalent image position P =[px , py] is located on the line li passing through the data point (xi, yi) and orthogonal to the normal flow (uin, vin), as illustrated by Fig. 2 . We call the line li the P-line of the data point (xi, yi), and it can be expressed as:
u i n p x + v i n p y − (u i n x i + v i n y i ) = 0
(4.1)
Thus, to find the p-axis which can make the maximum number of data points useful, a simple scheme is to let each data point vote for the members of its P-line in the space of all possible p-axes (which is only a two-dimensional space, as each paxes contains only two degrees of freedom). The p-axes that collect a large number of votes are then the good choices of p-axes we should use in the copoint vector field based method. li lj
(ujn, vjn)
(xi, yi)
(xj, yj)
(uin, vin) P =(px,py)
Fig. 2. The P-lines of data points (xi, yi) (with normal flow (uin, vin)) and (xj, yj) (with normal flow (ujn, vjn))
4.2 Optimal Determination of p-Axes
Obviously we could obtain a linear system of equations for the optimal p-axis (point P) from say n data points using Equation (4.1), and solve for the optimal p-axis. However, the normal flows’ orientations are extracted not without error, so each data point should vote not for a P-line, but a narrow cone centered at the data point and swung from the P-line. The size of the cone is a threshold that depends upon what estimated error we have in the extraction of the normal flows’ orientations. We thus adopt a voting scheme similar to the Hough Transform. We use an accumulator array to represent the entire space of all p-axes, and to collect votes from each data point. The accumulator is a two-dimensional array whose axes correspond to the quantized values of px and py. For each data point (an image point with detectable normal flow), we determine its P-line, look for bins in the accumulator array that the line falls into, and put one vote in each of those bins. After we finish this with all the data points, we identify the bins of the highest count of votes in the accumulator array. An example of an accumulator array is shown in Fig. 3(a). To
Determining Relative Geometry of Cameras from Normal Flows
(a )
(b)
(c)
307
(d)
Fig. 3. (a) Two-dimensional accumulation array that corresponds to various values of px and py. The P-line associated with each data point is determined, the array bins corresponding to the line are identified, and each of such bins has the vote count increased by one. The bin with the highest vote count is identified (and marked as a red circle in this figure), which corresponds to the optimal p-axis. (b) (c) (d): The development of the voting process under the coarse-to-fine strategy.
increase computational efficiency we use a coarse-to-fine strategy in the voting process, as illustrated by Fig. 3(b-d). Since the copoint vector field based method demands the use of not one but a few p-axes, in our case we use not only the optimal p-axis but a few p-axes of the highest number of vote counts. While in synthetic data experiments the scene texture (and thus the orientation of the normal flow) is often made random, making all p-axes having similar density of usable data points, in real image data the scene texture is often oriented to a few directions (and so is the normal flow), and densities of the usable data points could be drastically different under different choices of the p-axes. Our experience show that, especially in the real image data cases, the adoption of the optimal p-axes makes drastic improvement to the solution quality over those under random selection of the p-axes. More specifically, our experiments on real image data show that the pattern generated by the best p-axes often has 60% more data points than those under the average p-axes.
5 Experimental Results The whole method consists of two steps. First, the binocular cameras undergo pure translations twice as a whole, and each time they move in different directions. The rotational component Rx is computed in this first step. After that, we rotate the camera pair twice around two different axes passing through the optical center of one of the cameras. In this step tx is determined up to scale. 5.1 Experimental Results on Synthetic Data
The experiments on synthetic data aim at investigating the accuracy and precision of the method. Normal flows are the only input, same as in the case of real image experiments. We used image resolution 101×101 in the synthetic data.
308
D. Yuan and R. Chung
Estimation of Rx. The normal flows were generated by assigning to each image point a random intensity gradient direction. Dot product between the gradient direction and the optical flow incurred from the assumed camera motion determined the normal flow precisely. We selected the optimal set of p-axes first. With the first optimal p-axis we got the first positive-negative labeled pattern in the image space. After determining the pseudo FOEs at an accuracy of 0.25×0.25 pixel, a number of lines, determined from different pseudo FOEs, could well divide the pattern into two regions. Then we applied a second optimal p-axis to examine if those pseudo FOEs that had good performance in the first pattern still performed well in this new pattern. We kept those that still had good performance in the next round under the new p-axis. We repeated this process, until all possible FOEs were located within a small enough area. Then the center of these possible FOEs was considered as the input for computing Rx. We estimated the FOEs by locating the zero-boundaries for both camera A and B first, and the rotational component ωx of the binocular geometry was then estimated. The calculation result is shown in the Tab.1. The error was 0.7964o in direction, 1.2621% in length. Estimation of tx up to Arbitrary Scale. We assumed that the camera pair rotated about an axis passing the optical center of camera A at two different given velocities. As above, the normal flows were generated to be the inputs. We located the zero boundaries on the positive-negative labeled patterns to estimate rotations ωA of camera A, using the algorithm named “detranslation” [4][5]. FOE tB of camera B was obtained readily from the patterns. Finally we obtained tx up to arbitrary scale using Equation (3.5). The result, shown in Tab.1, is a unit vector describing the direction of the baseline. The angle between the ground truth and the result is 2.0907o. Table 1. Estimation of ω x and t x up to scale
ωx tx
Ground Truth [0.100 0.100 -0.200]T [-700 20 80]T
Experiment [ 0.097 0.204 -0.203] T [-0.988 0.043 0.147] T
In this experiment, synthetic normal flows, computed from full optical flows by allocating to each pixel a random gradient direction, are oriented more evenly to all directions than in real image data, because in real image data the scene texture is often oriented as discussed above. However, the accuracy of our method can be better explored in the synthetic data experiments. 5.2 Experimental Results on Real Image Data Here we only show results on the recovery of Rx (ωx) due to limitation of page space. We moved the camera pair on a translational platform. The image sequences were captured by Dragonfly CCD cameras of resolution 640×480. The first experiment is to investigate the accuracy of the algorithm. We used the algorithm described in [6] to estimate the intrinsic parameters of the two cameras.
Determining Relative Geometry of Cameras from Normal Flows
309
Input images were first smoothed by using Gaussian Filter with n=5 and σ=1.4 to eliminate the Gaussian noise. We examined pseudo FOEs pixel by pixel in the image frames. 377 p-axes were enough to pinpoint the locations of possible FOEs. The zeroboundaries determined by the estimated FOEs are shown in Fig 4.
(a)
(b)
(c)
(d)
Fig 4. The zero-boundaries (blue lines) determined by estimated FOEs. Green dots represent negative candidates; red dots represent positive candidates. (a) Camera A, Motion 1; (b) Camera B, Motion 1; (c) Camera A, Motion 2; (d) Camera B, Motion 2.
We then calibrated the binocular cameras using the traditional stereo calibration method [6], in which the inputs are the manually picked corner pairs of an imaged chess-board pattern in the stereo images. We compared in Tab. 2 the results from our method and from the traditional stereo calibration method. Table 2. Estimation of ω x Experiment 1--- by using our method; Experiment 2--- by using traditional stereo calibration method [6]
Experiment 1 [0.0129 -0.7896 0.5222]T
ωx
Experiment 2 [0.0270 -0.4109 -0.0100] T
Although there is still some error compared with the result by using tradition calibration method, our result is acceptable on the condition that we neither require any chess-board pattern appearing in the scene, nor need any manually intervention on selecting point-to-point correspondences across the image pairs. The second experiment is about the case where almost no overlap is there in the two cameras’ fields of view, as shown in Fig. 5. Estimating the binocular geometry of the cameras viewing such imaged scene could be a difficult task for the correspondence-based methods. However, our method is still effective.
(a)
(b)
(c)
(d)
Fig 5. The zero-boundaries (blue lines) determined by the estimated FOEs. Green dots represent negative candidates; red dots represent positive candidates. (a) Camera A, Motion 1, (b) Camera B, Motion 1, (c) Camera A, Motion 2, d) Camera B, Motion 2.
310
D. Yuan and R. Chung
Result of ωx in the experiment is shown in Table 3. Table 3. Estimation of the rotational component ω x of the binocular geometry
ωx
[0.1634 -0.0801 -2.1097] T
ωx
6 Conclusion and Future Work We have addressed in this work how the determination of inter-camera geometry from normal flows can be much improved by the use of better chosen p-axes, and how these better p-axes can be chosen. Our future work is to relax the requirement of the specific rigid-body motions required in the method. Acknowledgments. The work described in this paper was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. CUHK4195/04E), and is affiliated with the MicrosoftCUHK Joint Laboratory for Human-centric Computing and Interface Technologies.
References 1. Bjorkman, M., Eklundh, J.O.: Real-time epipolar geometry estimation of binocular stereo heads. IEEE Trans. on Pattern Analysis and Machine Intelligence 24(3) (March 2002) 2. Dornaika, F., Chung, R.: Stereo geometry from 3D ego-motion streams. IEEE Trans. On Systems, Man, and Cybernetics: Part B, Cybernetics 33(2) (April 2003) 3. Faugeras, O., Luong, T., Maybank, S.: Camera self-calibration: theory and experiments. In: Proc. 3rd European Conf. Computer Vision, Stockholm, Sweeden, pp. 471–478 (1994) 4. Fermüller, C., Aloimonos, Y.: Direct perception of 3D motion from patterns of visual motion. Science 270, 1973–1976 (1995) 5. Fermüller, C., Aloimonos, Y.: Qualitative egomotion. Int’ Journal of Computer Vision 15, 7–29 (1995) 6. Heikkil, J.: Geometric camera calibration using circular control points. IEEE Trans. Pattern Analysis and Machine Intelligence 22(10), 1066–1077 (2000) 7. Knight, J., Reid, I.: Self-calibration of a stereo rig in a planar scene by data. combination. In: Proc. of the International Conference on Pattern Recognition, pp. 1411–1414 (September 2000) 8. Maybank, S.J., Faugeras, O.: A Theory of self-calibration of a moving camera. Int’ Journal of Computer Vision 8(2), 123–152 (1992) 9. Takahashi, H., Tomita, F.: Self-calibration Of Stereo Cameras. In: Proc. 2nd Int’l Conference on Computer Vision, pp. 123–128 (1988) 10. Yuan, D., Chung, R.: Direct Estimation of the Stereo Geometry from Monocular Normal Flows. In: International Symposium on Visual Computing (1), pp. 303–312 (2006) 11. Zhang, Z., Luong, Q.-T., Faugeras, O.: Motion of an uncalibrated stereo rig: Selfcalibration and metric reconstruction. IEEE Trans. on Robotics and Automation 12(1), 103–113 (1996)
Highest Accuracy Fundamental Matrix Computation Yasuyuki Sugaya1 and Kenichi Kanatani2 Department of Information and Computer Sciences, Toyohashi University of Technology, Toyohashi, Aichi 441-8580 Japan
[email protected] Department of Computer Science, Okayama University, Okayama 700-8530, Japan
[email protected] 1
2
Abstract. We compare algorithms for fundamental matrix computation, which we classify into “a posteriori correction”, “internal access”, and “external access”. Doing experimental comparison, we show that the 7-parameter Levenberg-Marquardt (LM) search and the extended FNS (EFNS) exhibit the best performance and that additional bundle adjustment does not increase the accuracy to any noticeable degree.
1
Introduction
Computing the fundamental matrix from point correspondences is the first step of many vision applications including camera calibration, image rectification, structure from motion, and new view generation [6]. To compute the fundamental matrix accurately from noisy data, we need to solve optimization subject to the constraint that it has rank 2, for which typical approaches are: A posteriori correction. We first compute the fundamental matrix without considering the rank constraint and then modify the solution so that it is satisfied (Fig. 1(a)). Internal access. We minimally parameterize the fundamental matrix so that the rank constraint is always satisfied and do optimization in the reduced (“internal”) parameter space (Fig. 1(b)). External access. We do iterations in the redundant (“external”) parameter space in such a way that an optimal solution that satisfies the rank constraint automatically results (Fig. 1(c)). The aim of this paper is to find the best method by thorough performance comparison.
2
Mathematical Fundamentals
Fundamental matrix. Given two images of the same scene, a point (x, y) in the first image and the corresponding point (x , y ) in the second satisfy the epipolar equation [6] Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 311–321, 2007. c Springer-Verlag Berlin Heidelberg 2007
312
Y. Sugaya and K. Kanatani
⎞⎛ ⎞ ⎛ ⎞ x /f0 x/f0 F11 F12 F13 (⎝ y/f0 ⎠ , ⎝ F21 F22 F23 ⎠ ⎝ y /f0 ⎠) = 0, 1 F31 F32 F33 1 ⎛
(1)
where f0 is a scaling constant for stabilizing numerical computation [5] (In our experiments, we set f0 = 600 pixels). Throughout this paper, we denote the inner product of vectors a and b by (a, b). The matrix F = (Fij ) in Eq. (1) is of rank 2 and called the fundamental matrix . If we define u = (F11 , F12 , F13 , F21 , F22 , F23 , F31 , F32 , F33 ) , ξ = (xx , xy , xf0 , yx , yy , yf0 , f0 x , f0 y , f02 ) ,
(2) (3)
Equation (1) can be rewritten as (u, ξ) = 0.
(4)
The magnitude of u is indeterminate, so we normalize it to u = 1, which is equivalent to scaling F so that F = 1. With a slight abuse of symbolism, we hereafter denote by det u the determinant of the matrix F defined by u. Covariance matrices. Given N observed noisy correspondence pairs, we represent them as 9-D vectors {ξ α } in the form of Eq. (3) and write ξα = ξ¯α + Δξα , where ξ¯α is the true value and Δξ α the noise term. The covariance matrix of ξα is defined by (5) V [ξα ] = E[Δξ α Δξ α ], where E[ · ] denotes expectation over the noise distribution. If the noise in the xand y-coordinates is independent and of mean 0 and standard deviation σ, the covariance matrix of ξ α has the form V [ξ α ] = σ 2 V0 [ξ α ] up to O(σ 4 ), where
V0 [ξ α ] =
x ¯2α + x ¯2 x ¯α y¯α f0 x ¯α x ¯α y¯α 0 0 f0 x ¯α 0 α 2 2 ¯α + y¯α f0 y¯α 0 x ¯α y¯α 0 0 f0 x ¯α x ¯α y¯α x ¯α f0 y¯α f02 0 0 0 0 0 f0 x 0 0 y¯α2 + x ¯2 x ¯α y¯α f0 x ¯α f0 y¯α 0 x ¯α y¯α α 0 x ¯α y¯α y¯α2 + y¯α2 f0 y¯α 0 f0 y¯α 0 x ¯α y¯α ¯α f0 y¯α f02 0 0 0 0 0 f0 x ¯α 0 0 f0 y¯α 0 0 f02 0 f0 x ¯α 0 0 f0 y¯α 0 0 f02 0 f0 x 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
.
(6)
xα , y¯α ) are replaced by In actual computations, the true positions (¯ xα , y¯α ) and (¯ their data (xα , yα ) and (xα , yα ), respectively. ˆ of u by We define the covariance matrix V [ˆ u] of the resulting estimate u ˆ )(P U u ˆ ) ], V [ˆ u] = E[(P U u
(7)
where P U is the linear operator projecting R9 onto the domain U of u defined by ˆ by projecting the constraints u = 1 and det u = 0; we evaluate the error of u it onto the tangent space Tu (U) to U at u.
Highest Accuracy Fundamental Matrix Computation
optimal correction
SVD correction det F = 0
(a)
det F = 0
(b)
313
det F = 0
(c)
Fig. 1. (a) A posteriori correction. (b) Internal access. (c) External access.
Geometry of the constraint. The unit normal to the hypersurface defined by det u = 0 is (8) u† = N [∇u det u], where N [ · ] denotes normalization into unit norm. It is easily shown that the constraint det u = 0 is equivalently written as (u† , u) = 0.
(9)
Since the domain U is included in the unit sphere S 8 ⊂ R9 , the vector u is everywhere orthogonal to U. Hence, {u, u† } is an orthonormal basis of the orthogonal complement of the tangent space Tu (U). It follows that the projection operator P U in Eq. (7) has the following matrix representation: P U = I − uu − u† u† .
(10)
KCR lower bound. If the noise in {ξα } is independent and Gaussian with mean 0 and covariance matrix σ 2 V0 [ξ α ], the following inequality holds for an ˆ of u [7]: arbitrary unbiased estimator u V [ˆ u] σ 2
N (P U ξ¯α )(P U ξ¯α ) − . (u, V0 [ξ α ]u) 8 α=1
(11)
Here, means that the left-hand side minus the right is positive semidefinite, and ( · )− r denotes the pseudoinverse of rank r. Chernov and Lesort [2] called the right-hand side of Eq. (11) the KCR (Kanatani-Cramer-Rao) lower bound and ˆ is not unbiased; it is sufficient showed that Eq. (11) holds up to O(σ 4 ) even if u ˆ → u as σ → 0. that u Maximum likelihood. If the noise in {ξα } is independent and Gaussian with mean 0 and covariance matrix σ 2 V0 [ξ α ], maximum likelihood (ML) estimation of u is to minimize the sum of square Mahalanobis distances J=
N
¯ (ξ α − ξ¯α , V0 [ξ α ]− 2 (ξ α − ξ α )),
α=1
(12)
314
Y. Sugaya and K. Kanatani
subject to (u, ξ¯α ) = 0, α = 1, ..., N . Eliminating the constraint by using Lagrange multipliers, we obtain [7] J=
N
(u, ξ α )2 . (u, V0 [ξ α ]u) α=1
(13)
ˆ minimizes this subject to u = 1 and (u† , u) = 0. The ML estimator u
3
A Posteriori Correction
The a posteriori correction approach first minimizes Eq. (13) without considering ˜ so as to satisfy it the rank constraint and then modifies the resulting solution u (Fig. 1(a)). A popular method is to compute the singular value decomposition (SVD) of the computed fundamental matrix and replace the smallest singular value by 0, resulting in a matrix of rank 2 “closest” to the original one in norm [5]. We call this SVD correction. A more sophisticated method is the optimal correction [7,11]. According to the statistical optimization theory [7], the covariance matrix V [˜ u] of the rank ˜ can be evaluated, so u ˜ is moved in the direction of the unconstrained solution u mostly likely fluctuation implied by V [˜ u] until it satisfies the rank constraint (Fig. 1(a)). The procedure goes as follows [7]: 1. Compute the 9 × 9 matrices ˜ = M
N
ξα ξ α , (˜ u , V [ξ u) 0 α ]˜ α=1
(14)
˜− and V0 [˜ u] = M 8. ˜ as follows (˜ ˜ ): 2. Update the solution u u† is defined by Eq. (8) for u ˜ ← N [˜ u u−
˜ † )V0 [˜ 1 (˜ u, u u]˜ u† ]. 3 (˜ u† , V0 [˜ u]˜ u† )
(15)
˜ † ) ≈ 0, return u ˜ and stop. Else, update the matrix V0 [˜ 3. If (˜ u, u u] in the form ˜u ˜ , P u˜ = I − u
V0 [˜ u] ← P u˜ V0 [˜ u]P u˜ ,
(16)
and go back to Step 2. Before doing this, we need to solve unconstrained minimization of Eq. (13), for which many method exist: the FNS (Fundamental Numerical Scheme) of Chojnacki et al. [3], the HEIV (Heteroscedastic Errors-in-Variable) of Leedan and Meer [10], and the projective Gauss-Newton iterations of Kanatani and Sugaya [8]. Their convergence properties were studies in [8].
Highest Accuracy Fundamental Matrix Computation
4
315
Internal Access
The fundamental matrix F has nine elements, on which the normalization F = 1 and the rank constraint det F = 0 are imposed. Hence, it has seven degrees of freedom. The internal access minimizes Eq. (13) by searching the reduced 7-D parameter space (Fig. 1(b)). Many types of 7-degree parameterizations have been proposed in the past [12,14], but the resulting expressions are often complicated, and the geometric meaning of the individual unknowns are not clear. This was overcome by Bartoli and Sturm [1], who regarded the SVD of F as its parameterization. Their expression is compact, and each parameter has its geometric meaning. They did tentative 3-D reconstruction using the assumed F and adjusted the reconstructed shape, the camera positions, and their intrinsic parameters so that the reprojection error is minimized; such an approach is known as bundle adjustment. Sugaya and Kanatani [13] simplified this: adopting the parameterization of Bartoli and Sturm [1], they directly minimized Eq. (13) by the LevenbergMarquardt (LM) method. Their 7-parameter LM search goes as follows: 1. Initialize F in such a way that det F = 0 and F = 1, and express it as F = U diag(cos θ, sin θ, 0)V . 2. Compute J in Eq. (13), and let c = 0.0001. 3. Compute the matrices F U and F V and the vector uθ as follows: 0 0 0
−F FU = −F −F F F
31 32 33
21 22
F23
F31 −F21 F32 −F22 F33 −F23 0 F11 0 F12 0 F13 −F11 0 −F12 0 −F13 0
uθ =
,
FV
=
0
F13 −F12 −F13 0 F11 F12 −F11 0 0 F23 −F22 −F23 0 F21 F22 −F21 0 0 F33 −F32 −F33 0 F31 F32 −F31 0
U12 V12 cos θ − U11 V11 sin θ U12 V22 cos θ − U11 V21 sin θ U12 V32 cos θ − U11 V31 sin θ U22 V12 cos θ − U21 V11 sin θ U22 V22 cos θ − U21 V21 sin θ U22 V32 cos θ − U21 V31 sin θ U32 V12 cos θ − U31 V11 sin θ U32 V22 cos θ − U31 V21 sin θ U32 V32 cos θ − U31 V31 sin θ
.
,
(17)
(18)
4. Compute the following matrix X: X=
N
(u, ξ )2 V0 [ξ ] ξα ξ α α α − . 2 (u, V [ξ ]u) (u, V [ξ ]u) 0 0 α α α=1 α=1 N
(19)
316
Y. Sugaya and K. Kanatani
5. Compute the first and (Gauss-Newton approximated) second derivatives of J as follows: ∇ω J = F U Xu, ∇2ω J = F UMF U,
∇ω J = F V Xu, ∇2ω J = F V MF V ,
∂∇ω J ∂J 2 = F = (uθ , M uθ ), U M uθ , 2 ∂θ ∂θ 6. Compute the following matrix H: ⎛ ∇2ω J ∇ωω J ∇2ω J H = ⎝ (∇ωω J) (∂∇ω J/∂θ) (∂∇ω J/∂θ)
∂J = (uθ , Xu), ∂θ ∇ωω J = F UMF V , ∂∇ω J = F V M uθ . ∂θ ⎞ ∂∇ω J/∂θ ∂∇ω J/∂θ ⎠ . ∂J 2 /∂θ2
7. Solve the simultaneous linear equations ⎞ ⎞ ⎛ ⎛ ω ∇ω J (H + cD[H]) ⎝ ω ⎠ = − ⎝ ∇ω J ⎠ , ∂J/∂θ Δθ
8. 9. 10. 11. 12.
5
(20)
(21)
(22)
(23)
for ω, ω , and Δθ, where D[ · ] denotes the diagonal matrix obtained by taking out only the diagonal elements. Update U , V , and θ in the form U = R(ω)U , V = R(ω )V , and θ = θ + Δθ, where R(ω) denotes rotation around N [ω] by angle ω. Update F to F = U diag(cos θ , sin θ , 0)V . Let J be the value of Eq. (13) for F . Unless J < J or J ≈ J, let c ← 10c, and go back to Step 7. If F ≈ F , return F and stop. Else, let F ← F , U ← U , V ← V , θ ← θ , and c ← c/10, and go back to Step 3.
External Access
The external access approach does iterations in the 9-D u-space in such a way that an optimal solution satisfying the rank constraint automatically results (Fig. 1(c)). The concept dates back to such heuristics as introducing penalties to the violation of the constraints or projecting the solution onto the surface of the constraints in the course of iterations, but it is Chojnacki et al. [4] that first presented a systematic scheme, which they called CFNS (Constrained FNS ). Kanatani and Sugaya [9] pointed out, however, that CFNS does not necessarily converge to a correct solution and presented in a more general framework a new scheme, called EFNS (Extended FNS ), which is shown to converge to an optimal value. For fundamental matrix computation, it reduces to the following form: 1. Initialize u. 2. Compute the matrix X in Eq. (19).
Highest Accuracy Fundamental Matrix Computation
317
3. Computer the projection matrix P u† = I −u† u† (u† is defined by Eq. (8)). 4. Compute Y = P u† XP u† . 5. Solve the eigenvalue problem Y v = λv, and compute the two unit eigenvectors v 1 and v 2 for the smallest eigenvalues in absolute terms. ˆ = (u, v 1 )v 1 + (u, v 2 )v 2 . 6. Compute u ˆ ]. 7. Compute u = N [P u† u 8. If u ≈ u, return u and stop. Else, let u ← N [u + u ] and go back to Step 2.
6
Bundle Adjustment
The transition from Eq. (12) to Eq. (13) is exact ; no approximation is involved. Strictly speaking, however, the minimization of the (squared) Mahalanobis distance in the ξ-space (Eq. (13)) can be ML only when the noise in the ξ-space is Gaussian, because then and only then is the likelihood proportional to e−J/constant . If the noise in the image plane is Gaussian, on the other hand, the transformed noise in the ξ-space is no longer Gaussian, so minimizing Eq. (13) is not strictly ML in the image plane. In order to test how much difference is incurred, we also implemented bundle adjustment, minimizing the reprojection error (we omit the details).
7
Experiments
Figure 2 shows simulated images of two planar grid surfaces viewed from different angles. The image size is 600 × 600 pixels with 1200 pixel focal length. We added random Gaussian noise of mean 0 and standard deviation σ to the x- and y-coordinates of each grid point independently and from them computed the fundamental matrix by 1) SVD-corrected LS, 2) SVD-corrected ML, 3) CFNS, 4) optimally corrected ML, 5) 7-parameter LM, and 6) EFNS. “LS” means least squares (also called “8-point algorithm” [5]) minimizing N 2 α=1 (u, ξ α ) , which reduces to simple eigenvalue computation [8]. For brevity, we use the shorthand “ML” for unconstrained minimization of Eq. (13), for 0.2 1 3
5
2
4
6
0.1
0
0
1
2
3
σ
4
Fig. 2. Simulated images of planar grid surfaces and the RMS error vs. noise level. 1) SVD-corrected LS. 2) SVD-corrected ML. 3) CFNS. 4) Optimally corrected ML. 5) 7-parameter LM. 6) EFNS. The dotted line indicates the KCR lower bound.
318
Y. Sugaya and K. Kanatani 12
1.08 1
1.06
10 2 8
1.04
6 1.02 2
3
5
4
4
4
1
5 1
2
0.98
0
0.96
-2
0
1
2
(a)
3
σ
4
0
1
2
3
3
σ
4
(b)
Fig. 3. (a) The RMS error relative to the KCR lower bound. (b) Average residual minus minus (N − 7)σ 2 . 1) Optimally corrected ML. 2) 7-parameter LM started from LS. 3) 7-parameter LM started from optimally corrected ML. 4) EFNS. 5) Bundle adjustment.
which we used the FNS of Chojnacki et al. [3]. The 7-parameter LM and CFNS are initialized by LS. All iterations are stopped when the update of F is less than 10−6 in norm. On the right of Fig. 2 is plotted for σ on the horizontal axis the following rootmean-square (RMS) error D corresponding to Eq. (7) over 10000 independent trials:
1 10000 ˆ (a) 2 . P U u (24) D= 10000 a=1 ˆ (a) is the ath value, and P U is the projection matrix in Eq. (10). The Here, u dotted line is the bound implied by the KCR lower bound (the trace of the right-hand side of Eq. (11)). Preliminary observations. We can see that SVD-corrected LS (Hartley’s 8point algorithm) performs very poorly. We can also see that SVD-corrected ML is inferior to optimally corrected ML, whose accuracy is close to the KCR lower bound. The accuracy of the 7-parameter LM is nearly the same as optimally corrected ML when the noise is small but gradually outperforms it as the noise increases. Best performing is EFNS, exhibiting nearly the same accuracy as the KCR lower bound. In contrast, CFNS performs as poorly as SVD-corrected ML. The reason for this is fully investigated by Kanatani and Sugaya [9]. Doing many experiments (not all shown here), we have observed that i) EFNS stably achieves the highest accuracy over a wide range of the noise level, ii) optimally corrected ML is fairly accurate and very robust to noise but gradually deteriorates as noise grows, and iii) 7-parameter LM achieves very high accuracy when started from a good initial value but is likely to fall into local minima if poorly initialized. The robustness of EFNS and optimally corrected ML is due to the fact that the computation is done in the redundant (“external”) u-space, where J has a simple form of Eq. (13). In fact, we have never experienced local minima in our
Highest Accuracy Fundamental Matrix Computation
Fig. 4. Left: Real images and 100 corresponding points. Right: Residuals and execution times (sec) for 1) SVDcorrected LS, 2) SVD-corrected ML, 3) CFNS, 4) optimally corrected ML, 5) direct search from LS, 6) direct search from optimally corrected ML, 7) EFNS, 8) bundle adjustment.
1 2 3 4 5 6 7 8
residual 45.550 45.556 45.556 45.378 45.378 45.378 45.379 45.379
319
time . 000524 . 00652 . 01300 . 00764 . 01136 . 01748 . 01916 . 02580
experiments. The deterioration optimally corrected ML in the presence of large noise is because linear approximation is involved in Eq. (15). The fragility of 7-parameter LM is attributed to the complexity of the function J when expressed in seven parameters, resulting in many local minima in the reduced (“internal”) parameter space, as pointed out in [12]. Thus, the optimal correction of ML and the 7-parameter ML have complementary characteristics, which suggests that the 7-parameter ML initialized by optimally corrected ML may exhibit comparable accuracy to EFNS. We now confirm this. Detailed observations. Figure 3(a) compares 1) optimally corrected ML, 2) 7-parameter LM started from LS, 3) 7-parameter LM started from optimally corrected ML, 4) EFNS, and 5) bundle adjustment. For visual ease, we plot the ratio D/DKCR of D in Eq. (24) to the corresponding KCR lower bound. Figure 3(b) plots the corresponding average residual J (minimum of Eq. (13). Since direct plots of J nearly overlap, we plot its difference from (N − 7)σ 2 , where N is the number of corresponding pairs. This is motivated by the fact ˆ 2 is subject to a χ2 distribution with N − 7 that to a first approximation J/σ degrees of freedom [7], so the expectation of Jˆ is approximately (N − 7)σ 2 . We observe from Fig. 3 that i) the RMS error of optimally corrected ML increases as noise increases, yet the corresponding residual remains low, ii) the 7-parameter LM started from LS appears to have high accuracy for noise levels for which the corresponding residual high, iii) the accuracy of the 7-parameter LM improves if started from optimally corrected ML, resulting in the accuracy is comparable to EFNS, and iv) additional bundle adjustment does not increase the accuracy to any noticeable degree. The seeming contradiction that solutions that are closer to the true value (measured in RMS) have higher residuals Jˆ implies that the 7-parameter LM failed to reach the true minimum of the function J, indicating existence of local minima located close to the true value. When initialized by the optimally corrected ML, the 7-parameter LM successfully reaches the true minimum of J, resulting in the smaller Jˆ but larger RMS errors. Real image example. We manually selected 100 pairs of corresponding points in the two images in Fig. 4 and computed the fundamental matrix from them.
320
Y. Sugaya and K. Kanatani
The final residual J and the execution time (sec) are listed there. We used Core2Duo E6700 2.66GHz for the CPU with 4GB main memory and Linux for the OS. We can see that for this example optimally corrected ML, 7-parameter LM started from either LS or optimally corrected ML, EFNS, and bundle adjustment all converged to the same solution, indicating that all are optimal. On the other hand, SVD-corrected LS (Hartley’s 8-point method) and SVD-corrected ML have higher residual than the optimal solution and that CFNS has as high a residual as SVD-corrected ML.
8
Conclusions
We compared algorithms for fundamental matrix computation (the source code is available from the authors’ Web page1 ), which we classified into “a posteriori correction”, “internal access”, and “external access”. We observed that the popular SVD-corrected LS (Hartley’s 8-point algorithm) has poor performance and that the CFNS of Chojnacki et al. [4], a pioneering external access method, does not necessarily converge to a correct solution, while the EFNS always yields an optimal value. After many experiments (not all shown here), we concluded that EFNS and the 7-parameter LM started from optimally corrected ML exhibited the best performance. We also observed that additional bundle adjustment does not increase the accuracy to any noticeable degree. Acknowledgments. This work was done in part in collaboration with Mitsubishi Precision, Co. Ldt., Japan. The authors thank Mike Brooks, Wojciech Chojnacki, and Anton van den Hengel of the University Adelaide, Australia, for providing software and helpful discussions. They also thank Nikolai Chernov of the University of Alabama at Birmingham, U.S.A. for helpful discussions.
References 1. Bartoli, A., Sturm, P.: Nonlinear estimation of fundamental matrix with minimal parameters. IEEE Trans. Patt. Anal. Mach. Intell. 26(3), 426–432 (2004) 2. Chernov, N., Lesort, C.: Statistical efficiency of curve fitting algorithms. Comput. Stat. Data Anal. 47(4), 713–728 (2004) 3. Chojnacki, W., Brooks, M.J., van den Hengel, A., Gawley, D.: On the fitting of surfaces to data with covariances. IEEE Trans. Patt. Anal. Mach. Intell. 22(11), 1294–1303 (2000) 4. Chojnacki, W., Brooks, M.J., van den Hengel, A., Gawley, D.: A new constrained parameter estimator for computer vision applications. Image Vis. Comput. 22(2), 85–91 (2004) 5. Hartley, R.I.: In defense of the eight-point algorithm. IEEE Trans. Patt. Anal. Mach. Intell. 19(6), 580–593 (1997) 1
http://www.iim.ics.tut.ac.jp/˜sugaya/public-e.html
Highest Accuracy Fundamental Matrix Computation
321
6. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge, UK (2000) 7. Kanatani, K.: Statistical Optimization for Geometric Computation: Theory and Practice. Elsevier Science, Amsterdam, The Netherlands 1996, Dover, New York (2005) 8. Kanatani, K., Sugaya, Y.: High accuracy fundamental matrix computation and its performance evaluation. In: Proc. 17th British Machine Vision Conf., Edinburgh, UK, September 2006, vol. 1, pp. 217–226 (2006) 9. Kanatani, K., Sugaya, Y.: Extended FNS for constrained parameter estimation. In: Proc. 10th Meeting Image Recog. Understand, Hiroshima, Japan, July 2007, pp. 219–226 (2007) 10. Leedan, Y., Meer, P.: Heteroscedastic regression in computer vision: Problems with bilinear constraint. Int. J. Comput. Vision 37(2), 127–150 (2000) 11. Matei, J., Meer, P.: Estimation of nonlinear errors-in-variables models for computer vision applications. IEEE Trans. Patt. Anal. Mach. Intell. 28(10), 1537–1552 (2006) 12. Migita, T., Shakunaga, T.: One-dimensional search for reliable epipole estimation. In: Proc. IEEE Pacific Rim Symp. Image and Video Technology, Hsinchu, Taiwan, December 2006, pp. 1215–1224 (2006) 13. Sugaya, Y., Kanatani, K.: High accuracy computation of rank-constrained fundamental matrix. In: Proc. 18th British Machine Vision Conf., Coventry, UK (September 2007) 14. Zhang, Z., Loop, C.: Estimating the fundamental matrix by transforming image points in projective space. Comput. Vis. Image Understand 82(2), 174–180 (2001)
Sequential L∞ Norm Minimization for Triangulation Yongduek Seo1, and Richard Hartley2, 1 2
Department of Media Technology, Sogang University, Korea Australian National University and NICTA, Canberra, Australia
Abstract. It has been shown that various geometric vision problems such as triangulation and pose estimation can be solved optimally by minimizing L∞ error norm. This paper proposes a novel algorithm for sequential estimation. When a measurement is given at a time instance, applying the original batch bi-section algorithm is very much inefficient because the number of seocnd order constraints increases as time goes on and hence the computational cost increases accordingly. This paper shows that, the upper and lower bounds, which are two input parameters of the bi-section method, can be updated through the time sequence so that the gap between the two bounds is kept as small as possible. Furthermore, we may use only a subset of all the given measurements for the L∞ estimation. This reduces the number of constraints drastically. Finally, we do not have to reestimate the parameter when the reprojection error of the measurement is smaller than the estimation error. These three provide a very fast L∞ estimation through the sequence; our method is suitable for real-time or on-line sequential processing under L∞ optimality. This paper particularly focuses on the triangulation problem, but the algorithm is general enough to be applied to any L∞ problems.
1 Introduction Recently, convex programming technique has been introduced and widely studied in the area of geometric computer vision. By switching from an L2 sum-of-squared error function to an L∞ , we are now able to find the global optimum of the error function since the image re-projection error is of quasi-convex type that can be efficiently minimized by the bi-section method [1]. This L∞ norm miniminzation is advantageous because we do not need to build a linearized formulation to find an initial solution for iterative optimization like Levenverg-Marquadt, but also it provides the global optimum of the error function, which is geometrically meaningful, with the well-developed minimization algorithm. Applying an idea of L∞ optimization was presented by Hartley and Schaffalitzky in [2], where it was observed that many geometric vision problems have a single global
This work was supported by the Korea Science and Engineering Foundation(KOSEF) grant funded by the Korean government(MOST) (No. R01-2006-000-11374-0). This research is accomplished as the result of the research project for culture contents technology development supported by KOCCA. NICTA is a research centre funded by the Australian Government’s Department of Communications, Information Technology and the Arts and the Australian Research Council, through Backing Australia’s Ability and the ICT Research Centre of Excellence programs. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 322–331, 2007. c Springer-Verlag Berlin Heidelberg 2007
Sequential L∞ Norm Minimization for Triangulation
323
minimum under L∞ error norm. Kahl, and Ke & Kanade, respectively, showed that the error functions are of quasi-convex and can be solved by Second Order Cone Programming (SOCP) [1,3]. Vision problems that can be solved under L∞ formulation include triangulation [2], homography estimation and camera resectioning [1,3], multiview reconstruction knowing rotations or homographies induced by a plane [1,3], camera motion recovery [4], and outlier removal [5]. Vision problems like building a 3D model or computing the camera motion of a video clip require a batch process. A fast algorithm for such batch computations is presented in [6]. However, some applications need a mechanism of sequential update such as navigation or augmented reality, e.g, [7,8,9]. This paper is about sequentially minimizing the L∞ error norm, which has not been considered yet in vision literature. Our research is aimed at on-line or real-time vision applications. The most important constraint in this case is that the optimization should be done within a given compuation time. Therefore, we need to develop a bi-section algorithm for this purpose. This paper first introduces the triangulation problem in Section 2 and analyzes the bi-section algorithm and suggests three methods to reduce the computation time without sacrificing any accuracy. Section 3 presents our novel bisection algorithm suitable for time sequence applications. Experimental results are given in Section 4 and concluding remarks in Section 5. We focus on triangulation problem in this paper. Triangulation alone may look very much restricted but note that motion estimation knowing rotation is equivalent to triangulation as can be found in [4]. In addition, if a branch-and-bound algorithm is adopted for rotation estimation, then a fast triangulation computation becomes also very much important for global optimization for pose estimation or multi-view motion computation.
2 Triangulation with L∞ Norm Triangulation is to find a 3D space point X when we are given two or more pairs of camera matrix Pt of dimension 3×4 and its image point ut = [u1t , u2t ] at time t. These quantities are related by the projection equation: uit =
pit X p3t X
for
i = 1, 2,
(1)
where pit denotes the i-th 4D row vector of Pt and X is a 4D vector represented by homogeneous coordinates (that is, the 4-th coordinate is one). Re-projection discrepancy dt of X for the measurement ut contaminated by noise is given by p1t X p2t X 1 2 dt = ut − 3 , ut − 3 (2) pt X pt X and the quality of the error is given by the error function et (X) = dt (X). When we use L2 norm, 2 2 12 et (X) = dt 2 = d1t (X) + d2t (X) . (3) Any function f is called quasi-convex if its domain is convex and all its sub-level sets {x ∈ domain f | f (x) ≤ α} for α ∈ R are convex [10]. The error function in
324
Y. Seo and R. Hartley
Algorithm 1. Bisection method: L∞ norm minimization Input: initial upper(U )/lower(L) bounds, tolerance > 0. 1: repeat 2: γ := (L + U )/2 3: Solve the feasibility problem (7) 4: if feasible then U := γ else L := γ 5: until U − L ≤
Equation (3) is of convex-over-concave and can be shown to be a quasi-convex function in the convex domain D = {X|p3t X ≥ 0}, which means that the scene is in front of the camera [11,1]. Given a bound γ, the inequality et ≤ γ defines a set Ct of X Ct = {X|et (X) ≤ γ} .
(4)
Note that the feasible set Ct is due to t-th measurement ut ; Ct is called a second order cone due to Equation (3). The bound γ is called the radius of the cone in this paper. Note that Equation (4) is given by the circular disk dt 2 ≤ γ in the image plane. The set Ct defines the cone whose apex is at the location of camera center and whose cutting shape with the image plane is the circular disk dt 2 ≤ γ. The vector e = [e1 , e2 , ...eT ] represents the error vector of T measurements. A feasible set Fγ given a constant γ is defined as the intersection of all the cones Fγ =
T
{X|et ≤ γ}
(5)
t=1
= {X|e1 ≤ γ} ∩ . . . ∩ {X|eT ≤ γ}
(6)
We now have its feasibility problem for a given constant γ: find X subject to
et (X) ≤ γ,
t = 1, ...., T
(7)
The feasible set Fγ is convex because it is the intersection of (ice-cream shape) convex cones. Indeed, the feasibility problem (7) has already been investigated and well-known in the area of convex optimization; the solution X can be obtained by an SOCP solver. The L∞ norm of e is defined to be the maximum of et ’s, and the L∞ triangulation problem is to find the smallest γ that yields non-empty feasible set Fγ and its corresponding X . This can also be written as a min-max optimization problem min max {e1 , e2 , ..., eT } , X
(8)
and the global optimum can be found by the bi-section method presented in Algorithm 1. It consists of repeatedly solving the feasibility problem, adjusting the bound γ. Lemma 1. Since we minimize the maximum error, we don’t have to use all the measurements. There exist subsets of measurements that result in the same estimation. Reduced number of measurements will decrease the computation time, which is necessary for sequential applications. What we have to do is to choose some among those sequential measurements. Our approach will be provided in the next section together with our sequential bi-section algorithm.
Sequential L∞ Norm Minimization for Triangulation
325
3 Bisectioning for Sequential Update Problem 1. (Original Batch Problem) Given a set of image matches {ui , i = 1...T }, find their 3D point X that is optimal under L∞ error norm. As we mentioned in Section 1, the solution of this problem can be obtained by the bi-section method shown in Algorithm 1. From now on, the optimal solution with T ∞ measurements is represented by X∞ T and the corresponding minimum error by eT . Now let us cast our sequential problem. Problem 2. (Sequential Problem) The L∞ estimate X∞ T has been computed given image matches ui , i = 1, . . . , T . Now a new measurement uT +1 is arrived. Find the optimal estimate X∞ T +1 . Obviously, we might apply Algorithm 1 using all the T + 1 measurements again from scratch. However, we want to do it more efficiently in this paper; our first goal is to reduce the number of SOCP repetitions during the bisection algorithm. ∞ Lemma 2. If the re-projection error eT +1 (X∞ T ) for uT +1 is smaller than eT , that is, ∞ ∞ eT +1 ≤ e∞ , then no further minimization is necessary, and we can set e T T +1 = eT and ∞ = X . X∞ T +1 T ∞ If eT +1 ≤ e∞ T , the feasible cone CT +1 = {X|eT +1 (X) ≤ eT } for uT +1 is already a ∞ ∞ subset of FeT (i.e., γ = eT ). Therefore, we don’t have to run bisectioning to update the estimate. The only computation necessary is to evaluate the re-projection error eT +1 . This is because the bisection method is independent of the order of the input measurements {u1 , . . . , uT +1 }. The estimate X∞ T is already optimal and running the bisection ∞ algorithm with T + 1 measurements will result in the same output: X∞ T +1 = XT . Note that due to Lemma 2 the computational cost for evaluating eT +1 is so much less than the cost for running the bisection method.
Lemma 3. Otherwise (i.e., e∞ T < eT +1 ), we run the bisection algorithm but with different initial upper and lower bounds: U := eT +1 , L := e∞ T . In this case, e∞ T < eT +1 , we have ⊂ CT +1 = {X|eT +1 ≤ γ, where γ = eT +1 (X∞ Fe∞ T )} T
(9)
due to the fact that e∞ T < eT +1 ; therefore, the upper bound for the feasibility of X given T + 1 measurements can be set to U := eT +1 . In other words, the intersection of the T + 1 cones is non-empty when the cones are of radius eT +1 . It is natural that the initial lower bound be set to zero L0 := 0 to run the bisection algorithm. However, a lower bound greater than zero may reduce the number of iterations during the bi-sectioning. The feasible set FγT +1 with bound γ up to time T + 1 can be written as FγT +1 =
T +1
{X|et ≤ γ}
(10)
t=1
= [C1 ∩ . . . ∩ CT ] ∩ CT +1 =
FγT
∩ CT +1 .
(11) (12)
326
Y. Seo and R. Hartley
Algorithm 2. Sequential bi-section method with measurement selection. Input: Measurement set M, selected measurements S ⊂ M. 1: eT +1 := ReprojectionError (uT +1 , X∞ T ) 2: if eT +1 > e∞ T then 3: bool flag=FALSE 4: M := M ∪ {uT +1 }, U := eT +1 , L := e∞ T 5: repeat ∞ 6: (e∞ T +1 , XT +1 ) := BisectionAlgorithm(S, U , L) 7: (emax , tmax ) := ReprojectionErrors (M \ S, X∞ T +1 ) 8: if emax < e∞ T +1 then 9: flag=TRUE; 10: else 11: S := S ∪ {utmax } 12: L := e∞ T +1 13: end if 14: until flag=TRUE 15: end if
If the bound γ is smaller than e∞ T , the intersection of the first T cones results in an empty set because γ = e∞ is the smallest bound for FγT found from the bi-section algorithm T using T measurements. Consequently, it will make the total feasible set FγT +1 be empty or non-feasible. Therefore, we see that L0 = e∞ T is the greatest lower bound up to time T + 1 to execute the bi-section algorithm. Due to Lemma 3, we now have a much reduced gap of initial upper and lower bound; this decreases the number of itreations during the execution of bi-section method.
Fig. 1. A synthetic data sequence. Initial image location was at (0, 0), and each point was generated by Brownian camera motion. In total, a hundred data points were generated as an input sequence.
3.1 Measurement Selection From Lemma 1, we know that there is a possibility to reduce the number of measurements or constraints in SOCP without an accuracy loss. Here we explain our algorithm
Sequential L∞ Norm Minimization for Triangulation
327
Fig. 2. Evolution of L∞ estimation error through time. The initial error was from the first two measurements. The red line shows changes of L∞ error e∞ ; the green line (impulse style) at each time t corresponds to the re-projection error et . When e∞ < et (when the green line goes above the red line), our bisectioning re-computed the estimate Xt∞ . Otherwise, no more computation was necessary. The blue line denotes the evolution of RMS error for Xt∞ .
of selecting measurements presented in Algorithm 2. If we need to solve the feasibility problem due to the condition in Lemma 3 (e∞ T < eT +1 ), then we include uT +1 into the set S of selected measurements and run the bi-section algorithm to get the estima∞ tion results (e∞ T +1 , XT +1 ) (6th line). Using this estimate, we evaluate the reprojection errors for those un-selected measurements (7th line). If the estimation error is greater than the maximum error from the un-selected, emax < e∞ T +1 , then we are done (8th and 9th lines). Otherwise, we include the measurement utmax into the measurement set M (11th line) and repeat the operation. The lower bound is then set to the new value (12th line).
4 Experiments We implemented the algorithm using C/C++ language and tested for synthetic and real data. First, experiments with synthetic data set were done. A data set S was generated as follows: The center of the first camera was located at C1 = [0, 0, −1]T, and moved randomly with standard deviation σC of zero mean Gaussian. That is, Ct = Ct−1 + σC [N , N , N ]T
(13)
where N represents a random value from the standard Gaussian distribution. Then the space point X0 = [0, 0, 0]T was projected to the image plane ut = [u1t , u2t ]T with focal length f = 1000. Gaussian noise of level σ in the image space was then injected: ut = ut + σ[N , N ]. Figure 1 shows the trajectory of a random set S when σ = 1.5. All the camera parameters were assumed to be known in this paper. Figure 2 shows a sequential evolution of L∞ estimation error e∞ t through time t = 2, ..., 100, using the image data plotted in Figure 1. The initial error was from the first
328
Y. Seo and R. Hartley
Fig. 3. Evolution of the number of selected measurements. Initially two are necessary for triangulation. Only 16 among 100 measurements were selected. The increments were exactly at the time instance when we had e∞ t < et .
two measurements. The red line shows e∞ t ; the green line (impulse style) at each time t ∞ corresponds to the re-projection error et (Xt−1 ∞ ) defined in Equation 3. When et < et (that is, when the green line goes above the red line in this graph), our bisectioning was applied to compute the estimate X∞ t together with the scheme of measurement selection. Otherwise, no update was necessary. The blue line shows the evolution of RMS error for a comparison: RMSt =
1
di (Xt∞ )22 t i=1 t
12 (14)
.
Accumulated Computation Time 70
60
50
Clocks
40
30
20
10
0 0
10
20
30
40
50 60 Sequence Index
70
80
90
100
Fig. 4. Accumulated computation time. Computation was necessary when a new measurement < et . The total compuation took 3,414 was included into the measurement set M when e∞ t clocks (time units) when we simplied adopted the batch algorithm using every measurement every time step without any speed improving method; with adjustment of upper and lower bounds it took 100 clocks, which then reduced to 63 clocks with measurement selection.
Sequential L∞ Norm Minimization for Triangulation
329
Fig. 5. The ratio of the accumulated time to the batch computation time for the case of 100 data sets. The average ratio was almost 1.0, which meant that the speed of our sequential update algorithm was almost the same as that of one batch computation on the average.
Figure 3 shows the evolution of the number of selected measurements. In this experiment, only 16 measurements among 100 were selected during the sequential update. Note that the number of measurements increases when e∞ t < et . Figure 4 shows the accumulated computation time. Main computation was done only when the feasibility computation was necessary as can be seen in the graph. The total compuation took 3,414 clocks (time units) when we simplied adopted the batch algorithm using every measurement every time step without any speed improving method; with adjustment of upper and lower bounds it took 100 clocks, which then reduced to 63 clocks with measurement selection. The batch computation took 157 clocks.
Fig. 6. Results of a real experiment. Evolution of L∞ estimation error through time. The initial error was from the first two measurements. The red line shows changes of L∞ error e∞ ; the green line (impulse style) at each time t corresponds to the re-projection error et . When e∞ < et (when the green line goes above the red line), our bisectioning re-computed the estimate Xt∞ . Otherwise, no more computation was necessary. The blue line denotes the evolution of RMS (L2 ) error for Xt∞ ..
330
Y. Seo and R. Hartley
Fig. 7. Evolution of L∞ estimation error through time for the 25th 3D point from Corridor data set
Fig. 8. L∞ estimation error plot for Corridor sequence
Figure 5 shows that such a speed upgrade was attained on the average. We did the same experiments using different data sets. A hundred repetition showed that the average of the ratio of computation time was 1.0; this experimentally implies that the speed of our sequential algorithm is almost the same as running the batch algorithm once with total measurements. We then generated 1000 data sets {Sk , k = 1, ..., 1000} and repeated the same experiment for each. The difference of re-projection errors diff errk = |e∞ (Xbat ) − e∞ (Xseq )| were computed to make sure that our algorithm results in the same estimation, where e∞ is the L∞ norm (maximum) of all the errors of 100 data in the set Sk . The average of the differences was approximately 7 × 10−9 , which meant that the two estimates were the same numerically. Figure 6 shows the same illustration as Figure 2 for a real experiment of sequence length 162. It took 305 time units (batch algorith took 719 time units). Notice that the error converges early in the sequence and there is almost no time update. This also shows that our algorithm is very much suitable when we need to update a large number of triangulation problems at the same time.
Sequential L∞ Norm Minimization for Triangulation
331
Finally, experiments with real data were done with Corridor sequence1 . Among the tracks of corners, those which had more than three matches were chosen. Figure 7 shows an exemplary L∞ evolution graph as Figure 2 did. Figure 8 shows the plot of L∞ errors for all the data sequence whose length was longer than three.
5 Conclusion This paper considered how to apply the bisection method for L∞ norm minimization to a sequential situation. The computation with bisection method did not have to be executed during the sequence when the re-projection of Xt−1 ∞ to the t-th camera yielded a smaller error than e∞ ; otherwise, bisectioning was necessary but with lower and upper bounds whose gap was narrower, resulting in faster computation. Measurement selection scheme was also provided to reduce the computational cost by decreasing the number of measurement cones (constraints). Our mathematical reasoning were provided and showed the performance of the sequential algorithm via synthetic and real experiments; our method is suitable for real-time or on-line applications.
References 1. Kahl, F.: Multiple view geometry and the L∞ -norm. In: Proc. Int. Conf. on Computer Vision, Beijing, China, pp. 1002–1009 (2005) 2. Hartley, R., Schaffalitzky, F.: L∞ minimization in geometric reconstruction problems. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition (2004) 3. Ke, Q., Kanade, T.: Quasiconvex optimization for robust geometric reconstruction. In: Proc. Int. Conf. on Computer Vision, Beijing, China (2005) 4. Sim, K., Hartley, R.: Recovering camera motion using L∞ minimization. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition (2006) 5. Sim, K., Hartley, R.: Removing outliers using the L∞ norm. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition (2006) 6. Seo, Y., Hartley, R.: A fast method to minimize L∞ error norm for geometric vision problems. In: Proc. Int. Conf. on Computer Vision (2007) 7. Robert, L., Buffa, M., H´ebert, M.: Weakly-calibrated stereo perception for rover navigation. In: Fifth Inter. Conf. Comp. Vision (1995) 8. Lamb, P.: Artoolkit (2007), http://www.hitl.washington.edu/artoolkit 9. Bauer, M., Schlegel, M., Pustka, D., Navab, N., Klinker, G.: Predicting and estimating the accuracy of optical tracking system. In: IEEE and ACM International Symposium on Mixed and Augmented Reality, pp. 43–51 (2007) 10. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge Press, Cambridge (2004) 11. Hartley, R.I.: The chirality. Int. Journal of Computer Vision 26, 41–61 (1998)
1
http://www.robots.ox.ac.uk/ vgg/data.html
Initial Pose Estimation for 3D Model Tracking Using Learned Objective Functions Matthias Wimmer1, and Bernd Radig2 1
Faculty of Science and Engineering, Waseda University, Tokyo, Japan Institut f¨ ur Informatik, Technische Universit¨ at M¨ unchen, Germany
2
Abstract. Tracking 3D models in image sequences essentially requires determining their initial position and orientation. Our previous work [14] identifies the objective function as a crucial component for fitting 2D models to images. We state preferable properties of these functions and we propose to learn such a function from annotated example images. This paper extends this approach by making it appropriate to also fit 3D models to images. The correctly fitted model represents the initial pose for model tracking. However, this extension induces nontrivial challenges such as out-of-plane rotations and self occlusion, which cause large variation to the model’s surface visible in the image. We solve this issue by connecting the input features of the objective function directly to the model. Furthermore, sequentially executing objective functions specifically learned for different displacements from the correct positions yields highly accurate objective values.
1
Introduction
Model-based image interpretation is appropriate to extract high-level information from single images and from image sequences. Models induce a priori knowledge about the object of interest and thereby reduce the large amount of image data to a small number of model parameters. However, the great challenge is to determine the model parameters that best match a given image. For interpreting image sequences, model tracking algorithms fit the model to the individual images of the sequence. Each fitting step benefits from the pose estimate derived from the previous image of the sequence. However, determining the pose estimate for the first image of the sequence has not been sufficiently solved yet. The challenge of this so-called initial pose estimation is identical to the challenge of fitting models to single images. Our previous work identifies the objective function as an essential component fitting models to single images [14]. This function evaluates how well a particular model fits to an image. Without losing generality, we consider lower values to represent a better model fitness. Therefore, algorithms search for the model parameters that minimize the objective function. Since the described methods
This research is partly funded by a JSPS Postdoctoral Fellowship for North American and European Researchers (FY2007).
Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 332–341, 2007. c Springer-Verlag Berlin Heidelberg 2007
Initial Pose Estimation for 3D Model Tracking
333
are independent of the used fitting algorithm, we do not elaborate on them, but refer to Hanek et al. [5] for a recent overview and categorization of fitting algorithms. As our approach has only been specified for 2D models so far, this paper extends it to be capable for handling 3D models while considering a rigid model of a human face. In contrast to artificial objects, such as cars, faces highly vary in shape and texture and therefore the described face model fitting task represents a particular difficulty. Many researchers engage in fitting 3D models. Lepetit et al. [7] treat this issue as a classification problem and use decision trees for solution. As an example implementation, the ICP algorithm [2,12] minimizes the square error of the distance between the visible object and the model projected to the image. Problem Statement. Although the accuracy of model fitting heavily depends on the objective function, it is often designed by hand using the designer’s intuition about a reasonable measure of fitness. Afterwards, its appropriateness is subjectively determined by inspecting the objective function on example images and example model parameters. If the result is not satisfactory the function is tuned or redesigned from scratch [11,4], see Figure 1 (left). Therefore, building the objective function is very time consuming and the function does not guarantee to yield accurate results.
Fig. 1. The procedures for designing (left) and learning (right) objective functions
Solution Outline. Our novel approach focuses on the root problem of model fitting: We improve the objective function rather than the fitting algorithm. As a solution to this challenge, we propose to conduct a five-step methodology that learns robust local objective functions from annotated example images. We investigated this approach for 2D models so far [14]. This paper extends our methodology in order to generate objective functions that are capable of handling 3D models as well. The obtained functions consider specific aspects of 3D models, such as out-of-plane rotations and self-occlusion. We compute the features not in the 2D image plane but in the space of the 3D model. This requires connecting the individual features directly to the model. Contributions. The resulting objective functions work very accurately in realworld scenarios and they are able to solve the challenge of initial pose estimation that is required by model tracking. This easy-to-use approach is applicable to various image interpretation scenarios and requires the designer just to annotate example images with the correct model parameters. Since no further
334
M. Wimmer and B. Radig
computer vision expertise is necessary, this approach has great potential for commercialization. The paper proceeds as follows. In Section 2, we sketch the challenge of modelbased image interpretation. In Section 3, we propose our methodology to learn accurate local objective functions from annotated training images with particular focus on 3D models. Section 4 conducts experimental evaluations that verify this approach. Section 5 summarizes our approach and shows future work.
2
Model-Based Image Interpretation
Rigid 3D models represent the geometric properties of real-world objects. A six-dimensional parameter vector p=(tx , ty , tz , α, β, γ)T describes the position and orientation. The model consists of 1≤n≤N three-dimensional model points specified by cn (p). Figure 2 depicts our face model with N =214 model points. Fitting 3D models to images requires two essential components: The fitting algorithm searches for the model parameters that match the content of the image best. For this task, it searches for the minimum of the objective function f (I, p), which determines how well a model p matches an image I. As in Equation 1, this function is often subdivided into N local components fn (I, x), one for each model point [7,1,9]. These local functions determine how well the nth model point fits to the image. The advantage of this partitioning is that designing the local functions is more straightforward than designing the global function, because only the image content in the vicinity of one projected model point needs to be taken into consideration. The disadvantage is that dependencies and interactions between local errors cannot be combined. f (I, p) =
N
fn (I, cn (p))
(1)
n=1
2.1
Characteristic Search Directions of Local Objective Functions
Fitting algorithms for 2D contour models usually search the minimum of the objective function along the perpendicular to the contour [3]. The objective function
Fig. 2. Our 3D model of a human face correctly fitted to images
Initial Pose Estimation for 3D Model Tracking
335
Fig. 3. Due to affine transformations of the face model the characteristic search directions will not be parallel to the image plane, in general. These three images show how the transformations affect one of these directions.
computes its value from image features in the vicinity of the perpendicular. We stick to this procedure, and therefore we create local objective functions that are specific to a search direction. These so-called characteristic directions represent three-dimensional lines and we connect them tightly to the model’s geometric structure, i.e. to the individual model points cn (p). Transforming the model’s pose transforms these directions equivalently, see Figure 3. The objective function computes its value from three-dimensional features, however their value is calculated projecting them to the image plane. An image is most descriptive for a characteristic direction if they are parallel. Unfortunately, transforming the model will usually yield characteristic directions that are not parallel to the image plane. Therefore, we consider not only one but 1≤l≤L characteristic directions per model point, which are differently oriented. These directions may be arbitrary, but we prefer them to be pairwise orthogonal. This yields L objective function fn,l (I, x) for each model point. In order not to increase computation time, we consider the characteristic direction, which is most parallel to the image, only. fn (I, x) = fn,gn (p) (I, x)
(2)
The model point’s local objective function fn is computed as in Equation 2. The indicator gn (p) computes the index of the characteristic direction that is most significant for the current pose p of the model, i.e. that is most parallel to the image plane.
3
Learning Objective Functions from Image Annotations
Ideally, local objective functions have two specific properties. First, they should have a global minimum that corresponds to the best model fit. Otherwise, we cannot be certain that determining the true minimum of the local objective function indicates the intended result. Second, they should have no other local minima. This implies that any minimum found corresponds to the global minimum, which facilitates search. A concrete example of an ideal local objective
336
M. Wimmer and B. Radig
function, that has both properties, is shown in Equation 3. pI denotes the model parameters with the best model fit for a certain image I. fn (I, x) = |x − cn (pI )|
(3)
Unfortunately, fn cannot be applied to unseen images, for which the best model parameters pI are not known. Nevertheless, we apply this ideal objective function to annotated training images and obtain ideal training data for for the model point n and its characterlearning a local objective function fn,l istic direction l. The key idea behind our approach is that since the training data is generated by an ideal objective function, the learned function will also be approximately ideal. This has already been shown in [14]. Figure 1 (right) illustrates the proposed five-step procedure. Step 1: Annotating Images with Ideal Model Parameters. As in Figure 2, a database of 1≤k≤K images Ik is manually annotated with the ideal model parameters pIk , which are necessary to compute the ideal objective functions fn . This is the only laborious step of the entire procedure.
Fig. 4. Further annotations are generated by moving along the line that is longest when projected in the image. That line is colored white here. The directions illustrated in black are not used. Annotations on one of the directions that are not used are also shown to demonstrate that this direction is too short to be used.
Initial Pose Estimation for 3D Model Tracking
337
Fig. 5. This comprehensive set of image features is provided for learning local objective functions. In our experiments, we use a total number of A=6·3·5·5=450 features.
Step 2: Generating Further Image Annotations. The ideal objective function returns the minimum fn (I, x)=0 for all manual annotations x=cn (pIk ). These annotations are not sufficient to learn the characteristics of fn . Therefore, we will generate annotations x=cn (pIk ), for which fn (I, x)=0. In general, any 3D position x may represent one of these annotations, however, we sample −D≤d≤D positions along the characteristic direction with a maximum displacement Δ (learning radius), see Figure 4. This procedure learns the calculation rules of the objective function more accurately. In this paper, we use L=3 characteristic directions, because the model points vary within the 3D space. Note that the gn (pIk ) selects the most significant direction for the nth model point. Step 3: Specifying Image Features. Our approach learns the calculation rules of a mapping fn,l from an image Ik and a location xk,n,d,l to the value has no knowledge of pI , it must compute its result of fn (Ik , xk,n,d,l ). Since fn,l from the image content. Instead of learning a direct mapping from the pixel values in the vicinity of x to fn , we compute image features, first. Note that x does not denote a position in I but in 3D space. However, the corresponding pixel position is obtained via perspective projection. Our idea is to provide a multitude of 1≤a≤A features, and let the training . Each algorithm choose which of them are relevant to the calculation rules of fn,l feature ha (I, x) is computed from an image I and a position x and delivers a scalar value. Our approach currently relies on Haar-like features [13,8] of different styles and sizes. Furthermore, the features are not only computed at the location of the model point itself, but also at positions on a grid within its vicinity, see
Fig. 6. The grid of image features moves along with the displacement
338
M. Wimmer and B. Radig
Figure 5. This variety of styles, sizes, and locations yields a set of A=450 image features as we use it in our experiments in Section 4. This multitude of features enables the learned objective function to exploit the texture of the image at the model point and in its surrounding area. When moving the model point, the image features move along with it, leading their values to change, see Figure 6. Step 4: Generating Training Data. The result of the manual annotation step (Step 1) and the automated annotation step (Step 2) is a list of correspondences between positions x and the corresponding value of fn . Since K images, N model points, and 2D+1 displacements are landmarked these correspondences amount to K·N ·(2D+1). Equation 4 illustrates the list of these correspondences. [
Ik ,
xk,n,d,l ,
[ h1 (Ik , xk,n,d,l ), . . . , hA (Ik , xk,n,d,l ),
fn (Ik , xk,n,d,l ) ]
(4)
fn (Ik ,
(5)
xk,n,d,l ) ]
with 1≤k≤K, 1≤n≤N, −D≤d≤D, l=gn (pIk )
Applying the list of image features to the list of correspondences yields the training data in Equation 5. This step simplifies matters greatly. Since each feature returns a single value, we hereby reduce the problem of mapping the vast amount of image data and the related pixel locations to the corresponding target value, to mapping a list of feature values to the target value. Step 5: Learning the Calculation Rules. Given the training data from (I, x) that approximates Equation 5, the goal is to now learn the function fn,l is not provided knowledge of pI . Therefore, it fn (I, x). The challenge is that fn,l can be applied to previously unseen images. We obtain this function by training a model tree [10,15] with the comprehensive training data from Equation 5. Note have to be learned individually. However, that the N ·L objective functions fn,l learning fn,l only requires the records of the training data (Equation 5) where n and l match. Model trees are a generalization of regression trees and, in turn, of decision trees. Whereas decision trees have nominal values at their leaf nodes, model trees have line segments, allowing them to also map features to a continuous value, such as the value returned by the ideal objective function. One of the reasons for deciding for model trees is that they tend to select only features that are relevant to predict the target value. Therefore, they pick a small number of Mn Haar-like features from the provided set of A Mn features. for After executing these five steps, we obtain a local objective function fn,l each model point n and each direction l. It can now be called with an arbitrary location x of an arbitrary image I. The learned model tree calculates the values of the necessary features from the image content and yields the result value.
4
Experimental Evaluation
In this section, two experiments show the capability of fitting algorithms equipped with a learned objective function to fit a face model to previously unseen images.
Initial Pose Estimation for 3D Model Tracking
339
Fig. 7. If the initial distance to the ideal model points is smaller than the learning radius this distance is further reduced with every iteration. Otherwise, the result of the fitting step is unpredictable and the model points are spread further.
The experiments are performed on 240 training images and 80 test images with a non-overlapping set of individuals. Furthermore, the images differ in face pose, illumination, and background. Our evaluations randomly displace the face models from the manually specified pose. The fitting process determines the position of every model point by exhaustively searching along the most significant characteristic direction for the global minimum of the local objective function. Afterwards, the model parameters p are approximated. The subsequent figures illustrate the average point-to-point error of the model points between the obtained model p and the manually specified model pI . Our first evaluation investigates the impact of executing the fitting process with a different number of iterations. Figure 7 illustrates that each iteration improves the model parameters. However, there is a lower bound to the quality of the obtained model fit. Obviously, more than 10 iterations do not improve the fraction of well-fitted models significantly. Note that the objective function’s value is arbitrary for high distances from the correct position. These models are further distributed with every iteration. Our second experiment conducts model fitting by subsequently applying the fitting process with two different objective functions f A and f B learned with decreasing learning radii Δ. f A with a large Δ is able to handle large initial displacements in translation and rotation. However, the obtained fitting result gets less accurate, see Figure 8. The opposite holds true for f B . The idea is to apply a local objective function learned with large Δ first and then gradually apply objective functions learned with smaller values for Δ. As opposed to the
340
M. Wimmer and B. Radig
Fig. 8. By combining fitting algorithms using objective functions with different learning radius we obtain result that show the strengths of both objective functions. The sequential approach shows the tolerance to errors of f A and the accuracy of f B .
previous experiment, where we iteratively executed the same objective function, this iteration scheme executes different objective functions, which compensates the weakness of one function by the strength of another. The advantage of concatenating algorithms with objective functions with decreasing learning radii compared to iterating one algorithm several times is illustrated by Figure 8. Sequentially applying f A and f B is significantly better than both of the other algorithms. Note that we execute each experiment with ten iterations, because we don’t expect any improvement in quality with a higher number of iterations, see our first experiment. Therefore, the obtained accuracy from the sequential execution does not base of the fact that some additional iterations are applied.
5
Summary and Conclusion
In this paper, we extended our five-step methodology for learning local objective functions that we introduced for 2D models so far [14]. 3D models can now be considered as well. This approach automates many critical decisions and the remaining manual steps require little domain-dependent knowledge. Furthermore, its process does not contain any time-consuming loops. These features enable non-expert users to customize the fitting application to their specific domain. The resulting objective function is not only able to process objects that look similarly, such as in [7,6] but objects that differ significantly in shape and texture, such as human faces. Being trained with a limited number of annotated images as
Initial Pose Estimation for 3D Model Tracking
341
described in Section 3 the resulting objective function is able to fit faces that are not part of the training data as well. However, the database of annotated faces must be representative enough. If there were no bearded men in the training data the algorithm would have problem in fitting the model to an image of such a man. The disadvantage of our approach is the laborious annotation step. Gathering and annotating hundreds of images requires several weeks.
References 1. Allezard, N., Dhome, M., Jurie, F.: Recognition of 3D textured objects by mixing view-based and model-based representations. ICPR, 960–963 (September 2000) 2. Besl, P.J., McKay, N.D.: A method for registration of 3-D shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence 14(2), 239–256 (1992) 3. Cootes, T.F., Taylor, C.J., Lanitis, A., Cooper, D.H., Graham, J.: Building and using flexible models incorporating grey-level information. In: ICCV, pp. 242–246 (1993) 4. Cristinacce, D., Cootes, T.F.: Facial feature detection and tracking with automatic template selection. In: 7th IEEE International Conference on Automatic Face and Gesture Recognition, April 2006, pp. 429–434. IEEE Computer Society Press, Los Alamitos (2006) 5. Hanek, R.: Fitting Parametric Curve Models to Images Using Local Self-adapting Seperation Criteria. PhD thesis, Technische Universit¨ at M¨ unchen (2004) 6. Lepetit, V., Lagger, P., Fua, P.: Randomized trees for real-time keypoint recognition. In: CVPR 2005, Switzerland, pp. 775–781 (2005) 7. Lepetit, V., Pilet, J., Fua, P.: Point matching as a classification problem for fast and robust object pose estimation. In: CVPR 2004, June 2004, vol. 2, pp. 244–250 (2004) 8. Lienhart, R., Maydt, J.: An extended set of haar-like features for rapid object detection. In: IEEE ICIP, pp. 900–903. IEEE Computer Society Press, Los Alamitos (2002) 9. Marchand, E., Bouthemy, P., Chaumette, F., Moreau, V.: Robust real-time visual tracking using a 2D-3D model-based approach. In: ICCV, pp. 262–268 (September 1999) 10. Quinlan, R.: Learning with continuous classes. In: Proceedings of the 5th Australian Joint Conference on Artificial Intelligence, pp. 343–348 (1992) 11. Romdhani, S.: Face Image Analysis using a Multiple Feature Fitting Strategy. PhD thesis, University of Basel, Computer Science Department, Basel, CH (January 2005) 12. Simon, D., Hebert, M., Kanade, T.: Real-time 3-D pose estimation using a highspeed range sensor. In: ICRA 1994. Proceedings of IEEE International Conference on Robotics and Automation, vol. 3, pp. 2235–2241 (May 1994) 13. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. Computer Vision and Pattern Recognition (CVPR) (2001) 14. Wimmer, M., Pietzsch, S., Stulp, F., Radig, B.: Learning robust objective functions with application to face model fitting. In: Proceedings of the 29th DAGM Symposium, Heidelberg, Germany, September 2007 (to appear) 15. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
Multiple View Geometry for Non-rigid Motions Viewed from Translational Cameras Cheng Wan, Kazuki Kozuka, and Jun Sato Department of Computer Science and Engineering, Nagoya Institute of Technology, Nagoya 466–8555, Japan
Abstract. This paper introduces multiple view geometry under projective projections from four-dimensional space to two-dimensional space which can represent multiple view geometry under the projection of space with time. We show the multifocal tensors defined under space-time projective projections can be derived from non-rigid object motions viewed from multiple cameras with arbitrary translational motions, and they are practical for generating images of non-rigid object motions viewed from cameras with arbitrary translational motions. The method is tested in real image sequences.
1
Introduction
The multiple view geometry is very important for describing the relationship between images taken from multiple cameras and for recovering 3D geometry from images [1,2,3,4,6,7]. In the traditional multiple view geometry, the projection from the 3D space to 2D images has been assumed [3]. However, the traditional multiple view geometry is limited for describing the case where enough corresponding points are visible from a static configuration of multiple cameras. Recently, some efforts for extending the multiple view geometry for more general point-camera configurations have been made [5,8,10,11,12]. Wolf et al. [8] studied the multiple view geometry on the projections from N dimensional space to 2D images and showed that it can be used for describing the relationship of multiple views obtained from moving cameras and points which move on straight lines with constant speed. Thus the motions of objects are limited. Hayakawa et al. [9] proposed the multiple view geometry in the space-time which makes it possible to describe the relationship of multiple views derived from translational cameras and non-rigid arbitrary motions. However their multiple view geometry assumes affine projection, which is an ideal model and can not be applied if we have strong perspective distortions in images. In this paper we introduce the multiple view geometry under the projective projection from 4D space to 2D space and show that such a universal model can represent multiple view geometry in the case where non-rigid arbitrary motions are viewed from multiple translational projective cameras. We first analyze multiple view geometry under the projection from 4D to 2D, and show that we have multilinear relationships for up to 5 views unlike the traditional multilinear Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 342–352, 2007. c Springer-Verlag Berlin Heidelberg 2007
Multiple View Geometry for Non-rigid Motions
343
relationships. The three view, four view and five view geometries are studied extensively and new trilinear, quadrilinear and quintilinear relationships under the projective projection from 4D space to 2D space are presented. We next show that the newly defined multiple view geometry can be used for describing the relationship between images taken from non-rigid motions viewed from multiple translational cameras. We also show that it is very useful for generating images of non-rigid object motions viewed from arbitrary translational cameras.
2
Projective Projections from 4D to 2D
We first consider projective projections from 4D space to 2D space. This projection is used to describe the relationship between the real space-time and 2D images, and for analyzing the multiple view geometry under space-time projections. Let X = [X 1 , X 2 , X 3 , X 4 , X 5 ] be the homogeneous coordinates of a 4D space point projected to a point in the 2D space, whose homogeneous coordinates are represented by x = [x1 , x2 , x3 ] . Then, the extended projective projection from X to x can be described as follows: x ∼ PX
(1)
where (∼) denotes equality up to a scale, and P denotes the following 3 × 5 matrix: ⎤ ⎡ m11 m12 m13 m14 m15 P = ⎣m21 m22 m23 m24 m25 ⎦ (2) m31 m32 m33 m34 m35 From (1), we find that the extended projective camera, P, has 14 DOF. In the next section, we consider the multiple view geometry of the extended projective cameras.
3
Projective Multiple View Geometry from 4D to 2D
From (1), we have the following equation for N extended projective cameras: ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ X 0 P x 0 0 ··· 0 ⎢ ⎥ ⎢ P 0 x 0 · · · 0⎥ ⎢ λ ⎥ ⎢0⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ (3) ⎢P 0 0 x · · · 0⎥ ⎢ λ ⎥ = ⎢0⎥ ⎣ ⎦⎢ ⎥ ⎣ ⎦ .. .. ⎣λ ⎦ .. .. . . . . where, the leftmost matrix, M, in (3) is 3N × (5 + N ), and the (5 + N ) × (5 + N ) minors Q of M constitute multilinear relationships under the extended projection projection as: det Q = 0. We can choose any 5 + N rows from M to constitute Q, but we have to take at least 2 rows from each camera for deriving meaningful N view relationships (note, each camera has 3 rows in M).
344
C. Wan, K. Kozuka, and J. Sato
Table 1. The number of corresponding points required for computing multifocal tensors in three, four and five views with nonlinear method and linear method views nonlinear mothod linear mothod three 9 13 four 8 10 five 8 9
Table 2. Trilinear relations between point and line coordinates in three views. The final column denotes the number of linearly independent equations. correspondence relation # of equations three points xi xj xk krv Tijr = 0v 2 1 two points, one line xi xj lr Tijr = 0 2 one point, two lines xi lq lr qjt Tijr = 0t 4 three lines lp lq lr pis qjt Tijr = 0st
Thus, 5 + N ≥ 2N must hold for defining multilinear relationships for N view geometry in the 4D space. Thus, we find that, unlike the traditional multiple view geometry, the multilinear relationship for 5 views is the maximal linear relationship in the 4D space. We next consider the minimum number of points required for computing the multifocal tensors. The geometric DOF of N extended projective cameras is 14N − 24, since each extended projective camera has 14 DOF and these N cameras are in a single 4D projective space whose DOF is 24. Meanwhile, if we are given M points in the 4D space, and let them be projected to N projective cameras defined in (1). Then, we derive 2M N measurements from images, while we have to compute 14N − 24 + 4M components for fixing all the geometry in the 4D space. Thus, this condition must hold for computing the multifocal tensors from images: 2M N ≥ 14N − 24 + 4M . We find that minimum of 9, 8, 8 points are required to compute multifocal tensors in three, four and five views (see Table 1). 3.1
Three View Geometry
We next introduce the multiple view geometry of three extended projective cameras. For three views, the sub square matrix Q is 8 × 8. From det Q = 0, we have the following trilinear relationship under extended projective camera projections: xi xj xk krv Tijr = 0v
(4)
where ijk denotes a tensor, which represents a sign based on permutation from {i,j,k} to {1,2,3}. Tijr is the trifocal tensor for the extended cameras and has the following form:
Multiple View Geometry for Non-rigid Motions
345
Table 3. Quadrilinear relations between point and line coordinates in four views correspondence relation # of equations four points xi xj xk xs jlu kmv snw Qlmn = 0 8 uvw i three points, one line xi xj xk ln jlu kmv Qlmn = 0uv 4 i two points, two lines xi xj lm ln jlu Qlmn = 0u 2 i lmn one point, three lines xi ll lm ln Q i =0 1 kiw lmn four lines lk ll lm ln Q i = 0w 2
Table 4. Quintilinear relations between point and line coordinates in five views correspondence relation # five points xixjxkxsxt ila jmb knc sf d tge Rlmnf g = 0abcde four points, one line xi xj xk xs lg ila jmb knc sf d Rlmnf g = 0abcd three points, two lines xi xj xk lf lg ila jmb knc Rlmnf g = 0abc two points, three lines xi xj ln lf lg ila jmb Rlmnf g = 0ab one point, four lines xi l mln lf lg ila Rlmnf g = 0a five lines ll l mln lf lg Rlmnf g = 0
⎤ al ⎢ am ⎥ ⎢ q⎥ r ⎥ Tij = ilm jqu det ⎢ ⎢ bu ⎥ ⎣b ⎦ cr
of eq. 32 16 8 4 2 1
⎡
(5)
where ai denotes the ith row of P, bi denotes the ith row of P and ci denotes the ith row of P respectively. The trifocal tensor Tijr is 3 × 3 × 3 and has 27 entries. If the extended cameras are projective as shown in (1), we have only 26 free parameters in Tijr except a scale ambiguity. On the other hand, (4) provides us 3 linear equations on Tijr , but only 2 of them are linearly independent. Thus, at least 13 corresponding points are required to compute Tijr from images linearly. A complete set of the trilinear equations involving the trifocal tensor are given in Table 2. All of these equations are linear in the entries of the trifocal tensor Tijr . 3.2
Four View and Five View Geometry
Similarly, the four view and the five view geometry can also be derived for the extended projective cameras. The quadrilinear relationship under extended porjective projection is: = 0uvw xi xj xk xs jlu kmv snw Qlmn i
(6)
346
C. Wan, K. Kozuka, and J. Sato
Qlmn is the quadrifocal tensor whose form is described as: i ⎡ p⎤ a ⎢ aq ⎥ ⎢ l⎥ ⎥ = ipq det⎢ Qlmn i ⎢ bm ⎥ ⎣c ⎦ dn
(7)
where di denotes the ith row of P .The quadrifocal tensor Qlmn has 81 entries. i Excluding a scale ambiguity, it has 80 free parameters. And 27 linear equations are given from (6) but only 8 of them are linearly independent. Therefore, from imminimum of 10 corresponding points are required to compute Qlmn i ages linearly. The quadrilinear relationships involving the quadrifocal tensor are summerized in Table 3. We next introduce the multiple view geometry of five extended projective cameras. The quintilinear constraint is expressed as follows: xixjxkxsxt ila jmb knc sf d tge Rlmnf g = 0abcde
(8)
where Rlmnf g is the quintifocal tensor (five view tensor) whose form is represented as follows: ⎡ l ⎤ a ⎢ bm ⎥ ⎢ n⎥ ⎥ (9) Rlmnf g = det⎢ ⎢ cf ⎥ ⎣d ⎦ eg where ei denotes the ith row of P . The quintifocal tensor Rlmnf g has 243 entries. If the extended cameras are projective as shown in (1), we have only 242 free parameters in Rlmnf g except a scale. On the other hand, (8) provides us 243 linear equations on Rlmnf g , but only 32 of them are linearly independent. If we have N correponding points, 32N −N C2 independent constraints can be derived. Thus, at least 9 corresponding points are required to compute Rlmnf g from images linearly. The number of corresponding points required for computing multifocal tensors is summerized in Table 1. The quintilinear relationships are given in Table 4.
4
Multiple View Geometry for Multiple Moving Cameras
Let us consider a single moving point in the 3D space. If the multiple cameras are stationary, we can compute the traditional multifocal tensors [3] from the image motion of this point, and they can be used for constraining image points in arbitrary views and for reconstructing 3D points from images. However, if these cameras are moving independently, the traditional multifocal tensors cannot be computed from the image motion of a single point. Nonetheless, we in this section
Multiple View Geometry for Non-rigid Motions
camera motion
C3 (T + 2)
C1 (T + 2)
trajectory of point motion
C1 (T + 1)
347
camera motion
C3 (T + 1)
X(T + 2) X(T + 1)
C1 (T )
C3 (T )
X(T )
C2 (T + 2) C2 (T + 1) camera motion
C2 (T )
Fig. 1. A moving point in 3D space and its projections in three translational projective cameras. The multifocal tensor defined under space-time projections can describe the relationship between these image projections.
show that if the camera motions are translational as shown in Fig. 1, the multiple view geometry under extended projective projections can be computed from the image motion of a single point, and they can be used for, for example, generating image motions viewed from arbitrary translational cameras. We first show that the extended projective cameras shown in (1) can be used for describing non-rigid object motions viewed from stationary multiple cameras.We next show that this camera model can also be used for describing non-rigid object motions viewed from multiple cameras with translational motions of constant speed. = [X, Y, Z] , in the real space can be considered The motions of a point, W = [X, Y, Z, T ] , in a 4D space-time where T denotes time as a set of points, X and ( ) denotes inhomogeneous coordinates. The motions in the real space are = [x, y] . Thus, if projected to images, and can be observed as a set of points, x we assume projective projections in the space axes, the space-time projections can be described by the extended projective cameras shown in (1). We next show that the multiple view geometry described in section 3 can also be applied for multiple moving cameras. Let us consider a usual projective camera which projects points in 3D to 2D images. If the translational motions of the projective camera are constant, non-rigid motions are projected to images as: ⎡ ⎤ ⎤ X(T ) − T ΔX ⎡ ⎤ ⎡ x(T ) a11 a12 a13 a14 ⎢ Y (T ) − T ΔY ⎥ ⎥ λ ⎣ y(T )⎦ = ⎣a21 a22 a23 a24 ⎦ ⎢ (10) ⎣ Z(T ) − T ΔZ ⎦ a31 a32 a33 a34 1 1 ⎡ ⎤ ⎤ X(T ) ⎡ ⎥ a11 a12 a13 −a11 ΔX − a12 ΔY − a13 ΔZ a14 ⎢ ⎢ Y (T ) ⎥ ⎥ (11) Z(T ) = ⎣a21 a22 a23 −a21 ΔX − a22 ΔY − a23 ΔZ a24 ⎦ ⎢ ⎢ ⎥ a31 a32 a33 −a31 ΔX − a32 ΔY − a33 ΔZ a34 ⎣ T ⎦ 1
348
C. Wan, K. Kozuka, and J. Sato
(a) Camera 1
(b) Camera 2
(c) Camera 3
Fig. 2. Single point motion experiment. (a), (b) and (c) show image motions of a single point viewed from camera 1, 2 and 3. The 13 green points in each image are corresponding points used for computing the trifocal tensor. Note that these 3 cameras are translating with different speed and direction.
y
y
200
200
150
150
100
100
50
50
50
100
150
(a)
200
250
300
x
50
100
150
200
250
300
x
(b)
Fig. 3. The white curve in (a) shows image motions recovered from the extended trifocal tensor, and the black curve shows real image motions observed in camera 3. (b) shows those recovered from the traditional trifocal tensor. The 13 black points in (a) and 7 black points in (b) show points used for computing the trifocal tensors.
where x(T ) and y(T ) denote image coordinates at time T , X(T ), Y (T ) and Z(T ) denote coordinates of a 3D point at time T , and ΔX, ΔY and ΔZ denote camera motions in X, Y and Z axes.Since the translational motion is constant in each camera, ΔX, ΔY and ΔZ are fixed in each camera. Then, we find, from (11), that the projections of non-rigid motions to multiple cameras with translational motions can also be described by the extended projective cameras shown in (1).Thus the multiple view geometry described in section 3 can also be applied to multiple projective cameras with constant translational motions. Note that if we have enough moving points in the scene, we can also compute the traditional multiple view geometry on the multiple moving cameras at each instant.
5
Experiments
We next show the results of experiments. We first show the results from real images that the trifocal tensor for extended projective cameras can be computed from image motions viewed from arbitrary translational cameras, and can be used for generating the third view from the first and the second view of moving cameras. We next evaluate the stability of extracted trifocal tensors for extended projective cameras.
Multiple View Geometry for Non-rigid Motions
(a1) Camera 1
(b1) Camera 2
(c1) Camera 3
(d1)
(a2) Camera 1
(b2) Camera 2
(c2) Camera 3
(d2)
349
Fig. 4. Other single point motion experiments. (ai), (bi) and (ci) show three views of the ith motion. The 13 green points in each image and black points in (di) are corresponding points. The white curve in (di) shows image motions recovered from the extended trifocal tensor, and the black curve shows real image motions observed in camera 3.
5.1
Real Image Experiment
In this section, we show the results from single point motion and multiple point motion experiments. In the first experiment, we used 3 cameras which are translating with different constant speed and different direction, and computed trifocal tensors between these 3 cameras by using a single moving point in the 3D space. Since multiple cameras are dynamic, we cannot compute the traditional trifocal tensor of these cameras from a moving point. Nonetheless we can compute the extended trifocal tensor and can generate image motions in one of the 3 views from the others. Fig. 2 (a), (b) and (c) show image motions of a single moving point in translational camera 1, 2 and 3 respectively. The trifocal tensor is computed from 13 points on the image motions in three views. They are shown by green points in (a), (b) and (c). The extracted trifocal tensor is used for generating image motions in camera 3 from image motions in camera 1 and 2. The white curve in Fig. 3 (a) shows image motions in camera 3 generated from the extended trifocal tensor, and the black curve shows the real image motions viewed from camera 3. As shown in Fig. 3 (a), the generated image motions almost recovered the original image motions even if these 3 cameras have unknown translational motions. To show the advantage of the extended trifocal tensor, we also show image motions generated from the traditional trifocal tensor, that is, trifocal tensor defined for projections from 3D space to 2D space. 7 points taken from the former 13 points are used as corresponding points in three views for computing the traditional projective trifocal tensor. The image motion in camera 3 generated from the image motions in camera 1 and 2 by using the extracted traditional trifocal tensor is shown by white curve in Fig. 3 (b). As shown in Fig. 3 (b), the generated image motion is very different from the real image motion shown by black curve as we
350
C. Wan, K. Kozuka, and J. Sato
(a1) Camera 1
(b1) Camera 2
(c1) Camera 3
(d1)
(a2) Camera 1
(b2) Camera 2
(c2) Camera 3
(d2)
(a3) Camera 1
(b3) Camera 2
(c3) Camera 3
(d3)
Fig. 5. Multiple point motion experiments. (ai), (bi) and (ci) show three views of the ith motion. The green curve and the red curve represent two different image motion. The 7 green points on the green curve and the 6 red points on the red curve in each image are corresponding points used for computing the trifocal tensor. Note that these 3 cameras are translating with different speed and direction. The white curve in (di) shows image motions recovered from the extended trifocal tensor, and the black curve shows real image motions observed in camera 3.
expected, and thus we find that the traditional multiple view geometry cannot describe such general situations, while the proposed multiple view geometry can as shown in Fig. 3 (a). The results from other single point motions are also given. In Fig. 4, (ai), (bi) and (ci) show three views of the ith motion. The 13 green points in each image are corresponding points used for computing the trifocal tensor. Note that these 3 cameras are translating with different speed and different direction. The white curve in (di) shows image motions recovered from the extended trifocal tensor in camera 3, and the black curve shows real image motions observed in camera 3. The 13 black points in (di) show points used for computing the trifocal tensor. As we can see, the trifocal tensor defined under space-time projective projections can be derived from arbitrary single point motions viewed from the 3 cameras with arbitrary translational motions, and they are practical for generating images of single point motions viewed from translational camera. Next we show the results from multiple point motions. In Fig. 5, (ai), (bi) and (ci) show three views of the ith motion. The green curve and the red curve represent two different image motion. The 7 green points on the green curve and the 6 red points on the red curve in each image are corresponding points used for computing the trifocal tensor. Note that these 3 cameras are
Multiple View Geometry for Non-rigid Motions
351
40
C1 C2
2 Z
0
C3
-2
5 2.5 0
0 x
-2.5
5
30 25 20 15 10 5 0 13
-5 10
(a)
y
reprojection error
35
15
17
19
21
number of points
23
25
(b)
Fig. 6. Stability evaluation.(a) shows 3 translating cameras and a moving point in the 3D space. The black points show the viewpoints of the cameras before translational motions, and the white points show those after the motions.(b) shows the relationship between the number of corresponding points and the reprojection errors.
translating with different speed and different direction. The white curve in (di) shows image motions recovered from the extended trifocal tensor in camera 3, and the black curve shows real image motions observed in camera 3. The 13 black points in (di) show points used for computing the trifocal tensor. According to these experiments, we found that the extended multifocal tensors can be derived from non-rigid object motions viewed from multiple cameras with arbitrary translational motions, and they are useful for generating images of non-rigid object motions viewed from cameras with arbitrary translational motions. 5.2
Stability Evaluation
We next show the stability of extracted trifocal tensors under space-time projections. Fig. 6 (a) shows a 3D configuration of 3 moving cameras and a moving point. The black points show the viewpoints of three cameras, C1 , C2 and C3 , before translational motions, and the white points show their viewpoints after the translational motions. The translational motions of these three cameras are different and unknown. The black curve shows a locus of a freely moving point. For evaluating the extracted trifocal tensors, we computed reprojection errors derived from the trifocal tensors. The reprojection error is defined as: 1 N ˆ i )2 , where d(mi , m ˆ i ) denotes a distance between a true point i=1 d(mi , m N mi and a point m ˆ i recovered from the trifocal tensor. We increased the number of corresponding points used for computing trifocal tensors in three views from 13 to 25, and evaluated the reprojection errors. The Gaussian noise with the standard deviation of 1 pixel is added to every point on the locus. Fig. 6 (b) shows the relationship between the number of corresponding points and the reprojection errors. As we can see, the stability is obviously improved by using a few more points than the minimum number of corresponding points.
352
6
C. Wan, K. Kozuka, and J. Sato
Conclusion
In this paper, we analyzed multiple view geometry under projective projections from 4D space to 2D space, and showed that it can represent multiple view geometry under space-time projections. In particular, we showed that multifocal tensors defined under space-time projective projections can be computed from non-rigid object motions viewed from multiple cameras with arbitrary translational motions. We also showed that they are very useful for generating images of non-rigid motions viewed from projective cameras with arbitrary translational motions. The method was implemented and tested by using real image sequences. The stability of extracted trifocal tensors was also evaluated.
References 1. Faugeras, O.D., Luong, Q.T.: The Geometry of Multiple Images. MIT Press, Cambridge (2001) 2. Faugeras, O.D., Mourrain, B.: On the geometry and algebra of the point and line correspondences beterrn N images. In: Proc. 5th International Conference on Computer Vision, pp. 951–956 (1995) 3. Hartley, R.I., Zisserman, A.: Multiple View Geometry. Cambridge University Press, Cambridge (2000) 4. Hartley, R.I.: Multilinear relationship between coordinates of corresponding image points and lines. In: Proc. International Workshop on Computer Vision and Applied Geometry (1995) 5. Heyden, A.: A common framework for multiple view tensors. In: Burkhardt, H.-J., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1406, pp. 3–19. Springer, Heidelberg (1998) 6. Heyden, A.: Tensorial properties of multiple view constraints. Mathematical Methods in the Applied Sciences 23, 169–202 (2000) 7. Shashua, A., Wolf, L.: Homography tensors: On algebraic entities that represent three views of static or moving planar points. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1842, Springer, Heidelberg (2000) 8. Wolf, L., Shashua, A.: On projection matrices P k → P 2 , k = 3, · · ·, 6, and their applications in computer vision. In: Proc. 8th International Conference on Computer Vision, vol. 1, pp. 412–419 (2001) 9. Hayakawa, K., Sato, J.: Multiple View Geometry in the Space-Time. In: Proc. Asian Conference on Computer Vision, pp. 437–446 (2006) 10. Wexler, Y., Shashua, A.: On the synthesis of dynamic scenes from reference views. In: Proc. Conference on Computer Vision and Pattern Recognition, pp. 576–581 (2000) 11. Hartley, R.I., Schaffalitzky, F.: Reconstruction from Projections using Grassman Tensors. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 363–375. Springer, Heidelberg (2004) 12. Sturm, P.: Multi-View Geometry for General Camera Models. In: Proc. Conference on Computer Vision and Pattern Recognition, pp. 206–212 (2005)
Visual Odometry for Non-overlapping Views Using Second-Order Cone Programming Jae-Hak Kim1 , Richard Hartley1 , Jan-Michael Frahm2 and Marc Pollefeys2 1
Research School of Information Sciences and Engineering The Australian National University National ICT Australia, NICTA 2 Department of Computer Science University of North Carolina at Chapel Hill
Abstract. We present a solution for motion estimation for a set of cameras which are firmly mounted on a head unit and do not have overlapping views in each image. This problem relates to ego-motion estimation of multiple cameras, or visual odometry. We reduce motion estimation to solving a triangulation problem, which finds a point in space from multiple views. The optimal solution of the triangulation problem in Linfinity norm is found using SOCP (Second-Order Cone Programming) Consequently, with the help of the optimal solution for the triangulation, we can solve visual odometry by using SOCP as well.
1
Introduction
Motion estimation of cameras or pose estimation, mostly in the case of having overlapping points or tracks between views, has been studied in computer vision research for many years [1]. However, non-overlapping or slightly overlapping camera systems have not been studied so much, particulary the motion estimation problem. The non-overlapping views mean that all images captured with cameras do not have any, or at most have only a few common points. There are potential applications for this camera system. For instance, we construct a cluster of multiple cameras which are firmly installed on a base unit such as a vehicle, and the cameras are positioned to look at different view directions. A panoramic or omnidirectional image can be obtained from images captured with a set of cameras with small overlap. Another example is a vehicle with cameras mounted on it to provide driving assistance such as side/rear view cameras. An important problem is visual odometry – how can we estimate the tracks of a vehicle and use this data to determine where the vehicle is placed. There has been prior research considering a set of many cameras moving together as one camera. In [2] an algebraic solution to the multiple camera motion problem is presented. Similar research on planetary rover operations has been conducted to estimate the motion of a rover on Mars and to keep track of the rover [3]. Other
NICTA is funded by the Australian Government’s Backing Australia’s Ability initiative, in part through the Australian Research Council.
Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 353–362, 2007. c Springer-Verlag Berlin Heidelberg 2007
354
J.-H. Kim et al.
research on visual odometry has been performed to estimate the motion of a stereo rig or single camera [4]. Prior work on non-overlapping cameras includes most notably the paper [5]. This differs from our work in aligning independently computed tracks of the different cameras, whereas we compute a motion estimate using all the cameras at once. Finally, an earlier solution to the problem was proposed in unpublished work of [6], which may appear elsewhere. In this paper, we propose a solution to estimate the six degrees of freedom (DOFs) of the motion, three rotation parameters and three translation parameters (including scale), for a set of multiple cameras with non-overlapping views, based on L∞ triangulation. A main contribution of this paper is that we provided a well-founded geometric solution to the motion estimation in non-overlapping multiple cameras.
2
Problem Formulation
Consider a set of n calibrated cameras with non-overlapping fields of view. Since the cameras are calibrated, we may assume that they are all oriented in the same way just to simplify the mathematics. This is easily done by multiplying an inverse of the rotation matrix to the original image coordinates. This being the case, we can also assume that they all have camera matrices originally equal to Pi = [I| − ci ]. We assume that all ci are known. The cameras then undergo a common motion, described by a Euclidean matrix R −R t M= (1) 0 1 where R is a rotation, and t is a translation of a set of cameras. Then, the i-th camera matrix changes to R t Pi = Pi M−1 = [I | − ci ] = [R | t − ci ] 0 1 which is located at R(ci − t). Suppose that we compute all the essential matrices of the cameras independently, then decompose them into rotation and translation. We observe that the rotations computed from all the essential matrices are the same. This is true only because all the cameras have the same orientation. We can average them to get an overall estimate of rotation. Then, we would like to compute the translation. As we will demonstrate, this is a triangulation problem. Geometric concept. First, let us look at a geometric idea derived from this problem. An illustration of a motion of a set of cameras is shown in Figure 1. A bundle of cameras is moved by a rotation R and translation t. All cameras at ci are moved to ci . The first camera at position c1 is a sum of vectors ci , ci − ci and c1 − ci where i = 1...3. Observing that the vector vi in Figure 1 is the same as the vector ci − ci and the vector c1 − ci is obtained by rotating
Visual Odometry for Non-overlapping Views
R, t
c3
c2
c3 c1
v1 c1
355
v2
c2
c2 + R(c1 − c2)
v3 c3 + R(c1 − c3)
Fig. 1. A set of cameras is moved by Euclidean motion of rotation R and translation t. The centre of the first camera c1 is moved to c1 by the motion. The centre c1 is a common point where all translation direction vectors meet. The translation direction vectors are indicated as red, green and blue solid arrows which are v1 , v2 and v3 , respectively. Consequently, this is a triangulation problem.
the vector c1 − ci , the first camera at position c1 can be rewritten as a sum of three vectors ci , R(c1 − ci ) and vi . Therefore, the three vectors vi , colored solid arrows in Figure 1 meet in one common point c1 , the position of the centre of the first camera after the motion. It means that finding the motion of the set of cameras is the same as solving a triangulation problem for translation direction vectors derived from each view. Secondly, let us derive detailed equations on this problem from the geometric concept we have described above. Let Ei be the essential matrix for the i-th camera. From E1 , we can compute the translation vector of the first camera, P1 , in the usual way. This is a vector passing through the original position of the first camera. The final position of this camera must lie along this vector. Next, we use Ei , for i > 1 to estimate a vector along which the final position of the first camera can be found. Thus, for instance, we use E2 to find the final position of P1 . This works as follows. The i-th essential matrix Ei decomposes into Ri = R and a translation vector vi . In other words, Ei = R[vi ]× . This means that the i-th camera moves to a point ci + λi vi , the value of λi being unknown. This point is the final position of each camera ci in Figure 1. We transfer this motion to determine the motion of the first camera. We consider the motion as taking place in two stages, first rotation, then translation. First the camera centre c1 is rotated by R about point ci to point ci + R(c1 − ci ). Then it is translated in the direction vi to the point c1 = ci + R(c1 − ci ) + λi vi . Thus, we see that c1 lies on the line with direction vector vi , based at point ci + R(c1 + ci ).
356
J.-H. Kim et al.
In short, each essential matrix Ei constrains the final position of the first camera to lie along a line. These lines are not all the same, in fact unless R = I, they are all different. The problem now comes down to finding the values of λi and c1 such that for all i: c1 = ci + R(c1 − ci ) + λi vi
for i = 1, ..., n .
(2)
Having found c1 , we can get t from the equation c1 = R(c1 − t). Averaging Rotations. From the several cameras and their essential matrices Ei , we have several estimates Ri = R for the rotation of the camera rig. We determine the best estimate of R by averaging these rotations. This is done by representing each rotation Ri as a unit quaternion, computing the average of the quaternions and renormalizing to unit norm. Since a quaternion and its negative both represent the same rotation, it is important to choose consistently signed quaternions to represent the separate rotations Ri . Algebraic derivations. Alternatively, it is possible to show an algebraic derivation of the equations as follows. Given Pi = [I| − ci ] and Pi = [R | t − ci ] ( See (2)), an essential matrix is written as Ei = R [ci + R(t − ci )]× I = [R ci + (t − ci )]× R .
(3)
Considering that the decomposition of the essential matrix Ei is Ei = Ri [vi ]× = [Ri vi ]× Ri , we may get the rotation and translation from (3), namely Ri = R and λi Ri vi = R ci + (t − ci ). As a result, t = λi R vi + ci − R ci which is the same equation derived from the geometric concept. A Triangulation Problem. Equation (2) gives us independent measurements of the position of point c1 . Denoting ci + R(c1 − ci ) by Ci , the point c1 must lie at the intersection of the lines Ci + λi vi . In the presence of noise, these lines will not meet, so we need find a good approximation to c1 . Note that the points Ci and vectors vi are known, having been computed from the known calibration of the camera geometry, and the computed essential matrices Ei . The problem of estimating the best c1 is identical with the triangulation problem studied (among many places) in [7,8]. We adopt the approach of [7] of solving this under L∞ norm. The derived solution is the point c1 that minimizes the maximum difference between c1 − Ci and the direction vector vi , for all i. In the presence of noise, the point c1 will lie in the intersection of cones based at the vertex Ci , and with axis defined by the direction vectors vi . To formulate the triangulation problem, instead of c1 , we write X as the final position of the first camera where all translations derived from each essential matrix meet together. As we have explained in the previous section, in the presence of noise we have n cones, each one aligned with one of the translation directions. The desired point X lies in the overlap of all these cones, and, finding this overlap region gives the solution we need in order to get the motion
Visual Odometry for Non-overlapping Views
357
of cameras. Then, our original motion estimation problem is formulated as the following minimization problem: min max X
i
||(X − Ci ) × vi || . (X − Ci ) vi
(4)
Note that the quotient is equal to tan2 (θi ) where θi is the angle between vi and (X − Ci ). This problem can be solved as a Second-Order Cone Programming (SOCP) using a bisection algorithm [9].
3
Algorithm
The algorithm to estimate motion of cameras having non-overlapping views is as follows: Given 1. A set of cameras described in initial position by their known calibrated camera matrices Pi = Ri [I| − ci ]. The cameras then move to a second (unknown) position, described by camera matrices Pi . 2. For each camera pair Pi , Pi , several point correspondences xij ↔ xij (expressed in calibrated coordinates as homogeneous 3-vectors). Objective: Find the motion matrix of the form (1) that determines the common motion of the cameras, such that Pi = Pi M−1 . Algorithm 1. Normalize the image coordinates to calibrated image coordinates by setting ˆ ij = R−1 ˆ ij = R−1 x i xij and x i xij ,
2. 3. 4. 5.
ˆ ij /ˆ ˆ ij ← x ˆ ij /ˆ ˆ ij ← x xij and x xij . then adjust to unit length by setting x ˆ ij ↔ x ˆ ij for Compute each essential matrix Ei in terms of correspondences x the i-th camera. Decompose each Ei as Ei = Ri [vi ]× and find the rotation R as the average of the rotations Ri . Set Ci = ci + R(c1 − ci ). Solve the triangulation problem by finding the point X = c1 that (approximately because of noise) satisfies the condition X = Ci + λi vi for all i. Compute t from t = c1 − R c1 .
In our current implementation, we have used the L∞ norm to solve the triangulation problem. Other methods of solving the triangulation problem may be used, for instance the optimal L2 triangulation method given in [8]. Critical Motion. The algorithm has a critical condition when the rotation is zero. If this is so then, in the triangulation problem solved in this algorithm all the basepoints Ci involved are the same. Thus, we encounter a triangulation problem with a zero baseline. In this case, the magnitude of the translation can not be accurately determined.
358
4
J.-H. Kim et al.
Experiments
We have used SeDuMi and Yalmip toolbox for optimization of SOCP problems [10,11]. We used a five point solver to estimate the essential matrices [12,13]. We select the best five points from images using RANSAC to obtain an essential matrix, and then we improve the essential matrix by non-linear optimization. An alternative method for computing the essential matrix based on [14] was tried. This method gives the optimal essential matrix in L∞ norm. A comparison of the results for these two methods for computing Ei is given in Fig 6. Real data. We used Point Grey’s LadybugTM camera to generate some test data for our problem . This camera unit consists of six 1024×768 CCD color sensors with small overlap of their field of view. The six cameras, 6 sensors with 2.5 mm lenses, are closely packed on a head unit. Five CCDs are positioned in a horizontal ring around the head unit to capture side-view images, and one is located on the top of the head unit to take top-view images. Calibration information provided by Point Grey [15] is used to get intrinsic and relative extrinsic parameters of all six cameras. A piece of paper is positioned on the ground, and the camera is placed on the paper. Some books and objects are randomly located around the camera. The camera is moved manually while the positions of the camera at some points are marked on the paper as edges of the camera head unit. These marked edges on the paper are used to get the ground truth of relative motion of the camera for this experiment. The experimental setup is shown in Figure 2. A panoramic image stitched in our experimental setup is shown in Figure 3.
Fig. 2. An experimental setup of the LadybugTM camera on an A3 size paper surrounded by books. The camera is moved on the paper by hand, and the position of the camera at certain key frames is marked on the paper to provide the ground truth for the experiments.
Visual Odometry for Non-overlapping Views
359
Table 1. Experimental results of rotations at key frames 0, 30, 57 and 80, which correspond to the position number 0–3, respectively. For instance, a pair of rotation (R0 , R1 ) corresponds to a pair of rotations at key frame 0 and 30. Angles of each rotation are represented by the axis-angle rotation representation. Rotation pair (R0 , R1 ) (R0 , R2 ) (R0 , R3 )
True rotation Estimated rotation Axis Angle Axis [0 0 -1] 85.5◦ [-0.008647 -0.015547 0.999842] [0 0 -1] 157.0◦ [-0.022212 -0.008558 0.999717] [0 0 -1] 134.0◦ [ 0.024939 -0.005637 -0.999673]
Angle 85.15◦ 156.18◦ 134.95◦
Fig. 3. A panoramic image is obtained by stitching together all six images from the LadybugTM camera. This image is created by LadybugPro, the software provided by Point Grey Research Inc.
In the experiment, 139 frames of image are captured by each camera. Feature tracking is performed on the image sequence by the KLT (Kanade-Lucas-Tomasi) tracker [16]. Since there is lens distortion in the captured image, we correct the image coordinates of the feature tracks using lens distortion parameters provided by the Ladybug SDK library. The corrected image coordinates are used in all the equations we have derived. After that, we remove outliers from the feature tracks by the RANSAC (Random Sample Consensus) algorithm with a model of epipolar geometry in two view and trifocal tensors in three view [17]. There are key frames where we marked the positions of the camera. They are frames 0, 30, 57, 80, 110 and 138 in this experiment. The estimated path of the cameras over the frames is shown in Figure 4. After frame 80, the essential matrix result was badly estimated and subsequent estimation results were erroneous. A summary of the experimental results is shown in Tables 1 and 2. As can be seen, we have acquired a good estimation of rotations from frame 0 up to frame 80, within about one degree of accuracy. Adequate estimation of translations is reached up to frame 57 within less than 0.5 degrees. We have successfully tracked the motion of the camera through 57 frames. Somewhere between frame 57 and frame 80 an error occurred that invalidated the computation of the position of frame 80. Analysis indicates that this was due to a critical motion (near-zero rotation of the camera fixture) that made the translation estimation
360
J.-H. Kim et al.
0.1
0.08
0.06
0.04
0.03 0.025
0.02
0.02 0.015
0
0.01 0.005
−0.02
0 0.02
0
−0.02
−0.04
−0.06
−0.08
−0.1
−0.12
−0.14
−0.005 −0.14
(a) Top view
−0.12
−0.1
−0.08
−0.06
−0.04
−0.02
0
0.02
(b) Front view
Fig. 4. Estimated path of the LadybugTM camera viewed from (a) top and (b) front. The cameras numbered 0, 1, 2, 3, 4 and 5 are indicated as red, green, blue, cyan, magenta and black paths respectively.
Table 2. Experimental results of translation between two key frames are shown in scale ratio of two translation vectors and in angles of that at the two key frames. The translation direction vector t0i is a vector from the centre of the camera at the starting position, frame number 0, to the centre of the camera at the position number i. For example, t01 is a vector from the centre of the camera at frame 0 to the centre of the camera at frame 30. Translation Scale ratio Angles pair True value Estimated value True value Estimated value (t01 , t02 ) 0.6757 0.7424 28.5◦ 28.04◦ ◦ (t01 , t03 ) 0.4386 1.3406 42.5 84.01◦
inaccurate. Therefore, we have shown the frame-to-frame rotations, over frames in Figure 5-(a). As can be seen, there are frames for which the camera motion was less than 5 degrees. This occurred for frames 57 to 62, 67 to 72 and 72 to 77. In Figure 5-(c), we have shown the difference between the ground truth and estimated position of the cameras in this experiment. As can be seen, the position of the cameras is accurately estimated up to 57 frames. However, the track went off at frame 80. A beneficial feature of our method is that we can avoid such bad condition for the estimation by looking at the angles between frames and residual errors on the SOCP, and then we try to use other frames for the estimation. Using the L∞ optimal E-matrix. The results so-far were achieved using the 5-point algorithm (with iterative refinement) for calculating the essential matrix. We also tried using the method given in [14]. Since this method is quite new, we did not have time to obtain complete results. However, Fig 6 compares the angular error in the direction of the translation direction for the two methods. As may be seen, the L∞ -optimal method seems to work substantially better.
Visual Odometry for Non-overlapping Views
361
80 70 60
frame 30 frame 30
frame 0
50 60
frame 57 frame 57
40 420 −2 −4 −6 −8
30
40 20
frame 80
0
0
20 50
−20
10
frame 80
100
0 0
50
100
150
150
−40 −60
Fig. 5. The angles between pairs of frames used to estimate the motion are shown in (a). Note that a zero or near-zero rotation means a critical condition for estimating the motion of the cameras from the given frames. (b) Ground truth of positions (indicated as red lines) of the cameras with orientations at key frames 0, 30, 57 and 80, and estimated positions (indicated as black lines) of the cameras with their orientations at the same key frames. Orientations of the cameras are marked as blue arrows. Green lines are the estimated path through all 80 frames. Error of the estimation
Error of the estimation 30
Difference between the estimated translation direction and the translation from an essential matrix in degree
Difference between the estimated translation direction and the translation from an essential matrix in degree
30
25
20
15
10
5
0
0
20
40
60 Frames
80
(a) optimal
100
120
25
20
15
10
5
0
0
20
40
60
80
100
120
140
Frames
(b) 5-point
Fig. 6. Comparison of the angular error in the translation direction for two different methods of computing the essential matrix
5
Discussion
We have presented a solution to find the motion of cameras that are rigidly mounted and have minimally overlapping fields of view. This method works equally well for any number of cameras, not just two, and will therefore provide more accurate estimates than methods involving only pairs of cameras. The method requires a non-zero frame-to-frame rotation. Probably because of this, the estimation of motion through a long image sequence significantly went off track. The method geometrically showed good estimation results in experiments with real world data. However, the accumulated errors in processing long sequences
362
J.-H. Kim et al.
of images made the system produce bad estimations over long tracks. In the real experiments, we have found that a robust and accurate essential matrix estimation is a critical requirement to obtain correct motion estimation in this problem.
References 1. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2004) 2. Pless, R.: Using many cameras as one. In: CVPR 2003, vol. II, pp. 587–593 (2003) 3. Cheng, Y., Maimone, M., Matthies, L.: Visual odometry on the mars exploration rovers. In: Systems, Man and Cybernetics, 2005 IEEE International Conference (2005) 4. Nister, D., Naroditsky, O., Bergen, J.: Visual odometry. In: Conf. Computer Vision and Pattern Recognition, pp. 652–659 (2004) 5. Caspi, Y., Irani, M.: Aligning non-overlapping sequences. International Journal of Computer Vision 48(1), 39–51 (2002) 6. Clipp, B., Kim, J.H., Frahm, J.M., Pollefeys, M., Hartley, R.: Robust 6dof motion estimation for non-overlapping, multi-camera systems. Technical Report TR07-006 (Department of Computer Science, The University of North Carolina at Chapel Hill) 7. Hartley, R., Schaffalitzky, F.: L∞ minimization in geometric reconstruction problems. In: Conf. Computer Vision and Pattern Recognition, Washington, DC, USA, vol. I, pp. 504–509 (2004) 8. Lu, F., Hartley, R.: A fast optimal algorithm for L2 triangulation. In: Asian Conf. Computer Vision (2007) 9. Kahl, F.: Multiple view geometry and the L∞ -norm. In: Int. Conf. Computer Vision, Beijing, China, pp. 1002–1009 (2005) 10. Sturm, J.: Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optimization Methods and Software, Special issue on Interior Point Methods (CD supplement with software) 11–12, 625–653 (1999) 11. L¨ oberg, J.: Yalmip: A toolbox for modeling and optimization in MATLAB. In: Proceedings of the CACSD Conference, Taipei, Taiwan (2004) 12. Stew´enius, H., Engels, C., Nist´er, D.: Recent developments on direct relative orientation. ISPRS Journal of Photogrammetry and Remote Sensing 60, 284–294 (2006) 13. Li, H., Hartley, R.: Five-point motion estimation made easy. In: ICPR (1), pp. 630–633. IEEE Computer Society Press, Los Alamitos (2006) 14. Hartley, R., Kahl, F.: Global optimization through searching rotation space and optimal estimation of the essential matrix. In: Int. Conf. Computer Vision (2007) 15. Point Grey Research Incorporated: LadybugTM 2 camera (2006), http://www.ptgrey.com 16. Lucas, B., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: IJCAI 1981, pp. 674–679 (1981) 17. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)
Pose Estimation from Circle or Parallel Lines in a Single Image Guanghui Wang1,2 , Q.M. Jonathan Wu1 , and Zhengqiao Ji1 1
Department of Electrical and Computer Engineering, The University of Windsor, 401 Sunset, Windsor, Ontario, Canada N9B 3P4 2 National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, 100080, P.R. China
[email protected],
[email protected] Abstract. The paper is focused on the problem of pose estimation from a single view in minimum conditions that can be obtained from images. Under the assumption of known intrinsic parameters, we propose and prove that the pose of the camera can be recovered uniquely in three situations: (a) the image of one circle with discriminable center; (b) the image of one circle with preassigned world frame; (c) the image of any two pairs of parallel lines. Compared with previous techniques, the proposed method does not need any 3D measurement of the circle or lines, thus the required conditions are easily satisfied in many scenarios. Extensive experiments are carried out to validate the proposed method.
1
Introduction
Determining the position and orientation of a camera from a single image with respect to a reference frame is a basic and important problem in robot vision field. There are many potential applications such as visual navigation, robot localization, object recognition, photogrammetry, visual surveillance and so on. During the past two decades, the problem was widely studied and many approaches have been proposed. One well known pose estimation problem is the perspective-n-point (PnP) problem, which was first proposed by Fishler and Bolles [5]. The problem is to find the pose of an object from the image of n points at known location on it. Following this idea, the problem was further studied by many researchers [6,8,9,15,14]. One of the major concerns of the PnP problem is the multi-solution phenomenon, all PnP problems for n ≤ 5 have multiple solutions. Thus we need further information to determine the correct solution [6]. Another kind of localization algorithm is based on line correspondences. Dhome et al. [4] proposed to compute the attitude of object from three line correspondences. Liu et al. [12] discussed some methods to recover the camera pose linearly or nonlinearly by using different combination of line and point features. Ansar and Daniilidis [1] presented a general framework which allows for a novel set of linear solutions to the pose estimation problem for both n points and n lines. Chen [2] proposed a polynomial approach to find close form solution for Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 363–372, 2007. c Springer-Verlag Berlin Heidelberg 2007
364
G. Wang, Q.M.J. Wu, and Z. Ji
pose determination from line-to-plane correspondences. The line based methods also suffer from the problem of multiple solutions. The above methods assume that the camera is calibrated and the positions of the points and lines are known. In practice, it may be hard to obtain the accurate measurements of these features in space. However, some geometrical constraints, such as coplanarity, parallelity and orthogonality, are abundant in many indoor and outdoor structured scenarios. Some researchers proposed to recover the camera pose from the image of a rectangle, two orthogonal parallel lines and some other scene constraints [7,18]. Circle is another very common pattern in man-made objects and scenes, many studies on camera calibration were based on the image of circles [10,11,13]. In this paper, we try to compute the camera’s pose from a single image based on geometrical configurations in the scene. Different from previous methods, we propose to use the image of only one circle, or the image of any two pairs of parallel lines that may not be coplanar or orthogonal. The proposed method is widely applicable since the conditions are easily satisfied in many scenarios.
2 2.1
Perspective Geometry and Pose Estimation Camera Projection and Pose Estimation
Under perspective projection, a 3D point x ∈ R3 in space is projected to an image point m ∈ R2 via a rank-3 projection matrix P ∈ R3×4 as ˜ = P˜ x sm x = K[R, t]˜ x = K[r1 , r2 , r3 , t]˜
(1)
˜ = [xT , w]T and m ˜ = [mT , w]T are the homogeneous forms of points where, x x and m respectively, R and t are the rotation matrix and translation vector from the world system to the camera system, s is a non-zero scalar, K is the camera calibration matrix. In this paper, we assume the camera is calibrated, thus we may set K = I3 = diag(1, 1, 1), which is equivalent to normalize the image coordinates by applying transformation K−1 . In this case, the projection matrix is simplified to P = [R, t] = [r1 , r2 , r3 , t]. When all space points are coplanar, the mapping between the space points and their images can be modeled by a plane homography H which is a nonsingular 3 × 3 homogeneous matrix. Without loss of generality, we may assume the coordinates of the space plane as [0, 0, 1, 0]T for a specified world frame, then we have H = [r1 , r2 , t]. Obviously, the rotation matrix R and translation vector t can be factorized directly from the homography. Proposition 1. When the camera is calibrated, the pose of the camera can be recovered from two orthogonal vanishing points in a single view. Proof. Without loss of generality, let us set the X and Y axes of the world system in line with the two orthogonal directions. In the normalized world coordinate ˜w = [0, 1, 0, 0]T system, the direction of X and Y axes are x ˜w = [1, 0, 0, 0]T and y
Pose Estimation from Circle or Parallel Lines in a Single Image
365
respectively, and the homogeneous vector of the world origin is ˜ ow = [0, 0, 0, 1]T . Under perspective projection, we have: ˜x = P x ˜w = [r1 , r2 , r3 , t][1, 0, 0, 0]T = r1 sx v ˜y = P y ˜w = [r1 , r2 , r3 , t][0, 1, 0, 0]T = r2 sy v
(2) (3)
so v ˜o = P ˜ ow = [r1 , r2 , r3 , t][0, 0, 0, 1]T = t
(4)
Thus the rotation matrix can be computed from r1 = ±
v ˜x v ˜y , r2 = ± , r3 = r1 × r2 ˜ vx ˜ vy
(5)
where the rotation matrix R = [r1 , r2 , r3 ] may have four solutions if righthanded coordinate system is adopted. While only two of them can ensure that the reconstructed objects lie in front of the camera, which may be seen by the camera. In practice, if the world coordinate frame is preassigned, the rotation matrix may be uniquely determined [19]. Since we have no metric information of the given scene, the translation vector can only be defined up to scale as t ≈ vo . This is to say that we can only recover the direction of the translation vector. In practice, the orthonormal constraint should be enforced during the computation since r1 and r2 in (5) may not be orthogonal due to image noise. Suppose the SVD decomposition of R12 = [r1 , r2 ] is UΣ VT , where Σ is a 3 × 2 matrix made of the two singular values of R12 . Thus we may obtain the best approxi 10 mation to the rotation matrix in the least square sense from R12 = U 0 1 VT , 00 since a rotation matrix should have unit singular values. 2.2
The Circular Points and Pose Estimation
The absolute conic (AC) is a conic on the ideal plane, which can be expressed in matrix form as Ω∞ = diag(1, 1, 1). Obviously, Ω∞ is composed of purely imaginary points on the infinite plane. Under perspective projection, we can obtain the image of the absolute conic (IAC) as ωa = (KKT )−1 , which depends only on the camera calibration matrix K. The IAC is an invisible imaginary point conic in an image. It is easy to verify that the absolute conic intersects the ideal line at two ideal complex conjugate points, which are called the circular points. The circular points can be expressed in canonical form as I = [1, i, 0, 0]T, J = [1, −i, 0, 0]T. Under perspective projection, their images can be expressed as: si m ˜ i = P I = [r1 , r2 , r3 , t][1, i, 0, 0]T = r1 + i r2
sj m ˜ j = P J = [r1 , r2 , r3 , t][1, −i, 0, 0] = r1 − i r2 T
(6) (7)
Thus the imaged circular points (ICPs) are a pair of complex conjugate points, whose real and imaginary parts are defined by the first two columns of the rotation matrix. However, the rotation matrix can not be determined uniquely from the ICPs since (6) and (7) are defined up to scales.
366
G. Wang, Q.M.J. Wu, and Z. Ji
Proposition 2. Suppose mi and mj are the ICPs of a space plane, the world system is set on the plane. Then the pose of the camera can be uniquely determined from mi and mj if one direction of the world frame is preassigned. Proof. It is easy to verify that the line passing through the two imaged circular points is real, which is the vanishing line of the plane and can be computed from l∞ = mi × mj . Suppose ox is the image of one axis of the preassigned world frame, its vanishing point vx can be computed from the intersection of line ox with l∞ . If the vanishing point vy of Y direction is recovered, the camera pose can be determined accordingly from Proposition 1. Since the vanishing points of two orthogonal directions conjugate with are vxT ωvy = 0 respect to the IAC, thus vy can be easily computed from . On the lT ∞ vy = 0 other hand, since two orthogonal vanishing points are harmonic with respect to the ICPs, their cross ratio Cross(vx vy ; mi mj ) = −1. Thus vy can also be computed from the cross ratio.
3 3.1
Methods for Pose Estimation Pose Estimation from the Image of a Circle
Lemma 1. Any circle Ωc in a space plane π intersects the absolute conic Ω∞ at exactly two points, which are the circular points of the plane. Without loss of generality, let us set the XOY world frame on the supporting plane. Then any circle on the plane can be modelled in homogeneous form as (x − wx0 )2 + (y − wy0 )2 − w2 r2 = 0. The plane π intersects the ideal plane π∞ at the vanishing line L∞ . In the extended plane of the complex domain, L∞ has at most two intersections with Ωc . It is easy to verify that the circular points are the intersections. Lemma 2. The image of the circle Ωc intersects the IAC at four complex points, which can be divided into two pairs of complex conjugate points. Under perspective projection, any circle Ωc on space plane is imaged as a conic ωc = H−T Ωc H−1 , which is an ellipse in nondegenerate case. The absolute conic is projected to the IAC. Both the IAC and ωc are conics of second order that can be written in homogeneous form as xT ωc x = 0. According to B´ezout’s theorem, the two conics have four imaginary intersection points since the absolute conic and the circle have no real intersections in space. Suppose the complex point [a + bi] is one intersection, it is easy to verify that the conjugate point [a − bi] is also a solution. Thus the four intersections can be divided into two complex conjugate pairs. It is obvious that one pair of them is the ICPs, but the ambiguity can not be solved in the image with one circle. If there are two or more circles on the same or parallel space plane, the ICPs can be uniquely determined since the imaged circular points are the common intersections of each circle with the IAC in the image. However, we may have only one circle in many situations, then how to determine the ICPs in this case?
Pose Estimation from Circle or Parallel Lines in a Single Image
367
Proposition 3. The imaged circular points can be uniquely determined from the image of one circle if the center of the circle can be detected in the image. Proof. As shown in Fig.1, the image of the circle ωc intersects the IAC at two pairs of complex conjugate points mi , mj and mi , mj . Let us define two lines as l∞ = mi × mj , l∞ = mi × mj
(8)
then one of the lines must be the vanishing line and the two supporting points must be the ICPs. Suppose oc is the image of the circle center and l∞ is the vanishing line, then there is a pole-polar relationship between the center image oc and the vanishing line with respect to the conic. λl∞ = ωc oc
(9)
where λ is a scalar. Thus the true vanishing line and imaged circular points can be determined from (9). Under perspective projection, a circle is transformed into a conic. However, the center of the circle in space usually does not project to the center of the corresponding conic in the image, since perspective projection (1) is not a linear mapping from the space to the image. Thus the imaged center of the circle can not be determined only from the contour of the imaged conic. There are several possible ways to recover the projected center of the circle by virtue of more geometrical information, such as by two or more lines passing through the center [13] or by two concentric circles [10,11]. vy Space plane
Y
Oc
Image plane
v′y
mj m′j
Ωc
y
ωc
m′i
oc
O
X
(a)
o
ωa
mi
vx
x
l∞′
l∞
v ′x
(b)
Fig. 1. Determining the ICPs from the image of one circle. (a) a circle and preassigned world frame in space; (b) the imaged conic of the circle.
Proposition 4. The imaged circular points can be recovered from the image of one circle with preassigned world coordinate system. Proof. As shown in Fig.1, suppose line x and y are the imaged two axes of the preassigned world frame, the two lines intersect l∞ and l∞ at four points. Since the two ICPs and the two orthogonal vanishing points form a harmonic relation. Thus the true ICPs can be determined by verifying the cross ratio of the
368
G. Wang, Q.M.J. Wu, and Z. Ji
two pairs of quadruple collinear points {mi , mj , vx , vy } and {mi , mj , vx , vy }. Then the camera pose can be computed according to Proposition 2. 3.2
Pose Estimation from Two Pairs of Parallel Lines
Proposition 5. The pose of the camera can be recovered from the image of any two general pairs of parallel lines in the space. Proof. As shown in Fig.2, suppose L11 , L12 and L21 , L22 are two pairs of parallel lines in the space, they may not be coplanar or orthogonal. Their images l11 , l12 and l21 , l22 intersect at v1 and v2 respectively, then v1 and v2 must be the vanishing points of the two directions, and the line connecting the two points must be the vanishing line l∞ . Thus mi and mj can be computed from the intersections of l∞ with the IAC. Suppose v1 is one direction of the world ⊥ frame, o2 is the image of the world origin. Then the vanishing point v1 of the direction that is orthogonal to v1 can be easily computed from Cross(v1 v1⊥ ; mi mj )
v1T ωv1⊥ =0 ⊥ lT ∞ v1 =0
or
= −1, and the pose of the camera can be recovered from Proposition 1. Specifically, the angle α between the two pairs of parallel lines in v T ωa v 2 the space can be recovered from cos α = √ T 1 √ . If the two pairs of T v 1 ωa v 1
v 2 ωa v 2
lines are orthogonal with each other, then we have v1⊥ = v2 .
v1⊥ v 2
ωa mj
L22
l 22
L21 L11 L12
mi
l 21
l∞
v1
l 11
o2 α
o1
l 12
Fig. 2. Pose estimation from two pairs of parallel lines. Left: two pairs of parallel lines in the space; Right: the image of the parallel lines.
3.3
Projection Matrix and 3D Reconstruction
After retrieving the pose of the camera, the projection matrix with respect to the world frame can be computed from (1). With the projection matrix, any geometry primitive in the image can be back-projected into the space. For example, a point in the image is back-projected to a line, a line is back-projected to a plane and a conic is back-projected to a cone. Based on the scene constraints, many geometrical entities, such as the length ratios, angles, 3D information of some planar surfaces, can be recovered via the technique of single view metrology [3,17,18]. Therefore the 3D structure of some simple objects and scenarios can be reconstructed only from a single image.
Pose Estimation from Circle or Parallel Lines in a Single Image
4
369
Experiments with Simulated Data
During simulations, we generated a circle and two orthogonal pairs of parallel lines in the space, whose size and position in the world system are shown in Fig.3. Each line is composed of 50 evenly distributed points, the circle is composed of 100 evenly distributed points. The camera parameters were set as follows: focal length fu = fv = 1800, skew s = 0, principal point u0 = v0 = 0, rotation axis r = [0.717, −0.359, −0.598], rotation angle α = 0.84, translation vector t = [2, −2, 100]. The image resolution was set to 600 × 600 and Gaussian image noise was added on each imaged point. The generated image with 1-pixel Gaussian noise is shown in Fig.3. 15
L22 L21
10
200
Ωc
100
5
l 22
ωc
l 21
l 11
0
l 12
0 -100
-5
L11
-10
L12
-200
-10
0
10
o2 -200
0
200
Fig. 3. The synthetic scenario and image for simulation
In the experiments, the image lines and the imaged conic were fitted via least squares. We set L11 and L21 as the X and Y axes of the world frame, and recover the ICPs and camera pose according to the proposed methods. Here we only give the result of the recovered rotation matrix. For the convenience of comparison, we decomposed the rotation matrix into the rotation axis and rotation angle, we define the error of the axis as the angle between the recovered axis and the ground truth and define the error of the rotation angle as the absolute error of the recovered angle with the ground truth. We varied the noise level from 0 to 3 pixels with a step of 0.5 during the test, and took 200 independent tests at each noise level so as to obtain more statistically meaningful results. The mean and 0.1
0.05
Mean of rotation axis error
0.1
Mean of rotation angle error
0.05
STD of rotation axis error
STD of rotation angle error
0.08
Alg.1
0.04
Alg.1
0.08
Alg.1
0.04
Alg.1
0.06
Alg.2
0.03
Alg.2
0.06
Alg.2
0.03
Alg.2
0.04
0.02
0.04
0.02
0.02
0.01
0.02
0.01
0 0
1
2 Noise level
3
0 0
1
2 Noise level
3
0 0
1 2 Noise level
3
0 0
1
2
3
Noise level
Fig. 4. The mean and standard deviation of the errors of the rotation axis and rotation angle with respect to the noise levels
370
G. Wang, Q.M.J. Wu, and Z. Ji
standard deviation of the two methods are shown in Fig.4. It is clear that the accuracy of the two methods are comparable at small noise level (< 1.5 pixels), while the vanishing points based method (Alg.2) is superior to the circle based one (Alg.1) at large noise level.
5
Tests with Real Images
All images in the tests were captured by Canon Powershort G3 with a resolution of 1024 × 768. The camera was pre-calibrated via Zhang’s method [20]. Test on the tea box image: For this test, the selected world frame, two pairs of parallel lines and the two detected conics by the Hough transform are shown in Fig.5. The line segments were detected and fitted via orthogonal regression algorithm [16]. We recovered the rotation axis, rotation angle (unit: rad) and translation vector by the two methods as shown in Table 1, where the translation vector is normalized by t = 1. The results are reasonable with the imaging conditions, though we do not have the ground truth.
ω c2
ω c1
y x
Fig. 5. Test results of the tea box image. Upper: the image and the detected conics and parallel lines and world frame for pose estimation; Lower: the reconstructed tea box model at different viewpoints with texture mapping.
In order to give further evaluation of the recovered parameters, we reconstructed the 3D structure of the scene from the recovered projection matrix via the method in [17]. The result is shown from different viewpoints in Fig.5. We manually took the measurements of the tea box and the grid in the background and registered the reconstruction to the ground truth. Then we computed the relative error E1 of the side length of the grid, the relative errors E2 , E3 of the diameter and height of the circle. As listed in Table 1, we can see that the reconstruction error is very small. The results verifies the accuracy of the recovered parameters in return. Test on the book image: The image with detected conic and preassigned world frame and two pairs of parallel lines are shown in Fig.6. We recovered the
Pose Estimation from Circle or Parallel Lines in a Single Image
371
Table 1. Test results and performance evaluations for real images Images Method Alg.1 Box Alg.2 Alg.1 Book Alg.2
Raxis [-0.9746,0.1867,-0.1238] [-0.9748,0.1864,-0.1228] [-0.9173,0.3452,-0.1984] [-0.9188,0.3460,-0.1899]
Rangle 2.4385 2.4354 2.2811 2.3163
t [-0.08,0.13,0.98] [-0.08,0.13,0.98] [-0.02,0.09,0.99] [-0.02,0.09,0.99]
E1 (%) 0.219 0.284 0.372 0.306
E2 (%) 0.327 0.315 0.365 0.449
E3 (%) 0.286 0.329 0.633 0.547
pose of the camera by the proposed methods, then computed the relative errors E1 , E2 and E3 of the three side lengths of the book with respect to the ground truth taken manually. The results are shown in Table 1. The reconstructed 3D structure of the book is shown Fig.6. The results are realistic with good accuracy.
ωc y x
Fig. 6. Pose estimation and 3D reconstruction of the book image
6
Conclusion
In this paper, we proposed and proved the possibility to recover the pose of the camera from a single image of one circle or two general pairs of parallel lines. Compared with previous techniques, less conditions are required by the proposed method. Thus the results in the paper may find wide applications. Since the method utilizes the least information in computation, it is important to adopt some robust techniques to fit the conics and lines.
Acknowledgment The work is supported in part by the Canada Research Chair program and the National Natural Science Foundation of China under grant no. 60575015.
References 1. Ansar, A., Daniilidis, K.: Linear pose estimation from points or lines. IEEE Trans. Pattern Anal. Mach. Intell. 25(5), 578–589 (2003) 2. Chen, H.H.: Pose determination from line-to-plane correspondences: Existence condition and closed-form solutions. IEEE Trans. Pattern Anal. Mach. Intell. 13(6), 530–541 (1991)
372
G. Wang, Q.M.J. Wu, and Z. Ji
3. Criminisi, A., Reid, I., Zisserman, A.: Single view metrology. International Journal of Computer Vision 40(2), 123–148 (2000) 4. Dhome, M., Richetin, M., Lapreste, J.T.: Determination of the attitude of 3D objects from a single perspective view. IEEE Trans. Pattern Anal. Mach. Intell. 11(12), 1265–1278 (1989) 5. Fischler, M.A., Bolles, R.C.: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartomated cartography. Communications of the ACM. 24(6), 381–395 (1981) 6. Gao, X.S., Tang, J.: On the probability of the number of solutions for the P4P problem. J. Math. Imaging Vis. 25(1), 79–86 (2006) 7. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2004) 8. Horaud, R., Conio, B., Leboulleux, O., Lacolle, B.: An analytic solution for the perspective 4-point problem. CVGIP 47(1), 33–44 (1989) 9. Hu, Z.Y., Wu, F.C.: A note on the number of solutions of the noncoplanar P4P problem. IEEE Trans. Pattern Anal. Mach. Intell. 24(4), 550–555 (2002) 10. Jiang, G., Quan, L.: Detection of concentric circles for camera calibration. In: Proc. of ICCV, pp. 333–340 (2005) 11. Kim, J.S., Gurdjos, P., Kweon, I.S.: Geometric and algebraic constraints of projected concentric circles and their applications to camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(4), 637–642 (2005) 12. Liu, Y., Huang, T.S., Faugeras, O.D.: Determination of camera location from 2-D to 3-D line and point correspondences. IEEE Trans. Pattern Anal. Mach. Intell. 12(1), 28–37 (1990) 13. Meng, X., Li, H., Hu, Z.: A new easy camera calibration technique based on circular points. In: Proc. of BMVC (2000) 14. Nist´er, D., Stew´enius, H.: A minimal solution to the generalised 3-point pose problem. J. Math. Imaging Vis. 27(1), 67–79 (2007) 15. Quan, L., Lan, Z.: Linear n-point camera pose determination. IEEE Trans. Pattern Anal. Mach. Intell. 21(8), 774–780 (1999) 16. Schmid, C., Zisserman, A.: Automatic line matching across views. In: Proc. of CVPR, pp. 666–671 (1997) 17. Wang, G.H., Hu, Z.Y., Wu, F.C., Tsui, H.T.: Single view metrology from scene constraints. Image Vision Comput. 23(9), 831–840 (2005) 18. Wang, G.H., Tsui, H.T., Hu, Z.Y., Wu, F.C.: Camera calibration and 3D reconstruction from a single view based on scene constraints. Image Vision Comput. 23(3), 311–323 (2005) 19. Wang, G.H., Wang, S., Gao, X., Li, Y.: Three dimensional reconstruction of structured scenes based on vanishing points. In: Proc. of PCM, pp. 935–942 (2006) 20. Zhang, Z.: A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 22(11), 1330–1334 (2000)
An Occupancy–Depth Generative Model of Multi-view Images Pau Gargallo, Peter Sturm, and Sergi Pujades INRIA Rhˆ one-Alpes and Laboratoire Jean Kuntzmann, France
[email protected] Abstract. This paper presents an occupancy based generative model of stereo and multi-view stereo images. In this model, the space is divided into empty and occupied regions. The depth of a pixel is naturally determined from the occupancy as the depth of the first occupied point in its viewing ray. The color of a pixel corresponds to the color of this 3D point. This model has two theoretical advantages. First, unlike other occupancy based models, it explicitly models the deterministic relationship between occupancy and depth and, thus, it correctly handles occlusions. Second, unlike depth based approaches, determining depth from the occupancy automatically ensures the coherence of the resulting depth maps. Experimental results computing the MAP of the model using message passing techniques are presented to show the applicability of the model.
1
Introduction
Extracting 3D information from multiple images is one of the central problems of computer vision. It has applications to photorealistic 3D reconstruction, image based rendering, tracking and robotics among others. Many successful methods exist for each application, but a satisfactory general formalism is still to be defined. In this paper we present a simple probabilistic generative model of multi-view images that accurately defines the natural relationship between the shape and color of the 3D world and the observed images. The model is constructed with the philosophy that if one is able to describe the image formation process with a model, then Bayesian inference can be used to invert the process and recover information about the model from the images. The approach yields to a generic widely applicable formalism. The price to pay for such a generality is that the results for each specific application will be poor compared to those of more specialized techniques. There are mainly two approaches to the stereo and multi-view stereo problems. In small-baseline situations, the 3D world is represented by a depth map on a reference image and computing depth is regarded as a matching problem [1]. In wide-baseline situations, it is often more convenient to represent the shape of the objects by a surface or an occupancy function [2]. Depth and occupancy are Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 373–383, 2007. c Springer-Verlag Berlin Heidelberg 2007
374
P. Gargallo, P. Sturm, and S. Pujades
obviously highly related, but most of the algorithms concentrate on finding one of the two. The main problem with either approach are occlusions. The fact that a 3D point is not always visible from all the cameras makes the extraction of 3D information from images hard. The two main approaches to solve this issue are to treat occluded points as outliers [3] or to explicitly model the geometrical reason for the occlusion. Making a accurate generative model for multi-view images, as we wish to do, necessarily involves modeling occlusions geometrically, because geometric occlusions really exist in the true image formation process. In fact, geometric occlusions are so important that there are reconstruction techniques, like those based on the visual hull, that use only the occlusion information [4,5]. Geometric occlusions can be modeled effectively in depth based approaches by computing a depth map for every input image [6,7,8]. This requires to add constraints, so that the multiple depth maps are coherent and form a single surface. These constrains are not necessary in shape based approaches that implicitly incorporate them because they compute a single model for all the images. Shape based approaches usually deal with geometric occlusions in an alternating way [9,10,11]. They first compute the visibility given the current estimate of the shape; and then modify the shape according to some criteria. This procedure disregards the fact that the visibility will change while modifying the shape. A voxel carving technique carving inside the object, or a shrinking surface evolution are consequences of this oversight. Our recent work [12] avoids these problems by explicitly taking into account the visibility changes during the surface evolution. This is nothing but the continuous version of the model presented in this paper. The model presented in this paper explicitly characterizes the relationship between depth and shape and profits of the benefits of both worlds. The shape occupancy automatically gives coherence to the depth maps. Properly deriving the depth maps from the occupancy implicitly encodes the geometric occlusions. Our model is very related both in objective and approach to the recent work of Hern´ andez et al. [13]. In that work depth cues, are probabilistically integrated for inferring occupancy. The principal difference between their model and the one presented here is that they make some independence assumptions that we do not. In particular, they assume the depth of different pixels to be independent. This greatly simplifies the inference and good results are achieved. In our model, as in the real world, depth is determined by the occupancy and therefore the depth of different pixels are not independent. This create a huge very loopy factor graph representation of the join probability of occupancy and depth. Inference in such a graph is hard as we will see in the experiments.
2
The Model
This section presents the occupancy-depth model. We first introduce the random variables involved in the generative process. Then we decompose their joint probability distribution into simpler terms and give a form to each of them.
An Occupancy–Depth Generative Model of Multi-view Images
2.1
375
Occupancy, Depth and Color Variables
Consider a discretization of the 3D space in a finite set of sites S ⊂ R3 . A given site x ∈ S, can be in the free space or inside an object. This defines the occupancy of the site that will be represented by a binary random variable ux (1 meaning occupied and 0 meaning free). The occupancy of the whole space will be represented by the random process u : S → {0, 1}, which defines the shape of the objects in the space. The shape of the objects is not enough to generate images. Their appearance is also needed. In the simplest case, under the constant brightness assumption, the appearance can be represented by a color at each point on the surface of the objects. As we do not know the shape of the objects right now, we will need to define the color of all sites in S, even if only the color of the sites lying on the surface is relevant. The color will be represented by a random process C : S → R3 . The depth of a pixel is defined as the distance (measured along the camera’s z-axis) between the optical center of the camera and the 3D point observed at the pixel. Given the occupancy of the space and the position and calibration of a camera i, the depth Dpi of a pixel p is determined as the depth of the first occupied point in its viewing ray. The observed color at that pixel will be denoted by Ipi . This color should ideally correspond to the color of the site observed at that pixel. i.e. the point of the viewing ray of p which is at depth Dpi . 2.2
Decomposition
Having defined all the variables depths, colors, occupancies and observed colors – we will now define their joint probability distribution. To do so, we first decompose the distribution terms representing the natural dependence between the variables. One can think of this step as defining the way the data (the images) were generated. The proposed decomposition is p(Dpi |u) p(Ipi |Dpi , C) . (1) p(u, C, D, I) = p(u)p(C|u) i,p
i,p
Fig. 1. Bayesian network representation of the joint probability decomposition
376
P. Gargallo, P. Sturm, and S. Pujades
It is represented graphically in Figure 1. Each term of the decomposition corresponds to a variable and, thus, to a node of the network. The arrows represent the statistical dependencies between the variables. In other words, the order that one has to follow to generate random samples of the model from scratch. Therefore, the data generation process is as follows. First one builds the objects of the world by generating an occupancy function. Then one paints them by choosing the space colors. Finally, one takes pictures of the generated world: one first determines which points are visible from the camera by computing the depth of the pixels, and then sets the color of the pixels to be the color of the observed 3D points. In the following sections, we define each of the terms of the decomposition (1). 2.3
World Priors
Not all the possible occupancy functions are equally likely a priori. One expects the occupied points to be gathered together forming objects, rather than randomly scattered over the 3D space. To represent such a belief, we choose the occupancy u to follow a Markov Random Field distribution. This gives the following prior, ψ(ux , uy ) (2) p(u) ∝ exp − x,y
where the sum extends to all the neighboring points (x, y) in a grid discretization S of the 3D space. The energy potentials are of the form ψ(ux , uy ) = α|ux − uy |, so that they penalize neighboring sites of different occupancies by a cost α. This prior is isotropic in the sense that two neighboring points are equally likely to have the same occupancy regardless of their position and color. From experience, we know that the discontinuities or edges in images often correspond to discontinuities in the occupancy (the occluding contours). Therefore, one could be tempted to use the input images to derive a smoothing prior for the occupancy that is weaker at the points projecting to image discontinuities. While effective, this would not be correct from a Bayesian point of view, as one would be using the data to derive a prior for the model. We will now see how to obtain this anisotropic smoothing effect in a more theoretically well funded way. In the proposed decomposition, the prior on the color of the space depends on the occupancy. This makes it possible to express the following idea. Two neighboring points that are both either occupied or free, are likely to have similar colors. The colors of two points with different occupancies are not necessarily related. This can be expressed by the MRF distribution φ(Cx , Cy , ux , uy ) (3) p(C|u) ∝ exp − x,y
with
φ(Cx , Cy , ux , uy ) =
(Cx − Cy ) if ux = uy 0 otherwise
(4)
An Occupancy–Depth Generative Model of Multi-view Images
377
where is some robust penalty function, that penalize the difference of colors of neighboring points with the same occupancy. Now, combining the prior on the occupancy with the prior on the color we have ¯ x , Cy , u x , u y ) (5) ψ(C p(u, C) ∝ exp − x,y
with ¯ x , Cy , u x , u y ) = ψ(C
(Cx − Cy ) if ux = uy α otherwise
(6)
If we are given the color of the space, then p(u|C) ∝ p(u, C) is a color driven smoothing prior on the occupancy. Neighboring points with the same color are more likely to have the same occupancy than neighboring points with different colors. As the color will be estimated from the images, the color discontinuities will coincide with the edges in the images. Thus, this term will represent our experience based knowledge that object borders coincide with image edges. 2.4
Pixel Likelihood
The color Ipi observed at a pixel should be equal to the color of the 3D point visible at that pixel, up to the sensor noise and other unmodeled effects, e.g. specularities. If we denote the color of the observed 3D point as C(Dpi ), we have p(Ipi |Dpi , C) ∝ exp −ρ Ipi − C(Dpi )
(7)
where ρ is some robust penalty function.
Fig. 2. Bayesian network of the model associated to a single viewing ray. The occupancy variables ui along the viewing ray determine de depth d. The color of the image I corresponds to the color C at depth d.
Note that unlike traditional stereo algorithms, here there are no occlusions to be taken into account by the function ρ. We are matching the color of a pixel with the color of the observed scene point, not with the color of pixels in the other images. The observed scene point is, by definition, non-occluded, so no occlusion problem appears here.
378
P. Gargallo, P. Sturm, and S. Pujades
Depth Marginalization. The likelihood (7) of a pixel depends only on its depth and not on the whole occupancy function. However, the relationship between occupancy and depths is simple and deterministic. Therefore, it is easy to marginalize out the depth and express the likelihood directly in terms of the occupancy of the points on the viewing ray of the pixel. To simplify the notation we will do the computations for a single pixel. Figure2 shows the Bayesian network associated to a single pixel. The points of its viewing ray will be denoted by the natural numbers {0, 1, 2, · · · }, ordered by increasing distance to the camera. Their occupancy is a vector u such that ui is the occupancy of the i-th point in the viewing ray. The depth will be denoted by d. With this language, the probability of a depth given the occupancy is (1 − ui )ud (8) p(d|u) = ii
(18) which can be computed in linear time. 3.3
Color Estimation
In the M-step the color of the space is computed by maximizing the expected log-posterior, ln p(u, C|I)q = ln p(I|u, C)q + ln p(u, C)q + const.
(19)
where ·q denotes the expectation with respect to u assuming that it follows the variational distribution q. Again, the expectation is a sum over all possible occupancy configurations. Simplifications similar to the ones done above yield to qi (0) qd (1) ρ I − C(d) (20) ln p(I|u, C)q = − d i 1 B(ni ; nf , p0 ) pmin − 1
(a)
(b)
Fig. 4. (a) aerial view academic area, (b) orthographic view of academic area (courtesy Google), the registration parameters are given in Table 2
5 Image Mosaicing Images aligned after undergoing geometric corrections require further processing to eliminate distortions and discontinuities. Alignment of images may be imperfect due to registration errors resulting from incompatible model assumptions, dynamic scenes etc. Image Composition is the technique, which modifies the image gray levels in the vicinity of an alignment boundary to obtain a smooth transition between images by removing these seams and creating a blended image by determining how pixels in an overlapping area should be presented. The term image spline refers to digital techniques for making these adjustments [19]. The images to be splined are first decomposed into
Efficient Registration of Aerial Image Sequences Without Camera Priors
(a)
401
(b)
(c) Fig. 5. (a) Mosaic with affine asumption shows misalignments and distortion in regions of strong perspective, (b) Mosaic of same set as (a) constructed with perspective assumption (c) Mosaic with perspective assumption of another set of strongly perspective image set
a set of band-pass filtered component images (image components in a spatial frequency band). Next the component images in each spatial frequency band are assembled into a corresponding band-pass mosaic. In this step, component images are merged using a weighted average within a transition zone, which is proportional in size to the wavelengths represented in the band. Finally by summation of these band- passed mosaic images the desired image mosaic is generated. The spline is thus matched to the scale of features within the images. Figure 3(a) is the mosaic constructed without using the blending algorithm, and has clear visual artifacts along seam lines. Figure 3(b) has these artifacts eliminated using blending.
6 Results The system has been implemented on Visual C++ 6.0 using OpenCV library [20] from Intel. We have performed timing and performance analysis of proposed methodology with different feature detectors and analyzed their significance to accurate registration. The reported values are subjective since the results depend on the image data set. Since the Nielsen method is based on pattern matching, it is nontrivial to establish the logical relevance of the extracted feature points with the information contained in the image [6]. For the purpose of validation of presented algorithm, it was applied to pairs of test images, for which ground truthing was performed manually. KLT and Harris features take
402
S. Niranjan et al. Table 1. Performance Evaluation Dimensions Features Nielsen’s algorithm (sec) Self algorithm (sec) 728x388 100 15 5 728x388 150 27 8 728x388 175 34 10 800x600 100 19 5 800x600 150 30 8.7 Table 2. Ground truth result of Histogram method
Ground Truthing Histogram Method
Angle 34.062o 33.6752o
Scale 1.4512 1.445
equivalent computation times, but feature extraction based on Itti’s model consumes significantly higher time. KLT features were observed to be more stable for aerial images than Harris features. Salient points (Itti’s model) have also been observed to be stable, and are well distributed over entire images. Comparison with Nielsen algorithm with default parameters and proposed algorithms are given in Table 1. There is an overall decrease in computation time, and we are looking at the possibility of achieving this framework in real time. We also tested our registration algorithm with aerial images and orthographic images from Google Earth [21]. Manual registration is performed using standard Matlab functions Results obtained from algorithm for the same are listed in Table 2. Fig. 5(b) and 5(c) are the resulting mosaics with two sets of aerial images. We can clearly see distortions in image shown in Fig. 5(a), constructed from same set of images as Fig. 5(b), but with affine assumption [22]. Algorithm with affine assumption fails to compute correct homography in regions of strong perspective transform, with clear misalignments and distortions in constructed mosaic. Proposed algorithm works even for Figure 5(c), which uses image set of significantly higher perspective geometry, resulting in mosaic with invisible seam lines.
7 Conclusion Unmanned Aerial Vehicles(UAVs) equipped with lightweight,inexpensive cameras have grown in popularity by enabling new applications of UAV technology. Beginning with an investigation of previous registration and mosaicing works,this paper discusses the challenges of registering UAV based video sequence. A novel approach of estimating registration parameters is then presented. Future work for the improvement of proposed algorithm involves a frequency domain approach which could be used to determine a rough estimate of the parameters, which would help the registration algorithm to prune its search for registration parameters. Also with availability of coarse/fine metadata, significant improvements in performance can be achieved.
Efficient Registration of Aerial Image Sequences Without Camera Priors
403
References 1. Xiong, Y., Quek, F.: Automatic Aerial Image Registration Without Correspondence. In: ICVS 2006. The 4th IEEE International Conference on Computer Vision Systems, St. Johns University, Manhattan, New York City, New York, USA (January 5-7, 2006) 2. Sheikh, Y., Khan, S., Shah, M., Cannata, R.W.: Geodetic Alignment of Aerial Video Frames. VideoRegister03, ch. 7 (2003) 3. Wildes, R., Hirvonen, D., Hsu, S., Kumar, R., Lehman, W., Matei, B., Zhao, W.: Video Registration: Algorithm and quantitative evaluation. In: Proc. International Conference on Computer Vision, vol. 2, pp. 343–350 (2001) 4. Li, H., Manjunath, B.S., Mitra, S.K.: A contour based appraoch to multisensor image registration. IEEE Trans. Image Processing, 320–334 (March 1995) 5. Cideciyan, A.V.: Registration of high resolution images of the retina. In: Proc. SPIE, Medical Imaging VI: Image Processing, vol. 1652, pp. 310–322 (February 1992) 6. Nielsen, F.: Adaptive Randomized Algorithms for Mosaicing Systems. Transactions of the Institute of Electronics, Information, and Communication Engineers (IEICE), Information and Systems E83-D (7), 1386–1394 (2000) 7. Canny, J.: A computational approach to edge detection. IEEE PAMI, 679–698 (1986) 8. Harris, J.C., Stephens, M.: A combined corner and edge detector. In: Proc. 4th Alvey Vision Conf., pp. 189–192 (1988) 9. Tomasi, C., Kanade, T.: Detection and Tracking of Point Features. CMU Technical Report CMU-CS-91-132 (April 1991) 10. Itti, L.: Models of Bottom-Up and Top-Down Visual Attention. PhD thesis, Pasadena, California (2000) 11. Itti, L., Koch, C.: Nature Reviews Neuroscience 2(3), 194–203 (2001) 12. Ouerhani, N.: Visual Attention: Form Bio-Inspired Modelling to Real-Time Implementation, PhD thesis (2003) 13. Backer, G., Mertsching, B.: Two selection stages provide efficient object-based attentional controlfor dynamic vision. In: International Workshop on Attention and Performance in Computer Vision (2004) 14. Zoghliami, I., Faugeras, O., Deriche, R.: Using geometric corners to build a 2D mosaic from a set of images. In: CVPR Proceedings (1997) 15. Brown, M., Lowe, D.G.: Invariant features from interest point groups. In: BMVC 2002. British Machine Vision Conference, Cardiff, Wales, September 2002, pp. 656–665 (2002) 16. Maji, S., Mukerjee, A.: Motion Conspicuity Detection: A Visual Attention model for Dynamic Scenes. Report on CS497, IIT Kanpur, avialable at www.cse.iitk.ac.in/report-repository/2005/Y2383 497-report.pdf 17. Brown, M., Lowe, D.G.: Recognising panoramas. In: ICCV 2003. International Conference on Computer Vision, Nice, France, October 2003, pp. 1218–1225 (2003) 18. Kyung, Lacroix, S.: A Robust Interest Point Matching Algorithm. IEEE (2001) 19. Hsu, C., Wu, J.: Multiresolution mosaic. In: Image Processing, 1996, Proceedings, International Conference on, Iss., September 16-19, 1996, vol. 3, pp. 743–746 (1996) 20. Intel Open Source Computer Vision Libraray, http://www.intel.com/technology/computing/opencv/index.htm 21. http://www.earth.google.com 22. Gupta, G.: Feature Based Aerial Image Regsitration and Mosaicing. Bachelors Dissertation, IIT Kanpur (December 2006)
Simultaneous Appearance Modeling and Segmentation for Matching People Under Occlusion Zhe Lin, Larry S. Davis, David Doermann, and Daniel DeMenthon Institute for Advanced Computer Studies University of Maryland, College Park, MD 20742, USA {zhelin,lsd,doermann,daniel}@umiacs.umd.edu
Abstract. We describe an approach to segmenting foreground regions corresponding to a group of people into individual humans. Given background subtraction and ground plane homography, hierarchical parttemplate matching is employed to determine a reliable set of human detection hypotheses, and progressive greedy optimization is performed to estimate the best configuration of humans under a Bayesian MAP framework. Then, appearance models and segmentations are simultaneously estimated in an iterative sampling-expectation paradigm. Each human appearance is represented by a nonparametric kernel density estimator in a joint spatial-color space and a recursive probability update scheme is employed for soft segmentation at each iteration. Additionally, an automatic occlusion reasoning method is used to determine the layered occlusion status between humans. The approach is evaluated on a number of images and videos, and also applied to human appearance matching using a symmetric distance measure derived from the KullbackLeiber divergence.
1
Introduction
In video surveillance, people often appear in small groups, which yields occlusion of appearances due to the projection of the 3D world to 2D image space. In order to track people or to recognize them based on their appearances, it would be useful to be able to segment the groups into individuals and build their appearance models. The problem is to segment foreground regions from background subtraction into individual humans. Previous work on segmentation of groups can be classified into two categories: detection-based approaches and appearance-based approaches. Detection-based approaches model humans with 2D or 3D parametric shape models (e.g. rectangles, ellipses) and segment foreground regions into humans by fitting these models. For example, Zhao and Nevatia [1] introduce an MCMC-based optimization approach to human segmentation from foreground blobs. Following this work, Smith et al. [2] propose a similar trans-dimensional MCMC model to track multiple humans using particle filters. Later, an EM-based approach is proposed by Rittscher et al. [3] for foreground blob segmentation. On the other Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 404–413, 2007. c Springer-Verlag Berlin Heidelberg 2007
Simultaneous Appearance Modeling and Segmentation
405
hand, appearance-based approaches segment foreground regions by representing human appearances with probabilistic densities and classifying foreground pixels into individuals based on these densities. For example, Elgammal and Davis [4] introduce a probabilistic framework for human segmentation assuming a single video camera. In this approach, appearance models must first be acquired and used later in segmenting occluded humans. Mittal and Davis [5] deal with the occlusion problem by a multi-view approach using region-based stereo analysis and Bayesian pixel classification. But this approach needs strong calibration of the cameras for its stereo reconstruction. Other multi-view-based approaches [6][7][8] combine evidence from different views by exploiting ground plane homography information to handle more severe occlusions. Our goal is to develop an approach to segment and build appearance models from a single view even if people are occluded in every frame. In this context, appearance modeling and segmentation are closely related modules. Better appearance modeling can yield better pixel-wise segmentation while better segmentation can be used to generate better appearance models. This can be seen as a chicken-and-egg problem, so we solve it by the EM algorithm. Traditional EM-based segmentation approaches are sensitive to initialization and require appropriate selection of the number of mixture components. It is well known that finding a good initialization and choosing a generally reasonable number of mixtures for the traditional EM algorithm remain difficult problems. In [15], a sample consensus-based method is proposed for segmenting and tracking small groups of people using both color and spatial information. In [13], the KDE-EM approach is introduced by applying the nonparametric kernel density estimation method in EM-based color clustering. Later in [14], KDE-EM is applied to single human appearance modeling and segmentation from a video sequence. We modify KDE-EM and apply it to our problem of foreground human segmentation. First, we represent kernel densities of humans in a joint spatial-color space instead of density estimation in a pure color space. This can yield more discriminative appearance models by enforcing spatial constraints on color models. Second, we update assignment probabilities recursively instead of using a direct update scheme in KDE-EM; this modification of feature space and update equations results in faster convergence and better segmentation accuracy. Finally, we propose a general framework for building appearance models from occluded humans and matching them using full or partial observations.
2
Human Detection
In this section, we briefly introduce our human detection approach (details can be found in [16]). The detection problem is formulated as a Bayesian MAP optimization [1]: c∗ = arg maxc P (c|I), where I denotes the original image, c = {h1 , h2 , ...hn } denotes a human configuration (a set of human hypotheses). {hi = (xi , θi )} denotes an individual hypothesis which consists of foot position
406
Z. Lin et al.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 1. An example of the human detection process. (a) Adaptive rectangular window, (b) Foot candidate regions Rf oot (lighter regions), (c) Object-level likelihood map by hierarchical part-template matching, (d) The initial set of human hypotheses overlaid on the Canny edge map, (e) Human detection result, (f) Shape segmentation result.
xi and corresponding model parameters θi (which are defined as the indices of part-templates). Using Bayes Rule, the posterior probability can be decomposed (c) ∝ P (I|c)P (c). We asinto a joint likelihood and a prior as: P (c|I) = P (I|c)P P (I) sume a uniform prior, hence the MAP problem reduces to maximizing the joint likelihood. The joint likelihood P (I|c) is modeled as a multi-hypothesis, multiblob observation likelihood. The multi-blob observation likelihood has been previously explored in [9][10]. Hierarchical part-template matching is used to determine an initial set of human hypotheses. Given the (off-line estimated) foot-to-head plane homography [3], we search for human foot candidate pixels by matching a part-template tree to edges and binary foreground regions hierarchically and generate the objectlevel likelihood map. Local maxima are chosen adaptively from the likelihood map to determine the initial set of human hypotheses. For efficient implementation, we perform matching only for pixels in foot candidate regions Rf oot . Rf oot is defined as: Rf oot = {x|γx ≥ ξ}, where γx denotes the proportion of foreground pixels in an adaptive rectangular window W (x, (w0 , h0 )) determined → by the human vertical axis − v x (estimated from the homography mapping). The window coverage is efficiently calculated using integral images. Then, a fast and efficient greedy algorithm is employed for optimization. The algorithm works in a progressive way as follows: starting with an empty configuration, we iteratively add a new, locally best hypothesis from the remaining set of possible hypotheses until the termination condition is satisfied. The iteration is terminated when the joint likelihood stops increasing or no more hypothesis can be added. Fig. 1 shows an example of the human detection process.
Simultaneous Appearance Modeling and Segmentation
3
407
Human Segmentation
3.1
Modified KDE-EM Approach
KDE-EM [13] was originally developed for figure-ground segmentation. It uses nonparametric kernel density estimation [11] for representing feature distributions of foreground and background. Given a set of sample pixels {xi , i = 1, 2...N } (with a distribution P), each represented by a d-dimensional feature vector as xi = (xi1 , xi2 ..., xid )t , we can estimate the probability Pˆ (y) of a new pixel y with feature vector y = (y1 , y2 , ..., yd )t belonging to the same distribution P as: pˆ(y ∈ P) =
N d yj − xij 1 k( ), N σ1 ...σd i=1 j=1 σj
(1)
where the same kernel function k(·) is used in each dimension (or channel) with different bandwidth σj . It is well known that a kernel density estimator can converge to any complex-shaped density with sufficient samples. Also due to its nonparametric property, it is a natural choice for representing the complex color distributions that arise in real images. We extend the color feature space in KDE-EM to incorporate spatial information. This joint spatial-color feature space has been previously explored for feature space clustering approaches such as [12], [15]. The joint space imposes spatial constraints on pixel colors hence the resulting density representation is more discriminative and can tolerate small local deformations. Each pixel is represented by a feature vector x = (X t , C t )t in a 5D space, R5 , with 2D spatial coordinates X = (x1 , x2 )t and 3D normalized rgs color1 coordinates C = (r, g, s)t . In Equation 1, we assume independence between channels and use a Gaussian kernel for each channel. The kernel bandwidths are estimated as in [11]. In KDE-EM, the foreground and background assignment probabilities fˆt (y) and gˆt (y) are updated directly by weighted kernel densities. We modify this by updating the assignment probabilities recursively on the previous assignment probabilities with weighted kernel densities (see Equation 2). This modification results in faster convergence and better segmentation accuracy, which is quantitatively verified in [17] in terms of pixel-wise segmentation accuracy and number of iterations needed for foreground/background segmentation. 3.2
Foreground Segmentation Approach
Given a foreground regions Rf from background subtraction and a set of initial human detection hypotheses (hk , k = 1, 2, ...K), the problem of segmentation is equivalent to the K-class pixel labeling problem. The label set is denoted as F1 , F2 , ...FK . Given a pixel y, we represent the probability of pixel y belonging to human-k as fˆkt (y), where t = 0, 1, 2... is the iteration index. The assignment K probabilities fˆkt (y) are constrained to satisfy the condition: k=1 fˆkt (y) = 1. 1
r = R/(R + G + B), g = G/(R + G + B), s = (R + G + B)/3.
408
Z. Lin et al.
Algorithm 1. Initialization by Layered Occlusion Model initialize R00 (y) = 1 for all y ∈ Rf for k = 1, 2, ...K − 1 - for all y ∈ Rf t −1 0 - fˆk0 (y) = Rk−1 (y)e−1/2(Y −Y0,k ) V (Y −Y0,k ) and Rk0 (y) = 1 − ki=1 fˆi0 (y) endfor 0 0 0 (y) = RK−1 (y) for all y ∈ Rf and return fˆ10 , fˆ20 , ..., fˆK set fˆK where Y denotes the spatial coordinates of y, Y0,k denotes the center coordinates of object k, and V denotes the covariance matrix of the 2D spatial Gaussian distribution.
Layered Occlusion Model. We introduce a layered occlusion model into the initialization step. Given a hypothesis of an occlusion ordering of detections, we build a layered occlusion representation iteratively by calculating the foreground probability map fˆk0 for the current layer and its residual probability map Rk0 for pixel y). Suppose the occlusion order (from front to back) is given by F1 , F2 , ...FK ; then the initial probability map is calculated recursively from the front layer to the back layer by assigning 2D anisotropic Gaussian distributions based on the location and scales of each detection hypothesis. Occlusion Reasoning. The initial occlusion ordering is determined by sorting the detection hypotheses by their vertical coordinates and the layered occlusion model is used to estimate initial assignment probabilities. The occlusion status is updated at each iteration (after the E − step) by comparing the evidence of occupancy in the overlap area between different human hypotheses. For two human hypotheses hi and hj , if they have overlap area Ohi ,hj , we re-estimate the occlusion ordering between the two as: hi occlude hj if x∈Oh ,h fˆit (x) > x∈Oh ,h fˆjt (x) i
j
i
j
(i.e. hi better accounts for the pixels in the overlap area than hj ), hj occlude hi otherwise, where fˆit and fˆjt are the foreground assignment probabilities of hi and hj . At each iteration, every pair of hypotheses is compared in this way if they have a non-empty overlap area. The whole occlusion ordering is updated by exchanges if and only if an estimated pairwise occlusion ordering differs from the previous ordering.
4
Partial Human Appearance Matching
Appearance models represented by kernel probability densities can be compared by information theoretic measures such as the Battacharyya Distance or the Kullback Leiber Distance for tracking and matching objects in video. Recently, Yu et al. [18] introduce an approach to construct appearance models from a video sequence by a key frame method and show robust matching results using a pathlength feature and the Kullback-Leiber distance measure. But this approach only handles un-occluded cases. Suppose two appearance models, a and b are represented as kernel densities in a joint spatial-color space. Assuming a as the reference model and b as the test
Simultaneous Appearance Modeling and Segmentation
409
Algorithm 2. Simultaneous Appearance Modeling and Segmentation for Occlusion Handling Given a set of sample pixels {xi , i = 1, 2...N } from the foreground regions Rf , we iteratively estimate the assignment probabilities fˆkt (y) of a foreground pixel y ∈ Rf belonging to Fk as follows: Initialization : Initial probabilities are assigned by the layered occlusion model. M − Step : (Random Pixel Sampling) We randomly sample a set of pixels (we use η = 5% of the pixels) from the foreground regions Rf for estimating each foreground appearances represented by weighted kernel densities. E − Step : (Soft Probability Update) For each k ∈ {1, 2, ...K}, the assignment probabilities Fkt (y) are recursively updated as follows: fˆkt (y) = cfˆkt−1 (y)
fˆ
k( y
i=1
j=1
N
d
t−1 (xi ) k
j
− xij ), σj
(2)
where N is the number of samples and c is a normalizing constant such that K ˆt k=1 fk (y) = 1. Segmentation : The iteration is terminated when the average segmentation difference of two consecutive iterations is below a threshold: k
y
|fˆkt (y) − fˆkt−1 (y)| nK
(3)
< ,
where n is the number of pixels in the foreground regions. Let fˆk (y) denote the final converged assignment probabilities. Then the final segmentation is determined as: pixel y belong to human-k, i.e. y ∈ Fk , k = 1, ...K, if k = arg maxk∈{1,...K} fˆk (y).
model, the similarity of the two appearances can be measured by the KullbackLeiber distance as follows [12][18]: pˆa (y) dy, (4) pb ||ˆ pa ) = pˆa (y)log b DKL (ˆ pˆ (y) where y denotes a feature vector, and pˆa and pˆb denote kernel pdf functions. For simplification, the distance is calculated from samples instead of the whole feature set. We need to compare two kernel pdfs using the same set of samples in the feature space. Given Na samples xi , i = 1, 2..., Na from the appearance model a and Nb samples yk , k = 1, 2..., Nb from the appearance model b, the above equation can be approximated by the following form [18] given sufficient samples from the two appearances: Nb 1 pˆb (yk ) , p ||ˆ p )= log a DKL (ˆ Nb pˆ (yk )
(5)
Nb Na d d 1 ykj − xij 1 ykj − yij k( ), pˆb (yk ) = k( ). Na i=1 j=1 σj Nb i=1 j=1 σj
(6)
b
a
k=1
pˆa (yk ) =
Since we sample test pixels only from the appearance model b, pˆb is evaluated by its own samples and pˆb is guaranteed to be equal or larger than pˆa for any
410
Z. Lin et al.
convergence 0.1 0.09
segmentation difference
0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 1
2
3
4 5 iteration index
6
7
convergence 0.1 0.09
segmentation difference
0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0
2
4 6 iteration index
8
10
convergence 0.07
segmentation difference
0.06 0.05 0.04 0.03 0.02 0.01 0 1
2
3 iteration index
4
5
Fig. 2. Examples of the detection and segmentation process with corresponding convergence graphs. The vertical axis of the convergence graph shows the absolute segmentation difference between two consecutive iterations given by Equation 3.
samples yk . This ensures that the distance DKL (ˆ pb ||ˆ pa ) ≥ 0, where the equality holds if and only if two density models are identical. The Kullback-Leiber pb ||ˆ pa ) = DKL (ˆ pa ||ˆ pb ). For distance is a non-symmetric measure in that DKL (ˆ obtaining a symmetric similarity measure between the two appearance models, we define the distance of the two appearances as follows: Dist(ˆ pb , pˆa ) = b a a b min(DKL (ˆ p ||ˆ p ), DKL (ˆ p ||ˆ p )). It is reasonable to choose the minimum as the distance measure since it can preserve the balance between (full-full), (fullpartial), (partial-partial) appearance matching, while the symmetrized distance pb ||ˆ pa ) + DKL (ˆ pa ||ˆ pb )) would only be effective for (full-full) appearance DKL (ˆ matching and does not compensate for occlusion.
5
Experimental Results and Analysis
Fig. 2 shows examples of the human segmentation process for small human groups. The results show that our approach can generate accurate pixel-wise
Simultaneous Appearance Modeling and Segmentation
411
Fig. 3. Experiments on different degrees of occlusion between two people
segmentation of foreground regions when people are in standing or walking poses. Also, the convergence graphs show that our segmentation algorithm converges to a stable solution in less than 10 iterations and gives accurate segmentation of foreground regions for images with discriminating color structures of different humans. The cases of falling into local minimum with inaccurate segmentation is mostly due to the color ambiguity between different foreground objects or misclassification of shadows as foreground. Some inaccurate segmentation results can be found in human heads and feet in Fig. 2 and Fig. 3, and can be reduced by incorporating human pose models as in [17]. We also evaluated the segmentation performance with respect to the degree of occlusion. Fig. 3 shows the segmentation results given images with varying degrees of occlusion when two people walk across each other in an indoor environment. Note that the degree of occlusion does not significantly affect the segmentation accuracy as long as reasonably accurate detections can be achieved. Finally, we quantitatively evaluate our segmentation and appearance modeling approach to appearance matching under occlusion. We choose three frames from a test video sequence (containing two people in the scene) and perform segmentation for each of them. Then, the generated segmentations are used to estimate partial or full human appearance models as shown in Fig. 4. We evaluate the two-way Kullback-Leiber distances and the symmetric distance for each pair of appearances and represent them as affinity matrices shown in Fig. 4. The elements of the affinity matrices quantitatively reflect the accuracy of matching. We also conducted matching experiments using different spatial-color space combinations, 3D (r, g, s) color space, 4D (x, r, g, s) space, 4D (y, r, g, s) space, and 5D (x, y, r, g, s) space. The affinity matrices show that 3D (r, g, s) color space and 4D (y, r, g, s) space produce much better matching results than the other two. This is because color variation is more sensitive in the horizontal direction than in the vertical direction. The color-only feature space obtains good matching performance for this example because the color distributions are significantly different between appearances 1 and 2. But, in reality, there are often cases in which two different appearances have similar color distributions with completely different spatial layouts. On the other hand, 4D (y, r, g, s) joint spatial-color feature space (color distribution as a function of the normalized
412
Z. Lin et al.
Fig. 4. Experiments on appearance matching. Top: appearance models used for matching experiments, Middle: two-way Kullback-Leiber distances, Bottom: symmetric distances.
human height) enforces spatial constraints on color distributions, hence it has much more discriminative power.
6
Conclusion
We proposed a two stage foreground segmentation approach by combining human detection and iterative foreground segmentation. The KDE-EM framework is modified and applied to segmentation of groups into individuals. The advantage of the proposed approach lies in simultaneously segmenting people and building appearance models. This is useful when matching and recognizing people when only occluded frames can be used for training. Our future work includes the application of the proposed approach to human tracking and recognition across cameras.
Acknowledgement This research was funded in part by the U.S. Government VACE program.
Simultaneous Appearance Modeling and Segmentation
413
References 1. Zhao, T., Nevatia, R.: Tracking Multiple Humans in Crowded Environment. In: CVPR (2004) 2. Smith, K., Perez, D.G., Odobez, J.M.: Using Particles to Track Varying Numbers of Interacting People. In: CVPR (2005) 3. Rittscher, J., Tu, P.H., Krahnstoever, N.: Simultaneous Estimation of Segmentation and Shape. In: CVPR (2005) 4. Elgammal, A.M., Davis, L.S.: Probabilistic Framework for Segmenting People Under Occlusion. In: ICCV (2001) 5. Mittal, A., Davis, L.S.: M2Tracker: A Multi-View Approach to Segmenting and Tracking People in a Cluttered Scene. International Journal of Computer Vision (IJCV) 51(3), 189–203 (2003) 6. Fleuret, F., Lengagne, R., Fua, P.: Fixed Point Probability Field for Complex Occlusion Handling. In: ICCV (2005) 7. Khan, S., Shah, M.: A Multiview Approach to Tracking People in Crowded Scenes using a Planar Homography Constraint. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, Springer, Heidelberg (2006) 8. Kim, K., Davis, L.S.: Multi-Camera Tracking and Segmentation of Occluded People on Ground Plane using Search-Guided Particle Filtering. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, Springer, Heidelberg (2006) 9. Tao, H., Sawhney, H., Kumar, R.: A Sampling Algorithm for Detecting and Tracking Multiple Objects. In: ICCV Workshop on Vision Algorithms (1999) 10. Isard, M., MacCormick, J.: BraMBLe: A Bayesian Multiple-Blob Tracker. In: ICCV (2001) 11. Scott, D.W.: Multivariate Density Estimation. Wiley Interscience, Chichester (1992) 12. Elgammal, A.M., Davis, L.S.: Probabilistic Tracking in Joint Feature-Spatial Spaces. In: CVPR (2003) 13. Zhao, L., Davis, L.S.: Iterative Figure-Ground Discrimination. In: ICPR (2004) 14. Zhao, L., Davis, L.S.: Segmentation and Appearance Model Building from An Image Sequence. In: ICIP (2005) 15. Wang, H., Suter, D.: Tracking and Segmenting People with Occlusions by A Simple Consensus based Method. In: ICIP (2005) 16. Lin, Z., Davis, L.S., Doermann, D., DeMenthon, D.: Hierarchical Part-Template Matching for Human Detection and Segmentation. In: ICCV (2007) 17. Lin, Z., Davis, L.S., Doermann, D., DeMenthon, D.: An Interactive Approach to Pose-Assisted and Appearance-based Segmentation of Humans. In: ICCV Workshop on Interactive Computer Vision (2007) 18. Yu, Y., Harwood, D., Yoon, K., Davis, L.S.: Human Appearance Modeling for Matching across Video Sequences. Special Issue on Machine Vision Applications (2007)
Content-Based Matching of Videos Using Local Spatio-temporal Fingerprints Gajinder Singh, Manika Puri, Jeffrey Lubin, and Harpreet Sawhney Sarnoff Corporation
Abstract. Fingerprinting is the process of mapping content or fragments of it, into unique, discriminative hashes called fingerprints. In this paper, we propose an automated video identification algorithm that employs fingerprinting for storing videos inside its database. When queried using a degraded short video segment, the objective of the system is to retrieve the original video to which it corresponds to, both accurately and in real-time. We present an algorithm that first, extracts key frames for temporal alignment of the query and its actual database video, and then computes spatio-temporal fingerprints locally within such frames, to indicate a content-match. All stages of the algorithm have been shown to be highly stable and reproducible even when strong distortions are applied to the query.
1
Introduction
With the growing popularity of free video publishing on the web, there exist innumerable copies of copyright videos floating over the internet unrestricted. A robust content-identification system that detects perceptually identical video content thus, benefits popular video-sharing websites and peer-to-peer (P2P) networks of today, by detecting and removing all copyright infringing material from their databases and prevent any such future uploads made by their users. Fingerprinting offers a solution to query and identify short video segments from a large multimedia repository using a set of discriminative features called fingerprints. Challenges in designing such a system are: 1. Accuracy to identify content even when it is altered either naturally because of changes in video formats, resolution, illumination settings, color schemes or maliciously by introducing frame letterbox, camcorder recording or cropping. This is addressed by employing robust fingerprints that are invariant to such common distortions. 2. Speed to allow the fingerprinting system to determine a content-match with small turn-around times, which is crucial for real-time applications. A common denominator of all fingerprinting techniques is their ability to capture and represent high-dimensional, perceptually relevant multimedia content, in the form of short robust hashes [1], essential for fast retrieval. In content-based video identification, literature reports of approaches that compute features such as, mean-luminance [1], centroid of gradient [2], rank-ordered image intensity Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 414–423, 2007. c Springer-Verlag Berlin Heidelberg 2007
Content-Based Matching of Videos
415
distribution [3] and centroid of gradient orientations [4], over fixed-sized partitions of video frames. The limitation of such features is that they encode complete frame information and therefore fail to identify videos when presented with queries having partially cropped or scaled data. This motivates the use of a local fingerprinting approach. Sivic and Zisserman [5] in their text-retrieval approach for object recognition make use of maximally stable extremal regions (MSERs), proposed by Matas[6], as representations of each video frame. Since their method clusters semantically similar content together in its visual vocabulary, it is expected to offer poor discrimination, for example, between different seasons of the same TV programme having similar scene settings, camera capture positions and actors. However, a video fingerprinting system is expected to provide good discrimination between such videos. Similar to [5], Nist´er and Stew´enius propose an object recognition algorithm that extracts and stores MSERs on a group of images of an object, captured under different viewpoint, orientation, scale and lighting conditions [7]. During retrieval, a database image is scored depending on the number of MSER correspondences it shares with the given query image. Only the top scoring hits are then scanned further. Hence, fewer MSER pairs decrease the possibility of a database hit to figure out within the top ranked images. Since a fingerprinting system needs to identify videos even when queried with short distorted clips, both [5] and [7] become unsuitable. This is because, strong degradations such as, blurring, cropping, frame-letterboxing, result into fewer number of MSERs in a distorted image as compared to its original. Such degradations, thus, have a direct impact on the algorithm performance because of change in representation of the frame. Massoudi et al., proposes an algorithm that first slices a query video in terms of shots, extracts key-frames and then performs local fingerprinting [8]. A major drawback of this approach is the fact that even the most common forms of video processing such as blurring and scaling, disturb the key-frame and introduce misalignment between the query and database frames. In the proposed approach, key-frames correspond to local peaks of maximum intensity change across the video sequence. Such frames are hence, reproducible under most distortions. For feature extraction, information of each key-frame is localized as a set of affine covariant 3D regions, each of which is characterized using a spatio-temporal binary fingerprint. Use of such distortion-invariant fingerprints as an index into the database look-up, fetches appreciable advantages in terms of an easy and efficient database retrieval strategy. During database query, we employ a voting strategy to collate the set of local database hits in order to make a global decision about best-matching video in the database.
2
Proposed Video Fingerprinting Approach
Figure 1 shows various stages of the proposed video fingerprinting framework. The input video is first subjected to a series of video preprocessing steps [4]. These include, changing the source frame rate to a predefined resampling rate (say, 10 fps), followed by converting each video frame to grayscale and finally, resizing all frames to a fixed width(w) and height(h) (in our case,w = 160 and
416
G. Singh et al.
Fig. 1. Framework of proposed localized content-based video fingerprinting approach
h = 120 pixels). Such preprocessing achieves two benefits, one being in robustness against changes in color formats or resizing, while the other in terms of speed for retrieval of large-size videos (by choosing w and h accordingly). To reduce complexity, the frame sequence is examined to extract key-points that correspond to local peaks of maximum change in intensity. Since maximum change in intensity reduces stability of the regions detected in key-frames, we store a few neighboring frames on either side of the key-frame to maintain minimal database redundancy. 2.1
Region Detector
Stable regions are detected on key-frames for localizing their information. Our region detection process is inspired by the concept of Maximally Stable Volumes (MSVs), proposed by Donser and Bischof [9] for 3D segmentation. Extending MSER to the third dimension i.e., time, an MSV is detected by an extremal property of the intensity function in the volume and on its outer boundary. In the present case however, to extract regions stable under the set of distortions, we extend MSERs to the third dimension, that is, resolution. The process of extracting MSVs along with related terminology as adopted in the rest of the paper, is given as follows: Image Sequence. For a video frame F , consider a set of multi-resolution images F1 , F2 , . . . , Fi , . . . , Fs , where Fi is obtained when video frame F is subsampled by a factor 2i−1 . Size of each Fi is the same as F . Thus, a 3D point (x, y, z) in this space corresponds to pixel (x, y) of frame F at resolution z or, equivalently Fz (x, y). Volume. We define Vji as the j th volume such that all 3D points belonging to it have intensities less than (or greater than) i, ∀(x, y, z) ∈ Vji iff Fz (x, y) ≤ i (or Fz (x, y) ≥ i) Connectivity. Volume Vji is said to be contiguous, if for all points p, q ∈ Vji , there exists a sequence p, a1 , a2 , . . . , an , q and pAa1 , a1 Aa2 , . . . , ai Aai+1 , . . . , an Aq. Here, A is an adjacency relation defined such that two pixels p, q ∈ Vji 3 are adjacent (pAq), iff 1 |pi − qi | ≤ 1.
Content-Based Matching of Videos
417
Partial Relationship. Any two volumes Vki and Vlj , are nested, i.e., Vki Vlj iff i ≤ j (or i ≥ j). Maximally Stable Volumes. Let V1 , V2 , . . . , Vi−1 , Vi , . . . be a sequence of partially ordered set of volumes, such that Vi Vi+1 . Extremal volume Vi is said to be maximally stable iff v(i) = |Vi+Δ \Vi−Δ |/|Vi | has a local minimum at i , i.e. for changes in intensity of magnitude less than Δ, the corresponding change in region volume is zero. Thus, we represent each video frame as a set of distinguished regions that are maximally stable to intensity perturbation over different scales. The reason for stability of MSVs over MSERs in most cases of image degradations, is that additional volumetric information enables selection of regions with near-identical characteristics across different image resolutions. The more volatile regions (the ones which split or merge), are hence eliminated from consideration. 2.2
Localized Fingerprint Extraction
Consider a frame sequence {F1 , F2 , . . . , Fp , . . .} where Fp denotes the pth frame of the video. For invariance to affine transformations, each MSV is represented in the form of an ellipse. In terms of the present notation, pixels of ith maximally stable volume Vip in frame Fp are made to fit an ellipse denoted by epi . Affine invariance of local image regions makes our fingerprint robust to geometric distortions. Each ellipse epi is represented by (xpi , yip , spi , lxpi , lyip , αpi ) where (xpi , yip ) are its center coordinates, lxpi is the major axis, lyip is the minor axis and αpi is the orientation of the ellipse w.r.t frame axis. A scale factor spi , which depends upon the ratio of ellipse area w.r.t total area of the frame, is used to encode bigger regions around MSVs which are very small. The proposed algorithm expresses each local measurement region of a frame in the form of a spatio-temporal fingerprint that offers a compact representation of videos inside the database. For each ellipse epi extracted from frame F p , the process of fingerprint computation (similar to [1]) has been elaborated below: – Since small regions are more prone to perturbations, each region is blown-up by a factor si . A rectangular region rip that encloses ellipse epi and has an area of (lxpi × spi , lyip × spi ), is detected. – For a scale invariant fingerprint, we divide rip into fixed number of R × C blocks. – Let mean luminance of block (r, c) ∈ rip be denoted by Lpi (r, c) where r = [1, . . . , R] and c = [1, . . . , C]. We choose a spatial filter [-1 1] and a temporal filter [−α 1] for storing the spatio-temporal dimension of a video. In order to reduce susceptibility to noise, we compute a fingerprint between rip and rip+step , which is same as region rip but shifted by step frames. The R×(C −1) bit fingerprint Bip is equal to bit ’1’ if Qpi (r, c) is positive and bit ’0’ otherwise where, (r, c + 1) − Lp+step (r, c)) − α(Lpi (r, c + 1) − Lpi (r, c)) (1) Qpi (r, c) = (Lp+step i i Encoding mean luminance makes our fingerprint invariant to photometric distortions. In the current implementation, α = 0.95 and step = 10.
418
2.3
G. Singh et al.
Database Organization and Lookup
Localized content of each video frame is stored inside the database look-up table (LUT) [1] using 32-bit signatures, as computed by equation 1. Each LUT entry in turn stores pointers to all video clips with regions having the same fingerprint value. In order to have an affine invariant representation of the video frame, independent of different query distortions, we store the geometric and shape information of stable region ei along with its fingerprint inside the database. This is done by transforming coordinates of the original frame center, denoted ˆ Yˆ ). The new axis has by (cx, cy), onto a new reference axis, denoted by (X, the property of projecting ellipse epi onto a circle, with the ellipse center being ˆ Yˆ ) the origin of this axis and ellipse major and minor axes aligned with (X, respectively. The coordinates of the original frame center w.r.t this new reference ˆ pi ). axis are denoted by (cx ˆ pi , cy ˆ pi ) is given by: The transformation between (cx, cy) and (cx ˆ pi , cy cx ˆ pi = ((cx − xpi )cos(−αpi ) − (cy − yip )sin(−αpi ))/(lxpi × spi )
(2)
cy ˆ pi = ((cx − xpi )sin(−αpi ) + (cy − yip )cos(−αpi ))/(lyip × spi )
(3)
Apart from the frame center, transformation of three points c1, c2, c3, located at corners of a prefixed square SQ (in our case, of size 100 × 100) centered at (cx, cy), is also stored inside the database. Let their coordinates w.r.t reference ˆ Yˆ ) be denoted as c1 ˆ pi , c2 ˆ pi , c3 ˆ pi . These three points are stored for their role axis (X, in the verification stage. We define sif tp as the 128-dimensional SIFT descriptor [10] of SQ. ˆ pi , cy ˆ pi , epi ), sif tp ) is therefore expressed The fingerprint database p ( i (Bip , cx th as a union of fingerprint entries for each i MSV belonging to the pth video frame where each MSV entry inside the database consists of fields such as, its binary fingerprint Bip , ellipse parameters ei , frame center’s coordinates w.r.t reference ˆ Yˆ ) and SIFT descriptor of SQ given by sif tp . axis (X, During database retrieval, for a query video frame E q , we first generate its ellipses and their corresponding fingerprints using equation 1. Thus, the query frame can be expressed as j {Bjq , eqj }. Expressing in terms of notation, we use q to denote query and p to denote database parameters. Each of the fingerprints of MSVs belonging to the query frame are used to probe the database for potential candidates. That is, ∀j we query the database to get the candidate set given by p ˆ pi , cy ˆ pi , epi ), sif tp ). p ( i (Bi , cx There exists a possibility for every entry in the candidate set of being the correct-database match that we are looking for during database retrieval. Hence, we propose a hypothesis for the query frame E q of being the same as original frame Fp stored inside the database. This will happen when ellipses eqj and epi point to identical regions in their respective frames. For every candidate hit produced from the database, we compute the coordinates of the query frame’s center by using: cx ˇ p,q ˆ pi × sqj × lxqj )cos(αqj ) − (cy ˆ pi × sqj × lyjq )sin(αqj ) + xqj i,j = (cx
(4)
ˆ pi × sqj × lxqj )sin(αqj ) + (cy ˆ pi × sqj × lyjq )cos(αqj ) + yjq cy ˇ p,q i,j = (cx
(5)
Content-Based Matching of Videos
419
To speed-up query retrieval, we evaluate only a subset of the entire set of candidates by filtering out spurious database hits. For every candidate clip entry, we associate a score sci,j,p,q which is defined as: sci,j,p,q = f ac × (lxpi × lyip × spi × spi ÷ (w × h)) + (1 − f ac) × (log(N ÷ Njq )) (6) where N is the total number of entries present in the database and Njq is the number of database hits generated for query fingerprint Bjq . The first term in Equation 6, (lxpi ×lyip ×spi ×spi ÷(w×h)), signifies that greater the area represented by the fingerprint of the database image, higher is its score. The second term, log(N ÷ Njq ), assigns higher scores to unique fingerprints Bjq that produce fewer database hits. A factor f ac ∈ [0, 1] is used for assigning appropriate weight to each of the two terms in equation 6. We have chosen f ac = 0.5 in the current implementation. 2.4
Scoring and Polling Strategy
As stated earlier, an essential requirement of a video fingerprinting system is speed. For this purpose, we have an additional stage for scoring each database result, followed by a poll to collate all local information and arrive at the final decision. Consider a video’s frame sequence as a 3D space, with its third dimension given by frame number. We divide this space into bins (in our case, of size 2 × 2 × 10), where each bin is described by a three tuple b ≡ (b1, b2, b3). Thus we merge database hits, (1) who have their hypothetical image centers close to each other, and (2) belong to neighboring frames of the same video considering the fact that movements of region across them is appreciably small. The scoring process and preparation of candidate clip for the verification stage, is as follows: ˇ p,q ˇ p,q – ∀ ellipses eqj and epi , add the score sci,j,p,q to the bin in which (cx i,j , cy i,j , q) p p p,q p,q p,q ˇ ˇ ˇ ˆ ˆ ˆp falls. For each such entry, also calculate c1i,j , c2i,j , c3i,j by using c1i , c2 , c3 and ellipse parameters of eqj in Equations similar to 4 and 5. – Pick the entries that fall within the top n scoring bins for the next stage of verification. – Every database hit which polled into a particular bin and added to its score, gives information of the affine transformation by which the database frame can be aligned to the query frame. We compute the average transformation ˇ p,q ˇ p,q ˇ p,q of bin b, denoted by Hb , by taking an average of all c1 i,j , c2i,j , c3i,j that polled to bin b. – Inverse of Hb is applied on sif tnum number of query frames viz. {E q , E q+1 , . . . . . . , E q+sif tnum } to produce a series of frames {Ebq , Ebq+1 , . . . , Ebq+sif tnum }. Now, these frames are hypothetically aligned to the ones stored within the database and which polled to bin b. To carry out a check on sif tnum (in our case, sif tnum = 5) number of frames imposes a tougher constraint on making a decision of the correct database video. 2.5
Verification
From the top n candidates obtained from the previous stage, the correct database hit is found using a SIFT-based verification. Let esif tqb be the sift descriptor of
420
G. Singh et al.
ˆ centered at (cx, cy) in query’s frame E q with its sides aligned to the square SQ b the frame’s axis. The two-step verification procedure is: – ∀p database frames that voted to bin b, calculate Bhattacharyya distance betnum , . . . , esif tq+sif } tween descriptors of the aligned query {esif tqb , esif tq+1 b b and its database hit {sif tp , sif tp+1 , . . . , sif tp+sif tnum }. – If the distance is less than an empirically chosen threshold T , declare a match to database frame p.
3
Experimental Validation
Results presented in this section have been calculated for a database populated with 1200 videos of 4 minutes each (in total, 80 hours of data). Video material stored in the database contains clips taken from different seasons of TV programmes, music videos, animated cartoons and movies. Analysis of our algorithm (method 1) and its comparison with approaches described in [8] (method 2) and [1] (method 3) has been carried out using 2400 clips, each of duration 10 seconds and picked from different locations of their respective original video. Robustness testing of the proposed algorithm has been evaluated under video degradations such as, spatial blur (gaussian blurring of radius 4), contrast modification (query clips with 25% reduced contrast), scaling (resolution of each video frame is changed by a factor 1.5), frame letterbox (refers to the black bands, of width 100 pixels, appended around a video frame), DivX compression (alters bitrate by a factor of 4), temporal cropping (query clips having only 10% of frames present in the original video), spatial cropping (30% of spatial information is cropped in each video frame) and camcorder recording (refers to query clips captured using a camcorder). Different parameters of our method are: top n candidate bins that are chosen based on scoring and polling and, threshold T used in the verification stage. By setting empirical values of parameters, n = 10 and T = 0.2 (justified later in the section), results shown in figure 2 have been computed. Figure 2(a) demonstrates the repeatability of key-frames detected by our algorithm, as compared to method 2. As reflected by the graph, even under strong distortions such as spatial crop, percentage of key-frames re-detected in the query does not fall below 80%. Out of all the query clips performed on the database, percentage of queries which were retrieved correctly, is shown in Figure 2(b). Use of local, affine and scale invariant fingerprints makes our approach better than method 3, which fails for distortions like spatial cropping. At the same time, ability to encode reproducible key-frames makes our system more robust as compared to method 2. Variations in performance of the proposed system as a result of changing parameters n and T , is demonstrated in figure 3. Every row of figure 3 shows three different graphs for each type of distortion. Due to paucity of space, we demonstrate graphs only for the first five types of degradations following the order mentioned above. First column of figure 3 shows the efficacy of our proposed scoring scheme. On the y-axis, it plots the percentage of query fingerprints that produce database
Content-Based Matching of Videos
421
Table 1. Query retrieval timings for different distortions Distortion
Feature Time Query Time Verification Time
Spatial Blur Contrast Modification Scaling Frame Letterbox DivX Compression Temporal Cropping Spatial Cropping Camcorder Capture
(a)
0.4984 0.3173 0.4308 0.3886 0.4396 0.3239 0.4874 0.3729
1.0103 0.3229 0.9442 0.9117 1.0622 0.6657 1.2064 0.9975
1.4660 0.9977 1.4236 1.1160 1.5445 0.9350 1.1237 0.7923
(b)
Fig. 2. Comparison of different fingerprinting algorithms (a) Repeatability of keyframes and, (b) Percentage of queries that are correctly retrieved from the database
hits of the correct video, incorrect video or result in no hits at all (misses). On the x-axis, such analysis is repeated for different values of top n scoring bins as they are varied from 10 to 100. It can be noticed that in all cases of distortions, the yaxis distribution remains unchanged even as more bins are sent to the verification stage. In other words, correct database hits always fall within the top 10 scoring bins, even when the query is altered significantly. Examining a scenario when the size of database is increased, number of false database hits generated by a query fingerprint are expected to go up. However, due to invariance of the proposed scoring strategy to the number of database hits, it is established that while correct hits are always scored at the top, false hits fall within lower scoring bins. Such property proves scalability of our proposed fingerprinting scheme. Second column of figure 3 on the y-axis plots the probability of query and database hit being the same (Psame ) or different (Pdif f ). The x-axis shows the corresponding Bhattacharyya distance between their SIFT descriptors. As reflected by the graphs, the point where Psame and Pdif f intersect (i.e., probability of the query and database candidate being identical becomes less than the two being different) is the threshold T . Third column of figure 3 plots the distribution of queries versus the number of key-frames it takes for them to produce a correct hit. Curve for queries with
422
G. Singh et al.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
Fig. 3. An evaluation of changes in retrieval results as parameters of the system are varied for different query distortions, Row 1: Spatial Blurring, Row 2: Contrast Modification, Row 3: Scaling, Row 4: Frame Letterboxing and Row 5: DivX Compression. First column {(a),(d),(g),(j),(m)} show the percentage of query fingerprints that are correct, incorrect or a miss, for different number n of top scoring bins scanned by the verification stage. Second column {(b),(e),(h),(k),(n)} show the probability of two videos being identical (Psame ) or different (Pdif f ) and Bhattacharyya distance between their SIFT descriptors. Third column {(c),(f),(i),(l),(o)} plot probability of the number of iterations it takes to complete the retrieval process.
Content-Based Matching of Videos
423
higher distortions spans more number of iterations. In our case, even though distortions are stronger than the ones used in literature, a large percentage of queries lead to hits within the first 10 iterations. Table 1 tabulates the retrieval timings for different types of distorted queries. Given a query, the total time it takes for its retrieval can be split into: time taken to detect stable regions in a key-frame and compute their fingerprints (Feature Time), time taken to obtain database hits for each fingerprint and score them accordingly (Query Time) and finally time taken to process top 10 candidates for SIFT-based verification (verification time). All queries have been executed on a 1.6 GHz Pentium IV machine equipped with 1GB RAM and take a maximum of 3 seconds for retrieval under all distortions.
4
Conclusions
In this paper, a new video fingerprinting scheme has been proposed. Primary novelties of this approach are: extraction of repeatable key-frames, introduction of affine-invariant MSVs that are stable over different frame resolutions and, a scoring and polling strategy to merge local database hits and bring the ones belonging to the correct video within the top rank order. With all the desirable properties of a video fingerprinting paradigm, our system can be used as a robust perceptual hashing solution to identify pirated copies of copyright videos on the internet. We are currently working on new ways of making the algorithm foolproof against distortions such as spatial cropping.
References 1. Oostveen, J., Kalker, T., Haitsma, J.: Feature extraction and a database strategy for video fingerprinting. In: International conference on recent advances in visual information systems, pp. 117–128 (2002) 2. Hampapur, A., Bolle, R.: Videogrep: video copy detection using inverted file indices. Technical report, IBM research (2001) 3. Hua, X.-S., Chen, X., Zhang, H.-J.: Robust video signature based on ordinal measure. ICIP 1, 685–688 (2004) 4. Lee, S., Yoo, C.-D.: Video fingerprinting based on centroids of gradient distortions. In: ICASSP, pp. 401–404 (2006) 5. Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. ICCV 2, 1–8 (2003) 6. Matas, J., Chum, O., Martin, U., Pajdla, T.: Robust wide baseline stereo from maximally stable extremal regions. BMVC 1, 384–393 (2002) 7. Nist´er, D., Stew´enius, H.: Scalable recognition with a vocabulary tree. CVPR 2, 2161–2168 (2006) 8. Massoudi, A., Lefebvre, F., Demarty, C.-H., Oisel, L., Chupeau, B.: A video fingerprint based on visual digest and local fingerprints. ICIP, 2297–2300 (2006) 9. Donser, M., Bischof, H.: 3D segmentation by maximally stable volumes (MSVs). ICPR, 63–66 (2006) 10. Lowe, D.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004)
Automatic Range Image Registration Using Mixed Integer Linear Programming Shizu Sakakubara1, Yuusuke Kounoike2, Yuji Shinano1, and Ikuko Shimizu1 1
Tokyo Univ. of Agri. & Tech. 2 Canon Inc.
Abstract. A coarse registration method using Mixed Integer Linear Programming (MILP) is described that finds global optimal registration parameter values that are independent of the values of invariant features. We formulate the range image registration problem using MILP. Our algorithm using MILP formulation finds the best balanced optimal registration for robustly aligning two range images with the best balanced accuracy. It adjusts the error tolerance automatically in accordance with the accuracy of the given range image data. Experimental results show that this method of coarse registration is highly effective.
1 Introduction Automatic 3D model generation of real world objects is an important technique in various fields. Range sensors are useful for generating 3D models because they directly measure the 3D shape of objects. The range image measured by a range sensor reflects the partial 3D shape of the object expressed in the coordinate system corresponding to the pose and position of the sensor. Therefore, to obtain the whole shape of the object, many range images measured from different viewpoints must be expressed in a common coordinate system. Estimation of relative poses and positions of sensors from range images is called range image registration. In general, the registration process is divided into two phases: coarse registration and fine registration. For the fine registration, the ICP (Iterative Closest Point) algorithm by Besl and McKay [1] and its extensions [11] are widely used. However, they need sufficiently good initial estimates to achieve fine registration because their error functions have a number of local minima. Therefore, many coarse registration methods have been developed for obtaining good initial estimates automatically. Many of these coarse registration methods [2] match invariant features under rigid transformation. For example, spin images [7] are matched based on their crosscorrelation. A splash [12], a point signature [4], and a spherical attribute image [6] are used as indices representing surface structures in a hash table for matching. For the modeling of buildings, planar regions [5] and circular features [3] are used for matching. These methods are effective if the features can be sufficiently discriminated and their values can be accurately calculated. However, they cannot guarantee global optimality. In addition, registration has been formalized as a discrete optimization task of finding the maximum strict sub-kernel in a graph [10]. While the globally optimal solution can be obtained with this method, the solution depends on the quality of the features. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 424–434, 2007. c Springer-Verlag Berlin Heidelberg 2007
Automatic Range Image Registration Using MILP
425
However, invariant features such as curvatures are difficult to calculate stably because they greatly affected by occlusion and discretization of the object surface. In this paper, we propose a novel coarse registration method using Mixed Integer Linear Programming (MILP) that guarantees the global optimality. It aligns two range images independent of the values of invariant features. Since MILP problems are NPhard problems, they cannot be solved to optimality in a reasonable amount of time when they exceed a certain size. However, progress in algorithms, software, and hardware has dramatically enlarged this size. Twenty years ago, a mainframe computer was needed to solve problems with a hundred integer variables. Now it is possible to solve problems with thousands of integer variables on a personal computer [8]. The advances related to MILP have made our proposed algorithm possible.
2 Definition of Best Balanced Optimal Registration Consider the registration of two sets of points in R3 . Let V 1 = {v11 , v21 , · · · , vn1 1 } be the source point set and V 2 = {v12 , v22 , · · · , vn2 2 } the target one. Assume that these points are corrupted by additive noises and they are independent random variables with zero mean and unknown variance. The task of registration is to estimate parameter values of a rigid transformation T such that the T aligns the two point sets. We define the rigid transformation as T (R, t; v) = Rv + t where R is a 3 × 3 rotation matrix and t is a translation vector. To deal with occlusion and discretization of range images, we introduce a more accurate point set 2 −T (R, t; vi1 )||∞ ≤ , MAPS(T, φ, )= vi1 : ||vφ(i) where the function φ(i) denotes the index of the corresponding target point for each source point and the constant is a threshold value to remove outliers. To obtain robust estimation of the rigid transformation, the number of elements of MAPS(T, φ, ), |MAPS(T, φ, )|, needs to be maximized. Because no a priori correspondence between the source and target points is given, the function φ must be determined. And, we make no assumption about the variance of the noise. Hence, the value cannot be estimated in advance. However, if a number N is given, we can find the minimal value of such that |MAPS(T, φ, )| ≥ N . Therefore, we define the N -accuracy optimal registration as the (TˆN , φˆN ) that minimizes (TˆN , φˆN ) = argmin{ : |MAPS(T, φ, )| ≥ N }, T,φ
and define the optimal registration error ˆN by N -accuracy optimal registration as ˆN = min{ : |MAPS(T, φ, )| ≥ N }. If 1 ≥ 2 , then max |MAPS(T, φ, 1 )| ≥ max |MAPS(T, φ, 2 )|. On the other hand, if N1 ≤ N2 , then min{ : |MAPS(T, φ,)| ≥ N1 } ≤ min{ : |MAPS(T, φ, )| ≥ N2 }. A well-balanced registration with a large number of elements in MAPS(T, φ, ) with a small value is desired. Therefore, we define best balanced optimal registration ˆ that attains N ˆ such that as the (Tˆ, φ) ˆ = argmax{|MAPS(TˆN , φˆN , max )| : (TˆN , φˆN ) = argmin{ : |MAPS(T, φ, )| ≥ N }}, N 3≤N≤κ
T,φ
(1)
426
S. Sakakubara et al.
where κ is the maximum value for N and max = max{ˆ N }. The lower bound for N is three because a rigid transformation can be estimated using no less than three point pairs. To avoid that max might be too large, we give an upper bound ¯. In best balanced optimal registration, the N -accuracy optimal registration (TˆN , φˆN ) ˆ and for each N from three to κ is evaluated using |MAPS(TˆN , φˆN , max )|. We call N ˆNˆ the best balanced values. They depend on the accuracy of the original point sets. In general, this accuracy is unknown in advance. Our proposed method finds the best balanced optimal registration automatically. We assume that similarity sij between any pair of points vi1 and vj2 is given. Because the feature values of the corresponding points are similar, we assume that point pairs that have explicitly low similarity are not corresponding and remove them from the candidates for corresponding point pairs. Note that the precise values of the features are not needed because feature values are used only to remove obviously noncorresponding pairs.
3 Mixed Integer Linear Programming Based Registration In this section, we will first introduce the general form of Mixed Integer Linear Programming(MILP) problems. Next, we will give an MILP formulation of N -accuracy optimal registration. We also give an ILP formulation to evaluate each N -accuracy registration by counting |MAPS(TˆN , φˆN , max )|. Note that this step is not necessary to be formalized by ILP because it can be done by many other ways. Then, we will describe our algorithm for obtaining the best balanced optimal registration. Finally, we will address the problem size issue and our approach to overcoming it. 3.1 Mixed Integer Linear Programming Problems An MILP problem is the minimization or maximization of a linear objective function subject to linear constraints and to some of the variables being integers. Here, without loss of generality, we consider a maximization problem. Explicitly, the problem has the form (MILP) max c x sub. to Ax ≤ b, l ≤ x ≤ u, xi : integer, i = 1, . . . , p,
where A is an m×n matrix, c, x, l, and u are n-vectors, and b is an m-vector. Elements of c, l, u, b, and A are constant, and x is a variable vector. In this MILP problem, we assume 1 ≤ p ≤ n. If p = n, then the problem is called an Integer Linear Programming(ILP) problem. If all of the variables need not be integers, then the problem is called a Linear Programming(LP) problem. The set S = {x : Ax ≤ b, l ≤ x ≤ u, xi : integer, i = 1, . . . , p} is called the feasible region and an x ∈ S is called a feasible solution. A feasible solution x∗ that satisfies c x∗ ≥ c x, ∀x ∈ S is called an optimal solution. The value of c x∗ is called the optimal value.
Automatic Range Image Registration Using MILP
427
MILP problems are NP-hard in general. Therefore, heuristic methods for solving them are widely used. However, in this paper, we apply exact methods. Owing to recent progress in MILP techniques, relatively large-scale MILP problems can be solved to optimality with commercial solvers. Throughout this paper, the solutions for MILP problems are optimal ones. 3.2 MILP Formulation of the N -Accuracy Optimal Registration MILP formulation of N -accuracy optimal registration requires that the constraints on the rigid transformation be written in a linear form. Since this cannot be done straightforwardly, we introduce pseudo rotation matrix R and pseudo translation vector t and give linear constraints so that R and t are close to R and t. Symbols denoted with prime ( ) are obtained using R and t . def Initially, to simplify the notation, we define an index set for point set V as I(V ) = def
{i : vi ∈ V }, and define ABS(x) = (|x1 |, |x2 |, |x3 |) for a vector x = (x1 , x2 , x3 ) . We introduce a corresponding point pair vector p = (p11 , · · · , p1n2 , p21 , · · · , pn1 1 , · · · , pn1 n2 ) as one of the decision variables. Each element pij is a 0-1 integer variable that is designed to be 1 if source point vi1 corresponds to target point vj2 , that is, j = φ(i), and is designed to be zero otherwise. The other decision variables are the elements of R and t . All elements of these variables are continuous. We denote the ij-th element of R as rij (i, j = 1, 2, 3) and the i-th element of t as ti (i = 1, 2, 3). We assume that ≤ r¯ij , and ti ≤ ti ≤ t¯i . There are trivial upper and lower bounds are given: r ij ≤ rij bounds: −1 ≤ rij ≤ 1, (i, j = 1, 2, 3). As mentioned above, the constraints on the elements of the rotation matrix cannot be straightforwardly written in linear form. Therefore, we formulate this condition indirectly. Let vi1 , vk1 ∈ V 1 and vj2 , vl2 ∈ V 2 and d(vi1 , vk1 ) be the distance between vi1 and vk1 and and d(vj2 , vl2 ) is the distance between vj2 and vl2 . If |d(vi1 , vk1 ) − d(vj2 , vl2 )| > l is valid, then either j = φ(i) or l = φ(k) needs to be satisfied. This condition can be formulated as pij + pkl ≤ 1, (i, k ∈ I(V 1 ), j, l ∈ I(V 2 ), |d(vi1 , vk1 )−d(vj2 , vl2 )| > l ). This condition can be further rewritten with fewer pkl ≤ 1, (i, k ∈ I(V 1 ), j ∈ I(V 2 )). constraints as pij + 1 )−d(v 2 ,v 2 )|> l∈I(V 2 ),|d(vi1 ,vk l j l
Here, √ l depends on . Since a rigid transformation √ error within ± is allowed, l less than 2 3 should be allowed. Therefore, l ≥ 2 3 needs to be satisfied. 2 We assume one-to-one correspondence between subsets of V 1 and V . That is, two 1 conditions need to be satisfied: pij ≤ 1, (i ∈ I(V )) and pij ≤ 1, (j ∈ j∈I(V 2 )
i∈I(V 1 )
I(V 2 )). The minimal number of elements of MAPS(T, φ, ) is given as a constant N .Hence, pij ≥ N. i∈I(V 1 ) j∈I(V 2 )
For the points in MAPS(T, φ, ), only when pij = 1, ABS(vj2 − R vi1 − t ) ≤ e(i ∈ I(V 1 ), j ∈ I(V 2 )), where e = (, , ) . When pij is set to zero, these conditions need to be eliminated. They can be written by introducing a continuous vector
428
S. Sakakubara et al.
M = (m1 , m2 , m3 ) that is determined so as to always satisfy ABS(vj2 −R vi1 −t ) ≤ M, (i ∈ I(V 1 ), j ∈ I(V 2 )). Using this vector, we can formulate the condition for the points in MAPS(T, φ, ) as ABS(vj2 − R vi1 − t ) ≤ M(1 − pij ) + e, (i ∈ I(V 1 ), j ∈ pij vj2 − R vi1 − t ) ≤ M(1 − pij )+ I(V 2 )) and further rewrite it as ABS( j∈I(V 2 )
j∈I(V 2 )
e, (i ∈ I(V )). If several variables could be fixed in advance, the MILP problem could be solved more quickly. To fix pij to zero in advance, we use the similarity: pij = 0, (i ∈ I(V 1 ) j ∈ I(V 2 ), sij < s ), where s is a given parameter. If similarity sij is small enough, the pair vi1 and vj2 is removed from the putative corresponding point pairs. The objective of N -accuracy optimal registration is to minimize . Therefore, we can give an MILP formulation as follows. 1
(P1 ) min pij ≤ 1, (i ∈ I(V 1 )),
sub. to j∈I(V 2 )
pij ≤ 1, (j ∈ I(V 2 )), i∈I(V 1 )
pij ≥ N, i∈I(V 1 ) j∈I(V 2 )
pi,j vj2 − R vi1 − t ≥ −M(1 − j∈I(V 2 )
j∈I(V 2 )
pij vj2 j∈I(V
−R
vi1
− t ≤ M(1 −
2)
l∈I(V
pij ) + e, (i ∈ I(V 1 )),
j∈I(V 2 )
pij + r ij
pij ) −e, (i ∈ I(V 1 )),
pkl ≤ 1, (i, k ∈ (I(V 1 ), j ∈ I(V 2 )), 2 ),|d(v1 ,v1 )−d(v2 ,v2 )|> l i j k l
≤ rij ≤ r¯ij , (i = 1, 2, 3, j = 1, 2, 3), ti ≤ ti ≤ t¯i ,
(i = 1, 2, 3),
pij = 0 (i ∈ I(V ), j ∈ I(V ), sij < s ), pij ∈ {0, 1} (i ∈ I(V 1 ), j ∈ I(V 2 )). 1
2
Note that l is a given parameter and affects the optimal value of (P1 ). Here, we define Si and ˆi as the feasible region and the optimal value of the (P1 ) with a given parameter li . If l1 ≥ l2 , S1 ⊃ S2 , because the smaller l value forces the more elements of p to be zero, that is, it restricts the feasible region. Therefore, if l1 ≥ l2 , ˆ1 ≤ ˆ 2 . ˆN , max )| 3.3 ILP Formulation to Count |MAPS(TˆN , φ In this section, we describe ILP problem for evaluating (TˆN , φˆN ) by counting |MAPS(TˆN , φˆN , max )|. Its formulation uses almost the same components described in the previous section, except for the following ones. – The rigid transformation is given. In the formulation in this subsection, rotation matrix R and translation vector t with constant elements are used. They are calculated using a corresponding point pair vector introduced by R and t in our algorithm. – max is not a variable but a constant. – N is not given: the number of correspondences is maximized in this formulation.
Automatic Range Image Registration Using MILP
429
We can give an ILP formulation for counting |MAPS(TˆN , φˆN , max )| as follows using emax = (max , max , max ) . (P2 ) max
pij i∈I(V 1 ) j∈I(V 2 )
sub. to
pij ≤ 1, (i ∈ I(V 1 )), j∈I(V 2 )
pij ≤ 1, (j ∈ I(V 2 )), i∈I(V 1 )
pi,j vj2
−
pij vj2
−
Rvi1
− t ≥ −M(1 −
j∈I(V 2 )
j∈I(V
pij ) − emax (i ∈ I(V 1 )),
j∈I(V 2 )
Rvi1
− t ≤ M(1 −
2)
j∈I(V
pij ) + emax (i ∈ I(V 1 )), 2)
pij = 0 (i ∈ I(V 1 ), j ∈ I(V 2 ), sij < s ),
pij ∈ {0, 1} (i ∈ I(V 1 ), j ∈ I(V 2 )).
3.4 Algorithm for Best Balanced Optimal Registration Now, we describe our algorithm for obtaining the parameter values for the best balanced optimal registration. Equation (1) is solved in two phases. ˆ N . The p ˆ N that attains Phase 1. For N from 5 to κ, find ˆN and obtain (TˆN , φˆN ) as p ˆN ≤ ¯ is stored in list L. The minimum N value is set to five to avoid trivial solutions. ˆ N and count Phase 2. For each (ˆ pN , ˆN ) ∈ L, make (TˆN , φˆN ) from p ˆ N : (ˆ pN , ˆN ) ∈ L}. Then, select the N |MAPS(TˆN , φˆN , max )|, max = max{ˆ ˆ ˆ that attains argmax{|MAPS(TN , φN , max )| : N such that(ˆ pN , ˆN ) ∈ L}. N
In phase 1, To solve the problem P1 , an l needs to be given. However, the value of l affects the optimal value of the problem. The minimum ˆN value is found by solving P1 repeatedly by narrowing its range. ˆ by We find the corresponding point pair vector that attains the best balanced value N a two-phase algorithm as follows. ) [Algorithm] MILP-based Registration(¯ L = ∅; /* Phase 1 */ for(N = 5; N ≤ κ; N ++){ N = ¯; /* N : upper bound for ˆN */ N = 0.0; /* N : lower bound for ˆ N */ N = ¯; while(N > N ){ /* Narrowing process to find ˆ N */ √ l = 2 3N ; Solve problem (P1 ) with given parameters N and l ; ˆ N be the optimal solution and ˆ /* Let p N be the optimal value of problem (P1 ) */ if ( No feasible solution is found ) {break;} if ( ˆN < N ){ if ( ˆN > N ){ N = ˆ N ; } N = N ;
430
S. Sakakubara et al. }else{ if ( ˆN > ¯) {break; } if ( ˆN < N ){ N = ˆ N ; } N = N ; } N = (N − N )/2 + N ;
} L = L ∪ {(ˆ pN , ˆN )} ;
} if(L = ∅){return “No solution”; /* ¯ value is too small */ } /* Phase 2 */ ˆ = 0; max = max{ˆ N : (ˆ pN , ˆ N ) ∈ L}; N while(L = ∅){ Extract (ˆ pN , ˆN ) from L; /* L = L\{(ˆ pN , ˆ N )} */ ˆ ˆ ˆ N and count |MAPS(TˆN , φˆN , max )| by solving problem (P2 ) with Make (TN , φN ) from p a given parameter max ; /* Let Ntp = |MAPS(TˆN , φˆN , max )| */ ˆ = Ntp ; p ˆ < Ntp ){ N ˆ Nˆ = p ˆ N ;} if(N } ˆ, p ˆ Nˆ ); return (N
This algorithm has three particular advantages. – Although ¯ is initially given, l is adjusted to an appropriate value in accordance with the accuracy of the point sets automatically. – If the value of ¯ is too small to find a solution, it outputs “No solution.” If ¯ is too big, it outputs a feasible solution at the cost of time. – Because the optimal values of (P1 ) are used, the narrowing process is completed faster than that in a binary search. 3.5 Problem Size Issue and How to Overcome It Unfortunately, the MILP solver and computers still do not have enough power to solve the MILP problem (P1 ) for all points sets obtained from normal size range images. Therefore, we preselect the feature points from V 1 and V 2 . Since our algorithm evaluates only the distances between corresponding point pairs, it is robust against noise and the preselection method. When the algorithm is applied to the preselected point sets V˙ 1 ⊂ V 1 and V˙ 2 ⊂ V 2 , the robustness of the registration can be improved to solve the ILP problem (P2 ) with V˙ 1 and V 2 . The current MILP solver and computers have enough power to solve this size of problem (P2 ) within a reasonable time.
4 Experiments We tested the robustness of our method by applying it to three synthetic datasets and one real dataset. We used the ILOG CPLEX (ver.10.1) MILP solver installed on a PC
Automatic Range Image Registration Using MILP
431
with a Pentium D 950 CPU (3.4 GHz), 2 GB of RAM, and linux 2.6. We used GNU GLPK to generate files for our formalization. We use the algorithm of Umeyama [13] to estimate the rigid transformation from the point correspondences. We used 3D models of “Stanford Bunny”[14], “Horse” [16], and “Armadillo”[14] to generate synthetic range images. We generated 18 synthetic range images for each model with the size 200 × 200 pixels by rotating 20-degree rotation steps around the Y axis. Then, to generate five noisy datasets for each model, the Z coordinate of each point was perturbed by adding Gaussian noise with zero mean and a standard deviation of σ = 0.02, 0.04, 0.06, 0.08, or 1.00. We also applied our method to the real range images of “Pooh” [15]. The selected feature points had curvedness [9] values that were the maximal with a constraint on thedistance between feature points. The curvedness c of a point is calculated by c = κ21 + κ22 /2 where κ1 and κ2 are two main curvatures of the point. The number of feature points for the each range image was 50 for “Stanford Bunny” and “Horse”, and 70 for “Armadillo” and “Pooh”. The similarity sij between feature points vi1 and vj2 was defined using curvatures values of the local surfaces around the points. It is set to −100 if the shape of the surfaces, such as convex or concave, was not identical with each other, or else it is calculated based on the curvedness of points ci and cj as sij = 1/ ci − cj . We applied our method to all adjacent pairs in the range image sequences. For all = 1 (i, j = 1, 2, 3). For the synthetic datasets, κ = 10, s = 10, rij = −1, and r¯ij images, ¯ = 0.15, M = (100, 100, 100) , ti = −10, and t¯i = 10 (i = 1, 2, 3). We also used ¯ = 0.25 for the pairs whose solutions could not be obtained by ¯ = 0.15. For the real images, ¯ = 0.25, M = (1000, 1000, 1000), ti = −100, and t¯i = 100 (i = 1, 2, 3). We also applied ¯ = 0.50 for the pairs whose solutions could not be obtained by ¯ = 0.25. In order to improve accuracy of calculation, each points vi = (xi , yi , zi ) of “Pooh” dataset are translated near the origin of the coordinates ◦ ◦ by following: vir − ((min{xj : j ∈ I(V 0 )} + max{xj : j ∈ I(V 0 )})/2, (min{yj : ◦ ◦ ◦ j ∈ I(V 0 )} + max{yj : j ∈ I(V 0 )})/2, (min{zj : j ∈ I(V 0 )} + max{zj : j ∈ ◦ ◦ I(V 0 )})/2) , where V 0 are point set of 0◦ . The experimental results for the synthetic range image datasets are shown in Table 1( “b”, “h”, and “a” in the column “Dataset” indicate “Stanford Bunny”, “Horse” and “Armadillo”, respectively, followed by the numbers indicating the standard deviation of the added noise). The parameter ¯ = 0.25 was used to calculate the results for pairs 180◦ –200◦ of h00, 180◦ –200◦ and 280◦ –300◦ of h04, 180◦ –200◦ of h06, 280◦ –300◦ of h08, 20◦ –40◦ and 280◦ –300◦ of h10, and 60◦ –80◦ of all “Armadillo” datasets, and ¯ = 0.15 for the other pairs. Table 1 shows that the rigid transformation errors did not always increase with the standard deviation σ of the added Gaussian noise. There are two main reasons that our method is robust against the measurement noise: First, our method does not need accurate values for the invariant features. We use them to only select the feature points and to reduce the putative corresponding point pairs. Second, the rigid transformation is estimated using more than five corresponding pairs of feature
432
S. Sakakubara et al.
Table 1. Error evaluation of MILP-based registration. “Error of angle” is the absolute angle between the true rotation angle and the estimated one in degrees. “Error of axis” is the deviation in the estimated rotation axis in degrees. “Error of translation” is the norm of the error of the translation vector. Dataset σ
Time [sec.]
b00 b02 b04 b06 b08 b10 h00 h02 h04 h06 h08 h10
0.00 0.02 0.04 0.06 0.08 0.10 0.00 0.02 0.04 0.06 0.08 0.10
4578 4587 4906 5406 4700 4877 5334 5213 4273 4264 3202 3525
Error Error Error of Dataset σ of angle of axis translation [degree]
[degree]
0.343 0.321 0.466 0.247 0.366 0.421 0.281 0.272 0.354 0.364 0.417 0.344
1.89 1.74 2.00 1.30 1.67 1.47 2.14 1.94 1.97 2.19 2.55 3.04
Time [sec.]
0.328 0.311 0.390 0.298 0.364 0.291 0.587 0.552 0.507 0.485 0.625 0.592
a00 a02 a04 a06 a08 a10
0.00 119514 0.02 118699 0.04 124356 0.06 127393 0.08 118821 0.10 109337
Error Error Error of of angle of axis translation [degree]
[degree]
0.215 0.231 0.205 0.320 0.244 0.393
0.83 0.88 0.95 0.94 0.84 0.96
0.244 0.227 0.224 0.282 0.219 0.262
Table 2. Error evaluation of data set “Pooh”. “Error of angle” is the absolute angle between the true rotation angle and the estimated one in degrees. “Error of axis” is the angle in degrees between the rotation axis estimated from the given pair and the rotation axis estimated from all pairs. “Error of translation” is the norm of the error between the translation vector estimated from the given pair and the translation vector estimated from all pairs. Angle Time[sec.] Error of angle[degree] Error of axis[degree] Error of translation average 19496 1.950 0.072 0.719
points. If we can find only five pairs of points that are not so noisy by chance, we can accurately estimate rigid transformation. The computational time for “Armadillo” was higher than that of others because the numbers of MILP variables and constraints are proportional to the number of feature points. Table 2 shows the results for the real range image dataset of “Pooh”. The parameter ¯ = 0.50 was used to calculate the results for pairs 20◦ -40◦ , 60◦ –80◦ , 100◦ –120◦, 220◦ –240◦, and 320◦ –140◦, and ¯ = 0.25 was used for the other pairs. The registration results for “Pooh” are shown in Figure 1. While the errors for some pairs were relatively large, the results are good enough for coarse registration.
Fig. 1. Results for “Pooh”
Automatic Range Image Registration Using MILP
433
5 Concluding Remarks Our proposed coarse registration method using Mixed Integer Linear Programming (MILP) can find global optimal registration without using the values of the invariant features. In addition, it automatically adjusts the error tolerance depending on the accuracy of the given range image data. Our method finds the best consistent pairs from all possible point pairs using an MILP solver. While such solvers are powerful tools, all of the constraints should be written in linear form. This means that constraints on the rotation matrix cannot be applied directly. Therefore, we selected a relevant number of consistent pairs in the sense of the distances between point pairs, which gives the constraints on the rotation matrix indirectly. The number of corresponding point pairs and the distances between them are automatically balanced by our algorithm using two different MILP formulations. Future work will focus on reducing the computational time and improving the selection of the feature points.
References 1. Besl, P.J., McKay, N.D.: A Method for Registration of 3-D Shapes. IEEE Trans. on PAMI 14(2), 239–256 (1992) 2. Campbell, R.J., Flynn, P.J.: A Survey of Free-Form Object Representation and Recognition Techniques. CVIU 81, 166–210 (2001) 3. Chen, C.C., Stamos, I.: Range Image Registration Based on Circular Features. In: Proc. 3DPVT, pp. 447–454 (2006) 4. Chua, C.S., Jarvis, R.: 3D Free-Form Surface Registration and Object Recognition. IJCV 17(1), 77–99 (1996) 5. He, W., Ma, W., Zha, H.: Automatic Registration of Range Images Based on Correspondence of Complete Plane Patches. In: Proc. 3DIM, pp. 470–475 (2005) 6. Higuchi, K., Hebert, M., Ikeuchi, K.: Building 3-D Models from Unregistered Range Images. GMIP 57(4), 315–333 (1995) 7. Johnson, A.E., Hebert, M.: Using Spin Images for Efficient Object Recognition in Cluttered 3D Scenes. IEEE Trans. on PAMI 21(5), 433–449 (1999) 8. Johnson, E.L., Nemhauser, G.L., Savelsbergh, M.W.P.: Progress in Linear ProgrammingBased Algorithms for Integer Programming: An Exposition. INFORMS Journal on Computing 12(1), 2–23 (2000) 9. Koenderink, J.J.: Solid Shape. MIT Press, Cambridge (1990) ˇ ara, R., Okatahi, I.S., Sugimoto, A.: Globally Convergent Range Image Registration by 10. S´ Graph Kernel Algorithm. In: Proc. 3DIM, pp. 377–384 (2005) 11. Rusinkiewicz, S., Levoy, M.: Efficient Variants of the ICP Algorithm. In: Proc. 3DIM, pp. 145–152 (2001) 12. Stein, F., Medioni, G.: Structural indexing: Efficient 3-D object recognition. IEEE Trans. on PAMI 14(2), 125–145 (1992) 13. Umeyama, S.: Least-Square Estimation of Transformation Parameters Between Two Point Patterns. IEEE Trans. on PAMI 13(4), 376–380 (1991)
434
S. Sakakubara et al.
14. Stanford 3D Scanning Repository, http://www-graphics.stanford.edu/data/3Dscanrep/ 15. The Ohio State University Range Image Repository, http://sampl.ece.ohio-state.edu/data/3DDB/RID/minolta/ 16. Georgia Institute of Technology Large Geometric Models Archive, http://www-static.cc.gatech.edu/projects/large models/
Accelerating Pattern Matching or How Much Can You Slide? Ofir Pele and Michael Werman School of Computer Science and Engineering The Hebrew University of Jerusalem {ofirpele,werman}@cs.huji.ac.il
Abstract. This paper describes a method that accelerates pattern matching. The distance between a pattern and a window is usually close to the distance of the pattern to the adjacement windows due to image smoothness. We show how to exploit this fact to reduce the running time of pattern matching by adaptively sliding the window often by more than one pixel. The decision how much we can slide is based on a novel rank we define for each feature in the pattern. Implemented on a Pentium 4 3GHz processor, detection of a pattern with 7569 pixels in a 640 × 480 pixel image requires only 3.4ms.
1
Introduction
Many applications in image processing and computer vision require finding a particular pattern in an image, pattern matching. To be useful in practice, pattern matching methods must be automatic, generic, fast and robust.
(a)
(b)
(c)
(d)
Fig. 1. (a) A non-rectangular pattern of 7569 pixels (631 edge pixel pairs). Pixels not belonging to the mask are in black. (b) A 640 × 480 pixel image in which the pattern was sought. (c) The result image. All similar masked windows are marked in white. (d) The two found occurrences of the pattern in the image. Pixels not belonging to the mask are in black. The method suggested in this paper reduced the Pele and Werman pattern matching method[1] running time from 21ms to only 3.4ms. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 435–446, 2007. c Springer-Verlag Berlin Heidelberg 2007
436
O. Pele and M. Werman
(a)
(b)
(c)
(d)
Fig. 2. (a) A non-rectangular pattern of 2197 pixels. Pixels not belonging to the mask are in black. (b) Three 640x480 pixel frames out of fourteen in which the pattern was sought. (c) The result. Most similar masked windows are marked in white. (d) Zoom in of the occurrences of the pattern in the frames. Pixels not belonging to the mask are in black. The method suggested in this paper reduced the Pele and Werman pattern matching method[1] running time from 22ms to only 7.2ms. The average number of samples per window reduced from 19.7 to only 10.6.
Pattern matching is typically performed by scanning the entire image, and evaluating a distance measure between the pattern and a local rectangular window. The method proposed in this paper is applicable to any pattern shape, even a non-contiguous one. We use the notion of “window” to cover all possible shapes. There are two main approaches to reducing the computational complexity of pattern matching. The first approach reduces the time spent on each window. The second approach reduces the number of windows visited. In this work we concentrate on the second approach. We suggest sliding more than one pixel at a time. The question that arises is: how much can you slide? The answer depends on the pattern and on the image. For example, if the pattern and the image are black and white checked boards of pixels, the distance of the pattern to the current window and to the next window will be totally different. However, if the pattern is piecewise smooth, the
Accelerating Pattern Matching or How Much Can You Slide?
437
(a-zoom) (a)
(b)
(c)
(d)
Fig. 3. (a) A rectangular pattern of 1089 pixels. (b) A noisy version of the original 640x480 pixel image. The pattern that was taken from the original image was sought in this image. The noise is Gaussian with a mean of zero and a standard deviation of 25.5. (c) The result image. The single similar masked window is marked in white. (d) The occurrence of the pattern in the zoomed in image. The method suggested in this paper reduced the Pele and Werman pattern matching method[1] running time from 19ms to only 6ms. The average number of samples per window reduced from 12.07 to only 2. The image is copyright by Ben Schumin and was downloaded from: http://en. wikipedia.org/wiki/Image:July 4 crowd at Vienna Metro station.jpg.
distances will be similar. We describe a method which examines the pattern and decides how much we can slide in each step. The decision is based on a novel rank we define for each feature in the pattern. We use a two stage method on each window. First, we test all the features with a high rank. Most of the windows will not pass and we will be able to slide more than one pixel. For the windows that passed the test we perform the simple test on all the features. A typical pattern matching task is shown in Fig. 1. A non-rectangular pattern of 7569 pixels (631 edge pixel pairs) was sought in a 640 × 480 pixel image. Using the Pele and Werman method[1] the running time was 21ms. Using our method the running time reduced to only 3.4ms. All runs were done on a Pentium 4 3GHz processor. Decreasing the number of visited windows is usually achieved using an image pyramid[3]. By matching a coarser pattern to a coarser level of the pyramid, fewer windows are visited. Once the strength of each coarser resolution match is calculated, only those that exceed some threshold need to be compared for the next finer resolution. This process proceeds until the finest resolution is reached. There are several problems with the pyramid approach. First, important details of the objects can disappear. Thus, the pattern can be missed. For example, in Fig. 1 if we reduce the resolution to a factor of 0.8, the right occurrence of the pattern is found, but the left one is missed. Using the smaller images the running time decreases from 21ms to 18ms (without taking into account the time spent on decreasing the resolution). Using our approach, both occurrences of the patterns are found in only 3.4ms. Note that smoothness can change
438
O. Pele and M. Werman
(a)
(b)
(a-zoom)
(c)
(d)
Fig. 4. (a) A non-rectangular pattern of 3732 pixels (3303 edge pixel pairs). Pixels not belonging to the mask are in black. (b) A 2048 × 1536 pixel image in which the pattern was sought. The area where the pattern was found is marked in white. (c) The occurrence of the pattern in the image zoomed in. (d) The occurrence of the pattern in the image zoomed in, with the exact found outline of the pattern painted in white. The method suggested in this paper reduced the Pele and Werman pattern matching method[1] running time from 437ms to only 51ms. Note the large size of the image. The average number of samples per window reduced from 27 to only 3.7.
between different local parts of the pattern. The pyramid approach is global, while our approach is local and thus more distinctive. The second problem of pyramid approach is the memory overhead. This paper is organized as follows. Section 2 presents the LUp rank for pixels and for pairs of pixels. Section 3 describes a method that uses the LUp rank for accelerating pattern matching. Section 4 presents extensive experimental results. Finally, conclusions are drawn in Section 5.
2
The LUp Rank
In this section we define a novel smoothness rank for features, the LUp rank. The rank is later used as a measure that tells us how much we can slide for each
Accelerating Pattern Matching or How Much Can You Slide?
439
(a)
(b) Fig. 5. (a) A 115 × 160 pattern (2896 edge pixel pairs). (b) A 1000 × 700 pixel image in which the pattern was sought. The most similar window is marked in white. The method suggested in this paper reduced the Pele and Werman pattern matching method[1] running time from 51ms to only 9.2ms. The average number of samples per window reduced from 23 to only 3. The images are from the Mikolajczyk and Schmid paper[2].
pattern. This is first defined for pixels and then for pairs of pixels. Finally, we suggest ways of calculating the LUp rank. 2.1
The LUp Rank for Pixels
In this sub-section we use the Thresholded Absolute Difference Hamming distance that was suggested by Pele and Werman[1]. This distance is the number of different corresponding pixels between a window and a pattern, where the corresponding pixels are defined as different if and only if their absolute intensity difference is greater than a predefined pixel similarity threshold, q; i.e. The distance between the set of pixels, A, applied to the pattern and the current window is defined as (δ returns 1 for true and 0 for false): δ (|pattern(x, y) − window(x, y)| > q) (1) T ADA (pattern, window) = (x,y)∈A
We first define the LU rank for a pattern pixel as: LU(pattern, (x, y)) = max s.t: R
∀ 0 ≤ rx , ry ≤ R pattern(x, y) = pattern(x − rx , y − ry )
(2)
Now, if we if we assess the similarity between a pixel in the pattern with an LU rank of R, to a pixel in the window, we get information about all the windows which are up to R pixels to the right and down to the current window. Using this information we can slide in steps of R + 1 pixels, without losing accuracy.
440
O. Pele and M. Werman
The requirement for equality in Eq. 2 is relaxed in the definition of the LUp rank. In this rank the only requirement is that the absolute difference is not too high: LU p (pattern, (x, y)) = max s.t: R
∀ 0 ≤ rx , ry ≤ R |pattern(x, y) − pattern(x − rx , y − ry )| ≤ p
(3)
Note that the LU and LU0 ranks for pixels are equivalent. 2.2
The LUp Rank for Pairs of Pixels
In this sub-section we use the Monotonic Relations Hamming distance that was suggested by Pele and Werman[1]. This distance is the number of pairs of pixels in the current window that does not have the same relationship as in the pattern; i.e. the basic features of this distance are pairs of pixels and not pixels. Pixel relations have been successfully applied in many fields such as pattern matching[1], visual correspondence[4] and keypoint recognition[5]. Each pattern is defined by a set of pairs of pixels which are close, while the intensity difference is high. We assume without loss of generality that in the pattern the first pixel in each pair has a higher intensity value than the second pixel. The distance between the set of pairs, A, applied to the pattern and the current window is defined as (δ returns 1 for true and 0 for false): δ (window(x1 , y1 ) ≤ window(x2 , y2 )) (4) M RA (pattern, window) = [(x1 ,y1 ), (x2 ,y2 )]∈A
Given a pattern and pair of pixels, [(x1 , y1 ), (x2 , y2 )] such that the first pixel has a higher intensity value than the second pixel, i.e. pattern(x1 , y1 ) > pattern(x2 , y2 ), we define the pair’s LUp rank as: LU p (pattern, [(x1 , y1 ), (x2 , y2 )]) = max s.t: ∀ 0 ≤ rx , ry ≤ R R
pattern(x1 − rx , y1 − ry ) > pattern(x2 − rx , y2 − ry ) + p
(5)
The requirement that the relation must be bigger in at least p is added for stability. Now, if we assess the similarity between a pair of pixels in the pattern with an LUp rank of R, to a pair of pixels in the window, we get information about all the windows which are up to R pixels to the right and down to the current window. Figure 6 illustrates this. Using this information we can slide in steps of R + 1 pixels, without losing accuracy. 2.3
How to Calculate the LUp Rank
We suggest two methods of calculating the LUp rank of all features (pixels or pairs of pixels) in the pattern. The first is to calculate the rank for each feature. ¯ the average LUp rank and by |A| the feature set size, then If we denote by R
Accelerating Pattern Matching or How Much Can You Slide? Image 4
5
6
7
8
2
3
4
5
6
7
8
9
6
7
8
9
2 3 4 5
4
90
1
1
0 60
11
6
6
5
3
40
0
9 0
3
1
0
2
2
0
1
3
40
0
3
4
0
20
2
2
1
1
Pattern 0
441
(b)
(a) 2
3
4
5
6
7
8
0
9
1
2
3
4
5
1 2 3 4 5 6
6
5
4
3
2
1
0
1
0
0
(c)
(d)
Fig. 6. The pair of pixels in the pattern (marked with two circles): [(3, 4), (1, 1)], has LU10 rank of 1 (Pattern(3, 4) > Pattern(1, 1) + 10 and Pattern(3, 3) > Pattern(1, 0) + 10), etc). Thus, when we test whether Image(3, 4) > Image(1, 1), we get an answer to these 4 questions (all coordinates are relative to the window’s coordinates): 1. In the window of (a), is Window(3, 4) > Window(1, 1) as in the pattern? 2. In the window of (b), is Window(2, 4) > Window(0, 1) as in the pattern? 3. In the window of (c), is Window(3, 3) > Window(1, 0) as in the pattern? 4. In the window of (d), is Window(2, 2) > Window(0, 0) as in the pattern?
¯ 2 ). The second method is to test which the average time complexity is O(|A|R features have each LUp rank. This can be done quickly by finding the 2d min and max for each value of R. The Gil and Werman[6] method does this with a time complexity of O(1) per pixel. If we denote by Rmax the maximum R value, then the time complexity is O(|A|Rmax ). A combined approach can also be used. Note that the computation of the LUp is done offline for each given pattern. Moreover, the size of the pattern is usually much smaller than the size of the image; thus the running time of this stage is negligible. In this paper we simply calculate the LUp rank for each feature.
3
The Pattern Matching Method
The problem of pattern matching can be formulated as follows: given a pattern and an image, find all the occurrences of the pattern in the image. We define a window as a match, if the Hamming distance (i.e. Eq. 1 or Eq. 4) is smaller or equal to the image similarity threshold. In order to reduce the running time spent on each window we use the Pele and Werman[1] sequential sampling algorithm. The sequential algorithm random samples corresponding features sequentially and without replacement from the
442
O. Pele and M. Werman
window and pattern and tests them for similarity. After each sample, the algorithm tests whether the accumulated number of non-similar features is equal to a threshold, which increases with the number of samples. We call this vector of thresholds the rejection line. If the algorithm touches the rejection line, it stops and returns non-similar. If the algorithm finishes sampling all the features, it has computed the exact distance between the pattern and the window. Pele and Werman[1] presented a Bayesian framework for sequential hypothesis testing on finite populations. Given an allowable bound on the probability of a false negative the framework computes the optimal rejection line; i.e. a rejection line such that the sequential algorithm parameterized with it has the minimum expected running time. Pele and Werman[1] also presented a fast near-optimal framework for computing the rejection line. In this paper, we use the near-optimal framework. The full system we use for pattern matching is composed of an offline and an online part. The offline part gets a pattern and returns the characteristic LUp rank, two sets of features and the two corresponding rejection lines. One set contains all the pattern features. The second set contains all the pattern features from the first set that have an LUp rank greater or equal to the characteristic LUp rank. The online part slides through the image in steps of the characteristic LUp rank plus one. On each window it uses the sequential algorithm to test for similarity on the second set of features. If the sequential algorithm returns nonsimilar, the algorithm slides the characteristic LUp rank plus one pixels right or the characteristic LUp rank plus one rows (at the end of each row). If the sequential algorithm returns similar (which we assume is a rare event), the window and all the windows that would otherwise be skipped are tested for similarity. The test is made again using the sequential algorithm, this time on the set that contains all the pattern features.
4
Results
The proposed method was tested on real images and patterns. The results show that the method accelerates pattern matching, with a very small decrease in robustness to rotations. For all other transformations tested - small scale change, image blur, JPEG compression and illumination - there was no decrease in robustness. First we describe results that were obtained using the Thresholded Absolute Difference Hamming distance (see Eq. 1). Second, we describe results that were obtained using the Monotonic Relations Hamming distance (see Eq. 4). 4.1
Results Using the Thresholded Absolute Difference Hamming Distance
We searched for windows with a Thresholded Absolute Difference Hamming distance lower than 0.4×|A|. The sequential algorithm was parameterized using the near-optimal method of Pele and Werman[1] with input of a uniform prior and
Accelerating Pattern Matching or How Much Can You Slide?
443
a false negative error bound of 0.1%. In all of the experiments, the p threshold for the LUp rank was set to 5. The characteristic LU5 rank for each pattern was set to the maximum LU5 rank found for at least 30 pattern pixels. Note that the online part first tests similarity on the set of pixels with a LU5 rank greater or equal to the characteristic LU5 rank. The same relative similarity threshold is used; i.e. if the size of this small set is |As | we test whether the Thresholded Absolute Difference Hamming distance is lower than 0.4 × |As |. Results that show the substantial reduction in running time are shown in Figs. 2 and 3. 4.2
Results Using the Monotonic Relations Hamming Distance
The pairs that were used in the set of each pattern were pairs of pixels belonging to edges, i.e. pixels that had a neighbor pixel, where the absolute intensity value difference was greater than 80. Two pixels, (x2 , y2 ), (x1 , y1 ) are considered neighbors if their l∞ distance: max(|x1 − x2 |, |y1 − y2 |) is smaller or equal to 2. We searched for windows with a Monotonic Relations Hamming distance lower than 0.25 × |A|. The sequential algorithm was parameterized using the nearoptimal method of Pele and Werman[1] with input of a uniform prior and a false negative error bound of 0.1%. In all of the experiments, the p threshold for the LUp rank was set to 20. The characteristic LU20 rank for each pattern was set to the maximum LU20 rank found for at least 30 pairs of pixels from the set of all edge pixel pairs. Note that the online part first tests similarity on the set of pairs of edge pixels with a LU20 rank greater or equal to the characteristic LU20 rank. The same relative similarity threshold is used; i.e. if the size of this small set is |As | we test whether the Monotonic Relations Hamming distance is lower than 0.25 × |As |. Results that show the substantial reduction in running time are shown in Figs. 1, 4 and 5. To illustrate the performance of our method, we ran the tests that were also conducted in the Pele and Werman paper[1]. All the data for the experiments were downloaded from http://www.cs.huji.ac.il/~ofirpele/hs/ all images.zip. Five image transformations were evaluated: small rotation; small scale change; image blur; JPEG compression; and illumination. The names of the datasets used are rotation; scale; blur ; jpeg; and light respectively. The blur, jpeg and light datasets were from the Mikolajczyk and Schmid paper[2]. scale dataset contains 22 images with an artificial scale change from 0.9 to 1.1 in jumps of 0.01; and rotation dataset contains 22 images with an artificial inplane rotation from -10◦ to 10◦ in jumps of 1◦ . For each collection, there were ten rectangular patterns that were chosen from the image with no transformation. In each image we considered only the window with the minimum distance as similar, because we knew that the pattern occurred only once in the image. We repeated each search of a pattern in an image 1000 times. There are two notions of error: miss detection error rate and false detection error rate. As we know the true homographies between the images, we know where the pattern pixels are in the transformed image. We denote a correct match as one that covers at least 80% of the transformed pattern pixels. A false match is one that covers less than 80% of the transformed pattern pixels. Note
444
O. Pele and M. Werman 0, the maximum and the minimum have the same sign; it means all the normal sections are curving in the same direction with respect to the tangent plane. When K < 0, the maximum and the minimum of the curvature have the opposite signs; therefore, some normal sections are curving in the opposite direction from others. When K = 0, either the maximum or the minimum of the curvature among the normal sections is zero; all the normal sections are in the same side of the tangent plane, with one lying on it. Surfaces with K = 0 everywhere is said to be developable, meaning they can be unrolled into a flat sheet of paper without stretching. The surface represented by (x, y, d(x, y)) in the 3D space has the curvatures: K= H=
dxx dyy − dxy 2
(1)
2,
(1 + dx 2 + dy 2 )
(1 + dy 2 )dxx − 2dx dy dxy + (1 + dx 2 )dyy 3
2(1 + dx 2 + dy 2 ) 2
.
(2)
2.2 Developable vs. Minimal Surfaces We noted in [12,13] that the surfaces perceived by humans, as shown in Fig. 1(b), are developable, whereas previously-proposed algorithms predict the “soap film” surfaces
540
H. Ishikawa
such as the one shown in Fig. 1(c). Accordingly, we suggested that minimizing the total sum of the absolute value or the square of Gaussian curvature, for example, may predict the surfaces similar to those that are perceived by humans. Developable surfaces minimize the energy that is the total sum of the Gaussian curvature modulus. Thus, intuitively, it can be thought of as rolling and bending a piece of very thin paper (so thin that you cannot feel any stiffness, but it does not stretch) to fit to the stereo surface. In contrast, conventional priors tend to minimize the mean curvature H, rather than the Gaussian curvature. For instance, the continuous analog of the minimization of the square difference of disparities between neighboring pixels is minimizing the Dirichlet integral |∇d(x, y)|2 dxdy. It is known (Dirichlet’s Principle) that the function that minimizes this integral is harmonic. By equation (2), the surface represented by (x, y, d(x, y)) with a harmonic function d(x, y) has small mean curvature H when dx , dy are small compared to 1. Thus the surface approximates the minimal surface, which is defined as the surface with H = 0, and is physically illustrated by a soap film spanning a wire frame. Thus the difference between the kinds of surfaces favored by the present and the previously-used prior is the difference between minimizing the Gaussian and the mean curvature, and intuitively the difference between a flexible thin paper and a soap film. Since a surface whose Gaussian and mean curvature are both zero is a plane, one might say the two represent two opposite directions of curving the surface. In this sense, the minimization of Gaussian curvature modulus is the opposite of the extant priors. Note that this notion of oppositeness is about the smooth part of the surface; thus it does not change even when discontinities are allowed by the prior in disparity or its derivatives. Another difference between the currently popular priors and what the human vision system seems to use is, as we pointed out in [12,13], that the popular priors are convex. The surfaces that humans perceive, including the one shown in Fig. 1(b), cannot be predicted not only by the minimization of the fronto-parallel prior but also by that of any convex prior. Although some priors that allow discontinuities are non-convex, they are usually convex at the continuous part of the surface. As far as we could determine, the Gaussian curvature is the only prior that is inherently non-convex that has been proposed as a stereo prior. For more discussion on the convexity of the prior, see [12]. The total absolute Gaussian curvature has been proposed as a criterion of tightness for re-triangulation of a polyhedral surface mesh [1]. The re-triangulation process to minimize the total absolute Gaussian curvature has subsequently been proved to be NP-hard, at least in the case of terrains [6]. 2.3 Is It Good for Stereo? Let us consider the case of developable surfaces, the limiting case where the Gaussian curvature vanishes everywhere. Note that such a surface can have a sharp bend and still have zero Gaussian curvature, as shown in Fig. 3, allowing a sharp border in the depth surface. It also encourages the straightness of the border: a higher-order effect not seen in first-order priors. Compare this to the limiting case of the fronto-parallel prior, which
Total Absolute Gaussian Curvature for Stereo Prior
541
Fig. 3. Surfaces with zero Gaussian curvature. A surface can have a sharp bend and still have zero Gaussian curvature.
would be a plane. Thus, the Gaussian prior is more flexible than most conventional prior models. The question is rather if it is too flexible. It is not immediately clear if this scheme is useful as a prior constraint for stereo optimization model, especially since the functional would be hard to optimize. This is the reason for the experiments we describe in the next section. But before going to the experiments, we mention one reason that this type of prior might work. It is that there is a close relationship between the Gaussian curvature and the convex hull: Theorem. Let A be a set in the three-dimensional Euclidean space, B its convex hull, and p a point in ∂B \ A, where ∂B denotes the boundary of B. Assume that a neighborhood of p in ∂B is a smooth surface. Then the Gaussian curvature of ∂B at p is zero. A proof is in [12]. This means that the Gaussian curvature of the surface of the convex hull of a set at a point that does not belong to the original set is zero, wherever it is defined. How does this relate to the stereo prior? Imagine for a while that evidence from stereo matching is sparse but strong. That gives a number of scattered points in the matching space that should be on the matching surface. Then the role of the prior model is to interpolate these points. A soap film solution finds a minimum surface that goes through these points. We think taking the convex hull of these points locally might give a good solution, since the convex hull in a sense has the “simplest” 3D shape that is compatible with the data, much in the way the Kanizsa triangle[15] is the simplest 2D shape that explains incomplete contour information; and in the real world, most surfaces are in fact the faces of some 3D body. By the theorem, taking the convex hull of the points and using one of its faces gives a surface with zero Gaussian curvature. To be sure, there are always two sides of the hull, making it ambiguous, and there is the problem of choosing the group of local points to take the convex hull of—it would not do to use the all points on the whole surface. Also, when the evidence is dense, and not so strong, the data is not like the points in the space; it is a probability distribution. Then it is not clear what “taking the convex hull” means. So we decided to try minimizing the Gaussian curvature and hope that it has a similar effect.
3 Experiments We experimentally compared the Gaussian curvature minimization prior model with the conventional prior models, keeping other conditions identical. We used the data set and
542
H. Ishikawa
the code implementation of stereo algorithms, as well as the evaluation module, by D. Scharstin and R. Szeliski, available from http://vision.middlebury.edu/ stereo/, which was used in [22]. We made modifications to the code to allow use of different priors. 3.1 MAP Formulation Each data set consists of a rectified stereo pair IL and IR . The stereo surface is represented as a disparity function d(x, y) on the discretized left image domain. We seek the d(x, y) that minimizes the energy E(IL , IR , d) = E1 (IL , IR , d) + E2 (d).
(3)
The first term is the so-called data term, which encodes mainly the image formation model. The second term is the prior term, in which we are interested. For the image formation energy term E1 (IL , IR , d), we used the unaggregated absolute differences as the matching cost: E1 (IL , IR , d) = |IL (x, y) − IR (x + d(x, y), y)|. (4) (x,y)
This stays the same throughout the experiments. The purpose of the experiments is to evaluate the relative performance of the different prior energy terms E2 (d), as detailed next. 3.2 Priors We compared five prior models, one of which is the total absolute Gaussian curvature minimization prior. Each prior energy E2i (d)(i = 1, . . . , 5) is defined by adding the local function over all the pixels: fdi (x, y) (5) E2i (d) = λ (x,y)
The details of each prior model follow. i) Gaussian Curvature minimization. We experimented with several varieties of the prior model by minimization of Gaussian curvature modulus. First, to minimize the modulus of Gaussian curvature, we tried both the sum of square and the sum of absolute values. Second, since the curvature is not meaningful when the local discretized disparity surface is very rough, we used various smoothing schemes where the disparity change is larger than a threshold value; for smoothing we used i) the square difference, ii) the absolute difference, and iii) the constant penalty. By combination, there were six varieties of the local prior energy. Of which, we found that the combination of the absolute Gaussian curvature and the square difference smoothing worked best: |K(x, y, d)| (if dx 2 + dy 2 < c) 1 fd (x, y) = (6) dx 2 + dy 2 (if dx 2 + dy 2 ≥ c), where
Total Absolute Gaussian Curvature for Stereo Prior
dx = d(x + 1, y) − d(x, y),
543
dy = d(x, y + 1) − d(x, y)
dxx = d(x + 1, y) + d(x − 1, y) − 2d(x, y) dxy = d(x + 1, y + 1) + d(x, y) − d(x, y + 1) − d(x + 1, y) dyy = d(x, y + 1) + d(x, y − 1) − 2d(x, y), and K(x, y, d) =
dxx dyy − dxy 2
2.
(1 + dx 2 + dy 2 )
ii) Smoothness part only. In order to evaluate the effect of Gaussian curvature minimization, we compared it to the prior that uses only the smoothing part, i.e., the case c = 0 in fd1 (x, y). This is the most popular prior that minimizes the square disparity change. (7) fd2 (x, y) = dx 2 + dy 2 . iii) Potts model. This model is widely used in the literature, especially in relation with the graph cut algorithms, e.g., [5]. It is defined by fd3 (x, y) = T (dx ) + T (dy ),
(8)
where T (X) gives 0 if X = 0 and 1 otherwise. iv) Smoothing with a cut-off value. This is used to model a piecewise smooth surface. Where the surface is smooth, it penalizes a disparity change according to the absolute disparity change (fd4 ) or square disparity change (fd5 ). When there is a discontinuity, it costs only a constant value, no matter how large the discontinuity is: |dx | + |dy | (if |dx | + |dy | < c) 4 fd (x, y) = (9) c (if |dx | + |dy | ≥ c), dx 2 + dy 2 (if dx 2 + dy 2 < c) 5 fd (x, y) = (10) c (if dx 2 + dy 2 ≥ c). 3.3 Optimization We used the simulated annealing to optimize the energy functional. The number of iteration was 500 with the full Gibbs update rule, where all possible disparities at a given pixel is evaluated. The annealing schedule was linear. Note that we need to use the same optimization technique for all the prior. Thus we cannot use the graph cut or the belief propagation algorithms, since they cannot be used (at least easily) with the Gaussian curvature minimization because it is of second order. 3.4 Evaluation Measure The evaluation module collects various statistics from the experiments. Here, we give the definition of the statistics we mention later in discussing the results. RO¯ is the RMS
544
H. Ishikawa
Fig. 4. Statistics from the tsukuba data set, RO¯ , BO¯ , BT¯ , and BD . The total absolute Gaussian curvature minimization prior fd1 used c = 4.0. The prior fd4 (absolute difference with a cut-off) performed best when c = 5.0 and the prior fd5 (square difference with a cut-off) when c = 10.0.
(root-mean-squared) disparity error. BO¯ is the percentage of pixels in non-occluded areas with a disparity error greater than 3. BT¯ is the percentage of pixels in textureless areas with a disparity error greater than 3. BD is the percentage of pixels with a disparity error greater than 3 near discontinuities by more than 5. 3.5 Results Shown in Fig. 4 are the statistics from one of the data sets (tsukuba) with varying λ. The total absolute Gaussian curvature prior fd1 with c = 4.0 performed the best among the variations of minimizing Gaussian curvature modulus we tried. The prior fd4 , which uses the absolute difference with a cut-off value, performed best when the cut-off threshold c is 5.0 and fd5 , the square difference with a cut-off, when c = 10.0. Though the threshold value c for the best results varies from prior to prior, it is nothing to be alarmed for, as the quantity it is thresholding is different from prior to prior. In Fig. 5, we show the disparity maps of the experiments for qualitative evaluation. The first two rows show the original image and the ground truth. Each of the rest of the rows shows the best results in terms of RO¯ within the category, except for (d),
Total Absolute Gaussian Curvature for Stereo Prior
map
sawtooth
tsukuba
545
venus
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 5. Experiments: (a) The left image, (b) the ground truth, and the results by (c) total absolute Gaussian curvature minimization with smoothing, (d) smoothing only, (e) Potts model, and (f) smoothing with a cut-off. Each result is for the best value of λ within the category except for (d), which used the same λ as the one used in (c) for comparison.
which uses the same parameters as the result shown just above in row (c), to compare the effect of Gaussian curvature minimization, i.e., to see if the Gaussian part of (6) actually has any effect. The computation time was about 200 to 700 seconds for the Gaussian curvature minimization on a 3.4GHz processor.
546
H. Ishikawa
3.6 Discussion First of all, these results are not the best overall results in the standard of the state of the art, which is only achieved through the combination of various techniques in all aspects of stereo. It is not surprising as we use the simplest data term, and we do not use any of various techniques, such as the pixel classification into occluded and non-occluded pixels and the image segmentation before and after the matching; nor do we iterate the process based on such pre- and post processes. Also, it would have been much faster and the solution much better if we were using the best optimization method that can be used for each prior. The simple square difference without thresholding fd2 can be exactly solved by graph cut[10]. The Potts energy fd3 , as is now well known, can be efficiently optimized with a known error bound by α-expansion[5]; and recently an extension [27] of the algorithm made it possible to do the same for the truncated convex priors like fd4 and fd5 . However, this is expected and it is not the point of these experiments; rather, their purpose is to examine (i) whether or not the minimization of Gaussian curvature modulus works at all, and (ii) if it does, how does it compare to other priors, rather than the whole stereo algorithms. We conclude that (i) has been answered affirmative: it does seem to work in principle. We suspected that the too much flexibility of developable surfaces could be the problem; but it seems at least in combination with some smoothing it can work. As for (ii), quantitatively we say it is comparable to the best of other priors, with a caveat that the result might be skewed by some interaction between the prior and the chosen optimization, as when some priors are easier to optimize by simulated annealing than others. Thus, it might still be the case that the Gaussian curvature minimization is not good after all. It is a problem common in optimization algorithms that are not exact nor have known error bounds: we cannot know how close the results are to the optimum. From Fig. 4, it can be seen that the new prior model is comparable or better than any of the conventional energy functions tested. It also is not as sensitive to the value of λ as other priors. Qualitatively, the result by the Gaussian curvature minimization seems different, especially compared to the Potts model. It seems to preserve sharp depth boundaries better, as expected. Another consideration is that the relative success of this model may be simply due to the fact that the man-made objects in the test scenes exhibit mostly zero Gaussian curvature. While that remains to be determined experimentally, we point out that it might actually be an advantage if that is in fact the case. After all, developable surfaces like planes and cylinders are ubiquitous in an artificial environment, while minimal surfaces like soap films are not seen so much even in natural scenes.
4 Conclusion In this paper, we have examined a novel prior model for stereo vision based on minimizing the total absolute Gaussian curvature of the disparity surface. It is motivated
Total Absolute Gaussian Curvature for Stereo Prior
547
by psychophysical experiments on human stereo vision. Intuitively, it can be thought of as rolling and bending a piece of very thin paper to fit to the stereo surface, whereas the conventional priors are more akin to spanning a soap film over a wire frame. The experiments show that the new prior model is comparable or better than any of the conventional priors tested, when compared in the equal setting. The main drawback of the absolute Gaussian curvature minimization is that we don’t yet have an optimization method as good as graph cuts or belief propagation that can optimize it efficiently. Obviously, the real measure of a prior crucially depends on how well it can be actually optimized. However, the experiments have been a necessary step before attempting to devise a new optimization technique. It may be difficult: as we mentioned in 2.2, re-triangulation of polyhedral surfaces to minimize the total absolute Gaussian curvature is NP-hard; and the prior is of second-order. Still, the result in this paper at least gives us a motivation to pursue better optimization algorithms that can be used with this prior. Also, it might be fruitful to try exploiting the relationship between the convex hull and the Gaussian curvature mentioned in 2.3 in order to achieve the same goal. Using that method, we at least know there are definite solutions. The authors of [18] conclude their paper thus: “As can be seen, the global minimum of the energy function does not solve many of the problems in the BP or graph cuts solutions. This suggests that the problem is not in the optimization algorithm but rather in the energy function.” At the least, our new prior is something significantly different from other models that have been used for the past thirty years, as we discussed in some detail in 2.2 and also in [12]. Different, but still works at least as well as others, in a qualitatively different way. It may be a good starting point to begin rethinking about the energy functions in stereo. Acknowledgement. This work was partially supported by the Suzuki Foundation, the Research Foundation for the Electrotechnology of Chubu, the Inamori Foundation, the Hori Information Science Promotion Foundation, and the Grant-in-Aid for Exploratory Research 19650065 from the Ministry of Education, Culture, Sports, Science and Technology, Japan.
References 1. Alboul, L., van Damme, R.: Polyhedral metrics in surface reconstruction: Tight triangulations. In: The Mathematics of Surfaces VII, pp. 309–336. Clarendon Press, Oxford (1997) 2. Belhumeur, P.N.: A Bayesian Approach to Binocular Stereopsis. Int. J. Comput. Vision 19, 237–262 (1996) 3. Blake, A., Zisserman, A.: Visual reconstruction. MIT Press, Cambridge, MA (1987) 4. Birchfield, S., Tomasi, C.: Multiway cut for stereo and motion with slanted surfaces. In: ICCV 1999, vol. I, pp. 489–495 (1999) 5. Boykov, Y., Veksler, O., Zabih, R.: Fast Approximate Energy Minimization via Graph Cuts. IEEE Trans. on Patt. Anal. Machine Intell. 23, 1222–1239 (2001) 6. Buchin, M., Giesen, J.: Minimizing the Total Absolute Gaussian Curvature in a Terrain is Hard. In: the 17th Canadian Conference on Computational Geometry, pp. 192–195 (2005) 7. Faugeras, O.: Three-Dimensional Computer Vision. MIT Press, Cambridge, MA (1993) 8. Felzenszwalb, P., Huttenlocher, D.: Efficient Belief Propagation for Early Vision. In: CVPR 2004, pp. 261–268 (2004)
548
H. Ishikawa
9. Grimson, W.E.: From Images to Surfaces. MIT Press, Cambridge, MA (1981) 10. Ishikawa, H.: Exact Optimization for Markov Random Fields with Convex Priors. IEEE Trans. on Patt. Anal. Machine Intell. 25(10), 1333–1336 (2003) 11. Ishikawa, H., Geiger, D.: Occlusions, Discontinuities, and Epipolar Lines in Stereo. In: Burkhardt, H.-J., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1406, pp. 232–248. Springer, Heidelberg (1998) 12. Ishikawa, H., Geiger, D.: Rethinking the Prior Model for Stereo. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3953, pp. 526–537. Springer, Heidelberg (2006) 13. Ishikawa, H., Geiger, D.: Illusory Volumes in Human Stereo Perception. Vision Research 46(1-2), 171–178 (2006) 14. Jones, J., Malik, J.: Computational Framework for Determining Stereo Correspondence from a set of linear spatial filters. Image Vision Comput. 10, 699–708 (1992) 15. Kanizsa, G.: Organization in Vision. Praeger, New York (1979) 16. Kolmogorov, V., Zabih, R.: Computing Visual Correspondence with Occlusions via Graph Cuts. In: ICCV 2001, pp. 508–515 (2001) 17. Marr, D., Poggio, T.: Cooperative Computation of Stereo Disparity. Science 194, 283–287 (1976) 18. Meltzer, T., Yanover, C., Weiss, Y.: Globally Optimal Solutions for Energy Minimization in Stereo Vision Using Reweighted Belief Propagation. In: ICCV 2005, pp. 428–435 (2005) 19. Ogale, A.S., Aloimonos, Y.: Stereo correspondence with slanted surfaces: critical implications of horizontal slant. In: CVPR 2004, vol. I, pp. 568–573 (2004) 20. Roy, S.: Stereo Without Epipolar Lines: A Maximum-flow Formulation. Int. J. Comput. Vision 34, 147–162 (1999) 21. Roy, S., Cox, I.: A Maximum-flow Formulation of the N-camera Stereo Correspondence Problem. In: ICCV 1998, pp. 492–499 (1998) 22. Scharstein, D., Szeliski, R.: A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms. Int. J. Computer Vision 47, 7–42 (2002) 23. Szeliski, R.: A Bayesian Modelling of Uncertainty in Low-level Vision. Kluwer Academic Publishers, Boston, MA (1989) 24. Sun, J., Zhen, N.N, Shum, H.Y.: Stereo Matching Using Belief Propagation. IEEE Trans. on Patt. Anal. Machine Intell. 25, 787–800 (2003) 25. Tappen, M.F., Freeman, W.T.: Comparison of Graph Cuts with Belief Propagation for Stereo, using Identical MRF Parameters. In: ICCV 2003, vol. II, pp. 900–907 (2003) 26. Terzopoulos, D.: Regularization of Inverse Visual Problems Involving Discontinuities. IEEE Trans. on Patt. Anal. Machine Intell. 8, 413–424 (1986) 27. Veksler, O.: Graph Cut Based Optimization for MRFs with Truncated Convex Priors. In: CVPR 2007 (2007)
Fast Optimal Three View Triangulation Martin Byr¨ od, Klas Josephson, and Kalle ˚ Astr¨ om Center for Mathematical Sciences, Lund University, Lund, Sweden {byrod, klasj, kalle}@maths.lth.se
Abstract. We consider the problem of L2 -optimal triangulation from three separate views. Triangulation is an important part of numerous computer vision systems. Under gaussian noise, minimizing the L2 norm of the reprojection error gives a statistically optimal estimate. This has been solved for two views. However, for three or more views, it is not clear how this should be done. A previously proposed, but computationally impractical, method draws on Gr¨ obner basis techniques to solve for the complete set of stationary points of the cost function. We show how this method can be modified to become significantly more stable and hence given a fast implementation in standard IEEE double precision. We evaluate the precision and speed of the new method on both synthetic and real data. The algorithm has been implemented in a freely available software package which can be downloaded from the Internet.
1
Introduction
Triangulation, referring to the act of reconstructing the 3D location of a point given its images in two or more known views, is a fundamental part of numerous computer vision systems. Albeit conceptually simple, this problem is not completely solved in the general case of n views and noisy measurements. There exist fast and relatively robust methods based on linear least squares [1]. These methods are however sub-optimal. Moreover the linear least squares formulation does not have a clear geometrical meaning, which means that in unfortunate situations, this approch can yield very poor accuracy. The most desirable, but non-linear, approach is instead to minimize the L2 norm of the reprojection error, i.e. the sum of squares of the reprojection errors. The reason for this is that the L2 optimum yields the maximum likelihood estimate for the 3D point under the assumption of independent gaussian noise on the image measurements [2]. This problem has been given a closed form solution1 by Hartley and Sturm in the case of two views [2]. However, the approach of Hartley and Sturm is not straightforward to generalize to more than two views. 1
The solution is actually not entirely on closed form, since it involves the solution of a sixth degree polynomial, which cannot in general be solved on closed form. Therefore one has to go by e.g. the eigenvalues of the companion matrix, which implies an iterative process.
Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 549–559, 2007. c Springer-Verlag Berlin Heidelberg 2007
550
M. Byr¨ od, K. Josephson, and K. ˚ Astr¨ om
In the case of n views, the standard method when high accuracy is needed is to use a two-phase strategy where an iterative scheme for non-linear least squares such as Levenberg-Marquardt (Bundle Adjustment) is initialised with a linear method [3]. This procedure is reasonably fast and in general yields excellent results. One potential drawback, however, is that the method is inherently local, i.e. it finds local minima with no guarantee of beeing close to the global optimum. An interesting alternative is to replace the L2 norm with the L∞ , norm cf. [4]. This way it is possible to obtain a provably optimal solution with a geomtrically sound cost function in a relatively efficient way. The drawback is that the L∞ norm is suboptimal under gaussian noise and it is less robust to noise and outliers than the L2 norm. The most practical existing method for L2 optimization with an optimality guarantee is to use a branch and bound approach as introduced in [5], which, however, is a computationally expensive strategy. In this paper, we propose to solve the problem of L2 optimal triangulation from three views using a method introduced by Stewenius et al. in [6], where the optimum was found by explicit computation of the complete set of stationary points of the likelihood function. This approach is similar to that of Hartley and Sturm [2]. However, whereas the stationary points in the two view case can be found by solving a sixth degree polynomial in one variable, the easiest known formulation of the three view case involves solving a system of three sixth degree equations in three unknowns with 47 solutions. Thus, we have to resort to more sofisticated techniques to tackle this problem. Stewenius et al. used algebraic geometry and Gr¨ obner basis techniques to analyse and solve the equation system. However, Gr¨ obner basis calculations are known to be numerically challenging and they were forced to use emulated 128 bit precision arithmetics to get a stable implementation, which rendered their solution too slow to be of any practical value. In this paper we develop the Gr¨ obner basis approach further to improve the numerical stability. We show how computing the zeros of a relaxed ideal, i.e. a smaller ideal (implying a possibly larger solution set/variety) can be used to solve the original problem to a greater accuracy. Using this technique, we are able to give the Gr¨ obner basis method a fast implementation using standard IEEE double precision. By this we also show that global optimization by calculation of stationary points is indeed a feasible approach and that Gr¨ obner bases provide a powerful tool in this pursuit. Our main contributions are: – A modified version of the Gr¨ obner basis method for solving polynomial equation systems, here referred to as the relaxed ideal method, which trades some speed for a significant increase in numerical stability. – An effecient C++ language implementation of this method applied to the problem of three view triangulation. The source code for the methods described in this paper is freely available for download from the Internet[7].
Fast Optimal Three View Triangulation
2
551
Three View Triangulation
The main motivation for triangulation from more than two views is to use the additional information to improve accuracy. In this section we briefly outline the approach we take and derive the equations to be used in the following sections. This part is essentially identical to that used in [6]. We assume a linear pin-hole camera model, i.e. projection in homogeneous coordinates is done according to λi xi = Pi X, where Pi is the 3 × 4 camera matrix for view i, xi is the image coordinates, λi is the depth and X is the 3D coordinates of the world point to be determined. In standard coordinates, this can be written as 1 Pi1 X xi = , (1) Pi3 X Pi2 X where e.g. Pi3 refers to row 3 of camera i. As mentioned previously, we aim at minimizing the L2 norm of the reprojection errors. Since we are free to choose coordinate system in the images, we place the three image points at the origin in their respective image coordinate systems. With this choice of coordinates, we obtain the following cost function to minimize over X ϕ(X) =
(P11 X)2 + (P12 X)2 (P21 X)2 + (P22 X)2 (P31 X)2 + (P32 X)2 + + . (2) (P13 X)2 (P23 X)2 (P33 X)2
The approach we take is based on calculating the complete set of stationary points of ϕ(X), i.e. solving ∇ϕ(X) = 0. By inspection of (2) we see that ∇ϕ(X) will be a sum of rational functions. The explicit derivatives can easily be calculated, but we refrain from writing them out here. Differentiating and multiplying through with the denominators produces three sixth degree polynomial equations in the three unknowns of X = [ X1 X2 X3 ]. To simplify the equations we also make a change of world coordinates, setting the last rows of the respective cameras to P13 = [ 1 0 0 0 ], P23 = [ 0 1 0 0 ], P33 = [ 0 0 1 0 ].
(3)
Since we multiply with the denominator we introduce new stationary points in our equations corresponding to one of the denominators in (2) being equal to zero. This happens precisely when X coincides with the plane through one of the focal points parallel to the corresponding image plane. Such points have infinite/undefined value of ϕ(X) and can therefore safely be removed. To summarise, we now have three sixth degree equations in three unknowns. The remainder of the theoretical part of the paper will be devoted to the problem of solving these.
3
Using Gr¨ obner Bases to Solve Polynomial Equations
In this section we give an outline of how Gr¨ obner basis techniques can be used for solving systems of multivariate polynomial equations. Gr¨ obner bases are a
552
M. Byr¨ od, K. Josephson, and K. ˚ Astr¨ om
concept within algebraic geometry, which is the general theory of multivariate polynomials over any field. Naturally, we are only interested in real solutions, but since algebraic closedness is important to the approach we take, we seek solutions in C and then ignore any complex solutions we obtain. See e.g. [8] for a good introduction to algebraic geometry. Our goal is to find the set of solutions to a system f1 (x) = 0, . . . , fm (x) = 0 of m polynomial equations in n variables. The polynomials f1 , . . . , fm generate an ideal I in C[x], the ring of multivariate polynomials in x = (x1 , . . . , xn ) over the field of complex numbers. To find the roots of this system we study the quotient ring C[x]/I of polynomials modulo I. If the system of equations has r roots, then C[x]/I is a linear vector space of dimension r. In this ring, multiplication with xk is a linear mapping. The matrix mxk representing this mapping (in some basis) is referred to as the action matrix and is a generalization of the companion matrix for onevariable polynomials. From algebraic geometry it is known that the zeros of the equation system can be obtained from the eigenvectors/eigenvalues of the action matrix just as the eigenvectors/eigenvalues of the companion matrix yields the zeros of a one-variable polynomial [9]. The solutions can be extracted from the eigenvalue decomposition in a few different ways, but easiest is perhaps to use the fact that the vector of monomials spanning C[x]/I evaluated at a zero of I is an eigenvector of mtxk . An alternative is to use the eigenvalues of mtxk corresponding to the values of xk at zeros of I. C[x]/I is a set of equivalence classes and to perform calculations in this space we need to pick representatives for the equivalence classes. A Gr¨ obner basis G for I is a special set of generators for I with the property that it lets us compute a well defined, unique representative for each equivalence class. Our main focus is therefore on how to compute this Gr¨ obner basis in an efficient and reliable way.
4
Numerical Gr¨ obner Basis Computation
There is a general method for constructing a Gr¨ obner basis known as Buchberger’s algorithm [9]. It is a generalization of the Euclidean algorithm for computing the greatest common divisor and Gaussian elimination. The general idea is to arrange all monomials according to some ordering and then succesively eliminate leading monomials from the equations in a fashion similar to how Gaussian elimination works. This is done by selecting polynomials pair-wise and multiplying them by suitable monomials to be able to eliminate the least common multiple of their respective leading monomials. The algorithm stops when any new element from I reduces to zero upon multivariate polynomial division with the elements of G. Buchberger’s algorithm works perfectly under exact arithmetic. However, in floating point arithmetic it becomes extremely difficult to use due to accumulating round off errors. In Buchberger’s algorithm, adding equations and eliminating is completely interleaved. We aim for a process where we first add all equations
Fast Optimal Three View Triangulation
553
we will need and then do the full elimination in one go, in the spirit of the f4 algorithm [10]. This allows us to use methods from numerical linear algebra such as pivoting strategies and QR factorization to circumvent (some of) the numerical difficulties. This approach is made possible by first studying a particular problem using exact arithmetic2 to determine the number of solutions and what total degree we need to go to. Using this information, we hand craft a set of monomials which we multiply our original equations with to generate new equations. We stack the coefficients of our expanded set of equations in a matrix C and write our equations as Cϕ = 0, (4) where ϕ is a vector of monomials. Putting C on reduced row echelon form then gives us the reduced minimal Gr¨ obner basis we need. In the next section we go in to the details of constructing a Gr¨ obner basis for the three view triangulation problem.
5
Constructing a Gr¨ obner Basis for the Three View Triangulation Problem
As detailed in Section 2, we optimize the L2 cost function by calculation of the stationary points. This yields three sixth degree polynomial equations in X = [X1 , X2 , X3 ]. In addition to this, we add a fourth equation by taking the sum of our three original equations. This cancels out the leading terms, producing a fifth degree equation which will be useful in the subsequent calculations [6]. These equations generate an ideal I in C[X]. The purpose of this section is to give the details of how a Gr¨ obner basis for I can be constructed. First, however, we need to deal with the problem where one or more of Xi = 0. When this happens, we get a parametric solution to our equations. As mentioned in Section 2, this corresponds to the extra stationary points introduced by multiplying up denominators and these points have infinite value of the cost function ϕ(X). Hence, we would like to exclude solutions with any Xi = 0 or equivalently X1 X2 X3 = 0. The algebraic geometry way of doing this is to calculate the saturation sat(I, X1 X2 X3 ) of I w.r.t. X1 X2 X3 , consisting of all polynomials f (X) s.t. (X1 X2 X3 )k · f ∈ I for some k. Computationally it is easier to calculate sat(I, Xi ) for one variable at a time and then joining the result. This removes the same problematic parameter family of solutions, but with the side effect of producing some extra (finite) solutions with Xi = 0. These do not present any serious difficulties though since they can easily be detected and filtered out. Consider one of the variables, say X1 . The ideal sat(I, X1 ) is calculated in three steps. We order the monomials according to X1 but take the monomial with the highest power of X1 to be the smallest, e.g. X1 X22 X3 ≥ X12 X22 X3 . With the monomials ordered this way, we perform a few steps of the Gr¨ obner basis 2
Usually with the aid of some algebraic geometry software as Macaulay 2 [11].
554
M. Byr¨ od, K. Josephson, and K. ˚ Astr¨ om
calculation, yielding a set of generators where the last elements can be divided by powers of X1 . We add these new equations which are “stripped” from powers of X1 to I. More concretely, we multiply the equations by all monomials creating equations up to degree seven. After the elimination step two equations are divisible by X1 and one is divisible by X12 . The saturation process is performed analogously for X2 and X3 producing the saturated ideal Isat , from which we extract our solutions. The final step is to calculate a Gr¨obner basis for Isat , at this point generated by a set of nine fifth and sixth degree equations. To be able to do this we multiply with monomials creating 225 equations in 209 different monomials of total degree up to nine (refer to [6] for more details on the saturation and expansion process outlined above). The last step thus consists of putting the 225 by 209 matrix C on reduced row echelon form. This last part turns out to be a delicate task though due to generally very poor conditioning. In fact, the conditioning is often so poor that roundoff errors in the order of magnitude of machine epsilon (approximately 10−16 for doubles) yield errors as large as 102 or more in the final result. This is the reason one had to resort to emulated 128 bit numerics in [6]. In the next section, we propose a strategy for dealing with this problem which drastically improves numerical precision allowing us to use standard IEEE double precision.
6
The Relaxed Ideal Method
After the saturation step, we have a set of equations which “tightly” describe the set of solutions and nothing more. It turns out that by relaxing the constraints somewhat, possibly allowing some extra spurious solutions to enter the equations, we get a significantly better conditioned problem. We thus aim at selecting a subset of the 225 equations. This choice is not unique, but a natural subset to use is the 55 equations with all possible 9th degree monomials as leading terms, since this is the smallest set of equations which directly gives us a Gr¨obner basis. We do this by QR factorization of the submatrix of C consisting of the 55 first columns followed by multiplying the remaining columns with Qt . After these steps we pick out the 55 first rows of the resulting matrix. These rows correspond to 55 equations forming the relaxed ideal Irel ⊂ I which is a subset of the original ideal I. Forming the variety/solution set V of an ideal is an inclusion reversing operation and hence we have V (I) ⊂ V (Irel ), which means that we are guaranteed not to lose any solutions. Moreover, since all monomials of degree nine are represented in exactly one of our generators for Irel , this means that by construction we have a Gr¨ obner basis for Irel . The set of eigenvalues computed from the action matrices for C[X]/I and C[X]/Irel respectively are shown if Fig. 1. The claim that the number of solutions is equal to the dimension of C[X]/I only holds if I is a radical ideal. Otherwise, the dimension is only an upper bound on the number of solutions [8]. Furthermore, as mentioned in Section 3, a necessary condition for a specific point to be a solution is that the vector of basis
Fast Optimal Three View Triangulation 0.4
555
Original Eigenvalues Relaxed Set of Eigenvalues
0.3 0.2
Im
0.1 0 −0.1 −0.2 −0.3 −0.4 −1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
Re
Fig. 1. Eigenvalues of the action matrix using the standard method and the relaxed ideal method respectively, plotted in the complex number plane. The latter are a strict superset of the former.
monomials evaluated at that point is an eigenvector of the transposed action matrix. This condition is however not sufficient and there can be eigenvectors that do not correspond to zeros of the ideal. This will be the case if I is not a radical ideal. This can lead to false solutions, but does not present any serious problems since false solutions can easily be detected by e.g. evaluation of the original equations. Since we have 55 leading monomials in the Gr¨ obner basis, the 154 remaining monomials (of the totally 209 monomials) form a basis for C/Irel . Since Irel was constructed from our original equations by multiplication with monomials and invertible row operations (by Qt ) we expect there to be no new actual solutions. This has been confirmed empirically. One can therefore say that starting out with a radical ideal I, we relax the radicality property and compute a Gr¨ obner basis for a non-radical ideal but with the same set of solutions. This way we improve the conditioning of the elimination step involved in the Gr¨ obner basis computation considerably. The price we have to pay for this is performing an eigenvalue decomposition on a larger action matrix.
7
Experimental Validation
The algorithm described in this paper has been implemented in C++ making use of optimized LAPACK and BLAS implementations [12] and the code is available for download from [7]. The purpose of this section is to evaluate the algorithm in terms of speed and numerical precision. We have run the algorithm on both real and synthetically generated data using a 2.0 Ghz AMD Athlon X2 64 bit machine. With this setup, triangulation of one point takes approximately 60 milliseconds. This is to be contrasted with the previous implementation by
556
M. Byr¨ od, K. Josephson, and K. ˚ Astr¨ om
Stewenius et al. [6], which needs 30 seconds per triangulation with their setup. The branch and bound method of [5] is faster than [6] but exact running times for triangulation are not given in [5]. However, based on the performance of this algorithm on similar problems, the running time for three view triangulation is probably at least a couple of seconds using their method. 7.1
Synthetic Data
To evaluate the intrinsic numerical stability of our solver the algorithm has been run on 50.000 randomly generated test cases. World points were drawn uniformly from the cube [−500, 500]3 and cameras where placed randomly at a distance of around 1000 from the origin with focallength of around 1000 and pointing inwards. We compare our approach to that of [6] implemented in double precision here referred to as the standard method since it is based on straightforward Gr¨ obner basis calculation. A histogram over the resulting errors in estimated 3D location is shown in Fig. 2. As can be seen, computing solutions of the smaller ideal yields an end result with vastly improved numerical precision. The error is typically around a factor 105 smaller with the new method. Since we consider triangulation by minimization of the L2 norm of the error, ideally behaviour under noise should not be affected by the algorithm used. In the second experiment we assert that our algorithm behaves as expected under noise. We generate data as in the first experiment and apply gaussian noise to the image measurements in 0.1 pixel intervals from 0 to 5 pixels. We triangulate 1000 points for each noise level. The median error in 3D location is plotted versus noise in Fig. 3. There is a linear relation between noise and error, which confirms that the algorithm is stable also in the presence of noise. 0.35 Relaxed Ideal Complete Ideal
0.3
Frequency
0.25 0.2 0.15 0.1 0.05 0 −15
−10 Log
10
−5 0 of error in 3D placement
5
Fig. 2. Histogram over the error in 3D location of the estimated point X. As is evident from the graph, extracting solutions from the smaller ideal yields a final result with considerably smaller errors.
Fast Optimal Three View Triangulation
557
Median of the error in 3D location
2.5
2
1.5
1
0.5
0 0
1
2 3 Noise standard deviation
4
5
Fig. 3. Error in 3D location of the triangulated point X as a function of image-point noise. The behaviour under noise is as expected given the problem formulation.
Fig. 4. The Oxford dinosaur reconstructed from 2683 point triplets using the method described in this paper. The reconstruction was completed in approximately 2.5 minutes.
7.2
A Real Example
Finally, we evaluate the algorithm under real world conditions. The Oxford dinosaur [13] is a familiar image sequence of a toy dinosaur shot on a turn table. The image sequence consists of 36 images and 4983 point tracks. For each point visible in three or more views we select the first, middle and last view and triangulate using these. This yields a total of 2683 point triplets to triangulate from. The image sequence contains some erroneus tracks which we deal with by removing any points reprojected with an error greater than two pixels in any frame. The whole sequence was processed in approximately 2.5 minutes and the resulting point cloud is shown in Fig. 4. We have also run the same sequence using the previous method implemented in double precision, but the errors were too large to yield usable results. Note that [6] contains a successful triangulation of the dinosaur sequence, but this was done
558
M. Byr¨ od, K. Josephson, and K. ˚ Astr¨ om
using extremely slow emulated 128 bit arithmetic yielding an estimated running time of 20h for the whole sequence.
8
Conclusions
In this paper we have shown how a typical problem from computer vision, trianobner basis gulation, can be solved for the globally optimal L2 estimate using Gr¨ techniques. With the introduced method of the relaxed ideal, we have taken this approach to a state where it can now have practical value in actual applications. In all fairness though, linear initialisation combined with bundle adjustment will probably remain the choice for most applications since this is still significantly faster and gives excellent accuracy. However, if a guarantee of finding the provably optimal solution is desired, we provide a competetive method. More importantly perhaps, by this example we show that global optimisation by calculation of the stationary points using Gr¨ obner basis techniques is indeed a possible way forward. This is particularly interesting since a large number of computer vision problems ultimately depend on some form of optimisation. Currently the limiting factor in many applications of Gr¨ obner bases is numerical difficulties. Using the technique presented in this paper of computing the Gr¨ obner basis of a smaller/relaxed ideal, we are able to improve the numerical precision by approximately a factor 105 . We thus show that there is room for improvement on this point and there is certainly more to explore here. For instance, our choice of relaxation is somewhat arbitrary. Would it be possible to select more/other equations and get better results? If more equations can be kept, but with retained accuracy this is certainly a gain since it allows an eigenvalue decomposition of a smaller action matrix and this operation in most cases has O(n3 ) time complexity.
Acknowledgment This work has been funded by the Swedish Research Council through grant no. 2005-3230 ’Geometry of multi-camera systems’, grant no. 2004-4579 ’ImageBased Localisation and Recognition of Scenes’, SSF project VISCOS II and the European Commission’s Sixth Framework Programme under grant no. 011838 as part of the Integrated Project SMErobot.
References 1. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) 2. Hartley, R., Sturm, P.: Triangulation. Computer Vision and Image Understanding 68, 146–157 (1997) 3. Triggs, W., McLauchlan, P., Hartley, R., Fitzgibbon, A.: Bundle adjustment: A modern synthesis. In: Triggs, B., Zisserman, A., Szeliski, R. (eds.) Vision Algorithms: Theory and Practice. LNCS, vol. 1883, Springer, Heidelberg (2000)
Fast Optimal Three View Triangulation
559
4. Kahl, F.: Multiple view geometry and the L∞ -norm. In: International Conference on Computer Vision, Beijing, China, pp. 1002–1009 (2005) 5. Agarwal, S., Chandraker, M.K., Kahl, F., Kriegman, D.J., Belongie, S.: Practical global optimization for multiview geometry. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 592–605. Springer, Heidelberg (2006) 6. Stew´enius, H., Schaffalitzky, F., Nist´er, D.: How hard is three-view triangulation really? In: Proc. Int. Conf. on Computer Vision, Beijing, China, pp. 686–693 (2005) 7. Three view triangulation, http://www.maths.lth.se/∼ byrod/downloads.html 8. Cox, D., Little, J., O’Shea, D.: Ideals, Varieties, and Algorithms. Springer, Heidelberg (2007) 9. Cox, D., Little, J., O’Shea, D.: Using Algebraic Geometry. Springer, Heidelberg (1998) 10. Faug`ere, J.C.: A new efficient algorithm for computing gr¨ obner bases (f4 ). Journal of Pure and Applied Algebra 139(1-3), 61–88 (1999) 11. Grayson, D., Stillman, M.: Macaulay 2, http://www.math.uiuc.edu/Macaulay2 12. Lapack - linear algebra package, http://www.netlib.org/lapack 13. Visual geometry group, university of oxford, http://www.robots.ox.ac.uk/∼ vgg 14. Byr¨ od, M., Josephson, K., ˚ Astrom, K.: Improving numerical accuracy of gr¨ obner basis polynomial equation solvers. In: Proc. 11th Int. Conf. on Computer Vision, Rio de Janeiro, Brazil (2007)
Stereo Matching Using Population-Based MCMC Joonyoung Park1, Wonsik Kim2 , and Kyoung Mu Lee2 1
DM research Lab., LG Electronics Inc., 16 Woomyeon-Dong, Seocho-Gu, 137-724, Seoul, Korea
[email protected] 2 School of EECS, ASRI, Seoul National University, 151-742, Seoul, Korea {ultra16, kyoungmu}@snu.ac.kr Abstract. In this paper, we propose a new stereo matching method using the population-based Markov Chain Monte Carlo (Pop-MCMC). Pop-MCMC belongs to the sampling-based methods. Since previous MCMC methods produce only one sample at a time, only local moves are available. However, since Pop-MCMC uses multiple chains and produces multiple samples at a time, it enables global moves by exchanging information between samples, and in turn leads to faster mixing rate. In the view of optimization, it means that we can reach a state with the lower energy. The experimental results on real stereo images demonstrate that the performance of proposed algorithm is superior to those of previous algorithms.
1
Introduction
Stereo matching is one of the classical problems in computer vision [1]. The goal of stereo matching is to determine disparities, which are distances between two corresponding pixel. If we get an accurate disparity map, we can recover 3-D scene information. However, it remains challenging problem because of occluded regions, noise of camera sensor, textureless regions, etc. Stereo matching algorithms can be classified into two approaches. One is the local approach, and the other is the global approach. In the local approach, disparities are determined by comparing the intensity values in the local windows, such as SAD (Sum of Absolute Differences), SSD (Sum of Squared Differences), and Birchfield-Tomasi measure [2]. Although local approaches are fast, they have difficulties in obtaining an accurate disparity map. In the global approaches, one assumes that the disparity map is smooth in most regions. Usually, an energy function that is composed of local and global constraint is defined and solved by various energy minimization techniques. Typical global approaches include graph cut [3,4,5], belief propagation [6], and dynamic programming [7,8]. Monte Carlo method is one of the global approaches. It uses statistical sampling to obtain the solutions of some mathematical problems. Although this method was originally developed to generate samples from a given target distribution or to integrate functions in high dimensional space, it has also been applied to other types of problems such as optimization and learning problems. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 560–569, 2007. c Springer-Verlag Berlin Heidelberg 2007
Stereo Matching Using Population-Based MCMC
561
However, there are some difficulties in applying the Monte Carlo methods to vision problems as an optimizer. In general, we need to solve the vision problems in very high-dimensional solution space. Even if it is assumed to be 100 pixels in the width and the height respectively, the dimension of the image space becomes as high as 104 . Monte Carlo methods would take infinitely long time since the acceptance rate would be almost zero in such a high-dimensional case. Moreover, we need to design a proper proposal distribution close to a target distribution. To resolve these problems, Markov Chain Monte Carlo (MCMC) methods have been tried. In MCMC, a new sample is drawn from the previous sample and the local transition probability, based on the markov chain. Contrary to simple Monte Carlo methods, the acceptance rates of MCMC methods are high enough, and the proposal distributions are designable even in the high-dimensional problem. Therefore, MCMC methods are more proper to be applied to vision problems than the Monte Carlo methods. However, difficulties still remain. Since most MCMC methods allows only local move in a large solution space, it still takes very long time to reach the global optimum. To overcome the limitations of MCMC methods as an optimizer, SwendsenWang Cuts (SWC) was proposed [9,10]. In SWC, it is shown that bigger local move is possible than previous methods while maintaining the detailed balance. SWC uses Simulated Annealing (SA) [11] to find optima. Although SWC allows bigger local move, very slow annealing process is needed to approach the global optima with probability 1. This is a drawback for real applications. Therefore, usually a faster annealing process is applied for practical use in vision problems. However, the fast annealing does not guarantee the global optima but the sample is often trapped at local optima. In this paper, we propose a new MCMC method called Population-Based MCMC (Pop-MCMC) [12,13] for stereo matching problem, trying to resolve the above problems of SWC. Our goal is to find more accurate global optima than SWC. In Pop-MCMC, two or more samples are drawn at the same time. And the information exchange is occurred between the samples. That makes it possible to perform the global move of samples. It means that the mixing rate of drawn samples becomes faster. And in the view of optimization, it means that it takes the shorter time for the samples to approach the global optima than previous methods. This paper describes how Pop-MCMC is designed for stereo matching, and how the performance is comparing with the other methods like SWC or Belief Propagation. In section 2, we present how Pop-MCMC is applied to stereo matching. In section 3, we show the experimental results in the real problem. In the final section, we conclude this paper with discussions.
2
Proposed Algorithm
Segment-Based Stereo Energy Model. In order to improve the accuracy of the disparity map, various energy models have been newly proposed for stereo problem. Among them, we choose the segment-based energy model which is
562
J. Park, W. Kim, and K.M. Lee
one of the popular models [15,16,17,18]. In a segment-based energy model, the reference image is over-segmented. Mean-shift algorithm is often used for the segmentation[14]. And then, we assume that each segment can be approximated by a part of a plane in the real world. Each segment is defined as a node v ∈ V , and neighboring nodes s and t are connected with edges s, t ∈ E. Then we construct a graph G = (V, E). And the energy function is defined by E(X) =
CSEG (fv ) +
βs,t 1(fs = ft ),
(1)
s,t∈N
v∈V
where X represents the current state of every segment, fv is an estimated plane for each segment, CSEG (fv ) is a matching cost, and βs,t is a penalty for different neighboring nodes of s and t, which are defined by CSEG (fv ) =
C(x, y, fv (x, y)),
(2)
(x,y)∈V
βs,t = γ · (mean color similarity) · (border length),
(3)
where function C(x, y, fv (x, y)) is Birchfield-Tomasi cost. By varying γ, we can control the relative effect of matching cost and smoothness cost. We firstly need to make a list of the planes for assigning them to each segment. For each segment, we calculate a new plane and add it to the list. The process to find a new plane is following. We represent a plane with following equation. d = c1 x + c2 y + c3 ,
(4)
where x and y is the location of the pixel, d is the value of the disparity. From every pixel in a segment and initially assigned disparity values, we can construct the following algebraic equation. A [c1 , c2 , c3 ]T = B,
(5)
where ith row of the matrix A is [xi , yi , 1] and ith row of the matrix B is di . The values of c1 , c2 , c3 can be obtained by the least squares method. Once we find the values of the parameters, we can distinguish outlier pixels based on the values of the parameters. Then, the least squares method is repeated to exclude the outliers and improve the accuracy of c1 , c2 , c3 . After obtaining the list of planes, we group the segments and calculate planes again in order to improve the accuracy of planes. To this end, each segment is firstly assigned to a plane in the list that has lowest CSEG value. Then we group the segments which is assigned to the same plane. And for each group, the above plane fitting is repeated again. At last, we have the final list of the planes to use.
Stereo Matching Using Population-Based MCMC
563
Initialization
U~[0,1]
U 0.001 else xi = 0. Black areas in Fig.4 are inhomogeneous regions. An example of strokes placed in the background following the above guideline is shown in Fig.5(a) with which obtained memberships are illustrated in Fig.5(b). Other examples are shown in Fig.6 where strokes are illustrated on the left and binarized x10 i are shown on the right. Fig.6(c) and (e) were experimented with other existing methods[3,4]. Memberships obtained with strokes shown in Fig.6 are illustrated in Fig.7 where left are initial values and right are converged ones. We can get mattes by using our method with strokes simpler than other methods where they must be drawn in both objects and backgrounds.
5
Composition with Another Image
Colors of objects must be estimated at each pixel for them to be composited with another image as a new background. The color ci in the input image is calculated
596
W. Du and K. Urahama
(a) yi1
(b) ui2
(c) trimap
Fig. 9. Initial vlaues of memberships in the second frame
Fig. 10. Four frames in example video
with blending object colors cf i and background colors cbi with proportion xi : ci = xi cf i + (1 − xi )cbi . In the existing matting methods, both cf i and cbi are estimated by using this relation, however it is laborious and wasteful because only cf i is used for composition. Here we estimate only cf i by min cf
i
sij xj cf i − cj 2
(6)
j∈Wi
which is solved explicitly as sij xj cj / sij xj cf i = j∈Wi
(7)
j∈Wi
with which composite color is given by c˜i = xi cf i + (1 − xi )bi
(8)
where bi is color of pixel i in a new background image. Examples of composite images are shown in Fig.8.
6
Extension to Videos
This method for still images can be easily extended to videos. We draw strokes only in the first frame from which memberships are propagated to successive frames. We firstly compute projection vector q by using LDA at the first frame and project colors of every pixel in all frames to fik = q T cik where k denotes a frame number. We then compute memberships xi1 at the first frame and propagate them to the second frame memberships xi2 which is propagated to the third
Image and Video Matting with Membership Propagation
597
Fig. 11. Memberships
Fig. 12. Composite video
frame and so on. In order to speed up the propagation, we construct and use trimaps for the second and subsequent frames as follows: (1) Discretize xi1 into three levels as yi1 = 1 if xi1 > 0.99, yi1 = 0 if xi1 < 0.01 else yi1 = 0.5. (2) Construct a binary map as ui2 = 1 if |fi1 − fi2 | > 2/3, else ui2 = 0. (0) (0) (3) Set initial values as xi2 = yi1 if ui2 = 0 otherwise xi2 = 0.5. (0) This xi2 gives a trimap for the second frame and is used for an initial value of xi2 . Fig.9(a) illustrates yi1 for a video in Fig.10, which was experimented in [3]. Fig.9(b) shows ui2 where white regions depict pixels where colors vary by motions. Fig.9(c) illustrates trimap constructed from Fig.9(a) and Fig.9(b). Memberships are recomputed only in gray areas in Fig.9(c). This propagation scheme is simple without explicit estimation of object and camera motions. Memberships obtained for each frame are shown in Fig.11 of which composition with another video is shown in Fig.12.
7
Color Adjustment of Objects
In most matting methods, extracted objects are directly pasted on another image as in eq.(8). Such direct matting, however, often gives unnatural images because color or direction of illumination is different in the original image and in a new image. Since their precise estimation is hard in general, we resort here to a conventional technique for adjusting object colors before its composition by using an eigencolor mapping method[12] with its extension for the present task. 7.1
Eigencolor Mapping
Let us consider a case where color of a reference image is transferred to another target image. Let color of target image and that of reference image be ci1 and
598
W. Du and K. Urahama
(a) = 0
(b) = 0.2
(c) = 0.4
Fig. 13. Composition of object with adjusted colors
Fig. 14. Composite video with color adjustment
c matrices S1 = i (c1i − c¯1 )(c1i − c¯1 )T / i2 . We compute covariance i 1, S2 = T (c − c ¯ )(c − c ¯ ) / 1 where c ¯ = c / 1, c ¯ = c / 2 2i 2 1 2 i 2i i i 1i i i 2i i 1 and eigen-decompose them as S1 = U1 D1 U1T , S2 = U2 D2 U2T . By using these basis matrices, color ci1 of the target image is changed to −1/2
c1i = c¯2 + U2 D2 D1 1/2
7.2
U1T (c1i − c¯1 )
(9)
Object Composition with Its Color Adjustment
If we use this recoloring method for object composition, color of objects is changed completely to that of new background and cannot get desired composition. Natural composition needs partial blending of color of new background into object color for reproducing ambient illumination to the object from backgrounds. Let color of an object and that of new background be cf i and cbi as in section 5. We firstly compute a weighted covariance matrix of object colors as T = x (c − c ¯ )(c − c ¯ ) / ¯f is a weighted mean c¯f = S f i f i f f i f i i xi where c T i xi cf i / i xi of object colors. We then eigen-decompose Sf into Uf Df Uf . We next compute average c¯b and covariance matrix Sb of new background colors and shift them to c¯bm = (1 − )¯ cf + ¯ cb and Sbm = (1 − )Sf + Sb where ∈ [0, 1] denotes strength of ambient illumination. We eigen-decompose Sbm T into Ubm Dbm Ubm and change color cf i of the object to −1/2
cf i = c¯bm + Ubm Dbm Df 1/2
UfT (cf i − c¯f )
(10)
which is finally composited with a new background as xi cf i + (1 − xi )cbi which is color of an image with the object of adjusted color. Examples of composite images are shown in Fig.13.
Image and Video Matting with Membership Propagation
599
Adjustment of object colors is especially prerequisite for video matting where brightness or illumination color often varies with time. In such cases, adjustment of object color is necessarily varied in accordance to temporal variation in background color. An example is shown in Fig.14 where new background is reddish and becomes darker with time.
8
Conclusion
We have presented a guiding scheme for placement of strokes in a semi-supervised matting method for natural images and videos. We have also developed a composition method with adjusting object colors incorporating ambient illumination from new background. Some features of our method are summarized as (1) (2) (3) (4) (5) (6) (7) (8)
Membership propagation over holes or gaps owing to broad windows. Strokes are sufficient to be drawn in either object areas or backgrounds. Facilitation of object extraction by projection of colors with LDA. Effective initial values for membership propagation. Simple guidance for placement of strokes. Fast composition of object with new background. Effective membership propagation from frame to frame in video matting. Adaptive adjustment of object color for natural composition.
References 1. Ruzon, M., Tomasi, C.: Alpha estimation in natural images. Proc. CVPR, 18–25 (2000) 2. Chuang, Y.Y., Curless, D., Salesin, D., Szeliski, R.: A Bayesian approach to digital matting. CVPR, 264–271 (2001) 3. Wang, J., Cohen, M.C.: An iterative optimization approach for unified image segmentation and matting. In: ICCV, pp. 936–943 (2005) 4. Levin, A., Lischinski, D., Weiss, Y.: A closed form solution to natural image matting. CVPR, 61–68 (2006) 5. Rother, V.K.C., Blake, A.: Grabcut: interactive foreground extraction using iterated graph cuts. In: SIGGRAPH, pp. 309–314 (2004) 6. Grady, L., Schiwietz, T., Aharon, S.: Random walks for interactive alpha-matting. In: VIIP, pp. 423–429 (2005) 7. Vezhnevets, V., Konouchine, V.: GrowCut: Interactive multi-label N-D image segmentation by cellular automata. Graphicon (2005) 8. Perona, P., Freeman, W.T.: A factorization approach to grouping. In: Burkhardt, H.-J., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1406, pp. 655–670. Springer, Heidelberg (1998) 9. Inoue, K., Urahama, K.: Sequential fuzzy cluster extraction by a graph spectral method. Patt. Recog. Lett. 20, 699–705 (1999) 10. Ng., A.Y., Jordan, I.J., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: NIPS, pp. 849–856 (2001)
600
W. Du and K. Urahama
11. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. PAMI 22, 888–905 (2000) 12. Jing, L., Urahama, K.: Image recoloring by eigencolor mapping. In: IWAIT, pp. 375–380 (2006) 13. Reinhard, E., Ashkuhmin, B., Gooch, B., Shirley, P.: Color transfer between images. IEEE Trans. Comput. Graph. Appl. 21, 34–41 (2001)
Temporal Priors for Novel Video Synthesis Ali Shahrokni, Oliver Woodford, and Ian Reid Robotics Reseach Laboratory, University of Oxford, Oxford, UK http://www.robots.ox.ac.uk/
Abstract. In this paper we propose a method to construct a virtual sequence for a camera moving through a static environment given an input sequence from a different camera trajectory. Existing image-based rendering techniques can generate photorealistic images given a set of input views, though the output images almost unavoidably contain small regions where the colour has been incorrectly chosen. In a single image these artifacts are often hard to spot, but become more obvious when viewing a real image with its virtual stereo pair, and even more so when when a sequence of novel views is generated, since the artifacts are rarely temporally consistent. To address this problem of consistency, we propose a new spatiotemporal approach to novel video synthesis. The pixels in the output video sequence are modelled as nodes of a 3–D graph. We define an MRF on the graph which encodes photoconsistency of pixels as well as texture priors in both space and time. Unlike methods based on scene geometry which yield highly connected graphs, our approach results in a graph whose degree is independent of scene structure. The MRF energy is therefore tractable and we solve it for the whole sequence using a stateof-the-art message passing optimisation algorithm. We demonstrate the effectiveness of our approach in reducing temporal artifacts.
1
Introduction
This paper addresses the problem of reconstruction of a video sequence from an arbitrary sequence of viewpoints given an input video sequence. In particular, we focus on the reconstruction of a stereoscopic pair of a given input sequence captured by a moving camera through a static environment. This has application to the generation of 3-D content from commonly available monocular movies and videos for use with advanced 3-D displays. Existing image-based rendering techniques can generate photorealistic images given a set of input views. Though the best results apparently have remarkable fidelity, closer inspection almost invariably reveals pixels or regions where incorrect colours have been rendered, as illustrated in Fig. 1. These are often, but not always, associated with occlusion boundaries, and while they are often hard to see in a single image, they become very obvious when a sequence of novel views is generated, since the artifacts are rarely spatio-temporally consistent. We propose to solve the problem via a Markov Random Field energy minimisation over Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 601–610, 2007. c Springer-Verlag Berlin Heidelberg 2007
602
A. Shahrokni, O. Woodford, and I. Reid
a video sequence with the aim of preserving spatio-temporal consistency and coherence throughout the rendered frames. Two broad approaches to the novel-view synthesis problem are apparent in the literature: (i) multi-view scene reconstruction followed by rendering from the resulting geometric model, and (ii) image-based rendering techniques which seek simply to find the correct colour for a pixel. In both cases a data likelihood term f (C, z) is defined over colour C and depth z which is designed to achieve a maximum at the correct depth and colour. In the multi-view stereo reconstruction problem the aim is generally to find the correct depth, and [1] was the first to suggest that this could be done elegantly for multiple input views by looking for the depth that maximises colour agreement between the input images. Recent approaches such as [2,3] involve quasi-geometric models for 3–D reconstruction where occlusion is modelled as an outlier process. Approximate inference techniques are then used to reconstruct the scene taking account of occlusion. Realistic generative models using quasi-geometric models are capable of rendering high quality images but lead to intractable minimisation problems [3]. More explicit reasoning about depth and occlusions is possible when an explicit volumetric model is reconstructed as in voxel carving such as [4,5]. The direct application of voxel carving or stereo with occlusion methods [6,7,8] to our problem of novel video synthesis would, however, involve simultaneous optimisation of the MRF energy with respect to depth and colour in the space-time domain. The graph corresponding to the output video then becomes highly connected as shown in Fig. 2-a for a row of each frame. Unfortunately however, available optimisation techniques for highly connected graphs with non-submodular potentials are not guaranteed to reach a global solution [9]. In contrast, [10] marginalise the data likelihood over depth and thus have no explicit geometric reasoning about the depth of pixels. This and similar methods rely on photoconsistency regularised by photometric priors [10,7] to generate photorealistic images. The priors are designed to favour output cliques which resemble samples in a texture library built from the set of input images. It has recently been shown [11] that using small 2-pixel patch priors from a local texture library can be as effective as the larger patches used in [10]. [11] converts the problem of optimising over all possible colours, to a discrete labelling problem over modes of the photoconsistency function, referred to as colour modes, which can be enumerated a priori. Since the texture library comprises only pairs of pixels, the maximum clique size is two, and tree-reweighted message passing [12] can be used to solve for a strong minimum in spite of the non-submodular potentials introduced by enumerating the colour modes. We closely follow this latter, image-based rendering approach, but extend it to sequences of images rather than single frames. We propose to define suitable potential functions between spatially and temporally adjacent pixels. This, and our demonstration of the subsequent benefits, form the main contribution of this paper. We define an MRF in space-time for the output video sequence, and optimise an energy function defined over the entire video sequence to obtain a solution for the output sequence which is a strong local minimum of the energy
Temporal Priors for Novel Video Synthesis
603
Individual MRF optimisation for each output frame
Our method: Using temporal priors for video synthesis
Fig. 1. A pair of consecutive frames from a synthesised sequence. Top row: individual MRF optimisation for each output frame fails to ensure temporal consistency yielding artifacts that are particularly evident when the sequence is viewed continuously. Bottom row: Using temporal priors, as proposed in this paper, to optimise an MRF energy over the entire video sequence reduces those effects. An example is circled.
function. Crucially, in contrast to methods based on depth information and 3-D occlusion, our proposed framework has a graph with a depth-independent vertex degree, as shown in Fig. 2-b. This results in a tractable optimisation over the MRF and hence we have an affordable model for the temporal flow of colours in the scene as the camera moves. The remainder of this paper is organised as follows. In Section 2, we introduce the graph and its corresponding energy function that we wish to minimise, in particular the different potential terms. Section 3 gives implementation details, experimental results and a comparison of our method with (i) per-frame optimisation, and (ii) a na¨ıve, constant-colour prior.
2
Novel Video Synthesis Framework
We formulate the MRF energy using binary cliques with pairwise, texture-based priors for temporal and spatial edges in the output video graph. Spatial edges in the graph exist between 8-connected neighbourhood of pixels in each frame. Temporal edges link pixels with the same coordinates in adjacent frames as shown in Fig. 2-b. Therefore, the energy of the MRF can be expressed in terms
A. Shahrokni, O. Woodford, and I. Reid
frames
frames
604
pixels
pixels (a)
(b)
Fig. 2. Temporal edges in an MRF graph for video sequence synthesis. a) Using a 3–D occlusion model all pixels on epipolar lines of pixels in adjacent frames must be connected by temporal edges (here only four temporal edges per pixel are shown to avoid clutter). b) Using our proposed temporal texture-based priors we can reduce the degree of the graph to a constant. output sequence
ordered input frames
output sequence
k’
k epipolar line
j
j’
t+1 temporal edge q t p
i+1 epipolar line
i epipolar line
Tp Tq
epipolar line
i’
i
ordered input frames
Temporal texture dictionary for pixels p and q
epipolar line
(a)
(b)
Fig. 3. a) Local texture library is built using epipolar lines in sorted input views I for each pixel in the output video sequence. b) Local pairwise temporal texture dictionary for two output pixels p and q connected by a temporal graph edge.
of the unary and binary potential functions for the set of labels (colours) F as follows. E(F ) = φp (fp ) + λ1 ψpq (fp , fq ) + λ2 ψpq (fp , fq ) (1) p
p q∈Ns (p)
p q∈Nt (p)
where fp and fq are labels in the label set F , φ is the unary potential measuring the photoconsistency and ψ encodes the pairwise priors for spatial and temporal neighbours of pixel p denoted by Ns (p) and Nt (p) respectively. λ1 and λ2 are weight coefficients for different priors. The output sequence is then given by the optimal labelling F ∗ through minimisation of E: F ∗ = argmin{E(F )} F
(2)
Next, we first discuss the texture library for spatial and temporal terms and introduce some notations and then define the unary and binary potentials.
Temporal Priors for Novel Video Synthesis
605
world scene
correct prior pairwise word in the local library
q frame t+1 p frame t
Tp Tq
input epipolar lines
pairwise local texture vocabulary from input views
Fig. 4. The temporal transition of colours between pixels in two output frames. A constant colour model between temporally adjacent output pixels p and q is clearly invalid because of motion parallax. On the other hand, there is a good chance that the local texture vocabulary comprising colour pairs obtained from the epipolar lines Tp and Tq (respectively the epipolar lines in the corresponding input view of the stereo pair) captures the correct colour combination, as shown in this case.
2.1
Texture Library and Notations
To calculate the local texture library, we first find and sort subsets of the input frames with respect to their distance to the output frames. We denote these subsets by I. The input frame in I, which is closest to the output frame containing pixel p is denoted by I(p). Then for each pairwise clique of pixels p and q, the local texture library is generated by bilinear interpolation of pixels on the clique epipolar lines in I as illustrated in Fig. 3. For a pixel p the colour in input frame k corresponding to the depth disparity z is denoted by Ck (z, p). The vocabulary of the library is composed of the colour of the pixels corresponding to the same depth on each epipolar line and is defined below. T = {(Ci (z, p), Cj (z, q) ) | z = zmin , . . . , zmax , i = I(p), j = I(q)}
(3)
we also define Tp as the epipolar line of pixel p in I(p), Tp = {Ck (z, p) | z = zmin , . . . , zmax , k = I(p)}. 2.2
(4)
Unary Potentials
Unary potential terms express the measure of agreement in the input views for a hypothesised pixel colour. Since optimisation over the full colour space can only be effectively achieved via slow, non-deterministic algorithms, we use instead a technique proposed in [11] that finds a set of photoconsistent colour modes. The optimisation is then over the choice of which mode, i.e. a discrete
606
A. Shahrokni, O. Woodford, and I. Reid
labelling problem. These colour modes are denoted by fp for pixel p and using their estimated depth z the unary potential is given by the photoconsistency of fp in a set of close input views V : ρ(||fp − Ci (z, p)||) (5) φp (fp ) = i∈V
where ρ(.) is a truncated quadratic robust kernel. 2.3
Binary Potentials
Binary (pair-wise) potentials in graph-based formulation of computer vision problems often use the Potts model (piece-wise constant) to enforce smoothness of the output (e.g. colour in segmentation algorithms, or depth in stereo reconstruction). While the Potts model is useful as a regularisation term, its application to temporal cliques is strictly incorrect. This is due to the relative motion parallax between the frames as illustrated in Fig. 4. In general, the temporal links marked by dotted lines between two pixels p and q for example do not correspond to the same 3–D point and therefore colour coherency assumption using the Potts model is invalid. Instead, we propose to use texture-based priors to define pairwise potentials in temporal edges. As shown in Fig. 4, a local texture library given by Eq. 3 for the clique of pixels p and q is generated using epipolar lines Tp and Tq defined in Eq. 4 in two successive input frames close to the output frames containing p and q. This library contains the correct colour combination for the clique containing p and q corresponding to two distinct 3-D points (marked by the dotted rectangle in Fig. 4. This idea is valid for all temporal cliques in general scenes provided that there exists a pair of successive input frames throughout the whole sequence which can see the correct 3–D points for p and q. Each pairwise potential term measures how consistent the pair of labels for pixels p and q is with the (spatio-temporal) texture library. The potential is taken to be the minimum over all pairs in the library, viz: ψpq (fp , fq ) = min {ρ(||fp − Tp (z)||) + ρ(||fq − Tq (z)||)} . z
(6)
Note that the use of a robust kernel ρ(.) ensures that cases where a valid colour combination does not exist are not overly penalised; rather, if a good match cannot be found a constant penalty is used. As explained above, exploiting texture-based priors enables us to establish a valid model for temporal edges in the graph which is independent of the depth and therefore avoid highly connected temporal nodes. This is an important feature of our approach which implies that the degree of the graph is independent of the 3–D structure of the scene.
3
Implementation and Results
We verified the effectiveness of temporal priors for consistent novel video synthesis in several experiments. We compare the generated views with and without
Temporal Priors for Novel Video Synthesis
(a) frame #2
(b) frame #5
(c) magnified part of #2
607
(d) magnified part of #5
Fig. 5. Top row, results obtained using our proposed texture-based temporal priors. Middle row, using the Potts temporal priors. Bottom row, individual rendering of frames. Columns (c) and (d) show the details of rendering. It can be noted that the Potts model and individual optimisation fails on the sharp edges of the leaves.
temporal priors. In all cases, the spatial terms for all 8-connected neighbours in each frame in the MRF energy were similarly computed from texture-based priors. Therefore the focus of our experiments is on the texture-based temporal priors. We also show results from using a the simpler constant-colour prior (the Potts model). The energy function of Eq. 1 is minimised using a recently introduced enhanced version of tree-reweighted max-product message passing algorithm known as TRW-S algorithm [12] which can handle non submodular graph edge costs and has guaranteed convergence properties. For an output video sequence with n frame of size W × H, the spatio-temporal graph would have n × W × H vertices and (n − 1) × W × H temporal edges in the case of using texture-based temporal priors or the Potts model. This is the minimum number of temporal edge for a spatio-temporal MRF and any other prior based on depth with number of disparities z would require at least z × (n − 1) × W × H temporal edges, where z is of the order of 10 to 100. Typical run time to process a space-time volume of 15 × 100 × 100 pixels is 600 seconds on a P4 D 3.00GHz machine. The same volume when treated as individual frames takes 30 × 15 = 450 seconds to process.
608
A. Shahrokni, O. Woodford, and I. Reid frame #1
frame #3
frame #5
frame #9
Our propose method: texture-based temporal priors
The Potts temporal priors
Individual optimisation per frame
Fig. 6. Synthesised Edmontosaurus sequence. First row, results obtained using our proposed texture-based temporal priors. Second row, using the Potts temporal priors creates some artifacts (frame #3). Third row, individual rendering of frames introduces artifacts in the holes (the nose and the jaw). Also note that the quality of frame #5 has greatly improved thanks to the texture-based temporal priors.
The input video sequence is first calibrated using commercial camera tracking software1. The stereoscopic output virtual camera projection matrices are then generated from input camera matrices by adding a horizontal offset to the input camera centres. The colour modes as well as unary photoconsistency terms given by Eq. 5 for each pixel in the output video are calculated using 8 closest views in the input sequence. We also compute 8 subsets I’s for texture library computation as explained in Section 2.3 with the lowest distance to the ensemble of the n output camera positions. Finally in Eq. 1 we set λ1 to 1 and λ2 to 10 in our experiments. Fig. 5 shows two synthesised frames of a video sequence of a tree and the details of rendering around the leaves for different methods. Here, in the case of temporal priors (textured-based and Potts) 5 frames of 300 × 300 pixels are 1
Boujou, 2d3 Ltd.
Temporal Priors for Novel Video Synthesis
609
Fig. 7. Stereoscopic frames generated using texture-based temporal priors over 15 frames. In each frame, the left image is the input view (corresponding to the left eye) and the right image is the reconstructed right eye view.
rendered by a single energy optimisation. In the detailed view, it can be noted that the quality of the generated views using texture-based temporal priors has improved especially around the edges of the leaves. As another example, Fig. 6 shows some frames of the novel video synthesis on the Edmontosaurus sequence using different techniques. Here the temporal priors are used to render 11 frames of 200 × 200 pixels by a single energy optimisation. The first row shows the results obtained using our proposed texture-based temporal prior MRF. Using the Potts model for temporal edges generates more artifacts as shown in the second row in Fig. 6. Finally the third rows show the results obtained without any temporal priors and by individual optimisation of each frame. It can be noted that the background is consistently seen through the holes in the skull, while flickering artifacts occur in the case of the Potts prior and individual optimisation. Here the output camera matrices are generated by interpolation between the first and the last input camera positions. Finally Fig. 7 show the entire stereoscopic frames constructed using temporal priors over 15 frames.
4
Conclusion
We have introduced a new method for novel video rendering with optimisation in space-time domain. We define a Markov Random Field energy minimisation for rendering a video sequence which preserves temporal consistency and coherence throughout the rendered frames. Our method uses a finite set of colours for each pixel with their associated likelihood cost to find a global minimum energy solution which satisfies prior temporal consistency constraints in the output sequence. In contrast to methods based on depth information and 3–D occlusion we exploit texture-based priors on pairwise cliques to establish a valid model for temporal edges in the graph. This approach is independent of the depth and therefore results in a graph whose degree is independent of scene structure. As a result and as supported by our experiments, our approach provides a method to
610
A. Shahrokni, O. Woodford, and I. Reid
reduce temporal artifacts in novel video synthesis without resorting to approximate generative models and inference techniques to handle multiple depth maps. Moreover, our algorithm can be extended to larger clique texture-based priors while keeping the degree of the graph independent of the depth of the scene. This requires sophisticated optimisation techniques which can handle larger cliques such as [13,14] and will be investigated in our future work. Quantitative analysis of the algorithm using synthetic/real stereo sequences is also envisaged to further study the efficiency of temporal priors for video synthesis. Acknowledgements. This work was supported by EPSRC grant EP/C007220/1 and by a CASE studentship sponsored by Sharp Laboratories Europe. The authors also wish to thank Andrew W. Fitzgibbon for his valuable input.
References 1. Okutomi, M., Kanade, T.: A multiple-baseline stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(4), 353–363 (1993) 2. Strecha, C., Fransens, R., Gool, L.V.: Combined depth and outlier estimation in multiview stereo. In: Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, pp. 2394–2401. IEEE Computer Society, Los Alamitos (2006) 3. Gargallo, P., Sturm, P.: Bayesian 3d modeling from images using multiple depth maps. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, California, vol. 2, pp. 885–891 (2005) 4. Goesele, M., Seitz, S.M., Curless, B.: Multi-View Stereo Revisited. In: Conference on Computer Vision and Pattern Recognition, New York, USA (2006) 5. Kutulakos, K., Seitz, S.: A Theory of Shape by Space Carving. International Journal of Computer Vision 38(3), 197–216 (2000) 6. Kolmogorov, V., Zabih, R.: Multi-Camera Scene Reconstruction via Graph Cuts. In: European Conference on Computer Vision, Copenhagen, Denmark (2002) 7. Sun, J., Zheng, N., Shum, H.: Stereo matching using belief propagation. IEEE Transactions on Pattern Analysis 25, 1–14 (2003) 8. Tappen, M., Freeman, W.: Comparison of graph cuts with belief propagation for stereo,using identical MRF parameters. In: International Conference on Computer Vision (2003) 9. Kolmogorov, V., Rother, C.: Comparison of energy minimization algorithms for highly connected graphs. In: European Conference on Computer Vision, Graz, Austria (2006) 10. Fitzgibbon, A., Wexler, Y., Zisserman, A.: Image-based rendering using imagebased priors. In: Proceedings of the International Conference on Computer Vision, vol. 2, pp. 1176–1183 (2003) 11. Woodford, O.J., Reid, I.D., Fitzgibbon, A.W.: Efficient new view synthesis using pairwise dictionary priors. In: Conference on Computer Vision and Pattern Recognition (2007) 12. Kolmogorov, V.: Convergent tree-reweighted message passing for energy minimization. IEEE Trans. Pattern Anal. Mach. Intell. 28(10), 1568–1583 (2006) 13. Kohli, P., Kumar, M.P., Torr, P.H.: P3 & Beyond: Solving Energies with Higher Order Cliques. In: Conference on Computer Vision and Pattern Recognition (2007) 14. Potetz, B.: Efficient Belief Propagation for Vision Using Linear Constraint Nodes. In: Conference on Computer Vision and Pattern Recognition (2007)
Content-Based Image Retrieval by Indexing Random Subwindows with Randomized Trees Rapha¨el Mar´ee1, Pierre Geurts2 , and Louis Wehenkel2 GIGA Bioinformatics Platform, University of Li`ege, Belgium Systems and Modeling Unit, Montefiore Institute, University of Li`ege, Belgium 1
2
Abstract. We propose a new method for content-based image retrieval which exploits the similarity measure and indexing structure of totally randomized tree ensembles induced from a set of subwindows randomly extracted from a sample of images. We also present the possibility of updating the model as new images come in, and the capability of comparing new images using a model previously constructed from a different set of images. The approach is quantitatively evaluated on various types of images with state-of-the-art results despite its conceptual simplicity and computational efficiency.
1
Introduction
With the improvements in image acquisition technologies, large image collections are available in many domains. In numerous applications, users want to search efficiently images in such large databases but semantic labeling of all these images is rarely available, because it is not obvious to describe images exhaustively with words, and because there is no widely used taxonomy standard for images. Thus, one well-known paradigm in computer vision is “content-based image retrieval” (CBIR) ie. when users want to retrieve images that share some similar visual elements with a query image, without any further text description neither for images in the reference database, nor for the query image. To be practically valuable, a CBIR method should combine computer vision techniques that derive rich image descriptions, and efficient indexing structures [2]. Following these requirements, our starting point is the method of [8], where the goal was to build models able to predict accurately the class of new images, given a set of training images where each image is labeled with one single class among a finite number of classes. Their method was based on random subwindow extraction and ensembles of extremely randomized trees [6]. In addition to good accuracy results obtained on various types of images, this method has attractive computing times. These properties motivated us to extend their method for CBIR where one has to deal with very large databases of unlabeled images. The paper is organized as follows. The method is presented in Section 2. To assess its performances and usefulness as a foundation for image retrieval, we evaluate it on several datasets representing various types of images in Section 3, where the influence of its major parameters will also be evaluated. Method Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 611–620, 2007. c Springer-Verlag Berlin Heidelberg 2007
612
R. Mar´ee, P. Geurts, and L. Wehenkel
parameters and performances are discussed in Section 4. Finally, we conclude with some perspectives.
2
Method Rationale and Description
We now describe the different steps of our algorithm: extraction of random subwindows from images (2.1), construction of a tree-based indexing structure for these subwindows (2.2), derivation of a similarity measure between images from an ensemble of trees (2.3), and its practical use for image retrieval (2.4). 2.1
Extraction of Random Subwindows
Occlusions, cluttered backgrounds, and viewpoint or orientation changes that occur in real-world images motivated the development of object recognition or image retrieval methods that model image appearances locally by using the socalled “local features” [15]. Indeed, global aspects of images are considered not sufficient to model variabilities of objects or scenes and many local feature detection techniques were developped for years. These consider that the neighboorhood of corners, lines/edges, contours or homogenous regions capture interesting aspects of images to classify or compare them. However, a single detector might not capture enough information to distinguish all images and recent studies [18] suggest that most detectors are complementary (some being more adapted to structured scenes while others to textures) and that all of them should ideally be used in parallel. One step further, several recent works evaluated dense sampling schemes of local features, e.g. on a uniform grid [4] or even randomly [8,11]. In this work, we use the same subwindow random sampling scheme than [8]: square patches of random sizes are extracted at random locations in images, resized by bilinear interpolation to a fixed-size (16 × 16), and described by HSV values (resulting into 768 feature vectors). This provides a rich representation of images corresponding to various overlapping regions, both local and global, whatever the task and content of images. Using raw pixel values as descriptors avoids discarding potentially useful information while being generic, and fast. 2.2
Indexing Subwindows with Totally Randomized Trees
In parallel to these computer vision developments, and due to the slowness of nearest neighbor searches that prevent real-time response times with hundreds of thousands of local feature points described by high-dimensional descriptors, several tree-based data structures and/or approximate nearest neighbors techniques have been proposed [1,5,9,13,14,16] for efficient indexing and retrieval. In this paper, we propose to use ensembles of totally randomized trees [6] for indexing (random) local patches. The method recursively partitions the training sample of subwindows by randomly generated tests. Each test is chosen by selecting a random pixel component (among the 768 subwindows descriptors) and a random cut-point in the range of variation of the pixel component in
Content-Based Image Retrieval by Indexing Random Subwindows
613
the subset of subwindows associated to the node to split. The development of a node is stopped as soon as either all descriptors are constant in the leaf or the number of subwindows in the leaf is smaller than a predefined threshold nmin . A number T of such trees are grown from the training sample. The method thus depends on two parameters: nmin and T . We will discuss below their impact on the similarity measure defined by the tree ensemble. There exists a number of indexing techniques based on recursive partitioning. The two main differences between the present work and these algorithms is the use of an ensemble of trees instead of a single one and the random selection of tests in place of more elaborated splitting strategies (e.g., based on a distance metric computed over the whole descriptors in [9,16] or taken at the median of the pixel component whose distribution exhibits the greatest spread in [5]). Because of the randomization, the computational complexity of our algorithm is essentially independent of the dimensionality of the feature space and, like other tree methods, is O(N log(N )) in the number of subwindows. This makes the creation of the indexing structures extremely fast in practice. Note that totally randomized trees are a special case of the Extra-Trees method exploited in [8] for image classification. In this latter method, K random tests are generated at each tree node and the test that maximizes some information criterion related to the output classification is selected. Totally randomized trees are thus obtained by setting the parameter K of this method to 1, which desactivates test filtering based on the output classification and allows to grow trees in an unsupervised way. Note however that the image retrieval procedure described below is independent of the way the trees are built. When a semantic classification of the images is available, it could thus be a good idea to exploit it when growing the trees (as it would try to put subwindows from the same class in the same leaves). 2.3
Inducing Image Similarities from Tree Ensembles
A tree T defines the following similarity between two subwindows s and s [6]: 1 if s and s reach the same leaf L containing NL subwindows, kT (s, s ) = NL 0 otherwise. This expression amounts to considering that two subwindows are very similar if they fall in a same leaf that has a very small subset of training subwindows1 . The similarity induced by an ensemble of T trees is defined by: kens (s, s ) =
T 1 kT (s, s ) T t=1 t
(1)
This expression amounts to considering that two subwindows are similar if they are considered similar by a large proportion of the trees. The spread of the similarity measure is controlled by the parameter nmin : when nmin increases, subwindows 1
Intuitively, as it is less likely a priori that two subwindows will fall together in a small leaf, it is natural to consider them very similar when they actually do.
614
R. Mar´ee, P. Geurts, and L. Wehenkel
tend to fall more often in the same leaf which yields a higher similarity according to (1). On the other hand, the number of trees controls the smoothness of the similarity. With only one tree, the similarity (1) is very discrete as it can take only two values when one of the subwindows is fixed. The combination of several trees provides a finer-grained similarity measure and we expect that this will improve the results as much as in the context of image classification. We will study the influence of these two parameters in our experiments. Given this similarity measure between subwindows, we derive a similarity between two images I and I by: 1 kens (s, s ), (2) k(I, I ) = |S(I)||S(I )| s∈S(I),s ∈S(I )
where S(I) and S(I ) are the sets of all subwindows that can be extracted from I and I respectively. The similarity between two images is thus the average similarity between all pairs of their subwindows. Although finite, the number of different subwindows of variable size and location that can be extracted from a given image is in practice very large. Thus we propose to estimate (2) by extracting at random from each image an a priori fixed number of subwindows. Notice also that, although (2) suggests that the complexity of this evalation is quadratic in this number of subwindows, we show below that it can actually be computed in linear time by exploiting the tree structures. Since (1) actually defines a positive kernel [6] among subwindows, equation (2) actually defines a positive (convolution) kernel among images [17]. This means that this similarity measure has several nice mathematical properties. For example, it can be used to define a distance metric and it can be directly exploited in the context of kernel methods [17]. 2.4
Image Retrieval Algorithms
In image retrieval, we are given a set of, say NR , reference images and we want to find images from this set that are most similar to a query image. We propose the following procedure to achieve this goal. Creation of the indexing structure. To build the indexing structure over the reference set, we randomly extract Nls subwindows of variable size and location from each reference image, resize them to 16×16, and grow an ensemble of totally randomized trees from them. At each leaf of each tree, we record for each image of the reference set that appears in the leaf the number of its subwindows that have reached this leaf. Recall of reference images most similar to a query image. We compute the similarities between a query image IQ and all NR reference images, by propagating into each tree Nts subwindows from the query image, and by incrementing, for each subwindow s of IQ , each tree T , and each reference image IR , the similarity k(IQ , IR ) by the proportion of subwindows of IR in the leaf reached by s in T , and by dividing the resulting score by T Nls Nts . This procedure estimates k(IQ , IR ) as given by (2), using Nls and Nts random subwindows
Content-Based Image Retrieval by Indexing Random Subwindows
615
from IR and IQ respectively. From these NR similarities, one can identify the N most similar reference images in O(N NR ) operations, and the complexity of the whole computation is on the average of O(T Nts (log(Nls ) + NR )). Notice that the fact that information about the most similar reference images is gathered progressively as the number of subwindows of the query image increases could be exploited to yield an anytime recall procedure. Note also that once the indexing structure has been built, the database of training subwindows and the original images are not required anymore to compute the similarity. Computation of the similarity between query images. The above procedure can be extended to compute the similarity of a query-image to another image not belonging to the reference set, an extension we name model recycling. To this end, one propagates the subwindows from each image through each tree and maintains counts of the number of these subwindows reaching each leaf. The similarity (2) is then obtained by summing over tree leaves the product of the subwindow counts for the two images divided by the number of training subwindows in the leaf and by normalizing the resulting sum. Incremental mode. One can incorporate the subwindows of a new image into an existing indexing structure by propagating and recording their leaf counts. When, subsequently to this operation a leaf happens to contain more than nmin subwindows, the random splitting procedure would merely be used to develop it. Because of the random nature of the tree growing procedure, this incremental procedure is likely to produce similar trees as those that would be obtained by rebuilding them from scratch. In real-world applications such as World Wide Web image search engines, medical imaging in research or clinical routine, or software to organize user photos, this incremental characteristic will be of great interest as new images are crawled by search engines or generated very frequently.
3
Experiments
In this section, we perform a quantitative evaluation of our method in terms of its retrieval accuracy on datasets with ground-truth labels. We study the influence of the number of subwindows extracted in training images for building the tree structure (Nls ), the number of trees built (T ), the stop-splitting criterion (nmin ), and the number of images extracted in query images (Nts ). Like other authors, we will consider that an image is relevant to a query if it is of the same class as the query image, and irrelevant otherwise. Then, different quantitative measures [3] can be computed. In order to compare our results with the state of the art, for each of the following datasets, we will use the same protocol and performance measures than other authors. Note that, while using class labels to assess accuracy, this information is not used during the indexing phase. 3.1
Image Retrieval on UK-Bench
The University of Kentucky recognition benchmark is a dataset introduced in [9] and recently updated that now contains 640 × 480 color images of 2550 classes of
616
R. Mar´ee, P. Geurts, and L. Wehenkel
4 images each (10200 images in total), approximately 1.7GB of JPEG files. These images depict plants, people, cds, books, magazines, outdoor/indoor scenes, animals, household objects, etc., as illustrated by Figure 1. The full set is used as the reference database to build the model. Then, the measure of performance is an average score that counts for each of the 10200 images how many of the 4 images of this object (including the identical image) are ranked in the top-4 similar images. The score thus varies from 0 (when getting nothing right) up to 4 (when getting everything right). Average scores of variants of the method presented in [9] range from 3.07 to 3.29 (ie. recognition rates2 from 76.75% to 82.36%, see their updated website3 ), using among the best detector and descriptor combination (Maximally Stable Extremal Region (MSER) detector and the Scalable Invariant Feature Transform (SIFT) descriptor), a tree structure built by hierarchical k-means clustering, and different scoring schemes. Very recently, [14] improved results up to a score of 3.45 using the same set of features but with an approximate k-means clustering exploiting randomized k-d trees.
Fig. 1. Several images of the UK-Bench. One image for various objects (top), the four images of the same object (bottom).
Figure 2 shows the influence of the parameters of our method on the recognition performances. We obtain scores slightly above 3 (ie. around 75% recognition rate) with 1000 subwindows extracted per image, 10 trees, and a minimum number of subwindows per node nmin between 4 and 10. Note that the recognition rate still increases when using more subwindows. For example, not reported on these figures, a score of 3.10 is obtained when 5000 subwindows are extracted per image with only 5 trees (nmin = 10). 3.2
Image Retrieval on ZuBuD
The Z¨ urich Buildings Database4 is a database of color images of 201 buildings. Each building in the training set is represented by 5 images acquired at 5 arbitrary viewpoints. The training set thus includes 1005 images and it is used to 2 3 4
(Number of correct images in first 4 retrieved images /40800) ∗ 100% http://www.vis.uky.edu/~stewe/ukbench/ http://www.vision.ee.ethz.ch/showroom/zubud/index.en.html
Content-Based Image Retrieval by Indexing Random Subwindows UK−Bench: Influence of nmin stop splitting on recognition performance (T=10) 80%
617
UK−Bench: Influence of the number of trees T on recognition performance 80%
UK−Bench, 1000 subwindows per image UK−Bench, 100 subwindows per image
75%
UK−Bench, 1000 subwindows per image (nmin=4) UK−Bench, 1000 subwindows per image (nmin=15) UK−Bench, 100 subwindows per image (nmin=15)
75%
70%
70%
65% Recognition
Recognition
65% 60% 55%
60% 55%
50% 50%
45%
45%
40% 35%
40% 10
20
30
40
50 nmin
60
70
80
90
100
5
UK−Bench: Influence of the number of subwindows for training (T=10, Nts=100, nmin=15) 70%
10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 T UK−Bench: Influence of the number of subwindows for test (T=10, Nls=1500)
80%
UK−Bench, 100 subwindows per query image
UK−Bench, 1500 subwindows per training image, nmin=15 UK−Bench, 1500 subwindows per training image, nmin=1000
75%
Recognition
Recognition
70%
65%
65%
60%
55%
50%
60%
45% 0
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 Nls
0
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 Nts
Fig. 2. Influence of the parameters on UK-Bench. Influence of stop splitting parameter, number of trees, number of training subwindows, number of test subwindows.
build the model, while the test set (acquired by another camera under different conditions) that contains 115 images of a subset of the 201 buildings is used to evaluate the generalization performances of the model. The performance measured in [13,3,16] is the classification recognition rate of the first retrieved image, with 93%, 89.6%, and 59.13% error rates respectively. In [12], a 100% recognition rate was obtained, but with recall times of over 27 seconds per image (with an exhaustive scan of the database of local affine frames). We obtain 95.65% with 1000 subwindows per image, T = 10, and several values of nmin inferior to 10. On this problem, we observed that it is not necessary to use so many trees and subwindows to obtain this state-of-the-art recognition rate. In particular, only one tree is sufficient, or less than 500 subwindows. 3.3
Model Recycling on META and UK-Bench
In our last experiment, we evaluate the model recycling idea, ie. we want to assess if given a large set of unlabeled images we can build a model on these images, and then use this model to compare new images from another set. To do so, we build up a new dataset called META that is basically the collection of images from the following publicly available datasets: LabelMe Set1-16, Caltech-256, Aardvark to Zorro, CEA CLIC, Pascal Visual Object Challenge 2007, Natural Scenes A. Oliva, Flowers, WANG, Xerox6, Butterflies, Birds. This
618
R. Mar´ee, P. Geurts, and L. Wehenkel
sums up to 205763 color images (about 20 GB of JPEG image files) that we use as training data from which we extract random subwindows and build the ensemble of trees. Then, we exploit that model to compare the UK-Bench images between themselves. Using the same performance measure as in section 3.1, we obtain an average score of 2.64, ie. a recognition rate of 66.1%, with 50 subwindows per training image of META (roughly a total of 10 million subwindows), T = 10, nmin = 4, and 1000 subwindows per test image of UK-Bench. For comparison, we obtained a score of 3.01 ie. 75.25% recognition rate using the full UK-Bench set as training data and same parameter values. Unsurprisingly, the recognition rate is better when the model is built using the UK-Bench set as training data but we still obtain an interesting recognition rate with the META model. Nist´er and Stew´enius carried out a similar experiment in [9], using different training sets (images from moving vehicles and/or cd covers) to build a model to compare UK-Bench images. They obtained scores ranging from 2.16 to 3.16 (using between 21 and 53 millions local features), which are also inferior to what they obtained exploiting the UK-Bench set for building the model.
4
Discussion
Some general comments about the influence of parameters can be drawn from our experiments. First, we observed that the more trees and subwindows, the better the results. We note that on ZuBuD, a small number of trees and not so large number of subwindows already gives state-of-the-art results. We also found out that the value of nmin should neither be too small, nor too large. It influences the recognition rate and increasing its value also reduces the memory needed to store the trees (as they are smaller when nmin is larger) and the required time for the indexing phase. It also reduces the prediction time, but with large values of nmin (such as 1000) image indexes at terminal nodes of the trees tend to become dense, which then slows down the retrieval phase of our algorithm which exploits the sparsity of these vectors to speed up the updating procedure. One clear advantage of the method is that the user can more or less control the performance of the method and its parameters could be choosen so as to trade-off recognition performances, computational requirements, problem difficulty, and available resources. For example, with our current proof of concept implementation in Java, one single tree that has 94.78% accuracy on ZuBuD is built in less than 1m30s on a single 2.4Ghz processor, using a total of 1005000 training subwindows described by 768 values, and nmin = 4. When testing query images, the mean number of subwindow tests in the tree is 42.10. In our experiment of Section 3.3, to find similar images in UK-Bench based on the model built on META, there are on average 43.63 tests per subwindow in one single tree. On average, all 1000 subwindows of one UK-Bench image are propagated in all the 10 trees in about 0.16 seconds. Moreover, random subwindow extraction and raw pixel description are straightforward. In Section 3.3 we introduced the META database and model. While this database obviously does not represent the infinite “image space”, it is however
Content-Based Image Retrieval by Indexing Random Subwindows
619
possible to extract a very large set of subwindows from it, hence we expect that the META model could produce scores distinctive enough to compare a wide variety of images. The results we obtained in our last experiment on the 2550 object UK-Bench dataset are promising in that sense. Increasing the number of subwindows extracted from the META database and enriching it using other image sources such as the Wikipedia image database dump or frames from Open Video project might increase the generality and power of the META model. Our image retrieval approach does not require any prior information about the similarity of training images. Note however that in some applications, such information is available and it could be a good idea to exploit it to design better similarity measures for image retrieval. When this information is available in the form of a semantic labeling of the images, it is easy to incorporate it into our approach, simply replacing totally randomized trees by extremely randomized trees for the indexing of subwindows. Note however that our result on ZuBuD equals the result obtained by [8] using extremely randomized trees that exploit the image labels during the training stage. This result suggests that for some problems, good image retrieval performances could be obtained with a fast and rather simple method and without prior information about the images. Beside a classification, information could also be provided in the form a set of similar or dissimilar image pairs. Nowak and Jurie [10] propose a method based on randomized trees for exploiting such pairwise constraints to design a similarity measure between images. When a more quantitative information is available about the similarity between training images, one could combine our approach with ideas from [7], where a (kernel-based) similarity is generalized to never seen objects using ensembles of randomized trees.
5
Conclusions
In this paper, we used totally randomized trees to index randomly extracted subwindows for content-based image retrieval. Due to its conceptual simplicity (randomization is used both in image description and indexing), the method is fast. Good recognition results are obtained on two datasets with illumination, viewpoint, and scale changes. Moreover, incremental mode and model recycling were presented. In future works, other image descriptors and other stop splitting and scoring schemes might be evaluated. In terms of other applications, the usefulness of the method for the problem of near duplicate image detection might be investigated. Finally, totally randomized trees might also be helpful to index high-dimensional databases of other types of content.
Acknowledgements Rapha¨el Mar´ee is supported by the GIGA (University of Li`ege) with the help of the Walloon Region and the European Regional Development Fund. Pierre Geurts is a research associate of the FNRS, Belgium.
620
R. Mar´ee, P. Geurts, and L. Wehenkel
References 1. B¨ ohm, C., Berchtold, S., Keim, D.A.: Searching in high-dimensional spaces - index structures for improving the performance of multimedia databases. ACM Computing Surveys 33(3), 322–373 (2001) 2. Datta, R., Joshi, D., Li, J., Wang, J.Z.: Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys 39(65) (2007) 3. Deselaers, T., Keysers, D., Ney, H.: Classification error rate for quantitative evaluation of content-based image retrieval systems. In: ICPR 2004. Proc. 17th International Conference on Pattern Recognition, pp. 505–508 (2004) 4. Deselaers, T., Keysers, D., Ney, H.: Discriminative training for object recognition using image patches. In: CVPR 2005. Proc. International Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 157–162 (2005) 5. Friedman, J.H., Bentley, J.L., Finkel, R.A.: An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software 3(3), 209–226 (1977) 6. Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Machine Learning 36(1), 3–42 (2006) 7. Geurts, P., Wehenkel, L., d’Alch´e Buc, F.: Kernelizing the output of tree-based methods. In: ICML 2006. Proc. of the 23rd International Conference on Machine Learning, pp. 345–352. ACM, New York (2006) 8. Mar´ee, R., Geurts, P., Piater, J., Wehenkel, L.: Random subwindows for robust image classification. In: Proc. IEEE CVPR, vol. 1, pp. 34–40. IEEE, Los Alamitos (2005) 9. Nist´er, D., Stew´enius, H.: Scalable recognition with a vocabulary tree. In: Proc. IEEE CVPR, vol. 2, pp. 2161–2168 (2006) 10. Nowak, E., Jurie, F.: Learning visual similarity measures for comparing never seen objects. In: Proc. IEEE CVPR, IEEE Computer Society Press, Los Alamitos (2007) 11. Nowak, E., Jurie, F., Triggs, B.: Sampling strategies for bag-of-features image classification. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 490–503. Springer, Heidelberg (2006) 12. Obdrˇza ´lek, S., Matas, J.: Image retrieval using local compact DCT-based representation. In: Michaelis, B., Krell, G. (eds.) Pattern Recognition. LNCS, vol. 2781, pp. 490–497. Springer, Heidelberg (2003) 13. Obdrˇza ´lek, S., Matas, J.: Sub-linear indexing for large scale object recognition. In: BMVC 2005. Proc. British Machine Vision Conference, pp. 1–10 (2005) 14. Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: Proc. IEEE CVPR, IEEE Computer Society Press, Los Alamitos (2007) 15. Schmid, C., Mohr, R.: Local greyvalue invariants for image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(5), 530–534 (1997) 16. Shao, H., Svoboda, T., Ferrari, V., Tuytelaars, T., Van Gool, L.: Fast indexing for image retrieval based on local appearance with re-ranking. In: ICIP 2003. Proc. IEEE International Conference on Image Processing, pp. 737–749. IEEE Computer Society Press, Los Alamitos (2003) 17. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004) 18. Zhang, J., Marszaek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classification of texture and object categories: a comprehensive study. International Journal of Computer Vision 73, 213–238 (2007)
Analyzing Facial Expression by Fusing Manifolds Wen-Yan Chang1,2 , Chu-Song Chen1,3 , and Yi-Ping Hung1,2,3 2
1 Institute of Information Science, Academia Sinica, Taiwan Dept. of Computer Science and Information Engineering, National Taiwan University 3 Graduate Institute of Networking and Multimedia, National Taiwan University {wychang,song}@iis.sinica.edu.tw,
[email protected] Abstract. Feature representation and classification are two major issues in facial expression analysis. In the past, most methods used either holistic or local representation for analysis. In essence, local information mainly focuses on the subtle variations of expressions and holistic representation stresses on global diversities. To take the advantages of both, a hybrid representation is suggested in this paper and manifold learning is applied to characterize global and local information discriminatively. Unlike some methods using unsupervised manifold learning approaches, embedded manifolds of the hybrid representation are learned by adopting a supervised manifold learning technique. To integrate these manifolds effectively, a fusion classifier is introduced, which can help to employ suitable combination weights of facial components to identify an expression. Comprehensive comparisons on facial expression recognition are included to demonstrate the effectiveness of our algorithm.
1 Introduction Realizing human emotions plays an important role in human communication. To study human behavior scientifically and systematically, emotion analysis is an intriguing research issue in many fields. Much attention has been drawn to this topic in computer vision applications such as human-computer interaction, robot cognition, and behavior analysis. Usually, a facial expression analysis system contains three stages: face acquisition, feature extraction, and classification. For feature extraction, a lot of methods have been proposed. In general, most methods represent features in either holistic or local ways. Holistic representation uses the whole face for representation and focuses on the facial variations of global appearance. In contrast, local representation adopts local facial regions or features and gives attention to the subtle diversities on a face. Though most recent studies have been directed towards local representation [17,18], good research results are still obtained by using holistic approach [1,2]. Hence, it is interesting to exploit both of their benefits to develop a hybrid representation. In addition to feature representation, we also introduce a method for classification. Whether using Bayesian classifier [4,18], support vector machine (SVM) [1], or neural networks, finding a strong classifier is the core in the existing facial expression analysis studies. In the approaches that adopt local facial information, weighting these local regions in a single classifier is a common strategy [18]. However, not all local regions Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 621–630, 2007. c Springer-Verlag Berlin Heidelberg 2007
622
W.-Y. Chang, C.-S. Chen, and Y.-P. Hung
have the same significance in discriminating an expression. Recognition depending only on a fixed set of weights for all expressions cannot make explicit the significance of each local region to a particular expression. To address this issue, we characterize the discrimination ability per expression for each component in a hybrid representation; a fusion algorithm based on binary classification is presented. In this way, the characteristics of components can be addressed individually for expression recognition. In recent years, manifold learning [15,16] got much attention in machine learning and computer vision researches. The main consideration of manifold learning is not only to preserve global properties in data, but also to maintain localities in the embedded space. In addition to addressing the data representation problem, supervised manifold learning (SML) techniques [3,20] were proposed to further consider data class during learning and provide a good discriminating capability. These techniques are successfully applied to face recognition under different types of variations. Basically, SML can deliver superior performance to not only traditional subspace analysis techniques, such as PCA, LDA, but also unsupervised manifold learning methods. By taking the advantages of SML, we introduce a facial expression analysis method, where a set of embedded manifolds is constructed for each component. To integrate these embedded manifolds, a fusion algorithm is suggested and good recognition results can be obtained.
2 Background 2.1 Facial Expression Recognition To describe facial activity caused by the movement of facial muscles, the facial action coding system (FACS) was developed and 44 action units are used for modeling facial expressions. Instead of analyzing these complicated facial actions, Ekman et al. [6] also investigated several basic categories for emotion analysis. They claimed that there are six basic universal expressions: surprise, fear, sadness, disgust, anger, and happiness. In this paper, we follow the six-class expression taxonomy and classify each query image into one of the six classes. As mentioned above, feature extraction and classification are two major modules in facial expression analysis. Essa et al. [7] applied optical flow to represent motions of expressions. To lessen the effects of lighting, Wen and Huang [18] used both geometric shape and ratio-image based feature for expression recognition with a MAP formulation. Lyons et al. [11] and Zhang et al. [21] adopted Gabor wavelet features in this topic. Recently, Bartlett et al. [1] suggested using Adaboost for Gabor feature selection and a satisfied performance of expression recognition is achieved. Furthermore, appearance is also a popular representation for facial expression analysis and several subspace analysis techniques were used to improve recognition performance [11]. In [4], Cohen et al. proposed the Tree-Augmented Na¨ıve Bayes classifier for video-based expression analysis. Furthermore, neural network, hidden Markov model and SVM [1] were also widely used. Besides the image-based expression recognition, Wang et al. [17] used 3D range models for expression recognition and proposed a method to extract features from a 3D model recently. To analyze expressions under different orientations, head pose recovery is also addressed in some papers. In general, model registration or tracking approaches
Analyzing Facial Expression by Fusing Manifolds
623
are used to estimate the pose, and the image is warped into a frontal view [5,18]. Dornaika et al. [5] estimated head pose by using an iterative gradient descent method. Then, they applied particle filtering to track facial actions and recognize expressions simultaneously. Wen and Huang [18] also adopted a registration technique to obtain the geometric deformation parameters and warped images according to these parameters for expression recognition. Zhu and Ji [22] refined the SVD decomposition method by normalizing matrices to estimate the parameters of face pose and recover facial expression simultaneously. In a recent study, Pantic and Patras [13] further paid attentions to expression analysis based on face profile. More detailed surveys about facial expression analysis can be found in [8,12]. 2.2 Manifold Learning In the past decades, subspace learning techniques have been widely used for linear dimensionality reduction. Different from the traditional subspace analysis techniques, LLE [15] and Isomap [16] were proposed by considering the local geometry of data in recent manifold learning studies. They assumed that a data set approximately lies on a lower dimensional manifold embedded in the original higher dimensional feature space. Hence, they focused on finding a good embedding approach for training data representation in a lower dimensional space without considering the class label of data. However, one limitation of nonlinear manifold learning techniques is that manifolds are defined only on the training data and it is difficult to map a new test data to the lower dimensional space. Instead of using nonlinear manifold learning techniques, He et al. [9] porposed a linear approach, namely locality preserving projections (LPP), for vision-based applications. To achieve a better discriminating capability, class label of data is suggested to be considered during learning recently, and supervised manifold learning techniques were developed. Chen et al. [3] proposed the local discriminant embedding (LDE) method to learn the embedding of the sub-manifold for each class by utilizing the neighbor and class relations. At the same time, Yan et al. [20] also presented a graph embedding method, called marginal fisher analysis (MFA), which shares the similar concept with LDE. By using the Isomap, Chang and Turk [2] introduced a probabilistic method to video-based facial expression analysis.
3 Expression Analysis Using Fusion Manifolds 3.1 Facial Components Humans usually recognize emotions according to both global facial appearance and variations of facial components, such as eye shape, mouth contour, wrinkle expression, and the alike. In our method, we attempt to consider facial local regions and holistic face simultaneously. Based on facial features, we divide a face into seven components including left eye (LE), right eye (RE), middle of eyebrows (ME), nose (NS), mouth and chin (MC), left cheek (LC), and right cheek (RC). A mask of these components is illustrated in Fig. 1(a) In addition, two components, upper face (UF) and holistic face (HF), are also considered. The appearances of all components are shown in Fig. 1(b).
624
W.-Y. Chang, C.-S. Chen, and Y.-P. Hung
Fig. 1. Facial components used in our method. (a) shows the facial component mask and the locations of these local components. (b) examples of these components.
3.2 Fusion Algorithm for Embedded Manifolds After representing a face into nine components, we then perform expression analysis based on them. To deal with these multi-component information, a fusion classification is introduced. Given a face image I, a mapping M : Rd × c → Rt is constructed by M (I) = [m1 (I1 ), m2 (I2 ), . . . , mc (Ic )],
(1)
where c is the number of components, mi (·) is an embedding function and Ii is a ddimensional sub-image of the i-th component. Then, the multi-component information is mapped to a t-dimensional feature vector M (I), where t ≥ c. To construct the embedding function for each component, supervised manifold learning techniques are considered in our method. In this paper, the LDE [3] method is adopted for facial expression analysis. Considering a data set {xi |i = 1, ..., n} with class label {yi } in association with a facial component, where yi ∈ {Surprise, Fear, Sadness, Disgust, Anger, Happiness}, LDE attempts to minimize the distances of neighboring data points in the same class and maximize the distances between neighbor points belonging to different classes in a lower dimensional space simultaneously. The formulation of LDE is maxV such that
i,j
||V T xi − V T xj ||2 wij
i,j
||V T xi − V T xj ||2 wij = 1,
(2)
where wij = exp[−||xi −xj ||2 /r] is the weight between xi and xj , if xi and xj are neigh bors with the same class label. By contrast, wij is the weight between two neighbors, xi and xj , which belong to different classes. In LDE, only K-nearest neighbors are considered during learning. After computing the projection matrix V , an embedding of a data point x can be found by projecting it onto a lower dimensional space with z = V T x . For classification, nearest neighbor is used in the embedded low-dimensional space.
Analyzing Facial Expression by Fusing Manifolds
625
Since not all components are discriminative for an expression (e.g., chin features are particularly helpful for surprise and happiness), to take the discrimination ability of each component into account, a probabilistic representation is used to construct M (I) in our approach instead of hard decision by nearest neighboring. By calculating the shortest distances from x to a data point in each class, a probabilistic representation can be obtained by D(x ) =
1 i=1,...,e D
i
{D1 , D2 , . . . , De }
(3)
where Di = mink ||V T xik − z ||, xik is a training data belonging to class i, z = V T x , and e = 6 is the number of facial expression class. For each component j (j = 1, ..., c), the embedding function mj (·) can be written as mj (Ij ) = D(Ij ). Then, the dimension of M (I) is t = 6 × 9 = 54. The relationship among components and expressions can be encoded in M (I) by using this representation. Components that are complementary to each other for identifying an individual expression is thus considered in the fusion stage to boost the recognition performance. To learn the significance of components from the embedded manifolds, a fusion classifier F : Rt → {Surprise, Fear, Sadness, Disgust, Anger, Happiness} is used. With the vectors M (I), we apply a classifier to {(xi , yi )|i = 1, ..., n}, where x = M (I). The fusion classifier is helpful to decide the importance of each component to different expressions instead of selecting a fixed set of weights for all expressions. Due to its good generalization ability, SVM is adopted as the fusion classifier F in our method. Given a test data x , the decision function of SVM is formulated as f (x ) = uT φ(x ) + b,
(4)
where φ is a kernel function, u and b are parameters of the decision hyperplane. For a multi-class classification problem, pairwise coupling is a popular strategy that combines all pairwise comparisons into a multi-class decision. The class with the most winning two-class decisions is then selected as the prediction. Besides predicting an expression label, we also allow our fusion classifier to provide the probability/degree of each expression. In general, the absolute value of the decision function means the distance from x to the hyperplane and also reflects the confidence of the predicted label for a two-class classification problem. To estimate the probability of each class in a multi-class problem, the pairwise probabilities are addressed. Considering a binary classifier of classes i and j, pairwise class probability ti ≡ P (y = i|x ) can be estimated from (4) based on x and the training data by Platt’s posterior probabilities [14] with ti + tj = 1. That is, ti =
1 , 1 + exp(Af (x ) + B)
(5)
where the parameters A and B are estimated by minimizing the negative log likelihood function as yk + 1 yk + 1 min − log(qk ) + (1 − ) log(1 − qk ), (6) A,B 2 2 k
626
W.-Y. Chang, C.-S. Chen, and Y.-P. Hung
in which qk =
1 , 1 + exp(Af (xk ) + B)
(7)
and {xk , yk |yk ∈ {1, −1}} is the set of training data. Then, the class probabilities p = {p1 , p2 , . . . , pe } can be estimated by minimizing the Kullback-Leibler distance between ti and pi /(pi + pj ) , i.e., min
p
vij ti log(
i=j
ti (pi + pj ) ), pi
(8)
where k=1,...,e pk = 1, and vij is the number of training data in classes i and j. Recently, a generalized approach is proposed [19] to tackle this problem. For robust estimation, the relation ti /tj ≈ pi /pj is used and the optimization is re-formulated as min p
e 1 (tj pi − ti pj )2 , 2 i=1
(9)
j:j=i
instead of using the relation ti ≈ pi /(pi + pj ). Then, class probabilities can be stably measured by solving (9).
4 Experiment Results 4.1 Dataset and Preprocessing In our experiments, the public available CMU Cohn-Kanade expression database [10] is used to evaluate the performance of the proposed method. It consists of 97 subjects with different expressions. However, not all of these subjects have six coded expressions, and some of them only consist of less than three expressions. To avoid the unbalance problem in classification, we select 43 subjects who have at least 5 expressions from the database. The selection contains various ethnicities and includes different lighting conditions. Person-independent evaluation [18] is taken in our experiments so that the data of one person will not appear in the training set when this person is used as a testing subject. Evaluation of performance in this way is more challenging since the variations between subjects are much larger than those within the same subject, and it also examines the generalization ability of the proposed method. To locate the facial components, the eye locations available at the database are used. Then, the facial image is registered according to the locations and orientations of eyes. The component mask shown in Fig. 1 is applied to the registered facial image to extract facial components. The resolutions of a sub-image for each component is 32 × 32 in our implementation. 4.2 Algorithms for Comparison In this section, we give comparisons for different representations and algorithms. In holistic representation, we recognize expressions only by using the whole face,
Analyzing Facial Expression by Fusing Manifolds
627
i.e., the ninth image in Fig. 1(b), while the first seven components shown in Fig. 1(b) are used for local representation. To demonstrate the performance of the proposed method, several alternatives are also implemented for comparison. In the comparisons, appearance is used as the main feature by representing the intensities of pixels in a 1D vector. To evaluate the performance, five-fold cross validation is adopted. According to the identity of subjects, we divide the selected database into five parts, where four parts of them are treated as training data and the remaining part is treated as validation data in turn. To perform the person-independent evaluation, the training and validation sets do not contain images of the same person. We introduce the algorithms that are used for comparison as follows. Supervised Manifold Learning (SML). In this method, only holistic representation is used for recognition. Here, LDE is adopted and the expression label is predicted by using nearest-neighbor classification. We set the number of neighbors K as 19 and the dimension of reduced space as 150 in LDE. These parameters are also used in all of the other experiments. SML with Majority Voting. This approach is used for multi-component integration. SML is applied to each component at first. Then, the amount of each class label is accumulated and the final decision is made by selecting the class with maximum quantity. SVM Classification. This is an approach using SVM on the raw data (either holistic or local) directly without dimension reduction by SML. In our implementation, linear kernel is used by considering the computational cost. For multi-component integration, we simply concatenate the features of all of the components in order in this experiment. SVM with Manifold Reduction. This approach is similar to the preceding SVM approach. The main difference is that the dimension of data is reduced by manifold learning at first. Then, the projected data are used for SVM classification. Our Approach (SML with SVM Fusion). Here, the proposed method described in Section 3.2 is used for evaluation. 4.3 Comparisons and Discussions We summarize the recognition results of the aforementioned methods in Table 1. One can see that local representation provides better performance than holistic one in most methods. This agrees with the conclusions in many recent researches. By taking the advantages of both holistic and local representations, the hybrid approach can provide a superior result generally when an appropriate method is adopted. As shown in Table 1, the best result is obtained by using the proposed method in the hybrid representation. The recognition rate of each expression, obtained by using the aforementioned methods with the hybrid representation, are illustrated in Fig. 2. We illustrate the importance/influence of each component on an expression by a 3D visualization as shown in Fig. 3. The accuracy of each component is evaluated by applying SML. From this figure, the discrimination ability of each component to a particular expression can be seen. The overall accuracy evaluated by considering all expressions is summarized in Table 2. Though the accuracies of some components are not good enough, a higher recognition rate with 94.7% can still be achieved by using the proposed fusion algorithm to combine these components. It demonstrates the advantage of our fusion method.
628
W.-Y. Chang, C.-S. Chen, and Y.-P. Hung Table 1. Accuracies for different methods using holistic, local, and hybrid representation Methods SML Holistic SVM Classification Approaches SVM with Manifold Reduction SML with Majority Voting Local SVM Classification Approaches SVM with Manifold Reduction SML with SVM Fusion SML with Majority Voting Hybrid SVM Classification Approaches SVM with Manifold Reduction SML with SVM Fusion
Accuracy 87.7 % 86.1 % 87.7 % 78.6 % 87.2 % 92.5 % 92.0 % 87.2 % 87.7 % 92.0 % 94.7 %
Fig. 2. Comparison of accuracies for individual expression by using different methods with hybrid representation
Fig. 3. The importance/influence of each component on an expression
Analyzing Facial Expression by Fusing Manifolds
629
Table 2. Overall accuracies of expression recognition by using different facial components Component Name
Accuracy
Left Eye (LE) Right Eye (RE) Middle of Eyebrows (ME) Nose (NS) Mouth & Chin (MC) Left Chin (LC) Right Chin (RC) Upper Face (UF) Holistic Face (HF)
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
79.5 % 73.1 % 54.7 % 66.3 % 65.8 % 50.5 % 47.7 % 85.8 % 85.3 %
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. 4. Facial expression recognition results: horizontal bars indicate probabilities of expressions. The last column is an example where a surprise expression was wrongly predicted as a fear one.
Finally, some probabilistic facial expression recognition results are shown in Fig. 4, in which a horizontal bar indicates the probability of each expression. One mis-classified example is shown in the last column of this figure. Its ground-truth is surprise, but it was wrongly predicted as fear.
5 Conclusion In this paper, we propose a fusion framework for facial expression analysis. Instead of using only holistic or local representation, a hybrid representation is used in our framework. Hence, we can take both subtle and global appearance variations into account at the same time. In addition, unlike methods using unsupervised manifold learning for facial expression analysis, we introduce supervised manifold learning (SML) techniques to represent each component. To combine the embedded manifolds in an effective manner, a fusion algorithm is proposed in this paper, which takes into account the support of each component for individual expression. Both the expression label and probabilities can be estimated. Comparing to several methods using different representations and classification strategies, the experiment results show that our method is superior to the others, and promising recognition results for facial expression analysis are obtained.
630
W.-Y. Chang, C.-S. Chen, and Y.-P. Hung
Acknowledgments. This work was supported in part under Grants NSC 96-2752-E002-007-PAE. We would like to thank Prof. Jeffrey Cohn for providing the facial expression database.
References 1. Bartlett, M.S., Littlewort, G., Frank, M., Lainscsek, C.: Recognizing Facial Expression: Machine Learning and Application to Spontaneous Behavior. CVPR 2, 568–573 (2005) 2. Chang, Y., Hu, C., Turk, M.: Probabilistic Expression Analysis on Manifolds. CVPR 2, 520– 527 (2004) 3. Chen, H.T., Chang, H.W., Liu, T.L.: Local Discriminant Embedding and Its Variants. CVPR 2, 846–853 (2005) 4. Cohen, I., Sebe, N., Garg, A., Chen, L.S., Huang, T.: Facial Expression Recognition from Video Sequences: Temporal and Static Modeling. CVIU 91, 160–187 (2003) 5. Dornaika, F., Davoine, F.: Simultaneous Facial Action Tracking and Expression Recognition Using a Particle Filter. ICCV 2, 1733–1738 (2005) 6. Ekman, P., Friesen, W.V.: Unmasking the Face. Prentice Hall, Englewood Cliffs (1975) 7. Essa, I.A., Pentland, A.P.: Coding, Analysis, Interpretation, and Recognition of Facial Expressions. IEEE Trans. on PAMI 19(7), 757–763 (1997) 8. Fasel, B., Luettin, J.: Automatic Facial Expression Analysis: A Survey. Pattern Recognition 36, 259–275 (2003) 9. He, X., Yan, S., Hu, Y., Niyogi, P., Zhang, H.J.: Face Recognition Using Laplacianfaces. IEEE Trans. on PAMI 27(3), 328–340 (2005) 10. Kanade, T., Cohn, J.F., Tian, Y.: Comprehensive Database for Facial Expression Analysis. AFG, 46–53 (2000) 11. Lyons, M., Budynek, J., Akamatsu, S.: Automatic Classification of Single Facial Images. IEEE Trans. on PAMI 21(12), 1357–1362 (1999) 12. Pantic, M., Rothkrantz, L.J.M.: Automatic Analysis of Facial Expressions: The State of the Art. IEEE Trans. on PAMI 22(12), 1424–1445 (2000) 13. Pantic, M., Patras, I.: Dynamics of Facial Expression: Recognition of Facial Actions and Their Temproal Segments From Face Profile Image Sequences. IEEE Trans. on SMCB 32(2), 433–449 (2006) 14. Platt, J.: Probabilistic Outputs for Support Vector Machines and Comparison to Regularized Likelihood Methods. Advances in Large Margin Classifiers. MIT Press, Cambridge (2000) 15. Roweis, S.T., Saul, L.K.: Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science 290, 2323–2326 (2000) 16. Tenenbaum, J.B., De Silva, V., Langford, J.C.: A Global Geometric Framework for Nonlinear Diminsionality Reduction. Science 290, 2319–2323 (2000) 17. Wang, J., Yin, L., Wei, X., Sun, Y.: 3D Facial Expression Recognition Based on Primitive Surface Feature Distribution. CVPR 2, 1399–1406 (2006) 18. Wen, Z., Huang, T.: Capturing Subtle Facial Motions in 3D Face Tracking. ICCV 2, 1343– 1350 (2003) 19. Wu, T.F., Lin, C.J., Weng, R.C.: Probability Estimates for Multi-class Classification by Pairwise Coupling. Journal of Machine Learning Research 5, 975–1005 (2004) 20. Yan, S., Xu, D., Zhang, B., Zhang, H.J.: Graph Embedding: A General Framework for Dimensionality Reduction. CVPR 2, 830–837 (2005) 21. Zhang, Z., Lyons, M., Schuster, M., Akamatsu, S.: Comparison between Geometry-Based and Gabor Wavelets-Based Facial Expression Recongition Using Multi-Layer Perceptron. AFG, 454–459 (1998) 22. Zhu, Z., Ji, Q.: Robust Real-Time Face Pose and Facial Expression Recovery. CVPR 1, 681– 688 (2006)
A Novel Multi-stage Classifier for Face Recognition Chen-Hui Kuo1,2, Jiann-Der Lee1, and Tung-Jung Chan2 1
Department of Electrical Engineering, Chang Gung University, Tao-Yuan 33302, Taiwan, R.O.C 2 Department of Electrical Engineering, Chung Chou Institute of Technology, Chang-Hua 51003, Taiwan, R.O.C
Abstract. A novel face recognition scheme based on multi-stages classifier, which includes methods of support vector machine (SVM), Eigenface, and random sample consensus (RANSAC), is proposed in this paper. The whole decision process is conducted cascade coarse-to-fine stages. The first stage adopts one-against-one-SVM (OAO-SVM) method to choose two possible classes best similar to the testing image. In the second stage, “Eigenface” method was employed to select one prototype image with the minimum distance to the testing image in each of the two classes chosen. Finally, the real class is determined by comparing the geometric similarity, as done by “RANSAC” method, between these prototype images and the testing images. This multi-stage face recognition system has been tested on Olivetti Research Laboratory (ORL) face databases, and its experimental results give evidence that the proposed approach outperforms the other approaches either based on the single classifier or multi-parallel classifier, it can even obtain a nearly 100 percent recognition accuracy. Keywords: Face recognition; SVM; Eigenface; RANSAC.
1 Introduction In general, researches on face recognition system fall into two categories, one is single-classifier system and the other is multi-classifier system. The single-classifier system, including neural network (NN) [1], Eigenface [2], Fisher linear discriminant (FLD) [3], support vector machine (SVM) [4], hidden Markov model (HMM) [5], or AdaBoost [6], has been well developed in theories and experiments. On the other hand, the multi-classifier system such as local and global face information fusion [7],[8],[9], neural networks committee (NNC) [10], or multi-classifier system (MCS) [11], has been proposed in parallel process of different features or classifiers. The SVM is originally designed for binary classification and it is based on the structural risk minimization (SRM) principle. Although several methods to effectively extend the SVM for multi-class classification have been reported on technical literatures [12],[13], it is still a widely researched issue. The methods of SVM for multi-class classification are one-against-all (OAA) [12],[14], one-against-one (OAO) [12], directed acyclic graph support vector machine (DAGSVM) [15], or binary tree SVM [4]. If one employs the same feature vector for SVM, NN, and AdaBoost, he Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 631–640, 2007. © Springer-Verlag Berlin Heidelberg 2007
632
C.-H. Kuo, J.-D. Lee, and T.-J. Chan
will find the performance of SVM is better than that of NN and AdaBoost because the SVM always results in the maximum separating margin to the hyperplane of the two classes. If the feature vector includes noisy data, and the noisy data possesses at least one of the following properties: (a) overlapping class probability distributions, (b) outliers and (c) mislabeled patterns [16], the hyperplane of SVM turns out to be hard margin and overfitting. Additionally, the SVM allows noise or imperfect separation, provided that a non-negative slack variable is added to the objective function as a penalizing term. To integrate the image features of frequency, intensity, and space information, we propose a novel face recognition approach, which combines SVM, Eigenface, and random sample consensus (RANSAC) [17] methods with the multi-stage classifier system. The whole decision process is developed in stages, i. e., “OAO-SVM” first, “Eigenface” next, and finally “RANSAC”. In the first stage, “OAO-SVM”, we used the DCT features extracted from the entire face image to decide two possible classes which are best similar to the testing image. In the second stage, face images of the two classes decided in these two classes obtained from the first stage are projected onto a feature space (face space). The face space is defined by the “Eigenfaces”, which are the eigenvectors of the set of faces and are based on intensity information of the face image. “RANSAC” is applied in the last stage, in which the epipolar geometry with space information of the testing image is matched with the two training images obtained from the second stage, and then the prototype image with the greatest match numbers of correspondence feature points is considered as the real face. The face database used for performance evaluation is retrieved from Olivetti Research Laboratory (ORL) [18]. Three evaluation methods are adopted here to compare the performance of OAO-SVM, Eigenface, and our proposed multi-stage classifier. The remainder of this paper is organized as follows: In section 2, the proposed methods of OAO-SVM and multi-stage classifier are presented in detail. In section 3, experiment results and the comparison to other approaches with the ORL face database are given. In section 4, conclusions and directions for further research are summarized and discussed.
2 The Proposed Method On the basis of a coarse-to-fine strategy, we design a multi-stage recognition system which integrates OAO-SVM, Eigenface, and RANSAC methods to enhance recognition accuracy. The detail of this system is demonstrated as follows. 2.1 One-Against-One (OAO) of SVM for Multi-class Recognition In the OAO strategy, several binary SVMs are constructed, but each one is constructed by training data from only two different classes. Thus, this method is sometimes called a “pair-wise” approach. For a data set with N different classes, this method constructs C2N= N(N-1)/2 models of two-class SVM. Thus given m training data (x1,y1),…,(xm,ym), where xk ∈ Rd, k=1,…,m and yk ∈ {1,…,N} is the class of xk. For training data from the ith and jth classes, we solve the following binary classification problem:
A Novel Multi-stage Classifier for Face Recognition
min
w ij ,b ij ,ξ ij
1 ij T ij ( w ) w + C ∑ k ξ kij . 2 ( wij )T φ ( xk ) + bij ≥ 1 − ξ kij , if yk = i . ( wij )T φ ( xk ) + bij ≤ −1 + ξ kij , if yk = j .
633
(1)
ξ kij ≥ 0 . where w is the weight vector, b is the threshold, φ(xk) is a kernel function that mapped the training data xk to a higher dimensional space, and C is the penalty parameter, respectively. When data with noise causes hard margin, there is a penalty term C ∑ k ξ kij which can relax the hard margin and allow a possibility of mistrusting the data. It can reduce the number of training errors. The simplest decision function is a majority vote or max-win scheme. The decision function counts the votes for each class based on the output from the N(N-1)/2 SVM. The class with the most votes is selected as the system output. In the majority vote process, it is necessary to compute each discriminate function fij(x) of the input data x for the N(N-1)/2 SVMs model. The score function Ri(x) is sums of the correct votes. The final decision is made on the basis of the “winner takes all” rule, which corresponds to the following maximization. The expression for final decision is given as Eq. (2). f ij ( x) = ( x ∗ wn ) + bn , n = 1,K, N . N
{
}
Ri ( x) = ∑ sgn f ij ( x ) . j =1 j ≠i
(2)
m( x, R1 , R2 ,K, RN ) = arg max {Ri ( x)}. i =1,..., N
where fij(x) is the output of ijth SVM, x is the input data and m is the final decision that found which class has the largest voting from the decision function fij, respectively. 2.2 Our Multi-stage Classifier System for Face Recognition
Based on the same coarse-to-fine strategy, the proposed novel scheme for face recognition is performed by a consecutive multi-stage recognition system, in which each stage is devoted to remove a lot of false classes more or less. The flowcharts of this proposed system including the training phase and recognition phase are shown in Fig. 1(a) and (b), respectively. In the first stages (OAO-SVM), for a testing image, we obtained its DCT features by feature extraction process and employed “winner-takeall” voting strategy to select the top two classes i.e., ci and cj, with maximum votes for later use. In the second stage, the Euclidian distance of each image in ci and cj is calculated, and for each class only one prototype image with minimum distance to testing image is determined for the last stage. “RANSAC” method is applied in the last stage, in which the epipolar geometry with space information of the testing image is matched with the two prototype images obtained from the second stage, and then the prototype image with the greatest matching numbers of correspondence feature
634
C.-H. Kuo, J.-D. Lee, and T.-J. Chan
points is selected as the correct one; in other words, the prototype image with the most geometric similarity to the testing image is thus presented.
Fig. 1. Flowchart of the multi-stage classifier for face recognition. (a) Training phase. (b) Testing phase.
More specifically, there are C2N= N(N-1)/2 models of “pair-wise” SVM used in the first stage of OAO model, as shown in Fig. 1(b). According to Eq. (2), the voting value Ri(x) and Rj(x) of two classes ci and cj, the first and second largest voting value, are selected, respectively. Moreover, if the difference of Ri(x) and Rj(x) is less than or equals to e, i.e., Ri(x) – Rj(x) ≤ e, where e is a preset value. They are delivered into the second stage for binary classification. On the other hands, if Ri(x) –Rj(x) > e, the class ci is then be decided as the only correct answer and the recognition process is finished. In other words, while the difference of voting value of two classes ci and cj is less than or equals e, it represents a very little difference between classes ci and cj and it also indicates that there is definitely a need to proceed to the next stage to identify the decision. PCA is a well-known technique commonly exploited in multivariate linear data analysis. The main underlying concept is to reduce the dimensionality of a data set while retaining as much variation as possible in a data set. A testing face image (ix) is transformed into its eigenface components (projected into “face space”) by a simple operation, wk = ukT (ix − i ) for k=1,…, M, where ukT are eigenvectors obtained from the
covariance matrix of testing and average face image, i is the average face image and M is the number of face image. The weights form a vector ψT = [w1 w2 … wM] that describes the contribution of each eigenface in representing the input face image,
A Novel Multi-stage Classifier for Face Recognition
635
treating the eigenfaces as a basis set for face images. The vector is used to find which number of pre-defined face class best describes the face. The simplest method for determining which face class provides the best description of an input face image is to find the face class k that minimizes the Euclidian distance ε k = (ψ −ψ k ) , where ψk is a vector describing the kth face class. In the second stage, as shown in the Fig. 2(a), the input training image is projected into “face space” and the weights vectors ψT = [w1 w2 … w9] are then created. For each class, the image with the minimum Euclidian distance to the training image is selected as the prototype image. For example, Fig. 2(b) shows the Euclidian distance between input image and ten training images which belong to two classes. Class 1 includes the first five images, which are images 1, 2, 3, 4, and 5, respectively; and class 2 includes the other five images, which are images 6, 7, 8, 9, and 10, respectively. Subsequently, the image with the minimum Euclidian distance from each class is selected for the last stage. For example, image 3 in class 1 and image 7 in class 2 are decided in the second stage. Here, we denote these two images as c13 and c27.
Fig. 2. (a) The weight vector ψT = [w1 w2 … w9] of input face. (b) The Euclidian distance between the input image and ten training images, respectively. The images 1, 2, 3, 4, 5 are of the same class and the images 6, 7, 8, 9, 10 are of another class.
In the last stage, “RANSAC” method is used to match one testing image with these two training images (c13 and c27) obtained in the second stage, trying to find which prototype image best matches with the testing image. It shows that the one with the maximum numbers of correspondence points fits best. The procedure of “RANSAC” is described as follows.
• Find Harris Corners [19]. In the testing and training images, shifting a window in any direction should give a large change in intensity as shown in Fig. 3 (a). The change Ex,y produced by a shift (x,y) is given by: E x, y = ∑ δ u ,v [ I x+u , y +v − I u ,v ]2 .
δ
u ,v
(3)
where specifies the image window, for example a rectangular function: it is unity within a specified rectangular region, and zero elsewhere. A Gaussian functions: smooth circular window u,v = exp–(u2+v2)/2σ2 , Iu,v : image intensity
δ
636
C.-H. Kuo, J.-D. Lee, and T.-J. Chan Training image 2
Testing image
Training image 1
(a) Putative matches with training image 2 Putative matches with training image 1
(b) RANSAC matches with training image 2 RANSAC matches with training image 1
(c)
Match: 4 Unmatch: 6
Match numbers with training image 2
Match: 13 Unmatch: 4
Match numbers with training image 1
(d)
Fig. 3. The procedures of using RANSAC method to find the matched and unmatched correspondence points. (a) Find Harris corners feature points. (b) Find putative matches. (c) Use RANSAC method to find the correspondence points. (d) Count numbers of matched and unmatched of correspondence points.
• Find Putative Matches. Among previously detected corner feature points in given image pairs, putative matches are generated by looking for match points that are maximally correlated with each other within given windows. Undoubtedly, only points that robustly correlate with each other in both directions are returned. Even though the correlation matching results in many wrong matches, which is about 10 to 50 percent, it is strong enough to compute the fundamental matrix F as shown in Fig. 3 (b). • Estimate Fundamental Matrix F. Use RANSAC method to locate the correspondence points between the testing and training images: As shown in Fig. 4, the map x → l' between two images defined by fundamental matrix F is considered. And the most basic properties of F is x'Fx =0 [20] for any pair of corresponding points x ↔ x' in the given image pairs. Following steps was used by RANSAC method to consolidate fundamental matrix F estimation: Repeat (a) Select random samples of 8 correspondence points. (b) Compute F. (c) Measure support (number of inliers within threshold distance of epipolar line).
A Novel Multi-stage Classifier for Face Recognition
637
• Choose Fundamental Matrix. Choose the F with the largest number of inliers and obtain the correspondence point xi ↔ x'i (as shown in Fig. 3 (c)). • Count Numbers of Matched and Unmatched Feature Points. The threshold distance between two correspondence points xi ↔ x'i is set. Match counts if the distance between two correspondence points is smaller than that of the threshold; on the contrary, no match does. For any given image pairs, the successful match pairs should be the training images with the largest matching number as shown in Fig. 3 (d).
Fig. 4. The correspondence points of two images are x and x'. The two cameras are indicated by their centers C and C'. The camera centers, three-space point X, and its images x and x' lie in a common plane . An image point x back-projects to a ray in three-space defined by the first camera center, C, and x. This ray is imaged as a line l' in the second view. The three-space point X which projects to x must lies on this ray, so the image of X in the second view must lie on l'.
π
3 Experimental Results The proposed scheme for face recognition is evaluated on the ORL face databases. The ORL face database contains abundant variability in expressions, poses and facial details. We conducted experiments to compare our cascade multi-stage classifier strategy with some other well-known single classifier, e.g., the OAO-SVM, and Eigenface. The experimental platforms are Intel Celeron 2.93GHz processor, 1GB DDRAM, Windows XP, and Matlab 7.01. 3.1 Face Recognition on ORL Database
The experiment is performed on the ORL database. There are 400 images of 40 distinct subjects. Each subject has ten different images taken at different situations, i.e., pose, expression, etc. Each image was digitized a 112 × 92 pixel array whose gray levels ranged between 0 and 255. There are variations in facial expressions such as open/closed eyes, smiling/non-smiling, and with/without glasses. In our experiments, five images of each subject are randomly selected as training samples, the other five images and then serve as testing images. Therefore, for 40 subjects in the database, a total of 200 images are used for training and another 200 for testing,
638
C.-H. Kuo, J.-D. Lee, and T.-J. Chan
and there are no overlaps between the training and testing sets. Here, we verify our system based on the average error rate. Such procedures are repeated for four times, i.e. four runs, which result in four groups of data. For each group, we calculated the average of the error rates versus the number of feature dimensions (from 15 to 100). Fig. 5 shows the results of the average of four runs and the output of each stage rom the multi-stage classifier which integrates OAO-SVM, Eigenface, and RANSAC. As shown in Fig. 5, the error rates of the output of the final stage is lower than the other two types of single classifier. That is, our proposed method obtains the lowest error rate. Additionally, the average minimum error rate of our method is 1.37% on the 30 feature numbers, while the OAO-SVM is 2.87%, nd Eigenface is 8.50%. If we choose the best results among the four groups of the randomly selected data, the lowest error rate of the final stage can even achieve 0%.
Fig. 5. Comparison of recognition accuracy using OAO-SVM, Eigenface, and the proposed system on the ORL face database
3.2 Comparison with Previous Reported Results on ORL
Several approaches have been conducted for face recognition with the ORL database. The methods of using single classifier systems for face recognition are Eigenface [2],[21],[23],[24], DCT-RBFNN [1], binary tree SVM [4], 2D-HMM [5], LDA [25], and NFS [26]. The methods of using multi-classifiers for ORL face recognition are fuzzy fisherface [7], [22], and CF2C [9]. Here, we present a comparison under similar conditions between our proposed method and the other methods described on the ORL database. Approaches are evaluated on error rate, and feature vector dimension. Comparative results of different approaches are shown in Table 1. It is hard to compare the speed of different methods performed on different computing platforms, so we ignore the training and recognition time in each different approach. From Table 1, it is clear that the proposed approach achieves the best recognition rate compared with the other six approaches.
A Novel Multi-stage Classifier for Face Recognition
639
Table 1. Recognition performance comparison of different approaches (ORL)
Methods Eigenface [23] 2D-PCA [21] Binary tree SVM [4] DCT-RBFNN [1] CF2C [9] Fuzzy Fisherface [22] Our proposed approach
Error rate (%) Best Mean 2 4 4 5 N/A 3 0 2.45 3 4 2.5 4.5 0 1.375
Feature vector dimension 140 112×3 48 30 30 60 30
4 Conclusions This paper presents a multi-stage classifier method for face recognition based on the techniques of SVM, Eigenface, and RANSAC. The proposed multi-stage method is conducted on the basis of a coarse-to-fine strategy, which can reduce the computation cost. The facial features are first extracted by the DCT for the first stage, OAO-SVM. Although the last stage (RANSAC) led to more accuracy in comparison with the other two stages, its computation cost was more in the geometric fundamental matrix F estimation. In order to shorten the computation time, we reduced the classes and images to only two training images and then matched them with the testing image in the last stage. The key of this method is to consolidate OAO-SVM for the output of the top two maximum votes so that the decision of the correct class could be made later by RANSAC in the last stage. The feasibility of the proposed approach has been successfully tested on ORL face databases, which are acquired under varying poses, expressions, and an average amount of samples. Comparative experiments on the face databases also show that the proposed approach is superior to single classifier and multi-parallel classifier. Our ongoing research is to study the classification performance under the condition that the output of OAO is more than two classes, and to compare the relationship between successful rate and computation time, trying to find an optimal classification system with superior recognition capability in reasonable computation time.
References 1. Er, M.J., Chen, W., Wu, S.: High-Speed Face Recognition Based on Discrete Cosine Transform and RBF Neural Networks. IEEE Trans. Neural Networks 16(3), 679–691 (2005) 2. Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3, 71–86 (1991) 3. Xiang, C., Fan, X.A., Lee, T.H: Face Recognition Using Recursive Fisher Linear Discriminant. IEEE Trans. on Image Processing 15(8), 2097–2105 (2006) 4. Guo, G., Li, S.Z., Chan, K.L.: Support vector machines for face recognition. Image and Vision Computing 19, 631–638 (2001)
640
C.-H. Kuo, J.-D. Lee, and T.-J. Chan
5. Othman, H., Aboulnasr, T.: A Separable Low Complexity 2D HMM with Application to Face Recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence 25(10), 1229–1238 (2003) 6. Lu, J.K., Plataniotis, N.A., Venetsanopoulos, N., Li, S.Z.: Ensemble-Based Discriminant Learning With Boosting for Face Recognition. IEEE Trans. on Neural Networks 17(1), 166–178 (2006) 7. Kwak, K.C., Pedrycz, W.: Face recognition: A study in information fusion using fuzzy integral. Pattern Recognition Letters 26, 719–733 (2005) 8. Rajagopalan, A.N., Rao, K.S., Kumar, Y.A.: Face recognition using multiple facial features. Pattern Recognition Letters 28, 335–341 (2007) 9. Zhou, D., Yang, X., Peng, N., Wang, Y.: Improved-LDA based face recognition using both facial global and local information. Pattern Recognition Letters 27, 536–543 (2006) 10. Zhao, Z.Q., Huang, D.S., Sun, B.Y.: Human face recognition based on multi-features using neural networks committee. Pattern Recognition Letters 25, 1351–1358 (2004) 11. Lemieux, A., Parizeau, M.: Flexible multi-classifier architecture for face recognition systems. In: 16th Int. Conf. on Vision Interface (2003) 12. Hsu, C.W., Lin, C.J.: A comparison of methods for multi-class support vector machines. IEEE Trans. Neural Network 13(2), 415–425 (2002) 13. Vapnik, V.: Statistical Learning Theory. John Wiley & Sons, Inc., Chichester (1998) 14. Bottou, L., Cortes, C., Denker, J., Drucker, H., Guyon, I., Jackel, L., LeCun, Y., Muller, U., Sackinger, E., Simard, P., Vapnik, V.: Comparison of classifier methods: a case study in handwriting digit recognition. In: International Conference on Pattern Recognition, pp. 77–87. IEEE Computer Society Press, Los Alamitos (1994) 15. Platt, J.C., Cristianini, N., Shawe-Taylor, J.: Large margin DAGs for multiclass classification. In: Advances in Neural Information Processing Systems, vol. 12, pp. 547– 553. MIT Press, Cambridge (2000) 16. Ratsch, G., Onoda, T., Muller, K.R.: Soft Margins for AdaBoost. Machine Learning 42, 287–320 (2001) 17. Fischler, M.A., Bolles, R.C.: Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Communications of the ACM 24(6), 381–395 (1981) 18. ORL face database, http://www.uk.research.att.com/facedatabase.html 19. Harris, C., Stephens, M.: A Combined Corner and Edge Detector. In: 4th Alvey Vision Conference, pp. 147–151 (1988) 20. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2003) 21. Yang, J., Zhang, D., Frangi, A.F., Yang, J.Y.: Two-dimensional PCA: a new approach to appearance-based face representation and recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence 26, 131–137 (2004) 22. Kwak, K.C., Pedrycz, W.: Face recognition using a fuzzy fisferface classifier. Pattern Recognition 38, 1717–1732 (2005) 23. Li, B., Liu, Y.: When eigenfaces are combined with wavelets. Knowledge-Based Systems 15, 343–347 (2002) 24. Phiasai, T., Arunrungrusmi, S., Chamnongthai, K.: Face recognition system with PCA and moment invariant method. In: Proc. of the IEEE International Symposium on Circuits and Systems, vol. 2, pp. 165–168 (2001) 25. Lu, J., Plataniotis, K.N., Venetsanopoulos, A.N.: Face recognition using LDA-based algorithms. IEEE Trans. on Neural Networks 14, 195–200 (2003) 26. Chien, J.T., Wu, C.C.: Discriminant waveletfaces and nearest feature classifiers for face recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence 24, 1644–1649 (2002)
Discriminant Clustering Embedding for Face Recognition with Image Sets Youdong Zhao, Shuang Xu, and Yunde Jia School of Computer Science and Technology Beijing Institute of Technology, Beijing 100081, PR. China {zyd458,xushuang,jiayunde}@bit.edu.cn
Abstract. In this paper, a novel local discriminant embedding method, Discriminant Clustering Embedding (DCE), is proposed for face recognition with image sets. DCE combines the effectiveness of submanifolds, which are extracted by clustering for each subject’s image set, characterizing the inherent structure of face appearance manifold and the discriminant property of discriminant embedding. The low-dimensional embedding is learned via preserving the neighbor information within each submanifold, and separating the neighbor submanifolds belonging to different subjects from each other. Compared with previous work, the proposed method could not only discover the most powerful discriminative information embedded in the local structure of face appearance manifolds more sufficiently but also preserve it more efficiently. Extensive experiments on real world data demonstrate that DCE is efficient and robust for face recognition with image sets. Keywords: Face recognition, image sets, submanifolds (local linear models), discriminant embedding.
1 Introduction In the past several years, automatic face recognition using image sets has attracted more and more attention due to its wide underlying applications [1, 2, 3, 4]. Images in the sets are assumed to be sampled from complex high-dimensional nonlinear manifolds independently e.g., they may be derived from sparse and unordered observation acquired by multiple still shots of an individual or a long-term monitoring of a scene by surveillance systems. This relaxes the assumption of the temporal coherence between consecutive images from video. In this paper, we focus on revealing and extracting the most powerful discriminative information from the face appearance manifolds for face recognition over image sets. In real world, due to the various variations e.g. large pose or illumination, the face appearance manifold of an individual in image space is a complex nonlinear distribution, which consists of a set of submanifolds (or local linear models) (see Fig. 1). Those submanifolds can sufficiently characterize the inherent structure of face appearance manifold. How to extract the submanifolds of each individual and how to utilize them sufficiently for efficient classification are key issues for face recognition over image sets. Intuitively, when the submanifolds are known, a reasonable solution Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 641–650, 2007. © Springer-Verlag Berlin Heidelberg 2007
642
Y. Zhao, S. Xu, and Y. Jia
to extract most efficient discriminative information is to find a discriminant function that compresses the points in each submanifold together and separates the neighbor submanifolds belonging to different individuals from each other at the same time. Yan et al. [7] propose a Marginal Fisher Analysis (MFA) method to extract the local discriminative information. In MFA, the intra-class compactness is characterized by preserving the relationships of the k-nearest neighbors of each point in the same class and the inter-class separability is characterized by maximizing the class margins. However, owing to the asymmetry relationship of k-nearest neighbors, MFA tends to compress the points of an individual together even though they are really far away in image space. This makes against uncovering the significant local structure of appearance manifolds and extracting efficient discriminative information. Motivated by effectiveness of submanifolds [3, 4, 5, 6], we propose a novel local discriminant embedding method, Discriminant Clustering Embedding (DCE), based on the submanifolds for face recognition over sets. The proposed method combines the effectiveness of submanifolds characterizing the inherent structure of face appearance manifolds and the discriminant property of discriminant embedding. This is the main contribution of this paper. Specially, in our framework, the submanifolds, corresponding to the local linear subspaces on the entire nonlinear manifold, are first extracted by clustering for each subject’s image set. Two graphs are then constructed based on each submanifold and its neighbors to locally characterize the intra-class compactness and inter-class separability respectively. Finally, the low-dimensional embedding is learned by preserving the neighborhood information within each submanifold, and separating the neighborhood submanifolds belonging to different subjects from each other. The reason we prefer submanifolds to the k-nearest neighbors of each point used in MFA lies in their appealing property of explicitly characterizing the local structure of nonlinear face manifolds. Extensive experiments on real world data demonstrate the effectiveness of our method for face recognition with image sets and show that our method significantly outperforms the state-of-the-art methods in terms of accuracy in this area. 1.1 Previous Work Most of previous work on face recognition over image sets focused on the issues that estimate the densities of mixture models or extract the submanifolds by clustering and then utilize them for recognition task. Those algorithms could be broadly divided into two categories: model-based parametric approaches and clustering-based nonparametric approaches. In the model-based methods, Frey and Huang [1] use the factor analyzers to estimate the submanifolds and the mixture of factor analyzers is estimated by EM algorithm. The recognition decision is made by Bayes’ rule. The manifold density divergence method [2] estimates a Gaussian mixture model for each individual, and the similarity between the estimated models is measured by the Kullback-Leibler Divergence. However, when there is no enough training data, or the training data and the new test data do not have strong statistical relationships, it is difficult to estimate the parametric densities properly, or measure the similarity between densities estimated accurately. In the clustering-based methods, Hadis et al. [6] apply k-means clustering to extract the submanifolds in the low-dimensional space learned by locally linear embedding
Discriminant Clustering Embedding for Face Recognition with Image Sets
643
(LLE). The traditional classification methods are performed on the cluster centers which are used to represent the local models. Obviously, they just make use of the local models roughly. Lee et al. [5] also use k-means algorithm to approximate the data set. However, their work mainly utilized the temporal coherence between consecutive images from video sequence for recognition. Fan and Yeung [3] extract the submanifolds via a hierarchical clustering. The classification task is performed by the dual-subspace scheme based on the neighbor submanifolds belonging to different subjects. Kim at el. [4] propose a discriminant learning method which learns a linear discriminant function that maximizes the canonical correlations of within-class sets and minimizes the canonical correlations of between-class sets. Our work is similar to the methods [3, 5, 6] in the extraction of submanifolds, but different in the utilization of these submanifolds. By powerful discriminant embedding based on the submanifolds, our method takes full use of the local information embedded in submanifolds to extract the discriminative features. In addition, our algorithm relaxes the constraint on the data distribution assumption and the total number of available features, which is also important for recognition using image sets. The rest of this paper is organized as follows: Section 2 discusses the local structure of face appearance manifolds in image sets. The discriminant clustering embedding method is presented in section 3. We show the experimental results on complex image sets in Section 4. Finally, we give our conclusions and future work in Section 5. 1100
3000 1000 900
2000 2000
800
1000
1500
700 -700
1000
0 500
-800
-1000
-900
2000
-1000 0 -1500 3000
-5000
4000
0 -500
2000
1000
0
-1000
-2000
-3000
-2000
(a) First three PCs of subject A embedded by Isomap [14]
-2000 4000
0 3000
2000
-1000 -1100
1000
0
-1000
-2000
-3000
-4000
5000
(b) First three PCs of subject A and B embedded by Isomap
-1200
300
400
500
600
700
800
900
(c) First three PCs of subject A and B embedded by DCE
Fig. 1. Local structure of face manifolds of subject A (blue asterisks) and B (red plus signs)
2 Local Structure of Face Manifold As usual, we represent face im]age as a D-dimensional vector, and D is the number of pixels in each image. Due to the smoothness and regular texture of surfaces of faces, face images usually lie in or close to a low-dimensional manifold, which is a continuous and smooth distribution embedded in image space. However, due to the non-continuous or sparse sampling, it is usually discrete and consists of a number of separate submanifolds (or clusters). Fig. 1 illustrates the distributions of the low-dimensional manifolds corresponding to two subjects’ image sets which are obtained from two video sequences. Fig. 1a shows the first three principal components of image set A, obtained by Isomap [14]. Different submanifolds of image set A correspond to different variations. The images
644
Y. Zhao, S. Xu, and Y. Jia
under similar condition lie in neighbor locations in image space. So it is possible that different individuals’ submanifolds under the similar condition are closer than the submanifolds coming from the same individual but under completely different conditions. The leading three principal components of two image sets that are sampled under similar conditions from two individuals A (blue asterisks) and B (red plus signs) respectively are shown in Fig. 1b. Although the two image sets are obtained from two different individuals, there is significant overlap between two manifolds. The goal of this paper is to utilize these meaningful submanifolds to learn the most powerful discriminant features for recognition task. In Fig. 1c, the first three principal components of A and B embedded by our DCE is shown. The local discriminant structure of the manifolds is more obvious. We use two classical clustering methods, k-means and hierarchical clustering [13], to extract the submanifolds of each individual. For k-means, the initialized k seeds are selected by a greedy search procedure [5]. For hierarchical clustering, we use the agglomerative procedure, i.e. hierarchical agglomerative clustering (HAC) and the following distance measure: 1 (1) d avg ( Di , D j ) = ∑ ∑ x - x′ , n i n j x∈ D x ′∈ D i
where
j
n i and n j are the numbers of samples in the clusters Di and Dj respectively.
To evaluate the effectiveness of clustering, we also test a random selection scheme, assigning the samples of each individual to k clusters each time randomly instead of classical clustering. The performances will be discussed in Section 4.1. The principal angle is used to measure the similarity between two submanifolds that are extracted from different individuals. Principal angle is the angle between two d-dimensional subspaces. Recently, it has become a popular metric for the similarity of different subspace [3, 4]. Let L1 and L2 are two d-dimensional subspaces. The cosines of principal angles 0 ≤ θ1 ≤ ⋅ ⋅ ⋅ ≤ θ d ≤ (π / 2) between them are uniquely defined as
cosθi = max max xiT yi , x i ∈L1 y i ∈L 2
(2)
subject to x i = y i = 1 , x i T x j = y i T y j = 0 , i ≠ j . Refer to [3, 4, 12] for the details of the solution to this problem.
3 Discriminant Clustering Embedding 3.1 Problem Formulation Given a gallery set: X = {x1 , x 2 ,..., x N } ∈ IR D×N ,
(3)
where x i represents a D-dimensional image vector, and N is the number of all the images in the gallery. The label of each image denotes as yi ∈ { 1, 2, ..., C } . For each class c, containing nc samples represented as X c , we extract a set of submanifolds: Sc = { Sc, i }isn=c1 ,
(4)
Discriminant Clustering Embedding for Face Recognition with Image Sets
where
645
snc is the number of submanifolds of c and typically n c >> snc .
A projection matrix is defined as
V = {v1 , v2 ,..., vd } ∈ IR D×d ,
(5)
| = 1 , and d is the number of features extracted. Our goal is to learn such a projection matrix V , by which after mapping, the high-dimensional data could
where D >> d , | vi
be properly embedded. The intra-class scatter matrix Sw and inter-class scatter matrix Sb are defined as S w = ∑∑ v T x i − v T x j
2
Wiw, j = 2v T X(D w − W w )X T v
S b = ∑∑ v T x i − v T x j
2
Wib, j = 2v T X( D b − W b )X T v
i
,
(6)
j
i
,
(7)
j
where D iiw = ∑ j≠i Wijw
∀i
D iib = ∑ j≠ i Wijb
∀i ,
,
(8) (9)
and W wand W b ∈ IR N × N are two affinity matrixes (graphs) which denote the neighbor relationships between two points of intra-class and inter-class respectively. How to define them is the critical issue for computing the projection matrix V . The projection matrix V can be redefined as V = argmax V
| V TX( Db − Wb )XTV | , = argmax T Sw V | V X(Dw − W w )XTV | Sb
(10)
which is equal to solve the following generalized eigenvalue problem, X (D b − W b ) X TV = λX (D w − W w ) X TV .
(11)
V = {v1 , v2 ,..., vd } are the generalized eigenvectors associated to the generalized eigenvalues λ1 ≥ λ2 ≥ ... ≥ λd of Eq. (11). Given a probe image set T = { t i }ip= 1 containing p images of an individual whose identity is one of the C subjects in the gallery. The test image set is first mapped onto a low-dimensional space via the projection matrix V , i.e. T′ = V T T . Then each test image t i is classified in the low-dimensional discriminant space by measuring the similarity between t′i = V T t i and each training submanifolds S′c,j = V TSc,j . This process is formulated as c∗ = arg max d(t′i , S′c, j ) , c
(12)
646
Y. Zhao, S. Xu, and Y. Jia
where d( ⋅, ⋅) denotes the distance between a image and a linear subspace [11]. Finally, to determine the class of the test image set, we combine the decisions of all test images by a majority scheme. 3.2 Discriminant Clustering Embedding As discussed in Section 2, images in the sets usually change largely due to many realistic factors, e.g., large pose or illumination. In this case, traditional Fisher Discriminant Analysis (FDA) [9] performs poorly, since it could not capture nonlinear variation in face appearance due to illumination and pose changes. Marginal Fisher Analysis (MFA) [7], which is devised to extract the local discriminative information by utilizing the marginal information, relaxes the limitations of FDA in data distribution assumption and available discriminative features. However, it still tends to compress the points belonging to the same class together even though they are really far away in image space. This makes against uncovering the significant local structure of appearance manifolds and extracting efficient discriminative information. To handle this problem, we propose a novel method, Discriminant Clustering Embedding, which combines the effectiveness of submanifolds charactering the inherent structure of face appearance manifolds and the discriminant property of discriminant embedding. Specifically, our algorithm can be summarized as following four steps: 1. 2.
3.
Extract a set of submanifolds Sc of each subject as Eq. (4) by clustering, e.g. kmeans or hierarchical agglomerative clustering. Measure the neighbor relationships of each submanifold with all other subjects’ submanifolds by some metric, e.g. principal angle, and select the m nearest neighbors of it. Construct two affinity matrixes (graphs) W w and W
b
based on each
submanifold and its m nearest neighbors obtained in Step 2, simply written as 1 Wij w = ⎧⎨ 0 ⎩ ⎧ 1 ⎪ b Wij = ⎨ ⎪ ⎩ 0
if x i , x j ∈ Sc,k , otherwise
if xi∈Sc1,k1, xj∈Sc2,k2, c1≠c2, m
and Sc1,k1~Sc2,k2
(13) ,
(14)
otherwise
where S c1,k1 ~ S c 2 ,k 2 means Sc1,k1 is the m nearest neighbors of S c 2 ,k 2 or Sc 2,k 2 is m
4.
the m nearest neighbors of Sc1,k1 . Learn the low-dimensional embedding VDCEdefined as Eq. (10) by preserving the neighbor information within each submanifold, and separating the neighbor submanifolds belonging to different subjects from each other.
By Ww and Wb defined as Eq. (13) and Eq. (14), Sw and Sb could locally characterize the intra-class compactness and inter-class separability respectively. The different
Discriminant Clustering Embedding for Face Recognition with Image Sets
647
affinity matrixes, which determine the efficiency of extracting the local discriminative information, are the essential difference among FDA, MFA and DCE. The mapping function VDCE obtained by solving Eq.(11), compresses the points in each submanifold together and separates each submanifold from its m nearest neighbors that belong to other individuals at the same time. It is worth noting that VDCEdoes not impose the submanifolds of the same class to be close which are really far apart in original image space. After embedding, DCE could preserve significant inherent local structure of face appearance manifolds and extract the most powerful discriminant features.
4 Experimental Results We use the Honda/UCSD Video Database [5] to evaluate our method. The database contains a set of 52 video sequences of 20 different persons. Each video sequence is recorded indoors at 15 frames per second over duration of at least 20 seconds. Each individual appears in at least two video sequences with the head moving with different combinations of 2-D (in-plane) and 3-D (out-of-plane) rotation, expression changes, and illumination changes (see Fig. 2). A cascaded face detector [10] and a coarse skin color based tracker are used to detect faces automatically. About 90% faces could be detected. The images are resized to a uniform scale of 16×16 pixels. Histogram equalized is the only preprocessing step. For training, we select 20 video sequences, one for each individual, and the rest 32 sequences are for testing. For each individual’s each testing video sequence we randomly select 10 sub-image-sets including 30% faces detected to buildup a set of test image sets. Finally, to determine the class of a test image set, we combine the decisions of all test images in the test image set by a majority scheme. 4.1 Parameters Selection Comparison of Clustering Methods and Number of Clusters. As shown in Fig. 3, random clustering scheme exhibits instability in the relationship between accuracy and the number of clusters, whereas the other two classical clustering schemes both provide near constant accuracies beyond certain points. This is expected because the clusters obtained by random clustering scheme could not characterize the inherent local structure of face appearance manifolds properly. It is also noticeable that the proposed method is less sensitive to the selection between k-means and HAC in the experiment. This may be because both clustering algorithms could find out wellseparated clusters which can properly approximate the distribution of image set. For the two classical clustering procedures, just as expected, the accuracy rises rapidly along with the number of clusters, then increases slowly, and tends to be a constant at last. To characterize the local structure of an appearance manifold properly, it is necessary to obtain a suitable number of clusters. However, with increasing of the number, the accuracy should not drop because more clusters could characterize the local structure more subtly. For convenience, we adopt the HAC procedure and fix the number of clusters at 40 to evaluate our algorithm in the following experiments.
648
Y. Zhao, S. Xu, and Y. Jia
Fig. 2. Examples of image sets used in the experiments
(a)
Fig. 3. Comparison of different clustering methods and the effect of the number of clusters on accuracy
(b)
Fig. 4. Effect of different parameters on accuracy: (a) the number of selected features and (b) the number of neighborhoods of each submanifold
Number of Features and Number of Neighbors. Fig. 4a shows the accuracy of DCE with respect to the number of features. When the number of features reaches 20, there is a rapid rise and then the recognition rate increases slowly and tends to be a constant. The top accuracy is not achieved at 19 as traditional FDA [9]. In the following comparison experiments, we set the number of features to 50. In Fig. 4b, the recognition rate of DCE with respect to the number of neighbors for each submanifold is shown. Note that the accuracy of DCE is not increasing but asymptotically converging along with the number of neighbors. This is an interesting finding that the larger number of neighbors does not mean the higher accuracy. In extreme case, each submanifold is neighboring to all others that belong to other subjects, which seems to take advantage of all the inter-class separability but not. This shows the effectiveness of neighbor submanifolds characterizing the inter-class separability. The best number of neighbors for DCE is found to be around 5 and fixed at 6 for the following comparison experiments.
Discriminant Clustering Embedding for Face Recognition with Image Sets
649
4.2 Comparison with Previous Methods We first compare our proposed method with traditional methods, i.e. nearest neighbor (NN) [13], Fisher discriminant analysis (FDA) and marginal Fisher analysis (MFA). All experiments for comparison methods are performed in the original image space. And the final decisions are obtained by majority voting scheme. As shown in Table 1, the local embedding methods (MFA and DCE) achieve better performance than the other two methods. It is obvious that the methods based on local embedding could reveal the significant local structure more efficiently and could extract more powerful discriminative information in the nonlinear face manifolds. Note that FDA achieves the lowest accuracy. This is because the features obtained by FDA could not characterize the various variations of human faces properly. As expected, our method outperforms MFA. The reason is that our DCE compresses only the points in each submanifold that are really neighboring to each other together. However, MFA may impose the faraway points of the same class to be close. This is negative for discovering the proper local structure of face appearance manifolds efficiently and utilizing them to characterize inter-class separability. Table 1. Average accuracy of our DCE and previous methods
Methods Recognition rate (%)
NN
FDA
MFA
DCE
Dual-space
DCC
95.26
94.37
98.1
99.97
96.34
98.93
We also compare our method with two latest methods, dual-subspace method [3] and discriminant analysis of canonical correlations (DCC) [4], which are devised for recognition over image sets. As shown in Table 1, our discriminant embedding method outperforms the other two methods. While the three methods all take advantage of the submanifolds, the ways of applying them for classification task are completely different. By discriminant embedding, DCE makes full use of the local information embedded in the submanifolds and extracts the most powerful discriminant features for recognition over sets.
5 Conclusions and Future Work We have presented a novel discriminant embedding framework for face recognition with image sets based on the submanifolds extracted via clustering. The proposed method has been evaluated on a complex face image sets obtained from UCSD/Honda video database. Extensive experiments show that our method could both uncover local structure of appearance manifolds sufficiently and extract local discriminative information by discriminant embedding efficiently. It is noticeable that our method is significantly improved by the clustering procedure but insensitive to the selection of different clustering methods. The experiments also demonstrate that it is not the larger number of neighbors of each submanifold the higher accuracy. Experimental results show that DCE outperforms the state-of-the-art methods in terms of accuracy.
650
Y. Zhao, S. Xu, and Y. Jia
In the future work, we will further compare our method with the previous relevant methods e.g., dual-subspace method [3] and discriminant analysis of canonical correlations (DCC) [4]. The application of DCE to object recognition is also our interest. In addition, a nonlinear extension of DCE by so-called kernel trick is direct. Acknowledgments. This work was supported by the 973 Program of China (No. 2006CB303105). The authors thank the Honda Research Institute for providing us with Honda/UCSD Video Database.
References 1. Frey, B.J., Huang, T.S.: Mixtures of Local Linear Subspace for Face Recognition. In: IEEE Conf. on Computer Vision and Pattern Recognition, pp. 32–37. IEEE Computer Society Press, Los Alamitos (1998) 2. Arandjelovic, O., Shakhnarovich, G., Fisher, J., Cipolla, R., Darrell, T.: Face recognition with image sets using manifold density divergence. In: IEEE Conf. on Computer Vision and Pattern Recognition, vol. 1, pp. 581–588 (2005) 3. Fan, W., Yeung, D.Y.: Locally Linear Models on Face Appearance Manifolds with Application to Dual-Subspace Based Classification. In: IEEE Conf. on Computer Vision and Pattern Recognition, vol. 2, pp. 1384–1390 (2006) 4. Kim, T.K., Kittler, J., Cipolla, R.: Discriminative Learning and Recognition of Image Set Classes Using Canonical Correlations. IEEE Trans. PAMI 29(6), 1005–1018 (2007) 5. Lee, K., Ho, J., Yang, M., Kriegman, D.: Visual tracting and recognition using probabilistic appearance manifolds. CVIU 99, 303–331 (2005) 6. Hadid, A., Pietikainen, M.: From still image to videobased face recognition: an experimental analysis. In: 6th IEEE Conf. on Automatic Face and Gesture Recognition, pp. 17–19. IEEE Computer Society Press, Los Alamitos (2004) 7. Yan, S.C., Xu, D., Zhang, B., Zhang, H.J., Yang, Q., Lin, S.: Graph Embedding and Extensions: A General Framework for Dimensionality Reduction. IEEE Trans. PAMI 29(1), 40–51 (2007) 8. Chen, H.T., Chang, H.W., Liu, T.L.: Local Discriminant Embedding and Its Variants. In: IEEE Conf. on Computer Vision and Pattern Recognition, vol. 2, pp. 846–853 (2005) 9. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection. IEEE Trans. PAMI 19(7), 711–720 (1997) 10. Viola, P., Jones, M.: Robust real-time face detection. IJCV 57(2), 137–154 (2004) 11. Moghaddam, B., Pentland, A.: Probabilistic Visual Learning for Object Representation. IEEE Trans. PAMI 19(7), 696–710 (1997) 12. Bjorck, A., Golub, G.H.: Numerical methods for computing angles between linear subspaces. Mathematics of Computation 27(123), 579–594 (1973) 13. Duda, R., Hart, P., Stork, D.: Pattern Classification. Wiley, Chichester (2000) 14. Tenenbaum, J.B., Silva, V., Langford, J.C.: A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 290(22), 2319–2323 (2000)
Privacy Preserving: Hiding a Face in a Face∗ Xiaoyi Yu and Noboru Babaguchi Graduate School of Engineering, Osaka University, Japan
Abstract. This paper proposes a detailed framework of privacy preserving techniques in real-time video surveillance systems. In the proposed system, the protected video data can be released in such a way that the identity of any individual contained in video cannot be recognized while the surveillance data remains practically useful, and if the original privacy information is demanded, it can be recoverable with a secrete key. The proposed system attempts to hide a face (real face, privacy information) in a face (new generated face for anonymity). To deal with the huge payload problem of privacy information hiding, an Active Appearance Model (AAM) based privacy information extraction and recovering is proposed in our system. A quantized index modulation based data hiding scheme is used to hide the privacy information. Experimental results have shown that the proposed system can embed the privacy information into video without affecting its visual quality and keep its practical usefulness, at the same time, allows the privacy information to be revealed in a secure and reliable way. Keywords: Privacy Preserving, Data Hiding, Active Appearance Model.
1 Introduction Identity privacy is one of the most important civil rights. It is defined as the ability to prevent other parties from learning one’s current identity by recognizing his/her personal characteristics. In recent years, advanced digital camera and computer vision technology, for example, personal digital camera for photography, digital recording of a surgical operation or medical image recording for scientific research, video surveillance etc., have been widely deployed in many circumstances. While these technologies provide many conveniences, they also expose people’s privacy. Although such a concern may not be significant in public spaces such as surveillance systems in metro stations, airports or supermarket, patients who have medical image photographing in hospitals may feel their privacy being violated if their face or other personal information is exposed to the public. In such situations, it is desirable to have a system to balance the disclosure of image/video and privacy such that inferences about identities of people and about privacy information contained in the released image/video cannot reliably be made, while the released data is usable. In the ∗
This work was supported in part by SCOPE from Ministry of Internal Affairs and Communications, Japan and by a Grant-in-Aid for scientific research from the Japan Society for the Promotion of Science.
Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 651–661, 2007. © Springer-Verlag Berlin Heidelberg 2007
652
X. Yu and N. Babaguchi
literature, many techniques on privacy preserving in image/video system have been proposed [1-9]. These techniques can be divided into several classes as follows: 1) Pixel operations based methods. The operations such as black out, pixelize or blur et al have been used to fade out the sensitive area. In [7], Kitahara et al. proposed an anonymous video capturing system, called “Stealth Vision”, which protects the privacy of objects by fading out their appearance. In [9], Wickramasuriya et al. proposed a similar privacy protecting video surveillance system. Although these systems in some degree fulfil the privacy-protecting goal, it has a potential security flaw because they can not keep a record of the privacy information. In the case of video surveillance when authorized personnel were also involved in some maleficent or even criminal behaviours, the surveillance system should have the ability to provide the original surveillance footage when necessary. The second flaw is that although these systems can keep identity anonymous, the sensitive area is distorted. 2) Cryptography based methods. Dufaux et al. [1, 2] proposed a solution based on transform domain scrambling of regions of interest in a video sequence. In [3], the authors present a cryptographically invertible obscuration method for privacy preserving. Martínez-Ponte et al. [4] propose a method using Motion JPEG 2000 encoding module to encode sensitive data. These methods have the drawbacks that the sensitive area is distorted. 3) Data hiding based methods. Zhang et al. [5] proposed a method of storing privacy information in video using data hiding techniques. Privacy information is not only removed from the surveillance video but also embedded into the video itself. This method solved the problem that the privacy information can not recoverable. The method also has the drawback that the sensitive area is distorted (disappear), and another problem is that it is hard to deal with large privacy information data size for data hiding. 4) Others. Newton et al. [6, 8] addressed the threat associated with face-recognition techniques by ‘de-identifying’ faces in a way that preserves many facial characteristics, however the original privacy information is lost with this method. A desirable privacy protecting system should meet the following requirements: 1. The original privacy information can be recoverable. 2. The privacy preserving image remains practically usable, e.g. one still can see the emotion of processed face images in our proposed system. 3. The identity is anonymous. Almost none of the methods in the literature can fulfill all the 3 requirements. The motivation of our research is to propose a solution to fulfill all of 3 requirements for privacy protection issue. The main contributions of this paper are as follows. z We propose a framework of privacy preserving techniques in real-time video surveillance systems. In our proposed system, the privacy information is not only protected but also embedded into the video itself, which can only be retrieved with a secrete key. z In our system, the identity of any individual contained in video cannot be recognized while the surveillance data remains practically useful. The facial area is distorted in systems such as [1, 2, 5, 7, 9], while it can be seen in our system. z Our proposed method can efficiently solve the huge payload problem of privacy information hiding.
Privacy Preserving: Hiding a Face in a Face
653
In the next section, we discuss the architecture of a privacy-protecting system. In Section 3, the statistical Active Appearance model (AAM) is introduced and AAM based synthesis for anonymity is proposed. In Section 4, privacy information extraction and hiding is proposed. Experimental results on the proposed method are presented in Section 5. Conclusions are made in Section 6.
2 System Description The solution we present is based on face masking and hiding. The architecture is illustrated in Fig. 1 and consists of 3 modules: Training module (shown in the dashed circle in Fig. 1), Encoding module (top in Fig. 1) and Decoding module (bottom of Fig. 1).
Fig. 1. Schematic Diagram of Privacy Preserving System
Before privacy information processing, we need to train a statistical AAM model. With a set of face images, the model is built using the training module. In the encoding procedure, AAM model parameters (privacy information) are obtained by analyzing an unseen input frame with face image using the AAM model built at first. Then, keep a copy of model parameters for the later procedure of hiding. Based on the estimated AAM model parameters, a mask face can be generated, which is different from the original face. An anonymous frame is obtained by the imposing the mask face on the original image. Last, with a secret key, the privacy information is embedded into the anonymous frame using QIM embedding method. The resulting privacy preserving video is then obtained for practically use. For the decoding procedure, the AAM parameters are first extracted using the extraction procedure of QIM data hiding method. With the extracted parameters and AAM model, the original face can be synthesized, and then impose the synthesized face on the privacy preserving frame to get the recovered frame.
654
X. Yu and N. Babaguchi
3 Real-Time Face Masking for Anonymity 3.1 Active Appearance Model and Real-Time Implementation The Active Appearance Model [10, 13] is a successful statistical method for matching a combined model of shape and texture to new unseen faces. First, shapes are acquired through hand placement of fiducial points on a set of training faces; then textures are acquired through piecewise affine image warping to the reference shape and grey level sampling. Both shape and texture are modeled using Principal Component Analysis (PCA). The statistical shape model is given by: (1) s = s + Φ s bs
s is the synthesized shape, Φ s is a truncated matrix and bs is a vector that controls the synthesized shape. Similarly, after computing the mean shape-free texture g and normalizing all textures from the training set relatively to g by scaling and offset of the luminance values, the statistical texture model is given by (2) g = g + Φ t bt
where
where
g i is the synthesized shape-free texture, Φ t is a truncated matrix and bt is a
vector controlling the synthesized shape-free texture. Then combined shape model with texture model by further using PCA. s = s + Qs c g = g + Qt c
(3)
where Qs and Qt are truncated matrices describing the principal modes of combined appearance variations in the training set, and c is a vector of appearance parameters simultaneously controlling the shape and texture. Given a suitably annotated set of example face images, we can construct statistical models ( s , g , Qs and Qt ) of their shape and their patterns of intensity (texture) [10]. Such models can reconstruct synthetic face images using small numbers of parameters c . We implement a real-time AAM fitting algorithm with OpenCV library. Fig. 2 shows the matching procedure on an AAM with only 11 parameters, Fig.2 shows (a) the original face, (b) Initialization of AAM fitting, (c) 2nd iteration (d) the convergence result only after 8 iterations.
(a)
(b)
(c)
(d)
Fig. 2. (a) Original (b) Initialized (c) 2nd iteration (d) Synthesized Faces
Privacy Preserving: Hiding a Face in a Face
655
3.2 AAM Based Face Mask As mentioned in Section 3.1, for a given unseen image X, a shape in the image frame, can be generated by applying a suitable global transformation (such as a similarity transformation) to the points in the model frame. The texture in the image frame is generated by applying a scaling and offset to the intensities generated in the model frame. A full reconstruction is given by generating the texture in a mean shaped patch, then warping it so that the model points lie on the image points. The parameters c in Equation (3) are parameters controlling the appearance of the reconstructed face. If c is obtained, the face features such as eye, nose, mouth can be very easily located. The textile of face patches also can be obtained via Eq. (3). This leads to a method of face masking for identity anonymity based on these AAM parameters: parameter c , shape parameter s and texture parameter g . We have several ways to generate a new face mask for anonymity: 1. Perturbation of AAM parameters. Let ci be elements of c , we apply perturbation to ci as follows:
c~i = ci (1 + vi )
(4)
vi is a manually selected variable. To constrain c~i to plausible values we ~ to: apply limits to each elements c i
where
c~i ≤ 3 λi
where
(5)
λi is the corresponding eigenvalue to c~i .
s and g~ using the following equations: Then reconstruct ~ ~ s = s + Qs c~ g~ = g + Qt c~
(6)
~ is the perturbed vector where s , g , Qs and Qt are the same matrices as in Eq. (3), c ~ .Similar operation can be performed on shape parameter s and with elements of c i
texture parameter g . With ~ s and g~ , we can generate a new face with is quite different from the original one. 2. Replacement of AAM parameters. First the AAM model is applied by analyzing a face to obtain parameter s and g . Then in the reconstruction process, the texture parameter g is directly replaced by other texture parameters of different faces or just by s . 3. Face features based mask face generation. Once parameters c is determined using AAM matching, the shape parameter s , which is just the matched points in the face surface, is determined. Usually these points locate in face features such as eye, nose or mouth corner. Since face features can be accurately located, we can modify, or displace these places with other patches or adding virtual objects on the face. We can use all kinds of masked face for identity anonymity, such as virtual object adding, wearing a mask, adding a Beijing Opera actor's face-painting, and even adding another person’s face. Of course we can also
656
X. Yu and N. Babaguchi
using those distorted methods for identity anonymity such as making a contorted face, blur, mosaic, or dark.
4 Privacy Extraction and Hiding 4.1 Privacy Information Extraction The privacy information, which we try to protect and hide in our proposed system, is the detected facial area. It is natural to think of cutting the facial area and using data hiding techniques to hide in the image. This is the very method proposed in [5]. However, this method is impracticable in scenario of large facial area. For example, in our experiment, the facial area size is about 140x140, whe frame size is 320x240. The privacy information in the system is over 50000 bits even after compression, which is much larger than the privacy information size (3000 bits) in [5]. Most current image data hiding techniques can not embed such large amount of information in image. How to imperceptibly hide the large amount of privacy information in the image is a great challenge. In Section 3, we mentioned that the parameters c in Equation (3) are parameters controlling the appearance of reconstructed image. In the scenario of the two sides of communications share the same AAM model, the parameters c can be seen as privacy information. With parameters c , we can only generate a shape free face. To recover the original frame, we should have face pose parameters. These parameters include position translation, rotation and scaling. So parameters c and pose parameters are privacy information in our proposed system. We hide them for privacy protection. The privacy information size would largely decrease compared with the method in [5]. 4.2 Privacy Information Hiding and Recovering Data hiding [11] has been widely used in copyright protection, authentication or secure communication applications, in which the data (for example, watermark or secret information) are embedded in a host image, and later, the embedded data can be retrieved for ownership protection, authentication or secure communication. Most data hiding methods in literature are based on the human visual system (HVS) or perceptual model to guarantee a minimal perceptual distortion. Due to the popularity of Discrete Cosine Transform (DCT) based perceptual model, we adopt the DCT perceptual model described in [11]. The same model is used in [5]. With this perceptual model, we can compute a perceptual mask value for each DCT coefficient in the image. We sort the order of these values. Then with a secret key to determine the embedding location and combined with these perceptual mask values, we use a special case of Quantization index modulation (QIM), the odd-even method [11] for data hiding in our implementation. The modified DCT coefficients are constructed to a privacy preserving image at last. For ordinary users, only this image can be viewed. For an authorized user, who can see the privacy information, a decoding procedure is necessary to view the privacy recovered video. The decoding procedure is shown in Fig. 1. With the secret key, the decoder can determine where the privacy information has been embedded. After extracting the privacy information (AAM
Privacy Preserving: Hiding a Face in a Face
657
parameters), together with the AAM mode, the original frame can be recovered with privacy information.
5 Experimental Results To evaluate the proposed system, three experiments were performed here. The first experiment was carried on to test the performance of selective perturbing AAM parameters for identity anonymity. The second experiment is to test the performance of face features based mask face generation for anonymity. The last one is carried on to evaluate the proposed system: real-time privacy preserving video surveillance system: hiding a face in a face. For selective perturbing AAM parameters for identity anonymity, we use public available data set [12] in our test. First, an appearance model is trained on a set of 35 labeled faces. This set [12] contains 40 people. We leave 5 images for the test. Each image was hand annotated with 58 landmark points on the key features. From this data a combined appearance model (Model-I) is built. Fig. 3. Original Face and Anonymous Face Vector c (24 parameters) in Equation (3) is obtained by analyzing a target input face (shown in top left of Fig.4) using Model-I. Vector c is perturbed ~ . Combined c~ and Model-I, we generate an using Equation (4) to obtain the vector c anonymous new face. Fig.3 shows the experimental results of two faces. The top left and bottom left show the original faces and top right and bottom right are the generated anonymous faces. In our experiment, we set vi = −2 in Equation (5). Readers can set other values to generate an anonymous face by “adjusting and looking” method. For the second experiments, we used a new generated face set for model training. This set contains 45 face images of a single person with different pose, illumination, and express. Each image was manually annotated with 54 landmark points on the key features. From this data set, a combined appearance model (Model-II) is built. We use Model-II to evaluate face features based mask method. We generate mask faces using method 3 described Section 3.2. Fig. 4 shows the original face and experimental results of all kinds of masked face for identity anonymity, (a) the original face (b) virtual beard addition, (c) virtual beard, (d) a contorted face, (e) blur, (f) mosaic, (g) adding another person’s face, (h)adding a Beijing Opera actor's face-painting , (i)dark. Fig. 4(b) and (c) shows that a deformable beard is added into the video sequences. From the results, we can observe that the beard can be deformed along with the expression changes. The added virtual objects are tightly overlaid on the subject. The masked Beijing Opera actor's face-painting or adding other people’s faces are also deformable.
658
X. Yu and N. Babaguchi
(a)
(d)
(b)
(e)
(g)
(c)
(f)
(h)
(i)
Fig. 4. Original Face and Anonymous Faces
From the above two experiments, we can see that the proposed method can generate different kinds of anonymous faces, which fulfill the privacy protecting requirement of releasing useful information in such a way that the identity of any individual or entity contained in image/video cannot be recognized while the image/video remains practically useful.
Fig. 5. Experimental Results of Real-time Privacy Preserving Systme
Privacy Preserving: Hiding a Face in a Face
659
The third experiment was carried on to evaluate the proposed real-time privacy preserving video surveillance system. We train a model (Model-III) on a set of 25*6 labeled faces. This set contains 30 people. Each person has 6 face images with
Fig. 6. Replaced face
Fig. 7. PSNR vs. Frame No
different poses and expressions. We leave 10*6 images for the test. Each image was manually annotated with 50 landmark points. For each input video frame of video surveillance system, vector c (30 parameters) in Equation (3) is obtained by ~ , s , g and analyzing the input frame (shown in) using Model-III. Combined c Model- III, we generate mask faces using method 2 described in Section 3.2. Fig.5 shows the experimental result by replacing face texture of a person by texture of anther person. The top row of Fig.5 show the input frames of a person and the second row show the generated results by replacing the texture using Model-III and shape parameters. The front face of the replaced person is shown in Fig. 6. For the limited of space, other results such as replacement of parameter by s are not shown here. Then we come to embedding procedure. Since the most widespread video format MPEG2, MPEG2 is used to compress all the video sequence in our experiments. The privacy information is embedded into generated I frame of MPEG2 using the method described in Section 4.2, and then the privacy preserving I frame is obtained. For ordinary user, the system can provide this frame. For authorized user, the original face should be recovered. We use data extraction procedure discussed in Section 4.2 to recover the original faces (shown in the last row of Fig.5). Fig. 7 plots the Peak Signal-to-noise (PSNR) of recovered frame compared with the original frame. Large error would decrease if a small convergence threshold were set, however small convergence threshold would affect the convergence speed. There is a tradeoff between the accuracy and efficiency. From the above experiment, we show the effectivity, efficiency and reliability of our proposed system. The main advantage of our system is that our proposed system has a record of the privacy information and versatile disposal of sensitive area. Now we compare our system with systems in the literature in Table 1. In Table 1, the column “Anonymity” shows whether the compared systems can protect identity’s privacy, the column “Distortion” shows whether the privacy area is distorted, and the column “Recoverability” shows whether the original privacy
660
X. Yu and N. Babaguchi Table 1. Comparison between our proposed system and methods in the literature
Pixel operations based methods[7,9] Coding based methods [1, 2, 3,4] Zhang et al’ method [5] Newton et al.’s method [6, 8] Our proposed system
Anonymity Yes Yes Yes Yes Yes
Distortion Yes Yes Yes No No
Recoverability No No Yes No Yes
information is recoverable. From the table, we can see our proposed method outperform all the method listed. Among all methods in the literature, only Zhang et al.’s method can recover the original privacy information; however the method has the drawback that the encoded image/video is probably not practically useful. Another drawback of [5] is that the data capacity for hiding is large. For example, in our experiments, the privacy information size is about [30 ( ci length) + 3(pose parameter length)*1 (bytes per parameter)*8(bits per byte) = 720bits, and can be further compressed for a small size. It takes 50000 bits for privacy information hiding if we use method of [5]. The privacy information size of method [5] is as 70 times large as the size of our method.
6 Conclusions In this paper, we have presented a privacy preserving system, hiding a face in a face, where the identity of release data is anonymous, the released data is practical useful, and the privacy information can be hidden in the surveillance data and be retrieved later. Effective identity anonymity, privacy information extraction and hiding method have been proposed to hide all the privacy information into the host image with minimal perceptual distortion. The proposed approach does not take advantage of 3D information for AAM matching. In the future, AAM convergence accuracy will be improved by training the 3D AAM with the aligned 3D shapes or 2D shapes. Another lackenss of current system is the generated facial expression can not vary as identity expression changing. Future research will be focused on identity anonymity with facial expression. Our proposed solution is focusing on video surveillance system for privacy protecting, further, the system can be used in many other applications such as privacy protection of news images or newsreels in journalism etc. Although the proposed method is demonstrated in facial privacy protection, it is a framework which can also be applied to virtually any type of privacy protection such as human body as privacy information.
References 1. Dufaux, F., Ouaret, M., Abdeljaoued, Y., Navarro, A., Vergnenegre, F., Ebrahimi, T.: Privacy Enabling Technology for Video Surveillance. In: Proc. SPIE, vol. 6250 (2006) 2. Dufaux, F., Ebrahimi, T.: Scrambling for Video Surveillance with Privacy. In: Proc. IEEE Workshop on Privacy Research In Vision, IEEE Computer Society Press, Los Alamitos (2006)
Privacy Preserving: Hiding a Face in a Face
661
3. Boult, T.E.: PICO: Privacy through Invertible Cryptographic Obscuration. In: IEEE/NFS Workshop on Computer Vision for Interactive and Intelligent Environments (2005) 4. Martinez-Ponte, X., Desurmont, J., Meessen, J.-F.: Delaigle: Robust Human Face Hiding Ensuring Privacy. In: Proc. Int’l. Workshop on Image Analysis for Multimedia Interactive Services (2005) 5. Zhang, W., Cheung, S.S., Chen, M.: Hiding Privacy Information In Video Surveillance System. In: Proceedings of ICIP 2005, Genova, Italy (September 11-14, 2005) 6. Newton, E., Sweeney, L., Malin, B.: Preserving Privacy by De-identifying Facial Images. IEEE Transactions on Knowledge and Data Engineering 17(2), 232–243 (2005) 7. Kitahara, I., Kogure, K., Hagita, N.: Stealth Vision for Protecting Privacy. In: Proc. of 17th International Conference on Pattern Recognition, vol. 4, pp. 404–407 (2004) 8. Newton, E., Sweeney, L., Malin, B.: Preserving Privacy by De-identifying Facial Images, Technical Report CMU-CS-03-119 (2003) 9. Wickramasuriya, J., Datt, M., Mehrotra, S., Venkatasubramanian, N.: Privacy Protecting Data Collection in Media Spaces. In: ACM International Conference on Multimedia, New York (2004) 10. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(6), 681–685 (2001) 11. Cox, I.J., Miller, M.L., Bloom, J.A.: Digital watermarking. Morgan Kaufmann Publishers, San Francisco (2002) 12. http://www.imm.dtu.dk/ aam/datasets/face_data.zip 13. Stegmann, M.B., Ersboll, B.K., Larsen, R.: FAME - A Flexible Appearance Modelling Environment. IEEE Transactions on Medical Imaging, Institute of Electrical and Electronics Engineers (IEEE) 22(10), 1319–1331 (2003)
Face Mosaicing for Pose Robust Video-Based Recognition Xiaoming Liu1 and Tsuhan Chen2 Visualization and Computer Vision Lab, General Electric Global Research, Schenectady, NY, 12309 2 Advanced Multimedia Processing Lab, Carnegie Mellon University, Pittsburgh, PA, 15213 1
Abstract. This paper proposes a novel face mosaicing approach to modeling human facial appearance and geometry in a unified framework. The human head geometry is approximated with a 3D ellipsoid model. Multi-view face images are back projected onto the surface of the ellipsoid, and the surface texture map is decomposed into an array of local patches, which are allowed to move locally in order to achieve better correspondences among multiple views. Finally the corresponding patches are trained to model facial appearance. And a deviation model obtained from patch movements is used to model the face geometry. Our approach is applied to pose robust face recognition. Using the CMU PIE database, we show experimentally that the proposed algorithm provides better performance than the baseline algorithms. We also extend our approach to video-based face recognition and test it on the Face In Action database.
1
Introduction
Face recognition is an active topic in the vision community. Although many approaches have been proposed for face recognition [1], it is still considered as a hard and unsolved research problem. The key of a face recognition system is to handle all kinds of variations through modeling. There are different kinds of variations, such as pose, illumination, expression, among which, pose variation is the hardest, and contributes more recognition errors than others [2]. In the past decade, researchers mainly model each variation separately. For example, by assuming constant illumination and the frontal pose, expression invariant face recognition approaches are proposed [1]. However, although most of these approaches perform well for specific variation, the performance degrades quickly when multiple variations present, which is the case in real-world applications [3]. Thus, a good recognition approach should be able to model different kinds of variations in an efficient way. For human faces, most prior modeling work target at facial appearance using various pattern recognition tools, such as Principal Component Analysis (PCA) [4], Linear Discriminate Analysis [5], Support
The work presented in this paper is performed in Advanced Multimedia Processing Lab, Carnegie Mellon University.
Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 662–671, 2007. c Springer-Verlag Berlin Heidelberg 2007
Face Mosaicing for Pose Robust Video-Based Recognition
663
Y
ry
u
v
α
x
rx rz
β
O X
β
α Z
Fig. 1. Geometric mapping
Fig. 2. Up to 25 labeled facial features
Vector Machine [5]. On the other hand, except for the 3D face recognition, the human face geometry/shape is mostly overlooked in face recognition. We believe that, similar to the facial appearance, the face geometry is also a unique characteristic of human being. Face recognition can benefit if we can properly model the face geometry, especially when pose variation is presented. This paper proposes a face mosaicing approach to modeling both the facial appearance and geometry, and applies it to face recognition. This paper extends the idea introduced in [6,7] by approximating the human head with a 3D ellipsoid. As shown in Fig. 1, an arbitrary view face image can be back projected onto the surface of the 3D ellipsoid, and results in a texture map. In multi-view facial images based modeling, combining multiple texture maps is conducted, where the same facial feature, such as the mouth’s corner, from multiple maps might not correspond to the same coordinate on the texture map. Hence the blurring effect, which is normally not a good property for modeling, is observed. To reduce such blurring, the texture map is decomposed into a set of local patches. Patches from multi-view images are allowed to move locally for achieving better correspondences. Since the amount of movement indicates how much the actual head geometry deviates from the ellipsoid, a deviation model trained from patch movements models the face geometry. Also the corresponding patches are trained to model facial appearance. Our mosaic model is composed of both models together with a probabilistic model Pd that learns the statistical distribution of the distance measure between the test patch and the patch model [8]. Our face mosaicing approach makes a number of contributions. First, as the hardest variation, pose variation is handled naturally by mapping images from different view-angles to form the mosaic model, whose mean image can be treated as a compact representation of faces under various view-angles. Second, all other variations that could not be modeled by the mean image, for example, illumination and expression, are taken care of by a number of eigenvectors. Therefore, instead of modeling only one type of variation, as done in conventional methods, our method models all possible appearance variations under one framework. Third, a simple geometrical assumption has the problem since the head geometry is not truly an ellipsoid. This is taken care of by training a geometric deviation model, which results in better correspondences across multiple views. There are many prior work on face modeling [9,10]. Among them, Blanz and Vetter’s approach [9] is one of the most sophisticated that applied to face recognition as well, where two subspace models are trained for facial texture and
664
X. Liu and T. Chen
shape respectively. Given a test image, they fit the new image with two models by tuning the models’ coefficients, which are eventually used for recognition. Intuitively better modeling leads to better recognition performance. However, a more sophisticated modeling also makes model fitting to be too difficult. For example, both training and test images are manually labeled with 6 to 8 feature points [9]. On the other hand, we believe that, unlike the rendering applications in computer graphics, we might not need a very sophisticated geometric model for recognition applications. The benefit with a simpler face model is that model fitting tends to be easier and automatic, which is the goal of our approach.
2
Modeling the Geometric Deviation
To reduce the blurring issue in combining multiple texture maps, we obtain a better facial feature alignment by relying on the landmark points. For the model training, it is reasonable to manually label such landmark points. Given K multi-view training facial images, {fk }, firstly we label the position of facial feature points. As shown in Fig. 2, 25 facial feature points are labeled. For each training image, only a subset of the 25 points is labeled according to their visibility. We call these points as key points. Second, we generate the texture map sk from each training image, and compute key points’ corresponding coordinates bik (1 ≤ i ≤ 25) in the texture map sk , as shown in Fig. 3. Furthermore, we would like to find the coordinate on the mosaic model where all corresponding key points deviate to. Ideally if the human head is a perfect 3D ellipsoid, the same key point bik (1 ≤ k ≤ K) from multiple training texture maps should exactly correspond to the same coordinate. However, due to the fact that the human head is not a perfect ellipsoid, these key points deviate from each other. The amount of deviation is an indication of the geometrical difference between the actual head geometry and the ellipsoid. Third, we compute the averaged positions bik (1 ≤ k ≤ K) of all visible key points bi that correspond to the same facial feature. We treat this averaging, shown in the 3rd row of Fig. 3, as the target position in the final mosaic model where all corresponding key points should move toward. Since our resulting mosaic model is composed of an array of local patches, each one of the 25 averaged key points falls into one particular patch, namely key patch. Fourth, for each texture map, we take the difference between the positions of key point bik and that of the averaged key point bi as the key patch’s deviation flow (DF) that describes which patch from each texture map should move toward that key patch in the mosaic model. However, there are also non-key patches in the mosaic model. As shown in Fig. 4, we represent the mosaic model as a set of triangles, whose vertexes are the key patches. Since each non-key patch falls into at least one triangle, its DF is interpolated by the key patch’s DF. For each training texture map, its geometric deviation is a 2D vector map vk , whose dimension equals to the number of patches in vertical and horizontal directions, and each element is one patch’s DF. Note that for any training texture map, some elements in vk are considered missing. Finally the deviation model
Face Mosaicing for Pose Robust Video-Based Recognition
665
-
Fig. 3. Averaging key points: the position of key points in the training texture maps (2nd row), which correspond to the same facial feature are averaged and result in the position in the final model (3rd row)
Fig. 4. Computation of patch’s DF: each non-key patch falls into at least one triangle; the deviation of a non-key patch is interpolated by the key patch deviation of one triangle
θ = {g, u} is learned from the geometric deviation {vk } of all training texture maps using the robust PCA [11], where g and u are the mean and eigenvectors respectively. Essentially this linear model describes all possible geometric deviation of any view angle for this particular subject’s face.
3
Modeling the Appearance
After modeling the geometric deviation, we need to build an appearance model, which describes the facial appearance for all poses. On the left hand side of Fig. 5, there are two pairs of training texture maps sk and their corresponding geometric deviation vk . The resulting appearance model Π = {m, V} with one mean and two eigenvectors are shown on the right hand side. This appearance model is composed of an array of eigenspaces, where each is devoted to modeling the appearance of the local patch indexed by (i, j). In order to train one eigenspace for one particular patch, the key issue is to collect one corresponding patch from each training texture map sk , where the correspondence is specified by k 1 . For example, the summation of vi,j and (40,83) the geometric deviation vi,j 1 determines the center of corresponding patch, vi,j , in the texture map s1 . Using the same procedure, we find the corresponding patches ski,j (2 ≤ k ≤ K) from all other texture maps. Note some of ski,j might be considered as missing patches. Finally the set of corresponding patches are used to train a statistical model Πi,j via PCA. We call the array of PCA models as the patch-PCA mosaic. Modeling via PCA is popular when the number of training samples is large. However, when the number of training samples is small, such as the training of an individual mosaic model with only a few samples, it might not be suitable to train one PCA model for each patch. Instead we would rather train a universal PCA model based on all corresponding patches of all training texture maps, and keep the coefficient of these patches in the universal PCA model as well. This is
666
X. Liu and T. Chen
83 (40,83)
40
(40,83)
Fig. 5. Appearance modeling: the deviation indicates the corresponding patch for each of training texture maps; all corresponding patches are treated as samples for PCA
Fig. 6. The mean images of two mosaic models without geometric deviation (top) and with geometric deviation (bottom)
called the global-PCA mosaic. Note that the patch-PCA mosaic and the globalPCA mosaic only differ in how the corresponding patches across training texture maps are utilized to form a model, depending on the availability of training data in different application scenarios. Eventually the statistical mosaic model includes the appearance model Π, the geometric deviation model θ and the probabilistic model Pd . We consider that the geometric deviation model plays a key role in training the mosaic model. For example, Fig. 6 shows the mean images of two mosaic models trained with the same set of images from 10 subjects. It is obvious that the mean image on the bottom is much less blurring and captures more useful information about facial appearance. Note that this mean image covers much larger facial area comparing to the up-right illustration of Fig. 5 since extrapolation is performed while computing the geometric deviations of non-key patches.
4
Face Recognition Using the Statistical Mosaic Model
Given L subjects with K training images per subject, an individual statistical mosaic model is trained for each subject. For simplicity, let us assume we have enough training samples and obtain the patch-PCA mosaic for each subject. We will discuss the case of the global-PCA mosaic in the end of this section. We now introduce how to utilize this model for pose robust face recognition. As shown in Fig. 7, given one test image, we generate its texture map by using the universal mosaic model, which is trained from multi-view images of many subjects. Then we measure the distance between the test texture map and each of the trained individual mosaic model, namely the map-to-model distance. Note that the appearance model is composed of an array of patch models, which is called the reference patch. Hence, the map-to-model distance equals to the
Face Mosaicing for Pose Robust Video-Based Recognition
667
summation of the map-to-patch distances. That is, for each reference patch, we find its corresponding patch from the test texture map, and compute its distance to the reference patch. Since we have been deviating corresponding patches during the training stage, we should do the same while looking for the corresponding patch in the test stage. One simple approach is to search for the best corresponding patch for the reference patch within a search window. However, this does not impose any constraint on the deviation of neighboring reference patches. To solve this issue, we make use of the deviation model that was trained before. As shown in Fig. 7, if we randomly sample one coefficient in the deviation model, the linear combination of this coefficient describes the geometric deviation for all reference patches. Hence, the key is to find the coefficients that provide the optimal matching between the test texture map and the model. In this paper, we adopt a simple sequential searching scheme to achieve this. That is, in a Kdimensional deviation model, uniformly sample multiple coefficients along the 1st dimension while the coefficients for other dimensions are zero, and determine one of them which results in the maximal similarity between this test texture map and the model. The range of sampling is bounded by the coefficients of training geometric deviations. Then we perform the same searching along the 2nd dimension while fixing the optimal value for the 1st dimension and zero for all other dimensions. The searching is finished until the K th dimension. Basically our approach enforces the geometric deviation of neighboring patches to follow certain constraint, which is described by the bases of the deviation model. For each sampled coefficient, the reconstructed 2D geometric deviation (in the bottom-left of Fig. 7) indicates where to find the corresponding patches in the test texture map. Then the residue between the corresponding patch and the reference patch model is computed, which is further feed into the probabilistic model [8]. Finally the probabilistic measurement tells how likely this corresponding patch belongs to the same subject as the reference patch. By doing the same operation for all other reference patches and averaging all patch-based probabilistic measurements, we obtain the similarity between this test texture map and the model based on the current sampled coefficient. Finally the test image is recognized as the subject who provides the largest similarity. Depending on the type of the mosaic model (the patch-PCA mosaic or the global-PCA mosaic), there are different ways of calculating the distance between the corresponding patch and the reference patch model. For the patch-PCA mosaic, the residue with respect to the reference patch model is used as the distance measure. For the global-PCA mosaic, since one reference patch model is represented by a number of coefficients, the distance measure is defined as the nearest neighbor of the corresponding patch among all these coefficients.
5
Video-Based Face Recognition
There are two schemes for recognizing faces from video sequences: image-based recognition and video-based recognition. In image-based recognition, usually the
668
X. Liu and T. Chen Pd
95
∏
38
(38,95)
Fig. 7. The map-to-patch distance: the geometric deviation indicates the patch correspondence between the model and the texture map; the distance of corresponding patches are feed into the Bayesian framework to generate a probabilistic measurement
face area is cropped before feeding to a recognition system. Thus image-based face recognition involves two separate tasks: face tracking and face recognition. In our face mosaicing algorithm, given one video frame, the most important task is to generate a texture map and compare it with the mosaic model. Since the mapping parameter x, which is a 6-dimensional vector describing the 3D head location and orientation [7], contains all the information for generating the texture map, the face tracking is equivalent to estimating x, which can result in the maximal similarity between the texture map and the mosaic model. We use the condensation method [12] to estimate the mapping parameter x. In image-based recognition, for a face database with L subjects, we build the individualized model for each subject, based on one or multiple training images. Given a test sequence and one specific model, a distance measurement is calculated for each frame by face tracking. Averaging of the distances over all frames provides the distance between the test sequence and one specific model. After the distances between the sequence and all models are calculated, comparing these distances provides the recognition result of this sequence. In video-based face recognition, two tasks, face tracking and recognition, are usually performed simultaneously. Zhou et al. [13] propose a framework to combine the face tracking and recognition using the condensation method. They propagate a set of samples governed by two parameters: the mapping parameter and the subject ID. We adopt this framework in our experiments.
6
Experimental Results
We evaluate our algorithm on pose robust face recognition using the CMU PIE database [14]. We use half of the subjects (34 subjects) in PIE for training the probabilistic model. The 9 pose images per subject from remaining 34 subjects are used for the recognition experiments.
Face Mosaicing for Pose Robust Video-Based Recognition (a)
669
(c) c34
c14
c22
c11
c02
c27
c29
c37
c05
(b)
Fig. 8. (a) Sample Images of one subject from the PIE database. (b) Mean images of three individual mosaic models. (c) Recognition performances of four algorithms on the CMU PIE database based on three training images.
Sample images and the pose labels from one subject in PIE are shown in Fig. 8(a). Three poses (c27, c14, c02) are used for the training, and the remaining 6 poses (c34, c11, c29, c05, c37, c22) are used for test using four algorithms. The first is the traditional eigenface approach [4]. We perform the manual cropping and normalization for both training and test images. We test with different number of eigenvectors and plot the one with the best recognition performance. The second is the eigen light-field algorithm [15] (one frontal training image per subject). The third algorithm is our face mosaic method without the modeling of geometric deviation, which essentially sets the mean and all eigenvectors of θ = {g, u} to be zero. The fourth algorithm is the face mosaic method with the modeling of geometric deviation. Since the number of training images is small, we train the global-PCA mosaic for each subject. Three eigenvectors are used in building the global-PCA subspace. Thus each reference patch from the training stage is represented as a 3-dimentional vector. For the face mosaic method, the patch size is 4 × 4 pixels and the size of the texture map is 90 × 180 pixels. For illustration purpose, we show the mean images of three subjects in Fig. 8(b). Fig. 8(c) shows the recognition rate of four algorithms for each specific pose. Comparing among these four algorithms, both of our algorithms works better than the baseline algorithms. Obviously the mosaic approach provides a better way of registering multi-view images for an enhanced modeling, unlike the naive training procedure of the traditional eigenface approach. For our algorithms, the one with deviation modeling performs better than the one without deviation modeling. There are at least two benefits for the former. One is that a geometric model can be used in the test stage. The other is that as a result of deviation modeling, the patch-based appearance model also better captures the personal characteristic of the multi-view facial appearance in a non-blurred manner. We perform video-based face recognition experiments based on the Face In Action (FIA) database [16], which mimics the “passport checking” scenario. Multiple cameras capture the whole process of a subject walking toward the desk, standing in front of the desk, making simple conversation and head motion, and finally walking away from the desk. Six video sequences are captured from six calibrated cameras simultaneously for 20 seconds with 30 frames per second.
670
X. Liu and T. Chen
(a)
(b)
Fig. 9. (a) 9 training images from one subject in the FIA database. (b) The mean images of the individual models in two methods (left: Individual PCA, right: mosaicing). Table 1. Recognition error rate of different algorithms PCA Mosaic image-based method 17.24% 6.90% video-based method 8.97% 4.14%
We use a subset of the FIA database containing 29 subjects, with 10 sequences per subject as the test sequences. Each sequence has 50 frames, and the first frame is labeled with the ground truth data. We use the individual PCA algorithm [17] with image-based recognition and the individual PCA with video-based recognition as the baseline algorithms. For both algorithms, 9 images per subject are used for training and the best performance is reported by trying different number of eigenvectors. Fig. 9(a) shows the 9 training images for one subject in the FIA database. The face location of training images is labeled manually, while that of the test images is based on the tracking results using our mosaic model. Face images are cropped to be 64 × 64 pixels from video frames. We test two options for our algorithms based on the same training set (9 images per subject). The first is to use the individual patch-PCA mosaic with image-based recognition, which uses the averaged distance between the frames to the mosaic model as the final distance measure. The second is to use the individual patch-PCA mosaic with video-based recognition, which uses the 2D condensation method to perform tracking and recognition. Fig. 9(b) illustrates the mean images in two methods. We can observe significant blurring effect in the mean image of the individual PCA model. On the other hand, the mean image of our individual patch-PCA mosaic model covers larger pose variation while keeping enough individual facial characteristics. The comparison of recognition performance is shown in Table 1. Two observations can be made. First, given the same model, such as the PCA model or the mosaic model, video-based face recognition is better than image-based recognition. Second, the mosaic model works much better than the PCA model for pose-robust recognition.
7
Conclusions
This paper presents an approach to building a statistical mosaic model by combining multi-view face images, and applies it to face recognition. Multi-view face
Face Mosaicing for Pose Robust Video-Based Recognition
671
images are back projected onto the surface of an ellipsoid, and the surface texture map is decomposed into an array of local patches, which are allowed to move locally in order to achieve better correspondences among multiple views. We show the improved performance for pose robust face recognition by using this new method and extend our approach to video-based face recognition.
References 1. Zhao, W.Y., Chellappa, R., Phillips, P.J., Rosenfeld, A.: Face recognition: A literature survey. ACM Computing Survey 35(4), 399–458 (2003) 2. Phillips, P., Grother, P., Micheals, R., Blackburn, D., Tabassi, E., Bone, J.: Face recognition vendor test (FRVT) 2002: Evaluation report (2003) 3. Sim, T., Kanade, T.: Combining models and exemplars for face recognition: An illuminating example. In: Proc. of the CVPR 2001 Workshop on Models versus Exemplars in Computer Vision (2001) 4. Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3(1), 71–86 (1991) 5. Duda, R., Hart, P., Stork, D.: Pattern Classification, 2nd edn. John Wiley & Sons. Inc., New York (2001) 6. Liu, X., Chen, T.: Geometry-assisted statistical modeling for face mosaicing. In: ICIP 2003. Proc. 2003 International Conference on Image Processing, Barcelona, Catalonia, Spain, vol. 2, pp. 883–886 (2003) 7. Liu, X., Chen, T.: Pose-robust face recognition using geometry assisted probabilistic modeling. In: Proc. IEEE Computer Vision and Pattern Recognition, San Diego, California, vol. 1, pp. 502–509 (2005) 8. Kanade, T., Yamada, A.: Multi-subregion based probabilistic approach toward pose-invariant face recognition. In: IEEE Int. Symp. on Computational Intelligence in Robotics Automation, Kobe, Japan, vol. 2, pp. 954–959 (2003) 9. Blanz, V., Vetter, T.: Face recognition based on fitting a 3D morphable model. IEEE Trans. on Pattern Analysis and Machine Intelligence 25(9), 1063–1074 (2003) 10. Dimitrijevic, M., Ilic, S., Fua, P.: Accurate face models from uncalibrated and ill-lit video sequences. Proc. IEEE Computer Vision and Pattern Recognition 2, 1034–1041 (2004) 11. De la Torre, F., Black, M.J.: Robust principal component analysis for computer vision. In: Proc. 8th Int. Conf. on Computer Vision, Vancouver, BC, vol. 1, pp. 362–369 (2001) 12. Isard, M., Blake, A.: Active Contours. Springer, Heidelberg (1998) 13. Zhou, S., Krueger, V., Chellappa, R.: Probabilistic recognition of human faces from video. Computer Vision and Image Understanding 91, 214–245 (2003) 14. Sim, T., Baker, S., Bsat, M.: The CMU pose, illumination, and expression database. IEEE Trans. on Pattern Analysis and Machine Intelligence 25(12), 1615–1618 (2003) 15. Gross, R., Matthews, I., Baker, S.: Appearance-based face recognition and lightfields. IEEE Trans. on Pattern Analysis and Machine Intelligence 26(4), 449–465 (2004) 16. Goh, R., Liu, L., Liu, X., Chen, T.: The CMU Face In Action (FIA) database. In: Proc. of IEEE ICCV 2005 Workshop on Analysis and Modeling of Faces and Gestures, Beijing, China, IEEE Computer Society Press, Los Alamitos (2005) 17. Liu, X., Chen, T., Kumar, B.V.K.V.: Face authentication for multiple subjects using eigenflow. Pattern Recognition 36(2), 313–328 (2003)
Face Recognition by Using Elongated Local Binary Patterns with Average Maximum Distance Gradient Magnitude Shu Liao and Albert C.S. Chung Lo Kwee-Seong Medical Image Analysis Laboratory, Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong
[email protected],
[email protected] Abstract. In this paper, we propose a new face recognition approach based on local binary patterns (LBP). The proposed approach has the following novel contributions. (i) As compared with the conventional LBP, anisotropic structures of the facial images can be captured effectively by the proposed approach using elongated neighborhood distribution, which is called the elongated LBP (ELBP). (ii) A new feature, called Average Maximum Distance Gradient Magnitude (AMDGM), is proposed. AMDGM embeds the gray level difference information between the reference pixel and neighboring pixels in each ELBP pattern. (iii) It is found that the ELBP and AMDGM features are well complement with each other. The proposed method is evaluated by performing facial expression recognition experiments on two databases: ORL and FERET. The proposed method is compared with two widely used face recognition approaches. Furthermore, to test the robustness of the proposed method under the condition that the resolution level of the input images is low, we also conduct additional face recognition experiments on the two databases by reducing the resolution of the input facial images. The experimental results show that the proposed method gives the highest recognition accuracy in both normal environment and low image resolution conditions.
1
Introduction
Automatic facial recognition (AFR) has been the topic of extensive research in the past several years. It plays an important role in many computer vision applications including surveillance development, and biometric image processing. There are still many challenges and difficulties, for example, factors such as pose [8], illumination [9] and facial expression [10]. In this paper, we propose a new approach to face recognition from static images by using new features to effectively represent facial images. Developing a face recognition framework involves two crucial aspects. 1. Facial image representation: It is also known as the feature extraction process, in which feature vectors are extracted from the facial images. 2. Classifier design: The feature vectors extracted in the first stage Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 672–679, 2007. c Springer-Verlag Berlin Heidelberg 2007
Face Recognition by Using Elongated Local Binary Patterns
673
are fed into a specific classifier to obtain the final classification results. In this paper, we focus on the first stage: the feature extraction stage. Many feature extraction methods for facial image representation have been proposed. Turk and Pentland [4] used the Principle Component Analysis (PCA) to construct eigenfaces and represent face images as projection coefficients along these basis directions. Belhumeur et al. proposed the Linear Discriminant Analysis (LDA) method [5]. Wiskott et al. used the Gabor wavelet features for face recognition [3]. In recent years, a new feature extraction method is proposed, which is known as the Local Binary Patterns (LBP) [1]. LBP is first applied in the texture classification application [6]. It is a computationally efficient descriptor to capture the micro-structural properties of the facial images. However, there is major limitation of the conventional LBP. The conventional LBP uses circularly symmetric neighborhood definition. The usage of the circularly symmetric neighborhood definition aims to solve the rotation invariant problem in the texture classification application with the cost of eliminating anisotropic structural information. However, for face recognition, such problem does not exist. Anisotropic structural information is an important feature for face recognition as there are many anisotropic structures exist in the face (e.g. eyes, mouths). To this end, we extend the neighborhood distribution in the elongated manner to capture anisotropic properties of facial images called the elongated LBP (ELBP), the conventional LBP is a special case of the ELBP. Also, the conventional LBP does not take the gradient information into consideration. In this paper, we propose a new feature named average maximum distance gradient magnitude (AMDGM) to capture the general gradient information for each ELBP pattern. It is experimentally shown that the ELBP feature and the AMDGM feature are complement with each other and can achieve the highest recognition accuracy among all the other compared methods in both normal environment and low input image resolution conditions. The paper is organized as follows. In Section 2, the concepts of ELBP and AMDGM are introduced. Section 3 describes the experimental results for various approaches under the conditions of normal environment and low resolution images. Section 4 concludes the whole paper.
2
Face Recognition with ELBP and AMDGM
In this section, the ELBP and AMDGM features are introduced. We will first briefly review the conventional LBP approach and then describe these two features. 2.1
Elongated Local Binary Patterns
In this section, the Elongated Local Binary Patterns (ELBP) are introduced. In the definition of the conventional LBP [6], the neighborhood pixels of the reference pixel are defined based on the circularly symmetric manner. There are
674
S. Liao and A.C.S. Chung
two parameters m and R respectively representing the number of the neighbor pixels and the radius (i.e. the distance from the reference pixel to each neighboring pixel). By varying the values of m and R, the multiresolution analysis can be achieved. Figure 1 provides examples of different values of m and R. Then,
Fig. 1. Circularly symmetric neighbor sets for different values of m and R
the neighboring pixels are thresholded to 0 if their intensity values are lower than the center reference pixel, and 1 otherwise. If the number of transactions between ”0” and ”1” is less or equal to two, then such patterns are uniform patterns. For example, ”00110000” is a uniform pattern, but ”01011000” is not a uniform pattern. It is obvious that there are m + 1 possible types of uniform patterns. The final feature vector extracted from the conventional LBP is the occurrence of each type of uniform pattern in an input image as the authors in [6] pointed out that the uniform patterns represented the basic image structures such as lighting spots and edges. As we can see, for the conventional LBP, the neighborhood pixels are all defined on a circle with radius R and reference center pixel. The main reason for defining neighboring pixels in this isotropic manner is aimed to solve the rotation invariant problem in the texture classification application which is the first application of the conventional LBP. Later, the conventional LBP was applied in the face recognition application [1]. However, in this application, the rotation invariant problem does not exist. Instead, anisotropic information are important features for face recognition. To the best of our knowledge, this problem has not been mentioned by any researchers. Therefore, we are motivated to propose the ELBP approach. In ELBP, the distribution of neighborhood pixels gives an ellipse shape (see Figure 2). There are three parameters related to the ELBP approach: 1. The long axis of the ellipse, denoted by A; 2. The short axis of the ellipse, denoted by B; 3. The number of neighboring pixels, denoted by m. Figure 2 shows examples of the ELBP patterns with different values of A, B and m: The X and Y coordinates, gix and giy , of each neighbor pixel gi (i = 1,2...,m) with respect to the center pixel is defined by Equations 1 and 2 respectively, A2 B 2 Ri = , (1) 2 A2 sin θi + B 2 cos2 θi
Face Recognition by Using Elongated Local Binary Patterns
675
Fig. 2. Examples of ELBP with different values of A, B and m
gix = Ri ∗ cos θi , and giy = Ri ∗ sin θi , ◦
(2)
where θi = ∗ (i − 1)) . If the coordinates of the neighboring pixels do not fall exactly at the image grid, then the bilinear interpolation technique is applied. The final feature vector of ELBP is also the occurrence histogram of each type of uniform pattern. In this paper, three sets of ELBP are used with different values of A, B and m: A1 = 1, B1 = 1, m = 8; A2 = 3, B2 = 1, m = 16; A3 = 3, B2 = 2, m = 16. Similar to [1], before processing the input image for face recognition, the input image is divided into six regions in advanced: bows, eyes, nose, mouth, left cheek, and right cheek, which are denoted as R1 , R2 , ..., R6 . Each region is assigned a weighting factor according to their importance. The larger the weighting factor, the more important the region. In this paper, the weighting factors for these six regions are set to be: w1 = 2, w2 = 4, w3 = 2, w4 = 4, w5 = 1, w6 = 1. The ELBP histograms are estimated from each region. Then, the feature vector is normalized to the range of [-1, 1]. Finally, the normalized feature vector is multiplied by its corresponding weighting factor to obtain the region feature vector. As such, the region feature vector encodes the textural information in each local region. By concatenating all the region feature vectors, global information of the entire face image can be obtained. The ELBP Pattern can also be rotated along the center pixel with a specific angle β to achieve multi-orientation analysis and to characterize elongated structures along different orientations in the facial images. In this paper, four orientations β1 = 0◦ , β2 = 45◦ , β3 = 90◦ , β4 = 135◦ are used for each ELBP pattern with its own parameters A, B and m. The final ELBP feature vector is an m + 1 dimension vector F , where each element Fi (i = 1,2,3...m+1) denotes the occurrence of a specific type of uniform pattern along all those four orientations β1 , β2 , β3 and β4 in an input image. As we can see, the ELBP features are more general than the conventional LBP. More precisely, the conventional LBP can be viewed as a special case of ELBP when setting the values of A and B equal to each other. The ELBP is able to capture anisotropic information from the facial images, which are important ( 360 m
676
S. Liao and A.C.S. Chung
features as there are many important parts in the face such as eyes, mouths are all elongated structures. Therefore, it is expected that ELBP can have more discriminative power than the conventional LBP, which will be further verified in the Experimental Results Section. 2.2
Average Maximum Distance Gradient Magnitude
As mentioned in Section 2.1, ELBP is more general than the conventional LBP and the anisotropic information can be effectively captured. However, both the conventional LBP and the proposed ELBP still do not take the gradient information of each pattern into consideration. Since both the conventional LBP and ELBP are constructed by thesholding the neighboring pixels to 0 and 1 with respect to the reference center pixel, gradient magnitude information is therefore not included. In this paper, a new measure, called the average maximum distance gradient magnitude (AMDGM) is proposed to effectively capture such information. To define AMDGM, we first introduce the concept of distance gradient magnitude (DGM). For each ELBP pattern, there are three parameters: A, B and m denoting the long axis, short axis and the number of neighboring pixels. Then, the distance gradient magnitude for each neighboring pixel gi , given the center pixel gc , is defined by Equation 3, | ∇d I(gi , gc ) |=
| Igi − Igc | , | vi − vc |2
(3)
where v = (x, y) denotes the pixel position, Igi and Igc are the intensities of the neighbor pixel and the reference pixel respectively. Based on the definition of DGM, the maximum distance gradient magnitude G(v) is defined by Equation 4, (4) G(v) = maxgi | ∇d I(gi , gc ) |, i = 1, 2, . . . , m. Suppose that, in an input image, for each type of uniform ELBP patterns Pi (i = 1,2,...,m), its occurrence is Ni . Then, the average maximum distance gradient magnitude (AMDGM) A(Pi ) for each type of uniform patterns is defined by Equation 5, Ni G(vk ) A(Pi ) = k=1 , (5) Ni where vk ∈ Pi , the AMDGM feature has more advantage over the conventional gradient magnitude as it encodes the spatial information (i.e. the distance from the neighbor pixel to the reference center pixel) into consideration. It is essential because the neighborhood distribution is no longer isotropic. The distance from each neighborhood pixel to the reference pixel can be different, unlike the conventional LBP. The AMDGM feature is well complement with ELBP because the ELBP provides pattern type distribution and the AMDGM feature implies the general gradient information with spatial information for each type of uniform patterns.
Face Recognition by Using Elongated Local Binary Patterns
3
677
Experimental Results
The proposed approach have been evaluated by performing face recognition experiments on two databases: ORL and FERET [7]. The ORL database contains 40 classes with 10 samples for each class, each sample has resolution of 92 × 112 pixels. For the FERET database, we have selected 70 subjects from this database with six up-right, frontal-view images of each subject. For each facial image, the centers of the two eyes have already been manually detected. Then, each facial image has been translated, rotated and scaled properly to fit a grid size of 64 × 64 pixels. The proposed method has been compared with three widely used methods similar to [1]: 1. Principle Component Analysis (PCA), also known as the eigenface approach [4]; 2. Linear Discriminant Analysis (LDA) [5]; 3. The conventional Local Binary Patterns (LBP) [1]. The support vector machine (SVM) [2] with the Gaussian Radius Basis Function (RBF) kernel was used as the classifier in this work. To test the robustness of the proposed method under the condition of low input image resolution, which is a practical problem in real world applications, we have also performed face recognition experiment on the ORL and FERET databases by downsampling the input images. 3.1
Experiment on ORL and FERET Databases with Original Resolution
We have tested all the approaches under the normal environment in both ORL and FERET databases (i.e. all the input images were in their original resolution). Figure 3 and Figure 4 show some sample images of the ORL and FERET databases. The purpose of this experiment is to test the basic discriminative power of different approaches. In this experiment, half of the facial images for each class were randomly selected as training sets, the remaining images were used as the testing sets. The experiment was repeated for all possible combinations of training and testing sets. The average recognition rates for different approaches in both the ORL and FERET databases are listed in Table 1.
Fig. 3. Some sample images from the ORL Database
678
S. Liao and A.C.S. Chung
Fig. 4. Some sample images from the FERET Database Table 1. Performance of different approaches under the normal condition in the ORL and FERET Database. Results of the proposed methods are listed in Rows 4 – 5. Recognition Rate % Methods 1. PCA [4] 2. LDA [5] 3. LBP [1] 4. ELBP 5. ELBP + AMDGM
ORL 85.00 87.50 94.00 97.00 98.50
FERET 72.52 76.83 82.62 86.73 93.16
From Table 1, the proposed method has the highest recognition rate among the compared methods in both databases. Furthermore, the ELBP and AMDGM features are well complement with each other. The discriminative power of the proposed method is obviously implied. 3.2
Experiment on ORL and FERET Databases with Low Resolution Images
In real world applications, the quality of the input facial images is not always good due to various factors such as imaging equipment and outdoor environment (i.e. low resolution facial image). Therefore, the robustness of a face recognition approach under this condition is essential. In this experiment, the input images of both the ORL and FERET databases were downsampled to 32 × 32 pixels before processing, the rest of the settings were the same as Section 3.1. The recognition rates of different approaches under this environment are listed in Table 2. Table 2 echoes the robustness of different methods against the condition of low input image resolution. It is shown that the proposed method maintains the highest recognition rate among all the compared methods. Its robustness under such condition is clearly illustrated.
Face Recognition by Using Elongated Local Binary Patterns
679
Table 2. Performance of different approaches under the low resolution condition in the ORL and FERET Database. Results of the proposed methods are listed in Rows 4 – 5. Recognition Rate % Methods 1. PCA [4] 2. LDA [5] 3. LBP [1] 4. ELBP 5. ELBP + AMDGM
4
ORL 66.00 69.50 78.50 83.00 88.50
FERET 58.41 64.85 73.63 80.22 85.47
Conclusion
This paper proposed a novel approach to automatic face recognition. Motivated by the importance of capturing the anisotropic features of facial images, we propose the ELBP feature, which is more general and powerful than the conventional LBP. Also, to embed gradient information based on the definition of ELBP, a new feature AMDGM is proposed. The AMDGM feature encodes the spatial information of the neighboring pixels with respect to the reference center pixel, which is essential for the ELBP. Experimental results based on the ORL and FERET databases demonstrate the effectiveness and robustness of our proposed method when compared with three widely used methods.
References 1. Ahonen, T., Hadid, A., Pietikainen, M.: Face recognition with local binary patterns. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 469–481. Springer, Heidelberg (2004) 2. Vapnik, V.N.: Statistical Learning Theory. John Wiley and Sons, New York (1998) 3. Wiskott, L., Fellous, J.-M., Kuiger, N., von der Malsburg, C.: Face recognition by elastic bunch graph matching. IEEE PAMI 19(7), 775–779 (1997) 4. Turk, M., Pentland, A.: Eigenfaces for recognition. J. Cogn. Neurosci. 3, 71–86 (1991) 5. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE PAMI 19(7), 711–720 (1997) 6. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns. IEEE PAMI 24(7), 971–987 (2002) 7. Phillips, P.J., Wechsler, H., Huang, J., Rauss, P.: The feret database and evaluation procedure for face recognition algorithms. IVC 16(5), 295–306 (1998) 8. Blanz, V., Vetter, T.: Face recognition based on fitting a 3D morphable model. IEEE PAMI 25(9), 1063–1074 (2003) 9. Shashua, A., Riklin-Raviv, T.: The quotient image: Class based re-rendering and recognition with varying illuminations. IEEE PAMI 23(2), 129–139 (2001) 10. Tian, T.-I., Kanade, T., Cohn, J.F.: Recognizing action units for facial expression analysis. IEEE PAMI 23(2), 97–115 (2001)
An Adaptive Nonparametric Discriminant Analysis Method and Its Application to Face Recognition Liang Huang1,∗, Yong Ma2, Yoshihisa Ijiri2, Shihong Lao2, Masato Kawade2, and Yuming Zhao1 1 Institute of Image Processing and Pattern Recognition Shanghai Jiao Tong University, Shanghai, China, 200240 {asonder,arola_zym}@sjtu.edu.cn 2 Sensing & Control Technology Lab., Omron Corporation, Kyoto, Japan, 619-0283 {ma,joyport,lao,kawade}@ari.ncl.omron.co.jp
Abstract. Linear Discriminant Analysis (LDA) is frequently used for dimension reduction and has been successfully utilized in many applications, especially face recognition. In classical LDA, however, the definition of the between-class scatter matrix can cause large overlaps between neighboring classes, because LDA assumes that all classes obey a Gaussian distribution with the same covariance. We therefore, propose an adaptive nonparametric discriminant analysis (ANDA) algorithm that maximizes the distance between neighboring samples belonging to different classes, thus improving the discriminating power of the samples near the classification borders. To evaluate its performance thoroughly, we have compared our ANDA algorithm with traditional PCA+LDA, Orthogonal LDA (OLDA) and nonparametric discriminant analysis (NDA) on the FERET and ORL face databases. Experimental results show that the proposed algorithm outperforms the others. Keywords: Linear Discriminant Analysis, Nearest Neighbors, Nonparametric Discriminant Analysis, Face Recognition.
1 Introduction LDA is a well-known dimension reduction method that reduces the dimensionality while keeping the greatest between-class variation, relative to the within-class variation, in the data. An attractive feature of LDA is the quick and easy method of determining this optimal linear transformation, that requires only simple matrix arithmetic. It has been used successfully in reducing the dimension of the face feature space. LDA, however, has some limitations in certain situations. For example, in face recognition, when the number of class samples is smaller than the feature dimension (the well-known SSS (Small Sample Size) problem), LDA suffers from the singularity problem. Several extensions to classical LDA have been proposed to ∗
This work was done when the first author was an intern student at Omron Corporation.
Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 680–689, 2007. © Springer-Verlag Berlin Heidelberg 2007
An Adaptive Nonparametric Discriminant Analysis Method
681
overcome this problem. These include Direct LDA [2], null space LDA (NLDA) [8][9], orthogonal LDA (OLDA) [10], uncorrelated LDA (ULDA) [10][11], regularized LDA [12], pseudo-inverse LDA [13][14] and so on. The SSS problem can also be overcome by applying the PCA algorithm before LDA. Another limitation arises from the assumption by the LDA algorithm that all classes of a training set have a Gaussian distribution with a single shared covariance. LDA cannot obtain minimal error rate if the classes do not satisfy this assumption. This assumption thus restricts the application of LDA. When using LDA as a classifier for face recognition, this assumption is rarely true. To overcome this limitation, some extensions have been proposed that exclude the assumption, such as nonparametric discriminant analysis (NDA) [1][17], stepwise nearest neighbor discriminant analysis (SNNDA) [6], and heteroscedastic LDA (HLDA) [15]. Both NDA and HLDA, however, require a free parameter to be specified by the user. In NDA, this is the number of K nearest neighbors; in HLDA, it is the number of dimensions of the reduced space. SNNDA continues to modify the NDA by redefining the between-class matrix. SNNDA therefore, has the same free parameter as NDA. Thus each of these algorithms needs to be tuned to a specific situation, so as not to cause large overlaps between neighboring classes. In this paper, we propose an adaptive nonparametric discriminant analysis (ANDA) approach for face recognition, based on previously proposed methods, especially NDA. First, we apply the PCA method before the discriminant analysis to exclude the SSS problem. Then we redefine the between-class scatter matrix. In this step, we calculate the distances between each sample and its nearest neighbors, and these distances are used to define the between-class scatter matrix. Our goal is to find an adaptive method which can deal with different types of data distribution without parameter tuning. Finally we compare our approach with PCA+LDA, OLDA, and NDA on the FERET and ORL face datasets. The results show that our algorithm outperforms the other approaches. The rest of this paper is organized as follows. Section 2 reviews classical LDA and some extensions. Our ANDA algorithm is presented in Section 3. In Section 4, an explanation of the experiments and the results are given. Finally, we conclude in Section 5.
2 Overview of Discriminant Analysis 2.1 Classical LDA LDA is a statistical approach for classifying samples of unknown classes based on training samples from known classes. This algorithm aims to maximize the betweenclass variance and minimize the within-class variance. Classical LDA defines the between-class and within-class scatter matrices as follows: c
T Sb = ∑ pi ( mi − m )( mi − m ) . i =1
(1)
682
L. Huang et al.
c
S w = ∑ pi Si .
(2)
i =1
where mi , Si and pi are the mean, covariance matrix and prior probability of each class (i.e., each individual), respectively. m and c are the global mean and the number of classes. The trace of S w measures the within-class cohesion and the trace of Sb measures the between-class separation of the training set. Classical LDA results in finding a linear transformation matrix G to reduce the feature dimension, while maximizing trace Sb and minimizing trace S w . So classical LDA can compute the optimal G by solving the following optimization:
(
)
G = arg Tmax trace ( GT S wG ) GT Sb G . G SwG = I l
−1
(3)
The solution to this optimization problem can be given by the eigenvectors of S S corresponding to nonzero eigenvalues [3]. Thus if S w is singular, S w−1 definitely does not exist, and the optimization problem fails. This is known as the SSS problem. Similarly, another drawback is the definition of Sb in classical LDA that may cause large overlaps between neighboring classes. The reason is that this definition only calculates the distance between the global mean and the mean of each class without considering the distance between neighboring samples belonging to different classes. In this paper, we call this the dissatisfactory assumption problem. Fig. 2 illustrates this problem and we will explain it in Section 3. −1 w b
2.2 Extensions to LDA 2.2.1 Extensions for the SSS Problem PCA+LDA is commonly used to overcome the SSS problem. The PCA method first projects the high-dimensional features into a low-dimensional subspace. Thereafter, LDA is applied to the low-dimensional feature space. OLDA computes a set of orthogonal discriminant vectors via the simultaneous diagonalization of the scatter matrices. It can be implemented by solving the optimization problem below:
(
)
G = arg max trace ( G T St G ) G T Sb G . T G G = Il
+
(4)
where St is the total scatter matrix. NLDA aims to maximize the between-class distance in the null space of the within-class scatter matrix. Ye et al. [4] present a computational and theoretical analysis of NLDA and OLDA. They conclude that both NLDA and OLDA result in orthogonal transformations. Although these extensions can solve the SSS problem in LDA, they cannot overcome the dissatisfactory assumption problem.
An Adaptive Nonparametric Discriminant Analysis Method
683
2.2.2 Extensions for the Dissatisfactory Assumption Problem NDA was first proposed by Fukunaga et al. [3]. Bressan et al. [5] then explored the nexus between NDA and the nearest neighbour (NN) classifier, and also introduced a modification to NDA, which extends the two-class NDA to multi-class. NDA provides a new definition for the between-class scatter matrix Sb , by first defining the extra-class nearest neighbor and intra-class neighbor of a sample x ∈ ωi as:
{
x E = x′ ∈ ωi x′ − x ≤ z − x , ∀z ∈ ωi
}.
x I = { x′ ∈ ωi x′ − x ≤ z − x , ∀z ∈ ωi } .
(5) (6)
Next, the extra-class distance and within-class distance are defined as: ΔE = x − xE .
(7)
ΔI = x − xI .
(8)
The between-class scatter matrix is then defined as: N
( )( Δ )
S%b = ∑ ωn Δ E n =1
T
E
.
(9)
where ωn is a weight to emphasize and de-emphasize different samples. Qiu et al. [6] continued to modify NDA, by redefining S w as follows: N
( )( Δ )
S%w = ∑ ωn Δ I n =1
I
T
.
(10)
They also proposed a stepwise dimensionality reduction method to overcome the SSS problem. However, the definition of Sb in SNNDA is the same as in NDA. The definition in (9) for Sb is concerned with the nearest neighbor distance of each sample. Previous definition can be extended to the K-NN case by defining x E as the mean of the K nearest extra neighbors [5]. This is an improvement to classical LDA, but both NDA and SNNDA require a free parameter to be specified by the user, namely the number of K nearest neighbors. Once this has been specified, it will remain the same for every sample. As this is overly restrictive, we need an adaptive and comprehensive method for selecting neighboring samples. In the next section we propose an approach that overcomes these limitations.
3 Adaptive Nonparametric Discriminant Analysis 3.1 Introduction for ANDA In this section we introduce our ANDA algorithm. This approach has been proposed to overcome the drawbacks mentioned in Section 2. First PCA is applied to reduce the
684
L. Huang et al.
feature dimension. This step makes S w non-singular and helps overcome the SSS problem. Thereafter our ANDA approach is applied to the low dimensional subspace. As we mentioned before, the definition of S b in classical LDA is designed for ideal instances, where the training data obeys a Gaussian distribution. To exclude this classical LDA assumption, we redefine the between-class scatter matrix Sb . Our goal is to find an optimal discriminant function that simultaneously makes the samples of the same class as close as possible and the distance between samples belonging to different classes as great as possible. In a face recognition application, we assume there are c classes (individuals) w ωl ( l = 1,L , c ) in the training set. Then we define the within-class distance di of the sample xi ∈ ωl as:
d iw = xn − xi , xn ∈ ωl . So the maximal within-class distance of
(11)
xi can be represented as max d iw .
Similarly, the between-class distance d ib is defined as: d ib = xm − xi , xm ∉ ωl .
(12)
Now we can define the near neighbors of sample xi as follows:
{
}
X i = xm d ib < ∂ max d iw .
(13)
where ∂ is an adjustable parameter to increase or decrease the number of near neighbors. As we mentioned before, NDA also has a parameter to adjust the number of near neighbors, but it specifies the same number of nearest neighbors for every sample. Because this parameter can greatly influence the performance of NDA, it needs to be tuned for different applications. This is too restrictive. The ideal method should specify a different number of nearest neighbors for each sample depending on the different conditions. ANDA does exactly this using the parameter ∂ . Next we prove that our ANDA algorithm can do this adaptively. We evaluate the performance of ANDA with different values of ∂ and different feature dimensions on the FERET databases. As can be seen in Fig. 1, the performance is stable when ∂ is greater than 1. This means that the parameter does not need to be tuned for different situations. As stated before, ∂ is the only parameter in our algorithm. This proves that our ANDA algorithm is stable with respect to this parameter. In this paper, ∂ is set to 1.14.
An Adaptive Nonparametric Discriminant Analysis Method
(a)
685
(b)
Fig. 1. Rank-1 recognition rates with different choices of ∂ on FERET datasets. Feature dimension is 100 in (a) and 200 in (b).
Now the redefined between-class scatter matrix is acquired from: T 11 n s Sˆb = xi − xi j )( xi − xij ) , xi j ∈ X i . ( ∑∑ n s i =1 j =1
(14)
where xij is the near neighbor around xi , and n and s are the number of whole samples in the training set and near neighbors of each sample, respectively. 3.2 Discussions
We compared the performance of our ANDA with NDA and traditional LDA on artificial datasets as shown in Fig. 2. Fig. 2 (a) shows that when the data distribution is Gaussian, all three algorithms are effective. However, if this precondition is not satisfied, as shown in Fig. 2 (b), traditional LDA fails, because it calculates the between-class matrix using the mean of each class and the global mean as given in Eq. (1). However, in Fig. 2 (b) the means of each class and the global mean are all close to zero. So Sb cannot represent the separation of different classes. While S%b and Sˆb could overcome this problem.
From Eqs. (9) and (14), we see that S%b and Sˆb each have one parameter. This is the number of nearest neighbors in NDA and the ratio ∂ in ANDA. S%b selects the nearest neighbor with a specified number for every sample, whereas Sˆb determines the number of neighbors by using the ratio of intra-class distance to extra-class distance. As we know, if one sample is closer to an extra-class neighbor than to the intra-class neighbors, it will be hard to discriminate correctly using classical LDA. So we must pay more attention to the samples near the border than those in the center of a cluster. Thus Sˆb is more flexible than S%b for different situations. To demonstrate the validity of this statement, we apply the same value to the two parameters in Fig. 2 (b), (c) and (d).
686
L. Huang et al.
(a)
(c)
(b)
(d)
Fig. 2. First direction of classical LDA (dot-dash line), NDA (dash line) and ANDA (solid line) projections for different artificial datasets
The results show that NDA cannot classify the different datasets correctly, whereas ANDA can. 3.3 Extensions to ANDA
This adaptive method for selecting neighbors will make our ANDA algorithm calculates the between-class matrix with partial samples near the borders of different classes. For this, we redefine the between-class scatter matrix as follows:
Sb = λ Sˆb + (1 − λ ) Sb , λ ∈ ( 0,1) .
(15)
We use λ = 0.5 in the experiments in this paper. This definition can be seen as a weighted approach using classical LDA and our ANDA and we called it weighted ANDA in this paper.
4 Experiments In this section, we compare the performance of ANDA, weighted ANDA (wANDA), OLDA, NDA and traditional PCA+LDA on the same database. Before conducting the
An Adaptive Nonparametric Discriminant Analysis Method
687
Table 1. Rank-1 recognition rates with different feature dimensions on four different FERET datasets Feature Dimension
Method
PCA+LDA OLDA NDA ANDA wANDA
Recognition Accuracy on FERET Database(%)
Fa/Fb
Fa/Fc
Dup1
Dup2
100
150
200
100
150
200
100
150
200
100
150
200
98.5 97 98.5 98.5 98.5
99 98 99 99 99
99.5 98.5 99.5 99.5 99.5
50 49 45 51 50
66 60 61 70 70
70 63 65 76 75
62 55 61 66 66
68 60 65 69 68
69 63 67 70 70
40 31 39 45 45
50 36 46 51 50
48 41 47 51 52
experiments, we extract a wavelet feature of the face images in the database. Next we apply PCA to reduce the feature dimension, thus making S w nonsingular. OLDA can overcome the SSS problem, so there is no need to apply PCA initially. We then use the different discriminant analysis methods to learn the projection matrix using the training set. Thereafter, we select different numbers of the first d vectors in the projection matrix to build discriminant functions, thus obtaining the relationship between the recognition rate and feature dimension. Finally we evaluate the different methods with gallery and probe sets to ascertain the performance of recognition rates. The nearest neighbour classifier is applied in this experiment. 4.1 Experimental Data
In our experiments, we used the ORL and FERET 1996 face databases to evaluate the different algorithms. The descriptions of databases are now given. The September 1996 FERET evaluation protocol was designed to measure performance on different galleries and probe sets for identification and verification tasks [16]. In our experiments we use the FERET training set to learn discriminant functions for the different algorithms, which are then evaluated with the FERET probe sets and gallery set. The FERET training set consists of 1002 images, while the gallery consists of 1196 distinct individuals with one image per individual. Probe sets are divided into four categories. The Fa/Fb set consists of 1195 frontal images. The Fa/Fc set consists of 194 images that were taken with a different camera and different lighting on the same day as the corresponding gallery image. Duplicate 1 contains 722 images that were taken on different days within one year from the acquisition of the probe image and corresponding gallery image. Duplicate 2 contains 234 images that were taken on different days at least one year apart. The ORL database contains 400 images of 40 distinct individuals, with ten images per individual. The images were taken at different times, with varying lighting conditions, facial expressions (open / closed eyes, smiling / not smiling) and facial details (glasses / no glasses). All individuals are in the upright, frontal position, with tolerance for some side movement. We randomly selected 1 image per individual for the gallery set and the other images became the probe set. Then we trained the
688
L. Huang et al.
Table 2. Rank-1 recognition rates with the different dimensions of feature on the ORL datasets Feature Dimension Method PCA+LDA OLDA NDA ANDA wANDA
Recognition Accuracy on ORL Database (%) 100
150
200
80 72.5 78.5 81.5 81
82 79.5 81 84.5 82
82.5 80.5 82.5 85 85.5
different methods using the FERET training set and evaluated these with the ORL probe set and ORL gallery set. This evaluation procedure was repeated 10 times independently and the average performance is shown in Table 2. 4.2 Experimental Results
Table 1 shows the rank-1 recognition rates with different feature dimensions on the FERET database. ANDA does not give the best performance on the Fa/Fb probe set. The Fa/Fb set is the simplest and the recognition rates for the distinct approaches, with the exception of OLDA, are very similar. The difference between these rates is less than 0.5%, making it hard to distinguish between these approaches. On the other three probe sets, ANDA performs best. As we can see from Table 1, ANDA evidently excels over NDA and PCA+LDA. These results clearly show the advantage of the ANDA algorithm, especially the adaptive method for selecting the near neighbors. Table 2 gives the performance results on the ORL database. We apply the FERET training set to learn the discriminant functions and evaluate these on the ORL database. This is a difficult task because images in these two databases are quite different. Results show that ANDA is still the best algorithm. The performance of the wANDA algorithm is similar to that of ANDA, but it does not always perform better than ANDA. It seems that this extension to ANDA does not always achieve the desired effect. Experimental results, therefore, prove that the ANDA approach is an effective algorithm for face recognition especially for the more difficult tasks.
5 Conclusion In this paper, we present an adaptive nonparametric discriminant analysis approach for face recognition, which can overcome the drawbacks of classical LDA. In the ANDA approach we use a novel definition for the between-class scatter matrix, instead of the one used in classical LDA, NDA and SNNDA which can cause overlaps between neighboring classes. This new definition can select nearest neighbors for each sample adaptively. It aims to improve the discriminant performance by improving the discriminating power of samples near the classification borders. Experimental results show that the adaptive method for selecting near neighbors is most effective. The ANDA approach outperforms other methods especially for the difficult tasks.
An Adaptive Nonparametric Discriminant Analysis Method
689
References 1. Fukunaga, K., Mantock, J.M.: Nonparametric Discriminant Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 5, 671–678 (1983) 2. Yu, H., Yang, J.: A direct LDA algorithm for highdimensional data with application to face recognition. Pattern Recognition 34, 2067–2070 (2001) 3. Fukunaga, K.: Introduction to statistical pattern recognition, 2nd edn. Academic Press, Boston (1990) 4. Ye, J., Xiong, T.: Computational and theoretical analysis of null space and orthogonal linear discriminant analysis. Journal of Machine Learning Research 7, 1183–1204 (2006) 5. Bressan, M., Vitria, J.: Nonparametric discriminant analysis and nearest neighbor classification. Pattern Recognition Letters 24, 2743–2749 (2003) 6. Qiu, X., Wu, L.: Face Recognition By Stepwise Nonparametric Margin Maximum Criterion. In: IEEE International Conference on Computer Vision, vol. 2, pp. 1567–1572 (2005) 7. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley, Chichester (2001) 8. Chen, L.F., Liao, H.Y.M., Ko, M.T., Lin, J.C., Yu, G.J.: A new LDA-based face recognition system which can solve the small sample size problem. Pattern Recognition 33, 1713–1726 (2000) 9. Huang, R., Liu, Q., Lu, H., Ma, S.: Solving the small sample size problem of LDA. In: Proc. International Conference on Pattern Recognition, pp. 29–32 (2002) 10. Ye, J.: Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. Journal of Machine Learning Research 6, 483–502 (2005) 11. Ye, J., Janardan, R., Li, Q., Park, H.: Feature extraction via generalized uncorrelated linear discriminant analysis. In: Proc. International Conference on Machine Learning, pp. 895– 902 (2004) 12. Friedman, J.H.: Regularized discriminant analysis. Journal of the American Statistical Association 84(405), 165–175 (1989) 13. Skurichina, M., Duin, R.P.W.: Stabilizing classifiers for very small sample size. In: Proc. International Conference on Pattern Recognition, pp. 891–896 (1996) 14. Raudys, S., Duin, R.P.W.: On expected classification error of the Fisher linear classifier with pseudo-inverse covariance matrix. Pattern Recognition Letters 19(5-6), 385–392 (1998) 15. Loog, M., Duin, R.P.W.: Linear Dimensionality Reduction via a Heteroscedastic Extension of LDA: The Chernoff Criterion. IEEE Trans. Pattern Analysis and Machine Intelligence 26(6), 732–739 (2004) 16. Phillips, P.J., Moon, H., Rizvi, S.A., Rauss, P.J.: The FERET Evaluation Methodology for Face-Recognition Algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence 12(10), 671–678 (2000) 17. Li, Z., Liu, W., Lin, D., Tang, X.: Nonparametric subspace analysis for face recognition. Computer Vision and Pattern Recognition 2, 961–966 (2005)
Discriminating 3D Faces by Statistics of Depth Differences Yonggang Huang1 , Yunhong Wang2 , and Tieniu Tan1 National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences, Beijing, China {yghuang,tnt}@nlpr.ia.ac.cn 2 School of Computer Science and Engineering Beihang University, Beijing, China
[email protected] 1
Abstract. In this paper, we propose an efficient 3D face recognition method based on statistics of range image differences. Each pixel value of range image represents normalized depth value of corresponding point on facial surface, and so depth differences between two range images’ pixels of the same position on face can straightforwardly describe the differences between two faces’ structures. Here, we propose to use histogram proportion of depth differences to discriminate intra and inter personal differences for 3D face recognition. Depth differences are computed from a neighbor district instead of direct subtraction to avoid the impact of non-precise registration. Furthermore, three schemes are proposed to combine the local rigid region(nose) and holistic face to overcome expression variation for robust recognition. Promising experimental results are achieved on the 3D dataset of FRGC2.0, which is the most challenging 3D database so far.
1
Introduction
Face recognition has attracted much attention from researchers in recent decades, owing to its broad application and non-intrusive nature. A large amount of work has been done on this topic based on 2D color or intensity images. A certain level of success has been achieved by many algorithms with restricted conditions, and many techniques have been applied in practical systems. However, severe problems caused by large variations of illumination, pose, and expression remain unsolved. To deal with these problems, many researchers are now moving attention from 2D face to some other facial modalities, such as range data, infrared data, etc. Human face is actually 3D deformable object with texture. Most of its 3D shape information is lost in 2D intensity images. 3D shape information should not be ignored as it can provide another kind of distinct features to discriminate different faces from 3D point of view. With recent advances of 3D acquisition technology, 3D facial data capture is becoming easier and faster, and 3D face recognition is attracting more and more attention [1].In earlier days, many researchers worked on curvatures from range data [2]. Chua et al. [3] divided the Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 690–699, 2007. c Springer-Verlag Berlin Heidelberg 2007
Discriminating 3D Faces by Statistics of Depth Differences
691
face into rigid and non-rigid regions, and used point signature to represent the rigid parts. Since recognition was conducted only based on those rigid parts, their method achieved a certain level of robustness to expression variance. Medioni et al. [4] used Iterative Closest Point (ICP) to align and match face surfaces. ICPbased methods were also adopted by many other researchers, such as Lu[5]. Xu et al. [6] used Gaussian-Hermite moments to describe shape variations around important areas on face, then combined them with the depth feature for recognition. Brostein et al. [7] achieved expression-invariant 3D face recognition by using a new representation of facial surface ,which is invariant to isometric deformations. Subspace methods were also used for 3D face recognition based on range images [8][14]. Before recognition, original 3D point set should be preprocessed first, then subsequent feature extraction and classification steps are conducted based on the preprocessed results. From previous work, we can see that one large category of subsequent processing is based on parameter models, such as [4][5].They first build a mesh model to fit the facial surface on the original point set, and then extract features from the parameters of the model [6] or conduct further projection based on this model [7]. Another category of previous work is range image based [8][9]. A range image is an image with depth information in every pixel. The depth represents the distance from the sensor to the point of facial surface, which is always normalized to [0 255] as pixel value. Structure information of facial surface is straightforwardly reflected in facial range image. Range image bears several advantages. It is simple, and immune to illumination variation, common facial make-up. Moreover, it is much easier to handle than mesh models. Since range image is represented as 2D images, many 3D face recognition methods based on range images can be borrowed from 2D face recognition methods. However, range images and intensity images are intrinsically different in the fundamental physical meaning of pixel value, so it is necessary to develop recognition algorithms fit for range images. This paper focuses on 3D face recognition using range images. In this paper, we introduce to discriminate intra personal differences from inter personal depth differences for 3D face recognition. The concept of intra and inter personal differences has been widely used in 2D face recognition. In [12], it is successfully integrated with Bayesian method for 2D face recognition. However, for range images, intra and inter personal differences hold more meanings than that in 2D face recognition. In 2D face recognition, the differences are computed from intensity subtraction. They may be caused by illumination variation rather than inter personal difference. However, for 3D faces, depth difference between two range images straightforwardly represents the structure difference of the two faces, and it is immune to illumination. We propose to use histogram statistics of depth differences between two facial range images as the matching criterion. To solve the difficulty to get values closest to depth differences of the same position, we propose to compute the depth difference for each pixel in a neighborhood district instead of range image subtraction. What is more, expression variation may cause intra class variation larger than the inter class variation, which would
692
Y. Huang, Y. Wang, and T. Tan
cause false matching. To achieve robust recognition, we propose three schemes to combine local rigid region(nose) and the holistic face for robust recognition. All of our experiments in this paper are conducted on the 3D face database of FRGC 2.0, which contains 4007 range data of 466 different persons. The rest of this paper is organized as follows. Section 2 describes details of our proposed method. Section 3 presents the experimental results and discussions. Finally we summarize this paper in Section 4.
2 2.1
Methods Histogram Proportion of Depth Differences
Here, intra personal difference denotes difference between face images of the same individual, while inter personal difference is difference between face images of different individuals. For 2D intensity faces, intra personal variation can be caused by expression, make-up, illumination variations.The variation is always so drastic that intra personal difference is often larger than inter personal difference. However,for the facial range images, only the expression can affect the intra personal difference. If intra personal variation is milder than the variation of different identities, depth difference can provide a feasible way to discriminate different individuals. A simple experiment is shown in Fig. 1 and Fig. 2.
A1
A2
B
Fig. 1. Range images of person A, B
Fig. 1 shows three range images of person A and B. We can see that A1 varies much in expression comparing to A2. And Fig. 2 shows us three histograms of absolute depth differences between the three range images: |A1 − A2|, |A1 − B|, |A2 − B|. The first one is intra personal difference |A1 − A2|, while the latter two are inter personal differences. For convenience to compare in appearance, all three histograms are shown in a uniform coordinate of the same scale, and values of 0 bin are not shown because they are too huge. Here we list the three values of 0 bin: 4265, 1043, 375. And the total amount of points in ROI(Region of Interest) is 6154. From the shapes of histograms in Fig. 2, we can see that most distributions of histogram |A1 − A2| are close to 0, while the opposite happens in the latter two. If we set a threshold α = 20,the proportions of points whose depth differences are smaller than α are 97.4%, 36.2%, 26.9% respectively(0 bin
Discriminating 3D Faces by Statistics of Depth Differences
A1-A2
A1-B
693
A2-B
Fig. 2. An example of histogram statistics of inter and intra personal differences with threshold α = 20 showed as the red line
is included). We can see that the first one is much bigger than the latter two. This promising result tells us using histogram proportions can be a feasible way to discriminate intra and inter personal variation. We will further demonstrate this on a big 3D database composed of 4007 images in Section 3. The main idea of our method can be concluded as: compute the depth differences(absolute value) between different range images, then use holistic histogram proportion(HP ) of depth differences(DD) as the similarity measure: DD(i, j) = |A(i, j) − B(i, j)|
(1)
N umber of points(DD ≤ α) (2) T otal of ROI Where A, B denote two facial range images. ROI denotes region of interest. However, to make this idea robust for recognition, two problems, namely, correspondence and expression, remain to be considered. We further make two modifications to optimize this basic idea in the following two subsections. HP =
2.2
Correspondence
For 3D faces, the comparison between two pixel values belonged to two range images respectively makes sense only when the two pixels are corresponded in the same position of two faces in a uniform coordinate. But no two range images are corresponded in the same position perfectly, because no matter the registration is conducted manually or automatically, errors inevitably exists in the process of registration. In the case of coarse registration, the situation becomes even worse. For example, A(i, j)’s correspondent point in image B may be the neighbor point of B(i, j) in the image coordinate. Here we do not compute the depth differences between two range images by direct subtraction as (A − B), but using the local window as showed in Fig. 3. Depth difference(DD) between image A and B at point (i, j) is computed using the following equation: DD(i, j) = min(|A(i, j) − B(u, v)|),
u ∈ [i − 1 i + 1], v ∈ [j − 1 j + 1]
(3)
694
Y. Huang, Y. Wang, and T. Tan
Fig. 3. local window for computing depth difference
By this way, we can solve the correspondence problem to some extent. However, it generates another problem. DD obtained in this way is the minimum depth differences in the local window, but probably not the precise corresponding DD. However, this happens the same for every point in each comparison between images, so we believe this modification will not weaken its ability of similarity measure much. 2.3
Expression
Expression is a difficult problem for both 2D and 3D face recognition. Especially, in our method, expression variation can probably cause intra personal variation larger than inter personal variation, which will definitely lead to false matching. One common way to deal with this problem in 3D face recognition is to utilize the rigid regions of face, such as nose region[10], which is robust to expression changes. Nose is a very distinct region on face, and its shape is affected very little by facial muscle movement. However, only using nose region for matching may not be competent, since it is so small comparing to the whole face that it can not provide enough information for recognition. Here we propose three schemes to combine nose and holistic face for matching, so that we can achieve robust recognition, while we can also utilize discriminating ability of other facial regions. See Fig. 4. In scheme 1, we first conduct similarity measure on the nose region using histogram proportion of depth differences, then a weight W is set to the first M (e.g. 1000) most similar images. Next step is holistic matching, weight W is multiplied to corresponding images’ matching scores selected by M to strengthen the similarity as the final matching score. Scheme 2 is the same as Scheme 1 in structure, but the nose matching step and holistic matching step are exchanged. Both nose matching and holistic matching use histogram proportion of depth differences for similarity measure. Scheme 1 and Scheme 2 can be considered as hierachical combination schemes. Scheme 3 uses weighted Sum rule to fuse the matching scores of nose and holistic face from two channels. Three Schemes will be compared in the experiments in next section.
Discriminating 3D Faces by Statistics of Depth Differences
695
Matching Scores
Matching Scores
Matching Scores Matching Scores
Matching Scores
Fig. 4. Three schemes for robust recognition
3
Experimental Results and Discussions
We design three experiments on the full set of 3D data in FRGC 2.0 database to verify our proposed method. Firstly, the performance of direct depth difference, local window depth difference, and final combination are compared; Secondly, three combination schemes are compared; Finally, our proposed method is compared with some other 3D face recognition methods. The experimental results are reported as follows. 3.1
Database Description and Preprocessing
FRGC 2.0 [13] is a benchmark database released in recent years. To our best knowledge, the 3D face dataset of FRGC 2.0 is the largest 3D face database till now. It consists of six experiments. Our experiments belong to Experiment 3 which measures the performance of 3D face recognition. This dataset contains 4007 shape data of 466 different persons. All the 3D data were acquired by a Minolta Vivid 900/910 series sensor in three semesters in two years. The 3D shape data are given as raw point clouds, which contain spikes and holes. And manually labeled coordinates of the eyes, noses, and chin are given. Registration, filling holes and moving spikes, and normalization were carried out on the raw data in succession as processing. After that, the data points in three dimension coordinates were projected and normalized as range images(100*80 in size); the regions of faces are cropped and noses were set in the same position; a mask was also used to remove marginal noises. Fig. 5 shows some samples. From Fig. 5, we can see that FRGC 2.0 is a very challenging database. And some ”bad” images like the ones in the second row still exist in the database after
696
Y. Huang, Y. Wang, and T. Tan
Fig. 5. Samples from FRGC 2.0 (after preprocessing). The first row: images from the same person. The second row: some samples of ”bad” images suffered from data loss, big holes, and occlusions caused by the hair.
the preprocessing step due to the original data and our preprocessing method which is not powerful enough to handle those problems. However, our main focus in this work is not on data preprocessing, but the recognition method. Though those bad images will greatly challenge our final recognition performance, we still carried out all the experiments on the full dataset. Three ROC (Receiver Operating Characteristic) curves, ROC I, ROC II, ROC III, are required for performance analysis in Experiment 3 of FRGC 2.0. They in turn measure algorithm performance in three cases: target and query images were acquired within the same semester, within the same year but in different semesters, and in different years. The difficulty of three cases increases by degrees. 3.2
Parameters: α, M and W
In Equation 2, threshold α is the key for similarity measure. Here, we obtain the optimal α from the training set: αh = 16 and αl = 19 for holistic HP matching and local HP matching respectively. What is more, we also get optimal parameters for combination schemes by training: W1 = 2.5 and M1 = 400 for Scheme 1, W2 = 1.1 and M2 = 400 for Scheme 2, and W3 = 0.25 for Scheme 3. All following experiments are carried out with above optimal parameters. 3.3
The Improvements of Two Modifications
In Section 2, we propose local window DD to replace direct DD to solve the correspondence problem, and then propose three schemes to solve the expression problem. To verify the performance brought by the two modifications, here we compare the performances of direct DD, local window DD, and Scheme 1 in Fig. 6. We can see that local window DD performs a bit better than direct DD, and Scheme 1 performs best among the three, it achieves significant improvement.
Discriminating 3D Faces by Statistics of Depth Differences
697
Fig. 6. Performance of direct DD, local window DD, and Scheme 1
The results demonstrate that the two modifications we propose in Section 2 do work and achieve obvious improvement on recognition performance. 3.4
The Comparison of Three Combination Schemes
The proposed three schemes in Section 2 are to solve the problem caused by expression variation. Three schemes are different combination methods for holistic face matching and rigid nose matching. Their EER performances of three ROC curves are shown in Table. 1. Table 1. EERs of ROC curves of three combination schemes ROCI ROCII ROCIII
Scheme 1 Scheme 2 Scheme 3 9.8% 10.7% 11.1% 10.9% 11.7% 12.1% 12.4% 12.8% 13.1%
From Table 1, Scheme 1 and Scheme 2 perform better than Scheme 3(weighted Sum) which is a widely used fusion scheme. So we can conclude that hierachical scheme is superior to weighted Sum rule for combining holistic face and local rigid region for recognition. What’s more, Scheme 1 achieves lowest EERs for the three ROC curves, which is the best result among the proposed three schemes. It is a rather good result tested in such a challenging database with our coarse preprocessing method. This demonstrates using rigid nose region as a prescreener is a feasible idea in a 3D face recognition system to alleviate the impact of expression.
698
3.5
Y. Huang, Y. Wang, and T. Tan
The Comparison with Other Methods
To demonstrate the effectiveness of our proposed method, here we compare Scheme 1 with some other 3D face recognition methods on the same database. Their performance comparison is shown in Table 2: Table 2. Comparison with other methods ROCI ROCII ROCIII
Scheme 1 9.8% 10.9% 12.4%
3DLBP 11.2% 11.7% 12.4%
LBP 15.4% 15.8% 16.4%
LDA 16.81% 16.82% 16.82%
PCA 19.8% 21.0% 22.2%
PCA, LDA are common methods which have been used in 3D face recognition based on range image [8][9][14], and performance of PCA is the baseline for comparison in FRGC2.0. 3DLBP is a newly proposed and effective 3D face recognition algorithm [11]. The result of using LBP (Local Binary Patterns)for 3D face recognition is also compared in [11]. From Table 2, we can see that our proposed method, Scheme 1, performs best among these 3D face recognition methods. What is more, comparing to 3DLBP, our proposed method is much simpler in theory and needs much less computation cost while achieving better performance. The results in Table 2 demonstrate that our method does work though it is rather simple and straightforward.
4
Conclusion
In this paper, we have proposed a 3D face recognition method from the analysis of intra and inter personal differences based on range images. Our contributions include: 1) We proposed to use histogram proportion to evaluate the similarity of facial range images, based on their depth differences. 2) We proposed local window depth difference to replace direct range image subtraction to avoid the unprecise correspondence problem. 3) To overcome the intra personal expression variation which can cause false matching, we proposed to combine facial rigid nose region and holistic face for matching. Three combination schemes were proposed and compared, and by experiment performance analysis, we obtained an effective scheme as our ultimate method, which uses nose region as a prescreener in a hierachical combination scheme. Sufficient experiments have been done on the full set of 3D face data in FRGC 2.0. Encouraging results and comparisons to other 3D face recognition method have demonstrated the effectiveness of our proposed method. And even better performance can be achieved after refining our preprocessing method in the future.
Acknowledgement This work is supported by Program of New Century Excellent Talents in University, National Natural Science Foundation of China (Grant No. 60575003,
Discriminating 3D Faces by Statistics of Depth Differences
699
60332010, 60335010, 60121302, 60275003, 69825105, 60605008), Joint Project supported by National Science Foundation of China and Royal Society of UK (Grant No. 60710059), Hi-Tech Research and Development Program of China (Grant No. 2006AA01Z133), National Basic Research Program (Grant No. 2004CB318110), and the Chinese Academy of Sciences.
References 1. Bowyer, K., Chang, K., Flynn, P.: A survey of approaches and challenges in 3D and multi-modal 3D+2D face recognition. Computer Vision and Image Understanding 101(1), 1–15 (2006) 2. Gordon, G.: Face Recognition Based on Depth Maps and Surface Curvature. Proc. of SPIE, Geometric Methods in Computer Vision 1570, 108–110 (1991) 3. Chua, C.S., Han, F., Ho, Y.K.: 3d human face recognition using point signature. In: Proc. International Conf. on Automatic Face and Gesture Recognition, pp. 233–238 (2000) 4. Medioni, G., Waupotitsch, R.: Face recognition and modeling in 3D. In: IEEE International Workshop on Analysis and Modeling of Faces and Gestures, pp. 232– 233 (2003) 5. Lu, X., Colbry, D., Jain, A.K.: Three-Dimensional Model Based Face Recognition. In: 17th ICPR, vol. 1, pp. 362–366 (2004) 6. Xu, C., Wang, Y., Tan, T., Quan, L.: Automatic 3d face recognition combining global geometric features with local shape variation information. In: Proc. of 6th ICAFGR, pp. 308–313 (2004) 7. Bronstein, A., Bronstein, M., Kimmel, R.: Expressioninvariant 3D face recognition. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 62–69. Springer, Heidelberg (2003) 8. Achermann, B., Jiang, X., Bunke, H.: Face recognition using range images. In: Proc. of ICVSM, pp. 129–136 (1997) 9. Chang, K.I., Bowyer, K.W., Flynn, P.J.: An Evaluation of Multimodal 2D+3D Face Biometrics. IEEE Transaction on PAMI 27(4), 619–624 (2005) 10. Ojala, T., Pietikainen, M., Maenpaa, T.: Multi resolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transaction on PAMI 24(7), 971–987 (2002) 11. Huang, Y., Wang, Y., Tan, T.: Combining Statistics of Geometrical and Correlative Features for 3D Face Recognition. In: Proc. of 17th British Machine Vision Conference, vol. 3, pp. 879–888 (2006) 12. Jebara, T., Pentland, A.: Bayesian Face Recognition. Pattern Recognition 33(11), 1771–1782 (2000) 13. Phillips, P.J., Flynn, P.J., Scruggs, T., Bowyer, K.W., Chang, J., Hoffman, K., Marques, J., Min, J., Worek, W.: Overview of the face recognition grand challenge. In: Proc. of CVPR 2005, vol. 1, pp. 947–954 (2005) 14. Heseltine, T., Pears, N., Austin, J.: Three-Dimensional Face Recognition Using Surface Space Combinations. In: Proc. of BMVC 2004 (2004) 15. Chang, K., Bowyer, K.W., Flynn, P.J.: ARMS: Adaptive rigid multi-region selection for handling expression variation in 3D face recognition. In: IEEE Workshop on FRGC Expreriments, IEEE Computer Society Press, Los Alamitos (2005)
Kernel Discriminant Analysis Based on Canonical Differences for Face Recognition in Image Sets Wen-Sheng Vincnent Chu, Ju-Chin Chen, and Jenn-Jier James Lien Robotics Laboratory, Dept. of Computer Science and Information Engineering National Cheng Kung University, No. 1, Ta-Hsueh Road, Tainan, Taiwan {l2ior,joan,jjlien}@csie.ncku.edu.tw http://robotics.csie.ncku.edu.tw
Abstract. A novel kernel discriminant transformation (KDT) algorithm based on the concept of canonical differences is presented for automatic face recognition applications. For each individual, the face recognition system compiles a multi-view facial image set comprising images with different facial expressions, poses and illumination conditions. Since the multi-view facial images are non-linearly distributed, each image set is mapped into a highdimensional feature space using a nonlinear mapping function. The corresponding linear subspace, i.e. the kernel subspace, is then constructed via a process of kernel principal component analysis (KPCA). The similarity of two kernel subspaces is assessed by evaluating the canonical difference between them based on the angle between their respective canonical vectors. Utilizing the kernel Fisher discriminant (KFD), a KDT algorithm is derived to establish the correlation between kernel subspaces based on the ratio of the canonical differences of the between-classes to those of the within-classes. The experimental results demonstrate that the proposed classification system outperforms existing subspace comparison schemes and has a promising potential for use in automatic face recognition applications. Keywords: Face recognition, canonical angles, kernel method, kernel Fisher discriminant (KFD), kernel discriminant transformation (KDT), kernel PCA.
1 Introduction Although the capabilities of computer vision and automatic pattern recognition systems have improved immeasurably in recent decades, face recognition remains a challenging problem. Conventional face recognition schemes are invariably trained by comparing single-to-single or single-to-many images (or vectors). However, a single training or testing image provides insufficient information to optimize the face recognition performance because the faces viewed in daily life exhibit significant variations in terms of their size and shape, facial expressions, pose, illumination conditions, and so forth. Accordingly, various face recognition methods based on facial image appearance [3], [8], [14] have been proposed. However, these schemes were implemented on facial images captured under carefully controlled environments. To obtain a more practical and stable recognition performance, Yamaguchi et al. [17] Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 700–711, 2007. © Springer-Verlag Berlin Heidelberg 2007
Kernel Discriminant Analysis Based on Canonical Differences
701
proposed a mutual subspace method (MSM) and showed that the use of image sets consisting of multi-view images significantly improved the performance of automatic face recognition systems. Reviewing the literatures, it is found that the methods proposed for comparing two image sets can be broadly categorized as either sample-based or model-based. In sample-based methods, such as that presented in [10], a comparison is made between each pair of samples in the two sets, and thus the computational procedure is time consuming and expensive. Moreover, model-based methods, such as those described in [1] and [12] require the existence of a strong statistical correlation between the training and testing image sets to ensure a satisfactory classification performance [7]. Recently, the use of canonical correlations as a means of comparing image sets has attracted considerable attention. In the schemes presented in [5], [6], [7], [9], [16] and [17], each image set was represented in terms of a number of subspaces generated using such methods as PCA or KPCA, for example, and image matching was performed by evaluating the canonical angles between two subspaces. However, in the mutual subspace method (MSM) presented in [17], the linear subspaces corresponding to the two different image sets are compared without considering any inter-correlations. To improve the classification performance of MSM, the constrained mutual subspace method (CMSM) proposed in [6] constructed a constraint subspace, generated on the basis of the differences between the subspaces, and then compared the subspaces following their projection onto this constraint subspace. The results demonstrated that the differences between two subspaces provided effective components for carrying out their comparison. However, Kim et al. [7] reported that the classification performance of CMSM is largely dependent on an appropriate choice of the constraint subspace dimensionality. Accordingly, the authors presented an alternative scheme based upon a discriminative learning algorithm in which it was unnecessary to specify the dimensionality explicitly. It was shown that the canonical vectors generated in the proposed approach rendered the subspaces more robust to variations in the facial image pose and illumination conditions than the eigenvectors generated using a conventional PCA approach. For non-linearly distributed or highly-overlapped data such as those associated with multi-view facial images, the classification performances of the schemes presented in [6] and [17] are somewhat limited due to their assumption of an essentially linear data structure. To resolve this problem, Schölkopf et al. [11] applied kernel PCA (KPCA) to construct linear subspaces, i.e. kernel subspaces, in a highdimensional feature space. Yang [18] showed that kernel subspaces provides an efficient representation of non-linear distributed data for object recognition purposes. Accordingly, Fukui et al. [5] developed a kernel version of CMSM, designated as KCMSM, designed to carry out 3D object recognition by matching kernel subspaces. However, although the authors reported that KCMSM provided an efficient means of classifying non-linearly distributed data, the problem of the reliance of the classification performance upon the choice of an appropriate constraint subspace dimensionality was not resolved. In an attempt to address the problems outlined above, this study proposes a novel scheme for comparing kernel subspaces using a kernel discriminant transformation (KDT) algorithm. The feasibility of the proposed approach is explored in the context of an automatic face recognition system. To increase the volume of information
702
W.-S.V. Chu, J.-C. Chen, and J.-J.J. Lien
available for the recognition process, a multi-view facial image set is created for each individual showing the face with a range of facial expressions, poses and illumination conditions. To make the non-linearly distributed facial images more easily separable, each image is mapped into a high-dimensional feature space using a non-linear mapping function. The KPCA process is then applied to each mapped image set to generate the corresponding kernel subspace. To render the kernel subspaces more robust to variances in facial images, canonical vectors [7] are derived for each pair of kernel subspaces. The subspace spanned by these canonical vectors is defined as the canonical subspace. The difference between the vectors of different canonical subspaces (defined as the canonical difference) is used as a similarity measure to evaluate the relative closeness (i.e. similarity) of different kernel subspaces. Finally, exploiting the proven classification ability of utilizing the kernel Fisher discriminant (KFD) [2], [11], a kernel discriminant transformation (KDT) algorithm is developed for establishing the correlation between kernel subspaces. In the training process, KDT algorithm is proceeded to find a kernel transformation matrix by maximizing the ratio of the canonical differences of the between-classes to those of the within-classes. Then in the testing process, the kernel transformation matrix is applied to establish the inter-correlation between kernel subspaces. (a)
(b)
(c)
(e) d=u-v
u
(d)
ƈ v
d: Canonical difference
Fig. 1. (a) and (b) show the first five eigenvectors (or principal components) and corresponding canonical vectors, respectively, of a facial image set. Note that the first and second rows correspond to the same individual. Comparing the images in (a) and (b), it is observed that each pair of canonical vectors, i.e. each column in (b), contains more common factors such as poses and illumination conditions than each pair of eigenvectors, i.e. each column in (a). The images in (c) and (d) show the differences between the eigenvectors and the canonical vectors, respectively, of each image pair. It is apparent that the differences in the canonical vectors, i.e. the canonical differences, are less influenced by illumination effects than the differences in the eigenvectors. In (e), u and v are canonical vectors and the canonical difference d is proportional to the angle Θ between them.
2 Canonical Difference Creation This section commences by reviewing the concept of canonical vectors, which, as described above, span a subspace known as the canonical subspace. Subsequently,
Kernel Discriminant Analysis Based on Canonical Differences
703
the use of the canonical difference in representing the similarity between pairs of subspaces is discussed. 2.1 Canonical Subspace Creation Describing the behavior of a image sets in the input space as a linear subspace, we denote P1 and P2 as two n × d orthonormal basis matrices representing two of these linear subspaces. Applying singular value decomposition (SVD) to the product of the two subspaces yields P1T P2 = Φ12 ΛΦT21 s.t. Λ = diag (σ 1 ,..., σ n ) ,
(1)
where Φ12 and Φ 21 are eigenvectors and {σ 1 ,..., σ n } are the cosine values of the canonical angles [4] and represent the level of similarity of the two subspaces. The SVD form of the two subspaces can be expressed as T Λ = Φ 12 P1T P2 Φ 21 .
(2)
Thus, the similarity between any two subspaces can be evaluated simply by computing the trace of Λ. Let the canonical subspace and the canonical vectors be defined by the matrices d d C1 = P1Φ12 = [u1 ,..., ud ] and C 2 = P2 Φ 21 = [v1 ,..., vd ] and the vectors {ui}i =1 and {vi}i =1 , respectively [7]. Here, Φ12 and Φ 21 denote two projection matrices which regularize P1 and P2, respectively, to establish the correlation between them. Comparing Figs. 1.(a) and 1.(b), it can be seen that the eigenvectors of the orthonormal basis matrices P1 and P2 (shown in Fig. 1.(a)) are significantly affected by pose and illumination in the individual facial images. By contrast, the projection matrices ensure that the canonical vectors are capable of more faithfully reproducing variations in the pose and illumination of the different facial images. 2.2 Difference Between Canonical Subspaces
As shown in Fig. 1.(e), the difference between the two canonical vectors of different canonical subspaces is proportional to the angle between them. Accordingly, the canonical difference (or canonical distance) can be defined as follows:
CanonicalDiff (i, j ) = ∑ r =1 u r − vr d
((
= trace Ci − C j
2
) (C − C )). T
i
j
(3)
Clearly, the closer the two subspaces are to one another, the smaller the value given by CanonicalDiff (i, j ) in its summation of the diagonal terms. As shown in Figs. 1.(c) and (d), the canonical differences between two facial images contain more discriminative information than the eigenvector differences and are less affected by variances in the facial images.
704
W.-S.V. Chu, J.-C. Chen, and J.-J.J. Lien
3 Kernel Discriminant Transformation (KDT) Using Canonical Differences This section commences by discussing the use of KPCA to generate kernel subspaces and then applies the canonical difference concept proposed in Section 2.2 to develop a kernel discriminant transformation (KDT) algorithm designed to determine the correlation between two subspaces. Finally, a kernel Fisher discriminant (KFD) scheme is used to provide a solution to the proposed KDT algorithm. Note that the KDT algorithm is generated during the training process and then applied during the testing process. 3.1 Kernel Subspace Generation
To generate kernel subspaces, each image set in the input space is mapped into a high-dimensional feature space F using the following nonlinear mapping function:
φ : {X1 , K , Xm } → {φ (X1 ), K , φ ( Xm )} .
(4)
In practice, the dimensionality of F, which is defined as h, can be huge, or even infinite. Thus, performing calculations in F is highly complex and computationally expensive. This problem can be resolved by applying a “kernel trick”, in which the dot products φ (x ) ⋅ φ (y ) are replaced by a kernel function k(x,y) which allows the dot products to be computed without actually mapping the image sets. Generally, k(x,y) is specified as the Gaussian kernel function, i.e. ⎛ x−y k ( x , y) = exp⎜ − ⎜ σ2 ⎝
2
⎞ ⎟. ⎟ ⎠
(5)
Let m image sets be denoted as {X1 ,K, Xm } , where the i-th set, i.e. Xi = [ x1 ,K, xni ] , contains ni images in its columns. Note that the images of each facial image set belong to the same class. The “kernel trick” can then be applied to compute the kernel matrix Kij of image sets i and j ( i, j = 1,..., m ). Matrix Kij is an ni × nj matrix, in which each element has the form
(K )
ij sr
( )( ) (
)
= φ T x is φ xrj = k x is , x rj , s = 1,..., ni , r = 1,..., n j .
(6)
The particular case of Kii, i.e. j = i , is referred to as the i-th kernel matrix. To generate the kernel subspaces of each facial image set in F, KPCA is performed on each of the mapped image sets. In accordance with the theory of reproducing kernels, the p-th eigenvector of the i-th kernel subspace, e ip , can be expressed as the linear combination of all the mapped images, i.e. ni
eip = ∑ a ispφ ( x is ) , s =1
(7)
Kernel Discriminant Analysis Based on Canonical Differences
705
where the coefficient a ipq is the q-th component of the eigenvector corresponding to the p-th largest eigenvalue of the i-th kernel matrix Kii. Denoting the dimensionality of kernel subspace Pi of the i-th image set as d, then Pi can be represented as the span of the eigenvectors {e ip}d . p =1
3.2 Kernel Discriminant Transformation Formulation h×d
The kernel subspace generation process described in Section 3.1 yields Pi ∈R as a ddimensional kernel subspace corresponding to the i-th image set s.t. φ (X i )φ T (X i ) ≅ Pi ΛPi T . In this section, an h × w kernel discriminant transformation (KDT) matrix T is defined to transform the mapped image sets in order to obtain an improved identification ability, where w is defined as the dimensionality of KDT matrix T. Multiplying both sides of the d-dimensional kernel subspace of the i-th image set by T gives
(T φ (X ))(T φ (X )) ≅ (T P )Λ(T P ) T
T
T
i
T
i
T
T
i
i
(8)
.
It can be seen that the kernel subspace of the transformed mapped image set is equivalent to that obtained by applying T to the original kernel subspace. Since it is necessary to normalize the subspaces to guarantee that the canonical differences are computable, we have to first ensure that TT Pi is orthonormal. A process of QR-decomposition is then applied to each transformed kernel subspace TT Pi such that T T Pi = Qi Ri , where Qi is a w × d orthonormal matrix and Ri is a d × d invertible upper triangular matrix. Since Ri is invertible, the normalized kernel subspace following transformation by matrix T can be written as
Qi = TT Pi Ri−1 .
(9)
To obtain the canonical difference between two subspaces, it is first necessary to calculate the d × d projection matrices Φij and Φ ji , i.e. QiT Q j = Φ ij ΛΦTji .
(10)
The canonical difference between two transformed kernel subspaces i and j can then be computed from
((
CanonicalDiff (i, j ) = trace Qi Φ ij − Q j Φ ji
( (
)(
) (Q Φ T
i
ij
− Q j Φ ji
) )
= trace TT Pi Φ′ij − P j Φ′ji Pi Φ′ij − P j Φ′ji T , T
)) (11)
−1 −1 where Φ′ij = Ri Φ ij and Φ′ji = R j Φ ji . The transformation matrix T is derived by maximizing the ratio of the canonical differences of the between-classes to those of the within-classes. This problem can be formulated by optimizing the Fisher discriminant, i.e.
706
W.-S.V. Chu, J.-C. Chen, and J.-J.J. Lien
T = arg max T
∑ ∑ ∑ ∑
m
i =1
l∈Bi
k∈Wi
CanonicalDiff (i, l )
CanonicalDiff (i, k )
( (
) )
(12)
( Pi Φ′il − Pl Φ′li )( Pi Φ′il − Pl Φ′li )T
denotes the between-scatter
T
∑ ∑
l∈Bi
m i =1
trace T T S B T , trace T T SW T
= arg max where S B =
m i =1
given that Bi is the set of class labels of the between classes and SW =
∑ ∑ m
i =1
k ∈Wi
(Pi Φ′ik − Pk Φ′ki )(Pi Φ′ik − Pk Φ′ki )T is the within-scatter given that Wi is the
set of class labels of the within classes. 3.3 Kernel Discriminant Transformation Optimization
In this section, we describe an optimization process for solving the Fisher’s discriminant given by Eq. (12). First, the number of all the training images is assumed m to be M, i.e. M = ∑i=1 ni . Using the theory of reproducing kernels as shown in w Eq. (7), vectors {t q} q=1∈ T can be represented as the span of all mapped training images in the form M
t q = ∑ α uqφ ( x u ) ,
(13)
u =1
where α uq are the elements of an M × d coefficient matrix α . Applying the definition of Φ′ij and Eq. (7), respectively, ~ Pij = Pi Φ′ij = ~ e1ij ,K , ~ edij are given by
[
the
]
projected
kernel
d n ~ e pij = ∑r =1 ∑ s i=1 a isr Φ ′ijrpφ ( x is ) .
subspaces
(14)
T~ Applying Eqs. (13) and (14), it can be shown that T Pij = αZ ij , where each element of Zij has the form
(Z )
ij up
(
)
= ∑ dr=1 ∑ nsi=1a isr Φ ′ijrp k x u , x is , u = 1,K, M , p = 1,K, d .
(15)
From the definition of SW and Eq. (15), the denominator of Eq. (12) can be rewritten as TT SW T = αTUα , where U =
∑ ∑ (Z m
i =1
k ∈Wi
− Zik )(Z ki − Zik )
T
ki
(16)
is an M × M within-scatter matrix.
Following the similar step as deriving Eq. (16), the numerator of Eq. (12) can be rewritten as
Kernel Discriminant Analysis Based on Canonical Differences
T T S B T = α T Vα , where V =
∑ ∑ m
i =1
l∈Bi
(Zli − Zil )(Zli − Zil )T
707
(17)
is an M × M between-scatter matrix.
Combining Eqs. (16) and (17), the original formulation given in Eq. (12) can be transformed into the problem of maximizing the following Jacobian function:
J (α) =
α T Vα . α T Uα
(18)
α can be found by solving the leading eigenvectors of U-1V. Fig. 2 summarizes the solution procedure involved in computing the KDT matrix, T. However, considering the case that U is not invertible, we replace U by Uμ , i.e. simply add a constant value to the diagonal terms of U, where Uμ = U + μ I .
(19)
Thus Uμ is ensured to be definite-positive and the inverse Uμ −1 exists. That is, the leading eigenvectors of Uμ −1V are computed as the solutions of α .
Algorithm: Kernel Discriminant Transformation (KDT) Input: All training image sets {X1 ,K , X m } Output: T = [t1 ,K, t w ] , where t q =
∑
M u =1
α uqφ ( x u) , q = 1,K, w
1. α ← rand ( M , w) 2. For all i, do SVD: K ii = a i Γa Ti 3. Do the following iteration:
(
4. For all i, compute T T Pi
)
(
= ∑u =1 ∑ s =i 1 α uq a sp k x u , x is M
qp
n
i
)
5. For all i, do QR-decomposition: T T Pi = Qi Ri T T 6. For each pair of i and j, do SVD: Qi Q j = Φ ij ΛΦ ji −1 7. For each pair of i and j, compute Φ′ij = Ri Φ ij
( )
8. For each pair of i and j, compute Z ij
up
∑ ∑ (Z − Z )(Z − Z ) V = ∑ ∑ (Z − Z )(Z − Z ) Compute eigenvectors {α } of V U , α ← [α ,K , α ]
9. Compute U =
m
T
i =1
k ∈Wi
ki
ik
ki
ik
m
i =1
10.
(
= ∑ dr=1 ∑ nsi=1a isr Φ ′ijrp k x u , x is
T
l∈Bi
w p p =1
li
il
li
il
−1
1
w
11. End Fig. 2. Solution procedure for kernel discriminant transformation matrix T
)
708
W.-S.V. Chu, J.-C. Chen, and J.-J.J. Lien Testing Process Testing pattern set Xtest ....
Training Process ....
Training image sets {X1,…,Xm}
Xm
.... ....
X1
Non-linear mapping I (X test )
Non-linear mapping function I (X i )
Training image sets in high-dimensional feature space : I (X1 ) ~ I (X m )
Testing image set in feature space : I (X test )
KPCA
KPCA
Kernel subspace Ptest
Kernel subspace Pm Kernel subspace P1
TTPtest
Kernel discriminative transformation (KDT) T
Refm=TTPm Reference subspaces
Ref1=TTP1
Similarity Measure: Canonical Difference Identification
: Apply kernel transformation matrix T to kernel subspace P.
Fig. 3. Workflow of proposed face recognition system
4 Face Recognition System Fig. 3 illustrates the application of the KDT matrix in the context of a face recognition system. In the training process, kernel subspaces Pi are generated by performing KPCA on each set of mapped training images. The KDT matrix (T) is then obtained via an iterative learning process based on maximizing the ratio of the canonical differences of the within-classes to those of the between-classes. Reference subspaces, Refi, are then calculated by applying T to Pi, i.e. TTPi. In the testing process, a similar procedure to that conducted in the training process is applied to the testing image set Xtest to generate the corresponding kernel subspace Ptest and the transformed kernel subspace TTPtest, i.e. Reftest. By comparing each pair of reference subspaces, i.e. Refi and Reftest, an identification result with index id can be obtained by finding the minimal canonical difference, i.e.
id = arg min CanonicalDiff (i, test ) . i
(20)
5 Experimental Results A facial image database was compiled by recording image sequences under five controlled illumination conditions. The sequences were recorded at a rate of 10 fps
Kernel Discriminant Analysis Based on Canonical Differences
709
using a 320 × 240 pixel resolution. This database was then combined with the Yale B database [19] to give a total of 19,200 facial images of 32 individuals of varying gender and race. Each facial image was cropped to a 20×20-pixel scale using the Viola-Jones face detection algorithm [15] and was then preprocessed using a bandpass filtering scheme to compensate for illumination differences between the different images. The training process was performed using an arbitrarily-chosen image sequence, randomly partitioned into three image sets. The remaining sequences relating to the same individual were then used for testing purposes. A total of eight randomly-chosen sequence combinations were used to verify the performance of the proposed KDT classifier Furthermore, the performance of the KDT scheme is also verified by performing a series of comparative trials using existing subspace comparison schemes, i.e. KMSM, KCMSM and KDT. In performing the evaluation trials, the dimensionality of the kernel subspace was specified as 30 for the KMSM, KCMSM and KDT schemes, while in DCC, the dimensionality was assigned a value of 20 to preserve 98% of the energy in the images. In addition, the variance of the Gaussian kernel was specified as 0.05. Finally, in ensuring that the matrix U was computable, μ in Eq. (19) was assigned a value of 10-3. Fig. 4.(a) illustrates the convergence of the KDT solution procedure for different experimental initializations. It can be seen that as the number of iteration increases, the Jacobian value given in Eq. (18) converges to the same point irrespective of the initialization conditions. However, it is observed that the KMSM scheme achieves a better performance than the proposed method for random initializations. Fig. 4.(b) and (c) demonstrate the improvement obtained in the similarity matrix following 10 iterations. Fig.5.(a) illustrates the relationship between the identification rate and the dimensionality, w, and demonstrates that the identification rate is degraded if w is not assigned a sufficiently large value. From inspection, it is determined that w = 2,200 represents an appropriate value. Adopting this value of w, Fig. 5(b) compares the identification rate of the KDT scheme with that of the MSM, CMSM, KMSM and KCMSM methods, respectively. Note that the data represent the results obtained using eight different training/testing combinations. Overall, the results show that KDT consistently outperforms the other classification methods. (a)
(b)
(c)
Fig. 4. (a) Convergence of Jacobian value J (α) under different initialization conditions. (b) and (c) similarity matrices following 1st and 10th iterations, respectively.
710
W.-S.V. Chu, J.-C. Chen, and J.-J.J. Lien
(a)
(b)
Fig. 5. (a) Relationship between dimensionality w and identification rate. (b) Comparison of identification rate of KDT and various subspace methods for eight training/testing combinations.
5 Conclusions A novel kernel subspace comparison method of kernel discriminant transformation (KDT) has been provided based on canonical differences and formulated as a form of Fisher’s discriminant with parameter T. For the original form with parameter T may not be computable, kernel Fisher’s discriminant is used to rewrite the form and to convert the original problem into a solvable one with parameter α. An optimized solution for KDT is then obtained by iteratively learning α. KDT has been evaluated on a proposed face recognition system through various multi-view facial images sets. KDT has also been proven to converge stably with different initializations. Experiment results show promising performance in face recognition. In future studies, the performance of the KDT algorithm will be further evaluated using additional facial image databases, including the Cambridge-Toshiba Face Video Database [7]. The computational complexity of the KDT scheme increases as the number of images used in the training process increases. Consequently, future studies will investigate the feasibility of using an ensemble learning technique to reduce the number of training images required while preserving the quality of the classification results.
References 1. Arandjelović, O., Shakhnarovich, G., Fisher, J., Cipolla, R., Darrell, T.: Face Recognition with Image Sets Using Manifold Density Divergence. In: IEEE Conf. on Computer Vision and Pattern Recognition(CVPR), vol. 1, pp. 581–588 (2005) 2. Baudat, G., Anouar, F.: Generalized Discriminant Analysis Using a Kernel Approach. Neural Computation 12(10), 2385–2404 (2000) 3. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection. IEEE Trans. on Pattern Analysis and Machine Intelligence(PAMI) 19(7), 711–720 (1997) 4. Chatelin, F.: Eigenvalues of matrices. John Wiley & Sons, Chichester (1993) 5. Fukui, K., Stenger, B., Yamaguchi, O.: A Framework for 3D Object Recognition Using the Kernel Constrained Mutual Subspace Method. In: Asian Conf. on Computer Vision, pp. 315–324 (2006)
Kernel Discriminant Analysis Based on Canonical Differences
711
6. Fukui, K., Yamaguchi, O.: Face Recognition Using Multi-Viewpoint Patterns for Robot Vision. In: International Symposium of Robotics Research, pp. 192–201 (2003) 7. Kim, T.K., Kittler, J., Cipolla, R.: Discriminative Learning and Recognition of Image Set Classes Using Canonical Correlations. IEEE Trans. on PAMI 29(6), 1005–1018 (2007) 8. Penev, P.S., Atick, J.J.: Local Feature Analysis: A General Statistical Theory for Object Representation. Network: Computation in Neural systems 7(3), 477–500 (1996) 9. Sakano, H., Mukawa, N.: Kernel Mutual Subspace Method for Robust Facial Image Recognition. In: International Conf. on Knowledge-Based Intelligent Engineering System and Allied Technologies, pp. 245–248 (2000) 10. Satoh, S.: Comparative Evaluation of Face Sequence Matching for Content-Based Video Access. In: IEEE Conference on Automatic Face and Gesture Recognition (FG), pp. 163– 168. IEEE Computer Society Press, Los Alamitos (2000) 11. Schölkopf, B., Smola, A., Müller, K.-R.: Nonlinear Component Analysis as A Kernel Eigenvalue Problem. Neural Computation 10(5), 1299–1319 (1998) 12. Shakhnarovich, G., Fisher, J.W., Darrel, T.: Face Recognition from Long-Term Observations. In: European Conference on Computer Vision, pp. 851–868 (2000) 13. Shakhnarovich, G., Moghaddam, B.: Face Recognition in Subspaces. Handbook of Face Recognition (2004) 14. Turk, M., Pentland, A.: Face Recognition Using Eigenfaces. CVPR, 453–458 (1993) 15. Viola, P., Jones, M.: Robust Real-Time Face Detection. International Journal of Computer Vision 57(2), 137–154 (2004) 16. Wolf, L., Shashua, A.: Kernel Principal Angles for Classification Machines with Applications to Image Sequence Interpretation. CVPR, 635–642 (2003) 17. Yamaguchi, O., Fukui, K., Maeda, K.: Face Recognition Using Temporal Image Sequence. FG (10), 318–323 (1998) 18. Yang, M.-H.: Kernel Eigenfaces vs. Kernel Fisherfaces: Face Recognition Using Kernel Methods. FG, 215–220 (2002) 19. http://cvc.yale.edu/projects/yalefacesB/yalefacesB.html
Person-Similarity Weighted Feature for Expression Recognition Huachun Tan1 and Yu-Jin Zhang2 1
Department of Transportation Engineering, Beijing Institure of Technology, Beijing 100081, China
[email protected] 2 Department of Electronic Engineering, Tsinghua University, Beijing 100084, China
[email protected] Abstract. In this paper, a new method to extract person-independent expression feature based on HOSVD (Higher-Order Singular Value Decomposition) is proposed for facial expression recognition. With the assumption that similar persons have similar facial expression appearance and shape, person-similarity weighted expression feature is used to estimate the expression feature of the test person. As a result, the estimated expression feature can reduce the influence of individual caused by insufficient training data and becomes less person-dependent, and can be more robust to new persons. The proposed method has been tested on Cohn-Kanade facial expression database and Japanese Female Facial Expression (JAFFE) database. Person-independent experimental results show the efficiency of the proposed method.
1
Introduction
Facial expression analysis is an active area in Human-Computer interaction [1,2]. Many techniques of facial expression analysis have been proposed that try to make the interaction tighter and more efficient. During the past decade, the development of image analysis, object tracking, pattern recognition, computer vision, and computer hardware brings facial expressions into human computer interaction as a new modality. Many systems for automatic facial expressions have been developed since the pioneering work of Mase and Pentland [3]. Some surveys of automatic facial expression analysis [1, 2] are also appeared. Many algorithms are proposed to improve the robustness towards environmental changes, such as different illuminations, or different head poses. Traditionally, they use the geometric features that present the shape and locations of facial components, including mouth, eyes, brows, nose etc., for facial expression recognition. Using the geometric features, the methods are more robust to variation in face position, scale, size, head orientation and less person-dependent [4, 5]. In order to represent the detailed appearance changes such as wrinkles and creases as well, they combine geometric facial features and appearance facial features [6, 7]. There are few researches to deal with the effect of individuals [7-11]. It is still a challenging task for computer vision to recognize facial expression across Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 712–721, 2007. c Springer-Verlag Berlin Heidelberg 2007
Person-Similarity Weighted Feature for Expression Recognition
713
different persons [8]. Matsugu et al [9] proposed a rule-based algorithm for subject independence facial expression recognition, which is reported as the first facial expression recognition system with subject independence combined with robustness with regard to variability in facial appearance. However, for a rulebased method, setting effective rules for many expressions is difficult. Wen et al [9] proposed a ratio-image based appearance feature for facial expression recognition, which is independent of a person’s face albedo. Abbound et al [10] proposed bilinear appearance factorization based representations for facial expression recognition. But the two methods did not consider the differences of the representation of facial expressions of different persons. Wang et al [8] applied modified Higher-Order Singular Value Decomposition (HOSVD) in expressionindependent face recognition and person-independent facial expression recognition simultaneously. In the method, the representation of test person should be as same as the representation of one of persons in the training set. Otherwise, his/her expression would be wrongly classified. The problem may be solved by training the model based on a great deal of face images. However, the feature space of expression is so huge that it is difficult to collect enough face image data and train a model that works robustly for new face images. Tan et al [11] proposed person-similarity weighted distance based on HOSVD to improve the performance of person-independent expression recognition. But the recognition rate is not satisfied. In this paper, a new method based on HOSVD to extract person-independent facial expression feature is proposed to improve the performance of expression recognition for new persons. The new method is based on two assumptions, – It is assumed that similar persons have similar facial expression appearance and shape, which is often used for facial expression synthesis [8]. – For simplicity, the facial expression is affected only by one factor: individual is also assumed. Other factors, such as pose and illumination, are not considered. Based on the two assumptions, the expression feature of new persons is estimated to reduce the influence of individual caused by insufficient training data. The estimation, named by person-similarity weighted feature, is used for expression recognition. Different from the work in [11] which improves the performance in distance measure, this paper improves the performance of person-independent expression recognition by estimating the expression feature that could represent the expression of new persons more effectively. The proposed method has been tested on Cohn-Kanada facial expression database [12] and Japanese Female Facial Expression (JAFFE) database [13]. Person-independent experimental results show that the proposed method is more robust to new persons than previous HOSVD based method [8], using geometric and appearance feature directly and the work in [11]. The remainder of this paper is organized as follows. Background about HOSVD is overviewed in Section 2. Then the person-similarity weighted expression feature extraction is described in Section 3. Person-independent experimental results are presented in Section 4. Finally, conclusion of the paper is made in Section 5.
714
2 2.1
H. Tan and Y.-J. Zhang
Background of HOSVD HOSVD
In order to reduce the influence caused by individual, the factorization model to disentangle the person and expression factors is explored. Higher-Order Singular Value Decomposition (HOSVD) offers a potent mathematical framework for analyzing the multifactor structure of image ensembles and for addressing the difficult problem of disentangling the constituent factors or modes. Vasilescu et al. [14, 15] proposed using N-mode SVD to analyze face image that can account for each factors inherent to image formation. The resulting ”TensorFaces” are used for face recognition which obtained better results than PCA (Principle Component Analysis). Terzopoulos [16] extended TensorFaces for analyzing, synthesizing, and recognizing facial images. Wang et al. [8] used modified HOSVD to decompose facial expression images from different persons into separate subspaces: expression subspace and person subspace. Thus, it could lead to expression-independent face recognition and person-independent facial expression recognition simultaneously. In the conventional multi-factor analysis method for facial expression recognition proposed by Wang et al [8], a third-order tensor A ∈ RI×J×K is used to represent the facial expression configuration, where I is the number of persons, J is the number of facial expressions for each person, Kis the dimension of the facial expression feature vector Ve combining geometric features and appearance features. Then the tensor A can be decomposed as A = S ×1 U P ×2 U e ×3 U f
(1)
where S is the core tensor representing the interactions of the person, expression and feature subspaces, U p ,U e ,U f represent the person, expression and facial expression feature subspace, respectively. Each row vector in each subspace matrix represents a specific vector in this mode. Each column vector in each subspace matrix represents the contributions of other modes. For details of HOSVD, we refer readers to the works in [8, 14, 15, 17]. Using a simple transformation, two tensors related to the person and expression modes are defined, and we call them the expression tensor T e , and person tensor T p , respectively, given by,
p
T e = S × U e ×3 U f
(2)
T p = S ×1 U P ×3 U f
(3)
e
Each column in T or T is a basis vector that comprise I or J eigenvectors. T p or T e is a tensor which size is I × J × K. The input test tensor Ttest is a 1 × 1 × K tensor using Vet est in the third mode. Then the expression vector uetest of i-th person is represented as i = uf (Ttest , 2)T · uf (T p (i), 2)− 1 uetest i
(4)
Person-Similarity Weighted Feature for Expression Recognition
715
Similarly, the person vector of j-th facial expression is represented as uptest = uf (Ttest , 1)T · uf (T e (j), 1)− 1 j
(5)
where uf (Ttest , n) means unfolding tensor T in the n-th mode. That is, the coefficient vector of projecting the test vector to the i-th row eigenvectors in basis tensor T e is the expression vector uetest of i-th person. And the coefficient i vector of projecting the tested vector to the j-th column eigenvectors in basis ( tensor T p is the person vector upj test) of j-th expression. 2.2
Conventional Matching Processing (
Given the test original expression vector Ve test), the goal of expression recognition is to find j ∗ that satisfies j ∗ = arg max P (uej | Vetest ) j=1,...,J
(6)
In the previous HOSVD-based method [8], all test expression vectors associated with all persons in the training set, uetest , i = 1, 2, . . . , I , are compared i to the expression-specific coefficient vectors uej , j = 1, 2, . . . , J, which is the j-th row of expression subspace U e . The one that yields the maximum similarity sim(uetest , uej ),i = 1, 2, . . . , I, j = 1, 2, . . . , J, among all persons and expresi sions, identifies the unknown vector Vetest as expression index j. That is, the matching processing of conventional methods is to find (i∗, j∗) that satisfy (i∗ , j ∗ ) = arg
3 3.1
max
i=1,...,I,j=1,...,J
P (uej | uetest ) i
(7)
Person-Similarity Weighted Expression Features Problems of Conventional Matching Processing
In the matching processing, one assumption is made that the test person and one of persons in the training set represent their expression in the same way. Then, assume the ”true” expression feature of the test person uetest equals to uetest with probability 1, that is, i∗ 1 i = i∗ test | V ) = (8) P (uetest = uetest i e 0 others , and used for classification according to Then uetest is estimated as uetest i∗ equation (7). Ideally, i∗ -th person that used for calculating the expression feature is the most similar person in face recognition. However, in many cases, the assumption that the test person and one of persons in the training set represent their expression in the same way is not always true because of the difference of individual and the insufficient training data. That is, the assumption is not true and test expression could not be estimated using the equation (8). In these cases, the expressions are apt to be wrongly recognized. Some results in our experiments, which are reported in Section 4, could show the problem. How to estimate the expression feature of test person is still a challenge for expression recognition when the training data are insufficient.
716
3.2
H. Tan and Y.-J. Zhang
Person-Similarity Weighted Expression Features
In order to improve the performance of facial expression recognition across different persons, we propose to estimate the ”true” expression feature of the test person by taking the information of all persons in the training set into account. In order to set the probability model, an assumption that similar person have similar facial expression appearance and shape, which has been widely used in face expression synthesis [8] is used for facial expression recognition in our method. The assumption can be formulated as: the expression feature of one person is equal to that of other person with probability P , and the probability P is proportional to the similarity between the two persons. That is, | Vetest ) ∝ si , i = 1, 2, . . . , I P (uetest i
(9)
where uetest is the expression feature associated with i-th person, uetest is the i ”true” expression feature of test person. si denotes the similarity between the test person and i-th person in training set. Under the assumption, uetest is estimated. For simplicity, it can be also assumed the prior probabilities of all classes of persons are equal. Then, based on equation (9), the estimation of expression feature of the test person is u etest = E(uetest | Vetest ) = i
i
pi ∗ uetest = i
1 si ∗ uetest i Z i
(10)
where Z is the normalization constant which is equal to the sum of the similarities of test person and all persons in the training set. Then, the expression feature weighted by the similarities of the test person and all persons in training set can be used as the ”true” expression feature of test person for expression recognition. Through the weighting process, the person-similarity weighted expression feature can reduce the influence of new persons caused by insufficient training data, and is less person-dependent. In order to estimate the expression feature, the similarities of the test person and all persons in training set need to be determined firstly. It can be calculated using the person subspace obtained from HOSVD decomposition proposed by Wang et al [8]. In the process of calculating person similarities, cosine of the angle between two vectors, a and b, is adopted as the function of similarity. sim(a, b) =
4
tr < aT , b > < a, b > = a • b a • b
(11)
Experimental Results
The original expression features including geometric feature and appearance feature is firstly extracted. The process of geometric feature extraction is similar to that of [11]. But the geometric features about cheek are not used in our experiments since the definition of cheek feature points is difficult and tracking the cheek feature points is not robust from our experiments. Then the geometric
Person-Similarity Weighted Feature for Expression Recognition
717
features are tracked using the method proposed in [18]. The appearance features based on the expression ratio image and Gabor wavelets which are used in [11] are also extracted. Finally, person-similarity weighted expression feature is estimated for expression recognition. 4.1
Experimental Setup
The proposed method is applied to CMU Cohn-Kanade expression database [12] and JAFFE database [13] to evaluate the performance of person-independent expression recognition. Cohn-Kanade database In Cohn-Kanade expression database, the facial expressions are classified into action units (AUs) of Face Action Coding System (FACS), instead of a few prototypic expressions. In our experiments, 110 image sequences of 6 AUs for upper face were selected from 46 subjects of European, African, and Asian ancestry. The data distribution for training and testing is shown in Table 1. No subject appears in both training and testing. Table 1. Data distribution of training and test data sets AU1 AU6 AU1+2 AU1+2+5 AU4+7 AU4+6+7 train test
9 5
19 14
6 6
12 11
14 6
5 3
In the experiments on Cohn-Kanade database, since there are not enough training data to fill in the whole tensor for training, the mean feature of the specific expression in training set is used to substitute the blanks that have no training data. JAFFE Database. For JAFFE database, 50 images of 10 persons with 5 basic expressions displayed by each person are selected for training and testing. The 5 basic expressions are happiness, sadness, surprise, angry and dismal. Leave one out cross validation is used. For each test, we use the data of 9 persons for training and that of the rest one as test data. Since only static images are provided in JAFFE database, the image of neutral expression is considered as the first frame. The initial expression features are extracted by two images: one is the image of neutral expression and the other is the image of one of 5 basic expressions mentioned above. 4.2
Person-Independent Experimental Results
The proposed method is based on conventional HOSVD method proposed by Wang [8], and the initial expression vectors are similar to those in [19]. In our experiments, the classification performances of the proposed method are compared with the following methods:
718
H. Tan and Y.-J. Zhang
– Classifying geometric and appearance features directly by a three-layer standard back propagation neural network with one hidden layer which is similar to the method used in [19]. – Conventional HOSVD based method proposed by Wang [8]. – Classifying expressions by person-similarity distance proposed by Tan [11]. The results on Cohn-Kanade database are shown in Table 2. The average accuracy of expression recognition using proposed method has been improved to 73.3% from 58.6%, 55.6% and 64.4% respectively comparing with other methods. From Table 2, it can be observed that the recognition rate of AU6 of proposed method is the lowest. And the recognition rate of AU1+2+5 using proposed method is slightly lower than that of using geometric and appearance features directly. When the training data is adequate, the test person is more familiar to the persons in training set, the first two methods can achieve satisfied results. When the training data is inadequate, the proposed method outperforms other methods. Because the estimated expression feature is less person-dependent by the weighting process, the performance of the proposed method is morerobust. Table 2. Comparison of Average recognition rate on Cohn-Kanade database Tian[19] Wang[8] Tan[11] Proposed AU1 AU6 AU1+2 AU1+2+5 AU4+7 AU4+6+7 Average Rate
40% 85.7% 16.7% 54.6% 66.7% 0% 55.6%
80% 92.9% 33% 27.3% 33.3% 66.7% 58.6%
80% 78.6% 66.7% 27.3% 66.7% 100% 64.4%
100% 78.6% 83.3% 45.5% 66.7% 100% 73.3%
The performances of the three methods on JAFFE database are reported in Table 3. We can see that the proposed method is more robust to new persons than other methods. Though proposed method does not outperform in all expressions, the average recognition rate is much higher than that of other methods. The average recognition rate of proposed method has been improved to 66% from Table 3. Comparison of Average recognition rate on JAFFE database Tian[19] Wang[8] Tan[11] Proposed Happiness Sadness Surprise Angry Dismal Average Rate
70% 60% 40% 50% 60% 56%
30% 90% 70% 30% 60% 56%
50% 60% 70% 80% 50% 62%
50% 70% 80% 90% 40% 66%
Person-Similarity Weighted Feature for Expression Recognition
719
56%, 56% and 62% of using initial expression feature directly, tradition method and our previous work, respectively. 4.3
Discussions
From person-independent experiments, it can be observed that the proposed method is more robust to new persons in expression recognition. The reason is that the estimated expression feature can reduce the influence of individual and become less person-dependent. However, proposed method did not outperform for all expression. The reason is that the similarities of persons obtained by face recognition algorithm are rough, thus the ”true” expression feature can not be estimated accurately. Can the assumption ”similar persons have similar expressions appearance and shape” be used for facial expression recognition? This is a discussable problem. Though the assumption is often used in facial expression synthesis and is right intuitively, there is no psychological evidence to support the claim. The assumption can not be generalized to all persons. From Darwin’s theory, the representation of expression is influenced by person’s habit [20], not the individual appearance and shape. However, the experimental results show that the average reorganization rate of proposed method is higher than other two methods that do not use the assumption. This proves that the assumption can be generalized to a certain extent.
5
Conclusion
We have proposed a method of extracting person-independent facial expression feature for expression recognition. After obtaining person subspace and expression subspace using HOSVD, the expression features associated with all persons in training set are linear combined weighted by the similarity of the person. The work is based on the assumption that similar person have similar facial expression representation, which is often used for facial expression synthesis. By the weighting process, the person-similarity weighted expression feature is less person-dependent and more robust to new persons. The personindependent experimental results show that the proposed method can achieve more accurate expression recognition. Comparing with using traditional method based on HOSVD [8], using geometric and appearance features directly [19] and the work using person-similarity weighted distance [11], the average accuracy of expression recognition using proposed method outperforms other methods. On Cohn-Kanade database, it has been improved to 73.3% from 58.6%, 55.6% and 64.4%, respectively. On JAFFE database, it has been improved to 66% from 56%, 56% and 62%, respectively. In this paper, we simply use the mean feature of the expression to fill in the tensor, and not consider the factor of person on Cohn-Kanade database. Using an iterative method to synthesis the lost feature in the tensor may be more efficient.
720
H. Tan and Y.-J. Zhang
In our method, only the factor of person is considered. However, the method is a general framework that can be easily extended to multifactor analysis. For example, if the illumination of test expression is more similar to a class of illumination, larger weight can be set to estimate the illumination-independent expression feature. Because the similarities of persons are roughly valued, the ”true” expression feature of test person could not be estimated accurately. The problem may be solved by improving the performance of face recognition. However, recognize an individual across different expressions is also a challenging task for computer vision. These are all our future works.
Acknowledgment This work has been supported by Grant RFDP-20020003011.
References 1. Pantic, M., Rothkrantz, L.: Automatic Analysis of Facial Expressions: The State of the Art. IEEE Transaction on Pattern Analysis and Machine Intelligence 22, 1424–1445 (2000) 2. Fasel, B., Luettin, J.: Automatic Facial Expression Analysis: A Survey. Pattern Recognition 36, 259–275 (2003) 3. Mase, K., Pentland, A.: Recognition of Facial Expression from Optical Flow. IEICE Transactions E74(10), 3474–3483 (1991) 4. Bartlett, M., Hager, J., Ekman, P., Sejnowski, T.: Measuring facial expressions by computer image analysis. Psychophysiology, 253–263 (1999) 5. Essa, I., Pentland, A.: Coding Analysis, Interpretation, and Recognition of Facial Expressions. IEEE Transaction on Pattern Analysis and Machine Intelligence, 757– 763 (1997) 6. Tian, Y., Kanade, T., Cohn, J.: Evaluation of gabor wavelet-based facial action unit recognition in image sequences of increasing complexity. In: Proc. of Int’l Conf. on Automated Face and Gesture Recognition, pp. 239–234 (2002) 7. Wen, Z., Huang, T.S.: Capturing subtle facial motions in 3D face tracking. In: ICCV, pp. 1343–1350 (2003) 8. Wang, H., Ahuja, N.: Facial Expression Decomposition. In: ICCV, pp. 958–965 (2003) 9. Matsugu, M., Mori, K., Mitari, Y., Kaneda, Y.: Subject independent facial expression recognition with robust face detection using a convolutional neural network. Neural Networks 16, 555–559 (2003) 10. Abbound, B., Davoine, F.: Appearance Factorization based Facial Expression Recognition and Synthesis. In: Proceedings of the International Conference on Pattern Recognition, vol. 4, pp. 163–166 (2004) 11. Tan, H., Zhang, Y.: Person-Independent Expression Recognition Based on Person Similarity Weighted Distance. Jounal of Electronics and Information Technology 29, 455–459 (2007) 12. Kanade, T., Cohn, J., Tian, Y.: Comprehensive Database for Facial Expression Analysis. In: Proc. of Int’l Conf. Automated Face and Gesture Recognition, pp. 46–53 (2000)
Person-Similarity Weighted Feature for Expression Recognition
721
13. Michael, J.L., Shigeru, A., Miyuki, K., Jiro, G.: Coding Facial Expressions with Gabor Wavelets. In: Proc. of Int’l Conf. Automated Face and Gesture Recognition, pp. 200–205 (1998) 14. Vasilescu, M.A.O., Terzopoulos, D.: Multilinear Analysis of Image Ensembles: TensorFaces. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 447–460. Springer, Heidelberg (2002) 15. Vasilescu, M.A.O., Terzopoulos, D.: Multilinear subspace analysis of image ensembles: Image Analysis for facial recognition. In: Proceedings of the International Conference on Pattern Recognition, pp. 511–514 (2002) 16. Terzopoulos, D., Lee, Y., Vasilescu, M.A.O.: Model-based and image-based methods for facial image synthesis, analysis and recognition. In: Proc. of Int’l Conf. on Automated Face and Gesture Recognition, pp. 3–8 (2004) 17. Lathauwer, L.D., Moor, B.D., Vandewalle, J.: A Multilinear Singular value Decomposition. SIAM Journal of Matrix Analysis and Applications 21, 1253–1278 (2000) 18. Tan, H., Zhang, Y.: Detecting Eye Blink States by Tracking Iris and Eyelids. Pattern Recognition Letters 27, 667–675 (2006) 19. Tian, Y., Kanade, T., Cohn, J.: Recognizing Action Units for Facial Expression Analysis. IEEE Trans. On PAMI 23, 97–115 (2001) 20. Darwin, C.: The Expression of Emotions in Man and Animals. Reprinted by the University of Chicago Press (1965)
Converting Thermal Infrared Face Images into Normal Gray-Level Images Mingsong Dou1, Chao Zhang1, Pengwei Hao1,2, and Jun Li2 1
State Key Laboratory of Machine Perception, Peking University, Beijing, 100871, China 2 Department of Computer Science, Queen Mary, University of London, E1 4NS, UK
[email protected] Abstract. In this paper, we address the problem of producing visible spectrum facial images as we normally see by using thermal infrared images. We apply Canonical Correlation Analysis (CCA) to extract the features, converting a many-to-many mapping between infrared and visible images into a one-to-one mapping approximately. Then we learn the relationship between two feature spaces in which the visible features are inferred from the corresponding infrared features using Locally-Linear Regression (LLR) or, what is called, Sophisticated LLE, and a Locally Linear Embedding (LLE) method is used to recover a visible image from the inferred features, recovering some information lost in the infrared image. Experiments demonstrate that our method maintains the global facial structure and infers many local facial details from the thermal infrared images.
1 Introduction Human facial images have been widely used in the biometrics, law enforcement, surveillance and so on [1], but only the visible spectrum images of human faces were used in most cases. Recently the literature begins to emerge for face recognition (FR) based on infrared images or fusion of infrared images and visible spectrum images [2-4], and some sound results have been published. Other than FR based on infrared images this paper focuses on the transformation from thermal IR images to visible spectrum images (see Fig.2 for examples of both modal images), i.e. we try to render a visible spectrum image from a given thermal infrared image. Thermal infrared imaging sensors measure temperature of shot objects and are invariant to illuminance. There are many surveillance applications in which the light conditions are so poor that we can only acquire thermal infrared images. As we know, we see objects because of the reflectance of light, i.e. formation of visible-spectrum images needs light sources. For thermal infrared images, it is very optimistic. All objects with temperature above the absolute zero emit electromagnetic wave, and the human body temperature is in the range of emitting infrared electromagnetic wave. So even it is completely dark, we can still obtain thermal infrared images with thermal infrared imaging cameras. Though the formations of visible spectrum and infrared facial images are of different mechanisms, the images do share some commons if they come from the same face, e.g. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 722–732, 2007. © Springer-Verlag Berlin Heidelberg 2007
Converting Thermal Infrared Face Images into Normal Gray-Level Images
723
we can recognize some facial features from both modals of images. There indeed exists some correlation relationship between them which can be learned from training sets. The problem to normally view thermal infrared images is actually very challenging. First of all, the correlations between the visible image and corresponding infrared one are not strong. As mentioned above, the imaging models are of different mechanisms. Infrared images are invariant under the changes of the lighting conditions, so many visible spectrum images taken under different lighting conditions correspond to one infrared image. Therefore, the solution to our problem is not unique. To the contrary, thermal infrared images are not constant, either. Thermal infrared images are subject to the surface temperature of the shot objects. For example, the infrared images taken respectively from a person when he just came from the cold outside and the same person when he just did lots of sports are quite different. The analysis above shows that it is a many-to-many mapping between visual facial images and thermal infrared images of the same person, which is the biggest barrier for solving the problem. Another problem is that the resolution of visible spectrum facial images is generally much higher than that of thermal infrared images. Thus visible images have more information, and some information of visible spectrum images definitely can not be recovered from thermal infrared images through the correlation relationship. In this paper, we have developed a method to solve the problems. We use Canonical Correlation Analysis (CCA) to the extract features, converting a many-to-many mapping between infrared and visible image into a one-to-one mapping approximately. Then we learn the relationship between the feature spaces, in which the visible features are inferred from the corresponding infrared features using Locally-Linear Regression (LLR) or, what is called, Sophisticated LLE, and a Locally Linear Embedding (LLE) method is applied to recover a visible image from the inferred features, recovering some information lost in the infrared image.
2 Related Works As presented above, this paper addresses the problem of conversion between different modal images, which shares lots of commons with the super-resolution problem [5-7], which is to render one high resolution (HR) image from one or several low resolution (LR) images. For example, the data we try to recover for two problems both have some information lost in the given observation data. Baker et al. [5] developed a super-resolution method called face hallucination to recover the lost information. They first matched the input LR image to those in the training set, found the most similar LR image, and then take the first derivation information of the corresponding HR image in the training set as the information of the desired HR image. We adopt this idea of finding information from the training set for the recovery data. Chang et al. [6] introduced LLE [10] to super-resolution. Their method is based on the assumption that the patches in the low- and high- resolution images form the manifolds with the same local geometry in two distinct spaces, i.e. we can reconstruct a HR patch from the neighboring HR patches with the same coefficients as that for reconstructing the corresponding LR patch from the neighboring LR patches. Actually this method is a special case of Locally-Weighted Regression (LWR) [8] when the
724
M. Dou et al.
weights for all neighbors are equal and the regression function is linear, as we show in Section 4. We develop a Sophisticated LLE method which is an extension of LLE. Freeman et al. [7] took images as a Markov Random Field (MRF) with the nodes corresponding to image patches, i.e. the information from the surrounding patches is used to constrain the solution, while LLE does not. MRF improves the results when the images are not well aligned, but in our paper we assume all the images are well-registered and we do not use the time-consuming MRF method. Our work is also related to some researches on statistical learning. Melzer et al. [11] used Canonical Correlation Analysis (CCA) to infer the pose of the object from the gray-level images, and Reiter et al.’s method [12] learns the depth information from RGB images also using CCA. CCA aims to find two sets of projection directions for two training sets, with the property that the correlation between the projected data is maximized. We use CCA for feature extraction. Shakhnarovich et al. [9] used Locally-Weighted Regression for pose estimation. To accelerate searching for the nearest neighbors (NN), they adopted and extended the Locality-Sensitive Hashing (LSH) technique. The problem we address here is much different from theirs. A visible image is not an underlying scene to generate an infrared image, while in their problem the pose is the underlying parameter for the corresponding image. So we use CCA to extract the most correlated features, at the same time the dimensionality of data is reduced dramatically, making nearest neighbors searching easier. In our experiments we use exhaustive search for NN instead of LSH.
3 Feature Extraction Using CCA As mentioned above, the correspondence between the visible and the infrared images is a many-to-many mapping, and then to learn a simple linear relationship between the two image spaces is not possible. Instead, extracting features and learning the relationship between the feature spaces can be a solution. We wish to extract features from the original image with the properties as follows: (1) The relationship between two feature spaces is stable, i.e. there exists a one-to-one mapping between them, and it is easy to be learned from the training set and performs well when generalized to the test set; (2) The features in two distinct feature spaces should contain enough information to approximately recover the images. Unfortunately for our problem the two properties conflict with each other. Principal Component Analysis (PCA), which is known as the EigenFace method [13] in face recognition, is a popular method to extract features. For our problem it well satisfies the second condition above, but two sets of principal components, extracted from a visible image and the corresponding infrared image, have weak correlations. Canonical Correlation Analysis (CCA) finds pairs of directions that yield the maximum correlations between two data sets or two random vectors, i.e. the correlations between the projections (features) of the original data projected onto these directions are maximized. CCA has our desired traits as given in the above property (1). But unlike PCA, several CCA projections are not sufficient to recover the original data, for the found directions may not be able to cover the principal variance of the data set. However, we find that regularized CCA is a satisfying trade-off between the two desired properties.
Converting Thermal Infrared Face Images into Normal Gray-Level Images
725
3.1 Definition of CCA Given two zero-mean random variables x, a p×1 vector, and y, a q×1 vector, CCA finds the 1st pair of directions w1 and v1 that maximize the correlation between the projections x = w1Tx and y = v1Ty, max ρ(w1Tx, v1Ty) , s.t. Var(w1Tx) = 1 and Var(v1Ty) = 1 ,
(1)
where ρ is the correlation coefficient, the variables x and y are called the first canonical variates, and the vectors w1 and v1 are the first correlation direction vector. CCA finds kth pair of directions wk and vk satisfying: (1) wkTx and vkTy are uncorrelated to the former k-1 canonical variates; (2) the correlation between wkTx and vkTy is maximized subject to the constraints Var(wkTx) = 1 and Var(vkTy) = 1. Then wkTx and vkTy are called the kth canonical variates, and wk and vk are the kth correlation direction vector, k ≤ min(p, q). The solution for the correlation directions and correlation coefficients is equivalent to the solution of the generalized eigenvalue problem below, (∑xy∑yy-1∑xyT – ρ2∑xx)w = 0 ,
(2)
(∑xyT∑xx-1∑xy – ρ2∑yy)v = 0 ,
(3)
where ∑xx and ∑yy are the self-correlation matrices, ∑xy and ∑yx are the co-correlation matrices. There are robust methods to solve this problem, interested readers please refer to [15], where an SVD-based method is introduced. Unlike PCA, which aims to minimize the reconstruction error, CCA puts the first place the correlation of the two data sets. There is no assurance that the directions found by CCA cover the main variance of the data set, so generally speaking a few projections (canonical variates) are not sufficient to recover the original data well. Beside the recovery problem, we also have to deal with the overfitting problem. CCA is sensitive to noise. Even if there is small amount of noise in the data, CCA might give a good result to maximize the correlations between the extracted features, but the features more likely represent the noise rather than the data. As mentioned in [11], it is a sound method to add a multiple of the identity matrix λI to the co-variance matrix ∑xx and ∑yy to overcome the overfitting problem, and this method is called regularized CCA. We find that it also has effect on the reconstruction accuracy, as depicted in Fig.1. Then regularized CCA is a trade-off between the two desired properties mentioned above. 3.2 Feature Extraction and Image Recovery from Features We extract the local features rather than the holistic features for the holistic features seem to fail to capture the local facial traits. There is a training set consisting of pairs of visible and infrared images at our disposal. We partition all the images into overlapping patches, then at every patch position we have a set of patch pairs for CCA learning, and CCA finds pairs of directions W(i) = [w1,w2,…,wk] and V(i) = [v1,v2,…,vk] for visible and infrared patches respectively, where the superscript (i) denotes the patch index
726
M. Dou et al.
(or the patch position in the image). Every column of W or V is a unitary direction vector, but it is not orthogonal between different columns. Take a visible patch p (represented as a column vector by raster scan) at position i for example, we can extract the CCA feature of the patch p, using f = W(i)Tp ,
(4)
where f is the feature vector of the patch.
(a)
(b)
(c)
(d)
(e)
Fig. 1. The first row is the first CCA directions with different λ (we rearrange the direction vector as an image, and there are outlined faces in the former several images), and the second row is the corresponding reconstruction results. CCA is patch-based as introduced in Section 3.2; we reconstruct the image with 20 CCA variates using Eq(6). If the largest singular value of variance matrix is c, we set (a) λ = c/20; (b) λ = c/100; (c) λ = c/200; (d) λ = c/500; (e) λ = c/5000. It is obvious that when λ is small, the CCA direction tends to be noisy, and the reconstructed face tends to the mean face.
It is somewhat tricky to reconstruct the original patch p through feature vector f. Since W is not orthogonal, we cannot reconstruct the patch by p = Wf as we do in PCA. However we can solve the least squares problem below to obtain the original patch, p = argp min ||WTp – f||22 ,
(5)
or to add an energy constraint, p = argp min ||WTp – f||22 + ||p||22 .
(6)
The least squares problem above can be efficiently solved with the scaled conjugate gradient method. The above reconstruction method is feasible only in the situation when the feature vector f contains enough information of the original patch. When fewer canonical variates (features) are extracted, we can recover the original path using LLE method [10]. As the method in [6], we assume that the manifold of the feature space and that of the patch space have the same local geometry; then the original patch and its features have the same reconstruction coefficients. If p1, p2,…, pk are the patches whose features f1, f2,…, fk are f’s k nearest neighbors, and f can be reconstructed from neighbors with f = Fw, where F = [f1, f2,…, fk], w = [w1, w2,…, wk]T, we can reconstruct the original patch by
Converting Thermal Infrared Face Images into Normal Gray-Level Images
p = Pw ,
727
(7)
where P = [p1, p2,…, pk]. The reconstruction results using Eq(6 & 7) are show in Fig. 3(a). When only a few canonical variates at hand, the method of Eq(7) performs better than Eq(6); while there are more canonical variates, two methods give almost the same satisfying results.
4 Facial Image Conversion Using CCA Features From the training database we can obtain the CCA projection directions at every patch position, and for all the patches from all the training images we extract features by projection onto the proper directions, then at each patch position i we get a visible training set Ovi = {fv,ji} and an infrared one Oiri = {fir,ji }. Given one new infrared image, we partition it into small patches, and obtain the feature vector fir of every patch. If we can infer the corresponding visible feature vector fv, the visible patch can be obtained using Eq(7) and then the patches will be combined into an visible facial image. In this section we will focus on the prediction of the visible feature vector from the infrared one. Note that the inferences for patches at different positions are based on different training feature sets. 4.1 Reconstruction Through Locally-Linear Regression Locally-Weighted Regression [8][9] is a method to fit a function of the independent variables locally based on a training set, and it suits our problem well. To simplify the methods, we set the weights of nearest neighbors (NN) equal, and use a linear model to fit the function, then LWR degenerates to Locally-Linear Regression (LLR). For an input infrared feature vector fir, we find K-NNs in training set Oir, which compose a matrix Fir = [fir,1, fir,2,…, fir,K], and their corresponding visible feature vectors compose a matrix Fv = [fv,1, fv,2,…, fv,K]. Note that we omit the patch index for convenience. Then a linear regression obtains the relation matrix, M = argM min ∑k || fv,k – M fir,k||22 = Fv. Fir+ ,
(8)
where Fir+ is the pseudo-inverse of Fir. The corresponding visible feature vector fv can be inferred from the input infrared feature fir by fv = M fir = Fv Fir+ fir.
(9)
To find the nearest neighbors, the distance between two infrared feature vectors fT and fI need be defined. In this paper, we define the distance as D = ∑k ρk ( fkT, – fk I, ) ,
(10)
where ρk is kth correlation coefficient; fkT and fk I denotes the kth element of the feature vector fT and fI respectively.
728
M. Dou et al.
Actually the LLE method using in [6] is equivalent to LLR. The reconstruction coefficients w of infrared feature fir from K-NNs, Fir, can be obtained by solving the Least Squares Problem. w = argw min || Fir w - fir ||
(11)
= Fir+ fir .
then the reconstructed corresponding visible feature vector fv= Fv Fir+ fir, which has the same form as Eq(9). The difference between the two methods is the selection of the number K of NNs. In LLR, to make regression sensible we select a large K to ensure that Fir Rm×n has more columns than rows (mn and Fir+ = (FirT Fir)-1 FirT. The reconstruction results are shown in Fig.2(b)(e). The LLR method gives better results, but consumes more resources for it needs to find a larger number of NNs. In the next section we extend LLE to a Sophisticated LLE which achieves competitive results as LLR, while it uses approximately the same resources as LLE.
∈
4.2 Reconstruction Via Sophisticated LLE The reason of the poor performance of LLE method may be that the local geometry of the two manifolds of visible and infrared features is not the same. We use an experiment to demonstrate it. For every infrared feature vector firi in the training set we find its four NNs {fir1, fir2, fir3, fir4} (the neighbors are organized in the decreasing order according to the distance to firi, the same below.) whose convex hull (a tetrahedron) contains the infrared patch, but their visible counterpart, the visible feature vector fvi and its neighbors {fv1, fv2, fv3, fv4} do not preserve the same geometric relations. Moreover, more than 90 percent of fvi’s are out of the convex hulls of the corresponding neighbors. It is a natural idea to learn the changes between two local geometries of two manifolds. Since the local geometry is represented by the reconstruction coefficients, we only need to learn the mapping H(·) between the infrared and the visible reconstruction coefficients denoted as x and y respectively, and y = H(x). Since we have a training database at hand, we collect the pairs of reconstruction coefficient vectors (x1, y1),…,(xN, yN), which are used to reconstruct feature vectors of visible and infrared patches respectively. We can obtain the function H(·) between them using the least squares method. Or a simpler algorithm can be used, while the form of H(·) need not be known. For an input feature vector firi, we compute its reconstruction coefficients xi using its k-NNs in the infrared feature space. What we try to obtain is the reconstruction coefficient vector yi which is used to reconstruct the visible feature fvi corresponding to firi. We found the most similar coefficient vector xi’ with xi in the infrared coefficient dataset, and we regard the corresponding visible coefficient vector yi’ as an estimate of yi. We call our method Sophisticated LLE.
5 Experimental Results We use the public available database collected by Equinox Corporation [14] for our experiments. We select 70 subjects from the database, and each subject has 10 pairs of
Converting Thermal Infrared Face Images into Normal Gray-Level Images
729
visible and infrared images with different expressions. The long wave thermal infrared images are used because the points of image pairs are well-matched, even though they have lower resolution than the middle wave images. All the images have been manually registered to guarantee the eye centers and the mouth centers are well-registered. Some image pairs of the data set are shown in Fig. 2(a)(g). We test our algorithm on the training set using the leave-one-out scheme, i.e. take one pair out from the database as the test images (the infrared image as the input and the visible image as the ground truth); all the pairs of the same subject are removed from the database as well; and the left pairs are taken as the training data.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
Fig. 2. The results of face image conversion from thermal infrared images. (a)the input infrared image; (b) the result of our method using the prediction method of LLR using 5 canonical variates for each patch; (c) the result of our method using the prediction method of Sophisticated LLE; (d) the reconstruction face using 5 canonical features for each patch extracted from the ground truth; (e) the result using directly LLE method; (f) the result using the holistic method; (g) the ground truth.
730
M. Dou et al.
There are several parameters to be chosen in our algorithm, such as the size of patches, the number of canonical variates k (the dimensionality of feature vector) we take for every patch, and the number of the neighbors we use to train the canonical directions. Generally speaking, the correlation between pairs of infrared and visible patches of a smaller size is weaker, so the inference is less reasonable. While the larger size, makes the correlation stronger, but more canonical variates are needed to represent the patch, which makes training samples much sparser in the feature space. The size of images of our database is 110×86, and we choose the patch size of 9×9 with 3-px overlapping. Since the projections (features) onto the former pairs of direction have stronger correlations, choosing fewer features makes the inference more robust, while choosing more features gives a more accurate representation of the original patch. Similarly, when we choose a larger number of neighbors, K, there are more samples, which makes the algorithm more robust but time-consuming. We choose 2~8 features and 30~100 neighbors for LLR, and we have the slightly different results. We have compared our methods with other existing algorithms such as LLE and the holistic method. The results in Fig.2 show that our method is capable to preserve the global facial structure and to capture some facial detailed features such as wrinkles, mustache, and the boundary of nose. Our algorithm is also robust to facial expressions,
(a)
(b)
Fig. 3. (a) The comparison of reconstruction results using Eq(6) and Eq(7). The first row is the ground truth; the second row is the face reconstructed using Eq(7), and the third row using Eq(6). 5 canonical variates taken from the ground truth are used for each patch. It is clear that the reconstructions of Eq(7) contain more information than those of Eq(6). (b) The face image conversion results with different expressions of the same subject. The first column is the input infrared image; the second column is our conversion result; the third column is the reconstruction result using the canonical variates extracted from the ground truth; the last column is the ground truth.
Converting Thermal Infrared Face Images into Normal Gray-Level Images
731
as shown in Fig. 3 (b). The prediction methods proposed in Section 4.1 and 4.2 give slightly different results as shown in Fig. 2(b) (c). Although our method is effective, there is still difference between our results and the ground truth. There should be two key points to account for it. First, the correspondence between the visible and the infrared images is a many-to-many mapping and infrared images contain less information than visible images. Second, our method tries to obtain the optimal result only in the statistical sense.
6 Conclusion and Future Work In this paper we have developed an algorithm to render the visible facial images from thermal infrared images using canonical variates. Given an input thermal infrared image, we partition it into small patches, and for every patch we extract the CCA features. Then the features of the corresponding visible patch can be inferred by LLR or by sophisticated LLE, and the visible patch can be reconstructed by LLE using Eq(7) according to the inferred features. We use CCA to extract features, which makes the correlation in the feature space are much stronger than that in the patch space. And using LLE to reconstruct the original patch from inferred features recovers some information lost in the infrared patch and in the feature-extraction process. The experiments show that our algorithm is effective. Thought it cannot recover visible images the same as the ground truth because of less information of infrared images, but it does preserve some features of the ground truth such as the expression. The future work includes: (1) applying the method in infrared face recognition to improve the recognition rate for it recovers some information lost in infrared images; (2) making the methods more robust to ill-registered images.
Acknowledgments The authors would like to thank the anonymous reviewers for their constructive comments, which have contributed to a vast improvement of the paper. This work is supported by research funds of NSFC No.60572043 and the NKBRPC No.2004CB318005.
References 1. Zhao, W., Chellappa, R., Rosenfeld, A., Phillips, P.J.: Face Recognition: A Literature Survey. ACM Computing Surveys 35, 399–459 (2003) 2. Kong, S.G., Heo, J., Abidi, B.R., Paik, J., Abidi, M.A.: Recent Advances in Visual and Infrared face recognition—A Review. Computer Vision and Image Understanding 97, 103–135 (2005) 3. Bebis, G., Gyaourova, A., Singh, S., Pavlidis, I.: Face Recognition by Fusing Thermal Infrared and Visible Imagery. Image and Vision Computing 24, 727–742 (2006)
732
M. Dou et al.
4. Heo, J., Kong, S.G., Abidi, B.R., Abidi, M.A.: Fusion of Visual and Thermal Signatures of with Eyeglass Removal for Robust Face Recognition. In: Proc. of CVPRW2004, vol. 8, pp. 122–127 (2004) 5. Baker, S., Kanade, T.: Limits on Super-Resolution and How to Break Them. IEEE Trans. On Pattern Analysis and Machine Intelligence 24, 1167–1183 (2002) 6. Chang, H., Yeung, D.Y., Xiong, Y.: Super-Resolution Through Neighbor Embedding. In: Proc. of CVPR2004, vol. 1, pp. 275–282 (2004) 7. Freeman, W.T., Pasztor, E.C., Carmichael, O.T.: Learning Low-Level Vision. International Journal of Computer Vision 40, 25–47 (2000) 8. Cleveland, W.S., Devlin, S.J.: Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting. Journal of the American Statistical Association 83(403), 596–610 (1988) 9. Shakhnarovich, G., Viola, P., Darrell, T.: Fast Pose Estimation with Parameter-Sensitive Hashing. In: Proc. of ICCV2003, vol. 2, pp. 750–757 (2003) 10. Roweis, S.T., Saul, L.K.: Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science 290, 2323–2326 (2000) 11. Melzer, T., Reiter, M., Bischof, H.: Appearance Models Based on Kernel Canonical Correlation Analysis. Pattern Recognition 36, 1961–1971 (2003) 12. Reiter, M., Donner, R., Langs, G., Bischof, H.: 3D and Infrared Face Reconstruction from RGB data using Canonical Correlation Analysis. In: Proc. of ICPR2006, vol. 1, pp. 425–428 (2006) 13. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection. IEEE Trans. on Pattern Analysis and Machine Intelligence 19, 711–720 (1997) 14. Socolinsky, D.A., Selinger, A.: A Comparative Analysis of Face Recognition Performance with Visible and Thermal Infrared Imagery. In: Proc. of ICPR2002, vol. 4, pp. 217–222 (2002) 15. Weenink, D.: Canonical Correlation Analysis. In: IFA Proceedings, vol. 25, pp. 81–99 (2003)
Recognition of Digital Images of the Human Face at Ultra Low Resolution Via Illumination Spaces Jen-Mei Chang1 , Michael Kirby1 , Holger Kley1 , Chris Peterson1, Bruce Draper2 , and J. Ross Beveridge2 Department of Mathematics, Colorado State University, Fort Collins, CO 80523-1874 U.S.A. {chang,kirby,kley,peterson}@math.colostate.edu Department of Computer Science, Colorado State University, Fort Collins, CO 80523-1873 U.S.A. {draper,ross}@math.colostate.edu 1
2
Abstract. Recent work has established that digital images of a human face, collected under various illumination conditions, contain discriminatory information that can be used in classification. In this paper we demonstrate that sufficient discriminatory information persists at ultralow resolution to enable a computer to recognize specific human faces in settings beyond human capabilities. For instance, we utilized the Haar wavelet to modify a collection of images to emulate pictures from a 25pixel camera. From these modified images, a low-resolution illumination space was constructed for each individual in the CMU-PIE database. Each illumination space was then interpreted as a point on a Grassmann manifold. Classification that exploited the geometry on this manifold yielded error-free classification rates for this data set. This suggests the general utility of a low-resolution illumination camera for set-based image recognition problems.
1
Introduction
The face recognition problem has attracted substantial interest in recent years. As an academic discipline, face recognition has progressed by generating large galleries of images collected with various experimental protocols and by assessing the efficacy of new algorithms in this context. A number of proposed face recognition algorithms have been shown to be effective under controlled conditions. However, in the field, where data acquisition is essentially uncontrolled,
This study was partially supported by the National Science Foundation under award DMS-0434351 and the DOD-USAF-Office of Scientific Research under contract FA9550-04-1-0094. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or the DOD-USAF-Office of Scientific Research.
Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 733–743, 2007. c Springer-Verlag Berlin Heidelberg 2007
734
J.-M. Chang et al.
the performance of these algorithms typically degrades. In particular, variations in the illumination of subjects can significantly reduce the accuracy of even the best face recognition algorithms. A traditional approach in the face recognition literature has been to normalize illumination variations out of the problem using techniques such as nonlinear histogram equalization or image quotient methods [1]. While such approaches do indeed improve recognition, as demonstrated on the FERET, Yale and PIE databases, they do not exploit the fact that the response of a given subject to variation in illumination is idiosyncratic [2] and hence can be used for discrimination. This work builds on the observation that in the vector space generated by all possible digital images collected by a digital camera at a fixed resolution, the images of a fixed, Lambertian object under varying illuminations lie in a convex cone [3] which is well approximated by a relatively low dimensional linear subspace [4,5]. In our framework, we associate to a set of images of an individual their linear span, which is in turn represented, or encoded, by a point on a Grassmann manifold. This approach appears to be useful for the general problem of comparing sets of images [6]. In the context of face recognition our objective is to compare a set of images associated with subject 1 to a set of images associated with subject 2 or to a different set of images of subject 1. The comparison of any two sets of images is accomplished by constructing a subspace in the linear span of each that optimizes the ability to discriminate between the sets. As described in [2], a sequence of nested subspaces may be constructed for this purpose using principal vectors computed to reveal the geometric relationship between the linear spans of each subject. This approach provides an immediate pseudo-metric for set-to-set comparison of images. In an application to the images in the CMU-PIE Database [7] and Yale Face Database B [4], we have previously established that the data are Grassmann separable [2], i.e., when distances are computed between sets of images using their encoding as points on an appropriately determined Grassmann manifold the subjects are all correctly identified. The CMU-PIE database consists of images of 67 individuals. While the Grassmann separability of a database of this size is a significant, positive result, it is important to understand the general robustness of this approach. For example, the application of the methodology to a larger data set is of critical interest. In the absence of such data, however, we propose to explore a related question: as we reduce the effective resolution of the images of the 67 individuals which make up the CMU-PIE database, does Grassmann separability persist? The use of multiresolution analysis to artificially reduce resolution introduces another form of nested approximation into the problem that is distinct from that described above. We observe that facial imagery at ultra low resolutions is typically not recognizable or classifiable by human operators. Thus, if Grassmann separability persists at ultra low resolution, we can envision large private databases of facial imagery, stored at a resolution that is sufficiently low to prevent recognition by a human operator yet sufficiently high to enable machine recognition and classification via the Grassmann methods described in Section 2.
Recognition of Digital Images of the Human Face
735
Accordingly, the purpose of this paper is to explore the idiosyncratic nature of digital images of a face under variable illumination conditions at extremely low resolutions. In Section 2 we discuss the notion of classification on a Grassmann variety and a natural pseudo-metric that arises in the context of Schubert varieties. In Section 3 we extend these ideas to the context of a sequence of nested subspaces generated by a multiresolution analysis. Results of this approach applied to the CMU-PIE database are presented in Section 4. We contrast our approach with other methods in Section 5 and discuss future research directions in Section 6.
2
Classification on Grassmannians
The general approach to the pattern classification problem is to compare labeled instances of data to new, unlabeled exemplars. Implementation in practice depends on the nature of the data and the method by which features are extracted from the data and used to create a representation optimized for classification. We consider the case that an observation of a pattern produces a set of digital images at some resolution. This consideration is a practical one, since the accuracy of a recognition scheme that uses a single input image is significantly reduced when images are subject to variations, such as occlusion and illumination [8]. Now, the linear span of the images is a vector subspace of the space of all possible images at the given resolution, and thus, corresponds to a point on a Grassmann manifold. More precisely, let k (generally independent) images of a given subject be grouped together to form a data matrix X with each image stored as a column of X. If the column space of X, R(X), has rank k and if n denotes the image resolution, then R(X) is a k-dimensional vector subspace of Rn , which is a point on the Grassmann manifold G(k, n). See Fig. 1 for a graphical illustration of this correspondence. Specifically, the real Grassmannian (Grassmann manifold), G(k, n), parameterizes k-dimensional vector subspaces of the n-dimensional vector space Rn . Naturally, this parameter space is suitable for subspace-based algorithms. For example, the Grassmann manifold is used in [9] when searching for a geometrically invariant subspace of a matrix under full rank updates. An optimization over the Grassmann manifold is proposed in [10] to solve a general object recognition problem. In the case of face recognition, by realizing sets of images as points on the Grassmann manifold, we can exploit the geometries imposed by individual metrics (drawn from a large class of metrics) in computing distances between these sets of images. With respect to the natural structure of a Riemannian manifold that the Grassmannian inherits as a quotient space of the orthogonal group, the geodesic distance between two points A, B ∈ G(k, n) (i.e., two k-dimensional subspaces of Rn ) is given by dk (A, B) = (θ1 , . . . , θk )2 , where θ1 ≤ θ2 ≤ · · · ≤ θk are the principal angles between the subspaces A and B. The principal angles are readily computed using an SVD-based algorithm [11].
736
J.-M. Chang et al.
T
T T
T
Fig. 1. Illustration of the Grassmann method, where each set of images may be viewed as a point on the Grassmann manifold by computing an orthonormal basis associated with the set
Principal angles between subspaces are defined regardless of the dimensions of the subspaces. Thus, inspired by the Riemannian geometry of the Grassmannian, we may define, for any vector subspaces A, B of Rn , d (A, B) = (θ1 , . . . , θ )2 , for any ≤ min{dim A, dim B}. While d is not, strictly speaking, a metric (for example, if dim A ∩ B ≥ , then d (A, B) = 0), it nevertheless provides an efficient and useful tool for analyzing configurations in ∪k≥ G(k, n). For points on a fixed Grassmannian, G(k, n), the geometry driving these distance measures is captured by a type of Schubert variety Ω¯ (W ) ⊆ G(k, n). More specifically, let W be a subspace of Rn , then we define Ω¯ (W ) = {E ∈ G(k, n) | dim(E ∩ W ) ≥ }. With this notation, d (A, B) simply measures the geodesic distance between A ¯ (B)) = min{dk (A, C)|C ∈ Ω ¯ (B)} (it is worth noting and Ω¯ (B), i.e. d(A, Ω that under this interpretation, d (A, B) = d (B, A)).
3 3.1
Resolution Reduction Multiresolution Analysis and the Nested Grassmannians
Multiresolution analysis (MRA) works by projecting data in a space V onto a sequence of nested subspaces · · · ⊂ Vj+1 ⊂ Vj ⊂ Vj−1 ⊂ · · · ⊂ V0 = V. The subspaces Vj represent the data at decreasing resolutions and are called scaling subspaces or approximation subspaces. The orthogonal complements Wj to
Recognition of Digital Images of the Human Face
737
Vj in Vj−1 are the wavelet subspaces and encapsulate the error of approximation at each level of decreased resolution. For each j, we have an isomorphism ∼
φj : Vj−1 − → Vj ⊕ Wj . Let π j : Vj ⊕ Wj → Vj denote projection onto the first factor and let ψ j = π j ◦ φj (thus ψ j : Vj−1 → Vj ). This single level of subspace decomposition is represented by the commutative diagram in Fig. 2(a). Let G(k, V ) denote the Grassmannian of k-dimensional subspaces of a vector space V . Suppose that V, V are vector spaces, and that f : V → V is a linear map. Let ker(f ) denote the kernel of f , let dim(A) denote the dimension of the vector space A and let G(k, V )◦ = {A ⊂ V | dim(A) = k and A ∩ ker(f ) = 0}. If k + dim ker(f ) ≤ dim V , then G(k, V )◦ is a dense open subset of G(k, V ) and almost all points in G(k, V ) are in G(k, V )◦ . Now if A ∩ ker(f ) = 0, then dim f (A) = dim A, so f induces a map fk◦ : G(k, V )◦ → G(k, V ). Furthermore, if f is surjective, then so is f ◦ . The linear maps of the MRA shown in (a) of Fig. 2 thus induce the maps between Grassmannians shown in (b) of the same figure. Finally, we observe that if A, B are vector subspaces of V , then dim(A ∩ B) = dim(f (A) ∩ f (B)) if and only if (A + B) ∩ ker(f ) = 0. In particular, when (A + B) ∩ ker(f ) = 0 and ≤ min{dim A, dim B}, then d (A, B) = 0 if and only if d (f (A), f (B)) = 0. From this vantage point, we consider the space spanned by a linearly independent set of k images in their original space on the one hand, and the space spanned in their reduced resolution projections on the other hand, as points on corresponding Grassmann manifolds. Distances between pairs of sets of k linearly independent images or their low-resolution emulations can then be computed using the pseudo-metrics d on these Grassmann manifolds. The preceding observation suggests the possibility that for resolution-reducing projections, spaces which were separable by d remain separable after resolution reduction. Of course, taken to an extreme, this statement can no longer hold true. It is therefore of interest to understand the point at which separability fails. 3.2
Image Resolution Reduction
In a 2-dimensional Discrete Wavelet Transform (DWT), columns and rows of an image I each undergo a 1-dimensional wavelet transform. After a single level of a 2-dimensional DWT on an image I of size m-by-n, one obtains four subn images of dimension m 2 -by- 2 . If we consider each row and column of I as a 1-dimensional signal, then the approximation component of I is obtained by a low-pass filter on the columns then a low-pass filter on the rows and sampled
738
J.-M. Chang et al.
Vj ⊕ W j
O
φj
Vj−1
πj
uψ uuu
j
/ Vj u: u u u
(π j )◦ k
G(k, Vj ⊕ Wj )◦
O
(φj )◦ k
o(ψ ooo
j
/ G(k, Vj ) o7 o o o ◦
)k
G(k, Vj−1 )◦
(a)
(b)
Fig. 2. (a) Projection maps between scaling and wavelet subspaces for a single level of wavelet decomposition. (b) Projection maps between nested Grassmannians for a single level of decomposition.
on a dyadic grid. The other 3 sub-images are obtained in a similar fashion and collectively, they are called the detail component of I. The approximation component of an image after a single level of wavelet decomposition with the Haar wavelet is equivalent to averaging the columns, then the rows. See Fig. 3 for an illustration of the sub-images obtained from a single level of Haar wavelet analysis. To use wavelets to compress a signal, we sample the approximation and detail components on a dyadic grid. That is, keeping only one out of two wavelet coefficients at each step of the analysis. The approximation component of the signal, Aj , after j iterations of decomposition and down-sampling, will serve as the same image in level j with resolution 2mj -by- 2nj . In the subsequent discussions, we present results obtained by using the approximation subspaces. However, similar results obtained by using the wavelet subspaces are also observed.
(a) Original
(b) LL
(c) HL
(d) LH
(e) HH
Fig. 3. An illustration of the sub-images from a single level of Haar wavelet analysis on an image in CMU-PIE. From left to right: original image, approximation, horizontal, vertical, and diagonal detail.
4
Results: A 25-Pixel Camera
The experiment presented here follows the protocols set out in [2], where it was established that CMU-PIE is Grassmann separable. This means that using one of the distances d on the Grassmannian, the distance between an estimated illumination space of a subject and another estimated illumination space of the same subject is always less than the distance to an estimated illumination space of
Recognition of Digital Images of the Human Face
739
any different subject. In this new experiment we address the question of whether this idiosyncratic nature of the illumination spaces persists at significantly reduced resolutions. As described below, we empirically test this hypothesis by calculating distances between pairs of scaling subspaces. The PIE database consists of digital imagery of 67 people under different poses, illumination conditions, and expressions. The work presented here concerns only illumination variations, thus only frontal images are used. For each of the 67 subjects in the PIE database, 21 facial images were taken under lighting from distinct point light sources, both with ambient lights on and off. The results of the experiments performed on the ambient lights off data is summarized in Fig. 4. The results obtained by running the same experiment on illumination data collected under the presence of ambient lighting were not significantly different. For each of the 67 subjects, we randomly select two disjoint sets of 10 images to produce two 10-dimensional estimates of the illumination space for the subject. Two estimated spaces for the same subject are called matching subspaces, while estimated subspaces for two distinct subjects are called non-matching subspaces. The process of random selection is repeated 10 times to generate a total of 670 matching subspaces and 44,220 non-matching subspaces. We mathematically reduce the resolution of the images using the Haar wavelet, effectively emulating a camera with a reduced number of pixels at each step. As seen in Fig. 5, variations in illumination appear to be retained at each level of resolution, suggesting that the idiosyncratic nature of the illumination subspaces might be preserved. At the fifth level of the MRA the data corresponds to that which would have been captured by a camera with 5 × 5 pixels. We observe that at this resolution the human eye can no longer match an image with its subject. The separability of CMU-PIE at ultra low resolution is verified by comparing the distances between the matching to the non-matching subspaces as points on a Grassmann manifold. When the largest distance between any two matching subspaces is less than the smallest distance between any two non-matching subspaces, the data is called Grassmann separable. This phenomenon can be observed in Fig. 4. The three lines of the box in the box whisker plot shown in Fig. 4 represent the lowest quartile, median, and upper quartile values. The whiskers are lines extending from each end of the box to show the extent of the rest of the data and outliers are data with values beyond the ends of the whiskers. Using d1 , i.e., a distance based on only one principal angle, we observe a significant separation gap between the largest and smallest distance of the matching and non-matching subspaces throughout all levels of MRA. Specifically, the separation gap between matching and non-matching subspaces is approximately 16◦ , 18◦ , 17◦ , 14◦ , 8◦ , and 0.17◦ when subspaces are realized as points in G(10, 22080), G(10, 5520), G(10, 1400), G(10, 360), G(10, 90), and G(10, 25), respectively. Note that the non-decreasing trend of the separation gap is due to the random selection of the illumination subspaces.
J.-M. Chang et al. LL1
LL2
50
LL4
LL5
120
120
120
100
100
100
100
80
80
80
80
60
60
degrees
60
100 degrees
80
degrees
degrees
100
LL3
120
degrees
Original 120
degrees
740
60
60
40
40
40
40
40
20
20
20
20
20
0
0
0
0
0
1 2 match nonmatch
1 2 match nonmatch
1 2 match nonmatch
1 2 match nonmatch
1 2 match nonmatch
1 2 match nonmatch
Fig. 4. Box whisker plot of the minimal principal angles of the matching and nonmatching subspaces. Left to right: original (resolution 160×138), level 1 Haar wavelet approximation (80×69), level 2 (40×35), level 3 (20×18), level 4 (10×9), level 5 (5×5). Perfect separation of the matching and non-matching subspaces is observed throughout all levels of MRA.
As expected, the separation gap given by the minimal principal angle between the matching and non-matching subspaces decreases as we reduce resolution, but never to the level where points on the Grassmann manifold are misclassified. In other words, individuals can be recognized at ultra-low resolutions provided they are represented by multiple image sets taken under a variety of illumination conditions. It is curious to see if similar outcomes can be observed when using unstructured projections, e.g., random projections, to embed subject illumination subspaces into spaces of significantly reduced dimensions. To test this, we repeated
(a)
(b)
Fig. 5. Top to bottom: 4 distinct illumination images of subjects 04006 (a) and 04007 (b) in CMU-PIE; level 1 to level 5 approximation obtained from applying 2D discrete Haar wavelet transform to the top row
Recognition of Digital Images of the Human Face
741
the experiments described above in this new setting. Subject illumination subspaces in their original level of resolution were projected onto low dimensional spaces via randomly determined linear transformations. Error statistics were collected by repeating the experiment 100 times. Perfect separation between matching and non-matching subspaces occurred when subject illumination subspaces were projected onto random 35-dimensional subspaces. This validates the use of digital images at ultra low resolution and emphasizes the importance of illumination variations in the problem of face recognition. Furthermore, while unstructured projections perform surprisingly well in the retention of idiosyncratic information, structured projections that exploit similarities of neighboring pixels allow perfect recognition results at even lower resolutions. We remark that the idiosyncratic nature of the illumination subspaces can be found not only in the scaling subspaces, but also in the wavelet subspaces. Indeed, we observed perfect separation using the minimal principal angle in almost all scales of the wavelet subspaces.
5
Related Work
A variety of studies consider the roles of data resolution and face recognition, including [12,13,14,15,16]. A common feature of these studies is the practice of using single to single image comparison in the recognition stage (with the exception of [16]). Among the techniques used to train the algorithms are PCA, LDA, ICA, Neural Network, and Radial Basis Functions. Some of the classifiers used are correlation, similarity score, nearest neighbor, neural network, tangent distance, and multiresolution tangent distance. If variation in illumination is present in the data set, it is removed by either histogram equalization [17] or morphological nonlinear filtering [18]. Except in [16], the variation of illumination was treated as noise and eliminated in the preprocessing stage before the classification takes place. In a more related study, Vasconcelos and Lippman proposed the use of transformation invariant tangent distance embedded in the multiresolution framework [16]. Their method, based on the (2-sided) tangent distance between manifolds, is referred to as the multiresolution tangent distance (MRTD) and is similar to our approach in that it requires a set-to-set image comparison. It is also postulated that the use of a multiresolution framework preserves the global minima that are needed in the minimization problems associated with computing tangent distances. The results of [16], however, are that when the only variation in the data is illumination, the performance of MRTD is inferior to that of the normal tangent distance and Euclidean distance. Hence, it appears that the framework of [16] does not sufficiently detect the idiosyncratic nature of illumination at low resolutions. In summary, we have presented an algorithm for classification of image sets that requires no training and retains its high performance rates even at extremely low resolution. To our knowledge, no other algorithm has claimed to have achieved perfect separability of the CMU-PIE database at ultra low resolution.
742
6
J.-M. Chang et al.
Discussion
We have shown that a mathematically emulated ultra low-resolution illumination space is sufficient to classify the CMU-PIE database when a data point is a set of images under varying illuminations, represented by a point on a Grassmann manifold. We assert that this is only possible because the idiosyncratic nature of the response of a face to varying illumination, as captured in digital images, persists at ultra low resolutions. This is perhaps not so surprising given that the configuration space of a 25-pixel camera consists of 25625 different images and we are comparing only 67 subjects using some 20 total instances of illumination. The representation space is very large compared to the amount of data being stored. Furthermore, the reduction of resolution that was utilized takes advantage of similarities of neighboring pixels. The algorithm introduced here is computationally fast and can be implemented efficiently. In fact, on a 2.8GHz AMD Opteron processor, it takes approximately 0.000218 seconds to compute the distance between a pair of 25-pixel 10-dimensional illumination subspaces. The work presented here provides a blueprint for a low-resolution illumination camera to capture images and a framework in which to match them with lowresolution sets in a database. Future work will focus on evaluating this approach on a much larger data set that contains more subjects and more variations. The Grassmann method has shown promising results in a variety of face recognition problems [6,19,2], we intend to examine the effect of resolution reduction on the accuracy of the algorithm with a range of variations, such as viewpoint and expressions.
References 1. Riklin-Raviv, T., Shashua, A.: The quotient image: Class based re-rendering and recognition with varying illuminations. PAMI 23(2), 129–139 (2001) 2. Chang, J.M., Beveridge, J., Draper, B., Kirby, M., Kley, H., Peterson, C.: Illumination face spaces are idiosyncratic. In: International Conference on Image Procesing & Computer Vision, vol. 2, pp. 390–396 (June 2006) 3. Belhumeur, P., Kriegman, D.: What is the set of images of an object under all possible illumination conditions. IJCV 28(3), 245–260 (1998) 4. Georghiades, A., Belhumeur, P., Kriegman, D.: From few to many: Illumination cone models for face recognition under variable lighting and pose. PAMI 23(6), 643–660 (2001) 5. Basri, R., Jacobs, D.: Lambertian reflectance and linear subspaces. PAMI 25(2), 218–233 (2003) 6. Chang, J.M., Kirby, M., Kley, H., Beveridge, J., Peterson, C., Draper, B.: Examples of set-to-set image classification. In: Seventh International Conference on Mathematics in Signal Processing Conference Digest, The Royal Agricultural College, Cirencester, Institute for Mathematics and its Applications, pp. 102–105 (December 2006) 7. Sim, T., Baker, S., Bsat, M.: The CMU pose, illumination, and expression database. PAMI 25(12), 1615–1618 (2003)
Recognition of Digital Images of the Human Face
743
8. Yamaguchi, O., Fukui, K., Maeda, K.: Face recognition using temporal image sequence. In: AFGR, pp. 318–323 (1998) 9. Smith, S.: Subspace tracking with full rank updates. In: The 31st Asilomar Conference on Sinals, Systems & Computers, vol. 1, pp. 793–797 (November 1997) 10. Lui, X., Srivastava, A., Gallivan, K.: Optimal linear representations of images for object recognition. PAMI 26, 662–666 (2004) 11. Golub, G.H., Loan, C.F.V.: Matrix Computations, 3rd edn. Johns Hopkins University Press, Baltimore (1996) 12. Kouzani, A.Z., He, F., Sammut, K.: Wavelet packet face representation and recognition. In: IEEE Int’l Conf. on Systems, Man and Cybernetics, Orlando, vol. 2, pp. 1614–1619. IEEE Computer Society Press, Los Alamitos (1997) 13. Feng, G.C., Yuen, P.C., Dai, D.Q.: Human face recognition using PCA on wavelet subband. SPIE J. Electronic Imaging 9(2), 226–233 (2000) 14. Nastar, C., Moghaddam, B., Pentland, A.: Flexible images: Matching and recognition using learned deformations. Computer Vision and Image Understanding 65(2), 179–191 (1997) 15. Nastar, C.: The image shape spectrum for image retrieval. Technical Report RR3206, INRIA (1997) 16. Vasconcelos, N., Lippman, A.: A multiresolution manifold distance for invariant image similarity. IEEE Trans. Multimedia 7(1), 127–142 (2005) 17. Ekenel, H.K., Sankur, B.: Multiresolution face recognition. Image Vision Computing 23(5), 469–477 (2005) 18. Foltyniewicz, R.: Automatic face recognition via wavelets and mathematical morphology. In: Proc. of the 13th Int’l Conf. on Pattern Recognition, vol. 2, pp. 13–17 (1996) 19. Chang, J.M., Kirby, M., Peterson, C.: Set-to-set face recognition under variations in pose and illumination. In: 2007 Biometrics Symposium at the Biometric Consortium Conference, Baltimore, MD, U.S.A. (September 2007)
Crystal Vision-Applications of Point Groups in Computer Vision Reiner Lenz Department of Science and Technology, Link¨ oping University SE-60174 Norrk¨ oping, Sweden
[email protected] Abstract. Methods from the representation theory of finite groups are used to construct efficient processing methods for the special geometries related to the finite subgroups of the rotation group. We motivate the use of these subgroups in computer vision, summarize the necessary facts from the representation theory and develop the basics of Fourier theory for these geometries. We illustrate its usage for data compression in applications where the processes are (on average) symmetrical with respect to these groups. We use the icosahedral group as an example since it is the largest finite subgroup of the 3D rotation group. Other subgroups with fewer group elements can be studied in exactly the same way.
1
Introduction
Measuring properties related to the 3D geometry of objects is a fundamental problem in many image processing applications. Some very different examples are: Light stages, omnidirectional cameras and measurement of scattering properties. In a light stage (see [1] for an early description) an object is illuminated by a series of light sources and simultaneously images of the object are taken with one or several cameras. These images are then used to estimate the optical properties of the object and this information is then in turn used by computer graphics systems to insert the object into computer generated,or real world, environments. A typical application is the generation of special effects in the movie industry. An omnidirectional camera captures images of a scene from different directions simultaneously. Typical arrangements to obtain these images are combinations of a camera and a mirror ball or systems consisting of a number of cameras. The third area where similar techniques are used is the investigation of the optical properties of materials like skin or paper [2, 3]. These materials are characterized by complicated interactions between the light and the material due to sub-surface scattering and closed form descriptions are not available. Applications range from wound monitoring over cosmetics to the paper manufacturing and graphic arts industry. All of these problems have two common characteristics: Their main properties are defined in terms of directions (the directions of the incoming and reflected light) and the space of direction vectors is represented by a few representative Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 744–753, 2007. c Springer-Verlag Berlin Heidelberg 2007
Crystal Vision-Applications of Point Groups in Computer Vision
745
samples (for example the directions where the sensors or light sources are located). If we describe directions with vectors on the unit sphere then we see that the basic component of these models is a finite set of vectors on the unit sphere. Similar models are used in physics to investigate the properties of crystals whose atoms form similar geometric configurations. A standard tool used in these investigations is the theory of point groups which are finite subgroups of the group SO(3) of three-dimensional rotations. In this paper we will use methods from the representation theory of these point groups to construct efficient processing methods for computer vision problems involving quantized direction spaces. We will describe the main idea, summarize the necessary facts from the representation theory and illustrate it by examples such as light stage processing and modeling of the optical properties of materials. The group we use in this paper is the icosahedral group. We select it because it is the largest finite subgroup of the 3D rotation group. Other subgroups with fewer group elements (and thus a coarser quantization of the direction space) can be used in exactly the same way.
2
Geometry
Consider the problem of constructing a device to be used to measure the optical properties of materials or objects. The first decision in the construction of such a device concerns the placements of the light sources and cameras. Since we want to design a general instrument we will use the following scanning mechanism: Start with one light source in a fixed position in space and then move it to other positions with the help of a sequence of 3D rotations. From a mathematical point of view it is natural to require that the rotations used form a group: applying two given rotations in a sequence moves the light source to another possible position and all movements can be reversed. Since we want to have a physically realizable system we also require that only a finite number of positions are possible to visit. We therefore conclude that the positions of the light sources are characterized by a finite subgroup of the group SO(3) of 3D rotations. If the rotations are not all located in a plane then it is known that there are only a finite number of finite subgroups of the rotation group. The largest of these subgroups is the icosahedral group I and we will in the following only consider this group since it provides the densest sampling of the unit sphere constructed in this way. The other groups (related to cubic and tetrahedral sampling schemes) can be treated in a similar way. Here we only use the most important facts from the theory of these groups and the interested reader is referred to the many books in mathematics, physics and chemistry ( [4, 5, 6, 7]) for detailed descriptions. We will now collect the most important properties of I. Among the vast amount of knowledge about these groups we select those facts that are (1) relevant for the application we have in mind and (2) those that can be used in software systems like Maple and Matlab to do the necessary calculations. The group IG consists of 60 elements and these elements can be characterized by three elements Rk , k = 1, 2, 3, the generators of the group I. All group
746
R. Lenz
elements can be obtained by a concatenations of these three rotations and all R ∈ I have the form R = Rν(1) Rν(2) . . . Rν(K) where Rν(k) is one of the three generators Rk , k = 1, 2, 3 and Rν(k) Rν(k+1) is the concatenation of two elements. The generators satisfy the following equations: R22 = E; R32 = E R13 = E; 3 (R2 R3 ) = E; (R1 R3 )2 = E (R1 R2 ) = E; 3
(1)
and these equations specify the group I. These defining relations can be used in symbolic programs to generate all the elements of the group. The icosahedral group I maps the icosahedron into itself and if we cut off the vertices of the icosahedron we get the truncated icosahedron, also known as the buckyball or a fullerene. The buckyball has 60 vertices and its faces are pentagons and hexagons (see Figure 3(A)). Starting from one vertex and applying all the rotations in I will visit all the vertices of the buckyball (more information on the buckyball can be found in [6]). Now assume that at every vertex of the buckyball you have a controllable light source. We have sixty vertices and so we can describe the light distribution generated by these sources by enumerating them as Lk , k = 1, . . . 60. We can also describe them as functions of their positions using the unit vectors Uk , k = 1, . . . 60 : L(Uk ). The interpretation we will use in the following uses the rotation Rk needed to reach the k-th position from an arbitrary but fixed starting point. We have L(Rk ), k = 1, . . . 60 and we can think of L as a function defined on the group I. This space of all functions on I will be denoted by L2 (I). This space is a 60-dimensional vector space and in the following we will describe how to partition it into subspaces with simple transformation properties.
3
Representation Theory
The following construction is closely related to Fourier analysis, where functions on a circle are described as superpositions of complex exponentials. For a fixed value of n the complex exponential is characterized by the transformation property ein(x+Δ) = einΔ einx . The one-dimensional space spanned by all functions of the form ceinx , c ∈ C is thus invariant under the shift operation x → x + Δ of the unit circle. We will describe similar systems in the following. We construct a 60D space by assigning the k-th basis vector in this space to group element Rk ∈ I. Next select a fixed R ∈ I and form all products RRk , k = 1, . . . 60. The mapping R : Rk → Rl = RRk defines the linear mapping that moves the k-th basis vector to the l-th basis vector. Doing this for all elements Rk we see that R defines a 60D permutation matrix Dr (R). The map R → Dr (R) has the property that Dr (RQ) = Dr (R)Dr (Q) for all R, Q ∈ I. A mapping with this transformation property is called a representation and the special representation Dr is known as the regular representation. All elements in I are concatenations of the three elements Rk , k = 1, 2, 3 and every representation D is therefore completely characterized by the three matrices D(Rk ), k = 1, 2, 3.
Crystal Vision-Applications of Point Groups in Computer Vision
747
A given representation describes linear transformations D(R) in the 60D space. Changing the basis in this space with the help of a non-singular matrix T describes the same transformation in the new coordinate system by the matrix T D(R)T −1. Also T D(R)T −1T D(Q)T −1 = T D(RQ)T −1 and we see that this gives a new representation T D(R)T −1 which is equivalent to the original D. Assume that for the representation D there is a matrix T such that for all R ∈ I we have: (1) D (R) 0 −1 DT (R) = T D(R)T = (2) 0 D(2) (R) Here D(l) (R) are square-matrices of size nl and 0 are sub-matrices with all-zero entries. We have n1 + n2 = 60 and T D(R1 R2 )T −1 = T D(R1 )T −1 T D(R2 )T −1 (1) (1) D (R1 ) 0 0 D (R2 ) = 0 D(2) (R1 ) 0 D(2) (R2 ) (1) 0 D (R1 )D(1) (R2 ) = 0 D(2) (R1 )D(2) (R2 ) This shows that D(1) , D(2) are representations of lower dimensions n1 , n2 < 60. Each of the two new representations describes a subspace of the original space that is closed under all group operations. If we can split a representation D into two lower-dimensional representations we say that D is reducible, otherwise it is irreducible. Continuing splitting reducible representations finally leads to a decomposition of the original 60D space into smallest, irreducible components. One of the main results from the representation theory of finite groups is that the irreducible representations of a group are unique (up to equivalence). For the group I we denote the irreducible representations by M (1) , M (2) , M (3) , M (4) , M (5) . Their dimensions are 1, 3, 3, 4, 5. For the group I we find 1 + 32 + 32 + 42 + 52 = 60
(3)
which is an example of the general formula n21 + ... + n2K = n where nk is the dimension of the k-th irreducible representation and n is the number of elements in the group. Eq. 3 shows that the 60D space L2 (I) can be subdivided into subspaces of dimensions 1, 9, 9, 16 and 25. These subspaces consist of nk copies of the nk -dimensional space defined by the k-th irreducible representation of I. Next we define the character to describe how to compute this subdivision: assume that D is a representation given by the matrices D(R) for the group elements R. Its trace defines the character χD : χD : I → C; R → χD (R) = tr (D(R))
(4)
For the representations M (l) we denote their characters by χl = χM (l) . From the properties of the trace follows χl (R) = χl (QRQ−1 ) for all R, Q ∈ I. If we
748
R. Lenz
define R1 , R2 ∈ I as equivalent if there is a Q ∈ I such that R2 = QR1 Q−1 then we see that this defines an equivalence relation and that the characters are constant on equivalence classes. Characters are thus defined by their values on equivalence classes. We now define the matrix Pl as: Pl =
60
χl (Rk )Dr (Rk )
(5)
k=1
where Dr is the representation consisting of the permutation matrices defined at the beginning of this section. It can be shown that the matrix Pl defines a projection from the 60D space L2 (I) into the n2l dimensional space given by the nl copies of the nl -dimensional irreducible representation M (l) . The matrix Pl defines a projection matrix and if we compute its SingularValue-Decomposition (SVD) Pl = Ul Dl Vl then we see that the n2l columns of Vl span the range of Pl . We summarize the computations as follows: – For the generators Rk , k = 1, 2, 3 of I construct the permutation matrices Dr (Rk ) – Apply the group operations to generate all matrices Dr (R), R ∈ I – For the generators Rk , k = 1, 2, 3 of I construct the matrices M (l) (Rk ) of the l-th irreducible representation – Apply the group operations to generate all matrices M (l) (R), R ∈ I – Compute the values of the characters of the irreducible representations M (l) on the equivalence classes – Extend them to all elements R ∈ I to obtain the characters χl – Use Eq. 5 to construct the projection matrices Pl – Use the SVD of the matrices Pl to construct the new basis The lengthy theoretical derivation thus results in a very simple method to decompose functions on the buckyball. The new basis can be constructed automatically and from its construction it can also be seen that the elements of the projection matrices Pl are given by the values of the characters on the equivalence classes. Here we only concentrate on these algorithmically derived bases and using the SVD to construct the basis in the subspaces is only one option. Others, more optimized for special applications, can also be used.
4
Description and Compression of Scatter Measurements
Radiation transfer models provide a standard toolbox to describe the interaction between light and material and they are therefore useful in such different applications as remote sensing and subsurface scattering models of materials like skin and paper (see [2] for an example). The key component of the theory is the function that describes how incoming radiation is mapped to outgoing radiation. In the following we let U, V denote two unit vectors that describe
Crystal Vision-Applications of Point Groups in Computer Vision
749
directions in 3D space. We denote by p (U, V ) the probability that an incoming photon from direction U interacts with the material and is scattered into direction V . Now divide the unit sphere into 60 sections, each described by a vertex of the buckyball. The argument above shows that in that case we can write it as a function p(Ri , Rj ) with Ri , Rj ∈ I. Assume further that f : I → IR describes the incoming light distribution from the directions given by the elements in I. The expectation of the outgoing radiation g(Rj ) in direction Rj is then given by p(Ri , Rj )f (Ri ) (6) g(Rj ) = I
and we write this in operator notation as g = Sp f where Sp is an operator defined by the kernel p. A common assumption in applications of radiation transfer is that the function p (U, V ) only depends on the angle between the vectors U, V . We therefore consider especially probability functions that are invariant under elements of I, ie. we assume that: (7) p(RRi , RRj ) = p(Ri , Rj ) for all elements R, Ri , Rj ∈ I. We find that the operator commutes with the operator TQ describing the application of an arbitrary but fixed rotation Q : TQ f (R) = f (Q−1 R): Sp (f (QRi ))(Rj ) = p(Ri , Rj )f (QRi ) = I
I
p(Q
−1
Ri , Rj )f (Ri ) =
p(Ri , QRj )f (Ri ) = (Sp f )(QRj )
I
This shows that Sp TQ = TQ Sp for all Q ∈ I. We now use the new coordinate system in L2 (I) constructed above. The operator maps then the invariant subspaces defined by the irreducible representations onto itself. Schur’s Lemma [4] states that on these spaces SR is either the zero operator or a multiple of the identity. In other words: on such a subspace there is a constant λ such that: SR f = λk f and we find that the elements in this subspace are eigenvectors of the operator. We illustrate this by an example from radiation transfer theory describing the reflection of light on materials. We consider illumination distributions measured on the 60 vertices of the buckyball and described by 60D vectors f . We measure the reflected light at the same 60 positions resulting in a new set of 60D vectors g. In what follows we will not consider single distributions f, g but we will consider stochastic processes generating a number of light distributions fω where ω is the stochastic variable. This scenario is typical for a number of different applications like in the following examples: – The operator can describe the properties of a mirror ball, the incoming vectors f the light flow in the environment and g the corresponding measurement vector. This is of interest in computer graphics
750
R. Lenz
– The operator represents the optical properties of a material like paper or skin. The illumination/measurement configuration can be used for estimation of the reflectance properties of the material – The model describes a large number of independent interactions between the light flow and particles. A typical example is the propagation of light through the atmosphere In the following simulations we generated 500 vectors with uniformly distributed random numbers representing 500 different incoming illumination distributions from the 60 directions of the buckyball. The scattering properties of the material are characterized by the Henyey-Greenstein function [8] defined as p(cos θ) =
1 − ξ2 (1 + ξ 2 − 2ξ cos(θ))
3/2
(8)
where θ is the angle between the incoming and outgoing direction and ξ is a parameter characterizing the scattering properties. We choose this function simply as an illustration. Here we use it to illustrate how these distributions of the scattered light are described in the basis that was constructed in the previous sections. Note that this coordinate system is only constructed based on the buckyball geometry independent of the scattering properties. We illustrate the results with two examples: ξ = 0.2 (diffuse scattering) and ξ = 0.8 (specular reflection). We show images of the covariance matrix of the original scatter vectors, the covariance matrix of the scatter vectors in the new coordinate system and the correlation matrix of the scattered vectors in the new coordinate system where we set the matrix element in the upper left pixel (corresponding to the squared magnitude of the first coefficient) to zero. For all matrices we plot also the values of the diagonal where most of the contributions are concentrated. In Figure 1 we show the results for ξ = 0.2 and in Figure 2 for ξ = 0.8. In both cases we get similar results and for ξ = 0.8 we therefore omit the correlation matrix. The results show that in the new basis the results are more concentrated. We also see clearly the structure of the different invariant representations accounting for the block structure of the subspaces of dimensions: (1,9,9,16,25) and the most important components with numbers (1,2,11,20,36) corresponding to the first dimension in these subspaces. We also see that the concentration in these components is more pronounced in the first example with the diffuse reflection than it is for the more specular reflection. This is to be expected since in that case the energy of the reflected light is more concentrated in narrower regions. The shape of these basis functions is illustrated in Figure 3(B) showing basis vector number 36 which gives (after the constant basis vector number one) the highest absolute contribution in the previous plots. In this figure we mark the vertices with positive contributions by spheres and the vertices with negative values by tetrahedron.
Crystal Vision-Applications of Point Groups in Computer Vision
751
Asymmetry: 0.2 / Number: 1000 Covariance/Original 1.66
1.64
1.62
1.6
1.58
1.56
1.54
1.52
12
11
20
36
(A) Covariance Original
Asymmetry: 0.2 / Number: 1000 Correlation/Sym−Basis 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
12
11
20
36
(B) Correlation, Symmetry Basis Asymmetry: 0.2 / Number: 1000 Covariance/Sym−Basis 0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
12
11
20
36
(C) Covariance Symmetry Basis
Fig. 1. Results for ξ=0.2. (A) Covariance Original (B) Thresholded Correlation Matrix in Symmetry Basis (C) Covariance Matrix in Symmetry Basis. Asymmetry: 0.8 / Number: 1000 Covariance/Original 0.03
0.0295
0.029
0.0285
0.028
0.0275
0.027
0.0265
0.026
12
11
20
36
(A) Covariance Original
Asymmetry: 0.8 / Number: 1000 Covariance/Sym−Basis 0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
12
11
20
36
(B) Covariance Symmetry Basis
Fig. 2. Results for ξ = 0.8. (A) Covariance Original (B) Covariance Matrix in Symmetry Basis.
752
R. Lenz
Fig. 3. (A) The Buckyball (B) Basis Vector 36
5
Summary and Discussion
In this paper we used tools from the representation theory of the icosahedral group to construct transforms that are adapted to the transformation properties of the group. We showed how to construct the transform algorithmically from the properties of the group. We showed that under certain conditions these transforms provide approximations to principal component analysis of datasets defined on the vertices of the buckyball. We used a common model to describe the reflection properties of materials (the Henyey-Greenstein equation) and illustrated the compression properties of the transform with the help of simulated illumination distributions scattered from the surface of objects. The assumption of perfect symmetry under the icosahedral group on which the results in this paper were derived are seldom fulfilled in reality: the invariance property in Eq. 7 is clearly seldom fulfilled for real objects. This is also the case in physics where perfect crystals are rather an exception than the rule. In this case we can still use the basis constructed in this paper as a starting point and tune it to the special situation afterwards in a perturbation framework. But even in this simplest form it should be useful in computer vision applications. As a typical example we mention the fact that omnidirectional cameras typically produce large amounts of data and the examples shown above illustrate that the new basis should provide better compression results than the original point-based system. Without going into details we remark also that the new basis has a natural connection to invariants. From the construction we see that the new basis defines a partition of the original space into 1, 9, 16 and 25-dimensional subspaces that are invariant under the action of the icosahedral group. The projection onto the first subspace thus defines an invariant. The vectors obtained by projections onto the other subspaces are not invariants but their lengths are and
Crystal Vision-Applications of Point Groups in Computer Vision
753
we thus obtain four new invariants. From the construction follows furthermore that the transformation rules of the projected vectors in these subspaces follow the transformation rules of the representations and they can thus be used to obtain information about the underlying transformation causing the given transformation of the projected vectors. The application to the design of illumination patterns is based on the observation that the projection matrices P defined in Eq. (5) only contain integers and two non-integer constants. We can now use this simple structure to construct illuminations patterns by switching on all the light sources located on the vertices with identical values in the projection vector. Such a system should have favorite properties similar to those obtained by the technique described in [9], [10], [11].
References 1. Debevec, P., Hawkins, T., Tchou, C., Duiker, H.P., Sarokin, W., Sagar, M.: Acquiring the reflectance field of a human face. In: Proc. SIGGRAPH 2000, pp. 145–156. ACM Press/Addison-Wesley Publishing Co., New York, NY, USA (2000) 2. Edstr¨ om, P.: A fast and stable solution method for the radiative transfer problem. Siam Review 47(3), 447–468 (2005) 3. Weyrich, T., Matusik, W., Pfister, H., Bickel, B., Donner, C., Tu, C., McAndless, J., Lee, J., Ngan, A., Jensen, H.W., Gross, M.: Analysis of human faces using a measurement-based skin reflectance model. Acm Transactions On Graphics 25(3), 1013–1024 (2006) 4. Serre, J.P.: Linear representations of finite groups. Springer, Heidelberg (1977) 5. Stiefel, E., F¨ assler, A.: Gruppentheoretische Methoden und ihre Anwendungen. Teubner, Stuttgart (1979) 6. Sternberg, S.: Group Theory and Physics. First paperback (edn.) Cambridge University Press, Cambridge, England (1995) 7. Kim, S.K.: Group theoretical methods and applications to molecules and crystals. Cambridge University Press, Cambridge (1999) 8. Henyey, L., Greenstein, J.: Diffuse radiation in the galaxy. Astrophys. Journal 93, 70–83 (1941) 9. Schechner, Y., Nayar, S., Belhumeur, P.: A theory of multiplexed illumination. In: Proc. Ninth IEEE Int. Conf. on Computer Vision, vol. 2, pp. 808–815. IEEE Computer Society Press, Los Alamitos (2003) 10. Ratner, N., Schechner, Y.Y.: Illumination multiplexing within fundamental limits. In: Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE, Los Alamitos (2007) 11. Schechner, Y.Y., Nayar, S.K., Belhumeur, P.N.: Multiplexing for optimal lighting. IEEE Trans. Pattern Analysis and Machine Intelligence 29(8), 1339–1354 (2007)
On the Critical Point of Gradient Vector Flow Snake Yuanquan Wang1, Jia Liang2, and Yunde Jia2 1
School of Computer Science, Tianjin University of Technology, Tianjin 300191, PRC School of Computer Science, Beijing Institute of Technology, Beijing 100081, PRC {yqwang,liangjia,jiayunde}@bit.edu.cn
2
Abstract. In this paper, the so-called critical point problem of Gradient vector flow (GVF) snake is studied in two respects: influencing factors and detection of the critical points. One influencing factor that particular attention should be paid to is the iteration number in the diffusion process, too large amount of diffusion would flood the object boundaries while too small amount would preserve excessive noise. Here, the optimal iteration number is chosen by minimizing the correlation between the signal and noise in the filtered vector field. On the other hand, we single out all the critical points by quantizing the GVF vector field. After the critical points are singled out, the initial contour can be located properly to avoid the nuisance arising from critical points. Several experiments are also presented to demonstrate the effectiveness of the proposed strategies. Keywords: snake model, gradient vector flow, critical point, optimal stopping time, image segmentation.
1 Introduction Object shape segmentation and extraction in visual data is an important goal in computer vision. The parametric active contour models [1] and geometric active contour models [2] dominate this field in the latest two decades. From its debut in 1988[1], the parametric active contour models, i.e., snake models become extremely popular in the field of computer vision, which integrate an initial estimate, geometrical properties of the contour, image data and knowledge-based constraints into a single process, and provide a good solution for shape recovery of objects of interest in visual data. Despite the marvelous ability to represent the shapes of objects in visual data, the original algorithm is harassed by several limitations, such as initialization sensitivity, boundary concavities convergence and topology adaptation. These limitations have been extensively studied and many interesting results are presented. Among all the results, a new external force called gradient vector flow (GVF) which was proposed by Xu and Prince [3, 4] outperforms the other gradientbased methods in capture range enlarging and boundary concavities convergence and becomes the focus of many research. Examples include [5-16], among others. It is worthy of noting the graceful works on GVF proposed by Ray et al. They first presented a shape and size constrained active contour with the GVF, but modified by introducing additional boundary conditions of Dirichlet type using initial contour Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 754–763, 2007. © Springer-Verlag Berlin Heidelberg 2007
On the Critical Point of Gradient Vector Flow Snake
755
location to the PDEs, as external force [10]; then, they discussed this new formulation of GVF from another point of view [11]; Later, they presented motion gradient vector flow by integrating object moving direction into the GVF vector [12]; More recently, they utilized the GVF snake characterized by Dirichlet boundary condition to segment the spatiotemporal images [13]. Although the capture range of GVF snake is very high, unfortunately, the initial contour still suffers from some difficulties, that is, the initial contour should contain some points and exclude some other points, otherwise, the final results would be far from expected. We demonstrate this phenomenon in Fig.1. In the top row, there are some particular points, denoted by white cross, in GVF field within the heart ventricle, if the initial contour contains none or only part of these points, the contour would fail; the bottom row illustrates the particular points, denoted by white square, between the rectangle and the circle, if the initial contour contains any of these points, the snake contour would stabilize on the opposite object boundaries. This is the socalled critical point problem in this study and the critical points that should be included within initial contour are referred to as inner critical points and those which should be excluded as outer ones. The existence of inner critical point is first pointed out in [11] and recently, a dynamic system method is employed to detect the inner critical points in [15], later, He et al also utilized the dynamic system method for this purpose [16]. But this approach is computationally expensive and can’t detect the outer ones, in fact, Ford has proposed a more efficient and effective method based on dynamic system under the context of fluid flow [17]. In this work, we investigate this critical point problem in two respects: analysis of the influencing factors and detection of critical points. Understanding the influencing factors is helpful to select the parameters for computing GVF. One particular influencing factor is the iteration number during diffusion process; since too large amount of diffusion would flood the object boundaries, here, an optimal stopping
Fig. 1. Top row: demonstration of the inner critical points. The inner critical points are denoted by white crosses; the white circles are initial contours. Bottom row: demonstration of outer critical points. The black dashed circles are the initial contours, and the black solid curves are the converged results and the outer critical points are denoted by white squares.
756
Y. Wang, J. Liang, and Y. Jia
time, i.e., iteration number, is chosen by minimizing the correlation between the signal and noise in the filtered vector field. By quantizing the GVF vector field, a simple but effective method is presented to single out the critical points and the initial contour can be located around the inner critical points within object; in this way, the GVF snake can conquer this critical point problem. A preliminary version of this work appeared first in [18]. The remainder of this paper is organized as follows: the GVF snake is briefly reviewed in section 2 and section 3 devotes to the influencing factors of the critical points. In section 4, we detail the detection of the critical points and initialization of the GVF snake. Section 5 presents some experimental results and section 6 concludes this paper.
2 Brief Review of the GVF Snake A snake contour is an elastic curve that moves and changes its shape to minimize the following energy
(
E snake = ∫ 12 α c s + β c ss
∈
2
2
)+ E
ext
(c(s ))ds
.
(1)
where c(s)=[x(s) y(s)],s [0,1],is the snake contour parameterized by arc length, cs(s) and css(s) are the first and second derivative of c(s)with respect to s and positively weighted by α and β respectively. Eext(c(s)) is the image potential which may result from various events, e.g., lines and edges. By calculus of variation, the Euler equation to minimize Esnake is
αc ss ( s) − βc ssss ( s ) − ∇Eext = 0
.
(2)
This can be considered as a force balance equation
Fint + Fext = 0 . where
(3)
Fint = αc ss ( s ) − β c ssss ( s ) and Fext = −∇Eext . The internal force Fint
makes the snake contour to be smooth while the external force Fext attracts the snake to the desired image features. In a departure from this perspective, the gradient vector flow external force is introduced to replace − ∇E ext with a new vector v(x,y)=[u(x,y) v(x,y)] which is derived by minimizing the following function
ε = ∫∫ μ ∇v + ∇f 2
2
2
v − ∇f dxdy .
(4)
where f is the edge map of image I, usually, f = ∇ Gσ ∗ I . μ is a positive weight. Using calculus of variation, the Euler equations seeking the minimum of ε read
v t = μΔv − ∇f
2
( v − ∇f ) .
(5)
On the Critical Point of Gradient Vector Flow Snake
757
where Δ is the Laplacian operator. The snake model with v as external force is called GVF snake.
3 Critical Point Analysis: Influencing Factors and Optimal Stopping Time 3.1 Summary of the Influencing Factors Owing to the critical points, taking GVF as external force would introduce new nuisance for contour initialization and one should pay burdened attention to these points. Therefore, it is expected that there would be some guidance for choosing the parameters for GVF calculation such that the GVF is as regular as possible, i.e., better edge-preserving and fewer critical points. By analyzing Eq.5 and carrying out practical exercises, we summarize the influencing factors of critical points, which may serve as a qualitative guidance, as follows:
1) Shape of the object: The object shape is characterized by the edge map. Generally speaking, the inner critical points lie on the medial axis of the object, thus, the initial contour should include the medial axis in order to capture the desired object. In order to obtain a good edge map for contaminated images, the Gaussian blur with deviation σ is employed first, therefore, a slightly large σ is favored. 2) Regularization parameter μ : This coefficient controls the tradeoff between the fidelity to the original data and smoothing. Large μ means smoother results and fewer critical points, also large deviation from the original data. It is expected that μ is slightly small in the vicinity of boundaries and large in homogeneous areas, but this is a dilemma for contaminated images. 3) Iteration number in diffusion process: It was said in [3] that “the steady-state solution of these linear parabolic equations is the desired solution of the Euler equations…,” this statement gives rise to the following question: dose “the desired solution of the Euler equations” be the desired external force for Snake model, i.e., the desired GVF? We answer this question and demonstrate the influence of iteration number to critical points by using the example in Fig.2, μ = 0.15 , time step is 0.5. The heart image in Fig. 1 is smoothed using a Gaussian kernel of σ = 2.5 and the GVF fields at 100, 200 and 2000 iterations of diffusion are given in Fig.2 (a), (b), and (c) respectively. Visibly, there is less critical point in Fig. 2(b) than in Fig. 2(a) (see the white dot ) and the result in Fig. 2(c) is far from available in that the GVF flows into the ventricle from right and out from left-bottom. Surely, the result in Fig. 2(c) approximates the steady state solution, but it cannot serve as the external force for snake model. The reason behind this situation is that Eq.5 is a biased version of 2 v t = μΔv by (∇f − v) ∇f , where v t = μΔv is an isotropic diffusion. As t increases, the isotropic smoothing effect will dominate the diffusion of (5) and converge to the average of the initial value, ∇f . Small enough μ could depress this oversmoothing efficacy, but, at the same time, preserves excessive noise; alternatively, an optimal iteration number, say, 200 for this example, would be an effective solution for this issue, this is the topic in the next subsection.
758
Y. Wang, J. Liang, and Y. Jia
Fig. 2. Gradient vector flow fields at different iteration: (a) GVF at 100 iteration; (b) GVF at 200 iteration; (c) GVF at 2000 iteration
3.2 Optimal Stopping Time Following the works for image restoration by Mrázek and Navara[19], the decorrelation criterion by minimizing the correlation between the signal and noise in the filtered signal is adopted for the selection of the optimal diffusion stopping time. Starting with the noise image as its initial condition, I (0) = I 0 , I evolves along some
trajectory I (t ), t > 0 . The time T is optimal for stopping the diffusion in the sense that the noise in I (T ) is removed significantly and the object structure is preserved to the extent possible. Obviously, this is an ambiguous statement; T can only be estimated by some criteria. Based on the assumption that the ‘signal’ I (t ) and ‘noise’ I (0) − I (t ) is uncorrelated, the decorrelation criterion is proposed and select
T = arg min corr (I (0) − I(t ), I (t )) .
(6)
t
where
corr (I (0) − I (t ), I (t )) =
cov(I (0) − I (t ), I (t )) . var(I(0) − I (t )) ⋅ var(I (t ))
(7)
Although the underlying assumption is not necessary the case and corr (I (0) − I (t ), I (t )) is also not necessary unimodal, but this situation is not so severe in practice. Mrázek and Navara showed the effectiveness of this criterion [19]. Regarding the vector-valued GVF, by taking its two components into account, we slightly modify this criterion and obtain
TGVF = arg min corr (u (0) − u (t ), u (t )) + corr (v(0) − v (t ), v(t )) .
(8)
t
Since the initial values u (0 ) = f x , v(0 ) = f y are the derivatives of an edge map and may have different intensity, in this way, the diffusion stops at TGVF so that the one with smaller initial value isn’t over-smoothed and the other one not under-smoothed too much.
On the Critical Point of Gradient Vector Flow Snake
759
4 Initialization Via Critical Point Detection Generally speaking, the inner critical points locate within closed homogeneous image regions, e.g., object area. When they locate within object area, the initial contour should contain these points; otherwise, should be excluded. The outer ones generally locate between objects or parts of one object; when lying between objects, the initial contour should exclude these points; or, the snake contour would be driven to the opposite object. See the examples in Fig. 1. But the noise in practice would disturb the location of critical points. Due to page limitation, we don’t elaborate on the distribution of and graceful solution to the critical points and here we only present a practical method to alleviate contour initialization by using the inner critical points within object region. The proposed strategy achieves this end by singling out all the critical points and locating a proper initial contour around those within object regions. 4.1 Identifying the Critical Points
Our proposed method follows the basic idea of [14] by quantizing the GVF in the following way. Given a point p in the image domain, the associated GVF vector is
r v p , denote the 8-neightborhood of p by Ω p , for any point q in Ω p , pq is a unit r r vector from p to q, w p is derived from v p such that
r r v p ⋅ w p = max q∈Ω
r v p ⋅ pq .
(9)
p
In fact,
r r w p is the one nearest to v p in direction among the eight pq ’s. In Fig.3, we
demonstrate this transformation of GVF on the synthetic image used in Fig.1. Our proposed method would identify the critical points based on this quantized GVF. As aforementioned, there are two types of critical point; here we will address the identification algorithm in detail.
r
Given a point p in the image domain, for any point q in Ω p with w q , qp is a unit vector from q to p; p would be an inner critical point if, for all q ∈ Ω p ,
r w q ⋅ qp ≤ 0 .
(10)
If p isn’t an inner critical point and for all q ∈ Ω p ,
r w q ⋅ qp < 1 .
(11)
we call p an endpoint. An endpoint p is an outer critical point if p is not isolated or not on an isolated clique of endpoints. It is clear from Eq.10 that for an inner critical point, the quantized GVF vector of all the points in its 8-neightborhood doesn’t points to it; therefore, if this critical point is outside the initial snake contour, the GVF external force couldn’t put the snake
760
Y. Wang, J. Liang, and Y. Jia
contour across it; thus, if this critical point is within the object region, the initial snake contour should enclose it. For an endpoint, the quantized GVF vector of some points in its 8-neightborhood may point to it, but no one points to it exactly. If the endpoint is not isolated, it is an outer critical point. When this type critical points lies between objects and the initial contour encircles them, the snake contour would be driven away and stabilize on the opposite object. Here we carry out an experiment on the synthetic image and the corresponding GVF field shown in Fig. 3 to demonstrate the identification of the critical points. See Fig.4, the inner critical points are denoted by crosses and the outer ones by dots.
Fig. 3. Quantization of the GVF: (a) GVF field; (b) quantized GVF field
4.2 Locating the Initial Contour
After identifying the critical points, we can locate an initial contour based on the inner critical points, but the proposed method should make use of the prior position of ROI and includes the following steps:
Compute and quantize the GVF vector field; Adopt the prior knowledge about the object position as done in [11]; Single out all inner critical points within the object according to Eq.10; Locate one initial contour around these critical points; therefore, the snake contour would evolve to the object boundary under the GVF force.
In this way, we can get expected results. This strategy is somewhat rough and restricted, but it works well in practice.
Fig. 4. Identification of critical points. White dots denote the outer critical points between objects and black dots denote those separating parts of objects; while the inner critical points are denoted by black crosses.
On the Critical Point of Gradient Vector Flow Snake
761
5 Experimental Results In this section, we will first assess the applicability of the decorrelation criterion for selecting the optimal stopping time for GVF by performing an experiment on the heart image in Fig.1. The parameters to compute GVF are the same as in Fig.2. Fig.5 (a) shows the evolution of correlation with iteration number. The correlation decreases first as the iteration number increases, then reaches a minimum, and increases after this milestone. The iteration number where the correlation is minimal is optimal to stop the diffusion; it is 188 for this example. The associated GVF at 188 iteration is shown in Fig.5 (b).
Fig. 5. Optimal iteration number and the associated GVF field. (a) Correlation evolving with iteration number, the iteration where the correlation is minimal is optimal for stopping the diffusion. (b) GVF at the optimal iteration number.
In order to demonstrate this automatic identification of critical points and locating initial snake contour based on inner critical points, we utilize the proposed method to
Fig. 6. Segmentation examples. All the snake contours are initialized based on inner critical points.
762
Y. Wang, J. Liang, and Y. Jia
segment several images including the heart image and the synthetic image in Fig.1. The results are shown in Fig.6 with the initial contour, snapshot and final contour overlaid. The GVF fields are calculated with μ = 0.15 for all images and all real images are smoothed by Gaussian kernel with standard deviation σ = 2.5 . The iteration number is chosen automatically based on the decorrelation criterion, and it is 188,64,269,182 and 208 for Fig.6(a),(b),(c),(d) and (e) respectively. The inner critical points are indicated by cross. When the critical points are regularly distributed, the initial contour is automatically located, see Fig.6(a) and (b), otherwise, by hand, see Fig.6(c),(d)and (e). Because all the initial contours contain the corresponding inner critical points, they can cope with this critical point issue as shown in Fig.1 and converge to all desired boundaries. These experiments validate the feasibility of the proposed solution to the critical point problem.
6 Conclusion In this paper, a theoretical study has been launched on the GVF snake model. The critical point problem lurked in the GVF snake has been pointed out, and the critical points are identified as inner and outer ones. The influencing factors of critical point include object shape, regularization and iteration number during diffusion. By minimizing the correlation between the signal and noise in the filtered vector field, we have introduced the decorrelation criterion to choose the optimal iteration number. We have also presented an approach to find all the critical points by quantizing the GVF field. The snake contour initialized by containing the inner critical points within object region could avoid the nuisance stemming from critical point and converge to the desired boundaries. In a forthcoming work, we will elaborate on the detection of the critical points and present a graceful solution to critical point during initialization. Acknowledgments. This work was supported by the national natural science foundation of China under grants 60543007 and 60602050.
References 1. Kass, M., Witkin, A., Terzopoulos, D.: Snake: active contour models. Int’l J. Computer Vision 1(4), 321–331 (1988) 2. Han, X., Xu, C., Prince, J.: A topology preserving level set method for geometric deformable models. IEEE TPAMI 25(6), 755–768 (2003) 3. Xu, C., Prince, J.: Snakes, Shapes and gradient vector flow. IEEE TIP 7(3), 359–369 (1998) 4. Xu, C., Prince, J.: Prince, Generalized gradient vector flow external forces for active contours. Signal Processing 71(2), 131–139 (1998) 5. Tang, J., Acton, S.T.: Vessel boundary tracking for intravital microscopy via multiscale gradient vector flow snakes. IEEE TBME 51(2), 316–324 (2004) 6. Paragios, N., Mellina-Gottardo, O., Ramesh, V.: Gradient Vector Flow fast geometric active contours. IEEE TPAMI 26(3), 402–407 (2004) 7. Chuang, C., Lie, W.: A downstream algorithm based on extended gradient vector flow for object segmentation. IEEE TIP 13(10), 1379–1392 (2004)
On the Critical Point of Gradient Vector Flow Snake
763
8. Cheng, J., Foo, S.W.: Dynamic Directional Gradient Vector Flow for Snakes. IEEE TIP 15(6), 1653–1671 (2006) 9. Yu, H., Chua, C.-S.: GVF-Based Anisotropic Diffusion Models. IEEE TIP 15(6), 1517– 1524 (2006) 10. Ray, N., Acton, S.T., Ley, K.: Tracking leukocytes in vivo with shape and size constrained active contours. IEEE TMI 21(10), 1222–1235 (2002) 11. Ray, N., Acton, S.T., Altes, T., et al.: Merging parametric active contours within homogeneous image regions for MRI-based lung segmentation. IEEE TMI 22(2), 189–199 (2003) 12. Ray, N., Acton, S.T.: Motion gradient vector flow: an external force for tracking rolling leukocytes with shape and size constrained active contours. IEEE TMI 23(12), 1466–1478 (2004) 13. Ray, N., Acton, S.T.: Acton, Data acceptance for automated leukocyte tracking through segmentation of spatiotemporal images. IEEE TBME 52(10), 1702–1712 (2005) 14. Li, C., Liu, J., Fox, M.D.: Segmentation of external force field for automatic initialization and splitting of snakes. Pattern Recognition 38, 1947–1960 (2005) 15. Chen, D., Farag, A.A.: Detecting Critical Points of Skeletons Using Triangle Decomposition of Gradient Vector Flow Field. In: GVIP 2005 Conference, CICC, Cairo, Egypt, December 19-21 (2002) 16. He, Y., Luo, Y., Hu, D.: Semi-automatic initialization of gradient vector flow snakes. Journal of Electronic Imaging 15(4), 43006–43008 (2006) 17. Ford, R M.: Critical point detection in fluid flow images using dynamical system properties. Pattern Recognition 30(12), 1991–2000 (1997) 18. Wang, Y.: Investigation on deformable models with application to cardiac MR images analysis, PhD dissertation, Nanjing Univ. Sci. Tech., Nanjing, PRC (June 2004) 19. Mrázek, P., Navara, M.: Selection of optimal stopping time for nonlinear diffusion filtering. Int. J. Comput Vis. 52(2/3), 189–203 (2003)
A Fast and Noise-Tolerant Method for Positioning Centers of Spiraling and Circulating Vector Fields Ka Yan Wong and Chi Lap Yip Dept. of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong {kywong,clyip}@cs.hku.hk
Abstract. Identification of centers of circulating and spiraling vector fields are important in many applications. Tropical cyclone tracking, rotating object identification, analysis of motion video and movement of fluids are but some examples. In this paper, we introduce a fast and noise tolerant method for finding centers of circulating and spiraling vector field pattern. The method can be implemented using integer operations only. It is 1.4 to 4.5 times faster than traditional methods, and the speedup can be further boosted up to 96.6 by the incorporation of search algorithms. We show the soundness of the algorithm using experiments on synthetic vector fields and demonstrate its practicality using application examples in the field of multimedia and weather forecasting.
1 Introduction Spiral, circular, or elliptical 2D vector fields, as well as sources and sinks are encountered in many applications. Of particular interest to researchers is the detection of centers of these vector field patterns, which provides useful information of the field structure. For example, in [1], circulating or elliptical vector fields are formed by motion compensated prediction of rotating objects and swirl scene changes in video sequences. Locating the centers helps object segmentation and tracking. As another example, in meteorology, vector fields constructed from remote sensing images show circulating or spiraling structures of tropical cyclones and pressure systems, which help positioning them [2]. Orientation fields which show circulating and spiraling patterns also draw attention to computer vision researchers [3] [4]. To locate the centers of a circulating or spiraling vector field F, one can use circulation analysis to locate the regions with high magnitude of vorticity ||∇ × F||. To locate sources or sinks, divergence can be calculated. However, such simplistic methods are ineffective on incomplete or noisy vector fields. Previous work that address the issue mainly solve the problem using three approaches: (1) vector field pattern matching, (2) examination of dynamical system properties of vector fields using algebraic analysis, and (3) structural analysis. The idea of vector field pattern matching algorithm is to take the center as the location of the input vector field that best fits a template under some similarity measures, such as sine metric [5], correlation [6] and Clifford convolution [7]. Methods employing such approach are flexible, as different templates can be defined for finding different
The authors are thankful to Hong Kong Observatory for data provision and expert advice.
Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 764–773, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Fast and Noise-Tolerant Method
765
flow patterns. However, the size and pattern of templates have to be similar to patterns in the vector field. The computation time also increases with the template size. The idea of algebraic analysis is to analyze a vector field decomposed into a sum of solenoidal and irrotational components. This can be done using discrete HelmholtzHodge Decomposition [8] or 2D Fourier transform [9]. The corresponding potential functions are then derived from these components, and the critical points at the local extrema of these functions are found. These points are then characterized by the analysis of the corresponding linear phase portrait matrices. Besides centers, vector field singularities such as swirls, vortices, sources and sinks can also be identified. Other methods for phase portrait estimation include the isotangent-based algorithm [3] and the use of least square error estimator [4]. These methods are mathematically sound, but are relatively computationally expensive and are sensitive to noise in vector fields which inevitably arise in practical situations. Besides pattern matching and algebraic analysis, a structural motion field analysis approach is proposed in [10]. The method models vector field patterns as logarithmic spirals, with the polar equation r = aeθ cot α . The method works by transforming each vector on the field into a sector whose angular limits are bounded by the two lines, formed by rotating the vector counterclockwise by ψm and ψM . The value of ψm and ψM are calculated from the aspect ratio ρ and the angle α of the vector field pattern. The rotation center is the point covered by the largest number of sectors from neighbours, no larger than d pixels away. The method can handle circulating and spiraling vector fields with different inception angles and aspect ratios. It was reported to be up to 1.81 times faster than circulation analysis when the method is used to detect centers of rotating objects in video sequences [1]. Yet, the method requires the determination of parameters ρ and α which could be sometimes difficult. All the methods mentioned above require complex operations such as vector matching and parameter estimation. Floating point operations and a well-structured motion field that fits the assumed mathematical models or templates are required. To handle practical situations, we need a robust and fast method. In this paper, we introduce a fast and noise tolerant method for finding the centers of circulating and spiraling vector fields.
2 Identifying the Centers Our method is a structural analysis method which does not require the users to define a template nor carry out complex mathematical operations. In [10], the optimal rotation angle ω and sector span σ for center identification are determined for perfect vector fields. Yet, the performance of the algorithm in handling noisy vector field is not addressed. We found that this method can be modified to handle noisy vector field by increasing the sector span σ from the optimal value (see 1(a) to (c)). However, the amount of increase σ + depends on the vector field itself, and varies from case to case. Hence, instead of using extra computational resources to determine the suitable expansion value, let us consider the extreme case, where the sector span σ + σ + is of the value π. In this case, as long as the vector differ by less than π2 from the actual direction, the sector of the rotated vector, now a half-plane, is large enough to cover the center (see Fig 1(d)).
766
K.Y. Wong and C.L. Yip
sector region
the point covered by the largest number of sectors
(a) Optimal σ
(b) Optimal σ
(c) Enlarged σ
(d) σ + σ + = π
Fig. 1. Algorithm illustration. (a): perfect field; (b),(c) and (d): noisy field.
(a)ρ = 1, (b)ρ = 1, (c)ρ = 1, (d)ρ = 1, (e)ρ = 1, (f)ρ > 1, (g)ρ > 1, (h)ρ > 1, (i)ρ > 1, (j)ρ > 1, α = 0+ α = 3π α = π2 α = 7π α = π − α = 0+ α = 3π α = π2 α = 7π α = π− 10 10 10 10
Fig. 2. Result on synthetic field with different values of ρ and α
Based on this idea, our proposed center identification algorithm is as follows. For each point on the vector field, any neighbouring vectors within a distance of d pixels from it is checked to see whether the point is on its left or right. The left- or right-counts are then recorded. For a clockwise circulating or spiraling vector field, the point having the maximum right-count is the center location, whereas left-count is considered when the field is a counterclockwise one. The value d defines a distance of influence so that vectors far away, that may not be relevant to the flow of interest, are not considered. Also, it allows the algorithm to handle vector fields with multiple centers. Since the algorithm only involves left-right checking and counting, the method can be implemented with only integer operations. To further boost up the speed of the algorithm, search algorithms that make use of principle of locality is also incorporated in the design to locate the point with maximum count.
3 Evaluation To validate and evaluate the proposed method, a Java-based system prototype is built. We show the soundness of the proposed algorithm and its robustness against noisy vector fields using synthetic vector fields in Section 3.1 and 3.2, followed by a discussion on efficiency issues and the effect of search algorithms in Section 3.3 and 3.4 respectively. The practicality of the method is then demonstrated in Section 4. The performance is evaluated by comparing with three center finding approaches: (1) vector field pattern matching with a template “center” and absolute sine metric [5], (2) critical points classified as “center” by algebraic analysis, based on [9], and (3) structural analysis, as in [10]. The efficiency of the algorithm is evaluated by profiling tests on an iMac G5 computer running Mac OS 10.4.9, with a 2GHz PowerPC G5 processor and 1.5GB of RAM. The
A Fast and Noise-Tolerant Method
767
effectiveness of the algorithm is evaluated by finding the average Euclidean distance between the proposed center and the actual center of the vector field. 3.1 Validating the Method To empirically validate our algorithm, it is applied to synthetic vector fields of size 161 × 161 pixels generated using the polar equation r = aeθ cot α , which covers a family of 2D vector field patterns: spiral flow, circular flow, source and sink [10]. The field is viewed at different angles to produce fields with different aspect ratios ρ in the Cartesian coordinates: (x, y) = (r cos θ, rρ sin θ). The score is presented as a grayscale image on which the original vector field is overlaid. This representation is used in subsequent results. The higher the left-count (counterclockwise field) at a location (x, y), the brighter the point is. With d = 5, a single brightest spot is clearly shown at the field center in all tested values of ρ and α, giving a basic validation of our algorithm (Fig. 2). 3.2 Robustness Against Noise Noise sensitivity studies were carried out using synthetic circulating fields of size 161× 161 pixels with different types of noise that model three different situations: (1) Gaussian noise on each vector dimension, modeling sensor inaccuracies (e.g., error in wind direction measurements); (2) Random missing values, modeling sensor or object tracking failures; and (3) Partially occluded field, modeling video occlusions (e.g., a train passing in front of a Ferris wheel). The latter two cases are generated by replacing some of the vectors of an ideal vector field by zero or vectors pointing to a particular direction. In the experiment, the distance of influence d of the proposed method and structural analysis are set to five pixels. The template used for pattern matching algorithms is of size 41 × 41 pixels, generated from an ideal rotating field. Pattern matching takes the lowest score (darkest pixel) as the answer. All other methods take the highest score (brightest pixel) as answer. Fig 3(a) shows the results. The ideal vector fields without noise are references for this noise sensitivity study. Pattern matching gives the worst result when Gaussian noise is present. This is because the absolute sine function changes rapidly around the zero angle difference point, and the function tends to exaggerate the damage of occasional vectors that are not in the right direction. Yet, the pattern matching approach can handle data discontinuity cases, such as occlusion and random missing values. In contrast, algebraic analysis method works well on fields with Gaussian noise. Fourier transform of a Gaussian function in spatial domain results in a Gaussian function in the frequency domain. Thus, the addition of the Gaussian noise would not affect the global maximum score unless a frequency component of the noise is greater than the strongest frequency component of the original signal. However, phase portrait analysis cannot handle the occlusion case well. The structural analysis and our proposed method work well in all cases, and their level of noise tolerance can be controlled by adjusting the distance of influence d. In particular, the consideration of only left or right count of a vector in our method allows
768
K.Y. Wong and C.L. Yip Proposed Pattern Algebraic Structural method matching analysis analysis
TSS
2DLogS
OrthS
Occluded Random missing
Occluded Random missing
Gaussian Ideal noise
Gaussian Ideal noise
2LHS
(a) Comparison between major approaches
(b) Comparison between search algorithms
Fig. 3. Performance on noisy vector fields
slightly distorted vectors to cover the true rotation center, giving a more distinguished peak (the brightest pixel) than structural analysis when Gaussian noise is present. 3.3 Efficiency The efficiencies of the algorithms are compared by profiling tests. Table 1(a) shows the result. Algebraic analysis requires vector field decomposition, estimation of phase portrait matrices and classification, and thus takes the longest time. The speeds of pattern matching and structural analysis are comparable. Their run times are quadratic, to the linear dimension of the template and the distance of influence d respectively. Our proposed method is the fastest, and can be implemented using integer operations only. Yet, same as structural analysis, its efficiency is affected by the distance of influence d. 3.4 Boosting Efficiency: Use of Search Algorithms To further speed up our algorithm, search algorithms that make use of principle of locality are incorporated in the design. Four popular algorithms [11], namely Two level hierarchical search (2LHS), Three step search (TSS), Two dimensional logarithmic search (2DLogS), and Orthogonal search (OrthS) are implemented for comparison with Exhaustive search (ExhS) in which the score of every point is evaluated. Their properties are summarized in Table 1(b). The algorithm performance on noise tolerance and efficiency test are shown in Fig. 3(b) Table 1(b) respectively. In general, the fewer location a search algorithm examines in each iteration, the faster is the center identification process, but with a less distinguished result. The use of search algorithms boost up the efficiency by at least 9.33 times, but did not affect the noise tolerance much. This is an advantage as a faster search algorithm can be chosen to speed up the process, yet, preserving the quality.
A Fast and Noise-Tolerant Method
769
Table 1. Result on profiling test (a) Comparison between major approaches
(b) Comparison between different search algorithms (d = 10)
Method
Algorithm Properties
Parameter
Algebraic analysis Pattern matching Structural analysis Proposed method
Time (ms) 669769 208759 200444 148416
d = 10 d = 10
ExhS 2LHS TSS 2DLogS OrthS
Time (ms)
Examines every possible location A hierarchical algorithm, sparse then narrow down Reduces search distance in each iteration Reduces search distance when center is of the highest rank Explores the search space axis by axis
(a) Frame 01
(b) Frame 15
1 0.8 0.6 0.4
proposed method pattern matching algebraic analysis structural analysis
0.2 0 0
50
100
150
200
Euclidean distance from actual center (pixel)
(d) Proposed method and major approaches
(c) Key
Effect of search algorithms on proposed method accumulative frequency (%)
accumulative frequency (%)
Comparison between proposed method and major approache
148416 15905 5756 7973 6932
1 0.8 ExhS 2LHS TSS 2DLogS OrthS subsampled
0.6 0.4 0.2 0 0
10
20
30
40
50
Euclidean distance from actual center (pixel)
(e) Effect of search algorithms
Fig. 4. Ferris wheel: results and cumulative percentage of frames against error
4 Applications In this section, we demonstrate the use of our method on the fields of multimedia and meteorology. In these practical applications, vectors far away may not be related to the rotating object. Hence, we only consider vectors within d = 100 pixels in calculating the left–right counts on videos of 320 × 240 or 480 × 480 in size. Depending on the level of noise immunity desired, different values of d can be used in practice. To speed up the process in handling real-life applications, the proposed method, pattern matching, algebraic analysis and structural analysis are applied to sampled locations of the vector field to position the circulation center. Besides, the effect of search algorithms in handling real-life application is also studied. 4.1 Rotation Center Identification in Video Sequences The video sequences used for the experiments include the video of a Ferris wheel taken from the front at a slightly elevated view, and a sequence of presentation slides with
770
K.Y. Wong and C.L. Yip
swirl transition effect. Motion fields generated from MPEG-4 videos of 320 × 240 pixels in size are used for detection of rotation center. Each frame is segmented into overlapping blocks of 16 × 16 pixels in size 10 pixels apart both horizontally and vertically. Two level hierarchical search is then applied to every block to find the motion vector from the best matching blocks in the previous frame, using mean absolute error as distortion function. The motion field is then smoothed by a 5×5 median filter. The tested methods are then applied to the motion field for rotation center location. Since the density of the motion field affects the accuracy of the rotation center location, and the vector field found may not be perfect, for each algorithm, we take the top three highest-scored centers to determine the output. Among these top three, the one that is closest to the actual center is taken as the final answer. Here, the actual center is the centroid of the Ferris wheel. The performances of the algorithms are compared by plotting the fraction of frames with error smaller than an error distance, as in Fig. 4(d). A point (x, y) means a fraction y of the frames gives an error that is no more than x pixels from the actual center. A perfect algorithm should produce a graph that goes straight up from (0, 0) to (0, 1) then straight to the right. The nearer the line for an algorithm to the perfect graph, the more accurate it is. From the graph, the percentage of the frames giving an average error within one step size with (1) our proposed method, (2) pattern matching, (3) algebraic analysis, and (4) structural analysis are 91%, 1%, 46% and 90% respectively. The low accuracy of pattern matching method is caused by the sine function exaggerating the damage of occasional vectors that are not in the right direction under the imperfect field, as discussed in Section 3.2. For algebraic analysis, the imperfections in the motion field, especially the noisy vectors at the edges of the frame (Fig. 4(a)) cause the error. This explains why the fraction stays at less than 80% till the error distance is about 170 pixels. The results of our proposed algorithm and structural analysis are comparable. Both methods consider only vectors within a distance d, so noise far away from the actual center does not have any effect, offsetting the imperfections in motion fields and increases the practicality of the method. Results and the error graph for the swirl transition sequence from presentation slides are shown in Fig. 5. In this presentation sequence, the first slide rotates while zooming out from the center, and the next slide rotates while zooming in. The areas with motion vectors change continuously in the video sequence. Moreover, as the slides are shown as rectangular boxes, undesired motion vectors are generated at its edges and corners. Such edge effect affects the algorithm performance. The percentage of the frames giving an average error within two step sizes with (1) our proposed method, (2) pattern matching, (3) algebraic analysis, and (4) structural analysis are 83%, 52%, 20% and 66% respectively. Algebraic analysis classified the vector field center as a node or a saddle instead of a center in some cases, lead to a low 20% coverage within two step sizes. Yet, if the points classified as saddles or nodes are also considered as centers, over 80% coverage within two step sizes is obtained. The use of the limit d in our algorithm and structural analysis, or the size of template in pattern matching approach, limit the area of analysis. Unless vectors at the edges or corners of the slides are within a distance d from the actual center (or within the template size for pattern matching approach), the performance of the algorithms remain unaffected by the edge effect. The lower
A Fast and Noise-Tolerant Method
(a) Frame 299
(b) Frame 386
1 0.8 0.6 0.4
proposed method pattern matching algebraic analysis structural analysis
0.2 0 0
50
100
150
200
Euclidean distance from actual center (pixel)
(d) Proposed method and major approaches
(c) Key
Effect of search algorithms on proposed method accumulative frequency (%)
accumulative frequency (%)
Comparison between proposed method and major approache
771
1 0.8 ExhS 2LHS TSS 2DLogS OrthS subsampled
0.6 0.4 0.2 0 0
20
40
60
80
100
120
140
Euclidean distance from actual center (pixel)
(e) Effect of search algorithms
Fig. 5. Presentation: result and cumulative percentage of frames against error
percentage resulted from the pattern matching approach is mainly due to the mismatch between the template and the imperfect vector field. In studying the effect of search algorithms in both cases (Fig. 4(e) and 5(e), as expected, the best results are given by ExhS on all vectors, followed by 2LHS on sampled search positions. The use of TSS, 2DLogS and OrthS resulted in relatively larger errors. This is because TSS, 2DLogS and OrthS start the search with an initial search position, which depends on the answer of the previous frame. If the initial search position is incorrect, the result may accumulate and affect subsequent results. When an initial estimation of the center position is available, these three algorithms would be good choices since they are faster. 4.2 Tropical Cyclones Eye Fix Fixing the center of a tropical cyclone (TC) is important in weather forecasting. A typical TC has spiral rainbands with an inflow angle of about 10◦ swirls in counterclockwisely in the Northern Hemisphere. A spiraling vector field would thus be generated from sequence of remote sensing data. Our proposed method is applied to a sequence of radar images (5 hours, 50 frames) compressed as MPEG-4 videos of 480×480 pixels in size. The output positions of the system is smoothed using a Kalman filter. To give an objective evaluation, interpolated best tracks1 issued by Joint Typhoon Warning Center (JTWC) [12] and the Hong Kong Observatory (HKO) [13] are used for 1
Best tracks are the hourly TC locations determined after the event by a TC warning center using all available data.
772
K.Y. Wong and C.L. Yip
(a) Early stage of the TC
(b) Later stage of the TC
(d) Proposed method and major approaches
(c) Key
(e) Effect of search algorithms
Fig. 6. Comparison of TC tracks and eye fix results
comparison. Fig. 6 shows the eye fix results and the comparison of proposed tracks by different methods. We observe that pattern matching gives the worst result, with most of the frames having an answer far away from best tracks. Results of algebraic analysis were affected by the vectors formed by radar echoes at the outermost rainbands of the TC (Fig. 6(b)). Yet, when a well-structured vector field is found, the algebraic analysis method gives proposed centers close to the best tracks (Fig. 6(a)) Our proposed track and the one given by structural analysis are close to best tracks given by HKO and JTWC. Using HKO best track data as a reference, our proposed method gives an average error of 0.16 degrees on the Mercator projected map (Table 2(a), well within the relative error of 0.3 degrees given by different weather centers [14]. The use of search algorithms, 2LHS, 2DLogS, and OrthS are comparable, with average error ranging from 0.17 to 0.19 degrees, while TSS gives an average error of 0.35 degrees. This is because potentially better results far away from the initial location cannot be examined as the search distance of TSS halves every iteration. The application of our proposed method in weather forecasting shows its practicality and its ability to find the center of spiraling flow.
5 Summary We have proposed a fast and noise-tolerant method for identifying centers of a family of 2D vector field patterns: spiral flow, circular flow, source and sink. For each point on the
A Fast and Noise-Tolerant Method
773
Table 2. Average error from interpolated HKO best track (a) Proposed method and major approaches
(b) Effect of search algorithms
Algorithm
Search algorithm
Proposed method Pattern matching Algebraic analysis Structural analysis
Error (degrees) 0.16 1.40 0.39 0.25
2LHS TSS 2DLogS OrthS
Error (degrees) 0.17 0.35 0.19 0.18
vector field, every neighbouring vectors within a distance of d pixels are checked to see whether the point is on the left or right of the vector. The location with the maximum left or right count is the center location of a counterclockwise or clockwise circulating flow respectively. The method can be implemented by only integer operations. It is found that the proposed method is 1.35 to 4.51 times faster than traditional methods, and can be boosted up to 96.62 times faster when search algorithm is incorporated, with little tradeoff in effectiveness. The algorithm is tolerant to different types of noises such as Gaussian noise, missing vectors, and partially occluded fields. The practicality of the method is demonstrated using examples in detecting centers of rotating objects in video sequences and identifying the eye positions of tropical cyclones in weather forecasting.
References 1. Wong, K.Y., Yip, C.L.: Fast rotation center identification methods for video sequences. In: Proc. ICME, Amsterdam, The Netherlands, pp. 289–292 (July 2005) 2. Li, P.W., Lai, S.T.: Short range quantitative precipitation forecasting in Hong Kong. J. Hydrol. 288, 189–209 (2004) 3. Shu, C.F., Jain, R., Quek, F.: A linear algorithm for computing the phase portraits of oriented textures. In: Proc. CVPR, Maui, Hawaii, USA, pp. 352–357 (June 1991) 4. Shu, C.F., Jain, R.C.: Vector Field Analysis for Oriented Patterns. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-16(9), 946–950 (1994) 5. Rodrigues, P.S.S., de Ara´ujo, A.A., Pinotti, M.: Describing patterns in flow-like images. In: Proc. 10th ICIAP, Venice, Italy, pp. 424–429 (September 1999) 6. Heiberg, E., Ebbers, T., Wigstom, L., Karlsson, M.: Three-Dimensional Flow Characterization Using Vector Pattern Matching. IEEE Trans. Vis. Comput. Graphics 9(3), 313–319 (2003) 7. Ebling, J., Scheuermann, G.: Clifford convolution and pattern matching on vector fields. In: Proc. 14th Vis., Seattle, Washington, USA, pp. 26–33 (October 2003) 8. Polthier, K., Preuss, E.: Identifying Vector Field Singularities Using a Discrete Hodge Decomposition. In: Visualization and Mathematics III (2003) 9. Corpetti, T., M´emin, E., P´erez, P.: Extraction of Singular Points from Dense Motion Fields: An analytic approach. J. Math. Imag. and Vis. 19, 175–198 (2003) 10. Yip, C.L., Wong, K.Y.: Identifying centers of circulating and spiraling flow patterns. In: Proc. 18th ICPR, Hong Kong, vol. 1, pp. 769–772 (August 2006) 11. Furht, B., Greenberg, J., Westwater, R.: Motion estimation algorithms for video compression. Kluwer Academic Publishers, Boston (1997) 12. Joint Typhoon Warning Center: Web page (2007), http://www.npmoc.navy.mil/jtwc.html 13. Hong Kong Observatory: Web page (2007), http://www.hko.gov.hk/ 14. Lam, C.Y.: Operational Tropical Cyclone forecasting from the perspective of a small weather service. In: Proc. ICSU/WMO Sym. Tropical Cyclone Disasters, pp. 530–541 (October 1992)
Interpolation Between Eigenspaces Using Rotation in Multiple Dimensions Tomokazu Takahashi1,2 , Lina1 , Ichiro Ide1 , Yoshito Mekada3 , and Hiroshi Murase1 Graduate School of Information Science, Nagoya University, Japan
[email protected] 2 No Japan Society for the Promotion of Science Department of Life System Science and Technology Chukyo University, Japan 1
3
Abstract. We propose a method for interpolation between eigenspaces. Techniques that represent observed patterns as multivariate normal distribution have actively been developed to make it robust over observation noises. In the recognition of images that vary based on continuous parameters such as camera angles, one cause that degrades performance is training images that are observed discretely while the parameters are varied continuously. The proposed method interpolates between eigenspaces by analogy from rotation of a hyper-ellipsoid in high dimensional space. Experiments using face images captured in various illumination conditions demonstrate the validity and effectiveness of the proposed interpolation method.
1
Introduction
Appearance-based pattern recognition techniques that represent observed patterns as multivariate normal distribution have actively been developed to make them robust over observation noises. The subspace method [1] and related techniques [2,3] enable us to achieve accurate recognition under conditions where such observation noises as pose and illumination variations exist. Performance, however, degrades when the variations are far larger than expected. On the other hand, the parametric eigenspace method [4] deals with variations using manifolds that are parametric curved lines or surfaces. The manifolds are parameterized by parameters corresponding to controlled pose and illumination conditions in the training phase. This enables object recognition and at the same time parameter estimation that estimates pose and illumination parameters when an input image is given. However, this method is not very tolerant of uncontrolled noises that are not parameterized, e.g., translation, rotation, or motion blurring of input images. Accordingly, Lina et al. have developed a method that embeds multivariate normal density information in each point on the manifolds [5]. This method generates density information as a mean vector and a covariance matrix from training images that are degraded by artificial noises such as translation, rotation, or motion blurring. Each noise is controlled by a noise model and its Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 774–783, 2007. c Springer-Verlag Berlin Heidelberg 2007
Interpolation Between Eigenspaces Using Rotation in Multiple Dimensions
775
parameter. To obtain density information between consecutive poses and generate smooth manifolds, the method interpolates training images degraded by the identical noise model and the parameter between consecutive poses. By considering various other observation noises, however, controlling noises by model and parameter is difficult; therefore, making correspondence between training images is not realistic. Increasing computational cost with a growing number of training images is also a problem. In light of the above background, we propose a method to smoothly interpolate between eigenspaces by analogy from rotation of a hyper-ellipsoid in a high dimensional space. Section 2 introduces the mathematical foundation, the interpolation of a rotation matrix using diagonalization and its geometrical significance, followed by Section 3, where the proposed interpolation method is described. Section 4 demonstrates the validity and effectiveness of interpolation by the proposed method from experiment results using face images captured in various illumination conditions. Section 5 summarizes the paper.
2
Interpolation of Rotation Matrices in an n-Dimensional Space
2.1
Diagonalization of a Rotation Matrix
An n × n real number matrix nR is a rotation matrix when it satisfies the following conditions: nR nR
T
= nRT nR = nI,
det(nR) = 1,
(1)
where AT represents a transpose matrix of A and nI represents an n× n identity matrix. nR can be diagonalized with an n × n unitary matrix and a diagonal matrix nD including complex elements as nR
= nU nDnU † .
(2)
Here, A† represents a complex conjugate transpose matrix of A. The following equation is obtained for a real number x: nR
x
= nU nDx nU † .
(3)
represents an interpolated rotation when 0 ≤ x ≤ 1 and an extrapolated rotation in other cases. This means that once Un is calculated, the interpolation and extrapolation of nR can be easily obtained.
nR
x
2.2
Geometrical Significance of Diagonalization
A two-dimensional rotation matrix 2R(θ) whose θ(−π < θ ≤ π) is its rotation angle can be diagonalized as 2R(θ)
= 2U 2D(θ)2U † ,
(4)
776
T. Takahashi et al.
cos θ − sin θ , 2R(θ) = sin θ cos θ iθ e 0 . 2D(θ) = 0 e−iθ
where
(5) (6)
Here, since eiθ = cos θ + i sin θ (Euler s f ormulaC|eiθ | = |e−iθ | = 1), 2R(θ)x = x 2R(xθ) as well as 2D(θ) = 2D(xθ) for a real number x. The Eigen-equation of nR has m sets of complex conjugate roots whose absolute value is 1 when n = 2m. Meanwhile, when n = 2m + 1, nR has the same m sets of complex conjugate roots and 1 as roots. Therefore, nD in Equation 2 can be described as ⎤ ⎧⎡ 0 ⎪ 2D(θ1 ) · · · ⎪ ⎪⎢ . ⎥ ⎪ .. .. ⎪ ⎣ .. ⎦ (n = 2m) ⎪ . . ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎡ 0 · · · 2D(θm ) ⎤ 1 ··· 0 (7) nD(θ) = ⎪ ⎢ ⎥ .. ⎪ ⎪ ⎢ ⎥ ⎪ . ⎪ ⎢ 2D(θ1 ) ⎥ (n = 2m + 1) ⎪ ⎪ ⎢. ⎥ . ⎪ . . ⎪ ⎣ ⎦ . ⎪ . ⎪ ⎩ 0 ··· 2D(θm ) by an m dimensional vector θ = (θj | − π < θj ≤ π, j = 1, 2, · · · , m) composed of m rotation angles. Thus Equation 2 can be described as nR(θ)
= nU nD(θ)nU † .
(8)
This means that nRx in Equation 3 is obtained as nR(xθ) by simply linearly interpolating the vector. Additionally, † † (9) nR(θ) = nU nU nR (θ)nU nU . Here, when n = 2m + 1, ⎡
1
···
0 .. .
⎤
⎥ ⎢ ⎥ ⎢ 2R(θ1 ) ⎥, ⎢ nR (θ) = ⎢ . ⎥ . .. ⎦ ⎣ .. 0 ··· 2R(θm ) ⎤ ⎡ 1 ··· 0 ⎢ .. ⎥ ⎢ 2U . ⎥ ⎥. ⎢ nU = ⎢ . ⎥ .. ⎦ ⎣ .. . 0 ··· 2U
(10)
(11)
Meanwhile, when n = 2m, nR (θ) and nU are obtained by removing the first column and the first row from the matrix in the same way as Equation 7. Because
Interpolation Between Eigenspaces Using Rotation in Multiple Dimensions
777
x Fig. 1. Pose interpolation for a four-dimensional cube
the set of all n-dimensional rotation matrixes forms a group with multiplication † called the SO(n) (Special Orthogonal group), nU nU can be transformed into a rotation matrix nR ; therefore, using only real number rotation matrices, nR(θ) can be described as T . (12) nR(θ) = nR nR (θ)nR Using this expression, computational cost and memory are expected to be reduced in the sense of using real number matrices instead of complex number matrices. Note that the interpolated results are identical to those obtained from the simple diagonalization shown in Equation 2. nR (θ) represents rotations on m independent rotational planes where no rotation affects other rotational planes. This means that Equation 12 expresses nR(θ) as a sequence of rotation matrices that are a unique rotation nR for nR(θ), a rotation on independent planes nR (θ), and the inverse of the unique rotation matrix. 2.3
Rotation of a Four-Dimensional Cube
We interpolated poses of a four-dimensional cube using the rotation matrix interpolation method described above. First, two rotation matrices, R0 , R1 , were randomly chosen as key poses of the cube, and then poses between the key poses were interpolated by Equation 13. For every interpolated pose, we visualized the cube by a wireframe model using perspective projection: Rx = R0→1 (xθ)R0 .
(13)
Here, Ra→b (θ) = Rb Ra T . This equation corresponds to the linear interpolation of rotation matrices. Figure 1 shows the interpolated results. To simplify seeing how the four-dimensional cube rotates, the vertex trajectory is plotted by dots.
3 3.1
Interpolation of Eigenspaces Using Rotation of a Hyper-Ellipsoid Approach
The proposed method interpolates eigenspaces considering an eigenspace as a multivariate normal density. The iso-density points of a multivariate normal density are known to form a hyper-ellipsoid surface. Eigenvectors and eigenvalues
778
T. Takahashi et al.
e x0
μx
e00
e01
μ0
e10
e xn e x1
μ1
μx Σx )
Nx (
e0 n
μ 0 Σ0 ) ,
e11
μ Σ
1 ( 1, 1 )
N
,
…y
1
N0 (
e1n
yn
y0
Feature space
Fig. 2. Interpolation of hyper-ellipsoids
can be considered the directions of the hyper-ellipsoid’s axes and their lengths, respectively. We consider that the eigenspaces between two eigenspaces could be interpolated by rotation of a hyper-ellipsoid with the expansion and contraction of the length of each axis of the ellipsoid (Figure 2). The interpolation of ellipsoids has the following two problems. First, the correspondence of one ellipsoid’s axes to another ellipsoid’s axes cannot be determined uniquely. Secondly, the rotation angle cannot be determined uniquely because ellipsoids are symmetrical. From these problems, in general, ellipsoids cannot be interpolated uniquely from two ellipsoids. The following two conditions are imposed in the proposed method to obtain a unique interpolation. [Condition 1] Minimize the interpolated ellipsoid’s volume variations. [Condition 2] Minimize the interpolated ellipsoid’s rotation angle variations. 3.2
Algorithm
When two multivariate normal densities N0 (μ0 , Σ0 ) and N1 (μ1 , Σ1 ) are given, an interpolated or extrapolated density Nx (μx , Σx ) for a real number x is calculated by the following procedure. Here, μ and Σ represent an n-dimensional mean vector and an n × n covariance matrix, respectively. Interpolation of mean vectors: μx is obtained by a simple linear interpolation by the following equation. This corresponds to interpolation of the ellipsoids’ centers. (14) μx = (1 − x)μ0 + xμ1 . Interpolation of covariance matrices: Eigenvectors and eigenvalues of each covariance matrix have information about the pose of the ellipsoid and the lengths of its axes, respectively. First, n × n matrices E0 and E1 are formed by aligning each eigenvector e0j and e1j (j = 1, 2, · · · , n) of Σ0 and Σ1 . At the same time, n-dimensional vectors λ0 Cλ1 are formed by aligning eigenvalues λ0j , λ1j (j = 1, 2, · · · , n).
Interpolation Between Eigenspaces Using Rotation in Multiple Dimensions
779
[Step 1] To obtain the correspondences of axes between ellipsoids based on [Condition 1], E0 and E1 are formed by sorting eigenvectors in E0 and E1 in the order of their eigenvalues. λ0 and λ1 are formed from λ0 Cλ, as well. T [Step 2] Based on [Condition 2], e1j (j = 1, 2, · · · , n) is inverted if e0j e1j < 0 so that the angle between corresponded axes is less than or equal to π2 . [Step 3] e0n is inverted if det(E0 ) = −1, and e1n is inverted if det(E1 ) = −1, as well, so that E0 and E1 should meet Equation 1. The eigenvalue of Σx , λxj is calculated by 2 λxj = (1 − x) λ0j + x λ1j ,
(15)
and its eigenvectors Ex is calculated by Ex = R0→1 (xθ)E0 .
(16)
Here, R0→1 (θ) = E1 E0 D Therefore, Σx is calculated by T
Σx = Ex Λx Ex T .
(17)
HereCΛx represents a diagonal matrix that has λxj (j = 1, 2, · · · , n) as its diagonal elements.
4
Experiments Using Actual Images
To demonstrate the effectiveness and validity of the proposed interpolation method, we conducted face recognition experiments based on a subspace method. Training images were captured from two different angles in various illumination conditions, whereas input images were captured only from intermediate angles. In the training phase, a subspace for each camera angle was constructed from images captured in different illumination conditions. We compared the performance between recognition by the two subspaces and the interpolated subspaces. 4.1
Conditions
In the experiments, we used the face images of ten persons captured from three different angles (two for training and one for testing) in 51 different illumination conditions. Figures 3 and 4 show examples of the persons’ images and images captured in various conditions. In Figure 5, images from camera angles c0 and c1 were used for training and c0.5 for testing. The images were chosen from the face image dataset, “Yale Face Database B”[6]. We represented each image as a low dimensional vector in a 30-dimensional feature space using a dimension reduction technique based on PCA. In the train(p) (p) ing phase, for each person p, autocorrelation matrices Σ0 and Σ1 were calcu(p) lated from images obtained from angles c0 and c1 , and then matrices E0 and (p) E1 were obtained that consist of eigenvectors of the autocorrelation matrices.
780
T. Takahashi et al.
Fig. 3. Sample images of ten persons’ faces used in experiment
Fig. 4. Sample images captured in various illumination conditions used in experiment
In the recognition phase, the similarity between the subspaces and a test image captured from c0.5 were measured, and recognition result pˆ was obtained that gives maximum similarity. The similarity between an input vector z and (p) the K(≤ 30)-dimensional subspace of Ex is calculated by Sx(p) (z) =
K
(p)
< ex,k , z >2 ,
(18)
k=1 (p)
(p)
(p)
where Ex (0 ≤ x ≤ 1) is the interpolated eigenspace between E0 and E1 (p) and < ·, · > represents the inner product of the two vectors. Ex is calculated by Equation 16. The proposed method that uses the subspaces of the interpolated eigenspaces obtains pˆ by (19) pˆ = arg max max Sx(p) (z) . p
0≤x≤1
On the other hand, as a comparison method, the recognition method with (p) (p) subspaces of E0 and E1 obtains pˆ by (p) (p) pˆ = arg max max S0 (z), S1 (z) . p
(20)
We defined K = 5 in Equation 18 empirically through preliminary experiments.
c
0
c
0.5
c
1
Fig. 5. Sample images captured from three camera angles used in experiment
Interpolation Between Eigenspaces Using Rotation in Multiple Dimensions
781
Table 1. Comparison of recognition rates Recognition Method Recognition Rate [%] Interpolated subspaces by proposed method (Eq. 19) 71.8 62.6 Two subspaces (Eq. 20)
e c5 n a t 4 is D
a3 y y r a2 h c a t t 1 a h B0 0.0
0.1
0.2
0.3
0.4
0.5
x
0.6
0.7
0.8
0.9
1.0
Fig. 6. Bhattacharyya distances between actual normal density and interpolated densities
4.2
Results and Discussion
Table 1 compares the recognition rates of the two methods described in 4.1. From this result, we confirmed the effectiveness of the proposed method for face recognition. For verification of the validity of interpolation by the proposed method, Figure 6 shows the Bhattacharyya distances between normal density obtained from c0.5 and the interpolated normal densities from x = 0 to x = 1 for a person. Since the distance becomes smaller around x = 0.5, the validity of interpolation by the proposed method can be observed. In addition, Figure 7 visualizes the interpolated eigenvectors from x = 0 to x = 1 of the person. We can see that the direction of each eigenvector was changed smoothly by high dimensional rotation.
5
Summary
In this paper, we proposed a method for interpolation between eigenspaces. The experiments on face recognition based on the subspace method demonstrated the effectiveness and validity of the proposed method. Future works include expansion of the method into higher order interpolation such as a cubic spline and recognition experiments using larger datasets.
782
T. Takahashi et al.
dimension 0.0
0.5
1.0
x Fig. 7. Interpolated eigenvectors
Interpolation Between Eigenspaces Using Rotation in Multiple Dimensions
783
References 1. Watanabe, S., Pakvasa, N.: Subspace Method of Pattern Recognition. In: Proc. 1st Int. J. Conf. on Pattern Recognition, pp. 25–32 (1971) 2. Turk, M., Pentland, A.: Face Recognition Using Eigenfaces. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 568–591. IEEE Computer Society Press, Los Alamitos (1991) 3. Moghaddam, B.: Principal Manifolds and Bayesian Subspaces for Visual Recognition. In: Proc. Int. Conf. on Computer Vision, pp. 1131–1136 (1999) 4. Murase, H., Nayar, S.K.: Illumination Planning for Object Recognition Using Parametric Eigenspaces. IEEE Trans. Pattern Analysis and Machine Intelligence 16(12), 1218–1227 (1994) 5. Takahashi, L.T., Ide, I., Murase, H.: Appearance Manifold with Embedded Covariance Matrix for Robust 3-D Object Recognition. In: Proc. 10th IAPR Conf. on Machine Vision Applications, pp. 504–507 (2007) 6. Georghiades, A.S., Belhumeur, P.N., Kriegman, C.D.J.: From Few to Many: Illumination Cone Models for Face Recognition under Variable Lighting and Pose. IEEE Trans. Pattern Analysis and Machine Intelligence 23(6), 643–660 (2001)
Conic Fitting Using the Geometric Distance Peter Sturm and Pau Gargallo INRIA Rhˆ one-Alpes and Laboratoire Jean Kuntzmann, France
Abstract. We consider the problem of fitting a conic to a set of 2D points. It is commonly agreed that minimizing geometrical error, i.e. the sum of squared distances between the points and the conic, is better than using an algebraic error measure. However, most existing methods rely on algebraic error measures. This is usually motivated by the fact that pointto-conic distances are difficult to compute and the belief that non-linear optimization of conics is computationally very expensive. In this paper, we describe a parameterization for the conic fitting problem that allows to circumvent the difficulty of computing point-to-conic distances, and we show how to perform the non-linear optimization process efficiently.
1
Introduction
Fitting of ellipses, or conics in general, to edge or other data is a basic task in computer vision and image processing. Most existing works concentrate on solving the problem using linear least squares formulations [3,4,16]. Correcting the bias introduced by the linear problem formulation, is often aimed at by solving iteratively reweighted linear least squares problems [8,9,10,12,16], which is equivalent to non-linear optimization. In this paper, we propose a non-linear optimization approach for fitting a conic to 2D points, based on minimizing the sum of squared geometric distances between the points and the conic. The arguments why most of the algorithms proposed in literature do not use the sum of squared geometrical distances as explicit cost function, are: – non-linear optimization is required, thus the algorithms will be much slower. – computation of a point’s distance to a conic requires the solution of a 4th order polynomial [13,18], which is time-consuming and does not allow analytical derivation (for optimization methods requiring derivatives), thus leading to the use of numerical differentiation, which is again time-consuming. The main goal of this paper is to partly contradict these arguments. This is mainly achieved by parameterizing the problem in a way that allows to replace point-to-conic distance computations by point-to-point distance computations, thus avoiding the solution of 4th order polynomials. The problem formulation remains non-linear though. However, we show how to solve our non-linear optimization problem efficiently, in a manner routinely used in bundle adjustment. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 784–795, 2007. c Springer-Verlag Berlin Heidelberg 2007
Conic Fitting Using the Geometric Distance
2 2.1
785
Problem Formulation Cost Function T
Let qp = (xp , yp , 1) , p = 1 . . . n be the homogeneous coordinates of n measured 2D points. The aim is to fit a conic to these points. Many methods have been proposed for this task, often based on minimizing the sum of algebraic distances [3,4,12] (here, C is the usual symmetric 3 × 3 matrix representing the conic): n 2 C11 x2p + C22 yp2 + 2C12 xp yp + 2C13 xp + 2C23 yp + C33 p=1
This is a linear least squares problem, requiring some constraint on the unknowns in order to avoid the trivial solution. For example, Bookstein proposes 2 2 2 + 2C12 + C22 = 1, which allows to make the solution invariant the constraint C11 to Euclidean transformations of the data [3]. Fitzgibbon, Pilu and Fisher impose 2 ) = 1 in order to guarantee that the fitted conic will be an ellipse 4(C11 C22 − C12 [4]. In both cases, the constrained linear least squares problem can be solved by solving a 3 × 3 symmetric generalized eigenvalue problem. The cost function we want to minimize (cf. section 5), is n
dist (qp , C)2
(1)
p=1
where dist(q, C) is the geometric distance between a point q and a conic C, i.e. the distance between q and the point on C, that is closest to q. Determining dist(q, C) requires in general to compute the roots of a 4th order polynomial. 2.2
Transformations and Types of Conics
Let P be a projective transformation acting on 2D points (i.e. P is a 3×3 matrix). A conic C is transformed by P according to (∼ means equality up to scale): C ∼ P−T CP−1
(2)
In this work we are only interested in real conics, i.e. that do not only contain imaginary points. These can be characterized using the eigendecomposition of the conic’s 3 × 3 matrix: imaginary conics are exactly those whose eigenvalues have all the same sign [2]. We are thus only interested in conics with eigenvalues of different signs. This constraint will be explicitly imposed, as shown in the following section. In addition, we are only interested in proper conics, i.e. nondegenerate ones, with only non-zero eigenvalues. Concerning different types of real conics, we distinguish the projective and affine classes: – all proper real conics are projectively equivalent, i.e. for any two conics, there exists at least one projective transformation relating them according to (2). – affine classes: ellipses, hyperbolae, parabolae.
786
P. Sturm and P. Gargallo
In the following, we formulate the optimization problem for general conics, i.e. the corresponding algorithm may find the correct solution even if the initial guess is of the “wrong type”. Specialization of the method to the 3 affine cases of interest, is relatively straightforward; details are given in [15].
3
Minimizing the Geometrical Distance
In this section, we describe our method for minimizing the geometrical distance based cost function. The key of the method is the parameterization of the problem. In the next paragraph, we will first describe the parameterization, before showing that it indeed allows to minimize geometrical distance. After this, we explain how to initialize the parameters and describe how to solve the non-linear optimization problem in a computationally efficient way. 3.1
Parameterization
The parameterization explained in the following, is illustrated in figure 1. ˆ p , such that For each of the n measured points qp , we parameterize a point q ˆ p lie on a conic. The simplest way to do so is to choose the unit circle as all q ˆ p by an angle αp each: support, in which case we may parameterize the q ⎞ ⎛ cos αp ˆ p = ⎝ sin αp ⎠ q 1 Furthermore, we include in our parameterization a 2D projective transformation, or, homography, P. We then want to solve the following optimization problem: min
P,α1 ···αn
n
2
dist (qp , Pˆ qp )
(3)
p=1
In section 3.2, we show that this parameterization indeed allows to minimize the desired cost function based on point-to-conic distances. At first sight, this parameterization has the drawback of a much larger number of parameters than necessary: n + 8 (the n angles αp and 8 parameters for P) instead of 5 that would suffice to parameterize a conic. We will show however, in section 3.4, that the optimization can nevertheless be carried out in a computationally efficient manner, due to the sparsity of the normal equations associated to our least squares cost function. Up to now, we have considered P as a general 2D homography, which is clearly an overparameterization. We do actually parameterize P minimally: P ∼ R Σ = R diag(a, b, c) where R is an orthonormal matrix and a, b and c are scalars. We show in the following section that this parameterization is sufficient, i.e. it allows to express all proper real conics.
Conic Fitting Using the Geometric Distance
787
Fig. 1. Illustration of our parameterization
We may thus parameterize P using 6 parameters (3 for R and the 3 scalars). Since the scalars are only relevant up to a global scale factor, we may fix one and thus reduce the number of parameters for P to the minimum of 5. More details on the parameterization of the orthonormal matrix R are given in section 3.4. 3.2
Completeness of the Parameterization
We first show that the above parameterization allows to “reach” all proper real conics and then, that minimizing the associated cost function (3) is equivalent to minimizing the desired cost function (1). For any choice of R, a, b and c (a, b, c = 0), the associated homography will ˆ p to a set of points that lie on a conic C. This is obvious since map the points q ˆp point-conic incidence is invariant to projective transformations and since the q lie on a conic at the outset (the unit circle). The resulting conic C is given by: ⎛ ⎞ ⎛ ⎞ 1 1/a2 ⎠ RT ⎠ P−1 ∼ R ⎝ 1/b2 C ∼ P−T ⎝ 1 2 −1/c −1 unit circle
We now show that any proper real conic C can be “reached” by our parameterization, i.e. that there exist an orthonormal matrix R and scalars a, b and c such that C ∼ C . To do so, we consider the eigendecomposition of C : ⎛ ⎞ a T C = R ⎝ b ⎠ R c where R is an orthonormal matrix, whose rows are eigenvectors of C , and a , b and c are its eigenvalues (any symmetric matrix may be decomposed in this way). The condition for C being a proper conic is that its three eigenvalues
788
P. Sturm and P. Gargallo
are non-zero, and the condition that it is a proper and real conic is that one eigenvalue’s sign is opposed to that of the two others.
If for example, c is this eigenvalue, then with R = R , a = 1/ |a |, b = 1/ |b |
and c = 1/ |c |, we have obviously C ∼ C . If the “individual” eigenvalue is a instead, the following solution holds: ⎛ ⎞ 1 1 1 1 b= c= R = ⎝ −1 ⎠ R a= |c | |b | |a | 1 and similarly for b being the “individual” eigenvalue. Hence, our parameterization of an homography via an orthonormal matrix and three scalars, is complete. We now show that the associated cost function (3), is equivalent to the desired cost function (1), i.e. that the global minima of both cost functions correspond to the same conic (if a unique global minimum exists of course). Let C be the ˆ p be global minimum of the cost function (1). Let, for any measured point qp , v the closest point on C . If more than one point on C are equidistant from qp , pick any one of them. We have shown above that there exist R, a, b and c, such that P maps the unit ˆ p . Since v ˆ p = P−1 v ˆ p lies on C , it follows that w ˆ p lies on the circle to C . Let w ˆ p ∼ (cos αp , sin αp , 1)T . unit circle. Hence, there exists an angle αp such that w Consequently, there exists a set of parameters R, a, b, c, α1 , . . . , αn for which the value of the cost function (3) is the same as that of the global minimum of (1). Hence, our parameterization and cost function are equivalent to minimizing the desired cost function based on geometrical distance between points and conics. 3.3
Initialization
Minimizing the cost function (3) requires a non-linear, thus iterative, optimization method. Initial values for the parameters may be taken from the result of any other (linear) method. Let C be the initial guess for the conic. The initial values for R, a, b and c (thus, for P) are obtained in the way outlined in the previous section, based on the eigendecomposition of C . As for the angles αp , we determine the closest points on C to the measured qp , by solving the 4th order polynomial mentioned in the introduction or an equivalent problem (see [15] for details). We then map these to the unit circle using P and extract the angles αp , as described in the previous section. 3.4
Optimization
We now describe how we optimize the cost function (3). Any non-linear optimization method may be used, but since we deal with a non-linear least squares problem, we use the Levenberg-Marquardt method [7]. In the following, we describe how we deal with the rotational components of our parameterization (the orthonormal matrix R and the angles αp ), we then explicitly give the Jacobian of the cost function, and show how the normal equations’ sparsity may be used to solve them efficiently.
Conic Fitting Using the Geometric Distance
789
Update of Rotational Parameters. To avoid singularities in the parameterization of the orthonormal matrix R, we estimate, as is typical practice e.g. in photogrammetry [1], a first order approximation of an orthonormal “update” matrix at each iteration, as follows: 1. Let R0 be the estimation of R after the previous iteration. 2. Let R1 = R0 Δ be the estimation to be obtained after the current iteration. Here, we only allow the update matrix Δ to vary, i.e. R0 is kept fixed. Using the Euler angles α, β, γ, we may parameterize Δ as follows: ⎛ ⎞ cos β cos γ sin α sin β cos γ − cos α sin γ cos α sin β cos γ + sin α sin γ Δ = ⎝ cos β sin γ sin α sin β sin γ + cos α cos γ cos α sin β sin γ − sin α cos γ ⎠ − sin β sin α cos β cos α cos β (4) 3. The update angles α, β and γ will usually be very small, i.e. we have cos α ≈ 1 and sin α ≈ α. Instead of optimizing directly over the angles, we thus use the first order approximation of Δ: ⎛ ⎞ 1 −γ β Δ = ⎝ γ 1 −α⎠ −β α 1 4. In the cost function (3), we thus replace R by R0 Δ , and estimate α, β and γ. At the end of the iteration, we update the estimation of R. In order to keep R orthonormal, we do of course not update it using the first order approximation, i.e. as R1 = R0 Δ . Instead, we compute an exact orthonormal update matrix Δ using equation (4) and the estimated angles, and update the rotation via R1 = R0 Δ. 5. It is important to note that at the next iteration, R1 will be kept fixed on its turn, and new (small) update angles will be estimated. Thus, the initial values of the update angles at each iteration, are always zero, which greatly simplifies the analytical computation of cost function’s Jacobian. ˆ p on the unit circle, are updated in a similar manner, using a 1D The points q rotation matrix each for the update: ⎛ ⎞ cos ρp − sin ρp 0 Ψp = ⎝ sin ρp cos ρp 0⎠ 0 0 1 and its first order approximation: ⎛
Ψp
⎞ 1 −ρp 0 = ⎝ρp 1 0⎠ 0 0 1
ˆ p (where, as for R, the update ˆ p → Ψp q Points are thus updated as follows: q angles are estimated using the first order approximations Ψp ).
790
P. Sturm and P. Gargallo
Cost Function and Jacobian. Let the measured points be given by qp = T T ˆ p by q ˆ p = (ˆ xp , yˆp , 1) . At each (xp , yp , 1) , and the current estimate of the q iteration, we have to solve the problem: n
min
a,b,c,α,β,γ,ρ1···ρn
ˆp d2 qp , RΔ ΣΨp q
(5)
p=1
This has a least squares form, i.e. we may formulate the cost function using 2n residual functions: 2n
rj2
with
j=1
r2i−1 = xi −
ˆ i )1 (RΔ ΣΨi q ˆ i )3 (RΔ ΣΨi q
and
r2i = yi −
As for the Jacobian of the cost function, it is defined as: ⎛ ∂r1 ∂r1 ∂r1 ∂r1 ∂r1 ∂r1 ∂r1 ∂r1 ∂a ∂b ∂c ∂α ∂β ∂γ ∂ρ1 ∂ρ2 · · · ⎜ .. . . . . . .. .. . . .. .. .. .. .. J=⎝ . . . . ∂r2n ∂r2n ∂r2n ∂r2n ∂r2n ∂r2n ∂r2n ∂r2n ∂a ∂b ∂c ∂α ∂β ∂γ ∂ρ1 ∂ρ2 · · ·
ˆ i )2 (RΔ ΣΨi q ˆ i )3 (RΔ ΣΨi q
∂r1 ∂ρn
⎞
.. ⎟ . ⎠
∂r2n ∂ρn
It can be computed analytically, as follows. Due to the fact that before each iteration, the update angles α, β, γ, ρ1 , . . . , ρn are all zero, the entries of the Jacobian, evaluated at each iteration, have the following very simple form:
x ˆ1 u11 x ˆ1 v11 .. . x ˆn un1 x ˆn vn1
yˆ1 u12 yˆ1 v12 .. . yˆn un2 yˆn vn2
u13 v13 .. . un3 vn3
(bˆ y1 u13 − cu12 ) (bˆ y1 v13 − cv12 ) .. . (bˆ yn un3 − cun2 ) (bˆ yn vn3 − cvn2 )
(cu11 − aˆ x1 u13 ) (cv11 − aˆ x1 v13 ) .. . (cun1 − aˆ xn un3 ) (cvn1 − aˆ xn vn3 )
with ui = s2i
bR23 yˆi − cR22 cR21 − aR23 x ˆi aR22 x ˆi − bR21 yˆi c(bR21 x ˆi + aR22 yˆi ) − abR23
vi = s2i
(aˆ x1 u12 − bˆ y1 u11 ) (aˆ x1 v12 − bˆ y1 v11 ) .. . (aˆ xn un2 − bˆ yn un1 ) (aˆ xn vn2 − bˆ yn vn1 )
u14 v14 .. . 0 0
··· ··· .. . ··· ···
0 0 .. . un4 vn4
cR12 − bR13 yˆi aR13 x ˆi − cR11 bR11 yˆi − aR12 x ˆi abR13 − c(bR11 x ˆi + aR12 yˆi )
−1
and si = (aR31 x ˆi + bR32 yˆi + cR33 ) . As for the residual functions themselves, with α, β, γ, ρ1 , . . . , ρn being zero before each iteration, they evaluate to: ˆi + bR12 yˆi + cR13 ) r2i−1 = xi − si (aR11 x r2i = yi − si (aR21 x ˆi + bR22 yˆi + cR23 ) With the explicit expressions for the Jacobian and the residual functions, we have given all ingredients required to optimize the cost function using e.g. the Levenberg-Marquardt or Gauss-Newton methods. In the following paragraph, we show how to benefit from the sparsity of the Jacobian (nearly all derivatives with respect to the ρp are zero).
Conic Fitting Using the Geometric Distance
791
Hessian. The basic approximation to the Hessian matrix used in least squares optimizers such as Gauss-Newton is H = JT J. Each iteration of such a non-linear method comes down to solving a linear equation system of the following form: A6×6 B6×n x6 a6 = BT Dn×n yn bn where D is, due to sparsity of the Jacobian (see previous paragraph) a diagonal matrix. The right-hand side is usually the negative gradient of the cost function, which for least squares problems can also be computed as −JT r, r being the vector of the 2n residuals defined above. As suggested in [14], we may reduce this (6 + n) × (6 + n) problem to a 6 × 6 problem, as follows: 1. The lower set of equations give: y = D−1 b − BT x 2. Replacing this in the upper set of equations, we get: A − BD−1 BT x = a − BD−1 b
(6)
(7)
3. Since D is diagonal, its inversion is trivial, and thus the coefficients of the equation system (7) may be computed efficiently (in time and memory). 4. Once x is computed by solving the 6 × 6 system (7), y is obtained using (6). Hence, the most complex individual operation at each iteration is the same as that in iterative methods minimzing algebraic distance – inverting a 6 × 6 symmetric matrix or, equivalently, solving a linear equation system of the same size. In practice, we reduce the original problem to (5 + n) × (5 + n), respectively 5 × 5, by fixing one of the scalars a, b, c (the one with the largest absolute value). However, most of the computation time is actually spent on computing the partial derivatives required to compute the coefficients of the above equation systems. Overall, the computational complexity is linear in the number of data points. A detailed complexity analysis is given in [15]. With a non-optimized implementation, measured computation times for one iteration were about 10 times those required for the standard linear method (Linear in the next section). This may seem much but note that e.g. with 200 data points, one iteration requires less than 2 milli-secs on a 2.8GHz Pentium 4.
4
Experimental Results
Points were simulated on a unit circle, equally distributed over an arc of varying length, to simulate occluded data. Each point was subjected to Gaussian noise (in x and y). Six methods were used to fit conics: – Linear: least-square solution based on the algebraic distance, using the constraint of unit norm on the conic’s coefficients. – Bookstein: the method of [3].
792
P. Sturm and P. Gargallo 1.5
1.5 Linear Bookstein Fitzgibbon Linear-opt Book-opt Fitz-opt Original
1
1
0.5
0.5
0
0
-0.5
-0.5
-1
-1
-1.5 -1.5
-1
-0.5
0
0.5
Linear Bookstein Fitzgibbon Linear-opt Book-opt Fitz-opt Original
1
1.5
-1.5 -1.5
-1
-0.5
0
0.5
1
1.5
Fig. 2. Two examples: 50 points were distributed over an arc of 160◦ , and were subjected to a Gaussian noise of a standard deviation of 5 percent the radius of the circle
1
Linear Bookstein Fitzgibbon Linear-opt Book-opt 0.8 Fitz-opt
0.6
0.4
0.2
0 0
1
2
3
4
5
Fig. 3. Relative error on estimated minor axis length, as a function of noise (the unit of the y-axis is 100%). The graphs for the three non-linear optimization methods are superimposed.
– Fitzgibbon: the method of [4]. – Non-linear optimization using our method, using the results of the above methods as initialization: Linear-opt, Book-opt and Fitz-opt. We performed experiments for many combinations of noise level, amount of occlusion and number of points on the conic, see [15] for a detailled account. Figure 2 shows two typical examples. With all three initializations, the optimization method converged to the same conic in a few iterations each (2 or 3 typically). It is not obvious how to quantitatively compare the methods. Displaying residual geometrical point-to-conic distances for example would be unfair, since our method is designed for minimizing this. Instead, we compute an error measure on the estimated conic. Figure 3 shows the relative error on the length of the estimated conic’s minor axis (one indicator of how well the conic’s shape has
Conic Fitting Using the Geometric Distance
793
Fig. 4. Sample results on real data: fitting conics to catadioptric line images. Colors are as in figure 2; reference conics are shown in white and data points in black, in the common portion of the estimated conics.
been recovered), relative to the amount of noise. Each point in the graph represents the median value of the results of 50 simulation runs. All methods degrade rather gracefully, the non-linear optimization results being by far the best (the three graphs are superimposed). We also tested our approach with the results of a hyperbola-specific version of [4] as initialization. In most cases, the optimization method is capable of switching from an hyperbola to an ellipse, and to reach the same solution as when initialized with an ellipse. Figure 4 shows sample results on real data, fitting conics to edge points of catadioptric line images (same color code as in figure 2). Reference conics are shown in white; they were fitted using calibration information on the catadioptric camera (restricting the problem to 2 dof) and serve here as “ground truth”. The data points are shown by the black portion common to all estimated conics. They cover very small portions of the conics, making the fitting rather ill-posed. The ill-posedness shows e.g. in the fact that in most cases, conics with widely varying shape have similar residuals. Nevertheless, our approach gives results that are clearly more consistent than for any of the other methods; also note that in the shown examples, the three non-linear optimizations converged to the same conic each time. More results are given in [15].
5
Discussion on Choice of Cost Function
Let us briefly discuss the cost function used. A usual choice, and the one we adopted here, is the sum of squared geometrical distances of the points to the conic. Minimizing this cost function gives the optimal conic in the maximum likelihood sense, under the assumption that data points are generated from points on the true conic, by displacing them along the normal direction by a random distance that follows a zero mean Gaussian distribution, the same for all points. Another choice [17] is based on the assumption that a data point could be generated from any point on the true conic, by displacing it possibly in other directions than the normal to the conic. There may be other possibilities, taking into account the different densities of data points along the conic in areas with different
794
P. Sturm and P. Gargallo
curvatures. Which cost function to choose depends on the underlying application but of course also on the complexity of implementation and computation. In this work we use the cost function based on the geometrical distance between data points and the conic; it is analytically and computationally more tractable than e.g. [17]. Further, if data points are obtained by edge detection, i.e. if they form a contour, then it is reasonable to assume that the order of the data points along the contour is the same as that of the points on the true conic that were generating them. Hence, it may not be necessary here to evaluate the probability of all points on the conic generating all data points and it seems reasonable to stick with the geometric distance between data points and the conic, i.e. the distance between data points and the closest points on the conic. A more detailed discussion is beyond the scope of this paper though. A final comment is that it is straightforward to embed our approach in any M-estimator, in order to make it robust to outliers.
6
Conclusions and Perspectives
We have proposed a method for fitting conics to points, minimizing geometrical distance. The method avoids the solution of 4th order polynomials, often considered to be one of the main reasons for using algebraic distances. We have described in as much detail as possible how to perform the non-linear optimization computationally efficiently. A few simulation results are presented that suggest that the optimization of geometrical distance may correct bias present in results of linear methods, as expected. However, the main motivation for this paper was not to measure absolute performance, but to show that conic fitting by minimization of geometrical distance, is feasible. Recently, we became aware of the work [5], that describes an ellipse-specific method very similar in spirit and formulation to ours. Our method, as presented, is not specific to any affine conic type. This is an advantage if the type of conic is not known beforehand (e.g. line-based camera calibration of omnidirectional cameras is based on fitting conics of possibly different types [6]), and switching between different types is indeed completely natural for the method. However, we have also implemented ellipse-, hyperbola- and parabola-specific versions of the method [15]. The proposed approach for conic fitting can be adapted to other problems. This is rather direct for e.g. the reconstruction of a conic’s shape from multiple calibrated images or the optimization of the pose of a conic with known shape, from a single or multiple images. Equally straightforward is the extension to the fitting of quadrics to 3D point sets. Generally, the approach may be used for fitting various types of surfaces or curves to sets of points or other primitives. Another application is plumb-line calibration, where points would have to be parameterized on lines instead of the unit circle. Besides this, we are currently investigating an extension of our approach to the estimation of the shape and/or pose of quadrics, from silhouettes in multiple images. The added difficulty is that points on quadrics have to be parameterized such as to lie on occluding contours.
Conic Fitting Using the Geometric Distance
795
This may be useful for estimating articulated motions of objects modelled by quadric-shaped parts, similar to [11] which considered cone-shaped parts. Other current work is to make a Matlab implementation of the proposed approach publicly available, on the first author’s website and to study cases when the Gauss-Newton approximation of the Hessian may become singular. Acknowledgements. We thank Pascal Vasseur for the catadioptric image and the associated calibration data and the reviewers for very useful comments.
References 1. Atkinson, K.B. (ed.): Close Range Photogrammetry and Machine Vision. Whittles Publishing (1996) 2. Boehm, W., Prautzsch, H.: Geometric Concepts for Geometric Design. A.K. Peters (1994) 3. Bookstein, F.L.: Fitting Conic Sections to Scattered Data. Computer Graphics and Image Processing 9, 56–71 (1979) 4. Fitzgibbon, A., Pilu, M., Fisher, R.B.: Direct Least Square Fitting of Ellipses. IEEE–PAMI 21(5), 476–480 (1999) 5. Gander, W., Golub, G.H., Strebel, R.: Fitting of Circles and Ellipses. BIT 34, 556–577 (1994) 6. Geyer, C., Daniilidis, K.: Catadioptric Camera Calibration. In: ICCV, pp. 398–404 (1999) 7. Gill, P.E., Murray, W., Wright, M.H.: Practical Optimization. Academic Press, San Diego (1981) 8. Hal´ır, R.: Robust Bias-Corrected Least Squares Fitting of Ellipses. In: Conf. in Central Europe on Computer Graphics, Visualization and Interactive Digital Media (2000) 9. Kanatani, K.: Statistical Bias of Conic Fitting and Renormalization. IEEE– PAMI 16 (3), 320–326 (1994) 10. Kanazawa, Y., Kanatani, K.: Optimal Conic Fitting and Reliability Evaluation. IEICE Transactions on Information and Systems E79-D (9), 1323–1328 (1996) 11. Knossow, D., Ronfard, R., Horaud, R., Devernay, F.: Tracking with the Kinematics of Extremal Contours. In: Narayanan, P.J., Nayar, S.K., Shum, H.-Y. (eds.) ACCV 2006. LNCS, vol. 3851, pp. 664–673. Springer, Heidelberg (2006) 12. Rosin, P.L.: Analysing Error of Fit Functions for Ellipses. Pattern Recognition Letters 17, 1461–1470 (1996) 13. Rosin, P.L.: Ellipse Fitting Using Orthogonal Hyperbolae and Stirling’s Oval. Graphical Models and Image Processing 60(3), 209–213 (1998) 14. Slama, C.C. (ed.): Manual of Photogrammetry, 4th edn. American Society of Photogrammetry and Remote Sensing (1980) 15. Sturm, P.: Conic Fitting Using the Geometric Distance. Rapport de Recherche, INRIA (2007) 16. Taubin, G.: Estimation of Planar Curves, Surfaces, and Nonplanar Space Curves Defined by Implicit Equations with Applications to Edge and Range Image Segmentation. IEEE–PAMI 13(11), 1115–1138 (1991) 17. Werman, M., Keren, D.: A Bayesian Method for Fitting Parametric and Nonparametric Models to Noisy Data. IEEE–PAMI 23(5), 528–534 (2001) 18. Zhang, Z.: Parameter Estimation Techniques: A Tutorial with Application to Conic Fitting. Rapport de Recherche No. 2676, INRIA (1995)
Efficiently Solving the Fractional Trust Region Problem Anders P. Eriksson, Carl Olsson, and Fredrik Kahl Centre for Mathematical Sciences Lund University, Sweden
Abstract. Normalized Cuts has successfully been applied to a wide range of tasks in computer vision, it is indisputably one of the most popular segmentation algorithms in use today. A number of extensions to this approach have also been proposed, ones that can deal with multiple classes or that can incorporate a priori information in the form of grouping constraints. It was recently shown how a general linearly constrained Normalized Cut problem can be solved. This was done by proving that strong duality holds for the Lagrangian relaxation of such problems. This provides a principled way to perform multi-class partitioning while enforcing any linear constraints exactly. The Lagrangian relaxation requires the maximization of the algebraically smallest eigenvalue over a one-dimensional matrix sub-space. This is an unconstrained, piece-wise differentiable and concave problem. In this paper we show how to solve this optimization efficiently even for very large-scale problems. The method has been tested on real data with convincing results.1
1 Introduction Image segmentation can be defined as the task of partitioning an image into disjoint sets. This visual grouping process is typically based on low-level cues such as intensity, homogeneity or image contours. Existing approaches include thresholding techniques, edge based methods and region-based methods. Extensions to this process includes the incorporation of grouping constraints into the segmentation process. For instance the class labels for certain pixels might be supplied beforehand, through user interaction or some completely automated process, [1,2]. Perhaps the most successful and popular approaches for segmenting images are based on graph cuts. Here the images are converted into undirected graphs with edge weights between the pixels corresponding to some measure of similarity. The ambition is that partitioning such a graph will preserve some of the spatial structure of the image itself. These graph based methods were made popular first through the Normalized Cut formulation of [3] and more recently by the energy minimization method of [4]. This algorithm for optimizing objective functions that are submodular has the property of solving many discrete problems exactly. However, not all segmentation problems can 1
This work has been supported by the European Commission’s Sixth Framework Programme under grant no. 011838 as part of the Integrated Project SMErobotT M , Swedish Foundation for Strategic Research (SSF) through the programmes Vision in Cognitive Systems II (VISCOS II) and Spatial Statistics and Image Analysis for Environment and Medicine.
Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 796–805, 2007. c Springer-Verlag Berlin Heidelberg 2007
Efficiently Solving the Fractional Trust Region Problem
797
be formulated with submodular objective functions, nor is it possible to incorporate all types of linear constraints. In [5] it was shown how linear grouping constraints can be included in the former approach, Normalized Cuts. It was demonstrated how Lagrangian relaxation can in a unified can handle such linear constrains and also in what way they influence the resulting segmentation. It did not however address the practical issues of finding such solutions. In this paper we develop efficient algorithms for solving the Lagrangian relaxation.
2 Background 2.1 Normalized Cuts Consider an undirected graph G, with nodes V and edges E and where the non-negative weights of each such edge is represented by an affinity matrix W , with only nonnegative entries and of full rank. A min-cut is the non-trivial subset A of V such that the sum of edges between nodes in A and V is minimized, that is the minimizer of wij (1) cut(A, V ) = i∈A, j∈V \A
This is perhaps the most commonly used method for splitting graphs and is a well known problem for which very efficient solvers exist. It has however been observed that this criterion has a tendency to produced unbalanced cuts, smaller partitions are preferred to larger ones. In an attempt to remedy this shortcoming, Normalized Cuts was introduced by [3]. It is basically an altered criterion for partitioning graphs, applied to the problem of perceptual grouping in computer vision. By introducing a normalizing term into the cut metric the bias towards undersized cuts is avoided. The Normalized Cut of a graph is defined as: Ncut =
cut(B, V ) cut(A, V ) + assoc(A, V ) assoc(B, V )
(2)
where A ∪ B = V , A ∩ B = ∅ and the normalizing term defined as assoc(A, V ) = i∈Aj∈V wij It is then shown in [3] that by relaxing (2) a continuous underestimator of the Normalized Cut can be efficiently computed. To be able to include general linear constraints we reformulated the problem in the following way, (see [5] for details). With d = W 1 and D = diag(d) Normalized Cut cost can be written as inf z
z T (D − W )z , s.t. z ∈ {−1, 1}n, Cz = b. −z T ddT z + (1T d)2
(3)
The above problem is a non-convex, NP-hard optimization problem. In [5] z ∈ {−1, 1}n constraint was replaced with the norm constraint z T z = n. This gives us the relaxed problem inf z
z T (D − W )z , s.t. z T z = n, Cz = b. −z T ddT z + (1T d)2
(4)
798
A.P. Eriksson, C. Olsson, and F. Kahl
Even though this is a non-convex problem it was shown in [5] that it is possible to solve this problem exactly. 2.2 The Fractional Trust Region Subproblem Next we briefly review the theory for solving (4). If we let zˆ be the extended vector T T z zn+1 . Throughout the paper we will write zˆ when we consider the extended variables and just z when we consider the original ones. With Cˆ = [C − b] the linear constraints becomes Cz = b, and now form a linear subspace and can be eliminated in the following way. Let NCˆ be a matrix where its columns form a base of the nullspace ˆ Any zˆ fulfilling Cˆ zˆ = 0 can be written zˆ = N ˆ yˆ, where yˆ ∈ Rk+1 . Assuming of C. C that the linear constraints are feasible we may always choose that basis so that yˆk+1 = T ((1T d)D−ddT ) 0 )0 NCˆ , both zˆn+1 . Let LCˆ = NCˆ T (D−W N and M = N ˆ ˆ ˆ C C C 0 0 0 0 positive semidefinite, (see [5]). In the new space we get the following formulation inf yˆ
ˆ yˆT LC ˆy yˆT MCˆ yˆ ,
s.t. yˆk+1 = 1, ||ˆ y ||2NCˆ = n + 1,
(5)
where ||ˆ y ||2N ˆ = yˆT NCˆ T NCˆ yˆ. We call this problem the fractional trust region subC problem since if the denominator is removed it is similar to the standard trust region problem [6]. A common approach to solving problems of this type is to simply drop one of the two constraints. This may however result in very poor solutions. For example, in [7] segmentation with prior data was studied. The objective function considered there contained a linear part (the data part) and a quadratic smoothing term. It was observed that when yk+1 = ±1 the balance between that smoothing term and the data term was disrupted, resulting in very poor segmentations. In [5] it was show that in fact this problem can be solved exactly, without excluding any constraints, by considering the dual problem. Theorem 1. If a minima of (5) exists its dual problem supt inf ||ˆy||2N where ECˆ = [ 00 01 ] −
ˆ C
=n+1
T NC ˆ ˆ NC n+1
y yˆT (LC ˆ +tEC ˆ )ˆ yˆT MCˆ yˆ
= NCTˆ
1 − n+1 I 0 0 1
(6) NCˆ ,
has no duality gap. Since we assume that the problem is feasible and as the objective function of the primal problem is the quotient of two positive semidefinite quadratic forms a minima obviously exists. Thus we can apply this theorem directly and solve (5) through its dual formulation. We will use F (t, yˆ) to denote the objective function of (6), the Lagrangian of problem (5). By the dual function θ(t) we mean the solution of θ(t) = inf ||ˆy||2N =n+1 F (t, yˆ) ˆ C
The inner minimization of (6) is the well known generalized Rayleigh quotient, for which the minima is given by the algebraically smallest generalized eigenvalue2 of 2
By generalized eigenvalue of two matrices A and B we mean finding a λ = λG (A, B) and v, ||v|| = 1 such that Av = λBv has a solution.
Efficiently Solving the Fractional Trust Region Problem
799
(LCˆ + tECˆ ) and MCˆ . Letting λmin (t)(·, ·) denote the smallest generalized eigenvalue of two entering matrices, we can also write problem (6) as sup λmin (LCˆ + tECˆ , MCˆ ).
(7)
t
These two dual formulations will from here on be used interchangeably, it should be clear from the context which one is being referred to. In this paper we will develop methods for solving the outer maximization efficiently.
3 Efficient Optimization 3.1 Subgradient Optimization First we present a method, similar to that used in [8] for minimizing binary problems with quadratic objective functions, based on subgradients for solving the dual formulation of our relaxed problem. We start off by noting that as θ(t) is a pointwise infimum of functions linear in t it is easy to see that this is a concave function. Hence the outer optimization of (6) is a concave maximization problem, as is expected from dual problems. Thus a solution to the dual problem can be found by maximizing a concave function in one variable t. Note that the choice of norm does not affect the value of θ it only affects the minimizer yˆ∗ . It is widely known that the eigenvalues are analytic (and thereby differentiable) functions as long as they are distinct. Thus, to be able to use a steepest ascent method we need to consider subgradients. Recall the definition of a subgradient [9,8]. Definition 1. If a function g : Rk+1 → R is concave, then v ∈ Rk+1 is a subgradient to g at σ0 if (8) g(σ) ≤ g(σ0 ) + v T (σ − σ0 ), ∀σ ∈ Rk+1 . One can show that if a function is differentiable then the derivative is the only vector satisfying (8). We will denote the set of all subgradients of g at a point t0 by ∂g(t0 ). It is easy to see that this set is convex and if 0 ∈ ∂g(t0 ) then t0 is a global maximum. Next we show how to calculate the subgradients of our problem. y0 , t0 ) = θ(t0 ) and ||ˆ y0 ||2N ˆ = n + 1. Then Lemma 1. If yˆ0 fulfills F (ˆ C
v=
yˆ0T ECˆ yˆ0 yˆ0T MCˆ yˆ0
(9)
is a subgradient of θ at t0 . If θ is differentiable at t0 , then v is the derivative of θ at t0 . Proof.
θ(t) =
yˆ0T (LCˆ + tECˆ )ˆ yˆT (LCˆ + tECˆ )ˆ y y0 ≤ = T T yˆ MCˆ yˆ yˆ0 MCˆ yˆ0 ||ˆ y||N =1 ˆ min 2 C
=
yˆ0T (LCˆ + t0 ECˆ )ˆ y0 yˆ0T MCˆ yˆ0
+
yˆ0T ECˆ yˆ0 (t − t0 ) = θ(t0 ) + v T (t − t0 ) yˆ0T MCˆ yˆ0
(10)
800
A.P. Eriksson, C. Olsson, and F. Kahl
A Subgradient Algorithm. Next we present an algorithm based on the theory of subgradients. The idea is to find a simple approximation of the objective function. Since the function θ is concave, the first order Taylor expansion θi (t), around a point ti , always y , ti ) and this solution is unique then fulfills fi (t) ≤ f (t). If yˆi solves inf ||ˆy||2N =n+1 F (ˆ ˆ C
the Taylor expansion of θ at ti is
θi (t) = F (ˆ yi , ti ) + v T (t − ti ).
(11)
Note that if yˆi is not unique fi is still an overestimating function since v is a subgradient. One can assume that the function θi approximates θ well in a neighborhood around t = ti if the smallest eigenvalue is distinct. If it is not we can expect that there is some tj such that min(θi (t), θj (t)) is a good approximation. Thus we will construct a function θ¯ of the type ¯ = inf F (ˆ θ(t) yi , ti ) + v T (t − ti ) i∈I
(12)
that approximates θ well. That is, we approximate θ with the point-wise infimum of several first-order Taylor expansions, computed at a number of different values of t, an ¯ illustration can be seen in fig. 1. We then take the solution to the problem supt θ(t), given by supt,α α α ≤ F (ˆ yi , ti ) + v T (t − ti ), ∀i ∈ I, tmin ≤ t ≤ tmax .
(13)
as an approximate solution to the original dual problem. Here, the fixed parameters tmin , tmax are used to express the interval for which the approximation is believed to be valid. Let ti+1 denote the optimizer of (13). It is reasonable to assume that θ¯ approximates θ better the more Taylor approximations we use in the linear program. Thus, we can improve θ¯ by computing the first-order Taylor expansion around ti+1 , add it to (13) and solve the linear program again. This is repeated until |tN +1 − tN | < for some predefined > 0, and tN +1 will be a solution to supt θ(t). 3.2 A Second Order Method The algorithm presented in the previous section uses first order derivatives only. We would however like to employ higher order methods to increase efficiency. This requires calculating second order derivatives of (6). Most formulas for calculating the second derivatives of eigenvalues involves all of the eigenvectors and eigenvalues. However, determining the entire eigensystem is not feasible for large scale systems. We will show that it is possible to determine the second derivative of an eigenvalue function by solving a certain linear system only involving the corresponding eigenvalue and eigenvector. The generalized eigenvalues and eigenvectors fulfills the following equations y(t) = 0 ((LCˆ + tECˆ ) − λ(t)MCˆ )ˆ ||ˆ y (t)||2NCˆ
= n + 1.
(14) (15)
Efficiently Solving the Fractional Trust Region Problem −1.8
801
−2
−2
−2.5
−2.2 −3
−2.4
−3.5
−2.6 −2.8
−4
−3 −3.2 −3.4
objective function approximation −0.2
−0.1
0
0.1
−4.5 −5
0.2
−1.8
objective function approximation −0.2
−0.1
0
0.1
−0.2
−0.1
0
0.1
0.2
−2
−2
−2.5
−2.2 −3
−2.4
−3.5
−2.6 −2.8
−4
−3 −3.2 −3.4
objective function approximation −0.2
−0.1
0
0.1
0.2
−4.5 −5
objective function approximation 0.2
Fig. 1. Approximations of two randomly generated objective functions. Top: Approximation after 1 step of the algorithm. Bottom: Approximation after 2 steps of the algorithm.
To emphasize the dependence on t we write λ(t) for the eigenvalue and yˆ(t) for the eigenvector. By differentiating (14) we obtain y (t) + ((LCˆ + tECˆ ) − λ(t)M )ˆ y (t) = 0. (ECˆ − λ (t)MCˆ )ˆ
(16)
This (k + 1) × (k + 1) linear system in yˆ (t) will have a rank of k, assuming λ(k) is a distinct eigenvalue. To determine yˆ (t) uniquely we differentiate (15), obtaining yˆT (t)NCˆ T NCˆ yˆ (t) = 0.
(17)
Thus, the derivative of the eigenvector yˆ (t) is determined by the solution to the linear system (L +tE )−λ(t)M ˆ ˆ ˆ C C C (−EC y(t) ˆ +λ (t)MC ˆ )ˆ y ˆ (18) (t) = T T yˆ (t)N N 0 ˆ C
ˆ C
If we assume differentiability at t, the second derivative of θ(t) can now be found by d computing dt θ (t), where θ (t) is equal to the subgradient v given by (9). θ (t) =
d dt θ (t)
=
T ˆ(t) ˆy d yˆ(t) EC dt yˆ(t)T MCˆ yˆ(t)
=
2 ˆT (t) yˆ(t)T MC ˆ(t) y ˆy
ECˆ − θ (t)MCˆ yˆ (t)(19)
A Modified Newton Algorithm. Next we modify the algorithm presented in the previous section to incorporate the second derivatives. Note that the second order Taylor expansion is not necessarily an over-estimator of θ. Therefore we can not use the the second derivatives as we did in the previous section. Instead, as we know θ to be infinitely differentiable when the smallest eigenvalue λ(t) is distinct, strictly convex around its optima t∗ , Newton’s method for unconstrained optimization can be applied. It follows from these properties of θ(t) that Newton’s
802
A.P. Eriksson, C. Olsson, and F. Kahl
method [9] should be well behaved on this function and that we could expect quadratic convergence in a neighborhood of t∗ . All of this, under the assumption that θ is differentiable in this neighborhood. Since Newton’s method does not guarantee convergence we have modified the method slightly, adding some safeguarding measures. At a given iteration of the Newton method we have evaluated θ(t) at a number of points ti . As θ is concave we can easily find upper and lower bounds on t∗ (tmin, tmax ) by looking at the derivative of the objective function for these values of t = ti . tmax =
min
i;θ (ti )≤0
ti , and tmin =
max ti
i;θ (ti )≥0
(20)
At each step in the Newton method a new iterate is found by approximating the objective function is by its second-order Taylor approximation θ(t) ≈ θ(ti ) + θ (ti )(t − ti ) +
θ (ti ) (t − ti )2 . 2
(21)
and finding its maxima. By differentiating (21) it is easily shown that its optima, as well as the next point in the Newton sequence, is given by ti+1 = −
θ (ti ) + ti θ (ti )
(22)
If ti+1 is not in the interval [tmin , tmax ] then the second order expansion can not be a good approximation of θ, here the safeguarding comes in. In these cases we simply fall back to the first-order method of the previous section. If we successively store the values of θ(ti ), as well as the computed subgradients at these points, this can be carried out with little extra computational effort. Then, the upper and lower bounds tmin and tmax are updated, i is incremented by 1 and the whole procedure is repeated, until convergence. If the smallest eigenvalue λ(ti ) at an iteration is not distinct, then θ (t) is not defined and a new Newton step can not be computed. In these cases we also use the subgradient gradient method to determine the subsequent iterate. However, empirical studies indicate that non-distinct smallest eigenvalues are extremely unlikely to occur.
4 Experiments A number of experiments were conducted in an attempt to evaluate the suggested approaches. As we are mainly interested in maximizing a concave, piece-wise differentiable function, the underlying problem is actually somewhat irrelevant. However, in order to emphasize the intended practical application of the proposed methods, we ran the subgradient- and modified Newton algorithms on both smaller, synthetic problems as well as on larger, real-world data. For comparison purposes we also include the results of a golden section method [9], used in [5], as a baseline algorithm. First, we evaluated the performance of the proposed methods on a large number of synthetic problems. These were created by randomly choosing symmetric, positive definite, 100×100 matrices. As the computational burden lies in determining the generalized
Efficiently Solving the Fractional Trust Region Problem 1200
803
1200 Subgradient alg. Golden section
1000 800
800
600
600
400
400
200
200
0 0
10
Mod−Newton alg. Golden section
1000
20
30
40
0 0
50
10
20
30
40
50
Fig. 2. Histogram of the number of function evaluations required for 1000-synthetically generated experiments using a golden section method (blue) and the subgradient algorithm (red)
eigenvalue of the matrices LCˆ + tECˆ and MCˆ we wish to reduce the number of such calculations. Figure 2 shows a histogram of the number of eigenvalue evaluations for the subgradient-, modified Newton method as well as the baseline golden section search. The two gradient methods clearly outperforms the golden section search. The difference between the subgradient- and modified Newton is not as discernible. The somewhat surprisingly good performance of the subgradient method can be explained by the fact that far away from t∗ the function θ(t) is practically linear and an optimization method using second derivatives would not have much advantage over one that uses only first order information.
7
0.1 6
0.08 5
0.06 4
0.04 3
0.02 2
0 1
−0.02 0
0
5
10
15
20
25
30
35
40
45
50
8
10
12
14
16
18
20
22
24
26
Fig. 3. Top: Resulting segmentation (left) and constraints applied (right). Here an X means that this pixel belongs to the foreground and an O to the background. Bottom: Convergence of the modified Newton (solid), subgradient (dashed) and the golden section (dash-dotted) algorithms. The algorithms converged after 9, 14 and 23 iteration respectively.
804
A.P. Eriksson, C. Olsson, and F. Kahl
Finally, we applied our methods to two real world examples. The underlying motivation for investigating an optimization problem of this form was to segment images with linear constraints using Normalized Cuts. The first image can be seen in fig. 3, the linear constraints included were hard constraints, that is the requirement that that certain pixels should belong to the foreground or background. One can imagine that such constraints are supplied either by user interaction in a semi-supervised fashion or by some automatic preprocessing of the image. The image was gray-scale, approximately 100 × 100 pixels in size, the associated graph was constructed based on edge information as described in [10]. The second image was of traffic intersection where one wishes to segment out the small car in the top corner. We have a probability map of the image, giving the likelihood of a certain pixel belonging to the foreground. Here the graph representation is based on this map instead of the gray-level values in the image. The approximate size and location of the vehicle is know and included as linear constraint into the segmentation process. The resulting partition can be seen in fig. 4. In both these real world cases, the resulting segmentation will always be the same, regardless of approach. What is different is the computational complexity of the different methods. Once again, the two gradient based approaches are much more efficient than a golden section search, and their respective performance comparable. As the methods differ in what is required to compute, a direct comparison of them is not a straight forward procedure. Comparing the run time would be pointless as the degree to which the
−3
1
x 10
0.9 8 0.8
6
0.7
0.6 4 0.5 2
0.4
0.3
0
0.2 −2 0.1
−4
0 0
5
10
15
20
25
30
5
10
15
20
25
Fig. 4. Top: Resulting segmentation (left) and constraints applied, in addition to the area requirement used (area = 50 pixels) (right). Here the X in the top right part of the corner means that this pixel belongs to the foreground. Bottom: Convergence of the modified Newton (solid), subgradient (dashed) and the golden section (dash-dotted) algorithms. The algorithms converged after 9, 15 and 23 iteration respectively.
Efficiently Solving the Fractional Trust Region Problem
805
implementations of the individual methods have been optimized for speed differ greatly. However, as it is the eigenvalue computations that are the most demanding we believe that comparing the number of such eigenvalue calculations will be a good indicator of the computational requirements for the different approaches. It can be seen in fig. 3 and 4 how the subgradient methods converges quickly in the initial iterations only to slow down as it approaches the optima. This is in support of the above discussion regarding the linear appearance of the function θ(t) far away from the optima. We therfore expect the modified Newton method to be superior when higher accuracy is required. In conclusion we have proposed two methods for efficiently optimizing a piece-wise differentiable function using both first- and second order information applied to the task of partitioning images. Even though it is difficult to provide a completely accurate comparison between the suggested approaches it is obvious that the Newton based method is superior.
References 1. Rother, C., Kolmogorov, V., Blake, A.: ”GrabCut”: interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics, 309–314 (2004) 2. Boykov, Y., Jolly, M.P.: Interactive graph cuts for optimal boundary and region segmentation of objects in n-d images. In: International Conference on Computer Vision, Vancouver, Canada, pp. 05–112 (2001) 3. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000) 4. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Analysis and Machine Intelligence 23(11), 1222–1239 (2001) 5. Eriksson, A., Olsson, C., Kahl, F.: Normalized cuts revisited: A reformulation for segmentation with linear grouping constraints. In: International Conference on Computer Vision, Rio de Janeiro, Brazil (2007) 6. Sorensen, D.: Newton’s method with a model trust region modification. SIAM Journal on Nummerical Analysis 19(2), 409–426 (1982) 7. Eriksson, A., Olsson, C., Kahl, F.: Image segmentation with context. In: Proc. Conf. Scandinavian Conference on Image Analysis, Ahlborg, Denmark (2007) 8. Olsson, C., Eriksson, A., Kahl, F.: Solving large scale binary quadratic problems: Spectral methods vs. semidefinite programming. In: Proc. Conf. Computer Vision and Pattern Recognition, Mineapolis, USA (2007) 9. Bazaraa, Sherali, Shetty: Nonlinear Programming, Theory and Algorithms. Wiley, Chichester (2006) 10. Malik, J., Belongie, S., Leung, T.K., Shi, J.: Contour and texture analysis for image segmentation. International Journal of Computer Vision 43(1), 7–27 (2001)
Image Segmentation Using Iterated Graph Cuts Based on Multi-scale Smoothing Tomoyuki Nagahashi1, Hironobu Fujiyoshi1 , and Takeo Kanade2 Dept. of Computer Science, Chubu University. Matsumoto 1200, Kasugai, Aichi, 487-8501 Japan
[email protected],
[email protected] http://www.vision.cs.chubu.ac.jp 2 The Robotics Institute, Carnegie Mellon University. Pittsburgh, Pennsylvania, 15213-3890 USA
[email protected] 1
Abstract. We present a novel approach to image segmentation using iterated Graph Cuts based on multi-scale smoothing. We compute the prior probability obtained by the likelihood from a color histogram and a distance transform using the segmentation results from graph cuts in the previous process, and set the probability as the t-link of the graph for the next process. The proposed method can segment the regions of an object with a stepwise process from global to local segmentation by iterating the graph-cuts process with Gaussian smoothing using different values for the standard deviation. We demonstrate that we can obtain 4.7% better segmentation than that with the conventional approach.
1
Introduction
Image segmentation is a technique of removing objects in an image from their background. The segmentation result is typically composed on a different background to create a new scene. Since the breakthrough of Geman and Geman [1], probabilistic inference has been a powerful tool for image processing. The graph-cuts technique proposed by Boykov [2][3] has been used in recent years for interactive segmentation in 2D and 3D. Rother et al. proposed GrabCut[4], which is an iterative approach to image segmentation based on graph cuts. The inclusion of color information in the graph-cut algorithm and an iterative-learning approach increases its robustness. However, it is difficult to segment images that have a complex edge. This is because it is difficult to achieve segmentation by overlapping local edges that influence the cost of the n-link, which is calculated from neighboring pixels. Therefore, we introduced a coarse-to-fine approach to detecting boundaries using graph cuts. We present a novel method of image segmentation using iterated Graph Cuts based on multi-scale smoothing in this paper. We computed the prior probability obtained by the likelihood from a color histogram and a distance transform, and set the probability as the t-link of the graph for the next process using the segmentation results from the graph cuts in the previous process. The proposed Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 806–816, 2007. c Springer-Verlag Berlin Heidelberg 2007
Image Segmentation Using Iterated Graph Cuts
807
method could segment regions of an object with a stepwise process from global to local segmentation by iterating the graph-cuts process with Gaussian smoothing using different values for the standard deviation.
2
Graph Cuts
This section describes the graph-cuts-based segmentation proposed by Boykov and Jolly[2]. 2.1
Graph Cuts for Image Segmentation
An image segmentation problem can be posed as a binary labeling problem. Assume that the image is a graph, G = (V, E), where V is the set of all nodes and E is the set of all arcs connecting adjacent nodes. The nodes are usually pixels, p, on the image, P , and the arcs have adjacency relationships with four or eight connections between neighboring pixels, q ∈ N . The labeling problem is to assign a unique label, Li , to each node, i ∈ V , i.e. Li ∈ {“obj”, “bkg”}. The solution, L = {L1 , L2 , . . . , Lp , . . . , L|P | }, can be obtained by minimizing the Gibbs energy, E(L): E(L) = λ · R(L) + B(L)
(1)
where R(L) =
Rp (Lp )
(2)
p∈P
B(L) =
B{p,q} · δ(Lp , Lq )
(3)
{p,q}∈N
and δ(Lp , Lq ) =
1 if Lp = Lq 0 otherwise.
(4)
The coefficient, λ ≥ 0, in (1) specifies the relative importance of the regionproperties term, R(L), versus the boundary-properties term, B(L). The regional term, R(L), assumes that the individual penalties for assigning pixel p to “obj” and “bkg”, corresponding to Rp (“obj”) and Rp (“bkg”), are given. For example, Rp (·) may reflect on how the intensity of pixel p fits into a known intensity model (e.g., histogram) of the object and background. The term, B(L), comprises the “boundary” properties of segmentation L. Coefficient B{p,q} ≥ 0 should be interpreted as a penalty for discontinuity between p and q. B{p,q} is normally large when pixels p and q are similar (e.g., in intensity) and B{p,q} is close to zero when these two differ greatly. The penalty, B{p,q} , can also decrease as a function of distance between p and q. Costs B{p,q} may be based on the local intensity gradient, Laplacian zero-crossing, gradient direction, and other criteria.
808
T. Nagahashi, H. Fujiyoshi, and T. Kanade
Fig. 1. Example of graph from image Table 1. Edge cost Edge n-link {p, q}
Cost For B{p,q} {p, q} ∈ N λ · Rp (”bkg”) p ∈ P, p ∈ /O ∪ B {p, S} K p∈O 0 p∈B t-link λ · Rp (”obj”) p ∈ P, p ∈ /O ∪ B {p, T } 0 p∈O K p∈B
Figure 1 shows an example of a graph from an input image. Table 1 lists the weights of edges of the graph. The region term and boundary term in Table 1 are calculated by Rp (“obj”) = − ln Pr(Ip |O) Rp (“bkg”) = − ln Pr(Ip |B) (Ip − Iq )2 1 B{p,q} ∝ exp − · 2 2σ dist(p, q) B{p,q} . K = 1 + max p∈P
(5) (6) (7) (8)
q:{p,q}∈N
Let O and B define the “object” and “background” seeds. Seeds are given by the user. The boundary between the object and the background is segmented by finding the minimum cost cut [5] on the graph, G. 2.2
Problems with Graph Cuts
It is difficult to segment images including complex edges in interactive graph cuts [2], [3], as shown in Fig. 2. This is because the cost of the n-link is larger than that of the t-link. If a t-link value is larger than that of an n-link, the number of error
Image Segmentation Using Iterated Graph Cuts
809
Fig. 2. Example of poor results
pixels will be increased due to the influence of the color. The edge has a strong influence when there is a large n-link. The cost of the n-link between the flower and the leaf is larger than that between the leaf and the shadow as seen in Fig. 2. Therefore, it is difficult to segment an image that has a complex edge. We therefore introduced a coarse-to-fine approach to detect boundaries using graph cuts.
3
Iterated Graph Cuts by Multi-scale Smoothing
We present a novel approach to segmenting images using iterated Graph Cuts based on multi-scale smoothing. We computed the prior probability obtained by the likelihood from a color histogram and a distance transform, and set the probability as the t-link of the graph for the next process using the segmentation results from graph cuts in the previous process. The proposed method could segment regions of an object using a stepwise process from global to local segmentation by iterating the graph-cuts process with Gaussian smoothing using different values the for standard deviation. 3.1
Overview of Proposed Method
Our approach is outlined in Fig. 3. First, the seeds the “foreground” and “background” are given by the user. The first smoothing parameter, σ, is then
Fig. 3. Overview of proposed method
810
T. Nagahashi, H. Fujiyoshi, and T. Kanade
determined. Graph cuts are done to segment the image into an object or a background. The Gaussian Mixture Model (GMM) is then used to make a color distribution model for object and background classes from the segmentation results obtained by graph cuts. The prior probability is updated from the distance transform by the object and background classes of GMM. The t-links for the next graph-cuts process are calculated as a posterior probability which is computed a prior probability and GMMs, and σ is updated as, σ = α · σ(0 < α < 1). These processes are repeated until σ = 1 or classification converges if σ < 1. The processes are as follows. Step 1. Input seeds Step 2. Initialize σ Step 3. Smooth the input image by Gaussian filtering Step 4. Do graph cuts Step 5. Calculate the posterior probability from the segmentation results and set as the t-link Step 6. Steps 1-5 are repeated until σ = 1 or classification converges if σ < 1. The proposed method can be used to segment regions of the object with a stepwise process from global to local segmentation by iterating the graphcuts process with Gaussian smoothing using different values for the standard deviation, as shown in Fig. 4.
Fig. 4. Example of iterating the graph-cuts process
3.2
Smoothing Image Using Down Sampling
The smoothing image is created with a Gaussian filter. Let I denote an image and G(σ) denote a Gaussian kernel. Smoothing image L(σ) is given by L(σ) = G(σ) ∗ I.
(9)
If Gaussian parameter σ is large, it is necessary to enlarge the window size for the filter. As it is very difficult to design such a large window for Gaussian filtering. We used down-sampling to obtain a smoothing image that maintained the continuity of σ. Smoothing image L1 (σ) is first computed using the input image I1 increasing σ. Image I2 is then down-sampled into half the size of input image I. Smoothing
Image Segmentation Using Iterated Graph Cuts
811
Fig. 5. Smoothing Image using down-sampling
image L2 (σ) is computed using the I2 . Here, the relationship between L1 (σ) and L2 (σ) . L (σ) (10) L1 (2σ) = . 2 We obtain the smoothing image, which maintains continuity of σ without changing the window size, using this relationship. Figure 5 shows the smoothing process obtained by down-sampling. The smoothing procedure was repeated until σ = 1 in our implementation. 3.3
Iterated Graph Cuts
We compute the prior probability obtained by the likelihood from a color histogram and a distance transform using the segmentation results from the graph cuts in the previous process, and set the probability as the t-link using Rp (”obj”) = − ln Pr(O|Ip )
Fig. 6. Outline of updating for likelihood and prior probability
(11)
812
T. Nagahashi, H. Fujiyoshi, and T. Kanade
Rp (”bkg”) = − ln Pr(B|Ip )
(12)
where Pr(O|Ip ) and Pr(B|Ip ) is given by Pr(O)Pr(Ip |O) Pr(Ip ) Pr(B)Pr(Ip |B) Pr(B|Ip ) = . Pr(Ip )
Pr(O|Ip ) =
(13) (14)
Pr(Ip |O), Pr(Ip |B) and Pr(O), Pr(B) are computed from the segmentation results using graph cuts in the previous process. Figure 6 outlines t-link updating obtained by the likelihood and prior probability. Updating likelihood. The likelihoods Pr(Ip |O) and Pr(Ip |B) are computed by GMM[6]. The GMM for RGB color space is obtained by Pr(Ip |·) =
K
αi pi (Ip |μi , Σ i )
i=1
p(Ip |μ, Σ) =
1 · exp (2π)3/2 |Σ|1/2
(15)
1 (Ip − μ)T Σ −1 (Ip − μ) . 2
(16)
We used the EM algorithm to fit the GMM[7]. Updating prior probability. The prior probabilities Pr(O) and Pr(B) are updated by spatial information from graph cuts in the previous process. The next segmentation label is uncertain in the vicinity of the boundary. Therefore, the prior probability is updated by using the results of a distance transform. The distance from the boundary is normalized from 0.5 to 1. Let dobj denote the distance transform of the object, and dbkg denote the distance transform of the background. The prior probability is given by: dobj if dobj ≥ dbkg (17) Pr(O) = 1 − dbkg if dobj < dbkg Pr(B) = 1 − Pr(O)
(18)
Finally, using Pr(Ip |O), Pr(Ip |B) from GMM, and Pr(O) and Pr(B) from distance transform, posterior probability is computed by means of Eq. (11) and (12). We compute a prior probability obtained by the likelihood from a color histogram and a distance transform, and set the probability as the t-link of the graph for the next process using the segmentation results obtained by graph cuts in the previous process. Figure 7 shows examples of segmentation results when the n-link is changed. When σ is small, the boundary-properties term, B{p,q} , at the object is small because of the complex texture. Therefore, graph-cuts results do not work well for image segmentation. However, B{p,q} in the smoothing image is small between the object and background. The proposed method can used to segment regions of the object using a stepwise process from global to local segmentation by iterating the graph-cuts process with Gaussian smoothing using different values for the standard deviation.
Image Segmentation Using Iterated Graph Cuts
813
Fig. 7. Example of segmentation results when changing n-link
4 4.1
Experimental Results Dataset
We used 50 images(humans, animals, and landscapes) provided by the GrabCut database [8]. We compared the proposed method, Interactive Graph Cuts[2] and GrabCut[4] using the same seeds. The segmentation error rate is defined as object of miss detection pixels image size background of miss detection pixels under segmentation = . image size over segmentation =
4.2
(19) (20)
Experimental Results
Table 2 lists the error rate (%) for segmentation results using the proposed method and the conventional methods [2], [4]. The proposed method can obtain Table 2. Error rate[%] Interactive GrabCut[4] Proposed method Graph Cuts[2] Over segmentation 1.86 3.33 1.12 1.89 1.59 0.49 Under segmentation total 3.75 4.93 1.61
2.14% better segmentation than Interactive Graph Cuts. To clarify the differences between the methods, successfully segmented images were defined, based on the results of interactive Graph Cuts, as those with error rates below 2%, and missed images were defined as those with error rates over 2%. Table 3 list the segmentation results for successfully segmented and missed images. The proposed
814
T. Nagahashi, H. Fujiyoshi, and T. Kanade
Fig. 8. Examples of segmentation results
method and Interactive graph cuts are comparable in the number of successfully segmented images. However, we can see that the proposed method can obtain 4.79% better segmentation than Interactive Graph Cuts in missed images. The proposed method can be used to segment regions of the object using a stepwise process from global to local segmentation. Figure 8 shows examples of segmentation results obtained with the new method. 4.3
Video Segmentation
The proposed method can be applied to segmenting N-D data. A sequence of 40 frames (320x240) was treated as a single 3D volume. A seed is given to the first frame. Figure 9 shows examples of video segmentation obtained with the new method. It is clear that the method we propose can easily be applied to segmenting videos. We can obtain video-segmentation results.
Image Segmentation Using Iterated Graph Cuts
815
Fig. 9. Example of video segmentation Table 3. Error rate [%]
Over segmented Successfully segmented Under segmented (26 images) total Over segmented Missed images Under segmented (24images) total
5
Interactive Proposed GrabCut[4] Graph Cuts[2] method 0.29 3.54 0.81 0.43 1.03 0.22 0.72 4.58 1.03 3.56 3.10 1.45 3.47 2.21 0.79 7.04 5.31 2.25
Conclusion
We presented a novel approach to image segmentation using iterated Graph Cuts based on multi-scale smoothing. We computed the prior probability obtain by the likelihood from a color histogram and a distance transform, and set the probability as the t-link of the graph for the next process using the segmentation results from the graph cuts in the previous process. The proposed method could segment regions of an object with a stepwise process from global to local segmentation by iterating the graph cuts process with Gaussian smoothing using different values for the standard deviation. We demonstrated that we could obtain 4.7% better segmentation than that with the conventional approach. Future works includes increased speed using super pixels and highly accurate video segmentation.
References 1. Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Machine Intell. PAMI-6, 721–741 (1984)
816
T. Nagahashi, H. Fujiyoshi, and T. Kanade
2. Boykov, Y., Jolly, M-P.: Interactive Graph Cuts for Optimal Boundary & Region Segmentation of Objects in N-D Images. In: ICCV, vol. I, pp. 105–112 (2001) 3. Boykov, Y., Funka-Lea, G.: Graph Cuts and Efficient N-D Image Segmentation. IJCV 70(2), 109–131 (2006) 4. Rother, C., Kolmogorv, V., Blake, A.: “GrabCut”:Interactive Foreground Extraction Using Iterated Graph Cuts. ACM Trans. Graphics (SIGGRAPH 2004) 23(3), 309– 314 (2004) 5. Boykov, Y., Kolmogorov, V.: An Experimental Comparison of Min-Cut/Max-Flow Algorithms for Energy Minimization in Vision. PAMI 26(9), 1124–1137 (2004) 6. Stauffer, C., Grimson, W.E.L: Adaptive Background Mixture Models for Real-time Tracking. In: Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 246–252. IEEE Computer Society Press, Los Alamitos (1999) 7. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum-likelihood From Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B 39(1), 1–38 (1977) 8. GrabCut Database: http://research.microsoft.com/vision/cambridge/i3l/ segmentation/ GrabCut.htm
Backward Segmentation and Region Fitting for Geometrical Visibility Range Estimation Erwan Bigorgne and Jean-Philippe Tarel LCPC (ESE), 58 Boulevard Lef`ebvre, F-75732 Paris Cedex 15, France
[email protected] [email protected] Abstract. We present a new application of computer vision: continuous measurement of the geometrical visibility range on inter-urban roads, solely based on a monocular image acquisition system. To tackle this problem, we propose first a road segmentation scheme based on a Parzenwindowing of a color feature space with an original update that allows us to cope with heterogeneously paved-roads, shadows and reflections, observed under various and changing lighting conditions. Second, we address the under-constrained problem of retrieving the depth information along the road based on the flat word assumption. This is performed by a new region-fitting iterative least squares algorithm, derived from half-quadratic theory, able to cope with vanishing-point estimation, and allowing us to estimate the geometrical visibility range.
1
Introduction
Coming with the development of outdoor mobile robot systems, the detection and the recovering of the geometry of paved and /or marked roads has been an active research-field in the late 80’s. Since these pioneering works, the problem is still of great importance for different fields of Intelligent Transportation Systems. A precise, robust segmentation and fitting of the road thus remains a crucial requisite for many applications such as driver assistance or infrastructure management systems. We propose a new infrastructure management system: automatic estimation of the geometrical visibility range along a route, which is strictly related to the shape of the road and the presence of occluding objects in its close surroundings. Circumstantial perturbations such as weather conditions (vehicles, fog, snow, rain ...) are not considered. The challenge is to use a single camera to estimate the geometrical visibility range along the road path, i.e the maximum distance the road is visible. When only one camera is used, the process of recovering the projected depth information is an under-constrained problem which requires the introduction of generic constraints in order to infer a unique solution. The hypothesis which is usually considered is the flat world assumption [1], by which the road is assumed included in a plane. With the flat word assumption, a precise detection
Thanks to the French Department of Transportation for funding, within the SARIPREDIT project (http://www.sari.prd.fr/HomeEN.html).
Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 817–826, 2007. c Springer-Verlag Berlin Heidelberg 2007
818
E. Bigorgne and J.-P. Tarel
of the vanishing line is crucial. Most of the past and recent single camera algorithms are based on this assumption but differ by the retained model for the road itself [2,3,4,5,6,7]. One group of algorithms moves aside from the flat world assumption and provides an estimation of the vertical curvature of the road. In [8,9,10] the constraint that the road generally keeps an approximately constant width and does not tilt sideways is used. In a general way, it should be noted that the quoted systems, which often relate to applications of lane-tracking /-following, work primarily on relatively ’not so far’ parts of the road. In our case, the geometrical visibility range should be monitored along an interurban route to check for instance its compatibility with speed limits. We are thus released from the requirements of a strictly realtime application; however, both parts of the system, the detection and the fitting of the road, should manage the far extremity of the perceptible road, a requisite for which a road detection-based approach appears to be more adequate than the detection of markings. This article is composed of two sections. The first section deals with the segmentation of the image. We restrict ourselves to structured road contrary to [11]. The proposed algorithm operates an adaptative supervised classification of each pixel in two classes: Road (R) and Other (O). The proposed algorithm is robust and benefits from the fact that the process is off-line. The second section deals with the region-fitting algorithm of the road working on the probability map provided by the segmentation step. The proposed algorithm follows an alternated iterative scheme which allows both to estimate the position of the vanishing line and to fit the borders of the road. The camera calibration being known, the positions of the vanishing line and of the far extremity of the perceptible road are enough to estimate the geometrical visibility range.
2
An Adaptative Probabilistic Classification
A dense detection of the road has been the object of many works which consider it as a two class pixel classification problem either in a supervised way [1,12,13,14], or not [15,16]. All these works face the same difficulty: the detection should be performed all along a road, when the appearance of the road is likely to strongly vary because of changes in the pavement material or because of local color heterogeneity; the lighting conditions can also drastically modify the appearance of the road, see Fig. 1:
Fig. 1. Examples of road scenes to segment, with shadows and changes in pavement material
Backward Segmentation and Region Fitting
819
– The shadows in outdoors environment modify intensity and chromatic components (blue-wards shifting). – The sun at grazing angles and/or the presence of water on the road causes specular reflections. Several previously proposed systems try to tackle these difficulties. The originality of our approach is that we take advantage of the fact that the segmentation is off-line by performing backward processing which leads to robustness. We use a classification scheme able to cope with classes with possibly complex distributions of the color signal, rather than searching for features that would be invariant to well-identified transformations of the signal. In our tests, and contrary to [11], no feature with spatial or textured content (Gabor energy, local entropy, moments of co-occurence matrix, etc.) appeared to be sufficiently discriminant in the case of paved roads, whatever is the environment. In practice, we have chosen to work in the La∗ b∗ color-space which is quasi-uncorrelated. 2.1
Parzen-Windowing
In order to avoid taking hasty and wrong decisions, the very purpose of the segmentation stage is restricted to provide a probability map to be within the Road class, which will be used for fitting the road. The classification of each pixel is performed using a Bayesian decision: the posterior probability for a pixel with feature vector x = [L, a∗ , b∗ ] to be part of the road class is: P(R/x) =
p(x/R)P(R) p(x/R)P(R) + p(x/O)P(O)
(1)
where R and O denotes the two classes. We use Parzen windows to model p(x/R) and p(x/O), the class-conditional probability density functions (pdf ). We choose the anisotropic Gaussian function with mean zero and diagonal covariance matrix Σd as the Parzen window. Parzen windows are accumulated during the learning phases in two 3-D matrices, called P R and P O . The matrix dimensions depend on the signal dynamic and an adequacy is performed with respect to the bandwidth of Σd . For a 24-bit color signal, we typically use two 643 matrices and a diagonal covariance matrix Σd with [2, 1, 1] for bandwidth. This particular choice indeed allows larger variations along the intensity axis making it possible to cope with color variations causes by sun reflexions far ahead on the road. A fast estimation of p(x/R) and p(x/O) is thus obtained by using P R and P O as simple Look-Up-Tables, the entries of which are the digitized color coordinates of feature vectors x. 2.2
Comparison
We compared our approach with [16] which is based on the use of color saturation only. We found although saturation usually provides good segmentation results, this heuristic fails in cases too complex, where separability is no longer verified, see Fig. 2. Fig. 3 shows the correct pixel classification rate for a variable
820
E. Bigorgne and J.-P. Tarel
Fig. 2. Posterior probability maps based on saturation (middle) and [L, a∗ , b∗ ] vectors (right) 1
Correct classification rate
0.9 0.8 0.7 0.6 Saturation rg RGB
0.5 0.4
0
0.1
0.2
0.3
0.4 0.5 0.6 Probability threshold
0.7
0.8
0.9
1
Fig. 3. Correct classification rate comparison for different types of feature
threshold applied on the class-conditional pdf p(x/R). Three types of feature have been compared on twenty images of different road scenes with a groundMin(R,G,B) , 2) truth segmented by hand: 1) the color saturation x = S = 1 − Mean(R,G,B) R B , b = R+G+B ] and 3) the full color the chromatic coefficients x = [r = R+G+B signal x = [L, a∗ , b∗ ]. The obtained results show the benefit of a characterization based on this last vector, which is made possible by the use of Parzen-windowing. 2.3
Robust Update
The difficulty is to correctly update the class-conditional pdfs along a route despite drastic changes of the road appearance. In case of online processing, thanks to the temporal continuity, new pixel samples are typically selected in areas where either road or non-road pixels are predicted to take place [12,1]. In practice, this approach is not very robust because segmentation prediction is subject to errors and these errors imply damaged class-conditional pdfs that will produce a poor segmentation on the next image. Due to our particular application which is off-line, we greatly benefit from a backward processing of the entire sequence: being given N images taken at regular intervals, the (N − k)-th one is processed at the k-th iteration. For this image, new pixel samples are picked up in the bottom center part of the image to update the ’Road ’ pdf. The advantage is that we know for sure that these
Backward Segmentation and Region Fitting
821
Fig. 4. 20 of the detected road connected-components in a image sequence. This particular sequence is difficult due to shadows and pavement material changes.
pixels are from the ’Road ’ class since the on-board imaging system grabbing the sequence is on the road. Moreover, these new samples belongs to the newly observable portion of the road, and thus no prediction is needed. The update of the ’Other ’ pdf is only made on pixels that have been labeled ’Other ’ at the previous iteration. In order to lower as much as possible the risk of incorrect learning of the ’Road ’ class, and then to prevent any divergence of the learning, the proper labeling of pixels as ’Road ’ is performed by carrying out a logicalAND operation between the fitted model explained in the next section and the connected-component of the threshold probability map which is overlapping the bottom center-part of the image. The ’Other ’ class is then naturally defined as the complementary. This process drastically improves the robustness of the update compared to online approaches. Fig. 4 shows the detected connected-component superimposed on the corresponding original images with a probability threshold set at 0.5. These quite difficult frames show at the same time shadowed and overexposed bi-component pavement materials. The over-detections in the three first frames of the fourth row are due to a partially occluded private gravel road. This quality of results cannot be obtained with online update.
3
Road Fitting
As explain in the introduction, the estimation of the shape of the road is usually achieved by means of edge-fitting algorithms, which are applied after the detection of some lane or road boundaries. Hereafter, we propose an original approach based on region-fitting which is more robust to missing data and which is also able to cope with vanishing line estimation.
822
E. Bigorgne and J.-P. Tarel
3.1
Road Models
Following [4], we use two possible curve families to model the borders of the road. First we use polynomial curves. ur (v) (resp. ul (v)) models the right (resp. left) border of the road and is given as: u r = b0 + b1 v + b2 v 2 + . . . + bd v d =
d
bi v i
(2)
i=0
and similarly for the left border. Close to the vehicle, the four first parameters b0 , b1 , b2 , b3 are proportional respectively to lateral offset, to the bearing of the vehicle, to the curvature and to the curvature gradient of the lane. Second we use hyperbolic polynomial curves which better fit road edges on long range distances: 1 = ai (v − vh )1−d (v − vh )d−1 i=0 d
ur = a0 (v − vh ) + a1 + . . . + ad
(3)
and similarly for the left border. The previous equations are rewritten in short in vector notations as ur = Atr Xvh (v) (resp. ul = Atl Xvh (v)). 3.2
Half Quadratic Theory
We propose to set the region fitting algorithm as the minimization of the following classical least-square error: 2 [P (R/x(u, v)) − ΩAl ,Ar (u, v)] dudv (4) e(Al , Ar ) = Image
between the image P (R/x(u, v)) of the probability to be within the road class and the function ΩAl ,Ar (u, v) modeling the road. This region is parametrized by Al and Ar , the left and right border parameters. ΩAl ,Ar must be one inside the region and zero outside. Notice that function At Xvh (v) − u is defined for all pixel coordinates (u, v). Its zero set is the explicit curve parametrized by A and the function is negative on the left of the curve and positive on its right. We thus can use the previous function to build ΩAl ,Ar . For instance, function g
At Xvh (v)−u σ
+ 12 is a smooth model of the region on the right of the curve for
any increasing odd function g with g(+∞) = 12 . The σ parameter is useful to tune the smoothing strength. For a two-border region, we multiply the models for the left and right borders accordingly: t t Ar Xvh (v) − u Al Xvh (v) − u 1 1 −g + (5) ΩAl ,Ar = g σ 2 2 σ By substitution of the previous model in (4), we rewrite it in its discrete form: t t 2 Ar Xi − j Al Xi − j 1 1 −g eAl ,Ar = (6) Pij − g + σ 2 2 σ ij∈Image
Backward Segmentation and Region Fitting
823
The previous minimization is non-linear due to g. However, we now show that this minimization can be handled with the half-quadratic theory and allows us to derive the associated iterative algorithm. Indeed, after expansion t of the Al Xi −j square in (6), the function g 2 of the left and right residuals appears: g 2 σ t 2 Ar Xi −j 2 2 and g . Function g (t) being even, it can be rewritten as g (t) = σ h(t2 ). Once the problem is set in these terms, the half-quadratic theory can be applied in a similar way as for instance in [6] by defining the auxiliary variables 2 2 t t t t Al Xi −j Ar Xi −j Al Xi −j Ar Xi −j l r l r ωij , ω , ν = = = and ν = . The ij ij ij σ σ σ σ Lagrangian of the minimization is then obtained as:
1 l r l r l r L = ij h(νij )h(ν ij )g(ωij ij ) + l4 (h(νij ) r+ h(νij )) l + (2Prij − 1)g(ω ) r l +(Pij − 1/4) −g(ω ) + g(ω ) −h(ν )g(ω ) + h(ν )g(ω ) ij ij ij ij ij ij
At X −j At X −j l r (7) + ij λlij ωij + λrij ωij − l σi − r σi 2 2 t t
l Al Xi −j j−Ar Xi −j l r − − + ij μij νij + μrij νij σ σ The derivatives of (7) with respect to : the auxiliary variables, the unknown variables Al and Ar , and the Lagrange coefficients λlij , λrij , μlij μrij are set to zero. The algorithm is derived as an alternate and iterative minimization using the resulting equations.
Fig. 5. 15th degree polynomial region fitting on a difficult synthetic image. Left: ΩAl ,Ar 3D-rendering. Right: Obtained borders in white.
The proposed algorithm can handle a region defined either with polynomial curves (2) or with hyperbolic curves (3). It is only the design matrix (Xi ) that changes. Fig. 5 presents a region fit on a difficult synthetic image with numerous outliers and missing parts. The fit is a 15th order polynomial. On the left side, the 3-D rendering of the obtained region model ΩAl ,Ar is shown. Notice how the proposed region model is able to fit a closed shape even if the region borders are two explicit curves. We want to insist on the fact that contour-based fitting cannot handle correctly such images, with so many edge outliers and closings. 3.3
Geometrical Visibility Range
As explained in the introduction, for road fitting, it is of main importance to be able to estimate the position vh of the vanishing line which parametrizes the
824
E. Bigorgne and J.-P. Tarel
Fig. 6. Road region fitting results with 6th order hyperbolic polynomial borders. The images on the right provide a zoom on the far extremity of the road. The white line figures the estimated vanishing line; the red line figures the maximum distance at which the road is visible.
design matrix. We solve this problem by adding an extra step in the previous iterative minimization scheme, where vh is updated as the ordinate of the point where the asymptotes of the two curves intersect each other. In practice, we observed that the modified algorithm converges towards a local minimum. The minimization is performed with decreasing scales to better converge towards a local minimum not too far from the global one. Moreover, as underlined in [4], the left and right borders of the road are related being approximately parallel. This constraint can be easily enforced in the region fitting algorithm and leads to a minimization problem with a reduced number of parameters. Indeed, parallelism of the road borders implies ari = ali , ∀i ≤ 1, in (3). This constraint brings an improved robustness to the road region fitting algorithm as regards missing parts and outliers. Fig. 6 shows two images taken from one of the sequences we experimented with. It illustrates the accuracy and the robustness of the obtained results when the local-flatness assumption is valid, first row. Notice the limited effect of the violation of this assumption on
Fig. 7. Flat and non-flat road used for distance accuracy experiments
Backward Segmentation and Region Fitting
825
the second row at long distance. The white line shows the estimated vanishing line while the red line shows the maximum image height where the road is visible. The geometric visibility range of the road is directly related to the difference in height between the white and red lines. Table 1. Comparison of the true distance in meters (true) with the distances estimated by camera calibration (calib.), and estimated using the proposed segmentation and fitting algorithms (estim.) for four targets and on two images. On the left is the flat road, on the right the non-flat road of Fig. 7. target 1 2 3 4
true 26.56 52.04 103.52 200.68
calib. 26.95 56.67 98.08 202.02
estim. 27.34 68.41 103.4 225.96
true 33.59 59.49 111.66 208.64
calib. 23.49 60.93 111.58 1176.75
estim. 34.72 61.67 114.08 1530.34
Finally, we ran experiments to evaluate the accuracy of the estimated distances using one camera. On two images, one where the road is really flat and one where it is not the case, see Fig. 7, we compared the estimated and measured distances of white calibration targets set at different distances. The true distances were measured using a theodolite, and two kinds of estimation are provided. The first estimation is obtained using the camera calibration with respect to the road at close range and the second estimation is obtained using road segmentation and fitting. Results are shown on Tab. 1. It appears that errors on distance estimation can be important for large distances when the flat world assumption is not valid; but when it is valid the error is no more than 11%, which is satisfactory. A video format image is processed in a few seconds, but can be optimized further.
4
Conclusion
We tackle the original question of how to estimate the geometrical visibility range of the road from a vehicle with only one camera along inter-urban roads. This application is new and of importance in the field of transportation. It is a difficult inverse problem since 3D distances must be estimated using only one 2D view. However, we propose a solution based first on a fast and robust segmentation of the road region using local color features, and second on parametrized fitting of the segmented region using a priori knowledge we have on road regions. The segmentation is robust to lighting and road color variations thanks to a backward processing. The proposed original fitting algorithm is another new illustration of the power of half-quadratic theory. An extension of this algorithm is also proposed to estimate the position of the vanishing line in each image. We validated the good accuracy of the proposed approach for flat roads. In the future, we will focus on the combination of the proposed approach with stereovision, to handle the case of non-flat roads.
826
E. Bigorgne and J.-P. Tarel
References 1. Turk, M., Morgenthaler, D., Gremban, K., Marra, M.: VITS - a vision system for autonomous land vehicle navigation. IEEE Transactions on Pattern Analysis and Machine Intelligence 10(3), 342–361 (1988) 2. Liou, S., Jain, R.: Road following using vannishing points. Comput. Vision Graph. Image Process 39(1), 116–130 (1987) 3. Crisman, J., Thorpe, C.: Color vision for road following. In: Proc. of SPIE Conference on Mobile Robots, Cambridge, Massachusetts (1988) 4. Guichard, F., Tarel, J.P.: Curve finder combining perceptual grouping and a kalman like fitting. In: ICCV 1999. IEEE International Conference on Computer Vision, Kerkyra, Greece, IEEE Computer Society Press, Los Alamitos (1999) 5. Southall, C., Taylor, C.: Stochastic road shape estimation. In: Proceedings Eighth IEEE International Conference on Computer Vision, vol. 1, pp. 205–212. IEEE Computer Society Press, Los Alamitos (2001) 6. Tarel, J.P., Ieng, S.S., Charbonnier, P.: Using robust estimation algorithms for tracking explicit curves. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 492–507. Springer, Heidelberg (2002) 7. Wang, Y., Shen, D., Teoh, E.: Lane detection using spline model. Pattern Recognition Letters 21, 677–689 (2000) 8. Dementhon, D.: Reconstruction of the road by matching edge points in the road image. Technical Report Tech. Rep. CAT-TR-368, Center for Automation Research, Univ Maryland (1988) 9. Dickmanns, E., Mysliwetz, B.: Recursive 3D road and relative ego-state recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 14(2), 199–213 (1992) 10. Chapuis, R., Aufrere, R., Chausse, F.: Recovering a 3D shape of road by vision. In: Proc. of the 7th Int. Conf. on Image Processing and its applications, Manchester (1999) 11. Rasmussen, C.: Texture-based vanishing point voting for road shape estimation. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 470–477. IEEE Computer Society Press, Los Alamitos (2004) 12. Thorpe, C., Hebert, M., Kanade, T., Shafer, S.: Vision and navigation for the Carnegie-Mellon Navlab. IEEE Transactions on Pattern Analysis and Machine Intelligence 10(3), 362–373 (1988) 13. Sandt, F., Aubert, D.: Comparaison of color image segmentations for lane following. In: SPIE Mobile Robot VII, Boston (1992) 14. Crisman, J., Thorpe, C.: Scarf: a color vision system that tracks roads and intersections. IEEE Transactions on Robotics and Automation 9(1), 49–58 (1993) 15. Crisman, J., Thorpe, C.: Unscarf, a color vision system for the detection of unstructured roads. In: Proc. Of IEEE International Conference on Robotics And Automation, pp. 2496–2501. IEEE Computer Society Press, Los Alamitos (1991) 16. Charbonnier, P., Nicolle, P., Guillard, Y., Charrier, J.: Road boundaries detection using color saturation. In: European Conference (EUSIPCO). European Conference, Ile de Rhodes, Gr`ece, pp. 2553–2556 (1998)
Image Segmentation Using Co-EM Strategy Zhenglong Li, Jian Cheng, Qingshan Liu, and Hanqing Lu National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences P.O. Box 2728, Beijing (100080), P.R. China {zlli,jcheng,qsliu,luhq}@nlpr.ia.ac.cn
Abstract. Inspired by the idea of multi-view, we proposed an image segmentation algorithm using co-EM strategy in this paper. Image data are modeled using Gaussian Mixture Model (GMM), and two sets of features, i.e. two views, are employed using co-EM strategy instead of conventional single view based EM to estimate the parameters of GMM. Compared with the single view based GMM-EM methods, there are several advantages with the proposed segmentation method using co-EM strategy. First, imperfectness of single view can be compensated by the other view in the co-EM. Second, employing two views, co-EM strategy can offer more reliability to the segmentation results. Third, the drawback of local optimality for single view based EM can be overcome to some extent. Fourth, the convergence rate is improved. The average time is far less than single view based methods. We test the proposed method on large number of images with no specified contents. The experimental results verify the above advantages, and outperform the single view based GMM-EM segmentation methods.
1
Introduction
Image segmentation is an important pre-process for many higher level vision understanding systems. The basic task of image segmentation is dividing an input image according to some criteria into foreground objects and background objects. The most used criteria can be categorized into three classes, i.e. global knowledge based, region-based (homogeneity) and edge-based [1]. The global knowledge based methods are usually refer to as thresholding using global knowledge of a histogram of image. As to the homogeneity criteria, it assume that in an image the meaningful foregrounds and background objects should comprise of homogeneous regions in the sense of some homogeneity metrics. Usually the resulting segmentations are not exactly the objects: only partial segmentation [1] are fulfilled in which the segmentation results are regions with homogeneous characteristics. And in the higher level of image understanding system, the partial segmentation results can be regrouped to correspond to the real objects with the aid of specific domain knowledges. This paper focuses on the partial image segmentation. The segmentation of natural images or images of rich textural content is a challenging task, because of influences of texture, albedo, the unknown number Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 827–836, 2007. c Springer-Verlag Berlin Heidelberg 2007
828
Z. Li et al.
of objects, and amorphous object shapes etc. Some methods [2, 3, 4] have been proposed to deal with natural image segmentation. Amongst them, EdgeFlow [2] is a successful edge based method, and shows fine performance on generic image segmentation. But for its edge based nature, EdgeFlow’s performance relies heavily on the post-process, such as spur trimming, edge linking etc. The works of [4] using Gaussian Mixture Models (GMM) to model content of image shows success to some extent in the application of content based image retrieval. The parameters of GMM are solved by Expectation Maximization (EM) methods. This GMM-EM based method is more robust than the edge based methods for its region based essence. And there is no demand for the post-process to form closed and continuous boundaries, which heavily influences the performance of segmentation results. The GMM-EM based methods are to estimate maximum a posteriori parameters of GMM with EM algorithm and using this GMM as classifier to give labels to all points in N -D feature space, then the labels are re-map to 2-D image plane to achieve partitions of the image. The first step is feature extraction in which the 2-D image is transformed into N -D feature space, and then the feature vectors are clustered according to some metric. Usually only single feature set is used in the GMM-EM based methods. Although the GMM-EM method with single feature set in [3, 4] gets some successes in the generic image segmentation on the CorelTM image dataset, there exist several drawbacks with this GMM-EM based method using single feature set strategy: First, the feature set is usually imperfect on some aspect of discriminating details of an image. Second, to get more reliable results, the more features are desired to be incorporated, while high dimension of feature space will suffer the problem of over-fitness. Third, EM in essence is a solution seeking local optimal. It is prone to stick to the local optimal so that the algorithms give the improper segmentation results. To solve the above problems, we introduce the co-EM strategy (or multi-view) in this paper. Here we give a definition of jargon view that will be used later in this paper: for feature domain that can be divided into disjoint subsets and each subset is enough to learn the targets [5], such each subset is called a view. The idea of multi-view was proposed [6, 7] to text or Web page classification. It is an extension to co-Training of Blum and Mitchell [7]. The difference between co-Training and co-EM is that for the former only the most confident labels are adopted in each iteration, whereas for the latter the whole trained labels are used during each iteration (for details, please cf. [6, 5]). In this paper, we propose a method using co-EM strategy for natural content or rich-textured image segmentation. In the proposed methods, image is firstly modeled by an infinite GMM, then the co-EM algorithm employs two views to solve the parameters of GMM to label all pixels in the image to achieve segmentation result. Compared with those only one-view based methods, the proposed method has the following advantages: Compensated imperfectness of single view. There are usually some aspect of discrimination imperfectness in single view. In image segmentation, this imperfectness will bring with the problems such as imprecise boundaries or even
Image Segmentation Using Co-EM Strategy
829
wrong boundaries. By introducing the co-EM strategy, two views can augment each other, and to some extent the imperfectness of each view will be compensated by each other to give ideal segmentation results. More reliability. An option to improve reliability is including more features into feature space. While higher dimension of feature space improve the reliability, the single view based methods are apt to incur the curse of overfitness. And higher dimension exacerbates the computation burden. The coEM strategy employs two views by turns, and more informations are provided by the strategy while the dimension of feature spaces are kept relatively small, hence the proposed method is more reliable and with low computation overhead. Improved local optimality. The solution by EM is not optimal in the global sense. The improper initial estimate of parameters is prone to get stuck to a local optimality. In the co-EM strategy, when the initial estimates get stuck to local extrema in one view, the other view will “pull” the evolution from the local extrema, and vice versa. Until consensus of two views are achieved, the evolution of co-EM will not stop optimal solution seeking. Therefore the proposed methods with co-EM strategy can to some extent prevent the local optimality compared with single view based GMM-EM. Accelerated convergence rate. By augmenting each other with two view, the converging rate of co-EM strategy is accelerated compared with classical EM with single view. We notice recently [8] gives a similar segmentation method using co-EM. However there exist several serious problems with the works of [8], and we will give an short comparison with [8] in Section 2. The rest of the paper is organized as follows. In the Section 2, we briefly review some related works, especially comparison with [8]. And the algorithms using coEM strategy are proposed in the Section 3. Section 4 gives the experiments and analysis. Finally, we conclude the paper in the Section 5.
2
Related Works
In [3,4], the GMM-EM algorithms get some successes in generic image segmentation. The keys to their successes can be concluded as: 1) Fine feature descriptors on image content. 2) The scheme of selecting the proper initial estimates of parameters for EM. 3) Minimum Description Length (MDL) to determine the proper number of Gaussians in GMM. However, only one view is used in their works, and there are the issues such as imperfectness of views, reliability mentioned in last section. We notice recently a similar concept of co-EM was proposed for image segmentation in [8]. However there are several critical problems with [8]. First, there is not any consideration on initial parameter estimate for EM in [8]. The initial estimate of parameters for EM is important: the improper initial parameters are apt to give wrong under-segmentation for local optimality. Second, it is curious that only RGB channels and 2-D spatial coordinates are chosen as two views.
830
Z. Li et al.
This split of feature domain breaks the rules that two views should be sufficient to learn the object, i.e. two views must be both strong view [9]. Third, only two images are tested in [8]. Compared with the works of [8], our method considers the sensitivity of EM to initial parameter estimation, and the features are split into two strong views. Finally the proposed method is tested on large number of images and the experiment results are meaningful and promising.
3
Image Segmentation Using Co-EM Strategy
We use the finite GMM to model content of image [3]. The first step is to extract features for co-EM strategy. In tests, we choose the Carson’s features and Gabor features as two views to describe the content of image. Then co-EM strategy is applied to solve the parameters of GMM and give the segmentation results. 3.1
Feature Extraction
For co-EM strategy, the views must possess two properties [9]: 1) uncorrelated : for any labeled instance the description on it should be not strongly correlated; 2) compatibility: each view should give the same label for any instance. As for uncorrelated property, it means that each view should be the “strong” view that can itself learn the objects. Compatibility means that each view should be consistent in describing the nature of objects: there is no contradiction between resulting classification. These two properties should be strictly obeyed by the coEM strategy (although there exist the co- strategy to deal with views violating two properties, i.e. “weak” view, it needs human intervene and is not suitable in our case. For details we refer the readers to [9, 5]). We choose two views, Carson’s features [3, 4] and Gabor features [10, 11], in our experiments. Although the rigid proof of the uncorrelated and compatibility of two views is difficult. The experiments in Section 4 show that the chosen views satisfy the criteria. Following we describes the Carson’s features and Gabor features respectively. Carson’s features. The features are mainly composed of three parts: color, texture and normalized 2-D coordinates. The 3-D color descriptors adopt CIE 1976 L*a*b* color space for its perceptual uniformity. The texture descriptor used in Carson’s features is slightly complicated. Here we give a slightly detailed elucidation on it. The texture descriptors comprise of anisotropy, normalized texture contrast, and polarity. For anisotropy and normalized texture contrast, the second moment matrix Mσ (x, y) = Gσ (x, y) ∗ ∇L(∇L)T ,
(1)
should be computed first, where L is the L* component of color L*a*b*, and Gσ is a smoothing Gaussian filter with standard variance σ. Then the eigenvalues λ1 , λ2 (λ1 > λ2 ) and corresponding eigenvectors l1 , l2 of Mσ are computed.
Image Segmentation Using Co-EM Strategy
831
The anisotropy is given as a = 1 − λλ21 and the normalized texture contrast is √ c = 2 λ1 + λ2 . To generate the polarity descriptors with specific scales, the first step is scale selecting using the polarity property of image [12]: pσ =
|E+ − E− | E+ + E−
(2)
where E+ = x,y Gσ (x, y)[∇L · n]+ and E− = x,y Gσ (x, y)[∇L · n]− . The Pσ is the polarity for one point in image; Gσ is a Gaussian filter with standard variance as σ; n is a unit vector perpendicular to l1 , and operator [·]+ (or [·]− ) is the rectified positive (or negative) part of its argument. For each point (x, y) in an image, according to Eq. (2), the σ will take k = 0, 12 , 22 , · · · , 72 respectively. Then the resulting 7 polarity images will be convolved by Gaussians with standard variance of 2∗k = 0, 1, 2, · · · , 7 respectively. For each k−1 < 0.02. point (x, y), the scale will be selected as k if Pk −P Pk The last part of Carson’s features is the coordinates (x, y) normalized to range of [0, 1]. This 2-D coordinate (x, y) describes the spatial coherence to prevent the over-segmentation. The resulting Carson’s features is an 8-D space. Gabor features. Gabor filters can be considered as Fourier basis modulated by Gaussian windows, and Gabor filter bank can well describes the particular spatial frequency and orientation locally. Gabor filters used in this paper are 1 y2 1 x2 + 2 + 2πjW x , exp − (3) g(x, y) = 2πσx σy 2 σx2 σy 1 (u − W )2 v2 G(u, v) = exp − + , (4) 2 σu2 σv2 1 1 , and σv = . σu = 2πσx 2πσy The Eq. (3) is a 2-D Gabor in spatial domain, and Eq. (4) is the Fourier Transform of Eq. (3). Gabor filter bank should cover the whole effective frequency domain while reducing the redundancy as much as possible. In our test, the number of orientations and of scales is set as 6, 4 respectively (for detailed deduction of Gabor filter bank parameters, please cf. [10]). Therefore there are 24 filters in the Gabor filter bank, and the Gabor feature dimension is 24. 3.2
Co-EM Strategy
We use the well-known finite GMM to model the content of images. The single view based EM algorithm is composed of two steps: the E-step and M-step. In the E-step, the expectation is evaluated, and the this expectation is maximized at the M-step. For brevity, here we do not give the solution equations to standard EM-GMM (we refer the interested readers to [13]).
832
Z. Li et al.
We use the co-EM strategy to estimate the parameters of GMM. The idea of co-EM is using two classifiers to be trained in two views respectively, and suggesting labels to each other by turns during the training process of EM. Two views employed in our experiments are Carson’s features and Gabor features. We give the pseudocode for the co-EM strategy in Fig. 1. Line 1 to 3 are initial stage for co-EM. At line 3 of initial stage, Initial Label1 is fed to TrainingClassifier to get an initial estimate on parameters of GMM. TrainingClassifier is an classical GMM-EM solver. Line 4-19 are the main loop of co-EM algorithm. Line 5-10 train a classifier Classifier1 in the view View1 , and give Label1 for all the training points in View1 using Classifier1 . Note at line 8 (line 14), TrainingClassifier is using Label1 (Label2 ) from View1 (View2 ) to aid training Classifier2 (Classifier1 ) in View2 (View1 ). Line 12-16 serve the same purpose as line 5-10, but the former are done in the View2 instead of View1 . The cooperation of the View1 and View2 by means of EM is incarnated at line 8 and 14: the labels learned from one view is suggested into the other view to aid learning object by turns.
co-EM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Input: View1 , View2 , Initial Label1 Label1 ← 0, Label2 ← 0 counter ← 0, flag1 ← false, flag2 ← false Classifier1 ← TrainingClassifier(Initial Label1 , View1 ) while counter < max iteration or flag1 = false or flag2 = false do Label1 ← LabelingData(Classifier1 , View1 ) if IsLabelFull(Label1 ) = false then goto Step 16 else Classifier2 ← TrainingClassifier(Label1 , View2 ) end if Label2 ← LabelingData(Classifier2 , View2 ) f lag1 ← IsConverged(Classifier1 , View1 ) if IsLabelFull(Label2 ) = false then goto Step 18 else Classifier2 ← TrainingClassifier(Label2 , View1 ) end if Classifier1 ← TrainingClassifier(Label2 , View1 ) f lag2 ← IsConverged(Classifier2 , View2 ) counter ← counter +1 end while Fig. 1. The pseudocode of co-EM strategy
The initial parameter estimate plays the key role for the performance of the EM-like algorithms. The improper initial parameters for EM-like algorithms will produce the local optimal solution and result in wrong segmentation. In the experiments, the test image will be given several fixed initial partitioning templates to co-EM as a tempt to avoid local optimal solution.
Image Segmentation Using Co-EM Strategy
833
The determination of the number of Gaussians, K, in GMM is another critical problem in the co-EM strategy. In the experiments, K for each image is set as 3 (In [4], the MDL criteria is used to determine the number of Gaussians, but we found the fixed K of 3 works well in most situations in the experiments; and another promising method to estimate K is using the Dirichlet Process, although it can work for our cases, considering the efficiency we still use the scheme of a fixed K in the current experiments).
4
Experiments
Fig. 2 shows comparison between results by co-EM strategy and single-view EM respectively. The second column is the result by the proposed co-EM strategy using two views: Carson’s features and Gabor features. The third column is the result by single-view based EM using Carson’s features, and the rightmost
(a)
(b) tcoEM = 13.53 sec (c) tCarson = 25.61 sec (d) tGabor = 26.42 sec
(e)
(f) tcoEM = 12.84 sec (g) tCarson = 16.54 sec (h) tGabor = 17.83 sec
(i)
(j) tcoEM = 13.38 sec (k) tCarson = 31.16 sec (l) tGabor = 23.97 sec
Fig. 2. Comparison between results by co-EM strategy and single view based GMMEM methods. The second column is the segmentation results by co-EM strategy using two view: Carson’s features and Gabor features. The third column is the results by oneview based EM using Carson’s features, and the rightmost column by Gabor features. The gray regions in the segmentation results are invalid regions.
834
Z. Li et al.
108004
41069
42049
66053
97033
291000
62096
134052
45096
126007
156065
41004
100080
78004
189080
55075
302008
Image Name 108004 41069 42049 66053 97033 291000 62096 134052 45096 Co-EM 7.76 EM (Carson) 16.54 EM (Gabor) 11.71
15.58 18.14 29.16
11.89 12.24 20.56 27.25 24.05 34.59
17.25 14.54 19.24 14.18 34.87 24.18
24.16 14.88 20.65 21.25 34.59 26.42
12.34 31.17 23.97
126007 156065 41004 100080 78004 189080 55075 302008 12.10 20.56 33.85
30.63 18.42 23.65
10.92 10.63 21.05 15.33 30.38 23.11
15.81 21.59 17.59 18.81 29.35 29.59
15.04 21.75 26.30 33.06 32.09 31.56
Fig. 3. Segmentation results using co-EM strategy and time comparison between coEM strategy, single view based GMM-EM methods using Carson’s features, and Gabor features
Image Segmentation Using Co-EM Strategy
835
column by Gabor features. The gray regions in the segmentation results represent the invalid regions, i.e. the regions with low confidences belonging to any object. Observe the leopard in the top row in the figure. We can find that only using the Carson’s features, the leopard and tree trunk are wrongly classified as one object, while in the Gabor feature domain, the image is over-segmented: the boundaries get distorted. However the proposed co-EM strategy gives a finer result than one-view strategy: the leopard and tree trunk are correctly classified as two part, and the boundaries are well fitted to the real boundaries of objects (see top left subfigure in Fig. 2. The rest results all show that the co-EM strategy gives more superior results than the one-view based EM. We list the used time under each segmentation result. The proposed co-EM method show the fastest convergence rate while giving the finer segmentation results. An interesting phenomenon can be observed from Fig. 2: the cooperation of two views, e.g. Carson’s and Gabor view in our experiment, can produce finer results than single views. This phenomenon can be explained as follows. There are usually imperfectness for some view, and that will impair the classification performance. As to our cases, the ability of Carson features to classify texture is relatively weak, whereas the Gabor features place too weights on texture discrimination. Therefore in Carson feature space, the textures cannot get fine discriminated; and Gabor features based algorithms are prone to over-segment the textured region. However with co-EM strategy, the drawbacks of these two views get compensated to some extent. Hence the proposed co-EM algorithms outperform single view based methods. In Fig. 3, we show some segmentation results by the proposed co-EM strategy. The number marked under each segmentation result is the image name in the CorelTM image database. Under the lower part, a table shows the running time by the proposed co-EM strategy with 8-D Carson’s features and 24-D Gabor features, the single view based GMM-EM using Carson’s features, and Gabor features respectively. The average times for these three methods are 15.52 sec, 21.49 sec, 26.43 sec respectively. The proposed method are faster than single view EM based method using Carson’s view and Gabor view by 27.8%, 41.3%.
5
Conclusion
In this paper, we proposed a co-EM strategy for generic image segmentation. Two views are employed in the co-EM strategy: Carson’s features and Gabor features. The proposed methods show the advantages over classical single view based methods, and we test the proposed methods on large number of images. The experimental results are promising and verify the proposed methods.
Acknowledgment We would like to acknowledge support from Natural Sciences Foundation of China under grant No. 60475010, 60121302, 60605004 and No. 60675003.
836
Z. Li et al.
References 1. Sonka, M., Hilavac, V., Boyle, R.: Image Processing, Analysis, and Machine Vision, 2nd edn. Brooks/Cole (1998) 2. Ma, W., Manjunath, B.S.: EdgeFlow: A technique for boundary detection and image segmentation. IEEE Trans. Image Process 9(8), 1375–1388 (2000) 3. Belongie, S., Carson, C., Greenspan, H., Malik, J.: Color- and texture-based image segmentation using EM and its application to content-based image retrieval. In: IEEE Proc. Int. Conf. Computer Vision, pp. 675–682. IEEE Computer Society Press, Los Alamitos (1998) 4. Carson, C., Belongie, S., Greenspan, H., Malik, J.: Blobworld: Image segmentation using Expectation-Maximization and its application to image querying. IEEE Trans. Pattern Anal. Mach. Intell. 24(8), 1026–1038 (2002) 5. Muslea, I., Mintion, S., Knoblock, C.A.: Active + semi-supervised learning = robust multi-view learning. In: Proc. Int. Conf. Machine Learning, pp. 435–442 (2002) 6. Nigam, K., Ghani, R.: Analyzing the effectiveness and applicability of co-training. In: Proc. Intl Conf. of Information and Knowledge Management, pp. 86–93 (2000) 7. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Annual Workshop on Computational Learning Theory, pp. 92–100 (1998) 8. Yi, X., Zhang, C., Wang, J.: Multi-view EM algorithm and its application to color image segmentation. In: IEEE Proc. Int. Conf. Multimedia and Expo, pp. 351–354. IEEE Computer Society Press, Los Alamitos (2004) 9. Muslea, I., Minton, S.N., Knoblock, C.A.: Active learning with strong and weak views: A case study on wrapper induction. In: Proc. Int. Joint Conf. on Artificial Intelligence, pp. 415–420 (2003) 10. Manjunath, B.S., Ma, W.Y.: Texture features for browsing and retrieval of image data. IEEE Trans. Pattern Anal. Mach. Intell. 18(8), 837–842 (1996) 11. Daugman, J.G.: Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. Journal of Optical Society of America A 2(7) (1985) 12. Freeman, W.T., Adelson, E.H.: The design and use of steerable filters. IEEE Trans. Pattern Anal. Mach. Intell. 13(9), 891–906 (1991) 13. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006)
Co-segmentation of Image Pairs with Quadratic Global Constraint in MRFs Yadong Mu and Bingfeng Zhou Institute of Computer Science and Technology Peking University, Beijing, 100871 {muyadong,zhoubingfeng}@icst.pku.edu.cn
Abstract. This paper provides a novel method for co-segmentation, namely simultaneously segmenting multiple images with same foreground and distinct backgrounds. Our contribution primarily lies in four-folds. First, image pairs are typically captured under different imaging conditions, which makes the color distribution of desired object shift greatly, hence it brings challenges to color-based co-segmentation. Here we propose a robust regression method to minimize color variances between corresponding image regions. Secondly, although having been intensively discussed, the exact meaning of the term ”co-segmentation” is rather vague and importance of image background is previously neglected, this motivate us to provide a novel, clear and comprehensive definition for co-segmentation. Thirdly, it is an involved issue that specific regions tend to be categorized as foreground, so we introduce ”risk term” to differentiate colors, which has not been discussed before in the literatures to our best knowledge. Lastly and most importantly, unlike conventional linear global terms in MRFs, we propose a sum-of-squared-difference (SSD) based global constraint and deduce its equivalent quadratic form which takes into account the pairwise relations in feature space. Reasonable assumptions are made and global optimal could be efficiently obtained via alternating Graph Cuts.
1
Introduction
Segmentation is a fundamental and challenging problem in computer vision. Automatic segmentation [1] is possible yet prone to error. After the well-known Graph Cuts algorithm is utilized in [2], there is a burst of interactive segmentation methods ([3], [4] and [5]). Also it is proven that fusing information from multiple modalities ([6], [7]) can improve segmentation quality. However, as argued in [8], segmentation from one single image is too difficult. Recently there is much research interest on multiple-image based approaches. In this paper we focus on co-segmentation, namely simultaneously segmenting image pair containing identical objects and distinct backgrounds. The term ”cosegmentation” is first introduced into the computer vision community by Carsten Rother [8] in 2006. Important areas where co-segmentation is potentially useful are broad: automatic image/video object extraction, image partial distance, Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 837–846, 2007. c Springer-Verlag Berlin Heidelberg 2007
838
Y. Mu and B. Zhou
Fig. 1. Experimental results for our proposed co-segmentation approach
video summarization and tracking. Due to space consideration, we focus on the technique of co-segmentation itself, discussing little about its applications. We try to solve several key issues in co-segmentation. Traditional global terms in MRF are typically linear function, and can be performed in polynomial time [9]. Unfortunately, such linear term is too limited. Highly non-linear, challenging global terms [8] are proposed for the goal of co-segmentation, whose optimization is NP-hard. Moreover, although having been intensively discussed, the exact meaning of the term ”co-segmentation” is rather vague and importance of image background is previously neglected. In this paper, we present a more comprehensive definition and novel probabilistic model for co-segmentation, introduce a quadric global constraint which could be efficiently optimized and propose Risk Term which proves effective to boost segmentation quality.
2 2.1
Generative Model for Co-segmentation Notations
The inputs for co-segmentation are image pairs, and it is usually required that each pair should contain image regions corresponding to identical objects or scenes. Let K = {1, 2} and Ic = {1, . . . , N } are two index sets, ranging over images and pixels respectively. k and i are elements from them. Zk and Xk are random vectors of image measurements and pixel categories (foreground/background in current task). zki or xki represents i-th element from k-th image. We assume images are generated according to some unknown distribution, and each pixel is sampled independently. Parameters for image generation model could be divided into two parts, related to foreground or background regions respectively. Let θkf and θkb denote object/background parameters for k-th image. 2.2
Graphical Models for Co-segmentation
Choosing appropriate image generation models is the most crucial step in cosegmentation. However, such models are not obvious. As in the previous work in
Co-segmentation of Image Pairs with Quadratic Global Constraint in MRFs
X1
ș1b
ș2b
ș1f Z1
J
X2
X1
ș1b
ș2b
839
X2
ș*f
ș2f Z2
Z1 (a)
J
Z2 (b)
Fig. 2. Generative models for co-segmentation. (a) Rother’s model (refer to [8] for details) based on hypothesis evaluation. J = 1 and J = 0 correspond to the hypothesis that image pairs are generated with/without common foreground model respectively. (b) Generative model proposed in this paper for co-segmenting.
[8], Rother etc. selected 1D histogram based image probabilistic models, whose graphical models are drawn in Figure 2(a). As can be seen, Rother’s approach relies on hypothesis evaluation, namely choose the parameters maximizing the desired hypothesis that two images are generated in the manner of sharing nontrivial common parts. It could also be equivalently viewed as maximizing the joint probability of observed image pairs and hidden random vectors (specifically speaking, these are θkf , θkb and Xk in Figure 2(a), where k ranges over {1, 2}). However, the above-mentioned generative models are not practical although flexible. The drawbacks lie in several aspects. Firstly, Rother’s model makes too many assumptions for the purpose of feasibility, which complicates parameter estimation and makes this model sensitive about noises. Some model parameters even could not unbiasedly estimated due to lack of sufficient training samplings. For some parameters there is only one sampling could be found. An example for this point is that, image likelihoods under hypothesis J = 0 are always almost equal to 1, which is certainly not the true case. Secondly, the final deduced global term in [8] is highly non-linear. In fact it could be regarded as the classical 1-norm if we treat each histogram as a single vector, which complicated optimization for optimal pixel labeling. Lastly and most importantly, the authors did not seriously take into account the relation between background models in the image pair. Let hkf and hkb denote image measurement histograms (typically color or texture) of foreground/background for k-th image. The final energy function to be minimized in [8] only contains an image generation term proportional to z |h1f (z) − h2f (z)|, while background parameters disappear. This greedy strategy sometimes brings mistake. Here we argue that the effect of background could not be neglected. An example to illuminate our idea is given in Figure 3, where two segmenting results are shown for comparison. In case 1, the extracted foregrounds match each other perfectly if just comparing their color histogram. However, it seems the segmentation in case 2 is more preferable, although the purple regions in
840
Y. Mu and B. Zhou
Fig. 3. An example to illustrate the relation between ”optimality” and ”maximality”. The purple region in the bottom image is slightly larger than the top image’s. If we only consider foreground models as in [8], case 1 is optimal. However, it is not maximal, since the purple regions are supposed to be labeled as foreground as in case 2.
the two images differ greatly in size. In other words, we should consider both ”optimality” and ”maximality”. Case 1 is an extreme example, which is optimal according to the aforementioned criteria, yet not maximal. We argue that the task of co-segmentation could be regarded as finding the maximal common parts between two feature sets together with spatial consistency. Unlike [8], we obtain maximality by introducing large penalties if the backgrounds contain similar contents. A novel energy term about image backgrounds is proposed and detailed in Section 4. Our proposed graphical model could be found in Figure 2(b). At each phase, we optimize over X on one image by assuming parameters of the other image are known (Note θˆ in Figure 2(b) is colored in gray since its value is known.). We solve this optimization using alternating Graph Cuts, which is illustated in Figure 4. The joint probability to be maximized could be written as: ˆ X ∗ = arg max P (X)P (Z|X, θ) X
(1)
To solve this optimization problem is equivalent to find the minima of its ˆ For negative logarithm. Denote E1 = − log P (X) and E2 = − log P (Z|X, θ). convenience we use the latter log form.
3
Preprocessing by Color Rectification
It is well known that RGB color space is not uniform, and each of the three channel is not independent. It is previously argued in [10] that proper color coordinate transformations are able to partition RGB-space differently. Similar
Co-segmentation of Image Pairs with Quadratic Global Constraint in MRFs
841
Fig. 4. Illustration for alternating Graph Cuts. The optimization is performed in an alternative style. The vertical arrows denote optimization with graph cuts, while the horizontal arrows indicate building color histograms from segmentation X and pixel measurements Z.
to the ideas used in intrinsic images [11], we abandon the intensity channel, keeping solely color information. In practice, we first transform images from RGB-space to CIE-LAB space, where the L-channel reprensents lightness of the color and the other two channels are about color. After that, we perform color rectification in two steps: – Step One: Extract local feature points from each image, and find their correspondences in the other image if existing. – Step Two: Sample colors from a small neighborhood of matching points, use linear regression to minimize color variance. In step one, we adopts SIFT [12] to detect feature points. SIFT points are invariant to rotation, translation, scaling and partly robust to affine distortion. Also it shows high repeatability and distinctiveness in various applications and works well for our task. Typically we can extract hundreds of SIFT points from each image, while the number of matching point pairs varies according to current inputs. An example for SIFT matching procedure could be found in Figure 8. In the middle column of Figure 8, matching pixels are connected with red lines. These matching points are further used to perform linear regression [13] within each color channel. Colors are scaled and translated to match theirs correspondences, so that color variances between image pair are minimized in a sense of least squared error (LSE). An example can be found in Figure 5. Robust methods such as RANSAC could be exploited to remove outliers.
4 4.1
Incorporating Global Constraint into MRFs Notations
In this section we provide definitions for E1 and E2 , which are the negative log of image prior and likelihood respectively. Since we focus on only one image each time, we will drop the k subscript and use it to index histogram bins. We adopted the following notations for convenience: – xi ∈ {1, −1}, where xi = 1 implies ”object”, otherwise background. – Ih = {1, . . . , M } and Ic = {1, . . . , N } are index sets for histogram bins and image pixels.
842
Y. Mu and B. Zhou
Fig. 5. Illustration for color rectification. Variances of foreground colors affect final segmentation results notably (see the top images in the third column, compare it with the bottom segmentations). We operate in CIE-LAB color space. After color rectification, 1-norm of distribution difference in A-channel is reduced to 0.2245, compared with original 0.2552. And the results in B-channel are more promising, from 0.6531 to 0.3115. We plot color distribution curves in the middle column. Color curve for image A remains unchanged as groundtruth and plotted in black, while color curves for image B before/after rectification are plotted in red and blue respectively. Note that the peaks in B-channel approach groundtruth perfectly after transformation. The two experiments in rightmost column share same parameters.
– S(k) is the set of pixels that lies in histogram bin k. – F (k) and B(k) denote the number of pixels belonging to foreground/ background in bin k. Specifically speaking, F (k) = 12 (|S(k)| + i∈S(k) xi ), B(k) = 12 (|S(k)| − i∈S(k) xi ), where | · | means the cardinality of a set. across – Nf and Nb denotes pixel counts labeled as foreground/background the whole image. Nf = 12 (N + i∈Ic xi ), Nb = 12 (N − i∈Ic xi ). – DIST (h1 , h2 ) is a metric defined on histograms. We adopt a sum-of-squareddifference (SSD) form, namely DIST (h1 , h2 ) = k (h1 (k) − h2 (k))2 . 4.2
Ising Prior for P (X)
We adopt the well-known Ising prior for P (X). Similar to [8], a preference term is added to encourage larger foreground regions, whose strength is controlled by a positive constant α. A second term is over neighboring pixels. This energy term could be summarized as follows: xi + λ cij xi xj (2) E1 = −α i
i,j
where cij = exp(−||zi − zj ||2 /σ 2 ) are coefficients accounting for similarity between pixel pairs.
Co-segmentation of Image Pairs with Quadratic Global Constraint in MRFs
4.3
843
ˆ Global Term for P (Z|X, θ)
As argued before, the global constraint should take into account both the effects of foreground/background. We adopt a simple linear combination of the two, that is: ˆ f , hf ) − wb DIST (h ˆ b , hb ) (3) E2 = wf DIST (h ˆ denotes known histograms of the reference image, while h represents where h histograms to be estimated. In practice we build 2D histogram from the two color channels in LAB-space. It is obvious that this global term favors maximal common parts: similar foregrounds, and backgrounds that are different from each other as much as possible. For the purpose of tractability we assume wf = γ1 Nf2 and wb = γ2 Nb2 , then E2 could be written as: ˆ f ) − wb DIST (hb , h ˆ b) E2 = wf DIST (hf , h F (k) B(k) ˆ f (k))2 − γ2 N 2 ˆ b (k))2 = γ1 Nf2 ( −h ( −h b Nf Nb k k 2 2 ˆ 2 (k) − γ2 ˆ 2 (k) = γ1 F (k) − γ2 B (k) + γ1 Nf2 h Nb2 h f b k
−2γ1
k
ˆ f (k) + 2γ2 Nf F (k)h
k
k
k
ˆ b (k) Nb B(k)h
(4)
k
Now we will prove Equation 4 is actually quadric function about X. Denote the first two terms in Equation 4 as T1 , middle two as T2 , the last two as T3 , thus E2 = T1 + T2 + T3 . Recall that in Equation 2, parameter α indicates user’s Foreground size preference for the ratio (typically set to 0.3 in our experiments), Image size thus we could deduce that i∈Ic xi = (2α − 1)N . Basing on this observation, it is easy to prove that: – T1 = 12 (γ1 − γ2 ) ∃k,i,j∈S(k) xi xj + i∈Ic pi xi + const, where pi is coefficient unrelated to X. – T2 is unrelated to X. – T3 = i∈Ic qi xi + const, where qi is coefficient concerning i-th pixel. As a result, we could represent global term E2 in the following form: E2 =
1 (γ1 − γ2 ) 2
∃k,i,j∈S(k)
xi xj +
(pi + qi )xi + const
(5)
i∈Ic
This novel quadratic energy term consists of both unary and binary constraints, thus fundamentally different from conventional ones used in [2], [3] and [4], where only linear constraints are utilized. Moreover, it also differs from the pairwise Ising term defined in Equation 2, since the latter performs on neighborhood system in spatial domain while the pairwise term in Equation 5 works in feature space. From a graph point of view, each adjacent pixel pair in feature space (that is, they fall into the same histogram bin) is connected by an edge, even if they are far away from each other in the spatial domain.
844
4.4
Y. Mu and B. Zhou
Computation
To optimize above-defined energy function is challenging due to the existence of quadric global constraint. Although optimization methods like graph cuts [14] or normalized cuts [1] could found its optimal, required memory space is too huge for current computer hardware. For image pairs with typical size of 800*600, the global term usually gives rise to more than 1G extra edges, which is intolerant. General inference algorithm like MCMC [15], hierarchical methods or iterative procedures [8] are more favorable for such optimizing task. However, the common drawback for these methods lies in that they are too time-consuming, thus not suitable for real-time applications. To make a balance between efficiency and accuracy, we let γ1 be equal to γ2 in Equation 5, reducing the global term into a classical linear form. Experiments prove effectiveness of this approximation. 4.5
Risk Term
Another important issue is seldom considered in previous work. For an input image pair, small regions with unique color usually tend to be categorized as ”foreground” (see Figure 6 for an concrete example). This is mainly because they affect E2 much slighter than the preference term in E1 . To mitigate this problem, we propose a novel constraint named Risk Term, which reflects the risk to assign a pixel as foreground according to its color. h1 , h2 denote 2D histograms for image pair. For histogram bin k, its risk value is defined as follows: R(k) =
|h1 (k) − h2 (k)| |h1 (k) + h2 (k)|
(6)
Fig. 6. Illustration for Risk Term. For the right image in (a), several small regions are labeled as foreground objects (see the left image in (b)), after introducing risk term they are removed. Also we draw the coefficients pi + qi in Equation 5 (normalized to [0, 255]) in (c) and (d). Lower brightness implies more tendency to be foreground. The benefit of risk term is obvious.
Co-segmentation of Image Pairs with Quadratic Global Constraint in MRFs
845
Fig. 7. Comparison with Rother’s method. Parameters are identical in both experiments: α = 0.3, λ = 50. Note that α corresponds to user’s prior knowledge about the percentage of foreground in the whole image. It is shown that the way to choose α in our method is more consistent with user’s intuition.
5
Experiments and Comparison
We apply the proposed method in a variety of image pairs from public image sets or captured by ourselves. Experiments show our method is superior to previous ones in aspects including accuracy, computing time and ease of use. Lacking color rectification makes previous methods such as in [8] couldn’t handle input images captured under very different illuminating conditions or cluttered backgrounds (Figure 1, 5 and 6). Also, experiments shows the way to choose parameter in our method is more consistent with user’s intuition (Figure 7). For typical 640*480
Fig. 8. A failure example due to confusion of foreground/background colors
846
Y. Mu and B. Zhou
image pairs, the algorithm usually converges in fewer than 4 cycles, and each iteration takes about 0.94 seconds on a Pentium-4 2.8G/512M RAM computer.
6
Conclusions and Future Work
We have presented a novel co-segmentation method. Various experiments demonstrated its superiority over the state-of-the-art work. Our result (Figure 8) also showed certain limitation of the algorithm due to only utilizing color information; and our future work will focus on how to effectively utilize more types of information such as shapes, textures and high-level semantics.
References 1. Yu, S.X., Shi, J.: Multiclass spectral clustering. In: ICCV, pp. 313–319 (2003) 2. Boykov, Y., Jolly, M.P.: Interactive graph cuts for optimal boundary and region segmentation of objects in n-d images. In: ICCV, pp. 105–112 (2001) 3. Rother, C., Kolmogorov, V., Blake, A.: ”grabcut”: interactive foreground extraction using iterated graph cuts. ACM Trans. Graph. 23(3), 309–314 (2004) 4. Li, Y., Sun, J., Tang, C.K., Shum, H.Y.: Lazy snapping. ACM Trans. Graph. 23(3), 303–308 (2004) 5. Wang, J., Cohen, M.F.: An iterative optimization approach for unified image segmentation and matting. In: ICCV, pp. 936–943 (2005) 6. Kolmogorov, V., Criminisi, A., Blake, A., Cross, G., Rother, C.: Bi-layer segmentation of binocular stereo video. CVPR (2), 407–414 (2005) 7. Sun, J., Kang, S.-B., Xu, Z., Tang, X., Shum, H.Y.: Flash cut: Foreground extraction with flash/no-falsh image pairs. In: CVPR (2007) 8. Rother, C., Minka, T.P., Blake, A., Kolmogorov, V.: Cosegmentation of image pairs by histogram matching - incorporating a global constraint into mrfs. CVPR (1), 993–1000 (2006) 9. Narasimhan, M., Bilmes, J.: A submodular-supermodular procedure with applications to discriminative structure learning. In: UAI, pp. 404–441. AUAI Press (2005) 10. van de Weijer, J., Gevers, T.: Boosting saliency in color image features. CVPR (1), 365–372 (2005) 11. Weiss, Y.: Deriving intrinsic images from image sequences. In: ICCV, pp. 68–75 (2001) 12. Lowe, D.G.: Object recognition from local scale-invariant features. In: ICCV, pp. 1150–1157 (1999) 13. Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning. Springer, Heidelberg (2001) 14. Kolmogorov, V., Zabih, R.: What energy functions can be minimized via graph cuts? In: ECCV (3), pp. 65–81 (2002) 15. Barbu, A., Zhu, S.C.: Generalizing swendsen-wang to sampling arbitrary posterior probabilities. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1239–1253 (2005)
Shape Reconstruction from Cast Shadows Using Coplanarities and Metric Constraints Hiroshi Kawasaki1 and Ryo Furukawa2 1
2
Faculty of Engineering, Saitama University, 255, Shimo-okubo, Sakura-ku, Saitama, Japan
[email protected] Faculty of Information Sciences, Hiroshima City University, 3-4-1, Ozuka-higashi, Asaminami-ku, Hiroshima, Japan
[email protected] Abstract. To date, various techniques of shape reconstruction using cast shadows have been proposed. The techniques have the advantage that they can be applied to various scenes including outdoor scenes without using special devices. Previously proposed techniques usually require calibration of camera parameters and light source positions, and such calibration processes make the application ranges limited. If a shape can be reconstructed even when these values are unknown, the technique can be used to wider range of applications. In this paper, we propose a method to realize such a technique by constructing simultaneous equations from coplanarities and metric constraints, which are observed by cast shadows of straight edges and visible planes in the scenes, and solving them. We conducted experiments using simulated and real images to verify the technique.
1 Introduction To date, various techniques of scene shape reconstruction using shadows have been proposed. One of the advantages of using shadows is that the information for 3D reconstruction can be acquired without using special devices, since shadows exist wherever light is present. For example, these techniques are applicable to outdoor poles on a sunny day or indoor objects under a room light. Another advantage of shape reconstruction using shadows is that only a single camera is required. So far, most previously proposed methods assumed known light source positions because, if they are unknown, there are ambiguities on the solution and Euclidean reconstruction can not be achieved[1]. If a shape can be reconstructed with unknown light source positions, the technique can be used to wider applications. For example, a scene captured by a remote web camera under unknown lighting environments could be reconstructed. Since intrinsic parameters of a remote camera are usually unknown, if the focal length of the camera can be estimated at the same time, the application becomes more useful. In this paper, we propose a method to achieve this. Our technique is actually more general, i.e. both the object that casts shadows and the light source can be freely moved while scanning because both of their positions are not required to be known and static. This is a great advantage for actual scanning processes, since the unmeasured area caused by self-shadows can be drastically reduced by moving the light source. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 847–857, 2007. c Springer-Verlag Berlin Heidelberg 2007
848
H. Kawasaki and R. Furukawa
To actually realize the technique, we propose a novel formulation of simultaneous linear equations from planes created by shadows of straight edges (shadow planes) and the real planes in the scene, which are extension of the previous studies for shape from planes [2,3] and interpretation of line drawings of polyhedrons [4]. Since shadow planes and the real planes are treated equally in our formulation, various geometrical constraints among the planes can be utilized efficiently for Euclidean upgrade and camera calibration. In this paper, we assume two typical situations to reconstruct the scene. The first one, which we call “shadows of the static object,” assumes a fixed camera position, a static scene, and a static object of a straight edge which casts a moving shadow as the light source(e.g. the sun or a point light) moves. The second one, which we call “active scan by cast shadow,” assumes a fixed camera, and arbitrary motion of both a light source and an object with a straight edge to generate shadows to conduct an active scan.
2 Related Work 3D reconstruction using information of shadows has a long history. Shafer et al. presented the mathematical formulation of shadow geometries and derived constraints for surface orientation from shadow boundaries [5]. Hambrick et al. proposed a method for classifying boundaries of shadow regions [6]. Several methods for recovering 3D shapes up to Euclidean reconstruction based on geometrical constraints of cast-shadows have been proposed [7,8,9,10]. All of these methods assumes that the objects that cast shadows are static and the light directions or positions are known. On the other hand, Bouguet et al. proposed a method which allowed users to move a straight edged object freely so that the shadow generated by a fixed light source sweep the object [11,12]. However, the technique requires calibration of camera parameters, a light source position, and a reference plane. If an Euclidean shape can be reconstructed with unknown light source positions, it may broaden the application of “shape from cast shadow” techniques. However, it was proved that scene reconstructions based on binary shadow regions have ambiguities of four degrees of freedom (DOFs), if the light positions are unknown [1]. In the case of a perspective camera, these ambiguities correspond to the family of transformations called generalized projective bas relief (GPBR) transformations. To deal with unknown light source positions, Caspi et al. proposed a method using two straight, parallel and fixed objects to cast shadows and a reference plane (e.g. the ground) [13]. To solve ambiguities caused by unknown light sources, they used parallelisms of shadows of straight edges by detecting vanishing points. Compared to their work, our method is more general. For example, in our method, camera can be partially calibrated, the straight object and the light source can be moved, the light source can be a parallel or point light source, and wider types of constraints than parallelisms of shadows can be used to resolve the ambiguities.
3 Shape Reconstruction from Cast Shadow If a set of points exist on the same plane, they are coplanar as shown in figure 1(a). All the points on a plane are coplanar even if the plane does not have textures or feature
Shape Reconstruction from Cast Shadows Using Coplanarities and Metric Constraints
Box
B
π3 π4
π0
(a)
(b)
λ
849
Intersections
π2
π5 π1
(c)
Intersections
(d)
Fig. 1. Coplanarities in a scene:(a) Explicit coplanarities. Regions of each color except for white are a set of coplanar points. Note that points on a region of a curved surface are not coplanar. (b) Implicit coplanarities. Segmented lines of each color are a set of coplanar points. (c)Examples of metric constraints: π0 ⊥π1 and π0 ⊥π2 if λ⊥π0 . π3 ⊥π4 , π4 ⊥π5 , π3 ⊥π5 , and π3 π0 if box B is rectangular and on π0 . (d) Intersections between explicit coplanar curves and implicit coplanar curves in a scene. Lines of each color corresponds a plane in the scene.
points. A scene composed of plane structures has many coplanarities. In this paper, a coplanarity that is actually observed as a real plane in the scene is called as an explicit coplanarity. As opposed to this, in a 3D space, there exist an infinite number of coplanarities that are not explicitly observed in ordinary situations, but could be observed under specific conditions. For example, a boundary of a cast-shadow of a straight edge is a set of coplanar points as shown in figure 1(b). This kind of coplanarity is not visible until the shadow is cast on the scene. In this paper, we call these coplanarities as implicit coplanarities. Implicit coplanarities can be observed in various situations, such as the case that buildings with straight edges are under the sun and cast shadows onto the scene. Although explicit coplanarities are observed only for limited parts of the scene, implicit coplanarities can be observed on arbitrary-shaped surfaces including free curves. In this study, we create linear equations from the implicit coplanarities of the shadows and explicit coplanarities of the planes. By solving the acquired simultaneous equations, a scene can be reconstructed, except for four (or more) DOFs that simultaneous equations have, and also the DOFs corresponding to unknown camera parameters. For an Euclidean reconstruction from the solution, the remaining DOFs should be solved (called metric reconstruction in this paper). To achieve this, constraints other than coplanarities should be used. For many scenes, especially those that include artificial objects, we can find geometrical constraints among explicit and implicit planes. Examples of such information are explained here. (1) In figure 1(c), the ground is plane π0 , and linear object λ is standing vertically on the ground. If the planes corresponding shadows of λ are π1 and π2 , π0 ⊥π1 ,π0 ⊥π2 can be derived from λ⊥π0 . (2) In the same figure, the sides of boxB are π3 ,π4 , and π5 . If boxB is rectangular, π3 ,π4 , and π5 are orthogonal with each other. If boxB is on the ground, π3 is parallel to π0 . From constraints available from the scene such as above examples we can determine variables for the remaining DOFs and achieve metric reconstruction. With enough constraints, the camera parameters can be estimated at the same time. We call these constraints the metric constraints.
850
H. Kawasaki and R. Furukawa
Based on this, actual flow of the algorithms are as follows. Step 1: Extraction of coplanarities. From a series of images that are acquired from a scene with shadows captured by a fixed camera, shadow boundaries are extracted as implicit-coplanar curves. If the scene has plane areas, explicit-coplanar points are sampled from the area. For the efficient processing of steps 2 and 3 below, only selected frames are processed. Step 2: Cast shadow reconstruction by shape from coplanarities. From a dataset of coplanarities, constraints are acquired as linear equations. By numerically solving the simultaneous equations, a space of solutions with four (or more) DOFs can be acquired. Step 3: Metric reconstruction by metric constraints. To achieve metric reconstruction, an upgrade process of the solution of step 2 is required. The solution can be upgraded by solving the metric constraints. Step 4: Dense shape reconstruction. The processes in steps 2 and 3 are performed on selected frames. To realize dense shape reconstruction of a scene, implicit-coplanar curves from all the images are used to reconstruct 3D shapes using the results of the preceding processes.
4 Algorithm Details for Each Steps 4.1 Data Acquisition To detect coplanarities in a scene, the boundaries of cast shadows are required. Automatic extraction of a shadow area from a scene is not easy. However, since shadow extraction has been studied for a long period of time [14,15], many techniques are already proposed and we adopt a spatio-temporal based method as follows: 1. Images are captured from a fixed camera at fixed intervals, and a spatio-temporal image is created by stacking images after background subtraction. 2. The spatio-temporal image is divided by using 3D segmentation. The 3D segmentation has been achieved by applying a region growing method to the spatio-temporal space. To deal with noises on real images, we merge small regions to the surrounding regions and split a large region connected by a small region into two. 3. From the segmented regions, shadow regions are selected interactively by manual. Also, if wrong regions are produced by the automatic process, those regions are modified manually in this step. 4. The segmented regions are again divided into frames, and coplanar shadow curves are extracted from each frames as boundaries of divided regions. By drawing all the detected boundaries on a single image, we can acquire many intersections. Since one intersection shares at least two planes, we can construct simultaneous equations. The numerical solution of these equations is explained in the following section.
Shape Reconstruction from Cast Shadows Using Coplanarities and Metric Constraints
851
4.2 Projective Reconstruction Suppose a set of N planes including both implicit and explicit planes. Let j-th plane of the set be πj . We express the plane πj by the form a j x + b j y + cj z + 1 = 0
(1)
in the camera coordinates system. Suppose a set of points such that each point of the set exists on intersections of multiple planes. Let the i-th point of the set be represented as ξi and exist on the intersection of πj and πk . Let the coordinates (ui , vi ) be the location of the projection of ξi onto the image plane. We represent the camera intrinsic parameter by α = p/f , where f is the focal length and p is the size of the pixel. We define a∗j = αaj and b∗j = αbj . The direction vector of the line of sight from the camera to the point ξi is (αui , αvi , −1). Thus, (2) aj (−αui zi ) + bj (−αvi zi ) + cj (zi ) + 1 = 0, where zi is the z-coordinate of ξi . By dividing the form by zi and using the substitutions of ti = 1/zi , a∗j = αaj , and b∗j = αbj , we get −(ui )a∗j − (vi )b∗j + cj + ti = 0.
(3)
−(ui )a∗k − (vi )b∗k + ck + ti = 0.
(4)
Since ξi is also on πk ,
From the equations (3) and (4), the following simultaneous equations with variables a∗j ,b∗j and cj can be obtained: (a∗j − a∗k )ui + (b∗j − b∗k )vi + (cj − ck ) = 0.
(5)
We define L as the coefficient matrix of the above simultaneous equations, and x as the solution vector. Then, the equations can be described by a matrix form as Lx = 0.
(6)
The simultaneous equations of forms (5) have trivial equations that satisfy a∗j = a∗k , b∗j = b∗k , cj = ck , (i = j).
(7)
Let x1 be the solution of a∗i = 1, b∗i = 0, ci = 0(i = 1, 2, . . .), x2 be the solution of a∗i = 0, b∗i = 1, ci = 0, and x3 be the solution of a∗i = 0, b∗i = 0, ci = 1. Then, the above trivial solutions form a linear space spanned by the bases of x1 ,x2 ,x3 , which we represent as T . We describe a numerical solution of the simultaneous equations assuming the observed coordinates (ui , vi ) on the image plane include errors. Since the equation (6) is over-constrained, the equation generally cannot be fulfilled completely. Therefore, we consider the n-dimensional linear space Sn spanned by the n eigenvectors of L L associated with the n minimum eigenvalues. Then, Sn becomes the solution space of x
852
H. Kawasaki and R. Furukawa
such that maxx∈Sn |Lx|/|x| is the minimum with respect to all possible n-dimensional linear spaces. Even if coordinates of ui , vi are perturbed by additive errors, x1 ,x2 ,x3 remain trivial solutions that completely satisfies equations(5) within the precision of floating point calculations. Thus, normally, the 3D space S3 becomes equivalent with the space of trivial solutions T . For non-trivial solution, we can define xs = argminx∈T ⊥ (|Lx|/|x|)2 , where T ⊥ is the orthogonal complement space of T . xs is the solution that minimizes |Lx|/|x| and is orthogonal to x1 ,x2 and x3 . Since T and S3 are normally equal, xs can be calculated as the eigenvector of L L associated with the 4-th minimum eigenvalue. Thus, the general form of the non-trivial solutions are represented as x = f1 x1 + f2 x2 + f3 x3 + f4 xs = Mf ,
(8)
where f1 , f2 , f3 , f4 are free variables, f is a vector of (f1 f2 f3 f4 ) , and M is a matrix of (x1 x2 x3 xs ). The four DOFs of the general solution basically correspond to the DOFs of generalized projective bas-relief (GPBR) transformations described in the work of Kriegman et al. [1]. As far as we know, there are no previous studies that reconstruct 3D scenes by using the linear equations from the 3-DOF implicit and explicit planes. Advantages of this formulation are that the solution can be obtained stably, and the wide range of geometrical constraints can be used as metric constraints. 4.3 Metric Reconstruction The solution obtained in the previous section has four DOFs from f . In addition, if camera parameters are unknown, additional DOFs should be resolved to achieve metric reconstruction. Since these DOFs cannot be solved using coplanarities, they should be solved using metric constraints derived from the geometrical constraints in the scene. For example, suppose that the orthogonality between the planes πs and πt is assumed. We denote the unit normal vector of plane πs as a vector function ns (f , α) = N ((as (f , α) bs (f , α) cs (f , α)) ) whose parameters are f and the camera parameter α, where N () means an operation of normalization. Then, the orthogonality between πs and πt can be expressed as {(ns (f , α)} {nt (f , α)} = 0.
(9)
Other types of geometrical constraints such as parallelisms can be easily formulated using the similar method. To solve the equations described above, non-linear optimization with respect to f and α can be used. We implemented the numerical solver using Levenberg-Marquardt method. The determination of the initial value of f may be a problem. In the experiments described in this study, we construct a solution vector xI from the given plane parameters and fI = M xI is used as the initial values of f . In this method, fI is the projection of xI in the space of the plane parameters whose dimension is 3N , onto the solution space of the projective reconstruction (8) such that the metric distance between fI and xI is minimum. Using this process, we can obtain a set of plane parameters which fulfills the coplanarity conditions for an arbitrary set of plane parameters.
Shape Reconstruction from Cast Shadows Using Coplanarities and Metric Constraints
(a)
(b)
(c)
853
(d)
Fig. 2. Reconstruction of simulation data:(a)(b) input images with shadows and (c)(d) reconstruction results. In the results, the shaded surfaces are ground truth and the red points are reconstructed points.
4.4 Dense Reconstruction To obtain a dense 3D shape, we also conduct a dense 3D reconstruction by using all the captured frames. The actual processes are as follows. 1. Detect the intersections between a implicit-coplanar curve on an arbitrary frame and the curves of the already estimated planes. 2. Estimate the parameters of the plane of the implicit-coplanar curve by fitting it to the known 3D points, which are retrieved from the intersections, using principal components analysis (PCA). 3. Recover all the 3D positions of the points on the implicit-coplanar curve by using the estimated plane parameters and triangulation method. 4. Iterate 1 to 3 for all frames.
5 Experiments 5.1 Simulation Data(Shadow of Static Object) Figures 2 (a),(b) show data synthesized by CG including a square floor, a rabbit, and a perpendicular wall. 160 images were generated while moving the light source, so that the edge of the shadow scanned the rabbit. Simultaneous equations were created from intersection points between the implicit-coplanar shadow curves and lines that were drawn on an explicit plane (the floor). The initial value of nonlinear optimization was given to indicate whether the light source was located on the right or left. By using the coplanar information, the reconstruction could be done only up to scale, so there were three DOFs remained. Since we also estimated the focal length, we needed four metric constraints. For obtaining an Euclidean solution, we used two metric constraints from the orthogonalities of the shadow planes and the floor, and other two constraints from the orthogonalities of the two corners of the floor. Figures 2 (c) and (d) show the result (red points) and the ground truth (shaded surface). We can observe the reconstruction result almost coincides with the correct shape. The RMS error (root of
854
H. Kawasaki and R. Furukawa
(a)
(b)
(c)
(d)
Fig. 3. Reconstruction of simulation data (active scanning): (a) an input image, (b) explicit and implicit coplanarities, and (c)(d) reconstruction results. In the results, the shaded surfaces are ground truth and the red points are reconstructed points.
mean squared error) of the z-coordinates of all the reconstructed points was 2.6 × 10−3 where the average distance from the camera to the bunny was scaled to 1.0. Thus, a highly accurate reconstruction of the technique was confirmed. 5.2 Simulation Data(Active Scan by Cast Shadow) Next, we attempted to reconstruct 3D shapes by sweeping the cast shadows on the objects by moving both a light source and a straight objects. We synthesized a sequence of images of the model of a bunny that includes 20 implicit coplanarities and three visible planes (i.e. explicit planes). There are three metric constraints of orthogonalities and parallelisms between the visible planes. The figure 3(a) shows an example of the synthesized images, and the figure 3(b) shows all the implicit-coplanar curves as the borders of the grid patterns. The figures 3(c) and (d) show the result. The RMS error of the z-coordinates of all the reconstructed points (normalized by the average of the zcoordinates like the previous section) was 4.6×10−3. We can confirm the high accuracy of the result. 5.3 Real Outdoor Scene(Shadow of Static Object) We conducted a shape reconstruction from images acquired by outdoor fixed uncalibrated cameras. Images from a fixed outdoor camera were captured periodically and a shape and the focal length of the camera was reconstructed by the proposed technique from shadows in the scene. Since the scene also contained many shadows generated by non-straight edges, the automatic extraction of complete shadows was difficult. In this experiment, these noises were eliminated by human interactions and it took about 10 minutes for the actual working time. The figure 4 (a) shows the input frame, (b) shows the detected coplanar shadow curves, (c) shows all the coplanar curves and their intersections, and (d) to (f) show the reconstruction result. The proposed technique could correctly reconstruct the scene by using images from a fixed remote camera. 5.4 Real Indoor Scene(Active Scan by Cast Shadow) We conducted an indoor experiment on an actual scene by using a point light source. A video camera was directed toward a target object and multiple boxes and the scene
Shape Reconstruction from Cast Shadows Using Coplanarities and Metric Constraints
(a)
(b)
(c)
(d)
(e)
(f)
855
Fig. 4. Reconstruction of outdoor scene: (a) input image, (b) an example frame of the 3D segmentation result, (c) implicit (green) and explicit (red) coplanar curves, (d) reconstructed result of coplanar curves(red) and dense 3D points(shaded), and (e)(f) the textured reconstructed scene
was captured while the light source and the bar for shadowing were being moved freely. From the captured image sequence, several images were selected and the shadow curves of the bar were detected from the images. By using the detected coplanar shadow curves, we performed the 3D reconstruction up to 4 DOFs. For the metric
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 5. Reconstruction of an indoor real scene: (a)(b) the captured frames, (c)(d) the reconstructed coplanar shadow curves (red) with dense reconstructed model(shaded), and (e)(f) the textured reconstructed model
856
H. Kawasaki and R. Furukawa
(a)
(b)
(c)
(d)
Fig. 6. Reconstruction and evaluation of an indoor real scene: (a)(b) the captured frames and (c)(d) the reconstructed model displayed with the ground truth data (shaded model: reconstructed model, red points: ground truth)
reconstruction, orthogonalities of faces of the boxes were used. Figures 5 show the capturing scenes and the reconstruction result. In this case, since there were only small noises extracted because of indoor environment, shadow detection was stable and no human interaction was required. These results show that the dense shape is correctly reconstructed. We also reconstructed a scene of a box (size:0.4m × 0.3m × 0.3m) and a cylinder(height:0.2m, diameter:0.2m) to evaluate accuracies of the proposed method. The process of reconstruction was conducted in the same way as the previous experiment, except that we also measured the 3D scene by an active measurement method using coded structured light [16] as the ground truth. The reconstruction result was scaled to match the ground truth using the average distance to the points. Figures 6 (a) and (b) show the capturing scene, and (c) and (d) show both the scaled reconstruction (polygon mesh) and the ground truth (red points). Although there were small differences between the reconstruction and the ground truth, the shape was correctly recovered. The RMS error of the reconstruction from the ground truth normalized by the average distance was 1.80 × 10−2 .
6 Conclusion This paper proposed a technique capable of reconstructing a shape if only multiple shadows of straight linear objects or straight edges are available from a scene even when the light source position is unknown and the camera is not calibrated. The technique is achieved by extending the conventional method, which is used to reconstruct polyhedron from coplanar planes and its intersections, to general curved surfaces. Since reconstruction from coplanarities can be solved up to four DOFs, we proposed a technique of upgrading it to the metric solution by adding metric constraints. For the stable extraction of shadow areas from a scene, we developed a spatio-temporal image processing technique. By implementing the technique and conducting an experiment using simulated and real images, accurate and dense shape reconstruction were verified.
Shape Reconstruction from Cast Shadows Using Coplanarities and Metric Constraints
857
References 1. Kriegman, D.J., Belhumeur, P.N.: What shadows reveal about object structure. Journal of the Optical Society of America 18(8), 1804–1813 (2001) 2. Bartoli, A., Sturm, P.: Constrained structure and motion from multiple uncalibrated views of a piecewise planar scene. International Journal of Computer Vision 52(1), 45–64 (2003) 3. Kawasaki, H., Furukawa, R.: Dense 3d reconstruction method using coplanarities and metric constraints for line laser scanning. In: 3DIM 2007. Proceedings of the 5th international conference on 3-D digital imaging and modeling (2007) 4. Sugihara, K.: Machine interpretation of line drawings. MIT Press, Cambridge, MA, USA (1986) 5. Shafer, S.A., Kanade, T.: Using shadows in finding surface orientations. Computer Vision, Graphics, and Image Processing 22(1), 145–176 (1983) 6. Hambrick, L.N., Loew, M.H., Carroll, J.R.L.: The entry exit method of shadow boundary segmentation. PAMI 9(5), 597–607 (1987) 7. Hatzitheodorou, M., Kender, J.: An optimal algorithm for the derivation of shape from shadows. CVPR, 486–491 (1988) 8. Raviv, D., Pao, Y., Loparo, K.A.: Reconstruction of three-dimensional surfaces from twodimensional binary images. IEEE Trans. on Robotics and Automation 5(5), 701–710 (1989) 9. Daum, M., Dudek, G.: On 3-d surface reconstruction using shape from shadows. CVPR, 461–468 (1998) 10. Savarese, S., Andreetto, M., Rushmeier, H., Bernardini, F., Perona, P.: 3d reconstruction by shadow carving: Theory and practical evaluation. IJCV 71(3), 305–336 (2007) 11. Bouguet, J.Y., Perona, P.: 3D photography on your desk. In: ICCV, pp. 129–149 (1998) 12. Bouguet, J.Y., Weber, M., Perona, P.: What do planar shadows tell about scene geometry? CVPR 01, 514–520 (1999) 13. Caspi, Y., Werman, M.: Vertical parallax from moving shadows. In: CVPR, pp. 2309–2315. IEEE Computer Society, Washington, DC, USA (2006) 14. Jiang, C., Ward, M.O.: Shadow segmentation and classification in a constrained environment. CVGIP: Image Underst. 59(2), 213–225 (1994) 15. Salvador, E., Cavallaro, A., Ebrahimi, T.: Cast shadow segmentation using invariant color features. Comput. Vis. Image Underst. 95(2), 238–259 (2004) 16. Sato, K., Inokuchi, S.: Range-imaging system utilizing nematic liquid crystal mask. In: Proc. of FirstICCV, pp. 657–661 (1987)
Evolving Measurement Regions for Depth from Defocus Scott McCloskey, Michael Langer, and Kaleem Siddiqi Centre for Intelligent Machines, McGill University {scott,langer,siddiqi}@cim.mcgill.ca
Abstract. Depth from defocus (DFD) is a 3D recovery method based on estimating the amount of defocus induced by finite lens apertures. Given two images with different camera settings, the problem is to measure the resulting differences in defocus across the image, and to estimate a depth based on these blur differences. Most methods assume that the scene depth map is locally smooth, and this leads to inaccurate depth estimates near discontinuities. In this paper, we propose a novel DFD method that avoids smoothing over discontinuities by iteratively modifying an elliptical image region over which defocus is estimated. Our method can be used to complement any depth from defocus method based on spatial domain measurements. In particular, this method improves the DFD accuracy near discontinuities in depth or surface orientation.
1
Introduction
The recovery of the 3D structure of a scene from 2D images has long been a fundamental goal of computer vision. A plethora of methods, based on many different depth cues, have been presented in the literature. Depth from defocus methods belong to class of depth estimation schemes that use optical blur as a cue to recover the 3D scene structure. Given a small number of images taken with different camera settings, depth can be found by measuring the resulting change in blur. In light of this well-known relationship, we use the terms ’depth’ and ’change in blur’ interchangeably in this paper. Ideally, in order to recover the 3D structure of complicated scenes, the depth at each pixel location would be computed independently of neighboring pixels. This can be achieved through measurements of focus/defocus, though such approaches require a large number of images [9] or video with active illumination [17]. Given only a small number of observations (typically two), however, the change in blur must be measured over some region in the images. The shape of the region over which these measurements are made has, to date, been ignored in the literature. Measurements for a given pixel have typically been made over square regions centered on its location, leading to artificially smoothed depth estimates near discontinuities. As a motivating example, consider the image in Fig. 1 of a scene with two fronto-parallel planes at different distances, separated by a depth discontinuity. Now consider an attempt to recover the depth of the point near the discontinuity Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 858–868, 2007. c Springer-Verlag Berlin Heidelberg 2007
Evolving Measurement Regions for Depth from Defocus
859
Fig. 1. Two fronto-parallel surfaces separated by a step edge in depth. When estimating depth at a point near the step edge, inaccuracies arise when measurements are made over a square window.
(such as the point marked with the X) by estimating blur over a square region centered there (as outlined in the figure). We would like to recover the depth of the point, which is on one plane, but this region consists of points that lie on both surfaces. Inevitably, when trying to recover a single depth value from a region that contains points at two different depths, we will get an estimate that is between the distances of the two planes. Generally, the recovered depth is an average of the two depths weighted by the number of points in the measurement region taken from each plane. As a result, the depth map that is recovered by measuring blur over square windows will show a smooth change in depth across the edge where there should be a discontinuity. (See red curve in Fig. 3.) The extent of the unwanted smoothing depends on the size of the measurement region. In order to produce accurate estimates, DFD methods have traditionally used relatively large measurement regions. The method of [5] uses 64-by-64 pixel regions, and the method of [10] uses 51-by-51 regions. For the scene in Fig. 1, this results in a gradual transition of depth estimates over a band of more then 50 pixels where there should be a step change. This example illustrates a failure of the standard equifocal assumption, which requires that all points within the measurement region be at the same depth in the scene. In order for the equifocal assumption to hold over square regions, the scene must have constant depth over that region. In the case of a depth discontinuity, as illustrated above, the violation of the equifocal assumption results in smoothed depth estimates. In this paper, we propose a method for measuring blur over elliptical regions centered at each point, rather than over square regions at each point as in standard DFD [5,10,14,15]. Each elliptical region is evolved to minimize the depth variation within it (Sec. 3.1). Depths estimated over these regions (Sec. 3.2) are more accurate, particularly near discontinuities in depth and surface orientation.
860
S. McCloskey, M. Langer, and K. Siddiqi
Given the semantic importance of discontinuities, this improvement is important for subsequent 3D scene recovery and modeling. We demonstrate the improved accuracy of our method in several experiments with real images (Sec. 4).
2
Previous Work in Depth from Defocus
Broadly speaking, DFD methods fall into one of two categories: deconvolution methods based on a linear systems model, and energy minimization methods. Deconvolution Methods. Pentland [12] proposed the first DFD method based on frequency domain measurements. Given one image taken through a pinhole aperture, the blur in a second (finite aperture) image is found by comparing the images’ power spectra. Subbarao [15] generalized this result, allowing for changes in other camera settings, and removing the requirement that one image be taken through a pinhole aperture. G¨ okstorp [8] uses a set of Gabor filters to generate local frequency representations of the images. Other methods based on the frequency domain model such as by Nayar and Watanabe [11] develop filter banks which are used to measure blur (see also [16]). Ens and Lawrence point out a number of problems with frequency domain measurements and propose a matrix-based spatial domain method that avoids them [5]. Subbarao and Surya [14] model image intensities as a polynomial and introduce the S-transform to measure the blur in the spatial domain. More recently, McCloskey et al use a reverse projection model, and measure blur using correlation in the spatial domain [10]. Whether they employ spatial or frequency domain schemes, square measurement regions are the de facto standard for DFD methods based on deconvolution. Frequency domain methods use the pixels within a square region to compute the Fast Fourier Transform and observe changes in the local power spectrum. Spatial domain methods must measure blur over some area, and take square regions as the set of nearby points over which depth is assumed to be constant. In either case, estimates made over square windows will smooth over discontinuities, either in depth or surface orientation. Despite this shortcoming, deconvolution has been a popular approach to DFD because of its elegance and accessibility. Energy Minimization Methods. Several iterative DFD methods have been developed which recover both a surface (depth) and its radiance from pairs of defocused images. The methods presented by Favaro and Soatto [6,7] define an energy functional which is jointly minimized with respect to both shape and radiance. In [6], four blurred images are used to recover the depth and radiance, including partially occluded background regions. In [7] a regularized solution is sought and developed using Hilbert space techniques and SVD. Chaudhuri and Rajagopalan model depth and radiance as Markov Random Fields (MRF) and find a maximum a posteriori estimate [4]. These methods have the advantage that they don’t explicitly assume the scene to have locally constant depth, though regularization and MRF models implicitly assume that depth changes slowly.
Evolving Measurement Regions for Depth from Defocus
861
Building upon the MRF scheme with line fields, Bhasin and Chaudhuri [2] explicitly model image formation in the neighborhood of depth discontinuities in order to more accurately recover depth. However, their results are generated on synthetic data that is rendered with an incorrect model: they assert that (regardless of shadowing) a dark band arises near depth discontinuities as a result of partial occlusion. For a correct model of partial occlusion and image formation near depth discontinuities, see Asada et al [1].
3
DFD with Evolving Regions
The method we propose in this paper combines elements of both of the above categories of DFD methods. Our approach iteratively refines a depth estimate like the energy minimization methods, but it does not involve a large error function of unknown topography. Our method is conceptually and computationally straightforward like the deconvolution methods, but it will be shown to have better accuracy near discontinuities in depth and in surface orientation. To achieve this, we evolve a measurement region at each pixel location toward an equifocal configuration, using intermediate depth estimates to guide the evolution. In order to vary the shape of the measurement region in a controlled fashion despite erroneous depth estimates, we impose an elliptical shape on it. The measurement region at a given location starts as a circular region centered there, and evolves with increasing eccentricity while maintaining a fixed area1 M . In addition to providing a controlled evolution, elliptical regions are more general then square one in that they can provide better approximations to the scene’s equifocal contours2 . Furthermore, ellipses can be represented with a small number of parameters. This compactness of representation is important in light of our desire to produce dense depth maps, which requires us to maintain a separate measurement region for each pixel location in an image. Given only two images of the scene under different focus settings as input, we have no information about depth or the location of equifocal regions. Initially, we make the equifocal assumption that the scene has locally constant depth and estimate depth over hypothesized equifocal regions (circles) based on this assumption. The key to our approach is that this initial depth estimate is then used to update the measurement region at each pixel location. Instead of estimating defocus over square regions that we assume to have constant depth, we evolve elliptical regions over which we have evidence that depth is nearly constant. In order to give the reader an intuition for the process, we refer to Fig. 2, which shows the evolution of the measurement regions at two locations. In regions of the scene that have constant depth, initial measurement regions are found to contain no depth variation, and are maintained (blue circle). In regions of changing depth, as near the occluding contour, the initial region (red circle) is found to 1
2
Maintaining a measurement region with at least M pixels is necessary to ensure a consistent level of performance in the DFD estimation. In regions of locally constant depth, equifocal (iso-depth) points form a 2D region. More generally, the surface slopes and equifocal points fall along a 1D contour.
862
S. McCloskey, M. Langer, and K. Siddiqi
Fig. 2. Evolving measurement regions. Near the depth discontinuity, regions grow from circles (red) through a middle stage (yellow) to an equifocal configuration (green). In areas of constant depth, initial regions (blue) do not evolve.
contain significant depth variation and is evolved through an intermediate state (yellow ellipse) to an equifocal configuration (green ellipse). Our method can be thought of as a loop containing two high-level operations: evolution of elliptical regions and depth measurement. The method of evolving elliptical regions toward an equifocal configurations is presented in Sec. 3.1. The blur estimation method is presented in Sec. 3.2. We use the Ens and Lawrence blur estimation algorithm of [5], and expect that any DFD method based on spatial domain measurements could be used in its place. 3.1
Evolving Elliptical Regions
Given a circle3 as our initial measurement region’s perimeter, we wish to evolve that region in accordance with depth estimates. Generally, we would like the extent of the region to grow in the direction of constant scene depth while shrinking in the direction of depth variation. Using an elliptical model for the region, we would like the major axis to follow depth contours and the minor axis to be oriented in the direction of depth variation. We iteratively increase the ellipse’s eccentricity in order to effect the expansion and contraction along these dimensions, provided there are depth changes within it. Each pixel location p has its own measurement region Rp , an ellipse repre→ − sented by a 2-vector fp which expresses the location of one of the foci relative to → − p, and a scalar rp which is the length of the semi-major axis. Initially f = (0, 0) and r = rc for all p. The value rc comes from the region area parameter M = πrc2 . Once we have measured depth over circular regions, we establish the orientation of the ellipse by finding the direction of depth variation in the smoothed depth map. In many cases, thought not all, the direction of depth variation 3
An ellipse with eccentricity 0.
Evolving Measurement Regions for Depth from Defocus
863
is the same as the direction of the depth gradient. Interesting exceptions happen along contours of local minima or maxima of depth (as in the image in Fig. 4 (left)). In order to account for such configurations, we take the angle θv to be the direction of depth variation if the variance of depth values along diameters of the circular region is maximal in that direction. The set of depth values D(θ) along the diameter in direction θ is interpolated from the depth map d at equally-spaced points about p, D(θ) = {d(p + n(cos θ, sin θ))|n = −rc , −rc + 1, ...rc }.
(1)
We calculate D(θ) over a set of orientations θ ∈ [0, π), and take θv to be the direction that maximizes the depth variance, i.e. θv = argmaxθ var(D(θ)).
(2)
The variance of D(θv ) is compared to a threshold chosen to determine if the variance is significant. In the event that the scene has locally constant depth within the circle, the variance should be below threshold, and the measurement region does not evolve. In the event that the variance is above the threshold, we → − orient the minor axis in the direction θv by setting f = (− sin θv , cos θv ). Having established the orientation of the ellipses, we estimate depth over the current measurement regions and increase their eccentricity if the depth variation along the minor axis is significant (i.e. above threshold) and the variation along the major axis is insignificant. These checks halt the evolution if the elliptical region begins to expand into a region of changing depth or if it is found to be an equifocal configuration. If these checks are passed, the ellipse at iteration n + 1 → − is evolved from the ellipse at iteration n by increasing f according to − → fn − → −−→ fn+1 = (fn + k) − → , fn
(3)
where the scalar k represents the speed of the deformation. As necessary, the value of r is suitably adjusted in order to maintain a constant area despite → − changes in f . Though the accuracy of depth estimation may increase by measuring blur over regions of area greater then M , we keep the area constant for our experiments, allowing us to demonstrate the improvement in depth estimates due only to changes in the shape of the measurement region. 3.2
Measuring Depth Over Evolving Regions
In order to be utilized in our algorithm, a depth estimation method must be able to measure blur over regions of arbitrary shape. This rules out frequency domain methods that require square regions over which a Fast Fourier Transform can be calculated. Instead, we use a spatial domain method, namely the DFD method of Ens and Lawrence [5], adapted to our elliptical regions. The choice of this particular algorithm is not integral to the method; we expect that any spatial domain method for blur measurement could used.
864
S. McCloskey, M. Langer, and K. Siddiqi
The Ens and Lawrence method takes two images as input: i1 taken through a relatively small aperture, and i2 taken through a larger aperture. The integration time of the image taken through the larger aperture is reduced to keep the level of exposure constant. As is common in the DFD literature, we model each of these images to be the convolution of a theoretical pinhole aperture image i0 (which is sharply focused everywhere) with a point spread function (PSF) h, i1 = i0 ∗ h1 and i2 = i0 ∗ h2 .
(4)
The PSFs h1 and h2 belong to a family of functions parameterized by the spread σ. This family of PSFs is generally taken to be either the pillbox, which is consistent with a geometric model of optics, or Gaussian, which accounts for diffraction in the optical system. We take the PSFs to be a Gaussian parameterized by its standard deviation. That is, hn = G(σn ). Since it was taken through a larger aperture, regions of the image i2 cannot be sharper than corresponding regions in i1 . Quantitatively, σ1 ≤ σ2 . As described in [5], depth recovery can be achieved by finding a third PSF h3 such that i 1 ∗ h3 = i 2 .
(5)
As we have chosen a Gaussian model for h1 and h2 , this unknown PSF is also a Gaussian; h3 = G(σ3 ). Computationally, we take σ3 at pixel location p to be the value that minimizes, in the sum of squared errors sense, the difference between i2 and i1 ∗ G(σ) over p’s measurement region Rp . That is, σ3 (p) = argminσ
2
(i2 (q) − (i1 ∗ G(σ))(q)) .
(6)
q∈Rp
The value of σ3 can be converted to the depth when the camera parameters are known. We omit the details for brevity.
4
Experiments
We have conducted a number of experiments on real scenes. Our images were all acquired with a Nikon D70s digital camera and a 50mm Nikkor lens. The camera’s tone mapping function was characterized and inverted by the method described in [3], effectively making the camera a linear sensor. For each scene, the lens was focused at a distance nearer then the closest object in the scene4 . Depth Discontinuities. Our first experiment involves the scene considered in Sec. 1, which consists of two fronto-parallel planes separated by a depth discontinuity. We have shown the sharply focused, small aperture (f /13) input image in Figs. 1 and 2; the second, blurrier input image (not shown) was taken with a larger aperture (f /4.8). We have worked on a relatively small (400-by-300 pixel) window of 1000-by-1500 pixel images in order to illustrate the improved accuracy in the area of the discontinuity in sufficient detail.
Evolving Measurement Regions for Depth from Defocus
865
0.26
Scene Depth − distance from the plane of focus (m)
0.24
0.22
0.2
0.18
0.16
0.14
0.12
0.1
80
90
100
110
120
130 140 Image Column
150
160
170
180
190
Fig. 3. (Left) Surface plot of estimates from circular regions (dark points are closer to the camera). (Center) Estimates from computed elliptical regions. (Right) Profile of depth discontinuity as measured over disks (red), and evolved elliptical regions (blue).
Fig. 4. (Left) Image of two planes meeting at a discontinuity in surface orientation, annotated with an evolved measurement region. (Center) Surface plot of depth recovered over circular regions (dark points are closer to the camera). (Right) Surface plot of depth recovered over elliptical regions.
Fig. 3 shows surface renderings of the recovered scene from the initial circular measurement regions (left), and the elliptical regions recovered by our method (middle). These surface renderings demonstrate the improved accuracy of the depth estimates as a result of the elliptical evolution. Fig. 3 (right) shows a plot of the depth estimates for a single row, recovered from the circular regions (red), and elliptical regions (green). As this plot shows, our method recovers an edge profile that is significantly more accurate than the initial result. This result was obtained by evolving the ellipses 25 times with speed constant k = 10. The orientation θv was determined by measuring the depth variance over 8 different angles uniformly spaced in the interval [0, π). The value rc = 45 pixels, making the measurement regions comparable in size to those used in [5]. 4
Defocus increases in both direction away from the focused depth, and so depth from defocus suffers a sign ambiguity relative to the plane of focus [12]. This is typically avoided by having the entire scene be in front of or behind the plane of focus.
866
S. McCloskey, M. Langer, and K. Siddiqi
Fig. 5. (Left) Image of a tennis ball on a ground plane annotated with measurement regions. (Center) Surface plot of depth measured over circular regions. (Right) Surface plot of depth measured over ellipses evolved to approximate equifocal regions.
Discontinuities in Surface Orientation. As in the case of discontinuous depth changes, square (or circular) windows for the measurement of defocus produce inaccurate depth estimates near discontinuities in surface orientation. Consider, for example, the scene shown in Fig. 4 (left), which consists of two slanted planes that intersect at the middle column of the image. The scene is convex, so the middle column of the image depicts the closest points on the surface. Within a square window centered on one of the locations on this ridge line of minimum depth, most of the scene points will be at greater depths. As a result, the distance to that point will be overestimated. Similarly, were the ridge line at a maximum distance in the scene, the depth would be underestimated. Fig. 4 (center and right) show surface plots of the scene recovered from circular and evolved elliptical regions, respectively. An example of an evolved measurement region is shown in Fig. 4 (left). The inaccuracies of the circular regions are apparent near the discontinuity in orientation, where the width of the measurement region results in a smoothed corner. The surface plot of estimates made over evolved elliptical regions shows the discontinuity more accurately, and has more accurate depth estimates near the perimeter of the image. Non-linear Discontinuities. Because our ellipses evolve to a line segment in the limit, one may expect that our method will fail on objects with occluding contours that are non-linear. While an ellipse cannot describe a curved equifocal contour exactly, it can still provide an improvement over square or circular regions when the radius of the equifocal contour’s curvature is small compared to the scale of the ellipse. Fig. 5 (left) shows an example of a spherical object, whose equifocal contours are concentric circles. The curvature of these equifocal contours is relatively low at the sphere’s occluding contour, allowing our elliptical regions to increase in eccentricity while reducing the internal depth variation (blue ellipse). Near the center of the ball, the equifocal contours have higher curvature, but the surface is nearly fronto-parallel in these regions. As a result, the initial circles do not become as elongated in this region (red circle). This results in depth estimates that show a clearer distinction between the sphere and the background (see Fig 5 (center) and (right)).
Evolving Measurement Regions for Depth from Defocus
867
The parameters for this experiment were the same as in the previous examples, except that θv was found over 28 angles in the interval [0, π). The algorithm was run for 25 iterations, though most of the ellipses stopped evolving much earlier, → − leaving the maximum value of f = 100. For the experiments shown in this paper, a substantial majority of the running time was spent in the depth estimation step. The amount of time spent orienting and evolving the elliptical regions depends primarily on the scene structure, and totals about 2 minutes in the worst case.
5
Conclusions and Future Work
We have demonstrated that the accuracy of DFD estimates can depend on the shape of the region over which that estimate is computed. The standard assumption in the DFD literature - that square regions are equifocal - is shown to be problematic around discontinuities in depth and surface orientation. Moreover, through our experiments, we have demonstrated that an elliptical model can be used to evolve measurement regions that produce more accurate depth estimates near such features. We have, for the first time, presented an algorithm that iteratively tailors the measurement region to the structure of the scene. Future research could address both the size and shape of the measurement region. In order to illustrate the benefits of changes in its shape, we have kept the size M of the measurement region constant. In actual usage, however, we may choose to increase the size of the measurement area in regions of the scene that are found to be fronto-parallel in order to attain improved DFD accuracy. Though we have shown that elliptical regions are more general then squares, and that this additional generality improves DFD performance, there are scene structures for which ellipses are not sufficiently general. Non-smooth equifocal contours, found near corners, will be poorly approximated by ellipses. Such structures demand a more general model for the measurement region, and would require a different evolution algorithm, which is an interesting direction for future work.
References 1. Asada, N., Fujiwara, H., Matsuyama, T.: Seeing Behind the Scene: Analysis of Photometric Properties of Occluding Edges by the Reversed Projection Blurring Model. IEEE Trans. on Patt. Anal. and Mach. Intell. 20, 155–167 (1998) 2. Bhasin, S., Chaudhuri, S.: Depth from Defocus in Presence of Partial Self Occlusion. In: Proc. Intl. Conf. on Comp. Vis., pp. 488–493 (2001) 3. Debevec, P., Malik, J.: Recovering High Dynamic Range Radiance Maps from Photographs. In: Proc. SIGGRAPH, pp. 369–378 (1997) 4. Chaudhuri, S., Rajagopalan, A.: Depth from Defocus: A Real Aperture Imaging Approach. Springer, Heidelberg (1999) 5. Ens, J., Lawrence, P.: Investigation of Methods for Determining Depth from Focus. IEEE Trans. on Patt. Anal. and Mach. Intell. 15(2), 97–108 (1993)
868
S. McCloskey, M. Langer, and K. Siddiqi
6. Favaro, P., Soatto, S.: Seeing beyond occlusions (and other marvels of a finite lens aperture). In: Proc. CVPR 2003, vol. 2, pp. 579–586 (June 2003) 7. Favaro, P., Soatto, S.: A Geometric Approach to Shape from Defocus. IEEE Trans. on Patt. Anal. and Mach. Intell. 27(3), 406–417 (2005) 8. G¨ okstorp, M.: Computing Depth from Out-of-Focus Blur Using a Local Frequency Representation. In: Proc. of the IAPR Conf. on Patt. Recog., pp. 153–158 (1994) 9. Hasinoff, S.W., Kutulakos, K.N.: Confocal Stereo. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 620–634. Springer, Heidelberg (2006) 10. McCloskey, S., Langer, M., Siddiqi, K.: The Reverse Projection Correlation Principle for Depth from Defocus. In: Proceedings of the 3rd International Symposium on 3D Data Processing, Visualization and Transmission (2006) 11. Nayar, S.K., Watanabe, M.: Minimal Operator Set for Passive Depth from Defocus. In: Proc. CVPR 1996, pp. 431–438 (June 1996) 12. Pentland, A.: A New Sense for Depth of Field. IEEE Trans. on Patt. Anal. and Mach. Intell. 9(4), 523–531 (1987) 13. Pentland, A., Scherock, S., Darrell, T., Girod, B.: Simple Range Cameras Based on Focal Error. J. of the Optical Soc. Am. 11(11), 2925–2935 (1994) 14. Subbarao, M., Surya, G.: Depth from Defocus: A Spatial Domain Approach. Intl. J. of Comp. Vision 13, 271–294 (1994) 15. Subbarao, M.: Parallel Depth Recovery by Changing Camera Parameters. In: Proc. Intl. Conf. on Comp. Vis., pp. 149–155 (1998) 16. Xiong, Y., Shafer, S.A.: Moment Filters for High Precision Computation of Focus and Stereo. In: Proc. Intl. Conf. on Robotics and Automation, pp. 108–113 (1995) 17. Zhang, L., Nayar, S.K.: Projection Defocus Analysis for Scene Capture and Image Display. In: Proc. SIGGRAPH, pp. 907–915 (2006)
A New Framework for Grayscale and Colour Non-lambertian Shape-from-Shading William A.P. Smith and Edwin R. Hancock Department of Computer Science, The University of York {wsmith,erh}@cs.york.ac.uk
Abstract. In this paper we show how arbitrary surface reflectance properties can be incorporated into a shape-from-shading scheme, by using a Riemannian minimisation scheme to minimise the brightness error. We show that for face images an additional regularising constraint on the surface height function is all that is required to recover accurate face shape from single images, the only assumption being of a single light source of known direction. The method extends naturally to colour images, which add additional constraints to the problem. For our experimental evaluation we incorporate the Torrance and Sparrow surface reflectance model into our scheme and show how to solve for its parameters in conjunction with recovering a face shape estimate. We demonstrate that the method provides a realistic route to non-Lambertian shape-from-shading for both grayscale and colour face images.
1
Introduction
Shape-from-shading is a classical problem in computer vision which has attracted over four decades of research. The problem is underconstrained and proposed solutions have, in general, made strong assumptions in order to make the problem tractable. The most common assumption is that the surface reflectance is perfectly diffuse and is explained using Lambert’s law. For many types of surface, this turns out to be a poor approximation to the true reflectance properties. For example, in face images specularities caused by perspiration constitute a significant proportion of the total surface reflectance. More complex models of reflectance have rarely been considered in single image shape-from-shading. Likewise, the use of colour images for shape recovery has received little attention. In this paper we present a general framework for solving the shape-fromshading problem which can make use of any parametric model of reflectance. We use techniques from differential geometry in order to show how the problem of minimising the brightness error can be solved using gradient descent on a spherical manifold. We experiment with a number of regularisation constraints and show that from single face images, given only the illumination direction, we can make accurate estimates of the shape and reflectance parameters. We show how the method extends naturally to colour images and that both the shape and diffuse albedo maps allow convincing view synthesis under a wide range of illumination directions, viewpoints and illuminant colours. Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 869–880, 2007. c Springer-Verlag Berlin Heidelberg 2007
870
1.1
W.A.P. Smith and E.R. Hancock
Related Work
Ahmed and Farag [1] incorporate a non-Lambertian reflectance model into a perspective shape-from-shading algorithm. By accounting for light source attenuation they are able to resolve convex/concave ambiguities. Georghiades [2] uses a similar approach to the work presented here in an uncalibrated photometric stereo framework. From multiple grayscale images he estimates reflectance parameters and surface shape using an integrability constraint. Work that considers the use of colour images for shape recovery includes Christensen and Shapiro [3]. They use a numerical approach to transform a pixel brightness colour vector into the set of normals that corresponds to that colour. By using multiple images they are able to constrain the choice of normal directions to a small set, from which one is chosen using additional constraints. Ononye and Smith [4] consider the problem of shape-from-colour, i.e. estimating surface normal directions from single, colour intensity measurements. They present a variational approach that is able to recover coarse surface shape estimates from single colour images.
2
Solving Shape-from-Shading
The aim of computational shape-from-shading is to make estimates of surface shape from the intensity measurements in a single image. Since the amount of light reflected by a point on a surface is related to the surface orientation at that point, in general the shape is estimated in the form of a field of surface normals (a needle-map). If the surface height function is required, this can be estimated from the surface normals using integration. In order to recover surface orientation from image intensity measurements, the reflectance properties of the surface under study must be modelled. 2.1
Radiance Functions
The complex process of light reflecting from a surface can be captured by the bidirectional reflectance distribution function (BRDF). The BRDF describes the ratio of the emitted surface radiance to the incident irradiance over all possible incident and exitant directions. A number of parametric reflectance models have been proposed which capture the reflectance properties of different surface types with varying degrees of accuracy. Assuming a normalised and linear camera response, the image intensity predicted by a particular parametric reflectance model is given by its radiance function: I = g(N, L, V, P). This equality is the image irradiance equation. The arguments N, L and V are unit vectors in the direction of the local surface normal, the light source and viewer respectively. We use the set P to denote the set of additional parameters specific to the particular reflectance model in use. The best known and simplest reflectance model follows Lambert’s law which predicts that light is scattered equally in all directions, moderated by a diffuse albedo term, ρd , which describes the intrinsic reflectivity of the surface: gLambertian (N, L, V, {ρd }) = ρd N · L.
(1)
A New Framework for Grayscale and Colour
871
Note that the image intensity is independent of viewing direction. Other models capture more complex effects. For example, the Torrance and Sparrow [5] model has previously been shown to be suitable for approximating skin reflectance properties in a vision context [2] since it captures effects such as off-specular forward scattering at large incident angles. If the effects of Fresnel reflectance and bistatic shadowing are discounted, the Torrance and Sparrow model has the radiance function: gT&S (N, L, V, {ρd , ρs , ν, L}) = Lρd N · L +
Lρs e−ν
2
2
L+V arccos(N· L+V )
N·V
,
(2)
where ρs is the specular coefficient, ν the surface roughness, L the light source intensity and, once again, ρd is the diffuse albedo. 2.2
Brightness Error
The radiance function provides a succinct mapping between the reflectance geometry in the scene and the observed intensity. For an image in which the viewer and light source directions are fixed, the radiance function reduces to a function of one variable: the surface normal. This image-specific function is known as the reflectance map in the shape-from-shading literature. The squared error between the observed image intensities, I(x, y), and those predicted by the estimated surface normals, n(x, y), according to the chosen reflectance model is known as the brightness error. In terms of the radiance function this is expressed as follows: (I(x, y) − g(n(x, y), L, V, P(x, y)))2 . (3) EBright (n) = x,y
For typical reflectance models, this function does not have a unique minimum. In fact, there are likely to be an infinite set of normal directions all of which minimise the brightness error. For example, in the case of a Lambertian surface the minimum of the brightness error will be a circle on the unit sphere. 2.3
The Variational Approach
In order to make the shape-from-shading problem tractable, the most common approach has been to augment the brightness error with a regularization term, EReg (n), which penalises departures from a constraint based on the surface normals. A wide range of such constraints have been considered, but obvious choices are surface smoothness and integrability. In some of the earliest work, Horn and Brooks [6] used a simple smoothness constraint in a regularization framework. They used variational calculus to solve the minimisation: n∗ = arg min EBright (n) + λEReg (n), where λ is a Lagrange multiplier which efn
fectively weights the influence of the two terms. The resulting iterative solution is [7]: C(I − nt · L) nt+1 = fReg (nt ) + L, (4) λ
872
W.A.P. Smith and E.R. Hancock
where fReg (nt ) is a function which enforces the regularising constraint (in this case a simple neighbourhood averaging which effectively smooths the field of surface normals). The second term provides a step in the light source direction of a size proportional to the deviation of nt from the image irradiance equation and seeks to reduce the brightness error. The weakness of the Horn and Brooks approach is that for reasons of numerical stability, a large value of λ is typically required. The result is that the smoothing term dominates and image brightness constraints are only weakly satisfied. The recovered surface normals therefore lose much of the fine surface detail and do not accurately recreate the image. 2.4
The Geometric Approach
An approach which overcomes these deficiencies was proposed by Worthington and Hancock [7]. Their idea was to choose a solution which strictly satisfies the brightness constraint at every pixel but uses the regularisation constraint to help choose a solution from within this reduced solution space. This is possible because in the case of Lambertian reflectance, it is straightforward to select a surface normal direction whose brightness error is 0. From (1), it is clear that the angle θi = ∠NL can be recovered using θi = arccos(I), assuming unit albedo. By ensuring a normal direction is chosen that meets this condition, the brightness error is strictly minimised. In essence, the method applies the regularization constraint within the subspace of solutions which have a brightness error of 0: n∗ = arg min EReg (n). EBright (n)=0
To solve this minimisation, Worthington and Hancock use a two step iterative procedure which decouples application of the regularization constraint and projection onto the closest solution with zero brightness error: 1. nt = fReg (nt ) 2. nt+1 = arg min d(n, nt ), EBright (n)=0
where d(., .) is the arc distance between two unit vectors and fReg (nt ) enforces a robust regularizing constraint. The second step of this process is implemented using nt+1 = Θnt , where Θ is a rotation matrix which rotates a unit vector to the closest direction that satisfies θi = arccos(I).
3
Minimising the Brightness Error
Worthington and Hancock exploited the special case of Lambertian reflectance in which the brightness error can be made zero using a single rotation whose axis and magnitude can be found analytically. Our approach in this paper effectively generalises their idea to reflectance models that do not have such easily invertible forms. For Lambertian reflectance, our approach would take the same single step in order to minimise the brightness error. Neither their approach, nor that of Horn and Brook’s can incorporate arbitrary reflectance models.
A New Framework for Grayscale and Colour
873
In general, finding a surface normal that minimises the brightness error rev quires minimising a function of a unit vector. It is convenient to consider a unit vector as a point lying on a spherical manifold. Accordingly, the surface norn ||v|| mal n corresponds to the point on the unit 2-sphere, n ∈ S 2 , where Φ(n) = n Exp (v) TS and Φ : S 2 → R3 is an embedding. We can therefore define the brightness error for the local surface normal as a function of a point on the manifold S 2 : Fig. 1. The Exponential map from the 2 f (n) = (g(Φ(n), L, V, P) − I) . The need tangent plane to the sphere to solve minimisation problems on Riemannian manifolds arises when performing diffusion on tensor valued data. This problem has recently been addressed by Zhang and Hancock [8]. We propose a similar approach here for finding minima in the brightness error on the unit 2-sphere for arbitrary reflectance functions. This provides an elegant formulation for the shape-from-shading problem stated in very general terms. In order to do so we make use of the exponential map which transforms a point to the tangent plane to a point on the sphere. If v ∈ Tn S 2 is a vector on the tangent plane to S 2 at n ∈ S 2 and v = 0, the exponential map, denoted Expn , of v is the point on S 2 along the geodesic in the direction of v at distance v from n. Geometrically, this is equivalent to marking out a length equal to v along the geodesic that passes through n in the direction of v. The point on S 2 thus obtained is denoted Expn (v). This is illustrated in Figure 1. The inverse of the exponential map is the log map which transforms a point from the sphere to the tangent plane. We can approximate the local gradient of the error function f in terms of a vector on the tangent plane Tn S 2 using finite differences: T f Expn (, 0)T − f (n) f Expn (0, )T − f (n) , ∇f (n) ≈ . (5) n
n
2
We can therefore use gradient descent to find the local minimum of the brightness error function: nt+1 = Expnt (−γ∇f (nt )) , (6) where γ is the step size. At each iteration we perform a line search using Newton’s method to find the optimum value of γ. In the case of Lambertian reflectance, the radiance function strictly monotonically decreases as θi increases. Hence, the gradient of the error function will be a vector that lies on the geodesic curve that passes through the current normal estimate and the point corresponding to the light source direction vector. The optimal value of γ will reduce the brightness error to zero and the update is equivalent to the one step approach of Worthington and Hancock. For
874
W.A.P. Smith and E.R. Hancock
more complex reflectance models, the minimisation will require more than one iteration. We solve the minimisation on the unit sphere in a two step iterative process: 1. nt = fReg (nt ) 2. nt+1 = arg min EBright (n), n
where step 2 is solved using the gradient descent method given in (6) with nt as the initialisation. Our approach extends naturally to colour images. The error functional to be minimised on the unit sphere simply comprises the sum of the squared errors for each colour channel: 2 (g(Φ(n), L, V, Pc ) − Ic ) . (7) f (n) = c∈{R,G,B}
Note that for colour images the problem is more highly constrained, since the ratio of knowns to unknowns improves. This is because the surface shape is fixed across the three colour channels.
4
Regularisation Constraints
In this paper we use a statistical regularisation constraint, closely related to integrability [9]. Suppose that a facial surface F ∈ R3 is projected orthographically onto the image plane and parameterised by the function z(x, y). We can express the surface F in terms of a linear combination of K surface functions Ψi (or modes of variation): K (8) zb (x, y) = i=1 bi Ψi (x, y), where the coefficients b = (b1 , . . . , bK )T are the surface parameters. In this paper we use a surface height basis set learnt from a set of exemplar face surfaces. Here, the modes of variation are found by applying PCA to a representative sample of face surfaces and Ψi is the eigenvector of the covariance matrix of the training samples corresponding to the ith largest eigenvalue. We may express the surface normals in terms of the parameter vector b: nb (x, y) =
K i=1
bi ∂x Ψi (x, y),
K
T bi ∂y Ψi (x, y), −1
.
(9)
i=1
When we wish to refer to the corresponding vectors of unit length we use: nb (x,y) ˆ b (x, y) = n n . A field of normals expressed in this manner satisfies a b (x,y) stricter constraint than standard integrability. The field of normals will be integrable since they correspond exactly to the surface given by (8). But in addition, the surface corresponding to the field of normals is also constrained to lie within the span of the surface height model. We term this constraint model-based integrability.
A New Framework for Grayscale and Colour
875
In order to apply this constraint to the (possibly non-integrable) field of surface normals n(x, y), we seek the parameter vector b∗ , whose field of surface normals given by (9), minimises the distance to n(x, y). We pose this as minimising the squared error between the surface gradients of n(x, y) and those given by n (x,y) (x,y) and q(x, y) = nyz (x,y) . (9). The surface gradients of n(x, y) are p(x, y) = nnxz (x,y) The optimal solution is therefore given by: b∗ = arg min b
K x,y
i=1
2 bi pi (x, y) − p(x, y)
+
K
2 bi qi (x, y) − q(x, y)
, (10)
i=1
where pi (x, y) = ∂x Ψi (x, y) and qi (x, y) = ∂y Ψi (x, y). The solution to this minimisation is linear in b and is solved using linear least squares as follows. If the input image is of dimension M × N , we form a vector of length 2M N of the surface gradients of n(x, y): G = (p(1, 1), q(1, 1), . . . , p(M, N ), q(M, N ))T .
(11)
We then form the 2M N ×K matrix of the surface gradients of the eigenvectors, T Ψ, whose ith column is Ψi = [pi (1, 1), qi (1, 1), . . . , pi (M, N ), qi (M, N )] . We may now state our least squares problem in terms of matrix operations: b∗ = arg min Ψb − G2 . The least squares solution is given by: b
−1 T b∗ = ΨT Ψ Ψ G.
(12)
With the optimal parameter vector to hand, the field of surface normals satisfying the model-based integrability constraint are given by (9). Furthermore, we have also implicitly recovered the surface height, which is given by (8). 4.1
Implementation
For our implementation, we choose to employ the Torrance and Sparrow [5] reflectance model given in (2). We make a number of assumptions to simplify its fitting. The first follows [2] and assumes that the specular coefficient, ρs , and roughness parameter, ν, are constant over the surface, i.e. we estimate only a single value for each from one image. We allow the diffuse albedo, ρd , to vary arbitrarily across the face, though we do not allow albedo values greater than one. For colour images we also allow the diffuse albedo to vary between colour channels. However, we fix the specular coefficient and roughness terms to remain the same. In doing so we are making the assumption that specular reflection is independent of the colour of the surface. For both colour and grayscale images, we introduce a regularisation constraint on the albedo by applying a light anisotropic diffusion to the estimated albedo maps at each iteration [10] In performing this step we are assuming the albedo is piecewise smooth. We initialise the normal field to the surface normals of the average face surface from the samples used to construct the surface height model.
876
W.A.P. Smith and E.R. Hancock
Algorithm 1. Non-Lambertian shape-from-shading algorithm Input: Light source direction L, image intensities I(x, y) and gradients of surface functions Ψ Output: Estimated surface normal map n(x, y), surface height zb (x, y) and T&S parameters: ρs , ρd (x, y), ν and L. (0) (0) (0) 1 Initialise ρs = 0.2, ρd (x, y) = 0.8, ν = 2 and b = (0, . . . , 0)T ; 2 Set iteration t = 1; 3 repeat 4 Find L(t) by solving for L in (2) (linear least squares) using surface ˆ b(t−1) (x, y), fix all other parameters; normals n (t) 5 Find ρs and ν (t) by solving nonlinear minimisation of (2) using (t−1)
6
7
8
9 10
Newton’s method keeping all other parameters fixed and using ρs ˆ b(t−1) (x, y); and ν (t−1) as an initialisation and normals n Find n(t) (x, y) by minimising brightness error by solving (4) for every ˆ b(t−1) (x, y) as initialisation, fix all other parameters; pixel (x, y) using n Enforce model-based integrability. Form matrix of surface gradients G(t) using Equation 11 from n(t) (x, y) and find b(t) by −1 T (t) solving:b(t) = ΨT Ψ Ψ G ; Calculate diffuse albedo for every pixel: 2 L+V −ν 2 arccos(N· L+V) (t) I(N·V)−Lρ(t) s e ˆ b(t)(x, y); ρd (x, y) = min 1, , where N= n L(N·V)(N·L) Set iteration t = t + 1; 2 until x,y arccos n(t) (x, y) · n(t−1) (x, y) < ;
The algorithm for a grayscale image is given in Algorithm 1. For a colour image we simply replace the minimisation term in step 7 with (7) and calculate the diffuse albedo and light source intensity independently for each colour channel.
5
Experiments
We now demonstrate the results of applying our non-Lambertian shape-fromshading algorithm to both grayscale (drawn from the Yale B database [11]) and colour (drawn from the CMU PIE database [12]) face images. The model-based integrability constraint is constructed by applying PCA to cartesian height maps extracted from the 3DFS database [13] range data. In Figure 2 we show results of applying our technique to three grayscale input images (shown in first column). In the second and third columns we show the estimated shape rendered with Lambertian reflectance and frontal illumination in both a frontal and rotated view. The shape estimates are qualitatively good and successfully remove the effects of non-Lambertian reflectance. In the fourth and fifth columns we compare a synthesised image under novel illumination with a real image under the same illumination (the light source is 22.3◦ from frontal). Finally in the sixth column
A New Framework for Grayscale and Colour Input
Estimated Rotated Novel Lambertian Lambertian Illumination
Real View
877
Novel Pose
Fig. 2. Results on grayscale images
Frequency
we show a synthesised novel pose, rotated 30◦ from frontal. We keep the light source and viewing direction coincident and hence the specularities are in a different position than in the input images. The result is quite convincing. In Figure 3 we provide an example of applying our technique to an input image in which the illumination is non-frontal. In this case the light source is positioned 25◦ along the negative horizontal axis. We show the input image on the left and on the right we show an image in which we have rendered the recovered shape with Fig. 3. Correcting for non-frontal illufrontal illumination using the estimated mination. reflectance parameters. This allows us to normalise images to frontal illumination. 16 In Figure 4 we show the results of ap14 plying our method to colour images taken 12 from the CMU PIE database. As is clear 10 in the input images, the illuminant is 8 strongest in the blue channel, resulting in 6 the faces appearing unnaturally blue. The 4 subject shown in the third row is partic2 ularly challenging because of the lack of 0 shading information due to facial hair. 190 200 210 220 230 240 Light Source Hue ( ) For each subject we apply our algorithm to a frontally illuminated im- Fig. 5. Histogram of estimated light age. This provides an estimate of the source hue for 67 CMU PIE subjects. ◦
878
W.A.P. Smith and E.R. Hancock Input
Diffuse Albedo
White Estimated Rotated Illuminant Lambertian Lambertian
Novel Pose
Fig. 4. Results on colour images
colour of the light source in terms of an RGB vector: (LR , LG , LB )T . To demonstrate that our estimate of the colour of the light source is stable across all 67 subjects, we convert the estimated light source colour into HSV space and plot a histogram of the estimated hue values in Figure 5. It can be seen that this estimate is quite stable and that all samples lie within the ‘blue’ range of hue values. The mean estimated hue was 215.6◦ . Returning to Figure 4, in the second column we show the estimated diffuse albedo in the three colour channels. These appear to have accurately recovered the colour of features such as lips, skin and facial hair, despite the use of a coloured illuminant. Note that residual shading effects in the albedo maps are minimal. In the third column we show a synthesised image in which we have rendered the estimated shape using a white light source and the estimated reflectance parameters. These are qualitatively convincing and appear to have removed the effect of the Fig. 6. Synthesising colour images coloured light source. This effectively provides under novel lighting a route to facial colour constancy. In the fourth and fifth columns we show the estimated shape rendered with Lambertian reflectance in both a frontal and rotated view. Finally in the sixth column we show a synthesised image in a novel pose rendered with a white light source coincident with the viewing direction. Note that the specularities are in a different position compared to the input images.
A New Framework for Grayscale and Colour
879
Finally, in Figure 6 we provide some additional examples of the quality of image that can be synthesised under novel lighting conditions. From the input image in the top row of Figure 4, we synthesise images using white light from a variety of directions which subtend an angle of approximately 35◦ with the viewing direction.
6
Conclusions
In this paper we have presented a new framework for solving shape-from-shading problems which can incorporate arbitrary surface reflectance models. We used techniques from Riemannian geometry to minimise the brightness error in a manner that extends naturally to colour images. We experimented with the Torrance and Sparrow reflectance model on both grayscale and colour images. We showed that the shape and reflectance information we recover from one image is sufficient for realistic view synthesis. An obvious target for future work is to exploit the recovered information for the purposes of face recognition. The work also raises a number of questions that we do not answer here. The first is whether the iterative solution of two minimisation steps always converges in a stable manner (experimental results would suggest this is the case). The second is whether these two steps could be combined into a single minimisation in a more elegant manner. Finally, the generalisation power of the statistical model impacts upon the precision of the recovered face surfaces, it would be interesting to test this experimentally.
References 1. Ahmed, A.H., Farag, A.A.: A new formulation for shape from shading for nonlambertian surfaces. In: Proc. CVPR, vol. 2, pp. 1817–1824 (2006) 2. Georghiades, A.: Recovering 3-d shape and reflectance from a small number of photographs. In: Eurographics Symposium on Rendering, pp. 230–240 (2003) 3. Christensen, P.H., Shapiro, L.G.: Three-dimensional shape from color photometric stereo. Int. J. Comput. Vision 13, 213–227 (1994) 4. Ononye, A.E., Smith, P.W.: Estimating the shape of a surface with non-constant reflectance from a single color image. In: Proc. BMVC, pp. 163–172 (2002) 5. Torrance, K., Sparrow, E.: Theory for off-specular reflection from roughened surfaces. J. Opt. Soc. Am. 57, 1105–1114 (1967) 6. Horn, B.K.P., Brooks, M.J.: The variational approach to shape from shading. Comput. Vis. Graph. Image Process 33, 174–208 (1986) 7. Worthington, P.L., Hancock, E.R.: New constraints on data-closeness and needle map consistency for shape-from-shading. IEEE Trans. Pattern Anal. Mach. Intell. 21, 1250–1267 (1999) 8. Zhang, F., Hancock, E.R.: A riemannian weighted filter for edge-sensitive image smoothing. In: Proc. ICPR, pp. 594–598 (2006) 9. Frankot, R.T., Chellappa, R.: A method for enforcing integrability in shape from shading algorithms. IEEE Trans. Pattern Anal. Mach. Intell. 10, 439–451 (1988)
880
W.A.P. Smith and E.R. Hancock
10. Perona, P., Malik, J.: Scale-space and edge detection using anisotropic diffusion. IEEE Trans. Pattern Anal. Mach. Intell. 12, 629–639 (1990) 11. Georghiades, A., Belhumeur, P., Kriegman, D.: From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Trans. Pattern Anal. Mach. Intell. 23, 643–660 (2001) 12. Sim, T., Baker, S., Bsat, M.: The cmu pose, illumination, and expression database. IEEE Trans. Pattern Anal. Mach. Intell. 25, 1615–1618 (2003) 13. USF HumanID 3D Face Database, Courtesy of Sudeep. Sarkar, University of South Florida, Tampa, FL
A Regularized Approach to Feature Selection for Face Detection Augusto Destrero1 , Christine De Mol2 , Francesca Odone1 , and Alessandro Verri1 1
DISI, Universit`a di Genova, Via Dodecaneso 35 I-16146 Genova, Italy {destrero,odone,verri}@disi.unige.it 2 Universite Libre de Bruxelles boulevard du Triomphe, 1050 Bruxelles, Belgium
[email protected] Abstract. In this paper we present a trainable method for selecting features from an overcomplete dictionary of measurements. The starting point is a thresholded version of the Landweber algorithm for providing a sparse solution to a linear system of equations. We consider the problem of face detection and adopt rectangular features as an initial representation for allowing straightforward comparisons with existing techniques. For computational efficiency and memory requirements, instead of implementing the full optimization scheme on tenths of thousands of features, we propose to first solve a number of smaller size optimization problems obtained by randomly sub-sampling the feature vector, and then recombining the selected features. The obtained set is still highly redundant, so we further apply feature selection. The final feature selection system is an efficient two-stages architecture. Experimental results of an optimized version of the method on face images and image sequences indicate that this method is a serious competitor of other feature selection schemes recently popularized in computer vision for dealing with problems of real time object detection.
1 Introduction Overcomplete, general purpose set of features combined with learning techniques provide effective solutions to object detection [1,2]. Since only a small fraction of features is usually relevant to a given problem, these methods must face a difficult problem of feature selection. Even if a coherent theory is still missing a number of trainable methods are emerging empirically for their effectiveness [3,4,5,1,6]. An interesting way to cope with feature selection in the learning by examples framework is to resort to regularization techniques based on penalty term of L1 type [4]. In the case of linear problems, a theoretical support for this strategy can be derived from [7] where it is shown that for most under-determined linear systems the minimal-L1 solution equals the sparsest solution. In this paper we explore the Lagrangian formulation of the so called Lasso scheme [4] for selecting features in the computer vision domain. This choice is driven by three major considerations: first, a simple algorithm for obtaining the optimal solution in this setting has been recently proposed [8]. Second, this feature selection mechanism to date has not been evaluated in the vision context, context in which many spatially highly correlated features are available. Finally, since this approach to feature selection seems to Y. Yagi et al. (Eds.): ACCV 2007, Part II, LNCS 4844, pp. 881–890, 2007. c Springer-Verlag Berlin Heidelberg 2007
882
A. Destrero et al.
be appropriate for mathematical analysis [8] the gathering of empirical evidence on its merits and limitations can be very useful. To the purpose of obtaining a direct comparison with state-of-the-art methods we decided to focus our research to the well studied case of face detection. In the last years face detection has been boosted by the contribution of the learning from examples paradigm which helps to solve many well known problems related to face variability in images [9,10,11]. Component-based approaches highlighted the fact that local areas are often more meaningful and more appropriate for dealing with occlusions and deformations [12,13]. We use rectangle features [1] (widely retained as a very good starting point for many computer vision applications) as a starting representation. We perform feature selection through a sequence of stages, each of which we motivate empirically through extensive experiments and comparisons. Our investigation shows that the proposed regularized approach to feature selection appears to be very appropriate for computer vision applications like face detection. As a by-product we obtain a deeper understanding of the properties of the adopted feature selection framework: one step feature selection does not allow to obtain a small number of significant features due to the presence of strong spatial correlation between features computed on overlapping supports. As a simple way to overcome this problem we propose to repeat the feature selection procedure on a reduced but still quite large set of features. The obtained results are quite interesting. The final set of features not only leads to the construction of state-of-the-art classifiers but can also be further optimized to build a hierarchical real-time system for face detection which makes use of a very small number of loosely correlated features.
2 Iterative Algorithm with a Sparsity Constraint In this section we describe the basic algorithm on which our feature selection method is build upon. We restrict ourselves to the case of a linear dependence between input and output data, which means that the problem can be reformulated as the solution of the following linear system of equations: g = Af
(1)
where g = (g1 , . . . , gn ) is the n × 1 vector containing output labels, A = {Aij }, i = 1, . . . , n; j = 1, . . . , p is the n × p matrix containing the collection of features j for each image i and f = (f1 , . . . , fp ) the vector of the unknown weights to be estimated. Typically the number of features p is much larger than the dimension n of the training set, so that the system is hugely under-determined. Because of the redundancy of the feature set, we also have to deal with the collinearities responsible for severe illconditioning. Both difficulties call for some form of regularization and can be obviated by turning problem (1) into a penalized least-squares problem. Instead of classical regularization such as Tikhonov regularization (based on a L2 penalty term) we are looking for a penalty term that automatically enforce the presence of (many) zero weights in the vector f . Among such zero-enforcing penalties, the L1 norm of f is the only convex one, hence providing feasible algorithms for high-dimensional data. Thus we consider
A Regularized Approach to Feature Selection for Face Detection
883
the following penalized least-squares problem, usually referred to as “lasso regression” [4]: fL = arg min{|g − Af |22 + 2τ |f |1 } (2) f
where |f |1 = j |fj | is the L1 -norm of f and τ is a regularization parameter regulating the balance between the data misfit and the penalty. In feature selection problems, this parameter also allows to vary the degree of sparsity (number of true zero weights) of the vector f . Notice that the L1 -norm penalty makes the dependence of lasso solutions on g nonlinear. Hence the computation of L1 -norm penalized solutions is more difficult than with L2 -norm penalties. To solve (2) in this paper we adopt a simple iterative strategy: (t+1)
fL
= Sτ [fL + A (g − AfL )] t = 0, 1, . . . (t)
(t)
(3)
(0)
with arbitrary initial vector fL , where Sτ is the following “soft-thresholder” hj − τ sign(hj ) if |hj | ≥ τ (Sτ h)j = 0 otherwise In the absence of soft-thresholding (τ = 0) this scheme is known as the Landweber iteration, which converges the the generalized solution (minimum-norm least-squares solution) of (1). The soft-thresholded Landweber scheme (3) has been proven in [8] to converge to a minimizer of (2), provided the norm of the matrix A is renormalized to a value strictly smaller than 1.
3 Setting the Scene The application motivating our work is a face detector to be integrated in a monitoring system installed in our department. For this reason most of our experiments are carried out on a dataset of images collected by the system (Fig. 1). The system monitors a busy corridor acquiring video shots when motion is detected. We used the acquired video frames to extract examples of faces and non faces (non faces are motion areas containing everything but a face). We crop and rescale all images to the size 19×19. Our scenario has few difficult negative examples, but, face examples are quite challenging since faces rarely appear in a frontal position. In addition, the quality of the signal is low due to the fact that the acquisition device is a common video-surveillance camera and the detected object are often affected by motion blur. The dataset we consider for our empirical evaluations is made of 4000 training data, evenly distributed between positive and negative data, 2000 validation data and a test set of 3400 images. We compute rectangle features [1] over different locations, sizes, and aspect ratios of each 19 × 19 image patch, obtaining an overcomplete set of image descriptors of about 64000 features. Given the size of the image description obtained, computing the whole set of rectangle features for each analyzed image patch would make video analysis impossible: some kind of dimensionality reduction has to be performed.
884
A. Destrero et al.
Fig. 1. Examples of positive (left) and negative (right) data gathered with our monitoring system
4 Sampled Version of the Thresholded Landweber Algorithm In this section we present and discuss the method we propose for feature selection. We start by applying the iterative algorithm of Sec. 2 on the original set of features. We consider a problem of the form (1) in which A is the matrix of processed image data, where each entry Aij is obtained from i = 1 . . . n images each of which is represented by j = 1, . . . , p rectangle features. Since we are in a binary classification setting we associate to each datum (image) a label gi ∈ {−1, 1}. Each entry of the unknown vector f is associated to one feature: we perform feature selection looking for a sparse solution f = (f1 , . . . , fp ) : features corresponding to non-zero weights fi are relevant to model the diversity of the two classes. Experimental evidence showed that the choice of the initialization vector is not crucial, therefore we always initialize the weight vector f with zeros: f (0) = 0 . The stopping rule of the iterative process is related to the stability of the solution reached: at the t-th iteration we evaluate |f (t) − f (t−1) |: if it is smaller than a threshold T (that we choose as a proportion of f (t) , T = f (t) /100) for 50 consecutive iterations we conclude that the obtained solution is stable and stop the iterative process. 4.1 Implementation and Design Issues Let us now analyze in detail how we build the linear system: all images of our training set (the size of which is 4000) are represented by 64000 measurements of the rectangle features. Since these measures take real values and matrix A is 4000×64000, the matrix size in bytes is about 1 Gb (if each entry is represented in single precision). For this reason applying the iterative algorithm described in Eq. 3 directly to the whole matrix may not be feasible on all systems: the matrix multiplication needs to be implemented carefully so that we do not keep in primary memory the entire matrix A. One possibility is to compute intermediate solutions with multiple accesses to secondary memory. We implemented a different approach, based on resampling the features set and obtaining many smaller problems, that can be briefly described as follows: we build S feature subsets each time extracting m features from the original set of size p (m testing 1 > testing 2, where all the testings have the same meaning as explained in table 1. Figure 6 shows how noisy training images could affect the recognition rate. It’s obvious that when the training set is not aligned very well, all the testing
Face Mis-alignment Analysis by Multiple-Instance Subspace
909
cases fail, including using probe bags and gallery bags. So it’s very important to remove noisy training images from corrupting the training subspace. Figure 7, 8 and 9 show recognition error rates on three different testing combinations. The testings have the same meaning as explained in table 3. Optimal 1 means training with aligned bags, and optimal 2 means training with aligned single images. Iter1 and Iter3 means the first iteration and the 3rd iteration of the base selection procedure. We can see that in all cases, the 3rd iteration results is better than the 1st iteration results. It supports our claim that extremely poorly registered images will not benefit the learning algorithm. We use our multiple-instance learning algorithm to exclude those bad training images from corrupting the training base. Also interestingly, in all tests, optimal 1 always performs worst, which indicates that by adding perturbations to the training base, even very noisy images, we can improve the robustness of learning algorithms. Note that in all cases, when the number of dimensions increases, the error rate will first decrease and then increase. Normally we get the best recognition rate using around the first 50 dimensions (account for 70% of total energy). Iterative Base Selection Testing1 Results
Iterative Base Selection Testing2 Results
Iter 3 Optimal1
10
Optimal2
8 6 4
0
50
100 150 200 Number of Dimensions
250
6.5 Iter 1
Rank−1 recognition error rate
12
2
Iterative Base Selection Testing3 Results
25 Iter 1
Rank−1 recognition error rate
Rank−1 recognition error rate
14
Iter 3
20
Optimal1 Optimal2
15 10 5 0
0
50
100 150 200 Number of Dimensions
250
6 5.5 5 4.5 4
Iter 1 Iter 3 Optimal1 Optimal2
3.5 3 2.5 2 1.5 20 40 60 80 100 120 140 160 180 200 220 Number of Dimensions
Fig. 7. Single aligned Fig. 8. Single aligned Fig. 9. Aligned bag gallery, gallery, single aligned probe gallery, single noisy probe noisy bag probe
4
Conclusions
In this paper, we systematically studied the influence of image mis-alignment on face recognition performance, including mis-alignment in training sets, probe sets and gallery sets. We then formulated the image alignment problem in the multiple-instance learning framework. We proposed a novel supervised clustering based multiple-instance learning scheme for subspace training. The algorithm proceeds by iteratively updating the training set. Simple subspace method, such as FisherFace, when augmented with the proposed multiple-instance learning scheme, achieved very high recognition rate. Experimental results show that even with the noisy training and testing set, the Fisherface learned by our multipleinstance learning scheme achieves much higher recognition rate than the baseline algorithm where the training and testing images are aligned accurately. Our algorithm is a meta-algorithm which can be easily used with other methods. The same framework could also be deployed to deal with illumination and occlusion problems, with different definition of training bags and training instances.
910
Z. Li, Q. Liu, and D. Metaxas
Acknowledgments The research in this paper was partially supported by NSF CNS-0428231.
References 1. Wiskott, L., Fellous, J.M., Kr¨ uger, N., von der Malsburg, C.: Face recognition by elastic bunch graph matching. In: Sommer, G., Daniilidis, K., Pauli, J. (eds.) CAIP 1997. LNCS, vol. 1296, pp. 456–463. Springer, Heidelberg (1997) 2. Kirby, M., Sirovich, L.: Application of the karhunen-loeve procedure for the characterization of human faces. IEEE Transactions on Pattern Analysis and Machine Intelligence 12(1), 103–108 (1990) 3. Turk, M., Pentland, A.: Face recognition using eigenfaces. In: Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 586–591. IEEE Computer Society Press, Los Alamitos (1991) 4. Belhumeur, P.N., Hespanha, J., Kriegman, D.J.: Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(7), 711–720 (1997) 5. Etemad, K., Chellappa, R.: Discriminant analysis for recognition of human face images. In: Big¨ un, J., Borgefors, G., Chollet, G. (eds.) AVBPA 1997. LNCS, vol. 1206, pp. 127–142. Springer, Heidelberg (1997) 6. Phillips, P.J., Moon, H., Rizvi, S.A., Rauss, P.J.: The FERET evaluation methodology for face-recognition algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(10), 1090–1104 (2000) 7. Martinez, A.: Recognizing imprecisely localized, partially occuded and expression variant faces from a single sample per class. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(6), 748–763 (2002) 8. Shan, S., Chang, Y., Gao, W., Cao, B.: Curse of mis-alignment in face recognition: Problem and a novel mis-alignment learning solution. In: Proceedings of International Conference on Automatic Face and Gesture Recognition, pp. 314–320 (2004) 9. Viola, P., Platt, J.C, Zhang, C.: Multiple instance boosting for object dection. In: Proceedings of Neural Information Processing Systems (2005) 10. Maron, O., Lozano-Perez, T.: A framework for multiple-instance learning. In: Proceedings of Neural Information Processing Systems, pp. 570–576 (1998) 11. Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vecctor machines for multiple-instance learning. In: Proceedings of Neural Information Processing Systems, pp. 561–568 (2002) 12. Wang, J., Zucker, J.D.: Solving multiple-instance problem: A lazy learning approach. In: Proceedings of International Conference on Machine Learning, pp. 1119–1125 (2000) 13. Blum, C., Blesa, M.J.: New metaheuristic approaches for the edge-weighted kcardinality tree problem. Computers and Operations Research 32(6), 1355–1377 (2005)
Author Index
Abe, Shinji I-292 Agrawal, Amit I-945 Ai, Haizhou I-210 Akama, Ryo I-779 Andreopoulos, Alexander Aoki, Nobuya I-116 Aptoula, Erchan I-935 Arita, Daisaku I-159 Arth, Clemens II-447 Ashraf, Nazim II-63 ˚ Astr¨ om, Kalle II-549
I-385
Babaguchi, Noboru II-651 Banerjee, Subhashis II-85 Ben Ayed, Ismail I-925 Beveridge, J. Ross II-733 Bigorgne, Erwan II-817 Bischof, Horst I-657, II-447 Bouakaz, Sa¨ıda I-678, I-738 Boyer, Edmond II-166, II-580 Brice˜ no, Hector M. I-678, I-738 Brooks, Michael J. I-853, II-227 Byr¨ od, Martin II-549 Cai, Kangying I-779 Cai, Yinghao I-843 Cannons, Kevin I-532 Cha, Seungwook I-200 Chan, Tung-Jung II-631 Chang, Jen-Mei II-733 Chang, Wen-Yan II-621 Chaudhuri, Subhasis I-240 Chebrolu, Hima I-956 Chen, Chu-Song I-905, II-621 Chen, Ju-Chin II-700 Chen, Qian I-565, I-688 Chen, Tsuhan I-220, II-487, II-662 Chen, Wei I-843 Chen, Wenbin II-53 Chen, Ying I-832 Chen, Yu-Ting I-905 Cheng, Jian II-827 Choi, Inho I-698 Choi, Ouk II-269
Chu, Rufeng II-22 Chu, Wen-Sheng Vincnent II-700 Chun, Seong Soo I-200 Chung, Albert C.S. II-672 Chung, Ronald II-301 Cichowski, Alex I-375 Cipolla, Roberto I-335 Courteille, Fr´ed´eric II-196 Cui, Jinshi I-544 Dailey, Matthew N. I-85 Danafar, Somayeh II-457 Davis, Larry S. I-397, II-404 DeMenthon, Daniel II-404 De Mol, Christine II-881 Destrero, Augusto II-881 Detmold, Henry I-375 Di Stefano, Luigi II-517 Dick, Anthony I-375, I-853 Ding, Yuanyuan I-95 Dinh, Viet Cuong I-200 Doermann, David II-404 Donoser, Michael II-447 Dou, Mingsong II-722 Draper, Bruce II-733 Du, Wei I-365 Du, Weiwei II-590 Durou, Jean-Denis II-196 Ejiri, Masakazu I-35 Eriksson, Anders P. II-796 Fan, Kuo-Chin I-169 Farin, Dirk I-789 Foroosh, Hassan II-63 Frahm, Jan-Michael II-353 Fu, Li-Chen II-124 Fu, Zhouyu I-482, II-134 Fujimura, Kikuo I-408, II-32 Fujiwara, Takayuki II-891 Fujiyoshi, Hironobu I-915, II-806 Fukui, Kazuhiro II-467 Funahashi, Takuma II-891 Furukawa, Ryo II-206, II-847
912
Author Index
Gao, Jizhou I-127 Gargallo, Pau II-373, II-784 Geurts, Pierre II-611 Gheissari, Niloofar II-457 Girdziuˇsas, Ram¯ unas I-811 Goel, Dhiraj I-220 Goel, Lakshya II-85 Grabner, Helmut I-657 Grabner, Michael I-657 Guillou, Erwan I-678 Gupta, Ankit II-85 Gupta, Gaurav II-394 Gupta, Sumana II-394 Gurdjos, Pierre II-196 Han, Yufei II-1, II-22 Hancock, Edwin R. II-869 Handel, Holger II-258 Hao, Pengwei II-722 Hao, Ying II-12 Hartley, Richard I-13, I-800, II-279, II-322, II-353 Hasegawa, Tsutomu I-628 Hayes-Gill, Barrie I-945 He, Ran I-54, I-728, II-22 H´eas, Patrick I-864 Hill, Rhys I-375 Hiura, Shinsaku I-149 Honda, Kiyoshi I-85 Hong, Ki-Sang II-497 Horaud, Radu II-166 Horiuchi, Takahiko I-708 Hou, Cong I-210 Hsiao, Pei-Yung II-124 Hsieh, Jun-Wei I-169 Hu, Wei I-832 Hu, Weiming I-821, I-832 Hu, Zhanyi I-472 Hua, Chunsheng I-565 huang, Feiyue II-477 Huang, Guochang I-462 Huang, Kaiqi I-667, I-843 Huang, Liang II-680 Huang, Po-Hao I-106 Huang, Shih-Shinh II-124 Huang, Weimin I-875 Huang, Xinyu I-127 Huang, Yonggang II-690 Hung, Y.S. II-186 Hung, Yi-Ping II-621
Ide, Ichiro II-774 Ijiri, Yoshihisa II-680 Ikeda, Sei II-73 Iketani, Akihiko II-73 Ikeuchi, Katsushi II-289 Imai, Akihiro I-596 Ishikawa, Hiroshi II-537 Itano, Tomoya II-206 Iwata, Sho II-570 Jaeggli, Tobias I-608 Jawahar, C.V. I-586 Je, Changsoo II-507 Ji, Zhengqiao II-363 Jia, Yunde I-512, II-641, II-754 Jiao, Jianbin I-896 Jin, Huidong I-482 Jin, Yuxin I-748 Josephson, Klas II-549 Junejo, Imran N. II-63 Kahl, Fredrik I-13, II-796 Kalra, Prem II-85 Kanade, Takeo I-915, II-806 Kanatani, Kenichi II-311 Kanbara, Masayuki II-73 Katayama, Noriaki I-292 Kato, Takekazu I-688 Kawabata, Satoshi I-149 Kawade, Masato II-680 Kawamoto, Kazuhiko I-555 Kawasaki, Hiroshi II-206, II-847 Khan, Sohaib I-647 Kim, Daijin I-698 Kim, Hansung I-758 Kim, Hyeongwoo II-269 Kim, Jae-Hak II-353 Kim, Jong-Sung II-497 Kim, Tae-Kyun I-335 Kim, Wonsik II-560 Kirby, Michael II-733 Kitagawa, Yosuke I-688 Kitahara, Itaru I-758 Klein Gunnewiek, Rene I-789 Kley, Holger II-733 Kogure, Kiyoshi I-758 Koh, Tze K. I-945 Koller-Meier, Esther I-608 Kondo, Kazuaki I-544 Korica-Pehserl, Petra I-657
Author Index Koshimizu, Hiroyasu II-891 Kounoike, Yuusuke II-424 Kozuka, Kazuki II-342 Kuijper, Arjan I-230 Kumano, Shiro I-324 Kumar, Anand I-586 Kumar, Pankaj I-853 Kuo, Chen-Hui II-631 Kurazume, Ryo I-628 Kushal, Avanish II-85 Kweon, In So II-269 Laaksonen, Jorma I-811 Lai, Shang-Hong I-106, I-638 Lambert, Peter I-251 Langer, Michael I-271, II-858 Lao, Shihong I-210, II-680 Lau, W.S. II-186 Lee, Jiann-Der II-631 Lee, Kwang Hee II-507 Lee, Kyoung Mu II-560 Lee, Sang Wook II-507 Lee, Wonwoo II-580 Lef`evre, S´ebastien I-935 Lei, Zhen I-54, II-22 Lenz, Reiner II-744 Li, Baoxin II-155 Li, Heping I-472 Li, Hongdong I-800, II-227 Li, Jiun-Jie I-169 Li, Jun II-722 Li, Ping I-789 Li, Stan Z. I-54, I-728, II-22 Li, Zhenglong II-827 Li, Zhiguo II-901 Liang, Jia I-512, II-754 Liao, ShengCai I-54 Liao, Shu II-672 Lien, Jenn-Jier James I-261, I-314, I-885, II-96, II-700 Lim, Ser-Nam I-397 Lin, Shouxun II-106 Lin, Zhe II-404 Lina II-774 Liu, Chunxiao I-282 Liu, Fuqiang I-355 Liu, Jundong I-956 Liu, Nianjun I-482 Liu, Qingshan II-827, II-901 Liu, Wenyu I-282
Liu, Xiaoming II-662 Liu, Yuncai I-419 Loke, Eng Hui I-430 Lu, Fangfang II-134, II-279 Lu, Hanqing II-827 Lubin, Jeffrey II-414 Lui, Shu-Fan II-96 Luo, Guan I-821 Ma, Yong II-680 Maeda, Eisaku I-324 Mahmood, Arif I-647 Makhanov, Stanislav I-85 Makihara, Yasushi I-452 Manmatha, R. I-586 Mao, Hsi-Shu II-96 Mar´ee, Rapha¨el II-611 Marikhu, Ramesh I-85 Martens, Ga¨etan I-251 Matas, Jiˇr´ı II-236 Mattoccia, Stefano II-517 Maybank, Steve I-821 McCloskey, Scott I-271, II-858 Mekada, Yoshito II-774 Mekuz, Nathan I-492 ´ M´emin, Etienne I-864 Metaxas, Dimitris II-901 Meyer, Alexandre I-738 Michoud, Brice I-678 Miˇcuˇs´ık, Branislav I-65 Miles, Nicholas I-945 Mitiche, Amar I-925 Mittal, Anurag I-397 Mogi, Kenji II-528 Morgan, Steve I-945 Mori, Akihiro I-628 Morisaka, Akihiko II-206 Mu, Yadong II-837 Mudenagudi, Uma II-85 Mukaigawa, Yasuhiro I-544, II-246 Mukerjee, Amitabha II-394 Murai, Yasuhiro I-915 Murase, Hiroshi II-774 Nagahashi, Tomoyuki II-806 Nakajima, Noboru II-73 Nakasone, Yoshiki II-528 Nakazawa, Atsushi I-618 Nalin Pradeep, S. I-522, II-116 Niranjan, Shobhit II-394 Nomiya, Hiroki I-502
913
914
Author Index
Odone, Francesca II-881 Ohara, Masatoshi I-292 Ohta, Naoya II-528 Ohtera, Ryo I-708 Okutomi, Masatoshi II-176 Okutomoi, Masatoshi II-384 Olsson, Carl II-796 Ong, S.H. I-875 Otsuka, Kazuhiro I-324 Pagani, Alain I-769 Paluri, Balamanohar I-522, II-116 Papadakis, Nicolas I-864 Parikh, Devi II-487 Park, Joonyoung II-560 Pehserl, Joachim I-657 Pele, Ofir II-435 Peng, Yuxin I-748 Peterson, Chris II-733 Pham, Nam Trung I-875 Piater, Justus I-365 Pollefeys, Marc II-353 Poppe, Chris I-251 Prakash, C. I-522, II-116 Pujades, Sergi II-373 Puri, Manika II-414 Radig, Bernd II-332 Rahmati, Mohammad II-217 Raskar, Ramesh I-1, I-945 Raskin, Leonid I-442 Raxle Wang, Chi-Chen I-885 Reid, Ian II-601 Ren, Chunjian II-53 Rivlin, Ehud I-442 Robles-Kelly, Antonio II-134 Rudzsky, Michael I-442 Ryu, Hanjin I-200 Sagawa, Ryusuke I-116 Sakakubara, Shizu II-424 Sakamoto, Ryuuki I-758 Sato, Jun II-342 Sato, Kosuke I-149 Sato, Tomokazu II-73 Sato, Yoichi I-324 Sawhney, Harpreet II-414 Seo, Yongduek II-322 Shah, Hitesh I-240, I-522, II-116 Shahrokni, Ali II-601
Shen, Chunhua II-227 Shen, I-fan I-189, II-53 Shi, Jianbo I-189 Shi, Min II-42 Shi, Yu I-718 Shi, Zhenwei I-180 Shimada, Atsushi I-159 Shimada, Nobutaka I-596 Shimizu, Ikuko II-424 Shimizu, Masao II-176 Shinano, Yuji II-424 Shirai, Yoshiaki I-596 Siddiqi, Kaleem I-271, II-858 Singh, Gajinder II-414 Slobodan, Ili´c I-75 Smith, Charles I-956 Smith, William A.P. II-869 ˇ Sochman, Jan II-236 Song, Gang I-189 Song, Yangqiu I-180 Stricker, Didier I-769 Sturm, Peter II-373, II-784 Sugaya, Yasuyuki II-311 Sugimoto, Shigeki II-384 Sugiura, Kazushige I-452 Sull, Sanghoon I-200 Sumino, Kohei II-246 Sun, Zhenan II-1, II-12 Sung, Ming-Chian I-261 Sze, W.F. II-186 Takahashi, Hidekazu II-384 Takahashi, Tomokazu II-774 Takamatsu, Jun II-289 Takeda, Yuki I-779 Takemura, Haruo I-618 Tan, Huachun II-712 Tan, Tieniu I-667, I-843, II-1, II-12, II-690 Tanaka, Hidenori I-618 Tanaka, Hiromi T. I-779 Tanaka, Tatsuya I-159 Tang, Sheng II-106 Taniguchi, Rin-ichiro I-159, I-628 Tao, Hai I-345 Tao, Linmi I-748 Tarel, Jean-Philippe II-817 Tian, Min I-355 Tombari, Federico II-517 Tominaga, Shoji I-708
Author Index Toriyama, Tomoji I-758 Tsai, Luo-Wei I-169 Tseng, Chien-Chung I-314 Tseng, Yun-Jung I-169 Tsotsos, John K. I-385, I-492 Tsui, Timothy I-718 Uchida, Seiichi I-628 Uehara, Kuniaki I-502 Urahama, Kiichi II-590 Utsumi, Akira I-292 Van de Walle, Rik I-251 van den Hengel, Anton I-375 Van Gool, Luc I-608 Verri, Alessandro II-881 Vincze, Markus I-65 Wada, Toshikazu I-565, I-688 Wan, Cheng II-342 Wang, Fei II-1 Wang, Guanghui II-363 Wang, Junqiu I-576 Wang, Lei I-800, II-145 Wang, Liming I-189 Wang, Te-Hsun I-261 Wang, Xiaolong I-303 Wang, Ying I-667 Wang, Yuanquan I-512, II-754 Wang, Yunhong I-462, II-690 Wehenkel, Louis II-611 Wei, Shou-Der I-638 Werman, Michael II-435 Wildenauer, Horst I-65 Wildes, Richard I-532 Wimmer, Matthias II-332 With, Peter H.N. de I-789 Wong, Ka Yan II-764 Woo, Woontack II-580 Woodford, Oliver II-601 Wu, Fuchao I-472 Wu, Haiyuan I-565, I-688 Wu, Jin-Yi II-96 Wu, Q.M. Jonathan II-363 Wu, Yihong I-472 Wuest, Harald I-769 Xu, Gang II-570 Xu, Guangyou I-748, II-477
Xu, Lijie II-32 Xu, Shuang II-641 Xu, Xinyu II-155 Yagi, Yasushi I-116, I-452, I-544, I-576, II-246 Yamaguchi, Osamu II-467 Yamamoto, Masanobu I-430 Yamato, Junji I-324 Yamazaki, Masaki II-570 Yamazoe, Hirotake I-292 Yang, Ruigang I-127 Yang, Ying II-106 Ye, Qixiang I-896 Yin, Xin I-779 Ying, Xianghua I-138 Yip, Chi Lap II-764 Yokoya, Naokazu II-73 Yu, Hua I-896 Yu, Jingyi I-95 Yu, Xiaoyi II-651 Yuan, Ding II-301 Yuan, Xiaotong I-728 Zaboli, Hamidreza II-217 Zaharescu, Andrei II-166 Zha, Hongbin I-138, I-544 Zhang, Changshui I-180 Zhang, Chao II-722 Zhang, Dan I-180 Zhang, Fan I-282 Zhang, Ke I-482 Zhang, Weiwei I-355 Zhang, Xiaoqin I-821 Zhang, Yongdong II-106 Zhang, Yu-Jin II-712 Zhang, Yuhang I-800 Zhao, Qi I-345 Zhao, Xu I-419 Zhao, Youdong II-641 Zhao, Yuming II-680 Zheng, Bo II-289 Zheng, Jiang Yu I-303, II-42 Zhong, H. II-186 Zhou, Bingfeng II-837 Zhou, Xue I-832 Zhu, Youding I-408
915