Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany
6493
Ron Kimmel Reinhard Klette Akihiro Sugimoto (Eds.)
Computer Vision – ACCV 2010 10th Asian Conference on Computer Vision Queenstown, New Zealand, November 8-12, 2010 Revised Selected Papers, Part II
13
Volume Editors Ron Kimmel Department of Computer Science Technion – Israel Institute of Technology Haifa 32000, Israel E-mail:
[email protected] Reinhard Klette The University of Auckland Private Bag 92019, Auckland 1142, New Zealand E-mail:
[email protected] Akihiro Sugimoto National Institute of Informatics Chiyoda, Tokyo 1018430, Japan E-mail:
[email protected] ISSN 0302-9743 ISBN 978-3-642-19308-8 DOI 10.1007/978-3-642-19309-5
e-ISSN 1611-3349 e-ISBN 978-3-642-19309-5
Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011921594 CR Subject Classification (1998): I.4, I.5, I.2.10, I.2.6, I.3.5, F.2.2 LNCS Sublibrary: SL 6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics
© Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Coverpicture: Lake Wakatipu and the The Remarkables, from ‘Skyline Queenstown’ where the conference dinner took place. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Preface
The 2010 Asian Conference on Computer Vision took place in the southern hemisphere, in “The Land of the Long White Cloud” in Maori language, also known as New Zealand, in the beautiful town of Queenstown. If we try to segment the world we realize that New Zealand does not belong officially to any continent. Similarly, in computer vision we often try to define outliers while attempting to segment images, separate them to well-defined “continents” we refer to as objects. Thus, the ACCV Steering Committee consciously chose this remote and pretty island as a perfect location for ACCV2010, to host the computer vision conference of the most populated and largest continent, Asia. Here, on South Island we studied and exchanged ideas about the most recent advances in image understanding and processing sciences. Scientists from all well-defined continents (as well as ill-defined ones) submitted high-quality papers on subjects ranging from algorithms that attempt to automatically understand the content of images, optical methods coupled with computational techniques that enhance and improve images, and capturing and analyzing the world’s geometry while preparing for higher-level image and shape understanding. Novel geometry techniques, statistical-learning methods, and modern algebraic procedures rapidly propagate their way into this fascinating field as we witness in many of the papers one can find in this collection. For this 2010 issue of ACCV, we had to select a relatively small part of all the submissions and did our best to solve the impossible ranking problem in the process. We had three keynote speakers (Sing Bing Kang lecturing on modeling of plants and trees, Sebastian Sylwan talking about computer vision in production of visual effects, and Tim Cootes lecturing about modelling deformable object), eight workshops (Computational Photography and Esthetics, Computer Vision in Vehicle Technology, e-Heritage, Gaze Sensing and Interactions, Subspace, Video Event Categorization, Tagging and Retrieval, Visual Surveillance, and Application of Computer Vision for Mixed and Augmented Reality), and four tutorials. Three Program Chairs and 38 Area Chairs finalized the decision about the selection of 35 oral presentations and 171 posters that were voted for out of 739, so far the highest number of ACCV, submissions. During the reviewing process we made sure that each paper was reviewed by at least three reviewers, we added a rebuttal phase for the first time in ACCV, and held a three-day AC meeting in Tokyo to finalize the non-trivial acceptance decision-making process. Our sponsors were the Asian Federation of Computer Vision Societies (AFCV), NextWindow–Touch-Screen Technology, NICTA–Australia’s Information and Communications Technology (ICT), Microsoft Research Asia, Areograph–Interactive Computer Graphics, Adept Electronic Solutions, and 4D View Solutions.
VI
Preface
Finally, the International Journal of Computer Vision (IJCV) sponsored the Best Student Paper Award. We wish to acknowledge a number of people for their invaluable help in putting this conference together. Many thanks to the Organizing Committee for their excellent logistical management, the Area Chairs for their rigorous evaluation of papers, the Program Committee members as well as external reviewers for their considerable time and effort, and the authors for their outstanding contributions. We also wish to acknowledge the following individuals for their tremendous service: Yoshihiko Mochizuki for support in Tokyo (especially also for the Area Chair meeting), Gisela Klette, Konstantin Schauwecker, and Simon Hermann for processing the 200+ Latex submissions for these proceedings, Kaye Saunders for running the conference office at Otago University, and the volunteer students during the conference from Otago University and the .enpeda.. group at The University of Auckland. We also thank all the colleagues listed on the following pages who contributed to this conference in their specified roles, led by Brendan McCane who took the main responsibilities. ACCV2010 was a very enjoyable conference. We hope that the next ACCV meetings will attract even more high-quality submissions. November 2010
Ron Kimmel Reinhard Klette Akihiro Sugimoto
Organization
Steering Committee Katsushi Ikeuchi Tieniu Tan Chil-Woo Lee Yasushi Yagi
University of Tokyo, Japan Institute of Automation, Chinese Academy of Science, China Chonnam National University, Korea Osaka University, Japan
Honorary Chairs P. Anandan Richard Hartley
Microsoft Research India Australian National University, NICTA
General Chairs Brendan McCane Hongbin Zha
University of Otago, New Zealand Peking University, China
Program Chairs Ron Kimmel Reinhard Klette Akihiro Sugimoto
Israel Institute of Technology University of Auckland, New Zealand National Institute of Informatics, Japan
Local Organization Chairs Brendan McCane John Morris
University of Otago, New Zealand University of Auckland, New Zealand
Workshop Chairs Fay Huang Reinhard Koch
Ilan University, Yi-Lan, Taiwan University of Kiel, Germany
Tutorial Chair Terrence Sim
National University of Singapore
VIII
Organization
Demo Chairs Kenji Irie Alan McKinnon
Lincoln Ventures, New Zealand Lincoln University, New Zealand
Publication Chairs Michael Cree Keith Unsworth
University of Waikato, New Zealand Lincoln University, New Zealand
Publicity Chairs John Barron Domingo Mery Ioannis Pitas
University of Western Ontario, Canada Pontificia Universidad Cat´ olica de Chile Aristotle University of Thessaloniki, Greece
Area Chairs Donald G. Bailey Horst Bischof Alex Bronstein Michael S. Brown Chu-Song Chen Hui Chen Laurent Cohen Daniel Cremers Eduardo Destefanis Hamid Krim Chil-Woo Lee Facundo Memoli Kyoung Mu Lee Stephen Lin Kai-Kuang Ma Niloy J. Mitra P.J. Narayanan Nassir Navab Takayuki Okatani Tomas Pajdla Nikos Paragios Robert Pless Marc Pollefeys Mariano Rivera Antonio Robles-Kelly Hideo Saito
Massey University, Palmerston North, New Zealand TU Graz, Austria Technion, Haifa, Israel National University of Singapore Academia Sinica, Taipei, Taiwan Shandong University, Jinan, China University Paris Dauphine, France Bonn University, Germany Technical University Cordoba, Argentina North Carolina State University, Raleigh, USA Chonnam National University, Gwangju, Korea Stanford University, USA Seoul National University, Korea Microsoft Research Asia, Beijing, China Nanyang Technological University, Singapore Indian Institute of Technology, New Delhi, India International Institute of Information Technology, Hyderabad, India TU Munich, Germany Tohoku University, Sendai City, Japan Czech Technical University, Prague, Czech Republic Ecole Centrale de Paris, France Washington University, St. Louis, USA ETH Z¨ urich, Switzerland CIMAT Guanajuato, Mexico National ICT, Canberra, Australia Keio University, Yokohama, Japan
Organization
Yoichi Sato Nicu Sebe Stefano Soatto Nir Sochen Peter Sturm David Suter Robby T. Tan Toshikazu Wada Yaser Yacoob Ming-Hsuan Yang Hong Zhang Mengjie Zhang
The University of Tokyo, Japan University of Trento, Italy University of California, Los Angeles, USA Tel Aviv University, Israel INRIA Grenoble, France University of Adelaide, Australia University of Utrecht, The Netherlands Wakayama University, Japan University of Maryland, College Park, USA University of California, Merced, USA University of Alberta, Edmonton, Canada Victoria University of Wellington, New Zealand
Program Committee Members Abdenour, Hadid Achard, Catherine Ai, Haizhou Aiger, Dror Alahari, Karteek Araguas, Gaston Arica, Nafiz Ariki, Yasuo Arslan, Abdullah Astroem, Kalle August, Jonas Aura Vese, Luminita Azevedo-Marques, Paulo Bagdanov, Andy Bagon, Shai Bai, Xiang Baloch, Sajjad Baltes, Jacky Bao, Yufang Bar, Leah Barbu, Adrian Barnes, Nick Barron, John Bartoli, Adrien Baust, Maximilian Ben Hamza, Abdessamad BenAbdelkader, Chiraz Ben-ari, Rami Beng-Jin, AndrewTeoh
Benosman, Ryad Berkels, Benjamin Berthier, Michel Bhattacharya, Bhargab Biswas, Prabir Bo, Liefeng Boerdgen, Markus Bors, Adrian Boshra, Michael Bouguila, Nizar Boyer, Edmond Bronstein, Michael Bruhn, Andres Buckley, Michael Cai, Jinhai Cai, Zhenjiang Calder´ on, Jes´ us Camastra, Francesco Canavesio, Luisa Cao, Xun Carlo, Colombo Carlsson, Stefan Caspi, Yaron Castellani, Umberto Celik, Turgay Cham, Tat-Jen Chan, Antoni Chandran, Sharat Charvillat, Vincent
IX
X
Organization
Chellappa, Rama Chen, Bing-Yu Chen, Chia-Yen Chen, Chi-Fa Chen, Haifeng Chen, Hwann-Tzong Chen, Jie Chen, Jiun-Hung Chen, Ling Chen, Xiaowu Chen, Xilin Chen, Yong-Sheng Cheng, Shyi-Chyi Chia, Liang-Tien Chien, Shao-Yi Chin, Tat-Jun Chuang, Yung-Yu Chung, Albert Chunhong, Pan Civera, Javier Coleman, Sonya Cootes, Tim Costeira, JoaoPaulo Cristani, Marco Csaba, Beleznai Cui, Jinshi Daniilidis, Kostas Daras, Petros Davis, Larry De Campos, Teofilo Demirci, Fatih Deng, D. Jeremiah Deng, Hongli Denzler, Joachim Derrode, Stephane Diana, Mateus Didas, Stephan Dong, Qiulei Donoser, Michael Doretto, Gianfranco Dorst, Leo Duan, Fuqing Dueck, Delbert Duric, Zoran Dutta Roy, Sumantra
Ebner, Marc Einhauser, Wolfgang Engels, Christopher Eroglu-Erdem, Cigdem Escolano, Francisco Esteves, Claudia Evans, Adrian Fang, Wen-Pinn Feigin, Micha Feng, Jianjiang Ferri, Francesc Fite Georgel, Pierre Flitti, Farid Frahm, Jan-Michael Francisco Giro Mart´ın, Juan Fraundorfer, Friedrich Frosini, Patrizio Fu, Chi-Wing Fuh, Chiou-Shann Fujiyoshi, Hironobu Fukui, Kazuhiro Fumera, Giorgio Furst, Jacob Fusiello, Andrea Gall, Juergen Gallup, David Gang, Li Gasparini, Simone Geiger, Andreas Gertych, Arkadiusz Gevers, Theo Glocker, Ben Godin, Guy Goecke, Roland Goldluecke, Bastian Goras, Bogdan Gross, Ralph Gu, I Guerrero, Josechu Guest, Richard Guo, Guodong Gupta, Abhinav Gur, Yaniv Hajebi, Kiana Hall, Peter
Organization
Hamsici, Onur Han, Bohyung Hanbury, Allan Harit, Gaurav Hartley, Richard HassabElgawi, Osman Havlena, Michal Hayes, Michael Hayet, Jean-Bernard He, Junfeng Hee Han, Joon Hiura, Shinsaku Ho, Jeffrey Ho, Yo-Sung Ho Seo, Yung Hollitt, Christopher Hong, Hyunki Hotta, Kazuhiro Hotta, Seiji Hou, Zujun Hsu, Pai-Hui Hua, Gang Hua, Xian-Sheng Huang, Chun-Rong Huang, Fay Huang, Kaiqi Huang, Peter Huang, Xiangsheng Huang, Xiaolei Hudelot, Celine Hugo Sauchelli, V´ıctor Hung, Yi-Ping Hussein, Mohamed Huynh, Cong Phuoc Hyung Kim, Soo Ichimura, Naoyuki Ik Cho, Nam Ikizler-Cinbis, Nazli Il Park, Jong Ilic, Slobodan Imiya, Atsushi Ishikawa, Hiroshi Ishiyama, Rui Iwai, Yoshio Iwashita, Yumi
Jacobs, Nathan Jafari-Khouzani, Kourosh Jain, Arpit Jannin, Pierre Jawahar, C.V. Jenkin, Michael Jia, Jiaya Jia, JinYuan Jia, Yunde Jiang, Shuqiang Jiang, Xiaoyi Jin Chung, Myung Jo, Kang-Hyun Johnson, Taylor Joshi, Manjunath Jurie, Frederic Kagami, Shingo Kakadiaris, Ioannis Kale, Amit Kamberov, George Kanatani, Kenichi Kankanhalli, Mohan Kato, Zoltan Katti, Harish Kawakami, Rei Kawasaki, Hiroshi Keun Lee, Sang Khan, Saad-Masood Kim, Hansung Kim, Kyungnam Kim, Seon Joo Kim, TaeHoon Kita, Yasuyo Kitahara, Itaru Koepfler, Georges Koeppen, Mario Koeser, Kevin Kokiopoulou, Effrosyni Kokkinos, Iasonas Kolesnikov, Alexander Koschan, Andreas Kotsiantis, Sotiris Kown, Junghyun Kruger, Norbert Kuijper, Arjan
XI
XII
Organization
Kukenys, Ignas Kuno, Yoshinori Kuthirummal, Sujit Kwolek, Bogdan Kwon, Junseok Kybic, Jan Kyu Park, In Ladikos, Alexander Lai, Po-Hsiang Lai, Shang-Hong Lane, Richard Langs, Georg Lao, Shihong Lao, Zhiqiang Lauze, Francois Le, Duy-Dinh Le, Triet Lee, Jae-Ho Lee, Soochahn Leistner, Christian Leonardo, Bocchi Leow, Wee-Kheng Lepri, Bruno Lerasle, Frederic Li, Chunming Li, Hao Li, Hongdong Li, Stan Li, Yongmin Liao, T.Warren Lie, Wen-Nung Lien, Jenn-Jier Lim, Jongwoo Lim, Joo-Hwee Lin, Huei-Yung Lin, Weisi Lin, Wen-Chieh(Steve) Ling, Haibin Lipman, Yaron Liu, Cheng-Lin Liu, Jingen Liu, Ligang Liu, Qingshan Liu, Qingzhong Liu, Tianming
Liu, Tyng-Luh Liu, Xiaoming Liu, Yuncai Loog, Marco Lu, Huchuan Lu, Juwei Lu, Le Lucey, Simon Luo, Jiebo Macaire, Ludovic Maccormick, John Madabhushi, Anant Makris, Dimitrios Manabe, Yoshitsugu Marsland, Stephen Martinec, Daniel Martinet, Jean Martinez, Aleix Masuda, Takeshi Matsushita, Yasuyuki Mauthner, Thomas Maybank, Stephen McHenry, Kenton McNeill, Stephen Medioni, Gerard Mery, Domingo Mio, Washington Mittal, Anurag Miyazaki, Daisuke Mobahi, Hossein Moeslund, Thomas Mordohai, Philippos Moreno, Francesc Mori, Greg Mori, Kensaku Morris, John Mueller, Henning Mukaigawa, Yasuhiro Mukhopadhyay, Jayanta Muse, Pablo Nagahara, Hajime Nakajima, Shin-ichi Nanni, Loris Neshatian, Kourosh Newsam, Shawn
Organization
Niethammer, Marc Nieuwenhuis, Claudia Nikos, Komodakis Nobuhara, Shohei Norimichi, Ukita Nozick, Vincent Ofek, Eyal Ohnishi, Naoya Oishi, Takeshi Okabe, Takahiro Okuma, Kenji Olague, Gustavo Omachi, Shinichiro Ovsjanikov, Maks Pankanti, Sharath Paquet, Thierry Paternak, Ofer Patras, Ioannis Pauly, Olivier Pavlovic, Vladimir Peers, Pieter Peng, Yigang Penman, David Pernici, Federico Petrou, Maria Ping, Wong Ya Prasad Mukherjee, Dipti Prati, Andrea Qian, Zhen Qin, Xueyin Raducanu, Bogdan Rafael Canali, Luis Rajashekar, Umesh Ramalingam, Srikumar Ray, Nilanjan Real, Pedro Remondino, Fabio Reulke, Ralf Reyes, EdelGarcia Ribeiro, Eraldo Riklin Raviv, Tammy Roberto, Tron Rosenhahn, Bodo Rosman, Guy Roth, Peter
Roy Chowdhury, Amit Rugis, John Ruiz Shulcloper, Jose Ruiz-Correa, Salvador Rusinkiewicz, Szymon Rustamov, Raif Sadri, Javad Saffari, Amir Saga, Satoshi Sagawa, Ryusuke Salzmann, Mathieu Sanchez, Jorge Sang, Nong Sang Hong, Ki Sang Lee, Guee Sappa, Angel Sarkis, Michel Sato, Imari Sato, Jun Sato, Tomokazu Schiele, Bernt Schikora, Marek Schoenemann, Thomas Scotney, Bryan Shan, Shiguang Sheikh, Yaser Shen, Chunhua Shi, Qinfeng Shih, Sheng-Wen Shimizu, Ikuko Shimshoni, Ilan Shin Park, You Sigal, Leonid Sinha, Sudipta So Kweon, In Sommerlade, Eric Song, Andy Souvenir, Richard Srivastava, Anuj Staiano, Jacopo Stein, Gideon Stottinge, Julian Strecha, Christoph Strekalovskiy, Evgeny Subramanian, Ramanathan
XIII
XIV
Organization
Sugaya, Noriyuki Sumi, Yasushi Sun, Weidong Swaminathan, Rahul Tai, Yu-Wing Takamatsu, Jun Talbot, Hugues Tamaki, Toru Tan, Ping Tanaka, Masayuki Tang, Chi-Keung Tang, Jinshan Tang, Ming Taniguchi, Rinichiro Tao, Dacheng Tavares, Jo˜ao Manuel R.S. Teboul, Olivier Terauchi, Mutsuhiro Tian, Jing Tian, Taipeng Tobias, Reichl Toews, Matt Tominaga, Shoji Torii, Akihiko Tsin, Yanghai Turaga, Pavan Uchida, Seiichi Ueshiba, Toshio Unger, Markus Urtasun, Raquel van de Weijer, Joost Van Horebeek, Johan Vassallo, Raquel Vasseur, Pascal Vaswani, Namrata Wachinger, Christian Wang, Chen Wang, Cheng Wang, Hongcheng Wang, Jue Wang, Yu-Chiang Wang, Yunhong Wang, Zhi-Heng
Wang, Zhijie Wolf, Christian Wolf, Lior Wong, Kwan-Yee Woo, Young Wook Lee, Byung Wu, Jianxin Xue, Jianru Yagi, Yasushi Yan, Pingkun Yan, Shuicheng Yanai, Keiji Yang, Herbert Yang, Jie Yang, Yongliang Yi, June-Ho Yilmaz, Alper You, Suya Yu, Jin Yu, Tianli Yuan, Junsong Yun, Il Dong Zach, Christopher Zelek, John Zha, Zheng-Jun Zhang, Cha Zhang, Changshui Zhang, Guofeng Zhang, Hongbin Zhang, Li Zhang, Liqing Zhang, Xiaoqin Zheng, Lu Zheng, Wenming Zhong, Baojiang Zhou, Cathy Zhou, Changyin Zhou, Feng Zhou, Jun Zhou, S. Zhu, Feng Zou, Danping Zucker, Steve
Organization
Additional Reviewers Bai, Xiang Collins, Toby Compte, Benot Cong, Yang Das, Samarjit Duan, Lixing Fihl, Preben Garro, Valeria Geng, Bo Gherardi, Riccardo Giusti, Alessandro Guo, Jing-Ming Gupta, Vipin Han, Long Korchev, Dmitriy Kulkarni, Kaustubh Lewandowski, Michal Li, Xin Li, Zhu Lin, Guo-Shiang Lin, Wei-Yang
Liu, Damon Shing-Min Liu, Dong Luo, Ye Magerand, Ludovic Molineros, Jose Rao, Shankar Samir, Chafik Sanchez-Riera, Jordy Suryanarayana, Venkata Tang, Sheng Thota, Rahul Toldo, Roberto Tran, Du Wang, Jingdong Wu, Jun Yang, Jianchao Yang, Linjun Yang, Kuiyuan Yuan, Fei Zhang, Guofeng Zhuang, Jinfeng
ACCV2010 Best Paper Award Committee Alfred M. Bruckstein Larry S. Davis Richard Hartley Long Quan
Technion, Israel Institute of Techonlogy, Israel University of Maryland, USA Australian National University, Australia The Hong Kong University of Science and Technology, Hong Kong
XV
XVI
Organization
Sponsors of ACCV2010 Main Sponsor
The Asian Federation of Computer Vision Societies (AFCV)
Gold Sponsor
NextWindow – Touch-Screen Technology
Silver Sponsors
Areograph – Interactive Computer Graphics Microsoft Research Asia Australia’s Information and Communications Technology (NICTA) Adept Electronic Solutions
Bronze Sponsor
4D View Solutions
Best Student Paper Sponsor
The International Journal of Computer Vision (IJCV)
Best Paper Prize ACCV 2010 Context-Based Support Vector Machines for Interconnected Image Annotation Hichem Sahbi, Xi Li.
Best Student Paper ACCV 2010 Fast Spectral Reflectance Recovery Using DLP Projector Shuai Han, Imari Sato, Takahiro Okabe, Yoichi Sato
Best Application Paper ACCV 2010 Network Connectivity via Inference Over Curvature-Regularizing Line Graphs Maxwell Collins, Vikas Singh, Andrew Alexander
Honorable Mention ACCV 2010 Image-Based 3D Modeling via Cheeger Sets Eno Toeppe, Martin Oswald, Daniel Cremers, Carsten Rother
Outstanding Reviewers ACCV 2010 Philippos Mordohai Peter Roth Matt Toews Andres Bruhn Sudipta Sinha Benjamin Berkels Mathieu Salzmann
Table of Contents – Part II
Posters on Day 1 of ACCV 2010 Generic Object Class Detection Using Boosted Configurations of Oriented Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Oscar Danielsson and Stefan Carlsson
1
Unsupervised Feature Selection for Salient Object Detection . . . . . . . . . . . Viswanath Gopalakrishnan, Yiqun Hu, and Deepu Rajan
15
MRF Labeling for Multi-view Range Image Integration . . . . . . . . . . . . . . . Ran Song, Yonghuai Liu, Ralph R. Martin, and Paul L. Rosin
27
Wave Interference for Pattern Description . . . . . . . . . . . . . . . . . . . . . . . . . . . Selen Atasoy, Diana Mateus, Andreas Georgiou, Nassir Navab, and Guang-Zhong Yang
41
Colour Dynamic Photometric Stereo for Textured Surfaces . . . . . . . . . . . . Zsolt Jank´ o, Ama¨el Delaunoy, and Emmanuel Prados
55
Multi-Target Tracking by Learning Class-Specific and Instance-Specific Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Min Li, Wei Chen, Kaiqi Huang, and Tieniu Tan
67
Modeling Complex Scenes for Accurate Moving Objects Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jianwei Ding, Min Li, Kaiqi Huang, and Tieniu Tan
82
Online Learning for PLSA-Based Visual Recognition . . . . . . . . . . . . . . . . . Jie Xu, Getian Ye, Yang Wang, Wei Wang, and Jun Yang Emphasizing 3D Structure Visually Using Coded Projection from Multiple Projectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ryo Nakamura, Fumihiko Sakaue, and Jun Sato Object Class Segmentation Using Reliable Regions . . . . . . . . . . . . . . . . . . . Vida Vakili and Olga Veksler Specular Surface Recovery from Reflections of a Planar Pattern Undergoing an Unknown Pure Translation . . . . . . . . . . . . . . . . . . . . . . . . . . Miaomiao Liu, Kwan-Yee K. Wong, Zhenwen Dai, and Zhihu Chen Medical Image Segmentation Based on Novel Local Order Energy . . . . . . LingFeng Wang, Zeyun Yu, and ChunHong Pan
95
109 123
137 148
XVIII
Table of Contents – Part II
Geometries on Spaces of Treelike Shapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aasa Feragen, Francois Lauze, Pechin Lo, Marleen de Bruijne, and Mads Nielsen
160
Human Pose Estimation Using Exemplars and Part Based Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yanchao Su, Haizhou Ai, Takayoshi Yamashita, and Shihong Lao
174
Full-Resolution Depth Map Estimation from an Aliased Plenoptic Light Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tom E. Bishop and Paolo Favaro
186
Indoor Scene Classification Using Combined 3D and Gist Features . . . . . Agnes Swadzba and Sven Wachsmuth
201
Closed-Form Solutions to Minimal Absolute Pose Problems with Known Vertical Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zuzana Kukelova, Martin Bujnak, and Tomas Pajdla
216
Level Set with Embedded Conditional Random Fields and Shape Priors for Segmentation of Overlapping Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xuqing Wu and Shishir K. Shah
230
Optimal Two-View Planar Scene Triangulation . . . . . . . . . . . . . . . . . . . . . . Kenichi Kanatani and Hirotaka Niitsuma
242
Pursuing Atomic Video Words by Information Projection . . . . . . . . . . . . . Youdong Zhao, Haifeng Gong, and Yunde Jia
254
A Direct Method for Estimating Planar Projective Transform . . . . . . . . . Yu-Tseh Chi, Jeffrey Ho, and Ming-Hsuan Yang
268
Spatial-Temporal Motion Compensation Based Video Super Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yaozu An, Yao Lu, and Ziye Yan Learning Rare Behaviours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jian Li, Timothy M. Hospedales, Shaogang Gong, and Tao Xiang Character Energy and Link Energy-Based Text Extraction in Scene Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jing Zhang and Rangachar Kasturi
282
293
308
A Novel Representation of Palm-Print for Recognition . . . . . . . . . . . . . . . . G.S. Badrinath and Phalguni Gupta
321
Real-Time Robust Image Feature Description and Matching . . . . . . . . . . . Stephen J. Thomas, Bruce A. MacDonald, and Karl A. Stol
334
Table of Contents – Part II
XIX
A Biologically-Inspired Theory for Non-axiomatic Parametric Curve Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guy Ben-Yosef and Ohad Ben-Shahar
346
Geotagged Image Recognition by Combining Three Different Kinds of Geolocation Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Keita Yaegashi and Keiji Yanai
360
Modeling Urban Scenes in the Spatial-Temporal Space . . . . . . . . . . . . . . . . Jiong Xu, Qing Wang, and Jie Yang
374
Family Facial Patch Resemblance Extraction . . . . . . . . . . . . . . . . . . . . . . . . M. Ghahramani, W.Y. Yau, and E.K. Teoh
388
3D Line Segment Detection for Unorganized Point Clouds from Multi-view Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tingwang Chen and Qing Wang
400
Multi-View Stereo Reconstruction with High Dynamic Range Texture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Feng Lu, Xiangyang Ji, Qionghai Dai, and Guihua Er
412
Feature Quarrels: The Dempster-Shafer Evidence Theory for Image Segmentation Using a Variational Framework . . . . . . . . . . . . . . . . . . . . . . . . Bj¨ orn Scheuermann and Bodo Rosenhahn
426
Gait Analysis of Gender and Age Using a Large-Scale Multi-view Gait Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yasushi Makihara, Hidetoshi Mannami, and Yasushi Yagi
440
A System for Colorectal Tumor Classification in Magnifying Endoscopic NBI Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Toru Tamaki, Junki Yoshimuta, Takahishi Takeda, Bisser Raytchev, Kazufumi Kaneda, Shigeto Yoshida, Yoshito Takemura, and Shinji Tanaka A Linear Solution to 1-Dimensional Subspace Fitting under Incomplete Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hanno Ackermann and Bodo Rosenhahn
452
464
Efficient Clustering Earth Mover’s Distance . . . . . . . . . . . . . . . . . . . . . . . . . Jenny Wagner and Bj¨ orn Ommer
477
One-Class Classification with Gaussian Processes . . . . . . . . . . . . . . . . . . . . Michael Kemmler, Erik Rodner, and Joachim Denzler
489
A Fast Semi-inverse Approach to Detect and Remove the Haze from a Single Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Codruta O. Ancuti, Cosmin Ancuti, Chris Hermans, and Philippe Bekaert
501
XX
Table of Contents – Part II
Salient Region Detection by Jointly Modeling Distinctness and Redundancy of Image Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yiqun Hu, Zhixiang Ren, Deepu Rajan, and Liang-Tien Chia
515
Unsupervised Selective Transfer Learning for Object Recognition . . . . . . . Wei-Shi Zheng, Shaogang Gong, and Tao Xiang
527
A Heuristic Deformable Pedestrian Detection Method . . . . . . . . . . . . . . . . Yongzhen Huang, Kaiqi Huang, and Tieniu Tan
542
Gradual Sampling and Mutual Information Maximisation for Markerless Motion Capture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yifan Lu, Lei Wang, Richard Hartley, Hongdong Li, and Dan Xu Temporal Feature Weighting for Prototype-Based Action Recognition . . . Thomas Mauthner, Peter M. Roth, and Horst Bischof PTZ Camera Modeling and Panoramic View Generation via Focal Plane Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Karthik Sankaranarayanan and James W. Davis Horror Image Recognition Based on Emotional Attention . . . . . . . . . . . . . Bing Li, Weiming Hu, Weihua Xiong, Ou Wu, and Wei Li Spatial-Temporal Affinity Propagation for Feature Clustering with Application to Traffic Video Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jun Yang, Yang Wang, Arcot Sowmya, Jie Xu, Zhidong Li, and Bang Zhang Minimal Representations for Uncertainty and Estimation in Projective Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wolfgang F¨ orstner Personalized 3D-Aided 2D Facial Landmark Localization . . . . . . . . . . . . . . Zhihong Zeng, Tianhong Fang, Shishir K. Shah, and Ioannis A. Kakadiaris
554 566
580 594
606
619 633
A Theoretical and Numerical Study of a Phase Field Higher-Order Active Contour Model of Directed Networks . . . . . . . . . . . . . . . . . . . . . . . . . Aymen El Ghoul, Ian H. Jermyn, and Josiane Zerubia
647
Sparse Coding on Local Spatial-Temporal Volumes for Human Action Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yan Zhu, Xu Zhao, Yun Fu, and Yuncai Liu
660
Multi-illumination Face Recognition from a Single Training Image per Person with Sparse Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Die Hu, Li Song, and Cheng Zhi
672
Table of Contents – Part II
Human Detection in Video over Large Viewpoint Changes . . . . . . . . . . . . Genquan Duan, Haizhou Ai, and Shihong Lao Adaptive Parameter Selection for Image Segmentation Based on Similarity Estimation of Multiple Segmenters . . . . . . . . . . . . . . . . . . . . . . . . Lucas Franek and Xiaoyi Jiang
XXI
683
697
Cosine Similarity Metric Learning for Face Verification . . . . . . . . . . . . . . . Hieu V. Nguyen and Li Bai
709
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
721
Generic Object Class Detection Using Boosted Configurations of Oriented Edges Oscar Danielsson and Stefan Carlsson School of Computer Science and Communications Royal Inst. of Technology, Stockholm, Sweden {osda02,stefanc}@csc.kth.se
Abstract. In this paper we introduce a new representation for shapebased object class detection. This representation is based on very sparse and slightly flexible configurations of oriented edges. An ensemble of such configurations is learnt in a boosting framework. Each edge configuration can capture some local or global shape property of the target class and the representation is thus not limited to representing and detecting visual classes that have distinctive local structures. The representation is also able to handle significant intra-class variation. The representation allows for very efficient detection and can be learnt automatically from weakly labelled training images of the target class. The main drawback of the method is that, since its inductive bias is rather weak, it needs a comparatively large training set. We evaluate on a standard database [1] and when using a slightly extended training set, our method outperforms state of the art [2] on four out of five classes.
1
Introduction
Generic shape-based detection of object classes is a difficult problem because there is often significant intra-class variation. This means that the class cannot be well described by a simple representation, such as a template or some sort of mean shape. If templates are used, typically a very large number of them is required [3]. Another way of handling intra-class variation is to use more meaningful distance measures than the Euclidean or Chamfer distances typically used to match templates to images. Then, the category can potentially be represented by a substantially smaller number of templates/prototypes [4]. A meaningful distance measure, for example based on thin-plate spline deformation, generally requires correspondences between the model and the image [4–6]. Although many good papers have been devoted to computing shape based correspondences [4,7], this is still a very difficult problem under realistic conditions. Existing methods are known to be sensitive to clutter and are typically computed using an iterative process of high computational complexity. Another common way of achieving a compact description of an object category is to decompose the class into a set of parts and then model each part independently. For shape classes, the parts are typically contour segments [5,8,9], making R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 1–14, 2011. c Springer-Verlag Berlin Heidelberg 2011
2
O. Danielsson and S. Carlsson
such methods practical mainly for object classes with locally discriminative contours. However, many object classes lack such contour features. These classes might be better described by global contour properties or by image gradient structures that are not on the contour at all, as illustrated in figure 1.
Fig. 1. Intra-class shape variation too large for efficient template-based representation. No discriminative local features are shared among the exemplars of the class. However, exemplars share discriminative edge configurations.
In this paper we present an object representation that achieves a compact description of the target class without assuming the presence of discriminative local shape or appearance features. The main components of this representation are sparse and slightly flexible configurations of oriented edge features, built incrementally in a greedy error minimization scheme. We will refer to these configurations as weak classifiers. The final strong classifier is a linear combination of weak classifiers learnt using AdaBoost [10]. We also stress that the method can be trivially extended to include other types of features (for example corners or appearance features) when applicable. We use the attentional cascade of Viola and Jones for selection of relevant negative training examples and for fast rejection of negative test examples [11]. For detection we evaluate the slidingwindow approach and we also point out the possibility of a hierarchical search strategy. Feature values are in both cases computed efficiently using the distance transform [12]. We achieve detection performance comparable to state-of-the-art methods. Our main contributions are: (i) a generic, shape-based object class representation that can be learnt automatically in a discriminative framework from weakly labelled training examples, (ii) an efficient detection algorithm that combines an attentional cascade with efficient feature value computation and (iii) an evaluation showing very promising results on a standard dataset. In the following section we review other methods that use similar object representations. The rest of the paper is organized as follows. In section 3 we describe how our method represents the object category. In section 4 we describe the edge features and their computation. In sections 5 and 6 we describe the algorithms for learning and detection in detail. In sections 7 and 8 we present an experimental evaluation. Finally, in sections 9, 10 and 11, we analyze the results, suggest directions for future study and draw conclusions.
Object Class Detection Using Boosted Configurations of Oriented Edges
2
3
Related Work
In the proposed approach the target object category is represented by sparse, flexible configurations of oriented edges. Several other works have relied on similar representations. For example, Fleuret and Geman learn a disjunction of oriented edge configurations, where all oriented edges of at least one configuration are required to be present within a user-specified tolerance in order for detection to occur [13]. Wu et al. learn a configuration of Gabor basis functions, where each basis function is allowed to move within a user-specified distance from its preferred position in order to maximize its response [14]. The most similar representation is probably from Danielsson et al. [15], who use a combination of sparse configurations of oriented edges in a voting scheme for object detection. In their work, however, the number of configurations used to represent a class, the number of edges in each configuration and the flexibility tolerance of edges are all defined by the user and not learned automatically. They also use different methods for learning and detection. We argue that our representation essentially generalizes these representations. For example, by constraining the weak classifiers to use only a single edge feature we get a representation similar to that of Wu et al [14]. The proposed approach also extends the mentioned methods because the tolerance in edge location is learnt rather than defined by the user. Furthermore, being discriminative rather than generative, it can naturally learn tolerance thresholds with negative as well as positive parity, ie. it can also encode that an edge of a certain orientation is unlikely to be within a certain region. The use of a cascade of increasingly complex classifiers to quickly reject obvious negatives was pioneered by Viola and Jones [11,16]. The cascade also provides a means of focusing the negative training set on relevant examples during training. Each stage in the cascade contains an AdaBoost classifier, which is a linear combination of weak classifiers. Viola and Jones constructed weak classifiers by thresholding single Haar features, whereas our weak classifiers are flexible configurations of oriented edge features.
3
Object Class Representation
The key component of the object representation is the weak classifiers. A weak classifier can be regarded as a conjunction of a set of single feature classifiers, where a single feature classifier is defined by an edge feature (a location and orientation) along with a tolerance threshold and its parity. A single feature classifier returns true if the distance from the specified location to the closest edge with the specified orientation is within tolerance (i.e. it should be sufficiently small if the parity is positive and sufficiently large if the parity is negative). A weak classifier returns true if all its constituent single feature classifiers return true. A strong classifier is then formed by boosting the weak classifiers. Figure 2 illustrates how the output of the strong classifier is computed for two different examples of the target class. The output of the strong classifier is thresholded
4
O. Danielsson and S. Carlsson
w2
w1 +
w4 +
+
w1
= w1+w2+w4
w3 +
+
w4 +
= w1+w3+w4
Fig. 2. An object class is represented by a strong classifier, which is a linear combination of weak classifiers. Each weak classifier is a conjunction of single feature classifiers. Each single feature classifier is described by a feature (bars), a distance threshold (circles) and a parity of the threshold (green = pos. parity, red = neg. parity). The output of the strong classifier is the sum of the weights corresponding to the “active” weak classifiers, as illustrated for two different examples in the figure.
to determine class membership. The edge features will be described in the next section and the weak classifiers in section 5.1.
4
Edge Features
An edge feature defines a single feature classifier together with a threshold and its parity. In this section we define the edge feature and the feature value. We also describe how to compute feature values efficiently. An edge feature Fk = (xk , θk ) is defined by a location xk and orientation θk in a normalized coordinate system. The feature value, fk , is the (normalized) distance to the closest image edge with similar orientation (two orientations are defined as similar if they differ by less than a threshold, tθ ). In order to emphasize that feature values are computed by aligning the normalized feature coordinate system with an image, I, using translation t and scaling s, we write fk (I, t, s). See figure 3 for an illustration.
ε(I) s . f3 F1 F3 F2
s . f1 t
s
s . f2
Fig. 3. Feature values fk (I, t, s) are computed by translating (t) and scaling (s) features Fk = (xk , θk ) and taking the (normalized) distances to the closest edges with similar orientations in the image
Object Class Detection Using Boosted Configurations of Oriented Edges
εθ(I)
1
dθ(I,p) 3
ε(I)
I
5
2 3
Fig. 4. Images are preprocessed as follows: (1) extract a set of oriented edges E (I), (2) split E (I) into (overlapping) subsets Eθ (I) and (3) compute the distance transforms dθ (I, p).
In order to define fk (I, t, s), let E(I) = {. . . , (p , θ ) , . . .} be the set of oriented edge elements (or edgels) in image I (E(I) is simply computed by taking the output of any edge detector and appending the edge orientation at each edge point) and let Eθ (I) = (p , θ ) ∈ E(I) |cos(θ − θ)| ≥ cos(tθ ) be the set of edgels with orientation similar to θ. We can then define fk (I, t, s) as: fk (I, t, s) =
1 · min ||s · xk + t − p || s (p ,θ )∈Eθk (I)
(1)
Typically we define a set of features F by constructing a uniformly spaced grid in the normalized coordinate system, i.e. F = X × Y × Θ, where X , Y and Θ are uniformly spaced points. 4.1
Efficient Computation of Feature Values
The feature value fk (I, t, s) defined in the previous section can be computed using only a single array reference and a few elementary operations. The required preprocessing is as follows. Start by extracting edges E(I) from the input image I. This involves running an edge detector on the image and then computing the orientation of each edge point. The orientation is orthogonal to the image gradient and unsigned, i.e. defined modulo π. For edge points belonging to an edge segment, we compute the orientation by fitting a line to nearby points on the segment. We call the elements of E(I) edgels (edge elements). The second step of preprocessing involves splitting the edgels into several overlapping subsets Eθ (I), one subset for each orientation θ ∈ Θ (as defined in the previous section). Then compute the distance transform dθ (I, p) on each subset [12]: dθ (I, p) =
min
(p ,θ )∈Eθ (I)
||p − p ||
(2)
6
O. Danielsson and S. Carlsson
This step is illustrated in figure 4; in the first column of the figure the subsets corresponding to horizontal and vertical orientations are displayed as feature maps and in the last column the corresponding distance transforms dθ (I, p) are shown. Feature values can be efficiently computed as: fk (I, t, s) = dθk (I, s · xk + t)/s. This requires only one array reference and a few elementary operations.
5
Learning
The basic component of our object representation is the weak classifier (section 5.1). It is boosted to form a strong classifier (section 5.2). Finally, a cascade of strong classifiers is built (section 5.3). 5.1
Learning the Weak Classifiers
In this section we describe how to learn a weak classifier given a weighted set of training examples. The goal of the learner is to find a weak classifier with minimal (weighted) classification error on the training set. The input to the training algorithm is a set of training examples {Ij |j ∈ J } with target class cj ∈ {0, 1} and a weight distribution, {dj |j ∈ J }, j∈J dj = 1, over these examples. Each training image is annotated with the centroid tj and scale sj of the object in the image. A weak classifier is the conjunction of a set of single feature classifiers. A single feature classifier is defined by a feature, Fk , a threshold, t, and a parity, p ∈ {−1, 1}. The single feature classifier has the classification function g (I, t, s) = p · fk (I, t, s) ≤ p · t and the weak classifier thus has the classiN N fication function h (I, t, s) = i=1 gi (I, t, s) = i=1 (pi · fki (I, t, s) ≤ pi · ti ). The number of single feature classifiers, N , is learnt automatically. We want to find the weak classifier that minimizes the classification error, e, wrt. the current importance distribution: dj (3) e = Pj∼d (h (Ij , tj , sj ) = cj ) = {j∈J |h(Ij ,tj ,sj )=cj }
We use a greedy algorithm to incrementally nconstruct a weak classifier. In order to define this algorithm, let hn (I, t, s) = i=1 gi (I, t, s) be the conjunction of the first n single feature classifiers. Then let Jn = {j ∈ J |hn (Ij , tj , sj ) = 1} be the indices of all training examples classified as positive by hn and let en be the classification error of hn : en = dj (4) {j∈J |hn (Ij ,tj ,sj )=cj }
We can write en in terms of en−1 as: en = en−1 + dj {j∈Jn−1 |gn (Ij ,tj ,sj )=cj }
−
dj
{j∈Jn−1 |cj =0}
(5)
Object Class Detection Using Boosted Configurations of Oriented Edges
7
This gives us a recipe for learning the nth single feature classifier: gn = arg min dj g
(6)
{j∈Jn−1 |g(Ij ,tj ,sj )=cj }
We keep gn if en < en−1 , which can be evaluated easily using equation 5, otherwise we let N = n − 1 and stop. The process is illustrated in figure 5: in a) the weak classifier is not sufficiently specific to delineate the class manifold well, in b) the the weak classifier is too specific and thus inefficient and prone to overfitting, whereas in c) the specificity of the weak classifier is tuned to the task at hand.
-n-1 (a) Too generic
(b) Too specific
-n
-N
(c) Optimal
Fig. 5. The number of single feature classifiers controls the specificity of a weak classifier. a) Using too few yields weak classifiers that are too generic to delineate the class manifold (gray). b) Using too many yields weak classifiers that are prone to overfitting and that are inefficient at representing the class. c) We learn the optimal number of single feature classifiers for each weak classifier (the notation is defined in the text).
Learning a Single Feature Classifier. Learning a single feature classifier involves selecting a feature Fk from a “dictionary” F , a threshold t from R+ and a parity p from {−1, 1}. An optimal single feature classifier, that minimizes equation (6), can be found by evaluating a finite set of candidate classifiers [17]. However, typically a large number of candidate classifiers have to be evaluated, leading to a time consuming learning process. Selecting the unconstrained optimum might also lead to overfitting. We have observed empirically that our feature values (being nonnegative) tend to be exponentially distributed. This suggests selecting the threshold t for a particular feature as the intersection of two exponential pdfs, where μ+ is the (weighted) average of the feature values from the positive examples and μ− is the (weighted) average from the negative examples. − μ− μ+ μ (7) · t = ln μ+ μ− − μ+ The parity is 1 if μ+ ≤ μ− and -1 otherwise. We thus have only one possible classifier for each feature and can simply loop over all features and select the classifier yielding the smallest value for equation (6). This method for single feature classifier selection yielded good results in evaluation and we used it in our experiments.
8
5.2
O. Danielsson and S. Carlsson
Learning the Strong Classifier
We use asymmetric AdaBoost for learning the strong classifiers at each stage of the cascade [16]. This requires setting a parameter k specifying that false negatives cost k times more than false positives. We empirically found k = 3n− /n+ to be a reasonable choice. Finally the detection threshold of the strong classifier is adjusted to yield a true positive rate (tpr) close to one. 5.3
Learning the Cascade
The cascade is learnt according to Viola and Jones [11]. After the construction of each stage we run the current cascade on a database of randomly selected negative images to acquire a negative training set for the next stage. It is important to use a large set of negative images and in our experiments we used about 300.
6 6.1
Detection Sliding Window
The simplest and most direct search strategy is the sliding window approach in scale-space. Since the feature values are computed quickly and we are using a cascaded classifier, this approach yields an efficient detector. Typically we use a multiplicative step-size of 1.25 in scale, so that si+1 = 1.25 · si , and an additive step-size of 0.01 · s in position, so that xi+1 = xi + 0.01 · s and yi+1 = yi + 0.01 · s. Non-maximum suppression is used to remove overlapping detections. There are some drawbacks with the sliding window approach: (i) all possible windows have to be evaluated, (ii) the user is required to select a step-size for each search space dimension, (iii) the detector has to be made invariant to translations and scalings of at least half a step-size (typically by translating and scaling all training exemplars) and (iv) the output detections have an error of about half a step-size in each dimension of the search space (if the step-sizes are sufficiently large to yield an efficient detector these errors typically translate to significant errors in the estimated bounding box). Since the feature values fk represent distances, we can easily compute bounds (u) (l) (l) fk and fk , given a region S in search space, such that fk ≤ fk (I, t, s) ≤ (u) fk ∀ (t, s) ∈ S. This can be used to for efficient search space culling and should enable detectors with better precision at the same computational cost. We will explore the issue of hierarchical search in future work. 6.2
Estimating the Aspect Ratio
The detector scans the image over position and scale, but in order to produce a good estimate of the bounding box of a detected object we also need the aspect ratio (which typically varies significantly within an object class). We
Object Class Detection Using Boosted Configurations of Oriented Edges
9
estimate the aspect ratio of a detected object by finding one or a few similar training exemplars and taking the average aspect ratio of these exemplars as the estimate. We retrieve similar training exemplars using the weak classifiers that are turned “on” by the detected object. Each weak classifier, hk , is “on” for a subset of the positive training exemplars, Tk = {x|hk (x) = 1} and by requiring several weak classifiers to be “on” we get the intersection of the corresponding subsets: Tk1 ∩Tk2 = {x|hk1 (x) = 1 ∧ hk2 (x) = 1}. Thus we focus on a successively smaller subset of training exemplars by selecting weak classifiers that fire on the detected object, as illustrated in figure 6. We are searching for a non-empty subset of minimal cardinality and the greedy approach at each iteration is to specify the weak classifier that gives maximal reduction in the cardinality of the “active” subset (under the constraint that the new “active” subset is non-empty). h1 a
d b
c f
g
h1
h2
e
h3
h
a
d c
b
f
j i
(a)
g
h1
h2
e
h3
h
a
d b
j
f
i
(b)
h2
e c g
h3
h j i
(c)
Fig. 6. Retrieving similar training exemplars for a detected object (shown as a star). a) Weak classifiers h2 and h3 fire on the detection. b) Weak classifier h3 is selected, since it gives a smaller “active” subset than h2 . c) Weak classifier h2 is selected. Since no more weak classifiers can be selected, training exemplars “g” and “h” are returned.
7
Experiments and Dataset
We have evaluated object detection performance on the ETHZ Shape Classes dataset [1]. This dataset is challenging due to large intra-class variation, clutter and varying scales. Several other authors have evaluated their methods on this dataset: in [5] a deformable shape model learnt from images is evaluated, while hand drawn models are used in [1,18–20]. More recent results are [2,21,22], where [2] reports the best performance. Experiments were performed on all classes in the dataset and results are produced as in [5], using 5-fold cross-validation. We build 5 different detectors for each class by randomly sampling 5 subsets of half the images of that class. All other images in the dataset are used for testing. The dataset contains a total of 255 images and the number of class images varies from 32 to 87. Thus, the number of training images will vary from 16 to 43 and the test set will consist of about 200 background images and 16 to 44 images containing occurrences of the target object.
10
O. Danielsson and S. Carlsson
Bottles
Giraffes 1 0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.6
0.6
0.6
0.5
DR
1 0.9
DR
DR
Applelogos 1 0.9
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0
0 0
0.5
1
1.5
0.1 0
0
0.5
1
1.5
(a) Applelogos
0.5
1
1.5
FPPI
(b) Bottles
(c) Giraffes
Mugs
Swans
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
DR
DR
0
FPPI
FPPI
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0 0
0.5
1
FPPI
(d) Mugs
1.5
0
0.5
1
1.5
FPPI
(e) Swans
Fig. 7. Detection rate (DR) plotted versus false positives per image (FPPI). The results from the initial experiment are shown by the blue (lower) curves and the results from the experiment using extended training sets are shown by the red (upper) curves.
Quantitative results are presented as the detection rate (the number of true positives divided by the number of occurrences of the target object in the test set) versus the number of false positives per image (FPPI). We prefer using FPPI, instead of precision, as a measure of error rate, since it is not biased by the number of positive and negative examples in the test set. Precision would for example be unsuitable for comparison to methods evaluated using a different cross-validation scheme, since this might affect the number of positive test examples. As in [5], a detection is counted as correct if the detected bounding box overlaps more than 20 % with the ground truth bounding box. Bounding box overlap is defined as the area of intersection divided by the area of union of the bounding boxes.
8
Results
Quantitative results are plotted in figure 7. The results of the initial experiment are shown by the blue (lower) curves. The problem is that there are very few training examples (from 16 to 43). Since the presented method is statistical in nature it is not expected to perform well when training on a too small training set. Therefore we downloaded a set of extra training images from Google Images. These images contained a total of 40 applelogos, 65 bottles, 160 giraffes, 67 mugs and 103 swans, which were used to extend the training sets. The red
Object Class Detection Using Boosted Configurations of Oriented Edges
(a) Applelogos true positives
(c) Bottles true positives
(b) Applelogos false positives
(d) Bottles false positives
(e) Giraffes true positives
(f) Giraffes false positives
(g) Mugs true positives
(h) Mugs false positives
(i) Swans true positives
11
(j) Swans false positives
Fig. 8. Example detections (true and false positives) for each class.
(upper) curves are the results using the extended training set and the detection performance is now almost perfect. In table 1 we compare the detection performance of our system to previously presented methods. Figure 8 shows some sample detections (true and false positives).
12
O. Danielsson and S. Carlsson
Table 1. Comparison of detection performance. We state the average detection rate at 0.3 and 0.4 FPPI and its standard deviation in parentheses. We compare to the systems of [2,5].
[email protected] [email protected] [5]@0.4 [2]@0.3 [2]@0.4
FPPI: FPPI: FPPI: FPPI: FPPI:
A. logos 95.5(3.2) 95.5(3.2) 83.2(1.7) 95.0 95.0
Bottles 91.9(4.1) 92.6(3.7) 83.2(7.5) 92.9 96.4
Giraffes 92.9(1.9) 93.3(1.6) 58.6(14.6) 89.6 89.6
Mugs 96.4(1.4) 97.0(2.1) 83.6(8.6) 93.6 96.7
Swans 98.8(2.8) 100(0) 75.4(13.4) 88.2 88.2
Our currently unoptimized MATLAB/mex implementation of the sliding window detector, running (a single thread) on a 2.8 GHz Pentium D desktop computer, requires on the order of 10 seconds for scanning a 640 x 480 image (excluding the edge detection step). We believe that this could be greatly improved by making a more efficient implementation. This algorithm is also very simple to parallelize, since the computations done for one sliding window position are independent of the computations for other positions.
9
Discussion
Viola and Jones constructed their weak classifiers by thresholding single Haarfeatures in order to minimize the computational cost of weak classifier evaluation [16]. In our case, it is also a good idea to put an upper limit on the number of features used by the weak classifiers in the first few stages of the cascade for two reasons: (1) it reduces the risk of overfitting and (2) it improves the speed of the learnt detector. However, a strong classifier built using only single-feature weak classifiers should only be able to achieve low false positive rates for object classes with small intra-class variability. Therefore later stages of the cascade should not limit the number of features used by each weak classifier (rather, the learner should be unconstrained).
10
Future Work
As mentioned, search space culling and hierarchical search strategies will be explored, ideally with improvements in precision and speed as a result. Another straight forward extension of the current work is to use other features in addition to oriented edges. We only need to be able to produce a binary feature map, indicating the occurrences of the feature in an image. For example we could use corners, blobs or other interest points. We could also use the visual words in a visual code book (bag-of-words) as features. This would allow modeling of object appearance and shape in the same framework, but would only be practical for small code books.
Object Class Detection Using Boosted Configurations of Oriented Edges
13
We will investigate thoroughly how the performance of the algorithm is affected by constraining the number of features per weak classifier and we will also investigate the parameter estimation method presented in section 6.2 and the possibility of using it for estimating other parameters of a detected object, like the pose or viewpoint.
11
Conclusions
In this paper we have presented a novel shape-based object class representation. We have shown experimentally that this representation yields very good detection performance on a commonly used database. The advantages of the presented method compared to its competitors are that (1) it is generic and does not make any strong assumptions about for example the existence of discriminative parts or the detection of connected edge segments, (2) it is fast and (3) it is able to represent categories with significant intra-class variation. The method can also be extended to use other types of features in addition to the oriented edge features. The main drawback of the method is that, since its inductive bias is rather weak, it needs a comparatively large training set (at least about 100 training exemplars for the investigated dataset).
Acknowledgement This work was supported by the Swedish Foundation for Strategic Research (SSF) project VINST.
References 1. Ferrari, V., Tuytelaars, T., Van Gool, L.: Object detection by contour segment networks. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3953, pp. 14–28. Springer, Heidelberg (2006) 2. Maji, S., Malik, J.: Object detection using a max-margin. In: Proc. of the IEEE Computer Vision and Pattern Recognition (2009) 3. Gavrila, D.M.: A bayesian, exemplar-based approach to hierarchical shape matching. IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (2007) 4. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 509–522 (2002) 5. Ferrari, V., Jurie, F., Schmid, C.: Accurate object detection with deformable shape models learnt from images. In: Proc. of the IEEE Computer Vision and Pattern Recognition (2007) 6. Thuresson, J., Carlsson, S.: Finding object categories in cluttered images using minimal shape prototypes. In: Bigun, J., Gustavsson, T. (eds.) SCIA 2003. LNCS, vol. 2749, pp. 1122–1129. Springer, Heidelberg (2003) 7. Carlsson, S.: Order structure, correspondence, and shape based categories. In: Forsyth, D., Mundy, J.L., Di Ges´ u, V., Cipolla, R. (eds.) Shape, Contour, and Grouping 1999. LNCS, vol. 1681, p. 58. Springer, Heidelberg (1999)
14
O. Danielsson and S. Carlsson
8. Shotton, J., Blake, A., Cipolla, R.: Contour-based learning for object detection. In: Proc. of the International Conference of Computer Vision (2005) 9. Opelt, A., Pinz, A., Zisserman, A.: A boundary-fragment-model for object detection. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 575–588. Springer, Heidelberg (2006) 10. Freund, Y., Schapire, R.E.: A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence 14, 771–780 (1999) 11. Viola, P.A., Jones, M.J.: Robust real-time face detection. International Journal of Computer Vision 57, 137–154 (2004) 12. Breu, H., Gil, J., Kirkpatrick, D., Werman, M.: Linear time euclidean distance transform algorithms. IEEE Trans. Pattern Analysis and Machine Intelligence 17, 529–533 (1995) 13. Fleuret, F., Geman, D.: Coarse-to-fine face detection. International Journal of Computer Vision 4, 85–107 (2001) 14. Wu, Y.N., Si, Z., Fleming, C., Zhu, S.C.: Deformable template as active basis. In: Proc. of the International Conference on Computer Vision (2007) 15. Danielsson, O., Carlsson, S., Sullivan, J.: Automatic learning and extraction of multi-local features. In: Proc. of the International Conference on Computer Vision (2009) 16. Viola, P.A., Jones, M.J.: Fast and robust classification using asymmetric adaboost and a detector cascade. In: Proc. of Neural Information Processing Systems, pp. 1311–1318 (2001) 17. Fayyad, U.M.: On the Induction of Decision Trees for Multiple Concept Learning. PhD thesis, The University of Michigan (1991) 18. Ravishankar, S., Jain, A., Mittal, A.: Multi-stage contour based detection of deformable objects. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 483–496. Springer, Heidelberg (2008) 19. Zhu, Q., Wang, L., Wu, Y., Shi, J.: Contour context selection for object detection: A set-to-set contour matching approach. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 774–787. Springer, Heidelberg (2008) 20. Schindler, K., Suter, D.: Object detection by global contour shape. Pattern Recognition 41, 3736–3748 (2008) 21. Stark, M., Goesele, M., Schiele, B.: A shape-based object class model for knowledge transfer. In: Proc. of International Conference on Computer Vision (2009) 22. Ommer, B., Malik, J.: Multi-scale object detection by clustering lines. In: Proc. of International Conference on Computer Vision (2009)
Unsupervised Feature Selection for Salient Object Detection Viswanath Gopalakrishnan, Yiqun Hu, and Deepu Rajan School of Computer Engineering Nanyang Technological University, Singapore
Abstract. Feature selection plays a crucial role in deciding the salient regions of an image as in any other pattern recognition problem. However the problem of identifying the relevant features that plays a fundamental role in saliency of an image has not received much attention so far. We introduce an unsupervised feature selection method to improve the accuracy of salient object detection. The noisy irrelevant features in the image are identified by maximizing the mixing rate of a Markov process running on a linear combination of various graphs, each representing a feature. The global optimum of this convex problem is achieved by maximizing the second smallest eigen value of the graph Laplacian via semi-definite programming. The enhanced image graph model, after the removal of irrelevant features, is shown to improve the salient object detection performance on a large image data base with annotated ‘ground truth’.
1
Introduction
The detection of salient objects in images has been a challenging problem in computer vision for more than a decade. The task of identification of salient parts of an image is essentially that of choosing a subset of the input data which is considered as most relevant by the human visual system (HVS). The ‘relevance’ of such a subset is closely linked to two different task processing mechanisms of the HVS - a stimulus driven ‘bottom-up’ processing and a context dependent ‘top-down’ processing. The information regarding the salient parts in an image finds application in many computer vision tasks like content based image retrieval , intelligent image cropping to adapt to small screen devices, content aware image and video resizing [1, 5, 20, 21]. Salient region detection techniques include local contrast based approaches modeling the ‘bottom-up’ processing of HVS [15, 25], information theory based saliency computation [4], salient feature identification using discriminative object class features [7], saliency map computation by modeling color and orientation distributions [9] and using the Fourier spectrum of images [11, 14]. Graph based approaches for salient object detection have not been explored much although they have the advantage of capturing the geometric structure of the image when compared to feature-based clustering techniques. The limited major works in this regard include the random walk based methods in [12] and R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 15–26, 2011. c Springer-Verlag Berlin Heidelberg 2011
16
V. Gopalakrishnan, Y. Hu, and D. Rajan
[8]. In [12] edge strengths are used to represent the dissimilarity between two nodes and the most frequently visited node during the random walk is labeled as the most dissimilar node in a local context. In [8], random walks are performed on a complete and a sparse k-regular graph to capture the global and local properties of the most salient node. In [17] Liu et al. proposed a supervised learning approach in which 1000 annotated images were used to train a salient region detector which was tested on another 4000 images. We use the same database in this work to demonstrate the effectiveness of the proposed method. Feature selection, which is essentially selecting a subset of features from the original feature set (consequently representing an image in lower dimension), is quite important in many classification problems like face detection that involve high dimensional data. There has not been much work reported on unsupervised feature selection in contrast to supervised feature selection as the former is a far more challenging problem due to the lack of any label information. Feature selection methods can be generally classified as filter, embedded and wrapper methods depending on how the feature selection strategy is related to the classification algorithm that follows. He et.al[13] and Mitra et.al [18] discuss unsupervised ‘filter’ methods of feature selection in which the feature selection strategy is independent of the classification algorithm. The problem of simultaneously finding a classification and a feature subset in an unsupervised manner is discussed in the ‘embedded’ feature selection methods in [16] and [26]. Law et.al [16] use a clustering algorithm based on Gaussian Mixture Models along with the feature selection, while Zhao et.al [26] optimize an objective function to find a maximum margin classifier on the given data along with a set of feature weights that favor the features which most ‘respect’ the data manifold. In [24], a ‘wrapper’ strategy for feature selection is discussed in which the features are selected by optimizing the discriminative power of the classification algorithm used. Feature selection plays a vital role in salient object detection as well, like in many other pattern recognition problems. However, choosing the right mixture of features for identifying the salient parts of an image accurately is one problem that has not received much attention so far. For example, graph based methods like [8] give equal weight to all features while the saliency factor may be attributed only to a smaller subset of features. The relevancy of different features in different images is illustrated in Figure 1. The saliency of the image in Figure 1(a) is mainly derived from the color feature, in Figure 1(b) from texture features while the salient parts in Figure 1(c) is due to some combination of color and texture features. Hence it can be concluded that some of the features used in [8] are irrelevant and only add noise to the graph structure of the image thereby reducing the accuracy of the salient object detection. In this context, we propose an unsupervised filter method for identifying irrelevant features to improve the performance of salient region detection algorithms.
Unsupervised Feature Selection for Salient Object Detection
(a)
(b)
17
(c)
Fig. 1. Example images illustrating the relevancy of different features to saliency. The relevant salient features are (a) color (b) texture (c) some combination of color and texture features.
The major contributions of this paper can be summarized as follows: – We relate the problem of identifying irrelevant features for salient object detection to maximizing the mixing rate of a Markov process running on a weighted linear combination of ‘feature graphs’. – A semi-definite programming formulation is used to maximize the second smallest eigen value of the graph Laplacian to maximize the Markov process mixing rate. – The effectiveness of the proposed unsupervised filter method of feature selection (elimination) method is demonstrated by using it in conjunction with a graph based salient object detection method [8] resulting in improved performance on a large annotated dataset. The paper is organized as follows. Section 2 reviews some of the theory related to the graph models, Markov process equilibrium and the mixing rate of Markov processes. In Section 3 we discuss the relation between irrelevant feature identification and maximization of mixing rate of Markov process. The detailed optimization algorithm is also provided in Section 3. Section 4 details the experiments done and the comparative results showing performance improvement. Finally conclusions are given in Section 5.
2
Graph Laplacian and Mixing Rate of Markov Process
In this section we review some of the theory related to graph Laplacians and mixing rate of a Markov process running on the graph. Consider an undirected graph, G(V, E), where V is the set of n vertices and E is the set of edges in the graph. The graph G is characterized by the affinity matrix A where wij , i = j Aij = (1) 0, i = j. wij is the weight between node i and node j based on the affinity of features on the respective nodes.
18
V. Gopalakrishnan, Y. Hu, and D. Rajan
The Laplacian matrix L of the graph G is defined as L=D−A
(2)
D = diag {d1 , d2 ...dn } , di = wij ,
(3)
where
j
and di is the degree of node i and is the total weight connected to node i in the graph G. Now, consider a symmetric Markov process running on the graph G(V, E). The state space of the Markov process is the vertex (node) set V , and the edge set E represent the transition strengths. The edge eij connecting nodes i and j is associated with a transition rate wij which is the respective weight entry in the affinity matrix, A, of the graph G(V, E). This implies that the transition from node i to node j is faster when the affinities between the respective nodes are larger. Consequently, a Markov process entering node i at any specific time t will take more time to move into the nodes that possess lesser affinities with node i. The mixing time of a Markov process is the time taken to reach the equilibrium distribution π starting from some initial distribution π(0). Let π(t) ∈ Rn be the n dimensional vector showing probability distributions on the n states at time t for a symmetric Markov process starting from any given distribution on the graph. The evolution of π(t) is given by [19], dπ(t) = Q.π(t) dt where Q is the transition rate matrix of the Markov process defined as i = j qij , Qij = − j,j=i qij , i = j.
(4)
(5)
qij is the transition rate or rate of going from state i to state j. As explained previously, we equate this rate to the affinity between nodes i and j, wij . Hence the Laplacian matrix L and the transition rate matrix Q is related as Q = −L. Consequently, the solution for the distribution π(t) at time t starting from the initial condition π(0) is given as π(t) = e−tL π(0).
(6)
The Laplacian L is a symmetric and positive semi-definite matrix. The eigen values of L are such that λ1 = 0 ≤ λ2 ≤ .......... ≤ λN . λ2 is also known as the algebraic connectivity of the graph G. If the graph G is connected, i.e. any node of the graph is accessible from any other node through the edge set E, then λ2 > 0 [6]. The probability distribution π(t) will tend to the equilibrium distribution π as t → ∞. The total variation distance (maximum difference
Unsupervised Feature Selection for Salient Object Detection
19
in probabilities assigned by two different probability distributions to the same event) between distributions π(t) and π is upper bounded by λ2 as [22] sup π(t) − πtv ≤ (1/2)n0.5 e−λ2 t
(7)
where .tv is the total variance distance. Equation (7) implies that a larger value of λ2 will result in a tighter bound on the total variation distance between π(t) and π and consequently result in a faster mixing of the Markov process. The second smallest eigen value λ2 can be expressed as a function of some Laplacian matrix L as [3] (8) λ2 (L) = inf xT Lx | x = 1, 1T x = 0 where 1 is the vector with all ones. As L1 = 0, 1 is the eigen vector of the smallest eigen value, λ1 = 0, of L. Equation (8) implies that λ2 is a concave function of L, as it is the point wise infimum of a family of linear functions of L. Hence maximization of λ2 as a function of the Laplacian matrix L is a convex optimization problem for which a global maximum is guaranteed.
3
Feature Graphs and Irrelevant Features
We now relate the maximization of Markov process mixing rate to the identification of irrelevant features for salient object detection. In salient region detection methods like [8, 12] that use multiple features, all features are given equal importance. Our method for unsupervised feature selection aims to identify the important features by representing the image with several ‘feature graphs’, which implies that each feature is encoded into a separate image graph model. The nodes in the ‘feature graphs’ represents 8 × 8 patches in the image similar to [8]. The affinity between nodes in a feature graph depends on the respective features. If there are M features available for every node in the image, we have M different feature graphs in which the affinity between node i and node j for the mth feature is defined as equation (1) in which m
m = e−||(fi wij
−fjm ||2
,
(9)
where fim and fjm are the values of mth feature on node i and j respectively. It is evident that the connectivity in the M feature graphs will be different. The relevant features or the features that contribute to the salient parts of the image will have feature graphs with ‘bottlenecks’. The term ‘bottleneck’ implies a large group of edges which show very less affinity between two groups (clusters) of nodes. Such bottlenecks arise from the fact that the feature values for a certain feature on the salient part of an image will be different from the feature values on a non-salient part like the background. For example, in Figure 2(a), if the feature graph is attributed to the color feature, then the nodes connected by thick lines (indicating strong affinity) fall into two groups, each having a different color value. One group might belong to the salient object and the other to the
20
V. Gopalakrishnan, Y. Hu, and D. Rajan
background and both of them are connected by thin edges (indicating weak affinity). This group of thin edges signifies the presence of a bottleneck. This phenomenon of clustering of nodes does not occur in feature graphs attributed to irrelevant features and hence such graphs will not have bottlenecks. This is illustrated in Figure 2(b) for the same set of nodes as in Figure 2(a), but for a different feature. Here the affinities for the irrelevant feature vary smoothly (for example, consider the shading variations of color in Figure 1(b)) and hence do not result in a bottleneck in its respective feature graph. Therefore, a Markov process running on an irrelevant feature graph mixes faster compared to a Markov process in a relevant feature graph since in the relevant feature graph the transition time between two groups of nodes connected by very weak edge strengths will be always high. Now consider all possible graph realizations resulting as a linear combination of the feature graphs. When we maximize the Markov mixing rate of a linear combination of these feature graphs, what we are essentially doing is to identify the irrelevant features that does not form any kind of bottleneck.
(a)
(b)
Fig. 2. Feature graphs with the same set of nodes showing (a) bottleneck between two groups of nodes (b) no bottleneck. The thicker edges indicate larger affinity between nodes.
3.1
Irrelevant Feature Identification
We adopt a semi-definite programming method similar to [22] for maximizing the second smallest eigen value of the Laplacian. However, in [22] a convex optimization method is proposed to find the fastest mixing Markov process among all possible edge weights of the graph realization G(V, E) to give a generic solution. In our proposed method, searching the entire family of graph realizations that gives the fastest mixing rate will not give us the required specific solution. Hence we restrict our search to all possible linear combinations of the feature graphs. The maximization problem of the second smallest eigen value is formulated as maximize subject to
L=
k
λ2 (L) ak .Lk ,
ak = 1
k
ak ≥ 0, ∀k
.
(10)
Unsupervised Feature Selection for Salient Object Detection
21
where Lk is the Laplacian matrix of the k th feature graph and ak is the weight associated with the respective feature graph. Now we establish that the constraints on matrix L in Equation (10) make it a valid candidate for a Laplacian matrix. matrices of M feature graphs and if Lemma 1. If L1 ,L2 ,. . . ,LM are Laplacian L = k ak .Lk under the conditions k ak = 1 and ak ≥ 0, then L is also a Laplacian matrix. Lij = = Proof. j k ak .Lkij k j ak .Lkij j = a L = 0
k k ij k j Also Lij = k ak .Lkij is negative for i = j as ak ≥ 0 and Lkij ≤ 0 ∀k and for i = j. Similarly all diagonal elements of L are positive proving that L represents the Laplacian matrix of some graph resulting from the linear combination of feature graphs. The optimization problem in Equation (10) can be equivalently expressed as maximize s subject to sI ≺ L + β11T L= ak .Lk , ak = 1, ak ≥ 0, ∀k, β > 0 k
(11)
k
where the symbol ≺ represents matrix inequality. The equivalent optimization problem maximizes a variable s subject to some matrix inequality constraints and some linear equality constraints. The term β11T removes the space of matrices with the eigen vector 1 and eigen value 0 from the space of matrices L [2]. Hence maximizing the second smallest eigen value of L is equivalent to maximizing the smallest eigen value of L + β11T . Lemma 2. The maximum s that satisfies the matrix inequality constraint in Equation (11) is the minimum eigen value of the matrix variable L + β11T Proof. Let P = L + β11T . The smallest eigen value of P can be expressed as , T λmin = inf( xxTPxx ). Assume P ≺ λmin I. This implies that, P − λmin I is negative definite. ⇒ xT (P − λmin I)x < 0 for any x, ⇒ xT P x < λmin .xT x, T ⇒ xxTPxx < λmin T which is false since the minimum value of xxTPxx is λmin .
Hence P λmin I and the maximum s that satisifies the constraint P sI is λmin . The inequalities in Equation (11) can also be expressed as
Since L = as
k
sI − (L + β11T ) ≺ 0.
(12)
ak .Lk , the matrix inequality in Equation (12) can be expressed
sI + a1 (−L1 ) + a2 (−L2 ) + ....aM (−LM ) + β(−11T ) ≺ 0.
(13)
22
V. Gopalakrishnan, Y. Hu, and D. Rajan
which is a standard Linear Matrix Inequality (LMI) constraint in an SDP problem formulation [3] as all the matrices in Equation (13) are symmetric. The Equation (11) can be seen as the minimization of a linear function subject to an LMI and linear equality constraints and can be effectively solved by standard SDP solvers like CVX or SeDuMi. Our proposed algorithm for finding the weights of different features can be summarized as follows: 1) Calculate the feature graph of every feature with the n × n affinity matrix Am for mth feature according to Equations (9) and (1.) 2) Calculate the Laplacian matrices of M features L1 , L2 , ...LM according to Equations (2) and (3) 3) Formulate the SDP problem in Equation (11) with input data as L1 , L2 , ...LM and variables as s, β, L, a1 , a2 , ..., aM and solve for a1 , a2 , ...aM using a standard SDP solver. 4) Find the feature weights as r = 1 − a and normalize the weights.
4
Experiments
The effectiveness of the proposed algorithm for unsupervised feature selection is experimentally verified by using it in conjunction with the random walk based method for salient object detection described in [8], where a random walk is performed on a simple graph model for the image with equal weight for all features. The proposed method will provide the feature weights according to feature relevancy resulting in a much more accurate graph model in the context of salient object detection. The final image graph model is decided by the graph affinity matrix A, which is evaluated as ri .Ai (14) A= i
where r1 , r2 , ...rM are the final feature weights evaluated in Section 3.1 and A1 , A2 , ...AM are the feature affinity matrices. The proposed feature selection method falls in the class of unsupervised filter method of feature selection where the feature selection is independent of the classification algorithm that follows. Hence we compare the proposed method with two unsupervised filter methods of feature selection explained in [13] and [18].The Laplacian score based feature selection method [13] assigns a score to every feature denoting the extent to which each feature ‘respects’ the data manifold. In [18], an unsupervised filter method is proposed to remove redundant features by evaluating a feature correlation measure for every pair of features. The features are clustered according to their similarity and the representative feature of each set is selected starting from the most dissimilar feature set. However, as our proposed method weighs different features according to their importance, we use the feature correlation measure in [18] to weigh the features while comparing with the proposed method. We used the same dataset of 5000 images used in [8] which was originally provided by [17]. The dataset contains user annotations as rectangular boxes
Unsupervised Feature Selection for Salient Object Detection
23
around the most salient region in the image. The performance of the algorithm is objectively evaluated with the ground truth available in the data set. We use the same seven dimensional feature set used in [8]. The features are color (Cb and Cr) and texture (orientation complexity values at five different scales). According to our algorithm, there will be seven different feature graphs (transition matrices), one for each feature. The optimal weights of features are obtained as explained in Section 3.1 with semi-definite programming in Matlab using the cvx package [10]. Using the optimal feature weights, a new graph model for the image is designed according to Equation (14), which is used with [8] to detect the salient regions.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 3. Comparison of saliency maps for [8] with the proposed feature selection. (a),(d) Original Images. (b),(e) Saliency maps using [8]. (c),(f) Saliency maps of [8] with the proposed feature selection.
Figure 3 shows saliency maps on some example images for which the original algorithm in [8] has resulted in wrong or partially correct saliency maps. The original images are shown in figure 3(a), 3(d) and figures 3(b) and 3(e) show the saliency map generated by [8], in which the image graph model is designed with equal importance to all features, whether relevant or irrelevant . Figure 3(c)and 3(f) shows the saliency maps resulting from the proposed feature selection method used with [8]. It can be seen that the saliency maps are more accurate in Figure 3(c) and 3(f) where the features are weighted according to their relevance. For example, consider the first image in Figure 3(a). The color feature plays a good role in saliency for this image and the proposed algorithm
24
V. Gopalakrishnan, Y. Hu, and D. Rajan Table 1. F-measure comparison for different methods Method
F-Measure
Random walk method [8] [8] with Laplacian score [13] based feature selection [8] with correlation based feature selection [18] [8] with proposed unsupervised feature selection
0.63 0.68 0.68 0.7
detects the salient region more accurately by giving more weightage to color feature. On the contrary, for the eagle image in the second row of 3(d), the color features do not have much role in deciding the salient region and the proposed algorithm gets a more accurate saliency map by giving more priority to the texture feature. In most of the real life images it is some specific mixture of various features ( e.g. a combination of color and orientation entropy feature at some scale) that proves to be relevant in the context of salient object detection. The proposed algorithm intends to model this rightly weighted combination of features as can be testified from the results for other images in Figure 3. It has to be also noted that, though our proposed method assigns soft weights to different features according to their relevance, a hard decision on the features as in a selection of a feature subset can be made by applying some heuristics to the soft weights (for example, select the minimum number of features that satisfies 90% of the total soft weights assigned). The ground truth available for the dataset allows us to objectively compare various algorithms using precision, recall and F-measure. Precision is calculated as ratio of the total saliency in the saliency map captured inside the user annotated rectangle (sum of intensities inside user box) to the total saliency computed for the image (sum of intensities for full image). Recall is calculated as the ratio of the total saliency captured inside the user annotated window to the area of the user annotated window. F-measure, which is the harmonic mean of precision and recall, is the final metric used for comparison. Table 1 lists the F-measure computed for various feature selection methods on the entire dataset. We compare the proposed unsupervised feature selection algorithm with the Laplacian score based feature selection algorithm [13] and the correlation based feature selection algorithm [18]. It can be seen from the table that the proposed method significantly improves the performance of the random walk method and is better than the methods in [13] and [18]. It has to be noted that the methods proposed by [17] and [23] claims better fmeasure than the feature selection combined graph based approach on the same data set. However [17] is a supervised method in which 1000 annotated images were used to train a salient region detector which was tested on the rest of 4000 images. In [23], after obtaining the saliency map, a graph cut based segmentation algorithm is used to segment the image. Some segments with higher saliency are combined to form final saliency map. The accuracy of [23] depends heavily on the accuracy of segmentation algorithm and on the techniques used to combine different segments to obtain the final salient object. Hence as an unsupervised
Unsupervised Feature Selection for Salient Object Detection
25
and independent method (like [25] and [11]) that provides a good f-measure on a large dataset, the graph based approach is still attractive and demands attention for future enhancements.
5
Conclusion
We have proposed an unsupervised feature selection method for salient object detection by identifying the irrelevant features. The problem formulation is convex and is guaranteed to achieve the global optimum. The proposed feature selection method can be used as a filter method to improve the performance of salient object detection algorithms. The power of the proposed method is demonstrated by using it in conjunction with a graph based algorithm for salient object detection and demonstrating improved performance.
References 1. Avidan, S., Shamir, A.: Seam carving for content-aware image resizing. ACM Trans. Graph. 26(3) (2007) ISSN: 0301–0730 2. Boyd, S.: Convex Optimization of Graph Laplacian Eigenvalues. In: Proceedings International Congress of Mathematicians, vol. 3, pp. 1311–1319 (2006) 3. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004) 4. Bruce, N.D., Tsotsos, J.K.: Saliency Based on Information Maximization. In: NIPS, pp. 155–162 (2005) 5. Chen, L.Q., Xie, X., Fan, X., Ma, W.Y., Zhang, H.J., Zhou, H.Q.: A visual attention model for adapting images on small displays. Multimedia Syst. 9, 353–364 (2003) 6. Chung, F.R.K.: Spectral Graph Theory. American Mathematical Society, Providence (1997) 7. Gao, D., Vasconcelos, N.: Discriminant Saliency for Visual Recognition from Cluttered Scenes. In: NIPS, pp. 481–488 (2004) 8. Gopalakrishnan, V., Hu, Y., Rajan, D.: Random walks on graphs to model saliency in images. In: CVPR, pp. 1698–1705 (2009) 9. Gopalakrishnan, V., Hu, Y., Rajan, D.: Salient Region Detection by Modeling Distributions of Color and Orientation. IEEE Transactions on Multimedia 11(5), 892–905 (2009) 10. Grant, M., Boyd, S.: CVX: Matlab software for disciplined convex programming (web page and software) (2009), http://stanford.edu/~ boyd/cvx 11. Guo, C., Ma, Q., Zhang, L.: Spatio-temporal Saliency Detection Using Phase Spectrum of Quaternion Fourier Transform. In: CVPR (2008) 12. Harel, J., Koch, C., Perona, P.: Graph-Based Visual Saliency. In: NIPS, pp. 545– 552 (2006) 13. He, X., Cai, D., Niyogi, P.: Laplacian Score for Feature Selection. In: NIPS (2002) 14. Hou, X., Zhang, L.: Saliency Detection: A Spectral Residual Approach. In: CVPR (2007) 15. Itti, L., Koch, C., Niebur, E.: A Model of Saliency-Based Visual Attention for Rapid Scene Analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20(11), 1254–1259 (1998)
26
V. Gopalakrishnan, Y. Hu, and D. Rajan
16. Law, M.H.C., Figueiredo, M.A.T., Jain, A.K.: Simultaneous Feature Selection and Clustering Using Mixture Models. IEEE Trans. Pattern Anal. Mach. Intell. (2004) 17. Liu, T., Sun, J., Zheng, N., Tang, X., Shum, H.Y.: Learning to Detect A Salient Object. In: CVPR (2007) 18. Mitra, P., Murthy, C.A., Pal, S.K.: Unsupervised Feature Selection Using Feature Similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 301–312 (2002) 19. Norris, J.: Markov Chains. Cambridge University Press, Cambridge (1997) 20. Santella, A., Agrawala, M., DeCarlo, D., Salesin, D., Cohen, M.F.: Gaze-based interaction for semi-automatic photo cropping. In: CHI, pp. 771–780 (2006) 21. Stentiford, F.: Attention based Auto Image Cropping. In: The 5th International Conference on Computer Vision Systems (2007) 22. Sun, J., Boyd, S., Xiao, L., Diaconis, P.: The Fastest Mixing Markov Process on a Graph and a Connection to a Maximum Variance Unfolding Problem. SIAM Review 48, 681–699 (2004) 23. Valenti, R., Sebe, N., Gevers, T.: Image Saliency by Isocentric Curvedness and Color. In: ICCV (2009) 24. Volker, R., Tilman, L.: Feature selection in clustering problems. In: NIPS (2004) 25. Walther, D., Koch, C.: Modeling attention to salient proto-objects. Neural Network 19, 1395–1407 (2006) 26. Zhao, B., Kwok, J., Wang, F., Zhang, C.: Unsupervised Maximum Margin Feature Selection with Manifold Regularization. In: CVPR (June 2009)
MRF Labeling for Multi-view Range Image Integration Ran Song1 , Yonghuai Liu1 , Ralph R. Martin2 , and Paul L. Rosin2 1 2
Department of Computer Science, Aberystwyth University, UK {res,yyl}@aber.ac.uk School of Computer Science & Informatics, Cardiff University, UK {Ralph.Martin,Paul.Rosin}@cs.cardiff.ac.uk
Abstract. Multi-view range image integration focuses on producing a single reasonable 3D point cloud from multiple 2.5D range images for the reconstruction of a watertight manifold surface. However, registration errors and scanning noise usually lead to a poor integration and, as a result, the reconstructed surface cannot have topology and geometry consistent with the data source. This paper proposes a novel method cast in the framework of Markov random fields (MRF) to address the problem. We define a probabilistic description of a MRF labeling based on all input range images and then employ loopy belief propagation to solve this MRF, leading to a globally optimised integration with accurate local details. Experiments show the advantages and superiority of our MRF-based approach over existing methods.
1
Introduction
3D surface model reconstruction from multi-view 2.5D range image has a wide range of applications in many fields such as reverse engineering, CAD, medical imagery and film industry, etc. Its goal is to estimate a manifold surface that approximates an unknown object surface using multi-view range images, each of which essentially represents a sample of points in 3D Euclidean space, combined with knowledge about the scanning resolution and the measurement confidence. These samples of points are usually described in local, system centred, coordinate systems and cannot offer a full coverage of the object surface. Therefore, to build up a complete 3D surface model, we usually need to register a set of overlapped range images into a common coordinate frame and then integrate them to fuse the redundant data contained in overlapping regions while retain enough data sufficiently representing the correct surface details. However, to achieve both is challenging due to its ad hoc nature. On the one hand, registered range images are actually 3D unstructured point clouds downgraded from original 2.5D images. On the other hand, scanning noise such as unwanted outliers and data loss typically caused by self-occlusion, large registration errors and connectivity relationship loss among sampled points in acquired data often lead to a poor reconstruction. As a result, the reconstructed surface may include holes, false connections, thick and non-smooth or over-smooth patches, and artefacts. R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 27–40, 2011. Springer-Verlag Berlin Heidelberg 2011
28
R. Song et al.
Hence, a good integration should be robust to registration errors and scanning noise introduced in the stages of registration and data acquisition. Once multiple registered range images have been fused into a single reasonable point cloud, many techniques [1–4] can be employed to reconstruct a watertight surface.
2
Related Work
Existing integration methods can be classified into four categories: volumetric method, mesh-based method, point-based method and clustering-based method. The volumetric method [5–8] first divides the space around objects into voxels and then fuse the data in each voxel. But the comparative studies [9, 10] show they are time-consuming, memory-hungry and not robust to registration errors and scanning noise. The mesh-based method [11–14] first employs a step discontinuity constrained triangulation and then detects the overlapping regions between the triangular meshes derived from successive range images. Finally, it reserves the most accurate triangles in the overlapping regions and reconnects all remaining triangles subject to a certain objective function such as maximising the product of the interior angles of triangles. However, since the number of triangles is usually much larger than that of the sampled points, the mesh-based methods are computationally expensive. Thus some of the mesh-based methods just employ a 2D triangulation in the image plane to estimate the local surface connectivity in the first step as computation in a 2D sub-space is more efficient. But projection from 3D to 2D may lead to ambiguities if the projection is not injective. The mesh-based methods are thus highly likely to fail in non-flat regions where no unique projection plane exists. This speeded-up strategy cannot deal with 3D unstructured point clouds either. The point-based method [15, 16] produces a set of new points with optimised locations. Due to the neglect of local surface topology, its integration result is often over-smooth and cannot retain enough surface details. The latest clustering-based method [9, 10] employs classical clustering methods to minimise the objective dissimilarity functions. It surpasses previous methods since it is more robust to scanning noise and registration errors. Nonetheless, the clustering, which measures Euclidean distances to find closest centroids, does not consider local surface details, leading to some severe errors in non-flat areas. For instance, in Fig. 1, although point A is closer to point B and thus the clustering-based method will wrongly group them together, we would rather group A with C or D to maintain the surface topology.
Fig. 1. Local topology has a significant effect on the point clustering in non-flat areas
MRF Labeling for Multi-view Range Image Integration
29
To overcome the drawbacks of the existing integration methods, we propose a novel MRF-based method. We first construct a network by a point shifting strategy in section 3. In section 4, the MRF labeling is then configured on this network where each node is formally labeled with an image number. In section 5, we employ loopy belief propagation (LBP) to solve the MRF and find the optimal label assignment. Each node will just select its closest point from the image with its assigned image number. These selected points, directly from the original data, are used for surface reconstruction. The integration achieves a global optimisation as the MRF labeling considers all image numbers for each node. Also, according to the experimental resutls shown in section 6, it effectively preserves local surface details since in the output point cloud, the points representing a local patch of the reconstructed surface are usually from the same input image due to the neighbourhood consistency of MRF. More importantly, our method can also cope with the more general input—multiple 3D unstructured point clouds.
3
MRF Network Construction for Integration
MRF describes a system based on a network or a graphical model and allows many features of the system of interest to be captured by simply adding appropriate terms representing contextual dependencies into it. Once the network has been constructed, contextual dependencies can be cast in the manipulations within a neighbourhood. In a MRF network, the nodes are usually pixels. For example, in [17–20], the network is the 2D image lattice and 4 neighbouring pixels are selected to form a neighbourhood for each pixel. But it is difficult to define a network for multiple registered range images since the concept ‘pixel’ does not exist. The simplest method is to produce a point set by k-means clustering and the neighbours of a point can be found by Nearest Neighbours (NNs) algorithm. Since both k-means clustering and NNs are based only on Euclidean distance, in such a network neither the delivery of surface topological information nor the compensation for pairwise registration errors can be achieved. In this paper, we propose a novel scheme to produce a MRF network Inet . Given a set of consecutive range images I1 , I2 , . . . , Im , we first employ the pairwise registration method proposed in [21] to obtain a transform H12 mapping I1 into the coordinate system of I2 . The transformed image I1 and the reference image I2 supply not only the redundant surface information (overlapping area) but also the new surface information (non-overlapping area) from their own viewpoints for the fused surface. To integrate I1 and I2 , the overlapping area and non-overlapping area have to be accurately and efficiently detected. In the new scheme, we declare that a point from one viewpoint belongs to the overlapping area if its distance to its corresponding point from the other viewpoint is within a threshold, otherwise it belongs to the non-overlapping area. The corresponding point is found by a closest point search with k-D tree speedup. The threshold is set as 3R where R is the scanning resolution of the input range images. After the overlapping area detection, we initialise Inet as: Inet = Snon−overlap ,
Snon−overlap = S1 + S2
(1)
30
R. Song et al.
where S1 from I1 and S2 from I2 are the points in the non-overlapping area. Then, each point in the overlapping area is shifted by half of its distance to its corresponding point along its normal N towards its corresponding point. → − → − P = P + 0.5d · N, d = P · N and P = Pcor − P (2) where Pcor is the corresponding point of P . After such a shift, the corresponding points are closer to each other. For each point shifted from a point in the overlapping region of the reference image I2 , a sphere with a radius r = M × R (M is a parameter controlling the density of the output point cloud) is defined. If some points fall into this sphere, then their original points, without shifting, are retrieved. Their averages Soverlap are then computed and Inet is updated as: Inet = Snon−overlap + Soverlap
(3)
So, this point shifting (1) compensates the pairwise registration errors, (2) does not change the size of overlapping area as the operation of points is along the normal, and (3) still keeps the surface topology as the shift is along the normal. Now, we consider the third input image I3 . We map the current Inet into the coordinate system of I3 through the transformation H23 . Similarly, we then do the overlapping area detection between Inet transformed from Inet and the current reference image I3 . We still update Inet by Eq. (3). In this update, Snon−overlap contains the points from Inet and I3 in non-overlapping regions and Soverlap are produced by the aforementioned point shifting strategy. We keep running this updating scheme over all input range images. The final Inet will be consistent with the coordinate system of the last input image Im . Although it is not usual, if some points in Inet only appear in Snon−overlap , we delete them from Inet . Fig. 2 shows 18 range images of a bird mapped into a single global coordinate system and the MRF network Inet produced by our new scheme. In this paper, for clarity, we use surfaces instead of point clouds when we show range data. The triangulation method employed here for surface reconstruction is the ball-pivoting algorithm [1]. In Fig. 2, it can be seen that the registered images contain a massive amount of noise such as thick patches, false connections and fragments caused by poor scanning and registration errors. The point shifting scheme compensates the errors a bit whereas the resultant point set Inet is still noisy, leading to an over-smoothed surface without enough local details. But it offers us a roughly correct manifold for estimating the ground truth surface. We also need to define a neighbourhood for each node in Inet , consistent with the MRF theory. Instead of performing a NNs algorithm to search the neighbours for a point i in such a network, we find the neighbouring triangles of i in the triangular mesh. The vertices of these triangles, excluding i, are defined as the neighbours of i. The collection of all neighbours of i is defined as the neighbourhood of i, written as N (i). In this work, to strengthen the neighbourhood consistency, we define a larger neighbourhood N (i) consisting of neighbours and neighbours’ neighbours as shown in Fig. 3. Doing so is allowed since the condition i ∈ N (j) ⇔ j ∈ N (i) is still satisfied. The superiority of this definition over the NNs method is that it reflects the surface topology and is parameter-insensitive.
MRF Labeling for Multi-view Range Image Integration
31
Fig. 2. Left: 18 range images are registered (Please note that a patch from a certain range image is visible if and only if it is the outermost surface); Right: The surface representing point set Inet produced by point shifting is usually noisy and over-smooth
Fig. 3. Left: A neighbourhood and the normalised normal vectors attached to the neighbours. Right: the expanded neighbourhood used in this work. Please note that the number of neighbours of each point in such a network can vary.
4
MAP-MRF Labeling for Integration
We denote s = {1, . . . , n} representing the points in Inet and define the label assignment x = {x1 , . . . , xn } to all points as a realisation of a family of random variables defined on s. We also define the feature space that describes the points and understand it as a random observation field with its realisation y = {y1 , . . . , yn }. Finally, we denote a label set L = {1, . . . , m}, where each label corresponds to an image number. Thus, we have {xi , yj } ∈ L, {i, j} ∈ s. An optimal labeling should satisfy the maximum a posteriori probability (MAP) criterion [22] which requires the maximisation of the posterior probability P (x|y). Considering the Bayes rule, this maximisation can be written as: (4) x∗ = arg max P (x|y) = arg max P (x)p(y|x) x
x
In a MRF, the prior probability of a certain value at any given point depends only upon neighbouring points. This Markov property leads to the HammersleyClifford theorem [22]: the prior probability P (x) satisfies a Gibbs distribution with respect to a neighbourhood N , expressed as P (x) = Z −1 ×e−(U(x)/Q) , where Z is a normalising constant and U (x) is the priori energy. Q, a global control parameter, is set to 1 in most computer vision and graphics problems [22]. Let p(y|x) be expressed in the exponential form p(y|x) = Zy−1 × e−U(y|x) , where U (y|x) is the likelihood enery. Then P (x|y) is a Gibbs distribution
32
R. Song et al.
P (x|y) = ZU−1 × e−(U(x)+U(y|x))
(5)
The MAP-MRF problem is then converted to a minimisation problem for the posterior energy U (x|y): x∗ = arg minx U (x|y) = arg minx U (x) + U (y|x) . Under the MRF assumption that the observations are mutually independent, U (y|x) can be computed by the sum of the likelihood potentials V (yi |xi ) at all points. In this work, we estimate V (yi |xi ) by a truncation function: V (yi |xi ) = min(Di (yi , xi ), F ) (6) U (y|x) = i∈s
i∈s yi ∈L\xi
where F is a constant and L \ xi denotes the labels in L other than xi , and Di (yi , xi ) = Ci (yi ) − Ci (xi )
(7)
is a distance function where Ci (l), l ∈ L denotes i’s closest point in the lth input image. In this way, we convert the estimation for the likelihood potential of different labels at point i (or the single label cost) into the measurement of the distances between i’s closest points from the input images with different labels. The truncation parameter F plays an important role here. It eliminates the effect from the input images which do not cover the area around i. The prior energy U (x) can be expressed as the sum of clique energies: V1 (xi )+ V2 (xi , xj )+ V3 (xi , xj , xk )+· · · (8) U (x) = i
i
j∈N (i)
i
j∈N (i) k∈N (i)
We only consider the unary clique energy and the binary one. Here, single-point clique V1 (xi ) are set to 0 as we have no preference which label should be better. We hope that in the output point cloud, neighbouring points are from the same input image so that the local surface details can be well preserved. Therefore, we utilize the neigbhourhood consistency of MRF to ‘discourage’ the routine from assigning different labels to two neighbouring points: V2 (xi , xj ) = λAij (xi , xj ) (9) U (x) = i
j∈N (i)
i
j∈N (i)
where λ is a weighting parameter and here we use the Potts model [23] to measure the clique energy (or the pairwise label cost): 1, xi = xj , Aij (xi , xj ) = (10) 0, otherwise Once we find the optimal x∗ which assigns a label xi to point i by minimising the posterior energy, i will be replaced with its closest point from the input image with the number xi (xi th image). Please note that in this step we do not need to run a closest point search again as this point Ci (xi ) has already been searched for (see Eq. (7)). The final output of the integration is thus a single point cloud composed of points directly from different registered input range images.
MRF Labeling for Multi-view Range Image Integration
33
Fig. 4. Left: The idea of point shifting to generate Inet . Right: The illustration of a 2-image integration based on the MAP-MRF labeling.
The labeling is actually a segmentation for the point set s. Due to Eq. (9) and (10), neighbouring points in s tend to be labeled with the same image number. The surface reconstructed from the output point cloud will consist of patches directly from different registered input images. Therefore, one advantage of the new integration method is the preservation for surface details, which avoids the error illustrated in Fig. 1. Another significant advantage is its high robustness to possible moving objects. The image patches containing moving objects are less probably selected. On the one hand, the single label cost for labeling the points in the area corresponding to a moving object with the image containing the moving object must be high in terms of Eq. (6). On the other hand, the points on the boundaries among patches always have high pairwise label costs according to Eq. (9) and (10). As a result of minimising the posterior energy, the boundaries usually bypass the areas where the moving objects appear. Here we also add a robustness parameter β in order to make the algorithm more robust to noise. When we compute the single label cost V (yi |xi ), we always carry out the robustness judgment as below: • if V (yi |xi ) < β, Go through the routine; Terminate the calculation of V (yi |xi ) and delete the • if V (yi |xi ) > β, point i from the list of s. And, β = (m − q) × F
(11)
where m is the number of input images (m = max(L)) and q is the number of input images that contain the roughly same noise if known. Generally, we set q as 2 so that random noise can be effectively reduced. We can also adjust q experimentally. If the reconstructed surface still has some undesirable noise, q should be increased. If there are some holes in the surface, we use a smaller q. In summary, Fig. 4 illustrates the idea of point shifting which is the key step to construct the MRF network in section 3 and a 2-image integration based on the MAP-MRF labeling described in details in section 4.
5
Energy Minimisation Using Loopy Belief Propagation
There are several methods which can minimise the posterior energy U (x|y). The comparative study in [24] shows that two methods, Graph-Cuts(GC) [23, 25] and LBP [19, 26], have proven to be efficient and powerful. GC is usually regarded fast
34
R. Song et al.
and has some desirable theoretical guarantee on the optimality of the solution it can find. And since we use the ‘metric’ Potts model to measure the difference between labels, the energy function can be minimised by GC indeed. However, in the future research, we plan to extend our technique by exploring different types of functions for the measurement of pairwise label cost such as the linear model and quadratic model [19]. Thereby, LBP becomes a natural choice for us. Max-product and sum-product are the two variants of LBP. The algorithm proposed in [19] can efficiently reduce their computational complexity from O(nk 2 T ) to O(nkT ) and O(nk log kT ) respectively, where n is the number of points or pixels in the image, k is the number of possible labels and T is the number of iterations. Obviously, max-product is suitable here for the minimisation. It takes O(k) time to compute each message and there are O(n) messages to be computed in each iteration. In our work, each input range image has about 104 points and typically a set of ranges images have 18 images. To reflect the true surface details of the object, the number of points in the output point cloud must be large enough. Thus if we use a point set as the label set, the number of labels would be tens of thousands. In that case LBP would be extremely time-consuming and too memory-hungry to be tractable. A MATLAB implementation would immediately go out of memory because of the huge n × k matrix needed to save the messages passing through the network. But the final output of integration must be a single point cloud. So, our MRF labeling using image numbers as labels makes it feasible to employ LBP for the minimisation. The LBP works by passing messages in a MRF network, briefed as follows: 1. For all point pairs (i, j) ∈ N , initialising message m0ij to zero, where mtij is a vector of dimension given by the number of possible labels k and denotes the message that point i sends to a neighbouring point j at iteration t. 2. For t = 1, 2, . . . , T , updating the messages as t−1 t mij (xj ) = min λAij (xi , xj ) + min(Di (yi , xi ), F ) + mhi (xi ) (12) xi
yi ∈L\xi
h∈N (i)\j
3. After T iterations, computing a belief vector for each point, bj (xj ) = min(Dj (yj , xj ), F ) + mTij (xj ) yj ∈L\xj
(13)
i∈N (j)
and then determining labels as:
x∗j = arg min bj (xj ) xj
(14)
According to Eq. (12), the computational complexity for the calculation of the message vector is O(k 2 ) because we need to minimise over xi for each choice of xj . By a simple analysis, we can reduce it to O(k). The detail is given as below: 1. Rewriting Eq. (12) as, mtij (xj ) = min λAij (xi , xj ) + g(xi ) xi
(15)
MRF Labeling for Multi-view Range Image Integration
where
g(xi ) =
min(Di (yi , xi ), F ) +
yi ∈L\xi
35
mt−1 hi (xi )
h∈N (i)\j
2. Considering two cases: xi = xj and xi = xj . (1) If xi = xj , λAij (xi , xj ) = 0, thus mtij (xj ) = g(xj ). (2) If xi = xj , λAij (xi , xj ) = λ, thus mtij (xj ) = min g(xi ) + λ xi
3. Synthesizing the two cases: mtij (xj ) = min g(xj ), min g(xi ) + λ xi
(16)
According to Eq. (16), the minimisation over xi can be performed only once, independent of the value of xj . In other words, Eq. (12) needs two nested FOR loops to compute the messages but Eq. (16) just needs two independent FOR loops. Therefore, the computational complexity is reduced from O(k 2 ) to O(k) and the memory demand in a typical MATLAB implementaion is reduced from a k × k matrix to a 1 × k vector. As for implementation, we use a multi-scale version to further speed up the LBP algorithm, as suggested in [19], which significantly reduces the total number of iterations needed to reach convergence or an acceptable solution. Briefly speaking, the message updating process is coarse-to-fine, i.e. we start LBP from the coarsest network, after a few iterations transfer the messages to a finer scale and continue LBP there, so on and so forth.
6
Experimental Results and Performance Analysis
The input range images are all downloaded from the OSU Range Image Database (http://sampl.ece.ohio-state.edu/data/3DDB/RID/index.htm). On average, each ‘Bird’ image has 9022 points and each ‘Frog’ image has 9997 points. We employed the algorithm proposed in [21] for pairwise registration. Inevitably, the images are not accurately registered, since we found the average registration error (0.30mm for the ‘Bird’ images and 0.29mm for the ‘Frog’ images) is as high as 1/2 scanning resolution of the input data. Fig. 5 shows that the input range data are registered into the same coordinate system, rendering a noisy 3D surface model with many thick patches, false connections and fragments due to local and global registration errors. A large registration error causes corresponding points in the overlapping region to move away from each other and the overlapping area detection will thus be more difficult. The final integration is likely to be inaccurate accordingly. In our tests, the truncation parameter F is set to 4 and the weighting parameter λ is set to 10. Fig. 6 and 7 show the integration results produced by existing methods and our MRF-based integration, proving that our method is more robust to registration errors and thus performs best. Now we analyse some details of the new algorithm. Fig. 8 shows the labeling result. On the integrated surface, we use different colours to mark the points assigned with
36
R. Song et al.
Fig. 5. Range images (coloured differently) are registered into the same coordinate system as the input of the integration. A patch from some range image is visible if and only if it is the outermost surface. Left: 18 ‘Bird’ images; Right: 18 ‘Frog’ images.
Fig. 6. Integration results of 18 ‘Bird’ images. Top left: volumetric method [6]. Top middle: mesh-based method [14]. Top right: fuzzy-c means clustering [10]. Bottom left: k-means clustering [9]. Bottom right: the new method.
Fig. 7. Integration results of 18 ‘Frog’ images. Top left: volumetric method [6]. Top middle: mesh-based method [14]. Top right: fuzzy-c means clustering [10]. Bottom left: k-means clustering [9]. Bottom right: the new method.
MRF Labeling for Multi-view Range Image Integration
37
Fig. 8. Left: labeling result of ‘Bird’ images. Right: labeling result of ‘Frog’ images.
Fig. 9. Convergence performance of the new method using different input images. Left: The ‘Bird’ image set; Right: The ‘Frog’ image set.
Fig. 10. Integration results of 18 ‘Bird’ images produced by different clustering algorithms. From left to right (including the top row and the bottom row): k-means clustering [9], fuzzy-c means clustering [10], the new algorithm.
Fig. 11. The triangular meshes of integrated surfaces using different methods. From left to right: volumetric [6], mesh-based method [14],point-based method [16], our method.
38
R. Song et al.
different labels, corresponding to different input range images. The boundary areas coloured with red actually represent the triangles where the three vertices are not assigned with the same label (Please note that the surface is essentially a triangular mesh). If a group of points are all assigned with the same label, the surface they define will be marked with the same colour. We find that not all input images have contribution to the final integrated surface. For the ‘Bird’ and the ‘Frog’ images, the reconstructed surfaces are composed of the points from 10 and 11 source images respectively. We plot in Fig. 9 how the LBP algorithm converges. To observe the convergence performance, in the LBP algorithm, we calculate Equ. (13) to determine the label assignment in each iteration of message passing. And then we count the number of points assigned with different labels from what they are assigned in last iteration. When the number is less than 2% of the total number of the initialisation points (Inet ), the iteration is terminated. Fig. 9 shows that the new method achieves a high computational efficiency in terms of iteration number required for convergence. Due to the different objective functions, it is very difficult to define a uniform metric such as the integration error [9, 10] for a comparative study because to maintain local surface topology, we reject the idea that the smaller the average of the distances from the fused points to the closest points on the overlapping surfaces, the more stable and accurate the final fused surface. Thus we cannot use such a metric in this paper. Nonetheless, Fig. 10 highlights the visual difference of the integration results produced by classical clustering methods (according to Fig. 6 and 7, they are superior to other existing methods) and the new method. It can be seen that our algorithm performs better, particularly in the non-flat regions such as the neck and the tail of the bird. Even so, for a fair comparison, we adopt some measurement parameters widely used but not relevant to the objective function: 1) The distribution of interior angles of triangles. The angle distribution shows the global optimal degree of triangles. The closer the interior angles are to 60◦ , the more similar the triangles are to equilateral ones; 2) Average distortion metric [27]: the distortion metric of a triangle is defined as its area divided by the √sum of the squares of the lengths of its edges and then normalised by a factor 2 3. The value of distortion metric is in [0,1]. The higher the average distortion metric value, the higher the quality of
Fig. 12. Different performance measures of integration algorithms. Left: distortion metric. Right: computational time.
MRF Labeling for Multi-view Range Image Integration
39
a surface; 3) The computational time; Fig. 11 and 12 show that the new method performs better in the sense of the distribution of interior angles of triangles, the distortion metric and the computational time. All experiments were done on a Pentium IV 2.40 GHz computer. Additionally, because there is no specific segmentation scheme involved, our method based on MRF labeling saves computational time compared with the techniques using some segmentation algorithms as a preprocessing before the integration [10].
7
Conclusion and Future Work
Clustering-based methods proved superior to other existing methods for integrating multi-view range images. It has, however, been shown that classical clustering methods lead to significant misclassification in non-flat areas as the local surface topology are neglected. In this paper, we develop a MRF model to describe the integration problem and solve it by LBP, producing better results as local surface details are well preserved. Since the novel MRF-based method considers all registered input images for each point belonging to the network Inet , the integration is globally optimised and robust to noise and registration errors. The reconstructed surface is thus geometrically realistic. Also, it is applicable to many other data sources such as 3D unstructured point clouds. However, a couple of works can still be done to improve the integration in the future. As we mentioned above, the Potts model used in this paper is merely the simplest one. So we plan to explore different types of MRF models. We also hope to reduce the accumulated registration errors. One pairwise registration can produce one equation subject to the transform matrix. If we have more locally registered pairs than the total number of images in the sequence, the system of equations will be overdetermined. Solving this linear system of equations in a least squares sense will produce a set of global registrations with minimal deviation from the set of calculated pairwise registrations. In the future work, we will develop a scheme that can make a good balance between the extra cost caused by more pairwise registrations and the reduction of error accumulation. Acknowledgments. Ran Song is supported by HEFCW/WAG on the RIVIC project. This support is gratefully acknowledged.
References 1. Bernardini, F., Mittleman, J., Rushmeier, H., Silva, C., Taubin, G.: The ballpivoting algorithm for surface reconstruction. IEEE Trans. Visual. Comput. Graph. 5, 349–359 (1999) 2. Katulakos, K., Seitz, S.: A theory of shape by space carving. Int. J. Comput. Vision 38, 199–218 (2000) 3. Dey, T., Goswami, S.: Tight cocone: a watertight surface reconstructor. J. Comput. Inf. Sci. Eng. 13, 302–307 (2003) 4. Mederos, B., Velho, L., de Figueiredo, L.: Smooth surface reconstruction from noisy clouds. J. Braz. Comput. Soc. 9, 52–66 (2004) 5. Curless, B., Levoy, M.: A volumetric method for building complex models from range images. In: Proc. SIGGRAPH, pp. 303–312 (1996)
40
R. Song et al.
6. Dorai, C., Wang, G.: Registration and integration of multiple object views for 3d model construction. IEEE Trans. Pattern Anal. Mach. Intell. 20, 83–89 (1998) 7. Rusinkiewicz, S., Hall-Holt, O., Levoy, M.: Real-time 3d model acquisition. In: Proc. SIGGRAPH, pp. 438–446 (2002) 8. Sagawa, R., Nishino, K., Ikeuchi, K.: Adaptively merging large-scale range data with reflectance properties. IEEE Trans. Pattern Anal. Mach. Intell. 27, 392–405 (2005) 9. Zhou, H., Liu, Y.: Accurate integration of multi-view range images using k-means clustering. Pattern Recognition 41, 152–175 (2008) 10. Zhou, H., Liu, Y., Li, L., Wei, B.: A clustering approach to free form surface reconstruction from multi-view range images. Image and Vision Computing 27, 725–747 (2009) 11. Hilton, A.: On reliable surface reconstruction from multiple range images. Technical report (1995) 12. Turk, G., Levoy, M.: Zippered polygon meshes from range images. In: Proc. SIGGRAPH, pp. 311–318 (1994) 13. Rutishauser, M., Stricker, M., Trobina, M.: Merging range images of arbitrarily shaped objects. In: Proc. CVPR, pp. 573–580 (1994) 14. Sun, Y., Paik, J., Koschan, A., Abidi, M.: Surface modeling using multi-view range and color images. Int. J. Comput. Aided Eng. 10, 137–150 (2003) 15. Li, X., Wee, W.: Range image fusion for object reconstruction and modelling. In: Proc. First Canadian Conference on Computer and Robot Vision, pp. 306–314 (2004) 16. Zhou, H., Liu, Y.: Incremental point-based integration of registered multiple range images. In: Proc. IECON, pp. 468–473 (2005) 17. Wang, X., Wang, H.: Markov random field modelled range image segmentation. Pattern Recognition Letters 25, 367–375 (2004) 18. Suliga, M., Deklerck, R., Nyssen, E.: Markov random field-based clustering applied to the segmentation of masses in digital mammograms. Computerized Medical Imaging and Graphics 32, 502–512 (2008) 19. Felzenszwalb, P., Huttenlocher, D.: Efficient belief propagation for early vision. International Journal of Computer Vision 70, 41–54 (2006) 20. Diebel, J., Thrun, S.: An application of markov random fields to range sensing. In: Proc. NIPS (2005) 21. Liu, Y.: Automatic 3d free form shape matching using the graduated assignment algorithm. Pattern Recognition 38, 1615–1631 (2005) 22. Li, S.: Markov random field modelling in computer vision, 2nd edn. Springer, Heidelberg (1995) 23. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Anal. Mach. Intell. 23, 1222–1239 (2001) 24. Szeliski, R., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V., Agarwala, A., Tappen, M., Rother, C.: A comparative study of energy minimization methods for markov random fields with smoothness-based priors. Pattern Analysis and Machine Intelligence. IEEE Trans. Pattern Anal. Mach. Intell. 30, 1068–1080 (2008) 25. Kolmogorov, V., Zabih, R.: What energy functions can be minimized via graph cuts? IEEE Trans. Pattern Anal. Mach. Intell. 26, 147–159 (2004) 26. Yedidia, J., Freeman, W., Weiss, Y.: Understanding belief propagation and its generalizations. Techincal Report TR-2001-22, MERL (2002) 27. Lee, C., Lo, S.: A new scheme for the generation of a graded quadrilateral mesh. Computers and Structures 52, 847–857 (1994)
Wave Interference for Pattern Description Selen Atasoy1,2 , Diana Mateus2 , Andreas Georgiou1 , Nassir Navab2 , and Guang-Zhong Yang1 1
Visual Information Processing Group, Imperial College London, UK {catasoy,a.georgiou,g.z.yang}@imperial.ac.uk 2 Computer Aided Medical Procedures and Augmented Reality, Technical University of Munich, Germany {atasoy,mateus,navab}@in.tum.de
Abstract. This paper presents a novel compact description of a pattern based on the interference of circular waves. The proposed approach, called “interference description”, leads to a representation of the pattern, where the spatial relations of its constituent parts are intrinsically taken into account. Due to the intrinsic characteristics of the interference phenomenon, this description includes more information than a simple sum of individual parts. Therefore it is suitable for representing the interrelations of different pattern components. We illustrate that the proposed description satisfies some of the key Gestalt properties of human perception such as invariance, emergence and reification, which are also desirable for efficient pattern description. We further present a method for matching the proposed interference descriptions of different patterns. In a series of experiments, we demonstrate the effectiveness of our description for several computer vision tasks such as pattern recognition, shape matching and retrieval.
1
Introduction
Many tasks in computer vision require describing a structure by the contextual relations of its consitituents. Recognition of patterns where the dominant structure is due to global layout rather than its individual texture elements such as the triangle images in Figure 1a-c and peace symbols in Figure 10a, patterns with missing contours such as the Kanizsa triangle in Figure 1d and Figure 5b and patterns with large homogeneous regions such as the shape images in Figure 9, necessitate the description of contextual relations within a structure in an efficient yet distinct manner. The captured information has to be discriminative enough to distinguish between structures with small but contextually important differences and be robust enough to missing information. In this paper, we introduce a method based on wave interference that efficiently describes the contextual relations between the constituent parts of a structure. Interference of several waves that originate from different parts of a structure leads to a new wave profile, namely the interference pattern (Figure 2). This interference profile is computed as the superposition of all constituent circular waves and has two relevant properties for describing a structure. Firstly, the R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 41–54, 2011. Springer-Verlag Berlin Heidelberg 2011
42
S. Atasoy et al.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
Fig. 1. Interference patterns created from input images (shown in the first row) with several geometric transformations (a-c) and missing contours (d) leads to the emergence of similar patterns (shown in the second row), whereas shapes with only very small variations yield very distinct interference patterns (e-j)
interference profile depends on the relative spatial arrangement of the sources that create the constituent waves. Secondly, it intrinsically captures all relationships of the constituent waves. Thus, interference phenomenon provides a natural mechanism to describe the structure of a pattern by representing the interrelations of its components. The resulting interference patterns are discriminative enough to separate different structures with only very small variation such as the examples shown in Figure 1e-j but not sensitive to a loss of information due to missing contours or to geometric transformations as illustrated in Figure 1a-d. Recognition of patterns with missing contours is an important property of human perception which has been extensively studied by Gestalt psychology [1]. This constructive property, known as “reification”, is demonstrated in Figures 1d and 5b, where a triangle is perceived although it is not explicitly delineated. This is due to the spatial configuration of the constituent contours. Gestalt theory also points out two other relevant properties of human vision, i.e., invariance and emergence. “Invariance” analyses how objects are recognized invariant of their location, orientation and illumination. It is also an active area of research in computer vision. “Emergence” property states that a pattern is not recognized by first detecting its parts and combining them to a meaningful whole, but rather emerges as a complete object ; i.e. “the whole is more than the sum of its parts”. In fact, the interference phenomenon intrinsically exhibits the emergence property. The interference pattern is produced by the superposition of two or more waves. This leads to the occurrence of new structures which are not present in any of the constituent waves nor in their simple sum but are created by the interrelations of the parts as shown in Figure 2. The contribution of our work presented in this paper is twofold: firstly, we introduce the notion of wave interference for pattern description. To this end, we create interference patterns from an input image. This leads to an emergent description of the structure in the input pattern by representing the interrelations of its constituent parts. We demonstrate that our method, called ”interference description” (ID), exhibits some of the key Gestalt properties including emergence, reification and invariance. Secondly, we use these interference patterns to
Wave Interference for Pattern Description
43
create compact descriptor vectors and show in a series of experiments how the interference patterns can be applied to several computer vision tasks such as pattern recognition, shape matching and retrieval.
2
Related Work
In general, pattern description techniques can be analyzed in three main categories. Local methods extract and describe discriminative local information such as features and patches. The descriptors are then used for object matching or pattern recognition without using their contextual relations (See [2] and references inside). Global methods, such as descriptions based on the spectrum of Laplacian [3, 4] or finite element modal [5], capture distinctive information using the whole content of the pattern. In a third category, contextual methods, the local information is analyzed within the global context. To this end, relevant local information of the parts is combined with the contextual relationships between them. For instance, Sechtman and Irani create a descriptor using patches with a certain degree of self-similarities [6], Belongie et al. compute the shape context descriptors using distance histograms for each point. These descriptors are then usually combined with a suitable matching method such as star-graph [6] or hyper-graph matching [7], or a nearest neighbour matching after aligning two shapes [8]. One of the major challenges for modelling the contextual relations in the image domain is the computational complexity which increases exponentially with the order of relationships considered. Therefore, most methods include relations up to second order (pair-wise) [6, 8] or third order [7]. Furthermore, for applications such as retrieval, shape comparison or pattern recognition, a more compact description of the whole structure is desired as the explicit computation of feature correspondences is actually not required. Another form of global description methods, the frequency analysis techniques, on the other hand, allows the consideration of more complex relations of a pattern by encoding all global information. Fourier descriptors [9], for example, project the shape contour onto a set of orthogonal basis functions. Wavelet based methods such as [10] analyze the response to wave functions that are more localized in space and frequency. Spectral analysis using spherical harmonics [11] and eigenfunctions of the Laplace-Beltrami operator (vibration modes) has been successfully applied also for representation and matching of images [4] and 3D shapes [3]. In this paper, we introduce the notion of wave interference in order to create a compact pattern description. The proposed method intrinsically captures norder relations1 without increasing the computational complexity. The novelty of our method lies in the fact that it analyses the interference, and thus the interaction, of circular waves over a range of frequencies, whereas other frequency techniques ([3–5, 9–11]) are based on the reponses at different frequencies that do not interact with each other. 1
n is the number of all individual parts modelled as sources in our method.
44
3
S. Atasoy et al.
Properties of Interference Description
In this section we introduce and discuss some of the key properties of ID which are relevant for pattern description. Modelling Contextual Relations: In ID, contextual relations emerge automatically due to the relation of individual circular waves. This has two important benefits: Firstly, it eliminates the need for explicit modelling of any order relations or for defining a set of rules. Secondly, all contextual relations of a pattern are considered without increasing the computational cost. The interference is computed by a simple addition of complex wave functions, where the computation of individual wave functions are independent. Therefore, including a new part to the pattern is simply done by adding its wave function to the sum, whereas in other contextual methods, all descriptors need to be re-computed in order to consider their relations to the newly added part. Furthermore, as the contextual relations are already included in the ID, comparison of descriptors for retrieval or recognition can be performed by simple nearest neighbor matching. This avoids the requirement of more sophisticated methods such as graph matching or non-linear optimization. Specificity and Precision: In pattern description there exists a natural tradeoff between the specificity and precision of the method. ID allows to regulate this trade-off by the choice of the number of frequencies. A structure is described by a set of interference patterns computed over a range of frequencies (explained in Section 4.3), where the higher the frequency, the more specific the description becomes. Therefore the number of frequencies, which is the only parameter of the proposed method, allows adjusting the specificity of the description depending on the application. Verification of Gestalt Properties: Gestalt psychology has played a crucial role in understanding human visual perception. It also has been an inspiring model for several computer vision applications such as analyzing shapes [12], learning object boundaries [13], defining mid-level image features [14] and modeling illusionary contour formation [15]. An extensive study of the Gestalt principles and their application in computer vision is presented in [1, 16]. In this work, we point out the analogy between the interference phenomenon and three important Gestalt principles, namely emergence, reification and invariance. In Section 5.1, we demonstrate that due to the particular nature of interference, our method intrinsically satisfies the mentioned Gestalt properties and eliminates the need for a set of defined rules or learning using a training dataset.
4
Methods
In this Section, we first introduce the background for waves and interference phenomenon. Then we explain our approach for pattern description, where an input pattern is first transformed into a field of sources; each creating a circular
Wave Interference for Pattern Description
(a)
(b)
45
(c)
Fig. 2. a) Demonstration of the emergent nature of interference. Each source in the stimulus field creates a (attenuating) circular wave on the medium resulting in the final interference pattern. b) Sum of the individual wave patterns. c) Interference of individual wave patterns. The interference phenomena intrinsically exhibits the emergence property, as the whole (the interference pattern) shown in c) is different than the sum of its constituent wave profiles shown in b).
wave; and resulting jointly in an interference pattern. Finally, we propose a method to perform comparison of the created interference patterns. 4.1
Background
In its most basic form, a wave can be defined as a pattern (or distribution) of disturbance that propagates through a medium as time evolves. Harmonic waves are waves with the simplest wave profile, namely sine or cosine functions. Due to their special waveform they create wave patterns which are not only periodic in time but also in space. Thus, on a 2D medium, harmonic waves give rise to circular wave patterns such as the ones observed on a liquid surface when a particle in dropped into the liquid. Using complex representation, a circular wave is described by the following wave function of any point x on the medium and a time instance t: ψφ (A0 , ω; x, t) = A0 · ei(ω||x−φ||) · eiωt ,
(1)
where i is the imaginary unit, ω is the frequency of the wave and φ is the location of the source (stimulus). A0 denotes the intensity of the source and ||x− φ|| gives the distance of each point x on the medium to the location of the source. As the time dependent term eiωt is not a function of space (x), it does not affect the interference of the waves. Therefore, in the rest of the paper it will be taken as t = 0. So, the instantaneous amplitude of the disturbance at each point on the medium is given by the real part of the wave function Real [ψφ (A0 , ω; x)]. For simplicity, we will denote waves with a fixed frequency ω and a fixed amplitude A0 as ψφ (x) := ψφ (A0 , ω; x). As waves propagate outwards from the source, the amplitude of the wave will gradually decrease with the increased distance from the source due to friction. This effect, called attenuation, can be described by multiplying the amplitude of the wave with the attenuation profile σ. When two or more waves ψφi are created on the same medium at the same time from several source locations φi , the resultant disturbance Ψ at any point x on the medium is the algebraic sum of the amplitudes of all constituents. Using
46
S. Atasoy et al.
complex represetation, the interference pattern Ψ (x) is given by the amplitude of the sum of individual complex wave functions ψφi (x): n (2) Ψ (x) = ψφi (x) , i=1
where | · | denotes the absolute value of the complex number. Note that the absolute value of the sum of a set of complex numbers is different than the sum of their real parts (See Figure 2). This is known as the superposition principle of the waves. The superposition of waves with fixed frequencies yields to a special phenomenon called interference. In this special case, if the waves reaching a point on the medium are in-phase (aligned in their ascent and descents), they will amplify each other’s amplitudes (constructive interference); conversely if they are out-of-phase, they will diminish each other at that point (destructive interference). This results in a new wave profile called interference pattern. Figure 2a illustrates the interference pattern created by the superposition of 3 waves with the same frequency. The interference pattern is computed by the absolute value of the sum of complex wave functions. The complex addition results in a waveform which is not present in any of the constituting waves nor in the simple sum of the contituent wave amplitudes (Figure 2b) but occurs due to their combination (Figure 2c). Therefore, interference property of waves provides an emergent mechanism for describing the relations between the sources and for propagating local information. 4.2
Interference Description
Using the introduced definitions in Section 4.1, our method formulates the contextual relations of the parts composing the image in an emergent manner. Let Ω ⊂ R2 be the medium on which the waves are created, where x ∈ Ω denotes a point on Ω. In practice the medium is described with a grid discretizing the space. The function I : Ω → R, called stimulus field, assigns a source intensity A0 = I(φ) to the positions φ on the medium Ω. This function contains the local information of the pattern which is propagated over the whole medium. Depending on the application, it can be chosen to be a particular cue defined on the image domain such as gradient magnitude, extracted edges or features. Each value I(φ) of the stimulus field I induces a circular wave ψφ (I(φ), ω; x) on the medium. We use the frequency of the wave to describe the relation between the distribution of sources and the scale at which one wants to describe the pattern. This is discussed in detail in Section 4.3. The profile of the created circular wave is described as: ψφ (I(φ), ω; x) = I(φ) · ei(ω(||x−φ||) · σ ,
(3)
where σ is the attenuation profile. We define the attenuation profile of the wave induced by the source φ with frequency ω as: σ(φ, ω; x) = ω · e−ω·||x−φ|| ,
(4)
Wave Interference for Pattern Description
(a)
47
(b)
Fig. 3. Wave patterns for different source locations with the 3rd frequency (k = 3) shown in a) and 8th frequency (k = 8) shown in b)
where ||x − φ|| denotes the Euclidean distance between a point x and the source location φ. This attenuation profile allows us to conserve the total energy of the waves while changing their frequency. Note that the attenuation profile is a function of the distance to the source (||x − φ||) as well as the frequency ω. The higher the frequency, the sharper is the decay of the attenuation profile. This yields to a localized effect for high frequencies and a more spread effect for low frequencies. Once the location and intensities of the sources on the stimulus field are determined, the interference pattern Ψ (I, ω; x) on the medium for a fixed frequency ω is computed simply by the superposition of all wave patterns: (5) ψφi (I(φi ), ω; x) . Ψ (I, ω; x) = φi ∈Ω 4.3
Multi-frequency Analysis
Computing the interference patterns for several frequencies allows for the analysis of the stimulus field at different scales. As discussed in Section 4.2, high frequency waves are more localized and propagate the content in a smaller neighbourhood, whereas low frequency waves are more spread over a larger area. Figure 3 illustrates the circular wave patterns for several frequencies and source locations. Formally, we define the interference description (ID) Θ(I) of an input I as the set of interference patterns Ψ (I, ωi ) for a range of frequencies {ω1 , ω2 , ..., ωn }: Θ(I) = {Ψ (I, ω1 ), Ψ (I, ω2 ), ..., Ψ (I, ωn )} ,
(6)
i with ωi = 2π·k P , where ki ∈ N is the wave-number, i.e. the number of periods the wave has on the medium and P is the length of the grid.
4.4
Descriptor Comparison
In this section, we present a method to compare two IDs, Θ(I1 ) and Θ(I2 ) in a rotation invariant manner. To this end, we first define a coherency measure c of an interference pattern Ψ (x):
48
S. Atasoy et al.
|Ψ (x)| φi ∈Ω |ψφi (x)|
c(Ψ (x)) =
.
(7)
The coherency measures the power of the interference pattern compared to the sum of the powers of the constituent waves. As φi ∈Ω |ψφi (x)| provides the upper limit to the power of the interference amplitude |Ψ (x)| for each location x on the medium, the coherency measure c normalizes the values of the interference profile to the interval [0, 1] (Figure 4c-d). This enables the comparison of different IDs independent of their absolute values and therefore makes the comparison independent of the number of sources. We subdivide this coherency image c(Ψ (x)) into m + 1 levelsets {γ0 , γ1 , · · · , γj , · · · , γm }
γj = {x|c(Ψ (x)) = j/m}
(8)
and take the sum of the coherency values for each interval between the two T consecutive levelsets. This yields an m-dimensional measure h = [h1, · · · , hm ] , hj = γj−1 ≤x Th H(xf , yf )
(2)
where Th is a threshold (usually set to 0.3). An example of the learned geometrical model is shown in Fig. 2 (c), which is nearly a plane. The surface shape of the geometrical model depends on the the number and locations of vanishing points of scene structures, and a plane surface usually means one-point perspective. One advantage of the proposed geometrical model is that it can incrementally learn scene geometrical information without camera calibration.
Fig. 2. An example of online learned geometrical model for excluding detection outliers that violate perspective projection. (a) Detection results. Object 2 (in red) is an outlier. (b) Definitions of foot point and object height. (c) Learned surface of object height vs object’s foot point. Blue points are online collected training samples.
Data Association. After object detection and filtering the outliers, detection results should be associated with trajectories of old targets. First, a similarity matrix between trajectory-detection pairs is computed. The similarity score Simda between a trajectory tr and a detection d is computed as a product of appearance similarity and distance similarity:
Multi-Target Tracking by Learning Class-Specific and Instance-Specific Cues Simda (tr, d) = Simc (tr, d) · pN (|d − tr|)
71 (3)
2 N (|d − tr|; 0, σdet )
where pN (|d − tr|) ∼ is a Gaussian function used to measure the distance similarity between tr and d and Simc (tr, d) is the output (scaled to (0, 1) by a sigmoid function) of an online Boosting classifier used to measure the appearance similarity. Then, a simple matching algorithm is conducted: the pair (tr∗ , d∗ ) with the maximum score is iteratively selected, and the rows and columns belonging to trajectory tr and detection d in the similarity matrix are deleted. Note that only the association with a matching score above a threshold is considered as a valid match.
Fig. 3. Comparison of appearance similarity matrices. (a) trajectories and detections. (b) obtained by feature template matching. (c) obtained by Boosting classifiers. In (b) and (c), yellow block means the maximum of a column and red means the second maximum.
Online Boosting Classifiers. Instance-specific Boosting classifiers are used in data association, as well as the observation model for tracking. Our Boosting classifier is similar to [20] and is trained online on one target against all others. Patches used as positive training examples are sampled from the bounding box of the optimal state given by the tracking process. The negative training set is sampled from all other targets, augmented by background patches. Fig. 3 shows an comparison of appearance similarity matrices obtained by feature template matching (see Section 4.2) and Boosting classifiers. As shown in Fig. 3 (b) and (c), the discriminability of online Boosting is much more powerful than feature template matching, because the average ratio of column maximum to second maximum of online Boosting is 14.5, while that of feature template matching is only 1.3.
4
Tracking by Particle Filtering
To decompose the full joint-state optimization problem in multi-target tracking into a set of “easy-to-handle” low-dimensional optimization problems, we adopt a “divide-and-conquer” strategy. Targets are clustered into two categories: separated-target and target-group. Separated-target is tracked by a single-state particle filter and target-group is tracked by a joint-state particle filter. A set of targets G = {tri }ni=1 with n ≥ 2 is considered as a target-group if: Clustering Rule: ∀ tri G, ∃ trj G(j = i), s.t. |Si Sj |/min(|Si |, |Sj |) > Tc
72
M. Li et al.
Fig. 4. An example of training and using of the full-body pose estimator. (a) the training process. (b) pose-tuning process. S0 (red) is a random particle, S3 (yellow) is the tuned particle with the maximum similarity, others are intermediate states (black) (c) the curve of similarity vs iteration# for (b).
where Si is the optimal state of target tri at frame #t-1 or its prediction state at frame #t using only previous velocity, | · | is the area of a state, and Tc is a threshold (usually set to 0.4). This rule makes target pairs that have or are going to have significant overlapping areas be in the same group. Targets that do not belong to any group are separated targets. 4.1
Global Pose Estimators
Sampling-based stochastic optimization methods usually need to sample a lot of particles, the number of which may grow exponentially with the increase of state-dimension. However, most particles have almost no contribution to state estimation because of their extremely small weights, while they consume most of the computational resources. To tremendously decrease the unnecessary computational cost, ISPF (incremental Self-tuning Particle Filtering)[16] uses an online learned pose estimator to guide random particles to move towards the correct directions, thus every random particle becomes intelligent and dense sampling is unnecessary. Inspired by [16], we also use pose estimators in multi-target tracking. Different from one pose estimator per target in [16], we train two global pose estimators online for all targets. One is ff b trained with full-body human samples, the other is fhs trained with only the head-shoulder parts of human samples. fhs is used when most parts of a target is occluded by other targets except the head-should part. Fig. 4 shows an example of how to train and use a pose estimator. As shown in Fig. 4 (a), we do state perturbation dS around the optimal state S of each target to get training samples, then HOG (Histogram of Oriented Gradients) [21] like features F are extracted and LWPR is used to learn the regression function dS = f (F). By using the regression function, as shown in
Multi-Target Tracking by Learning Class-Specific and Instance-Specific Cues
73
Fig. 4 (b) and (c), given an initial random state around a target, we can guide it move to its best state with the maximum similarity in several iterations (each iteration includes two main steps: ΔSi = f (F(Si )) and Si+1 = Si − ΔSi , see [16] for details). 4.2
Observation Model
To make appearance model robust to partial occlusions, we use a part-based representation for each target, shown in Fig. 5 (a) p1 ∼ p7 . As shown in Fig. 5 (a), each part is represented by a L2-normalized HOG histogram and a L2normalized IRG (Intensity/Red/Green) color histogram: Fj ={Fjhog , Fjirg }. The final representation is the concatenation of all features F = {Fj }m j=1 , where m is the number of all parts. Note that feature F is also the input of online boosting classifiers, and it is not the same F used in pose estimator.
Fig. 5. Part based object representation and calculation of the visibility vector. (a) Target is divided into m parts (denoted by {pj }m j=1 , here m=7), and each part is represented by a HOG histogram and an IRG (Intensity/Red/Green) histogram denoted as Fj ={Fjhog , Fjirg }. (b) An example of calculation of visibility vector. Elliptical regions are the valid areas of targets. Knowing the configuration of states {Si } and their occlusion relations {π ij }, the visibility vector of S1 can be easily obtained: (1,1,1,1,0,1,1).
Part based Feature Template Matching. Only using the online trained Boosting classifiers for appearance model may suffer from occlusions, thus a partbased template matching method is introduced to our appearance model. Given the state configuration X = {S, π } of a target-group (subscript t is dropped), where S={Si }ni=1 is the joint state and π = {π ij }1≤i<j≤n is the occlusion matrix, an m-dimensional visibility vector V =(V1 , ..., Vm ) for each target can be calculated (1: visible, 0: invisible). In our work, a part is invisible if the ratio of its area covered by other targets exceeds some threshold (usually set to 0.5). Fig. 5 (b) shows an example of how to calculate the visibility vector. For separated target, all parts are visible. The template similarity score of a state S (subscript is dropped) to a target tr is defined as: m SimT (tr, S) = 1 −
Fj − Fj · Vj 2· m j=1 Vj
j=1
(4)
74
M. Li et al.
m where F = {Fj }m j=1 is the feature vector specified by S, F = {Fj }j=1 is the reference template of target tr, and V =(V1 , ..., Vm ) is the visibility vector. If m j=1 Vj =0, which means target tr is totally occluded by other targets, SimT (tr, S) is set to the average template matching score of background patches. Multi-Source Similarity Fusion Given a random state S, its appearance similarity Sima to a target tr is defined as a weighted combination of multiple sources: Sima (tr, S) = α1 · pN (|d∗ − S|) + α2 · SimT (tr, S) + α3 · Simc (tr, S)
(5)
where d∗ is the associated detection of target tr and {αi }3i=1 are the weights of different sources with α1 + α2 + α3 =1 (α1 is set to 0 if there is no associated detection). The first term depicts how close S is to the associated detection d∗ , the second term is the similarity obtained by template matching, and the third term is the similarity obtained by the Boosting classifier of target tr. Usually we set the weights α1 , α2 , α3 to 0.2, 0.4, 0.4 respectively. The final observation model of a target group G = {tri }ni=1 with state configuration X (subscript t is dropped) can be defined as: P (o|X) =
n
P (oj |{Si }n i=1 , π ) =
j=1
4.3
n
Sima (trj , Sj )
(6)
j=1
Dynamical Model
In particle filtering framework, besides observation model P (ot |Xt ), another important part is the dynamical model P (Xt |Xt−1 ). As in [9], we also introduce interaction potentials to the dynamical model, which aims to penalize the case that two targets occupy the same position. Suppose G = {tri }ni=1 is a target group with joint state {Si } and occlusion matrix π , the dynamical model is defined as follows: P (Xt |Xt−1 ) =
n 1 πt) U(π ψi (Sit |Si(t−1) ) ψij (Sit , Sjt ) Zc i=1 1≤i<j≤n
(7)
π t ) is a uniform distribution with only where Zc is the normalizing factor, U(π two values: 0 and 1, ψi is a local motion prior modeled by a Gaussian N (Sit − Si(t−1) ; 0 , Σ s ), and ψij is the interaction potential between target-pair defined as: ψij = exp(−
i(t−1) − S j(t−1) 2 βS ) Sit − Sjt 2
(8)
i(t−1) and S j(t−1) are the optimal states of target tri and trj at frame where S #t-1, and β is a constant factor (usually set to 0.1). 4.4
Tracking Separated-Target
Separated target is tracked by the ISPF (Incremental Self-tuning Particle Filtering) [16] tracker. Its observation model and dynamical model can be easily
Multi-Target Tracking by Learning Class-Specific and Instance-Specific Cues
75
derived, because they are special cases of those of target-group. The main steps of ISPF include: 1) particles are incrementally drawn; 2) each random particle is tuned to its neighborhood best state according some similarity feedback; 3) state optimization is terminated if the maximum similarity score of all tuned particle satisfies a TSD (target similarity distribution) modeled as an online Gaussian, or the maximum number of particles Ns is reached. However, there are some differences between the ISPF tracker in this paper and that in [16]. First, we train two global pose estimators for all target, not one for each. Second, the observation models are very different. We introduce online Boosting classifier to appearance model, meanwhile we do not use IPCA (incremental principle component analysis), but a time-varying feature template updated every frame with a constant learning rate to represent target appearance. This is because it is computationally expensive to maintain and update one subspace for each part of each target. Details of how ISPF work can be seen in [16]. 4.5
Tracking Target-Group
With the defined observation model (6) and dynamical model (7), the Bayesian posterior density (1) can be approximated by particle based density: π t )P (ot |Xt ) P (Xt |Ot ) ≈ c · U(π
ψij (Sit , Sjt )
Ng n k=1 i=1
1≤i<j≤n
(k)
ψi (Sit |Si(t−1) )
(9)
where c is the normalizing factor, Ng is the number of particles, and n is the number of targets in a target-group. The next problem is how to efficiently sample particles for this posterior which lies in a high dimensional state space. Markov chain Monte Carlo (MCMC) sampling [11] is an effective solution to this problem. MCMC methods work by defining a Markov Chain over the space of configurations X, such that the stationary distribution of the chain is equal to a target distribution P (X|O). The Metropolis-Hastings (MH) algorithm [22] is one way to simulate from such a chain. It works as follows: Algorithm 1. Metropolis Hastings Algorithm Start with a random particle X(1) , then iterate for k = 1, ..., N 1. Sample a new particle X from the proposal density Q(X ; X(k) ); 2. Calculate the acceptance ratio: a = min{
P (X |O)Q(X(k) ; X ) , 1} P (X(k) |O)Q(X ; X(k) )
(10)
3. Accept X and set X(k+1) =X with probability a, or set X(k+1) =X(k) . The proposal density used here is a product of a uniform distribution for occlusion varibles and the joint local motion prior centered on previous optimal states: (k)
π t ) Q(Xt ; Xt ) = Q(Xt ) ∼ U(π
ˆ i(t−1) ). ψi (Sit |S
(11)
i
Besides using MCMC-sampling to efficiently sample particles for posterior density, we also utilize global pose estimators to slightly refine the random sampled
76
M. Li et al.
particles for each target if possible (at least the head-should part is visible), so as to make the joint-state particle “smart”. The main steps of our data-driven MCMC-based particle filtering are given in Algorithm 2 . Algorithm 2. Data-Driven MCMC-based Particle Filtering ˆ t−1 , propagate it with the 1. Start with previous optimal state configuration X local joint motion prior density, then iterate for k = 1, ..., Ng – (a) Sample a new particle X from the proposal density (11) – (b) Calculate the visibility vector V i for each target tri according to the joint states {Si } and current occlusion matrix π . : for each target tri , tune its pose {S } to {S } by a global – (c) Refine X to X i i pose estimator (the full-body pose estimator takes priority) if at least the head-should part is visible. – (d) Recalculate the visibility vector for each target and compute particle ) according to the observation model (6); weight w = P (o|X – (e) Calculate posterior density (9) and the acceptance ratio (10) with tuned state X and set {X(k+1) , w(k+1) }={X , w } with probability a, or reject – (f) Accept X t t (k+1) (k+1) (k) (k) , wt }={Xt , wt }. it and set {Xt ∗
∗
t = X(k ) with w(k ) = max{w(k) }. 2. state estimation: X t t t (k) (k) 3. Resample Ng particles from {Xt , wt } and set all weights to 1/Ng
5
Experimental Results
Three sequences are used to test the robustness and efficiency of our tracking algorithm. They are the NLPR-Campus (Fig. 6, our own dataset), PETS09 [23] (Fig. 9) and PETS04 [24] (Fig. 11) sequence. Some important parameters are set as follows: the maximum number of particles Ns for separated-target is set to 100, and the number of particle Ng for target-group is set to 200. Gaussian noise covariance matrix Σ S =diag{72, 72 , 0.022}. Without taking the detection time into consideration, the MATLAB implementation of our method can run at about 1 ∼ 0.4 frames/sec on a PC with a P4 2.4 GHz CPU. Robustness and Efficiency Evaluation. Two sequences (the NLPR-Campus and PETS09 sequence) are used to test the robustness of our algorithm. The PSOF detector [19] is used to detect objects in the two sequences with the detection rate of about 0.5 ∼ 0.7, and the online learned geometrical models decrease false alarms by about 10% in each sequence. Targets are only initialized on the border areas of an image except in the first frame, and their trajectories are terminated if they move out of image borders. In the NLPR-Campus sequence, as shown in Fig 6, six objects with three of them having interactions appear in the scene. At about frame #118, obj1 (means the object with ID 1) is occluded by obj2 slightly, but occlusion reasoning is not conducted because their valid overlapping area (the intersection area of two
Multi-Target Tracking by Learning Class-Specific and Instance-Specific Cues
77
Fig. 6. Tracking results of our method in the NLPR-Campus Sequence
Fig. 7. Occlusion reasoning between object 1 and 2 in the NLPR-Campus Sequence. (a) Occlusion relationship. (b) Appearance similarity scores of object 1 and 2.
Fig. 8. Particles used in the NLPR-Campus seqence. (a) and (b): random particles (blue rectangles) and estimated optimal states (yellow rectangles). (c) The curve of average number of particles per object vs frame #.
ellipses) has not exceeded a threshold (0.4). At about frame #131, obj1 and obj2 form a target-group, and thus the MCMC-based particle filtering process with occlusion reasoning is activated. As shown in Fig. 7 (b), after occlusion reasoning, the appearance similarity score of obj1 increases, this is because our part-based template matching score becomes more accurate when occlusion relationship is correctly recovered (shown in Fig. 7 (a)). At about frame #138, obj1 is totally occluded by obj2 and this occlusion last for 50 frames. At about frame #189, obj1 passes by obj2 and obj3 and reappears, our algorithm successfully tracks
78
M. Li et al.
Fig. 9. Tracking results of our method in the PETS09 sequence
Fig. 10. Particles used in the PETS09 seqence. (a) and (b): random particles (blue rectangles) and estimated optimal states (yellow rectangles). (c) The curve of average number of particles per object vs frame #.
him. Although obj1 and obj3 wear similar clothes (white T-Shirt and black trousers), ID-switching does not happen, because the online-Boosting classifiers can choose discriminative features to distinguish them. Our algorithm can not only robustly track every object, but also work efficiently. As shown in Fig. 8 (c), except in the frames in which significant inter-occlusion happens (#131 ∼ #189), our algorithm uses an average number of less than 10 particles per object. The average number of particles per object per frame throughout the sequence is 21.4. In Fig. 8 (a) and (b), some random particles are shown. It can be seen that separated-targets (obj1/obj2/obj3 at #90; obj3/obj4/obj5/obj6 at #164) often consume a very small number of particles (due to the usage of global pose estimators), while large number of particles are used by target-group (obj1-obj2 at #164) in which joint-state optimization is conducted. The PETS09 sequence is more challenging, as shown in Fig. 9. There are not only frequent inter-object occlusions (obj3-obj1 at #25; obj2-obj1 at #104; obj8-obj2 at #104; obj8-obj3 and obj6-obj4 at #163; obj6-obj1 and obj11-obj10 at #347 ), but also frequent background-object occlusions caused by a signboard (obj3 at #104/#144, and obj10 at #270). Besides, obj1, obj2 and obj3 wear
Multi-Target Tracking by Learning Class-Specific and Instance-Specific Cues
79
Fig. 11. Tracking results of our method in the PETS04 sequence
very similar clothes and have frequent interactions. In this complex scene, our algorithm consistently robustly tracks all objects and maintains their IDs correctly. However, the average computational cost is very low. As shown in Fig. 10 (a) and (b), only those objects in a target-group (obj2-obj7 at #131) or occluded by the signboard (obj3 at #131) consume large number of particles, while others consume several to ten particles per object. The curve of average number of particles per object vs frame No. is shown in Fig. 10 (c), and the average number of particles per object per frame is 26.4. The robustness of our algorithm shown on this sequence relies on two main factors. The first factor is the online learned Boosting classifiers. They make the detection association process robust, as well as provide part of the similarity score in the observation model. The second is the MCMC-based joint-state optimization strategy utilized when object-pairs have significant overlapping areas. Because of using part-based appearance model and occlusion reasoning, the MCMC-based particle filtering process is very robust to partial occlusions. The efficiency of our method obviously benefits from the target clustering strategy and the online trained global pose estimators. Accuracy Evaluation. To evaluate the accuracy of our tracking algorithm, as shown in Fig. 11 the P ET S04 sequence is used, in which three objects are labeled and their groundtruth trajectory can be downloaded from [24]. We use the groundtruth object bounding boxes in the first frame to initialize targets, and no detector is used during tracking. The RMSE (root mean square error) between estimated position and groundtruth of our method is shown in Table. 1. It is about 3.9 pixels per object per frame. As comparison, the tracking accuracies of three other state-of-the-art multi-target tracking algorithms: Yang’s work [12], Qu’s work [10] and Zhang’s work [13] are also given in Table 1 (their results are reported in [13]). It can be seen that the tracking accuracy of our algorithm outperforms [12] (RMSE: 11.2 pixels) and [10] (RMSE: 6.3 pixels), and a little inferior to [13] (RMSE: 3.2 pixels). This is because in Zhang’s work [13], object’s appearance is modeled by an online learned subspace which is a little better than our template based representation. Table 1. Tracking accuracy comparison on the PETS04 sequence Algorithm Our method Yang’s work Qu’s work Zhang’s work RMSE per object per frame (pixels) 3.9 11.2 6.3 3.2
80
M. Li et al.
Discussion. From the above experimental results, we can obtain some useful findings: 1) Online learned Boosting classifiers plays an important role to distinguish targets with similar appearance. 2) Without camera calibration, the online learned geometrical model can effectively reduce false detections that violate perspective projection. 3) The online learned global pose estimators tremendously increase the tracking efficiency of our algorithm. 4) The MCMC-based joint-state optimization process can effectively deal with partial occlusions.
6
Conclusions
This paper proposes a Data-Driven Particle Filtering (DDPF) framework for multi-target tracking, in which online learned class-specific and instance-specific cues are used to make the tracking process robust as well as efficient. The learned cues include an online learned geometrical model for excluding detection outliers that violate geometrical constraints, global pose estimators shared by all targets for particle refinement, and online boosting based appearance models which select discriminative features to distinguish different individuals. Targets are clustered into two categories: separated-target and target group. Separatedtarget is tracked by an ISPF (Incremental Self-tuning Particle Filtering) method in low-dimensional state space and target-group is tracked by an MCMC-based particle filtering in joint-state space with occlusion reasoning. Experimental results demonstrate that our algorithm can automatically detect and track objects, effectively deal with partial occlusions, as well as possess high efficiency and stat e-of-the-art tracking accuracy. Acknowledgement. This work is supported by National Natural Science Foundation of China (Grant No. 60736018, 60723005), National Hi-Tech Research and Development Program of China (2009AA01Z318), National Science Founding (60605014, 60875021).
References 1. Reid, D.: An algorithm for tracking multiple targets. IEEE Trans. Automatic Control 24, 84–90 (1979) 2. Bar-Shalom, Y., Fortmann, T., Scheffe, M.: Joint probabilistic data association for multiple targets in clutter. In: Proc. Conf. Information Sciences and Systems (1980) 3. Cox, I., Leonard, J.: Modeling a dynamic environment using a bayesian multiple hypothesis approach. Artificial Intelligence 66, 311–344 (1994) 4. Rasmussen, C., Hager, G.: Probabilistic data association methods for tracking complex visual objects. IEEE Trans. on PAMI 23, 560–576 (2001) 5. Khan, Z., Balch, T., Dellaert, F.: Multitarget tracking with split and merged measurements. In: Proc. of CVPR 2005 (2005) 6. Isard, M., MacCormick, J.: Bramble: A bayesian multipleblob tracker. In: Proc. of ICCV 2001 (2001) 7. Okuma, K., Taleghani, A., de Freitas, N., Little, J.J., Lowe, D.G.: A boosted particle filter: Multitarget detection and tracking. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 28–39. Springer, Heidelberg (2004)
Multi-Target Tracking by Learning Class-Specific and Instance-Specific Cues
81
8. Vermaak, J., Doucet, A., Perez, P.: Maintaining multi-modality through mixture tracking. In: Proc. of ICCV 2003 (2003) 9. Yu, T., Wu, Y.: Decentralized multiple target tracking using netted collaborative autonomous trackers. In: Proc. of CVPR 2005 (2005) 10. Qu, W., Schonfeld, D., Mohamed, M.: Real-time interactively distributed multiobject tracking using a magnetic- inertia potential model. In: Proc. of ICCV 2005 (2005) 11. Khan, Z., Balch, T., Dellaert, F.: Mcmc-based particle filtering for tracking a variable number of interacting targets. PAMI 27, 1805–1819 (2005) 12. Yang, M., Yu, T., Wu, Y.: Game-theoretic multiple target tracking. In: Proc. of ICCV 2007 (2007) 13. Zhang, X., Hu, W., Li, W., Qu, W., Maybank, S.: Multi-object tracking via species based particle swarm optimization. In: ICCV 2009 WS on Visual Surveillance (2009) 14. Hess, R., Fern, A.: Discriminatively trained particle filters for complex multi-object tracking. In: Proc. of CVPR 2009 (2009) 15. Breitenstein, M., Reichlin, F., Leibe, B., Koller-Meier, E., Gool, L.: Robust tracking-by-detection using a detector confidence particle filter. In: ICCV 2009 (2009) 16. Li, M., Chen, W., Huang, K., Tan, T.: Visual tracking via incremental self-tuning particle filtering on the affine group. In: Proc. of CVPR 2010 (2010) 17. Arulampalam, M.S., Maskell, S., Gordon, N., Clapp, T.: A tutorial on particle filters for online nonlinear/non-gaussian bayesian tracking. Journal of IEEE Trans. on Signal Processing 50, 174–188 (2002) 18. Vijayakumar, S., Souza, A.D., Schaal, S.: Incremental online learning in high dimensions. Neural Computation 17, 2602–2634 (2005) 19. Li, M., Zhang, Z., Huang, K., Tan, T.: Pyramidal statistics of oriented filtering for robust pedestrian detection. In: Proc. of ICCV 2009 WS on Visual Surveillance (2009) 20. Grabner, H., Bischof, H.: On-line boosting and vision. In: Proc. of CVPR 2006 (2006) 21. Dalal, N., Triggls, B.: Histograms of oriented gradients for human detection. In: CVPR (2005) 22. Hastings, W.: Monte carlo sampling methods using markov chains and their applications. Biometrika 57, 97–109 (1970) 23. PETS09dataset, http://www.cvg.rdg.ac.uk/PETS2009/a.html 24. PETS04dataset, http://homepages.inf.ed.ac.uk/rbf/CAVIARDATA1/
Modeling Complex Scenes for Accurate Moving Objects Segmentation Jianwei Ding, Min Li, Kaiqi Huang, and Tieniu Tan National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, P.R. China {jwding,mli,kqhuang,tnt}@nlpr.ia.ac.cn Abstract. In video surveillance, it is still a difficult task to segment moving object accurately in complex scenes, since most widely used algorithms are background subtraction. We propose an online and unsupervised technique to find optimal segmentation in a Markov Random Field (MRF) framework. To improve the accuracy, color, locality, temporal coherence and spatial consistency are fused together in the framework. The models of color, locality and temporal coherence are learned online from complex scenes. A novel mixture of nonparametric regional model and parametric pixel-wise model is proposed to approximate the background color distribution. The foreground color distribution for every pixel is learned from neighboring pixels of previous frame. The locality distributions of background and foreground are approximated with the nonparametric model. The temporal coherence is modeled with a Markov chain. Experiments on challenging videos demonstrate the effectiveness of our algorithm.
1
Introduction
Moving object segmentation is fundamental and important for intelligent video surveillance. The accuracy of segmentation is vital to object tracking, recognition, activity analysis, etc. Traditional video surveillance systems use stationary cameras to monitor scenes of interest. The usual approach for this kind of system is background subtraction. In many video surveillance applications, the background of scene is not stationary. It may contain waving trees, fountains, ocean ripples, illumination changes, etc. It is a difficult task to design a robust background subtraction algorithm that can extract moving object accurately from complex scenes. Many background subtraction methods have been proposed to address this problem. Stauffer and Grimson [1] used Gaussian mixture model (GMM) to account for the multimodality of color distribution for every pixel. Because of its high efficiency and good accuracy for most conditions, GMM has been used widely. And many researchers have proposed improvements and extensions of the basic GMM method, such as [2–4]. But an implicit assumption for GMMbased method is that the temporal variation of a pixel could be modeled by one or multiple Gaussian distribution. Elgammal et al. proposed Kernel density estimation (KDE) [5] to model color distribution, which does not have any R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 82–94, 2011. Springer-Verlag Berlin Heidelberg 2011
Modeling Complex Scenes for Accurate Moving Objects Segmentation
83
Fig. 1. Moving object segmentation in complex scenes using background subtraction. The first column shows the original images, the second column gives the corresponding segmentation using background subtraction.
underlying distribution. But the bandwidth of kernel is difficult to choose and the computation cost is high. Sheikh and Shah [6] incorporated spatial cue with temporal information into a single model. They got accurate segmentation result in a Maximum a posteriori-Markov Random Field (MAP-MRF) decision framework. Some other sophisticated techniques can be seen in [7–11]. However, there is a drawback of background subtraction. Thresholding is often used to segment foreground from background. Fig. 1 gives an example of moving objects segmentation in complex scenes using background subtraction. How to find an optimal segmentation is an energy minimization problem in MAP-MRF framework. Graph cuts [12] is one of the popular energy minimization algorithm. Boykov et al. [13] proposed an interactive image and video segmentation technique using graph cuts and got good results. This technique was then extended to solve other foreground/background segmentation problems. In [14], Layered Graph cuts (LGC) were proposed to extract foreground from background layers in stereo video sequences. In [15], Liu Feng et al. proposed an method which fused color, locality and motion information. Their method can segment moving objects with limited motion, but works off-line. Sun et al. [16] proposed background cut which combined background subtraction, color and contrast cues. Their method could work in real time. But static background was assumed in the initialization phase. In [17], multiple cues were probabilistically fused in bilayer segmentation of live video. In [18], the complex background was learned by multi-scale discriminative model. But training samples are needed in [17, 18]. We propose an online and unsupervised technique which learns complex scenes to find optimal foreground/background segmentation in a MRF framework. Our method absorbs the advantages of foreground/background modeling and Graph cuts algorithm. But our method works online and does not need prior information or training data. To improve the segmentation accuracy, color, locality, temporal coherence and spatial consistency are fused together in the framework. All these features are easy to get and calculate. We do not choose optical flow as [15, 17] because the reliable computation of optical flow is expensive and it suffers from aperture problem. The models of color, locality and temporal coherence are learned online from complex scenes. Color feature is the dominant one to discriminate foreground from background in our algorithm. To model the color distribution of background image precisely and adaptively, a novel mixture
84
J. Ding et al.
of nonparametric regional model and parametric pixel-wise model is presented to improve the robustness of algorithm under different background change. The foreground color model for every pixel in current frame is learned from the neighboring foreground pixels in previous frame. The locality distributions of background and foreground are also considered to reduce error. The distribution is approximated using a nonparametric model. Besides, an additional spatial uniform distribution is also considered in the model. The label of each pixel at current moment is not independent of that in previous time. This temporal coherence of labels is modeled with a first order Markov chain to increase accuracy. A simple strategy to calculate the transition probabilities is also provided. The remainder of this paper is organized as follows. Section 2 describes the general model for moving object segmentation. Section 3 gives the modeling of complex scenes. Section 4 presents the experiments and results. Section 5 makes conclusions of our work.
2
General Model for Moving Object Segmentation
In this section, we will describe the general model for moving object segmentation in a MAP-MRF framework.
Fig. 2. A simple Flowchart of the proposed algorithm
Moving object segmentation can be considered as a binary labeling problem. Let I be a new input image and P be the set of all pixels in I. The labeling problem is to assign a unique label lp ∈ {f oreground(= 1), background(= 0)} to each pixel p ∈ P. The maximum a posteriori estimation of L = {l1 , l2 · · · , l|P| } for I can be obtained by minimizing the Gibbs energy E(L) [13]: L∗ = arg min E(L)
(1)
The basic energy model E(L) used in [13] only considered color and spatial continuity cues, which are not sufficient [14]. The locality cue of a pixel is an
Modeling Complex Scenes for Accurate Moving Objects Segmentation
85
important spatial constraint that can help to judge the label value. Besides, the label of a pixel in current time is not independent of that in previous time. So the cue of temporal coherence is also needed to increase the accuracy for label decision. We extend the basic energy model and define E(L) as: E(L) = λc Ec (lp ) + λs Es (lp ) + λt Et (lp ) + EN (lp , lq ) (2) p∈P
p∈P
p∈P
p∈P q∈Np
where λc , λs and λt are factors that measure the weights of color term, locality term and temporal coherence term respectively. Np is the 4-connection or 8connection neighborhood of pixel p. Ec (lp ), Es (lp ), Et (lp ) and EN (lp , lq ) are described below. Color term Ec (lp ) is the color term of pixel p, denoting the cost of color vector cp when the label is lp . Ec (lp ) is defined as the negative logarithm of color likelihood: Ec (lp ) = − log P r(cp |lp ) (3) Locality term Es (lp ) is the locality term of pixel p, denoting the cost of locality vector sp when the label is lp . Es (lp ) is defined as the negative logarithm of locality likelihood: (4) Es (lp ) = − log P r(sp |lp ) Temporal coherence term Et (lp ) is the temporal coherence term of pixel p, denoting the cost of label lp in current time T when the label was lpT −1 in previous time T − 1. We use a first-order Markov chain to model the temporal coherence and define Et (lp ) as: Et (lp ) = − log P r(lpT |lpT −1 )
(5)
Spatial consistence term There is spatial consistence for neighboring pixels. They are more likely to have similar colors and same labels. EN (lp , lq ) is used to convey the spatial consistence between neighboring pixels p and q when the labels are lp and lq respectively. Similar to [13, 17], EN (lp , lq ) is expressed as an Ising term and defined as: EN (lp , lq ) = δlp =lq
+ exp(−βd2pq ) 1+
(6)
where δlp =lq is a Kronecker delta function. The constant is a positive parameter. β is also a parameter and set to β = (2 cp − cq )−1 , where · is the expectation over color samples of neighboring pixels in an image. dpq denotes the color difference between pixel p and q. To reduce computation cost, can be set to zero and β is taken as a constant, which is used in [13]. A simpler version of the Ising term can be seen in [6], which ignored the color difference. Among the four energy terms, only the spatial consistence term can be calculated directly from the new input image. But the color likelihood, the locality likelihood and the probability transition matrix should be known first to get the other three energy terms. They can be calculated by constructing color model, locality model and temporal coherence model. The three models are learned from the complex scenes and updated online.
86
J. Ding et al.
At the beginning of the algorithm, an initial background model is constructed using the color information of input images. The background model is updated quickly with high learning parameters until it is stable. After initialization, the flowchart of the algorithm is shown in Fig. 2. At each time T, the first step is modeling the scene and calculate the energy E(L). The second step is to minimize the energy E(L) and output the segmentation. To minimize the energy E(L), we use the implementation of [19]. The third step is to update the scene model with slow learning parameters.
3
Modeling the Complex Scenes
The first and most important step is modeling the scenes. As it may contain dynamic texture or various illumination changes in complex scenes, color, locality, temporal coherence and spatial consistence are fused together to increase the accuracy of segmentation. The spatial consistence model defined by Eq. (6) can be calculated directly on the input image. We only have to construct the color model, the locality model and the temporal coherence model. They will be described below. For simplicity, the update of the scene model is also included in this section. 3.1
Color Model
The color term is the dominant one among the four terms in Eq. (2). Different color features can be used in our algorithms, such as RGB, HSV, Lab. To discriminate foreground from background effectively, we choose RGB color features without color transformation. Let cp = (Rp , Gp , Bp )T be the color vector for each pixel p ∈ P. The modeling of color distribution includes three parts: modeling the background color distribution, modeling the foreground color distribution and updating the color model . Modeling the background color distribution. In [6], the background modeling methods are classified into two categories, which are methods that use pixel-wise model and methods that use regional model. The pixel-wise model is more precise than the regional model to measure the probability of pixel p belonging to background, but it is sensitive to spatial noise, illumination change and small movement of background. To model the color likelihood for each pixel p belonging to background, we define it as a mixture of regional model and pixel-wise model: P r(cp |lp = 0) = αc P rregion (cp |lp = 0) + (1 − αc )P rpixel (cp |lp = 0)
(7)
where αc ∈ [0, 1] is a mixing factor, and P rregion (cp |lp = 0) is the probability of observing cp at the neighbor region of pixel p, and P rpixel (cp |lp = 0) is the probability of observing cp at pixel p. The local spatial and temporal color information is used coherently in Eq. (7). Fig. 3 gives an comparatione of the three models. Our model can approximate the background color distribution precisely and is also robust to various spatial noise.
Modeling Complex Scenes for Accurate Moving Objects Segmentation
87
Fig. 3. Comparation of the three models. The three columns are color distributions of the pixel marked as red in the original image using the pixel-wise model, the regional model and the proposed mixture model, respectively.
The regional background color model can be learned from the local neighboring spatial-temporal volume Vp centered in pixel p with a nonparametric method: 1 P rregion (cp |lp = 0) = KH (cp − cm ) (8) |Vp | cm ∈Vp
where KH (x) = |H|−1/2 K(H−1/2 x) and K(x) is a d-variate kernel function and H is the bandwidth matrix. We select the commonly used Gaussian kernel, so 1 KH (x) = |H|−1/2 (2π)−1/2 exp(− xT H−1 x) (9) 2 To reduce computation cost, the bandwidth matrix H can be assumed to be the form of H = h2 I . The volume Vp is defined as Vp = {ct (u, v)||u − i| ≤ l, |v − j| ≤ l, T − Td < t < T }, where ct (u, v) is the color vector of pixel at the u-th row and v-th column of the image taken at time t, and Td is the period. Pixel p is at the location of i-th row and j-th column of the current input image at time T, and l is a constant value that measures the size of the local spatial region. The computation cost is very high to calculate Eq. (8) directly because of the large sample size. Besides, it is hard to choose proper kernel bandwidth H in high dimensional spaces even it has been simplified. We use the binned kernel density estimator to approximate Eq. (8). The accuracy and effectiveness of binned kernel density estimator has been proved in [20]. Similar strategy has also been used in [6, 7].
88
J. Ding et al.
We use a Gaussian mixture model as the pixel-wise model for every pixel p: K ωk N (cp |μk , Σk ) (10) P rpixel (cp |lp = 0) = k=1
where ωk , μk and Σk are the weight, the mean color vector and the covariance matrix of the k-th component respectively, and K is the number of components. To simplify computation, the covariance matrix is assumed to be of the form: Σk = σk2 I. Modeling the foreground color distribution. The color likelihood P r(cp |lp = 1) for each pixel p belonging to foreground at time T can be learned adaptively based on data from the neighboring foreground pixels in the previous frame at time T-1. Here there is an implicit assumption that a foreground pixel will remain in the neighboring area. Besides, at any time the probability of observing a foreground pixel at pixel p of any color is uniform. Thus, the foreground color likelihood can be expressed as a mixture of a uniform function and the kernel density function: 1 KH (cp − cm ) (11) P r(cp |lp = 1) = βc ηc + (1 − βc ) |Up | cm ∈Up
where βc ∈ [0, 1] is the mixing factor, and ηc is a random variable with uniform probability ηc (cp ) = 1/2563 . The set of neighboring foreground pixels in the previous frame is defined as Up = {ct (u, v)||u − i| ≤ w, |v − j| ≤ w, t = T − 1}, where w is a constant that measures the size of the neighboring region at pixel p, and pixel p is at the i-th row and j-th column of current image. The kernel density function is the same as Eq. (9). Updating the color model. The background color model is updated adaptively and online with the new input image after segmentation. For the nonparametric regional background color model, a sliding window of length lb is maintained. For the parametric pixel-wise model, similar maintenance strategy [1] is used here. If the new observation cp belongs to the k-th distribution, the parameters of the k-th distribution at time T is updated as follows: 2 σk,T
= (1 −
μk,T = (1 − ρ)μk,T −1 + ρcp
(12)
+ ρ(cp − μk,T ) (cp − μk,T ) ωk,T = (1 − ρ)ωk,T −1 + ρMk,T
(13) (14)
2 ρ)σk,T −1
T
where ρ is the learning rate. Mk,T is 1 when cp matches the k-th distribution and 0 otherwise. 3.2
Locality Model
It is not sufficient to use color alone if the color of foreground is similar to that of background. It can reduce error to take account of the location of moving objects. The spatial location of pixel p is denoted as sp = (xp , yp )T . As the shape of objects can be arbitrary, we use a nonparametric model similar to
Modeling Complex Scenes for Accurate Moving Objects Segmentation
89
[15] to approximate the spatial distribution of background and foreground. The difference is that we take account of the spatial uniform distribution for any pixel at any location. The locality distribution of background and foreground are defined as: 1 (sp − sm )T (sp − sm ) P r(sp |lp = 0) = αs ηs + (1 − αs ) max exp(− ) (15) 2πσ 2 m∈Vs 2σ 2 1 (sp − sm )T (sp − sm ) max exp(− ) (16) P r(sp |lp = 1) = αs ηs + (1 − αs ) 2πσ 2 m∈Us 2σ 2 where αs ∈ [0, 1] is the mixing factor, and ηs is a random variable with uniform probability ηs (sp ) = 1/M × N . Here M × N means the image size. Vs and Us are the sets of all background pixels and foreground pixels in the previous frame respectively. The parameter σ is the bandwidth of kernel. An example of locality distributions of background and foreground is shown in Fig. 4. From this figure we can see that the pixels near moving objects have higher probability belonging to foreground, and the pixels far away from moving object have higher probability belonging to background. 3.3
Temporal Coherence Model
For any pixel p ∈ P, if the label is foreground at time T-1, it is more likely for p to remain foreground than background at time T. So the label of pixel p at time T is not independent of labels at previous time. This is called the label temporal
Fig. 4. Locality distributions of background and foreground
Fig. 5. A simple example to show the temporal coherence of labels. The first image depicts the scene, the second image shows the foreground mask at time T-1, and the last image shows the foreground mask at time T. A small portion of pixels whose labels have changed are marked as red in the last image.
90
J. Ding et al.
coherency. Fig. 5 is an example to show the coherency. It can be modeled using a Markov chain. Here, a first order Markov chain is adopted to convey the nature of temporal coherence. There are four (22 ) possible transitions for the first order Markov chain model since there are only two candidate labels of pixel p. But there remain two degrees of freedom because there is relationship between P r(lpT = 1|lpT −1 ) and P r(lpT = 0|lpT −1 ), which is P r(lpT = 1|lpT −1 ) = 1 − P r(lpT = 0|lpT −1 ). We denote them as P rF F and P rBB , where P rF F denotes the probability of remaining foreground, and P rBB denotes the probability of remaining background. A simple strategy is described here to calculate P rF F and P rBB . They are both initialized to be 0.5. At time T-2, the number of foreground pixels and background pixels are denoted as nTF −2 and nTB−2 respectively. When there are detected moving objects, nTF −2 > 0 and nTB−2 > 0 . Then at time T-1 the number of pixels which remain foreground is nTF −1 and the number of pixels which remain background is nTB−1 . We can approximate P rF F and P rBB at time T by: P rFT F = T = P rBB
1 nT −1 αT + (1 − αT ) F 2 nTF −2 1 nT −1 βT + (1 − βT ) B 2 nTB−2
(17) (18)
where αT ∈ [0, 1] and βT ∈ [0, 1] are mixing factors. If nTF −2 = 0 or nTB−2 = 0, P rF F and P rBB remain the previous values.
4
Experiments and Results
We tested our algorithm on a variety of videos, which include fountain, swaying trees, camera motion, etc. The algorithm was implemented using C++, on a computer with 2.83 GHz Intel Core2 CPU and 2GB RAM, and achieved the processing speed of about 6 fps at the resolution of 240 × 360. Since GMM [1] and KDE [5] are two widely used background subtraction algorithms, we compare the performance of our method with them. To evaluate the effectiveness of the three algorithms, no morphological operations like erosion and dilation are used in the experiments. Besides, both qualitative analysis and quantitative analysis are used to demonstrate the effectiveness of our approach. Fig. 6 shows qualitative comparision result on outdoor scenes with fountain. The first column are frames 470, 580, 680 and 850 in the video. There are moving objects in the first, third and fourth images. The second column is the manual segmentation result. The third column shows the result of our method. The fourth and fifth columns are results of GMM and KDE respectively. The difficulty of moving objects extraction in this video is that the background changes continuously and violently. It is difficult to adapt the background model to the dynamic texture for the algorithms of GMM and KDE. But our method works well in the presence of fountain and gives accurate segmentation.
Modeling Complex Scenes for Accurate Moving Objects Segmentation
91
Fig. 6. Segmentation results in the presence of fountain. The first column shows the original images, the second column gives the manual segmentation, the third column presents the result of our method, the last two columns show the results of GMM and KDE respectively.
Fig. 7. Segmentation results in the presence of waving trees. The first column shows the original images, the second column gives the manual segmentation, the third column presents the result of our method, the last two columns show the results of GMM and KDE respectively.
Fig. 7 gives qualitative comparison result on outdoor scenes with waving trees. The video comes from [8]. Our method can detect most of the foreground pixels accurately and has little false detection. Fig. 8 gives segmentation on another two challenging videos by our method. The videos come from [10]. In the two original sequences, there are ocean ripples and waving trees respectively.
92
J. Ding et al.
Fig. 8. Segmentation results in complex scenes. The first column shows the original images which contain ocean ripples, the second column presents the corresponding result of our method. The third column shows the original images which contain waving trees, the fourth column presents the corresponding result of our method.
Fig. 9. Segmentation results in the presence of camera motion. The first column shows the original images, the second column gives the manual segmentation, the third column presents the result of our method, the last two columns show the results of GMM and KDE respectively.
In order to better demonstrate the performance of our algorithm, a quantitative analysis is provided here. The test video and corresponding ground-truth data come from [6]. This video contains average nominal motion of about 14 pixels.
Modeling Complex Scenes for Accurate Moving Objects Segmentation 1
1
0.9
0.8
0.8
0.7
0.7 0.6
0.6 0.5
Recall
Precision
0.9
Ours GMM with rate 0.005 GMM with rate 0.05 GMM with rate 0.1 KDE
0.4 0.3 0.2
0.5 0.4 0.3 0.2
0.1 0 250
93
Ours GMM with rate 0.005 GMM with rate 0.05 GMM with rate 0.1 KDE
0.1
300
350
frame number
400
450
500
0 250
300
350
frame number
400
450
500
Fig. 10. Precision and Recall comparisons of the three algorithms. The GMM is tested with different learning rate:0.005, 0.05, 0.1.
Fig. 9 shows a qualitative illustration of the results as compared to the ground truth, GMM and KDE. From this figure, we can see that our method can segment moving objects accurately and has almost no false detections in the presence of camera motion. Fig. 10 shows the corresponding quantitative comparison in terms of precision and recall. The precision and recall are defined as: N umber of true positives detected T otal number of positives detected N umber of true positives detected Recall = T otal number of true positives
P recision =
From Fig. 10 we can see that the detection accuracy both in terms of precision and recall is consistently higher than GMM and KDE. Several learning rates are selected to test for GMM. Though the precision and recall of our method are both very high, there are still some false positives and false negatives detected by our method. These pixels are mainly at the edge or contour of true objects. That is because spatial continuity information is used in the proposed method.
5
Conclusions
In this paper, we have proposed an online and unsupervised technique which learns multiple cues to find optimal foreground/background segmentation in a MRF framework. Our method adopts the advantages of foreground/background modeling and Graph cuts algorithm. To increase the accuracy of foreground segmentation, color, locality, temporal coherency and spatial continuity are fused together in the framework. The distributions of different features for background and foreground are modeled and learned separately. Our method can segment moving object accurately in complex scenes. That was proved by experiments on several challenging videos. Acknowledgement. This work is supported by National Natural Science Foundation of China (Grant No. 60736018, 60723005), National Hi-Tech Research and Development Program of China (2009AA01Z318), National Science Founding (60605014, 60875021).
94
J. Ding et al.
References 1. Stauffer, C., Grimson, W.: Learning pattern of activity using real-time tracking. IEEE Trans. Pattern Analysis and Machine Intelligence 22, 747–757 (2000) 2. KaewTraKulPong, P., Bowden, R.: An improved adaptive background mixture model for realtime tracking with shadow detection. In: Proc. European Workshop on Advanced Video Based Surveillance Systems (2001) 3. Zivkovic, Z.: Improved adaptive gaussian mixture model for background subtraction. In: Proc. Int. Conf. Pattern Recognition, pp. 28–31 (2004) 4. Lee, D.S.: Effective gaussian mixture learning for video background subtraction. IEEE Trans. Pattern Analysis and Machine Intelligence 27, 827–832 (2005) 5. Elgammal, A., Harwood, D., Davis, L.: Non-parametric model for background subtraction. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1843, pp. 751–767. Springer, Heidelberg (2000) 6. Sheikh, Y., Shah, M.: Bayesian modeling of dynamic scenes for object detection. IEEE Trans. Pattern Analysis and Machine Intelligence 27, 1778–1792 (2005) 7. Ko, T., Soatto, S., Estrin, D.: Background subtraction on distributions. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part III. LNCS, vol. 5304, pp. 276– 289. Springer, Heidelberg (2008) 8. Toyama, K., Krumm, J., Brumitt, B., Meyers, B.: Wallflower: Principles and practice of background maintenance. In: Proc. ICCV, pp. 255–261 (1999) 9. Monnet, A., Mittal, A., Paragios, N., Ramesh, V.: Background modeling and subtraction of dynamic scenes. In: Proc. ICCV, pp. 1305–1312 (2003) 10. Li, L., Huang, W., Gu, I.Y.H., Tian, Q.: statistical modeling of complex backgrounds for foreground object detection. IEEE Transactions on Image Processing 13, 1459–1472 (2004) 11. Pless, R., Larson, J., Siebers, S., Westover, B.: Evaluation of local models of dynamic backgrounds. In: Proc. CVPR, pp. 73–78 (2003) 12. Szeliski, R., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V., Agarwala, A., Tappen, M., Rother, C.: A comparative study of energy minimization methods for markov random fields. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 16–29. Springer, Heidelberg (2006) 13. Boykov, Y., Jolly, M.: Interactive graph cuts for optimal boundary region segmentation of objects in n-d images. In: Proc. ECCV, pp. 105–112 (2006) 14. Kolmogorov, V., Criminisi, A., Blake, A., Cross, G., Rother, C.: Bi-layer segmentation of binocular stereo video. In: Proc. CVPR, pp. 1186–1193 (2005) 15. Liu, F., Gleicher, M.: Learning color and locality cues for moving object detection and segmentation. In: Proc. CVPR, pp. 320–327 (2009) 16. Sun, J., Zhang, W., Tang, X., Shum, H.-Y.: Background cut. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 628–641. Springer, Heidelberg (2006) 17. Criminisi, A., Cross, G., Blake, A., Kolmogorov, V.: Bilayer segmentation of live video. In: Proc. CVPR, vol. 1, pp. 53–60 (2006) 18. Zha, Y., Bi, D., Yang, Y.: Learning complex background by multi-scale discriminative model. Pattern Recognition Letters 30, 1003–1014 (2009) 19. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Trans. Pattern Analysis and Machine Intelligence 26, 1124–1137 (2004) 20. Hall, P., Wand, M.: On the accuracy of binned kernel estimators. J. Multivariate Analysis (1995)
Online Learning for PLSA-Based Visual Recognition Jie Xu1,2 , Getian Ye3 , Yang Wang1,2 , Wei Wang1,2 , and Jun Yang1,2 1
National ICT Australia University of New South Wales 3 Canon Information Systems Research Australia {jie.xu,yang.wang,jun.yang}@nicta.com.au,
[email protected],
[email protected] 2
Abstract. Probabilistic Latent Semantic Analysis (PLSA) is one of the latent topic models and it has been successfully applied to visual recognition tasks. However, PLSA models have been learned mainly in batch learning, which can not handle data that arrives sequentially. In this paper, we propose a novel on-line learning algorithm for learning the parameters of PLSA. Our contributions are two-fold: (i) an on-line learning algorithm that learns the parameters of a PLSA model from incoming data; (ii) a codebook adaptation algorithm that can capture the full characteristics of all the features during the learning. Experimental results demonstrate that the proposed algorithm can handle sequentially arriving data that batch PLSA learning cannot cope with, and its performance is comparable with that of the batch PLSA learning on visual recognition.
1
Introduction
Visual recognition involves the task of automatically assigning visual data to the semantic categories. Recently probabilistic latent topic models such as probabilistic Latent Semantic Analysis (PLSA) [1], Latent Dirichlet Allocation (LDA) [2] and their extensions have been applied to the visual recognition tasks. The PLSA has been successfully employed on the tasks of object categorization [3], scene recognition [4], and also action recognition [5], [6]. However, PLSA has the following disadvantages. Firstly, the conventional learning for PLSA is batch learning, and hence it can not handle data that arrives sequentially. Secondly, PLSA relies on a codebook of words for the “bag-of-words” representation. For visual recognition, k-means clustering algorithm is usually employed to produce a codebook of visual words [3],[5],[4]. An optimal codebook can be constructed by taking into account of all the training features. However, it would be infeasible to apply the same algorithm on large datasets with excessive features. In practice, only a subset of features is chosen for the clustering, and it results in a suboptimal codebook. In order to address the aforementioned problems, we propose an on-line learning algorithm for learning the parameters of PLSA model from the incoming data. The contributions of this paper are two-fold: (i) R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 95–108, 2011. Springer-Verlag Berlin Heidelberg 2011
96
J. Xu et al.
an on-line learning algorithm that learns the parameters of PLSA model from incoming data; (ii) a codebook adaptation algorithm that can capture the full characteristics of all the features during the on-line learning. Related Work. Using the proposed on-line algorithm, the trained model can adapt to new data without re-using the previous training data. Incremental EM is employed to speed up the learning of PLSA in [7]. However, the algorithm is not an on-line one since previous training data has to be re-used, together with the new data, for model adaptation. Previous work on on-line learning of PLSA and other latent models can be found in the text mining area [8], [9], [10]. However, all these algorithms are proposed for text mining only, and there is no visual codebook and its adaptation involved. The most related work to our algorithm is the QB-PLSA proposed by Chien et al. in [8]. QB-PLSA employs the QB estimation [11] to enable the on-line learning of PLSA, which represents the priors of PLSA parameters using Dirichlet densities. Our algorithm, on the other hand, employs the on-line EM for stochastic approximation of PLSA, and makes no assumptions of the parameter distributions. To the best of our knowledge, our work proposes the first on-line PLSA algorithm for vision problems. We apply the proposed algorithm to two visual recognition tasks, i.e., scene recognition and action recognition. The task of scene recognition is referred to as the classification of semantic categories of places using visual information collected from scene images. The work in the literature can be classified into one of the two following approaches. The first approach considers global or local feature-based image representation [12],[13],[14], whereas the second one relies on the intermediate concept-based representation [15],[16],[4],[17],[18]. Among the work from the second category, some models learn the concepts in a supervised manner [15],[17], while the others learn without supervision [16],[4],[18]. The PLSA-based scene recognition belongs to the second category, and it has been shown by Bosch et al. [4] to outperform previous models [15],[16]. As for the action recognition, it is an important task in video surveillance. Numerous techniques have been proposed for action recognition in the literature. A popular approach is based on the analysis of motion cues in human actions [19],[20]. Despite its popularity, the robustness of this type of approach depends heavily on the tracking systems, especially under complex situations in videos. To handle complex situations, another type of approach attempts to build dynamic human action models [21],[22],[23]. However, the large number of parameters from the models results in high model complexities. To avoid the model complexities, local feature-based video representation is employed. Salient features are extracted from the videos, and videos are then converted into intermediate representations based on the features, before being fed into different classifiers [24],[25],[26],[27]. The PLSA has been employed for action recognition based on the “bag-of-words” representation using spatial-temporal interest points [24], and it has been shown to perform well on this task [5]. The experimental results of our proposed algorithm show that the proposed algorithm can handle large datasets that batch PLSA learning cannot cope with, and its performance is comparable with that of the batch PLSA learning on visual recognition.
Online Learning for PLSA-Based Visual Recognition
97
The rest of the paper is organized as follows: Section 2 reviews the batch learning algorithm for PLSA-based visual recognition. Our proposed algorithm is described in Section 3, followed by the experimental results in Section 4. Our final conclusions are summarized in Section 5.
2
PLSA-Based Visual Recognition
The PLSA was originally proposed for text modeling, in which words are considered as the elementary building units for documents. Rather than modeling documents as sets of words, PLSA models each document as a mixture of latent topics, and each topic is characterized by a distribution over words. In the context of visual recognition, the visual features are considered as words. Denote zk ∈ Z = {z1 , z2 , ..., zK } as the latent topic variable that associates the words and documents, and X as the training co-occurrence matrix consists of co-occurrence counting of (di , wj ), collected from N visual documents D = {d1 , d2 , ..., dN }, based on a codebook of V visual words W = {w1 , w2 , ..., wV }. The joint probability of the visual document di , the hidden topic zk and the visual word wj is assumed to have the following form [1]: p(di , zk , wj ) = p(wj |zk )p(zk |di )p(di ),
(1)
where p(wj |zk ) represents the probability of the word wj occurring in the hidden topic zk , p(zk |di ) is the probability of the topic zk occurring in the document di , and p(di ) can be considered as the prior probability of di . The joint probability of observation pair (di , wj ) can be generated by marginalizing over all the topic variables zk : p(zk |di )p(wj |zk ). (2) p(di , wj ) = p(di ) k
Let n(di , wj ) be the number of occurrences of the word wj in the document di , and all of them constitute the co-occurrence matrix X. The prior probabil ity p(di ) can be computed as p(di ) ∝ j n(di , wj ). The parameters of PLSA then Kdistributions over K latent variables, and such that V contain multinomial p(w |z ) = 1 and j k j=1 k=1 p(zk |di ) = 1. We represent these KV + KN probabilities using θ = {{p(wj |zk )}j,k , {p(zk |di )}k,i }. The log likelihood of X given θ can be written as L(X|θ) =
N V
n(di , wj ) log p(di , wj ).
(3)
i=1 j=1
A maximum likelihood estimation of θ is obtained by maximizing the log likelihood using an Expectation Maximization (EM) algorithm [28]. Starting from an initial estimate of θ, the expectation (E) step of the algorithm computes the following expectation function ˆ = Ez [log P (X, Z|θ)|X, ˆ Q(θ|θ) θ] ∝
N V i=1 j=1
n(di , wj )
K k=1
p(zk |di , wj ) log[p(zk |di )p(wi |zk )].
98
J. Xu et al.
The maximization (M) step then calculates the new estimate θˆ by maximizing ˆ computed at E-step. The EM algorithm alterthe expectation function Q(θ|θ) nates between both steps until the log likelihood converges. It is shown in [1] that, the E step is equivalent to computing the posterior probability p(zk |di , wj ), given the estimated p(wj |zk ), p(zk |di ): p(wj |zk )p(zk |di ) p(zk |di , wj ) = K , l=1 p(wj |zl )p(zl |di )
(4)
while the M step corresponds to the following updates, given the computed p(zk |di , wj ) from the E-step: new
p(wj |zk )
N
i=1 n(di , wj )p(zk |di , wj ) , N m=1 i=1 n(di , wm )p(zk |di , wm )
= V
(5)
V
new
p(zk |di )
j=1 n(di , wj )p(zk |di , wj ) = K V . l=1 j=1 n(di , wj )p(zl |di , wj )
(6)
In (6), the denominator can also be written as n(di ), i.e., the total number of words occurring in di . During the inference stage, given a testing visual document dtest , the topicbased intermediate representation p(zk |dtest ) are computed by using the “foldin” heuristic described in [1]. The heuristic employs the EM algorithm in the same way as in the learning, while fixing the values of coefficients p(wj |zk ) obtained from the training stage.
3
Proposed Algorithm
We are to present our algorithm for PLSA under the setting of on-line learning, where training data is assumed to arrive at successive time slices. De(t) (t) (t) note D(t) = {d1 , d2 , ..., dNt } as the data stream arriving at time t. and (t)
(t)
(t)
F (t) = {f1 , f2 , ..., fNf } as the feature descriptors from D(t) . Let W (t) = t
{w1 , w2 , ..., wVt } be the codebook of visual words at different time slice t, X (t) be the co-occurrence matrix at time t. The goal of our algorithm is to learn the parameters of PLSA model from data streams. 3.1
On-line Codebook Adaptation for PLSA
To train a PLSA model for visual recognition, a codebook of visual words is required. The k-means clustering algorithm is usually employed to produce the codebook using all the features in the training set. Niebles et al. show that a bigger sized codebook often results in a higher recognition accuracy [5]. The size of the codebook is usually proportional to the number of features. However in practice, due to memory limitations, only a limited number of features are considered for the codebook construction when dealing with a large dataset.
Online Learning for PLSA-Based Visual Recognition
99
Algorithm 1. On-line Codebook Adaptation 1: INPUTS: (t) (t) (t) 2: F (t) = {f1 , f2 , ..., fNf } - feature vectors in the data stream at time slice (t) t
3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:
W (t−1) = {w1 , w2 , ..., wV t−1 } - the codebook from time slice (t − 1) Σ (t−1) = {(μ1 , σ1 ), (μ2 , σ2 )..., (μVt−1 , σV t−1 )} - the distance statistics for W (t−1) ¯ = {(¯ Σ μ, σ ¯ 2 )} - the mean distance statistics computed from the initial set β - the learning factor OUTPUTS: W (t) = {w1 , w2 , ..., wVt } - updated codebook for time slice (t) Σ (t) = {(μ1 , σ1 ), (μ2 , σ2 )..., (μVt , σVt )} - the distance statistics for W (t) for each time slice t ≥ 1 do Vt ← Vt−1 (t) for each feature fn do choose a codeword wk , where k = arg mini ( (t)
(t)
fn −wi 2 2 ) σi2
15: d ← fn − wk 22 16: if d ≤ μk + 2.5σk then 17: μk ← μk + β(d − μk ) 18: σk2 ← σk2 + β(d − μk )(d − μk ) (t) 19: wk ← wk + β(fn − wk ) 20: else (t) ¯ , σV2 t ← σ ¯2 21: Vt ← Vt + 1, wVt ← fn , μVt ← μ 22: end if 23: end for 24: end for
The produced codebook is not necessarily optimal as it only considers a subset of features. In order to construct a codebook that captures the full characteristics of all features in the data, we present here an codebook adaptation algorithm for the codebook construction. (t) Given a feature vector fn , and the corresponding wk , denote δ as the dif(t) (t) ference vector between fn and wk , i.e., δ = fn − wk . We simply assume that each dimension of δ follows an i.i.d. zero mean Gaussian distribution, and hence (t) the squared Euclidean distance between fn and wk can be approximately modeled by a Gaussian distribution [29]. The statistics μk and σk are the mean and standard deviation for the squared distances between wk and all the features assigned to wk . The codebook adaptation algorithm is presented in Algorithm 1. As shown in Algorithm 1, the inputs include the current feature set F (t) , the previous codebook W (t−1) , and also the distance statistics Σ (t−1) for W (t−1) . The Σ (t−1) consists of the corresponding means and variances for the squared distances between the features and corresponding codewords. During the adaptation process, (t) each feature chooses one codeword to update. For each feature vector fn , we (t) 2 f −w select the codeword wk that has the minimum of n σ2 k 2 . Let d be the squared k
100
J. Xu et al. (t)
distance between fn and wk . If the difference between d and μk is within 2.5 (t) σk , both wk and (μk , σk ) will be updated using fn . The updating factor β is (t) set to be a flat rate (e.g., β = 0.05). If no match is found between fn and all (t) the codewords, a new codeword is then initiated using fn . The related distance 2 statistics are initiated using μ ¯, and σ ¯ , which are the corresponding mean values computed from the initial training set. 3.2
On-Line EM for PLSA Learning
The learning of PLSA model in Section 2 employs the batch EM for parameter estimation. However, batch EM is not applicable to data that arrives sequentially. To enable the learning of PLSA under this situation, we propose an on-line learning algorithm for PLSA. The idea of the proposed algorithm is to update the model parameter θ using D(t) received at each time slice. For a better understanding of the algorithm, we begin the on-line learning using the notations for batch learning in Section 2. To enable the on-line learning for PLSA, we firstly define two statistics n(d, wj )zk and n(d, w)zk in the follows: n(d, wj )zk =
N
n(di , wj )p(zk |di , wj ), j ∈ {1, 2, ..., V }, k ∈ {1, 2, ..., K}. (7)
i=1
n(d, w)zk =
V N
n(di , wj )p(zk |di , wj ), k ∈ {1, 2, ..., K}.
(8)
j=1 i=1
Based on the above definitions, (5) in M-step can be re-formulated as: N n(di , wj )p(zk |di , wj ) n(d, wj )zk p(wj |zk ) = V i=1 . = N n(d, w)zk m=1 i=1 n(di , wm )p(zk |di , wm )
(9)
The main idea of our proposed on-line EM is to replace the values of n(d, wj )zk and n(d, w)zk with weighted mean values. We define the weighted mean for n(d, w)zk at time t as follows: n(d, w)(t) zk = η(t)
t
((
τ =1
t
s=τ +1
λ(s))
Nτ V
(τ )
(τ )
n(di , wj )p(τ ) (zk |di , wj ))
(10)
i=1 j=1
(t) where η(t) = ( tτ =1 ts=τ +1 λ(s))−1 , and p(t) (zk |di , wj ) represents the posterior probability computed using visual document stream D(t) and W . Here, the parameter λ(s) (0 ≤ λ(s) ≤ 1, s = 1, 2, 3, ...) is a time-dependent decaying factor that determines the contribution of the data stream at each time slice. Equation (10) can be simplified as (please refer to the appendix for details of derivation): Nt V (t) (t) (t−1) = (1 − η(t))n(d, w) + η(t)[ n(di , wj )p(t) (zk |di , wj )]. n(d, w)(t) zk zk i=1 j=1
(11)
Online Learning for PLSA-Based Visual Recognition
101
Algorithm 2. On-line EM Algorithm for PLSA Learning 1: INITIAL STAGE (0) (0) 2: estimate n(d, w)zk , n(d, wj )zk from the initial training set. 3: construct the initial codebook W (0) using F (0) from the initial training set 4: ON-LINE LEARNING STAGE 5: for each image stream at time slice t do 6: Codebook Adaptation: 7: compute W (t) using F (t) and W (t−1) based on Algorithm 1 (t) 8: compute X (t) = {n(di , wj )} with W (t) and F (t) 9: On-line EM Parameter Estimation: (t) 10: initialize p(t) (wj |zk ), p(t) (zk |di ) based on W (t) 11: for each EM iteration n do 12: E-step: (t)
(t)
13: 14:
compute p(t) (zk |di , wj ) using p(t) (wj |zk ) and p(t) (zk |di ) with (4) M-step:
15: 16:
computen(d,w)zk using X (t) , n(d,w)zk , and p(t)(zk |di , wj ) with (11) (t) (t−1) (t) compute n(d, wj )zk using X (t) , n(d, wj )zk , and p(t) (zk |di , wj ) with (12) (t) (t) compute p(t) (wj |zk ) using n(d, w)zk and n(d, wj )zk with (13) (t) (t) compute p(t) (zk |di ) using X (t) and p(t) (zk |di , wj ) with (6) Check convergence:
17: 18: 19: 20: 21: 22: 23: 24: 25:
(t)
(t−1)
(t)
(t)
compute L(n) (X (t) |θ) using p(t) (wj |zk ) and p(t) (zk |di ) with (3) (n) (n−1) | 3, the matrix composed of normal vectors will be a rectangle matrix, and the symbol of inverse in (5) denotes the Moore-Penrose pseudo-inverse. The pseudo-inverse gives the solution in a least-squared sense. However, when the specular object is a plane or a surface of revolution, there exist some degenerate cases which are described below in Proposition 1 and Proposition 2. Proposition 1. If the specular object is a plane, the translation vector T cannot be determined by using the proposed theory. Proof. Let S be a specular plane. Without loss of generality, let P be an arbitrary point on S and NP be the surface normal at P . Consider now the visual ray OP . By the law of reflection, the visual ray OP , the surface normal NP and the reflection ray must be all lie on the same plane. Let ΓP denotes such a reflection plane. Note all points on S have the same surface normal NP . It follows that all the reflection planes will intersect along a line passing through the camera center O and with an orientation same as NP , and hence their normal vectors are all orthogonal to NP . This makes the matrix composed of the normal vectors in (5) to have a rank of at most 2, and therefore T cannot be recovered using (5). Proposition 2. If the specular object is a surface of revolution and its revolution axis passes through the camera center, the translation vector T cannot be determined by using the proposed theory. Proof. Let S be a specular surface of revolution with its revolution axis l passing through the camera center O. Without loss of generality, let P be an arbitrary point on S and NP be the surface normal at P . Consider now the visual ray OP . By the law of reflection, the visual ray OP , the surface normal NP and the reflection ray must all lie on the same plane. Let ΓP denotes such a reflection plane. By the properties of a surface of revolution, NP lies on the plane defined
142
M. Liu et al.
by P and l. Since l passes through O, it is easy to see that l also lies on ΓP . It follows that all the reflection planes will intersect at l and hence their normal vectors are all orthogonal to l. This makes the matrix composed of normal vectors in (5) to have a rank of at most 2, and therefore T cannot be recovered using (5). Since a sphere is a special kind of surface of revolution with any line passing through its center being a revolution axis, Corollary 1 follows immediately. Corollary 1. A single sphere cannot be recovered by using the proposed theory.
4
Specular Surface Recovery
With the estimated translation vector T, the position of the reference planar pattern after translation can be determined. If accurate correspondences between points on the reference planar pattern and pixels on the image can be obtained, the surface of the specular object can be recovered by triangulating the corresponding visual ray and reflection ray at each pixel (see Section 2). However, most of the encoding strategies cannot achieve a real point-to-point correspondences. Usually, one pixel corresponds to one encoded area (e.g., a square on the reference planar pattern). In [16], a dense matching of the pixels on the image and positions on the reference plane was implemented by minimizing an energy function which penalized the mis-encoding problem, and bilinear interpolation was used to achieve a sub-pixel accuracy in the matching. Bilinear interpolation, however, is indeed not a good approximation as it is well-known that ratio of lengths is not preserved under perspective projection. Besides, the reflection of the pattern by a non-planar specular surface would also introduce distortion in the image of the pattern, and such a distortion depends on both the shape and distances of the object relative to the camera and the planar pattern. In order to obtain good shape recovery results, the resolution of the encoding pattern and the positions of the plane need to be chosen carefully. In this paper, the specular surface is assumed to be smooth and locally planar. After associating a gray code (and hence an encoded area on the reference plane) to each pixel, the algebraic centers ci for pixels with the same gray code is computed and associated with the center Ci of the corresponding encoded area on the reference plane. This is the approximation to the point-to-point correspondences in the underlying encoding strategy. This approximation has been verified with experiments on synthetic data, and experimental results show that the error distances introduced by approximating the ground truth reflection points on the plane with the centers of the encoded areas are negligible. Nevertheless, this approximation can only be applied to obtain a matching between the image pixels and points on one reference plane, as the algebraic centers ci computed for one reference plane position will not in general coincide with those computed for a different reference plane position. Instead of trying to locate a point match on the translated reference plane Ω for each ci , the visual ray for ci is projected onto the translated reference plane Ω using its corresponding point Ci on the initial reference plane Π as the projection center (see Fig. 2). The projection of
Specular Surface Recovery from a Translating Planar Pattern
143
Ω Π S
Ci
P
B
Ci
O
Fig. 2. Shape recovery of specular object. The visual ray at ci is projected onto the translated reference plane Ω using its corresponding point Ci on the initial reference plane Π as the projection center. The projection of the visual ray on Ω will result in a line which will intersect the corresponding encoded area as observed at ci . This intersection line segment can be back projected onto the visual ray and this gives a depth range in which the surface point should exist.
the visual ray on Ω will result in a line which will intersect the corresponding encoded area (which is a square) as observed at ci . This intersection line segment can be back projected onto the visual ray and this gives a depth range in which the surface point should exist. If Ω is far from Π, the depth range will be small and the midpoint can be taken as the estimated surface depth. Alternatively, the depth range can be made smaller by increasing the resolution of the encoding pattern (i.e., decreasing the size of the encoded area). This, however, is not a practical strategy since the encoding pattern will be blurred in the imaging process and may result in an aliasing effect.
5
Experimental Analysis
The proposed approach was tested using both synthetic and real image data. The synthetic images were generated using the pov-ray software. For the real experiments, images were taken by using a simple setting composed of a monitor, a camera and a drawer. The monitor was used for displaying gray coded patterns. The camera used was a Cannon 450D with a focal length of 50mm. The drawer was used for carrying the monitor and making it move in pure translation motions. 5.1
Synthetic Experiments
In the synthetic experiments, the scene was composed of two spheres or two planes with different positions and orientations. Note that the cases with multiple
144
M. Liu et al.
Fig. 3. Shape recovery result for a sphere. Left column: Ground truth observed in two different views. Right column: Recovered shape in two different views. 1.4 1.2 1 0.8 0.6 0.4 0.2 0
0
50
100
150
Fig. 4. Relationship between depth recovery error and translation scale. The scene is composed of two spheres which have the same radius of 19mm, and the translation scale ranges from 1mm to 150mm. 12880 pixels are used for shape recovery. Y-axis shows the average depth recovery error for all pixels, and X-axis represents the translation scale.
spheres or multiple planes are not degenerate, and the shape of the objects can be recovered using the proposed method. In order to analyze the effect of the translation scale on the result, the reference plane was translated with multiple scales. The reconstruction for a part of a sphere is shown in Fig. 3, which also gives a comparison with the ground truth data. Fig. 4 shows the relation between the translation scale of the reference plane and the shape recovery error. It can
Specular Surface Recovery from a Translating Planar Pattern
145
Fig. 5. Shape recovery result for a plane. Left column: Ground truth observed in two different views. Right column: Recovered shape in two different views.
Fig. 6. Setting for the real experiment. Left Column: Setting used in the real experiment. Right Column: sample pattern images used for encoding. The proposed method is used to recover the area reflecting the reference plane on the specular object.
be seen that, within the visible reflection range for the specular object, larger translation scale would result in smaller reconstruction error. After experimenting with a range of translation scales, 80mm is chosen as the translation scale for the second position of the reference plane. The reconstruction error is smaller than 0.5%. The reconstruction for the plane is shown in Fig.5, which again also gives a comparison with the ground truth data. 70mm is chosen as the translation scale and the average reconstruction error is less than 0.45%.
146
M. Liu et al.
0 −0.1 −0.2 −0.3
1
−0.4 2.5
0.9 0.8
2
0.7
1.5
0.6
1 0.5
0.5
Fig. 7. Shape recovery result for a spoon. The recovered shape is the area with reflections on the spoon.
5.2
Real Experiments
The real experiment was conducted to recover the shape of a small spoon. The experiment setting was quite simple. The monitor was carried by a drawer and translated on the table (see Fig. 6). The monitor displayed gray coded pattern images and a set of images were taken when the monitor translated in different scales. A traditional structured light method [18] was used to get the correspondences between a coded area on the reference plane and a pixel on the image. The first reference planar pattern is initially calibrated. The reference plane can be translated along a certain direction in different scale and the reconstruction results can be obtained by choosing the best motion scale or combine the multiple motion information together to get a better result. Although the spoon is small and it only reflects a small area on the monitor, the shape can be well recovered (see Fig. 7).
6
Conclusions
In this paper, we have proposed a novel solution for specular surface recovery based on observing the reflections of a translating planar pattern. Unlike previous methods which assume a fully calibrated environment, only one reference planar pattern is assumed to have been calibrated against a fixed camera observing the specular surface, and this pattern is allowed to undergo an unknown pure translation. The motion of the pattern as well as the shape of the specular object can then be estimated from the reflections of the translating pattern. The main contributions of this paper are 1) A closed form solution for recovering the motion of the translating pattern. This allows an easy calibration of the translating pattern, and data redundancy resulting from the translating pattern (i.e., reflections of multiple planar patterns at each pixel location) can improve both the robustness and accuracy of the shape estimation. 2) A novel method for shape recovery based on computing the projections of the visual rays on the translating pattern. This produces a depth range for each
Specular Surface Recovery from a Translating Planar Pattern
147
pixel in which the true surface point exists. This in turn provides a measure of the accuracy and confidence of the estimation. Experimental results on both synthetic and real data are presented, which demonstrate the effectiveness of our proposed approach. As for the future work, we are now trying to solve the more challenging problems in which 1) the initial reference planar pattern is also uncalibrated and 2) the motion of the reference planar pattern is not limited to pure translation.
References 1. Tomasi, C.: Shape and motion from image streams under orthography: a factorization method. International Journal of Computer Vision 9, 137–154 (1992) 2. Wong, K.Y.K., Cipolla, R.: Structure and motion from silhouettes. In: ICCV, pp. 217–222 (2001) 3. Woodham, R.: Photometric method for determining surface orientation from multiple images. Optical Engineering 19, 139–144 (1980) 4. Zisserman, A., Giblin, P., Blake, A.: The information available to a moving observer from specularities. Image and Vision Computing 7, 38–42 (1989) 5. Fleming, R.W., Torralba, A., Adelson, E.H.: Specular reflections and the perception of shape. Journal of Vision 4, 798–820 (2004) 6. Oren, M., Nayar, S.K.: A theory of specular surface geometry. International Journal of Computer Vision 24, 105–124 (1997) 7. Roth, S., Black, M.J.: Specular flow and the recovery of surface structure. In: CVPR, pp. 1869–1876 (2006) 8. Adato, Y., Vasilyev, Y., Ben-Shahar, O., Zickler, T.: Toward a theory of shape from specular flow. In: ICCV, pp. 1–8 (2007) 9. Canas, G.D., Vasilyev, Y., Adato, Y., Zickler, T., Gortler, S., Ben-Shahar, O.: A linear formulation of shape from specular flow. In: ICCV, pp. 1–8 (2009) 10. Savarese, S., Perona, P.: Local analysis for 3d reconstruction of specular surfaces. In: CVPR, pp. 738–745 (2001) 11. Savarese, S., Perona, P.: Local analysis for 3d reconstruction of specular surfaces - part ii. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2351, pp. 759–774. Springer, Heidelberg (2002) 12. Bonfort, T., Sturm, P.: Voxel carving for specular surfaces. In: ICCV, pp. 691–696 (2003) 13. Seitz, S., Matsushita, Y., Kutulakos, K.: A theory of inverse light transport. In: ICCV, pp. 1440–1447 (2005) 14. Yamazaki, M., Iwata, S., Xu, G.: Dense 3D reconstruction of specular and transparent objects using stereo cameras and phase-shift method. In: Yagi, Y., Kang, S.B., Kweon, I.S., Zha, H. (eds.) ACCV 2007, Part II. LNCS, vol. 4844, pp. 570–579. Springer, Heidelberg (2007) 15. Nehab, D., Weyrich, T., Rusinkiewicz, S.: Dense 3d reconstruction from specularity consistency. In: CVPR, pp. 1–8 (2008) 16. Bonfort, T., Sturm, P., Gargallo, P.: General specular surface triangulation. In: Narayanan, P.J., Nayar, S.K., Shum, H.-Y. (eds.) ACCV 2006. LNCS, vol. 3852, pp. 872–881. Springer, Heidelberg (2006) 17. Rozenfeld, S., Shimshoni, I., Lindenbaum, M.: Dense mirroring surface recovery from 1d homographies and sparse correspondences. In: CVPR, pp. 1–8 (2007) 18. Scharstein, D., Szeliski, R.: High-accuracy stereo depth maps using structured light. In: CVPR, pp. 195–202 (2003)
Medical Image Segmentation Based on Novel Local Order Energy LingFeng Wang1 , Zeyun Yu2 , and ChunHong Pan1 1
2
NLPR, Institute of Automation, Chinese Academy of Sciences {lfWang,chpan}@nlpr.ia.ac.cn Department of Computer Science University of Wisconsin-Milwaukee
[email protected] Abstract. Image segmentation plays an important role in many medical imaging systems, yet in complex circumstances it is still a challenging problem. Among many difficulties, problem caused by the image intensity inhomogeneity is the key aspect. In this work, we develop a novel localhomogeneous region-based level set segmentation method to tackle this problem. First, we propose a novel local order energy, which interprets the local intensity constraint. And then, we integrate this energy into the objective energy function. After that, we minimize the energy function via a level set evolution process. Extensive experiments are performed to evaluate the proposed approach, showing significant improvements in both accuracy and efficiency, as compared to the state-of-the-art.
1
Introduction
Level set model has been extensively applied to medical image segmentation, because it holds several desirable advantages over traditional image segmentation methods. First, it can achieve sub-pixel segmentation accuracy of object boundaries. Second, it can change the topology of the contour automatically. Existing level set methods can be mainly classified into two groups: i.e. the edge-based [1–3] methods and the region-based [4–14] methods. Edge-based methods utilize the local edge information to attract the active contour toward the object boundaries. These methods have two inevitable limitations [4], such as depending on the initial contour as well as sensitive to the noise. Region-based methods, on the other hand, are introduced to overcome the two limitations. These methods identify each interest region by a certain region descriptor, such as intensity, to guide the evolution of the active contour. The principle of regionbased methods tend to rely on the assumption of homogeneity of the region descriptor. In this work, we improve the traditional region-based method to tackle the segmentation difficulties, such as the inhomogeneity of the image intensity. There are mainly two types of region-based segmentation methods: i.e. the global-homogeneous [4–6] and local-homogeneous [7–14]. One of the most important global-homogeneous methods is proposed by Chan et al. [4]. This method R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 148–159, 2011. c Springer-Verlag Berlin Heidelberg 2011
Medical Image Segmentation Based on Novel Local Order Energy
149
first assumes that image intensities are statistically homogeneous in each segmented region. Then, it utilizes the global fitting energy to represent this homogeneity. Improved performance of the global-homogeneous methods has been reported in [4–6]. However, the details of the object always cannot be segmented out by these methods. As shown in Fig.1, parts of brain tissues are not segmented out. The local-homogeneous methods are proposed to solve intensity inhomogeneity. These methods assume that the intensities in a relatively small local region are separable. Li et al. [7, 10] introduced a local fitting energy due to a kernel function, which is used to extract the local image information. Motivated by the local fitting energy proposed in [10], the local Gaussian distribution fitting energy is presented in [14]. An et al. [8] offered a variational region-based algorithm, which combines the Γ -Convergence approximation with a new multi-scale piecewise smooth model. Unfortunately, the local-homogenous methods have two inevitable drawbacks. First, these methods are sensitive to the initial contour. Second, the segmentation results have obvious errors. As shown in Fig.1, although the two initial contours are very similar, the segmentation results of [10] are quite different. Furthermore, both the two results have obvious segmentation errors. We use the blue ellipses to indicate these errors. Motivated by the previous works [4, 7, 8, 10, 15], the local intensity constraint is adopted to realize our local-homogeneous region-based segmentation method. The purposes of using the local intensity constraint are from two aspects. First, the local characteristic can soften the global constraint of globalhomogeneous methods. Second, using additional constraint can strengthen the instability of local-homogenous methods. In this work, we first develop a local order energy as local intensity constraint. The proposed local order energy assumes that intensities in a relatively small local region has order. For example, the intensities in the target are larger than those in the non-target. Generally, the local order assumption is easy to be satisfied in medical image. Then, we integrate the local order energy into the objective energy function, and minimize the energy function via a level set evolution process. Experimental results show the proposed method has many advantages and improvements. First, compared to the global-homogeneous methods, the detail information is well segmented. Second, the two main drawbacks of the traditional local-homogeneous methods are well solved. Third, the proposed method is also insensitive to noise. The remainder of the presented paper is organized as follows: Section 2 describes the proposed segmentation method in detail; Section 3 presents some experiments and results; the conclusive remark is given in Section 4.
2
The Proposed Method
In this section, we first briefly introduce two traditional segmentation methods, i.e. the global-homogenous methods and the local-homogenous methods. Then, we describe the proposed local order energy, which interprets the local intensity constraint, in detail. Finally, we use level set evolution process to minimize the energy function, and present our algorithm in Alg.1.
150
L. Wang, Z. Yu, and C. Pan
ÿ
ÿ
Fig. 1. A comparison of the proposed method to the traditional methods, i.e. the Chan et al. [4] (a global-homogeneous method) and the Li et al. [10] (a local-homogenous method). Note that, two different yet very similar initial contours are used to test the initialization sensitive of three algorithms.
2.1
Global-Homogenous and Local-Homogenous Methods
Let Ω ⊂ R2 be the image domain, and I : Ω → R be the given gray-level image. Global-homogenous methods formulated the image segmentation problem as to find a contour C. The contour segments the image into two non-overlapping regions, i.e. the region of inside contour as Cin (the target region) and the region of outside contour as Cout (the non-target region). By specifying two constants cin and cout , the fitting energy (global fitting energy) is defined as follows, I(x) − cin 2 dx E global (C, cin , cout ) = λin Cin I(x) − cout 2 dx + μ|C| (1) + λout Cout
where |C| is the length of the contour C, and λin , λout and μ are positive constants. As expressed in Eqn.1, the specified two constants cin and cout globally govern all the pixel intensities in Cin and Cout , respectively. Thus, Chan’s method is denoted as global-homogeneous method. The two constants cin and cout can further be interpreted as the two cluster centers, i.e. the target cluster cin and the non-target cluster cout . Motivated by Chan’s work, Li et al. [10] proposed a local fitting energy due to a kernel function, given by local f it E (2) C, cin (x), cout (x) = E C, cin (x), cout (x) dx + μ|C| yielding,
2 E f it C, cin (x), cout (x) = λin Kσ x − y I(y) − cin (x) dy Cin 2 Kσ x − y I(y) − cout (x) dy + λout Cout
(3)
Medical Image Segmentation Based on Novel Local Order Energy
151
where K(.) is the Gaussian kernel function Kσ x =
2 2 1 e−x /2σ pa(2π)n/2 σ n rameterized with the scale parameter σ. More comprehensive interpretations of the local fitting energy are introduced in [10] in detail. The fitting values cin (x) and cout (x) are locally govern the pixel intensities centered at position x, whose size is controlled by σ. Thus, Eqn.2 is named as the local fitting energy, and Li’s method is denoted as local-homogeneous method. Similarly, cin (x) and cout (x) can be interpreted as two local cluster centers at position x. The main difference between the above two methods is how to choose the cluster centers. The global-homogenous method simply utilizes two constants cin and cout as the cluster centers, and formulates them as the global fitting energy. In contrary, the local-homogenous method adopts the local cluster centers, and formulates them as the local fitting energy.
2.2
Local Order Energy
Generally, medical image has inhomogeneity problem. Thus, local fitting energy interprets the segmentation problem better than global fitting energy. However, local fitting energy also has some problems. The most serious one is that local fitting energy function has many local optimal solutions. Thus, it is hard to reach the global optimal solution through the optimization method, such as the level set evolution process. Meanwhile, this problem brings two inevitable drawbacks. First, the solution is sensitive to the initialization. Second, the segmentation result has some obvious errors. Actually, the local optimal solutions of global fitting energy is less than those of local fitting energy. That is, the global-homogenous method has more stable solution than the local-homogenous method. The main reason is that global fitting energy utilizes a strong intensity constraint. However, the local fitting energy weakens this intensity constraint by using the local characteristic. Thus, it is reasonable to add other local intensity constraints. Here, we adopt the local order energy as the local intensity constraint. The proposes of using this energy are from two sides. First, using local constraints can solve the inhomogeneity problem better than using global constraints. Thus, the detail information can be well segmented out. Second, adding other constraints can obviously decrease local optimal solutions number. Thus, the segmentation result can be stable and with small errors. In the next, we first give a straightforward example to interpret the local optimal solution problem. Then, we present the local order energy, which is one of the important local intensity constraint. Fig.2 presents a straightforward example to interpret the local optimal solution problem. We use a simple synthetic image with half bright pixels and half dark pixels as the test image. As shown in Fig.2, the curves Cright (in red) and Cwrong (in blue) represent the right and wrong segmentation results, respectively. When specifying the local cluster centers cin (x) and cout (x), both two local fitting energies E local (Cright , cin (x), cout (x)) and E local (Cwrong , cin (x), cout (x)) calculated from Cright and Cwrong can reach to the local minimums. That is, both E local (Cright , cin (x), cout (x)) and E local (Cwrong , cin (x), cout (x)) are the local optimal solutions. As shown in the second row of Fig.2, we assign three boundary
152
L. Wang, Z. Yu, and C. Pan
pixels (in green dashed curve) with cin (x) = 1 and other pixels with cin (x) = 1 and cout (x) = 0. Here, the local fitting energy E local (Cright , cin (x), cout (x)) reaches to the local minimum. As shown in the third row of Fig.2, we assign three boundary pixels with cout (x) = 1 and other pixels with cin (x) = 1 and cout (x) = 0. Here, the local fitting energy E local (Cwrong , cin (x), cout (x)) reaches to the local minimum. Note that, the curve Cwrong is the wrong segmentation result. We can also enumerate other wrong results, which are similar with Cwrong .
First Row
1
1
1
1
1
1
1
1
1
1
1
1
1wrong 0 1 0right 1
0
0
0
0
0
0
0
0
0
0
0
0
0
Second Row
right
cin (x)
cout ( x)
1 1 1 1 1 x x x x x 1 1 1 1 1 x x x x x 1 1 1 1 1 x x x x x
x x x x x 0 0 0 0 0 x x x x x 0 0 0 0 0 x x x x x 0 0 0 0 0
Third Row
wrong
cin (x)
cout ( x)
1 1 1 1 x x x x x x 1 1 1 1 x x x x x x 1 1 1 1 x x x x x x
x x x x 1 0 0 0 0 0 x x x x 1 0 0 0 0 0 x x x x 1 0 0 0 0 0
Fig. 2. Overview of the comparison of two segmentation results. 1 and 0 are the intensity values of the image. The first row gives two segmentation curves Cright (in red) and Cwrong (in blue). The second row presents the cin (x) and cout (x) of each pixel when curve is Cright , while the third row presents them when curve is Cwrong . Note that, the symbol x means the random value.
There are two types of values. The first one is the value of non-target cluster centers cout (x) inside the curve C. The second one is the value of target cluster centers cin (x) outside the curve C. From the Eqn.2 and Eqn.3, we find that these two types of values are none sense when calculating the local fitting energy. That is, we can set them as the random values. As shown in Fig.2, we use symbol x to represent the random value. However, these values have useful meanings in
Medical Image Segmentation Based on Novel Local Order Energy
153
practice. Such as, cin (x) means the local target cluster center, and the cout (x) means the local non-target cluster center. Moreover, these values are recursive used when perform level set evolution process. The evolution process will be described in Subsection 2.3. Or you can find the similar process in [10] in detail. In this work, these values are used to formulate the local order energy, which is a local intensity priori constraint. Actually, in a small local region of a medical image, the intensity values always have order. Such as the target intensity values are larger than the non-target ones. Thus, the target cluster center value cin (x) is larger than the non-target cluster center value cout (x) at position x. The local order information is represented by the local order energy, given by, E order C, cin , cout = sgn cout (x) − cin (x) dx (4) Ω
where sgn(.) is the sign function, sgn(x) =
1 x>0 0 Else
(5)
As shown in Eqn.4, the local order energy uses all cluster centers, which include the above two types of values. Furthermore, the local order energy, is one of the important local intensity constraint, because it restricts the local target intensity should larger than the non-target intensity statistically. Combining the local fitting energy defined in Eqn.2, we can get the new local energy E E = E local + E order
(6)
where E local and E order are defined by Eqn.2 and Eqn.4, respectively. Since the local fitting energy and the local order energy are both defined from the contour C, the final energy E is represented by the contour C too. 2.3
Level Set Formulation and Energy Minimization
Level set formulation is performed to solve the above energy E. In level set methods, a contour C is represented by the zero level set of a Lipschitz function φ : Ω → R, which is called a level set function. Accordingly, the energy of our method can be defined as 2 F φ, cin , cout = λin Kσ x − y I(y) − cin (x) H φ(y) dydx 2 + λout Kσ x − y I(y) − cout (x) 1 − H φ(y) dydx sgn cout (x) − cin (x) dx + μL(φ) + νP(φ) (7) +γ Ω
where λin , λout , γ, μ, and ν are the weighting constants, H(.) is the Heaviside function, L(φ) = δ(φ(x))|∇φ(x)|dx is the length term, δ(.) is the derivative
154
L. Wang, Z. Yu, and C. Pan
function of Heaviside function, and P(φ) = 12 (|∇φ(x)| − 1)2 dx. The term P(φ) is the regularization term [3], which serves to maintain the regularity of the level set function. In implement, Heaviside
function is approximated by a smooth function H (x) = 12 1 + π2 arctan( x ) , and the corresponding derivative function is defined by δ (x) = H (x) = π(2+x2 ) . When implementation, the coefficient is set as = 1.0. Standard gradient descent method is performed to minimize the energy function F (φ, cin , cout ) based on two sub-steps. For the first step, we fix the level set functional φ, and calculate two cluster centers cin (x) and cout (x). For the second step, we fix the two cluster centers cin (x) and cout (x), and calculate the level set functional φ. Variational method is first applied to calculate cin and cout for a fixed level set functional φ, given by, cin = max cin , cout cin , cout cin = min (8) yielding,
Kσ (x) ⊗ H φ(x) I x cin =
Kσ (x) ⊗ H φ(x)
cout
Kσ (x) ⊗ 1 − H φ(x) I x = Kσ (x) ⊗ 1 − H φ(x)
(9) where ⊗ is the convolution operation. Then, keeping cin and cout fixed, the energy functional is minimized with respect to the level set functional φ by using the standard gradient descent, given by, ∂φ = − δ (φ) λin Tin − λout Tout ∂t ∇φ ∇φ + μδ div + ν ∇2 φ − div (10) |∇φ| |∇φ| where Tin and Tout are defined as follows, 2 Tin (x) = Kσ (y − x)I(y) − cin (x) dy 2 Tin (x) = Kσ (y − x)I(y) − cin (x) dy
(11)
The more information of the implementation of the proposed local prior intensity level set method is illustrated in Alg.1 in detail.
3
Experiment Results
A number of real medical images are used to evaluate the performance of the proposed scheme by comparing the two traditional methods, i.e. Chan et al. [4] and Li et al. [10], which are the deputations of global-homogeneous and localhomogeneous methods, respectively. When implementation, the coefficients are
Medical Image Segmentation Based on Novel Local Order Energy
155
Algorithm 1. Local Intensity Priori Level Set Segmentation Data: initial level set φinitial , input image I, max iteration maxiter, and weighting constants Result: level set φ and segmentation result. 1 1 Initial the output level set function φ = φinitial ; 2 for i ← 1 to maxiter do 3 Calculate cin and cout based on φi by Eqn.8 and Eqn.9; 4 Calculate level set φi+1 based on cin and cout by Eqn.10 and Eqn.11; 5 if φi equals to φi+1 then 6 Break the iteration; 7 end 8 end 9 Saving level set φ, and segmenting the input image based on φ;
Initializations
Chanÿ s Method
Liÿ s Method
Proposed Method
Fig. 3. The comparison of our method to Chan [4] and Li [10] on CT image
empirically set as λin = 1, λout = 1, ν = 1, μ = 0.004 · h · w, where h and w are the height and width of the test image. All the medical images are downloaded from the web-site 1 and 2 . The visual comparison is first shown in Fig.1, and the more comparisons are illustrated in Fig.3, Fig.4, and Fig.5, respectively. We use three different image acquisition techniques, i.e. CT, MR, and Ultrasound, to evaluate our approach. As illustrated in these figures, although the images are rather noisy and part of the boundaries are weak, the detail information are still segmented very well comparing to the Chan’s [4] and Li’s [10] methods. Furthermore, the proposed approach is not sensitive to the initialization compared to the Li’s [10] method. 1 2
http://www.bic.mni.mcgill.ca/brainweb/ http://www.ece.ncsu.edu/imaging/Archives/ImageDataBase/Medical/index.html
156
L. Wang, Z. Yu, and C. Pan
Initializations
Chanÿ s Method
Liÿ s Method
Proposed Method
Fig. 4. The comparison of our method to Chan [4] and Li [10] on Ultrasound image
Initializations
Chanÿ s Method
Liÿ s Method
Proposed Method
Fig. 5. The comparison of our method to Chan [4] and Li [10] on MR image
The numeric comparisons are presented in Fig.6 and Fig.7. The segmentation results are compared to the ground truth by using the following two error measures: i.e. false negative (FN), which interprets number of target pixels that are missed, and false positive (FP), which interprets number of non-target pixels are marked as target. Denoting Rour target are the result of target region of by gt li our method. Similarly, Rchar , R , target Rtarget are the Chan’s, Li’s, and ground in truth. Thus, the FN and FP are defined as follows FP =
R∗target − R∗target ∩ Rgt target Rgt target
FN =
Rt argetgt − R∗target ∩ Rgt target Rgt target (12)
Medical Image Segmentation Based on Novel Local Order Energy
Scene 1
Scene 2
Scene 3
Scene 4
Chan FN Mean
FP Cov
Mean
Scene 5
Li Cov
Time(s)
FN Mean
Mean
Scene 6
Our
FP Cov
157
Time(s)
Cov
FN Mean
FP Cov
Mean
Cov
Time(s)
Scene 1
0.131 0.001 0.018 0.000
3.293
0.142 0.125 0.163 0.151
12.49
0.026 0.000 0.019 0.000
12.91
Scene 2
0.144 0.001 0.021 0.001
3.395
0.152 0.153 0.102 0.173
13.16
0.017 0.000 0.015 0.000
13.46
Scene 3
0.123 0.000 0.012 0.000
3.351
0.178 0.203 0.128 0.145
13.04
0.020 0.000 0.017 0.000
13.12
Scene 4
0.126 0.000 0.014 0.001
3.362
0.101 0.146 0.137 0.126
13.07
0.018 0.000 0.016 0.000
13.22
Scene 5
0.132 0.001 0.016 0.000
3.325
0.122 0.116 0.124 0.136
12.86
0.011 0.000 0.014 0.000
13.10
Scene 6
0.126 0.000 0.014 0.001
3.338
0.186 0.107 0.149 0.127
12.81
0.019 0.000 0.021 0.000
13.09
Fig. 6. An overview of the comparisons to Chan’s [4] and Li’s [10] by using two error measures, i.e. the (FN) and (FP)
Chan Li Ours
0.2 0.15 0.1 0.05 0
Chan Li Ours
0.25
False Positive
False Negative
0.25
0.2 0.15 0.1 0.05
0
5
10
15
20
Power of Noise(%)
25
30
35
0
0
5
10
15
20
25
30
35
Power of Noise(%)
Fig. 7. An comparisons to Chan’s [4] and Li’s [10] when adding noise to the images. The horizontal axis shows the power of noise, while the vertical axis shows two error measures, i.e. the false negative (FN) and false positive (FP).
where {∗|∗ ∈ chan, li, our}. In Fig.6, six different scenes are choosed, and for each scene, 100 experiments are performed with the different initial contour. As shown in the figure, both the means and variances of the above two error measures are smaller than the traditional methods. That is to say, the proposed method is of better accuracy (smaller error mean) and stability (smaller error variance) than the traditional ones. Furthermore, the computation cost of our approach is similar with the Li’s approach. Fig.7 shows the comparisons when adding noise to the images. As illustrated in the figure, the error of our method increase slower than the traditional methods when the power of noise enlarging. That is, the proposed method is less sensitive to the noise. We also use Chan’s, Li’s , and our methods to medical image reconstruction. The visual result of our method is shown in Fig.8. As shown in this figure, the brain tissues are preserved well. The main reason is our segmentation method gives excellent the segmentation result. The numerical comparison of reconstruction
158
L. Wang, Z. Yu, and C. Pan
Scene
3D Reconstruction Results
Frame 1
Frame 1 To 20
Frame 21 To 40
Frame 61 To 80
Frame 81 To 100
Frame 41 To 60
Frame 30
Frame 100
Fig. 8. Medical image reconstruction result by our method Table 1. The comparison of reconstruction errors Algorithm Chan’s [4] Li’s [10] Ours FN 0.161 0.143 0.019 FP 0.016 0.126 0.017
error is shown in Tab.1. The error is defined as the mean of all slice’s FN and FP, given by N N = 1 = 1 FP FN FNi FPi (13) N i=1 N i=1 where N is the slice number, the FNi , FPi are the ith slice’s FN and FP (defined in Eqn.12). As shown in this table, the our two errors are obviously smaller than the Chan’s and Li’s.
4
Conclusions and Future Works
In this work, we propose a novel region-based level set segmentation method. The main contribution of this work is proposing the local order energy, which is used to represent the local intensity priori constraint. The proposed algorithm has been shown on a number of experiments to address the medical image segmentation problem both efficiently and effectively. Meanwhile, compared to the traditional methods, the proposed method has many advantages and improvements: 1. the detail information is well segmented; 2. the main drawbacks of the traditional local-homogeneous methods are well solved; 3. the proposed method is insensitive to noise. Future works will be adding the other local intensity constraints to improve the segmentation algorithm, or adding the shape information
Medical Image Segmentation Based on Novel Local Order Energy
159
to enhance the segmentation results. Moreover, the proposed algorithm can not only be used in medical image segmentation, but also be applied on other types of images. Acknowledgement. This work was supported by the National Natural Science Foundation of China (Grant No. 60873161 and Grant No. 60975037).
References 1. Caselles, V., Kimmel, R., Sapiro, G.: Geodesic active contours. IJCV 22, 61–79 (1997) 2. Yezzi, A., Kichenassamy, S., Kumar, A., Olver, P., Tannenbaum, A.: A geometric snake model for segmentation of medical imagery. IEEE Trans. on MI 16, 199–209 (1997) 3. Li, C., Xu, C., Gui, C., Fox, M.D.: Level set evolution without re-initialization: A new variational formulation. In: CVPR, pp. 430–436 (2005) 4. Chan, T.F., Vese, L.A.: Active contours without edges. IEEE Trans. on IP 10, 266–277 (2001) 5. Tsai, A., Anthony Yezzi, J., Willsky, A.S.: Curve evolution implementation of the mumfordcshah functional for image segmentation, denoising, interpolation, and magnification. IEEE Trans. on IP 10, 1169–1186 (2001) 6. Vese, L.A., Chan, T.F.: A multiphase level set framework for image segmentation using the mumford and shah model. IJCV 50, 271–293 (2002) 7. Li, C., Kao, C.Y., Gore, J.C., Ding, Z.: Implicit active contours driven by local binary fitting energy. In: CVPR, pp. 1–7 (2007) 8. An, J., Rousson, M., Xu, C.: γ-convergence approximation to piecewise smooth medical image segmentation. In: Ayache, N., Ourselin, S., Maeder, A. (eds.) MICCAI 2007, Part II. LNCS, vol. 4792, pp. 495–502. Springer, Heidelberg (2007) 9. Piovano, J., Rousson, M., Papadopoulo, T.: Efficient segmentation of piecewise smooth images. In: SSVMCV, pp. 709–720 (2007) 10. Li, C., Kao, C.Y., Gore, J.C., Ding, Z.: Minimization of region-scalable fitting energy for image segmentation. IEEE Trans. on IP 17, 1940–1949 (2008) 11. Li, C., Huang, R., Ding, Z., Gatenby, C., Metaxas, D., Gore, J.: A variational level set approach to segmentation and bias correction of images with intensity inhomogeneity. In: Metaxas, D., Axel, L., Fichtinger, G., Sz´ekely, G. (eds.) MICCAI 2008, Part II. LNCS, vol. 5242, pp. 1083–1091. Springer, Heidelberg (2008) 12. Li, C., Gatenby, C., Wang, L., Gore, J.C.: A robust parametric method for bias field estimation and segmentation of mr images. In: CVPR (2009) 13. Li, C., Li, F., Kao, C.Y., Xu, C.: Image segmentation with simultaneous illumination and reflectance estimation: An energy minimization approach. In: ICCV (2009) 14. Wang, L., Macione, J., Sun, Q., Xia, D., Li, C.: Level set segmentation based on local gaussian distribution fitting. In: Zha, H., Taniguchi, R.-i., Maybank, S. (eds.) ACCV 2009. LNCS, vol. 5994, pp. 293–302. Springer, Heidelberg (2010) 15. Xiang, S., Nie, F., Zhang, C., Zhang, C.: Interactive natural image segmentation via spline regression. IEEE Trans. on IP 18, 1623–1632 (2009)
Geometries on Spaces of Treelike Shapes Aasa Feragen1 , Francois Lauze1 , Pechin Lo1 , Marleen de Bruijne1,2 , and Mads Nielsen1 1 2
eScience Center, Dept. of Computer Science, University of Copenhagen, Denmark Biomedical Imaging Group Rotterdam, Depts of Radiology & Medical Informatics, Erasmus MC – University Medical Center Rotterdam, The Netherlands {aasa,francois,pechin,marleen,madsn}@diku.dk
Abstract. In order to develop statistical methods for shapes with a tree-structure, we construct a shape space framework for treelike shapes and study metrics on the shape space. The shape space has singularities, which correspond to topological transitions in the represented trees. We study two closely related metrics, TED and QED. The QED is a quotient euclidean distance arising from the new shape space formulation, while TED is essentially the classical tree edit distance. Using Gromov’s metric geometry we gain new insight into the geometries defined by TED and QED. In particular, we show that the new metric QED has nice geometric properties which facilitate statistical analysis, such as existence and local uniqueness of geodesics and averages. TED, on the other hand, has algorithmic advantages, while it does not share the geometric strongpoints of QED. We provide a theoretical framework as well as computational results such as matching of airway trees from pulmonary CT scans and geodesics between synthetic data trees illustrating the dynamic and geometric properties of the QED metric.
1
Introduction
Trees are fundamental structures in nature, where they describe delivery systems for water, blood and other fluids, or different types of skeletal structures. Trees and graphs also track evolution, for instance in genetics [3]. In medical image analysis, treelike shapes appear when studying structures such as vessels [4, 5] or airways [6–8], see Fig. 1. In more general imaging problems, they appear in the form of shock graphs [9] describing general shapes.
Fig. 1. Treelike structures found in airways and vessels in lungs [1], in breast cancer vascularization, and in retinal blood vessels [2] R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 160–173, 2011. Springer-Verlag Berlin Heidelberg 2011
Geometries on Spaces of Treelike Shapes
161
An efficient model for tree-shape would have applications such as shape recognition and classification. In medical imaging, this could give new insight into anatomical structure and biomarkers for diseases that affect shape – e.g. COPD affects the edge shapes and topological structures of airway trees [10]. We want to do statistical analysis on sets of treelike shapes, and in order to meaningfully define averages and modes of variation in analogy with the classical PCA analysis, we need at least local existence and uniqueness of geodesics and averages. Immediate questions are then: How do we model treelike structures and their variation? Can we encode global, tree-topological structure as well as local edgewise geometry in the geometry of a single shape space? Do geodesics exist, and are they unique? Do the dynamics of geodesics reflect natural shape variation? Here we provide the theoretical foundations needed to answer these questions, accompanied by a first implementation. We define shape spaces of ordered and unordered treelike shapes, where transitions in internal tree-topological structure are found at shape space singularities in the form of self-intersections. This is a general idea, which can be adapted to other situations where shapes with varying topology are studied – e.g. graphs or 3D shapes described by medial structures. The paper is organized as follows: In Section 1.1 we give a brief overview of related work. In Section 2 we define the treespace. Using Gromov’s approach to metric geometry [11] we gain insight into the geometric properties of two different metrics; one which is essentially tree edit distance (TED) and one which is a quotient euclidean distance (QED). We pay particular respect to the properties of geodesics and averages, which are essential for statistical shape analysis. In Section 4 we discuss how to overcome the computational complexity of both metrics. In particular, the computational cost can be drastically reduced by interpreting order as an edge semi-labeling, which can often be obtained through geometric or anatomic restrictions. In Section 5 we discuss a simple QED implementation, and in Section 6 we illustrate the properties of QED through computed geodesics and matchings for synthetic planar trees and 3D pulmonary airway trees. The paper contains mathematical results, but proofs are omitted due to length constraints. 1.1
Related Work
Metrics on sets of trees have been studied by different research communities in the past 20 years. The best-known approach in the computer vision community is perhaps TED, as used for instance by Sebastian et al [9] for comparing shapes via their shock graph representations. TED performs well for tasks such as matching and registration [9]. Our goal is, however, to adapt the classical shape statistics to treelike shapes. The TED metric will nearly always have infinitely many geodesics between two given trees, and thus it is no longer sufficient, since it becomes hard to meaningfully define and compute an average shape or find modes of variation. Another approach to defining a metric on treespace is that of Wang and Marron et al [12]. They define a metric on attributed trees as well as a “median-mean”
162
A. Feragen et al.
tree, which is not unique, and a version of PCA which finds modes of variation in terms of so called treelines, encoding the maximum amount of structural and attributal variation. Wang and Marron analyze datasets consisting of brain blood vessels, which are trees with few, long branches. The metric is not suitable for studying large trees with a lot of topological variation and noise such as airways, as it punishes structural changes much harder than shape variation with constant topological structure. A rather different approach is that of Jain and Obermayer [13], who define metrics on graphs. Here, graphs are modeled as incidence matrices, and the space of graphs, which is a quotient by the group of vertex relabelings, is given a metric inherited from Euclidean space. Means are computed using Lipschitz analysis, giving fast computations; however the model is rigid when it comes to modeling topological changes, which is essential for our purposes. We have previously [14] studied geodesics between small planar embedded trees in the same type of singular shape space as studied here. These geodesics are fundamental in the sense that they represent the possible structural changes found locally even in larger trees.
2
Geometry of Treespace
Before defining a treespace and giving it a geometric structure, let us discuss which properties are desirable. In order to compute average trees and analyze variation in datasets, we require existence and uniqueness properties for geodesics. When geodesics exist, we want the topological structure of the intermediate trees to reflect the resemblance in structure of the trees being compared – in particular, a geodesic passing through the trivial one-vertex tree should indicate that the trees being compared are maximally different. Perhaps more important, we would like to compare trees where edge matching is inconsistent with tree topology as in Fig. 2a; specifically, we would like to find geodesic deformations in which the tree topology changes when we have such edge matchings, for instance as in Fig. 2b.
(a) Edge matching
(b) Geodesic candidate
Fig. 2. A good metric must handle edge matchings which are inconsistent with tree topology
2.1
Representation of Trees
We represent any treelike (pre-)shape as a pair (T , x) consisting of a rooted, planar binary tree T = (V, E, r) with edge attributes describing edge shape. Here, T describes the tree topology, and the the attributes describe the edge geometry. The attributes are represented by a map x : E → A or, equivalently, by a point
Geometries on Spaces of Treelike Shapes
163
x ∈ AE , where A is the attribute space. Tree-shapes which are not binary are represented by the maximal binary tree in a very natural way by allowing constant edges, represented by the zero scalar or vector attribute. Endowing an internal edge with a zero attribute corresponds to collapsing that edge, see Fig. 4a, and hence an arbitrary attributed tree can be represented as an attributed binary tree. The intuitive idea is illustrated in Fig. 3.
Fig. 3. Treelike shapes are encoded by an ordered binary tree and a set of attributes describing edge shape
The attributes could be edge length, landmark points or edge parametrizations, for instance. In this work, we use trees whose attributes are open curves translated to start at the origin, described by a fixed number of landmark points. Thus, thorughout the paper, the attribute space is (Rd )n where d = 2 or 3 and n is the number of landmark points per edge. Collapsed edges are represented by a sequence of origin points. We need to represent trees of different sizes and structures in a unified way in order to compare them. As a first step, we describe all shapes using the same binary tree T to encode the topology of all the tree-shapes in the treespace – by choosing a sufficiently large binary tree we can represent all the trees in our dataset by filling out with collapsed edges. We call T the maximal binary tree. When given a left-right order on every pair of children in a binary tree, there is an implicitly defined embedding of the tree in the plane. Hence we shall use the terms “planar tree” and “ordered tree” interchangingly, depending on whether we are interested in the planar properties or the order of the tree. We initially define our metric on the set of ordered binary trees; later we use it to compute distances between unordered trees by considering all possible orders. In Section 4 we discuss how to handle the computational challenges of the unordered case. depth n with edges E, any atGiven an ordered maximal binary tree Tn of tributed tree T is represented by a point in X = e∈E (Rd )n . We call X the tree preshape space, since it parametrizes all possible shapes. 2.2
The Space of Trees as a Quotient – A Singular Treespace
We fix the maximal depth n binary tree Tn which encodes the connectivity of the trees. In the preshape space defined above, there will be several points corresponding to the same tree, as illustrated in Fig. 4a. We go from preshapes to shapes by identifying those preshapes which define the same shape. Consider two ordered tree-shapes where internal edges are collapsed. The orders of the original trees induce orders on the collapsed trees. We consider two ordered tree-shapes to be the same when their collapsed ordered topological structures are identical, and the edge attributes on corresponding non-collapsed edges are identical as well, as in Fig. 4a. Thus, tree identifications come with
164
A. Feragen et al.
(a)
(b)
Fig. 4. (a) Higher-degree vertices are represented by collapsing internal edges. The dotted line indicates zero attribute, which gives a collapsed edge. We identify those tree preshapes whose collapsed structures are identical – here x1 and x2 represent the same tree T . (b) The TED moves: Remove an edge a, add an edge b, deform an edge c.
an inherent bijection of subsets of E: If we identify x, y ∈ X = (RN )E , denote E1 = {e ∈ E|xe = 0}, E2 = {e ∈ E|ye = 0}; the identification comes with an order preserving bijection ϕ : E1 → E2 identifying those edges that correspond to the same edge in the collapsed tree-shape. Note that ϕ will also correspond to similar equivalences of pairs of trees with the same topology, but other attributes. The bijection ϕ next induces a bijection Φ : V1 → V2 given by Φ : (xe ) → / E1 } and V2 = {x ∈ X|xe = 0 if e ∈ / E2 } (xϕ(e) ). Here, V1 = {x ∈ X|xe = 0 if e ∈ are subspaces of X where, except for at the axes, the topological tree structure is constant, and for x ∈ V1 , Φ(x) ∈ V2 describes the same shape as x. We define a map Φ for each pair of identified tree-structures, and form an equivalence on X by setting x ∼ Φ(x) for all x and Φ. For each x ∈ X we denote ¯ = (X/ ∼) = by x¯ the equivalence class {x ∈ X|x ∼ x}. The quotient space X {¯ x|x ∈ X} of equivalence classes x ¯ is the space of treelike shapes. The geometric interpretation of the identification made in the quotient is that we are folding and gluing our space along the identified subspaces; i.e. when x1 ∼ x2 we glue the two points x1 and x2 together. See the toy quotient space in Fig. 5 for an intuitive lower-dimensional illustration. 2.3
Metrics on Treespace
Given a metric d on Euclidean space X = e∈E RN we define the standard quo¯ = X/ ∼ by setting tient pseudometric [15] d¯ on the quotient space X k ¯ d(¯ x, y¯) = inf d(xi , yi )|x1 ∈ x ¯, yi ∼ xi+1 , yk ∈ y¯ . (1) i=1
This amounts to finding the optimal path from x¯ to y¯ passing through the identified subspaces, as shown in Fig. 5. We define two metrics on X, which come from two different ways of combining the individual edge distances: The metrics d1 and d2 on X = e∈E RN are in 2 duced by the norms x−y1 = e∈E xe −ye and x−y2 = e∈E xe − ye . From now on, d and d¯ will denote either the metrics d1 and d¯1 , or d2 and d¯2 . We have the following: ¯ which is a conTheorem 1. The distance function d¯ restricts to a metric on X, tractible, complete, proper geodesic space.
Geometries on Spaces of Treelike Shapes
165
This means, in particular, that given any two trees, we can always find a geodesic between them in both metrics d¯1 and d¯2 . Note that d¯1 is essentially the classical tree edit distance (TED) metric, where the distance between two trees is the minimal sum of costs of edge edits needed to transform one tree into another, see Fig. 4b. The d¯2 is a descent of the Euclidean metric, and geodesics in this metric are concatenations of straight lines in flat regions. We call it the QED metric, for quotient euclidean distance. In Section 3 we describe the properties of these metrics with example geodesic deformations between simple trees. Using methods from metric geometry [11] we obtain results on curvature and average trees. We consider two different types of average tree; namely the centroid, as defined and computed in [3], and the circumcenter, defined as follows: ¯ with diameter D a circumcenter of S is defined as a point Given a set S ⊂ X ¯ ¯ x, D/2) = {¯ ¯ d¯2 (¯ x ¯ ∈ X such that S ⊂ B(¯ z ∈ X| x, z¯) ≤ D/2}. ¯ has ¯ ¯∈X Theorem 2. i) Endow X with the QED metric d¯2 . A generic point x ¯ a neighborhood U ⊂ X in which the curvature is non-positive. At non-generic ¯ d¯2 ) is unbounded. points, the curvature of (X, ¯ there exists ¯ ¯ ∈ X, ii) Endow X with the QED metric d¯2 . Given a generic point x a radius rx¯ such that sets contained in the ball B(¯ x, rx¯ ) have unique centroids and circumcenters. ¯ d¯1 ) is everywhere ¯ with the TED metric d¯1 . The curvature of (X, iii) Endow X ¯ d¯1 ) does not have locally unique geodesics anywhere. unbounded, and (X, The practical meaning of Theorem 2 is that i) we can use techniques from metric geometry to look for QED averages, ii) for datasets whose scattering is not too large, there exist unique centroids and circumcenters for the QED metric, and iii) we cannot use the same techniques in order to prove existence or uniqueness of centerpoints for the TED metric; in fact, any geometric method which requires bounded curvature is going to fail for the TED metric. This result motivates our study of the QED metric. 2.4
From Planar Trees to Spatial Trees
The world is not flat, and for most applications it is necessary to study embedded trees in R3 . As far as attributes in terms of edge embeddings and landmarks are concerned, this is not very different from the planar embedded trees – the main difference from the R2 case comes from the fact that trees in R3 have no canonical edge order. The left-right order on children of planar trees gives an implicit preference for edge matchings, and hence reduces the number of possible matches. When we no longer have this preference, we generally need to consider all orderings of the same tree and choose the one which minimizes the distance. ¯ = X/G, ¯ We define the space of spatial treelike shapes as the quotient X where G is the group of reorderings of the standard tree; G is a finite group. The metric ¯ We can prove: ¯ induces a quotient pseudometric d¯ on X. d¯ on X Theorem 3. For d¯ induced by either d¯1 or d¯2 , the function d¯ is a metric and the ¯ is a contractible, complete, proper geodesic space. ¯ d) space (X,
166
A. Feragen et al.
(a)
(b)
Fig. 5. (a) A 2-dimensional toy version of the folding of Euclidean space along two linear subspaces, along with geodesic paths from x ¯ to y¯ and from x ¯ to y¯ . (b) Local¯ ¯ ¯ to-global properties of TED: d1 (T1 , T2 ) = d1 (T1,1 , T2,1 ) + d1 (T1,2 , T2,2 ).
While considering all different possible orderings of the tree makes perfect sense from the geometric point of view, in reality this becomes an impossible task as the size of the trees grow beyond a few generations. In real applications we can, however, efficiently reduce complexity by taking tree- and treespace geometry into account. This is discussed in Section 4.
3
Comparison of the QED and TED Metrics
In this section we list the main differences between the TED and QED metrics and compare their performance on the small trees studied in [14]. ¯ are ¯ and X Geometry. As shown in Theorem 1 and Theorem 3 above, both X complete geodesic spaces. However, by Theorem 2, the QED metric gives locally non-positive curvature at generic points, while the TED metric gives unbounded ¯ This means that we cannot imitate the classical stacurvature everywhere on X. tistical procedures on shape spaces using the TED metric. Note also that the QED metric is the quotient metric induced from the Euclidean metric on the preshape space X, making it the obvious choice for a metric seen from the shape space point of view. Computation. The TED metric has nice local-to-global properties, as illustrated in Fig. 5b. If the trees T1 and T2 are decomposed into subtrees T1,1 , T1,2 and T2,1 , T2,2 as in Fig. 5b such that the geodesic from T1 to T2 restricts to geodesics between T1,1 and T2,1 as well as T2,1 and T2,2 , then d(T1 , T2 ) = d(T1,1 , T2,1 ) + d(T2,1 , T2,2 ). This property is used in many TED algorithms, and the same property does not hold for the QED metric. Performance. We have previously [14] studied fundamental geodesic deformations for the QED metric; that is, deformations between depth 3 trees which encode the local topological transformations found in generic tree-geodesics. These deformations are important for determining whether an internal tree-topological transition is needed to transform one (sub)tree into another – i.e. whether the correct edge registration is one which is inconsistent with the tree topology.
Geometries on Spaces of Treelike Shapes
(a)
167
(b)
Fig. 6. (a) The fundamental geodesic deformations: geodesic 1 goes through an internal structural change, while geodesic 2 does not, as the disappearing branches approach the zero attribute. (b) Two options for structural change.
To compare the TED and QED metric on small, simple trees, consider the two tree-paths in Fig. 6b, where the edges are endowed with non-negative scalar attributes a, b, c, d, e describing edge length. Path 1 indicates a matching of the identically attributed edges c and d, while Path 2 does not make the match. Now, √ the cost of Path 1 is 2e in both metrics, while the cost of Path 2 is 2 c2 + d2 in the QED metric and 2(c + d) in the TED metric. In particular, TED will chose to identify the c and d edges whenever e2 ≤ c2 + 2cd + d2 , while QED makes the match whenever e2 ≤ (c2 + d2 ). That is, TED will be more prone to internal structural changes than QED. This is also seen empirically in the comparison of TED and QED matching in Fig. 7. Note that although the TED is more prone to matching trees with different tree-topological structures, the matching is similar.
(a) Matching in the QED metric.
(b) Matching in the TED metric.
Fig. 7. Given a set of five data trees, we match each to the four others in both metrics
4
Computation and Complexity
Complexity is a problem with computing both TED and QED distances, in particular for 3D trees, which do not have a canonical planar order. Here we discuss how to use geometry and anatomy to find approximations of the metric whose complexity is significantly reduced.
168
A. Feragen et al.
Ordered trees: Reducing complexity using geometry. The definition of d¯ in (1) opens for considering infinitely many possible paths. However, we can significantly limit the search for a geodesic by taking the geometry of treespace into account. For instance, it is enough to consider paths going through each structural transition at most once: Lemma 1. For the metrics d¯1 and d¯2 on X, the shortest path between two points ¯ passes through each identified subspace at most once. in X Another way of limiting the complexity of real computations is to find a generic shape structure. We show that locally in the geodesic, the only internal topological transitions found will be those described by the fundamental geodesic deformations in Section 3. Theorem 4. i) Treelike shapes that are truly binary (i.e. their internal edges are not collapsed) are generic in the space of all treelike shapes. ii) The tree-shape deformations whose local structure is described by the fundamental geodesic deformations in Fig. 6a are generic in the set of all tree deformations. Moreover, the set of pairs of trees whose geodesics are of this form, is generic in the space of all pairs of trees. Genericity basically means that when a random treelike shape is selected, the probability of that shape being truly binary is 1. This does not mean, however, that non-binary trees do not need to be considered. While non-binary trees do not appear as randomly selected trees, they do appear in paths between randomly selected pairs of trees, as in Fig. 6a above. This also means that we can, in fact, work on a less complicated semi-preshape-space where we, instead of identifying all possible representations of the same shape, restrict ourselves to making the identifications illustrated by Fig. 6a. This is similar to the generic form of shock graphs and generic transitions between shock graphs found by Giblin and Kimia [16]. The notions of genericity in the two settings are, however, different since in [16], genericity is defined with respect to the Whitney topology on the space of the corresponding shape boundary parametrizations. Although we do run into non-binary treelike shapes in real-life applications, for instance when studying airway trees, this can be interpreted as an artifact of resolution rather than as true higher-degree vertices. Indeed, the airway extraction algorithms record trifurcations when the relative distances are below certain treshold values. Finally, it is often safe to assume that only very few structural changes will actually happen. For the airway trees studied in Section 6.2, we find empirically that it is enough to allow for one structural change in each depth 3 subtree. Using these arguments, it becomes feasible to computationally handle semilarge structures. In many medical imaging applications, such as airways or blood vessels, we are not interested in studying tree-deformations between very large networks or trees, since beyond a certain point, the tree-structure stops following a predetermined pattern and becomes a stochastic variable.
Geometries on Spaces of Treelike Shapes
169
Unordered trees: Reducing complexity using geometry- or anatomybased semi-labeling schemes. It is well known that the general problem of computing TED-distances between unordered trees is NP-complete [17], and the QED metric is probably generally not less expensive to compute, as indicated also by Theorem 3. Here we discuss how to use geometry and anatomy to find approximations of the metric whose complexity is significantly reduced. In particular, trees appearing in applications are usually not completely unordered, but are often semi-labeled. Semi-labelings can come from geometric or anatomical properties as in the pulmonary airway trees studied in Example 1 below, or may be obtained by a coarser registration method. (Semi-)labelings can also come from a TED distance computation or approximation, which is a reasonable way to detect approximate structural changes since the TED and QED give similar matchings, as seen in Fig. 7. Example 1 (Semi-labeling of the upper airway tree). In Fig. 9a, we see a “standard” airway tree with branch labels (or at least the first generations of it). Most airway trees, as the one shown to the left in Fig. 1, have similar, but not necessarily identical, topological structure to this one, and several branches, especially of low generation, have names and can be identified by experts. The top three generations of the airway tree serve very clear purposes in terms of anatomy. The root edge is the trachea; the second generation edges are the left and right bronchi; and the third generation edges lead to the top and towards the middle and lower lung lobes on the left, and to the top and bottom lobes on the right. Thus we find a canonical semi-labeling of the airway tree, which we use to simplify the problem of computing airway distances in Section 6.2.
5
QED Implementation
We explain a simple implementation of the QED metric. As shown in the next sections, our results are promising – the geodesics for synthetic data show the expected correspondences between edges, and in our experiments on 6 airway trees extracted from CT scans of 3 patients made at two different times, we recognize the correct patient in 5 out of 6 cases. Alignment of trees. We translate all edges to start at 0. We do not factor out scale in the treespace, because in general, we consider scale an important feature of shape. Edge scale in particular is a critical property, as the dynamics of appearing and disappearing edges is directly tied to edge size. Edge shape comparison. We represent each edge in a tree by a fixed number of landmark points – in our case 6 – evenly distributed along the edge, the first one at the origin (and hence neglected). The distance between two edge attributes v1 , v2 ∈ (Rd )5 is defined as the Euclidean distance between them. Although simple, this distance measure does take the scale of edges into account to some degree since they are aligned.
170
A. Feragen et al.
Fig. 8. Different options for structural changes in the left hand side of the depth 3 tree. Topological illustration on the left, corresponding tree-shape illustration on the right.
Implementation. In ordered depth 3 trees we make an implementation using Algorithm 1, where the number of structural changes taking place in the whole tree is limited to either 1 or 2, and in the final case we also rule out the case where the two changes happen along the same “stem”. This leaves us with the options for structural changes illustrated in Fig. 8, applied to the left and right half tree. The complexity of Algorithm 1 is O(nk · C(n, k)), where n is the number of internal vertices, k is the number of structural changes and C(n, k) is the maximal complexity of the optimization in line 8. Algorithm 1. Computing geodesics between ordered, rooted trees with up to k structural changes 1: x, y planar rooted depth n binary trees 2: S = {S} set of ordered identified pairs S = {S1 , S2 } of subspaces of X corresponding to internal topological changes, corresponding to a subspace S ¯ s.t. if {S1 , S2 } ∈ S, then also {S2 , S1 } ∈ S. of X, ˜ ≤ k do 3: for S˜ = {S 1 , . . . , S s } ⊂ S with |S| i i 4: for p ∈ S with representatives pi1 ∈i S1 and pi2 ∈ S2i do 5: p = (p1 , p2 , . . . , ps ) 2 i 6: f (p) = min{d2 (x, p1 ) + s−1 j=1 ds (pi , pi+1 ) + d2 (ps , y)} 7: end for 8: dS˜ = min f (p)|p = (p1 , . . . , ps ), pi ∈ S i , S˜ = {S 1 , . . . , S s } 9: 10: 11: 12: 13: 14:
6
pS˜ = {pi1 , pi2 }si=1 = argminf (p) end for ˜ ≤ k} d = min{dS |S˜ ⊂ S, |S| p = {p1 , p2 }si=1 = {pS˜ |dS˜ = d} geodesic = g = {x → p11 ∼ p12 → p21 ∼ p22 → . . . → ps1 ∼ ps2 → y} return d, g
Experimental Results
The QED metric is new, whereas the matching properties of the TED metric are well known [9]. In this section we present experimental results on real and synthetic data which illustrate the geometric properties of the QED metric. The experiments on airway trees in Section 6.2 show, in particular, that it is feasible to compute the metric distances on real 3D data trees.
Geometries on Spaces of Treelike Shapes
6.1
171
Synthetic Planar Trees of Depth 3
We have uploaded movies illustrating geodesics between planar depth 3 trees, as well as a matching table for a set of planar depth 3 trees, to the webpage http://image.diku.dk/aasa/ACCVsupplementary/geodesicdata.html. Note the geometrically inutitive behaviour of the geodesic deformations. 6.2
Results in 3D: Pulmonary Airway Trees
We also compute geodesics between subtrees of pulmonary airway trees. The airway trees were first segmented from low dose screening computed tomography (CT) scans of the chest using a voxel classification based airway tree segmentation algorithm by Lo et al. [18]. The centerlines were then extracted from the segmented airway trees using a modified fast marching algorithm based on [19], which was originally used in [18] for measuring the length of individual branches. Since the method also gives connectivity of parent and children branches, a tree structure is obtained directly. Leaves with a volume less than 11 mm3 were assumed to be noise and pruned away, and the centerlines were sampled with 5 landmark points on each edge. Six airways extracted from CT scans taken from three different patients at two different times were analyzed, restricting to the first six generations of the airway tree. The first three generations were identified and labeled as in Ex. 1, leaving us with four depth 3 subtrees representing the 4th to 6th generations of a set of 3D pulmonary airway trees. Algorithm 1 was run with both one and two structural changes, and no subtree geodesics (out of 24) had more than one structural change. Thus, we find it perfectly acceptable to restrict our search to paths with few structural changes.
(a)
(b)
Fig. 9. (a) Standard airway tree with branch labels. Drawing from [6] based on [20]. (b) Table of sum of QED distances for the five airway subtrees for 6 airway trees retrieved from CT scans of three patients taken at two different times, denoted P(a, i) where a denotes patient and i denotes time.
In the depth 3 subtrees we expect the deformation from one lung to the other to include topological changes, and we compare them using the unordered QED metric implementation on the subtrees which are cut off after the first 3 generations. As a result we obtain a deformation from the first depth 6 airway tree to
172
A. Feragen et al.
the second, whose restriction to the subtrees is a geodesic. In this experiment we study four airways originating from CT scans of two different patients acquired at two different times, and we obtain the geodesic distances and corresponding matching shown in Fig. 9b. The five subtree distances (top three generations and four subtrees) were added up to give the distance measure in Fig. 9b, and we see that for 5 out of 6 images, the metric recognizes the second image from the same patient.
7
Conclusions and Future Work
Starting from a purely geometric point of view, we define a shape space for treelike shapes. The intuitive geometric framework allows us to simultaneously take both global tree-topological structure and local edgewise geometry into account. We define two metrics on this shape space, QED and TED, which give the shape space a geodesic structure (Theorems 1 and 3). QED is the geometrically natural metric, which turns out to have excellent geometric properties which are essential for statistical shape analysis. In particular, the QED metric has local uniqueness of geodesics and local existence and uniqueness for two versions of average shape, namely the circumcentre and the centroid (Theorem 2). TED does not share these properties, but has better computational properties. Both metrics are generally NP hard to compute for 3D trees. We explain how semi-labeling schemes and geometry can be used to overcome the complexity problems, and illustrate this by computing QED distances between trees extracted from pulmonary airway trees as well as synthetic planar data trees. Our future research will be centered around two points: Development of nonlinear statistical methods for the singular tree-shape spaces, and finding fast algorithms for an approximate QED metric. The latter will allow us to actually compute averages and modes of variation for large, real 3D data trees. Acknowledgements. This work is partly funded by the Lundbeck foundation, the Danish Council for Strategic Research (NABIIT projects 09−065145 and 09− 061346) and the Netherlands Organisation for Scientific Research (NWO). We thank Dr. J.H. Pedersen from the Danish Lung Cancer Screening Trial (DLCST) for providing the CT scans. We thank Marco Loog, Søren Hauberg and Melanie Ganz for helpful discussions during the preparation of the article.
References 1. Pedersen, J., Ashraf, H., Dirksen, A., Bach, K., Hansen, H., Toennesen, P., Thorsen, H., Brodersen, J., Skov, B., Døssing, M., Mortensen, J., Richter, K., Clementsen, P., Seersholm, N.: The Danish randomized lung cancer CT screening trial - overall design and results of the prevalence round. J. Thorac. Oncol. 4, 608–614 (2009)
Geometries on Spaces of Treelike Shapes
173
2. Staal, J., Abramoff, M., Niemeijer, M., Viergever, M., van Ginneken, B.: Ridge based vessel segmentation in color images of the retina. TMI 23, 501–509 (2004) 3. Billera, L.J., Holmes, S.P., Vogtmann, K.: Geometry of the space of phylogenetic trees. Adv. in Appl. Math. 27, 733–767 (2001) 4. Charnoz, A., Agnus, V., Malandain, G., Soler, L., Tajine, M.: Tree matching applied to vascular system. In: Brun, L., Vento, M. (eds.) GbRPR 2005. LNCS, vol. 3434, pp. 183–192. Springer, Heidelberg (2005) 5. Chalopin, C., Finet, G., Magnin, I.E.: Modeling the 3D coronary tree for labeling purposes. MIA 5, 301–315 (2001) 6. Tschirren, J., McLennan, G., Pal´ agyi, K., Hoffman, E.A., Sonka, M.: Matching and anatomical labeling of human airway tree. TMI 24, 1540–1547 (2005) 7. Kaftan, J., Kiraly, A., Naidich, D., Novak, C.: A novel multi-purpose tree and path matching algorithm with application to airway trees. In: SPIE, vol. 6143 I (2006) 8. Metzen, J.H., Kr¨ oger, T., Schenk, A., Zidowitz, S., Peitgen, H.-O., Jiang, X.: Matching of tree structures for registration of medical images. In: Escolano, F., Vento, M. (eds.) GbRPR. LNCS, vol. 4538, pp. 13–24. Springer, Heidelberg (2007) 9. Sebastian, T.B., Klein, P., Kimia, B.: Recognition of shapes by editing their shock graphs. TPAMI 26, 550–571 (2004) 10. Nakano, Y., Muro, S., Sakai, H., Hirai, T., Chin, K., Tsukino, M., Nishimura, K., Itoh, H., Par´e, P.D., Hogg, J.C., Mishima, M.: Computed tomographic measurements of airway dimensions and emphysema in smokers. Correlation with lung function. Am. J. Resp. Crit. Care Med. 162, 1102–1108 (2000) 11. Gromov, M.: Hyperbolic groups. In: Essays in group theory. Math. Sci. Res. Inst. Publ., vol. 8, pp. 75–263 (1987) 12. Wang, H., Marron, J.S.: Object oriented data analysis: sets of trees. Ann. Statist. 35, 1849–1873 (2007) 13. Jain, B.J., Obermayer, K.: Structure spaces. J. Mach. Learn. Res. 10, 2667–2714 (2009) 14. Feragen, A., Lauze, F., Nielsen, M.: Fundamental geodesic deformations in spaces of treelike shapes. In: Proc ICPR (2010) 15. Bridson, M.R., Haefliger, A.: Metric spaces of non-positive curvature. Springer, Heidelberg (1999) 16. Giblin, P.J., Kimia, B.B.: On the local form and transitions of symmetry sets, medial axes, and shocks. IJCV 54, 143–156 (2003) 17. Zhang, K., Statman, R., Shasha, D.: On the editing distance between unordered labeled trees. Inf. Process. Lett. 42, 133–139 (1992) 18. Lo, P., Sporring, J., Ashraf, H., Pedersen, J.J., de Bruijne, M.: Vessel-guided airway tree segmentation: A voxel classification approach. Medical Image Analysis 14, 527–538 (2010) 19. Schlath¨ olter, T., Lorenz, C., Carlsen, I.C., Renisch, S., Deschamps, T.: Simultaneous segmentation and tree reconstruction of the airways for virtual bronchoscopy. In: SPIE, vol. 4684, pp. 103–113 (2002) 20. Boyden, E.A.: Segmental anatomy of the lungs. McGraw-Hill, New York (1955)
Human Pose Estimation Using Exemplars and Part Based Refinement Yanchao Su1 , Haizhou Ai1 , Takayoshi Yamashita2 , and Shihong Lao2 1
Computer Science and Technology Department, Tsinghua, Beijing 100084, China 2 Core Technology Center, Omron Corporation, Kyoto 619-0283, Japan
Abstract. In this paper, we proposed a fast and accurate human pose estimation framework that combines top-down and bottom-up methods. The framework consists of an initialization stage and an iterative searching stage. In the initialization stage, example based method is used to find several initial poses which are used as searching seeds of the next stage. In the iterative searching stage, a larger number of body parts candidates are generated by adding random disturbance to searching seeds. Belief Propagation (BP) algorithm is applied to these candidates to find the best n poses using the information of global graph model and part image likelihood. Then these poses are further used as searching seeds for the next iteration. To model image likelihoods of parts we designed rotation invariant EdgeField features based on which we learnt boosted classifiers to calculate the image likelihoods. Experiment result shows that our framework is both fast and accurate.
1
Introduction
In recent year human pose estimation from a single image has become an interesting and essential problem in computer vision domain. Promising applications such as human computer interfaces, human activity analysis and visual surveillance always rely on robust and accurate human pose estimation results. But human pose estimation still remains a challenging problem due to the dramatic change of shape and appearance of articulated pose. There are mainly two categories of human pose estimation methods: Topdown methods and bottom-up methods. Top-down methods, including regression based methods and example based method, concern of the transformation between human pose and image appearance. Regression based methods [1, 2] learns directly the mapping from image features to human pose. Example based methods [3–7] find a finite number of pose examplars that sparsely cover the pose space and store image features corresponding to each examplar, the resultant pose is obtained through interpolation of the n closest examplars. Bottom-up methods, which are mainly part based methods [8–12], divide human body into several parts and use graph models to characterize the whole body. First several candidates of each part are found using learned part appearance model and then the global graph models are used to assemble each part into a whole body. R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 174–185, 2011. c Springer-Verlag Berlin Heidelberg 2011
Human Pose Estimation Using Exemplars and Part Based Refinement
175
Among all these works, top-down methods are usually quick in speed but their performances rely on the training data. Bottom up methods are more accurate but the exhaustive search of body parts is too time consuming for practical applications. Inspired by the works of Active Shape Model (ASM) [13], we find that iterative local search could give accurate and robust result of shape alignment with acceptable computation cost given a proper initialization. We design a novel iterative inferring schema for bottom-up part based pose estimation. And to model the image likelihood of each part, we designed rotation invariant EdgeField features based on which we learn boosted classifiers for each body part. To provide initial poses to the part based method, we utilize a sparse example based method that gives several best matches with little time cost based on the cropped human body image patch given by a human detector. The combination of top-down and bottom-up methods enables us to achieve accurate and robust pose estimation, and the iterative inferring schema boosts the speed of pose inference with no degeneracy on accuracy. The rest of this paper is organized as follows: section 2 presents an overview of our pose estimation framework, section 3 describes the top-down example based stage, and section 4 is the bottom-up iterative searching stage. Experiments and conclusions are then given in section 5 and section 6.
2
Overview of Our Framework
A brief flowchart of our framework is given in Fig. 1.
6HDUFKLQJ &URSSHG 6HDUFKLQJ 6HDUFKLQJ 6HHGV ,PDJH &DQGLGDWHV 6HHGV ([DPSOHEDVHG *HQHUDWH %3LQIHUHQFH &RYHUDJH" LQLWLDOL]DWLRQ FDQGLGDWHV
5HVXOW
Fig. 1. Examples in dataset 2
The pose estimation framework uses the cropped image patch given as input and consists of two main stages: the initialization stage and the refining stage. In the initialization stage, a sparse top-down example based method is used to find the first n best matches of the examplars, the poses of these examplars are used as the searching seeds in the searching stage. In the refining stage, we utilize a set of boosted body part classifier to measure the image likelihood and a global tree model to restrict the body configuration. Inspired by the works of Active Shape Model, we design an iterative searching
176
Y. Su et al.
algorithm to further refine the pose estimation result. Unlike ASM which finds the best shape in each iteration, we find the first n best poses using BP inference on the tree model and use them as searching seeds for the next iteration. The refined pose estimation result is given when the iteration converges.
3
Example Based Initialization
The objective of example based initialization stage is to provide several searching seeds for the part based refinement stage, so this stage has to be fast and robust. In some previous example-based methods[3, 4, 7], a dense cover of pose space which uses all the training samples as examplars is adopted to achieve better performance, but in our case, a sparse examplar set is enough to get initial poses for the refining stage. We use the Gaussian Process Latent Variable Model (GPLVM) [14] to obtain a low dimensional manifold and use the active set of GPLVM as examplars. An illustration of the GPLVM model and the examplars are given in Fig. 2.
Fig. 2. Illustration of GPLVM and examplars
The adopted image feature is an important aspect of example-based methods. In many previous works, silhouettes of human body are usually used as the templates of examplars. But in practice, it is impossible to get accurate silhouettes from a single image with clutter background. In [3] the author adopts histograms of oriented gradients (HOG) [15] as the image descriptor to characterize the appearance of human body. However, there are two main limitations of using HOG descriptor in a tight bounding box: firstly, the bounding box of human varies dramatically when the pose changes, secondly, a single HOG descriptor isnt expressive enough considering the furious change of the appearance of human body in different pose. So we first extend the tight bounding rectangle to a square, and then divide the bounding square equally into 9 sub-regions,
Human Pose Estimation Using Exemplars and Part Based Refinement
177
HOG descriptors in the sub-regions are concatenated into the final descriptor. And for each examplar, we assigned a weight to each sub-region which represents the percentage of foreground area in the sub-region. Given an image and the bounding box, we first cut out a square image patch based on the bounding box, and then HOGs of all the sub-regions are computed. The distance between the image and an examplar is given by the weighted sum of the Euclid distances of HOGs in each sub-region. We find the first n examplars with shortest distance as the initial poses.
4 4.1
Part Based Refinement Part Based Model
It is natural that human body can be decomposed into several connected parts. Similar with previous works, we model the whole body as a Markov network as shown in Fig. 3(left). The whole body is represented as a tree structure in which each tree node corresponds to a body part and is related to its adjacent parts and its image observation. Human pose can be represented as the configuration of the tree model: L = {l1 , l2 , ..., ln }, and the state of each part is determined by the parameter of its enclosing rectangle: li = {x, y, θ, w, h} (Fig. 3(right)). Each part is associated with an image likelihood, and each edge of the tree between connected part i and part j has an associated, that encodes the restriction of configuration of part i and potential function part j.
xi
pj
pi
ߠ
ߠ
( x, y) h
xj
w
VWDWH REVHUYDWLRQ
Fig. 3. Left: Markov network of human body. Middle:the constraint between two connected parts. Right: the state of a body part.
As shown in Fig. 3(middle), the potential function has two components corresponding to the angle and the position: p θ (li , lj ) + ψij (li , lj ) ψij (li , lj ) = ψij
(1)
The constraint of angle is defined by the Mises distribution: θ (li , lj ) = ek cos(θ−μ) ψij
(2)
178
Y. Su et al.
And the potential function on position is a piece-wise function: p (li , lj ) = { ψij
2
αe−β||pi −pj || , ||pi − pj ||2 < t 0 , ||pi − pj ||2 ≥ t
(3)
The parameter α, β can be learnt from the ground truth data. The piece-wise definition of the potential function not only guarantees the connection between adjacent parts, but also can speed up the inference procedure. The body pose can be obtained by optimizing: p(L|Y ) ∝ p(Y |L)p(Y ) = ψi (li ) ψij (li , lj ) (4) i
4.2
i∈Γ (p)
Image Likelihood
As in previous works, body silhouettes are the most direct cue to measure the image likelihood. However, fast and accurate silhouettes extraction is another open problem. As proven in previous works, boosting classifiers with gradient based features such as HOG feature [15] is an effective way of measuring image likelihood in pedestrian detection and human pose estimation. However, in human pose estimation, the orientation of body parts change dramatically when the pose changes, so when measuring image likelihood of body parts we need to rotate the images to compute the HOG feature values. This time consuming image rotation can be eliminated when using a kind of rotation invariant features. Based on this point, we designed an EdgeField feature. Given an image, we can take each pixel as a charged particle and then the image as a 2D field. Following the definition of electric field force, we can define the force between two pixels p and q: kdp · dq r (5) ||r||3 Where dp , dq are the image gradient on pixel p, q. and r is the vector pointing from p to q. the amplitude of the force is in proportion to the gradient amplitude of p and q, and the similarity between the gradient of p and q, and is in inverse proportion to the square of the distant between p and q. Given a template pixel p and an image I, we can compute the force between the template pixel and the image by summing up the force between the template pixel and each image pixel: f (p, q) =
f (p, I) =
kdp · dq q∈I
||r||3
r
(6)
By applying orthogonal decomposition on f(p,I), we can get: f (p, I) =
f (p, I) · ex f (p, I) · ey
=k
dp · q∈Γ (p) dp · q∈Γ (p)
rq ·ex ||rq ||3 rq ·ey ||rq ||3
=
dp · Fx (p) dp · Fy (p)
= dp F (p) (7)
Human Pose Estimation Using Exemplars and Part Based Refinement
179
Where ex and ey are the unit vector of x and y directions. And F (p) is a 2 by 2 constant matrix in each position p in image I. we call F (p) the EdgeField. Our EdgeField feature is define as a short template segment of curve composed of several points with oriented gradient. The feature value is defined as the projection of the force between the template segment and the image to a normal vector l: f (S, I) = F (p, I) · l = dTp F (p)l (8) p∈S
p∈S
Obviously, the calculation of the EdgeField feature is rotation invariant. To compute the feature value of an oriented EdgeField feature, we only need to rotate the template. The EdgeField can be precomputed from the image by convolution. An example of EdgeField and its contour line are shown in Fig. 4 left and middle. Fig. 4 right shows an example of EdgeField feature.
dpi pi f l
f(S,I)
S I
Fig. 4. Left: EdgeField; Middle: Contour line of EdgeField; Right: EdgeField feature
Based on the EdgeField feature, we can learn a boosted classifier for each body part. We use the improved Adaboost algorithm [14] that gives real-valued confidence to its predictions which can be used as the image likelihood of each part. The weak classifier based on feature S is defined as follow: h(Ii ) = lutj , vj < f (S, Ii ) ≤ vj+1
(9)
Where the range of feature value is divided into several sub ranges{(vj , vj+1 )}, and the weak classifier output a certain constant confidence luti in each sub range. A crucial problem in learning boosted classifier with EdgeField feature is that the number of different EdgeField features is enormous and exhaustive search in conventional boosting learning becomes impractical. So we design a heuristic searching procedure as follow:
180
Y. Su et al.
Algorithm 1. Weak classifier selection for the ith part Input: samples {Iij }, labels of the sample{yij } and the corresponding weight{Dij } 1. Initialize Initialize the open feature list: OL = {S1 , S2 , . . . , Sn } and the close feature list: CL = φ 2. Grow and search: for i = 1 to T lost from OL: Select the feature S ∗ with the lowest classification S ∗ = argmins∈OL (Z(S)) where Z(S) = i Di exp(−yi h(Ii )) OL = OL − {S ∗ }, CL = CL ∪ {S ∗ } Generate new features based on S∗ under the constraint in Fig. 5, then learn parameters and calculate the classification lost and put them in OL. end for 3. Output: Select the feature with the lowest classification loss for OL and CL
This algorithm initializes with EdgeField features with only one template pixel, and generates new features based on current best feature under the growing constraint which keep the template segment of the feature simple and smooth.
Fig. 5. Growing constraint of features: Left: the second pixel (gray square) can only be placed at 5 certain directions from the first pixel (black square); Middle: The following pixels can only be placed in the proceeding direction of current segment and in the adjacent directions. Right: the segment cannot cross the horizontal line crossing the first pixel.
When the position of each pixel p in the feature is given, we need to learn the parameters of the feature: the projection vector l and the gradient of each pixel dp. The objective is to best separate the positive and negative samples. Similar with common object detection task, the feature values of positive samples are surrounded by those of negative samples, so we take the Maximal-RejectionClassifier criterion [18] which minimizes the positive scatter while maximizing the negative scatter: ({dpi , l}) = argminRpos /Rneg
(10)
Where Rpos and Rneg are the covariance of the feature values of positive sample and negative samples respectively.
Human Pose Estimation Using Exemplars and Part Based Refinement
181
Since it is hard to learning l and dp simultaneously, we first discretize l into 12 directions, and learn {dp } for each l. Given l, the feature value can be computed as follow:
dTpi F (pi )l = dTp1 , dTp2 , . . . , dTpn (F (p1 )l, F (p2 )l, . . . , F (pn )l) (11) f (S, I) = pi ∈S
Then the learning of {dp } becomes a typical MRC problem where dTp1 , . . . , dTpn is taken as the projection vector. Following [16], we adopt weighted MRC to take the sample weights into account. Given the boosted classifier of each part, the image likelihood can be calculated as follow: hk , i(Ii )))−1 (12) ψi (li ) = (1 + exp(−Hi (Ii )))−1 = (1 + exp(− k
4.3
Iterative Pose Inference
Belief propagation (BP)[17] is a popular and effective inference algorithm of tree structured model inference. BP calculates the desired marginal distributions p(li—Y) by local message passing between connected nodes. At iteration n, each node t ∈ V calculates a message mnts (lt ) that will be passed to a node s ∈ Γ (t): n ψst (ls , lt )ψt (lt ) × mn−1 (13) mts (ls ) = α ut (ls )dlt lt
u∈Γ (s)
Where α = 1/ mnts (ls )is the normalization coefficient. The approximation of the marginal distribution p(ls |Y ) can be calculated in each iteration: mnut (ls ) (14) pˆ(ls |Y ) = αψt (lt ) u∈Γ (s)
In tree structured models, pˆ(ls |Y ) will converge to p(ls |Y ) when the messages are passed to every nodes in the graph. In our case the image likelihood is discontinuous so that the integral in equation (13) becomes product over all feasible states lt. Although the present of the BP inference reduces the complexity for pose inference, the feasible state space of each part remains unconquerable by exhaustive search. Inspired by the work of Active Shape Model (ASM), we designed an iterative inference procedure which limits each part in the neighborhood of several initial states and inference locally in each iteration. The whole iterative inference algorithm is shown in algorithm 2. In each iteration, the searching seeds are the initial guesses of the states of each part, and we sample uniformly from the neighborhood of each searching seeds to generate searching candidates. And then BP inference is applied over the searching candidates to calculate the marginal distribution over the candidates. And the searching seeds for next iteration are selected from the candidates using MAP according to the marginal distribution.
182
Y. Su et al.
Unlike ASM which uses the reconstructed shape as the initial shape for the local search stage, we keep the first n best candidates as the searching seeds for the next iteration and searching candidates are generated based on the searching seeds. Preserving multiple searching seeds reduce the risk of improper initial pose.
Algorithm 2. Iterative pose inference Input: searching seeds for each part Si = {li1 , li2 , . . . , lin } Generating candidates: Ci = {li |li ∈ N eighborhood(lki ), lki ∈ Si } For each part i: Calculate the image likelihood ψi (li )using boosted classifiers BP inference: Loop until convergence: Calculate messages: For each part s and the messages
t, t ∈ Γ (s) caculate n mts = α lt ∈Ct ψst (ls , lt )ψt (lt ) × u∈Γ (s) mn−1 ut (ls) n Calculate belief: pˆ(ls |Y ) = αψt (lt ) u∈Γ (s) mut (ls ) Find searching seeds for each part i: Si =candidates with the first n largest belief values Go to generating candidates if Si do not converge The estimate pose L = {Si }
5 5.1
Experiments Experiment Data
Our experiments are taken on following dataset: 1. Synthesized dataset: A dataset of 1000 images synthesized using Poser 8. Each image contains a human and the pose is labeled manually. And the backgrounds of these images are the images selected randomly selected from other datasets. 2. The Buffy dataset from [11]: This dataset contains 748 video frames over 5 episodes of the fifth season of Buffy: the vampire slayer. The upper body pose (including head, torso, upper arms and lower arms) are labeled manually in each image. This is a challenging dataset due to the changing of poses, different clothes and strongly varying illumination. As in [11, 12], we use 276 images in 3 episodes as the testing set, and the rest 472 as training set. 3. The playground dataset: This dataset contains 350 images of people play football and basketball in the playground. The full body pose is labeled manually. 150 images from this data set are used as the training set, and the rest 200 is the testing set. 4. The iterative image parsing dataset from[11]: this dataset contains 305 images of people engaged in various activities such as standing, performing exercises, dancing, etc. the full body pose in each image is annotated manually. This dataset is more challenging because there is no constraint on human poses, clothes, illumination and backgrounds. 100 images in this dataset are used as the training set and the rest 205 is the testing set.
Human Pose Estimation Using Exemplars and Part Based Refinement
0.6 0.55
Our descriptor HOG descriptor
0.85
Mean Error
Mean Error
0.7 0.65
1
0.9
Our descriptor HOG descriptor
Mean Error
0.8 0.75
0.8 0.75 0.7
0
5 10 15 number of selected examplars
20
0.65
183
0
5 10 15 number of selected examplars
20
Our descriptor HOG descriptor
0.9 0.8 0.7
0
5 10 15 number of selected examplars
20
Fig. 6. Mean minimum errors of selected examplars in dataset 2,3,4 (from left to right)
5.2
Experiments of Examplar Based Initialization
To learn the examplars for full body pose estimation, we used the whole dataset 1 to learn examplars. The coordinates of 14 joint point of each image are connected as a pose vector; GPLVM is used to learn a low dimensional manifold of the pose space. The active set of GPLVM is taken as the examplars and the corresponding descriptors are stored. The upper body examplars are learnt similarly by using the bounding box of upper body. Given an image and a bounding box of human, we generate 100 candidate bounding box by adding random disturbance to the bounding box. And for each examplar, we find the best candidate bounding box with the shortest distance of the descriptor, and the first 20 examplars with the shortest distance are used as the initial pose in the next stage. We measures the accuracy of this stage on the datasets in terms of the mean of minimum distance of corresponding joint points divided by the part length between ground truth and the top n selected examplars, as shown in Fig. 6. 5.3
Experiments of Pose Estimation
For the part classifier, in the training sets, we cut out image patches according to the bounding box of each part and normalize them into the same orientation and scale and use them as positive samples. Negative samples are obtained by adding large random disturbance to the ground truth states of each part. In the pose estimation procedure, the top 20 matched examplars are used as the searching seeds. 100 searching candidates of each part are generated based on each searching seeds. During each iteration, the candidates with top 20 posterior marginal possibilities are preserved as the searching seeds for the next iteration. To compare with previous methods, we used the code of [12]. The images are cut out based on the bounding box (with some loose), and then are processed by the code. The correctness of each part in the testing sets of dataset 2,3,4 are listed in table 1. A part is correct if the distance between estimated joint points and ground truth is below 50% of the part length. From the result we can see that our method can achieve better correctness with much less time cost. Fig. 7 give some examples in dataset 2,3,4.
184
Y. Su et al. Table 1. Correctness of parts in dataset 2,3,4
part
dataset 2 Method Our in [12] method
dataset 3 Method Our in [12] method
Torso 91% 93% 81% Upper Arm 82% 83% 83% 85% 45% 51% Lower Arm 57% 59% 61% 64% 32% 31% Upper Leg 54% 61% Lower Leg 43% 45% Head 96% 95% 81% Total 78% 80% 52% Time cost 65.1s 5.3s 81.2s
84% 57% 59% 41% 37% 67% 63% 51% 50% 89% 59% 9.1s
dataset 3 Method Our in [12] method 84% 56% 59% 35% 38% 68% 59% 66% 51% 78% 59% 79.1s
87% 64% 62% 41% 43% 67% 68% 69% 60% 80% 64% 8.9s
Fig. 7. Examples in dataset 2,3,4
6
Conclusion
In this paper, we propose a fast and robust human estimation framework that combines the top-down example based method and bottom-up part based method. Example based method is first used to get initial guesses of the body pose and then a novel iterative BP inference schema is utilized to refined the estimated pose using a tree-structured kinematic model and discriminative part classifiers based on a rotation invariant EdgeField feature. The combination for top-down and bottom-up methods and the iterative inference schema enable us to achieve accurate pose estimation in acceptable time cost. Acknowledgement. This work is supported by National Science Foundation of China under grant No.61075026, and it is also supported by a grant from Omron Corporation.
Human Pose Estimation Using Exemplars and Part Based Refinement
185
References 1. Agarwal, A., Triggs, B.: Recovering 3d human pose from monocular images. PAMI 28(1), 44–58 (2006) 2. Bissacco, A., Yang, M., Soatto, S.: Fast human pose estimation using appearance and motion via multi-dimensional boosting regression. In: CVPR, vol. 13, pp. 234– 778 (2007) 3. Poppe, R.: Evaluating example-based pose estimation: Experiments on the humaneva sets. In: CVPR 2nd Workshop on EHuM2, vol. 14, pp. 234–778 (2007) 4. Poppe, R., Poel, M.: Comparison of silhouette shape descriptors for example-based human pose recovery. AFG (2006) 5. Mori, G., Malik, J.: Recovering 3d human body configurations using shape contexts. IEEE PAMI, 1052–1062 (2006) 6. Rogez, G., Rihan, J., Ramalingam, S., Orrite, C., Torr, P.H.S.: Randomized Trees for Human Pose Detection. In: CVPR (2008) 7. Shakhnarovich, G., Viola, P., Darrell, R.: Fast pose estimation with parametersensitive hashing. In: ICCV (2003) 8. Jiang, H., Martin, D.R.: Global pose estimation using non-tree models. In: CVPR (2008) 9. Sigal, L., Black, M.J.: Measure locally, reason globally: Occlusion-sensitive articulated pose estimation. In: CVPR (2006) 10. Sigal, L., Bhatia, S., Roth, S., et al.: Tracking Loose-limbed People. In: CVPR (2004) 11. Ramanan, D.: Learning to parse images of articulated bodies. In: NIPS (2006) 12. Andriluka, M., Roth, S., Schiele, B.: Pictorial Structures Revisited: People Detection and Articulated Pose Estimation. In: CVPR (2009) 13. Hill, A., Cootes, T.F., Taylor, C.J.: Active shape models and the shape approximation problem. In: 6th British Machine Vison Conference, pp. 157–166 (1995) 14. Lawrence, N.D.: Gaussian process latent variable models for visualization of high dimensional data. In: NIPS, vol. 16, pp. 329–336 (2004) 15. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005) 16. Schapire, R.E., Singer, Y.: Improved boosting algorithms using confidence-rated predictions. Machine Learning 37(3), 297–336 (1999) 17. Yedidia, J.S., Freeman, W.T., Weiss, Y.: Understanding Belief Propagation and its Generalizations. IJCAI (2001) 18. Xu, X., Huang, T.S.: Face Recognition with MRC-Boosting. In: ICCV (2005)
Full-Resolution Depth Map Estimation from an Aliased Plenoptic Light Field Tom E. Bishop and Paolo Favaro Department of Engineering and Physical Sciences Heriot-Watt University, Edinburgh, UK {t.e.bishop,p.favaro}@hw.ac.uk
Abstract. In this paper we show how to obtain full-resolution depth maps from a single image obtained from a plenoptic camera. Previous work showed that the estimation of a low-resolution depth map with a plenoptic camera differs substantially from that of a camera array and, in particular, requires appropriate depth-varying antialiasing filtering. In this paper we show a quite striking result: One can instead recover a depth map at the same full-resolution of the input data. We propose a novel algorithm which exploits a photoconsistency constraint specific to light fields captured with plenoptic cameras. Key to our approach is handling missing data in the photoconsistency constraint and the introduction of novel boundary conditions that impose texture consistency in the reconstructed full-resolution images. These ideas are combined with an efficient regularization scheme to give depth maps at a higher resolution than in any previous method. We provide results on both synthetic and real data.
1
Introduction
Recent work [1, 2] demonstrates that the depth of field of Plenoptic cameras can be extended beyond that of conventional cameras. One of the fundamental steps in achieving such result is to recover an accurate depth map of the scene. [3] introduces a method to estimate a low-resolution depth map by using a sophisticated antialiasing filtering scheme in the multiview stereo framework. In this paper, we show that the depth reconstruction can actually be obtained at the full-resolution of the sensor, i.e., we show that one can obtain a depth value at each pixel in the captured light field image. We will show that the fullresolution depth map contains details not achievable with simple interpolation of a low-resolution depth map. A fundamental ingredient in our method is the formulation of a novel photoconsistency constraint, specifically designed for a Plenoptic camera (we will also use the term light field (LF) camera as in [4]) imaging a Lambertian scene (see Fig. 1). We show that a point on a Lambertian object is typically imaged under several microlenses on a regular lattice of correspondences. The geometry of such lattice can be easily obtained by using a Gaussian optics model of the Plenoptic camera. We will see that this leads to the reconstruction of full-resolution views R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 186–200, 2011. c Springer-Verlag Berlin Heidelberg 2011
Full-Res. Depth Map Estimation from an Aliased Plenoptic Light Field
187
consisting of mosaiced tiles from the sampled LF. Notice that in the data that we have used in the experiments, the aperture of the main lens is a disc. This causes the square images under each microlens to be split into a disc containing valid measurements and 4 corners with missing data (see one microlens image in the right image in Fig. 1). We show that images with missing data can also be reconstructed quite well via interpolation and inpainting (see e.g. [5]). Such interpolation however is not necessary if one uses a square main lens aperture and sets the F-numbers of the microlenses and the main lens appropriately. More importantly, we will show that the reconstructed samples form tiled mosaics, and one can enforce consistency at tile edges by introducing additional boundary conditions constraints. Such constraints are fundamental in obtaining the full-resolution depth estimates. Surprisingly, no similar gain in depth resolution is possible with a conventional camera array. In a camera array the views are not aliased (as the CCD sensors have contiguous pixels), and applying any existing method on such data (including the one proposed here) only results in upsampling of the low-resolution depth map (which is just the resolution of one view, and not the total number of captured pixels). This highlights another novel advantage of LF cameras. 1.1
Prior Work and Contributions
The space of light rays in a scene that intersect a plane from different directions may be interpreted as a 4D light field (LF) [6], a concept that has been useful in refocusing, novel view synthesis, dealing with occlusions and non-Lambertian objects among other applications. Various novel camera designs that sample the LF have been proposed, including multiplexed coded aperture [7], Heterodyne mask-based cameras [8], and external arrangements of lenses and prisms [9]. However these systems have disadvantages including the inability to capture dynamic scenes [7], loss of light [8], or poor optical performance [9]. Using an array of cameras [10] overcomes these problems, at the expense of a bulky setup. Light Field, or Plenoptic, cameras [4, 11, 12], provide similar advantages, in more compact portable form, and without problems of synchronisation and calibration. LF cameras have been used for digital refocusing [13] capturing extended depth-of-field images, although only at a (low) resolution equal to the number of microlenses in the device. This has been addressed by super-resolution methods [1, 2, 14], making use of priors that exploit redundancies in the LF, along with models of the sampling process to formulate high resolution refocused images. The same sampling properties that enable substantial image resolution improvements also lead to problems in depth estimation from LF cameras. In [3], a method for estimating a depth map from a LF camera was presented. Differences in sampling patterns of light fields captured by camera arrays and LF cameras were analysed, revealing how spatial aliasing of the samples in the latter case cause problems with the use of traditional multi-view depth estimation methods. Simply put, the LF views (taking one sample per microlens) are highly undersampled for many depths, and thus in areas of high frequency texture, an error term based on matching views will fail completely. The solution proposed in [3]
188
T.E. Bishop and P. Favaro
Fig. 1. Example of LF photoconsistency. Left: A LF image (courtesy of [15]) with an enlarged region (marked in red) shown to the right. Right: The letter E on the book cover is imaged (rotated by 180◦ ) under nine microlenses. Due to Lambertianity each repetition has the same intensity. Notice that as depth changes, the photoconsistency constraint varies by increasing or decreasing the number of repetitions.
was to filter out the aliased signal component, and perform matching using the remaining valid low-pass part. The optimal filter is actually depth dependent, so an iterative scheme was used to update the filtering and the estimated depth map. Whilst this scheme yields an improvement in areas where strong aliasing corrupts the matching term, it actually throws away useful detail that could be used to improve matches if interpreted correctly. Concurrent with this work, [16] proposed estimation of depth maps from LF data using cross-correlation, however this also only gives low resolution results. In this paper we propose to perform matching directly on the sampled lightfield data rather than restricting the interpretation to matching views in the traditional multi-view stereo sense. In this way, the concept of view aliasing is not directly relevant and we bypass the need to antialias the data. A key benefit of the new formulation is that we compute depth maps at a higher resolution than the sampled views. In fact the amount of improvement attainable depends on the depth in the scene, similar to superresolution restoration of the image texture as in [2]. Others have considered super-resolution of surface texture along with 3D reconstruction [17] and depth map super-resolution [18], however in the context of multi-view stereo or image sequences, where the sampling issues of a LF camera do not apply. Specifically, our contributions are: 1. A novel method to estimate a depth value at each pixel of a single (LF) image; 2. Analysis of the subimage correspondences, and reformulation in terms of virtual full-resolution views, i.e., mosaicing, leading to a novel photoconsistency term for LF cameras; 3. A novel penalty term that enforces gradient consistency across tile boundaries of mosaiced full-resolution views.
Full-Res. Depth Map Estimation from an Aliased Plenoptic Light Field
2
189
Light Field Representation and Methodology Overview
We consider a plenoptic camera, obtained by adding a microlens array in front of a conventional camera’s sensor [1, 2, 4]. The geometry of this camera is almost identical to that of a camera array where each camera’s aperture corresponds to a microlens. However, the LF camera has an additional main lens in front of the array, and this dramatically affects the sampling of the LF. We will see in sec. 3 that the precise study of how the spatial and angular coordinates sample the light field is critical to the formulation of matching terms for depth estimation. Let us consider a version of the imaging system where the microlens apertures are small and behave as pinholes.1 In the LF camera, each microlens forms its own subimage Sc (θ) on the sensor, where c ∈ R2 identifies the center of a microlens and θ ∈ R2 identifies the local (angular) coordinates under such a microlens. In a real system the microlens centers are located at discrete positions ck , but to begin the analysis we generalize the definition to virtual microlenses that are free to be centered in the continuous coordinates c. The main lens maps an object in space to a conjugate object inside the camera. Then, pixels in each subimage see light from the conjugate object from different vantage points, and have a one-to-one map with points on the main lens (see Fig. 2). The collection of pixels, one per microlens, that map to the same point on the main lens (i.e., with the same local angle θ) form an image we call a view, denoted by Vθ (c) ≡ Sc (θ). The views and the subimages are two equivalent representations of the same quantity, the sampled light field. We assume the scene consists of Lambertian objects, so that we can represent the continuous light field with a function r(u) that is independent of viewing angle. In our model, r(u) is called the radiance image or scene texture, and relates to the LF via a reprojection of the coordinates u ∈ R2 at the microlens plane, through the main lens center into space. That is, r(u) is the full-resolution all-focused image that would be captured by a standard pinhole camera with the sensor placed at the microlens plane. This provides a common reference frame, independent of depth. We also define the depth map z(u) on this plane. 2.1
Virtual View Reconstructions
Due to undersampling, the plenoptic views are not suitable for interpolation. Instead we may interpolate within the angular coordinates of the subimages. In our proposed depth estimation method, we make use of the fact that in the Lambertian light field there are samples, indexed by n, in different subimages that can be matched to the same point in space. In fact, we will explain how several estimates rˆn (u) of the radiance can be obtained, one per set of samples with common n. We term these full resolution virtual views, because they make use of more samples than the regular views Vθ (c), and are formed from interpolation 1
Depending on the LF camera settings, this holds true for a large range of depth values. Moreover, we argue in sec. 4 that corresponding defocused points are still matched correctly and so the analysis holds in general for larger apertures.
190
T.E. Bishop and P. Favaro
r’
O
microlenses
r
z’
c1
θ
- v’θq v
c2
u
θq μ
d
θq θ1 θ2 θq
v’
cK
Main Lens
v
conjugate image v’ v
θ1
v’
- v’ vθ
λ(u)(c-u) u
z’ (u-c) u’ v’ O
Main Lens
z‘
θ0
c v’-z‘
θ0
v
θ Sensor
Fig. 2. Coordinates and sampling in a plenoptic camera. Left: In our model we begin with an image r at v , and scale it down to the conjugate image r at z , where O is the main lens optical center. Center: This image at r is then imaged behind each microlens; for instance, the blue rays show the samples obtained in the subimage under the central microlens. The central view is formed from the sensor pixels hit by the solid bold rays. Aliasing of the views is present: the spatial frequency in r is higher than the microlens pitch d, and clearly the adjacent view contains samples unrelated to the central view via interpolation. Right: Relation between points in r and corresponding points under two microlenses. A point u in r has a conjugate point (the triangle) at u , which is imaged by a microlens centered at an arbitrary center c, giving a projected point at angle θ. If we then consider a second microlens positioned at u, the image of the same point is located at the central view, θ = 0, on the line through O.
in the non-aliased angular coordinates. Similarly to [1], these virtual views may be reconstructed in a process equivalent to building a mosaic from parts of the subimages (see Fig. 3). Portions of each subimage are scaled by a magnification factor λ that is a function of the depth map, and these tiles are placed on a new grid to give a mosaiced virtual view for each n. As λ changes, the size of the portions used and the corresponding number of valid n are also affected. We will describe in detail how these quantities are related and how sampling and aliasing occurs in the LF camera in sec. 3, but firstly we describe the overall form of our depth estimation method, and how it uses the virtual views. 2.2
Depth Estimation via Energy Minimization
We use an energy minimization formulation to estimate the optimal depth map z = {z(u)} in the full-resolution coordinates u. The optimal z is found as the minimum of the energy E defined as: E1 (u, z(u)) + γ2 E2 (u, z(u)) + γ3 E3 (z(u)) (1) E(z) = u
where γ2 , γ3 > 0. This energy consists of three terms. Firstly, E1 is a data term, penalizing depth hypotheses where the corresponding points in the light field do not have matching intensities. For the plenoptic camera, the number of virtual views rˆn (u) depends on the depth hypothesis. We describe this term in sec. 4.
Full-Res. Depth Map Estimation from an Aliased Plenoptic Light Field
191
Fig. 3. Mosaiced full-resolution views generated from subimages. Each microlens subimage in a portion of the light field image shown on the left is partitioned into a grid of tiles. Rearranging the tiles located at the same offset from the microlens center into a single image produces mosaiced full-resolution views as shown in the middle. As we used a Plenoptic camera with a circular main lens aperture, some pixels are not usable. In the top view, the tiles are taken from the edge of each microlens and so there is missing data which is inpainted and shown on the image to the right. We use these full-resolution views to perform accurate data matching for depth reconstruction.
E2 is a new matching term we introduce in sec. 4.1, enforcing the restored views to have proper boundary conditions. It effectively penalizes mismatches at the mosaiced tile boundaries, by imposing smoothness of the reconstructed texture (that also depends on the depth map z). The third term, E3 , is a regularization term that enforces piecewise smoothness of z. Because the terms E1 and E2 only depend on z in a pointwise manner, they may be precomputed for a set of depth hypotheses. This enables an efficient numerical algorithm to minimize eq. (1), which we describe in sec. 4.3. Before detailing the implementation of these energy terms in sec. 4, we show in sec. 3 how such views are obtained from the sampled light field.
3
Obtaining Correspondences in Subimages and Views
We summarise here the image formation process in the plenoptic camera, and the relations between the radiance, subimages and views. We consider first a continuous model and then discretise, explaining why aliasing occurs. In our model, we work entirely inside the camera, beginning with the full-resolution radiance image r(u), and describe the mapping to the views Vθ (c) or microlens subimages Sc (θ). This mapping is done in two steps: from r(u) to the conjugate object (or image) at z , which we denote r , and then from r to the sensor. This process is shown in Fig. 2. There is a set of coordinates {θ, c} that corresponds to the same u, and we will now examine this relation. We see in Fig. 2 that the projection of r(u) to r (u ) through the main lens u , where v is the main lens to microlens array center O results in u = z v(u) distance. This is then imaged onto the sensor at θ(u, c) through the microlens
192
T.E. Bishop and P. Favaro
at c with a scaling of − v −zv (u) . Notice that θ(u, c) is defined as the projection of u through c onto the sensor relative to the projection of O through c onto the sensor; hence, we have z (u) v (c − u) v − z (u) v = λ(u)(c − u),
θ(u, c) =
(2)
where we defined the magnification factor λ(u) = v −zv (u) z v(u) , representing the signed scaling between the regular camera image r(u) that would form at the microlens array, and the actual subimage that forms under a microlens. We can then use (2) to obtain the following relations: r(u) = Sc (θ(u, c)) = Sc λ(u)(c − u) , v D (3) ∀c s.t. |θ(u, c)| < v 2 = Vλ(u)(c−u) (c), θ v D r(u) = Vθ (c(θ, u)) = Vθ u + (4) ∀ |θ| < . λ(u) v 2
The constraints on c or θ mean only microlenses whose projection − vv θ is within the main lens aperture, of radius D 2 , will actually image the point u. Equations (3) and (4) show respectively how to find values corresponding to r(u) in the light field for a particular choice of either c or θ, by selecting the other variable appropriately. This does not pose a problem if both c and θ are continuous, however in the LF camera this is not the case and we must consider the implications of interpolating sampled subimages at θ = λ(u)(c − u) or θ . sampled views at c = u + λ(u) 3.1
Brief Discussion on Sampling and Aliasing in Plenoptic Cameras
In a LF camera with microlens spacing d, only a discrete set of samples in each . view is available, corresponding to the microlens centers at positions c = ck = dk, where k indexes a microlens. Also, the pixels (of spacing μ) in each subimage . sample the possible views at angles θ = θq = μq, where q is the pixel index; these coordinates are local to each microlens, with θ0 the projection of the main lens center through each microlens onto the sensor. Therefore, we define the discrete . observed view Vˆq (k) = Vθq (ck ) at angle θq as the image given by the samples . for each k. We can also denote the sampled subimages as Sˆk (q) = Vθq (ck ). For a camera array, the spatial samples are not aliased, and a virtual view Vθ (c) may be reconstructed accurately, which allows for the sub-pixel matching used in regular multi-view stereo. However, in the LF camera, there is clearly a large jump between neighboring samples from the same view in the conjugate image (see Fig. 2). Even when the microlenses have apertures sized equal to their spacing, as described in [3], the resulting microlens blur size changes with
Full-Res. Depth Map Estimation from an Aliased Plenoptic Light Field
193
depth. Equivalently there is a depth range where the integration region size in the conjugate image is less than the sample spacing. This means that the sampled views Vˆq (k) may be severely aliased, depending on depth. As a result, simple interpolation of Vˆq (k) in the spatial coordinates k is not enough to reconstruct an arbitrary virtual view Vθ (c) at c = ck . 3.2
Matching Sampled Views and Mosaicing
In an LF camera we can reconstruct Vθ (c) and avoid aliasing in the sampled views by interpolating instead along the angular coordinates of the subimages, and using the spatial samples only at known k. Using eq. (3), we can find the estimates rˆ(u) of the texture by interpolating as follows: r(u) = Vλ(u)(ck −u) (ck ) ⇒ rˆ(u) = Vˆ λ(u) (k), μ
(ck −u)
∀k s.t. |λ(u)(ck − u)|
θ1 ) that δθ1 + |δθ2 | = θ2 − θ1
(7)
This gives us an equation from which f can be solved. We repeat the above technique for other such pairs of images obtained by moving the camera to different overlapping pan-tilt locations. However, since the SIFT keypoint finding and matching may not be perfect, we use the well known RANSAC algorithm [14] to perform a robust estimation of the focal length and eliminate outliers using multiple matched keypoints in the images. This process allows us to automatically calculate a robust estimate of the focal length.
PTZ Camera Modeling and Panoramic View Generation
(a)
585
(b)
Fig. 3. (a) SIFT keypoints matched across overlapping regions of an adjacent pair of images. (b) Variation of focal length with zoom for Sony and Pelco cameras.
3.3
Variation with Zoom
A key component of the mapping model is the PTZ camera’s focal length and in order to generalize the model to all zoom levels, we need to know the interrelationship between focal length (f ) and zoom (z). Therefore, we next model the effect of zoom and analyze the variation of f with z. We perform the automatic focal length calculation described in Sect. 3.2 at different zoom levels of the cameras. Here, we choose the range of zoom=1 (wide field-of-view) to zoom=12 as the upper limit of zoom=12 has been observed to be more than sufficient for standard surveillance applications. Once the focal length values for all the zoom levels are calculated, we compute a piecewise cubic fit to this data. Figure 3(b) shows the variation of focal length with zoom for two different brands of PTZ cameras. This function permits the use of an arbitrary zoom level in our model by mapping any zoom value to its corresponding focal length so that it can then be utilized in the (x, y) to (pan,tilt) mapping. To determine the smoothness and accuracy of the learned f -z function, we employ an object size preservation metric. A uniform colored rectangular object is placed in the center of the camera’s view and the object’s motion is restricted to be along the line-of-sight of the camera. Keeping the camera’s pan-tilt (θ,φ) fixed, the camera’s zoom is automatically adjusted so that the size of the object at any instant during its motion remains the same as in the first frame (w × h pixels). To do this, the object’s upper-right corner point (x,y) is detected at every frame (using for example, Canny edge detection) and the point’s difference in pan (δθ) from the home pan (θ) is calculated using Eqn.(1). Since the size of the object is to remain the same and the object is always centered, the locations of all of its corners should also remain constant (±w/2, ±h/2). Therefore, in order to scale the location of the upper-right corner point from (x,y) back to (w/2, h/2) by adjusting the zoom, the new focal length needed for the camera is calculated using the values δθ, φ, w/2, and h/2 in Eqn.(1) and then reformulated to solve for f . This focal length value is now mapped to a corresponding zoom level using the f -z function, thereby adjusting the zoom to this new value. To demonstrate this concept, Fig. 4 shows a red block remaining fairly constant in size as the block is moved forward and the zoom level is adjusted
586
K. Sankaranarayanan and J.W. Davis
(a) zoom=1 n = 235 (0.96%)
(b) zoom=2.72 n = 421 (1.73%)
(c) zoom=3.64 n = 156 (0.64%)
(d) zoom=4.95 n = 393 (1.61%)
Fig. 4. Size of the red block in the image (24, 400 pixels) remains almost constant due to the automatic zoom update. Note the increase in size of the background object.
automatically for the PTZ camera. At each of these levels the number of error pixels (n ) is obtained by finding the number of red pixels outside the target box and the number of non-red pixels inside the target box using color segmentation. Figure 4 also presents the low number of error pixels (and percentage error) at different zoom levels. Note that since the block is moved manually, there may be some minimal translation involved. The values obtained for the error pixels can be used to evaluate the precision of the mapping with variations in zoom for particular application needs.
4
Experiments
In order to evaluate the complete PTZ mapping model proposed in this work, we performed multiple experiments using synthetic data and real data captured from commercial off-the-shelf PTZ cameras. All the experiments covered in this section are important contributions of this work (beyond [6]) because they comprehensively tested the model both qualitatively and quantitatively in the following way. The first set of experiments examined the accuracy of the model by mapping individual pixels from images across the entire scene to their corresponding pan-tilt locations to generate spherical panoramas. We then performed experiments using synthetic data to quantify the accuracy of the mapping model. We next studied the pan-tilt deviations of the cameras and their effects on the mapping model. We then demonstrate the validity of the model across various zoom levels. The final experiment demonstrates the complete pan-tilt-zoom capabilities of the model within a real-time active tracking system. All the experiments involving real data were performed using two models of commercial security PTZ cameras: 1) Pelco Spectra III SE surveillance camera (analog) and 2) Sony IPELA SNC-RX550N camera (digital). The accuracy of pan-tilt recall for these cameras were evaluated by moving the camera from arbitrary locations to a fixed home pan-tilt location and then matching SIFT feature points across the collected home location images. The mean (μ) and standard deviation (σ) of the pan-tilt recall error (in pixels) measured for the Sony and Pelco cameras were (2.231, 0.325) and (2.615, 0.458) respectively. Similarly, we also calculated the mean and variance of the zoom recall error for the Pelco and
PTZ Camera Modeling and Panoramic View Generation
587
Sony cameras as (0.0954, 0.0143) and (0.0711, 0.0082) respectively, by changing the zoom from arbitrary zoom levels to a fixed home zoom level and then polling the camera’s zoom value (repeated for multiple home zoom levels). The radial distortion parameter (κ) values obtained for the Sony and Pelco cameras were 0.017 and 0.043 respectively, determined using the technique from [15] [16]. These values can be used by the reader to evaluate the quality of the cameras used in the experiments. 4.1
Panorama Generation
PTZ cameras generally have a large pan-tilt range (e.g., 360 × 90 degrees). By orienting the camera to an automatically calculated set of overlapping pan-tilt locations, we collected a set of images such that they cover the complete view space. The number of images required for complete coverage vary with the zoom level employed, a study of which follows later. We then took images which are pairwise adjacent from this set and used them to automatically calculate a robust estimate of the focal length f (using the technique described in Sect. 3.2). Next, we mapped each pixel of every image to its corresponding pan-tilt orientation in the 360 × 90 degree space (at a resolution of 0.1 degree). The pan-tilt RGB data were plotted on a polar coordinate system, where the radius varies linearly with tilt and the sweep angle represents the pan. Bilinear interpolation was used to fill any missing pixels in the final panorama. The result is a spherical (linear fisheye) panorama which displays the entire pan-tilt coverage of the camera in a single image.
(a) Ground truth
(b) Constructed panorama
Fig. 5. SIFT keypoint matching quantifying the accuracy of the mapping model
To quantitatively evaluate the accuracy of the above technique, we performed an experiment using a virtual environment synthetically generated in Maya. The ground truth data (Fig. 5(a)) was obtained by setting up a camera with a wide field-of-view1 directly above the environment so that it captures the entire field-of-coverage. We then generated a panorama (Fig. 5(b)) using the synthetic 1
Maya camera parameters: Film Gate-Imax, Film aspect ratio-1.33, Lens squeeze ratio-1, camera scale-1, focal length-2.5.
588
K. Sankaranarayanan and J.W. Davis
(a)
(b)
Fig. 6. Panoramas for different shifts show illumination differences before intrinsic image generation
data captured from different pan-tilt locations using our technique. By using SIFT to match keypoints between the ground truth image and the generated panorama, we calculated the distance between the matched locations for the top 100 keypoints. The mean and standard deviation values were obtained as 7.573 pixels and 0.862 pixels respectively. For images of size 900×900, we believe these values are reasonably small and within the error range of SIFT matching since the interpolation does not reproduce the fine textures very accurately in the scale space. These low values for the matching distance statistics quantitatively demonstrate the accuracy of the panorama generation technique. While performing the above experiments using real cameras, the automatic gain control (AGC) on these cameras could not be fully disabled as the camera moved to new locations. Consequently there were illumination gain differences among the different component images captured across the scene and these differences were reflected in the panorama. Therefore, we used a technique of deriving intrinsic images [17] and computed a reflectance-only panoramic image to overcome this problem. We generated multiple panoramas of the same scene (see Fig. 6) by shifting the pre-configured pan and tilt coverage locations in multiples of 5 and 1 degree intervals respectively. These small shifts in pan and tilt displace the location of the illumination seams in the panoramas across multiple passes, thus simulating the appearance of varying shadows. We then separated the luminance and chrominance channels from the RGB panoramas by changing the color space from RGB to YIQ. After running the intrinsic image algorithm on the Y (luminance) channel alone, we combined the result with the mean I and Q channels of the panorama set, and converted back from YIQ to the RGB color space for display. As shown in Fig. 7(d), the result provides a more uniform panorama as compared to Fig. 6. The above process was tested on multiple Pelco cameras mounted at varying heights (on 2, 4, and 8 story buildings) and a Sony camera. The final results after intrinsic image generation are shown in Fig. 7, demonstrating the applicability of the model to real surveillance cameras at different heights. See high resolution version of Fig. 7(a) in supplementary material.
PTZ Camera Modeling and Panoramic View Generation
(a)
(b)
(c)
589
(d)
Fig. 7. Panoramas from Pelco (a-c) and Sony (d) cameras mounted on buildings at different heights
(a)
(b)
Fig. 8. Panoramas with simulated pan-tilt error values of (a) 1.5 and (b) 2.5. Note the alignment problems in the panorama with =2.5.
4.2
Pan-Tilt Recall
In order to study the effect of inaccuracies in the pan-tilt motor on the proposed model, we simulated varying degrees of error in the data capturing process and then compared the resulting panoramas. This was performed by introducing either a positive or negative (randomly determined) fixed shift/error in the pan and tilt values at each of the pre-configured pan-tilt locations and then capturing the images. This data was then used to generate a spherical panorama using the above mentioned technique. We repeated this process to generate panoramas for different values of simulated pan-tilt shift/error in the range 0.1 to 5 in steps of 0.1 degrees. Figure 8 shows the results of this experiment for values of 1.5 and 2.5 (before intrinsic analysis). The alignment errors for panoramas with values less than 1.5 degrees were visually negligible, but were exacerbated for greater than this value. 4.3
Validity Across Zoom Levels
The next set of experiments were aimed towards verifying that the model works at varying zoom levels. We performed the panorama generation process at different zoom levels in the range of zoom=1 to zoom=12. At each zoom level,
590
K. Sankaranarayanan and J.W. Davis
(a) 76 images
(b) 482 images
(c) 1681 images
(d) 2528 images
Fig. 9. (a)-(d) Panoramas (pre-intrinsic) generated at zoom levels 1, 5, 10, and 12 respectively with their corresponding number of component images
the appropriate focal length was chosen by employing the f -z mapping function learned using the technique described in Sect. 3.3. As the zoom level increases, the field-of-view of each component image reduces and consequently the number of component images required to cover the complete pan-tilt space increases. Figure 9 shows representative panoramas (before intrinsic analysis) generated at zoom levels 1, 5, 10, and 12 for the Sony camera (and presents the number of component images needed to construct the panoramas). In each of the panoramas, it was observed that the small component images registered correctly and no noticeable misalignments were seen with any change in zoom. This demonstrates that the (x, y) to (pan, tilt) mapping model works irrespective of the zoom level employed. 4.4
Use in Tracking with Active PTZ Cameras
Combining the pan, tilt, and zoom mapping capabilities of the proposed model, we developed a real-time active camera tracking application with a wide-area PTZ surveillance camera. The active camera system continually follows the target as it moves throughout the scene by controlling the pan-tilt of the camera to keep the target centered in its view. In addition, the camera’s zoom is continually adjusted so that the target being tracked is always of constant size irrespective of target’s distance from the camera. We used the appearance-based tracking algorithm from [18] to demonstrate the applicability of our PTZ model. Note that our model is indifferent to the tracker itself and various other frame-to-frame tracking algorithms could be employed to handle strong clutter and occlusion situations if needed. In the system, we model the target using a covariance of features and matched it in successive frames to track the target. To build the appearance-based model of the target, we selected a vector of position, color, and gradient features fk = [x y R G B Ix Iy ]. The target window was represented as a covariance matrix of these features. Since in our application, we desire to track targets over long distances across the scene, the appearance of targets undergo considerable change. To adapt to this and to overcome noise, a model update method from [18] was used.
PTZ Camera Modeling and Panoramic View Generation
591
zoom=2
zoom=2.81
zoom=3.64
zoom=4.82
zoom=5.12
zoom=3
zoom=3.72
zoom=4.94
zoom=5.81
zoom=6.77
Fig. 10. Active camera tracking results from two sequences
The tracker is initialized manually by placing a box on the torso of the target. As the target moves, the best match (xmatch , ymatch ) is found by the tracker in successive frames and its distance dmatch is stored. In this system, we restricted our search for the best match to a local search window of 40 × 40 (in frames of size 320 × 240). Using the proposed camera model, the (xmatch , ymatch ) coordinates of the best matching patch are converted to the change in pan and tilt angles required to re-center the camera to the target. The tracker then checks to determine if an update in zoom is necessary. To do so, it checks the distances of a larger patch and a smaller patch at (xmatch , ymatch ). If either of these distances are found to be better than dmatch , the zoom is adjusted to the new size using the same technique as described in the red block experiment in Sect. 4.4. While moving the camera, the system polls the motor to see if it has reached the desired pan-tilt location. Once it has reached this position, the next frame is grabbed and the best match is again found. This process is repeated to continually track the target. Figure 10 shows a few frames from two sequences with a target being automatically tracked using the proposed technique. In the first sequence, the target walks a total distance of approximately 400 feet away from the camera and the zoom level varies from 2 (in the first frame) to 5.12 (in the last frame). In the second sequence, the target covers a distance of around 650 feet along a curved path and the camera’s zoom varies in the range from 3 to 6.77 during this period. See supplemental tracking movie. The ground truth locations of the target were obtained by manually marking the left, right, top, and bottom extents of the target’s torso and calculating its center in each frame. By comparing these with the center of the tracking box (of size 25 × 50), the tracking error statistics (mean, standard deviation) in pixels were obtained as (2.77, 1.50) and (3.23, 1.63) respectively for the two sequences. In addition, the accuracy of the zoom update function was evaluated by comparing the ground truth height of the target’s torso with the height of the tracking box. The error in this case was obtained to be 6.23% and 8.29% for the two sequences. This application demonstrates a
592
K. Sankaranarayanan and J.W. Davis
comprehensive exploitation of the pan-tilt-zoom capabilities of the camera using the proposed model. Again, other trackers could also be employed with our PTZ model for different tracking scenarios/requirements.
5
Key Contributions and Advantages
To summarize the key contributions of this work, we presented 1) a theoretical derivation of a complete pan-tilt-zoom mapping model for real-time mapping of image x-y to pan-tilt orientations, 2) a SIFT-based technique to automatically learn the focal length for commercial-grade PTZ cameras, 3) an analysis of the variations between focal length and zoom in the model along with its evaluation using an object size preservation metric, 4) quantitative experiments and discussion on the PTZ mapping model with panorama generation, 5) experiments demonstrating robustness with different brands of cameras, and 6) qualitative and quantitative experiments within an example PTZ active tracking application. The proposed camera model is a straight-forward and closed-form technique. Since previous work in this area have been based on more complex methods (explained in Sect. 2), by showing our technique with thorough and robust results, we demonstrate that any additional modeling and computation (as shown in previous work) is in fact not necessary for the demonstrated capabilities. We believe our simplification compared to other over-complicated methods is a strength, as this makes our technique more practically applicable to a wide variety of real-time problems.
6
Conclusion and Future Work
We proposed a novel camera model to map the x-y focal plane of a PTZ camera to its pan-tilt space. The model is based on the observation that the locus of a point on a fixed image plane, as the camera pans, is an ellipse using which we solve for the desired change in pan and tilt angles needed to center the point. We proposed techniques to automatically calculate the focal length, analyzed the variation of focal length with zoom, tested the model by generating accurate panoramas, and validated its usability at varying zoom levels. In future work, we plan to utilize the model to map the camera’s pan-tilt-zoom space to groundplane information for registration of multiple cameras. Acknowledgement. This research was supported in part by the National Science Foundation under grant No. 0236653.
References 1. Davis, J., Chen, X.: Calibrating pan-tilt cameras in wide-area surveillance networks. In: Proc. ICCV (2003) 2. Jain, A., Kopell, D., Wang, Y.: Using stationary-dynamic camera assemblies for wide-area video surveillance and selective attention. In: Proc. IEEE CVPR (2006)
PTZ Camera Modeling and Panoramic View Generation
593
3. Bernardin, K., van de Camp, F., Stiefelhagen, R.: Automatic person detection and tracking using fuzzy controlled active cameras. In: Proc. IEEE CVPR (2007) 4. Jethwa, M., Zisserman, A., Fitzgibbon, A.: Real-time panoramic mosaics and augmented reality. In: Proc. BMVC (1998) 5. Tordoff, B., Murray, D.: Reactive control of zoom while fixating using perspective and affine cameras. IEEE TPAMI 26, 98–112 (2004) 6. Sankaranarayanan, K., Davis, J.: An efficient active camera model for video surveillance. In: Proc. WACV (2008) 7. Basu, A., Ravi, K.: Active camera calibration using pan, tilt and roll. IEEE Trans. Sys., Man and Cyber. 27, 559–566 (1997) 8. Collins, R., Tsin, Y.: Calibration of an outdoor active camera system. In: Proc. IEEE CVPR, vol. I, pp. 528–534 (1999) 9. Fry, S., Bichsel, M., Muller, P., Robert, D.: Tracking of flying insects using pan-tilt cameras. Journal of Neuroscience Methods 101, 59–67 (2000) 10. Woo, D., Capson, D.: 3D visual tracking using a network of low-cost pan/tilt cameras. In: Canadian Conf. on Elec. and Comp. Engineering, vol. 2, pp. 884–889 (2000) 11. Barreto, J., Araujo, H.: A general framework for the selection of world coordinate systems in perspective and catadioptric imaging applications. IJCV 57 (2004) 12. Sinha, S., Pollefeys, M.: Pan-tilt-zoom camera calibration and high-resolution mosaic generation. Comp. Vis. and Image Understanding 103, 170–183 (2006) 13. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV (2004) 14. Fischler, M., Bolles, R.: Random sample consensus: a paradigm for model fitting. Communications of the ACM Archive 24, 381–395 (1981) 15. Tordoff, B., Murray, D.: The impact of radial distortion on the self-calibration of rotating cameras (Comp. Vis. and Image Understanding) 16. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) ISBN: 0521623049 17. Weiss, Y.: Deriving intrinsic images from image sequences. In: Proc. ICCV (2001) 18. Porikli, F., Tuzel, O., Meer, P.: Covariance tracking using model update based on means on Riemannian manifolds. In: Proc. IEEE CVPR (2006)
Horror Image Recognition Based on Emotional Attention Bing Li1 , Weiming Hu1 , Weihua Xiong2 , Ou Wu1 , and Wei Li1 1
National Lab of Pattern Recognition, Institute of Automation, CAS, Beijing, China 2 OmniVision Technologies, Sunnyvale, CA, USA
[email protected] Abstract. Along with the ever-growing Web, people benefit more and more from sharing information. Meanwhile, the harmful and illegal content, such as pornography, violence, horror etc., permeates the Web. Horror images, whose threat to children’s health is no less than that from pornographic content, are nowadays neglected by existing Web filtering tools. This paper focuses on horror image recognition, which may further be applied to Web horror content filtering. The contributions of this paper are two-fold. First, the emotional attention mechanism is introduced into our work to detect emotional salient region in an image. And a topdown emotional saliency computation model is initially proposed based on color emotion and color harmony theories. Second, we present an Attention based Bag-of-Words (ABoW) framework for image’s emotion representation by combining the emotional saliency computation model and the Bag-of-Words model. Based on ABoW, a horror image recognition algorithm is given out. The experimental results on diverse real images collected from internet show that the proposed emotional saliency model and horror image recognition algorithm are effective.
1
Introduction
In the past decades, the implosive growth of on-line information and resources has given us the privilege to share pictures, images, and videos from geographically disparate locations very conveniently. Meanwhile, more and more harmful and illegal content, such as pornography, violence, horror, terrorism etc., permeate in the Web across the world. Therefore, it is necessary to make use of a filtering system to prevent computer users, especially children, from accessing inappropriate information. Recently, a variety of horror materials are becoming a big issue that enters into children’s daily life easily. There are a number of psychological and physiological researches indicating that too much horror images or words can seriously affect children’s health. Rachman’s research [1] shows that horror information is one of the most important factors for phobias. To solve this problem, many governments also have taken a series of measures to ban horror films from children. Over the last years, a number of high-quality algorithms have been developed for Web content filtering and the state of art is improving rapidly. Unfortunately, R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 594–605, 2011. Springer-Verlag Berlin Heidelberg 2011
Horror Image Recognition Based on Emotional Attention
595
in contrast to either the pornographic content filtering [2][3][4] or Violent content filtering[5], there has been comparatively little work done on horror information filtering. In this paper, we aim to rectify this imbalance by proposing a horror image recognition algorithm which is the key ingredient of horror content filtering. Horror image recognition can benefit from image emotion analysis which lots of researchers have been paying more attention to recently. For example, Yanulevskaya et al [6] classify images into 10 emotional categories based on holistic texture features. Solli et al [7] propose a Color based Bag-of-Emotions model for image retrieval from emotional viewpoints. Without loss of generality, we also introduce emotional aspect into horror image recognition. In particular, we present a novel emotional saliency map based on emotional attention analysis which is then fed into a modified Bag-of-Words model, Attention based Bag-of-Words (ABoW) model, to determine images’ emotional content. The experiments on a real image database show that our method can efficiently recognize those horror images.
2
Image and Emotion Introduction
Images not only provide visual information but also evoke some emotional perceptions, such as excitement, disgust, fear etc. Horror unexceptionally comes out when one experiences an awful realization or deeply unpleasant occurrence[8]. Published research in experimental psychology [9] shows that emotions can be measured and quantified, and that common emotional response patterns exist among people with widely varying cultural and educational backgrounds. Therefore, it is possible to establish a relationship between an image and the emotions. Generally, a horror image includes two parts: the region with highly emotional stimuli, named emotional salient region (ESR) and a certain background region. The feeling of horror depends on the interaction between them. Therefore, accurate detection and separation of emotional salient region is important for horror image recognition. Intuitionally, visual saliency scheme may be a proper tool for this task. Unfortunately, existing visual saliency algorithms are incompetent to this task. To this end, we introduce emotional attention mechanism into our work and an emotional saliency map method is proposed.
3
Emotional Attention
It is well known that the primate visual system employs an attention mechanism to limit processing to important information which is currently relevant to behaviors or visual tasks. Many psychophysical and neurophysiologic experiments show that there exist two ways to direct visual attention: One uses bottom-up visual information including color, orientation, and other low level features, the other one uses those information relevant to current visual behaviors or tasks [10]. Recent psychophysical research further reveals that visual attention can be modulated and improved by the affective significance of stimuli [11]. When simultaneously facing emotionally positive, emotionally negative, and nonemotional
596
B. Li et al.
pictures, human automatic attention is always captured by the following order: negative, both negative and positive, and nonemotional stimuli. Therefore, modulatory effects implement specialized mechanisms of ”emotional attention” that might not only supplement but also compete with other sources of top-down control on attention. 3.1
Emotional Saliency Map Framework
Motivated by emotion attention mechanism, we propose a new saliency map of image, called ’emotional saliency map’, which is different from the traditional visual saliency map in three major aspects: (1) it is based on some high-level emotional features, such as color emotion and color harmony which contain more emotional semantics [12][13]; (2) it takes both contrast information and isolated emotional saliency property into account which is inspired by lots of psychological experiments indicating that isolated pixels with various colors have different saliency to human visual system; and (3) it uses both contrast and color harmony value to represent the relationship between pixels. The emotional saliency map computation framework is given out in Fig.1, and we will discuss its implementation more details in the next section. 3.2
Emotional Saliency Map Implementation
Color Emotion based Saliency Map. As a high-level perception, emotion is evoked by many low-level visual features, such as color, texture, shape, etc. However, how these features affect human emotional perception is a difficult issue in current research. Color emotion refers to the emotion caused by a single color and has long been of interest to both artists and scientists. Recent research of Ou et al gives a 3D computational color emotion model, each dimension representing color activity (CA), color weight (CW ), and color heat (CH) respectively.
Fig. 1. Emotional saliency map computation Framework
Horror Image Recognition Based on Emotional Attention
597
The transformation equations between color space and color emotion space are defined as follows[12]: ∗ 2 1/2 −17 CA = −2.1 + 0.06 (L∗ − 50)2 + (a∗ − 3)2 + b 1.4 CW = −1.8 + 0.04(100 − L∗ ) + 0.45 cos(h − 100o) CH = −0.5 + 0.02(C ∗ )1.07 cos(h − 50o )
(1)
where (L∗ , a∗ , b∗ ) and (L∗ , C ∗ , h∗ ) are the color values in CIELAB and CIELCH color spaces respectively. Given a pixel I(x, y) at coordinates (x, y) of image I, we transform its RGB color to CIELAB and CIELCH color spaces, followed by another computation according to Eq (1), its color emotion value is denoted as CE(x, y) = [CA(x, y), CW (x, y), CH(x, y)] . We define single pixel’s emotional saliency EI as: EI(x, y) = [CA(x, y) − minCA]2 + [CH(x, y) − minCH]2 (2) where minCA and minCH are the minimal color activity value and minimal color heat value in the image respectively. This definition is based on the observation that pixels with high color activity (CA) and color heat (CH) are more salient, especially for those horror images; while color weight (CW ) has less effects on emotional saliency. Center-surround contrast is also very important operator in saliency models. Considering that the contrast in the uniformed image regions is close to 0, we propose a multi-scale center-surround operation for color emotion contrast computation: EC(x, y) =
L l=1
Θl (CE(x, y)),
Θl (CE(x, y)) =
x+d
y+d
i=x−d, j=y−d
(3) |CEl (x, y) − CEl (i, j)|
where Θ is center-surround difference operator that computes the color emotion difference between the central pixel (x, y) and its neighbor pixels inside a d × d window in lth scale image from L different levels in a Gaussian image pyramid. EC(x, y) is the multi-scale center-surround color emotion difference which combines the output of Θ operations. In this paper, we always set d = 3 and L = 5. Another factor is emotional difference between the local and global EL(x, y), as defined in Eq(4), where CE is the average color emotion of the whole image. (4) EL(x, y) = CE(x, y) − CE Finally, we define the color emotion based emotional saliency map CS(x, y) as Eq.(5), where N orm() is the Min-Max Normalization operation in an image. CS(x, y) =
1 [N orm(EI(x, y)) + N orm(EC(x, y)) + N orm(EL(x, y))] 3
(5)
598
B. Li et al.
Color Harmony based Saliency Map. Judd et al[13] pointed out: ”When two or more colors seen in neighboring areas produce a pleasing effect, they are said to produce a color harmony.” Simply speaking, color harmony refers to the emotion caused by color combination. A computational model HA(P 1 , P 2 ) to quantify harmony between two colors P 1 , P 2 is proposed by Ou et al [13]. HA(P 1 , P 2 ) = HC (P 1 , P 2 ) + HL (P 1 , P 2 ) + HH (P 1 , P 2 )
(6)
where HC (), HL (), and HH () are the chromatic harmony component, the lightness harmony component, and the hue harmony component, respectively. Since the equations for three components are very complex, we will skip their details due to space limitation, the reader can refer to [13]. HA(P 1 , P 2 )is harmony score of two colors. The higher the score of HA(P 1 , P 2 ) is, the more harmonious the two-color combination is. Similarly, we define a multi-scale center-surround color harmony as well: L
HC(x, y) = Ψl (x, y) =
l=1 x+d
Ψl (x, y), y+d
i=x−d, j=y−d
(7) HA(Pl (x, y), Pl (i, j))
where Ψ is a center-surround color harmony operator that computes the color harmony between the central pixel’s color P (x, y) and its neighbor pixel’s color P (i, j) in a d × d window in lth scale image from L different levels in a Gaussian image pyramid. Next, we define the color harmony score between a pixel and the global image as: (8) HL(x, y) = HA(P (x, y), P ) where P is average color of whole image. The high color harmony score indicates the calm or tranquility emotion [14]. In other words, the higher the color harmony score is, the weaker the emotional response is. Therefore we can transform color harmony score to an emotional saliency value using a monotonic decreasing function. Because the range of color harmony function HA(P 1 , P 2 ) is [−1.24, 1.39][13], a linear transformation function T (v) is given as: T (v) = (1.39 − v)/2.63
(9)
Finally, the color harmony based emotional saliency is computed as: HS(x, y) =
1 [T (HC(x, y) + T (HL(x, y))] 2
(10)
Emotional Saliency Map. Color emotion based emotional saliency map and color harmony based emotional saliency maps are linearly fused into emotional saliency map: ES(x, y) = λ × HS(x, y) + (1 − λ) × CS(x, y)
(11)
where λ is fusing weight and ES(x, y) indicates the emotional saliency of the pixel at location (x, y). The higher the emotional saliency is, the more likely this pixel captures the emotional attention.
Horror Image Recognition Based on Emotional Attention
4
599
Horror Image Recognition Based on Emotional Attention
Bag-of-Words (BoW) is widely used for visual categorization. In the BoW model, all the patches in an image are represented using visual words with same weights. The limitation of this scheme is that a large number of background image patches seriously confuse image’s semantic representation. To address this problem, we extend BoW to integrate emotional saliency map and call it Attention based Bag-of-Words (ABoW). The framework is shown in Fig.2.
Fig. 2. Framework for Attention based Bag-of-Emotions
4.1
Attention based Bag-of-Words
In the ABoW model, two emotional vocabularies, Horror Vocabulary (HV) and Universal Vocabulary (UV), and their histograms are defined to emphasize horror emotion character. The former one is for image patches in ESR while the latter one is for background. The combination of these two histograms is used to represent the image’s emotion. The histogram for each vocabulary is defined as:
M |V OC| M p(Ik ∈ wm ) p(Ik ∈ wm ) (12) Hist(wm , V OC) = k=1
k=1 m=1
where V OC ∈ {HV, U V }, wm is the mth emotional word in V OC. p(Ik ∈ wm ) is the probability that the image patch Ik belongs to the word wm . Because the words in the two vocabularies are independent. Consequently, p(Ik ∈ wm ) can be defined as p(Ik ∈ wm ) = p(Ik ∈ V OC)p(Ik ∈ wm |V OC)
(13)
600
B. Li et al.
The p(Ik ∈ V OC) is the probability of selection V OC to represent Ik . Because the image patches in ESR is expected to be representd by the HV histogram, we assume that p(Ik ∈ HV ) = p(Ik ∈ ESR) ∝ ESk ;
p(Ik ∈ U V ) = p(Ik ∈ / HV )
(14)
where ESk is the average emotional saliency of image patch Ik and computed by the emotional saliency computation model. The higher ESk is, the more possibility Ik belongs to ESR. Consequently, Eq (13) can be rewritten as: p(Ik ∈ hwm ) = ESk × p(Ik ∈ hwm |HV ) (15) p(Ik ∈ wm ) = p(Ik ∈ uwm ) = (1 − ESk ) × p(Ik ∈ uwm |U V ) where hwm and uwm are the emotional words in HV and UV respectively. The conditional probability p(Ik ∈ wm |V OC) is defined by the multivariate Gaussian function. p(Ik ∈ wm |V OC) =
1 ((2π)D/2
1/2
|Σ|
1 exp{− (Fk − cm )T Σ −1 (Fk − cm )} (16) 2 )
where Fk is a D dimensional feature vector of Ik ; Cm is a feature vector of emotional word wm , Σ is the covariance matrix, and | • | is the determinate. The nearer the distance between Fk and Cm is, the higher the corresponding probability p(Ik ∈ wm |V OC) is. 4.2
Feature Extraction
In the ABoW framework, feature extraction for each image patch is another key step. To fill the gap between low level image features and high level image emotions, we select three types of features for each image patch, Color ,Color Emotion and Weibull Texture. Color Feature: We consider the image color information in HSV space. k k k k k k Two feature sets are defined as: [f1 , f2 , f3 ] = [Hk , Sk , Vk ] and [f4 , f5 , f6 ] = [Hk − H , Sk − S , Vk − V ]. The former one is the average of all the pixels in the patch, and the latter one is defined as the difference between average value of the patch and that of the whole image. Texture Feature: Geusebroek et al [15] report a six-stimulus basis for stochastic texture perception. Fragmentation of a scene by a chaotic process causes the spatial scene statistics to conform to a Weibull-distribution. γ wb(x) = β
γ−1 x γ x e−( β ) β
(17)
The parameters of the distribution can completely characterize the spatial structure of the texture. The contrast of an image can be represented by the width
Horror Image Recognition Based on Emotional Attention
601
of the distribution β, and the grain size is given by γ, which is the peakedness of the distribution. So [f7k , f8k ] = [γk , βk ] represents the local texture feature of the patch Ik and the texture difference between Ik and whole image is k ] = [|γk − γ¯ | , βk − β¯]. [f9k , f10 Emotion Feature: The emotion feature is also extracted based on color emotion and color harmony theories. The average color emotion of image patch Ik , and the k k k difference between Ik and whole image are [f11 , f12 , f13 ] = [CA k , CWk , CHk ] and k k k [f14 , f15 , f16 ] = [ CAk − CA , CWk − CW , CHk − CH ]. We use the color harmony score between average color Pk of Ik and average color P of the whole k image to indicate the harmony relationship between them,[f17 ] = [HA(Pk , P )]. 4.3
Emotional Histogram Construction
The feature vectors are quantized into visual word for two vocabularies, HV and UV, by k-means clustering technique independently. For HV, horror words are carried out from those interesting regions with manual annotation in each horror training image. All image patches in both horror images and non-horror images in the training set are used to construct words of UV. Then each image is represented by combining the horror word histogram Hist(wm , HV ) and the universal emotional word histogram Hist(wm , U V ), as [Hist(wm , HV ), Hist(wm , U V )]. And Support Vector Machine (SVM) is applied as the classifier to determine whether any given image is horror one or not.
5
Experiments
To evaluate the performance of our proposed scheme, we conducted two separate experiments, one is about emotional attention, and the other one is about Horror image recognition. 5.1
Data Set Preparation and Error Measure
Because of lacking open large data set for horror image recognition, we collected and created one. A large number of candidate horror images are collected from internet. Then 7 Ph.D students in our Lab are asked to label them from one of the three categories: Non-horror, A little horror, and Horror. They are also required to draw a bounding box around the most emotional salient region (according to their understanding of saliency) of each image. The provided annotations are used to create a saliency map S = {sX |sX ∈ [0, 1]} as follow: sX
N 1 n = a N n=1 X
(18)
where N is the number of labeling users and anX ∈ {0, 1} is a binary label give by nth user to indicate whether or not the pixel X belongs to salient region. We
602
B. Li et al.
select out 500 horror images from these candidates to the create horror image set, each labeled as ’Horror’ by at least 4 users. On the other hand, we also collect 500 non-horror images with different scenes, objects or emotions. Specially, the non-horror images includes 50 indoor images, 50 outdoor images, 50 human images, 50 animal images, 50 plant images, and 250 images with different emotions (adorable, amusing, boring, exciting, irritating, pleasing, scary, and surprising) which are downloaded from an image retrieval system ALIPR(http://alipr.com/). Finally, all these 500 horror images, along with their emotional saliency S, and 500 non-horror images are put together to construct a horror image recognition image set (HRIS) used in the following experiments. Given the ground truth annotation sX and the obtained emotional saliency value eX with different alogrithms, the precision, recall, and Fα measure defined in Eq.(19) are used to evaluate the performance. As many previous work [16], α is set as 0.5 here. P recsion = X sX eX/ eX ; Recall = X sX eX/ sX X X (19) recsion×Recall Fα = (1+α)×P α×P recsion+Recall 5.2
Experiments for Emotional Saliency Computation
To test the performance of emotional saliency computation, we need to select optimal fusing weight λ in Eq(11) firstly. The F0.5 value change curve with λ on the 500 horror image set is shown in Fig.3(A). It tells us that the best F0.5 value, 0.723, is achieved when we set λ = 0.2 . So λ is always set to be 0.2 in the followings. From the optimal λ selection, we can find that the emotional saliency map with only color emotion based saliency map (λ = 0) is better than that with only color harmony based saliency map (λ = 1). But the color harmony based saliency map also can improve the color emotion based saliency map. Both of them are important for the final emotional saliency map. We also compare our emotional saliency computation method with other existing visual saliency algorithms, Itti’s method (ITTI)[17], Hou’s method (HOU)[18], Graph-based visual saliency algorithm (GBVS)[19], and Frequencytuned Salient Region Detection algorithm (FS)[20]. The precision (P re), recall
Fig. 3. (A)F0.5 change curve as a function of λ. (B) Comparison with other methods.
Horror Image Recognition Based on Emotional Attention
603
(Rec) and F0.5 values for each method are calculated and shown in Fig.3 (B). In addition, in order to validate the effectiveness of color emotion, we also apply our saliency framework on the RGB space, denote as ’Our(RGB)’ in Fig.3. The output of our methods are P re = 0.732, Rec = 0.748, and F0.5 = 0.737 respectively, showing that it outperforms all other visual saliency computation methods. This is due to the fact that visual saliency computation are solely based on local feature contrast without considering any pixel’s salient property itself and some emotional factors, so neither of them can detect emotional saliency accurately. In contrast, emotional factors that can integrate the single pixel’s salient value are added in our solution, so it can improve performances. In addition, the proposed method also outperforms Our(RGB), which indicates the color emotion and color harmony theories are effective for emotional saliency detection. 5.3
Horror Image Recognition
In this experiment, an image is divided into 16 × 16 image patches with overlapping 8 pixels. We have discussed many features in section 4.2. Now we need to find the feature or feature combination that best fits horror image recognition at hand. We divide these features into 3 subsets: Color Feature ([f1 − f6 ]), Texture Feature ([f7 − f10 ]) and Emotion Feature ([f11 − f17 ]). Seven different selections (or combinations) are tried based on the ABoW model. In order to simplify the parameter selection, we set the word numbers in HV and UV equal, NH = NU = 200. We equally divide the image set into two subsets A and B. Each subset includes 250 horror images and 250 non-horror images. We then use A for training and B for test and vice versa. The combined results of the two experiments are used as the final performance. The Precision, Recall and F1 values, which are widely used for visual categorization evaluation, are adopted to evaluate the performances of each feature combination. Different from the computation of Precision and Recall in emotional saliency evaluation, the annotation label here for horror image recognition is binary, ’1’ for horror image and ’0’ for non-horror images. The experimental results are shown in Table 1. Table 1 shows that the Emotion feature outperforms the other single features, meaning that the color emotion and color harmony features are more useful for Table 1. Comparison with different feature combinations based on ABoW Feature
P rec
Rec
F1
Color 0.700 0.742 0.720 Texture 0.706 0.666 0.685 Emotion 0.755 0.742 0.748 Color + Texture 0.735 0.772 0.753 Color + Emotion 0.746 0.780 0.763 Texture + Emotion 0.769 0.744 0.756 Color + Texture + Emotion 0.768 0.770 0.769
604
B. Li et al. Table 2. Comparison with other emotional categorization methods Method
P re
Rec
F1
EVC 0.760 0.622 0.684 BoE 0.741 0.600 0.663 Proposed (Without Emotional Saliency) 0.706 0.698 0.702 Proposed (With Emotional Saliency) 0.768 0.770 0.769
horror image recognition. Regarding combinational features, combination of the color, texture and emotion features performs best, with F1 = 0.769. In order to check the effectiveness of the emotional saliency map, we also construct the combinational histogram without emotional saliency for horror image recognition, in which the emotional saliency value for each image patch is set as 0.5. That is to say, the weights for the horror vocabulary and for the universal vocabulary are equal. The same feature combination (Color + Texture + Emotion) is used for horror image recognition both with and without emotional saliency. In addition, we also compare our method with Emotional Valence Categorization (EVC) algorithm [12] proposed by Yanulevskaya and color based Bagof-Emotions (BoE) model[13]. The experimental results are shown in Table 2. The results show that the proposed horror image recognition algorithm based emotional attention outperforms all other methods. The comparison between with and without emotional saliency gives out that the emotional saliency is effective for horror image recognition. The potential reason lies in that the EVC and BoE describe image’s emotion only from global perspective, but the horror emotional is usually generated by image’s local region. This proves again that the emotional saliency is useful to find the horror region from images.
6
Conclusion
This paper has proposed a novel horror image recognition algorithm based on emotional attention mechanism. Reviewing this paper, we have the following conclusions: (1) This paper has presented a new and promising research topic, Web horror image filtering, which is supplement to existing Web content filtering. (2) Emotional attention mechanism, an emotion driven attention, has been introduced and a initial emotional saliency computation model has been proposed. (3)We have given out an Attention based Bag-of-Words (ABoW) framework, in which the combinational histogram of horror vocabulary and universal vocabulary has been used for horror image recognition. Experiments on real images have shown that our methods can recognize those horror images effectively and can be used as a new basis for horror content filtering in the future. Acknowledgement. This work is partly supported by National Nature Science Foundation of China (No. 61005030, 60825204, and 60935002), China Postdoctoral Science Foundation and K. C. Wong Education Foundation, Hong Kong.
Horror Image Recognition Based on Emotional Attention
605
References 1. Rachman, S.: The conditioning theory of fear acquisition. Behaviour Research and Therapy 15, 375–378 (1997) 2. Forsyth, D., Fleck, M.: Automatic detection of human nudes. IJCV 32, 63–77 (1999) 3. Hammami, M.: Webguard: A web filtering engine combining textual, structural, and visual content-based analysis. IEEE T KDE 18, 272–284 (2006) 4. Hu, W., Wu, O., Chen, Z.: Recognition of pornographic web pages by classifying texts and images. IEEE T PAMI 29, 1019–1034 (2007) 5. Guermazi, R., Hammami, M., Ben Hamadou, A.: WebAngels filter: A violent web filtering engine using textual and structural content-based analysis. In: Perner, P. (ed.) ICDM 2008. LNCS (LNAI), vol. 5077, pp. 268–282. Springer, Heidelberg (2008) 6. Yanulevskaya, V., van Gemert, J.C., Roth, K.: Emotional valence categorization using holistic image features. In: Proc. of ICIP, pp. 101–104 (2008) 7. Solli, M., Lenz, R.: Color based bags-of-emotions. In: Jiang, X., Petkov, N. (eds.) CAIP 2009. LNCS, vol. 5702, pp. 573–580. Springer, Heidelberg (2009) 8. http://en.wikipedia.org/wiki/Horror_and_terror 9. Lang, P.J., Bradley, M.M., Cuthbert, B.N.: International affective picture system (iaps): Technical manual and affective ratings, Tech. Rep., Gainesville, Centre for Research in Psychophysiology (1999) 10. Sun, Y., Fisher, R.: Object-based visual attention for computer vision. AI 146, 77–123 (2003) 11. Vuilleumier, P.: How brains beware: neural mechanisms of emotional attention. TRENDS in Cognitive Sciences 9, 585–594 (2005) 12. Ou, L., Luo, M., Woodcock, A., Wright, A.: A study of colour emotion and colour preference. part i: Colour emotions for single colours. Color Res. & App. 29, 232– 240 (2004) 13. Ou, L., Luo, M.: A colour harmony model for two-colour combinations. Color Res. & App. 31, 191–204 (2006) 14. Fedorovskaya, E., Neustaedter, C., Hao, W.: Image harmony for consumer images. In: Proc. of ICIP, pp. 121–124 (2008) 15. Geusebroek, J., Smeulders, A.: A six-stimulus theory for stochastic texture. IJCV (62) 16. Liu, T., Sun, J., Zheng, N.N., Tang, X., Shum, H.Y.: Learning to detect a salient object. In: Proc. of CVPR, pp. 1–8 (2007) 17. Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE T PAMI 20, 1254–1259 (1998) 18. Hou, X., Zhang, L.: Saliency detection: A spectral residual approach. In: Proc. of CVPR, pp. 1–8 (2007) 19. Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. In: Proc. of NIPS, pp. 545–552 (2006) 20. Achanta, R., Hemami, S., Estrada, F., Ssstrunk, S.: Frequency-tuned salient region detection. In: Proc. of CVPR, pp. 1597–1604 (2009)
Spatial-Temporal Affinity Propagation for Feature Clustering with Application to Traffic Video Analysis Jun Yang1,2 , Yang Wang1,2 , Arcot Sowmya1 , Jie Xu1,2 , Zhidong Li1,2 , and Bang Zhang1,2 1
School of Computer Science and Engineering, The University of New South Wales 2 National ICT Australia {jun.yang,yang.wang,jie.xu,zhidong.li,bang.zhang}@nicta.com.au,
[email protected] Abstract. In this paper, we propose STAP (Spatial-Temporal Affinity Propagation), an extension of the Affinity Propagation algorithm for feature points clustering, by incorporating temporal consistency of the clustering configurations between consecutive frames. By extending AP to the temporal domain, STAP successfully models the smooth-motion assumption in object detection and tracking. Our experiments on applications in traffic video analysis demonstrate the effectiveness and efficiency of the proposed method and its advantages over existing approaches.
1
Introduction
Feature points clustering has been exploited extensively in object detection and tracking, and some promising results have been achieved [1–10]. There are roughly two classes of approaches: one is from a top-down graph-partitioning perspective [1, 2, 6], the other features a bottom-up multi-level hierarchical clustering scheme [4, 5, 7, 8]. Du and Piater [3] instantiate and initialize clusters using the EM algorithm, followed by merging overlapping clusters and splitting spatially disjoint ones. Kanhere and Birchfield [9] find stable features heuristically and then group unstable features by a region growing algorithm. Yang et al [10] pose the feature clustering problem as a general MAP problem and find the optimal solution using MCMC. Very few existing methods consider spatial and temporal constraints simultaneously in an unified framework. Affinity Propagation [11] is a recently proposed data clustering method which simultaneously considers all data points as potential exemplars (cluster centers), and recursively exchanges real-valued messages between data points until a good set of exemplars and corresponding clusters gradually emerges. In this paper, we extend the original Affinity Propagation algorithm for feature points clustering by incorporating temporal clustering consistency constraints into the messagepassing procedure, and apply the proposed method to the analysis of real-world traffic videos. The paper is organized as follows. First we revisit the Affinity Propagation algorithm in section 2, and then propose the Spatial-Temporal Affinity R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 606–618, 2011. Springer-Verlag Berlin Heidelberg 2011
Spatial-Temporal Affinity Propagation for Feature Clustering
607
Propagation algorithm for feature points clustering in section 3. We give experimental results on traffic video analysis in section 4 and conclude the paper in section 5.
2
Affinity Propagation
Affinity Propagation [11] is essentially a special case of Loopy Belief Propagation algorithm specifically designed clustering. Given n data points and a for data set of real-valued similarities s(i, k) between data points, Affinity Propagation simultaneously considers all data points as potential exemplars (cluster centers), and works by recursively exchanging real-valued messages between data points until a good set of exemplars and corresponding clusters gradually emerges. The similarity s(i, k), i = k indicates how well data point k is suited to be the exemplar for data point i, while s(k, k) is referred to as “preference” and data points with higher preference values are more likely to be chosen as exemplars. The preference thus serves as a parameter for controlling the number of identified clusters. If all data points are equally likely to be exemplars a priori, the Algorithm 1. Affinity Propagation Input: similarity values s(i, k), maxits, convits, λ Output: exemplar index vector c 1 iter ← 0, conv ← 0, a ← 0, r ← 0, c ← −∞ 2 repeat 3 iter ← iter + 1, oldc ← c, oldr ← r, olda ← a 4 forall i, k do r(i, k) ← s(i, k) − max a(i, k ) + s(i, k ) 5
6 7
k : k =k
r ← (1 − λ) ∗ r + λ ∗ oldr for i = k do
a(i, k) ← min 0, r(k, k) +
10 11 12 13 14 15 16 17 18 19 20
max 0, r(i , k)
i : i ∈{i,k} /
8 9
forall k do a(k, k) ← max 0, r(i , k) i : i =k
a ← (1 − λ) ∗ a + λ ∗ olda foreach k do if r(k, k) + a(k, k) > 0 then ck ← k foreach k do if ck = k then ck ← arg maxi: ci =i s(k, i) if c = oldc then conv ← conv + 1 else conv ← 1 until iter > maxits or conv > convits
608
J. Yang et al.
preferences should be set to a common value, usually being the median of input similarities. Two types of messages are exchanged between data points, each representing a different kind of competition. Point i sends point k the “responsibility” r(i, k) which reflects the accumulated evidence for the suitability of point i choosing point k as its exemplar, considering other potential exemplars other than k. Point k sends point i the “availability” a(i, k) which reflects the accumulated evidence for the suitability of point k serving as the exemplar for point i, considering the support from other points who choose point k as their exemplar. To avoid numerical oscillations, messages are usually damped based on their values from the current and previous iterations. At any iteration of the message-passing procedure, the positivity of the sum r(k, k)+a(k, k) can be used to identify exemplars, and the corresponding clusters can be calculated. The algorithm terminates after a predefined number of iterations, or if the decided clusters have stayed constant for some number of iterations. A summary of the Affinity Propagation algorithm is given in Algorithm 1.
3 3.1
Spatial-Temporal Affinity Propagation Motivation
Expressed in the Affinity Propagation terminology, the feature points clustering problem can be described as: given a set feature points {f1τ1 :t , f2τ2 :t , · · · , of tracked τn :t fn } and a set of similarity values s(i, k) derived from spatial and temporal cues, find the optimal configuration of hidden labels c = (c1 , . . . , cn ), where ci = k if data point i has chosen k as its exemplar. For general object detection and tracking, the temporal constraint models the smooth-motion assumption. For feature clustering specifically, it reduces to the requirement that the clustering configurations of two consecutive frames should be consistent and stable. To incorporate this temporal constraint into the Affinity Propagation framework, we introduce a temporal penalty function for the labeling configuration regarding k chosen as an exemplar: −∞, if ck = k but ∃i ∈ Gt−1 (k) : ci = k δkT (c) = (1) 0, otherwise where Gt−1 (k) denotes the cluster at time t−1 that feature k belongs to. Gt−1 (k) can be calculated because each feature point is tracked individually. The temporal penalty poses the constraint that if k is chosen as an exemplar at time t, then the feature points which were in the same cluster as k at time t−1 should all choose k as their exemplar at time t. In other words, δkT (c) implements a relaxed version of the strong constraint that feature points that were clustered together at time t − 1 should still be clustered together at time t. Because of the dynamic nature of feature points (being tracked, appear or disappear), satisfying δkT (c) does not guarantee the satisfaction of the strong constraint, for example when newly-born features are selected as exemplars. Nevertheless, the benefits
Spatial-Temporal Affinity Propagation for Feature Clustering
609
of defining δkT (c) by (1) are twofold: on the one hand it is more flexible than the strong constraint by allowing new feature points to contribute to the reclustering of old feature points, on the other hand it keeps the optimization problem simple and tractable, and can be seamlessly integrated into the original Affinity Propagation framework as we will show shortly. 3.2
Derivation
Following the same derivation method as the original Affinity Propagation [11] algorithm, Spatial-Temporal Affinity Propagation can be viewed as searching over valid configurations of c so as to maximize the net similarity S: S(c) =
n
s(i, ci ) +
i=1
where δkS (c) =
n
δk (c)
(2)
k=1
δk (c) = δkS (c) + δkT (c) −∞, if ck = k but ∃i : ci = k
(3) (4)
0, otherwise
δkS (c) is the spatial penalty that enforces the validity of the configuration c with regard to k being chosen as an exemplar. δkT (c) is the temporal penalty defined in (1) that enforces the temporal constraint.
s(1, ·)
s(2, ·)
···
ci
δk
l ai av
il ab
···
δN
y it
α i← α i←1
···
cN
···
s(N, ·)
δk 2
···
s(i, ·)
···
α i←N
ci
δk
ρ 1→k
ρ 2→k
···
k
···
y it
·)
c2
sp re
s on
il ib
s(i,
c1
···
α i←
δ3
k
δ2
ρ i→
δ1
ρ N→
···
k
ci
s(i, ·)
Fig. 1. Left: factor graph for Affinity Propagation. Middle: the calculation of responsibilities. Right: the calculation of availabilities.
Function (2) can be represented as a factor graph as shown in Fig. 1, and optimization can be done using the max-sum algorithm (the log-domain version of the sum-product algorithm). There are two kinds of messages: the message sent from ci to δk (c), denoted as ρi→k (j), and the message sent from δk (c) to ci , denoted as αi←k (j). Both messages are vectors of n real numbers - one for each possible value, j, of ci , and they are calculated according to (5) and (6) as follows: αi←k (ci ) (5) ρi→k (ci ) = s(i, ci ) + k :k =k
610
J. Yang et al.
with αi←k (ci ) best possible configuration satisfying δk given ci
= ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ =
max
j1 ,...,ji−1 ,ji+1 ,...,jn
δk (j1 , . . . , ji−1 , ci , ji+1 , . . . , jn ) + ρi →k (ji ) i :i =i
best configuration with or without cluster k
max ρ
j i :i ∈G / t−1 (k)
i →k
(j ) +
ρ
i →k
(k), for ci = k = i
i :i ∈Gt−1 (k)/{k}
best configuration with no cluster k
i :i =k
max ρi →k (j ), for ci = k = i
j :j =k
best configuration of others
k is a exemplar
→k (j ) + max ρ ρk→k (k) + i
ρi →k (k), for ci = k = i
⎪ j ⎪ ⎪ t−1 (k) i :i ∈{i}∪G / i :i ∈Gt−1 (k)/{k,i} ⎪ ⎪ ⎪ ⎪ ⎪ with no cluster k ⎪ best configuration ⎪
⎪ ⎪ ⎪ ⎪ max ρ (j )+ max ρ (j ), max ⎪ k→k i →k ⎪ j :j =k j :j =k ⎪ ⎪ :i ∈{i,k} i / ⎪ ⎪ ⎪ ⎪ ⎪ best configuration with a cluster k ⎪ ⎪
⎪ ⎪ ⎪ ⎪ ρk→k (k) + max ρi →k (j ) + ρi →k (k) , for ci = k = i ⎪ ⎪ j ⎩ t−1 t−1 i :i ∈{i}∪G /
(k)
i :i ∈G
(k)/{k,i}
(6) If we assume that each message consists of a constant part and a variable part, ˜ i←k (ci ) + α ¯ i←k , and define i.e. ρi→k (ci ) = ρ˜i→k (ci ) + ρ¯i→k and αi←k (ci ) = α ρ¯i→k = maxj:j=k ρi→k (j) and α ¯ i←k = αi←k (ci : ci = k), then we have that max ρ˜i→k (j) = 0 max ρ˜i→k (j) = max 0, ρ˜i→k (k) j:j=k
j
α ˜ i←k (ci ) = 0, for ci = k α ˜ i←ci (ci ), for ci = k α ˜ i←k (ci ) = 0, for ci = k k :k =k
(7) (8) (9) (10)
By substituting the above equations into (5) and (6), we get simplified message calculations: ⎧ ⎪ s(i, k) + α ¯i←k , for ci = k ⎪ ⎪ ⎨ k :k =k ρi→k (ci ) = (11) ⎪ ⎪ s(i, ci ) + α ˜i←ci (ci ) + α ¯i←k , for ci = k ⎪ ⎩ k :k =k
Spatial-Temporal Affinity Propagation for Feature Clustering
611
αi←k (ci ) ⎧ ⎪ max 0, ρ˜i →k (k) + ρ˜i →k (k) + ρ¯i →k , for ci=k=i ⎪ ⎪ ⎪ :i =k :i ∈G t−1 (k) :i ∈Gt−1 (k)/{k} ⎪ i i / i ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ρ¯i →k , for ci=k=i ⎪ ⎪ ⎨ i :i =k = ⎪ →k (k) + →k (k) + ρ ˜ (k)+ max 0, ρ ˜ ρ ˜ ρ¯i →k , for ci=k=i ⎪ k→k i i ⎪ ⎪ ⎪ :i =i :i ∈{i}∪G t−1 (k) :i ∈Gt−1 (k)/{k,i} i ⎪ i / i ⎪ ⎪
⎪ ⎪ ⎪ ⎪ max 0,˜ ρk→k (k)+ max 0, ρ˜i →k (k) + ρ˜i →k (k) + ρ¯i →k , for ci=k=i ⎪ ⎩ t−1 (k) i :i ∈{i}∪G /
i :i ∈Gt−1 (k)/{k,i}
i :i =i
(12) If we solve for ρ˜i→k (ci = k) = ρi→k (ci = k) − ρ¯i→k and α ˜ i←k (ci = k) = ¯ i←k , we obtain simple update equations where the ρ¯ and α ¯ αi←k (ci = k) − α terms cancel:
˜ i←j (j) ρ˜i→k (k) = ρi→k (k) − max ρi→k (j) = s(i, k) − max s(i, j) + α (13) j:j=k
j:j=k
α ˜ i←k (k) = αi←k (k) − αi←k (j : j = k) ⎧ ⎪ max 0, ρ˜i →k (k) + ρ˜i →k (k), for k = i ⎪ ⎪ ⎨ i :i ∈G / t−1 (k) i :i ∈Gt−1 (k)/{k}
= (14) ⎪ ⎪ max 0, ρ˜i →k (k) + ρ˜i →k (k) , for k = i min 0, ρ˜k→k (k) + ⎪ ⎩ t−1 (k) i :i ∈{i}∪G /
i :i ∈Gt−1 (k)/{k,i}
The second equation of (14) follows from x − max(0, x) = min(0, x). ρ˜i→k (ci ) and α ˜ i←k (ci ) for ci = k are not used in (13) and (14) because α ˜ i←k (ci = k) = 0. Now the original vector messages are reduced to scalar ones with responsibilities ˜ i←k (k), and the defined as r(i, k) = ρ˜i→k (k) and availabilities as a(i, k) = α message update rules finally become:
(15) r(i, k) = s(i, k) − max s(i, j) + a(i, j) j:j=k
⎧ ⎪ ⎪ ⎪ ⎨ a(i, k) =
max 0, r(i , k) + r(i , k), for k = i
i :i ∈G / t−1 (k)
i :i ∈Gt−1 (k)/{k}
(16) ⎪ ⎪ min 0, r(k, k) + max 0, r(i , k) + r(i , k) , for k = i ⎪ ⎩ t−1 (k) i :i ∈{i}∪G /
i :i ∈Gt−1 (k)/{k,i}
Compared to the original message update rules described in Algorithm 1, we can see that the update equation for responsibilities is left unchanged. For the update equation of availabilities, the temporal constraint translates into having the responsibility messages coming from members of the cluster containing k at time t−1 fixed rather than feeding them into the max(0, ·) function. For values of
612
J. Yang et al.
k corresponding to newly-born features, the r(i , k) term in (16) vanishes and the availability update rule reduces to the original version. Everything else including the label estimation equation and algorithm termination criteria remain unchanged from the original algorithm. The STAP algorithm is summarized in Algorithm 2.
4
Experiments
To validate the effectiveness and efficiency of the proposed STAP algorithm, we test it with two applications in traffic video analysis. For vehicle detection and tracking, we compare it with the the Normalized Cut approach [2], as well as the original Affinity Propagation. For virtual loop detector, we compare it with a background subtraction based approach. All the methods are implemented in unoptimized C code and run on a Pentium IV 3.0 GHz machine. 4.1
Vehicle Detection and Tracking
Implementation Details. We adopt a similarity function which incorporates spatial and temporal cues from the feature point level: s(i, k) = −γs ds (i, k) − γm dm (i, k) − γb db (i, k)
(17)
ds (i, k) = ||Fit − Fkt || measures the spatial proximity between point i and point k, where Fit is the real-world position of fit calculated from camera calibration. dm (i, k) = maxtj=max{τi ,τk } ||(Fit −Fij )−(Fkt −Fkj )|| models the motion similarity fkt between the two trajectories. db (i, k) = 1 B t (f ) = 1 calculates the f =fit number of background pixels between the two points, where B t is the background mask at time t. For temporal association, we use a novel model based approach. First we define for a cluster Gti the model fitness function f it(Gti ) based on the overlap relationship between the convex hull mask spanned by Gti and a typical vehicle shape mask generated by projecting a 3D vehicle model at the corresponding with the clusterGt−1 from position of the cluster. A cluster Gti is associated k t−1 t the previous frame if k = arg maxj |Gj | f it(Gi ∪ Gt−1 ) > β . For each a j unassociated Gti , we assign a new vehicle label to it if it satisfies f it(Gti ) > βn and |Gti | > βs . We employ a few preprocessing and postprocessing steps: (1) calibrate camera offline by assuming a planar road surface. (2) build a reference background image using simple temporal averaging method. (3) detect and track KLT [12] feature points, and prune the ones falling in the background region or out of ROI. (4) dropping small clusters whose sizes are below a predefined threshold βd . For the Normalized Cut implementation: (1) we use the same spatial and temporal cues, and the pairwise affinity between point i and point k is defined as e(i, k) = exp s(i, k). (2) for the recursive top-down bipartition process, we adopt a model based termination criteria rather than the original one which
Spatial-Temporal Affinity Propagation for Feature Clustering
613
is based on the normalized cut value. The partitioning of graph G terminates if f it(G) = 1. (3) The same temporal association scheme, preprocessing and postprocessing steps are applied. For STAP and AP, we set maxits to 1000, convits to 10, λ to 0.7, s(k, k) to the median of similarities s(i, k), i = k. All the other parameters of the three methods are set by trial-and-error to achieve the best performance. Experiment Setup. We use three diversified traffic video sequences for evaluation purposes. The three sequences are from different locations, different camera installations and different weather conditions. city1 and city2 are two challenging sequences of a busy urban intersection with city1 recorded with a low-height off-axis camera on a cloudy day and city2 with a low-height on-axis camera on a rainy day. They both contain lots of complicated traffic events during traffic light cycles such as the typical process of vehicles decelerate, merge, queue together, then accelerate and separate. The low height of the cameras brings extra difficulties to the detection task. city3 is recorded near a busy urban bridge with a medium-height off-axis camera in sunny weather. For each sequence, the
Algorithm 2. Spatial-Temporal Affinity Propagation
1 2 3 4 5 6 7
Input: similarity values s(i, k), maxits, convits, λ, Gt−1 (k) Output: exemplar index vector c iter ← 0, conv ← 0, a ← 0, r ← 0, c ← −∞ repeat iter ← iter + 1, oldc ← c, oldr ← r, olda ← a forall i, k do r(i, k) ← s(i, k) − max a(i, k ) + s(i, k ) k : k =k
r ← (1 − λ) ∗ r + λ ∗ oldr for i = k do
a(i, k) ← min 0, r(k, k) +
10 11 12 13 14 15 16 17 18 19 20
max 0, r(i , k) + r(i , k)
t−1 (k) i :i ∈{i}∪G /
8 9
forall k do a(k, k) ←
i :i ∈Gt−1 (k)/{k,i}
max 0, r(i , k) + r(i , k)
i :i ∈G / t−1 (k)
i :i ∈Gt−1 (k)/{k}
a ← (1 − λ) ∗ a + λ ∗ olda foreach k do if r(k, k) + a(k, k) > 0 then ck ← k foreach k do if ck = k then ck ← arg maxi: ci =i s(k, i) if c = oldc then conv ← conv + 1 else conv ← 1 until iter > maxits or conv > convits
614
J. Yang et al. Table 1. Sequence information sequence camera camera weather sequence labeled labeled name position height condition length frames vehicles city1 city2 city3
off-axis low cloudy on-axis low rainy off-axis medium sunny
2016 2016 1366
200 200 136
2035 2592 1509
Table 2. Performance information. TPR stands for true positive rate, and FPR stands for false positive rate. sequence name
TPR at 10% FPR run time (ms/frame) NCut AP STAP NCut AP STAP v1 v2 v1 v2 v1 v2 v1 v2 v1 v2 v1 v2
city1 city2 city3
0.65 0.61 0.66 0.68 0.67 0.70 85 136 82 112 83 117 0.67 0.66 0.73 0.73 0.78 0.76 108 244 101 182 104 189 0.42 0.40 0.45 0.44 0.48 0.49 64 80 63 76 64 77
groundtruth is manually labeled at approximately ten frames apart. Detailed information for each sequence is given in Table 1. To facilitate the analysis, we designed two versions for each of the three methods. We label the individual vehicle lanes for each sequence beforehand. The first version works on each lane separately, denoted as v1. The second version works on all the lanes combined, denoted as v2. The performance calculation is based on the comparison with hand-labeled groundtruth on a strict frame-by-frame basis. Experimental Result. The ROC curves of the three methods, each with two versions on the three sequences, are shown in Fig. 2, and detailed performance comparison can be found in Table 2. We can see that both STAP and AP achieve higher performance and run faster than Normalized Cut, and STAP performs better than AP at the cost of a little extra run time. For all the methods, the performance on city3 is the lowest among the three sequences because of the much lower video quality of city3 and specific difficulties in sunny weather video such as shadow cast by vehicles, trees, etc, while the performance on city1 is worse than that on city2 because of the more frequent inter-lane occlusions for off-axis cameras. For STAP and AP, v1 on city1 does not offer an improvement over v2 in performance probably because in off-axis cameras, feature points from one vehicle usually spill over two or even more lanes, thus making lane separation a disadvantage rather than an advantage. For each method, the two versions give us an idea of how the algorithm scales with problem size. We see from Table 2 that the run time of Normalized Cut increases much faster with problem size than STAP and AP do. Normalized Cut performs reasonably well in our experiments because of two improved components in our implementation which are different from the original approach: one is the model based temporal association scheme which can
1
1
0.9
0.9
0.8 0.7 0.6
NCut v1 NCut v2 AP v1 AP v2 STAP v1 STAP v2
0.5 0.4 0.3 0.2 0.1 0
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
False Positive Rate
0.16
0.18
0.2
0.8 0.7 0.6
NCut v1 NCut v2 AP v1 AP v2 STAP v1 STAP v2
0.5 0.4 0.3 0.2 0.1 0
0
0.1
0.2
0.3
0.4
0.5
False Positive Rate
0.6
0.7
True Positive Rate
1 0.9
True Positive Rate
True Positive Rate
Spatial-Temporal Affinity Propagation for Feature Clustering
615
0.8 0.7 0.6
NCut v1 NCut v2 AP v1 AP v2 STAP v1 STAP v2
0.5 0.4 0.3 0.2 0.1 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
False Positive Rate
0.8
0.9
1
Fig. 2. ROCs on the three sequences. From left to right: city1, city2 and city3.
−0.05
−0.1
Net Similarity
−0.15
−0.2
−0.25
−0.3
−0.35
−0.4
−0.45
−0.5
0
5
10
15
20
25
30
Iteration
35
40
45
50
Fig. 3. AP vs. STAP. From left to right: clustering results from the previous frame, current frame clustering by AP, current frame clustering by STAP, and the convergence of net similarity for STAP. Identified exemplars are represented by triangles.
avoid undesirable associations that could lead to cluster drift, dilation and other meaningless results; the other is the model based termination criteria for recursive bipartitioning which introduces object-level semantics into the process. The advantage of STAP over the original AP algorithm in feature clustering is illustrated in Fig. 3. Without temporal constraint, AP could end up with fragmented results caused by factors such as unstable image evidence, even though the clustering configuration from the previous frame is good. STAP yields a much more desirable solution than AP by enforcing temporal consistency. From the figure we can also see that despite the good clustering result, the identified exemplars usually do not have physical meaning. The net similarity value for the STAP example is also plotted in Fig. 3 which shows the fast convergence of STAP. Examples of detection results of STAP on the three sequences are shown in Fig. 4 and examples of generated vehicle trajectories are given in Fig. 5. 4.2
Virtual Loop Detector
Accurate object level information is not always available for traffic state estimation because of the inherent difficulties in the task of vehicle detection and tracking. In such situations, traffic parameters estimated from low-level features are still very useful for facilitating traffic control. A loop detector is a device buried underneath the road surface which can output simple pulse signals when vehicles are passing from above. Loop detectors have long been used in traffic monitoring and control, specifically for estimating traffic density, traffic flow, etc. Because of the high installation and maintenance costs, many computer
616
J. Yang et al.
Fig. 4. STAP example detections on the three sequences. From left to right: city1, city2 and city3.
250
250
250
200
200
200
150
150
150
100
100
100
50
50
50
0
0
0
50
100
150
200
250
300
350
0
50
100
150
200
250
300
350
0
0
50
100
150
200
250
300
350
Fig. 5. STAP example vehicle trajectories on the three sequences. From left to right: city1, city2 and city3.
vision programs, called virtual loop detectors, have been proposed to simulate the functions of a physical loop device. Implementation Details. We use the occupancy ratio of a virtual loop mask Mv to estimate the “on” and “off” status of loop v. For STAP virtual loop detector, we generate a object mask Mot , which consists of the convex hulls of feature clusters at time t. The status of loop v is estimated as |Mv ∩Mot | > βls on, if |M s v| lv = (18) off, otherwise For background subtraction virtual loop detector, the status of loop v is estimated as t v ∩F | on, if |M|M > βlb b | v (19) lv = off, otherwise where F t is the foreground mask at time t. The parameters of STAP are set in the same way as in the previous application. For background subtraction, we use a simple image averaging method and the parameters are fine-tuned manually. Experiment Setup. We use the two sequences city1 and city2 from the previous application for the purpose of evaluation. For each sequence, the groundtruth
1
1
0.9
0.9
True Positive Rate
True Positive Rate
Spatial-Temporal Affinity Propagation for Feature Clustering
0.8 0.7 0.6 0.5 0.4 0.3 0.2
BgSub STAP
0.1 0
0
0.1
0.2
0.3
0.4
0.5
False Positive Rate
0.6
0.8 0.7 0.6 0.5 0.4 0.3 0.2
BgSub STAP
0.1
0.7
617
0
0
0.05
0.1
0.15
False Positive Rate
0.2
0.25
Fig. 6. From left to right: ROCs and example STAP virtual loop detectors on city1 and city2 respectively.
status of each virtual loop is manually recorded for each frame by a human observer. The performance calculation is done by frame-by-frame comparison between detection and groundtruth. Experimental Result. The ROC curves of the two methods on the two sequences are shown in Fig. 6 indicating the superiority of STAP over background subtraction in simulating loop detectors. Background subtraction works at the pixel level and is known to be sensitive to difficulties such as illumination change, camera shake, shadow, etc. While STAP outputs feature clusters which are at the object or semi-object level, thus avoiding some of the problems that background subtraction is subject to. Some example detections of the STAP virtual loop detector are also given in Fig. 6.
5
Conclusion
In this paper we discuss an extension of the Affinity Propagation algorithm to the scenario of feature points clustering by incorporating temporal constraint of the clustering configurations between consecutive frames. Applications in traffic video analysis demonstrate the advantages of the proposed method over the original algorithm and the state-of-the-art traffic video analysis approaches. In the future work, we will explore more applications in traffic video analysis using STAP feature clustering, and investigate the possibility of fusing feature points and other informative features in the STAP framework. Acknowledgement. NICTA is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence program.
References 1. McLauchlan, P.F., Beymer, D., Coifman, B., Malik, J.: A real-time computer vision system for measuring traffic parameters. In: CVPR, pp. 495–501 (1997) 2. Kanhere, N.K., Pundlik, S.J., Birchfield, S.: Vehicle segmentation and tracking from a low-angle off-axis camera. In: CVPR, pp. 1152–1157 (2005) 3. Du, W., Piater, J.: Tracking by cluster analysis of feature points using a mixture particle filter. In: AVSS, pp. 165–170 (2005)
618
J. Yang et al.
4. Brostow, G.J., Cipolla, R.: Unsupervised bayesian detection of independent motion in crowds. In: CVPR, pp. 594–601 (2006) 5. Antonini, G., Thiran, J.P.: Counting pedestrians in video sequences using trajectory clustering. TCSVT 16, 1008–1020 (2006) 6. Rabaud, V., Belongie, S.: Counting crowded moving objects. In: CVPR, pp. 705– 711 (2006) 7. Li, Y., Ai, H.: Fast detection of independent motion in crowds guided by supervised learning. In: ICIP, pp. 341–344 (2007) 8. Kim, Z.: Real time object tracking based on dynamic feature grouping with background subtraction. In: CVPR (2008) 9. Kanhere, N., Birchfield, S.: Real-time incremental segmentation and tracking of vehicles at low camera angles using stable features. TITS 9, 148–160 (2008) 10. Yang, J., Wang, Y., Ye, G., Sowmya, A., Zhang, B., Xu, J.: Feature clustering for vehicle detection and tracking in road traffic surveillance. In: ICIP, pp. 1152–1157 (2009) 11. Frey, B.J., Dueck, D.: Clustering by Passing Messages Between Data Points. Science 315, 972–976 (2007) 12. Tomasi, C., Shi, J.: Good features to track. In: CVPR, pp. 593–600 (1994)
Minimal Representations for Uncertainty and Estimation in Projective Spaces Wolfgang F¨ orstner University Bonn, Department of Photogrammetry Nussallee 15, D-53115 Bonn
[email protected] Abstract. Estimation using homogeneous entities has to cope with obstacles such as singularities of covariance matrices and redundant parametrizations which do not allow an immediate definition of maximum likelihood estimation and lead to estimation problems with more parameters than necessary. The paper proposes a representation of the uncertainty of all types of geometric entities and estimation procedures for geometric entities and transformations which (1) only require the minimum number of parameters, (2) are free of singularities, (3) allow for a consistent update within an iterative procedure, (4) enable to exploit the simplicity of homogeneous coordinates to represent geometric constraints and (5) allow to handle geometric entities which are at infinity or at least very far, avoiding the usage of concepts like the inverse depth. Such representations are already available for transformations such as rotations, motions (Rosenhahn 2002), homographies (Begelfor 2005), or the projective correlation with fundamental matrix (Bartoli 2004) all being elements of some Lie group. The uncertainty is represented in the tangent space of the manifold, namely the corresponding Lie algebra. However, to our knowledge no such representations are developed for the basic geometric entities such as points, lines and planes, as in addition to use the tangent space of the manifolds we need transformation of the entities such that they stay on their specific manifold during the estimation process. We develop the concept, discuss its usefulness for bundle adjustment and demonstrate (a) its superiority compared to more simple methods for vanishing point estimation, (b) its rigour when estimating 3D lines from 3D points and (c) its applicability for determining 3D lines from observed image line segments in a multi view setup.
Motivation. Estimation of entities in projective spaces, such as points or transformations, has to cope with the scale ambiguity of these entities, resulting from the redundancy of the projective representations, and with the definition of proper metrics which on one hand reflect the uncertainty of the entities and on the other hand lead to estimates which are invariant to possible gauge transformation, thus changes of the reference system. The paper shows how to consistently perform Maximum likelihood estimation for an arbitrary number of geometric entities in projective spaces including elements at infinity under realistic assumptions without a need to impose constraints resulting from the redundant representations. R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 619–632, 2011. Springer-Verlag Berlin Heidelberg 2011
620
W. F¨ orstner
The scale ambiguity of homogeneous entities results from the redundant representation, where two elements, say 2D points, x (x) and y (y) are identical, in case their representations with homogeneous coordinates, here with x and y, are proportional. This ambiguity regularly is avoided by proper normalization of the homogeneous entities. Mostly one applies either Euclidean normalization, say xe = x/x3 (cf. [13]), then accepting that no elements at infinity can be represented, or spherical normalization, say xs = x/|x|, then accepting that the parameters to be estimated sit on a non-linear manifold, here the unit sphere S 2 , cf. [7, 11]. The sign ambiguity usually does not cause difficulty, as the homogeneous constraints used for reasoning are independent on the chosen sign. The singularity of contrained observations also has also been pointed out in [5]. The uncertainty of an observed geometric entity in many practical cases, can be represented sufficiently well by a Gaussian distribution, say N (μx , Σxx ). The distribution of derived entities, y = f (x), resulting from a non-linear transformation can also be approximated by a Gaussian distribution, using Taylor expansion at the mean µx and omitting higher order terms. The degree of approximation depends on the relative accuracy and has been shown to be negligible in most cases, cf. [8, p. 55]. The invariance of estimates w.r.t. the choice of the coordinate system of the estimated entities usually is achieved, by minimizing a function in the Euclidean space of observations, in the context of bundle adjustment being the reprojection i )T Σx−1 i ). (xi − x error, leading to the optimization function Ω = i (xi − x i xi This at the same time is the Mahalanobis distance between the observed and estimated entities and can be used to evaluate whether the model fits to the data. This situation becomes difficult, in case one wants to handle elements at infinity and therefore wants to use spherically normalized homogeneous vectors, or at least normalized direction vectors when using omnidirectional cameras, as their covariance matrices are singular. The rank deficiency is at least one, due to the homogeneity. In case further constraints need to be taken into account, as the Plcker constraint for 3D lines or the singularity constraint for fundamental matrices, the rank deficiency increases with the number of these constraints. Therefore, in case we want to use these normalized vectors or matrices as observed quantities, already the formulation of the optimization function based on homogeneous entities is not possible and requires a careful discussion about estimable quantities, cf. [17]. Also the redundant representation requires additional constraints, which lead to Lagrangian parameters in the estimation process. As an example, one would need four parameters to estimate a 2D point, three for the homogeneous coordinates and one Lagrangian for the constraint, two parameters more than the degrees of freedom. It remains open, how to arrive at a minimal representation for the uncertainty and at the estimation of all types of geometric entities in projective spaces which are free of singularities and allow to handle entities at infinity. Related work. This problem has been addressed successfully for geometric transformations. Common to these approaches is the observation that all types of
Minimal Representations for Uncertainty and Estimation
621
transformations form differentiable groups called Lie groups. Starting from an approximate transformation, the estimation can exploit the corresponding Lie algebra, being the tangent space at the unit element of the manifold of the group. Take as an example the group SO(3) of rotations: Starting from an approximate rotation a a close-by rotation can be represented by = (ΔR) a , with a small rotation with rotation matrix (ΔR) depending on the small rotation vector ΔR. This small rotation vector can be estimated in the approximate model ≈ ( 3 + (ΔR)) a , where the components of the rotation vector ΔR appear linearly, building the three dimensional space of the Lie algebra so(3) corresponding to the Lie group SO(3). Here (.) is the skew symmetric matrix inducing the crossproduct. The main difference to a Taylor approximation, ≈ a + Δ , is the multiplicative correction in (1), which is additive, say which guarantees that the corrected matrix is a proper rotation matrix, based on the exponential representation (ΔR) = exp( (ΔR)) of a rotation using skew symmetric matrices. This concept can be found in all proposals for a minimal representation for transformations: Based on the work of Bregler et al. [6], Rosenhahn et al. [19] used the exponential map for modelling spatial Euclidean motions, the special Euclidean group SE(3) being composed of rotations SO(3) and translations in IR3 . Bartoli and Sturm [2] used the idea to estimate the fundamental matrix with a minimal representation = 1 Diag(exp(λ), exp(−λ), 0) 2 , twice using the rotation group and once the multiplication group IR+ . Begelfor and Werman [4] showed how to estimate a general 2D homography with a minimal representation statistically rigorous, namely using the special linear group SL(3) of 3 × 3matrices with determinant 1, and its Lie algebra sl(3) consisting of all matrices with trace zero, building an eight dimensional vector space, correctly reflecting the correct number of degrees of freedom. To our knowledge the only attempt to use minimal representations for geometric entities other than transformations have been given by Sturm [20] and ˚ Astrm [1]. Sturm suggested a minimal representation of conics, namely = T , with = Diag(a, b, c). Using a corresponding class of homographies = Diag(d, e, f ), being a rotation matrix, which map any conic into the unit circle Diag(1, 1, −1), he determines updates for the conic in this reduced representation and at the end undoes the mapping. Another way to achieve a minimal representation for homogeneous entities is given by ˚ Astrm [1] in the context of structure from motion. He proposes to use the Cholesky decomposition of the pseudo inverse of the covariance matrix of the spherically normalized homogeneous 2D point coordinates x to arrive at a whitened and reduced observation. For 3D lines he uses a special double point representation with minimal parameters. He provides no method to update the approximate values xa guar ∈ S 2 . His 3D line representation is also not linked to anteeing the estimate x projective Plcker representation, and he cannot estimate elements at infinity. Our proposal is similar in flavour to the idea of Sturm and the method of ˚ Astrm to represent uncertain 2D points. However, it is simpler to handle, as it directly works on the manifold of the homogeneous entities.
622
W. F¨ orstner
Notation. We name objects with calligraphic letters, say a point x , in order to be able to change representation, e. g. using Euclidean coordinates denoted with a slanted letter x or homogeneous coordinates with an upright letter x. Matrices are denoted with sans serif capital letters, say , or in case of homogeneous matrices . This transfers to the indices of covariance matrices, the covariance matrix Σxx for a Euclidean vector and Σxx for a homogeneous vector. The operator N(.) normalizes a vector to unit length. We adopt the Matlab syntax to denote the stack of two vectors or matrices, e. g. z = [x; y] = [xT , y T ]T . Stochastic variables are underscored, e. g. x.
1
Minimal Representation of Uncertainty
The natural space of homogeneous entities are the unit spheres S n , possibly constrained to a subspace. Spherically normalized homogeneous coordinates of 2D points (xs ) and lines (ls ), and of 3D points (Xs )and planes (As ) live on S 2 and S 3 resp., while 3D lines, represented by Plcker coordinates (Ls ), live on the Klein quadric Q. The points on S 3 build the Lie group of unit quaternions, which could be used for an incremental update q = Δq qa of spherically normalized 3D point or plane vectors. However, there exist no Lie groups for points on the 2sphere or on the Klein quadric, see [18]. Therefore we need to develop an update scheme, which guarantees the updates to lie on the manifold, without relying on some group concept. We will develop the concept for unit vectors on the S 2 , representing 2D points and lines, and generalize it to the other geometric entities. 1.1
Minimal Representation for Uncertain Points in 2D and 3D
Let an uncertain 2D point x be represented with its mean, the 2-vector µx and its 2 × 2-covariance matrix Σxx . It can be visualized by the standard ellipse −1 (x − µx )T Σxx (x − µx ) = 1. Spherically normalizing the homogeneous vector x = [x; 1] = [u, v, w]T yields xs =
x , |x|
Σxs xs = Σxx T
with
Σxx =
Σxx 0 0T 0
,
=
1 ( 3 − xs xsT ) |x| (1)
with rank (Σxx ) = 2 and null(Σxx ) = xs . Taking the smallest eigenvalue to be an infinitely small positive number, one sees that the standard error ellipsoid is flat and lies in the tangent space of x at S 2 . In the following we assume all point vectors x to be spherically normalized and omit the superscript s for simplicity of notation. We now want to choose a coordinate system x (x) = [s, t] in the tangent space ⊥ x and represent the uncertainty by a 2 × 2-matrix in that coordinate system. This is easily achieved by using
x (x) = null(xT ) ,
(2)
Minimal Representations for Uncertainty and Estimation
w
y
O2
x u
v t
s
623
Fig. 1. Minimal representation for an uncertain point x (x) on the unit sphere S 2 representing the projective plane IP2 by a flat ellipsoid in the tangent plane at x . The uncertainty has only two degrees of freedom in the tangent space spanned by two basis vectors s and t of the tangent space, being the null space of xT . The uncertainty should not be too large, such that the deviation of the distribution on the sphere and on the tangent do not differ too much, as at point y .
assuming it to be an orthonormal matrix fullfilling T x (x) x (x) = 2 , see fig. 1. We define a stochastic 2-vector xr ∼ M (0, Σxr xr ) with mean 0 and covariance Σxr xr in the tangent space at x . In order to arrive at a spherically normalized random vector x with mean µx we need to spherically normalize the vector xt = µx + x (µx ) xr in the tangent space and obtain x(µx , xr ) = N (µx + x (µx ) xr ) ,
∂x x (µx ) = ∂xr x=µx
(3)
(4)
We thus can identify x (µx ) with the Jacobian of this transformation. The inverse transformation can be achieved using the pseudo inverse of x (x) which due to the construction is the transpose, + (x) = T x (x). This leads to the reduction of the homogeneous vector to its reduced counterpart xr = T x (µx ) x .
(5)
As = 0 the mean of xr is the zero vector, µxr = 0. This allows to establish the one to one correspondence between the reduced covariance matrix Σxr xr and the covariance matrix Σxx of x: T x (µx ) µx
Σxx = x (µx ) Σxr xr T x (µx ) ,
Σxr xr = x (µx )T Σxx x (µx ) .
(6)
We use (5) to derive reduced observations and parameters and after estimating r then apply (4) to find corrected estimates x r ). =x (xa , Δx corrections Δx A similar reasoning leads to the representation of 3D points. Again, the Jacobian X is the null space of XT and spans the 3-dimensional tangent space of S 3 at X. The relations between the singular 4 × 4-covariance matrix of the spherically normalized vector X and the reduced 3 × 3-covariance matrix ΣXr Xr are equivalent to (6). Homogeneous 3-vectors l representing 2D lines and homogeneous 4-vectors A representing planes can be handled in the same way. 1.2
Minimal Representation for 3D Lines
We now generalize the concept for 3D lines. Lines L in 3D are represented by their Plcker coordinates L = [Lh ; L0 ] = [Y − X, X × Y ] in case they are
624
W. F¨ orstner
derived by joining two points X (X) and Y (Y ). Line vectors need to fullfill the quadratic Plcker constraint LT h L0 = 0 and span the Klein quadric Q consisting of all homogeneous 6-vectors fulfilling the Plcker constraint. The dual line L (L) has Plcker coordinates L = [L0 ; Lh ], exchanging its first and second 3-vector. We also will use the Plcker matrix I (L) = XYT − YXT of a line and will assume 3D line vectors L to be spherically normalized. As in addition a 6-vector needs to fullfill the Plcker constraint in order to represent a 3D line, the space of 3D lines is four dimensional. The transfer of the minimal representation of points to 3D-lines requires some care. The four dimensional tangent space is perpendicular to L, as LT L − 1 = 0 T holds and perpendicular to L, as L L = 0 holds. Therefore the the tangent space is given by the four columns of the 6 × 4 matrix
(7) L (L) = null L , L T again assuming this matrix to be orthonormal. Therefore for random perturbations Lr we have the general 6-vector Lt (µL , Lr ) = µL + L (µL ) Lr
(8)
in the tangent space. In order to arrive at a random 6-vector, which is both spherically normalized and fullfills the Plcker constraint also for finite random perturbations we need to normalize Lt = [Lth , Lt0 ] accordingly. The two 3-vectors Lth and Lt0 in general are not orthogonal. Following the idea of Bartoli [3] we therefore rotate these vectors in their common plane such that they become orthogonal. We use a simplified modification, as the normalization within an iteration sequence will have decreasing effect. We use linear interpolation of the directions Dh = N(Lth ) and D0 = N(Lt0 ). With the distance d = |D h − D0 | and the shortest distance r = 1 − d2 /4 of the origin to the line joining D h and D0 we have M h,0 = ((d ± 2r) D h + (d ∓ 2r) D 0 ) |Lth | / (2d)
(9)
The 6-vector M = [M h ; M 0 ] now fullfills the Plcker constraint but needs to be spherically normalized. This finally leads to the normalized stochastic 3D line coordinates . (10) L = N(Lt (µL , Lr )) = M / |M| which guarantees L to sit on the Klein quadric, thus to fullfill the Plcker constraint. zed, omitting the superscript e for clarity. The inverse relation to (10) is Lr = T L (µL ) L
(11)
as L (µL ) is an orthonormal matrix. The relation between the covariances of L and Lr therefore are ΣLL = L (µL ) ΣLr Lr T L (µL ) ,
ΣLr Lr = L (µL )+ ΣLL +T L (µL ) .
(12)
Minimal Representations for Uncertainty and Estimation
625
Conics and quadrics can be handled in the same manner. A conic = [a, b, c; b, d, e; c, d, e] can be represented by the vector c = [a, b, c, d, e, f ]T , normalized to 1, and – due the homogeneity of the representation – is a point in IP5 with the required Jacobian being the nullspace of cT . In contrast to the approach of Sturm [20], this setup would allow for singular conics. Using the minimal representations introduced in the last section, we are able to perform ML-estimation for all entities. We restrict the model to contain constraints between observed and unknown entities only, not between the parameters only; a generalization is easily possible. We start with the model where the observations have regular covariance matrices and then reduce the model, such that also observations with singular covariance matrices can be handled. Finally, we show how to arrive at a Mahalanobis distance for uncertain geometric entities, where we need the inverse of the covariance matrix. The optimization problem. We want to solve the following optimization problem minimize
Ω(v) = v T Σll−1 v
subject to
g(l + v, x) = 0
(13)
where the N observations l, their N × N covariance matrix Σll and the G constraint functions g are given, and the N residuals v and the U parameters x are unknown. The number G of constraints needs to be larger than the number of parameters U . Also it is assumed the constraints are functionally independent. The solution yields the maximum likelihood estimates, namely the fitted and parameters l, under the assumption that the observations observations x are normally distributed with covariance matrix Σll = D(l) = D(v), and the true observations ˜l fullfill the constraints given the true parameters x ˜. Example: Bundle adjustment. Bundle adjustment is based on the projection relation xij = λij j Xi between the scene points Xi , the projection matrices j and the image points xij of point Xi observed in camera j. The classical approach eliminates the individual scale factors λij by using Euclidean coordinates for the image points. Also the scene points are represented by their Euclidean coordinates. This does not allow for scene or image points at infinity. This may occur when using omnidirectional cameras, where a representation of the image points in a projection plane is not possible for all points or in case scene points are very far compared to the length of the motion path of a camera, e.g. at the horizon. Rewriting the model as xij × j Xi = 0 eliminates the scale factor without constraining the image points to be not at infinity. Taking the homogeneous coordinates of all image points as observations and the parameters of all cameras and the homogeneous coordinates of all scene points as unknown parameters shows this model to have the structure of (13). The singularity of the covariance matrix of the spherically normalized image points and the necessity to represent the scene points also with spherically normalized homogeneous vectors, requires to use the corresponding reduced coordinates. For solving the generally non-linear problem, we assume approximate values a a x and l for the fitted parameters and observations to be available. We thus
and Δx for the fitted observations and parameters using search for corrections Δl
626
W. F¨ orstner
a
and x l = l+ With these assumptions we can rephrase the =x a +Δx. v = l +Δl
= (la − l + Δl)
T Σ −1 (la − l + Δl)
subject optimization problem: minimize Ω(Δl) ll a
x = 0 The approximate values are iteratively improved by a + Δx) to g(l + Δl,
and Δx using the linearized constraints finding best estimates for Δl a
=0 + T Δl a ) + Δx g(l , x
with the corresponding Jacobians mate values.
and
(14)
of g to be evaluated at the approxi-
Reducing the model. We now want to transform the model in order to allow for observations with singular covariances. For simplicity we assume the vectors l and x of all observations and unknown parameters can be partitioned into I and J individual and mutually uncorrelated observational vectors li , i = 1, ..., I and parameter vectors xj , j = 1, ..., J, referring to points, lines, planes or transformations. We first introduce the reduced observations lri , the reduced corrections of the
ri and the reduced corrections Δx rj : observations Δl a a )li , lri = T li (l , x
ri = T (la , x
i , a )Δl Δl li
rj = T (la , x j a )Δx Δx xj (15) where each Jacobian refers to the type of the entity it is applied to. The reduced approximate values are zero, as they are used to define the reduction, e. g. a a from (5) we conclude xar = T x (x )x = 0. We collect the Jacobians in two block T T a T a a )} and T a )} in order to arrive at diagonal matrices l = { li (l , x x = { xj (l , x the reduced observations lr = T l l, the corrections for the reduced observations T Δlr = T Δl and parameters Δx r = x Δx. l Second we need to reduce the covariance matrices Σli li . This requires some care: As a covariance matrix is the mean squared deviation from the mean, we need to refer to the best estimate of the mean when using it. In our context the a best estimate for the mean at the current iteration is the approximate value li . Therefore we need to apply two steps: (1) transfer the given covariance matrix, a referring to li , such that it refers to li and (2) reduce the covariance matrix to the minimal representation lri . As an example, let the observations be 2D lines with spherically normalized homogeneous vectors li . Then the reduction a a is achieved by: Σlari lli = ai Σli li aT with ai = T i x (li ) (li , li ), namely by first applying a minimal rotation from li to lai (see [16, eq. (2.183)], second reducing the covariance matrix following (6). Observe, we use the same Jacobian x as for points, exploiting the duality of 2D points and 2D lines. The superscript a in Σlari lli indicates the covariance to depend on the approximated values. The reduced constraints now read as a
r = 0 r + T Δl a ) + r Δx g(l , x with r = T T = T T (16) r
x
r
l
Now we need to minimize the weighted sum of the squared reduced residuals a
= −lr + Δl
r thus we need to minimize Ω(Δl
r ) = (−lr + r = lr − lr + Δl v −1 r T a
(−lr + Δlr ) subject to the reduced constraints in (16). Δlr ) Σlr lr
Minimal Representations for Uncertainty and Estimation
627
The parameters of the linearized model are obtained from (cf. [16, Tab. 2.3]) T a −1 Σx r x r = (T r )−1 r ( r Σlr lr r ) r = Σx x T ( T Σ a r )−1 wg Δx r r lr lr r r T a a
r ) − v a Δlr = Σ r ( Σ r )−1 (wg − r Δx lr lr
r
lr lr
(17) (18) (19)
a
r . It contains the theoretical covariance matrix a) + T using wg = −g(l , x rv Σx r x r in (17), at the same time being the Cramer-Rao-bound. The weighted sum of residuals Ω is χ2 -distributed with G − U degrees of freedom, in case the observations fulfill the constraints and the observations are normally distributed with covariance matrix Σll and can be used to test the validity of the model.
r are used to update the approximate values for r and Δl The corrections Δx the parameters and the fitted observations using the corresponding non-linear transformations, e. g. for an unknown 3D line one uses (10). Example: Mahalanobis distance of two 3D lines. As an example for such an estimation we want to derive the mean of two 3D-lines Li (Li , ΣLi Li ), i = 1, 2 and derive the Mahalanobis distance d(L1 , L2 ) of the 3D lines. The model reads 1, L 2) = L 2 − L 1 = 0, which, exploits the fact that the line coordinates g(L are normalized. First we need to choose an appropriate approximate line La , which in case the two lines are not too different can be one of the two. Then we reduce the two lines using the Jacobian L (La ), being the same for both lines Li , and obtain Lri = TL (La )Li and the reduced covariance matrices ΣLa ri ,Lri = Ti ΣLi Li i with Ti = TL (La ) (Li , La ) using the minimal rotation from Li to La . As there are no parameters to be estimated, the solution becomes simple. With r = 0 in (19) the reduced residuals are the Jacobian T = [− 6 , 6 ] and using Δx a a a −1 ri = ±ΣLri Lri (ΣLr1 Lr1 + ΣLr2 Lr2 ) (−(Lr2 − Lr1 )). The weighted sum Ω of v the residuals is the Mahalanobis distance and therefore can be determined from d2 (L1 , L2 ) = (Lr2 − Lr1 )T (ΣLa r2 ,Lr2 + ΣLa r2 ,Lr2 )−1 (Lr2 − Lr1 ) ,
(20)
as the reduced covariance matrices in general have full rank. The squared distance d2 is χ24 distributed, which can be used for testing. In case one does not have the covariance matrices of the 3D lines Li but only two points Xi (X i ) and Yi (Y i ) of a 3D line segment, and a good guess for their uncertainty, say σ2 3 , one easily can derive the covariance matrix from Li = [Y i − X i ; X i × Y i ] by variance propagation. The equations directly transfer to the Mahalanobis distance of two 2D line segments. Thus, no heuristic is required to determine the distance of two line segments.
2
Examples
We want to demonstrate the setup with three interrelated problems: (1) demonstrating the superiority of the rigorous estimation compared to a classical one when estimating a vanishing point, (2) fitting a straight line through a set of 3D
628
W. F¨ orstner
points and (3) inferring a 3D line from image line segments observed in several images as these tasks regularly appear in 3D reconstruction of man-made scenes and solved suboptimally, see e.g. [12]. Estimating a vanishing point. Examples for vanishing point estimation from line segments using this methodology are found in [9]. Here we want to demonstrate that using a simple optimization criterion can lead to less interpretable results. We compare two models for the random perturbations of line segments: α) a model which determines the segment by fitting a line through the edge pixels, which are assumed to be determinable with a standard deviation of σp , β) the model of Liebowitz [15, Fig. 3.16], who assumes the endpoints to have zero mean isotropic Gaussian distribution, and that each of the endpoint measurements are independent and have the same variance σe2 . Fig. 2 demonstrates the difference of the two models. The model α, using line fitting, results in a significant decrease of the uncertainty for longer line segments, the difference in standard deviation going with the squareroot of the lenght of the line segment. For a simulated case
Fig. 2. Error bounds (1 σ) for line segments. Dashed: following the model of Liebowitz [15], standard deviation standard deviation at end points σ = 0.15 [pel] , Solid: from fitting a line through the edge pixels with standard deviation σp = 0.3 [pel] for segments with 10, 40 and 160 pixels. Deviations 20 times overemphasized. The uncertainty differ by a factor of 2 on the avaerage in this case.
with 10 lines between 14 and 85 pixels, the uncertainty models on an average differ by 4 85/14 ≈ 1.6 in standard deviation. Based on 200 repetitions, the empirical scatter of the vanishing point of the method β of Liebowitz is appr. 20 % larger in standard deviation than the method α using the line fitting accuracy as error model. This is a small gain. However, when statistically testing the line segments whether they belong to the vanishing points, the decision depends on the assumed stochstical model for the line segments: When compared to short segments, the test will be much more sensitive for long segments in case model α is used, as when model β is used, which appears to be reasonable. Fitting a 3D line though points. We now give an example proving the validity of the approach again using simulated data in order to have full access to the uncertainty of the given data. Given are I uncertain 3D points Xi (X i , ΣXi Xi ), i = 1, ..., I, whose true values are supposed to sit on a 3D line L . The two constraints for the incidence of point Xi and the line L can be written as ˜ i , L) ˜ = 1 (L) ˜ X ˜ i = 2 (X ˜ i )L ˜ =0 g i (X
(21)
Minimal Representations for Uncertainty and Estimation
629
with the Jacobians 1 (L) and 2 (Xi ) depending on the homogeneous coordinates of the points and the line. The incidence constraint of a point and a line can be expressed with the Plcker matrix of the dual line: I (L)X = 0. As from these four incidence constraints only two are linearly independent, we select those two, where the entries in I (L) have maximal absolute value, leading to the two constraints g i (Xi , L) = 1 (L)Xi for each point with the 2 × 4 matrix 1 (L). The Jacobian 2 (Xi ) = ∂g i /∂L then is a 2 × 6-matrix, cf. [16, sect. 2.3.5.3]. Reduction of these constraints leads to g i (Xi , L) = 1r (L)Xi = r2 (Xi )L = 0 with the reduced Jacobians 1r (L) and r2 (Xi ) having size 2×3 and 2×4 leading to a 4 × 4 normal equation matrix, the inverse of (17). Observe, the estimation does not need to incorporate the Plcker constraint, as this is taken into account after the estimation of Lr by the back projection (10). We want to perform two tests of the setup: (1) whether the estimated variance factor actually is F -distributed, and (2) whether the theoretical covariance matrix ΣL, L corresponds to the empirical one. For this we define true line pa˜ generate I true points X ˜ i on this line, choose individual covariance rameters L, matrices ΣXi Xi and perturb the points according to a normal distribution with zero mean and these covariance matrices. We generate M samples, by repeating the perturbation M times. We determine initial estimates for the lines using an algebraic minimization, based on the constraints (21), which are linear in L and perform the iterative ML-estimation. The iteration is terminated in case the corrections to the parameters are smaller than 0.1 their standard deviation. We first perform a test where the line passes through the unit cube and contains I = 100 points with, thus G = 200 and U = 4. The standard deviations of the points vary between 0.0002 and 0.03, thus range to up to 3% relative accuracy referring to the distance to their centroid, which is comparably large. As 2 each weighted sum of squared residuals Ω is χ -distributed with G−U = R = 196 degrees of freedom, the sum Ω = samples Ωm is m ωm of the independent 2 2 2 χMR distributed, thus the average variance factor σ 0 =
0 m is FMR,∞ mσ distributed. The histogram of M = 1000 samples σ
0 2m is shown in fig. 3, left. with Σ , the CramerSecond, we compare the sample covariance matrix Σ LL L,L Rao-bound and a lower bound for the true covariance matrix, using the test for = Σ from [14, eq. (287.5)]. As both covariance mathe hypothesis H0 : Σ LL L,L trices are singular we on one side reduce the theoretical covariance matrix to obtain ΣL r ,L r and on the other side reduce the estimated line parameters which rm } from the set {L we use to determine the empirical covariance matrix Σ Lr Lr of the M estimated 3D lines. The test statistic indicates that the hypothesis of the two covariance matrices being was not rejected. Estimating 3D lines from image lines segments. The following example demonstrates the practical use of the proposed method: namely determining 3D lines from image line segments. Fig. 3 shows three images taken with a CANON 450D. The focal length was determined using vanishing points, the principle point was assumed to be the image centre, the images were not corrected for lens distortion. The images then have been mutually oriented using a bundle adjustment
630
W. F¨ orstner
1
5 2 9
3
1
5
1
6
6 11
10
10 9 3
7 8
4 12
4
5 6
2
2
10
11
9 3
7 8 12
11
7 8
4 12
2 Fig. 3. Left: Histogram of estimated variance factors σ determined from M = 0m 1000 samples of a 3D line fitted through 100 points. The 99 % conficence intervall for the Fisher distributed random variable σ 0 2 is [0.80,1.25]. Right: three images with 12 corresponding straight line segements used for the reconstruction of the 3D lines, forming three groups [1..4],[5...8], [9...12] for three main directions.
program. Straight line segments were automatically detected and a small subset of 12, visible in all three and pointing in the three principle directions were manually brought into correspondence. From the straight lines lij , i = 1, ..., 12; j = 1, 2, 3 and the projection matrices j we determine the projection planes Aij = T j lij . For determining the ML-estimates of the 12 lines Li we need the covariance matrices of the projection planes. They are determined by variance propagation based on the covariance matrices of the image lines lij and the covariance matrices of the projection matrices. As we do not have the crosscovariance matrices between any two of the projection matrices, we only use the uncertainty ΣZj Zj of the three projection centres Zj . The covariance matrices of the straight line segments are derived from the uncertainty given by the feature extraction as in the first example. The covariance matrix of the projection planes T then is determined from ΣAij Aij = T i Σli li i + ( 4 ⊗ lij )Σpi pi ( 4 ⊗ lij ). Now we observe, that determining the intersecting line of three planes is dual to determining the best fitting line through three 3D points. Thus the procedure for fitting a 3D line through a set of 3D points can be used directly to determine the ML-estimate of the best fitting line through three planes, except for the fact, that the result of the estimation yields the dual line. First, the square roots of the estimated variance factors σ 02 = Ω/(G − U ) range between 0.03 and 3.2. As the number of degrees of freedom is G − U = 2I − 4 = 2.3 − 4 = 2 in this case is very low, such a spread is to be expected. The mean value for the variance factor is 1.1, which confirms the model to fit to the data. As a second result we analyse the angles between the directions of the 12 lines. As they are clustered into three groups corresponding to the main directions of the building, we should find values close to 0◦ within a group and values close to 90◦ between lines of different groups. The results are collected in the following table. The angles between lines in the same group scatter between 0◦ and 14.5◦ , the angles between lines of different orientation differ from 90◦ between 0◦ and 15◦ . The standard deviations of the angles scatter between 0.4◦ and 8.3◦ , this is why none of the deviations from 0 or 90◦ are significant. The statistical analysis obviously makes the visual impression objective.
Minimal Representations for Uncertainty and Estimation lmin \# 173 [pel] 155 [pel] 72 [pel] 62 [pel] 232 [pel] 153 [pel] 91 [pel] 113 [pel] 190 [pel] 82 [pel] 103 [pel] 225 [pel]
1 2 3 4 − 2.6◦ 2.7◦ 3.0◦ ◦ ◦ 0.7 − 0.7 1.6 0.7 0. − 0.9◦ 1.2 0.3 0.1 − 0.6 0.0 0.1 0.2 0.3 0.1 0.0 0.1 0.5 0.8 0.4 0.2 1.0 1.1 1.1 0.9 1.6 0.4 0.3 0.1 1.4 0.3 0.2 0.0 1.6 0.5 0.4 0.2 2.4 1.1 1.1 1.0
5 88.6◦ ◦ 89.9 89.4◦ 88.5◦ − 0.1 0.8 1.1 0.2 0.4 0.2 0.5
6 89.0◦ ◦ 89.7 89.8◦ 88.9◦ 0.4◦ − 0.6 1.1 0.3 0.4 0.2 0.5
7 88.7◦ ◦ 87.3 87.9◦ 88.7◦ 2.8◦ 2.4◦ − 0.8 1.0 1.2 0.9 1.3
8 76.5◦ ◦ 75.1 75.6◦ 76.4◦ 15.3◦ 14.9◦ 12.5◦ − 1.3 1.4 1.2 1.5
9 86.7◦ ◦ 89.0 89.4◦ 89.8◦ 89.6◦ 89.5◦ 89.0◦ 87.0◦ − 0.5 0.3 3.2
10 87.0◦ 89.2◦ 89.6◦ 90.0◦ 89.3◦ 89.2◦ 88.7◦ 86.7◦ 0.4◦ − 0.6 3.6
11 86.6◦ 88.9◦ 89.3◦ 89.7◦ 89.7◦ 89.6◦ 89.1◦ 87.2◦ 0.2◦ 0.5◦ − 4.0
631
12 85.2◦ 87.4◦ 87.8◦ 88.2◦ 89.1◦ 89.1◦ 88.6◦ 87.0◦ 1.6◦ 1.8◦ 1.6◦ −
Fig. 4. Result of determining 12 lines from Fig. 3. Left column: minimal length l of the three line segments involved, upper right triangle: angels between the lines. Lower left triangle: test statistic for the deviation from 0◦ or 90◦ .
3
Conclusions and Outlook
We developed a rigorous estimation scheme for all types of entities in projective spaces, especially points, lines and planes in 2D and 3D, together with the corresponding transformations. The estimation requires only the minimum number of parameters for each entity, thus avoids the redundancy of the homogeneous representations. Therefore no additional constraints are required to enforce the normalization of the entities, or to enforce additional constraints such as the Plcker constraints for 3D lines. In addition we not only obtain a minimal representation for the uncertainty of the geometric elements, but also simple means to determine the Mahalanobis distance between two elements, which may be used for testing or for grouping. In both cases the estimation requires the covariance matrices of the observed entities to be invertible, which is made possible by the proposed reduced representation. We demonstrated the superiority of rigorous setup compared to a suboptimal classical method of determining vanishing points, and proved the rigour of the method with the estimation of 3D lines from points or planes. The convergence properties when using the proposed reduced representation does not change as the solutions steps are algebraically equivalent. The main advantage of the proposed concept is the ability to handle elements at or close to infinity without loosing numerical stability and that it is not necessary to introduce additional constraints to enforce the geometric entities to lie on their manifold. The concept certainly can be extended to higher level algebras, such as the geometric or the conformal algebra (see [10]) where the motivation to use minimal representations is even higher than in our context. Acknowledgement. I acknowledge the valuable recommendations of the reviewers.
References 1. ˚ Astrom, K.: Using Combinations of Points, Lines and Conics to Estimate Structure and Motion. In: Proc. of Scandinavian Conference on Image Analysis (1998)
632
W. F¨ orstner
2. Bartoli, A., Sturm, P.: Nonlinear estimation of the fundamental matrix with minimal parameters. IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 426–432 (2004) 3. Bartoli, A., Sturm, P.: Structure-from-motion using lines: Representation, triangulation and bundle adjustment. Computer Vision and Image Understanding 100 (2005) 4. Begelfor, E., Werman, M.: How to put probabilities on homographies. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1666–1670 (2005) 5. Bohling, G.C., Davis, J.C., Olea, R.A., Harff, J.: Singularity and Nonnormality in the Classification of Compositional Data. Int. Assoc. for Math. Geology 30(1), 5–20 (1998) 6. Bregler, C., Malik, J.: Tracking people with twists and exponential maps. In: Proc. Computer Vision and Pattern Recognition, pp. 8–15 (1998) 7. Collins, R.: Model Acquisition Using Stochastic Projective Geometry. PhD thesis, Department of Computer Science, University of Massachusetts (1993); Also published as UMass Computer Science Technical Report TR95-70 8. Criminisi, A.: Accurate Visual Metrology from Single and Multiple Uncalibrated Images. In: Distinguished Dissertations. Springer, Heidelberg (2001) 9. Frstner, W.: Optimal Vanishing Point Detection and Rotation Estimation of Single Images of a Legolandscene. In: Int. Archives for Photogrammetry and Remote Sensing, vol. XXXVIII/3A, Paris (2010) 10. Gebken, C.: Comformal Geometric Algebra in Stochastic Optimization. PhD Thesis, University of Kiel, Institute of Computer Science (2009) 11. Heuel, S.: Uncertain Projective Geometry: Statistical Reasoning For Polyhedral Object Reconstruction. LNCS. Springer, New York (2004) 12. Jain, A., Kurz, C., Thorm¨ alen, T., Seidel, H.P.: Exploiting Global Connectivity Constraints for Reconstruction of 3D Line Segments from Images. In: Conference on Computer Vision and Pattern Recognition (2010) 13. Kanatani, K.: Statistical Optimization for Geometric Computation: Theory and Practice. Elsevier Science, Amsterdam (1996) 14. Koch, K.R.: Parameter Estimation and Hypothesis Testing in Linear Models. Springer, Heidelberg (1999) 15. Liebowitz, D.: Camera Calibration and Reconstruction of Geometry from Images. PhD thesis, University of Oxford, Dept. Engineering Science (2001) 16. McGlone, C.J., Mikhail, E.M., Bethel, J.S.: Manual of Photogrammetry. Am. Soc. of Photogrammetry and Remote Sensing (2004) 17. Meidow, J., Beder, C., F¨ orstner, W.: Reasoning with uncertain points, straight lines, and straight line segments in 2D. International Journal of Photogrammetry and Remote Sensing 64, 125–139 (2009) 18. Pottmann, H., Wallner, J.: Computational Line Geometry. Springer, Heidelberg (2010) 19. Rosenhahn, B., Granert, O., Sommer, G.: Monocular pose estimation of kinematic chains. In: Dorst, L., Doran, C., Lasenby, J. (eds.) Applications of Geometric Algebra in Computer Science and Engineering, Proc. AGACSE 2001, Cambridge, UK, pp. 373–375. Birkh¨ auser, Boston (2002) 20. Sturm, P., Gargallo, P.: Conic fitting using the geometric distance. In: Proceedings of the Asian Conference on Computer Vision, Tokyo, Japan, vol. 2, pp. 784–795. Springer, Heidelberg (2007)
Personalized 3D-Aided 2D Facial Landmark Localization Zhihong Zeng, Tianhong Fang, Shishir K. Shah, and Ioannis A. Kakadiaris Computational Biomedicine Lab Department of Computer Science University of Houston 4800 Calhoun Rd, Houston, TX 77004
Abstract. Facial landmark detection in images obtained under varying acquisition conditions is a challenging problem. In this paper, we present a personalized landmark localization method that leverages information available from 2D/3D gallery data. To realize a robust correspondence between gallery and probe key points, we present several innovative solutions, including: (i) a hierarchical DAISY descriptor that encodes larger contextual information, (ii) a Data-Driven Sample Consensus (DDSAC) algorithm that leverages the image information to reduce the number of required iterations for robust transform estimation, and (iii) a 2D/3D gallery pre-processing step to build personalized landmark metadata (i.e., local descriptors and a 3D landmark model). We validate our approach on the Multi-PIE and UHDB14 databases, and by comparing our results with those obtained using two existing methods.
1
Introduction
Landmark detection and localization are active research topics in the field of computer vision due to a variety of potential applications (e.g., face recognition, facial expression analysis, and human computer interaction). Numerous researchers have proposed a variety of approaches to build general facial landmark detectors. Most methods follow a learning-based approach where the variations in each landmark’s appearance as well as the geometric relationship between landmarks are characterized based on a large amount of training data. These methods are typically generalized to solve the problem of landmark detection for any input image and not necessarily use any information from the gallery images. In contrast to the existing methods, we propose a personalized landmark localization framework. Our approach provides a framework wherein the problem of training across large data sets is alleviated, and instead focuses on the efficient encoding of landmarks in the gallery data and their counterparts in the probe image. Our framework includes two stages. The first stage allows for off-line processing of the gallery data to generate an appearance model for each landmark and a 3D landmark model that encodes the relationship between the landmarks. The second stage is performed on-line on the input probe image to identify potential R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 633–646, 2011. Springer-Verlag Berlin Heidelberg 2011
634
Z. Zeng et al.
Fig. 1. Diagram of our personalized landmark localization pipeline. The upper part represents the gallery data processing and the bottom part depicts the landmark localization step in a probe image.
landmarks and establish a match between the models derived at the first stage using a constrained optimization scheme. We propose several computational methods to improve the accuracy and efficiency of landmark detection and localization. Specifically, we extend the existing DAISY descriptor [1–3] and propose a hierarchical formulation that incorporates contextual information at multiple scales. Motivated by robust sampling approaches, we propose a Data-Driven SAmple Consensus (DDSAC) method for estimating the optimal landmark configuration. Finally, we introduce a hierarchical landmark searching scheme for efficient convergence of the localization process. We test our method on the CMU Multi-PIE [4] and the UHDB14 [5] databases. Our results obtained are compared to the latest version of a method based on the Active Shape Model [6, 7] and a state-of-the-art classifier-based facial feature point detector [8, 9]. The experimental results demonstrate the advantages of our proposed solution.
2
Related Work
The problem of facial landmark detection and localization has attracted much attention from researchers in the field of computer vision. Most of the existing efforts toward this task follow the spirit of the Point Distribution Model (PDM) [10]. The implementation of a PDM framework includes two main subtasks: local search for PDM landmarks in the neighborhood of their current estimates, and optimization of the PDM configuration such that the global geometric compatibility is jointly maximized. For the first subtask, most of the existing efforts are based on an exhaustive independent local search for each landmark using feature detectors. A common theme is to build a probability distribution for the landmark locations in order to reduce sensitivity to local minima. Both parametric [11, 12] and non-parametric representations [13] of the landmark distributions have been proposed. For the
Personalized 3D-Aided 2D Facial Landmark Localization
635
second subtask, most of the studies are based on a 2D geometric model with similarity or affine transformation constraints (e.g., [6, 11, 14–16]). Few studies (e.g., [17, 18]) have investigated 3D generative models and weak-perspective projection constraints. For the optimization process, most existing approaches use gradient descent methods (e.g., [6, 10]). A few researchers have also used MCMC [19, 20] and RANSAC [17] that theoretically are able to find the global optimal configuration, but in practice these methods are hampered by the curse of dimensionality. Furthermore, classifier-based (e.g., [8, 21]) and regression-based (e.g., [15, 22]) approaches have also been used. Across all methods, a large training database has been the key to enhancing the power of detecting landmarks and modeling their variations. For example, Liu et al. [19] use 280 distributions modeled by GMM to learn the landmark variations in both position and appearance. Similarly, Liang et al. [21] use 4,002 images for training classifiers and 120,000 image patches to learn directional classifiers for each facial component. Our objective is to explore the possibility of leveraging the gallery data to reduce the cost of training. The study closest to our work is the recent investigation by Li et al. [23], where they use both the training data and a gallery image to build a person-specific appearance landmark model to improve the accuracy of landmark localization. However, there are several differences between our work and their work: (i) our method employs only one gallery image whereas Li et al. [23] also use additional training data and (ii) our method uses the 3D geometric relationship of landmarks to constrain the search.
3 3.1
Method Framework
Our personalized landmark localization framework (URxD-L) includes two stages (Fig. 1). In the gallery processing step, personalized landmark metadata (i.e., landmark descriptor templates and a 3D landmark model) are computed. When only 2D images are available in the gallery, we use a statistical landmark model and 2D-3D fitting method to obtain a 3D personalized landmark model. Next, the landmark descriptor templates are computed on the projected landmark positions. If 3D data is also available in the gallery, we can obtain a personalized 3D landmark model directly from the annotated landmarks on the 3D data. We also use an Annotated Face Model (AFM) [24] to align the 3D face data and generate multiple images from a variety of viewpoints, which are then used to compute multi-view landmark descriptors. Landmark localization in a probe image is achieved through correspondence between the landmark descriptors in the gallery image and those in the probe image, constrained by the geometric relationship derived for the specific individual. This process includes efficient descriptor matching and robust estimation. 3.2
Matching Problem Formulation
In our framework, landmark localization is posed as an energy minimization problem. The energy function E(X) depends on X = {xj , j = 1, · · · , n}, which
636
Z. Zeng et al.
denotes the configuration to be estimated, whereas xj denotes the coordinates of the j th landmark. Specifically, E(X) is defined as a weighted sum of two energy terms: E(X) = λa E a (X) + λg E g (X), (1) where λa and λg are non-negative scalar weights (λa + λg = 1), and the energy terms E a (X) and E g (X) denote the associated appearance and geometric terms, respectively. The energy term E a (X) is a sum of unary terms that favors correspondence between similar appearance descriptors across two images: E a (X) = ϕk , (2) k∈K
where ϕk is the L2 distance of the descriptor pair between two images, and K is the set of all correspondences between the gallery and probe images. The energy term E g (X) is a measure of geometric compatibility of correspondences between the gallery image and the probe image and it favors the correspondences that are compatible to the geometric relationship of facial landmarks (the shape of the landmarks) with a predefined transformation. In this paper, we use weak-perspective projection of the 3D landmark model to form the geometric constraints. Thus, E g (X) penalizes any deviation from these constraints and is given by: (3) E g (X) = ||X − U ||, where U is the target configuration and X is estimated configuration. We define E g as the L2 distance between the projected landmarks from the 3D landmark template and their estimated positions in the probe image. 3.3
Personalized 3D Facial Shape Model
The objective of the gallery processing is to obtain a personalized 3D landmark model representing the geometry of the facial landmarks (of the gallery data), which can be detected automatically or through manual annotations. If the gallery includes 3D data, then this geometric relationship is constructed directly from the 3D landmark positions, obtained either manually or automatically [25]. In the case that the gallery contains only 2D images, the geometry is recovered through the use of a statistical 3D landmark model and a 2D-3D fitting process. 3D Statistical Landmark Model: A generic 3D statistical shape model is built based on the BU-3DFE database [26]. The BU-3DFE is a 3D facial expression database collected at State University of New York at Binghamton. We select only the datasets with neutral expressions, one per each of the 100 subjects. Considering the ambiguity of the face contour landmarks, we discard those and keep the 45 landmarks from different facial regions. By aligning the selected 3D landmarks and applying Principal Component Analysis, the statistical shape model is formulated as follows:
Personalized 3D-Aided 2D Facial Landmark Localization
S = S0 +
L
αj S j ,
637
(4)
j=1
where S0 is the 3D mean landmark model, S j is the j th principal component, and L denotes the number of principal components retained in the model. 2D-3D landmark fitting process: A regular 3D-to-2D transformation can be approximated with a translation, a rotation and a weak-perspective projection: f 00 t P = Rγ Rζ Rφ x , (5) 0f0 ty where f is the focal length, tx , ty are the 2D translations, and Rγ , Rζ , Rφ denote image plane rotation, elevation rotation, and azimuth rotation, respectively. The configuration X is given by: X = P (S0 +
L
αj S j ).
(6)
j=1
Based on the correspondences between the 3D landmark model and the 2D gallery image with known landmark positions U , we can estimate the parameters of the transformation matrix P and the shape coefficients α = (αj , j = 1, · · · , L) according to the maximum-likelihood solution as follows: {Pˆ , α ˆ } = argmin||P (S0 + P,α
L
αj S j ) − U ||, Pˆ = U (S0 )+ , α ˆ = A+ (U − Pˆ S0 ), (7)
j=1
where A is a matrix which is constructed by P S j . The details are presented by Romdhani et al. [17]. When detecting landmarks in a probe image, the shape coefficients α are considered to be person-specific and are fixed, while only the transformation P needs to be estimated. The deformation of the personalized 3D face model due to expression variations is not considered in this work. 3.4
Hierarchical DAISY Representation of a Facial Landmark
A local feature descriptor is used to established a match between landmarks in the gallery and probe images. Inspired by the DAISY descriptor [1–3] and ancestry context [27], we propose a hierarchical DAISY descriptor that augments the discriminability of DAISY by including larger context information via a multi-resolution approach (Fig. 2). The contextual cues of facial landmarks are naturally encoded in the face image with this hierarchical representation that relates different components of a face (i.e., nose, eyes and mouth). We construct the hierarchical DAISY descriptor of a facial landmark by progressively enlarging the window of analysis, which is effectively achieved by decreasing image resolution. Given a face image and a landmark location, we
638
Z. Zeng et al.
(a)
(b)
(c)
Fig. 2. The hierarchical daisy descriptor for the left inner eye corner. (a-c) Images of different resolutions with the corresponding DAISY descriptors.
define the ancestral support regions of a landmark as the set of regions that are obtained by sequentially decreasing the image resolution, covering not only the local context information but also global context information. Keeping the patch size (pixel units) of the DAISY descriptor constant, we compute the DAISY descriptors in these ancestral support regions. Then, these descriptor vectors are concatenated to form a hierarchical DAISY descriptor. 3.5
Data-Driven SAmple Consensus (DDSAC)
We propose a Data-Driven SAmple Consensus (DDSAC) that is inspired by the principle of RANSAC (RANdom SAmple Consensus) and importance sampling. Although RANSAC has been successfully used for many computer vision problems, several issues still remain open [28]. Examples include the convergence to the correct estimated transformation with fewer iterations, and tuning the parameter that distinguishes inliers from outliers (this parameter is usually set empirically). With these open issues in mind, we propose a Data-Driven SAmple Consensus (DDSAC) approach, which follows the same hypothesize-and-verify framework but improves the traditional RANSAC by using (i) data-driven sampling and (ii) a least median error criterion for selecting the best estimate. The image observations are used as a proposal sampling probability to choose a subset of landmarks, in order to use fewer iterations to obtain the correct landmark configuration. The proposal probability is designed so that it is more likely to choose a landmark subset that is more compatible with the geometric shape of the landmarks as well as their appearance. Given a probe image I, we first search for the best descriptor matching candidates of the landmarks {y j = (uj , v j ), j = 1, · · · , n} where uj and v j are the position and the descriptor of the candidate of the j th landmark, respectively. If we construct a template tuple by B j = (xj , Γ j ), where Γ j is the descriptor template and xj is the initial position (or estimated position from the previous iteration), the proposal probability of choosing j th landmark P (B j ; y j ) can be formulated as: P (B j ; y j ) = P (xj , Γ j ; uj , v j ) = βp(xj |uj ) + (1 − β)q(Γ j |v j ),
(8)
Personalized 3D-Aided 2D Facial Landmark Localization
639
where β is the weight of the probability derived from the geometry-driven process p(xj |uj ), and (1 − β) is the weight for the probablity based on appearance q(Γ j |v j ). The term p(xj |uj ) in Eq. 8 is given by: N (uj ; xj , σ) , r r r=1,··· ,n N (u ; x , σ)
p(xj |uj ) =
(9)
where N (uj ; xj , σ) is the probability that the j th landmark point is located at uj given the landmark initial position xj . In this work, we model N (uj ; xj , σ) using a Gaussian distribution with mean the initial position xj . In order to reduce the sensitivity to initialization, we use robust estimation to compute xj as follows. First, a large search area is defined to obtain the best descriptor match of the landmarks. Second, we compute the distances between these match locations and the initial landmark positions. Finally, the initial configuration X is translated based on the match that yields the median of these distances. Algorithm 1. Data-Driven SAmple Consensus (DDSAC) Input: A tentative set of corresponding points from descriptor template matching and a geometric constraint (i.e., personalized 3D landmark model and a weak-perspective transformation) 1. Select elements with the proposal probability defined by Eqn. 8 2. Combine the selected points and delete the duplicates to obtain a subset of samples 3. Estimate the transformation parameters (i.e., generate a hypothesis) 4. Project the template shape and compute the median of errors between the projected locations and locations of landmark candidates 5. Repeat Steps 1-4 for m iterations 6. Select the estimated configuration with the least median error, and sort the landmark candidates by the distances between their positions and this configuration 7. Re-estimate the configuration based on the first k landmark candidates in the list sorted in ascending order Output: The estimated transformation and matching results
The second term in Eq. 8, q(Γ j |v j ), defines the uncertainty of the best descriptor match position for the j th landmark based on the descriptor similarity. It can be formulated as follows: q(Γ j |v j ) =
P (v j ; Γ j , Ω j ) r r r r=1,··· ,n P (v ; Γ , Ω )
(10)
where P (v j ; Γ j , Ω j ) is the probability of j th landmark point at position uj given the landmark descriptor template Γ j . A Gaussian distribution is also used here to define P (v j ; Γ j , Ω j ) with the mean being the descriptor template from the gallery image. The pipeline for estimating the landmark matches and the shape transformation is presented in Alg. 1.
640
Z. Zeng et al.
Algorithm 2. 2D gallery processing Input: A 2D face image associated with a subject ID and the annotated landmarks 1. Build a personalized 3D landmark model (geometric relationship of landmarks) by using landmark correspondences between the 2D images and the 3D statistical landmark model (Eq. 7) 2. Compute the personalized descriptors of the annotated landmarks from the 2D image Output: Personalized metadata (landmark descriptors and a 3D landmark model).
Algorithm 3. 3D gallery processing Input: A 3D textured data associated with a subject ID and the annotated landmarks 1. Align the 3D face data with the Annotated Face Model (AFM) [24] 2. Orthographically project the 3D texture surface and landmark positions to predefined viewing angles 3. Compute the descriptors at the projected locations of the landmarks on the projections Output: Personalized metadata (multi-view landmark descriptors and a 3D landmark model).
In our experiments, we empirically set m = 50, β = 0.8, σ = 0.1W (W being the width of the face image) and Ω j = 10. Figure 5(b) depicts the effect of selecting k or the ratio k/n on the results. Based on the sampled subset of the tentative correspondences, we are able to estimate the parameters in the weak-perspective matrix P with the maximum-likelihood solution described in Section 3.3. 3.6
Gallery/Probe Processing
The algorithms for 2D gallery, 3D gallery, and probe processing are presented in Alg. 2, Alg. 3, and Alg. 4, respectively. For 2D gallery image processing (frontal face image without 3D data), we run the ASM algorithm [6, 7] to automatically obtain 68 landmarks in order to reduce the annotation burden. We selected among them 45 landmarks from eyebrows, eyes, nose and mouth that have correspondences in our 3D statistical landmark model. Then, we manually correct the erroneous annotations from ASM to obtain accurate ground truth. For 3D gallery face data, we manually annotate each of the landmarks on the 3D surface. Given a probe image, we use the hierarchical optimization method to localize the landmarks from coarse level to fine level as presented in Alg. 4.
Personalized 3D-Aided 2D Facial Landmark Localization
641
Algorithm 4. Hierarchical landmark localization on a probe image Input: A 2D probe face image and the corresponding gallery metadata 1. Scale the gallery landmarks to match the size of the face region 2. Align the geometric center of the landmarks with that of the face image 3. Optimize from coarse to fine resolution: a. Generate candidates by the descriptor template matching b. Estimate the translation by the median of distances between candidates and the initialization c. Run DDSAC to obtain matching results and estimate the transformation Output: Estimated landmark locations
4
Experiments
We evaluated our method on the Multi-PIE [4] and the UHDB14 [5] databases. Both databases include facial images with pose and illumination variations. The Multi-PIE dataset includes only 2D images while the UHDB14 includes both 3D data and the associated texture images for all subjects. We compare the results of our method to those obtained by the latest ASM-based approach [6, 7] and a classifier-based facial landmark detector [8, 9]. 4.1
Multi-PIE
The Multi-PIE database collected at CMU contains facial images from 337 subjects in four acquisition sessions over a span of six months, and under a large variety of poses, illumination and expressions. In our experiments, we chose the data from 129 subjects who attended all four sessions. We used the frontal neutral face images in Session 1 under Lighting 16 (top frontal flash) condition as the gallery, and the neutral face images of the same subjects with different poses and lighting conditions in Session 2 as the probes. The probe data were divided into cohorts defined in Table 1, each with 129 images from 129 subjects. We obtained 45 landmarks in gallery images by using the method described in Section 3.6. In addition, eight landmarks on the probe images were manually annotated as the ground truth for evaluation. The original images were cropped to 300 × 300 pixels, ensuring that all landmarks are enclosed in the bounding box. Examples of gallery and probe images are depicted in Fig. 3. Table 1. Probe subset definitions of the Multi-PIE database Pose 051 (0◦ ) Pose 130 (30◦ ) 16 (top frontal flash) C1 C2 lighting 14 (top right flash) C3 C4 00 (no flash) C5 C6
642
Z. Zeng et al.
Fig. 3. Examples of gallery and probe images from Multi-PIE. (a) Gallery image from Session 1 with Light 16 and Pose 051 (b-g) Probe images from Session 2 with same ID as the gallery. Example images with Light 16 and Pose 051 (b), Light 16 and Pose 130 (c), Light 14 and Pose 051 (d), Light 14 and Pose 130 (e), Light 00 and Pose 051 (f), Light 00 and Pose 130 (g).
(a)
(b)
(c)
(d)
Fig. 4. Gallery image processing and landmark localization on a probe image. (a) Gallery image with annotations (yellow circles) and fitting results (green crosses). (b) Personalized 3D landmark model. (c-d) Matching results [initialization (black crosses), best descriptor matches (red circles), geometrical fitting (green crosses)] on two levels of resolutions, respectively.
The matching process is depicted in Fig. 4. First, the landmark descriptors are computed on the gallery image and the 3D personalized landmark model is constructed from fitting the 3D statistical landmark model, as illustrated in Figs. 4a and 4b. Then, from coarse to fine resolution, the landmarks are localized by descriptor matching and geometrical fitting, as illustrated in Figs. 4c and 4d. The experimental results are shown in Figs. 5 and 6. All of the results are evaluated in terms of the mean of the normalized errors, which are computed as the ratio of the landmark errors (i.e., distance between the estimated landmark position and the ground truth) and the intra-pupillary distance. The average intra-pupillary distance is about 72 pixels among the frontal face images (cohorts: C1, C3 and C5), and about 62 pixels among images with varying poses (cohorts: C2, C4 and C6). Figure 5(a) depicts the results of our method (URxD-L) using different levels of the DAISY descriptor. It can be observed that the two-level descriptor with larger context region results in an improvement over the one-level descriptor in all probe subsets. Figure 5(b) depicts the effect of ratio (k/n) of landmarks used in DDSAC on the estimation results, where k and n are the number of selected landmarks and the total number of landmarks, respectively. This finding suggests that robust estimation (ratio 0.9-0.4) is better than non-robust estimation (ratio 1.0) in dealing with outliers. Note that the results on probe subsets with frontal
Personalized 3D-Aided 2D Facial Landmark Localization
643
Fig. 5. (a) Normalized error of URxD-L using different levels of DAISY descriptor (b) Effect of ratio (k/n) of landmarks on the estimation results (c) Means of normalized errors of different landmarks under different poses and lighting conditions.
Fig. 6. Comparison among ASM-based [6, 7], Vukadinovic & Pantic detector [8, 9] and URxD-L on Multi-PIE
pose (same as the gallery) have ”flatter” curves with lower error rates than other probe subsets. This can be attributed to the fact that frontal face images have less outliers than the non-frontal face images. Figure 5(c) illustrates the normalized localization errors of 8 landmarks (right/left eye outer/inner corners, right/left nostril corners, and right/left mouth corners). We can observe that the eye inner corners are more reliable landmarks in frontal face images. As expected, for face images that are non-frontal (e.g., 30◦ yaw) the landmarks on the left face region are more reliable than those on the right side except for left eye outer corner, which is often occluded by glasses. Figure 6 presents a comparison of landmark localization using three different methods, including the latest ASM-based approach [6, 7], a classifier-based landmark detector (Vukadinovic & Pantic) [8, 9], and URxD-L. The results show that our method can achieve similar or better results than the ASM-based and Vukadinovic & Pantic detector. They also suggest that a gallery image and robust descriptors are very helpful in reducing training burden and achieving good results. 4.2
UHDB14 Database
To evaluate the performance of URxD-L for the case where 3D data are available in the gallery, we collected a set of 3D data with associated 2D texture images using two calibrated 3dMD cameras [29] and a set of 2D images of the same subjects using a Canon XTi DSLR camera. This database, which we name
644
Z. Zeng et al.
Fig. 7. (a-d) Examples of gallery and probe cohorts from UHDB14. Gallery images generated from the 3D data (a) -30◦ yaw and (b) 30◦ yaw. Probe images (c) -30◦ yaw and (d) 30◦ yaw. (e) Comparison among ASM-based [6, 7], Vukadinovic & Pantic detector [8, 9] and URxD-L on UHDB14.
UHDB14 [5], includes 11 subjects (3 females and 8 males) with varying yaw angles (-30◦ , 0◦ , 30◦ ). We use the textured 3D data as the gallery, and the 2D images with yaw angles -30◦ and 30◦ as our probe cohorts. Since the data were obtained with different sensors, the illuminations of the gallery and the probe images are different. Some sample images from the gallery and probe cohorts are shown in Figs. 7(a-d). We manually annotated 45 landmarks on the gallery 3D data around each facial components (eyebrows, eyes, nose and mouth), and 6 landmarks on the probe images as the ground truth for evaluation. The average intra-pupillary distances of the two probe cohorts with different yaw angles are 72 and 54 pixels, respectively. We evaluated our method on the 6 annotated landmarks, and compared the results against the ones from ASM-based [6, 7] and Vukadinovic & Pantic detector [8, 9] in terms of average normalized error. Figure 7(e) presents the comparison in terms of the mean of normalized landmark errors. Note that our method outperforms both methods.
5
Conclusions
In this paper, we have proposed a 2D facial landmark localization method URxDL that leverages a 2D/3D gallery image to carry out a person-specific search. Specifically, a framework is proposed in which descriptors of facial landmarks as well as a 3D landmark model that encodes the geometric relationship between landmarks are computed from processing a given 2D or 3D gallery dataset. Online processing of the 2D probe, targeted towards the face authentication problem [30], focuses on establishing correspondences between key points on the probe image and the landmarks on the corresponding gallery image. To realize this framework, we have presented several methods, including a hierarchical DAISY descriptor and a Data-Driven SAmple Consensus algorithm. Our approach is compared with two of the existing methods and it exhibits a better performance.
Personalized 3D-Aided 2D Facial Landmark Localization
645
Acknowledgement. This work was supported in part by the US Army Research Laboratory award DWAM80750 and UH Eckhard Pfeiffer Endowment. Any opinions, findings, conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of our sponsors.
References 1. Tola, E., Lepetit, V., Fua, P.: DAISY: An efficient dense descriptor applied to wide baseline stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 815–830 (2010) 2. Winder, S., Brown, M.: Learning local image descriptors. In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, pp. 1–8 (2007) 3. Winder, S., Hua, G., Brown, M.: Picking the best DAISY. In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Miami Beach, FL, pp. 178–185 (2009) 4. Gross, R., Matthews, I., Cohn, J., Kanade, T., Baker, S.: Multi-PIE. In: Proc. IEEE International Conference on Automatic Face and Gesture Recognition, Amsterdam, The Netherlands (2008) 5. UHDB14, http://cbl.uh.edu/URxD/datasets/ 6. Milborrow, S., Nicolls, F.: Locating facial features with an extended active shape model. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 504–513. Springer, Heidelberg (2008) 7. STASM, http://www.milbo.users.sonic.net/stasm/ 8. Vukadinovic, D., Pantic, M.: Fully automatic facial feature point detection using Gabor feature based boosted classifiers. In: Proc. IEEE International Conference on Systems, Man and Cybernetics, Waikoloa, Hawaii, USA, pp. 1692–1698 (2005) 9. FFPD, http://www.doc.ic.ac.uk/~ maja/ 10. Cootes, T., Taylor, C.: Active shape models: Smart snakes. In: Proc. British Machine Vision Conference, Leeds, UK, pp. 266–275 (1992) 11. Gu, L., Kanade, T.: A generative shape regularization model for robust face alignment. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 413–426. Springer, Heidelberg (2008) 12. Wang, Y., Lucey, S., Cohn, J.: Enforcing convexity for improved alignment with constrained local models. In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Anchorage, AK (2008) 13. Saragih, J., Lucey, S., Cohn, J.: Face alignment through subspace constrained mean-shifts. In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Miami Beach, FL, pp. 1034–1041 (2009) 14. Cootes, T., Walker, K., Taylor, C.: View-based active appearance models. In: Proc. IEEE International Conference on Automatic Face and Gesture Recognition, Grenoble, France, pp. 227–232 (2002) 15. Cristinacce, D., Cootes, T.: Boosted regression active shape models. In: Proc. British Machine Vision Conference, University of Warwick, United Kingdom, pp. 880–889 (2007) 16. Liang, L., Wen, F., Xu, Y., Tang, X., Shum, H.: Accurate face alignment using shape constrained Markov network. In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York, NY, pp. 1313–1319 (2006)
646
Z. Zeng et al.
17. Romdhani, S., Vetter, T.: 3D probabilistic feature point model for object detection and recognition. In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Minneapolis, MN (2007) 18. Gu, L., Kanade, T.: 3D alignment of face in a single image. In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York, NY, pp. 1305–1312 (2006) 19. Liu, C., Shum, H.Y., Zhang, C.: Hierarchical shape modeling for automatic face localization. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2351, pp. 687–703. Springer, Heidelberg (2002) 20. Tu, J., Zhang, Z., Zeng, Z., Huang, T.S.: Face localization via hierarchical condensation with fisher boosting feature selection. In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 719–724 (2004) 21. Liang, L., Xiao, R., Wen, F., Sun, J.: Face alignment via component-based discriminative search. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 72–85. Springer, Heidelberg (2008) 22. Valstar, M., Martinez, B., Binefa, X., Pantic, M.: Facial point detection using boosted regression and graph models. In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, USA, pp. 2729–2736 (2010) 23. Li, P., Prince, S.: Joint and implicit registration for face recognition. In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Miami Beach, FL, pp. 1510–1517 (2009) 24. Kakadiaris, I., Passalis, G., Toderici, G., Murtuza, M., Lu, Y., Karampatziakis, N., Theoharis, T.: Three-dimensional face recognition in the presence of facial expressions: An annotated deformable model approach. IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 640–649 (2007) 25. Perakis, P., Passalis, G., Theoharis, T., Toderici, G., Kakadiaris, I.: Partial matching of interpose 3D facial data for face recognition. In: Proc. 3rd IEEE International Conference on Biometrics: Theory, Applications and Systems, Arlington, VA, pp. 439–446 (2009) 26. Yin, L., Wei, X., Sun, Y., Wang, J., Rosato, M.: A 3D facial expression database for facial behavior research. In: Proc. 7th International Conference on Automatic Face and Gesture Recognition, Southampton, UK, pp. 211–216 (2006) 27. Lim, J., Arbelaez, P., Gu, C., Malik, J.: Context by region ancestry. In: Proc. 12th IEEE International Conference on Computer Vision, Kyoto, Japan, pp. 1978–1985 (2009) 28. Proc. 25 Years of RANSAC (Workshop in conjunction with CVPR), New York, NY (2006) 29. 3dMD, http://www.3dmd.com/ 30. Toderici, G., Passalis, G., Zafeiriou, S., Tzimiropoulos, G., Petrou, M., Theoharis, T., Kakadiaris, I.: Bidirectional relighting for 3d-aided 2d face recognition. In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, pp. 2721–2728 (2010)
A Theoretical and Numerical Study of a Phase Field Higher-Order Active Contour Model of Directed Networks Aymen El Ghoul, Ian H. Jermyn, and Josiane Zerubia ARIANA - Joint Team-Project INRIA/CNRS/UNSA 2004 route des Lucioles, BP 93, 06902 Sophia Antipolis Cedex, France
Abstract. We address the problem of quasi-automatic extraction of directed networks, which have characteristic geometric features, from images. To include the necessary prior knowledge about these geometric features, we use a phase field higher-order active contour model of directed networks. The model has a large number of unphysical parameters (weights of energy terms), and can favour different geometric structures for different parameter values. To overcome this problem, we perform a stability analysis of a long, straight bar in order to find parameter ranges that favour networks. The resulting constraints necessary to produce stable networks eliminate some parameters, replace others by physical parameters such as network branch width, and place lower and upper bounds on the values of the rest. We validate the theoretical analysis via numerical experiments, and then apply the model to the problem of hydrographic network extraction from multi-spectral VHR satellite images.
1
Introduction
The automatic identification of the region in the image domain corresponding to a specified entity in the scene (‘extraction’) is a difficult problem because, in general, local image properties are insufficient to distinguish the entity from the rest of the image. In order to extract the entity successfully, significant prior knowledge K about the geometry of the region R corresponding to the entity of interest must be provided. This is often in the form of explicit guidance by a user, but if the problem is to be solved automatically, this knowledge must be incorporated into mathematical models. In this paper, we focus on the extraction of ‘directed networks’ (e.g. vascular networks in medical imagery and hydrographic networks in remote sensing imagery). These are entities that occupy network-like regions in the image domain, i.e. regions consisting of a set of branches that meet at junctions. In addition, however, unlike undirected networks, directed networks carry a unidirectional flow. The existence of this flow, which is usually conserved, encourages the network region to have certain characteristic geometric properties: branches tend not to end; different branches may have very different widths; width changes slowly along each branch; at junctions, R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 647–659, 2011. Springer-Verlag Berlin Heidelberg 2011
648
A. El Ghoul, I.H. Jermyn, and J. Zerubia
total incoming width and total outgoing width tend to be similar, but otherwise width can change dramatically. Creating models that incorporate these geometric properties is non-trivial. First, the topology of a network region is essentially arbitrary: it may have any number of connected components each of which may have any number of loops. Shape modelling methods that rely on reference regions or otherwise bounded zones in region space, e.g. [1–7], cannot capture this topological freedom. In [8], Rochery et al. introduced the ‘higher-order active contour’ (HOAC) framework, which was specifically designed to incorporate non-trivial geometric information about the region of interest without constraining the topology. In [9], a phase field formulation of HOACs was introduced. This formulation possesses several advantages over the contour formulation. Both formulations were used to model undirected networks in which all branches have approximately the same width. While these models were applied successfully to the extraction of road networks from remote sensing images, they fail on directed networks in which branches have widely differing widths, and where information about the geometry of junctions can be crucial for successful extraction. Naive attempts to allow a broader range of widths simply results in branches whose width changes rapidly: branch rigidity is lost. In [10], El Ghoul et al. introduced a phase field HOAC model of directed networks. This extends the phase field model of [8] by introducing a vector field which describes the ‘flow’ through network branches. The main idea of the model is that the vector field should be nearly divergence-free, and of nearly constant magnitude within the network. Flow conservation then favours network regions in which the width changes slowly along branches, in which branches do not end, and in which total incoming width equals total outgoing width at junctions. Because flow conservation preserves width in each branch, the constraint on branch width can be lifted without sacrificing rigidity, thereby allowing different branches to have very different widths. The model was tested on a synthetic image. The work in [10] suffers from a serious drawback, however. The model is complex and possesses a number of parameters that as yet must be fixed by trial and error. Worse, depending on the parameter values, geometric structures other than networks may be favoured. There is thus no way to know whether the model is actually favouring network structures for a given set of parameter values. The model was applied to network extraction from remote sensing images in an application paper complementary to this one [11], but that paper focused on testing various likelihood terms and on the concrete application. The model was not analysed theoretically and no solution to the parameter problem was described. In this paper, then, we focus on a theoretical analysis of the model. In particular, we perform a stability analysis of a long, straight bar, which is supposed to be an approximation to a network branch. We compute the energy of the bar and constrain the first order variation of the bar energy to be zero and the second order variation to be positive definite, so that the bar be an energy
Active Contour Model of Directed Networks
649
minimum. This results in constraints on the parameters of the model that allow us to express some parameters in terms of others, to replace some unphysical parameters by physical parameters e.g. the bar width, and to place upper and lower bounds on the remaining free parameter values. The results of the theoretical study are then tested in numerical experiments, some using the network model on its own to validate the stability analysis, and others using the network model in conjunction with a data likelihood to extract networks from very high resolution (VHR) multi-spectral satellite images. In Section 2, we recall the phase field formulation of the undirected and directed network models. In Section 3, we introduce the core contribution of this paper: the theoretical stability analysis of a bar under the phase field HOAC model for directed networks. In Section 4, we show the results of geometric and extraction experiments. We conclude in Section 5.
2
Network Models
In this section, we recall the undirected and directed network phase field HOAC models introduced in [9] and [10] respectively. 2.1
Undirected Networks
A phase field is a real-valued function on the image domain: φ : Ω ⊂ R2 → R. It describes a region by the map ζz (φ) = {x ∈ Ω : φ(x) > z} where z is a given threshold. The basic phase field energy is [9] 4 φ D φ2 φ3 s 2 ∂φ · ∂φ + λ − d x E0 (φ) = +α φ− (1) 2 4 2 3 Ω For a given region R, Eq. (1) is minimized subject to ζz (φ) = R. As a result, the minimizing function φR assumes the value 1 inside, and −1 outside R thanks to the ultralocal terms. To guarantee two stable phases at −1 and 1, the inequality λ > |α| must be satisfied. We choose α > 0 so that the energy at −1 is less than that at 1. The first term guarantees that φR be smooth, producing a narrow interface around the boundary ∂R interpolating between −1 and +1. In order to incorporate geometric properties, a nonlocal term is added to give a total energy EPs = E0s + ENL , where [9] |x − x | β 2 2 ENL (φ) = − d x d x ∂φ(x) · ∂φ(x ) Ψ (2) 2 d Ω2 where d is the interaction range. This term creates long-range interactions between points of ∂R (because ∂φR is zero elsewhere) using an interaction function, Ψ , which decreases as a function of the distance between the points. In [9], it was shown that the phase field model is approximately equivalent to the HOAC model, the parameters of one model being expressed as a function of those of the other. For the undirected network model, the parameter ranges can be found via a stability analysis of the HOAC model [12] and subsequently converted to phase field model parameters [9].
650
2.2
A. El Ghoul, I.H. Jermyn, and J. Zerubia
Directed Networks
In [10], a directed network phase field HOAC model was introduced by extending the undirected network model described in Section 2.1. The key idea is to incorporate geometric information by the use, in addition to the scalar phase field φ, of a vector phase field, v : Ω → R2 , which describes the flow through network branches. The total prior energy EP (φ, v) is the sum of a local term E0 (φ, v) and the nonlocal term ENL (φ) given by Eq. (2). E0 is [10] Dv D Lv E0 (φ, v) = ∂φ · ∂φ + (∂ · v)2 + ∂v : ∂v + W (φ, v) (3) d2 x 2 2 2 Ω where W is a fourth order polynomial in φ and |v|, constrained to be differentiable: W (φ, v) =
|v|4 |v|2 φ2 + (λ22 + λ21 φ + λ20 ) 4 2 2 + λ04
φ4 φ3 φ2 + λ03 + λ02 + λ01 φ 4 3 2
(4)
Similarly to the undirected network model described in Section 2.1, where −1 and 1 are the two stable phases, the form of the the ultralocal term W (φ, v) means that the directed network model has two stable configurations, (φ, |v|) = (−1, 0) and (1, 1), which describe the exterior and interior of R respectively. The first term in Eq. (3) guarantees the smoothness of φ. The second term penalizes the divergence of v. This represents a soft version of flow conservation, but the parameter multiplying this term will be large so that in general the divergence will be small. The divergence term is not sufficient to ensure smoothness of v: a small smoothing term (∂v : ∂v = m,n (∂m v n )2 , where m, n ∈ {1, 2} label the two Euclidean coordinates) is added to encourage this. The divergence term, coupled with the change of magnitude of v from 0 to 1 across ∂R, strongly encourages the vector field v to be parallel to ∂R. The divergence and smoothness terms then propagate this parallelism to the interior of the network, meaning that the flow runs along the network branches. Flow conservation and the fact that |v| ∼ = 1 in the network interior then tend to prolong branches, to favour slow changes of width along branches, and to favour junctions for which total incoming width is approximately equal to total outgoing width. Because the vector field encourages width stability along branches, the interaction function used in [9], which severely constrains the range of stable branch widths, can be replaced by one that allows more freedom. We choose the modified Bessel function of the second kind of order 0, K0 [10]. Different branches can then have very different widths.
3
Theoretical Stability Analysis
There are various types of stability that need to be considered. First, we require that the two phases of the system, (φ, |v|) = (−1, 0) and (1, 1), be stable to
Active Contour Model of Directed Networks
651
infinitesimal perturbations, to prevent the exterior or interior of R from ‘decaying’. For homogeneous perturbations, this means they should be local minima of W , which produces constraints reducing the number of free parameter of W from 7 to 4 (these are chosen to be (λ04 , λ03 , λ22 , λ21 )). Stability to non-zero frequency perturbations then places lower and upper bounds on the values of the remaining parameters. Second, we require that the exterior (φ, |v|) = (−1, 0), as well as being locally stable, also be the global minimum of the energy. This is ensured if the quadratic form defined by the derivative terms is positive definite. This generates a further inequality on one of the parameters. We do not describe these two calculations here due to lack of space, but the constraints are used in selecting parameters for the experiments. Third, we require that network configurations be stable. A network region consists of a set of branches, meeting at junctions. To analyse a general network configuration is very difficult, but if we assume that the branches are long and straight enough, then their stability can be analysed by considering the idealized limit of a long, straight bar. The validity of this idealization will be tested using numerical experiments. Ideally, the analysis should proceed by first finding the energy minimizing φRBar and vRBar for the bar region, and then expanding around these values. In practice, there are two obstacles. First, φRBar and vRBar cannot be found exactly. Second, it is not possible to diagonalize exactly the second derivative operator, and thus it is hard to impose positive definiteness for stability. Instead, we take simple ansatzes for φRBar and vRBar and study their stability in a low-dimensional subspace of function space. We define a four-parameter family of ansatzes for φRBar and vRBar , and analyse the stability of this family. Again, the validity of these ansatzes will be tested using numerical experiments. The first thing we need to do, then, is to calculate the energy of this idealized bar configuration. 3.1
Energy of the Bar
Two phase field variables are involved: the scalar field φ and the vector field v. The ansatz is as follows: the scalar field φ varies linearly from −1 to φm across a region interface of width w, otherwise being −1 outside and φm inside the bar, which has width w0 ; the vector field v is parallel to the bar boundary, with magnitude that varies linearly from 0 to vm across the region interface, otherwise being 0 outside and vm inside the bar. The bar is thus described by the four physical parameters w0 , w, φm and vm . To compute the total energy per unit length of the bar, we substitute the bar ansatz into Eq. (2) and (3). The result is eP (wˆ0 , w, ˆ φm , vm ) = w ˆ0 ν(φm , vm ) + wμ(φ ˆ m , vm ) − β(φm + 1)2 G00 (wˆ0 , w) ˆ +
ˆ m + 1)2 + L ˆ v v2 D(φ m w ˆ
(5)
ˆ = D/d2 and L ˆ v = Lv /d2 are dimensionless pawhere w ˆ = w/d, w ˆ0 = w0 /d, D rameters. The functions μ, ν and G00 are given in the Appendix. The parameters μ and ν play the same roles as λ and α in the undirected network model given
652
A. El Ghoul, I.H. Jermyn, and J. Zerubia
by Eq. (1). The energy gap between the foreground and the background is equal to 2α in the undirected network model and ν in the directed network model, and ν must be strictly positive to favour pixels belonging to the background. The parameter μ controls the contribution of the region interface of width w: it has an effect similar to the parameter λ. 3.2
Stability of the Bar
The energy per unit length of a network branch, eP , is given by Eq. 5. A network branch is stable in the four-parameter family of ansatzes if it minimizes eP with respect to variations of w ˆ0 , w, ˆ φm and vm . This is equivalent to setting the first partial derivatives of eP equal to zero and requiring its Hessian matrix to be positive definite. The desired value of (φm , vm ) is (1, 1) to describe the interior of the region R. Setting the first partial derivatives of eP , evaluated at (w ˆ0 , w, ˆ 1, 1), equal to zero, and after some mathematical manipulation, one finds the following constraints: ρ − G(wˆ0 , w) ˆ =0 ν β= 4G10 (w ˆ0 , w) ˆ w ˆ ν G00 (w ˆ0 , w) ˆ ˆ D= − wμ ˆ φ 2 2G10 (wˆ0 , w) ˆ ˆ 2 μv ˆv = − w L 2
(6) (7) (8) (9)
where G(w ˆ0 , w) ˆ = [G00 (wˆ0 , w)/ ˆ w ˆ + G11 (wˆ0 , w)] ˆ /G10 (wˆ0 , w), ˆ ρ = [μ + μφ + μv /2]/ν , ν = ν(1, 1), μ = μ(1, 1), μφ = μφ (1, 1) and μv = μv (1, 1); μφ = ∂μ/∂φm , μv = ∂μ/∂vm , G10 = ∂G00 /∂ w ˆ0 and G11 = ∂G00 /∂ w. ˆ The starred parameters depend only on the 4 parameters of W : πλ = (λ22 , λ21 , λ04 , λ03 ). ˆ > 0 and L ˆ v > 0 generate lower and upper bounds on w ˆ0 , w ˆ The conditions D and πλ . Equation (6) shows that, for fixed πλ i.e. when ρ is determined, the set of solutions in the plane (w ˆ0 , w) ˆ is the intersection of the surface representing the function G and the plane located at ρ . The result is then a set of curves in the ˆ where each corresponds to a value of ρ . The top-left of Fig. 1 plane (w ˆ0 , w) shows examples of solutions of Eq. (6) for some values of ρ . The solutions must satisfy the condition w ˆ<w ˆ0 otherwise the bar ansatz fails. Equation (7) shows that bar stability depends mainly on the scaled parameter βˆ = β/ν . The top-right of Fig. 1 shows a plot of βˆ against the scaled bar width ˆ We have plotted the surface as two halfw ˆ0 and the scaled interface width w. surfaces: one is lighter (left-hand half-surface, smaller w ˆ0 ) than the other (righthand half-surface, larger w ˆ0 ). The valley between the half-surfaces corresponds to the minimum value of βˆ for each value of w. ˆ The graph shows that: for each ˆ = 0.0879, there are no possible values value βˆ < βˆmin = inf(1/4G10 (wˆ0 , w)) of (w ˆ0 , w) ˆ which satisfy the constraints and so the bar energy does not have
Active Contour Model of Directed Networks
653
Fig. 1. Top-left: examples of solutions of Eq. (6) for some values of ρ ; curves are ˆ the light and dark surfaces labelled by the values of ρ . Top-right: behaviour of β; show the locations of maxima and minima respectively. Bottom: bar energy eP against the bar parameters w, ˆ w ˆ0 , φm and vm . Parameter values were chosen so that there is = (1.36, 0.67, 1, 1). an energy minimum at πB
extrema; for each value βˆ > βˆmin and for some chosen value w, ˆ there are two possible values of w ˆ0 which satisfy the constraints: the smaller width (left-hand half-surface) corresponds to an energy maximum and the larger width (righthand half-surface) corresponds to an energy minimum. The second order variation of the bar energy is described by the 4 × 4 Hessian matrix H which we do not detail due to lack of space. In order for the bar to be an ˆ φm , vm ) = (w ˆ0 , w, ˆ 1, 1). energy minimum, H must be positive definite at (w ˆ0 , w, This generates upper and lower bounds on the parameter values. This positivity condition is tested numerically because explicit expressions for the eigenvalues cannot be found. The lower part of Fig. 1 shows the bar energy, eP , plotted against the physical parameters of the bar πB = (wˆ0 , w, ˆ φm , vm ). The desired energy minimum was = (1.36, 0.67, 1, 1).1 We choose a parameter setting, πλ , and then chosen at πB 1
The parameter values were: (λ04 , λ03 , λ22 , λ21 , D, β, d, Lv , Dv ) = (0.05, 0.025, 0.013, −0.6, 0.0007, 0.003, 1, 0.208, 0).
654
A. El Ghoul, I.H. Jermyn, and J. Zerubia
compute D, β and Lv using the parameter constraints given above. The bar energy per unit length eP indeed has a minimum at the desired πB . 3.3
Parameter Settings in Practice
There are nine free parameters in the model: (λ04 , λ03 , λ22 , λ21 , D, β, d, Lv , Dv ). Based on the stability analysis in Section 3, two of these parameters are elimiˆ nated while another two can be replaced by the ‘physical’ parameters w ˆ0 and w: (λ04 , λ03 , λ22 , λ21 , Dv , w ˆ0 , w). ˆ The interface width w should be taken as small as possible compatible with the discretization; in practice, we take 2 < w < 4 [9]. The desired bar width w0 is an application-determined physical parameter. We then fix the parameter values as follows: choose the 4 free parameter values (λ04 , λ03 , λ22 , λ21 ) of the ultralocal term W ; compute ν , μ , μφ , μv and then ρ , and solve Eq. (6) to give the values of the scaled widths w ˆ0 and w; ˆ comˆ and L ˆ v using the parameter constraints given by Eq. (7), (8) and (9), pute β, D ˆ and Lv = d2 L ˆ v where d = w0 /w and then compute D = d2 D ˆ0 ; choose the free parameter Dv ; check numerically the stability of the phases (−1, 0) and (1, 1) as described briefly at the beginning of Section 3; check numerically positive definiteness of the Hessian matrix H.
4
Experiments
In Section 4.1, we present numerical experiments using the new model EP that confirm the results of the theoretical analysis. In Section 4.2, we add a data likelihood and use the overall model to extract hydrographic networks from real images. 4.1
Geometric Evolutions
To confirm the results of the stability analysis, we first show that bars of different widths evolve under gradient descent towards bars of the stable width predicted by theory. Fig. 2 shows three such evolutions.2 In all evolutions, we fixed Dv = 0, because the divergence term cannot destabilize the bar when the vector field is initialized as a constant everywhere in the image domain. The width of the initial straight bar is 10. The first row shows that the bar evolves until it disappears ˆ0 = 0. because βˆ < βˆmin , so that the bar energy does not have a minimum for w On the other hand, the second and third rows show that straight bars evolve toward straight bars with the desired stable widths w0 = 3 and 6, respectively, when βˆ > βˆmin . As a second test of the theoretical analysis, we present experiments that show that starting from a random configuration of φ and v, the region evolves under 2
From top to bottom, parameter values were: (λ04 , λ03 , λ22 , λ21 , D, β, d, Lv , Dv ) = (0.1, −0.064, 0.232, −0.6, 0.065, 0.0027, 3.66, 0.5524, 0), (0.112, 0.019, −0.051, −0.6, 0.055, 0.011, 2.78, 0.115, 0) and (0.1, −0.064, 0.232, −0.6, 0.065, 0.0077, 3.66, 0.5524, 0).
Active Contour Model of Directed Networks
655
Fig. 2. Geometric evolutions of bars using the directed network model EP with the interaction function K0 . Time runs from left to right. First column: initial configurations, which consist of a straight bar of width 10. The initial φ is −1 in the background and φs in the foreground, and the initial vector field is (1, 0). First row: when βˆ < βˆmin , the initial bar vanishes; second and third rows: when βˆ > βˆmin , the bars evolve toward bars which have the desired stable widths 3 and 6 respectively. The regions are obtained by thresholding the function φ at 0.
Fig. 3. Geometric evolutions starting from random configurations. Time runs from left to right. Parameter values were chosen as a function of the desired stable width, which was 5 for the first row and 8 for the second row. Both initial configurations evolve towards network regions in which the branches have widths equal to the predicted stable width, except near junctions, where branch width changes significantly to accommodate flow conservation. The regions are obtained by thresholding the function φ at 0.
gradient descent to a network of the predicted width. Fig. 3 shows two such evolutions.3 At each point, the pair (φ, |v|) was initialized randomly to be either (−1, 0) or (1, 1). The orientation of v was chosen uniformly on the circle. The evolutions in the first row has predicted stable width w0 = 5, while that in the second row has predicted stable width w0 = 8. In all cases, the initial configuration evolves to a stable network region with branch widths equal to the 3
From top to bottom, the parameter values were: (λ04 , λ03 , λ22 , λ21 , D, β, d, Lv , Dv ) = (0.25, 0.0625, 0.0932, −0.8, 0.111, 0.0061, 3.65, 0.0412, 50) and (0.4, −0.1217, 0.7246, −1, 0.4566, 0.0083, 10.8, 0.4334, 100).
656
A. El Ghoul, I.H. Jermyn, and J. Zerubia
predicted value, except near junctions, where branch width changes significantly to accommodate flow conservation. 4.2
Results on Real Images
In this section, we use the new model to extract hydrographic networks from multispectral VHR Quickbird images. The multi-spectral channels are red, green, blue and infra-red. Fig. 4 shows three such images in which parts of the hydrographic network to be extracted are obscured in some way. To apply the model, a likelihood energy linking the region R to the data I is needed in addition to the prior term EP . The total energy to minimize is then E(φ, v) = θEP (φ, v) + EI (φ) where the parameter θ > 0 balances the two energy terms. The likelihood energy EI (φ) is taken to be a multivariate mixture of two Gaussians (MMG) to take into account the heterogeneity in the appearance of the network produced by occlusions. It takes the form EI (φ) = −
1 2
dx Ω
{ln − ln
2
1
pi |2πΣi |−1/2 e− 2 (I(x)−μi )
t
Σi−1 (I(x)−μi )
i=1 2
¯ i |−1/2 e− 12 (I(x)−¯μi )t Σ¯ i−1 (I(x)−¯μi ) }φ(x) (10) p¯i |2π Σ
i=1
¯1 , Σ ¯ 2 ) are learnt from where the parameters (p1 , p2 , p¯1 , p¯2 , μ1 , μ2 , μ ¯1 , μ ¯ 2 , Σ1 , Σ2 , Σ the original images using the Expectation-Maximization algorithm; labels 1 and 2 refer to the two Gaussian components and the unbarred and barred parameters refer to the interior and exterior of the region R respectively; t indicates transpose. Fig. 4 shows the results. The second row shows the maximum likelihood estimate of the network (i.e. using the MMG without EP ). The third and fourth rows show the MAP estimates obtained using the undirected network model,4 EPs , and the directed network model,5 EP , respectively. The undirected network model favours network structures in which all branches have the same width. Consequently, network branches which have widths significantly different from the average are not extracted. The result on the third image shows clearly the false negative in the central part of the network branch, where the width is about half the average width. Similarly, the result on the second image shows a false positive at the bottom of the central loop of the network, where the true branch width is small. The result on the first image again shows a small false negative piece in the network junction at the bottom. 4 5
The parameter values were, from left to right: (λ, α, D, β, d) = (18.74, 0.0775, 11.25, 0.0134, 34.1), (4.88, 0.3327, 3, 0.0575, 4.54) and (19.88, 0.6654, 12, 0.1151, 9.1). The parameter values were, from left to right: (λ04 , λ03 , λ22 , λ21 , D, β, d, Lv , Dv , θ) = (0.412, −0.0008, 0.0022, −0.6, 0.257, 0.0083, 8.33, 0.275, 50, 25), (0.2650, −0.1659, 0.5023, −0.8, 0.1926, 0.0387, 2.027, 0.2023, 100, 16.66) and (0.4525, −0.2629, 0.6611, −0.8, 0.3585, 0.0289, 5.12, 0.4224, 10, 22.22).
Active Contour Model of Directed Networks
657
Fig. 4. From top to bottom: multispectral Quickbird images; ML segmentation; MAP segmentation using the undirected model EPs ; MAP segmentation using the directed model EP . Regions are obtained by thresholding φ at 0. (Original images DigitalGlobe, CNES processing, images acquired via ORFEO Accompaniment Program).
658
A. El Ghoul, I.H. Jermyn, and J. Zerubia
Table 1. Quantitative evaluations of experiments of the three images given in Fig. 4. T, F, P, N, UNM and DNM correspond to true, false, positive, negative, undirected network model and directed network model respectively. Completeness = TP / (TP+FN) UNM 0.8439 1 DNM 0.8202
Correctness = TP / (TP+FP) 0.9168 0.9489
Quality = TP / (TP+FP+FN) 0.7739 0.7924
2
UNM 0.9094 DNM 0.8484
0.6999 0.7856
0.6043 0.6889
3
UNM 0.5421 DNM 0.7702
0.6411 0.6251
0.4158 0.6513
The directed network model remedies these problems, as shown by the results in the fourth row. The vector field was initialized to be vertical for the first and third images, and horizontal for the second image. φ and |v| are initialized at the saddle point of the ultralocal term W , except for the third image where |v| = 1. The third image shows many gaps in the hydrographic network due mainly to the presence of trees. These gaps cannot be closed using the undirected network model. The directed network model can close these gaps because flow conservation tends to prolong network branches. See Table 1 for quantitative evaluations.
5
Conclusion
In this paper, we have conducted a theoretical study of a phase field HOAC model of directed networks in order to ascertain parameter ranges for which stable networks exist. This was done via a stability analysis of a long, straight bar that enabled some model parameters to be fixed in terms of the rest, others to be replaced by physically meaningful parameters, and lower and upper bounds to be placed on the remainder. We validated the theoretical analysis via numerical experiments. We then added a likelihood energy and tested the model on the problem of hydrographic network extraction from multi-spectral VHR satellite images, showing that the directed network model outperforms the undirected network model. Acknowledgement. The authors thank the French Space Agency (CNES) for the satellite images, and CNES and the PACA Region for partial financial support.
References 1. Chen, Y., Tagare, H., Thiruvenkadam, S., Huang, F., Wilson, D., Gopinath, K., Briggs, R., Geiser, E.: Using prior shapes in geometric active contours in a variational framework. IJCV 50, 315–328 (2002)
Active Contour Model of Directed Networks
659
2. Cremers, D., Kohlberger, T., Schn¨ orr, C.: Shape statistics in kernel space for variational image segmentation. Pattern Recognition 36, 1929–1943 (2003) 3. Paragios, N., Rousson, M.: Shape priors for level set representations. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2351, pp. 78–92. Springer, Heidelberg (2002) 4. Srivastava, A., Joshi, S., Mio, W., Liu, X.: Statistical shape analysis: Clustering, learning, and testing. IEEE Trans. PAMI 27, 590–602 (2003) 5. Riklin Raviv, T., Kiryati, N., Sochen, N.: Prior-based segmentation and shape registration in the presence of perspective distortion. IJCV 72, 309–328 (2007) 6. Taron, M., Paragios, N., Jolly, M.P.: Registration with uncertainties and statistical modeling of shapes with variable metric kernels. IEEE Trans. PAMI 31, 99–113 (2009) 7. Vaillant, M., Miller, M.I., Younes, A.L., Trouv´e, A.: Statistics on diffeomorphisms via tangent space representations. NeuroImage 23, 161–169 (2004) 8. Rochery, M., Jermyn, I., Zerubia, J.: Higher order active contours. IJCV 69, 27–42 (2006) 9. Rochery, M., Jermyn, I.H., Zerubia, J.: Phase field models and higher-order active contours. In: ICCV, Beijing, China (2005) 10. El Ghoul, A., Jermyn, I.H., Zerubia, J.: A phase field higher-order active contour model of directed networks. In: NORDIA, in conjunction with ICCV, Kyoto, Japan (2009) 11. El Ghoul, A., Jermyn, I.H., Zerubia, J.: Segmentation of networks from VHR remote sensing images using a directed phase field HOAC model. In: ISPRS / PCV, Paris, France (2010) 12. El Ghoul, A., Jermyn, I.H., Zerubia, J.: Phase diagram of a long bar under a higherorder active contour energy: application to hydrographic network extraction from VHR satellite images. In: ICPR, Tampa, Florida (2008)
Appendix The functions μ, ν and G00 are 4 2 vm λ22 3vm λ21 1 + (φm + 1)(−3φm + 2) + (−3φm + 1) + μ=− 20 2 10 6 3 λ04 λ03 (φm + 1)2 (−9φ2m + 12φm + 1) + (φm + 1)2 (−φm + 1) + 60 6 1 + (λ22 + λ21 )(φm + 1)2 (11) 24 v4 v 2 λ22 2 λ04 2 ν= m+ m (φm − 1) + λ21 (φm − 1) − 1 + (φm − 1)2 4 2 2 4 λ03 1 (φm + 1)2 (φm − 2) − (λ22 + λ21 )(φm + 1)2 + (12) 3 8 +∞ wˆ w−x ˆ 2 2 G00 = 2 dz dx2 dt Ψ ( z 2 + t2 ) − Ψ ( z 2 + (w ˆ0 + t)2 ) (13) w ˆ 0 0 −x2
Sparse Coding on Local Spatial-Temporal Volumes for Human Action Recognition Yan Zhu1 , Xu Zhao1 , Yun Fu2 , and Yuncai Liu1 1 2
Shanghai Jiao Tong University, Shanghai 200240, China Department of CSE, SUNY at Buffalo, NY 14260, USA
Abstract. By extracting local spatial-temporal features from videos, many recently proposed approaches for action recognition achieve promising performance. The Bag-of-Words (BoW) model is commonly used in the approaches to obtain the video level representations. However, BoW model roughly assigns each feature vector to its closest visual word, therefore inevitably causing nontrivial quantization errors and impairing further improvements on classification rates. To obtain a more accurate and discriminative representation, in this paper, we propose an approach for action recognition by encoding local 3D spatial-temporal gradient features within the sparse coding framework. In so doing, each local spatial-temporal feature is transformed to a linear combination of a few “atoms” in a trained dictionary. In addition, we also investigate the construction of the dictionary under the guidance of transfer learning. We collect a large set of diverse video clips of sport games and movies, from which a set of universal atoms composed of the dictionary are learned by an online learning strategy. We test our approach on KTH dataset and UCF sports dataset. Experimental results demonstrate that our approach outperforms the state-of-art techniques on KTH dataset and achieves the comparable performance on UCF sports dataset.
1
Introduction
Recognizing human actions is of great importance in various applications such as human-computer interaction, intelligent surveillance and automatic video annotation. However, the diverse inner-class variations of human poses, occlusions, viewpoints and other exterior environments in realistic scenarios, make it still a challenging problem for accurate action classification. Recently, approaches based on local spatial-temporal descriptors [1–3] have achieved promising performance. These approaches generally first detect or densely sample a set of spatial-temporal interest points from videos; then describe their spatial-temporal properties or local statistic characteristics within small cuboids centered at these interest points. To obtain global representations from sets of local features, the popular bag-of-words model [1, 4–6] is widely used incorporating with various local spatial-temporal descriptors. In the BoW model, an input video is viewed as an unordered collection of spatial-temporal words, each of which is quantized to their closest visual word existing in a trained R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 660–671, 2011. Springer-Verlag Berlin Heidelberg 2011
Sparse Coding on Local Spatial-Temporal Volumes for Action Recognition
661
Fig. 1. Framework of our action recognition system. First the input video is transformed to a group of local spatial temporal features through dense sampling. Then the local features are encoded into sparse codes using the pre-trained dictionary. Finally, max pooling operation is applied over the whole sparse code set to obtain the final video representation.
dictionary by measuring a distance metric within the feature space. The video is finally represented as the histogram of the visual words occurrences. However, there is a drawback in its quantization strategy of the BoW model. Simply assigning each feature to its closest visual word can lead to relatively high reconstruction error; therefore the obtained approximation of the original feature is too coarse to be sufficiently discriminative for the following classification task. To address this problem, we propose an action recognition approach within sparse coding framework, which is illustrated in Fig. 1. Firstly we densely extract a set of local spatial-temporal descriptors with varying space and time scales from the input video. The descriptor we adopt is HOG3D [2]. We use sparse coding to encode each local descriptor into its corresponding sparse code according to the pre-trained dictionary. Then maximum pooling procedure is operated over the whole sparse code set of the video. The obtained feature vector, namely the final representation of the video, is the input of the classifier for action recognition. In the classification stage, we use multi-class linear support vector machine. Another issue addressed in this paper, is how to effectively model the distribution of the local spatial-temporal volumes over the feature space. To this end, constructing a well defined dictionary is a critical procedure in sparse coding framework. Generally a dictionary is built from a collection of training video clips with known labels similar to the test videos. However, a large set of labeled video data are required to get a well generalized dictionary. But it is a difficult task to conduct video annotation on such a large video database; and the generalization capability of the dictionary is degraded using homologous training and test data. In this work, motivated by previous works introducing transfer
662
Y. Zhu et al.
learning into image classification [7, 8], we construct our dictionary by utilizing large volumes of unlabeled video clips, from which the dictionary can learn universal prototypes to facilitate the classification task. These video clips are widely collected from movies and sport games. Although these unlabeled video sequences may not be closely relevant with the actions to be classified on the semantic level, they can be used to learn a more generalized latent structure of the feature space. Its efficacy is demonstrated by extensive comparative experiments. The remaining of the paper is organized as follows. Section 2 reviews the previous related works in action recognition and sparse coding. Section 3 introduces our proposed action recognition approach and the implementation details. Section 4 describes the experimental setup, results, and discussion. Section 5 briefly concludes this paper and addresses the future work.
2
Related Work
Many existing approaches of action recognition extract video representations from a set of detected interest points [9–11]. Diverse local spatial-temporal descriptors have been introduced to describe the properties of these points. Although compact and computational efficient, interest point based approaches mitigate much potentially useful information of the original data, and thus weaken the discriminative power of the representation. Recently two evaluations confirm that dense sampling method could achieve better performance in action recognition tasks [12, 13]. In our work, we adopt dense sampling method, but we use sparse coding instead of BoW model to obtain a novel representation of videos. Sparse representation has been widely discussed recently and achieved exciting progress in various fields including audio classification [14], image inpainting [15], segmentation [16] and object recognition [8, 17, 18]. These successes demonstrate that sparse representation could flexibly adapt to diverse low level natural signals with desirable properties. Besides, research in visual cortex has justified the biological plausibility of sparse coding [19]. For tasks of human action classification, interest information, inherently, is sparsely distributed in the spatial-temporal domain. Inspired by above insights, we introduce sparse representations into the field of action recognition from video. To improve the classification performance, transfer learning is incorporated into sparse coding framework to obtain robust representations and learn knowledge from unlabeled data [8, 17, 20]. Raina et al. [8] proposed self-taught learning to construct the sparse coding dictionary using unlabeled irrelevant data. Yang et al. [17] integrated sparse coding with traditional spatial pyramid matching method for object classification. Liu et al. [20] proposed a topographic subspace model for image classification and retrieval employing inductive transfer learning. Motivated by above successes, our work explores the application of sparse coding with transfer learning in video domain. Experimental results will validate that dictionary training with transferable knowledge from unlabeled data can significantly improve the performance of action recognition.
Sparse Coding on Local Spatial-Temporal Volumes for Action Recognition
3 3.1
663
Approach Local Descriptor and Sampling Strategy
Each video can be viewed as a 3D spatial-temporal volume. In order to capture sufficient discriminative information, we choose to densely sample a set of local features across space-time domain throughout the input video volume. Each descriptor is computed within a small 3D cuboid centered at a space-time point. Multiple sizes of 3D cuboid are adopted to increase scale and speed invariance. We follow the dense sampling parameter settings as described in [13]. To capture local motion and appearance characteristic, we use HOG3D descriptor proposed in [2]. The descriptor computation can be summarized as following steps: 1. Smooth the input video volume to obtain the integral video representation. 2. Divide the sampled 3D cuboid centered at p = (xp , yp , tp ) into nx × ny × nt cells and further divide each cell into Sb × Sb × Sb subblocks. ¯ bi = (¯ 3. For each subblock bi , compute the mean gradient g gbi ∂x , g¯bi ∂y , g¯bi ∂t ) using the integral video. ¯ bi as qbi using a regular icosahedron. 4. Quantize each mean gradient g 5. For each cell cj , compute the histogram hcj of the quantized mean gradients over Sb × Sb × Sb subblocks. 6. Concatenate the nx × ny × nt histograms to one feature vector. After 2 normalization, the obtained vector xp is the HOG3D descriptor for p. In Section 4, parameter settings for sampling and descriptor computation will be discussed in detail. 3.2
Sparse Coding Scheme for Action Recognition
In our action recognition framework, sparse coding is used to obtain a more discriminative intermediate representation for human actions. Suppose we have obtained a set of local spatial-temporal features X = [x1 , ..., xN ] ∈ Rd×N to represent a video, where each feature is a d-dimensional column vector; and we have a well-trained dictionary D = [d1 , ..., dS ] ∈ Rd×S . Sparse coding method [8, 17, 18] manages to sparsely encode each feature vector in X into a linear combination of a few atoms of dictionary D by optimizing = arg min 1 X − DZ2 + λZ1 , Z 2 Z∈RS×N 2
(1)
where λ is a regularization parameter which determines the sparsity of the representation of each local spatial-temporal feature. Dictionary D is pre-trained to be an overcomplete basis set which is composed of S atoms, in which each atom is a d-dimensional column vector. Note that S typically is greater than 2d. To avoid numerical instability, each column of D subjects to the constraint of dk 2 ≤ 1. Once the dictionary D is fixed, the optimization over Z alone is convex, thus can be viewed as an 1 regularized linear least square problem. The twofold optimization goal ensures the least reconstruction error and the
664
Y. Zhu et al.
simultaneously. We use the LARS-lasso implesparsity of the coefficients set Z mentation provided by [15] to find the optimal solution. After optimization, we = [ get a set of sparse codes Z z1 , ..., zN ] , where each column vector zi has only a few nonzero elements. It can also be interpreted as that each descriptor only responds to a small subset of dictionary D. To capture the global statistics of the whole video, we use a maximum pooling function [17, 18] defined as (2) β = ξmax (Z), where ξmax returns a vector β ∈ RS with the to pool over the sparse code set Z, k-th element defined as k1 |, |Z k2 |, ..., |ZkN |} . βk = max {|Z
(3)
Through maximum pooling, the obtained vector β is viewed as the final video level feature. Maximum pooling operation has been successfully used in several image classification frameworks to increase spatial translation invariance [17, 18, 21]. Such mechanism has been proven to be consistent with the properties of the cells in visual cortex [21]. Motivated by above insights, we adopt the similar procedure to increase both spatial and temporal translation invariance. Within the sparse code set, only the strongest response for each particular atom is preserved without considering its spatial and temporal location. Experimental results demonstrate that the maximum pooling procedure can lead to compact and discriminative final representation of the videos. Note that our previous choice of dense sampling strategy provides sufficient low-level local features, which guarantee the maximum pooling procedure statistically reliable. 3.3
Dictionary Construction Based on Transfer Learning
Under sparse coding framework, constructing a dictionary for a specific classification task is essentially to learn a set of overcomplete bases to represent the basic pattern of the specific data distribution within feature space. Given a large collection of local descriptors Y = [y1 , ..., yM ], the dictionary learning process can be interpreted as jointly optimizing with respect to the dictionary D and coefficients set Z = [z1 , ..., zM ], arg min Z∈RS×M ,D∈C
M 1 1 yi − Dzi 22 + λzi 1 , M i=1 2
(4)
where C is defined as a convex set C {D ∈ Rd×S s.t. dk 2 ≤ 1, ∀k ∈ {1, ..., S}}. In dictionary training stage, the optimization is not convex when both dictionary D and coefficients set Z are varying. How to solve the above optimization, especially in situations of large training set, is still an open problem. Recently, Mairal et al. [15] presented an online dictionary learning algorithm, which is much faster than previous methods and proven to be more suitable for large training sets with action recognition purpose. Due to the desirable properties mentioned above, we use this technique to train our dictionary.
Sparse Coding on Local Spatial-Temporal Volumes for Action Recognition
665
To discover the latent structure of the feature space for actions, we use a large set of unlabeled video clips collected from movies and sport games as the “learning material”, which is different from [22]. In recent years, transfer learning from unlabeled data to facilitate the supervised classification task has sparked many successful applications in machine learning [8, 20]. Although these unlabeled video clips are not necessarily belong to the same classes with the test data, they are similar in that they all contain human motions sharing the universal patterns. This transferable knowledge is helpful for our supervised action classification. In our experiment, we will construct two dictionaries in different ways. The first one is trained with patches from target classification videos and the second one is trained with unlabeled video clips. Experimental results will show that the second dictionary yields higher recognition rate. This can be explained that the unlabeled data contain more diverse patterns of human actions, which are helpful for the dictionary to thoroughly discover the nature of human action. In contrast, dictionary constructed merely from training data can hardly grasp the universal basis because of the insufficient information provided by training data, which made the dictionary unable to encode the test data sparsely and accurately. It is interesting to observe that the dictionary constructed from unlabeled data can potentially be utilized in other relevant action classification tasks. The transferable knowledge makes the dictionary more universal and reusable. 3.4
Multi-class Linear SVM
In classification stage, we use multi-class support vector machine with linear kernels as described in [17]. Given a training video set {(βi , yi )}ni=1 for an Lclass classification task, where βi denotes the feature vector of the i-th video and yi ∈ Y = {1, ..., L} denotes the class label of βi , we adopt one-against-all method to train L linear SVMs, each of which seeks to learn L linear functions {Wc β|c ∈ Y}. The trained SVMs predict the class label yj for a test video feature βj by solving yj = arg max Wc βj . (5) c∈Y
Note that the traditional histogram based action recognition models usually need some specific designed nonlinear kernels, which lead to time-consuming computations for classifier training. However, the linear kernel SVM operated on sparse coding statistics can achieve satisfying accuracy with much faster speed.
4
Experiments
We evaluate our approach on two benchmark human action datasets: the KTH action dataset [5] and the UCF Sports dataset [23]. Some sample frames from the two databases are shown in Fig. 2. In addition, we evaluate the different effects of two dictionary construction manners.
666
Y. Zhu et al.
Fig. 2. Sample frames from KTH dataset (top row) and UCF sports dataset (middle and bottom rows)
4.1
Parameter Settings
Sampling and descriptor parameters. In the sampling stage, we extract 3D patches of varying sizes from the test video. The minimum size of 3D patches is 18 pixels× 18 pixels × 10 frames. We employ the sampling setting as suggested √ √ √ 2, 36, 36 2, 72, 72 2, 144), in [13], specifically, using eight spatial scales(18, 18 √ and two temporal scales (10, 10 2). Patches with all possible combinations of temporal scales and spatial scales are densely exacted with 50% overlap. Note that we discard those patches whose spatial scales exceed the video resolution, e.g. 144 pixels × 144 pixels for KTH frames of 160 × 120 resolution. We calculate HOG3D features using the executable provided by the authors of [2] with default parameters, namely, number of supporting subblocks S 3 = 27 and number of histogram cells nx = ny = 4, nt = 3, thus each patch corresponds to a 960dimensional feature vector. Dictionary training parameters. As for dictionary construction, we set the dictionary size as 4000 empirically. In dictionary learning stage, we extract 400000 HOG3D features from 500 video clips collected from movies and sport games. In selecting dictionary training video clips, we do not impose any semantic constraints on the contents of the video clips except one criterion: all video clips must contain at least one moving subject. 1 regularization parameter λ is , as suggested in [15], where m = 960, denoting the dimension of the set to √1.2 m original signal. 4.2
Performance on KTH Dataset
The KTH dataset is a benchmark dataset to evaluate various human action recognition algorithms. It consists of six types of human actions including
Sparse Coding on Local Spatial-Temporal Volumes for Action Recognition
667
walking, jogging, running, boxing, hand waving and hand clapping. Each action is performed several times by 25 subjects under four different environment settings: outdoors, outdoors with scale variation, outdoors with different clothes and indoors. Currently KTH dataset contains 599 video clips in total. We follow the common experimental setup as [5], randomly dividing all the sequences into the training set (16 subjects) and the test set (9 subjects). We train a multiclass support vector machine using one-against-all method. The experiment is repeated 100 times and the average accuracy over all classes is reported. Table 1. Comparisons to previous published results on KTH dataset Methods
Average Precision
Experimental Setting
Niebles[4] Jhuang[24] Fathi [25] Laptev[1] Bregonzio [26] Kovashka [6] Our Method
81.50% 91.70% 90.50% 91.80% 93.17% 94.53% 94.92%
leave-one-out split split split leave-one-out split split
Table 1 shows that our proposed method achieves an average accuracy of 94.92%, which outperforms previous published results. Note that some of the above results [4, 26] are implemented under leave-one-out settings. Confusion matrix shown in Fig. 3 demonstrates that all the classes are predicted with satisfying precision except the pair of jogging and running. This is understandable since these two sorts of actions look ambiguous even for human beings.
Fig. 3. Confusion matrix for the KTH dataset, the rows are the real labels while the columns are predicted ones. All the reported results are the averages of 100 rounds.
668
4.3
Y. Zhu et al.
Performance on the UCF Sports Dataset
UCF sports dataset consists of 150 video clips belonging to 10 action classes including diving, golf swinging, kicking, lifting, horse riding, running, skating, swinging (around high bars), swinging (on the floor or on the pommel). All the video clips are collected from realistic sports broadcasts. Following [13, 23], we enlarge the dataset by horizontally flipping all the original clips. Similar to the setting in the KTH dataset, we train a multi-class SVM using one-against-all setting. To fairly compare with previous results, we also employ leave-one-out manner, specifically, test each original video clips successively while training all of the remaining clips. The flipped version of the test video clip is excluded from the training set. We report the average accuracy over all classes. Experimental results demonstrate that our method achieves comparable accuracy with state-of-art techniques, as shown in Table 2. Fig. 4 shows the confusion matrix for the UCF dataset. Note that UCF Sports dataset used by the authors listed in Table 2 differs slightly since some of the videos are removed from the original version due to copyright issues. 4.4
Dictionary Construction Analysis
We also explore different methods for dictionary construction on two datasets and evaluate the corresponding effects on performance. We train two dictionaries from Table 2. Comparisons to previous published results on UCF sports dataset Methods
Average Precision
Experimental Setting
Rodriguez[23] Yeffet[27] Wang [13] Kovashka[6] Our Method
69.20% 79.30% 85.60% 87.27% 84.33%
leave-one-out leave-one-out leave-one-out leave-one-out leave-one-out
Fig. 4. Confusion matrix of UCF dataset
Sparse Coding on Local Spatial-Temporal Volumes for Action Recognition
669
Table 3. Comparison of different dictionary training sources on KTH dataset Dictionary Source Average Sparsity
Standard Deviation
Classification Accuracy
Training Set Diverse Source
0.20% 0.09%
91.94% 94.92%
0.65% 0.54%
Table 4. Comparison of different dictionary training sources on UCF Sports dataset Dictionary Source Average Sparsity
Standard Deviation
Classification Accuracy
Training Set Diverse Source
0.09% 0.06%
82.09% 84.33%
0.61% 0.50%
different sources for each dataset. The first one is trained on patches collected from the training set of the KTH dataset or UCF Sports dataset while the other one is trained with patches extracted from diverse sources including movies and sports videos. All the parameter settings for training are the same. Results show that the dictionary trained with diverse sources yields higher accuracy than the one trained merely using training sets. To further investigate the effect of dictionary sources, we calculate the sparsity of the sparse codes transformed from the original features. For each sparse code vector z ∈ RS , we define the sparsity of z as sparsity(z) =
z0 , S
(6)
where zero-norm z0 denotes the number of nonzero elements in z. We calculate the average sparsity and the corresponding standard deviation of the whole sparse code set, as shown in Table 3 and Table 4. We use these two statistics to measure how suitable a dictionary is for the given dataset. We find that the diverse source dictionary gets lower average sparsity and its corresponding standard deviation is also lower than using training set. Low sparsity ensures the more compact representations and low standard deviation over the whole dataset manifests that the diverse source dictionary is more universal for different local motion patterns. In contrast, the dictionary obtained from training set appears less sparse and more inclines to fluctuate. Although it can encode certain local features into extremely sparse form in some cases (probably due to overfitting for certain patterns), the overall sparsity of the whole sparse code set is still not comparable. This can be explained that the distribution disparity between training videos and test videos makes the dictionary learned merely from training set neither to precisely model the target subspace, nor to further impair the discriminative power of the final video representation. In contrast, the diverse source dictionary captures common patterns from diversely distributed data. It generates a set of bases with higher robustness, and thus better models the subspace which is more generative to correlate with the target data distribution. Table 3 and Table 4 can help demonstrate this point.
670
5
Y. Zhu et al.
Conclusion
In this paper, we have proposed an action recognition approach using sparse coding and the local spatial-temporal descriptors. To obtain high-level video representations from local features, we also suggest to use transfer learning and maximum pooling procedure rather than the traditional histogram representation of BoW model. Experimental results demonstrate that sparse coding can provide a more accurate representation with desirable sparsity, which strengthens the discriminative power and eventually help improve the recognition accuracy. In the future work, we would like to investigate the inner structure of the dictionary and the mutual relationship of different atoms, which will be helpful to explore the semantic representations. Besides, the supervised dictionary learning algorithm will also be our future research interest. Acknowledgements. This work is supported by the 973 Key Basic Research Program of China (2011CB302203), NSFC Program under (60833009, 60975012) and the SUNY Buffalo Faculty Startup Funding.
References 1. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE CVPR (2008) 2. Kl¨ aser, A., Marszalek, M., Schmid, C.: A spatio-temporal descriptor based on 3Dgradients. In: British Machine Vision Conference (2008) 3. Scovanner, P., Ali, S., Shah, M.: A 3-dimensional SIFT descriptor and its application to action recognition. In: ACM Multimedia, pp. 357–360 (2007) 4. Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. IJCV 79, 299–318 (2008) 5. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local SVM approach. In: ICPR, pp. 32–36 (2004) 6. Kovashka, A., Grauman, K.: Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: IEEE CVPR (2010) 7. Ando, R.K., Zhang, T.: A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research 6, 1817–1853 (2005) 8. Raina, R., Battle, A., Lee, H., Packer, B., Ng, A.: Self-taught learning: transfer learning from unlabeled data. In: International Conference on Machine Learning, pp. 759–766. ACM, New York (2007) 9. Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65–72 (2005) 10. Laptev, I.: On space-time interest points. IJCV 64, 107–123 (2005) 11. Wong, S.F., Cipolla, R.: Extracting spatiotemporal interest points using global information. In: IEEE ICCV (2007) 12. Dikmen, M., Lin, D., Del Pozo, A., Cao, L., Fu, Y., Huang, T.S.: A study on sampling strategies in space-time domain for recognition applications. In: Advances in Multimedia Modeling, pp. 465–476 (2010)
Sparse Coding on Local Spatial-Temporal Volumes for Action Recognition
671
13. Wang, H., Ullah, M., Kl¨ aser, A., Laptev, I., Schmid, C.: Evaluation of local spatiotemporal features for action recognition. In: British Machine Vision Conference (2009) 14. Grosse, R., Raina, R., Kwong, H., Ng, A.Y.: Shift-invariant sparse coding for audio classification. In: UAI (2007) 15. Mairal, J., Bach, F., Ponce, J., Sapiro, G.: Online dictionary learning for sparse coding. In: International Conference on Machine Learning, pp. 689–696. ACM, New York (2009) 16. Mairal, J., Bach, F., Ponce, J., Sapiro, G., Zisserman, A.: Discriminative learned dictionaries for local image analysis. In: IEEE CVPR (2008) 17. Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding for image classification. In: IEEE CVPR (2009) 18. Yang, J., Yu, K., Huang, T.S.: Supervised translation-invariant sparse coding. In: IEEE CVPR (2010) 19. Olshausen, B.A., Field, D.J.: Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research 37, 3311–3325 (1997) 20. Liu, Y., Cheng, J., Xu, C., Lu, H.: Building topographic subspace model with transfer learning for sparse representation. Neurocomputing 73, 1662–1668 (2010) 21. Serre, T., Wolf, L., Bileschi, S., Riesenhuber, M., Poggio, T.: Robust object recognition with cortex-like mechanisms. IEEE T-PAMI 29, 411–426 (2007) 22. Taylor, G., Bregler, C.: Learning local spatio-temporal features for activity recognition. In: Snowbird Learning Workshop (2010) 23. Rodriguez, M.D., Ahmed, J., Shah, M.: Action MACH: a spatio-temporal maximum average correlation height filter for action recognition. In: IEEE CVPR (2008) 24. Jhuang, H., Serre, T., Wolf, L., Poggio, T.: A biologically inspired system for action recognition. In: IEEE ICCV (2007) 25. Fathi, A., Mori, G.: Action recognition by learning mid-level motion features. In: IEEE CVPR (2008) 26. Bregonzio, M., Gong, S., Xiang, T.: Recognising action as clouds of space-time interest points. In: IEEE CVPR (2009) 27. Yeffet, L., Wolf, L.: Local trinary patterns for human action recognition. In: IEEE ICCV (2009)
Multi-illumination Face Recognition from a Single Training Image per Person with Sparse Representation Die Hu, Li Song, and Cheng Zhi Institute of Image Communication and Information Processing, Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China {butterflybubble,song li,zhicheng}@sjtu.edu.cn
Abstract. In real-world face recognition systems, traditional face recognition algorithms often fail in the case of insufficient training samples. Recently, the face recognition algorithms of sparse representation have achieved promising results even in the presence of corruption or occlusion. However a large over-complete and elaborately designed discriminant training set is still required to form sparse representation, which seems impractical in the single training image per person problems. In this paper, we extend Sparse Representation Classification (SRC) to the one sample per person problem. We address this problem under variant lighting conditions by introducing relighting methods to generate virtual faces. Our diverse and complete training set can be well composed, which makes SRC more general. Moreover, we verify the recognition under different lighting environments by a cross-database comparison.
1
Introduction
In the past decades, the one sample per person problem has aroused significant attention in Face Recognition Technology (FRT) due to the wide applications such as law scenarios and passport identifications. It is defined as: given a stored database of faces with only one image per person, the goal is to identify a person from the database later in time in any different and unpredictable poses, lighting, etc from an individual image [1]. However, when there is only one sample per person available, many difficulties, i.e. reliability of parameter estimation, may arise when directly putting traditional recognition methods into use. Recently, many research efforts have been focusing on applying the sparse representation to computer vision tasks since sparse signal representation has proven to be an extremely powerful tool for acquiring, representing, and compressing high-dimensional signals [2]. In FRT, Wright et al. [3] has cast face recognition as the problem of finding the sparsest representation of the test image over the training set, which have demonstrated promising results even in the presence of corruption and occlusion. However they have to provide a large over-complete and elaborately designed discriminant training set to form sparse R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 672–682, 2011. Springer-Verlag Berlin Heidelberg 2011
Multi-illumination Face Recognition from a Single Training Image with SRC
673
representation. It’s beyond doubt that this constraint provokes a great challenge in one sample per person problem. Details will be discussed in Section 3. In this paper, we propose an improved method of SRC in the one sample per person recognition problem under different lighting environments. Different from the traditional face recognition methods focused on illumination compensation, we make full use of the diversity representative illumination information to generate novel realistic faces. Given a single image input, we could synthesize the image space of it, so that the test image could form an efficient sparse representation when projected onto the constructed space. By doing that, we could not only satisfy the over-complete constraint in one sample per person problem, but also guarantee the discrimination capability of the training set in terms of selecting varying illumination conditions. We conduct extensive experiments on public databases, and especially, verify the recognition under different lighting environments by a cross-database comparison, of which the classical illumination record doesn’t come from the same database as the images to be recognized.‘
2 2.1
Preliminaries SRC
As is mentioned above, in FRT, [3] has cast the face recognition problem as finding the sparsest representation of the test image over the whole training set. The well-aligned ni training images of individual i taken under varying illumination are stacked as columns of a matrix Ai = [ai,1 , ai,2 , ..., ai,n ]∈Rm×ni , each normalized to be the l2 norm. Then we could define a dictionary as a concentration of all the N object classes A = [A1 , A2 , ..., AN ]∈Rm×n . Here, n is N the number of total training images and n = i=1 ni . Since images of the same face under varying illumination lie near a special low dimensional subspace [4], given a test image y∈Rm , it can be represented by a sparse linear combination of the training set y = Ax, where x is a coefficient vector whose items are all zero except for those associated with the ith object. However in consideration of the gross error brought by partial corruption or occlusion, the formulation is y = Ax + e. Since the desired solution (x, e) is the not only sparse but also the sparsest solution of the system, it can be recovered by the following equation (x, e) = arg min x0 + e0
s.t. y = Ax + e
(1)
Here, the l0 norm counts the nonzero number in a vector. However, to solve the l0 norm problem above is NP-hard. Based on the theoretical result of l0 and l1 minimization equivalence, if the solution (x, e) sought is sparse enough, the authors would rather seek the convex relaxation min x1 + e1
s.t. y = Ax + e
(2)
The above problem can be solved by linear programming methods in polynomial time efficiently. In succession, to analyze the coefficient to what extent it concentrates on one subject, we can judge which class it belongs to and whether it belongs to any subjects in the training database.
674
2.2
D. Hu, L. Song, and C. Zhi
Quotient Image
Quotient Image (QI) [5] is firstly introduced by Riklin-Raviv and Shashua, which uses the color ratio to define an illumination invariant signature image and enables a rendering of the image space with varying illumination. The main algorithm of Quotient Image is formulated on the Lambertian model I(x, y) = ρ(x, y)n(x, y)T s
(3)
Where ρ is the albedo (surface texture) of face, n(x, y)T is the surface normal of the object, and s is the point light source direction which can be arbitrary. Then the quotient image Qy of an object y against the object a is defined by Qy (u, v) =
ρy (u, v)n(u, v)T sy ρy (u, v) = ρa (u, v) ρa (u, v)n(u, v)T sy
(4)
Let s1 , s2 , s3 be the three linearly independent vectors of a basis, thus any point light source direction can be reconstructed by sy = j xj sj . Hence Qy (u, v) =
ρy (u, v)n(u, v)T sy Iy = 3 ρa (u, v)n(u, v)T j xj sj j=1 Ij xj
(5)
Here, Iy is the image of illumination source direction sy . If the illumination set {Ij }3j=1 in (5) is devised in advance, given a reference image of object y, the quotient image Qy could be computed. Then by product of Qy and the different choice of Ij , the image space of a reference object y could be synthesized by the set of three linearly independent illumination images of other objects. Through the above method, given a single input image of an object, and a database of images with varying illumination of other objects of the same general class, we can re-render the input image to simulate new illumination conditions using QI.
3 3.1
SRC for One Training Image per Person Motivation for Synthesizing Virtual Samples
To extend SRC method to the one sample image per person problem, the first difficulty to handle is that we must provide over-complete training atoms. One of the ways is synthesizing virtual samples from a single image. Furthermore, to give a better performance of SRC, the dictionary must be discriminant. We test the algorithm of [3] under the following conditions. The Extended Yale B, one of the renowned face databases, consists of 2,414 frontal faces of 38 individuals [4]. We choose the 30th to the 59th images in Extended Yale B of each individual for training, and the 1st to the 29th to test, then down-sample all the images to 12 × 10 at a rate of 1/16, which is the same feature dimension as in [3]. The recognition rate is about 82.3%, and it is much lower than the results of randomly selecting half of the images for training, which was reported in [3] at a rate between 92.1% and 95.6%. This is because of the uniform distribution of
Multi-illumination Face Recognition from a Single Training Image with SRC
675
(1)
(2)
(3)
(4)
(5)
(6) (a)
(b)
Fig. 1. Left: Six groups of training set in our experiment in rows. In the top Row (1), the first 10 consecutive images are selected. In Row (2) we choose every other image, which is the 1st, 3rd, 5th images and so on. Then in Row (6), we select one in every six consecutive images. Right: The corresponding recognition results to the Left.
the training samples. Since all the images in this training set are collected under 64 different illuminations, the lighting condition varies to the consecutive one gradually. To discuss how to design the training set for robustness, we conduct experiments as follows. For each individual, we select 10 images for training, with the others left for testing. We divide our training set into six groups. For the first group, the top 10 consecutive images are selected as the training dictionary. For the second set, we select every other images. And for the third, we select one in every three consecutive pictures and so on. As shown in Figure 1(a), there are six groups of images in rows. And the recognition results are in Figure 1(b) respectively. It can be deduced that, the more diversely illumination varies from each other, the better recognition results it gives. In our method, we make full use of illumination environment information of an illumination record instead of traditional compensation ways. We can generate complete and diversity training atoms from a single input image of an individual by devising a small representative illumination record. Therefore, synthesizing virtual samples could not only satisfy the sparse constraint but also form a discriminant dictionary for the one sample per person problem. 3.2
Our Approach of Recognition with One Training Image per Person
Suppose that there is only one image ysi (the reference image) available for each individual i, firstly we should generate the sufficient and robust training set for each object. Based on the technology of QI, given a database of varying illumination, we could generate new images of another subject of the same condition. Let Bj be a matrix whose columns are the pictures of object j with albedo function ρj . The illumination set {Bj }K j=1 , which contains only K (a small number) other objects, can be elaborately designed manually to ensure the discrimination
676
D. Hu, L. Song, and C. Zhi
capability. From (5), if we know the correct coefficient xj , we can get the quotient image Qyi . In [5], it proves that the energy function K
f (ˆ x) =
1 |Bj x ˆ − αj ysi |2 2 j=1
(6)
has a global minimum x ˆ = x, if the albedo ρy of an object y is rationally spanned by the illumination set. To recover x in (6), it is only a least-squares problem. x) is: The global minima x0 of the energy function f (ˆ x0 =
K
αj vj
(7)
j=1
where K vj = ( Br BrT )−1 Bj ysi
(8)
r=1
and the coefficients αj are determined up to a uniform scale as the solution of the following equation: K αj ysTi ysi − ( αr vr )T Bj ysi = 0 s.t. r=1
αj = K
(9)
j
for j = 1, . . . , K. Then the quotient image Qyi (defined in (4)) is computed by Qyi =
ysi Bx
(10)
where B=
K 1 Bi K i=1
(11)
is the average of illumination set. Then we can approximate which class the given test sample yt belongs to through SRC. For z∈Rn , δi (z)∈Rn is a new vector of z whose entries are zero except for those associated with the ith class. One can approximate the given test sample by yˆi = Aδi (ˆ z ). We then classify yt based on these approximations by assigning it to the object class that minimizes the residual between yt and yˆi : ri (yt ) = yt − Aδi (ˆ z )2
(12)
The whole procedures of our method are depicted in the flowchart of Figure 2. And Algorithm 1 summarizes the complete recognition details in one sample per person problem.
Multi-illumination Face Recognition from a Single Training Image with SRC
677
Ill. Set design
Reference image
Test image
Feature extraction
QI generated
Sparse representation
Result
Feature extraction
Fig. 2. The whole recognition procedures Algorithm 1. Single Face Recognition via Sparse Representation 1: Input: {Bj }K j=1 , the illumination set, where each matrix contains ni images (as its columns). ysi , the reference image of each individual i (a vector of size m). yt , the test image. Align all the images in the Rm space. 2: Step: Solve the equation (9). Compute x in (7) and the quotient image Qyi is generated by (10) and (11). Synthesize the training space: For l = 1, . . . , ni and i = 1, . . . , N , Ai,l = Qyi ⊗ Bl , where Ai,l stands for the lth image of the ith person, Bl is the lth column of matrix B, and ⊗ is the Cartesian product (pixel by pixel multiplication). 3: Step: Stack all the generated matrices in A = [A1 , A2 , . . . , AN ]∈Rm×n , and normalize the columns of A to have the unit l2 norm. Solve the l1 minimization problem (13) zˆ = arg min z1 s.t. Az − yt 2 ε z
where ε is an optional error tolerance. Compute the residuals (12) for i = 1, . . . , N . 4: Output: identity(yt ) = arg mini ri (yt ).
4
Experiments
In this section, we conduct a wide range of experiments on publicly available databases for face recognition. Firstly, we verify our algorithm on CMU-PIE database [6] and discuss its robustness in real-world applications. We then extend our method on color images. Finally, we put forward a cross-database synthesizing comparison. 4.1
Gray Scale Images Verification
Firstly we test our algorithm in the CMU-PIE database. It consists of 2,924 frontal-face images of 68 individuals. The cropped and normalized 112 × 92 face images were captured under various laboratory-controlled lighting conditions. There are two folders of different illuminations, of which the folder with the
678
D. Hu, L. Song, and C. Zhi
ambient lights off we call it PIE-illumination package while the folder with the ambient lights on is named the PIE-light package. Each person is under 43 different illumination conditions, and the PIE-illumination contains 21 lighting environments. We randomly choose images of 10 individuals as the illumination set in PIElight. Since there are 22 distinct illumination, we synthesize images of the same lighting condition given a reference image of the left persons. As is shown in Figure 3 there are 22 different images generated. Here we select the first image of each face as the unique reference image to get quotient image and re-render others while the images left are all used for testing. In the period of recognition, the feature space dimension was set to 154. Since the precise choice of feature space is no longer critical [3], we just down-sample the images at a ratio of 1/8. Our method achieves recognition rate of 95.07% which gives a good result in the single training image problem.
Fig. 3. Virtual samples
One may argue that the reference image we chose was well-chosen, since the first image of each object was under a favorable illumination condition. However, in real-world tasks, especially in law enforcement scenarios for illustration, we often have the record of each suspect with a piece of legible photograph. Therefore this choice of deliberately designed input image looks reasonable. Then we verify the algorithm of different input images as depicted in Figure 4. Here the leave-one-out scheme [7] is used. To name a few, each image acts as the template by turns and the others present as the test part. As is shown in the picture, the recognition result is influenced by the template we choose, and the recognition rate fluctuates between 82.35% and 95.07%. Here templates which give a most or least satisfying performance is presented as below. In Figure 5, we can see that images in column (b) and column (c), which perform worst at a rate around 82.35%, are images with shadows. This is due to the limitation of the formulation of Lambertian model, without taking into account shadow. It can be expected that other algorithms such as 3D morphable model will improve the performance. 4.2
Recognition with Color Images
The next experiment is about color images. Given a color image represented by RGB channels, we could directly convert it to a gray scale image. The gray scale
Multi-illumination Face Recognition from a Single Training Image with SRC
679
Recognitio Rate
1 0.95 0.9 0.85 0.8 0.75 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Different Test Set
Fig. 4. Recognition Rate on CMU-PIE
(a)
(b)
(c)
Fig. 5. Templates. The Column (a) has the most willing illumination, which achieved a recognition result at 95.07%. And the (b) and (c) columns perform worst at a rate of 82.34%.
images conserve all the lightness information because lightness of the gray is directly proportional to the number representing the brightness levels of the primary. We can assume that the varying illumination only works on the gray-value distribution without affecting the hue and saturation part. At the same time, the value channel of HSV representation, which describes the luminance (the blackand-white) information, can be transformed from the RGB color space. As is shown in Figure 6, Row (a) contains four color images with different illumination directions. And Row (b) are images from the RGB color space transforms into HSV space. While in the Row (c), we present the value in the V channel. It reveals that the Value channel in HSV space preserves the luminance information.
680
D. Hu, L. Song, and C. Zhi
(a)
(b)
(c)
Fig. 6. The V-channel in HSV color space. There are four illumination conditions in columns. Row (a) contains four color images in RGB space. And Row (b) are images from the RGB color space transforms into HSV space. While in the Row (c), we present the value in the V channel.
To illustrate how it works in recognition, we use the AR databases. It is made up of over 4,000 frontal images of 126 individuals. For each person, 26 pictures are taken in two separate sessions named Session 1 and Session 2, of which 8 images are faces with only illumination change [8]. In our trail, we use the 4 illumination conditions of arbitrary 10 individuals in Session 1 as the illumination set. Then we randomly select one of the 8 images of the other individuals for training, and process them as described in Figure 7. Subsequently, we recognize the left faces in the 8 images of each individual after down-sampling all the matrix to 154 features. It achieves a recognition rate between 78.75% and 81.48% though we have only 4 training lighting conditions of each person. 4.3
A Cross-Database Comparison
So far we have experimented with objects and their images from the same databases. Even though the object we choose for training and testing is outside the illumination set, we still have the advantage that they are taken by the same camera, in the same laboratory-controlled environment. Here we test our algorithm under cross-dictionary conditions especially. To name a few, in our
Multi-illumination Face Recognition from a Single Training Image with SRC
681
H
The reference image
HSV
HSV
S
V
QI Generated
RGB
Grayscale
V
Fig. 7. Color images synthesized procedures Table 1. Cross database performance Ill. set Extended Extended Extended Extended
No. of Ills. Reference image for training Rec. rate Yale Yale Yale Yale
B B B B
30 20 30 20
20th face of PIE-illumination 20th face of PIE-illumination 1st face of PIE-light 1st face of PIE-light
76.68% 72.41% 98.67% 98.18%
experiment, a cross-database test means that the classical illumination record in the set doesn’t come from the same database as images to be recognized. However, in real-world scenarios, we often have possessions of illumination set in advance, but the images for training and identification are usually from a totally different collection environment. As is shown in Table 1, in attempt to generate an incoherent dictionary [9] in the language of signal representation, we select every other image of random 10 persons in Extended Yale B as the illumination set. Then all the faces in PIEillumination are taken for testing. The 20th face of each man is the reference image to generate training atoms and the others are test cases. In the beginning, we pre-align all the images in both databases to 112 × 92 pixels and the downsampling rate is also 1/8. That is, the feature space we use is 154. As a result, the recognition rate is 76.68%. Whereafter, we decrease the illumination conditions to 20 instead, and the rate is around 72.41% respectively. Finally, images in PIElight are also used for testing instead of PIE-illumination in the same procedures, and the corresponding performance is recorded in the table. We can get several views from the table as follows. First of all, we can use our method for one sample per person problem in real-world scenarios when the images in the illumination set are not collected in the same environment as the images to be classified. In addition, illumination diversity of the training set improves the performance. Last but not the least, it implies some tricks of choosing the reference image, since it’s obvious that images in PIE-light perform much better than those in PIE-illumination. There may be some reasons related to the ambient light so that illumination distribution is more uniform. However, we can get the environment condition of the reference image in control and make some illumination compensation of it in practice.
682
5
D. Hu, L. Song, and C. Zhi
Conclusions
In this paper, we present a new method for one sample per person recognition based on sparse representation, which makes use of the illumination diversity information instead of compensation pre-processing. This method could not only satisfy the over-complete constraint in the one training image per person problem, but also make the training set discriminant. Our experiments on public databases achieved good performances and a cross-database comparison was put forward. In the future, we will try to improve the virtual sample generation method with other techniques by taking pose variation into consideration. At the same time, details of dictionary design will be explored further as well. Acknowledgement. This work was supported in part by NSFC (No.60702044, No.60632040, No.60625103), MIIT of China (No.2010ZX03004-003) and 973 Program of China (No.2010CB731401, No.2010CB731406).
References 1. Tan, X.: Face recognition from a single image per person: A survey. Pattern Recognition 39, 1725–1745 (2006) 2. Wright, J.: Sparse representations for computer vision and pattern recognition. Proceedings of IEEE (2009) 3. Wright, J.: Robust face recognition via sparse representation. IEEE Trans. (2008) 4. Geoghiades, A.: From few to many: Illumination Cone models for face recognition under variable lighting and pose. IEEE Trans. on Pattern Analysis and Machine Intelligence 23, 643–660 (2001) 5. Shashua, A.: The quotient image: Class-based re-rendering and recognition with varying illuminations. Transactions on Pattern Analysis and Machine Intelligence 23, 129–139 (2001) 6. Sim, T.: The CMU Pose Illumination and Expression (PIE) Database. In: The 5th International Conference on Automatic Face and Gesture Recognition (2002) 7. Wang, H.: Face Recognition under Varying Lighting Condition Using Self Quotient Image. In: Proceedings of International Conference on Automatic Face and Gesture Recognition, vol. 54, pp. 819–824 (2004) 8. Martinez, A.: The AR face database. CVC Tech. Report (1998) 9. Donoho, D.: For most large underdetermined systems of linear equations the minimal l1-norm solution is also the sparsest solution. In: IEEE International Symposium on Comm. Pure and Applied Math., Information Theory, ISIT 2007, vol. 59, pp. 797–829 (2006)
Human Detection in Video over Large Viewpoint Changes Genquan Duan1 , Haizhou Ai1 , and Shihong Lao2 1
Computer Science & Technology Department, Tsinghua University, Beijing, China
[email protected] 2 Core Technology Center, Omron Corporation, Kyoto, Japan
[email protected] Abstract. In this paper, we aim to detect human in video over large viewpoint changes which is very challenging due to the diversity of human appearance and motion from a wide spread of viewpoint domain compared with a common frontal viewpoint. We propose 1) a new feature called Intra-frame and Inter-frame Comparison Feature to combine both appearance and motion information, 2) an Enhanced Multiple Clusters Boost algorithm to co-cluster the samples of various viewpoints and discriminative features automatically and 3) a Multiple Video Sampling strategy to make the approach robust to human motion and frame rate changes. Due to the large amount of samples and features, we propose a two-stage tree structure detector, using only appearance in the 1st stage and both appearance and motion in the 2nd stage. Our approach is evaluated on some challenging Real-world scenes, PETS2007 dataset, ETHZ dataset and our own collected videos, which demonstrate the effectiveness and efficiency of our approach.
1
Introduction
Human detection and tracking are intensively researched in computer vision in recent years due to their wide spread of potential applications in real-world tasks such as driver-aided system, visual surveillance system, etc., in which real time and high accuracy performance is required. In this paper, our problem is to detect human in video over large viewpoint changes as shown in Fig. 1. It is very challenging due to the following issues: 1) A large variation in human appearance over wide viewpoint changes is caused by different poses, positions and views; 2) Lighting change and background clutter as well as occlusions make human detection much harder; 3) The diversity of human motion exists where any direction of movement is possible and the shape and the size of a human in video may change as he moves; 4) Video frame rates are often different; 5) The camera is moving sometimes. The basic idea for human detection in video is to combine both appearance and motion information. There are mainly three difficult problems to be explored in this detection task: 1) How to combine both appearance and motion information to generate discriminative features? 2) How to train a detector to cover such R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 683–696, 2011. Springer-Verlag Berlin Heidelberg 2011
684
G. Duan, H. Ai, and S. Lao
Fig. 1. Human detection in video over large viewpoint changes. Samples of three typical viewpoints and corresponding scenes are given.
a large variation in human appearance and motion over wide viewpoint changes? 3) How to deal with the changes in video frame rate or abrupt motion if using motion features on several consecutive frames? Viola and Jones [1] first made use of appearance and motion information in object detection, where they trained AdaBoosted classifiers with Harr features on two consecutive frames. Later Jones and Snow [2] extended this work by proposing appearance filter, difference filter and shifted difference filter on 10 consecutive frames and using predefined several categories of samples. The approaches in [1][2] can solve the 1st problem, but still face the challenge of the 3rd problem. The approach in [2] can handle the 2nd problem to some extent but as even human himself sometimes cannot tell which predefined category a moving object belongs to and thus its application will be limited, while the approach in [1] trains detectors by mixing all positives together. Dalal et al. [3] combined HOG descriptors and some motionbased descriptors together to detect humans with possibly moving cameras and backgrounds. Wojek et al. [4] proposed to combine multiple and complementary feature types and incorporate motion information for human detection, which coped with moving camera and clustered background well and achieved promising results on humans with a common frontal viewpoint. In this paper, our aim is to design a novel feature to take advantages of both appearance and motion information, and to propose an efficient learning algorithm to learn a practical detector of rational structure even when the samples are tremendously diverse for handling the difficulties mentioned above in one framework. The rest of this paper is organized as follows. Related work is introduced in Sec. 2. The proposed feature (I 2 CF ), the co-cluster algorithm (EMC-Boost) and the sampling strategy (MVS) are given in Sec. 3, Sec. 4 and Sec. 5 respectively and they are integrated to handle human detection in video in Sec. 6. Some experiments and conclusions are given in Sec. 7 and the last section respectively.
2
Related Work
In literature, human detection in video can be divided roughly into four categories. 1) Detection in static images as [5][6][7]. APCF [5], HOG [6] and Edgelet [7] are defined on appearance only. APCF compares colors or gradient orientations of two squares in images that can describe the invariance of color and gradient of an object to some extent. HOG computes oriented gradient distribution in a
Human Detection in Video over Large Viewpoint Changes
685
rectangle image window. An edgelet is a short segment of line or curve, which is predefined based on prior knowledge. 2) Detection over videos as [1][2]. Both of them are already mentioned in the previous section. 3) Object tracking as [8][9]. Some need manual initializations as in [8], and some are with the aid of detection as in [9]. 4) Detecting events or human behaviors. 3D volumetric features [10] are designed for event detection, which can be 3D harr like features. ST-patch [11] is used for detecting behaviors. Inspired by those works, we propose Intra-frame and Inter-frame Comparison Features (I 2 CF s) to combine appearance and motion information. Due to the large variation in human appearance over wide viewpoint changes, it is impossible to train a usable detector by taking the sample space as a whole. The solution is, divide and conquer, to cluster the sample space into some subspaces during training. A subspace can be dealt with as one class and the difficulty exists mainly in clustering the sample space. An efficient way is to cluster sample space automatically like in [12][13][14]. Clustered Boosting Tree (CBT) [14] splits sample space automatically by the already learned discriminative features during training process for pedestrian detection. Mixture of Experts [12] (MoE) jointly learns multiple classifiers and data partitions. It emphasizes much on local experts and is suitable when input data can be naturally divided into homogeneous subsets, which is even impossible for a fixed viewpoint of human as shown in Fig. 1. MC-Boost [13] co-clusters images and visual features by simultaneously learning image clusters and boosting classifiers. Risk map, defined on pixel level distance between samples, is also used to reduce the searching space of the weak classifiers in [13]. To solve our problem, we propose an Enhanced Multiple Clusters Boost (EMC-Boost) algorithm to co-cluster the sample space and discriminative features automatically which combines the benefits of Cascade [15], CBT [14] and MC-Boost [13]. The selection of EMC-Boost instead of MC-Boost is discussed in Sec. 7. Our contributions are summarized in four folds: 1) Intra-frame and Interframe Comparison Features (I 2 CF s) are proposed to combine appearance and motion information for human detection in video over large viewpoint changes; 2) An Enhanced Multiple Clusters Boost (EMC-Boost) algorithm is proposed to co-cluster the sample space and discriminative features automatically; 3) A Multiple Video Sampling (MVS) strategy is used to make our approach robust to human motion and video frame rate changes; 4) A two-stage tree structure detector is presented to fully mine the discriminative features of the appearance and motion information. The experiments in challenging real-world scenes show that our approach is robust to human motion and frame rate changes.
3 3.1
Intra-frame and Inter-frame Comparison Features Granular Space
Our proposed discriminative feature is defined in Granular space [16]. A granule is a square window patch defined in grey images, which is defined as a triplet g(x, y, s), where (x, y) is the position and s is the scale. For instance, g(x, y, s)
686
G. Duan, H. Ai, and S. Lao
indicates that the size of this granule is 2s × 2s and its left-top corner is at position (x, y) of an image. In an image I, it can be calculated as s
s
2 −1 2 −1 1 g(x, y, s) = s I(x + k, y + j). 2 × 2s j=0
(1)
k=0
s is set to 0,1,2 or 3 in this paper and the four typical granules are shown in Fig. 2 (a). In order to calculate the distance between two granules, Granular space G is mapped into 3D space , where for each element g ∈ G and γ ∈ , g(x, y, s) → γ(x + 2s , y + 2s , 2s ). The distance between two granules in G is defined to be the Euclidean distance between two corresponding points in , d(g1 , g2 ) = d(γ1 , γ2 ) where g1 , g2 ∈ G, γ1 , γ2 ∈ and γ1 , γ2 correspond to g1 , g2 respectively. d(γ1 (x1 , y1 , z1 ), γ2 (x2 , y2 , z2 )) = 3.2
(x1 − x2 )2 + (y1 − y2 )2 + (z1 − z2 )2 .
(2)
Intra-frame and Inter-frame Comparison Features (I 2 CF s)
Similar to the approach in [1], we consider two frames each time, the previous one and the latter one, from which two pairs of granules are extracted to fully capture the appearance and motion features of an object. An I 2 CF can be represented as a five-tuple c = (mode, g1i , g1j , g2i , g2j ), which is also called a cell according to [5]. The mode is Appearance mode, Difference mode or Consistent mode. g1i , g1j , g2i and g2j are four granules. The first pair of granules, g1i and g1j are from the previous frame to describe the appearance of an object. The second pair of granules, g2i and g2j comes from the previous or latter frame to describe either appearance or motion information. When the second pair are from the previous frame, which means that both pairs are from the previous frame, this kind of feature is Intra-frame Comparison Feature (Intra-frame CF); when the second pair come from the latter one, the feature becomes Inter-frame Comparison Feature (Inter-frame CF). Both of these two kinds of comparison features are combined to be Intra-frame and Inter-frame Comparison Feature (I 2 CF ). Appearance mode (A-mode). Pairing Comparison of Color feature (PCC) is proved to be simple, fast and efficient in [5] . As PCC can describe the invariance of color to some extent, we extend this idea to 3D space. A-mode compares two pairs of granules simultaneously: fA (g1i , g1j , g2i , g2j ) = g1i ≥ g1j &&g2i ≥ g2j .
(3)
PCC feature is a special case of A-mode. fA (g1i , g1j , g2i , g2j ) = g1i ≥ g1j , when g1i == g2i and g1j == g2j . Difference mode (D-mode). D-mode computes the absolute subtractions of two pairs of granules, defined as: fD (g1i , g1j , g2i , g2j ) = |g1i − g2i | ≥ |g1j − g2j |.
(4)
Human Detection in Video over Large Viewpoint Changes
g(10,6,3)
g3
g1
g1
g(16,2,0)
g(2,4,1)
g4
g2
g2
687
g3 g1 g4
g2
g(3,11,2)
(a)
(b)
(c)
(d)
Fig. 2. Our proposed I 2 CF . (a) Granular space with four scales (s = 0,1,2,3) of granules comes from [5]. (b)Two granules g1 and g2 connected by a solid line form one pair of granules applied in APCF [5]. (c) Two pairs of granules are used in each cell of I 2 CF . The solid line between g1 and g2 (or g3 and g4 ) means that g1 and g2 (or g3 and g4 ) come from the same frame. The dashed line connecting g1 and g3 (or g2 and g4 ) means that the locations of g1 and g3 (or g2 and g4 ) are related. This relation of locations is shown in (d). For example, g3 is in the neighborhood of g1 . This way reduces the feature pool a lot but still reserves the discriminative weak features.
The motion filters in [1][2] calculate the difference between one region and a shifted one by moving it up, left, right or bottom 1 or 2 pixels in the second frame. There are three main differences between the D-mode and those methods: 1) The restriction for the locations of these regions is defined spatially and much looser; 2) D-mode considers two pair of regions each time; 3) The only operation of D-mode is a comparison operator after subtractions. Consistent-mode (C-mode). C-mode compares the sums of two pairs of granules to take advantage of consistent information in the appearance of one frame or successive frames, defined as: fC (g1i , g1j , g2i , g2j ) = (g1i + g2i ) ≥ (g1j + g2j ).
(5)
C-mode is much simpler and can be quickly calculated compared with 3D volumetric features [10] and spatial temporal patches [11]. An I 2 CF of length n is represented as {c0 , c1 , · · · , cn−1 } and its feature value is defined as a binary concatenation of corresponding functions of cells in reverse order as fI 2 CF = [bn−1 bn−2 · · · b2 b1 ], where bk = f (mode, g1i , g1j , g2i , g2j ) for 0 ≤ k < n. ⎧ j j i i ⎪ ⎨fA (g1 , g1 , g2 , g2 ), mode = A, f (mode, g1i , g1j , g2i , g2j ) = fD (g1i , g1j , g2i , g2j ), mode = D, (6) ⎪ ⎩ j j i i fC (g1 , g1 , g2 , g2 ), mode = C. 3.3
Heuristic Learning I 2 CF s
Feature reduction. For 58 × 58 samples, there are 3s=0 (58 − 2s + 1) × (58 − 2s +1) = 12239 granules in total and the feature pool contains 3×122392 6.7× 1016 weak features without any restrictions, which make the training time and
688
G. Duan, H. Ai, and S. Lao
memory requirements impractical. With the distance of two granules defined in Sec. 3.1, two effective constraints are introduced into I 2 CF : 1)Motivated by [5], the first pair of granules in I 2 CF is constrained as d(g1i , g1j ) ≤ T1 . 2) Considering of the consistency in one frame or two near video frames, we constrain that the second pair of granules in I 2 CF is in the neighborhood of the first pair as shown in Fig. 2 (d): d(g1i , g2i ) ≤ T2 , d(g1j , g2j ) ≤ T2 . (7) We set T1 = 8, T2 = 4 in our experiments. Table 1. Learning algorithm of I 2 CF Input: Sample set S = {(xi , yi )|1 ≤ i ≤ m} where yi = ±1. Initialize: Cell space (CS) with all possible cells and empty I 2 CF . Output: The learned I 2 CF . Loop: – Learn the first pair of granules as [5]. Denote the best f pairs as a set F . – Construct a new set CS’ : In each cell of CS’, the first pair of granules is from F , the second pair of granules is generated by Eq. 7 and its mode is A-mode, D-mode or C-mode. Calculate Z value of I 2 CF by adding each cell in CS’. – Select the cell with the lowest Z value, denoted as c∗ . Add c∗ to I 2 CF . – Refine I 2 CF by replacing one or two granules in it without changing the mode.
Heuristically learning. I 2 CF starts with an empty I 2 CF . Each time select the most discriminative cell and add it to I 2 CF . The discriminability of a weak feature is measured by Z value, which reflects the classification power of the weak classifier as [17]: j j Z=2 W+ W− , (8) j
where W+j is the weight of positive samples that fall into the j th bin while W−j is that of negatives. The less Z value is, the more discriminative a weak feature is. The learning algorithm of I 2 CF is summarized in Table 1. (See more details in [5][16].)
4
EMC-Boost
We propose the EMC-Boost to co-cluster the sample space and discriminative features automatically. A perceptual clustering problem is shown in Fig. 3 (a)(c). EMC-Boost consists of three components, Cascade Component (CC), Mixed Component (MC) and Separated Component (SC). The three components are combined to become EMC-Boost. In fact, SC is similar to MC-Boost [13], which is the reason that our boosting algorithm is named as EMC-Boost. In the following descriptions, we formulate the three components explicitly first, and then demonstrate the learning algorithms, and summarize EMC-Boost at the end of this section.
Human Detection in Video over Large Viewpoint Changes
689
CC MC
Cluster
(a)
SC
(b)
Classifier
(c)
(d)
Fig. 3. A perceptual clustering problem in (a)-(c) and a general EMC-Boost in (d) where CC, MC and SC are three components of EMC-Boost
4.1
Three Components CC/MC/SC
CC deals with a standard 2-class classification problem that can be solved by any boosting algorithm. MC and SC deal with K clusters. We formulate the detectors of MC or SC as K strong classifiers, each of which is a linear combination of weak learners Hk (x) = t αkt hkt (x), k = 1, · · · , K with a threshold θk (default is 0). Note that the K classifiers Hk (x), k = 1, · · · , K are same in MC with K different thresholds θk , which means H1 (x) = H2 (x) = · · · = Hk (x), but they are totally different in SC. We present MC and SC uniformly below. The score yik of the ith sample belonging to the k th clusters can be computed as yik = Hk (xi )−θk . Therefore, the probability of xi belonging to the k th cluster is Pik (x) = 1+e1−yik . For aggregating all scores of one sample on K classifiers, we formulate Noisy-OR like [18][13] as Pi (x) = 1 −
K
(1 − Pik (xi )).
(9)
k=1
The cost function is defined as J = i Piti (1 − Pi )1−ti where ti ∈ {0, 1} is the label of ith sample, which is equivalent to maximize the log-likelihood ti log Pi + (1 − ti ) log(1 − Pi ). (10) log J = i
4.2
Learning Algorithms of CC/MC/SC
The learning algorithm of CC is directly Real Adaboost [17]. The learning algorithm of MC or is different from that classifiers of CC. MC and SC learn weak SC K th to maximize k w h (x ) and w h (x ) respectively at t round of ki kt i ki kt i i i boosting. Initially, the sample weights are: 1) For positives, wki = 1 if xi ∈ k and wki = 0 otherwise, where i denotes the ith sample and k denotes the k th cluster or classifier; 2) For all negatives we set wki = 1/K. Following the AnyBoost method [19], we set the sample weights as the derivative of the cost function w.r.t. the classifier score. The weight of k th classifier over ith sample is updated by ∂ log J ti − Pi wki = = Pki (xi ). (11) ∂yki Pi
690
G. Duan, H. Ai, and S. Lao
We sum up the training algorithms of MC and SC in Table 2. Table 2. Learning algorithms of MC and SC Input: Sample set S = {(xi , yi )|1 ≤ i ≤ m} where yi = ±1; Detection rate r in each layer. Output: Hk (x) = t αkt (x), k = 1, · · · , K. Loop: For t = 1, · · · , T (MC) – Find weak classifiers ht , (hkt = ht , k = 1, · · · , K) that maximize K k=1 i wki hkt (xi ). – Find the weak-learner weights αkt , (k = 1, · · · , K) that maximize Γ (H + αkt hkt ). – Update weights by Eq.11. (SC) For k = 1, · · · , K – Find weak classifiers hkt that maximize i wki hkt (xi ). – Find the weak-learner weights αkt that maximize Γ (H + αkt hkt ). – Update weights by Eq.11. Update thresholds θk (k = 1, · · · , K) to satisfy detection rate r.
4.3
A General EMC-Boost
The three components of EMC-Boost have different properties. CC takes all samples as one cluster while MC or SC considers that the whole sample space consists of multiple clusters. CC tends to distinguish positives from negatives; while SC tends to cluster the sample space, and MC can do both work at the same time but not so accurately as CC in classifying positives from negatives or as SC in clustering. Compared with SC, one particular advantage of MC is sharing weak features among all clusters. We combine these three components to become a general EMC-Boost as shown in Fig. 3 (d) which contains five steps:(Note that Step 2 is similar to CBT [14]) Step Step Step Step Step
5
1. CC learns a classifier for all samples considered as one category. 2. K-means algorithm clusters sample space with learned weak features. 3. MC clusters sample space coarsely. 4. SC clusters the sample space further. 5. CC learns a classifier for each cluster center.
Multiple Video Sampling
In order to deal with the change in video frame rate or abrupt motion problem, we introduce a Multiple Video Sampling (MVS) strategy as illustrated in Fig. 4 (a) and (b). Considering five consecutive frames in (a), a positive sample is made up of two frames, where one is the first frame and the other is from the next four frames as shown in (b). In other words, one annotation corresponds to five consecutive frames and generates 4 positives. Some more positives are shown in Fig. 4 (c). Suppose that the original frame rate is R and the used positives consist of the 1st and the rth frames (r > 1), then the possible frame rate covered by MVS strategy is R/(r − 1). If these positives are extracted from 30 fps videos, the trained detector is able to deal with 30fps(30/1),15fps(30/2),10fps(30/3) and 7.5fps(30/4) videos where r is 2, 3, 4 and 5 respectively.
Human Detection in Video over Large Viewpoint Changes
1
2
3
4
691
5
(a)
1
2
1
1
3
4
1
5
(b)
(c)
Fig. 4. The MVS strategy and some positive samples Appearance
Motion
23.3%
Appearance+Motion
34.9% ith cluster
kth cluster
15.5% 10.5%
1st stage
2nd stage (a)
1st stage
2nd stage
15.8%
(b)
Fig. 5. Two-Stage Tree Structure in (a) and an example in (b). The number in the box gives the percentage of samples belonging to that branch.
6
Overview of Our Approach
We adopt EMC-Boost selecting I 2 CF as weak features to learn a strong classifier for multiple viewpoint human detection, in which positive samples are achieved through MVS strategy. Due to the large amount of samples and features, it is difficult to learn a detector directly by a general EMC-Boost. We modify the detector structure slightly and propose a new structure containing two stages as shown in Fig. 5 (a) with an example in (b), which is called two-stage tree structure: In the 1st stage, it only uses appearance information for learning and clustering; In the 2nd stage, it uses both appearance and motion information for clustering first, and then for learning classifiers for all clusters.
7
Experiments
We carry out some experiments to evaluate our approach by False Positive Per Image (FPPI) on several real-world challenging datasets, ETHZ, PET2007 and our own collected dataset. When the intersection between a detection response and a ground-truth box is larger than 50% of their union, we consider it to be a successful detection. Only one detection per annotation is counted as correct. For simplicity, the three typical viewpoints mentioned in Fig. 1 are represented as Horizontal Viewpoint (HV), Slant Viewpoint (SV) and Vertical Viewpoint (VV) in turn from left to right. Datasets. The datasets used in the experiments are ETHZ dataset [20], PETS2007 dataset [21] and our own collected dataset. ETHZ dataset provides four video sequences, Seq.#0∼Seq.#3 (640×480 pixels at 15 fps). This dataset
692
G. Duan, H. Ai, and S. Lao
whose viewpoint is near HV is recorded using a pair of cameras, and we only use the images provided by the left camera. PETS2007 dataset contains 9 sequences S00∼S08 (720×576 pixels at 30 fps) and each sequence has 4 fixed cameras and we choose the 3rd camera whose viewpoint is near SV. There are 3 scenarios in PETS2007, with increasing scene complexity, loitering (S01 and S02), attended luggage removal (S03, S04, S05 and S06) and unattended luggage (S07 and S08). In the experiments, we use S01, S03, S05 and S06. In addition, we have collected several sequences by hand-held DV cameras: 2 sequences near HV (853×480 pixels at 30 fps), and 2 sequences near SV (1280×720 pixels at 30 fps) and 8 sequences near VV (2 sequences are 1280×720 pixels at 30 fps and the others are 704×576 pixels at 30 fps). Training and testing datasets. S01, S03, S05 and S06 of PETS2007 and our own dataset are labeled every five frames manually for training and testing, while ETHZ dataset provides the groundtruth already. The training datasets contain Seq.#0 of ETHZ, S01, S03 and S06 of PETS2007, 2 sequences (near HV), 2 sequences (near SV) and 6 sequences (near VV) of ours. The testing datasets contain Seq.#1, Seq.#2 and Seq.#3 of ETHZ, S05 of PETS2007, and 2 sequences of ours (near VV). Note that the groundtruths of the internal unlabeled frames in the testing datasets are achieved through interpolation. The properties of the testing datasets may have impacts on all detectors, like camera motion states (fixed or moving), illumination conditions (slightly or significantly light changes), etc. Details about the testing datasets are summarized in Table 3. Training detectors. We have labeled 11768 different humans in total and obtain 47072 positives after MVS, where the number of positives near HV, SV and VV are 18976, 19248 and 8848 respectively. The size of positives is normalized to 58 × 58. Some positives are shown in Fig. 4. We train a detector based on EMC-Boost selecting I 2 CF as features. Implementation details. We cluster the sample space into 2 clusters in the 1st stage and cluster the two sub spaces into 2 and 3 clusters separately in the 2nd stage as illustrated in Fig. 5 (b). When do we start and stop MC/SC? When the false positive rate is less than 10−2 (10−4 ) in the 1st (2nd ) stage, we start MC and then start SC after learning by MC. Before describing when to stop MC or SC, we first define transferred samples. A Table 3. Some details about the testing datasets Description
Seq. 1
Seq. 2
S05
Seq.#1 Seq.#2 Seq.#3
Source Camera Light changes Frame rate Size Frames Annotations
ours Fixed Slightly 30fps 704×576 2420 591
ours Fixed Slightly 30fps 704×576 1781 1927
PETS2007 Fixed Slightly 30fps 720×576 4500 17067
ETHZ Moving Slightly 15fps 640×480 999 5193
ETHZ Moving Slightly 15fps 640×480 450 2359
ETHZ Moving Significantly 15fps 640×480 354 1828
Human Detection in Video over Large Viewpoint Changes
693
Test Sequence 2 (ours, 1718 frames, 1927 annotations)
Test Sequence 1 (ours, 2420 frames, 591 annotations)
0.9
0.95
0.85
0.9
0.8
Recall
Recall
0.85
0.75
0.8
0.7
0.75
Intra-frame CF + Boost Intra-frame CF + EMC-Boost
Intra-frame CF + Boost Intra-frame CF + EMC-Boost
I2CF + EMC-Boost
I2CF + EMC-Boost 0.7 0.05
0.1
0.15
0.2
0.25
0.65 0.2
0.3
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
FPPI
FPPI S05 (PETS2007, 4500 frames, 17067 annotations)
Seq. #1 (ETHZ, 999 frames, 5193 annotations)
0.75
0.8 0.75
0.7
0.7 0.65 0.6
0.6
Recall
Recall
0.65
0.55
0.55 0.5 Ess et al. Schwartz et al. Wojek et al.(HOG, IMHwd and HIKSVM) Intra-frame CF + Boost Intra-frame CF + EMC-Boost
0.45
0.5
0.4
Intra-frame CF + Boost Intra-frame CF + EMC-Boost
0.45
0.35
2
I2CF + EMC-Boost
I CF + EMC-Boost 0.4 0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0
1
2
3 FPPI
FPPI Seq. #2 (ETHZ, 450 frames, 2359 annotations)
4
5
6
Seq. #3 (ETHZ, 354 frames, 1828 annotations)
0.75
0.9
0.7 0.8 0.65 0.7
0.6
Recall
Recall
0.6 0.55 0.5 0.45
Ess et al. Schwartz et al. Wojek et al.(HOG, IMHwd and HIKSVM) Intra-frame CF + Boost Intra-frame CF + EMC-Boost
0.4 0.35
0.5 Ess et al. Schwartz et al. Wojek et al.(HOG, IMHwd and HIKSVM) Intra-frame CF + Boost Intra-frame CF + EMC-Boost
0.4
0.3
I2CF + EMC-Boost 0
0.5
1
1.5
2 FPPI
2.5
I2CF + EMC-Boost 3
3.5
4
0.2
0
0.5
1
1.5
2 FPPI
2.5
3
3.5
4
Fig. 6. Evaluation of our approach and some results
sample is called transferred if it belongs to another cluster after current round boosting. We stop MC (SC) when the number of transferred samples is less than 10% (2%) of the total number of samples. Evaluation. To compare with our approach (denoted as I 2 CF +EMC-Boost), two other detectors are trained: one is to adopt Intra-frame CF learned by a general Boost algorithm like [5][15] (denoted as Intra-frame CF+Boost) and the other one is to adopt Intra-frame CF learned by EMC-Boost (denoted as Intraframe CF+EMC-Boost). Note that due to the large amount of Inter-frame CFs, the large amount of positives and memory limited, it is impractical to learn a detector of Inter-frame CF by Boost or EMC-Boost. We compare our approach with Intra-frame CF + Boost and Intra-frame CF + EMC-Boost approaches on PETS2007 dataset and our own collected videos, and also with [4][20][22] on ETHZ dataset. We give the ROC curves and some results in Fig. 6. In general, our proposed approach which integrates appearance and motion information is superior to Intra-frame CF+Boost and Intra-frame CF+EMC-Boost approaches which only use appearance information. From another viewpoint, this experiment also indicates that incorporating motion information improves detection significantly as [4]. Compared to [20][22], our approach is better. Furthermore, we have not used any additional cues like depth maps, ground-plane estimation, and occlusion reasoning, which are used in [20]. “HOG, IMHwd and HIKSVM” proposed in [4], combines HOG feature [18] and the Internal Motion Histogram wavelet difference (IMHwd) descriptor [3] together using histogram intersection kernel SVM (HIKSVM) and achieves the best results than other approaches used in [4]. Our approach is not as good as but comparable to “ HOG, IMHwd and HIKSVM”and we argue that our approach is much simpler and faster. Currently, our approach takes about 0.29s on ETHZ, 0.61s on PETS and 0.55s on our dataset to process one frame in average.
694
G. Duan, H. Ai, and S. Lao S05 of PETS2007
Test Sequence 2 of ours
Test Sequence 1 of ours 1
0.95
1
0.75
0.95
0.7
0.9
0.9
0.65
Recall
Recall
Recall
0.85 0.85
0.8
0.8
0.6
0.55
0.75 0.75
1 1/2 1/3 1/4 1/5
0.7
0.65 0.05
0.1
0.15
0.2
0.25
0.2
0.3
0.5
1 1/2 1/3 1/4 1/5
0.7 0.65
0.4
0.6
0.8
1
1.2
1.4
0.4 0.2
1.6
0.6
0.8
1
1.2
1.4
1.6
FPPI
Seq.#2 of ETHZ
Seq.# 3 of ETHZ
0.9
0.65
0.8 0.75
0.8
0.7
0.6 0.7
0.65
0.55
0.6
0.45
Recall
0.6
0.5
Recall
Recall
0.4
FPPI
FPPI Seq.#1 of ETHZ 0.7
0.5
0.4
1 1/2 1/3 1/4 1/5
0.3
0.4
0.6
0.8
1
1.2 FPPI
1.4
1.6
1.8
2
0.2
0.45
1 1/2 1/3 1/4 1/5
0.3
2.2
0.55 0.5
0.4 0.35
0.25 0.2
1 1/2 1/3 1/4 1/5
0.45
0
0.2
0.4
0.6
0.8 FPPI
1
1.2
1.4
1 1/2 1/3 1/4 1/5
0.4 0.35 1.6
0
0.2
0.4
0.6
0.8
1 FPPI
1.2
1.4
1.6
1.8
2
Fig. 7. The robustness evaluation of our approach to video frame rate changes. 1 represents the original frame rate, 1/2 represents that the testing dataset is downsampled to 1/2 original frame rate and so as 1/3, 1/4 and 1/5.
In order to evaluate the robustness of our approach to video frame rate changes, we downsample the videos to their 1/2, 1/3, 1/4 and 1/5 original frame rate to compare with the original frame rate. Our approach is evaluated on the testing datasets with frame rate changes, and the ROC curves are shown in Fig. 7. The results on the three sequences of ETHZ dataset and S05 of PETS2007 are similar, but the results on our collected two sequences differ a lot. The main reason is that: 1) as frame rates get lower, human motion changes more abruptly; 2) note that human motion in near VV videos changes more fiercely than that in near HV or SV ones. Our collected two sequences are near VV, so relatively human motion changes even more drastically than the other testing datasets. Generally speaking, our approach is robust to the video frame rate change to a certain extent. Discussions. We then discuss about the selection of EMC-Boost instead of MC-Boost and a general Boost algorithm. MC-Boost runs several classifiers together at any time, which may be good at classification problems, but not at detection problems. Considering the sharing feature at the beginning of clustering and the good clusters after clustering, we propose more suitable learning algorithms, MC and SC, for detection problems. Furthermore, risk map is an essential part of MC-Boost, in which the risk of one sample is related to predefined neighbors in the same class and in the opposite. But a proper neighborhood definition itself might be a tough question. For preserving the merits of MC-Boost and avoiding its shortcomings, we argue that EMC-Boost is more suitable for a detection problem. A general Boost algorithm can work well when the sample space has fewer variations, while EMC-Boost is designed to cluster the space into several sub ones which can make the learning process faster. But after clustering, a sample is considered as correct if it belongs to any sub space and thus it may also bring with more negatives. This may be the reason that Intra-frame CF+EMC-Boost
Human Detection in Video over Large Viewpoint Changes
695
is inferior to Intra-frame CF+Boost in Fig. 6. Mainly considering the cluster ability, we choose EMC-Boost other than a general Boost algorithm. In fact, it is impractical to learn a detector of I 2 CF +Boost without clustering because of the large amount of weak features and positives.
8
Conclusion
In this paper, we propose Intra-frame and Inter-frame Comparison Features (I 2 CF s), Enhanced Multiple Clusters Boost algorithm (EMC-Boost), Multiple Video Sampling (MVS) strategy and a two-stage tree structure detector to detect human in video over large viewpoint change. I 2 CF s combine appearance information and motion information automatically. EMC-Boost can cluster a sample space quickly and efficiently. MVS strategy makes our approach robust to frame rate and human motion. The final detector is organized as a two-stage tree structure to fully mine the discriminative features of the appearance and motion information. The evaluations on challenging datasets show the efficiency of our approach. There are some future works to be done. The large feature pool causes lots of difficulties during training. One work is to design a more efficient learning algorithm. MVS strategy can make the approach more robust to frame rate, but not handle arbitrary frame rates once a detector is learned. To achieve better results, another work may integrate object detection in video with object detection in static images and object tracking. An interesting question to EMCBoost is what kind of clusters EMC-Boost can obtain. Take human for example. Different poses, views, viewpoints or illumination will make human different and perceptual co-clusters of these samples differ with different criterion. The further relation between the discriminative features and samples is critical to the results. Another work may study the relations among features, objects and EMC-Boost. Our approach can also be applied to other object detection, multiple objects detection or object category as well. Acknowledgement. This work is supported by National Science Foundation of China under grant No.61075026, and it is also supported by a grant from Omron Corporation.
References 1. Viola, P., Jones, M., Snow, D.: Detecting pedestrians using patterns of motion and appearance. In: IEEE International Conference on Computer Vision, ICCV (2003) 2. Jones, M., Snow, D.: Pedestrian detection using boosted features over many frames. In: International Conference on Pattern Recognition (ICPR), Motion, Tracking, Video Analysis (2008) 3. Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 428–441. Springer, Heidelberg (2006)
696
G. Duan, H. Ai, and S. Lao
4. Wojek, C., Walk, S., Schiele, B.: Multi-cue onboard pedestrian detection. In: CVPR (2009) 5. Duan, G., Huang, C., Ai, H., Lao, S.: Boosting associated pairing comparison features for pedestrian detection. In: 9th Workshop on Visual Surveillance (2009) 6. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005) 7. Wu, B., Nevatia, R.: Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors. In: ICCV (2005) 8. Yang, M., Yuan, J., Wu, Y.: Spatial selection for attentional visual tracking. In: CVPR (2007) 9. Andriluka, M., Roth, S., Schiele, B.: People-tracking-by-detection and peopledetection-by-tracking. In: CVPR (2008) 10. Ke, Y., Sukthankar, R., Hebert, M.: Efficient visual event detection using volumetric features. In: ICCV (2005) 11. Shechtman, E., Irani, M.: Space-time behavior based correlation. In: CVPR (2005) 12. Jordan, M., Jacobs, R.: Hierarchical mixture of experts and the em algorithm. Neural Computation 6, 181–214 (1994) 13. Kim, T.K., Cipolla, R.: Mcboost: Multiple classifier boosting for perceptual coclustering of images and visual features. In: Advances in Neural Information Processing Systems, NIPS (2008) 14. Wu, B., Nevatia, R.: Cluster boosted tree classifier for multi-view, multi-pose object detection. In: ICCV (2007) 15. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: CVPR (2001) 16. Huang, C., Ai, H., Li, Y., Lao, S.: Learning sparse features in granular space for multi-view face detection. In: IEEE International Conference, Automatic Face and Gesture Recognition (2006) 17. Schapire, R.E., Singer, Y.: Improved boosting algorithms using confidence-rated predictions. Machine Learning 37, 297–336 (1999) 18. Viola, P., Platt, J., Zhang, C.: Multiple instance boosting for object detection. In: NIPS (2005) 19. Mason, L., Baxter, J., Bartlett, P., Frean, M.: Boosting algorithms as gradient descent. In: Proc. Advances in Neural Information Processing Systems (2000) 20. Ess, A., Leibe, B., Gool, L.V.: Depth and appearance for mobile scene analysis. In: ICCV (2007) 21. PETS2007, http://www.cvg.rdg.ac.uk/PETS2007/ 22. Schwartz, W.R., Kembhavi, A., Harwood, D., Davis, L.S.: Human detection using partial least squares analysis. In: ICCV (2009)
Adaptive Parameter Selection for Image Segmentation Based on Similarity Estimation of Multiple Segmenters Lucas Franek and Xiaoyi Jiang Department of Mathematics and Computer Science University of M¨ unster, Germany {lucas.franek,xjiang}@uni-muenster.de
Abstract. This paper addresses the parameter selection problem in image segmentation. Mostly, segmentation algorithms have parameters which are usually fixed beforehand by the user. Typically, however, each image has its own optimal set of parameters and in general a fixed parameter setting may result in unsatisfactory segmentations. In this paper we present a novel unsupervised framework for automatically choosing parameters based on a comparison with the results from some reference segmentation algorithm(s). The experimental results show that our framework is even superior to supervised selection method based on ground truth. The proposed framework is not bounded to image segmentation and can be potentially applied to solve the adaptive parameter selection problem in other contexts.
1
Introduction
The parameter selection problem is a fundamental and unsolved problem in image segmentation. Mostly, segmentation algorithms have parameters, which for example control the scale of grouping and are typically fixed by the user. Although some appropriate range of parameter values can be well shared by a large number of images, the optimal parameter setting within that range varies substantially for each individual image. In the literature the parameter selection problem is mostly either not considered at all or solved by learning a fixed optimal parameter setting based on manual ground truth [1]. However, it is a fact that a fixed parameter setting may achieve good performance in average, but cannot reach the best possible segmentation for each individual image. It is thus desirable to have a means of adaptively selecting the optimal parameter setting on a per image basis. Unsupervised parameter selection depends on some kind of self-judgement of segmentation quality, which is not an easy task [2, 3]. In this paper a novel framework of unsupervised automatic parameter selection is proposed. Given the fundamental difficulty of the self-judgement of segmentation quality, we argue that the situation becomes easier when we consider some reference segmentations. A comparison of their results may give us a hint of good segmentations. R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 697–708, 2011. Springer-Verlag Berlin Heidelberg 2011
698
L. Franek and X. Jiang
The paper is organized as follows. In Section 2 we review shortly different existing approaches to the parameter selection problem. Our novel framework is detailed in Section 3. Experimental results including the comparison with an existing supervised method are shown in Section 4. We conclude in Section 5.
2
Parameter Selection Methods
Parameter selection methods can be classified into supervised and unsupervised approaches. Supervised methods use ground truth for the evaluation of segmentation quality. If ground truth is available, the parameter space can be sampled into a finite number of parameter settings and the segmentation performance can be evaluated for each parameter setting over a set of training images. The authors of [1] propose a multi-locus hill climbing scheme on a coarsely sampled parameter space. As an alternative, genetic search has been chosen in [4] to find the optimal parameter setting. Supervised approaches are appropriate to find the best fixed parameter setting. However, they are applicable only in case of availability of ground truth, which causes considerable amount of manual work. In addition, the learned fixed parameter set allows good performance in average only, but cannot reach the best possible segmentation on a per image basis. Unsupervised parameter selection methods do not use ground truth, but instead are based on heuristics to measure segmentation quality. The authors of [5] design an application-specific segmentation quality function and a search algorithm is applied to find the optimal parameter setting. A broad class of unsupervised algorithms use characteristics such as intra-region uniformity or inter-region contrast; see [2, 3] for a review. They evaluate a segmented image based on how well a set of characteristics is optimized. In [3] it is shown that segmentations resulting from optimizing such characteristics do not necessarily correspond to segmentations perceived by humans. Another unsupervised approach to tackling the parameter selection problem is proposed in [6] by an ensemble technique. A segmentation ensemble is produced by sampling a reasonable parameter subspace of a segmentation algorithm and a combination solution is then computed from the ensemble, which is supposed to be better than the most input segmentations. The method proposed in this work is related to the stability-based approach proposed in [7]. There the authors consider the notion of stability of clusterings/segmentations and observe that the number of stable segmentations is much smaller than the total number of possible segmentations. Several parameter settings leading to stable results are returned by the method and it is assumed that stable segmentations are good ones. In order to check for stability the data has to be perturbed many times by adding noise and re-clustered in a further step. Similar to [7] we argue in our work that stable segmentations are good ones but the realisation of this observation is quite different. In our work a reference segmenter is introduced to detect stable segmentations and there is no need for perturbing and re-clustering the data. In this work we propose a novel framework for automatic parameter selection, which is fully unsupervised and does not need any ground truth for training. Our
Adaptive Parameter Selection for Image Segmentation
699
Fig. 1. Left: Image from Berkeley dataset [8]. Right: its ground truth segmentation.
framework is based on comparison of segmentations generated by different segmenters. To estimate the parameters for one segmenter a reference segmenter is introduced which also has an unknown optimal parameter setting. The fundamental idea behind our work (to be detailed in the next section) is the observation that different segmentation algorithms generate different bad segmentations, but similar good segmentations. It should be emphasized that in contrast to existing unsupervised methods no segmentation quality function and no segmentation characteristics are used to estimate the optimal parameter setting.
3
Framework for Unsupervised Parameter Selection
In this section we explain the fundamental idea of our framwork. Then, details are given to show how the framework can be further concretized. 3.1
Proposed Framework
Our framework is based on the comparison of segmentations produced by different segmentation algorithms. Our basic observation is that different segmenters tend to produce similar good segmentations, but dissimilar bad segmentations. Our observation is reasonable because every segmenter is based on different methodologies, therefore it may be assumed that over- and under-segmentations of different segmenters look differently and potential misplacement of segment contours is different. Another interpretation is related to the space of all segmentations of an image. Generally, the subspace of bad segmentations is substantially larger than the subspace of good segmentations. Therefore, the segmentation diversity can be expected to appear in the subspace of bad segmentations. Based on our observation we propose to compare segmentation results of different segmenters and to figure out good segmentations by means of similarity tests. Furthermore, the notion of stability [7] is adjusted: A segmentation is denoted as stable if a reference segmenter yields a similar segmentation. Illustrative example. Consider the image from Berkeley dataset [8] and its ground truth segmentation in Fig. 1. We use two segmentation algorithms for this illustrative example: the graph-based image segmentation algorithm FH [9]
700
L. Franek and X. Jiang
1.N M I = 0.54 2.N M I = 0.72 3.N M I = 0.67 4.N M I = 0.70 5.N M I = 0.74
6.N M I = 0.70 7.N M I = 0.76 8.N M I = 0.74 9.N M I = 0.76 10.N M I = 0.76
(a) Segmentations generated by FH
1.N M I = 0.52 2.N M I = 0.71 3.N M I = 0.78 4.N M I = 0.81 5.N M I = 0.82
6.N M I = 0.80
7.N M I = 0.80
8.N M I = 0.74
9.N M I = 0.72
10.N M I = 0.72
(b) Segmentations generated by UCM Fig. 2. Segmentations for each algorithm were received by varying parameters
and the segmenter UCM [10] based on the ultrametric contour map. For each segmenter 10 segmentations were generated by varying parameters (Fig. 2a,b). The segmentation algorithms and parameter settings will be detailed in Section 3.2. For performance evaluation purpose the normalized mutual information (NMI) (to be defined in Section 3.3) is used. The performance curves are shown in Fig. 3a. To obtain good parameters for FH we use UCM as reference segmenter. Every segmentation result of FH is compared with every segmentation result of UCM by using NMI as similarity measure, resulting in the similarity matrix shown in Fig. 3b. This similarity matrix indicates which pair of FH and UCM segmentations is similar or dissimilar. The maximal similarity is detected in column 9 and row 5, which corresponds to the ninth FH-segmentation and the fifth UCM-segmentation. And in fact, this segmentation for FH has the highest NMI value when compared with ground truth; see Fig. 3a. Note that the columns and rows in the similarity matrix deliver information about the first and second segmenter in the comparison, respectively. In our
Adaptive Parameter Selection for Image Segmentation
701
FH UCM 0.9 1
0.85
0.75
Parameter setting for UCM
Performance measure
2
0.8 0.75
0.7 0.65 0.6 0.55
0.5
3
0.7
4 0.65
5 6
0.6 7 8
0.55
9 0.5
10
1
2
3
4
5
6
Parameter setting
7
8
9
10
(a) Performance evaluation of segmentations compared with ground truth.
1
2
3
4
5
6
7
Parameter setting for FH
8
9
10
(b) Similarity matrix for “FH – UCM”.
Fig. 3. Similarity matrices generated by comparing pairs of segmentations of different segmentation algorithms. NMI was used for similarity estimation.
example in Fig. 3b therefore columns represent the corresponding FH segmentations. The columns 1 to 6 are darker than the columns 7 to 10. This means that the segmentations 1 to 6 do not concur with the FH segmentations well and thus indicates that they tend to be bad segmentations. Conversely, FH can be used as reference segmenter for obtaining the optimal parameter setting for UCM: the fifth UCM-segmentation has also the highest NMI value (see Fig. 3a). Because of the highest values in the similarity matrix the ninth FH-segmentation and the fifth UCM-segmentation are denoted as most stable. Outline of the framework. The example above can be generalized to outline our framework as follows: 1. Compute for each segmentation algorithm N segmentations. 2. Compute an N × N similarity matrix by comparing each segmentation of the first algorithm with each segmentation of the second algorithm. 3. Determine the best parameter setting from the similarity matrix. The first two steps are relatively straightforward. We only need to specify a reasonable parameter subspace for each involved segmentation algorithm and a similarity measure. The much more demanding last step requires a careful design and will be discussed in detail later. We have implemented a prototype to validate our framework, which will be detailled in the following three subsections. 3.2
Segmentation Algorithms Used in Prototypical Implementation
We decided to use three different segmentation algorithms in our current prototypical implementation: the graph-based approach FH [9], the algorithm TBES [11] based on the MDL-principle, and UCM [10]. The reasons for our choice are: 1) algorithms like FH, TBES and UCM are state-of-the-art and frequently referenced in the literature; 2) their code is available; 3) they are sensitive to parameter selection. In contrast we found that the segmenter EDISON [12] is
702
L. Franek and X. Jiang
not as sensitive to parameter changes as the selected algorithms and a fixed parameter set leads in the majority of cases to the optimal solution (see [13]). For each segmentation algorithm every image of Berkeley dataset is segmented 10 times by varying one free parameter, i.e. a reasonable one-dimensional parameter space is sampled into 10 equidistant parameter settings. FH has three parameters: a smoothing parameter (σ), a threshold function (k) and a minimum component size (min size). We have chosen to vary the smoothing parameter σ = 0.2, 0.4, . . . , 2.0 because it seemed to be most sensitive parameter with respect to the segmentation results. The other parameters are fixed to the optimal setting (k = 350, min size = 1500) as it is done in the supervised parameter learning approach. (The case of several free parameters will be discussed in Section 3.5.) The segmenter TBES has only one free parameter, the quantization error which is set to = 40, 70, . . . , 310. The method UCM consists of an oriented watershed transform to form initial regions from contours, followed by construction of an ultrametric contour map (UCM). Its only parameter is the threshold l, which controls the level of detail, we choose l = 0.03, 0.11, 0.19, 0.27, 0.35, 0.43, 0.50, 0.58, 0.66, 0.74. 3.3
Similarity Measure Used in Prototypical Implementation
In our framework a segmentation evaluation measure is needed for similarity estimation (and also for performance evaluation purpose). We use the same measure for both tasks in our experiments. The question which measure is best for comparing segmentations remains difficult and researchers try to construct measures which account for over-, under-segmentation, inaccurate boundary localization and different number of segments [13, 14]. For our work we decided to use two different measures, NMI and F-measure, to demonstrate the performance. NMI is an information-theoretic measure and was used in the context of image segmentation in [6, 15]. Let Sa and Sb denote two labellings, each representing one segmentation. Furthermore, |Sa | and |Sb | denote the number of groups within Sa and Sb , respectively. Then NMI is formally defined by |Sa | |Sb |
|Rh,l | log
n · |Rh,l | |Rh | · |Rl |
NMI(Sa , Sb ) = h=1 l=1 |Sa | |Sb | |Rh | |Rl | |Rh | log |Rl | log n n h=1
(1)
l=1
where Rh and Rl are regions from Sa and Sb , respectively. Rh,l denotes the common part of Rh and Rl , and n is the image size. Its value domain domain is [0, 1], whereas a high (low) NMI value indicates high (low) similarity between two labellings. Therefore, a high (low) NMI value indicates a good (bad) segmentation. The boundary-based F-measure [16] is the weighted harmonic mean of precision and recall. Its value domain is also [0, 1] and a high (low) F-measure value indicates a good (bad) segmentation.
Adaptive Parameter Selection for Image Segmentation
3.4
703
Dealing with Similarity Matrix
The essential component of our framework is the similarity matrix, which has the deciding influence to the overall performance. Below we discuss various design strategies to concretize the framework towards a real implementation. Similarity measure with penalty term. In our discussion so far the similarity matrix is assumed to be built by computing the pairwise similarity measure such as NMI or F-measure. In practice, however, we need a small extension with a penalty term for handling the over-segmentation problem. If a pair of segmentations tend to be over-segmentation, they naturally will look similar. In the extreme case where each pixel is partitioned as a segment, two segmentations will be equal, leading to a maximal similarity value. Indeed, we observed that in some cases pairs of strongly over-segmented segmentations are most similar and dominate the similarity matrix. Thus, we need a means of correcting such forged similarity values. There are several possibilities to handle this fundamental problem. We found that simply counting the regions of each segmentation and adding a small penalty (0.1) to the performance measure, if a threshold is reached, is sufficient for our purpose. Note that this penalty term does not affect the rest of computing the similarity matrix. Nonlocal maximum selection. In the illustrative examples the best parameter setting was selected by detecting the maximal value in the similarity matrix. This strategy is suitable for introduction purpose, but may be too local to be really useful in practice. A first idea came up, which performs a weighted smoothing of the similarity matrix and then determines the maximum. Alternatively, as a second strategy, when searching the best result for the first (second) segmenter we can compute for each column (row) one representative by averaging over the k maximal values in each column (row) and then searching for the maximum over the column (row) representatives. In our study k was set to 5. In Fig. 4a,b the row/column representatives for UCM and FH are shown. Indeed, as explained in Section 3.1 columns particularly deliver information about the first segmenter and rows deliver information about the second segmenter in a comparison. Therefore, in general a low-valued column (row) only may eliminate bad segmentations from the first (second) segmenter. Extensive experiments showed that this second strategy delivers better results. Using several reference segmenters. It can be thought about combining different similarity matrices gained from multiple reference segmenters. For instance, we can compute a combined result for TBES based on the similarity matrices from the comparisons “TBES – FH” and “TBES – UCM”. In this case representatives of each column are computed and added (Fig. 4c). From the resulting column the maximal value is selected and the corresponding parameter is assumed to be the best one.
704
L. Franek and X. Jiang
2
1
0.56
0.58
3
0.6
7
6
5
4
Parameter setting for UCM 0.62
0.64
0.66
0.68
0.7
8
0.72
9
10
0.74
0.76
(a) Row representatives for UCM received from comparison “FH – UCM” by using nonlocal maximum selection.
1
2
0.64
3
4
6
5
7
Parameter setting for FH 0.66
0.68
9
8
0.7
0.72
10
0.74
(b) Column representatives for FH received from comparison “FH – TBES” by using nonlocal maximum selection.
1
1.34
3
2
1.36
4
5
6
7
Parameter setting for TBES 1.38
1.4
1.42
1.44
8
9
1.46
10
1.48
(c) Column representatives for TBES received by combining two reference segmenters (FH and UCM).
Fig. 4. (a) and (b): Row/column representatives received from similarity matrices by averaging over the five best values in each row in. (c): Column representatives received by using two reference segmenters.
3.5
Handling Several Free Parameters
Our framework is able to work with an arbitrary number of free parameters, but for simplicity and better visualization reasons we decided to use only one free parameter for each segmenter for our prototypical implementation as a first validation of our framework. However, we want also to point out shortly two strategies for handling several free parameters. A straightforward use of our approach for multiple parameters will provoke increased computation time. To solve this problem efficiently, it can be thought of an evolutionary algorithm. Given only one pair of segmentations the evolutionary algorithm could start in the corresponding point of the similarity matrix and converge to a minimum. Several starting points could be considered to avoid trapping in a local minimum. By this way similarity matrix of bigger dimension encoding several free parameters could be handled. Another solution could be a multi-scale approach, in which all sensitive parameters are considered, but roughly sampled at the beginning only. In the next steps local optimum areas are further partitioned and explored.
4
Experimental Results
In this section we report some results of the current prototypical implementation. In particular, we compare our framework of adaptive parameter selection with the supervised automated parameter training. For our experiments we use the Berkeley dataset. It provides for each image several different ground truth segmentations. Therefore, we decided to select the ground truth image with the maximum sum of distances to the other ground truth images, whereas NMI resp. F-measure is used as distance measure. Supervised automated parameter training. We compute the average performance measure over all 300 images of Berkeley dataset for each parameter setting candidate and the one with the largest value is selected as the optimal fixed parameter setting. For this training the same parameter values as in Section 3.2 are used. Table 1 summarizes the average performance measure for each algorithm based on these fixed parameters and symbolizes the performance of
Adaptive Parameter Selection for Image Segmentation
705
Table 1. Average performance achieved by individual best and fixed parameter setting NMI F-measure individual fixed individual fixed FH TBES UCM
0.65 0.69 0.70
0.60 0.64 0.66
0.53 0.59 0.64
0.47 0.54 0.60
supervised parameter setting. For comparison purposes the best individual performance is also computed for each algorithm and each image, see the averaged values over all images displayed in Table 1. Note that this performance measure is in some sense the limit the respective algorithm can achieve. Adaptive parameter selection. As said before, for simplicity and better visualization reasons we decided to adaptively select only one free parameter for each segmenter for our prototypical implementation as a first validation of our framework. The other parameters are taken from the supervised learning. To compare our framework with the supervised method both approaches are applied to each image of Berkeley dataset. NMI is used for both similarity and performance evaluation. The resulting histograms are plotted in Fig. 5. For better comparison the performance measure values are centred for each image by the best individual performance; i.e. the differences NMI − NMIbest are histogrammed. For instance, the histogram in Fig. 5a displays the performance behavior of FH: 1) compared to the results received by the fixed learned parameter setting (the left bar in each bin); 2) supported by a single reference segmenter (FH or UCM; the two middle bars in each bin); 3) supported by two reference segmenters (the right bar in each bin). The fixed parameter setting leads in 107 of 300 cases (blue bar in the rightmost bin) to the optimal solution. In 114 of all images the performance is about 0.05 below the optimal solution and so on. A reference segmenter TBES alone lifts the number of optimal solutions from 107 to 190. A similar picture can be seen for using UCM as the reference segmenter. For FH using both reference segmenters leads to the best result: the optimal parameter setting is achieved in 197 of 300 images. Note that an ideal parameter selection method would lead to a single bar located at zero in the histogram. The finding that the support of TBES, UCM, or both together substantially increases the performance of FH is perhaps not surprising since TBES and UCM are better performing segmenters, see Table 1. But important is the fact that our framework can realize the intrinsic support technically. Results for TBES and UCM are shown in Fig. 5a and Fig. 5c, respectively. For TBES the adaptive method (with both FH and UCM) achieves in 192 of all images the optimal parameter setting, in contrast the fixed parameter in 124 only. Interestingly, the weaker segmenter FH enables a jump of 124 optimal parameter settings to 183. UCM seems to be more stable with respect to the parameter from the supervised parameter learning. Nevertheless also in this case a remarkable improvement can be observed.
L. Franek and X. Jiang 250
150
100
50
0
250
Fixed Parameter Ref.Segm. FH Ref.Segm. UCM Both Reference Segmentors
200
number of images
number of images
250
Fixed Parameter Ref.Segm. TBES Ref.Segm. UCM Both Reference Segmentors
200
150
100
50
−0.5
−0.4
−0.3
−0.2
performance
(a) FH
−0.1
0
0
Fixed Parameter Ref.Segm. FH Ref.Segm. TBES Both Reference Segmentors
200
number of images
706
150
100
50
−0.5
−0.4
−0.3
−0.2
performance
(b) TBES
−0.1
0
0
−0.5
−0.4
−0.3
−0.2
performance
−0.1
0
(c) UCM
Fig. 5. Performance histograms for FH, TBES and UCM. NMI was used for evaluation. Table 2. Average performance using NMI for similarity estimation and performance evaluation. “Ref.segm.” specifies the reference segmenter.
FH TBES UCM
avg ref.segm. FH TBES UCM TBES & UCM FH & UCM FH & TBES -0.053 * -0.036 -0.034 -0.027 * * -0.054 -0.035 * -0.038 * -0.033 * -0.044 -0.041 -0.044 * * * -0.033
To get a quantitative and contrastable value the average over the performance values in the histogram is computed (Table 2). It can be seen from the values that using two reference segmenters always yields the best results. Experiments are also executed using F-measure for similarity estimation and performance evaluation (Table 3). In this case using both reference segmenters yields the best results. Only for UCM the supervised learning approach outperforms our framework. This is due to the fact that for F-measure the difference between UCM and the other segmenters is remarkable (Table 1) and for the reference segmenters it gets harder to optimize UCM. Nevertheless it must be emphasized that by using both reference segmenters results close to the supervised learning approach are received without knowing ground truth. To further optimize the results for UCM another comparable reference segmentor would be necessary. 4.1
Discussion
Our experiments confirmed our observation stated in Section 2: For the majority of images different segmentation algorithms tend to produce similar good segmentations and dissimilar bad segmentations. This fundamental fact allowed us to show the superiority of our adaptive parameter selection framework compared to the supervised method based on parameter learning. From a practical point of view it is even more important to emphasize that ground truth is often not available and thus the supervised learning approach cannot be applied at all. Our framework may provide a useful tool for dealing with the omnipresent parameter selection problem in such situations. Our approach depends on the assumption that the reference segmenter generates for at least one parameter setting a good segmentation. If it totally fails,
Adaptive Parameter Selection for Image Segmentation
707
Table 3. Average performance using F-measure for similarity estimation and performance evaluation. “Ref.segm.” specifies the reference segmenter.
FH TBES UCM
avg ref.segm. FH TBES UCM TBES & UCM FH & UCM FH & TBES -0.063 * -0.036 -0.037 -0.032 * * -0.050 -0.048 * -0.043 * -0.041 * -0.043 -0.065 -0.052 * * * -0.047
then there may be not much chance to recover a good segmentation from the similarity matrix. However, this is a weak requirement on the reference segmenter and does not pose a real problem in practice. We also want to emphasize that our approach substantially depends on the question how good similarity between segmentations can be quantitatively measured. For almost optimal segmentations near ground truth this is not a big challenge since all similarity measures can be expected to work well. In more practical cases the similarity measure should be chosen with care. The experimental results indicate that NMI and F-measure are appropriate for this task. The overall runtime of our current implementation mainly depends on the runtime of each segmenter. The complex segmenters TBES and UCM need a few minutes to segment an image which is the most critical task (for FH the generation takes only a few seconds). The estimation of similarity takes a few seconds for NMI and few minutes for F-measure. Our framework is inherently parallel and optimized implementation would lead to a significant speed-up.
5
Conclusion
In this paper we have presented a novel unsupervised framework for adaptive parameter selection based on comparison of segmentations generated by different segmentation algorithms. Fundamental to our method is the observation that bad segmentations generated by different segmenters are dissimilar. Motivated by this fact we proposed to adjust the notion of stability in the context of segmentation. Stable segmentations are found by using a reference segmenter. Experiments justified our observation and showed the superiority of our approach compared to the supervised parameter training. In particular, such an unsupervised approach is helpful in situations without ground truth. In contrast to the most existing unsupervised methods our approach does not use any heuristic self-judgement characteristics for segmentation evaluation. Although the framework has been presented and validated in the context of image segmentation, the idea behind is general and can be adapted to solving the adaptive parameter selection in other tasks. This is indeed under study currently. In addition, we will further improve the implementational details of our framework, for instance combination of multiple similarity measures and enhanced strategies of dealing with the similarity matrix. Using more than one reference segmenters has not been fully explored in this work yet and deserves further investigation as well.
708
L. Franek and X. Jiang
References 1. Min, J., Powell, M.W., Bowyer, K.W.: Automated performance evaluation of range image segmentation algorithms. IEEE Trans. on Systems, Man, and Cybernetics, Part B 34, 263–271 (2004) 2. Chabrier, S., Emile, B., Rosenberger, C., Laurent, H.: Unsupervised performance evaluation of image segmentation. EURASIP J. Appl. Signal Process. (2006) 3. Zhang, H., Fritts, J.E., Goldman, S.A.: Image segmentation evaluation: A survey of unsupervised methods. Computer Vision and Image Understanding 110, 260–280 (2008) 4. Pignalberi, G., Cucchiara, R., Cinque, L., Levialdi, S.: Tuning range image segmentation by genetic algorithm. EURASIP J. Appl. Signal Process., 780–790 (2003) 5. Abdul-Karim, M.A., Roysam, B., Dowell-Mesfin, N., Jeromin, A., Yuksel, M., Kalyanaraman, S.: Automatic selection of parameters for vessel/neurite segmentation algorithms. IEEE Trans. on Image Processing 14, 1338–1350 (2005) 6. Wattuya, P., Jiang, X.: Ensemble combination for solving the parameter selection problem in image segmentation. In: Proc. of Int. Conf. on Pattern Recognition, pp. 392–401 (2008) 7. Rabinovich, A., Lange, T., Buhmann, J., Belongie, S.: Model order selection and cue combination for image segmentation. In: CVPR, pp. 1130–1137 (2006) 8. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proc. ICCV, vol. 2, pp. 416–423 (2001) 9. Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. Int. J. Computer Vision 59, 167–181 (2004) 10. Arbelaez, P., Maire, M., Fowlkes, C.C., Malik, J.: From contours to regions: An empirical evaluation. In: CVPR, pp. 2294–2301. IEEE, Los Alamitos (2009) 11. Rao, S., Mobahi, H., Yang, A.Y., Sastry, S., Ma, Y.: Natural image segmentation with adaptive texture and boundary encoding. In: Zha, H., Taniguchi, R.-i., Maybank, S. (eds.) ACCV 2009. LNCS, vol. 5994, pp. 135–146. Springer, Heidelberg (2010) 12. Christoudias, C.M., Georgescu, B., Meer, P., Georgescu, C.M.: Synergism in low level vision. In: Int. Conf. on Pattern Recognition, pp. 150–155 (2002) 13. Unnikrishnan, R., Pantofaru, C., Hebert, M.: Toward objective evaluation of image segmentation algorithms. IEEE Trans. on Pattern Analysis and Machine Intelligence 29, 929–944 (2007) 14. Monteiro, F.C., Campilho, A.C.: Performance evaluation of image segmentation. In: Int. Conf. on Image Analysis and Recognition, vol. (1), pp. 248–259 (2006) 15. Fowlkes, C., Martin, D., Malik, J.: Learning affinity functions for image segmentation: combining patch-based and gradient-based approaches. In: Proc. IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, vol. 2, pp. 54–61 (2003) 16. Martin, D., Fowlkes, C., Malik, J.: Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE Trans. on Pattern Analysis and Machine Intelligence 26, 530–539 (2004)
Cosine Similarity Metric Learning for Face Verification Hieu V. Nguyen and Li Bai School of Computer Science, University of Nottingham, Jubilee Campus, Wollaton Road, Nottingham, NG8 1BB, UK {vhn,bai}@cs.nott.ac.uk http://www.nottingham.ac.uk/cs/
Abstract. Face verification is the task of deciding by analyzing face images, whether a person is who he/she claims to be. This is very challenging due to image variations in lighting, pose, facial expression, and age. The task boils down to computing the distance between two face vectors. As such, appropriate distance metrics are essential for face verification accuracy. In this paper we propose a new method, named the Cosine Similarity Metric Learning (CSML) for learning a distance metric for facial verification. The use of cosine similarity in our method leads to an effective learning algorithm which can improve the generalization ability of any given metric. Our method is tested on the state-of-the-art dataset, the Labeled Faces in the Wild (LFW), and has achieved the highest accuracy in the literature.
Face verification has been extensively researched for decades. The reason for its popularity is the non-intrusiveness and wide range of practical applications, such as access control, video surveillance, and telecommunication. The biggest challenge in face verification comes from the numerous variations of a face image, due to changes in lighting, pose, facial expression, and age. It is a very difficult problem, especially using images captured in totally uncontrolled environment, for instance, images from surveillance cameras, or from the Web. Over the years, many public face datasets have been created for researchers to advance state of the art and make their methods comparable. This practice has proved to be extremely useful. FERET [1] is the first popular face dataset freely available to researchers. It was created in 1993 and since then research in face recognition has advanced considerably. Researchers have come very close to fully recognizing all the frontal images in FERET [2–6]. However, these methods are not robust to deal with non-frontal face images. Recently a new face dataset named the Labeled Faces in the Wild (LFW) [7] was created. LFW is a full protocol for evaluating face verification algorithms. Unlike FERET, LFW is designed for unconstrained face verification. Faces in LFW can vary in all possible ways due to pose, lighting, expression, age, scale, and misalignment (Figure 1). Methods for frontal images cannot cope with these variations and as such many researchers have turned to machine learning to R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 709–720, 2011. Springer-Verlag Berlin Heidelberg 2011
710
H.V. Nguyen and L. Bai
Fig. 1. From FERET to LFW
develop learning based face verification methods [8, 9]. One of these approaches is to learn a transformation matrix from the data so that the Euclidean distance can perform better in the new subspace. Learning such a transformation matrix is equivalent to learning a Mahalanobis metric in the original space [10]. Xing et al. [11] used semidefinite programming to learn a Mahalanobis distance metric for clustering. Their algorithm aims to minimize the sum of squared distances between similarly labeled inputs, while maintaining a lower bound on the sum of distances between differently labeled inputs. Goldberger et al. [10] proposed Neighbourhood Component Analysis (NCA), a distance metric learning algorithm especially designed to improve kNN classification. The algorithm is to learn a Mahalanobis distance by minimizing the leave-one-out cross validation error of the kNN classifier on a training set. Because it uses softmax activation function to convert distance to probability, the gradient computation step is expensive. Weinberger et al. [12] proposed a method that learns a matrix designed to improve the performance of kNN classification. The objective function is composed of two terms. The first term minimizes the distance between target neighbours. The second term is a hinge-loss that encourages target neighbours to be at least one distance unit closer than points from other classes. It requires information about the class of each sample. As a result, their method is not applicable for the restricted setting in LFW (see section 2.1). Recently, Davis et al. [13] have taken an information theoretic approach to learn a Mahalanobis metric under a wide range of possible constraints and prior knowledge on the Mahalanobis distance. Their method regularizes the learned matrix to make it as close as possible to a known prior matrix. The closeness is measured as a Kullback-Leibler divergence between two Gaussian distributions corresponding to the two matrices. In this paper, we propose a new method named Cosine Similarity Metric Learning (CSML). There are two main contributions. The first contribution is
Cosine Similarity Metric Learning for Face Verification
711
that we have shown cosine similarity to be an effective alternative to Euclidean distance in metric learning problem. The second contribution is that CSML can improve the generalization ability of an existing metric significantly in most cases. Our method is different from all the above methods in terms of distance measures. All of the other methods use Euclidean distance to measure the dissimilarities between samples in the transformed space whilst our method uses cosine similarity which leads to a simple and effective metric learning method. The rest of this paper is structured as follows. Section 2 presents CSML method in detail. Section 3 present how CSML can be applied to face verification. Experimental results are presented in section 4. Finally, conclusion is given in section 5.
1
Cosine Similarity Metric Learning
The general idea is to learn a transformation matrix from training data so that cosine similarity performs well in the transformed subspace. The performance is measured by cross validation error (cve). 1.1
Cosine Similarity
Cosine similarity (CS) between two vectors x and y is defined as: CS(x, y) =
xT y x y
Cosine similarity has a special property that makes it suitable for metric learning: the resulting similarity measure is always within the range of −1 and +1. As shown in section 1.3, this property allows the objective function to be simple and effective. 1.2
Metric Learning Formulation
Let {xi , yi , li }si=1 denote a training set of s labeled samples with pairs of input vectors xi , yi ∈ Rm and binary class labels li ∈ {1, 0} which indicates whether xi and yi match or not. The goal is to learn a linear transformation A : Rm → Rd (d ≤ m), which we will use to compute cosine similarities in the transformed subspace as: CS(x, y, A) =
(Ax)T (Ay) xT AT Ay = √ Ax Ay xT AT Ax y T AT Ay
Specifically, we want to learn the linear transformation that minimizes the cross validation error when similarities are measured in this way. We begin by defining the objective function.
712
1.3
H.V. Nguyen and L. Bai
Objective Function
First, we define positive and negative sample index sets P os and N eg as: P os = {i|li = 1} N eg = {i|li = 0} Also, let |P os| and |N eg| denote the numbers of positive and negative samples. We have |P os| + |N eg| = s - the total number of samples. Now the objective function f (A) can be defined as: f (A) = CS(xi , yi , A) − α CS(xi , yi , A) − β A − A0 2 i∈P os
i∈N eg
We want to maximize f (A) with regard to matrix A given two parameters α and β where α, β ≥ 0. The objective function can be split into two terms: g(A) and h(A) where CS(xi , yi, A) − α CS(xi , yi , A) g(A) = i∈P os
h(A) =β A − A0
i∈N eg 2
The role of g(A) is to encourage the margin between positive and negative samples to be large. A large margin can help to reduce the training error. g(A) can be seen as a simple voting scheme from each sample. The reason we can treat votes from samples equally is that cosine similarity function is bounded by 1. Additionally, because of this simple form of g(A), we can optimize f (A) very fast (details in section 1.4). The parameter α in g(A) is to balance the contributions of positive samples and negatives samples to the margin. In practice, α can be |P os| estimated using cross validation or simply be set to |N eg| . In the case of LFW, because the numbers of positive and negative samples are equal, we simply set α = 1. The role of h(A) is to regularize matrix A to be as close as possible to a predefined matrix A0 which can be any matrix. The idea is both to inherit good properties from matrix A0 and to reduce the training error as much as possible. If A0 is carefully chosen, the learned matrix A can achieve small training error and good generalization ability at the same time. The parameter β plays an important role here. It controls the tradeoff between maximizing the margin (g(A)) and minimizing the distance from A to A0 (h(A)). With the objective function set up, the algorithm can be presented in detail in the next section. 1.4
The Algorithm and Its Complexity
The idea is to use cross validation to estimate the optimal values of (α, β). In this paper, α can be simply set to 1 and suitable β can be found using coarse-to-fine
Cosine Similarity Metric Learning for Face Verification
713
Algorithm 1. Cosine Similarity Metric Learning INPUT – – – – –
S = {xi , yi , li }si=1 : a set of training samples (xi , yi ∈ Rm , li ∈ {0, 1}) T = {xi , yi , li }ti=1 : a set of validation samples (xi , yi ∈ Rm , li ∈ {0, 1}) d: dimension of the transformed subspace (d ≤ m) Ap : a predefined matrix (Ap ∈ Rd×m ) K: K-fold cross validation
OUTPUT - ACSM L : output transformation matrix (ACSM L ∈ Rd×m ) 1. A0 ← Ap |P os| 2. α ← |Neg| 3. Repeat (a) min cve ← ∞ // store minimum cross validation error (b) For each value of β // coarse-to-fine strategy i. A∗ ← the matrix maximizing f (A) given (A0 , α, β) evaluating on S // Algorithm 2 ii. if cve(T, A∗ , K) < min cve then A. min cve ← cve(T, A∗ , K) B. Anext ← A∗ (c) A0 ← Anext 4. Until convergence 5. ACSM L ← A0 6. Return ACSM L
Algorithm 2. Cross validation error computation INPUT – T = {xi , yi , li }ti=1 : a set of validation samples (xi , yi ∈ Rm , li ∈ {0, 1}) – A: a linear transformation matrix (A ∈ Rd×m ) – K: K-fold cross validation OUTPUT - cross validation error 1. Transform all samples in T using matrix A 2. Partition T into K equal-sized subsamples 3. total error ← 0 4. For k = 1 → K // using subsample k as testing data, the other K−1 subsamples as training data (a) θ ← the optimal threshold on training data (b) test error ← error on testing data (c) total error ← total error + test error 5. Return total error/K
714
H.V. Nguyen and L. Bai
search strategy. Coarse-to-fine means the range of searching area decreases over time. Algorithm 1 presents the proposed CSML method. It is easy to prove that when β goes to ∞, the optimized matrix A∗ approaches the prior A0 . In other words, the performance of learned matrix ACSML is guaranteed to be as good as that of matrix A0 . In practice, however, the performance of matrix ACSML is significantly better in most cases (see section 3). f (A) is differentiable with regard to matrix A so we can optimize it using a gradient based optimizer such as delta-bar-delta or conjugate gradients. We used the Conjugate Gradient method. The gradient of f (A) can be computed as follows: ∂CS(xi , yi , A) ∂CS(xi , yi , A) ∂f (A) = −α ∂A ∂A ∂A i∈P os
i∈N eg
− 2β(A − A0 ) ∂CS(xi , yi , A) = ∂A =
∂( √
(1)
T xT i A Ayi
T xT i A Axi
√
yiT AT Ayi
)
∂A ∂( u(A) v(A) )
∂A 1 ∂u(A) u(A) ∂v(A) = − v(A) ∂A v(A)2 ∂A
(2)
where ∂u(A) =A(xi yiT + yi xTi ) ∂A ∂v(A) yiT AT Ayi xT AT Axi T = Axi xi − i Ayi yiT T ∂A xi AT Axi yiT AT Ayi
(3) (4)
From Eq (1, 2, 3, 4), the complexity of computing f (A)’s gradient is O(s × d × m). As a result, the complexity of CSML algorithm is O(r × b × g × s × d × m) where r is the number of iterations used to optimize A0 repeatedly (at line 3 in Algorithm 1), b is the number of values of β tested in cross validation process (at line 3b in Algorithm 1), g is the number of steps in the Conjugate Gradient method.
2
Application to Face Verification
In this section, we show how CSML can be applied to face verification on the LFW dataset in detail. 2.1
LFW Dataset
The dataset contains more than 13, 000 images of faces collected from the web. These images have a very large degree of variability in face pose, age, expression,
Cosine Similarity Metric Learning for Face Verification
715
race and illumination. There are two evaluation settings by the authors of the LFW: the restricted and the unrestricted setting. This paper considers restricted setting. Under this setting no identity information of the faces is given. The only information available to a face verification algorithm is a pair of input images and the algorithm is expected to determine whether the pair of images come from the same person. The performance of an algorithm is measured by a 10-fold cross validation procedure. See [7] for details. There are three versions of the LFW available: original, funneled and aligned. In [14], Wolf et al. showed that the aligned version is better than funneled version at dealing with misalignment. Therefore, we are going to use the aligned version in all of our experiments. 2.2
Face Verification Pipeline
The overview of our method is presented in Figure 2. First, two original images are cropped to smaller sizes. Next some feature extraction method is used to → → − − form feature vectors ( X , Y ) from the cropped images. These vectors are passed −→ − → to PCA to get two dimension-reduced vectors (X2 , Y2 ). Then CSML is used to − → − → − → − → − → transform (X2 , Y2 ) to (X3 , Y3 ) in the final subspace. Cosine similarity between X3 − → and Y3 is the similarity score between two faces. Finally, this score is thresholded to determine whether two faces are the same or not. The optimal threshold θ is estimated from the training set. Specifically, θ is set so that False Acceptance Rate equals to False Rejection Rate. Each step will be discussed in detail. Preprocessing. The original size of each image is 250 × 250 pixels. At the preprocessing step, we simply crop the image to remove the background, leaving a 80 × 150 face image. The next step after preprocessing is to extract features from the image. Feature Extraction. To test the robustness of our method to different types of features, we carry out experiments on three facial descriptors: Intensity, Local Binary Patterns and Gabor Wavelets.
Fig. 2. Overview of Face verifiction process
716
H.V. Nguyen and L. Bai
Intensity is the simplest feature extraction method. The feature vector is formed by concatenating all the pixels. The length of the feature vector is 12, 000 (= 80 × 150). Local Binary Patterns (LBP) was first applied for Face Recognition in [15] with very promising results. In our experiments, the face is divided into nonoverlapping 10 × 10 blocks and LBP histograms are extracted in all blocks to form the feature vector whose length is 7, 080 (= 8 × 15 × 59). Gabor Wavelets [16, 17] with 5 scales and 8 orientations are convoluted at different pixels selected uniformly with the downsampling rate of 10 × 10. The length of the feature vector is 4, 800 (= 5 × 8 × 8 × 15) . Dimension Reduction. Before applying any learning method, we use PCA to reduce the dimension of the original feature vector to a more tractable number. A thousand faces from training data (different for each fold) are used to create the covariance matrix in PCA. We notice in our experiments that the specific value of the reduced dimension after applying PCA doesn’t affect the accuracy very much as long as it is not too small. Feature Combination. We can further improve the accuracy by combining different types of features. Features can be combined at the feature extraction step [18, 19] or at the verification step. Here we combine features at the verification step using SVM [20]. Applying CSML to each type of feature produces a similarity score. These scores form a vector which is passed to SVM for verification. 2.3
How to Choose A0 in CSML?
Because CSML improves the accuracy of A0 , it is a good idea to choose matrix A0 which performs well by itself. There are published papers concluding that Whitened PCA (WPCA) with Cosine Similarity can achieve very good performance [3, 21]. Therefore, we propose to use the whitening matrix as A0 . Since we reduce the dimension from m to d, the whitening matrix is in the rectangular form as follows: ⎡ −1 ⎤ λ1 2 0 ... 0 0 0 ⎢ ⎥ − 12 ⎢ ⎥ AW P CA = ⎢ 0 λ2 ... 0 0 0 ⎥ ∈ Rd×m ⎣ ... ... ... ... 0 0 ⎦ 0
− 12
0 ... λd
00
where λ1 , λ2 , ..., λd are the first d largest eigen-values of the covariance matrix computed in the PCA step. To compare, we tried two different matrices: non-whitening PCA and Random Projection. ⎡ ⎤ 1 0 ... 0 0 0 ⎢ 0 1 ... 0 0 0 ⎥ d×m ⎥ AP CA = ⎢ ⎣ ... ... ... ... 0 0 ⎦ ∈ R 0 0 ... 1 0 0 ARP = random matrix ∈ Rd×m
Cosine Similarity Metric Learning for Face Verification
3
717
Experimental Results
To evaluate performance on View 2 of the LFW dataset, we used five of the nine training splits as training samples and the remaining four as validation samples in CSML algorithm (more about the LFW protocol in [7]). These validation samples are also used for training the SVM. All results presented here are produced using the parameters: m = 500 and d = 200 where m is the dimension of the data after applying PCA and d is the dimension of the data after applying CSML. In this section, we present the results of two experiments. In this first experiment, we will show how much CSML improves over three cases of A0 : Random Projection, PCA, and Whitened PCA. We call the transformation matrices of these ARP , AP CA , and AW P CA respectively. Here we used the original intensity as the feature extraction method. As shown in table 1, ACSML consistently performs better than A0 about 5 − 10%. Table 1. ACSM L and A0 performance comparison A0 ACSM L
ARP AP CA AW P CA 0.5752 ± 0.0057 0.6762 ± 0.0075 0.7322 ± 0.0037 0.673 ± 0.0095 0.7112 ± 0.0083 0.7865 ± 0.0039
In the second experiment, we will show how much CSML improves over cosine similarity in the original space and over Whitened PCA with three types of features: Intensity (IN), Local Binary Patterns (LBP), and Gabor Wavelets (GABOR). Each type of feature is tested with the original feature vector or the square root of the feature vector [8, 14, 20]. Table 2. The improvements of CSML over cosine similarity and WPCA Cosine WPCA original 0.6567 ± 0.0071 0.7322 ± 0.0037 sqrt 0.6485 ± 0.0088 0.7243 ± 0.0038 LBP original 0.7027 ± 0.0036 0.7712 ± 0.0044 sqrt 0.6977 ± 0.0047 0.7937 ± 0.0034 GABOR original 0.672 ± 0.0053 0.7558 ± 0.0052 sqrt 0.6942 ± 0.0072 0.7698 ± 0.0056 Feature Combination IN
CSML 0.7865 ± 0.0039 0.7887 ± 0.0052 0.8295 ± 0.0052 0.8557 ± 0.0052 0.8238 ± 0.0021 0.8358 ± 0.0058 0.88 ± 0.0037
As shown in table 2, CSML improves about 5% over WPCA and about 10 − 15% over cosine similarity. LBP seems to perform better than Intensity and Gabor Wavelets. Using square root of the feature vector improves the accuracy about 2−3% in most cases. The highest accuracy we can get from a single type of feature is 0.8557 ± 0.0052 using CSML with the square root of the LBP feature. The accuracy we can get by combining 6 scores corresponding to 6 different features (in the rightmost column in table 2) is 0.88 ± 0.0038. This is better
718
H.V. Nguyen and L. Bai
Fig. 3. ROC curves averaged over 10 folds of View 2
than the current state of the art result reported in [14]. For comparison purpose, the ROC curves of our method and others are depicted in Figure 3. Complete benchmark results can be found on the LFW website [22].
4
Conclusion
We have introduced a novel method for learning a distance metric based on cosine similarity. The use of cosine similarity allows us to form a simple but effective objective function, which leads to a fast gradient-based optimization algorithm. Another important property of our method is that in theory the learned matrix cannot perform worse than the regularized matrix. In practice, it performs considerably better in most cases. We tested our method on the LFW dataset and achieved highest accuracy in the literature. Although initially CSML was designed for face verification, it has a wide range of applications, which we plan to explore in future work.
References 1. Phillips, P., Wechsler, H., Huang, J., Rauss, P.: The FERET database and evaluation procedure for face-recognition algorithms. Image and Vision Computing 16, 295–306 (1998) 2. Shan, S., Zhang, W., Su, Y., Chen, X., Gao, W., Frjdl, I., Cas, B.: Ensemble of Piecewise FDA Based on Spatial Histograms of Local (Gabor) Binary Patterns for Face Recognition. In: Proceedings of the 18th International Conference on Pattern Recognition, pp. 606–609 (2006)
Cosine Similarity Metric Learning for Face Verification
719
3. Hieu, N., Bai, L., Shen, L.: Local gabor binary pattern whitened pca: A novel approach for face recognition from single image per person. In: Proceedings of the 3rd IAPR/IEEE International Conference on Biometrics (2009) 4. Shen, L., Bai, L.: MutualBoost learning for selecting Gabor features for face recognition. Pattern Recognition Letters 27, 1758–1767 (2006) 5. Shen, L., Bai, L., Fairhurst, M.: Gabor wavelets and general discriminant analysis for face identification and verification. Image and Vision Computing 27, 1758–1767 (2006) 6. Nguyen, H.V., Bai, L.: Compact binary patterns (cbp) with multiple patch classifiers for fast and accurate face recognition. In: CompIMAGE, pp. 187–198 (2010) 7. Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst (2007) 8. Guillaumin, M., Verbeek, J., Schmid, C.: Is that you? metric learning approaches for face identification. In: International Conference on Computer Vision, pp. 498– 505 (2009) 9. Taigman, Y., Wolf, L., Hassner, T.: Multiple one-shots for utilizing class label information. In: The British Machine Vision Conference, BMVC (2009) 10. Goldberger, J., Roweis, S., Hinton, G., Salakhutdinov, R.: Neighborhood component analysis. In: NIPS 11. Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.: Distance metric learning, with application to clustering with side-information. In: Advances in Neural Information Processing Systems 15, vol. 15, pp. 505–512 (2003) 12. Weinberger, K., Blitzer, J., Saul, L.: Distance metric learning for large margin nearest neighbor classification. In: Advances in Neural Information Processing Systems 18, pp. 1473–1480 (2006) 13. Davis, J.V., Kulis, B., Jain, P., Sra, S., Dhillon, I.S.: Information-theoretic metric learning. In: ICML 2007: Proceedings of the 24th International Conference on Machine Learning, pp. 209–216. ACM, New York (2007) 14. Wolf, L., Hassner, T., Taigman, Y.: Similarity scores based on background samples. In: Zha, H., Taniguchi, R.-i., Maybank, S. (eds.) ACCV 2009. LNCS, vol. 5995, pp. 88–97. Springer, Heidelberg (2010) 15. Ahonen, T., Hadid, A., Pietikainen, M.: Face Recognition with Local Binary Patterns. In: Guessarian, I. (ed.) LITP 1990. LNCS, vol. 469, pp. 469–481. Springer, Heidelberg (1990) 16. Daugman, J.: Complete Discrete 2D Gabor Transforms by Neural Networks for Image Analysis and Compression. IEEE Trans. Acoust.Speech Signal Process. 36 (1988) 17. Shan, S., Gao, W., Chang, Y., Cao, B., Yang, P.: Review the strength of Gabor features for face recognition from the angle of its robustness to mis-alignment. In: Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, vol. 1 (2004) 18. Tan, X., Triggs, B.: Fusing Gabor and LBP Feature Sets for Kernel-Based Face Recognition. In: Zhou, S.K., Zhao, W., Tang, X., Gong, S. (eds.) AMFG 2007. LNCS, vol. 4778, pp. 235–249. Springer, Heidelberg (2007) 19. Zhang, W., Shan, S., Gao, W., Chen, X., Zhang, H.: Local Gabor Binary Pattern Histogram Sequence (LGBPHS): A Novel Non-Statistical Model for Face Representation and Recognition. In: Proc. ICCV, pp. 786–791 (2005)
720
H.V. Nguyen and L. Bai
20. Wolf, L., Hassner, T., Taigman, Y.: Descriptor based methods in the wild. In: RealLife Images Workshop at the European Conference on Computer Vision, ECCV (2008) 21. Deng, W., Hu, J., Guo, J.: Gabor-Eigen-Whiten-Cosine: A Robust Scheme for Face Recognition. In: Zhao, W., Gong, S., Tang, X. (eds.) AMFG 2005. LNCS, vol. 3723, pp. 336–349. Springer, Heidelberg (2005) 22. http://vis-www.cs.umass.edu/lfw/results.html: (LFW benchmark results)
Author Index
Abdala, Daniel Duarte IV-373 Abe, Daisuke IV-565 Achtenberg, Albert IV-141 Ackermann, Hanno II-464 Agapito, Lourdes IV-460 Ahn, Jae Hyun IV-513 Ahuja, Narendra IV-501 Ai, Haizhou II-174, II-683, III-171 Akbas, Emre IV-501 Alexander, Andrew L. I-65 An, Yaozu II-282 Ancuti, Codruta O. I-79, II-501 Ancuti, Cosmin I-79, II-501 Argyros, Antonis A. III-744 Arnaud, Elise IV-361 ˚ Astr¨ om, Kalle IV-255 Atasoy, Selen II-41 Azuma, Takeo III-641 Babari, Raouf IV-243 Badino, Hern´ an III-679 Badrinath, G.S. II-321 Bai, Li II-709 Ballan, Luca III-613 Barlaud, Michel III-67 Barnes, Nick I-176, IV-269, IV-410 Bartoli, Adrien III-52 Bekaert, Philippe I-79, II-501 Belhumeur, Peter N. I-39 Bennamoun, Mohammed III-199, IV-115 Ben-Shahar, Ohad II-346 Ben-Yosef, Guy II-346 Binder, Alexander III-95 Bischof, Horst I-397, II-566 Bishop, Tom E. II-186 Biswas, Sujoy Kumar I-244 Bonde, Ujwal D. IV-228 Bowden, Richard I-256, IV-525 Boyer, Edmond IV-592 Br´emond, Roland IV-243 Briassouli, Alexia I-149 Brocklehurst, Kyle III-329, IV-422 Brunet, Florent III-52
Bu, Jiajun III-436 Bujnak, Martin I-11, II-216 Burschka, Darius I-135, IV-474 Byr¨ od, Martin IV-255 Carlsson, Stefan II-1 Cha, Joonhyuk IV-486 Chan, Kwok-Ping IV-51 Chen, Chia-Ping I-355 Chen, Chun III-436 Chen, Chu-Song I-355 Chen, Duowen I-283 Chen, Kai III-121 Chen, Tingwang II-400 Chen, Wei II-67 Chen, Yan Qiu IV-435 Chen, Yen-Wei III-511, IV-39, IV-165 Chen, Yi-Ling III-535 Chen, Zhihu II-137 Chi, Yu-Tseh II-268 Chia, Liang-Tien II-515 Chin, Tat-Jun IV-553 Cho, Nam Ik IV-513 Chu, Wen-Sheng I-355 Chum, Ondˇrej IV-347 Chung, Ronald H-Y. IV-690 Chung, Sheng-Luen IV-90 Collins, Maxwell D. I-65 Collins, Robert T. III-329, IV-422 Cootes, Tim F. I-1 Cosker, Darren IV-189 Cowan, Brett R. IV-385 Cree, Michael J. IV-397 Cremers, Daniel I-53 Dai, Qionghai II-412 Dai, Zhenwen II-137 Danielsson, Oscar II-1 Davis, James W. II-580 de Bruijne, Marleen II-160 Declercq, Arnaud III-422 Deguchi, Koichiro IV-565 De la Torre, Fernando III-679 Delaunoy, Ama¨el I-39, II-55 Denzler, Joachim II-489
722
Author Index
De Smet, Micha¨el III-276 Detry, Renaud III-572 Dickinson, Sven I-369, IV-539 Dikmen, Mert IV-501 Ding, Jianwei II-82 Di Stefano, Luigi III-653 Dorothy, Monekosso I-439 Dorrington, Adrian A. IV-397 Duan, Genquan II-683 ´ Dumont, Eric IV-243 El Ghoul, Aymen II-647 Ellis, Liam IV-525 Eng, How-Lung I-439 Er, Guihua II-412 Fan, Yong IV-606 Fang, Tianhong II-633 Favaro, Paolo I-425, II-186 Felsberg, Michael IV-525 Feng, Jufu III-213, III-343 Feng, Yaokai I-296 Feragen, Aasa II-160 Feuerstein, Marco III-409 Fieguth, Paul I-383 F¨ orstner, Wolfgang II-619 Franco, Jean-S´ebastien III-599 Franek, Lucas II-697, IV-373 Fu, Yun II-660 Fujimura, Ikko I-296 Fujiyoshi, Hironobu IV-25 Fukuda, Hisato IV-127 Fukui, Kazuhiro IV-580 Furukawa, Ryo IV-127 Ganesh, Arvind III-314, III-703 Gao, Changxin III-133 Gao, Yan IV-153 Garg, Ravi IV-460 Geiger, Andreas I-25 Georgiou, Andreas II-41 Ghahramani, M. II-388 Gilbert, Andrew I-256 Godbaz, John P. IV-397 Gong, Haifeng II-254 Gong, Shaogang I-161, II-293, II-527 Gopalakrishnan, Viswanath II-15, III-732 Grabner, Helmut I-200 Gu, Congcong III-121
Gu, Steve I-271 Guan, Haibing III-121 Guo, Yimo III-185 Gupta, Phalguni II-321 Hall, Peter IV-189 Han, Shuai I-323 Hao, Zhihui IV-269 Hartley, Richard II-554, III-52, IV-177, IV-281 Hauberg, Søren III-758 Hauti´ere, Nicolas IV-243 He, Hangen III-27 He, Yonggang III-133 Helmer, Scott I-464 Hendel, Avishai III-448 Heo, Yong Seok IV-486 Hermans, Chris I-79, II-501 Ho, Jeffrey II-268 Horaud, Radu IV-592 Hospedales, Timothy M. II-293 Hou, Xiaodi III-225 Hsu, Gee-Sern IV-90 Hu, Die II-672 Hu, Tingbo III-27 Hu, Weiming II-594, III-691, IV-630 Hu, Yiqun II-15, II-515, III-732 Huang, Jia-Bin III-497 Huang, Kaiqi II-67, II-82, II-542 Huang, Thomas S. IV-501 Huang, Xinsheng IV-281 Huang, Yongzhen II-542 Hung, Dao Huu IV-90 Igarashi, Yosuke IV-580 Ikemura, Sho IV-25 Iketani, Akihiko III-109 Imagawa, Taro III-641 Iwama, Haruyuki IV-702 Jank´ o, Zsolt II-55 Jeon, Moongu III-718 Jermyn, Ian H. II-647 Ji, Xiangyang II-412 Jia, Ke III-586 Jia, Yunde II-254 Jiang, Hao I-228 Jiang, Mingyang III-213, III-343 Jiang, Xiaoyi II-697, IV-373 Jung, Soon Ki I-478
Author Index Kakadiaris, Ioannis A. II-633 Kale, Amit IV-592 Kambhamettu, Chandra III-82, III-483, III-627 Kanatani, Kenichi II-242 Kaneda, Kazufumi II-452, III-250 Kang, Sing Bing I-350 Kasturi, Rangachar II-308 Kawabata, Satoshi III-523 Kawai, Yoshihiro III-523 Kawanabe, Motoaki III-95 Kawano, Hiroki I-296 Kawasaki, Hiroshi IV-127 Kemmler, Michael II-489 Khan, R. Nazim III-199 Kikutsugi, Yuta III-250 Kim, Du Yong III-718 Kim, Hee-Dong IV-1 Kim, Hyunwoo IV-333 Kim, Jaewon I-336 Kim, Seong-Dae IV-1 Kim, Sujung IV-1 Kim, Tae-Kyun IV-228 Kim, Wook-Joong IV-1 Kise, Koichi IV-64 Kitasaka, Takayuki III-409 Klinkigt, Martin IV-64 Kompatsiaris, Ioannis I-149 Kopp, Lars IV-255 Kuang, Gangyao I-383 Kuang, Yubin IV-255 Kuk, Jung Gap IV-513 Kukelova, Zuzana I-11, II-216 Kulkarni, Kaustubh IV-592 Kwon, Dongjin I-121 Kyriazis, Nikolaos III-744 Lai, Shang-Hong III-535 Lam, Antony III-157 Lao, Shihong II-174, II-683, III-171 Lauze, Francois II-160 Lee, Kyong Joon I-121 Lee, Kyoung Mu IV-486 Lee, Sang Uk I-121, IV-486 Lee, Sukhan IV-333 Lei, Yinjie IV-115 Levinshtein, Alex I-369 Li, Bing II-594, III-691, IV-630 Li, Bo IV-385 Li, Chuan IV-189
Li, Chunxiao III-213, III-343 Li, Fajie IV-641 Li, Hongdong II-554, IV-177 Li, Hongming IV-606 Li, Jian II-293 Li, Li III-691 Li, Min II-67, II-82 Li, Sikun III-471 Li, Wei II-594, IV-630 Li, Xi I-214 Li, Yiqun IV-153 Li, Zhidong II-606, III-145 Liang, Xiao III-314 Little, James J. I-464 Liu, Jing III-239 Liu, Jingchen IV-102 Liu, Li I-383 Liu, Miaomiao II-137 Liu, Nianjun III-586 Liu, Wei IV-115 Liu, Wenyu III-382 Liu, Yanxi III-329, IV-102, IV-422 Liu, Yong III-679 Liu, Yonghuai II-27 Liu, Yuncai II-660 Llad´ o, X. III-15 Lo, Pechin II-160 Lovell, Brian C. III-547 Lowe, David G. I-464 Loy, Chen Change I-161 Lu, Feng II-412 Lu, Guojun IV-449 Lu, Hanqing III-239 Lu, Huchuan III-511, IV-39, IV-165 Lu, Shipeng IV-165 Lu, Yao II-282 Lu, Yifan II-554, IV-177 Lu, Zhaojin IV-333 Luo, Guan IV-630 Lu´ o, Xi´ ongbi¯ ao III-409 Luo, Ye III-396 Ma, Songde III-239 Ma, Yi III-314, III-703 MacDonald, Bruce A. II-334 MacNish, Cara III-199 Macrini, Diego IV-539 Mahalingam, Gayathri III-82 Makihara, Yasushi I-107, II-440, III-667, IV-202, IV-702
723
724
Author Index
Makris, Dimitrios III-262 Malgouyres, Remy III-52 Mannami, Hidetoshi II-440 Martin, Ralph R. II-27 Matas, Jiˇr´ı IV-347 Matas, Jiri III-770 Mateus, Diana II-41 Matsushita, Yasuyuki I-336, III-703 Maturana, Daniel IV-618 Mauthner, Thomas II-566 McCarthy, Chris IV-410 Meger, David I-464 Mehdizadeh, Maryam III-199 Mery, Domingo IV-618 Middleton, Lee I-200 Moon, Youngsu IV-486 Mori, Atsushi I-107 Mori, Kensaku III-409 Muja, Marius I-464 Mukaigawa, Yasuhiro I-336, III-667 Mukherjee, Dipti Prasad I-244 Mukherjee, Snehasis I-244 M¨ uller, Christina III-95 Nagahara, Hajime III-667, IV-216 Nakamura, Ryo II-109 Nakamura, Takayuki IV-653 Navab, Nassir II-41, III-52 Neumann, Lukas III-770 Nguyen, Hieu V. II-709 Nguyen, Tan Dat IV-665 Nielsen, Frank III-67 Nielsen, Mads II-160 Niitsuma, Hirotaka II-242 Nock, Richard III-67 Oikonomidis, Iasonas III-744 Okabe, Takahiro I-93, I-323 Okada, Yusuke III-641 Okatani, Takayuki IV-565 Okutomi, Masatoshi III-290, IV-76 Okwechime, Dumebi I-256 Ommer, Bj¨ orn II-477 Ong, Eng-Jon I-256 Ortner, Mathias IV-361 Orwell, James III-262 Oskarsson, Magnus IV-255 Oswald, Martin R. I-53 Ota, Takahiro IV-653
Paisitkriangkrai, Sakrapee III-460 Pajdla, Tomas I-11, II-216 Pan, ChunHong II-148, III-560 Pan, Xiuxia IV-641 Paparoditis, Nicolas IV-243 Papazov, Chavdar I-135 Park, Minwoo III-329, IV-422 Park, Youngjin III-355 Pedersen, Kim Steenstrup III-758 Peleg, Shmuel III-448 Peng, Xi I-283 Perrier, R´egis IV-361 Piater, Justus III-422, III-572 Pickup, David IV-189 Pietik¨ ainen, Matti III-185 Piro, Paolo III-67 Pirri, Fiora III-369 Pizarro, Luis IV-460 Pizzoli, Matia III-369 Pock, Thomas I-397 Pollefeys, Marc III-613 Prados, Emmanuel I-39, II-55 Provenzi, E. III-15 Qi, Baojun
III-27
Rajan, Deepu II-15, II-515, III-732 Ramakrishnan, Kalpatti R. IV-228 Ranganath, Surendra IV-665 Raskar, Ramesh I-336 Ravichandran, Avinash I-425 Ray, Nilanjan III-39 Raytchev, Bisser II-452, III-250 Reddy, Vikas III-547 Reichl, Tobias III-409 Remagnino, Paolo I-439 Ren, Zhang I-176 Ren, Zhixiang II-515 Rodner, Erik II-489 Rohith, MV III-627 Rosenhahn, Bodo II-426, II-464 Roser, Martin I-25 Rosin, Paul L. II-27 Roth, Peter M. II-566 Rother, Carsten I-53 Roy-Chowdhury, Amit K. III-157 Rudi, Alessandro III-369 Rudoy, Dmitry IV-307 Rueckert, Daniel IV-460 Ruepp, Oliver IV-474
Author Index Sagawa, Ryusuke III-667 Saha, Baidya Nath III-39 Sahbi, Hichem I-214 Sakaue, Fumihiko II-109 Sala, Pablo IV-539 Salti, Samuele III-653 Salvi, J. III-15 Sanderson, Conrad III-547 Sang, Nong III-133 Sanin, Andres III-547 Sankaranarayananan, Karthik II-580 Santner, Jakob I-397 ˇ ara, Radim I-450 S´ Sato, Imari I-93, I-323 Sato, Jun II-109 Sato, Yoichi I-93, I-323 Savoye, Yann III-599 Scheuermann, Bj¨ orn II-426 Semenovich, Dimitri I-490 Senda, Shuji III-109 Shah, Shishir K. II-230, II-633 Shahiduzzaman, Mohammad IV-449 Shang, Lifeng IV-51 Shelton, Christian R. III-157 Shen, Chunhua I-176, III-460, IV-269, IV-281 Shi, Boxin III-703 Shibata, Takashi III-109 Shimada, Atsushi IV-216 Shimano, Mihoko I-93 Shin, Min-Gil IV-293 Shin, Vladimir III-718 Sigal, Leonid III-679 Singh, Vikas I-65 Sminchisescu, Cristian I-369 Somanath, Gowri III-483 Song, Li II-672 Song, Ming IV-606 Song, Mingli III-436 Song, Ran II-27 ´ Soto, Alvaro IV-618 Sowmya, Arcot I-490, II-606 Stol, Karl A. II-334 Sturm, Peter IV-127, IV-361 Su, Hang III-302 Su, Te-Feng III-535 Su, Yanchao II-174 Sugimoto, Shigeki IV-76 Sung, Eric IV-11 Suter, David IV-553
Swadzba, Agnes II-201 Sylwan, Sebastian I-189 Szir´ anyi, Tam´ as IV-321 Szolgay, D´ aniel IV-321 Tagawa, Seiichi I-336 Takeda, Takahishi II-452 Takemura, Yoshito II-452 Tamaki, Toru II-452, III-250 Tan, Tieniu II-67, II-82, II-542 Tanaka, Masayuki III-290 Tanaka, Shinji II-452 Taneja, Aparna III-613 Tang, Ming I-283 Taniguchi, Rin-ichiro IV-216 Tao, Dacheng III-436 Teoh, E.K. II-388 Thida, Myo I-439 Thomas, Stephen J. II-334 Tian, Qi III-239, III-396 Tian, Yan III-679 Timofte, Radu I-411 Tomasi, Carlo I-271 Tombari, Federico III-653 T¨ oppe, Eno I-53 Tossavainen, Timo III-1 Trung, Ngo Thanh III-667 Tyleˇcek, Radim I-450 Uchida, Seiichi I-296 Ugawa, Sanzo III-641 Urtasun, Raquel I-25 Vakili, Vida II-123 Van Gool, Luc I-200, I-411, III-276 Vega-Pons, Sandro IV-373 Veksler, Olga II-123 Velastin, Sergio A. III-262 Veres, Galina I-200 Vidal, Ren´e I-425 Wachsmuth, Sven II-201 Wada, Toshikazu IV-653 Wagner, Jenny II-477 Wang, Aiping III-471 Wang, Bo IV-269 Wang, Hanzi IV-630 Wang, Jian-Gang IV-11 Wang, Jinqiao III-239 Wang, Lei II-554, III-586, IV-177
725
726
Author Index
Wang, LingFeng II-148, III-560 Wang, Liwei III-213, III-343 Wang, Nan III-171 Wang, Peng I-176 Wang, Qing II-374, II-400 Wang, Shao-Chuan I-310 Wang, Wei II-95, III-145 Wang, Yang II-95, II-606, III-145 Wang, Yongtian III-703 Wang, Yu-Chiang Frank I-310 Weinshall, Daphna III-448 Willis, Phil IV-189 Wojcikiewicz, Wojciech III-95 Won, Kwang Hee I-478 Wong, Hoi Sim IV-553 Wong, Kwan-Yee K. II-137, IV-690 Wong, Wilson IV-115 Wu, HuaiYu III-560 Wu, Lun III-703 Wu, Ou II-594 Wu, Tao III-27 Wu, Xuqing II-230 Xiang, Tao I-161, II-293, II-527 Xiong, Weihua II-594 Xu, Changsheng III-239 Xu, Dan II-554, IV-177 Xu, Jie II-95, II-606, III-145 Xu, Jiong II-374 Xu, Zhengguang III-185 Xue, Ping III-396 Yaegashi, Keita II-360 Yagi, Yasushi I-107, I-336, II-440, III-667, IV-202, IV-702 Yamaguchi, Takuma IV-127 Yamashita, Takayoshi II-174 Yan, Ziye II-282 Yanai, Keiji II-360 Yang, Chih-Yuan III-497 Yang, Ehwa III-718 Yang, Fan IV-39 Yang, Guang-Zhong II-41 Yang, Hua III-302 Yang, Jie II-374 Yang, Jun II-95, II-606, III-145 Yang, Ming-Hsuan II-268, III-497 Yau, Wei-Yun IV-11 Yau, W.Y. II-388 Ye, Getian II-95
Yin, Fei III-262 Yoo, Suk I. III-355 Yoon, Kuk-Jin IV-293 Yoshida, Shigeto II-452 Yoshimuta, Junki II-452 Yoshinaga, Satoshi IV-216 Young, Alistair A. IV-385 Yu, Jin IV-553 Yu, Zeyun II-148 Yuan, Chunfeng III-691 Yuan, Junsong III-396 Yuk, Jacky S-C. IV-690 Yun, Il Dong I-121 Zappella, L. III-15 Zeevi, Yehoshua Y. IV-141 Zelnik-Manor, Lihi IV-307 Zeng, Liang III-471 Zeng, Zhihong II-633 Zerubia, Josiane II-647 Zhang, Bang II-606 Zhang, Chunjie III-239 Zhang, Dengsheng IV-449 Zhang, Hong III-39 Zhang, Jian III-460 Zhang, Jing II-308 Zhang, Liqing III-225 Zhang, Luming III-436 Zhang, Wenling III-511 Zhang, Xiaoqin IV-630 Zhang, Zhengdong III-314 Zhao, Guoying III-185 Zhao, Xu II-660 Zhao, Youdong II-254 Zheng, Hong I-176 Zheng, Huicheng IV-677 Zheng, Qi III-121 Zheng, Shibao III-302 Zheng, Wei-Shi II-527 Zheng, Ying I-271 Zheng, Yinqiang IV-76 Zheng, Yongbin IV-281 Zhi, Cheng II-672 Zhou, Bolei III-225 Zhou, Quan III-382 Zhou, Yi III-121 Zhou, Yihao IV-435 Zhou, Zhuoli III-436 Zhu, Yan II-660