Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
5748
Joachim Denzler Gunther Notni Herbert Süße (Eds.)
Pattern Recognition 31st DAGM Symposium Jena, Germany, September 9-11, 2009 Proceedings
13
Volume Editors Joachim Denzler Herbert Süße Friedrich-Schiller Universität Jena, Lehrstuhl Digitale Bildverarbeitung Ernst-Abbe-Platz 2, 07743 Jena, Germany E-mail: {joachim.denzler, herbert.suesse}@uni-jena.de Gunther Notni Fraunhofer-Institut für Angewandte Optik und Feinmechanik Albert-Einstein-Str. 7, 07745 Jena, Germany E-mail:
[email protected] Library of Congress Control Number: 2009933619 CR Subject Classification (1998): I.5, I.4, I.3, I.2.10, F.2.2, I.4.8, I.4.1 LNCS Sublibrary: SL 6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics ISSN ISBN-10 ISBN-13
0302-9743 3-642-03797-6 Springer Berlin Heidelberg New York 978-3-642-03797-9 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2009 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12743339 06/3180 543210
Preface
In 2009, for the second time in a row, Jena hosted an extraordinary event. In 2008, Jena celebrated the 450th birthday of the Friedrich Schiller University of Jena with the motto “Lichtgedanken” – “flashes of brilliance.” This year, for almost one week, Jena became the center for the pattern recognition research community of the German-speaking countries in Europe by hosting the 31st Annual Symposium of the Deutsche Arbeitsgemeinschaft f¨ ur Mustererkennung (DAGM). Jena is a special place for this event for several reasons. Firstly, it is the first time that the university of Jena has been selected to host this conference, and it is an opportunity to present the city of Jena as offering a fascinating combination of historic sites, an intellectual past, a delightful countryside, and innovative, international research and industry within Thuringia. Second, the conference takes place in an environment that has been heavily influenced by optics research and industry for more than 150 years. Third, in several schools and departments at the University of Jena, research institutions and companies in the fields of pattern recognition, 3D computer vision, and machine learning play an important role. The university’s involvement includes such diverse activities as industrial inspection, medical image processing and analysis, remote sensing, biomedical analysis, and cutting-edge developments in the field of physics, such as the recent development of the new terahertz imaging technique. Thus, DAGM 2009 was an important event to transfer basic research results to different applications in such areas. Finally, the fact that the conference was jointly organized by the Chair for Computer Vision of the Friedrich Schiller University of Jena and the Fraunhofer Institute IOF reflects the strong cooperation between these two institutions during the past and, more generally, between research, applied research, and industry in this field. The establishment of a Graduate School of Computer Vision and Image Interpretation, which is a joint facility of the Technical University of Ilmenau and the Friedrich Schiller University of Jena, is a recent achievement that will focus and strengthen the computer vision and pattern recognition activities in Thuringia. The technical program covered all aspects of pattern recognition and consisted of oral presentations and poster contributions, which were treated equally and given the same number of pages in the proceedings. Each section is devoted to one specific topic and contains all oral and poster papers for this topic sorted alphabetically by first authors. A very strict paper selection process was used, resulting in an acceptance rate of less than 45%. Therefore, the proceedings meet the strict requirements for publication in the Springer Lecture Notes in Computer Science series. Although not reflected in these proceedings, one additional point that also made this year’s DAGM special is the Young Researchers’ Forum, a special session for promoting scientific interactions between excellent
VI
Preface
young researchers. The impressive scientific program of the conference is due to the enormous efforts of the reviewers of the Program Committee. We thank all of those whose dedication and timely reporting helped to ensure that the highly selective reviewing process was completed on schedule. We are also proud to have had three renowned invited speakers at the conference: – Josef Kittler (University of Surrey, UK) – Reinhard Klette (University of Auckland, New Zealand) – Kyros Kutulakos (University of Toronto, Canada) We extend our sincere thanks to everyone involved in the organization of this event, especially the members of the Chair for Computer Vision and the Fraunhofer Institute IOF. In particular, we are indebted to Erik Rodner for organizing everything related to the conference proceedings, to Wolfgang Ortmann for installation and support in the context of the Web presentation and the reviewing and submission system, to Kathrin M¨ ausezahl for managing the conference office and arranging the conference dinner, and to Marcel Br¨ uckner, Michael Kemmler, and Marco K¨orner for the local organization. Finally, we would like to thank our sponsors, OLYMPUS Europe Foundation Science for Life, STIFT Thuringia, MVTec Software GmbH, Telekom Laboratories, Allied Vision Technologies, Desko GmbH, Jenoptik AG, and Optonet e.V. for their donations and helpful support, which contributed to several awards at the conference and made reasonable registration fees possible. We especially appreciate support from industry because it indicates faithfulness to our community and recognizes the importance of pattern recognition and related areas to business and industry. We were happy to host the 31st Annual Symposium of DAGM in Jena and look forward to DAGM 2010 in Darmstadt. September 2009
Joachim Denzler Gunther Notni Herbert S¨ uße
Organization
Program Committee T. Aach H. Bischof J. Buhmann H. Burkhardt D. Cremers J. Denzler G. Fink B. Flach W. F¨orstner U. Franke M. Franz D. Gavrila M. Goesele F.A. Hamprecht J. Hornegger B. J¨ahne X. Jiang R. Koch U. K¨ othe W.G. Kropatsch G. Linß H. Mayer R. Mester B. Michaelis K.-R. M¨ uller H. Ney G. Notni K. Obermayer G. R¨atsch G. Rigoll K. Rohr B. Rosenhahn S. Roth B. Schiele C. Schn¨ orr B. Sch¨olkopf G. Sommer T. Vetter F.M. Wahl J. Weickert
RWTH Aachen TU Graz ETH Z¨ urich University of Freiburg University of Bonn University of Jena TU Dortmund TU Dresden University of Bonn Daimler AG HTWG Konstanz Daimler AG TU Darmstadt University of Heidelberg University of Erlangen University of Heidelberg University of M¨ unster University of Kiel University of Heidelberg TU Wien TU Ilmenau BW-Universit¨ at M¨ unchen University of Frankfurt University of Magdeburg TU Berlin RWTH Aachen Fraunhofer IOF Jena TU Berlin MPI T¨ ubingen TU M¨ unchen University of Heidelberg University of Hannover TU Darmstadt University of Darmstadt University of Heidelberg MPI T¨ ubingen University of Kiel University of Basel University of Braunschweig Saarland University
Prizes 2007
Olympus Prize The Olympus Prize 2007 was awarded to Bodo Rosenhahn and Gunnar R¨ atsch for their outstanding contributions to the area of computer vision and machine learning.
DAGM Prizes The main prize for 2007 was awarded to: J¨ urgen Gall, Bodo Rosenhahn, Hans-Peter Seidel: Clustered Stochastic Optimization for Object Recognition and Pose Estimation Christopher Zach, Thomas Pock, Horst Bischof: A Duality-Based Approach for Realtime TV-L1 Optical Flow Further DAGM prizes for 2007 were awarded to: Kevin K¨ oser, Bogumil Bartczak, Reinhard Koch: An Analysis-by-Synthesis Camera Tracking Approach Based on Free-Form Surfaces Volker Roth, Bernd Fischer: The kernelHMM : Learning Kernel Combinations in Structured Output Domains
Prizes 2008
Olympus Prize The Olympus Prize 2008 was awarded to Bastian Leibe for his outstanding contributions to the area of closely coupled object categorization, segmentation, and tracking.
DAGM Prizes The main prize for 2008 was awarded to: Christoph H. Lampert, Matthew B. Blaschko: A Multiple Kernel Learning Approach to Joint Multi-class Object Detection Further DAGM prizes for 2008 were awarded to: Bj¨ orn Andres, Ullrich K¨ othe, Moritz Helmst¨ adter, Winfried Denk, Fred A. Hamprecht : Segmentation of SBFSEM Volume Data of Neural Tissue by Hierarchical Classification Kersten Petersen, Janis Fehr, Hans Burkhardt : Fast Generalized Belief Propagation for MAP Estimation on 2D and 3D Grid-Like Markov Random Fields Kai Krajsek, Rudolf Mester, Hanno Scharr : Statistically Optimal Averaging for Image Restoration and Optical Flow Estimation
Table of Contents
Motion and Tracking A 3-Component Inverse Depth Parameterization for Particle Filter SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˙ Evren Imre and Marie-Odile Berger
1
An Efficient Linear Method for the Estimation of Ego-Motion from Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Florian Raudies and Heiko Neumann
11
Localised Mixture Models in Region-Based Tracking . . . . . . . . . . . . . . . . . . Christian Schmaltz, Bodo Rosenhahn, Thomas Brox, and Joachim Weickert A Closed-Form Solution for Image Sequence Segmentation with Dynamical Shape Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frank R. Schmidt and Daniel Cremers Markerless 3D Face Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christian Walder, Martin Breidt, Heinrich B¨ ulthoff, Bernhard Sch¨ olkopf, and Crist´ obal Curio
21
31 41
Pedestrian Recognition and Automotive Applications The Stixel World - A Compact Medium Level Representation of the 3D-World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hern´ an Badino, Uwe Franke, and David Pfeiffer
51
Global Localization of Vehicles Using Local Pole Patterns . . . . . . . . . . . . . Claus Brenner
61
Single-Frame 3D Human Pose Recovery from Multiple Views . . . . . . . . . . Michael Hofmann and Dariu M. Gavrila
71
Dense Stereo-Based ROI Generation for Pedestrian Detection . . . . . . . . . . Christoph Gustav Keller, David Fern´ andez Llorca, and Dariu M. Gavrila
81
Pedestrian Detection by Probabilistic Component Assembly . . . . . . . . . . . Martin Rapus, Stefan Munder, Gregory Baratoff, and Joachim Denzler
91
High-Level Fusion of Depth and Intensity for Pedestrian Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marcus Rohrbach, Markus Enzweiler, and Dariu M. Gavrila
101
XII
Table of Contents
Features Fast and Accurate 3D Edge Detection for Surface Reconstruction . . . . . . Christian B¨ ahnisch, Peer Stelldinger, and Ullrich K¨ othe
111
Boosting Shift-Invariant Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thomas H¨ ornlein and Bernd J¨ ahne
121
Harmonic Filters for Generic Feature Detection in 3D . . . . . . . . . . . . . . . . Marco Reisert and Hans Burkhardt
131
Increasing the Dimension of Creativity in Rotation Invariant Feature Design Using 3D Tensorial Harmonics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Henrik Skibbe, Marco Reisert, Olaf Ronneberger, and Hans Burkhardt Training for Task Specific Keypoint Detection . . . . . . . . . . . . . . . . . . . . . . . Christoph Strecha, Albrecht Lindner, Karim Ali, and Pascal Fua Combined GKLT Feature Tracking and Reconstruction for Next Best View Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Trummer, Christoph Munkelt, and Joachim Denzler
141 151
161
Single-View and 3D Reconstruction Non-parametric Single View Reconstruction of Curved Objects Using Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martin R. Oswald, Eno T¨ oppe, Kalin Kolev, and Daniel Cremers
171
Discontinuity-Adaptive Shape from Focus Using a Non-convex Prior . . . . Krishnamurthy Ramnath and Ambasamudram N. Rajagopalan
181
Making Shape from Shading Work for Real-World Images . . . . . . . . . . . . . Oliver Vogel, Levi Valgaerts, Michael Breuß, and Joachim Weickert
191
Learning and Classification Deformation-Aware Log-Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tobias Gass, Thomas Deselaers, and Hermann Ney Multi-view Object Detection Based on Spatial Consistency in a Low Dimensional Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gurman Gill and Martin Levine
201
211
Active Structured Learning for High-Speed Object Detection . . . . . . . . . . Christoph H. Lampert and Jan Peters
221
Face Reconstruction from Skull Shapes and Physical Attributes . . . . . . . . Pascal Paysan, Marcel L¨ uthi, Thomas Albrecht, Anita Lerch, Brian Amberg, Francesco Santini, and Thomas Vetter
232
Table of Contents
Sparse Bayesian Regression for Grouped Variables in Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sudhir Raman and Volker Roth Learning with Few Examples by Transferring Feature Relevance . . . . . . . Erik Rodner and Joachim Denzler
XIII
242 252
Pattern Recognition and Estimation Simultaneous Estimation of Pose and Motion at Highly Dynamic Turn Maneuvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander Barth, Jan Siegemund, Uwe Franke, and Wolfgang F¨ orstner
262
Making Archetypal Analysis Practical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christian Bauckhage and Christian Thurau
272
Fast Multiscale Operator Development for Hexagonal Images . . . . . . . . . . Bryan Gardiner, Sonya Coleman, and Bryan Scotney
282
Optimal Parameter Estimation with Homogeneous Entities and Arbitrary Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jochen Meidow, Wolfgang F¨ orstner, and Christian Beder
292
Detecting Hubs in Music Audio Based on Network Analysis . . . . . . . . . . . Alexandros Nanopoulos
302
A Gradient Descent Approximation for Graph Cuts . . . . . . . . . . . . . . . . . . Alparslan Yildiz and Yusuf Sinan Akgul
312
Stereo and Multi-view Reconstruction A Stereo Depth Recovery Method Using Layered Representation of the Scene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tarkan Aydin and Yusuf Sinan Akgul
322
Reconstruction of Sewer Shaft Profiles from Fisheye-Lens Camera Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sandro Esquivel, Reinhard Koch, and Heino Rehse
332
A Superresolution Framework for High-Accuracy Multiview Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bastian Goldl¨ ucke and Daniel Cremers
342
View Planning for 3D Reconstruction Using Time-of-Flight Camera Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christoph Munkelt, Michael Trummer, Peter K¨ uhmstedt, Gunther Notni, and Joachim Denzler
352
XIV
Table of Contents
Real Aperture Axial Stereo: Solving for Correspondences in Blur . . . . . . . Rajiv Ranjan Sahay and Ambasamudram N. Rajagopalan Real-Time GPU-Based Voxel Carving with Systematic Occlusion Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander Schick and Rainer Stiefelhagen Image-Based Lunar Surface Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . Stephan Wenger, Anita Sellent, Ole Sch¨ utt, and Marcus Magnor
362
372 382
Image Analysis and Applications Use of Coloured Tracers in Gas Flow Experiments for a Lagrangian Flow Analysis with Increased Tracer Density . . . . . . . . . . . . . . . . . . . . . . . . Christian Bendicks, Dominique Tarlet, Bernd Michaelis, Dominique Th´evenin, and Bernd Wunderlich Reading from Scratch – A Vision-System for Reading Data on Micro-structured Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ralf Dragon, Christian Becker, Bodo Rosenhahn, and J¨ orn Ostermann
392
402
Diffusion MRI Tractography of Crossing Fibers by Cone-Beam ODF Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hans-Heino Ehricke, Kay M. Otto, Vinoid Kumar, and Uwe Klose
412
Feature Extraction Algorithm for Banknote Textures Based on Incomplete Shift Invariant Wavelet Packet Transform . . . . . . . . . . . . . . . . . Stefan Glock, Eugen Gillich, Johannes Schaede, and Volker Lohweg
422
Video Super Resolution Using Duality Based TV-L1 Optical Flow . . . . . . Dennis Mitzel, Thomas Pock, Thomas Schoenemann, and Daniel Cremers HMM-Based Defect Localization in Wire Ropes — A New Approach to Unusual Subsequence Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Esther-Sabrina Platzer, Josef N¨ agele, Karl-Heinz Wehking, and Joachim Denzler Beating the Quality of JPEG 2000 with Anisotropic Diffusion . . . . . . . . . Christian Schmaltz, Joachim Weickert, and Andr´es Bruhn Decoding Color Structured Light Patterns with a Region Adjacency Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christoph Schmalz Residual Images Remove Illumination Artifacts! . . . . . . . . . . . . . . . . . . . . . Tobi Vaudrey and Reinhard Klette
432
442
452
462 472
Table of Contents
XV
Superresolution and Denoising of 3D Fluid Flow Estimates . . . . . . . . . . . . Andrey Vlasenko and Christoph Schn¨ orr
482
Spatial Statistics for Tumor Cell Counting and Classification . . . . . . . . . . Oliver Wirjadi, Yoo-Jin Kim, and Thomas Breuel
492
Segmentation Quantitative Assessment of Image Segmentation Quality by Random Walk Relaxation Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bj¨ orn Andres, Ullrich K¨ othe, Andreea Bonea, Boaz Nadler, and Fred A. Hamprecht Applying Recursive EM to Scene Segmentation . . . . . . . . . . . . . . . . . . . . . . Alexander Bachmann Adaptive Foreground/Background Segmentation Using Multiview Silhouette Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tobias Feldmann, Lars Dießelberg, and Annika W¨ orner
502
512
522
Evaluation of Structure Recognition Using Labelled Facade Images . . . . . Nora Ripperda and Claus Brenner
532
Using Lateral Coupled Snakes for Modeling the Contours of Worms . . . . Qing Wang, Olaf Ronneberger, Ekkehard Schulze, Ralf Baumeister, and Hans Burkhardt
542
Globally Optimal Finsler Active Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . Christopher Zach, Liang Shan, and Marc Niethammer
552
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
563
A 3-Component Inverse Depth Parameterization for Particle Filter SLAM ˙ Evren Imre and Marie-Odile Berger INRIA Grand Est- Nancy, France
Abstract. The non-Gaussianity of the depth estimate uncertainty degrades the performance of monocular extended Kalman filter SLAM (EKF-SLAM) systems employing a 3-component Cartesian landmark parameterization, especially in low-parallax configurations. Even particle filter SLAM (PF-SLAM) approaches are affected, as they utilize EKF for estimating the map. The inverse depth parameterization (IDP) alleviates this problem through a redundant representation, but at the price of increased computational complexity. The authors show that such a redundancy does not exist in PF-SLAM, hence the performance advantage of the IDP comes almost without an increase in the computational cost.
1
Introduction
The monocular simultaneous localization and mapping (SLAM) problem involves the causal estimation of the location of a set of 3D landmarks in an unknown environment (mapping), in order to compute the pose of a sensor platform within this environment (localization), via the photometric measurements acquired by a camera, i.e. the 2D images [2]. Since the computational complexity of the structure-from-motion techniques, such as [6], is deemed prohibitively high, the literature is dominated by extended Kalman filter (EKF) [2],[3] and particle filter (PF) [4] based approaches. The former utilizes an EKF to estimate the current state, defined as the pose and the map, using all past measurements [5]. The latter exploits the independence of the landmarks, given the trajectory, to decompose the SLAM problem into the estimation of the trajectory via PF, and the individual landmarks via EKF [5]. Since both approaches use EKF, they share a common problem: EKF assumes that the state distribution is Gaussian. The validity of this assumption, hence the success of EKF in a particular application, critically depends on the linearity of the measurement function. However, the measurement function in monocular SLAM, the pinhole camera model [2], is known to be highly nonlinear for landmarks represented with the Cartesian parameterization (CP) [9], i.e., with their components along the 3 orthonormal axes corresponding to the 3 spatial dimensions. This is especially true for low-parallax configurations, which typically occurs in case of distant, or newly initialized landmarks [9]. A well-known solution to this problem is to employ an initialization stage, by using a particle filter [10], or simplified linear measurement model [4], and then to J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 1–10, 2009. c Springer-Verlag Berlin Heidelberg 2009
2
˙ E. Imre and M.-O. Berger
switch to the CP. The IDP [7] further refines the initialization approach: it uses the actual measurement model, hence is more accurate than [4]; computationally less expensive than [10]; and needs no special procedure to constrain the pose with the landmarks in the initialization stage, hence simpler than both [4] and [10]. Since it is more linear than the CP [7], it also offers a performance gain both in low- and high-parallax configurations. However, since EKF is an O(N 2 ) algorithm, the redundancy of the IDP limits its use to the low-parallax case. The main contribution of this work is to show that in PF-SLAM, the performance gain from the use of the IDP is almost free: PF-SLAM operates under the assumption that for each particle, the trajectory is known. Therefore the pose related components of the IDP should be removed from the state of the landmark EKF, leaving exactly the same number of components as the CP. Since this parameterization has no redundancy, and has better performance than the CP [7], its benefits can be enjoyed throughout the entire estimation procedure, not just during the landmark initialization. The organization of the paper is as follows: In the next section, the PFSLAM system used in this study is presented. In Sect. 3, the application of IDP to PF-SLAM is discussed, and compared with [4]. The experimental results are presented in Sect. 4, and Sect. 5 concludes the paper. 1.1
Notation
Throughout the rest of the paper, a matrix and a vector is represented by an uppercase and a lowercase bold letter, respectively. A standalone lowercase italic letter denotes a scalar, and one with paranthesis stands for a function. Finally, an uppercase italic letter corresponds to a set.
2
A Monocular PF-SLAM System
PF-SLAM maximizes the SLAM posterior over the entire trajectory of the camera and the map, i.e, the objective function is [5] pP F = p(X, M |Z),
(1)
where X and M denotes the camera trajectory and the map estimate at the kth time instant, respectively (in (1) the subscript k is suppressed for brevity). Z is the collection of measurements acquired until k. Equation 1 can be decomposed as [1] pP F = p(X|Z)p(M |X, Z).
(2)
In PF-SLAM, the first term is evaluated by a PF, which generates a set of trajectory hypotheses. Then, for a given trajectory X i , the second term can be expanded as [1] γ i i p(M |X , Z) = p(mij |X i , Z), (3) j=1
A 3-Component Inverse Depth Parameterization for Particle Filter SLAM
3
where γ is the total number of landmarks, and M i is the map estimate of the particle i, computed from X i , via EKF. Therefore, for a τ - particle system, (2) is maximized by a particle filter and γτ independent EKFs [1]. When a κparameter landmark representation is employed, the computational complexity is O(γτ κ2 ). In the system utilized in this work, X is the history of the pose and the rate of displacement estimates. Its kth member is sk = [ck qk ] xk = [sk tk wk ],
(4)
where ck and qk denote the position of the camera center in 3D world coordinates, and, its orientation as a quaternion, respectively. Together, they form the pose sk . tk and wk are the translational and rotational displacement terms, in terms of distance and angle covered in a single time unit. M is defined as a collection of 3D point landmarks, i.e., γ
M = {mj }j=1
(5)
The state evolves with respect to the constant velocity model, defined as ck+1 = ck + tk qk+1 = qk ⊗ q(wk ) tk+1 = tk + vt wk+1 = wk + vw ,
(6)
where q is an operator that maps an Euler angle vector to a quaternion, and ⊗ is the quaternion product operation. vt and vw are two independent Gaussian noise processes with covariance matrices Pt and Pw , respectively. The measurement function projects a 3D point landmark to a 2D point feature on the image plane via the perspective projection equation [9], i.e., [hx
hz ]T = r(qk −1 )[mj − ck ]T hy x zj = [νx − αx h hz νy − αy hz ],
hy
(7)
where r(q) is an operator that yields the rotation matrix corresponding to a quaternion q, and zj is the projection of the jth landmark to the image plane. (νx ; νy ) denotes the principal point of the camera, and (αx ; αy ) represents the focal length-related scale factors. The implementation follows the FASTSLAM2.0 [1] adaptation, described in [4]. In a cycle, first the particle poses are updated via (6). Then, the measurement predictions and the associated search regions are constructed. After matching with normalized cross correlation, the pose and the displacement estimates of all particles are updated with the measurements, zk . The quality of each particle is assessed by the observation likelihood function p(zk |X i , M i ), evaluated at zk . The resampling stage utilizes this quality score to form the new particle set. Finally, for each particle X i , the corresponding map M i is updated
4
˙ E. Imre and M.-O. Berger
with the measurements. The algorithm tries to maintain a certain number of active landmarks (i.e. landmarks that are in front of the camera, and have their measurement predictions within the image), and uses FAST [8] to detect new landmarks to replace the lost ones. The addition and the removal operations are global, i.e., if a landmark is deleted, it is removed from the maps of all particles.
3
Inverse-Depth Parameterization and PF-SLAM
The original IDP represents a 3D landmark, m3D , as a point on the ray that joins the landmark, and the camera center of the first camera in which the landmark is observed [9], i.e., 1 m3D = c + n, (8) λ where c is the camera center, n is the direction vector of the ray and λ is the inverse of the distance from c. n is parameterized by the azimuth and the elevation angles of the ray, θ and φ, as n = [cos φ sin θ
− sin φ
cos φ cos θ],
(9)
computed from the orientation of the first camera, and the first 2D observation, q and u, respectively. The resulting 6-parameter representation, IDP6, is mIDP6 = [c
θ(u, q)
φ(u, q)
λ].
(10)
This formulation, demonstrated to be superior to the CP [7], has two shortcomings. Firstly, it is a 6-parameter representation, hence its use in the EKF is computationally more expensive. Secondly, u and q are not directly represented, and their nonlinear relation to θ and φ [9] inevitably introduces an error. The latter issue can be remedied by a representation which deals with these hidden variables explicitly, i.e., a 10-component parameterization, mIDP10 = [c
q u λ].
(11)
νy − u 2 1]T αy n= l . l
(12)
In this case, n is redefined as l = r(q)[ νx α−x u1
With these definitions, the likelihood of a landmark in a particle, i.e., the operand in (3), is p(mij |X i , Z) = p(sij , uij , λij |X i , Z). (13) Consider a landmark mj is initiated at the time instant k − a, with a > 0. By definition, sij is the pose hypothesis of the particle i at k − a, i.e., sik−a (see (4) ). Since, for a particle, the trajectory is given, this entity has no associated uncertainty, hence, is not updated by the landmark EKF. Therefore, sik−a
sik = sik−a } ⇒ p(mij |X i , Z) = p(uij , λij |X i , Z). ∈ xik−a ∈ X i
(14)
A 3-Component Inverse Depth Parameterization for Particle Filter SLAM
5
In other words, the pose component of a landmark in a particle is a part of the trajectory hypothesis, and is fixed for a given particle. Therefore, it can be removed from the state vector of the landmark EKF. The resulting parameterization, IDP3, is mIDP3 = [u λ]. (15) Since the linearity analysis of [7] involves only the derivatives of the inverse depth parameter, it applies to all parameterizations of the form (8). Therefore, IDP3 retains the performance advantage of IDP6 over CP. As for the complexity, IDP3 and CP differ only in the measurement functions and their Jacobians. Equation (7) can be evaluated in 58 floating point operations (FLOP), whereas when (12) and (8) is substituted into (7), considering that some of the terms are fixed at the instantiation, the increase is 13 FLOPs, plus a square root. Similar figures apply to the Jaocbians. To put the above into perspective, the rest of the state update equations of the EKF can be evaluated roughly in 160 FLOPs. Therefore, in PF-SLAM, the performance vs. computational cost trade-off that limits the application of IDP6 is effectively eliminated, there is no need for a dedicated initialization stage, and IDP3 can be utilized throughout the entire process. Besides, IDP3 involves no approximations over CP, it only exploits a property of particle filters. A similar, 3-component parameterization is proposed in [4]. However, the authors employ it in an initialization stage, in which a simplified measurement function that assumes no rotation, and translation only along the directions orthogonal to the principal axis vector is utilized. This approximation yields a linear measurement function, and makes it possible to use a linear Kalman filter, a computationally less expensive scheme than EKF. The approach proposed in this work, employing IDP3 exclusively, has the following advantages: 1. IDP is utilized throughout the entire process, not only in the initialization. 2. No separate landmark initialization stage is required, therefore the system architecture is simpler. 3. The measurement function employed in [4] is valid only within a small neighborhood of the original pose [4]. The approximation not only adversely affects the performance, but also limits the duration in which a landmark may complete its initialization. However, the proposed approach uses the actual measurement equation, whose validity is not likewise limited.
4
Experimental Results
The performance of the proposed parameterization is assessed via a pose estimation task. For this purpose, a bed, which can translate and rotate a camera in two axes with a positional and angular precision of 0.48 mm and 0.001o, respectively, is used to acquire the images of two indoor scenes, with the dimensions 4x2x3 meters, at a resolution of 640x480. In the sequence Line, the camera moves on an 63.5-cm long straight path, with a constant translational and angular displacement of 1.58 mm/frame and 0.0325o/frame, respectively.
6
˙ E. Imre and M.-O. Berger
Fig. 1. Left: The bed used in the experiment to produce the ground truth trajectory. Right top: The first and the last images of Line and Hardline Right bottom: Two images from the circle the camera traced in Circle.
Hardline is derived from Line by discarding 2/3 of the images randomly, in order to obtain a nonconstant-velocity motion. The sequence Circle is acquired by a camera tracing a circle with a diameter of 73 cm (i.e. a circumference of 229 cm), and moving at a displacement of 3.17 mm/frame. It is the most challenging sequence of the entire set, as, unlike Hardline, not only the horizontal and forward components of the displacement, but also the direction changes. Figure 1 depicts the setup, and two images from each of the sequences. The pose estimation task involves recovering the pose and the orientation of the camera from the image sequences by using the PF-SLAM algorithm described in Sect. 2. Two map representations are compared: the exclusive use of the IDP3 and a hybrid CP-IDP3 scheme. The hybrid scheme involves converting an IDP3 landmark to the CP representation as soon as a measure of the linearity of the measurement function, the linearity index, proposed in [7], goes below 0.1 [7]. At a given time, the system may have both CP and IDP3 landmarks in the maps of the particles, hence the name hybrid CP-IDP3. It is related to [4] in the sense that both use the same landmark representation for initialization, however the hybrid CP-IDP3 employs the actual measurement model, hence is expected to perform better than [4]. Therefore, it is safe to state that the experiments compare IDP3 to an improved version of [4]. In the experiments, the number of particles is set to 2500, and both algorithms try to maintain 30 landmarks. Although this may seem low, given the capabilities of the contemporary monocular SLAM systems, since the main argument of this work is totally independent of the number of landmarks, the authors’ believe that denser maps would not enhance the discussion. Two criteria are used for the evaluation of the results: 1. Position error: Square root of the mean square error between the ground truth and the estimated trajectory, in milimeters. 2. Orientation error: The angle between the the estimated and the actual normals to the image plane (i.e., the principal axis vectors), in degrees. The results are presented in Table 1 and Figs. 2-4.
A 3-Component Inverse Depth Parameterization for Particle Filter SLAM
7
Table 1. Mean trajectory and principal axis errors Criterion
Line Hardline Circle IDP3 CP-IDP3 IDP3 CP-IDP3 IDP3 CP-IDP3 Mean trajectory error (mm) 7.58 11.25 8.15 12.46 22.66 39.87 Principal axis error (degrees) 0.31 0.57 0.24 0.54 0.36 0.48
The experiment results indicate that both schemes perform satisfactorily in Line and Hardline. The IDP3 performs slightly, but consistently, better in both position and orientation estimates, with an average position error below 1 cm. As for the orientation error, in both cases, the IDP3 yields an error oscillating around 0.3o , whereas, in the CP-IDP3, it grows towards 1o , as the camera moves. However, in Circle, the performance difference is much more pronounced: the IDP3 can follow the circle, the true trajectory, much closely than the CPIDP3. The average and peak differences are approximately 1.7 and 4 cm, respectively. The final error in both algorithms are less than 2% of the total path length. The superiority of the IDP3 can be primarily attributed to two factors: the nonlinearity of (8) and the relatively high nonlinearity of (6), when mj is represented via the CP, instead of the IDP [9]. The first issue affects the conversion from the CP to the IDP3. Since the transformation is nonlinear, the conversion of the uncertainty of an IDP landmark to the the corresponding CP landmark is not error-free. The second problem, the relative nonlinearity, implies that the accumulation of the linearization errors occurs at a higher rate in a CP landmark than an IDP landmark. Since the quality of the landmark estimates are reflected in the accuracy of the estimated pose [7], IDP3 performs better. The performance difference is not significant in Line (Fig. 4), a relatively easy sequence in which the constant translational and angular displacement assumptions are satisfied, as seen in Table 1. Although Hardline (Figs. 2, 3 and 4) is a more difficult sequence, the uncertainty in the translation component is still constrained to a line, and PF can cope with the variations in the total displacement magnitude. Besides, it is probably somewhat short to illustrate the effects of the drift: the diverging orientation error observed in Figs. 2, 3 and 4 is likely to cause problems in a longer sequence. However, in Circle (Figs. 2, 3 and 4), there is a considerable performance gap. It is a sequence in which neither the direction, nor the components of the displacement vector are constant. Therefore the violation of the constant displacement assumption is the strongest among all sequences. Moreover, at certain parts of the sequence, the camera motion has a substantial component along the principal axis vector of the camera, a case in which the nonlinear nature of (6) is accentuated. A closer study of Fig. 3 reveals that it is these parts of the sequence, especially in the second half of the trajectory, in which the IDP3 performs better than the CP-IDP3 scheme, due to its superior linearization.
8
˙ E. Imre and M.-O. Berger
Fig. 2. Top view of the trajectory and the structure estimates. Left:Hardline Right: Circle. G denotes the ground truth. Blue circles indicate the estimated landmarks.
Fig. 3. Trajectory and orientation estimates for Hardline (top) and Circle (bottom). Left: Trajectory. Right: Orientation, i.e, the principal axis. In order to prevent cluttering, the orientation estimates are downsampled by 4.
A 3-Component Inverse Depth Parameterization for Particle Filter SLAM
9
Fig. 4. Performance comparison of the IDP3 and the CP-IDP3 schemes. Top: Line. Middle: Hardline. Bottom: Circle. Left column is the Euclidean distance between the actual and the estimated trajectories. Right column is the angle between the actual and the estimated principal axis vectors.
10
5
˙ E. Imre and M.-O. Berger
Conclusion
The advantage the IDP offers over the CP, the relative amenability to linearization, is a prize that comes at the price of reduced representation efficiency, as the CP describes a landmark with the minimum number of components, whereas the IDP has redundant components. In this paper, the authors show that, this is not the case in PF-SLAM, i.e., the IDP is effectively as efficient as the CP, by exploiting the fact that in a PF-SLAM system, for each particle, the trajectory is given, i.e., has no uncertainty, therefore, any pose-related parameters can be removed from the landmark EKFs. This allows the use of the IDP throughout the entire estimation procedure. In addition to reducing the linearization errors, this parameterization strategy removes the need for a separate feature initialization procedure, hence also reduces the system complexity, and eliminates the errors introduced in transferring the uncertainty from one parameterization to another. The experimental results demonstrate the superiority of the proposed approach to a hybrid CP-IDP scheme.
References 1. Montemerlo, M.: FastSLAM: A Factored Solution to the Simultaneous Localization and Mapping, Ph. D. dissertation, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA (2003) 2. Davison, A.J., Reid, I.D., Molton, N.D., Stasse, O.: MonoSLAM: Real-Time Single Camera SLAM. IEEE Trans. Pattern Analysis and Machine Intelligence 29(6), 1052–1067 (2007) 3. Jin, H., Favaro, P., Soatto, S.: A Semi-Direct Approach to Structure from Motion. The Visual Computer 19(6), 377–394 (2003) 4. Eade, E., Drummond, T.: Scalable Monocular SLAM. In: CVPR 2006, pp. 469–476 (2006) 5. Durrant-Whyte, H., Bailey, T.: Simultaneous Localization and Mapping: Part I. IEEE Robotics and Automation Mag. 13(2), 9–110 (2006) 6. Pollefeys, M., Van Gool, L., Vergauwen, M., Verbiest, F., Cornelis, K., Tops, J., Koch, R.: Visual Modeling with a Hand-Held Camera. Intl. J. Computer Vision 59(3), 207–232 (2004) 7. Civera, J., Davison, A.J., Montiel, J.M.M.: Inverse Depth to Depth Conversion for Monocular SLAM. In: ICRA 2007, pp. 2778–2783 (2007) 8. Rosten, E., Drummond, T.: Fusing Points and Lines for High Performance Tracking. In: ICCV 2005, pp. 1508–1515 (2005) 9. Civera, J., Davison, A.J., Montiel, J.M.M.: Unified Inverse Depth Parameterization for Monocular SLAM. In: RSS 2006 (2006) 10. Davison, A.J.: Real-Time Simultaneous Localization and Mapping with a Single Camera. In: ICCV 2003, vol. 2, pp. 1403–1410 (2003)
An Efficient Linear Method for the Estimation of Ego-Motion from Optical Flow Florian Raudies and Heiko Neumann Institute of Neural Information Processing University of Ulm 89069 Ulm, Germany
Abstract. Approaches to visual navigation, e.g. used in robotics, require computationally efficient, numerically stable, and robust methods for the estimation of ego-motion. One of the main problems for egomotion estimation is the segregation of the translational and rotational component of ego-motion in order to utilize the translation component, e.g. for computing spatial navigation direction. Most of the existing methods solve this segregation task by means of formulating a nonlinear optimization problem. One exception is the subspace method, a wellknown linear method, which applies a computationally high-cost singular value decomposition (SVD). In order to be computationally efficient a novel linear method for the segregation of translation and rotation is introduced. For robust estimation of ego-motion the new method is integrated into the Random Sample Consensus (RANSAC) algorithm. Different scenarios show perspectives of the new method compared to existing approaches.
1
Motivation
For many applications visual navigation and ego-motion estimation is of prime importance. Here, processing starts with the estimation of optical flow using a monocular spatio-temporal image sequences as input followed by the estimation of ego-motion. Optical flow fields generated by ego-motion of the observer are getting more complex if one or multiple objects move independently of ego-motion. A challenging task is to segregate such moving objects (IMOs), where MacLean et al. proposed a combination of ego-motion estimation and the Expectation Maximization (EM) algorithm [15]. With this algorithm a single motion model is estimated for ego-motion and each IMO using the subspace method [9]. A key functionality of the subspace method is the possibility to cluster ego-motion and motion of IMOs. More robust approaches assume noisy flow estimates besides IMOs when estimating ego-motion with the EM algorithm [16,5]. Generally, the EM algorithm uses an iterative computational scheme and in each iteration the evaluation of the method estimating ego-motion is required. This necessitates a computationally highly efficient algorithm for the estimation of ego-motion in real-time applications. So far, many of the ego-motion algorithms introduced in the past lack this property of computationally efficiency. J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 11–20, 2009. c Springer-Verlag Berlin Heidelberg 2009
12
F. Raudies and H. Neumann
Bruss and Horn derived a bilinear constraint to estimate ego-motion by utilizing a quadratic Euclidian metric to calculate errors between input flow and model flow [3]. The method is linear w.r.t. either translation or rotation and independent of depth. This bilinear constraint was used throughout the last two decades for ego-motion estimation: (i) Heeger and Jepson built their subspace method upon this bilinear constraint [9]. (ii) Chiuso et al. used a fix-point iteration to optimize between rotation (based on the bilinear constraint), depth, and translation, [4] and Pauwels and Van Hulle used the same iteration mechanism optimizing for rotation and translation (both based on the bilinear constraint) [16]. (iii) Zhang and Tomasi as well as Pauwels and Van Hulle used a GaussNewton iteration between rotation, depth, and translation [20,17]. In detail the method (i) needs singular value decomposition, and methods of (ii) and (iii) iterative optimization techniques. Here, a novel linear approach for the estimation of ego-motion is presented. Our approach utilizes the bilinear constraint, the basis of many nonlinear methods. Unlike to these previous methods, here, a linear formulation is achieved by introducing auxiliary variables. In turn with this linear formulation a computationally efficient method is defined. Section 2 gives a formal description of the instantaneous optical flow model. This model serves as basis to derive our method outlined in Section 3. An evaluation of the new method in different scenarios and in comparison to existing approaches is given in Section 4. Finally, Section 5 discusses our method in the context of existing approaches [3,9,11,20,16,18] and Section 6 gives a conclusion.
2
Model of Instantaneous Ego-Motion
Von Helmholtz and Gibson introduced the definition of optical flow as moving patterns of light falling upon the retina [10,8]. Following this definition LonguetHiggins and Prazdny gave a formal description of optical flow which is based on a model of instantaneous ego-motion [13]. In their description they used a pinhole camera with the focal length f which projects 3-d points (X, Y, Z) onto the 2-d image plane, formally (x, y) = f /Z · (X, Y ). Ego-motion composed of the translation T = (tx , ty , tz )t and rotation R = (rx , ry , rz )t causes the 3-d in t t t t stantaneous displacement X˙ Y˙ Z˙ = − tx ty tz − rx ry rz × X Y Z , where dots denote the first temporal derivative and t the transpose operator. Using this model, movements of projected points on the 2-d image plane have the velocity 1 −f 0 x 1 u xy −(f 2 + x2 ) f y V := = T+ R. (1) v 0 −f y −xy −f x Z f (f 2 + y 2 )
3
Linear Method for Ego-Motion Estimation
Input flow, e.g. estimated from a spatio-temporal image sequence, is denoted by Vˆ , while the model flow is defined as in Equation 1. Now, the problem is to find
Estimation of Ego-Motion
13
parameters of the model flow which describe the given flow Vˆ best. Namely, these parameters are the scenic depth Z, the translation T and rotation R. Based on Equation 1 many researchers studied non-linear optimization problems to estimate ego-motion [3,20,4,18]. Moreover, most of these methods have a statistical bias which means that methods produce systematic errors considering isotropic noisy input [14,16]. Unlike these approaches we suggest a new linearized form based on Equation 1 and show how to solve this form computationally efficient with a new method. Further, this method can be unbiased. The new method is derived in three consecutive steps: (i) the algebraic transformation of Equation 1 which is independent of depth Z, (ii) a formulation of an optimization problem for translation and auxiliary variables, and (iii) the removal of a statistical bias. The calculation of rotation R with translation T known is then a simple problem. Depth independent constraint equation. Bruss and Horn formulated an optimization problem with respect to depth Z which optimizes the squared Euclidian distance of the residual vector between the input flow vector Vˆ = (ˆ u, vˆ)t and the model flow vector V defined in Equation 1. Inserting the optimized depth into Equation 1 they derived the so called bilinear optimization constraint. An algebraic transformation of this constraint is ⎛ ⎞ ⎛ ⎞ ⎛ ⎞⎛ ⎞ tx f vˆ −(f 2 + y 2 ) xy fx rx ⎠ ⎝ ry ⎠), ˆ ⎠−⎝ xy −(f 2 + x2 ) fy 0 = ⎝ ty ⎠ (⎝ −f u tz yu ˆ − xˆ v fx fy −(x2 + y 2 ) rz
=:M
=:H
(2) which Heeger and Jepson describe during the course of their subspace construction. In detail, they use a subspace which is orthogonal to the base polynomial defined by entries of the matrix H(xi , yi )i=1..m , where m denotes the finite number of constraints employed [9]. Optimization of translation. Only a linear independent part of the base polynomial H is used for optimization. We chose the upper triangular matrix together with the diagonal of matrix H. These entries are summarized in the vector E := (−(f 2 + y 2 ), xy, f x, −(f 2 + x2 ), f y, −(x2 + y 2 ))t . To achieve a linear form of Equation 2 auxiliary variables (tx ·rx , tx ·ry , tx ·rz , ty ·ry , ty ·rz , tz ·rz )t := K are introduced. With respect to E and K the linear optimization problem T,K(T ) F (Vˆ ; T, K(T )) := [T t M + K t E]2 dx −−−−−→ min, (3) Ωx
is defined, integrating constraints over all locations x = (x, y) ∈ Ωx ⊂ 2 of the image plane. This image plane is assumed to be continuous and finite. Calculating partial derivatives of F (Vˆ ; T, K(T )) and equating them to zero leads to the linear system of equations
14
F. Raudies and H. Neumann
0= Ωx
0=
Ωx
[T t M + K t E] · E t dx, [T t M + K t E] · [M +
(4)
∂(K t E) t ] dx, ∂T
(5)
consisting of nine equations and nine variables in K and T . Solving Equation 4 with respect to K and inserting the result as well as the partial derivative for the argument T of expression K into Equation 5 results in the homogenous linear system of equations 0 = Tt
Li Lj dx =: T t C, i, j = 1..3, with Ωx Li := Mi − (DE)t EMi dx, i = 1..3 and D := [ Ωx
Ωx
(6) EE t dx]−1 ∈ 6×6 .
A robust (non-trivial) solution for such a system is given by the eigenvector which corresponds to the smallest eigenvalue of the 3 × 3 scatter matrix C [3]. Removal of statistical bias. All methods which are based on the bilinear constraint given in Equation 2 are statistically biased [9,11,14,18]. To calculate this bias we define an isotropic noisy input by the vector V˜ := (ˆ u, vˆ) + (nu , nv ), with components nu and nv ∈ N (μ = 0,σ) normally distributed. A statistical ˜ bias is inferred by studying the expectation value < · > of the scatter matrix C. ˜ This scatter matrix is defined inserting the noisy input flow V into Equation 6. This gives < C˜ >=< C > +σ 2 N with ⎛
⎞ f 0 −f < x > 0 f −f < y > ⎠ , N =⎝ −f < x > −f < y > < (x2 + y 2 ) >
(7)
using the properties < nu >=< nv >= 0 and < n2u >=< n2v >= σ 2 . Several procedures to remove the bias term σ 2 N have been proposed. For example, Kanatani suggested a method of renormalization subtracting the bias term on the basis of an estimate of σ 2 [11]. Heeger and Jepson used dithered constraint vectors and defined a roughly isotropic covariance matrix with these vectors. MacLean used a transformation of constraints into a space where the influence by noise is isotropic [14]. Here, the last approach is used, due to its computational efficiency. In a nutshell, to solve Equation 6 considering noisy input we calculate ˜ Prethe eigenvector which corresponds to the smallest eigenvalue of matrix C. − 12 ˜ − 12 ˜ ˇ withening of the scatter matrix C gives C := N CN . Then the influence by noise is isotropic, namely σ 2 I, where I denotes a 3 × 3 unity matrix. The newly ˇ = (λ + σ 2 )x preserves the ordering of λ and defined eigenvalue problem Cx − 12 eigenvectors N x compared to the former eigenvalue problem Cx = λx. Then the solution is constructed with the eigenvector of matrix Cˇ which corresponds 1 to the smallest eigenvalue. Finally, this eigenvector has to be multiplied by N − 2 .
Estimation of Ego-Motion
4
15
Results
To test the proposed method for ego-motion estimation in different configurations we use two sequences, the Yosemite sequence1 and the Fountain sequence2 . In the Yosemite sequence a flight through a valley is simulated, specified by T = (0, 0.17, 0.98)·34.8 px and R = (1.33, 9.31, 1.62)·10−2 deg/frame [9]. In the Fountain sequence the curvilinear motion with T = (−0.6446, 0.2179, 2.4056) and R = (−0.125, 0.20, −0.125) deg/frame is performed. The (virtual) camera employed to gather images has a vertical field of view of 40 deg and a resolution of 316 × 252 for the Yosemite sequence and 320 × 240 for the Fountain sequence. All methods included in our investigation have a statistical bias which is removed with the technique of MacLean [14]. The iterative method of Pauwels and Van Hulle [18] employs a fix-point iteration mechanism using a maximal number of 500 iterations and 15 initial values for the translation direction, randomly distributed on the positive hemisphere [18]. Numerical stability. To show numerical stability we use the scenic depth of the Fountain sequence (5th frame) with a quarter of the full resolution to test different ego-motions. These ego-motions are uniformly distributed in the range of ±40 deg azimuth and elevation in the positive hemisphere. Rotational components for pitch and yaw are calculated by fixating the central point and compensating translation by rotation. An additional roll component of 1 deg/frame is superimposed. With scenic depth values and ego-motion given, optical flow is calculated by Equation 1. This optical flow is systematically manipulated by applying two different noise models: a Gaussian and an outlier noise model. The Gaussian noise model was former specified in Section 3. In the outlier noise model a percentage, denoted by ρ, of all flow vectors are replaced by a randomly constructed vector. Each component of this vector is drawn from a uniformly distributed random variable. The interval of this distribution is defined by the negative and positive of the mean length of all flow vectors. Outlier noise models sparsely distributed gross errors, e.g. caused by correspondences that were incorrectly estimated. Applying a noise model to the input flow, the estimation of ego-motion becomes erroneous. These errors are reported by: (i) the angular difference between two translational 3-d vectors, whereas one is the estimated vector and the other the ground-truth vector, and (ii) the absolute value of the difference for each rotational component. Again, differences are calculated between estimate and ground-truth. Figure 1 shows errors of ego-motion estimation applying the Gaussian and the outlier noise model. All methods show numerical stability, whereas the mean translational error is lower than approximately 6 deg for both noise models. The method of Pauwels and Van Hulle performs best compared to the other methods. Better performance is assumed to be achieved by employing numerical fix-point iteration with different initial values randomly chosen within the search space. 1 2
Available via anonymous ftp from ftp.csd.uwo.ca in the directory pub/vision. Provided at http://www.informatik.uni-ulm.de/ni/mitarbeiter/FRaudies.
F. Raudies and H. Neumann
std of angular error [°]
mean of angular error [°]
c)
b) proposed method Kanatani, 1993 Pauwels & Van Hulle, 2006
6
5 4 3 2 1 0 0
5 10 Gaussian noise V [%]
5 4 3 2 1 0 0
5 10 Gaussian noise V [%]
d)
6
6 std of angular error [°]
a) 6
mean of angular error [°]
16
5 4 3 2 1 0 0
20 40 outlier noise U [%]
5 4 3 2 1 0 0
20 40 outlier noise U [%]
Fig. 1. All methods employed show numerical stability in the presence of noise due to small translational and rotational errors (not shown). In detail, a) shows the mean angular error for Gaussian noise and c) for outlier noise. Graphs in b) and d) show the corresponding standard deviation, respectively. The parameter σ is specified with respect to the image height. Mean and standard deviation are calculated for a number of 50 trials. Table 1. Errors for estimated optical and ground-truth input flow of the proposed method. In case of the Yosemite sequence which contains the independently moving cloudy sky the RANSAC paradigm is employed which improves ego-motion estimates (50 trials, mean and ± standard deviation shown). (x) denotes the angle calculated between estimated and ground-truth 3-d translational vectors.
sequence Fountain Yosemite Fountain Yosemite Fountain Yosemite Yosemite
Fountain Yosemite Yosemite
translation rotation (T est , T gt )(x) [deg] |Δrx | [deg] |Δry | [deg] estimated optical flow; Brox et al. [2]; 100% density 4.395 0.001645 0.0286 4.893 0.02012 0.1187 estimated optical flow; Farnebaeck [6]; 100% density 6.841 0.01521 0.05089 4.834 0.03922 0.00393 estimated optical flow; Farnebeack, [6]; 25% density 1.542 0.0008952 0.01349 1.208 0.007888 0.01178 (RANSAC) 1.134 0.01261 0.008485 ± 0.2618 ± 0.002088 ± 0.002389 ground-truth optical flow; 25% of full resolution 0.0676 0.000259 8.624e-006 5.625 0.02613 0.1092 (RANSAC) 1.116 0.01075 0.004865 ± 1.119 ± 0.01021 ± 0.006396
|Δrz | [deg] 0.02101 0.1153 0.025 0.07636 0.003637 0.02633 0.02849 ± 0.003714 0.0007189 0.06062 0.02256 ± 0.009565
Estimated optical flow as input. We test our method on the basis of optical input flow estimated by two different methods. First, we utilize the tensor-based method of Farnebaeck together with an affine motion model [6] to estimate optical flow. The spatio-temporal tensor is constructed by projecting the input signal to a set of base polynomials of finite Gaussian support (σ = 1.6 px and length l = 11 px, γ = 1/256). Spatial
Estimation of Ego-Motion
17
averaging of the resulting components of the tensor is performed with a Gaussian filter (σ = 6.5 px and l = 41 px). Second, optical flow is estimated with the affine warping technique of Brox et al. [2]. Here, we implemented the 2-d version of the algorithm and used the following parameter values, α = 200, γ = 100, = 0.001, σ = 0.5, η = 0.95, a number of 77 outer fix point iterations and 10 inner fix point iterations. To solve partial differential equations the numerical method Successive Over-Relaxation (SOR) with parameter ω = 1.8 and 5 iterations is applied. Errors of optical flow estimation are reported by a 3-d mean angular error which was defined by Barron and Fleet [1]. According to this angular error optical flow is estimated for frame pair 8 − 9 (starting to count from index 0) of the Yosemite sequence, with 5.41 deg accuracy for the method of Farnebaeck and with 3.54 deg for the method of Brox. In case of frame pair 5 − 6 of the Fountain sequence the mean angular error is 2.49 deg estimating flow with Farnebaeck’s method and 2.54 deg for the method of Brox. All errors refer to a density of 100% for optical flow data. Table 1 lists errors of ego-motion estimation for different scenarios. Comparing the first two parts of the table, we conclude that a high accuracy for optical flow estimates does not necessarily provide a high accuracy in the estimation of ego-motion. In detail, the error of ego-motion estimation depends on the error characteristic (spatial distribution and value of errors) within the estimated optical flow field. However, this characteristic is not expressed by the mean angular error. One way to reduce the dependency on the error characteristic is to reduce the data set, leaving out data points which are most erroneous. Generally, this requires (i) an appropriate confidence measure to evaluate the validity or reliability of data points, (ii) and a strategy to avoid linear dependency in the resulting data w.r.t. ego-motion estimation. Farnebaeck describes how to calculate a confidence value within his thesis [6]. Here, this confidence is used to thin out flow estimates, whereas we retain 25% of all estimates, enough to avoid linear dependency for our configurations. For ego-motion estimation errors are then reduced as can be observed in the third part of Table 1. In case of the Yosemite sequence sparsification has a helpful side effect. The cloud motion is estimated by the method of Farnebaeck with low accuracy and confidence. Thus, no estimates corresponding from the cloudy sky are contained in the data set for the estimation of ego-motion. In the last part of Table 1 ground-truth optical flow is utilized to estimate ego-motion. In this case, the cloudy sky is present in the data set and thus deflects estimates of ego-motion, e.g. the translational angular error amounts 5.6 deg. To handle IMOs we use the RANSAC algorithm [7]. In a nutshell, the idea of the algorithm is to achieve an estimate which is based on non erroneous data points only. Therefore, initial estimates are performed on different randomly selected subsets of all data points, which are tried to be enlarged by other non erroneous data points. The algorithm stops if an estimate is found, that is based on a data set of a certain cardinality. For the groundtruth flow of the Yosemite sequence, this method is successfully in estimating ego-motion, now the translational angular error amounts 1.116 deg (mean value).
18
5
F. Raudies and H. Neumann
Discussion
A novel linear optimization method was derived to solve the segregation of the translational and rotational component, one of the main problems in computational ego-motion estimation [3,9,13]. Related work. A well-known linear method for ego-motion estimation is the subspace method [9]. Unlike our method a subspace independent of the rotational part was used by Heeger and Jepson for the estimation of translation, using only m − 6 of m constraints. However, in the method proposed here, all constraints are used which leads to more robust estimates. Zhuang et al. formulated a linear method for the segregation of translation and rotation employing the instantaneous motion model together with the epipolar constraint [21]. They introduced auxiliary variables, as superposition of translation and rotation, then optimized w.r.t. these variables and translation. In a last step they reconstructed rotation from auxiliary variables. Unlike their method we used the bilinear constraint for optimization, defined auxiliary variables differently, split up the optimization for rotation and translation and finally had to solve only a 3 × 3 eigenvalue problem for translation estimation, instead of a 9 × 9 eigenvalue problem in case of Zhuang’s approach. Moreover, applying this different optimization strategy allowed us to incorporate the method of MacLean to remove a statistically bias, which is not the case for the method of Zhuang. Complexity. To achieve real-time capability in applications a low computationally complexity is of vital need. Existing methods for the estimation of egomotion have a higher complexity than our method (compare with Table 2). For example [9] employs a singular value decomposition for a m × 6 matrix, or iterative methods to solve for nonlinear optimization problems are employed [4,18,20]. Comparable to our method in case of computational complexity, is the method of Kanatani [11]. Unlike our approach this method is based on the epipolar constraint. Numerical stability. We showed that the optimization method is robust against noise, compared to other ego-motion algorithms [11,18]. Furthermore, the technique of pre-whitening is applied to our method to remove a statistical bias Table 2. Average (1000 trials) computing times [msec] of methods estimating egomotion, tested with a C++ implementation on a Windows XP platform, Intel Core 2 Duo T9300. (∗) This algorithm employs a maximal number of 500 iterations and 15 initial values. method new proposed method (unbiased) Kanatani (unbiased) Heeger & Jepson (unbiased) Pauwels & Van Hulle, 2006 (unbiased)(∗)
25 0.05 0.03 0.08 0.16
number of vectors 225 2.025 20.164 80.089 0.06 0.34 4.56 22.16 0.11 0.78 7.56 29.20 2.44 399.20 n.a. n.a. 0.81 6.90 66.87 272.95
Estimation of Ego-Motion
19
as well. This technique was proposed by MacLean [14] for bias removal in the subspace algorithm of Heeger and Jepson [9] and the method by Pauwels and Van Hulle for the fix-point iteration, iterating between coupled estimates for translation and rotation of ego-motion [17]. Unlike other unbiasing techniques MacLean’s technique needs neither an estimate of the noise characteristic nor an iterative mechanism. With statistical bias removed methods are consistent in the sense of Zhang and Tomasi’s definition of consistency [20]. Outlier detection. To detect outliers in ego-motion estimation, in particular IMOs, several methods were suggested, namely frameworks employing the EM algorithm [15,5], the Collinear Point Constraint [12] and the RANSAC algorithm [19]. In accordance to the conclusion of Torr’s thesis, who found that the RANSAC algorithm performs best in motion segmentation and outlier detection, we chose RANSAC to achieve robust ego-motion estimation.
6
Conclusion
In summary, we have introduced a novel method for the separation of translation and rotation in the computation of ego-motion. Due to the simplicity of the method it has a very low computational complexity and is thus faster than existing estimation techniques (Table 2). First, we tested our method with a computed optical flow field, where ego-motion can be estimated exactly. Under noisy conditions results show numerical stability of the optimization method and its comparability with existing methods for the estimation of ego-motion. In more realistic scenarios utilizing estimated optical flow, ego-motion can be estimated with high accuracy. Future work will employ temporal integration of ego-motion estimates within the processing of an image sequences. This should stabilize ego-motion and optical flow estimation counting on the spatio-temporal coherence of the visually observable world.
Acknowledgements Stefan Ringbauer kindly provided a computer graphics ray-tracer utilized to generate images and ground-truth flow for the Fountain sequence. This research has been supported by a scholarship given to F.R. from the Graduate School of Mathematical Analysis of Evolution, Information and Complexity at Ulm University.
References 1. Barron, J.L., Fleet, D.J., Beauchemin, S.S.: Performance of optical flow techniques. Int. J. of Comp. Vis. 12(1), 43–77 (1994) 2. Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004)
20
F. Raudies and H. Neumann
3. Bruss, A.R., Horn, B.K.P.: Passive navigation. Comp. Vis., Graph., and Im. Proc. 21, 3–20 (1983) 4. Chiuso, A., Brockett, R., Soatto, S.: Optimal structure from motion: Local ambiguities and global estimates. Int. J. of Comp. Vis. 39(3), 195–228 (2000) 5. Clauss, M., Bayerl, P., Neumann, H.: Segmentation of independently moving objects using a maximum-likelihood principle. In: Lafrenz, R., Avrutin, V., Levi, P., Schanz, M. (eds.) Autonome Mobile Systeme 2005, Informatik Aktuell, pp. 81–87. Springer, Berlin (2005) 6. Farnebaeck, G.: Polynomial expansion for orientation and motion estimation. PhD thesis, Dept. of Electrical Engineering, Linkoepings universitet (2002) 7. Fischler, M.A., Bolles, R.C.: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Comm. of the ACM 24(6), 381–395 (1981) 8. Gibson, J.J.: The Perception of the Visual World. Houghton Mifflin, Boston (1950) 9. Heeger, D.J., Jepson, A.D.: Subspace methods for recovering rigid motion i: Algorithm and implementation. Int. J. of Comp. Vis. 7(2), 95–117 (1992) 10. Helmholtz, H.: Treatise on physiological optics. In: Southhall, J.P, (ed.) (1925) 11. Kanatani, K.: 3-d interpretation of optical-flow by renormalization. Int. J. of Comp. Vis. 11(3), 267–282 (1993) 12. Lobo, N.V., Tsotsos, J.K.: Computing ego-motion and detecting independent motion from image motion using collinear points. Comp. Vis. and Img. Underst. 64(1), 21–52 (1996) 13. Longuet-Higgins, H.C., Prazdny, K.: The interpretation of a moving retinal image. Proc. of the Royal Soc. of London. Series B, Biol. Sci. 208(1173), 385–397 (1980) 14. MacLean, W.J.: Removal of translation bias when using subspace methods. IEEE Int. Conf. on Comp. Vis. 2, 753–758 (1999) 15. MacLean, W.J., Jepson, A.D., Frecker, R.C.: Recovery of egomotion and segmentation of independent object motion using the EM algorithm. Brit. Mach. Vis. Conf. 1, 175–184 (1994) 16. Pauwels, K., Van Hulle, M.M.: Segmenting independently moving objects from egomotion flow fields. In: Proc. of the Early Cognitive Vision Workshop (ECOVISION 2004), Isle of Skye, Scotland (2004) 17. Pauwels, K., Van Hulle, M.M.: Robust instantaneous rigid motion estimation. Proc. of Comp. Vis. and Pat. Rec. 2, 980–985 (2005) 18. Pauwels, K., Van Hulle, M.M.: Optimal instantaneous rigid motion estimation insensitive to local minima. Comp. Vis. and Im. Underst. 104(1), 77–86 (2006) 19. Torr, P.H.S.: Outlier Detection and Motion Segmentation. PhD thesis, Engineering Dept., University of Oxford (1995) 20. Zhang, T., Tomasi, C.: Fast, robust, and consistent camera motion estimation. Proc. of Comp. Vis. and Pat. Rec. 1, 164–170 (1999) 21. Zhuang, X., Huang, T.S., Ahuja, N., Haralick, R.M.: A simplified linear optic flowmotion algorithm. Comp. Graph. and Img. Proc. 42, 334–344 (1988)
Localised Mixture Models in Region-Based Tracking Christian Schmaltz1 , Bodo Rosenhahn2 , Thomas Brox3 , and Joachim Weickert1 1
Mathematical Image Analysis Group, Faculty of Mathematics and Computer Science, Building E1 1 Saarland University, 66041 Saarbr¨ ucken, Germany {schmaltz,weickert}@mia.uni-saarland.de 2 Leibniz Universit¨ at Hannover 30167 Hannover, Germany
[email protected] 3 University of California, Berkeley, CA, 94720, USA
[email protected] Abstract. An important problem in many computer vision tasks is the separation of an object from its background. One common strategy is to estimate appearance models of the object and background region. However, if the appearance is spatially varying, simple homogeneous models are often inaccurate. Gaussian mixture models can take multimodal distributions into account, yet they still neglect the positional information. In this paper, we propose localised mixture models (LMMs) and evaluate this idea in the scope of model-based tracking by automatically partitioning the fore- and background into several subregions. In contrast to background subtraction methods, this approach also allows for moving backgrounds. Experiments with a rigid object and the HumanEva-II benchmark show that tracking is remarkably stabilised by the new model.
1
Introduction
In many image processing tasks such as object segmentation or tracking, it is necessary to distinguish between the region of interest (foreground) and its background. Common approaches, such as MRFs or active contours build appearance models of both regions with their parameters being learnt either from a-priori data or from the images [1,2,3]. Various types of features can be used to build the appearance model. Most common are brightness and colour, but any dense feature set such as texture descriptors [4] or motion [5] can be part of the model. Apart from the considered features, the statistical model of the region is of large interest. In simple cases, one assumes a Gaussian distribution in each region. However, since usually object regions change their appearance locally, such a Gaussian model is too inaccurate. A typical example is the black and white stripes of a zebra, which leads to a Gaussian distribution with a grayish mean
We gratefully acknowledge funding by the German Research Foundation (DFG) under the project We 2602/5-1.
J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 21–30, 2009. c Springer-Verlag Berlin Heidelberg 2009
22
C. Schmaltz et al.
(a)
(b)
(c)
(d)
Fig. 1. Left: Illustrative examples of situations where object (to be further specified by a shape prior) and background region are not well modelled by identically distributed pixels. In (a), red points are more likely in the background. Thus, the hooves of the giraffe will not be classified correctly. In (b), the dark hair and parts of the body are more likely to belong to the background. Localised distributions can model these cases more accurately. Right: Object model used by the tracker in one of our experiments (c) and decomposition of the object model into three different components (d), as proposed by the automatic splitting algorithm from [6]. There are 22 joint angles in the model, resulting in a total of 28 parameters that must be estimated.
that does neither describe the black nor the white part very well. In order to deal with such cases, Gaussian mixture models or kernel density models have been proposed. These models are much more general, yet still impose the assumption of identically distributed pixels in each region, i.e., they ignore positional information. The left part of Fig. 1 shows two examples where this is insufficient. In contrast, a model which is sensitive for the location in the image was proposed in [7]. The region statistics are estimated for each point separately, thereby considering only information from the local neighbourhood. Consequently, the distribution varies smoothly within a region. A similar local statistical model was used in [8]. A drawback of this model is that it blurs across discontinuities inside the region. As the support of the neighbourhood needs to be sufficiently large to reliably estimate the parameters of local distributions, this blurring can be quite significant. This is especially true when using local kernel density models, which require more data than a local Gaussian model. The basic idea in the present paper is to segment the regions into subregions inside which a statistical model can be estimated. Similar to the above local region statistics, the distribution model integrates positional information. The support for estimating the distribution parameters is usually much larger as it considers all pixels from the subregion, though. Splitting the background into subregions and employing a kernel density estimator in each of those allows for a very precise region model relying on enough data for parameter estimation. Related to this concept are Gaussian mixture models in the context of background subtraction. Here, the mixture parameters are not estimated in a spatial neighbourhood but from data along the temporal axis. This leads to models which include very accurate positional information [9]. In [10], an overview of several possible background models ranging from very simple to complex models
Localised Mixture Models in Region-Based Tracking
23
is given. The learned statistics from such models can also be combined with a conventional spatially global model as proposed in [11]. For background subtraction, however, the parameters are learned in advance, i.e., a background image or images with little motion and without the object must be available. Such limitations are not present in our approach. In fact, our experiments show that background subtraction and the proposed localised mixture model (LMM) are in some sense complementary and can be combined to improve results in tracking. Also note that, in contrast to image labelling approaches that also split the background into different regions such as [12], no learning step is necessary. A general problem that arises when making statistical models more and more precise is the increasing amount of local optima in corresponding cost functions. In Fig. 1 there is actually no reason to put the red hooves to the giraffe region or the black hair to the person. A shape prior and/or a close initialisation of the contour is required to properly define the object segmentation problem. For this reason we focus in this paper on the field of model based tracking, where both a shape model and a good initial separation into foreground and background can be derived from the previous frame. In particular, we evaluated the model in silhouette-based 3-D pose tracking, where pose and deformation parameters of a 3-D object model are estimated such that the image is optimally split into object and background [13,6]. The model is generally applicable to any other contour-based tracking method as well. Another possible field of application is semi-supervised segmentation, where the user can incrementally improve the segmentation by manually specifying some parts of the image as foreground or background [1]. This can resolve above ambiguities as well. Our paper is organised as follows: We first review the pose tracking approach used for evaluation. We then explain the localised mixture model (LMM) in Section 3. While the basic approach only works with static background images, we remove this restriction later in a more general approach. After presentation of our experimental data in Section 4, the paper is concluded in Section 5.
2
Foreground-Background Separation in Region-Based Pose Tracking
In this paper, we focus on tracking an articulated free-form surface consisting of rigid parts interconnected by predefined joints. The state vector χ consists of the global pose parameters (3-D shift and rotation) as well as n joint angles, similar to [14]. The surface model is divided into l different (not necessarily connected) components Mi , i = 1, . . . , l, as illustrated in Fig. 1. The components are chosen such that each component has a uniform appearance that differs from other components, as proposed in [6]. There are many more tracking approaches than the one presented here. We refer to the surveys [15,16] for an overview. Given an initial pose, the primary goal is to adapt the state vector such that the projections of the object parts lead to maximally homogeneous regions in the image. This is stated by the following cost function which is sought to be minimised in each frame:
24
C. Schmaltz et al.
(a)
(b)
(c)
(d)
(e)
Fig. 2. Example of a background segmentation. From left to right: (a) Background image. (b,c) K-means clustering with three and six clusters. (d,e) Level set segmentation with two different parameter settings.
E(χ) = −
l i=0
Ω
vi (χ, x)Pi,χ (x) log pi,χ (x) dx,
(1)
where Ω denotes the image domain. The appearance of each component i and of the background (i = 0) is modelled by a probability density function (PDF) pi , i ∈ 0, . . . , l. The PDFs of the object parts are modelled as kernel densities, whereas we will use the LMM for modelling the background as explained later. Pi,χ is the indicator function for the projection of the i-th component Mi , i.e. Pi,χ (x) is 1 if and only if a part of the object with pose χ is projected to the image point x. In order to take occlusion into account, vi (χ, x) : R6+n × Ω → {0, 1} is a visibility function that is 1 if and only if the i-th object part is not occluded by another part of the object in the given pose. Visibility can be computed efficiently using openGL. The cost function can be minimised locally by a modified gradient descent. The PDFs are evaluated at silhouette points xi of each projected model components. These points xi are then moved along the normal direction of the projected object, either towards or away from the components, depending on which of the regions’ PDF fits better at that particular point. The point motion is transferred to the corresponding change of the state vector by using a point-based pose estimation algorithm as described, e.g., in [7].
3
Localised Mixture Models
In the above approach, the object region is very accurately described by the object model, which is split into various parts that are similar in their appearance. Hence, the local change of appearance within the object region is taken well into account. The background region, however, consists of a single appearance model and positional changes of this appearance are so far neglected. Consider a red-haired person standing on a red carpet which is facing the camera. Then, only a very small part of the person is red, compared to a large part of the background. As a larger percentage of pixels lying outside the person are red, red pixels will be classified to belong to the outside regions. Thus, the hair will be considered as not being part of the object, which deteriorates tracking. This happens despite the fact that the carpet is far away from the hair.
Localised Mixture Models in Region-Based Tracking
25
The idea to circumvent this problem is to separate the background into multiple subregions each of which is modelled by its own PDF. This can be regarded as a mixture of PDFs, yet the mixture components exploit the positional information telling where the separate mixture components are to be applied. 3.1
Case I: Static Background Image Available
If a static background image is available, segmenting the background is quite simple. In contrast to the top-level task of object-background separation, the regions need not necessarily correspond to objects in the scene. Hence, virtually any multi-region segmentation technique can be applied for this. We tested a very simple one, the K-means algorithm [17,18], and a more sophisticated level set based segmentation, which considers multiple scales and includes a smoothness prior on the contour [19]. In the K-means algorithm the number of clusters is fixed, whereas the level set approach optimises the number of regions by a homogeneity criterion, which is steered by a tuning parameter. Thus, the number of subregions can vary. Fig. 2 compares the segmentation output of these two methods for two different parameter settings. The results with the level set method are much smoother due to the boundary length constraint. In contrast, the regions computed with K-means have more fuzzy boundaries. This can be disadvantageous, particularly when the localisation of the model is not precise due to a moving background as considered in the next section. After splitting the background image into subregions, a localised PDF can be assembled from PDFs estimated in each subregion j. Let L(x, y) denote the labelling obtained by the segmentation, we obtain the density p(x, y, s) = pL(x,y)(s),
(2)
where s is any feature used for tracking. It makes most sense to use the same density model for the subregions as used in the segmentation method. In case of K-means this means that we have a Gaussian distribution with fixed variance: (s − μj )2 pkmeans (s) ∝ exp − , (3) j 2 where μj is the cluster centre of cluster j. The level set based segmentation method is build upon a kernel density estimator (x,y)∈Ωj δ(s, I(x, y)) levelset pj (s) = Kσ ∗ (4) |Ωj | where δ is the Dirac delta distribution and Kσ is a Gaussian kernel with standard √ deviation σ. Here, we use σ = 30. The PDF in (2) can simply be plugged into the energy in (1). Note that this PDF needs to be estimated only once for the background image and then stays fixed, whereas the PDFs of the object parts are reestimated in each frame to account for the changing appearance.
26
3.2
C. Schmaltz et al.
Case II: Potentially Varying Background
For some scenarios, generating a static background image is not possible. In outdoor scenarios, for example, the background usually changes due to moving plants or people passing by. Even inside buildings, the lighting conditions – and thus the background – typically vary. Furthermore, the background could vary due to camera motion. In fact, varying backgrounds can appear in many applications and render background subtraction methods impossible. In general, the background changes only slowly between two consecutive frames. This can be exploited to extend the described approach to non-static images or to images where the object is already present. Excluding the current object region from the image domain the remainder of the image can be segmented as before. This is shown in Fig. 5. To further deal with slow changes in the background, the segmentation can also be recomputed in each new frame. This takes changes in the localisation or in the statistics into account. A subtle difficulty appearing in this case is that there may be parts of the background not available in the density model because these areas were occluded by the object in the previous frame. When reestimating the pose parameters of the object model, the previously occluded part can appear and needs some treatment. In such a case we choose the nearest available neighbour and use the probability density of the corresponding subregion. That is, if Ωj is the jth subregion as computed by the segmentation step, the local mixture density is: p(x, y, s) = pj ∗ (x, y)
4
with
j ∗ = argmin (dist((x, y), Ωj )) . j
(5)
Experiments
We evaluated the described region statistics on sequence S4 of the HumanEvaII benchmark [20]. For this sequence, a total of four views as well as static background images are available. Thus, this sequence allows us to compare the variant that uses a static background image to the version without the need for such an image. The sequence shows a man walking in a circle for approximately 370 frames, followed by a jogging part from frame 370 to 780, and finally a “balancing” part until frame 1200. Ground truth marker data is available for this sequence and tracking errors can be evaluated via an online interface provided by Brown University. Note that the ground truth data between frame 299 and 334 is not available, thus this part is ignored in the evaluation. In the figures, we plotted a linear interpolation between frame 298 and 335. Table 1 shows some statistics over tracking results with different models. The first line in the table shows an experiment in which background subtraction was used to find approximate silhouettes of the person to be tracked. These silhouette images are used as additional features, i.e. in addition to the three channels of the CIELAB colour space, for computing the PDFs of the different regions. This approach corresponds to the one in [6]. Results are improved when using the LMM based on level set segmentation. This can be seen by comparing the first
Localised Mixture Models in Region-Based Tracking
1
2 2
2 2
1 3
3
1 3
27
3
1
2
1 3
1
2
3
Fig. 3. PDFs estimated for the CIELAB colour channels of the subregions shown in Fig. 5. Each colour corresponds to one region. From left to right: Lightness channel, A channel and B channel. Top: Estimated PDFs when using the level-set-based segmentation. Bottom: Estimated PDFs when computing the subregions using Kmeans. Due to the smoothness term, the region boundaries are smoother resulting in PDFs that are separated less clearly when using the level-set-based method than with the K-means algorithm. Nevertheless, the level set approach performed better in the quantitative evaluation.
and third line of the table. The best results are achieved when using both the silhouette images as well as the LMM (fifth line). The level set based LMM yields slightly better results than K-means clustering. See Fig. 4 for a tracking curve illustrating the error per frame for the best of these experiments Fig. 5 shows segmentation results without using the background image, hence dropping the assumption of a static background. Fig. 3 visualises the estimated PDFs for each channel in each subregion. Aside from some misclassified pixels close to the occluded area (due to tracking inaccuracies, and due to the fact that a human cannot be perfectly modelled by a kinematic chain), the background is split into reasonable subparts and yields a good LMM. Tracking is almost as good as the combination with background subtraction, as indicated by the lower part of Table 1, without requiring a strictly static background any more. The same setting with a global Parzen model fails completely, as depicted in Fig. 4, since fore- and background are too similar for tracking at some places. In order to verify the true applicability of the LMM in the presence of nonstatic backgrounds, we tracked a tea box in a monocular sequence with a partially moving background. Neither ground truth nor background images are available for this sequence, making background subtraction impossible. As expected, the LMM can handle the moving background very well. When using only the Parzen model for the background, a 90◦ rotation of the tea box is missed by the tracker as shown in the left part of the lower row in Fig. 6. If we add Gaussian noise with standard deviation 10, the Parzen model completely fails (right part in lower row of Fig. 6) while tracking still works when using the LMM.
28
C. Schmaltz et al.
Table 1. Comparison of different tracking versions for sequence S4 of the HumanEva-II benchmark as reported by the automatic evaluation script. Each line shows the model used for the background region, if images of the backgrounds were used, the average tracking error in millimetre, its variance and its maximum, as well as the total time used for tracking all 1200 frames. Model BG image Avg. error Variance Max. Time Parzen model + BG subtraction yes 46.16 276.81 104.0 4h 31m LMM (K-means) yes 49.63 473.90 114.2 4h 34m LMM (level set segmentation) yes 42.18 157.31 93.6 4h 22m BG subtraction + LMM (K-means) yes 42.96 178.19 92.6 4h 27m BG subtraction + LMM (LS segm.) yes 41.64 153.94 83.8 4h 29m Parzen model no 451.11 24059.41 728.4 5h 12m LMM (K-means) no 52.64 588.66 162.7 9h 19m LMM (level set segmentation) no 49.94 168.61 111.2 19h 9m
Parzen model Walking
Balancing
Jogging
Localised mixture model
Fig. 4. Tracking error per frame of some tracking results of sequence S4 from the HumanEva-II dataset, automatically evaluated. Left: LMM where background subtraction information is supplemented as an additional feature channel. This plot corresponds to the fifth line in Table 1. Right: Global kernel density estimator (red) and LMM (blue). Here, we did not use the background images or any information derived from them. These plots correspond to the last (blue) and third last (red) line of Table 1.
Fig. 5. Segmentation results for frame 42 as seen from camera 3. Leftmost: Input image of frame 42 the HumanEva-II sequence S4. Left: Object model projected into the image. The different colours indicate the different model component. Right: Segmentation with level-set-based segmentation and using K-means with 3 regions. The white part is the area occluded by the tracked object, i.e. the area removed from the segmentation process. Every other colour denotes a region. Although no information from the background image was used, segmentation results still look good.
Localised Mixture Models in Region-Based Tracking
29
Fig. 6. Experiment with varying background. Upper row: Model of the tea box to be tracked, input image with initialisation in first frame, and tracking results for frame 50, 150 and 180. Lower row: Input image (frame 90), result when using LMM, result with Parzen model, and results with Gaussian noise with LMM and the Parzen model.
5
Summary
We have presented a localised mixture model that splits the region whose appearance should be estimated into distinct subregions. The appearance of the region is then modelled by a mixture of densities, each applied in its local vicinity. For the partitioning step, we tested a fast K-means clustering as well as a multi-region segmentation algorithm based on level sets. We demonstrated the relevance of such a localised mixture model by quantitative experiments in model based tracking using the HumanEva-II benchmark. Results clearly improved when using this new model. Moreover, the approach is also applicable when a static background image is missing. In such cases tracking is only successful with the localised mixture model. We believe that such localised models can also be very beneficial in other object segmentation tasks, where low-level cues are combined with a-priori information, such as semi-supervised segmentation, or combined object recognition and segmentation.
References 1. Rother, C., Kolmogorov, V., Blake, A.: “GrabCut”: interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics 23(3), 309–314 (2004) 2. Criminisi, A., Cross, G., Blake, A., Kolmogorov, V.: Bilayer segmentation of live video. In: Proc. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, pp. 53–60. IEEE Computer Society, Los Alamitos (2006) 3. Paragios, N., Deriche, R.: Geodesic active regions: A new paradigm to deal with frame partition problems in computer vision. Journal of Visual Communication and Image Representation 13(1/2), 249–268 (2002) 4. Sifakis, E., Garcia, C., Tziritas, G.: Bayesian level sets for image segmentation. Journal of Visual Communication and Image Representation 13(1/2), 44–64 (2002)
30
C. Schmaltz et al.
5. Cremers, D., Soatto, S.: Motion competition: A variational approach to piecewise parametric motion segmentation. International Journal of Computer Vision 62(3), 249–265 (2005) 6. Schmaltz, C., Rosenhahn, B., Brox, T., Weickert, J., Wietzke, L., Sommer, G.: Dealing with self-occlusion in region based motion capture by means of internal regions. In: Perales, F.J., Fisher, R.B. (eds.) AMDO 2008. LNCS, vol. 5098, pp. 102–111. Springer, Heidelberg (2008) 7. Rosenhahn, B., Brox, T., Weickert, J.: Three-dimensional shape knowledge for joint image segmentation and pose tracking. International Journal of Computer Vision 73(3), 243–262 (2007) 8. Morya, B., Ardon, R., Thiran, J.P.: Variational segmentation using fuzzy region competition and local non-parametric probability density functions. In: Proc. Eleventh International Conference on Computer Vision. IEEE Computer Society Press, Los Alamitos (2007) 9. Grimson, W., Stauffer, C., Romano, R., Lee, L.: Using adaptive tracking to classify and monitor activities in a site. In: Proc. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, pp. 22–29. IEEE Computer Society Press, Los Alamitos (1998) 10. Pless, R., Larson, J., Siebers, S., Westover, B.: Evaluation of local models of dynamic backgrounds. In: Proc. 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 73–78 (2003) 11. Sun, J., Zhang, W., Tang, X., Shum, H.Y.: Background cut. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 628–641. Springer, Heidelberg (2006) 12. Shotton, J., Johnson, M., Cipolla, R.: Semantic texton forests for image categorization and segmentation. In: Proc. Twelfth International Conference on Computer Vision. IEEE Computer Society Press, Los Alamitos (2008) 13. Dambreville, S., Sandhu, R., Yezzi, A., Tannenbaum, A.: Robust 3D pose estimation and efficient 2D region-based segmentation from a 3D shape prior. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 169– 182. Springer, Heidelberg (2008) 14. Bregler, C., Malik, J., Pullen, K.: Twist based acquisition and tracking of animal and human kinematics. International Journal of Computer Vision 56(3), 179–194 (2004) 15. Gavrila, D.M.: The visual analysis of human movement: a survey. Computer Vision and Image Understanding 73(1), 82–98 (1999) 16. Poppe, R.: Vision-based human motion analysis: An overview. Computer Vision and Image Understanding 108(1-2), 4–18 (2007) 17. Elkan, C.: Using the triangle inequality to accelerate k-Means. In: Proceedings of the Twentieth International Conference on Machine Learning, pp. 147–153. AAAI Press, Menlo Park (2003) 18. Gehler, P.: Mpikmeans (2007), http://mloss.org/software/view/48/ 19. Brox, T., Weickert, J.: Level set segmentation with multiple regions. IEEE Transactions on Image Processing 15(10), 3213–3218 (2006) 20. Sigal, L., Black, M.J.: HumanEva: Synchronized video and motion capture dataset for evaluation of articulated motion. Technical Report CS-06-08, Department of Computer Science, Brown University (September 2006)
A Closed-Form Solution for Image Sequence Segmentation with Dynamical Shape Priors Frank R. Schmidt and Daniel Cremers Computer Science Department University of Bonn, Germany
Abstract. In this paper, we address the problem of image sequence segmentation with dynamical shape priors. While existing formulations are typically based on hard decisions, we propose a formalism which allows to reconsider all segmentations of past images. Firstly, we prove that the marginalization over all (exponentially many) reinterpretations of past measurements can be carried out in closed form. Secondly, we prove that computing the optimal segmentation at time t given all images up to t and a dynamical shape prior amounts to the optimization of a convex energy and can therefore optimized globally. Experimental results confirm that for large amounts of noise, the proposed reconsideration of past measurements improves the performance of the tracking method.
1
Introduction
A classical challenge in Computer Vision is the segmentation and tracking of a deformable object. Numerous researchers have addressed this problem by introducing statistical shape priors into segmentation and tracking [1,2,3,4,5,6,7]. While in earlier approaches every image of a sequence was handled independently, Cremers [8] suggested to consider the correlations which characterize many deforming objects. The introduction of such dynamical shape priors allows to substantially improve the performance of tracking algorithms: The dynamics are learned via an auto-regressive model and segmentations of the preceding images guide the segmentation of the current image. Upon a closer look, this approach suffers from two drawbacks: – The optimization in [8] was done in a level set framework which only allows for locally optimal solutions. As a consequence, depending on the initialization the resulting solutions may be suboptimal. – At any given time the algorithm in [8] computed the currently optimal segmentation and only retained the segmentations of the two preceding frames. Past measurements were never reinterpreted in the light of new measurements. As a consequence, any incorrect decision would not be corrected at later stages of processing. While dynamical shape priors were called priors with memory in [8], what is memorized are only the decisions the algorithm took on previous frames – the measurements are instantly lost from memory, a reinterpretation is not considered in [8]. J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 31–40, 2009. c Springer-Verlag Berlin Heidelberg 2009
32
F.R. Schmidt and D. Cremers
The reinterpretation of past measurements in the light of new measurements is a difficult computational challenge due to the exponential growth of the solution space: Even if a tracking system only had k discrete states representing the system at any time t, then after T time steps, there are k T possible system configurations explaining all measurements. In this work silhouettes are represented by k continuous real-valued parameters: While determining the silhouette for time t amounts to an optimization in k , the optimizaton over all silhouettes up to time T amounts to an optimization over k·T . Recent works tried to address the above shortcomings. Papadakis and Memin suggested in [9] a control framework for segmentation which aimed at a consistent sequence segmentation by forward- and backward propagation of the current solution according to a dynamical system. Yet this approach is entirely based on level set methods and local optimization as well. Moreover, extrapolations into the past and the future rely on a sophisticated partial differential equation. In [10] the sequence segmentation was addressed in a convex framework. While this allowed to compute globally optimal solutions independent of initialization, it does not allow a reinterpretation of past measurements. Hence incorrect segmentations will negatively affect future segmentations. The contribution of this paper is it to introduce a novel framework for image sequence segmentation which overcomes both of the above drawbacks. While [8,10] compute the best segmentation given the current image and past segmentations here we propose to compute the best segmentation given the current image and all previous images. In particular we propose a statistical inference framework which gives rise to a marginalization over all possible segmentations of all previous images. The theoretical contribution of this work is therefore two-fold. Firstly, we prove that the marginalization over all segmentations of the preceding images can be solved in closed form which allows to handle the combinatorial explosion analytically. Secondly, we prove that the resulting functional is convex, such that the maximum aposteriori inference of the currently best segmentation can be solved globally. Experimental results confirm that this marginalization over preceding segmentations improves the accuracy of the tracking scheme in the presence of large amounts of noise.
Ê
2
Ê
An Implicit Dynamic Shape Model
In the following, we will briefly review the dynamical shape model introduced in [10]. It is based on the notion of a probabilistic shape u defined as a mapping u : Ω → [0, 1]
(1)
Ê
that assigns to every pixel x of the shape domain Ω ⊂ d the probability that this pixel is inside the given shape. While our algorithm will compute such a relaxed shape, for visualization of a silhouette we will simply threshold u at 1 2 . We present a general model for shapes in arbitrary dimension. However, the approach is tested for planar shapes (d = 2).
A Closed-Form Solution for Image Sequence Segmentation
33
The space of all probabilistic shapes forms a convex set, and the space spanned by a few training shapes {u1 , . . . , uN } forms a convex subset. Any shape u can be approximated by a linear combination of the first n principal components Ψi of the training set: u(x) ≈ u0 (x) +
n
αi · Ψi (x)
(2)
i=1
with an average shape u0 . Also, the set Q := {α ∈
Ên|∀x ∈ Ω : 0 ≤ u0 +
n
αi · Ψi (x) ≤ 1}
i=1
of feasible α-parameters is convex [10]. Any given sequence of shapes u1 , . . . , uN can be reduced to a sequence of low dimensional coefficient vectors α1 , . . . , αN ∈ Q ⊂ n . The evolution of these coefficient vectors can be modeled as an autoregressive system
Ê
αi =
k
Aj αi−j + ηΣ −1
(3)
j=1
Æ
Ê
of order k ∈ , where the transition matrices Aj ∈ n×n describe the linear dependency of the current observation on the previous k observations. Here ηΣ −1 denotes Gaussian noise with covariance matrix Σ −1 .
3
A Statistical Formulation of Sequence Segmentation
In the following, we will develop a statistical framework for image sequence segmentation which for any time t determines the most likely segmentation ut given all images I1:t up to time t and given the dynamical model in (3). The goal is to maximize the conditional probability P(αt |I1:t ), where αt ∈ n represents the segmentation ut := u0 + Ψ · αt . For the derivation we will make use of four concepts from probabilistic reasoning:
Ê
– Firstly, the conditional probability is defined as P(A|B) :=
P(A, B) . P(B)
(4)
– Secondly, the application of this definition leads to the Bayesian formula P(A|B) =
P(B|A) · P(A) P(B)
(5)
34
F.R. Schmidt and D. Cremers I1
I2
noise
noise
A2 α−1
α0
A1
I3
...
noise
It noise
A2 α1
A1
α2
A1
α3
...
αt
A2 Fig. 1. Model for image sequence segmentation. We assume that all information about the observed images Iτ (top row) is encoded in the segmentation variables ατ (bottom row) and that the dynamics of ατ follow the autoregressive model (3) learned beforehand. If the state space was discrete with N possible states per time instance, then one would need to consider N t different states to find the optimal segmentation of the t-th image. In Theorem 1, we provide a closed-form solution for the integration over all preceding segmentations. In Theorem 2, we prove that the final expression is convex in αt and can therefore be optimized globally.
– Thirdly, we have the concept of marginalization: P(A) = P(A|B) · P(B) dB
(6)
which represents the probability P(A) as a weighted integration of P(A|B) over all conceivable states B. In the context of time-series analysis this marginalization is often referred to as the Chapman-Kolmogorov equation [11]. In particle physics it is popular in the formalism of path integral computations. – Fourthly, besides these stochastic properties we make the assumption that for any time τ the probability for measuring image Iτ is completely characterized by its segmentation ατ as shown in Figure 1:
The segmentation ατ contains all information about the system in state τ . The rest of the state τ is independent noise. Hence, Iτ contains no further hidden information, its probability is uniquely determined by ατ .
(7)
With these four properties, we can now derive an expression for the probability P(αt |I1:t ) that we like to maximize. Using Bayes rule with all expressions in (5) conditioned on I1:t−1 , we receive P(αt |I1:t ) ∝ P(It |αt , I1:t−1 ) · P(αt |I1:t−1 )
(8)
A Closed-Form Solution for Image Sequence Segmentation
35
Due to property (7), we can drop the dependency on the previous images in the first factor. Moreover, we can expand the second factor using Bayes rule again: P(αt |I1:t ) ∝ P(It |αt ) · P(I1:t−1 |αt ) · P(αt )
(9)
Applying the Chapman-Kolmogorov equation (6) to (9), we obtain P(αt |I1:t ) ∝ P(It |αt ) P(I1:t−1 |α1:t ) · P(α|αt ) · P(αt ) dα1:t−1
(10)
P(α1:t )
This expression shows that the optimal solution for αt requires an integration over all conceivable segmentations α1:t−1 of the preceding images. To evaluate the right hand side of (10), we will model the probabilities P(It |αt ), P(I1:t−1 |α1:t ) and P(α1:t ). Assuming a spatially independent prelearned color distribution Pob of the object and Pbg of the background, we can define p(x) := − log(Pob (x)/Pbg (x)) which is negative for every pixel that is more likely to be an object pixel than a background pixel. By introducing an exponential weighting parameter γ for the color distributions, P(It |αt ) becomes
P(It |αt ) =
Pob (x)
γut (x)
Pbg (x)
γ(1−ut (x))
∝ exp
x∈Ω
γut (x) log
x∈Ω
Pob (x) Pbg (x)
n ∝ exp − γ · (αt )i · Ψi (x) · p(x) = exp (−γ at , ft ) . i=1
x∈Ω
ft,i
To compute P(I1:t−1 |α1:t ), we use the assumption (7). Besides the information encoded in α1:t , the images Iτ contain no further informations and are therefore pairwise independent: P(I1:t−1 |α1:t ) =
t−1
P(Iτ |α1:t ) =
τ =1
t−1
P(Iτ |ατ ) =
τ =1
t−1
exp (−γ aτ , fτ )
τ =1
The second equation holds again due to (7): Since the probability for Iτ is uniquely determined by ατ , the dependency on the other states can be dropped. Now, we have to address the probability P(α1:t ) which can recursively be simplified via (4): P(α1:t ) = P(αt |α1:t−1 ) · P(α1:t−1 ) = · · · =
t−1
P(ατ |α1:τ −1 )
τ =1
Using the dynamic shape prior (3), this expression becomes ⎛ 2 ⎞ t−1 k ⎠ P(α1:t ) ∝ exp ⎝− ατ − Ai ατ −i −1 τ =1
i=1
Σ
(11)
36
F.R. Schmidt and D. Cremers
To make this formula more accessible, we introduced k additional segmentation parameters α1−k , . . . , α0 . These parameters represent the segmentation of the past prior to the first observation I1 (cf. Figure 1). To simplify the notation, we will introduce α := α1−k:t−1 . These are the parameters that represent all segmentations prior to the current segmentation αt . Combining all derived probabilities, we can formulate the image segmentation as the following minimization task arg min αt
⎛
2 ⎞ k ⎜ ⎟ ατ − exp ⎝− γ · fτ , ατ − A α ⎠ dα j τ −j τ =1 τ =1 j=1 −1 t
t
(12)
Σ
Numerically computing this n · (t + k − 1)-dimensional integral of (12) leads to a combinatorial explosion. Even for a simple example of t = 25 frames, n = 5 eigenmodes and an autoregressive model size of k = 1, a 100-dimensional integral has to be computed. In [8], this computational challenge was circumvented by the crude assumption of a Dirac distribution centered at precomputed segmentation results – i.e. rather than considering all possible trajectories the algorithm only retained for each previous time the one segmentation which was then most likely. In this paper, we will compute this integral explicitely and receive a closedform expression for (12) described in Theorem 1. This closed-form formulation has the important advantage that for any given time it allows an optimal reconsideration of all conceivable previous segmentations. To simplify (12), we write the integral as exp(Q(α, αt ))dα. Note that Q is a quadratic expression that can be written as Q(α, αt ) = γ · ft , αt + αt Σ −1 + α, M α − b, α
I
(13)
III
II
with the block vector b and the block matrix M : bi = −γ · fi + 2ATt−i Σ −1 αt i≥1
i≥t−k
Mi,j = ΣATt−i Σ −1 At−j
i,j≥t−k
½
+ − 2Ai−j + i=j≥1
i≥1 k≥i−j≥1
ΣATl Σ −1 Ai−j+l
1≤l≤k 1≤i+l≤t−1 1≤i−j+l≤k
Despite their complicated nature, the three terms in (13) have the following intuitive interpretations: – I assures that the current segmentation encoded by αt optimally segments the current image. – II assures that the segmentation path (α−1 , . . . , αt ) is consistent with the learned autoregressive model encoded by (Ai , Σ −1 ).
A Closed-Form Solution for Image Sequence Segmentation
37
– III assures that the current segmentation αt also consistently segments all previous images when propagated back in time according to the dynamical model. In dynamical systems such backpropagation is modeled by the adjoints AT of the transition matrices. In the next theorem we will provide a closed form expression for (12) that is freed of any integration process and can therefore computed more efficiently. Additionally, we will come up with a convex energy functional. Therefore, to compute the global optimum of the image sequence problem is an easy task. Theorem 1. The integration over all conceivable interpretations of past measurements can be solved in the following closed form: 1 −1 P(αt |I1:t ) = exp −γ αt , ft − αt 2Σ −1 + Ms b, b + const (14) 4 Proof.
P(αt |I1:t ) ∝ =
e−γαt ,ft −αt Σ−1 −α,Ms α+b,α dα e
−αt ,ft −αt Σ −1 −α− 12 Ms−1 b
2
∝ exp −γ αt , ft − αt Σ −1 +
2 Ms
+ 14 Ms−1 b
1 −1 Ms b, b 4
2 Ms
dα
Theorem 2. The resulting energy E(αt ) = − log(P(αt |I1:t )) is convex and can therefore be minimized globally. Proof. The density function P(αt |I1:t ) is the integral of a log-concave function, i.e., their logarithm is a concave function. It was shown in [12] that integrals of log-concave functions are log-concave. Hence, E is convex. Therefore, the global optimum can be computed using, for example, a gradient descent approach.
In [10], discarding all preceding images and merely retaining the segmentations of the last frames gave rise to the simple objective function: 2
E1 (αt ) =γ · αt , ft + αt − vΣ −1
(15)
where v is the prediction obtained using the AR model (3) on the basis of the last segmentations. The proposed optimal path integration gives rise to the new objective function 2
E2 (αt ) =γ · αt , ft + αt Σ −1 −
1 −1 Ms b, b 4
(16)
In the next section, we will experimentally quantify the difference in performance brought about by the proposed marginalization over preceding segmentations.
38
F.R. Schmidt and D. Cremers
Fig. 2. Optimal Parameter Estimation. The tracking error averaged over all frames (plotted as a function of γ) shows that γ = 1 produces the best results for both methods at various noise levels (shown here are σ = 16 and σ = 256).
4
Experimental Results
In the following experiments, the goal is to track a walking person in spite of noise and missing data. To measure the tracking accuracy, we handsegmented the sequence (before adding noise) and measured the relative error with respect to this ground truth. Let T : Ω → {0, 1} be the true segmentation and S : Ω → {0, 1} be the estimated one. Then we define the scaled relative error as |S(x) − T (x)| dx := Ω . 2 · Ω T (x) dx It measures the area difference relative to twice the area of the ground truth. Thus we have =0 for a perfect segmentation and =1 for a completely wrong segmentation (of the same size). Optimal parameter estimation In order to estimate the optimal parameter γ for both approaches, we added Gaussian noise of standard deviation σ to the training images. As we can see in Figure 2, the lowest tracking error (averaged over all frames) is obtained at γ = 1 for both approaches. Therefore, we will fix γ = 1 for the test series in the next section. Robust tracking through prominent noise The proposed framework allows to track a deformable silhouette despite large amounts of noise. Figure 3 shows segmentation results obtained with the proposed method for various levels of Gaussian noise. The segmentations are quite accurate even for high levels of noise. Quantitative comparison to the method in [10] For a quantitative comparison of the proposed approach with the method of [10], we compute the average error of the learned input sequence I1:151 for different levels of Gaussian noise. Figure 4 shows two different aspects. While the method in [10] exhibits slightly lower errors for small noise levels, the proposed method shows less dependency on noise and exihibits substantially better performance at larger noise levels. While the difference in the segmentation results for low noise level are barely recognizable (middle row), for high noise level, the method in [10] clearly estimates incorrect poses (bottom row).
A Closed-Form Solution for Image Sequence Segmentation
39
Fig. 3. Close-ups of segmentation results. The proposed method gets correct segmentation results. Even at the presence of high Gaussian noise (σ ∈ {64, 512}).
Average tracking error as a function of the noise level.
Segmentation for σ = 128
method in [10]
proposed method
Segmentation for σ = 2048
method in [10]
proposed method
Fig. 4. Robustness with respect to noise. Tracking experiments demonstrate that in contrast to the approach in [10], the performance of the proposed algorithm is less sensitive to noise and outperforms the former in the regime of large noise. While for low noise, the resulting segmentations are qualitatively similar (middle row), for high noise level, the method in [10] provides an obviously wrong pose estimate (bottom row).
40
5
F.R. Schmidt and D. Cremers
Conclusion
In this paper we presented the first approach for variational object tracking with dynamical shape priors which allows to marginalize over all previous segmentations. Firstly, we proved that this marginalization over an exponentially growing space of solutions can be solved analytically. Secondly, we proved that the resulting functional is convex. As a consequence, one can efficiently compute the globally optimal segmentation at time t given all images up to time t. In experiments, we confirmed that the resulting algorithm allows to reliably track walking people despite prominent noise. In particular for very large amounts of noise, it outperforms an alternative algorithm [10] that does not include a marginalization over the preceding segmentations.
References 1. Leventon, M., Grimson, W., Faugeras, O.: Statistical shape influence in geodesic active contours. In: IEEE Int. Conf. on Computer Vision and Pattern Recognition, vol. 1, pp. 316–323 (2000) 2. Tsai, A., Yezzi, A., Wells, W., Tempany, C., Tucker, D., Fan, A., Grimson, E., Willsky, A.: Model–based curve evolution technique for image segmentation. In: Comp. Vision Patt. Recog., pp. 463–468 (2001) 3. Cremers, D., Kohlberger, T., Schn¨ orr, C.: Nonlinear shape statistics in Mumford– Shah based segmentation. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2351, pp. 93–108. Springer, Heidelberg (2002) 4. Riklin-Raviv, T., Kiryati, N., Sochen, N.: Unlevel sets: Geometry and prior-based segmentation. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3024, pp. 50–61. Springer, Heidelberg (2004) 5. Rousson, M., Paragios, N., Deriche, R.: Implicit active shape models for 3d segmentation in MRI imaging. In: Barillot, C., Haynor, D.R., Hellier, P. (eds.) MICCAI 2004. LNCS, vol. 3216, pp. 209–216. Springer, Heidelberg (2004) 6. Kohlberger, T., Cremers, D., Rousson, M., Ramaraj, R.: 4d shape priors for level set segmentation of the left myocardium in SPECT sequences. In: Larsen, R., Nielsen, M., Sporring, J. (eds.) MICCAI 2006. LNCS, vol. 4190, pp. 92–100. Springer, Heidelberg (2006) 7. Charpiat, G., Faugeras, O., Keriven, R.: Shape statistics for image segmentation with prior. In: IEEE Int. Conf. on Computer Vision and Pattern Recognition (2007) 8. Cremers, D.: Dynamical statistical shape priors for level set based tracking. IEEE PAMI 28(8), 1262–1273 (2006) 9. Papadakis, N., M´emin, E.: Variational optimal control technique for the tracking of deformable objects. In: IEEE Int. Conf. on Comp. Vis. (2007) 10. Cremers, D., Schmidt, F.R., Barthel, F.: Shape priors in variational image segmentation: Convexity, Lipschitz continuity and globally optimal solutions. In: IEEE Int. Conf. on Computer Vision and Pattern Recognition (2008) 11. Papoulis, A.: Probability, Random Variables, and Stochastic Processes. McGrawHill, New York (1984) 12. Pr´ekopa, A.: Logarithmic concave measures with application to stochastic programming. Acta Scientiarum Mathematicarum 34, 301–316 (1971)
Markerless 3D Face Tracking Christian Walder1,2 , Martin Breidt1 , Heinrich B¨ ulthoff1 , Bernhard Sch¨ olkopf1 , 1, and Crist´ obal Curio 2
1 Max Planck Institute for Biological Cybernetics, T¨ ubingen, Germany Informatics and Mathematical Modelling, Technical University of Denmark
Abstract. We present a novel algorithm for the markerless tracking of deforming surfaces such as faces. We acquire a sequence of 3D scans along with color images at 40Hz. The data is then represented by implicit surface and color functions, using a novel partition-of-unity type method of efficiently combining local regressors using nearest neighbor searches. Both these functions act on the 4D space of 3D plus time, and use temporal information to handle the noise in individual scans. After interactive registration of a template mesh to the first frame, it is then automatically deformed to track the scanned surface, using the variation of both shape and color as features in a dynamic energy minimization problem. Our prototype system yields high-quality animated 3D models in correspondence, at a rate of approximately twenty seconds per timestep. Tracking results for faces and other objects are presented.
1
Introduction
Creating animated 3D models of faces is an important and difficult task in computer graphics due to the sensitivity of the human perception of face motion. People can detect slight peculiarities present in an artificially animated face model, which makes the animator’s job rather difficult and has lead to data-driven animation techniques, which aim to capture live performance. Data-driven face animation has enjoyed increasing success in the movie industry, mainly using marker-based methods. Although steady progress has been made, there are certain limitations involved in placing physical markers on a subject’s face. Summarizing the face by a sparse set of locations loses information, and necessitates motion re-targeting to map the marker motion onto that of a model suitable for animation. Markers also occlude the face, obscuring expression wrinkles and color changes. Practically, significant time and effort is required to accurately place markers, especially with brief scans of a numerous subjects — a scenario common in the computer game industry. Tracking without markers is more difficult. To date, most attempts have made extensive use of optical flow calculations between adjacent time-steps of the sequence. Since local flow calculations are noisy and inconsistent, spatial coherency constraints must be added. Although significant progress has been made [1], the
This work was supported by Perceptual Graphics (DFG), EU-Project BACS FP6IST-027140, and the Max-Planck-Society.
J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 41–50, 2009. c Springer-Verlag Berlin Heidelberg 2009
42
C. Walder et al.
Fig. 1. Setup of the dynamic 3D scanner. Two 640 by 480 pixel photon focus MV-D752160 gray-scale cameras (red) compute depth images at 40 Hz from coded light projected by the synchronized minirot H1 projector (blue). Two strobes (far left and right) are triggered by the 656 by 490 pixel Basler A601fc color camera (green), capturing color images at a rate of 40 Hz.
sequential use of between-frame flow vectors can lead to continual accumulation of errors, which may eventually necessitate labor intensive manual corrections [2]. It is also noteworthy that facial cosmetics designed to remove skin blemishes strike directly at the key assumptions of optical flow-based methods. Non flowbased methods include [3]. There, local geometrical patches are modelled and stitched together. [4] introduced a multiresolution approach which iteratively solves between-frame correspondence problems using feature points and 3D implicit surface models. Neither of these works use color information. For face tracking purposes, there is significant redundancy between the geometry and color information. Our goal is to exploit this multitude of information sources, in order to obtain high quality tracking results in spite of possible ambiguities in any of the individual sources. In contrast to classical motion capture we aim to capture the surface densely rather than at a sparse set of locations. We present a novel surface tracking algorithm which addresses these issues. The input is an unorganized set of four-dimensional (3D plus time) surface points, with a corresponding set of surface normals and surface colors. From this we construct a 4D implicit surface model, and a regressed function which models the color at any given point in space and time. Our 4D implicit surface model is a partition of unity method like [5], but uses a local weighting scheme which is particularly easy to implement efficiently using a nearest neighbor library. By requiring only an unorganized point cloud, we are not restricted to scanners which produce a sequence of 3D frames, and can handle samples at arbitrary points in time and space as produced by a laser scanner, for example.
2
Surface Tracking
In this section we present our novel method of deforming the initial template mesh to move in correspondence with the scanned surface. The dynamic 3D scanner we use is a commercial prototype (see Figure 1) developed by ABW GmbH (http://www.abw-3d.de) and uses a modified coded light approach with phase unwrapping. A typical frame of output consists of around 40K points with texture coordinates that index into the corresponding color texture image.
Markerless 3D Face Tracking
43
Input. The data produced by our scanner consists of a sequence of 3D meshes with texture images, sampled at a constant rate. As a first step we transform each mesh into a set of points and normals, where the points are the mesh vertices and the corresponding normals are computed by a weighted average of the adjacent face normals, using the method described in [6]. Furthermore, we append to each 3D point the time at which it was sampled, yielding a 4D spatio-temporal point cloud. To simplify the subsequent notation, we also append to each 3D surface normal a fourth temporal component of value zero. To represent the color information, we assign to each surface point a 3D color vector representing the RGB color, which we obtain by projecting the mesh produced by the scanner into the texture image. Hence we summarize the data from the scanner as the set of m (point, normal, color) triplets {(xi , ni , ci )}i k, where x is our evaluation point and di = x − xi . In practice, we obtain such an ordering by way of a k nearest neighbor search using the TSTOOL software library [10]. By now letting ri ≡ di /dk and choosing wi = (1 − ri )+ , it is easy to see that the corresponding ϕi are continuous, differentiable almost everywhere, and that we only need to examine the k nearest neighbors of x in order to compute them. Note that the nearest neighbor search costs are easily amortized between the evaluation of fimp. and fcol. . Larger values of k average over more local estimates and hence lead to smoother functions — for our experiments we fixed k = 50. Note that the nearest neighbor search requires Euclidean distances in 4D, so we must decide, say, what spatial distance is equivalent to the temporal distance between frames. Too small a spatial distance will treat each frame separately, too large will smear the frames temporally. The heuristic we used was to adjust the time scale such that on average approximately half of the k nearest neighbors of each data point come from the same time (that is, the same 3D frame from the scanner) as that data point, so that the other half come from the surrounding frames. In this way we obtain functions which vary smoothly through space and time. Note that it is easy to visually verify the effect of this choice by rendering the implicit surface and color models, as demonstrated in the accompanying video. This method is particularly efficient when we optimize on a moving window as discussed in Section 2.2. In this case, reasonable assumptions imply that the implicit surface and color models enjoy setup and evaluation costs of O(q log(q)) and O(k log(q)) respectively, where q is the number of vertices in a single 3D frame.
The Stixel World - A Compact Medium Level Representation of the 3D-World Hern´ an Badino1 , Uwe Franke2 , and David Pfeiffer2 1
Goethe University Frankfurt
[email protected] 2 Daimler AG {uwe.franke,david.pfeiffer}@daimler.com
Abstract. Ambitious driver assistance for complex urban scenarios demands a complete awareness of the situation, including all moving and stationary objects that limit the free space. Recent progress in real-time dense stereo vision provides precise depth information for nearly every pixel of an image. This rises new questions: How can one efficiently analyze half a million disparity values of next generation imagers? And how can one find all relevant obstacles in this huge amount of data in real-time? In this paper we build a medium-level representation named “stixel-world”. It takes into account that the free space in front of vehicles is limited by objects with almost vertical surfaces. These surfaces are approximated by adjacent rectangular sticks of a certain width and height. The stixel-world turns out to be a compact but flexible representation of the three-dimensional traffic situation that can be used as the common basis for the scene understanding tasks of driver assistance and autonomous systems.
1
Introduction
Stereo vision will play an essential role for scene understanding in cars of the near future. Recently, the dense stereo algorithm “Semi-Global Matching” (SGM) has been proposed [1], which offers accurate object boundaries and smooth surfaces. According to the Middlebury data base, three out of the ten most powerful stereo algorithms are currently SGM variants. Due to the computational burden, in particular the required memory bandwidth, the original SGM algorithm is still too complex for a general purpose CPU. Fortunately, we were able to implement an SGM variant on an FPGA (Field Programmable Gate Array). The task at hand is to extract and track every object of interest captured within the stereo stream. The research of the last decades was focused on the detection of cars and pedestrians from mobile platforms. It is common to recognize different object classes independently. Therefore the image is evaluated repetitively. This common approach results in complex software structures, which
Hern´ an Badino is now with the Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, USA.
J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 51–60, 2009. c Springer-Verlag Berlin Heidelberg 2009
52
H. Badino, U. Franke, and D. Pfeiffer
(a) Dense disparity image (SGM result)
(b) Stixel representation
Fig. 1. (a) Dense stereo results overlaid on the image of an urban traffic situation. The colors encode the distance, red means close, green represents far. Note that SGM delivers measurements even for most pixels on the road. (b) Stixel representation for this situation. The free space (not explicitly shown) in front of the car is limited by the stixels. The colors encode the lateral distance to the expected driving corridor shown in blue.
remain incomplete in detection, since only objects of interest are observed. Aiming at a generic vision system architecture for driver assistance, we suggest the use of a medium level representation that bridges the gap between the pixel and the object level. To serve the multifaceted requirements of automotive environment perception and modeling, such a representation should be: – compact: offering a significant reduction of the data volume, – complete: information of interest is preserved, – stable: small changes of the underlying data must not cause rapid changes within the representation, – robust: outliers must have minimal or no impact on the resulting representation. We propose to represent the 3D-situation by a set of rectangular sticks named “stixels” as shown in Fig. 1(b). Each stixel is defined by its 3D position relative to the camera and stands vertically on the ground, having a certain height. Each stixel limits the free space and approximates the object boundaries. If for example, the width of the stixels is set to 5 pixels, a scene from a VGA image can be represented by 640/5=128 stixels only. Observe, that a similar stick scheme was already formulated in [2] to represent and render 3D volumetric data at high compression rates. Although our stixels are different to those presented in [2], the properties of compression, compactness and exploitation of the spatial coherence are common in both representations. The literature provides several object descriptors like particles [3], quadtrees, octtrees and quadrics [4] [5], patchlets [6] or surfels [7]. Even though these structures partly suffice our designated requirements, we refrain from their usage for our matter since they do not achieve the degree of compactness we strive for.
The Stixel World - A Compact Medium Level Representation
53
Section 2 describes the steps required to build the stixel-world from raw stereo data. Section 3 presents results and properties of the proposed representation. Future work is discussed in Section 4 and Section 5 concludes the paper.
2
Building the Stixel-World
Traffic scenes typically consist of a relatively planar free space limited by 3D obstacles that have a nearly vertical pose. Fig. 1 displays a typical disparity input and the resulting stixel-world. The different steps applied to construct this representation are depicted in Fig. 2 and Fig. 3. An occupancy grid is computed from the stereo data (see Fig. 2(a)) and used for an initial free space computation. We formulate the problem in such a way that we are able to use dynamic programming which yields a global optimum. The result of this step is shown in Fig. 2(c) and 3(a). By definition, the free space ends at the base-point of vertical obstacles. Stereo disparities vote for their membership to the vertical obstacle generating a membership cost image (Fig. 3(c)). A second dynamic programming pass optimally estimates the height of the obstacles. An appropriate formulation of this problem allows us to reuse the same dynamic programming algorithm for this task, which was applied for the free space computation. The result of the height estimation is depicted in Fig. 3(d). Finally, a robust averaging of the disparities of each stixel yields a precise model of the scene. 2.1
Dense Stereo
Most real-time stereo algorithms based on local optimization techniques deliver sparse disparity data. Hirschm¨ uller [1] proposed a dense stereo scheme named ”Semi-Global Matching” that runs within a few seconds on a PC. For road scenes, the “Gravitational Constraint” has been introduced in [8] which improves the results by taking into account that the disparities tend to increase monotonously from top to bottom. The implementation of this stereo algorithm on a FPGA allows us to run this method in real-time. Fig. 1(a) shows that SGM is able to model object boundaries precisely. In addition, the smoothness constraint used in the algorithm leads to smooth estimations in low contrast regions, exemplarily seen on the street and the untextured parts of the vehicles and buildings. 2.2
Occupancy Grid
The stereo disparities are used to build a stochastic occupancy grid. An occupancy grid is a two-dimensional array or grid which models occupancy evidence of the environment. Occupancy grids were first introduced in [9]. A review is given in [10]. Occupancy grids are computed in real-time using the method presented in [11] which allows to propagate the uncertainty of the stereo disparities onto the grid. We use a polar occupancy grid in which the image column is used to represent
54
H. Badino, U. Franke, and D. Pfeiffer
(a) Polar occ. grid.
(b) Background subtraction.
(c) Obtained free space.
Fig. 2. Occupancy grids: Fig.(a) shows the polar occupancy grid obtained from the disparity image shown in Fig. 1(a) (brightness encode the likelihood of occupancy). Fig. (b) shows the resulting image when background subtraction is applied to Fig. (a). The free space obtained from dynamic programming is shown in Fig. (c) in green, overlaid on a Cartesian representation of the occupancy grid.
the angular coordinate and the stereo disparity is used to represent the range. Figure 2(a) shows an example of a the polar occupancy grid obtained from the stereo result shown in Fig. 1(a). Only those 3D measurements lying above the road are registered as obstacles in the occupancy grid. Instead of assuming a planar road, we estimate the road pose by fitting a B-Spline surface to the 3D data as proposed in [12]. 2.3
Free Space Computation
The task in free space analysis is to find the first visible relevant obstacle in the positive direction of depth. By observing Fig. 2(a) this means that the search must start from the bottom of the image in vertical direction until an occupied cell is found. The space found in front of this cell is considered free space. Instead of using a thresholding operation for every column independently, we use dynamic programming (DP) to find the optimal path cutting the polar grid from left to right. As proposed in [11], spatial smoothness is imposed by using a cost that penalizes jumps in depth, while temporal smoothness is imposed by a cost that penalizes the deviation of the current solution from a prediction. The prediction is obtained from the segmentation result of the previous cycle. In real world scenes, an image column may contain more than one object. In the example considered here, the guardrail at the right and the building at the background in Fig. 1, both have a corresponding occupancy likelihood in the occupancy grid of Fig. 2(a). Nevertheless, per definition, the free space is given only up to the guardrail. Applying dynamic programming directly on the grid of
The Stixel World - A Compact Medium Level Representation
55
Fig. 2(a) might lead to a solution where the optimal boundary is found on the background object (i.e. the building) and not on the foreground object (i.e. the guardrail). To cope with the above problem, a background subtraction is carried out before applying DP. All occupied cells behind the first maximum which is above a given threshold are marked as free. The threshold must be selected so that it is quite larger than the occupancy grid noise expected in the grid. An example of the resulting background subtraction is shown in Fig. 2(b). The output of the DP is a set of vector coordinates (u, dˆu ), where u is a column of the image and dˆu the disparity corresponding to the distance up to which free space is available. For every pair (u, dˆu ) a corresponding triangulated pair (xu , zu ) is computed, which defines the 2D world point corresponding to (u, dˆu ). The sorted collection of points (xu , zu ) plus the origin (0, 0) form a polygon which defines the free space area from the camera point of view (see Fig. 2(c)). Fig. 3(a) shows the free space overlaid on the left image when dynamic programming is applied on Fig. 2(b). Observe that each free space point of the polygon in Fig. 3(a) indicates not only the interruption the free space but also the base-point of a potential obstacle located at that position (a similar idea was successfully applied in [13]). The next section describes how to apply a second pass of dynamic programming in order to obtain the upper boundary of the obstacle. 2.4
Height Segmentation
The height of the obstacles is obtained by finding the optimal segmentation between foreground and background disparities. This is achieved by first computing a cost image and then applying dynamic programming to find the upper boundary of the objects. Given the set of points (u, dˆu ) and their corresponding triangulated coordinate vectors (xu , zu ) obtained from the free space analysis, the task is to find the optimal row position vt where the upper boundary of the object at (xu , zu ) is located. In our approach every disparity d(u, v) (i.e. disparity on column u and row v) of the disparity image votes for its membership to the foreground object. In the simplest case a disparity votes positively for its membership as belonging to the foreground object if it does not deviate more than a maximal distance from the expected disparity of the object. The disparity votes negatively otherwise. The Boolean assignments make the threshold for the distance very sensitive: if it is too large, all disparities vote for the foreground membership, if it is too small, all points vote for the background. A better alternative is to approximate the Boolean membership in a continuous variation with an exponential function of the form Mu,v (d) = 2
1−
d−dˆu ΔDu
2
−1
(1)
where ΔDu is a computed parameter and dˆu is the disparity obtained from the free space vector (Sec. 2.3), i.e. the initially expected disparity of the
56
H. Badino, U. Franke, and D. Pfeiffer
foreground object in the column u. The variable ΔDu is derived for every column independently as b · fx ΔDu = dˆu − fd (zu + ΔZu ), where fd (z) = z
(2)
and fd (z) is the disparity corresponding to depth z. b corresponds to the baseline, fx is the focal length and ΔZu is a parameter. This strategy has the objective to define the membership as a function in meters instead of pixels to correct for perspective effects. For the results shown in this paper we use ΔZu = 2 m. Fig. 3(b) shows an example of the membership values. Our experiments show that the explicit choice of the functional is not crucial as long as it is continuous. From the membership values the cost image is computed: C(u, v) =
i=v−1 i=0
i=vf
Mu,v (d(u, i)) −
Mu,v (d(u, i))
(3)
i=v
where vf is the row position such that the triangulated 3D position of disparity dˆu on image position (u, vf ) lies on the road, i.e. is the row corresponding to the base-point of the object. Fig. 3(c) shows an exemplary cost image. For the computation of the optimal path, a graph Ghs (Vhs , Ehs ) is generated. Vhs is the set of vertices and contains one vertex for every pixel in the image. Ehs is the set of edges which connect every vertex of one column with every vertex of the following column. The cost minimized by dynamic programming is composed of a data and a smoothness term, i.e.; cu,v0 ,v1 = C(u, v0 ) + S(u, v0 , v1 )
(4)
is the cost of the edge connecting the vertices Vu,v0 and Vu+1,v1 where C(u, v) is the data term as defined in Eq. 3. S(u, v0 , v1 ) applies smoothness and penalizes jumps in the vertical direction and is defined as: |zu − zu+1 | S(u, v0 , v1 ) = Cs |v0 − v1 | · max 0, 1 − (5) NZ where Cs is the cost of a jump. The cost of a jump is proportional to the difference between the rows v0 and v1 . The last term has the effect of relaxing the smoothness constraint at depth discontinuities. The spatial smoothness cost of a jump becomes zero if the difference in depth between the columns is equal or larger than NZ . The cost reaches its maximum Cs when the free space distance between consecutive columns is 0. For our experiments we use NZ = 5 m, Cs = 8. An exemplary result of the height segmentation for the free space computed in Fig. 3(a) is shown in Fig. 3(d). 2.5
Stixel Extraction
Once the free space and the height for every column has been computed, the extraction of the stixel is straightforward. If the predefined width of the stixel
The Stixel World - A Compact Medium Level Representation
(a) Free space
(b) Membership values
(c) Membership cost image
(d) Height segmentation
57
Fig. 3. Stixels computation: Fig.(a) shows the result obtained from free space computation with dynamic programming. The assigned membership values for the height segmentation are shown in Fig. (b), while the cost image is shown in Fig. (c) (the grey values are negatively scaled). Fig. (d) shows the resulting height segmentation.
is more than one column, the heights obtained in the previous step are fused resulting in the height of the stixel. The parameters base and top point vB and vT as well as the width of the stixel span a frame where the stixel is located. Due to discretization effects of the free space computation, which are caused by the finite resolution of the occupancy grid, the free space vector is condemned to a limited accuracy in depth. Further spatial integration over disparities within this frame grant an additional gain in depth accuracy. The disparities found within the stixel area are registered in a histogram while regarding the depth uncertainty known from SGM. A parabolic fit around the maximum delivers the new depth information. This approach offers outlier rejection and noise suppression, which is illustrated by Fig. 4, where the SGM stereo data of the rear of a truck are displayed. Assuming a disparity noise of 0.2 px, a stereo baseline of 0.35 m and a focal length of 830 px, as in our experiments, the expected standard deviation for the truck at 28 meters is approx. 0.54 m. Since an average stixel covers hundreds of disparity values, the integration significantly improves the depth of the stixel. As expected, the uncertainty falls below 0.1m for each stixel.
58
H. Badino, U. Franke, and D. Pfeiffer
Fig. 4. 3D visualization of the raw stereo data showing a truck driving 28 meters ahead. Each red line represents 1 meter in depth. One can clearly observe the high scattering of the raw stereo data while the stixels remain as a compound and approximate the planar rear of the truck.
3
Experimental Results
Figure 5 displays the results of the described algorithm when applied to images taken from different road scenarios such as highways, construction sites, rural roads and urban environments. The stereo baseline is 0.35 m, the focal length 830 pixels and the images have a VGA (640 × 480 pixels) resolution. The color of the stixels encodes the lateral distance to the expected driving corridor. It’s highly visible that even filigree structures like beacons or reflector posts are being captured in their position and extension. For clarity reasons we do not explicitly show the obtained free space. The complete computation of stixels on a Intel Quad Core 3.00 GHz processor takes less than 25 milliseconds. The examples shown in this paper must be taken as representative results of the proposed approach. In fact, the method has successfully passed days of real-time testing in our demonstrator vehicle in urban, highway and rural environments.
4
Future Work
In the future we intend to apply a tracking for stixels based upon the principles of 6D-Vision [14], where 3D points are tracked over time and integrated with Kalman filters. The integration of stixels over time will lead to further improvement of the position and height. At the same time it will be possible to estimate the velocity and acceleration, which will ease subsequent object clustering steps. Almost all objects of interest within the dynamic vehicle environment touch the ground. Nevertheless, hovering or flying objects such as traffic signs, traffic lights and side mirrors (an example is given in Fig. 5(b) at the traffic sign)
The Stixel World - A Compact Medium Level Representation
(a) Highway
(b) Construction site
(c) Rural road
(d) Urban traffic
59
Fig. 5. Evaluation of stixels in different real world road scenarios showing a highway, a construction site, a rural road and an urban environment. The color encodes the lateral distance to the driving corridor. Base points (i.e. distance) and height estimates are in very good accordance to the expectation.
violate this constraint. Our efforts in the future work also includes to provide a dynamic height of the base-point.
5
Conclusion
A new primitive called stixel was proposed for modeling 3D scenes. The resulting stixel-world turns out to be a robust and very compact representation (not only) of the traffic environment, including the free space as well as static and moving objects. Stochastic occupancy grids are computed from dense stereo information. Free space is computed from a polar representation of the occupancy grid in order to obtain the base-point of the obstacles. The height of the stixels is obtained by segmenting the disparity image in foreground and background disparities applying the same dynamic programming scheme as used for the free space
60
H. Badino, U. Franke, and D. Pfeiffer
computation. Given height and base point the depth of the stixel is obtained with high accuracy. The proposed stixel scheme serves as a well formulated medium-level representation for traffic scenes. Obviously, the presented approach is also promising for other applications that obey the same assumptions of the underlying scene structure.
Acknowledgment The authors would like to thank Jan Siegemund for his contribution to the literature review and Stefan Gehrig and Andreas Wedel for fruitful discussions.
References 1. Hirschm¨ uller, H.: Accurate and efficient stereo processing by semi-global matching and mutual information. In: CVPR (2005) 2. Montani, C., Scopigno, R.: Rendering volumetric data using the sticks representation scheme. In: Workshop on Volume Visualization, San Diego, California (1990) 3. Fua, P.: Reconstructing complex surfaces from multiple stereo views. In: ICCV (June 1996) 4. Hoppe, H., DeRose, T., Duchamp, T., McDonald, J., Stuetzle, W.: Surface reconstruction from unorganized points. In: Conference on Computer Graphics and Interactive Techniques, pp. 71–78 (1992) 5. Ohtake, Y., Belyaev, A., Alexa, M., Turk, G., Seidel, H.P.: Multi-level partition of unity implicits. ACM SIGGRAPH 2003 22(3), 463–470 (2003) 6. Murray, D., Little, J.J.: Segmenting correlation stereo range images using surface elements. In: 3D Data Processing, Visualization and Transmission, September 2004, pp. 656–663 (2004) 7. Pfister, H., Zwicker, M., van Baar, J., Gross, M.: Surfels: Surface elements as rendering primitives. In: ACM SIGGRAPH (2000) 8. Gehrig, S., Franke, U.: Improving sub-pixel accuracy for long range stereo. In: VRML Workshop, ICCV (2007) 9. Elfes, A.: Sonar-based real-world mapping and navigation. Journal of Robotics and Automation 3(3), 249–265 (1987) 10. Thrun, S., Burgard, W., Fox, D.: Probabilistic Robotics. Intelligent Robotics and Autonomous Agents. The MIT Press, Cambridge (2005) 11. Badino, H., Franke, U., Mester, R.: Free space computation using stochastic occupancy grids and dynamic programming. In: Workshop on Dynamical Vision, ICCV, Rio de Janeiro, Brazil (October 2007) 12. Wedel, A., Franke, U., Badino, H., Cremers, D.: B-spline modeling of road surfaces for freespace estimation. In: Intelligent Vehicle Symposium (2008) 13. Kubota, S., Nakano, T., Okamoto, Y.: A global optimization algorithm for realtime on-board stereo obstacle detection systems. In: Intelligent Vehicle Symposium (2007) 14. Franke, U., et al.: 6D-vision: Fusion of stereo and motion for robust environment perception. In: Kropatsch, W.G., Sablatnig, R., Hanbury, A. (eds.) DAGM 2005, vol. 3663, pp. 216–223. Springer, Heidelberg (2005)
Global Localization of Vehicles Using Local Pole Patterns Claus Brenner Institute of Cartography and Geoinformatics, Leibniz Universit¨ at Hannover, Appelstraße 9a, 30167 Hannover, Germany
[email protected] Abstract. Accurate and reliable localization is an important requirement for autonomous driving. This paper investigates an asymmetric model for global mapping and localization in large outdoor scenes. In the first stage, a mobile mapping van scans the street environment in full 3D, using high accuracy and high resolution sensors. From this raw data, local descriptors are extracted in an offline process and stored in a global map. In the second stage, vehicles, equipped with simple, inaccurate sensors are assumed to be able to recover part of these descriptors which allows them to determine their global position. The focus of this paper is on the investigation of local pole patterns. A descriptor is proposed which is tolerant with regard to missing data, and performance and scalability are considered. For the experiments, a large, dense outdoor LiDAR scan with a total length of 21.7 km is used.
1
Introduction
For future driver assistance systems and autonomous driving, reliable positioning is a prerequisite. While global navigation satellite systems like GPS are in widespread use, they lack the required reliability, especially in densely builtup urban areas. Relative positioning, using video or LiDAR sensors, is an important alternative, which has been explored in many disciplines, like photogrammetry, computer vision and robotics. In order to determine one’s position uniquely, a global map is required, which is usually based on features (or landmarks), rather than on the originally captured raw data. In robotics, features like line segments or corners have been extracted from horizontal scans [1]. However, such features exhibit a low degree of information and, especially for indoor sites, are not very discriminative. Recently, there is much research in computer vision using highdimensional descriptors (such as SIFT), extracted from images, which can be used for object recognition [2] or localization [3]. Thus, it is interesting if highdimensional descriptors can be found which are strictly based on geometry and work in large outdoor environments. In this paper, a large 3D scanned outdoor scene is used from which a map of features is derived. Upright poles are extracted, which can be done quite reliably using simple geometric reasoning. The pole centers then form a 2D pattern. 2D point pattern matching is a research topic of its own, with applications for star J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 61–70, 2009. c Springer-Verlag Berlin Heidelberg 2009
62
C. Brenner
pattern and fingerprint matching. Van Wamelen et al. [4] present an expected linear time algorithm, to find a pattern in O(n(log k)3/2 ), where n is the number of points in the scene and k < n the number of points in the subset which shall be searched in the scene. Bishnu et al. [5] describe an algorithm which is O(kn4/3 log n) in the worst case, however is reported to perform much better for practical cases. It is based on indexing the distance of point pairs. This paper generalizes this approach, by encoding the local relations of two or more points.
2
Mobile Mapping Setup and Feature Extraction
We obtained a dense LiDAR scan of a part of Hannover, Germany, acquired by the Steetmapper mobile mapping system, jointly developed by 3D laser mapping Ltd., UK, and IGI mbH, Germany [6]. The scan was acquired with a configuration of four scanners. Postprocessing yields the trajectory and, through the calibrated relative orientations of the scanners and the GNSS/IMU system, a georeferenced point cloud. Absolute point accuracy varies depending on GPS outages, however a relative accuracy of a few centimeters can be expected due to the high accuracy fiber optic IMU employed. Since we want to deal with local point patterns, relative accuracy is actually much more important than absolute accuracy. Fig. 1(a) shows an overview. Note that the scanned area contains streets in densely built up regions as well as highway like roads. The total length of the scanned roads is 21.7 kilometers, captured in 48 minutes, which is an average speed of 27 km/h. During that time, 70.7 million points were captured, corresponding to an effective measurement rate of 24,500 points per second. On average, each road meter is covered by more than 3,200 points. After obtaining the point cloud, the first step is to extract features. The basic motivation of this is that many of the scanned points do not convey much information and by reducing the huge cloud to only a few important features, transmission, storage, and computation requirements are substantially reduced. Our first choice were poles, which are usually abundant in inner city scenes as well as geometrically stable over time. Pole extraction uses a simple geometric model, namely that the basic characteristic of a pole is that it is upright, there is a kernel region of radius r1 where laser scan points are required to be present, and a hollow cylinder, between r1 and r2 (r1 < r2 ) where no points are allowed to be present (Fig. 1(b)). The structure is analyzed in stacks of hollow cylinders. A pole is confirmed when a certain minimum number of stacked cylinders is found. After a pole is identified, the points in the kernel are used for a least squares estimation of the pole center. Note that this method owes its reliability to the availability of full 3D data, i.e., the processing of the 3D stack of cylinders. Using just one scan plane parallel to the ground (as is often the case in robotics) or projecting the 3D cloud down to the ground would not yield the same detection reliability. The method also extracts some tree trunks of diameter ≤ r1 , which we do not attempt to discard since they are useful for positioning purposes as well.
Global Localization of Vehicles Using Local Pole Patterns
(a)
(b)
63
(c)
Fig. 1. (a) Scan path in Hannover (color encodes height of scanned points, using temperature scale). (b) Illustration of the pole extraction algorithm, using cylindrical stacks. In this example, some parts of the pole were not verified due to missing scan points (middle), others due to an additional structure, such as a sign mounted to the pole (top). (c) Extracted poles for the intersection ‘Friederikenplatz’ in Hannover. (a) and (c) overlaid with a cadastral map (which takes not part in any computation).
For the entire 22 km scene, a total of 2,658 poles was found, which is on average one pole every 8 meters (Fig. 1(c)). In terms of data reduction, this is one extracted pole (or 2D point) per 27,000 original scan points. Although the current implementation is not optimized, processing time is uncritical and yields several poles per second on a standard PC. There are also some false detections in the scene, e.g. where a pole-like point pattern is induced by low point density or occlusion. Nevertheless, the pole patterns obtained are considered to be representative.
3
Characteristics of Local Pole Patterns
The idea for global localization of vehicles is to use the local pole pattern and match this to the global set of poles. As opposed to the general case of point pattern matching, one can assume that the scale is fixed. Also, there are additional constraints which arise from this special application, which is a combination of ‘horizon diameter’ and measurement accuracy. Since vehicles are not equipped with high accuracy positioning sensors, one can achieve a good measurement accuracy only for scenes of small extent. In order to keept the treatment general and map-centric, i.e. without necessity to rely on a specific vehicle sensor, standard parameters are used in this paper. It is assumed that the ‘horizon’ of a vehicle is a 50 meter radius disk. Within this radius, the vehicle is deemed to be able to measure poles with an accuracy of = 0.1 m (or alternatively, 0.2 m). A pole accuracy of means that during matching, a pole from the reference map and one from the vehicle’s horizon are considered to form a matching pair if they are within a distance of . Note that these assumptions do not imply that an actual vehicle must be equipped with a 360◦ , 50 m range sensor. Rather,
64
C. Brenner
it is just an assumption regarding the scene extent and accuracy which seems feasible, either by direct measurement or by merging of multiple scans along the path of the vehicle. Using the assumed parameters, one can derive the first pole pattern characteristics. Placing the center of the ‘horizon’ on every pole in turn, and counting the number of other poles within a radius of r = 50 m, one finds that the average number of neighbor poles is around 18, with a maximum of 41. The number of neighbors is not uniformly distributed, but rather, there is a peak around 17, as can be seen from the histogram in Fig. 2(a). Around 50% of all poles do have between 12 and 22 neighbors. To determine uniqueness, the following experiment was performed. Each of the poles pi , 1 ≤ i ≤ N = 2, 658 is taken as center and its local set of neighbor poles Pi = {pj |j = i, pi − pj 2 ≤ r} (with r = 50 m) is matched to every pole neighborhood Pj , j = i in the scene. Taking into account all possible rotations, it is counted how many poles match within a tolerance of 2 and the maximum over all this counts is taken, i.e. ni,max = maxj=i (matchcount(Pi , Pj )). The result is shown as a histogram in Fig. 2(b), where the area of the bubbles reflects the count and the pole neighborhoods are sorted according to the number of neighbors |Pi |. For example, the bubble at (17, 2) represents the number of poles with 17 other poles in the neighborhood, for which ni,max = 2 (in this case, the bubble area represents a count of 101, which means that 101 out of the 155 poles with 17 neighbors (peak in Fig. 2(a)) fulfill this criterion). An important observation is that for poles with up to around 20 neighbors, there is a strong peak at two, which means that in the majority of those cases, if we take the center and three more points in the neighborhood, this is a pattern which is unique in the entire scene. This property is used in the next section for the design of the local descriptor. It can also be seen that ni,max may be as large as 20, for poles with |Pi | ≥ 36. As it turns out, this is due to alleys in the scene, where trees are planted with regular spacing. Along those alleys, there is a huge number of neighbors, many of which fit to other places along the alley. Nevertheless, one can see that even in this case, ni,max is only around 50% of |Pi |.
4 4.1
A Local Pole Pattern Descriptor The Curse of Dimensionality
Similar to the case of image retrieval from large databases, we are looking for a local ‘pole descriptor’ which can be retrieved quickly from a huge scene, containing millions of poles. One of the key properties of local descriptors used in vision (such as the SIFT descriptor [2]) is that through their high dimensionality, they are quite unique, even in large databases. Similarly, in our case, the patterns of local pole neighbors are high-dimensional and unique. For example, if |Pi | = 17, we can describe the center point and its 17 neighbors in a unique way using their local (x, y) or polar coordinates, which will yield 33 = 2 · 18 − 3 parameters (in general, k points in 2D will require 2k − 3 parameters when rotation and translation are not fixed). Different from the case of the SIFT descriptor,
Global Localization of Vehicles Using Local Pole Patterns
65
Ϯϰ
ϮϮ
ϮϬ
ϭϴ
ϭϲϬ
ϭϲ
ϭϰϬ
ϭϰ
ϭϮϬ
ϭϮ
ϭϬϬ
ϭϬ
ϴϬ
ϴ
ϲϬ ϲ
ϰϬ ϰ
ϮϬ Ϯ
Ϭ Ϭ
Ϯ
ϰ
ϲ
ϴ ϭϬ ϭϮ ϭϰ ϭϲ ϭϴ ϮϬ ϮϮ Ϯϰ Ϯϲ Ϯϴ ϯϬ ϯϮ ϯϰ ϯϲ ϯϴ ϰϬ
(a)
Ϭ Ϭ
Ϯ
ϰ
ϲ
ϴ
ϭϬ
ϭϮ
ϭϰ
ϭϲ
ϭϴ
ϮϬ
ϮϮ
Ϯϰ
Ϯϲ
Ϯϴ
ϯϬ
ϯϮ
ϯϰ
ϯϲ
ϯϴ
ϰϬ
ϰϮ
(b)
Fig. 2. (a) Histogram of the number of poles in a local neighborhood (r = 50 m disk). (b) Histogram of the maximum number of poles in the neighborhood which match to another pole neighborhood in the scene. x-axis is number of neighbors |Pi |, y-axis is maximum matches ni,max , and the area of the bubbles represent the number of cases.
however, the number of dimensions would not be fixed but rather depend on the local scene content. Moreover, since one has to take into account missing or extra poles resulting from scene interpretation errors, descriptors of different dimensions would have to be compared. As is well-known, common efficient indexing methods are not successful in databases of high dimensions. For example, the popular kd-tree has a time complexity for retrieval of O(n1−1/d + l) where n is the number of elements in the database, d is the dimension, and l is the number of retrieved neighbors. This means that for high dimensions, it is as efficient as a brute force O(n) search over all elements. Therefore, one has to rely on approximations, for example searching for a near neighbor only (as done by Lowe [2]) or by using quantization (as done by Nist´er and Stew´enius [7], where quantization into cells is given by a hierarchic clustering tree). In our case, quantization is staightforward, since the parameters required to express the pole pattern are geometric in nature, with given maximum range and measurement accuracy. For example, using a quantization of 2 = 0.2 m within a total range of ±50 m would yield 500 discrete values, which requires approximately α ≈ 9 bits to encode. Thus, let us assume for the moment that the number of poles in all neighborhoods of the database (and the query) is the same, say k, in which case the dimension is d = 2k − 3, and each local neighborhood can be encoded into a single integer number using αd bits. Since each descriptor is just an integer with bounded range, one can search for an exact match instead of a nearest neighbor, using perfect hashing on a grid, with the remarkable time complexity of only O(1), however this would require a (perhaps unrealistic) O(αd · n3 ) preprocessing time [8]. Alternatively, using a search tree or sorting all database entries would still yield a time complexity of O(log n) and would require only O(n log n) preprocessing time.
66
C. Brenner
However, there are two caveats. First, since there is noise in the measured query vector, quantization may lead to an error of ±1 (if the cell size is chosen as outlined above). That is, one needs to search for the neighboring cell as well. Using a tree, this can be done in O(1), however for just one dimension, which will not help, since we ‘folded’ d dimensions into one integer. Overmars [9] proposed an √ algorithm for range search on a grid which has retrieval time O(l + logd−2 n α), where l is again the number of returned results, requiring O(n logd−1 n) storage. However, since we are not performing a general range search, but rather are interested in the two direct neighbors only (i.e., if the value in one dimension is i, we have to look at i − 1, i, i + 1), we can simply search several times, which requires 3 searches for each dimension, i.e. a total of 3d searches. This will grow by a factor of 3 for each added dimension instead of log n and allow us to use O(n) storage. Still, for large d (remember 17 neighbors yield d = 33) this is not practical. The second caveat is that we cannot assume a fixed dimension and have to allow for missing and additional poles in the query. 4.2
Design of the Local Pattern Descriptor
While we can’t defeat the curse of dimensionality, the following observation is the key to a practical solution. As we have seen in section 3 (Fig. 2(b)), the maximum overlap of a pole with any other pole, ni,max , is usually quite small. Therefore, it suffices to take a subset of k points of a local neighborhood in order to perform a query. This suggests the following approach: – Database construction. For every pole pi with neighborhood Pi , select all possible combinations of k − 1 poles from Pi (for a total of k). Compute a unique descriptor D for those points, which has a (fixed) dimension of d = 2k − 3. Store the value i under the key D in the database. – Query. For a given scene, draw a random selection of k points and retrieve the set of possible solutions. Repeat this until there is only one solution remaining (see algorithm 1). (This could also be replaced by a voting scheme.) The unique descriptor D of a point set with k ≥ 2 points is formed as follows. First, the diameter of the points is determined (the largest distance between any two points in the set), which can be done in O(k log k). The diameter yields the first value of the descriptor. Then, the x-axis is defined along the diameter and the y-axis perpendicular to it. The orientation is selected in such a way that when the remaining k − 2 points are expressed in local coordinates, the extension in +y is larger than in −y. Then, all the (xi , yi ) values are sorted in lexicographic order and are added one after the other to the descriptor, yielding a total of 1 + 2(k − 2) = 2k − 3 values. In order to prevent that a structural change of the descriptor is induced by a small error in coordinates, building the descriptor fails if during its construction any decision is closer than . Using a fixed k solves the problem that varying pole neighbor counts lead to different dimensions d. Also, selecting k as small as possible leads to a small d and thus, to a small number of queries, 3d (see next section for scalability).
Global Localization of Vehicles Using Local Pole Patterns
67
Algorithm 1. Database query. 1: 2: 3: 4: 5: 6: 7: 8: 9:
Q is a local set of poles to be searched in the database S ← {1, . . . , N } [the set of all indices in the database] Select a point q from Q, which acts as the ‘center pole’ while |S| > 1 do Randomly select k − 1 points from Q \ {q}: {q1 , q2 , . . . , qk−1 } Compute the unique descriptor, D, for the k points {q, q1 , . . . , qk−1 } Query the database for the key D, which yields a set of indices S1 S ← S ∩ S1 [narrow the set of possible solutions] return the single element in S
The random draws in the query part of the algorithm also solve the problem of erroneous extra poles in the scene. For example, considering the average number of 18 poles in a pole neighborhood, if only 50% of them (9) are captured by a vehicle, with an additional 3 captured in error, and k = 4, then the probability 9 12 of a good draw is still / = 25% and the expected number of draws 4 4 required to get at least one correct draw is 4. 4.3
Scalability
It remains to be determined how k should be selected. If it is small, this keeps the database size and the number of required queries (3d ) small, and gives better chances to pick a correct pole subset when erroneous extra poles are present. However, if it is too small, the returned set of keys S1 in algorithm 1, line 7, gets large. In fact, one would like to select k in such a way that |S1 | = O(1). Otherwise, if d is too small, |S1 | will be linear in n. For a concrete example, consider k = 2, then pairs of points are in the database. If the average number of neighbors is 18, and N = 2,658, then n = 18·2,658 = 47,844. If = 0.1 m, the error in distance (which is a difference) is 0.2 m. If the 47,844 entries are distributed uniformly in the [0, 50 m] range, 383 will be in any interval of length 0.4 m (±0.2 m). Thus, in the uniformly distributed case, we would expect that a random draw of a pair yields about 400 entries in the database and it will need several draws to reduce this to a single solution, according to algorithm 1. We will have a closer look at the case k = 3. In this case, we would expect about N ∗ 18 ∗ 17/2 = 406,674 different descriptors (indeed there are 503,024). How are those descriptors distributed? Since for k = 3 it follows d = 3, we can plot them in 3D space. From Fig. 3, one sees that the distribution is quite uniform. There is a certain point pattern evident, especially on the ground plane, which occurs with a spacing of 6 m and can indeed be traced back to a row of alley trees in the scene, planted at 6 m spacing. Fig. 3 supports the assumption that, in contrast to indoor scenes (typically occuring in robotics), the descriptors expose only little regularity.
68
C. Brenner
Fig. 3. All descriptors D of the scene, plotted in 3D space (N = 2,658 poles, k = 3). The coordinates (x, y, z) correspond to the descriptor’s (d, x1 , y1 ).
If the distribution is about uniform, the question is how large the space spanned by all possible descriptors D is? The volume of the (oddly shaped) √ descriptor space for k = 3 is (2 3/3 − 2π/9)r3 ≈ 0.46r3 . To give an estimation, it is computed how many voxels of edge length 0.4 ( = 0.1 m, cf. to the reasoning in the case k = 2 above) fit to this space, which is 891,736. Therefore, for k = 3, if the 503,024 descriptors are ‘almost’ uniformly placed in the ‘891,736 cell’ descriptor space, one can expect that a query will lead to a single result, as desired. In order to verify this, the following experiment was carried out. After filling the database, 10 queries are performed for any pole in the database, according to algorithm 1. A descriptor from the database was considered to match the query descriptor if all elements were within a distance of 2. The number of iterations (draws) required to narrow down the resulting set to a single element (while loop in line 4 of algorithm 1) was recorded into a histogram. At most 20 iterations were allowed, i.e. the histogram entries at ‘20’ mark failures. Again, the histogram entries are sorted according to the number of neighbors. Fig. 4(a) shows the case for k = 2 and = 0.1 m. It can be seen that in most of the cases, 5 iterations were required, with up to 10 or even more for poles with a large number of neighbors. For poles with only a few neighbors, there is a substantial number of failures. For = 0.2 m, the situation gets worse (Fig. 4(b)). There are more failures, and also, more iterations required in general. Of course, this is the result of a too small descriptor space. Moving on to k = 3, we see that most poles are found within 1 or 2 iterations (Fig. 4(c)). (Note that point triplets will vote for all three of their endpoints if all sides are ≤ r, for which reason often 3 solutions are returned for the first query and another iteration is required.) When trying = 0.2 m (Fig. 4(d)), there is almost no change, which means that the descriptor space is large in relation to the number of descriptors.
Global Localization of Vehicles Using Local Pole Patterns ϮϬ
ϮϬ
ϭϱ
ϭϱ
ϭϬ
ϭϬ
ϱ
ϱ
Ϭ
69
Ϭ Ϭ
ϱ
ϭϬ
ϭϱ
ϮϬ
Ϯϱ
ϯϬ
ϯϱ
ϰϬ
Ϭ
ϱ
ϭϬ
ϭϱ
(a)
ϮϬ
Ϯϱ
ϯϬ
ϯϱ
ϰϬ
Ϯϱ
ϯϬ
ϯϱ
ϰϬ
(b)
ϮϬ
ϮϬ
ϭϱ
ϭϱ
ϭϬ
ϭϬ
ϱ
ϱ
Ϭ
Ϭ Ϭ
ϱ
ϭϬ
ϭϱ
ϮϬ
(c)
Ϯϱ
ϯϬ
ϯϱ
ϰϬ
Ϭ
ϱ
ϭϬ
ϭϱ
ϮϬ
(d)
Fig. 4. Histograms of the number of draws required to retrieve a pole uniquely from the database. x-axis is the number of poles in the neighborhood, y-axis is number of draws required (with 20 being failures). The area of the bubbles represent the number of cases. All experiments for r = 50 m and (a) k = 2, = 0.1 m, (b) k = 2, = 0.2 m, (c) k = 3, = 0.1 m, (d) k = 3, = 0.2 m.
Finally, to give an estimation on the order of N for different k, we use the above reasoning (for r = 50 m, = 0.1 m, 18 poles neighborhood). For k=2, there are 18N descriptors and 50/0.4 = 125 cells, so that N = 7. Similarly, for k = 3, there are 18 ∗ 17/2 ·N descriptors and 891,736 cells, so that N = 5,828. For k = 4 and k = 5 it follows N = 1.3·107 (1010 cells) and N = 4.8·1010 (1014 cells). Note that although 1010 cells (for a a database size of thirteen million poles) sounds large, this is in the order of the main memory of a modern desktop computer.
5
Conclusions
In this paper, the use of local pole patterns for global localization was investigated. First, the characteristics of local pole patterns are determined, using a large scene captured by LiDAR and assumptions on the measurement range
70
C. Brenner
and accuracy. Second, a local descriptor is proposed which has a constant dimension and allows for an efficient retrieval. Third, the structure and size of the descriptor space, the retrieval performance and the scalability were analyzed. There are numerous enhancements possible. When constructing the database, not all descriptors should be required and especially, clusters in descriptor space can probably be removed (similar to stop lists). Also, additional features like planar patches or dihedral edges can (and should) be used. Finally, experiments with real vehicle sensors are required to verify the assumptions regarding range and accuracy, and larger scenes would be needed to verify scalability.
Acknowledgements This work has been supported by the VolkswagenStiftung, Germany.
References 1. Arras, K.O., Siegwart, R.Y.: Feature extraction and scene interpretation for mapbased navigation and map building. In: Proc. SPIE. Mobile Robots XII, vol. 3210, pp. 42–53 (1997) 2. Lowe, D.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004) 3. Fraundorfer, F., Wu, C., Frahm, J.M., Pollefeys, M.: Visual word based location recognition in 3d models using distance augmented weighting. In: Fourth International Symposium on 3D Data Processing, Visualization and Transmission (2008) 4. Wamelen, P.B.V., Li, Z., Iyengar, S.S.: A fast expected time algorithm for the 2-D point pattern matching problem. Pattern Recognition 37(8), 1699–1711 (2004) 5. Bishnu, A., Das, S., Nandy, S.C., Bhattacharya, B.B.: Simple algorithms for partial point set pattern matching under rigid motion. Pattern Recognition 39(9), 1662– 1671 (2006) 6. Kremer, J., Hunter, G.: Performance of the streetmapper mobile lidar mapping system in ‘real world’ projects. In: Photogrammetric Week, Wichmann, pp. 215– 225 (2007) 7. Nist´er, D., Stew´enius, H.: Scalable recognition with a vocabulary tree. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 2161–2168 (2006) 8. Fredman, M.L., Komlos, J., Szemeredi, E.: Storing a sparse table with O(1) worst case access time. Journal of the ACM 31(3), 538–544 (1984) 9. Overmars, M.H.: Efficient data structures for range searching on a grid. Technical Report RUU-CS-87-2, Department of Computer Science, University of Utrecht (1987)
Single-Frame 3D Human Pose Recovery from Multiple Views Michael Hofmann1 and Dariu M. Gavrila2 1
2
TNO Defence, Security and Safety, The Netherlands
[email protected] Intelligent Systems Laboratory, Faculty of Science, University of Amsterdam (NL)
[email protected] Abstract. We present a system for the estimation of unconstrained 3D human upper body pose from multi-camera single-frame views. Pose recovery starts with a shape detection stage where candidate poses are generated based on hierarchical exemplar matching in the individual camera views. The hierarchy used in this stage is created using a hybrid clustering approach in order to efficiently deal with the large number of represented poses. In the following multi-view verification stage, poses are re-projected to the other camera views and ranked according to a multi-view matching score. A subsequent gradient-based local pose optimization stage bridges the gap between the used discrete pose exemplars and the underlying continuous parameter space. We demonstrate that the proposed clustering approach greatly outperforms state-of-the-art bottom-up clustering in parameter space and present a detailed experimental evaluation of the complete system on a large data set.
1 Introduction The recovery of 3D human pose is an important problem in computer vision with many potential applications in animation, motion analysis and surveillance, and also provides view-invariant features for a subsequent activity recognition step. Despite the considerable advances that have been made over the past years (see next section), the problem of 3D human pose recovery remains essentially unsolved. This paper presents a multi-camera system for the estimation of 3D human upper body pose in single frames of cluttered scenes with non-stationary backgrounds. See Figure 1. Using input from three calibrated cameras we are able to infer the most likely poses in a multi-view approach, starting with shape detection for each camera followed by fusing information between cameras at the pose parameter level. The computational burden is shifted as much as possible to an off-line stage – as a result of a hierarchical representation and matching scheme, algorithmic complexity is sub-linear in the number of body poses considered. The proposed system also has some limitations: Like previous 3D pose recovery systems, it currently cannot handle a sizable amount of external occlusion. It furthermore assumes the existence of a 3D human model that roughly fits the person in the scene. J. Denzler, G. Notni, and H. S¨uße (Eds.): DAGM 2009, LNCS 5748, pp. 71–80, 2009. c Springer-Verlag Berlin Heidelberg 2009
72
M. Hofmann and D.M. Gavrila
Fig. 1. System overview. For details, please refer to the text, Section 3.1.
2 Previous Work As one of the most active fields in computer vision, there is meanwhile extensive literature on 3D human pose estimation. Due to space limitations we have to make a selection of what we consider is most relevant. In particular work that deals with 3D model-based tracking, as opposed to pose initialization, falls outside the scope of this paper; see recent surveys [1,2] for an overview on the topic. Work regarding 3D pose initialization can be distinguished by the number of cameras used. Multi-camera systems were so far applied in controlled indoor environments. The near-perfect foreground segmentation resulting from the “blue-screen” type background, together with the many cameras used (> 5), allows to recover pose by Shapefrom-Silhouette techniques [3,4,5]. Single camera approaches for 3D pose initialization can be sub-divided into generative and learning-based techniques. Learning-based approaches [6,7,8,9] are fast and conceptually appealing, but questions still remain regarding their scalability to arbitrary poses, given the ill conditioning and high dimensionality of the problem (most experimental results involve restricted movements, e.g. walking). On the other hand, pose initialization using 3D generative models [10,11] involves finding the best match between model projections and image, and retrieving the associated 3D pose. Pose initialization using 2D generative models [12,13] involves a 2D pose recovery step followed by a 3D inference step with respect to the joint locations. In order to reduce the combinatorial complexity, previous generative approaches apply part-based decomposition techniques [14]. This typically involves searching first for the torso, then arms and legs [12,15,13]. This decomposition approach is error prone in the sense that estimation mistakes made early on based on partial model knowledge cannot be corrected at a later stage. In this paper we demonstrate the feasibility of a hierarchical exemplar-based approach to single-frame 3D human pose recovery in an unconstrained setting (i.e. not restricted to specific motions, such as walking). Unlike [16], we do not cluster our exemplars directly in parameter space but use a shape similarity measure for both clustering and matching. Because bottom-up clustering does not scale with the number of poses represented in our system, we propose a hybrid approach that judiciously combines bottom-up and top-down clustering. We add a gradient-based local pose optimization
Single-Frame 3D Human Pose Recovery from Multiple Views
73
step to our framework in order to overcome the limitations of having generated the candidate poses from a discrete set. An experimental performance evaluation is presented on a large number of frames. Overall, we demonstrate that the daunting combinatorics of matching whole upper-body exemplars can be overcome effectively by hierarchical representations, pruning strategies and use of clustering techniques. While in this paper we focus on single-frame pose recovery in detail, its integration with tracking and appearance model adaptation is discussed in [17].
3 Single-Frame 3D Pose Estimation 3.1 Overview Figure 1 presents an overview of the proposed system. Pre-processing determines a region of interest based on foreground segmentation (Section 3.3). Pose hypotheses are generated based on hierarchical shape matching of exemplars in the individual camera views (Section 3.4) and then verified by reprojecting the shape model into all camera views (Section 3.5). This is implemented in two stages for efficiency reasons: In the 2D-based verification stage, the reprojection is done by mapping the discrete exemplars to the other camera views, while in the subsequent 3D-based verification stage, the poses are rendered on-line and therefore modeled with higher precision. As a last step, a gradient-based local pose optimization is applied to part of the pose hypotheses (Section 3.6). The final output is a list of pose hypotheses for each single frame, ranked according to their multi-view likelihood. 3.2 Shape Model Our 3D upper body model uses superquadrics as body part primitives, yielding a good trade-off between desired accuracy and model complexity [18]. Joint articulation is represented using homogeneous coordinate transformations x = Hx, H = (R(φ, θ, ψ), T ), where R is a 3 × 3 rotation matrix determined by the Euler angles φ, θ, ψ, and T a constant 3 × 1 translation vector. We represent a pose as an 13-dimensional vector π = (πtorso (φ, θ, ψ), πhead (φ, ψ), πl.shoulder (φ, θ, ψ), πl.elbow (θ), πr.sh. (φ, θ, ψ), πr.elb. (θ)) (1) 3.3 Pre-processing The aim of pre-processing is to obtain a rough region of interest, both in terms of each individual 2D camera view and in terms of the 3D space. For this, we apply background subtraction [19] to each camera view and fuse the computed masks by means of volume carving [20]. In the considered environment with dynamic background and a limited number of cameras (3) we do not expect to obtain well segmented human silhouettes in a quality suitable for solving pose recovery by SfS techniques [3,4,5]. However, approximate 3D positions of people in the scene can be estimated by extracting voxel blobs of a minimum size; this also yields information about the image scales to be used in the forthcoming single-view detection step (Section 3.4). Edge segmentation in the foreground regions then provides the features being used in the subsequent steps.
74
M. Hofmann and D.M. Gavrila
3.4 Single-View Shape Detection Shape hierarchy construction. We follow an exemplar-based approach to 3D pose recovery, matching a scene image with a pre-generated silhouette library with known 3D articulation. To obtain the silhouette library, we first define the set of upper body poses by specifying lower and upper bounds for each joint angle and discretizing each angle into a number of states with an average delta of about 22◦ . The Cartesian product contains anatomically impossible poses; these are filtered out by collision detection on the model primitives and through rule-based heuristics, more specifically a set of linear inequalities on the four arm angles. The remaining set P of about 15 × 106 “allowable” poses serves as input for the silhouette library, for which the exemplars are rendered using the 3D shape model (Section 3.2) assuming orthographic projection and, following [21,22,16], organized in a (4-level) template tree hierarchy, see Figure 2. We use a shape similarity measure (see sub-section below) for clustering as well as for matching, as opposed to clustering directly in angle space [16]. This has the advantage that similar projections (e.g. front/back views) can be compactly grouped together even if they are distant in angle space. However, bottom-up clustering does not scale with the number of allowable poses used here: On-line evaluation of our similarity measure would be prohibitively expensive; furthermore, computing the full dissimilarity matrix (approx. 2.3 × 1014 entries) off-line is not possible either due to memory constraints. We therefore propose a hybrid clustering approach; see Figure 2 for an illustration of this process. We first set the exemplars of our third tree level by discretizing the allowable poses more coarsely, such that bottom-up clustering similar to [21] for creating the second and first hierarchy level is still feasible. Then, we compute a mapping for each pose π ∈ P to the 3rd level exemplar with the best shape similarity. Each 3rd-level exemplar will thus be associated with a subset P3i of P, where i is the exemplar index and such that i P3i ≡ P. The 4th level is then created by clustering the elements of each assigned subset P3i and selecting prototypes in a number proportional to the number of elements in the subset. Each pose in P is thus mapped to a 4th-level exemplar, i.e. each 4th-level exemplar is associated with a subset P4i of P such that i P ≡ P. The need for a 4-th tree level for an increase in matching accuracy was ini 4 dicated by preliminary experiments. In the hierarchy used in our experiments we have approximately 200, 2,000, 20,000 and 150,000 exemplars at the respective levels. Hierarchical shape matching. On-line matching is implemented by a hierarchy traversal for each camera; search is discontinued below nodes where the match is below a certain (level-specific) threshold. Instead of using silhouette exemplars of different scales, we rescale the scene image using information from the preprocessing step (Section 3.3). After matching, the exemplars s ∈ S that pass the leaf-level threshold are ranked according to a single-view likelihood p(Oc |s) ∝ p(Dc (s, ec ))
(2)
where Oc is the observation for camera c and Dc (s, e) the undirected Chamfer distance [23] between the exemplar s and the scene edge image ec of camera c. We select the Kc best ranked matches for view c (Kc = 150 in our experiments, for all c) and expand the previously grouped poses from each silhouette exemplar as input for the next step. (On average, about 15,800 poses are expanded per camera in our experiments.)
75
Single-Frame 3D Human Pose Recovery from Multiple Views
Fig. 2. Schematized structure of the 4-level shape exemplar hierarchy (Section 3.4)
Fig. 3. Correction angle ϕ when transferring poses from orthographic to perspective projection (Section 3.5)
3.5 Multi-view Pose Verification Given a set of expanded poses from the single-view shape detection step (Section 3.4), we verify all poses by reprojecting them into the other cameras and computing a multiview likelihood. For efficiency reasons, this is implemented in a two-step approach. In a first step (“2D-based pose verification”), we map a pose extracted from one camera to the corresponding exemplars of the other cameras and match these exemplars onto their respective images. Due to the used orthographic projection, the mapping from a pose as observed in camera ci to the corresponding pose in camera cj is done by modifying the torso rotation angle ψtorso relative to the projected angle between cameras ci and cj on the ground plane. To account for the error made by the orthographic projection assumption, we add a correction angle ϕ as illustrated in Figure 3. The mapping from a (re-discretized) pose to an exemplar is then easily retrieved from a look-up table. The corresponding multi-view likelihood given a pose π is modeled as p(O|π) ∝ p( Dc (sc , ec )) (3) c∈C
where O is the set of observations over all cameras C, sc the exemplar corresponding to the pose π, and ec the scene edge image of camera c. For each pose, we also need to obtain a 3D position in the world coordinate system from the 2D location of the match on the image plane. We therefore backproject this location at various depths corresponding to the epipolar line in the other cameras in regions with foreground support and match the corresponding exemplars at these locations. For each pose π, the 2D location with the highest likelihood per camera is kept; triangulation then yields a 3D position x in the world coordinate system, with inconsistent triangulations being discarded. We obtain a ranked list of candidate 3D poses {π, x} of which the best L (L = 2000 in our experiments) are evaluated further.
76
M. Hofmann and D.M. Gavrila
In the second step (“3D-based pose verification”), the candidate 3D poses are rendered on-line, assuming perspective projection, and ranked according to a respective multi-view likelihood p(O|π, x) ∝ p( Dc (rc , ec )) (4) c∈C
where rc is the image of the shape model silhouette in camera c. This is a very costly step in the evaluation cascade due to the rendering across multiple camera views, but provides the most accurate likelihood evaluation because poses are not approximated by a subset of shapes anymore, and due to the assumption of perspective projection. As a result, we obtain a ranked list of pose hypotheses, of which the best M (M = 30 in our experiments) enter the following processing step and the others remain unchanged. 3.6 Gradient-Based Local Pose Optimization So far we have evaluated likelihoods given only poses π from a discrete set of poses P (Section 3.4). We can overcome this limitation by assuming that the likelihood described in Equation 4 is a locally smooth function on a neighborhood of π and x in state space and performing a local optimization of the parameters of each pose using the gradient ∇p(O|π, x). For a reasonable trade-off between optimization performance and evaluation efficiency, we decompose the parameter space during this step and optimize first over the world coordinate position x, followed by optimizations over πtorso , πhead , πl.shoulder , πr.shoulder , πl.elbow and πr.elbow respectively, evaluating the gradient once for each sub-step and moving in its direction until the likelihood value reaches a local maximum. Because the objective function used relies on rendering and therefore produces output on a fixed pixel grid, the gradients are approximated by suitable central differences, e.g. (p(O|π + 12 ) − p(O|π − 12 ))/, with chosen according to the input image resolution.
4 Experiments Our experimental data consists of 2038 frames from recordings of three overlapping and synchronized color CCD cameras looking over a train station platform with various actors performing unscripted movements, such as walking, gesticulation and waving. The same generic shape model (Section 3.2) is used for all actors in the scene. We model the likelihood distributions (Equations 2, 3, 4) as exponential distributions, computed using maximum likelihood. Cameras were calibrated [24]; this enabled the recovery of the ground plane. Ground truth poses were manually labeled for all frames of the data set; we estimate its accuracy to be within 3cm, considering the quality of calibration and labeling. We define the average pose error between two poses as dx (π 1 , π2 ) =
1 de (v1i , v2i ) |B|
(5)
i∈B
where B is a set of locations on the human upper body, |B| the number of locations, v i is the 3D position of the respective location in a fixed Euclidean coordinate system,
Single-Frame 3D Human Pose Recovery from Multiple Views cumulative number of correct hypotheses wrt. nr of selected shape exemplars 800 proposed hierarchy angle−clustered hierarchy 700
77
ratio of cumulative number of correct hypotheses (proposed hier./angle−clustered hier.) 12
ratio of correct hypotheses
number of correct hypotheses
11 600 500 400 300
10
9
8
200 7 100 0
unnormalized normalized by number of extracted poses 0
50 100 nr of selected shape exemplars
(a)
150
6
0
50 100 nr of selected shape exemplars
150
(b)
Fig. 4. (a) Cumulative number of correct pose hypotheses wrt. the number of selected shape exemplars (avg. over all frames and cameras). (b) Ratio of the number of correct pose hypotheses between both hierarchies (avg. over all frames and cameras).
and de (.) is the Euclidean distance. For the set of locations, we choose torso and head center as well as shoulder, elbow and wrist joint location for each arm. We regard a pose hypothesis as “correct”, if the average pose error to the ground truth is less than 10cm. We first compare our hybrid hierarchy clustering approach as described in Section 3.4 with a state-of-the-art clustering approach proposed in [16] in the context of hand tracking. There, clustering is performed directly in parameter space using a hierarchical k-means algorithm. We constructed an equivalent alternative hierarchy (“angleclustered hierarchy”) with the same number of exemplars on each level.To ensure a fair comparison, we evaluate the single-view shape detection step (Section 3.4) with the same tree-level specific thresholds for both shape hierarchies. Figure 4(a) shows the cumulative number of correct pose hypotheses in relation to the number of selected shape exemplars after single-view shape detection.Using our proposed hierarchy, we obtain about one order of magnitude more correct poses compared to the hierarchy clustered in parameter space. Figure 4(b) shows that the ratio of the number of correct poses between both hierarchies saturates at about 12. We additionally plot the ratio of the number of correct solutions, normalized by the number of extracted poses to take out the influence of a variable number of shape exemplars matched. Still, the proposed hierarchy generates about 9.5 times more correct hypotheses; we therefore continue all following experiments using this hierarchy. The considerably worse performance of the angle-clustered hierarchy is explained by the fact that equal distance in angle space does not imply equal shape (dis)similarity. In particular, the represented joint angles are part of an articulated model – for example, small changes of the torso rotation angle ψtorso will have a large effect on the projected silhouette if one or both arms are extended. Figure 5 shows a few example frames from our data set, together with recovered poses after executing the steps described in Sections 3.4 to 3.6. A quantitative analysis over all images in our data set (Figure 6(a)) shows the successive benefit of repeated likelihood evaluations and pruning of hypotheses in our cascaded framework: The average pose error of the best solution (among the K top-ranked) decreases after each
78
M. Hofmann and D.M. Gavrila
Fig. 5. Example result images (all three camera views shown each). Top row: Top-ranked pose hypothesis. Bottom row: Best pose hypothesis out of 20 best-ranked. Average pose error of best solution up to a rank K
Pose error of best solution among K best−ranked
25
60 K=100 K=10 K=1
50
20 avg. pose error [cm]
avg. pose error of best solution [cm]
after 2D−based verification after 3D−based verification after 3D−based local optimization
15
40
30
20
10 10
5
0
5
10
15
20
25 rank
(a)
30
35
40
45
0
20
40
60
80 frame nr
100
120
140
(b)
Fig. 6. (a) Average pose error of the best solution among the K best-ranked (K on x-axis), average over all frames of the data set. (b) Average pose error of the best solution among the K best-ranked (K ∈ {1, 10, 100}) for 150 frames of a sequence.
verification/optimization step. To obtain an average error of 10cm we need to disambiguate among the best 20 ranked pose hypotheses on average, while for 50 hypotheses, the average error decreases to approximately 8cm. Figure 6(b) provides a closer look at the average pose error for each frame in a sequence of 150 images. Between frames 10 and 80 the top-ranked pose hypothesis gives an acceptable estimate of the actual (ground truth) pose; considering more pose hypotheses for disambiguation can provide yet better accuracy. However, the spikes between frames 1-10 and 80-125 also show that our purely shape-based single-frame pose recovery does not succeed in all cases – our system still has some difficulties with more “ambiguous” poses, e.g. with hands close to the torso (see e.g. Figure 5, 2nd column), or when the silhouette does not convey sufficient information about front/back orientation of the person (see e.g. Figure 5, 3rd column). Many of these cases can be disambiguated by incorporating additional knowledge such as temporal information or enriching the likelihood function by learning an appearance model in addition to shape. Both approaches lead toward tracking and are thus out of scope for this paper, but are discussed e.g. in [17]. Figure 7 shows a plot of the average pose error before and after local pose optimization (Section 3.6), evaluated on 10 images from our data set. 1280 test input poses have been created by random perturbations π GT + N (0, Σ) of the ground truth pose πGT , with varying covariances Σ. Being a local optimization step, we expect the convergence
Single-Frame 3D Human Pose Recovery from Multiple Views
79
Effect of gradient−based local pose optimization 20
avg. pose error [cm] after optimization
18 16 14 12 10 8 6 4 2 0
0
5 10 15 avg. pose error [cm] before optimization
20
Fig. 7. Left: Plot of the average pose error in cm (Equation 5) before and after gradient-based local pose optimization (Section 3.6). Right: Example of local pose optimization, before (top row, avg. error 8.7cm) and after (bottom row, avg. error 5.9cm).
area to be close to the true solution; indeed, we can see that it is quite effective up to an input pose error of about 10cm. In addition to improving our overall experimental results (see Figure 6(a)), we expect that this transitioning from a discrete to a continuous pose space can also prove useful when evaluating motion likelihoods between poses in a temporal context that have been trained on real, i.e. undiscretized movement data. Our current system requires about 45-60s per frame (image triplet) to recover the list of pose hypotheses, running with unoptimized C++ code on a 2.6 GHz Intel PC. Currently the steps involving on-line rendering (Sections 3.5 and 3.6) and, to a lesser degree, single-view shape detection (Section 3.4) are our performance bottleneck. These components can be easily parallelized, allowing a near-linear reduction of processing speed with available processing cores.
5 Conclusion and Further Work We proposed a system for 3D human upper body pose estimation from multiple cameras. The system combines single-view hierarchical shape detection with a cascaded multi-view verification stage and gradient-based local pose optimization. The exemplar hierarchy is created using a novel hybrid clustering approach based on shape similarity and we demonstrated that it significantly outperforms a parameter-space clustered hierarchy in pose retrieval experiments. Future work involves extension to whole-body pose recovery, which would be rather memory intensive if implemented directly. A more suitable solution, better able to deal with partial occlusion, is to recover upper and lower body pose separately and integrate results. Another area of future work involves extending the estimation to the shape model in addition to the pose.
80
M. Hofmann and D.M. Gavrila
References 1. Forsyth, D., et al.: Computational studies of human motion. Found. Trends. Comput. Graph. Vis. 1(2-3), 77–254 (2005) 2. Moeslund, T.B., et al.: A survey of advances in vision-based human motion capture and analysis. CVIU 103(2-3), 90–126 (2006) 3. Cheung, K.M.G., et al.: Shape-from-silhouette across time - parts I and II. IJCV 62 and 63(3), 221–247 and 225–245 (2005) 4. Mikic, I., et al.: Human body model acquisition and tracking using voxel data. IJCV 53(3), 199–223 (2003) 5. Starck, J., Hilton, A.: Model-based multiple view reconstruction of people. In: ICCV, pp. 915–922 (2003) 6. Agarwal, A., Triggs, B.: Recovering 3D human pose from monoc. images. TPAMI 28(1), 44–58 (2006) 7. Bissacco, A., et al.: Fast human pose estimation using appearance and motion via multidimensional boosting regression. In: CVPR (2007) 8. Kanaujia, A., et al.: Semi-supervised hierarchical models for 3d human pose reconstruction. In: CVPR (2007) 9. Shakhnarovich, G., et al.: Fast pose estimation with parameter-sensitive hashing. In: ICCV, pp. 750–757 (2003) 10. Kohli, P., et al.: Simultaneous segmentation and pose estimation of humans using dynamic graph cuts. IJCV 79, 285–298 (2008) 11. Lee, M.W., Cohen, I.: A model-based approach for estimating human 3D poses in static images. TPAMI 28(6), 905–916 (2006) 12. Mori, G., Malik, J.: Recovering 3D human body configurations using shape contexts. TPAMI 28(7), 1052–1062 (2006) 13. Ramanan, D., et al.: Tracking people by learning their appearance. TPAMI 29(1), 65–81 (2007) 14. Sigal, L., et al.: Tracking loose-limbed people. In: CVPR (2004) 15. Navaratnam, R., et al.: Hierarchical part-based human body pose estimation. In: BMVC (2005) 16. Stenger, B., et al.: Model-based hand tracking using a hierarchical Bayesian filter. TPAMI 28(9), 1372–1384 (2006) 17. Hofmann, M., Gavrila, D.M.: Multi-view 3d human pose estimation combining single-frame recovery, temporal integration and model adaptation. In: CVPR (2009) 18. Gavrila, D.M., Davis, L.: 3-D model-based tracking of humans in action: a multi-view approach. In: CVPR (1996) 19. Zivkovic, Z.: Improved adaptive Gaussian mixture model for background subtraction. In: ICPR (2), pp. 28–31 (2004) 20. Laurentini, A.: The visual hull concept for silhouette-based image understanding. TPAMI 16(2), 150–162 (1994) 21. Gavrila, D.M., Philomin, V.: Real-time object detection for “smart” vehicles. In: ICCV, pp. 87–93 (1999) 22. Rogez, G., et al.: Randomized trees for human pose detection. In: CVPR (2008) 23. Athitsos, V., Sclaroff, S.: Estimating 3D hand pose from a cluttered image. In: CVPR, pp. II.432–II.439 (2003) 24. Bouguet, J.Y.: Camera calib. toolbox for matlab (2003)
Dense Stereo-Based ROI Generation for Pedestrian Detection C.G. Keller1 , D.F. Llorca2 , and D.M. Gavrila3,4 1
Image & Pattern Analysis Group, Department of Math. and Computer Science, Univ. of Heidelberg, Germany 2 Department of Electronics. Univ. of Alcal´ a. Alcal´ a de Henares (Madrid), Spain 3 Environment Perception, Group Research, Daimler AG, Ulm, Germany 4 Intelligent Systems Lab, Fac. of Science, Univ. of Amsterdam, The Netherlands {uni-heidelberg.keller,dariu.gavrila}@daimler.com,
[email protected] Abstract. This paper investigates the benefit of dense stereo for the ROI generation stage of a pedestrian detection system. Dense disparity maps allow an accurate estimation of the camera height, pitch angle and vertical road profile, which in turn enables a more precise specification of the areas on the ground where pedestrians are to be expected. An experimental comparison between sparse and dense stereo approaches is carried out on image data captured in complex urban environments (i.e. undulating roads, speed bumps). The ROI generation stage, based on dense stereo and specific camera and road parameter estimation, results in a detection performance improvement of factor five over the stateof-the-art based on ROI generation by sparse stereo. Interestingly, the added processing cost of computing dense disparity maps is at least partially amortized by the fewer ROIs that need to be processed at the system level.
1
Introduction
Vision-based pedestrian detection is a key problem in the domain of intelligent vehicles (IV). Large variations in human pose and clothing, as well as varying backgrounds and environmental conditions make this problem particularly challenging. The first stage in most systems consists of identifying generic obstacles as regions of interest (ROIs) using a computationally efficient method. Subsequently, a more expensive pattern classification step is applied. Previous IV applications have typically used sparse, feature-based stereo approaches (e.g. [1,9]) because of lower processing cost. However, with recent hardware advances, real-time dense stereo has become feasible [12] (here we use a hardware implementation of the semi-global matching (SGM) algorithm [7]). Both sparse and dense stereo approaches haved proved suitable to dynamically estimate camera height and pitch angle, in order to deal with road imperfections, speed bumps, car accelerations, etc. Dense stereo, furthermore, holds the potential to also reliably estimate the vertical road profile (which feature-based stereo, due to its sparseness does not). The more accurate estimation of ground location of pedestrians can be expected to improve system performance, especially when considering undulating, hilly roads. The aim of this paper thus is to investigate the advantages of dense vs. sparse disparity maps when detecting generic obstacles in the early stage of a pedestrian J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 81–90, 2009. c Springer-Verlag Berlin Heidelberg 2009
82
C.G. Keller, D.F. Llorca, and D.M. Gavrila
Fig. 1. Overview of the dense stereo-based ROI generation system comprising dense stereo computation, pitch estimation, corridor computation, B-Spline road profile modeling and multiplexed depth maps scanning with windows related to minimum and maximum extents of pedestrians
detection system [9]. We are interested in both the ROC performance (trade-off correct vs. false detections) and in the processing cost.
2
Related Work
Many interesting approaches for pedestrian detection have been proposed. See [4] for a recent survey and a novel publicly available benchmark set. Most work has proceeded with a learning-based approach by-passing a pose recovery step and describing human appearance directly in terms of low-level features from a region of interest (ROI). In this paper, we concentrate on the stereo-based ROI generation stage. The simplest technique to obtain object location hypotheses is the sliding window technique, where detector windows at various scales and locations are shifted over the image. This approach in combination with powerful classifiers (e.g. [3,13,16]) is currently computationally too expensive for real-time applications. Significant speed-ups can be obtained by including application-specific constraints such as flat-world assumption, ground-plane based objects and common geometry of pedestrians, e.g. object height or aspect ratio [9,17]. Besides monocular techniques (e.g. [5]), which are out of scope in this work, stereo vision is an effective approach for obtaining ROIs. In [20] a foreground region is obtained by clustering in the disparity space. In [2,10] ROIs are selected considering the x- and y-projections of the disparity space following the v-disparity representation [11]. In [1] object hypotheses are obtained by using a subtractive clustering in the 3D space in world coordinates. Either monocular or stereo, most approaches are carried out under the assumption of a planar road and no camera height and camera pitch angle variations. In recent literature on intelligent vehicles many interesting approaches have been proposed to perform road modeling and to estimate camera pitch angle and camera height. Linear fitting in the v-disparity [14], in world coordinates [6] and in the so-called virtual-disparity image [18] has been proposed to estimate the camera pitch angle and the camera height. In [11] the road surface is modeled by the fitting of the envelope of piecewise linear functions in the v-disparity space. Other approaches are performed by fitting of a quadratic polynomial [15] or a clothoid function [14] in the v-disparity space as well. Building upon this work, we propose the use of dense stereo vision for ROI generation in the context of pedestrian detection. Dense disparity maps are provided in real-time [7]. Firstly, camera pitch angle is estimated by determining the slope with highest probability in the v-disparity map, for a reduced distance
Dense Stereo-Based ROI Generation for Pedestrian Detection
83
range. Secondly, a corridor of a predefined width is computed using the vehicle velocity and the yaw rate. Only points that belong to that corridor will be used for subsequent road surface modeling. Then, the ground surface is represented as a parametric B-Spline surface and tracked by using a Kalman filter [19]. Reliability on the road profile estimation is an important issue which has to be considered for real implementations. ROIs are finally obtained by analyzing the multiplexed depth maps as in [9] (see Figure 1).
3 3.1
Dense Stereo-Based ROI Generation Modeling of Non-planar Road Surface
Feature-based stereo vision systems typically provide depth measurements at points with sufficient image structure, whereas dense stereo algorithms estimate disparities at all pixels, including untextured regions, by interpolation. Before computing the road profile, the camera pitch angle is estimated by using the v-disparity space. We assume that the camera is installed such that the roll angle is insignificant. Then, the disparity of a planar road surface (this assumption can be accepted in the vehicle vicinity) can be calculated by: d(v) = a · v + b
(1)
where v is the image row and a, b are the slope and the offset which depend on camera height and tilt angle respectively. Both parameters can be estimating using a robust estimator. However, if we assume a fixed camera height we can compute a slopes histogram and determine the slope with the highest probability, obtaining a first estimation of the camera pitch angle. In order to put only good candidates into the histogram, a disparity range is calculated for each image row, depending on the tolerance of the camera height and tilt angle. The next step consists in computing a corridor of a pre-defined width using the vehicle velocity, the yaw rate, the camera height and the camera tilt angle. If the vehicle is stopped, a fixed corridor is used. In this way, a considerable amount of object points are not taken into account when modeling the road surface. This is particularly important when the vehicle is taking a curve, since most of the points in front of the vehicle correspond to object points. The road profile is represented as a parametric B-Spline surface as in [19]. B-Splines are a basis for the vector space of piecewise polynomials with degree d. The basis-functions are defined on a knot vector c using equidistant knots within the observed distance interval. A simple B-Spline least square fit tries to approximate the 3D measurements optimally. However, a more robust estimation over time is achieved by integrating the B-Spline parameter vector c, the camera
Fig. 2. Road surface modeling. Distances grid and their corresponding height values along with camera height and tilt angle.
84
C.G. Keller, D.F. Llorca, and D.M. Gavrila
Fig. 3. Wrong road profile estimation when a vertical object appears in the corridor for a consecutive number of frames. The cumulative variance for the bin in which the vertical object is located increases and the object points are eventually passed to the Kalman filter.
pitch angle α and the camera height H into a Kalman filter. Finally, the filter state vector is converted into a grid of distances and their corresponding road height values as depicted in Figure 2. The number of bins of the grid will be as accurate as the B-Spline sampling. 3.2
Outlier Removal
In general, the method of [19] works well if the measurements provided to the Kalman filter correspond to actual road points. The computation of the corridor removes a considerable amount of object points. However, there are a few cases in which the B-Spline road modeling still leads to bad results. These cases are mainly caused by vertical objects (cars, motorbikes, pedestrians, cyclists, etc.) in the vicinity of the vehicle. Reflections in the windshield can cause additional correlation errors in the stereo image. If we include these points, the B-spline fitting achieves a solution which climbs or wraps over the vertical objects. In order to avoid this problem, the variance of the road profile for each bin σi2 is computed. Thus, if the measurements for a specific bin are out of the bounds defined by the predicted height and the cumulative variance, they are not added to the filter. Although this alternative can deal with spurious errors, if the situation remains for a consecutive number of iterations (e.g., when there is a vehicle stopped in front of the host vehicle), the variance increases due to the inavailability of measurements, and the points pertaining to the vertical object are eventually passed to the filter as measurements. This situation is depicted in Figure 3. Accordingly, a mechanism is needed in order to ensure that points corresponding to vertical objects are never passed to the filter. We compute the variance of all measurements for a specific bin and compare it with the expected variance in the given distance. The latter can be computed by using the associate standard deviations σm via error propagation from stereo triangulation [15,19]. If the computed
Dense Stereo-Based ROI Generation for Pedestrian Detection
85
Fig. 4. Rejected measurements for bin i at distance Zi since measurements variance 2 σi2 is greater than the expected variance σei in that bin
Fig. 5. Accepted measurements for bins i and i + 1 at distances Zi and Zi+1 since 2 2 measurements variances σi2 and σi+1 are lower than the expected variances σei and 2 σei+1 in these bins 2 variance σi2 is greater than the expected one σei , we do not rely on the measurements but on the prediction for that bin. This is useful for cases in which there is a vertical object like the one in the example depicted in Figure 4. However, in cases in which the rear part of the vertical object produces 3D information for two consecutive bins, this approach may fail depending on the distance to the vertical object. For example, in Figure 5 the rear part of the vehicle yields 3D measurements in two consecutive bins Zi and Zi+1 whose variance is lower than the expected one for those bins. In this case, measurements will be added to the filter which will yield unpredictable results. We therefore define a fixed region of interest, in which we restrict measurements to lie. To that effect, we quantify the maximum road height changes at different distances and we fit a second order polynomial, see Figure 6. The fixed region can be seen as a compromise between filter stability and response to sharp road profile changes (undulating roads). Apart from this region of interest, we maintain the beforementioned test on the variance, to see if measurements corresponding to a particular grid are added or not to the filter.
Fig. 6. Second order polynomial function used to accept/reject measurements at all distances
86
3.3
C.G. Keller, D.F. Llorca, and D.M. Gavrila
System Integration
Initial ROIs Ri are generated using a sliding windows technique where detector windows at various scales and locations are shifted over the depth map. In previous works [9] flat-world assumption along with known camera geometry were used, so that, the search space was drastically restricted. Pitch variations were handled by relaxing the scene constraints [9], e.g., including camera pitch and camera height tolerances. However, thanks to the use of dense stereo a reliable estimation of the vertical profile of the road is computed along with the camera pitch and tilt angle. In order to easily adapt the subsequent detection modules, we compute new camera heights Hi and pitch angles αi for all bins of the road profile grid. After that, standard equations for projecting 3D points into the image plane can be used. First of all dense depth maps are filtered as follows: points Pr = (Xr , Yr , Zr ) under the actual road profile, i.e., Zi < Zr < Zi+1 and Yr < hi and over the actual road profile plus the maximum pedestrian size, i.e., Zi < Zr < Zi+1 and Yr > hi + Hmax , are removed since they do not correspond to obstacles (possible pedestrians). The resulting filtered depth map is multiplexed into N discrete depth ranges, which are subsequently scanned with windows related to minimum and maximum extent of pedestrians. Possible window locations (ROIs) are defined according to the road profile grid (we assume the pedestrian stands on the ground). Each pedestrian candidate region Ri is represented in terms of the number of depth features DFi . A threshold θR governs the amount of ROIs which are committed to the subsequent module. Only ROIs with DFi > θR trigger the evaluation of the next cascade module. Others are rejected immediately. Pedestrian recognition proceeds with shape-based detection, involving coarseto-fine matching of an exemplar-based shape hierarchy to the image data at hand [9]. Positional initialization is given by the output ROIs of the dense stereo-based ROI generation stage. The shape hierarchy is constructed off-line in an automatic fashion from manually annotated shape labels. On-line matching involves traversing the shape hierarchy with the Chamfer distance between a shape template and an image sub-window as smooth and robust similarity measure. Image locations, where the similarity between shape and image is above a user-specified threshold, are considered detections. A single distance threshold applies for each level of the hierarchy. Additional parameters govern the edge density on which the underlying distance map is based. Detections of the shape matching step are verified by a texture-based pattern classifier. We employ a multi layer feed-forward neural network operating on local adaptive receptive field features [9]. Finally, temporal integration of detection results is employed to overcome gaps in detection and suppress spurious false positives. A 2D bounding box tracker is utilized, with an object state model involving bounding box position and extent [9]. State parameters are estimated using an α − β tracker.
4
Experiments
We tested our dense stereo-based ROI generation scheme on a 5 min (3942 image) sequence recorded from a vehicle driving through the canal area of the city of
Dense Stereo-Based ROI Generation for Pedestrian Detection
87
Amsterdam. Because of the many bridges and speed bumps, the sequence is quite challenging for the road profiling component. Pedestrians were manually labeled; their 3D position was obtained by triangulation in the two camera views. Only pedestrians located in front of the vehicle in the area 12-27m in longitudinal and ±4m in lateral direction were considered required. Pedestrians beyond this detection area were regarded as optional. Localization tolerance is selected as in [9] to be X = 10% and Z = 30% as percentage of distance for lateral (X) and longitudinal (Z) direction. In all, this resulted in 1684 required pedestrian single-frame instances in 66 distinct trajectories, to be detected by our pedestrian system. See Figure 7 for an illustration of the results. We first examined the performance of the ROI generation module in isolation, see Figure 8. Shown are the ROCs (correctly vs. falsely passed ROIs) for various configurations (dense vs. sparse stereo, w/out pitch angle and road profile estimation). No significant performance difference can be observed between denseor sparse- stereo-based ROI generation when neither pitch angle nor road profile is estimated. Estimating the pitch angle leads however to a clear performance
(a)
(c)
(b)
(d)
Fig. 7. System example with estimated road profile and pedestrian detection. (a) Final output with detected pedestrian marked red. The magenta area illustrates the system detection area. (b) Dense stereo image. (c) Corridor used for spline computation after outlier removal. (d) Spline (blue) fitted to the measurements (red) in profile view.
88
C.G. Keller, D.F. Llorca, and D.M. Gavrila STEREO BOX FILTERING 1
Detection Rate
0.95
0.9
0.85
Dense; Road profiling; Estimated Pitch Sparse; Flat world; Estimated Pitch Sparse; Flat world; Fixed Pitch Dense; Flat world; Fixed Pitch
0.8
0.75
3
10 False Positives Per Frame
Fig. 8. ROC peformance of stereo-based ROI generation module for different variations Table 1. Comparison of the number of false positives and total number of generated ROIs per frame for an exemplary threshold θR resulting in a detection rate of 92%
Dense - Road Profiling Sparse - Pitch Estimation Sparse - Fixed Pitch Dense - Fixed Pitch
FPs/Frame # ROIs/Frame 1036 1549 1662 2345 3367 4388 3395 4355
improvement. Incorporating the estimated road profile yields an additional performance gain. The total number of generated ROIs and false positives for an exemplary detection rate of 92% are summarized in table 1. The number of ROIs that need to be generated can be reduced by a factor of 2.8 when utilizing road profile information compared to a system with static camera position. Using camera pose information leads to an reduction of generated ROIs by a factor of 1.87. A reduced number of generated ROIs implies fewer computations in later stages of our detection system, and thus faster processing speed (approx. linear in number of ROIs). We now turn to the evaluation on the overall system level, i.e. with the various ROI generation schemes integrated in the pedestrian classification and tracking system of [9]. Relevant module parameters (in particular density threshold θR for stereo-based ROI generation) were optimized for each system configuration following the ROC convex hull technique described in [9]. See Figure 9. One observes the relative ranking of the various ROI generation schemes is maintained cf. Figure 8 (the dense stereo, fixed pitch and flat world case is not plotted additionally, as is has similar performance as the equivalent sparsestereo case). That is, there is a significant benefit of estimating pitch angle, camera height and road profile, i.e. a performance improvement of factor 5.
Dense Stereo-Based ROI Generation for Pedestrian Detection
89
System Performance 1 0.9
Dense; Road profiling; Estimated Pitch Sparse; Flat world; Estimated Pitch Sparse; Flat world; Fixed Pitch
Detection Rate
0.8 0.7 0.6 0.5 0.4 0.3 −2 10
−1
10 False Positives Per Frame
0
10
Fig. 9. Overall performance of system configurations with different ROI generation stages
5
Conclusions
We investigated the benefit of dense stereo for the ROI generation stage of a pedestrian detection system. In challenging real-world sequences (i.e. undulated roads, bridges and speed bumps), we compared various versions of dense and sparse stereo-based ROI generation. For the case of flat world assumption and fixed camera parameters, sparse and dense stereo provided equal ROI generation performance (baseline configuration). The specific estimation of camera height and pitch angle resulted in a performance improvement of about factor three (reduction false positives at same correct detection rate). When estimating road surface as well, the benefit increased to a factor of five vs. the baseline configuration. Interestingly, the added processing cost of computing dense, rather than sparse, disparity maps is at least partially amortized by the fewer ROIs that need to be processed at the system level.
References 1. Alonso, I.P., Llorca, D.F., Sotelo, M.A., Bergasa, L.M., de Toro, P.R., Nuevo, J., Ocana, M., Garrido, M.A.: Combination of Feature Extraction Methods for SVM Pedestrian Detection. IEEE Transactions on Intelligent Transportation Systems 8(2), 292–307 (2007) 2. Broggi, A., Fascioli, A., Fedriga, I., Tibaldi, A., Rose, M.D.: Stereo-based preprocessing for human shape localization in unstructured environments. In: Proc. of the IEEE Intelligent Vehicle Symposium, IVS (2003) 3. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proc. of the International Conference on Computer Vision and Pattern Recognition, CVPR (2005) 4. Enzweiler, M., Gavrila, D.M.: Monocular Pedestrian Detection: Survey and Experiments. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI). IEEE Computer Society Digital Library (2009), http://doi.ieeecomputersociety.org/10.1109/TPAMI.2008.260
90
C.G. Keller, D.F. Llorca, and D.M. Gavrila
5. Enzweiler, M., Kanter, P., Gavrila, D.M.: Monocular pedestrian recognition using motion parallax. In: Proc. of the IEEE Intelligent Vehicle Symposium, IVS (2008) ´ 6. Fern´ andez, D., Parra, I., Sotelo, M.A., Revenga, P., Alvarez, S.: 3D candidate selection method for pedestrian detection on non-planar roads. In: Proc. of the IEEE Intelligent Vehicle Symposium, IVS (2007) 7. Franke, U., Gehrig, S., Badino, H., Rabe, C.: Towards Optimal Stereo Analysis of Image Sequences. In: Sommer, G., Klette, R. (eds.) RobVis 2008. LNCS, vol. 4931, pp. 43–58. Springer, Heidelberg (2008) 8. Gandhi, T., Trivedi, M.M.: Pedestrian protection systems: Issues, survey and challenges. IEEE Transactions on Intelligent Transportation Systems 8(3), 413–430 (2007) 9. Gavrila, D.M., Munder, S.: Multi-cue pedestrian detection and tracking from a moving vehicle. International Journal of Computer Vision 73(1), 41–59 (2007) 10. Grubb, G., Zelinsky, A., Nilsson, L., Ribbe, M.: 3D vision sensing for improved pedestrian safety. In: Proc. of the IEEE Intelligent Vehicle Symposium, IVS (2004) 11. Labayrade, R., Aubert, D., Tarel, J.P.: Real time obstacle detection on non flat road geometry through ’v-disparity’ representation. In: Proc. of the IEEE Intelligent Vehicle Symposium, IVS (2002) 12. van der Mark, W., Gavrila, D.M.: Real-Time Dense Stereo for Intelligent Vehicles. IEEE Transactions on Intelligent Transportation Systems 7(1), 38–50 (2006) 13. Mohan, A., Papageorgiou, C., Poggio, T.: Example-based object detection in images by components. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(4), 349–361 (2001) 14. Nedevschi, S., Danescu, R., Frentiu, D., Marita, T., Oniga, F., Pocol, C., Graf, T., Schmidt, R.: High accuracy stereovision approach for obstacle detection on non-planar roads. In: Proc. of the IEEE Intelligent Engineering Systems, INES (2004) 15. Oniga, F., Nedevschi, S., Meinecke, M., Binh, T.: Road surface and obstacle detection based on elevation maps from dense stereo. In: Proc. of the IEEE Intelligent Transportation Systems, ITSC (2007) 16. Sabzmeydani, P., Mori, G.: Detecting pedestrians by learning shapelet features. In: Proc. of the International Conference on Computer Vision and Pattern Recognition, CVPR (2007) 17. Shashua, A., Gdalyahu, Y., Hayun, G.: Pedestrian detection for driving assistance systems: single-frame classification and system level performance. In: Proc. of the IEEE Intelligent Vehicle Symposium, IVS (2004) 18. Suganuma, N., Fujiwara, N.: An obstacle extraction method using virtual disparity image. In: Proc. of the IEEE Intelligent Vehicle Symposium, IVS (2007) 19. Wedel, A., Franke, U., Badino, H., Cremers, D.: B-Spline modeling of road surfaces for freespace estimation. In: Proc. of the IEEE Intelligent Vehicle Symposium, IVS (2008) 20. Zhao, L., Thorpe, C.: Stereo- and neural network-based pedestrian detection. IEEE Transactions on Intelligent Transportation Systems (ITS) 1(3)
Pedestrian Detection by Probabilistic Component Assembly Martin Rapus1,2 , Stefan Munder1 , Gregory Baratoff1 , and Joachim Denzler2 1
Continental AG, ADC Automotive Distance Control Systems GmbH Kemptener Str. 99, 88131 Lindau, Germany {martin.rapus,stefan.munder,gregory.baratoff}@continental-corporation.com 2 Chair for Computer Vision, Friedrich Schiller University of Jena Ernst-Abbe-Platz 2, 07743 Jena, Germany
[email protected] Abstract. We present a novel pedestrian detection system based on probabilistic component assembly. A part-based model is proposed which uses three parts consisting of head-shoulder, torso and legs of a pedestrian. Components are detected using histograms of oriented gradients and Support Vector Machines (SVM). Optimal features are selected from a large feature pool by boosting techniques, in order to calculate a compact representation suitable for SVM. A Bayesian approach is used for the component grouping, consisting of an appearance model and a spatial model. The probabilistic grouping integrates the results, scale and position of the components. To distinguish both classes, pedestrian and non-pedestrian, a spatial model is trained for each class. Below miss rates of 8% our approach outperforms state of the art detectors. Above, performance is similar.
1
Introduction
Pedestrian recognition is one of the main research topics in computer vision with applications ranging from security problems, where e.g. humans are observed or counted, to automotive safety area, for vulnerable road user protection. The varying challenges are given by appearance of pedestrians, due to clothing and posture, and occlusions, for example pedestrians walking in groups or behind car hoods. For automotive safety applications the real-time performance needs to be combined with high accuracy and low false positive rate. Earlier approaches employed full-body classification. Most popular: Papageorgiou et al. [12] applies Haar-wavelets with SVM [15]. Instead of SVM, a cascade based on AdaBoost [6] is used by Viola and Jones [16], to achieve real-time performance. An extensive experimental evaluation of histograms of oriented gradients (HOG) for pedestrian recognition is made by Dalal and Triggs [2]. In place of the constant histogram selection [2], Zhu et al. [19] use a variable selection made by an AdaBoost cascade, which achieves better results. Gavrila and Munder [7] recognize pedestrians with local receptive fields and several neural networks. J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 91–100, 2009. c Springer-Verlag Berlin Heidelberg 2009
92
M. Rapus et al.
The achieved performance with full-body classification is still not good enough to handle the big variability in human posture. To achieve better performance, part-based approaches are used. These approaches are more robust against partial occlusions. Part-based approaches often consist of two steps, the first one detects components, mostly by classification approaches, while the second step groups them to pedestrians. One possible way to group components is to use classification techniques. Mohan et al. [11] use the approach proposed in [12] for the component detection. The best results per component are classified by a SVM into pedestrian and non-pedestrian. In Dalal’s thesis [3], the HOG-approach [2] is used for the component detectors. A spatial histogram for each component weighted by the results is classified by a SVM. Felzenszwalb et al. [5] determine the component model parameters (size and position) in the training process. For the pedestrian classification the HOG component feature vectors and geometrical parameters (scale and position) are used as input for a linear SVM. The fixed ROI configuration used in these approaches puts a limit on the variability of part configurations they can handle. To overcome this limitation, spatial models that explicitly describe the arrangement of components were introduced. In general, these approaches incorporate an appearance model and a spatial model. One of the first approaches is from Mikolajczyk et al. [10]. The components are detected by SIFT-like features and AdaBoost. An iterative process with thresholding is used to generate the global result via a probabilistic assembly of the components, using the geometric relations: distance vector and scale ratio between two parts, modeled by a Gaussian. Wu and Nevatia [18] use a component hierarchy with 12 parts and the full-body as root-component. The component detection is done by edgelet features [17] and Boosting [14]. For the probabilistic grouping the position, scale and a visibility value is incorporated. Only the inter-occlusion of pedestrians is considered. The Maximum-A-Posteriori (MAP) configuration is computed by the Hungarian algorithm. All results above a threshold are regarded as pedestrian. Bergtholdt et al. [1] use all possible relations between 13 components. For the component detection SIFT and color features are classified through randomized classification trees. The MAP configuration is computed with A*-search. A great number of parts is used by the last two approaches for robustness against partial occlusions. The computation time for the probabilistic grouping grows non-linearly with the number of components used, and with the number of component detection results. As a consequence these probabilistic based methods have no real-time performance on an actual desktop PC. Our approach is part-based. For real-time purpose our pedestrian detector is divided into the three parts, head-shoulder, torso, legs and for better classification performance we distinguish between frontal/rear and side view. HOGs [2] are used as component features. We make use of a variable histogram selection by AdaBoost. The selected histograms are classified through a linear SVM. Because similar histograms are selected with weighted fisher discriminant analysis (wFDA) [9] in comparison to a linear SVM, but in less training time, we apply wFDA as weak classifier. A Bayesian-based approach is used for component
Pedestrian Detection by Probabilistic Component Assembly
93
grouping. To reduce the number of component detections thresholding is applied, keeping 99% true positive component detection rate. Our probabilistic grouping approach consists of an image matching and a spatial matching of the components. To use the component results for the image matching they are converted into probabilistic values. Invariance against scale and translation is achieved by using the distance vector, normalized through scale, and the scale ratio between two components. In comparison to existing approaches the spatial distributions are not approximated, instead the distribution histograms are used directly. We also differentiate component arrangements by class. Below miss rates of 8% our approach outperforms state of the art detectors. Above, performance is similar. The paper is organized as follows. Sect. 2 describes the component detection step, followed by the component grouping through a probabilistic model in Sect. 3. The results for the component detection and grouping step are discussed in Sect. 4. The conclusion forms Sect. 5 and the paper ends with an outlook in Sect. 6.
2
Component Detection
HOG features were proven best in [2], and thus adopted here for the component detection. The averaged gradient magnitude and HOG images for our components, derived through the INRIA Person dataset [2], are visualized in Fig. 1 and Fig. 2. Instead of the histograms the corresponding edges with weighted edge length are shown. Pedestrian contours are well preserved in the average edge images, while irrelevant edges are suppressed. A (slight) difference can be seen in the head component. In the frontal view, the whole contour is preserved and in the side view it is only the head contour, while the shoulder contour is blurred. Two different methods for the histogram selection are examined. One is a constant selection [2]: the image is divided into non-overlapping histograms, followed by an extraction of normalized blocks neighboring histograms. The other approach is similar to [19] and uses variable selection. The best histogram blocks (varying size and position) are selected using AdaBoost. We use the weighted Fisher discriminant analysis [9] as weak classifier. The classification of the generated feature vector is done by a linear SVM.
3
Probabilistic Component Assembly
This step builds the global pedestrian detections out of the detected components V = {vHS , vT , vL }, where the superscripts HS, T and L stand for head-shoulder, torso and legs respectively, by applying the appearance and the spatial relationship. The probability P (L|I) to find a pedestrian, consisting of the mentioned components, with configuration L = {lHS , lT , lL } in the actual image I, with li as position and scale for the ith component, is given by Bayes rule: P (L|I) ∝ P (I|L) · P (L) .
(1)
94
M. Rapus et al.
Fig. 1. Average gradient magnitudes and average HOGs for the frontal/rear view components (head, torso and legs) - INRIA Person dataset
Fig. 2. Average gradient magnitudes and average HOGs for the side view components (head, torso and legs) - INRIA Person dataset
The first factor P (I|L) is the detection probability of the components, at the position and scale given by L. The second factor P (L) represents the prior probability of a positive pedestrian component arrangement. Every Head-Shoulder detection is used as start point to find the corresponding MAP configuration by greedy search. In the following sections we will go further into detail. 3.1
Probabilistic Appearance Model
To compute P (I|L) the component results of the detection step are used. For this purpose the SVM results f (x) are converted into probabilistic values. From the many choices available, we preferred an approximation of the a posteriori curve P (y = 1|f (x)), that for a specific SVM result f (x) a pedestrian component y = 1 is given, because the best fit was achieved by this model. By using Bayes rule with the priors P (y = −1) and P (y = 1), and class-conditional densities p(f (x)|y = −1) and p(f (x)|y = 1), we get: P (y = 1|f (x)) =
p(f (x)|y = 1)P (y = 1) . p(f (x)|y = i)P (y = i)
(2)
i=−1,1
The resulting a posteriori values for the frontal legs training set are shown in Fig. 3(b), derived with the class-conditional densities, which can be seen in 1 Fig. 3(a). A sigmoid function s(z = f (x)) = 1+exp(Az+B) is used to approximate
Pedestrian Detection by Probabilistic Component Assembly
95
probability
SVM result histograms − Legs Front positive histogram negative histogram
0.04 0.02 0 −6
−5
−4
−3
−2 SVM result
−1
0
1
2
(a) Posterior Approximation − Legs Front sigmoid approximation posterior probability
1 probability
0.8 0.6 0.4 0.2 0
−6
−4
−2 SVM result
0
2
(b) Fig. 3. (a) Distribution histograms and (b) the approximated a posterior curve by a sigmoid function for the frontal legs
the posterior. The parameters for s(z) are determined by the Maximum Likelihood method proposed by Platt [13], using the Levenberg-Marquardt method. To compute the sigmoid parameters, training sets for each component and view are used. Fig. 3(b) shows the approximated curve for the frontal legs. By assuming independence between the detectors for each component vi , P (I|L) is given by: P (I|L) = P (y = 1|fi (xi )) (3) vi ∈V
with xi as the extracted feature vector and fi as the result of the ith component. 3.2
Probabilistic Geometric Model
Besides the appearance likelihood value P (I|L), for component configuration L, the probability for the spatial arrangement P (L) has to be computed. Invariance against scale and translation is achieved by using the relative distance vector dij and the scale ratio Δsij = ssji between two components i and j: dij =
dxij dyij
=
1 · si
xj − xi yj − yi
.
(4)
As in common literature [4] the model is expressed as a graph G = (V, E), with the components vi as vertices and the possible relations as edges eij between component i and j. Our model regard all possible component relations, except those between the same component in different views. Every edge eij gets a weight wij ∈ [0, 1], to account that component pairs of the same view appear
96
M. Rapus et al.
more likely, than component pairs of different views. The weights are generated from the component training sets. With the priors P (li , lj ) = P (dij , Δsij ) the probability of the component arrangement L is given as: P (L) = wij P (li , lj ) = wij P (dij , Δsij ) . (5) eij ∈E
eij ∈E
The generated distribution histograms for the geometrical parameters dij and Δsij are used for the priors P (li , lj ). To distinguish between a pedestrian-like and non-pedestrian-like component arrangement, two spatial distributions are generated, one for the positive Pp (L) and one for the negative class Pn (L). Distribution histograms are also used for the negative class. The distributions are computed as follows: First the positive spatial distribution histograms are computed from training data. Afterwards, the spatial distributions for the negative class are generated, using only the hard ones, i.e. those lying in the distribution histogram range for the positive class. As final spatial result the difference between the positive and negative spatial result is used.
4
Experiments
The INRIA Person dataset [2] is used for our experiments. This dataset contains a training set with 2416 pedestrian labels and 1218 images without any pedestrians and a test set with 1132 pedestrian images and 453 images not containing any pedestrians. Both sets have only global labels. For the component evaluation, part labels are needed, so in a first step we applied our component labels: head-shoulder, torso and legs, in front/rear and side view. In a second step the average label sizes were determined, see Table 1. Smaller labels were resized to the size given in Table 1. The number of positive training samples and test samples, for every component and view, are listed in Table 1. Some images have no component training labels because of occlusions. In a first experiment the component detection was evaluated, followed by testing the proposed probabilistic model from Sect. 3. Finally, the probabilistic method is compared to state of the art detectors. Receiver Operating Characteristic (ROC) curves in loglog scale are used for the experimental evaluation of the alseN eg miss rate T ruePFos+F against the false-positive rate. Matching criteria alseN eg is 75% overlap between detection and corresponding label.
4.1
Component Detection
The proposed component detection in Sect. 2 is evaluated. ”Unsigned” gradients, 9 orientation bins and a block size of 2x2 histogram cells are used as parameters for the HOG features. In this test the constant histogram selection is compared against a variable selection, as described in Sect. 2. The block sizes for the constant selection are: 16x16 pixels for the frontal torso and 12x12 pixels for
Pedestrian Detection by Probabilistic Component Assembly
97
Table 1. Component sizes and the number of positive training/test samples Part
View
Width
Height
# pos. Training-Samples
# pos. Test-Samples
head
front side front side front side
32 32 40 32 34 34
32 32 45 45 55 55
1726 678 1668 646 1400 668
870 262 846 286 756 376
torso legs
the remaining components/views. For the variable selection, block size range is 8x8 to maximum, not limited to a specific scale. The negative training set was created by using the bootstrapping method given in [2]. The generation of regions of interest (ROI) is done by a sliding window approach. ROI’s are generated in different scales. The factor 1.2 is used between two scales. In all scales the step size is 4 pixel in both directions. For the SVM classifier training we use SVMlight [8]. The ROC-curves for the component detection are shown in Fig. 4 and Fig. 5, divided into frontal/rear and side views. It confirms that variable selection (solid lines) yields better results than constant selection (dotted lines), except for the frontal head-component. The results for the frontal/rear head with constant selection are slightly better as those with variable selection. An interesting observation is the obvious difference between the head and leg results, which is stronger in the frontal/rear view than the side view. The leg component produces at 10% miss rate three times fewer false positives than the head. In the frontal view, similar results are recieved by head and torso. The ROC-curves of the side torso and side legs intersect at 10% miss rate. Below 10% miss rate, fewer false positives are produced by the torso and above 10% miss rate the legs generate less false positives. The computation time per component ROI is in average 0.025 ms, on a 1.8 GHz dual core PC, using only one core. At a resolution of 320x240 pixels, 20000 search windows are generated in average per component and view. The component detection at this resolution with full search takes about 3.1 seconds. 4.2
Probabilistic Component Assembly
The proposed Bayesian approach to component assembly from Sect. 3 is evaluated here. In a first step the probabilistic approach is tested with and without the use of spatial distribution histograms for the negative class, and afterwards compared against state of the art detectors. These detectors are the one from Dalal [2] and the cascade from Viola and Jones [16]. Again the INRIA Person dataset is used as test set. First the probabilistic approach is evaluated. The results are given in Fig. 6. By using spatial distribution histograms from both classes we achieve better results. The difference between both curves is greater at higher false positive
98
M. Rapus et al. ROC Curves − Front/Rear Components
ROC Curves − Side Components
−1
miss rate
miss rate
−1
10
Constant Head Front/Rear Constant Torso Front/Rear Constant Legs Front/Rear Variable Head Front/Rear Variable Torso Front/Rear Variable Legs Front/Rear
−2
10
10
−6
10
−5
−4
10
Constant Head Side Constant Torso Side Constant Legs Side Variable Head Side Variable Torso Side Variable Legs Side
−2
−3
−2
10 10 false positives
10
−1
10
10
−6
10
Fig. 4. Front/Rear component results
−5
10
−4
−1
10
ROC Curves − Global Detectors
without negative spatial distribution with negative spatial distribution
Variable Legs Front/Rear Global Dalal Classifier Global Viola−Jones Cascade Probabilistic Model
−1
−1
10
miss rate
miss rate
−2
10
Fig. 5. Side component results
ROC Curves − Probabilistic Model
10
−2
−2
10 −6 10
−3
10 10 false positives
10 −5
10
−4
10 false positives
−3
10
−2
10
Fig. 6. Probabilistic grouping results
−6
10
−5
10
−4
−3
10 10 false positives
−2
10
−1
10
Fig. 7. State of the art detectors in comparison to our approach (blue line)
rates. At low miss rates the extra usage of spatial distributions for the negative class reduce the number of false positives compared to the common approach. In the following experiment the probabilistic approach is compared against state of the art detectors. Fig. 7 shows the best probabilistic detector in comparison to the mentioned standard detectors and the best component result (frontal/rear legs). The results of our part-based approach are slightly better as the best state of the art detector. Below 8% miss rate our probabilistic method outperforms the state of the art detectors. Note that Dalal’s detector takes a larger margin around a person, so in comparison to our approach more contextual information is incorporated. Fig. 8 shows some typical results of our approach. At a resolution of 320x240 pixels, after applying thresholding to the component detection results, we get on average about 400 detections per component and view. For this resolution, our probabilistic grouping approach takes 190 milliseconds in average on a 1.8 GHz PC.
Pedestrian Detection by Probabilistic Component Assembly
99
Fig. 8. Some detection results (white - full body, black - head, green - torso, cyan legs). No post-processing was applied to the images.
5
Conclusion
In this paper a Bayesian component-based approach for pedestrian recognition in single frames was proposed. Our pedestrian detector is composed of the headshoulder, torso and legs, divided into front/rear and side view for better recognition. For the component detection a variable selection of histograms of oriented gradients and SVM classification is applied. In the next step, the components are grouped by a Bayes-based approach. To shrink the number of candidates for the probabilistic grouping, thresholding is applied to all component results, so that 99% true positive component detection rate remains. Invariance against scale and translation is achieved by using the relative distance vector and scale ratio between the components. To make a better separation into positive and negative spatial component arrangements, distributions for both classes are generated. Instead of approximating these distributions, for example by a Gaussian, the computed distribution histograms are used directly. The results confirm the positive benefit of using distributions for both classes and not only for one. Below miss rates of 8% our approach outperforms state of the art detectors. Above, performance is similar.
6
Future Work
One main drawback of our approach is computation time, mainly of the component detection. Using a cascaded classifier would make the component detection faster. To improve the performance of our approach the narrow field of a pedestrian can be included as contextual information. First experiments show promising results. The performance of the front/rear views is much better than for the side views. To overcome this, left and right side views could be separated.
References 1. Bergtholdt, M., Kappes, J., Schmidt, S., Schn¨ orr, C.: A Study of Parts-Based Object Class Detection Using Complete Graphs. In: IJCV (in press, 2009) 2. Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human Detection. In: CVPR, vol. 1, pp. 886–893 (2005)
100
M. Rapus et al.
3. Dalal, N.: Finding People in Images and Videos, PhD thesis, Institut National Polytechnique de Grenoble (July 2006) 4. Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial Structures for Object Recognition. IJCV 61(1), 55–79 (2005) 5. Felzenszwalb, P., Mcallester, D., Ramanan, D.: A Discriminatively Trained, Multiscale, Deformable Part Model. In: CVPR, Anchorage, Alaska, June 2008, pp. 1–8 (2008) 6. Freund, Y., Schapire, R.E.: Experiments with a New Boosting Algorithm. In: International Conference on Machine Learning, pp. 148–156 (1996) 7. Gavrila, D.M., Munder, S.: Multi-cue Pedestrian Detection and Tracking from a Moving Vehicle. IJCV 73, 41–59 (2007) 8. Joachims, T.: Making large-Scale SVM Learning Practical. In: Sch¨ olkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods - Support Vector Learning. MIT Press, Cambridge (1999) 9. Laptev, I.: Improvements of Object Detection Using Boosted Histograms. In: British Machine Vision Conference, September 2006, vol. 3, pp. 949–958 (2006) 10. Mikolajczyk, K., Schmid, C., Zisserman, A.: Human detection based on a probabilistic assembly of robust part detectors. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 69–82. Springer, Heidelberg (2004) 11. Mohan, A., Papageorgiou, C., Poggio, T.: Example-Based Object Detection in Images by Components. PAMI 23(4), 349–361 (2001) 12. Papageorgiou, C., Evgeniou, T., Poggio, T.: A Trainable Pedestrian Detection System. In: IVS, pp. 241–246 (1998) 13. Platt, J.: Probabilities for SV Machines. In: Press, M. (ed.) Advances in Large Margin Classifiers, pp. 61–74 (1999) 14. Schapire, R.E.: The Strength of Weak Learnability. Machine Learning 5(2), 197– 227 (1990) 15. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc, New York (1995) 16. Viola, P., Jones, M.: Robust Real-time Object Detection. IJCV 57(2), 137–154 (2004) 17. Wu, B., Nevatia, R.: Detection of Multiple, Partially Occluded Humans in a Single Image by Bayesian Combination of Edgelet Part Detectors. In: ICCV, vol. 1, pp. 90–97 (2005) 18. Wu, B., Nevatia, R.: Detection and Segmentation of Multiple, Partially Occluded Objects by Grouping, Merging, Assigning Part Detection Responses. IJCV 82(2), 185–204 (2009) 19. Zhu, Q., Yeh, M.-C., Cheng, K.-T., Avidan, S.: Fast Human Detection Using a Cascade of Histograms of Oriented Gradients. In: CVPR, pp. 1491–1498 (2006)
High-Level Fusion of Depth and Intensity for Pedestrian Classification Marcus Rohrbach1,3, , Markus Enzweiler2, , and Dariu M. Gavrila1,4 1
4
Environment Perception, Group Research, Daimler AG, Ulm, Germany 2 Image & Pattern Analysis Group, Dept. of Math. and Computer Science, Univ. of Heidelberg, Germany 3 Dept. of Computer Science, TU Darmstadt, Germany Intelligent Systems Lab, Fac. of Science, Univ. of Amsterdam, The Netherlands
[email protected], {uni-heidelberg.enzweiler,dariu.gavrila}@daimler.com
Abstract. This paper presents a novel approach to pedestrian classification which involves a high-level fusion of depth and intensity cues. Instead of utilizing depth information only in a pre-processing step, we propose to extract discriminative spatial features (gradient orientation histograms and local receptive fields) directly from (dense) depth and intensity images. Both modalities are represented in terms of individual feature spaces, in each of which a discriminative model is learned to distinguish between pedestrians and non-pedestrians. We refrain from the construction of a joint feature space, but instead employ a high-level fusion of depth and intensity at classifier-level. Our experiments on a large real-world dataset demonstrate a significant performance improvement of the combined intensity-depth representation over depth-only and intensity-only models (factor four reduction in false positives at comparable detection rates). Moreover, high-level fusion outperforms low-level fusion using a joint feature space approach.
1
Introduction
Pedestrian recognition is an important problem in domains such as intelligent vehicles or surveillance. It is particularly difficult, as pedestrians tend to occupy only a small part of the image (low resolution), have different poses (shape) and clothing (appearance), varying background, or might be partially occluded. Most state-of-the-art systems derive feature sets from intensity images, i.e. grayscale (or colour) images, and apply learning-based approaches to detect people [1,3,9,22,23]. Besides image intensity, depth information can provide additional cues for pedestrian recognition. Up to now, the use of depth information has been limited to recovering high-level scene geometry [5,11] and focus-of-attention mechanisms [8]. Given the availability of real-time high-resolution dense stereo algorithms [6,20],
Marcus Rohrbach and Markus Enzweiler acknowledge the support of the Studienstiftung des deutschen Volkes (German National Academic Foundation).
J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 101–110, 2009. c Springer-Verlag Berlin Heidelberg 2009
102
M. Rohrbach, M. Enzweiler, and D.M. Gavrila
/FFLINE4RAINING )NTENSITY #LASSIFIER
$EPTH #LASSIFIER /NLINE!PPLICATION /NLINE!PPLICATION &USED $ECISION
Fig. 1. Framework overview. Individual classifiers are trained offline on intensity and corresponding depth images. Online, both classifiers are fused to a combined decision. For depth images, warmer colors represent closer distances from the camera.
we propose to enrich an intensity-based feature space for pedestrian classification with features operating on dense depth images (Sect. 3). Depth information is computed from a calibrated stereo camera rig using semi-global matching [6]. Individual classifiers are trained offline on features derived from intensity and depth images depicting pedestrian and non-pedestrian samples. Online, the outputs of both classifiers are fused to a combined decision (Sect. 4). See Fig. 1.
2
Related Work
A large amount of literature covers image-based classification of pedestrians. See [3] for a recent survey and a challenging benchmark dataset. Classification typically involves a combination of feature extraction and a discriminative model (classifier), which learns to separate object classes by estimating discriminative functions within an underlying feature space. Most proposed feature sets are based on image intensity. Such features can be categorized into texture-based and gradient-based. Non-adaptive Haar wavelet features have been popularized by [15] and adapted by many others [14,22], with manual [14,15] and automatic feature selection [22]. Adaptive feature sets were proposed, e.g. local receptive fields [23], where the spatial structure is able to adapt to the data. Another class of texture-based features involves codebook patches which are extracted around salient points in the image [11,18]. Gradient-based features have focused on discontinuities in image brightness. Local gradient orientation histograms were applied in both sparse (SIFT) [12] and dense representations (HOG) [1,7,25,26]. Covariance descriptors involving a model of spatial variation and correlation of local gradients were also used [19]. Yet others proposed local shape filters exploiting characteristic patterns in the spatial configuration of salient edges [13,24].
High-Level Fusion of Depth and Intensity for Pedestrian Classification
(a) Pedestrian
103
(b) Non-Pedestrian
Fig. 2. Intensity and depth images for pedestrian (a) and non-pedestrian samples (b). From left to right: intensity image, gradient magnitude of intensity, depth image, gradient magnitude of depth
In terms of discriminative models, support vector machines (SVM) [21] are widely used in both linear [1,25,26] and non-linear variants [14,15]. Other popular classifiers include neural networks [9,10,23] and AdaBoost cascades [13,19,22,24,25,26]. Some approaches additionally applied a component-based representation of pedestrians as an ensemble of body parts [13,14,24]. Others combined features from different modalities, e.g. intensity, motion, depth, etc. Multi-cue combination can be performed at different levels: On module-level, depth [5,9,11] or motion [4] can be used in a pre-processing step to provide knowledge of the scene geometry and focus-of-attention for a subsequent (intensity-based) classification module. Other approaches have fused information from different modalities on feature-level by establishing a joint feature space (low-level fusion): [1,22] combined gray-level intensity with motion. In [17], intensity and depth features derived from a 3D camera with very low resolution (pedestrian heights between 4 and 8 pixels) were utilized. Finally, fusion can occur on classifier-level [1,2]. Here, individual classifiers are trained within each feature space and their outputs are combined (high-level fusion). We consider the main contribution of our paper to be the use of spatial depth features based on dense stereo images for pedestrian classification at medium resolution (pedestrian heights up to 80 pixels). A secondary contribution concerns fusion techniques of depth and intensity. We follow a high-level fusion strategy which allows to tune features specifically to each modality and base the final decision on a combined vote of the individual classifiers. As opposed to lowlevel fusion approaches [17,22], this strategy does not suffer from the increased dimensionality of a joint feature space.
3
Spatial Depth and Intensity Features
Dense stereo provides information for most image areas, apart from regions which are visible only by one camera (stereo shadow). See the dark red areas to the left of the pedestrian torso in Fig. 2(a). Spatial features can be based on either depth Z (in meters) or disparity d (in pixels). Both are inverse proportional given the camera geometry with focal length f and the distance between the two cameras B: fB Z(x, y) = at pixel (x, y) (1) d(x, y)
104
M. Rohrbach, M. Enzweiler, and D.M. Gavrila
(a) Intensity features
(b) Depth features
Fig. 3. Visualization of gradient magnitude (related to HOG) and LRF features on (a) intensity and (b) depth images. From left to right: Average gradient magnitude of pedestrian training samples, two exemplary 5×5-pixel local receptive field features and their activation maps, highlighting spatial regions of the training samples where the corresponding LRFs are most discriminative with regard to the pedestrian and non-pedestrian classes.
Objects in the scene have similar foreground/background gradients in depth space, irrespective of their location relative to the camera. In disparity space however, such gradients are larger, the closer the object is to the camera. To remove this variability, we derive spatial features from depth instead of disparity. We refer to an image with depth values Z(x, y) at each pixel (x, y) as depth image. A visual inspection of the depth image vs. the intensity image in Fig. 2 reveals that pedestrians have a distinct depth contour and texture which is different from the intensity domain. In intensity images, lower body features (shape and appearance of legs) are the most significant feature of a pedestrian (see results of part-based approaches, e.g. [14]). In contrast, the upper body area has dominant foreground/background gradients and is particularly characteristic for a pedestrian in the depth image. Additionally, the stereo shadow is clearly visible in this area (to the left of the pedestrian torso) and represents a significant local depth discontinuity. This might not be a disadvantage but rather a distinctive feature. The various salient regions in depth and intensity images motivate our use of fusion approaches between both modalities to benefit from the individual strengths, see Sect. 4. To instantiate feature spaces involving depth and intensity, we utilize wellknown state-of-the-art features, which focus on local discontinuities: Non-adaptive histogram of oriented gradients with a linear SVM (HOG/linSVM) [1] and a neural network using adaptive local receptive fields (NN/LRF) [23]. For classifier training, the feature vectors are normalized to [−1; +1] per dimension. To get an insight into HOG and LRF features, Fig. 3 depicts the average gradient magnitude of all pedestrian training samples (related to HOG), as well as exemplary local receptive field features and their activation maps (LRF), for both intensity and depth. We observe that gradient magnitude is particularly high around the upper body contour for the depth image, while being more evenly distributed for the intensity image. Further, almost no depth gradients are present on areas corresponding to the pedestrian body. During training, the local receptive field features have developed to detect very fine grained structures in the image intensity domain. The two features depicted in Fig. 3(a) can be regarded as specialized “head-shoulder” and “leg” detectors and are especially activated in the corresponding areas. For depth images, LRF features respond to larger structures in the image, see Fig. 3(b). Here, characteristic features
High-Level Fusion of Depth and Intensity for Pedestrian Classification
105
focus on the coarse depth contrast between the upper-body head/torso area. The mostly uniform depth texture on the pedestrian body is a prominent feature as well.
4
Fusion on Classifier-Level
A popular strategy to improve classification is to split-up a classification problem into more manageable sub-parts on data-level, e.g. using mixture-of-experts or component-based approaches [3]. A similar strategy can be pursued on classifierlevel. Here, multiple classifiers are learned on the full dataset and their outputs combined to a single decision. Particularly, when the classifiers involve uncorrelated features, benefits can be expected. We follow a Parallel Combination strategy [2], where multiple feature sets (i.e. based on depth and intensity, see Sect. 3) are extracted from the same underlying data. Each feature set is then used as input to a single classifier and their outputs combined (high-level fusion). For classifier fusion, we utilize a set of fusion rules which are explained below. An important prerequisite is that the individual classifier outputs are normalized, so that they can be combined homogeneously. The outputs of many state-of-theart classifiers can be converted to an estimate of posterior probabilities [10,16]. We use this sigmoidal mapping in our experiments. Let xk , k = 1, . . . , n, denote a (vectorized) sample. The posterior for the k-th sample with respect to the j-th object class (e.g. pedestrian, non-pedestrian), estimated by the i-th classifier, i = 1, . . . , m, is given by: pij (xk ). Posterior probabilities are normalized across object classes for each sample, so that: (pij (xk )) = 1 (2) j
Classifier-level fusion involves the derivation of a new set of class-specific confidence values for each data point, qj (xk ), out of the posteriors of the individual classifiers, pij (xk ). The final classification decision ω(xk ) results from selecting the object class with the highest confidence: ω(xk ) = arg max (qj (xk )) j
(3)
We consider the following fusion rules to determine the confidence qj (xk ) of the k-th sample with respect to the j-th object class: Maximum Rule. The maximum rule bases the final confidence value on the classifier with the highest estimated posterior probability: qj (xk ) = max (pij (xk )) i
(4)
Product Rule. Individual posterior probabilities are multiplied to derive the combined confidence: qj (xk ) = (pij (xk )) (5) i
106
M. Rohrbach, M. Enzweiler, and D.M. Gavrila
Sum Rule. The combined confidence is computed as the average of individual posteriors, with m denoting the number of individual classifiers: qj (xk ) =
1 (pij (xk )) m i
(6)
SVM Rule. A support vector machine is trained as a fusion classifier to discriminate between object classes in the space of posterior probabilities of the individual classifiers: Let pjk = (p1j (xk ) , . . . , pmj (xk )) denote the m-dimensional vector of individual posteriors for sample xk with respect to the j-th object class. The corresponding hyperplane is defined by: fj (pjk ) = yl αl · K (pjk , pjl ) + b (7) l
Here, pjl denotes the set of support vectors with labels yl and Lagrange multipliers αl . K(·, ·) represents the SVM Kernel function. We use a non-linear RBF kernel in our experiments. The SVM decision value fj (pjk ) (distance to the hyperplane) is used as confidence value: qj (xk ) = fj (pjk )
5 5.1
(8)
Experiments Experimental Setup
The presented feature/classifier combinations and fusion strategies, see Sects. 3 and 4, were evaluated in experiments on pedestrian classification. Training and test samples comprise non-occluded pedestrian and non-pedestrian cut-outs from intensity and corresponding depth images, captured from a moving vehicle in an urban environment. See Table 1 and Fig. 4 for an overview of the dataset. All samples are scaled to 48 × 96 pixels (HOG/linSVM) and 18 × 36 pixels (NN/LRF) with an eight-pixel (HOG/linSVM) and two-pixel border (NN/LRF) to retain contour information. For each manually labelled pedestrian bounding box we randomly created four samples by mirroring and geometric jittering.
(a) Pedestrian samples
(b) Non-Pedestrian samples
Fig. 4. Overview of (a) pedestrian and (b) non-pedestrian samples (intensity and corresponding depth images)
High-Level Fusion of Depth and Intensity for Pedestrian Classification
107
Table 1. Dataset statistics. The same numbers apply to samples from depth and intensity images.
Training Set (2 parts) Test Set (1 part) Total
Pedestrians (labelled) Pedestrians (jittered) Non-Pedestrians 10998 43992 43046 5499 21996 21523 16497 65988 64569
Non-pedestrian samples resulted from a pedestrian shape detection step with relaxed threshold setting, i.e. containing a bias towards more difficult patterns. HOG features were extracted from those samples using 8 × 8 pixel cells, accumulated to 16 × 16 pixel blocks, with 8 gradient orientation bins, see [1]. LRF features (in 24 branches, see [23]) were extracted at a 5 × 5 pixel scale. Identical feature/classifier parameters are used for intensity and depth. The dimension of the resulting feature spaces is 1760 for HOG/linSVM and 3312 for NN/LRF. We apply a three-fold cross-validation to our dataset: The dataset is splitup into three parts of the same size, see Table 1. In each cross-validation run, two parts are used for training and the remaining part for testing. Results are visualized in terms of mean ROC curves across the three cross-validation runs. 5.2
Experimental Results
In our first experiment, we evaluate the performance of classifiers for depth and intensity separately, as well as using different fusion strategies. Results are given in Fig. 5(a-b) for the HOG/linSVM and NN/LRF classifier, respectively. The performance of features derived from intensity images (black ◦) is better than for depth features (red +), irrespective of the actual feature/classifier approach. Furthermore, all fusion strategies between depth and intensity clearly improve performance (Fig. 5(a-b), solid lines). For both HOG/linSVM and NN/LRF, the sum rule performs better than product rule, which in turn outperforms the maximum rule. However, performance differences among fusion rules are rather small. Only for NN/LRF, the maximum rule performs significantly worse. By design, maximum selection is more susceptive to noise and outliers. Using a nonlinear RBF SVM as a fusion classifier does not improve performance over fusion by the sum rule, but is far more computationally expensive. Hence, we only employ the sum rule for fusion in our further experiments. Comparing absolute performances, our experiments show that fusion of depth and intensity can reduce false positives over intensity-only features at a constant detection rate by approx. a factor of two for HOG/linSVM and a factor of four for NN/LRF: At a detection rate of 90%, the false positive rates for HOG/linSVM (NN/LRF) amount to 1.44% (2.01%) for intensity, 8.92% (5.60%) for depth and 0.77% (0.43%) for sum-based fusion of depth and intensity. This clearly shows that the different strengths of depth and intensity can indeed be exploited, see Sect. 3. An analysis of correlation between the classifier outputs for depth and intensity confirms this: For HOG/linSVM (NN/LRF), the correlation coefficient
108
M. Rohrbach, M. Enzweiler, and D.M. Gavrila 1
Detection Rate
0.9 0.8 HOG Depth HOG Intensity HOG Fusion Sum HOG Fusion Max HOG Fusion SVM HOG Fusion Prod .
0.7 0.6 0.5
0
0.005
0.01
0.015
0.02 0.025 0.03 False Positive Rate
0.035
0.04
0.045
0.05
(a) HOG/linSVM classifier 1
Detection Rate
0.9 0.8 NN/LRF Depth NN/LRF Intensity NN/LRF Fusion Sum NN/LRF Fusion Max NN/LRF Fusion SVM NN/LRF Fusion Prod .
0.7 0.6 0.5
0
0.005
0.01
0.015
0.02 0.025 0.03 False Positive Rate
0.035
0.04
0.045
0.05
(b) NN/LRF classifier 1
Detection Rate
0.9 0.8 0.7 NN/LRF Depth HOG Intensity HOG Int. + LRF Depth Joint Space SVM . HOG Int. + LRF Depth Fusion Sum
0.6 0.5
0
0.005
0.01
0.015
0.02 0.025 0.03 False Positive Rate
0.035
0.04
0.045
0.05
(c) Best performing classifiers and joint feature space with 1-σ error bars. Fig. 5. Pedestrian classification performance using spatial depth and intensity features. (a) HOG/linSVM, (b) NN/LRF, (c) best performing classifiers.
High-Level Fusion of Depth and Intensity for Pedestrian Classification
109
between depth and intensity is 0.1068 (0.1072). For comparison, the correlation coefficient beween HOG/linSVM and NN/LRF on intensity images is 0.3096. In our third experiment, we fuse the best performing feature/classifier for each modality, i.e. HOG/linSVM for intensity images (black ◦) and NN/LRF for depth images (red +). See Fig. 5(c). The results of fusion using the sumrule (blue *) outperforms all previously considered variants. More specifically, we achieve a false positive rate of 0.35% (at 90% detection rate) which is a reduction by a factor of four, compared to the state-of-the-art HOG/linSVM classifier on intensity images (black ◦; 1.44% false positive rate). We additionally visualize 1-σ error bars computed from the different cross-validation runs. The non-overlapping error bars of the various system variants underline the statistical significance of our results. We further compare the proposed high-level fusion (Fig. 5(c), blue *) with low-level fusion (Fig. 5(c), magenta Δ). For this, we construct a joint feature space combining HOG features for intensity and LRF features for depth (normalized to [−1; +1] per dimension). A linear SVM is trained in the joint space to discriminate between pedestrians and non-pedestrians. A non-linear SVM was computationally not feasible, given the increased dimension of the joint feature space (5072) and our large datasets. Results show, that low-level fusion using a joint feature space is outperformed by the proposed high-level classifier fusion, presumable because of the higher dimensionality of the joint space.
6
Conclusion
This paper presented a novel framework for pedestrian classification which involves a high-level fusion of spatial features derived from dense stereo and intensity images. Our combined depth/intensity approach outperforms the stateof-the-art intensity-only HOG/linSVM classifier by a factor of four in reduction of false positives. The proposed classifier-level fusion of depth and intensity also outperforms a low-level fusion approach using a joint feature space.
References 1. Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 428–441. Springer, Heidelberg (2006) 2. Duin, R.P.W., Tax, D.M.J.: Experiments with classifier combining rules. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 16–29. Springer, Heidelberg (2000) 3. Enzweiler, M., Gavrila, D.M.: Monocular pedestrian detection: Survey and experiments. In: IEEE PAMI, October 17, 2008. IEEE Computer Society Digital Library (2008), http://doi.ieeecomputersociety.org/10.1109/TPAMI.2008.260 4. Enzweiler, M., Kanter, P., Gavrila, D.M.: Monocular pedestrian recognition using motion parallax. In: IEEE IV Symp., pp. 792–797 (2008) 5. Ess, A., Leibe, B., van Gool, L.: Depth and appearance for mobile scene analysis. In: Proc. ICCV (2007)
110
M. Rohrbach, M. Enzweiler, and D.M. Gavrila
6. Franke, U., Gehrig, S.K., Badino, H., Rabe, C.: Towards optimal stereo analysis of image sequences. In: Sommer, G., Klette, R. (eds.) RobVis 2008. LNCS, vol. 4931, pp. 43–58. Springer, Heidelberg (2008) 7. Gandhi, T., Trivedi, M.M.: Image based estimation of pedestrian orientation for improving path prediction. In: IEEE IV Symp., pp. 506–511 (2008) 8. Gavrila, D.M.: A Bayesian, exemplar-based approach to hierarchical shape matching. IEEE PAMI 29(8), 1408–1421 (2007) 9. Gavrila, D.M., Munder, S.: Multi-cue pedestrian detection and tracking from a moving vehicle. IJCV 73(1), 41–59 (2007) 10. Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: A review. IEEE PAMI 22(1), 4–37 (2000) 11. Leibe, B., et al.: Dynamic 3d scene analysis from a moving vehicle. In: Proc. CVPR (2007) 12. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004) 13. Mikolajczyk, K., Schmid, C., Zisserman, A.: Human detection based on a probabilistic assembly of robust part detectors. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 69–82. Springer, Heidelberg (2004) 14. Mohan, A., Papageorgiou, C., Poggio, T.: Example-based object detection in images by components. IEEE PAMI 23(4), 349–361 (2001) 15. Papageorgiou, C., Poggio, T.: A trainable system for object detection. IJCV 38, 15–33 (2000) 16. Platt, J.C.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advance. In: Advances in Large Margin Classifiers, pp. 61–74 (1999) 17. Rapus, M., et al.: Pedestrian recognition using combined low-resolution depth and intensity images. In: IEEE IV Symp., pp. 632–636 (2008) 18. Seemann, E., Fritz, M., Schiele, B.: Towards robust pedestrian detection in crowded image sequences. In: Proc. CVPR (2007) 19. Tuzel, O., Porikli, F., Meer, P.: Human detection via classification on Riemannian manifolds. In: Proc. CVPR (2007) 20. Van der Mark, W., Gavrila, D.M.: Real-time dense stereo for intelligent vehicles. IEEE PAMI 7(1), 38–50 (2006) 21. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995) 22. Viola, P., Jones, M., Snow, D.: Detecting pedestrians using patterns of motion and appearance. IJCV 63(2), 153–161 (2005) 23. W¨ ohler, C., Anlauf, J.K.: A time delay neural network algorithm for estimating image-pattern shape and motion. IVC 17, 281–294 (1999) 24. Wu, B., Nevatia, R.: Detection and tracking of multiple, partially occluded humans by Bayesian combination of edgelet based part detectors. IJCV 75(2), 247 (2007) 25. Zhang, L., Wu, B., Nevatia, R.: Detection and tracking of multiple humans with extensive pose articulation. In: Proc. ICCV (2007) 26. Zhu, Q., et al.: Fast human detection using a cascade of histograms of oriented gradients. In: CVPR, pp. 1491–1498 (2006)
Fast and Accurate 3D Edge Detection for Surface Reconstruction Christian B¨ahnisch, Peer Stelldinger, and Ullrich K¨ othe University of Hamburg, 22527 Hamburg, Germany University of Heidelberg, 69115 Heidelberg, Germany
Abstract. Although edge detection is a well investigated topic, 3D edge detectors mostly lack either accuracy or speed. We will show, how to build a highly accurate subvoxel edge detector, which is fast enough for practical applications. In contrast to other approaches we use a spline interpolation in order to have an efficient approximation of the theoretically ideal sinc interpolator. We give theoretical bounds for the accuracy and show experimentally that our approach reaches these bounds while the often-used subpixel-accurate parabola fit leads to much higher edge displacements.
1
Introduction
Edge detection is generally seen as an important part of image analysis and computer vision. As a fundamental step in early vision it provides the basis for subsequent high level processing such as object recognition and image segmentation. Depending on the concrete analysis task the accurate detection of edges can be very important. For example, the estimation of geometric and differential properties of reconstructed object boundaries such as perimeter, volume, curvature or even higher order properties requires particularly accurate edge localization algorithms. However, especially for the 3D image domain performance and storage considerations become very present. While edge detection in 3D is essentially the same as in 2D, the trade-off of computational efficiency and geometric accuracy makes the design of usable 3D edge detectors very difficult. In this work we propose a 3D edge detection algorithm which provides accurate edges with subvoxel precision while being computationally efficient. The paper is organized as follows: First, we give an overview about previous work. Then we describe our new approach for edge detection, followed by a theoretical analysis of the edge location errors of an ideal subvoxel edge detector under the influence of noise. This analysis is based on the same optimality constraints as used in the Canny edge detector. Finally, we show experimentally that (in contrast to a parabola fit method) our algorithm is a good approximation of an ideal edge detector, and that the computational costs are negligible. J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 111–120, 2009. c Springer-Verlag Berlin Heidelberg 2009
112
2
C. B¨ ahnisch, P. Stelldinger, and U. K¨ othe
Previous Work
Since edge detection is a fundamental operation in image analysis, there exists an uncountable number of different approaches. Nevertheless, any new edge detection algorithm must compete with the groundbreaking algorithm proposed by Canny in his MS thesis in 1986 [2]. Due to his definition, an edge is detected as local maximum of the gradient magnitude along the gradient direction. This idea proved to be advantageous to other approaches, mostly because of its theoretical justification and the restriction to first derivatives, which makes it more robust against noise. The Canny edge detection algorithm has been extended to 3D in [10] by using recursive filters. However both methods return only edge points with pixel/voxel accuracy, i.e. certain pixels (respectively voxels in 3D) are marked as edge points. Since generally the discrete image is a sampled version of a continuous image of the real world, attempts had been made to locate the edge points with higher accuracy. The edge points are then called edgels (i.e. edge elements in analogy to pixels as picture elements and voxels as volume elements). Since a 2D edge separates two regions from each other, the analogon in 3D is a surface. Thus 3D edge points are also called surfels. One often cited example of a subpixel precise edge detection algorithm is given in [3], where 2D edgels are detected as maxima on a local parabola fit to the neighboring pixels. The disadvantage of this parabola fit approach is, that the different local fits do not stitch together to a continuous image. The same is true for the approaches presented in [12]. A different method for subvoxel precise edge detection based on local moment analysis is given in [8], but it simply oversamples the moment functions and thus there is still a discretization error being only on a finer grid. An interpolation approach having higher accuracy has been proposed for 2D in [7,13,14]. Here, the continuous image is defined by a computationally efficient spline interpolation based on the discrete samples. With increasing order of the spline, this approximates the signal theoretic optimal sinc interpolator, thus in case of sufficiently bandlimited images the approximation error converges to zero. An efficient implementation of the spline interpolation can be found in the VIGRA image processing library [6].
3
The 3D Edge Detection Algorithm
In this section we first introduce the concepts and mathematical notions needed to give the term “3D edge” an exact meaning. A discussion of our algorithm to actually detect them follows. 3.1
Volume Function and 3D Edge Model
In the following our mathematical model for a 3D image is the scalar valued volume function with shape (w, h, d)T ∈ N3 : f : w × h × d → D with
Fast and Accurate 3D Edge Detection for Surface Reconstruction
113
n = {0, . . . , n − 1} and an appropriately selected domain D, e.g. D = 255. The gradient of f at position p is defined as ∇f := ∇gσ f with ∇gσ denoting 2 √ the vectors of spatial derivatives of the Gaussian gσ (p) := 1/ 2πσ exp(− p 2σ2 ) at scale σ. Note that the gradient can be efficiently computed using the separability property of the Gaussian. The gradient of the volume function is the basis for our 3D edge model. The boundary indicator b := ∇f expresses strong evidence for a boundary at a certain volume position in terms of a high scalar value. Adapting Canny’s edge model [2] to 3D images, we define surface elements (surfels) as maxima of the boundary indicator b along the gradient direction ∇f of the 3D image function. Our detection algorithm for these maxima can be divided into two phases: A voxel-precise surfel detection phase and a refinement phase which improves the localization of the surfels to sub-voxel precision. 3.2
Phase 1: Voxel-Based Edge-Detection
The detection of the voxel-precise surfels is basically an adaption of the classical Canny edge detection algorithm [2]. However, we do not perform the commonly used edge tracing step with hysteresis thresholding as this step becomes especially complicated in the 3D image domain. A second reason is, that this is the most time consuming part of the Canny edge detection algorithm. Additionally, although non-maxima-suppression is performed, the classical Canny edge detection algorithm can lead to several pixel wide edges which is only alleviated by hysteresis. Therefore, we propose to use only one threshold (corresponding to the lower of the hysteresis thresholds) in combination with a fast morphological thinning with priorities which ensures one voxel thick surfaces while preserving the topology of the detected voxel-set. This also reduces the number of initial edges for subvoxel refinement resulting in a significant speedup of the further algorithm. Thus, the steps of the first phase are: 1. Compute the gradient ∇f and the boundary indicator function b. 2. Compute a binary volume function a : w × h × d → {0, 1} marking surfels with “1” and background voxels with “0” by thresholding b with t and by using non maximum suppression along the gradient direction: a(p) := {1 if b(p) > t ∧ b(p) > b(p ± d), else 0} with d being a vector such that p + d is the grid point of the nearest neighbouring voxel in direction of ∇f , i.e., T 1 ∇f (p) d = u + 12 , v + 12 , w + 12 and (u, v, w)T = 2 sin(π/8) ∇f (p) where · is the floor operation. 3. Do topology preserving thinning with priorities on a. For the last step of our algorithm we use a modified version of the 3D morphological thinning algorithm described in [5]: Thinning is not performed in scan line order of the volume; instead surfels in a with a small boundary indicator value are preferred for removal. The outcome of the thinning step is then a one voxel
114
C. B¨ ahnisch, P. Stelldinger, and U. K¨ othe
thin set of surfels such that its number of connected components is unchanged. A detailed description of thinning with priorities can be found in [7]. 3.3
Phase 2: Subpixel Refinement
The first phase of our algorithm yields surface points located at voxel grid positions. In the second phase the localization accuracy of these points is improved to sub-voxel precision by means of iterative optimization methods based on line searches. Therefore a continuous version of the boundary indicator is needed, i.e. it has to be evaluable at arbitrary sub-voxel positions. Here we use B-spline based interpolators for this purpose. They are especially suited as they provide an optimal trad-off between interpolation quality and computational costs: With increasing order n spline interpolators converge to the ideal sinc interpolator and have already for small values of n a very good approximation quality. While the computational burden also grows with n their are still efficiently implementable for order n = 5 used in our experiments. The continuous boundary indicator can now be defined via a discrete convolution of recursively defined B-spline basis function βn of order n: b(p) := cijk βn (i − x)βn (j − y)βn (k − z) i,j,k
with βn := βn−1 n+1 x + 12 + βn−1 n+1 x − 12 and β0 := 2 +x 2 −x {0 if x < 0, else 1}. The coefficients cijk can be efficiently computed from the discrete version of the boundary indicator by recursive linear filters. Note that there is one coefficient for each voxel and that they have to be computed only once for each volume. The overall algorithmic complexity for this is linear in the number of voxels with a small constant factor. More details on the corresponding theory and the actual implementation can be found in [7, 13, 14]. A B-spline interpolated boundary indicator has also the advantage of being (n − 1)-times continuously differentiable. Its derivatives can also be efficiently computed at arbitrary sub-voxel positions which is very important for optimization methods which rely on gradient information. We can now work on the continuous boundary indicator to get sub-voxel accurate surfels. As we are adapting Canny’s edge model to 3D images, we shift the already detected surfels along the gradient direction of the 3D image function such that they are located at maxima of the boundary indicator. This can be formulated as a constrained line search optimization problem, i.e. we search for the maximizing parameter α of the one dimensional function φ(α) := b(p + α · d) with the constraint α ∈ (αmin , αmax ) and with d being an unit length vector at position p collinear with ∇f such that b increases in its direction, i.e. d := sgn(∇f T ∇b) · ∇f / ∇f . The interval constraint on α can be fixed for every surfel or computed dynamically with e.g. bracketing (see e.g. [15]) which we use here. Maximizing φ can then be done via standard line search algorithms like the algorithm of Brent [1] 1 n
Fast and Accurate 3D Edge Detection for Surface Reconstruction
115
or the algorithm of Mor´e and Thuente [11]. Here we use the modification of Brent’s algorithm presented in [15], which takes advantage of the available gradient information. In order to get even higher accuracy, the line searches defined by φ can be iterated several times. Any line search based optimization algorithms should be suitable for this. Here, we choose the common conjugate gradient method (see e.g. [15]) and compare its accuracy improvement to the single line search approach in sec. 5.
4
Theoretical Analysis: Accuracy
In order to justify the localization accuracy of our algorithm we perform experiments on synthetic test volumes based on simple 3D edge models for which the true surfel positions are known from theory. For this it is necessary to carefully model and implement the corresponding image acquisition process which we define as a convolution of a continuous volume with a 3D isotropic Gaussian point spread function (PSF) with scale σPSF followed by sampling and quantization with possible addition of white Gaussian noise. The volume is modeled via a binary volume function f0 : R3 → {0, S} with S ∈ R such that the support f0−1 (S) is either an open half space, ball or a cylinder, i.e. its surface corresponds to a planar, spherical or cylindrical shell. We investigate these three types of functions, since they allow to estimate the localization accuracy for every possible case of a 3D surface. For example, if a surface is hyperbolic in some point p with principal curvatures κ1 > 0, κ2 < 0, then the localization errors should be bounded by the errors of two opposing cylinders having curvatures κ1 , respectively κ2 . The function f0 is blurred by convolution with a Gaussian before sampling. The resulting function f = f0 gσPSF defines the ground truth for edge detection. More precisely, for a planar surface with normal unit vector n and distance s ∈ R from the origin the corresponding volume function reads
1 x T fplane (p) := S · ΦσPSF (p n + s) with Φσ (x) := 1 + erf √ . 2 2σ Maxima of fplane then occur exactly at positions p with pT n + s = 0. For a ball BR with radius R, a closed form solution of the convolution integral with a Gaussian can be derived by exploiting the rotational symmetry of both functions and the separability of the Gaussian: R x−p2 (r −r)2 1 1 √ √ fsphere (p) := e− 2σ2 dx = e− 2σ2 dr 2πσ 2πσ BR −R
2 2rR
e− (R+r) 2σ2 σ2 − 1 e Sσ S R−r R+r √ = erf √ + erf √ − 2 2σ 2σ 2πr with r = p. The gradient magnitude of a blurred sphere is then the derivative of fsphere with respect to r: (R+r)2 2rR 1 ∇f (p) = √ e− 2σ2 σ 2 + rR + e σ2 rR − σ 2 (1) 2πr2 σ
116
C. B¨ ahnisch, P. Stelldinger, and U. K¨ othe
σ R
0.2
0.4
0.6
0.8
r0 −R σ
0
−0.2
−0.4
cylinder
Fig. 1. Normalized bias of a blurred sphere and cylinder with radius R and scaling σ of the PSF. Approximating functions given by (2) and (3) are indicated with dotted lines.
As there is no closed form solution giving the maxima r0 of (1), fig. 1 shows numeric results. It plots the normalized displacement (r0 − R)/σ against the ratio σ/R, in order to apply to arbitrary scales and radii. In practice, the most interesting part is in the interval 0 ≤ σ/R ≤ 0.5, since otherwise the image content is too small to be reliably detectable after blurring. In this interval an approximation with error below 3 · 10−4 is given by σ σ 2 σ 3 0.04 − 145.5 R + 345.8( R ) − 234.8( R ) r0 − R = σ σ 2 σ 3. σ 142.6 − 308.5 R + 59.11( R ) + 327.7( R )
(2)
In case of a cylinder of radius R, a closed form solution exists neither for the convolution integral nor for the position of the maxima of its gradient magnitude (but for the gradient magnitude itself). This case is mathematically identical to the 2D case of a disc blurred by a Gaussian, which has been analyzed in detail in [9]. An approximating formula for the relative displacement with error below 3 · 10−4 is given in [7]: σ
2 r0 − R = 0.52 0.122 + − 0.476 − 0.255 (3) σ R 4.1
Noisy 3D Images
Canny’s noise model is based on the assumption that the surface is a step which has been convolved with both the PSF and the edge detection filters. Therefore, 2 2 the total scale of the smoothed surface is σedge = σPSF + σfilter . The second directional derivative in the surface’s normal direction equals the first derivative of a Gaussian (and it is a constant in the tangential plane of the surface). Near the true surface position, this derivative can be approximated by its first order Taylor expansion: S·x sxx (x) ≈ sxxx (x = 0) · x = − √ , 3 2πσedge
Fast and Accurate 3D Edge Detection for Surface Reconstruction
117
where S is the step height. The observed surface profile equals the true profile plus noise. The noise is only filtered with the edge detection filter, not with the PSF. The observed second derivative is the sum of the above Taylor formula and the second derivative of the smoothed noise fxx (x) ≈ sxxx |x=0 · x + nxx (x). Solving for the standard deviation of x at the zero crossing fxx (x) = 0 gives Var[nxx ] StdDev[x] = . |sxxx (x = 0)| According to Parseval’s theorem, the variance of the second directional derivative of the noise can becomputed in the Fourier domain as ∞ 2 3 Var[nxx ] = N 2 4π 2 u2 G(u)G(v)G(w) du dv dw = N 2 , 3/2 32π σ 7 −∞ where G(.) is the Fourier transform of a Gaussian at scale σfilter , and N 2 is the variance of the noise before filtering. Inserting, we get the expected localization error as √ 2 2 N 3(σPSF + σfilter )3/2 StdDev[x] = (4) 7/2 S 4π 1/4 σ filter
N S
is the inverse signal-to-noise ratio. In contrast to 2D edge detection this error goes to zero as the filter size approaches infinity √ N 3 lim StdDev[x] = . √ σfilter →∞ S 4π 1/4 σfilter However, this limit only applies to perfectly planar surfaces. In case of curved surfaces, enlarging the edge detection filter leads to a bias, as shown above, and there is a trade-off between noise reduction and bias.
5
Experiments: Accuracy and Speed
In this section we present the results of experiments which confirm our claims about the accuracy and speed characteristics of our algorithm. We start with artificial volume data generated by sampling of the simple continuous volume functions given above. For the cylindrical cases we used 8fold oversampled binary discrete volumes with subsequent numeric convolution and downsampling. In the following we always use spline interpolation with order five for the line search and conjugate gradient based algorithms. Accuracy results for test volumes generated from fplane are shown in fig. 2. As model parameters we used unit step height, σPSF = 0.9 and σfilter = 1 and various values for the sub-voxel shift s and the plane normal n. For the directions of n we evenly distributed fifty points on the hemisphere located at the origin. In fig. 2a results for noise free volumes are shown. As one can see both the line search and the conjugate gradient based methods possess very high accuracy and are several orders of magnitudes better than the parabolic fit. In the presence of noise the accuracy of our method is still almost one order of magnitude better than the parabolic fit for a rather bad signal-to-noise ration of SNR = 20, see
118
C. B¨ ahnisch, P. Stelldinger, and U. K¨ othe
localization error [voxel]
localization error [voxel]
10−2 10−4 10−6 10−8
10−3
10−4
10−10 0
0.2
0.4
0.6
0.8
1
0
ground truth angle ϕ(t) = (α, β) [rad]
0.2
0.4
0.6
0.8
1
ground truth angle ϕ(t) = (α, β) [rad]
(a) no noise
(b) SNR = 20,σfilter = 2.0
measured StdDev[x]
parabolic fit line search conjugate gradient
10−2
10−3 0
0.5
1
1.5
2
predicted StdDev[x]
2.5 ·10−2
(c) measured mean std. derivation vs. predicted std derivation according to eq. 4 computed for ten evenly distributed signal-tonoise ratios SNR ∈ [10, 100] (note that the scaling of the y axis is logarithmic).
Fig. 2. Comparison of sub-voxel accuracy of the three algorithms on sampled instances of fplane with σPSF = 0.9, σfilter = 1.0, s ∈ {0.1, 0.25, 0.5, 0.75} and using fifty evenly distributed points on the hemisphere for n
measured dislocation [voxel]
−0.05 −0.20
−0.10
−0.15
−0.30
−0.20 −0.40
−0.25 −0.40
−0.35
−0.30
−0.25
predicted dislocation [voxel]
−0.20
−0.20 −0.18 −0.16 −0.14 −0.12 −0.10 predicted dislocation [voxel]
Fig. 3. comparison of predicted and measured localization bias for spherical (left) and cylindrical (right) surfaces using R = 5, σPSF = 0.9 with SNR = 10 for six evenly distributed filter scales σfilter ∈ [0.2, 0.4]. Values have been averaged over 10 instances with different sub-voxel shift.
fig. 2b. Finally, fig. 2c shows that the estimated standard derivation matches the prediction from theory very well. In fig. 3 we compare the predicted localization bias for spherical and cylindrical surfaces according to eq. 2 and eq. 3 respectively. Test-volumes have been generated from eq. 1 for spheres and using oversampling as described above for
Fast and Accurate 3D Edge Detection for Surface Reconstruction
119
cylinders. As model parameters we used R = 5, σPSF = 0.9 and various values for σfilter with addition of Gaussian noise such that SNR = 10. For each set of model parameters radii have then been estimated from 10 instances with same model parameters but with different sub-voxel shift. From these figures we conclude that our algorithms correctly reproduces the localization bias prevailing over the parabolic fit which possesses a systematic error. For performance comparison, we measured execution time on a Linux PC with a Pentium D 3.4 GHz processor and 2 GB of RAM for test-volumes with shape (200, 200, 200)T and two real CT-volumes. Results are given in table 1. As one can see the line search based method is only ≈ 35% slower than the parabolic fit and the conjugate gradient based method only from ≈ 50% to ≈ 90% slower. Table 1. performance results for various test volumes and real CT-volumes. Columns in the middle give run-times in seconds. volume plane sphere cylinder lobster foot
shape
p. fit l.search T
(200, 200, 200) (200, 200, 200)T (200, 200, 200)T (301, 324, 56)T (256, 256, 207)T
9.24 9.76 10.91 6.22 20.01
11.63 14.03 15.51 8.24 26.74
cg
n. surfels
14.13 ≈ 39500 18.80 ≈ 48000 20.74 ≈ 75200 10.19 21571 34.00 74411
Fig. 4. Surface reconstructions for test-volumes and real CT-volumes using α-shapes [4] (α = 1) with SNR = 10 for the test-volumes
120
6
C. B¨ ahnisch, P. Stelldinger, and U. K¨ othe
Conclusions
Based on the well-known Canny edge detector, we presented a new algorithm for subvoxel-precise 3D edge detection. The accuracy of our method is much better than the accuracy of the subvoxel refinement based on a parabola fit. Due to an efficient implementation of the spline interpolation and due to the use of fast voxel-accurate computations where-ever possible, our algorithm is still computationally efficient. In order to justify the accuracy, we theoretically analyzed the measurement errors of an ideal Canny-like edge detector on an infinite sampling resolution in case of 3D planar, spherical and cylindrical surfaces. Our analysis showed, that all experimental results are in full agreement with the theory, while this is not the case for the parabola fit method.
References 1. Brent, R.P.: Algorithms for Minimisation Without Derivatives. Prentice-Hall, Englewood Cliffs (1973) 2. Canny, J.: A computational approach to edge detection. TPAMI 8(6), 679–698 (1986) 3. Devernay, F.: A non-maxima suppression method for edge detection with sub-pixel accuracy. Technical Report 2724, INRIA Sophia Antipolis (1995) 4. Edelsbrunner, H., M¨ ucke, E.P.: Three-dimensional alpha shapes. ACM Trans. Graph. 13(1), 43–72 (1994) 5. Jonker, P.P.: Skeletons in n dimensions using shape primitives. Pattern Recognition Letters 23, 677–686 (2002) 6. K¨ othe, U.: Vigra. Web Resource, http://hci.iwr.uni-heidelberg.de/vigra/ (visited March 1, 2009) 7. K¨ othe, U.: Reliable Low-Level Image Analysis. Habilitation thesis, University of Hamburg, Germany (2008) 8. Luo, L., Hamitouche, C., Dillenseger, J., Coatrieux, J.: A moment-based threedimensional edge operator. IEEE Trans. Biomed. 40(7), 693–703 (1993) 9. Mendon¸ca, P.R.S., Padfield, D.R., Miller, J., Turek, M.: Bias in the localization of curved edges. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3022, pp. 554–565. Springer, Heidelberg (2004) 10. Monga, O., Deriche, R., Rocchisani, J.: 3d edge detection using recursive filtering: application to scanner images. CVGIP: Image Underst. 53(1), 76–87 (1991) 11. Mor´e, J.J., Thuente, D.J.: Line search algorithms with guaranteed sufficient decrease. ACM Trans. Math. Software 20, 286–307 (1994) 12. Udupa, J.K., Hung, H.M., Chuang, K.S.: Surface and volume rendering in three dimensional imaging: A comparison. J. Digital Imaging 4, 159–169 (1991) 13. Unser, M., Aldroubi, A., Eden, M.: B-Spline signal processing: Part I—Theory. IEEE Trans. Signal Process. 41(2), 821–833 (1993) 14. Unser, M., Aldroubi, A., Eden, M.: B-Spline signal processing: Part II—Efficient design and applications. IEEE Trans. Signal Process. 41(2), 834–848 (1993) 15. Vetterling, W.T., Flannery, B.P.: Numerical Recipes in C++: The Art of Scientific Computing. Cambridge University Press, Cambridge (2002)
Boosting Shift-Invariant Features Thomas H¨ornlein and Bernd J¨ ahne Heidelberg Collaboratory for Image Processing University of Heidelberg, 69115 Heidelberg, Germany
Abstract. This work presents a novel method for training shift-invariant features using a Boosting framework. Features performing local convolutions followed by subsampling are used to achieve shift-invariance. Other systems using this type of features, e.g. Convolutional Neural Networks, use complex feed-forward networks with multiple layers. In contrast, the proposed system adds features one at a time using smoothing spline base classifiers. Feature training optimizes base classifier costs. Boosting sample-reweighting ensures features to be both descriptive and independent. Our system has a lower number of design parameters as comparable systems, so adapting the system to new problems is simple. Also, the stage-wise training makes it very scalable. Experimental results show the competitiveness of our approach.
1
Introduction
This work deals with shift-invariant features performing convolutions followed by subsampling. Most systems using this type of features (e.g. Convolutional Neural Networks [1] or biologically motivated hierarchical networks [2]) are very complex. We propose to use a Boosting framework to build a linear ensemble of shift-invariant features. Boosting ensures that trained features are both descriptive and independent. The simple structure of the presented approach leads to a significant reduction of design parameters in comparison to other systems using convolutional shift-invariant features. At the same time, the presented system achieves state-of-the-art performance for classification of handwritten digits and car side-views. The presented system builds a classification rule for an image classification problem, given in form of a collection of N training samples {xi , yi }, i = 1, . . . , N - x is a vector of pixel values in an image region and y is the class label of the respective sample1 . The depicted objects are assumed to be fairly well aligned with respect to position and scale. However, in most cases the depicted objects of one class will exhibit some degree of variability due to imperfect localization or intra-class variability. In order to achieve good classification performance, this variability needs to be taken into account. One way to approach the problem is by using shift-invariant features (Sect. 2), namely features performing local 1
For simplicity we assume binary classification tasks (y ∈ {−1, 1}) throughout the paper. Extension to multi-class problems is straightforward using a scheme similar to AdaBoost.MH [3].
J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 121–130, 2009. c Springer-Verlag Berlin Heidelberg 2009
122
T. H¨ ornlein and B. J¨ ahne
convolution and subsampling. To avoid the complexity of hierarchical networks commonly used with this type of features, a Boosting scheme is used for feature generation (Sect. 3). In order to illustrate the effectiveness of our approach, a set of experiments on USPS database (handwritten digit recognition, Sect. 4.1) and UIUC car sideview database (Sect. 4.2) is conducted. The achieved performance compares well to state-of-the-art algorithms.
2
Shift-Invariant Features for Image Classification
Distribution of samples in feature-space is influenced by discriminative and nondiscriminative variability. While discriminative variability is essential for classification, non-discriminative variability should not influence results. It is, however, hard for training systems to learn to distinguish the two cases and usually high numbers of training samples are needed to do so. Therefore prior knowledge is commonly used to design features suppressing non-discriminative variability while preserving discriminative information. Using these features can significantly simplify the training problem. While the global appearance of objects in one class is subject to strong variations, discriminative and stable local image structures - for example the appearance of wheels for classification of vehicles - exist. The relative image-positions of these features may change due to changes of point-of-view or deformations of the objects but their appearance is relatively stable. Therefore the images of objects can be represented as a collection of local image features, where the exact location of each feature is unknown. Different approaches to handling location uncertainty exist, ranging from completely ignoring position information (e.g. bag of features) to construction of complex hierarchies of object parts (e.g. [4]). In this work a model is used that searches for features in a part of the image described by p = [c0 , r0 , w, h], where c0 , r0 describes the position and w, h the width and height of the region respectively. We define the operator P(x, p) extracting patches of geometry p from feature vector x. To extract discriminative information local-convolution features are used: f (x) = sub(P(x, p) ∗ K) , (1) where K is the convolution kernel2 . The subsampling operation sub(.) makes the result invariant to small shifts. For the experiments reported in Sec. 4 the subsampling operation sub(.) returns the maximum absolute value of the filter response3 . Local convolutional features are mainly used in multi-layer feed-forward networks. The kernel matrices may be either fixed or are tuned in training. Examples 2 3
The convolution is only performed for the range, in which the kernel has full overlap with the patch. Note that this subsampling operator is non-differentiable. For the backpropagation training used in Sect. 3.2 a differentiable approximation needs to be used.
Boosting Shift-Invariant Features
123
for the use of fixed kernels are the biologically motivated systems in [2] and [5]. An advantage of using fixed weights is the lower number of parameters to be adjusted in training. On the other hand, prior knowledge is necessary to select good kernels for a given classification problem4 . Examples of systems using trained kernels are the unsupervised system in [1] and the supervised in [6]. The advantage of training kernels is the ability of the system to adjust to the problem at hand and thus find compact representations. The hierarchical networks used with local convolution features are able to construct complex features by combining basic local convolution features. The cost for this flexibility is the high number of design parameters to be set. In order to provide a simple scheme for using local convolution features, a single layer system is proposed in the next section.
3
Boosting Shift-Invariant Features
This section describes a boosting-based approach to train and use local convolution features. This simplifies the complicated architecture and therefore unpleasant design of classifiers. Using Boosting we add features in a greedy stagewise manner instead of starting with a predefined number of features which need to be trained in parallel. This makes the approach very scalable. Since one feature is trained at a time, only a small number of parameters is tuned simultaneously, simplifying training of kernel weights. In order to train features, differentiable base classifiers have to be used. Using gradient descent to train features bares a strong resemblance with the training of artificial neural networks (e.g. [7]). ANN’s, however, use fixed transfer functions while the approach presented here uses smooth base classifiers adapting to class distributions. The use of adaptive transfer functions enables the ensemble to be very flexible even in the absence of hidden layers. 3.1
Boosting Smoothing Splines
Boosting is a technique to combine weak base classifiers to form a strong classification ensemble. The additive model has the form: yˆ = sign (H(x)) with H(x) =
T
αt ht (x) .
(2)
t=1
Boosting training is based on minimizing a continuous cost function J on the given training samples {xi , yi }. The minimization is performed using functional gradient descent. In stage t the update ht+1 is calculated by performing a gradient descent step on J. The step width depends on the specific Boosting algorithm
4
Though biologically motivated kernels like gabor wavelets seem to give good performance on a wide range of image processing applications.
124
T. H¨ ornlein and B. J¨ ahne
in use. GentleBoost (GB) [3] (used in the experiments of Sect. 4) uses GaussNewton updates, leading to the GentleBoost update rule: N
t Ec [cy|x] with ci = e−yi s=1 αs hs (xi ) , J = ci and αt+1 = 1 , Ec [c|x] i=1 (3) where Ec is the weighted expectation. The presented approach is not restricted to being used with GentleBoost but might be used with arbitrary Boosting schemes. The task of the base classifier is to select the rule h giving the lowest costs5 Typically choices of base classifiers are Decision Stumps, Decision Trees and Histograms. We are, however, interested in base-classifiers which are differentiable. Due to their cheap evaluation and simple structure we use univariate smoothing splines. A smoothing spline base-classifier is represented as: ht+1 (x) =
h(z) = aT b(z) ,
(4)
where z is a scalar input and a represents the weights of the spline basisfunctions, b returns the values of the spline basis-functions evaluated at z. To construct a fit from scalar inputs zi to outputs yi , the weights a need to be calculated by solving a linear system of equations. In order to prevent overfitting, a tradeoff between approximation error and complexity has to be found. We use P-Splines [8] for fitting penalized splines: a fixed high number of equidistant support points is used and a parameter λ is tuned to adjust the amount of smoothing. P-Splines use finite differences of the weights a of the spline functions to approximate roughness. The weights a can then be calculated using −1 a = BΔc BT + λDDT BΔc y , (5) where B = [b(z1 ) . . . b(zN )]T denotes the matrix of values of the spline basisfunctions evaluated at z1 , . . . , zN , y = [y1 . . . yN ]T contains the sample class and Δc ∈ IRN ×N is a diagonal matrix containing the sample weights c1 , . . . , cN . The expression aT D calculates finite differences of a given degree6 on a. The roughness penalty can be chosen using cross validation. 3.2
Training Features Using a Boosting Framework
A large group of base classifiers used with Boosting operate on one input feature at a time: h(x) = g(x(j) ) (component-wise base classifiers). The advantage of this approach is the simple nature and cheap evaluation of the resulting classification rules. Boosting of component-wise base classifiers can be used as a feature selection scheme, adding features to the final hypothesis one at a time. 5 6
The cost function dependson the Boosting algorithm used. GentleBoost uses 2 weighted squared error: = N i=1 ci (yi − h(xi )) . For classification penalizing first degree finite differences is a natural choice, leading to h(x) = cT y = const for λ → ∞.
Boosting Shift-Invariant Features
125
In order to use a feature selection scheme, one needs a set of meaningful features first. However, providing such a feature set for arbitrary image classification problems is a difficult task, especially if the properties of good features are unknown. In general it would be more convenient to provide as little prior knowledge as possible and train features automatically. For Boosting feature generation - similar to boosting feature selection - a mapping z = f (x) from IRF to IR minimizing the weighted costs of the spline fit is sought: N GB 2 f (x) ← min ((h(f (x))) = min ci (h(f (xi )) − yi ) , (6) f (x)
f (x)
i=1
where h(f (x)) is a weighted least squares fit to y. When using local convolution features, the kernel weights w can be tuned using error backpropagation. This is similar to the training techniques used with Convolutional Neural Networks – a particular simple scheme can be found in [9]. The complete scheme for building a classifier with local convolution features is shown in Alg. 1. Training time may be reduced, without deteriorating classification performance, by visiting only a limited number of random positions (line 5). Algorithm 1. Boosting of local convolution features
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Input. Training samples {x, y}i , i = 1, . . . , N Input. Number of boosting rounds T Input. Smoothing parameter λ Input. Feature-geometry h0 (x) = y for t = 1, . . . , T do ci ← e−yi H(xi ) , ci ← ci /( N i=1 ci ) min ← ∞ for all positions p do Initialize convolution kernel K ← N (0, 1) repeat zi = sub(P(xi , p) ∗ K) Fit base-classifier h(z) to {zi , yi , ci } Calculate kernel gradient ΔK using back-prop Update kernel K (e.g. using Levenberg Marquardt) until convergence or maximum number of rounds reached 2 ← N i=1 ci (yi − h(sub(P(x, p) ∗ K))) if < min then min ← , pt ← p, Kt ← K end end Fit base-classifier ht (z) to {zi , yi , ci }, zi = sub(P(xi , pt ) ∗ Kt ) Add ht to ensemble end Output. Classifier: H(x) = Tt=0 ht (sub(P(x, pt ) ∗ Kt ))
126
T. H¨ ornlein and B. J¨ ahne
Fig. 1. Pooling feature (for handwritten digits 5 vs 8)
Combining Features. In higher layers of hierarchical networks, basic features are combined to build more complex features [1,2]. This type of feature interaction cannot be modeled by using Alg. 1 directly. We propose, rather than using hierarchical networks, to build complex features as linear combinations of local convolution features: zi = vT zi , where zi = [f1 (xi ), f2 (xi ), . . .]T represents the values of all convolutional features learned so far and v are their respective weights. While this approach may not be as powerful as using hierarchical networks, it comes at almost no extra costs. Algorithm 1 is adapted by feeding linear combinations of features into the base classifier in line 18. The weights of the local convolution features v are trained to optimize class separation. Typically, only a small number of features, say two or three, need to be combined - depending on the problem at hand. In cases where the maximum number of convolutional features to be used is limited (e.g. due to computational resources), performance may be improved by adding Boosting stages using combinations of the already learned features. Calculation of local convolutional features is much more expensive than evaluation of base classifiers, so costs are neglectable.
4
Experiments
In order to show the competitiveness of our approach, experiments on two wellknown image classification databases are conducted. The data sets are selected to have very different properties to illustrate the flexibility of our approach. 4.1
USPS Handwritten Digit Recognition
The first set of experiments is performed on the USPS handwritten digit recognition corpus. The database contains grayscale images of handwritten digits, normalized to have dimensions 16 × 16 leading to an input feature vector with 256 values. The training set includes 7, 291 samples, the test set 2, 007. Human error rate on this data set is approximately 2.5% ([10]). Penalized cubic smoothing spline base-classifiers with 100 support points are used to approximate class distributions. Spline roughness penalty, as well as the size of the convolution kernel were determined using cross validation. Kernels of size 5 × 5 with a subsampling area of 5 × 5 gave best results - this means each pooling feature operates on a 9 × 9 patch. Pairs of convolutional features are combined to model feature-interactions. An ensemble of 1000 base classifiers
Boosting Shift-Invariant Features
127
Table 1. Test error rates on USPS database method human performance ([12]) neural net (LeNet1 [13]) tangent distance ([12]) kernel densities ([11]) this work
error [%] 2.5 4.2 3.3 3.1
error ext. [%] 2.5 2.4 2.6
error rate
0.1
0.05
0 0
200 400 600 800 number of boosting rounds
1000
Fig. 2. Classification error on USPS depending on the number of boosting rounds (black: original set, red: extented set). Note that features were trained until round 500. The remaining Boosting rounds add base classifiers combining already calculated features.
was build. Features were added in rounds 1 to 500. The remaining boosting rounds combined already trained local convolutional features. Experiments using an extended set of training patterns [11] suggest the original training set is to small to achieve optimal performance. In the literature different techniques are used to extend the training set. We build an extended training set by adding distorted versions of training patterns (see [9]), increasing the number of training samples by a factor of five. Note that we did not extend the test set in any way. Figure 2 shows test error with respect to the number of features used. Experiments using the original training set yielded an error rate of 3.1%. On the extended training set an error rate of 2.6% was achieved. Note that the error rate of the extended feature set drops from 3.0% to 2.6% between round 500 and 1000 without adding new convolutional features. Table 1 compares our performance to other published results. The results of the presented scheme are competitive to other state-of-the art algorithms. 4.2
UIUC Car Classification
A second set of experiments was conducted using the UIUC car side view database [14]. The training set contains 550 images of cars and 500 images of background, each image of size 100 × 40. Again, cross validation was used to find good parameters. The best performance was achieved using convolution kernels of size 5 × 5 and a subsampling area of size 5 × 5.
128
T. H¨ ornlein and B. J¨ ahne
Table 2. Test error rates on UIUC cars (this work: min, mean, max over ten runs) method error (single-scale set) [%] Lampert et al [15] 1.5 Agarwal et al [14] 23.5 Leibe et al [16] 2.5 Fritz et al [17] 11.4 Mutch et al [5] 0.04 this work (1.25) 1.55 (1.78)
error (multi-scale set) [%] 1.5 60.4 5.0 12.2 9.4 (2.9) 3.6 (4.0)
Fig. 3. Examples of classification on single-scale test set (ground truth: blue, true positives green, false positives red)
The UIUC car database contains two test sets, both consist of natural images containing cars. The first set (single-scale) consists of 170 images containing 200 cars. The cars in this set have the same scale as the cars in the training set. The second test (multi-scale) set consists of 107 images showing 139 cars. The dimensions of the cars range between 89 × 36 and 212 × 85. A sliding window approach was used to generate candidates for the classifier. For multi-scale test images the sliding window classifier was applied to scaled versions of the images. We used the same scales as in [14] (s = 1.2−4,−3,...,1 ). Figure 3 shows some classification results on the single scale test set. Performance evaluation was done in the same fashion as in the original paper [14]. Table 2 compares our results to state-of-the-art7 . Results for single and multi-scale test set are among the best reported. In particular, our results on the multi-scale test set are the best reported results using a sliding window approach. The error rate with respect to the number of features on the single-scale test set is shown in Fig. 4. Errors drop to a competitive level quickly. For an average error of below 2% approximately 30 multiplications per pixel are used, giving a very efficient classifier. 7
To show the effect of the randomness of our approach the results are given for multiple runs of the system.
1
0.1
0.95
0.08 1−fscore
recall
Boosting Shift-Invariant Features
0.9 0.85 0.8
129
0.06 0.04 0.02
0.75
0 0
0.1 0.2 1−precision
0
200 400 number of features
Fig. 4. Left: recall-precision curve for UIUC cars (black: single scale, red: multi scale). Right: f-score on single scale test set (min, mean, max over 10 runs).
5
Conclusion and Outlook
In this work a novel approach for generating shift-invariant features was presented. By using Boosting to find meaningful features, the scheme is very simple and scalable. Performance, evaluated on USPS handwritten digit recognition database and UIUC car side views database, is competitive to state-of-the-art systems. The advantage of our method, when compared to other systems using similar features, is the low number of design parameters and its modularity. The complexity of the trained classifier adapts to the problem at hand. Boosting techniques like the use of cascades, can easily be incorporated. Future extensions of the presented method will include the use of multiple scales. Right now features are generated on one fixed scale. While this is sufficient for classification of handwritten digits and related problems, for real world objects descriptive features will likely appear on multiple scales.
Acknowledgments We gratefully acknowledge financial support by the Robert Bosch GmbH corporate PhD program and the Heidelberg Graduate School of Mathematical and Computational Methods for the Sciences at IWR, Heidelberg.
References 1. Ranzato, M., Huang, F.J., Boureau, Y.L., LeCun, Y.: Unsupervised learning of invariant feature hierarchies with applications to object recognition. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007) 2. Serre, T., Wolf, L., Bileschi, S., Riesenhuber, M., Poggio, T.: Robust object recognition with cortex-like mechanisms. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(3), 411–426 (2007)
130
T. H¨ ornlein and B. J¨ ahne
3. Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of boosting. The Annals of Statistics 38(2) (2000) 4. Bouchard, G., Triggs, B.: Hierarchical part-based visual object categorization. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 710–715 (2005) 5. Mutch, J., Lowe, D.G.: Multiclass object recognition with sparse, localized features. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 11–18 (2006) 6. Huang, F.J., LeCun, Y.: Large-scale learning with svm and convolutional nets for generic object categorization. In: Proc. Computer Vision and Pattern Recognition Conference (CVPR 2006). IEEE Press, Los Alamitos (2006) 7. Schwenk, H., Bengio, Y.: Boosting neural networks. Neural Comput. 12(8), 1869– 1887 (2000) 8. Eilers, P.H.C., Marx, B.D.: Flexible smoothing with b-splines and penalties. Statistical Science 11(2), 89–121 (1996) 9. Simard, P.Y., Steinkraus, D., Platt, J.C.: Best practices for convolutional neural networks applied to visual document analysis. In: ICDAR 2003: Proceedings of the Seventh International Conference on Document Analysis and Recognition, Washington, DC, USA, Microsoft Research, p. 958. IEEE Computer Soc, Los Alamitos (2003) 10. Simard, P.Y., LeCun, Y.A., Denker, J.S., Victorri, B.: Transformation invariance in pattern recognition - tangent distance and tangent propagation. In: Orr, G.B., M¨ uller, K.-R. (eds.) NIPS-WS 1996. LNCS, vol. 1524, pp. 239–274. Springer, Heidelberg (1998) 11. Keysers, D., Macherey, W., Ney, H., Dahmen, J.: Adaptation in statistical pattern recognition using tangent vectors. IEEE Trans. Pattern Anal. Mach. Intell. 26(2), 269–274 (2004) 12. Bottou, L., Cortes, C., Denker, J., Drucker, H., Guyon, I., Jackel, L., LeCun, Y., Muller, U., Sackinger, E., Simard, P., Vapnik, V.: Comparison of classifier methods: a case study in handwritten digit recognition. In: Proceedings of the 12th IAPR International Conference on Pattern Recognition, 1994. Conference B: Computer Vision & Image Processing, vol. 2, pp. 77–82 (1994) 13. Cun, Y.L., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Howard, W., Jackel, L.D.: Handwritten digit recognition with a back-propagation network. In: Touretzky, D.S. (ed.) Advances in Neural Information Processing Systems II (Denver 1989), pp. 396–404. Morgan Kaufmann, San Mateo (1990) 14. Agarwal, S., Awan, A., Roth, D.: Learning to detect objects in images via a sparse, part-based representation. In: IEEE Transactions on Pattern Analysis and Matchine Intelligence, vol. 26 (2004) 15. Lampert, C.H., Blaschko, M.B., Hofmann, T.: Beyond sliding windows: Object localization by efficient subwindow search. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2008), pp. 1–8 (June 2008) 16. Leibe, B., Leonardis, A., Schiele, B.: Robust object detection with interleaved categorization and segmentation. Int. J. Comput. Vision 77(1-3), 259–289 (2008) 17. Fritz, M., Leibe, B., Caputo, B., Schiele, B.: Integrating representative and discriminative models for object category detection. In: ICCV 2005: Proceedings of the Tenth IEEE International Conference on Computer Vision, Washington, DC, USA, pp. 1363–1370. IEEE Computer Society Press, Los Alamitos (2005)
Harmonic Filters for Generic Feature Detection in 3D Marco Reisert1 and Hans Burkhardt2,3 1
Dept. of Diagnostic Radiology, Medical Physics, University Medical Center 2 Computer Science Department, University of Freiburg 3 Centre for Biological Signaling Studies (bioss), University of Freiburg
[email protected] Abstract. This paper proposes a concept for SE(3)-equivariant non-linear filters for multiple purposes, especially in the context of feature and object detection. The idea of the approach is to compute local descriptors as projections onto a local harmonic basis. These descriptors are mapped in a non-linear way onto new local harmonic representations, which then contribute to the filter output in a linear way. This approach may be interpreted as a kind of voting procedure in the spirit of the generalized Hough transform, where the local harmonic representations are interpreted as a voting function. On the other hand, the filter has similarities with classical low-level feature detectors (like corner/blob/line detectors), just extended to the generic feature/object detection problem. The proposed approach fills the gap between low-level feature detectors and high-level object detection systems based on the generalized Hough transform. We will apply the proposed filter to a feature detection task on confocal microscopical images of airborne pollen and compare the results to a 3D-extension of a popular GHT-based approach and to a classification per voxel solution.
1 Introduction The theory of non-linear filters is well developed for image translations. It is known as Volterra theory. Volterra theory states that any non-linear translation-invariant system can be modelled as an infinite sum of multidimensional convolution integrals. More precisely, a filter H is said to be equivariant with respect to some group G, if gH{f } = H{gf } holds for all images f and all g ∈ G, where gf denotes the action of the group to the image f . For the group of translations (or the group of time-shifts) such filters are called Volterra series. In this paper we want to develop non-linear filters that are invariant with respect to Euclidean motion SE(3), therefore, we need a generalization of Volterra’s principle to SE(3). In [1] a 2D non-linear filter was proposed that is SE(2)-equivariant. The filter was derived from the general concept of group integration which replaced Volterra’s principle. In this paper we want to generalize this filter to SE(3). The generalization is not straightforward because the two-dimensional rotation group SO(2) essentially differs from its three-dimensional counterpart SO(3). As already mentioned the derivation of the filter in [1] was based on the principle of group integration. In this paper we want to follow a more pragmatic way and directly propose the 3D filter guided by its 2D analogon. Let us recapitulate the workflow of the holomorphic filter and give a sketch of its 3D counterpart. In a first step the 2 holomorphic filter computes several convolutions with functions of the form z j e−|z| J. Denzler, G. Notni, and H. S¨uße (Eds.): DAGM 2009, LNCS 5748, pp. 131–140, 2009. c Springer-Verlag Berlin Heidelberg 2009
132
M. Reisert and H. Burkhardt
where z = x + iy is the pixel coordinate in complex notation. Note, that the monomial z j = rj eijφ is holomorphic. The results of these convolutions show a special rotation behavior, e.g. for j = 1 it behaves like a gradient field or for j = 2 it behaves like a 2nd rank tensor field. Several products of these convolution results are computed. These products show again a special rotation behavior. For example, if we multiply a gradient field (j1 = 1) and a 2-tensor-field (j2 = 2) we obtain a third-order field with j = j1 + j2 = 3. According to the transformation behavior of the products they are 2 again convolved with functions of the form z j e−|z| such that the result of the convolution transforms like a scalar (j = 0). This is the principle of the holomorphic filter which we want to generalize to 3D. 2 The first question is, what are the function corresponding to z j e−|z| in 3D? We know that the real and imaginary part of a holomorphic polynomial are harmonic poly2 nomials. Harmonic polynomials solve the Laplace equation. As z j e−|z| is a Gaussianwindowed holomorphic monomial we will use instead a Gaussian-windowed harmonic polynomial for the 3D filter. The second question is, how can we form products of convolutions with harmonic polynomials that entail their transformation behavior? We will find out that the Clebsch-Gordan coefficients that are known from quantum mechanics provide such products. Given two tensor fields of a certain degree we are able to form a new tensor field of another degree by a certain multiplication and weighted summation of the input fields. The weights in the summations are the Clebsch Gordan coefficients. In [1] and [2] it was shown that the convolutions with the Gaussian-windowed holomorphic basis can be computed efficiently with complex derivatives. In fact, there is a very similar approach in 3D by so called spherical derivatives [3]. The paper is organized as follows: in the following section we give a small overview about related work. In Section 2 we introduce the basics in spherical tensor analysis. We introduce the spherical product which couples spherical tensor fields and introduce basics about spherical harmonics. We also introduce so called spherical derivatives that are the counterpart to the usual complex derivatives in 2D. They will help us to compute the occurring convolutions in an efficient manner. In Section 3 we introduce the Harmonic filter and show how the parameters can be adpated to a specific problem. Section 4 shows how the filter can be implemented efficiently and how it can be applied for feature detection in confocal microscopical images. In Section 5 we conclude and give an outlook for future work. 1.1 Related Work Volterra filters are the canonical generalization of the linear convolution to a nonlinear mapping. They are widely used in the signal processing community and also find applications in image processing tasks [4,5]. The filter proposed in this work might be interpreted as a kind of ’joint’ Volterra filter for translation and rotation. Steerable filters, introduced in [6], are a common tool in early vision and image analysis. A generalization for non-group like deformations was proposed in [7] using an approximative scheme. The harmonic filter computes a certain subset of gaussian-windowed spherical moments in a first step which is actually a steerable filter. The generalized Hough transform (GHT) [8] is a major tool for the detection of arbitrary shapes. Many modern approaches [9,10] for object detection and recognition are
Harmonic Filters for Generic Feature Detection in 3D
133
based on the idea that local parts of the object cast votes for the putative center of the object. If the proposed algorithm is used in the context of object detection, it may be interpreted as some kind of voting procedure for the object center. This voting interpretation also relates our approach to the Tensor Voting [11] framework (TV). However, in TV the voting function does not depend on the local context. Contrarily the proposed filter is able to cast context dependend votes.
2 Spherical Tensor Analysis In the following we shortly repeat the basic notions in 3D harmonic analysis as they were introduced in [3]. For introductory reading we recommend literature [12] concerning the quantum theory of the angular momentum, while our representation tries to avoid terms from quantum theory to also give the non-physicists a chance to follow. See e.g. [13,14] for an introduction from an image processing/engineering viewpoint. 2.1 Preliminaries Let Djg be the unitary irreducible representation of a g ∈ SO(3) of order j with j ∈ N. They are also known as the Wigner D-matrices (see e.g. [12]). The representation Djg acts on a vector space Vj which is represented by C2j+1 . The standard basis of C2j+1 is written as ejm . We write the elements of Vj in bold face, e.g. u ∈ Vj and write the 2j+1 components in unbold face um ∈ C where m = −j, . . . j. For the transposition of a vector/matrix we write uT ; the joint complex conjugation and transposition is denoted by u = uT . Note, that we treat the space Vj as a real vector space of dimensions 2j + 1, although the components of u might be complex. This means that the space Vj is only closed under weighted superpositions with real numbers. As a consequence we observe that the components are interrelated by um = (−1)m u−m . From a computational point of view this is an important issue. Although the vectors are elements of C2j+1 we just have to store just 2j +1 real numbers. So, the standard coordinate vector r = (x, y, z)T ∈ R3 has a natural relation to elements u ∈ V1 in the form of ⎛ ⎞ ⎛ √1 ⎞ (x − iy) w 2 ⎠ = Sr ∈ V1 z u=⎝ z ⎠=⎝ −w − √12 (x + iy) Note, that S is an unitary coordinate transformation. Actually, the representation D1g is directly related to the real valued rotation matrix Ug ∈ R3×3 by D1g = SUg S . Definition 1. A function f : R3 → Vj is called a spherical tensor field of rank j if it transforms with respect to rotations as follows (gf )(r) := Djg f (UTg r) for all g ∈ SO(3). The space of all spherical tensor fields of rank j is denoted by Tj .
134
M. Reisert and H. Burkhardt
2.2 Spherical Tensor Coupling We define a family of symmetric bilinear forms that connect tensors of different ranks. Definition 2. For every j ≥ 0 we define a family of symmetric bilinear forms of type •j : Vj1 × Vj2 → Vj where j1 , j2 ∈ N has to be chosen according to the triangle inequality |j1 − j2 | ≤ j ≤ j1 + j2 and j1 + j2 + j has to be even. It is defined by (ejm ) (v •j w) :=
m=m1 +m2
jm | j1 m1 , j2 m2 vm1 wm2 j0 | j1 0, j2 0
where jm | j1 m1 , j2 m2 are the Clebsch Gordan coefficients (see e.g. [12]). Up to the factor j0 | j1 0, j2 0 this definition is just the usual spherical tensor coupling equation which is very well known in quantum mechanics of the angular momentum. The additional factor is for convenience. It normalizes the product such that it shows a more gentle behavior with respect to the spherical harmonics as we will see later. The characterizing property of these products is that they respect the rotations of the arguments, i.e. if v ∈ Vj1 and w ∈ Vj2 , then for any g ∈ SO(3) (Djg1 v) •j (Djg2 w) = Djg (v •j w) holds. For the special case j = 0 the arguments have to be of the same rank due to the triangle inequality. Actually, in this case the new product coincides with the standard inner product v •0 w = w v. Further note, that if one of the arguments of • is a scalar, then • reduces to the standard scalar multiplication, i.e. v •j w = vw, where v ∈ V0 and w ∈ Vj . Another remark is that • is not associative. The introduced product can also be used to combine tensor fields of different rank by point-wise multiplication as f (r) = v(r) •j w(r). If v ∈ Tj1 and w ∈ Tj2 and j is chosen such that |j1 − j2 | ≤ j ≤ j1 + j2 , then f is in Tj , i.e. a tensor field of rank j. 2.3 Spherical and Solid Harmonics We denote the well-known spherical harmonics by Yj : S 2 → Vj . We write Yj (r), where r may be an element of R3 , but Yj (r) is independent of the magnitude of r. We know that the Yj provide an orthogonal basis of scalar functions on the 2-sphere S 2 . Thus, any real scalar field f ∈ T0 can be expanded in terms of spherical harmonics in a unique manner. In the following, we use Racah’s normalization (also known as semi 1 Schmidt normalization), i.e. Ymj , Ymj S 2 = 2j+1 δjj δmm . One important and useful j j1 j2 property is that Y = Y •j Y . We can use this formula to iteratively compute higher order Yj from given lower order ones. Note that Y0 = 1 and Y1 = Sr, where r ∈ S 2 . The spherical harmonics have a variety of nice properties. One of the most important ones is that each Yj , interpreted as a tensor field of rank j is a fix-point with respect to rotations, i.e. (gYj )(r) = Yj (r) or in other words Yj (Ug r) = Djg Yj (r). The spherical harmonics naturally arise from the solutions from the Laplace equation as the so called solid harmonics Rj (r) := rj Yj (r).
Harmonic Filters for Generic Feature Detection in 3D
135
2.4 Spherical Derivatives This section proposes the basic tools for dealing with derivatives in the context of spherical tensor analysis. In [3] the spherical derivatives are introduced. They connect spherical tensor fields of different ranks by differentiation. Proposition 1 (Spherical Derivatives). Let f ∈ Tj be a tensor field. The spherical up-derivative ∇1 : Tj → Tj+1 and the down-derivative ∇1 : Tj → Tj−1 are defined as ∇1 f := ∇ •j+1 f ∇1 f := ∇ •j−1 f , where
(1) (2)
1 1 ∇ = ( √ (∂x − i∂y ), ∂z , − √ (∂x + i∂y )) 2 2
is the spherical gradient and ∂x , ∂y , ∂z the standard partial derivatives. Note, that for a scalar function the spherical up-derivative is just the spherical gradient, i.e. ∇f = ∇1 f . As a prerequisite to the Harmonic filter it is necessary to mention that the spherical derivative ∇j of a Gaussian is just a Gaussian-windowed solid harmonic: √ r2 ∇j e− 2σ2 = ( 2πσ)3 Gjσ (r) =
j r2 1 − 2 Rj (r) e− 2σ2 σ
(3)
An implication is that convolutions with the Gjσ are derivatives of Gaussian-smoothed functions, namely Gjσ ∗f = ∇j (Gσ ∗f ), where f ∈ T0 . Note that we use the convention G0σ = Gσ =
r2
√ 1 e− 2σ2 ( 2πσ)3
.
3 Harmonic Filters Our goal is to build non-linear image filters that are equivariant to Euclidean motion. An SE(3)-equivariant image filter is given by the following Definition 3 (SE(3)-Equivariant Image Filter). An scalar image filter F is a mapping from T0 onto T0 . We call such a mapping SE(3)-equivariant if F{gf } = gF{f } for all g ∈ SE(3) and f ∈ T0 . Our approach may be interpreted as a kind of context-dependend voting scheme. The intuitive idea is as follows: Compute for each position in the 3D space the projection onto the Gaussian windowed harmonic basis Gjσ for j = 0, . . . , n. You can do this by a simple convolution of the image f with the basis, i.e. pj := Gjσ ∗ f . Imagine this set of projections pj as some local descriptor images, where the set [p0 (r), p1 (r), . . . , pn (r)] of coefficients describe the harmonic part of the neighborhood of the voxel r. Then, for each voxel map these projections on some new harmonic descriptors Vj (r) = Vj [p0 (r), p1 (r), . . . , pn (r)] which can be interpreted as a local expansion of a kind
136
M. Reisert and H. Burkhardt
of voting function that contributes into the neighborhood of r. The contribution stemming from the voter at voxel r at position r is Vr (r) = Gη (r − r )
∞
Vj (r ) •0 Rj (r − r ),
(4)
j=0
i.e. the voting function is just a Gaussian-windowed harmonic function. The final step is to render the contribution from all pixels r in an additive way together by integration to arrive at n H{f }(r) := Vr (r)dr = Gjη
•0 Vj . R3
j=0
To ensure rotation-equivariance the Vj [·] has to obey the following equivariance constraint: Vj [D0g p0 , . . . , Dng pn ] = Djg Vj [p0 , . . . , pn ]. We will use the spherical product • as the basic building block for the equivariant nonlinearities Vj . There are many possibility to combine several spherical tensors by the products • in an equivariant way. Later we will discuss this in detail. 3.1 Differential Formulation A computational expensive part of the filter are the convolutions. On the one hand, the projection onto the harmonic basis of the input and, secondly, the rendering of the output, also by convolution. Equation (3) shows that there is another way to compute such projections: by the use of the spherical derivative. So, we can reformulate the filter as follows: H{f } := Gη ∗
n
∇j Vj [∇0 fs , . . . , ∇n fs ]
(5)
j=0
with fs = Gσ ∗ f . In Algorithm 1 we depict the computation of the filter. Note, that we just have to compute n spherical derivatives ∇1 if we implement them by repeated applications. And actually the same holds for the down-derivative ∇1 if we follow Algorithm 1. 3.2 The Voting Function The probably most simple nonlinear voting function Vj is a sum of second order products of the descriptor images pj , namely Vj [p0 , . . . , pn ] = αjj1 ,j2 pj1 •j pj2 (6) |j1 −j2 |≤j≤j1 +j2 j1 +j2 +j even j1 ,j2 ≤n
Harmonic Filters for Generic Feature Detection in 3D
137
Algorithm 1. Filter Algorithm y = H{f } Input: scalar volume image f Output: scalar volume imag y 1: Initialize yn := 0 ∈ Tn 2: Convolve p0 := Gσ ∗ f 3: for j = 1 : n do 4: pj = ∇ 1 pj−1 5: end for 6: for j = n : −1 : 1 do 7: yj−1 := ∇ 1 (yj + Vj [p0 , . . . , pn ]) 8: end for 9: Let y := y0 + V0 [p0 , . . . , pn ]) 10: Convolve y := Gη ∗ y
where αjj1 ,j2 ∈ R are expansion coefficients. We call the order of the products that are involved in Vj the order of the filter and denote it by N . Depending on the application they may or may not depend on the absolute intensity values of the input image. To become invariant against additive intensity changes one leaves out the zero order descriptor p0 . For robustness against illumination/contrast changes we introduce a soft normalization of the first order (’gradient’) descriptor p1 . This means, that in the forloop in Alg. 1 from line 3-5 we introduce a special case for j = 1, namely p1 (r) =
1 ∇1 f (r), γ + sdev (r)
where γ ∈ R is a fixed regularization parameter and sdev (r) denotes the standard deviation computed in a local window around r. The normalization makes the filter robust against multiplicative changes of the gray values and, secondly, emphasizes the ’structural’ and ’textural’ properties rather than the pure intensities. Besides γ, the filter has three other parameters: the expansion degree n, the width of the input Gaussian σ and the output Gaussian η. In the spirit of the GHT, the parameter σ determines the size of the local features that vote for the center of the object of interest. To assure that every pixel of the object can contribute, the extent of the voting function should be at least half the diameter of the object.
4 Pollen Porate Detection in Confocal Data Analysis techniques for data acquired by microscopy typically demand for a rotation and translation invariant treatment. In this experiment we use the harmonic filter for the analysis of pollen grains acquired with confocal laser scanning microscopy (see [15]). Palynology, the study and analysis of pollen, is an interesting topic with very diverse applications like in Paleoclimatology or Forensics. An important feature of certain types of pollen grain are the so called porates that are small pores on the surface of the grain. Their relative configuration is crucial for the determination of the species. We want to show that our filter is able to detect this structures in a reliable way. The dataset consists of 45 samples.
138
M. Reisert and H. Burkhardt
The images have varying sizes of about 803 voxels. We labeled the porates by hand. The experimental setup is quite simple. We apply on each pollen image the trained harmonic filter and then select local maxima up to a certain threshold as detection hypotheses. 4.1 Reference Approaches We use the ideas of Ballard et al [8], Lowe et al [9] and Leibe et al [10] and extended them to 3D. The approach is based on the generalized Hough transform (GHT). Based on a selection of interest points local features are extracted and assigned to a codebook entry. Each codebook entry is endowed with a set of votes for the center of object which are casted for each interest point. This approach resembles closely the idea of the implicit shape model by Leibe et al [10], where we used a 3D extension of Lowe’s SIFT features [9] as local features (for details see [16]). As a second approach we apply a simple classification scheme per voxel (VC). For each voxel we compute a set of expressive rotation invariant features and train a classifier to discriminate the objects of interest from the background. This idea was for example used by Staal et al [17] for blood vessel detection in retinal images in 2D or by Fehr et al [18] for cell detection in 3D. For details about the features and implementation see [16]. 4.2 Training For the training of the harmonic filter (and for both reference approaches) we selected one(!) good pollen example, i.e. three porate samples. To train the harmonic filter we built an indicator image with pixels set to 1 at the centers of the three porates. The indicator image is just the target image y which should satisfy H{f } = y. As mentioned before the linearity of the filter in its parameters makes it easy to adapt them. We use an unregularized least square approach. Due to the high dynamic differences between the filter responses corresponding to the individual parameters it is necessary to normalize the equation to avoid numerical problems. We used the standard deviation of the individual filter responses taken over all samples in the training image. The σ parameter determining the size of the local features was chosen to be 2.5 pixels. The output width η determining the range of voting function was chosen to be 8 pixels, this is about half the diameter of the porates. For the training of the reference approaches see again [16]. 4.3 Evaluation In Figure 1 we show two examples. The filter detects the porates but shows also some small responses within the pollen, however the results are still acceptable. For quantative results we computed Precision/Recall graphs. A detection was found to be successful if it is at least 8(4) pixels away from the true label. In Figure 2 on the left we show a PR-graph for a varying expansion degree n with a low detection precision of 8 pixels. As one expects the filter improves its performance with growing n. For n = 8 no performance gain is observed. The runtime of the filter heavily depends on the number of spherical products to be computed. For example for n = 6 we have to compute 46
Harmonic Filters for Generic Feature Detection in 3D
139
Fig. 1. Mugwort pollen (green) with overlayed filter response (red) for two examples. The filter detects the three porates, but there are also some spurious responses within the pollen, because the pollen has also strong inner structures. 1 0.9
0.6 0.5
1 0.9
0.8 n=3 n=4 n=5 n=6 n=7 n=8
0.7 0.6 0.5
0.8 Recall
0.7
Recall
Recall
0.8
1 0.9
GHT Harris GHT DOG GHT DHES VC KNN VC SVM Harmonic Filter
0.7 0.6 0.5
GHT Harris GHT DOG GHT DHES VC KNN VC SVM Harmonic Filter
0.4 0.4
0.5
0.6 0.7 Precision
0.8
0.9
1
0.3
0.4
0.5
0.6 0.7 Precision
0.8
0.9
1
0.3
0.4
0.5
0.6 0.7 Precision
0.8
0.9
1
Fig. 2. Precision/Recall graphs of the porate detection problem. Left: Comparison of the Harmonic filter for different expansion degrees (precision 8 pixels). Middle: Comparison with reference approaches (precision 8 pixels). Right: Comparison with reference approaches (4 pixels).
products. The computation of these products takes on a P4 2.8Ghz about 6 seconds. In Figure 2 in the middle we compare the result of the Harmonic filter with n = 7 with the reference approaches. The results of the GHT based on DOG interest points are comparable with the Harmonic filter. The voxel classification approach (VC) performs not so well. In particular, for the SVM based classification is performing quite poorly. Finally, we evaluated the PR-graph with a higher detection precision of 4 pixels. As already experienced in [1] the GHT based approach has problems in this case, which has probably to do with the inaccurate and unstable determination of the interest points. Now both VC approaches are outperforming the GHT approaches while the Harmonic Filter is definitely superior over all the others.
5 Conclusion In this paper we presented a general-purpose non-linear filter that is equivariant with respect to the 3D Euclidean motion. The filter may be seen as a joint Volterra filter for rotation and translation. The filter senses locally a harmonic projection of the image function and maps this projection onto a kind of voting function which is also harmonic. The mapping is modelled by rotation equivariant polynomials in the describing coefficients. The harmonic projections are computed in an efficient manner by the use of spherical derivatives of Gaussian-smoothed images. We applied the filter on a 3D detection problem. For low detection precision the performance is comparable to state of the art approaches, while for high detection precision the approach is definitely outperforming existing approaches.
140
M. Reisert and H. Burkhardt
Acknowledgements This study was supported by the Excellence Initiative of the German Federal and State Governments (EXC 294).
References 1. Reisert, M., Burkhardt, H.: Equivariant holomorphic filters for contour denoising and rapid object detection. IEEE Trans. on Image Processing 17(2) (2008) 2. Reisert, M., Burkhardt, H.: Complex derivative filters. IEEE Trans. Image Processing 17(12), 2265–2274 (2008) 3. Reisert, M., Burkhardt, H.: Spherical tensor calculus for local adaptive filtering. In: Tensors in Image Processing and Computer Vision (2009) 4. Thurnhofer, S., Mitra, S.: A general framework for quadratic volterra filters for edge enhancment. IEEE Trans. Image Processing, 950–963 (1996) 5. Mathews, V.J., Sicuranza, G.: Polynomial Signal Processing. J.Wiley, New York (2000) 6. Freeman, W.T., Adelson, E.H.: The design and use of steerable filters. IEEE Trans. Pattern Anal. Machine Intell. 13(9), 891–906 (1991) 7. Perona, P.: Deformable kernels for early vision. IEEE Trans. Pattern Anal. Machine Intell. 17(5), 488–499 (1995) 8. Ballard, D.: Generalizing the hough transform to detect arbitrary shapes. Pattern Recognition 13(2), 111–122 (1981) 9. Lowe, D.: Distinct image features from scale-invariant keypoints. International Journal of Computer Vision 60, 91–110 (2004) 10. Leibe, B., Leonardis, A., Schiele, B.: Combined object categorization and segmentation with an implicit shape model. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS. Springer, Heidelberg (2004) 11. Mordohai, P.: Tensor Voting: A Perceptual Organization Approach to Computer Vision and Machine Learning. Morgan and Claypool, San Francisco (2006) 12. Rose, M.: Elementary Theory of Angular Momentum. Dover Publications (1995) 13. Miller, W., Blahut, R., Wilcox, C.: Topics in harmonic analysis with applications to radar and sonar. In: IMA Volumes in Mathematics and its Applications. Springer, New York (1991) 14. Lenz, R.: Group theoretical methods in Image Processing. Lecture Notes. Springer, Heidelberg (1990) 15. Ronneberger, O., Burkhardt, H., Schultz, E.: General-purpose Object Recognition in 3D Volume Data Sets using Gray-Scale Invariants. In: Proceedings of the International Conference on Pattern Recognition, Quebec, Canada. IEEE Computer Society Press, Los Alamitos (2002) 16. Reisert, M.: Harmonic filters in 3d - theory and applications. Technical Report 1/09, IIFLMB, Computer Science Department, University of Freiburg (2009) 17. Staal, J., Ginneken, B., Niemeijer, M., Viegever, A., Abramoff, M.: Ridge based vessel segmentation in color images of the retina. IEEE Trans. Med. Imaging 23(4), 501–509 (2004) 18. Fehr, J., Ronneberger, O., Kurz, H., Burkhardt, H.: Self-learning segmentation and classification of cell-nuclei in 3D volumetric data using voxel-wise gray scale invariants. In: Kropatsch, W.G., Sablatnig, R., Hanbury, A. (eds.) DAGM 2005. LNCS, vol. 3663, pp. 377– 384. Springer, Heidelberg (2005)
Increasing the Dimension of Creativity in Rotation Invariant Feature Design Using 3D Tensorial Harmonics Henrik Skibbe1,3 , Marco Reisert2 , Olaf Ronneberger1,3 , and Hans Burkhardt1,3 1
Department of Computer Science, Albert-Ludwigs-Universit¨ at Freiburg, Germany 2 Dept. of Diagnostic Radiology, Medical Physics, University Medical Center, Freiburg 3 Center for Biological Signalling Studies (bioss), Albert-Ludwigs-Universit¨ at Freiburg {skibbe,ronneber,Hans.Burkhardt}@informatik.uni-freiburg.de,
[email protected] Abstract. Spherical harmonics are widely used in 3D image processing due to their compactness and rotation properties. For example, it is quite easy to obtain rotation invariance by taking the magnitudes of the representation, similar to the power spectrum known from Fourier analysis. We propose a novel approach extending the spherical harmonic representation to tensors of higher order in a very efficient manner. Our approach utilises the so called tensorial harmonics [1] to overcome the restrictions to scalar fields. In this way it is possible to represent vector and tensor fields with all the gentle properties known from spherical harmonic theory. In our experiments we have tested our system by using the most commonly used tensors in three dimensional image analysis, namely the gradient vector, the Hessian matrix and finally the structure tensor. For comparable results we have used the Princeton Shape Benchmark [2] and a database of airborne pollen, leading to very promising results.
1
Introduction
In modern image processing and classification tasks we are facing an increasing number of three dimensional data. Since objects in different orientations are usually considered to be the same, descriptors that are rotational invariant are needed. One possible solution are features which rely on the idea of group integration, where certain features are averaged over the whole group to become invariant [3]. Here we face the problem to derive features in an efficient manner. In the case of 3D rotations one of the most efficient and effective approaches utilises the theory of spherical harmonics [4]. This representation allows to accomplish the group integration analytically. In implementation practice the magnitudes of certain subbands of the spherical harmonic representation have to be taken to become invariant. But, there is one bottleneck that limits the creativity of designing features based on spherical harmonics: they represent scalar functions. This means that, J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 141–150, 2009. c Springer-Verlag Berlin Heidelberg 2009
142
H. Skibbe et al.
for example, vector valued functions, like the gradient field, cannot be put into the spherical harmonics framework without loosing the nice rotation properties (which are of particular importance for the design of invariant features). We are restricted to features with scalar components that are not interrelated by a global rotation. Only then, a component-wise spherical harmonic transformation leads to rotation invariant features. Here our new approach jumps in. Imagine that all the fantastic features which have already been proposed on the basis of the spherical harmonic approach could be generalised to vector valued or even tensor valued fields. What we propose is exactly this: the natural extension of the spherical harmonic framework to arbitrary ranked tensor fields, in particular including vector fields (e.g. gradient fields or gradient vector flow) and rank 2 tensor fields (e.g. the Hessian or the structure tensor). This is achieved by utilising the theory of spherical tensor analysis [1]. Doing so gives us the possibility to transform tensor fields of any rank into representations that share all the same nice properties as ordinary spherical harmonic transformations. Additionally, we show how to compute these tensor field transformations efficiently by using existing tools for fast computations of spherical harmonic representations [5,6]. This paper is divided into six sections. In section 2 we introduce the fundamental mathematical definitions needed in the later sections. Sections 3 introduces the tensorial harmonic expansion as a natural extension of the spherical harmonic expansion. We further show how rotation invariant features can be obtained in a manner similar to [4]. Section 4 addresses the problem of efficient tensor expansion and offers a solution by utilising spherical harmonics. In section 5 we put all the details necessary to transform commonly used real cartesian tensors up to rank 2 in our framework. And finally we present our experiments in section 6. We successfully applied our approach to commonly used tensors, namely vectors and matrices. The promising results of the examples aim to encourage the reader to consider the use of the approach proposed here. The conclusion points out some ideas that were not investigated here and might be considered in future research.
2
Preliminaries
We assume that the reader has basic knowledge in cartesian tensor calculus. We further assume that the reader is familiar with the basic theory and notations of the harmonic analysis of SO(3), meaning he should have knowledge both in spherical harmonics and in Wigner D-Matrices and their natural relation to Clebsch-Gordan coefficients. He also should know how and why we can obtain rotation invariant features from spherical harmonic coefficients [4], because we will adapt this approach directly to tensorial harmonics. A good start for readers who are completely unfamiliar with the theory of the harmonic analysis of SO(3) might be [7] where a basic understanding of spherical harmonics is given, focused on a practical point of view. The design of rotation invariant spherical harmonic features was first addressed in [4]. Deeper views into the theory are given in [8,1,9]. However, we first want to recapitulate the mathematical constructs and definitions which we will use in the following sections.
Increasing the Dimension of Creativity
143
We denote by {ejm }m=−j...j the standard basis of C2j+1 . The standard coordinate vector r = (x, y, z)T ∈ R3 has a natural relation to an element in u ∈ C3 by the unitary coordinate transformation S: ⎛ ⎞ −1 −i √0 1 ⎝ S= √ (1) 0 0 2⎠ 2 1 −i 0 with u = Sr. Let Djg be the unitary irreducible representation of a g ∈ SO(3) of order j ∈ N0 , acting on the vector space C2j+1 . They are widely known as Wigner-D Matrices [8]. The representation of D1g is directly related by S to the real valued rotation matrix Ug ∈ R3×3 , namely, D1g = SUg S∗ , where S∗ is the adjugate of S. Depending on the context we will also express the coordinate vector r ∈ R3 in spherical coordinates (r, θ, φ), which is closer to the commonly used notation of spherical harmonics, where: z 2 2 2 r = x + y + z , θ = arccos , φ = atan2(y, x) (2) x2 + y 2 + z 2 e.g. we sometimes write f (r, θ, φ) instead of f (r). Definition 1. A function f : R3 → C2j+1 is called a spherical tensor field of rank j if it transforms with respect to rotation: ∀g ∈ SO(3) :
(gf )(r) := Djg f (UTg r)
(3)
The space of all spherical tensor fields of rank j is denoted by Tj . We further need to define the family of bilinear forms which we use to couple spherical tensors of different ranks. Definition 2. For every j ≥ 0 we define the family of bilinear forms ◦j : C2j1 +1 × C2j2 +1 → C2j+1 that only exists for those triple of j1 , j2 , j ∈ N0 that fulfil the triangle inequality |j1 − j2 | ≤ j ≤ j1 + j2 . T
(ejm ) (v ◦j w) :=
m1=j 1
m2=j 2
j1 m1 , j2 m2 | jm vm1 wm2
m1=−j1 m2=−j2
=
j1 m1 , j2 m2 | jm vm1 wm2
(4)
m=m1 +m2
where j1 m1 , j2 m2 | jm are the Clebsch-Gordan coefficients. (The Clebsch-Gordan coefficients are zero if m1 + m2 = m) One of the orthogonality properties of the Clebsch-Gordan coefficients that will be used later is given by: 2j + 1 j1 m1 , j2 m2 | jm j1 m1 , j2 m2 | jm = δj2 ,j2 δm2 ,m2 (5) 2j + 1 2 m ,m 1
where δ is the Kronecker symbol.
144
3
H. Skibbe et al.
Rotation Invariant Features from Tensorial Harmonics
Combining all the previously defined pieces we can now formalise an expansion of a spherical tensor field f ∈ T using the notation proposed in [1]: f (r, θ, φ) =
∞ k=
ajk (r) ◦ Yj (θ, φ)
(6)
j=0 k=−
with expansion coefficients ajk (r) ∈ C2(j+k)+1 , and the well known spherical harmonics Yj ∈ C2j+1 . Note, that we always use the semi-Schmidt normalised spherical harmonics. In the special case where = 0 the expansion coincides with the ordinary scalar spherical harmonic expansion. The important property of the tensorial harmonic expansion is given by (gf )(r) = Dg f (Ug T r) =
∞ k=
Dj+k ajk (r) ◦ Yj (θ, φ) g
(7)
j=0 k=−
This means, that a rotation of the tensor field by Dg affects the expansion coefficients ajk (r) to be transformed by Dj+k . This is an important fact which we g will use when we aim to get rotation invariant features from tensorial harmonic coefficients. 3.1
Designing Features
Facing the problem of designing features describing three dimensional image data, the spherical harmonic based method proposed in [4] is widely known and used to transform non-rotation invariant features into rotation invariant representations, as seen e.g. in [10,11]. Considering eq. (7) it easily can be seen that for each coefficient ajk (r) a feature cjk (r) ∈ R can be computed that is invariant to arbitrary rotations Dg acting on a tensor field f ∈ T : cjk (r)
j = Dj+k a (r) = Dj+k ajk (r), Dj+k ajk (r)
g g g k ∗ j+k j = Dj+k Dg ak (r), ajk (r) = ajk (r), ajk (r) = ajk (r) g
(8)
By now the generation of features is just the natural extension of the features proposed in [4], adapted to tensor fields of arbitrary order. In addition to that we can also consider the interrelation of different coefficients with equal rank. For a tensor field of order we can combine 2 + 1 coefficients. For two different coefficients ajk (r) and ajk (r) with j + k = j + k we can easily extend the feature defined above such that the following feature is also unaffected by arbitrary rotations: j+k j j +k j cjj (r) = | D a (r), D a (r) | = | ajk (r), ajk (r) | (9) g g kk k k
Increasing the Dimension of Creativity
4
145
Fast Computation of Tensorial Harmonic Coefficients
In the current section we want to derive a computation rule for the tensorial harmonic coefficients based on the ordinary spherical harmonic expansion. This is very important, since spherical harmonic expansions can be realized in a very efficient manner [6]. T It is obvious that each of the M components (eM ) f (r) of a spherical tensor field f ∈ T can be separately expanded by an ordinary spherical harmonic expansion: T
(eM ) f (r, θ, φ) =
∞
T
bjM (r) Yj (θ, φ)
(10)
j=0
where the bjM (r) ∈ Tj are the spherical harmonic coefficients. Combining eq. (10) and eq. (6) we obtain a system of equations which allow us to determine the relation between the tensorial harmonic coefficients ajk (r) and the spherical harmonic coefficients bjM (r): T
(eM ) f (r, θ, φ)
=
∞ k=
ajk (r) ◦ Yj (θ, φ)
j=0 k=−
=
∞ k=
ajkm (r) (j + k)m, jn | M Ynj (θ, φ)
j=0 k=− M=m+n
=
m=(j+k)
∞ k=
n=j
ajkm (r) (j + k)m, jn | M Ynj (θ, φ)
j=0 k=− m=−(j+k) n=−j
=
∞ n=j
Ynj (θ, φ)
j=0 n=−j
=
∞ n=j
k=
m=(j+k)
k=− m=−(j+k)
=bjM,n (r)
bjM,n (r)Ynj (θ, φ)
j=0 n=−j
ajkm (r) (j + k)m, jn | M
=
∞
T
bjM (r) Yj (θ, φ)
(11)
j=0
With use of eq. (11) we can directly observe that bjM,n (r) =
k=
m=(j+k)
ajkm (r) (j + k)m, jn | M
(12)
k=− m=−(j+k)
Multiplying both sides with (j + k )m , jn | M results in bjM,n (r) (j + k )m , jn | M
=
k=
m=(j+k)
k=− m=−(j+k)
ajkm (r) (j + k)m, jn | M (j + k )m , jn | M
(13)
146
H. Skibbe et al.
Summarising over all n and M leads to j bM,n (r) (j + k )m , jn | M
M,n
=
k=
m=(j+k)
ajkm (r) (j + k)m, jn | M (j + k )m , jn | M
M,n k=− m=−(j+k)
=
k=
m=(j+k)
ajkm (r)
M,n
k=− m=−(j+k)
(j + k)m, jn | M (j + k )m , jn | M
δk,k δm,m
2+1 2(j+k )+1
(14)
Due to the orthogonality of the Clebsch-Gordon coefficients (5) all addends with m = m or k = k vanish: j 2 + 1 bM,n (r) (j + k )m , jn | M = aj (15) 2(j + k ) + 1 k m M,n
Finally, we obtain our computation rule which allows us to easily and efficiently compute the tensorial harmonic coefficients ajk ∈ Tj+k based on the spherical harmonic expansion of the individual components of a given tensor field f : ajk m =
M= n=j 2(j + k ) + 1 j bM,n (r) (j + k )m , jn | M
2 + 1 n=−j
(16)
M=−
5
Transforming Cartesian Tensors into Spherical Tensors
The question that has not been answered yet is how these spherical tensor fields are related to cartesian tensor fields like scalars, vectors and matrices. In the following we show how cartesian tensors up to rank two can easily be transformed into a spherical tensor representation which then can be used to obtain rotation invariant features. For scalars the answer is trivial. For rank 1 it is the unitary transformation S that directly maps the real-valued cartesian vector r ∈ R3 to its spherical counterpart. More complicated is the case of real valued tensors T3×3 of rank 2. Nevertheless, we will see that the vector space of real cartesian tensors of rank 2 covers tensors of rank 1 and 0, too. Due to this fact we can build up our system covering all three cases by just considering the current case. There exists a unique cartesian tensor decomposition for tensors T ∈ R3×3 : T = αI3×3 + Tanti + Tsym
(17)
where Tanti is an antisymmetric matrix, Tsym a traceless symmetric matrix and α ∈ R. The corresponding spherical decomposition is then given by: j vm = (−1)m1 1m1 , 1m2 | jm Ts1−m1 ,1+m2 (18) m=m1 +m2
Increasing the Dimension of Creativity
147
where Ts = STS∗ and vj ∈ C2j+1 , j = 0, 1, 2. Note that the spherical tensor v0 corresponds to α, namely a scalar. The real valued cartesian representation of v1 is the antisymmetric matrix Tanti or equivalently a vector in R3 , and v2 has its cartesian representation in R3×3 by a traceless symmetric matrix Tsym . Proposition 1. The spherical tensors v0 , v1 , v2 are the ⎛ results ⎞ of the spherical t00 t01 t02
decomposition of the real valued cartesian tensor T = ⎝t10 t11 t12 ⎠ of rank 2, with: t20 t21 t22
v0 =
− (t00 + t11 + t22 ) √ , 3 ⎛1
v =⎝ 1
⎞
(t02 − t20 + i(t21 − t12 )) √i (t01 − t10 ) ⎠, 2 1 2 (t02 − t20 − i(t21 − t12 )) 2
⎛
⎞ (t00 − t11 − i(t01 + t10 )) (−(t02 + t20 ) + i(t12 + t21 ))⎟ ⎟ −1 2 ⎜ ⎟ √ (t + t11 − 2t22 ) v =⎜ 6 00 ⎟ ⎝ 1 ((t02 + t20 ) + i(t12 + t21 )) ⎠ 2 1 2 (t00 − t11 + i(t01 + t10 )) ⎜ 12 ⎜
1 2
where v0 ∈ C1 , v1 ∈ C2 and v2 ∈ C3 .
6
Experiments
We perform experiments comparing tensorial harmonic descriptors derived from different tensors. For testing we use the Princeton Shape Benchmark (PSB) [2] based on 1814 triangulated objects divided into 161 classes. We present the models in a 1503 voxel grid. The objects are translational normalised with respect to their centre of gravity. We further perform experiments based on an airborne pollen database containing 389 files equally divided into 26 classes [12,11]. All pollen are normalised to a spherical representation with a radius of 85 voxel (figure 1). In both experiments we compute the first and second order derivatives for each object and do a discrete coordinate transform according to eq. (2) for the intensity values and the derivatives. For each radius in voxel step size the longitude θ and the colatitude φ are sampled in 64 steps for models of the PSB. In case of the pollen database we use a spherical resolution of 128 steps for the longitude θ and 128 steps for the colatitude φ. In addtition to the ordinary spherical harmonic expansion (denoted as SH) of the scalar valued intensity fields we do the tensorial harmonic expansion of the following cartesian tensor fields according to proposition 1 and eq. (16):
Fig. 1. The 26 classes of the spherically normalised airborne pollen dataset
148
H. Skibbe et al.
Fig. 2. PSB containing 1814 models divided into 161 classes
Vectorial Harmonic Expansion (VH). Similar to spherical harmonics the vectorial harmonics have been used first in a physical context [13]. For convenience we prefer the representation of 2nd order tensors using the axiator, despite the fact that gradient vectors only have rank 1 (eq. (18)). Using proposition 1 we transform the cartesian gradient vector field into its spherical counterpart and do the tensorial harmonic expansion. ⎛ ⎞ 0 −Iz Iy
∇I× = ⎝ Iz 0 −Ix ⎠ (19) −Iy Ix 0 where ∇ is the nabla operator, × denotes the axiator and using the notation ∂I Ix := ∂x . Hessian Harmonic Expansion (HH). The Hessian tensor field can be transformed in a manner similar to vectorial harmonics. But in contrast we obtain two harmonic expansions according to proposition 1. Structural Harmonic Expansion (StrH). The structure tensor is widely used in the 2D and 3D image analysis. It is derived by an outer product of a gradient vector, followed by a componentwise convolution with an isotropic gaussian kernel gσ . ⎛ 2 ⎞ Ix Ix Iy Ix Iz gσ ∗ ⎝Ix Iy Iy2 Iy Iz ⎠ (20) Ix Iz Iy Iz Iz2 In our experiments we use a standard deviation σ of 3.5 (in voxel diameter). In the experiments related to the PSB we found best to cut off the expansions by band width 25. We compute rotation invariant features according to section 3.1. All features are normalised with respect to the L1 norm. In case of the HH and the StrH expansion we obtain two separate features for each expansion which we concatenate. In order to keep the results comparable to those given in [2], we perform our experiments on the test and training set of the PSB at the finest granularity. For a description of the used performance measures NearestNeighbour/1st-Tier/2nd-Tier/E-Measure/Discounted-Cumulative-Gain see [2]. Table 1 depicts our results. Results based on features considering the interrelation of different coefficients (eq. (9)) are marked with a subscripted 2, e.g. VH2 . The results of further experiments conducting a LOOCV1 considering all 1814 objects are depicted in the left hand graph of figure 3. 1
Leave-one-out cross-validation.
Increasing the Dimension of Creativity
149
Table 1. PSB: Results of the test-set (left) and training set (right). The subscribed number 2 means features based on eq. (9), other wise based on eq. (8). To show the superiority of tensorial harmonics over the spherical harmonics we also give the results for the best corresponding SH-feature (SH∗ ) from [2]. Method StrH2 StrH HH2 VH2 VH HH SH SH∗
NN 61.6% 61.0% 58.5% 58.0% 57.7% 56.9% 52.5% 55.6%
1stT 34.3% 33.5% 31.5% 31.6% 30.8% 30.5% 27.2% 30.9%
2ndT 44.2% 43.6% 40.5% 40.7% 39.9% 39.7% 36.2% 41.1%
EM 26.1% 25.4% 24.5% 24.5% 23.7% 23.8% 21.6% 24.1%
DCG 60.9% 60.2% 58.5% 58.5% 57.6% 57.5% 54.5% 58.4%
Method StrH2 StrH HH2 VH2 VH HH SH
60
2ndT 44.5% 43.5% 42.2% 42.0% 40.0% 40.3% 36.2%
EM 25.1% 24.4% 23.7% 23.6% 22.5% 22.6% 20.2%
DCG 61.9% 61.3% 60.2% 59.7% 58.4% 58.9% 55.9%
90
correctly classified in %
correctly classified in %
1stT 34.6% 33.8% 31.8% 31.6% 30.4% 30.7% 26.8%
100
50
40
30
20
10
0
NN 61.7% 61.4% 59.3% 58.9% 56.6% 57.6% 55.8%
80 70 60 50 40
1 NN 2 NN 3 NN 4 NN minimum number of correct nearest neighbours
30
1
2
3 4 5 6 7 8 minimum number of correct nearest neighbours
9
10
Fig. 3. (left): LOOCV of the whole PSB dataset, demanding 1, 2, 3 and 4 correct NN. (right): LOOCV results of the pollen dataset, showing the performance when demanding up to 10 correct nearest neighbours.
We secondly perform experiments on the airborne pollen database. The expansions are done up to the 40th band. We compute features based on eq. (8) in the same manner as for the PSB experiment. The results of a LOOCV showing the performance of the features are depicted in the right graph of figure 3.
7
Conclusion
We presented a new method with which tensor fields of higher order can be described in a rotation invariant manner. We further have shown how to compute tensor field transformations efficiently using a componentwise spherical harmonics transformation. The conducted experiments concerning higher order tensors led to the highest results and have prooven our assumption that the consideration of higher order tensors for feature design is very promising. Taking advantage of the presence of different expansion coefficient with equal rank of higher order tensors additionally improved our results. But we also observed that we can’t give a fixed ranking of the performance of the investigated tensors. Considering
150
H. Skibbe et al.
the results of the PSB the structural harmonic features performed best. In contrast they have shown the worst performance in the pollen classification task. For future work we want to apply our method to tensors based on biological multi channel data. We further aim to examine features based on the gradient vector flow. Acknowledgement. This study was supported by the Excellence Initiative of the German Federal and State Governments (EXC 294).
References 1. Reisert, M., Burkhardt, H.: Efficient tensor voting with 3d tensorial harmonics. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2008. CVPRW 2008, pp. 1–7 (2008) 2. Shilane, P., Min, P., Kazhdan, M., Funkhouser, T.: The princeton shape benchmark. In: Shape Modeling and Applications, pp. 167–178 (2004) 3. Reisert, M.: Group Integration Techniques in Pattern Analysis - A Kernel View. PhD thesis, Albert-Ludwigs-Universit¨ at Freiburg (2008) 4. Kazhdan, M., Funkhouser, T., Rusinkiewicz, S.: Rotation invariant spherical harmonic representation of 3D shape descriptors. In: Symposium on Geometry Processing (June 2003) 5. Kostelec, P.J., Rockmore, D.N.: S2kit: A lite version of spharmonickit. Department of Mathematics. Dartmouth College (2004) 6. Healy, D.M., Rockmore, D.N., Moore, S.S.B.: Ffts for the 2-sphere-improvements and variations. Technical report, Hanover, NH, USA (1996) 7. Green, R.: Spherical harmonic lighting: The gritty details. In: Archives of the Game Developers Conference (March 2003) 8. Rose, M.: Elementary Theory of Angular Momentum. Dover Publications (1995) 9. Brink, D.M., Satchler, G.R.: Angular Momentum. Oxford Science Publications (1993) 10. Reisert, M., Burkhardt, H.: Second order 3d shape features: An exhaustive study. C&G, Special Issue on Shape Reasoning and Understanding 30(2) (2006) 11. Ronneberger, O., Wang, Q., Burkhardt, H.: 3D invariants with high robustness to local deformations for automated pollen recognition. In: Hamprecht, F.A., Schn¨ orr, C., J¨ ahne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 425–435. Springer, Heidelberg (2007) 12. Ronneberger, O., Burkhardt, H., Schultz, E.: General-purpose Object Recognition in 3D Volume Data Sets using Gray-Scale Invariants – Classification of Airborne Pollen-Grains Recorded with a Confocal Laser Scanning Microscope. In: Proceedings of the International Conference on Pattern Recognition, Quebec, Canada (2002) 13. Morse, P.M., Feshbach, H.: Methods of Theoretical Physics, Part II. McGraw-Hill, New York (1953)
Training for Task Specific Keypoint Detection Christoph Strecha, Albrecht Lindner, Karim Ali, and Pascal Fua CVLab EPFL Lausanne Switzerland
Abstract. In this paper, we show that a better performance can be achieved by training a keypoint detector to only find those points that are suitable to the needs of the given task. We demonstrate our approach in an urban environment, where the keypoint detector should focus on stable man-made structures and ignore objects that undergo natural changes such as vegetation and clouds. We use WaldBoost learning with task specific training samples in order to train a keypoint detector with this capability. We show that our aproach generalizes to a broad class of problems where the task is known beforehand.
1 Introduction State of the art keypoint descriptors such as SIFT [1] or SURF [2] are designed to be insensitive to both perspective distortion and illumination changes, which allows for images obtained from different viewpoints and under different lighting conditions to be successfully matched. This capability is hindered by the fact that general-purpose keypoint detectors exhibit a performance which deteriorates with seasonal changes and variations in lighting. A standard approach to coping with this difficulty is to set the parameters of the detectors so that a far greater number of keypoints than necessary are identified, in the hope that enough will be found consistently across multiple images. This method, however, entails performing unnecessary computations and increases the chances of mismatches. In this paper, we show that when training data is available for a specific task , we can do better by training a keypoint detector to only identify those points that are relevant to the needs of the given task. We demonstrate our approach in an urban environment where the detector should focus on stable man-made structures and ignore the surrounding vegetation, the sky and the various shadows, all of which display features that do not persist with seasonal and lighting changes. We rely on WaldBoost learning [3], similar in essence to the recent work [4] by the same authors, to learn a classifier that responds more frequently on stable structures. Task-specific keypoint detection is known to play an important role in human perception. Among the early seminal studies is that of Yarbus [5] where it was demonstrated that a subject’s gaze is drawn to relevant aspects of a scene and that eye movements are highly influenced by the assigned task, for instance memorization. To the best of our knowledge, these ideas have not yet made their mark for image-matching purposes. Our main contribution is to show that image matching algorithms benefit from incorporating task-specific keypoint detection. J. Denzler, G. Notni, and H. S¨uße (Eds.): DAGM 2009, LNCS 5748, pp. 151–160, 2009. c Springer-Verlag Berlin Heidelberg 2009
152
C. Strecha et al.
We begin this paper with a brief review of related approaches. Next, we discuss in more detail what constitutes a stable keypoint that an optimized detector should identify and introduce our approach to training such a detector. Experimental results are then presented for the structure and motion problem, where our goal is to build a keypoint detector - called TaSK (Task Specific Keypoint) that focuses on stable man-made structure. We also show a result of a keypoint detector, which was learned to focus on face features. Finally, we conclude with a discussion.
2 Related Work State of the art keypoint detectors fall into two broad categories: those that are designed to detect corners on one hand, and those that detect blob-like image structures on the other. An extensive overview can be found in Tuytelaars et al. [6]. Corner like detectors such as Harris, FAST [7], F¨orstner [8] [9,10] are often used for the pose and image localization problems. These detectors have a high spatial precision in the image plane but are not scale invariant and are therefore used for small baseline matching or tracking. The other category of keypoint detectors aim at detecting blob structures (SIFT [1], MSER [11] or SURF [2]). They provide a scale estimate, which renders them suited for wide-baseline matching [12,13] or for the purpose of object detection and categorization. Both detector types can be seen as general-purpose hand crafted detectors, which run for many application at a very high false positive rate to prevent failures from missed keypoints. ˇ Our approach is most related to the work of Sochman and Matas [4]. These authors, emulate the behavior of a keypoint detector using the boosting learning method. They show that the emulated detector achieves equivalent performance with a substantial speed improvement. Rosten and Drummond [7,14] applied a similar idea to make fast decisions about the presence of a keypoints in a image patch. There, learning techniques are used to enhance the detection speed for general-purpose keypoint detection. Note, that their work does not focus on task specific keypoint detection, which is the aim of this paper. Similar in spirit is also the work of Kienzle et.al. [15] in which human eye movement data is used to to train a saliency detector.
3 Task Specific Keypoints Training data can be used in various ways to improve the keypoint detection. We will describe two approaches in the following sections. 3.1 Detector Verification Suppose we are given a keypoint detector K and a specific task for which training data is available. The most natural way to enhance keypoint detection is based on a post-filtering process: among all detections which are output by the detector K, we are interested only in the keypoints that are relevant given the training data. Our enhanced keypoint detector would then output all low-level keypoints and an additional classification stage is added which rejects unreliable keypoints based on the learned appearance.
Training for Task Specific Keypoint Detection
153
Fig. 1. Keypoint detections by DoG (top) and our proposed detector TaSK (bottom). Note that TaSK is specialized to focus more on stable man-made structures and ignores vegetation and sky features.
3.2 Detector Learning In order to learn the appearance of good keypoints we need to specify how they are characterized. In particular we need to specify the conditions under which a pixel can be regarded as a good keypoint. We will use the following two criteria:
154
C. Strecha et al.
1. A good keypoint can be reliably matched over many images. 2. A good keypoint is well localized, meaning its descriptor is sufficiently different from the descriptors of its neighboring pixels. All pixels that obey these criteria will constitute the positive input class to our learning while the negative training examples are random samples of the training images. Our method is based on WaldBoost learning [3] similar in spirit to the work of ˇ Sochman and Matas [4]. Using our aforementioned training examples, we learn a classifier that responds more frequently on stable structures such as buildings and ignores unstable one such as vegetation or shadows. Our eventual goal is to only detect keypoints that can be reliably matched. The advantage is not only a better registration, but also a speed up in the calibration. For the WaldBoost training we used images taken by a panorama camera. These images are taken from the same view point every 10 minutes for the past four years. This massive training set captures light and seasonal changes but does not cover appearance variations which are due to changes in view point. 3.3 Training Samples The generation of the training samples is an important preliminary step for the detector learning since the boosting algorithm optimizes for the provided training samples. In [3], the set of training samples fed into the boosting algorithm is the set of all keypoints identified by a specific detector. In so doing, the learned detector is naturally no more than an emulation of the detector for the training samples. Our research aims at generating a more narrow set of training samples, which obey the criteria proposed in section 3.2. In a first step, we used the F¨orstner [8] operator to find keypoint candidates which are well localized in the images. In a second step, keypoints which are estimated to have poor reliability for reconstruction purposes are pruned. The automated selection of keypoints is based on two features: the number of occurrences of a keypoint and the stability of a descriptor at a specific position over several images of the sequences. The number of occurrences is simply a count of how many times a fixed pixel position has been detected as a keypoint in several images of the same scene. To illustrate our measure of stability, let pji denote the position of the i-th keypoint in the j j-th image i = 1 . . . Nj , j = 1 . . . Nimages . The union P = pi contains all the positions which have been detected in at least one image. In all the images a SIFT descriptor sjk is calculated for every single position pk ∈ P. For the stability of the descriptor Euclidean distances djk1 ,j2 = dist(sjk1 , sjk2 ) are calculated and their median dk = median(djk1 ,j2 ), j1 = j2 is determined. The more stable a keypoint is in time, the smaller its median will be. A pixel position is then classified as a good keypoint if its occurrence count is high and its descriptor median is low: two thresholds were thus set so that a reasonable number of keypoints is obtained for our training set(couple of thousands per image). These keypoints form the positive training set. The negative training examples are randomly sampled from the same images such that they are no closer than 5 pixels to any positive keypoint. Given these training examples we apply WaldBoost learning, as described in the next section.
Training for Task Specific Keypoint Detection
155
4 Keypoint Boosting Boosting works by sequentially applying a, usually, weak classification algorithm to a re-weighted set of training examples [16,17]. Given N training examples x1 . . . xN together with their corresponding labels y1 . . . yN , it is a greedy algorithm which leads to a classifier H(x) of the form: H(x) =
T
ht (x) ,
(1)
t=1
where ht (x) ∈ H is a weak classifier from a pool H chosen to be simple and efficient to compute. H(x) is obtained sequentially by finding at each iteration t the weak classifier which minimizes the weighted Dt (xi ) training error: Zt =
N
Dt (xi ) exp(−yi ht (xi )) .
(2)
xi =1
The weights of each training sample, Dt (xi ), are initialized uniformly and updated according to the classification performance. One possibility to minimize eq. 2 uses domain partitioning [17] as next explained. 4.1 Fuzzy Weak Learning by Domain-Partitioning The minimization of eq. 2 includes the optimization over possible features with response function r(x) and over the partitioning of the feature response into k = 1 . . .K, non-uniformly distributed bins. If a sample point x falls into the k th bin, its corresponding weak classification result is approximated by ck . This corresponds to the real version of AdaBoost.1 By this partitioning model, eq. 2 can be written as (for the current state of training t): Z=
K
D(xi ) exp(−yi ck ) .
(3)
k=1 r(xi )∈k
To compute the optimal weak classifier for a given distribution D(xi ) many features r are sampled and the best , i.e. the one with minimal Z is kept. The optimal partitioning is obtained by rewriting eq. 3 for positive (yi = 1) and negative (yi = −1) training data: Z=
K + Wk exp(−ck ) + Wk− exp(ck ) , k=1
where +/−
Wk
=
+/−
Dk
(xi )
(4)
r(xi )∈k +/−
is the sum of positive and negative weights Dk 1
that fall into a certain bin k.
For the discrete AdaBoost algorithm, a weak classifier estimates one threshold t0 and outputs α = {−1, 1} depending of whether a data point is below or above this threshold.
156
C. Strecha et al.
ALGORITHM: WaldBoost Keypoint learning Input: h ∈ H, (x1 , y1 ) . . . (x1 , y1 ), θ+ , θ− initialize weights D(xi ) = 1/N ; mark all training examples as undecided {yi∗ = 0} For t = 1 . . . T , number weak learners in cascade sample training examples xi from undecided examples {yi∗ = 0} compute weights D(xi ) w.r.t. Ht−i ∀{yi∗ = 0} For s = 1 . . . S, number weak learner trials -sample weak learner ht ∈ H -compute response r(xi ) -compute domain partitioning and score Z [17] End -among the S weak learners keep the best and add ht to the strong classifier HT = t ht -sequential probability ratio test[3] classify all current training examples into yi∗ = {+1, −1, 0} End Fig. 2. WaldBoost Keypoint learning
After finding the optimal weak learner, Wald’s decision criterion is used to classify the training samples into {+1, −1, 0} while the next weak learner is obtained by only using the undecided, zero labelled, training examples. The entire algorithm is shown in table 4.1. For more information we refer to the work of Schapire et.al. [17]. 4.2 Weak Classifier The image features which are used for the weak classifiers are computed by using integral images and include color as well as gradient features. For the minimization of 4, we first randomly sample a specific kind of weak classifier and than its parameters. The weak classifiers include: – ratio of the mean colors of two rectangles: compares two color components of two rectangles at two different positions (2+4+4 parameters). – mean color of a rectangles: measures the mean color components of a rectangles (1+2 parameters). – roundness and intensity: integral images are computed from the componnet of the structure tensor, roundness and intensity as defined by F¨ostner and G¨ulch [8] are further computed on a randomly sampled rectange size (2 parameters).
5 Detector Evaluation Repeatability is a main criterion for evaluating the performance of keypoint detectors. In contrast to current studies by Mikolajczyk et al. [18] where a good feature detection was defined according to the percentage of overlap between the keypoint ellipses, we evaluate repeatability more specifically for the task of image calibration. The Mikolajczyk criterion is in fact not well suited to evaluate multi-view image calibration, where a successful calibration should result in a sub-pixel reprojection error. We are more interested in a keypoint location which only deviates by a few pixels from the ideal
Training for Task Specific Keypoint Detection
DoG TaSK harris MSER SURF