Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany
6819
Yuri Boykov Fredrik Kahl Victor Lempitsky Frank R. Schmidt (Eds.)
Energy Minimazation Methods in Computer Vision and Pattern Recognition 8th International Conference, EMMCVPR 2011 St. Petersburg, Russia, July 25-27, 2011 Proceedings
13
Volume Editors Yuri Boykov Frank R. Schmidt University of Western Ontario Computer Science Department London, N6A 5A5, ON, Canada E-mail:
[email protected] E-mail:
[email protected] Fredrik Kahl Lund University Centre for Mathematical Sciences 22100 Lund, Sweden E-mail:
[email protected] Victor Lempitsky University of Oxford Department of Engineering Science Oxford, OX1 3PJ, UK E-mail:
[email protected] ISSN 0302-9743 e-ISSN 1611-3349 ISBN 978-3-642-23093-6 e-ISBN 978-3-642-23094-3 DOI 10.1007/978-3-642-23094-3 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011933713 CR Subject Classification (1998): F.2, E.1, G.2, I.3.5, G.1, C.2 LNCS Sublibrary: SL 6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics
© Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Preface
Over the last few decades, energy minimization methods have become an established paradigm to resolve a variety of challenges in the fields of computer vision and pattern recognition. While traditional approaches to computer vision were often based on a heuristic sequence of processing steps and merely allowed a very limited theoretical understanding of the respective methods, most state-ofthe-art methods are nowadays based on the concept of computing solutions to a given problem by minimizing the respective energies. This volume contains the papers presented at the 8th International Conference on Energy Minimization Methods in Computer Vision and Pattern Recognition (EMMCVPR 2011), held at the Radisson Royal Hotel in Saint Petersburg, July 25–27, 2011. These papers demonstrate that energy minimization methods have become a mature field of research spanning a broad range of areas from discrete graph theoretic approaches and Markov random fields to variational methods and partial differential equations. Application areas include image segmentation and tracking, shape optimization and registration, inpainting and image denoising, color and texture modeling, statistics and learning. Overall, we received 52 high-quality submissions. Based on the reviewer recommendations, after double-blind review process 30 papers were selected for publication, 16 as oral and 14 as poster presentations. Both oral and poster papers were attributed the same number of pages in the conference proceedings. Furthermore, we were delighted that three leading experts from the fields of computer vision and energy minimization, namely, Andrew Blake (Microsoft Research), Emmanuel Candes (Stanford University), Alan Yuille (UCLA), and Vladimir Kolmogorov (IST Austria), agreed to further enrich the conference with inspiring keynote lectures. We would like to express our gratitude to those who made this event possible and contributed to its success. In particular, our Program Committee of top international experts in the field provided excellent reviews. A major donation from Microsoft Research and a financial contribution from Yandex covered a significant part of the conference expenses. We are grateful to Andrew Delong, Lena Gorelick, and a grant from the University of Western Ontario for covering the conference’s printing needs. Anna Medvedeva provided very helpful local administrative support. It is our belief that this conference will help to advance the field of energy minimization methods and to further establish the mathematical foundations of computer vision. July 2011
Yuri Boykov Fredrik Kahl Victor Lempitsky Frank R. Schmidt
Table of Contents
Discrete Optimization A Distributed Mincut/Maxflow Algorithm Combining Path Augmentation and Push-Relabel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander Shekhovtsov and V´ aclav Hlav´ aˇc
1
Minimizing Count-Based High Order Terms in Markov Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thomas Schoenemann
17
Globally Optimal Image Partitioning by Multicuts . . . . . . . . . . . . . . . . . . . J¨ org Hendrik Kappes, Markus Speth, Bj¨ orn Andres, Gerhard Reinelt, and Christoph Schn¨ orr A Fast Solver for Truncated-Convex Priors: Quantized-Convex Split Moves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anna Jezierska, Hugues Talbot, Olga Veksler, and Daniel Wesierski
31
45
Continuous Optimization Temporally Consistent Gradient Domain Video Editing . . . . . . . . . . . . . . . Gabriele Facciolo, Rida Sadek, Aur´elie Bugeau, and Vicent Caselles
59
Texture Segmentation via Non-local Non-parametric Active Contours . . . Miyoun Jung, Gabriel Peyr´e, and Laurent D. Cohen
74
Evaluation of a First-Order Primal-Dual Algorithm for MRF Energy Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefan Schmidt, Bogdan Savchynskyy, J¨ org Hendrik Kappes, and Christoph Schn¨ orr
89
Global Relabeling for Continuous Optimization in Binary Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Markus Unger, Thomas Pock, and Horst Bischof
104
Stop Condition for Subgradient Minimization in Dual Relaxed (max,+) Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michail Schlesinger, Evgeniy Vodolazskiy, and Nikolai Lopatka
118
Segmentation Optimality Bounds for a Variational Relaxation of the Image Partitioning Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jan Lellmann, Frank Lenzen, and Christoph Schn¨ orr
132
VIII
Table of Contents
Interactive Segmentation with Super-Labels . . . . . . . . . . . . . . . . . . . . . . . . . Andrew Delong, Lena Gorelick, Frank R. Schmidt, Olga Veksler, and Yuri Boykov
147
Curvature Regularity for Multi-label Problems - Standard and Customized Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thomas Schoenemann, Yubin Kuang, and Fredrik Kahl
163
Space-Varying Color Distributions for Interactive Multiregion Segmentation: Discrete versus Continuous Approaches . . . . . . . . . . . . . . . . Claudia Nieuwenhuis, Eno T¨ oppe, and Daniel Cremers
177
Detachable Object Detection with Efficient Model Selection . . . . . . . . . . . Alper Ayvaci and Stefano Soatto
191
Curvature Regularization for Curves and Surfaces in a Global Optimization Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Petter Strandmark and Fredrik Kahl
205
SlimCuts: GraphCuts for High Resolution Images Using Graph Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bj¨ orn Scheuermann and Bodo Rosenhahn
219
Discrete Optimization of the Multiphase Piecewise Constant Mumford-Shah Functional . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Noha El-Zehiry and Leo Grady
233
Image Segmentation with a Shape Prior Based on Simplified Skeleton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Boris Yangel and Dmitry Vetrov
247
High Resolution Segmentation of Neuronal Tissues from Low Depth-Resolution EM Imagery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Glasner, Tao Hu, Juan Nunez-Iglesias, Lou Scheffer, Shan Xu, Harald Hess, Richard Fetter, Dmitri Chklovskii, and Ronen Basri
261
Motion and Video Optical Flow Guided TV-L1 Video Interpolation and Restoration . . . . . . Manuel Werlberger, Thomas Pock, Markus Unger, and Horst Bischof
273
Data-Driven Importance Distributions for Articulated Tracking . . . . . . . . Søren Hauberg and Kim Steenstrup Pedersen
287
Robust Trajectory-Space TV-L1 Optical Flow for Non-rigid Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ravi Garg, Anastasios Roussos, and Lourdes Agapito
300
Table of Contents
Intermediate Flow Field Filtering in Energy Based Optic Flow Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Laurent Hoeltgen, Simon Setzer, and Michael Breuß TV-L1 Optical Flow for Vector Valued Images . . . . . . . . . . . . . . . . . . . . . . . Lars Lau Rakˆet, Lars Roholm, Mads Nielsen, and Fran¸cois Lauze Using the Higher Order Singular Value Decomposition for Video Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ajit Rajwade, Anand Rangarajan, and Arunava Banerjee
IX
315 329
344
Learning Optimization of Robust Loss Functions for Weakly-Labeled Image Taxonomies: An ImageNet Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Julian J. McAuley, Arnau Ramisa, and Tib´erio S. Caetano Multiple-Instance Learning with Structured Bag Models . . . . . . . . . . . . . . Jonathan Warrell and Philip H.S. Torr Branch and Bound Strategies for Non-maximal Suppression in Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matthew B. Blaschko
355 369
385
Shape Analysis Metrics, Connections, and Correspondence: The Setting for Groupwise Shape Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Carole Twining, Stephen Marsland, and Chris Taylor
399
The Complex Wave Representation of Distance Transforms . . . . . . . . . . . . Karthik S. Gurumoorthy, Anand Rangarajan, and Arunava Banerjee
413
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
429
A Distributed Mincut/Maxflow Algorithm Combining Path Augmentation and Push-Relabel Alexander Shekhovtsov and V´ aclav Hlav´aˇc Czech Technical University in Prague {shekhole,hlavac}@fel.cvut.cz Abstract. We present a novel distributed algorithm for the minimum s-t cut problem, suitable for solving large sparse instances. Assuming vertices of the graph are partitioned into several regions, the algorithm performs path augmentations inside the regions and updates of the pushrelabel style between the regions. The interaction between regions is considered expensive (regions are loaded into the memory one-by-one or located on separate machines in a network). The algorithm works in sweeps, which are passes over all regions. Let B be the set of vertices incident to inter-region edges of the graph. We present a sequential and parallel versions of the algorithm which terminate in at most 2|B|2 + 1 sweeps. The competing algorithm by Delong and Boykov uses push-relabel updates inside regions. In the case of a fixed partition we prove that this algorithm has a tight O(n2 ) bound on the number of sweeps, where n is the number of vertices. We tested sequential versions of the algorithms on instances of maxflow problems in computer vision. Experimentally, the number of sweeps required by the new algorithm is much lower than for the Delong and Boykov’s variant. Large problems (up to 108 vertices and 6 · 108 edges) are solved using under 1GB of memory in about 10 sweeps. Keywords: mincut, maxflow, distributed, parallel, large-scale, streaming, augmented path, push-relabel, region.
1
Introduction
Minimum s-t cut (mincut) is a classical combinatorial problem with applications in many areas of science and engineering. This research1 was motivated by wide use of mincut/maxflow in computer vision, where large sparse instances need to be solved. To deal efficiently with the large scale we consider distributed algorithms, dividing the computation and the data between computation units and assuming that passing information from one unit to another is expensive. We consider the following two practical usage modes: 1
A. Shekhovtsov was supported by the EU project FP7-ICT-247870 NIFTi and V. Hlavac by the EU project FP7-ICT-247525 HUMAVIPS.
Y. Boykov et al. (Eds.): EMMCVPR 2011, LNCS 6819, pp. 1–16, 2011. c Springer-Verlag Berlin Heidelberg 2011
2
A. Shekhovtsov and V. Hlav´ aˇc
• Sequential (or streaming) mode, which uses a single computer with a limited memory and a disk storage, reading, processing and writing back a part of data at a time. Since it is easier for analysis and implementation, this mode will be the main focus of this work. • Parallel mode, in which the units are e.g. computers in a network. We show that the algorithm we propose admits full parallelization. The theoretical analysis is derived from the sequential variant. Details and preliminary experiments on a single computer with several CPUs are presented in the technical report [1]. To represent the cost of information exchange between the units, we use a special related measure of complexity. We call a sweep the event when all units of a distributed algorithm recalculate their data once. The number of sweeps is roughly proportional to the amount of communication in the parallel mode or disk operations in the streaming mode. Previous Work. A variant of path augmentation algorithm was shown in [2] to have the best performance on computer vision problems among sequential solvers. There were several proposals how to parallelize it. Partially distributed implementation [3] augments paths within disjoint regions first and then merges regions hierarchically. In the end, it still requires finding augmenting paths in the whole problem. A distributed algorithm was obtained in [4] using the dual decomposition approach. The subproblems are mincut instances on the parts of the graph (regions) and the master problem is solved using subgradient method. This approach requires solving mincut subproblems with real valued capacities (rather than integer ones) and does not have a polynomial iteration bound. The push-relabel algorithm [5] performs many local atomic operations, which makes it a good choice for a parallel or distributed implementation. √ A distributed version [6] runs in O(n2 ) time using O(n) processors and O(n2 m) messages. Delong and Boykov [7] proposed a coarser granulation, associating a subset of vertices (a region) to each processor. Push and relabel operations inside a region are decoupled from the rest of the graph. This allows to process several noninteracting regions in parallel or run in a limited memory, processing one region at a time. For the case of a fixed partition we prove that the sequential and our novel parallel versions of their algorithm have a tight O(n2 ) bound on the number of sweeps. We then construct a new algorithm, which works with the same partition of the data but is guided by a different distance function than push-relabel. The New Algorithm. Given a fixed partition into regions, we introduce a distance function which counts the number of region boundaries crossed by a path to the sink. Intuitively, it corresponds to the amount of costly operations – network communications or loads-unloads of the regions in the streaming mode. The algorithm maintains a labeling, which is a lower bound on the distance function. Within a region, we first augment paths to the sink and then paths to the boundary nodes of the region in the order of their increasing labels. Thus the flow is pushed out of the region in the direction given by the distance estimate.
Distributed Mincut/Maxflow Algorithm
3
We present a sequential and parallel versions of the algorithm which terminate in at most 2|B|2 + 1 sweeps, where B is the set of all boundary nodes (incident to inter-region edges). Other Related Work. The following works do not consider a distributed implementation but are relevant to our design. Partial Augment-Relabel algorithm (PAR) [8] in each step augments a path of length k. It may be viewed as a lazy variant of push-relabel, where actual pushes are delayed until it is known that a sequence of k pushes can be executed. The algorithm of [9] incorporates the notion of a length function and a valid labeling w.r.t. this length. It can be seen that the labeling maintained by our algorithm corresponds to the length function assigning 1 to boundary edges and 0 to intra-region edges. In [9] this generalized labeling is used in the context of blocking flow algorithm but not within push-relabel.
2
Mincut and Push-Relabel
We will be solving mincut problem by finding a maximum preflow2 . In this section, we give basic definitions and introduce the push-relabel framework [5]. By a network we call the tuple G = (V, E, s, t, c, e), where V is a set of vertices; E ⊂ V × V , thus (V, E) is a directed graph; s, t ∈ V , s = t, are source and sink, respectively; c : E → N0 is a capacity function; and e : V \{s, t} → N0 is an excess function. Excess can be equivalently represented as additional edges from the source, but we prefer this explicit form. For convenience we let e(s) = ∞ and e(t) = 0. We also denote n = |V | and m = |E|. For X, Y ⊂ V we will denote (X, Y ) = E ∩ (X × Y ). For C ⊂ V such that ¯ with C¯ = V \C is called an s-t cut. The s ∈ C, t ∈ / C, the set of edges (C, C), mincut problem is min c(u, v) + e(v) C ⊂ V, s ∈ C, t ∈ C¯ . (1) ¯ (u,v)∈(C,C)
¯ v∈C
The objective is called the cost of the cut. Without a loss of generality, we assume that E is symmetric – if not, the missing edges are added and assigned zero capacity. A preflow in G is a function f : E → Z satisfying the following constraints: f (u, v) ≤ c(u, v)
∀(u, v) ∈ E
f (u, v) = −f (u, v) ∀(u, v) ∈ E e(v) + f (u, v) ≥ 0 ∀v ∈ V
(capacity constraint),
(2a)
(antisymmetry),
(2b)
(preflow constraint).
(2c)
u | (u,v)∈E 2
A maximum preflow can be completed to a maximum flow using flow decomposition, in O(m log m) time. Because we are primarily interested in the minimum cut, we do not consider this step or whether it can be distributed.
4
A. Shekhovtsov and V. Hlav´ aˇc
A residual network w.r.t. preflow f is a network Gf = (V, E, s, t, cf , ef ) with the capacity and excess functions given by cf = c − f,
ef (v) = e(v) +
(3a) f (u, v),
∀v ∈ V \{t}.
(3b)
u | (u,v)∈E
By constraints (2) it is cf ≥ 0 and ef ≥ 0. The costs of all s-t cuts differ in G and Gf by a constant called the flow value, |f | = f (u, t). Network Gf u | (u,t)∈E
is thus up to a constant equivalent to network G and |f | is a trivial lower bound on the cost of a cut. Dual to mincut is the problem of maximizing this lower bound, i.e. finding a maximum preflow: max |f | f
s.t. constraints (2).
(4)
We say that w ∈ V is reachable from v ∈ V in network G if there is a path (possibly of length 0) from v to w composed of edges with strictly positive capacities. This relation is denoted by v → w. If w is not reachable from v we write v w. For any X, Y ⊂ V , we write X → Y if there exist x ∈ X, y ∈ Y such that x → y. Otherwise we write X Y . A preflow f is maximum iff {v | e(v) > 0} t in Gf . In that case the cut (T¯ , T ) with T = {v ∈ V | v → t in Gf } has value 0 in Gf . Because all cuts are non-negative it is a minimum cut. A Distance function d∗ : V → N0 in G assigns to v ∈ V the length of the shortest path from v to t, or n if no such path exists. A shortest path cannot have loops, thus its length is not greater than n − 1. Let us denote d∞ = n. A labeling d : V → {0, . . . , d∞ } is valid in G if d(t) = 0 and d(u) ≤ d(v) + 1 for all (u, v) ∈ E such that c(u, v) > 0. Any valid labeling is a lower bound on the distance d∗ in G. Not every lower bound is a valid labeling. A vertex v is called active w.r.t. (f, d) if ef (v) > 0 and d(v) < d∞ . All algorithms in this paper will use the following common initialization. Procedure Init f := preflow saturating all ({s}, V ) edges; G := Gf ; f := 0; ∞ 2 d := 0, d(s) := d ; 1
The generic push-relabel algorithm [5] starts with Init and applies the following Push and Relabel operations while possible: • Push(u, v) is applicable if u is active and cf (u, v) > 0 and d(u) = d(v) + 1. The operation increases f (u, v) by Δ and decreases f (v, u) by Δ, where Δ = min(ef (u), cf (u, v)). • Relabel(u) is applicable if u is active and ∀v | (u, v) ∈ E, cf (u, v) > 0 it is d(u) ≤ d(v). It sets d(u) := min d∞ , min{d(v)+1| (u, v) ∈ E, cf (u, v) > 0} .
Distributed Mincut/Maxflow Algorithm
5
If u is active then either Push or Relabel operation is applicable to u. The algorithm preserves validity of labeling and stops when there are no active nodes. Then for any u such that ef (u) > 0, we have d(u) = d∞ and therefore d∗ (u) = d∞ and u t in Gf , so f is a maximum preflow.
3
Region Discharge Revisited
We now review the approach of Delong and Boykov [7] and reformulate it for the case of a fixed graph partition. We then describe generic sequential and parallel algorithms which can be applied with both push-relabel and augmenting path approaches. Delong and Boykov [7] introduce the following operation. The discharge of a region R ⊂ V \{s, t} applies Push and Relabel to v ∈ R until there are no active vertices left in R. This localizes computations to R and its boundary, defined as B R = {w | ∃u ∈ R (u, w) ∈ E, w ∈ / R, w = s, t}.
(5)
When a Push is applied to an edge (v, w) ∈ (R, B R ), the flow is sent out of the region. We say that two regions R1 , R2 ⊂ V \{s, t} interact if (R1 , R2 ) = ∅. Discharges of non-interacting regions can be performed in parallel since the computations in them do not share the data. The algorithm proposed in [7] repeats the following steps until there are no active vertices in V : 1. Select several non-interacting regions, containing active vertices. 2. Discharge the selected regions in parallel, applying region-gap and regionrelabel heuristics3 . 3. Apply global gap heuristic. While the regions in [7] are selected dynamically in each iteration trying to divide the work evenly between CPUs and cover the most of the active nodes, we restrict ourselves to a fixed collection of regions (Rk )K k=1 forming a partition of V \{s, t} and let each region-discharge to work on its own separate subnetwork. We define a region network GR = (V R , E R , s, t, cR , eR ), where V R = R ∪ B R ∪ {s, t}; E R = (R ∪ {s, t}, R ∪ {s, t}) ∪ (R, B R ) ∪ (B R , R); cR (u, v) = c(u, v) if (u, v) ∈ E R \(B R , R) and 0 otherwise; eR = e|R∪{s,t} (the restriction of function e to its subdomain R ∪ {s, t}). This network is illustrated in Fig. 1(a). Note that the capacities of edges coming from the boundary, (B R , R), are set to zero. Indeed, these edges belong to a neighboring region network. The region discharge operation of [7], which we refer to as Push-relabel Region Discharge (PRD), can now be defined as follows. 3
All heuristics (global-gap, region-gap, region-relabel) serve to improve the distance estimate. Details in [10,7,1]. They are very important in practice, but do not affect theoretical properties.
6
A. Shekhovtsov and V. Hlav´ aˇc
Procedure (f, d) = PRD(GR ,d) 1 2 3
/* assume d : V R → {0, . . . , d∞ } valid in GR */ while ∃v ∈ R active do apply Push or Relabel to v; /* changes f and d */ apply region gap heuristic (see [7], [1, sec.5]); /* optional */
Generic Region Discharge Algorithms. We give a sequential and a parallel algorithms in Alg. 1 and Alg. 2, resp. The later allows to discharge interacting regions in parallel, resolving conflicts in the flow similar to the asynchronous parallel push-relabel [5]. These two algorithms are generic, taking a black-box Discharge function. In the case Discharge is PRD the sequential and parallel algorithms are implementing the push-relabel approach and will be referred to as S-PRD and P-PRD respectively. S-PRD is a sequential variant of [7] and P-PRD is a novel variant, based on results of [5] and [7]. Algorithm 1. Sequential Region Discharge 1 2 3 4 5 6 7 8
Init; while there are active vertices do /* a sweep for k = 1, . . . K do Construct GRk from G; (f , d ) := Discharge(GRk , d|V Rk ); G := Gf ; /* apply f to G d|Rk := d |Rk ; /* update labels apply global gap heuristic (see [10], [1, sec.5]); /* optional
*/
*/ */ */
Algorithm 2. Parallel Region Discharge 1 2 3 4 5
6 7 8 9
Init; while there are active vertices do /* a sweep (fk , dk ) := Discharge(GRk , d|V Rk ) ∀k; /* in parallel d |Rk := dk |Rk ∀k; /* fuse labels α(u, v) := [[d (u) ≤ d (v) + 1]] ∀(u, v) ∈ (B, B); /* valid pairs /* fuse flow α(v, u)fk (u, v) + α(u, v)fj (u, v) if (u, v) ∈ (Rk , Rj ) f (u, v) := ; fk (u, v) if (u, v) ∈ (Rk , Rk ) G := Gf ; /* apply f to G d := d ; /* update labels global gap heuristic; /* optional
*/ */ */ */ */
*/ */ */
Distributed Mincut/Maxflow Algorithm
7
We prove in [1] that both S-PRD and P-PRD terminate with a valid labeling in at most 2n2 sweeps. Parallel variants of push-relabel [11] have the same bound on the number of sweeps, so the mentioned result is not very surprising. On the other hand, the analysis in [1] allows for more general Discharge functions. We also show in [1] an example, which takes O(n2 ) sweeps to terminate for a partition into two regions, interacting over two edges. Hence the bound is tight.
4
Augmented Path Region Discharge
We will now use the same setup of the problem distribution, but replace the discharge operation and the labeling function. Because this is our main contribution, it is presented in full detail. 4.1
New Distance Function
Rk Let the boundary w.r.t. partition (Rk )K . The region k=1 be the set B = kB distance d∗B (u) in G is the minimal number of inter-region edges contained in a path from u to t, or |B| if no such path exists: ⎧ ⎨ min |P ∩ (B, B)| if u → t, d∗B (u) = P =((u,u1 ),...,(ur ,t)) (6) ⎩|B| if u t. This distance corresponds well to the number of region discharge operations required to transfer the excess to the sink. Statement 1. If u → t then d∗B (u) < |B|. Proof. Let P be a path from u to t given as a sequence of edges. If P contains a loop then it can be removed from P and |P ∩ (B, B)| will not increase. A path without loops goes through each vertex at most once. For B ⊂ V there is at most |B| − 1 edges in the path which have both endpoints in B. We now let d∞ = |B| and redefine a valid labeling w.r.t. to the new distance. A labeling d : V → {0, . . . , d∞ } is valid in G if d(t) = 0 and for all (u, v) ∈ E such that c(u, v) > 0: d(u) ≤ d(v) + 1
if (u, v) ∈ (B, B),
(7)
d(u) ≤ d(v)
if (u, v) ∈ / (B, B).
(8)
Statement 2. A valid labeling d is a lower bound on d∗B . Proof. If u t then d(u) ≤ d∗B . Otherwise, let P = ((u, v1 ), . . . , (vl , t)) be a shortest path w.r.t. d∗B , i.e. d∗B (u) = |P ∩ (B, B)|. Applying the validity property to each edge in this path, we have d(u) ≤ d(t) + |P ∩ (B, B)| = d∗B (u). 4.2
New Region Discharge
In this subsection, reachability relations “→”, “”, residual paths, and labeling validity will be understood in the region network GR or its residual GR f.
8
A. Shekhovtsov and V. Hlav´ aˇc
The new Discharge operation, called Augmented Path Region Discharge (ARD), works as follows. It first pushes excess to the sink along augmenting paths inside the network GR . When it is no longer possible, it continues to augment paths to nodes in the region boundary, B R , in the order of their increasing labels. This is represented by the sequence of nested sets T0 = {t}, T1 = {t} ∪ {v ∈ B R | d(v) = 0}, . . . , Td∞ = {t} ∪ {v ∈ B R | d(v) < d∞ }. Set Tk is the destination of augmentations in stage k. As we prove below, in stage k > 0 residual paths may exist only to the set Tk \Tk−1 = {v | d(v) = k − 1}. Algorithm 1 and 2 with this new discharge operation will be referred to as S-ARD and P-ARD, respectively. Procedure (f, d) = ARD(GR ,d) 1 2 3 4
5
/* assume d : V R → {0, . . . , d∞ } valid in GR */ for k = 0, 1, . . . , d∞ do /* stage k */ Tk = {t} ∪ {v ∈ B R | d(v) < k} /* Augment(R, Tk) */ while ∃ a residual path (v0 ∈ R, . . . , vl ∈ Tk ), ef (v0 ) > 0 do augment Δ = min(ef (v0 ), min cf (vi−1 , vi )) along the path. i
/* Region-relabel ⎧ ⎪ ⎨min{k | u → Tk } d(u) := d∞ ⎪ ⎩ d(u)
*/ u ∈ R, u → Td∞ , u ∈ R, u Td∞ , u ∈ BR .
The labels on the boundary, d|BR , remain fixed during the algorithm and the labels d|R inside the region do not participate in augmentations and therefore are updated only in the end. We claim that ARD terminates with no active nodes inside the region, preserves validity and monotonicity of the labeling, and pushes flow from higher labels to lower labels w.r.t. the new labeling. These properties will be required to prove finite termination and correctness of S-ARD. Before we prove them (Statement 6) we need the following intermediate results: • Properties of the network GR f maintained by the algorithm (Statement 3, Corollaries 1 and 2). • Properties of valid labellings in the network GR f (Statement 4). • Properties of the labeling constructed by region-relabel (line 5 of ARD) in the network GR f (Statement 5). Lemma 1. Let X, Y ⊂ V R , X ∩ Y = ∅, X Y . Then X Y is preserved after i) augmenting a path (x, . . . , v) with x ∈ X and v ∈ V R ; ii) augmenting a path (v, . . . , y) with y ∈ Y and v ∈ V R . Proof. Let X be the set of vertices reachable from X. Let Y be the set of vertices from which Y is reachable. Clearly X ∩ Y = ∅, otherwise X → Y . We have
Distributed Mincut/Maxflow Algorithm
9
that (X , X¯ ) is a cut separating X and Y and having all edge capacities zero. Any residual path starting in X or ending in Y cannot cross the cut and its augmentation change the edges of the cut. Hence, X and Y will stay separated. Statement 3. Let v ∈ V R and v Ta in Gf in the beginning of stage k0 , where a, k0 ∈ {0, 1, . . . d∞ }. Then v Ta holds until the end of the algorithm. Proof. We need to show that v Ta is not affected by augmentations performed by the algorithm. If k0 ≤ a, we first prove v Ta holds during stages k0 ≤ k ≤ a. Consider augmentation of a path (u0 , u1 , . . . , ul ), u0 ∈ R, ul ∈ Tk ⊂ Ta, ef (u0 ) > 0. Assume v Ta before augmentation. By Lemma 1 with X = {v}, Y = Ta (noting that X Y and the augmenting path ends in Y ), after the augmentation v Ta . By induction, it holds till the end of stage a and hence in the beginning of stage a + 1. We can assume now that k0 > a. Let A = {u ∈ R | ef (u) > 0}. At the end of stage k0 −1 we have A Tk0 −1 ⊃ Ta by construction. Consider augmentation in stage k0 on a path (u0 , u1 . . . , ul ), u0 ∈ R, ul ∈ Tk0 , ef (u0 ) > 0. By construction, u0 ∈ A. Assume {v} ∪ A Ta before augmentation. Apply Lemma 1 with X = {v} ∪ A, Y = Ta (we have X Y and u0 ∈ A ⊂ X). After augmentation it is X Ta . By induction, X Ta till the end of stage k0 . By induction on stages, v Ta until the end of the algorithm. Corollary 1. If w ∈ B R then w Td(w) throughout the algorithm. Proof. At initialization, it is fulfilled by construction of GR due to cR (B R , R) = 0. It holds then during the algorithm by Statement 3. In particular, we have B R t during the algorithm. R Corollary 2. Let (u, v1 . . . vl , w) be a residual path in GR f from u ∈ R to w ∈ B and let vr ∈ B R for some r. Then d(vr ) ≤ d(w).
Proof. We have vr Tvr . Suppose d(w) < d(vr ), then w ∈ Tvr and because vr → w it is vr → Tvr which is a contradiction. Statement 4. Let d be a valid labeling, d(u) ≥ 1, u ∈ R. Then u Td(u)−1 . Proof. Suppose u → T0 . Then there exist a residual path (u, v1 . . . vl , t), vi ∈ R (by Corollary 1 it cannot happen that vi ∈ B R ). By validity of d we have d(u) ≤ d(v1 ) ≤ · · · ≤ d(vl ) ≤ d(t) = 0, which is a contradiction. Suppose d(u) > 1 and u → Td(u)−1 . Because u T0 , it must be that u → w, w ∈ B R and d(w) < d(u) − 1. Let (v0 . . . vl ) be a residual path with v0 = u and vl = w. Let r be the minimal number such that vr ∈ B R . By validity of d we have d(u) ≤ d(v1 ) ≤ · · · ≤ d(vr−1 ) ≤ d(vr ) + 1. By corollary 2 we have d(vr ) ≤ d(w), hence d(u) ≤ d(w) + 1 which is a contradiction. Statement 5. For d computed on line 5 and any u ∈ R it holds: 1. d is valid; 2. u Ta ⇔ d(u) ≥ a + 1.
10
A. Shekhovtsov and V. Hlav´ aˇc
Proof. 1. Let (u, v) ∈ E R and c(u, v) > 0. Clearly u → v. Consider four cases: • case u ∈ R, v ∈ B R : Then u → Td(v)+1, hence d(u) ≤ d(v) + 1. • case u ∈ R, v ∈ R: If v Td∞ then d(v) = d∞ and d(u) ≤ d(v). If v → Td∞ , then d(v) = min{k | v → Tk }. Let k = d(v), then v → Tk and u → Tk , therefore d(u) ≤ k = d(v). • case u ∈ B R , v ∈ R: By Corollary 1, u Td(u) . Because u → v, it is v Td(u) , therefore d(v) ≥ d(u) + 1 and d(u) ≤ d(v) − 1 ≤ d(v) + 1. • case when u = t or v = t is trivial. 2. The “⇐” direction follows by Statement 4 applied to d, which is a valid labeling. The “⇒” direction: we have u Ta and d(u) ≥ min{k | u → Tk } = min{k > a | u → Tk } ≥ a + 1. Statement 6 (Properties of ARD). Let d be a valid labeling in GR . The output (f , d ) of ARD satisfies: 1. There are no active vertices in R w.r.t. (f , d ); (optimality) 2. d ≥ d, d |B R = d|B R ; (labeling monotonicity) 3. d is valid in GR (labeling validity) f; 4. f is a sum of path flows, where each path is from a vertex u ∈ R to a vertex v ∈ {t} ∪ B R and it is d (u) > d(v) if v ∈ B R . (flow direction) Proof. 1. In the last stage, the algorithm augments all paths to Td∞ . After this augmentation a vertex u ∈ R either has excess 0 or there is no residual path to Td∞ and hence d (u) = d∞ by construction. 2. For d(u) = 0, we trivially have d (u) ≥ d(u). Let d(u) = a + 1 > 0. By Statement 4, u Ta in GR and it holds also in GR f by Statement 3. From Statement 5.2, we conclude that d (u) ≥ a + 1 = d(u). The equality d |B R = d|B R is by construction. 3. Proven by Statement 5.1. 4. Consider a path from u to v ∈ B R , augmented in stage k > 0. It follows that k = d(v)+1. At the beginning of stage k it is u Tk−1 . By Statement 3, this is preserved till the end of the algorithm. By Statement 5.2, d (u) ≥ k = d(v) + 1 > d(v). 4.3
Complexity of the Sequential ARD
Let us first verify that the labeling in S-ARD is globally valid. Statement 7. For a labeling d valid in G and (f , d ) = ARD(GR , d), the extension of d to V defined by d |R¯ = d|R¯ is valid in Gf . Proof. Statement 5 established validity of d in GR f . For edges (u, v) ∈ (V \R, V \R) labeling d coincides with d and f (u, v) = 0. It remains to verify validity on edges (v, u) ∈ (B R , R) in the case cR f (v, u) = 0 and cf (v, u) > 0. Because R 0 = cR (v, u) = c (v, u) − f (v, u) = −f (v, u), we have cf (v, u) = c(v, u). Since d f was valid in G, d(v) ≤ d(u) + 1. The new labeling d satisfies d (u) ≥ d(u) and
Distributed Mincut/Maxflow Algorithm
11
d (v) = d(v). It follows that d (v) = d(v) ≤ d(u) + 1 ≤ d (u) + 1. Hence d is valid in Gf . Theorem 1. The sequential ARD terminates in at most 2|B|2 + 1 sweeps. Proof. The value of d(v) does not exceed |B| and d is non-decreasing. The total increase of d|B during the algorithm is at most |B|2 . After the first sweep, active vertices are only in B. Indeed, discharging region Rk makes all vertices v ∈ Rk inactive and only vertices in B may become active. So by the end of the sweep, all vertices V \B are inactive. Let us introduce the quantity Φ = max{d(v) | v ∈ B, v is active in G }.
(9)
We will prove the following two cases for each sweep after the first one: 1. If d|B is increased then the increase in Φ is no more than total increase in d|B . Consider discharge of Rk . Let Φ be the value before ARD on Rk and Φ the value after. Let Φ = d (v). It must be that v is active in G . If v ∈ / V R, then d(v) = d (v) and e(v) = ef (v) so Φ ≥ d(v) = Φ . Let v ∈ V R . After the discharge, vertices in Rk are inactive, so v ∈ Bk and it is d (v) = d(v). If v was active in G then Φ ≥ d(v) and we have Φ − Φ ≤ d (v) − d(v) = 0. If v was not active in G, there must exist an augmenting path from a vertex v0 to v such that v0 ∈ Rk ∩B was active in G. For this path, the flow direction property implies d (v0 )≥ d(v). We now have Φ −Φ ≤ d (v)−d(v0 ) = d(v)−d(v0 ) ≤ d (v0 )−d(v0 ) ≤ v∈Rk ∩B [d (v)−d(v)]. Summing over all regions, we get the result. 2. If d|B is not increased then Φ is decreased at least by 1. We have d = d. Let us consider the set of vertices having the highest active label or above, H = {v | d(v) ≥ Φ}. These vertices do not receive flow during all discharge operations due to the flow direction property. After the discharge of Rk there are no active vertices left in Rk ∩ H (property 6.1). After the full sweep, there are no active vertices in H. In the worst case, starting from sweep 2, Φ can increase by one |B|2 times and decrease by one |B 2 | times. In at most 2|B|2 + 1 sweeps, there are no active vertices left. On termination we have that the labeling is valid and there are no active vertices in G. The proof that P-ARD terminates is similar and is given in [1].
5
Experiments
We tested the algorithms on synthetic and real problems. The machine had Intel Core 2 Quad
[email protected], 4GB memory, Windows XP 32bit and Microsoft VC compiler. All tested algorithms are sequential, 32bit and use only one core of the CPU. The memory limit for the algorithms is 2GB. As a baseline we used augmenting path implementation [2] v3.0 (BK) and the highest level push-relabel implementation [10] v3.6 (HIPRα, where α is a parameter denoting frequency of global relabels, 0.5 is the default value).
A. Shekhovtsov and V. Hlav´ aˇc
(a)
number of sweeps
12
200 150 100 50 0 50
CPU, sec.
40
(b)
S-ARD S-PRD
30 20
100 150 200 250 300 350 400 450 500 strength
(c) BK HIPR0 HIPR0.5 HPR S-ARD S-PRD
10 0 50
100 150 200 250 300 350 400 450 500 strength
Fig. 1. (a) Region Network. (b) Example of a synthetic problem: a network of size 6×6, connectivity 8, partition into 4 regions. The source and sink are not shown. (c) Dependence on the interaction strength for size 1000×1000, connectivity 8 and 4 regions. Plots show the mean values over 100 random samples and intervals containing 70% of the samples.
ARD was implemented4 using BK as a core solver. PRD is based on our reimplementation of the highest level push-relabel for the case of a given labeling on the boundary. This reimplementation (denoted HPR) uses linked list of buckets (rather than array) to achieve the time and space complexity independent of n and otherwise is similar to HIPR. The sequential Alg. 1 for each region loads and saves all the internal data of the core solver, so that discharge is always warm-started. Please see [1] for details of implementation and more experimental results. 5.1
Synthetic Instances
We generated simple synthetic 2D grid problems with a regular connectivity structure. Fig. 1(b) shows an example of such a network. Nodes are arranged into 2D grid and edges are added at the the following relative displacements: (0, 1), (1, 0), (1, 2), (2, 1), so the number of edges incident to a node far enough from the boundary (connectivity) is equal to 8. Each node is given an integer excess/deficit distributed uniformly in the interval [−500 500]. A positive number means a source link and a negative number a sink link. All edges in the graph, except of terminal links, are assigned a constant capacity, called strength. The network is partitioned into regions by slicing it in s equal parts in both dimensions. Let us first look at the dependence on the strength, shown in Fig. 1(c). 4
Implementations are available at cmp.felk.cvut.cz/~ shekhovt/d_maxflow.
350 300 250 200 150 100
number of sweeps
number of sweeps
Distributed Mincut/Maxflow Algorithm
S-ARD S-PRD
30 4 16
36 49 64 81 100 121 144 # of regions
169
196
225
350 300 S-ARD 250 S-PRD 200 150 100 50 8 200 400 600 800 1000 1200 vertices1/2
CPU, sec.
CPU, sec.
150
20 10 0 4 16
36 49 64 81 100 121 144 # of regions
(a)
169
196
225
100 50
13
1600 1800 2000
BK HIPR0 HIPR0.5 HPR S-ARD S-PRD
0 200 400 600 800 1000 1200 vertices1/2
1600 1800 2000
(b)
Fig. 2. (a) Dependence on the number of regions, for size 1000×1000, strength 150. (b) Dependence on the problem size, for strength 150, 4 regions.
Problems with small strength are easy because they are very local – long augmentation paths do not occur. For problems with large strength long paths needs to be augmented. However, finding them is easy because bottlenecks are unlikely. Therefore BK and S-ARD have a maximum in the computation time somewhere in the middle. It is more difficult to transfer the flow over long distances for push-relabel algorithms. This is where the global relabel heuristic becomes efficient and HIPR0.5 outperforms HIPR0. The region-relabel heuristic of S-PRD allows it to outperform other push-relabel variants. As the function of the number of regions (Fig. 2(a)), both the number of sweeps and the computation time grow slowly. As the function of the problem size (Fig. 2(b)), computation efforts of all algorithms grow proportionally. However, the number of sweeps shows different asymptotes. It is almost constant for S-ARD but grows significantly for S-PRD. 5.2
Real Instances
We tested our algorithms on the maxflow problem instances published by the Computer Vision Research Group at the University of Western Ontario (http://vision.csd.uwo.ca/maxflow-data/ ). The data consists of typical maxflow problems in computer vision, graphics, and biomedical image analysis, including 2D, 3D and 4D grids of various connectivity. The results are presented in Table 1. We select the regions by slicing the problems in 4 parts in each dimension: into 16 regions for 2D BVZ grids and into 64 regions for 3D segmentation instances.
14
A. Shekhovtsov and V. Hlav´ aˇc
Table 1. Real instances. CPU – the time spent purely for computation, excluding the time for parsing, construction and disk I/O. The total time to solve the problem is not shown. K – number of regions. RAM – memory taken by the solver; for BK in the case it exceeds 2GB limit, the expected required memory; for streaming solvers the sum of shared and region memory. I/O – total bytes read or written to disk. problem
BK
HIHI- HPR S-ARD S-PRD PR0 PR0.5 name CPU CPU CPU CPU CPU sweeps K CPU sweeps n(106 ) m/n RAM RAM RAM RAM RAM I/O RAM stereo BVZ-sawtooth(20) 0.68s 3.0s 7.7s 3.8s 0.68s 6 16 2.7s 26 0.2 4.0 14MB 17MB 0.3+0.9MB 91MB 0.7+1.1MB BVZ-tsukuba(16) 0.36s 1.9s 4.9s 2.6s 0.40s 5 16 1.7s 23 0.1 4.0 9.7MB 11MB 0.2+0.6MB 55MB 0.5+0.7MB BVZ-venus(22) 1.2s 5.7s 15s 6.2s 1.6s 6 16 5.8s 29 0.2 4.0 15MB 17MB 0.3+0.9MB 94MB 0.7+1.1MB KZ2-sawtooth(20) 1.8s 7.1s 22s 6.1s 2.2s 6 16 6.0s 21 0.3 5.8 33MB 36MB 1.2+2.0MB 212MB 1.5+2.3MB KZ2-tsukuba(16) 1.1s 5.3s 20s 4.4s 1.8s 6 16 5.4s 15 0.2 5.9 23MB 25MB 1.1+1.4MB 148MB 1.1+1.6MB KZ2-venus(22) 2.8s 13s 39s 10s 4.0s 7 16 12s 29 0.3 5.8 34MB 37MB 1.2+2.1MB 255MB 1.5+2.4MB multiview BL06-camel-lrg 81s 116s 11 16 308s 418 18.9 4.0 1.6GB 19+116MB 25GB 86+122MB BL06-camel-med 25s 29s 77s 59s 36s 12 16 118s 227 9.7 4.0 0.8GB 1.0GB 13+60MB 13GB 46+63MB BL06-gargoyle-lrg 245s 91s 191s 20 16 318s 354 17.2 4.0 1.5GB 1.7GB 23+106MB 33GB 82+112MB BL06-gargoyle-med 115s 17s 58s 37s 91s 14 16 143s 340 8.8 4.0 0.8GB 0.9GB 15+55MB 12GB 44+58MB surface LB07-bunny-lrg 16min 6 64 416s 43 49.5 6.0 5.7GB 49+101MB 34GB 226+99MB LB07-bunny-med 1.6s 20s 41s 26s 20s 8 64 16s 25 6.3 6.0 0.7GB 0.8GB 14+14MB 4.1GB 34+13MB segm liver.n26c100 12s 26s 28s 39s 26s 15 64 35s 98 4.2 11.1 0.8GB 0.7GB 18+15MB 11GB 30+14MB liver.n6c100 15s 30s 34s 44s 25s 17 64 32s 94 4.2 10.5 0.8GB 0.7GB 16+14MB 11GB 29+13MB babyface.n26c100 264s 36 64 262s 116 5.1 49.0 3.8GB 165+72MB 95GB 180+57MB babyface.n6c100 13s 71s 65s 87s 32s 17 64 74s 191 5.1 11.5 1.0GB 0.9GB 22+19MB 17GB 37+17MB adhead.n26c100 185s 16 64 269s 129 12.6 31.6 6.3GB 154+106MB 70GB 196+86MB adhead.n6c100 48s 13 64 121s 165 12.6 11.7 2.5GB 35+44MB 29GB 77+39MB bone.n26c100 32s 15 64 68s 124 7.8 32.4 4.0GB 122+79MB 31GB 147+63MB bone.n6c10 7.7s 5.7s 17s 12s 7.8s 10 64 37s 195 7.8 11.5 1.5GB 1.4GB 27+28MB 10GB 52+25MB bone.n6c100 9.1s 9.1s 22s 14s 9.8s 11 64 23s 65 7.8 11.6 1.6GB 1.5GB 27+28MB 12GB 52+25MB abdomen long.n6c10 179s 11 64 > 35 144.4 11.8 29GB 170+497MB 196GB abdomen short.n6c10 82s 11 64 144.4 11.8 29GB 170+497MB 138GB
K I/O 16 0.6GB 16 349MB 16 0.8GB 16 1.1GB 16 0.5GB 16 1.5GB 16 0.6TB 16 225GB 16 0.8TB 16 235GB 64 276GB 64 24GB 64 66GB 64 70GB 64 0.6TB 64 189GB 64 0.8TB 64 354GB 64 321GB 64 188GB 64 104GB 64 >1TB 64
Distributed Mincut/Maxflow Algorithm
15
Problems KZ2 are not regular grids, so we sliced them into 16 regions just by the node number. The same we did for the multiview LB06 instances, for which we do not know the grid layout. In 3D segmentation instances the arcs which are reverse of each other are spread in the file. Because we did not match them, we had to create parallel arcs in the graph (multigraph). This is seen, e.g. in babyface.n26c100, which is 26-connected, but we construct a multigraph with average node degree of 49. For some other instances, however, this is not visible because there are many zero arcs.
6
Conclusion
We have developed a new distributed algorithm for mincut problem on sparse graphs and proved an O(|B|2 ) bound on the number of sweeps. Both in theory and practice (randomized tests) the required number of sweeps is asymptotically better than for a variant of parallel push-relabel. Experiments on real instances showed that S-ARD, while sometimes doing more computations than S-PRD or BK, uses significantly fewer disk operations. We proposed a sequential and a parallel version of the algorithm. The best practical solution could be a combination of the two, depending on the usage mode and hardware (several CPUs, several network computers, sequential with storage on Solid State Drive, using GPU for region discharge, etc.). There is the following simple way how to allow region overlaps in our framework. A sequential algorithm can keep 2 regions in memory at a time and alternate between them until both are discharged. With PRD this is efficiently equivalent to discharging twice larger regions with a 1/2 overlap and may significantly decrease the number of sweeps required.
References 1. Shekhovtsov, A., Hlavac, V.: A distributed mincut/maxflow algorithm combining path augmentation and push-relabel. Research Report K333–43/11, CTU–CMP– 2011–03, Czech Technical University in Prague (2011) 2. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. PAMI 26 (2004) 3. Liu, J., Sun, J.: Parallel graph-cuts by adaptive bottom-up merging. In: CVPR (2010) 4. Strandmark, P., Kahl, F.: Parallel and distributed graph cuts by dual decomposition. In: CVPR (2010) 5. Goldberg, A.V., Tarjan, R.E.: A new approach to the maximum flow problem. Journal of the ACM 35 (1988) 6. Goldberg, A.V.: Processor-efficient implementation of a maximum flow algorithm. Inf. Process. Lett. 38 (1991) 7. Delong, A., Boykov, Y.: A scalable graph-cut algorithm for N-D grids. In: CVPR (2008)
16
A. Shekhovtsov and V. Hlav´ aˇc
8. Goldberg, A.V.: The partial augment–relabel algorithm for the maximum flow problem. In: Proceedings of the 16th Annual European Symposium on Algorithms (2008) 9. Goldberg, A.V., Rao, S.: Beyond the flow decomposition barrier. J. ACM (1998) 10. Cherkassky, B.V., Goldberg, A.V.: On implementing push-relabel method for the maximum flow problem. Technical report (1994) 11. Goldberg, A.: Efficient graph algorithms for sequential and parallel computers. PhD thesis, Massachusetts Institute of Technology (1987)
Minimizing Count-Based High Order Terms in Markov Random Fields Thomas Schoenemann Center for Mathematical Sciences Lund University, Sweden
Abstract. We present a technique to handle computer vision problems inducing models with very high order terms - in fact terms of maximal order. Here we consider terms where the cost function depends only on the number of variables that are assigned a certain label, but where the dependence is arbitrary. Applications include image segmentation with a histogram-based data term [28] and the recently introduced marginal probability fields [31]. The presented technique makes use of linear and integer linear programming. We include a set of customized cuts to strengthen the formulations.
1
Introduction
When global methods for a certain kind of optimization problems with binary terms became known [13], for several years research in computer vision focused on such problems, and they are very well understood today. Over the past few years the trend has moved more and more towards higher order terms which are generally more difficult to solve but also provide better models [22,30,20,24,17]. Research on optimizing such models is split into two fields: methods designed for terms of fairly low order, e.g. up to five, and methods that address very high order terms up to the maximal order of the number of variables. The first class includes the works [21,29,33,30,20] and their complexity grows exponentially with the order of the term. Some of them [30,20] do however admit efficient solutions for specific high order problems. There are no restrictions on the form of the terms, but performance can of course differ greatly - already problems with binary terms are in general NP-hard. The methods in this first group are split into two subclasses: some of them [29,30,20], usually message passing algorithms, are directly based on the higher order terms. Alternatively, there are methods that first convert higher order terms into unary and binary terms [5,32] and then make use of appropriate inference techniques [32,14,2]. In special cases even the global minimum can be computed [3,11]. The second class of methods handles very high order terms, and they need to make assumptions on the form of the terms. The method [18] handles any Y. Boykov et al. (Eds.): EMMCVPR 2011, LNCS 6819, pp. 17–30, 2011. c Springer-Verlag Berlin Heidelberg 2011
18
T. Schoenemann
concave dependence on the number of variables that are assigned a certain label. For binary labeling problems such terms are optimized globally (when all binary terms are submodular), for multi-label problems move-based algorithms are used. An alternative method for a largely overlapping class of models was recently given in [10]. Furthermore, general submodular functions (possibly with high order terms) can be optimized in polynomial time [26,15], but to our knowledge these algorithms have never been tested in a computer vision context. Moreover, for a broad class of problems (including all those discussed in this paper) message passing approaches can be implemented with a polynomial dependence on the maximal order of the high order terms [23,27]. Another specialized solution for a large part of this class was given in [17], but tested only on small synthetic problems. Finally, there are some problem-specific solutions based on dual decomposition [28,31]. As the respective problems will be addressed in this paper, we review them more closely below. Before, however, we give a brief summary of this paper. This Work: We address the class of very high order terms where the cost depends only on the number of variables that are assigned a label, but without any restrictions on the form of this dependence. Our method is based on (suboptimally) solving integer linear programs (ILPs), where we make use of standard packages. We first solve the linear programming relaxations, then apply a branch-and-cut scheme where we employ the set of cuts we derived in [25]. Note that our ILPs are different from those considered in the above cited works, in particular from the standard max-product message passing techniques: for terms of maximal order these latter produce exponentially large formulations that in some special cases can be handled implicitly. In contrast, our formulation is very compact, so we can make use of standard integer linear programming solvers. Likewise, our strategy after solving the linear programming relaxation differs greatly from the cutting planes scheme described in [30] which introduces extra variables. We only introduce extra constraints. We demonstrate applications for the following problems: Histogram Image Segmentation. A popular model for image segmentation (e.g. [4,28]) combines a length-based regularity term with appearance terms based on the log-probabilities of fairly general distributions (e.g. histograms or Gaussian Mixtures). These distributions are themselves unknown and to be adapted to the given image. Traditionally, these models have been addressed via alternating minimization (AM). For histograms, recently [28] cast the problem as finding the global optimum of a labeling problem with terms of maximal order. Their solution is based on dual decomposition. However, we found that this approach only works if seed nodes are given. We give a solution that works in the fully unsupervised setting and also extends to multi-label problems (however, here it does not beat AM). Marginal Probability Fields. It was recently proposed [31] to extend the traditional Markov Random Field framework by high order terms measuring how
Minimizing Count-Based High Order Terms in Markov Random Fields
19
well a labeling reflects a-priori known marginal statistics of certain features. The authors give a dual-decomposition optimization scheme. However, in contrast to our work they neglect consistency: the method assigns labels independently to single nodes and pairs of nodes, although the latter are already defined by the former. Moreover, while they only explored unary and pairwise terms we will also include triple constellations.
2
Count-Based Terms as Integer Linear Programs
We start with a description of the general class of labeling problems we consider and simultaneously indicate applications for computer vision. In this paper we consider the problem of assigning each of the nodes p in a certain finite set V ⊆ 2 – typically the pixels in a given image – a label yp ∈ L = {1, . . . , K}, with K a given constant. We are interested in finding the best labeling, where “best” is defined as the minimum of an energy function with unary, pairwise and a certain kind of higher order terms. These last terms depend on all nodes simultaneously and their cost are a function of the number of times a certain labeling constellation is observed. In the simplest case these constellations are the labels of single pixels and the problem is of the form min Dp (yp ) + Vp,q (yp , yq )
Ê
y
p∈V
(p,q)∈N
K + fl δ(yp , l) ,
(1)
p∈V
l=1
where N is a neighborhood system and δ(·, ·) the Kronecker-δ, i.e. 1 if both its arguments are equal, otherwise 0. The real-valued functions Dp (·), Vp,q (·, ·) and fl (·) can be chosen freely - there are no restrictions on them. In a slightly more general form, we can have several higher order functions per label1 . Instead of a single function fl : {0, . . . , |V|} → we now allow Nl functions where each of them can collect counts over its own subset Sli ⊆ V of nodes. The model then reads min Dp (yp ) + Vp,q (yp , yq )
Ê
y
p∈V
+
(p,q)∈N Nl K l=1 i=1
fli
δ(yp , l)
.
(2)
p∈Sli
Example. The class (2) contains the problem of histogram-based image segmentation when stated as a purely combinatorial problem [28]. Given an image I : V → {0, . . . , 255}, the problem is to minimize λ − log[Pyp (I(p))] + (1 − δ(yp , yq )) (3) p − q p∈V
1
(p,q)∈N
One can also collect counts across different labels, but we will not explore this here.
20
T. Schoenemann
with respect to both the probability distributions P1 (·), . . . , PK (·) and the labels yp for p ∈ V. For a given labeling y the minimizing distributions are given by δ(yp , l) Pl (k) =
p:I(p)=k
δ(yp , l)
, k ∈ {0, . . . , 255}
p∈V
(where 0/0 is defined as 0). The key observation of [28] is that when inserting the negative logarithm of this term into the above functional one obtains a purely combinatorial problem of the form (2). The unary terms in (3) can now be grouped together into higher order terms, and defining h(n) = n log(n) (with h(0) = 0) the problem is written as 255 h δ(yp , l) + −h δ(yp , l) l
p∈V
+
(p,q)∈N
l
k=0
p:I(p)=k
λ (1 − δ(yp , yq )) p − q
.
Very similar derivations can be given for color images and histograms with arbitrarily defined bins. 2.1
Integer Programming Formulation
We now show how the above energy minimization problems can be cast as integer linear programs, i.e. as minimizing a linear cost function subject to linear constraints and integrality conditions on the variables. We use the common concept of binary indicator variables xlp ∈ {0, 1} where l xp = 1 indicates that yp = l. With these variables, the unary terms are readily expressed as a linear cost function. To write the binary terms in a linear way, we 1 ,l2 consider variables xlp,q ∈ {0, 1} that we want to be 1 if and only if yp = l1 and yq = l2 . All these binary variables are grouped into a vector x and the associated linear cost is denoted cTx x, where cx has elements 1 ,l2 clp = Dp (l) and clp,q = Vp,q (l1 , l2 ) .
n To express the count-based terms we introduce variables zl,i ∈ {0, 1} (for n ∈ i {0, . . . , |Sl |}) that we want to be 1 if and only if the count for the function fli is equal to n. All these variables are grouped into a vector z and the associated cost cTz z has elements cnl,i = fli (n). In addition, there are three sets of constraints to be satisfied, where the first states that every node must have exactly one label n as well as that for all l and i exactly one of the count variables zl,i must be 1. The second states that the binary variables must be consistent with the values induced by the unary ones. As these are quite standard constraints that arise in many message passing approaches, we defer the equations to (5) below. In contrast, the last set of constraints is not at all common for computer vision. It states that the count variables need to be consistent with the associated node variables:
Minimizing Count-Based High Order Terms in Markov Random Fields
21
i
xlp
=
p∈Sli
|Sl |
n n · zl,i
(4)
n=0
Together this results in the following integer linear program: min cTx x + cTz z x,z s.t. xlp = 1
∀p ∈ V
(5)
l |Sli |
∀l ∈ L, i = 1, . . . Nl
zl,i = 1
n=0
xlp =
∀p ∈ V, l ∈ L
xl,l p,q
q∈N (p) l
i
xlp
p∈Sli
=
|Sl |
n n · zl,i
∀l ∈ L, i = 1, . . . , Nl
n=0
n 1 ,l2 xlp ∈ {0, 1}, xlp,q ∈ {0, 1}, zl,i ∈ {0, 1}
Solving this kind of problem is in general NP-hard [28]. 2.2
Special Cases
For a number of problems the ILP (5) can be written more compactly. In particular, for binary labeling problems one can replace all occurrences of xl=2 by p 1 − xl=1 and drop the constraints in the second line of (5). p Furthermore, for many regularity terms Vp,q (l1 , l2 ) one can reduce the number of pairwise variables and constraints. This includes the Potts model, where it is well-known [16,34] that Vp,q (l1 , l2 ) = λ(1 − δ(l1 , l2 )) λ l = |xp − xlq | , 2 l∈L
where λ > 0 is a smoothness weight. Now, it is well-known (e.g. [9]) that the absolutes in this expression can be written as linear programs: λ min [a+ + a− p,q,l ] x≥0,a± ≥0 2 p,q,l (p,q)∈N l∈L
s.t.
− xlp − xlq = a+ p,q,l − ap,q,l
Such constructions significantly reduce the number of required variables. Finally, if a function fl : {0, . . . , |Sil |} → is convex, we can bypass the consistency constraints (4): such a function can be implemented as a set of inequalities, see section 4.2 and [17]. In this work we always make use of such reductions in the problem size.
Ê
22
T. Schoenemann
2.3
Composite Features
So far the higher order terms depended on the number of nodes that were assigned a certain label. In a more general setting one can count any kind of features, e.g. by looking at certain pairs of nodes or certain triplet constellations. Handling pairs of nodes is particularly easy since (5) already contains variables that explicitly reflect pairwise constellations. We only have to make sure that all relevant pairs are contained in the neighborhood system N and slightly modify (4) so that the right hand side now contains the pairwise variables. In the case where the dependence is on the number of pairwise constellations (p, q) in the neighborhood system with the labels yp = l1 , yq = l2 , this reads
1 ,l2 xlp,q =
(p,q)∈N
|N |
n · zln1 ,l2
n=0
1 ,l2 ,l3 Similarly, one can introduce variables xlp,q,r that express the constellations of triplets of nodes (see e.g. [30]). One then needs to introduce the corresponding consistency constraints between the node variables and the triplet variables, and modify the above count constraints so that they sum over the new variables.
3
Optimization Strategies
A number of useful integer linear programs were presented, and we now turn to the question of how to solve them, at least approximately. Here we make use of a combination of standard integer linear programming solvers (both open source and commercial) and specialized computer vision code. The latter is integrated as plug-ins into the solvers. 3.1
Linear Programming
Standard approaches to integer linear programming start with solving the associated linear programming relaxation – the linear program that arises when dropping the integrality constraints on the variables. We adopt this scheme, relying on the standard packages. There are currently two classes of algorithms to solve linear programs. The first class is the class of simplex algorithms that find so-called basic feasible solutions. This is a prerequisite for most standard implementations of the socalled cutting planes method for integer programming. These algorithms do not have a polynomial time guarantee, but in practice they are often very efficient and memory saving. Moreover, there are very good open source implementations, e.g. Clp2 . On the other hand, there are interior point methods that employ a Newton optimization scheme and come with a polynomial time guarantee. To solve the 2
http://www.coin-or.org/projects/Clp.xml
Minimizing Count-Based High Order Terms in Markov Random Fields
23
arising linear equation systems the sparse Cholesky decomposition is used. As this involves a lot of expertise it is usually best to rely on commercial products here. These products are often faster than the simplex method, but in our experience they require more memory, up to a factor of two. Also, they generally do not give basic feasible solutions, so one has to run a procedure called crossover afterwards. We found that both methods can be useful, depending on what problem is handled. 3.2
Customized Cutting Planes
Linear programming relaxations often provide reasonably good lower bounds on the integer problem. However, it is hard to convert them into good integral solutions - a simple thresholding often performs very poorly. Hence, linear programming is only the starting point of our method. Subsequently we apply two techniques. The first one is called cut generation, where a cut is nothing else than a linear inequality that is valid for all feasible integral solutions. One is interested in finding cuts that improve the current relaxation, i.e. its fractional optimum becomes infeasible when the cuts are added to the system. One says that such cuts are violated by the current relaxation. Many approaches for cut generation are known and implemented in the standard solvers, where the most generally applicable method are probably the Gomory cuts [12]. In our setting we use a specialized class of cuts we presented in [25] and which allows to find violated cuts very efficiently. These cuts address consistency constraints of the form I
xi =
i=1
I
n · zn
n=0
I
zn = 1 , z ≥ 0 . To motivate the cuts, we give 10 a fractional solution for the case I = 10 and where we know that i=1 xi = 3. Then z0 = 7/10, z10 = 3/10 and zi = 0 for all other i is a feasible solution. This will indeed be the optimal solution if the represented count-cost f (·) is concave in n. Note that this includes the function −h(n) that was introduced for histogram-based image segmentation in Section 2. The cuts are based on the following reasoning: if we know that all variables of a subset C ⊆ {x1 , . . . , xI } of size |C| = N are 1, we can conclude that the count variables zn for n < N must all be 0. This can be expressed as the inequality: together with the constraints
n=0
i:xi ∈C
|C|−1
xi +
zn ≤ |C| .
n=0
Violated cuts are efficiently found by sorting the variables. There are exponentially many sets C, but in practice sufficiently few corresponding cuts are violated. There is a closely related set of cuts, derived from the fact that whenever
24
T. Schoenemann
all variables in C ⊆ {x1 , . . . , xI } are 0, then the count variables zn for n > N must be 0: I − xi + zn ≤ 0 i:xi ∈C
n=|C|+1
The derived classes of cuts are not sufficient to arrive at an integral solution. In fact, they are only useful for cost functions fli that have regions of concavities. Even then, we found that for the original linear programming relaxation usually none of these cuts is violated – the respective fractional solutions usually set the non-count variables to values near 0.5. Here, even the standard cut generation methods produce either no cuts or quite dense ones (with 300 or more non-zero coefficients), which soon exhausts the available memory. As a consequence, we combine the derivation of cuts with the second technique, branch-and-bound, into a branch-and-cut scheme. 3.3
Branch-and-Cut
Branch-and-cut is based on the method of branch-and-bound (e.g. [1]): the problem is hierarchically partitioned into smaller sub-problems, called nodes. At each node, the arising linear programming relaxation is solved, which gives a lower bound on the sub-problem. If the obtained solution is integral or the lower bound exceeds some known upper bound on the original problem, the node can be closed. This process will eventually find the optimum, but this may take exponential time. All used solvers allow two kinds of interaction in this scheme (usually via so-called callback-functions): firstly, once the relaxation of a node has been solved, the user can provide his/her own routine to generate cuts - we use the cuts stated above. Secondly, we can provide a routine which generates an integral solution from the fractional solution of the node. Here, for histogram-based image segmentation we include the well-known alternating minimization scheme with the help of graph cuts [6] and expansion moves [7], where the fractional solution serves as an initialization of the probabilities. We found this to produce much better solutions than the standard heuristics included in the solvers.
4
Experiments
The addressed class of models allows a great variety of applications, and here we consider three of them. We experimented with the commercial solver Gurobi and the open source solver CBC3 , the results below were produced with Gurobi. All experiments were run on a 2.4 GHz Core2 Duo with 3 GB memory. 4.1
Histogram-Based Image Segmentation
We start with histogram-based image segmentation as described in example 1 above. In this case, some of the high order terms are convex, the others are 3
http://www.coin-or.org/projects/Cbc.xml
Minimizing Count-Based High Order Terms in Markov Random Fields
25
concave. Since the convex functions are strictly convex we cannot reduce the size of the ILP by including inequalities. For the concave functions we tried the cut generation plug-in we described in section 3.2, but found it only mildly helpful. Hence, we include it for binary problems but (since it slows down the solver) we do not use it for multi-label problems. Further, since the problem is symmetric we strengthen the relaxation by assigning one of the pixels to region 0. In addition, we provide a plugin to generate integral solutions from fractional ones, where we use an alternating minimization (AM) scheme, updating first the probability distributions, then the segmentation via graph cuts or expansion moves. We found that the produced integral solutions are of much higher quality than those produced by the standard methods of Gurobi (or other toolkits). Further, we run the AM scheme a priori with two different initial segmentations, given by horizontally and vertically dividing the image into K parts, where K is the number of regions. The two resulting energies usually differ significantly and we take the lower one. This solution is then passed to the solver and serves as an initial upper bound. Note that we deal with a fully unsupervised scenario, i.e. there are no seed nodes. We found that the recent work of [28] is not applicable here: parametric maxflow gives only two trivial solutions. This approach is closely related to a linear relaxation, and we found that our LP-relaxation alone is equally useless: its energy forms a reasonable lower bound, but most of the segmentation variables are set to (roughly) 0.5. This is useless for thresholding schemes. Results. Finding the global optimum is illusory in practice, so we set a time limit of 2.5 hours. We ran our method on all 100 images of the test set of the Berkeley image database, downscaled to a resolution of 120 × 80. In 75 cases our method was able to find a lower energy solution than the starting point (the better of two runs of AM). Figure 1 shows the cases with the most significant differences. Clearly AM tends to find solutions with a short boundary, and stays often close to its initialization. Our solutions are of lower energy and frequently more interesting as they often locate animals. Figure 2 shows images where our approach does not improve the initialization. For some of them this may well be the global optimum. We experimented with 3-region segmentation, but since no better solution than the initial one was found we omit the images. 4.2
Binary Texture Denoising
Our next application is binary texture denoising as addressed in [8], [19]. Here one first runs a training stage on a given binary texture image and extracts certain relevant probability distributions. For example, one can collect the statistics of certain pairwise constellations, i.e. look at all pairs of pixels where the second pixel is obtained by shifting the first by a fixed displacement. One obtains distributions of the form pt (l1 , l2 ) = p I(x) = l1 , I(x + t) = l2 .
26
T. Schoenemann
Fig. 1. Joint histogram segmentation on images of size 120 × 80. Left image of the pair: Alternating Minimization. Right: Our method.
Minimizing Count-Based High Order Terms in Markov Random Fields
27
Fig. 2. Images where both approaches find the same result
Now one wants to select a set of translation vectors t that characterize the given texture rather than describe a random distribution. It is well-known that this is reflected in minimal entropies: pt (l1 , l2 ) log pt (l1 , l2 ) , l1 ∈{0,1},l2 ∈{0,1}
so we take the 15 translations with minimal entropies. We also tried adding the 7 most informative triple constellations, selected in the same way. Given is now a noisy gray-value image, and we look for a binarized denoised image with data terms like in [8] and V-kernels to penalize the marginal statistics [31]. The (convex) V-kernels are easily expressed in terms of two inequalities, which reduces the size of the system a little. Since the linear programming relaxation is quite strong we only solve 10 nodes in the branch-and-cut subsequently. Results Figure 3 confirms that indeed marginal probability fields [31] improve on the widely used Markov Random fields. Moreover, the Gurobi solver found the global optimum in no more than 8 minutes, suggesting that the problem may be rather easy to solve in many situations. We trained on the right part of the well-known Brodatz texture D101, then process a noisy version of a crop of the left part.
noisy input
Markov Random Field Marginal Probability Field
Fig. 3. Image denoising with MRFs and MPFs. We manually selected the best weighting parameter for each method. Both results are globally optimal.
4.3
Completion of Binary Textures
In a related setting we are given a partial texture and want to inpaint the missing part. Figure 4 demonstrates that now MRFs and MPFs perform very differently: the Markov Random Fields choose a constant fill value, whereas the MPF tries to respect the marginal statistics.
28
T. Schoenemann
The linear programming relaxations are still quite strong and we let them follow by 10 nodes of branch-and-cut. This time we also test ternary terms (without branch-and-cut), but it can be seen that this does not result in performance gains (and it takes much longer). The running times are 1.5 hours for binary terms and roughly 10 hours for ternary terms.
partial texture
completed with MRFs
completed with binary MPF
binary + ternary MPF
Fig. 4. Completion of partial textures (gray values indicate unknown regions) via MPFs and MRFs (via graph cuts)
5
Conclusion
We have proposed an integer linear programming framework to solve minimization problems with very high order terms depending on counts. For histogrambased image segmentation it was shown that this improves over existing alternating minimization schemes, which again are very good plug-ins for standard solvers. We furthermore showed that the recently introduced marginal probability fields can be handled and examined ternary terms. It was clearly demonstrated that in some cases this is a much more sensible approach than standard Markov Random Fields. In future work we want to explore customized strategies to solve the arising linear programs. Acknowledgements. We thank Fredrik Kahl for helpful discussions. This work was funded by the European Research Council (GlobalVision grant no. 209480).
References 1. Achterberg, T.: Constraint Integer Programming. PhD thesis, Zuse Institut, TU Berlin, Germany (July 2007) [24] 2. Ali, A., Farag, A., Gimel’farb, G.: Optimizing binary mRFs with higher order cliques. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part III. LNCS, vol. 5304, pp. 98–111. Springer, Heidelberg (2008) [17]
Minimizing Count-Based High Order Terms in Markov Random Fields
29
3. Billionet, A., Minoux, M.: Maximizing a supermodular pseudo-boolean function: A polynomial algorithm for supermodular cubic functions. Discrete Applied Mathematics 12(1), 1–11 (1985) [17] 4. Blake, A., Rother, C., Brown, M., Perez, P., Torr, P.: Interactive image segmentation using an adaptive GMMRF model. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 428–441. Springer, Heidelberg (2004) [18] 5. Boros, E., Hammer, P.: Pseudo-Boolean optimization. Discrete Applied Mathematics 123(1-3), 155–225 (2002) [17] 6. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 26(9), 1124–1137 (2004) [24] 7. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 23(11), 1222–1239 (2001) [24] 8. Cremers, D., Grady, L.: Learning statistical priors for efficient combinatorial optimization via graph cuts. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3953, pp. 263–274. Springer, Heidelberg (2006) [25], [27] 9. Dantzig, G., Thapa, M.: Linear Programming 1: Introduction. Springer Series in Operations Research. Springer, Heidelberg (1997) [21] 10. Delong, A., Osokin, A., Isack, H., Boykov, Y.: Fast approximate energy minimization with label cost. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, California (June 2010) [18] 11. Freedman, D., Drineas, P.: Energy minimization via graph cuts: Settling what is possible. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, California (June 2005) [17] 12. Gomory, R.: Outline of an algorithm for integer solutions to linear programs. Bulletin of the American Mathematical Society 64, 275–278 (1958) [23] 13. Greig, D., Porteous, B., Seheult, A.: Exact maximum a posteriori estimation for binary images. Journal of the Royal Statistical Society, Series B 51(2), 271–279 (1989) [17] 14. Ishikawa, H.: Transformation of general binary MRF minimization to the first order case. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 2010 (to appear) [17] 15. Iwata, S.: A fully combinatorial algorithm for submodular function minimization. Journal of Combinatorial Theory Series B 84(2), 203–212 (2002) [18] 16. Kleinberg, J., Tardos, E.: Approximation algorithms for classification problems with pairwise relationships: metric labeling and Markov Random Fields. In: Symposium on Foundations of Computer Science (1999) [21] 17. Kohli, P., Kumar, M.P.: Energy minimization for linear envelope MRFs. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, California (June 2010) [17], [18], [21] 18. Kohli, P., Ladick` y, L., Torr, P.: Robust higher order potentials for enforcing label consistency. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 82(3), 302–324 (2009) [17] 19. Kolmogorov, V., Rother, C.: Minimizing non-submodular functions with graph cuts – a review. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 29(7), 1274–1279 (2007) [25] 20. Komodakis, N., Paragios, N.: Beyond pairwise energies: Efficient optimization for higher-order MRFs. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Miami, Florida (June 2009) [17]
30
T. Schoenemann
21. Kschischang, F., Frey, B., Loelinger, H.-A.: Factor graphs and the sum-product algorithm. IEEE Tansactions on Information Theory 47(2), 498–519 (2001) [17] 22. Lan, X., Roth, S., Huttenlocher, D., Black, M.: Efficient belief propagation with learned higher-order markov random fields. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 269–282. Springer, Heidelberg (2006) [17] 23. Potetz, B., Lee, T.: Efficient belief propagation for higher-order cliques using linear constraint nodes. Computer Vision and Image Understanding 112(1), 39–54 (2008) [18] 24. Rother, C., Kohli, P., Feng, W., Jia, J.: Minimizing sparse higher order energy functions of discrete variables. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Miami, Florida (June 2009) [17] 25. Schoenemann, T.: Probabilistics word alignment under the l0 -norm. In: Conference Computational Natural Language Learning (CoNLL), Portland, Oregon (June 2011) [18], [23] 26. Schrijver, A.: A combinatorial algorithm minimizing submodular functions in strongly polynomial time. Journal of Combinatorial Theory Series B 80(2), 346–355 (2000) [18] 27. Tarlow, D., Givoni, I., Zemel, R.: HOP-MAP: efficient message passing with higher order potentials. In: International Conference on Artificial Intelligence and Statistics (AISTATS), Sardinia, Italy (2010) [18] 28. Vicente, S., Kolmogorov, V., Rother, C.: Joint optimization of segmentation and appearance models. In: IEEE International Conference on Computer Vision (ICCV), Kyoto, Japan (September 2009) [17], [18], [19], [20], [21], [25] 29. Wainwright, M., Jaakkola, T., Willsky, A.: MAP estimation via agreement on (hyper-)trees: Message-passing and linear programming approaches. IEEE Tansactions on Information Theory 51(11), 3697–3717 (2005) [17] 30. Werner, T.: High-arity interactions, polyhedral relaxations, and cutting plane algorithm for soft constraint optimisation (MAP-MRF). In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, Alaska (June 2008) [17], [18], [22] 31. Woodford, O., Rother, C., Kolmogorov, V.: A global perspective on MAP inference for low-level vision. In: IEEE International Conference on Computer Vision (ICCV), Kyoto, Japan (September 2009) [17], [18], [27] 32. Yededia, J., Freeman, W., Weiss, Y.: Understanding belief propagation and its generalizations. Technical report, Mitsubishi Electric Research Laboratories (January 2002) [17] 33. Yededia, J., Freeman, W., Weiss, Y.: Constructing free energy approximations and generalized belief propagation. IEEE Tansactions on Information Theory 51(7), 2282–2312 (2005) [17] 34. Zach, C., Gallup, D., Frahm, J.-M., Niethammer, M.: Fast global labeling for realtime stereo using multiple plane sweeps. In: Vision, Modeling and Visualization Workshop (VMV), Konstanz, Germany (October 2008) [21]
Globally Optimal Image Partitioning by Multicuts Jörg Hendrik Kappes, Markus Speth, Björn Andres, Gerhard Reinelt, and Christoph Schnörr Image & Pattern Analysis Group, Discrete and Combinatorial Optimization Group, Multidimensional Image Processing Group – University of Heidelberg, Germany
Abstract. We introduce an approach to both image labeling and unsupervised image partitioning as different instances of the multicut problem, together with an algorithm returning globally optimal solutions. For image labeling, the approach provides a valid alternative. For unsupervised image partitioning, the approach outperforms state-of-the-art labeling methods with respect to both optimality and runtime, and additionally returns competitive performance measures for the Berkeley Segmentation Dataset as reported in the literature. Keywords: image segmentation, partitioning, labeling, multicuts, multiway cuts, multicut polytope, cutting planes, integer programming.
1
Introduction
Partitioning an image into a number of segments is a key problem in computer vision. We distinguish (i) image labeling where segments are associated with a finite number of classes (e.g., street, sky, person, etc.) based on pre-defined features, and (ii) unsupervised partitioning where no prototypical features as class representatives are available, but only pairwise distances between features. Concerning (i), partitions are determined by inference with respect to variables that take values in a finite set of labels and are assigned to the nodes of the underlying graph [1]. Accordingly, the marginal polytope has become a focal point of research on relaxations and approximate inference for image labeling [2–4]. In this paper, we focus on the image partitioning problem as a multicut problem which appears natural for unsupervised partitioning (ii) and includes image labeling (i) as a special case. Here, random variables are assigned to the edges of the underlying graph. This is appealing because in order to form a partition, edges have to adjoin in order to separate nodes properly and thus explicitly represent local shape, which can only indirectly be achieved through labelled nodes by taking differences. Clearly, edge indicator vectors have to be constrained in order to form valid partitions [5–7], and the resulting combinatorial problem is NP-hard. We demonstrate below, however, that especially for unsupervised scenarios (ii), our multicut approach enables us to compute efficiently globally optimal image partitions – see Fig. 1. Y. Boykov et al. (Eds.): EMMCVPR 2011, LNCS 6819, pp. 31–44, 2011. c Springer-Verlag Berlin Heidelberg 2011
32
J.H. Kappes et al.
(a) Image
(b) Superpixels
(c) λ = 0.3
(d) λ = 1
Fig. 1. (a-b) An image preprocessed into a set of superpixels [11]. (c-d) Globally optimal partition minimizing the objective function (11) for two different values of λ in an unsupervised setting, computed in less than 0.1 seconds excluding the time for computing superpixels.
A subclass of the multicut problem, the multiway cut problem, has been introduced to computer vision in [8] as a generalization of the basic min-cut/max-flow approach. For specific problems in this subclass, efficient approximative methods exist [9], and for few special cases, e.g., for planar graphs, exact polynomial time algorithms are known [10]. Our approach applies to both scenarios (i) and (ii) sketched above and can deal with generalized Potts terms that might have negative signs. Our contributions are the following: 1. We reformulate image labeling in terms of generalized Potts models as a multicut problem (Sec. 2). 2. We provide an economical multicut formulation of the unsupervised partitioning problem. It can be based on arbitrary features and pairwise distances, and generates a hierarchy of partitions by varying a single parameter (Sec. 2). 3. We devise a three-phase iterative procedure for computing globally optimal partitions in both scenarios (i) and (ii) based on LP-relaxation, integer programming, and the cutting plane method (Sec. 3). For our LPrelaxation, a deterministic rounding procedure suggested by [12] returns a 3 1 − -approximate integer solution of the k-multiway cut problem and 2 k improves the performance bound 2 − k2 known in computer vision so far [13]. 4. We compare our approach for image labeling problems (Sec. 4). State-ofthe-art methods [4, 13] are competitive concerning both optimality (though they do not provide any guarantee) and runtime. Thus, our approach only provides a competitive alternative using a different problem formulation. 5. We compare our approach for unsupervised partitioning (Sec. 4). Our optimization method clearly outperforms competing methods concerning both optimality and runtime. Concerning the Berkeley Segmentation Dataset and Benchmark (BSD), our approach even is on par with approaches that rely on edge detection (privileging them), rather than on image partitioning. In contrast to [14], the present paper discusses a 3-phase algorithm together with a hierarchy of inequality constraints (Sec. 3) and examines experimentally optimization methods for the multicut/image partitioning problem.
Globally Optimal Image Partitioning by Multicuts
2 2.1
33
Problem Description Image Labeling
Given a graph G = (V, E), we assign labels from a label set L = {1, . . . , k} to all |V | nodes v ∈ V by using node variables xv ∈ L. A labeling x = (xv )v∈V ∈ L defines a partition of V into subsets of nodes Sl assigned to class l, i.e., l∈L Sl = V . Furthermore, for a logical expression ϕ, we define an indicator function I(ϕ) which is 1 if the expression is true and 0 otherwise. The cost of a labeling is the sum of the label assignment costs for each node, plus the sum of weighted edges connecting different classes which may be considered as an approximation of the weighted length of the separating boundaries: J(x) = fv (xv ) + βuv I(xu = xv ). (1) v∈V
uv∈E
Note that the weight of the boundary βuv ∈ R depends only on the edge uv and not on the labels assigned to u and v. In the simplest case, all edges are treated equally, i.e., βuv = βˆ for all uv ∈ E. The function fv (l) encodes the similarity of data observed at location v to class l. Since the number of labels is finite, we can represent the function by a vector θv ∈ Rk : fv (xv ) = θv,l I(xv = l). (2) θv,l = fv (l), l∈L
A common approach to determine a labeling is to consider the combinatorial optimization problem fv (xv ) + βuv I(xu = xv ). (P1) min x∈L|V |
v∈V
uv∈E
Instead of optimizing over the set of all node labelings (node domain), we optimize over the set of all separating boundaries related to valid partitions (edge domain), which is known as the multicut problem, see Sec. 2.3. 2.2
Unsupervised Pairwise Image Partitioning
We will also study the following important variant of problem (P1): min βuv I(xu = xv ), L = {1, . . . , |V |}. x∈L|V |
(P2)
uv∈E
Here, in comparison to (P1), we have fv ≡ 0 for all v ∈ V . Coefficients βuv may depend on data but are assumed not to depend on prototypical prior information about a fixed number of classes L, so that the maximum number of labels is |V |. Rather, only pairwise distances between data (or features) are used. To obtain a well-posed problem, the sign of βuv is not restricted. As for the image labeling problem in the previous section, we will also study solving problem (P2) by multicuts which turns out to offer an economical representation – cf. Fig. 1.
34
J.H. Kappes et al.
2.3
The Multicut Problem k Let G = (V, E) and i=1 Si = V be a partition of V . Then we call the edge set δ(S1 , . . . , Sk ) := {uv ∈ E | ∃i = j : u ∈ Si and v ∈ Sj }
(3)
a multicut and the sets S1 , . . . , Sk the shores of the multicut. To obtain a polyhedral representation of multicuts, we define incidence vectors χ(F ) ∈ R|E| for each subset F ⊆ E: 1, if e ∈ F , χe (F ) = 0, if e ∈ E \ F . The multicut polytope is given by MC(G) := conv {χ(δ(S1 , . . . , Sk )) | δ(S1 , . . . , Sk ) is a multicut of G} .
(4)
For an overview and further details on the geometry of this and related polytopes, we refer to [5]. For given edge weights w(e) ∈ R, e ∈ E, the multicut problem is to find a multicut for which the sum of the weights of cut edges is minimal. Since all vertices of the multicut polytope correspond to multicuts, this amounts to solving the linear program min w(e) ye . (P3) y∈MC(G)
e∈E
In order to apply linear programming techniques, we have to represent MC(G) as intersection of half-spaces given by a system of affine inequalities. Since the multicut problem is NP-hard [15], we cannot expect to find a system of polynomial size. But, as we will see later, partial systems may be very helpful to solve the multicut problem. Before discussing how problem (P3) can be solved efficiently, we will show how the problems (P1) and (P2) can be transformed into problem (P3). 2.4
Image Labeling as Multicut Problem
To write problem (P1) as a multicut problem, we use its defining graph G and introduce k additional terminal nodes T = {t1 , . . . , tk }. Then we define the graph G = (V , E ) by V = V ∪ T,
E = E ∪ {(t, v) | t ∈ T, v ∈ V } ∪ {(ti , tj ) | 1 ≤ i < j ≤ k}.
Each node v ∈ V is connected to all terminal nodes t ∈ T . The terminal nodes represent the k labels, and label l is assigned to variable xv , v ∈ V , if edge tl v is not part of the multicut. Since we want to assign only a single label to each variable, k − 1 edges joining node v and the terminal nodes have to be part of
Globally Optimal Image Partitioning by Multicuts
t1
t2
35
t3
(a) Multicut graph for (P1)
(b) Multicut graph for (P2)
Fig. 2. Construction of G = (V , E ) for 4 × 4-grid for the supervised case with L = {1, 2, 3} (a) and the unsupervised case (b). Red edges are part of the multicut, i.e., they separate shores. Blue edges join nodes of the same shore of the partition.
the multicut. Let E be the matrix of all ones and I be the identity matrix, both of size k × k. Then the weights w(tl v), l ∈ L, v ∈ V , are given by ⎛ ⎞ ⎛ ⎞ fv (1) w(t1 v) 1 ⎜ .. ⎟ ⎜ . ⎟ (E − I)−1 = E − I, (5) ⎝ . ⎠ = (E − I) ⎝ .. ⎠ , k−1 fv (k) w(tk v) so as to represent problem (P1). For edges uv ∈ E, we use w(uv) = βuv . Edges between terminal nodes have the weight −∞ to enforce that all terminal nodes belong to different shores. Note that this can also be considered as a multiway cut problem [8]. For the unsupervised partitioning problem (P2), we would have to add |V | |−1) terminal nodes and |V |2 + |V |·(|V edges. As shown in [6], we can remove the 2 terminal nodes from our graph without changing the optimal partition if the maximal number of possible shores is the number of nodes. This observation is crucial since it reduces the number of variables in (P3) to |E|. Thus, to represent (P2) as a multicut problem (P3), we just use the graph G defining (P2), i.e., T = ∅. As before, we use w(uv) = βuv for uv ∈ E.
3 3.1
Finding an Optimal Multicut Linear Programming Formulations
Finding a minimal cost multicut is NP-hard in general [15]. However, since images induce a certain structure, there is some hope that the problems are easier to solve in practice than problems without any structure. We use a cutting plane approach to iteratively tighten an outer relaxation of the multicut polytope. In each step we solve a problem relaxation in terms of
36
J.H. Kappes et al.
a linear program, detect violated constraints from a pre-specified finite list, and augment the constraint system accordingly. This procedure is repeated until no more violated constraints are found. After each iteration we obtain a lower bound as the solution of the LP and an upper bound by mapping the obtained solution to the multicut polytope (rounding, see Sec. 3.2). Finally, if the relaxed solution is not integral, we use integer linear programming and again add violated constraints after each round of optimization. Overall, we optimize the following integer linear program: min w(e)ye (6a) y∈[0,1]|E
s.t.
e∈E
∀v ∈ V
(6b)
ytu + ytv ≥ yuv
∀uv ∈ E, t ∈ T
(6c)
ytu + yuv ≥ ytv ytv + yuv ≥ ytu yuv ≥
∀uv ∈ E, t ∈ T ∀uv ∈ E, t ∈ T
(6d) (6e)
∀uv ∈ E, S ⊆ T
(6f)
∀ cycles C ⊆ E, e ∈ C
(6g)
∀e ∈ E
(6h)
t∈T
|
y(t,v) = (k − 1) · I(T = ∅)
e∈C\{e }
t∈S
(ytu − ytv )
ye ≥ ye ye ∈ {0, 1}
Note that not every y ∈ {0, 1}|E | lies inside the multicut polytope. As shown in [7], Lemma 2.2, y is a vertex of the multicut polytope if and only if ye = 1 ∀ cycles C ⊆ E , (7) e∈C
i.e., there exist no active edges inside a shore. If T is not empty, (6b)–(6e) implies (7) [7] and if T is empty, (7) is equivalent to (6g). Therefore, any y that satisfies (6b)–(6h) is a vertex of the multicut polytope. Later, we will also consider the linear programming relaxation (6a)–(6f) introduced in [12]. 3.2
Rounding Fractional Solutions
As pointed out by Călinescu et al. [12], the integrality ratio of the relaxed LP (6a)–(6f) is 32 − k1 . This is superior to the α-expansion algorithm and the work of Dahlhaus et al. [10], that guarantees only a ratio of 2 − k2 . However, while derandomized rounding procedures as suggested in [12] provide optimality bounds, they may perform worse than simple heuristics. We therefore proceed as follows. For problem (P1), we obtain a partition by assigning each node to the terminal with minimal edge costs: xa = arg minl∈L ytl a ,
∀a ∈ V.
(8)
For problems of type (P2), we determine the connected components by a unionset structure in O(|V | + |E|) and assign a single label to each connected compo nent. In short, both mappings transform a vector y ∈ [0, 1]|E | into a partition that in turn implies a valid multicut vector y ∈ MC(G).
Globally Optimal Image Partitioning by Multicuts
3.3
37
Finding Violated Constraints
Starting with the linear program (6a)–(6b), we add violated constraints and re-optimize the new LP. If its solution still violates constraints, we repeat the current phase, otherwise we continue with the next one. Phase 1: Given the optimal solution for the current LP, we check (6c)–(6e) for violated constraints to be added. This requires 3 · |E| · |T | checks per iteration. If no violated constraint is found, we have the optimal solution for the LPrelaxation (6a)–(6e)1 and continue with phase 2. Phase 2: We search for violated constraints of the form (6f). The number of these constraints is exponential in |T |, but can be represented in polynomial size using slack variables [12]. To avoid additional slack variables, we include in each round for each uv ∈ E only the subset S corresponding to the most violated constraint. If no violated constraint is found, we have determined the optimal solution for the LP-relaxation (6a)–(6f)1 and continue with phase 3. Phase 3: We switch to the integer program by including (6h) and check (6g) and (6c)–(6e) for violated constraints. If none exist, we have found the integer solution of (6). Otherwise, the current solution is outside the multicut polytope. In this case, we calculate a mapping to a vertex of the multicut polytope as described in Sec. 3.2 to obtain a partition of V . When checking for (6g), we consider without loss of generality only edges uv ∈ E for which yuv = 1 and check if this edge is consistent with the partition. If not, this is an active edge inside a shore. We then compute the shortest path from u to v in the shore by breadth-first search and add the corresponding constraint to our ILP. It is well known that if the cycle is chordless, the constraint is facet-defining. If there is a chord, the constraint is not facet-defining with respect to the multicut polytope but still a valid and maybe useful and facet-defining constraint with respect to the polytope relaxation. Consequently, the constraints (6c)–(6e) are facet-defining by construction and the constraints (6g) can be facet-defining. In cases where no terminal nodes are included (cf. Sec. 2.2), the constraint set (6b)–(6f) is empty and we can start directly with phase 3. Of course, it is also possible to start with a relaxation and add constraints of the form (6g) to the relaxed problem, but then (i) shortest paths have to be computed for all edges e with ye > 0, i.e., usually for all, and (ii) the shortest path search can no longer be performed by breadth-first search so that more time consuming methods have to be used.
4
Experiments
We propose two algorithms: The multicut algorithm MCA that optimizes (6) and MCA-LP (MCA until phase 2) that solves the LP-relaxation (6a)–(6f). We compare them with three state-of-the-art algorithms: ILP-N: The commercial integer linear program solver CPLEX 12.1 is used to solve the integer problem in the node domain, i.e., the LP-relaxation over 1
If the solution is integral, this is the solution of the complete ILP (6).
38
J.H. Kappes et al.
the local polytope [2] with integer constraints. This method guarantees global optimality, and due to the progress of ILP solvers in the last years, it is applicable for small problems but does not scale. We will refer to this method in the following as ILP-N (ILP in node domain). TRW-S: For models with grid structure, we use the tree-reweighted message passing code from the Middlebury benchmark [16], for other structures an alternative code provided by the author of TRW-S [4]. Since this code provides no stopping criteria, we run the algorithm a sufficiently large but fixed number of iterations. When we measure the runtime of TRW-S, we consider the iteration in which the best lower bound was obtained the first time. α-expansion: The α-expansion algorithm [13] is used if the model includes no negative Potts terms, i.e., if βuv ≥ 0. Again, we use the implementation available in the Middlebury benchmark [16] provided by the corresponding authors. All code is written in C/C++ and compiled with the same compiler and flags, experiments are performed on a standard desktop computer with a Pentium Dual processor (2.00 GHz) without multi-threading. The subproblems in each iteration of MCA and MCA-LP are solved by the commercial solver CPLEX 12.1 using warm-start. Synthetic Problems: For an evaluation of the influence of different parameters, we generate synthetic N ×N -grid models and vary the width of the grid (N ), the number of labels (k), and the coupling strength (λ). The corresponding energy function has the form
J(x) = (1 − λ) θv,l I(xv = l) + λ βuv I(xu = xv ) (9) v∈V l∈L
uv∈E
where θv,l for all v ∈ V and l ∈ L, and βuv for all uv ∈ E are sampled uniformly from [−1, 1]. The coupling strength λ adjusts the influence of the pairwise terms relative to the unary ones and is selected from [0, 1]. Note that since βuv can be negative, common approximations for the multiway cut problem [13] can not be applied. Fig. 3 shows the influence of changing the parameters on the mean relative optimality gap of the rounded integer solution and bound of TRW-S and MCA-LP as well as the mean runtimes for the compared methods. Reported numbers are averaged over 10 sampled models per setting. The maximal number of iterations for TRW-S was set to 5000. Our method is faster and requires less memory than ILP-N. The objective of the ILP-N has |V |·k +|E|·k 2 variables, while MCA has only |V |·k +|E|+ k(k−1) . 2 Furthermore, we keep the number of required constraints low by using the cutting plane scheme. In contrast to TRW-S, our method is able to compute the global optimum in all cases. However, with increasing number of variables the runtime of MCA increases faster than for TRW-S. Image Labeling: We use the four-color images that were introduced in [17]. They contain segment boundaries in all directions and points in which three
Globally Optimal Image Partitioning by Multicuts
39
Fig. 3. Synthetic Problems (P1) We generate synthetic data according to (9). We vary the image width N , the number of labels k, and the coupling strength λ and set the other two values to the default parameters N = 8, k = 4, and λ = 0.5. The top row shows the relative optimality gap (J(x) − J(xopt ))/J(xopt ) for integer solutions (solid) and lower bounds (dashed) for MCA-LP and TRW-S. The bottom row shows runtimes in seconds for MCA, MCA-LP, ILP-N, and TRW-S. MCA and ILP-N always return a globally optimal solution, TRW-S and MCA-LP return a rounded integer solution and a lower bound. While the runtime for ILP-N increases in all cases, MCA scales quite good. However, with increasing number of variables the runtime of MCA grows faster than that of TRW-S. MCA-LP gives similar results as TRW-S, but produces slightly worse integer solutions, since TRW-S uses more advanced rounding methods.
classes meet. We add Gaussian noise with variance 1 to each of the three color channels independently and use the 1 -norm of the difference between pixelcolor (Iv ) and class-color (Cl ) as unary data term. As regularizer, we use a Potts prior which for indicator functions provides an anisotropic approximation of the total variation (TV) measure:
J(x) =
Iv − Cl 1 · I(xv = l) + λ I(xu = xv ). (10) v∈V l∈L
uv∈E
We generate 50 noisy images as illustrated in Fig. 4 for different image sizes shown in Tab. 1. For a reconstruction, we minimize the energy function (10) with λ = 0.5 by MCA, ILP-N, MCA-LP, TRW-S, and α-expansion. The number of globally optimal integer solutions, the mean integrality gap2 , and the runtime for these algorithms are reported in Tab. 1. While ILP-N and MCA always find the globally optimal solution, MCA does this much faster. TRW-S finds better solutions than MCA-LP and α-expansion but all fail to find the optimal 2
The integrality gap is the gap between the calculated and optimal integer solution.
40
J.H. Kappes et al.
(a) Ground truth
(b) Noisy image
(d) α-expansion
(c) MCA
Fig. 4. Image Labeling (P1) Denoising of the noisy data is done by minimizing an energy function. Here we show the optimal result found by our multicut method and the result of α-expansion. While both look similarly good, α-expansion has not found the global optimum. The reconstructions differ in 110 pixel, e.g., border between blue and red rectangle.
integer solution for larger problems. However, from the practical point of view, the quality is similar, see Fig. 4. In Fig. 5, we illustrate the 3-phase optimization of MCA. For phase one, it requires 4.76 seconds, for phase two 0.47 seconds, and for phase three 5.42 seconds, for the particular image of width 128. Table 1. Image Labeling (P1) Results for the labeling problem with synthetic fourcolor images of size N × N . # denotes the number of optimal solutions found and i-gap is the average integrality gap of 50 runs. MCA and ILP-N guarantee to find the optimal integer solution. MCA makes use of the problem structure, is significantly faster than ILP-N, and requires less memory. MCA-LP, TRW-S, and α-expansion are approximative methods and do not guarantee optimal solutions. However, they are much faster and, after rounding if needed, return integer solutions close to optimality.
N 16 32 64 128 192 256 320
#
MCA time
50 0.05 50 0.25 50 1.25 50 13.50 50 35.72 50 86.20 50 156.59
#
ILP-N time
50 0.25 50 0.83 50 3.59 50 27.59 50 89.39 50 209.04 50 587.18
MCA-LP # i-gap time
TRW-S # i-gap time
46 38 40 10 5 2 0
47 38 44 22 9 1 1
0.03 0.12 0.07 0.80 1.20 1.67 2.14
0.03 0.16 0.82 5.28 15.33 33.45 61.92
0.01 0.05 0.02 0.14 0.27 0.40 0.53
α-expansion # i-gap time
0.01 13 0.26 0.01 0.05 0 1.01 0.01 0.16 0 2.28 0.05 1.51 0 6.15 0.27 4.44 0 12.11 0.67 9.76 0 19.80 1.23 15.76 0 28.95 1.92
Unsupervised Image Partitioning: Finally, we consider the case when the number of parts in which the image should be segmented is unknown and no data term for a single pixel label is given. In this case, distances between local features codetermine the edge weights, and large distances vote for including the corresponding edges into the multicut. As a counterpart to this term, we force the total length of the boundary between segments to be small by adding a total
Globally Optimal Image Partitioning by Multicuts
41
Fig. 5. Image Labeling (P1) Exemplarily, for the behavior of our multicut algorithm, we show the bounds for a four-color image of width 128. MCA-LP already leads to useful results. If after simple rounding an integrality gap remains, we enforce a boolean solution by integer constraints (MCA). This leads to better results than TRW-S and α-expansion but requires more runtime (dotted lines show objectives after termination).
variation term. Instead of working on pixel-level, we suggest to work on superpixels. This has several advantages, firstly, it makes the model robust to pixel noise, and secondly, it prunes the search space. On the other hand, it is somehow critical to use superpixels, since decisions made by a superpixel segmentation are irreversible. Therefore it is important to avoid an under-segmentation, i.e., each edge between segments should be an edge between superpixels. For our simple model, we use the publicly available code of Mori [11] to generate a superpixel segmentation of the image. We apply the default parameter values and omit any further data specific tuning. As similarity measure between superpixels, we use the 2 -distance of the mean RGB-colors. We denote the mean color of the superpixel v by Iv and the length of the boundary between two superpixels u and v by luv . Our objective function is J(x) = − ( Iu − Iv 2 · luv · I(xu = xv )) + λ · (luv · I(xu = xv )) . (11) uv∈E
uv∈E
We illustrate this approach in Fig. 6. Starting from the superpixel representation, we show segmentations for three different values of λ. While MCA can deal with the maximal number of labels k = |V |, we set the number of labels for TRW-S sufficiently high, here k = 100. For the plots in the bottom row, we set k to the optimal number of segments calculated by MCA and run TRW-S. Even in this case, where the number of segments is already given, TRW-S does not find the optimal solution. Note that α-expansion can not be used since some edge weights are negative. Segmentation Results on the Berkley Segmentation Dataset: We apply the proposed method on the Berkley Segmentation Dataset (BSD) [18]. Instead of the simple model above, more complex features are used and edge weights are calculated by a random forest, see [14] for details. From the optimization point of view, this does not make any difference. At the time of writing, the quality of the partitioning as measured by the F-score [18] in the setting where the same (optimal) parameterization of algorithms is used for all images, our method [14] (F = 0.67, P re = 0.64, Rec = 0.74)
42
J.H. Kappes et al.
(a) Image
(b) Superpixels
(e) MCA
(f) TRW-S
(i)
(c) Image
(d) Superpixels
(g) MCA
(h) TRW-S
(j)
Fig. 6. Unsupervised Partitioning (P2) To deal with pixel-noise and scale to large images, we use a superpixel representation (b) and (d) of the images and apply a simple model to join superpixels based on their similarity and border length, cf. (11). We compare our method MCA with TRW-S and restrict the number of labels for TRW-S to a sufficiently large number, here k = 100. MCA can deal with the maximal number of superpixels k = |V | and selects the optimal number implicitly in the optimization process. The images show the resulting segmentation from low (top) to high (bottom) values of λ. TRW-S never finds the global optimum and tends to include additional segments. If we set the number of labels for TRW-S for a fixed λ (corresponding to the middle segmentation) to the optimal number of segments found by MCA, TRW-S is still not able to solve this problem and converges to a non-optimal fixpoint as shown in (i) and (j). Note that if we increase the number of labels, TRW-S becomes significantly slower.
Globally Optimal Image Partitioning by Multicuts
43
Fig. 7. Unsupervised Partitioning (P2) Exemplary results on the BSD
is on a par with [19] (F = 0.67, P re = 0.66, Rec = 0.69) with a higher recall but lower precision. Note that no other algorithm that produces closed contours has a better F-score. Pure boundary detectors that need not produce closed contours [20, 21] still have a higher F-score. From the viewpoint of polyhedral theory, these solutions lie outside the multicut polytope and do not correspond to any partition.
5
Conclusions
We present an image partitioning framework for supervised and unsupervised scenarios together with a novel optimization algorithm (MCA) that solves these problems to optimality. We show that this framework is appealing and that MCA outperforms state-of-the-art optimization methods in the unsupervised case, i.e., when no unary data term is included. In general, it provides a more compact linear program than methods working in the node domain, i.e., it has less variables. MCA calculates an optimal solution by using cutting plane and integer programming techniques, and in its variant MCA-LP an approximative solution by solving a polynomial size LP. Even without any post-processing, our results on the Berkley Segmentation Dataset and Benchmark (BSD) are on par with the best-performing methods that ensure closed contours. Acknowledgement. This work has been supported by the German Research Foundation (DFG) within the programme “Spatio-/Temporal Graphical Models and Applications in Image Analysis”, grant GRK 1653.
44
J.H. Kappes et al.
References 1. Kleinberg, J., Tardos, É.: Approximation algorithms for classification problems with pairwise relationships: Metric labeling and Markov random fields. In: FOCS (1999) 2. Wainwright, M.J., Jordan, M.I.: Graphical models, exponential families, and variational inference. FTML 1, 1–305 (2008) 3. Sontag, D., Jaakkola, T.: New outer bounds on the marginal polytope. In: NIPS (2007) 4. Kolmogorov, V.: Convergent tree-reweighted message passing for energy minimization. TPAMI 28, 1568–1583 (2006) 5. Deza, M., Grötschel, M., Laurent, M.: Complete descriptions of small multicut polytopes. In: Applied Geometry and Discrete Mathematics: The Victor Klee Festschrift. American Mathematical Society, Providence (1991) 6. Chopra, S., Rao, M.R.: On the multiway cut polyhedron. Networks 21, 51–89 (1991) 7. Chopra, S., Rao, M.R.: The partition problem. Mathematical Programming 59, 87–115 (1993) 8. Boykov, Y., Veksler, O., Zabih, R.: Markov random fields with efficient approximations. In: CVPR (1998) 9. Kolmogorov, V., Zabih, R.: What energy functions can be minimized via graph cuts? TPAMI 26, 147–159 (2004) 10. Dahlhaus, E., Johnson, D.S., Papadimitriou, C.H., Seymour, P.D., Yannakakis, M.: The complexity of multiway cuts (extended abstract). In: STOC (1992) 11. Mori, G.: http://www.cs.sfu.ca/~mori/research/superpixels/ 12. Călinescu, G., Karloff, H., Rabani, Y.: An improved approximation algorithm for multiway cut. JCSS 60, 564–574 (2000) 13. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. TPAMI 23, 1222–1239 (2001) 14. Andres, B., Kappes, J.H., Beier, T., Köthe, U., Hamprecht, F.: Probabilistic image segmentation with closedness constraints (submitted to ICCV 2011) 15. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York (1979) 16. Szeliski, R., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V., Agarwala, A., Tappen, M., Rother, C.: A comparative study of energy minimization methods for Markov random fields with smoothness-based priors. TPAMI 30, 1068–1080 (2008) 17. Lellmann, J., Kappes, J., Yuan, J., Becker, F., Schnörr, C.: Convex multi-class image labeling by simplex-constrained total variation. In: Tai, X.-C., Mørken, K., Lysaker, M., Lie, K.-A. (eds.) SSVM 2009. LNCS, vol. 5567, pp. 150–162. Springer, Heidelberg (2009) 18. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: ICCV (2001) 19. Arbeláez, P.: Boundary extraction in natural images using ultrametric contour maps. In: CVPRW (2006) 20. Maire, M., Arbeláez, P., Fowlkes, C., Malik, J.: Using contours to detect and localize junctions in natural images. In: CVPR (2008) 21. Ren, X.: Multi-scale improves boundary detection in natural images. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part III. LNCS, vol. 5304, pp. 533–545. Springer, Heidelberg (2008)
A Fast Solver for Truncated-Convex Priors: Quantized-Convex Split Moves Anna Jezierska1, Hugues Talbot1 , Olga Veksler2 , and Daniel Wesierski3 1
Universit´e Paris-Est, Laboratoire d’Informatique Gaspard-Monge, France {first.last}@univ-paris-est.fr 2 University of Western Ontario, Canada
[email protected] 3 Telecom SudParis, France
[email protected] Abstract. This paper addresses the problem of minimizing multilabel energies with truncated convex priors. Such priors are known to be useful but difficult and slow to optimize because they are not convex. We propose two novel classes of binary Graph-Cuts (GC) moves, namely the convex move and the quantized move. The moves are complementary. To significantly improve efficiency, the label range is divided into even intervals. The quantized move tends to efficiently put pixel labels into the correct intervals for the energy with truncated convex prior. Then the convex move assigns the labels more precisely within these intervals for the same energy. The quantized move is a modified α-expansion move, adapted to handle a generalized Potts prior, which assigns a constant penalty to arguments above some threshold. Our convex move is a GC representation of the efficient Murota’s algorithm. We assume that the data terms are convex, since this is a requirement for Murota’s algorithm. We introduce Quantized-Convex Split Moves algorithm which minimizes energies with truncated priors by alternating both moves. This algorithm is a fast solver for labeling problems with a high number of labels and convex data terms. We illustrate its performance on image restoration. Keywords: Graph cuts, image restoration, non-convex prior, Potts model, Murota’s algorithm.
1
Introduction
We consider the well-known combinatorial optimization problem defined as follows. Let G(V, E) be an undirected graph with a set of edges E and a set of vertices V. The goal of our optimization problem is to restore an unknown x∗ based on observations x, under the condition that x takes values over a finite set of labels L, representing e.g. grey level values in an image. Here we define L
This work was supported by the Agence Nationale de la Recherche under grant ANR-09-EMER-004-03.
Y. Boykov et al. (Eds.): EMMCVPR 2011, LNCS 6819, pp. 45–58, 2011. c Springer-Verlag Berlin Heidelberg 2011
46
A. Jezierska et al.
as an ordered discrete set of labels {0, 1, . . . , L} and xu as the label assigned to node u ∈ V. The unknown x∗ is a minimum argument of the energy function: E(x) = D(xu ) + λ R(xu , xv ), (1) u∈V
(u,v)∈E
where λ is a positive real value. D(xu ) is often called the data fidelity term and R(xu , xv ) the regularization or smoothness term. A common choice of data term D is a pixelwise distance D = |xu − xu |p between the desired labeling x and a reference x, representing noisy acquired data, where p is a small positive integer, e.g. 1 or 2. Many choices of R lead to useful algorithms and results. A common model is the so-called Potts model, where R(xu , xv ) = wuv min(1, |xu − xv |), and wuv are spatially variant positive pairwise weights. This model corresponds to a piecewise constant prior. Other choices for R include R = wuv |xu − xv |q , where q is typically 1 or 2 for linear and quadratic priors respectively. The latter represents an “everywhere smooth” prior with good denoising properties and lack of staircase effect in the result, but with blurred boundaries. Better preservation of boundaries can be achieved with regularization term R = wuv min(T q , |xu − xv |q ), where for q = 1 or q = 2 it is respectively called truncated linear or truncated quadratic [1]. More generally, a pairwise truncated convex prior can be formulated as: f (xu − xv ) if |xu − xv | < T R(xv , xu ) = (2) f (T ) if |xu − xv | ≥ T where f is a convex function with f (0) = 0. Discrete random field models characterized by such a prior are well known and extensively discussed in the literature. Their popularity in low level vision is due to their ability to capture natural image statistics [2]. Indeed, Nikolova [3] shows that the robustness of regularization terms depends on their characteristics at ±∞, and their differentiability at zero. Non-differentiable terms at zero reconstruct sharp edges well but lead to undesirable staircase effects. As a result, for the case of image restoration problems in the pixel domain, truncated regularization terms are more robust. In this way, truncated models may combine noise suppression with edge preservation. In general, depending on the application, a sharp (e.g. truncated linear) or smooth (e.g. truncated quadratic) term might be desirable. In the following, we introduce new GC algorithm solving optimization problem characterized by energy (1) and prior (2). In recent years, energy-based optimization methods using GC have become very popular in computer vision applications [4–6]. GC optimization has been applied to e.g. stereo-vision [7], multiview reconstruction [8], motion analysis [9], segmentation [10] and image restoration [11]. GC methods tend to provide optimal or near-optimal solutions to classical Markov Random Fields (MRF) problems, with some guarantees and in reasonable time, unlike earlier methods like Simulated Annealing (SA) [12] or Iterated Conditional Modes (ICM) [13]. From the algorithmic point of view, GC
A Fast Solver for Truncated-Convex Priors
47
problems can be solved exactly when the energy is submodular, which was shown for the binary case (binary L) in [14, 15] and for multilabel case in [16]. When energy E is not submodular, some GC methods can still be used, for instance the move algorithms [6, 17–19]. GC move algorithms have typically good theoretical guarantees for quality for certain sets of regularization terms containing truncated convex functions considered in this paper. Classical move algorithms include expansion and swap moves [6]. More recently, improved moves have been proposed e.g. range moves and fusion moves [17, 18, 20, 19]. All are geared towards improving the quality of the solution and the speed of the algorithm. The time complexity of move algorithms usually increases steeply with the number of labels. For example, the worst-case complexity of swap moves is quadratic in the number of labels while range-moves perform even poorer. However, for problems where the number of labels is relatively low, these methods can be fast enough. Hence, move algorithms scale well with connectivity, are flexible with respect to data fidelity terms, but do not scale well with the number of labels. It is worth noting that when R is convex, e.g. in the non-truncated linear or quadratic cases, the energy E of (1) may be optimized exactly and efficiently [5, 11]. Moreover, Szeliski et al. [21] have shown that expansion and swap moves work well in practice for the Potts model. Conversely, in the truncated linear or quadratic cases, due to non-convexity and non-differentiability (at the truncation and also at zero for truncated linear regularization term), such optimization problems remain challenging. In the multilabel case, i.e. when the set of labels L is not binary, the minimization problem of (1) is NP-hard. The GC algorithms dedicated for energies with truncated convex priors, e.g [17, 18, 22], have been developed to meet this challenge. We discuss them in detail in section 2. This group of algorithms can be extended with our Quantized-Convex Split Moves. This two-step approach produces results comparable to the current state-of-the-art move based algorithms and yet outperforms them by a large factor in terms of time efficiency, especially when the number of labels is large. As these convex priors and the Potts model can be optimized efficiently with move methods, we split the label set into two parts, a regular quantized one that we optimize using a modified Potts model, and a remainder part, which we optimize using a convex framework. We propose two types of moves, which are complementary, namely the quantized move and the convex move. Our quantized move is a modified α-expansion move, adapted to cope with a generalized Potts prior taking zero value for arguments in the range (−T, +T ). Thus, it tends to efficiently put pixel labels into the right intervals. These approximate results are corrected by the convex move, which performs finer changes with respect to the previously chosen label. A new, more precise label is found within previously chosen interval. The convex move is a GC representation of an efficient Murota’s gradient descent algorithm [14, 23]. The rest of the paper is organized as follows. The description of our method in the context of mostly related work is given in section 2. We present our Quantized and Convex moves in section 3, and the Quantized-convex split moves algorithm
48
A. Jezierska et al.
in section 4. Then we provide experimental comparison of the different energy minimization methods in section 5, and conclude with section 6.
2
Related Work
In recent years, many algorithms utilizing truncated regularization terms have been proposed. Apart from GC move algorithms, the sequential tree reweighted message passing (TRW-S) [24] has currently the most accurate results and provides a Lagrangian approximation of the dual energy, e.g. estimates the gap between current and globally optimal energies. However, it is relatively slow [25] and is not well suited to highly-connected graphs [26]. Belief propagation (BP) [27] methods, though fast, are not guaranteed to converge. GC methods were shown to outperform BP in several cases examined in [21]. Energies with truncated linear priors (truncated 1 ) may be optimized e.g. using α-expansions [6] or Gupta and Tardos [28] algorithm. The latter offers good theoretical properties, but it is not practical. Veksler proposed in [17] to minimize energies with truncated convex priors by splitting the problem into several subproblems that are all convex with respect to the prior. Each subproblem is defined for subsets of pixels u , v ∈ V with labels xu ∈ T such that T ⊂ L and |xu − xv | ≤ T . Note that there exist many T ⊂ L satisfying conditions |xu − xv | ≤ T . Moreover, assuming that labels in T = {. . . , ti−1 , ti , ti+1 , . . .} form a convex cone defined as ti + 1 = ti+1 , one can assign T different T ⊂ L to each xu . According to the theorem presented in [17], the original energy with labeling L is minimized with each subenergy having sublabeling T . An algorithm that takes advantage of this property is the range move. Range move solves different subproblems for different choices of T iteratively using an Ishikawa-like approach [5]. In this article, we show that using what we call a convex move instead of the Ishikawa approach, it is possible to consider all possible choices of T ⊂ L such that dT = T − 1 simultaneously, where dT = max {|xu − xv | , {xu , xv } ∈ T }. This allows us to improve the time efficiency of the overall algorithm considerably. Additionally, we propose a quantized move, allowing for changes of xu between T1 ⊂ L and T2 ⊂ L such that T1 ∩ T2 = ∅. This further improves the time efficiency of our algorithm upon the range move. The proposed algorithm alternates iteratively between quantized and convex moves. Note that if T1 ∩ T2 = ∅, the energy (1) is no longer convex with respect to the prior term. The advantage of the Ishikawa approach is that it guarantees a global minimum even with a non-convex data fidelity term, provided the prior is convex. This is particularly important for stereo-vision. For the convex move introduced in this paper, the energy is guaranteed to decrease but the optimal solution is not secured. More recent work by Kumar and Torr [18] is better grounded theoretically than Veksler’s range move. The quality of the solution is guaranteed by bounds on the converged energy for truncated 1 and √ 2 , which √ are calculated √ with √ respect to dT , and equal 2 + 2 if dT = 2 T and O( T ) if dT = T , for truncated 1 and 2 , respectively. However, according to the results presented
A Fast Solver for Truncated-Convex Priors
49
in [18], the practical performance of both algorithms is similar for truncated 2 prior, although the greatest improvement is achieved for the truncated 1 prior. In terms of time efficiency, range move outperforms the approach proposed by Kumar and Torr, but not significantly. Similarly to the range move, authors use the graph construction proposed by Ishikawa, but they introduce small modifications. Namely, they adopt the Ishikawa approach to deal with non-convex priors at the cost of not representing the energy exactly. (Here we will not analyze our algorithm as a function of dT . The convex move in our quantized-convex split moves algorithm is associated with two sets: (1) the set of all possible T ⊂ L with dT = T − 1 and (2) the set of all possible T ⊂ L with dT = T . In [22], authors proposed a hierarchical approach. The original problem was replaced by a series of r-HST metric labeling subproblems and obtained solutions were combined with α-expansion algorithm. The previously presented approximation bounds were improved. They are equal to O(ln(L)) and O((γ ln(L))2 ), γ ≥ 1 for truncated 1 and 2 , respectively. However, this approach is computationally expensive.
3
Move Algorithms
Move algorithms have been developed to solve multilabeling problems. According to the definition given in [17], a move algorithm is an iterative algorithm where xn+1 ∈ M (xn ) and M (x) is a “moves” space of x. The local minimum with respect to a set of moves is at x if E(x ) > E(x) for any x ∈ M (x). Each move algorithm is characterized by its space of “moves” M (x). In this section we describe two moves that we develop. The quantized move is closely related to α-expansion and convex move to Murota’s gradient descent algorithm. In section 4, we explain why linking these moves together leads to improvement of efficiency in the context of minimization of energy functions with truncated convex prior. 3.1
Quantized Move
The main idea behind the quantized move is to divide the label range into equal subintervals of length T and, ideally, put pixel labels into the correct intervals, thus reducing the number of categories from the original range L to L/T . This greatly accelerates the execution time of the algorithm. The proposed move algorithm minimizes the energy Ep with an arbitrary data fidelity term Dp and a pairwise term defined as: 0 if |xu − xv | < T (3) Rp (xv , xu ) = f (T ) if |xu − xv | ≥ T, where T is a positive integer value. This prior is potentially interesting for other applications, but here we will use it as an intermediate step for minimizing truncated convex priors.
50
A. Jezierska et al.
A quantized move is a new labeling where xu is either left as xu or moved to a new value according to the following transformation: tk if xu ≤ tk1 α(xu , k) = 1k (4) tT if xu ≥ tkT , where k is an integer belonging to a regular quantization of the label set L, i.e,: k ∈ K = {k0 , k1 , . . . , kK } such that k0 = 0, ki = iT , i ∈ N+ , KT ≥ L and (K − 1)T < L. Recall that L is the maximum label in L. T k = tk1 , . . . , tkT is an ordered label set, such that tki+1 = tki +1. The values in T k change from k − T2 +1 to k + T2 and from k − T2 + 12 to k + T2 − 12 for odd and even T , respectively. The tk1 and the tkT is a first and last element of set T k , respectively. The acceptable moves for a label depending on its current position are illustrated in Fig. 1.
(a)
(b)
(c)
Fig. 1. (a,b,c) illustrate the label moves when its current value is below, above, and inside the considered interval T k (denoted by square brackets), respectively
The set of quantized moves MQ (x) is then defined as the collection of moves for all k ∈ K. Quantized moves act much like expansion moves in the case of a Potts model on a quantized subset of labels. We now prove that quantized moves are graph-representable and can be optimized by GC. Proposition 1. For the energy in (1) with a regularization term given by (3), the optimal quantized move (i.e. giving the maximum decrease in energy) can be computed with a graph cut. Proof: We show that quantized move satisfies all conditions specified in [15]. Let b = {bu , ∀u ∈ V} be a binary vector coding a quantized move. Then the move can be described by a transformation function B(x(n) , b) returning a new labeling x(n+1) , based on b and x(n) . Here (n) is the iteration number. The transformation function Bq (x(n) , b) for a quantized move is given by: (n) α(xu , k) if bu = 1 x(n+1) = Bq (x(n) (5) u u , bu ) = (n) xu if bu = 0 The considered move finds b∗ = Argminb E(Bq (x(n) , b)), where E(Bq (x(n) , b)) is (n) a pseudo-boolean energy, defined as u∈V D(Bq (xu , bu ))+ (n) (n) (u,v)∈E R(Bq (xu , bu ), Bq (xv , bv )). Let us denote the pairwise term of the binary quantized move energy by B(bu , bv ), omitting x(n) from the notation for simplification. Then:
A Fast Solver for Truncated-Convex Priors
⎧ (n) (n) ⎪ ⎪ ⎪Rp (xu , xv ) ⎪ ⎨R (x(n) , α(x(n) , k)) v p u B(bu , bv ) = ⎪Rp (α(xu(n) , k), xv(n) ) ⎪ ⎪ ⎪ ⎩R (α(x(n) , k), α(x(n) , k)) u v p
if if if if
bu bu bu bu
= 0, bv = 0, bv = 1, bv = 1, bv
=0 =1 =0 = 1.
51
(6)
The pairwise term B needs to be submodular, i.e.: B(0, 0) + B(1, 1) ≤ B(1, 0) + (n) (n) B(0, 1). Since for all n and k we have that Rp (α(xu , k), α(xv , k)) = 0, the submodularity inequality takes the form: Rp (xu(n) , α(xv(n) , k)) + Rp (α(xu(n) , k), xv(n) ) ≥ Rp (xu(n) , xv(n) ),
(7)
or equivalently: B(0, 1) + B(1, 0) ≥ B(0, 0).
(8)
The only case when B(0, 0) is not 0 is when neighbors xu and xv are at least T apart, i.e. |xu − xv | ≥ T , in which case we have B(0, 0) = f (T ). However, in this case either B(0, 1) or B(1, 0) or both are equal to f (T ), so the inequality is verified. The problem of minimizing energy E(Bq (x(n) , b)) can be solved globally with respect to b using discrete maxflow-mincut methods [29]. Note that when T = 1 our quantized move reduces to the α-expansion move. 3.2
Convex Moves
In the previous section, we showed how to assign the pixel values into the correct intervals, and now we propose a convex algorithm to optimize these values within these intervals. To achieve this, we view the steepest descent algorithm of Murota [14, 23] as a special case of GC move. The primal and a primal-dual algorithms proposed in [30] are also related to Murota’s approach. Their convergence properties in the case of L -convex functions have been proved. However, the case of non-convex data fidelity was not examined. This limitation can be viewed as disadvantage compared to Ishikawa approach [5], which guaranties a global minimum even for non-convex data fidelity. In contrast, both primal and primal-dual algorithms are more memory and time efficient than the non-iterative Ishikawa’s method. The convex move is conceptually similar to the jump move [1]. However, the jump move processes pixels with odd and even values differently. As a consequence, Potts functions can be represented on jump-move graphs, whereas convex functions generally cannot. As in the previous case (section 3.1), a convex move is described by a binary vector b and the transformation function Bc (x(n) , b) defined as: (n) xu + s if bu = 1 (n+1) (n) (9) = Bc (xu , bu ) = xu (n) if bu = 0, xu where s ∈ S and S is a set of discrete values from Z. The convex move space MC (x) is then defined as the collection of convex moves for all s ∈ S. We call
52
A. Jezierska et al.
the algorithm finding b∗ = Argminb E(Bc (x(n) , b)) the convex move algorithm. The pseudo-boolean prior term representation is given by: ⎧ (n) (n) ⎪ if bu = 0, bv = 0 Rc (xu , xv ) ⎪ ⎪ ⎪ ⎨R (x(n) , x(n) + s) if bu = 0, bv = 1 v c u R(Bc (xu(n) , bu ), Bc (xv(n) , bv )) = (n) (n) ⎪ if bu = 1, bv = 0 Rc (xu + s, xv ) ⎪ ⎪ ⎪ ⎩R (x(n) + s, x(n) + s) if b = 1, b = 1 u v c u v (10) (10) is submodular as Rc (xu , xv ) is a L -convex function (since f is convex, its submodularity inequality f (|xu + s − xv |) + f (|xu − xv − s|) ≥ 2f (|xu − xv |) is always satisfied). The optimal convex move can be found with Murota’s gradient descent algorithm [23]. It is worth noting that GC formulation does not impose any requirements on data fidelity term thus guaranteeing that the energy decreases. Hence, in this case the energy (1) is minimized but the optimal solution of multilabel problem is not secured.
4
Truncated Convex Prior Algorithm
In this section, we present an effective method combining both moves introduced in section 3 for minimizing energies with truncated convex prior functionals (2). The convex move submodularity inequality is a function of (xu , xv ) s.t. u, v ∈ N and s ∈ S. The choice of S influences the number of pairs of neighboring pixels u ∈ V which satisfies the convex move submodularity inequality. We examine the case where S = {−1, +1} and f (xu , xv ) is defined as in (2). To specify the sets of pixels the convex move applies to, we define Ti for 0 ≤ i ≤ L to be the collection of all subsets SiV of V such that ∀ u, v ∈ SiV , | u − v| ≤ i. We note V that all xu belong to at least one Si irrespective of i, and so the entire image is covered by Ti . A convex move characterized by S = {−1, +1} is a function which maps TT −1 onto TT , guaranteeing that the energy defined as (2) decreases with each move. This comes from the fact that the energy for the TT −1 is represented exactly using our convex graph and as s is equal to either 1 or −1, the solution belongs to TT . Following [15], we define the edge capacities of graph G(V, E). The cost c(u, v) between (u, v) ∈ N is set to f (|xu + s − xv |) + f (|xu − xv − s|) − 2f (|xu − xv |) if |xu − xv | < T and 0 otherwise. Because of the many such null connections, the final MRF is sparser which improves the time efficiency of the algorithm. The energy is guaranteed to go down, but the resulting labeling and corresponding energy are not as good as obtained by other minimizers. To improve our results, we combine this convex move with our proposed quantized move. An arbitrary new labeling set by the quantized move part is not guaranteed to improve the energy with respect to the truncated convex prior energy (only a Potts-like energy is guaranteed to be minimized). However, we can easily impose this extra condition: the new labeling is accepted only if the proposed energy is better with respect to truncated convex prior energy, and rejected otherwise,
A Fast Solver for Truncated-Convex Priors
53
which yields the desired effect. Since quantized move regularizes distant outliers, it is a powerful complement method for convex moves, for which S = {−1, +1} regularizes close outliers. Now, we present our two-step algorithm alternating convex and quantized move. Here, Q(x, k) denotes the quantized move of image x and interval k. We also denote the convex move by C(x, s), where s is the considered step and x the input image. Note that the loops indexed by n and m are repeated until convergence. Algorithm 1. (Quantized-convex split moves algorithm) Fix x(0) , S = {−1, 1} For j = 0, 1, . . . ⎢ (0) ⎢ x = x(j) ⎢ ⎢ For n = 0, 1, . . . ⎢⎢ ⎢ ⎢ Assign to K a set of randomly ordered elements from K ⎢⎢ ⎢ ⎢ For i = 0, 1, . . . , K ⎢⎢⎢ ⎢ ⎢ ⎢ Set ki to be the i-th element of K ⎢⎢⎢ ⎢ ⎣ ⎣ x = Q(x(n) , ki ) ⎢ ⎢ if (E(x) ≤ E(x(n) )) then x(n+1) = x ⎢ ⎢ x(0) = x(n) ⎢ ⎢ For m = 0, 1 . . . ⎢⎢ ⎢ ⎢ Assign to S a set of randomly ordered elements from S ⎢⎢ ⎢ ⎢ For i = 0, 1 ⎢⎢ ⎢ ⎣ Set s to be the i-th element of S i ⎢ ⎣ x(m+1) = C(x(m) , si ) x(j+1) = x(m) We now have our main result: Proposition 2. Algorithm 1 iteratively decreases the energy (1), with R defined as a truncated convex function. Proof: This result comes straightforwardly from the previous discussion, where it was shown that all steps reduce the energy E(x). As this algorithm combines quantized and convex moves, it is important to understand what happens at the boundary between them. A difficulty is that neighboring pairs u, v ∈ N with labels |xu − xv | = T cannot be represented exactly on the convex graph. This comes from the fact that the convex move cannot map TT to TT −1 . We cope with this problem in a similar way as in [31], where α-expansions were shown to be able to minimize energies involving a truncated prior, as long as the number of pairs xu , xv not satisfying the submodularity inequality is relatively small. This is the reason why we limited the convex moves to S = {−1, +1}. We represent truncated priors on convex move graph in a similar spirit.
54
A. Jezierska et al.
5
Results
We implemented our proposed algorithm in the framework of the Middlebury MRF vision code (http://vision.middlebury.edu/MRF/code/), based on [21], so we could compare our approach with the following methods: ICM [13], αexpansion and swap moves [32, 6], MaxProdBP, BP-S (using software provided by Marshall Tappen [33]), and TRW-S [25, 24]. We also endeavoured to compare it with the range move, but range move did not work for our test because the value of T was too large. The tests were performed single-threaded on an Intel Xeon 2.5GHz with 32GB of RAM running RedHat Enterprise Linux 5.5. All algorithms were run either until full convergence for GC algorithms, ICM, and ours, or until the first oscillation for the other algorithms. We evaluated our proposed algorithm only in the context of image restoration for √ different prior functions, namely truncated 2 and truncated 1 -2 , defined as + x2 . In each case, we also examined the influence of parameter T . The grey scale images (L = 255) of size 512 × 512 (for 2 ) and 256 × 256 (for 1 -2 ) were corrupted with additive zero mean Gaussian noise with standard deviation 25.3 corresponding to initial SNR values 13.75 dB, 15.09 dB, and 14.26 dB for images “gold rec”, “elaine”, and “barbara”, respectively. Consequently, all experiments were performed with an 2 data fidelity term, which is most appropriate for this noise distribution. All the algorithms were initialized with an empty zero image. The algorithm accuracy is evaluated in terms of absolute error defined as err = (E(x∗ ) − E(xTRW −Sl ))/E(xTRW −Sl ), where E(xTRW −Sl ) is the lower bound value reported by TRW-S and E(x∗ ) is an energy corresponding to the solution obtained by the algorithm. The restoration quality is evaluated in terms of SNR. The mean time, the energy, SNR, and the error presented in Table 1 and Table 2 are computed from 3 different realizations of the noise added to 3 considered images. The performance of our algorithm is also illustrated by energy vs. time plots (Fig. 2). 8
2.5
7
x 10
x 10
8.2 2.4
2.3
8
2.2 7.8 2.1 7.6
2
1.9
7.4
1.8 7.2 1.7
1.6
0
10
10
1
10
(a)
2
3
10
7 −1 10
0
10
1
10
10
2
3
10
4
10
(b)
Fig. 2. Energy versus log time characteristics of convergence for algorithm comparison: ICM (solid line with crosses), BP-S (solid line with diamonds), BP (dashed line), TRWS (dotted line), αβ swap (dash-dot line), α-exp (solid line with squares), ours (solid line). (a,b) illustrates the case of 2 and 1 -2 prior, respectively.
A Fast Solver for Truncated-Convex Priors
(a)
Original
(d)
BP-S 20.16 dB
(g)
α-exp 20.14 dB
(b)
Noisy G(0,25.3)
(e)
(h)
BP 20.15 dB
αβ swap 20.15 dB
(c)
(f)
55
ICM 20.11 dB
TRW-S 20.16 dB
(i)
Our 20.35 dB
Fig. 3. The image restoration results for truncated 1 -2 prior with threshold T = 50, 2 data fidelity term, and λ = 2
In Table 2, our quantized-convex split moves algorithm outperforms all other GC based algorithms in terms of minimum energy and time efficiency for truncated 2 prior. However, the best final energy is obtained by the BP (contrary to what was found in [21]) and TRW-S algorithms, the latter converging faster than the former. One can observe in Fig. 2 (a) that in the case of truncated 2 , TRWS offers a speed/energy compromise comparable with our quantized-convex split moves algorithm when it is stopped early, for instance after two iterations. However, for truncated 1 -2 prior, our algorithm is significantly faster (Fig. 2 (b)), while still achieving energies comparable with other algorithms (Table 2). The quality of the results is also verified by inspecting the mean SNR value, which is not further improved by other algorithms in comparison to ours. Indeed, our algorithm appears to perform better at removing isolated noisy pixels
56
A. Jezierska et al.
Table 1. Truncated 2 prior results on 512 × 512 images. The SNR is given in dB, and the time in seconds. Best results are in bold. TRW-S and BP were stopped after 15 iterations (after this, the energy did not improve significantly). T = 25 , λ = 2 time err SNR ICM BP-S BP TRW-S α-exp αβ swap Proposed
39.8 1807.8 153.0 154.9 307.6 360.1 27.1
3.02 × 10−2 8.29 × 10−4 1.05 × 10−3 1.36 × 10−3 1.98 × 10−2 2.57 × 10−2 1.57 × 10−2
T = 35, λ = 2 time err SNR
20.11 39.06 1.71 × 10−2 20.82 1658.7 5.10 × 10−4 20.80 154.5 7.21 × 10−4 20.85 154.6 7.55 × 10−4 20.53 294.6 2.01 × 10−2 20.33 362.3 1.48 × 10−2 21.53 28.6 6.30 × 10−3
T = 50, λ = 1 time err SNR
21.70 25.4 5.25 × 10−3 21.74 1641.8 7.21 × 10−5 21.76 153.3 1.17 × 10−4 21.72 172.2 8.53 × 10−3 21.61 240.3 1.98 × 10−2 21.75 359.5 4.26 × 10−3 21.71 29.2 3.28 × 10−3
21.53 21.52 21.52 21.54 21.28 21.53 21.51
compared with other algorithms (see Fig. 3(i)). Since our Quantized-convex split moves algorithm leads to very good results (Fig.3), is fast and less memory expensive than other algorithms, it appears to be well suited for image restoration application. Table 2. Truncated 1 -2 prior results with = 10 on 256 × 256 images. The SNR is given in dB, and the time in seconds. Best results are in bold. T = 35, λ = 55 time err SNR ICM 104.4 2.69 × 10−2 BP-S 4871.9 7.28 × 10−4 BP 13950.0 9.40 × 10−4 TRW-S 2508.9 2.54 × 10−4 α-exp 61.8 7.96 × 10−3 αβ swap 200.4 1.12 × 10−2 Proposed 9.4 1.16 × 10−2
6
T = 50, λ = 45 time] err SNR
19.51 84.4 8.67 × 10−3 20.08 5069.9 1.34 × 10−4 20.12 16048.7 2.10 × 10−4 20.06 2259.0.4 4.30 × 10−5 20.10 50.3 7.36 × 10−3 19.94 178.9 4.31 × 10−3 20.37 9.3 4.00 × 10−3
T = 60, λ = 30 time err SNR
20.57 46.8 2.39 × 10−3 20.68 3866.7 1.42 × 10−5 20.69 14902.3 4.31 × 10−5 20.66 2852.2 5.08 × 10−6 20.72 50.9 7.37 × 10−3 20.63 112.7 1.19 × 10−3 20.86 7.9 1.51 × 10−3
20.88 20.97 20.97 20.97 20.80 20.95 21.19
Conclusion and Future Work
In this paper, we have presented a novel move-based algorithm to solve GC problems with truncated convex priors in the context of image denoising. Our move is split in two parts, a first Potts-like move that denoises a quantized version of the image, and a second move that processes the result of the first move according to a fully convex prior. We have shown that combining these moves corresponds to denoising with a truncated convex prior. For a convex prior truncated at threshold T and for an image with L labels, the Potts-like denoising operates on L/T labels and the convex part on T labels only. This results in two optimizations over a much reduced set of labels for most useful values of T , and therefore it translates into large savings in computing time. Because only submodular moves are effected, the algorithm is guaranteed to converge in finite time. The result of these moves appears better in terms of energy than all moves, and depending on the problems, our algorithm is at least 5 times and up to several orders of magnitude faster than current state-of-theart algorithms. We believe this constitutes an interesting compromise between efficiency and precision.
A Fast Solver for Truncated-Convex Priors
57
Since we use barely modified versions of Potts optimization and discrete convex optimization methods, future progress in this area will also translate into improvements for the proposed method. In particular, future work will include analyzing primal-dual methods for convex optimization. Precision can also be improved by using more sophisticated Potts-like moves. We will also explore the behaviour of our algorithm with non-convex data terms and consider other applications, such as stereo-vision.
References 1. Veksler, O.: Efficient graph-based energy minimization methods in computer vision. PhD thesis, Cornell University, Ithaca, NY, USA (1999) 2. Huang, J., Mumford, D.: Statistics of natural images and models. In: IEEE Computer Society Conference on Computer Vision, Computer Vision and Pattern Recognition, Fort Collins, CO, USA (1999) 3. Nikolova, M.: Minimizers of cost-functions involving non-smooth data-fidelity terms. Application to the processing of outliers. SIAM J. on Numerical Analysis 40, 965–994 (2002) 4. Boykov, Y., Veksler, O., Zabih, R.: Markov random fields with efficient approximations. In: CVPR, pp. 648–655 (1998) 5. Ishikawa, H.: Exact optimization for Markov random fields with convex priors. IEEE Transaction on Pattern Analysis and Machine Intelligence 25, 1333–1336 (2003) 6. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Transaction on Pattern Analysis and Machine Intelligence 23, 1222–1239 (2001) 7. Woodford, O.J., Torr, P.H.S., Reid, I.D., Fitzgibbon, A.W.: Global stereo reconstruction under second order smoothness priors. In: IEEE Conference on Computer Vision and Pattern Recognition (2008) 8. Sinha, S.N., Mordohai, P., Pollefeys, M.: Multi-view stereo via graph cuts on the dual of an adaptive tetrahedral mesh. In: IEEE 11th International Conference on Computer Vision, ICCV 2007, Rio de Janeiro, pp. 1–8 (2007) 9. Xiao, J., Shah, M.: Motion layer extraction in the presence of occlusion using graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 1644–1659 (2007) 10. Boykov, Y., Jolly, M.P.: Interactive graph cuts for optimal boundary & region segmentation of objects in n-d images. In: Proceedings of the Eighth IEEE International Conference on Computer Vision, ICCV 2001, Vancouver, BC, Canada, pp. 105–112 (2001) 11. Darbon, J., Sigelle, M.: Image restoration with discrete constrained total variation part ii: Levelable functions, convex priors and non-convex cases. JMIV 26, 277–291 (2006) 12. Geman, S., Geman, D.: Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. TPAMI 6, 721–741 (1984) 13. Besag, J.: On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society. Series B (Methodological) 48, 259–302 (1986) 14. Murota, K.: Algorithms in discrete convex analysis (2000) 15. Kolmogorov, V., Zabih, R.: What energy functions can be minimized via graph cuts? IEEE Transaction on Pattern Analysis and Machine Intelligence 26, 147–159 (2004)
58
A. Jezierska et al.
16. Schlesinger, D., Flach, B.: Transforming an arbitrary minsum problem into a binary one. Technical report, Dresden University of Technology (2008) 17. Veksler, O.: Graph cut based optimization for MRFs with truncated convex priors. In: IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, pp. 1–8 (2007) 18. Kumar, M.P., Torr, P.H.S.: Improved moves for truncated convex models. In: Proceedings of Advances in Neural Information Processing Systems (2008) 19. Lempitsky, V., Rother, C., Roth, S., Blake, A.: Fusion moves for markov random field optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 1392–1405 (2010) 20. Veksler, O.: Multi-label moves for mRFs with truncated convex priors. In: Cremers, D., Boykov, Y., Blake, A., Schmidt, F.R. (eds.) EMMCVPR 2009. LNCS, vol. 5681, pp. 1–8. Springer, Heidelberg (2009) 21. Szeliski, R., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V., Agarwala, A., Tappen, M., Rother, C.: A comparative study of energy minimization methods for markov random fields with smoothness-based priors. IEEE Transactions on Pattern Analysis and Machine Intelligence 30, 1068–1080 (2008) 22. Kumar, M.P., Koller, D.: MAP estimation of semi-metric MRFs via hierarchical graph cuts. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 2009, pp. 313–320. AUAI Press, Arlington (2009) 23. Murota, K.: On steepest descent algorithms for discrete convex functions. SIAM Journal on Optimization 14, 699–707 (2004) 24. Kolmogorov, V.: Convergent tree-reweighted message passing for energy minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 1568– 1583 (2006) 25. Wainwright, M.J., Jaakkola, T.S., Willsky, A.S.: MAP estimation via agreement on trees: message-passing and linear programming. IEEE Transactions on Information Theory 51, 3697–3717 (2005) 26. Kolmogorov, V., Rother, C.: Comparison of energy minimization algorithms for highly connected graphs. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 1–15. Springer, Heidelberg (2006) 27. Felzenszwalb, P.F., Huttenlocher, D.R.: Efficient belief propagation for early vision. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2004, pp. 261–268 (2004) 28. Gupta, A., Tardos, E.: A constant factor approximation algorithm for a class of classification problems. In: Proceedings of the 43rd Annual IEEE Symposium on Foundations of Computer Science, pp. 333–342 (2000) 29. Ford, J.L.R., Fulkerson, D.R.: Flows in Networks. Princeton University Press, Princeton (1962) 30. Kolmogorov, V., Shioura, A.: New algorithms for convex cost tension problem with application to computer vision. Discrete Optimization 6, 378–393 (2009) 31. Rother, C., Kumar, S., Kolmogorov, V., Blake, A.: Digital tapestry [automatic image synthesis]. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, pp. 589–596 (2005) 32. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Transaction on Pattern Analysis and Machine Intelligence 26, 1124–1137 (2004) 33. Tappen, M.F., Freeman, W.T.: Comparison of graph cuts with belief propagation for stereo, using identical MRF parameters. In: Proceedings of Ninth IEEE International Conference on Computer Vision, Nice, France, pp. 900–906 (2008)
Temporally Consistent Gradient Domain Video Editing Gabriele Facciolo1 , Rida Sadek1 , Aur´elie Bugeau2 , and Vicent Caselles1 1
DTIC, Universitat Pompeu Fabra, 08023 Barcelona, Spain {gabriele.facciolo,rida.sadek,vicent.caselles}@upf.edu 2 LaBRI, Universit´e Bordeaux 1, 33405 Talence, France
[email protected] Abstract. In the context of video editing, enforcing spatio-temporal consistency is an important issue. With that purpose, the current variational models for gradient domain video editing include space and time regularization terms. The spatial terms are based on the usual space derivatives, the temporal ones are based on the convective derivative, and both are balanced by a parameter β. However, the usual discretizations of the convective derivative limit the value of β to a certain range, thus limiting these models from achieving their full potential. In this paper, we propose a new numerical scheme to compute the convective derivative, the deblurring convective derivative, which allows us to lift this constraint. Moreover, the proposed scheme introduces less errors than other discretization schemes without adding computational complexity. We use this scheme in the implementation of two gradient domain models for temporally consistent video editing, based on Poisson and total variation type formulations, respectively. We apply these models to three video editing tasks: inpainting correction, object insertion and object removal. Keywords: Video editing, Poisson editing, temporal consistency, total variation, convective derivative, numerical methods.
1
Introduction
In the context of static image editing, the insertion/removal of content in an image is generally performed using gradient domain methods. These methods allow editing an image without introducing artifacts at the boundaries of the edited regions. Consequently, gradient domain methods are widely used in image processing for: seamless cloning and compositing [1, 2], shadow removal [3], HDR compression [4], image inpainting [5–7], and matting [8] among others. Essentially, gradient domain image editing is based on the manipulation of the gradients of an image instead of its graylevels. The modified gradients are then integrated to recover the resulting image. This procedure prevents the appearance of seams at the boundaries of the edited region. For a more detailed introduction to gradient-domain methods, the reader is referred to [9]. In particular, Y. Boykov et al. (Eds.): EMMCVPR 2011, LNCS 6819, pp. 59–73, 2011. c Springer-Verlag Berlin Heidelberg 2011
60
G. Facciolo et al.
Poisson image editing [1] is one of such techniques; it formulates the problem variationally as ∇u − g2 dx; with u|∂O = u0 , min u
O⊂Ω
where Ω ⊂ R2 is the image domain, O ⊂ Ω is the region to be edited, g : O → R2 is the guidance vector field (e.g. gradient of the image to be composed), u : O → R is the solution which best approximates the field g, and u0 : Ω → R is the original image which provides the boundary conditions needed for reconstructing the solution u. The solution of this problem is computed solving the Poisson equation with Dirichlet boundary conditions u|∂O = u0 . Since a video is nothing but a stack of images captured at evenly spaced times, it is natural to apply image editing techniques for video editing tasks. However, video editing poses some challenges that were not present in still image editing such as preserving the temporal consistency of the video. To cope with these challenges, video editing tasks are generally broken down into three steps: a) tracking the region where the editing is being performed [10, 11], a task that may also require detecting occlusions of the region; b) computing the new content of the region; and c) blending the new content with the original video so that the modification is not noticeable. The first two steps are usually handled by tracking and video inpainting techniques. The focus of this work lies in computing the missing content in the editing region and enforcing the temporal consistency during the blending step. We consider three application scenarios: the correction of artifacts due to illumination changes in inpainted video sequences; the insertion of objects in a video; and the removal of objects. We restrict our study to the case where the inserted or removed object is affixed to a surface (for instance a sign on a wall). However, we do not restrict the motions of the camera nor of the surface in the scene. Our method also copes with occlusions and disocclusions of the inserted/removed object. Temporal inconsistencies are sometimes more conspicuous than spatial ones. They tend to manifest in the edited video by introducing an annoying flickering. A simple solution to avoid the flickering is to consider the video as a three-dimensional volume and manipulate the spatio-temporal gradient to perform the editing operations. In [12], it is shown that setting the spatio-temporal gradient as the guidance field eliminates the flickering artifact. However, this approach is only valid when the edited region is not moving. That is, if the target region is moving, then the temporal consistency must be enforced by following its movement in the scene. This may be the case of inserting new content in a video, like the advertisement in Figure 1. In [13] the temporal consistency is imposed by a Kalman smoothing process applied along the trajectories of the pixels, which are computed by integrating the optical flow. Other variational approaches, such as [14, 15], enforce temporal consistency by adding a spatial regularization term to a temporal one, based on a derivative along the direction of the movement (the convective derivative). Both terms are then balanced by a parameter β > 0. However, due to numerical problems that arise in the
Temporally Consistent Gradient Domain Video Editing
61
discretization of the convective derivative, β is restricted to a certain range of values. This restriction manifests itself in practice by requiring the guidance field to be present in all frames. In this paper, we propose a novel numerical scheme for computing the convective derivative that allows β to take any desired value and, by that, allowing the model to achieve its full potential. In particular, in some applications, we may avoid computing the guidance field in all frames. Furthermore, the proposed scheme introduces fewer errors than other discretization schemes. We use this scheme in the definition of two gradient domain video editing formulations that use L2 - and L1 -norms, which lead to Poisson [1] and total variation type models, respectively. While the L2 -norm forces smooth solutions, the L1 -norm allows sharp transitions which may be desirable to better preserve textures. Let us describe the organization of the paper. Section 2 presents our formulation for gradient domain video editing enforcing temporal consistency. In Section 3 we discuss the discretization of the convective derivative. In Section 4 we present the proposed new discretization scheme. Section 5 shows applications of the presented models to inpainting correction, object insertion, and object removal. The experiments show the ability of our model to handle occlusions automatically. Finally, in Section 6 we include some concluding remarks and future work.
Fig. 1. Temporal consistency in gradient domain video editing. From left to right, a video frame with motion vectors superimposed, the result of video editing imposing the temporal consistency disregarding the motion of the scene, and the result of video editing considering the motion of the scene.
2
Video Editing Model (Continuous Setting)
We propose a functional to perform gradient domain editing in a video while enforcing the temporal consistency. For that we minimize an energy defined on the spatio-temporal volume Π := Ω × T Ep (u) = |∂v u(x, t)|p + β∇x u(x, t) − g(x, t)p dx dt, (1) O⊂Π
where Ω ⊂ R is the rectangular image domain, T = [0, T ] is the temporal domain, O ⊂ Ω × T is the spatio-temporal domain where the editing is performed (Figure 2 illustrates these domains), u : Ω × T → R is a scalar function representing the video, and v : Ω × T → R2 is a known velocity field obtained from the original video u0 . We consider p ∈ {1, 2} and β ≥ 0. Equation (1) is solved 2
62
G. Facciolo et al.
with Dirichlet boundary conditions u|∂O\∂Π = u0 at ∂O \ ∂Π and homogeneous Neuman boundary conditions at O ∩ ∂Π. The second term in (1) is similar to other gradient domain image editing models [1, 2] and it is responsible for image editing at each frame. The spatial gradient with respect to the two spatial dimensions is denoted by ∇x and the field g : O → R2 is the guidance field which determines the new content of the frames. The first term in (1) imposes the temporal consistency on the video. We denote by ∂v u(x, t) the convective derivative of u along v (a derivative of the video u along the optical flow field v) which we describe next. Please, note that throughout the text, we will refer to temporal boundary as the set of points (x, t) ∈ ∂O such that the scalar product [v1 (x, t), v2 (x, t), 1], n(x, t) = 0, where n(x, t) is the normal to ∂O, and vi (·) denotes the i-th component of v.
Fig. 2. Illustration of the spatio-temporal domain
Temporal Consistency and the Convective Derivative. A particle moving in the real world describes a trajectory s : T → Ω when seen in a video. The d velocity field v(s(t), t) = dt s(t) characterizes the motion of all the particles in the video. Then, given any particle and the associated trajectory s, the temporal consistency means that along s the graylevel of u(s(t), t) is constant, which can be stated as d u(s(t), t) = 0. dt Applying chain’s rule to this equation we obtain ∂v u(x, t) := ∇x u(x, t) · v(x, t) + ∂t u(x, t) = 0,
∀(x, t) ∈ Ω × T.
(2)
∂ ( ∂t ).
Here, ∂t denotes the derivative with respect to time The left hand side of (2) corresponds to the definition of the convective derivative of u, which is nothing else than the directional derivative in the direction of the flow v. Therefore the temporal consistency can be stated in terms of the convective derivative: ∂v u(x, t) = 0, ∀(x, t) ∈ Ω × T. Analysis of the Energy. In order to highlight the roles of the terms of (1), let us re-write it as p |∂v u(x, t)| dx dt + β ∇x u(x, t) − g(x, t)p dx dt, (3) T
Ot
T
Ot
Temporally Consistent Gradient Domain Video Editing
63
where Ot denotes the slice of O at time t. Observe that the rightmost term is a stack of independent gradient domain image editing problems, each spatial term attaches to its guidance field g(·, t) and to the corresponding spatial boundary condition. The leftmost term enforces the temporal consistency by introducing a dependence between consecutive instants. In particular, it penalizes the changes of intensity along the trajectories specified by v, while attaching to the temporal boundary conditions. The parameter β ≥ 0 determines the mixing between the temporal consistency and the spatial term. When β → ∞, the temporal consistency is irrelevant and the solution is equivalent to solving many independent image editing problems. On the other hand, when β = 0 the functional only enforces the temporal consistency. In this case, the information at the temporal boundary is transported along the trajectories specified by v; however, we are not able to deal with the illumination changes that are incorporated by the space boundary conditions. Moreover, in the case β = 0, if we consider only the boundary condition at time t = 0, then solving (1) amounts to solve the advection equation (2). A common 1 choice of β for enforcing the temporal consistency is β ∼ 10 [15]. However, in some cases it is interesting to consider even lower values of β. This could be the case in the absence of the guidance field, here the information needs to be transported from the temporal boundaries. As we will see in Section 3.1 standard discretizations of (1) do not always behave in a satisfactory way when β → 0. The choice of p ∈ {1, 2} leads to two models with different characteristics of the solution. Let us discuss these two cases. Case p = 2: Here, the energy (1) is quadratic and its solution is computed by solving the linear system (∂v∗ ∂v · +βdivx ∇x ·) u = βdivx g,
(4)
where divx denotes the spatial divergence and ∂v∗ is given by ∂v∗ f = −∂t f + divx (vf ). Equation (4) is of Poisson-type, and we solve it by using the conjugate gradient method. The solution of (4) smoothly adapts to the boundary conditions of O. Moreover, any error due to inconsistencies between the boundary conditions and the potential field g is smoothly spread across the whole domain O. Case p = 1: Here, the energy (1) takes the form of a total variation minimization problem. To solve it, we perform an implicit gradient descent: uj+1 = arg min E1 (u) + u
1 u − uj 2 , 2λ
(5)
where λ is a positive number. Each iteration of the gradient descent entails the resolution of a convex problem similar to the total variation model for denoising [16, 17]. The solution of (5) is computed by solving its dual problem [17]. Defining
64
G. Facciolo et al.
the dual variables ψ : O → R and ξ : O → R2 , we perform the fixed point iteration with time step τ ≤ 1/8: ψ k+1 = ξ k+1 =
ψ k + τ ∂v [∂v∗ ψ k + βdivx ξ k + uj /λ] , 1 + τ |∂v [∂v∗ ψ k + βdivx ξ k + uj /λ]|
ξ k + τ ∇x [∂v∗ ψ k + βdivx ξ k + uj /λ] + g/λ , 1 + τ ∇x [∂v∗ ψ k + βdivx ξ k + uj /λ] + g/λ
and at convergence the solution is recovered as u = uj + λ(∂v∗ ψ + βdivx ξ). This method allows discontinuities of u in O and at its boundary. Its solution attaches to the boundary conditions reducing the effects of the illumination changes, but as opposed to the case p = 2 the transitions may not be smooth. While setting p = 2 favors smooth transitions, the model with p = 1 produces sharp transitions which may be desirable in some circumstances. It also allows a better preservation of textures. Figure 7e shows the distribution of the error in both cases and it also highlights the fact that texture is better preserved (we refer to Section 5 for details). Remark 1. The formulation of (1) can also easily accommodate spatial weights for controlling the blending as in [18], and the temporal consistency term can be modified as in [15] to keep the reflectance properties of the objects.
3
Discretization of the Model
This section deals with the discretization of (1). The convective derivative is defined using the velocity field v, so we start by commenting on the use of the optical flow as an approximation for v. We argue that the optical flow is suited for imposing the temporal consistency. Then, we study some numerical schemes for computing the convective derivative and their limitations. In next section, we introduce the deblurring convective derivative, a new numerical scheme for computing it. In what follows we work in a discrete setting. The spatial domain is now a square lattice in Z2 , Ω = {0, 1, . . . , N }2 , and the temporal domain is T := {0, 1, . . . , T }; therefore a video is represented as a stack of T + 1 digital images (frames). The spatial gradient is computed using forward differences and is denoted by ∇+ x. About the Optical Flow. As a discrete approximation of the velocity field, we use the optical flow computed on the original video u0 : Π → R. We define the forward optical flow v + : Ω × T → R2 between two frames u(·, t) and u(·, t+ 1) as a vector field such that u(x, t) and u(x + v+ (x, t), t + 1) correspond to the same point in the scene. Similarly, the backward optical flow v − , relates the frame at t with the one at time t − 1. We discretize the temporal consistency constraint (2) using the forward optical flow as ∂v+ u(x, t) := u ˆ(x + v + (x, t), t + 1) − u(x, t) = 0,
(6)
Temporally Consistent Gradient Domain Video Editing
65
Fig. 3. Results with and without pre-processing of the optical flow. From left to right: a diagram describing the pre-processing, the removed flows, the result obtained using the pre-processed flow and without using it.
where u ˆ(x+v + (x, t), t+1) is the bilinear interpolation of u(·, t+1) at x+v + (x, t). In a similar way, we can also discretize the temporal consistency constraint using the backward optical flow. The optical flow used in our experiments is obtained with the algorithm described in [19], but any optical flow algorithm with subpixel precision and regularization (or better, with edge preserving regularization) could be used. As a last remark, the optical flow is not defined at the occluded/disoccluded areas of a frame. In these cases there may be no correspondence for a pixel in the next frame. While some optical flow algorithms produce occlusion maps, many others do not, so we pre-process the flow to identify the occlusion/disocclusion areas and remove them from the energy. A simple and direct way to achieve that is to compute |ˆ u0 (x + v + (x, t), t + 1) − u0 (x, t)| > tol, where u0 is the original video, and tol is a tolerance for the change in the gray levels. Flow vectors that do not satisfy the tolerance criteria are removed from the energy. The discrete energy becomes: p Ep (u) = Occ(x, t)∂¯v u(x, t)p + β ∇+ x u(x, t) − g(x, t) , (7) t∈T x∈Ω
t∈T x∈Ω
where ∂¯v denotes a discretization of the convective derivative, for instance ∂v+ u(x, t), and Occ : Ω × T → {0, 1} indicates if a vector v + (x, t) satisfies the tolerance criterion or not. The inclusion of Occ(·, ·) implies that occluded/ disoccluded pixels are only influenced by the spatial regularization term. Figure 3 illustrates with an experiment the benefits of pre-processing the optical flow. 3.1
Discretization of the Convective Derivative
The discretization of the convective derivative given in (6) corresponds to an implicit upwind scheme with an adaptive stencil [20]. That is, for computing the bilinear interpolation uˆ(x + v(x, t), t + 1) the convective derivative considers stencil points surrounding (x + v I (x, t), t + 1) (where v I (x, t) is the integer part of the vector v + (x, t)), this results in a potentially different stencil for each x. When used for simulating the advection equation, this adaptive scheme allows to achieve stability with time steps beyond the one prescribed by the Courant–Friedrichs–Lewy condition[20].
G. Facciolo et al.
t=93
t=31
t=16
66
Original
v − -scheme
v + -scheme
DCD
Fig. 4. Temporal propagation from frame t = 0, by solving (7) using β = 0 and p = 2. Columns from left to right show the: original frame, results obtained with v − -scheme (explicit), v + -scheme (implicit), and DCD scheme for discretizating the convective derivative (see Section 4).
Using the forward optical flow v+ we define the v + -scheme for computing the convective derivative as u ˆ(x + v + (x, t), t + 1) − u(x, t) if t < T, + ∂v u(x, t) := 0 if t = T or (x + v+ (x, t), t + 1) ∈ / O, where u ˆ : Z2 × T → R is a bilinear interpolation of u(·, t + 1) at x + v+ (x, t), modified in order to account for the Neumann boundary conditions at ∂(Ω × T). Similarly, with the backward optical flow v − we define the discrete v − -scheme for computing the convective derivative as u(x, t) − uˆ(x +v − (x, t), t − 1) if t > 0, − ∂v u(x, t) := 0 if t = T or (x + v − (x, t), t − 1) ∈ / O. We implement these operators as sparse matrices, which allow us to easily compute their adjoints. Evaluation. Let us consider the following experiment. We solve (7) with β = 0 (spatial term disabled) and with a Dirichlet boundary condition only at t = 0. Note that the guidance field g is undefined in this experiment. In the solution, we expect the content of the first frame to be transported according to the known optical flow v from time t = 0 to all subsequent frames. This experiment is related to simulating the advection equation ∂v u = 0 with initial condition at t = 0. When applied in this setting, the v + - and v− -schemes for computing the convective derivative correspond to applying explicit or implicit upwind schemes for solving the advection equation, respectively. Using the
Temporally Consistent Gradient Domain Video Editing
v −scheme
15
v −scheme DCD
MSE
10 5 0
v − -scheme
v+ -scheme
t=40
−
20
+
t=11
t=20
67
0
10
20 frame
30
40
DCD
Fig. 5. Temporal propagation with β = 0.01, horizontal motion 0.175px/frame, g known and p = 2. After 40 frames the results of v− -scheme (explicit scheme) or v+ scheme (implicit scheme) are more distorted than with DCD. The plots of the evolution of the error w.r.t. the ground truth (right) confirm it.
v − -scheme can be seen as applying an explicit scheme, where a filter is applied to u(·, t − 1) to obtain u(·, t), while the v + -scheme behaves as an implicit one where an inverse filter relates u(·, t) with u(·, t + 1). Figure 4 shows the results of this transport experiment. It is not surprising to see that for both, the explicit and implicit discretizations, the solutions are completely distorted only a few frames away from t = 0. The reason being that these schemes introduce a numerical diffusion (or oscillations) that is accumulated over time. These artifacts are due to approximation errors in the discretization of the differential operators. In particular, in the case of the above upwind schemes the approximation is only first-order accurate. Artifacts as those seen in Figure 4 are in general never observed, since models like (7) are always applied with β > 0 and with a known guidance field g for all the frames [14]. However, in some circumstances, like in the case of partial occlusions, the guidance field may not be easily computed. Moreover, as we see in Figure 5, even with β > 0 and with a known g, these discretizations induce noticeable errors in the result. We have seen that due to discretization issues we are unable to cope with a small value of β and, therefore, unable to use (7) to its full potential.
4
The Deblurring Convective Derivative (DCD)
It is interesting to note that, in the experiment shown in Figure 4, the effects of the explicit and implicit schemes for discretizing the convective derivative are somehow opposite. The explicit scheme introduces blurring in the solution, while the implicit scheme sharpens the solution but also introduces oscillations. This motivates the search for a new scheme which allows the convective derivative to preserve the transported information for longer periods of time. The idea of the deblurring convective derivative (DCD for short) is to attain this objective as the balance of two opposing processes: implicit and explicit schemes. That is, applying both the v + - and v − -schemes to moderate each other’s effects.
68
G. Facciolo et al.
Let us analyze in detail the effect of using the explicit upwind scheme (v − scheme) while solving (7) by considering an illustrative example. We consider a similar problem as before, where only the initial frame (t = 0) is known, and we have a constant optical flow v− = [0.3, 0]T , v+ = [−0.3, 0]T (with this flow the problem reduces to 1D). Taking β = 0 andp = 2 in (7) and analyzing the first two frames of the sequence we get: minu x∈Ω ∂v− u(x, 1)2 . It is easy to see that the values of u(·, 1) which minimize this energy are explicitly determined by applying the filter [0.7, 0.3] (coefficients specific for this flow) to the rows of frame u(·, 0). Similarly, frame u(·, 2) is obtained by filtering u(·, 1), and so on. Denoting the filtering operator as Mv− , then we can describe this relation as: u(·, t + 1) = Mv− u(·, t). In this case, the minimum of the complete energy (for all the frames) is 0, and the solution is increasingly blurry with t since u(·, t) = (Mv− )t u(·, 0), which confirms what we observed in Figure 4. A similar analysis for the implicit upwind scheme (v + -scheme) reveals that the solution of the problem restricted to the first frame, minu x∈Ω ∂v+ u(x, 0)2 , satisfies the relation Mv+ u(·, 1) = u(·, 0), where Mv+ applies the filter [0.3, 0.7] to the rows of u(·, 1). Therefore, u(·, 1) is given by the pseudo-inverse of Mv+ applied to u(·, 0). The repeated application of the pseudo-inverse acts as an inverse smoothing. This sharpening enhances the high frequencies in the solution but also introduces numerical artifacts, as seen in Figure 4. Let us mention that, by solving the linear system using the conjugate gradient method, this problem is mitigated by its regularization effect. The DCD takes advantage of the fact that these two schemes have opposite effects on the data (one blurs while the other sharpens) and alternates between them. Shortly, if between t = 0 and t = 1 we apply the v+ -scheme (implicit), then from frame t = 1 to t = 2 we apply v − -scheme (explicit) and so on. The diagrams in Figure 6 shows how the temporal derivatives are taken. Note that there is no computational overhead in the use of DCD with respect to implicit or explicit schemes. It can also be applied sequentially on pairs of frames [14] without the need of solving the complete system associated to (7). In the DCD, a step with the v+ -scheme permits to recover the frequencies smoothed by the previous step (v − -scheme). However, it may also introduce other high frequencies, which on the long term will build up as high frequency
Fig. 6. On the left, we show a diagram depicting the v + -scheme for discretizing the convective derivative. On the right, we show a diagram corresponding to the DCD: it alternates between v− and v + -schemes.
Temporally Consistent Gradient Domain Video Editing
69
artifacts. These artifacts can then be removed using a low-pass filter as a postprocessing step. In general the DCD preserves the transported information for much longer periods of time without noticeable decay. The performance of the proposed method is shown in Figures 4 and 5, where it is compared with the implicit and explicit upwind schemes. Observe that the texture is preserved for more than 70 frames (which is considered as a very long time).
5
Applications
What we have presented so far can be used for a variety of applications. Here we present some experiments that illustrate the usefulness of these techniques in three applications: 1. Correcting an inpainted domain by making it temporally consistent. 2. Inserting an object on a surface while handling situations where the object is partially occluded/disoccluded. 3. Removing an object from a surface while handling situations where the object is partially occluded/disoccluded. In what follows we discuss the experimental results for each application. The full set of results can be viewed at http://gpi.upf.edu/static/emmcvpr11. Inpainting Correction. In this scenario, we are given an inpainted video sequence. Thus, we can compute the guidance field in all frames. We would like to correct this inpainted sequence so that it becomes spatio-temporally consistent. In order to achieve that, we use the inpainted video as our input and solve (7) for p = {1, 2} and β = 1. Figure 7a shows the original sequence. Figure 7b shows the resulting inpainted sequence. The sequence has been inpainted with an algorithm that generates a temporally consistent inpainting, yet not consistent with respect to the illumination changes at each frame. Figures 7c and 7d show the result obtained by solving (7) with p = 1 and p = 2, respectively. Notice that the inconsistency which is present in the inpainted sequence has been corrected and the result is temporally and spatially consistent. We also observe that in this case, the results of p = 1 and p = 2 are very similar. Object Insertion. As we have pointed out in Section 3, we pre-process the optical flow for detecting occlusions/disocclusions and the tolerance criterion used in the experiments is tol = 50. In this scenario, we wish to insert a twodimensional object into a video sequence and affix it to a surface. Thanks to the DCD, we are able to coherently transport information present in one frame into subsequent (or previous) frames in the video sequence. Basically, we start inserting the object into a chosen frame that indicates the first appearance of the object in the video. Let us call this the first frame. Then, by solving Equation (7) with a small value of β (usually 0.01 for p = 1, and 0.0001 for p = 2) and setting the first frame as a Dirichlet temporal boundary condition, we are able to transport the first frame to the others. Since we are affixing an object on a surface, the inserted object inherits the optical flow from this surface. The
70
G. Facciolo et al.
(a) Original sequence.
(b) Inpainted sequence.
(c) Corrected inpainting with p = 1
(d) Corrected inpainting with p = 2
(e) See caption. Fig. 7. Correcting an inpainted domain to impose spatio-temporal consistency using Equation (7) with p = {1, 2} and β = 1. Figure (e), left (resp. right), shows the modulus of the difference of the magnitude of the gradients corresponding to the first column of images 7b and 7c (resp. 7b and 7d). These images have been jointly scaled to take values in [0, 255] (max value = 16). Note that in the case p = 2 the differences are spread across the domain.
occlusions are handled automatically by the temporal consistency using the preprocessed optical flow and it suffices to insert the object into the first frame. If we want to handle disocclusions, then we need also to insert the object into a later frame in the sequence and set it as another Dirichlet boundary condition. Let us call this the last frame. We note that in this setting we only need the information in the first and the last frames. For the intermediate frames we just have a hole that we fill-in with the new information. Figure 8 shows an experiment where we insert a poster on a door replacing an existing one. The area where we insert the poster is being occluded and then disoccluded by a moving man. The total number of frames of the sequence is 31. The results shown in Figure 8 are obtained by solving Equation (7) with p = {1, 2}. As we can see, our method handles both occlusions and disocclusions. Notice that sometimes, on the boundary of the occluding object, we can see some small inconsistencies due to the inaccuracy of the optical flow. We will address this issue in a future work.
Temporally Consistent Gradient Domain Video Editing
71
(a) Original sequence. Frames t = 1..6 and t = 24..30.
(b) Inserted object using Equation (7) with p = 1
(c) Inserted object using Equation (7) with p = 2 Fig. 8. Object insertion experiment with handling of occlusions and disocclusions. Figures (a) show some frames of the original sequence, with the editing region superimposed. Figures (b) and (c) show the results obtained after solving (7) with p = 1 and p = 2, respectively. In this experiment the new object is only provided in the first and last frames of the sequence, and the editing region in-between is unknown. A closer inspection of the results corresponding to p = 1 and p = 2 reveals that the model with p = 1 preserves better the texture.
Fig. 9. Example of object removal. We remove the object only in the first frame (by inpainting) and then transport the information to subsequent frames by solving Equation (7) with p = 1. The result for p = 2 is similar. The complete sequence is available at the webpage mentioned at the beginning of Section 5.
72
G. Facciolo et al.
Object Removal. This object insertion case can be easily adapted to object removal. We can remove an object from the first and last frames and then fill-in the hole in these frames either by inpainting or by copying a zero gradient field using Poisson editing, for instance. Then we can just transport the information from the temporal boundaries to all the domain in-between. Figure 9 shows an example.
6
Conclusions and Future Work
In this work we have presented the deblurring convective derivative, a novel scheme for computing the convective derivative that allows us to maintain information for a long period of time without decay. We have also integrated this derivative in a functional for video editing that imposes both spatial and temporal consistency. We applied this formulation to three video editing applications: object insertion, object removal and inpainting correction. Our solution relies extensively on the quality of the optical flow and we have seen that inconsistencies in the optical flow affect considerably the quality of the results. Our future work will mainly focus on how to detect and handle inconsistent flows in order to improve the results. Acknoledgements. We acknowledge support by MICINN project, reference MTM2009-08171, and by GRC reference 2009 SGR 773. VC also acknowledges partial support by IP project “2020 3D Media: Spatial Sound and Vision”, financed by EC, and by “ICREA Acad`emia” prize for excellence in research funded both by the Generalitat de Catalunya. We acknowledge P. Arias and E. Meinhardt for their assistance and useful comments.
References 1. P´erez, P., Gangnet, M., Blake, A.: Poisson image editing. ACM Trans. Graph. 22, 313–318 (2003) 2. Georgiev, T.: Image reconstruction invariant to relighting. In: Eurographics 2005, pp. 61–64 (2005) 3. Finlayson, G.D., Hordley, S.D., Lu, C., Drew, M.S.: On the removal of shadows from images. IEEE Trans. on PAMI 28(1), 59–68 (2006) 4. Fattal, R., Lischinski, D., Werman, M.: Gradient domain high dynamic range compression. ACM Trans. Graph. 21, 249–256 (2002) 5. Arias, P., Facciolo, G., Caselles, V., Sapiro, G.: A variational framework for exemplar-based image inpainting. Int. J. Comput. Vision 93, 319–347 (2011) 6. Komodakis, N., Tziritas, G.: Image completion using efficient belief propagation via priority scheduling and dynamic pruning. IEEE Trans. on IP 16, 2649–2661 (2007) 7. Kwatra, V., Essa, I., Bobick, A., Kwatra, N.: Texture optimization for examplebased synthesis. ACM Trans. Graph. 24, 795–802 (2005) 8. Sun, J., Jia, J., Tang, C.K., Shum, H.Y.: Poisson matting. ACM Trans. Graph. 23, 315–321 (2004)
Temporally Consistent Gradient Domain Video Editing
73
9. Agrawal, A., Raskar, R.: Gradient domain manipulation techniques in vision and graphics. In: ICCV 2007 Short Course (2007) 10. Bai, X., Sapiro, G.: A geodesic framework for fast interactive image and video segmentation and matting. In: IEEE ICCV 2007, pp. 1–8 (2007) 11. Grundmann, M., Kwatra, V., Han, M., Essa, I.: Efficient hierarchical graph-based video segmentation. In: IEEE CVPR 2010, pp. 2141–2148 (2010) 12. Wang, H., Xu, X., Raskar, R., Ahuja, N.: Videoshop: A new framework for spatiotemporal video editing in gradient domain. In: IEEE CVPR 2005, p. 1201 (2005) 13. Bugeau, A., Gargallo, P., D’Hondt, O., Hervieu, A., Papadakis, N., Caselles, V.: Coherent Background Video Inpainting through Kalman Smoothing along Trajectories. In: Modeling, and Visualization Workshop, p. 8 (2010) 14. Bhat, P., Zitnick, C.L., Cohen, M., Curless, B.: Gradientshop: A gradient-domain optimization framework for image and video filtering. ACM Trans. Graph. 29, 10:1– 14 (2010) 15. Bhat, P., Zitnick, C.L., Snavely, N., Agarwala, A., Agrawala, M., Curless, B., Cohen, M., Kang, S.B.: Using photographs to enhance videos of a static scene. In: Eurographics Symposium on Rendering, pp. 327–338 (2007) 16. Rudin, L., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D, 259–268 (1992) 17. Chambolle, A.: An algorithm for total variation minimization and applications. J. Math. Imaging Vis. 20, 89–97 (2004) 18. Tao, M.W., Johnson, M.K., Paris, S.: Error-tolerant image compositing. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 31–44. Springer, Heidelberg (2010) 19. Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004) 20. Zhou, M.H., Mascagni, M., Qiao, A.Y.: Explicit Finite Difference Schemes for the Advection Equation. Relation 10, 7098 (1998)
Texture Segmentation via Non-local Non-parametric Active Contours Miyoun Jung, Gabriel Peyr´e, and Laurent D. Cohen Ceremade, Universit´e Paris-Dauphine, 75775 Paris Cedex 16, France {jung,peyre,cohen}@ceremade.dauphine.fr
Abstract. This article introduces a novel active contour model that makes use of non-parametric estimators over patches for the segmentation of textured images. It is based on an energy that enforces the homogeneity of these statistics. This smoothness is measured using Wasserstein distances among discretized probability distributions that can handle features in arbitrary dimension. It is thus usable for the segmentation of color images or other high dimensional features. The Wasserstein distance is more robust than traditional pointwise statistical metrics (such as the Kullback-Leibler divergence) because it takes into account the relative distances between modes in the distributions. This makes the corresponding energy robust and does not require any smoothing of the statistical estimators. To speed-up the computational time, we propose an alternative metric that retains the main qualities of the Wasserstein distance, while being faster to compute. It aggregates 1-D Wasserstein distances over a set of directions, and thus benefits from the simplicity of 1-D statistical metrics while being able to discriminate high dimensional features. We show numerical results that instantiate this novel framework using grayscale and color values distributions. This allows us to segment regions with smoothly varying intensities or colors as well as complicated textures.
1
Introduction
This article considers a variational minimization problem aiming at detecting objects in textured images. It makes use of a comparison principle between pairs of neighboring patches. We thus refer to it as a “non-local” approach, following the terminology initiated in [1]. The resulting method is a general framework to implement a piecewise smooth segmentation model. To illustrate the generality of this concept, we make use of statistical distances between distributions. The resulting method considers that the objects to be segmented are sampled from a non-stationary distribution with parameters that vary smoothly from pixel to pixel. Our approach is general when it comes to the use of distributions, in the sense that it can handle the distributions of general features such as pixel values, small chunks of pixel values to capture the joint dependancies between neighboring pixels, multiscale coefficients (e.g. wavelets, Gabor filter banks, etc) or group of these. We restrict in this article our attention to the distributions of pixel values (intensities or color values). Y. Boykov et al. (Eds.): EMMCVPR 2011, LNCS 6819, pp. 74–88, 2011. c Springer-Verlag Berlin Heidelberg 2011
Texture Segmentation via Non-local Non-parametric Active Contours
1.1
75
Previous Works
Region-Based Active Contours. This paper is focused on the class of region-based approaches to image segmentation. These methods make use of information extracted from inside and outside the region to be segmented. Following the seminal work of Mumford and Shah on optimal partitions [2], several influential works have proposed to perform image segmentation using a variational minimization with a region-based energy that seeks for constant features in each region to be segmented, see for instance [3, 4]. More recently, several models have been introduced to handle a local homogeneity of the features. They require the estimation of a piecewise smooth parameter field, see for instance [5–9]. The non-local active contour method, recently introduced in [10, 11], also makes use of such a local homogeneity principle, but implements the variational minimization using only a pairwise comparison of features, which might for instance depend on patches extracted around each pixel. A chief advantage of this approach is that it does not require the estimation of a piecewise smooth set of parameters, and only requires the design of a metric to compare patches. In this paper, we extend this non-local active contour framework in several directions, to capture locally homogenous statistical distributions. Non-local Segmentation. Non-local methods have been applied in many image processing problems such as denoising [1], general inverse problems [12] and segmentation [13–15], by regularizing the image using comparison of patches in the image. In this article we define an attraction term pulling the contour towards the object boundaries, which is contrast to the existing non-local based segmentation methods that use non-local energy terms only as regularization terms. Statistical Segmentation. Instead of considering a small set of parameters (such as the mean value of the features), more general models make use of statistical distributions to drive the segmentation. These approaches are quite effective for many natural images or textures that contain complicated random fluctuations. The resulting statistical region-based active contours make use of pointwise similarity measures among distributions (such as the Kullback-Leibler divergence) to compare the distributions, in a parametric or non-parametric (using Parzen windows) fashion, see for instance [16–19]. In this paper, we also consider the setting of statistical segmentation, and we propose to use a fully non-parametric estimator that does not require to compute histograms. Wasserstein Metric. Traditional pointwise statistical distances are simple to compute, but suffer from several drawbacks, in particular the difficulty to handle localized distributions. This is because these statistical distances do not take into account the relative positions between the modes of the distributions. In practice, a smoothing of the histogram using Parzen windows is required to
76
M. Jung, G. Peyr´e, and L. Cohen
make these methods usable on localized distributions. To address these issues, Ni et al. [20] propose to make use of the L1 Wasserstein distance in order to extend the segmentation model of Chan and Vese [3]. This work is extended to color image segmentation in [21] using a Wasserstein metric only on the brightness channel, thus resulting in a 1-D optimal transport metric. The Wasserstein distance is now routinely used in computer vision, for instance as a metric for retrieval, where it improves over more classical pointwise distances between histograms [22]. It is also used for other image processing applications such as warping [23] and texture synthesis [24]. This metric is related to the assignment problem [25]. We use this connection in our work to avoid computing histograms. 1.2
Comparison with Previous Works
Our work draws connexions between recent works in statistical image segmentation, and extends these works in several directions. We make use of the non-local active contour variational minimization problem proposed recently in [10, 11]. We extend it by introducing a pixel-by-pixel normalization of the energy. This is crucial to obtain a un-biased segmentation result in the case where the size of patches is large. This is typically required for statistical segmentation when the distributions have a large variance (such as for noisy images or for complicated textures). Furthermore, we instantiate this energy using a Wasserstein metric. This defines a notion of locally homogeneous statistical distributions, which might be of independent interest. We follow the work of [20] which is the first to clearly acknowledge the importance of optimal transport for image segmentation. We develop a different variational model, that does not require a global consistency of the local histograms. This is important to segment images with a slowly varying background, or complicated non-stationary textures. Furthermore, we extend this method to distributions in arbitrary dimension, which leads for instance to a 3-D distance for color distributions. We also make use of the Lp Wasserstein distance using an assignment formulation of the metric, while previous works restrict their attention to p = 1. To speed up the evaluation of the Wasserstein distance, we approximate it using a series of 1-D projections, which is a method introduced in [24] for histogram equalization. 1.3
Contributions
Our first contribution is a novel region-based segmentation method that extends the non-local active contour method [10, 11] using a pixel-by-pixel normalization. Our second contribution is the definition of a novel segmentation criteria that requires only a local homogeneity of the statistical distribution of features. This criteria is handled using Wasserstein metrics, which extends the global homogeneity criteria introduced in [20, 21].
Texture Segmentation via Non-local Non-parametric Active Contours
2
77
Non-local Active Contours
Section 2.1 recalls the non-local active contours energy introduced originally in [10, 11]. We then describe our contribution, which is a normalization of this model that reduces significantly the bias introduced by the use of patches. 2.1
Un-normalized Non-local Active Contours [10,11]
Our goal is to segment an image f : [0, 1]2 → Rd , where d is dimensionality of the feature space. We make use of a local patch extraction process to design variational energies. Pairwise Patch Interaction. A patch in some image f around a pixel x ∈ [0, 1]2 is defined as px (t) = f (x + t), ∀ t ∈ [−τ /2, τ /2]2 . The non-local interaction between two patches is measured using a metric d(·, ·) 0 that accounts for the similarity between patches. The simplest choice, considered in [10], is a weighted L2 distance d(px , py ) = Ga (t)||px (t) − py (t)||2 dt, (1) t
||t||2
where a Gaussian weight Ga (t) = e− 2a2 can be used to give more influence to the central pixel. More complicated similarity measures can be used, and Section 3 explains how to use statistical distances. Level Set Formulation. The segmentation problem corresponds to the computation of some region Ω ⊂ [0, 1]2 that should capture the objects of interest. We represent the segmented region Ω using a level set function ϕ : [0, 1]2 → R so that Ω = {x \ ϕ(x) > 0}. To simplify the exposition, we make use of a smoothed Heaviside function H(ϕ) = 12 + π1 atan(ϕ/ε) to introduce variational energies and compute their derivatives. The parameter ε should be chosen small enough to obtain a sharp region boundary, but not too small to avoid numerical instabilities. In the numerical examples, we use ε = 1/n for a discretized image of n × n pixels. A mathematically more rigorous way to derive the corresponding PDE is to make use of the shape derivative machinery, which is formally equivalent to letting ε tend to 0, see for instance [18, 19]. Using such a shape gradient would make the evolution PDE well defined only on the boundary of Ω, and this evolution is then extended to the whole domain by preserving some distance function property on ϕ. Non-local Segmentation Energy. Following [10, 11], we introduce an energy functional E0 (ϕ) enforcing the similarity of features located either inside or outside Ω, E0 (ϕ) =
ρ(H(ϕ(x)), H(ϕ(y)))Gσ (x − y)d(px , py )dxdy.
(2)
78
M. Jung, G. Peyr´e, and L. Cohen
The function ρ restricts the comparison to pairs of patches that are in the same region (inside or outside). Since H(ϕ(x)) is close to being a binary function, we use ρ(u, v) = 1 − |u − v| for the numerical experiments (but other similar functionals could be used as well). Note that the parameter σ > 0 is important since it controls the scale of the local homogeneity one requires for the segmented object. To enforce the regularity of the boundary of the extracted region, following previous works in active contours, we penalize its length, which is computed as L(ϕ) = ||∇H(ϕ)(x)||dx = H (ϕ(x))||∇ϕ(x)||dx where ∇H(ϕ)(x) is the gradient at point x of the function H(ϕ). The non-local active contour model computes the segmentation as a stationary point of the energy min E0 (ϕ) + γL(ϕ) (3) ϕ
where γ > 0 is a parameter that should be adapted to the expected regularity of the boundary of the region. Limitation and Motivation. The non-local active contours model works well when the size of patches is small. Figure 1 shows examples of segmentation of piecewise smooth images using the L2 patch distance (1) with patches of width τ = 3/n for an image of size n × n and a = 0.5/n. The local homogeneity property of the energy (2) enables the model (3) to correctly detect objects which are only locally homogeneous, and can deal with separated objects with different intensities. This model however suffers from a segmentation bias. The segmented region is shifted away from the object boundary with an amount proportional to the patch width τ . This becomes problematic when used with large patches, because of the lack of precision of the resulting segmentation. Large patches (and large values of a) are however desirable as the noise level increases, since robustness requires more pixels to evaluate the local homogeneity. Figures 2 and 3 show that increasing the size of patches (the value of a) leads to segmented regions with a smoother boundary, but also reveal that the resulting curve is not located on the exact boundary of the object. Figure 3, first column, shows how the amount of bias increases as the size of patch (or the value a) increases. 2.2
Normalized Non-local Active Contour Model
Normalized Energy. To reduce the segmentation bias introduced by the non-local active contour energy (2), we define a novel normalized non-local energy 1 E(ϕ) = ρ(H(ϕ(x)), H(ϕ(y)))Gσ (x − y)d(px , py )dydx, (4) C(ϕ, x) where the local normalization factor is C(ϕ, x) = ρ(H(ϕ(x)), H(ϕ(y)))Gσ (x − y)dy > 0. Note that the un-normalized energy E0 defined in (2) is recovered by setting C(ϕ, x) = 1.
Texture Segmentation via Non-local Non-parametric Active Contours
79
In practice, the correction factor 1/C(ϕ, x) is far from being constant, in particular when the size of patches is large. This normalization is thus crucial to reduce the disparities that increase as a pixel approaches the boundary of the segmented region. Gradient Flow. Our normalized non-local active contour model computes the segmentation as a stationary point of the energy min E(ϕ) + γL(ϕ)
(5)
ϕ
where L is defined in (2.1) and γ > 0 is a regularization parameter. Introducing an artificial time t 0, the gradient flow of (5) reads ∂ϕ = − (∇E(ϕ) + γ∇L(ϕ)) , ∂t
(6)
for ϕ(x, t) parameterized by space and time. The gradients are computed as ∇E(ϕ)(x) =
∂f (ϕ(x))g(ϕ(x)) − f (ϕ(x))∂g(ϕ(x)) , (g(ϕ(x)))2
∇L(ϕ)(x) = −div with the notations f (u) := g(u) := ∂f (u) := ∂g(u) :=
∇ϕ(x) ||∇ϕ(x)||
H (ϕ(x)),
(7)
(8)
ρ(H(u), H(ϕ(y)))Gσ (x − y)d(px , py )dy, ρ(H(u), H(ϕ(y)))Gσ (x − y)dy, (∂1 ρ)(H(u), H(ϕ(y)))Gσ (x − y)d(px , py )dy H (u), (∂1 ρ)(H(u), H(ϕ(y)))Gσ (x − y)dy H (u).
where ∂1 is the gradient with respect to the first variable. Numerical Implementation. The segmentation is applied to a discretized image f of n × n pixels. The length energy (2.1) is computed using a finite difference approximation of the gradient. In a preprocessing step, the distance between neighboring patches d(px , py ) is computed. Depending on the numerical application, one might want to use either the weighted L2 norm (1) or the sliced Wasserstein distance defined in (11). The gradient flow (6) is then discretized using a gradient descent ϕ(+1) = ϕ() − μ (∇E(ϕ) + γ∇L(ϕ)) , where μ > 0 is a suitable time step size.
80
M. Jung, G. Peyr´e, and L. Cohen
To make all the level sets evolve simultaneously, H (ϕ(x)) appearing in (7) and (8) is replaced by ||∇ϕ(x)||. To ensure the stability of the level set evolution (6), one needs to re-initialize it from time to time. This corresponds to replacing ϕ by the signed distance function to the level set {x \ ϕ(x) = 0}. The width σ of the windowing function Gσ (x − y) typically depends on the initial curve at time t = 0. If the initial curve is far away from the object boundary, a large windowing function might be required.
3
Wasserstein Local Homogeneity
Classical deterministic similarity measures such as the L2 norm (1) are suitable to segment piecewise smooth images. They can also be used over a transformed domain (such the output of a Gabor filter bank [10]) to characterize some simple geometric textures. To handle complicated images containing textural contents, it is useful to consider similarity measures between statistical distributions. 3.1
Wasserstein Distance
In this article we consider a Lagrangian discretization of distributions, which N −1 corresponds to treating a distribution X as a points cloud X = {Xi }i=0 ⊂ Rd . In our numerical applications, d = 1 for grayscale images and d = 3 for color images. This is different from the more traditional Eulerian discretization, that makes use of a fixed set of points (usually a rectangular grid) but where the points are equipped with a weight to reflect the local density of the distribution. These histogram-based (Eulerian) discretizations are at the heart of previous statistical region-based active contours such as [18, 19, 20]. In this Lagrangian setting, for p 1, the Lp Wasserstein distance between two distributions X, Y ⊂ Rd is defined as W p (X, Y ) = min
σ∈ΣN
N −1
Xi − Yσ(i) p
(9)
i=0
where ΣN is the set of all the permutations of N elements. For simplicity we have restricted our attention to distributions having the same number of points, which is the case for our application to segmentation. Note that W should really be understood as being a distance between points clouds, since it is invariant under a permutation of the indexes of the distributions. The permutation σ minimizing (9) is the optimal assignment between the two points clouds. This optimal assignment problem can be solved using combinatorial optimization schemes in O(N 5/2 log(N )) operations, see [25]. 3.2
Sliced Wasserstein Distance
1-D Wasserstein Distance. In the 1-D case, the optimal assignment σ that solves (9) can be computed in O(N log(N )) operations by ordering the points clouds X and Y
Texture Segmentation via Non-local Non-parametric Active Contours
XσX (i) XσX (i+1)
81
and YσY (i) YσY (i+1)
with two permutations σX , σY ∈ ΣN . The optimal permutation is then σ = −1 σY ◦ σX . Equivalently, the Wasserstein distance is the Lp norm of the sorted vectors N −1 W p (X, Y ) = |XσX (i) − YσY (i) |p . (10) i=0
Note the major computational difference between the assignment problem (9) in dimension d = 1 and in higher dimensions d > 1, where no O(N log(N )) algorithm is available. Sliced Approximation. The numerical complexity of solving (9) in dimension d > 1 is prohibitive for imaging applications such as our segmentation problem. To obtain a fast numerical scheme, we follow the work of Rabin et al. [24] that introduces a sliced Wasserstein distance. It is defined as an aggregation of 1-D Wasserstein distances of projected distributions N −1 W p (Xθ , Yθ ) where Xθ = { Xi , θ }i=1 . (11) SW p (X, Y ) = θ∈Θ
Here Xθ , Yθ ⊂ R are projected 1-D distributions and Θ ⊂ Rd is a discrete set of directions, sampled on the unit sphere. Evaluating this sliced distance (11) has a complexity of O(|Θ|N log(N )) operations which is advantageous over the original Wasserstein distance (9) if Θ is not too large. Although there is no mathematical proof of the quality of the approximation of W using SW , numerical observations suggest that SW is a good approximation to solve minimization problems involving the Wasserstein metric, see [24]. 3.3
Wasserstein Non-local Active Contours
The sliced approximation (11) is used to measure the similarity between patches to perform statistical region-based segmentation. We thus propose to replace the L2 norm in (1) by d(px , py ) = W p ([px ], [py ]) N −1 where the operator [·] maps a vector v = (vi )i=0 ∈ RN to a points cloud N −1 d [v] = {vi }i=0 ⊂ R .
4
Experimental Results and Comparisons
This section presents experimental results with synthetic and real images. Chan-Vese and LBF Model Incorporated with Patches. For grayscale images, we compare our model (5) with extensions of the Chan-Vese (CV) [3] and locally
82
M. Jung, G. Peyr´e, and L. Cohen
Fig. 1. Results of the un-normalized model (3) with the weighted L2 distance function (1), and comparison with Chan-Vese model [3]. Patches of width τ = 3/n (3 × 3 pixels) and a = 0.5/n are used. 100 × 100 image and 31 × 31 windowing function are used.
binary fitting (LBF) models [6]. These extensions make use of patches to enable a fair comparison with our method. They are obtained by replacing intensity features by patches in the original CV or LBF models min E(p1 , p2 , ϕ) + γL(ϕ) (12) 1 2 p ,p ,ϕ
where E is either ECV or ELBF defined as ECV = λ1 d(px , p1 )H(ϕ(x))dx + λ2 d(px , p2 )(1 − H(ϕ(x)))dx, Gσ (x − y)d(py , p1x )H(ϕ(y))dydx Gσ (x − y)d(py , p2x )(1 − H(ϕ(y)))dydx, +λ2
ELBF = λ1
where λ1 , λ2 > 0 are parameters, and p1 and p2 are updated by iteratively cycling through ϕ and then (p1 , p2 ). Note that the energy ECV is introduced in [20] in the special case of the L1 -Wasserstein distance. In the numerical examples, we let λ1 = λ2 = 1, and we tried to choose the best smoothness parameters γ for each model. For color images, we compare our model (5) with the vector-valued ChanVese model [26], and the work [21] that uses a Wasserstein metric only on the brightness channel and the vector-valued Chan-Vese model for the chromaticity components. L2 Distance Between Patches. Figure 1 shows the results of the un-normalized model (3) tested on synthetic images with spatially varying background and/or object, or with several separated objects with different intensities. Due to the
Texture Segmentation via Non-local Non-parametric Active Contours
The un-normalized model (3)
83
Our normalized model (5)
Fig. 2. Comparisons of our new normalized model (5) with the un-normalized model (3) with the L2 distance (1). 1st-3rd: the un-normalized model with patches of width τ = 1/n (1st), τ = 3/n and a = ∞ (2-3rd) with two different parameters γ = 0.3 (2nd) and 0.5 (3rd). 4th-5th: our new normalized model (5) with patches of width τ = 3/n, τ = 5/n and a = ∞.
The un-normalized model (3)
Our normalized model (5), CV, LBF
Fig. 3. Comparisons of models with the new distance function (10). 1st-2nd columns: the un-normalized model (3) with two different parameters γ = 0.1, 0.5 (1st row), γ = 0.2, 0.3 (2nd row). 3rd-5th columns: our new normalized model (5), CV and LBF models (12) with patches with width τ = 15/n (top row), τ = 11/n (bottom).
local homogeneity property, the model correctly detects objects, while the twophase Chan-Vese model, requiring a global homogeneity in each region, fails for the correct segmentation. Figure 2 presents an example where the un-normalized model (3) does not provide satisfactory results with any kinds of patches: patches of width τ = 1/n (1 pixel), τ = 3/n (3 × 3 pixels) with a = ∞. By using patches of width τ = 1/n, the un-normalized model (3) produces noisy final curves, and by using patches of width τ = 3/n, it results in smoother final curves that are however not located on the object boundaries in spite of adjusting the smoothness parameter γ. On the other hand, our new normalized model (5) with patches of width τ = 5/n provides a smooth final curve, located exactly on the boundary. This example also shows the case when a large size of patch is required because our model with smaller size (τ = 3/n) of patch also gives noisy final curves.
84
M. Jung, G. Peyr´e, and L. Cohen
13000
12000
11000
10000
9000
8000
7000
6000 0
5
10
15
20
25
30
4
6
x 10
5.5
5
4.5
4
3.5
3
2.5
2
1.5
1
0
10
20
30
40
50
60
70
Fig. 4. Results of our new normalized model (5) with the Wasserstein distance (10). 1st-5th columns: initial curve (1st), curve evolution (2nd-3rd), final curve (4th), plot of energy E(ϕ() ) vs iteration (5th). Patches of width τ = 3/n are used.
Fig. 5. Texture segmentation using our new normalized model (5) with the Wasserstein distance (10), and comparison with CV model (12). 1st-3rd rows: initial curves (1st), results of our model (2nd) and CV model (3rd). 1st, 3rd, 4th columns: patch with width τ = 11/n, 2nd column: patch with width τ = 7/n. In our model, 31 × 31 windowing function is used.
Texture Segmentation via Non-local Non-parametric Active Contours
Initial curve
Our model (5)
Vector-valued CV [26]
85
Extended work [21]
Fig. 6. Texture segmentation using our new normalized model (5) with the sliced Wasserstein distance (11) with |Θ| = 3 fixed directions, and comparison with the vector-valued Chan-Vese model [26] and the extended work [21] of [20]. Patches with width τ = 11/n (1st), τ = 5/n (2nd), τ = 9/n (3rd), τ = 7/n (4th-5th) are used, and 31 × 31 windowing functions are used.
86
M. Jung, G. Peyr´e, and L. Cohen
Wasserstein Distance Between Patches. Figure 3 presents the examples where the L2 patch distance (1) cannot be applied because the black and white stripe pattern is a texture that is not homogenous in the pixel domain. Furthermore, these examples require a large patch size to capture the texture statistics. We present comparisons of the un-normalized model (3), our new normalized model (5), CV and LBF models (12), using the 1-D Wasserstein distance (10). The 1st-2nd columns present the results of the un-normalized model with different smoothness parameters γ with patches of τ = 15/n (1st row) and 11/n (2nd row). The un-normalized model produces biased final curves near the boundary, and large values of γ seem to reduce the bias to some extent. However, the bias cannot be reduced completely by the parameter γ, as well as that large values of γ result in too smoothed-out curves and slow convergence. However, our model locates the curves exactly on the boundary. Lastly, the 4th-5th columns show that CV and LBF models also do not locate the final curves on the exact boundary: the curves are located a few pixels far away from the boundaries. Although our model and CV/LBF models have similar behaviors on globally homogeneous textures, this example highlights the importance of our normalization. Figure 4 shows that our new normalized model (5) with the Wasserstein distance (10) detects objects with smoothly varying distributions of intensities and separated multiple objects with different distributions of intensities. It also shows the curve evolution of our model starting from given initial curves, and displays the convergence of the energy E(ϕ() ) as a function of the iteration index . Figure 5 presents texture segmentation results of our model (5) and comparison with CV model (12). Again due to the local homogeneity, our model discriminates different textures having different distributions of intensities, while CV fails for the correct discrimination. In Figure 6, we present color texture segmentation results of our model using a sliced Wasserstein distance (11). We considered only |Θ| = 3 projection directions, i.e. Θ = {(1, 0, 0), (0, 1, 0), (0, 0, 1)}, which was enough to obtain satisfactory segmentation in all the given examples. We also compare our model with the vector-valued Chan-Vese model [26], and with the color extension [21] of the original method proposed in [20]. In all the examples, our model correctly detects the boundary of objects and segments separated multiple objects with different distributions of color values, in contrast to the other models that do not locate the curve on the exact boundaries, detect only part of objects or fail to detect objects. Note that the second example was degraded by the random-valued implusive noise of density 0.3.
5
Conclusion
In this article, we have proposed a novel non-local energy for the segmentation of textured images, making use of non-parametric estimators over patches. The Wasserstein distance and its sliced approximation can be used as a similarity measure which allows one to segment complicated textural features in arbitrary dimension. Due to the local homogeneity property of the energy, our active
Texture Segmentation via Non-local Non-parametric Active Contours
87
contour model is able to detect regions with smoothly spatially varying features. It can also segment several separated regions with different features. All these properties are significant extensions of existing region-based models crucial to solve difficult texture segmentation problems. Acknowledgments. This work is supported by ANR grant NatImages, ANR08-EMER-009.
References 1. Buades, A., Coll, B., Morel, J.M.: A review of image denoising algorithms, with a new one. SIAM Mul. Model. and Simul. 4, 490–530 (2005) 2. Mumford, D., Shah, J.: Optimal approximations by piecewise smooth functions and associated variational problems. Communications on Pure and Applied Mathematics XLII (1989) 3. Chan, T., Vese, L.: Active contours without edges. IEEE Trans. Image Proc. 10, 266–277 (2001) 4. Paragios, N., Deriche, R.: Geodesic active regions: A new framework to deal with frame partition problems in computer vision. Journal of Visual Communication and Image Representation 13, 249–268 (2002) 5. Tsai, A., Yezzi, A., Willsky, A.S.: Curve evolution implementation of the mumfordshah functional for image segmentation, denoising, interpolation, and magnification. IEEE Trans. Image Proc. 10, 1169–1186 (2001) 6. Li, C., Kao, C., Gore, J., Ding, Z.: Implicit active contours driven by local binary fitting energy. In: Proceedings of the CVPR 2007, pp. 1–7 (2007) 7. Wang, L., Li, C., Suna, Q., Xia, D., Kao, C.Y.: Active contours driven by local and global intensity fitting energy with application to brain mr image segmentation. Computerized Medical Imaging and Graphics 33, 520–531 (2009) 8. Wang, L., He, L., Mishra, A., Li, C.: Active contours driven by local gaussian distribution fitting energy. Signal Processing 89, 2435–2447 (2009) 9. Wang, X., Huang, D., Xu, H.: An efficient local chan–vese model for image segmentation. Pattern Recognition 43, 603–618 (2010) 10. Jung, M., Peyr´e, G., Cohen, L.D.: Nonlocal active contours. In: Proc. SSVM 2011 (2011) 11. Jung, M., Peyr´e, G., Cohen, L.D.: Nonlocal segmentation and inpainting. In: Proc. ICIP 2011 (2011) 12. Peyr´e, G., Bougleux, S., Cohen, L.: Non-local regularization of inverse problems. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part III. LNCS, vol. 5304, pp. 57–68. Springer, Heidelberg (2008) 13. Gilboa, G., Osher, S.: Nonlocal linear image regularization and supervised segmentation. SIAM Mul. Model. and Simul. 6, 595–630 (2007) 14. Elmoataz, A., Lezoray, O., Bougleux, S.: Nonlocal discrete regularization on weighted graphs: a framework for image and manifold processing. IEEE Trans. Image Process. 17, 1047–1060 (2008) 15. Bresson, X., Chan, T.: Non-local unsupervised variational image segmentation models. UCLA CAM Report 08-67 (2008) 16. Kim, J., Fisher, J., Yezzi, A., Cetin, M., Willsky, A.: A nonparametric statistical method for image segmentation using information theory and curve evolution. IEEE Trans. Image Proc. 14, 1486–1502 (2005)
88
M. Jung, G. Peyr´e, and L. Cohen
17. Zhu, S., Yuille, A.: Region competition: unifying snakes, region growing, and bayes/mdl for multiband image segmentation. IEEE Trans. Patt. Anal. and Mach. Intell. 18, 884–900 (1996) 18. Jehan-Besson, S., Herbulot, A., Barlaud, M., Aubert, G.: Segmentation of Vectorial Image Features Using Shape Gradients and Information Measures. Springer, Heidelberg (2005) 19. Herbulot, A., Besson, S.J., Duffner, S., Barlaud, M., Aubert, G.: Segmentation of vectorial image features using shape gradients and information measures. Journal of Mathematical Imaging and Vision 25, 365–386 (2006) 20. Ni, K., Bresson, X., Chan, T., Esedoglu, S.: Local histogram based segmentation using the wasserstein distance. Int. J. Comput. Vis. 84, 97–111 (2009) 21. Bao, Z., Liu, Y., Peng, Y., Zhang, G.: Variational color image segmentation via chromaticity-brightness decomposition. In: Boll, S., Tian, Q., Zhang, L., Zhang, Z., Chen, Y.-P.P. (eds.) MMM 2010. LNCS, vol. 5916, pp. 295–302. Springer, Heidelberg (2010) 22. Rubner, Y., Tomast, C., Guibas, L.: The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis. 40, 99–121 (2000) 23. Haker, S., Zhu, L., Tannnenbaum, A.: Optimal mass transport for registration and warping. Int. J. Comput. Vis. 60, 225–240 (2004) 24. Rabin, J., Peyr´e, G., Delon, J., Bernot, M.: Wasserstein barycenter and its application to texture mixing. In: Proc. of SSVM 2011 (2011) 25. Burkard, R., Dell’Amico, M., Martello, S.: Assignment problems. SIAM (2009) 26. Chan, T., Sandberg, B., Vese, L.: Active contours without edges for vector-valued images. J. Vis. Comm. Image Repr. 11, 130–141 (2000)
Evaluation of a First-Order Primal-Dual Algorithm for MRF Energy Minimization Stefan Schmidt, Bogdan Savchynskyy, J¨ org Hendrik Kappes, and Christoph Schn¨orr Heidelberg University, IWR / HCI Speyerer Str. 6, 69115 Heidelberg, Germany {schmidt,kappes,schnoerr}@math.uni-heidelberg.de,
[email protected] http://ipa.iwr.uni-heidelberg.de http://hci.iwr.uni-heidelberg.de
Abstract. We investigate the First-Order Primal-Dual (FPD) algorithm of Chambolle and Pock [1] in connection with MAP inference for general discrete graphical models. We provide a tight analytical upper bound of the stepsize parameter as a function of the underlying graphical structure (number of states, graph connectivity) and thus insight into the dependency of the convergence rate on the problem structure. Furthermore, we provide a method to compute efficiently primal and dual feasible solutions as part of the FPD iteration, which allows to obtain a sound termination criterion based on the primal-dual gap. An experimental comparison with Nesterov’s first-order method in connection with dual decomposition shows superiority of the latter one in optimizing the dual problem. However due to the direct optimization of the primal bound, for small-sized (e.g. 20x20 grid graphs) problems with a large number of states, FPD iterations lead to faster improvement of the primal bound and a resulting faster overall convergence. Keywords: graphical model, MAP inference, LP relaxation, image labeling, sparse convex programming.
1
Introduction
1.1
Overview
Our goal is to compute maximum-a-posteriori (MAP) solutions for discrete Markov random fields (MRF), specified by a hypergraph G = (V, F), where the set of hyperedges F is a subset of the power-set 2V of V, which we will call the set of factors.1 The states of random variables of the MRF belong to finite sets Xv , v ∈ V. The notation Xa , a ∈ F is used the Cartesian product 1
Note that we represent factors directly as hyperedges here. In the representation of [2] as a bipartite graph G˜ = (V, F ; E), this implies E = {(v, a) ∈ V × F | v ∈ a}.
Y. Boykov et al. (Eds.): EMMCVPR 2011, LNCS 6819, pp. 89–103, 2011. c Springer-Verlag Berlin Heidelberg 2011
90
S. Schmidt et al.
⊗v∈a Xv of state sets for variables belonging to the factor a ∈ F. The associated distribution is pG (x; θ) ∝ exp(−EG (x; θ)), with the energy function EG (x; θ) = θa (xa ) , (1) a∈F
where θ denotes a collection of potential functions θa : Xa → each factor a ∈ F . 1.2
Ê associated with
Related Work and Motivation
Computing a MAP solution is equivalent to minimization of energy (1) and is known to be NP-complete in general. Thus we will concentrate on its linear programming (LP) relaxation over the local polytope [3]. The special case when the problem contains only first and second order factors (∀a ∈ F, |a| ≤ 2), further on referred to as pairwise or second-order case. Its LP relaxation was originally studied in [4] (see also a modern overview [5]). There is a number of algorithms for solving this LP relaxation. The first group of such algorithms contains DAG and diffusion algorithms by Schlesinger (cf. [5]) and the closely related TRW-S algorithm by Kolmogorov [6]. These algorithms decrease the value of the dual LP monotonically but may not attain its optima in general, since they can be interpreted as (block-)coordinate descent and thus can get stuck, due to the non-smoothness of the dual objective. Indeed, TRW-S is considered as one of the fastest approximate solvers for the problem [7]. As alternatives, sub-gradient schemes for maximizing the dual objective were proposed in [8] and [9]. They are theoretically guaranteed to reach the optimum. However, these schemes are rather slow not only in theory, but also in practice. A recent paper [10] proposed to combine a dual decomposition [11] and Nesterov’s first-order optimization scheme [12], which can be considered as a compromise between speed of TRW-S and guarantee of convergence. Another recent paper [1] proposes a first-order primal-dual iteration scheme and a range of successful applications to variational optimization problems in image processing. Since this method is suited for large-scale non-smooth convex problems, and since MRF based image labeling covers a broad range of applications in computer vision, a competitive assessment of the method is of particular interest. 1.3
Contribution
Our contribution is three-fold: – We propose a way of applying the first order primal-dual method [1] to the LP relaxation of the general (not obligatory pairwise) MRF energy minimization problem (1) and numerically compare it to Nesterov’s optimization scheme [10] and TRW-S [6]. – For the pairwise case we provide a tight bound showing how the step size parameter of the method depends on the model structure and on the number of variable states.
Evaluation of FPD for MRF Energy Minimization
91
– We generalize a method for computing an approximate primal solution [10] to models of arbitrary order (it was proposed originally only for the pairwise case) and propose a similar approach for constructing an approximate solution of the dual problem. These two approximations result in a sound stopping criterion based on the duality gap.
2 2.1
Methods LP Relaxation of the MAP Problem
We introduce the notation F 1 = {a ∈ F : |a| = 1} for the set of all unary factors. Without loss of generality, we suppose that the model includes a unary factor for each variable, i.e. {{v} : v ∈ V} ⊆ F , and all non-first order potentials are absorbed into those of the highest order, i.e. if b ∈ F \F 1 , then ∀a ∈ F, from a ⊂ b follows |a| = 1. We start by representing the energy (1) in overcomplete form [3] as
EG (x; θ) =
θi · φi (x) = θ, φ(x) ,
(2)
i∈I(G)
where x ∈ ⊗v∈V Xv is a model configuration and the potentials θ = (θa (xa ), a ∈ F , xa ∈ Xa ) as well as indicator vectors φ(x) ∈ {0, 1}I(G) are indexed by I(G) = {(a; xa )|a ∈ F, xa ∈ Xa }. The notation ·, · is used for the standard scalar product. Relaxing the binary vector φ to a vector μ = (μa (xa ), a ∈ F, xa ∈ Xa ) with components from the interval [0; 1] and imposing consistency constraints between the components leads to the well-known linear programming relaxation of the problem of minimizing (2): min θ, μ
s.t.
μ
Lμ = c,
μ ≥ 0.
(3)
Here L is the matrix of a linear operator and c a vector of corresponding dimension, which we will define next. Let ½a = (1, . . . , 1) denote an |Xa |-dimensional vector of ones. A specific |Xa |
feature of our problem (3) is that the constraint matrix L has a block form. Namely, for μa ∈ ÊXa , μb ∈ ÊXb the problem (3) can be written as min μ
s.t.
θ, μ
(4)
Lab μb = μa , b ∈ F \F 1 , a ⊂ b ,
(5)
½
(6)
a μa
= 1,
μ ≥ 0.
a∈F , 1
(7)
92
S. Schmidt et al.
For the second-order case, problem (4)-(7) reads: min θ, μ μ s.t. μb (xa , xa ) = μa (xa ) b ∈ F \F 1 , ∀a ∈ b, a ∈ b \ {a} , xa ∈Xa μa (xa ) = 1, a ∈ F1 , xa ∈Xa
μ ≥ 0.
(8) (9) (10) (11)
The dual to (3), max c, ν ν
s.t. L ν ≤ θ ,
(12)
plays a significant role in many optimization schemes. We will analyze its structure for the general (non-pairwise) case in Section 2.3. We cast the pair (3), (12) of optimization problems into a saddle point form via their Lagrangian, max min μ≥0
ν
{ −c, ν + μ, L ν − θ, μ} ,
(13)
which is of the general form to apply the first order primal-dual iteration scheme – Algorithm 1 in [1]. This algorithm will be further on referred to as FPD. 2.2
Primal-Dual Iteration Scheme
Starting from any μ(1) ≥ 0, ν (1) , ζ (1) , the FPD algorithm iterates for t = 2, 3, . . . and step-size τ ≥ 0 the updates: μt+1 ← ΠÊ+ μt + τ L ζ t − θ (14) t+1 t+1 t ν ← ν − τ Lμ −c ζ t+1 ← 2ν t+1 − ν t , where ΠÊ+ denotes the projection onto the positive orthant Ê+ .2 As shown in [1, Th. 1], the algorithm achieves a O(1/t) convergence rate, where t is the number of iterations. Note that this algorithm requires only two sparse matrix multiplications (with L and L ), and a projection ΠÊ+ , both of which are simple to implement and easily parallelizable. Computing the maximal step length τ requires the estimation of the spectral norm of the matrix L. We have the sufficient convergence condition [1, Th. 1] τ ≤ λ−1/2 max (LL ) ,
(15)
where λmax gives the largest eigenvalue of its argument, which depends on the graph structure only, and may be computed a priori using power iterations [13]. However, λmax can be estimated also analytically, as is stated by the following theorem. 2
We consider the case where the step-sizes for primal and dual iterations are set to the same value τ .
Evaluation of FPD for MRF Energy Minimization
93
Theorem 1. Let G be a second-order factor graph and all its variables have an equal number K of possible states. Then for dmax denoting the maximal degree (number of adjacent pairwise factors) of any node of G, λmax ≤
1 3K + dmax + K 2 + 6dmax K + d2max . 2
(16)
We prove this theorem in the appendix. Remark 1. Note that the bound (16) does not depend on the graph size for grids. This bound is exact for regular graphs, i.e. having all nodes of equal degree. A typical example are fully connected graphs (dmax = |V| − 1) and infinite grids (dmax = 4). For finite grids numerical computations show that this bound is quite sharp: for a particular 100×100 grid graph with 5 states the value λmax = 15.8418 was computed numerically using power iterations [13] and the value 15.8443 is given by (16). Remark 2. We also considered a variant of (13) that explicitly enforces the constraint for the unary primal variables μa , a ∈ F 1 to lie in the unit simplex Δ(Xa ) defined by constraints (6) and (7): max
μ≥0 a∈F 1 :μa ∈Δ(Xa )
min ν
˜ ν − θ, μ} , { −c, ν + μ, L
(17)
˜ is the matrix, obtained by removing constraints (6) from L. The neceswhere L sary simplex projections may be computed using e.g. [14], however this requires an inner loop within the first FPD step. It can be shown that for the pairwise case ˜L ˜ ≤ dmax + 2K L (18) holds, thus the incurred extra computational cost does not necessarily outweigh ˜ which increased only the larger maximal step-size allowed due to the reduced L, marginally for our problems (compare (18) to (16)). 2.3
Estimating Primal and Dual Bounds
The primal μt and dual ν t iterates are not necessarily feasible in the respective primal (3) and dual (12) problems during the course of the algorithm, therefore obtaining primal and dual bounds to base a stopping criterion on the duality gap is not trivial. We devise a method for computing sequences of primal and dual feasible points such that primal and dual bounds and the duality gap, respectively, can be estimated as a part of the overall iteration (14). Our method relies on strong duality of the primal (3) and dual (12) pair and (13) which ensures a vanishing duality gap after convergence. A method to compute feasible points in the local polytope and hence an upper bound for the energy of the relaxed problem was recently proposed in [10]. We generalize this method to problems of arbitrary order, and we show that the
94
S. Schmidt et al.
same idea can be applied to obtain feasible points for both primal and dual problems. To simplify understanding of the main idea we will provide explicit formulations for the more common second order problems. The estimation of feasible primal and dual points is based on the following simple proposition. Proposition 1. Let f : ÊN × ÊM → Ê be a proper lower semi-continuous convex function of two vector variables and (x∗ , y ∗ ) = argmin(x,y)∈ÊN ×ÊM f (x, y) be its minimizer. Let xt , t = 1, 2, . . . be a sequence of points in ÊN converging to x∗ . If additionally the function ϕ(x) = miny∈ÊM f (x, y) is continuous, then t→∞
ϕ(xt ) −−−→ f (x∗ , y ∗ ).
t→∞
The proof of the proposition is straightforward: since x −−−→ x∗ then due to t→∞ continuity ϕ(x) −−−→ ϕ(x∗ ) = miny∈ÊM f (x∗ , y) = f (x∗ , y ∗ ). To make use of the Proposition 1 for the calculation of primal and dual feasible points, we split our set of variables into two parts (x and y according to the notation of the Proposition 1). The subsets of variables should be selected such that the function ϕ is continuous and easy to compute. Indeed, as observed in [10] for the second-order case, with fixed μa , a ∈ F 1 satisfying the last two constraints in (8), the primal problem (8) splits into a set of independent small subproblems: one subproblem for each second-order factor. This has a straightforward generalization for problem (4) of arbitrary order, as stated by the following theorem: Theorem 2. Let μ∗ be any solution of (4)-(7), and let μt be a sequence such t→∞ that μta −−−→ μ∗a , μta ≥ 0, a ∈ F 1 . Let μ t be constructed as follows: t
μ a (xa ) = ΠΔ(Xa ) (μta ) ,
∀a ∈ F 1
(19)
where ΠΔ(Xa ) : Ê → Δ(Xa ) denotes a projection operator to the |Xa |dimensional simplex Δ(Xa ), and Xa
∀b ∈ F \F 1
t
μ b = arg min μb ∈ÊXb s.t.
Then
t
θb , μb
(20)
t
Lab μb = μ a , a ⊂ b , μb ≥ 0.
t→∞
θ, μ −−−→ θ, μ∗ .
We prove the theorem in the appendix. The dual to (20) reads (see its derivation in appendix): max νa + νb ν
s.t.
a∈F 1
θa − θb +
b∈F \F 1
b⊃a b∈F\F 1
a⊂b a∈F 1
νab ≥ νa · ½a ,
a ∈ F1 ,
1 L ab νab ≥ νb · ½b , b ∈ F \F .
(21)
Evaluation of FPD for MRF Energy Minimization
95
In the second-order case this formulation has the following well-known (cf. [5]) form: νa + νb (22) max ν
a∈F 1
s.t. θa (xa ) −
b∈F \F 1
b⊃a b∈F\F 1
≥ νa , a ∈ F 1 , xa ∈ Xa ,
νab (xa )
θb (xa , xa ) + νab (xa ) + νa b (xa ) ≥ νb , b ∈ F \F 1 , b = a ∪ a , (xa , xa ) ∈ Xb . Since for each b ∈ F \F 1 variable xb is in fact a collection of xa , a ⊂ b, we will use the notation (xb )a for such xa . The dual problem (21) becomes easily solvable with respect to νa , a ∈ F 1 and νb , b ∈ F \F 1 , when νab are fixed. This is stated by the following theorem. Theorem 3. Let ν ∗ denote any solution of (21), and ν t be a sequence such that t t t→∞ ∗ νab −−−→ νab , b ∈ F \F 1 , a ⊂ b. Let ν be constructed as follows: t
t , b ∈ F \F 1 , a ⊂ b , ν ab = νab t νba (xa ), a ∈ F 1 , ν a = min θa (xa ) − xa ∈Xa
t
xb ∈Xb
a∈F 1
t
νa +
(24)
b⊃a b∈F\F 1
ν b = min θb (xb ) + Then
(23)
1 L ab νba ((xb )a ), b ∈ F \F .
(25)
a⊂b a∈F 1
t t→∞
ν b −−−→
b∈F \F 1
a∈F 1
νa∗ +
νb∗ .
b∈F \F 1
We prove the theorem in the appendix.
3
Experimental Results
Test Cases. We compared the FPD approach to other established methods using standard grid-structured models from the Middlebury MRF-Benchmark [7], in particular the well-known Tsukuba stereo problem. We additionally used various synthetic models with a varying number of variable states (range 2, . . . , 20) and grid size (range 22 , . . . , 402 ). The potential functions θ for nodes and edges were sampled from a uniform distribution. Furthermore, we tested with a set of specific grids leading to LP-tight relaxations, where the LP problem (3) always has an integer minimizer. These graphical models were constructed as follows: First, starting from graphs with uniformly sampled potentials, for each unary factor (a ∈ F 1 ) we chose one state to have the minimum local potential, and also modified the connected pairwise
96
S. Schmidt et al. 198
188
FPD
FPD
NEST
196
NEST
187
TRW-S
TRW-S
Sub-Grad.
194
192
185 energy
energy
Sub-Grad.
186
190
184
188
183
186
182
184
181
182 0
180 0
2000
4000
6000
8000
10000
12000
14000
oracle calls
500
1000
1500
2000
oracle calls
Fig. 1. FPD method for a 20 × 20 synthetic grid model with 5 states, in comparison to (a) NEST with = 1, (b) TRW-S and (c) sub-gradient methods. The plots show LP lower and upper bounds (unavailable for TRW-S and subgradients) as well as integer bounds obtained by rounding (dashed). TRW-S is the fastest on this data, but gets stuck in a non-optimal fixed point. FPD gives much better upper bounds than NEST and achieves a low primal-dual gap much earlier, which is attributed to its direct optimization of both primal and dual variables. The right plot displays a close-up, highlighting the superiority w.r.t. subgradients and TRW-S in this case, but also shows that NEST achieves a better lower bound for a given number of iterations.
factors to assign the minimum energy to each pair of selected states, such that the problem becomes trivial3 . Second, we randomly sampled νab , b ∈ F \F 1 , a ∈ b (see (22)) associated with the pairwise factors and applied the reparametrization: θa (xa ) ← θa (xa ) −
νab (xa ), a ∈ F 1 , xa ∈ Xa ,
(26)
b⊃a b∈F\F 1
θb (xa , xa ) ← θb (xa , xa ) + νab (xa ) + νa b (xa ), b ∈ F \F 1 , b = a ∪ a , (xa , xa ) ∈ Xb .
(27)
It is known [4,5] that this reparametrization does not change the energy of any configuration. Compared Algorithms. Our comparison includes Nesterov’s method (NEST) of [10], a sub-gradient method [8], as well as TRW-S [6], all based on the same dual decomposition to acyclic subgraphs corresponding to the rows and columns of the input grid. For TRW-S and NEST, the authors kindly provided the original implementation of their algorithms. We show the lower (LP-dual due to Th. 3) and upper (LP-primal due to Th. 2 and integer-rounded values of μta , a ∈ F 1 ) bounds on the energy, which the algorithms achieve. To ensure comparability, the algorithm progress is not plotted against time, but instead as a function 3
For a definition of such problems, please see [5, Sect. III.D] or [9, part I, eq. (7)].
Evaluation of FPD for MRF Energy Minimization 50000
12000
FPD 20x20 FPD 40x40 NEST 20x20 NEST 40x40
40000 30000
97
FPD NEST
10000
8000
6000
20000 4000
10000 2
3
4
5 6 number of states
7
8
0
1
2
3
4
5
6
7
8
6
7
8
log_2 grid size 5
conv.
5 4 3 2 1 0
conv.
0
2000
4 3 2 1
2
3
4
5
6
7
8
0
1
2
3
4
5
Fig. 2. Convergence behaviour of FPD compared with NEST in terms of the number of iterations (oracle calls) required to reach a 0.1% duality gap, for synthetic grid graph problems. Left: varying the number of states; Right: varying the grid size. The lower bar-plots show the number of runs (out of 5) which converged within a maximum of 15000 iterations, which were therefore included in the upper plot. The increase with the number of states is more pronounced for NEST, probably because the primal problems becomes more complicated, which only FPD optimizes directly along with the dual. With increasing graph size however, the converse is true, where FPD quickly required more than the maximum number of iterations permitted in the experiment.
of the number of oracle calls, i.e. the required number of objective function or gradient evaluations. For FPD, one oracle call corresponds to a single iteration. Synthetic Grid Problem. In the first experiment, all four methods were compared using a synthetic grid graph problem (Fig. 1). We first note that TRW-S converges to a suboptimal fixed point. Furthermore, the subgradient scheme is not competitive. The primal bound obtained from our FPD method drops faster than that of the Nesterov-based method, which is attributed to the fact that the latter does not optimize the primal problem directly, while FPD does. Note that NEST employs smoothing, which depends on the required precision , which also influences the convergence. In the second group of experiments, we study the dependence of the number of oracle calls required to reach a given precision in the number of variables and the number of states per variable. Here, we compare only FPD against NEST, because none of the other methods provides a primal bound of the relaxed problem (3). Dependence on the Number of States. According to Figs. 2, 3, both FPD and NEST require increasing numbers of iterations with increasing number of states. This increase is more pronounced for NEST, which again probably is due to NEST being a dual optimization method rather than a primal-dual one. Dependence on the Number of Variables. When the number of variables in the model increases, the number of iterations increases as Figs. 2, 3 show. As
98
S. Schmidt et al.
100000 80000 60000
25000
FPD 20x20 FPD 40x40 NEST 20x20 NEST 40x40
FPD NEST 20000
15000
40000
10000
20000
5000
5
10 number of states
15
20
0
1
2
3
4
5
6
7
8
6
7
8
log_2 grid size 5
conv.
5 4 3 2 1 0
conv.
0
4 3 2 1
5
10
15
20
0
1
2
3
4
5
Fig. 3. Convergence behaviour of FPD compared with NEST as in Fig. 2, but for a target precision of a 1% duality gap and a maximum number of 30000 iterations permitted. Left: varying the number of states; Right: varying the grid size. Again NEST requires more oracle calls than FPD for larger numbers of states, but FPD quickly becomes inferior with increasing numbers of variables.
opposed to the dependency on the number of states, here the growth is much more pronounced for the FPD method, quickly exceeding the maximal number of iterations we imposed for the experiment, while for NEST, the increase is moderate. This can be explained by dual decomposition into large subgraphs in NEST, that leads to faster propagation of information across the graph. LP-tight Problems. One reason for the differences observed above may be the different optimization strategies of FPD and NEST: While the first optimizes primal and dual simultaneously, the latter only optimizes the dual. Hence we confirmed this conjecture by considering LP-tight graphical models:Fig. 4 shows the corresponding result. The required number of oracle calls for both methods linearly depends on the number of states, and also grows for larger graphs, with NEST achieving the required precision several times faster. The overall number of oracle calls is much lower than for non-LP-tight problems, as the complexity of the primal optimization is absent, and the algorithms mainly optimize the dual, which is apparently much easier. Tsukuba Stereo Problem. Fig. 5 finally presents results of a comparison of the energy minimization algorithms for the Tsukuba dataset. Among all methods, FPD shows the slowest convergence. This is indeed to be expected since the problem is quite large for FPD (110592 variables) and contains a relatively large number of states (16 depth states). Other methods (even typically slow subgradient) are more efficient presumably due to use of the dual decomposition to large subgraphs. Parallelization Properties. Besides the moderately-sized dual vector, the algorithm requires to handle the set of primal variables, whose storage requirement
Evaluation of FPD for MRF Energy Minimization
1200
oracle calls
1000
200
FPD 20x20 FPD 40x40 NEST 20x20 NEST 40x40
FPD NEST 150
oracle calls
1400
99
800 600 400
100
50
200 0
5
10 number of states
15
20
0
1
2
3
4
5
6
7
8
log_2 grid size
Fig. 4. Convergence behaviour of FPD compared with NEST for LP-tight synthetic grid graph problems, in terms of the number of iterations (oracle calls) required to reach a 0.1% duality gap. Left: varying the number of states; Right: varying the grid size. The overall number of oracle calls is much less than for non-LP-tight problems, as the complexity of the primal optimization is absent and the algorithms mainly optimize the dual, which is apparently much easier. In that case, NEST is clearly superior to FPD in almost all instances.
800000
700000
600000
energy
500000
400000
300000
200000
FPD NEST
100000
TRW-S Sub-Grad.
0 0
500
1000
1500
2000
oracle calls
Fig. 5. FPD method for Tsukuba data in comparison to (a) NEST with = 10, (b) TRW-S and (c) sub-gradient methods. The plot shows LP lower and upper bounds (unavailable for TRW-S and subgradients) as well as integer bounds obtained by rounding (dashed). FPD shows the slowest convergence among all methods, as the problem is relatively large in terms of both the grid size and the number of states (16 depth states). Note however that all other methods use dual decomposition into large subgraphs, and therefore may be able to propagate information faster across the graph.
in the pairwise case is O(|F 1 |K + |F \ F 1 |K 2 ), where K is the number of states (assuming an equal number of states per node). Due to the quadratic growth in K, this is a major drawback of the method if the problem has many states.
100
S. Schmidt et al.
However, the method (including the bounds computation) is easily parallelizable, which we exploited to provide a CUDA variant running on GPU hardware4. This code allowed practical speedups up to a factor of 160 compared to a sequential variant running on a CPU5 (both not explicitly optimized for grid graphs).
4
Conclusion
We presented a study of the first order primal-dual algorithmic scheme [1] applied to MAP inference for general discrete graphical models, via the LP relaxation of this problem formulated in a saddle-point form. We supplemented the original scheme by a method for computing upper and lower bounds, which results in a sound stopping condition and thus ensures comparability and reproducibility of results. Our study shows that the performance of the algorithm rapidly drops as the model size increases. Competitive methods, which use a dual decomposition technique, appear to propagate information across the graph much faster. However due to an explicit optimization of the primal objective, FPD accomplishes faster improvement of the primal bound than the application of Nesterov’s scheme [10] to the dual objective, which is not optimizing the primal directly. This effect is clearly visible for small-sized graphical models. Future work will focus on a combination of efficient optimization and decomposition schemes, both for primal and dual objectives. Good parallelization properties – as given for the presented FPD method – will also play a key role in further improving the efficiency of convergent inference methods. Acknowledgements. The authors gratefully acknowledge support by the German Science Foundation (DFG) within the Excellence Initiative (HCI) and by the joint research project ”Spatio/Temporal Graphical Models and Applications in Image Analysis”, grant GRK 1653. The authors thank B. Andres, D. Breitenreicher and J. Lellmann for many helpful discussions and sharing software.
References 1. Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging and Vision, 1–26 (2010) 2. Kschischang, F., Frey, B., Loeliger, H.A.: Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory 47, 498–519 (2001) 3. Wainwright, M.J., Jordan, M.I.: Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn. 1, 1–305 (2008) 4. Schlesinger, M.: Syntactic analysis of two-dimensional visual signals in the presence of noise. Kibernetika, 113–130 (1976) 5. Werner, T.: A linear programming approach to max-sum problem: A review. IEEE PAMI 29 (2007) 4 5
NVidia GTX 480. Intel Core-i7 860 (using one of 4 cores) at 2.8 GHz.
Evaluation of FPD for MRF Energy Minimization
101
6. Kolmogorov, V.: Convergent tree-reweighted message passing for energy minimization. IEEE PAMI 28, 1568–1583 (2006) 7. Szeliski, R., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V., Agarwala, A., Tappen, M., Rother, C.: A comparative study of energy minimization methods for Markov random fields with smoothness-based priors. IEEE PAMI 30, 1068–1080 (2008) 8. Komodakis, N., Paragios, N., Tziritas, G.: MRF optimization via dual decomposition: Message-passing revisited. In: ICCV (2007) 9. Schlesinger, M., Giginyak, V.: Solution to structural recognition (MAX,+)problems by their equivalent transformations. In 2 parts. Control Systems and Computers (2007) 10. Savchynskyy, B., Kappes, J., Schmidt, S., Schn¨ orr, C.: A study of Nesterov’s scheme for Lagrangian decomposition and MAP labeling. In: CVPR 2011 (2011) 11. Korte, B., Vygen, J.: Combinatorial Optimization, 4th edn. Springer, Heidelberg (2008) 12. Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. Ser. A, 127–152 (2004) 13. Golub, G., Van Loan, C.: Matrix Computations, 3rd edn. The John Hopkins Univ. Press, Baltimore (1996) 14. Michelot, C.: A finite algorithm for finding the projection of a point onto the canonical simplex of Rn . J. Optim. Theory Appl. 50, 195–200 (1986) 15. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, New York (2004)
Appendix Proof of Theorem 1. The matrix L has a block form
LV 0 L= , LV E LE where LV is determined by (10), LE and LV E respectively correspond to the left hand and right hand sides of (9), and
L1 L 2 := LL , L1 = KI|V | , L2 L3 where K is the number of possible states. Let x := ( xxVE ) denote the maximal eigenvector of LL corresponding to λmax . From the upper part of the eigenvalue equation LL x = λmax x, L1 xv + L 2 xE = λmax xV , and L1 = KI|V | , we conclude xV = λmax1−K L 2 xE . Insertion into the lower part of the eigenvalue equation yields
1 L2 L2 + L3 xE = λmax xE . (28) λmax − K
102
S. Schmidt et al.
In terms of the components of L, this equation reads
λmax LV E LV E + LE LE xE = λmax xE . λmax − K The maximum eigenvalue of the matrix on the left hand side equals λmax and itself depends on λmax . We therefore solve the equation λ dmax + 2K = λ λ−K
⇒
λ=
1 3K + dmax + K 2 + 6dmax K + d2max . 2
where dmax denotes the maximal degree of any node of G. The expression on the left is an upper bound of the 1 -norms of the row vectors of the matrix, which is an upper bound of λmax by Gerschgorin’s theorem [13]. Because this function decreases with λ, it follows λ ≥ λmax . Derivation of the Block Form of the Dual Objective (21). The full Lagrangian corresponding to (4) reads: L(μ, ν˜, γ) = θ, μ + ν˜ab (Lab μb − μa ) + ν˜a (1 − ½ a μa ) − γ, μ . b∈F \F 1 a⊂b
a∈F 1
(29) From this follows the dual problem (a detailed presentation of (12)): max ν˜a ν ˜
s.t.
a
b⊃a ν˜ab ≥ ν˜a · ½a , a ∈ F 1 b∈F \F 1 θb + a⊂b L ˜ab ≥ 0 · ½b , b ∈ F \F 1 ab ν 1
θa −
(30)
(31)
a∈F
We introduce additional variables νb ∈ Ê, b ∈ F \F 1 and apply the following change of variables: νb νab := ν˜ab + νb · ½a , νa := ν˜a + . (32) |b| b⊃a
The matrix Lab in (4) is of size |Xb | × |Xa |, and it possesses the important property L (33) ab ½a = ½b , a⊂b
following from the fact that each column of Lab contains exactly one non-zero entry (equalling 1). νb Taking into account (33) and a∈F 1 b⊃a |b| = b∈F \F 1 νb leads to the equivalent dual problem formulation (21).
Evaluation of FPD for MRF Energy Minimization
103
Proof of Theorem 2. Due to Proposition 1 and continuity of the projection in (19), it suffices to prove that the objective value of (20) continuously changes t with μ a . Our proof is a straightforward generalization of the one given in [10]. Problem (20) satisfies Slater’s condition [15] due to affinity of its constraints and it always has at least one feasible point when ½ a μa = 1, a ⊂ b. This condition holds due to (19). Since 1 ≥ μb ≥ 0 its optimal value is always finite. Thus its Lagrange dual has the same finite optimal value. The Lagrange dual for (20) reads: t max ξab μ a s.t. θb − L (34) ab ξab ≥ 0 . ξab ∈ÊXa a⊂b
t μ a
a⊂b t
It depends on only through its objective, which continuously depends on μ a . Since optimal value of (34) is finite, it is attained in one of the vertices of its t constraint set, which implies that it changes continuously with μ a . Proof of Theorem 3. The proof follows from Proposition 1 and continuity of the min-operation in (24)-(25).
Global Relabeling for Continuous Optimization in Binary Image Segmentation Markus Unger, Thomas Pock, and Horst Bischof Institute for Computer Graphics and Vision, Graz University of Technology, Austria {unger,pock,bischof}@icg.tugraz.at
Abstract. Recently, continuous optimization methods have become quite popular since they can deal with a variety of non-smooth convex problems. They are inherently parallel and therefore well suited for GPU implementations. Most of the continuous optimization approaches have in common that they are very fast in the beginning, but tend to get very slow as the solution gets close to the optimum. We therefore propose to apply global relabeling steps to speed up the convergence close to the optimum. The resulting primal-dual algorithm with global relabeling is applied to graph cut problems as well as to Total Variation (TV) based image segmentation. Numerical results show that the global relabeling steps significantly speed up convergence of the segmentation algorithm. Keywords: Segmentation, Graph Cut, Continuous Optimization, Convex Energy Minimization, Total Variation.
1
Introduction
Binary image segmentation is one of the fundamental problems in computer vision and image processing. One of the standard methods for image segmentation are graph cuts [1], also referred to as the discrete max-flow/min-cut problem. The max-flow/min-cut optimization already has a long history in computer vision as it can be applied to a wide range of optimization problems [1]. A more recent study in the general context of Markov random fields can be found by Szeliski et al. [2]. Ignited by the work of Boykov and Kolmogorov [3], [4] graph cuts have become very popular in computer vision (e.g. ’GrabCut’ proposed by Rother et al. [5]). Recently, graph cuts for image segmentation have also been implemented on parallel hardware [6], [7]. The discrete graph cut problem can also be written as an 1 norm minimization problem as shown by Bhusnurmath and Taylor [8], and it is thus possible to solve graph cuts with off-the-shelf LP solvers. The max-flow/min-cut problem is not restricted to the discrete setting only. In the continuous setting, the max-flow problem was first studied by Strang [9] in 1983, but remains challenging [10] up to now (e.g. the continuous equivalence of directed graphs). Klodt et al. [11], showed an experimental comparison of continuous and discrete formulations. They showed that the well known metrication
This work was supported by the BRIDGE project HD-VIP (no. 827544).
Y. Boykov et al. (Eds.): EMMCVPR 2011, LNCS 6819, pp. 104–117, 2011. c Springer-Verlag Berlin Heidelberg 2011
Global Relabeling for Continuous Optimization
105
0
10
standard with global relabeling
−5
10
−10
10
−15
10
(a) Input + Scribbles
0
2000
4000
6000
8000
(b) Convergence Criterion
(c) Segmentation
(d) n = 40
(e) n = 200
(f) n = 360
(g) n = 1000
(h) n = 6400
(i) n = 40
(j) n = 200
(k) n = 360
(l) n = 1000
(m) n = 6400
Fig. 1. Standard continuous optimization methods are usually very fast in the beginning, but slow down later on. Note that the segmentation looks already good after 200 iterations (e, j). Nevertheless there are some small regions that are changing their value very slowly. It takes the standard algorithm 6400 iterations to fully converge (h,m). We propose to use global relabeling steps to speed up convergence. As can be seen in (b) the proposed algorithm converges significantly faster.
errors of discrete graph cuts, caused by the definition of the grid, can be easily overcome in a continuous formulation. Sinop and Grady [12], presented a common segmentation model that evaluated different norms in the regularization term. Namely, the 1 norm that corresponds to the discrete graph cut model, the squared 2 norm and the ∞ norm. In [13], Couprie et al. even extended this connections to the watershed segmentation. The random walker framework of Grady et al. [14] is a special case of this segmentation model based on the squared 2 norm, and thus is closely related to the Dirichlet problem. In the work of Olsson et al. [15] continuous graph cuts are extended to anisotropic metrics and continuous α-expansion is presented. Appleton and Talbot [16], implemented the continuous max-flow equations using a finite-differences scheme to find a globally optimal solution of the continuous max-flow problem. In [17], Chan et al. investigated convex minimization problems in computer vision, and established relations between image segmentation and denoising. All these methods have in common that they use the Total Variation (the 2 norm of the image gradient) as a smoothness term. Note that the Total Variation of a binary function corresponds to the contour length. In the continuous setting there is a lot of work that was also influenced by Geodesic Active Contours [18]. This includes
106
M. Unger, T. Pock, and H. Bischof
e.g. the work of Leung and Osher [19] and Unger et al. [20]. These methods are all based on the weighted Total Variation first introduced by Bresson et al. [21]. The weighted Total Variation corresponds to the weighting of the binary terms in discrete graph cut optimization. Most algorithms that solve the discrete graph cut problem (e.g. [4]) are highly specialized algorithms. Instead, we will use the recently proposed general primaldual algorithm of Chambolle and Pock [22]. This algorithm can easily deal with a wide range of non-smooth convex problems. Thus we can apply the algorithm both to a relaxed, continuous formulation of the discrete graph cut problem, as well as to a non-linear Total Variation based formulation of the continuous min-cut/max-flow problem. Primal-dual formulations in continuous optimization have the advantage that the gap between primal energy and dual energy provides a meaningful convergence measure. We observed that continuous methods are very fast in the beginning, but slow down as they get closer to the globally optimal solution. We illustrate this problem in Fig. 1. The key observation when watching the algorithm during optimization is as following: While the segmentation variable gets very close to the final segmentation in a few iterations, usually some small areas change their value very slowly over time. The main contribution of this paper are global relabeling steps that speed up the convergence process by evaluating thresholded versions of the current segmentation variable. Instead of changing the value of an area only slowly, the global relabeling step results in a discrete labeling with lower energy in just a single iteration.
2
Preliminaries
Images are defined on a two dimensional regular Cartesian grid of size I × J: {(i, j) : 1 ≤ i ≤ I, 1 ≤ j ≤ J} ,
(1)
where discrete locations are donated by the indices (i, j). Note that we assume IJ quadratic pixels of size 1. We define a finite dimensional vector space X = R with a scalar product v, wX = i,j vi,j wi,j and the two vectors v, w ∈ X. Additionally, we define a gradient operator as a linear mapping ∇ : X → Y as (∇v)i,j = ((δ1 v)i,j , . . . , (δK v)i,j )T .
(2)
Here we also used the dual vector space Y = RIJ × · · · × RIJ . Given two vectors Ktimes
p = (p1 , . . . , pK )T ∈ Y and q = (q1 , . . . , qK )T ∈ Y we define the corresponding scalar product as p, qY = i,j,k pi,j,k qi,j,k , with 1 ≤ k ≤ K. In this paper we will limit ourselves to 4-connected (K = 2) and 8-connected (K = 4) graphs. The neighboring relations for the Total Variation formulation (K = 2) are the same as for 4-connected graphs. Note that we will denote differences between graph cuts and TV only if necessary. Thus we can define the following finite differences with Neumann boundary condition
Global Relabeling for Continuous Optimization
(δ1 v)i,j = (δ2 v)i,j = (δ3 v)i,j = (δ4 v)i,j =
107
vi+1,j − vi,j if i < I , 0 if i = I vi,j+1 − vi,j if j < J , 0 if j = J vi+1,j+1 − vi,j if i < I, j < J , 0 else
(3)
vi+1,j−1 − vi,j if i < I, j > 0 . 0 else
∇ can be simply represented as a IJK × IJ matrix.
3
Binary Segmentation Models
In this section we introduce the graph cut (GC) and Total Variation (TV) based image segmentation models. Additionally we derive the primal-dual gap as a convergence criterion. 3.1
Graph Cuts
A graph G is a pair (V, E), with a vertex set V and an edge set E ⊆ V × V. In image processing the vertices usually correspond to discrete pixel locations and two special terminal vertices, the source s and the sink t. The edge set E consists of different types of edges. First, the spatial edges eb = (r, q) | r, q ∈ V\ {s, t} that define the pixel neighborhood (e.g. 4-connectedness). Second, there are edges connecting the source with all pixels es = (s, r) and additionally the pixels with the sink et = (r, t). All edges have some assigned costs C(e) ≥ 0. In case of eb these will correspond to image gradient information (edges), and for es and et the costs are used to model the foreground and background affinity. The min-cut partitions the set of vertices V into two disjoint regions Vs ∩ Vt assigned to the source s and the sink t. We can now define a cut Ec as the cost of all edges ec ∈ Ec whose end points belong to two different regions. The cut has an assigned energy as the sum of all corresponding costs C(ec ). Therefore the min-cut problem can be written as
C(ec ) . (4) min Ec ⊂E
ec ∈Ec
We now reformulate the min-cut problem using the characteristic function u ∈ X. Additionally we construct the vectors ws , wt ∈ X using the costs C(e) at the corresponding positions of the edges linked with source s or sink t. We can refer to these terms as the unary terms. The same is done for the spatial edge costs by constructing the vector wb ∈ Y (the binary terms). Thus we can rewrite (4) as the following minimization problem min {Wb ∇u1 + 1 − u, ws + u, wt } , s.t. u ∈ {0, 1} , u
(5)
108
M. Unger, T. Pock, and H. Bischof
with Wb = diag(wb ). To solve the above energy, we have to relax the variable u ∈ [0, 1] to vary continuously between 0 and 1. It is well known [23] that the resulting convex relaxation will provide the globally optimal solution for the original problem in (5). We can rewrite the terms 1 − u, ws + u, wt as u, wt − ws + wt 1 . The last term is the sum over all sink costs wt (note that wt ≥ 0). As this term is constant, we can neglect it during optimization. To further simplify the notation, we define the unary terms as wu = wt − ws . Therefore we can rewrite the relaxed version of (5) as min {Wb ∇u1 + u, wu } , s.t. u ∈ [0, 1], u
(6)
We use Legendre-Fenchel duality to obtain the convex conjugate of the 1 norm in (6). We then arrive at the following primal-dual saddle point formulation of the graph cut energy min max {∇u, p + u, wu } , s.t. u ∈ [0, 1], p ∈ [−wb , wb ]. u
p
(7)
With the dual variable p ∈ Y . We will show in Section 3.3 that the saddle point formulation will give us a meaningful convergence criterion. 3.2
Total Variation Formulation
The continuous equivalent to the weighted graph in the previous section, is a Riemmannian space R, that consists of domain Ω and an associated metric cb : Ω → R+ . If we assume that u, cs , ct , cb : Ω → R are now continuous functions, we can write the continuous min-cut/max-flow problem as min cb |Du|2 + ct u dx + cs (1 − u) dx , s.t. u(x) ∈ {0, 1} . (8) u
Ω
Ω
Ω
Note that the Total Variation |Du|2 , with D the continuous gradient, is here taken in the distributional sense. We can again simplify
the above model with cu = ct − cs and Ω ct u + cs (1 − u) dx = Ω cu u dx + Ω cs dx, neglecting the constant term. To make the above optimization problem convex, we relax the binary constraint to a continuous one u(x) = [0, 1]. As a result, the optimum is no longer guaranteed to be binary. The well known thresholding theorem [9], [11] states that all upper levelsets {x ∈ Ω | u∗ (x) > θ} , θ ∈ [0, 1) of the optimal solution u∗ of the relaxed problem provide a globally optimal solution to the binary labeling problem in (8). Although, as we will later see that this assumption does not hold in the discrete setting, this is also a motivation to apply global relabeling steps to the TV based segmentation model. As we have a discretized input image, we have to optimize a discrete version of (8). In a discrete setting we can again use vectors wu for the unary and wb for the binary terms exactly as used for the graph cut model in the previous section. The discrete minimization problem of (8) thus becomes min {Wb ∇u2,1 + u, wu } , s.t. u ∈ [0, 1], u
(9)
Global Relabeling for Continuous Optimization
with the norm ∇u2,1 =
i,j
109
(δ1 u)2i,j + (δ2 u)2i,j . Note that the only difference
to the graph cut energy in (6) is the point-wise 2 norm for the regularization term. The primal dual formulation of (9) is then given as min max {∇u, p + u, wu } , s.t. u ∈ [0, 1], ||p||∞ ≤ wb . u
p
(10)
An advantage of the TV based formulation is that in contrast to the graph cut energy it does not suffer from metrication errors [11]. This will become obvious in the experimental results in Section 5. Unfortunately, the discrete functional in (9) no longer guarantees a binary solution since the thresholding theorem holds only in the continuous formulation. We will see in the experimental section that as the solution of (9) is almost binary, the proposed global relabeling schema still results in significant speedups. 3.3
Primal-Dual Gap
In the following, we derive the normalized primal dual gap, which we will use as a convergence criterion. We start with the primal energy for the graph cut model and the TV model EpGC (u) = Wb ∇u1
+ u, wu + |wt |1 ,
EpT V
+ u, wu + |wt |1 .
(u) = Wb ∇u2,1
(11)
Note that we have to use the term wt 1 (neglected during optimization) for the energy calculation to ensure that Ep > 0. Otherwise the normalization in (15) would not make sense. The computation of the dual energy for the GC and TV model are the same. To find a dual only formulation of (7) and (10), we have to obtain the optimal u for a given p as u ˆ = arg min u, ∇T p + wu , s.t. u ∈ [0, 1]. (12) u
It is trivial to obtain the optimal u ˆ for a given p as 1 if (∇T p)i,j + (wu )i,j < 0 u ˆi,j = . 0 else Therefore, the dual energy Ed can be written as Ed (p) = u ˆ, ∇T p + wu + wt 1 .
(13)
(14)
The normalized primal dual gap is then given as G(u, p) =
Ep (u) − Ed (p) . Ep (u)
(15)
The gap will become 0 if u reaches the globally optimal solution. Thus it provides an excellent optimality measure [22] and will therefore be used later on for convergence analysis as well as a measure of optimality in the global relabeling step.
110
4
M. Unger, T. Pock, and H. Bischof
Algorithm
We will first review the used primal dual algorithm [22] and apply it to the graph cut problem (7) and TV segmentation model (10) as defined in the previous section. Then we will describe the global relabeling steps and summarize the overall algorithm. 4.1
Primal Dual Optimization
One of the fastest algorithms to date (especially on the GPU) is the first-order primal-dual algorithm of Chambolle and Pock [22]. Applying the algorithm to the saddle point problems in (7) and (10), we obtain the following iterative update rules: pn+1 = Πwb pn + σ∇ 2un − un−1 , (16) un+1 = Π[0,1] un − τ ∇T pn+1 + wu . Where the reprojection Π[0,1] (u) = max (0, min (1, u)) is a simple clamping to the interval [0, 1]. For the graph cut energy (7) the reprojection ΠwGC (p) again b results into a simple clamping to the interval [−wb , wb ]. In case of the total vari ation formulation (10) the reprojection ΠwT bV (p) = p/ max1, |p| is a orthogonal wb projection to a 2 -ball of radius wb . We will refer to the primal and dual updates in (16) as pd step. The timesteps τ and σ have to fulfill the condition τ σL2 < 1 with the Lipschitz constant L2 = ||∇||2 . 4.2
Global Relabeling
As the example in Fig. 1 shows, the pd steps described in the previous section are very fast in the beginning, but often slow down as the result gets closer to the optimal solution. The main problem are small areas that change their value very slowly. With the global relabeling (grl ) step we want to assign this regions either the value 1 or 0. To keep this grl step fast and efficient on parallel hardware, we simply compute the upper levelsets of u by thresholding u several times in the range (0, 1). We can then compute the best thresholded version u ˜ ˜ as and corresponding p ˜ ) = arg min {G(uθ , pθ )} , (˜ u, p θ∈(0,1)
where (uθ )i,j =
1 if ui,j > θ , 0 else
(17)
(18)
pθ = Πwb (p + c∇uθ ) , Hence, uθ represent thresholded versions of u. To obtain pθ the update equation is evaluated with a very large timestep c 1.
Global Relabeling for Continuous Optimization
111
If the solution u˜ is significantly closer to the global optimum than the current solution u, we accept the grl step, otherwise we continue optimization from the current solution u. As we need a non-binary u for the thresholding to work, we compute the grl step only every M = max (I, J) iterations. Φ
Σ
p pd grl
Gmin
ωGmin
(a) First grl step
Ψ
Gmin
(b) Second grl step
Fig. 2. Illustration of the optimization schema. While the pd steps can move freely through the space of solutions, the grl step only allows jumps to binary solutions that are closer to the global optimum than the best solution so far.
In Fig. 2, we illustrated the overall primal dual algorithm with global relabeling pdgrl. We denote by φ = [0, 1]IJ the feasible set of the relaxed labeling vector and by Ψ ⊆ Φ the set of solutions (note that in general, graph cuts could have multiple solutions). Furthermore, let Σ = {0, 1}IJ be the set of binary labeling vectors and hence Ψ ∩ Σ is the set of binary solutions. Starting from an arbitrary initialization, the pd steps will change the labeling vector according to (primal and dual) gradient information. This could also result in a temporary increase of the gap. The grl step is only considered if the resulting gap will be smaller than ωGmin . The parameter 0 < ω < 1 allows a global relabeling only if G is significantly reduced. We used ω = 0.5 throughout the paper. Algorithm 1 summarizes the proposed pdgrl algorithm. Details on the convergence criterion tol will be given in Section 5. Convergence of the algorithm: Convergence follows from the fact that both steps, the primal-dual optimization pd and the global relabeling grl are guaranteed to decrease the gap. In fact, we allow a grl step only if the new gap G after grl is smaller than the minimal gap Gmin obtained by pd so far. In [22], it is shown that the pd algorithm decreases the primal-dual gap with a sublinear rate of O(1/N ) where N is the total number of iterations. Although the proposed global relabeling does not change this estimate, it empirically gives a super-linear convergence close to the optimal solution.
5
Experimental Results
Implementation: Experiments were conducted on a Intel Core i7 960 with 12 GB available memory and a NVidia GeForce GTX 480 with 1.5 GB available
112
M. Unger, T. Pock, and H. Bischof
Algorithm 1. Primal dual algorithm with global relabeling (pdgrl) repeat for 1, . . . , M do pn+1 ← Πwb pn + σ∇ 2un − un−1 // Primal Update un+1 ← Π[0,1] un − τ ∇T pn+1 + wu // Dual Update n←n+1 end for Gmin ← min {Gmin , G(un , pn )} ˜ ) ← arg minθ∈(0,1) {G(uθ , p θ )} (˜ u, p // Thresholding ˜ ) ≤ ωGmin then if G(˜ u, p ˜) (un , pn ) ← (˜ u, p // Global Relabeling end if until G(un , pn ) ≤ tol
memory. The segmentation framework was implemented in Matlab. The actual algorithms were additionally implemented on the GPU using the CUDA framework.1 Measured times do not include any transfer times between CPU and GPU (the approximate overhead ranges from 10 ms for images of size 256 × 256 to 700 ms for images of size 3200 × 3200). For the grl step we need to compute the normalized primal dual gap G for each threshold θ. As this computation is costly, the number of thresholds should be kept low. Experiments showed that θ ∈ {0, 0.01, 0.05, 0.1, 0.5, 0.9, 0.95, 0.99} provides enough different thresholds and is reasonably fast. The convergence criterion tol was chosen as following: When using 64bit double precision (Matlab) numerical accuracy is reached with a normalized primal dual gap G = 10−14. For the 32bit float precision (CUDA) numerical accuracy is already reached with G = 10−7 . In case of the TV segmentation model we set tol = 5 · 10−4 as the segmentation did not show any changes anymore. Unary and Binary Terms: To calculate the unary terms wu , we use two kind of scribbles. First, we can directly draw sparse source and sink seeds. We will then set the unary terms to very large values (or infinity) for these seeds. Second, we can draw scribbles that will be used to build color histograms. The normalized probabilities are directly used to get ws and wt . We skipped the details here, as they are of no relevance to this paper. For the binary terms wb , we calculate an edge image as g = (g1 , g2 , g3 , g4 ) as the color gradient of the image. For 4-connected graphs we set (wb )i,j = −α|(g 1 )i,j | e , e−α|(g2 )i,j | , with α > 0, and for 8-connected graphs the binary terms become (wb )i,j = e−α|(g1 )i,j | , . . . , e−α|(g4 )i,j | . In case of the TV model, √ 2 2 we set (wb )i,j = (¯ gi,j , g¯i,j ) with g¯i,j = e−α (g1 )i,j +(g2 )i,j . Evaluation: In Fig. 3, we show experiments on typical segmentation problems using the Matlab implementation. For (a,b) we used color information while for (c,d) the algorithm relies solely on seed regions (only edge information is used). 1
Matlab code and CUDA libraries are available at http://gpu4vision.org
Global Relabeling for Continuous Optimization
(a)
(b)
0
4n−pd 4n−pdgrl
−5
(a)
10
−10
0
4000
6000
8000
10000
−5
−10
10
0
500
1000
1500
2000
0
2000
4000
6000
0
10
8n−pd 8n−pdgrl
−5
10
tv−pd tv−pdgrl
−2
10
−10
10
−15
−15
0
1000
2000
3000
0
10
0
1000
2000
3000
0
10
4n−pd 4n−pdgrl
−5
10
0
200
400
600
800
0
10
8n−pd 8n−pdgrl
−5
10
tv−pd tv−pdgrl
10
−10
10
−15
0
−2
10
−10
10
−15
2000
4000
6000
8000
10000
0
10
0
2000
4000
6000
8000
10000
0
10
4n−pd 4n−pdgrl
−5
10
0
500
1000
1500
2000
0
10
8n−pd 8n−pdgrl
−5
10
tv−pd tv−pdgrl
10
−10
10
−15
0
−2
10
−10
10 10
−2
10
10
(d)
tv−pd tv−pdgrl
10
0
4n−pd 4n−pdgrl
10
10
10
−15
2000
0
(c)
−5
−10
10
10
8n−pd 8n−pdgrl
10
−15
(b)
0
10 10
10 10
(d)
(c) 0
10
113
−15
2000
4000
6000
8000
10000
10
0
2000
4000
6000
8000
10000
0
2000
4000
6000
8000
10000
Fig. 3. Comparison of the pd algorithm and the primal-dual algorithm with global relabeling (pdgrl ) for different segmentation problems. The segmentation results in the top row are corresponding to tv, 8n and 4n in a clockwise order.
It shows that the global relabeling steps (the pdgrl algorithm is depicted in blue) converges significantly faster than the pd algorithm alone. This is true for the graph cut models as well as the TV model. Note that for the pd algorithm, some experiments did not converge within the 10000 iterations allowed for this experiment. To motivate the choice of M (the iteration interval after which global relabeling is performed), we conducted experiments with varying M in Fig. 4 on an
M. Unger, T. Pock, and H. Bischof
(a) Input
(b) Segmentation
Normalized Primal Dual Gap
114
10
0
10
−5
10
−10
10
−15
8n−pd 8n−pdgrl M=max(I,J) 8n−pdgrl M=100 8n−pdgrl M=10 O(1/N) 1
10
10
2
3
Iterations
10
4
10
(c) Convergence depending on M
Fig. 4. Demonstration on the effect of different global relabeling intervals M . While the chosen intervals (blue) are faster than the standard pd algorithm, M = 100 (cyan) would speed up convergence even more. If global relabeling is done too often, e.g. M = 10 (orange), the algorithm might become slower.
image of size 519 × 324. In Section 4.2, we choose M = max(I, J). As one can see from Fig. 4, this choice is rather conservative, as e.g. M = 100 would result in much faster convergence. On the other hand M = 10 would result in much slower convergence, as the primal dual steps need some time to provide a meaningful direction. Setting M = max(I, J) gives the primal dual steps the chance to propagate information through the whole image before the next global relabeling is performed. Note that when using M = max(I, J), during our large number of experiments, the global relabeling steps never resulted into slower convergence. On the other hand, Fig. 4 shows that there might be better strategies on when to perform global relabeling. In Fig. 5, we compared the proposed pdgrl algorithm with the pd algorithm on the GPU and a CPU implementation of Boykov and Kolmogorov [4] (denoted as boykov ), that is one of the most used graph cut implementations to date. Additionally we compare to the NPP library [24] graph cut implementation (npp), that is to our knowledge currently the fastest graph cut implementation on a GPU.2 We conducted the experiment for 4 different quadratic images that were scaled to 256, 512, 1024, 2048 and 3200 edge length, thus ranging from approximately 6 · 104 to 107 pixels. All algorithms have an approximately linear runtime behaviour. With the proposed algorithm most of the time a bit slower than the npp implementation for 4n. The slowest algorithm is always the boykov CPU implementation with 8n. Note that for the runtime of the pdgrl there is not much difference for the graph cut with 4n and 8n, and the tv model.
2
Note that the npp implementation only works for 4n.
Global Relabeling for Continuous Optimization
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
4
2
10
3
10
4n−pdgrl 4n−boykov 4n−npp 8n−pdgrl 8n−boykov tv−pdgrl
2
10 Runtime in seconds
Runtime in seconds
10
115
0
10
1
10
4n−pdgrl 4n−boykov 4n−npp 8n−pdgrl 8n−boykov tv−pdgrl
0
10
−1
10 −2
10
−2
5
10
6
10 Number of pixels
10
7
10
(i) Image (a) with seeds only
Runtime in seconds
10
1
10
7
10
4
10
4n−pdgrl 4n−boykov 4n−npp 8n−pdgrl 8n−boykov tv−pdgrl
Runtime in seconds
2
6
10 Number of pixels
(j) Image (c) with seeds and color
3
10
5
10
0
10
−1
2
10
4n−pdgrl 4n−boykov 4n−npp 8n−pdgrl 8n−boykov tv−pdgrl
0
10
10
−2
10
−2
5
10
6
10 Number of pixels
(k) Image (e) using color only
7
10
10
5
10
6
10 Number of pixels
7
10
(l) Image (g) with seeds only
Fig. 5. Evaluation of the influence of the image size to the runtime of different algorithms
116
6
M. Unger, T. Pock, and H. Bischof
Conclusion
We presented a continuous optimization schema with global relabeling steps. Although we focused on extending a single primal dual algorithm [22], global relabeling steps should work for any optimization algorithm with a measure on the distance of the current solution to the globally optimal one. Empirical studies showed that the global relabeling step significantly speeds up convergence. Continuous optimization schemes have the advantage that they can also be used for non-smooth optimization, and therefore also work for the continuous maxflow/min-cut problem. We demonstrated the speedups can be achieved not only for the graph cut model that has a binary solution, but also for the TV based image segmentation model where the optimal solution is not guaranteed to be binary. The proposed algorithm can be parallelized efficiently and is therefore perfectly suited for future hardware. Although the current global relabeling strategy performs very well, experimental results suggest that there might be better ways on when to perform a global relabeling step. Future work will therefore investigate on better global relabeling strategies.
References 1. Greig, D.M., Porteous, B.T., Seheult, A.H.: Exact maximum a posteriori estimation for binary images. Journal of the Royal Statistical Society Series B 51, 271–279 (1989) 2. Szeliski, R., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V., Agarwala, A., Tappen, M., Rother, C.: A comparative study of energy minimization methods for Markov random fields with smoothness-based priors. IEEE transactions on pattern analysis and machine intelligence 30, 1068–1080 (2008) 3. Boykov, Y., Kolmogorov, V.: Computing geodesics and minimal surfaces via graph cuts. In: Ninth IEEE International Conference on Computer Vision, vol. 1, pp. 26–33 (2003) 4. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE transactions on pattern analysis and machine intelligence 26, 1124–1137 (2004) 5. Rother, C., Kolmogorov, V., Blake, A.: GrabCut - Interactive Foreground Extraction using Iterated Graph Cuts. ACM Transactions on Graphics, SIGGRAPH (2004) 6. Dixit, N., Keriven, R., Paragios, N.: GPU-Cuts: Combinatorial Optimisation, Graphic Processing Units and Adaptive Object Extraction. Technical Report March, Laboratoire Centre Enseignement Recherche Traitement Information Systemes (CERTIS), Ecole Nationale des Ponts et Chaussees, ENPC (2005) 7. Vineet, V., Narayanan, P.J.: CUDA cuts: Fast graph cuts on the GPU. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops (2008) 8. Bhusnurmath, A., Taylor, C.J.: Graph cuts via l1 norm minimization. IEEE transactions on pattern analysis and machine intelligence 30, 1866–1871 (2008) 9. Strang, G.: Maximal flow through a domain. Mathematical Programming 26, 123– 143 (1983)
Global Relabeling for Continuous Optimization
117
10. Strang, G.: Maximum flows and minimum cuts in the plane. Journal of Global Optimization 47, 527–535 (2009) 11. Klodt, M., Schoenemann, T., Kolev, K., Schikora, M., Cremers, D.: An experimental comparison of discrete and continuous shape optimization methods. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 332–345. Springer, Heidelberg (2008) 12. Sinop, A.K., Grady, L.: A Seeded Image Segmentation Framework Unifying Graph Cuts And Random Walker Which Yields A New Algorithm. In: IEEE 11th International Conference on Computer Vision (2007) 13. Couprie, C., Grady, L., Najman, L., Talbot, H.: Power Watershed: A Unifying Graph-Based Optimization Framework. IEEE Trans. on Pattern Analysis and Machine Intelligence (2011) 14. Grady, L.: Random walks for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 83, 1768–1783 (2006) 15. Olsson, C., Byrod, M., Overgaard, N.C., Kahl, F.: Extending continuous cuts: Anisotropic metrics and expansion moves. In: IEEE 12th International Conference on Computer Vision, pp. 405–412 (2009) 16. Appleton, B., Talbot, H.: Globally minimal surfaces by continuous maximal flows. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 106–118 (2006) 17. Chan, T.F., Esedoglu, S., Nikolova, M.: Algorithms for Finding Global Minimizers of Image Segmentation and Denoising Models. SIAM Journal on Applied Mathematics 66, 16–32 (2006) 18. Caselles, V., Kimmel, R., Sapiro, G.: Geodesic active contours. International Journal of Computer Vision 1, 61–79 (1997) 19. Leung, S., Osher, S.: Global Minimization of the Active Contour Model with TVInpainting and Two-phase Denoising. In: 3rd IEEE Workshop on Variational, Geometric and Level Set Methods in Computer Vision, pp. 149–160 (2005) 20. Unger, M., Pock, T., Bischof, H.: Continuous Globally Optimal Image Segmentation with Local Constraints. In: Computer Vision Winter Workshop, Moravske Toplice, Slovenija (2008) 21. Bresson, X., Esedoglu, S., Vandergheynst, P., Thiran, J.P., Osher, S.: Fast Global Minimization of the Active Contour/Snake Model. Journal of Mathematical Imaging and Vision 28, 151–167 (2007) 22. Chambolle, A., Pock, T.: A First-Order Primal-Dual Algorithm for Convex Problems with Applications to Imaging. Journal of Mathematical Imaging and Vision, 1-26 (2010) 23. Chambolle, A.: Total Variation Minimization and a Class of Binary MRF Models. In: Energy Minimization Methods in Computer Vision and Pattern Recognition, vol. 1, pp. 136–152 (2005) 24. NVidia: NVIDIA Performance Primitives ( NPP ) Version 3.2.16. Technical report (2010)
Stop Condition for Subgradient Minimization in Dual Relaxed (max,+) Problem Michail Schlesinger, Evgeniy Vodolazskiy, and Nikolai Lopatka International Research & Training Center of Information Technologies and Systems of National Academy of Sciences of Ukraine
[email protected] Abstract. Subgradient descent methods for minimization of dual linear relaxed labeling problem are analysed. They are guaranteed to converge to the quality of the optimal relaxed labeling, but do not obtain an optimal relaxed labeling itself. Moreover, no stop condition is known for these methods upto now. The stop condition is defined and experimentally compared with the commonly-used stop conditions. The stop condition is defined in a way that when fulfilled a relaxed labeling is simultaneously obtained with arbitrary non-zero difference from the optimal labeling. Keywords: Labeling problem, relaxation, equivalent transformation, reparametrisation, subgradient descent, energy minimization.
1
Introduction: Definition of Main Concepts
Let T be a finite set of pixels and K be a finite set of labels. Let a function k : T → K be called a labeling that for each pixel t ∈ T defines a label k(t) ∈ K. The labeling k will be also referred to as a strict labeling as opposed to relaxed labeling defined below. Let τ be a set of unordered pairs (t, t ) ∈ τ of pixels that are called neighbours. The set τ is symmetric and irreflexive in a sense that (t, t ) ∈ τ ⇔ (t , t) ∈ τ for any two pixels and (t, t) ∈ / τ . We will often use the notation tt ∈ τ instead of (t, t ) ∈ τ for simplicity. The subset N (t) ⊂ T = {t |tt ∈ τ } is a set of neighbours for the pixel t. So the following three expressions are equivalent: t ∈ N (t) ⇔ t ∈ N (t ) ⇔ tt ∈ τ .
(1)
An ordered pair (t, k), t ∈ T, k ∈ K, will be called a vertex. For each neighbouring pair tt ∈ τ of pixels an unordered pair of vertices ((t, k), (t , k )) will be called an edge. We will say that a vertex (t∗ , k ∗ ) belongs to a labeling k if k(t∗ ) = k ∗ . An edge ((t∗ , k ∗ ), (t∗∗ , k ∗∗ )) belongs to a labeling k if both vertices (t∗ , k ∗ ) and (t∗∗ , k∗∗ ) belong to k. Let q(t, k) be a real number defined for each vertex (t, k) and called a vertex quality. Let g((t, k), (t , k )) be a real number defined for each edge ((t, k), (t , k )) Y. Boykov et al. (Eds.): EMMCVPR 2011, LNCS 6819, pp. 118–131, 2011. c Springer-Verlag Berlin Heidelberg 2011
Stop Condition for Subgradient Descent
119
and called an edge quality. The quality G(k) of a labeling k is defined as the sum of qualities of all edges and vertices that belong to the labeling: G(k) = g (t, k(t)), (t , k(t )) + q t, k(t) . (2) tt ∈
τ
t∈T
The strict labeling problem consists in finding the labeling with the best quality: k ∗ = arg max G(k) . k∈K T
(3)
The problem is known to be NP-complete. The so-called relaxed modification of this problem is much more simpler. It is based on the following concepts. Let α(t, k), t ∈ T, k ∈ K be a real number called the weight of a vertex (t, k) and β((t, k), (t , k )), tt ∈ τ , k ∈ K, k ∈ K be a real number called the weight of an edge ((t, k), (t , k )). Let α be an array α(t, k)|t ∈ T, k ∈ K of vertex weights and β be an array β((t, k), (t , k ))|tt ∈ τ , k ∈ K, k ∈ K of edge weights. The pair (α, β) is called a weight function. A weight function (α, β) is called a relaxed labeling if it satisfies the following conditions: ⎧ ⎪ α(t, k) = β (t, k), (t , k ) , t ∈ T, k ∈ K, t ∈ N (t) ; ⎪ ⎪ ⎪ ⎪ k ∈K ⎨ (4) α(t, k) = 1, t∈T ; ⎪ ⎪ ⎪ k∈K ⎪ ⎪ ⎩ β (t, k), (t , k ) ≥ 0, tt ∈ τ , k ∈ K, k ∈ K . The quality of a relaxed labeling is defined as follows: G(α, β) = α(t, k)·q(t, k)+ β((t, k), (t , k ))·g((t, k), (t , k )) . t∈T k∈K
tt ∈
τ k∈K k ∈K
(5) The relaxed labeling problem consists in finding the relaxed labeling (α∗ , β ∗ ) with the best quality: (α∗ , β ∗ ) = arg max G(α, β) . (α,β)
(6)
One can see that by restricting the α(t, k) and β((t, k), (t , k )) to integer values the problem (6) becomes equivalent to the strict problem (3). In this case α(t, k ) = 1 in the relaxed problem means that k(t) = k in the strict problem.
2
Equivalent Transformations of a Relaxed Labeling Problem into Trivial One
There exists an evident sufficient condition of the relaxed labeling optimality. If the relaxed labeling (α, β) fulfils the conditions ⎧ ⎨
q(t, k) < max q(t, l)
⎩ g((t, k), (t , k ))
ε2 then continue subgradient descent. Exit 8. Go to 3 Theorem 5. The algorithm stops in finite time.
7
Experiments
The main idea of the proposed method is that the subgradient descent performs power minimization of any task while most known methods do not. There are known examples where the known methods do not ensure power minimization and one of them is presented in [3]. Subgradient descent method with proposed stop condition was tested on this example and its positive properties were validated. The goal of subsequent experiments was to determine how often it occurs in examples which are close to image processing. In the following experiments the proposed stop condition was tested along with the well-known stop condition used in several algorithms (for example [1] [2]). The known stop condition also selects the set of edges and vertices that are close to maximum ones but instead of trying to build a relaxed labeling on it, checks for its consistency (see for example [2]). The same set of edges and vertices picked for both stop conditions. It is obvious that with ε = 0 the proposed stop condition is more strict than known one. With positive ε it may not be the case. The set of pixels T forms a square field with the size n × n. Each pixel (except for the ones on the borders of the field) has 4 neighbours: on the left, right, up and down. To visualize the results the strict labeling is estimated from the obtained relaxed labeling by picking the label with the maximum weight in each pixel. 7.1
Vertical and Horizontal Lines
An image consists of black vertical and horizontal lines on a white background. It gets distorted by altering some pixel colors. The task is to restore the original image from the distorted one. More precisely, it is necessary to find a set of vertical and horizontal lines that produce an image such that the number of pixels different from the presented image is minimal. The set of labels for this problem has 4 elements V H, V H, V H, V H with the following meaning: V H - both vertical and horizontal lines pass through the pixel. V H - only the horizontal line passes through the pixel. V H - only the vertical line passes through. V H - no lines pass through the pixel. For two horizontal neighbours the edge qualities are picked such that the presence of a horizontal line must be the same in both pixels. For two vertical neighbours the edge qualities are picked such that the presence of a vertical line must be the same. These qualities of two horizontal neighbouring pixels are
Stop Condition for Subgradient Descent
127
Fig. 1. Drawn edges have weight 0, others have −∞
shown on fig.1. The vertex qualities are produced from the distorted image. The black pixel has a (−1) weight for the label V H and 0 for the rest. The white pixel has a 0 weight for the label V H and (−1) for the rest. Each line in table 1 describes one experiment with size, parameters δ and ε and power levels obtained by the known stop condition and the proposed one. In each experiment subgradient descent with the proposed stop condition has reached the power minimum. The last experiment from the table 1 is presented on fig. 2. Table 1. Vertical and horizontal lines tests Size
δ
ε
15 × 15 15 × 15 20 × 20 20 × 20 20 × 20
0.01 0.01 0.01 0.01 0.01
0.01 0.01 0.01 0.01 0.01
Consistency stop condition Proposed stop condition −35.911 −28.973 −49.986 −60.971 −55.975
−36 −29 −50 −61 −56
The proposed stop condition is slightly stricter than the known one in these examples. 7.2
Segmentation
The neighbouring structure is the same as with the first experiment. There are 3 labels, each corresponding to one color. For any neighbouring pixel pair (t, t ) the weight function g is defined like this: C, k = k , gtt (k, k ) = (27) 0, k = k . In the following experiments C = 3.
128
M. Schlesinger, E. Vodolazskiy, and N. Lopatka
Fig. 2. From the left: original image, distorted image, restored image by the proposed stop condition
Using these g and setting qt (k) = 0 the images were generated with Gibbs sampler. Than each pixel has independently changed its color with some fixed probability p producing the distorted image. The new values q were obtained from the distorted image. ⎧ p ⎨ ln(1 − p + ), k is the same as on the image n qt (k) = (28) ⎩ ln p , otherwise . n
Table 2. Segmentation Original
Distorted
Consistency
ε = 400
ε = 40
100 × 100
10% noize
53311.1
53331.3
53311.7
100 × 100
20% noize
50794.9
50818.1
50804.9
The results are presented in the table 2. The five presented columns mean the following from left to right: the generated image with its size, the distorted image with the probability of a pixel to change its color, the image produced after known consistency-based stop condition was fulfilled with the power level
Stop Condition for Subgradient Descent
129
obtained, the images produced with the proposed stop condition with ε value 400 and 40 with their respective power levels. As one can see, in these examples the known stop condition showed itself to be more strict than the proposed one.
8
Concluding Remarks
The experiments showed that the proposed stop condition can be used to determine power optimum as well as for estimating the optimal relaxed labeling. With some ε values the proposed stop condition is more strict than the known condition, with others not. In our experiments both stop conditions were approximately the same in terms of the power level obtained. The known stop condition has of course the advantage of being easier to compute. Moreover, there exist simpler and in some cases faster algorithms that achieve the known stop condition. On the other hand the advantage of the proposed method is that once it stops it simultaneously produces a relaxed labeling. Thus both proposed and known stop conditions supplement each other, and it is reasonable to use them simultaneously in the following technological way. 1. The known stop condition is used together with algorithms that ensure the achievement of this condition, e.g., diffusion. 2. After the known stop condition is satisfied the proposed condition is checked. 3. If the proposed stop condition is satisfied the relaxed labeling is produced with the required precision. 4. If the proposed stop condition is not satisfied more complex algorithms have to be used that guarantee the achievement of the proposed condition, e.g., subgradient descent.
References 1. Kolmogorov, V.: Convergent Tree-reweighted Message Passing for Energy Minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 28(10), 1568–1583 (2006) 2. Schlesinger, M.I., Antoniuk, K.V.: Diffusion algorithms and structural recognition optimization problems Cybernetics and Systems Analysis, vol. 48(2), pp. 3–12 (2011) (in Russian); English translation of the paper will be published at Cybernetics and Systems Analysis, vol. (2). Springer, Heidelberg (2011) 3. Schlesinger, M.I., Giginjak, V.V.: Solving (max,+) problems of structural pattern recognition using equivalent transformations. In: Upravlyayushchie Sistemy i Mashiny (Control Systems and Machines), Kiev, Naukova Dumka, vol. (1 and 2) (2007) (in Russian); English version: http://www.irtc.org.ua/image/publications/SchlGig_1_ru (Part 1), http://www.irtc.org.ua/image/publications/SchlGig_2_ru (Part 2) 4. Schlesinger, M.I., Antoniuk, K.V., Vodolazskii, E.V.: Optimal labeling problems, their relaxation and equivalent transformations 5. Shor, N.: Nondifferentiable optimization and polynomial problems, p. 394. Kluwer Academic Publisher, Dordrecht (1998)
130
M. Schlesinger, E. Vodolazskiy, and N. Lopatka
6. Wainwright, M.J., Jaakkola, T.S., Willsky, A.S.: MAP estimation via agreement on (hyper)trees: Message-passing and linear-programming approaches. IEEE Transactions on Information Theory 51(11), 3697–3717 (2005)
A
Appendix
Proof (Theorem 1). P∗ ≥ =
G(α, β) = α(t, k) · q(t, k) + β((t, k), (t , k )) · g((t, k), (t , k )) ≥ t∈T k∈K
≥
=
t∈T
=
l∈K
tt ∈
k ∈K τ k∈K
α(t, k) · max q(t, l) − δ +
t∈T k∈K
+
tt ∈
τ k∈K k ∈K
β((t, k), (t , k )) ·
max q(t, l) + l∈K
tt ∈
τ
τ
max
l∈K,l ∈K
max
l∈K,l ∈K
g((t, l), (t , l )) − δ =
τ
g((t, l), (t , l )) − δ · |T | − δ · | | =
τ
P − δ · (|T | + | |) ≥ G(α∗ , β ∗ ) − δ · (|T | + | |) . (29)
So, P ∗ ≥ G(α, β) ≥ P − δ · (|T | + |τ |) ≥ G(α∗ , β ∗ ) − δ · (|T | + |τ |). Proof (Theorem 2). Let P ∗ be the power minimum over all equivalent transformations. Let Θ be the set of all equivalent problems where the power function P reaches its minimum P ∗ . Let D((q , g ), (q , g )) be an Euclidean distance between two qualities and D((q i , g i ), Θ) be the distance between a quality (q i , g i ) and a set Θ. From subgradient descent theory it is known that lim P (q i , g i ) = P ∗ , lim D((q i , g i ), Θ) = 0 .
i→∞
i→∞
(30)
This means that for any positive δ there exists such k and such qualities (q ∗ , g ∗ ) ∈ Θ that D((q k , g k ), (q ∗ , g ∗ )) < δ2 . Since the distance between two qualities is less than δ2 , the qualities of each edge and each vertex also differ no more than δ2 . This means that if two edges had the same quality in the problem (q ∗ , g ∗ ), they would not differ by more than δ in the problem (q k , g k ). Since (q ∗ , g ∗ ) is the optimum, there exits such relaxed labeling (α, β) that satisfy (7). And this very same (α, β) and qualities (q k , g k ) also satisfy (12). Proof (Theorem 3). The first inequality in (21) is trivial since S = up(α, β). As for the second inequality the following is true: up(α, β) · S = up(α, β) · min up(α , β ) = α ∈A,β ∈B
min
α∈A,β∈B
(31)
Stop Condition for Subgradient Descent
= up(α, β) ·
min
α ∈A,β ∈B
α (t, k)
−
β
(t, k), (t , k )
131
2 ≥
k ∈K
t∈T t ∈N (t) k∈K
(32)
≥
min
α ∈A,β ∈B
α (t, k) −
t∈T t ∈N (t) k∈K
β (t, k), (t , k ) ×
k ∈K
× α(t, k) −
β (t, k), (t , k )
(33) =
k ∈K
=
min
α ∈A
t∈T
− min
β ∈B
t ∈N (t)
α (t, k) × α(t, k) −
k∈K
=
t∈T
k ∈K
β (t, k), (t , k ) ×
t∈T t ∈N (t) k∈K k ∈K
× α(t, k) −
β (t, k), (t , k ) −
(34)
β (t, k), (t , k ) =
k ∈K
α(t, k) − β((t, k), (t , k )) −
min
k∈kM (t)
t ∈N (t)
t∈T t ∈N (t) k∈K k ∈K
k
(35) β((t, k), (t , k )) · α(t, k) − β((t, k), (t , k )) . k
In 32 up is simply substituted with its definition. Inequality between 32 and 33 comes from the fact that the length of a vector is always not less than its projection. And 33 can be seen as projecting the vector (α , β ) onto the vector (α, β). The equality of 33 and 34 is simply separating independent variables in the minimization. The first part of the last equality comes from the fact that the sum of weights α in one pixel is equal to 1 and they are positive numbers. The second part comes from β being the exact minimum of the upper bound with currently fixed α. Proof (Theorem 4). By substituting the up(α, β) and Δt t (k )(α, β) in (20) according to their definitions it can be shown that if for some (α∗ , β ∗ ), up(α∗ , β ∗ ) = S then low(α∗ , β ∗ ) = S. Since both functions low and up are continuous over (α, β), lim low(αi , β i ) = S. i→∞
Proof (Theorem 5). The upper bound function up(α, β) is convex and continuously differentiable. The restrictions on the variables groups α and β are orthogonal. Therefore sequential minimizations converges to absolute minimum and either clause 6 or 7 of the algorithm will succeed.
Optimality Bounds for a Variational Relaxation of the Image Partitioning Problem Jan Lellmann, Frank Lenzen, and Christoph Schn¨ orr Image and Pattern Analysis Group & HCI Dept. of Mathematics and Computer Science, University of Heidelberg
Abstract. Variational relaxations can be used to compute approximate minimizers of optimal partitioning and multiclass labeling problems on continuous domains. While the resulting relaxed convex problem can be solved globally optimal, in order to obtain a discrete solution a rounding step is required, which may increase the objective and lead to suboptimal solutions. We analyze a probabilistic rounding method and prove that it allows to obtain discrete solutions with an a priori upper bound on the objective, ensuring the quality of the result from the viewpoint of optimization. We show that the approach can be interpreted as an approximate, multiclass variant of the coarea formula.
1
Introduction
In a series of papers [1–3], several authors have recently proposed convex relaxations of multiclass labeling problems in a variational framework. We consider the problem formulation inf f (u), f (u) := u(x), s(x)dx + Ψ (Du) , (1) u∈CE
Ω
Ω
CE := {u ∈ BV(Ω)l , u(x) ∈ E := {e1 , . . . , el } a.e.},
Ω = (0, 1)d ,
for finding an optimal labeling function u that is of bounded variation [4]. Here ei denotes the i-th unit vector representing the i-th label, s ∈ L1 (Ω), s 0 are the local costs representing the data term, and Ψ : Rd×l → R0 is positively homogeneous, convex and continuous, and defines the regularizer. This formulation covers problems such as color segmentation, denoising, inpainting, depth from stereo and many more; see [5] for the definition of a class of regularizers Ψ relevant to various applications. Problem (1) can also be seen as the problem of finding an optimal partition of Ω into l (not necessarily connected) sets Pi := u−1(ei ). It constitutes a hard problem due to the discrete decision at each point. However, formulation (1) permits a convenient relaxation to a convex problem: inf f (u),
u∈C
C := {u ∈ BV(Ω)l , u(x) ∈ Δl a.e.},
(2)
where Δl := {x ∈ Rl |x 0, i xi = 1} is the convex hull of E := {e1 , . . . , el }, i.e. the l-dimensional unit simplex. Problem (2) is convex and can thus be solved Y. Boykov et al. (Eds.): EMMCVPR 2011, LNCS 6819, pp. 132–146, 2011. c Springer-Verlag Berlin Heidelberg 2011
Optimality Bounds for the Image Partitioning Problem
133
globally optimal. However, the minimizer u∗ of the relaxed problem may not lie in CE , i.e. it is not necessarily discrete. In order to obtain a true partition of Ω, some rounding process is thus required to generate a discrete labeling u ¯∗ . This may may increase the objective, and lead to a suboptimal solution of the original problem (1). While plausible deterministic methods exist [5], we are not aware of a method that allows to bound the objective of the obtained discrete solution u ¯∗ with respect to the objective of the (unknown) optimal discrete solution u∗E in the spatially continuous setting for general regularizers. Contribution. In this work, we consider a probabilistic rounding approach and derive a probabilistic bound of the form (see Thm. 2 below) Ef (¯ u∗ ) (1 + ε)f (u∗E ),
(3)
where u¯∗ is the solution obtained by applying a custom probabilistic rounding method to the solution u∗ of the convex relaxed problem (2), and u∗E is the solution of the original partitioning problem (1). The approach is based on the work of Kleinberg and Tardos [6], who derive similar bounds in an LP relaxation framework. However their results are restricted in that they assume a grid discretization of the image domain and extensively make use of the fact that the number of grid points is finite. The bounds derived in Thm. 2 are compatible with their bounds, as well as the ones derived for the graph cut-based α-expansion in [7]. However, our results hold in the spatially continuous setting without assuming a particular problem discretization. In the continuous setting, in [8] a similar bound was announced for the special case of the uniform metric. The approach is based on a continuous extension of the α-expansion method, which requires to solve a sequence of problems. In contrast, our approach only requires to solve a single convex problem, and provides valid bounds for a broad class of regularizers [5]. For an overview of generic approaches for solving integer problems using relaxation techniques we also refer to [9]. As these known approaches only apply in finite-dimensional spaces, deriving similar results for functions on continuous domains requires considerable additional mathematical work. Due to space restrictions we will only provide an outline of the proofs, and refer to an upcoming report for the technical details. Notation. Superscripts vi usually denote a collection of vectors or elements of a sequence, while subscripts vk denote vector components. We denote N = {1, 2, . . .}, e = (1, . . . , 1). · 2 is the usual Euclidean resp. the Frobenius norm, and Br (x) denotes the ball of radius r in x. For a set S, we define 1S (x) = 1 iff x ∈ S and 1S (x) = 0 otherwise. Regarding measure-theoretic notations and functions of bounded variation we refer to [4]. In particular, we will use the d-dimensional Lebesgue measure Ld , the k-dimensional Hausdorff measure Hk , the distributional gradient Du and the total variation TV(u) = |Du|(Ω). For some Ld -measurable set E ⊆ Ω, we denote its volume |E| = Ld (E), the measure-theoretic interior (E)1 and exterior (E)0 , the reduced boundary F E with generalized inner normal νE , and the perimeter
134
J. Lellmann, F. Lenzen, and C. Schn¨ orr
Per(E) = TV(1E ). DuE is the restriction of Du to E, and Ψ (Du) denotes the measure Ψ (Du/|Du|)|Du|, i.e. Ψ transforms the density of the measure Du with respect to its total variation measure |Du|. For u ∈ BV(Ω)l we denote by u ˜ its − approximate limit, and by u+ F E and uF E its one-sided limits [4, Thm. 3.77] on the reduced boundary of a set of finite perimeter E, i.e. Per(E) < ∞.
2
Probabilistic Rounding and the Coarea Formula
As a motivation for the following sections, we first provide a probabilistic interpretation of a tool often used in geometric measure theory, the coarea formula (cf. [4]). Assume that u ∈ BV(Ω) and u (x) ∈ [0, 1] for a.e. x ∈ Ω, then the coarea formula states that its total variation can be represented by summing the boundary lengths of its superlevelsets: TV(u ) =
1
TV(1{u >α} )dα.
(4)
0
The coarea formula provides a connection between problem (1) and the relaxation (2) in the two-class case, where E = {e1 , e2 }, u√∈ CE and therefore u1 = 1 − u2 : As noted in [5], TV(u) = e1 − e2 TV(u1 ) = 2 TV(u1 ), therefore the coarea formula (4) can be rewritten as √ TV(u) = 2 0
1
TV(1{u1 >α} )dα =
1
0
TV(e1 1{u1 >α} + e2 1{u1 α} )dα (5)
1
=
TV(¯ uα )dα , 0
u ¯α := e1 1{u1 >α} + e2 1{u1 α}
(6)
Consequently, the total variation of u can be computed by taking the mean over the total variations of a set of discrete labelings {¯ uα ∈ CE |α ∈ [0, 1]}, obtained by rounding u at different thresholds α. We now adopt a probabilistic view of (6): We regard the mapping (u, α) ∈ C × [0, 1] → u ¯α ∈ CE
(for a.e. α ∈ [0, 1])
(7)
as a parameterized, deterministic rounding algorithm, that depends on u and on an additional parameter α. From this, we obtain a probabilistic (randomized) rounding algorithm by assuming α to be a uniformly distributed random variable. Under these assumptions, the coarea formula (6) can be written as TV(u) = Eα TV(¯ uα ).
(8)
This has the probabilistic interpretation that applying the probabilistic rounding to (arbitrary, but fixed) u does – in a probabilistic sense, i.e. in the mean – not change the objective. It can be shown that this property extends to the full functional f in (2). A well-known implication is that if u = u∗ , i.e. u minimizes (2), then almost every u ¯α = u ¯∗α is a minimizer of (1) [10].
Optimality Bounds for the Image Partitioning Problem
135
Algorithm 1. Continuous Probabilistic Rounding Require: u ∈ C 1: u0 ← u, U 0 ← Ω, c0 ← (1, . . . , 1) ∈ Rl . 2: for k = 1, 2, . . . do 3: Randomly choose γ k := (ik , αk ) uniformly from {1, . . . , l} × [0, 1] 4: M k ← U k−1 ∩ {x ∈ Ω|uk−1 (x) > αk } ik 5: 6: 7:
k
uk ← ei 1M k + uk−1 1Ω\M k U k ←U k−1 \ M k min{ck−1 , αk }, if j = ik , j ckj ← k−1 cj , otherwise.
8: end for
Unfortunately, property (8) is intrinsically restricted to the two-class case with TV regularizer. In the general case, one would hope to obtain a relation f (u) = f (¯ uγ )dμ(γ) = Eγ f (¯ uγ ) (9) Γ
for some probability space (Γ, μ). For l = 2 and Ψ (x) = · 2 , (8) shows that (9) holds with γ = α, Γ = [0, 1], μ the Lebesgue measure, and u¯γ : C × Γ → CE as defined in (7). In the multiclass case, the difficulty lies in providing a suitable probability space (Γ, μ) and parameterized rounding step (u, γ) → u ¯γ . Unfortunately, obtaining a relation such as (8) for the full functional (1) is unlikely, as it would mean that solutions to the (after discretization) NP-hard problem (1) could be obtained by solving the convex relaxation (2) and subsequent rounding. In this work we will derive a bound of the form (1 + ε)f (u) f (¯ uγ )dμ(γ) = Eγ f (uγ ). (10) Γ
This can be seen as an approximate variant of the coarea formula. While (10) is not sufficient to provide a bound on f (¯ uγ ) for particular γ, it permits a probabilistic bound in the sense of (3): For any minimizer u∗ of the relaxed problem (2), Eγ f (¯ u∗γ ) (1 + ε)f (u∗ ) (1 + ε)f (u∗E ),
(11)
holds, i.e. the ratio between the objective of the rounded relaxed solution and the optimal discrete solution is bounded – in a probabilistic sense – by (1 + ε). In the following sections, we will construct a suitable parameterized rounding method and probability space in order to obtain an approximate coarea formula of the form (10).
3
Probabilistic Rounding for Multiclass Image Partitions
We consider the probabilistic rounding approach based on [6] as defined in Alg. 1. The algorithm proceeds in a number of phases. At each iteration, a label and a
136
J. Lellmann, F. Lenzen, and C. Schn¨ orr
threshold (ik , αk ) ∈ Γ := {1, . . . , l} × [0, 1] are randomly chosen (step 3), and label ik is assigned to all yet unassigned points where uk−1 > αk holds (step 5). ik In contrast to the two-class case considered above, the randomness is provided by a sequence (γ k ) of uniformly distributed random variables, i.e. Γ = (Γ )N . After iteration k, all points in the set U k ⊆ Ω have not yet been assigned a label, while all points in Ω \ U k have been assigned a discrete label in iteration k or in a previous iteration. Iteration k + 1 potentially modifies points only in the set U k . The variable ckj stores the lowest threshold α chosen for label j up to and including iteration k. For fixed input u, the algorithm can be seen as mapping a sequence of parameters (or instances of random variables) γ = (γ k ) ∈ Γ into a sequence of states k ∞ k ∞ (ukγ )∞ k=1 , (Uγ )k=1 and (cγ )k=1 . We drop the subscript γ if it does not create ambiguities. In order to define the parameterized rounding step (u, γ) → u ¯γ , we observe that, once Uγk = ∅ occurs for some k , the sequence (ukγ ) becomes stationary at ¯γ := ukγ : ukγ . In this case the algorithm may be terminated, with output u N
Definition 1. Let u ∈ BV(Ω)l , and denote Γ := (Γ ) . For some γ ∈ Γ , if Uγk = ∅ for some k ∈ N, we denote u ¯γ := ukγ . For a functional f : BV(Ω)l → R, define f (¯ u(·) ) : Γ → R ∪ {+∞} f (ukγ ), Uγk = ∅ and ukγ ∈ BV(Ω)l , γ ∈ Γ → f (¯ uγ ) := +∞, otherwise.
(12)
Denote by f (¯ u) the corresponding random variable induced by assuming γ to be uniformly distributed on Γ .
Note that f (¯ uγ ) is well-defined: if Uγk = ∅ for some (γ, k ) then ukγ = ukγ for all k k . In the remainder of this work, we will show that the expectation of f (¯ uγ ) over all sequences γ can be bounded according to uγ ) (1 + ε)f (¯ u) Ef (¯ u) = Eγ f (¯
(13)
for some ε 0, cf. (10). Consequently, the rounding process may only increase the average objective in a controlled way. We first show that almost surely Alg. 1 generates (in a finite number of iterations) a discrete labeling function u¯γ ∈ CE . Theorem 1. Let u ∈ BV(Ω)l and f (¯ u) as in Def. 1. Then P(f (¯ u) < ∞) = 1.
(14)
Proof. Due to space restrictions we can only provide a sketch of the proof. The first part is to show that (uk ) becomes stationary almost surely, i.e. P(∃k ∈ N : U k = ∅) = 1.
(15)
Optimality Bounds for the Image Partitioning Problem
137
Define nkj ∈ N0 the number of k ∈ {1, . . . , k} s.t. ik = j. Then the vector nk is multinomially distributed, nk ∼ Multinomial (k; 1/l, . . . , 1/l). Accordingly, the probability that all ckj , j = 1, . . . , l are smaller than 1/l is k nkj
l k! 1 1 P(ck < l−1 e) = 1− 1− , (16) l nk1 ! · . . . · nkl ! l k k j=1 n1 +...+nl =k
which can be shown to converge to 1 for k → ∞. Since u(x) ∈ Δl , the condition ck < l−1 e implies U k = ∅. Therefore (15) follows from k→∞
1 P(∃k ∈ N : e ck < 1) P(ck < l−1 e) → 1.
(17)
The second part of the proof consists in showing that almost surely all iterates uk are contained in BV(Ω)l , for which it suffices to show that P uk ∈ BV(Ω)l ∀k ∈ N = 1. (18) This can be seen using induction to show that uk ∈ BV(Ω)l and Per(U k ) < ∞ almost surely for all k ∈ N0 . For uk−1 ∈ BV(Ω)l , by [4, Thm. 3.40] it holds that Per({x ∈ Ω|uk−1 (x) αk }) < ∞ for L1 -a.e. αk ∈ [0, 1] (and all ik ), therefore ik P(Per(U k ) < ∞| Per(U k−1 ) < ∞) = 1.
(19)
The statement for uk follows, since for the same reason Per(M k ) < ∞ almost surely (cf. Alg. 1 for the definition of M k ), and [4, Thm. 3.84] ensures k
Per(M k ) < ∞, uk−1 ∈ BV(Ω)l ⇒ uk = ei 1M k + uk−1 1Ω\M k ∈ BV(Ω)l .(20)
4
A Probabilistic A Priori Optimality Bound
In the previous sections we have shown that the rounding process induced by Alg. 1 is well-defined in the sense that it returns a discrete solution u ¯γ ∈ BV(Ω)l almost surely. We now return to proving an upper bound for the expectation of f (¯ u) as in the approximate coarea formula (3). We first show that the expectation of the linear part (data term) of f is invariant under the rounding process. Proposition 1. The sequence (uk ) generated by Alg. 1 satisfies E(uk , s) = u, s
∀k ∈ N.
(21)
Proof. In Alg. 1, instead of step 5 we consider the update k
uk ← ei 1{uk−1 >αk } + uk−1 1{uk−1 αk } , ik
ik
(22)
which yields exactly the same iterates. Denote γ := (γ 1 , . . . , γ k−1 ) and uγ := uk−1 . We use an induction argument on k: For k 1, γ
138
J. Lellmann, F. Lenzen, and C. Schn¨ orr
Eγ ukγ , s = Eγ
l
1 l i=1 l
= Eγ
1 l
1
0
i=1
l j=1
1
0
sj · ei 1{uγ >α} + uγ 1{uγ α} dα i
i
j
(23)
si · 1{uγ >α} + 1 − 1{uγ >α} uγ , s dα. i
i
Now we take into account the relation [4, Prop. 1.78], 0
1
Ω
si (x) · 1ui >α (x)dxdα =
Ω
si (x)ui (x)dx = ui , si .
(24)
This leads to Eγ ukγ , s
=
uγ (x)∈Δl
=
Eγ
l 1 γ si ui + uγ , s − uγi uγ , s dα l i=1
Eγ uγ , s
=
Eγ uk−1 , s. γ
(25)
Since u0 , s = u, s, the assertion follows by induction.
Bounding the regularizer is more involved. For γ k = (ik , αk ), define Uγ k := {x ∈ Ω|uik (x) αk },
1 Vγ k := Uγ k ,
V k := (U k )1 .
(26)
As the measure-theoretic interior is invariant under Ld -negligible modifications, given some fixed sequence γ the sequence (V k ) is invariant under Ld -negligible modifications of u = u0 , i.e. it is uniquely defined when viewing u as an element of L1 (Ω)l . We use (without proof) the fact that the measure-theoretic interior satisfies (E ∩ F )1 = (E)1 ∩ (F )1 for any Ld -measurable sets E, F . Some calculations yield U k = Uγ 1 ∩ . . . ∩ Uγ k , V k = Vγ 1 ∩ . . . ∩ Vγ k (k 1), (27) k−1 k U \ U = Uγ 1 ∩ Uγ 2 ∩ . . . ∩ Uγ k−1 \ Uγ 2 ∩ . . . ∩ Uγ k (k 2), k−1 k V \ V = Vγ 1 ∩ Vγ 2 ∩ . . . ∩ Vγ k−1 \ Vγ 2 ∩ . . . ∩ Vγ k (k 2), (28) k Ω\Vk = V k −1 \ V k (k 1). (29) k =1
Moreover (again without proof), since V k is the measure-theoretic interior of U k , both sets are equal up to an Ld -negligible set. We now prepare for an induction argument on the expectation of the regularizing term when restricted to the sets V k−1 \ V k . We first state an intermediate result required for the proofs.
Optimality Bounds for the Image Partitioning Problem
139
Proposition 2. Let u, v ∈ C, Ψ ρu · 2 , and E ⊆ Ω s.t. Per(E) < ∞. Then w := u1E + v1Ω\E ∈ BV(Ω)l , d−1 − Dw = Du(E)1 + Dv(E)0 + νE u+ H (F E ∩ Ω) , F E − vF E and, for some Borel set A ⊆ Ω, √ Ψ (Dw) 2ρu Per(E) +
Ψ (Du) +
A∩(E)1
A
(30)
Ψ (Dv).
(31)
A∩(E)0
Proof. We omit the details of the proof due to space restrictions. It relies on [4, Thm. 3.84], [4, Prop. 2.37] and the fact that + − d−1 Ψ (Dw) = Ψ (νE (wFE (x) − wF E (x)) )dH A∩F E∩Ω A∩F E∩Ω + − d−1 ρu νE (wF (32) E (x) − wF E (x)) 2 dH A∩F E∩Ω √ 2ρu Per(E). The following proposition provides the initial step for k = 1. Proposition 3. Let ρl · 2 Ψ ρu · 2 . Then 2 ρu E Ψ (D¯ u) Ψ (Du). l ρl Ω V 0 \V 1
(33)
Proof. Denote (i, α) = γ 1 . Since 1U(i,α) = 1V(i,α) Ld -a.e., we have u¯γ = 1V(i,α) ei + 1Ω\V(i,α) u ¯γ Therefore, since V 0 = (U 0 )1 = (Ω)1 = Ω, Ψ (D u ¯γ ) = Ψ (D¯ uγ ) = V 0 \V 1
Ω\V(i,α)
Ω\V(i,α)
Ld − a.e.
(34)
Ψ D 1V(i,α) ei + 1Ω\V(i,α) u ¯γ .
Since u ∈ BV(Ω)l , we know that Per(V(i,α) ) < ∞ holds for L1 -a.e. α and any i [4, Thm. 3.40]. Therefore we conclude from Prop. 2 that (for L1 -a.e. α), √ Ψ (D u ¯γ ) ρl 2 Per V(i,α) + Ω\V(i,α)
1 Ω\V(i,α) ∩ Ω\V(i,α)
Ψ De
i
+
0 Ω\V(i,α) ∩ Ω\V(i,α)
Ψ (D¯ uγ ) . (35)
140
J. Lellmann, F. Lenzen, and C. Schn¨ orr
Both of the integrals are zero, since Dei = 0 and (Ω \V(i,α) )0 = (V(i,α) )1 = V(i,α) , √ therefore Ω\V Ψ (D¯ uγ ) ρl 2 Per(V(i,α) ). This implies (i,α)
Eγ
l
Ω\V(i,α)
Ψ (D¯ uγ )
1 l i=1
1
√ ρu 2 Per(V(i,α) )dα.
(36)
0
Also, Per(V(i,α) ) = Per(U(i,α) ) since the perimeter is invariant under Ld -negligible modifications. The assertion then follows using the coarea formula [4, Thm. 3.40]: Eγ
l
V 0 \V 1
√
coarea
=
l
2
Ψ (D¯ uγ )
ρu
1 l i=1
l
TV(ui )
i=1
1
√ ρu 2 Per(U(i,α) )dα
(37)
0
2 ρu l
Ω
Du2
2 ρu l ρl
Ψ (Du). Ω
We now take care of the induction step for the regularizer. Proposition 4. Let Ψ ρu · 2 . Then, for any k 2, (l − 1) F := E Ψ (D¯ u) E Ψ (Du¯). l V k−1 \V k V k−2 \V k−1 k Proof. Define the shifted sequence γ = (γ k )∞ := γ k+1 , and let k=1 by γ Wγ := Vγk−2 \ Vγk−1 = Vγ 2 ∩ . . . ∩ Vγ k−1 \ Vγ 2 ∩ . . . ∩ Vγ k .
(38)
(39)
By Prop. 1 we may assume that, under the expectation, u ¯γ exists and is an element of BV(Ω)l . We denote γ 1 = (i, α), then V k−1 \ V k = V(i,α) ∩ Wγ due to (28), and
l 1 1 F = Eγ Ψ (Du¯((i,α),γ ) ) dα. (40) l i=1 0 V(i,α) ∩Wγ We now use (without proof) the fact that if two functions v, w in BV(Ω) coincide (in L1 , i.e. Ld -a.e.) on a set E with Per(E) < ∞, then the measures Ψ (Dv)(E)1 and Ψ (Dw)(E)1 coincide. In particular, since in the first iteration of the algorithm no points in U(i,α) are assigned a label, u ¯((i,α),γ ) = u ¯γ holds on U(i,α) , and therefore Ld − a.e. on V(i,α) . Therefore we may substitute Du¯((i,α),γ ) by Du¯γ in (40):
l 1 1 F = Eγ 1V(i,α) Ψ (D u ¯γ ) dα. (41) l 0 Wγ i=1
By definition of the measure-theoretic interior, 1V(i,α) is bounded from above by the density function of U(i,α) , ΘU(i,α) (x) := limδ 0 |Bδ (x) ∩ U(i,α) |/|Bδ (x)| [4,
Optimality Bounds for the Image Partitioning Problem
141
Def. 2.55], which exists Hd−1 -a.e. on Ω by [4, Thm. 3.61]. Therefore, denoting by Bδ (·) the mapping x ∈ Ω → Bδ (x), l
1 F l i=1
1
Eγ
|Bδ (·) ∩ U(i,α) | lim δ 0 |Bδ (·)|
Wγ
0
Ψ (D¯ uγ ) dα.
(42)
Rearranging the integrals and the limit, which can be justified by dominated convergence using Ψ ρu · 2 and TV(¯ uγ ) < ∞ almost surely, we get 1 F Eγ lim δ 0 l =
1 Eγ lim δ 0 l
l
Wγ
i=1
Wγ
1 |Bδ (·)|
|Bδ (·) ∩ U(i,α) | dα Ψ (D u ¯γ ) (43) |Bδ (·)| 0 l
1 1{ui (y)α} dydα Ψ (D¯ uγ ). 1
Bδ (·)
0
i=1
We again apply [4, Prop. 1.78] to the two innermost integrals, which leads to 1 F Eγ lim δ 0 l
Wγ
1 |Bδ (·)|
l i=1
Bδ (·)
(1 − ui (y))dy
Ψ (D¯ uγ ) .
Using the fact that u(y) ∈ Δl , this collapses to l−1 l−1 F Eγ Ψ (Du¯γ ) = Eγ Ψ (D¯ uγ ) . l l Wγ V k−2 \V k−1 γ
(44)
(45)
γ
Reversing the index shift and using the fact that u ¯γ = u ¯γ concludes the proof: F
l−1 Eγ l
Vγk−1 \Vγk
Ψ (D u ¯γ ) .
(46)
The following theorem is the main result of this work, and provides an approximate coarea formula as in (10). Theorem 2. Let s : Ω → [0, ∞) s.t. s ∈ L1 (Ω)l , Ψ : Rd×l → R0 positively homogeneous, convex and continuous with ρl z2 Ψ (z) ρu z2 ∀z ∈ Rd×l , and u ∈ C. Then Alg. 1. generates a discrete labeling u ¯ ∈ CE almost surely, and Ef (¯ u) 2
ρu f (u). ρl
(47)
Proof. The first part follows from Thm. 1. Therefore there almost surely exists k := k (γ) 1 s.t. U k = ∅ and u ¯γ = ukγ . The stationarity implies
¯ uγ , s = ukγ , s = lim ukγ , s and Ω = k→∞
∞ k−1 V \Vk k=1
(48)
142
J. Lellmann, F. Lenzen, and C. Schn¨ orr
almost surely (cf. (29)). Thus Eγ f (¯ uγ ) = Eγ = lim
lim
k→∞
k→∞
ukγ , s
Eγ ukγ , s
+ Eγ +
∞
∞
k=1
Eγ
V k−1 \V k
V k−1 \V k
k=1
Ψ (D u ¯γ )
Ψ (D¯ uγ )
(49) (50)
The first term is equal to u, s due to Prop. 1. An induction argument using Prop. 3 and Prop. 4 shows k−1 ∞ l−1 2 ρu ρu Ψ (D¯ uγ ) Ψ (Du) = 2 Ψ (Du) , (51) l l ρl Ω ρl Ω V k−1 \V k
k=1
therefore Eγ f (¯ uγ ) u, s + 2
ρu ρl
Ψ (Du) .
(52)
Ω
Since s 0, ρu ρl and therefore u, s 2(ρu /ρl )u, s, this proves the assertion (47). Swapping the integral and limits in (50) can be justified retrospectively by the dominated convergence theorem, using 0 u, s ∞ and Ψ (Du) ρu TV(u) < ∞. Ω Corollary 1. Under the conditions of Thm. 2, if u∗ minimizes f over C, u∗E minimizes f over CE , and u ¯∗ denotes the output of Alg. 1 applied to u∗ , then Ef (¯ u∗ ) 2
ρu f (u∗E ). ρl
(53)
Proof. This follows immediately from Thm. 2 using f (u∗ ) f (u∗E ), cf. (11). We have demonstrated that the proposed approach allows to recover, from the solution u∗ of the convex relaxed problem (2), an approximate discrete solution u ¯∗ of the nonconvex original problem (1), with an upper bound on the objective. The bound in (53) is of the same order as the known bounds for finitedimensional metric labeling [6] and α-expansion [7], however it extends these results to problems on continuous domains for a broad class of regularizers [5].
5
Experiments
Although the main purpose of Alg. 1 is to provide a basis for deriving the bound in Thm. 2, we will briefly point out some of its empirical characteristics. Expected Number of Iterations. In practice, choosing αk ∈ [0, 1] leads to an unnecessary large number of iterations, as no point is assigned a label in
Optimality Bounds for the Image Partitioning Problem
143
k basic accelerated
3000 2500 2000 1500 1000 500 20
40
60
80
100
120
l
Fig. 1. Left: Label count l vs. mean number of iterations k of the probabilistic rounding algorithm. The improved sampling of αk greatly accelerates the method. Empirically, k ≈ 2l ln(l) for the accelerated method. As a result, runtime is comparable to the deterministic rounding methods. Right: Histogram (probability density scale) of the number of iterations k over 5000 runs for 2 − 12 labels.
iteration k unless αk < ck−1 . The method can be accelerated without affecting ik the derived bounds by choosing αk ∈ [0, ck−1 ] instead, thereby skipping the ik redundant iterations. Fig. 1 shows the mean number of iterations k until e ck < 1, over 5000 runs per label count. From the proof of Thm. 1 it can be seen that this provides a worst-case upper bound for the expected number of iterations until u ¯γ is obtained. For the accelerated method, k is almost perfectly proportional to l ln(l); we conjecture that asymptotically k = 2l ln(l). Optimality. In order to evaluate the tightness of the bound (53) in Thm. 2 in practice, we selected 12 prototypical multiclass labeling problems with 3 − 64 labels each. For each we computed the relaxed solution u∗ and the mean as well as the best objective of the rounded solution u ¯∗ during 10000 iterations of Alg. 1. The employed primal-dual optimization approach provides a lower bound fD (v∗ ) f (u∗ ) via the dual objective fD and a dual feasible point v ∗ . This allows to compute the relative gap ε := (f (¯ u∗ ) − fD (v ∗ ))/fD (v ∗ ), which provides an a posteriori upper bound for the optimality w.r.t. the discrete solution u∗E , f (¯ u∗ ) − f (u∗E ) ε , f (u∗E )
(54)
in contrast to the theoretical, a priori upper bound ε = 2ρu /ρl − 1 derived from Cor. 1. In practice, the a posteriori bound stayed well below the theoretical bound (Table 1), which is consistent with the good practical performance of the α-expansion method that has a similar a priori bound. Relative Performance. We compared the probabilistic approach to two deterministic rounding methods: While the “first-max” method assigns to each
144
J. Lellmann, F. Lenzen, and C. Schn¨ orr
Table 1. Number of pixels N , number of labels l, mean number of iterations k, predicted a priori bound ε = 2ρu /ρl − 1, and mean relative gap (a posteriori bound) ε . The a posteriori bound is well below the bound predicted by Thm. 2. Problems 1 − 10 are color segmentation/inpainting problems with Ψ = · 2 . The depth-from-stereo resp. inpainting problems 11 and 12 use an approximated cut-linear metric as in [5]. problem N l k bound rel. gap
1 76800 3 7.1 1. 0.0014
2 14400 3 6.9 1. 0.0186
3 14400 3 5.0 1. 0.0102
4 129240 4 11.0 1. 0.0106
5 76800 8 27.2 1. 0.0510
rel. gap 0.06
1.4
0.05
1.2
0.04
1.0
6 86400 12 47.5 1. 0.0591
7 86400 12 47.0 1. 0.0722
8 76800 12 43.6 1. 0.2140
9 86400 12 46.5 1. 0.1382
10 76800 12 46.0 1. 0.1172
11 110592 16 70.7 253.275 1.4072
12 21838 64 335.0 375.836 0.2772
rel. gap
firstmax
0.8
modified
0.6
probabilisticbest
0.03 0.02
probabilisticmean
0.4
0.01
0.2
1
2
3
4
5
6
7
8
9
10
11
12
Fig. 2. Relative gap (a posteriori bound) ε of the rounded solution for the test problems using deterministic “first-max” and “modified” rounding [5], and best and mean gap obtained using the proposed probabilistic method. While the energy increase through probabilistic rounding is usually slightly larger than for the deterministic methods, it is well below the a priori bound of ε = 2ρu /ρl − 1 derived in Cor. 1 (Table 1).
point the first label i s.t. ui (x) = maxj uj (x), the “modified” method [5] chooses the unit vector ei that is closest to u(x) with respect to a norm defined by Ψ . Compared to these methods, Alg. 1 usually leads to a slightly larger energy increase (Fig. 2). For problems 11 and 12, where ρu /ρl is large, the solution is clearly inferior to the one obtained using the “modified” rounding. This can be attributed to the fact that the latter takes into account the detailed structure of Ψ , which is neither required nor used in order to obtain the bounds in Thm. 2. However, for problems that are inherently difficult for convex relaxation approaches, we found that the probabilistic approach often generated better solutions. An example is the “inverse triple junction” inpainting problem (second row in Fig. 3), which has at least 3 distinct discrete solutions. A variant of this problem, formulated on graphs, was used as a worst-case example to show the tightness of the LP relaxation bound in [6]. We would like to emphasize that the purpose of these experiments is not to demonstrate a practical superiority of the proposed method compared to other techniques, but rather to provide an illustration on what bounds can be expected in practice compared to the a priori bounds in Thm. 2.
Optimality Bounds for the Image Partitioning Problem
145
Fig. 3. Top to bottom: Problems 2,3,5,8,11 of the test set. Left to right: Input, relaxed solution, discrete solutions obtained by deterministic “first-max” and “modified” rounding [5], result of the probabilistic rounding. In specially crafted situations, the probabilistic method may perform slightly worse (first row) or better (second row). On real-world data, results are very similar (rows 3–5). In contrast to the deterministic approaches, the proposed method provides true a priori optimality bounds.
6
Conclusion
We presented a probabilistic rounding method for recovering approximate solutions of multiclass labeling or image partitioning problems from solutions of convex relaxations. To our knowledge, this is the first fully convex approach that is both formulated in the spatially continuous setting and provides an a priori bound on the optimality of the generated discrete solution. We showed that the approach can also be interpreted as an approximate variant of the coarea formula. Numerical experiments confirm the theoretical bounds. Future work may include extending the results to non-homogeneous regularizers, and improving the tightness of the bound. Also, the connection to recent convex relaxation techniques [11, 12] for solving nonconvex variational problems should be further explored.
146
J. Lellmann, F. Lenzen, and C. Schn¨ orr
References 1. Zach, C., Gallup, D., Frahm, J.M., Niethammer, M.: Fast global labeling for realtime stereo using multiple plane sweeps. Vis. Mod. Vis. (2008) 2. Lellmann, J., Kappes, J., Yuan, J., Becker, F., Schn¨ orr, C.: Convex multi-class image labeling by simplex-constrained total variation. In: Tai, X.-C., Mørken, K., Lysaker, M., Lie, K.-A. (eds.) SSVM 2009. LNCS, vol. 5567, pp. 150–162. Springer, Heidelberg (2009) 3. Pock, T., Chambolle, A., Cremers, D., Bischof, H.: A convex relaxation approach for computing minimal partitions. Comp. Vis. Patt. Recogn. (2009) 4. Ambrosio, L., Fusco, N., Pallara, D.: Functions of Bounded Variation and Free Discontinuity Problems. Clarendon Press, Oxford (2000) 5. Lellmann, J., Becker, F., Schn¨ orr, C.: Convex optimization for multi-class image labeling with a novel family of total variation based regularizers. In: Int. Conf. Comp. Vis. (2009) 6. Kleinberg, J.M., Tardos, E.: Approximation algorithms for classification problems with pairwise relationships: Metric labeling and Markov random fields. Found. Comp. Sci., 14–23 (1999) 7. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. Patt. Anal. Mach. Intell. 23, 1222–1239 (2001) 8. Olsson, C., Byr¨ od, M., Overgaard, N.C., Kahl, F.: Extending continuous cuts: Anisotropic metrics and expansion moves. In: Int. Conf. Comp. Vis. (2009) 9. Bertsimas, D., Weismantel, R.: Optimization over Integers. Dynamic Ideas (2005) 10. Chan, T.F., Esedo¯ glu, S., Nikolova, M.: Algorithms for finding global minimizers of image segmentation and denoising models. J. Appl. Math. 66, 1632–1648 (2006) 11. Alberti, G., Bouchitt´e, G., Dal Maso, G.: The calibration method for the MumfordShah functional and free-discontinuity problems. Calc. Var. Part. Diff. Eq. 16, 299–333 (2003) 12. Pock, T., Cremers, D., Bischof, H., Chambolle, A.: Global solutions of variational models with convex regularization. J. Imaging Sci. 3, 1122–1145 (2010)
Interactive Segmentation with Super-Labels Andrew Delong , Lena Gorelick, , Frank R. Schmidt, Olga Veksler, and Yuri Boykov University of Western Ontario, Canada
[email protected] Abstract. In interactive segmentation, the most common way to model object appearance is by GMM or histogram, while MRFs are used to encourage spatial coherence among the object labels. This makes the strong assumption that pixels within each object are i.i.d. when in fact most objects have multiple distinct appearances and exhibit strong spatial correlation among their pixels. At the very least, this calls for an MRF-based appearance model within each object itself and yet, to the best of our knowledge, such a “two-level MRF” has never been proposed. We propose a novel segmentation energy that can model complex appearance. We represent the appearance of each object by a set of distinct spatially coherent models. This results in a two-level MRF with “superlabels” at the top level that are partitioned into “sub-labels” at the bottom. We introduce the hierarchical Potts (hPotts) prior to govern spatial coherence within each level. Finally, we introduce a novel algorithm with EM-style alternation of proposal, α-expansion and re-estimation steps. Our experiments demonstrate the conceptual and qualitative improvement that a two-level MRF can provide. We show applications in binary segmentation, multi-class segmentation, and interactive co-segmentation. Finally, our energy and algorithm have interesting interpretations in terms of semi-supervised learning.
Boykov-Jolly
user scribbles
label 1
Ours
appearance models
super-label 1
sub-labeling
appearance models
Fig. 1. Given user scribbles, typical MRF segmentation (Boykov-Jolly) uses a GMM to model the appearance of each object label. This makes the strong assumption that pixels inside each object are i.i.d. In contrast, we define a two-level MRF to encourage inter-object coherence among super-labels and intra-object coherence among sub-labels.
Authors contributed equally. Corresponding Author.
Y. Boykov et al. (Eds.): EMMCVPR 2011, LNCS 6819, pp. 147–162, 2011. c Springer-Verlag Berlin Heidelberg 2011
148
1
A. Delong et al.
Introduction
The vast majority of segmentation methods model object appearance by GMM or histogram and rely on some form of spatial regularization of the object labels. This includes interactive [1–3], unsupervised [4–9], binary [1–3, 8, 9] and multiclass [4–7, 10] techniques. The interactive methods make the strong assumption that all pixels within an entire object are i.i.d. when in fact many objects are composed of multiple regions with distinct appearances. Unsupervised methods try to break the image into small regions that actually are i.i.d., but these formulations do not involve any high-level segmentation of objects. We propose a novel energy that unifies these two approaches by incorporating unsupervised learning into interactive segmentation. We show that this more descriptive object model leads to better high-level segmentations. In our formulation, each object (super-label) is automatically decomposed into spatially coherent regions where each region is described by a distinct appearance model (sub-label). This results in a two-level MRF with super-labels at the top level that are partitioned into sub-labels at the bottom. Figure 1 illustrates the main idea. We introduce the hierarchical Potts (hPotts) prior to govern spatial coherence at both levels of our MRF. The hierarchical Potts prior regularizes boundaries between objects (super-label transitions) differently from boundaries within each object (sub-label transitions). The unsupervised aspect of our MRF allows appearance models of arbitrary complexity and would severely over-fit the image data if left unregularized. We address this by incorporating global sparsity prior into our MRF via the energetic concept of “label costs” [7]. Since our framework is based on multi-label MRFs, a natural choice of optimization machinery is α-expansion [11, 7]. Furthermore, the number, class, and parameters of each object’s appearance models are not known a priori — in order to use powerful combinatorial techniques we must propose a finite set of possibilities for α-expansion to select from. We therefore resort to an iterative graph-cut process that involves random sampling to propose new models, α-expansion to update the segmentation, and re-estimation to improve the current appearance models. Figure 2 illustrates our algorithm.
user scribbles
models from 1 propose current super-labeling
2-level MRF 2 solve via α-expansion E=503005
converged to final sub-labeling E=452288
3 re-estimate all sub-models Fig. 2. We iteratively propose new models by randomly sampling pixels from superlabels, optimize the resulting two-level MRF, and re-estimate model parameters
Interactive Segmentation with Super-Labels
149
The remainder of the paper is structured as follows. Section 2 discusses other methods for modeling complex appearance, MDL-based segmentation, and related iterative graph-cut algorithms. Section 3 describes our energy-based formulation and algorithm in formal detail. Section 4 shows applications in interactive binary/multi-class segmentation and interactive co-segmentation; furthermore it describes how our framework easily allows appearance models to come from a mixture of classes (GMM, plane, etc.). Section 5 draws an interesting parallel between our formulation and multi-class semi-supervised learning in general.
2
Related Work
Complex Appearance Models. The DDMCMC method [6] was the first to emphasize the importance of representing object appearance with complex models (e.g. splines and texture based models in addition to GMMs) in the context of unsupervised segmentation. However, being unsupervised, DDMCMC does not delineate objects but rather provides low-level segments along with their appearance models. Ours is the first multi-label graph-cut based framework that can learn a mixture of such models for segmentation. There is an interactive method [10] that decomposes objects into spatially coherent sub-regions with distinct appearance models. However, the number of sub-regions, their geometric interactions, and their corresponding appearance models must be carefully designed for each object of interest. In contrast, we automatically learn the number of sub-regions and their model parameters. MDL-Based Segmentation. A number of works have shown that minimum description length (MDL) is a useful regularizer for unsupervised segmentation, e.g. [5–7]. Our work stands out here in two main respects: our formulation is designed for semi-supervised settings and explicitly weighs the benefit of each appearance model against the ‘cost’ of its inherent complexity (e.g. number of parameters). To the best of our knowledge, only the unsupervised DDMCMC [6] method allows arbitrary complexity while explicitly penalizing it in a meaningful way. However, they use a completely different optimization framework and, being unsupervised, they do not delineate object boundaries. Iterative Graph-Cuts. Several energy-based methods have employed EM-style alternation between a graph-cut/α-expansion phase and a model re-estimation phase, e.g. [12, 2, 13, 14, 7]. Like our work, Grab-Cut [2] is about interactive segmentation, though their focus is binary segmentation with a bounding-box interaction rather than scribbles. The bounding box is intuitive and effective for many kinds of objects but often requires subsequent scribble-based interaction for more precise control. Throughout this paper, we compare our method to an iterative multi-label variant of Boykov-Jolly [1] that we call iBJ. Given user scribbles, this baseline method maintains one GMM per object label and iterates between α-expansion and re-estimating each model.
150
A. Delong et al.
On an algorithmic level, the approach most closely related to ours is the unsupervised method [14, 7] because it also involves random sampling, α-expansion, and label costs. Our framework is designed to learn complex appearance models from partially-labeled data and differs from [14, 7] in the following respects: (1) we make use of hard constraints and the current super-labeling to guide random sampling, (2) our hierarchical Potts potentials regularize sub- and superlabels differently, and (3), again, our label costs penalize models based on their individual complexity rather than using uniform label costs.
3
Modeling Complex Appearance via Super-Labels
We begin by describing a novel multi-label energy that corresponds to our twolevel MRF. Unlike typical MRF-based segmentation methods, our actual set of discrete labels (appearance models) is not precisely known beforehand and we need to estimate both the number of unique models and their parameters. Section 3.1 explains this energy formulation in detail, and Section 3.2 describes our iterative algorithm for minimizing this energy. 3.1
Problem Formulation
Let S denote the set of super-labels (scribble colors) available to the user and let P denote the indexes of pixels in the input image I. By “scribbling” on the image, the user interactively defines a partial labeling g : P → S ∪ {none} that assigns to each pixel p a super-label index gp ∈ S or leaves p unlabeled (gp = none). Our objective in terms of optimization is to find the following: 1. an unknown set of L of distinct appearance models (sub-labels) generated from the image, along with model parameters θ for each ∈ L 2. a complete sub-labeling f : P → L that assigns one model to each pixel, and 3. a map π : L → S where π() = i associates sub-label with super-label i, i.e. the sub-labels are grouped into disjoint subsets, one for each super-label; any π defines a parent-child relation in what we call a two-level MRF. Our output is therefore a tuple (L, θ, π, f ) with set of sub-labels L, model parameters θ = {θ }, super-label association π, and complete pixel labeling f . The final segmentation presented to the user is simply (π ◦ f ) : P → S which assigns a scribble color (super-label index) to each pixel in P. In a good segmentation we expect the tuple (L, θ, π, f ) to satisfy the following three properties. First, the super-labeling π ◦ f must respect the constraints imposed by user scribbles, i.e. if pixel p was scribbled then we require π(fp ) = gp . Second, the labeling f should exhibit spatial coherence both among sub-labels and between super-labels. Finally, the set of sub-labels L should contain as many appearance models as is justified by the image data, but no more.
Interactive Segmentation with Super-Labels
151
We propose an energy for our two-level MRFs1 that satisfies these three criteria and can be expressed in the following form2: E(L, θ, π, f ) = Dp (fp ) + wpq V (fp , fq ) + h δ (f ) (1) p∈P
pq∈N
∈L
The unary terms D of our energy express negative log-likelihoods of appearance models and enforce the hard constraints imposed by the user. A pixel p that has been scribbled (gp ∈ S) is only allowed to be assigned a sub-label such that π() = gp . Un-scribbled pixels are permitted to take any sub-label. − ln Pr(Ip |θ ) if gp = none ∨ gp = π() Dp () = (2) ∞ otherwise The pairwise terms V are defined with respect to the current super-label map π as follows: ⎧ ⎨ 0 if = c1 if = and π() = π( ) V (, ) = (3) ⎩ c2 if π() = π( ) We call (3) a two-level Potts potential because it governs coherence on two levels: c1 encourages sub-labels within each super-label to be spatially coherent, and c2 encourages smoothness among super-labels. This potential is a special case of our more general class of hierarchical Potts potentials introduced in Appendix 6, but two-level Potts is sufficient for our interactive segmentation applications. For image segmentation we assume c1 ≤ c2 , though in general any V with c1 ≤ 2c2 is still a metric [11] and can be optimized by α-expansion. Appendix 6 gives general conditions for hPotts to be metric. It is commonly known that smoothness costs directly affect the expected length of the boundary and should be scaled proportionally to the size of the image. Intuitively c2 should be larger as it operates on the entire image as opposed to smaller regions that correspond to objects. The weight wpq ≥ 0 of each pairwise term in (1) is computed from local image gradients in the standard way (e.g. [1, 2]). Finally, we incorporate a model-dependent “label costs” [7] to regularize the number of unique models in L and their individual complexity. A label cost h is a global potential that penalizes the use of in labeling f through indicator function δ (f ) = 1 ⇔ ∃fp = . There are many possible ways to define the weight h of a label cost, such as Akaike information criterion (AIC) [16] or Bayesian information critierion (BIC) [17]. We use a heuristic described in Section 4.2. 3.2
Our SUPERLABELSEG Algorithm
We propose a novel segmentation algorithm based on the iterative Pearl framework [7]. Each iteration of Pearl has three main steps: propose candidate 1 2
In practice we use non-uniform wpq and so, strictly speaking, (1) is a conditional random field (CRF) [15] rather than an MRF. The dependence of D on (π, θ) and of V on π is omitted in (1) for clarity.
152
A. Delong et al.
models by random sampling, segment via α-expansion with label costs, and re-estimate the model parameters for the current segmentation. Our algorithm differs from [7] as follows: (1) we make use of hard constraints g and the current super-labeling π ◦ f to guide random sampling, (2) our two-level Potts potentials regularize sub- and super-labels differently, and (3) our label costs penalize models based on their individual complexity rather than uniform penalty. The proposal step repeatedly generates a new candidate model with parameters θ fitted to a random subsample of pixels. Each model is proposed in the context of a particular super-label i ∈ S, and so the random sample is selected from the set of pixels Pi = { p | π(fp ) = i } currently associated with i. Each candidate is then added to the current label set L with super-label assignment set to π() = i. A heuristic is used to determine a sufficient number of proposals to cover the set of pixels Pi at each iteration. Once we have candidate sub-labels for every object a na¨ıve approach would be to directly optimize our two-level MRF. However, being random, not all of an object’s proposals are equally good for representing its appearance. For example, a proposal from a small sample of pixels is likely to over-fit or mix statistics (Figure 3, proposal 2). Such models are not characteristic of the object’s overall appearance but are problematic because they may incidentally match some portion of another object and lead to an erroneous super-label segmentation. Before allowing sub-labels to compete over the entire image, we should do our best to ensure that all appearance models within each object are relevant and accurate. Given the complete set of proposals, we first re-learn the appearance of each object i ∈ S. This is achieved by restricting our energy to pixels that are currently labeled with π(fp ) = i and optimizing via α-expansion with label costs [7]; this ensures that each object is represented by an accurate set of appearance models. Once each object’s appearance has been re-learned, we allow the objects to simultaneously compete for all image pixels while continuing to re-estimate their parameters. Segmentation is performed on a two-level MRF defined by the current (L, θ, π). Again, we use α-expansion with label costs to select a good subset of appearance models and to partition the image. The pseudo-code below describes our SuperLabelSeg algorithm.
GMM for 1
GMM for 2
GMM for 3
3 2 gray
1
white
gray
white
gray
white
not an accurate appearance model for red scribble!
Fig. 3. The object marked with red has two spatially-coherent appearances: pure gray, and pure white. We can generate proposals for the red object from random patches 1–3. However, if we allow proposal 2 to remain associated with the red object, it may incorrectly claim pixels from the blue object which actually does look like proposal 2.
Interactive Segmentation with Super-Labels
153
SuperLabelSeg(g) where g : P → S ∪ {none} is a partial labeling 1 2 3 4 5 6
L = {} // empty label set with global f, π, θ undefined Propose(g) // initialize L, π, θ from user scribbles repeat Segment(P, L) // segment entire image using all available labels L Propose(π ◦f ) // update L, π, θ from current super-labeling until converged
Propose(z) where z : P → S ∪ {none} 1 2 3 4 5 6 7 8 9
for each i ∈ S Pi = { p | zp = i } // set of pixels currently labeled with super-label i repeat sufficiently generate model with parameters θ fitted to random sample from Pi π() = i L = L ∪ {} end Li = { | π() = i } Segment(Pi, Li ) // optimize models and segmentation within super-label i
ˆ L) ˆ where Pˆ ⊆ P and Lˆ ⊆ L Segment(P, 1 2 3 4 5 6 7
4
let f |Pˆ denote current global labeling f restricted to Pˆ repeat ˆ θ, π, fˆ) // segment by α-expansion with label costs [7] f |Pˆ = argminfˆE(L, // where we optimize only on fˆ : Pˆ → Lˆ L= L \ { ∈ Lˆ | δ (fˆ) = 0} // discard unused models ˆ θ, π, fˆ) θ = argminθ E(L, // re-estimate each sub-model params until converged
Applications and Experiments
Our experiments demonstrate the conceptual and qualitative improvement that a two-level MRF can provide. We are only concerned with scribble-based MRF segmentation, an important class of interactive methods. We use an iterative variant of Boykov-Jolly [1] (iBJ) as a representative baseline because it is simple, popular, and exhibits a problem characteristic to a wide class of standard methods. By using one appearance model per object, such methods implicitly assume that pixels within each object are i.i.d. with respect to its model. However, this is rarely the case, as objects often have multiple distinct appearances and exhibit strong spatial correlation among their pixels. The main message of all the experiments is to show that by using multiple distinctive appearance models per object we are able to reduce uncertainty near the boundaries of objects and thereby improve segmentation in difficult cases. We show applications in interactive binary/multi-class segmentation and interactive co-segmentation.
154
A. Delong et al. user scribbles
sub-labeling
our appearance models
our segmentation
iBJ segmentation
Fig. 4. Binary segmentation examples. The second column shows our final sub-label segmentation f where blues indicate foreground sub-labels and reds indicate background sub-labels. The third column is generated by sampling each Ip from model θfp . The last two columns compare our super-label segmentation π ◦ f and iBJ.
Implementation Details. In all our experiments we used publicly available α-expansion code [11, 7, 4, 18]. Our non-optimized matlab implementation takes on the order of one to three minutes depending on the size of the image, with the majority of time spent on re-estimating the sub-model parameters. We used the same within- and between- smoothness costs (c1 = 5, c2 = 10) in all binary, multi-class and co-segmentation experiments. Our proposal step uses distancebased sampling within each super-label whereby patches of diameter 3 to 5 are randomly selected. For re-estimating model parameters we use the Matlab implementation of EM algorithm for GMMs and we use PCA for planes. We regularize GMM covariance matrices to avoid overfitting in (L, a, b) color space by adding constant value of 2.0 to the diagonal. 4.1
Binary Segmentation
For binary segmentation we assume that the user wishes to segment an object from the background where the set of super-labels (scribble indexes) is defined by S = {F, B}. In this specific case we found that most of the user interaction is spent on removing disconnected false-positive object regions by scribbling over them with background super-label. We therefore employ a simple heuristic: after convergence we find foreground connected components that are not supported by a scribble and modify their data-terms to prohibit those pixels from taking the super-label F . We then perform one extra segmentation step to account for the new constraints. We apply this heuristic in all our binary segmentation results for both SuperLabelSeg and iBJ (Figures 4, 5, 6). Other heuristics could be easily incorporated in our energy to encourage connectivity, e.g. star-convexity [19, 20]. In Figure 4, top-right, notice that iBJ does not incorporate the floor as part of the background. This is because there is only a small proportion of floor pixels in the red scribbles, but a large proportion of a similar color (roof) in the blue scribbles. By relying directly on the color proportions in the scribbles, the learned GMMs do not represent the actual appearance of the objects in the full image. Therefore the ground is considered a priori more likely to be explained by the
Interactive Segmentation with Super-Labels ours
user scribbles
155
iBJ
Fig. 5. More binary segmentation results showing scribbles, sub-labelings, synthesized images, and final cut-outs
iBJ
ours
GMMs
ours
plane
Fig. 6. An image exhibiting gradual changes in color. Columns 2–4 show colors sampled from the learned appearance models for iBJ, our two-level MRF restricted to GMMs only, and ours with both GMMs and planes. Our framework can detect a mix of GMMs (grass,clouds) and planes (sky) for the background super-label (top-right).
(wrong) roof color than the precise floor color, giving an erroneous segmentation despite the hard constraints. Our methods relies on spatial coherence of the distinct appearances within each object and therefore has a sub-label that fits the floor color tightly. This same phenomenon is even more evident in the bottom row of Figure 5. In the iBJ case, the appearance model for the foreground mixes the statistics from all scribbled pixels and is biased towards the most dominant color. Our decomposition allows each appearance with spatial support (textured fabric, face, hair) to have good representation in the composite foreground model. 4.2
Complex Appearance Models
In natural images objects often exhibit gradual change in hue, tone or shades. Modeling an object with a single GMM in color space [1, 2] makes the implicit assumption that appearance is piece-wise constant. In contrast, our framework allows us to decompose an object into regions with distinct appearance models,
A. Delong et al.
GMM
GMM
ours, GMM + planes
GMM
GMM
2 planes detected
plane
white
ours, GMM only
iBJ
plane GMM
GMM detected
GMM
Fig. 7. Our algorithm detects complex mixtures of models, for example GMMs and planes. The appearance of the above object cannot be captured by GMMs alone.
black
156
y x
. .
each from an arbitrary class (e.g. GMM, plane, quadratic, spline). Our algorithm will choose automatically the most suitable class for each sub-label within an object. Figure 6 (right) shows such an example where the background is decomposed into several grass regions, each modeled by a GMM in (L, a, b) color space, and a sky region that is modeled by a plane3 in a 5-dimensional (x, y, L, a, b) space. Note the gradual change in the sky color and how the clouds are segmented as a separate ‘white’ GMM. Figure 7 show a synthetic image, in which the foreground object breaks into two sub-regions, each exhibiting a different type of gradual change in color. This kind of object appearance cannot be captured by a mixture of GMM models. In general our framework can incorporate a wide range of appearance models as long as there exists a black-box algorithm for estimating the parameters θl , which can be used at the line 6 of Segment. The importance of more complex appearance models was proposed by DDMCMC [6] for unsupervised segmentation in a completely different algorithmic framework. Ours is the first multi-label graph-cut based framework that can incorporate such models. Because our appearance models may be arbitrarily complex, we must incorporate individual model complexity in our energy. Each label cost h is computed based on the number of parameters θ and the number ν of pixels that are to √ be labeled, namely h = 12 ν |θ |. We set ν = #{ p | gp = none } for line 2 of the SuperLabelSeg algorithm and ν = |P| for lines 4,5. Penalizing complexity is a crucial part of our framework because it helps our MRF-based models to avoid over-fitting. Label costs balance the number of parameters required to describe the models against the number of data points to which the models are fit. When re-estimating the parameters of a GMM we allow the number of components to increase or decrease if favored by the overall energy. 4.3
Multi-Class Segmentation
Interactive multi-class segmentation is a straight-forward application of our energy (1) where the set of super-labels S contains an index for each scribble 3
In color images a ‘plane’ is a 2-D linear subspace (a 2-flat) of a 5-D image space.
Interactive Segmentation with Super-Labels user scribbles
our sub-labeling
our models
iBJ labeling
157
iBJ models
Fig. 8. Multi-class segmentation examples. Again, we show the color-coded sublabelings and the learned appearance models. Our super-labels are decomposed into spatially coherent regions with distinct appearance.
color. Figure 8 shows examples of images with multiple scribbles corresponding to multiple objects. The resulting sub-labelings show how objects are decomposed into regions with distinct appearances. For example, in the top row, the basket is decomposed into a highly-textured colorful region (4-component GMM) and a more homogeneous region adjacent to it (2-component GMM). In the bottom row, notice that the hair of children marked with blue was so weak in the iBJ appearance model that it was absorbed into the background. The synthesized images suggest the quality of the learned appearance models. Unlike the binary case, here we do not apply the post-processing step enforcing connectivity. 4.4
Interactive Co-segmentation
Our two-level MRFs can be directly used for interactive co-segmentation [3, 21]. Specifically, we apply our method to co-segmentation of a collection of similar images as in [3] because it is a natural scenario for many users. This differs from ‘unsupervised’ binary co-segmentation [8, 9] that assumes dissimilar backgrounds and similar-sized foreground objects. Figure 9 shows a collection of four images with similar content. Just by scribbling on one of the images our method is able to correctly segment the objects. Note that the unmarked images contain background colors not present in the scribbled image, yet our method was able to detect these novel appearances and correctly segment the background into sub-labels.
5
Discussion: Super-Labels as Semi-supervised Learning
There are evident parallels between interactive segmentation and semi-supervised learning, particularly among graph cut methods ([1] versus [22]) and random walk methods ([23] versus [24]). An insightful paper by Duchenne et al. [25]
158
A. Delong et al.
image collection
our co-segmentation
iBJ co-segmentation
Fig. 9. Interactive co-segmentation examples. Note that our method detected submodels for grass, water, and sand in the 1st and 3rd bear images; these appearances were not present in the scribbled image.
explicitly discusses this observation. Looking back at our energy and algorithm from this perspective, it is clear that we actually do semi-supervised learning applied to image segmentation. For example, the grayscale image in Figure 7 can be visualized as points in a 3D feature space where small subsets of points have been labeled either blue or red. In addition to making ‘transductive’ inferences, our algorithm automatically learned that the blue label is best decomposed into two linear subspaces (green & purple planes in Figure 7, right) whereas the red label is best described by a single bi-modal GMM. The number, class, and parameters of these models was not known a priori but was discovered by SuperLabelSeg. Our two-level framework allows each object to be modeled with arbitrary complexity but, crucially, we use spatial coherence (smooth costs) and label costs to regularize the energy and thereby avoid over-fitting. Setting c1 < c2 in our smooth costs V corresponds to a “two-level clustering assumption,” i.e. that class clusters are better separated than the sub-clusters within each class. To the best of our knowledge, we are first to suggest iterated random sampling and α-expansion with label costs (SuperLabelSeg) as an algorithm for multiclass semi-supervised learning. These observations are interesting and potentially useful in the context of more general semi-supervised learning.
Interactive Segmentation with Super-Labels
6
159
Conclusion
In this paper we raised the question of whether GMM/histograms are an appropriate choice for modeling object appearance. If GMMs and histograms are not satisfying generative models for a natural image, they are equally unsatisfying for modeling appearance of complex objects within the image. To address this question we introduced a novel energy that models complex appearance as a two-level MRF. Our energy incorporates both elements of interactive segmentation and unsupervised learning. Interactions are used to provide high-level knowledge about objects in the image, whereas the unsupervised component tries to learn the number, class and parameters of appearance models within each object. We introduced the hierarchical Potts prior to regularize smoothness within and between the objects in our two-level MRF, and we use label costs to account for the individual complexity of appearance models. Our experiments demonstrate the conceptual and qualitative improvement that a two-level MRF can provide. Finally, our energy and algorithm have interesting interpretations in terms of semi-supervised learning. In particular, our energy-based framework can be extended in a straight-forward manner to handle general semi-supervised learning with ambiguously-labeled data [26]. We leave this as future work.
References 1. Boykov, Y., Jolly, M.P.: Interactive Graph Cuts for Optimal Boundary and Region Segmentation of Objects in N-D Images. In: Int’l Conf. on Computer Vision (ICCV), vol. 1, pp. 105–112 (2001) 2. Rother, C., Kolmogorov, V., Blake, A.: GrabCut: Interactive Foreground Extraction using Iterated Graph Cuts. In: ACM SIGGRAPH (2004) 3. Batra, D., Kowdle, A., Parikh, D., Luo, J., Chen, T.: iCoseg: Interactive Cosegmentation with Intelligent Scribble Guidance. In: IEEE Conf. on Computer Vision and Pattern Recognition, CVPR (2010) 4. Kolmogorov, V., Zabih, R.: What Energy Functions Can Be Optimized via Graph Cuts. IEEE Trans. on Patt. Analysis and Machine Intelligence 26, 147–159 (2004) 5. Zhu, S.C., Yuille, A.L.: Region competition: unifying snakes, region growing, and Bayes/MDL for multiband image segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence 18, 884–900 (1996) 6. Tu, Z., Zhu, S.C.: Image Segmentation by Data-Driven Markov Chain Monte Carlo. IEEE Trans. on Pattern Analysis and Machine Intelligence 24, 657–673 (2002) 7. Delong, A., Osokin, A., Isack, H., Boykov, Y.: Fast Approximate Energy Minimization with Label Costs. Int’l Journal of Computer Vision ( in press, 2011) 8. Rother, C., Minka, T., Blake, A., Kolmogorov, V.: Cosegmentation of Image Pairs by Histogram Matching - Incorporating a Global Constraint into MRFs. In: IEEE Conf. on Computer Vision and Pattern Recognition, CVPR (2006) 9. Vicente, S., Kolmogorov, V., Rother, C.: Cosegmentation revisited: Models and optimization. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6312, pp. 465–479. Springer, Heidelberg (2010) 10. Delong, A., Boykov, Y.: Globally Optimal Segmentation of Multi-Region Objects. In: Int’l Conf. on Computer Vision, ICCV (2009)
160
A. Delong et al.
11. Boykov, Y., Veksler, O., Zabih, R.: Fast Approximate Energy Minimization via Graph Cuts. IEEE Trans. on Pattern Analysis and Machine Intelligence (2001) 12. Birchfield, S., Tomasi, C.: Multiway cut for stereo and motion with slanted surfaces. In: Int’l Conf. on Computer Vision (1999) 13. Zabih, R., Kolmogorov, V.: Spatially Coherent Clustering with Graph Cuts. In: IEEE Conf. on Computer Vision and Pattern Recognition, CVPR (2004) 14. Isack, H.N., Boykov, Y.: Energy-based Geometric Multi-Model Fitting. Int’l Journal of Computer Vision (accepted 2011) 15. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Int’l Conf. on Machine Learning, ICML (2001) 16. Akaike, H.: A new look at statistical model identification. IEEE Trans. on Automatic Control 19, 716–723 (1974) 17. Schwarz, G.: Estimating the Dimension of a Model. Annals of Statistics 6, 461–646 (1978) 18. Boykov, Y., Kolmogorov, V.: An Experimental Comparison of Min-Cut/Max-Flow Algorithms for Energy Minimization in Vision. IEEE Trans. on Pattern Analysis and Machine Intelligence 29, 1124–1137 (2004) 19. Veksler, O.: Star shape prior for graph-cut image segmentation. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part III. LNCS, vol. 5304, pp. 454–467. Springer, Heidelberg (2008) 20. Gulshan, V., Rother, C., Criminisi, A., Blake, A., Zisserman, A.: Geodesic Star Convexity for Interactive Image Segmentation. In: IEEE Conf. on Computer Vision and Pattern Recognition, CVPR (2010) 21. Schnitman, Y., Caspi, Y., Cohen-Or, D., Lischinski, D.: Inducing semantic segmentation from an example. In: Narayanan, P.J., Nayar, S.K., Shum, H.-Y. (eds.) ACCV 2006. LNCS, vol. 3852, pp. 373–384. Springer, Heidelberg (2006) 22. Blum, A., Chawla, S.: Learning from labeled and unlabeled data using graph mincuts. In: Int’l Conf. on Machine Learning (2001) 23. Grady, L.: Random Walks for Image Segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence 28, 1768–1783 (2006) 24. Szummer, M., Jaakkola, T.: Partially labeled classification with markov random walks. In: Advances in Neural Information Processing Systems, NIPS (2001) 25. Duchenne, O., Audibert, J.Y., Keriven, R., Ponce, J., Segonne, F.: Segmentation by transduction. In: IEEE Conf. on Computer Vision and Pattern Recognition, CVPR (2008) 26. Cour, T., Sapp, B., Jordan, C., Taskar, B.: Learning from Ambiguously Labeled Images. In: IEEE Conf. on Computer Vision and Pattern Recognition, CVPR (2009) 27. Choi, M.J., Lim, J., Torralba, A., Willsky, A.S.: Exploiting Hierarchical Context on a Large Database of Object Categories. In: IEEE Conf. on Computer Vision and Pattern Recognition, CVPR (2010)
Appendix — Hierarchical Potts In this paper we use two-level Potts potentials where the smoothness is governed by two coefficients, c1 and c2 . This concept can be generalized to a hierarchical Potts (hPotts) potential that is useful whenever there is a natural hierarchical grouping of labels. For example, the recent work on hierarchical context [27]
Interactive Segmentation with Super-Labels
161
learns a tree-structured grouping of the class labels for object detection; with hPotts potentials it is also possible to learn pairwise interactions for segmentation with hierarchical context. We leave this as future work. We now characterize our class of hPotts potentials and prove necessary and sufficient conditions for them to be optimized by the α-expansion algorithm [11]. Let N = L∪S denote combined set of sub-labels and super-labels. A hierarchical Potts prior V is defined with respect to an irreducible4 tree over node set N . The parent-child relationship in the tree is determined by π : N → S where π() gives the parent5 of . The leaves of the tree are the sub-labels L and the interior nodes are the super-labels S. Each node i ∈ S has an associated Potts coefficient ci for penalizing sub-label transitions that cross from one sub-tree of i to another. An hPotts potential is a special case of general pairwise potentials over L and can be written as an |L|×|L| “smooth cost matrix” with entries V (, ). The coefficients of this matrix are block-structured in a way that corresponds to some irreducible tree. The example below shows an hPotts potential V and its corresponding tree. V( , )
0 c1
0
c5
1
0 c5
super-labels S
5
c1
0 c2 c2 0
c4
c4
0 c3 c3 0
⇔
4 2
sub-labels L
3
α
β
γ
Let π n () denote n applications of the parent function as in π(· · · π()). Let lca(, ) denote the lowest common ancestor of and , i.e. lca(, ) = i where i = π n () = π m ( ) for minimal n, m. We can now define an hPotts potential as V (, ) = c lca(, )
(4)
where we assume V (, ) = c = 0 for each leaf ∈ L. For example, in the tree illustrated above lca(α, β) is super-label 4 and so the smooth cost V (α, β) = c4 . Theorem 1. Let V be an hPotts potential with corresponding irreducible tree π. V is metric on L
⇐⇒
ci ≤ 2cj for all j = π n (i).
(5)
Proof. The metric constraint V (β, γ) ≤ V (α, γ) + V (β, α) is equivalent to c lca(β,γ) ≤ c lca(α,γ) + c lca(β,α)
(6)
for all α, β, γ ∈ L. Because π defines a tree structure, for every α, β, γ there exists i, j ∈ S such that, without loss of generality, j = lca(α, γ) = lca(β, α), and i = lca(β, γ) such that j = π k (i) for some k ≥ 0. 4 5
A tree is irreducible if all its internal nodes have at least two children. The root of the tree r ∈ S is assigned π(r) = r.
(7)
162
A. Delong et al.
In other words there can be up to two unique lowest common ancestors among (α, β, γ) and we assume ancestor i is in the sub-tree rooted at ancestor j, possibly equal to j. For any particular (α, β, γ) and corresponding (i, j) inequality (6) is equivalent to ci ≤ 2cj . Since π defines an irreducible tree, for each (i, j) there must exist corresponding sub-labels (α, β, γ) for which (6) holds. It follows that ci ≤ 2cj holds for all pairs j = π k (i) and completes the proof of (5).
Curvature Regularity for Multi-label Problems Standard and Customized Linear Programming Thomas Schoenemann, Yubin Kuang, and Fredrik Kahl Centre for Mathematical Sciences Lund University, Sweden
Abstract. We follow recent work by Schoenemann et al. [25] for expressing curvature regularity as a linear program. While the original formulation focused on binary segmentation, we address several multi-label problems, including segmentation, denoising and inpainting, all cast as a single linear program. Our multi-label segmentation introduces a “curvature Potts model” and combines a well-known Potts model relaxation [14] with the above work. For inpainting, we improve on [25] by grouping intensities into bins. Finally, we address the problem of denoising with absolute differences in the data term. Furthermore, we explore alternative solving strategies, including higher order Markov Random Fields, min-sum diffusion and a combination of augmented Lagrangians and an accelerated first order scheme to solve the linear programs.
1
Introduction
Labeling problems1 with length-based regularity terms for computer vision have received an enormous amount of attention over the past two decades. Many of the associated optimization problems are solved globally today, e.g. [12,18] using the length of segmentation boundaries as a regularizer. If, more generally, one considers the length of level lines the result is the well-known total variation which is very wide-spread in image denoising [23,7] and again can be optimized globally. In many problems, one also wants to include second order regularity terms based on curvature. This has turned out to be much more challenging. Here, in a region-based context only the inpainting approach of Masnou and Morel [15] finds global optima. For a long time, region-based problems were addressed by local curve evolution, e.g. for segmentation in [19,11], for inpainting in [8,29,5] and very recently for denoising in [33,6]. In particular for segmentation problems this requires a good initialization and more generally, it can lead to numerical instabilities. For purely boundary based problems (without any regional terms), solutions were available already in 1990: Amini et al. [1] gave a solution that computes 1
This work was funded by the European Research Council (GlobalVision grant no. 209480).
Y. Boykov et al. (Eds.): EMMCVPR 2011, LNCS 6819, pp. 163–176, 2011. c Springer-Verlag Berlin Heidelberg 2011
164
T. Schoenemann, Y. Kuang, and F. Kahl
global optima as long as self-intersecting curves are allowed. More recently, this has been generalized to ratio functionals [24], which in practice is even more efficient. Curvature has also been explored for the problem of trace inference [20]. In a recent work, Schoenemann et al. [25,26] proposed to solve region-based problems with curvature regularity as linear programs. This approach is independent of initialization and has the advantage over the boundary-based methods that region-terms can be included, self-intersections be excluded and that the computed regions need not be (singly-) connected. The authors also show applications to inpainting. Subsequently El-Zehiry and Grady [10] proposed to solve binary segmentation problems with curvature regularity via the QPBO-method [13] with a subsequent probing stage. This works correctly for a 4-connectivity only, but an approximation for an 8-connectivity was included. Recently, it has also been tried to solve region-based curvature problems via non-convex constrained optimization [28]. While this is different from curve evolution, it is still dependent on initialization. In this work we consider the multi-label problems of multi-region segmentation, inpainting and denoising. For segmentation we derive a “curvature Potts model”, i.e. show that the linear program in [25] can be generalized to more than two regions. Here, we make use of prior work [14,32] on writing the (length-based) Potts model as a very compact integer linear program. Approaches derived from a continuous formulation of the Potts model include [21,31]. For inpainting, we show that the parameterization of [25] may degenerate (and does so quite often in practice) since “level lines” are allowed to cross between different levels. We discuss what is needed to prevent this and then introduce a new multi-label parameterization to remedy the situation that is still practicable (in terms of memory and run-time). Subsequently we address the problem of image denoising with absolute differences in the data term and a level-line based curvature regularizer. Finally, we discuss alternative techniques to model curvature as a linear program. Starting from [10] we derive a linear program that corresponds to a higher order MRF of a form considered in [30]. We then compare the two formulations and try suboptimal solving strategies such as min-sum diffusion (see [30]) and a (prematurely terminated) convex augmented Lagrangian solver. Previously, customized interior point methods have been investigated [4] for inequality constrained linear programs. However, in our setting we have equality constraints and found that the method does not work well on such problems.
2
Curvature and Linear Programming
In this section, we review the formulation in [25] for region-based binary segmentation problems with curvature regularity (but differ slightly in notation). Given an image I : Ω → , the problem is to find a division of Ω ⊂ 2 into two regions, foreground and background, where two given functions gF (x) and gB (x) (derived from the image I) define the cost for a point x belong-
Ê
Ê
Curvature Regularity for Multi-label Problems
165
ing to foreground and background respectively. It will be convenient to define g(x) = gF (x) − gB (x). Together with length and curvature regularity, up to a constant the resulting problem is to minimize min g(ˆ x) dˆ x + ν|C| + λ |κC (ˆ x)|2 dH1 (ˆ x) , (1) R⊆Ω
R
C
where C = ∂R is the segmentation boundary of length |C|, weighted with ν > 0, κC denotes the curvature of the boundary and λ > 0 is a weighting factor for curvature regularity. The one-dimensional Haussdorff measure dH1 signifies an integration over a set of closed lines and that the integral is independent of the parameterization of the lines. One can reformulate (1) as a linear program (LP) by first subdividing Ω into a cell complex, i.e. a subdivision into N non-overlapping basic regions whose union gives Ω. Where two regions meet there is an edge, and for every edge the approach considers two line segments, one for every possible direction of traversal of the edge – see the Figure on the left. We denote the set of basic regions F , the set of edges E and the set of line segments E o . In the arising discrete setting, the region integral in (1) is expressed as gT x, where the vector x ∈ {0, 1}N contains an entry xf for every basic region f ∈ F and with 1 indicating foreground, 0 background. The entries of the vector g reflect the integral of g(·) over each basic region. In principle the region variables x already define a boundary and hence the cost of the regularity terms. However, since this mapping is not linear [25] introduces an additional variable-vector y with an entry yjj ∈ {0, 1} for every pair j, j ∈ E o of consistently oriented adjacent line segments. The variable yjj is meant to be 1 if and only if both line pairs are part of the boundary. The regularity term is now expressed as wT y, where w is a vector defining the length and curvature weights for every pair j, j of consistent line segments. Here curvature is computed in terms of the angle θ between the line segments: wjj =
1 ν(lj 2
+ lj ) + λ θ 2
,
where lj and lj are the lengths of the two line segments. To ensure the consistency between boundary and region variables, constraints are needed, and in [26] there are three sets. The first ensures that wherever the foreground region ends there is an appropriate boundary variable. Hence there is one constraint for every edge. To formalize these constraints, all basic regions are assigned a consistent orientation (e.g. clockwise traversal everywhere). The oriented boundary line to be fitted must have the opposite orientation. Finally one needs to assign each edge an (arbitrary but fixed) orientation. The constraints are now based on the notion of incidence of basic regions to edges and of line pairs to edges. The incidence of region f and edge e is denoted mfe (for “match”) and defined to be 0 unless the basic region f contains the edge e in its boundary. Otherwise, mfe is 1 if the orientation of edge e agrees with the orientation of the region f and −1 if the orientations disagree.
166
T. Schoenemann, Y. Kuang, and F. Kahl
For edge pairs jj , the incidence operator mjj e is 0 unless line segment j is an orientation of the edge e. In this case it is 1 if e and j agree in orientation, −1 if they disagree. The first set of constraints, called surface continuation is now: mfe xf + mjj ∀e ∈ E . (2) e yjj = 0 j,j
f
This set alone is not sufficient to ensure consistency of region and boundary variables. As a consequence, [25] introduces the boundary continuation constraints, with the aim to ensure that whenever yjj = 1 both affected edges are actually part of the region boundary induced by x: vspace-.8mm yj j = yjj ∀j ∈ E o , (3) j
j
i.e. whenever a line segment forms the first part of a pair in the boundary, it must also be the second part for some such pair. Finally, to prevent touching lines one adds the constraint set mjj yjj ≤ 1 ∀e ∈ E , (4) e jj
which is called boundary consistency. They express that for every edge e only one line pair in the boundary may contain an orientation of e as its first element. Simultaneous to our work appears the work [27] that proposes an alternative constraint system to prevent touching lines. We leave this for future work. The resulting integer linear program is min gT x + wT y
(5)
x,y
s.t. (2), (3), (4), xf ∈ {0, 1} ∀f, yjj ∈ {0, 1} ∀j, j . As in [25], we will solve the linear programming relaxation, i.e. the non-convex variable domains {0, 1} are relaxed to the convex sets [0, 1] for all variables.
3
Formulations for Multi-label Problems
Above we have described how to handle curvature in binary segmentation via linear programs, and later on we will review how this can be extended to inpainting. In all cases there is one region variable for each basic region in the mesh. We now introduce formulations for multi-label problems where we have several variables associated to each region. 3.1
Multi-Region Segmentation
The first problem we address is multi-region segmentation, i.e. a generalization of (1) to the region set I = {1, . . . , K} where K is a pre-defined constant. In the continuous formulation, we are now looking for a segmentation function s : Ω → I and want to minimize the model K 1 g(ˆ x, s(ˆ x)) dˆ x + ν|Ci | + λ |κCi (ˆ x)|2 dH1 (ˆ x) (6) 2 i=1 Ω
Ci
Curvature Regularity for Multi-label Problems
167
where g(ˆ x, i) defines a data term for point x ˆ belonging to segment i, Ci = ∂{ˆ x | s(ˆ x) = i} is the set of points where s(·) is discontinuous and on an adjacent side takes the value i and |Ci | denotes the one-dimensional measure of this set of points. Note that (6) exploits the known observation [14,32] that every point in the discontinuity set of s(·) belongs to two region boundaries, hence the factor 1/2 for the regularity term. Here we are assuming that the region boundary of every region is a smooth curve, i.e. we do not consider cusps that may arise when three regions meet in a point. In the discrete setting we consider below this is not an issue. Furthermore, we mention that the data term g(ˆ x, i) can be chosen freely and that more general regularity terms can be included, e.g. weighted curvature. Again, we consider a discretized version of the continuous problem, where we note that for λ = 0, i.e. without curvature term, this is exactly the well-known Potts model [22]. This is why we call our discrete formulation “curvature Potts model”, the extension of the Potts model to a discrete curvature term. Curvature Potts Model. We now show how to extend the integer linear program (5) to the curvature Potts model for multi-region segmentation. Like [14,32] we rely on indicator variables xif ∈ {0, 1} for each basic region where xif = 1 signifies that the basic region f belongs to the segmentation region i (i.e. sf = i). It must be assigned to exactly one such region, so xif = 1 ∀f ∈ F . (7) i∈I
From the region variables xif , the boundary of any region can be computed i exactly as for the binary case: there are now separate boundary variables yjj for each i ∈ {1, . . . , K}. Each of these “layers” has separate surface continuation, boundary continuation and boundary consistency constraints. This amounts to the integer linear program min gT x + wT y x,y xif = 1 ∀f ∈ F s.t. i
mfe xif +
j
i mjj e yjj = 0
∀e ∈ E, i ∈ I
j,j
f
(8)
yji j =
i yjj
∀j ∈ E o , i ∈ I
j i |mejj | yjj ≤ 1
∀e ∈ E, i ∈ I
jj i xif ∈ {0, 1} ∀f, i , yjj , ∈ {0, 1} ∀i, j, j
where the only coupling constraint is now (7). The associated linear programming relaxation is again obtained by relaxing the sets {0, 1} to [0, 1] everywhere.
168
T. Schoenemann, Y. Kuang, and F. Kahl
region variables only
with level set variables
Fig. 1. Best viewed in color. When fixing the intensities, the effect of including level set variables becomes apparent: lines with same colors indicate an active pair (not all are shown). Without level sets one of the pairs connects a jump from black to dark-grey and one from dark-gray to light-grey. This is not a level line. When introducing level sets the correct level lines are found.
3.2
Inpainting
Ê
Given is now an image I : Ω → together with a damaged region Ωd ⊂ Ω which is to be filled with proper values. We first review the continuous formulation for inpainting in [25]. The model minimizes over the functions u : Ω → ,
Ê
|∇u(ˆ x)| dˆ x+ λ
min ν
Ê
u:Ω→
Iu
Ω
s.t.
|κCu,t (ˆ x)|2 dH1 (ˆ x) dt Il Cu,t
u(ˆ x) = I(ˆ x)
∀ˆ x ∈ Ω \ Ωd ,
(9)
where Cu,t = {ˆ x | u(ˆ x) = t} is the set of level lines of u(·) for level t. Iu and Il are the maximal and minimal intensity at the boundary of the damaged region respectively. Note that the first term here is the well-known total variation that is closely related to the length of level lines. In principle there are infinitely many levels to consider, but one commonly [15] deals only with those levels that actually occur on the image boundary, i.e. a subset of L = {0, ..., 255}. In [25] this problem was addressed by an integer program similar to (5), with two differences: firstly, the variables are no longer binary – they can now take any value in [Il , Iu ]. Likewise, the boundary variables can take any value in [0, Iu ] where the maximal range is needed at the image border. In practice it is sensible to first subtract the constant Il from the image so that the range is [0, Iu − Il ] for all variables. Secondly, all region variables for the known region Ω \ Ωd are fixed to the corresponding values of I(·). In practice one only considers region variables that are inside Ωd or very close to its border. As noted in [25] and visualized in Figure 1 the resulting boundary variables do usually not reflect level lines: In this formulation lines are allowed to switch between different intensity levels. One way to circumvent this for the discrete intensity range L is to introduce variables that reflect the actual level sets of the intensity profile. Then, there is no longer one variable xf for every basic region, there are now 255 binary variables xif ∈ {0, 1}, ∀i ∈ L \ {0} with the
Curvature Regularity for Multi-label Problems
169
intention that xif = 1 if and only if If ≥ i. This naturally induces the constraints xif ≥ xi+1 ∀f ∈ F. f Level lines are exactly the boundary lines of level sets, so using the procedure for binary segmentation to fit boundary lines to the binary variables xif gives i indeed level lines. There are then |L| − 1 layers of variables xif and yjj that are coupled by constraints on the level variables and further constrained by the boundary conditions, i.e. by setting the variables xif for fixed regions so that they represent the level sets correctly. A Compromise. Introducing 255 layers is not practicable for any but the smallest domains - the memory demands are too high. Hence, we consider an intermediate strategy: we group the intensity interval [0, Iu − Il ] into B equally large bins, where B is set by the user and the bin sizes are all integral (this implies that the uppermost bin may only partially be used). That is, we now have variables xbf ∈ [0, IB ], ∀b ∈ {1, . . . , B}, where IB = (Iu − Il )/B is the size of each bin. We want xbf = IB if If ≥ b · IB , else xbf = max{0, If − (b − 1) · IB }. The result is the integer linear program min wT y x,y
s.t.
(10) xb+1 f
xbf
≥ ∀f ∈ F, b < B b mfe xbf + mjj e yjj = 0 ∀e ∈ E, b ≤ B j,j
f
j
yjb j =
j e b |mjj | yjj
b yjj
∀j ∈ E o , 1 ≤ b ≤ B
≤ IB
∀e ∈ E, 1 ≤ b ≤ B
jj b xbf ∈ {0, . . . , IB } ∀f, b , yjj ∈ {0, . . . , IB } ∀b, j, j .
plus some additional constraints for variables outside Ωd . 3.3
Denoising
The formulation (10) for inpainting can be easily generalized to the problem of image denoising with absolute differences in the data term and a curvature prior on level lines. That is, the continuous model takes the form min |I(ˆ x) − u(ˆ x)| dˆ x + ν |∇u(ˆ x)| dˆ x
Ê
u:Ω→
Ω
+λ
ÊC
Ω
|κCu,t (ˆ x)|2 dH1 (ˆ x) dt ,
u,t
where compared to (9) there is now a data term and u is to be optimized over all of Ω. The function u : Ω → is an inherently continuous-valued function, so the region variables are no longer integral. However, to estimate the level
Ê
170
T. Schoenemann, Y. Kuang, and F. Kahl
lines one again needs to include variables reflecting level sets in the formulation. These variables, as well as the associated boundary variables, are still integral. Since there is no longer an a priori known set of levels to consider, we can only approximate the desired behavior. To do this, we use the same trick as above: levels are grouped into bins and bins have to be filled from the bottom up. The associated integer program is very similar to (10), so we only explain how to integrate the data term. It is well-known, e.g. [9], that an expression of the form min cT x + |aT x − b| , x≥0
where c and a are arbitrary vectors, and b some scalar, can be transformed to the auxiliary linear program min
x≥0,z+ ≥0,z− ≥0
cT x + z+ + z−
s.t. aT x − b = z+ − z− . This construction is easily generalized to multiple absolutes, where each will introduce two auxiliary variables.
4
Alternative Strategies
In this section, we explore alternative strategies, both in the formulation of the integer linear program and in the way the linear programming relaxations are solved. 4.1
Linear Programming and Higher Order MRFs
El-Zehiry and Grady [10] proposed to model curvature on cell complexes as a higher order Markov Random Field with ternary cliques. This only allows a 4connectivity, but as we will see below a related formulation with fourth-order cliques can handle all connectivities. In this paper we explore the associated integer linear program and its linear programming relaxation. The nodes in the MRF correspond to the basic regions in the mesh, so region cost are unary terms. The ternary cliques are induced by the edge pairs: for a 4-connectivity, whenever a pair of adjacent edges represents a direction change it will always have a common adjacent region, so in total there are only three regions to consider. When using QPBO the associated length and curvature cost can be decomposed into two-cliques. For our integer program they are simply precomputed into a lookup-table for all possible constellations of the three regions. When considering arbitrary meshes each edge pair can have up to four adjacent regions, so fourth order cliques are necessary. The (relevant) cliques in the arising graph can now be of order two (at the image border), three or four. The MRF can be associated a “standard” integer linear program, growing exponentially with the order of the maximal clique. Problems of this form were considered in [30]. Here we give a formulation for the binary segmentation problem. As above, there is a variable xf for every basic region f . For any clique C = {f1 , . . . , fkC }
Curvature Regularity for Multi-label Problems
171
of order kC the integer program contains 2kC auxiliary variables representing all possible value assignments, i.e. for any v ∈ {0, 1}kC there is a variable yCv with cost defined by the above mentioned lookup tables. The constraint system expresses consistency of the region variables and the clique variables, i.e. the ILP reads min cx T x + cy T y x,y s.t. xf =
yCv
∀f, C : f ∈ C
yCv
∀f, C : f ∈ C
v:v(f )=1
1 − xf =
,
v:v(f )=0
where v(f ) means the entry of v that corresponds to the basic region f for the currently considered (ordering of the) clique C. Note that this integer program is not completely equivalent to the one given in section 2: wherever two regions meet in a point the cost is overestimated as all possible boundary configurations are penalized. In contrast, the formulation (5) will choose the best configuration here. 4.2
Do We Need Standard Solvers?
Standard linear programming solvers have only sparsely been used in Computer Vision, due to the high demands on memory and the long run-times. Such solvers have proven to work fine for curvature problems [25], but the image sizes that can be handled in practice are rather small - processing a 128 × 128 image with a 16-connectivity and the simplex method requires 4 GB. Clearly, less memory intensive techniques are desirable and one may even want to accept a loss in numerical precision of the solution. We examine two strategies. Min-sum Diffusion. The first technique is called min-sum diffusion and applicable only to the linear program derived from the higher order MRF. It was used for the higher order terms in [30], where it is described in detail. Here we only mention that the method is doing a reparameterization of the cost function, i.e. it shifts cost from the higher order terms to the unary terms without affecting the represented energy. Finding the optimal reparameterization is equivalent to solving the dual linear program. However, not all reparameterizations are considered, so in general the method is suboptimal – see the experiments below. The main advantages of the method are its simplicity and the savings in memory consumption it offers. Augmented Lagrangians. The augmented Lagrangian method [3, chap. 4.2] is a very general technique to handle a large class of constrained optimization problems. Here we detail it on the example of the linear program min cT x x
s.t. Ax = b, x ≥ 0 ,
(11)
172
T. Schoenemann, Y. Kuang, and F. Kahl
assumed to be bounded and feasible. The augmented Lagrangian method introduces the convex relaxation min Ψγ,λ (x) = cT x + λT (Ax − b) + γAx − b2 , x≥0
and since Ax = b for any feasible solution the cost of such solutions remain untouched - for any γ > 0 and λ ∈ n . The choice of γ and λ significantly influences the quality of the relaxation, but also the computational difficulty of minimizing Ψγ,λ(x). In practice one proceeds in iterations, starting with some γ0 and λ = 0. Ideally, each iteration would minimize Ψγ,λ (x) with a guaranteed precision of some specified . In our case this is too computationally costly, so we only run a fixed number of (inner) iterations. After each outer iteration, the multipliers are updated according to [3, chap. 4.2] λ ← λ + γ(Ax − b) and γ is multiplied with a constant factor (we take 1.1). To minimize Ψγ,λ (x) for given γ and λ, we exploit that the gradient of the function is Lipschitz-continuous and apply an accelerated first order scheme [17] with a projection on the set x ≥ 0. In the case of multi-region segmentation some constraints can be included directly into this projection [16]. This scheme is parallelizable, so we consider a GPU-implementation. The computed solutions are very nearly feasible but, due to the limit on the inner iterations, their cost are frequently above the optimal ones.
Ê
5
Experiments
We now give results for all discussed applications, where the experiments were run on a 3 GHz Core2 Duo with 8 GB memory and an NVidia Tesla 2050 card. We found that our problems are best solved via interior point methods, which frequently requires the entire 8 GB. We use the commercial solvers FICO Xpress and Gurobi. 5.1
Image Segmentation
For image segmentation we choose an 8-connectivity and the data term g(ˆ x, i) = (I(ˆ x) − μi )2 , where we select the mean values as the 16%, 50% and 83%percentiles of the image. Two results for multi-region segmentation on image sizes of 128×116 are given in Figure 2. Note in particular that without curvature, even a low length weight splits the tail of the lion into several parts, whereas the proposed curvature-based method closes the gap. For the length-based results we re-ran our method with a curvature weight of 0. Despite the NP-hardness of the Potts model they are globally optimal. The computed curvature-based solutions (roughly 5 hours) are 15% and 37% away from the respective proven lower bounds. Running the Augmented Lagrangian solver (Sec. 4.2) for the lions and 30 × 2500 iterations gave a nearly feasible solution 0.01% higher than the optimum. The thresholded solution looks visually very similar. With 1.5 hours (18x on the CPU) this is faster than the exact solver and needs less memory.
Curvature Regularity for Multi-label Problems
input
173
curvature segmentation length segmentation
Fig. 2. Comparison of purely curvature-based and purely length-based segmentation (all 8-connected)
damaged
inpainted (our sol.)
without bins
with 2 bins
Fig. 3. Curvature-inpaintings with a 16-connectivity: already two bins have a significant effect on the results
5.2
Inpainting
For inpainting, Figure 3 demonstrates that indeed the introduction of bins has a significant influence on the results. We prefer a 16-connectivity over a large number of bins. At the same time, it can be seen that the method can handle non-singly connected domains. To estimate the direction of the level lines at the border of the inpainting domains we use a structure tensor based method similar to [5]. The inpainted image is obtained by summing all bin variables associated to each face. 5.3
Denoising
With a 16-connectivity we can handle denoising tasks for resolutions of 96 × 96, using the entire 8 GB and roughly 3 hours computing time. As shown in Figure 4 the results are less blocky than the state of the art. However, they contain new kinds of artifacts. When adding a length penalty these are partially removed and the results look more natural than with a TV term alone. Further, we tested different numbers of bins for an 8-connectivity, but at least 6 bins were needed to affect the solutions and this already takes 3 GB.
174
T. Schoenemann, Y. Kuang, and F. Kahl
noisy input
T V −L2 [23]
T V −L1 [2]
curvature
curvature + TV
noisy input
T V −L2 [23]
T V −L1 [2]
curvature
curvature + TV
Fig. 4. Image denoising with first and second order methods (our implementations) on a resolution of 96×96. With curvature regularity (16-connected) there are less block artifacts
input
rel. 22790622 rel. 22863011 thr. 23511655 thr. 23139673
input
rel. 13583936 rel. 13114465 thr. 14167819 thr. 14173800
Fig. 5. Comparison of the different strategies for an 8-connectivity. Next to the input images: Higher Order ILP. Left to that: ILP from [26]. In all cases the lower bounds and the energies of the thresholded solutions are given.
5.4
Comparison of ILPs
When using standard solvers, we found that the clique-lp requires roughly five times more memory. As a consequence, for the comparison of the two presented ILPs we use an image resolution of 64×64 and only consider binary segmentation. As shown in Figure 5 there is no clear winner: both methods can produce tighter lower bounds and better integral solutions than the respective other. The problem with the higher order ILP is its huge memory consumption, though, and the run-times are also clearly inferior to the other ILP. Hence, we ran the min-sum diffusion of section 4.2 (until near convergence), but with lpenergies of 22770853 (for the cameraman) and 13558816 (for the fingerprints) it was far away from the actual optima (see Fig. 5). Still, the thresholded solutions are useful and the memory consumption is now a factor of two below the standard solving of [25] (plus boundary consistency).
6
Conclusion
This paper has shown that a discrete curvature regularizer can be applied to multi-label region-based problems. It has addressed three different tasks: seg-
Curvature Regularity for Multi-label Problems
175
mentation, denoising and inpainting. All are modeled as linear programs with several layers of variables coupled by relatively few constraints. Hence, in future work we intend to consider dual decomposition strategies. At the same time, we have shown in this work that linear programs need not necessarily be solved by standard solvers if one is willing to sacrifice a bit of precision. Here, we have considered two different linear programs and two strategies of solving them. To facilitate further research in this area, the source code associated to this paper will be made publicly available.
References 1. Amini, A., Weymouth, T., Jain, R.: Using dynamic programming for solving variational problems in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 12(9), 855–867 (1990) [163] 2. Aujol, J.-F., Gilboa, G., Chan, T., Osher, S.: Structure-texture image decomposition modeling, algorithms, and parameter selection. International Journal on Computer Vision (IJCV) 67(1), 111–136 (2006) 3. Bertsekas, D.: Nonlinear Programming, 2nd edn. Athena Scientific, Belmont (1999) [171], [172] 4. Bhusnurmath, A., Taylor, C.: Graph cuts via l1 norm minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 30(10), 1866–1871 (2008) [164] 5. Bornemann, F., M¨ arz, T.: Fast image inpainting based on coherence transport. Journal on Mathematical Imaging and Vision 28(3), 259–278 (2007) [163], [173] 6. Brito-Loeza, C., Chen, K.: Multigrid algorithm for higher order denoising. SIAM Journal for Imageing Science 3(3), 363–389 (2010) [163] 7. Chan, T., Esedoglu, S.: Aspects of total variation regularized l1 function approximation. SIAM Journal of Applied Mathematics 65(5), 1817–1837 (2004) [163] 8. Chan, T., Kang, S., Shen, J.: Euler’s elastica and curvature based inpaintings. SIAM Journal of Applied Mathematics 2, 564–592 (2002) [163] 9. Dantzig, G., Thapa, M.: Linear Programming 1: Introduction. Series in Operations Research. Springer, Heidelberg (1997) [170] 10. El-Zehiry, N., Grady, L.: Fast global optimization of curvature. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, California (June 2010) [164], [170] 11. Esedoglu, S., March, R.: Segmentation with depth but without detecting junctions. Journal on Mathematical Imaging and Vision 18, 7–15 (2003) [163] 12. Greig, D., Porteous, B., Seheult, A.: Exact maximum a posteriori estimation for binary images. Journal of the Royal Statistical Society, Series B 51(2), 271–279 (1989) [163] 13. Hammer, P., Hansen, P., Simeone, B.: Roof duality, complementation and persistency in quadratic 0-1 optimization. Mathematical Programming 28(2), 121–155 (1984) [164] 14. Kleinberg, J., Tardos, E.: Approximation algorithms for classification problems with pairwise relationships: metric labeling and Markov Random Fields. In: Symposium on Foundations of Computer Science (1999) [163], [164], [167] 15. Masnou, S.: Disocclusion: A variational approach using level lines. IEEE Transactions on Image Processing (TIP) 11, 68–76 (2002) [163], [168]
176
T. Schoenemann, Y. Kuang, and F. Kahl
16. Michelot, C.: A finite algorithm for finding the projection of a point onto the canonical simplex of Rn . Journal on Optimization Theory and Applications 50(1) (July 1986) [172] 17. Nesterov, Y.: Introductory lectures on convex optimization. Applied Optimization. Kluwer Academic Publishers, Dordrecht (2004) [172] 18. Nikolova, M., Esedoglu, S., Chan, T.: Algorithms for finding global minimizers of image segmentation and denoising models. SIAM Journal of Applied Mathematics 66(5), 1632–1648 (2006) [163] 19. Nitzberg, M., Mumford, D., Shiota, T.: Filtering, segmentation and depth.In: LNCS, vol. 662. Springer, Heidelberg (1993) [163] 20. Parent, P., Zucker, S.: Trace inference, curvature consistency, and curve detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 11(8), 823–839 (1989) [164] 21. Pock, T., Chambolle, A., Bischof, H., Cremers, D.: A convex relaxation approach for computing minimal partitions. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Miami, Florida (June 2009) [164] 22. Potts, R.: Some generalized order-disorder transformation. Proceedings of the Cambridge Philosophical Society 48, 106–109 (1952) [167] 23. Rudin, L., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D 60, 259–268 (1992) [163] 24. Schoenemann, T., Cremers, D.: Introducing curvature into globally optimal image segmentation: Minimum ratio cycles on product graphs. In: IEEE International Conference on Computer Vision (ICCV), Rio de Janeiro, Brazil (October 2007) [164] 25. Schoenemann, T., Kahl, F., Cremers, D.: Curvature regularity for region-based image segmentation and inpainting: A linear programming relaxation. In: IEEE International Conference on Computer Vision (ICCV), Kyoto, Japan (September 2009) [163], [164], [165], [166], [168], [171], [174] 26. Schoenemann, T., Kahl, F., Masnou, S., Cremers, D.: A linear framework for region-based image segmentation and inpainting involving curvature regularity. Technical report, ArXiv report (February 2011) [164], [165], [174] 27. Strandmark, P., Kahl, F.: Curvature regularization for curves and surfaces in a global optimization framework. In: International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, St. Petersburg, Russia (July 2011) [166] 28. Tai, X.-C., Hahn, J., Chung, G.: A fast algorithm for Euler’s elastica model using augmented Lagrangian method. Technical report, UCLA CAM report (July 2010) [164] 29. Tschumperl´e, D.: Fast anisotropic smoothing of multi-valued images using curvature-preserving PDE’s. International Journal on Computer Vision (IJCV) 68(1), 65–82 (2006) [163] 30. Werner, T.: High-arity interactions, polyhedral relaxations, and cutting plane algorithm for soft constraint optimisation (MAP-MRF). In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, Alaska ((June 2008) [164], [170], [171] 31. Yuan, J., Bae, E., Tai, X.-C., Boykov, Y.: A continuous max-flow approach to Potts model. In: European Conference on Computer Vision (ECCV), Iraklion, Greece (September 2010) [164] 32. Zach, C., Gallup, D., Frahm, J.-M., Niethammer, M.: Fast global labeling for realtime stereo using multiple plane sweeps. In: Vision, Modeling and Visualization Workshop (VMV), Konstanz, Germany (October 2008) [164], [167] 33. Zhu, W., Chan, T.: Image denoising and mean curvature. Technical report, Courant Institute, New York, Preprint (November 2006), Preprint, http://www.cims.nyu.edu/~ wzhu/meancurvature_06Nov30.pdf [163]
Space-Varying Color Distributions for Interactive Multiregion Segmentation: Discrete versus Continuous Approaches Claudia Nieuwenhuis, Eno T¨oppe, and Daniel Cremers Department of Computer Science, TU M¨ unchen, Germany
Abstract. State-of-the-art approaches in interactive image segmentation often fail for objects exhibiting complex color variability, similar colors or difficult lighting conditions. The reason is that they treat the given user information as independent and identically distributed in the input space yielding a single color distribution per region. Due to their strong overlap segmentation often fails. By statistically taking into account the local distribution of the scribbles we obtain spatially varying color distributions, which are locally separable and allow for weaker regularization assumptions. Starting from a Bayesian formulation for image segmentation, we derive a variational framework for multi-region segmentation, which incorporates spatially adaptive probability density functions. Minimization is done by three different optimization methods from the MRF and PDE community. We discuss advantages and drawbacks of respective algorithms and compare them experimentally in terms of segmentation accuracy, quantitative performance on the Graz benchmark and speed.
1
Introduction
Segmentation denotes the task of dividing an image into meaningful, non-overlapping regions. Meaningful, especially in complex images, depends on the user’s intention of what he wants to extract from the image. This makes the problem highly ill-posed, so user interaction is indispensable. Typically bounding boxes, contours or scribbles are used to indicate the regions of interest. Such interactive segmentation algorithms are widely used in image editing software packages, e.g. for the identification of specific structures in medical images, for tracking or to compose scenes from different images. Previous interactive approaches use various methods to estimate color models for each object to be segmented. In [21] the authors compute color histograms and threshold histogram distances, in [17] Mixtures of Gaussians are employed to estimate the color distributions, in [3] kernel density estimators are used to estimate geodesic distances between each two pixels, and in [19] methods from machine learning have been applied to the task. All of these approaches derive a single, constant color model for each region. This policy does not adequately represent the given user information, Y. Boykov et al. (Eds.): EMMCVPR 2011, LNCS 6819, pp. 177–190, 2011. c Springer-Verlag Berlin Heidelberg 2011
178
C. Nieuwenhuis, E. T¨ oppe, and D. Cremers
since scribbles contain information on color and space. By disregarding the spatial information in kernel density estimation we assume that the color information is, in a statistical sense, spatially independent and identically distributed (iid) over the whole image. This is not correct and only leads to good results in case of a strong, well-chosen prior and non-overlapping color distributions - which is, clearly, not the case in natural images with changing light conditions, multicolored objects and difficult textures. In fact, close to a scribble we have precise information on the color of the object, at scribble pixels we even know the color exactly. Only if we are far away from the scribbles spatial location becomes less important, since we have no precise knowledge on the object color. This is not reflected by previous data fidelity terms. In [13] we segmented optical flow fields by means of spatially dependent flow vector distributions. The first contribution of this paper is to show how non-iid assumptions can be handled in color density estimation leading to strongly improved results. The optimization of energies with respect to a set of variables which can take one of multiple labels is among the central algorithmic challenges in computer vision and image analysis. In the discrete MRF community the segmentation problem is related to the classical Potts model [15], in the spatially continuous setting it is typically referred to as the minimal partition problem. Over the years a number of algorithms for multilabel optimization have been developed, both in the MRF community and in the PDE community. As a result one may ask how respective algorithms compare in practice. In [9] an experimental comparison of discrete and continuous optimization approaches was presented for the specific case of binary labeling problems. In recent years, the focus has shifted from binary labeling problems to the more general multilabel problem. The second contribution of this paper is to perform an experimental comparison of respective algorithms for the minimal partition problem with more than two regions or labels. Such comparions are invariably limited as they can only reflect the state of the art for a given time whereas algorithms are continuously being improved. In this paper, we will therefore focus on an experimental comparison of the following algorithms, which minimize the same energy in a discrete and in a continuous formulation: – The discrete alpha expansion approach with 4 and 8 connectivity [5]. – The continuous approach by Chambolle et al. [6] – The continuous approach by Lellmann et al. [12] Yet, we should point out that even at this moment, there exist further accelerations of algorithms such as primal-dual algorithms [10] and dynamic hybrid algorithms [2] for speeding up MRF optimization and Douglas-Rachford splitting which reportedly speeds up the Chambolle relaxation by a factor of 4-20 [11]. We qualitatively and quantitatively compare results on the specific problem of multilabel image segmentation based on a known segmentation benchmark.
Space-Varying Color Distributions for Interactive Multiregion Segmentation
2
179
A Statistical Framework for Segmentation
Let I : Ω → Rd denote the input image defined on the domain Ω ⊂ R2 (in the discrete case Ω = {1..N } × {1..M }). The task of segmenting the image plane into a set of n pairwise disjoint regions Ωi Ω=
n
Ωi ,
Ωi ∩ Ωj = ∅ ∀i = j
(1)
i=1
can be solved by computing a labeling l : Ω → {1, .., n} indicating which of the n regions each pixel belongs to: Ωi = {xl(x) = i}. For multilabel segmentation the energies to be minimized usually contain two terms: a dataterm ρi : Ω → R indicating how well the observed data fits to region i ∈ {1, ..., n}, and a length regularization term. 2.1
Segmentation as Bayesian Inference
In the framework of Bayesian inference, one can compute a segmentation by maximizing the conditional probability arg max P(l | I) = arg max P(I | l) P(l). l
l
(2)
Assuming that the colors of all pixels are independent of each other, but – in contrast to previous interactive segmentation approaches – not independent of space, we obtain dx , (3) P(I(x) | x, l) P(I | l) = x∈Ω
where the exponent dx denotes an infinitesimal volume in R2 and assures the correct continuum limit. Note how the space-dependency of color likelihoods arises naturally in this formulation. It has commonly been neglected, yet we shall show in this paper that taking into account this spatial variation of color distributions based on scribble locations leads to drastic improvements of the resulting interactive segmentation process. Assuming furthermore that the color probability at location x does not depend on the labeling of other pixels y = x, the product in (3) can be written as P(I | l) =
n dx . P(I(x) | x, l(x) = i)
(4)
i=1 x∈Ωi
2.2
Inferring Space-Variant Color Distributions
The expression P(I(x) | x, l(x) = i) in (4) denotes the conditional probability for observing a color value I at location x given that x is part of region Ωi . It can be estimated from the user scribbles as follows. Let
xij , j = 1, .., mi (5) Si := Iij
180
C. Nieuwenhuis, E. T¨ oppe, and D. Cremers
Fig. 1. Schematic plot of a kernel density estimate in the joint space of location x and color I computed for a set of sample points. If space is disregarded we obtain the same density function P (I) at each pixel showing several peaks for colors, which are predominant in different object parts. Normalizing the joint distribution gives the conditional probability P(I|x) for observing a color I at a location x. At each location x we obtain a separate color distribution showing only a single color peak for the locally predominant color. This distribution is used as a data term for image segmentation.
denote the set of user scribbles Si for region i consisting of pixel locations xij with corresponding color values Iij . The probability of a given color value I at location x within region i is computed by a Parzen density estimator [1,16] of the form:
mi 1 x − xij ˆ P(I, x | l(x) = i) = k , (6) I − Iij mi j=1 i.e. a sum of normalized kernel functions k centered at each sample point in the product space of location and color. Figure 1 shows a schematic drawing of this distribution in the joint space of color and spatial coordinate estimated from a set of sample points. Commonly in interactive segmentation the location of the scribbles is not taken into account and the space-independent color distribution, the marginal ˆ | l(x) = i) = P(I, ˆ x | l(x) = i) dx, P(I (7) is used to estimate color likelihoods. The marginal is plotted on the left and contains four peaks, each one for a different predominant color of the foreground object. At each location in the image, the likelihood for each of these colors follows the same marginal distribution, no matter if we are very close to a scribble or far away. In this paper, instead of the marginal the joint distribution of color and space is estimated to account for color changes in segmentation regions. Typically, the color is not independent of space. In this way, one obtains a separate color distribution at each location in the image. Unfortunately, in the considered scenario the Parzen density estimator (6) is not guaranteed to converge to the true density since the samples are not
Space-Varying Color Distributions for Interactive Multiregion Segmentation
181
independent and identically distributed (iid), but rather user-placed in locations which are by no means independent. To account for this dependency we employ spatially adaptive kernels. In practice, we choose isotropic Gaussian kernels with variance σ and ρ in color and space dimension. Due to the separability of the Gaussian kernel we can write:
x = kρ(x) (x) kσ (I). (8) k I To account for the non-uniform spatial distribution, the spatial kernel variance ρ(x) depends on the image location x and is chosen proportional to the distance from the k-th nearest scribble point xvk ∈ Si : ρ(x) = α|x − xvk |2 .
3
Discrete versus Continuous Energies
There are two basic approaches for the optimization of multilabel problems: MRF approaches and PDE approaches. In the following we will review both paradigms and point out practical advantages and drawbacks. 3.1
MRF Approaches
For solving multilabel segmentation in the case of a discrete image domain we seek to minimize the following energy
Di (l(i)) + Sij (l(i), l(j)) (9) E(l) = i∈Ω
(i,j)∈N
where Di : {1, .., n} → R measures the data fidelity (see section 2) and Sij : {1, .., n} × {1, .., n} → R the smoothness between pixels i and j. The smoothness is summed over all pixel pairs in the neighborhood N . In the simplest case, N consists of the upper, lower, left and right neighbor of each pixel. This is called 4-connectivity. If the diagonal neighbors belong to N as well, this is called 8connectivity. In discrete approaches, the image is represented by a graph containing one node for each pixel and links between neighboring nodes, whose weight is associated with the smoothness term. This constitutes a Markov Random Field interpretation of the problem (9). In [20] different algorithms for the minimization of Markov Random Field approaches are compared: iterated conditional modes, graph cuts (alpha expansion and alpha-beta-swaps) and message passing algorithms such as belief propagation. Of those algorithms, alpha expansion yields the best performance results for the Pott’s model, which is a suitable model for the multiregion segmentation problem. Hence, in this paper we use the graph cut alpha-expansion approach [5] for optimization in the discrete case. The basic idea of graph cuts is to compute a binary partition (‘cut‘) of the nodes in a graph that separates a source from a sink node. The sum of the edge weights which are crossed by the partition interface should be minimal of all possible separations. One can show that this problem is equivalent to computing a maximum flow which can be done in polynomial time.
182
C. Nieuwenhuis, E. T¨ oppe, and D. Cremers
In computer vision, graph cuts were applied for the first time by Greig, Porteous and Seheult [8]. They showed that for the problem of denoising a binary image a global optimum of the corresponding MAP formulation can be found by computing the maximum flow through a certain graph construction. Later Boykov et al. revisited the idea exploring to a greater extend which energies can be minimized by graph cuts [5]. If the smoothness penalty constitutes a metric on the label space, an iterative method called alpha expansion can be applied. It was shown in [5] that the computed result is energetically within a factor of two of the global optimum. In the alpha expansion algorithm the multilabel problem is reduced to a number of binary labeling problems which are iterated. The basic idea is that in each step for a selected label α pixels can either keep their current label or take the label α. To this end, a special graph is constructed and the min cut is computed. The steps are executed for each label α in the label set until convergence. The binary labeling problem of each step is solved globally optimal, the solution to the multilabel segmentation problem, however, is only locally optimal. In this paper we refer to the alpha expansion algorithm based on the interactive segmentation dataterm with 4-connectivity as Gc4, with 8-connectivity as Gc8. It is based on the max-flow implementation by Boykov and Kolmogorov [4].
3.2
PDE Approaches
In the continuous setting the optimization problem is formulated as a set of partial differential equations (PDE) based on region indicator functions. The energy in the continuous setting also consists of the dataterm and a length regularization term. In case of convex energies or relaxations the resulting problem can be solved by fast primal-dual optimization schemes. For the multilabel minimal partition problem (the Pott’s model) three main PDE based approaches have been proposed in parallel in recent years, e.g. the approach by Chambolle et al. [6], Zach et al. [22] and by Lellmann et al. [12]. These approaches differ in their complexity and by the tightness of the energy relaxation. None of these approaches is convex, but convex relaxations can be formulated to tackle the optimization. However, as in the discrete case these approaches only lead to locally optimal solutions. Lellmann et al. [12] represent the n regions Ωi by the indicator function u ∈ BV(Ω, {0, 1})n . 1, if x ∈ Ωi ∀i = 1, . . . , n. (10) ui (x) = 0, else Here BV denotes the function space of bounded variation. Based on this definition they formulated the following energy given the dataterm ρi n n n
E(u) = u i ρi + λ g(x)|Dui | dx s.t. ui (x) = 1, x ∈ Ω, (11) i=1
Ω
i=1
Ω
i=1
Space-Varying Color Distributions for Interactive Multiregion Segmentation
183
which is minimized. The first term punishes incoherence with the observed data, whereas the second term measures the length of the region boundaries. Here Dui denotes the distributional gradient of ui , λ balances the dataterm and the length regularization and g(x) = exp(−γ|∇I(x)|) favors the coincidence of object and image edges. To handle the non-differentiable indicator functions, the boundary of the set indicated by ui , the perimeter, can be written as the total variation: Per({x | ui (x) = 1}) =
Ω
g(x)|Dui | =
sup ξ:|ξ(x)|≤g(x)
ui div ξi dx , −
(12)
Ω
where ξi ∈ Cc1 (Ω, R2 ) denote the dual variables and Cc1 the space of smooth functions with compact support. With this notation, the original energy minimization problem (11) can be relaxed to min sup
u∈B ξ∈K1
Ω
ui ρi dx − λ
n
i=1 Ω n
ui div ξi dx
ui = 1 , B = u ∈ BV(Ω, [0, 1]) i=1 1 2×n ) ξ(x) F ≤ g(x), x ∈ Ω , K1 = ξ ∈ Cc (Ω, R
n
(13)
(14)
where . F denotes the Frobenius norm. For multilabel segmentation we compute the interactive segmentation dataterm described in the previous section. We will refer to the resulting approach based on the optimization scheme by Lellmann et al. as P deL. Chambolle et al. [6] define the following region indicator function 1, ui (x) = 0,
if l(x) ≥ i else
∀i = 1, . . . , n.
(15)
with 1 = u0 ≥ u1 (x) ≥ ... ≥ un (x) ≥ un+1 = 0, where u0 = 1 and un+1 = 0 are added to simplify notation. Based on this indicator function the following energy is proposed:
E(u) =
n
i=1
Ω
(ui − ui+1 )ρi + λ
n
i=1
Ω
g(x)|Dui | dx
1 = u0 ≥ u1 (x) ≥ ... ≥ un (x) ≥ un+1 = 0.
(16) (17)
Since this energy is not convex due to the binary constraints on u Chambolle et al. propose the following relaxation
184
C. Nieuwenhuis, E. T¨ oppe, and D. Cremers
min sup
n
u∈B ξ∈K2 i=1 η∈W
Ω
(ui − ui+1 )ηi − λ
⎧ ⎨
n
ui div ξi dx
(18)
i=1
⎫ ⎬ ξi ≤ g(x), 1 ≤ i1 ≤ i2 ≤ n K2 = ξ : Ω → R2×n ⎩ ⎭ i1 ≤i≤i2 W = η : Ω → Rn |ηi (x)| ≤ ρi (x), x ∈ Ω, 1 ≤ i ≤ n
(19)
We will refer to this optimization method applied to the problem of multilabel segmentation based on the interactive segmentation dataterm as P deC. Both energy formulations are relaxed to obtain convex approaches, which can be globally minimized by primal-dual optimization schemes. In case of two regions the thresholding theorem applies saying that the thresholded globally optimal solution of the relaxed problem yields the globally optimal solution of the original binary problem independent of the chosen threshold. In the multilabel case the threshold theorem does not hold. However, the binarized solution of the relaxed approach has been experimentally shown to lie within very tight bounds from the optimal binary solution. Hence, the proposed relaxations lead to very near optimal solutions of the original problem. The approach by Lellmann et al. is simpler than the one by Chambolle et al. since its relaxation leads to simpler constraints on the dual variables ξ. Besides it is faster because the number of constraints on the dual variables increases only linearly with the number of labels. In contrast, the approach by Chambolle et al. demands quadratically many constraints on the dual variables yielding higher computation times. However, in [14] the relaxation proposed by Chambolle et al. has been shown to yield tighter bounds meaning that the solution of the relaxed problem is closer to the global optimum of the binary problem than the relaxed result of the approach by Lellmann et al.. 3.3
Advantages and Disadvantages
In applications of MRF approaches and PDEs we encountered several advantages and short-comings many of which are known to researchers working on these topics. In particular: Graph cuts do not require numerical tuning parameters, so they can easily be applied as a blackbox algorithm without further need to understand the underlying optimization process. Instead, PDE approaches typically require the choice of appropriate step size parameters and termination criterion. While appropriate parameters can be determined automatically by convergence analysis and the primal-dual gap, the latter is often not straight-forward to compute. The commonly used alpha expansion for solving general multilabel MRFs is based on iteratively solving binary problems. As we show in Figure 3 results may therefore depend on the order of the labels. Furthermore graph cut approaches exhibit metrication errors due to a rather crude approximation of the Euclidean
Space-Varying Color Distributions for Interactive Multiregion Segmentation
185
Table 1. Appearance frequency for different label numbers in Santner’s database. Labels Frequency
2 3 4 5 6 7 66 104 58 18 11 2
8 2
9 1
13 1
boundary length – see Section 4.4. Instead the PDE approaches for multilabel optimization are based on minimizing a single convex functional. They provide smooth object boundaries that do not exhibit a prominent grid bias. Furthermore, graph cuts cannot be parallelized in a straight forward manner, since the max-flow problem lies in the P-complete complexity class of problems, which are probably not efficiently parallelizable [7]. Instead, the considered PDE approaches are easily parallelizable yielding drastic speedups over CPU algorithms.
4
Comparison of Segmentation Accuracy
In this section we compare results of the proposed multilabel segmentation approaches P deL, P deC and Gc4. To compute the interactive dataterm from user scribbles the following parameters were used: σ = 1.2, α = 0.7, γ = 5. For the PDE approaches we set λ = 67 and the step size τ = 0.28. We used a GTX 580 for the parallel computations. For the alpha expansion we used λ = 6. Computations were carried out on an Intel Core i7 CPU 860, 2.80GHz. 4.1
Results on the Graz Benchmark
For automatic segmentation several benchmarks are available, e.g. the Berkeley database, the GrabCut database or the Pascal VOC Database. As extensively discussed in [18], these benchmarks are not suited for testing interactive segmentation. Hence, Santner et al. recently published the first benchmark for interactive scribble based multilabel segmentation containing 262 seed-groundtruth pairs from 158 natural images containing between 2 and 13 user labeled segments. The label frequencies are not uniformly distributed. Instead, small numbers such as 2, 3 or 4 labels appear frequently, whereas large numbers are rare (see Table 1). This influences the average quality and runtime values we compute later. To evaluate the segmentation accuracy Santner et al. compute the arithmetic mean of the Dice-score over all segments. It relates the overlap area of the groundtruth ¯ i and the computed segment Ωi to the sum of their areas Ω ¯ ¯ i , Ωi ) = 2|Ωi ∩ Ωi | . dice(Ω ¯ |Ωi | + |Ωi |
(20)
The closer to 1 the Dice-score the more accurate is the segmentation. To evaluate the proposed spatially varying dataterm and the quality of the three different optimization schemes we compute the average Dice-score on the benchmark, which is given in Table 4.1, and compare the results to Santner’s approach. Santner et al. [19] show impressive results for different combinations of color and texture features in a random forest approach: RGB, HSV and CIELab colors combined with image patches, Haralick features and Local Binary Patterns
186
C. Nieuwenhuis, E. T¨ oppe, and D. Cremers
a) Santner
b) PdeC
c) PdeL
d) Gc4
Fig. 2. Results for the 4 compared algorithms on selected images from the Graz benchmark. a) Santner’s result with indicated user scribbles, b) PdeC, c) PdeL, d) Gc4.
(LBP). However, they neglect the locality of the scribbles by estimating a single, invariant color model for each region. In our experiments we tested the proposed approach with spatially constant and spatially varying color models on their benchmark. If we use the spatially constant model the results are comparable to those obtained by Santner et al. (RGB color information without texture). They obtain the best results combining CIELab and LBP features in a 21 dimensional vector based on a scribble brush of radius 13. We obtain better results on the benchmark by allowing for spatially varying color models. To obtain the spatially constant approach we set α to a very large value, for the space-variant approach we use α = 0.7 to obtain locally adaptive color distributions. The results summarized in Table 4.1 indicate that merely regarding the spatial location of scribbles provides stronger performance improvements than a multitude of sophisticated features. The proposed approach outperforms Santner’s spatially invariant dataterm with all four optimization schemes. The graph cut approaches (Gc4 and Gc8) yield qualitatively better results than Santner’s approach, especially in case of 8-connectivity. The PDE-based approaches yield results comparable to the graph cuts or slightly better. The relaxation proposed
Space-Varying Color Distributions for Interactive Multiregion Segmentation
187
Table 2. Comparison of the proposed spatially varying color model with four different optimization schemes based on random forest (RF), graph cuts (Gc4, Gc8) and PDEs (PdeL, PdeC) to spatially constant color models on the Graz Benchmark. For each approach the dimension of the scribble brush and the average Dice-score is indicated. Method Dim Brush Optim. Dice-Score [19], RGB 3 RF 0.877 our approach, spatially constant 3 3 PdeL 0.872 [19], CIELab + LBP 21 5 RF 0.917 our approach, space-variant 3 5 PdeL 0.923 [19], CIELab + LBP 21 13 RF 0.927 our approach, space-variant 3 13 Gc4 0.929 our approach, space-variant 3 13 Gc8 0.931 our approach, space-variant 3 13 PdeL 0.931 our approach, space-variant 3 13 PdeC 0.934
by Lellmann et al. (PdeL) is less tight than that by Chambolle et al. (PdeC) leading to results slightly inferior to those by Chambolle et al. 4.2
Visual Results
To visually assess the quality of the three compared optimization schemes we show results on a few benchmark images in Figure 2. When inspecting the results of the three algorithms we notice only slight quality differences, e.g. in the foot of the elephant. 4.3
Ambiguities
The alpha expansion algorithm as well as the PDE-based approaches do not lead to globally optimal solutions. Hence, ambiguities can appear in the results. The alpha expansion algorithm iterates over all possible labels α and each time solves a binary graph cut problem. This process is repeated until convergence. The order of traversal influences both the quality and the runtime of the algorithm. For four regions we registered a runtime difference of up to five seconds depending on whether we iterated over labels in order 1..N or N..1 in each step of the algorithm. Results can differ locally depending on the iteration order as well. An example is shown in Figure 3 b). In contrast, when solving PDEs each indicator function ui is updated separately and the variable constraints are only enforced at the end of each iteration, so the iteration order has no impact on the result. Furthermore, since the relaxed problems (13) and (18) are convex, we will always reach their global minimum independent of the label order. However, ambiguities also occur for PDE-based approaches in the multilabel case. The relaxed problems are convex and thus lead to globally optimal solutions. But to obtain a segmentation the solution must be transformed into a binary result. In case of more than two regions this transformation is not
188
C. Nieuwenhuis, E. T¨ oppe, and D. Cremers
input image
(a)
(b)
(c)
Fig. 3. Metrication errors and ambiguities in optimization results on a given input image. a) Metrication error for Gc4 (top), Gc8 (center) in contrast to a smooth boundary for the PDE result (bottom), b) Ambiguous graph cut results depending on the traversed label order, c) Ambiguous P deC results due to thresholding.
uniquely defined and can be carried out e.g. by thresholding or maximization over the indicator functions. This leads to ambiguous results depending on the chosen method (see Figure 3). 4.4
Metrication Errors
In discrete optimization metrication errors can occur (Figure 3 a). Region boundaries tend to run either horizontally, vertically or diagonally, since the boundary length is not equally penalized in different directions, e.g. in the sea gull image in Figure 2. This is especially true for regions with uncertain data fidelity. The rotationally invariant L2 norm which is optimized in the continuous case is approximated in the discrete case. With a 4-connectivity boundary the length is penalized with respect to the L1 norm. The larger the neighborhood, the closer comes the penalization to the L2 norm at the cost of very large graphs and computation times as well as memory consumption.
5
Runtimes
Especially for interactive image segmentation runtimes close to real-time are indispensable. One of the great advantages of PDE based methods is that they can be parallelized and run on graphics hardware. In contrast, computing the maximum flow on a graph with augmenting path algorithms is difficult and only leads to limited speedups. For the PDE approaches the runtime increases with the number of constraints on the dual variables: linearly in case of the PdeL approach (14), quadratically in case of the PdeC approach (19). Figure 4 shows the average runtime in seconds for each label computed over the whole database. The average runtime for the PdeL-method is 0.43 seconds in contrast to 1.27 seconds for the PdeC-method, which is a factor of three higher. For alpha expansion the runtimes seem to increase linearly in the number of labels. However, the average runtime on the whole database is 1.24 seconds for 4-connectivity and 2.54 seconds for 8-connectivity, which exceeds the PDE approaches by a factor of 2.9 and 5.9 respectively. These computation times vary with the smoothness λ as shown exemplarily for Gc4 and PdeL in Table 5.
Space-Varying Color Distributions for Interactive Multiregion Segmentation
189
Table 3. Average runtimes in seconds of the proposed multilabel segmentation algorithms with respect to differently scaled smoothness values. λopt is the optimal smoothness parameter with respect to the benchmark Method λopt /100 Gc4 0.47 PdeL 0.08
λopt /10 0.84 0.16
λopt 1.24 0.43
λopt · 10 4.06 1.86
λopt · 100 17.13 5.41
Fig. 4. Comparison of runtimes for different numbers of labels for PDE optimization schemes (PdeL and PdeC) compared to the alpha expansion optimization (Gc4 and Gc8) for the problem of interactive image segmentation.
6
Conclusion
In this paper we proposed an algorithm for interactive multi-region segmentation which takes into account the spatial dimension of the user scribbles. In this way, overlapping color distributions become locally separable allowing for weaker regularization assumptions and correct segmentations in difficult images. We provide an experimental comparison of discrete and continuous optimization approaches. While all algorithms provide similarly good qualitative results on a recently proposed benchmark, the PDE-based methods provide slightly more accuracy, partly due to an absence of metrication errors and partly due to the fact that the multilabel problem is solved by minimizing a single convex energy rather than an iterative sequence of binary problems. Especially the approach by Lellmann et al. is close to real-time on average and runs substantially faster.
References 1. Akaike, H.: An approximation to the density function. Ann. Inst. Statist. Math. 6, 127–132 (1954) 2. Alahari, K., Kohli, P., Torr, P.: Dynamic hybrid algorithms for map inference in discrete mrfs. Trans. Pattern Anal. Mach. Intell. 32(10), 1846–1857 (2010) 3. Bai, X., Sapiro, G.: A geodesic framework for fast interactive image and video segmentation and matting. In: Proc. of ICCV (2007)
190
C. Nieuwenhuis, E. T¨ oppe, and D. Cremers
4. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms for energy minimization in computer vision. In: Figueiredo, M., Zerubia, J., Jain, A.K. (eds.) EMMCVPR 2001. LNCS, vol. 2134, pp. 359–374. Springer, Heidelberg (2001) 5. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Trans. on Patt. Anal. and Mach. Intell. 23(11), 1222–1239 (2001) 6. Chambolle, A., Cremers, D., Pock, T.: A convex approach for computing minimal partitions. Technical report TR-2008-05, University of Bonn (2008) 7. Goldschlager, L., Shaw, R., Staples, J.: The maximum flow problem is log space complete for p. Theoretical Computer Science 21, 105–111 (1982) 8. Greig, D.M., Porteous, B.T., Seheult, A.H.: Exact maximum a posteriori estimation for binary images. J. Roy. Statist. Soc., Ser. B 51(2), 271–279 (1989) 9. Klodt, M., Schoenemann, T., Kolev, K., Schikora, M., Cremers, D.: An experimental comparison of discrete and continuous shape optimization methods. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 332–345. Springer, Heidelberg (2008) 10. Komodakis, N., Tziritas, G.: A new framework for approximate labeling via graph cuts. In: Proc. of ICCV, pp. 1018–1025 (2005) 11. Lellmann, J., Breitenreicher, D., Schn¨ orr, C.: Fast and exact primal-dual iterations for variational problems in computer vision. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6312, pp. 494–505. Springer, Heidelberg (2010) 12. Lellmann, J., Kappes, J., Yuan, J., Becker, F., Schn¨ orr, C.: Convex multiclass image labeling by simplex-constrained total variation. In: Technical Report, HCI, IWR, University of Heidelberg (2008) 13. Nieuwenhuis, C., Berkels, B., Rumpf, M., Cremers, D.: Interactive motion segmentation. In: Goesele, M., Roth, S., Kuijper, A., Schiele, B., Schindler, K. (eds.) Pattern Recognition. LNCS, vol. 6376, pp. 483–492. Springer, Heidelberg (2010) 14. Pock, T., Chambolle, A., Cremers, D., Bischof, H.: A convex relaxation approach for computing minimal partitions. In: Proc. of CVPR (2009) 15. Potts, R.B.: Some generalized order-disorder transformations. Proc. Camb. Phil. Soc. 48, 106–109 (1952) 16. Rosenblatt, F.: Remarks on some nonparametric estimates of a density function. Annals of Mathematical Statistics 27, 832–837 (1956) 17. Rother, C., Kolmogorov, V., Blake, A.: Grab-cut: interactive foreground segmentation using iterated graph cuts. ACM Transactions on Graphics 23(3), 309–314 (2004) 18. Santner, J.: Interactive Multi-label segmentation. PhD thesis, University of Graz (2010) 19. Santner, J., Pock, T., Bischof, H.: Interactive multi-label segmentation. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010, Part I. LNCS, vol. 6492, pp. 397– 410. Springer, Heidelberg (2011) 20. Szeliski, R., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V., Agarwala, A., Tappen, M., Rother, C.: A comparative study of energy minimization methods for markov random fields. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 16–29. Springer, Heidelberg (2006) 21. Unger, M., Pock, T., Trobin, W., Cremers, D., Bischof, H.: Tvseg - interactive total variation based image segmentation. In: Proc. of BMVC (2008) 22. Zach, C., Gallup, D., Frahm, J.-M., Niethammer, M.: Fast global labeling for realtime stereo using multiple plane sweeps. In: Vision, Modeling and Visualization Workshop, VMV (2008)
Detachable Object Detection with Efficient Model Selection Alper Ayvaci and Stefano Soatto Computer Science Department, University of California, Los Angeles
Abstract. We describe a computationally efficient scheme to perform model selection while simultaneously segmenting a short video stream into an unknown number of detachable objects. Detachable objects are regions of space bounded by surfaces that are surrounded by the medium other than for their region of support, and the region of support changes over time. These include humans walking, vehicles moving, etc. We exploit recent work on occlusion detection to bootstrap an energy minimization approach that is solved with linear programming. The energy integrates both appearance and motion statistics, and can be used to seed layer segmentation approaches that integrate temporal information on long timescales. Keywords: Object Detection, Video Segmentation, Occlusion, Layers, Ordering Constraints, Graph Cuts, Motion Segmentation.
1
Introduction
A detached object [1] is “a layout of surfaces completely surrounded by the medium.” Being detached is a functionally important property of objects as it gives them “typical affordances like graspability”. In practice, most objects in a scene are attached to something, rather than floating in mid-air. So, rather than assuming that they are completely surrounded by the medium, one may require that their point of contact with other objects (e.g., the ground plane) change over time. The pedestrian in Fig. 3 illustrates this concept. Such objects have been called “detachable” in the sense that they could be detached or moved if sufficient force were applied. Of course, to ascertain whether an object is detachable, sufficiently informative data has to be provided. For instance, a mug resting on a table is qualitatively different than a tree resting on the ground plane, but there is no way to ascertain whether the former is “detachable” (and not glued to the table instead) until someone actually moves it to a different location. In this paper, therefore, we focus more simply on detecting objects that generate occlusions when either they or the viewer moves. This would include both a mug and a tree. For simplicity, however, we refer to both of these objects as “detachable.”
Research supported by ARO 56765, ONR N000140810414, AFOSR FA95500910427.
Y. Boykov et al. (Eds.): EMMCVPR 2011, LNCS 6819, pp. 191–204, 2011. c Springer-Verlag Berlin Heidelberg 2011
192
A. Ayvaci and S. Soatto
Detecting such objects can be thought of as a segmentation problem, in the sense that our goal is to, eventually, classify the domain of the image into a number of regions, each representing the projection of a different object in the scene. However, it is not the segmentation of a single image, since it is not possible to determine in a single image whether the object is surrounded by the medium or simply “painted” on a surface (see, e.g., the girl painted on the road surface in Fig. 4, which is clearly not “detachable”). So, multiple views are necessary. Once multiple views are available, the presence of detachable objects is primed by the presence of an occlusion boundary. But in order to determine the presence of an occlusion boundary, we must first determine the motion field that maps the domain of an image onto (a portion of) the domain of a temporally adjacent one. Occluded regions are precisely those where the motion field is not defined, for there exists no motion that carries the domain of one image onto another. Therefore, detachable object detection is not just a motion segmentation task, whereby one seeks for discontinuities of the motion field, or of the optical flow that approximates it. Instead, given a sequence of images, detachable object detection entails simultaneously solving for the (unknown) regions of the image where the motion field is defined – so the regions are co-visible in adjacent images – for an (unknown) number of objects in the scene. So, our work relates most closely to “layer segmentation” approaches to video coding [2], and in particular variational formulations, the first of which was [3,4]. Here we show that, by using local occlusion detection, we can turn a computationally complex problem into one that can be solved by means of linear programming. Our work is motivated by recent results on occlusion detection [5]. 1.1
Related Work and Key Idea
As we have already indicated, the outcome of our procedure is a “segmentation” in the sense that it provides pixel-level classification and class-membership information. As such, this paper relates to a vast literature on object segmentation, for instance [6,7,8,9] and references therein. We find that such segmentation approaches work well provided that there are few independently moving objects and a reasonable initialization. This is increasingly difficult to obtain as the scene becomes more cluttered. Furthermore, those approaches are computationally complex, and typically assume that the number of objects (or layers) is known a-priori. Our work leverages on recent work on occlusion detection [10,5]. Using the occluded regions as “seeds,” it has been shown in [11] that the problem of detachable object detection can be cast as a supervised segmentation scheme [12,8,13], and solved with linear programming for a known number of layers in the scene. The main contribution of this manuscript is the introduction of a novel scheme for automatic model selection that respects the computational infrastructure of supervised segmentation. However, it does not require any user input, other than the setting of one free parameter that trades off complexity and fidelity as customary in any model selection scheme. Our approach is consistent with
Detachable Object Detection with Efficient Model Selection
193
the general methodology of minimum description length [14] and instantiates a particularly simple rendition that has also been explored in the field of robust control [15]. The use of minimum-description length (MDL) has a long history in vision, specifically in image segmentation, going back to [16]. However, until recently, description length minimization problems has not been addressed by combinatorial algorithms that can guarantee performance bounds [17,18] or convex optimization techniques [19] for solving their relaxations. We focus on short baseline video, indeed as short as three temporally adjacent frames. The results can be used to seed more global (but also more computationally expensive) batch schemes [3,4,20,21,22,23]. Other methods for layered motion segmentation [24,25,26] also take occlusions into account, but use more restrictive parametric motion models, and typically assume a fixed number of layers, or do not scale well as the number of object increases beyond few. Similarly, [27,28] use occlusion boundaries inferred using appearance, motion and depth cues [29,30] or T-junctions [31] to segment image sequences. However, due to their using of graph-cuts [32] and normalized-cuts [33], they require the number of segments to be known a priori. There is also some interest in inference of depth map from a single image by analysis of T-junctions: [34] proposes a nonlinear iterative filter to achieve depth synthesis, while [35] infers depth by solving a hinge-loss regularized quadratic minimization problem given the T-junctions and segmentation of the image.
2
Background: From Local Occlusion Ordering to Global Consistency
Let I : D ⊂ R2 × R+ → R+ ; (x, t) → It (x) be a grayscale time-varying image sequence defined on a domain D. Under the assumption of Lambertian reflection, constant illumination and co-visibility, It (x) is related to its (forward and backward) neighbors It+dt (x), It−dt (x) by the usual brightness-constancy equation It (x) = It±dt (x + v±t (x)) + n± (x), x ∈ D\Ω± (t)
(1)
where v+t and v−t are the forward and backward motion fields and the additive residual n lumps together all unmodeled phenomena. Schemes such as [10,5] provide an estimate of the motion field v, in the co-visible region D\Ω, as well as of the occluded region Ω. From this point on, therefore, we will assume that we are given, at each time instant t, both the forward (occlusion) and backward (un-occlusion) time-varying regions Ω+ (t), Ω− (t), possibly up to some errors. From now on we drop the subscript ± for simplicity. The local complement of Ω, i.e. a subset of D\Ω in a neighborhood of Ω, is indicated by Ω c and can easily be obtained by inflation, Fig. 1a. It is important to note that these regions are in general multiply-connected, so Ω = ∪K k=1 Ωk , and each connected component Ωk may correspond to a different occluded region. However, occlusion detection is a binary classification problem because each region of an image is either co-visible (visible in an adjacent image) or not, regardless of how many detachable objects populate the scene. In order
194
A. Ayvaci and S. Soatto
to arrive at detachable object detection we must aggregate local depth-ordering information (from knowledge of occlusion and optical flow) into a global depthordering model. To this end, one can define a label field c : D×R+ → Z+ ; (x, t) → c(x, t) that maps each pixel x at time t to an integer indicating the depth order. For each connected component k of an occluded region Ω, we have that if x ∈ Ωk and y ∈ Ωkc , then c(x, t) < c(y, t). If x and y belong to the same object, then c(x, t) = c(y, t). To enforce label consistency one can minimize |c(x, t) − c(y, t)|, by integrating it against a data-dependent measure that allows it to be violated across object boundaries, for instance dμ(x, y) = K(x, y)dxdy where K(x, y) =
2
2
αe−(It (x)−It (y)) + βe−vt (x)−vt (y)2 , 0, otherwise;
x − y2 < ,
(2)
where identifies the neighborhood, and α and β are the coefficients that weight the intensity and motion components of the measure. We then have |c(x, t) − c(y, t)|dμ(x, y) cˆ = arg min c:D→Z
D
s. t. c(x, t) < c(y, t) ∀x ∈ Ωk (t), y ∈ Ωkc (t), k = 1, .., K,
(3)
and x − y2 < . It has been shown in [11] that for a known number of layers, this problem can be translated into integer programming, by quantizing D into an M × N grid-graph G = (V, E) with the vertex (node) set V (pixels), and the edge set E ⊆ V × V encoding adjacency of two nodes i and j ∈ V via i ∼ j. Then the depth ordering is ci = c(xi , t), cj = c(xj , t), and the measure dμ(xi , xj ) becomes a symmetric positive-definite matrix wij = K(xi , xj ) that measures the affinity between two nodes i, j. The problem (3) then becomes the search for the discrete-valued function c : V → Z+ , {ˆ ci }MN wij |ci − cj | i=1 = arg min c i∼j (4) c s. t. cj > ci , i ∼ j, xi ∈ Ωk (t), xj ∈ Ωk (t), with 1 ≤ ci ≤ L.
3
Automatic Model Selection: Formulation
The problem above, for the case of L = 2 objects (e.g., foreground/background), can be interpreted as binary graph cut [32,36]. Unfortunately, for L > 2 this is an NP-hard problem. As customary, one can relax it by dropping the integer constraint and instead allowing c : V → R+ , thereby turning (4) into a convex minimization problem. Therefore, we can reach to the optimal solution efficiently as long as the number of layers L is known. However, guessing the number of layers can have undesired consequences, as Fig. 2 illustrates. Therefore, here
Detachable Object Detection with Efficient Model Selection
195
we introduce a novel approach to perform model selection that preserves the desirable computational properties of the relaxed version of (4). A natural criterion for model selection is to introduce a complexity cost, and then trading off complexity of the model, and fidelity to the data, in a minimumdescription length (MDL) setting [14]. In our case, because we have introduced a label field c, the obvious complexity cost is the largest value of c in the do. main D, that is the infinity-norm of c: c∞ = max{|ci |}MN i=1 . This leads to the straightforward modification of the problem (4) into wij |ci − cj | + γc∞ {ˆ ci }MN i=1 = arg min c i∼j (5) c s. t. cj > ci , i ∼ j, xi ∈ Ωk (t), xj ∈ Ωk (t), with 1 ≤ ci where γ is the cost for addition of each new layer. While this problem preserves the convexity properties of the original model, it is not amenable to being solved using linear programming (LP). Therefore, we propose a modification of the problem above, obtained by introducing auxiliary variables {uij |i ∼ j} and σ, so that (5) can be written as wij uij + γσ min uij ,ci ,σ
i∼j
s. t. 1 c σ, cj − ci ≥ 1, i ∼ j, xi ∈ Ωk (t), xj ∈ Ωkc (t)
(6)
− uij ≤ ci − cj ≤ uij . The addition of the auxiliary variables changes the structure of the original problem, and makes it amenable to deployment of a vast arsenal of efficient numerical methods.
4
Automatic Model Selection: Implementation
So far we have taken the result of whatever occlusion detection method as “correct”. Clearly, this is not realistic. So, to allow for the possibility of errors in the occlusion detection stage, we introduce slack variables {ξk }K k=1 to relax the hard constraints; this yields min
uij ,ci ,σ
i∼j
wij uij + λ
K
ξk + γσ
k=1
s. t. 1 c σ cj − ci ≥ 1 − ξk , i ∼ j, xi ∈ Ωk (t), xj ∈ Ωkc (t) 0 ≤ ξk ≤ 1 ∀ k, − uij ≤ ci − cj ≤ uij . where λ is the penalty for violating the ordering constraints.
(7)
196
A. Ayvaci and S. Soatto
(a)
(b)
(c)
(d)
Fig. 1. Effect of the increasing layer cost γ on the final outcome given (a) the occlusion cues Ω (red) and Ω c (yellow). The number of regions σ ˆ is estimated as (b) 4 (background, hand-arm, body and pivot leg, swinging leg), (c) 3 (background, body, and swinging arm) and (d) 2 (whole body) respectively. Note that the pivot foot is attached to the ground, and is therefore classified as such. For a longer sequence, where the pivot foot eventually changes, both feet would be lumped with the rest of the body and classified as detachable.
Clearly, the choice of γ affects the final outcome of our algorithm. Like any model selection approach, our scheme has a free parameter that trades off complexity and fidelity. The effect of different choices of γ is illustrated in Fig. 1. Once we have the (forward-backward) occluded regions, Ω± (t), from [5], we can bootstrap the process and solve this linear program using [37]. Note that layers obtained may consist of multiple objects. In our implementation, we have separated each layer into distinct objects by finding the connected components at each depth level on c. One may object to our “separation” of the problem of detachable object detection into a sequence of steps: First occlusion detection, then aggregation into layers. Also, one could object that the short-baseline does not enable enforcing long-term temporal consistency, so the pivot foot of the hiker in Fig. 1 is attributed to the ground. Indeed, one could, in principle, just write a global cost functional to go from the raw data (the images) straight to the pixel-wise classification of layer depths. This, however, would not be amenable to being solved using efficient computational schemes. We are aware that such a divide-et-impera approach comes at the cost of overall optimality, but we feel this is a suitable price to pay to reduce the problem to a linear program. As already shown in [11], simple sequential optimization using the estimate from two adjacent images to initialize the third already enables agglomerating all components of a detached object as in Fig. 3. To achieve that, we simply use the results of (7) at each instant as initialization to the optimization at the subsequent time, using the field v−(t) . To incorporate the previous layer estimate, similar to [38], we redefine the measure such that ˜ y) = K(x, y) + κD(c(x + v−t (x), t − 1), c(y + v−t (y), t − 1)), (8) K(x, where D : R × R → {0, 1} is defined by 1, a = b, a > 1, b > 1, D(a, b) = 0, otherwise, and κ is a forgetting factor.
(9)
Detachable Object Detection with Efficient Model Selection
5
197
Experiments
In our experiments, rather than solving (7) on the pixel grid, we over-segment the domain to N non-overlapping superpixels {si }N i=1 obtained using watershed N such that i=1 si = D, si ∩ sj = ∅, ∀i = j, following [29]. This is not strictly necessary, but enables simple low-level integration of color and texture cues. As in [11], the edge weight wij between two neighbors is given by ¯
¯
2
2
wij = |∂si ∩ ∂sj |[αe−(I(si )−I(sj )) + βe−¯v(si )−¯v (sj )2 + τ (1 − P b(si , sj ))], 1 ¯ = 1 where I(s) It (x)dx, v¯(s) = vt (x)dx and s |s| |s| s 1 P b(s, s ) = P b(x)dx, |∂s ∩ ∂s | ∂s∩∂s
(10)
(11)
where P b : D → [0, 1] is the probability of a location to be on an edge and ∂s is the boundary of the region s ⊂ D. The edge map is acquired using a multi cue edge detector [39]. Note that the edge features are incorporated into the computation of the weights since the superpixels are constructed based on P b. In our experiments, we have assigned the parameters α, β and τ to 0.25, 0.5 and 0.25 respectively. We have used the CMU Occlusion/Object Boundary Dataset1 [29] to evaluate our approach qualitatively and quantitatively. It includes 16 test sequences with a variety of indoor and outdoor scenes and some noise and compression artifacts. It provides ground truth object segmentation for a single reference frame in each sequence. 5.1
Qualitative Performance
Representative examples of successful detection are shown in Fig. 2 and Fig. 5. In particular, the experiments show that our approach achieves qualitatively the same performance as [11] when the latter is given the correct number of layers. However, when the number of layers fed to [11] is patently wrong, then our approach outperforms it in qualitatively significant ways, Fig. 2. The sequence in Fig. 2, from [29], is too short to capture an entire walking cycle, so the person cannot be positively identified as detachable, unlike the squirrel. Using longer video sequences, however (Fig. 3), shows that we can successfully aggregate the entire person into one segment, and therefore positively detect him as a detachable object. The sequence in Fig. 4 is taken in West Vancouver where the figure of a child is painted on the road. Unlike a real pedestrian or a car, this is not a detachable object, and is therefore not detected as such by our algorithm. Instead, the nearby car is correctly identified as one. 1
http://www.cs.cmu.edu/∼stein/occlusion data/
198
A. Ayvaci and S. Soatto
Fig. 2. Sample frames from Walking Legs and Squirrel4 (first column). The second c column shows the result of [27] (Figures taken from [27]. Copyright 2008 Andrew Stein. Reprinted with permission.); the third column shows the results of [11] when the number of objects is given as L = 2 for Walking Legs and L = 5 for Squirrel4. Observe that in the case that wrong number of layers provided to [11], the detected regions have errors, e.g. misclassified regions around the hand of the hiker and tail of the squirrel. However, our approach (fourth column) does not require knowledge of the number of layers, and automatically selects the best tradeoff between complexity and fidelity, modulated by the parameter γ.
5.2
Failure Modes
The failure modes of our algorithms are attributed to four classes of phenomena. The first is associated to failure of the occlusion detection module. If an occlusion region is present, but it is not detected, our “pseudo-supervised” approach fails, just like [11]. This is the price to pay for having the problem decomposed into a sequence of steps. The second failure mode is common to all model selection work. Unless there is a “true” model, and the true model belongs to the class chosen for inference, there is no guarantee that the solution is unique and independent of the regularization parameter γ. Therefore, it is to be expected that our algorithm will behave in a way that is dependent on the value of γ chosen, although the hope is that for sufficiently exciting data sequence the scheme will be relatively insensitive to the choice of γ. The third failure model is precisely connected to the absence of sufficiently exciting conditions. Like all model identification approaches, in order to achieve a sensible outcome one has to ensure that the data stream is “sufficiently exciting” in the sense of eliciting all the modes of the system. In our context this means that (a) there is sufficient motion (either of the camera or of the object) that sufficient occlusion occurs. Clearly, if we have a detached object but we make an infinitesimal motion, so two adjacent images are essentially identical, we cannot determine that object is detached. However, at the more global level, to distinguish between truly detachable objects (e.g. a mug) and those that are planted (e.g. a tree) we would have to (b) have sufficiently exciting data that include moving the point-of-contact. Example of (a) is visible in Fig. 5, where the closest box is not detected as a detachable. Examples of (b) include the pivot foot of the hiker in Fig. 2.
Detachable Object Detection with Efficient Model Selection
199
Fig. 3. Improvement in the segmentation from considering extended temporal observations. The first segmentation based on short-baseline motion (left) fails to detect the leg, since it is not moving and attached to the ground. However, integrating on longer temporal frames, during which the pivot leg is tilting forward, results in an enlargement of the detected region (center) until it covers the entire object (a person in this case) the moment the pivot is transferred and the right foot is detached from the ground. Even after the pivot is transferred and the other leg becomes grounded (right), accurate region detection is maintained under the extended temporal observations.
(a)
(b)
(c)
(d)
Fig. 4. A child figure is painted on the road in West Vancouver (a). Unlike a real pedestrian or a car, this drawing does not cause any occlusion (b). Therefore, given the image and motion features (c), our algorithm does not detect it as a detachable object while it segments the nearby car (d). The original sequence can be seen at http://reviews.cnet.com/8301-13746 7-20016169-48.html.
The last class of failures is due to violation of the model underlying the motion field estimation, that is Lambertian reflection and constant illumination. An example, the shiny bowl behind the chair, is visible in Fig. 5.
200
A. Ayvaci and S. Soatto
Fig. 5. Additional samples from the CMU dataset (first column), ground truth objects on these sequences (second column), results of [11] given the correct number of layers (third column), and detected objects with our algorithm which does not require such supervision (fourth column). Observe that comparison yields to comparable results. Note that color coding does not represent the layers rather the distinct components on the layer map. Failures are related to small motion and miss detection of occluded regions.
5.3
Quantitative Assessment
Our quantitative evaluation follows the lines of [40]. The covering score of a set of ground truth segments S by a set of segments S can be defined as
Detachable Object Detection with Efficient Model Selection
201
Table 1. Performance of our approach on the CMU dataset computed based on the covering score (12) and compared to [41], [11] in case the correct number of layers is provided and [11] when L is set to 2. Score Score Score Score
(ours) [41] [11]; L = correct [11]; L = 2
Bench 0.89 0.67 0.89 0.89
Score Score Score Score
(ours) [41] [11]; L = correct [11]; L = 2
Intrepid 0.85 0.55 0.66 0.66
Car2 0.52 0.52 0.53 0.52
Chair1 0.78 0.66 0.78 0.67
Coffee Stuff Couch Color Couch Corner 0.43 0.63 0.95 0.40 0.40 0.93 0.40 0.72 0.96 0.41 0.32 0.76
Post Rocking Horse Squirrel4 0.98 0.78 0.90 0.98 0.70 0.75 0.98 0.77 0.91 0.98 0.77 0.91
Score(S , S) =
1
s ∈S
Trash Can 0.75 0.73 0.75 0.75
|s |
s ∈S
max s∈S
Tree 0.69 0.89 0.74 0.74
|s ∩ s | |s| + |s |
Fencepost 0.42 0.42 0.41 0.38
Hand3 0.74 0.65 0.73 0.73
Walking Legs 0.92 0.64 0.91 0.80
Zoe1 0.72 0.71 0.72 0.72
(12)
Note that comparing our approach to [27] is not straightforward, and possibly unfair, since the latter is an over-segmentation method where the number of segments are predetermined; our algorithm, on the other hand, performs automatic model selection. To be fair to [27], we have selected the cases where their algorithm yields a single segment, discarding all others that would negatively bias their outcome. [27] reports segmentation covering scores of 0.72 for the pedestrian, 0.84 for the tissue box and 0.71 for the squirrel which are depicted in red in Fig 2. By comparison, our algorithm achieves superior scores 0.90, 0.95 and 0.90 respectively. We have also compared our method to normalized cut [33], as the superpixel graphs depicted at Fig. 4 can be partitioned using this technique. However, normalized cut also requires the number of segments to be known a priori, therefore, in our experiments, we have used self-tuning spectral clustering proposed by [41] which addresses this limitation. Our performance on the whole dataset considering all the ground truth objects is shown in Table 1, which shows that our algorithm outperforms [41] in most of the sequences. As seen in Table 1, comparison with [11] yields comparable results when the correct number of layers L is given as input to their algorithm. However, when the incorrect number of layers is used, our algorithm performs substantially better, at a comparable computational cost. In terms of running time, once occluded regions are detected, it takes 6.3 seconds for CVX [37] to solve the linear program (7) with 310 depth ordering constraints on a frame over-segmented to 4012 superpixels.
6
Discussion
We have presented a method for performing automatic model selection in detachable object detection. It builds on prior work [11] that aggregates local occlusion information into a global ordering akin to a layer decomposition.
202
A. Ayvaci and S. Soatto
We have shown that automatic model selection can be performed by imposing complexity constraints in an energy minimization framework, by minimizing the maximum number of depth layers. While this problem is convex, it is not amenable to solution via linear programming. We have shown that the introduction of suitable auxiliary variables can turn this problem into a linear one that can be solved using computationally efficient schemes. We have shown the qualitative properties of our scheme, and compared it against competing schemes that, however, assume the number of layers to be known. Our scheme compares favorably in that it achieves comparable performance, at comparable computational cost, when the competing approaches are given the correct model complexity. However, it outperforms them when the given number of models is wrong. Thus our scheme is significantly more flexible at a modest increase of computational complexity. Our approach shares the same limitations of any model selection schemes, in that in general there is no “right” tradeoff between complexity and fidelity and one can expect to have different behavior of the algorithm depending on the choice of layer cost. It also shares the limitation of all schemes that break down the original problem (detached object detection, in our case) into a number of sequential steps, whereby failure of the early stages of processing cause failure of the entire pipeline. The benefit that comes with this predicament is the ability to solve an otherwise very complex computational problem using efficient numerical schemes from linear programming.
References 1. Gibson, J.J.: The ecological approach to visual perception. LEA (1984) 2. Wang, J., Adelson, E.: Representing moving images with layers. IEEE Transactions on Image Processing 3, 625–638 (1994) 3. Jackson, J.D., Yezzi, A.J., Soatto, S.: Dynamic shape and appearance modeling via moving and deforming layers. In: Rangarajan, A., Vemuri, B.C., Yuille, A.L. (eds.) EMMCVPR 2005. LNCS, vol. 3757, pp. 427–438. Springer, Heidelberg (2005) 4. Jackson, J., Yezzi, A.J., Soatto, S.: Dynamic shape and appearance modeling via moving and deforming layers. Intl. J. of Comp. Vision 79(1), 71–84 (2008) 5. Ayvaci, A., Raptis, M., Soatto, S.: Occlusion detection and motion estimation with convex optimization. In: Advances in Neural Information Processing Systems (2010) 6. Cremers, D., Soatto, S.: Motion competition: a variational approach to piecewise parametric motion segmentation. International Journal of Computer Vision 62, 249–265 (2005) 7. Huang, Y., Liu, Q., Metaxas, D.: Video object segmentation by hypergraph cut. In: Proc. of the Conference on Computer Vision and Pattern Recognition, pp. 1738–1745 (2009) 8. Bai, X., Wang, J., Simons, D., Sapiro, G.: Video SnapCut: robust video object cutout using localized classifiers. In: ACM SIGGRAPH (2009) 9. Unger, M., Mauthner, T., Pock, T., Bischof, H.: Tracking as segmentation of spatial-temporal volumes by anisotropic weighted TV. In: Proc of the Energy Minimization Methods in Computer Vision and Pattern Recognition (2009)
Detachable Object Detection with Efficient Model Selection
203
10. Ince, S., Konrad, J.: Occlusion-aware optical flow estimation. IEEE Transactions on Image Processing 17, 1443–1451 (2008) 11. Ayvaci, A., Soatto, S.: Detachable object detection. Technical Report CSD100036, UCLA Computer Science Department (November 19, 2010) 12. Wang, J., Xu, Y., Shum, H., Cohen, M.: Video tooning. In: ACM SIGGRAPH (2004) 13. Grady, L.: Random walks for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 1768–1783 (2006) 14. Grunwald, P., Rissanen, J.: The Minimum Description Length Principle. The MIT Press, Cambridge (2007) 15. Dahleh, M.A., Diaz-Bobillo, I.J.: Control of uncertain systems: a linear programming approach. Prentice-Hall, Englewood Cliffs (1994) 16. Leclerc, Y.: Constructing simple stable descriptions for image partitioning. International Journal of Computer Vision 3, 73–102 (1989) 17. Delong, A., Osokin, A., Isack, H., Boykov, Y.: Fast approximate energy minimization with label costs. In: Proc. of the Conference on Computer Vision and Pattern Recognition (2010) 18. Lim, Y., Jung, K., Kohli, P.: Energy minimization under constraints on label counts. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6312, pp. 535–551. Springer, Heidelberg (2010) 19. Yuan, J., Boykov, Y.: Tv-based image segmentation with label cost prior. In: Proc. of the Britih Machine Vision Conference (2010) 20. Schoenemann, T., Cremers, D.: High resolution motion layer decomposition using dual-space graph cuts. In: Proc. of the Conference on Computer Vision and Pattern Recognition (2008) 21. Sun, D., Sudderth, E., Black, M.: Layered Image Motion with Explicit Occlusions, Temporal Consistency, and Depth Ordering. In: Advances in Neural Information Processing Systems (2010) 22. Brox, T., Malik, J.: Object segmentation by long term analysis of point trajectories. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6315, pp. 282–295. Springer, Heidelberg (2010) 23. Pawan Kumar, M., Torr, P., Zisserman, A.: Learning layered motion segmentations of video. International Journal of Computer Vision 76, 301–319 (2008) 24. Irani, M., Peleg, S.: Motion analysis for image enhancement: Resolution, occlusion, and transparency. Journal of Visual Communication and Image Representation 4, 324–324 (1993) 25. Jepson, A.D., Fleet, D.J., Black, M.J.: A layered motion representation with occlusion and compact spatial support. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 692–706. Springer, Heidelberg (2002) 26. Ogale, A., Ferm, C., Aloimonos, Y.: Motion segmentation using occlusions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 988–992 (2005) 27. Stein, A., Stepleton, T., Hebert, M.: Towards unsupervised whole-object segmentation: Combining automated matting with boundary detection. In: Proc. of the Conference on Computer Vision and Pattern Recognition (2008) 28. Apostoloff, N., Fitzgibbon, A.: Automatic video segmentation using spatiotemporal T-junctions. In: Proc. of the Britih Machine Vision Conference (2006) 29. Stein, A., Hebert, M.: Occlusion boundaries from motion: low-level detection and mid-level reasoning. International Journal of Computer Vision 82, 325–357 (2009) 30. He, X., Yuille, A.: Occlusion boundary detection using pseudo-depth. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 539–552. Springer, Heidelberg (2010)
204
A. Ayvaci and S. Soatto
31. Apostoloff, N., Fitzgibbon, A.: Learning Spatiotemporal T-Junctions for Occlusion Detection. In: Proc. of the Conference on Computer Vision and Pattern Recognition (2005) 32. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 1222–1239 (2002) 33. Shi, J., Malik, J.: Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 888–905 (2002) 34. Morel, J., Salembier, P.: Monocular Depth by Nonlinear Diffusion. In: Proc. of the Indian Conference on Computer Vision, Graphics & Image Processing (2008) 35. Amer, M., Raich, R., Todorovic, S.: Monocular Extraction of 2.1D Sketch. In: Proc. of the International Conference on Image Processing (2010) 36. Sinop, A.K., Grady, L.: A seeded image segmentation framework unifying graph cuts and random walker which yields a new algorithm. In: Proc. of the International Conference on Computer Vision (2007) 37. Grant, M., Boyd, S.: Cvx: Matlab software for disciplined convex programming, version 1.21 (2010), http://cvxr.com/cvx 38. Freedman, D., Zhang, T.: Interactive graph cut based segmentation with shape priors. In: Proc. of the Conference on Computer Vision and Pattern Recognition (2005) 39. Martin, D., Fowlkes, C., Malik, J.: Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE Trans. on Pattern Analysis and Machine Intelligence 26, 530–549 (2004) 40. Arbel´ aez, P., Maire, M., Fowlkes, C., Malik, J.: From contours to regions: An empirical evaluation. In: Proc. of the Conference on Computer Vision and Pattern Recognition (2009) 41. Zelnik-Manor, L., Perona, P.: Self-tuning spectral clustering. In: Advances in Neural Information Processing Systems (2004)
Curvature Regularization for Curves and Surfaces in a Global Optimization Framework Petter Strandmark and Fredrik Kahl Centre for Mathematical Sciences, Lund University, Sweden {petter,fredrik}@maths.lth.se
Abstract. Length and area regularization are commonplace for inverse problems today. It has however turned out to be much more difficult to incorporate a curvature prior. In this paper we propose several improvements to a recently proposed framework based on global optimization. We identify and solve an issue with extraneous arcs in the original formulation by introducing region consistency constraints. The mesh geometry is analyzed both from a theoretical and experimental viewpoint and hexagonal meshes are shown to be superior. We demonstrate that adaptively generated meshes significantly improve the performance. Our final contribution is that we generalize the framework to handle mean curvature regularization for 3D surface completion and segmentation.
1
Curvature in Vision
The problem we are interested in solving amounts to minimizing the following energy functional: E(R) = g(x) dx + λ + γκ(x)2 dA(x), (1) R
∂R
where R is the 2D (or 3D) foreground region with boundary ∂R. Here g(x) is the data term, which may take many forms depending on the application, λ is a positive weighting factor for length (or area) regularization, and γ controls the amount of curvature regularization, denoted κ. Note that the domain may be a 2D image region or a 3D region. In the former case, the boundary is a curve and the notion of curvature is the usual one, while in the latter, the boundary is a surface and κ refers to the mean curvature. Second order priors like curvature are important for many vision applications, such as stereo [1]. In image segmentation, experiments have shown that curvature regularization is able to capture thin, elongated structures [2,3] where standard length-based regulators would fail. Curvature has also been identified as a key factor in human perception based on psychophysical experiments on contour completion [4] and there is evidence that cells in the visual cortex detect curvature [5]. Still, most segmentation-based approaches in computer vision do not use curvature information. This is contrast to length or area regularity which do play an important role. One of the reasons for this fact is that curvature regularity is harder to incorporate in a global optimization framework. Y. Boykov et al. (Eds.): EMMCVPR 2011, LNCS 6819, pp. 205–218, 2011. Springer-Verlag Berlin Heidelberg 2011
206
P. Strandmark and F. Kahl
Note that curvature regularity is fundamentally different from length or area regularity. While, for example, length regularization prefers shorter boundaries, there is no such bias in curvature regularization. In fact, due to a famous theorem of Werner Fenchel, we know that the integral of the absolute curvature for any closed convex plane curve is equal to 2π. In differential geometry, energy functionals of the type in (1) have been studied for a long time. In the surface case, the functional is known as the Willmore energy [6]. It gives a quantitative measure of how much a given surface deviates from a round sphere. Local descent techniques have been derived for minimizing (1), cf. [7], but they are very dependent on a good initialization. We propose several improvements to the current state-of-the-art of curvature regularization. This gives both faster running times and smaller memory requirements, and hence important steps are taken to make curvature regularization more practical. More specifically, we (i) solve an identified issue with extraneous arcs while keeping the relaxation tight. We also show how to obtain (ii) better suited tessellations of the domain, and (iii) higher resolution by using adaptive meshes. In turn, this opens up the possibility to apply curvature regularization for surfaces in R3 . To our knowledge, this is the first published work that has been able to globally optimize functionals of squared mean curvature. There are number of application problems that can be modeled by the energy functional in (1). In this paper, we concentrate on improving the methodology for optimizing the functional and we compare with current state-of-the-art methods, cf. [2,3,19]. Experimental results are mainly given for segmentation, but other applications include surface completion [8] and inpainting [3,9].
2
Length and Area Regularization
The basis for our work is the discrete differential geometry framework developed by Sullivan in [10] and Grady in [11] for computing minimal surfaces and shortest paths. The goal is to compute a discrete approximation of the continuous functional in (1). We will recast the problem as an integer linear program and solve it via LP relaxation. In this section we limit the exposition to the standard case without the curvature term (corresponding to γ = 0). Interestingly, this integer linear program can be shown to be totally unimodular and hence the LP relaxation will be tight. The method is based on tessellating the domain of interest into a so-called cell complex, a collection of non-overlapping basic regions whose union gives the original domain. Several types of tessellations are possible. Some examples are given in Fig. 3 for 2D and Fig. 5 for 3D meshes. Typical choices are square meshes (2D), resulting in 4-connectivity and cube meshes (3D) giving 6-connectivity. To mimic 8-connectivity, pixels are subdivided into four triangular regions each. We will elaborate more on this issue in Section 4. The boundaries of 2D regions are called edges, and the boundaries of 3D regions are called facets. To make this approach work, it is necessary to consider both possible orientations of each edge and facet.
Curvature Regularization for Curves and Surfaces
xc
xa
xd
xb
yij
207
xc
xa xb = xd
(a)
(b)
Fig. 1. The line pair variable yij and its four incident region variables xa , xb , xc and xd . The four region variables may coincide for some edge pairs
In the integer linear program, there are two sets of binary variables, one reflecting regions and the other related to boundaries. For each basic region, a binary variable reflects whether the region belongs to the foreground or the background. Let xi , i = 1, . . . , m denote these binary variables, where m is the number of basic regions. The region integral min (1) is now easily approximated by a linear objective function of the form i=1 gi xi . In this paper we let xi denote region variables and yi boundary variables (lines in2D and facets in 3D). The length area term in (1) is then represented with λ i i yi , where i denotes the length of edge i, and similarly in 3D with areas ai . To enforce consistency between the region and boundary variables, surface continuation constraints [3] are used in the 2D case. We use completely analogous constrains (8) in three dimensions. When considering surface completion, region variables are not needed and linear constraints of the type (5) may be used to enforce consistency of the surface [11].
3
Curvature Regularization
To be able to handle curvature regularization, pairs of boundary variables are introduced. We denote these pairs by yij . Schoenemann et al. [3] described how to introduce boundary continuation constraints to ensure that an actual boundary curve is formed. Without the latter constraint, only straight line pairs would be used. Having introduced the line pairvariables, the last term in (1) may also be represented as a linear function: γ i,j bij yij . The coefficients bij used by [12,3] were p α bij = min{i , j } , (2) min{i , j } where α is the angle difference between the two lines. We use p = 2 exclusively in this paper. 3.1
Avoiding Extraneous Arcs
The constraints introduced by [3] admit too many feasible solutions, which is discussed in the recent work [19]. The problem is illustrated in Fig. 2a, where
208
P. Strandmark and F. Kahl
(a) Using the original constraints from [3]. The small black regions to the right have their boundary costs reduced by the large arcs of extra boundaries.
(b) Simply requiring that yij + yji ≤ 1 would work if the variables were integers, but causes the LP relaxation to output a fractional solution.
(c) Result using the additional constraints (4). The LP-relaxation output an integral solution.
Fig. 2. Segmentation with and without region consistency constraints. A very crude mesh was used to make the visualization clearer. Gray scale polygons indicate region variables and red lines indicate edge pair variables. Gray lines show the mesh used. Parameters: λ = 0, γ = 300000. The time to solve the problem decreased by adding the extra constraints: 0.251s vs. 3.402s for the original problem.
sharp corners are avoided by introducing large, extra curves which due to the nonexistent length penalty have low cost. This solution is integral and optimal in the original formulation, because along the spurious large arcs both yij and yji are active. The result is an underestimation of the boundary curvature, in turn resulting in a non-optimal segmentation. Fig. 2a have several small regions which do not appear in the correct Fig. 2c. The solution seems simple: to add constraints yij + yji ≤ 1. This would indeed solve the problem if the variables could be restricted to be integral, but we have found the constraints in practice gives a fractional solution with even more spurious arcs (Fig. 2b). Instead, we propose new linear constraints in addition to the constraints in [3]: Region consistency constraints. Consider a line pair variable yij and call its four incident regions xa , xb , xc and xd , located as shown in Fig. 1a. If xa = xb = 1 or xa = xb = 0, the region pair should not be active. Similarly for xc and xd . This can be linearly encoded as xa + xb + yij + yji ≤ 2 −xa − xb + yij + yji ≤ 0.
(3)
Similar constraints hold for xc and xd . All in all, four new constraints are introduced for each pair of edges. It might be the case that xa and xc or xb and xd coincide, c.f. Fig. 1b. The constraint (3) still looks the same.
Curvature Regularization for Curves and Surfaces
209
To reduce the total number of constraints, we can combine the constraints into four constraints per edge k: xk;1 + xk;2 +
ykj ≤ 2
kj a pair
−xk;1 − xk;2 +
ykj ≤ 0
kj a pair
xk;1 + xk;2 +
yjk ≤ 2
(4)
jk a pair
−xk;1 − xk;2 +
yjk ≤ 0.
jk a pair
Here xk;1 and xk;2 denote the two regions adjacent to edge k. The first two constraints sum over all line pairs starting with edge k and the last two sum over all pairs ending with edge k. Fig. 2c show the result with these additional constraints where the boundary now is consistent with the region variables. As a bonus, the new constraints reduced the time required to solve the problem to about 7%. We can also see from Fig. 2 that both before and after the additional constrains, the optimal solution has its region variables equal or very close to 0 or 1. 3.2
Curvature of Surfaces
Each facet in our 3D mesh is associated with a variable y = (y1 , . . . , y2n ) of areas a = (a1 , . . . , a2n ). There are twice as many variables as facets, because each facet is associated with two variables, one for each orientation. The two are distinguished by (arbitrarily) assigning a normal to each face in the mesh. With the matrix B as defined in [11], the optimization problem for surface completion with area regularization is minimize λaT y y
subject to
Be,yi
By = 0;
y ∈ {0, 1}2n;
yk = 1, k ∈ K.
(5)
⎧ ⎪ ⎨+1, if edge e borders yi with coherent orientation = −1, if edge e borders yi with coherent orientation ⎪ ⎩ 0, otherwise
K is the set of facets that are supposed to be part of the minimal surface a priori. We now extend this formulation to support curvature by introducing face pairs. Each pair of facets in the mesh with an edge in common are associated with two variables {yij } (one for each orientation). Enforcing consistency between the face variables and the variables corresponding to the pairs of faces can be done with linear constraints:
210
P. Strandmark and F. Kahl
Surface continuation constraints. For each oriented facet k and each one of its edges e we add the following constraint: yk = dk,ij yij . (6) ij with edge e
The sum is over all pairs ij with edge e in common. The indicator dk,ij is 1 if facet k is part of the pair ij. Having introduced the facet pairs, we follow [13] and associate them with a cost bi,j , approximating the mean curvature (compare with (2) on page 207): 2 3||ei,j ||2 θi,j bi,j = 2 cos , (7) 2(ai + aj ) 2 where θi,j is the dihedral angle between the two facets in the pair. ||ei,j || is the length of theircommon edge. The objective function we are minimizing is then ρ i ai yi + σ i,j bi,j yi,j , subject to the constraints in (5) and (6). This approximation is not perfect; for example, it will not give the correct approximation for saddle points. However, it measures how much the surface bends and fulfills a couple of requirements listed by [13]. Segmentation, as opposed to surface completion, requires variables for each volume element in order to incorporate the data term. Additional consistency constraints are then required: Volume continuation constraints. For each facet k, bk,i yi + gk,i xi = 0, i
(8)
i
where bk indicates whether the facet yk is positively or negatively incident w.r.t. the chosen face normal. gk,i is 1 if the volume element xi is positive incident (the face normal points towards its center), -1 if it is negative incident and 0 otherwise. Both sums have two non-zero terms. 3.3
Pseudo-Boolean Optimization
Solving the discrete optimization problem does not have to be done using a linear program. It is also possible to use discrete optimization methods such as roof duality [14]. Each edge pair is represented as a 3- or 4-clique in the energy minimization. This formulation has the advantage that it readily carries over to three dimensions. El-Zehiry and Grady [2] used 3-cliques for minimizing curvature functionals and their formulation is equivalent to [3] for 4-connected grids. This is because in a 4-connected grid, only configurations of the type in Fig. 1b are present. For all connectivities higher than 4, many edge pairs are adjacent to 4 regions, see Fig. 1a. The cost of an edge pair can be written as:
bi,j xa xc (1 − xb )(1 − xd ) + (1 − xa )(1 − xc )xb xd . (9) Representing this, however, requires extra nodes to be added, since the degree-4 term does not disappear.
Curvature Regularization for Curves and Surfaces
4
211
Tessellations
The mesh used for the segmentation can be created in a number of ways. The quality of the approximation depends on how many different possible straight lines that can be represented by the mesh, since a larger possible choice of line slopes allows the mesh to approximate a continuous curve more closely. Fig. 3 shows some possible meshes and the straight lines they admit. If a mesh allows n possible straight line directions, it is referred to as n-connected. 4.1
Hexagonal Meshes
Hexagonal meshes have long been studied for image processing [15]. One characterizing fact of hexagons is that they are the optimal way of subdividing a surface into regions of equal area while minimizing the sum of the boundary lengths [16]. The fact that is more important to us is the neighborhood structure. In a hexagonal lattice every region has 6 equidistant neighbors. When approximating curvature we would like to represent as many different straight lines as possible and we would like the maximum angle between them to be small, as that gives us a better approximation of a smooth curve [12]. The neighborhood structure of the hexagonal mesh allows for similar performance (number of lines and angle between them) while using fewer regions. This is illustrated in Fig. 3, where three crude meshes and three finer meshes are shown. The meshes in Figs. 3d and 3f have similar maximal angle between the possible straight lines, but the hexagonal mesh achieves this with fewer regions due to the favorable intersection
(a) 1 region and 2 lines
(b) 6 regions and 9 lines.
(d) 31 regions and 52 lines. (e) 21 regions and 44 lines.
(c) 4 regions and 6 lines.
(f) 12 regions and 18 lines.
Fig. 3. Different types of grids. The maximum angle between the possible straight lines is 90◦ in (a), 60◦ in (b) and 45◦ in (c). Meshes (d), (e) and (f) have about 27◦ , 37◦ and 30◦ as their maximum angle, respectively.
212
P. Strandmark and F. Kahl
pattern of the lines. This suggests that hexagonal meshes can achieve the same accuracy as the meshes (c) and (d) used in [3] with a significantly smaller linear program. Calculating the data term. With the introduction of the hexagonal grid, every region is no longer contained within a single pixel. Some regions will partly overlap more than one pixel. The data term for the region Rk is the integral of g over that region: gk =
g(x)dx.
(10)
Rk
The data term is a function on a continuous domain. However, because it arises from a measured image, it will be piecewise constant on the n × m square image pixels. If each pixel is subdivided into regions, the data term for each region is simply the area of the region multiplied with the data term for the pixel. In the general case, the data term for Rk is computed as: gk =
1 1 g i + ,j + · area(Rk ∩ pij ), 2 2 j=0
m−1 n−1 i=0
(11)
where pij is the square representing pixel (i, j): pij = {(x, y) | i ≤ x ≤ i + 1, j ≤ y ≤ j + 1}.
(12)
Calculating this sum requires a large number of polygon intersections to be computed. For this we used the General Polygon Clipper library (GPC) from the The University of Manchester. 4.2
Adaptive Meshes
The memory requirements for solving the linear programs arising from the discretizations are very large. Each pair of connected edges introduce two variables. Linear programs are typically solved using the simplex method or interior point methods, both of which require a substantial amount of memory for our problems. As one example, a problem with 131,072 regions and 1,173,136 edge pairs required about 2.5 GB of memory to solve using the Clp solver. For this reason, it is desirable to keep the size of the mesh small. However, a fine mesh is needed to be able to approximate every possible curve. The solution to this conflict of interest is to generate the mesh adaptively, to only give it high resolution where the segmentation boundary is likely to pass through. Adaptive meshes have previously been considered for image segmentation in the level-set framework [17] and in combinatorial optimization of continuous functionals [18]. The mesh is refined using an iterative process. First, a single region is put into a priority queue. Then regions are removed from the priority queue and subdivided into smaller regions which are put back into the queue. The region which most urgently needs to be split is removed first from the priority queue.
Curvature Regularization for Curves and Surfaces
(b) 12-connectivity subdividing triangles
(a) 16-connectivity by subdividing rectangles
213
by
Fig. 4. Adaptive meshes can be constructed by recursively subdividing basic shapes into several similar shapes and finally adding the extra connectivity
Start with q an empty priority queue R ← (0, 0, w, h) Add R to q with priority = score(R) while size(q) < L do Remove R from q Split R into R1 . . . Rk Add R1 . . . Rk to q with score(R1 ) . . . score(Rk ) end while Both square and triangular basic shapes can be split up into four identical shapes similar to the original one. Thus, for all adaptive meshes k = 4. The score function can be chosen in many different ways. One way is to use the squared deviation from the mean of each region, i.e.: score(R) = (I(x) − μ(R))2 dx, (13)
R
1 where μ(R) = |R| I(x)dx. This way, regions where the data term vary a lot R will be split before regions which have a uniform data term. The score is not normalized, because otherwise many very small regions would tend to have a big score. The integrals may be computed in the same manner as (11) for images with square pixels, resulting in the computation of polygon intersections.
4.3
Tetrahedrons
There are many choices of tessellations in three dimensions. We have taken a quite simple approach and divided each unit cube into five tetrahedrons, as shown in Fig. 5. This allowed us to be enough different planes to demonstrate the global optimization of mean curvature.
214
P. Strandmark and F. Kahl
(a) One unit cube.
(b) Eight unit cubes in a 2 × 2 × 2 mesh.
Fig. 5. Interactive figure. Each unit cube is split into 5 tetrahedrons. This is the type of mesh used for our experiments in 3D. When stacking several, every other cube has to be mirrored in order to fit.
5
Experimental Results
This paper does not focus on how to model the data term and we will use a simple, two-phase version throughout all our experiments: g(x) = (I(x) − μ1 )2 − (I(x) − μ0 )2 ,
(14)
where μ0 and μ1 are two fixed mean values and I is the image. 5.1
Hexagonal Meshes
In our first experiment we evaluate hexagonal vs. square meshes. We are comparing three types of meshes, the 8- and 16-connected square mesh and the 12-connected hexagonal mesh, shown in Fig. 3 (c), (d) and (f). We fixed a data term of a 256 × 256 image (cameraman) and lay meshes of various types and sizes on top of it and calculated the optimal energy. The result is shown in Fig. 6b, where the optimal energy is plotted as a function of the number of regions used. This is reasonable, since the number of regions is a good indicator of the total size of the linear program. The analogous plots using the number of line pairs or edges look the same. We see that the 8-connected grid converges quickly, but to a suboptimal energy. The hexagonal mesh consistently outperforms the 16-connected grid. If we were to let the number of regions grow very large, the 16-connected grid would probably achieve a lower energy than the hexagonal, due to it having 2 more possible straight lines. We have not been able to observe this in practice, though, due to the memory requirements. 5.2
Adaptive Meshes
To evaluate the effect of adaptive meshes, we performed a number of experiments. Firstly, we evaluated the visual quality of the segmentation for regular and adaptive 16-connected meshes with the same number of regions. The result
Curvature Regularization for Curves and Surfaces 8
2.2
x 10
8
2.03
2.18
2.025
2.16
2.02 Energy
Energy
Regular mesh Adaptive mesh
2.14
2.01
2.1
2.005
0
2
4
6 8 Number of regions
10
12
14 4
x 10
(a) Optimal energy vs. the total number of regions for a square 16-connected mesh. To get the same accuracy as the finest parts of the adaptive mesh, the regular mesh would need 210 · 104 regions. In contrast, the adaptive mesh converged using about 10 · 104 regions.
x 10
Hexagonal mesh (12−connectivity) Square mesh (16−connectivity) Square mesh (8−connectivity)
2.015
2.12
2.08
215
2
4
6
8 10 Number of regions
12
14 4
x 10
(b) Optimal energy vs. the total number of regions. The best accuracy obtained by the square mesh was achieved by the hexagonal mesh with about half the number of regions.
Fig. 6. Experiments evaluating adaptive and hexagonal meshes. These experiments used λ = γ = 10000. The energy difference might seem small, but differences of these magnitudes often correspond to significant changes in segmentation, cf. Fig. 7.
can be seen in Fig. 7. There is a significant visual difference. The fact that the adaptive mesh achieved a smoother curve is also reflected in the optimal energy, which is lower. The results for 8-connected meshes are shown in Fig. 7d and yield the same conclusion. To evaluate the performance more quantitatively, we solve the same segmentation problem a large number of times for different number of regions. The adaptive mesh converged to what probably is the optimal energy for that connectivity, while the regular mesh did not. The regular mesh would have required more than 20 times more regions to achieve the same energy. Fig. 6a shows the optimal energies for the different number of regions and the two types of meshes. 5.3
The Wilmore Functional
For our experiments in three dimensions we generated a mesh where each unit cube was split into 5 tetrahedrons, see Fig. 5. We then created the set K as two circular surfaces at z = 0 and z = zmax , with nothing in between. The analytic solution with area penalty is the catenoid, one of the first minimal surfaces found. Fig. 8a show the discrete version obtained with λ = 1 and γ = 0. If instead the mean curvature is chosen as the regularizer, the optimal surface instead bends outwards. The solution to this problem is shown in Fig. 8b and is the global optimum, since all variables ended up integral in the LP relaxation of (5). We used Clp as our LP solver.
216
P. Strandmark and F. Kahl
(a) Regular grid
(b) Adaptive grid
(c) 8-conn.
(d) 8-conn.
Fig. 7. Results with (a) regular and (b) adaptive grids. The number of regions used were 32,768 in both cases and the number of edge pairs were 291,664 and 285,056, respectively. The adaptive mesh gives a smoother curve and correctly includes the hand of the camera man. The optimal energy for the regular mesh was 2.470 · 108 and 2.458 · 108 for the adaptive. This experiment used λ = 30000 and γ = 1000. (a) and (b) used 16-connectivity (Fig. 3d), whereas (c) and (d) show the same experiment with 8-connectivity (Fig. 3c).
(a) Area regularization (40 × 40 × 15 mesh, 447 seconds)
(b) Curvature regularization (25×25× 7 mesh, 178 seconds)
Fig. 8. Interactive figure. Surface completion with area and curvature regularization. Two flat, circular surfaces at the top and bottom were fixed. The surface in (a) bends inwards to approximate a catenoid and in (b) it correctly bends outwards to minimize the squared mean curvature.
(a) Area regularization
(b) Curvature regularization
Fig. 9. Interactive figure. Surface completion on a 16 × 16 × 16 mesh with area and curvature regularization and volume element variables. The data term and the optimal surface using area regularization coincide. The radius of the volume in (b) is constrained by the mesh size. Otherwise, a minimal surface would not exist for the continuous problem.
Curvature Regularization for Curves and Surfaces
217
In another experiment we also used variables for the volume elements. The data term was a 3D ‘cross’ where the volume elements were forced to be equal to 1, whereas the volume elements at the boundary were forced to be 0. The optimal segmentation when the area was minimized coincided with the data term and is shown in Fig. 9a. When instead minimizing the curvature the optimal segmentation should resemble a sphere, which is observed in Fig. 9b. 5.4
Pseudo-Boolean Optimization
Finally, we have compared linear programming to the pseudo-Boolean formulation from Section 3.3. It seems that the formulation with higher-order cliques is weaker than the previously discussed linear programming formulations. In all cases except with a negligible value of γ we obtained 75%–100% unlabeled nodes, with no hope of recovering a good solution with e.g. probing [14].
6
Conclusions
The purpose of this paper has been to discuss the problem of segmentation with curvature regularization and to enhance the methodology in numerous ways. First of all, we have introduced new region consistency constraints (4) which are essential for the method in [3] to work well (Fig. 2c). These new constraints also reduced the computation time to less than one tenth of the original time. We have argued, by regarding the angles between straight lines and the number of regions in different meshes that hexagonal meshes are more suitable than square meshes with the same number of regions. Our experiments have confirmed this conclusion as well (Fig. 6b). Another way of reducing the memory requirements is to allow for an adaptive mesh. We have shown that generating the mesh adaptively by examining the changes of the data term results in far better segmentations, both quantitatively (Fig. 6a) and qualitatively (Figs. 7 and 7d). Lastly, we have introduced constraints for 3D surface completion and segmentation. Experiments are encouraging (Figs. 8 and 9) with exclusively globally optimal solutions. To our knowledge, this is the first time the mean curvature of surfaces has been optimized globally. The next step would be to apply this method to e.g. the partial surfaces obtained by stereo estimation algorithms. Another line of further research is how to be able to cope with a finer discretization of the 3D volume. Source code. To facilitate further research, the source code used for the experiments in this paper will be made publicly available and may be downloaded from our web page. Acknowledgments. We thank Thomas Schoenemann for helpful discussions and sharing his source code. This work has been funded by the Swedish Foundation for Strategic Research (SSF) through the programmes Future Research Leaders and Wearable Visual Information Systems and by the European Research Council (GlobalVision grant no. 209480).
218
P. Strandmark and F. Kahl
References 1. Woodford, O., Torr, P.H.S., Reid, I., Fitzgibbon, A.W.: Global stereo reconstruction under second order smoothness priors. IEEE Trans. Pattern Analysis and Machine Intelligence 31, 2115–2128 (2009) [205] 2. El-Zehiry, N., Grady, L.: Fast global optimization of curvature. In: Conf. Computer Vision and Pattern Recognition (2010) [205], [206], [210] 3. Schoenemann, T., Kahl, F., Cremers, D.: Curvature regularity for region-based image segmentation and inpainting: A linear programming relaxation. In: Int. Conf. Computer Vision (2009) [205], [206], [207], [208], [210], [212], [217] 4. Kanizsa, G.: Contours without gradients or cognitive contours. Italian Jour. Psych. 1, 93–112 (1971) [205] 5. Dobbins, A., Zucker, S.W., Cynader, M.S.: Endstopped neurons in the visual cortex as a substrate for calculating curvature. Nature 329, 438–441 (1987) [205] 6. Willmore, T.J.: Note on embedded surfaces. An. Sti. Univ. ”Al. I. Cuza” Iasi Sect. I a Mat (N.S.), 493–496 (1965) [206] 7. Hsu, L., Kusner, R., Sullivan, J.: Minimizing the squared mean curvature integral for surfaces in space forms. Experimental Mathematics 1, 191–207 (1992) [206] 8. Kawai, N., Sato, T., Yokoya, N.: Efficient surface completion using principal curvature and its evaluation. In: Int. Conf. Image Processing, pp. 521–524 (2009) [206] 9. Masnou, S.: Disocclusion: A variational approach using level lines. IEEE Transactions on Image Processing 11, 68–76 (2002) [206] 10. Sullivan, J.: Crystalline Approximation Theorem for Hypersurfaces. PhD thesis, Princeton Univ. (1990) [206] 11. Grady, L.: Minimal surfaces extend shortest path segmentation methods to 3D. IEEE Trans. on Pattern Analysis and Machine Intelligence 32(2), 321–334 (2010) [206], [207], [209] 12. Bruckstein, A.M., Netravali, A.N., Richardson, T.J.: Epi-convergence of discrete elastica. Applicable Analysis, Bob Caroll Special Issue 79, 137–171 (2001) [207], [211] 13. Wardetzky, M., Bergou, M., Harmon, D., Zorin, D., Grinspun, E.: Discrete quadratic curvature energies. Comput. Aided Geom. Des. 24(8-9), 499–518 (2007) [210] 14. Rother, C., Kolmogorov, V., Lempitsky, V., Szummer, M.: Optimizing binary MRFs via extended roof duality. In: Conf. Computer Vision and Pattern Recognition (2007) [210], [217] 15. Middleton, L., Sivaswamy, J.: Hexagonal Image Processing: A Practical Approach. Springer, New York (2005) [211] 16. Hales, T.C.: The honeycomb conjecture. Discrete & Computational Geometry 25, 1–22 (2001) [211] 17. Xu, M., Thompson, P.M., Toga, A.W.: An adaptive level set segmentation on a triangulated mesh. IEEE Trans. on Medical Imaging 23, 191–201 (2004) [212] 18. Kirsanov, D., Gortler, S.J.: A discrete global minimization algorithm for continuous variational problems. Technical Report TR-14-04, Harvard (2004) [212] 19. Schoenemann, T., Kuang, Y., Kahl, F.: Curvature regularity for multi-label problems — standard and customized linear programming. In: Boykov, Y., et al. (eds.) EMMCVPR 2011. LNCS, vol. 6819, pp. 205–218. Springer, Heidelberg (2011) [206], [207]
SlimCuts: GraphCuts for High Resolution Images Using Graph Reduction Bj¨orn Scheuermann and Bodo Rosenhahn Leibniz Universit¨ at Hannover, Germany {scheuermann,rosenhahn}@tnt.uni-hannover.de
Abstract. This paper proposes an algorithm for image segmentation using GraphCuts which can be used to efficiently solve labeling problems on high resolution images or resource-limited systems. The basic idea of the proposed algorithm is to simplify the original graph while maintaining the maximum flow properties. The resulting Slim Graph can be solved with standard maximum flow/minimum cut-algorithms. We prove that the maximum flow/minimum cut of the Slim Graph corresponds to the maximum flow/minimum cut of the original graph. Experiments on image segmentation show that using our graph simplification leads to significant speedup and memory reduction of the labeling problem. Thus large-scale labeling problems can be solved in an efficient manner even on resource-limited systems.
1
Introduction
Discrete optimization of energy minimization problems using maximum flow algorithms have become very popular in the fields of computer vision [1]. This has been driven by their ability to efficiently compute a global minimum of the given optimization problem. Examples for computer vision problems include segmentation, image restoration, dense stereo estimation and shape matching [2,3,4]. We introduce a novel algorithm for maximum flow algorithms which improves the performance of graph cut algorithms. Parallel to the improvement of energy minimization algorithms [5,6,7], the size of images and 3D volumes increased significantly in recent years. Standard benchmark images still have an average size of approximately 120.000 pixels and graph cut algorithms solve these problems very fast. In contrast, nowadays commercial cameras are able to capture high quality images with up to 20 million pixels. Solving segmentation problems using graph cuts on such data leads to large scale optimization problems. These problems are computationally expensive and require large amounts of memory. If the data of these problems do not fit into the physical memory the given algorithms are not applicable [8].
This work is partially funded by the German Research Foundation (RO 2497/6-1).
Y. Boykov et al. (Eds.): EMMCVPR 2011, LNCS 6819, pp. 219–232, 2011. c Springer-Verlag Berlin Heidelberg 2011
220
B. Scheuermann and B. Rosenhahn
(a)
(b)
(c)
Fig. 1. Example segmentation using Apple’s iPhone. (a) image with user scribbles; (b) label map defined by the proposed Slim Graph. White and grey pixels denote fore- and background, black regions are unlabeled pixels; (c) final segmentation.
1.1
Prior Work
Solving the maximum flow/minimum cut problem for applications in computer vision can be divided into four types of approaches: Augmenting paths: Due to the works of Boykov and Kolmogorov the BK augmenting paths algorithm [9] is widely used for computer vision problems. This is because of its computationally efficiency for moderately sized 2D and 3D problems with low connectivity. A parallel implementation of the BK-algorithm has been proposed in [10]. They iteratively solve subproblems on multiple cores or multiple machines. Push-relabel: Most parallelized maximum flow/minimum cut algorithms have been focused on push-relabel algorithms [8]. These methods outperform the traditional BK-algorithm for huge and highly connected grids [1]. An algorithm that involves GPU processing is CUDA cuts, presented by Vineet and Narayanan [11]. In contrast to these algorithms our novel method does not use special hardware (multiple cores, GPU) to reduce the computational time. Convex optimization: Formulating the maximum flow/minimum cut problem as a linear program is another promising approach to parallelize GraphCuts. Assuming only bidirectional edges the maximum flow problem can be reformulated as an l1 minimization problem [12]. In [13] Klodt et al. used GPU based convex optimization to solve continuous versions of GraphCuts. However, the speedup using a GPU is low compared to BK-algorithm and the advantage of continuous cuts is to reduce metrication errors due to discretization. Multi-Scale: Besides the approaches to parallelize the maximum flow/minimum cut problem to outperform existing algorithms, multi-scale processing is an approach to reduce memory and computational requirements of optimization algorithms. The idea to efficiently solve the optimization problem is to first solve the problem at low resolution using standard techniques [14]. The resulting
SlimCuts: GraphCuts for High Resolution Images Using Graph Reduction
221
low resolution labeling is refined on the high resolution problem in a following optimization step. Boundary banded algorithms [15,16] are examples for multiscale image segmentation of high resolution images. Kohli et al. [17] proposed an uncertainty driven multi-scale approach for energy minimization allowing to compute solutions close to the globally optimum. However, both approaches suffer from the problem that they are not able to efficiently recover from large scale errors present in the low resolution result. Graph sparsification: In the field of applied mathematics graph sparsification and graph simplification is an important matter. Karger and Stein proposed the Recursive Contraction Algorithm in [18]. The algorithm relies on the idea that the minimum cut is a small set of edges and a randomly chosen edge is unlikely to be in this set. They randomly contract edges and showed that the minimum cut is found with high probability. However, it is not guaranteed that the cut is optimal. Similarly Benc´ ur and Karger [19] and Spielman and Teng [20] proposed algorithms based on random sampling to approximate the minimum cut of a given graph. Since we are looking for the global minimum we are not interested in an approximation. In [21], Chekuri, Goldberg et al. developed heuristics to improve practical performance of minimum cut algorithms. They propose to use the Padberg-Rinaldi heuristic [22] to contract edges that are not in the minimum cut. Therefore they apply several so called PR tests to identify those edges. In practice the PR tests are computationally too expensive for large graphs. In [23] Hogstedt et al. proposed a number of heuristical graph algorithms to simplify partitioning of distributed object applications. For the special case of an s-t minimum cut (two machine nodes) their condition for graph simplification preserves one minimum cut. In contrast our graph simplification guarantees that all minimum cuts are preserved. To our knowledge the proposed condition have not been used to simplify a graph in the computer vision community. 1.2
Contribution
To solve large scale labeling problems efficiently we propose to build a so called Slim Graph by simplifying the original graph. Therefore we search nodes that are connected by an edge (simple edge) which can be removed from the graph without changing the value of the maximum flow and the corresponding minimum cut. The nodes connected by a simple edge will have the same label in the final segmentation and can be merged into a single node. Thus we simplify the original graph to a Slim Graph without changing the energy-minimization problem and the value of the global minimum. The proposed simplification can be applied to each of the aforementioned algorithms. Thus our algorithm provides a general speedup and memory reduction without suffering from the problem of large scale errors or the use of special hardware, e.g. GPU or multiple processors. Besides speedup and memory reduction, the merged nodes help the user to set the parameter included in the minimization problem. Additionally, the simplified graph reveals which areas of the image / graph cannot be assigned to foreground or background. Even for large images, this provides a fast feedback where further user strokes are necessary. To summarize, our contributions are:
222
B. Scheuermann and B. Rosenhahn
– An algorithm based on edge contraction is used to generate a Slim Graph from the original graph without changing the value of the minimum cut. – A proof is given that the value of the maximum flow on the Slim Graph is equal to the maximum flow of the original graph. – Several experiments on small and large scale images, as well as on resource limited systems demonstrate the general applicability of our method. – For further evaluation we will provide C-Sources for generating Slim Graphs to the scientific community1 . Our contribution is neither a parallelization of an existing algorithm nor a multiscale approach to speedup and reduce the amount of memory of graph cuts. Hence our algorithm does not suffer from the problems of these methods. In contrast to the works in the field of applied mathematics on graph sparsification and graph simplification our method preserves the value of the minimum cut instead of approximating it and the given condition to test whether an edge is simple or not is computationally cheap and much faster. Paper Organization In Section 2 we continue with a short review of discrete energy minimization, which is the basis for our segmentation framework. Section 3 introduces the proposed graph simplification to build Slim Graphs and proofs the equality of the maximum flow value. The simplified user interaction by visualizing joined nodes is described in Section 4. Experimental results in Section 5 demonstrate the advantages of the proposed method. The paper finishes with a short conclusion.
2
Segmentation by Discrete Energy Minimization
The discrete energy E : Ln → R for the problem of binary image labeling can be written as the sum of unary ϕi and pairwise functions ϕi,j E(x) =
i∈V
ϕi (xi ) +
ϕi,j (xi , xj ) ,
(1)
(i,j)∈E
where V correspond to the set of all image pixels and E is the set of all edges between pixels in a defined neighborhood N e.g. 4 or 8 neighborhood. For the problem of binary image segmentation, which is addressed in this paper, the label set L consists of a foreground (fg) and a background (bg) label. The unary function ϕi is given as the minus log likelihood using a standard GMM model [7]. It is defined as ϕi (xi ) = − log P r(Ii | xi = S) , (2) where S is either fg or bg. The pairwise function ϕi,j takes the form of a contrast sensitive Ising model and is defined as ϕi,j (xi , xj ) = γ · dist(i, j)−1 · [xi = xj ] · exp(−βIi − Ij 2 ) . 1
http://www.tnt.uni-hannover.de/project/Segmentation
(3)
SlimCuts: GraphCuts for High Resolution Images Using Graph Reduction
223
Here Ii and Ij describe the feature vectors of pixels i and j. The parameter γ specifies the impact of the pairwise function. A small γ leads to a strong unary term whereas a large γ leads to a weak unary term. Using the defined unary and pairwise functions, the energy (1) is submodular and can hence be represented by a graph [9]. Represented as a graph, the global minimum of the energy can be computed with standard maximum flow/minimum cut algorithms [1]. Solving the labeling problem using maximum flow algorithm the energy function need to be represented by a graph. This can be done analogously to [9] by defining the graph G = (VG , EG ) as follows: The set of vertices is simply the set of pixels unified with two special vertices: VG = V ∪ {S, T }, where S denotes the source and T the sink. The set of edges consists of the set of all neighboring pixels plus an edge between each pixel and the source and sink: EG = E ∪ {(p, S), (p, T ) | p ∈ V}. The capacities c(e) of each edge are defined analogously to Boykov et al. [9]. As noted earlier, to speed up the algorithms solving the maximum flow/minimum cut problem, one can parallelize existing algorithms or reduce the number of variables by a multiscale approach. While the multi-scale approaches, as an example for reducing the number of variables, suffer from the problem that the minimum of the large scale problem (1) is only approximated, the parallelizing approaches use special hardware components. Instead, here we propose an algorithm that computes the true global minimum of (1) in a fraction of time and memory required by the original algorithm without using special hardware. The proposed edge contraction guarantees that the minimum cut is preserved, while the given condition is computational cheap and applicable for large scale problems.
3
Constructing Slim Graphs
In this section we explain how to construct a so called Slim Graph by merging nodes that are connected by simple edges. We start defining these special edges and prove that these edges are not part of any minimum cut. This operation is also called edge contraction. We can then simplify the original graph by merging nodes by the given rules. Finally we prove that the minimum cut of a Slim Graph corresponds to the minimum cut of the original graph and can be used to minimize the large scale optimization problem. Lemma 1. Let G = (VG , EG ) be a graph, A, B ∈ VG , eA,B ∈ EG the edge from node A to node B and C a minimum s-t-cut. If c(ei,A ) or c(eA,B ) > c(eB,i ) (4) c(eA,B ) > i:ei,A ∈EG i=B
i:eB,i ∈EG i=A
then eA,B ∈ / C.
(5)
Simply spoken: If the weight of one edge e of node A is bigger then the sum of all edges adjacent to A, then the minimum cut does not contain the edge e.
224
B. Scheuermann and B. Rosenhahn sj
(i)
i
1 S
a1
ai
(ii)
ti
bi
h
j
sA
tA A sB
eA,B
B
i
1
N bN
sj
T
a1
S sA + sB
ti
a i + bi
h
{A, B}
j
N bN
T
tA + tB
tB
Fig. 2. Example of building a Slim Graph (ii) out of the original graph (i) because of a simple edge between nodes A and B. The given rules join nodes A and B in a single node {A, B}, replace edges connected to one of the nodes and merge edges that are connected to both nodes.
Proof: Following Shannon, the value of the maximum flow is equal to the value of a minimum cut. The value of the maximum flow in G can be computed with the augmenting path-algorithm of Ford-Fulkerson [24]. Following this algorithm paths from s to t are searched and augmented, as long as there exists a path from s to t. Because of equation (4) the edge eA,B will never become saturated, / C. which means that the edge is not part of the minimum cut C ⇒ eA,B ∈ Those edges eA,B ∈ EG fulfilling equation (4) are called simple edges. A similar Lemma was also given in [23]. They define a so called dominant edge, with a stronger condition. Having a dominant edge e, there exists a minimum cut not containing this edge. In contrast, our condition results in a simple edge e that is not contained in any minimum cut, that is all minimum cuts are preserved. With the following rules we simplify the graph and reduce the number of variables of the maximum flow/minimum cut problem: Simplifying a graph: Let G = (VG , EG ) be a graph with a simple edge eA,B ∈ EG connecting nodes A, B ∈ VG . W.l.o.g. let eA,B fullfill the left condition of Equation (4). Then we ˜ = (V˜G , E˜G ) as follows: can create the Slim Graph G ˜ Nodes: VG = VG \ {A, B} ∪ {AB}. That means we merge nodes A and B by node AB and reduce the number of nodes in the Graph by one. Edges: For the edges we can distinguish the following two cases: (i) for all nodes i ∈ VG connected to exactly one of the nodes A or B: W.l.o.g. let ei,A be the edge connecting node i with node A (ei,B ∈ / EG ). Then we replace the edge ei,A by a new edge ei,AB with c(ei,AB ) = c(ei,A ). This operation does not change the number of edges. (ii) for all nodes i ∈ VG connected to to both of the nodes A and B with edges ei,A and ei,B or eA,i and eB,i . We merge the two edges in one new edge ei,AB or eAB,i with the capacities c(ei,AB ) = c(ei,A ) + c(ei,B ) or c(eAB,i ) = c(eA,i ) + c(eB,i ). This operation reduces the number of edges by one.
SlimCuts: GraphCuts for High Resolution Images Using Graph Reduction S
S,1,4
S,1
20 6
1
5
5
6
7
3
1 1
4
8
2
8
6
2
8
8
1
4 6
8
9
7
5
5
2,3
3 1
2 6
6
1
6
5 2
5 2
8
8
20
8
T
(a)
S,1,4,2,3,7 6
1
6
5
8
2
8
2
(b)
225
T,9
10
7 2
(c)
T,9,8,6
T,9,8,6,5
(d)
Fig. 3. Example of simplifying a graph. (a) original graph; (b) Nodes S and 1 and nodes T and 9 are connected with a simple edge and joint in one node; (c) Nodes S and 4, 2 and 3 and nodes T, 8 and 6 can be joint in one node respectively; (d) shows the final Slim Graph. At each step the value of the maximum flow remains the same and also the final segmentation stays the same.
Figure 2 shows the construction of a Slim Graph. Assuming a simple edge eA,B we can merge nodes A and B to a single node {A, B}. Edges connected to exactly one of the nodes are replaced by new edge. In the given example, these nodes are 1, . . . i − 1, j + 1, . . . N . The edges eA,1 . . . eA,i−1 , eB,j+1 , . . . eB,N connecting the nodes to A or B are replaced by edges eAB,1 . . . eAB,i−1 , eAB,j+1 , . . . eAB,N without changing the capacities ai , bj of these edges. Nodes that are connected to both A and B in this example are i, . . . j. For these nodes we merge the two edges eA,h and eB,h to one edge eAB,h with capacity ah + bh . The resulting Slim Graph has 1 node and j − i edges less than the original graph. Theorem 1. Let G = (VG , EG ) be a graph, A, B ∈ VG , eA,B ∈ EG a simple edge connecting nodes A and B and f the maximum flow in Graph G. Since eA,B ˜ = (V˜G , E˜G ). The value of the is a simple edge we can build the Slim Graph G ˜ is equal to the value of the maximum flow in the maximum flow f˜ of graph G original graph |f | = |f˜| . (6) Proof: Following lemma 1 we know that eA,B ∈ / C, where C is the minimum cut of G. This implies that the simple edge never becomes saturated. Therefore its capacity can be set to infinity without affecting the minimum cut or the maximum flow. It follows that both nodes A and B are in the same partition of G \ C. W.l.o.g. let both nodes be in the partition connected to S. Hence the ˜ \ C is connected to S. To prove the constructed node AB ∈ V˜G in the graph G theorem we will now show that: ˜ with |C| = |C˜ |. (i) the minimum cut C of graph G implies a cut C˜ in G, ˜ ˜ (ii) the maximum flow f leads to a flow f in G, with the same value |f | = |f˜ | The first condition provides an upper bound for the value of the minimum cut ˜ On the other hand, the value of the flow |f˜ | in Graph G ˜ provides a lower in G. bound for the minimum cut. Since they are equal, the value of the maximum flow / minimum cut does not change in the Slim Graph.
226
B. Scheuermann and B. Rosenhahn
(a)
(b)
(c)
(d)
(e)
Fig. 4. Example of utilizing the Slim Graph to simplify user interaction. (a) the original image with user scribbles; (b) the label map defined by the Slim Graph. White and black pixels denote fore- and background, grey pixels are unlabeled. (c) resulting segmentation using GraphCuts and possible additional user strokes to refine the segmentation (green and red); (d) label map with one additional user strokes (red); (e) final segmentation.
Proof of (i): Let i ∈ V be nodes with a path to terminal node T ∈ G \ C. Since A and B are connected to S, all edges eA,i and eB,i are part of the minimum ˜ cut C. Defining C˜ as follows implies a cut in the Slim Graph G: C˜ ={ei,j | ei,j ∈ E˜G and ei,j ∈ C} ∪ {ei,AB | ei,A or ei,B ∈ C}
(7)
Due to the construction of the Slim Graph this definition leaves the value of the cut unchanged. Hence it holds |C| = |C˜ |. Proof of (ii): Let i ∈ V be a node and p = (S, . . . , A, i, . . . , T ) a path from S to T in G with flow f (p). Following the construction of the Slim Graph the ˜ Hence the flow f (p) is preserved by the path p˜ = (S, . . . , AB, i, . . . , T ) in G. ˜ maximum flow of G implies a lower bound for the maximum flow in G. Figure 3 shows how a Slim Graph can be constructed. By merging nodes that are connected by a simple edge the original graph (a) is simplified to the Slim Graph (d). The value of the maximum flow and the minimum cut can be computed more efficiently on the new graph and remains identical. The labeling of the original graph is implicitly included in the labeling of the Slim Graph.
4
Slim Graphs for Simplified User Interaction
In this section we will show how the Slim Graph can be integrated into the segmentation process to simplify user interactions. The visualization of the Slim Graph is used to guide the user where to place additional strokes. Analyzing the original graph shows, that simple edges exists most likely between pixel-nodes and terminal-nodes. Every pixel p that has been marked by the user or fulfill
SlimCuts: GraphCuts for High Resolution Images Using Graph Reduction
− log P r(Ii | xi = S) > γ ·
227
dist(i, j)−1 · [xi = xj ] · exp(−βIi − Ij 2 ) , (8)
j∈N
where S is either fg or bg, is connected to the corresponding terminal node by a simple edge. Visualizing these pixels in a label map results in a partial labeling with pixels labeled as foreground or background due to user marks or regional properties and unlabeled pixel. Figure 4 shows an example image with user strokes and the label map coming from the Slim Graph. There are many pixels assigned to foreground or background due to their regional properties. Based on the given user input the final segmentation will have two regions that are assigned a wrong label. To correct the segmentation the user has to mark these two regions or even one of them as background. That means the user has three options to affect the segmentation, shown in Figure 4c. In the label map coming from the Slim Graph there is exactly one region assigned a wrong label. That implies that the user has to mark this region as background to achieve a right label map. In an optimal situation this user mark would also correct the labeling of the other region, leading to a good segmentation result with only one additional user mark. This situation is exemplarily shown in Figure 4d. Marking the wrong labeled region in the label map, guide the user to the desired segmentation 4e. Using the proposed label map as additional information hints the user to place strokes in regions with high regional support. On the one hand this can lead to less user interactions for the problem of image segmentation and on the other hand, the label map can be computed very efficiently. That means, that it is much faster to start refining the label map of a large scale image than refining the segmentation.
5
Experiments
We present an evaluation of the proposed method on small scale images from the database used by Blake et al. [2] as well as on large scale images with up to 26 million pixels found on the web. The images, trimaps and ground truth data is available online2,3 . In the experiments we used the same energy function proposed by Blake et al. [2] and the same set of parameters. Since our method does not change the segmentation result we will not show any segmentation results. Instead we evaluate our contribution by comparing the computational time of BK-algorithm with and without using our proposed Slim Graphs. We ran all our experiments on a MacBook Pro with 2.4 GHz Intel Core i5 processor and 4GB Ram. 2 3
http://research.microsoft.com/en-us/um/cambridge/projects/ visionimagevideoediting/segmentation/grabcut.htm http://www.eecs.berkeley.edu/Research/Projects/ CS/vision/grouping/segbench/
228
B. Scheuermann and B. Rosenhahn
Fig. 5. Small scale images: Running time over 46 small scale images with image sizes between 481x321 and 640x480 pixels. The average speedup of the proposed method compared to BK-algorithm [1] on these small scale problems is 40%. The maximum and minimum speedup is 70% and 14% respectively. The running time of the proposed method includes the simplifying of the original graph.
(a)
(b)
(c)
(d)
Fig. 6. Weak vs. Strong Unary Terms: Running time over the flower image (a) with different trimaps and varying γ; (b) Lasso trimap around the flowers; (c) Rectangular trimap; (d) user strokes provided in (a). Using good initializations (b) and (c) the proposed algorithm performed significantly faster. Nevertheless, even with a poor initialization and a weak unary term we achieved a speedup.
SlimCuts: GraphCuts for High Resolution Images Using Graph Reduction
229
!
Fig. 7. Large scale images: Running time with one image and image sizes up to 25.84 MP. Up to an image size of 8.54 MP, the proposed method was approximately two times faster. Larger images exceeded the physical memory and the proposed method was approximately 877 times faster. On the original image size of 25.84 MP the computation time of the BK-algorithm was 38 minutes. The proposed method required 2.6 seconds. This time already includes the graph simplification step..
5.1
Experiments on Small Scale Images
Figure 5 shows the running times of Boykov’s algorithm on the original graph and the Slim Graph and the running time of simplifying the graph. In the running time on the original graph we included the time creating the graph and computing the maximum flow. We excluded the time computing the capacities and histograms. The running time on the Slim Graph further includes the time for building the Slim Graph. The experiments on the small scale images show that using Slim Graphs never affects the running time negatively and can significantly speedup the segmentation process. As mentioned earlier, most simple edges exists between pixel-nodes and terminal-nodes. Since the weight of these edges is defined by the unary term, we performed a second experiment on small scale images, comparing the effectiveness of the Slim Graph under weak vs. strong unary terms. Therefore we computed the maximum flow for one image, three different trimaps (lasso, rectangle and user strokes) and varied the parameter γ from 0 (strong unary term) to 100 (weak unary term). Figure 6 shows the running times of our experiments. It turns out, that the speedup using Slim Graphs is highest using strong unary terms and trimaps separating fore- and background with small errors. Regardless, even with weak unary terms and poor initializations (e.g. rectangular-trimap) the proposed algorithm using the proposed Slim Graph performed faster. 5.2
Experiments on Large Scale Images
To evaluate the speed up of the proposed method we used large scale images with up to 26 MP and down sampled these images to several image-sizes. As shown in Figure 7, solving the maximum flow problem on our Slim Graphs significantly speeds up the algorithm. This speed up is achieved by a large decrease
230
B. Scheuermann and B. Rosenhahn
(a)
(b)
Fig. 8. Resource-limited systems: (a): Running time in seconds over 7 different sized images. The average speedup using Slim Graphs was approximately 36%. Running BK-algorithm without using Slim Graphs was not possible on images bigger than 1.6 MP. The running time of the proposed method includes the graph simplification. (b): Example segmentation using Apple’s iPhone. Left image shows an image with user scribbles and the right image the final segmentation.
of variables/nodes due to many simple edges. As already shown by Delong and Boykov [8] the problem of BK-algorithm is that it becomes inefficient and unusable, if the graph does not fit into the physical memory. Due to this limitation the algorithm is greatly extended by the proposed method. 5.3
Experiments on Resource-Limited Systems
We also compared the running time of BK-algorithm on Apple’s iPhone 4 with 512MB Ram. Therefore we used 7 different sized images from 0.15 MP up to 2.52 MP. The average speedup of using Slim Graphs was approximately 32%. The limitations of the physical memory prohibited a comparison of larger images. The results of the experiments are shown in Figure 8a. Using Slim Graphs we are able to segment images with up to 2.5 MP in 6 seconds on an iPhone 4, while using the original Graph we can only compute segmentations for images up to 1.6 MP in 8 seconds. The biggest speedup of approximately 45% where reached on the image with 1.61MP, because the number of unlabeled nodes could be reduced from 1.61million to 446951.
6
Conclusion
An efficient method for graph simplification of maximum flow/ minimum cut algorithms is presented. It constructs a Slim Graph by merging nodes that are connected by simple edges. A proof that the maximum flow of the much smaller graph remains identical is given. Hence it can be applied to any maximum flow algorithm. We demonstrated that the speedup is between 14 and 70 percent on small scale problems compared to the BK-algorithm. On high quality images with
SlimCuts: GraphCuts for High Resolution Images Using Graph Reduction
231
up to 26 MP, the proposed method was up to 877 times faster. It was shown that the proposed method required much less memory allowing segmentation of images of reasonable sizes even on mobile devices. A further reduction of computation time can be achieved by using parallel hardware architecture. In addition the visualization of our Slim Graph can be utilized to guide the user during the segmentation process resulting in less user interaction.
References 1. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI) 26(9), 1124–1137 (2004) 2. Blake, A., Rother, C., Brown, M., Perez, P., Torr, P.: Interactive image segmentation using an adaptive GMMRF model. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 428–441. Springer, Heidelberg (2004) 3. Boykov, Y., Kolmogorov, V.: Computing geodesics and minimal surfaces via graph cuts. In: Nineth IEEE Int. Conf. on Computer Vision (ICCV), pp. 26–33 (2003) 4. Lempitsky, V., Boykov, Y.: Global optimization for shape fitting. In: Comp. Vision and Pattern Recognition (CVPR), pp. 1–8 (2007) 5. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI) 23(11), 1222–1239 (2002) 6. Kohli, P., Torr, P.H.S.: Efficiently solving dynamic markov random fields using graph cuts. In: Tenth IEEE Int. Conf. on Computer Vision (ICCV), vol. 2, pp. 922–929 (2005) 7. Rother, C., Kolmogorov, V., Blake, A.: Grabcut: Interactive foreground extraction using iterated graph cuts. SIGGRAPH 23(3), 309–314 (2004) 8. Delong, A., Boykov, Y.: A scalable graph-cut algorithm for n-d grids. In: Comp. Vision and Pattern Recognition, CVPR (2008) 9. Boykov, Y., Jolly, M.P.: Interactive graph cuts for optimal boundary & region segmentation of objects in nd images. In: Eighth IEEE Int. Conf. on Computer Vision (ICCV), vol. 1, pp. 105–112 (2001) 10. Strandmark, P., Kahl, F.: Parallel and distributed graph cuts by dual decomposition. In: Comp. Vision and Pattern Recognition, CVPR (2010) 11. Vineet, V., Narayanan, P.: Cuda cuts: Fast graph cuts on the gpu. In: Comp. Vision and Pattern Recognition Workshops, CVPRW (2008) 12. Bhusnurmath, A., Taylor, C.J.: Graph cuts via l1 norm minimization. IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI) 30(10), 1866–1871 (2008) 13. Klodt, M., Schoenemann, T., Kolev, K., Schikora, M., Cremers, D.: An experimental comparison of discrete and continuous shape optimization methods. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 332–345. Springer, Heidelberg (2008) 14. Puzicha, J., Buhmann, J.M.: Multiscale annealing for grouping and unsupervised texture segmentation. In: Comp. Vision and Image Understanding, vol. 76(3), pp. 213–230 (1999) 15. Lombaert, H., Sun, Y., Grady, L., Xu, C.: A multilevel banded graph cuts method for fast image segmentation. In: Tenth IEEE Int. Conf. on Computer Vision (ICCV), vol. 1, pp. 259–265 (2005)
232
B. Scheuermann and B. Rosenhahn
16. Sinop, A.K., Grady, L.: Accurate banded graph cut segmentation of thin structures using laplacian pyramids. In: Larsen, R., Nielsen, M., Sporring, J. (eds.) MICCAI 2006. LNCS, vol. 4191, pp. 896–903. Springer, Heidelberg (2006) 17. Kohli, P., Lempitsky, V., Rother, C.: Uncertainty driven multi-scale optimization. In: Goesele, M., Roth, S., Kuijper, A., Schiele, B., Schindler, K. (eds.) Pattern Recognition. LNCS, vol. 6376, pp. 242–251. Springer, Heidelberg (2010) 18. Karger, D.R., Stein, C.: A new approach to the minimum cut problem. Journal of the ACM (JACM) 43(4), 601–640 (1996) ˜ 2 ) time. In: 19. Bencz´ ur, A.A., Karger, D.R.: Approximating s-t minimum cuts in O(n Proceedings of the Twenty-eighth Annual ACM Symposium on Theory of Computing. STOC 1996, pp. 47–55 (1996) 20. Spielman, D.A., Teng, S.H.: Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In: Proceedings of the Thirty-sixth Annual ACM Symposium on Theory of Computing STOC 2004, pp. 81–90. ACM, New York (2004) 21. Chekuri, C.S., Goldberg, A.V., Karger, D.R., Levine, M.S., Stein, C.: Experimental study of minimum cut algorithms. In: Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms SODA 1997, pp. 324–333. SIAM, Philadelphia (1997) 22. Padberg, M., Rinaldi, G.: An efficient algorithm for the minimum capacity cut problem. Mathematical Programming 47(1), 19–36 (1990) 23. Hogstedt, K., Kimelman, D., Rajan, V.T., Roth, T., Wegman, M.: Graph cutting algorithms for distributed applications partitioning. ACM SIGMETRICS Performance Evaluation Review 28(4), 27–29 (2001) 24. Ford, L., Fulkerson, D.: Maximum flow through a network. Canadian J. of Math. 8, 299–404 (1956)
Discrete Optimization of the Multiphase Piecewise Constant Mumford-Shah Functional Noha El-Zehiry and Leo Grady Siemens Corporate Research Image Analytics and Informatics Princeton, NJ
Abstract. The Mumford-Shah model has been one of the most powerful models in image segmentation and denoising. The optimization of the multiphase Mumford-Shah energy functional has been performed using level sets methods that optimize the Mumford-Shah energy by evolving the level sets via the gradient descent. These methods are very slow and prone to getting stuck in local optima due to the use of the gradient descent. After the reformulation of the bimodal Mumford-Shah functional on a graph, several groups investigated the hierarchical extension of the graph representation to multi class. These approaches, though more effective than level sets, provide approximate solutions and can diverge away from the optimal solution. In this paper, we present a discrete optimization for the multiphase Mumford Shah functional that directly minimizes the multiphase functional without recursive bisection on the labels. Our approach handles the nonsubmodularity of the multiphase energy function and provide a global optimum if prior information is provided.
1
Introduction
The formulation of the image segmentation problem as an energy minimization problem has been one of the most powerful and commonly used techniques in the past couple of decades. For example, the piecewise constant Mumford-Shah segmentation aims at minimizing the integral of the intensity deviations around the mean of each class. Minimizing such integral equation can be done in two ways [1] : First, to find the associated partial differential equation and use numerical methods to solve the PDE, in which case, discretization of the solution domain is used to optimize the continuous problem. An alternative method is to find the corresponding discrete problem and employ combinatorial optimization tools to solve it. The former method has been extensively used in image segmentation. However, it suffers from several drawbacks such as: 1) it requires tuning for many parameters. When different parameters are used to discretize the solution domain, the results may vastly change. 2) Local optimization tools such as the gradient descent method are used for the optimization of the continuous problem, but these methods get easily stuck in local optima and jeopardize the quality of the output. Y. Boykov et al. (Eds.): EMMCVPR 2011, LNCS 6819, pp. 233–246, 2011. c Springer-Verlag Berlin Heidelberg 2011
234
N. El-Zehiry and L. Grady
Recent studies such as [2,3,4] are in favor of the second method. They unify two of the most important frameworks in image segmentation; graph cuts and deformable models. The ultimate goal of the new hybrid framework is to provide curve evolution models that can benefit from the arsenal of variational formulations of deformable models and meanwhile extend the robustness of these algorithms by employing graph cuts optimization with the associated dynamic labeling that tend to capture global optima in a relatively very short time. Inspired by [2,3], in this paper, we would like to extend their work by investigating the combinatorial optimization of the multiphase Mumford Shah energy functional and its application to image segmentation problem and brain MRI tissue classification. We will present a discrete formulation of the multiphase Mumford-Shah segmentation functional and provide a graph construction that represents the discrete energy function. Then, we will discuss the nonsubmodularity of the resulting function and employ Quadratic Pseudo Boolean Optimization (QBPO) [5] [6] and Quadratic Pseudo Boolean Optimization with Probing (QPBOP)[7] to optimize the function. 1.1
Previous Work
The optimization of the multiphase Mumford-Shah model have been previously investigated by several research groups, either in a continuous framework [8,9,10] or discrete framework [11,12,13]. Most of the continuous approaches are slow and do not provide an optimal solution. Even the most recent recursive approach in [9] that used an efficient global optimization for the 2-phase Mumford-Shah [14] does not guarantee the global optimality of the multiphase function due to the recursive solution. Despite the recent advancement of combinatorial optimization schemes that yield global optima and exhibit low computational complexity, several problem formulations are still suffering from local solutions associated with gradient descent solutions or recursive bisection schemes. In addition to local optimality, the implementation of such schemes often comes at a very high computational cost that limits the applicability of such algorithms. Recent examples of such models include, but not limited to, [9,15,16]. This striking fact urged us to investigate the fast and global optimization of the piecewise constant Mumford-Shah multiphase energy function that can be achieved by optimizing the corresponding discrete functional on a graph using QPBO/QPBOP. On the other hand, the extension of the discrete optimization of the 2-phase Mumford-Shah model to multiphase is not straightforward. This is due to the nonsubmodularity of the resulting discrete energy function. This problem has been investigated in [11,12]. The previous contributions presented a graph cut optimization of the multiphase Mumford-Shah functional through a sequence of binary segmentation steps. These recursive approaches are approximate and, in certain scenarios, they may produce solutions that are very far from the optimal solution [17]. Unlike Bae et al. [12] and El-Zehiry et al. [11], in this paper, we directly minimize the multiphase energy function itself.
Discrete Optimization of the Multiphase Piecewise Constant
235
Recently, Delong and Boykov [18] presented an approach for multi-region image segmentation. They build a graph of n layers to segment n classes and assign inter-layer links which weights depend on the geometric interactions among the distinct classes. The authors dissected the geometric interactions into 3 main interactions: containment, attraction which are submodular and exclusion which is supermodular. However, the combinations of the previous interactions sometimes yield to nonsubmodularity in which case, the authors either truncate the nonsubmodular terms or provide approximate solutions. Similar to [18], we construct a multi-layer graph, however, we use only log2 n layers to segment n classes and our construction is independent of the geometric interaction among the classes. Our experimental results will exhibit an example for a nonsubmodular geometric interaction that could not be globally optimized using [18] but has been successfully segmented using our construction (See Figure 4 first image). Binary encoding of multi-label problems has been discussed in [19,13]. In [19], two or more Boolean variables are used to to encode the states of a single multilabel variable. For example, to encode a problem of a four class labeling, they used the battleship transformation [20] that utilizes three Boolean variables. The resulting binary interaction of the battleship transformation is submodular. In our paper, we will use only two variables instead of three to encode a four class labeling problem. The price of using less variables is that the resulting function will be nonsubmodular. While our work here also discusses the problem of multiphase labeling, unlike [18] and [19], we present a reformulation of the multiphase curve evolution problem that is commonly used in different imaging applications. Our work will improve the speed and quality of any application that uses the multiphase VeseChan curve evolution framework. Bae and Tai [13] used fewer Boolean variables than [19]. However, they proposed a constraint on the relationship between the mean intensity values of the distinct classes to guarantee the submodularity of the multiphase functional (which restricts the input image significantly). If this constraint is not satisfied, they handle the optimization by multi-step case analysis of the constructed graph. However, in our approach, we do not restrict the input domain but instead we deal with the nonsubmodularity of the multiphase function using QPBO and QPBOP [21]. Moreover, although Bae and Tai did not discuss the 3-phase segmentation in their work, we will handle this scenario in our paper. Until recent, graph cuts used to provide complete global solutions only if the function to be optimized is submodular. These recent significant contributions [18,13] towards the solution of the multiphase image segmentation problem aim at avoiding nonsubmodularity, sometimes at very high cost. However, we believe that with the recent advancement in combinatorial optimization and the introduction of QPBO and QPBOP, researchers are urged to overlook the submodularity constraint and start dealing with nonsubmodular functions and use these powerful tools for the optimization. Although QBPO and QBPOP are
236
N. El-Zehiry and L. Grady
not guaranteed, theoretically, to provide complete solution, we found them very efficient in practice as we will discuss in the results section. The rest of the paper is organized as follows: Section 2 will review the VeseChan level set formulation for the Multiphase Mumford- Shah functional [8], It will present the discrete formulation of the energy function and provide the graph construction and the optimization details of the discrete energy function. Section 3 will introduce a sample of our segmentation results to synthetic and real images, it will also exhibit the MRI tissue classification as a potential application of the proposed approach. Finally, Section 4 will conclude the paper and present insights for future improvements.
2
Methods
We start this section by reviewing the Vese-Chan level set formulation for the multiphase image segmentation problem and then we proceed to the discrete formulation and optimization. 2.1
Review of Level Set Formulation of the Multiphase Mumford Shah model
This section will review the multiphase image segmentation approach presented in [8]. Vese and Chan proved, using the four color theorem, that four classes are sufficient to provide labeling for any number of classes. Therefore, we will only focus on four and three class segmentation. Vese and Chan have presented a multiphase image segmentation approach using level sets in [8]. Their multiphase image segmentation approach extends their model in [22] by using several level set function instead of one. One level set function can only separate two classes, hence n1 = log2 n is the number of level sets necessary to segment n distinct classes. Figures 1 illustrates how two level set functions separate among four classes. Two level set functions φ1 and φ2 span the input domain Ω such that Ω = 4 ωi where: ω1 = {p | φ1 (p) > 0 and φ2 (p) > 0}, i=1 ω2 = {p | φ1 (p) > 0 and φ2 (p) < 0}, ω3 = {p | φ1 (p) < 0 and φ2 (p) > 0}, ω4 = {p | φ1 (p) < 0 and φ2 (p) < 0}. In [8], the authors perform the segmentation task by minimizing the intensity deviation around the class means and minimizing the length of every level set as a regularization. The mean intensities inside each of the previous regions are represented by the vector C = [c11 c10 c01 c00 ]. Hence the segmentation can be obtained by minimizing the energy functional in [8] represented by the following formulation;
Discrete Optimization of the Multiphase Piecewise Constant
237
F (C, φ1 , φ2 ) =
Ω
+ Ω +
Ω
(u − c11 )2 H(φ1 )H(φ2 )dxdy (u − c01 )2 (1 − H(φ1 ))H(φ2 )dxdy (u − c10 )2 H(φ1 )(1 − H(φ2 ))dxdy
(1)
(u − c00 )2 H(1 − φ1 )(1 − H(φ2 ))dxdy |∇H(φ1 )|dxdy + ν2 |∇H(φ2 )|dxdy. + ν1
+
Ω
Ω
Ω
ν1 and ν2 are weighting factors that control the length contribution in every level set and H(φ) is the Heaviside step function of the level set φ.
Fig. 1. Two evolving contours represented by two level set functions φ1 = 0 and φ2 = 0. The two level set functions subdivide the domain into 4 regions;{φ1 > 0, φ2 > 0}, {φ1 > 0, φ2 < 0}, {φ1 < 0, φ2 > 0}and{φ1 < 0, φ2 < 0}. Courtesy of Vese and Chan. [8]
2.2
Discrete Formulation
To provide a discrete formulation for the multiphase segmentation energy in (1), each pixel p ∈ Ω will be associated with a vector of binary variables Xp = [xp yp ]. The vector components are defined as follows: 1, φ1 (p) > 0; (2) xp = 0, otherwise. 1, φ2 (p) > 0; (3) yp = 0, otherwise. Hence, Xp ∈ {[1 1], [0 1], [1 0], [0 0]} and the labeling for the different classes ω1 , ω2 , ω3 and ω4 are 11, 01, 10 and 00, respectively. The discrete formulation of the segmentation energy is given by the discrete function FD expressed as: FD = FData + ν1 FL1 + ν2 FL2 .
(4)
Generally, the data and its fidelity can vastly vary depending on the problem of interest. The same formulation can be used to solve other problems, far beyond image segmentation, such as data clustering.
238
N. El-Zehiry and L. Grady
(a)G1
(b) G2
(c) G3
(d) G
Fig. 2. Graph construction. (a) G1 , the graph that represents FL1 , (b) G2 , the graph that represents FL2 , (c) G3 , the graph that represents FData and (d) The final graph that represents FD .
For simplicity, we will use the piecewise constant Mumford-Shah data fidelity with a target application of multiphase 2D image segmentation, in which case, FData , FL1 and FL2 are defined as follows: FData = (u(p) − c11 )2 xp yp + (u(p) − c01 )2 (1 − xp )yp p∈Ω
+
p∈Ω
(u(p) − c10 ) xp (1 − yp ) + 2
p∈Ω
FL1 =
(u(p) − c00 )2 (1 − xp )(1 − yp ), (5)
p∈Ω
|xp − xq |wpq ,
(6)
|yp − yq |wpq ,
(7)
p∈Ω q∈N (p)
FL2 =
p∈Ω q∈N (p)
Discrete Optimization of the Multiphase Piecewise Constant
239
where p = (x, y) represent an image pixel and the c00 , c01 , c10 and c11 are the mean intensity values of the classes ω1 , ω2 , ω3 and ω4 , respectively. The pixel neighborhood N (p) belongs to the set of eight points neighbors of p (in horizontal, vertical and diagonal directions) i.e. q ∈ {(i, j + 1), (i, j − 1), (i − 1, j), (i + 1, j), (i + 1, j − 1), (i + 1, j + 1), (i − 1, j − 1), (i − 1, j + 1)} and wpq is the weight of the edge vp vq that represents a discrete representation of the boundary length, details can be found in [2,23]. The mean intensities are expressed as follows: p∈Ω u(p) xp yp c11 = (8) p∈Ω xp yp p∈Ω u(p) (1 − xp )yp c01 = (9) p∈Ω (1 − xp )yp p∈Ω u(p) xp (1 − yp ) c10 = (10) p∈Ω xp (1 − yp ) p∈Ω u(p) (1 − xp )(1 − yp ) c00 = (11) p∈Ω (1 − xp )(1 − yp ) To optimize the previous function, every binary variable will be associated with a vertex in the graph. The vertices will be arranged in two layers, one layer with the vertices corresponding to xp and another for the vertices representing yp as shown in Figure 2. The intra-layer links will be determined based on the FL1 and FL2 and the inter-layer links will be determined according to FData (more details in the next section). Several studies such as [11] and [12] have investigated the graph cut optimization of the multiphase Vese-Chan model. However, the discrete model in (5) violates the submodularity constraint in [24] which made the graph representation of the energy function very challenging. In the previous contributions, the authors presented a hierarchical approach to solve the multiphase segmentation problem by applying sequence of binary segmentation steps. With the advances in the discrete optimization field and the contributions in [6] and [7], the multiphase segmentation problem can be solved in a non sequential manner that reduces the time complexity of the segmentation and improve the robustness of the model. The next subsection will present the discrete optimization of the proposed energy function using Quadratic Pseudo Boolean Optimization. 2.3
Discrete Optimization
Kolmogorov and Rother [6] have reviewed the minimization of nonsubmodular functions using Quadratic Pseudo Boolean Optimization, an approach that was first introduced in the optimization literature by Hammer et al. in [5]. Similar to graph cuts methods, QPBO works by reducing the problem to the computation of a min S-T cut but with two fundamental difference. First; the constructed graph contains double the number of vertices, as for every variable x two vertices
240
N. El-Zehiry and L. Grady
are added to the graph providing the vertex set V = {vx , vx |x ∈ X} and, in general, QPBO provides partial labeling for the vertices such that ⎧ vx ∈ S, vx ∈ T , ⎨ 0, vx ∈ T , vx ∈ S, x = 1, (12) ⎩ ∅, otherwise. where ∅ means that QPBO fails to label the given vertex. The utility of QPBO is determined by the number of vertices that the approach fails to label. Experimentally, it has been verified [21] that if the number of nonsubmodular terms in the energy function is large, the output of QPBO contains many unlabeled vertices (Rother et al. [21] reported that up to 99.9% of variables could be undetermined). A promising approach to resolve this problem is the extended roof duality presented in [25]. Rother et al. [21] also reviewed the extended roof duality approach, first introduced by Boros et al. [25], and presented a more efficient implementation than Boros’s. Their work extends QPBO by a probing operation that aims at calculating the global minimum for the vertices that have not been assigned a label by QPBO. The approach is referred to as QPBOP (Quadratic Pseudo Boolean Optimization with Probing). It has been reported by Rother et al. that applying QPBOP can reduce the number of unlabeled pixels from 99.9% to 0%. Graph construction for the energy function in (4) We will construct three graphs G1 , G2 and G3 for FL1 , FL2 and FData , respectively. Then, we will use the additivity theorem in [26] to provide the graph G that represent the discrete energy FD . Step1: Construct a graph G1 = (V1 , E1 ) with |V 1| = N such that every pixel p has a corresponding vertex vp and add the weights as follows: ∀(p ∈ Ω ∧ q ∈ N (p)) and satisfying |xp − xq | > 0 add an edge wvp vq (detailed description for the weight values is found in [2] and [23]) Step 2: Similarly, construct a graph G2 = (V2 , E2 ) with |V2 | = N such that every pixel p has a corresponding vertex zp and add the weights as follows: ∀(p ∈ Ω ∧ q ∈ N ) satisfying |yp − yq | > 0 add an edge with weight wzp zq . Step 3: Construct a 4-partite graph G3 = (V1 , V2 , S, T , E3 ) with an edge evp zp with weight wvp zp = (u(p) − c11 )2 − (u(p) − c01 )2 − (u(p) − c10 )2 + (u(p) − c00 )2 . Add an edge Svp with weight wSvp = (u(p) − c11 )2 − (u(p) − c10 )2 and an edge zpT with weight wzp T =(u(p) − c01 )2 − (u(p) − c11 )2 Graphs G1 , G2 and G3 are depicted in Figure 2 (a), (b) and (c), receptively. The additivity theorem in [26], the graph G = {V, E} is the graph in which V = V1 ∪V2 ∪{S, T } and E = {evp zp | wvp zp |G = wvp zp |G1 +wvp zp |G2 +wvp zp |G3 }. Figure 2 (d) illustrates the construction of graph G from G1 , G2 and G3 . After constructing the graph as discussed, we apply QPBO to obtain the labeling. Experimentally, QPBO gives complete labeling in most of the cases. If QPBO does not provide complete labeling, we apply QPBOP, which found an
Discrete Optimization of the Multiphase Piecewise Constant
241
3 class input image
xp
yp
Final segmentation result
4 class input image
xp
yp
Final segmentation result
Fig. 3. Three and four class segmentation via binary classifications on a two layer graph. The first column contains the input images. The second column shows the labeling of the vertices in layer 1 of the graph. The third column shows the labeling of the vertices in layer 2 of the graph. The last column shows the final segmentation results by combining the labels from the two layers.
optimal solution in all of the images that we have segmented. For example, for the sample of the results in the this paper, QPBO provided complete labeling for 11 out of 12 images and for the one image that QPBO gave partial labeling, the probing succeeds to label all the vertices that QPBO missed. For segmenting three classes, we simply assign the same data term for two classes and different data terms for the remaining two. Figure 3 illustrates the segmentation of a three class image and a four class image. The figure depicts the binary labeling in each layer of the graph and the final segmentation obtained by combining the binary labeling of the two layers. 2.4
Algorithm
Having introduced the mathematical formulation of our proposed solution, the implementation details are summarized as follows: The function will be minimized on a graph G = {V, E}. The vertex set V = {v1 , v2 , ..., vN , z1 , z2 , ..., zN } ∪ {S, T } ; where N in the number of pixels in the input data set. i.e. each pixel p in the input data set has a two corresponding vertices vp in the first layer of the graph and zp in the second layer and S and T are auxiliary vertices representing the source and target of the graph, respectively. The edge set E is subdivided into intra-layer edges that correspond to the regularization and inter-layer links that represent the data fidelity. Pseudocode of the algorithm is given in Algorithm 1.
242
N. El-Zehiry and L. Grady
QPBO for the multiphase image classification Input: A set of points p ∈ Ω with associated feature u(p) Output: Labeling l(p) ∀p ∈ Ω and {l ∈ l1 , l2 , l3 , l4 } begin 1. INITIALIZE the variables xp and yp ∀p ∈ Ω while F i+1 − F i > do Calculate c11 , c01 , c10 and c11 using (8) to (11) 2. ADDING data fidelity edges • Add an edge evp zp with weight wvp ,zp = (u(p) − c11 )2 - (u(p) − c01 )2 (u(p) − c10 )2 + (u(p) − c00 )2 • Add an edge eSvp with weight wSvp = (u(p) − c11 )2 − (u(p) − c10 )2 . • Add an edge ezp T with weight wzp T =(u(p) − c01 )2 − (u(p) − c11 )2 3. ADDING boundary regularization ∀p, q ∈ Ω if xp = xq then Add an edge evp vq with weight wvp vq end if yp = yq then Add an edge ezp zq with weight wzp zq end 4. OPTIMIZE: Apply QPBO/QPBOP [21] to find the min cut C of the graph G. FIND LABELS QPBO/QPBOP provides the labels for xp and yp . The binary combinations of xp and xq gives the labels l1 , l2 , l3 and l4 . end end
Algorithm 1. Algorithm for the discrete optimization for the multiphase MumfordShah model.
3
Experimental Results
Figure 4 shows a sample of our segmentation results for input images with three and four classes. The first column shows the input images, the second and third columns display the final segmentation results and the piecewise constant approximation, respectively. The first image in Figure 4 illustrate an example that results in frustrated cycles in the construction presented in [18] (the sky class contains the moon and the trees. the moon class excludes the trees class) but in our construction (given c00 , c01 , c10 , c11 ), QBPO provided a complete labeling which guarantees and optimal solution. Tissue classification in brain magnetic resonance images is a potential application that can highly benefit from this approach. The classification algorithm should subdivide the image into four classes; background, gray matter, white matter and cerebrospinal fluid. We have applied our optimization of the multiphase Mumford-Shah to the BrainWeb simulated data set [27]. Figure 5 depicts a sample of our results for MRI tissue classification. The first row shows three
Discrete Optimization of the Multiphase Piecewise Constant Input images
Segmentation Results
243
Piecewise Constant Approximation
Fig. 4. First column: The input images. Second column: segmentation results. third column: The piecewise constant approximation. The first two images contain only three classes and the last two images contain four classes.
T1-weighted MRI slices in axial, coronal and sagittal views, respectively. The second row displays the corresponding segmentation. The third row shows three slices in the different views for T2-weighted MRI and their segmentation results are depicted in the fourth row. To validate our approach, we run our segmentation algorithm on 100 images out of the simulated Brainweb data set. We repeated the experiments 8 times with different regularization strength as we changes ν1 and ν2 from 0.001 to 10000. In the 800 trials, we obtained complete labeling which verify the efficiency of QPBOP, even when a high regularization parameter was used.
244
N. El-Zehiry and L. Grady
Fig. 5. Sample of the segmentation results of T1 and T2 weighted MRI brain slices in coronal, sagittal and axial views.
Discrete Optimization of the Multiphase Piecewise Constant
4
245
Conclusion
Vese and Chan [8] presented a level set numerical implementation for the multiphase Mumford-Shah model and promoted its applicability to image segmentation. Later research papers investigated more efficient implementations for the model. However, these studies either provide a continuous optimization which is less robust and slower than discrete optimization or they apply recursive bisection that uses the bi-model Mumford Shah as a core for the segmentation algorithm which affect the robustness of the segmentation whether the optimization is continuous or discrete. This paper provided a discrete optimization for the multiphase Mumford Shah functional that does not require recursive bisection and provides a globally optimal solution in practice. Most of the previous work avoids nonsubmodularity and exerts their efforts toward formulating the multiphase segmentation problem in a submodular domain. But, our construction deals with the nonsubmodularity using QPBO/QPBOP which are very efficient tools that, in practice, provided a globally optimal solution over all 800 segmentation trials.
References 1. Bruckstein, A.M., Netravali, A.N., Richardson: Epi-convergence of discrete elastica. Applicable Analysis 79, 137–171 (1997) 2. El-Zehiry, N.Y., Sahoo, P., Xu, S., Elmaghraby, A.: Graph cut optimization for the Mumford Shah model. In: Proccedings of VIIP (August 2007) 3. Grady, L., Alvino, C.: The piecewise smooth Mumford-Shah functional on an arbitrary graph. IEEE Trans. on Image Proc. 18, 2547–2561 (2009) 4. Darbon, J., Sigelle, M.: A Fast and Exact Algorithm for Total Variation Minimization. In: Marques, J.S., P´erez de la Blanca, N., Pina, P. (eds.) IbPRIA 2005. LNCS, vol. 3522, pp. 351–359. Springer, Heidelberg (2005) 5. Hammer, P.L., Hansen, P., Simeone, B.: Roof duality, complementation and persistency in quadratic 01 optimization. Mathematical Programming 28, 121–155 (1984) 6. Kolmogorov, V., Rother, C.: Minimizing nonsubmodular functions with graph cutsa review. IEEE Trans. Pattern Anal. Mach. Intell. 29, 1274–1279 (2007) 7. Rother, C., Kolmogorov, V., Lempitsky, V., Szummer, M.: Optimizing binary MRFs via extended roof duality. In: Computer Vision and Pattern Recognition - CVPR, vol. 5302, pp. 248–261 (2007) 8. Vese, L.A., Chan, T.F.: A multiphase level set framework for image segmentation using the mumford and shah model. International Journal of Computer Vision 50, 271–293 (2002) 9. Ni, K., Hong, B.-W., Soatto, S., Chan, T.: Unsupervised multiphase segmentation: A recursive approach. Computer Vision and Image Understanding 113, 502–510 (2009) 10. Jeon, M., Alexander, M., Pedrycz, W., Pizzi, N.: Unsupervised hierarchical image segmentation with level set and additive operator splitting. Pattern Recognition Letters 26 11. El-Zehiry, N.Y., Elmaghraby, A.: Brain MRI tissue classification using graph cut optimization of the mumford-shah functional. In: IVCNZ, New Zealand, pp. 321– 326 (2007)
246
N. El-Zehiry and L. Grady
12. Bae, E., Tai, X.-C.: Graph cut optimization for the piecewise constant level set method applied to multiphase image segmentation. In: Tai, X.-C., Mørken, K., Lysaker, M., Lie, K.-A. (eds.) SSVM 2009. LNCS, vol. 5567, pp. 1–13. Springer, Heidelberg (2009) 13. Bae, E., Tai, X.-C.: Efficient Global Minimization for the Multiphase Chan-Vese Model of Image Segmentation. In: Cremers, D., Boykov, Y., Blake, A., Schmidt, F.R. (eds.) EMMCVPR 2009. LNCS, vol. 5681, pp. 28–41. Springer, Heidelberg (2009) 14. Bresson, X., Esedognlu, S., Vandergheynst, P., Thiran, J.P., Osher, S.: Fast global minimization of the active contour/snake model. Journal of Math. Imag. and Vis. 2, 151–167 (2007) 15. Badshah, N., Chen, K.: On two multigrid algorithms for modeling variational multiphase image segmentation. Trans. Img. Proc. 18, 1097–1106 (2009) 16. Vazquez-Reina, A., Miller, E., Pfister, H.: Multiphase geometric couplings for the segmentation of neural processes. In: CVPR, pp. 2020–2027 (2009) 17. Simon, H., Teng, S.-H.: How good is recursive bisection? SIAM Journal on Scientific Computing 18, 1436–1445 (2001) 18. Delong, A., Boykov, Y.: Global optimal segmentation of multi-region objects. In: ICCV, vol. 1, pp. 26–33 (2009) 19. Ramalingam, S., Kohli, P., Alahari, K., Torr, P.H.S.: Exact inference in multi-label crfs with higher order cliques. In: CVPR (2008) 20. Ishikawa, H.: Exact optimization for Markov random fields with convex priors. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 1333–1336 (2003) 21. Rother, C., Kolmogorov, V., Lempitsky, V.S., Szummer, M.: Optimizing binary MRFs via extended roof duality. In: CVPR (2007) 22. Chan, T.F., Vese, L.A.: Active contours without edges. IEEE Transactions on Image Processing 10, 266–277 (2001) 23. Kolmogorov, V., Boykov, Y.: What metrics can be approximated by geo-cuts, or global optimization of length/area and flux. In: International Conference on Computer Vision, ICCV 2005, vol. 1, pp. 564–571 (2005) 24. Kolmogorov, V., Zabih, R.: What energy functions can be minimized via graph cuts? IEEE Trans. on Patt. Anal. and Mach. Int. 26, 147–159 (2004) 25. Boros, E., Hammer, P.L., Tavares, G.: Preprocessing of unconstrained quadratic binary optimization. Technical Report RRR 10-2006, RUTCOR (2006) 26. Kolmogorov, V.: Graph based algorithms for scene reconstruction from two or more views, PhD thesis, Cornell University (2003) 27. Collins, D.L., Zijdenbos, A.P., Kollokian, V., Sled, J.G., Kabani, N.J., Holmes, C.J., Evans, A.C.: Design and construction of a realistic digital brain phantom. IEEE Trans. Med. Imaging 17, 463–468 (1998)
Image Segmentation with a Shape Prior Based on Simplified Skeleton Boris Yangel and Dmitry Vetrov Lomonosov Moscow State University
Abstract. In the paper we propose a new deformable shape model that is based on simplified skeleton graph. Such shape model allows to account for different shape variations and to introduce global constraints like known orientation or scale of the object. We combine the model with low-level image segmentation techniques based on Markov random fields and derive an approximate algorithm for the minimization of the energy function by performing stochastic coordinate descent. Experiments on two different sets of images confirm that usage of proposed shape model as a prior leads to improved segmentation quality. Keywords: Image segmentation, shape prior, MRF, skeleton.
1
Introduction
Image segmentation is an important, but inherently ambiguous problem. Practical segmentation systems rely on some user input and provide a way to combine that input with the low-level cues, such as color distributions and contrast edges observed in the image. Sometimes high-level information about image like shape of segmented object is available. Intuitively it seems that the more prior information about segmented object is involved in the model, the better its final segmentation is. Bayesian framework provides an efficient way of combining both low- and high-level information within a unified framework. Low-level cues are usually taken into account by using well-examined Markov random fields (MRF) theory [1]. However the straightforward addition of high-level information to MRF framework makes it intractable hence some extensions of MRFs are needed. 1.1
Related Work
Prior work on image segmentation with shape models can be divided into several classes. Some methods use shape prior in a form of hard object mask, which is represented either via level-sets [2] or via distance function [3]. To cope with missing information about object location in image, these methods use some iterative location re-estimation scheme combined with repeated segmentation of the image. While approaches based on a hard object mask can be quite robust when object shape is similar to mask, they are not applicable for classes of objects with high shape variability. Y. Boykov et al. (Eds.): EMMCVPR 2011, LNCS 6819, pp. 247–260, 2011. c Springer-Verlag Berlin Heidelberg 2011
248
B. Yangel and D. Vetrov
Similar approaches to shape modeling are used in [4] and [5]. In [4] shape prior is represented by a set of hard object masks. Branch-and-bound is used to choose right prior from that set. Paper [5] proposes usage of probability mask as a shape prior; it combines variational segmentation with a probability mask prior with a continuous analogue of branch-and-bound technique for object location estimation. Another class includes approaches that limit the set of possible shapes of the object. Some examples include star-shape prior [6] and tightness prior [7]. These broad restrictions can be of great utility in particular situations, while completely useless in others. For example, mislabeled pixels can sometimes make object even more tight or star-shaped. One more class of techniques that are close to our work in spirit includes approaches that represent shape as a set of rigid parts that may have various positions w.r.t. to one another. In [8] shape model is represented by a layered pictorial structure. Monte Carlo EM algorithm is used to obtain image labeling. Sampling from the posterior distribution on the parameters of pictorial structure is performed via loopy belief propagation. Similar approach is presented in [9], where stick figure of a human is used as a prior. In [10] shape model based on skeleton is used for part-based object detection. 1.2
Contribution
In this paper we propose new shape prior that represents object shape via simplified skeleton graph. Edges of the graph correspond to meaningful parts of an object. Radii assigned to vertices of the graph specify width of object parts. Not only can proposed shape model describe object shape variation and global shape constraints such as known orientation or scale, but it can also account for non-uniform scale of object parts via soft constraints on vertex radii. We also propose a framework for combining MRF segmentation with a shape prior. Our framework leads to iterative segmentation process with two-step iterations. On the first step shape model is re-estimated via stochastic optimization. On the second step, MRF segmentation is combined with current shape model to produce new pixel labeling. We also show that proposed approach can be seen as a specific kind of EM algorithm.
2
An Iterative Approach to Segmentation with a Shape Prior
In this section we propose an iterative approach to MRF segmentation with a shape prior based on a generative model. Then we discuss its relation to EM algorithm. Our approach is based on a graphical model presented in figure 1. In this model random variable S represents shape model that produces pixel labels Li (one per pixel). Labels Li take values 0 (pixel belongs to background) and 1 (pixel belongs to object). Shape model does not produce pixel labels independently, but in a
Image Segmentation with a Shape Prior Based on Simplified Skeleton
249
S
Li
Li+1
Li+2
Li+3
Ii
Ii+1
Ii+2
Ii+3
Fig. 1. Generative model for the segmentation with a shape prior
way that neighboring pixels are likely to have the same label. This soft constraint is represented by edges between labels of the neighboring pixels. Finally, each image pixel Ii is independently generated by the label Li using some class-specific color model. 2.1
Iterative Segmentation as a Coordinate Descent
Joint probability in the discussed graphical model can be expressed as P (S, L, I) = P (I | L)P (L | S)P (S) = 1 = f (S) hi (Ii , Li ) φ˜ij (Li , Lj , S), Z i
(1)
(i,j)∈N
where N is a neighborhood model and φ˜ij are potentials corresponding to 3cliques of the model graph. We further assume that each of the potentials φ˜ij can be expressed as a product of the pairwise terms. Then we can finally rewrite joint probability as P (S, L, I) =
1 f (S) hi (Ii , Li ) φij (Li , Lj ) φi (Li , S). Z i i
(2)
(i,j)∈N
Let us now state the problem of image segmentation as the problem of finding S ∗ , L∗ = arg max P (S, L | I) = arg max P (S, L, I) = S,L S,L = arg min − log f (S) − log hi (Ii , Li ) + log φi (Li , S) − S,L
−
(i,j)∈N
i
log φij (Li , Lj ) .
(3)
250
B. Yangel and D. Vetrov
Minimization of this expression can be performed by coordinate descent w.r.t. two groups of variables: L and S. In this case update expressions for each group can be written as S new = arg min − log f (S) − log φi (Lold , S) , (4) i S
i
Lnew = arg min − L
log hi (Ii , Li ) + log φi (Li , S new ) −
i
−
log φij (Li , Lj ) .
(5)
(i,j)∈N
Note that the update step for L is in fact a regular binary image segmentation problem with unary potentials modified by the shape prior. Thus, it can be efficiently solved using graph cuts for submodular pairwise terms φij . On the other hand, update step for S is more challenging optimization problem. Optimization algorithm for it should be selected according to particular form of f (S). 2.2
Relation to EM Algorithm
Iterative segmentation approach discussed in section 2.1 can be seen as a particular form of expectation-maximization algorithm. If we consider S a group of latent variables and use EM approach to maximize posterior probability P (L | I), M-step takes the form Lnew = arg max ES∼P ∗ (S) log P (I, S | L) + log P (L) = L = arg max ES∼P ∗ (S) log P (I, L, S) = L (6) = arg max ES∼P ∗ (S) log φi (Li , S)+ L
+
i
log hi (Ii , Li ) +
i
log φij (Li , Lj ) .
(i,j)∈N
Distribution P ∗ (S) comes from E-step and has form P ∗ (S) = P (S | I, Lold ) =
P (S, I, Lold ) 1 = φi (Lold i , S)f (S). old P (I, L ) Z i
(7)
It can be easily shown that the algorithm presented in section 2.1 is equivalent to approximating P ∗ (S) with delta function centered at the distribution mode on the E-step. In this case E-step corresponds to an update of S, while M-step is equivalent to updating L by solving regular segmentation problem.
Image Segmentation with a Shape Prior Based on Simplified Skeleton
251
Interpretation of proposed segmentation algorithm as a particular case of EM can give rise to alternative ways of solving the problem. For example, instead of approximating P ∗ (S) with a delta function, one could use Monte Carlo EM to approximate the expectation itself.
3
Simplified Figure Skeleton as a Shape Prior
In this section we present a new shape prior that allows for controllable shape variation. We also discuss a way to build the prior into the segmentation approach presented in section 2. 3.1
Graph-Based Shape Model
In order to handle significant shape variation, we propose a graph-based shape model where each edge encodes some meaningful part of the object. Radii assigned to each vertex of the graph allow us to encode variable width of each object part. This representation can be seen as a simplified version of object skeleton. One example of such a representation is shown in figure 2.
Fig. 2. Giraffe image with graph-based shape model
252
B. Yangel and D. Vetrov
Soft constraints can be introduced into this shape model via MRF-like energy function Ui (ei ) + Bij (ei , ej ), (8) E(S) = − log f (S) = i
(i,j)∈NS
where ei denotes i-th edge of the shape graph and NS is a set of all pairs of neighboring edge indices. Unary terms Ui in this case represent global constraints on the edge itself, like its scale or location. Binary terms Bij can constrain relative sizes and angles between the connected object parts. This model can be easily made invariant to rigid transformations by removing all the global constraints and considering only relative ones. Parameters of the unary and binary terms can be learned from a set of labeled images using techniques like ML estimation. Particular forms of shape energy that we used in our experiments are covered in section 4. 3.2
Unary Potentials
In order to complete description of the proposed shape prior, we should specify potential functions φi (Li , S) from (3). It is natural to assume that pixels located near the edges of the shape graph will certainly belong to object, while pixels that are far from any edge will most likely belong to background. This observation yields the following expression for φi : φi (Li , S) = Li max W (ej , i) + (1 − Li )(1 − max W (ej , i)). j
Fig. 3. Edge width and distance for points A and B
j
(9)
Fig. 4. log φi (1, S) for the shape model of a giraffe
Image Segmentation with a Shape Prior Based on Simplified Skeleton
253
In this expression W (e, i) denotes a function that decreases from 1 to 0 as the distance from the edge e to the pixel i increases. In our experiments we have used function W of the form
p dist(e, i) − α width(e, i) W (e, i) = exp −w max 0, , (10) (1 − α) width(e, i) where dist(e, i) is a distance from edge e to pixel i and width(e, i) is edge width for that pixel (see figure 3). This function holds 1 while distance from the edge goes from 0 to α width(e, i), then it decreases in a way that for dist(e, i) = width(e, i) it has value exp(−w). An example of unary potentials calculated by the proposed function can be seen in figure 4. 3.3
Shape Fitting via Simulated Annealing
As it was said earlier, some segmentation methods use discrete [4] or continuous [5] versions of branch-and-bound method for shape fitting. We find out that local minima achieved with simulated annealing (SA) were good enough for the whole approach to work, so we decided to use it for S update step (4). On each SA iteration we slightly perturbed positions and radii of graph vertices to update current solution. Perturbation variance was proportional to temperature T = log1 k at iteration k. Optimization process usually converged in 2000-4000 iterations in our experiments. Annealing initialization is explained in section 4.2.
4
Experiments
Segmentation method presented in this paper was tested on two sets of images. One set was obtained by filtering giraffe photos used in [11], leaving only photos with giraffes in lateral view. Another set consisted of various images with capital “E” letter; it was built manually from various sources. Images from both sets lack reliable edge and color models for object and background, and therefore need some additional information like shape model to improve segmentation quality. Shape models for both giraffes and letters were set manually. Models are explained in more details in sections 4.3 and 4.4. Bounding box containing object of interest was specified for every image. This form of initialization is a more simple alternative to providing seeds for object and background. 4.1
Unary and Pairwise Terms
In our experiments color models for object and background were represented by mixtures of Gaussians. We used approach proposed in [7] to learn color models
254
B. Yangel and D. Vetrov
using a bounding box of the object. The number of components in the mixture was set to 3. For pairwise terms we used 4-connected neighborhood model N . Terms were calculated as
(Bi −Bj )2 φij (Li , Lj ) = exp −λI[Li = Lj ] e−c σ2 +d , (11) where Bk represents color intensity for pixel k. Constant c was set to 1.2, d was set to 0.1, λ was set to 10 and σ was set to average difference between the intensities of neighboring pixels. We used φi (Li , S) of form (9) with W (e, i) as in (10). Constants were set as w = ln 2, α = 0.7, p = 2. 4.2
Coordinate Descent
As it was mentioned in section 3.3, S update step (4) was performed by simulated annealing. On the first update of S we initialized SA solution by automatically fitting most probable shape (the one that minimizes E(S)) into provided bounding box. Shape found on previous iteration was used as SA initialization for all the following S update steps. Label update step (5) was performed via graph cuts. Function optimized during this step had form F (L, S) = − log hi (Ii , Li ) − ws log φi (Li , S)− i
−
i
(12)
log φij (Li , Lj ).
i,j∈N
We found that segmentation can be made more robust by smoothly increasing parameter ws from 0 to 1 during first several iterations of coordinate descent. It can be viewed as a local minima avoidance heuristic. We used labeling computed without shape prior (ws = 0) as initial value for L. Coordinate descent stopped when the rate of pixels whose labels have changed after L update step was less than 0.0002. Shape influence ws was linearly increased during first 10 iterations. Optimization process usually converged in 12 − 15 iterations in our experiments and took about 2-3 minutes on a modern computer for a 320 × 240 image. 4.3
Giraffe Segmentation
We applied our algorithm to a set of giraffe photos described above. The results of our algorithm were compared to segmentation without shape prior (initial labeling for our approach) and also with segmentation received by method from [7] that enforces tightness constraint on the segmented object.
Image Segmentation with a Shape Prior Based on Simplified Skeleton
255
P5 e56
P6
P1
e15
e12
P2
e13 P3
e24 P4
e37 P7
e48 P8
Fig. 5. Graph-based model of giraffe shape. Here Pi = (Pix , Piy , Pir ).
Graph-based model of giraffe shape is presented in figure 5. We used the following expression for shape energy: E(S) =
8
y y 2 r x x 2 Ri Pi , (P1 − P2 ) + (P1 − P2 ) +
i=1
(13)
+E13 (e13 , e12 ) + E24 (e24 , e12 ) + E37 (e37 , e13 )+ +E48 (e48 , e24 ) + E15 (e15 , e12 ) + E56 (e56 , e15 ). In this expression term Ri constrains radius of the i-th vertex according to the length of giraffe body: Ri (r, l; ρi , σir ) =
1 (r − ρi l)2 . σir
(14)
Here ρi specifies how radius of a particular vertex relates to the length of giraffe body and σir allows to control constraint softness. Pairwise terms Eij (e1 , e2 ) establish constraints on relative length and angles between neighboring edges: α l Eij (e1 , e2 ; αij , σij , ρij , σij )= 1 1 = α (∠(e1 , e2 ) − αij )2 + l (e1 − ρij e2 )2 . σij σij
(15)
Parameter αij specifies mean angle between edges, ρij relates length of one α l edge to the length of another, parameters σij and σij control softness of the corresponding constraints. Energy function includes only relative constraints, and thus it is invariant to rotation and uniform scale of the shape. Values for all the parameters used in Ri and Eij were selected manually.
256
B. Yangel and D. Vetrov
Fig. 6. Experimental results for giraffe images. Left: initial segmentation. Middle: segmentation with tightness prior. Right: segmentation with graph-based shape prior.
Image Segmentation with a Shape Prior Based on Simplified Skeleton
257
Fig. 7. Examples of bad segmentation with shape prior. Odd images: initial segmentation. Even images: segmentation with shape prior. e14
P1
P4
e12 P2
e25
P5
e23 P3
e36
P6
Fig. 8. Graph-based model of capital “E” letter
Segmentation obtained by the proposed method for several giraffe photos is shown in figure 6. As we can see, in most cases initial segmentation includes many pixels with wrong labels. Graph-based shape prior seems to improve segmentation quality significantly. Tightness prior, on the other hand, is almost useless for pictures of this kind. Many segmentation errors occur near the boundaries of the bounding box and, therefore, make object even more tight. Some typical situations when proposed method performs poorly are shown in figure 7. Left pair of images shows how bad initial segmentation can lead to coordinate descent solution that is far from desired optimum. Nevertheless, resulting segmentation is much closer to the ground truth than the initial one. Other images show situations when our hand-made shape model fails to handle all the shape variations, leading to a segmentation with some of the object pixels labeled as background. We think that more flexible shape model trained on labeled data can help to deal with such errors. 4.4
Letter Segmentation
We also tested our algorithm on a number of images containing capital “E” letter. Shape model we used is shown in figure 8. Similar to giraffe shape model, we used shape energy function of the form E(S) =
5
y y 2 r x x 2 Ri Pi , (P1 − P3 ) + (P1 − P3 ) +
i=1
+E12 (e12 , e25 ) + E23 (e23 , e25 ) + E14 (e14 , e12 ) + E36 (e36 , e23 ),
(16)
258
B. Yangel and D. Vetrov
Fig. 9. Algorithm results for letter images. Left: segmentation without shape prior. Middle: segmentation with shape prior. Right: shape skeleton obtained during segmentation.
Image Segmentation with a Shape Prior Based on Simplified Skeleton
259
with Ri defined in (14) and Eij defined in (15). Results of segmenting letter images with and without shape prior are shown in figure 9. As with giraffe photos, shape prior has improved segmentation quality significantly.
5
Conclusion
In this paper we present an iterative approach to image segmentation with a shape prior. Approach is based on a posterior probability maximization via coordinate descent and can be seen as a degenerate kind of EM algorithm. Each iteration of coordinate descent consists of two stages: shape fitting via simulated annealing and image segmentation with re-estimated unary terms. We also propose a shape prior that is applicable to objects with well-defined structure. Presented shape model is a graph with variable width specified for every edge. Such representation allows to control shape variation and to specify global constraints like known orientation or scale of an object. We show how one can build such a prior into proposed segmentation scheme. Experiments confirm that proposed shape prior can make segmentation less sensitive to the lack of reliable information about object edges and color. Finding more efficient shape matching technique for S update step that would replace simulated annealing can become one direction of future research. Possible options include DP-based approach similar to the one used in [12] and branchand-bound technique [4]. It is, however, questionable if branch-and-bound can help to improve processing speed significantly. This question requires further investigation. It would also be useful to exclude the manual stage of shape model creation. One can try to learn the structure of the shape graph together with the parameters controlling its flexibility from a set of manually segmented images.
References 1. Boykov, Y.Y., Jolly, M.-P.: Interactive graph cuts for optimal boundary & region segmentation of objects in ND images. In: 2001 IEEE 8th International Conference on Computer Vision, vol. 1, pp. 105–112. IEEE, Los Alamitos (2001) 2. Vu, N., Manjunath, B.S.: Shape prior segmentation of multiple objects with graph cuts. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008) 3. Freedman, D., Zhang, T.: Interactive Graph Cut Based Segmentation with Shape Priors. In: 2005 IEEE Conference on Computer Vision and Pattern Recognition, pp. 755–762 (2005) 4. Lempitsky, V., Blake, A., Rother, C.: Image segmentation by branch-and-mincut. In: Proceedings of the 10th European Conference on Computer Vision, pp. 15–29 (2008) 5. Cremers, D., Schmidt, F.R., Barthel, F.: Shape priors in variational image segmentation: Convexity, lipschitz continuity and globally optimal solutions. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition (2008)
260
B. Yangel and D. Vetrov
6. Veksler, O.: Star Shape Prior for Graph-Cut Image Segmentation. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part III. LNCS, vol. 5304, pp. 454–467. Springer, Heidelberg (2008) 7. Lempitsky, V., Kohli, P., Rother, C., Sharp, T.: Image segmentation with a bounding box prior. In: IEEE 12th International Conference on Computer Vision, pp. 277–284. IEEE, Los Alamitos (2009) 8. Kumar, M.P., Torr, P.H.S., Zisserman, A.: Obj Cut. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 1. IEEE Computer Society Press, Los Alamitos (2005) 9. Bray, M., Kohli, P., Torr, P.H.S.: Posecut: Simultaneous segmentation and 3d pose estimation of humans using dynamic graph-cuts. In: Proceedings of the 8th European Conference on Computer Vision, vol. 01, pp. 642–655 (2006) 10. Latecki, L.J.: Active skeleton for non-rigid object detection. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 575–582. IEEE Computer Society Press, Los Alamitos (2009) 11. Quack, T., Ferrari, V., Leibe, B., Van Gool, L.: Efficient Mining of Frequent and Distinctive Feature Configurations. In: 2007 IEEE 11th International Conference on Computer Vision (2007) 12. Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial Structures for Object Recognition. International Journal of Computer Vision 61, 55–79 (2005)
High Resolution Segmentation of Neuronal Tissues from Low Depth-Resolution EM Imagery Daniel Glasner1,2,∗ , Tao Hu1 , Juan Nunez-Iglesias1 , Lou Scheffer1 , Shan Xu1 , Harald Hess1 , Richard Fetter1 , Dmitri Chklovskii1, and Ronen Basri1,2,∗ 1
Janelia Farm Research Campus, Howard Hughes Medical Institute 2 Department of Computer Science and Applied Mathematics, Weizmann Institute of Science
Abstract. The challenge of recovering the topology of massive neuronal circuits can potentially be met by high throughput Electron Microscopy (EM) imagery. Segmenting a 3-dimensional stack of EM images into the individual neurons is difficult, due to the low depth-resolution in existing high-throughput EM technology, such as serial section Transmission EM (ssTEM). In this paper we propose methods for detecting the high resolution locations of membranes from low depth-resolution images. We approach this problem using both a method that learns a discriminative, over-complete dictionary and a kernel SVM. We test this approach on tomographic sections produced in simulations from high resolution Focused Ion Beam (FIB) images and on low depth-resolution images acquired with ssTEM and evaluate our results by comparing it to manual labeling of this data. Keywords: Segmentation of neuronal tissues, Task-driven dictionary learning, Sparse over-complete representation, Connectomics.
1
Introduction
Recent years have seen several large scale efforts to recover the structure of neuronal networks of various animals’ brains [1,2]. Detecting every single neuron and its synaptic connections to other neurons in dense neuronal tissues requires both high-resolution and high-throughput imaging techniques. Currently, the only technology that can potentially meet this challenge is high-throughput Electron Microscopy (EM) followed by automated image analysis, and finally manual proofreading [3]. Effective image analysis techniques can greatly speed up this process by reducing the need for manual labour. High-throughput Electron Microscopy imagery of neuronal tissues can be obtained using serial section Transmission EM (ssTEM) technology. In ssTEM, a fixed and embedded neuronal tissue is sliced into sections of about 50nm in thickness. Each section is then observed using an Electron Microscope producing
D.G. and R.B. acknowledge the support and the hospitality of the Janelia visitors program under which this work was carried out.
Y. Boykov et al. (Eds.): EMMCVPR 2011, LNCS 6819, pp. 261–272, 2011. c Springer-Verlag Berlin Heidelberg 2011
262
D. Glasner et al.
a 2D projection image of the section with a pixel size of about 10 × 10nm2 . Although the images obtained with this method are of high quality, due to the thick sectioning, membranes crossing the section in oblique directions appear blurry (see examples in Figures 1 and 2). Moreover, the same membrane can appear displaced in consecutive sections, making it difficult to link the regions belonging to the same neuron from one section to the next. High depth-resolution can be obtained by Focused Ion Beam (FIB) [4] and serial section electron tomography [5]. However, because of low throughput these techniques are currently limited to small tissue volumes and therefore cannot be used to reconstruct complete neuronal networks.
Slice thickness: 50 nm
Slice thickness: 10 nm
Fig. 1. The figure shows part of a tomographic section produced in simulation from high resolution FIB data (left) and the corresponding middle section in the original high resolution image (right). Notice the blurry membranes in the left image, which appear much sharper in the right image. Such membranes are difficult to detect in the low depth-resolution data. An example of a blurry membrane is marked with a green (vertical) arrow, a non-membrane region with similar appearance is marked with a red (horizontal) arrow. A mitochondrion can be seen at the lower part of the image, it is surrounded by a blue circle.
Once the EM imagery is collected, a crucial step in reconstructing the underlying neuronal circuits is to segment each individual neuron in the 3-dimensional images [3,6]. Segmentation of neurons can be difficult since different neurons usually share similar intensity and texture distributions, requiring one to accurately locate their bounding membranes. As neurons are usually very long, each segmentation mistake can lead to significant mistakes in the topology of the recovered network. Moreover, as current, high throughput EM techniques (such as ssTEM) are limited in their depth resolution, relating the different 2D segments across different sections can be challenging.
High Resolution Segmentation of Neuronal Tissues
263
Fig. 2. The figure shows parts of a low depth-resolution serial section Tomographic EM image (left images) and the corresponding middle section in an image reconstructed with super-resolution (right images). Again, notice the blurry membranes in the left images (marked by green arrows), which are somewhat rectified by the super-resolution reconstruction.
A recent approach proposed to improve the depth resolution of ssTEM by using “limited angle tomography” [7]. In this technique, one images each section at only a few angles to maintain the high throughput and uses computational methods to reconstruct the volume structure with high depth resolution. In particular, Veeraraghavan et al. [8] used a sparse representation of the volume using a manually chosen over-complete dictionary [9,10]. [7] used a dictionary learned on high-resolution FIB data. After the volume is reconstructed at higher resolution it can be segmented in 3D. In this paper we follow up on the work presented in [8] and in [7], and propose to segment the neuronal cells directly from low depth-resolution EM images,
264
D. Glasner et al.
while by-passing the reconstruction step. We employ two existing methods for detecting the location of membranes in high resolution. The first method learns a discriminative, over-complete dictionary to relate between the input tomographic projections and the high resolution class labels. We use the algorithm of [11] (for other methods see, e.g., [12,13]). The second method approaches this classification task using a Support Vector Machine (SVM) with a Radial Basis Function (RBF) kernel. We test these approaches on two sets of images. The first set includes low depth-resolution images constructed in simulations from high resolution FIB data of fly larva. The second set of images include low-depth resolution images of fly larva obtained with ssTEM technology. We evaluate our results by comparing classification results to manual labeling by proofreaders on both the high resolution FIB images and on super-resolved ssTEM images. We further compare our results to results obtained with other classification methodologies.
2 2.1
Approach Learning a Discriminative, over-Complete Dictionary
We cast the problem of segmenting the EM imagery as a classification problem. Our objective is to classify the high resolution voxels as either a membrane or non-membrane given the low depth-resolution input images. Inspired by the success of super-resolution methods [8], which demonstrated that neuronal tissues can be effectively reconstructed by using over-complete dictionary, we first approach this problem using sparse representations over a learned dictionary. Our approach is based on the formulation of Mairal et al. [11], adapting their method to the resolution available at test time and the desired resolution of the output. We train a dictionary for the low resolution training data, so that our learned dictionary optimizes a classification loss function over the high resolution labels. At test we apply the learned dictionary to the input low resolution data. Below we describe our approach in more detail. We begin by defining our training objective. The training data includes a collection of labels for 3D high resolution patches. Let x ∈ Rp denote a vector layout of a high resolution patch, and let y ∈ {−1, 0, 1}p contain the label for each of the voxels in x, where the label 1 denotes a membrane, -1 a nonmembrane, and 0 is unknown. Let z = P x ∈ Rq be a vector layout of tomographic projections vectors of x, where the (possibly unknown) q × p matrix P denotes the projection operator. Our task is to learn an association between the low resolution patches and the high resolution class labels. We train a classifier by learning an over-complete dictionary, below we review the method of [11] as applied to our problem. Let D denote the sought dictionary. D is a matrix of size q ×k where k is the number of dictionary elements (typically k = 2q). Given D we decompose our training patch z over D by optimizing a functional of the following form, 1 λ2 α (z, D) = argmin z − Dα22 + λ1 α1 + α22 . 2 2 k α∈R
(1)
High Resolution Segmentation of Neuronal Tissues
265
The first term in this equation seeks a vector of coefficients α ∈ Rk that encode the low resolution patch z in terms of the dictionary D. The remaining terms use the Elastic Net formulation [14] to regularize α. In particular, the second term is the 1 norm of α, encouraging a sparse encoding over the dictionary. The final term is the squared 2 norm of α, which is used to provide stability by convexifying the problem. λ1 , λ2 ≥ 0 are constants. We further constrain the encoding coefficients α to be non-negative. Given an optimal encoding vector α we define our loss function using the logistic loss. Let L(α , y, W ) =
p
T T ∗ log 1 + e−yi wi α ,
(2)
i=1 T
where y = (y1 , . . . , yp ) is a vector of the provided labels, and the k × p matrix W = (w1 , · · · , wp ) is a set of learned classification weights1 . Our training procedure optimizes this loss function: f (D, W ) = Ey,x [L(α , y, W )] +
ν W 2F , 2
(3)
where E denotes expectation taken over the distribution of (y, x). The rightmost term is a regularization term; .F denotes the Forbenious matrix norm, and ν is a predetermined constant. We seek to optimize f (D, W ) over all choices of dictionaries D ∈ D and weight matrices W ∈ Rk×p , where D = {D ∈ Rq×k | di 2 = 1 ∀i ∈ 1, . . . , k}.
(4)
In practice we only learn the labels of a subset p˜ of the labels, 1 ≤ p˜ ≤ p, so W is k × p˜. Our optimization jointly constructs a dictionary D and a weight matrix W that achieve optimal classification over the training data. Further details of the optimization are provided below. At test time given a low resolution patch, z = P x for some unknown high resolution patch x, our objective is to recover the labels y that correspond to the high resolution patch x. To this end we use (1) to find an optimal encoding of z over the trained dictionary D, and then use the obtained coefficients to recover the sought label values by setting y = sign(W T α). In general, we set p˜ so as to obtain overlapping predictions of y from neighboring patches. We average those predictions for each high resolution voxel to obtain our final classification. Optimization: The construction of the dictionary D and training weights W is done using stochastic gradient descent. The gradient of our functional f (D, W ) (3) can be written as in [11] ∇W f (D, W ) = Ey,x [∇W L(α , y, W )] + νW ∇D f (D, W ) = Ey,x −Dβ α T + (z − Dα )β T , 1
(5) (6)
In our implementation we extend α by appending the entry 1 to allow an affine shift in W T α∗ .
266
D. Glasner et al.
where β ∈ Rk is a vector defined as follows. Let Λ denote the set of non-zero coefficients in α then the Λ entries of β are set to T βΛ = (DΛ DΛ + λ2 I)−1 ∇αΛ L(α , y, W ),
(7)
and the rest of the entries are set to zero. Stochastic gradient descent proceeds iteratively as follows: 1. Select an i.i.d. patch sample z ∈ Rq from the training set, along with its corresponding labeling y ∈ Rp˜. 2. Compute its sparse coding by solving (1) (e.g. using a modified LARS [15]). 3. Compute the active set Λ. 4. Compute β (7). 5. Update W and D by subtracting their respective gradients, scaled by a learning rate ρt . We initialize our dictionary by training an unsupervised representation dictionary Dinit . 2.2
SVM Classifier
As an alternative we train a Support Vector Machine (SVM) classifier with both a linear and a Radial Basis Function (RBF) kernel. For the SVM classifier, let z1 , ..., zN denote the training patches, and let y1 , ..., yN denotes the k’th label (1 ≤ k ≤ p) of each xi (1 ≤ i ≤ N ). We train a classifier that optimizes the standard hinge loss, written in dual form as min a
1 ai aj K(zi , zj ) + c max(0, 1 − yi aj K(zi , zj )), 2 i,j i j
(8)
where a ∈ RN are the support weights, c is a constant, and K(.) is a kernel function. For the linear SVM K(zi , zj ) = zTi zj . The RBF kernel is K(zi , zj ) = e−γzi −zj , where γ is a scale factor. At test time given a low resolution patch z we assign the labels y k = i aki K(zi , z).
3
Experiments
To test our approach we have evaluated our method on simulated tomographic projection images constructed from high resolution FIB data. In addition we show results on low depth-resolution ssTEM data. 3.1
Parameter Selection
For the dictionary based method we achieved similar performance using twice and thrice over-complete dictionaries and chose to use twice over-complete in all our experiments. The value of λ1 was set at λ1 = 0.03 for the FIB experiment and λ1 = 0.05 for the ssTEM experiment. These values were chosen out of the
High Resolution Segmentation of Neuronal Tissues
267
set {0.15, 0.125, 0.1, 0.075, 0.05, 0.04, 0.03, 0.02} using cross validation. The value of λ2 was fixed at λ2 = 0.01 in all the experiments. In all the experiments we trained using mini batches of size η = 200 and ran three epochs of T = 20, 000 iterations each. In each epoch we decreased the learning rate ρ ∈ {0.5, 0.1, 0.01} and ρt was then set to min(ρ, ρt0 /t), where t0 = T /100. We use the SPAMS toolbox [16,17] to train the unsupervised dictionary and to find an initial W given the unsupervised dictionary. We also use it to solve the lasso during training and testing. For the SVM with the RBF kernel we used c = 1 and γ = 2−4 . These values were chosen using cross validation on a grid of different (c, γ) values. For the linear SVM we used c = 0.5 after running cross validation on an extensive set of possible c values. 3.2
Simulations with FIB Data
High resolution 3D images of fly larva were acquired with an Electron Microscope using the Focused Ion Beam (FIB) protocol. The volume included 5003 voxels of size 10 × 10 × 10 nm3 each. Proofreaders labeled the thin (1 voxel width) skeletons of the membranes by correcting the results of watershed segmentation. Additional labels were assigned to mitochondria (an example mitochondrion is marked in Figure 1). In the FIB data experiments we ignored the voxels marked as mitochondria. In addition, we ignored voxels of Euclidean distance greater √ than 2 from marked membranes, as those voxels often are membrane voxels, but they are not marked as such by the proofreaders. We used half of the data for training and a disjoint 2003 part of the volume for testing. We produced tomographic sections of the volume by averaging each 5 Zsections in one of 5 directions, parallel the Z-axis and at ±45◦ toward the X and Y directions, see Figure 3. We then selected block patches of size 9 × 9 × 15 (obtaining p = 1, 215) from the original high resolution volume and used the 5 tomographic projections to produce 2D patches. The parallel tomography sections produced patches of size 9×9×3. The oblique tomography sections were
Fig. 3. 2D tomographic sections provide voxel averages in a single direction. This figure illustrates how 2D tomographic sections are produced from a 3D volume in directions −45◦ (left), 0 (middle), and +45◦ (right) from vertical in the X-Z plane. In our experiments we used in addition tomographic sections produced in directions −45◦ and +45◦ from vertical in the Y-Z plane.
268
D. Glasner et al. Table 1. Best F-measure achieved by each method on the FIB data
Method Score
DIC-HR 91.28%
DIC-LR 90.07%
DIC-SR 88.68%
SVM-RBF-LR 88.26%
SVM-LIN-LR LDA-LR 78.16% 76.76%
further intersected with this patch area, producing patches of sizes 5 × 9 × 3 and 9×5×3. Concatenating these patches we obtained feature vectors of size q = 783. With each vector we associate p˜ = 45 labels, marking the center 3×3×5 voxels in the high resolution patches with the proofreaders’ labels. Overall the test volume included 662,675 (8.28%) membrane voxels, 5,142,613 (64.28%) non-membrane, and 2,194,712 (27.43%) unknowns. As a preprocessing step we linearly stretch the values of the input volume after cropping to the range between the 0.001 and 0.999 quantiles of the observed values. We further center each patch by subtracting its mean and scale to have unit 2 norm. To reduce the dimension of the learning we applied Principal Component Analysis (PCA) to the feature vectors. We chose the number of vectors to account for 95% of the energy in the feature vectors, reducing the dimensionality to 173. Figure 4 shows a recall-precision plot of our results. These results are also summarized in Table 1, which shows the maximal F-measure (harmonic mean of the recall and precision values) obtained with each method. Our proposed dictionary-based and kernel SVM methods (denoted as DIC-LR and SVM-RBF) achieve F-measures of 90.07% and 88.26% respectively. These values are very close to classification results on the original high resolution data (91.28%, marked by DIC-HR), which can be thought of as a ceiling for our method. We further compare these results to running the dictionary method on super-resolved data (denoted DIC-SR), which achieves an F-measure of 88.68%. This indicates that we can achieve similar or even better classification values if we skip the step of super-resolution reconstruction by classifying the low depth-resolution data directly. Finally, as a base-line we show the results of classifying the membranes using linear SVM (SVM-LIN) and using Linear Discriminant Analysis (LDA). 3.3
ssTEM Data
Low depth-resolution 3D images of fly larva were acquired using the serial section Transmission EM technique. Each section, of width 50nm, was photographed 5 times from roughly the same directions that were simulated with the FIB data (Section 3.2). Each of the obtained 5 volumes included 558 × 558 × 16 voxels of size 10 × 10 × 50 nm3 . To label this data we applied a super resolution Table 2. Best F-measure achieved by each method on the ssTEM data Method SVM-RBF-SR DIC-LR SVM-RBF-LR SVM-LIN-LR LDA-LR Score 88.23% 87.38% 85.75% 64.28% 64.5%
High Resolution Segmentation of Neuronal Tissues
269
1 0.9 0.8
precision
0.7 0.6
DIC−HR 0.5
DIC−SR
0.4
DIC−LR
0.3
SVM−RBF−LR SVM−LIN−LR
0.2
LDA−LR 0.1 0 0
0.1
0.2
0.3
0.4
0.5
recall
0.6
0.7
0.8
0.9
1
Fig. 4. Results obtained on the FIB data. The figure shows a recall precision plot of our methods, compared to membrane classification on the high resolution and super resolved data.
1 0.9 0.8
precision
0.7 0.6
SVM−RBF−SR 0.5
SVM−RBF−LR 0.4
DIC−LR 0.3
LDA−LR 0.2
SVM−LIN−LR 0.1 0 0
0.1
0.2
0.3
0.4
0.5
recall
0.6
0.7
0.8
0.9
1
Fig. 5. Results obtained on the ssTEM data. The figure shows a recall precision plot of our methods compared to baseline methods and membrane classification on super resolved data.
270
D. Glasner et al.
Fig. 6. A classification example. The figure shows part of a section of a ssTEM image (top left), ground truth labeling by a proofreader (top right), and labeling scores obtained with the dictionary-based method (bottom left) and the SVM with RBF kernel (bottom right).
reconstruction using an over-complete dictionary. Proofreaders then labeled the membrane voxels, by marking their skeletons (again, by correcting the results of watershed segmentation). No labeling of mitochondria were available for this data. We used half of the data for training and a disjoint block of size 200 × 200 × 65 for testing. As before, from the 5 tomographic sections we extracted patches of sizes 9 × 9 × 3 (for the parallel tomographic section) and 5 × 9 × 3 and 9 × 5 × 3 for the other sections, obtaining feature vectors of size q = 783. With each vector we associate p˜ = 45 labels, marking the center 3 × 3 × 5 voxels in the high resolution patches as either membranes, non-membranes, or unknowns. Overall
High Resolution Segmentation of Neuronal Tissues
271
the test data included 340,104 (13.08%) membrane voxels, 1,646,992 (63.34%) non-membrane, and 612,974 (23.58%) unknown. We applied the same preprocessing as for the FIB data (described in Section 3.2). Applying PCA, we reduce the dimension of the feature vectors to 170. Figure 5 shows a recall-precision plot of our results. The results are also summarized in Table 2, which shows the maximal F-measure obtained with each method. Both our proposed methods (denoted as DIC-LR and SVM-RBF-LR) achieve similar F-measures at 87.38% and 85.75% respectively. These values are slightly lower than the score obtained by running SVM on the super-resolved data (denoted SVM-RBF-SR), which was 88.23%. Note however that the labeling in this experiment is done on the super-resolved data, so it may be biased toward this approach. Finally, as a base line we show the results of classifying the membranes using linear SVM and LDA. Figure 6 shows an example of the classification scores obtained with the dictionary based and SVM with RBF kernel methods.
4
Conclusion
We presented a system for membrane classification for segmentation of neuronal tissues in low depth-resolution EM imagery. We showed that both a classification method that learns a discriminative, over-complete dictionary as well as SVM with RBF kernel trained over the low depth-resolution EM data with high resolution labeling, can achieve accurate classification of membranes, bypassing the need for an additional step of super-resolution reconstruction. These techniques, therefore, can potentially reduce the amount of manual labor required for reconstructing the topology of the observed cells.
References 1. Varshney, L.R., Chen, B.L., Paniagua, E., Hall, D.H., Chklovskii, D.B.: Structural properties of the Caenorhabditis Elegans neuronal network. PLoS Comput. Biol. 7 (2011) 2. Helmstaedter, M., Briggman, K.L., Denk, W.: 3d structural imaging of the brain with photons and electrons. Curr. Opin. Neurobiol. 18, 633–641 (2008) 3. Chklovskii, D.B., Vitaladevuni, S., Scheffer, L.K.: Semi-automated reconstruction of neural circuits using electron microscopy. Curr. Opin. Neurobiol. 20, 667–675 (2010) 4. Knott, G., Marchman, H., Wall, D., Lich, B.: Serial section scanning electron microscopy of adult brain tissue using focused ion beam milling. J. Neurosci. 28, 2959–2964 (2008) 5. Mcewen, B.F., Marko, M.: The emergence of electron tomography as an important tool for investigating cellular ultrastructure. J. Histochem. Cytochem. 49, 553–563 (2001) 6. Jain, V., Turaga, S.C., Seung, H.S.: Machines that learn to segment images: a crucial technology of connectomics. Curr. Opin. Neurobiol. 20, 653–666 (2010)
272
D. Glasner et al.
7. Hu, T., Nunez-Iglesias, J., Scheffer, L., Xu, S., Hess, H., Fetter, R., Chklovskii, D.B.: Super-resolution reconstruction of brain structure using sparse representation over learned dictionary (2011) (submitted) 8. Veeraraghavan, A., Genkin, A.V., Vitaladevuni, S., Scheffer, L., Xu, S., Hess, H., Fetter, R., Cantoni, M., Knott, G., Chklovskii, D.B.: Increasing depth resolution of electron microscopy of neural circuits using sparse tomographic reconstruction. In: CVPR (2010) 9. Adler, A., Hel-Or, Y., Elad, M.: A shrinkage learning approach for single image super-resolution with overcomplete representations. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6312, pp. 622–635. Springer, Heidelberg (2010) 10. Yang, J., Wright, J., Huang, T., Ma, Y.: Image super-resolution via sparse representation. IEEE Trans. on Image Proc. (2011) 11. Mairal, J., Bach, F., Ponce, J.: Task-driven dictionary learning. In: arXiv:1009.5358v1 (2010) 12. Mairal, J., Bach, F., Ponce, J., Sapiro, G., Zisserman, A.: Discriminative learned dictionaries for local image analysis. In: CVPR (2008) 13. Ramirez, I., Sprechmann, P., Sapiro, G.: Classification and clustering via dictionary learning with structured incoherence and shared features. In: CVPR (2010) 14. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. JRSSB 67, 301–320 (2005) 15. Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Annals of statistics 32, 407–499 (2004) 16. Mairal, J., Bach, F., Ponce, J., Sapiro, G.: Online dictionary learning for sparse coding. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 689–696. ACM, New York (2009) 17. Mairal, J., Bach, F., Ponce, J., Sapiro, G.: Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research 11, 19–60 (2010)
Optical Flow Guided TV-L1 Video Interpolation and Restoration Manuel Werlberger, Thomas Pock, Markus Unger, and Horst Bischof Institute for Computer Graphics and Vision, Graz University of Technology, Austria {werlberger,pock,unger,bischof}@icg.tugraz.at
Abstract. The ability to generate intermediate frames between two given images in a video sequence is an essential task for video restoration and video post-processing. In addition, restoration requires robust denoising algorithms, must handle corrupted frames and recover from impaired frames accordingly. In this paper we present a unified framework for all these tasks. In our approach we use a variant of the TV-L1 denoising algorithm that operates on image sequences in a space-time volume. The temporal derivative is modified to take the pixels’ movement into account. In order to steer the temporal gradient in the desired direction we utilize optical flow to estimate the velocity vectors between consecutive frames. We demonstrate our approach on impaired movie sequences as well as on benchmark datasets where the ground-truth is known.
1
Introduction
With the rise of digitalization in the film industry, the restoration of historic videos gained in importance. The cause of degradation of the original material is to chemical decomposition or the abrasion of repeated playback. Another typical artifact with aged films is the occurrence of noise, be it because of dirt, dust, scratches, or just because of long-term storage. For an overview of potential artifacts and their occurrence we want to refer to the survey of Kokaram [1]. In areas where film reels get bonded together when played back, several sequent frames often are entirely destructed. The recreation of frames and parts of sequences can be used to restore corrupted film segments on the one hand, but on the other the generation of time interpolated viewpoints is essential for the post production industry. A prevalent application is the generation of slowmotion or the simulation of tracking shots (a film sequence where the camera is mounted on a wheeled platform). A lot of methods deal with the problem of frame interpolation. Let us describe some approaches in this field: Image-based rendering often relies on additional geometric constraints of the scenes like e.g. the 3D scene geometry, scene depth or epipolar constraints [2,3,4,5,6]. With the known set of 2D images and the additional geometric constraints a 3D model is built and used to render novel views. To achieve a higher
This work was supported by the BRIDGE project HD-VIP (no. 827544).
Y. Boykov et al. (Eds.): EMMCVPR 2011, LNCS 6819, pp. 273–286, 2011. c Springer-Verlag Berlin Heidelberg 2011
274
M. Werlberger et al.
Fig. 1. Image sequence interpolation. The first and the last frames are given and the frames in between are generated by our method. In the cutout in the second row it can be seen that motion is interpolated naturally in the generated frames (the heads movement, mouth closing, etc.). Note that the occluded region are also handled correctly despite of the noisy input material. This is nicely illustrated by continuing the white pole in the background when the head disoccludes this region.
accuracy when relying on the reconstructed 3D geometry it is common practice to use calibrated and synchronized camera arrays. Besides that, the movie industry often uses more simple approaches to create visual effects that need interpolated frames. View morphing [7,8,9,10,11] uses some correspondences between images to transform both into an intermediate position and afterwards superimpose and blend both warped images to generate a single interpolated result. Correspondences can be introduced by a user that defines corresponding points manually, or by some feature matching algorithm. The general problem of image morphing techniques is that the warping and blending might introduce errors when dealing with complex motions between the known images. Especially in presence of (dis-)occlusions (caused by moving objects) the approach might exhibit artifacts in these regions. In [12], a path-based approach is introduced that estimates a path for every pixel. To estimate intermediate pixels, the paths are linearly blended between the known positions. Due to the definition of a path in every pixel, no holes are produced when generating interpolated frames. Although, close relations to optical flow [13] are given, the authors emphasize that there is no need for calculating optical flow to create pixel correspondences. They call their approach an inverse optical flow algorithm. Recently a method of computing optical flow in an optimal control framework [14] is used for image sequence interpolation [15]. Optical flow in the control framework searches for the flow field such that the interpolated image matches the given input image and is therefore very suitable for the sequence interpolation task.
Optical Flow Guided TV-L1 Video Interpolation and Restoration
275
In this work we propose a method that is capable of image sequence denoising, restorating impaired frames, reconstructing completely lost frames and interpolating images between frames. Note that all this is handled within a rather simple model. Though, the results can compete with current state-of-the art and even outperform some existing methods. As a motivation we show two frames of a historical video sequence and two in-between images generated by our method in Fig. 1. The cutout region shows that existing motion (head movement) and deformation (closing of the mouth) are handled correctly and produce reasonable intermediate images. As our model is closely related to variational image denoising we first give some insights into this topic. In the field of image restoration, total variation based methods were studied intensly in the last decade starting with the introduction of the so-called ROF denoising model in the seminal work of Rudin, Osher and Fatemi [16]. Minimizing the energy functional 2 |∇u| + λ (u − f ) dx (1) min u
Ω
computes the desired solution u that is most likely fitting the observed input image f both defined in the image domain Ω. |∇u| denotes the well-known total variation used as smoothness prior and preserving the edges of the reconstruction u. This model is appropriate to remove additive Gaussian noise. Similar to the ROF model, the TV-L1 model [17,18] |∇u| + λ |u − f | dx (2) min u
Ω
is a popular approach when it comes to removing impulse noise. Besides the ability to remove strong outliers, the substitution of the 2 norm in the ROF model with the 1 norm makes the TV-L1 model contrast invariant. In addition to image denoising applications, (2) can be used for shape denoising and feature selection tasks [19]. In this paper we extend the idea of the TV-L1 denoising approach for the task of film restoration and frame interpolation. We introduce the proposed model in Section 2. Then, in Section 3 we discuss the optimization procedure. Next, the applications and results are presented in Section 4 and finally, Section 5 concludes the work and gives an outlook.
2
The Optical Flow Guided TV-L1 Model
In this section we define a variant of the TV-L1 model that operates on a spatiotemporal volume. For tracking and segmentation it has been shown in [20,21] that the representation of image sequences in a space-time volume is beneficial. In the following we will show that this is also true if a spatio-temporal TV-L1 model is utilized for restoration and video processing tasks. The drawback of the methods proposed up to now is the definition of the temporal gradient. To handle
276
M. Werlberger et al.
moving objects and especially the associated (dis-)occlusions, we incorporate optical flow to guide the temporal derivative. Hence, the time derivatives are computed along objects movement trajectories. In order to be robust against occlusions and disocclusions we compute the optical flow in both direction which T we will denote as forward flow v + = v1+ , v2+ : Ω × T → R2 and backward T flow v − = v1− , v2− : Ω × T → R2 . To estimate the optical flow for the existing frames, we use the implementation of [22] as it is publicly available, is reasonable fast to compute and achieves good results especially for the frame interpolation Middlebury benchmark [23]. The approach uses a combination of classical optical flow constraint [13] and a weighted regularization that considers the strength and the direction of underlying image edges. For an overview on optical flow methods we refer to [23,24,25,26,27] and the references therein. We propose to minimize the following optical flow driven TV-L1 energy functional: min |∇v u| + λ(x, t) |u − f | dx dt (3) u
Ω×T
The input sequence f : Ω × T → R and the sought solution u : Ω × T → R are defined in a space-time volume and λ(x, t) defines the trade-off between regularization and data term. The spatio-temporal gradient operator is defined as T ∇v = (∂x1 , ∂x2 , ∂t+ , ∂t− ) , (4) where the components ∂x1 and ∂x2 denote the spatial derivatives. The temporal derivatives are directed by the corresponding optical flow vectors (v + , v − ) and are denoted as ∂t+ and ∂t− . We may assume that u is sufficiently smooth such that ∇v exists: ⎧ u(x1 + h, x2 , t) − u(x, t) ⎪ ⎪ ⎪ ∂x1 u = lim+ ⎪ h ⎪ h→0 ⎪ ⎪ u(x1 , x2 + h, t) − u(x, t) ⎪ ⎪ ⎪ ∂x1 u = lim+ ⎪ ⎪ h h→0 ⎨ u(x + ht v+ (x, t), t + ht ) − u(x, t) . (5) ∂t+ u = lim ⎪ ht →0+ 2 ⎪ + ⎪ ht 1 + ||v (x, t)||2 ⎪ ⎪ ⎪ − ⎪ ⎪ u(x + h t v (x, t), t + ht ) − u(x, t) ⎪ ⎪ ∂t− u = lim ⎪ ⎪ ht →0+ 2 ⎩ h 1 + ||v − (x, t)|| t
3 3.1
2
Minimizing the Optical Flow Guided TV-L1 Model Discretization
First, we define the image sequence within a three dimensional, regular Cartesian grid of the size M × N × K : {(m, n, k) : 1 ≤ m ≤ M, 1 ≤ n ≤ N, 1 ≤ k ≤ K}. The grid size is 1 and the discrete pixel positions within the defined volume are given by (m, n, k). We use finite dimensional vector spaces X = RMN K
Optical Flow Guided TV-L1 Video Interpolation and Restoration
277
Fig. 2. Layout of the spatio-temporal volume and the defined gradient operator (4), which directions are marked as red vectors. The forward v + and backward optical flow v − guides the temporal components of this gradient operator.
and Y = R4MN K equipped with standard scalar products denoted by ·, ·X and ·, ·Y . Next, we define the discretized versions of the anisotropic spatio-temporal gradient operator ∇v (see (4) and (5)) with h and ht controlling the spatial and temporal influence of the regularization (Dv u)m,n,k = ( (Dx1 u)m,n,k , (Dx2 u)m,n,k , (Dt+ u)m,n,k , (Dt− u)m,n,k )
spatial
T
. (6)
temporal
The spatial gradients are discretized using simple finite differences yielding um+1,n,k −um,n,k if m < M h (Dx1 u)m,n,k = , (7) 0 if m = M um,n+1,k −um,n,k if n < N h (Dx2 u)m,n,k = . (8) 0 if n = N For the flow-guided temporal gradients we use a linear interpolation within the plane, if the flow vector points in-between discrete locations on the pixel grid. In the following we describe the interpolation for the spatial component guided with the forward flow. The same is of course valid for the component where the gradient is steered with the backward optical flow. The coordinates used for linearly interpolating the position yield + + 1+ 2+ m , n , k = m + ht vm,n,k , n + ht vm,n,k ,k , (9)
278
M. Werlberger et al.
with the neighboring coordinates used for the linear interpolation + + + m+ 1 = m , m2 = m1 + 1 , + n+ 1 = n ,
+ n+ 2 = n1 + 1 .
(10)
and the corresponding weighting factors are computed as the distances + + d+ m = m2 − m
and
+ + d+ . n = n2 − n
The temporal gradient operators then yield ⎧ + + + + + ⎨ a +b +c +d −e if a+ , b+ , c+ , d+ , e+ ∈ Ω × T + ht 1+||vm,n,k ||22 (Dt+ u)m,n,k = ⎩0 else , ⎧ − − − − − ⎨ a +b +c +d −e if a− , b− , c− , d− , e− ∈ Ω × T − ht 1+||vm,n,k ||22 (Dt− u)m,n,k = ⎩0 else ,
(11)
(12)
(13)
with + + a+ = d+ m dn um+ 1 ,n1 ,k+1 + + + b = (1 − d+ )d u m n m+ 2 ,n1 ,k+1 + + c+ = d+ (1 − d )u m n m1 ,n+ 2 ,k+1 + + d = (1 − dm )(1 − d+ + n )um+ 2 ,n2 ,k+1 e+ = um,n,k
− − a − = d− m dn um− 1 ,n1 ,k − − − b = (1 − d− )d u m n m− 2 ,n1 ,k − − c− = d− (1 − d )u m n m1 ,n− 2 ,k − − d = (1 − dm )(1 − d− − n )um− 2 ,n2 ,k e− = um,n,k+1
(14)
To implement the operator Dv we use a sparse matrix representation. This is of particular interest because we will also need to implement the adjoint operator Dv∗ which is defined through the identity Dv u, pY = u, Dv∗ pX . Note that here, the adjoint operator is simply the matrix transpose. Now, (3) can be rewritten in the discrete setting as the following minimization problem min ||Dv u||1 + ||Λ(u − f )||1 , u
with Λ = diag(λ) .
(15)
Dualizing both 1 norms in (15) yields the saddle-point problem min max Dv u, pY + Λ u − f, qX , u
p,q
(16) ||p||∞ ≤ 1 and − 1 ≤ q ≤ 1 where p = p1m,n,k , p2m,n,k , p3m,n,k , p4m,n,k ∈ Y and q = (qm,n,k ) ∈ X denote the dual variables, and the discrete maximum norm ||p||∞ is defined as s.t. 0 ≤ u ≤ 1 ,
||p||∞ = max |pm,n,k | , m,n,k |pm,n,k | = (p1m,n,k )2 + (p2m,n,k )2 + (p3m,n,k )2 + (p4m,n,k )2
(17)
Optical Flow Guided TV-L1 Video Interpolation and Restoration
3.2
279
Primal-Dual Algorithm
In order to optimize (16) we use the first order primal-dual algorithm proposed by Chambolle and Pock [28] yielding the following algorithm for updating primal and dual variables: ⎧ n+1 ⎪ = ΠB1 pn + σ Dv 2un − un−1 ⎪ ⎨p n n+1 n n−1 (18) q = Π + σΛ (2u − u ) − f q [−1,1] ⎪ ⎪ ⎩ n+1 u = un − τ (Dv∗ p + Λq) The projection ΠBζ is a simple point-wise projections onto the ball with the radius ζ and Π[a,b] denotes the point-wise truncation to the interval [a, b]. τ and σ are the primal and dual update step sizes satisfying τ σL2 ≤ 1, where L2 = (Dv∗ , Λ∗ ) 2 . 3.3
The Optical Flow Guided TV-L1 Model for Frame Interpolation
The task of frame interpolation needs a more detailed setting yielding the desired result. The main demand on intermediate frames between two images is the interpolation of occurring movements. The aim is to generate frames interpolating the moving objects along the movement trajectories in a natural and appropriate way. Hence, the presented approach is suitable due to the incorporation of optical flow information for computing the temporal derivatives. To accomplish the task of frame interpolation the spatio-temporal volume gets scaled to the desired size. The new and unknown frames are currently left blank for the input volume f . In these areas no data term can be computed and therefore the values for λ are set to zero for all unknown pixels. In terms of optical flow, the forward and backward flow are first computed between all the known neighboring frames. To get estimates for the optical flow at currently unknown positions, and hence describe the motion trajectories of the pixels for the intermediate positions, we assume a linear movement between the available frames. To fill-in optical flow vectors at unknown positions the known flow fields are rescaled with a ‘stretch factor’ so that the resultant vectors describe the optical flow between the known frames and the subsequent (unknown) frame. The positions for the initial points of optical flow vectors within unknown frames are located by propagating the known vectors through the unknown parts of the volume. Therefore, the vector coordinates at available frames are taken and propagated to the point where the vector aims at. This is most likely an unknown position due to the previous rescaling of the vector field. As the vectors generally not end at a discrete pixel location, the vector field is then filled with linearly interpolated values calculated with the factors from (11). A visual description of this process is given in Figure 3. When solving (3) with this setting the regularization propagates the desired pixel values along the motion trajectories and results in adequate intermediate frames. Due to the robustness of the L1 norm against outlier this approach is also suitable for restorating or denoising corrupted frames.
280
M. Werlberger et al.
Fig. 3. Flow propagation: The left visualization shows basic setting of an unknown frame (t, t + 1: known; t + 12 : unknown) and an exemplar forward flow vector. The right image shows the relevant pixels that are used when the optical flow is propagated using the linear interpolation factors as in (11).
Fig. 4. Single frame denoising of Dimetrodon dataset. The first row shows the noisy input image, and the denoised images with the TV-L1 model and the TV-L1 model (left to right). Zoomed regions show results of the TV-L1 model in first row and TV-L1 model in the last row.
Optical Flow Guided TV-L1 Video Interpolation and Restoration
4 4.1
281
Application and Numerical Evaluations Denoising
In this section, we want to highlight the benefit of the directed temporal gradients in the proposed method compared to the classical TV-L1 denoising extended to a spatio-temporal volume but with standard temporal gradients. In Fig. 4 a sequence of three frames is taken and random Gaussian noise is added to the middle frame. To obtain reasonable optical flow we compute the flow vectors between the intact frames and use the setting as described in Section 3.3 to obtain flow vectors for the noisy image. The weighting parameter is set to λ(x, t) = 2 for all the pixels within the space-time volume. To increase the temporal influence of the regularization we choose h = 10 and ht = 1. 4.2
Inpainting
Historic video material is often impaired and for the next application we will deal with partly damaged areas within frames. Small outliers and artifacts might be handled by denoising methods as shown in Section 4.1. Nevertheless, the impairments are often bigger than such algorithms can handle. To use our method for restoration, we compute the optical flow between the neighboring image pairs. As the impairments will cause outliers in the optical flow field, we cannot use
Fig. 5. Inpainting corrupted regions. The first row shows the input images, the second row the restorated frames and the third shows 2 pairs of images where the left is the impaired region and the right shows the result of our approach.
282
M. Werlberger et al.
(a) Dimetrodon-dataset: eRMSE = 7.478 · 10−3
(b) MiniCooper-dataset: eRMSE = 3.167 · 10−4
(c) Walking dataset: eRMSE = 1.326 · 10−4 Fig. 6. Frame interpolation on different dataset. left column: ground-truth u∗ ; middle column: our interpolation result u; right column: difference image udiff .
Fig. 7. Comparison to [12] (first row). Our result (second row) produces almost identical results compared to current state-of-the art.
Optical Flow Guided TV-L1 Video Interpolation and Restoration
283
Fig. 8. Interpolating the Basketball-sequence
those vectors as guidance when solving (3). There are of course several methods to obtain reasonable flow fields. In the following example we simply inpaint the flow fields with the help of our model but neglect the direction for the temporal gradient in this case. These interpolated flow fields are then used to solve (3). Of course more sophisticated methods like vector inpainting methods [29] can be used here. Still, our approach is robust enough to handle gross outliers like in Figure 5. In this experiment the values for λ are set to λ(x, t) = 2 if the pixels are considered good and to λ(x, t) = 0 otherwise. The weightings of the regularity is chosen as h = 2.5 and ht = 1. 4.3
Image Sequence Interpolation
For image interpolation we use the ability of regularizing along the temporal trajectory in the space-time volume as described in Section 3.3. When solving (3) in such a setting the intermediate frames are generated and the pixels’ movement are reasonably interpolated with the propagated optical flow vectors. For a numerical evaluation of interpolating intermediate frames within an image sequence we use sequences of the Middlebury benchmark [23], where interpo-
284
M. Werlberger et al.
Fig. 9. Interpolating the Backyard-sequence
Optical Flow Guided TV-L1 Video Interpolation and Restoration
285
lated frames are available as ground truth. The ground-truth datasets, as well as the input data, are available at the Middlebury website1 . For comparison we show the interpolated image u, the ground-truth u∗ and the difference image udiff = |u∗ − u| in Fig. 6. As an quantitative measurement we give the root 2 N M ∗ 1 mean squared error (RMSE) eRMSE = . n=1 m=1 um,n − um,n,k MN Next, we compare our approach to an interpolation result of [12] in Fig. 7 where our method generates almost identical intermediate frames. In Fig. 8 and Fig. 9 two examples of the Middlebury database are shown. Here larger displacements, faster movements, small scaled structures and complex occlusions are the major difficulties within the two presented datasets. The zoomed regions show that again the interpolated movement is reasonable and the (dis-)occluded regions are handled robustly.
5
Conclusion
In this paper we presented a method that can handle the problem of sequence denoising, restorating partly corrupted frames within image sequences, recover completely lost frames and interpolate intermediate frames between two given images. We proposed to use an optical flow directed temporal gradient to preserve and, in case of frame interpolation, generate natural object movements. Since the method is very dependent on the quality of the used optical flow, the robustness on generating the unknown flow vectors could be investigated in the future. Especially for the restoration task in Section 4.2 the optical flow cannot be computed in impaired regions and must be interpolated somehow in these areas. Although we have shown that our method is robust enough to use very simple completion techniques, a further investigation in such interpolation techniques would even further increase the robustness and applicability of our approach.
References 1. Kokaram, A.C.: On Missing Data Treatment for Degraded Video and Film Archives: A Survey and a New Bayesian Approach. IEEE Trans. on IP (2004) 2. Chen, S.E., Williams, L.: View Interpolation for Image Synthesis. In: SIGGRAPH (1993) 3. McMillan, L., Bishop, G.: Plenoptic Modeling: An Image-Based Rendering System. In: ACM SIGGRAPH (1995) 4. Faugeras, O., Robert, L.: What Can Two Images Tell Us About a Third One? International Journal of Computer Vision (1996) 5. Wexler, Y., Sashua, A.: On the Synthesis of Dynamic Scenes from Reference Views. In: IEEE Conference on Computer Vision and Pattern Recognition (2000) 6. Vedula, S., Baker, S., Kanade, T.: Image-based spatio-temporal modeling and view interpolation of dynamic events. ACM Transactions on Graphics, TOG (2005) 1
http://vision.middlebury.edu
286
M. Werlberger et al.
7. Beier, T., Neely, S.: Feature-Based Image Metamorphosis. In: ACM SIGGRAPH (1992) 8. Seitz, S.M., Dyer, C.R.: View Morphing. In: Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques (1996) 9. Wolberg, G.: Image morphing: a survey. The Visual Computer (1998) 10. Schaefer, S., McPhail, T., Warren, J.: Image deformation using moving least squares. ACM Transactions on Graphics, TOG (2006) 11. Stich, T., Linz, C., Albuquerque, G., Magnor, M.: View and Time Interpolation in Image Space. Computer Graphics Forum (2008) 12. Mahajan, D., Huang, F.C., Matusik, W., Ramamoorthi, R., Belhumeur, P.: Moving Gradients: A Path-Based Method for Plausible Image Interpolation. In: ACM SIGGRAPH (2009) 13. Horn, B.K., Schunck, B.G.: Determining Optical Flow. Artificial Intelligence (1981) 14. Borz`ı, A., Ito, K., Kunisch, K.: Optimal Control Formulation for Determining Optical Flow. SIAM Journal on Scientific Computing (2006) 15. Chen, K., Lorenz, D.A.: Image Sequence Interpolation Using Optimal Control. Journal of Mathematical Imaging and Vision (2011) 16. Rudin, L.I., Osher, S.J., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena (1992) 17. Nikolova, M.: A Variational Approach to Remove Outliers and Impulse Noise. Journal of Mathematical Imaging and Vision (2004) 18. Chan, T.F., Esedoglu, S.: Aspects of Total Variation Regularized L[sup 1] Function Approximation. SIAM Journal of Applied Mathematics (2005) 19. Pock, T.: Fast Total Variation for Computer Vision. PhD thesis, Graz University of Technology (2008) 20. Mansouri, A.R., Mitiche, A., Aron, M.: PDE-based region tracking without motion computation by joint space-time segmentation. In: ICIP (2003) 21. Unger, M., Mauthner, T., Pock, T., Bischof, H.: Tracking as Segmentation of Spatial-Temporal Volumes by Anisotropic Weighted TV. In: Cremers, D., Boykov, Y., Blake, A., Schmidt, F.R. (eds.) EMMCVPR 2009. LNCS, vol. 5681, pp. 193– 206. Springer, Heidelberg (2009) 22. Werlberger, M., Trobin, W., Pock, T., Wedel, A., Cremers, D., Bischof, H.: Anisotropic Huber-L1 Optical Flow. In: BMVC (2009) 23. Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M., Szeliski, R.: A Database and Evaluation Methodology for Optical Flow. IJCV (2011) 24. Barron, J.L., Fleet, D., Beaucheming, S.S.: Performance of optical flow techniques. International Journal of Computer Vision (1994) 25. Beauchemin, S.S., Barron, J.L.: The Computation of Optical Flow. ACM Computing Surveys (1995) 26. Aubert, G., Deriche, R., Kornprobst, P.: Computing Optical Flow via Variational Techniques. SIAM J. Appl. Math. (2000) 27. Fleet, D.J., Weiss, Y.: Optical Flow Estimation (2006) 28. Chambolle, A., Pock, T.: A First-Order Primal-Dual Algorithm for Convex Problems with Applications to Imaging. J. of Math. Imaging and Vision (2010) 29. Berkels, B., Kondermann, C., Garbe, C., Rumpf, M.: Reconstructing Optical Flow Fields by Motion Inpainting. In: Cremers, D., Boykov, Y., Blake, A., Schmidt, F.R. (eds.) EMMCVPR 2009. LNCS, vol. 5681, pp. 388–400. Springer, Heidelberg (2009)
Data-Driven Importance Distributions for Articulated Tracking Søren Hauberg and Kim Steenstrup Pedersen The eScience Centre, Dept. of Computer Science, University of Copenhagen {hauberg,kimstp}@diku.dk
Abstract. We present two data-driven importance distributions for particle filterbased articulated tracking; one based on background subtraction, another on depth information. In order to keep the algorithms efficient, we represent human poses in terms of spatial joint positions. To ensure constant bone lengths, the joint positions are confined to a non-linear representation manifold embedded in a high-dimensional Euclidean space. We define the importance distributions in the embedding space and project them onto the representation manifold. The resulting importance distributions are used in a particle filter, where they improve both accuracy and efficiency of the tracker. In fact, they triple the effective number of samples compared to the most commonly used importance distribution at little extra computational cost. Keywords: Articulated tracking, Importance Distributions, Particle Filtering, Spatial Human Motion Models.
1 Motivation Articulated tracking is the process of estimating the pose of a person in each frame in an image sequence [1]. Often this is expressed in a Bayesian framework and subsequently the poses are inferred using a particle filter [1,2,3,4,5,6,7,8,9,10,11]. Such filters generate a set of sample hypotheses and assign them weights according to the likelihood of the observed data given the hypothesis is correct. Usually, the hypotheses are sampled directly from the motion prior as this vastly simplifies development. However, as the motion prior is inherently independent of the observed data, samples are generated completely oblivious to the current observation. This has the practical consequence that many sampled pose hypotheses are far away from the modes of the likelihood. This means that many samples are needed for accurate results. As the likelihood has to be evaluated for each of these samples, the resulting filter becomes computationally demanding. One solution, is to sample hypotheses from a distribution that is not “blind” to the current observation. The particle filter allows for such importance distributions. While the design of good importance distributions can be the deciding point of a filter, not much attention has been given to their development in articulated tracking. The root of the problem is that the pose parameters are related to the observation in a highly non-linear fashion, which makes good importance distributions hard to design. In this paper, we change the pose parametrisation and then suggest a simple approximation Y. Boykov et al. (Eds.): EMMCVPR 2011, LNCS 6819, pp. 287–299, 2011. c Springer-Verlag Berlin Heidelberg 2011
288
S. Hauberg and K.S. Pedersen
that allows us to design highly efficient importance distributions that account for the current observation. 1.1 Articulated Tracking Using Particle Filters Estimating the pose of a person using a single view point or a small baseline stereo camera is an inherently difficult problem due to self-occlusions and visual ambiguities. This manifests itself in that the distribution of the human pose is multi-modal with an unknown number of modes. Currently, the best method for coping with such distributions is the particle filter [12]. This relies on a prior motion model p(θt |θt−1 ) and a data likelihood model p(Zt |θt ). Here θt denotes the human pose at time t and Zt the observation at the same time. The particle filter approximates the posterior p(θt |Z1:t ) as a set of weighted samples. These samples are drawn from an importance distribution q(θt |Zt , θt−1 ) and the weights are computed recursively as (n) wt
∝
(n) wt−1 p
(n) Zt |θt
(n) rt
s.t.
N
(n)
wt
=1 ,
(1)
n=1 (n)
where the superscript (n) denotes sample index and the correction factor rt by (n) (n) p θt |θt−1 (n) . rt = (n) (n) q θt |Zt , θt−1
is given
(2)
In practice, it is common use the motion prior as the importance distribution, i.e. to (n) let q(θt |Zt , θt−1 ) = p(θt |θt−1 ) as then rt = 1 which simplifies development. This does, however, have the unwanted side-effect that the importance distribution is “blind” to the current observation, such that the samples can easily be placed far away from the modes of the likelihood (and hence the modes of the posterior). In practice, this increases the number of samples needed for successful tracking. As the likelihood has to be evaluated for each sample, this quickly becomes a costly affair; in general the likelihood is expensive to evaluate as it has to traverse the data. To use the particle filter for articulated tracking, we need a human pose representation. As is common [1], we shall use the kinematic skeleton (see fig. 1). This representation is a collection of connected rigid bones organised in a tree structure. Each bone can be rotated at the point of connection between the bone and its parent. We model the bones as having known constant length, so the angles between connected bones constitutes the only degrees of freedom in the kinematic skeleton. We collect these into one large vector θt representing all joint angles in the model at time t. To represent constraints on the joint angles, they are confined to a subset Θ of RN . From known bone lengths and a joint angle vector θt , the joint positions can be computed recursively using forward kinematics [13]. We will let F (θt ) denote the joint positions corresponding to the joint angles θt . In this paper, we will make a distinction between joint angles and joint positions as this has profound impact when designing data-driven importance distributions.
Data-Driven Importance Distributions for Articulated Tracking
289
Fig. 1. An illustration of the kinematic skeleton. Bones are connected in a tree structure where branches have constant length. Angles between connected bones constitute the only degrees of freedom in the model.
1.2 Related Work In articulated tracking, much work has gone into improving either the likelihood model or the motion prior. Likelihood models usually depend on cues such as edges [2, 3, 4], optical flow [11, 4] or background subtraction [14, 3, 15, 5, 16, 17, 18]. Motion priors are usually crafted by learning activity specific priors, such as for walking [19, 20, 6, 7]. These approaches work by restricting the tracker to some subspace of the joint angle space, which makes the priors activity specific. When no knowledge of the activity is available it is common [21, 5, 6, 18] to simply let θt follow a normal distribution with a diagonal covariance, i.e. pgp (θt |θt−1 ) ∝ N (θt |θt−1 , diag) UΘ (θt ) ,
(3)
where UΘ is a uniform distribution on the legal set of angles that encodes the joint constraints. Recently, Hauberg et al. [8] showed that this model causes the spatial variance of the joint positions to increase as the kinematic chains are traversed. In practice this means that with this model the spatial variance of e.g. the hands is always larger than of the shoulders. To avoid this somewhat arbitrary behaviour it was suggested to build the prior distribution directly in the spatial domain; a solution we will review in sec. 3. In this paper we design data-driven importance distributions; a sub-field of articulated tracking where little work has been done. One notable exception is the work of Poon and Fleet [9], where a hybrid Monte Carlo filter was suggested. In this filter, the importance distribution uses the gradient of the log-likelihood, which moves the samples closer to the modes of the likelihood function (and, hence, also closer to the modes of the posterior). This approach is reported to improve the overall system performance. In the more general filtering literature, the optimal particle filter [12] is known to vastly improve the performance of particle filters. This filter incorporates the observation in the importance distribution, such that samples are drawn from p(θt |θt−1 , Zt ), where Zt denotes the observation at time t. In practice, the optimal particle filter is quite difficult to implement as non-trivial integrals need to be solved in closed-form. Thus, solutions are only available for non-linear extensions to the Kalman filter [12] and for non-linear extensions of left-to-right Hidden Markov models with known expected state durations [22].
290
S. Hauberg and K.S. Pedersen
2 A Failed Experiment Our approach is motivated by a simple experiment, which proved to be a failure. In an effort to design data-driven importance distributions, we designed a straight-forward importance distribution based on silhouette observations. We, thus, assume we have a binary image Bt available, which roughly separates the human from the scene. When sampling new poses, we will ensure that joint positions are within the human segment. We model the motion prior according to eq. 3, i.e. assume that joint angles follow a normal distribution with diagonal covariance. Let UBt denote the uniform distribution on the binary image Bt , such that background pixels have zero probability and let projim [F (θt )] be the projection of joint positions F (θt ) onto the image plane. We then define the importance distribution as q˜(θt |Bt , θt−1 ) ∝ N (θt |θt−1 , diag) UΘ (θt ) UBt (projim [F (θt )]) .
(4)
The two first terms correspond to the motion prior and the third term ensures that sampled joint positions are within the human segment in the silhouette image. It is worth (n) noticing that the correction factor rt (eq. 2) becomes constant for this importance distribution and hence can be ignored. It is straight-forward to sample from this importance distribution using rejection sampling [23]: new samples can be drawn from the motion prior until one is found where all joint positions are within the human segment. This simple scheme, which is illustrated in fig. 2, should improve tracking quality. To measure this, we develop one articulated tracker where the motion prior (eq. 3) is used as importance distribution and one where eq. 4 is used. We use a likelihood model and measure of tracking error described later in the paper; for now details are not relevant. Fig. 3a and 3b shows the tracking error as well as the running time for the two systems as a function of the number of samples in the filter. As can be seen, the data-driven importance distribution increases the tracking error with approximately one centimetre, while roughly requiring 10 times as many computations. An utter failure!
Fig. 2. An illustration of the rejection sampling scheme for simulating the importance distribution in eq. 4. The green skeleton drawn in full lines is accepted, while the two red dashed skeletons are rejected as at least one joint is outside the silhouette.
Angular Motion Prior Angular Motion Prior + Silhouette
1500
Angular Motion Prior Angular Motion Prior + Silhouette
6
Running Time (sec)
Tracking Error (cm)
7
5
4
3
2
25
75
150
250
500
Number of Samples
(a)
Average Number of Rejections
Data-Driven Importance Distributions for Articulated Tracking
1000
500
0
25
75
150
250
500
Number of Samples
(b)
291
10000
8000
6000
4000
2000
0
0
50
100
150
200
250
300
Frame Number
(c)
Fig. 3. Various performance measures for the tracking systems; errorbars denote one standard deviation of the attained results over several trials. (a) The tracking error measured in centimetre. (b) The running time per frame. (c) The average number of rejections in each frame.
To get to the root of this failure, we need to look at the motion prior. As previously mentioned, Hauberg et al. [8] have pointed out that the spatial variance of the joint positions increases as the kinematic chains are traversed. This means that e.g. hand positions are always more variant than shoulder positions. In practice, this leads to rather large spatial variances of joint positions. This makes the term UBt (projim [F (θt )]) dominant in eq. 4, thereby diminishing the effect of the motion prior. This explains the increased tracking error. The large running time can also be explained by the large spatial variance of the motion prior. For a sampled pose to be accepted in the rejection sampling scheme, all joint positions need to be inside the human silhouette. Due to the large spatial variance of the motion prior, many samples will be rejected, leading to large computational demands. To keep the running time under control, we maximally allow for 10000 rejections. Fig. 3c shows the average number of rejections in each frame in a sequence; on average 6162 rejections are required to generate a sample where all joint positions are within the human silhouette. Thus, the poor performance, both in terms accuracy and speed, of the importance distribution in eq. 4 is due to the large spatial variance of the motion prior. This indicates that we should be looking for motion priors with more well-behaved spatial variance. We will turn to the framework suggested by Hauberg et al. [8] as it was specifically designed for controlling the spatial variance of joint positions. We shall briefly review this work next.
3 Spatial Predictions To design motion priors with easily controlled spatial variance, Hauberg et al. [8] first define a spatial pose representation manifold M ⊂ R3L , where L denotes the number of joints. A point on this manifold corresponds to all spatial joint positions of a pose parametrised by a set of joint angles. More stringently, M can be defined as M = {F (θ) | θ ∈ Θ} ,
(5)
where F denotes the forward kinematics function for the entire skeleton. As this function is injective with a full-rank Jacobian, M is a compact differentiable manifold embedded in R3L . Alternatively, one can think of M as a quadratic constraint manifold
292
S. Hauberg and K.S. Pedersen
arising due to the constant distance between connected joints. It should be noted that while a point on M corresponds to a point in Θ, the metrics on the two spaces are different, giving rise to different behaviours of seemingly similar distributions. A Gaussian-like predictive distribution on M can be defined simply by projecting a Gaussian distribution in R3L onto M, i.e. pproj (θt |θt−1 ) = projM [N (F (θt )|F (θt−1 ), Σ)] .
(6)
When using a particle filter for tracking, one only needs to be able to draw samples from the prior model. This can easily be done by sampling from the normal distribution in R3L and projecting the result onto M. This projection can be performed in a direct manner by seeking 2 ˆ t − F (θt ) θˆt = arg min x s.t. θt ∈ Θ , (7) θt
ˆ t ∼ N (F (θt )|F (θt−1 ), Σ). This is an inverse kinematics problem [13], where where x all joints are assigned a goal. Eq. 7 can efficiently be solved using gradient descent by starting the search in θt−1 .
4 Data-Driven Importance Distributions We now have the necessary ingredients for designing data-driven importance distributions. In this paper, we will be designing two such distributions: one based on silhouette data and another on depth data from a stereo camera. Both will follow the same basic strategy. 4.1 An Importance Distribution Based on Silhouettes Many articulated tracking systems base their likelihood models on simple background subtractions [14, 3, 15, 5, 16, 17, 18]. As such, importance distributions based on silhouette data are good candidates for improving many systems. We, thus, assume that we have a binary image Bt available, which roughly separates the human from the scene. When predicting new joint positions, we will ensure that they are within the human segment. The projected prior (eq. 6) provides us with a motion model where the variance of joint positions can easily be controlled. We can then create an importance distribution similar to eq. 4, qbg (θt |Bt , θt−1 ) ∝ pproj (θt |θt−1 ) UBt (projim [F (θt )]) .
(8)
While the more well-behaved spatial variance of this approach would improve upon the previous experiment, it would still leave us with a high dimensional rejection sampling problem. As this has great impact on performance, we suggest an approximation of the above importance distribution, qbg (θt |Bt , θt−1 ) ∝ pproj (θt |θt−1 ) UBt (projim [F (θt )]) (9) = projM N (F (θt )|F (θt−1 ), Σ) UBt (projim [F (θt )]) (10) ≈ projM N (F (θt )|F (θt−1 ), Σ) UBt (projim [F (θt )]) . (11)
Data-Driven Importance Distributions for Articulated Tracking
293
In other words, we suggest imposing the data-driven restriction in the embedding space before projecting back on manifold. When the covariance Σ is block-diagonal, such that the position of different joints in embedding space are independent, this importance distribution can be written as L
qbg (θt |Bt , θt−1 ) ≈ projM N (μl,t |μl,t−1 , Σl ) UBt (projim [μl,t ]) , (12) l=1
where μl,t denotes the position of the lth joint at time t and Σl denotes the block of Σ corresponding to the l th joint. We can sample efficiently from this distribution using rejection sampling by sampling each joint position independently and ensuring that they are within the human silhouette. This is L three dimensional rejection sampling problems, which can be solved much more efficiently than one 3L dimensional problem. After the joint positions are sampled, they can be projected onto the representation manifold M, such that the sampled pose respects the skeleton structure. A few samples from this distribution can be seen in fig. 4c, where samples from the angular prior from eq. 3 is available as well for comparative purposes. As can be seen, the samples from the silhouette-driven importance distribution are much more aligned with the true pose, which is the general trend.
(a)
(b)
(c)
(d)
Fig. 4. Samples from various importance distributions. Notice how the data-driven distributions generate more “focused” samples. (a) The input data with the segmentation superimposed. (b) Samples from the angular prior (eq. 3). (c) Samples from the importance distribution guided by silhouette data. (d) Samples from the importance distribution guided by depth information.
294
S. Hauberg and K.S. Pedersen
4.2 An Importance Distribution Based on Depth Several authors have also used depth information as the basis of their likelihood model. Some have used stereo [8, 10, 24] and others have used time-of-flight cameras [25]. When depth information is available it is often fairly easy to segment the data into background and foreground simply by thresholding the depth. As such, we will extend the previous model with the depth information. From depth information we can generate a set of points Z = {z1 , . . . , zK } corresponding to the surface of the observed objects. When sampling a joint position, we will simply ensure that it is not too far away from any of the points in Z. To formalise this idea, we first note that the observed surface corresponds to the skin of the human, whereas we are modelling the skeleton. Hence, the joint positions should not be directly on the surface, but a bit away depending on the joint. For instance, hand joints should be closer to the surface than a joint on the spine. To encode this knowledge, we let Z⊕rl denote the set of three dimensional points where the shortest distance to any point in Z is less than rl , i.e. Z⊕rl = {z | min(z − zk ) < rl } . k
(13)
Here the rl threshold is set to be small for hands, large for joints on the spine and so forth. When we sample individual joint positions, we ensure they are within this set, i.e. qdepth (θt |Z, θt−1 ) ∝ pproj (θt |θt−1 ) UBt (projim [F (θt )]) UZ⊕ F (θt (14) L
≈ projM N (μl,t |μl,t−1 , Σl ) UBt (projim [μl,t ]) UZ⊕rl (μl,t ) l=1
where UZ⊕rl is the uniform distribution on Z⊕rl . Again, we can sample from this distribution using rejection sampling. This requires us to compute the distance from the predicted position to the nearest point in depth data. We can find this very efficiently using techniques from kN N classifiers, such as k-d trees [26]. Once all joint positions have been sampled, they are collectively projected onto the manifold M of possible poses. A few samples from this distribution is shown in fig. 4d. As can be seen, the results are visually comparable to the model based on background subtraction; we shall later, unsurprisingly, see that for out-of-plane motions the depth model does outperform the one based on background subtraction.
5 A Simple Likelihood Model In order to complete the tracking system, we need a system for computing the likelihood of the observed data. To keep the paper focused on prediction, we use a simple vision system [8] based on a consumer stereo camera1 . This camera provides a dense set of three dimensional points Z = {z1 , . . . , zK } in each frame. The objective of the vision system then becomes to measure how well a pose hypothesis matches the points. We 1
http://www.ptgrey.com/products/bumblebee2/
Data-Driven Importance Distributions for Articulated Tracking
295
assume that points are independent and that the distance between a point and the skin of the human follows a zero-mean Gaussian distribution, i.e.
K min D2 (θt , zk ), τ p(Z|θt ) ∝ exp − , (15) 2σ 2 k=1
where D 2 (θt , zk ) denotes the squared distance between the point zk and the skin of the pose θt and τ is a constant threshold. The minimum operation is there to make the system robust with respect to outliers. We also need to define the skin of a pose, such that we can compute distances between this and a data point. Here, the skin of a bone is defined as a capsule with main axis corresponding to the bone itself. Since we only have a single view point, we discard the half of the capsule that is not visible. The skin of the entire pose is then defined as the union of these half-capsules. The distance between a point and this skin can then be computed as the smallest distance from the point to any of the half-capsules.
6 Experimental Results We now have two efficient data-driven importance distributions and a likelihood model. This gives us two systems for articulated tracking that we now validate by comparison with one using the standard activity independent prior that assumes normally distributed joint angles (eq. 3) as importance distribution. We use this motion prior as reference as it is the most commonly used model. As ground truth we will be using data acquired with an optical marker-based motion capture system. We first illustrate the different priors on a sequence where a person is standing in place while waving a stick. This motion utilises the shoulders a lot; something that often causes problems for articulated trackers. As the person is standing in place, we only track the upper body motions. In fig. 5 we show attained results for the different importance distributions; a film with the results are available as part of the supplementary material. Visually, we see that the data-driven distributions improve the attained results substantially. Next, we set out to measure this gain. To evaluate the quality of the attained results we position motion capture markers on the arms of the test subject. We then measure the average distance between the motion capture markers and the capsule skin of the attained results. This measure is then averaged across frames, such that the error measure becomes T M 1 E= D(θˆt , vm ) , T M t=1 m=1
(16)
where D(θˆt , vm ) is the orthogonal Euclidean distance between the mth motion capture marker and the skin at time t. The error measure is shown in fig. 6a using between 25 and 500 particles. As can be seen, both data-driven importance distributions perform substantially better than the model not utilising the data. For a small number of samples,
296
S. Hauberg and K.S. Pedersen
(a)
(b)
(c)
Not Using Data Using Background Subtraction Using Depth
4 3.5 3 2.5 2
25
75
150
250
Number of Samples
(a)
500
200
150 Not Using Data Using Background Subtraction Using Depth
Running Time (sec)
Tracking Error (cm)
5 4.5
Effective Number of Samples
Fig. 5. Results from trackers using 150 particles with the different importance distributions. The general trend is that the data-driven distributions improve the results. (a) The angular prior from eq. 3. (b) The importance distribution guided by background subtraction. (c) The importance distribution guided by depth information.
150
100
50
0
25 75
150
250
Number of Samples
(b)
500
Not Using Data Using Background Subtraction Using Depth
100
50
0
25
75
150
250
500
Number of Samples
(c)
Fig. 6. Various performance measures for the tracking systems using different importance distributions on the first sequence. Errorbars denote one standard deviation of the attained results over several trials. (a) The tracking error E . (b) The effective number of samples Nef f . (c) The running time per frame.
the model based on depth outperforms the one based on background subtraction, but for 150 particles and more, the two models perform similarly. In the particle filtering literature the quality of the Monte Carlo approximation is sometimes measured by computing the effective number of samples [12]. This measure can be approximated by
N −1 (n) wt , (17) Nef f = n=1
Data-Driven Importance Distributions for Articulated Tracking
(a)
(b)
297
(c)
Not Using Data Using Background Subtraction Using Depth
4 3.5 3 2.5 2
25
75
150
250
Number of Samples
(a)
500
200
100
Not Using Data Using Background Subtraction Using Depth
Running Time (sec)
Tracking Error (cm)
5 4.5
Effective Number of Samples
Fig. 7. Results from trackers using 150 particles with the different importance distributions. The general trend is that the data-driven distributions improve the results. (a) The angular prior from eq. 3. (b) The importance distribution guided by silhouette data. (c) The importance distribution guided by depth information.
150
100
50
0
25
75
150
250
Number of Samples
(b)
500
Not Using Data Using Background Subtraction Using Depth
80
60
40
20
0
25
75
150
250
500
Number of Samples
(c)
Fig. 8. Various performance measures for the tracking systems using different importance distributions on the second sequence. Errorbars denote one standard deviation of the attained results over several trials. (a) The tracking error E . (b) The effective number of samples Nef f . (c) The running time per frame.
(n)
where wt denotes the weight of the nth sample in the particle filter. Most often this measure is used to determine when resampling should be performed; here we will use it to compare the different importance distributions. We compute the effective number of samples in each frame and compute the temporal average. This provides us with a measure of how many of the samples are actually contributing to the filter. In fig. 6b we show this for the different importance distributions as a function of the number of particles. As can be seen, the data-driven importance distributions gives rise to more effective samples than the one not using the data. The importance distribution based on background subtraction gives between 1.6 and 2.2 times more effective samples than the model not using data, while the model using depth gives between 2.3 and 3.3 times more effective samples. We have seen that the data-driven importance distributions improve the tracking substantially as they increase the effective number of samples. This benefit, however, comes at the cost of an increased running time. An obvious question is then whether this extra
298
S. Hauberg and K.S. Pedersen
cost outweigh the gains. To answer this, we plot the running times per frame for the tracker using the different distributions in fig. 6c. As can be seen, the two data-driven models require the same amount of computational resources; both requiring approximately 10% more resources than the importance distribution not using the data. In other words, we can triple the effective number of samples at 10% extra cost. We repeat the above experiments for a different sequence, where a person is moving his arms in a quite arbitrary fashion; a type of motion that is hard to predict and as such also hard to track. Example results are shown in fig. 7, with a film again being available as part of the supplementary material. Once more, we see that the data-driven importance distributions improve results. The tracking error is shown in fig. 8a; we see that the importance distribution based on depth consistently outperforms the one based on background subtraction, which, in turn, outperforms the one not using the data. The effective number of samples is shown in fig. 8b. The importance distribution based on background subtraction gives between 1.8 and 2.2 times more effective samples than the model not using data, while the model using depth gives between 2.8 and 3.6 times more effective samples. Again a substantial improvement at little extra cost.
7 Conclusion We have suggested two efficient importance distributions for use in articulated tracking systems based on particle filters. They gain their efficiency by an approximation that allows us to sample joint positions independently. A valid pose is then constructed by a projection onto the manifold M of possible joint positions. While this projection might seem complicated it merely correspond to a least-squares fit of a kinematic skeleton to the sampled joint positions. As such, the suggested importance distributions are quite simple, which consequently means that the algorithms are efficient and that they actually work. In fact, our importance distributions triple the effective number of samples in the particle filter, at little extra computational cost. The simplicity of the suggested distributions also makes them quite general and easy to implement. Hence, they can be used to improve many existing tracking systems with little effort.
References 1. Poppe, R.: Vision-based human motion analysis: An overview. Computer Vision and Image Understanding 108, 4–18 (2007) 2. Sminchisescu, C., Triggs, B.: Kinematic Jump Processes for Monocular 3D Human Tracking. In: In IEEE International Conference on Computer Vision and Pattern Recognition, pp. 69–76 (2003) 3. Duetscher, J., Blake, A., Reid, I.: Articulated body motion capture by annealed particle filtering. In: CVPR, pp. 21–26. IEEE Computer Society, Los Alamitos (2000) 4. Sminchisescu, C., Triggs, B.: Estimating articulated human motion with covariance scaled sampling. The International Journal of Robotics Research 22, 371 (2003) 5. Kjellstr¨om, H., Kragi´c, D., Black, M.J.: Tracking people interacting with objects. In: IEEE CVPR (2010) 6. Sidenbladh, H., Black, M.J., Fleet, D.J.: Stochastic tracking of 3d human figures using 2d image motion. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1843, pp. 702–718. Springer, Heidelberg (2000)
Data-Driven Importance Distributions for Articulated Tracking
299
7. Urtasun, R., Fleet, D.J., Fua, P.: 3D People Tracking with Gaussian Process Dynamical Models. In: IEEE CVPR, pp. 238–245 (2006) 8. Hauberg, S., Sommer, S., Pedersen, K.S.: Gaussian-like spatial priors for articulated tracking. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 425– 437. Springer, Heidelberg (2010) 9. Poon, E., Fleet, D.J.: Hybrid monte carlo filtering: Edge-based people tracking. In: IEEE Workshop on Motion and Video Computing, p. 151 (2002) 10. Hauberg, S., Pedersen, K.S.: Stick it! articulated tracking using spatial rigid object priors. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010, Part III. LNCS, vol. 6494, pp. 758–769. Springer, Heidelberg (2011) 11. Brubaker, M.A., Fleet, D.J., Hertzmann, A.: Physics-based person tracking using the anthropomorphic walker. International Journal of Computer Vision 87, 140–155 (2010) 12. Doucet, A., Godsill, S., Andrieu, C.: On sequential monte carlo sampling methods for bayesian filtering. Statistics and computing 10, 197–208 (2000) 13. Erleben, K., Sporring, J., Henriksen, K., Dohlmann, H.: Physics Based Animation. Charles River Media (2005) 14. Balan, A.O., Sigal, L., Black, M.J., Davis, J.E., Haussecker, H.W.: Detailed human shape and pose from images. In: IEEE CVPR, pp. 1–8 (2007) 15. Vondrak, M., Sigal, L., Jenkins, O.C.: Physical simulation for probabilistic motion tracking. In: CVPR. IEEE Computer Society Press, Los Alamitos (2008) 16. Gall, J., Rosenhahn, B., Brox, T., Seidel, H.-P.: Optimization and filtering for human motion capture. International Journal of Computer Vision 87, 75–92 (2010) 17. Bandouch, J., Beetz, M.: Tracking humans interacting with the environment using efficient hierarchical sampling and layered observation models. In: Computer Vision Workshops, ICCV Workshops (2009) 18. Balan, A.O., Sigal, L., Black, M.J.: A quantitative evaluation of video-based 3d person tracking. Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 349–356 (2005) 19. Wang, J.M., Fleet, D.J., Hertzmann, A.: Gaussian Process Dynamical Models for Human Motion. IEEE PAMI 30, 283–298 (2008) 20. Lu, Z., Carreira-Perpinan, M., Sminchisescu, C.: People Tracking with the Laplacian Eigenmaps Latent Variable Model. In: Platt, J.C., Koller, D., Singer, Y., Roweis, S. (eds.) Advances in Neural Information Processing Systems, vol. 20, pp. 1705–1712. MIT Press, Cambridge (2008) 21. Bandouch, J., Engstler, F., Beetz, M.: Accurate human motion capture using an ergonomicsbased anthropometric human model. In: Perales, F.J., Fisher, R.B. (eds.) AMDO 2008. LNCS, vol. 5098, pp. 248–258. Springer, Heidelberg (2008) 22. Hauberg, S., Sloth, J.: An efficient algorithm for modelling duration in hidden markov models, with a dramatic application. J. Math. Imaging Vis. 31, 165–170 (2008) 23. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006) 24. Ziegler, J., Nickel, K., Stiefelhagen, R.: Tracking of the articulated upper body on multi-view stereo image sequences. In: IEEE CVPR, pp. 774–781 (2006) 25. Ganapathi, V., Plagemann, C., Koller, D., Thrun, S.: Real time motion capture using a single time-of-flight camera. In: IEEE CVPR, pp. 755–762 (2010) 26. Arya, S., Mount, D.M.: Approximate nearest neighbor queries in fixed dimensions. In: Proc. 4th ACM-SIAM Sympos. Discrete Algorithms, pp. 271–280 (1993)
Robust Trajectory-Space TV-L1 Optical Flow for Non-rigid Sequences Ravi Garg, Anastasios Roussos, and Lourdes Agapito Queen Mary University of London, Mile End Road, London E1 4NS, UK
Abstract. This paper deals with the problem of computing optical flow between each of the images in a sequence and a reference frame when the camera is viewing a non-rigid object. We exploit the high correlation between 2D trajectories of different points on the same non-rigid surface by assuming that the displacement sequence of any point can be expressed in a compact way as a linear combination of a low-rank motion basis. This subspace constraint effectively acts as a long term regularization leading to temporally consistent optical flow. We formulate it as a robust soft constraint within a variational framework by penalizing flow fields that lie outside the low-rank manifold. The resulting energy functional includes a quadratic relaxation term that allows to decouple the optimization of the brightness constancy and spatial regularization terms, leading to an efficient optimization scheme. We provide a new benchmark dataset, based on motion capture data of a flag waving in the wind, with dense ground truth optical flow for evaluation of multi-view optical flow of non-rigid surfaces. Our experiments, show that our proposed approach provides comparable or superior results to state of the art optical flow and dense non-rigid registration algorithms.
1 Introduction Optical flow in the presence of non-rigid deformations is a challenging task and an important problem that continues to attract significant attention from the computer vision community given its wide ranging applications from medical imaging and video augmentation to non-rigid structure from motion. Given a template image of a non-rigid object and an input image of it after deforming, the task can be described as one of finding the displacement field (warp) that relates the input image back to the template. In this paper we are interested in the case where we deal with a long image sequence instead of a single pair of images – each of the images in the sequence must be aligned back to the reference frame. Our work concerns the estimation of the vector field of displacements that maps pixels in the reference frame to each image in the sequence. Two significant difficulties arise. First, the image displacements between the reference frame and subsequent ones are large since we deal with long sequences. Secondly, as a consequence of the non-rigidity of the motion, multiple warps can explain the same pair of images causing ambiguities to arise. A multi-frame approach offers the advantage to exploit temporal information to resolve these ambiguities. In this paper we make
This work is supported by the European Research Council under ERC Starting Grant agreement 204871-HUMANIS. We are grateful to D. Pizarro for providing results of their method [1] and tracks for the synthetic sequence.
Y. Boykov et al. (Eds.): EMMCVPR 2011, LNCS 6819, pp. 300–314, 2011. c Springer-Verlag Berlin Heidelberg 2011
Robust Trajectory-Space TV-L1 Optical Flow for Non-rigid Sequences
301
use of the high correlation between 2D trajectories of different points on the same nonrigid surface. These trajectories lie on a lower dimensional subspace and we assume that the displacement field of any point can be expressed compactly as a linear combination of a low-rank motion basis. This leads to a significant reduction in the dimensionality of the problem while implicitly imposing some form of temporal smoothness. The flow field can be represented by the basis and a set of coefficients for each point in the template image. In contrast to previous multi-frame optical flow approaches that incorporate explicit temporal smoothness regularization [2] our subspace constraint implicitly acts as a long term smoothing term leading to temporally consistent optical flow. Subspace constraints have been used before both in the context of sparse point tracking [3,4,5] and optical flow [3,6] in the rigid and non-rigid domains, to allow correspondences to be obtained in low textured areas. While Irani’s original rigid [3] and Torresani et al.’s non-rigid [5] formulations relied on minimizing the linearized brightness constraint in their discrete form, Garg et al. [6] extended the subspace constraints to the continuous domain in the non-rigid case using a variational approach. Nir et al.[7] propose a variational approach to optical flow estimation based on a spatio-temporal model with varying coefficients multiplying a set of basis functions at each pixel. However, the common feature of all the above approaches is that the subspace constraint is imposed as a hard constraint. Hard constraints are vulnerable to noise in the model and can be avoided by substituting them with principled robust constraints. In this paper we extend the use of multi-frame temporal smoothness constraints within a variational framework by providing a more principled energy formulation with a robust soft constraint which leads to improved results. In practice, we penalize deviations of the optical flow trajectories from the low-rank subspace manifold, which acts as a temporal regularization term over long sequences. We then take advantage of recent developments [8,9] in variational methods and optimize the energy defining a variant of the duality-based efficient numerical optimization scheme.
2 Related Work and Contribution Variational methods formulate the optical flow or image alignment problems as the optimization of an energy functional in the continuous domain. Stemming from Horn and Schunck’s original approach [10], the energy incorporates a data term that optimizes the brightness constancy constraint and a regularization term that allows to fill-in flow information in low textured areas. Variational methods have seen a huge surge in recent years due to the development of more sophisticated and robust data fidelity terms which are robust to changes in image brightness or occlusions [11,12]; the addition of efficient regularization terms such as Total-Variation [13,14] or temporal smoothing terms [2]; and new optimization strategies that allow computation of highly accurate [15] and real time optical flow [13] even in the presence of large displacements [16,11,17]. One of the most successful recent advances in variational methods has been the development of the duality based efficient numerical optimization scheme to solve the TVL1 optical flow problem [13,9]. Duplication of the optimization variable via a quadratic relaxation is used to decouple the data and regularization terms, decomposing the optimization problem into two, each of which is a convex energy that can be solved in a
302
R. Garg, A. Roussos, and L. Agapito
globally optimal manner. The minimization algorithm then alternates between solving for each of the two variables assuming the other one fixed. One of the key advantages of this decoupling scheme is that since the data term is point-wise its optimization can be highly parallelized using graphics hardware [13]. Following its success in optical flow computation, this optimization scheme has since been successfully applied to motion and disparity estimation [18] and real time dense 3D reconstruction [19]. Non-rigid image registration, has recently seen important progress in its robust estimation in the case of severe deformations and large baselines both from keypointbased and learning based approaches. Successful keypoint-based approaches to deformable image registration include the parametric approach of Pizarro and Bartoli [1] who propose a warp estimation algorithm that can cope with wide baseline and selfocclusions using a piecewise smoothness prior on the deforming surface. A direct approach that uses all the pixels in the image is used as a refinement step. Discriminative approaches, on the other hand, learn the mapping that predicts the deformation parameters given a distorted image but require a large number of training samples. Tian and Narasimhan [20] combine generative and discriminative approaches reusing training samples far away from the test image lowering the total number of training samples. Our contribution. In this paper we adopt a robust approach to non-rigid image alignment where instead of imposing the hard constraint that the optical flow must lie on the low-rank manifold [6], we penalize flow fields that lie outside it. Formulating the manifold constraint as a soft constraint using variational principles leads to an energy with a quadratic relaxation term that allows us to adopt a decoupling scheme, similar to the one described above [13,9], for its efficient optimization. Since our regularization term is parameterized in terms of the basis coefficients, instead of the full flow field, we achieve an important dimensionality reduction in this term, which is usually the bottleneck of other quadratic relaxation duality based approaches [13,9]. Moreover, the optimization of this regularization step can be parallelized due to the independence of the orthogonal basis coefficients adding further advantages to the, already efficient, optimization scheme of Zach et al. [13]. Our approach can be seen as an extension of this efficient TV-L1 flow estimation algorithm to the case of multi-frame non-rigid optical flow, where the addition of subspace constraints acts as a temporal regularization term. Currently, there are no benchmark datasets for the evaluation of optical flow that include long sequences of non-rigid deformations. In particular, the most popular one [21] (Middlebury) does not incorporate any such sequences. In order to facilitate quantitative evaluation of multi-frame non-rigid registration and optical flow and to promote progress in this area, in this paper we provide a new dataset based on motion capture data of a flag waving in the wind, with dense ground truth optical flow. Our quantitative evaluation on this dataset using three different motion bases (Principal Components Analysis (PCA), Discrete Cosine Transform (DCT) and Uniform Cubic B-Splines (UCBS)) shows that our proposed approach improves or has equivalent performance to state of the art large displacement [11] and duality based [13] optical flow algorithms and a parametric dense non-rigid registration approach [1].
Robust Trajectory-Space TV-L1 Optical Flow for Non-rigid Sequences
303
3 Multi-frame Image Alignment Consider an image sequence of a non-rigid object moving and deforming in 3D. In the classical optical flow problem, one seeks to estimate the vector field of image point displacements independently for each pair of consecutive frames. In this paper, we adopt the following multi-frame reformulation of the problem. Taking one frame as the reference template, usually the first frame, our goal is then to estimate the 2D trajectories of every point visible in the reference frame over the entire sequence, using a multi-frame approach. The use of temporal information in this way allows us to predict the location of points not visible in a particular frame making us robust to self-occlusions or external occlusions by other objects. 3.1 Subspace Trajectory Model In order to solve the multi-frame optical flow problem, we make use of the fact that the 2D image trajectories of points on an object are highly correlated, even when the object is deforming. We model this property by assuming that the trajectories are near a low-dimensional subspace. This is induced by the non-rigid low-rank shape model, first proposed by Bregler et al. [22], which states that the time varying 3D shape of a non-rigid object can be expressed as as a linear combination of a low-rank shape basis. This assumption has been successfully exploited for 3D reconstruction by Non-Rigid Structure from Motion (NRSfM) algorithms [23] and non-rigid 2D tracking [5]. More precisely, assume that the input image sequence has F frames and the n0 -th frame, n0 ∈ {1, . . . , F } has been chosen as the reference. If Ω ⊂ R2 denotes the image domain, we define the function u : Ω × {1, . . . , F } → R2 that represents the point trajectories in the following way. For every visible point x ∈ Ω in the reference image, u(x; ·) : {1, . . . , F } → R2 is its discrete-time 2D trajectory over all frames of the sequence. The coordinates of each trajectory u(x; ·) are expressed with respect to the position of the point x at n = n0 , which means that u(x; n0 ) = 0 and that the location of the same point in frame n is x + u(x; n). Mathematically, the linear subspace constraint on the 2D trajectories u(x; n) can be expressed in the following way. For all x ∈ Ω and n ∈ {1, . . . , F }: u(x; n) =
R
q i (n)Li (x) + ε(x; n) ,
(1)
i=1
which states that the trajectory u(x; ·) of any point x ∈ Ω can be approximated as the linear combination of R basis trajectories q 1 (n), . . . , q R (n) : {1, . . . , F } → R2 that are independent from the point location. We include a modeling error term ε(x; n) which will allow us to impose the subspace constraint as a penalty term. We refer to the subspace where any such combination lies, i.e. the linear span of the basis trajectories, as a trajectory subspace and we denote it by SQ . The linear combination is controlled by coefficients Li (x) that depend on x, therefore we can interpret the collection of all the coefficients for all the points x ∈ Ω as a vector-valued image L(x) [L1 (x), . . . , LR (x)]T : Ω → RR . Effective choices for the model order (or rank) R usually correspond to values much smaller than 2F , which means that the above
304
R. Garg, A. Roussos, and L. Agapito
representation is very compact and achieves a dramatic dimensionality reduction on the point trajectories. Normally the values of ε(x; n) are relatively small, yet sufficient to improve the robustness of the multi-frame optical flow estimation. We now re-write equation (1) in matrix notation, which will be useful in the subsequent presentation. Let U (x) and E(x) : Ω → R2F be equivalent representations of the functions u(x; n) and ε(x; n) that are derived by vectorizing the dependence on the discrete time n and let Q be the trajectory basis matrix whose columns contain the basis elements q 1 (n), . . . , q R (n), after vectorizing them in the same way: ⎡
⎤ ⎡ ⎤ ⎡ ⎤ u(x; 1) ε(x; 1) q 1 (1) · · · q R (1) ⎢ ⎥ ⎢ ⎥ ⎢ . .. .. .. ⎥ U (x) ⎣ E (x) ⎣ ⎦ , ⎦ , Q ⎣ .. . . . ⎦ 2F ×1 2F ×1 u(x; F ) ε(x; F ) q 1 (F ) · · · q R (F ) 2F ×R The subspace constraint (1) can now be written as follows: U(x) = Q L(x) + E(x) , ∀x ∈ Ω
(2)
3.2 Choice of Basis Concerning the choice of 2D trajectory basis {q1 (n), . . . , q R (n)}, we consider orthonormal bases as it simplifies the analysis and calculations in our method (see Section 4). Of course this assumption is not restrictive, since for any basis an orthonormal one can be found that will span the same subspace. We now describe several effective choices of trajectory basis that we have used in our formulation. Predefined bases for single-valued discrete-time signals with F samples can be used to model separately each coordinate of the 2D trajectories. Assuming that the rank R is an even number, this single-valued basis should have R/2 elements w1 (n), . . . , wR/2 (n) and the trajectory basis would be given by:
[wi (n), 0]T , if i = 1, . . . , R/2 q i (n) = [0, wi−R/2 (n)]T , if i = R/2 + 1, . . . , R
(3)
Provided that the object moves and deforms smoothly, effective choices for the basis {wi (n)} are (i) the first R 2 low-frequency basis elements of the 1D Discrete Cosine Transform (DCT) or (ii) a sampling of the basis elements of the Uniform Cubic BSplines of rank R/2 over the sequence’s time window, followed by orthonormalization of the yielded basis. An alternative is to compute the basis by applying Principal Component Analysis (PCA) to a small subset of reliable point tracks. Reliable tracks are those where the texture of the image is strong in both spatial directions and could be selected using Shi and Tomasi’s criterium [24]. Provided that it is possible to estimate a set of reliable tracks that adequately represent the trajectories of the points over the whole object, the choice of the PCA basis is optimum for the linear model of given rank R, in terms of representational power.
Robust Trajectory-Space TV-L1 Optical Flow for Non-rigid Sequences
305
4 Variational Multi-frame Optical Flow Estimation In this section we aim to combine dense motion estimation with the trajectory subspace constraints described in Section 3.1 following variational principles. If I(x; n) : Ω × {1, . . . , F } → R denotes the input image sequence and n0 is the index of the reference frame, then we propose to minimize the following energy: E u(x; n) , L(x) = α
F Ω n=1
|I (x + u(x; n) ; n) − I(x; n0 )| dx
R F R 2 u(x; n) − q i (n)Li (x) dx + ∇Li (x) dx +β Ω n=1
i=1
(4)
Ω i=1
jointly with respect to the point trajectories u(x; n) and their components on the trajectory subspace that are determined by the linear model coefficients L(x). The positive constants α and β weigh the balance between the terms of the energy. Note that the functions u(x; n) and L(x) determine two sets of trajectories that are relatively close to each other but not exactly the same since the subspace constraint is imposed as a soft constraint (i.e. the model error ε in equation (1) is not zero). Since we regard the linear trajectory model as an approximation, we consider that the final output of our method are the trajectories u(x; n). The first term in the above energy is a data attachment term that uses the robust L1 -norm and is a direct multi-frame extension of the brightness constancy term used by most optical flow methods, e.g. [13]. It is based on the assumption that the image brightness I(x; n0 ) at every pixel x of the reference frame is preserved at its new location, x + u(x; n), in every frame of the sequence. The use of an L1 -norm improves the robustness of the method since it accounts for deviations from this assumption, which might occur in real-world scenarios because of occlusions of some points in some frames. The second term of the energy (4) penalizes trajectories u(x; n) that do not lie on the trajectory subspace QL(x). In fact, this term corresponds to the energy of the trajectory model error ε (c.f. equation (1)) and serves as a soft constraint that the trajectories u(x; n) should be relatively close to the subspace spanned by the basis Q. Concerning the weight β, the larger its value the more restrictive the subspace constraint becomes. We normally use a relatively high value for this weight. Since the subspace of Q is low-dimensional, this constraint operates also as a temporal regularization that is able to perform temporal filling-in in cases of occlusions or other distortions. Note that, unlike [13], we do not need to introduce an auxiliary variable since this quadratic term allows us to decouple the data term and the regularizer directly. The third term of (4) corresponds to Total Variation - based spatial regularization of the trajectory model coefficients. This term penalizes spatial oscillations of each coefficient caused by image noise or other distortions but not strong discontinuities that are desirable in the borders of each object. In addition, this term allows to fill in textural information into flat regions from their neighborhoods. Our approach is related to the recent work of Garg et al. [6] in which dense multiframe optical flow for non-rigid motion is computed imposing hard subspace constraints. Our approach departs in a number of ways. First, while [6] imposes the
306
R. Garg, A. Roussos, and L. Agapito
subspace constraint via re-parameterization of the optical flow, we use a soft constraint and do not optimize directly on the low-rank manifold but impose that the flow should lie close to it. Secondly, the use of the L1 -norm for the data term and a Total Variation regularizer instead of the non-robust L2 -norm and quadratic regularizer used by [6] allow us to deal with occlusions and appearance changes and to preserve object boundaries. Finally, by providing a generalization of the subspace constraint, we have extended the approach to deal with any orthogonal basis and not just the PCA basis [6].
5 Optimization of the Proposed Energy As we described in the previous section, the energy in (4) is related to the TV-L1 formulation of the optical flow problem described in [13], therefore we follow a similar alternating approach to solve the optimization problem. We decouple the data and regularization terms to decompose the optimization problem into two, each of which can be solved in a globally optimal manner. The key difference is that we do not solve for pairwise optical flow but instead we optimize over all the frames of the sequence while imposing the trajectory subspace constraint as a soft constraint. In this section we show how to adapt the method of [13] to our problem, to take advantage of its computational efficiency and apply it to multi-frame subspace-constrained optical flow. Assuming an initialization u0 (x; n) is available for u(x; n), we apply an alternating optimization, updating either u(x; n) or L(x) in every iteration, as follows: – Repeat until convergence:
Step 1. For u(x; n) fixed, update L(x) by minimizing E u(x; n) , L(x) wrt L(x). Step 2. For L(x) fixed, update u(x; n) by minimizing E u(x; n) , L(x) wrt u(x; n). Convergence is declared if the relative update of L(x) and u(x; n) is negligible according to some appropriate distance threshold. 5.1 Minimization of Step 1 Since in this step we keep u(x; n) fixed, we can observe that only the last two terms of the energy (4) depend on L(x). Regarding the second term, using the matrix notation defined in (2), we can write this penalty term as: F R 2 2 2 u(x; n) − q i (n)Li (x) = E(x) = U (x) − Q L(x) n=1
(5)
i=1
Let Q⊥ be an 2F × (2F − R) matrix whose columns form an orthonormal basis of the orthogonal complement of the trajectory subspace SQ . Then the block matrix [Q Q⊥ ] is an orthonormal 2F × 2F matrix, which means that its columns form a basis of R2F . Consequently, U (x) can be decomposed into two orthogonal vectors as U (x) = Q U in (x) + Q⊥ U out (x) where U in (x)QT U (x) and U out (x)(Q⊥ )T U(x) are the coefficients that define the projections of U(x) onto the trajectory subspace SQ and its orthogonal complement. Equation (5) can now be further simplified: 2 E(x)2 = Q⊥ U out (x)+Q (U in (x) − L(x)) = U out (x)2 +U in (x) − L(x)2 ,
Robust Trajectory-Space TV-L1 Optical Flow for Non-rigid Sequences
307
due to the orthonormality of the columns of Q and Q⊥ (which makes the corresponding 2 transforms isometric) and Pythagoras’ theorem. The component U out (x) is constant with respect to L(x); therefore it can be neglected from the current minimization. In other words, with U being fixed and Q L lying on the linear subspace SQ , penalizing the distance between Q L and U is equivalent to penalizing the distance between Q L and the projection of U onto SQ . To conclude, the minimization of step 1 is equivalent to the minimization of: R R (i) 2 β U in (x) − L(x) + ∇Li (x)= ∇Li (x)+β Uin (x)−Li(x) 2 Ω
Ω i=1
i=1
Ω
(i)
where Uin (x) is the i-th coordinate of U in (x). We have finally obtained a new form of the energy that offers a decoupling between the trajectory model coefficients Li (x). The minimization of each term in the above sum can be done independently and corresponds to the Total Variation - based denoising model of Rudin,Osher and Fatemi (ROF) [25] applied to each coefficient Li (x). The optimum Li (x) is actually a regularized version (i) of Uin (x) and the extent of this regularization increases as the weight β decreases. The benefits in the computational efficiency of the above procedure are twofold. First, these independent minimizations can be parallelized. Second, there exist several efficient algorithms for implementing the ROF model. We have used the method of [8], which uses a dual formulation of the minimization and proposes a globally convergent scheme (c.f. [8] for details). Note that this method has been also used by [13] for the problem of optical flow, but under its classical formulation of finding the frame-byframe displacements. 5.2 Minimization of Step 2 Keeping L(x) fixed, we observe that only the first two terms of the energy (4) depend on u(x; n) and furthermore these terms can be written in the following way: F 2 α |I (x + u(x; n) ; n) − I(x; n0 )| + β u(x; n) − u dx , (6) Ω n=1
R where u = i=1 q i (n)Li (x). This quantity depends only on the value of u on the specific point x and the discrete time n (and not on the derivatives of u). Therefore the variational minimization of Step 2 boils down to the minimization of a bivariate function of the value of u for every spatiotemporal point (x; n) independently. We implement this pointwise minimization by applying the technique proposed in [13] to every frame. More precisely, for every frame n and point x the image I(·; n) is linearized around x+u0 (x; n), where u0 (x; n) are the initializations of the trajectories u(x; n). The function to be minimized at every point will then have the simple form of a summation of a quadratic term with the absolute value of a linear term. The minimum can be easily found analytically using the thresholding scheme reported in [13]. 5.3 Implementation Details The above image linearizations are effective only if the initialization u0 (x; n) is relatively close to the actual solution u(x; n). To ensure the linearisation assumptions hold
308
R. Garg, A. Roussos, and L. Agapito
(a) S1
(b) S30
(c) S60
(d) I1
(e) I30
(f) I60
(g) gaus. noise (h) salt-pep. noise
Fig. 1. Rendering process for ground truth optical flow sequence of a non-rigid object. (a-c) Dense surfaces Sn , constructed using thin plate spline interpolation of sparse MOCAP data [26]. (d-f) Rendered sequence In using texture mapping of a graffiti image. Superimposed red disks indicate regions where intensities are replaced by black in the version of input with synthetic occlusions. (g-h) Reference image I1 for the versions with (g) gaussian and (h) salt and pepper noise.
Table 1. Measures of endpoint errors for different methods on the benchmark sequences RMS endpoint error (pixels) Version of input:
Ours, PCA basis Ours, DCT basis Pizarro et al. [1] ITV-L1 [27] LDOF [11]
99th percentile of endpoint error (pix)
Original Occlusions Gaus.noise S&P noise Original Occlusions Gaus.noise
0.98 1.06 1.24 1.43 1.71
1.33 1.72 1.27 1.89 2.01
2.28 2.78 1.94 2.61 4.35
1.84 2.29 1.79 2.34 5.05
3.08 6.70 4.88 6.28 3.72
4.92 5.18 5.05 9.44 6.63
8.33 7.92 8.67 9.70 18.15
S&P noise
7.09 8.53 8.54 9.98 20.35
in the case of large optic flow we use coarse-to-fine techniques with multiple warping iterations. We used a similar numerical optimisation scheme and preprocessing of images to the one proposed in [27] to minimise the energy (4), i.e. we use the structure-texture decomposition to make our input robust to illumination artifacts due to shadows and shading reflections. We also used blended versions of the image gradients and a median filter to reject flow outliers. Concerning the choice of parameters, the default values for the ITV-L1 [27] algorithm were found to give the best results for ITV-L1 and our method (5 warp iterations, 20 alternation iterations and the weights α and β were set to 30 and 2). However, weighing the data term with a lower value of α = 10 was found to give superior results in the noisy sequences using a PCA basis with our approach.
6 Experimental results In this section we evaluate our method and compare its performance with state of the art optical flow [11,13] and image registration [1] algorithms. We show quantitative comparative results on our new benchmark ground truth optical flow dataset and qualitative results on real-world sequences1 . Furthermore, we analyse the sensitivity of our algorithm to some of its parameters, such as the choice of trajectory basis and regularization weight. Since our algorithm computes multi-frame optical flow and incorporates an implicit temporal regularization term, it would have been natural to compare 1
Videos of the results as well as our benchmark dataset can be found on the following URL: http://www.eecs.qmul.ac.uk/˜lourdes/subspace_flow
Robust Trajectory-Space TV-L1 Optical Flow for Non-rigid Sequences
309
its performance with a spatiotemporal optical flow formulation [2]. However, due to the lack of publicly available implementations we chose to compare with LDOF (Large Displacement Optical Flow) [11], one of the best performing current optical flow algorithms, that can deal with large displacements by integrating rich feature descriptors into a variational optic flow approach to compute dense flow. We also compare with the duality based TV-L1 algorithm [13] since our method can be seen as its extension to the case of multi-frame non-rigid optical flow via robust trajectory subspace constraints. To be more exact, we compare with the Improved TV-L1 (ITV-L1) algorithm [27] since we use a similar numerical optimization scheme and preprocessing steps (see Section 5.3). In both cases, we register each frame in the sequence independently with the reference frame. We also compare with Pizarro and Bartoli’s state of the art keypoint-based nonrigid registration algorithm [1]. Additionally, we show comparative results with Garg et al. [6] which support our claim that imposing the subspace constraint as a soft instead of a hard constraint results in improved performance and higher resilience to noise. 6.1 Construction of a Ground Truth Benchmark Dataset For the purpose of quantitative evaluation of multi-frame non-rigid optical flow and to promote progress in this area we generated a benchmark sequence with ground truth. To the best of our knowledge, this is one of the first attempts to generate a long image sequence of a deformable object with dense ground truth 2D trajectories. We use sparse motion capture (MOCAP) data from [26] to capture the real deformations of a waving flag in 3D. We interpolated this sparse data to have a continuous dense 3D surface using the motion capture markers as the control points for smooth Spline interpolation. This dense 3D surface is then projected synthetically onto the image plane using an orthographic camera. We use texture mapping to associate some texture to the surface while rendering 60 frames of size 500×500 pixels. The advantage of this new sequence is that, since it is based on MOCAP data, it captures the complex natural deformations of a real non-rigid object while allowing us to have access to dense ground truth optical flow. We have also used three degraded versions of the original rendered sequence by adding (a) gaussian noise, of standard deviation 0.2 relative to the range of image intensities, (b) salt & pepper (S&P) noise of density 10% and (c) synthetic occlusions generated by superimposing some black circles of radius 20 pixels moving in linear orbits. Figure 1 shows the interpolated 3D flag surface and some of the frames of the sequence. 6.2 Quantitative Results on Benchmark Sequence We tested our algorithm using the three different proposed motion basis: PCA, DCT and Cubic B-Spline. Similarly to Garg et al. [6] we compute the PCA basis from sparse tracked features. For the experiments on the benchmark sequence we used the tracks provided by the feature matching algorithm of Pizarro and Bartoli [1] where a robust method based on local surface smoothness is used to discard outliers from an initial set of SIFT feature matches. Temporal cubic spline interpolation is then used to fill in the missing data in each track independently for the computation of the PCA basis. In Table 1, various methods are quantitatively compared using the different versions of the rendered flag sequence as inputs. For this, we use the root mean square (RMS)
310
R. Garg, A. Roussos, and L. Agapito
and 99th percentiles of the endpoint error, i.e. the amplitude of the difference of ground truth and estimated flow u(x; n). These measures are computed over all the frames and foreground pixels. Note that the results obtained with the Spline basis were omitted since they were almost equivalent those obtained with the DCT basis, as Figure 3(a) reveals. We observe that our proposed method yields the best RMS measure in the case of the original sequence and outperforms ITV-L1 and LDOF in all other cases. Also, in the case of data with noise it performs comparably to the best performing method, Pizarro et al. [1]. In the case of external occlusions, the method of Pizarro et al. [1] yields the best RMS error. Furthermore, we can observe from the error maps of Figs. 2 and 3 that in the case of self-occlusions the situation seems reversed and our proposed method yields a more accurate result. Concerning the 99th percentile measures of Table 1, the maximum error after neglecting 1% outliers, the best measures are in all the cases yielded by the two versions of our proposed method. Figure 2 shows a comparison of the results on the flag sequence of our algorithm using a PCA basis of rank R = 75 and a full rank DCT basis R = 120; ITV-L1 optical flow [27]; LDOF [11] and Pizarro and Bartoli’s registration algorithm [1]. We show a closeup of the reverse warped images of 3 frames in the sequence (20 30 60) which should be identical to the template frame; and the error in flow estimation, expressed in pixels, encoded as a heatmap. Our method, both using PCA and DCT, gives lower errors on these frames than the state of the art methods we compare with. Figure 3(c-g) shows a similar comparison in the presence of synthetic occlusions and it is evident that our method and [1] perform much better than others in occluded regions as they model the flow of a non rigid surface in the reference template. Figure 3(a) shows a graph of the RMS error over all the frames of the optical flow estimated using the 3 different bases for different values of the rank and of the weight β associated with the soft constraint. For a reasonably large value of β all the basis can be used with a significant reduction in the rank. The optimization also appears not to overfit when the dimensionality of the subspace is overly high. Figure 3(b) explores the effect of varying the value of the weight β on the accuracy of the optical flow. While low values of β cause numerical instability (data and regularization terms become completely decoupled) high values of β, on the other hand, lead to slow convergence and errors since the point-wise search is not allowed to leave the manifold, simulating a hard constraint. Another interesting observation is that our proposed method with a PCA basis of rank R=50, yields a better performance than with a full rank PCA basis R=120. This reflects the fact that the temporal regularization due to the low dimensional subspace is often beneficial. Note that to analyze the sensitivity of our algorithm to its parameters in Figure 3(a-b) we used ground truth tracks to compute the PCA basis to remove the bias from tracking. 6.3 Experiments on Real Sequences Figure 4 presents comparative results of optical flow methods in two real sequences of textured paper bending smoothly. The first input sequence is particularly challenging due to its length (100 frames) and the large camera rotation. The basis for our method is derived by applying PCA on KLT tracks [28] keeping the first 10 components. Note that our method achieves similar results with the DCT basis of rank R=14. We run the LDOF
Robust Trajectory-Space TV-L1 Optical Flow for Non-rigid Sequences
311
(e)
(d)
(c)
(b)
(a)
W −1 (I20 ) Err. map 20 W −1 (I30 ) Err. map 30 W −1 (I60 ) Err. map 60
Fig. 2. Inverse warps W −1 (In ) and error maps n (x) for some frames of the flag sequence. (a) Proposed method: PCA basis, (b) DCT basis. (c) ITV-L1 [27]. (d) LDOF [11].(e) Pizarro et al.[1].
PCA basis, β=2 DCT basis, β=2 UCBS basis, β=2
7 6
(c)
5
(d)
(e)
(f)
(g)
20
(a)
RMS error (pixels)
8
4 3 2 1 0
20
40
80
100
120
30
PCA basis, R=20 PCA basis, R=50 PCA basis, R=120
1.4 1.2
60
(b)
1.6
RMS error (pixels)
60
Trajectory basis rank R
1 0.8
10
−2
−1
10
0
10
Weight beta
1
10
10
2
Fig. 3. (a-b) RMS flow error for proposed method in flag sequence varying basic parameters. (c-f) Flow error maps n (x) for flag sequence with synthetic occlusions: (c) Proposed method: PCA basis, (d) DCT basis. (e) ITV-L1 [27]. (f) LDOF [11]. (g) Pizarro et al. [1].
and ITV-L1 algorithms using a multi-resolution scaling factor of 0.95, whereas for our algorithm the value 0.75 was sufficient (pointing to faster convergence). Comparing the warped images W −1 (In ), we observe that our method yields a significant improvement on the accuracy of the optical flow, especially after some frames (see e.g. the artifacts
312
R. Garg, A. Roussos, and L. Agapito In
W −1 (In )
Flow u(·, n)
W −1 (In )
Flow u(·, n)
W −1 (In )
Flow u(·, n)
n = 100
n = 80
n = 40
In0 = I1
n = 71
n=1
In0 = I30
(a)
(b)
(c) Proposed method, PCA basis
(d) ITV-L1 [27]
(e) LDOF [11], direct registr.
Fig. 4. Multi-frame optical flow for different methods, on 2 paper bending sequences (with 100 and 71 frames respectively). (a) Reference frames and (b) frames from the input sequences. (c-e) Flow-based inverse warps W −1 (In ) in the reference frames and color-coded flow fields u(·, n).
W −1 (In ) W −1 (In ) In0 = I30
(a)
In
W −1 (In ) W −1 (In )
n = 71
n=1
n = 40
In
n = 60
In0 = I1
(b)
(c)
(d)
(a)
(b)
(c)
(d)
Fig. 5. Multi-frame optical flow results on a T-shirt and a paper-bending sequences. (a) Reference frames and (b) representative frames of the input sequences. (c-d) Inverse warps W −1 (In ) for different methods: (c) Proposed method, PCA basis. (d) Garg et al. [6].
annotated by the red ellipses in the results of LDOF and ITV-L1). The second input sequence in Fig. 4 is widely used in NRSfM and contains 71 frames. Our method used a PCA basis of rank R=6 obtained from KLT tracks. The 30th frame as the reference. Our method yields an accurate result and suffers from less artifacts than others. In Fig.5, we show results on 2 input sequences to compare our new approach against Garg et al. [6]. The first 60 frame sequence captures a T-shirt deforming as it is stretched from the bottom two corners. The second sequence is the same as in Fig. 4 (bottom). For
Robust Trajectory-Space TV-L1 Optical Flow for Non-rigid Sequences
313
the method of [6], we tested different values for the rank R of the basis and we kept the best for each sequence, which was R=3 for the T-shirt and R=8 for the paper bending sequence. For our method, the choice of rank is less crucial and we selected R=8 for the T-shirt and R=6 for the paper bending sequence. We observe that both methods output a plausible result for the T-shirt sequence. However, [6] cannot reliably estimate the optical flow in the corners that are marked with red circles, whereas our proposed method can. On the paper bending sequence, we observe that our method performs significantly better than [6]. We believe that these improvements can be attributed to our use of robust soft subspace constraints and robust Total Variation and L1 data terms.
7 Conclusions We have provided a new formulation for the computation of optical flow of a non-rigid surface exploiting the high correlation in a long sequence between 2D trajectories of points by assuming that these lie close to a low dimensional subspace. Our contribution is to formulate the manifold constraint as a soft constraint which, using variational principles, leads to a robust energy with a quadratic relaxation term that allows its efficient optimization. We also provide a new benchmark dataset, with ground truth optical flow. Our proposed approach improves or has equivalent performance to state of the art optical flow algorithms and a non-rigid registration approach.
References 1. Pizarro, D., Bartoli, A.: Feature-based deformable surface detection with self-occlusion reasoning. In: International Symposium on 3D Data Processing, Visualization and Transmission, 3DPVT 2010 (2010) 2. Weickert, J., Schn¨orr, C.: Variational optic flow computation with a spatio-temporal smoothness constraint. JMIV 14, 245–255 (2001) 3. Irani, M.: Multi-frame correspondence estimation using subspace constraints. IJCV (2002) 4. Torresani, L., Yang, D., Alexander, E., Bregler, C.: Tracking and modeling non-rigid objects with rank constraints. In: CVPR (2001) 5. Torresani, L., Bregler, C.: Space-time tracking. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 801–812. Springer, Heidelberg (2002) 6. Garg, R., Pizarro, L., Rueckert, D., Agapito, L.: Dense multi-frame optic flow for non-rigid objects using subspace constraints. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010, Part IV. LNCS, vol. 6495, pp. 460–473. Springer, Heidelberg (2011) 7. Nir, T., Bruckstein, A.M., Kimmel, R.: Over-parameterized variational optical flow. IJCV 76, 205–216 (2008) 8. Chambolle, A.: An algorithm for total variation minimization and applications. JMIV 20, 89–97 (2004) 9. Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. JMIV (2011) 10. Horn, B., Schunck, B.: Determining optical flow. Artificial Intelligence 17, 185–203 (1981) 11. Brox, T., Malik, J.: Large displacement optical flow: Descriptor matching in variational motion estimation. TPAMI (2010) 12. Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping (chapter). In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004)
314
R. Garg, A. Roussos, and L. Agapito
13. Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV- L1 optical flow. In: Hamprecht, F.A., Schn¨orr, C., J¨ahne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 214–223. Springer, Heidelberg (2007) 14. Wedel, A., Pock, T., Braun, J., Franke, U., Cremers, D.: Duality TV-L1 flow with fundamental matrix prior. In: Image and Vision Computing, New Zealand (2008) 15. Wedel, A., Cremers, D., Pock, T., Bischof, H.: Structure- and motion-adaptive regularization for high accuracy optic flow. In: ICCV (2009) 16. Alvarez, L., Weickert, J., S´anchez, J.: Reliable estimation of dense optical flow fields with large displacements. IJCV 39, 41–56 (2000) 17. Steinbruecker, F., Pock, T., Cremers, D.: Large displacement optical flow computation without warping. In: ICCV (2009) 18. Pock, T., Cremers, D., Bischof, H., Chambolle, A.: Global solutions of variational models with convex regularization. SIAM Journal on Imaging Sciences (2010) 19. St¨uhmer, J., Gumhold, S., Cremers, D.: Real-time dense geometry from a handheld camera. In: Goesele, M., Roth, S., Kuijper, A., Schiele, B., Schindler, K. (eds.) Pattern Recognition. LNCS, vol. 6376, pp. 11–20. Springer, Heidelberg (2010) 20. Tian, Y., Narasimhan, S.: A globally optimal data-driven approach for image distortion estimation. In: CVPR (2010) 21. Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M., Szeliski, R.: A database and evaluation methodology for optical flow. IJCV 92, 1–31 (2011) 22. Bregler, C., Hertzmann, A., Biermann, H.: Recovering non-rigid 3D shape from image streams. In: CVPR (2000) 23. Torresani, L., Hertzmann, A., Bregler., C.: Non-rigid structure-from-motion: Estimating shape and motion with hierarchical priors. PAMI 30 (2008) 24. Shi, J., Tomasi, C.: Good features to track. CVPR (1994) 25. Rudin, L., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D 60, 259–268 (1992) 26. White, R., Crane, K., Forsyth, D.: Capturing and animating occluded cloth. ACM Trans. on Graphics (2007) 27. Wedel, A., Pock, T., Zach, C., Bischof, H., Cremers, D.: An improved algorithm for TV-L1 optical flow. In: Cremers, D., Rosenhahn, B., Yuille, A.L., Schmidt, F.R. (eds.) Statistical and Geometrical Approaches to Visual Motion Analysis. LNCS, vol. 5604, pp. 23–45. Springer, Heidelberg (2009) 28. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: IJCAI 1981 (1981)
Intermediate Flow Field Filtering in Energy Based Optic Flow Computations Laurent Hoeltgen, Simon Setzer, and Michael Breuß Mathematical Image Analysis Group, Faculty of Mathematics and Computer Science Saarland University, Campus E1.1, 66041 Saarbrücken, Germany {hoeltgen,setzer,breuss}@mia.uni-saarland.de
Abstract. The Euler-Lagrange framework and splitting based methods are among the most popular approaches to solve variational optic flow problems. These methods are commonly embedded in a coarse-to-fine strategy to be able to handle large displacements. While the use of a denoising filter inbetween the warping is an important tool for splitting based approaches, such a practice is rather uncommon for the Euler-Lagrange method. The question arises, why there is this surprising difference in optic flow methods. In previous works it has also been stated that the use of such a filtering leads to a modification of the underlying energy functional, thus, there seems to be a difference in the energies that are actually minimised depending on the chosen algorithmic approach. The goal of this paper is to address these fundamental issues. By a detailed numerical study we show in which way a filtering affects the evolution of the energy for the above mentioned frameworks. Doing so, we not only give many new insights on the use of filtering steps, we also bridge an important methodical gap between the two commonly used implementation approaches. Keywords: Optic flow, Warping, Median filter, Euler-Lagrange equation, Splitting based methods.
1 Introduction The most successful class of methods for optic flow estimation is based on minimising an energy formulation1. Such an energy combines a data term expressing the constancy of some property of the input images and a smoothness term penalising fluctuations in the flow field. Based on the seminal work of Horn and Schunck [1], sophisticated models have been developed and a high degree of accuracy and robustness has been achieved, see e.g. [2,3,4,5,6,7,8,9,10,11] for some influential publications. One may distinguish energy based optic flow methods not only by the underlying model, but also by the algorithmic realisation. There are two main approaches for implementations. The first one, we call it DEO (discrete energy optimisation), relies on the direct discretisation of the energy functional, followed by the application of an optimisation scheme for this discrete energy. The second one is to discretise the Euler-Lagrange 1
http://vision.middlebury.edu/flow/
Y. Boykov et al. (Eds.): EMMCVPR 2011, LNCS 6819, pp. 315–328, 2011. c Springer-Verlag Berlin Heidelberg 2011
316
L. Hoeltgen, S. Setzer, and M. Breuß
(EL) equations that describe a necessary condition for a minimiser of the continuous energy functional. In recent years, both approaches have achieved a high technical level, employing latest optimisation techniques [4,12], multigrid methods [13] and parallel implementations on graphics hardware [14,15]. While the individual paths in modeling and implementation for the optic flow problem have received much attention, there are not many papers that deal with fundamental principles such as the use of the warping strategy, see for example [3]. A quite influential recent work in this respect is the paper of Sun, Roth and Black [16], where important building blocks of modern algorithms are examined. One issue of specific importance discussed there is the application of a denoising step inbetween the warping levels as performed in [17]. It is stressed that the use of the median filter to denoise the computed flow fields leads to better results while it also modifies the energy functional that is minimised, as additional terms representing the filter need to be added. For motivating our present work, let us point out explicitly that the investigations in [16] lead to some important questions. On the one hand, a filtering of flow fields during warping is a very important building block for use within the DEO approach, cf. [4,16,17]. On the other hand, it seems to be completely unusual to apply such a filtering in the context of the EL method; see for instance the works [3,11,18] that are supposed to employ sophisticated techniques in that direction and where a denoising step is not reported. Given that both approaches are supposed to finally serve the same purpose, it seems to be interesting to explore why there is this important (and somewhat surprising) methodical gap. Let us point out clearly, that this issue is not only an algorithmic one, but it leads to important questions concerning the modeling of optic flow. Note that Sun et al. [16] have shown that the median filtering of flow fields is equivalent to the modification of the original energy functional. Therefore, the optic flow models actually addressed by the two implementation approaches seem to differ on a fundamental level, as the DEO method solves a model which includes terms corresponding to the median filter. The question arises, if the EL approach should be modified accordingly from the beginning on. In our present paper, we clarify these issues. Thereby, we complement and extend the discussion of Sun et al. [16] in several ways, and we also bridge the gap between the DEO and EL implementation approaches. Let us note that the results in [16] are supposed to hold for general energy-based optic flow models. In order to put our investigation in concrete terms, we make use of an L1 -penaliser in both data and smoothness term, and we employ, for simplicity, just the grey value constancy assumption. This can be considered as a basic modern optic flow model, cf. [3,15,17]. Concerning the numerical realisation, we employ a version of the recent primal-dual hybrid gradient (PDHG) algorithm for optimising the discrete energy in the DEO method, see e.g. [4]. For the EL approach we employ the same strategy as in [3] where we make use of the classic successive over-relaxation (SOR) method [19] for solving the arising systems of linear equations. These are embedded in a loop of fixed-point iterations for updating the nonlinearities. The discretisation of the derivatives is in both cases fourth-order accurate, as often employed in schemes today.
Intermediate Flow Field Filtering in Energy Based Optic Flow Computations
317
Thus, using one and the same optic flow model, we give a detailed numerical study of the approaches, to optimise the discrete energy and the Euler-Lagrange formalism, with respect to median filtering of flow fields. The results are: 1. Concerning the DEO approach, we show that the filtering effect is beneficial when employing a low number of optimisation steps. While the result with median filtering is better than without, the energy of that solution is in comparison substantially higher, confirming results in [16] that were also obtained by a discrete optimisation strategy. 2. We obtain a corresponding result for the EL approach when we employ a low number of fixed-point iterations to update nonlinearities. 3. When increasing the number of optimisation or fixed-point iterations respectively, the filtering becomes less effective. By employing a large number of iterations, one obtains equivalent results for the algorithms with and without filter steps. 4. The number of iterations needed to reach approximately convergence is in our tests higher for the PDHG algorithm than for the fixed-point iterations in the EL method, while these iterations are numerically roughly of the same cost. Thus, one may infer that flow field filtering is in general more useful for the DEO approach. For numerical setups as often employed in the EL method, a filtering is not necessary and a modification of the energy is not a suitable model. As another contribution, we compare the classic median filtering on fixed masks to its data-adaptive application [20]. The latter serves not only as a reference filter, it also evaluates the idea of structure adaptivity while being as close as possible to the standard median filtering. We obtain the same behaviour as above for this reference method. This shows that data-adaptivity alone is not enough in order to construct a better filter, one needs a combination of more sophisticated modeling components such as proposed in [16]. Our paper is organised as follows. In Section 2 we briefly review the optic flow model used here. This is followed by a description of the algorithms in Section 3. In Section 4, we perform extensive numerical experiments illustrating the results mentioned above.
2 The Optic Flow Model Variational formulations for the computation of optic flow have proven to be highly successful since they offer transparent modeling while yielding, at the same time, excellent results. They have been studied for three decades, beginning with the work of Horn and Schunck [1], and, as a consequence, there exist a multitude of formulations that take into account various model assumptions. In the following, we will consider the TV-L1 energy functional [3,15]. It is given by ∇u E [u, v] = (1) |f (x + u, y + v, t + 1) − f (x, y, t)| + λ ∇v dxdy Ω where f is our image sequence, u and v the components of the sought flow field in xresp. y-direction, Ω is the considered spatial image domain, and λ is a regularisation
318
L. Hoeltgen, S. Setzer, and M. Breuß
weight specifying the relative importance between the two penalty terms. Furthermore, we have employed in eq. (1) the definition ∇u := (∂x u)2 + (∂y u)2 + (∂x v)2 + (∂y v)2 (2) ∇v This energy exhibits a data term that models brightness constancy as well as a regulariser that enforces piecewise smoothness of u and v while still respecting discontinuities in the flow field. In comparison with quadratic penalty terms (as used in [1]), this energy formulation has the advantage that it is more robust towards strong outliers. However, due to the non-differentiability of the L1 -norm, its minimisation is considerably more difficult than for quadratic penalisers. Nevertheless, the good results that can be achieved with non-quadratic terms justify the frequent use of this formulation in various forms, see e.g. [11,15,17].
3 Numerical Approaches 3.1 The Euler-Lagrange Framework The EL framework, as employed in e.g. [3,11], considers the Euler-Lagrange equations of the corresponding variational formulation and tries to solve the resulting system of partial differential equations (PDEs). The EL setting requires the occurring terms to be differentiable. Therefore, we approximate eq. (1) by considering the function Ψ s2 := s2 + ε2 (3) with a small parameter ε which leads us to the following energy formulation 2 E [u, v] = Ψ (f (x + u, y + v, t + 1) − f (x, y, t)) Ω + λ Ψ |∇u|2 + |∇v|2 dxdy
(4)
According to the calculus of variations, the minimiser of eq. (4) must necessarily fulfil the following system of nonlinear PDEs Ψ Iz2 · Ix Iz −λ div Ψ |∇u|2 + |∇v|2 ∇u = 0 (5) Ψ Iz2 · Iy Iz −λ div Ψ |∇u|2 + |∇v|2 ∇v = 0 with reflecting boundary conditions and where we used the abbreviations Ix := ∂x f (x + u, y + v, t + 1) Iy := ∂y f (x + u, y + v, t + 1) Iz := f (x + u, y + v, t + 1) − f (x, y, t)
(6)
Our means to deal with these equations will essentially be identical to the approach presented in [3]. Let us nevertheless go into some detail as this is important for the understanding of our numerical study. We solve the system in (5) through a fixed-point
Intermediate Flow Field Filtering in Energy Based Optic Flow Computations
319
iteration which we embed further into a multiscale coarse-to-fine warping strategy. Starting with the zero flow field on the coarsest level, we consider iteratively
2 k k+1 · Ix Iz −λ div Ψ |∇uk+1 |2 + |∇v k+1 |2 ∇uk+1 = 0 Ψ Izk+1
(7) 2 k k+1 Ψ Izk+1 · Iy Iz −λ div Ψ |∇uk+1 |2 + |∇v k+1 |2 ∇v k+1 = 0 where Ixk , Iyk , Izk are defined according to eq. (6) by using the flow field components u and v from iteration k. Each time we reach a fixed-point, we advance to the next finer level and use the solution from the previous level as an initialisation. Since the above approach still yields a nonlinear system, we apply a first order Taylor expansion and approximate (8) Izk+1 ≈ Izk + Ixk duk + Iyk dv k where uk+1 = uk + duk with a known part uk from coarse levels and an unknown increment duk . The same holds, of course for vk+1 . For better readability let
2 k (Ψ )D = Ψ Izk + Ixk duk + Iyk dv k (9) k (10) (Ψ )S = Ψ |∇ uk + duk |2 + |∇ v k + dv k |2 Then the first equation in (7) can be written as k k (Ψ )D · Ixk Izk + Ixk duk + Iyk dv k − λdiv (Ψ )S ∇ uk + duk = 0
(11)
and the second equation can be reformulated in a similar way. Finally, by introducing a second fixed point strategy on duk and dv k we obtain a linear problem of the form k,l k,l (Ψ )D · Ixk Izk + Ixk duk,l+1 + Iyk dv k,l+1 − λ div (Ψ )S ∇ uk + duk,l+1 = 0 (12) for the first equation. A similar expression can again be derived for the second one. As initialisation, we will always use duk,0 = dv k,0 = 0. In the forthcoming experiments, this linear system will be solved using the SOR scheme. Thus, we basically have to perform two nested iterations. The inner iteration being the iterations of the SOR solver and the outer iteration refers to the fixed point approach on the index l. 3.2 Discrete Energy Optimisation As described in the introduction, the strategy of the DEO approach is to first define a discrete version of our energy functional and then compute a minimiser via an optimisation method. More specifically, we discretise all the variables as well as the gradient operator and linearise the data-term in order to obtain a convex approximation. Analogously to the Euler-Lagrange approach of Section 3.1, we embed this in a coarse-to-fine warping framework to be able to deal with large displacements. Let us assume that we have already computed the flow field (uk , v k ) on the warping level k. We denote the grid on which uk and v k are defined by Ωk . Then, (uk+1 , v k+1 ) are computed on the next level as a minimiser of the functional
320
L. Hoeltgen, S. Setzer, and M. Breuß
|Izi,j + Ixi,j (ui,j −
(i,j)∈Ωk
uki,j )
+ Iyi,j (vi,j −
k vi,j )| +λ
=:F (w)
∇ui,j ∇vi,j (13) (i,j)∈Ωk
=:G(Dw)
with Ixi,j (resp. Iyi,j ) evaluated at warped positions. As indicated in eq. (13), we use the abbreviated notation F (w) + λG(Dw) with w := (u, v) and where the linear mapping D implements the gradient operators. Note that both F and G are convex functions. To minimise discrete energy functions which consist of the sum of convex terms, so-called splitting based methods have become popular in recent years in image processing and computer vision, cf. [15,21,22]. The main idea behind these methods is to treat the different terms of the energy function separately in each iteration and thus to decompose the problem into subproblems which are easy to solve. We follow [4] and use this algorithm to compute a minimiser of (13): Algorithm (PDHG) Initialisation: wk+1,−1 := wk+1,0 := 0, p0 := 0 For l = 0, . . . , N − 1 repeat l+1
Step 1: p
Step 2: w
k+1,l 2 l k+1,l−1 = argmin 12 p − p + σD 2w − w p∈C
k+1,l+1
k+1,l 2 T l+1 1 = argmin 2 w − w − τD p + τ F (w)
(14)
w
Output: wk+1 := wk+1,N The set C is defined as C := {p : pi,j ≤ λ, ∀(i, j) ∈ Ωk }. In [23], a similar algorithm which uses wk+1,l instead of the extrapolation 2wk+1,l − wk+1,l−1 in the first minimisation problem of the above algorithm was proposed for image processing applications. We also refer to [4] for a detailed analysis. The authors of [23] call their method a primal-dual hybrid gradient (PDHG) algorithm and we also choose this terminology for the slightly different version used here. Observe that the PDHG algorithm was characterised as an inexact Uzawa method in [23,24], see also [25]. Furthermore, it corresponds to Algorithm 1 in [4] (with θ = 1). For the step length parameters σ and τ satisfying στ < 1/D2 and any initial values, the sequence (wk+1,l )l generated by the PDHG algorithm is guaranteed to converge to a minimiser of the energy functional in (13), c.f. [4,23]. Solving the two minimisation problems in each iteration of the PDHG algorithm can be done explicitly. Clearly, we can compute the orthogonal projection in the first step independently for each pixel via argmin p∈C
1 2 p
− p˜2
=
p˜i,j max(1, ˜ pi,j /λ)
(15) i,j
Intermediate Flow Field Filtering in Energy Based Optic Flow Computations
321
Interestingly, the minimisation problem in the second step of the PDHG algorithm also decouples, and for any w the components of the minimiser 2 1 w ˆ = argmin 2 w − w + τ F (w) (16) w
Ixi,j are given for each (i, j) ∈ Ωk as follows. Let us define a := and ξ := Izi,j + Iyi,j T (Ixi,j Iyi,j ) w i,j , then we have
w ˆi,j = w i,j
⎧ ⎪ if ξ < −τ a2 ⎨ τ a, + −τ a, if ξ > τ a2 ⎪ ⎩ −ξa/a2, if |ξ| τ a2
(17)
4 Experiments We present a detailed numerical study making use of the Rubberwhale sequence from the popular Middlebury Computer Vision page2 . The exact ground truth of this sequence is known and publicly available, thus allowing us to analyse the behaviour of the average endpoint error (AEE) as well as the discrete version of the energy of the flow field, c.f. (1). In accordance to the results of this detailed exposition, we summarise corresponding experiments for other sequences later on. As for the parameters of our algorithms, we paid special interest on the influence of varying numbers of iterations. The regularisation parameter λ in the energy functional (1) was fixed throughout all the Rubberwhale tests at 5. For the Yosemite experiments it was set to 2 and during the Marble evaluation we used a value of 20. The EL framework used ε = 10−3 in (3). We further chose the step length parameters σ = 7.8 and τ = 0.02 in the PDHG algorithm. The warping pyramid is always computed to the maximal possible extend. The scaling parameter is fixed at 0.95 in each image direction although a value of 0.5 also seems to be a common choice, see e.g. [15,16]. Let us emphasize that all the tests that we perform here can also be done for such a smaller scaling parameter. We observed the same behaviour for such setups as for the experiments reported below and include one example for the scaling parameter 0.5 at the end of Section 4.1. The main reason for us to use the scaling parameter 0.95 is that it seems to be frequently used for the EL approach, see [3,11]. Since we will perform comparative tests with this framework as well, it appeared more appropriate to us to choose 0.95. In addition to the experiments performed with standard median filtering we also consider a reference method in order to evaluate the influence of the filter choice. We employ a structure adaptive filter as proposed in [16]. The authors of [16] suggested a filter that relies on the computation of weights over a given mask, whereas our reference filter adapts the mask itself. The idea of such a construction is that the point masks adapt locally to the variation of flow field values while taking into account the Euclidean 2
http://vision.middlebury.edu/flow/
322
L. Hoeltgen, S. Setzer, and M. Breuß
distance to the origin pixel for which it is set up. Large deviations in the flow field values are penalised, however, it may grow around corners or along strong structures in the flow. Finally, a median filter is applied on this resulting mask; see [20] for a detailed description and theoretical investigations. The classic median filter we employ has a standard 5 × 5 square shaped mask and is applied twice at each flow field component between the warping steps. The adaptive median filter used a threshold of 0.65 for the tonal difference and a maximal length of 3. This yields a maximal mask size similar to the one for the classic median filter and allows us to give a fair comparison. The adaptive filter is also applied twice inbetween the different levels. Let us emphasize at this point, that none of the filters is applied anymore once we reach the finest resolution level. With the above described setting we will analyse (i) the evolution of the AEE as well as the energy with and without additional filtering and (ii) the convergence behaviour of both algorithms for low and high numbers of iterations. In this context we will also briefly comment on the influence of varying numbers of inner and outer iterations in the EL approach. 4.1 The Impact of Intermediate Flow Field Filtering In a first series of tests, we analyse the influence of an additional median or adaptive median filtering. PDHG Algorithm. Our findings for the PDHG framework are presented in Table 1. For low numbers of iterations, e.g. 10-15, they correspond to the results described in [16]: applying a filter improves the endpoint error, however it also increases the energy. Interestingly, this phenomenon vanishes when we increase the number of iterations. For large numbers of optimisation steps, we observe that the effect of filtering on the error becomes negligible and that there is practically no difference in the energy between filtered and non filtered solutions. Also note that during the energy evolution, there are in practice always energy fluctuations that decrease in size with the number of optimisation steps. These cannot be captured by our tables. Table 1. Rubberwhale experiment: Behaviour of the PDHG algorithm for different filters and numbers of iterations. For low numbers of iterations, additional filtering yields a decrease in the error and an increase in the energy. This effect decreases for higher numbers of iterations. Both the error as well as the energy value are not influenced by the filtering when the algorithm is close to convergence. Iterations 10 15 75 750
None
Median
Struct. adapt.
Energy
Error
Energy
Error
Energy
Error
230705.0 220609.0 206292.5 204376.0
0.2321 0.1952 0.1401 0.1347
231755.0 221089.5 205922.0 204316.5
0.2300 0.1933 0.1397 0.1341
231664.5 221056.5 206109.0 204572.0
0.2296 0.1925 0.1394 0.1341
Intermediate Flow Field Filtering in Energy Based Optic Flow Computations
Energy difference for adaptive vs. no filtering Energy difference for standard vs. no filtering Energy difference for adaptive vs. standard filtering
1400
Absolute energy differences
323
1200 1000 800 600 400 200 0
20
30
40
50
60
70
Number of iterations
Fig. 1. Rubberwhale experiment: absolute energy differences for different filtering methods and the PDHG algorithm. The longer one iterates, the influence of filtering decreases.
1e+06 Energy without filtering Energy with standard median filtering Energy with adaptive median filtering
900000
Energy value
800000 700000 600000 500000 400000 300000 200000 0
5
10
15
20
25
30
35
Number of iterations
Fig. 2. Rubberwhale experiment: energy evolution of the PDHG algorithm with different filters and a scaling parameter of 0.5 for the coarse-to-fine strategy. Although a low number of iterations yields a higher energy when applying additional filtering, this effect vanishes after approximately 15 optimisation steps.
Figure 1 depicts the absolute value of the difference in the energy with and without additional filtering, giving us further detailed information of the numerical energy evolution. For low number of iterations the impact of filtering is strong in the beginning but slowly wears off as the number of iterations increases. Furthermore, the figure also depicts the difference of the energy between the standard and adaptive median filtering. Interestingly, this difference is rather small, suggesting that it might not be that important to consider adaptivity alone for improving filters. We supplement this study with
324
L. Hoeltgen, S. Setzer, and M. Breuß
Table 2. Rubberwhale experiment: behaviour of the EL approach with a fixed number of 40 inner iterations with SOR for different filters and numbers of outer iterations. For low numbers of outer iterations, additional filtering yields a decrease in the error and an increase in the energy. This effect is no longer visible for high numbers of outer iterations (close to convergence). Both the error and the energy value are not influenced by the filtering if more than 200 outer iterations are applied. None
Outer Iterations 5 10 75 200
Median
Struct. adapt.
Energy
Error
Energy
Error
Energy
Error
202828.5 203328.0 204245.5 204342.0
0.1338 0.1341 0.1348 0.1350
203216.0 202768.5 204187.0 204285.5
0.1310 0.1321 0.1342 0.1343
203218.5 202822.5 204427.0 204529.5
0.1307 0.1319 0.1341 0.1342
Energy difference for adaptive vs. no filtering Energy difference for standard vs. no filtering Energy difference for adaptive vs. standard filtering
Absolute energy difference
1200 1000 800 600 400 200 0 0
10
20 30
40
50
60 70
0
10
20
30
40
50
60
70
Number of iterations
Fig. 3. Rubberwhale experiment: absolute energy difference with different filtering methods for 4 (left) and 40 (right) inner iterations with SOR. For small numbers of outer iterations the impact of filtering is again significantly larger than for high numbers of outer iterations. Note that the decrease is significantly faster with a high number of inner iterations. The y-axis uses the same scaling in both figures.
an experiment evaluating a warping scaling factor of 0.5. See Fig. 2. We observe the same qualitative behaviour as detailed via Table 1 and Figure 1 for the case 0.95. As the choice of small warping scale factors do not give more insight in the context of our study, we conclude the investigation of different scaling factors here with this example. Euler Lagrange Approach. We conducted two test series. One where we applied 40 inner iterations with the SOR solver and another where we only applied 4. The former guarantees that the linear system inside the fixed-point iteration is always solved very accurately whereas the latter corresponds to more commonly used number of iterations.
Intermediate Flow Field Filtering in Energy Based Optic Flow Computations
325
Table 3. Rubberwhale experiment: comparison of the energy values for different filters and very high numbers of iterations. The number of iterations was chosen such that all methods have practically converged. The first number for the EL framework indicates the outer iterations, the second one corresponds to the iterations for the SOR solver. In the converged state there is no substantial difference between filtered and unfiltered solutions. None
Algorithm PDHG (800 it.) EL (75/04 it.) EL (75/07 it.) EL (75/40 it.)
Median
Struct. adapt.
Energy
Error
Energy
Error
Energy
Error
204958.0 204856.3 204916.5 204245.5
0.1357 0.1355 0.1362 0.1348
204302.0 204202.3 204197.0 204187.0
0.1340 0.1341 0.1340 0.1342
204566.0 204432.5 204436.5 204427.0
0.1341 0.1340 0.1339 0.1341
Table 4. The Yosemite test: the comparison of DEO and EL frameworks confirming the results of our detailed investigation with the Rubberwhale sequence. We observe a decreasing influence of the filter in the energy, see especially the EL results. While the filter results show some benefit here, note that the reported iterates have not yet fully converged. PDHG Endpoint Error
Iterations 10 15 75 200
Absolute energy difference
None
Median
Str. adapt.
None/Median
None/Str. adapt.
0.2577 0.2194 0.1985 0.2022
0.2309 0.1997 0.1899 0.1916
0.1958 0.1729 0.1735 0.1760
1250.4 1035.0 200.6 278.4
1014.4 1022.0 205.4 270.8
EL with 4 inner SOR steps Endpoint Error
Iterations 5 15 75 125
Absolute energy difference
None
Median
Str. adapt.
None/Median
None/Str. adapt.
0.1738 0.1772 0.1887 0.1842
0.1601 0.1734 0.1764 0.1786
0.1597 0.1727 0.1775 0.1781
938.0 149.4 154.2 79.0
905.4 144.6 152.2 73.4
Fig. 3 depicts the influence of the filtering for the just described setting. Using 40 inner iterations with SOR, we obtain the same behaviour as for the PDHG algorithm concerning the fixed-point iterations. If we use only 4 inner iterations, a decrease of the filtering influence is still there for higher numbers of outer iterations. However, it is much less pronounced. This may be due to the fact that the linear system inside the EL equations is not solved accurately enough, possibly involving a numerical blurring effect. All in all, we observe a similar behaviour in the EL setting as for the PDHG formulation.
326
L. Hoeltgen, S. Setzer, and M. Breuß
Table 5. The Marble test: We observe no effect on the AEE when using a filter. While we observe some energy fluctuations in this example, comparing the energy differences after 10 and 200 steps for the PDHG approach clearly shows the expected decrease. PDHG Endpoint Error
Iterations 10 15 75 200
Absolute energy difference
None
Median
Str. adapt.
None/Median
None/Str. adapt.
0.1787 0.1713 0.1575 0.1582
0.1778 0.1709 0.1574 0.1579
0.1775 0.1709 0.1576 0.1581
4710 2372 1232 1478
4764 2432 1232 1458
EL with 4 inner SOR steps Endpoint Error
Iterations 5 15 75 125
Absolute energy difference
None
Median
Str. adapt.
None/Median
None/Str. adapt.
0.1565 0.1575 0.1589 0.1590
0.1565 0.1574 0.1589 0.1590
0.1565 0.1574 0.1589 0.1590
2362 818 1372 1466
2362 818 1376 1466
4.2 Euler-Lagrange and Splitting Methods Close to Convergence Table 3 depicts the energy value of the considered algorithms for very high numbers of iterations. We applied 800 steps with the PDHG algorithm and 75 outer iterations as well as a varying number of inner SOR steps within the EL approach. At this point all considered methods have practically reached convergence. Two important things become immediately apparent. The differences in the energy do not vary significantly, whether we apply a filter or not. The smoothing effect of the filtering even results in this example in a slightly lower energy than without filtering, with negligible consequences on the AEE. Finally, both frameworks yield very similar energy values. 4.3 Results for Further Test Sequences Because of space restrictions, we have selected just two more image sequences, namely the Marble3 and the Yosemite4 sequences. For other sequences the results are similar.
5 Conclusions In our paper, we have clarified the mechanism behind the filtering of flow fields during warping. We think that by our investigations the effect of this technique is now wellunderstood. In this, we have complemented and extended the previous work [16]. 3 4
Available from http://i21www.ira.uka.de/image_sequences/ Available from http://vision.middlebury.edu/flow/data/
Intermediate Flow Field Filtering in Energy Based Optic Flow Computations
327
Furthermore, we have closed an important methodical gap between the two main algorithmic approaches in modern optic flow computation. Since most papers in the field of optic flow employ just one of these techniques, we also hope to improve by the current work the mutual understanding of researchers following mainly one of the paths. In our future work, we strive for other deeper insights into numerical schemes in computer vision.
References 1. Horn, B.K.P., Schunck, B.G.: Determining optical flow. Artificial Intelligence 17, 185–203 (1981) 2. Black, M.J., Anandan, P.: The robust estimation of multiple motions: parametric and piecewise smooth flow fields. Computer Vision and Image Understanding 63, 75–104 (1996) 3. Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004) 4. Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision 40, 120–145 (2011) 5. Mémin, E., Pérez, P.: Hierarchical estimation and segementation of dense motion fields. International Journal of Computer Vision 46, 129–155 (2002) 6. Nir, T., Bruckstein, A.M., Kimmel, R.: Over-parametrized variational optical flow. International Journal of Computer Vision 76, 205–216 (2008) 7. Wedel, A., Cremers, D., Pock, T., Bischof, H.: Structure- and motion-adaptive regularization for high accuracy optic flow. In: Proc. 2009 IEEE International Conference on Computer Vision. IEEE Computer Society Press, Kyoto (2009) 8. Werlberger, M., Pock, T., Bischof, H.: Motion estimation with non-local total variation regularization. In: Proc. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society Press, San Francisco (2010) 9. Weickert, J., Schnörr, C.: A theoretical framework for convex regularizers in PDE-based computation of image motion. International Journal of Computer Vision 45, 245–264 (2001) 10. Xu, L., Jia, J., Matsushita, Y.: Motion detail preserving optical flow estimation. In: Proc. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society Press, San Francisco (2010) 11. Zimmer, H., Bruhn, A., Weickert, J., Valgaerts, L., Salgado, A., Rosenhahn, B., Seidel, H.P.: Complementary optic flow. In: Cremers, D., Boykov, Y., Blake, A., Schmidt, F.R. (eds.) EMMCVPR 2009. LNCS, vol. 5681, pp. 207–220. Springer, Heidelberg (2009) 12. Goldluecke, B., Cremers, D.: Convex relaxation for multilabel problems with product label spaces. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6315, pp. 225–238. Springer, Heidelberg (2010) 13. Bruhn, A., Weickert, J., Kohlberger, T., Schnörr, C.: A multigrid platform for real-time motion computation with discontinuity-preserving variational methods. International Journal of Computer Vision 70, 257–277 (2006) 14. Gwosdek, P., Zimmer, H., Grewenig, S., Bruhn, A., Weickert, J.: A highly efficient GPU implementation for variational optic flow based on the Euler-Lagrange framework. In: Proc. 2010 ECCV Workshop on Computer Vision with GPUs, Heraklion, Greece (September 2010) accepted 15. Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV- L1 optical flow. In: Hamprecht, F.A., Schnörr, C., Jähne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 214–223. Springer, Heidelberg (2007)
328
L. Hoeltgen, S. Setzer, and M. Breuß
16. Sun, D., Roth, S., Black, M.J.: Secrets of optical flow estimation and their principles. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2432– 2439. IEEE, Los Alamitos (2010) 17. Wedel, A., Pock, T., Zach, C., Bischof, H., Cremers, D.: An improved algorithm for TV-L1 optical flow. In: Cremers, D., Rosenhahn, B., Yuille, A.L., Schmidt, F.R. (eds.) Statistical and Geometrical Approaches to Visual Motion Analysis. LNCS, vol. 5604, pp. 23–45. Springer, Heidelberg (2009) 18. Zimmer, H., Bruhn, A., Weickert, J.: Optic flow in harmony (2011) to appear in International Journal of Computer Vision 19. Saad, Y.: Iterative methods for sparse linear systems, 2nd edn. SIAM, Philadelphia (2000) 20. Welk, M., Breuß, M., Vogel, O.: Morphological amoebas are self-snakes. Journal of Mathematical Imaging and Vision 39, 87–99 (2011) 21. Wang, Y., Yang, J., Yin, W., Zhang, Y.: A new alternating minimization algorithm for total variation image reconstruction. SIAM Journal on Imaging Sciences 1, 248–272 (2008) 22. Goldstein, T., Osher, S.: The split bregman method for L1 -regularized problems. SIAM Journal on Imaging Sciences 2, 323–343 (2009) 23. Zhang, X., Burger, M., Osher, S.: A unified primal-dual algorithm framework based on Bregman iteration. Journal of Scientific Computing 46, 20–46 (2011) 24. Esser, E., Zhang, X., Chan, T.F.: A general framework for a class of first order primal-dual algorithms for convex optimization in imaging science. SIAM Journal on Imaging Sciences 3, 1015–1046 (2010) 25. Arrow, K.J., Hurwicz, L., Uzawa, H.: Studies in linear and non-linear programming. Stanford University Press, Stanford (1958)
TV-L1 Optical Flow for Vector Valued Images Lars Lau Rakˆet1 , Lars Roholm2 , Mads Nielsen1 , and Fran¸cois Lauze1 1
Department of Computer Science, University of Copenhagen, Denmark {larslau,madsn,francoisg}@diku.dk 2 IT University of Copenhagen, Denmark
[email protected] Abstract. The variational TV-L1 framework has become one of the most popular and successful approaches for calculating optical flow. One reason for the popularity is the very appealing properties of the two terms in the energy formulation of the problem, the robust L1 -norm of the data fidelity term combined with the total variation (TV) regularization that smoothes the flow, but preserve strong discontinuities such as edges. Specifically the approach of Zach et al. [1] has provided a very clean and efficient algorithm for calculating TV-L1 optical flows between grayscale images. In this paper we propose a generalized algorithm that works on vector valued images, by means of a generalized projection step. We give examples of calculations of flows for a number of multidimensional constancy assumptions, e.g. gradient and RGB, and show how the developed methodology expands to any kind of vector valued images. The resulting algorithms have the same degree of parallelism as the case of one-dimensional images, and we have produced an efficient GPU implementation, that can take vector valued images with vectors of any dimension. Finally we demonstrate how these algorithms generally produce better flows than the original algorithm. Keywords: Optical flow, TV, convex nonsmooth analysis, vector valued images, projections on ellipsoids, GPU implementation.
1
Introduction
During the last decade estimation methods for optical flow have improved tremendously. This is in part due to ever-increasing computational power, but in particular to a wide variety of new interesting estimation methods (Xu et al. [2], Sun et al. [3], Zimmer et al. [4]) as well as novel implementation choices, that have proven to effectively increase accuracy (Sun et al. [5]). A basic framework for calculating optical flow is based on the variational TV-L1 energy formulation. This method and and variations hereof has proved to be very effective (Bruhn et al. [6], Brox et al. [7], Zach et al. [1]). In the pure form the TV-L1 energy consists of a term penalizing the total variation of the estimated flow, and a term encouraging data matching in term of an L1 -norm that is robust to outliers. One type of closely related energies is achieved by replacing the L1 -norms with smooth Charbonnier penalties ([6], [7]), and another variation consist in Y. Boykov et al. (Eds.): EMMCVPR 2011, LNCS 6819, pp. 329–343, 2011. c Springer-Verlag Berlin Heidelberg 2011
330
L.L. Rakˆet et al.
replacing the L1 -norms by smooth Huber norms [8]. In the pure form, Zach et. al [1] were the first to solve the TV-L1 optical flow problem using nonsmooth convex analysis. This paper presents an algorithm for calculating the TV-L1 optical flow between two vector valued images I0 , I1 : Rd → Rk , which is an extension that has not previously been done in the nonsmooth convex analysis setting. The algorithm generalizes the highly influential algorithm by Zach et al. [1], and the extension allows the use of e.g. color or gradient information when calculating the flows. This additional information improves the quality of the flow, compared to only using intensity values, but we will also show that simple and clean implementations of the presented algorithms, in a number of cases even surpass the more sophisticated TV-L1 -improved algorithm [9] on training data from the Middlebury optical flow database [10] in terms of average endpoint error. The focus of this paper is not to produce perfectly engineered algorithms to compete on the Middlebury benchmark, but to develop and explore the generalizations of an elegant optical flow algorithm. It is however the hope that the work presented here will lay the ground for competitive optical flow algorithms in the future. The paper is organized as follows. In the next section we recall the TV-L1 formulation for calculating optical flow. In Section 3 we introduce the tools used to solve our vectorial extension. A general algorithm and some implementation issues are discussed in Section 4, and examples are given in Section 5. We then present experimental results and comparisons on image sequences from the Middlebury database in Section 6, and finally we summarize and discuss future directions.
2
TV-L1 Optical Flow of Vector Valued Images
The recovery of motion patterns is clearly an important feature of human vision, and during the last two decades a large number of computer vision applications has become dependent on motion information. The optical flow is one way of expressing motion information, where we calculate the displacement field v between two images, I0 and I1 . This field should minimize the difference I1 (x + v(x)) − I0 (x) while still being sufficiently regular. Here we will concentrate on the variational TV-L1 formulation of the optical flow problem, where the optical flow is recovered as a minimizer of the energy E: E(v) = λ I1 (x + v(x)) − I0 (x) dx + ∇v(x) dx. (1) T
T
The first term in this energy is the L1 -term, i.e. an integral of the Euclidian norm of the difference I1 (x + v(x)) − I0 (x), where the images are vector valued. This term builds on an assumption that image values are conserved over time, and along the motion. For grayscale images this implies that we do not have radical changes in the lighting of the scene, but for vector valued images the exact meaning will depend on the nature of the image format, as we will see in
TV-L1 Optical Flow for Vector Valued Images
331
Section 5. The second term simply penalizes the total variation of the flow, so the estimation favors smoother displacement fields. The first step in minimizing this energy is to linearize the optical flow constraint around the point x + v x 0 for each x, x x I1 (x + v(x)) − I0 (x) ≈ I1 (x + v x 0 ) + JI1 (x + v 0 )(v(x) − v 0 ) − I0 (x),
(2)
=ρ(v)(x)
where JI1 denotes the Jacobian of I1 . The problem is then split in two, introducing an auxiliary variable u, and the following energies are then minimized iteratively in a coarse-to-fine pyramid scheme E1 (u) =
T
E2 (v) = λ
T
∇u(x) dx +
1 2θ
1 ρ(v)(x) dx + 2θ
T
v(x) − u(x)2 dx.
(3)
T
v(x) − u(x)2 dx,
(4)
Equation (3) is solved by the well known method by Chambolle (Chambolle [11], Bresson and Chan [12]) which is reproduced as Proposition 1 in [1], and this minimization will not be discussed in the present paper. Instead we will solve the minimization of (4). Since no differential of v is involved, the minimization of (4) boils down to a pointwise minimization, with v x 0 and u(x) fixed, of a strictly convex cost function of the form F (v) =
1 v − u0 2 + λAv + b. 2
(5)
When I0 and I1 are scalar-valued, b is real and A : Rd → R is the differential of I1 computed at x + v x 0 , which is a linear form. Because, when A = 0, b is always in the range of A, the minimization above boils down to computing the residual of a projection onto a closed and bounded line segment, this is the essence of Propositions 2 and 3 of [1] (when A = 0, the above minimization is of course trivial). When, on the other hand, I0 and I1 take their values in Rk , A becomes a linear map Rd → Rk , and it may happen that b ∈ Rk is not in the image (range) of A, even when A = 0. In this case the cost function F is smooth and can be minimized by usual variational methods. On the other hand, when b ∈ Im A, one needs to project onto an elliptic ball. This will de discussed in the next section.
3
A General Minimization Problem
In this section we present the tools used for solving the minimization problem (5). We recall first a few elements of convex analysis, the reader can refer to [13] for a complete introduction to convex analysis in both finite and infinite dimension. Here we will restrict ourselves to finite dimensional problems.
332
L.L. Rakˆet et al.
A function f : Rd → R is one-homogeneous if f (λx) = λf (x), for all λ > 0. For a one-homogeneous function, it is easily shown that its Legendre-Fenchel transform f ∗ (x∗ ) = sup {x, x∗ − f (x)}
(6)
x∈Rd
is the characteristic function of a closed convex set C of Rd , 0 if x∗ ∈ C, ∗ ∗ ∗ dC (x ) := f (x ) = +∞ otherwise.
(7)
The one-homogeneous functions that will interest us here are of the form f (x) = Ax where A : Rd → Rk is linear, and · is the usual Euclidean norm of Rk . The computation of the associated Fenchel transform involves the MoorePenrose pseudoinverse A† of A. We recall its construction. The kernel (or null-space) of A, denoted Ker A, is the vector subspace of the v ∈ Rd for which Av = 0. The image (or range) of A, denoted Im A, is the subspace of Rk reached by A. The orthogonal complement of Ker A is denoted Ker A⊥ . call ι the inclusion map Ker A⊥ → Rd and π the orthogonal projection Rk → Im A. It is well known that the composition map B = π ◦ A ◦ ι ι
A
π
Ker A⊥ −→ Rd −→ Rk −→ Im A
(8)
is a linear isomorphism between Ker A⊥ and Im A. The Moore-Penrose pseudoinverse A† of A is defined as A† = ι ◦ B −1 ◦ π.
(9)
With this, the following lemma provides the Legendre-Fenchel transform of f (x): Lemma 1. The Legendre-Fenchel tranform of x → Ax is the characteristic function dC of the elliptic ball C given by the set of x’s in Rd that satisfy the following conditions A† Ax = x
†
x A A
†
x ≤ 1.
(10) (11)
From the properties of pseudoinverses, the equality x = A† Ax means that x belongs to Ker A⊥ . In fact, A† A is the orthogonal projection on Ker A⊥ . On this subspace, A† A† is positive definite and the inequality thus defines an elliptic ball. We will not prove the lemma, but we indicate how it can be done. In the case where A is the identity Id of Rd , it is easy to show that C is the unit sphere of Rd . The case where A is invertible follows easily, while the general case follows from the latter using the structure of pseudoinverse (see [14] for instance). We can now state the main result which allows to generalize the TV-L1 algorithm from [1] to calculate the optical flow between two vector valued images.
TV-L1 Optical Flow for Vector Valued Images
333
Proposition 1. Minimization of (5). (i) In the case b ∈ Im A, F (v) is smooth. It can be minimized by usual methods. (ii) In the case where b ∈ Im A, F (v), which fails to be smooth for v ∈ Ker A + A† b, reaches its unique minimum at v = u − πλC u + A† b (12) where πλC is the projection onto the convex set λC = {λx, x ∈ C}, with C as described in Lemma 1. To see (i), write b as Ab0 + b1 , with b0 = A† b, Ab0 being then orthogonal projection of b onto Im A, while b1 is the residual of the projection. The assumption of (i) implies that b1 = 0 is orthogonal to the image of A. One can then write Av + b = A(v + b0 ) + b1 = A(v + b0 )2 + b1 2 (13) which is always strictly positive as b1 2 > 0, and smoothness follows. In the situation of (ii), since b ∈ Im A, we can do the substitution v ← v +A† b in function (5) and the resulting function has the same form as a number of functions found in [11] and [15]. We refer the reader to them for the computation of minimizers. Proposition 1 generalizes Propositions 2 and 3 from [1] since, on onedimensional spaces, elliptic balls are simply line segments. The next example demonstrate this, and the subsequent example extends to multi-dimensional values. Example 1. Consider the minimization problem 1 arg min v − u0 2 + λ|a v + b| , 2 v
λ > 0,
(14)
where v, u0 ∈ Rd , a ∈ Rd \ {0}. The pseudoinverse of v → a v is the multiplication by a/a2 . Applying Lemma 1, the set C is just the line segment [−a, a] and proposition 1 gives the solution b u = u0 − πλ[−a,a] u0 + a , (15) a2 where the projection is given by ⎧ if ⎪ ⎨λa b if πλ[−a,a] u0 + a = −λa ⎪ a2 ⎩ a u0 +b − a2 a if
a u0 + b < −λa2 a u0 + b > λa2 . |a u0 + b| ≤ −λa2
(16)
This is easily seen to correspond to the projection step for the TV-L1 algorithm in [1], namely Proposition 3 with a = ∇I1 and b = I1 ( · + v 0 ) − ∇I1 · v 0 − I0 . ◦
334
L.L. Rakˆet et al.
Example 2. Now consider the more general minimization problem 1 arg min v − u2 + λAv + b , λ > 0. 2 v
(17)
where A ∈ Rk×2 . If A has maximal rank (i.e. 2), then is is well known that the 2 × 2 matrix C = A† A† is symmetric and positive definite [14]. The set C is then an elliptic disc determined by the eigenvectors and eigenvalues of C. If however the matrix has two linearly dependent columns a = 0 and ca, a series of straightforward calculations give Ker A = Ry, with x =
1 1+c2 (1, c)
and y = A† A†
Ker A⊥ = Rx, 1 1+c2 (−c, 1)
Im A = Ra
(18)
an orthonormal basis of R2 , and
1 = (1 + c2 )2 a2
1 c c c2
.
(19)
If c = 0, the inequality (11) from Lemma 1, just amounts to u21 ≤ 1 ⇐⇒ −a ≤ u1 ≤ a a2
(20)
(with u = (u1 , u2 ) ), a vertical strip, while equality (10) in Lemma 1 simply says that u2 = 0, thus, setting γ = a, C is the line segment [−γx, γx] ⊂ R2 .
(21)
The case where c = 0 is identical, and obtained for instance by rotating the natural basis of R2 to the basis (x, y). ◦
4
Implementation
In this section we will consider how to use the tools developed in the previous section to implement an algorithm for calculating the TV-L1 optical flow between vector valued images. One particular appealing feature of the formulation we have given, is that the algorithm is essentially dimensionless, in the sense that we can produce a single implementation that can take images I0 , I1 : R2 → Rk for all values of k. This can be done since the calculations given in Example 2 only depend on k for the calculation of norms. We recall that the linearized data fidelity term is given by x x ρ(v) = JI1 (x + v x 0 ) v(x) + I1 (x + v 0 ) − I0 (x) − JI1 (x + v 0 )v 0 , A
(22)
b
and according to Proposition 1 there are two main situations to consider when minimizing E1 . The first situation is when b ∈ / Im A, which translates to I1 (x +
TV-L1 Optical Flow for Vector Valued Images
335
vx / Im A, and in this situation the energy is smooth and can be 0 ) − I0 (x) ∈ minimized by usual methods (following e.g. [7]). In the alternative situation, we can minimize the energy by the projection step described in Proposition 1 (ii). In the case of images with two spatial coordinates, the calculations necessary for the projection step are done in Example 2. A generic algorithm for the vector TV-L1 flow is given in Algorithm 1.
Data: Two vector valued images I0 and I1 Result: Optical flow field u from I0 to I1 for L = Lmax to 0 do // Pyramid levels Downsample the images I0 and I1 to current pyramid level for W = 0 to Wmax do // Warping if I1 (x + u(x)) − I0 (x) ∈ ImJI1 (x + u(x)) then Compute v as the minimizer of E1 , using Proposition 1 (ii) on current pyramid level else Compute the minimizer v by gradient descent end for I = 0 to Imax do // Inner iterations Solve (3) for u on current pyramid level end end Upscale flows to next pyramid level end
Algorithm 1. General TV-L1 algorithm for vector valued images
4.1
Projections on Elliptic Balls
As already mentioned, the set C will be an elliptic ball, and when the Jacobian JI1 has full rank the ball is proper, i.e. it is not degenerated to a line segment or a point. Projecting a point outside C onto the boundary ellipsoid is somewhat more complicated than projection onto a line segment, and finding efficient and precise algorithms for this is still an active area of research, [16]. We have taken the approach of doing a gradient descent on the boundary ellipsoid. This approach is not very efficient in terms of computational effort, but on the upside the algorithm is very simple, and easily implemented. The algorithm is specified in Algorithm 2, where, in order to alleviate notation, we have denoted the matrix (J1† J1† )(x + v x 0 ) by C. The convergence criterion of the algorithm is based on the fact that the line between the original point w and the projection v n should be orthogonal to the boundary at v n , and if this is not achieved in 100 iterations the last value is returned.
336
L.L. Rakˆet et al. Data: The point w and the matrix C specifying the bounding ellipsoid. Result: The orthogonal projection of w onto the ellipsoid. w Set v 0 = , and let the stepsize τ > 0 be small enough. w, Cw while v n has not converged do v n+1 = v n + τ (w + v n , v n − v x0 Cv n ) // Gradient step v n+1 v n+1 = // Reprojection v n+1 , Cv n+1 end
Algorithm 2. Projection of a point w onto an ellipsoid
In the situation where the Jacobian has full rank, it holds that JI1 JI†1 = Id . This means that the condition of equation (10) disappears. In addition it simplifies the expression for the point we are projecting, since the v x 0 ’s cancel out, so the point simply becomes x w = JI†1 (x + v x 0 )(I1 (x + v 0 ) − I0 (x)).
4.2
(23)
Implementation Choices
In the implementations we present here it has been assumed that we are always x in the situation that I1 (x + v x 0 ) − I0 (x) ∈ ImJI1 (x + v 0 ), and when this is not x the case, we simply project I1 (x + v0 ) − I0 (x) onto the image of the Jacobian, so we never do the gradient descent step for v. The justification for it is two-folds. First, if noise is small (relative to the data fidelity term), one expects that a displacement vector v satisfies the following 0 ≈ I1 (x + v x 0 ) − I0 (x) ≈ Av + b
(24)
with the notations of equation (22), which means that b is “approximately” in Im A. In the case where noise is considerable, this step is followed by a regularization step, which should correct for the discrepancies when noise is modest in the neighborhood. Considering however the actual case that b is not in the image of the Jacobian may improve the precision of the computed optical flow slightly. For the implementation we use five pyramid levels with a downsampling factor of 2, and the gradients used in the projection step are estimated by bicubic lookup. For the minimization of E2 we use forward and backward differences as suggested in [11]. Finally it should be mentioned that the minimization procedure presented in Proposition 1 is highly parallel. We have chosen to implemented the algorithm in CUDA C, so the computations can be accelerated by the hundreds of cores on modern graphics processing units.
TV-L1 Optical Flow for Vector Valued Images
5
337
Examples
In this section we will consider a number of different constancy assumptions for vector valued images. We will start with perhaps the most simple example of this. Consider two gray-scale images I0 , I1 : R2 → R, and let I0 = ∇I0 ,
I1 = ∇I1 .
(25)
Solving (1) then corresponds to computing the flow with the gradient constancy assumption (GCA) proposed by Brox et al. [7], which will typically be more robust to illumination changes [10]. A very simple implementation of the flow algorithm from the previous section can then be done by assuming that JI1 always has full rank, and then simply do the ellipse projection in each step. When the Jacobian does not have full rank, a small amount of noise is added to the entries in the matrix until the determinant is no longer zero. This approach is justified from the observation that it is relatively rare that the Hessian of the original images is zero, when the derivatives are estimated using bicubic lookup, and it is our experience that the suggested procedure does not introduce a noticeable bias in the resulting flow. In addition it has the positive side effect of slightly faster computations. The most obvious example of a vector valued constancy assumption is constancy of RGB colors, such that I0 , I1 : : R2 → R3 . Other color spaces can also be used (e.g. HSV for a more robust representation [4]). The colors provide valuable discriminative information, that should typically increase the precision of the flow compared to only using brightness values. As opposed to gradient constancy (25), it is much more common that the three RGB-layers contain exactly the same information, so the calculations should take into account the rank of the Jacobian, and do the projection step according to Example 2. An alternative higher order constancy assumption is based on the Laplacian of individual RGB-color channels. The Laplacian has the property that it is invariant to rotation or flipping in the pixel neighborhood, and so should be better suited for these types of motion (see e.g. [17]).
Fig. 1. Frame 10 of the Dimetrodon sequence represented in color, gradient (of intensities) and Laplacian of color channels. The color coding of the gradient vectors is also used for the following flow images, and the Laplacian of the color channels is represented in (rescaled) RGB.
338
L.L. Rakˆet et al. Table 1. Parameters for the flows in Figure 2
RGB TV-L
1
GCA TV-L1
RGB
λ
θ
10
0.19
0.27
9
0.22
0.23
warps
inner iterations
75 75
GCA
Ground truth
Fig. 2. Flows of the Dimetrodon sequence calculated using RGB TV-L1 (AEE 0.156) and GCA TV-L1 (AEE 0.086) respectively. Last image is ground truth from the Middlebury optical flow database.
Fig. 3. Frame 10 from the Grove3 sequence and the ground truth flow
Figure 1 contains an image from the Dimetrodon sequence in respectively color, gradient and Laplacian of RGB representation. The flows calculated between the RGB images and the gradient images of the Dimetrodon sequence can be seen in Figure 2. One notes that the GCA flow matches the true flow better than the RGB version, especially along the tail of the dimetrodon. This is most likely due to the high gradients around the tip of the tail as can be seen in Figure 1. For this sequence the average endpoint errors (AEE) are 0.156 and 0.086 for the color and gradient flows respectively. The parameters used in this example can be found in Table 1. As another example, consider the sequence Grove3 from the Middlebury optical flow database (Figure 3). Here we are faced with a much more complicated flow pattern, and it is clear that the colors provide additional information for discriminating objects, and the flow calculated using the RGB information also
TV-L1 Optical Flow for Vector Valued Images
BCA
RGB
339
RGB+MF
Fig. 4. Flows of the Grove3 sequence calculated using brightness constancy assumption TV-L1 (AEE 0.85), RGB TV-L1 (AEE 0.62) and RGB TV-L1 with a 3 × 3 median filtering step of the flow (AEE 0.57). The parameters are given in Table 3.
results in a better flow than using just the brightness channel (Figure 4). There are however still problems with recovering the details of the small structures, which is probably due to the coarse-to-fine pyramid scheme (Xu et al. [2]).
6
Results
Results for all training sequences from the Middlebury optical flow database are available in Table 2. These results are for a fixed set of parameters for each algorithm, which can be found in Table 3. The parameters has been chosen as the ones that produce the average lowest (normalized) AAE, and have been found by an extensive grid search, with the number of warps locked at 75. The computation time for the optical flow between a pair of 640 × 480 RGB images R TeslaTM C2050 GPU for the proposed is around 0.5 seconds on an NVIDIA parameters, which is a factor 6 faster than the TV-L1 -improved algorithm (cf. the Middlebury database [10]). At a minor cost in accuracy (fewer warps) the RGB+MF algorithm can do realtime flow calculations for 640×480 RGB images. From Table 2 it can be seen that the flows calculated between gradient images will improve the results of the baseline (BCA) TV-L1 only in a limited number of cases, however if changing lighting conditions were a bigger issue, this algorithm or the Laplacian of RGB (Δ-RGB) should be preferred. The results for the RGB algorithm are somewhat more impressing. In six of the eight cases we see more precise flows compared to baseline, and on the two remaining sequences, the results are comparable. Finally the results of the TV-L1 -improved algorithm from [9] and the RGB algorithm with a 3×3 median filter step are included for comparison. It should be noted that the four basic algorithms are implemented quite sparsely, i.e. without median filtering of the flow, structure–texture decomposition of the images etc., since this will correspond to minimizing an energy different from the original TVL1 (Sun et al. [5]). In the light of this it seems promising that the simple RGB algorithm outperforms the TV-L1 -improved on three of the eight sequences, since TV-L1 -improved uses a number of these clever tricks. We see that the addition of a small median filter improves the results of the RGB algorithm considerably,
340
L.L. Rakˆet et al.
Table 2. Average endpoint error results for the Middlebury optical flow database training sequences for different constancy assumptions. Bold indicates the best result within each of the two blocks. The last two rows consist of our RGB algorithm with 3 × 3 median filtering and the TV-L1 -improved results from [9] for comparisson. BCA
GCA
RGB
Δ-RGB
RGB+MF
“improved” [9]
0.14
0.10
0.16
0.22
0.16
0.19
0.18
0.23
0.17
0.24
0.15
0.15
0.85
0.76
0.62
0.84
0.57
0.67
0.20
0.22
0.24
0.23
0.25
0.15
0.20
0.20
0.17
0.18
0.17
0.09
0.59
0.42
0.38
1.52
0.36
0.32
0.82
0.99
0.62
1.25
0.50
0.63
0.54
0.58
0.53
0.67
0.49
0.26
Dimetrodon
Grove2
Grove3
Hydrangea
RubberWhale
Urban2
Urban3
Venus
Table 3. Global parameters for the flow results of Table 2 warps
inner iterations
λ
θ
BCA
75
9
0.07
0.71
GCA
75
11
0.05
0.59
RGB
75
12
0.24
0.76
Δ-RGB
75
7
0.40
0.45
RGB+MF
75
2
0.45
0.70
and using more of the schemes from the TV-L1 -improved algorithm will in all probability give further improvements for the algorithms presented here.
7
Conclusion and Future Research
In this paper we have proposed a generalization of the TV-L1 optical flow algorithm by Zach et al. [1]. We have considered a number of flow algorithms
TV-L1 Optical Flow for Vector Valued Images
341
based on different constancy assumptions, and it has been demonstrated that these algorithms are superior to the standard brightness constancy implementation on training data from the Middlebury optical flow database. It was even showed that some of these algorithms surpassed the more sophisticated TV-L1 improved algorithm by Wedel et al. [9] in a number of cases. A point of interest is to consider if further refinements from the TV-L1 -improved algorithm could also enhance the algorithms presented here. The median filter step that was included in the RGB+MF algorithm increased accuracy, as well as the robustness to wrong parameter choices, and a 5 × 5 median filter would probably increase accuracy even further [5]. We suspect that a structure–texture decomposition could increase the precision of the RGB TV-L1 algorithm slightly, but the gain for the GCA TV-L1 would probably be negligible (Sun et al. [5]). Another interesting direction would be to consider higher order data fidelity terms, e.g. GCA of RGB, GCA and RGB (2 + 3 dimensions) like in [6]. The implementation and combination of these terms is very easy in the current setup, as it can be done by simply pre-processing the input images, and inputting these new vector valued images to the same flow algorithm, similarly to what was done for the gradient constancy assumption and Laplacian of RGB. We are currently looking into these refinements, and working on implementing a competitive version of the algorithm to submit to the Middlebury optical flow database. Another point of future research would be to automatically determine the parameters of the algorithms from the sequences. The results of Table 2 are, as already mentioned, computed from a single set of parameters, but changing the parameters can drastically improve the precision on some sequences, and degrade the quality of others (compare Figure 2 and Table 2). An excellent yet very simple approach for automatic determination of the smoothness weights is the “optimal prediction principle” proposed by Zimmer et al. [4]. An alternative approach is to define a rigorous stochastic model that allows for likelihood estimation of parameters such as the model by Markussen [18] based on stochastic partial differential equations. A step further could be to automatically determine which algorithm should be used for computing the flow, possibly only for parts of the images. One method of doing this has been successfully applied in the magnificent optical flow algorithm by Xu et al. [2], and another method was presented in [19]. In the setting of video coding, multiple motion estimates has been used in [20], where a criterion based on best interpolation quality was introduced. Finally we are currently working on constructing specialized data fidelity terms for specific applications of optical flow, e.g. for inpainting ([21], [22]) or different schemes for variational super-resolution ([23], [24]), which should produce optical flows that are better suited for these specific tasks.
References 1. Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV-L1 optical flow. In: Hamprecht, F.A., Schn¨ orr, C., J¨ ahne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 214–223. Springer, Heidelberg (2007)
342
L.L. Rakˆet et al.
2. Xu, L., Jia, J., Matsushita, Y.: A unified framework for large- and smalldisplacement optical flow estimation. Technical report, The Chinese University of Hong Kong (2010) 3. Sun, D., Suderth, E., Black, M.J.: Layered image motion with explicit occlusions, temporal consistency, and depth ordering. In: NIPS (2010) 4. Zimmer, H., Bruhn, A., Weickert, J.: Optical flow in harmony. International Journal of Computer Vision (2011) 5. Sun, D., Roth, S., Black, M.J.: Secrets of optical flow estimation and their principles. In: CVPR (2010) 6. Bruhn, A., Papenberg, N., Weickert, J.: Towards ultimate motion estimation: Combining highest accuracy with real-time performance. In: ICCV, vol. 4, pp. 749–755 (2005) 7. Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004) 8. Werlberger, M., Trobin, W., Pock, T., Wedel, A., Cremers, D., Bischof, H.: Anisotropic huber-l1 optical flow. In: BMVC (2009) 9. Wedel, A., Zach, C., Pock, T., Bischof, H., Cremers, D.: An improved algorithm for TV-L1 optical flow. In: Dagstuhl Motion Workshop (2008) 10. Baker, S., Scharstein, D., Lewis, J.P., Roth, S., Black, M.J., Szeliski, R.: A database and evaluation methodology for optical flow. International Journal of Computer Vision 31, 1–31 (2011) 11. Chambolle, A.: An algorithm for total variation minimization and applications. Journal of Mathematical Imaging and Vision 20, 89–97 (2004) 12. Bresson, X., Chan, T.: Fast dual minimization of the vectorial total variation norm and application to color image processing. Inverse Problems and Imaging 2, 455– 484 (2008) 13. Ekeland, I., Teman, R.: Convex Analysis and Variational Problems. SIAM, Philadelphia (1999) 14. Golub, G., van Loan, C.: Matrix Computations. The John Hopkins University Press, Baltimore (1989) 15. Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision 40, 120–145 (2011) 16. Dai, Y.H.: Fast algorithms for projection on an ellipsoid. SIAM Journal on Optimization 16, 986–1006 (2006) 17. Papenberg, N., Bruhn, A., Brox, T., Didas, S., Weickert, J.: Highly accurate optical flow computations with theoretically justified warping. International Journal of Computer Vision 67, 141–158 (2006) 18. Markussen, B.: Large deformation diffeomorphisms with application to optic flow. Computer Vision and Image Understanding 106, 97–105 (2007); Special issue on Generative Model Based Vision 19. Mac Aodha, O., Brostow, G.J., Pollefeys, M.: Segmenting video into classes of algorithm-suitability. In: CVPR (2010) 20. Huang, X., Rakˆet, L.L., Luong, H.V., Nielsen, M., Lauze, F., Forchhammer, S.: Multi-hypothesis transform domain wyner-ziv video coding including optical flow In: MMSP (submitted, 2011) 21. Lauze, F., Nielsen, M.: On variational methods for motion compensated inpainting. Technical report, DIKU (2009)
TV-L1 Optical Flow for Vector Valued Images
343
22. Matsushita, Y., Ofek, E., Ge, W., Tang, X., Shum, H.: Full-frame video stabilization with motion inpainting. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 1150–1163 (2006) 23. Unger, M., Pock, T., Werlberger, M., Bischof, H.: A convex approach for variational super-resolution. In: Goesele, M., Roth, S., Kuijper, A., Schiele, B., Schindler, K. (eds.) Pattern Recognition. LNCS, vol. 6376, pp. 313–322. Springer, Heidelberg (2010) 24. Keller, S., Lauze, F., Nielsen, M.: Temporal super resolution using variational methods. In: High-Quality Visual Experience: Creation, Processing and Interactivity of High-Resolution and High-Dimensional Video Signals (2010)
Using the Higher Order Singular Value Decomposition for Video Denoising Ajit Rajwade, Anand Rangarajan, and Arunava Banerjee Department of CISE, University of Florida, Gainesville (USA) {avr,anand,arunava}@cise.ufl.edu
Abstract. We present an algorithm for denoising of videos corrupted by additive i.i.d. zero mean Gaussian noise with a fixed and known standard deviation. Our algorithm is patch-based. Given a patch from a frame in the video, the algorithm collects similar patches from the same and adjacent frames. All the patches in this group are denoised using a transform-based approach that involves hard thresholding of insignificant coefficients. In this paper, the transform chosen is the higher order singular value decomposition of the group of similar patches. This procedure is repeated across the entire video in sliding window fashion. We present results on a well-known database of eight video sequences. The results demonstrate the ability of our method to preserve fine textures. Moreover we demonstrate that our algorithm, which is entirely driven by patch-similarity, can produce mean-squared error results which are comparable to those produced by state of the art techniques such as [1], as also methods such as [2] that explicitly use motion estimation before denoising.
1
Introduction
Video denoising is an important application in the field of computer vision or signal processing. Videos captured by digital cameras are susceptible to corruption by noise from various sources: film grain noise, noise due to insufficient bit-rate during transmission, mechanical damage to the DVD, insufficient lighting during exposure time, and so on. The restoration of such videos can have broad applications in the film industry, the communication of multimedia, in remote sensing, medical imaging, and also for plain aesthetic purposes. The literature on video denoising contains several methods that use shrinkage of coefficients measured on fixed bases such as various types of wavelets. Examples include the work in [3] and [4]. However, the image denoising community has witnessed rapid advances in methods that infer ‘optimal’ bases for denoising image patches. These bases can be learned offline from a representative set of image patches, though dictionaries are often learned in situ from the noisy image itself [5]. Several of these methods learn a single global dictionary to sparsely represent the patches in the image [5], some others cluster similar patches a priori and learn a single dictionary for each cluster separately [6], whereas a third category of methods learn pointwise varying bases, i.e. separate bases for Y. Boykov et al. (Eds.): EMMCVPR 2011, LNCS 6819, pp. 344–354, 2011. c Springer-Verlag Berlin Heidelberg 2011
Using the Higher Order Singular Value Decomposition for Video Denoising
345
a fixed size patch located at each pixel [7], [8]. The method we present in this paper belongs to this third category. Given a ‘reference’ patch from a frame in the video to be denoised, the data for learning the bases consist of the patches from adjacent video frames that are similar to that patch. Thus, our approach can be included in the paradigm of non-local denoising which has emerged very successful in recent times, beginning with approaches such as NL-means (for image and video denoising) [9] and culminating in state of the art approaches such as block-matching-3D (‘BM3D’ [8]). The BM3D method treats the group of similar patches as a 3D stack and denoises all the patches jointly using a fixed 3D transform. This joint filtering has been demonstrated to be more effective in filtering several fine textures than individual filtering of each patch using a 2D transform [8]. The current literature on video denoising indicates two divergent schools of thought. There exist papers such as [9], [1] which do not perform any motion estimation prior to smoothing the video. Their main argument is that the wellknown aperture problem in optical flow actually helps the denoising process. In fact, a video sequence contains many more patches that are similar to a given reference patch (as compared to a single image) and this added redundancy can enhance the video denoising results. On the other hand, there exist papers such as [2], [10] which are proponents of prior motion estimation and correction. In this paper, we choose not to perform motion estimation and perform denoising solely on the basis of non-local patch similarity. We present arguments and some empirically driven reasons for this later in Section 3 where we present experimental results. Our work in this paper is based on ideas from our earlier work [11]. The contribution of this paper is the extension of our earlier idea to video denoising with entirely new experimental results. This paper is organized as follows. Section 2 reviews the theoretical background of our technique. Several experimental results are presented in Section 3. Our results are compared to video-BM3D, which is considered the state of the art in video denoising. We conclude in Section 4.
2
Theory
Let In be the corrupted version of a clean image It under the action of noise from N (0, σ). Consider a reference patch Pn in In and let its underlying clean patch in It be Pt . Suppose we computed K patches {Qni } that were ‘similar’ to Pn from In . The similarity metric is detailed in Section 2.2. The principal components of these K patchesT are given by the eigenvectors Un of the correlation 1 matrix Cn = K i=1 Qni Qni . If such a set of eigenvectors is computed for each patch, we get a set of pointwise varying orthonormal bases. Patches from In can be denoised by projecting them onto the orthonormal basis computed for each patch, followed by hard-thresholding or some other method to attenuate the smaller coefficient values (which are assumed to consist of mainly noise). This spatially varying PCA approach is presented in [7]. Now assume an ideal situation where all the K patches {Qni } happened to be noisy versions of Pt . In
346
A. Rajwade, A. Rangarajan, and A. Banerjee def
such a case, we see that as K → ∞, Cn → Ct + σ 2 Id, where Ct = Pt PtT and where Id is the identity matrix. The eigenvectors of Cn then have a very good chance of capturing all the structural information in Pt and hard thresholding the insignificant coefficients of the bases derived in this manner will most likely yield a very good quality denoised output. However, such a situation is usually not possible in most natural images. In fact, the patches that qualify as ‘similar’ will usually not be exact copies of Pt modulo noise. Hence, we adopt the following principle to further constrain our solution: if a group of patches are similar to one another in the noisy image, the denoising procedure should take this fact into account and not filter the individual patches from the group independently. Bearing this is in mind, we group together similar patches and represent them in the form of a 3D stack as in Equation 1. The main idea is that the filtering is performed not only across the length and breadth of each individual (2D) patch, but also in the third dimension so as to allow for similarity between intensity values at corresponding pixels of the different patches. The idea of joint filtering of multiple patches has been implemented earlier in the BM3D algorithm [8], but with fixed bases such as DCT, Haar or Biorthogonal wavelets. However, in this paper, we use this idea to learn spatially adaptive bases, which we choose to be the higher order singular value decomposition (HOSVD) bases of the 3D stack of patches. An example in Figure 1 illustrates the superiority of our HOSVD approach over PCA, for denoising a texture image. The third and fourth row of Figure 1 show the application of coefficient thresholding for smoothing of the 11 structurally similar patches of size 64 × 64 using the PCA and HOSVD transforms respectively, while the last two rows show the filtered patches after the averaging operations (employing the same criteria for patch similarity and coefficient thresholding). These figures reveal that HOSVD preserves the finer textures on the table-cloth surface much better than PCA which almost erases those textures. 2.1
Implementation of the HOSVD for Video Denoising
Given a s × s reference patch Pn in the noisy image In , we create a stack of K similar patches. Here similarity is defined as in Section 2.2. Let us denote the stack as F ∈ Rs×s×K . The HOSVD of this stack given as follows [12]: F = S ×1 V (1) ×2 V (2) ×3 V (3)
(1)
where V (1) ∈ Rs×s , V (2) ∈ Rs×s and U (3) ∈ RK×K are orthonormal matrices, and S is a 3D coefficient array of size s × s × K. Here, the symbol ×j stands for the j th mode tensor product defined in [12]. The orthonormal matrices V (1) , V (2) and V (3) are in practice computed from the SVD of the unfoldings F(1) , F(2) and F(3) respectively [12]. The exact equations are of the form F(j) = V (j) · S(j) · (V mod(j+1,3) ⊗ V mod(j+2,3) )T
(2)
where 1 ≤ j ≤ 3. This representation in terms on tensor unfolding is equivalent to the original formulation of HOSVD from Equation 1. However, the complexity
Using the Higher Order Singular Value Decomposition for Video Denoising
347
Fig. 1. Eleven patches of size 64 × 64 from a textured portion of the original Barbara image (row 1), its noisy version under N (0, 20) (row 2), from the PCA output before averaging (row 3), from the HOSVD output before averaging (row 4), from the PCA output after averaging (row 5), from the HOSVD output after averaging (row 6). Zoom into pdf file for a better view.
of the SVD computations for K ×K matrices is O(K 3 ). For computational speed, we impose the constraint that K ≤ 8. The patches from F are then projected onto the HOSVD transform. The parameter for thresholding the transform coefficients is picked to be σ 2 log p2 K, which is the near-optimal threshold for hard thresholding of the coefficients of a noise-corrupted data vector projected onto any orthonormal basis assuming i.i.d. additive N (0, σ) noise [13]. The complete stack F is then reconstructed after inverting the transform, thereby filtering all the individual patches. The procedure is repeated over all pixels in sliding window fashion with simple averaging of the multipled hypotheses that appear at any pixel. Note that we filter all the individual patches in the ensemble and not just the reference patch. Moreover, we perform denoising using hard thresholding of coefficients as opposed to low-rank matrix approximations because it is difficult to relate the optimal matrix rank to the noise statistics. Some papers such as [14] penalize the matrix nuclear norm for denoising, but this requires iterated optimization for each patch stack with some heuristically chosen parameters. The aforementioned framework for image denoising is extended to video denoising in the following manner. The search for patches that are similar to a reference patch in a given frame at time instant to is performed over all patches in the time frame [to − Δ, to + Δ] where Δ is a temporal search radius. The existence of multiple images of the same scene varying smoothly with respect to one another, yields us greater redundancy which can be exploited for the purpose of better denoising (modulo limitations of computing time).
348
2.2
A. Rajwade, A. Rangarajan, and A. Banerjee
Choice of Patch Similarity Measure
Given a reference patch Pn of size s × s in In , we can compute the patches similar to it by using a distance threshold τd and selecting all patches Pni such that Pn − Pni 2 < τd . Assuming a fixed, known noise model - N (0, σ), if Pni and Pi were different noisy versions of the same underlying patch Pt (i.e. Pi ∼ N (Pt , σ) and Pni ∼ N (Pt , σ)), the following random variable would have a χ2 (s2 ) distribution: s2 (Pref,k − Pik )2 x= . (3) 2σ 2 k=1
The cumulative of a χ2 (z) random variable is given by x z F (x; z) = γ( , ) 2 2
(4)
where γ(x, a) stands for the incomplete gamma function defined as x 1 γ(x, a) = e−t ta−1 dt (5) Γ (a) t=0 ∞ with Γ (a) = 0 e−t t(a−1) dt being the Gamma function. We observe that if z ≥ 3, for any x ≥ 3z, we have F (x; z) ≥ 0.99. Therefore for a patch-size of s × s and under the given σ, we choose τd = 6σ 2 s2 . Hence we regard two patches to be similar if and only if their mean squared difference was less than or equal to τd . Note however, that in our specific implementation, we always restrict the number of similar patches to a maximum of K = 8 for the sake of efficiency. 2.3
HOSVD and Universal 3D Transforms
We explain an important theoretical difference between the HOSVD and universal 3D transforms such as a 3D-DCT, 3D-FFT or a product of 2D-DCT and 1D Haar wavelet (as in [8]), in the context of denoising of patch stacks. The latter group of transforms treats the 3D stack as an actual 3D signal - in other words, it assumes a continuity between pixels at corresponding locations in the different patches of the 3D stack. However, as the stack consists of a group of patches similar to the reference patch, all from different locations in the video, this is not a valid assumption. Moreover, a change in the ordering of the patches could potentially affect the denoising results. In the case of HOSVD, permuting the location of the patches in the 3D stack will leave the transform coefficients unchanged (upto a permutation) and hence not affect the denoising results. Summarily, while 3D transforms will enforce (functional) smoothness in the third dimension of the stack, the HOSVD uses statistical criteria for coupled filtering of all the patches from the stack.
Using the Higher Order Singular Value Decomposition for Video Denoising
3
349
Experimental Results and Comparisons
We now present experimental results on video denoising. Our dataset consists of the eight well-known gray-scale video sequences available on http://telin. ugent.be/~vzlokoli/PHD/Grey_scale/. Some sequences such as ‘Miss America’ contain highly homogenous image frames, whereas others such as ‘flower’ or ‘tennis’ are quite textured. We tested our denoising algorithm on each sequence for noise from N (0, σ) where σ ∈ {20, 25, 30, 35}. The quality metric used for 2552 evaluation was the PSNR which is computed as 10 log10 MSE where MSE is the ‘mean-squared error’. The results produced by our HOSVD method were compared to those produced by the video version of BM3D [1]. The latter is a two stage algorithm. The first step performs collaborative hard thresholding of the wavelet transform coefficients of a (3D) stack of similar patches. We refer to this step as ‘VBM3D-1’. The second step performs collaborative Wiener filtering where the transform domain coefficients are attenuated using ratios computed from patches from the output of ‘VBM3D-1’. We refer to this second step as ‘VBM3D-2’. The HOSVD algorithm was run with 8 × 8 patches, a spatial search window of radius 8 and a temporal search radius of 4, for finding similar patches. VBM3D-1 and VBM3D-2 were run using the package provided by the authors of [1] using their default parameter settings (which also included patch sizes of 8 × 8). In all experiments, noise was added to the original sequence using the Matlab command noisy = orig + randn(size(orig))*sigma, followed by clipping of the values in the noisy signal to the [0,255] range. The comparative results between HOSVD, VBM3D-1 and VBM3D-2 are shown in Tables 1 and 3. From these tables, we see that HOSVD produces PSNR values that are superior to VBM3D-1 on most sequences except ‘Miss America’ which contains highly homogenous frames. The PSNR values produced by HOSVD are usually close behind those of VBM3D-2. On some complex sequences such as ‘bus’, HOSVD produces results slightly superior to those by VBM3D-2. In general, the difference between the PSNR values is small for the more difficult and textured sequences. We believe that superior results could be produced by HOSVD on homogenous image frames if larger patch sizes were used. A point to note is that the PSNR values for VBM3D we have reported are slightly less than those reported by the authors of [1] on their webpage http: // www. cs. tut. fi/ ~ foi/ GCF-BM3D/ . We observed that this difference is due to the clipping of the noisy videos to the [0,255] range. An interesting point to note is the performance on the ‘tennis’ sequence which contains large textured regions (on the wall behind the table tennis table) in many frames. Although VBM3D-2 produces a superior PSNR in comparison to HOSVD at all four noise levels, we observed that HOSVD did a much better job than both VBM3D-1 and VBM3D-2 in preserving the texture on the wall. We believe that the fixed bases used by VBM3D-1 (DCT/Haar) have a tendency to wipe out some subtle textures. These textures get further attenuated during the Wiener filtering step in VBM3D-2. These results can be observed in Figure 2 and Table 2. Essentially, this example yet again highlights the advantages of learning the bases in situ from noisy data as opposed to using fixed, universal bases.
350
A. Rajwade, A. Rangarajan, and A. Banerjee Table 1. PSNR results for video sequences for σ ∈ {20, 25} Sequence
HOSVD VBM3D-1 VBM3D-2 σ = 20 σ = 20 σ = 20 salesman 32.046 31.73 33.49 bus 30.237 28.69 29.568 flower 28.142 27.436 28.17 miss america 34.32 35.809 37.72 foreman 32.4 32.113 33.37 tennis 30.31 30.04 30.94 coastguard 31.09 30.63 31.75 bicycle 33.09 32.8 34.15
HOSVD VBM3D-1 VBM3D-2 σ = 25 σ = 25 σ = 25 30.396 30.298 31.99 28.934 27.524 28.4 26.88 26.038 26.873 32.737 34.795 36.838 30.886 30.828 32.148 29.034 28.733 29.51 29.634 29.348 30.54 31.453 31.4 32.77
Table 2. PSNR results for ‘tennis’ sequence for σ = 20 Sequence HOSVD VBM3D-1 VBM3D-2 [2] tennis 30.31 30.04 30.94 30.21
Table 3. PSNR results for video sequences for σ ∈ {30, 35} Sequence
HOSVD VBM3D-1 VBM3D-2 σ = 30 σ = 30 σ = 30 salesman 29.04 29.124 30.66 bus 27.81 26.53 27.45 flower 25.69 24.787 25.727 miss america 31.35 33.91 35.847 foreman 29.645 29.78 31.097 tennis 28 27.8 28.55 coastguard 28.4 28.28 29.497 bicycle 30 30.185 31.476
HOSVD VBM3D-1 VBM3D-2 σ = 35 σ = 35 σ = 35 27.87 28.129 29.486 26.817 25.793 26.66 24.588 23.9 24.81 30.087 33.05 34.736 28.594 28.9 30.236 27.117 27.076 27.841 27.346 27.384 28.628 28.864 29.107 30.28
A similar phenomenon was noted by [2] on the same video sequence (see Figure 4 of [2]). The method in [2] produces a PSNR of 30.21 for σ = 20, whereas ours produces 30.31. It should be noted that the former makes explicit use of a robust motion estimator as well as a much more sophisticated and global search algorithm for finding similar patches (unlike our method which looks for similar patches in a restricted search radius around the top left corner of the reference patch). This highlights the good denoising properties of the HOSVD bases as well as the benefits of the additional smoothing afforded by patch-based algorithms, unlike [2] which uses a pixel-based algorithm such as (a slightly modified version of) NL-Means [9] for the smoothing. We believe that the exact benefit of motion estimation for denoising videos affected by i.i.d. noise remains a debatable matter (even more so, given the error-prone nature of optical flow computations especially in noisy data, the parameter selection involved and
Using the Higher Order Singular Value Decomposition for Video Denoising
(a)
(b)
(c)
(d)
351
(e) Fig. 2. Frame 75 from (a) original ‘tennis’ sequence, (b) noisy sequence (σ = 20), (c) sequence denoised by VBM3D-1, (d) sequence denoised by VBM3D-2, (e) sequence denoised by HOSVD. Zoom into the pdf for a better view.
the computational cost), but we leave a rigorous testing of this issue to future work. From the arguments in [2], it does seem that motion estimation prior to denoising will enhance the overall performance in case of structured noise which affects real-world color videos. We show a few more results of the HOSVD method - on frame 75 from the bus and flower sequences at σ=25 in Figures 3 and 4 respectively. We have uploaded sample video results (in the form of avi files) on four sequences: coastguard, flower, bus and tennis, each at noise levels 20, 25, 30, 35, on the following
352
A. Rajwade, A. Rangarajan, and A. Banerjee
(a)
(b)
(c)
(d)
(e) Fig. 3. Frame 75 from (a) original ‘bus’ sequence, (b) noisy sequence (σ = 25), (c) sequence denoised by VBM3D-1, (d) sequence denoised by VBM3D-2, (e) sequence denoised by HOSVD. Zoom into the pdf for a better view.
webpages: https://sites.google.com/site/emmcvpr2011submission26/ videos and https://sites.google.com/site/emmcvprsubmission26part2/ emmcvprsubmission26part2.
Using the Higher Order Singular Value Decomposition for Video Denoising
(a)
(b)
(c)
(d)
353
(e) Fig. 4. Frame 75 from (a) original ‘flower’ sequence, (b) noisy sequence (σ = 25), (c) sequence denoised by VBM3D-1, (d) sequence denoised by VBM3D-2, (e) sequence denoised by HOSVD. Zoom into the pdf for a better view.
4
Conclusion
We have presented a very simple algorithm for video denoising and compared it to state of the art methods such as BM3D. Our algorithm yields good results on video denoising despite the fact that it does not employ motion estimation as in [2], or Wiener filtering as in [1]. The algorithm can be easily parallelized for improving efficiency. It sometimes preserves fine textural details better than VBM3D-2 - the current state of the art algorithm in video denoising. The performance of the algorithm could perhaps be improved using: (1) a better and more
354
A. Rajwade, A. Rangarajan, and A. Banerjee
global search method for finding similar patches, and (2) a robust and efficient method for motion estimation prior to denoising.
References 1. Dabov, K., Foi, A., Egiazarian, K.: Video denoising by sparse 3d transformdomain collaborative filtering. In: European Signal Processing Conference, EUSIPCO (2007) 2. Liu, C., Freeman, W.: A high-quality video denoising algorithm based on reliable motion estimation. In: European Conference on Computer Vision (2010) 3. Balster, E., Zheng, Y., Ewing, R.: Combined spatial and temporal domain wavelet shrinkage algorithm for video denoising. IEEE Transactions on Circuits and Systems for Video Technology 16, 220–230 (2006) 4. Selesnick, I., Li, K.: Video denoising using 2d and 3d dual-tree complex wavelet transforms. In: SPIE Proceedings of Wavelet Applications in Signal and Image Processing (2003) 5. Elad, M., Aharon, M.: Image denoising via learned dictionaries and sparse representation. In: IEEE Conf. Computer Vision and Pattern Recognition, vol. 1, pp. 17–22 (2006) 6. Chatterjee, P., Milanfar, P.: Clustering-based denoising with locally learned dictionaries. IEEE Trans. Image Process. 18, 1438–1451 (2009) 7. Muresan, D., Parks, T.: Adaptive principal components and image denoising. In: IEEE Int. Conf. Image Process, pp. 101–104 (2003) 8. Dabov, K., Foi, A., Katkovnik, V., Egiazarian, K.: Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Trans. Image Process. 16, 2080– 2095 (2007) 9. Buades, A., Coll, B., Morel, J.M.: Nonlocal image and movie denoising. Int. J. Comput. Vis. 76, 123–139 (2008) 10. Buades, A., Lou, Y., Morel, J., Tang, Z.: A note on multi-image denoising. In: Local and Non-Local Approximation in Image Processing, pp. 1–15 (2009) 11. Rajwade, A., Rangarajan, A., Banerjee, A.: Image denoising using the higher order singular value decomposition. Technical Report REP-2011-515, Department of CISE, University of Florida, Gainesville, Florida (2011) 12. de Lathauwer, L.: Signal Processing Based on Multilinear Algebra. PhD thesis, Katholieke Universiteit Leuven, Belgium (1997) 13. Donoho, D., Johnstone, I.: Ideal spatial adaptation by wavelet shrinkage. Biometrika 81, 425–455 (1993) 14. Ji, H., Liu, C., Shen, Z., Xu, Y.: Robust video denoising using low rank matrix completion. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2010)
Optimization of Robust Loss Functions for Weakly-Labeled Image Taxonomies: An ImageNet Case Study Julian J. McAuley1, Arnau Ramisa2 , and Tib´erio S. Caetano1 1
Statistical Machine Learning Group, NICTA, and the Australian National University {julian.mcauley,tiberio.caetano}@nicta.com.au 2 Institut de Rob`otica i Inform`atica Industrial (CSIC-UPC), Spain
[email protected] Abstract. The recently proposed ImageNet dataset consists of several million images, each annotated with a single object category. However, these annotations may be imperfect, in the sense that many images contain multiple objects belonging to the label vocabulary. In other words, we have a multi-label problem but the annotations include only a single label (and not necessarily the most prominent). Such a setting motivates the use of a robust evaluation measure, which allows for a limited number of labels to be predicted and, as long as one of the predicted labels is correct, the overall prediction should be considered correct. This is indeed the type of evaluation measure used to assess algorithm performance in a recent competition on ImageNet data. Optimizing such types of performance measures presents several hurdles even with existing structured output learning methods. Indeed, many of the current state-of-the-art methods optimize the prediction of only a single output label, ignoring this ‘structure’ altogether. In this paper, we show how to directly optimize continuous surrogates of such performance measures using structured output learning techniques with latent variables. We use the output of existing binary classifiers as input features in a new learning stage which optimizes the structured loss corresponding to the robust performance measure. We present empirical evidence that this allows us to ‘boost’ the performance of existing binary classifiers which are the state-of-the-art for the task of object classification in ImageNet.
1 Introduction The recently proposed ImageNet project consists of building a growing dataset using an image taxonomy based on the WordNet hierarchy (Deng et al., 2009). Each node in this taxonomy includes a large set of images (in the hundreds or thousands). From an object recognition point of view, this dataset is interesting because it naturally suggests the possibility of leveraging the image taxonomy in order to improve recognition beyond what can be achieved independently for each image. Indeed this question has been the subject of much interest recently, culminating in a competition in this context using ImageNet data (Berg et al., 2010; Lin et al., 2011; S´anchez and Perronnin, 2011). Although in ImageNet each image may have several objects from the label vocabulary, the annotation only includes a single label per image, and this label is not Y. Boykov et al. (Eds.): EMMCVPR 2011, LNCS 6819, pp. 355–368, 2011. © Springer-Verlag Berlin Heidelberg 2011
356
J.J. McAuley, A. Ramisa, and T.S. Caetano
necessarily the most prominent. This imperfect annotation suggests that a meaningful performance measure in this dataset should somehow not penalize predictions that contain legitimate objects that are missing in the annotation. One way to deal with this issue is to enforce a robust performance measure based on the following idea: an algorithm is allowed to predict more than one label per image (up to a maximum of K labels), and as long as one of those labels agrees with the ground-truth label, no penalty is incurred. This is precisely the type of performance measure used to evaluate algorithm performance in the aforementioned competition (Berg et al., 2010). In this paper, we present an approach for directly optimizing a continuous surrogate of this robust performance measure. In other words, we try to optimize the very measure that is used to assess recognition quality in ImageNet. We show empirically that by using binary classifiers as a starting point, which are state-of-the-art for this task, we can boost their performance by means of optimizing the structured loss. 1.1 Literature Review The success of visual object classification obtained in recent years is pushing computer vision research towards more difficult goals in terms of the number of object classes and the size of the training sets used. For example, Perronnin et al. (2010) used increasingly large training sets of Flickr images together with online learning algorithms to improve the performance of linear SVM classifiers trained to recognize the 20 Pascal Visual Object Challenge 2007 objects; or Torralba et al. (2008), who defined a gigantic dataset of 75,062 classes (using all the nouns in WordNet) populated with 80 million tiny images of only 32 × 32 pixels. The WordNet nouns were used in seven search engines, but without any manual or automatic validation of the downloaded images. Despite its low resolution, the images were shown to still be useful for classification. Similarly, Deng et al. (2009) created ImageNet: a vast dataset with thousands of classes and millions of images, also constructed by taking nouns from the WordNet taxonomy. These were translated into different languages, and used as query terms in multiple image search engines to collect a large amount of pictures. However, as opposed to the case of the previously mentioned 80M Tiny Images dataset, in this case the images were kept at full resolution and manually verified using Amazon Mechanical Turk. Currently, the full ImageNet dataset consists of over 17,000 classes and 12 million images. Figure 1 shows a few example images from various classes. Deng et al. (2010) performed classification experiments using a substantial subset of ImageNet, more than ten thousand classes and nine million images. Their experiments highlighted the importance of algorithm design when dealing with such quantities of data, and showed that methods believed to be better in small scale experiments turned out to under-perform when brought to larger scales. Also a cost function for classification taking into account the hierarchy was proposed. In contrast with Deng et al. (2010), most of the works using ImageNet for large scale classification made no use of its hierarchical structure. As mentioned before, in order to encourage large scale image classification using ImageNet, a competition using a subset of 1,000 classes and 1.2 million images, called the ImageNet Large Scale Visual Recognition Challenge (ILSVRC; Berg et al., 2010), was conducted together with the Pascal Visual Object Challenge 2010 competition.
Optimization of Robust Loss Functions for Weakly-Labeled Image Taxonomies
357
Fig. 1. Example images from ImageNet. Classes range from very general to very specific, and since there is only one label per image, it is not rare to find images with unannotated instances of other classes from the dataset.
Notoriously, the better classified participants of the competition used a traditional oneversus-all approach and completely disregarded the WordNet taxonomy. Lin et al. (2011) obtained the best score in the ILSVRC’10 competition using a conventional one-vs-all approach. However, in order to make their method efficient enough to deal with large amounts of training data, they used Local Coordinate Coding and Super-Vector Coding to reduce the size of the image descriptor vectors, and averaged stochastic gradient descent (ASGD) to efficiently train a thousand linear SVM classifiers. S´anchez and Perronnin (2011) got the second best score in the ILSVRC’10 competition (and a posteriori reported better results than those of Lin et al. (2011)). In their approach, they used high-dimensional Fisher Kernels for image representation with lossy compression techniques: first, dimensionality reduction using Hash Kernels (Shi et al., 2009) was attempted and secondly, since the results degraded rapidly with smaller descriptor dimensionality, coding with Product Quantizers (J´egou et al., 2010) was used to retain the advantages of a high-dimensional representation without paying an expensive price in terms of memory and I/O usage. For learning the standard binary one-vs-all linear classifiers, they also used Stochastic Gradient Descent. The difficulty of using the hierarchical information for improving classification may be explained by the findings of Russakovsky and Fei-Fei (2010). In their work ImageNet is used to show that the relationships endowed by the WordNet taxonomy do not necessarily translate in visual similarity, and that in fact new relations based only on visual appearance information can be established between classes, often far away in the hierarchy.
358
J.J. McAuley, A. Ramisa, and T.S. Caetano
2 Problem Statement We are given the dataset S = (x1 , y1 ), . . . , (xN , yN ) , where xn ∈ X denotes a feature ¯ θ) that vector representing an image with label yn . Our goal is to learn a classifier Y(x; for an image x outputs a set of K distinct object categories. The vector θ ‘parametrizes’ ¯ we wish to learn θ so that the labels produced by Y(x ¯ n ; θ) are ‘similar the classifier Y; n n n ¯ to’ the training labels y under some loss function Δ(Y(x ; θ), y ). Our specific choice of classifier and loss function shall be given in Section 2.1. We assume an estimator based on the principle of regularized risk minimization, i.e. we aim to find θ∗ such that
N 1 ¯ n ; θ), yn ) + λ θ2 . θ∗ = argmin Δ(Y(x (1) N n=1 2 θ regularizer empirical risk
Our notation is summarized in Table 1. Note specifically that each image is annotated with a single label, while the output space consists of a set of K labels (we use y to denote a single label, Y to denote a set of K labels, and Y to denote the space of sets of K labels). This setting presents several issues when trying to express (eq. 1) in the framework of structured prediction (Tsochantaridis et al., 2005). Apparently for this reason, many of the state-of-the-art methods in the ImageNet Large Scale Visual Recognition Challenge (Berg et al., 2010, or just ‘the ImageNet Challenge’ from now on) consisted of binary classifiers, such as multiclass SVMs, that merely optimized the score of a single prediction (Lin et al., 2011; S´anchez and Perronnin, 2011). Motivated by the surprisingly good performance of these binary classifiers, in the following sections we shall propose a learning scheme that will ‘boost’ their performance by re-weighting them so as to take into account the structured nature of the loss function from the ImageNet Challenge. 2.1 The Loss Function Images in the ImageNet dataset are annotated with a single label yn . Each image may contain multiple objects that are not labeled, and the labeled object need not necessarily be the most salient, so the method should not be penalized for choosing ‘incorrect’ labels in the event that those objects actually appear in the scene. Note that this is not an issue in some similar datasets, such as the Caltech datasets (Griffin et al., 2007), where the images have been selected to avoid such ambiguity in the labeling, or all instances of objects covered in the dataset are annotated in every image, as in the Pascal Visual Object Challenge (Everingham et al., 2010). To address this issue, a loss is given over a set of output labels Y, that only penalizes the method if none of those labels is similar to the annotated object. For a training image with label yn , the loss incurred by choosing the set of labels Y is given by Δ(Y, yn ) = min d(y, yn ). y∈Y
(2)
In principle, d(y, yn ) could be any difference measure between the classes y and yn . If d(y, yn ) = 1 − δ(y = yn ) (i.e., 0 if y = yn , 1 otherwise), this recovers the ImageNet
Optimization of Robust Loss Functions for Weakly-Labeled Image Taxonomies
359
Table 1. Notation Notation Description x xn X F N y yn C C ¯ θ) Y(x; ˆ θ) Y(x; Y¯ n Yˆ n K Y θ y θbinary λ φ(x, y) Φ(x, Y) Δ(Y, yn ) d(y, yn ) Zn Yn
the feature vector for an image (or just ‘an image’ for simplicity) the feature vector for the nth training image the features space, i.e., xn ∈ X the feature dimensionality, i.e., F = |xn | the total number of training images an image label, consisting of a single object class the training label for the image xn the set of classes, i.e., yn ∈ C the total number of classes, i.e., C = |C| the set of output labels produced by the classifier the output labels resulting in the most violated constraints during column-generation ¯ n ; θ) shorthand for Y(x ˆ n ; θ) shorthand for Y(x the number of output labels produced by the classifier, i.e., K = |Y¯ n | = |Yˆ n | the space of all possible sets of K labels a vector parameterizing our classifier a binary classifier for the class y a constant that balances the importance of the empirical risk versus the regularizer the joint parametrization of the image x with the label y the joint parametrization of the image x with a set of labels Y the error induced by the set of labels Y when the correct label is yn a distance measure between the two classes y and yn in our image taxonomy latent annotation of the image xn , consisting of K − 1 object classes distinct from yn the ‘complete annotation’ of the image xn , i.e., Z n ∪ {yn }
Challenge’s ‘flat’ error measure. If d(y, yn ) is the shortest-path distance from y to the nearest common ancestor of y and yn in a certain taxonomic tree, this recovers the ‘hierarchical’ error measure (which we shall use in our experiments). For images with multiple labels we could use the loss Δ(Y, Y n ) = |Y1n | yn ∈Y n Δ(Y, yn), though when using the ImageNet data we always have a single label. 2.2 ‘Boosting’ of Binary Classifiers Many of the state-of-the-art methods for image classification consist of learning a series of binary ‘one vs. all’ classifiers that distinguish a single class from all others. That is, y for each class y ∈ C, one learns a separate parameter vector θbinary , and then performs classification by choosing the class with the highest score, according to y
y¯ binary (x) = argmax x, θbinary . (3) y∈C
In order to output a set of K labels, such methods simply return the labels with the highest scores,
y Y¯ binary (x) = argmax x, θbinary , (4) Y∈Y
y∈Y
360
J.J. McAuley, A. Ramisa, and T.S. Caetano
where Y is the space of sets of K distinct labels. The above equations describe many of the competitive methods from the ImageNet Challenge, such as Lin et al. (2011) or S´anchez and Perronnin (2011). One obvious improvement is simply to learn a new set of classifiers {θy }y∈C that optimize the structured error measure of (eq. 1). However, given the large number of classes in the ImageNet Challenge (|C| = 1000), and the high-dimensionality of standard image features, this would mean simultaneously optimizing several million parameters, which is not practical using existing structured learning techniques. Instead, we would like to leverage the already good classification performance of existing binary classifiers, simply by re-weighting them to account for the structured nature of (eq. 2). Hence we will learn a single parameter vector θ that re-weights the features of every class. Our proposed learning framework is designed to extend any y classifier of the form given in (eq. 4). Given a set of binary classifiers {θbinary }y∈C , we propose a new classifier of the from
y ¯ θ) = argmax (5) Y(x; x θbinary , θ , Y∈Y
y∈Y
y y is simply the Hadamard product of x and θbinary . Note that when θ = 1 where x θbinary this recovers precisely the original model of (eq. 4). To use the standard notation of structured prediction, we define the joint feature vector Φ(x, Y) as y Φ(x, Y) = φ(x, y) = x θbinary , (6) y∈Y
y∈Y
so that (eq. 4) can be expressed as ¯ θ) = argmax Φ(x, Y), θ . Y(x;
(7)
Y∈Y
¯ n ; θ) to avoid excessive notation. In the following We will use the shorthand Y¯ n Y(x sections we shall discuss how structured prediction methods can be used to optimize models of this form. 2.3 The Latent Setting The joint parametrization of (eq. 6) is problematic, since the energy of the ‘true’ label yn , φ(xn , yn ), θ, is not readily comparable with the energy of a set of predicted outputs Y, Φ(xn , Y), θ. To address this, we propose the introduction of a latent variable, Z = {Z1 . . . ZN }, which for each image xn encodes the set of objects that appear in xn that were not annotated. The full set of labels for the image xn is now Y n = Zn ∪ {yn }. If our method outputs K objects, then we fix |Zn | = K − 1, so that |Y n | = K. It is now possible to meaningfully compute the difference between Φ(xn , Y) and Φ(xn , Y n ), where the latter is defined as φ(xn , y). (8) Φ(xn , Y n ) = φ(xn , yn ) + y∈Zn
Optimization of Robust Loss Functions for Weakly-Labeled Image Taxonomies
361
The importance of this step shall become clear in Section 3.1, (eq. 13). Note that we still define Δ(Y, yn ) in terms of the single training label yn , as in (eq. 2). Following the programme of Yu and Joachims (2009), learning proceeds by alternately optimizing the latent variables and the parameter vector. Optimizing the parameter vector θi given the latent variables Z i is addressed in Section 3.1; optimizing the latent variables Z i given the parameter vector θi−1 is addressed in Section 3.2.
3 The Optimization Problem The optimization problem of (eq. 1) is non-convex. More critically, the loss is a piecewise constant function of θ1 . A similar problem occurs when one aims to optimize a 0/1 loss in binary classification; in that case, a typical workaround consists of minimizing a surrogate convex loss function that upper-bounds the 0/1 loss, such as the hinge loss, which gives rise to support vector machines. We will now see that we can construct a suitable convex relaxation for the problem defined in (eq. 1). 3.1 Convex Relaxation Here we use an analogous approach to that of SVMs, notably popularized in Tsochantaridis et al. (2005), which optimizes a convex upper bound on the structured loss of (eq. 1). The resulting optimization problem is ⎡ ⎤ N ⎢⎢⎢ 1 ⎥⎥ 2 [θ , ξ ] = argmin ⎢⎢⎣ ξn + λ θ ⎥⎥⎥⎦ N n=1 θ,ξ
(9a)
s.t. Φ(xn , Y n ), θ − Φ(xn , Y), θ ≥ Δ(Y, Y n ) − ξn
(9b)
∗
∗
∀n, Y ∈ Y. It is easy to see that ξn∗ upper-bounds Δ(Y¯ n , yn ) (and therefore the objective in (eq. 9) upper bounds that of (eq. 1) for the optimal solution). First note that since the constraints (eq. 9b) hold for all Y, they also hold for Y¯ n . Second, the left hand side of the inequality ¯ θ) = argmaxY Φ(x, Y), θ. It then follows for Y = Y¯ n must be non-positive since Y(x; ∗ n n ¯ that ξn ≥ Δ(Y , y ). This implies that a solution of the relaxation is an upper bound on the solution of the original problem, and therefore the relaxation is well-motivated. The constraints (eq. 9b) basically enforce a loss-sensitive margin: θ is learned so that mispredictions Y that incur some loss end up with a score Φ(xn , Y), θ that is smaller than the score Φ(xn , Y n ), θ of the correct prediction Y n by a margin equal to that loss (minus the slack ξn ). The formulation is a generalization of support vector machines for the multi-class case. There are two options for solving the convex relaxation of (eq. 9). One is to explicitly include all N × |Y| constraints and then solve the resulting quadratic program using one of several existing methods. This may not be feasible if N × |Y| is too large. In this case, we can use a constraint generation strategy. This consists of iteratively solving the 1
There are countably many values for the loss but uncountably many values for the parameters, so there are large equivalence classes of parameters that correspond to precisely the same loss.
362
J.J. McAuley, A. Ramisa, and T.S. Caetano
quadratic program by adding at each iteration the constraint corresponding to the most violated Y for the current model θ and training instance n. This is done by maximizing the violation gap ξn , i.e., solving at each iteration the problem ˆ n ; θ) = argmax {Δ(Y, yn ) + Φ(xn , Y), θ} , Y(x
(10)
Y∈Y
ˆ n ; θ) for brevity). The solution to this optimization prob(as before we define Yˆ n Y(x lem (known as ‘column generation’) is somewhat involved, though it turns out to be tractable as we shall see in Section 3.3. Several publicly available tools implement precisely this constraint generation strategy. A popular example is SvmStruct (Tsochantaridis et al., 2005), though we use BMRM (‘Bundle Methods for Risk Minimization’; Teo et al., 2007) in light of its faster convergence properties. Algorithm 1 describes pseudocode for solving the optimization problem (eq. 9) with BMRM. In order to use BMRM, one needs to compute, at the optimal solution ξn∗ for the most violated constraint Yˆ n , both the value of the objective function (eq. 9) and its gradient. At the optimal solution for ξn∗ with fixed θ we have
Φ(xn , Y n ), θ − Φ(xn , Yˆ n ), θ = Δ(Yˆ n , yn ) − ξn∗ . (11) By expressing (eq. 11) as a function of ξn∗ and substituting into the objective function we obtain the following lower bound on the objective of (eq. 9a): oi =
1 ˆn n Δ(Y , y ) − Φ(xn , Y n ), θ + Φ(xn , Yˆ n ), θ + λ θ2 , N n
(12)
whose gradient with respect to θ is gi = λθ +
1 (Φ(xn , Yˆ n ) − Φ(xn , Y n )). N n
Algorithm 1. Taxonomy Learning 1: 2: 3: 4: 5: 6: 7: 8: 9: 10:
N Input: training set {(xn , Y n )}n=1 Output: θ θ 0 {in the setting of Algorithm 2, θ can be ‘hot-started’ with its previous value} repeat for n ∈ {1 . . . N} do Yˆ n argmaxY∈Y Δ(Y, yn ) + φ(xn , Y), θ end for Compute gradient gi (equation (eq. 13)) Compute objective oi (equation (eq. 12)) θ argminθ λ2 θ2 + max(0, max g j , θ + o j ) j≤i
11: until converged (see Teo et al. (2007)) 12: return θ
(13)
Optimization of Robust Loss Functions for Weakly-Labeled Image Taxonomies
363
3.2 Learning the Latent Variables To learn the optimal value of θ, we alternate between optimizing the parameter vector θi given the latent variables Z i , and optimizing the latent variables Z i given the parameter vector θi−1 . Given a fixed parameter vector θ, optimizing the latent variables Z n can be done greedily, and is in fact equivalent to performing inference, with the restriction that the true label yn cannot be part of the latent variable Z n (see Algorithm 2, Line 5). See Yu and Joachims (2009) for further discussion of this type of approach. Algorithm 2. Taxonomy Learning with Latent Variables N Input: training set {(xn , yn )}n=1 Output: θ θ0 1 for i = 1 . . . I do
Zin argmaxY∈Y Φ(xn , Y), θi−1 \ {yn } {choose only K − 1 distinct labels} N 6: θi Algorithm1 xn , Zin ∪ {yn }
1: 2: 3: 4: 5:
n=1
7: end for 8: return θI
3.3 Column Generation Given the loss function of (eq. 2), obtaining the most violated constraints (Algorithm 1, Line 6) takes the form ⎧ ⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎬ n n Yˆ n = argmax ⎪ min d(y, y ) + φ(x , y), θ , (14) ⎪ ⎪ ⎪ ⎪ ⎪ ⎭ Y∈Y ⎩ y∈Y y∈Y which appears to require enumerating through all Y ∈ Y, which if there are C = |C| classes amounts to CK possibilities. However, if we know that argminy∈Yˆ n d(y, yn ) = c, then (eq. 14) becomes ⎧ ⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎬ n n Yˆ n = argmax ⎪ d(c, y ) + φ(x , y), θ , (15) ⎪ ⎪ ⎪ ⎪ ⎩ ⎭ Y∈Y ⎪ y∈Y where Y is just Y restricted to those y for which d(y, yn ) ≥ d(c, yn ). This can be solved greedily by sorting φ(xn , y), θ for each class y ∈ C such that d(y, yn ) ≥ d(c, yn ) and simply choosing the top K classes. Since we don’t know the optimal value of c in advance, we must consider all c ∈ C, which means solving (eq. 15) a total of C times. Solving (eq. 15) greedily takes O(C log C) (sorting C values), so that solving (eq. 14) takes O(C 2 log C). Although this method works for any loss of the form given in (eq. 2), for the specific distance function d(y, yn ) used for the ImageNet Challenge, further improvements are possible. As mentioned, for the ImageNet Challenge’s hierarchical error measure, d(y, yn ) is the shortest-path distance from y to the nearest common ancestor of y and yn
364
J.J. McAuley, A. Ramisa, and T.S. Caetano
in a taxonomic tree. One would expect the depth of such a tree to grow logarithmically in the number of classes, and indeed we find that we always have d(y, yn ) ∈ {0 . . . 18}. If the number of discrete possibilities for Δ(Y, yn ) is small, instead of enumerating each possible value of c = argminy∈Yˆ n d(y, yn ), we can directly enumerate each value of δ = miny∈Yˆ n d(y, yn ). If there are |L| distinct values of the loss, (eq. 14) can now be solved in O(|L|C log C). In ImageNet we have |L| = 19 whereas C = 1000, so this is clearly a significant improvement. Several further improvements can be made (e.g. we do not need to sort all C values in order to compute the top K, and we do not need to re-sort all of them for each value of the loss, etc.). We omit these details for brevity, though our implementation shall be made available at the time of publication2.
4 Experiments 4.1 Binary Classifiers As previously described, our approach needs, for each class, one binary classifier able to provide some reasonable score as a starting point for the proposed method. Since the objective of this paper is not beating the state-of-the-art, but rather demonstrating the advantage of our structured learning approach to improve the overall classification, we used a standard, simple image classification setup. As mentioned, should the one-vs-all classifiers of Lin et al. (2011) or S´anchez and Perronnin (2011) become available in the future, they should be immediately compatible with the proposed method. First, images have to be transformed into descriptor vectors sensible for classification using machine learning techniques. For this we have chosen the very popular Bag of Features model (Csurka et al., 2004): dense SIFT features are extracted from each image xn and quantized using a visual vocabulary of F visual words. Next, the visual words are pooled in a histogram that represents the image. This representation is widely used in state-of-the-art image classification methods, and despite its simplicity achieves very good results. Regarding the basic classifiers, a rational first choice would be to use a Linear SVM for every class. However, since our objective is to predict the correct class of a new image, we would need to compare the raw scores attained by the classifier, which would not be theoretically satisfying. Although it is possible to obtain probabilities from SVM scores using a sigmoid trained with the Platt algorithm, we opted for training Logistic Regressors instead, which directly give probabilities as output and do not depend on a separate validation set. In order to deal with the computational and memory requirements derived from the large number of training images, we used Stochastic Gradient Descent (SGD) from Bottou and Bousquet (2008) to train the classifiers. SGD is a good choice for our problem, since it has been shown to achieve a performance similar to that of batch training methods in a fraction of the time (Perronnin et al., 2010). Furthermore, we validated its performance against that of LibLinear in a small-scale experiment using part of the ImageNet hierarchy with satisfactory results. One limitation of online learning methods 2
See http://users.cecs.anu.edu.au/˜ julianm/
Optimization of Robust Loss Functions for Weakly-Labeled Image Taxonomies
365
(a)
feature weight
Reweighting of 1024 dimensional features 1
0
200
400
600
800
1000
feature index
Reduction in training error with latent variables
(b)
training error
6.58 after learning
Test error:
before learning
(c) 6.18
1
2
3
4 5 6 7 iteration of Algorithm 2
8
9
1nn 2 3 4 5 Before learning 11.35 9.29 8.08 7.25 6.64 After learning 10.88 8.85 7.71 6.93 6.36
10
Fig. 2. Results for training with 1024 dimensional features. (a) feature weights; (b) reduction in training error during each iteration of Algorithm 2; (c) error for different numbers of nearestneighbors K (the method was trained to optimize the error for K = 5). Results are reported for the best value of λ on the validation set (here λ = 10−4 ). Reweighting of 4096 dimensional features
(a)
feature weight
19
0 −12
500
1000
1500
2000 feature index
2500
3000
3500
4000
Reduction in training error with latent variables
(b)
training error
4.44
Test error:
before learning
(c)
after learning 4.29
1
2
3
4 5 6 7 iteration of Algorithm 2
8
9
1nn 2 3 4 5 Before learning 9.27 7.29 6.23 5.53 5.03 After learning 9.02 7.08 6.05 5.38 4.91
10
Fig. 3. Results for training with 4096 dimensional features. (a) feature weights; (b) reduction in training error during each iteration of Algorithm 2; (c) error for different numbers of nearestneighbors K (the method was trained to optimize the error for K = 5). Results are reported for the best value of λ on the validation set (here λ = 10−6 ).
is that the optimization process iterations are limited by the amount of training data available. In order to add more training data, we cycled over all the training data for 10 epochs. y With this approach, the θbinary parameters for each class used in the structured learning method proposed in this work were generated. 4.2 Structured Classifiers For every image xn and every class y we must compute φ(xn , y), θ. Earlier we defined y φ(x, y) = x θbinary . If we have C classes and F features, then this computation can be y made efficient by first computing the C ×F matrix A whose yth row is given by θbinary θ.
366
J.J. McAuley, A. Ramisa, and T.S. Caetano
Similarly, if we have N images then the set of image features can be thought of as an N × F matrix X. Now the energy of a particular labeling y of xn under θ is given by the matrix product . (16) φ(xn , y), θ = X × AT n,y
This observation is critical if we wish to handle a large number of images and highdimensional feature vectors. In our experiments, we performed this computation using Nvidia’s high-performance BLAS library CUBLAS. Although GPU performance is often limited by a memory bottleneck, this particular application is ideally suited as the matrix X is far larger than either the matrix A, or the resulting product, and X needs to be copied to the GPU only once, after which it is repeatedly reused. After this matrix product is computed, we must sort every row, which can be na¨ıvely parallelized. In light of these observations, our method is no longer prohibitively constrained by its running time (running ten iterations of Algorithm 2 takes around one day for a single regularization parameter λ). Instead we are constrained by the size of the GPU’s onboard memory, meaning that we only used 25% of the training data (half for training, half for validation). In principle the method could be further parallelized across multiple machines, using a parallel implementation of the BMRM library. The results of our algorithm using features of dimension F = 1024 and F = 4096 are shown in Figures 2 and 3, respectively. Here we ran Algorithm 2 for ten iterations, ‘hot-starting’ θi using the optimal result from the previous iteration. The reduction in training error is also shown during subsequent iterations of Algorithm 2, showing that minimal benefits are gained after ten iterations. We used regularization parameters λ ∈ {10−1, 10−2 . . . 10−8 }, and as usual we report the test error for the value of λ that resulted in the best performance on the validation set. We show the test error for different numbers of nearest-neighbors K, though the method was trained to minimize the error for K = 5. In both Figures 2 and 3, we find that the optimal θ is non-uniform, indicating that there are interesting relationships that can be learned between the features when a structured setting is used. As hoped, a reduction in test error is obtained over already good classifiers, though the improvement is indeed less significant for the better-performing high-dimensional classifiers. In the future we hope to apply our method to state-of-the-art features and classifiers like those of Lin et al. (2011) or S´anchez and Perronnin (2011). It remains to be seen whether the setting we have described could yield additional benefits over their already excellent classifiers.
5 Conclusion Large scale, collaboratively labeled image datasets embedded in a taxonomy naturally invite the use of both structured and robust losses, to account for the inconsistencies in the labeling process and the hierarchical structure of the taxonomy. However, on datasets such as ImageNet, the state-of-the-art methods still use one-vs-all classifiers, which do not account for the structured nature of such losses, nor for the imperfect nature of the annotation. We have outlined the computational challenges involved in
Optimization of Robust Loss Functions for Weakly-Labeled Image Taxonomies
367
using structured methods, which sheds some light on why they have not been used before in this task. However, by exploiting a number of computational tricks, and by using recent advances on structured learning with latent variables, we have been able to formulate learning in this task as the optimization of a loss that is both structured and robust to weak labeling. Better yet, our method leverages existing one-vs-all classifiers, essentially by re-weighting, or ‘boosting’ them to directly account for the structured loss. In practice this leads to improvements in the hierarchical loss of already good one-vs-all classifiers. Acknowledgements. Part of this work was carried out when both AR and TC were at INRIA Grenoble, Rhˆone-Alpes. NICTA is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence program. This work was partially funded by the QUAERO project supported by OSEO, French State agency for innovation and by MICINN under project MIPRCV Consolider Ingenio CSD2007-00018.
References 1. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei.-Fei, L.: ImageNet: A Large-Scale Hierarchical Image Database. In: IEEE Conference on Computer Vision and Pattern Recognition (2009) [355], [356] 2. Berg, A., Deng, J., Fei-Fei, L.: Imagenet large scale visual recognition challenge 2010 (2010), http://www.image-net.org/challenges/LSVRC/2010/index [355], [356], [358] 3. Lin, Y., Lv, F., Zhu, S., Yang, M., Cour, T., Yu, K.: Large-scale image classification: fast feature extraction and SVM training. In: IEEE Conference on Computer Vision and Pattern Recognition (to appear, 2011) [355], [357], [358], [360], [364], [366] 4. S´anchez, J., Perronnin, F.: High-Dimensional Signature Compression for Large-Scale Image Classification. In: IEEE Conference on Computer Vision and Pattern Recognition (to appear, 2011) [355], [357], [358], [360], [364], [366] 5. Florent Perronnin, Jorge S´anchez, and Thomas Mensink. Improving the fisher kernel for large-scale image classification. European Conference on Computer Vision, pages 143–156, 2010. [356], [364] 6. Torralba, A., Fergus, R., Freeman, W.T.: 80 Million Tiny Images: a Large Data Set for Nonparametric Object and Scene Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(11), 1958–1970 (2008) [356] 7. Deng, J., Berg, A.C., Li, K., Fei-Fei, L.: What does classifying more than 10,000 image categories tell us? In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6315, pp. 71–84. Springer, Heidelberg (2010) [356] 8. Shi, Q., Petterson, J., Dror, G., Langford, J., Smola, A., Strehl, A., Vishwanathan, V.: Hash Kernels. In: Artificial Intelligence and Statistics (2009) [357] 9. J´egou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(11), 117–128 (2010) [357] 10. Russakovsky, O., Fei-Fei, L.: Attribute learning in large-scale datasets. In: ECCV Workshop on Parts and Attributes (2010) [357] 11. Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research 6, 1453–1484 (2005) [358], [361], [362]
368
J.J. McAuley, A. Ramisa, and T.S. Caetano
12. Griffin, G., Holub, A., Perona, P.: Caltech-256 object category dataset. Technical Report 7694, California Institute of Technology (2007) [358] 13. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The Pascal visual object classes (VOC) challenge. International Journal of Computer Vision 88, 303–338 (2010) [358] 14. Yu, C.-N.J., Joachims, T.: Learning structural svms with latent variables. In: International Conference on Machine Learning (2009) [361], [363] 15. Teo, C.H., Smola, A., Vishwanathan, S.V.N., Le., Q.V.: A scalable modular convex solver for regularized risk minimization. In: Knowledge Discovery and Data Mining (2007) [362] 16. Csurka, G., Dance, C.R., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: ECCV Workshop on statistical learning in computer vision (2004) [364] 17. Bottou, L., Bousquet, O.: The tradeoffs of large scale learning. In: Neural Information Processing Systems (2008) [364]
Multiple-Instance Learning with Structured Bag Models Jonathan Warrell and Philip H.S. Torr Oxford Brookes University Oxford, UK {jwarrell,ptorr}@brookes.ac.uk http://cms.brookes.ac.uk/research/visiongroup
Abstract. Traditional approaches to Multiple-Instance Learning (MIL) operate under the assumption that the instances of a bag are generated independently, and therefore typically learn an instance-level classifier which does not take into account possible dependencies between instances. This assumption is particularly inappropriate in visual data, where spatial dependencies are the norm. We introduce here techniques for incorporating MIL constraints into Conditional Random Field models, thus providing a set of tools for constructing structured bag models, in which spatial (or other) dependencies are represented. Further, we show how Deterministic Annealing, which has proved a successful method for training non-structured MIL models, can also form the basis of training models with structured bags. Results are given on various segmentation tasks.
1 Introduction Multiple instance learning (MIL) has received intense interest recently in the field of computer vision, with algorithms being developed for a number of different learning scenarios and specific applications. Methods have been proposed based on the frameworks of boosting [3,21], SVMs [6,25], and random forests [12,16], and special algorithms/techniques proposed for online scenarios [16], scenarios where the proportion of positives in bag is known [6], and multi-label generalizations [24,25]. In several of these methods, the technique of deterministic annealing (DA) has proved an effective method for learning [6,12,16], which provides a general framework for learning in weakly supervised settings [13,14]. Particular vision applications where MIL has had notable success include tracking [3], detection [21], and image classification [12,25]. Other vision problems have also been targeted for MIL solutions. The framework which MIL provides, with labels at two distinct levels, is in principle ideally suited for tasks such as semantic scene segmentation [17], where it is expensive to collect detailed segmentations of whole scenes at the pixel level, but much easier to acquire annotations at the image level of the objects an image contains. Attempts to apply MIL techniques directly to this problem though have met with limited success. Vezhnevets et al. [18] for instance found it necessary to include a multitask learning subproblem to achieve reasonable results with MIL, as the former appeared to be needed as a regularizer over the unlabeled data. This may be explained by appealing to the underlying assumptions: while MIL assumes all instances in a bag to be independent samples, labeling tasks in vision such as scene segmentation typically exhibit strong spatial dependencies between Y. Boykov et al. (Eds.): EMMCVPR 2011, LNCS 6819, pp. 369–384, 2011. c Springer-Verlag Berlin Heidelberg 2011
370
J. Warrell and P.H.S. Torr
neighboring labels. A recent approach which considers modeling dependencies between the instances of a bag was proposed in [26]. This approach though builds a graphkernel classifier directly at the bag level, and so is not directly applicable if we are interested primarily in learning instance classifiers (as in semantic segmentation). A further method uses Hidden Conditional Random Fields and probabilistic EM learning for MIL in a combined scene segmentation/classification setting [24]. In this form, the MIL approach is highly related to other probabilistic models which integrate labels at multiple levels for scene segmentation (e.g. [7]). The disadvantage of such models from an MIL perspective is that they propose structures in which probabilistic sampling is necessary, the efficiency of which determines the feasibility of the learning. Recent advances in optimization techniques for Conditional Random Field models (CRFs) have made it possible to perform inference in models which encode a variety of structural constraints or assumptions concerning the form of the output. A technique which has been applied widely in this regard is that of dual decomposition (DD) [4], which arises from the integer programming formulation of maximum posterior (MAP) inference in CRF models. This approach has been applied for instance to general minimization of non-submodular energies [9], joint optimization of foreground/background segmentation and color models in the context of interactive segmentation [19], penalizing disagreement between foreground histograms in the context of cosegmentation [20], and enforcing marginal statistics on the solutions of labeling problems in binary and multilabel settings [23]. An advantage of dual decomposition is that it allows complex varieties of constraints to be modeled, while still permitting efficient energy minimization techniques such as graph-cuts to be used. In this paper we propose to use such techniques to embed MIL constraints directly in CRF models. This will provide us a way of building structured bag models, which can effectively model the kinds of dependencies between instance variables that we expect in for instance semantic segmentation applications. We consider constraints of two kinds: first the traditional ‘hard’ MIL constraint, where we have a 0-1 bag label indicating the presence or absence of positives at the instance level, and second a ‘soft’ constraint, where a continuous bag label in the interval [0 1] indicates the expected proportion of positives (similar to [6], where this information is added to the positive bags to help training). We also consider hierarchical bag models which incorporate both kinds of constraint at different levels. We then provide a method for extending the deterministic annealing framework discussed above to perform learning in such structured models, using boosted models as our instance level classifiers, and requiring only that approximate MAP inference can be performed at the bag level. We demonstrate the potential of our approach on two vision tasks which both permit MIL formulations, and exhibit structural dependencies appropriate to our models. The first is a binary semantic labeling task involving road-crack detection using the LabelCracks dataset of [22], which we extend with various weaker MIL annotations. The second is the task of interactive cosegmentation, using the iCoseg dataset of [2], in which we consider the extended scenario where the user is free to add soft ‘bag-level’ annotations. We stress though the broad applicability of the techniques introduced, particularly for scene segmentation problems with weak annotations as discussed above.
Multiple-Instance Learning with Structured Bag Models
371
Fig. 1. Summary of structured bag models considered in the paper. Model 1 (a) enforces a hard MIL constraint between a binary bag label y and instance labels x, with internal pairwise dependencies, while model 2 (b) enforces a soft constraint between a continuous bag label σ and the instances. In model 3 (c) a two level bag model is formed, combining both hard and soft constraints. The bag labels y and σ may be observed or unobserved during inference (see Sec. 3).
Sec. 2 introduces our basic structured bag models, Secs. 3 and 4 give details on inference and training, 5 and 6 give experimental evaluations and 7 offers a discussion.
2 Structured Bag Models We begin by outlining the basic bag models we will consider in Sections 2.1 and 2.2, involving hard and soft MIL constraints respectively. Section 2.3 then outlines a more complex 2-level bag model, which is motivated by the experimental data of section 5. 2.1 Model 1 (Hard Constraints) The first model we consider is shown in Figure 1a. Here we have a simple bag model, consisting of a binary bag label y ∈ {0 1}, and instances xp , p = 1...P , also taking binary labels. We shall denote observations associated with each instance by zp . Given this notation, we define an energy for the bag as: E 1 (x, y|z) = Ebase (x|z) + φhard-MIL (x, y) where: Ebase (x|z) =
p
φunary (xp , zp ) +
φpair (xp1 , xp2 , zp )
(1)
(2)
p1 ,p2 ∈N
φunary and φpair representing standard unary and pairwise terms in a base CRF energy (with N the neighborhood relation) and: 0 if maxp xp = y hard-MIL (x, y) = (3) φ ∞ otherwise
372
J. Warrell and P.H.S. Torr
Eq. 3 ensures the MIL constraint is enforced between x and y, giving finite energy only if it is. Implicitly, this energy model also implies a probability model, where P r(x, y|z) ∝ exp(−E(x, y|z)). 2.2 Model 2 (Soft Constraints) Model 2 is illustrated in Figure 1b. The only difference with model 1 is that now in place of the binary bag labels y we have continuous bag labels σ ∈ [0 1] which represent the expected number of positives in the bag (we also change the index p to q to match the notation in model 3). The bag energy now takes the form: E 2 (x, σ|z) = Ebase (x|z) + φsoft-MIL (x, σ)
(4)
where Ebase (x|z) is as in (2), and: φsoft-MIL (x, σ) = g(σ · Q −
xq )
(5)
q
Here g is any convex function, for example the l1 -distance g(.) = |.| or l2 -distance g(.) = (.)2 . The intended semantics of g are that it penalizes the difference between the expected number of positives, and the observed count. We note that these choices of function do not require bags labeled σ = 0 to have no positives; if we wish to do that, we must use the convex function g which is 0 at 0, and ∞ elsewhere, enforcing the exact counts are observed for all bags. This may be too restrictive in practice. 2.3 Model 3 (Combined Model) Our third model is illustrated in Figure 1c. This combines the two types of constraints outlined above to form a more complex bag structure, with instances labels on two levels (the first level forming a set of ‘sub-bag’ labels for the second). The overall bag label is again σ ∈ [0 1], which here represents the expected number of positives in the first level of instances, x1q , q = 1...Q. Each of the xq ’s in turn acts as a hard bag label to a subset of instances at the second level, x2p , p = 1...P , and we write x2q to denote the subset associated with x1q , and x1qp for the x1q for which x2p ∈ x2q (assuming non-overlapping sub-bags for simplicity). Given this notation, the bag energy is: E 3 (x1 , x2 , σ|z1 , z2 ) = Ebase (x1 |z1 ) + Ebase (x2 |z2 ) + φsoft-MIL (x1 , σ) + φhard-MIL (x2q , x1q ) (6) q
3 Inference Using Dual Decomposition We now consider how to perform MAP inference in the above models. In all cases, the base energy Ebase is assumed to be submodular.
Multiple-Instance Learning with Structured Bag Models
373
3.1 Model 1 (Hard Constraints) If neither x or y are known, MAP inference in (1) consists simply of performing a graph-cut to find x = argminx Ebase (x|z), and setting y = maxp xp . If y is known, the only problematic case is when y = 1 and performing a graph-cut returns x = 0. In this case, we may perform additional graph-cuts forcing each instance xp to be positive in turn and take the solution with the lowest energy, or as an approximation, force only the instance with the highest unary response, and take the associated solution. 3.2 Model 2 (Soft Constraints) With unknown x and σ, inference in (4) is again simple, as we must only make a single graph-cut to find x = argminx Ebase (x|z), and set σ = (1/Q) q xq . With a known σ, recent methods for MAP inference suggest a dual decomposition approach (DD), as in ‘area matching’ over a Marginal Probability Field (MPF) [23]. We give a slightly simpler DD algorithm than [23] below, which we then extend to model 3 in Sec. 3.3. We reformulate Eq. 4 as an integer program, introducing a (redundant) variable a to encode the soft MIL constraint between σ and x and dropping the dependencies on z, leading to the primal objective: min[Ebase (x) + g(a)]
(7)
x,a
where g is a convex function as in Sec. 2.2, and we have constraints: a= xq − σ · Q, x ∈ {0 1}Q , a ∈ {−Q...Q}
(8)
q
Introducing a Lagrange multiplier λ for the first constraint, the dual objective is: max ψ(λ) = max(min[Ebase (x)− < (λ · 1), x1 >] + min[g(a) + λa] + λσQ)(9) λ
λ
x
a
where λ is unconstrained. Evaluating the dual requires we solve two minimization problems to give us x and a for a given λ. The first is straightforward, and requires a single graph-cut on the energyEbase(x) with −λ added to the positive unary terms. The second can be solved analytically for most convex g. For instance, for g(.) = |.| we set: a = −Q if λ > 1,
Q if λ < 1,
0 otherwise
(10)
We can thus maximize (9) by setting λ = 0, and taking steps along the subgradient: ∂ψ = a − xq + σ · Q ∂λ q
(11)
Several methods are available for choosing step sizes which are guaranteed to converge to the solution of the dual (9) (see [4]). A solution to the primal is only guaranteed if the duality gap is closed at this value. However, a reasonable solution here, and also if the algorithm is not run to convergence, is to take the x with the lowest base energy seen across all iterations.
374
J. Warrell and P.H.S. Torr
3.3 Model 3 (Combined Model) We now extend the dual decomposition approach outlined in Section 3.2 to the more complex bag structure of model 3. Here, we note that even in the case that σ is unknown, inference is now difficult since the hard MIL constraints must be enforced between x1 and x2 . Instead of considering this case separately, we outline below the case for inference with known σ, which is more complex. For σ unknown, we can simply remove the σ terms from the objectives below, solve for x1 and x2 , and set σ = (1/Q) q x1q . As above, we reformulate (6) as an integer program, giving the primal objective: min [Ebase (x1 ) + Ebase (x2 ) + g(a)]
(12)
x1 ,x2 ,a
We now include additional constraints (2 and 3 below) to encode the hard MIL constraints between x1 and x2 , giving: a= x1q − σ · Q x1 ∈ {0 1}Q q
∀p x2p ≤ x1qp x2p ≥ x1q
∀q
x2 ∈ {0 1}P a ∈ {−Q...Q}
(13)
{p∈x2q }
Introducing Lagrangian multipliers, we form the dual objective by relaxing the constraints in (13): max ψ(λ) = max( 1min [Ebase (x1 ) + Ebase (x2 ) + g(a) λ λ x ,x2 ,a +λ1 (a − x1q + σQ) + λ2p (x2p − x1qp ) + λ3q (x1q − x2p )]) (14) q
p
q
{p∈x2q }
subject to: λ1 unconstrained,
λ2 , λ3 ≥ 0
(15)
We may further group (14) as follows: max ψ(λ) = max(min [Ebase (x1 )+ < (λ3 − λ2 − λ1 · 1), x1 >] 1 λ
λ
x
+ min [Ebase (x2 )+ < (λ2 − λ3 ), x2 >] 2 x
+ min[g(a) + λ1 a] + λ1 σQ) (16) a 2 3 3 where we have defined λ2 q = {p∈x2q } λp and λp = λqp . Evaluating the dual then requires solving the 3 minimization problems in (16) to give x1 , x2 and a for a given λ. The first two can be solved by single graph cuts on x1 and x2 , where we simply add to the positive unaries the values implied by the inner products in (16). a can again be found for g(.) = |.| as in (10). Using these solutions, we derive a subgradient at λ as: ∂ψ ∂ψ ∂ψ 1 = a − x1 = x2 = x1 x2 q + σ · Q, p − xqp , q − p (17) 1 2 3 ∂λ ∂λ ∂λ p q 2 q {p∈xq }
Multiple-Instance Learning with Structured Bag Models
375
As before, we maximize the dual by starting with λ = 0 and taking steps along this subgradient (using a scheme as in [4]). Several additional operations are required compared to model 2. First, we must check with each update that the constraints (15) are satisfied, projecting λ if not. Further, a rounding step is required to produce a valid solution, as x1 and x2 for any given λ may not respect the inter-layer MIL constraints. A simple scheme may be used, such as setting x2q = 0 for each x1q = 0, and treating positive sub-bags violating the constraints using the inference techniques in Sec. 3.1. The rounded primal energy is checked at each iteration, and the minimum selected.
4 Training Using Deterministic Annealing We describe here how the deterministic annealing (DA) approach used for training MIForests in [12] may be adapted to train structured bag models of the kind proposed. No assumptions are made about the nature of the unary classifiers (which are the main objects to be trained), and the algorithms given may be applied for instance to boosted classifiers, SVMs or random forests (RFs). 4.1 Model 1 (Hard Constraints) We outline the DA algorithm first in detail for the simplest case of model 1. Modifications for models 2 and 3 are discussed in Sections 4.2 and 4.3. We consider we have a training set of I bags. We assume for simplicity we have only the bag labels, yi=1...I and observations, zi=1...I , since additional instance observations are easily incorporated. Since we are working with a structured bag model, we formulate our loss in terms of the score (which will be derived from the energy) given by the model to the whole bag, which we write as F (x, y|z). Given this notation, and a loss function l, we may write the overall training objective as: l(F (xi , yi |zi )) (18) (F , x ) = argminF,x i
As discussed in [12], optimizing such an objective is difficult due to the large search space for x. We thus propose to optimize Eq. 18 by DA, producing an auxiliary objective by introducing a distribution over the instance labels, π(x), and a ‘temperature’ T : (F , π ) = argminF,π π(xi )l(F (xi , yi |zi )) − T H(πi ) (19) i
xi
i
where H(πi ) = − xi π(xi ) log(π(xi )) is the entropy of the distribution over bag i. Eq. 19 can be minimized by starting at a high T , and alternately updating F and π while reducing T . When T = 0, Eq. 19 reduces to Eq. 18, and hence we are optimizing the original objective. Since [12] was working with independent classifier responses, updating π in (19) was automatically a convex problem. In our case, since we are dealing with responses for full bags, we must be careful how we formulate F to maintain tractability. The straightforward option of basing F on the CRF distribution implied by (1)
376
J. Warrell and P.H.S. Torr
(i.e. F = P r(xi , yi |zi ) ∝ exp(−Ei )) will not work, as it involves estimating the partition function. Instead, we choose to base F on a simplified distribution which is easier to work with. We factorize this distribution as follows: P r (xi , yi |zi ) = P r (yi |zi )P r (xi |yi , zi ). The first term, we derive from the energy of the mode (MAP estimate) of (1) when yi is fixed to the specified value. We write these modes as fiyi = argminxi E 1 (xi , yi |zi ) (noting that fi0 = 0 automatically), and let P r (yi |zi ) = 1 yi (1/Zi ) exp(−E we note that normalization is now only over two (fi , yi |zi )), where terms, Zi = yi ={0 1} exp(−E 1 (fiyi , yi |zi )). For the second term, P r (xi |yi , zi ), we introduce an uncertainty parameter β, which indicates the probability that an instance label may be flipped from the mode estimate. We base F directly on the joint probability of xi and yi in this altered distribution: F (xi , yi |zi ) = (1/Zi ) exp(−E 1 (fiyi , yi |zi ))
yi
yi
β [fip =xip ] (1 − β)[fip =xip ] (20)
p
We note that by this approximation we have introduced a (small) probability that the MIL constraints will be violated: this is not a problem though if we are concerned with MAP inference, as the modes f yi will not violate the constraints, so at T = 0 all the probability mass in (19) will be placed on a valid labeling. Finally, we take the loss to be the negative log-likelihood, l(.) = −log(.). The advantage of the factorized distribution form used in (20) is that the losses incurred by the instances become independent given F and the bag labels y. This means that we know the optimal π will factor as π = ip πip , where πip is the probability that xip is positive, and finding these factors reduces to solving a series of convex problems. Differentiating (19) with respect to πip and setting to zero yields: πip =
1 1 + exp(lip /T )
(21)
where lip is the loss that would be incurred by labeling xip as positive, which, as in [12] is the negative log of the positive margin: lip = − log
yi
yi
yi
yi
β [fip =0] (1 − β)[fip =1] β [fip =1] (1 − β)[fip =0]
(22)
Optimizing (19) with respect to F for fixed π and y is more involved, and in general will depend on the form of classifiers used in the potentials of the bag CRF models. As in [12], we first sample instance labels xip across all positive bags according to πip (forcing at least one instance to be positive). We then update the bag parameters to maximize the joint probability of xi and yi . We note that the problem of CRF training is simplified in our case given the factorized form of (20), and a reasonable piecewise method is to (1) optimize the unary potentials using the current instance samples, and (2) optimize remaining parameters, i.e. β in (20) and possibly a weight αpair on the base pairwise potential, by 1-d line searches, where MAP inference only is required at each step to evaluate (19), which can be performed as in Sec. 3.1. In practice, step (2)
Multiple-Instance Learning with Structured Bag Models
377
can be omitted, and parameters such as αpair and β treated as fixed hyperparameters set by cross-validation over several runs of DA1 . 4.2 Model 2 (Soft Constraints) The DA algorithm for model 2 closely mirrors that for model 1, with some exceptions which we draw attention to below. The overall and auxiliary training objectives are as in Eqs. 18 and 19, where we simply replace the binary bag labels yi with continuous labels σi . Again, we use a simplified form of distribution to generate the bag responses F as in (20). However, we adopt a slightly different form for this distribution to cater for the more complex energy form of model 2. We begin by quantizing σ into a number of levels, σ = k, k ∈ {k1 , k2 ...kK }, effectively forming a set of classes. As before, we factorize the distribution P r (xi , σi |zi ) = P r (σi |zi )P r (xi |σi , zi ), where the first term is now dependent on the MAP solutions of E 2 (Eq. 4) when σi is set to each value k in turn: P r (σi = k|zi ) = (1/Zi ) exp(−E 2 (fik |zi )), writing fik for the mode when σi is set to k (i.e. argminxi E 2 (xi , k|zi )), which can be inferred by MAP inference as in Sec. 3.2. We note that exact MAP inference is not required (and cannot be guaranteed with dual decomposition): we need only a deterministic way of finding modes to approximate the full CRF distribution, which can be achieved by, say, using a fixed number of subgradient updates, and always initializing to λ = 0. Further, normalization is again tractable, Zi = k exp(−E 2 (fik , k|zi )). The bag responses take the form: F (xi , σi = k|zi ) = (1/Zi ) exp(−E 2 (fik , k|zi ))
k
k
β [fiq =xiq ] (1 − β)[fiq =xiq ] (23)
q
The factorized form of (23) means that again it is straightforward to optimize with respect to π for fixed F and σ, and the optimal updates again take the same form as Eq. yi σi 21, where liq is defined as was lip in (22), but substituting fiq for fip . Optimizing the CRF parameters can again be done in a piecewise fashion, after sampling new instances labels from π. There is now potentially an extra weight αsoft for the soft-MIL potential, which can be set by line search, or treated as a hyperparameter. A further issue arises in model 2 regarding inference when this model is trained by the above DA procedure. Since the model was trained under the factorized approximation of the CRF energy implied by (23), the correct procedure for MAP inference for a bag with unknown σ is to set σ = k for each of the {k1 , k2 ...kK } quantized levels in turn, run dual decomposition on each and take the solution with the minimum energy. This contrasts with the simpler procedure mentioned in Sec. 3.2 of simply performing a single cut on the base energy and setting σ = (1/Q) q xq , which would assume our approximation of the original energy during training good enough to permit switching at test time. In our experimentation we use the former approach, and leave investigation of relationship between these solutions to future work. 1
Our approach in Sec. 4.1 is similar to the pseudolikelihood (see [5]) in that we ensure the overall objective (Eq. 18) factorizes across the losses on each variable (Eq. 22). However, unlike the pseudolikelihood, which bases these losses on the true marginals (via each variable’s Markov blanket), we base ours on the altered distribution P r (x, y|z), which is centered on the modes of P r(x|y, z), and factorizes into a product of marginals by construction.
378
J. Warrell and P.H.S. Torr
Fig. 2. Example annotations and results on the LabelCracks dataset [22] for road crack detection. Shown are (a) fully labeled data at the pixel level, (b) weak MIL annotations at the patch level (blue indicates patches containing cracks), and image level (σ denotes proportion of cracking), (c-d) results from model 1 (DA) showing original image, ground truth, patch-level estimates, pixel-level estimates (see Sec. 5). Best viewed in color.
4.3 Model 3 (Combined Model) The DA algorithm for model 3 is essentially the same as for model 2, and again involves quantizing the continuous bag labels σ into a set of discrete levels. The modes fik are now the (approximate) MAP solutions across both levels x1 and x2 when σ = k, which are found through the DD algorithm of Sec. 3.3 with appropriate rounding to respect the hard MIL constraints between levels. F again takes the same form as in (23), where k 1 k 1 k k P r (xi |σi , zi ) = q β [fiq =xiq ] (1 − β)[fiq =xiq ] · p β [fip =xip ] (1 − β)[fip =xip ] is simply adapted to run across both levels. This implies the same updates can be used for π (an extra check is needed though in sampling to ensure all the constraints are satisfied), and piecewise training of the CRF model can proceed by training the unary potentials on each level separately, and setting the remaining parameters to optimize (19) directly.
5 Semi-supervised Segmentation with Weak MIL Annotations (Road Crack Detection) We first make a comparison of the models proposed using a binary segmentation task involving detecting cracked regions in road surface imagery. We use the LabelCracks dataset from [22], which provides 100 such fully annotated images. To these, we add 572 further images which we weakly annotate by overlaying each with a grid of 25 × 20 squares, and marking a square as positive if it contains cracking (example annotations shown in Figure 2a-b.) The patch and pixel annotations correspond directly to the levels x1 and x2 in model 3 above, and σ labels for all images are derived via the proportion of positive-labeled patches. These annotation types all correspond to types of output with application interest. We divide original and new images 0.75/0.25 for training/testing. Several design choices are held constant across models. We extract from the images HOG and Texton features, and use boosted classifiers at both pixel and patch levels on these features for the unary potentials in our models. These use weak-learners
Multiple-Instance Learning with Structured Bag Models
379
Table 1. Summary of road crack detection results across all models. Notice particularly (1) the generally higher performance of the models trained with deterministic annealing (DA) (exceptions discussed in text), (2) the higher performance of the more complex bag structures (models 2 and 3) at the image level, and (3) the higher performance of model 1 at patch and pixel levels.
Supervision Model 1 Model 2 Model 3 Model 1 (DA) Model 2 (DA) Model 3 (DA)
Image Level (% correct) 0.25 0.5 0.75 1 36.1 39.5 33.1 34.9 40.7 40.7 39.0 45.9 44.8 45.4 47.7 45.4 30.8 32.6 34.3 34.9 45.9 44.2 42.4 45.9 45.9 47.7 50.6 45.4
Patch Level (union-intersect) 0.25 0.5 0.75 1 0.232 0.225 0.246 0.248 0.152 0.181 0.182 0.171 0.192 0.216 0.212 0.209 0.244 0.249 0.253 0.248 0.164 0.158 0.166 0.171 0.169 0.202 0.198 0.209
Pixel Level (union-intersect) 0.25 0.5 0.75 1 0.123 0.126 0.140 0.147 0.085 0.118 0.127 0.118 0.131 0.137 0.138 0.147 0.083 0.106 0.108 0.118
Table 2. Precision and Recall values for model 1 (DA) at patch and pixel levels. These provide alternatives to the union-intersection metric used in Table 1, and results show how these scores can be traded off by varying parameter settings. Model 1 (DA)
Proportion labeled Precision (Low) Recall Precision (High) Recall
Patch Level 0.25 0.5 0.75 1 0.267 0.270 0.283 0.288 0.744 0.756 0.709 0.639 0.471 0.473 0.504 0.460 0.176 0.161 0.164 0.131
Pixel Level 0.25 0.5 0.75 1 0.144 0.149 0.156 0.172 0.596 0.627 0.537 0.503 0.293 0.319 0.386 0.259 0.084 0.084 0.100 0.054
which threshold feature responses in regions of random size/offset from the pixel/patchcenter in the manner of TextonBoost [17]. Further, our pairwise terms are based on contrast-sensitive Potts models at the pixel-level (see [8]), and an Ising model with a single parameter at the patch-level. Dual decomposition is run for 5 iterations, using the sub-gradient update procedure outlined in [9], and deterministic annealing is run for 5 rounds, starting at T = 5 and using the update T n+1 = T n /1.5 each round. We compare the performance of the models under several conditions. We divide the training sets (pixel and patch annotations) in four equal parts, and test the models when the full annotations are provided for 0.25, 0.5, 0.75 and all of the training data. In the first three cases, for the remaining data only the bag labels are visible, consisting of patch-level annotations for pixel-data, and σ annotations for the patch-data. Each model is trained under two scenarios: (a) using only the supervised data, and (b) using both the supervised subset and the extra bag labeled data. In only the latter case is the DA training used, but dual decomposition is used at testing for models 2 and 3 in both training scenarios (i.e. they must propose a consistent labeling across levels). We quantize σ into K = 3 levels for models 2 and 3 (covering equal probability mass in the training distribution, giving σ ∈ {0.04, 0.1, 0.2}), and measure performance in terms
380
J. Warrell and P.H.S. Torr
of %-correct classification at the image level, and the union-intersection score at patch and pixel levels (using the true/false positive/negative counts: TP/(TP+FP+FN)). Results: Results are shown across models in Table 1. Several points can be noted. First, as expected, increasing the percentage of supervised training data generally increases performance. A notable feature though is that in several cases, e.g. Model 3 (DA) image level, the maximum is achieved at 0.5 or 0.75 supervision, suggesting that the weakerlabeled MIL annotations may be providing a buffer to overfitting in these cases. Second, we note that in most cases, adding the MIL-annotations for the DA-trained models does appear to improve performance, as is particularly notable in models 2 and 3 at the image level, and model 1 at the patch and pixel levels. The cases where the opposite is true (model 1 at the image level, 2 and 3 at the patch level, and 3 at the pixel level) may be explained as follows: In each case, the level for which the model is optimizing the loss through DA does improve (i.e. image level for models 2 and 3, and patch level for model 1), but in other levels performance may decrease as it is not specifically targeted. In addition, in the case of model 1, there is a separate issue in that the extra annotations we added to the fully labeled LabelCracks set may have subtly different distribution properties, so there is an issue of transfer learning when testing at the image and patch levels for this model. Indeed, it performs at around chance (= 33.3%) at the image level. Nevertheless, it generalizes surprisingly at the patch level. Finally we note that the image level results clearly show the utility of using bag models with greater amounts of structure, where the two level model (model 3) outperforms both models 1 and 2. This confirms that model 3 is able to use the weak MIL annotations at both levels, and combine all the strong annotations, whose distribution characteristics are not necessarily identical, to improve the overall estimate of the image bag labels. A number of further points about the results should be noted. While the actual numerical results may seem quite low, this is in part due to the metrics being used, as well as the difficulty of the task. Indeed, as shown in Figure 2c-d, the qualitative results are often reasonable, although the cracks are often poorly localized, which is partly a function of the underlying TextonBoost classifier. The intersection-union score is typically a harsh metric, and the pixel and patch level scores in Table 1 are comparable to those of the state-of-the-art in many classes of interest in for instance the PASCAL VOC challenge (see [11]). Table 2 gives precision/recall performance for model 1 (DA) at both high and low recall by varying a global weight on the positive unary potentials, where the high recall results correspond to the intersection-union scores in Table 1. We note that at low recall we are able to achieve precisions comparable or better than the 0.370.40 precisions reported in [22]. We use the stricter union-intersection metric in Table 1 so as to have a single measurement to compare across models, since the precision/recall regimes vary in different models according to the loss being optimized (models 2 and 3 impose lower recall rates at the patch and pixel levels, while model 1 favors higher recall at all levels). Finally, the 3-way categorization enforced by the quantization levels of σ may not reflect the total global information in the models, and a comparison of inference techniques (including continuous estimates of σ, see Secs. 4.2 and 4.3) is desirable. In this respect we also note that we have not made use of expected correlations between σ values on neighboring sections of road, and the same sections of road
Multiple-Instance Learning with Structured Bag Models
381
Table 3. Summary of MIL-based interactive cosegmentation results. We outperform [2] with our baseline CRF, and adding the σ annotations (proportion of positives) allows us to improve performance still further using our MIL model 2 with deterministic annealing.
# pixels annotated iCoseg (machine) [2] Our baseline CRF (no MIL) Model 2 (MIL-DA)
120 240 360 480 600 720 840 960 1080 1200 61.3 73.5 79.9 84.0 86.6 88.3 89.6 91.1 91.8 92.7 85.7 87.0 88.9 89.6 89.6 91.3 92.4 93.1 93.2 93.7 90.1 91.7 92.5 92.5 93.3 94.2 95.1 95.0 95.0 95.3
captured at different times. With these qualifications, the results generally show promise for combining various levels of annotations on this task using the models provided.
6 MIL-Based Interactive Cosegmentation For our second task, we consider interactive cosegmentation, using the iCoseg dataset of [2]. We use the 37 groups of 5 images provided, each containing a highly related set of images, where the task is to extract out the foreground objects in each group (example groups are shown in Figure 3a-b). We duplicate the machine-based scenario explored in [2], where ‘scribbles’ are automatically generated across all images of the group, totaling between 120 and 1200 pixels, mimicking user foreground/background annotations (see Figure 3c-d for examples). The task is to provide a binary segmentation of each image from these annotations. We over-segment all images using the method of [1], and use the over-segmentations to augment the annotations by labeling segments containing scribbles by their mode scribble-class. In addition, we extract HOG, Color-HOG and Texton features from all images, using a similar quantization into visual words as in [10]. For our unary classifiers, we simply build histograms of these features for both foreground and background classes to learn a set of multinomial models (after adding a constant to each bin), and take the combined negative log-likelihood across features as the unary classifier response at each pixel. To create a base CRF, we also introduce contrast-sensitive Potts pairwise potentials as in Sec. 5. We compare the performance of this baseline CRF with model 2 as outlined earlier, which also has access to global σ values (proportion of positive labels), extracted from the ground truth. This mimics an interaction scenario where the user may provide not only scribbles on the images, but also an estimate of the size of the object relative to the image (which may be revised). We use the DA algorithm of Sec. 4.2 to alternate between sampling labels for the unannotated pixels in the group, and re-estimating the unary models, while maintaining the soft MIL constraints. Results: Quantitative results are given in Table 2. Our baseline CRF already outperforms [2] (which also uses a pairwise CRF) at all levels, while adding the extra imagelevel annotations (as expected) gives a marked further increase. Qualitative comparisons of results before and after introduction of the global constraint are shown in Figure 3c-d. While to an extent a toy task, these results suggest that certain types of user interaction currently under-explored may be opened up by these models. Along similar lines, [23] explored a scenario for single image segmentation with an ‘area constraint’ (equivalent
382
J. Warrell and P.H.S. Torr
Fig. 3. Examples and results from the iCoseg dataset [2]. Shown are (a-b) two example subgroups of images from the dataset, and (c-d) examples of results achieved, showing the original image, ground truth and σ annotation (proportion of positives), scribble annotations, output of our baseline CRF, output of our model 2 (MIL-DA), using the provided σ annotations.
to σ), and a similar DD algorithm used for inference. Comparison with our algorithm, which includes the extra DA optimization, is left to future work.
7 Discussion We have introduced a number of techniques in this paper for using multiple-instance learning in settings where bags may be highly structured, drawing on recent dual decomposition approaches from the CRF literature, and also training techniques from the semi-supervised/MIL literature (deterministic annealing). As suggested, the use of more highly structured bags may be particularly beneficial in vision applications, and we provided two example tasks which demonstrate the potential of the models proposed. A number of avenues for further research are suggested by this work. Most directly, we envisage such techniques to be ideally suited to general scene segmentation tasks if extended to a multilabel setting (as is straightforward), enabling us to utilize labelings of varying strengths (image-wise, pixel-wise, image regions etc.). The advantage of this approach over others (e.g. [7,24]) is that CRF designs of arbitrary complexity can be incorporated (e.g. [10]), in which our DA algorithm requires only that we can perform approximate MAP inference. In addition, we have already mentioned the desirability of a closer comparison of different possible inference strategies in the models proposed at test time, and this might be extended to related sampling based approaches (e.g. Swendsen-Wang cuts) to compare the relative merits. Finally, we might also consider incorporating stronger methods of MAP-CRF learning such as structured max-margin approaches as components within the deterministic annealing framework.
Multiple-Instance Learning with Structured Bag Models
383
Acknowledgments. This work was supported by Yotta DCL, and the IST Programme of the European Community, under the PASCAL2 Network of Excellence. P. H. S. Torr is in receipt of a Royal Society Wolfson Research Merit Award.
References 1. Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: From Contours to Regions: An Empirical Evaluation. In: CVPR (2009) 2. Batra, D., Kowdle, A., Parikh, D., Luo, J., Chen, T.: iCoseg: Interactive Cosegmentation with Intelligent Scribble Guidance. In: CVPR (2010) 3. Babenko, B., Yang, M.H., Belongie, S.: Visual tracking with online multiple instance learning. In: CVPR (2009) 4. Bertsekas, D.: Nonlinear Programming. Athena Scientific (1999) 5. Besag, J.: Statistical analysis of non-lattice data. The Statistician 24, 179–195 (1975) 6. Gehler, P.V., Chapelle, O.: Deterministic Annealing for Multiple-Instance Learning. In: AISTATS (2007) 7. Gould, S., Gao, T., Koller, D.: Region-based Segmentation and Object Detection. In: NIPS (2009) 8. Kohli, P., Ladicky, L., Torr, P.H.S.: Robust Higher Order Potentials for Enforcing Label Consistency. In: IJCV (2009) 9. Komodakis, N., Paragios, N., Tziritas, G.: MRF Optimization via Dual Decomposition: Message- passing Revisited. In: ICCV (2005) 10. Ladicky, L., Russell, C., Kohli, P., Torr, P.H.S.: Associative Hierarchical CRFs for Object Class Image Segmentation. In: ICCV (2009) 11. Ladicky, L., Russell, C., Kohli, P., Torr, P.H.S.: Graph Cut Based Inference with Cooccurrence Statistics. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6315, pp. 239–253. Springer, Heidelberg (2010) 12. Leistner, C., Saffari, A., Bischof, H.: MIForests: Multiple-Instance Learning with Randomized Trees. In: ECCV (2009) 13. Leistner, C., Saffari, A., Santner, J., Bischof, H.: Semi-supervised Random Forests. In: ICCV (2009) 14. Rose, K.: Deterministic annealing, constrained clustering, and optimization. In: IJCNN (1998) 15. Rother, C., Kolmogorov, V., Minka, T., Blake, A.: Cosegmentation of Image Pairs by Histogram Matching - Incorporating a Global Constraint into MRFs. In: CVPR (2006) 16. Saffari, A., Leistner, C., Godec, M., Santner, J., Bischof, H.: On-line Random Forests. In: OLCV (2009) 17. Shotton, J., Winn, J., Rother, C., Criminisi, A.: TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-class Object Recognition and Segmentation. In: ECCV (2006) 18. Vezhnevets, A., Buhmann, J.: Towards weakly supervised semantic segmentation by means of multiple instance and multitask learning. In: CVPR (2010) 19. Vicente, S., Kolmogorov, V., Rother, C.: Joint optimization of segmentation and appearance models. In: ICCV (2009) 20. Vicente, S., Kolmogorov, V., Rother, C.: Cosegmentation revisited: Models and optimization. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6312, pp. 465– 479. Springer, Heidelberg (2010) 21. Viola, P., Platt, J.C., Zang, C.: Multiple instance boosting for object detection. In: NIPS (2005)
384
J. Warrell and P.H.S. Torr
22. Warrell, J., Prince, S., Torr, P.H.S.: StyP-Boost: A Bilinear Boosting Algorithm for Learning Style-Parameterized Classifiers. In: BMVC (2010) 23. Woodford, O.J., Rother, C., Kolmogorov, V.: A Global Perspective on MAP inference for Low-Level Vision. In: ICCV (2009) 24. Zha, Z., Hua, X., Mei, T., Wang, J., Qi, G., Wang, Z.: Joint multi-label multi-instance learning for image classification. In: CVPR (2008) 25. Zhou, Z., Zang, M.: Multiple-Instance Multi-Label Learning with application to Scene Classification. In: NIPS (2006) 26. Zhou, Z., Sun, Y., Li, Y.: Multiple-Instance Learning by Treating Instances as Non-I.I.D. Samples. In: ICML (2009)
Branch and Bound Strategies for Non-maximal Suppression in Object Detection Matthew B. Blaschko Visual Geometry Group Department of Engineering Science University of Oxford United Kingdom
Abstract. In this work, we are concerned with the detection of multiple objects in an image. We demonstrate that typically applied objectives have the structure of a random field model, but that the energies resulting from non-maximal suppression terms lead to the maximization of a submodular function. This is in general a difficult problem to solve, which is made worse by the very large size of the output space. We make use of an optimal approximation result for this form of problem by employing a greedy algorithm that finds one detection at a time. We show that we can adopt a branch-and-bound strategy that efficiently explores the space of all subwindows to optimally detect single objects while incorporating pairwise energies resulting from previous detections. This leads to a series of inter-related branch-and-bound optimizations, which we characterize by several new theoretical results. We then show empirically that optimal branch-and-bound efficiency gains can be achieved by a simple strategy of reusing priority queues from previous detections, resulting in speedups of up to a factor of three on the PASCAL VOC data set as compared with serial application of branch-and-bound.
1
Introduction
Non-maximal suppression has been employed in many settings in vision and image processing. In image processing, objectives for edge and corner detection have been specified in terms of the eigenvalues of a matrix containing local oriented image statistics [12], while more recently general objectives for object detection have been trained discriminatively [28,5,21,2,4,3,9]. Often, an objective function specifies a property of interest in image coordinates, but it is the arg maximum of the objective rather than scalar values that is of importance. From this perspective, an ideal objective would place all its mass on the true location and give zero output elsewhere. In practice, this is rarely the case, and the function output consists instead of hills and valleys characterizing intermediate belief in the fitness of a given location. Discriminative training of detection models can lead to the need for non-maximal suppression as more confident detections will have higher peaks than less confident ones. Without non-maximal suppression the next best-scoring detections will almost certainly be located on the upper Y. Boykov et al. (Eds.): EMMCVPR 2011, LNCS 6819, pp. 385–398, 2011. c Springer-Verlag Berlin Heidelberg 2011
386
M.B. Blaschko
slope of the peak corresponding with the most confident detection, while other peaks may be ignored entirely. One may interpret this as maximizing the loglikelihood of the detections assuming that they are independent, while in fact there is a strong spatial dependence on the scores of the output. Here, we interpret commonly applied non-maximal suppression strategies as the maximization of a random-field model in which energies describing the joint distribution of detections are included. This insight enables us to characterize in general terms the maximization problem, and to make use of existing theoretical results on maximizing submodular (minimizing supermodular) functions. As a result, we can adopt an efficient optimization strategy with strong approximation guarantees. This is of particular interest as maximizing a submodular function is in general NP-hard. The resulting optimization problem can be solved by a series of inter-related optimizations. Here, we follow Lampert et al. and approach the optimization using a branch-and-bound strategy that enables fast detections of typically tens of milliseconds on a standard desktop machine [19]. The branch-and-bound strategy we consider here is a best first search that makes use of a priority queue to manage which regions of the space of detections to explore. Furthermore, the inter-related optimizations resulting from branchand-bound have a very benign structure in that each problem can use intermediate results stored in the priority queue by the previous optimization. We show empirically that, while reuse of these results does not always give an optimal increase in speed, that there is a very simple strategy for the selective reuse of intermediate results that does give optimal empirical performance. This is further illuminated by several theoretical results that motivate the strategy. 1.1
Related Work
Viola and Jones developed one of the best studied and widely used generic detection algorithms [28]. A key step in their algorithm can be interpreted as nonmaximal suppression, in which they cluster highly overlapping detections and represent clusters by only one detection. Thus, peaks in the detection landscape are compressed to a single detection, suppressing other output. A key question in such strategies is which metric to use when suppressing detections that are too close. A common approach in the recent object detection literature (e.g. [8,26,27]) is to make use of a detection specific overlap measure, such as the one used in the PASCAL VOC object detection challenge [7]. It has been noted that this overlap measure has several favorable properties compared to other measure such as invariance to scale and translation [13]. Desai et al. have taken an interesting approach in which the joint distribution between object detections is modeled linearly given features capturing statistics of the joint distribution of objects [6]. The model is trained discriminatively, but without approximation guarantees due to the greedy optimization employed in a cutting plane training algorithm. Their subproblem shares key characteristics with our random field characterization of non-maximal suppression, and the explicit characterization of a tractable family of models is a key contribution of the present work.
Branch and Bound Strategies for Non-maximal Suppression
387
The approaches cited above largely work by employing sliding windows or other window subsampling strategies, but alternatively, variants on Hough transform detections have also been used. Leibe et al. proposed a widely adopted model in which visual words vote for an object center [24]. Gall and Lempitsky have developed a state of the art detection framework using Hough forests [9]. Lehmann et al. have recently presented a line of work that extends these models to efficient detection [22,23] where the second citation uses branch-and-bound for optimization of detection. The present work in contrast is agnostic to the exact model employed, and the branch-and-bound framework we employ has been applied to several variants of non-linear models that cannot be represented using Hough transforms [20]. Barinova et al. have proposed a principled method of non-maximal suppression that can be interpreted as an explicit approximation to a full probabilistic model [1]. Their work is to our knowledge the first to couple approximation results for the maximization of submodular functions with object detection. Their work, however, is (i) restricted to models for which one can build a Hough image whereas the class of functions for which we can design a practical bound is more general, and (ii) their approach is restricted to very low dimensional detection parametrizations because Hough images are expensive to build for more than a few dimensions. Such an approach additionally must recompute a Hough image after each detection, while the proposed non-maximal suppression model can reuse the same data-structures (such as integral images [28,20]) for subsequent detections. Maximization of a submodular function with monotonic properties is common to many problems in computer science, from robotics [14] to social network analysis [16] and sensor networks [11,18], and has been studied extensively in the operations research literature (a toolbox by Andreas Krause contains many of the algorithms developed there [17]). Branch and bound has been employed to find optimal solutions to the (in general) NP-hard problem [10], but has not, to our knowledge, been applied to greedy optimization of supermodular functions with optimal approximation guarantees, as in this work. The variety of problems that share the same structure promises that analogous optimization approaches to that proposed in this work may have wider application across computer science domains.
2
The Energy
We consider a very general class of joint energy functions that contains both an appearance model of the object class of interest, as well as terms incorporating beliefs about the joint distribution of object detections. These latter terms may be the result of a learning procedure, a prior over the joint positions of objects [6], or a set of constraints chosen a priori to disallow detections that have high overlap. We consider energies of the form max f, φ(x, yi )H − Ω(y). (1) y
i
388
M.B. Blaschko
Here we consider Ω that factorizes into pairwise terms as well as higher order terms Ω(y) = Ω(yi , yj ) + Ωc (yc ) (2) ij
c∈C
higher order terms
where x is an image, yi is an object detection1 , y is a collection of detections, φ is a joint kernel map, f is a function living in the RKHS defined by φ, Ω is a penalization term for detections that overlap too closely, and c ∈ C is a clique in the set of cliques contributing to the energy. In principle, higher order terms that are supermodular (see Section 3) do not affect the anaylsis in this paper. For simplicity, we will not treat them explicitly in the sequel. We note that this form of energy for the detection of multiple objects may occur in diverse settings, such as object detection test time inference, detection cascades, and inference for cutting plane training of structured output learning [2,15].
3
Minimization of a Supermodular Function
Many optimization approaches to random field models, such as graph cuts, rely on the submodularity of a function to be minimized. In the context of image segmentation, this is reflected in a general principle that neighboring pixels are likely to share the same label. Non-maximal suppression, however, enforces the exact opposite effect: neighboring detections are likely to have different labels, at least when the appearance term indicates an object is likely to be present in the vicinity. In particular Equation (1) is the maximization of a submodular (minimization of a supermodular) function. Submodularity holds for a set function if for any two subsets of detections, A and B such that A⊂B
(3)
f (A ∪ {y}) − f (A) ≥ f (B ∪ {y}) − f (B).
(4)
the following holds
This is easy to show as f (A ∪ {y}) − f (A) = f, φ(x, y)H − f, φ(x, y)H −
Ω(yi , y)
(5)
Ω(yi , y)
(6)
i∈A
Ω(yi , y) ≥ f, φ(x, y)H −
i∈A
0≥−
i∈B
Ω(yi , y).
(7)
i∈B\A 1
In the sequel we pay particular attention to detections parametrized by bounding boxes.
Branch and Bound Strategies for Non-maximal Suppression
389
Supermodular higher order terms in Equation (2) will be negated, resulting in submodularity. Equation (1) is therefore very difficult to optimize globally for multiple detections as maximizing a submodular (minimizing a supermodular) function is in general NP hard. As our proposed optimization methodology is based on branch-and-bound, the practical constraints of its application to global optimization are key. Branch and bound ceases to be efficient due to curse of dimensionality for approximately 6 or more dimensions. While a bounding box provides a low (four) dimensional parametrization for single object detection, joint optimization of even two boxes leads to a combinatoric explosion of the complexity of the algorithm and is infeasible already for relatively small images. However, as has been exploited by Barinova et al. [1], strong theoretical results about the maximization of submodular functions indicates that a greedy approach gives optimal approximation guarantees for submodular energies [25]. Consequently, our optimization strategy will be to find the best detection without taking into account the non-maximal suppression terms, and then iteratively find subsequent detections, taking into account non-maximal suppression terms only with previously selected detections. The next section addresses the specific implications of this approach for branch and bound strategies, in particular how the structure of the problem can be exploited to improve the computational efficiency of subsequent detections.
4
Branch and Bound Implementations
Efficient subwindow search (ESS) is a branch and bound framework for object detection that works by storing sets of windows in a priority queue [19,20]. Sets of windows are specified by intervals indicating the minimum and maximum coordinates of the four sides of the bounding box, and are ordered by an upper bound on the maximum score of any window within the set. This upper bound, fˆ, must satisfy two properties in order to guarantee the optimality of the result: fˆ(Y ) ≥ f (y) ∀y ∈ Y fˆ({y}) = f (y)
(8) (9)
where Y is a set of bounding boxes specified by intervals for the sides of the box, and y is an individual window. The first property states that the upper bound is a true bound, while the second states that the score for a set containing exactly one window should be the true score of the window. Given these properties, when a state containing only one window is dequeued, we are guaranteed that this window has the maximal score of all windows in the image. As we are pursuing a greedy optimization strategy, we wish to be able to compute upper bounds of the augmented quality function that contains both the unary terms, and the pairwise non-maximal suppression terms. Here, we discuss how to do so for a class of pairwise terms that are monotonic functions of the ratio of the areas of intersection and union of the two windows [7] Area(yi ∩ yj ) Ω(yi , yj ) = g (10) Area(yi ∪ yj )
390
M.B. Blaschko
where g is any non-negative monotonic function. Consequently, for the kth detection we require an upper bound for f, φ(x, yk )H −
k−1
Ω(yi , yk )
(11)
i=1
where detections are ordered by their selection by the greedy optimization strategy. We may do so by taking the sum of two bounds, that of the unary terms, the construction of which is discussed for a number of linear and non-linear function classes in [20], and that of the non-maximal suppression term. The bound on the non-maximal suppression terms can be computed as Area(yi ∩ y) miny∈Y Area(yi ∩ y) max − g ≤ −g (12) y∈Y Area(yi ∪ y) maxy∈Y Area(yi ∪ y) miny∈Y Area(yi ∩ y) ≤ −g (13) (maxy∈Y Area(y)) + Area(yi ) − (miny∈Y Area(yi ∩ y)) The computation of the bounds for area of overlap require only constant time given sets of windows specified by intervals. A key property of greedy optimization of bounds of this form is that the objective for subsequent detections differs only by the subtraction of one additional Ω term. Since Ω is non-negative, this means that any valid bound for an earlier detection remains a valid upper bound for a subsequent detection (Equation (8)). This suggests that the computation required to find an earlier detection may be leveraged to more efficiently discover subsequent detections by keeping the priority queue expanded by an earlier detection. We also note, however, that Equation (9) may be violated if we simply continue the ESS branch-and-bound procedure without modification. This is because a state may be pushed into the priority queue containing only one window, but that does not consider nonmaximal suppression terms resulting from detections discovered after that state was pushed into the queue. We can account for this by modifying the ESS algorithm in two ways: (i) we augment a state in the priority queue to store not only the upper bound and intervals specifying the set of bounding boxes, but also to store the number of previous detections considered in the computation of the upper bound, and (ii) we modify the termination criterion to check that the number of detections used for computation of the upper bound is equal to the number of detections found up to that point. If not, the bound is recalculated using all previous detections, and the state is re-inserted into the queue. We make a further assumption on the form of g for the purposes of subsequent analysis: 0 if x < γ g(x) = (14) ∞ otherwise where γ is a threshold on the overlap score (e.g. 0.5) above which multiple detections are disallowed. This results in the same non-maximal suppression criterion as used in recent state of the art detection strategies [8,26,27].
Branch and Bound Strategies for Non-maximal Suppression
391
ONML HIJK startG GGCG01 GG C13 # C12/ /.-, ()*+ /.-, ()*+ C14/ /.-, ()*+ CC ()*+11 / /.-, .. ++ CC 1 .. ++ 11 C02 C! ()*+C 1 / /.-, /.-, ()*+ ++ / /.-, ()*+ .. 55 . CC11 00 + 55 ..0 C 1 00 ++ C03 C! ()*+? 00 ++/ /.-, /.-, ()*+G 555.. ??00++ GGG 55.. ??00+ GG5. # C04 ? ()*+ / ONML /.-, HIJK end 0
Fig. 1. Mapping of the selection of an optimal strategy to a shortest path problem. The resulting graph is constructed here for four detections. Horizontal moves correspond to keeping an existing priority queue for a subsequent detection, while diagonal moves correspond to resetting the priority queue to the root node containing the set of all bounding boxes. Cij corresponds to the cost of computing the jth detection using the priority queue carried on from the ith detection. C0j corresponds to the cost when resetting the priority queue prior to computing the jth detection. All edges pointing towards a given node have the same cost. This construction demonstrates that the complexity of computing the optimal strategy given the branch-and-bound costs are O(n2 ) for n detections (see text). These costs are not known at test time, but we show empirically that optimal strategies have a very simple form (Section 6).
With these modifications, we can define a family of branch-and-bound strategies for multiple object detections. For each subsequent detection, a strategy may either reset the priority queue to contain a single state containing all possible windows in an image, or it may use a priority queue expanded from a previous detection (Figure 1). Each of these strategies will result in the same set of detections. Consequently, the goal is to determine a strategy or subset of strategies that reduces the expected computation time2 of all detections. We fix the number of detections to 10 in this work and note that a strong pattern is apparent in the empirically observed computation times indicating that results are likely to generalize to other numbers of detections in real data.
5
Theoretical Results
Branch and bound can be characterized as a best-first search strategy over a DAG whose nodes are isomorphic to a Hasse diagram with direction assigned by set inclusion. We use the notation Y to indicate the maximal (root) element of the Hasse diagram containing all possible windows, Y to indicate a set of 2
We use here the number of dequeuing operations required as a platform independent measure of the computation time. We note in particular that the bound computation is constant for the family of Ω considered here, making this a natural unit of measurement.
392
M.B. Blaschko
windows (Y ⊂ Y, |Y | > 1), and y to indicate an individual window (y ∈ Y). In practice, a subset of possible edges are considered corresponding to those such that Y can be represented by intervals. Furthermore, we consider a deterministic rule for splitting Y into two subsets following [19]. We denote the set of nodes visited by the best-first search from the root node with an upper-bound fˆ as Sfˆ ⊂ P(Y), where P(Y) denotes the power set of Y. Theorem 1. For valid upper bounds fˆ1 and fˆ2 , fˆ1 (Y ) ≥ fˆ2 (Y )
∀Y =⇒ Sfˆ2 ⊆ Sfˆ1
(15)
Proof. Best first search expands all nodes with upper bound greater than the value of the true detection f (y ∗ ). fˆ2 (Y ) ≥ f (y ∗ ) =⇒ fˆ1 (Y ) ≥ f (y ∗ ), but there may be additional Y for which fˆ2 (Y ) < f (y ∗ ) ∧ fˆ1 (Y ) ≥ f (y ∗ ). Corollary 1. Sfˆk ⊆ Sfˆi , where k > i and fˆk is a bounding function for the greedy optimization subproblem corresponding to detection k. Corollary 1 implies that there is a strict ordering of the number of nodes expanded by different objectives. As any priority queue expanded up to the point of an earlier detection will contain elements computed with a lose upper bound, we conclude that there is a potential computational advantage to resetting the priority queue to the root node for a subsequent detection. However, we also note that if the values of the function change only slightly, there will be a computational overhead to expanding the same nodes over again. Consequently, there may instead be a computational advantage to keeping an existing priority queue. Stated simply, if we reset the queue to the root node we may have to re-expand nodes that had already been expanded in the previous round. If we don’t reset the queue, we may have to go through a large number of nodes that have been expanded, but violate the non-maximal suppression condition in Equation (14). Theorem 2. The number of nodes to be re-expanded on reset of a queue for detection k is upper bounded by the sum of nodes expanded by other strategies up to that point. Proof. Nodes that have been previously expanded in round i can be categorized as belonging to one of two groups: (i) those for which fˆi (Y ) ≥ f (y ∗ ) ∧ fˆk (Y ) ≥ f (y ∗ ) and (ii) those for which fˆi (Y ) ≥ f (y ∗ ) ∧ fˆk (Y ) < f (y ∗ ). All nodes in the first case will be expanded by both strategies, while nodes in the second case will be expanded by the previous detections, but not by the current detection. The proof of Theorem 2 also indicates that in subsequent rounds after a reset, the marginal number of nodes to be expanded is strictly ordered, the older the priority queue, the more nodes will need to be expanded. This implies that once a priority queue has been reset and expanded until a subsequent detection is found, it will be superior to keep using that priority queue rather than one expanded from a previous set of detections.
Branch and Bound Strategies for Non-maximal Suppression
393
These theoretical results indicate that for n detections, there are at most 2n−1 possible strategies of interest: for each detection after the first, we may either keep the existing priority queue with all expanded states, or we may reset the queue to the root node. If we were to know ahead of time all costs associated with a given choice, we could use a single-source shortest path algorithm to determine the optimal strategy. Figure 1 shows a mapping of the problem to a graph for four detections. As the graph is a DAG, the complexity of this procedure is O(V ), where V is the number of vertices. For our graph construction, V = n(n+1) +2 = 2 O(n2 ) resulting in an overall complexity of O(n2 ) for n detections. This allows us post hoc to efficiently determine the optimal strategies in our empirical analysis. This result unfortunately does not allow us to determine the lowest cost approach without precomputing all costs. Possible approaches would be to compute the empirical costs of these strategies for a sample of data, or to use a branchand-bound strategy in the shortest path algorithm to avoid computing all edge costs. However, we show in Section 6 that all optimal strategies selected by this analysis on the PASCAL VOC data set have a simple form. This form consists of resetting the queue for a fixed number of initial detections, and then keeping the resulting priority queue without any resets for all subsequent detections. In practice, this indicates that only n − 1 of the possible 2n−1 strategies are of interest.
6
Empirical Results
We present results for a modified implementation of the publicly available ESS code described in [20]. We use the feature extraction and trained models downloaded from the author’s webpage. All results are reported on the test set of the PASCAL VOC 2007 data set [7], with a different objective trained for each of the 20 classes. Figure 2 shows the number of splits required for several selected classes, as well as the average across all classes for varying values of γ (Equation (14)). Figure 3 shows the number of splits conditioned on the presence or absence of the class of interest averaged across all classes. Table 1 shows statistics of the optimal strategy found by a shortest path search. For all classes, the optimal strategy consists of resetting the priority queue to the root node for a number of initial detections followed by re-using the existing priority queue for all subsequent detections. Table 2 shows the ratio of the amount of computation required by two simple strategies compared to the optimal strategy. Table 1. Statistics of the number of resets to the root node required by optimal strategies. Statistics are reported across classes.
min median max
γ = 0.25 γ = 0.50 γ = 0.75 3 2 1 4 3 2 4 4 3
394
M.B. Blaschko
(a) Aeroplane, γ = 0.25
(b) Aeroplane, γ = 0.50
(c) Aeroplane, γ = 0.75
(d) Cat, γ = 0.25
(e) Cat, γ = 0.50
(f) Cat, γ = 0.75
(g) Train, γ = 0.25
(h) Train, γ = 0.50
(i) Train, γ = 0.75
(j) All classes, γ = 0.25
(k) All classes, γ = 0.50
(l) All classes, γ = 0.75
Fig. 2. Number of splits per subsequent detection when resetting the priority queue at different detections vs. keeping an existing priority queue. x-axis: detection number, y-axis: average number of splits across all images in the VOC2007 test set.
Branch and Bound Strategies for Non-maximal Suppression
395
Table 2. Ratios of the amount of computation required by two simple strategies to the optimal strategy. The first, na¨ıve strategy consists of resetting the priority queue to the root node at each subsequent detection. The second strategy consists of keeping a single priority queue for all detections without any resets to the root node. Statistics are reported across classes. γ = 0.25 all reset no reset γ = 0.50 all reset min 1.36 1.17 min 1.38 median 1.48 1.22 median 1.52 max 1.94 1.28 max 2.19 γ = 0.75 all reset no reset min 1.59 1.14 median 2.04 1.16 max 3.15 1.20
(a) +, γ = 0.25
(d) −, γ = 0.25
(b) +, γ = 0.50
(e) −, γ = 0.50
no reset 1.16 1.20 1.28
(c) +, γ = 0.75
(f) −, γ = 0.75
Fig. 3. Number of splits per subsequent detection when resetting the priority queue at different detections vs. keeping an existing priority queue. x-axis: detection number, y-axis: average number of splits across all images and classes in the VOC2007 test set conditioned on the presence or absence of an object of interest (denoted + and −, respectively).
7
Discussion
Several broad conclusions can be drawn from the experiments reported in Section 6. The first, and most important for practical application of branch-andbound to object detection with non-maximal suppression, is that there is a
396
M.B. Blaschko
regime in which resetting the priority queue is more efficient than keeping an existing queue. However, after a few detections, ranging from one to four depending on the class of interest (Table 1), it is better to keep an existing priority queue for all subsequent detections. The proof of Theorem 2 indicates that more recently reset priority queues are always preferable to older queues. This has advantages, both in terms of the simplicity of the set of useful strategies, as well as in terms of reducing memory usage. Varying behaviors were found when using differing values for γ. In general, the lower the value of γ (more strict non-maximal suppression) the more likely resetting the priority queue is beneficial. As γ increases from 0.25 to 0.75 the median number of resets taken by the optimal strategy for a given class decreases from 4 to 2. This makes intuitive sense as lower values of γ result in strictly higher numbers of nodes in the search graph that will be suppressed in subsequent branch-and-bound optimizations. A large number of expanded nodes around a peak will result in wasted computation as they are subsequently pruned by nonmaximal suppression. Conversely, the higher the overlap threshold (less strict non-maximal suppression), the more likely keeping the existing priority queue is helpful. Conditioning on the class label does not seem to show a large difference in the average number of splits per detection (Figure 3). This supports the idea that strategies may be fixed ahead of time. The marginal cost of the first detection after resetting the priority queue to the root node is not strictly increasing (see e.g. Figure 2(d)), but is empirically observed to do so for many classes, and in the average performance across all classes (Figures 2(j)-2(l)). This result is in line with Theorem 2 which says that the upper bound on subsequent detections is increasing. This is especially apparent after the first few detections. Finally, Table 2 indicates that of the simple strategies consisting of either always resetting the priority queue or never resetting the priority queue, it is preferable to never reset the priority queue. Our experiments showed that the amount of required computation for 10 detections was higher for each class and overlap threshold when using the resetting strategy than the simple strategy of always keeping the same priority queue.
8
Conclusions
Commonly applied non-maximal suppression strategies can be interpreted as optimization of a random field model in which non-maximal suppression is captured by pairwise terms encoding the joint distribution of object detection. We have shown in this work how to adapt a branch-and-bound strategy to optimize jointly over multiple detections with non-maximal suppression terms. An optimal approximation result allowed us to frame this as the subsequent application of inter-related branch-and-bound optimizations, enabling us to reuse computations across multiple detections. It is possible to frame the search for a computationally optimal strategy as a shortest path problem on a DAG with
Branch and Bound Strategies for Non-maximal Suppression
397
O(n2 ) vertices, resulting in efficient post hoc computation of the optimal strategies. We have observed that these strategies have a very simple form: although every length n − 1 bit string encodes a valid strategy resulting in 2n−1 possible strategies, all empirically optimal strategies consisted of first resetting the priority queue for a small number of detections, followed by keeping an existing priority queue. Furthermore, simply keeping a single priority queue for all detections resulted in only a modest increase in the total amount of required computation over the optimal strategy. This indicates that simple strategies can significantly improve computational performance over the na¨ıve application of branch-and-bound in serial. Acknowledgments. This work is supported by the Royal Academy of Engineering through a Newton International Fellowship.
References 1. Barinova, O., Lempitsky, V., Kohli, P.: On the detection of multiple object instances using Hough transforms. In: IEEE Conference on Computer Vision and Pattern Recognition (2010) 2. Blaschko, M.B., Lampert, C.H.: Learning to localize objects with structured output regression. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 2–15. Springer, Heidelberg (2008) 3. Blaschko, M.B., Vedaldi, A., Zisserman, A.: Simultaneous object detection and ranking with weak supervision. In: Proc. NIPS (2010) 4. Blaschko, M.B., Lampert, C.H.: Object localization with global and local context kernels. In: BMVC (2009) 5. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proc. CVPR (2005) 6. Desai, C., Ramanan, D., Fowlkes, C.: Discriminative models for multi-class layout. In: Proc. ICCV (2009) 7. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes (VOC) challenge. International Journal of Computer Vision 88(2), 303–338 (2010) 8. Felzenszwalb, P.F., McAllester, D., Ramanan, D.: A discriminatively trained, multiscale, deformable part model. In: Proc. CVPR (2008) 9. Gall, J., Lempitsky, V.: Class-specific hough forests for object detection. In: Proc. CVPR (2010) 10. Goldengorin, B., Sierksma, G., Tijssen, G.A., Tso, M.: The data-correcting algorithm for the minimization of supermodular functions. Management Science 45(11), 1539–1551 (1999) 11. Guestrin, C., Krause, A., Singh, A.: Near-optimal sensor placements in Gaussian processes. In: International Conference on Machine Learning, ICML (August 2005) 12. Harris, C., Stephens, M.: A combined corner and edge detector. In: Proc. of The Fourth Alvey Vision Conference, pp. 147–151 (1988) 13. Hemery, B., Laurent, H., Rosenberger, C.: Comparative study of metrics for evaluation of object localisation by bounding boxes. In: Fourth International Conference on Image and Graphics, ICIG 2007, pp. 459–464 (August 2007) 14. Hollinger, G., Singh, S.: Proofs and experiments in scalable, near-optimal search by multiple robots. In: Robotics: Science and Systems (June 2008)
398
M.B. Blaschko
15. Joachims, T., Finley, T., Yu, C.N.J.: Cutting-plane training of structural SVMs. Mach. Learn. 77(1) (2009) 16. Kempe, D., Kleinberg, J., Tardos, E.: Maximizing the spread of influence through a social network. In: Proc. KDD (2003) 17. Krause, A.: Sfo: A toolbox for submodular function optimization. Journal of Machine Learning Research 11, 1141–1144 (2010) 18. Krause, A., Guestrin, C., Gupta, A., Kleinberg, J.: Near-optimal sensor placements: Maximizing information while minimizing communication cost. In: International Symposium on Information Processing in Sensor Networks (IPSN) (April 2006) 19. Lampert, C.H., Blaschko, M.B., Hofmann, T.: Beyond sliding windows: Object localizationby efficient subwindow search. In: Proc. CVPR (2008) 20. Lampert, C.H., Blaschko, M.B., Hofmann, T.: Efficient subwindow search: A branch and bound framework for object localization. PAMI 31, 2129–2142 (2009) 21. Laptev, I.: Improvements of object detection using boosted histograms. In: Proc. ECCV (2006) 22. Lehmann, A., Leibe, B., van Gool, L.: Feature-centric efficient subwindow search. In: Proc. ICCV (2009) 23. Lehmann, A., Leibe, B., Van Gool, L.: Fast prism: Branch and bound hough transform for object class detection. International Journal of Computer Vision, 1–23 (2010) 24. Leibe, B., Leonardis, A., Schiele, B.: Combined object categorization and segmentation with implicit shape model. In: ECCV Workshop on Statistical Learning in Comp. Vision (2004) 25. Nemhauser, G.L., Wolsey, L.A., Fisher, M.L.: An analysis of approximations for maximizing submodular set functions. Mathematical Programming 14, 265–294 (1978) 26. Vedaldi, A., Gulshan, V., Varma, M., Zisserman, A.: Multiple kernels for object detection. In: Proc. ICCV (2009) 27. Vedaldi, A., Zisserman, A.: Structured output regression for detection with partial occulsion. In: Proc. NIPS (2009) 28. Viola, P., Jones, M.J.: Robust real-time object detection. In: IJCV (2002)
Metrics, Connections, and Correspondence: The Setting for Groupwise Shape Analysis Carole Twining1 , Stephen Marsland2 , and Chris Taylor1 1
2
Imaging Science (ISBE), University of Manchester, Manchester, UK School of Engineering and Advanced Technology (SEAT), Massey University, Palmerston North, New Zealand
Abstract. This paper considers the general problem of the analysis of groups of shapes, and the issue of correspondence in that context. Many papers have been published on the topic of pairwise shape distances and pairwise shape similarity measures. However, most of these approaches make the implicit assumption that the methods developed for pairs of shapes are then sufficient when it comes to the problem of analyzing groups of shapes. In this paper, we consider the general case of pairwise and groupwise shape analysis within an infinite-dimensional Riemannian framework. We show how the issue of groupwise or pairwise shape correspondence is inextricably linked to the issue of the metric. We discuss how data-driven approaches can be used to find the optimum correspondence, and demonstrate how different choices of objective function lead to different groupwise correspondence, and why this matters in terms of groupwise modelling of shape.
1
Introduction
It is generally agreed that a “shape” is what is left when the “nuisance” degrees of freedom corresponding to translation, scaling, and rotation (i.e., pose) have been filtered away, as was proposed by Kendall [8]. Underlying this definition is the idea that the shape itself is an object with an infinite number of degrees of General Shape Analysis: • Define a suitable representation for our shapes, hence a space of shapes. • Define an energy or distance function on that space, a measure of the similarity between any two such shapes. Pairwise Shape Analysis: • To provide a continuous, optimal path, that allows interpolation between the two shapes. • To give a quantitative measure of the degree of similarity.
Groupwise Shape Analysis: • To provide a method of interpolation across a training set of shapes. • To provide a means of analysing the statistics of the training set of shapes.
Fig. 1. The key aspects of general, pairwise, and groupwise shape analysis Y. Boykov et al. (Eds.): EMMCVPR 2011, LNCS 6819, pp. 399–412, 2011. c Springer-Verlag Berlin Heidelberg 2011
400
C. Twining, S. Marsland, and C. Taylor
freedom, such as a continuous curve or surface. The analysis of shape is then the related problems of representing the infinite dimensional curves, defining metric distances between different shapes, finding optimal correspondences between shapes, and performing statistical analysis on the manifold of shapes. In Fig. 1 we identify these problems within the general framework, and then in the concrete cases of pairwise and groupwise shape analysis. The last of these, which has been generally neglected, is probably the most important for real shape analysis. In oder to remove the transformation group of rotation, translation, scaling, a method of shape alignment (such as Procrustes analysis) is often defined a priori. In fact, the same approach has been taken to defining a correspondence between shapes as well, and this causes problems. Defining a correspondence between shapes makes assumptions about how to best interpolate between shapes. When such an a priori choice is applied to interpolation between a pair of shapes, the result may not reflect what is actually seen when we come to consider a group of shapes. This is shown in Fig. 2, where two choices of correspondence lead to different interpolated shapes. We contend that the issue of correspondence should be left open, and the various alternative hypotheses considered, so that the result of interpolation may then be chosen to agree with what is seen from the entire group of shapes.
Fig. 2. For the same pair of shapes (left & right), different correspondences (indicated by colours) leads to different hypotheses as to the intermediate shapes
In this paper we present a framework for shape analysis, and place many of the recent papers on the topic within that framework. We will first show how the question of shape correspondence is inextricably linked to the use of a Riemannian metric, both in terms of the tangent space, and in terms of the associated Levi-Civita connection, and then discuss how to optimise correspondence, both pairwise and groupwise.
2
The Metric
We start by considering parameterised shapes, where a shape is described by a function c(θ) ∈ Rn , where θ = {θα : α = 1, . . . m} represents a point in
Metrics, Connections, and Correspondence
401
the m-dimensional parameter space M . All that we require is that there is a continuous, one-to-one correspondence between the shape and the parameter . ∂c space, so that {cθα = ∂θ } are not all zero (i.e., no pleats). The mapping from α the parameter space to Rn can be either a smooth immersion (self-intersections in Rn permitted) or a smooth embedding (self-intersections in Rn not permitted). In order to define a Riemannian metric between shapes, we need to consider the tangent space to our space of parameterised shapes at the shape c, the space of all possible infinitesimal deformations of c(θ). This can be represented by: . c(θ) ⇒ c (θ) = c(θ) + h(θ), 1,
(1)
where h(θ) is a continuous and (piecewise) smooth Rn -valued vector field1 , defined everywhere on M . This hence represents the infinitesimal deformation of our original shape in the direction defined by h(θ). This gives the most general description possible of deforming a shape, in that the deformation of every point on the original shape is given, with the only constraint being that the deformation is smooth and continuous. However, this very generality means that translations (constant vector fields), rotations, scalings, and re-parameterisations of the original curve (purely tangential vector fields) are also included as elements of the space of vector fields. It is important to note that this definition defines a point-to-point correspondence between two infinitesimally-separated shapes c and c , given by the value of the parameter, so that c(θ) corresponds to c (θ). This mapping can also always be made one-to-one, because even if, for some value of , there are points where cθα = cθα + hθα = 0 ∀ α, we can always adjust so that it is not the case, since there is always some α such that cθα = 0.
Fig. 3. Correspondence between two curves, c and c , showing how the vector field changes h → hψ as c is reparameterised 1
In what follows, we will often absorb the factor of into the definition of h, hence take c(θ) + h(θ) as our deformed shape.
402
C. Twining, S. Marsland, and C. Taylor
We can also change the correspondence between the infinitesimally-close shapes, by reparameterising the curve c : c (θ) → c (ψ(θ)) (see Fig. 3). Hence changing the correspondence between the two shapes means changing the vector field h → hψ , whilst keeping the set of points {c (θ) : θ ∈ M } fixed. A Riemannian metric on our space of shapes is then an assignment of an inner product between elements of the tangent space at c, denoted by Gc (h, k). The energy and length of the path segment between the (infinitesimally) separated shapes c(θ) and c(θ) + h(θ) are then given by: 1
δEnergy = Gc (h, h), δLength = (Gc (h, h)) 2 .
(2)
We take as an exemplar (open or closed) parametric planar shapes (PPS). Defining our basic notation: Curves and vectors in the Argand plane: c(θ), h(θ) ∈ R2 = C. Arc-length: (ds)2 ≡ |c(θ + dθ) − c(θ)|2 . M = unit line or circle for open or closed curves, θ ∈ [0, 1] or: θ ∈ [0, 2π]. ∂ ∂ Derivatives: Ds ≡ , Dθ = , Dθ f (θ) ≡ fθ . ∂s ∂θ cθ Unit tangent vector: vc = cs = , Unit normal to curve: nc = i.vc . |cθ | Following Michor & Mumford [12,11], and Younes et al. [14] we can then define the local and almost-local metrics: Local: Gc (h, k) = Lh(θ), Lk(θ) dμ(θ), (3a) Almost local: Gc (h, k) = Φc (θ) Lh(θ), Lk(θ) dμ(θ), (3b) where dμ(θ) is our integration measure (which may involve the total length of the . curve lc ), Φc (θ) is a function of lc and the curvature κc , (where κc nc = Ds vc (s)), n and L is some differential operator, such as (Ds ) . The inner product ·, · will usually be the Euclidean inner product on R2 . The shortest-path c(θ, t) (the geodesic) between two finitely-separated shapes c(θ, 0) and c(θ, 1) is then given by minimising the total energy2 : 1 Gc(θ,t) (ct , ct )dt, (4) 0
wrt variations of the path c(θ, t), whilst keeping the endpoints fixed. If we take an algebraic approach, we will need to compute the variation of terms such as the metric, as we vary the path c(θ, t). It is convenient to define the variational, Gˆ ateaux derivative: . d Dc,m Gc (h, k) = Gc+λm (h, k), (5) dλ λ=0
2
As in the case of finite-dimensional Riemannian geometry, the total length and total energy have the same minimizer, and the total energy is simpler to work with.
Metrics, Connections, and Correspondence
403
where λm(θ) represents an infinitesimal variation of the curve c(θ) in the direction m(θ). We can then define derivatives of the metric Hc and Kc , where: . . Gc (Kc (m, h), k) = Dc,m Gc (h, k). Gc (m, Hc (h, k)) = Dc,m Gc (h, k).
(6)
The formal solution is then given by the geodesic equation: ctt =
1 (Hc (ct , ct ) − 2Kc (ct , ct )) . 2
(7)
The path ahead in the pairwise case now seems simple – we make our choice of metric, solve the geodesic equation (either numerically [12] or algebraically if possible [14]), and hence compute the geodesic between any pair of shapes. Given the geodesic path between the shapes, it is then trivial to compute the length, hence assess their level of shape similarity. However, we have forgotten the issue of correspondence. As noted previously (see Fig. 3), even for the same sequence of physical shapes, altering the correspondence alters the vector fields, and hence will in general give a different distance between shapes. 2.1
Global and Local Re-parameterisations
Consider a general path on the space of curves c(θ, t), and suppose that we now consider a global re-parameterisation of the path, so that c(θ, t) → c(ψ(θ), t). This will leave the correspondence unchanged along the path, and the vector fields unchanged, but will change the integration measure, and hence the path energy, unless we choose a metric that is invariant under re-parameterisation. That is, Gc (h, k) is invariant under the transformation θ → ψ(θ). This invariance is important, otherwise we can reduce the energy cost of finite deformations to zero by simply identifying a suitable choice of re-parameterisation (see [11], §3.1). However, even for such a re-parameterisation invariant metric, the path energy will not be invariant under a local re-parameterisation, ψ(θ, t). Since the re-parameterisation function is different at different points along the path, it changes the correspondence and the vector fields ct (see Fig. 3), and thus in general, it changes the path energy. Hence using re-parameterisation-invariant metrics does not solve the pairwise correspondence problem. It would seem a sensible course to nail the issue of correspondence down at the start, and this is the approach taken by Michor & Mumford. Their approach rests on the observation that a purely tangential vector field generates not changes of shape of a curve c, but changes of parameterisation of c. A general vector field can be decomposed into tangential and normal components, where: . . h(θ) = h⊥ (θ) + h (θ), h (θ) = h(θ), vc (θ) vc (θ), h⊥ (θ) = h(θ), nc (θ) nc (θ), . This means that purely normal vector fields are perpendicular to the directions in shape space that represent pure re-parameterisations, hence always perpendicular to the orbit of a shape under the action of the diffeomorphism group
404
C. Twining, S. Marsland, and C. Taylor
of the parameter space M . They then use what we will call the normality prescription, where they project out the degrees of freedom in the vector fields that correspond to re-parameterisation, replacing the inner product in Rn with a term that depends only on the normal components. For the H 0 metric: h, k → h⊥ , k⊥ ≡ h, nc k, nc
This normality prescription is not without some undesirable effects. In [11], Michor and Mumford state that they were looking for the simplest Riemannian metric, the obvious candidate being the re-parameterisation-invariant H 0 metric: . 0 Gc (h, k) = h(θ), k(θ) |cθ |dθ ≡ h(s), k(s) ds, which can be re-written in terms of the arc-length parameter s and which is obviously related to the simple sum-of-squared-distances metric for polygonal shapes represented by a finite number of points. However, if we apply the normality prescription, then this metric goes horribly wrong (see [11], §3.10), and all geodesic distances can be reduced to zero! It was this problem with the H 0 metric that led Michor & Mumford to consider higherorder derivative terms (local metric (3a)), or the addition of curvature-dependent terms (almost-local metric (3b)), whilst still keeping the normality prescription. What the normality prescription does, in effect, is assign an equal cost to all possible correspondences, by ignoring the tangential component that generates such changes of correspondence. This is not the same as choosing one particular way of assigning correspondence, which would be equivalent to assigning an infinite cost to all other correspondences. This freedom to move along shapes is one way of seeing why the geodesic distances can be reduced to zero – points can move from one shape to the second for zero cost if they can complete the journey by sliding along a shape, and adding saw-teeth stretching between the two shapes, as intermediate shapes are allowed this under the prescription. Our contention is that since the normality prescription does not work for the H 0 metric, then it seems to us unwise to continue with it. But perhaps the correspondence problem can be improved by moving to a different representation of shape? We mention here two alternatives. The first is the elastic approach [6,7,15], which rather than the shape function c(s), uses the speed function cs . In particular, the square-root-elastic (SRE) representation uses the variable and metric: . cθ (θ) . q(θ) = , GSRE (Δq1 , Δq2 ) = Δq1 , Δq2 dθ, (8) |cθ (θ)| where Δq1 , Δq2 are elements of the tangent space to the space of speed functions. However, we can re-write this in terms of the elements of the tangent space to the space of shape functions, to find that: 1 3 SRE G ⇒ hθ , kθ − hθ , vc kθ , vc dθ. |cθ | 4
Metrics, Connections, and Correspondence
405
Hence we see that the SRE metric is just a particular combination of re-parameterisation invariant terms involving just the first derivatives of the vector field, chosen so as to give the metric a simple form in the space of speed functions. The issue of correspondence is as in the case of parameterised shape, and we note that in [7], Joshi et al. find the detailed correspondence and the path between a pair of shapes by explicit optimisation. The second, and more intriguing approach, is that given by the use of conformal mappings [13]. For any simple closed planar shape, there always exists a conformal (angle-preserving) mapping from the interior of the shape to the unit disc. This hence defines a conformal parameterisation of the shape, in terms of the mapping between points on the curve and points on the unit circle. We can also consider the inverted shape (for a shape c ⊂ C, and a point z0 in the 1 interior of the shape, the inverted shape is given by c−z ), and find the confor0 mal parameterisation of that. In general, the conformal parameterisation of the shape and the inverted shape are different, and the difference between the two is a diffeomorphism of the unit circle. Sharon & Mumford [13] show that it is possible to (almost) uniquely reconstruct a shape purely from knowledge of this element of the diffeomorphism group of the unit circle, and hence establish a way to represent shapes in terms of this diffeomorphism group. They can then apply metrics on the diffeomorphism group to generate a metric on the space of shapes, and also, from the group multiplication, obtain an intriguing multiplication of two shapes to give a third shape. This method has been generalized by Lui et al. [9] to the case of planar objects with other topologies. However, the entire method still rests on favouring a particular method of parameterisation over any other, and is limited to purely planar shapes. Rather than trying to consider each method of shape representation and each method of defining a distance between shapes on a case-by-case basis, we will instead move on to consider the general geometric setting for shape distances.
3
Correspondence and Connections
In this section, we first start by discussing what it is about shapes and spaces of shapes that makes them distinct from other spaces. In particular, we will focus on the notion of spatial localization. Let us suppose we have some general method of shape representation and a shape space S, shapes c ∈ S, and a Riemannian metric (Gc (·, ·)) defined on such a space. By general, we mean that our shape representation should be such that it can also represent infinitesimal, localised deformations of any permitted shape – this can be thought of in terms of growing an infinitesimal bump or pit at any point on any shape. For a finite-dimensional representation of shape, such as the simple polygonal or spline-based representations, this requirement becomes the ability to move only a single point, plus the ability to increase the number of points used in the representation as required. We will then associate such bumps and pits with localized elements of the tangent space to the space of shapes, which we will refer to as bump vectors. For a general shape, growing a
406
C. Twining, S. Marsland, and C. Taylor
bump at one of two distinct points on the shape should be recognized as distinct directions in the tangent space, since they generate distinct shapes in the finite limit. Hence we will refer to a shape c, where A and B are distinct points on the shape, and distinct elements of the tangent space at c, kA ∈ Tc S and kB ∈ Tc S, which correspond to growing a localized bump at point A or at point B. It is important to note that our intuitive ideas of shape and shape change rest on the notion of locality. In particular, we have the idea that, in general, spatially-separated small perturbations of a single shape represent distinct degrees of freedom, provided these perturbations are sufficiently far apart. In terms of the metric and our bump vectors, this can usefully be stated in the form that Gc (kA , kB ) → 0 as |A − B|Rn increases3 . Hence we restrict ourselves to shape metrics that have some notion of spatial locality and localization. The local and almost-local metrics in (3a) & (3b) obviously have this property, and we will consider a more non-local metric later in this section. For a general space, the tangent spaces at two distinct points are not equivalent, just as the tangent plane to a sphere at the pole is a different plane to the tangent plane at a point on the equator. A connection provides a recipe (called parallel transport) for mapping elements of the tangent space at one point into elements of the tangent space at any other point. For the particular case of a Riemannian metric, there is a unique (torsion-free) connection (the Levi-Civita connection) that preserves the metric. In terms of the Gˆateaux derivatives of the metric we defined earlier (6), this connection is given by: . 1 Γc (h, k) = (Hc (h, k) − Kc (h, k) − Kc (k, h)) , 2
(9)
where it should be noted that this formula is general, and not specific to the case of a parametric representation of shape. An element k of the tangent space at the point c can then be parallel-transported by an infinitesimal amount in the direction h, to give the element of the new tangent space, which can be written as: k ∈ Tc S → k + Γc (h, k) ∈ Tc+h S. (10) Tangent space vectors can then be parallel-transported a finite distance along paths in the space by integrating up the above result, and in general, the exact result will depend on the path chosen, even with fixed endpoints4 . We can now see how this construction of parallel-transport applies to the issue of correspondence between shapes. If we take a bump vector kA on one shape, we can parallel-transport this tangent-space vector along a path between shapes, and hence generate the corresponding element of the tangent space on our second shape. If this new element is also localized, then its location provides us with a (rough) correspondence between the shapes. We take as our example a metric on parametric shapes, where the connection can be computed in closed form. In this case, we already know the answer 3 4
That is, they become orthogonal at sufficient spatial separation. This dependence of the result of parallel transport on the exact path taken is one definition of the curvature of the underlying manifold.
Metrics, Connections, and Correspondence
407
as to the correspondence between shapes we expect to recover, it is just the correspondence given by parameter value. We take a translation-invariant H 1 metric: 1 Gc (h, k) = hθ , kθ dθ, |cθ | which is also re-parameterisation invariant. Unlike the H 1 metric used by Younes [15], it does not include the extra factor of l1c , which would make it scale-invariant. To compute the Gˆ ateaux derivative (5), we note that the only piece that varies is the |cθ | term, which gives: 1 Dc,m Gc (h, k) = − vc , mθ hθ , kθ dθ. |cθ |2 To compute the connection, we take a specific form for the bump vector k(θ) (a top-hat function, given by a constant vector α between θ0 and θ1 , and zero elsewhere), so that: kθ (θ) = a (δ(θ − θ0 ) − δ(θ − θ1 )) . If we then also let θ1 → θ0 (so that terms such as f (θ1 ) − f (θ0 ) can be taken to vanish in the limit), then using the definitions (6) & (9), we find that Γc (h, k)(θ) =
1 [ vc , hθ α + vc , α hθ − α, hθ vc ] (θ) if θ = θ0 , else 0. 2|cθ |
Hence, as we might have expected, any change in k(θ) under parallel transport is localized at θ0 . Note also that if the change of shape h(θ) is locally a translation (hθ (θ0 ) = 0), then there is no change in k(θ) under parallel transport, which reflects the fact that the metric is an H 1 metric. And in general, since the result contains terms in the three directions α, vc (θ0 ), and hθ (θ0 ), the direction of k(θ) may change under the transport, even though the foot-point remains unchanged. Finally, we consider an extension to the metrics we have considered so far. In [5], Glaun`es et al. considered a non-local curve-matching energy term for finitely-separated curves (derived from a norm on the space of currents), which was incorporated within the large deformation diffeomorphic mapping framework. This energy was of the form: E(c, c ) = F (c, c) − 2F (c, c ) + F (c , c ), . F (c, c ) = dθ dφ K (c(θ), c (φ)) cθ , cφ ,
(11a) (11b)
where c(θ) and c (φ) are two parametric curves, and K (x, y) ≡ K (|x − y|) is a kernel function (such as a Gaussian). If we take an infinitesimal difference of curves, c = c + h, and make the approximation that: K (c(θ) + h(θ), c(φ) + h(φ)) ≈ K (c(θ), c(φ)), then we obtain the final non-local Riemannian metric in the form: Gc (h, k) = dθ dφ K (c(θ), c(φ)) hθ (θ), kφ (φ) .
408
C. Twining, S. Marsland, and C. Taylor
This is translation and re-parameterisation invariant, and an obvious generalization of the H 1 metric that we considered previously (and a similar generalization can obviously be applied to the other metrics considered earlier (3a) & (3b)). Note that the original energy term for finitely-separated curves does not involve an explicit correspondence between the curves based on parameter value, and was in fact invariant to local re-parameterisations. However, when we made the simplifying assumption to replace K (c, c ) etc. by K (c, c), we removed this invariance, and instead replaced it by the correspondence according to parameter that we had in the cases of the local and almost-local metrics. It should seem that we have taken a case without correspondence, and put it back in by hand! This is not quite the case: the original formulation assigned an explicit correspondence based on points in the plane. The positioning of the curves in the plane then allowed the distance between points on the two curves to act to establish the notion of locality and the meaning of local differences in shape between the curves. The optimisation over diffeomorphisms of the plane that Glaun`es et al. [5] then use to match curves is the equivalent of the optimisation over correspondence that we propose. We note that other formulations (such as shape representation using distance maps), also employ point-to-point correspondence across the plane as an alternative to point-to-point correspondence between shapes.
4
Optimising Correspondence
We have seen from the previous section that the question of correspondence is inextricably tied up with the use of Riemannian metrics on shape spaces. There are then three possible approaches to dealing with this issue: (1) Define a method of determining correspondence a priori. Examples would be basing correspondence on equal fractional arc-length, or the use of the conformal parameterisation that formed part of the work in [13]. The problems are that this choice is essentially arbitrary, and that a method that gives sensible interpolation for pairs of shapes from one class may not give suitable results for shapes from a different class. (2) Try to factor-out these degrees of freedom. This is essentially the approach taken in the normality prescription case (see §2.1), where all possible correspondences are assigned an equal weight. But as we have already noted, the simplest H 0 metric fails in this case, which does not seem a desirable result. (3) Determine the optimum correspondence in a data-driven fashion. This can then obviously be extended to find the optimum pose. In the pairwise case, given the absence of any other data, the only information we have that distinguishes between different correspondences is the geodesic distance itself. Hence, it would seem sensible to allow this to define the optimum pairwise correspondence, despite the complication of a further optimisation step. This was the approach taken by Joshi et al. [7], in the case of the square-rootelastic metric.
Metrics, Connections, and Correspondence
409
Fig. 4. A set of training shapes, correspondence indicated by colour, for two different correspondences (Left: correct, Right: arc-length). Bottom: The Euclidean mean. Table 1. Mean and spread of distances to mean shape for two choices of correspondence, and using two different metrics Correct Arc-Length Mean (Spread) Mean (Spread) Eucl. 2.03 (1.37) 0.68 (0.11) SRE 2.10 (1.39) 1.05 (0.25)
When we come to the groupwise case, why can we not just take the pairwise correspondences defined as above to create a groupwise correspondence? The problem is that, in general, the correspondence defined between shapes A and B, and between A and C, will not agree with that defined between B and C. Geometrically, this is because parallel-transport around a closed loop gives a result other than the identity (which is a definition of curvature). This question did not arise in earlier work (such as Davies et al. [3]), since the Riemannian metric used there was Euclidean. The obvious solution is to define correspondence via some reference shape, which makes the groupwise correspondence consistent by construction, the obvious candidate being the Karcher mean. The mean shape also gives us another advantage, in that we can use length/area on the mean in order to define our integration measure (3a), as was done in [3]. This now gives us our basic framework for groupwise shape analysis. However, we still have to define the objective function that we are going to use to define the optimum correspondence. The simplest suggestion is to just repeat the procedure we used in the pairwise case, and take the sum of geodesic distances to the mean (the compactness) to define both the mean (for fixed correspondence), and the optimum correspondence. However, this repeated-pairwise approach does not always work.
410
C. Twining, S. Marsland, and C. Taylor
To give an example, consider the set of curves shown in Fig. 4, where we have used the Euclidean metric, and integration measure computed on the mean shape [3]. We take two different correspondences, the correct correspondence (on the left), and arc-length correspondence (on the right). The correct correspondence gives the mean as just another bump, whereas the arc-length case gives a shape unlike any seen in the training set. However, if we follow the same colour across examples, we see that distances from the mean will be larger for the correct correspondence, since the straight-line portions of the curve have to stretch and compress to accommodate the motion of the bump, as well as the motion associated with the bump itself. In the arc-length case, although the mean is not a bump, movements are minimal, hence this correspondence will be measured as being more compact. For the same group of shapes, we also repeated the analysis using the speed function representation and the SRE metric, as in (8). The distance to their respective means, and the spread of values, are given in the Table. It can be seen that, for both metrics, compactness fails as a means of choosing the correspondence that is in accord with the mode of variation seen in the input shape data.
5
Discussion
In this paper we have provided a framework for shape analysis and discussed why the related problems of metric, correspondence, and connection all need to be selected in a data-dependent way. This is particularly clear from our last example. It could be argued that in this case a metric that gave greater weight to shape similarity based on curvature would give a better result. However, that would miss the essential point: what distinguishes the correct correspondence in this case is not curvature per se, but the commonality of structure across the group of shapes. The association of the edges of the bumps with regions of high curvature should then be seen as an accident of the artificial shape construction. From the point of view of modelling, it is obviously desirable that the mean should reflect the common structure seen across the group, and unless we have correctly identified the common structure across examples, we will be unable to correctly represent the variation of this structure. We note that the groupwise case is more complicated than the pairwise case, in that we don’t want to just interpolate between pairs of example, but across the whole sub-space in which the training data lies. It was for this reason that more sophisticated groupwise objective functions (such as MDL [3]) were introduced for the case of Euclidean shape spaces. The construction of such objective functions will be more complicated in the non-flat case, since we can no longer construct the simple pdf models on the shape space. Given this, we might ask why we might need to use metrics other than the Euclidean one? One illustrative example is where parts of an object undergo motion which is a rotation (such as the thumb of a hand, see [2], Fig. 9.9). The MDL correspondence in this case tends to linearize the motion by allowing points on the tip to slide, which is not quite the correct correspondence from a physical point of view. We will be considering this further in the future.
Metrics, Connections, and Correspondence
411
We note that there do exist methods for modelling on non-flat shape spaces, and these entail constructing models on the tangent space at the mean (e.g., principal geodesic analysis [4]), which again shows that it is necessary that the mean itself is similar to the shapes seen in the group. Developing alternative objective functions to compactness, in the spirit of MDL, is the obvious next step, but is beyond the scope of the current paper. In §3 we identified that the method of Glaun`es et al. used a rotationally invariant kernel K (x, y). This kernel defines an inner product between parametric curves c(θ) and c (φ) [10], which clearly induces a particular Riemannian metric on the space. This has been considered in the area of machine learning, where the kernel mapping of a Support Vector Machine performs essentially the same mapping. There, Burges [1] looked for locally invariant kernels under some symmetry and identified how the induced metric can be expressed in closed form. In future work we will follow up this line to identify whether it is possible to choose the kernel in a data-driven way for groupwise shape analysis. Acknowledgements. Our thanks to S. H. Joshi for making available his matlab implementation of the SRE metric.
References 1. Burges, C.: Geometry and invariance in kernel based methods. In: Sch¨ olkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning, pp. 89–116. MIT Press, Cambridge (1999) 2. Davies, R., Twining, C.J., Taylor, C.J.: Statistical models of shape: optimisation and evaluation. Springer, Heidelberg (2008) 3. Davies, R.H., Twining, C.J., Cootes, T.F., Taylor, C.J.: Building 3-D statistical shape models by direct optimization. IEEE Transactions on Medical Imaging 29(4), 961–981 (2010) 4. Fletcher, P.T., Lu, C., Pizer, S.M., Joshi, S.: Principal geodesic analysis for the study of nonlinear statistics of shape. IEEE Transactions on Medical Imaging 23(8), 995–1005 (2004) 5. Glaun`es, J., Qiu, A., Miller, M.I., Younes, L.: Large deformation diffeomorphic metric curve mapping. International Journal of Computer Vision 80(3), 317–336 (2008) 6. Joshi, S.H., Klassen, E., Srivastava, A., Jermyn, I.: A novel representation for riemannian analysis of elastic curves in Rn . In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1643–1649 (2007) 7. Joshi, S.H., Klassen, E., Srivastava, A., Jermyn, I.H.: Removing shape-preserving transformations in square-root elastic (SRE) framework for shape analysis of curves. In: Yuille, A.L., Zhu, S.-C., Cremers, D., Wang, Y. (eds.) EMMCVPR 2007. LNCS, vol. 4679, pp. 387–398. Springer, Heidelberg (2007) 8. Kendall, D.G.: The diffusion of shape. Advances in Applied Probability 9(3), 428– 430 (1977) 9. Lui, L.M., Zeng, W., Yau, S.-T., Gu, X.: Shape analysis of planar objects with arbitrary topologies using conformal geometry. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6315, pp. 672–686. Springer, Heidelberg (2010)
412
C. Twining, S. Marsland, and C. Taylor
10. McLachlan, R.I., Marsland, S.: N-particle dynamics of the Euler equations for planar diffeomorphisms. Dynamical Systems 22(3), 269–290 (2007), http://www-ist.massey.ac.nz/smarsland/PUBS/DynSys07.pdf 11. Michor, P.W., Mumford, D.: Riemannian geometries on spaces of plane curves. Journal of the European Mathematical Society 8, 1–48 (2006) 12. Michor, P.W., Mumford, D.: An overview of the riemannian metrics on spaces of curves using the hamiltonian approach. Applied and Computational Harmonic Analysis 23, 74–113 (2007) 13. Sharon, E., Mumford, D.: 2d-shape analysis using conformal mapping. International Journal of Computer Vision 70(1), 55–75 (2006) 14. Younes, L., Michor, P.W., Shah, J., Mumford, D.: A metric on shape space with explicit geodesics. Rendiconti Lincei - Matematica e Applicazioni 9, 25–37 (2008) 15. Younes, L.: Computable elastic distances between shapes. SIAM Journal of Applied Mathematics 58(2), 565–586 (1998)
The Complex Wave Representation of Distance Transforms Karthik S. Gurumoorthy, Anand Rangarajan, and Arunava Banerjee Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA {ksg,anand,arunava}@cise.ufl.edu
Abstract. The complex wave representation (CWR) converts unsigned 2D distance transforms into their corresponding wave functions. The underlying motivation for performing this maneuver is as follows: the normalized power spectrum of the wave function is an excellent approximation (at small values of Planck’s constant—here a free parameter τ ) to the density function of the distance transform gradients. Or in colloquial terms, spatial frequencies are gradient histogram bins. Since the distance transform gradients have only orientation information, the Fourier transform values mainly lie on the unit circle in the spatial frequency domain. We use the higher-order stationary phase approximation to prove this result and then provide empirical confirmation at low values of τ . The result indicates that the CWR of distance transforms is an intriguing and novel shape representation. Keywords: distance transforms, Voronoi, Hamilton-Jacobi equation, Schrödinger wave function, complex wave representation (CWR), stationary phase (method of), gradient density, power spectrum.
1
Introduction
Over the past three decades, image analysis has borrowed numerous formalisms, methodologies and techniques from classical physics. These include variational and level-set methods for active contours and surface reconstruction [1,12], Markov Chain Monte Carlo (MCMC) [17], mean-field methods in image segmentation and matching [9,20,8], fluidic flow formulations for image registration [6] etc. Curiously, there has been very little interest in adapting approaches from quantum mechanics. This is despite the fact that linear Schrödinger equations are the quantum counterpart to nonlinear Hamilton-Jacobi equations [4] and the knowledge that the quantum approaches the classical as Planck’s constant tends to zero [2]. The principal theme in this work is the introduction of complex wave representations (CWRs) of shapes. We begin by reconstructing the well known bridge between the Hamilton-Jacobi and Schrödinger equations as adapted to the problem of Euclidean distance transform computation. As expected, the familiar nonlinear, static Hamilton-Jacobi equation emerges from a linear, static Y. Boykov et al. (Eds.): EMMCVPR 2011, LNCS 6819, pp. 413–427, 2011. c Springer-Verlag Berlin Heidelberg 2011
414
K.S. Gurumoorthy, A. Rangarajan, and A. Banerjee
Schrödinger equation in the limit as τ → 0. This paves the way for the complex wave representation (CWR) of distance transforms: here the wave function S(x,y)
ψ(x, y) is equal to exp i τ where S(x, y) is the distance transform in 2D. Since distance transform gradients (when they exist) are unit vectors [13], their appropriate representation is the space of orientations. The centerpiece of this work is the following statement of equivalence: |Ψτ (u(θ), v(θ)|2 —the squared magnitude of the normalized Fourier transform of ψ(x, y) is approximately equal to the density function of distance transform gradients with the approximation becoming increasingly accurate as τ tends to zero. We will prove this conjecture using the stationary phase approximation [11]—a well known technique in asymptotic analysis. The significance for shape analysis is that spatial frequencies become histogram bins. This result also demonstrates that the well known interpretation of the squared magnitude of the wave function as a probability density [2] is not merely a philosophical position. Instead, the squared magnitude of the wave function (in the spatial frequency basis) is shown to be an approximation to the distance transform gradient density (with the approximation becoming increasingly accurate as τ → 0).
2
The Complex Wave Representation (CWR)
We begin with Euclidean distance functions—more popularly referred to as distance transforms. Since distance transforms are used to set up the transition from Hamilton-Jacobi to Schrödinger wave functions, we stick with the simplest case: unsigned distance functions of a point-set. Given a point-set Yk ∈ RD , k ∈ {1, . . . , K} , the distance transform is defined as def
S(x) = min x − Yk {Yk }
(1)
where x ∈ Ω, is a bounded domain in RD . Below, we mainly use D = 2. In computational geometry, this is the Voronoi problem [15,7] and the solution S(x) can be visualized as a set of cones (with the centers being the point-set locations {Yk }). The distance transform S(x) is not differentiable at the pointset locations and at the Voronoi boundaries but satisfies the static, nonlinear Hamilton-Jacobi equation [13,16] ∇S(x) = 1
(2)
elsewhere. Furthermore S(x) = 0 at the point-set locations. The intimate relationship between the Hamilton-Jacobi and Schrödinger equations is well known in theoretical physics [4] and leveraged by our previous work on this topic [14]. For our purposes, the static, nonlinear Hamilton-Jacobi equation can be embedded in a static, linear Schrödinger equation. Consider the following linear differential equation −τ 2 ∇2 ψ(x) = ψ(x)
(3)
The Complex Wave Representation of Distance Transforms
415
where ψ(x) is a complex wave function and τ a free parameter (usually Planck’s constant in the physics literature). Now, substitute ψ(x) = exp i S(x) τ
with
the notation S(x) deliberately chosen to resonate with the distance transform above. We get using simple algebra [5] ∇S(x)2 − iτ ∇2 S(x) = 1
(4)
which approaches (3) as τ → 0 provided |∇2 S(x)| is bounded. Due to this equivalence, and since the focus in this work is not on efficient computation of S(x), we will henceforth not make a distinction between the Hamilton-Jacobi field S(x) and the phase of the wave function ψ(x). We have shown—following the theoretical physics literature and our previous work in EMMCVPR 2009 [14]—that the static, linear Schrödinger equation in (3) is capable of expressing the static, nonlinear Hamilton-Jacobi equation in (2). The focus in this work, however, is on leveraging the complex wave representation (CWR) ψ(x) = exp{i S(x) }. τ
3
Distance Transform Gradient Density
The geometry of the distance transform in 2D corresponds to a set of intersecting cones with the origins at the Voronoi centers [7]. The gradients of the distance transform (which exist globally except at the cone intersections and origins) are unit vectors and satisfy S = 1. Therefore the gradient density function is one dimensional and defined over the space of orientations. The orientations are constant and unique along each ray of each cone. Its probability distribution function is given by ˆ ˆ 1 F (θ ≤ Θ ≤ θ + Δ) ≡ dxdy (5) S L θ≤arctan Sy ≤θ+Δ x
S where we have expressed the orientation random variable—Θ = arctan Sxy — as a random variable transformation of a uniformly distributed random variable (defined on a bounded 2D domain). The probability distribution function also induces a closed-form expression for its density function as shown below. Let Ω√denote the polygonal grid. Let L = μ(Ω) represent the area of the grid and l = L. Let Y = {Yk ∈ R2 , k ∈ {1, . . . , K}} be the given point-set locations. Then the Euclidean distance transform at a point X = (x, y) ∈ Ω is given by S(X) ≡ min X − Yk = min( (x − xk )2 + (y − yk )2 ). (6) k
k
Let Dk , centered at Yk , denote the k th Voronoi region corresponding to the input point Yk . Dk can be represented by the Cartesian product [0, 2π) × [0, Rk (θ)] where Rk (θ) is the length of the ray of the k th cone at orientation θ. If a grid point X = (x, y) ∈ Yk +Dk , then S(X) = X −Yk . Each Dk is a convex polygon whose boundary is composed of a finite sequence of straight line segments. Even for
416
K.S. Gurumoorthy, A. Rangarajan, and A. Banerjee
points that lie on the Voronoi boundary–where the radial length equals Rk (θ)– the distance transform is well defined. The area L of the polygonal grid Ω is given by K ˆ 2π ˆ Rk (θ) K ˆ 2π Rk2 (θ) L= rdrdθ = dθ. (7) 2 0 0 0 k=1
k=1
With the above set-up in place and by recognizing the cone geometry at each Voronoi center Yk , equation (5) can be simplified as K
F (θ ≤ Θ ≤ θ + Δ) ≡
1 L
k=1
ˆ
θ+Δ
θ
ˆ 0
Rk (θ)
K
rdrdθ =
1 L
k=1
ˆ
θ+Δ
θ
Rk2 (θ) dθ. (8) 2
Following this drastic simplification, we can write the closed-form expression for the density function of the unit vector distance transform gradients as K
P (θ) ≡ lim
Δ→0
F (θ ≤ Θ ≤ θ + Δ) 1 Rk2 (θ) = . Δ L 2
(9)
k=1
Based on the expression for L in (7) it is easy to see that ˆ 2π P (θ)dθ = 1.
(10)
0
Since the Voronoi cells are convex polygons [7], each cell contributes exactly one conical ray to the density function on orientation.
4 4.1
Properties of the Fourier Transform of the CWR Spatial Frequencies as Gradient Histogram Bins
Now, consider the CWR of the distance transform in 2D. We therefore use S(x,y) ψ(x, y) = exp i τ and S(x, y) the actual distance transform of a pointset. We take its 2D scaled Fourier transform:
ˆ ˆ 1 S(x, y) ux + vy Ψτ (u, v) = exp i exp −i dxdy. (11) 2πτ τ τ Ω We see in Figure 1 (the figure on the right) that the Fourier transform values lie mainly on a circle and we have observed that this behavior tightens as τ → 0. The preferred theoretical tool in the literature to analyze this general type of behavior is the stationary phase approximation [11]—well known in theoretical physics but not so well known in image analysis. Below, we give a very brief and very qualitative exposition. Consider the following integral (in 1D):
ˆ ∞ ν f (x) exp i exp −i x dx (12) τ τ −∞
The Complex Wave Representation of Distance Transforms
417
0.1
0.05
0
−0.05
−0.1
−0.1
−0.08
−0.06
−0.04
−0.02
0
0.02
0.04
0.06
0.08
0.1
Fig. 1. Left: Distance transform and gradient map of horse silhouette (τ = 0.00004). Right: The scaled and normalized Fourier transform Ψτ (u, v). Please ZOOM into the plots (especially the horse silhouette) to see greater detail.
where f (x) is a twice differentiable function and ν a fixed parameter. The first exponential is a varying complex “sinusoid” whereas the second is a fixed complex sinusoid at frequency νh . When we multiply these two complex exponentials, at low values of τ , the two sinusoids are usually not “in sync” and cancellations occur in the integral. Exceptions to the cancellation happen at locations where f (x) = ν, since around these locations, the two sinusoids are in perfect sync (with the approximate duration of this resonance dependent on f (x). The value of the integral is approximately
π √ 1 f (x0 ) − νx0 2πτ exp ±i exp i (13) 4 τ |f (x0 )| {x0 }
where {x0 } is the set of locations at which f (x0 ) = ν. The approximation is increasingly tight as τ → 0. For more information, please see [11]. The stationary phase approximation gives a theoretical explanation for the Fourier transform of ψ(x, y) taking values mainly on the unit circle. In 2D, the stationary phase approximation indicates that the two sinusoids are in sync when ∇S = ν where ν is now a 2D spatial frequency pair (u, v). However, since ∇S = 1, strong 2 2 resonance occurs only
when u +v = 1 and when the distance transform orientav tion θ = arctan u . While this brief explanation does serious injustice to a vast topic, the important points nonetheless are: i) match between the orientation θ of each ray of the distance transform and the angle of the 2D spatial frequency [arctan uv ] and ii) match between the magnitude of ∇S (which is equal to one) and locations on the unit circle (corresponding to 2D spatial frequencies of magnitude one). Next, we show that the squared magnitude of the Fourier transform (normalized such that it has overall unit norm) is approximately equal to the density function of the distance transform gradients.
418
4.2
K.S. Gurumoorthy, A. Rangarajan, and A. Banerjee
Power Spectrum of ψ(x, y) as a Gradient Density Estimator
The previous section motivated the use of the stationary phase approximation for evaluating integrals. In this section, we first outline the main result and then spend the remainder of the section proving it. The main result: The squared magnitude of the Fourier transform of the complex wave representation ψ(x, y) = exp i S(x,y) is an increasingly more τ accurate approximation to the density function of ∇S as the free parameter τ tends to zero. We now briefly outline the proof strategy: The Fourier transform of the CWR involves two spatial integrals (over x and y) which are converted into polar coordinate domain integrals. The squared magnitude of the Fourier transform involves multiplying the Fourier transform with its complex conjugate. The complex conjugate is yet another 2D integral which we will perform in the polar coordinate domain. Since the Fourier transform (suitable normalized) takes values very close to the unit circle, we then integrate the squared magnitude of the Fourier transform along the radial direction. This is a fifth integral. Finally, in order to eliminate unwanted phase factors, we first integrate the result of the above over very small angles and then take the limit as τ tends to zero. This integral and limit cannot be exchanged because the phase factors will not otherwise cancel. The remainder of this section mainly deals with managing these six integrals. Define a function F : R × R × R+ → C by ˆ ˆ 1 iS(x, y) −i(ux + vy) F (u, v, τ ) ≡ exp exp dxdy. (14) 2πτ l τ τ Ω For a fixed value of τ , define a function Fτ : R × R → C by Fτ (u, v) ≡ F (u, v, τ ).
(15) Observe that Fτ is closely related to the Fourier transform of exp iS(x,y) [3]. τ
1 2 2 The scale factor 2πτ l is the normalizing term such that Fτ ∈ L (R ) and Fτ = 1. Consider the polar representation of the spatial frequencies(u, v) namely u = r˜ cos(φ) and v = r˜ sin(φ) where r˜ > 0. For (x, y) ∈ Yk + Dk , let x − xk = r cos(θ) and y − yk = r sin(θ) where r ∈ (0, Rk (θ)]. Then
Fτ (˜ r , φ) =
K
Ck Ik (˜ r , φ)
(16)
k=1
where
and
i Ck = exp − [˜ r cos(φ)xk + r˜ sin(φ)yk ] τ 1 Ik (˜ r , φ) = 2πτ l
ˆ 0
2π
ˆ 0
Rk (θ)
exp
i r [1 − r˜ cos(θ − φ)] rdrdθ. τ
(17)
(18)
The Complex Wave Representation of Distance Transforms
419
Lemma 1. For r˜ = 1, limτ →0 Fτ (˜ r , φ) = 0. Proof. As each Ck is bounded, it suffices to show that if r˜ = 1, then the limit limτ →0 Ik (˜ r , φ) = 0 for all Ik . Consider
ˆ 2π ˆ R(θ) 1 i I(˜ r , φ) = exp r [1 − r˜ cos(θ − φ)] rdrdθ. (19) 2πτ l 0 τ 0 Let p(r, θ) = r(1 − r˜ cos(θ − φ)). Since we are interested only in the limit as τ → 0, essential contribution to I(˜ r , φ) in (19) comes only from the stationary points of p(r, θ) [10,19]. The partial gradients of p are given by ∂p = 1 − r˜ cos(θ − φ) ∂r
∂p = r˜ r sin(θ − φ). ∂θ
(20)
The gradient equals zero only when r˜ = 1 and θ = φ. Since r˜ = 1 by assumption, no stationary points exist (p = 0). Then, using the two dimensional stationary phase approximation, we can show that I = O(τ ) as τ → 0 and hence converges to zero in the limit as τ → 0. Define a function Pτ by Pτ (˜ r , φ) ≡ |Fτ (˜ r , φ)|2 = Fτ (˜ r , φ)Fτ (˜ r , φ). By definition Pτ ≥ 0. Then, from basic algebra we have ˆ 2π ˆ ∞ Pτ (˜ r , φ)˜ r d˜ r dφ = 1 0
(21)
(22)
0
independent of τ . Hence r˜Pτ (˜ r , φ) can be treated as a density function irrespective of the value of τ . Furthermore, from Lemma (1), we see that as τ → 0, Pτ is concentrated only on the unit circle r˜ = 1 and converges to zero everywhere else. We now state and prove the main theorem in this work. Theorem 1. For any given 0 < δ < 1, φ0 ∈ [0, 2π) and 0 < Δ < 2π, ˆ φ0 +Δ ˆ 1+δ ˆ φ0 +Δ lim Pτ (˜ r , φ)˜ r d˜ r dφ = P (φ)dφ τ →0
φ0
1−δ
(23)
φ0
where P (φ) is as defined in (9). Proof. First observe that Fτ (˜ r , φ) =
ˆ 2π ˆ Rk (θ ) K Ck ir exp − [1 − r˜ cos(θ − φ)] r dr dθ . (24) 2πτ l 0 τ 0 k=1
Fτ (˜ r , φ) in (24) is the complex conjugate of the Fourier transform of the CWR in (14) and (15). Define ˆ 1+δ I(φ) ≡ Pτ (˜ r , φ)˜ r d˜ r. (25) 1−δ
420
K.S. Gurumoorthy, A. Rangarajan, and A. Banerjee
As τ → 0 I(φ) will approach the density function of the gradients of S(x, y). Note that the integral in (25) is over the interval [1 − δ, 1 + δ] where δ > 0 can be made arbitrarily small (as τ → 0) due to Lemma (1). Since Pτ (˜ r , φ) equals Fτ (˜ r , φ)Fτ (˜ r , φ), we can rewrite I(φ) in (25) as I(φ) =
K K j=1 k=1
1 (2πτ l)2
ˆ
ˆ
2π
0
Rk (θ )
exp
0
−ir τ
gjk (r , θ )r dr dθ .
(26)
Here gjk (r , θ ) ≡
ˆ
1+δ
1−δ
ˆ
2π
ˆ
0
Rj (θ)
exp
0
i γjk (r, θ, r˜; r , θ , φ) f (r, r˜)drdθd˜ r (27) τ
where γjk (r, θ, r˜; r , θ , φ) ≡ r [1 − r˜ cos(θ − φ)] + r r˜ cos(θ − φ) − r˜ρjk
(28)
and ρjk (φ) = cos(φ)(xj − xk ) + sin(φ)(yj − yk )
(29)
f (r, r˜) = r˜ r.
(30)
with
The main reason for rewriting I(φ) in this manner will become clear as we proceed. In (29),ρjk represents the phase term of the quantity Cj Ck with Ck defined earlier in (17). In the definition of γjk (r, θ, r˜; r , θ , φ) in (28), the particular notation is used to emphasize that φ, r and θ are held fixed in the integral in (27). The integration with respect to r˜ is considered before the integration for r and θ . The main integral (26) has an integral over θ over the range [0, 2π). Dividing the integral range [0, 2π) for θ into three disjoint regions namely [0, φ − β), [φ − β, φ + β] and (φ + β, 2π) with β > 0, we get I(φ) =
K K j=1 k=1
(1) (2) (3) Jjk (φ) + Jjk (φ) + Jjk (φ)
(31)
where (1)
Jjk (φ) =
1 (2πτ l)2
ˆ
φ−β 0
ˆ
Rk (θ )
0
exp
gjk (r , θ )r dr dθ ,
−ir exp gjk (r , θ )r dr dθ , and τ φ−β 0 ˆ 2π ˆ Rk (θ ) 1 −ir (3) Jjk (φ) = exp gjk (r , θ )r dr dθ . (32) (2πτ l)2 φ+β 0 τ (2) Jjk (φ)
1 = (2πτ l)2
ˆ
φ+β
ˆ
Rk (θ )
−ir τ
The Complex Wave Representation of Distance Transforms
421
Examine the phase term (in the exponent) of the above integrals. One factor γ comes from − rτ and the other from τjk which is present in gjk in (27). Let αjk = −r + γjk denote the phase term in the above integrals relative to τ . Since we are interested only in the limit as τ → 0, the essential contribution to the above integrals comes only from regions (in 5D) near the stationary points of αjk [10,19]. The partial derivatives of αjk w.r.t. r, θ, r˜, r and θ are given by ∂αjk ∂αjk = 1 − r˜ cos(θ − φ), = r˜ r sin(θ − φ), ∂r ∂θ ∂αjk ∂αjk = r˜ cos(θ − φ) − 1, = −r r˜ sin(θ − φ), and ∂r ∂θ ∂αjk = −r cos(θ − φ) + r cos(θ − φ) − ρjk . ∂ r˜
(33)
Since both r and r are greater than zero, for αjk = 0, we must have r˜ = 1,
θ = θ = φ,
and
(1)
r = r − ρjk .
(34)
(3)
By construction, the integrals Jjk (φ) and Jjk (φ) do not include the stationary point θ = φ, and hence αjk = 0 in these integrals. Using the higher order (1) (3) stationary phase approximation [18], both the integrals Jjk (φ) and Jjk (φ) can be shown to be O(τ ) as τ → 0 and therefore converge to zero in the limit. This (2) leaves us with just the second integral Jjk (φ) which is restricted to the interval [φ − β, φ + β]. As β → 0, we may assume that Rk (θ ) is constant over the θ interval [φ − β, φ + β] and equals Rk (φ). (Without this assumption, the main result still goes (2) through, but the proof is rather unwieldy.) Hence, the integral Jjk (φ) can be rewritten as ˆ Rk (φ) 1 −ir (2) Jjk (φ) = exp ξjk (r )r dr (35) (2πτ l)2 0 τ where
ξjk (r ) ≡
ˆ
φ+β
φ−β
gjk (r , θ )dθ .
(36)
In (28), the notation for γjk (r, θ, r˜; r , θ , φ) was used to emphasize that φ, r and θ were held fixed in the integral in (27). But in order to compute ξjk (r ) in (36), we need the integral over θ in the interval [φ − β, φ + β]. As τ → 0, the essential contribution to ξjk (r ) comes only from the stationary points of γjk [18]. Closely following (33) where we computed the gradients of αjk , it can be readily verified that for γjk = 0 we must have r˜ = 1,
θ = θ = φ,
and
r = r − ρjk .
(37)
Let p0 denote this stationary point. Then γjk (p0 ) = r − ρjk = rp0 , f (p0 ) = rp0 = r − ρjk
(38)
422
K.S. Gurumoorthy, A. Rangarajan, and A. Banerjee
and the Hessian matrix H of γjk at p0 is given by ⎡ ⎤ 0 0 −1 0 ⎢ 0 r − ρjk 0 0 ⎥ ⎥. H(r, θ, r˜, θ )|p0 = ⎢ ⎣ −1 0 0 0 ⎦ 0 0 0 −r The determinant of H equals r (r − ρjk ) and its signature—the difference between the number of positive and negative eigenvalues—is zero. From the results of the four-dimensional stationary-phase approximation [18], we have as τ → 0,
r − ρjk i 2 √ ξjk (r ) = (2πτ ) exp (r − ρjk ) + 1 (r , τ ) (39) τ r where 1 (r , τ ) ≤ M1 τ κ with κ ≥ 52 and includes contributions from the boundary. Plugging the value of ξjk (r ) in (35), we get −iρjk exp r (r − ρjk )dr τ 0 ˆ Rk (φ) 1 −ir + exp 1 (r , τ )r dr . (2πτ l)2 0 τ
1 = 2 l
(2) Jjk (φ)
ˆ
Rk (φ)
(40)
Since 1 (rh2,τ ) ≤ M τ 2 , the second integral converges to zero as τ → 0. Let χjk (φ) denote the first integral in (40). Note that l2 = L and ρjk depends only on φ [as can be seen from Equation (29)]. Then
ˆ Rk (φ) 1 −i χjk (φ) = exp ρjk (φ) r (r − ρjk )dr . (41) L τ 0 1
Recall the definition of I(φ) in (25) and its equivalent statement in (31). So K (2) (2) far we have approximated I(φ) by K j=1 k=1 Jjk (φ) and Jjk (φ) by χjk (φ) as τ → 0. For the theorem statement to hold good, it suffices to show that lim
τ →0
K K ˆ j=1 k=1
φ0 +Δ
φ0
ˆ χjk (φ)dφ =
φ0 +Δ
P (φ)dφ.
(42)
φ0
We now consider two cases: first in which j = k and the second in which j = k. case (i) : If j = k, then ρjk varies continuously with φ. The stationary point(s) ˜ of ρjk —denoted by φ—satisfies ˜ = tan(φ)
yj − y k xj − xk
(43)
and the second derivative of ρjk at its stationary point is given by ˜ = −ρjk (φ). ˜ ρjk (φ)
(44)
The Complex Wave Representation of Distance Transforms
423
˜ to become equal to zero, we must have For ρjk (φ)
˜ =− tan(φ)
xj − xk yj − yk = yj − yk xj − xk
(45)
where the last equality is obtained using (43). Rewriting we get 2 yj − yk = −1 xj − xk
(46)
which cannot be true. Since the second derivative cannot vanish at the stationary ˜ from the one dimensional stationary phase approximation [11], we have point (φ), ˆ φ0 +Δ lim χjk (φ)dφ = lim O(τ κ ) = 0 (47) τ →0
τ →0
φ0
where κ = 0.5 or 1 depending on whether the interval [φ0 , φ0 + Δ) contains the ˜ or not. Hence, stationary point (φ) ˆ φ0 +Δ lim χjk (φ)dφ = 0 (48) τ →0
φ0
for j = k. case (ii) : If j = k, then ρkk = 0. Hence ˆ Rk (φ) R2 (φ) χkk (φ) = r dr = k 2 0 and
ˆ
φ0 +Δ
φ0
ˆ χjk (φ)dφ =
φ0 +Δ
φ0
(49)
Rk2 (φ) dφ. 2
(50)
Combining both case (i) and case (ii) we get K K j=1 k=1
ˆ lim
τ →0
φ0 +Δ φ0
K
1 χjk (φ)dφ = L
which completes the proof.
k=1
ˆ
φ0 +Δ
φ0
Rk2 (φ) dφ = 2
ˆ
φ0 +Δ
P (φ)dφ (51)
φ0
We have shown that the Fourier transform of the CWR of the distance transform has magnitude peaks on the unit circle of spatial frequency [Lemma (1)]. We have then shown that the squared magnitude of the Fourier transform (normalized such that it has overall unit norm) is approximately equal to the density function of the distance transform gradients with the approximation becoming increasingly tight as τ (a free parameter) tends to zero [Theorem (1)]. Consequently, we can make the identification that Ψτ (u(θ), v(θ)) is a complex, square-root density (of gradient orientation) and that spatial frequencies are essentially gradient histogram bins. (Since the Fourier transform values lie mainly on the unit circle, the difference between marginalization w.r.t. the radial parameter and evaluation on the unit circle becomes negligible.)
424
K.S. Gurumoorthy, A. Rangarajan, and A. Banerjee
True density function 0.025
0.02
0.015
0.01
0.005
0
0
50
100
150
200
250
300
350
400
angle Gradient density estimation from the Fourier transform approach 0.025
0.02
0.015
0.01
0.005
0
0
50
100
150
200
250
300
350
400
300
350
400
angle
True density function 0.025
0.02
0.015
0.01
0.005
0
0
50
100
150
200
250
angle Gradient density estimation from the Fourier transform approach 0.025
0.02
0.015
0.01
0.005
0
0
50
100
150
200
250
300
350
300
350
400
angle
True gradient density function 0.025 0.02 0.015 0.01 0.005 0 0
50
150
100
200
250
400
angle
Gradient density function from the Fourier transform approach 0.025 0.02 0.015 0.01 0.005 0 0
50
150
100
200
250
300
350
400
300
350
400
angle
True gradient density function 0.02
0.015
0.01
0.005
0 0
50
150
100
200
250
angle
Gradient density function from the Fourier transform approach 0.02
0.015
0.01
0.005
00
50
100
150
200
250
300
350
400
angle
Fig. 2. Wave function (left), FFT (middle) and density estimation comparison (right)
True density function
True density function
0.025
True density function
0.025
0.025
0.02
0.02
0.02
0.015
0.015
0.015
0.01
0.01
0.01
0.005
0.005
0 0
0.005
0 0
400
350
300
250
200
150
100
50
angle Gradient density estimation from the Fourier transform approach
0 0
400
350
300
250
200
150
100
50
angle Gradient density estimation from the Fourier transform approach 0.025
0.025
0.02
0.02
0.02
0.015
0.015
0.01
0.015
0.01
0.005
0.01
0.005
00
50
100
150
200
250
300
350
0.005
00
400
50
100
150
200
angle
250
300
350
00
400
50
0.01
0.01
0.005
0.005
150
200
250
300
350
400
0 0
50
100
150
200
250
300
350
400
0 0
0.025
0.025
0.02
0.02
0.02
0 0
100
150
200
angle
250
300
350
400
0 0
50
100
150
200
250
300
350
400
0.01
0.01
0.005
0.005
50
400
0.015
0.015
0.01 0.005
350
angle Gradient density estimation from the Fourier transform approach
angle Gradient density estimation from the Fourier transform approach
0.025
0.015
300
0.015
0.015
0.01 0.005
angle Gradient density estimation from the Fourier transform approach
250
0.02
0.02
0.02 0.015
100
200
angle True density function 0.025
0.025
50
150
True density function
True density function
0 0
100
angle
0.025
400
350
300
250
200
150
100
50
angle Gradient density estimation from the Fourier transform approach
0.025
50
100
150
200
250
300
350
400
0 0
50
100
150
angle
200
250
300
350
400
angle
Fig. 3. Convergence of the FFT density. Top Left: τ = 0.007. Top Middle: τ = 0.004. Top Right: τ = 0.001. Bottom Left: τ = 0.0006. Bottom Middle: τ = 0.0002. Bottom Right: τ = 0.00004.
5
Empirical Confirmation of the Main Result
The first set of empirical results seeks to validate the principal result in this paper. In Figure 2, we show four shapes1 , their associated distance transforms, 1
Two shapes were obtained from Kaleem Siddiqi whom we thank and the other two shapes are from the GatorBait shape database (http://www.cise.ufl.edu/~anand/GatorBait_100.tgz).
The Complex Wave Representation of Distance Transforms
425
0.8 fish9 horse3
L1 norm of estimation error
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
1
2
3
4 τ
5
6
7
8 x 10
−3
Fig. 4. Variation of the L1 norm of the error with τ for two shapes
the Fourier transform of the CWR and finally the comparison between the true density function—where we have denoted (9) as the true density function for the sake of clarity—and the power spectrum of the CWR. As expected, and at a value of τ = 0.00004, we note from visual inspection that the two density functions are qualitatively similar. While obviously anecdotal, these empirical findings buttress the theoretical result proved in the paper. While Theorem 1 establishes the principal result, it does not provide much in terms of the approach toward convergence as τ → 0. Our next set of empirical results examines the convergence as τ → 0. In Figure 3, we show the convergence patterns of the FFT density for τ taking values in a set. From an initial density at τ = 0.001 which does not bear much resemblance to the true density (see Figure 3 top left), we see gradual improvement and much closer correspondence at τ = 0.00004. A curious transitory pattern can also be discerned as τ is reduced. At τ = 0.001 and 0.0006, we notice very smooth density function estimates at odds with the shape of the true density. For τ = 0.0002 and 0.00004, we notice the emergence of the “Manhattan” skyline in the density estimator. We do not have any explanation for this empirical observation at the present time. Finally, we plot a scalar figure of merit—the L1 norm of the difference between the true and estimated densities for two shapes (Figure 4). As expected, the L1 norm shows a gradual pattern of convergence as τ is reduced. More detailed work (and with arbitrary precision numerics) is required to understand the approach toward convergence.
6
Discussion
We have shown that the power spectrum of the complex wave representation (CWR) of 2D distance transforms approaches the true density function of the distance transform gradients as a free parameter τ (usually identified with Planck’s constant in the physics literature) tends to zero. The proof utilizes the higherorder stationary phase approximation, a technique which is well known and widely deployed in the theoretical physics literature but underused in present
426
K.S. Gurumoorthy, A. Rangarajan, and A. Banerjee
day image analysis and machine learning. Insofar as the higher-order stationary phase approximation bounds conspire to work in our favor (as they have clearly done in the 2D case), the extension to 3D distance transforms should be straightforward. This connection to density estimation legitimizes the CWR as a viable distance transform shape representation with potential applications in atlas estimation, shape clustering etc. It remains to be seen if CWRs can play a role in general image analysis domains as well.
References 1. Blake, A., Zisserman, A.: Visual Reconstruction. The MIT Press, Cambridge (1987) 2. Bohm, D.: A suggested interpretation of the quantum theory in terms of "hidden variables", I . Physical Review 85, 166–179 (1952) 3. Bracewell, R.N.: The Fourier Transform and its Applications, 3rd edn. McGrawHill Science and Engineering (1999) 4. Butterfield, J.: On Hamilton-Jacobi theory as a classical root of quantum theory. In: Elitzur, A., Dolev, S., Kolenda, N. (eds.) Quo-Vadis Quantum Mechanics. ch. 13, pp. 239–274. Springer, Heidelberg (2005) 5. Chaichian, M., Demichev, A.: Path Integrals in Physics: Stochastic Processes and Quantum Mechanics, vol. I. Institute of Physics Publishing (2001) 6. Christensen, G.E., Rabbitt, R.D., Miller, M.I.: Deformable templates using large deformation kinematics. IEEE Transactions on Image Processing 5(10), 1435–1447 (1996) 7. de Berg, M., Cheong, O., van Kreveld, M., Overmars, M.: Computational Geometry: Algorithms and Applications, 3rd edn. Springer, Heidelberg (2010) 8. Geiger, D., Yuille, A.L.: A common framework for image segmentation. International Journal of Computer Vision 6(3), 227–243 (1991) 9. Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 6(6), 721–741 (1984) 10. Jones, D.S., Kline, M.: Asymptotic expansions of multiple integrals and the method of stationary phase. Journal of Mathematical Physics 37, 1–28 (1958) 11. Olver, F.W.J.: Asymptotics and Special Functions. A.K. Peters/CRC Press, Boca Raton (1997) 12. Osher, S.J., Fedkiw, R.P.: Level Set Methods and Dynamic Implicit Surfaces. Springer, Heidelberg (2002) 13. Osher, S.J., Sethian, J.A.: Fronts propagating with curvature dependent speed: algorithms based on Hamilton-Jacobi formulations. Journal of Computational Physics 79(1), 12–49 (1988) 14. Rangarajan, A., Gurumoorthy, K.S.: A schrödinger wave equation approach to the eikonal equation: Application to image analysis. In: Cremers, D., Boykov, Y., Blake, A., Schmidt, F.R. (eds.) EMMCVPR 2009. LNCS, vol. 5681, pp. 140–153. Springer, Heidelberg (2009) 15. Siddiqi, K., Pizer, S. (eds.): Medial Representations: Mathematics, Algorithms and Applications. Computational Imaging and Vision. Springer, Heidelberg (2008) 16. Siddiqi, K., Tannenbaum, A.R., Zucker, S.W.: A Hamiltonian approach to the eikonal equation. In: Hancock, E.R., Pelillo, M. (eds.) EMMCVPR 1999. LNCS, vol. 1654, pp. 1–13. Springer, Heidelberg (1999)
The Complex Wave Representation of Distance Transforms
427
17. Tu, Z., Chen, X., Yuille, A.L., Zhu, S.C.: Image parsing: Unifying segmentation, detection, and recognition. International Journal of Computer Vision 63(2), 113– 140 (2005) 18. Wong, R.: Asymptotic Approximations of Integrals. Academic Press, Inc., London (1989) 19. Wong, R., McClure, J.P.: On a method of asymptotic evaluation of multiple integrals. Mathematics of Computation 37(156), 509–521 (1981) 20. Yuille, A.L.: Generalized deformable models, statistical physics, and matching problems. Neural Computation 2(1), 1–24 (1990)
Author Index
Lauze, Fran¸cois 329 Lellmann, Jan 132 Lenzen, Frank 132 Lopatka, Nikolai 118
Agapito, Lourdes 300 Andres, Bj¨ orn 31 Ayvaci, Alper 191 Banerjee, Arunava 344, 413 Basri, Ronen 261 Bischof, Horst 104, 273 Blaschko, Matthew B. 385 Boykov, Yuri 147 Breuß, Michael 315 Bugeau, Aur´elie 59 Caetano, Tib´erio S. 355 Caselles, Vicent 59 Chklovskii, Dmitri 261 Cohen, Laurent D. 74 Cremers, Daniel 177 Delong, Andrew
147
El-Zehiry, Noha
233
399 355
Nielsen, Mads 329 Nieuwenhuis, Claudia 177 Nunez-Iglesias, Juan 261 Pedersen, Kim Steenstrup Peyr´e, Gabriel 74 Pock, Thomas 104, 273
287
Rajwade, Ajit 344 Rakˆet, Lars Lau 329 Ramisa, Arnau 355 Rangarajan, Anand 344, 413 Reinelt, Gerhard 31 Roholm, Lars 329 Rosenhahn, Bodo 219 Roussos, Anastasios 300
Facciolo, Gabriele 59 Fetter, Richard 261 Garg, Ravi 300 Glasner, Daniel 261 Gorelick, Lena 147 Grady, Leo 233 Gurumoorthy, Karthik S.
Marsland, Stephen McAuley, Julian J.
Jezierska, Anna 45 Jung, Miyoun 74
Sadek, Rida 59 Savchynskyy, Bogdan 89 Scheffer, Lou 261 Scheuermann, Bj¨ orn 219 Schlesinger, Michail 118 Schmidt, Frank R. 147 Schmidt, Stefan 89 Schn¨ orr, Christoph 31, 89, 132 Schoenemann, Thomas 17, 163 Setzer, Simon 315 Shekhovtsov, Alexander 1 Soatto, Stefano 191 Speth, Markus 31 Strandmark, Petter 205
Kahl, Fredrik 163, 205 Kappes, J¨ org Hendrik 31, 89 Kuang, Yubin 163
Talbot, Hugues 45 Taylor, Chris 399 T¨ oppe, Eno 177
413
Hauberg, Søren 287 Hess, Harald 261 Hlav´ aˇc, V´ aclav 1 Hoeltgen, Laurent 315 Hu, Tao 261
430
Author Index
Torr, Philip H.S. Twining, Carole Unger, Markus
369 399 104, 273
Veksler, Olga 45, 147 Vetrov, Dmitry 247 Vodolazskiy, Evgeniy 118
Warrell, Jonathan 369 Werlberger, Manuel 273 Wesierski, Daniel 45 Xu, Shan
261
Yangel, Boris
247